System Architecture Overview
A comprehensive guide to the distributed microservices platform powering our core product.
Overview
This document describes the architecture of the platform as of Q1 2026. The system is composed of seven core services running across three availability zones, backed by a polyglot persistence layer.
Key properties:
- Stateless services — all state lives in databases or distributed caches
- Event-driven — services communicate via Kafka topics for async operations
- Zero-trust networking — all inter-service calls are mTLS authenticated
- Observability-first — every service emits structured logs, metrics (Prometheus), and traces (OpenTelemetry)
High-Level Architecture
Request Lifecycle
A typical API request flows through five stages. Total end-to-end latency for a read path is 15–25ms at p95.
- TLS Termination — Cloudflare edge handles SSL offload and DDoS filtering
- Authentication — JWT validation, rate limiting, request signing (~2ms)
- Routing — API gateway dispatches to the target service via gRPC
- Service Handler — Business logic executes, DB queries run (~5–12ms)
- Emit & Respond — Kafka events are published asynchronously, trace is closed, response is returned to the client
Service Latencies
Deployment Strategy
The platform uses a blue-green deployment model with canary releases. Each service is containerized and orchestrated via Kubernetes across three availability zones.
Infrastructure: Helm charts per service, HPA scales 2–20 pods based on CPU and request queue depth. Pod anti-affinity across three availability zones ensures the platform survives a full zone failure.
Observability Stack
Every service is instrumented with OpenTelemetry. Structured logs, metrics, and distributed traces flow through a central OTel Collector into purpose-built backends:
- Metrics → Prometheus (retention: 90 days, 15s scrape interval)
- Logs → Loki (structured JSON, indexed by service/level/trace_id)
- Traces → Tempo (100% sampling in staging, 10% head-based in prod)
- Dashboards → Grafana (unified query across all three backends)
Alerting runs in Grafana with PagerDuty integration. On-call watches the api-latency board — any p99 spike above 200ms for 5 minutes triggers a page.
Design Notes
We chose event sourcing for the audit trail to maintain a complete, immutable history of all state changes. This supports both compliance requirements and debugging distributed transactions.
The polyglot persistence approach — PostgreSQL for relational data, Redis for caching, Elasticsearch for search — lets each service choose the best storage engine for its access patterns rather than forcing everything through a single database.
Last updated: March 2026 — Maintained by the Platform Team