By Patrick McCurley

System Architecture Overview

By Patrick McCurley · Created Mar 18, 2026 public

A comprehensive guide to the distributed microservices platform powering our core product.

Overview

This document describes the architecture of the platform as of Q1 2026. The system is composed of seven core services running across three availability zones, backed by a polyglot persistence layer.

Key properties:

Stateless services — all state lives in databases or distributed caches
Event-driven — services communicate via Kafka topics for async operations
Zero-trust networking — all inter-service calls are mTLS authenticated
Observability-first — every service emits structured logs, metrics (Prometheus), and traces (OpenTelemetry)

High-Level Architecture

Request Lifecycle

A typical API request flows through five stages. Total end-to-end latency for a read path is 15–25ms at p95.

TLS Termination — Cloudflare edge handles SSL offload and DDoS filtering
Authentication — JWT validation, rate limiting, request signing (~2ms)
Routing — API gateway dispatches to the target service via gRPC
Service Handler — Business logic executes, DB queries run (~5–12ms)
Emit & Respond — Kafka events are published asynchronously, trace is closed, response is returned to the client

Service Latencies

Deployment Strategy

The platform uses a blue-green deployment model with canary releases. Each service is containerized and orchestrated via Kubernetes across three availability zones.

Infrastructure: Helm charts per service, HPA scales 2–20 pods based on CPU and request queue depth. Pod anti-affinity across three availability zones ensures the platform survives a full zone failure.

Observability Stack

Every service is instrumented with OpenTelemetry. Structured logs, metrics, and distributed traces flow through a central OTel Collector into purpose-built backends:

Metrics → Prometheus (retention: 90 days, 15s scrape interval)
Logs → Loki (structured JSON, indexed by service/level/trace_id)
Traces → Tempo (100% sampling in staging, 10% head-based in prod)
Dashboards → Grafana (unified query across all three backends)

Alerting runs in Grafana with PagerDuty integration. On-call watches the api-latency board — any p99 spike above 200ms for 5 minutes triggers a page.

Design Notes

We chose event sourcing for the audit trail to maintain a complete, immutable history of all state changes. This supports both compliance requirements and debugging distributed transactions.

The polyglot persistence approach — PostgreSQL for relational data, Redis for caching, Elasticsearch for search — lets each service choose the best storage engine for its access patterns rather than forcing everything through a single database.

Last updated: March 2026 — Maintained by the Platform Team

System Architecture Overview

Overview

High-Level Architecture

Request Lifecycle

Service Latencies

Deployment Strategy

Observability Stack

Design Notes

Sign in to Emberflow

This doc was made with emberflow

Appearance

API Keys

Team

Create your organization

Share