Bug Investigation: Intermittent 502s on /api/checkout
Severity: P1 — ~8% of checkout requests returned 502 errors over an 18-hour window. Root cause: PostgreSQL connection pool leak in the inventory reservation path. Estimated revenue impact: ~$47,000.
Incident Timeline
Observable Symptoms
All four symptoms were correlated — the connection pool leak was the single upstream cause, with Redis timeouts being a downstream effect of request queue backpressure.
Root Cause Chain
The Buggy Code
The reserveInventory() function acquires a connection from the pool but fails to release it when an error occurs in the catch block:
async function reserveInventory(orderId, items) {
const client = await pool.connect();
try {
await client.query('BEGIN');
for (const item of items) {
const res = await client.query(
'UPDATE inventory SET reserved = reserved + $1 WHERE sku = $2 AND available >= $1 RETURNING *',
[item.quantity, item.sku]
);
if (res.rowCount === 0) {
throw new Error(`Insufficient inventory for SKU ${item.sku}`);
}
}
await client.query('COMMIT');
client.release(); // ✓ released on success
} catch (err) {
await client.query('ROLLBACK');
// ✗ BUG: client.release() is never called here
// → connection is leaked back to the pool
throw err;
}
}Every time an inventory reservation fails (out-of-stock, constraint violation, or any other error), the pool connection is permanently leaked. Once all 20 connections are consumed, the entire checkout pathway goes down.
The Fix
The fix moves client.release() into a finally block, guaranteeing the connection is returned regardless of outcome:
async function reserveInventory(orderId, items) {
const client = await pool.connect();
try {
await client.query('BEGIN');
for (const item of items) {
const res = await client.query(
'UPDATE inventory SET reserved = reserved + $1 WHERE sku = $2 AND available >= $1 RETURNING *',
[item.quantity, item.sku]
);
if (res.rowCount === 0) {
throw new Error(`Insufficient inventory for SKU ${item.sku}`);
}
}
await client.query('COMMIT');
} catch (err) {
await client.query('ROLLBACK');
throw err;
} finally {
client.release(); // ✓ always released
}
}This is a one-line fix — but it eliminates the entire failure cascade.
Impact Summary
Takeaways
- Always use
finallyfor resource cleanup —try/catchalone is not sufficient when acquiring pooled resources like database connections - Set pool idle timeouts —
pgsupportsidleTimeoutMilliswhich would have reclaimed leaked connections after a delay, limiting blast radius - Add connection pool metrics — pool utilization should be a first-class dashboard metric with alerts at 80% capacity
- Reproduce in staging — the leak only manifests under inventory failure conditions, which weren't covered by the existing load test suite