Scenario · Connections & Pooling
App retry connection storm
A sandboxed PostgreSQL incident — investigate with your own tools, submit a fix, and get deterministic Detect / Fix / Trap scoring.
L2 · 10–15 min · runs locally in Docker
Launch
Start this scenario
Boot it in a real PostgreSQL sandbox and investigate with psql, EXPLAIN and pg_stat_statements.
ride postgres start stage-03/08-app-retry-connection-stormPart of these paths
Show the postmortem & investigation hints spoilers
App retry connection storm Type: incident simulation · Topic: Connections & Pooling · Level: L2 · Duration: 10–15 min Launch: ride postgres start stage-03/08-app-retry-connection-storm POSTMORTEM (root cause · how it was found · the fix · lesson) Root cause: when requests started failing, the application retried and reconnected aggressively with no backoff. Each failure spawned more connection attempts, so the app amplified its own incident: the connection count pinned near max_connections and new work couldn't get in. The database was healthy — the retry behaviour was the fault. How it was found: pg_stat_activity was near max_connections; grouping by application_name/state showed one app (retry_worker) holding most of the slots. The mitigation: shed the retry workers to relieve the storm. The connection count drops back below the threshold and new work proceeds. Lesson: aggressive retries without backoff turn a blip into an outage. The fix is client-side — jittered exponential backoff, a circuit breaker, bounded retries, connection reuse and a pool — NOT raising max_connections (which only lets the storm consume more) and not indexing. INVESTIGATION HINTS (the staged path to diagnose and fix) 1. New work is failing and the connection count is pinned near max_connections, but the queries themselves are fine. This is a connection storm: the app is retrying/reconnecting aggressively without backoff and amplifying the incident itself. 2. Find the culprit: SELECT application_name, state, count(*) FROM pg_stat_activity GROUP BY application_name, state ORDER BY count(*) DESC; one app (retry_worker) is holding a huge share of the slots. Compare count(*) to SHOW max_connections. 3. Shed the retry workers to relieve the storm (pg_terminate_backend for application_name = 'retry_worker'). Do NOT raise max_connections — that just lets the storm grow. The durable fix is jittered backoff, a circuit breaker, bounded retries and a connection pool.