Scenario · Connections & Pooling

App retry connection storm

A sandboxed PostgreSQL incident — investigate with your own tools, submit a fix, and get deterministic Detect / Fix / Trap scoring.

L2 · 10–15 min · runs locally in Docker

Launch

Start this scenario

Boot it in a real PostgreSQL sandbox and investigate with psql, EXPLAIN and pg_stat_statements.

ride postgres start stage-03/08-app-retry-connection-storm

Part of these paths

Connection Pool Mastery

Show the postmortem & investigation hints spoilers

App retry connection storm
Type: incident simulation · Topic: Connections & Pooling · Level: L2 · Duration: 10–15 min
Launch: ride postgres start stage-03/08-app-retry-connection-storm

POSTMORTEM (root cause · how it was found · the fix · lesson)
Root cause: when requests started failing, the application retried and
reconnected aggressively with no backoff. Each failure spawned more connection
attempts, so the app amplified its own incident: the connection count pinned
near max_connections and new work couldn't get in. The database was healthy —
the retry behaviour was the fault.

How it was found: pg_stat_activity was near max_connections; grouping by
application_name/state showed one app (retry_worker) holding most of the slots.

The mitigation: shed the retry workers to relieve the storm. The connection
count drops back below the threshold and new work proceeds.

Lesson: aggressive retries without backoff turn a blip into an outage. The fix
is client-side — jittered exponential backoff, a circuit breaker, bounded
retries, connection reuse and a pool — NOT raising max_connections (which only
lets the storm consume more) and not indexing.

INVESTIGATION HINTS (the staged path to diagnose and fix)
1. New work is failing and the connection count is pinned near max_connections, but the queries themselves are fine. This is a connection storm: the app is retrying/reconnecting aggressively without backoff and amplifying the incident itself.
2. Find the culprit: SELECT application_name, state, count(*) FROM pg_stat_activity GROUP BY application_name, state ORDER BY count(*) DESC; one app (retry_worker) is holding a huge share of the slots. Compare count(*) to SHOW max_connections.
3. Shed the retry workers to relieve the storm (pg_terminate_backend for application_name = 'retry_worker'). Do NOT raise max_connections — that just lets the storm grow. The durable fix is jittered backoff, a circuit breaker, bounded retries and a connection pool.

Start now →← All scenarios