Scenario · Compound Incidents

PgBouncer saturation and retry storm

A sandboxed PostgreSQL incident — investigate with your own tools, submit a fix, and get deterministic Detect / Fix / Trap scoring.

L4 · 15–20 min · runs locally in Docker

Launch

Start this scenario

Boot it in a real PostgreSQL sandbox and investigate with psql, EXPLAIN and pg_stat_statements.

ride postgres start stage-11/06-pgbouncer-saturation-and-retry-storm

Part of these paths

Incident Response Readiness

Show the postmortem & investigation hints spoilers

PgBouncer saturation and retry storm
Type: incident simulation · Topic: Compound Incidents · Level: L4 · Duration: 15–20 min
Launch: ride postgres start stage-11/06-pgbouncer-saturation-and-retry-storm

POSTMORTEM (root cause · how it was found · the fix · lesson)
Root cause (compound, layered): a runaway session held a connection open in an idle
transaction (`runaway_pool_holder`), pinning a pool slot, and an app retry storm
(`app_retry%`) piled on more connections — so the pool looked saturated. The storm is
an amplification layer; the held transaction is the root cause.

How it was found: pg_stat_activity (ordered by xact_start) showed the long-lived
idle-in-transaction holder; the flood of `app_retry%` sessions was the amplification.

The fix (both):
  SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE application_name = 'runaway_pool_holder';
  SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE application_name LIKE 'app_retry%';

Lesson: pool saturation is usually an amplifier, not the cause. Clear the connection
the runaway transaction is pinning first, then shed the retry storm. Raising
max_connections, bumping the pool size, killing only random clients, or adding an
index all leave the root-cause holder in place.

INVESTIGATION HINTS (the staged path to diagnose and fix)
1. The pool is saturated and the app can't get connections. Two layers: a runaway session holding a connection (application_name = 'runaway_pool_holder', idle in transaction) and an app retry storm (app_retry%). Check pg_stat_activity ORDER BY xact_start and SHOW POOLS.
2. The retry storm is amplification, not the root cause. Find the long-held transaction pinning a connection — that's what to clear first — then shed the retry connections.
3. Terminate the runaway holder (application_name = 'runaway_pool_holder') AND reduce the retry storm (terminate app_retry%). Don't just raise max_connections, don't only kill random clients, and don't add an index.

Start now →← All scenarios