Scenario · Connections & Pooling

Connection storm after deploy

A sandboxed PostgreSQL incident — investigate with your own tools, submit a fix, and get deterministic Detect / Fix / Trap scoring.

L2 · 10–15 min · runs locally in Docker

Launch

Start this scenario

Boot it in a real PostgreSQL sandbox and investigate with psql, EXPLAIN and pg_stat_statements.

ride postgres start stage-03/03-connection-storm-after-deploy

Part of these paths

Connection Pool Mastery SRE On-Call Path Production Readiness

Show the postmortem & investigation hints spoilers

Connection storm after deploy
Type: incident simulation · Topic: Connections & Pooling · Level: L2 · Duration: 10–15 min
Launch: ride postgres start stage-03/03-connection-storm-after-deploy

POSTMORTEM (root cause · how it was found · the fix · lesson)
Root cause: a deploy restarted the app fleet and every worker reconnected at the
same moment — a thundering herd. The synchronized burst of connections pushed
the database to max_connections, so new connections failed and latency spiked.
Unlike a slow leak, this is bursty: triggered by the restart.

How it was found: pg_stat_activity showed a sudden mass of one application
(deploy_worker) near max_connections right after the deploy.

The mitigation: shed the excess deploy_worker connections to recover.

Lesson: a reconnect storm is a client-behavior problem. Fix it with a connection
pool, jittered reconnects with exponential backoff, a smaller app pool size, and
readiness gating on restart — not with indexes, and not by blindly raising
max_connections.

INVESTIGATION HINTS (the staged path to diagnose and fix)
1. This spiked right after a deploy: a wave of workers all reconnected at once. Group pg_stat_activity by application_name — a burst of one app (deploy_worker) is eating the slots.
2. Compare the count to SHOW max_connections. A thundering-herd reconnect after restart looks like 'too many clients' but is bursty, not a steady leak.
3. Shed the excess workers to recover now: pg_terminate_backend(pid) for that application_name where state = 'idle'. The real fix is pooling + jittered reconnect with backoff and a smaller app pool size.

Start now →← All scenarios