Scenario · Incident Control

Checkout meltdown

A sandboxed PostgreSQL incident — investigate with your own tools, submit a fix, and get deterministic Detect / Fix / Trap scoring.

L5 · 20–30 min · runs locally in Docker

Launch

Start this scenario

Boot it in a real PostgreSQL sandbox and investigate with psql, EXPLAIN and pg_stat_statements.

ride postgres start stage-12/01-checkout-meltdown

Part of these paths

Capstone: Incident Control

Show the postmortem & investigation hints spoilers

Checkout meltdown
Type: incident simulation · Topic: Incident Control · Level: L5 · Duration: 20–30 min
Launch: ride postgres start stage-12/01-checkout-meltdown

POSTMORTEM (root cause · how it was found · the fix · lesson)
INCIDENT TIMELINE (Incident Control capstone)

Triage: checkout was melting down under peak traffic. pg_stat_statements/EXPLAIN
showed the checkout query running as a Seq Scan; pg_stat_activity showed a flood of
`checkout_retry%` connections and a `checkout_hot_row_holder` transaction holding a
checkout row with a writer blocked behind it; SHOW POOLS showed the pool saturated.
Several connected causes, not one knob.

Stabilize: shed the retry storm and end the runaway transaction holding the hot row
so blocked writes drain and the pool frees:
  SELECT pg_terminate_backend(pid) FROM pg_stat_activity
   WHERE application_name LIKE 'checkout_hot_row_holder%'
      OR application_name LIKE 'checkout_retry%';

Root cause: the checkout query had no supporting index — build it without blocking:
  CREATE INDEX CONCURRENTLY idx_orders_checkout ON orders (user_id, status, created_at DESC);

Recovery & validation: the checkout plan now uses the index, the retry storm is gone,
no session is blocked on the hot row, and the pool is no longer saturated.

FINAL CHECKLIST
- [x] System stabilized (storm shed, blocker cleared, writes unblocked)
- [x] Root cause fixed (checkout index in place, efficient plan)
- [x] Dangerous shortcuts avoided (no raise max_connections / pool bump / wrong index)
- [x] App path validated (checkout query fast, no blocked writers)

Lesson: a checkout outage is rarely one knob. Stabilize the pressure (storm + blocker),
fix the database root cause (the right index), and validate the app path. Raising
max_connections, bumping the pool, killing random clients, or indexing the wrong
column each leave the meltdown in place.

INVESTIGATION HINTS (the staged path to diagnose and fix)
1. Peak-traffic checkout meltdown — several things are wrong at once. Triage broadly first: pg_stat_statements + EXPLAIN (slow checkout query), pg_stat_activity grouped by application_name/state (retry storm + a held transaction), SHOW POOLS (pool pressure).
2. Stabilize, then fix root cause, then validate. There's a runaway transaction holding a checkout row (checkout_hot_row_holder) with a writer blocked behind it, a retry storm (checkout_retry%), and the checkout query is a Seq Scan for lack of an index.
3. Clear the blocker and the storm, then build the index CONCURRENTLY: terminate checkout_hot_row_holder% and checkout_retry%, then CREATE INDEX CONCURRENTLY idx_orders_checkout ON orders (user_id, status, created_at DESC). Don't just raise max_connections, bump the pool, or index status alone.

Start now →← All scenarios