Scenario · Compound Incidents

Checkout slow query and connection storm

A sandboxed PostgreSQL incident — investigate with your own tools, submit a fix, and get deterministic Detect / Fix / Trap scoring.

L4 · 15–20 min · runs locally in Docker

Launch

Start this scenario

Boot it in a real PostgreSQL sandbox and investigate with psql, EXPLAIN and pg_stat_statements.

ride postgres start stage-11/01-checkout-slow-query-and-connection-storm

Part of these paths

SRE On-Call Path Incident Response Readiness

Show the postmortem & investigation hints spoilers

Checkout slow query and connection storm
Type: incident simulation · Topic: Compound Incidents · Level: L4 · Duration: 15–20 min
Launch: ride postgres start stage-11/01-checkout-slow-query-and-connection-storm

POSTMORTEM (root cause · how it was found · the fix · lesson)
Root cause (a chain, not one fault): the checkout query
`... WHERE user_id = ? AND status = 'paid' ORDER BY created_at DESC LIMIT 20` had no
supporting index and ran as a Seq Scan. As it slowed, the application retried and
opened more and more connections (`checkout_retry%`), saturating the pool — which
*looked* like a pure connection/capacity problem.

How it was found: pg_stat_statements / EXPLAIN exposed the Seq Scan; pg_stat_activity
showed the retry storm; the two were linked (slow query → retries → pool pressure).

The fix (both, in order):
CREATE INDEX CONCURRENTLY idx_orders_checkout ON orders (user_id, status, created_at DESC);
-- then shed the storm the slowness created:
SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE application_name LIKE 'checkout_retry%';

Lesson: a connection storm is often a *symptom*. Treating it as only a connection
problem (raise max_connections, kill backends, restart the pooler) leaves the slow
query to refill the pool. Fix the database root cause, then relieve the pressure.
Adding an index on `status` alone (low selectivity) is the wrong index.

INVESTIGATION HINTS (the staged path to diagnose and fix)
1. Checkout is degraded and you see a flood of connections (application_name LIKE 'checkout_retry%'). Don't stop at the connections — find why they pile up. Check pg_stat_statements and EXPLAIN the checkout query; check pg_stat_activity / SHOW POOLS for the pressure.
2. Two connected problems: the checkout query is doing a Seq Scan (missing composite index), and that slowness drives an app retry storm that saturates the pool. Fix the DB root cause first, then shed the storm.
3. Build the right index — CREATE INDEX CONCURRENTLY idx_orders_checkout ON orders (user_id, status, created_at DESC) — then terminate the retry-storm sessions. Raising max_connections, killing connections without the index, or adding an index on status alone won't pass.

Start now →← All scenarios