Scenario · Compound Incidents
Bad migration and lock pileup
A sandboxed PostgreSQL incident — investigate with your own tools, submit a fix, and get deterministic Detect / Fix / Trap scoring.
L4 · 15–20 min · runs locally in Docker
Launch
Start this scenario
Boot it in a real PostgreSQL sandbox and investigate with psql, EXPLAIN and pg_stat_statements.
ride postgres start stage-11/03-bad-migration-and-lock-pileupPart of these paths
Show the postmortem & investigation hints spoilers
Bad migration and lock pileup Type: incident simulation · Topic: Compound Incidents · Level: L4 · Duration: 15–20 min Launch: ride postgres start stage-11/03-bad-migration-and-lock-pileup POSTMORTEM (root cause · how it was found · the fix · lesson) Root cause (a chain): a release was mid-flight on `app_db.orders`. A migration session held a write-blocking lock (a SHARE lock, like a non-concurrent index build — writes blocked, reads still allowed) and production writers piled up behind it (a lock pileup), AND the release had only half-applied `release_id` — the column was committed but every row was NULL with no default. Killing random app sessions or rolling back the column are both wrong moves. How it was found: pg_stat_activity showed `migration_runner` holding the lock with `app_writer` blocked (wait_event_type = 'Lock'); information_schema.columns showed release_id nullable with no default and all-NULL data. The fix (both, in order): -- end the blocker so writes resume (not the app writers) SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE application_name LIKE 'migration_runner%'; -- then finish the rollout safely UPDATE orders SET release_id = 'unknown' WHERE release_id IS NULL; ALTER TABLE orders ALTER COLUMN release_id SET DEFAULT 'unknown'; Lesson: a migration-caused lock pileup is a who-holds-the-lock problem plus an unfinished-release problem. End the migration blocker (not the app writers), then complete the rollout (backfill + default). Don't drop the column as a "rollback" and don't reach for an index. INVESTIGATION HINTS (the staged path to diagnose and fix) 1. Writes to app_db.orders are piling up behind a migration. Two problems: a migration_runner session holds a write-blocking lock (like a non-concurrent index build — writers blocked, reads OK), and the release left orders.release_id half-applied. Check pg_stat_activity / pg_locks and information_schema.columns. 2. Clear the right blocker — terminate the migration session (application_name LIKE 'migration_runner%'), not the app writers — so writes resume. Then finish the rollout: release_id is committed but NULL with no default. 3. After unblocking: UPDATE orders SET release_id = 'unknown' WHERE release_id IS NULL; ALTER TABLE orders ALTER COLUMN release_id SET DEFAULT 'unknown'; Don't kill the app writers, don't DROP the column as a rollback, and don't add an index.