← All scenarios

Scenario · Compound Incidents

Bad migration and lock pileup

A sandboxed PostgreSQL incident — investigate with your own tools, submit a fix, and get deterministic Detect / Fix / Trap scoring.

L4 · 15–20 min · runs locally in Docker

Launch

Start this scenario

Boot it in a real PostgreSQL sandbox and investigate with psql, EXPLAIN and pg_stat_statements.

ride postgres start stage-11/03-bad-migration-and-lock-pileup

Part of these paths

Show the postmortem & investigation hints spoilers
Bad migration and lock pileup
Type: incident simulation · Topic: Compound Incidents · Level: L4 · Duration: 15–20 min
Launch: ride postgres start stage-11/03-bad-migration-and-lock-pileup

POSTMORTEM (root cause · how it was found · the fix · lesson)
Root cause (a chain): a release was mid-flight on `app_db.orders`. A migration
session held a write-blocking lock (a SHARE lock, like a non-concurrent index build —
writes blocked, reads still allowed) and production writers piled up behind it
(a lock pileup), AND the release had only half-applied `release_id` — the column was
committed but every row was NULL with no default. Killing random app sessions or
rolling back the column are both wrong moves.

How it was found: pg_stat_activity showed `migration_runner` holding the lock with
`app_writer` blocked (wait_event_type = 'Lock'); information_schema.columns showed
release_id nullable with no default and all-NULL data.

The fix (both, in order):
  -- end the blocker so writes resume (not the app writers)
  SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE application_name LIKE 'migration_runner%';
  -- then finish the rollout safely
  UPDATE orders SET release_id = 'unknown' WHERE release_id IS NULL;
  ALTER TABLE orders ALTER COLUMN release_id SET DEFAULT 'unknown';

Lesson: a migration-caused lock pileup is a who-holds-the-lock problem plus an
unfinished-release problem. End the migration blocker (not the app writers), then
complete the rollout (backfill + default). Don't drop the column as a "rollback" and
don't reach for an index.

INVESTIGATION HINTS (the staged path to diagnose and fix)
1. Writes to app_db.orders are piling up behind a migration. Two problems: a migration_runner session holds a write-blocking lock (like a non-concurrent index build — writers blocked, reads OK), and the release left orders.release_id half-applied. Check pg_stat_activity / pg_locks and information_schema.columns.
2. Clear the right blocker — terminate the migration session (application_name LIKE 'migration_runner%'), not the app writers — so writes resume. Then finish the rollout: release_id is committed but NULL with no default.
3. After unblocking: UPDATE orders SET release_id = 'unknown' WHERE release_id IS NULL; ALTER TABLE orders ALTER COLUMN release_id SET DEFAULT 'unknown'; Don't kill the app writers, don't DROP the column as a rollback, and don't add an index.