Scenario · Incident Control

Final production incident

A sandboxed PostgreSQL incident — investigate with your own tools, submit a fix, and get deterministic Detect / Fix / Trap scoring.

L5 · 25–35 min · runs locally in Docker

Launch

Start this scenario

Boot it in a real PostgreSQL sandbox and investigate with psql, EXPLAIN and pg_stat_statements.

ride postgres start stage-12/05-final-production-incident

Part of these paths

Capstone: Incident Control

Show the postmortem & investigation hints spoilers

Final production incident
Type: incident simulation · Topic: Incident Control · Level: L5 · Duration: 25–35 min
Launch: ride postgres start stage-12/05-final-production-incident

POSTMORTEM (root cause · how it was found · the fix · lesson)
INCIDENT TIMELINE (Incident Control — the final capstone)

Triage: a release left production broken on four connected fronts:
  1. a migration session held a write-blocking lock on app_db.orders, with a
     production writer piled up behind it;
  2. the `release_id` rollout was half-applied (committed, NULL, no default);
  3. the tenant rollout reached tenant_a/tenant_c but skipped tenant_b;
  4. app_user lost access to the release's billing_db.new_invoices.

Stabilize: end the migration blocker (not the app writers) so writes drain:
  SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE application_name LIKE 'migration_runner%';

Recover, layer by layer (compatibility + completeness + least-privilege):
  -- app_db rollout
  UPDATE orders SET release_id = 'unknown' WHERE release_id IS NULL;
  ALTER TABLE orders ALTER COLUMN release_id SET DEFAULT 'unknown';
  -- the skipped tenant
  \connect tenant_b
  ALTER TABLE orders ADD COLUMN release_id text;
  INSERT INTO schema_migrations (version) VALUES ('20260531_add_order_release_id');
  -- billing access, least-privilege
  GRANT CONNECT ON DATABASE billing_db TO app_user;
  \connect billing_db
  GRANT SELECT ON new_invoices TO app_user;

FINAL CHECKLIST
- [x] Blocker gone; app writes unblocked
- [x] app_db rollout safe (release_id backfilled + default)
- [x] Every tenant migrated and recorded (tenant_b caught up)
- [x] app_user has the required billing access AND no superpowers
- [x] No shortcuts: no superuser, no single-tenant-only validation, no DROP-column rollback, no index

Lesson: incident response is not one fix. Stabilize, find every connected cause,
recover safely across databases and tenants, and validate the whole production path.
Any shortcut — superuser, validating one tenant, a destructive rollback — leaves the
incident half-resolved.

INVESTIGATION HINTS (the staged path to diagnose and fix)
1. The final exam: one release broke production on several fronts. Triage broadly — pg_stat_activity (a migration_runner holds a lock, app_writer blocked), information_schema.columns for app_db.orders (release_id half-applied), schema_migrations across tenant_a/b/c, and app_user's access to billing_db.new_invoices.
2. Stabilize then recover every layer: end the migration blocker (not the app) so writes drain; backfill + default release_id; finish the rollout on the tenant that was skipped; restore app_user's billing access with least-privilege grants.
3. Don't take shortcuts: terminate migration_runner%, fix release_id (backfill + default), migrate tenant_b (column + schema_migrations version), GRANT CONNECT ON billing_db + SELECT on new_invoices. Never grant superuser, never validate only one tenant, never DROP the column.

Start now →← All scenarios