Scenario · Incident Control
Final production incident
A sandboxed PostgreSQL incident — investigate with your own tools, submit a fix, and get deterministic Detect / Fix / Trap scoring.
L5 · 25–35 min · runs locally in Docker
Launch
Start this scenario
Boot it in a real PostgreSQL sandbox and investigate with psql, EXPLAIN and pg_stat_statements.
ride postgres start stage-12/05-final-production-incidentPart of these paths
Show the postmortem & investigation hints spoilers
Final production incident
Type: incident simulation · Topic: Incident Control · Level: L5 · Duration: 25–35 min
Launch: ride postgres start stage-12/05-final-production-incident
POSTMORTEM (root cause · how it was found · the fix · lesson)
INCIDENT TIMELINE (Incident Control — the final capstone)
Triage: a release left production broken on four connected fronts:
1. a migration session held a write-blocking lock on app_db.orders, with a
production writer piled up behind it;
2. the `release_id` rollout was half-applied (committed, NULL, no default);
3. the tenant rollout reached tenant_a/tenant_c but skipped tenant_b;
4. app_user lost access to the release's billing_db.new_invoices.
Stabilize: end the migration blocker (not the app writers) so writes drain:
SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE application_name LIKE 'migration_runner%';
Recover, layer by layer (compatibility + completeness + least-privilege):
-- app_db rollout
UPDATE orders SET release_id = 'unknown' WHERE release_id IS NULL;
ALTER TABLE orders ALTER COLUMN release_id SET DEFAULT 'unknown';
-- the skipped tenant
\connect tenant_b
ALTER TABLE orders ADD COLUMN release_id text;
INSERT INTO schema_migrations (version) VALUES ('20260531_add_order_release_id');
-- billing access, least-privilege
GRANT CONNECT ON DATABASE billing_db TO app_user;
\connect billing_db
GRANT SELECT ON new_invoices TO app_user;
FINAL CHECKLIST
- [x] Blocker gone; app writes unblocked
- [x] app_db rollout safe (release_id backfilled + default)
- [x] Every tenant migrated and recorded (tenant_b caught up)
- [x] app_user has the required billing access AND no superpowers
- [x] No shortcuts: no superuser, no single-tenant-only validation, no DROP-column rollback, no index
Lesson: incident response is not one fix. Stabilize, find every connected cause,
recover safely across databases and tenants, and validate the whole production path.
Any shortcut — superuser, validating one tenant, a destructive rollback — leaves the
incident half-resolved.
INVESTIGATION HINTS (the staged path to diagnose and fix)
1. The final exam: one release broke production on several fronts. Triage broadly — pg_stat_activity (a migration_runner holds a lock, app_writer blocked), information_schema.columns for app_db.orders (release_id half-applied), schema_migrations across tenant_a/b/c, and app_user's access to billing_db.new_invoices.
2. Stabilize then recover every layer: end the migration blocker (not the app) so writes drain; backfill + default release_id; finish the rollout on the tenant that was skipped; restore app_user's billing access with least-privilege grants.
3. Don't take shortcuts: terminate migration_runner%, fix release_id (backfill + default), migrate tenant_b (column + schema_migrations version), GRANT CONNECT ON billing_db + SELECT on new_invoices. Never grant superuser, never validate only one tenant, never DROP the column.