Scenario · Incident Control
Release fallout
A sandboxed PostgreSQL incident — investigate with your own tools, submit a fix, and get deterministic Detect / Fix / Trap scoring.
L5 · 20–30 min · runs locally in Docker
Launch
Start this scenario
Boot it in a real PostgreSQL sandbox and investigate with psql, EXPLAIN and pg_stat_statements.
ride postgres start stage-12/02-release-falloutPart of these paths
Show the postmortem & investigation hints spoilers
Release fallout
Type: incident simulation · Topic: Incident Control · Level: L5 · Duration: 20–30 min
Launch: ride postgres start stage-12/02-release-fallout
POSTMORTEM (root cause · how it was found · the fix · lesson)
INCIDENT TIMELINE (Incident Control capstone)
Triage: a release left several connected problems. A migration_runner held a
write-blocking lock on app_db.orders with a production writer piled up behind it; the
`release_id` rollout was half-applied (committed but NULL, no default) while the old
app version (old_app) was still connected; and the tenant rollout had reached
tenant_a/tenant_c but not tenant_b. A blind rollback would have made it worse.
Stabilize: end the migration blocker so writes drain (terminate migration_runner% —
not the app writers).
Recover (compatibility, not rollback): backfill and default the half-applied column
so the old app keeps working:
UPDATE orders SET release_id = 'unknown' WHERE release_id IS NULL;
ALTER TABLE orders ALTER COLUMN release_id SET DEFAULT 'unknown';
Finish the rollout on the tenant that was skipped:
\connect tenant_b
ALTER TABLE orders ADD COLUMN release_id text;
INSERT INTO schema_migrations (version) VALUES ('20260531_add_order_release_id');
FINAL CHECKLIST
- [x] System stabilized (migration blocker gone, app writes unblocked)
- [x] Compatibility restored (release_id backfilled + default; old app keeps working)
- [x] Every intended tenant migrated and recorded (tenant_b caught up)
- [x] Dangerous shortcuts avoided (no DROP-column rollback, no superuser, no index)
Lesson: a failed release is a compatibility-recovery problem, not a blind rollback.
Stabilize the lock pileup, restore backward compatibility, and complete the rollout
across every tenant. Validating only the healthy tenant, dropping the column,
killing the app, or granting superuser all leave the fallout in place.
INVESTIGATION HINTS (the staged path to diagnose and fix)
1. A release went wrong on several fronts at once. Triage: pg_stat_activity (a migration_runner holds a lock, app_writer blocked, old_app still connected), information_schema.columns for app_db.orders (release_id half-applied), and schema_migrations across tenant_a/b/c (one tenant left behind).
2. Stabilize then recover — don't blind-rollback. End the migration blocker (not the app), restore backward compatibility (backfill release_id + default so the old app keeps working), and finish the rollout on the tenant that was skipped.
3. Clear migration_runner%, then in app_db: UPDATE orders SET release_id='unknown' WHERE release_id IS NULL; ALTER TABLE orders ALTER COLUMN release_id SET DEFAULT 'unknown'; then on tenant_b: ALTER TABLE orders ADD COLUMN release_id text; INSERT INTO schema_migrations (version) VALUES ('20260531_add_order_release_id'); Don't validate only tenant_a, don't DROP the column, don't grant superuser.