Scenario · Incident Control

Recovery crisis

A sandboxed PostgreSQL incident — investigate with your own tools, submit a fix, and get deterministic Detect / Fix / Trap scoring.

L5 · 20–30 min · runs locally in Docker

Launch

Start this scenario

Boot it in a real PostgreSQL sandbox and investigate with psql, EXPLAIN and pg_stat_statements.

ride postgres start stage-12/04-recovery-crisis

Part of these paths

Capstone: Incident Control

Show the postmortem & investigation hints spoilers

Recovery crisis
Type: incident simulation · Topic: Incident Control · Level: L5 · Duration: 20–30 min
Launch: ride postgres start stage-12/04-recovery-crisis

POSTMORTEM (root cause · how it was found · the fix · lesson)
INCIDENT TIMELINE (Incident Control capstone)

Triage: mid-recovery, nothing was actually proven. Two independent gaps — no complete
multi-database backup (billing_db wasn't covered), and the restore_target still held
stale data (its db_identity read 'stale', not the app_db snapshot it should contain).
Under pressure it's tempting to glance at the healthy source and call it done.

Recover (via the safe action layer):
  pgpg action take-complete-multidb-backup   -- cover app_db AND billing_db
  pgpg action restore-app-to-target          -- put app_db's snapshot in restore_target

Validate: the backup covers every critical database and the restore target holds the
correct production marker.

FINAL CHECKLIST
- [x] Backup coverage proven (app_db AND billing_db)
- [x] Restore target holds the correct production snapshot (not stale)
- [x] Source-only / app-only validation rejected
- [x] No blind shortcuts (no index, no deleting backup artifacts)

Lesson: recovery is not complete until backup coverage AND the restored target are
validated. A healthy-looking source database proves nothing; validate the restored
target and every critical database before declaring the crisis over.

INVESTIGATION HINTS (the staged path to diagnose and fix)
1. A recovery crisis: you must prove recoverability, not just 'run a restore'. Triage: current_database(), db_identity in app_db / billing_db / restore_target, and pg_ls_dir('/tmp/pgpg_backup') for backup coverage.
2. Two things must hold: the backup covers EVERY critical database (app_db AND billing_db), and the restore target holds the right production snapshot (restore_target.db_identity should read 'app_db', not 'stale'). Validating the source database proves nothing.
3. Recover and verify: `pgpg action take-complete-multidb-backup` then `pgpg action restore-app-to-target`. Don't validate only the source or only app_db, and don't add an index.

Start now →← All scenarios