Scenario · HA & Failover
Post-failover validation
A sandboxed PostgreSQL incident — investigate with your own tools, submit a fix, and get deterministic Detect / Fix / Trap scoring.
L4 · 10–15 min · runs locally in Docker
Launch
Start this scenario
Boot it in a real PostgreSQL sandbox and investigate with psql, EXPLAIN and pg_stat_statements.
ride postgres start stage-08/10-post-failover-validationPart of these paths
Show the postmortem & investigation hints spoilers
Post-failover validation Type: incident simulation · Topic: HA & Failover · Level: L4 · Duration: 10–15 min Launch: ride postgres start stage-08/10-post-failover-validation POSTMORTEM (root cause · how it was found · the fix · lesson) Root cause: a failover was only half-done. The primary was down and the standby needed promoting, but the job isn't finished at "promote" — it's finished when the new primary is writable, the old primary is not a write target, and every critical database is validated on the new primary. How it was found: pg_is_in_recovery() on the standby was still true (not promoted); the failover markers in app_db AND billing_db confirmed both critical databases survived once promotion completed. The mitigation: promote the standby and run the full post-failover checklist — writable new primary, old primary fenced, app_db and billing_db both validated. Lesson: promotion is not the end of a failover — validation is part of it. Check write-readiness on the new primary, that the old primary isn't a write target, and EVERY critical database (don't stop at the default one or just one app database). An index is irrelevant. INVESTIGATION HINTS (the staged path to diagnose and fix) 1. The primary is down and the standby hasn't been promoted yet. Promotion is only step one — finish the whole failover checklist. On the standby SELECT pg_is_in_recovery(); is true (still read-only). 2. Promote, then validate EVERYTHING: `pgpg action <session> promote-replica`, then confirm the new primary is writable and BOTH critical databases survived — \connect app_db then SELECT * FROM failover_markers; and the same for billing_db. Don't forget billing_db. 3. Failover is done when: the promoted node is writable, the old primary is not a write target, and app_db AND billing_db markers are present on the new primary. Validating only the default database (or only app_db) is incomplete; an index is irrelevant.