Scenario · HA & Failover

Post-failover validation

A sandboxed PostgreSQL incident — investigate with your own tools, submit a fix, and get deterministic Detect / Fix / Trap scoring.

L4 · 10–15 min · runs locally in Docker

Launch

Start this scenario

Boot it in a real PostgreSQL sandbox and investigate with psql, EXPLAIN and pg_stat_statements.

ride postgres start stage-08/10-post-failover-validation

Part of these paths

HA & Failover On-Call SRE On-Call Path

Show the postmortem & investigation hints spoilers

Post-failover validation
Type: incident simulation · Topic: HA & Failover · Level: L4 · Duration: 10–15 min
Launch: ride postgres start stage-08/10-post-failover-validation

POSTMORTEM (root cause · how it was found · the fix · lesson)
Root cause: a failover was only half-done. The primary was down and the standby
needed promoting, but the job isn't finished at "promote" — it's finished when the
new primary is writable, the old primary is not a write target, and every critical
database is validated on the new primary.

How it was found: pg_is_in_recovery() on the standby was still true (not promoted);
the failover markers in app_db AND billing_db confirmed both critical databases
survived once promotion completed.

The mitigation: promote the standby and run the full post-failover checklist —
writable new primary, old primary fenced, app_db and billing_db both validated.

Lesson: promotion is not the end of a failover — validation is part of it. Check
write-readiness on the new primary, that the old primary isn't a write target, and
EVERY critical database (don't stop at the default one or just one app database).
An index is irrelevant.

INVESTIGATION HINTS (the staged path to diagnose and fix)
1. The primary is down and the standby hasn't been promoted yet. Promotion is only step one — finish the whole failover checklist. On the standby SELECT pg_is_in_recovery(); is true (still read-only).
2. Promote, then validate EVERYTHING: `pgpg action <session> promote-replica`, then confirm the new primary is writable and BOTH critical databases survived — \connect app_db then SELECT * FROM failover_markers; and the same for billing_db. Don't forget billing_db.
3. Failover is done when: the promoted node is writable, the old primary is not a write target, and app_db AND billing_db markers are present on the new primary. Validating only the default database (or only app_db) is incomplete; an index is irrelevant.

Start now →← All scenarios