Scenario · HA & Failover

Manual replica promotion

A sandboxed PostgreSQL incident — investigate with your own tools, submit a fix, and get deterministic Detect / Fix / Trap scoring.

L3 · 10–15 min · runs locally in Docker

Launch

Start this scenario

Boot it in a real PostgreSQL sandbox and investigate with psql, EXPLAIN and pg_stat_statements.

ride postgres start stage-08/02-manual-replica-promotion

Part of these paths

HA & Failover On-Call SRE On-Call Path Zero-Downtime Operations

Show the postmortem & investigation hints spoilers

Manual replica promotion
Type: incident simulation · Topic: HA & Failover · Level: L3 · Duration: 10–15 min
Launch: ride postgres start stage-08/02-manual-replica-promotion

POSTMORTEM (root cause · how it was found · the fix · lesson)
Root cause: the primary was lost for good, so the streaming standby had to be
promoted to a new writable primary. A failover isn't done when the command runs —
it's done when the promoted node is writable AND the critical databases are
verified intact.

How it was found: pg_is_in_recovery() on the standby confirmed it was a promotion
candidate (caught up before the crash); after promotion it returned false; the
failover markers in app_db and billing_db confirmed the critical data survived.

The mitigation: promote the standby (`pgpg action promote-replica`), confirm it's
out of recovery, and verify the markers in the critical (non-default) databases.

Lesson: promotion changes which node is writable — verify pg_is_in_recovery() =
false on the new primary and validate the databases your app actually uses, not
just the default one. An index is irrelevant.

INVESTIGATION HINTS (the staged path to diagnose and fix)
1. The primary is gone and won't come back — you need to fail over. First confirm the standby is a good candidate: on the replica, SELECT pg_is_in_recovery(); is true (a standby). The standby had caught up before the crash.
2. Promote the standby: `pgpg action <session> promote-replica`. Then SELECT pg_is_in_recovery(); becomes false — it's now a writable primary.
3. After promotion, validate the CRITICAL databases, not just the default one: \connect billing_db then SELECT * FROM failover_markers; (and app_db). Don't add an index and don't only check the default database.

Start now →← All scenarios