Scenario · HA & Failover
Failed failover due to lag
A sandboxed PostgreSQL incident — investigate with your own tools, submit a fix, and get deterministic Detect / Fix / Trap scoring.
L4 · 10–15 min · runs locally in Docker
Launch
Start this scenario
Boot it in a real PostgreSQL sandbox and investigate with psql, EXPLAIN and pg_stat_statements.
ride postgres start stage-08/03-failed-failover-due-to-lagPart of these paths
Show the postmortem & investigation hints spoilers
Failed failover due to lag Type: incident simulation · Topic: HA & Failover · Level: L4 · Duration: 10–15 min Launch: ride postgres start stage-08/03-failed-failover-due-to-lag POSTMORTEM (root cause · how it was found · the fix · lesson) Root cause: the standby's WAL replay was paused, so it lagged behind the primary — recent committed data (the 'latest_order' marker) had been received but not replayed. Promoting it in that state would have lost that data (a failover that silently drops recent transactions). How it was found: pg_is_wal_replay_paused() was true and the replay LSN trailed the received LSN; the latest marker was missing on the replica. The mitigation: resume replay and let the standby catch up (`pgpg action wait-for-replica-catchup`) BEFORE any promotion — no blind failover while lagging. Lesson: never promote a replica blindly. Check replay lag / LSN position (and critical markers) first; promote only a caught-up standby. Promoting a lagging replica loses data; an index is irrelevant. INVESTIGATION HINTS (the staged path to diagnose and fix) 1. Before promoting, check whether the standby is safe to promote. On the replica: SELECT pg_is_wal_replay_paused(); is true and SELECT pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn(); show replay is behind what it has received — it's lagging. 2. Recent data hasn't been replayed: SELECT * FROM failover_markers WHERE id='latest_order'; is missing on the replica. Promoting it now would lose that committed data. 3. Make the standby safe first: resume replay and let it catch up — `pgpg action <session> wait-for-replica-catchup` (or SELECT pg_wal_replay_resume();). Do NOT promote a lagging replica, and don't add an index.