Scenario · HA & Failover

Failed failover due to lag

A sandboxed PostgreSQL incident — investigate with your own tools, submit a fix, and get deterministic Detect / Fix / Trap scoring.

L4 · 10–15 min · runs locally in Docker

Launch

Start this scenario

Boot it in a real PostgreSQL sandbox and investigate with psql, EXPLAIN and pg_stat_statements.

ride postgres start stage-08/03-failed-failover-due-to-lag

Part of these paths

HA & Failover On-Call SRE On-Call Path

Show the postmortem & investigation hints spoilers

Failed failover due to lag
Type: incident simulation · Topic: HA & Failover · Level: L4 · Duration: 10–15 min
Launch: ride postgres start stage-08/03-failed-failover-due-to-lag

POSTMORTEM (root cause · how it was found · the fix · lesson)
Root cause: the standby's WAL replay was paused, so it lagged behind the primary —
recent committed data (the 'latest_order' marker) had been received but not
replayed. Promoting it in that state would have lost that data (a failover that
silently drops recent transactions).

How it was found: pg_is_wal_replay_paused() was true and the replay LSN trailed the
received LSN; the latest marker was missing on the replica.

The mitigation: resume replay and let the standby catch up
(`pgpg action wait-for-replica-catchup`) BEFORE any promotion — no blind failover
while lagging.

Lesson: never promote a replica blindly. Check replay lag / LSN position (and
critical markers) first; promote only a caught-up standby. Promoting a lagging
replica loses data; an index is irrelevant.

INVESTIGATION HINTS (the staged path to diagnose and fix)
1. Before promoting, check whether the standby is safe to promote. On the replica: SELECT pg_is_wal_replay_paused(); is true and SELECT pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn(); show replay is behind what it has received — it's lagging.
2. Recent data hasn't been replayed: SELECT * FROM failover_markers WHERE id='latest_order'; is missing on the replica. Promoting it now would lose that committed data.
3. Make the standby safe first: resume replay and let it catch up — `pgpg action <session> wait-for-replica-catchup` (or SELECT pg_wal_replay_resume();). Do NOT promote a lagging replica, and don't add an index.

Start now →← All scenarios