← All scenarios

Scenario · Incident Control

Failover pressure

A sandboxed PostgreSQL incident — investigate with your own tools, submit a fix, and get deterministic Detect / Fix / Trap scoring.

L5 · 20–30 min · runs locally in Docker

Launch

Start this scenario

Boot it in a real PostgreSQL sandbox and investigate with psql, EXPLAIN and pg_stat_statements.

ride postgres start stage-12/03-failover-pressure

Part of these paths

Show the postmortem & investigation hints spoilers
Failover pressure
Type: incident simulation · Topic: Incident Control · Level: L5 · Duration: 20–30 min
Launch: ride postgres start stage-12/03-failover-pressure

POSTMORTEM (root cause · how it was found · the fix · lesson)
INCIDENT TIMELINE (Incident Control capstone)

Triage: under failover pressure, the standby looked promotable but was lagging — its
WAL replay was paused, so a recent `latest` marker it had RECEIVED was not yet
applied. Promoting a lagging standby risks losing the most recent committed state.

Stabilize (not blind-promote): resume replay so the standby applies the WAL it already
received and becomes consistent:
  -- on the standby
  SELECT pg_wal_replay_resume();

Validate: confirm replay is active again, the `latest` marker is now present on the
standby, and the critical databases (app_db, billing_db) are intact — while the
standby is still a safe standby (not prematurely promoted).

FINAL CHECKLIST
- [x] Lag understood (replay paused; latest marker received but not replayed)
- [x] Standby made consistent (replay resumed, latest marker applied)
- [x] Critical databases validated (app_db + billing_db markers present)
- [x] No premature promotion of a lagging standby; default-DB-only validation rejected

Lesson: a failover under pressure is not "promote fast". It's not complete until data
freshness is restored and the critical databases are verified. Resume replay and make
the standby consistent before any promotion; promoting a lagging replica or checking
only the default database hides the risk.

INVESTIGATION HINTS (the staged path to diagnose and fix)
1. Failover pressure: the team wants to promote the standby NOW, but it's lagging. Triage first: SELECT pg_is_in_recovery(); SELECT pg_is_wal_replay_paused(); SELECT pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn(); and read failover_markers in incident / app_db / billing_db.
2. The standby's replay is paused, so a recent 'latest' marker it has RECEIVED hasn't been applied — promoting now is a premature-failover risk. Resume replay so it catches up and applies the received WAL; validate the critical databases.
3. Stabilize, don't blind-promote: SELECT pg_wal_replay_resume(); then confirm the standby is caught up (replay active, the 'latest' marker present) and app_db/billing_db markers are intact. Don't promote the lagging replica and don't validate only the default database.