Scenario · Incident Control
Failover pressure
A sandboxed PostgreSQL incident — investigate with your own tools, submit a fix, and get deterministic Detect / Fix / Trap scoring.
L5 · 20–30 min · runs locally in Docker
Launch
Start this scenario
Boot it in a real PostgreSQL sandbox and investigate with psql, EXPLAIN and pg_stat_statements.
ride postgres start stage-12/03-failover-pressurePart of these paths
Show the postmortem & investigation hints spoilers
Failover pressure Type: incident simulation · Topic: Incident Control · Level: L5 · Duration: 20–30 min Launch: ride postgres start stage-12/03-failover-pressure POSTMORTEM (root cause · how it was found · the fix · lesson) INCIDENT TIMELINE (Incident Control capstone) Triage: under failover pressure, the standby looked promotable but was lagging — its WAL replay was paused, so a recent `latest` marker it had RECEIVED was not yet applied. Promoting a lagging standby risks losing the most recent committed state. Stabilize (not blind-promote): resume replay so the standby applies the WAL it already received and becomes consistent: -- on the standby SELECT pg_wal_replay_resume(); Validate: confirm replay is active again, the `latest` marker is now present on the standby, and the critical databases (app_db, billing_db) are intact — while the standby is still a safe standby (not prematurely promoted). FINAL CHECKLIST - [x] Lag understood (replay paused; latest marker received but not replayed) - [x] Standby made consistent (replay resumed, latest marker applied) - [x] Critical databases validated (app_db + billing_db markers present) - [x] No premature promotion of a lagging standby; default-DB-only validation rejected Lesson: a failover under pressure is not "promote fast". It's not complete until data freshness is restored and the critical databases are verified. Resume replay and make the standby consistent before any promotion; promoting a lagging replica or checking only the default database hides the risk. INVESTIGATION HINTS (the staged path to diagnose and fix) 1. Failover pressure: the team wants to promote the standby NOW, but it's lagging. Triage first: SELECT pg_is_in_recovery(); SELECT pg_is_wal_replay_paused(); SELECT pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn(); and read failover_markers in incident / app_db / billing_db. 2. The standby's replay is paused, so a recent 'latest' marker it has RECEIVED hasn't been applied — promoting now is a premature-failover risk. Resume replay so it catches up and applies the received WAL; validate the critical databases. 3. Stabilize, don't blind-promote: SELECT pg_wal_replay_resume(); then confirm the standby is caught up (replay active, the 'latest' marker present) and app_db/billing_db markers are intact. Don't promote the lagging replica and don't validate only the default database.