Scenario · HA & Failover
Timeline divergence detected
A sandboxed PostgreSQL incident — investigate with your own tools, submit a fix, and get deterministic Detect / Fix / Trap scoring.
L4 · 10–15 min · runs locally in Docker
Launch
Start this scenario
Boot it in a real PostgreSQL sandbox and investigate with psql, EXPLAIN and pg_stat_statements.
ride postgres start stage-08/08-timeline-divergence-detectedPart of these paths
Show the postmortem & investigation hints spoilers
Timeline divergence detected Type: incident simulation · Topic: HA & Failover · Level: L4 · Duration: 10–15 min Launch: ride postgres start stage-08/08-timeline-divergence-detected POSTMORTEM (root cause · how it was found · the fix · lesson) Root cause: after promotion, the new primary advanced on a new timeline and took writes (a 'post_promotion' marker). The old primary, having been down, never saw those writes — its history diverged. Reattaching it naively would lose or conflict with post-promotion data. How it was found: comparing failover_markers across nodes showed the promoted primary had the post-promotion marker while the returned old primary did not — a divergent/stale timeline. The mitigation: keep the old primary fenced and mark it for rebuild (a controlled re-clone), never reuse it as-is. Lesson: promotion changes topology history. A returned old primary is on a stale timeline and must be rebuilt (e.g. re-cloned / pg_rewind in real systems) before it can rejoin — never trusted or written to as-is. (This scenario simulates the divergence with markers; it does not perform a real rejoin.) INVESTIGATION HINTS (the staged path to diagnose and fix) 1. A failover happened and the old primary is back, but it's on a divergent timeline. Compare state: SELECT pg_is_in_recovery(); and SELECT * FROM failover_markers ORDER BY key; the promoted primary has a 'post_promotion' marker the old one is missing. 2. The old primary diverged — it never saw writes made after the promotion. Reattaching it as-is would lose or conflict with those writes. Treat it as stale until rebuilt. 3. Keep the old primary fenced / mark it for rebuild: `pgpg action <session> fence-old-primary`. Don't reuse the old primary, don't attempt an unsafe rejoin, and don't add an index.