← All scenarios

Scenario · HA & Failover

Timeline divergence detected

A sandboxed PostgreSQL incident — investigate with your own tools, submit a fix, and get deterministic Detect / Fix / Trap scoring.

L4 · 10–15 min · runs locally in Docker

Launch

Start this scenario

Boot it in a real PostgreSQL sandbox and investigate with psql, EXPLAIN and pg_stat_statements.

ride postgres start stage-08/08-timeline-divergence-detected

Part of these paths

Show the postmortem & investigation hints spoilers
Timeline divergence detected
Type: incident simulation · Topic: HA & Failover · Level: L4 · Duration: 10–15 min
Launch: ride postgres start stage-08/08-timeline-divergence-detected

POSTMORTEM (root cause · how it was found · the fix · lesson)
Root cause: after promotion, the new primary advanced on a new timeline and took
writes (a 'post_promotion' marker). The old primary, having been down, never saw
those writes — its history diverged. Reattaching it naively would lose or conflict
with post-promotion data.

How it was found: comparing failover_markers across nodes showed the promoted
primary had the post-promotion marker while the returned old primary did not — a
divergent/stale timeline.

The mitigation: keep the old primary fenced and mark it for rebuild (a controlled
re-clone), never reuse it as-is.

Lesson: promotion changes topology history. A returned old primary is on a stale
timeline and must be rebuilt (e.g. re-cloned / pg_rewind in real systems) before it
can rejoin — never trusted or written to as-is. (This scenario simulates the
divergence with markers; it does not perform a real rejoin.)

INVESTIGATION HINTS (the staged path to diagnose and fix)
1. A failover happened and the old primary is back, but it's on a divergent timeline. Compare state: SELECT pg_is_in_recovery(); and SELECT * FROM failover_markers ORDER BY key; the promoted primary has a 'post_promotion' marker the old one is missing.
2. The old primary diverged — it never saw writes made after the promotion. Reattaching it as-is would lose or conflict with those writes. Treat it as stale until rebuilt.
3. Keep the old primary fenced / mark it for rebuild: `pgpg action <session> fence-old-primary`. Don't reuse the old primary, don't attempt an unsafe rejoin, and don't add an index.