Scenario · HA & Failover
Failover with slot cleanup
A sandboxed PostgreSQL incident — investigate with your own tools, submit a fix, and get deterministic Detect / Fix / Trap scoring.
L3 · 10–15 min · runs locally in Docker
Launch
Start this scenario
Boot it in a real PostgreSQL sandbox and investigate with psql, EXPLAIN and pg_stat_statements.
ride postgres start stage-08/07-failover-with-slot-cleanupPart of these paths
Show the postmortem & investigation hints spoilers
Failover with slot cleanup
Type: incident simulation · Topic: HA & Failover · Level: L3 · Duration: 10–15 min
Launch: ride postgres start stage-08/07-failover-with-slot-cleanup
POSTMORTEM (root cause · how it was found · the fix · lesson)
Root cause: after a failover, an obsolete physical replication slot from the old
topology (old_primary_slot) lingered on the current primary. An inactive slot pins
WAL (and clutters monitoring), so it must be cleaned up — but a still-needed slot
(standby_slot, used by the current standby) must NOT be dropped.
How it was found: pg_replication_slots on the current primary listed both the
obsolete slot and the one still in use.
The mitigation: drop only the obsolete slot
(pg_drop_replication_slot('old_primary_slot')), leaving the needed one.
Lesson: failover cleanup includes obsolete replication slots — they retain WAL.
Drop the stale ones specifically; never drop all slots blindly (you'd break the
current standby). An index is unrelated.
INVESTIGATION HINTS (the staged path to diagnose and fix)
1. After the failover, inspect replication slots on the current primary: SELECT slot_name, active, restart_lsn FROM pg_replication_slots; there's an obsolete slot (old_primary_slot) from the old topology AND a slot the current standby still needs (standby_slot).
2. An inactive obsolete slot keeps pinning WAL — clean it up. But don't drop the slot the current standby relies on.
3. Drop ONLY the obsolete one: SELECT pg_drop_replication_slot('old_primary_slot'); Don't drop all slots blindly, and don't add an index.