Scenario · Replication & WAL
Unmonitored replication slot lag
A sandboxed PostgreSQL incident — investigate with your own tools, submit a fix, and get deterministic Detect / Fix / Trap scoring.
L2 · 10–15 min · runs locally in Docker
Launch
Start this scenario
Boot it in a real PostgreSQL sandbox and investigate with psql, EXPLAIN and pg_stat_statements.
ride postgres start stage-04/09-unmonitored-replication-slot-lagPart of these paths
Show the postmortem & investigation hints spoilers
Unmonitored replication slot lag
Type: incident simulation · Topic: Replication & WAL · Level: L2 · Duration: 10–15 min
Launch: ride postgres start stage-04/09-unmonitored-replication-slot-lag
POSTMORTEM (root cause · how it was found · the fix · lesson)
Root cause: a leftover physical replication slot (`old_analytics_slot`) from a
decommissioned consumer was never dropped. With no one advancing it, its
restart_lsn stayed put while the primary kept generating WAL, so the WAL it
retained grew steadily — an early warning for disk pressure before it became an
outage.
How it was found: pg_replication_slots showed an inactive slot whose
pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) — the retained WAL — was large
and growing.
The mitigation: drop the stale slot so its retained WAL can be recycled.
Lesson: monitor per-slot retained WAL, not just disk. An unused slot silently
accumulates WAL; drop it (or advance its consumer). A CHECKPOINT can't recycle
WAL a slot still pins, and indexes are unrelated.
INVESTIGATION HINTS (the staged path to diagnose and fix)
1. Not an outage yet, but a warning sign: a replication slot with no consumer is holding WAL, and pg_wal is creeping up. List slots with their retained WAL on the PRIMARY: SELECT slot_name, active, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained FROM pg_replication_slots;
2. An inactive slot (active = false) with a large retained value is the one to worry about — its restart_lsn is far behind current and pins every segment since.
3. Drop the stale, unused slot: SELECT pg_drop_replication_slot('old_analytics_slot'); WAL can then be recycled. A CHECKPOINT won't release a slot's hold, and an index is unrelated.