← All scenarios

Scenario · Replication & WAL

WAL explosion from a replication slot

A sandboxed PostgreSQL incident — investigate with your own tools, submit a fix, and get deterministic Detect / Fix / Trap scoring.

L3 · 10–15 min · runs locally in Docker

Launch

Start this scenario

Boot it in a real PostgreSQL sandbox and investigate with psql, EXPLAIN and pg_stat_statements.

ride postgres start stage-04/02-wal-explosion-replication-slot

Part of these paths

Show the postmortem & investigation hints spoilers
WAL explosion from a replication slot
Type: incident simulation · Topic: Replication & WAL · Level: L3 · Duration: 10–15 min
Launch: ride postgres start stage-04/02-wal-explosion-replication-slot

POSTMORTEM (root cause · how it was found · the fix · lesson)
Root cause: an inactive physical replication slot held a restart_lsn far in the
past. A slot guarantees the primary keeps every WAL segment a consumer might
still need, so with no consumer draining it, pg_wal grew without bound and the
disk filled — even though the database was otherwise healthy.

How it was found: pg_replication_slots showed a slot with active = false and a
large pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) — i.e. a lot of WAL
retained on its behalf.

The mitigation: drop the stale slot (pg_drop_replication_slot). Once nothing
pins the old restart_lsn, checkpoints recycle the backlog.

Lesson: unbounded pg_wal growth on a healthy server points at a slot (or a
failing archiver). Find the slot pinning the oldest restart_lsn; drop it if it's
truly unused. Never rm WAL files by hand, and a CHECKPOINT can't recycle WAL a
slot still needs.

INVESTIGATION HINTS (the staged path to diagnose and fix)
1. pg_wal is growing and disk is filling, but the database is up. A replication slot with no consumer pins WAL so it can never be recycled. List them: SELECT slot_name, active, restart_lsn FROM pg_replication_slots;
2. Measure what a slot is holding back: SELECT slot_name, active, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained FROM pg_replication_slots; an inactive slot with a large retained value is the culprit.
3. Drop the stale, inactive slot: SELECT pg_drop_replication_slot('stale_replica_slot'); WAL can then be recycled. Do NOT delete files from pg_wal by hand and don't reach for CHECKPOINT — neither releases a slot's hold.