Scenario · Compound Incidents

WAL pressure and replica lag

A sandboxed PostgreSQL incident — investigate with your own tools, submit a fix, and get deterministic Detect / Fix / Trap scoring.

L4 · 15–20 min · runs locally in Docker

Launch

Start this scenario

Boot it in a real PostgreSQL sandbox and investigate with psql, EXPLAIN and pg_stat_statements.

ride postgres start stage-11/02-wal-pressure-and-replica-lag

Part of these paths

Incident Response Readiness

Show the postmortem & investigation hints spoilers

WAL pressure and replica lag
Type: incident simulation · Topic: Compound Incidents · Level: L4 · Duration: 15–20 min
Launch: ride postgres start stage-11/02-wal-pressure-and-replica-lag

POSTMORTEM (root cause · how it was found · the fix · lesson)
Root cause (connected): the replica's WAL replay was paused, so it fell behind and
stopped advancing — while an obsolete, inactive replication slot (`obsolete_slot`)
also pinned WAL on the primary. Together they drove replica lag *and* WAL retention
pressure; fixing only one leaves the other.

How it was found: pg_stat_replication / pg_replication_slots on the primary showed the
lag and the inactive slot; pg_is_wal_replay_paused() on the replica showed replay was
paused.

The fix (both):
  -- on the replica: resume replay so it catches up
  SELECT pg_wal_replay_resume();
  -- on the primary: drop the obsolete slot pinning WAL
  SELECT pg_drop_replication_slot('obsolete_slot');

Lesson: replica lag and WAL pressure are usually the same story — a consumer that
isn't advancing (paused replay, or an inactive slot) makes the primary retain WAL.
Don't delete WAL files by hand, don't drop a slot a live standby still needs, and
don't promote a lagging replica to "make the lag go away".

INVESTIGATION HINTS (the staged path to diagnose and fix)
1. Two connected problems: the replica is lagging AND WAL is piling up on the primary. Check both ends — pg_stat_replication and pg_replication_slots on the primary; pg_is_wal_replay_paused() / pg_last_wal_replay_lsn() on the replica.
2. The replica's replay is paused (so it lags and can't release WAL), and an obsolete inactive slot ('obsolete_slot') is also pinning WAL. Don't delete WAL files by hand and don't promote the lagging replica.
3. Resume replay on the replica (SELECT pg_wal_replay_resume();) so it catches up, and drop the obsolete slot (SELECT pg_drop_replication_slot('obsolete_slot');). Don't drop the slot the live standby needs, and don't promote.

Start now →← All scenarios