← All scenarios

Scenario · Replication & WAL

WAL generation spike

A sandboxed PostgreSQL incident — investigate with your own tools, submit a fix, and get deterministic Detect / Fix / Trap scoring.

L3 · 10–15 min · runs locally in Docker

Launch

Start this scenario

Boot it in a real PostgreSQL sandbox and investigate with psql, EXPLAIN and pg_stat_statements.

ride postgres start stage-04/10-wal-generation-spike

Part of these paths

Show the postmortem & investigation hints spoilers
WAL generation spike
Type: incident simulation · Topic: Replication & WAL · Level: L3 · Duration: 10–15 min
Launch: ride postgres start stage-04/10-wal-generation-spike

POSTMORTEM (root cause · how it was found · the fix · lesson)
Root cause: a write-heavy job on the primary (a tight insert/commit loop) was
generating WAL far faster than normal. The symptoms — surging WAL, growing
pg_wal, replica replay lag — were write amplification from that one workload, not
a query-plan problem and not a fault on the replica.

How it was found: pg_stat_wal / a moving pg_current_wal_lsn showed a WAL spike;
pg_stat_activity showed a backend (wal_generator) in a continuous write loop.

The mitigation: terminate the runaway writer; WAL generation dropped back to
baseline and the replica stopped falling behind.

Lesson: trace WAL spikes to the workload producing them (batch writes, throttle
backfills/migrations, avoid per-row commits). A CHECKPOINT only writes more WAL,
and killing the replica treats the symptom on the wrong node.

INVESTIGATION HINTS (the staged path to diagnose and fix)
1. WAL is being generated fast and the replica may be lagging, but it's not a missing index and not the replica's fault — a write-heavy job on the primary is amplifying writes. Look at WAL: SELECT * FROM pg_stat_wal; and watch pg_current_wal_lsn() move.
2. Find the writer: SELECT pid, application_name, state, query FROM pg_stat_activity WHERE application_name LIKE 'wal_generator%'; one backend is in a tight write loop.
3. Stop the runaway writer: SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE application_name LIKE 'wal_generator%'; WAL generation drops. Don't CHECKPOINT (it adds WAL) and don't kill the replica — the cause is on the primary.