← All scenarios

Scenario · Replication & WAL

Replica lag from a stopped replica

A sandboxed PostgreSQL incident — investigate with your own tools, submit a fix, and get deterministic Detect / Fix / Trap scoring.

L2 · 10–15 min · runs locally in Docker

Launch

Start this scenario

Boot it in a real PostgreSQL sandbox and investigate with psql, EXPLAIN and pg_stat_statements.

ride postgres start stage-04/01-replica-lag-stopped-replica

Part of these paths

Show the postmortem & investigation hints spoilers
Replica lag from a stopped replica
Type: incident simulation · Topic: Replication & WAL · Level: L2 · Duration: 10–15 min
Launch: ride postgres start stage-04/01-replica-lag-stopped-replica

POSTMORTEM (root cause · how it was found · the fix · lesson)
Root cause: the standby stopped applying WAL (replay was paused), so while the
primary kept accepting writes the replica fell further and further behind.
Anything reading from the replica saw stale data even though the primary was
perfectly healthy. This is a replication-lag incident, not a query-plan problem.

How it was found: on the primary, pg_stat_replication showed the standby's
replay_lsn frozen well behind sent_lsn; on the replica, pg_last_wal_replay_lsn
wasn't advancing and trailed the primary's pg_current_wal_lsn by a growing gap.

The mitigation: resume WAL replay on the replica (pg_wal_replay_resume); it then
applied the backlog and caught up.

Lesson: lag means "look at both ends" — pg_stat_replication on the primary and
the replay LSN on the replica. Fix the replication path; don't reach for indexes
or kill the primary's writers, which does nothing about a paused standby.

INVESTIGATION HINTS (the staged path to diagnose and fix)
1. Reads from the replica are stale but the primary is healthy — this is replication lag, not a slow query. On the PRIMARY check the senders: SELECT application_name, state, sent_lsn, replay_lsn FROM pg_stat_replication;
2. On the REPLICA, see how far behind replay is: SELECT pg_is_in_recovery(); SELECT pg_last_wal_replay_lsn(); and compare to the primary's pg_current_wal_lsn(). The replay LSN isn't advancing — WAL replay is paused.
3. Resume replay on the REPLICA: SELECT pg_wal_replay_resume(); then watch the replica catch up. Don't add indexes or touch the primary's sessions — the replica simply stopped applying WAL.