Scenario · Replication & WAL
Replica query conflict
A sandboxed PostgreSQL incident — investigate with your own tools, submit a fix, and get deterministic Detect / Fix / Trap scoring.
L3 · 10–15 min · runs locally in Docker
Launch
Start this scenario
Boot it in a real PostgreSQL sandbox and investigate with psql, EXPLAIN and pg_stat_statements.
ride postgres start stage-04/04-replica-query-conflictPart of these paths
Show the postmortem & investigation hints spoilers
Replica query conflict Type: incident simulation · Topic: Replication & WAL · Level: L3 · Duration: 10–15 min Launch: ride postgres start stage-04/04-replica-query-conflict POSTMORTEM (root cause · how it was found · the fix · lesson) Root cause: a long-running read on the standby held an AccessShare lock on a table while the primary ran a change that needs an exclusive lock there (an ALTER). The standby's recovery process couldn't apply that WAL until the query released its lock, so — with max_standby_streaming_delay high — replay stalled and lag grew behind one analytics query. This is a recovery conflict on the replica, not a primary problem. How it was found: pg_stat_replication on the primary showed replay_lsn frozen; pg_stat_activity on the replica showed an old, still-open query; conflicts/lag pointed at the standby. The mitigation: terminate the offending replica query. Recovery then took its lock and caught up. Lesson: long reads on a hot standby fight WAL replay. Tune max_standby_streaming_delay / hot_standby_feedback for the workload, or run heavy analytics elsewhere. The fix is on the replica — don't kill primary writers or add indexes. INVESTIGATION HINTS (the staged path to diagnose and fix) 1. Replay on the replica has stalled and lag is growing, but the primary is fine. The cause is on the REPLICA: a long-running read is holding a lock that the standby's recovery needs. On the REPLICA: SELECT pid, state, now() - query_start AS age, query FROM pg_stat_activity ORDER BY query_start; and SELECT * FROM pg_stat_database_conflicts; 2. On the PRIMARY, pg_stat_replication shows the standby's replay_lsn frozen behind sent_lsn — recovery is waiting on that replica query (max_standby_streaming_delay). 3. Terminate the long-running query on the REPLICA (pg_terminate_backend for its pid / application_name). Recovery then acquires its lock and replays the backlog. Don't kill the primary's sessions and don't add indexes.