Scenario · Replication & WAL

Replica query conflict

A sandboxed PostgreSQL incident — investigate with your own tools, submit a fix, and get deterministic Detect / Fix / Trap scoring.

L3 · 10–15 min · runs locally in Docker

Launch

Start this scenario

Boot it in a real PostgreSQL sandbox and investigate with psql, EXPLAIN and pg_stat_statements.

ride postgres start stage-04/04-replica-query-conflict

Part of these paths

Replication & WAL

Show the postmortem & investigation hints spoilers

Replica query conflict
Type: incident simulation · Topic: Replication & WAL · Level: L3 · Duration: 10–15 min
Launch: ride postgres start stage-04/04-replica-query-conflict

POSTMORTEM (root cause · how it was found · the fix · lesson)
Root cause: a long-running read on the standby held an AccessShare lock on a
table while the primary ran a change that needs an exclusive lock there (an
ALTER). The standby's recovery process couldn't apply that WAL until the query
released its lock, so — with max_standby_streaming_delay high — replay stalled
and lag grew behind one analytics query. This is a recovery conflict on the
replica, not a primary problem.

How it was found: pg_stat_replication on the primary showed replay_lsn frozen;
pg_stat_activity on the replica showed an old, still-open query; conflicts/lag
pointed at the standby.

The mitigation: terminate the offending replica query. Recovery then took its
lock and caught up.

Lesson: long reads on a hot standby fight WAL replay. Tune
max_standby_streaming_delay / hot_standby_feedback for the workload, or run heavy
analytics elsewhere. The fix is on the replica — don't kill primary writers or
add indexes.

INVESTIGATION HINTS (the staged path to diagnose and fix)
1. Replay on the replica has stalled and lag is growing, but the primary is fine. The cause is on the REPLICA: a long-running read is holding a lock that the standby's recovery needs. On the REPLICA: SELECT pid, state, now() - query_start AS age, query FROM pg_stat_activity ORDER BY query_start; and SELECT * FROM pg_stat_database_conflicts;
2. On the PRIMARY, pg_stat_replication shows the standby's replay_lsn frozen behind sent_lsn — recovery is waiting on that replica query (max_standby_streaming_delay).
3. Terminate the long-running query on the REPLICA (pg_terminate_backend for its pid / application_name). Recovery then acquires its lock and replays the backlog. Don't kill the primary's sessions and don't add indexes.

Start now →← All scenarios