Scenario · Replication & WAL

Broken replication credentials

A sandboxed PostgreSQL incident — investigate with your own tools, submit a fix, and get deterministic Detect / Fix / Trap scoring.

L3 · 10–15 min · runs locally in Docker

Launch

Start this scenario

Boot it in a real PostgreSQL sandbox and investigate with psql, EXPLAIN and pg_stat_statements.

ride postgres start stage-04/05-broken-replication-credentials

Part of these paths

Replication & WAL

Show the postmortem & investigation hints spoilers

Broken replication credentials
Type: incident simulation · Topic: Replication & WAL · Level: L3 · Duration: 10–15 min
Launch: ride postgres start stage-04/05-broken-replication-credentials

POSTMORTEM (root cause · how it was found · the fix · lesson)
Root cause: the role the standby uses to stream (`replicator`) was locked out
(NOLOGIN) and the live connection dropped, so the walreceiver's reconnect
attempts failed authentication. The primary kept serving traffic, but no standby
was attached — replication was down and the replica fell behind/stale.

How it was found: pg_stat_replication on the primary was empty; on the replica
pg_stat_wal_receiver showed no healthy receiver and the logs showed the role
could not log in.

The mitigation: restore the role's LOGIN privilege (ALTER ROLE replicator
LOGIN). The walreceiver reconnected and streaming resumed.

Lesson: "primary healthy but no standby" is a connectivity/auth problem — check
pg_stat_replication and pg_stat_wal_receiver and the role/pg_hba/primary_conninfo
that streaming uses. Don't drop replication slots, add indexes, or rebuild the
whole replica for an auth fix.

INVESTIGATION HINTS (the staged path to diagnose and fix)
1. The replica stopped streaming. On the PRIMARY: SELECT * FROM pg_stat_replication; — it's empty, no standby connected. The primary is healthy; the replica just can't attach.
2. On the REPLICA: SELECT * FROM pg_stat_wal_receiver; — no active receiver, and the logs show the replication role can't log in. This is an auth/credentials problem, not slots or queries.
3. Restore the replication role's ability to log in on the PRIMARY: ALTER ROLE replicator LOGIN; the standby's walreceiver reconnects on its own. Don't drop slots, add indexes, or rebuild the replica.

Start now →← All scenarios