← All scenarios

Scenario · HA & Failover

Primary crash detection

A sandboxed PostgreSQL incident — investigate with your own tools, submit a fix, and get deterministic Detect / Fix / Trap scoring.

L2 · 10–15 min · runs locally in Docker

Launch

Start this scenario

Boot it in a real PostgreSQL sandbox and investigate with psql, EXPLAIN and pg_stat_statements.

ride postgres start stage-08/01-primary-crash-detection

Part of these paths

Show the postmortem & investigation hints spoilers
Primary crash detection
Type: incident simulation · Topic: HA & Failover · Level: L2 · Duration: 10–15 min
Launch: ride postgres start stage-08/01-primary-crash-detection

POSTMORTEM (root cause · how it was found · the fix · lesson)
Root cause: the primary crashed (its process/container stopped) and stopped
accepting connections. The streaming replica kept running but as a read-only
standby (pg_is_in_recovery() = true), so writes failed everywhere.

How it was found: the primary connection failed; on the replica,
pg_is_in_recovery() was true and pg_stat_wal_receiver showed the stream down — the
primary, not the replica, was the problem.

The mitigation: restart the primary (`pgpg action start-primary`); it accepted
connections again and the standby resumed streaming.

Lesson: first triage which node is down. If the primary can simply be restarted,
do that — promotion/failover is a bigger, riskier step you don't always need. An
index or a config change doesn't bring a crashed node back.

INVESTIGATION HINTS (the staged path to diagnose and fix)
1. The primary won't accept connections. Investigate from the replica (its own connection string from `pgpg start`): SELECT pg_is_in_recovery(); returns true (it's a read-only standby), and SELECT status FROM pg_stat_wal_receiver; shows the stream to the primary is down.
2. The primary is down and the replica is healthy but read-only. This is a node-availability incident, not a query/index problem — and you don't necessarily need to fail over.
3. Bring the primary back: `pgpg action <session> start-primary`. It accepts connections again and the standby resumes streaming. Don't promote the replica unnecessarily and don't add an index.