Scenario · HA & Failover
Primary crash detection
A sandboxed PostgreSQL incident — investigate with your own tools, submit a fix, and get deterministic Detect / Fix / Trap scoring.
L2 · 10–15 min · runs locally in Docker
Launch
Start this scenario
Boot it in a real PostgreSQL sandbox and investigate with psql, EXPLAIN and pg_stat_statements.
ride postgres start stage-08/01-primary-crash-detectionPart of these paths
Show the postmortem & investigation hints spoilers
Primary crash detection Type: incident simulation · Topic: HA & Failover · Level: L2 · Duration: 10–15 min Launch: ride postgres start stage-08/01-primary-crash-detection POSTMORTEM (root cause · how it was found · the fix · lesson) Root cause: the primary crashed (its process/container stopped) and stopped accepting connections. The streaming replica kept running but as a read-only standby (pg_is_in_recovery() = true), so writes failed everywhere. How it was found: the primary connection failed; on the replica, pg_is_in_recovery() was true and pg_stat_wal_receiver showed the stream down — the primary, not the replica, was the problem. The mitigation: restart the primary (`pgpg action start-primary`); it accepted connections again and the standby resumed streaming. Lesson: first triage which node is down. If the primary can simply be restarted, do that — promotion/failover is a bigger, riskier step you don't always need. An index or a config change doesn't bring a crashed node back. INVESTIGATION HINTS (the staged path to diagnose and fix) 1. The primary won't accept connections. Investigate from the replica (its own connection string from `pgpg start`): SELECT pg_is_in_recovery(); returns true (it's a read-only standby), and SELECT status FROM pg_stat_wal_receiver; shows the stream to the primary is down. 2. The primary is down and the replica is healthy but read-only. This is a node-availability incident, not a query/index problem — and you don't necessarily need to fail over. 3. Bring the primary back: `pgpg action <session> start-primary`. It accepts connections again and the standby resumes streaming. Don't promote the replica unnecessarily and don't add an index.