Scenario · Compound Incidents
Primary crash and stale read endpoint
A sandboxed PostgreSQL incident — investigate with your own tools, submit a fix, and get deterministic Detect / Fix / Trap scoring.
L4 · 15–20 min · runs locally in Docker
Launch
Start this scenario
Boot it in a real PostgreSQL sandbox and investigate with psql, EXPLAIN and pg_stat_statements.
ride postgres start stage-11/04-primary-crash-and-stale-read-endpointPart of these paths
Show the postmortem & investigation hints spoilers
Primary crash and stale read endpoint Type: incident simulation · Topic: Compound Incidents · Level: L4 · Duration: 15–20 min Launch: ride postgres start stage-11/04-primary-crash-and-stale-read-endpoint POSTMORTEM (root cause · how it was found · the fix · lesson) Root cause (compound): the primary crashed, and the app's write endpoint (`app_endpoints.write_endpoint`) still pointed at it — so even with a healthy standby, the app was aimed at a dead/stale node. Failover here is not one action: the standby must be promoted AND the endpoint repointed to the new primary, with the critical databases validated. How it was found: pg_is_in_recovery() showed the standby still in recovery; the app_endpoints registry still read `write_endpoint = primary`; the app_db / billing_db failover_markers were present on the standby (streamed before the crash). The fix (both): pgpg action promote-replica -- standby becomes the new primary UPDATE app_endpoints SET value = 'replica' WHERE key = 'write_endpoint'; Lesson: failover isn't complete until endpoints and critical-data freshness are validated on the new primary. Promoting without repointing leaves writes aimed at the dead node; validating only the default database hides per-database gaps. Don't keep the old primary as the source of truth, and don't reach for an index. INVESTIGATION HINTS (the staged path to diagnose and fix) 1. The primary crashed. The standby is available but the app's write endpoint still points at the dead primary. Check recovery state (SELECT pg_is_in_recovery();) and the endpoint registry (SELECT * FROM app_endpoints;). 2. Recovery needs two things, not one: promote the standby AND repoint the write endpoint to the new primary. Validate the critical databases on the promoted node (app_db, billing_db failover_markers). 3. Promote with `pgpg action promote-replica`, then repoint: UPDATE app_endpoints SET value = 'replica' WHERE key = 'write_endpoint'; Don't leave the endpoint on the old primary, don't validate only the default database, and don't add an index.