← All scenarios

Scenario · Compound Incidents

Primary crash and stale read endpoint

A sandboxed PostgreSQL incident — investigate with your own tools, submit a fix, and get deterministic Detect / Fix / Trap scoring.

L4 · 15–20 min · runs locally in Docker

Launch

Start this scenario

Boot it in a real PostgreSQL sandbox and investigate with psql, EXPLAIN and pg_stat_statements.

ride postgres start stage-11/04-primary-crash-and-stale-read-endpoint

Part of these paths

Show the postmortem & investigation hints spoilers
Primary crash and stale read endpoint
Type: incident simulation · Topic: Compound Incidents · Level: L4 · Duration: 15–20 min
Launch: ride postgres start stage-11/04-primary-crash-and-stale-read-endpoint

POSTMORTEM (root cause · how it was found · the fix · lesson)
Root cause (compound): the primary crashed, and the app's write endpoint
(`app_endpoints.write_endpoint`) still pointed at it — so even with a healthy standby,
the app was aimed at a dead/stale node. Failover here is not one action: the standby
must be promoted AND the endpoint repointed to the new primary, with the critical
databases validated.

How it was found: pg_is_in_recovery() showed the standby still in recovery; the
app_endpoints registry still read `write_endpoint = primary`; the app_db / billing_db
failover_markers were present on the standby (streamed before the crash).

The fix (both):
  pgpg action promote-replica            -- standby becomes the new primary
  UPDATE app_endpoints SET value = 'replica' WHERE key = 'write_endpoint';

Lesson: failover isn't complete until endpoints and critical-data freshness are
validated on the new primary. Promoting without repointing leaves writes aimed at the
dead node; validating only the default database hides per-database gaps. Don't keep
the old primary as the source of truth, and don't reach for an index.

INVESTIGATION HINTS (the staged path to diagnose and fix)
1. The primary crashed. The standby is available but the app's write endpoint still points at the dead primary. Check recovery state (SELECT pg_is_in_recovery();) and the endpoint registry (SELECT * FROM app_endpoints;).
2. Recovery needs two things, not one: promote the standby AND repoint the write endpoint to the new primary. Validate the critical databases on the promoted node (app_db, billing_db failover_markers).
3. Promote with `pgpg action promote-replica`, then repoint: UPDATE app_endpoints SET value = 'replica' WHERE key = 'write_endpoint'; Don't leave the endpoint on the old primary, don't validate only the default database, and don't add an index.