← All scenarios

Scenario · Compound Incidents

Storage pressure and autovacuum failure

A sandboxed PostgreSQL incident — investigate with your own tools, submit a fix, and get deterministic Detect / Fix / Trap scoring.

L4 · 15–20 min · runs locally in Docker

Launch

Start this scenario

Boot it in a real PostgreSQL sandbox and investigate with psql, EXPLAIN and pg_stat_statements.

ride postgres start stage-11/10-storage-pressure-and-autovacuum-failure

Part of these paths

Show the postmortem & investigation hints spoilers
Storage pressure and autovacuum failure
Type: incident simulation · Topic: Compound Incidents · Level: L4 · Duration: 15–20 min
Launch: ride postgres start stage-11/10-storage-pressure-and-autovacuum-failure

POSTMORTEM (root cause · how it was found · the fix · lesson)
Root cause (compound): storage pressure on analytics_db with an underlying cleanup
failure. Autovacuum was disabled on `events` and update/delete churn left a large
dead-tuple load, so the table kept growing and space was never reclaimed. The storage
symptom and the vacuum failure are the same problem — and reaching for VACUUM FULL
(a heavy exclusive-lock rewrite) or an index is the wrong move. app_db was a healthy
decoy.

How it was found: pg_stat_user_tables showed high n_dead_tup with no recent
autovacuum; pg_class.reloptions showed autovacuum_enabled = false on events.

The fix (correct database, correct tool):
  \connect analytics_db
  ALTER TABLE events RESET (autovacuum_enabled);
  VACUUM ANALYZE events;

Lesson: storage incidents often hide a cleanup problem. Don't just delete files or
run VACUUM once and move on — re-enable autovacuum so it keeps reclaiming, and use a
plain VACUUM (not VACUUM FULL) during a live incident. Fix the right database; an
index doesn't reclaim bloat.

INVESTIGATION HINTS (the staged path to diagnose and fix)
1. analytics_db.events is consuming space and degrading. Don't just delete files or VACUUM once — look at table health: SELECT relname, n_live_tup, n_dead_tup, last_autovacuum FROM pg_stat_user_tables; and SELECT relname, reloptions FROM pg_class WHERE relname = 'events';
2. The bloat IS the storage pressure: autovacuum is disabled on events and churn left a huge dead-tuple load, so cleanup never reclaims space. app_db is a healthy decoy. An index won't help, and VACUUM FULL is the wrong reflex (heavy lock).
3. Re-enable autovacuum and run a plain VACUUM: \connect analytics_db; ALTER TABLE events RESET (autovacuum_enabled); VACUUM ANALYZE events; Don't VACUUM FULL, don't only add an index, and fix the right database (analytics_db, not app_db).