Scenario · Storage & Backup

Archive directory backlog

A sandboxed PostgreSQL incident — investigate with your own tools, submit a fix, and get deterministic Detect / Fix / Trap scoring.

L3 · 10–15 min · runs locally in Docker

Launch

Start this scenario

Boot it in a real PostgreSQL sandbox and investigate with psql, EXPLAIN and pg_stat_statements.

ride postgres start stage-05/03-archive-directory-backlog

Part of these paths

Backup & Recovery Drills Storage Pressure DBA Path

Show the postmortem & investigation hints spoilers

Archive directory backlog
Type: incident simulation · Topic: Storage & Backup · Level: L3 · Duration: 10–15 min
Launch: ride postgres start stage-05/03-archive-directory-backlog

POSTMORTEM (root cause · how it was found · the fix · lesson)
Root cause: archiving succeeded, but a write-heavy workload produced WAL segments
faster than the archive destination was being offloaded, so the archive area grew
without bound — a slow-burn risk to backups/PITR and disk, even though queries
were fine. (Here the archive dir is a size-capped tmpfs so the sandbox can never
fill the host disk.)

How it was found: pg_stat_archiver.archived_count climbed continuously with no
failures; the WAL turnover in pg_ls_waldir() and pg_stat_activity pointed at one
workload driving it.

The mitigation: stop the archive-filling workload; archive growth stopped.

Lesson: monitor archive-destination size/growth, not just archiver success. The
durable fix is retention/offload of archived WAL and rate-limiting bulk writes.
Never "fix" it by turning archive_mode off — that silently destroys your PITR
chain — and an index/checkpoint is irrelevant.

INVESTIGATION HINTS (the staged path to diagnose and fix)
1. Archiving is working, but the archive destination keeps filling — a backup/PITR storage risk, not a query outage. Check the archiver: SELECT archived_count, last_archived_wal, last_failed_wal FROM pg_stat_archiver; archived_count climbs steadily, and pg_ls_waldir() shows constant WAL turnover.
2. A write-heavy workload (archive_writer) is forcing segment after segment to be archived into a bounded archive area that nobody is offloading. This is destination pressure, not a streaming-replication problem.
3. Stop the workload filling the archive: SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE application_name LIKE 'archive_writer%'; the archive then stops growing. The real fix is retention/offload of old archives — do NOT turn archiving off (you'd lose PITR).

Start now →← All scenarios