Most teams notice WAL only when something goes wrong.
The disk fills with files in pg_wal.
A replica falls behind.
Backups stop completing.
Checkpoints create latency spikes.
A bulk update generates far more IO than expected.
A restart takes longer than the team is comfortable with.
Until then, WAL and checkpoints feel like internal Postgres details.
They are not.
WAL and checkpoints are part of the contract between Postgres, storage, replication, backups, recovery, and application latency. If you operate Postgres in production, you do not need to become a storage engine developer, but you do need a practical reliability model of how this machinery behaves under pressure.
PostgreSQL uses Write-Ahead Logging to preserve data integrity: changes to data files must be logged first, and WAL records are flushed to durable storage before the corresponding data-file changes are considered safe. (PostgreSQL)
That is the foundation. The incidents come from everything around it.
The basic idea of WAL
When a transaction changes data, Postgres does not rely only on immediately updating table and index files.
It first records the change in WAL.
A simplified write path looks like this:
flowchart TD
A[Client sends write] --> B[Postgres modifies pages in memory]
B --> C[Postgres writes WAL records]
C --> D[WAL is flushed according to durability settings]
D --> E[COMMIT returns]
E --> F[Data pages are written later]
This separation is crucial.
The data page may not be written to the table file immediately. It can remain dirty in shared buffers. If the server crashes, Postgres can use WAL during recovery to bring data files back to a consistent state. PostgreSQL keeps WAL in the pg_wal/ directory, and the documentation describes WAL replay after the last checkpoint as the mechanism used to restore consistency after a crash. (PostgreSQL)
That is why WAL is not just logging.
It is recovery infrastructure.
COMMIT does not mean “every table page is already on disk”
A common misconception:
COMMIT means all changed table and index pages were written to disk.
Not exactly.
A committed transaction means Postgres has made the transaction durable according to its WAL and commit settings. The actual table and index pages may be written later.
This is one reason Postgres can perform well. It does not need to synchronously rewrite every affected data page before returning every commit.
But it also means that the health of WAL IO is critical.
If WAL writes or WAL fsync become slow, commits can become slow.
A user-visible symptom may be:
INSERT/UPDATE latency increases
API writes slow down
background jobs fall behind
replication lag grows
WAL directory grows
The application may report “database is slow,” but the specific mechanism may be commit-path pressure.
synchronous_commit changes the durability/latency trade-off
One setting that directly affects commit behavior is:
SHOW synchronous_commit;
The default is usually appropriate for many production systems, but the operational model matters.
With stronger commit guarantees, the client waits for more durability work before COMMIT returns. With weaker settings, commits can return earlier, but the system accepts a larger risk window in the event of a crash.
This is not a generic performance knob.
It is a business and reliability decision.
For example, it may be acceptable to relax durability for:
ephemeral analytics events;
rebuildable caches;
non-critical metrics;
temporary ingestion buffers.
It may be unacceptable for:
payments;
orders;
ledger entries;
identity changes;
permissions;
security-sensitive writes.
A dangerous incident response is changing durability settings during pressure without understanding what data can be lost and what the product guarantees.
The question is not:
Can this reduce latency?
The better question is:
What durability contract are we changing, and who owns that risk?
What checkpoints do
If WAL can recover data after a crash, why do checkpoints exist?
Because recovery cannot start from the beginning of time.
A checkpoint is a known safe point in the WAL sequence. At checkpoint time, dirty data pages are flushed to disk, and Postgres writes a checkpoint record to WAL. PostgreSQL documentation describes checkpoints as points where heap and index data files are guaranteed to have been updated with all information written before that checkpoint. (PostgreSQL)
A simplified model:
flowchart TD
A[WAL records accumulate] --> B[Dirty pages accumulate in memory]
B --> C[Checkpoint begins]
C --> D[Dirty pages are written to disk]
D --> E[Checkpoint record is written]
E --> F([Crash recovery can start from a later point])
Checkpoints reduce crash recovery work.
But they also create IO work.
That trade-off is central to Postgres reliability.
Checkpoints are not free
During a checkpoint, Postgres must write dirty buffers to disk.
If many pages are dirty, that can create significant IO pressure. If the storage system is already busy, checkpoint activity can appear as latency spikes.
Symptoms may include:
periodic write latency spikes;
higher commit latency;
slow queries during checkpoint periods;
replica lag increasing during write bursts;
backend processes writing buffers directly;
checkpoint warnings in logs;
storage saturation without one obvious query.
This is why checkpoint behavior should be understood as part of workload management, not only configuration.
A checkpoint problem is often a workload-shape problem:
many writes in a short period;
bulk updates;
large deletes;
index builds;
backfills;
ETL jobs;
maintenance tasks;
write-heavy deploys;
checkpoints happening too frequently.
The database may be working correctly while still creating unacceptable latency for the product.
Time-based and WAL-volume-based checkpoints
Checkpoints happen for different reasons.
Two important controls are:
SHOW checkpoint_timeout;
SHOW max_wal_size;
SHOW checkpoint_completion_target;
SHOW checkpoint_warning;
Postgres can checkpoint because enough time has passed, or because WAL volume has grown enough. The documentation describes checkpoint_timeout, max_wal_size, checkpoint_completion_target, and checkpoint_warning as key WAL/checkpoint configuration parameters. (PostgreSQL)
A useful mental model:
checkpoint_timeout:
how long Postgres may go between automatic checkpoints.
max_wal_size:
how much WAL growth can push Postgres toward a checkpoint.
checkpoint_completion_target:
how much of the checkpoint interval Postgres tries to use
to spread checkpoint writes.
checkpoint_warning:
log a warning if checkpoints happen too close together.
Frequent requested checkpoints are usually a warning sign.
They often mean WAL is being generated faster than the current checkpoint configuration expects.
That can happen during normal growth, but it can also reveal an unsafe backfill, bulk update, migration, or retry storm.
The classic warning: checkpoints are happening too often
Postgres can log warnings when checkpoints caused by WAL segment pressure happen too close together. The documentation notes that checkpoint_warning exists to log when checkpoints caused by WAL filling occur closer together than the configured threshold, suggesting max_wal_size may need to be increased. (PostgreSQL)
A log message like this should not be ignored:
checkpoints are occurring too frequently
It does not automatically mean “increase max_wal_size and move on.”
It means:
The workload is generating WAL fast enough
to force more checkpoint activity than expected.
The next question is workload-oriented:
What changed?
A migration?
A bulk update?
A new write-heavy endpoint?
A data import?
A queue retry storm?
A new index?
A replica or archive issue?
Changing a setting may be appropriate. But if the WAL spike came from a bad release or uncontrolled job, the real fix may be outside postgresql.conf.
Measuring WAL generation
A basic WAL snapshot:
SELECT
wal_records,
wal_fpi,
pg_size_pretty(wal_bytes) AS wal_bytes,
wal_buffers_full,
stats_reset
FROM pg_stat_wal;
PostgreSQL’s cumulative statistics system exposes server activity through statistics views, including WAL-related and replication-related views. (PostgreSQL)
The most useful number is not only total WAL generated.
It is the rate.
You can sample WAL position:
SELECT now(), pg_current_wal_lsn();
Then sample again later:
SELECT
pg_size_pretty(
pg_wal_lsn_diff(
'0/70000000'::pg_lsn,
'0/60000000'::pg_lsn
)
) AS wal_generated;
In monitoring, this becomes:
WAL bytes generated per second
Why this matters:
WAL must be written locally.
WAL may need to be archived.
WAL may need to be streamed to replicas.
WAL may be retained for replication slots.
WAL volume affects checkpoint pressure.
WAL volume affects recovery time.
A system can have acceptable query latency and still be heading toward a WAL-related incident.
Finding WAL-heavy queries
In modern Postgres versions, pg_stat_statements can expose WAL-related metrics for statements, depending on version and configuration.
A useful query shape:
SELECT
calls,
pg_size_pretty(wal_bytes) AS total_wal,
pg_size_pretty((wal_bytes / greatest(calls, 1))::numeric) AS wal_per_call,
mean_exec_time,
rows,
left(query, 180) AS query_preview
FROM pg_stat_statements
WHERE wal_bytes > 0
ORDER BY wal_bytes DESC
LIMIT 20;
This helps identify statements that generate large amounts of WAL.
Typical WAL-heavy operations include:
large UPDATEs;
large DELETEs;
bulk INSERTs;
index creation;
table rewrites;
VACUUM FULL;
CLUSTER;
backfills;
high-churn queue updates;
touching indexed columns repeatedly.
The important distinction:
A query can be acceptable from a latency perspective
and still dangerous from a WAL perspective.
For example, a backfill may run efficiently but generate enough WAL to delay replicas, overload archiving, and force frequent checkpoints.
That is a reliability problem, even if the SQL itself is “fast.”
Full-page images and WAL volume
After a checkpoint, the first modification to a data page may include a full-page image in WAL when full_page_writes is enabled.
Check:
SHOW full_page_writes;
full_page_writes protects against torn pages after crashes. It can increase WAL volume, especially after checkpoints and during write-heavy workloads.
This creates an important interaction:
Frequent checkpoints
↓
More pages modified for the first time after each checkpoint
↓
More full-page images
↓
More WAL generated
↓
More pressure on WAL, archiving, replication, and checkpoints
This is one reason overly frequent checkpoints can amplify IO pressure.
A dangerous conclusion would be:
Full-page writes generate WAL, so disable them.
That is usually the wrong instinct. This setting exists for crash safety.
A better conclusion:
If full-page images are high,
understand checkpoint frequency, write patterns, and storage behavior.
WAL compression
Postgres supports WAL compression:
SHOW wal_compression;
Enabling WAL compression can reduce WAL volume for some workloads, especially where full-page images dominate. But it may increase CPU usage.
This is a trade-off:
Less WAL volume
More CPU work
Potentially lower replication/archive pressure
Potentially higher CPU pressure
It is not universally good or bad.
It should be evaluated against the actual bottleneck:
Is the system WAL-volume bound?
Storage bound?
Network bound?
Archive bound?
Replica catch-up bound?
CPU bound?
A reliability mistake is tuning WAL without knowing which resource is constrained.
WAL and replication lag
Replication depends on WAL movement.
A write-heavy event on the primary can become a replica incident:
Bulk update generates WAL
↓
Primary writes WAL locally
↓
WAL is streamed to standby
↓
Standby writes, flushes, replays WAL
↓
Replica falls behind
↓
Read traffic sees stale data
↓
Failover safety decreases
Primary-side check:
SELECT
application_name,
state,
sync_state,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn)) AS send_lag,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), write_lsn)) AS write_lag,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn)) AS flush_lag,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) AS replay_lag,
write_lag,
flush_lag,
replay_lag
FROM pg_stat_replication;
Standby-side check:
SELECT
pg_is_in_recovery() AS is_standby,
pg_last_wal_receive_lsn() AS receive_lsn,
pg_last_wal_replay_lsn() AS replay_lsn,
now() - pg_last_xact_replay_timestamp() AS replay_delay;
The WAL question during an incident is not only:
How much WAL did we generate?
It is:
Can every downstream system consume it fast enough?
That includes standbys, archives, logical replication consumers, backup systems, and change-data-capture pipelines.
WAL archiving and backup risk
WAL is also central to point-in-time recovery.
If WAL archiving fails, backups may no longer support the recovery objectives the team believes they have.
Postgres continuous archiving relies on saving WAL files so that the database can be restored by replaying WAL from a base backup to a desired point in time. (PostgreSQL)
A common failure chain:
flowchart TD
A[Archive command starts failing] --> B[WAL files accumulate]
B --> C[pg_wal grows]
C --> D[Disk fills]
D --> E([Primary becomes unstable or stops])
Check archiver status:
SELECT
archived_count,
last_archived_wal,
last_archived_time,
failed_count,
last_failed_wal,
last_failed_time,
stats_reset
FROM pg_stat_archiver;
This view should be part of production monitoring when archiving is enabled.
A healthy primary with broken archiving is not healthy from a disaster recovery perspective.
WAL retention and replication slots
Replication slots can retain WAL required by a replica or logical consumer.
That is useful.
It is also dangerous.
SELECT
slot_name,
slot_type,
active,
restart_lsn,
confirmed_flush_lsn,
wal_status,
safe_wal_size,
pg_size_pretty(
pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)
) AS retained_wal
FROM pg_replication_slots
ORDER BY pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) DESC NULLS LAST;
A disconnected consumer with an active slot can force the primary to retain WAL.
The incident can look like:
Logical consumer stops
↓
Replication slot remains
↓
Primary keeps old WAL
↓
Disk usage grows
↓
Emergency cleanup decision required
The dangerous command:
SELECT pg_drop_replication_slot('slot_name');
This can be correct if the slot is abandoned. It can also break a consumer that still needs the WAL.
WAL retention is not just a database metric.
It is ownership information:
Who owns this slot?
What system consumes it?
How far behind is it allowed to get?
What alert fires?
What is the reinitialization procedure?
Monitoring checkpoint behavior
On newer PostgreSQL versions, checkpoint-related statistics are exposed separately through pg_stat_checkpointer; on older versions, similar counters are found in pg_stat_bgwriter. The exact view and column names vary by version, so monitoring queries should match the Postgres version you operate. PostgreSQL’s monitoring documentation describes these cumulative statistics views as the place to inspect server activity. (PostgreSQL)
For newer versions:
SELECT
num_timed,
num_requested,
num_done,
write_time,
sync_time,
buffers_written,
stats_reset
FROM pg_stat_checkpointer;
For older versions:
SELECT
checkpoints_timed,
checkpoints_req,
checkpoint_write_time,
checkpoint_sync_time,
buffers_checkpoint,
buffers_backend,
buffers_backend_fsync,
stats_reset
FROM pg_stat_bgwriter;
The operational interpretation:
Many requested checkpoints:
WAL volume may be forcing checkpoints.
High checkpoint write/sync time:
storage may be struggling with checkpoint work.
High backend buffer writes:
foreground sessions may be doing writes themselves,
which can increase user-visible latency.
Frequent checkpoint warnings:
checkpoint/WAL sizing may not match workload.
The goal is not to obsess over one counter.
The goal is to detect whether checkpoint work is smooth and predictable or bursty and user-visible.
Logging checkpoints
You can enable checkpoint logging:
SHOW log_checkpoints;
To enable:
ALTER SYSTEM SET log_checkpoints = on;
SELECT pg_reload_conf();
Checkpoint logs can show:
when checkpoints start and finish;
how much was written;
how long writing took;
how long syncing took;
whether checkpoints are requested or timed;
whether the system is checkpointing too often.
This is useful during investigation because checkpoint problems are often temporal.
A graph may show latency spikes every few minutes.
Checkpoint logs can confirm whether those spikes correlate with checkpoint activity.
WAL and disk-full incidents
pg_wal filling the disk is one of the most direct WAL-related outages.
Possible causes:
archive failures;
replication slot retention;
replica disconnected;
logical replication consumer stopped;
long base backup;
too much WAL generated too quickly;
max_wal_size too small for workload;
storage capacity too low;
unexpected bulk operation.
A useful filesystem-level check:
du -sh "$PGDATA/pg_wal"
From SQL, you can inspect WAL directory files if permissions allow:
SELECT
count(*) AS wal_files,
pg_size_pretty(sum(size)) AS total_size
FROM pg_ls_waldir();
Disk-full incidents are dangerous because Postgres may be unable to continue writing WAL.
At that point, this is not a tuning issue. It is an availability incident.
The immediate question becomes:
Why is WAL being retained or generated faster than expected,
and what can be safely removed, advanced, paused, or fixed?
Deleting files manually from pg_wal is not a safe normal operation. It can corrupt recovery assumptions and break the cluster.
WAL-heavy migrations
Some migrations generate much more WAL than teams expect.
Examples:
UPDATE users
SET normalized_email = lower(email);
DELETE FROM events
WHERE created_at < now() - interval '180 days';
CREATE INDEX CONCURRENTLY idx_events_tenant_created
ON events (tenant_id, created_at);
ALTER TABLE orders
ADD COLUMN total_cents bigint DEFAULT 0;
Depending on version, table structure, defaults, and operation type, schema changes may be metadata-only or may rewrite substantial data. Large updates and deletes can generate WAL, create dead tuples, pressure autovacuum, and increase replication lag.
A safer operational pattern for backfills:
process in small batches;
sleep between batches;
measure WAL rate;
watch replication lag;
watch archive status;
watch checkpoint frequency;
keep transactions short;
make progress resumable;
stop quickly if pressure rises.
Example batch shape:
WITH batch AS (
SELECT id
FROM users
WHERE normalized_email IS NULL
ORDER BY id
LIMIT 1000
)
UPDATE users u
SET normalized_email = lower(u.email)
FROM batch
WHERE u.id = batch.id;
The exact batch size is workload-specific.
The reliability principle is stable:
A migration should have a pressure budget,
not just a correctness test.
Why “fast on staging” is not enough
WAL behavior depends on production realities:
table size;
index count;
row width;
update pattern;
checkpoint timing;
full-page writes;
storage latency;
replica speed;
archive bandwidth;
logical consumers;
autovacuum state;
concurrent workload.
A staging database with small tables and no replicas cannot reveal the true WAL cost of a production backfill.
A migration may pass every functional test and still be operationally unsafe.
The better pre-flight question:
How much WAL will this generate,
and what systems must absorb that WAL?
That question changes how teams design migrations.
Crash recovery time is part of reliability
Checkpoints influence crash recovery.
If checkpoints are very far apart, there may be more WAL to replay after a crash. If checkpoints are too frequent, normal operation may suffer from excessive checkpoint IO.
This is a trade-off:
Less frequent checkpoints:
potentially smoother normal operation,
more WAL to replay after crash.
More frequent checkpoints:
less WAL to replay,
more frequent checkpoint IO,
potentially more full-page image WAL.
The right balance depends on recovery objectives, write workload, storage capacity, and latency requirements.
A database that is fast during normal operation but takes too long to recover may not satisfy the business reliability target.
A database that checkpoints too aggressively may create latency incidents during normal traffic.
Reliability is the balance, not one extreme.
A practical WAL and checkpoint health snapshot
This is not a complete runbook, but it is a useful investigation snapshot.
WAL settings:
SELECT
name,
setting,
unit,
context
FROM pg_settings
WHERE name IN (
'wal_level',
'synchronous_commit',
'full_page_writes',
'wal_compression',
'checkpoint_timeout',
'checkpoint_completion_target',
'checkpoint_warning',
'max_wal_size',
'min_wal_size',
'archive_mode',
'archive_command',
'max_slot_wal_keep_size'
)
ORDER BY name;
WAL generation:
SELECT
wal_records,
wal_fpi,
pg_size_pretty(wal_bytes) AS wal_bytes,
wal_buffers_full,
stats_reset
FROM pg_stat_wal;
WAL directory size:
SELECT
count(*) AS wal_files,
pg_size_pretty(sum(size)) AS total_size
FROM pg_ls_waldir();
Archiving:
SELECT
archived_count,
last_archived_wal,
last_archived_time,
failed_count,
last_failed_wal,
last_failed_time
FROM pg_stat_archiver;
Replication lag:
SELECT
application_name,
state,
sync_state,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) AS replay_lag_bytes,
write_lag,
flush_lag,
replay_lag
FROM pg_stat_replication;
Replication slots:
SELECT
slot_name,
slot_type,
active,
wal_status,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal
FROM pg_replication_slots
ORDER BY pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) DESC NULLS LAST;
Checkpoint stats, newer versions:
SELECT
num_timed,
num_requested,
num_done,
write_time,
sync_time,
buffers_written,
stats_reset
FROM pg_stat_checkpointer;
Checkpoint stats, older versions:
SELECT
checkpoints_timed,
checkpoints_req,
checkpoint_write_time,
checkpoint_sync_time,
buffers_checkpoint,
buffers_backend,
buffers_backend_fsync,
stats_reset
FROM pg_stat_bgwriter;
WAL-heavy statements:
SELECT
calls,
pg_size_pretty(wal_bytes) AS total_wal,
pg_size_pretty((wal_bytes / greatest(calls, 1))::numeric) AS wal_per_call,
mean_exec_time,
left(query, 180) AS query_preview
FROM pg_stat_statements
WHERE wal_bytes > 0
ORDER BY wal_bytes DESC
LIMIT 20;
The purpose of this snapshot is to connect symptoms:
high WAL generation;
frequent checkpoints;
archive failures;
replica lag;
slot retention;
storage growth;
write latency;
migration activity.
A WAL incident is rarely visible through one metric alone.
Common anti-patterns
Treating WAL as a storage nuisance
pg_wal is not garbage. It is required for crash recovery, replication, and backups.
Increasing max_wal_size without understanding the workload
This may reduce checkpoint frequency, but it does not explain why WAL generation changed.
Ignoring archiver failures
A database can keep serving traffic while silently losing point-in-time recovery capability.
Letting replication slots have no owner
An abandoned slot can retain WAL until the primary disk is in danger.
Running large backfills without a WAL budget
A backfill should be planned around WAL rate, replica lag, archive capacity, and checkpoint pressure.
Using staging to estimate production WAL cost
Small data, fewer indexes, and missing replicas make staging a poor predictor of WAL impact.
Manually deleting WAL files
This is not a safe incident response pattern. It can destroy recovery guarantees.
Why WAL and checkpoint incidents are good simulation material
WAL/checkpoint incidents are excellent for training because the symptoms are distributed across the system.
The application may show write latency.
The database may show frequent checkpoints.
The replica may show lag.
The backup system may show archive failures.
The disk may show pg_wal growth.
The migration system may show a “successful” backfill.
The team may be tempted to change settings without understanding the pressure chain.
A realistic simulation can force decisions like:
Is the primary overloaded by WAL writes or normal query IO?
Is checkpoint activity causing latency spikes?
Is a bulk operation generating too much WAL?
Is the replica behind because it is slow or because the primary is producing too much WAL?
Is archiving broken or merely delayed?
Is a replication slot safe to drop?
Should the team pause a migration, throttle a job, increase WAL capacity, tune checkpoints, or protect user traffic first?
This is not about memorizing pg_stat_wal.
It is about understanding the system consequences of writes.
Articles can explain WAL mechanics. Dashboards can expose WAL rates. Simulations teach teams how WAL pressure changes operational decisions.
Conclusion
WAL and checkpoints are invisible when healthy and unavoidable when they fail.
WAL protects durability and enables crash recovery, replication, archiving, and point-in-time recovery. Checkpoints bound recovery work and move dirty data pages to disk. Together, they form the storage reliability backbone of Postgres.
But that backbone has operational limits.
Write-heavy workloads generate WAL. WAL must be written, archived, streamed, retained, and replayed. Checkpoints must flush dirty data. Storage must absorb bursts. Replicas and backup systems must keep up. Operators must understand when a “database slowdown” is really WAL pressure.
The dangerous phrase is:
It is just WAL.
The better reliability question is:
What is generating this WAL, what systems must consume it, and what happens if they cannot keep up?
That question turns WAL and checkpoints from internal Postgres machinery into practical production reliability signals.