← All posts

Replication 13 min read

Postgres replication: when a standby exists but does not save you

A standby database is comforting.

It appears in architecture diagrams as a safety net. The primary fails, the standby takes over, and the product survives. Read traffic can be moved away from the primary. Backups can be isolated. Disaster recovery looks solved.

But Postgres replication does not automatically mean high availability.

A standby can be too far behind. A replica can faithfully reproduce bad writes. A failover can create split-brain. A replication slot can fill the primary disk with retained WAL. Read queries on a standby can conflict with recovery. A promoted replica can break downstream consumers. An application can keep writing to the wrong node after failover.

Replication is not a guarantee. It is a mechanism.

And like every reliability mechanism, it creates new failure modes.

PostgreSQL streaming replication keeps a standby up to date by sending WAL records from the primary as they are generated; it is asynchronous by default, meaning there can be a delay between commit on the primary and visibility on the standby. (PostgreSQL)

That small sentence contains an entire class of incidents.


The false sense of safety

Many teams say:

We have a replica.

But that statement is incomplete.

A more useful operational version is:

We have a replica.
We know how far behind it is.
We know whether it can be promoted.
We know what data loss window is acceptable.
We know how applications reconnect.
We know how to prevent the old primary from coming back.
We know what happens to replication slots, read traffic, jobs, and logical consumers after failover.

A replica is not a disaster recovery plan by itself.

It is a component inside a larger recovery process.

PostgreSQL’s own failover documentation is explicit about the need to prevent the old primary from continuing as primary after a standby is promoted, because two systems believing they are primary can lead to data loss; this is the classic split-brain problem. (PostgreSQL)

That is why replication reliability is not just about lag.

It is about control.


Replication lag is not one number

The first mistake is treating replication lag as a single metric.

In practice, there are several different “lags”:

WAL generated on primary but not sent
WAL sent but not written by standby
WAL written but not flushed
WAL flushed but not replayed
Changes replayed but application still reading stale data

On the primary, pg_stat_replication is the main view for directly connected standbys. The PostgreSQL statistics documentation describes it as one row per WAL sender process, with information about replication to the connected standby. (PostgreSQL)

A useful primary-side query:

SELECT
    application_name,
    client_addr,
    state,
    sync_state,
    sent_lsn,
    write_lsn,
    flush_lsn,
    replay_lsn,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn))   AS send_lag_bytes,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), write_lsn))  AS write_lag_bytes,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn))  AS flush_lag_bytes,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) AS replay_lag_bytes,
    write_lag,
    flush_lag,
    replay_lag
FROM pg_stat_replication
ORDER BY application_name;

This query separates the pipeline.

If sent_lsn is far behind, the primary is not sending fast enough or the connection is impaired.

If write_lsn lags behind sent_lsn, the standby is receiving but not writing fast enough.

If flush_lsn is behind, WAL is not durable on the standby yet.

If replay_lsn is behind, the standby has received WAL but has not applied it.

Those are not the same problem.

A standby can be connected and still not be useful for failover if it is too far behind the primary.


Checking the standby from the standby

On the standby itself:

SELECT pg_is_in_recovery();

A standby returns true. After promotion, it returns false.

To inspect receive and replay positions:

SELECT
    pg_last_wal_receive_lsn() AS receive_lsn,
    pg_last_wal_replay_lsn()  AS replay_lsn,
    pg_size_pretty(
        pg_wal_lsn_diff(pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn())
    ) AS receive_replay_gap,
    now() - pg_last_xact_replay_timestamp() AS replay_delay;

This helps answer a different question:

Is this standby receiving WAL?
Is it replaying WAL?
How stale is the data visible to queries?

The replay_delay value is especially important for read replicas. It tells you how far behind visible database state may be.

For example, if the application writes an order to the primary and immediately reads from a standby, it may not see its own write.

That is not a Postgres bug. It is a read-after-write consistency problem.


Read replicas can serve stale data

A common architecture sends writes to the primary and reads to replicas:

flowchart TD
    A[Application writes order to primary] --> B[Application reads order from standby]
    B --> C([Order is missing])

The write committed successfully. The replica simply has not replayed the WAL yet.

This is one of the most common ways replication leaks into product behavior.

The user sees:

I saved the setting, but the UI still shows the old value.

The backend sees:

INSERT succeeded.
SELECT returned old state.

The database sees:

Primary is correct.
Standby is behind by 800 ms.

That may be acceptable for dashboards, analytics, or eventually consistent feeds. It may be unacceptable for checkout, authentication, permissions, billing, or anything requiring read-your-writes behavior.

A basic mitigation pattern is application-level routing:

Fresh reads after writes → primary
Stale-tolerant reads → replica
Long analytics queries → dedicated reporting replica

This decision belongs in system design, not in a panic during an incident.


Replication protects availability, not correctness of bad changes

Replication copies changes.

That includes bad changes.

If an application deploy runs:

UPDATE users
SET plan = 'free';

without a WHERE clause, the standby will not save you. It will replay the same change.

If a migration drops the wrong column, the standby will follow.

If an application bug deletes valid data, physical streaming replication reproduces the deletion.

This is why replication is not a replacement for backups, point-in-time recovery, access controls, safer migrations, or staged rollouts.

A standby helps when the primary node, disk, VM, container, or availability zone fails.

It does not magically distinguish good WAL from bad WAL.

A good reliability review asks:

Which failure mode are we defending against?
Primary host failure?
Storage failure?
Human error?
Bad deploy?
Region outage?
Silent corruption?
Accidental DELETE?

A replica is useful for some of these. It is insufficient for others.


Replication slots: safety mechanism with sharp edges

Replication slots are designed to help prevent the primary from removing WAL that a replica or logical consumer still needs. PostgreSQL documents pg_replication_slots as the view listing replication slots and their current state. (PostgreSQL)

That is useful. It is also dangerous if nobody monitors it.

Inspect slots:

SELECT
    slot_name,
    slot_type,
    active,
    restart_lsn,
    confirmed_flush_lsn,
    wal_status,
    safe_wal_size,
    pg_size_pretty(
        pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)
    ) AS retained_wal
FROM pg_replication_slots
ORDER BY pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) DESC NULLS LAST;

The risk is simple:

flowchart TD
    A[A replica disconnects] --> B[Its replication slot remains]
    B --> C[The primary keeps WAL needed by that slot]
    C --> D[WAL accumulates]
    D --> E[Disk fills]
    E --> F([The primary goes down])

The original problem may have been a failed standby.

The actual production outage may be the primary running out of disk because the slot kept retaining WAL.

Replication infrastructure can therefore take down the primary it was supposed to protect.

Operationally, slots need ownership:

Who owns this slot?
Which process consumes it?
Is it expected to be active?
How much WAL can it retain?
What alert fires before disk pressure becomes dangerous?
Can this slot be safely dropped?

Dropping a slot is not a casual action. If the consumer still needs that WAL, dropping the slot may force reinitialization or data loss for that consumer.

SELECT pg_drop_replication_slot('slot_name');

That command can be correct. It can also be destructive. The hard part is knowing which situation you are in.


WAL volume can break your assumptions

Replication lag is not only about network speed.

A primary can suddenly generate more WAL than usual:

Large UPDATE
Bulk import
Index creation
VACUUM FULL
High-write deploy
Backfill job
Large DELETE
Migration touching many rows

A replica that keeps up during normal traffic may fall behind during a backfill.

A simple way to inspect WAL generation rate is to sample LSN movement over time.

Manual example:

SELECT pg_current_wal_lsn();

Run it again later:

SELECT
    pg_size_pretty(
        pg_wal_lsn_diff('0/50000000'::pg_lsn, '0/40000000'::pg_lsn)
    ) AS wal_generated;

In a monitoring system, this becomes a rate:

WAL bytes generated per second

That metric matters because replication capacity is about throughput over time, not just whether the standby is connected.

The standby may be healthy and still unable to keep up with a temporary WAL storm.


Hot standby query conflicts

A hot standby can serve read-only queries while it replays WAL.

That sounds perfect until long read queries on the standby conflict with recovery.

A reporting query might hold a snapshot that conflicts with WAL replay. Postgres then has a choice: delay replay or cancel the query, depending on configuration and timing.

You can inspect standby conflicts with:

SELECT
    datname,
    confl_tablespace,
    confl_lock,
    confl_snapshot,
    confl_bufferpin,
    confl_deadlock
FROM pg_stat_database_conflicts
ORDER BY datname;

The monitoring stats documentation includes pg_stat_database_conflicts for database-wide query cancels due to conflicts with recovery on standby servers. (PostgreSQL)

This matters because a replica often has two competing jobs:

Stay close to primary for failover
Serve long-running read queries

Those goals can conflict.

If the standby prioritizes replay, analytical queries may be canceled.

If the standby delays replay to satisfy long queries, replication lag may grow.

You can reduce pain by separating roles:

HA standby: optimized for promotion, minimal lag
Reporting replica: accepts staleness, runs heavy reads
Logical/ETL replica: feeds downstream systems

Using one standby for every purpose is cheap architecturally and expensive operationally.


Synchronous replication: stronger durability, different failure mode

Asynchronous replication has a data loss window.

Synchronous replication can reduce that window, but it changes the write path. The primary may wait for standby acknowledgement depending on synchronous_commit and synchronous replication configuration. The PostgreSQL replication settings documentation warns that with synchronous_commit = remote_apply, commits wait for the change to be applied on the standby. (PostgreSQL)

That means synchronous replication can turn standby problems into primary write latency.

The trade-off is not “sync is better” or “async is better.”

The trade-off is:

Async replication:
lower write latency,
possible data loss during failover.

Sync replication:
stronger durability guarantees,
standby health can affect primary commits.

A useful query:

SELECT
    application_name,
    state,
    sync_state,
    write_lag,
    flush_lag,
    replay_lag
FROM pg_stat_replication
ORDER BY application_name;

Pay attention to sync_state.

Values such as sync, potential, or async tell you how the standby participates in synchronous replication behavior.

A synchronous standby is not just a backup target. It is part of the commit path.

If it becomes slow, user-facing writes may slow down too.


Failover is a process, not a command

Promotion is technically simple:

SELECT pg_promote();

or from the server:

pg_ctl promote

PostgreSQL documents these as ways to trigger failover for a log-shipping standby. (PostgreSQL)

But promotion is only one step.

A real failover involves many decisions:

Is the primary truly dead?
Could it still accept writes?
Which standby is the best candidate?
How much WAL has it replayed?
What data loss is acceptable?
How will applications reconnect?
What happens to connection pools?
What happens to old primary fencing?
What happens to read replicas following the old primary?
What happens to logical replication slots?
What happens to scheduled jobs and workers?
Who declares the incident phase complete?

The dangerous failover is not the one that fails loudly.

The dangerous failover is the one that half-succeeds.

For example:

Standby promoted successfully.
Some app instances still write to old primary.
A background worker reconnects to the wrong host.
Read replicas still follow the old timeline.
Logical consumers lose their slots.
Monitoring shows green because one node is healthy.
Data diverges.

This is why failover must be rehearsed.

Not discussed. Not documented once. Rehearsed.


Timeline changes matter

After promotion, the new primary continues on a new timeline.

That matters for replicas, WAL archives, backup chains, and recovery procedures.

PostgreSQL documentation notes that standbys used for high availability should follow timeline changes after failover, with recovery_target_timeline set to latest, which is the default. (PostgreSQL)

This detail sounds small until a replica fails to follow the new primary after failover.

The operational symptom may be confusing:

New primary accepts writes.
Old standby does not catch up.
A recreated replica follows the wrong history.
Archive restore behaves unexpectedly.

During calm periods, timeline mechanics feel like internal implementation detail.

During failover, they become part of the recovery path.


Logical replication adds another layer

Logical replication is often used for:

CDC pipelines
Search indexing
Data warehouses
Event streaming
Cross-version migrations
Selective table replication
Zero-downtime migration workflows

Its failure modes are different from physical streaming replication.

A logical slot can fall behind and retain WAL. A subscriber can stop applying changes. Schema drift can break replication. A failover can strand logical slots if they are not handled correctly.

Recent PostgreSQL versions include mechanisms for logical failover slot synchronization. The current documentation describes sync_replication_slots as enabling a physical standby to synchronize logical failover slots from the primary so logical subscribers can resume from the new primary after failover. (PostgreSQL)

The practical lesson is simple:

If downstream systems depend on logical replication,
failover planning must include those systems.

It is not enough that the database comes back.

The data platform around it must continue correctly.


A practical replication health snapshot

This is not a full runbook, but these queries make a useful health snapshot.

Primary-side replication status:

SELECT
    application_name,
    client_addr,
    state,
    sync_state,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) AS replay_lag_bytes,
    write_lag,
    flush_lag,
    replay_lag
FROM pg_stat_replication
ORDER BY application_name;

Replication slots:

SELECT
    slot_name,
    slot_type,
    active,
    wal_status,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal
FROM pg_replication_slots
ORDER BY pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) DESC NULLS LAST;

Standby freshness:

SELECT
    pg_is_in_recovery() AS is_standby,
    pg_last_wal_receive_lsn() AS receive_lsn,
    pg_last_wal_replay_lsn() AS replay_lsn,
    now() - pg_last_xact_replay_timestamp() AS replay_delay;

Standby conflicts:

SELECT
    datname,
    confl_lock,
    confl_snapshot,
    confl_bufferpin,
    confl_deadlock
FROM pg_stat_database_conflicts
ORDER BY datname;

WAL receiver on standby:

SELECT
    status,
    receive_start_lsn,
    written_lsn,
    flushed_lsn,
    received_tli,
    last_msg_send_time,
    last_msg_receipt_time,
    latest_end_lsn,
    latest_end_time,
    conninfo
FROM pg_stat_wal_receiver;

These queries do not tell you what to do automatically. They help you ask better questions:

Is the standby connected?
Is it catching up or falling behind?
Is lag measured in bytes, time, or user-visible staleness?
Is WAL retention becoming dangerous?
Are standby reads conflicting with recovery?
Is failover currently safe, risky, or impossible?

Common anti-patterns

One replica for every purpose

A standby used for HA, reporting, backups, ad hoc analytics, and read scaling will eventually disappoint one of those use cases.

HA wants low lag. Analytics wants long queries. Backups want predictable throughput. Read scaling wants availability and acceptable staleness.

Those goals are not identical.

No explicit read consistency model

If the application casually sends reads to replicas, product behavior may become inconsistent.

Use replicas deliberately:

Can this read be stale?
Does this user need to read their own write?
Can this endpoint tolerate lag?
Should this workflow force primary reads?

Ignoring slots until disk pressure

Replication slots should be treated like production resources with owners, alerts, and lifecycle management.

An abandoned slot is not harmless metadata.

Treating failover as infrastructure-only

Failover affects database clients, application routing, workers, caches, queues, jobs, observability, and people.

A database promotion that the application does not understand is not recovery.

Never testing promotion

A failover process that has never been practiced is an assumption.

Assumptions do not become reliable because they are written in a document.


What a good incident review should ask

After a replication incident, avoid stopping at:

The replica lagged.

That is only the symptom.

Better questions:

What created the WAL spike?
Was the standby under-provisioned or overloaded by read traffic?
Did a long query on the standby delay recovery?
Did a slot retain more WAL than expected?
Were alerts based on bytes, time, or disk risk?
Did application reads tolerate the actual staleness?
Was failover considered? If not, why?
Would promotion have caused data loss?
Could the old primary have reappeared?
Did downstream logical consumers survive the event?

The goal is to understand the system’s recovery posture, not just the replication metric that turned red.


Why replication incidents are excellent simulation material

Replication incidents are perfect for training because they combine database internals with distributed systems behavior.

A realistic scenario can involve:

WAL generation spike from a migration
Replica lag crossing the read-staleness budget
Replication slot retaining dangerous WAL volume
Read queries conflicting with recovery
Application reads returning stale data
A failover decision under uncertainty
Old primary fencing
Connection string and DNS behavior
Downstream logical replication consumers

The hard part is not running pg_stat_replication.

The hard part is deciding what the evidence means.

Is the replica unhealthy, or is the primary generating too much WAL? Is lag acceptable for read traffic but unacceptable for failover? Is the slot protecting data or threatening disk? Would promotion reduce impact or create split-brain? Should traffic be moved, throttled, failed over, or left alone while the standby catches up?

Those decisions require practice.

Articles can explain the mechanism. Monitoring can expose the symptoms. Simulation builds the judgment needed to act safely.


Conclusion

A standby does not automatically save you.

Postgres replication is powerful, but it is not magic. It improves availability only when the surrounding operational system is mature enough to use it correctly.

You need to know:

how far behind replicas are;
which reads can tolerate staleness;
how much WAL slots retain;
whether standby queries conflict with replay;
what data loss window is acceptable;
how failover is triggered;
how split-brain is prevented;
how applications reconnect;
how downstream consumers continue;
how the cluster returns to a healthy topology after promotion.

Replication is not just a database feature.

It is a reliability contract between Postgres, infrastructure, applications, operators, and product expectations.

The dangerous phrase is:

“We have a replica, so we are safe.”

The better phrase is:

“We know exactly what our replica can and cannot save us from.”
Newsletter

Stay in the loop

New incident tracks, psql+ features and hard-won PostgreSQL tips — delivered to your inbox now and then.

No spam. Unsubscribe anytime.