Streaming Replication and Failover
Build a real streaming-replication cluster in Docker — two Postgres nodes, live WAL shipping, measurable lag, promote to failover, and logical replication for selective table sync — so you understand the mechanics before any orchestrator abstracts them away.
Most engineers operate replication through an orchestrator and never look at what is happening underneath. This course strips the abstraction away. We build a two-node cluster from scratch — primary and replica in Docker, connected by streaming WAL — and measure what is actually happening at every layer. We watch pg_stat_replication report the replica's lag in real time, insert on the primary and immediately read on the standby, then cut the primary and promote the replica into a writable node. We follow that with logical replication: a publication on one table, a subscription on a separate cluster, and rows flowing across in seconds. Everything runs on a real Postgres stand with real data; nothing is simulated.
What you'll build
- Configure wal_level, max_wal_senders, and a replication slot from scratch
- Build a standby with pg_basebackup -R and verify it enters streaming state
- Read sent_lsn, write_lsn, flush_lsn, replay_lsn and explain what each measures
- Observe and measure replication lag under write load
- Promote a standby with pg_promote() and understand the split-brain risk
- Create a publication and subscription for logical replication of a single table
Contents
- Two nodes, one goal
- Two services in one compose file
- WAL settings that make replication possible
- Who can connect — and as what role
- Mount the config into the container
- Start both containers
- Connect to the primary
- A dedicated role for replication
- Two views that show the replication state
- Both views are empty — that is correct
- A slot to protect the replica's WAL position
- The slot appears with active = false
- Two tables, two purposes
- Fifty thousand orders, deterministic
- Create the tables on the primary
- Load the orders
- Clone the primary with pg_basebackup
- Seed the replica from the primary
- Read from the replica
- Confirming standby state from the inside
- Watching the replica from the primary
- Insert on the primary, read on the replica
- The new row is on the replica
- The replica refuses writes
- A query to measure lag precisely
- Generating a burst of writes
- Catching the replica mid-stream
- Lag drains when writes stop
- Promoting the replica — and why order matters
- Executing the failover
- The standby is now writable
- Logical replication — a different model
- A separate cluster for the logical subscriber
- Create the publication on the primary
- Schema must exist on the subscriber
- Connect the subscriber to the publisher
- Data flows from publisher to subscriber
- Three rows on the subscriber
- Inspecting the subscription internals
- A recovery checklist as live queries
- Baseline before the accidental delete
- Logical dump before the drill
- The accidental delete
- Restoring from the logical dump
- Verifying the restore
- Cleaning up the stand