Postgres documentation is excellent.
It explains MVCC, locks, WAL, indexes, replication, vacuum, isolation levels, planner behavior, configuration, backup, recovery, and hundreds of other details. If you operate Postgres seriously, you should read it.
But documentation is not the same as operational readiness.
Documentation teaches mechanisms. Incidents test judgment.
A production incident does not usually announce itself as:
This is a lock queue caused by an ACCESS EXCLUSIVE lock waiting behind an idle transaction.
It looks more like this:
API latency is rising.
The pool is full.
Some queries are slow.
A migration started recently.
CPU is not that high.
Replica lag is increasing.
Users are reporting timeouts.
The team is not sure whether to cancel, wait, kill, rollback, or fail over.
That is the gap.
Postgres reliability is not only about knowing how Postgres works. It is about making safe decisions when Postgres, the application, infrastructure, traffic, and human pressure interact.
Documentation explains components, but incidents combine them
Postgres documentation is organized by topics:
Locks
Transactions
Indexes
VACUUM
WAL
Replication
Configuration
Monitoring
Backup and restore
Query planning
That structure is necessary for learning.
But real incidents rarely respect that structure.
A single production problem can involve:
flowchart TD
A[A new release] --> B[A query plan regression]
B --> C[Longer transaction time]
C --> D[Connection pool saturation]
D --> E[Aggressive retries]
E --> F[Higher database concurrency]
F --> G[Autovacuum falling behind]
G --> H[Replica lag]
H --> I([User-visible timeouts])
Which chapter is that?
It is not one chapter. It is the interaction of many systems.
That is why reading about locks does not automatically prepare someone for a migration incident. Reading about VACUUM does not automatically prepare someone for a bloat-driven latency degradation. Reading about replication does not automatically prepare someone to decide whether failover is safe.
The hard part is synthesis.
Knowing the command is not the same as knowing when to use it
Many Postgres incident actions are simple at the command level.
Cancel a query:
SELECT pg_cancel_backend(12345);
Terminate a backend:
SELECT pg_terminate_backend(12345);
Analyze a table:
ANALYZE invoices;
Create an index concurrently:
CREATE INDEX CONCURRENTLY idx_orders_customer_created
ON orders (customer_id, created_at DESC);
Promote a standby:
SELECT pg_promote();
Drop a replication slot:
SELECT pg_drop_replication_slot('old_slot');
None of these commands are hard to type.
The difficult questions are operational:
Is this backend safe to terminate?
Will cancellation cause retries that make the incident worse?
Is ANALYZE enough, or is the query slow because of locks?
Can we afford the IO of a concurrent index right now?
Is the standby fresh enough to promote?
Could the old primary still accept writes?
Is this replication slot abandoned, or does a downstream system still need it?
Documentation can tell you what a command does.
It cannot decide whether using it right now reduces risk.
That decision depends on context.
Incidents are full of partial evidence
In a calm environment, you can investigate carefully.
During an incident, the evidence is incomplete and changing.
You may see:
SELECT
state,
wait_event_type,
wait_event,
count(*) AS sessions
FROM pg_stat_activity
GROUP BY state, wait_event_type, wait_event
ORDER BY sessions DESC;
And get something like:
active | Lock | transactionid | 47
active | IO | DataFileRead | 12
idle | | | 180
This does not automatically tell you what to do.
You need to ask:
Are lock waits the cause or a consequence?
Who is blocking whom?
Did a migration start?
Are retries increasing concurrency?
Are the IO waits caused by the same workload?
Are idle connections normal or part of pool exhaustion?
Then you inspect blockers:
SELECT
blocked.pid AS blocked_pid,
blocked.application_name AS blocked_app,
now() - blocked.query_start AS blocked_duration,
left(blocked.query, 120) AS blocked_query,
blocking.pid AS blocking_pid,
blocking.application_name AS blocking_app,
blocking.state AS blocking_state,
now() - blocking.query_start AS blocking_duration,
left(blocking.query, 120) AS blocking_query
FROM pg_stat_activity blocked
JOIN LATERAL unnest(pg_blocking_pids(blocked.pid)) AS blocker_pid ON true
JOIN pg_stat_activity blocking ON blocking.pid = blocker_pid
ORDER BY blocked_duration DESC;
Now you find a blocker.
But even then, you still need judgment.
Killing the blocker may fix the incident. It may also roll back important work, trigger retries, break a migration, or create more load.
The database gives evidence. It does not give certainty.
Runbooks help, but they are not enough
Runbooks are valuable.
A good runbook can say:
If lock waits are high:
1. Identify blocked sessions.
2. Identify blockers.
3. Check whether the blocker is a migration, application query, or idle transaction.
4. Check user impact.
5. Prefer cancellation before termination where possible.
6. Escalate before terminating unknown critical sessions.
This is useful.
But production incidents often violate the clean path.
For example:
The blocker is a migration.
The migration is important.
The migration has already partially completed.
Application retries are increasing pressure.
The team cannot immediately tell whether canceling is safe.
The service owner is offline.
A background job is also holding connections.
Replica lag is rising.
The incident commander wants a decision now.
The runbook can guide thinking.
It cannot replace thinking.
A weak reliability culture treats runbooks as scripts. A strong reliability culture treats runbooks as decision support.
The dangerous middle: when several actions are plausible
Many Postgres incidents are hard because multiple actions seem reasonable.
Imagine this situation:
API latency is high.
Connection pool wait time is rising.
Postgres has many active sessions.
Top queries show one expensive query shape.
Replica lag is increasing.
A backfill started ten minutes ago.
Possible actions:
Pause the backfill.
Reduce application concurrency.
Cancel slow queries.
Increase pool size.
Add an index.
Move reads to a replica.
Disable retries.
Scale the database.
Wait and observe.
Several of these may be valid in different scenarios.
The wrong action can amplify the incident.
Increasing the pool may push more work into Postgres. Moving reads to a lagging replica may serve stale data. Adding an index may create more IO and WAL during an already overloaded period. Canceling queries may trigger retries. Waiting may be correct if the system is recovering, or disastrous if pressure is still growing.
The skill is not knowing a list of actions.
The skill is understanding the likely consequence of each action under current conditions.
Documentation teaches normal behavior; incidents expose edge behavior
Most engineers learn Postgres features in their normal form.
A transaction groups work:
BEGIN;
UPDATE accounts
SET balance = balance - 100
WHERE id = 1;
UPDATE accounts
SET balance = balance + 100
WHERE id = 2;
COMMIT;
An index speeds up access:
CREATE INDEX CONCURRENTLY idx_orders_customer_id
ON orders (customer_id);
A replica provides another copy of data:
primary → standby
Autovacuum cleans old row versions.
WAL protects durability.
All of this is true.
But incidents live in the edge behavior.
A transaction becomes dangerous when it stays open for 45 minutes.
An index build becomes dangerous when it competes with peak traffic.
A replica becomes dangerous when the application assumes it is always fresh.
Autovacuum becomes dangerous when it cannot keep up with write churn.
WAL becomes dangerous when a backfill generates more than archiving and replication can consume.
The feature is not the problem. The production interaction is the problem.
Example: documentation says CREATE INDEX CONCURRENTLY, production asks “when?”
A team finds a slow query:
SELECT *
FROM orders
WHERE customer_id = $1
ORDER BY created_at DESC
LIMIT 50;
The likely index is obvious:
CREATE INDEX CONCURRENTLY idx_orders_customer_created
ON orders (customer_id, created_at DESC);
Documentation can explain why CONCURRENTLY avoids blocking writes.
But production readiness requires more questions:
How large is the table?
How many indexes already exist?
How much WAL will this generate?
Will replica lag violate read expectations?
Is storage already under pressure?
Is autovacuum currently behind?
Are we in peak traffic?
Can the migration framework run this outside a transaction?
What happens if the index build fails?
Who will clean up an invalid index?
The SQL is technically correct.
That does not make the timing safe.
Reliability depends on knowing when the right command is wrong for the current system state.
Example: documentation says failover is possible, production asks “is it safe?”
Promotion can be simple:
SELECT pg_promote();
But the operational question is not whether you can promote.
It is whether promotion improves the situation.
Before failover, you need to know:
Is the primary truly unavailable?
Can the old primary still accept writes?
How far behind is the standby?
What data loss is acceptable?
How will applications reconnect?
What happens to connection pools?
Will background workers follow the new primary?
What happens to read replicas?
What happens to logical replication slots?
Can the old primary be fenced?
A standby that exists but is 20 minutes behind may not be a safe target.
A standby that is current but cannot accept the full write workload may fail shortly after promotion.
A failover that leaves the old primary alive can create data divergence.
Documentation explains promotion.
Practice teaches hesitation, verification, and controlled execution.
Example: documentation says vacuum cleans dead tuples, production asks “why is it behind?”
A table shows many dead tuples:
SELECT
relname,
n_live_tup,
n_dead_tup,
last_autovacuum,
last_autoanalyze
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 20;
A beginner may think:
Run VACUUM.
A more experienced operator asks:
Why did dead tuples accumulate?
Is autovacuum blocked by a long transaction?
Is the table too hot for default thresholds?
Are there too many indexes?
Is a backfill creating churn?
Is the table design queue-like?
Are we seeing bloat or just temporary dead tuple pressure?
Will manual vacuum compete with user traffic?
Then they check old transactions:
SELECT
pid,
application_name,
state,
now() - xact_start AS transaction_age,
left(query, 160) AS query_preview
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
ORDER BY xact_start ASC;
The command VACUUM is easy.
Understanding why cleanup failed is the reliability work.
Operational skill means recognizing patterns
In real incidents, the exact details vary.
But patterns repeat.
A lock incident has a recognizable shape:
Latency rises.
Connections increase.
Many sessions wait on Lock.
A migration or transaction is in the blocking chain.
The pool fills behind blocked work.
A connection storm has a recognizable shape:
App instances increase.
Total connections rise sharply.
Many sessions are active or waiting.
Pool timeouts appear.
Retries multiply traffic.
Postgres slows under concurrency.
A WAL pressure incident has a recognizable shape:
A bulk operation starts.
WAL generation spikes.
Checkpoints become more frequent.
Replica lag grows.
Archiving may fall behind.
Write latency worsens.
A vacuum starvation incident has a recognizable shape:
Dead tuples trend upward.
Old transactions exist.
Autovacuum runs but does not catch up.
Table/index size grows.
Query performance degrades gradually.
Pattern recognition does not come only from reading.
It comes from seeing scenarios, making decisions, and observing consequences.
The hardest part is prioritization
During a Postgres incident, there are usually too many possible investigations.
You can inspect:
pg_stat_activity
pg_locks
pg_stat_statements
pg_stat_replication
pg_replication_slots
pg_stat_user_tables
pg_stat_progress_vacuum
pg_stat_progress_create_index
pg_stat_wal
pg_stat_archiver
application pool metrics
request traces
deployment history
OS IO metrics
All of them may be relevant.
But you cannot investigate everything at once.
Operational readiness means knowing what to check first based on symptoms.
For example:
If users wait for DB connections:
start at application pool metrics and pg_stat_activity.
If sessions wait on Lock:
identify blockers and recent migrations.
If writes are slow and replica lag grows:
inspect WAL generation, checkpoints, storage, and recent bulk operations.
If queries became slow gradually:
inspect query plans, table statistics, dead tuples, bloat signals, and data growth.
If reads from replicas are inconsistent:
inspect replay delay and read-routing assumptions.
The skill is navigation.
Documentation gives the map. Incidents require route selection.
Good operators know what not to touch
A major difference between junior and senior incident response is restraint.
During a database incident, doing something feels better than doing nothing.
But some actions are dangerous without enough evidence:
Increasing max_connections
Increasing pool size
Killing random backends
Dropping replication slots
Running VACUUM FULL
Creating emergency indexes
Failing over prematurely
Restarting Postgres
Disabling autovacuum
Changing durability settings
Deleting files from pg_wal
Some of these actions can be correct in specific situations.
The danger is using them as reflexes.
Reliability is not only the ability to act.
It is the ability to delay unsafe action long enough to understand the system, while still acting quickly enough to reduce impact.
That balance cannot be learned from syntax alone.
Documentation does not teach team coordination
Postgres incidents are rarely solved by one person silently running SQL.
They involve communication:
Who is incident commander?
Who owns the application?
Who owns the database?
Who can pause workers?
Who can rollback deploys?
Who can approve failover?
Who communicates customer impact?
Who records the timeline?
Who verifies recovery?
Technical evidence must be translated into operational decisions.
For example:
“We have 60 sessions waiting on a migration lock.
The blocker is an app transaction idle for 18 minutes.
Canceling the migration will stop new queue growth.
Terminating the idle transaction appears safe, but it belongs to the billing service.
We need billing owner approval or incident commander decision.”
That is not just database knowledge.
That is incident communication.
A technically correct action performed without coordination can still create organizational failure.
Documentation does not create muscle memory
In a quiet learning environment, an engineer can search, read, think, and test.
In an incident, the environment is different:
Users are affected.
Dashboards are noisy.
Logs are incomplete.
People are asking for updates.
The system is changing while you investigate.
Some actions are irreversible.
Time pressure is real.
Under pressure, people fall back to practiced behavior.
If the only practiced behavior is reading documentation, the team may move too slowly or choose familiar but unsafe actions.
Simulation creates muscle memory:
Notice the symptom.
Form a hypothesis.
Choose the next inspection.
Interpret evidence.
Communicate uncertainty.
Take a bounded action.
Observe the result.
Revise the hypothesis.
That loop is the core of operational reliability.
A useful maturity model
Postgres reliability maturity can be described in four levels.
Level 1: Vocabulary
The team knows terms:
locks;
VACUUM;
WAL;
replica lag;
connection pool;
EXPLAIN;
checkpoint;
transaction;
index.
This is necessary, but not enough.
Level 2: Mechanism understanding
The team understands how things work:
why MVCC creates dead tuples;
why locks protect consistency;
why WAL enables recovery;
why replicas can lag;
why indexes help some queries and hurt writes;
why pool size controls concurrency.
This is where documentation is very strong.
Level 3: Diagnostic reasoning
The team can connect symptoms to mechanisms:
pool saturation may be caused by slow queries;
slow queries may be caused by locks;
locks may be caused by migrations;
replica lag may be caused by WAL spikes;
bad plans may be caused by stale statistics;
vacuum lag may be caused by old transactions.
This requires experience and practice.
Level 4: Operational judgment
The team can act safely under pressure:
cancel the right thing;
pause the right workload;
avoid unsafe failover;
reduce concurrency;
communicate risk;
choose rollback vs roll-forward;
protect user traffic;
recover without creating a second incident.
This is where simulation matters most.
What articles can teach well
Articles are valuable.
They can explain:
mental models;
common failure modes;
diagnostic queries;
dangerous anti-patterns;
technical vocabulary;
incident patterns;
review questions;
safe design principles.
An article can show why this query matters:
SELECT
pid,
application_name,
state,
wait_event_type,
wait_event,
now() - query_start AS query_age,
now() - xact_start AS transaction_age,
left(query, 160) AS query_preview
FROM pg_stat_activity
WHERE state <> 'idle'
ORDER BY query_start ASC;
It can explain that wait_event_type = 'Lock' points toward contention.
It can explain that idle in transaction is dangerous.
It can explain that high connection count is not the same as useful throughput.
But an article cannot reproduce the stress of deciding whether to terminate a real backend while customer requests are failing.
That is the boundary.
What simulations teach better
Simulations are useful because they train behavior, not only knowledge.
A good Postgres incident simulation can force the team to experience:
unclear symptoms;
conflicting metrics;
misleading first hypotheses;
actions with side effects;
pressure from user impact;
coordination between roles;
the cost of waiting too long;
the cost of acting too early;
post-incident analysis.
For example, a simulation can show what happens when someone increases pool size during database saturation.
It can show how retries amplify load.
It can show how canceling the wrong migration changes the incident.
It can show how a replica exists but is not safe for failover.
It can show how a long transaction prevents vacuum cleanup and creates delayed consequences.
That feedback loop is difficult to get from documentation.
The goal is not to replace documentation
This is not an argument against documentation.
Strong Postgres reliability requires documentation, source knowledge, and practical experience.
The right relationship is:
Documentation explains mechanisms.
Runbooks organize known responses.
Monitoring provides evidence.
Simulations build judgment.
Production experience validates assumptions.
Post-incident reviews improve the system.
Each layer has a role.
The mistake is expecting one layer to do all the work.
Documentation alone produces theoretical understanding. Monitoring alone produces noise. Runbooks alone produce mechanical responses. Production alone is too expensive as a training environment.
Simulation connects them.
A practical way to use documentation better
Documentation becomes more valuable when read through incident questions.
Instead of reading the lock chapter as theory, ask:
Which lock modes can block normal traffic?
How would I recognize a lock queue?
Which DDL operations need strong locks?
What would make cancellation unsafe?
Instead of reading about WAL as internals, ask:
What happens if WAL generation spikes?
How does WAL affect replication lag?
How can archiving failure fill disk?
How would checkpoints appear in latency graphs?
Instead of reading about autovacuum, ask:
What prevents cleanup?
How do old transactions affect vacuum?
Which tables need different settings?
How would vacuum failure show up as query latency?
Instead of reading about replication, ask:
What does this replica protect us from?
How stale can reads be?
What happens during promotion?
How do we prevent split-brain?
What downstream systems depend on replication state?
This turns documentation from reference material into operational training material.
A good Postgres reliability training loop
A strong learning process looks like this:
1. Study the mechanism.
2. Observe the metric in a healthy system.
3. Trigger or simulate a controlled failure.
4. Diagnose using real tools.
5. Choose a mitigation.
6. Observe side effects.
7. Review the decision.
8. Update dashboards, runbooks, code, or process.
For example, with locks:
Study lock modes.
Observe normal pg_stat_activity.
Simulate a migration waiting behind a transaction.
Identify blockers.
Try canceling migration vs terminating blocker.
Observe pool behavior.
Discuss which action was safest.
Update migration policy.
With replication:
Study WAL streaming.
Observe normal replay lag.
Simulate a WAL spike.
Watch replica delay.
Route stale-sensitive reads.
Discuss failover safety.
Update read-routing rules.
This is how knowledge becomes readiness.
The business reason this matters
Postgres reliability is not only a technical concern.
Database incidents affect product behavior:
users cannot log in;
payments fail;
orders timeout;
dashboards show stale data;
workers fall behind;
notifications duplicate;
customers lose trust;
engineers lose sleep;
teams become afraid of migrations.
The cost is not just downtime.
It is slower engineering velocity.
When teams fear the database, they avoid necessary changes. They delay migrations, postpone cleanup, over-index defensively, under-invest in schema evolution, and treat every production change as risky.
Reliability training reduces that fear.
Not by pretending incidents will not happen, but by making the team more competent when they do.
Why this matters specifically for Postgres
Postgres is powerful because it gives teams many capabilities:
transactions;
rich indexing;
constraints;
JSON;
extensions;
replication;
partitioning;
concurrent index builds;
foreign keys;
materialized views;
stored procedures;
logical decoding;
advanced SQL.
Those capabilities allow teams to build serious systems.
They also create operational complexity.
A database that supports strong correctness, flexible queries, and rich workloads requires disciplined operation.
Postgres will often do exactly what you asked.
The reliability question is whether you understood what you asked it to do under production conditions.
Common anti-patterns in learning Postgres reliability
Learning only through local experiments
Local databases hide production realities: data volume, concurrency, locks, replicas, WAL volume, autovacuum pressure, and real traffic.
Memorizing diagnostic queries without hypotheses
A query is useful only when you know what question it answers.
Treating every incident as a missing index
Indexes matter, but not every latency problem is an indexing problem.
Treating failover as a button
Promotion is easy. Safe recovery is not.
Treating runbooks as scripts
Runbooks guide decisions. They do not remove context.
Treating monitoring as truth
Metrics are evidence. They require interpretation.
Waiting for production to teach the team
Production is the most expensive classroom.
What a simulation-ready team looks like
A team ready for Postgres incidents can do more than quote documentation.
It can say:
We know what normal looks like.
We know which symptoms are user-impacting.
We know where database pressure appears first.
We know how to inspect active sessions.
We know how to identify blockers.
We know which workloads can be paused.
We know who owns migrations.
We know how replicas are used.
We know our acceptable data loss window.
We know which actions are dangerous.
We have practiced decisions before production forced them.
That is operational maturity.
It does not mean the team never has incidents.
It means incidents are shorter, less chaotic, and less likely to produce secondary failures.
Conclusion
Postgres documentation is necessary.
But it is not sufficient.
It teaches what locks are, how WAL works, why vacuum exists, how replication functions, what indexes do, and how configuration parameters behave.
Production incidents test something different:
Can the team recognize the pattern?
Can it connect database symptoms to application behavior?
Can it choose the safest next action?
Can it avoid making the incident worse?
Can it communicate uncertainty?
Can it recover the system without creating a second failure?
Can it learn afterward?
That is database reliability.
The dangerous phrase is:
We read the docs, so we know Postgres.
The better phrase is:
We understand the mechanisms, and we have practiced applying them under incident conditions.
Documentation builds knowledge. Simulation builds judgment. Reliable Postgres operations need both.