Postgres incidents rarely start with "Postgres broke"

2026-05-29T00:00:00+00:00
When a production system starts degrading, Postgres often becomes the first suspect.
The application is slow. Requests are timing out. Background jobs are piling up. Dashboards are turning red. The database CPU is higher than usual. Someone opens the incident channel and says:
“Looks like Postgres is having problems.” </blockquote>
Sometimes that is true. But very often, Postgres is not the original cause. It is the place where multiple system problems finally become visible.
A Postgres incident usually starts somewhere else: a release, a schema migration, a query pattern change, a sudden traffic spike, a connection pool misconfiguration, a long-running transaction, a reporting job, a replica falling behind, or an application retry storm.
The database becomes the pressure point.
That is why Postgres reliability is not only about knowing SQL or database internals. It is about understanding how Postgres behaves inside a living production system.
The misleading phrase: “the database is slow”</h2>
“The database is slow” sounds like a diagnosis, but it is usually only a symptom.
A slow query can be caused by many different mechanisms:
an expensive migration running at the wrong time.</li> </ul>
The external symptom may look the same:
HTTP latency increased
API requests timing out
Worker queue length growing
Database connections rising
Postgres CPU and IO elevated
</code></pre>
But the correct response depends entirely on the mechanism.</p>
This is where many teams get into trouble. They treat the symptom as the cause.</p>

A typical incident chain</h2>
A Postgres incident often looks like this:</p>
flowchart TD
    A[Small application change] --> B[New or more frequent query pattern]
    B --> C[Higher database load]
    C --> D[Longer query execution time]
    D --> E[Connections held for longer]
    E --> F[Connection pool saturation]
    F --> G[Application timeouts]
    G --> H[Retries]
    H --> I[Even more database load]
    I --> J([Production incident])
</code></pre>
From the outside, this may look like “Postgres became slow.”</p>
But Postgres did not randomly become slow. The system changed around it.</p>
That distinction matters because the wrong mitigation can make the incident worse.</p>

Example 1: a harmless release that doubles database pressure</h2>
Imagine a backend service has an endpoint like this:</p>
SELECT id, email, status
FROM users
WHERE id = $1;
</code></pre>
It is fast. It uses the primary key. No problem.</p>
Then a release adds a feature flag check based on recent user activity:</p>
SELECT id
FROM user_events
WHERE user_id = $1
  AND event_type = 'purchase'
ORDER BY created_at DESC
LIMIT 1;
</code></pre>
On staging, this query is fast. In production, user_events</code> has hundreds of millions of rows.</p>
If the index is not aligned with the query, Postgres may need to scan far more data than expected.</p>
A better supporting index might look like:</p>
CREATE INDEX CONCURRENTLY idx_user_events_user_type_created
ON user_events (user_id, event_type, created_at DESC);
</code></pre>
But the incident is not just “missing index.”</p>
The real incident chain may be:</p>
New release adds one extra query per request
        ↓
Query is cheap for some users, expensive for others
        ↓
Average DB time per request increases
        ↓
Application holds connections longer
        ↓
Pool reaches max size
        ↓
Requests queue inside the app
        ↓
Timeouts trigger retries
        ↓
Postgres receives even more work
</code></pre>
A useful first question is not:</p>

“Which query is slow?”</p>
</blockquote>
A better first question is:</p>

“What changed in the system right before the database started showing pressure?”</p>
</blockquote>

Example 2: connection pool exhaustion is not always a pool problem</h2>
When an application starts timing out while waiting for a database connection, the instinctive response is often:</p>

“Increase the pool size.”</p>
</blockquote>
That can help in some cases. But it can also make the incident worse.</p>
A connection pool is not just a performance tool. It is a pressure regulator.</p>
If Postgres is already overloaded, increasing the number of concurrent database sessions may increase CPU contention, memory pressure, lock contention, and IO saturation.</p>
A useful mental model:</p>
Small pool:
Application queues before Postgres

Huge pool:
Postgres receives too much concurrent work directly
</code></pre>
You can inspect active database sessions with:</p>
SELECT
    state,
    wait_event_type,
    wait_event,
    count(*)
FROM pg_stat_activity
GROUP BY state, wait_event_type, wait_event
ORDER BY count(*) DESC;
</code></pre>
This tells you whether sessions are actively running, waiting on locks, waiting on IO, idle in transaction, or simply connected.</p>
But the query alone is not the solution. The important part is interpretation.</p>
For example, many sessions in this state are a major warning sign:</p>
SELECT
    pid,
    usename,
    application_name,
    client_addr,
    now() - xact_start AS transaction_age,
    state,
    query
FROM pg_stat_activity
WHERE state = 'idle in transaction'
ORDER BY transaction_age DESC;
</code></pre>
An idle in transaction</code> session may keep old row versions alive, block vacuum progress, hold locks, or distort the behavior of other parts of the system.</p>
In an incident, this may appear as “Postgres is slow,” while the actual trigger is an application code path that opened a transaction and failed to close it correctly.</p>

Example 3: a schema migration that blocks production traffic</h2>
Schema migrations are one of the most common sources of Postgres incidents.</p>
A migration can be syntactically correct and still operationally dangerous.</p>
For example:</p>
ALTER TABLE orders ADD COLUMN processed_at timestamptz;
</code></pre>
This may be safe and fast in many modern Postgres versions. But not every ALTER TABLE</code> is harmless, and even operations that are usually fast still need locks.</p>
A more dangerous example:</p>
ALTER TABLE orders
ADD CONSTRAINT orders_customer_id_fkey
FOREIGN KEY (customer_id)
REFERENCES customers(id);
</code></pre>
Or:</p>
CREATE INDEX idx_orders_created_at
ON orders (created_at);
</code></pre>
Creating a normal index can block writes. In production, you usually want:</p>
CREATE INDEX CONCURRENTLY idx_orders_created_at
ON orders (created_at);
</code></pre>
But even CONCURRENTLY</code> is not magic. It takes longer, consumes resources, and can fail if there are conflicting operations.</p>
During a suspected lock-related incident, this type of query can help identify blockers:</p>
SELECT
    blocked.pid AS blocked_pid,
    blocked.query AS blocked_query,
    blocking.pid AS blocking_pid,
    blocking.query AS blocking_query,
    now() - blocking.query_start AS blocking_duration
FROM pg_locks blocked_locks
JOIN pg_stat_activity blocked
    ON blocked.pid = blocked_locks.pid
JOIN pg_locks blocking_locks
    ON blocking_locks.locktype = blocked_locks.locktype
   AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
   AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
   AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
   AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
   AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
   AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
   AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
   AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
   AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
   AND blocking_locks.pid != blocked_locks.pid
JOIN pg_stat_activity blocking
    ON blocking.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted
  AND blocking_locks.granted;
</code></pre>
This is useful, but it is still only one piece of the incident.</p>
The deeper questions are:</p>

Why was this migration run during this traffic pattern?</li>
Was there a rollback plan?</li>
Were lock timeouts configured?</li>
Were long transactions checked before the migration?</li>
Did the application have retry behavior that amplified the issue?</li>
</ul>
A mature team does not only ask “which process blocked us?”
It asks “why was the system vulnerable to this class of failure?”</p>

Example 4: a slow query is not always a query problem</h2>
A query can become slow without changing the SQL text.</p>
For example:</p>
SELECT *
FROM invoices
WHERE account_id = $1
  AND status = 'open'
ORDER BY due_date ASC
LIMIT 50;
</code></pre>
This may work well when each account has a small number of invoices.</p>
But as the product grows, one enterprise account may accumulate millions of rows. The query becomes highly sensitive to data distribution.</p>
You can inspect the execution plan:</p>
EXPLAIN (ANALYZE, BUFFERS)
SELECT *
FROM invoices
WHERE account_id = 123
  AND status = 'open'
ORDER BY due_date ASC
LIMIT 50;
</code></pre>
The plan might reveal:</p>

sequential scans;</li>
high buffer reads;</li>
unexpected nested loops;</li>
bad row estimates;</li>
sort operations spilling to disk;</li>
index scans that are technically used but still inefficient.</li>
</ul>
A possible supporting index could be:</p>
CREATE INDEX CONCURRENTLY idx_invoices_account_status_due
ON invoices (account_id, status, due_date);
</code></pre>
But again, the point is not “add this index.”</p>
The real reliability lesson is that production data shape changes over time. A query that was safe six months ago can become dangerous after customer growth, product changes, or new usage patterns.</p>
Reliability is not only about fixing bad queries. It is about detecting when previously good assumptions have expired.</p>

The difference between trigger, mechanism, and amplifier</h2>
A useful way to reason about Postgres incidents is to separate three things.</p>
1. Trigger</h3>
The event that started the incident.</p>
Examples:</p>
New release
Schema migration
Traffic spike
Batch job
Analytics query
Configuration change
Failover
New customer onboarded
</code></pre>
2. Mechanism</h3>
The technical process through which the system degraded.</p>
Examples:</p>
Lock contention
Connection saturation
Query plan regression
Disk IO saturation
WAL pressure
Autovacuum lag
Replication lag
Memory pressure
Transaction buildup
</code></pre>
3. Amplifier</h3>
The thing that made the incident worse.</p>
Examples:</p>
Aggressive retries
Oversized connection pools
No statement timeout
No lock timeout
Long-running transactions
Missing dashboards
No migration safety process
Manual panic actions
</code></pre>
A poor incident review says:</p>

“The database was slow because of a bad query.”</p>
</blockquote>
A better incident review says:</p>

“The trigger was a release that introduced a new query pattern. The mechanism was inefficient index access under production data distribution. The amplifier was application retries combined with a pool size that allowed too much concurrent pressure on Postgres.”</p>
</blockquote>
That second version teaches the team something reusable.</p>

Useful diagnostic queries are not the same as an incident response skill</h2>
It is good to know queries like these.</p>
Current activity:</p>
SELECT
    pid,
    usename,
    application_name,
    state,
    wait_event_type,
    wait_event,
    now() - query_start AS query_age,
    left(query, 120) AS query_preview
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY query_age DESC;
</code></pre>
Long transactions:</p>
SELECT
    pid,
    usename,
    application_name,
    state,
    now() - xact_start AS xact_age,
    left(query, 120) AS query_preview
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
ORDER BY xact_age DESC;
</code></pre>
Top queries with pg_stat_statements</code>:</p>
SELECT
    calls,
    total_exec_time,
    mean_exec_time,
    rows,
    left(query, 160) AS query_preview
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;
</code></pre>
Replication lag:</p>
SELECT
    application_name,
    state,
    sync_state,
    write_lag,
    flush_lag,
    replay_lag
FROM pg_stat_replication;
</code></pre>
Approximate table bloat and dead tuple pressure:</p>
SELECT
    relname,
    n_live_tup,
    n_dead_tup,
    last_vacuum,
    last_autovacuum,
    last_analyze,
    last_autoanalyze
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 20;
</code></pre>
Index usage:</p>
SELECT
    relname AS table_name,
    indexrelname AS index_name,
    idx_scan,
    idx_tup_read,
    idx_tup_fetch
FROM pg_stat_user_indexes
ORDER BY idx_scan ASC
LIMIT 20;
</code></pre>
These queries are useful. But they are not enough.</p>
During a real incident, the challenge is not just running SQL. The challenge is knowing which hypothesis you are testing.</p>
For example:</p>
Are we overloaded because queries are slower?
Are queries slower because of locks?
Are locks caused by a migration?
Is the pool full because Postgres is slow, or is Postgres slow because the pool allows too much concurrency?
Is replication lag a cause, a symptom, or a separate issue?
Are retries protecting the system or attacking it?
</code></pre>
This is where operational skill matters.</p>

Why Postgres incidents often look similar</h2>
Many different failure modes produce similar symptoms.</p>
Symptom</th> Possible causes</th></tr></thead>

High latency</td> slow queries, locks, IO saturation, pool wait, CPU pressure</td></tr>
Many active connections</td> slow DB, oversized pool, retry storm, long transactions</td></tr>
High CPU</td> query plan regression, too much concurrency, missing index</td></tr>
High IO</td> sequential scans, checkpoints, vacuum, index creation, bad plans</td></tr>
Timeouts</td> pool exhaustion, locks, network, overloaded DB, application retries</td></tr>
Replica lag</td> WAL volume, slow replica IO, long queries on standby, replication slot issues</td></tr>
</tbody></table>
This is why “dashboard watching” is not enough.</p>
Metrics do not tell you what to do by themselves. They only become useful when connected to a hypothesis.</p>
A metric says:</p>
Connections are high.
</code></pre>
An engineer has to ask:</p>
Are connections high because requests increased?
Because queries are slower?
Because transactions are stuck?
Because the pool was reconfigured?
Because the app is retrying?
Because background jobs started?
</code></pre>
The same metric can point to different actions depending on context.</p>

Dangerous reactions during Postgres incidents</h2>
Some actions feel helpful but can be dangerous when done without understanding the mechanism.</p>
Increasing the connection pool</h3>
May help if the pool is too small and Postgres has spare capacity.</p>
May hurt if Postgres is already saturated.</p>
Killing random queries</h3>
May help if a clearly harmful query is blocking critical work.</p>
May hurt if you kill the wrong backend, interrupt a migration, or cause application-level retries.</p>
Restarting the application</h3>
May help if the app is stuck.</p>
May hurt if every instance reconnects at once and creates a connection storm.</p>
Failing over to a replica</h3>
May help if the primary is unhealthy.</p>
May hurt if the issue is caused by application behavior, bad queries, or a migration that will continue after failover.</p>
Running emergency indexes</h3>
May help if the cause is well understood.</p>
May hurt if index creation adds IO pressure during an already overloaded period.</p>
The operational question is not:</p>

“What can we do?”</p>
</blockquote>
It is:</p>

“Which action reduces pressure without increasing uncertainty?”</p>
</blockquote>

Reliability requires practicing the messy middle</h2>
Most educational material explains clean concepts:</p>

how MVCC works;</li>
how indexes work;</li>
how locks work;</li>
how autovacuum works;</li>
how replication works;</li>
how query planning works.</li>
</ul>
That knowledge is necessary.</p>
But incidents do not arrive as clean textbook chapters.</p>
They arrive as noisy combinations:</p>
A migration is waiting on a lock.
A long transaction is preventing cleanup.
The application pool is saturated.
Retries are increasing traffic.
A reporting query is consuming IO.
Replication lag is rising.
The team is debating rollback.
Customers are already affected.
</code></pre>
The difficult part is the messy middle: forming hypotheses, rejecting bad assumptions, choosing safe mitigations, and communicating clearly while the system is degraded.</p>
This cannot be learned fully from documentation.</p>
It has to be practiced.</p>

What incident simulations teach that articles cannot</h2>
An article can explain the concepts.
A checklist can remind you what to inspect.
A dashboard can show symptoms.</p>
But a simulation trains the actual operational behavior:</p>

noticing weak signals early;</li>
distinguishing trigger from mechanism;</li>
avoiding attractive but dangerous actions;</li>
reading database symptoms in application context;</li>
understanding how one mitigation changes system pressure;</li>
coordinating investigation under time pressure;</li>
learning from mistakes without damaging production.</li>
</ul>
In a good Postgres incident simulation, the goal is not to memorize one magic query.</p>
The goal is to experience the chain:</p>
flowchart LR
    S[Symptom] --> H[Hypothesis] --> I[Inspection] --> D[Decision] --> C[Consequence]
</code></pre>
That loop is the core of database reliability work.</p>

Conclusion</h2>
Postgres incidents rarely begin with “Postgres broke.”</p>
More often, they begin with a normal engineering action:</p>
a release
a migration
a new query
a batch job
a traffic spike
a retry policy
a pool configuration change
</code></pre>
Postgres becomes the place where the consequences accumulate.</p>
That is why reliable Postgres operations require more than database knowledge. They require system thinking.</p>
You need to understand queries, locks, transactions, WAL, vacuum, replication, and indexes. But you also need to understand application behavior, deployment practices, connection pools, retries, traffic patterns, and human decision-making during incidents.</p>
Documentation teaches mechanisms.
Monitoring shows symptoms.
Simulations build operational judgment.</p>
And in production, judgment is often the difference between a short degradation and a serious incident.</p>


Connection pools and Postgres: why more connections do not mean more performance
2026-05-20T00:00:00+00:00
When an application starts timing out while talking to Postgres, one of the most tempting reactions is:</p>
Increase the connection pool size.
</code></pre>
It feels reasonable.</p>
Requests are waiting for a database connection.
The pool is full.
The application needs more throughput.
So the team gives it more connections.</p>
Sometimes that helps.</p>
But in many Postgres incidents, increasing the pool size turns a controlled slowdown into a larger failure.</p>
A connection pool is not just a performance optimization. It is a pressure valve between the application and the database.</p>
When configured well, it protects Postgres from too much concurrent work.
When configured badly, it allows the application to overload the database faster.</p>

The wrong mental model</h2>
A common mental model looks like this:</p>
More connections = more parallelism = more throughput
</code></pre>
That is only true up to a point.</p>
Postgres does not become infinitely faster just because more clients connect to it. Each active connection competes for shared resources:</p>
CPU
memory
locks
shared buffers
disk IO
WAL bandwidth
temporary file space
autovacuum capacity
checkpoint pressure
planner and executor overhead
</code></pre>
If the database is already saturated, adding more active sessions usually increases contention.</p>
A better mental model is:</p>
Connections are not throughput.
Connections are concurrency.
Concurrency must be limited to what the database can actually serve.
</code></pre>
The pool should protect the database from excessive concurrency, not blindly maximize it.</p>

The hidden multiplication problem</h2>
Connection incidents often start with innocent numbers.</p>
One service has a pool size of 20.</p>
That sounds small.</p>
Then production reality looks like this:</p>
20 connections per application instance
× 30 application instances
= 600 possible database connections
</code></pre>
Now add:</p>
background workers;
admin jobs;
migration runners;
BI tools;
cron scripts;
read replicas;
multiple services;
autoscaling;
deployment overlap during rolling releases.
</code></pre>
Suddenly, max_connections = 500</code> no longer looks large.</p>
The dangerous part is that each team may only see its own service:</p>
Our pool is only 20.
</code></pre>
But Postgres sees the total:</p>
Hundreds of clients competing for one database.
</code></pre>
A simple inventory query:</p>
SELECT
    application_name,
    usename,
    client_addr,
    state,
    count(*) AS connections
FROM pg_stat_activity
GROUP BY application_name, usename, client_addr, state
ORDER BY connections DESC;
</code></pre>
This often reveals surprises:</p>
old app versions still connected;
workers using separate pools;
BI tools holding sessions;
idle clients consuming slots;
one service with far more connections than expected;
deployment overlap doubling connection count temporarily.
</code></pre>
The incident is not always caused by one bad query. Sometimes the system simply permits too many concurrent conversations with the database.</p>

Idle connections are not free</h2>
An idle connection is less dangerous than an active query, but it is not free.</p>
Each connection is represented by a backend process. It consumes memory and a connection slot. It also increases operational complexity during spikes, failovers, restarts, and deployments.</p>
Inspect idle connections:</p>
SELECT
    application_name,
    usename,
    client_addr,
    count(*) AS idle_connections
FROM pg_stat_activity
WHERE state = 'idle'
GROUP BY application_name, usename, client_addr
ORDER BY idle_connections DESC;
</code></pre>
A large number of idle sessions may indicate:</p>
oversized pools;
too many application instances;
poor pool lifecycle management;
clients that connect and do not reuse efficiently;
services holding capacity they do not need.
</code></pre>
Idle connections may not be the immediate cause of latency, but they reduce headroom.</p>
During an incident, headroom matters.</p>

Active connections are the real pressure</h2>
The more important question is not just how many sessions exist.</p>
It is how many sessions are actively doing work or waiting on something.</p>
SELECT
    state,
    wait_event_type,
    wait_event,
    count(*) AS sessions
FROM pg_stat_activity
GROUP BY state, wait_event_type, wait_event
ORDER BY sessions DESC;
</code></pre>
This gives a better view of database pressure:</p>
active sessions consuming CPU;
sessions waiting on locks;
sessions waiting on IO;
sessions idle in transaction;
sessions waiting on client reads or writes;
</code></pre>
A saturated pool with many active database sessions means one thing.</p>
A saturated pool with many sessions waiting on locks means another.</p>
A saturated pool with many idle-in-transaction sessions means something else entirely.</p>
The number of connections is only the surface.</p>
The wait state tells you what kind of pressure the database is experiencing.</p>

Pool saturation can be a symptom, not the root cause</h2>
When an application pool is full, it is easy to blame the pool.</p>
But a pool usually fills because connections are being held longer than expected.</p>
That can happen because:</p>
queries became slower;
transactions became longer;
locks caused sessions to wait;
the database started waiting on IO;
the application opened transactions too early;
external service calls happened inside transactions;
retries increased traffic;
a deployment created more concurrent workers;
background jobs started competing with user requests.
</code></pre>
A typical chain:</p>
flowchart TD
    A[Query latency increases] --> B[Application holds DB connections longer]
    B --> C[Pool reaches max size]
    C --> D[New requests wait for a connection]
    D --> E[HTTP latency increases]
    E --> F[Requests time out]
    F --> G[Application retries]
    G --> H[More work reaches Postgres]
    H --> I([The pool stays saturated])
</code></pre>
The pool is not the original failure. It is the place where the failure becomes visible.</p>
Increasing the pool size may only move the queue from the application into Postgres.</p>
That can make the database less stable.</p>

Queuing in the application is often safer than queuing in Postgres</h2>
A small pool can be frustrating because requests wait before reaching the database.</p>
But that waiting can be protective.</p>
Application-side queue:
limits database concurrency.

Database-side queue:
lets too much work enter Postgres.
</code></pre>
If too many requests enter Postgres, they can compete for locks, memory, CPU, and IO. Once the database is overloaded, every query can become slower, which makes connections stay busy even longer.</p>
This feedback loop is dangerous:</p>
flowchart TD
    A[More concurrent queries] --> B[More contention]
    B --> C[Slower queries]
    C --> D[Connections held longer]
    D --> E[More pool pressure]
    E --> F[More retries]
    F --> A
</code></pre>
A pool should create backpressure.</p>
Backpressure is not failure. It is a controlled refusal to overload the most critical shared component.</p>

The database pool is part of your traffic control system</h2>
A mature production system usually has multiple layers of traffic control:</p>
load balancer limits;
application worker limits;
request timeouts;
queue depth limits;
connection pool limits;
statement timeouts;
retry budgets;
rate limits;
circuit breakers;
background job concurrency limits.
</code></pre>
The database pool is one of those layers.</p>
If all other layers are loose, the database pool becomes the final gate before Postgres.</p>
That is risky.</p>
For example:</p>
API accepts too much traffic.
Workers retry aggressively.
Each worker can open many DB connections.
Background jobs are unconstrained.
Pool size is high.
Postgres receives the full blast.
</code></pre>
This is how a traffic spike becomes a database incident.</p>
The database did not “break.” It was used as the only effective limiter in the system.</p>

Inspecting connection pressure in Postgres</h2>
Start with total connection usage:</p>
SELECT
    count(*) AS current_connections,
    setting::int AS max_connections,
    round(100.0 * count(*) / setting::int, 2) AS percent_used
FROM pg_stat_activity
CROSS JOIN pg_settings
WHERE name = 'max_connections'
GROUP BY setting;
</code></pre>
Break it down by application:</p>
SELECT
    application_name,
    count(*) AS total,
    count(*) FILTER (WHERE state = 'active') AS active,
    count(*) FILTER (WHERE state = 'idle') AS idle,
    count(*) FILTER (WHERE state = 'idle in transaction') AS idle_in_transaction
FROM pg_stat_activity
GROUP BY application_name
ORDER BY total DESC;
</code></pre>
Look for old sessions:</p>
SELECT
    pid,
    application_name,
    usename,
    client_addr,
    state,
    now() - backend_start AS connection_age,
    now() - state_change AS state_age,
    left(query, 160) AS query_preview
FROM pg_stat_activity
ORDER BY backend_start ASC
LIMIT 30;
</code></pre>
Look for long-running active queries:</p>
SELECT
    pid,
    application_name,
    usename,
    state,
    wait_event_type,
    wait_event,
    now() - query_start AS query_age,
    left(query, 200) AS query_preview
FROM pg_stat_activity
WHERE state = 'active'
ORDER BY query_start ASC
LIMIT 30;
</code></pre>
Look for idle transactions:</p>
SELECT
    pid,
    application_name,
    usename,
    client_addr,
    now() - xact_start AS transaction_age,
    now() - state_change AS idle_age,
    left(query, 200) AS last_query
FROM pg_stat_activity
WHERE state = 'idle in transaction'
ORDER BY xact_start ASC;
</code></pre>
These queries help separate different problems:</p>
too many idle sessions;
too many active sessions;
long-running queries;
sessions waiting on locks;
idle transactions;
connection leaks;
unexpected clients;
deployment overlap.
</code></pre>
The goal is not just to count connections.</p>
The goal is to understand why they exist and what they are doing.</p>

idle in transaction</code>: small bug, large blast radius</h2>
An application can open a transaction, run a query, and then wait.</p>
Example:</p>
BEGIN;

SELECT *
FROM accounts
WHERE id = 42;

-- application waits on an external API before COMMIT
</code></pre>
From Postgres, this may appear as:</p>
idle in transaction
</code></pre>
That session is not actively running SQL, but the transaction is still open.</p>
This can:</p>
hold locks;
prevent vacuum cleanup;
keep old row versions visible;
increase bloat;
block migrations;
hold a pool connection indefinitely;
create confusing incident symptoms.
</code></pre>
A useful protection:</p>
SHOW idle_in_transaction_session_timeout;
</code></pre>
You can set it at the role or database level:</p>
ALTER ROLE app_user
SET idle_in_transaction_session_timeout = '60s';
</code></pre>
This is not a substitute for fixing application code, but it can reduce blast radius.</p>
Application transactions should be short and explicit.</p>
A dangerous pattern:</p>
BEGIN
  read from database
  call external service
  perform business logic
  write to database
COMMIT
</code></pre>
A safer pattern is usually:</p>
call external services before opening the transaction;
open the transaction late;
perform only the required database work;
commit quickly;
avoid user or network waits inside the transaction.
</code></pre>
Postgres can handle concurrency. It cannot make long application transactions short.</p>

Pool timeouts and statement timeouts are different</h2>
Application pool timeout:</p>
How long a request waits to get a database connection.
</code></pre>
Postgres statement_timeout</code>:</p>
How long a SQL statement may run before Postgres cancels it.
</code></pre>
Postgres lock_timeout</code>:</p>
How long a statement waits to acquire a lock.
</code></pre>
Postgres idle_in_transaction_session_timeout</code>:</p>
How long a session may remain idle while inside a transaction.
</code></pre>
These protect different parts of the system.</p>
Inspect settings:</p>
SHOW statement_timeout;
SHOW lock_timeout;
SHOW idle_in_transaction_session_timeout;
</code></pre>
Example role-level guardrails:</p>
ALTER ROLE app_user SET statement_timeout = '30s';
ALTER ROLE app_user SET lock_timeout = '2s';
ALTER ROLE app_user SET idle_in_transaction_session_timeout = '60s';
</code></pre>
These values are examples, not universal defaults.</p>
Different workloads need different limits:</p>
OLTP API queries need strict latency control.
Background jobs may need longer statement timeouts.
Migrations need careful lock timeouts.
Analytics should often run on separate infrastructure.
</code></pre>
Timeouts do not fix bad architecture, but they prevent some failures from growing without bounds.</p>

PgBouncer is useful, but not magic</h2>
Many Postgres systems use PgBouncer or another external pooler.</p>
PgBouncer can reduce the number of server connections and allow many client connections to share fewer Postgres backends.</p>
But the pooling mode matters.</p>
The common modes are:</p>
session pooling;
transaction pooling;
statement pooling.
</code></pre>
In session pooling, a client keeps the same server connection for the whole client session.</p>
In transaction pooling, a client gets a server connection only for the duration of a transaction.</p>
Transaction pooling can dramatically reduce pressure on Postgres, but it changes what application behavior is safe.</p>
Features that depend on session state may become problematic:</p>
temporary tables;
session-level SET commands;
session-level advisory locks;
LISTEN / NOTIFY patterns;
some prepared statement assumptions;
stateful connection behavior in application frameworks.
</code></pre>
For example, this is session state:</p>
SET search_path = tenant_42, public;
</code></pre>
If an application assumes this setting remains attached to a session, transaction pooling can break that assumption.</p>
A safer approach is to make state explicit:</p>
SET LOCAL statement_timeout = '5s';
</code></pre>
inside a transaction, or avoid relying on session state for request behavior.</p>
The reliability lesson:</p>
A pooler changes the contract between the application and Postgres.
</code></pre>
It must be tested as part of the application architecture, not added only during an emergency.</p>

App-level pools and external poolers can fight each other</h2>
A common architecture has both:</p>
Application connection pool
        ↓
PgBouncer
        ↓
Postgres
</code></pre>
That can work well.</p>
But it can also create confusion.</p>
Example:</p>
50 application instances
× app pool size 20
= 1000 client connections to PgBouncer

PgBouncer pool size 100
= only 100 server connections to Postgres
</code></pre>
That may be fine if PgBouncer queues safely.</p>
But application metrics may say:</p>
Database pool is healthy.
</code></pre>
while PgBouncer is saturated.</p>
Or PgBouncer may be healthy while Postgres is overloaded by 100 expensive active queries.</p>
The important operational question is:</p>
Where is the queue?
</code></pre>
Possible answers:</p>
inside the application pool;
inside PgBouncer;
inside Postgres lock waits;
inside disk IO;
inside the application request queue;
inside a background job system.
</code></pre>
The location of the queue tells you where backpressure is happening.</p>
During an incident, moving the queue from one layer to another may improve or worsen the system.</p>

Pool size should be based on database capacity, not hope</h2>
A poor pool-sizing strategy:</p>
Set pool size high enough that application requests rarely wait.
</code></pre>
That optimizes for hiding pressure.</p>
A better strategy:</p>
Set pool size low enough that Postgres remains stable under expected and degraded conditions.
</code></pre>
A rough capacity-oriented approach:</p>
How many active queries can Postgres serve with acceptable latency?
How many services share this database?
How many app instances can exist during autoscaling or rolling deploys?
How many background jobs run concurrently?
What is reserved for migrations, admin access, replication, monitoring, and emergency operations?
</code></pre>
The total possible connection count matters:</p>
total_possible_connections =
    service_count
  × instances_per_service
  × pool_size_per_instance
  + workers
  + admin clients
  + migrations
  + monitoring
</code></pre>
That number should not accidentally exceed what Postgres can handle.</p>
More importantly, the number of active queries should not exceed what the database can serve efficiently.</p>

max_connections</code> is not a performance target</h2>
max_connections</code> is a limit, not a goal.</p>
If Postgres has max_connections = 500</code>, that does not mean the system should normally run with 500 active sessions.</p>
Check the setting:</p>
SHOW max_connections;
</code></pre>
When connection count approaches the limit, new clients may fail to connect. That can block application traffic, migrations, admin access, and incident response.</p>
You do not want to discover during an outage that there is no free connection left for an operator.</p>
A useful connection headroom query:</p>
SELECT
    count(*) AS used_connections,
    setting::int AS max_connections,
    setting::int - count(*) AS remaining_connections
FROM pg_stat_activity
CROSS JOIN pg_settings
WHERE name = 'max_connections'
GROUP BY setting;
</code></pre>
Running near the maximum is usually a sign of poor control, not high efficiency.</p>
A stable Postgres system should have connection headroom.</p>

Retries can turn pool pressure into a storm</h2>
Retries are meant to make systems more resilient.</p>
Under database saturation, they can do the opposite.</p>
A bad retry pattern:</p>
Request times out waiting for DB
        ↓
Application retries immediately
        ↓
Retry also waits for DB
        ↓
More requests accumulate
        ↓
Pool remains saturated
        ↓
Database receives duplicate work
</code></pre>
A better retry strategy includes:</p>
bounded attempts;
exponential backoff;
jitter;
request deadlines;
idempotency keys;
retry budgets;
different policies for reads and writes;
no retry for known non-transient errors;
load shedding when the database is saturated.
</code></pre>
The pool and retry policy must be designed together.</p>
A small pool with aggressive retries can still overload the system.
A large pool with aggressive retries can overload it faster.</p>
Retries should not be allowed to attack a struggling database.</p>

Background workers need separate limits</h2>
User-facing requests and background jobs should not always share the same database capacity.</p>
A background worker can be useful during normal operation and harmful during an incident.</p>
Examples:</p>
email jobs;
billing reconciliation;
search indexing;
analytics sync;
cleanup tasks;
data backfills;
report generation;
cache warming;
webhook reprocessing.
</code></pre>
If these workers use the same database pool limits as API traffic, they can starve critical paths.</p>
A better architecture often separates:</p>
API pool;
worker pool;
migration/admin access;
analytics/reporting access;
maintenance jobs.
</code></pre>
This allows operational decisions such as:</p>
pause non-critical workers;
reduce backfill concurrency;
reserve capacity for user traffic;
run reporting on a replica;
prevent cleanup jobs from overwhelming primary.
</code></pre>
In a database incident, not all work is equally important.</p>
The pool configuration should reflect that.</p>

Connection leaks</h2>
A connection leak happens when the application checks out a database connection and does not return it to the pool.</p>
Symptoms:</p>
pool usage grows over time;
database queries are not necessarily slow;
application instances require restart to recover;
idle connections accumulate;
a specific code path correlates with pool exhaustion.
</code></pre>
Database-side symptoms may not be obvious.</p>
You can inspect session age and state:</p>
SELECT
    pid,
    application_name,
    client_addr,
    state,
    now() - backend_start AS backend_age,
    now() - state_change AS state_age,
    left(query, 160) AS query_preview
FROM pg_stat_activity
ORDER BY state_age DESC
LIMIT 50;
</code></pre>
But connection leaks are often easier to detect with application metrics:</p>
pool connections in use;
pool idle connections;
pool wait time;
pool checkout timeout count;
connection acquisition latency;
connections opened and closed per second.
</code></pre>
Postgres can show you the sessions.</p>
The application usually tells you whether the pool is leaking.</p>
Both views are needed.</p>

Long queries can masquerade as pool problems</h2>
Suppose an endpoint usually takes 50 ms of database time.</p>
Then one query starts taking 5 seconds.</p>
Even without more traffic, pool usage rises because each request holds a connection longer.</p>
A simple relationship:</p>
required concurrency ≈ request rate × connection hold time
</code></pre>
If request rate is 100 requests per second and each request holds a DB connection for 50 ms:</p>
100 × 0.05 = 5 active connections
</code></pre>
If the same path now holds a connection for 5 seconds:</p>
100 × 5 = 500 active connections
</code></pre>
The pool did not become too small.</p>
The connection hold time exploded.</p>
This is why pool metrics should be read together with query latency, transaction duration, and application request traces.</p>
The pool is a mirror of database time.</p>

Transactions should not wrap too much application logic</h2>
A transaction should protect a small unit of database consistency.</p>
It should not wrap an entire business workflow unless absolutely necessary.</p>
Risky pattern:</p>
BEGIN
  select user
  call payment provider
  update order
  send webhook
  insert audit log
COMMIT
</code></pre>
This holds a database connection while waiting for external systems.</p>
Safer pattern:</p>
prepare required data;
call external systems outside transaction when possible;
open transaction;
perform minimal database changes;
commit;
emit async follow-up work.
</code></pre>
There are exceptions. Some workflows need careful transactional boundaries.</p>
But as a reliability default:</p>
Keep transactions short.
Keep connection hold time predictable.
Do not wait on the network while holding scarce database capacity.
</code></pre>
This is one of the most important application-level rules for Postgres reliability.</p>

What to measure in the application</h2>
Postgres views are necessary, but not sufficient.</p>
The application should expose pool metrics:</p>
maximum pool size;
connections currently in use;
idle connections;
pending connection requests;
connection acquisition latency;
connection checkout timeout count;
query duration;
transaction duration;
request duration while holding DB connection;
retries by reason;
errors by SQLSTATE or exception type.
</code></pre>
The most useful metric is often not just query time.</p>
It is:</p>
time spent waiting for a connection
</code></pre>
If this grows, the application is experiencing backpressure.</p>
That may be healthy if Postgres is protected and the system degrades gracefully.</p>
It may be dangerous if requests timeout and retry aggressively.</p>
Metrics should distinguish:</p>
waiting for a pool connection;
executing SQL;
waiting on a database lock;
waiting on network;
waiting on an external service while holding a connection.
</code></pre>
Without that separation, every database incident looks like “Postgres is slow.”</p>

What to measure in Postgres</h2>
Useful database-side signals:</p>
-- Connections by state and wait type
SELECT
    state,
    wait_event_type,
    wait_event,
    count(*) AS sessions
FROM pg_stat_activity
GROUP BY state, wait_event_type, wait_event
ORDER BY sessions DESC;
</code></pre>
-- Connections by application
SELECT
    application_name,
    count(*) AS total,
    count(*) FILTER (WHERE state = 'active') AS active,
    count(*) FILTER (WHERE wait_event_type = 'Lock') AS waiting_on_lock,
    count(*) FILTER (WHERE state = 'idle in transaction') AS idle_in_transaction
FROM pg_stat_activity
GROUP BY application_name
ORDER BY total DESC;
</code></pre>
-- Oldest transactions
SELECT
    pid,
    application_name,
    state,
    now() - xact_start AS transaction_age,
    left(query, 160) AS query_preview
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
ORDER BY xact_start ASC
LIMIT 20;
</code></pre>
-- Long active queries
SELECT
    pid,
    application_name,
    wait_event_type,
    wait_event,
    now() - query_start AS query_age,
    left(query, 200) AS query_preview
FROM pg_stat_activity
WHERE state = 'active'
ORDER BY query_start ASC
LIMIT 20;
</code></pre>
-- Blocked sessions
SELECT
    blocked.pid AS blocked_pid,
    blocked.application_name AS blocked_app,
    now() - blocked.query_start AS blocked_duration,
    left(blocked.query, 120) AS blocked_query,
    blocking.pid AS blocking_pid,
    blocking.application_name AS blocking_app,
    blocking.state AS blocking_state,
    now() - blocking.query_start AS blocking_duration,
    left(blocking.query, 120) AS blocking_query
FROM pg_stat_activity blocked
JOIN LATERAL unnest(pg_blocking_pids(blocked.pid)) AS blocker_pid ON true
JOIN pg_stat_activity blocking ON blocking.pid = blocker_pid
ORDER BY blocked_duration DESC;
</code></pre>
These queries are not a runbook by themselves.</p>
They help answer one central question:</p>
Is Postgres doing too much work, waiting on something, or being held hostage by client behavior?
</code></pre>

Why connection incidents are often misdiagnosed</h2>
Connection pool incidents are confusing because the first visible error is often outside the database.</p>
The app logs may say:</p>
timeout acquiring connection from pool
</code></pre>
or:</p>
remaining connection slots are reserved
</code></pre>
or:</p>
too many clients already
</code></pre>
or:</p>
context deadline exceeded
</code></pre>
Teams then debate:</p>
Is this an app issue?
Is this a database issue?
Is the pool too small?
Is max_connections too low?
Is PgBouncer broken?
Is a query slow?
Is the network slow?
</code></pre>
The answer may be “yes” to several of these.</p>
The pool is the boundary between application behavior and database capacity. Boundary failures usually have causes on both sides.</p>

Common anti-patterns</h2>
Pool size copied from another system</h3>
A pool size that worked for one service may be wrong for another.</p>
Workload shape matters:</p>
short OLTP queries;
long reporting queries;
bursty writes;
background jobs;
tenant skew;
transaction-heavy workflows;
read-after-write patterns.
</code></pre>
Pool size configured per instance without considering total instances</h3>
Autoscaling can silently multiply database pressure.</p>
One shared pool for critical and non-critical work</h3>
A reporting job should not be able to starve checkout.</p>
Long external calls inside transactions</h3>
This turns network latency into database connection pressure.</p>
No timeout hierarchy</h3>
Without clear request, pool, statement, lock, and transaction timeouts, failures linger too long.</p>
Aggressive retries</h3>
Retries without budgets and backoff can turn a small slowdown into a storm.</p>
Treating PgBouncer as a universal fix</h3>
A pooler helps manage connections. It does not remove query cost, lock contention, IO saturation, or bad transaction design.</p>

A healthier operating model</h2>
A good connection strategy is explicit.</p>
It defines:</p>
which services may connect to Postgres;
how many connections each service may use;
how many instances may exist during normal and deploy conditions;
which work is allowed on the primary;
which work should use replicas;
which jobs can be paused;
which timeouts protect the system;
which retries are allowed;
which metrics indicate backpressure;
which actions reduce pressure safely.
</code></pre>
This is not only DBA work.</p>
It requires cooperation between:</p>
backend engineers;
SREs;
DBAs;
platform engineers;
application owners;
incident responders.
</code></pre>
Connection reliability lives at the boundary between application design and database operations.</p>
That is why it often falls through organizational cracks.</p>

Why connection pool incidents are good simulation material</h2>
Connection pool incidents are excellent for practice because they create misleading symptoms.</p>
The application says it cannot get a connection.
The database says it has too many clients.
The query dashboard shows slower SQL.
The lock dashboard may show waiting sessions.
The autoscaler adds more application instances.
Retries increase traffic.
Someone proposes increasing max_connections</code>.
Someone else proposes restarting the app.</p>
All of these may be plausible.</p>
A realistic simulation can force the team to reason through:</p>
where the queue is forming;
whether the pool is protecting or harming Postgres;
whether increasing pool size would help or amplify the incident;
which workload should be shed first;
whether long transactions are holding connections;
whether retries are multiplying demand;
whether background workers should be paused;
whether the safest mitigation is in SQL, app config, infrastructure, or traffic control.
</code></pre>
The goal is not to memorize a perfect pool size.</p>
The goal is to build judgment around database pressure.</p>
Articles can explain the mechanics.
Dashboards can show saturation.
Simulations teach what it feels like to choose under pressure.</p>

Conclusion</h2>
More Postgres connections do not automatically mean more performance.</p>
They mean more concurrency.</p>
Concurrency is useful only while the database has capacity to serve it. Past that point, additional connections create contention, longer waits, more timeouts, more retries, and a larger incident.</p>
A connection pool should not be treated as a bucket that must be as large as possible.</p>
It should be treated as a control surface.</p>
Good pooling protects Postgres.
Bad pooling exposes Postgres to uncontrolled application demand.</p>
Reliable Postgres systems need:</p>
bounded connection counts;
short transactions;
clear timeout policies;
safe retry behavior;
separate limits for critical and background work;
visibility into pool wait time;
visibility into database wait states;
enough headroom for operations and incidents.
</code></pre>
The dangerous phrase is:</p>
The pool is full, so increase it.
</code></pre>
The better question is:</p>
Why are connections being held longer than expected, and where should backpressure happen?
</code></pre>
That question turns connection pooling from a configuration detail into a database reliability practice.</p>


Postgres replication: when a standby exists but does not save you
2026-05-12T00:00:00+00:00
A standby database is comforting.</p>
It appears in architecture diagrams as a safety net. The primary fails, the standby takes over, and the product survives. Read traffic can be moved away from the primary. Backups can be isolated. Disaster recovery looks solved.</p>
But Postgres replication does not automatically mean high availability.</p>
A standby can be too far behind.
A replica can faithfully reproduce bad writes.
A failover can create split-brain.
A replication slot can fill the primary disk with retained WAL.
Read queries on a standby can conflict with recovery.
A promoted replica can break downstream consumers.
An application can keep writing to the wrong node after failover.</p>
Replication is not a guarantee. It is a mechanism.</p>
And like every reliability mechanism, it creates new failure modes.</p>
PostgreSQL streaming replication keeps a standby up to date by sending WAL records from the primary as they are generated; it is asynchronous by default, meaning there can be a delay between commit on the primary and visibility on the standby. (PostgreSQL</a>)</p>
That small sentence contains an entire class of incidents.</p>

The false sense of safety</h2>
Many teams say:</p>
We have a replica.
</code></pre>
But that statement is incomplete.</p>
A more useful operational version is:</p>
We have a replica.
We know how far behind it is.
We know whether it can be promoted.
We know what data loss window is acceptable.
We know how applications reconnect.
We know how to prevent the old primary from coming back.
We know what happens to replication slots, read traffic, jobs, and logical consumers after failover.
</code></pre>
A replica is not a disaster recovery plan by itself.</p>
It is a component inside a larger recovery process.</p>
PostgreSQL’s own failover documentation is explicit about the need to prevent the old primary from continuing as primary after a standby is promoted, because two systems believing they are primary can lead to data loss; this is the classic split-brain problem. (PostgreSQL</a>)</p>
That is why replication reliability is not just about lag.</p>
It is about control.</p>

Replication lag is not one number</h2>
The first mistake is treating replication lag as a single metric.</p>
In practice, there are several different “lags”:</p>
WAL generated on primary but not sent
WAL sent but not written by standby
WAL written but not flushed
WAL flushed but not replayed
Changes replayed but application still reading stale data
</code></pre>
On the primary, pg_stat_replication</code> is the main view for directly connected standbys. The PostgreSQL statistics documentation describes it as one row per WAL sender process, with information about replication to the connected standby. (PostgreSQL</a>)</p>
A useful primary-side query:</p>
SELECT
    application_name,
    client_addr,
    state,
    sync_state,
    sent_lsn,
    write_lsn,
    flush_lsn,
    replay_lsn,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn))   AS send_lag_bytes,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), write_lsn))  AS write_lag_bytes,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn))  AS flush_lag_bytes,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) AS replay_lag_bytes,
    write_lag,
    flush_lag,
    replay_lag
FROM pg_stat_replication
ORDER BY application_name;
</code></pre>
This query separates the pipeline.</p>
If sent_lsn</code> is far behind, the primary is not sending fast enough or the connection is impaired.</p>
If write_lsn</code> lags behind sent_lsn</code>, the standby is receiving but not writing fast enough.</p>
If flush_lsn</code> is behind, WAL is not durable on the standby yet.</p>
If replay_lsn</code> is behind, the standby has received WAL but has not applied it.</p>
Those are not the same problem.</p>
A standby can be connected and still not be useful for failover if it is too far behind the primary.</p>

Checking the standby from the standby</h2>
On the standby itself:</p>
SELECT pg_is_in_recovery();
</code></pre>
A standby returns true</code>. After promotion, it returns false</code>.</p>
To inspect receive and replay positions:</p>
SELECT
    pg_last_wal_receive_lsn() AS receive_lsn,
    pg_last_wal_replay_lsn()  AS replay_lsn,
    pg_size_pretty(
        pg_wal_lsn_diff(pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn())
    ) AS receive_replay_gap,
    now() - pg_last_xact_replay_timestamp() AS replay_delay;
</code></pre>
This helps answer a different question:</p>
Is this standby receiving WAL?
Is it replaying WAL?
How stale is the data visible to queries?
</code></pre>
The replay_delay</code> value is especially important for read replicas. It tells you how far behind visible database state may be.</p>
For example, if the application writes an order to the primary and immediately reads from a standby, it may not see its own write.</p>
That is not a Postgres bug. It is a read-after-write consistency problem.</p>

Read replicas can serve stale data</h2>
A common architecture sends writes to the primary and reads to replicas:</p>
flowchart TD
    A[Application writes order to primary] --> B[Application reads order from standby]
    B --> C([Order is missing])
</code></pre>
The write committed successfully. The replica simply has not replayed the WAL yet.</p>
This is one of the most common ways replication leaks into product behavior.</p>
The user sees:</p>
I saved the setting, but the UI still shows the old value.
</code></pre>
The backend sees:</p>
INSERT succeeded.
SELECT returned old state.
</code></pre>
The database sees:</p>
Primary is correct.
Standby is behind by 800 ms.
</code></pre>
That may be acceptable for dashboards, analytics, or eventually consistent feeds. It may be unacceptable for checkout, authentication, permissions, billing, or anything requiring read-your-writes behavior.</p>
A basic mitigation pattern is application-level routing:</p>
Fresh reads after writes → primary
Stale-tolerant reads → replica
Long analytics queries → dedicated reporting replica
</code></pre>
This decision belongs in system design, not in a panic during an incident.</p>

Replication protects availability, not correctness of bad changes</h2>
Replication copies changes.</p>
That includes bad changes.</p>
If an application deploy runs:</p>
UPDATE users
SET plan = 'free';
</code></pre>
without a WHERE</code> clause, the standby will not save you. It will replay the same change.</p>
If a migration drops the wrong column, the standby will follow.</p>
If an application bug deletes valid data, physical streaming replication reproduces the deletion.</p>
This is why replication is not a replacement for backups, point-in-time recovery, access controls, safer migrations, or staged rollouts.</p>
A standby helps when the primary node, disk, VM, container, or availability zone fails.</p>
It does not magically distinguish good WAL from bad WAL.</p>
A good reliability review asks:</p>
Which failure mode are we defending against?
Primary host failure?
Storage failure?
Human error?
Bad deploy?
Region outage?
Silent corruption?
Accidental DELETE?
</code></pre>
A replica is useful for some of these. It is insufficient for others.</p>

Replication slots: safety mechanism with sharp edges</h2>
Replication slots are designed to help prevent the primary from removing WAL that a replica or logical consumer still needs. PostgreSQL documents pg_replication_slots</code> as the view listing replication slots and their current state. (PostgreSQL</a>)</p>
That is useful. It is also dangerous if nobody monitors it.</p>
Inspect slots:</p>
SELECT
    slot_name,
    slot_type,
    active,
    restart_lsn,
    confirmed_flush_lsn,
    wal_status,
    safe_wal_size,
    pg_size_pretty(
        pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)
    ) AS retained_wal
FROM pg_replication_slots
ORDER BY pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) DESC NULLS LAST;
</code></pre>
The risk is simple:</p>
flowchart TD
    A[A replica disconnects] --> B[Its replication slot remains]
    B --> C[The primary keeps WAL needed by that slot]
    C --> D[WAL accumulates]
    D --> E[Disk fills]
    E --> F([The primary goes down])
</code></pre>
The original problem may have been a failed standby.</p>
The actual production outage may be the primary running out of disk because the slot kept retaining WAL.</p>
Replication infrastructure can therefore take down the primary it was supposed to protect.</p>
Operationally, slots need ownership:</p>
Who owns this slot?
Which process consumes it?
Is it expected to be active?
How much WAL can it retain?
What alert fires before disk pressure becomes dangerous?
Can this slot be safely dropped?
</code></pre>
Dropping a slot is not a casual action. If the consumer still needs that WAL, dropping the slot may force reinitialization or data loss for that consumer.</p>
SELECT pg_drop_replication_slot('slot_name');
</code></pre>
That command can be correct. It can also be destructive. The hard part is knowing which situation you are in.</p>

WAL volume can break your assumptions</h2>
Replication lag is not only about network speed.</p>
A primary can suddenly generate more WAL than usual:</p>
Large UPDATE
Bulk import
Index creation
VACUUM FULL
High-write deploy
Backfill job
Large DELETE
Migration touching many rows
</code></pre>
A replica that keeps up during normal traffic may fall behind during a backfill.</p>
A simple way to inspect WAL generation rate is to sample LSN movement over time.</p>
Manual example:</p>
SELECT pg_current_wal_lsn();
</code></pre>
Run it again later:</p>
SELECT
    pg_size_pretty(
        pg_wal_lsn_diff('0/50000000'::pg_lsn, '0/40000000'::pg_lsn)
    ) AS wal_generated;
</code></pre>
In a monitoring system, this becomes a rate:</p>
WAL bytes generated per second
</code></pre>
That metric matters because replication capacity is about throughput over time, not just whether the standby is connected.</p>
The standby may be healthy and still unable to keep up with a temporary WAL storm.</p>

Hot standby query conflicts</h2>
A hot standby can serve read-only queries while it replays WAL.</p>
That sounds perfect until long read queries on the standby conflict with recovery.</p>
A reporting query might hold a snapshot that conflicts with WAL replay. Postgres then has a choice: delay replay or cancel the query, depending on configuration and timing.</p>
You can inspect standby conflicts with:</p>
SELECT
    datname,
    confl_tablespace,
    confl_lock,
    confl_snapshot,
    confl_bufferpin,
    confl_deadlock
FROM pg_stat_database_conflicts
ORDER BY datname;
</code></pre>
The monitoring stats documentation includes pg_stat_database_conflicts</code> for database-wide query cancels due to conflicts with recovery on standby servers. (PostgreSQL</a>)</p>
This matters because a replica often has two competing jobs:</p>
Stay close to primary for failover
Serve long-running read queries
</code></pre>
Those goals can conflict.</p>
If the standby prioritizes replay, analytical queries may be canceled.</p>
If the standby delays replay to satisfy long queries, replication lag may grow.</p>
You can reduce pain by separating roles:</p>
HA standby: optimized for promotion, minimal lag
Reporting replica: accepts staleness, runs heavy reads
Logical/ETL replica: feeds downstream systems
</code></pre>
Using one standby for every purpose is cheap architecturally and expensive operationally.</p>

Synchronous replication: stronger durability, different failure mode</h2>
Asynchronous replication has a data loss window.</p>
Synchronous replication can reduce that window, but it changes the write path. The primary may wait for standby acknowledgement depending on synchronous_commit</code> and synchronous replication configuration. The PostgreSQL replication settings documentation warns that with synchronous_commit = remote_apply</code>, commits wait for the change to be applied on the standby. (PostgreSQL</a>)</p>
That means synchronous replication can turn standby problems into primary write latency.</p>
The trade-off is not “sync is better” or “async is better.”</p>
The trade-off is:</p>
Async replication:
lower write latency,
possible data loss during failover.

Sync replication:
stronger durability guarantees,
standby health can affect primary commits.
</code></pre>
A useful query:</p>
SELECT
    application_name,
    state,
    sync_state,
    write_lag,
    flush_lag,
    replay_lag
FROM pg_stat_replication
ORDER BY application_name;
</code></pre>
Pay attention to sync_state</code>.</p>
Values such as sync</code>, potential</code>, or async</code> tell you how the standby participates in synchronous replication behavior.</p>
A synchronous standby is not just a backup target. It is part of the commit path.</p>
If it becomes slow, user-facing writes may slow down too.</p>

Failover is a process, not a command</h2>
Promotion is technically simple:</p>
SELECT pg_promote();
</code></pre>
or from the server:</p>
pg_ctl promote
</code></pre>
PostgreSQL documents these as ways to trigger failover for a log-shipping standby. (PostgreSQL</a>)</p>
But promotion is only one step.</p>
A real failover involves many decisions:</p>
Is the primary truly dead?
Could it still accept writes?
Which standby is the best candidate?
How much WAL has it replayed?
What data loss is acceptable?
How will applications reconnect?
What happens to connection pools?
What happens to old primary fencing?
What happens to read replicas following the old primary?
What happens to logical replication slots?
What happens to scheduled jobs and workers?
Who declares the incident phase complete?
</code></pre>
The dangerous failover is not the one that fails loudly.</p>
The dangerous failover is the one that half-succeeds.</p>
For example:</p>
Standby promoted successfully.
Some app instances still write to old primary.
A background worker reconnects to the wrong host.
Read replicas still follow the old timeline.
Logical consumers lose their slots.
Monitoring shows green because one node is healthy.
Data diverges.
</code></pre>
This is why failover must be rehearsed.</p>
Not discussed.
Not documented once.
Rehearsed.</p>

Timeline changes matter</h2>
After promotion, the new primary continues on a new timeline.</p>
That matters for replicas, WAL archives, backup chains, and recovery procedures.</p>
PostgreSQL documentation notes that standbys used for high availability should follow timeline changes after failover, with recovery_target_timeline</code> set to latest</code>, which is the default. (PostgreSQL</a>)</p>
This detail sounds small until a replica fails to follow the new primary after failover.</p>
The operational symptom may be confusing:</p>
New primary accepts writes.
Old standby does not catch up.
A recreated replica follows the wrong history.
Archive restore behaves unexpectedly.
</code></pre>
During calm periods, timeline mechanics feel like internal implementation detail.</p>
During failover, they become part of the recovery path.</p>

Logical replication adds another layer</h2>
Logical replication is often used for:</p>
CDC pipelines
Search indexing
Data warehouses
Event streaming
Cross-version migrations
Selective table replication
Zero-downtime migration workflows
</code></pre>
Its failure modes are different from physical streaming replication.</p>
A logical slot can fall behind and retain WAL.
A subscriber can stop applying changes.
Schema drift can break replication.
A failover can strand logical slots if they are not handled correctly.</p>
Recent PostgreSQL versions include mechanisms for logical failover slot synchronization. The current documentation describes sync_replication_slots</code> as enabling a physical standby to synchronize logical failover slots from the primary so logical subscribers can resume from the new primary after failover. (PostgreSQL</a>)</p>
The practical lesson is simple:</p>
If downstream systems depend on logical replication,
failover planning must include those systems.
</code></pre>
It is not enough that the database comes back.</p>
The data platform around it must continue correctly.</p>

A practical replication health snapshot</h2>
This is not a full runbook, but these queries make a useful health snapshot.</p>
Primary-side replication status:</p>
SELECT
    application_name,
    client_addr,
    state,
    sync_state,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) AS replay_lag_bytes,
    write_lag,
    flush_lag,
    replay_lag
FROM pg_stat_replication
ORDER BY application_name;
</code></pre>
Replication slots:</p>
SELECT
    slot_name,
    slot_type,
    active,
    wal_status,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal
FROM pg_replication_slots
ORDER BY pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) DESC NULLS LAST;
</code></pre>
Standby freshness:</p>
SELECT
    pg_is_in_recovery() AS is_standby,
    pg_last_wal_receive_lsn() AS receive_lsn,
    pg_last_wal_replay_lsn() AS replay_lsn,
    now() - pg_last_xact_replay_timestamp() AS replay_delay;
</code></pre>
Standby conflicts:</p>
SELECT
    datname,
    confl_lock,
    confl_snapshot,
    confl_bufferpin,
    confl_deadlock
FROM pg_stat_database_conflicts
ORDER BY datname;
</code></pre>
WAL receiver on standby:</p>
SELECT
    status,
    receive_start_lsn,
    written_lsn,
    flushed_lsn,
    received_tli,
    last_msg_send_time,
    last_msg_receipt_time,
    latest_end_lsn,
    latest_end_time,
    conninfo
FROM pg_stat_wal_receiver;
</code></pre>
These queries do not tell you what to do automatically. They help you ask better questions:</p>
Is the standby connected?
Is it catching up or falling behind?
Is lag measured in bytes, time, or user-visible staleness?
Is WAL retention becoming dangerous?
Are standby reads conflicting with recovery?
Is failover currently safe, risky, or impossible?
</code></pre>

Common anti-patterns</h2>
One replica for every purpose</h3>
A standby used for HA, reporting, backups, ad hoc analytics, and read scaling will eventually disappoint one of those use cases.</p>
HA wants low lag.
Analytics wants long queries.
Backups want predictable throughput.
Read scaling wants availability and acceptable staleness.</p>
Those goals are not identical.</p>
No explicit read consistency model</h3>
If the application casually sends reads to replicas, product behavior may become inconsistent.</p>
Use replicas deliberately:</p>
Can this read be stale?
Does this user need to read their own write?
Can this endpoint tolerate lag?
Should this workflow force primary reads?
</code></pre>
Ignoring slots until disk pressure</h3>
Replication slots should be treated like production resources with owners, alerts, and lifecycle management.</p>
An abandoned slot is not harmless metadata.</p>
Treating failover as infrastructure-only</h3>
Failover affects database clients, application routing, workers, caches, queues, jobs, observability, and people.</p>
A database promotion that the application does not understand is not recovery.</p>
Never testing promotion</h3>
A failover process that has never been practiced is an assumption.</p>
Assumptions do not become reliable because they are written in a document.</p>

What a good incident review should ask</h2>
After a replication incident, avoid stopping at:</p>
The replica lagged.
</code></pre>
That is only the symptom.</p>
Better questions:</p>
What created the WAL spike?
Was the standby under-provisioned or overloaded by read traffic?
Did a long query on the standby delay recovery?
Did a slot retain more WAL than expected?
Were alerts based on bytes, time, or disk risk?
Did application reads tolerate the actual staleness?
Was failover considered? If not, why?
Would promotion have caused data loss?
Could the old primary have reappeared?
Did downstream logical consumers survive the event?
</code></pre>
The goal is to understand the system’s recovery posture, not just the replication metric that turned red.</p>

Why replication incidents are excellent simulation material</h2>
Replication incidents are perfect for training because they combine database internals with distributed systems behavior.</p>
A realistic scenario can involve:</p>
WAL generation spike from a migration
Replica lag crossing the read-staleness budget
Replication slot retaining dangerous WAL volume
Read queries conflicting with recovery
Application reads returning stale data
A failover decision under uncertainty
Old primary fencing
Connection string and DNS behavior
Downstream logical replication consumers
</code></pre>
The hard part is not running pg_stat_replication</code>.</p>
The hard part is deciding what the evidence means.</p>
Is the replica unhealthy, or is the primary generating too much WAL?
Is lag acceptable for read traffic but unacceptable for failover?
Is the slot protecting data or threatening disk?
Would promotion reduce impact or create split-brain?
Should traffic be moved, throttled, failed over, or left alone while the standby catches up?</p>
Those decisions require practice.</p>
Articles can explain the mechanism.
Monitoring can expose the symptoms.
Simulation builds the judgment needed to act safely.</p>

Conclusion</h2>
A standby does not automatically save you.</p>
Postgres replication is powerful, but it is not magic. It improves availability only when the surrounding operational system is mature enough to use it correctly.</p>
You need to know:</p>
how far behind replicas are;
which reads can tolerate staleness;
how much WAL slots retain;
whether standby queries conflict with replay;
what data loss window is acceptable;
how failover is triggered;
how split-brain is prevented;
how applications reconnect;
how downstream consumers continue;
how the cluster returns to a healthy topology after promotion.
</code></pre>
Replication is not just a database feature.</p>
It is a reliability contract between Postgres, infrastructure, applications, operators, and product expectations.</p>
The dangerous phrase is:</p>
“We have a replica, so we are safe.”
</code></pre>
The better phrase is:</p>
“We know exactly what our replica can and cannot save us from.”
</code></pre>


WAL and checkpoints: the invisible machinery behind Postgres durability
2026-05-04T00:00:00+00:00
Most teams notice WAL only when something goes wrong.</p>
The disk fills with files in pg_wal</code>.
A replica falls behind.
Backups stop completing.
Checkpoints create latency spikes.
A bulk update generates far more IO than expected.
A restart takes longer than the team is comfortable with.</p>
Until then, WAL and checkpoints feel like internal Postgres details.</p>
They are not.</p>
WAL and checkpoints are part of the contract between Postgres, storage, replication, backups, recovery, and application latency. If you operate Postgres in production, you do not need to become a storage engine developer, but you do need a practical reliability model of how this machinery behaves under pressure.</p>
PostgreSQL uses Write-Ahead Logging to preserve data integrity: changes to data files must be logged first, and WAL records are flushed to durable storage before the corresponding data-file changes are considered safe. (PostgreSQL</a>)</p>
That is the foundation. The incidents come from everything around it.</p>

The basic idea of WAL</h2>
When a transaction changes data, Postgres does not rely only on immediately updating table and index files.</p>
It first records the change in WAL.</p>
A simplified write path looks like this:</p>
flowchart TD
    A[Client sends write] --> B[Postgres modifies pages in memory]
    B --> C[Postgres writes WAL records]
    C --> D[WAL is flushed according to durability settings]
    D --> E[COMMIT returns]
    E --> F[Data pages are written later]
</code></pre>
This separation is crucial.</p>
The data page may not be written to the table file immediately. It can remain dirty in shared buffers. If the server crashes, Postgres can use WAL during recovery to bring data files back to a consistent state. PostgreSQL keeps WAL in the pg_wal/</code> directory, and the documentation describes WAL replay after the last checkpoint as the mechanism used to restore consistency after a crash. (PostgreSQL</a>)</p>
That is why WAL is not just logging.</p>
It is recovery infrastructure.</p>

COMMIT does not mean “every table page is already on disk”</h2>
A common misconception:</p>
COMMIT means all changed table and index pages were written to disk.
</code></pre>
Not exactly.</p>
A committed transaction means Postgres has made the transaction durable according to its WAL and commit settings. The actual table and index pages may be written later.</p>
This is one reason Postgres can perform well. It does not need to synchronously rewrite every affected data page before returning every commit.</p>
But it also means that the health of WAL IO is critical.</p>
If WAL writes or WAL fsync become slow, commits can become slow.</p>
A user-visible symptom may be:</p>
INSERT/UPDATE latency increases
API writes slow down
background jobs fall behind
replication lag grows
WAL directory grows
</code></pre>
The application may report “database is slow,” but the specific mechanism may be commit-path pressure.</p>

synchronous_commit</code> changes the durability/latency trade-off</h2>
One setting that directly affects commit behavior is:</p>
SHOW synchronous_commit;
</code></pre>
The default is usually appropriate for many production systems, but the operational model matters.</p>
With stronger commit guarantees, the client waits for more durability work before COMMIT</code> returns. With weaker settings, commits can return earlier, but the system accepts a larger risk window in the event of a crash.</p>
This is not a generic performance knob.</p>
It is a business and reliability decision.</p>
For example, it may be acceptable to relax durability for:</p>
ephemeral analytics events;
rebuildable caches;
non-critical metrics;
temporary ingestion buffers.
</code></pre>
It may be unacceptable for:</p>
payments;
orders;
ledger entries;
identity changes;
permissions;
security-sensitive writes.
</code></pre>
A dangerous incident response is changing durability settings during pressure without understanding what data can be lost and what the product guarantees.</p>
The question is not:</p>
Can this reduce latency?
</code></pre>
The better question is:</p>
What durability contract are we changing, and who owns that risk?
</code></pre>

What checkpoints do</h2>
If WAL can recover data after a crash, why do checkpoints exist?</p>
Because recovery cannot start from the beginning of time.</p>
A checkpoint is a known safe point in the WAL sequence. At checkpoint time, dirty data pages are flushed to disk, and Postgres writes a checkpoint record to WAL. PostgreSQL documentation describes checkpoints as points where heap and index data files are guaranteed to have been updated with all information written before that checkpoint. (PostgreSQL</a>)</p>
A simplified model:</p>
flowchart TD
    A[WAL records accumulate] --> B[Dirty pages accumulate in memory]
    B --> C[Checkpoint begins]
    C --> D[Dirty pages are written to disk]
    D --> E[Checkpoint record is written]
    E --> F([Crash recovery can start from a later point])
</code></pre>
Checkpoints reduce crash recovery work.</p>
But they also create IO work.</p>
That trade-off is central to Postgres reliability.</p>

Checkpoints are not free</h2>
During a checkpoint, Postgres must write dirty buffers to disk.</p>
If many pages are dirty, that can create significant IO pressure. If the storage system is already busy, checkpoint activity can appear as latency spikes.</p>
Symptoms may include:</p>
periodic write latency spikes;
higher commit latency;
slow queries during checkpoint periods;
replica lag increasing during write bursts;
backend processes writing buffers directly;
checkpoint warnings in logs;
storage saturation without one obvious query.
</code></pre>
This is why checkpoint behavior should be understood as part of workload management, not only configuration.</p>
A checkpoint problem is often a workload-shape problem:</p>
many writes in a short period;
bulk updates;
large deletes;
index builds;
backfills;
ETL jobs;
maintenance tasks;
write-heavy deploys;
checkpoints happening too frequently.
</code></pre>
The database may be working correctly while still creating unacceptable latency for the product.</p>

Time-based and WAL-volume-based checkpoints</h2>
Checkpoints happen for different reasons.</p>
Two important controls are:</p>
SHOW checkpoint_timeout;
SHOW max_wal_size;
SHOW checkpoint_completion_target;
SHOW checkpoint_warning;
</code></pre>
Postgres can checkpoint because enough time has passed, or because WAL volume has grown enough. The documentation describes checkpoint_timeout</code>, max_wal_size</code>, checkpoint_completion_target</code>, and checkpoint_warning</code> as key WAL/checkpoint configuration parameters. (PostgreSQL</a>)</p>
A useful mental model:</p>
checkpoint_timeout:
how long Postgres may go between automatic checkpoints.

max_wal_size:
how much WAL growth can push Postgres toward a checkpoint.

checkpoint_completion_target:
how much of the checkpoint interval Postgres tries to use
to spread checkpoint writes.

checkpoint_warning:
log a warning if checkpoints happen too close together.
</code></pre>
Frequent requested checkpoints are usually a warning sign.</p>
They often mean WAL is being generated faster than the current checkpoint configuration expects.</p>
That can happen during normal growth, but it can also reveal an unsafe backfill, bulk update, migration, or retry storm.</p>

The classic warning: checkpoints are happening too often</h2>
Postgres can log warnings when checkpoints caused by WAL segment pressure happen too close together. The documentation notes that checkpoint_warning</code> exists to log when checkpoints caused by WAL filling occur closer together than the configured threshold, suggesting max_wal_size</code> may need to be increased. (PostgreSQL</a>)</p>
A log message like this should not be ignored:</p>
checkpoints are occurring too frequently
</code></pre>
It does not automatically mean “increase max_wal_size</code> and move on.”</p>
It means:</p>
The workload is generating WAL fast enough
to force more checkpoint activity than expected.
</code></pre>
The next question is workload-oriented:</p>
What changed?
A migration?
A bulk update?
A new write-heavy endpoint?
A data import?
A queue retry storm?
A new index?
A replica or archive issue?
</code></pre>
Changing a setting may be appropriate. But if the WAL spike came from a bad release or uncontrolled job, the real fix may be outside postgresql.conf</code>.</p>

Measuring WAL generation</h2>
A basic WAL snapshot:</p>
SELECT
    wal_records,
    wal_fpi,
    pg_size_pretty(wal_bytes) AS wal_bytes,
    wal_buffers_full,
    stats_reset
FROM pg_stat_wal;
</code></pre>
PostgreSQL’s cumulative statistics system exposes server activity through statistics views, including WAL-related and replication-related views. (PostgreSQL</a>)</p>
The most useful number is not only total WAL generated.</p>
It is the rate.</p>
You can sample WAL position:</p>
SELECT now(), pg_current_wal_lsn();
</code></pre>
Then sample again later:</p>
SELECT
    pg_size_pretty(
        pg_wal_lsn_diff(
            '0/70000000'::pg_lsn,
            '0/60000000'::pg_lsn
        )
    ) AS wal_generated;
</code></pre>
In monitoring, this becomes:</p>
WAL bytes generated per second
</code></pre>
Why this matters:</p>
WAL must be written locally.
WAL may need to be archived.
WAL may need to be streamed to replicas.
WAL may be retained for replication slots.
WAL volume affects checkpoint pressure.
WAL volume affects recovery time.
</code></pre>
A system can have acceptable query latency and still be heading toward a WAL-related incident.</p>

Finding WAL-heavy queries</h2>
In modern Postgres versions, pg_stat_statements</code> can expose WAL-related metrics for statements, depending on version and configuration.</p>
A useful query shape:</p>
SELECT
    calls,
    pg_size_pretty(wal_bytes) AS total_wal,
    pg_size_pretty((wal_bytes / greatest(calls, 1))::numeric) AS wal_per_call,
    mean_exec_time,
    rows,
    left(query, 180) AS query_preview
FROM pg_stat_statements
WHERE wal_bytes > 0
ORDER BY wal_bytes DESC
LIMIT 20;
</code></pre>
This helps identify statements that generate large amounts of WAL.</p>
Typical WAL-heavy operations include:</p>
large UPDATEs;
large DELETEs;
bulk INSERTs;
index creation;
table rewrites;
VACUUM FULL;
CLUSTER;
backfills;
high-churn queue updates;
touching indexed columns repeatedly.
</code></pre>
The important distinction:</p>
A query can be acceptable from a latency perspective
and still dangerous from a WAL perspective.
</code></pre>
For example, a backfill may run efficiently but generate enough WAL to delay replicas, overload archiving, and force frequent checkpoints.</p>
That is a reliability problem, even if the SQL itself is “fast.”</p>

Full-page images and WAL volume</h2>
After a checkpoint, the first modification to a data page may include a full-page image in WAL when full_page_writes</code> is enabled.</p>
Check:</p>
SHOW full_page_writes;
</code></pre>
full_page_writes</code> protects against torn pages after crashes. It can increase WAL volume, especially after checkpoints and during write-heavy workloads.</p>
This creates an important interaction:</p>
Frequent checkpoints
        ↓
More pages modified for the first time after each checkpoint
        ↓
More full-page images
        ↓
More WAL generated
        ↓
More pressure on WAL, archiving, replication, and checkpoints
</code></pre>
This is one reason overly frequent checkpoints can amplify IO pressure.</p>
A dangerous conclusion would be:</p>
Full-page writes generate WAL, so disable them.
</code></pre>
That is usually the wrong instinct. This setting exists for crash safety.</p>
A better conclusion:</p>
If full-page images are high,
understand checkpoint frequency, write patterns, and storage behavior.
</code></pre>

WAL compression</h2>
Postgres supports WAL compression:</p>
SHOW wal_compression;
</code></pre>
Enabling WAL compression can reduce WAL volume for some workloads, especially where full-page images dominate. But it may increase CPU usage.</p>
This is a trade-off:</p>
Less WAL volume
More CPU work
Potentially lower replication/archive pressure
Potentially higher CPU pressure
</code></pre>
It is not universally good or bad.</p>
It should be evaluated against the actual bottleneck:</p>
Is the system WAL-volume bound?
Storage bound?
Network bound?
Archive bound?
Replica catch-up bound?
CPU bound?
</code></pre>
A reliability mistake is tuning WAL without knowing which resource is constrained.</p>

WAL and replication lag</h2>
Replication depends on WAL movement.</p>
A write-heavy event on the primary can become a replica incident:</p>
Bulk update generates WAL
        ↓
Primary writes WAL locally
        ↓
WAL is streamed to standby
        ↓
Standby writes, flushes, replays WAL
        ↓
Replica falls behind
        ↓
Read traffic sees stale data
        ↓
Failover safety decreases
</code></pre>
Primary-side check:</p>
SELECT
    application_name,
    state,
    sync_state,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn))   AS send_lag,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), write_lsn))  AS write_lag,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn))  AS flush_lag,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) AS replay_lag,
    write_lag,
    flush_lag,
    replay_lag
FROM pg_stat_replication;
</code></pre>
Standby-side check:</p>
SELECT
    pg_is_in_recovery() AS is_standby,
    pg_last_wal_receive_lsn() AS receive_lsn,
    pg_last_wal_replay_lsn() AS replay_lsn,
    now() - pg_last_xact_replay_timestamp() AS replay_delay;
</code></pre>
The WAL question during an incident is not only:</p>
How much WAL did we generate?
</code></pre>
It is:</p>
Can every downstream system consume it fast enough?
</code></pre>
That includes standbys, archives, logical replication consumers, backup systems, and change-data-capture pipelines.</p>

WAL archiving and backup risk</h2>
WAL is also central to point-in-time recovery.</p>
If WAL archiving fails, backups may no longer support the recovery objectives the team believes they have.</p>
Postgres continuous archiving relies on saving WAL files so that the database can be restored by replaying WAL from a base backup to a desired point in time. (PostgreSQL</a>)</p>
A common failure chain:</p>
flowchart TD
    A[Archive command starts failing] --> B[WAL files accumulate]
    B --> C[pg_wal grows]
    C --> D[Disk fills]
    D --> E([Primary becomes unstable or stops])
</code></pre>
Check archiver status:</p>
SELECT
    archived_count,
    last_archived_wal,
    last_archived_time,
    failed_count,
    last_failed_wal,
    last_failed_time,
    stats_reset
FROM pg_stat_archiver;
</code></pre>
This view should be part of production monitoring when archiving is enabled.</p>
A healthy primary with broken archiving is not healthy from a disaster recovery perspective.</p>

WAL retention and replication slots</h2>
Replication slots can retain WAL required by a replica or logical consumer.</p>
That is useful.</p>
It is also dangerous.</p>
SELECT
    slot_name,
    slot_type,
    active,
    restart_lsn,
    confirmed_flush_lsn,
    wal_status,
    safe_wal_size,
    pg_size_pretty(
        pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)
    ) AS retained_wal
FROM pg_replication_slots
ORDER BY pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) DESC NULLS LAST;
</code></pre>
A disconnected consumer with an active slot can force the primary to retain WAL.</p>
The incident can look like:</p>
Logical consumer stops
        ↓
Replication slot remains
        ↓
Primary keeps old WAL
        ↓
Disk usage grows
        ↓
Emergency cleanup decision required
</code></pre>
The dangerous command:</p>
SELECT pg_drop_replication_slot('slot_name');
</code></pre>
This can be correct if the slot is abandoned. It can also break a consumer that still needs the WAL.</p>
WAL retention is not just a database metric.</p>
It is ownership information:</p>
Who owns this slot?
What system consumes it?
How far behind is it allowed to get?
What alert fires?
What is the reinitialization procedure?
</code></pre>

Monitoring checkpoint behavior</h2>
On newer PostgreSQL versions, checkpoint-related statistics are exposed separately through pg_stat_checkpointer</code>; on older versions, similar counters are found in pg_stat_bgwriter</code>. The exact view and column names vary by version, so monitoring queries should match the Postgres version you operate. PostgreSQL’s monitoring documentation describes these cumulative statistics views as the place to inspect server activity. (PostgreSQL</a>)</p>
For newer versions:</p>
SELECT
    num_timed,
    num_requested,
    num_done,
    write_time,
    sync_time,
    buffers_written,
    stats_reset
FROM pg_stat_checkpointer;
</code></pre>
For older versions:</p>
SELECT
    checkpoints_timed,
    checkpoints_req,
    checkpoint_write_time,
    checkpoint_sync_time,
    buffers_checkpoint,
    buffers_backend,
    buffers_backend_fsync,
    stats_reset
FROM pg_stat_bgwriter;
</code></pre>
The operational interpretation:</p>
Many requested checkpoints:
WAL volume may be forcing checkpoints.

High checkpoint write/sync time:
storage may be struggling with checkpoint work.

High backend buffer writes:
foreground sessions may be doing writes themselves,
which can increase user-visible latency.

Frequent checkpoint warnings:
checkpoint/WAL sizing may not match workload.
</code></pre>
The goal is not to obsess over one counter.</p>
The goal is to detect whether checkpoint work is smooth and predictable or bursty and user-visible.</p>

Logging checkpoints</h2>
You can enable checkpoint logging:</p>
SHOW log_checkpoints;
</code></pre>
To enable:</p>
ALTER SYSTEM SET log_checkpoints = on;
SELECT pg_reload_conf();
</code></pre>
Checkpoint logs can show:</p>
when checkpoints start and finish;
how much was written;
how long writing took;
how long syncing took;
whether checkpoints are requested or timed;
whether the system is checkpointing too often.
</code></pre>
This is useful during investigation because checkpoint problems are often temporal.</p>
A graph may show latency spikes every few minutes.</p>
Checkpoint logs can confirm whether those spikes correlate with checkpoint activity.</p>

WAL and disk-full incidents</h2>
pg_wal</code> filling the disk is one of the most direct WAL-related outages.</p>
Possible causes:</p>
archive failures;
replication slot retention;
replica disconnected;
logical replication consumer stopped;
long base backup;
too much WAL generated too quickly;
max_wal_size too small for workload;
storage capacity too low;
unexpected bulk operation.
</code></pre>
A useful filesystem-level check:</p>
du -sh "$PGDATA/pg_wal"
</code></pre>
From SQL, you can inspect WAL directory files if permissions allow:</p>
SELECT
    count(*) AS wal_files,
    pg_size_pretty(sum(size)) AS total_size
FROM pg_ls_waldir();
</code></pre>
Disk-full incidents are dangerous because Postgres may be unable to continue writing WAL.</p>
At that point, this is not a tuning issue. It is an availability incident.</p>
The immediate question becomes:</p>
Why is WAL being retained or generated faster than expected,
and what can be safely removed, advanced, paused, or fixed?
</code></pre>
Deleting files manually from pg_wal</code> is not a safe normal operation. It can corrupt recovery assumptions and break the cluster.</p>

WAL-heavy migrations</h2>
Some migrations generate much more WAL than teams expect.</p>
Examples:</p>
UPDATE users
SET normalized_email = lower(email);
</code></pre>
DELETE FROM events
WHERE created_at < now() - interval '180 days';
</code></pre>
CREATE INDEX CONCURRENTLY idx_events_tenant_created
ON events (tenant_id, created_at);
</code></pre>
ALTER TABLE orders
ADD COLUMN total_cents bigint DEFAULT 0;
</code></pre>
Depending on version, table structure, defaults, and operation type, schema changes may be metadata-only or may rewrite substantial data. Large updates and deletes can generate WAL, create dead tuples, pressure autovacuum, and increase replication lag.</p>
A safer operational pattern for backfills:</p>
process in small batches;
sleep between batches;
measure WAL rate;
watch replication lag;
watch archive status;
watch checkpoint frequency;
keep transactions short;
make progress resumable;
stop quickly if pressure rises.
</code></pre>
Example batch shape:</p>
WITH batch AS (
    SELECT id
    FROM users
    WHERE normalized_email IS NULL
    ORDER BY id
    LIMIT 1000
)
UPDATE users u
SET normalized_email = lower(u.email)
FROM batch
WHERE u.id = batch.id;
</code></pre>
The exact batch size is workload-specific.</p>
The reliability principle is stable:</p>
A migration should have a pressure budget,
not just a correctness test.
</code></pre>

Why “fast on staging” is not enough</h2>
WAL behavior depends on production realities:</p>
table size;
index count;
row width;
update pattern;
checkpoint timing;
full-page writes;
storage latency;
replica speed;
archive bandwidth;
logical consumers;
autovacuum state;
concurrent workload.
</code></pre>
A staging database with small tables and no replicas cannot reveal the true WAL cost of a production backfill.</p>
A migration may pass every functional test and still be operationally unsafe.</p>
The better pre-flight question:</p>
How much WAL will this generate,
and what systems must absorb that WAL?
</code></pre>
That question changes how teams design migrations.</p>

Crash recovery time is part of reliability</h2>
Checkpoints influence crash recovery.</p>
If checkpoints are very far apart, there may be more WAL to replay after a crash. If checkpoints are too frequent, normal operation may suffer from excessive checkpoint IO.</p>
This is a trade-off:</p>
Less frequent checkpoints:
potentially smoother normal operation,
more WAL to replay after crash.

More frequent checkpoints:
less WAL to replay,
more frequent checkpoint IO,
potentially more full-page image WAL.
</code></pre>
The right balance depends on recovery objectives, write workload, storage capacity, and latency requirements.</p>
A database that is fast during normal operation but takes too long to recover may not satisfy the business reliability target.</p>
A database that checkpoints too aggressively may create latency incidents during normal traffic.</p>
Reliability is the balance, not one extreme.</p>

A practical WAL and checkpoint health snapshot</h2>
This is not a complete runbook, but it is a useful investigation snapshot.</p>
WAL settings:</p>
SELECT
    name,
    setting,
    unit,
    context
FROM pg_settings
WHERE name IN (
    'wal_level',
    'synchronous_commit',
    'full_page_writes',
    'wal_compression',
    'checkpoint_timeout',
    'checkpoint_completion_target',
    'checkpoint_warning',
    'max_wal_size',
    'min_wal_size',
    'archive_mode',
    'archive_command',
    'max_slot_wal_keep_size'
)
ORDER BY name;
</code></pre>
WAL generation:</p>
SELECT
    wal_records,
    wal_fpi,
    pg_size_pretty(wal_bytes) AS wal_bytes,
    wal_buffers_full,
    stats_reset
FROM pg_stat_wal;
</code></pre>
WAL directory size:</p>
SELECT
    count(*) AS wal_files,
    pg_size_pretty(sum(size)) AS total_size
FROM pg_ls_waldir();
</code></pre>
Archiving:</p>
SELECT
    archived_count,
    last_archived_wal,
    last_archived_time,
    failed_count,
    last_failed_wal,
    last_failed_time
FROM pg_stat_archiver;
</code></pre>
Replication lag:</p>
SELECT
    application_name,
    state,
    sync_state,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) AS replay_lag_bytes,
    write_lag,
    flush_lag,
    replay_lag
FROM pg_stat_replication;
</code></pre>
Replication slots:</p>
SELECT
    slot_name,
    slot_type,
    active,
    wal_status,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal
FROM pg_replication_slots
ORDER BY pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) DESC NULLS LAST;
</code></pre>
Checkpoint stats, newer versions:</p>
SELECT
    num_timed,
    num_requested,
    num_done,
    write_time,
    sync_time,
    buffers_written,
    stats_reset
FROM pg_stat_checkpointer;
</code></pre>
Checkpoint stats, older versions:</p>
SELECT
    checkpoints_timed,
    checkpoints_req,
    checkpoint_write_time,
    checkpoint_sync_time,
    buffers_checkpoint,
    buffers_backend,
    buffers_backend_fsync,
    stats_reset
FROM pg_stat_bgwriter;
</code></pre>
WAL-heavy statements:</p>
SELECT
    calls,
    pg_size_pretty(wal_bytes) AS total_wal,
    pg_size_pretty((wal_bytes / greatest(calls, 1))::numeric) AS wal_per_call,
    mean_exec_time,
    left(query, 180) AS query_preview
FROM pg_stat_statements
WHERE wal_bytes > 0
ORDER BY wal_bytes DESC
LIMIT 20;
</code></pre>
The purpose of this snapshot is to connect symptoms:</p>
high WAL generation;
frequent checkpoints;
archive failures;
replica lag;
slot retention;
storage growth;
write latency;
migration activity.
</code></pre>
A WAL incident is rarely visible through one metric alone.</p>

Common anti-patterns</h2>
Treating WAL as a storage nuisance</h3>
pg_wal</code> is not garbage. It is required for crash recovery, replication, and backups.</p>
Increasing max_wal_size</code> without understanding the workload</h3>
This may reduce checkpoint frequency, but it does not explain why WAL generation changed.</p>
Ignoring archiver failures</h3>
A database can keep serving traffic while silently losing point-in-time recovery capability.</p>
Letting replication slots have no owner</h3>
An abandoned slot can retain WAL until the primary disk is in danger.</p>
Running large backfills without a WAL budget</h3>
A backfill should be planned around WAL rate, replica lag, archive capacity, and checkpoint pressure.</p>
Using staging to estimate production WAL cost</h3>
Small data, fewer indexes, and missing replicas make staging a poor predictor of WAL impact.</p>
Manually deleting WAL files</h3>
This is not a safe incident response pattern. It can destroy recovery guarantees.</p>

Why WAL and checkpoint incidents are good simulation material</h2>
WAL/checkpoint incidents are excellent for training because the symptoms are distributed across the system.</p>
The application may show write latency.
The database may show frequent checkpoints.
The replica may show lag.
The backup system may show archive failures.
The disk may show pg_wal</code> growth.
The migration system may show a “successful” backfill.
The team may be tempted to change settings without understanding the pressure chain.</p>
A realistic simulation can force decisions like:</p>
Is the primary overloaded by WAL writes or normal query IO?
Is checkpoint activity causing latency spikes?
Is a bulk operation generating too much WAL?
Is the replica behind because it is slow or because the primary is producing too much WAL?
Is archiving broken or merely delayed?
Is a replication slot safe to drop?
Should the team pause a migration, throttle a job, increase WAL capacity, tune checkpoints, or protect user traffic first?
</code></pre>
This is not about memorizing pg_stat_wal</code>.</p>
It is about understanding the system consequences of writes.</p>
Articles can explain WAL mechanics.
Dashboards can expose WAL rates.
Simulations teach teams how WAL pressure changes operational decisions.</p>

Conclusion</h2>
WAL and checkpoints are invisible when healthy and unavoidable when they fail.</p>
WAL protects durability and enables crash recovery, replication, archiving, and point-in-time recovery. Checkpoints bound recovery work and move dirty data pages to disk. Together, they form the storage reliability backbone of Postgres.</p>
But that backbone has operational limits.</p>
Write-heavy workloads generate WAL.
WAL must be written, archived, streamed, retained, and replayed.
Checkpoints must flush dirty data.
Storage must absorb bursts.
Replicas and backup systems must keep up.
Operators must understand when a “database slowdown” is really WAL pressure.</p>
The dangerous phrase is:</p>
It is just WAL.
</code></pre>
The better reliability question is:</p>
What is generating this WAL, what systems must consume it, and what happens if they cannot keep up?
</code></pre>
That question turns WAL and checkpoints from internal Postgres machinery into practical production reliability signals.</p>


Postgres locks: how one ALTER TABLE can stop your product
2026-04-25T00:00:00+00:00
Postgres locks are not a bug.</p>
They are one of the reasons Postgres can safely protect your data while many users, services, jobs, migrations, and background processes are touching the same database at the same time.</p>
The problem is that locks are often invisible until they are not.</p>
A migration that looked harmless in staging can freeze production traffic.
A long-running transaction can block a schema change.
A background job can hold a lock longer than expected.
A single ALTER TABLE</code> can create a queue of blocked queries behind it.</p>
From the outside, this often looks like:</p>
The application is slow.
Requests are timing out.
Postgres has many active connections.
CPU is not necessarily high.
The database “looks stuck”.
</code></pre>
But Postgres may not be stuck at all. It may be doing exactly what it was designed to do: preserving consistency.</p>

The dangerous misunderstanding</h2>
Many teams think about locks only when they explicitly run something like:</p>
LOCK TABLE users;
</code></pre>
But most Postgres locks are not written manually. They are acquired automatically by normal SQL operations.</p>
For example:</p>
SELECT * FROM orders WHERE id = 42;
UPDATE orders SET status = 'paid' WHERE id = 42;
ALTER TABLE orders ADD COLUMN processed_at timestamptz;
CREATE INDEX orders_created_at_idx ON orders(created_at);
DELETE FROM sessions WHERE expires_at < now();
</code></pre>
All of these can involve locks.</p>
Usually, that is fine. Most locks are short-lived and harmless. The incident starts when a lock is held longer than expected, or when a lock request waits behind another transaction while new queries pile up behind it.</p>
This is the part that surprises people: the most damaging session is not always the one using the most CPU. Sometimes it is just waiting.</p>

A simple lock queue scenario</h2>
Imagine a busy table:</p>
CREATE TABLE accounts (
    id bigint PRIMARY KEY,
    email text NOT NULL,
    status text NOT NULL
);
</code></pre>
The application constantly runs queries like:</p>
SELECT *
FROM accounts
WHERE id = $1;
</code></pre>
Now a migration starts:</p>
ALTER TABLE accounts
ADD COLUMN deleted_at timestamptz;
</code></pre>
Depending on the operation and Postgres version, this may be fast. But it still needs a table lock. If another transaction is already touching the table in a way that conflicts, the ALTER TABLE</code> waits.</p>
That sounds safe: the migration is waiting, not blocking, right?</p>
Not quite.</p>
Once the ALTER TABLE</code> is waiting for a strong lock, later application queries may queue behind it. The result can look like the whole table is frozen.</p>
A simplified chain:</p>
flowchart TD
    A[Long transaction touches accounts] --> B[ALTER TABLE waits for lock]
    B --> C[New application queries arrive]
    C --> D[They queue behind the pending ALTER TABLE]
    D --> E[Connection pool fills]
    E --> F[Requests time out]
    F --> G([Incident])
</code></pre>
The migration may not be consuming CPU. It may not be doing heavy IO. It may simply be waiting.</p>
But its position in the lock queue can still damage production traffic.</p>

Locks are about compatibility</h2>
Postgres has different lock modes. They are not all equal.</p>
A normal SELECT</code> does not block another normal SELECT</code>. Many operations can safely happen together. The problem appears when two operations require incompatible locks.</p>
You do not need to memorize the entire lock matrix to respond well to incidents, but you do need the mental model:</p>
Weak locks allow many operations to continue.
Strong locks conflict with more operations.
Some schema changes require very strong locks.
A waiting strong lock can cause later queries to queue.
</code></pre>
For product engineers, the important lesson is this:</p>

“This query is fast locally” does not mean “this operation is operationally safe in production.”</p>
</blockquote>
Lock behavior depends on concurrency, transaction duration, table size, workload, and timing.</p>

The classic villain: idle in transaction</code></h2>
One of the most common lock-related problems is not a dramatic query. It is a transaction that started, did some work, and then remained open.</p>
For example, application code does something like:</p>
BEGIN;

SELECT *
FROM accounts
WHERE id = 42;

-- application waits on network, external API, user input, or crashes before COMMIT
</code></pre>
From the database side, the session may become:</p>
idle in transaction
</code></pre>
That means it is not actively running a query, but the transaction is still open.</p>
You can find old transactions with:</p>
SELECT
    pid,
    usename,
    application_name,
    client_addr,
    state,
    now() - xact_start AS transaction_age,
    wait_event_type,
    wait_event,
    left(query, 160) AS query_preview
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
ORDER BY xact_start ASC;
</code></pre>
And specifically:</p>
SELECT
    pid,
    usename,
    application_name,
    client_addr,
    now() - xact_start AS transaction_age,
    left(query, 160) AS last_query
FROM pg_stat_activity
WHERE state = 'idle in transaction'
ORDER BY xact_start ASC;
</code></pre>
An idle in transaction</code> session can be harmful because it may:</p>
hold locks;
prevent vacuum cleanup;
keep old row versions visible;
interfere with migrations;
increase table and index bloat over time;
confuse incident responders because it looks inactive.
</code></pre>
The session is “idle”, but the transaction is not harmless.</p>

A safer way to inspect blockers</h2>
Modern Postgres gives you a very useful function:</p>
pg_blocking_pids(pid)
</code></pre>
You can use it to see which sessions are blocking others:</p>
SELECT
    blocked.pid AS blocked_pid,
    blocked.usename AS blocked_user,
    blocked.application_name AS blocked_app,
    blocked.state AS blocked_state,
    now() - blocked.query_start AS blocked_duration,
    left(blocked.query, 120) AS blocked_query,
    blocking.pid AS blocking_pid,
    blocking.usename AS blocking_user,
    blocking.application_name AS blocking_app,
    blocking.state AS blocking_state,
    now() - blocking.query_start AS blocking_duration,
    left(blocking.query, 120) AS blocking_query
FROM pg_stat_activity blocked
JOIN LATERAL unnest(pg_blocking_pids(blocked.pid)) AS blocker_pid ON true
JOIN pg_stat_activity blocking ON blocking.pid = blocker_pid
ORDER BY blocked_duration DESC;
</code></pre>
This is often easier and safer than writing a large manual join over pg_locks</code>.</p>
The result can tell you:</p>
Who is blocked?
Who is blocking them?
How long has each query been running?
Which application opened the session?
Is the blocker active or idle in transaction?
</code></pre>
But this query is not a complete incident response plan. It only answers one question: “Who is blocking whom?”</p>
The harder question is: “What is the safest action now?”</p>

Why killing the blocker is not always the right move</h2>
When you find a blocking session, the tempting move is:</p>
SELECT pg_terminate_backend(<pid>);
</code></pre>
That can be the correct action in some incidents. But it is dangerous as a reflex.</p>
There are two related functions:</p>
SELECT pg_cancel_backend(<pid>);
SELECT pg_terminate_backend(<pid>);
</code></pre>
The difference matters.</p>
pg_cancel_backend</code> asks Postgres to cancel the current query. The connection stays alive.</p>
pg_terminate_backend</code> terminates the whole backend connection. If it is inside a transaction, the transaction is rolled back.</p>
That rollback can itself be expensive. It can also trigger application retries, break a migration, or cause a thundering herd of reconnects.</p>
A better incident question is:</p>
Is this blocker safe to cancel?
Is it part of a migration?
Is it user traffic?
Is it a background job?
Is it already rolling back?
Will the application retry immediately?
Will killing it unblock the critical path or create more load?
</code></pre>
The existence of a blocker tells you where pressure is accumulating. It does not automatically tell you what to kill.</p>

Schema migrations and lock risk</h2>
Schema migrations deserve special respect in Postgres.</p>
Consider:</p>
ALTER TABLE users
ADD COLUMN last_seen_at timestamptz;
</code></pre>
This can be quick. But “quick” is not the same as “risk-free”.</p>
Now consider:</p>
ALTER TABLE users
ALTER COLUMN email SET NOT NULL;
</code></pre>
Or:</p>
ALTER TABLE orders
ADD CONSTRAINT orders_customer_id_fkey
FOREIGN KEY (customer_id)
REFERENCES customers(id);
</code></pre>
Or:</p>
CREATE INDEX orders_created_at_idx
ON orders(created_at);
</code></pre>
Some operations scan data. Some require stronger locks. Some block writes. Some interact badly with long transactions. Some are safe on small tables and dangerous on large ones.</p>
For indexes, the production-safe form is often:</p>
CREATE INDEX CONCURRENTLY orders_created_at_idx
ON orders(created_at);
</code></pre>
But CONCURRENTLY</code> is not magic. It reduces blocking, but it can still:</p>
take a long time;
consume CPU and IO;
fail and leave an invalid index;
conflict with other schema changes;
increase load during an already sensitive period.
</code></pre>
You can check invalid indexes with:</p>
SELECT
    schemaname,
    tablename,
    indexname
FROM pg_indexes
WHERE indexname IN (
    SELECT relname
    FROM pg_class
    WHERE oid IN (
        SELECT indexrelid
        FROM pg_index
        WHERE NOT indisvalid
    )
);
</code></pre>
A cleaner version using catalog tables:</p>
SELECT
    n.nspname AS schema_name,
    t.relname AS table_name,
    i.relname AS index_name,
    ix.indisvalid,
    ix.indisready
FROM pg_index ix
JOIN pg_class i ON i.oid = ix.indexrelid
JOIN pg_class t ON t.oid = ix.indrelid
JOIN pg_namespace n ON n.oid = t.relnamespace
WHERE ix.indisvalid = false
   OR ix.indisready = false;
</code></pre>
This is useful after a failed concurrent index build.</p>

Use timeouts as guardrails</h2>
One of the simplest ways to reduce lock-related blast radius is to use timeouts during migrations.</p>
For example:</p>
SET lock_timeout = '2s';
SET statement_timeout = '5min';

ALTER TABLE accounts
ADD COLUMN deleted_at timestamptz;
</code></pre>
lock_timeout</code> means: do not wait forever to acquire a lock.</p>
This is valuable because the worst migration is often not the one that fails. It is the one that waits silently and causes application traffic to queue behind it.</p>
A common migration pattern is:</p>
BEGIN;
SET LOCAL lock_timeout = '2s';
SET LOCAL statement_timeout = '5min';

ALTER TABLE accounts
ADD COLUMN deleted_at timestamptz;

COMMIT;
</code></pre>
However, be careful with commands like CREATE INDEX CONCURRENTLY</code>: they cannot run inside a normal transaction block.</p>
For example:</p>
SET lock_timeout = '2s';
SET statement_timeout = '30min';

CREATE INDEX CONCURRENTLY idx_accounts_email
ON accounts(email);
</code></pre>
Timeouts do not make a migration safe by themselves. They are guardrails. They help a risky operation fail early instead of becoming an incident.</p>

Detecting lock pressure before users notice</h2>
During normal operation, you can inspect waiting sessions:</p>
SELECT
    pid,
    usename,
    application_name,
    state,
    wait_event_type,
    wait_event,
    now() - query_start AS waiting_for,
    left(query, 160) AS query_preview
FROM pg_stat_activity
WHERE wait_event_type = 'Lock'
ORDER BY query_start ASC;
</code></pre>
You can also summarize wait events:</p>
SELECT
    wait_event_type,
    wait_event,
    count(*) AS sessions
FROM pg_stat_activity
WHERE wait_event_type IS NOT NULL
GROUP BY wait_event_type, wait_event
ORDER BY sessions DESC;
</code></pre>
This helps separate lock waits from other kinds of waits.</p>
For example:</p>
Lock waits suggest contention.
IO waits suggest disk or storage pressure.
Client waits may indicate application behavior.
LWLock waits may indicate internal contention.
</code></pre>
But again, this is not enough by itself. You still need context:</p>
Did a migration just start?
Did a deploy just happen?
Is a background job running?
Did traffic increase?
Are blocked sessions all from one service?
Are blockers idle in transaction?
</code></pre>
Locks become understandable only when connected to system events.</p>

Row locks can also cause incidents</h2>
Not all dangerous locks are table-level migration locks.</p>
Application-level transactions can block each other on rows.</p>
For example:</p>
BEGIN;

UPDATE accounts
SET balance = balance - 100
WHERE id = 1;

-- transaction remains open
</code></pre>
Another transaction tries:</p>
UPDATE accounts
SET balance = balance + 100
WHERE id = 1;
</code></pre>
The second transaction waits.</p>
This is normal. But if the first transaction waits on an external API before committing, you have created database contention from application behavior.</p>
A dangerous pattern:</p>
BEGIN
  update database row
  call external service
  wait for response
  update another row
COMMIT
</code></pre>
A safer pattern is often:</p>
do external work before opening transaction;
keep the transaction small;
avoid user/network waits inside transactions;
commit quickly;
make retry behavior explicit.
</code></pre>
Postgres can handle concurrency well, but it cannot make long business transactions short.</p>
That is an application architecture problem, not just a database problem.</p>

SELECT ... FOR UPDATE</code> is powerful and dangerous</h2>
Many systems use row-level locking intentionally:</p>
SELECT *
FROM jobs
WHERE status = 'pending'
ORDER BY created_at
LIMIT 1
FOR UPDATE;
</code></pre>
This can be correct, but under concurrency it can create contention.</p>
For job queues, a better pattern is often:</p>
SELECT *
FROM jobs
WHERE status = 'pending'
ORDER BY created_at
LIMIT 1
FOR UPDATE SKIP LOCKED;
</code></pre>
SKIP LOCKED</code> allows workers to skip rows already locked by other workers.</p>
But this changes semantics. It is useful for queues and work distribution, not for every business operation.</p>
The reliability lesson is that lock behavior is part of application design. It is not just a database implementation detail.</p>

Advisory locks: useful, but easy to forget</h2>
Postgres also supports advisory locks:</p>
SELECT pg_advisory_lock(12345);
</code></pre>
And:</p>
SELECT pg_advisory_unlock(12345);
</code></pre>
These are application-defined locks. They are useful for leader election, scheduled jobs, migration coordination, or preventing duplicate work.</p>
But they can also create mysterious incidents if not visible in normal application logs.</p>
You can inspect advisory locks with:</p>
SELECT
    a.pid,
    a.usename,
    a.application_name,
    l.locktype,
    l.mode,
    l.granted,
    now() - a.query_start AS query_age,
    left(a.query, 160) AS query_preview
FROM pg_locks l
JOIN pg_stat_activity a ON a.pid = l.pid
WHERE l.locktype = 'advisory'
ORDER BY query_age DESC;
</code></pre>
Advisory locks are not bad. Hidden coordination is bad.</p>
If your system uses advisory locks, they should be named, documented, observable, and included in incident thinking.</p>

A practical migration safety checklist</h2>
This is not a full migration playbook, but these questions catch many common problems.</p>
Before running a migration on a hot table, ask:</p>
How large is the table?
What lock level does this operation need?
Can it run concurrently?
Can it be split into smaller phases?
Does it scan or rewrite the table?
Can it fail quickly with lock_timeout?
Is there a rollback plan?
Are there long-running transactions right now?
Is traffic normal or elevated?
Will application retries amplify the problem?
Are dashboards ready for lock waits and pool saturation?
</code></pre>
For large tables, prefer phased changes.</p>
For example, instead of immediately adding a strict constraint:</p>
ALTER TABLE orders
ADD CONSTRAINT orders_amount_positive
CHECK (amount > 0);
</code></pre>
You may use:</p>
ALTER TABLE orders
ADD CONSTRAINT orders_amount_positive
CHECK (amount > 0) NOT VALID;
</code></pre>
Then validate later:</p>
ALTER TABLE orders
VALIDATE CONSTRAINT orders_amount_positive;
</code></pre>
This pattern can reduce operational risk because adding the constraint metadata and validating existing rows are separated.</p>
Again, the point is not to memorize one trick. The point is to treat schema changes as production operations, not just code changes.</p>

What teams often get wrong</h2>
They test migrations only on empty or tiny databases</h3>
A migration that takes 100 ms on staging may behave very differently on a 500 GB production table.</p>
They ignore concurrent workload</h3>
The table is not sitting idle in production. It is being read, written, vacuumed, indexed, and queried by multiple services.</p>
They forget old transactions</h3>
One forgotten transaction can turn a safe migration into a production incident.</p>
They run DDL without timeouts</h3>
A migration that waits forever can become a silent lock queue.</p>
They treat the database as isolated</h3>
The real incident may involve the app pool, retries, background jobs, dashboards, and human decisions.</p>

Why lock incidents are good simulation material</h2>
Lock incidents are especially valuable to practice because they are deceptive.</p>
They often do not look dramatic at first.</p>
CPU may be fine.
Memory may be fine.
The migration may appear to be “just waiting.”
The blocker may be “idle.”
The application may report generic timeout errors.</p>
A good simulation teaches the operational loop:</p>
flowchart TD
    A[Notice latency] --> B[Inspect active sessions]
    B --> C[Identify lock waits]
    C --> D[Find blockers]
    D --> E[Understand application context]
    E --> F[Choose safe mitigation]
    F --> G[Observe consequences]
    G --> H[Review why the system was vulnerable]
</code></pre>
The hard part is not running a query against pg_stat_activity</code>.</p>
The hard part is deciding what the result means under pressure.</p>
Should you cancel the migration?
Terminate the blocker?
Reduce application concurrency?
Disable a worker?
Rollback a deploy?
Wait?
Communicate impact?
Prevent retries?</p>
Those choices are where reliability skill is built.</p>

Conclusion</h2>
Postgres locks are not the enemy. They are part of how Postgres protects correctness.</p>
The incident happens when lock behavior meets production reality:</p>
large tables;
long transactions;
busy applications;
schema migrations;
connection pools;
background jobs;
retry storms;
unclear ownership;
time pressure.
</code></pre>
A single ALTER TABLE</code> can stop a product not because Postgres is fragile, but because production systems are concurrent.</p>
The right lesson is not “avoid locks.”
The right lesson is “understand the lock behavior of your changes before production does.”</p>
Articles and checklists can teach the concepts.
Queries can reveal symptoms.
But lock incidents require practiced judgment.</p>
Because in the middle of an incident, the question is rarely:</p>
Is there a lock?
</code></pre>
The real question is:</p>
Which action reduces risk without making the system worse?
</code></pre>


Schema migrations in Postgres: why safe SQL can be dangerous in production
2026-04-19T00:00:00+00:00
Schema migrations are one of the most common ways teams accidentally create Postgres incidents.</p>
The migration passes code review.
It works locally.
It runs instantly on staging.
The SQL is syntactically correct.
The change looks small.</p>
Then production traffic slows down, the connection pool fills, requests time out, and the incident channel starts with a familiar sentence:</p>
The database is stuck.
</code></pre>
Usually, Postgres is not stuck.</p>
It is enforcing the rules that keep data consistent while many transactions touch the same tables concurrently.</p>
The mistake is treating a schema migration as “just a code change.”</p>
In production, a migration is an operational event.</p>

The core problem: DDL changes concurrency</h2>
A normal application query changes data or reads data.</p>
A schema migration changes the shape of the database itself.</p>
That difference matters because Postgres must protect the table definition while other sessions are reading or writing rows. ALTER TABLE</code> has many subforms, and the official documentation notes that lock levels differ by subform; unless explicitly noted, ALTER TABLE</code> acquires an ACCESS EXCLUSIVE</code> lock, and when several subcommands are combined, Postgres uses the strictest required lock. (PostgreSQL</a>)</p>
That is the reliability risk.</p>
A migration may not be CPU-heavy.
It may not read much data.
It may not write many rows.
It may simply need a lock that conflicts with normal traffic.</p>
A migration incident often looks like this:</p>
flowchart TD
    A[Long-running transaction touches a hot table] --> B[Migration waits for a table lock]
    B --> C[New application queries arrive]
    C --> D[They queue behind the waiting migration]
    D --> E[Application pool fills]
    E --> F[Requests time out]
    F --> G[Retries increase pressure]
    G --> H([Production incident])
</code></pre>
The migration may be “waiting.”</p>
But waiting in the wrong place can still stop the product.</p>

Lock compatibility is the hidden part of migration safety</h2>
Postgres locks are not all equal.</p>
A regular SELECT</code> acquires an ACCESS SHARE</code> lock. INSERT</code>, UPDATE</code>, DELETE</code>, and MERGE</code> acquire ROW EXCLUSIVE</code> locks on the target table. CREATE INDEX</code> without CONCURRENTLY</code> acquires a SHARE</code> lock. ACCESS EXCLUSIVE</code> conflicts with every table-level lock mode and is the only table-level lock that blocks a plain SELECT</code>. (PostgreSQL</a>)</p>
This is why a schema change can have a much larger blast radius than expected.</p>
For example:</p>
ALTER TABLE accounts
ADD COLUMN deleted_at timestamptz;
</code></pre>
This may be fast in many situations. But “fast” is not the same as “risk-free.”</p>
Even a short lock can be dangerous when:</p>
the table is hot;
transactions are long;
traffic is high;
the migration waits behind another session;
application timeouts are short;
retries are aggressive;
the deploy starts many app instances at once.
</code></pre>
The operational risk is not only how long the migration takes after it starts.</p>
It is also how long it waits before it can safely start.</p>

The migration that waits can be worse than the migration that runs</h2>
A migration can damage traffic before it does any meaningful work.</p>
Suppose this transaction is open:</p>
BEGIN;

SELECT *
FROM accounts
WHERE id = 42;

-- application stays idle before COMMIT
</code></pre>
Now a migration runs:</p>
ALTER TABLE accounts
ADD COLUMN archived_at timestamptz;
</code></pre>
If the migration waits for a strong lock, later queries against accounts</code> can queue behind it.</p>
That queue can grow quickly:</p>
Session A: old transaction is still open
Session B: ALTER TABLE waits for lock
Session C: SELECT from application waits
Session D: UPDATE from application waits
Session E: SELECT from application waits
...
</code></pre>
This is why “the migration is only waiting” is not comforting.</p>
A waiting migration can become a traffic barrier.</p>
During an incident, look for blocked and blocking sessions:</p>
SELECT
    blocked.pid AS blocked_pid,
    blocked.application_name AS blocked_app,
    now() - blocked.query_start AS blocked_duration,
    left(blocked.query, 160) AS blocked_query,
    blocking.pid AS blocking_pid,
    blocking.application_name AS blocking_app,
    blocking.state AS blocking_state,
    now() - blocking.query_start AS blocking_duration,
    left(blocking.query, 160) AS blocking_query
FROM pg_stat_activity blocked
JOIN LATERAL unnest(pg_blocking_pids(blocked.pid)) AS blocker_pid ON true
JOIN pg_stat_activity blocking ON blocking.pid = blocker_pid
ORDER BY blocked_duration DESC;
</code></pre>
And inspect sessions waiting on locks:</p>
SELECT
    pid,
    application_name,
    usename,
    state,
    wait_event_type,
    wait_event,
    now() - query_start AS waiting_for,
    left(query, 200) AS query_preview
FROM pg_stat_activity
WHERE wait_event_type = 'Lock'
ORDER BY query_start ASC;
</code></pre>
The goal is not only to identify the blocking PID.</p>
The goal is to understand whether the migration has created a queue in front of production traffic.</p>

Use lock timeouts as blast-radius control</h2>
A migration should not wait forever for a lock on a hot table.</p>
Use a lock timeout:</p>
SET lock_timeout = '2s';
SET statement_timeout = '5min';

ALTER TABLE accounts
ADD COLUMN archived_at timestamptz;
</code></pre>
Inside a transaction:</p>
BEGIN;

SET LOCAL lock_timeout = '2s';
SET LOCAL statement_timeout = '5min';

ALTER TABLE accounts
ADD COLUMN archived_at timestamptz;

COMMIT;
</code></pre>
This does not make the migration safe.</p>
It makes failure faster.</p>
That is valuable.</p>
A failed migration with a clear timeout is usually better than a migration that silently waits and causes traffic to queue behind it.</p>
The operational principle:</p>
Migrations should fail before they become incidents.
</code></pre>
But timeouts need to be chosen carefully. Too short, and safe migrations fail constantly. Too long, and the timeout no longer protects production traffic.</p>

CREATE INDEX</code> is not always online</h2>
A classic migration:</p>
CREATE INDEX idx_orders_customer_id
ON orders (customer_id);
</code></pre>
This is a normal index build. It can block writes to the table.</p>
For production systems, teams often use:</p>
CREATE INDEX CONCURRENTLY idx_orders_customer_id
ON orders (customer_id);
</code></pre>
Postgres documents CREATE INDEX CONCURRENTLY</code> as a way to create an index without locking out writes to the table. The same documentation also notes important caveats: concurrent index builds cannot run inside a transaction block, only one concurrent index build can run on a table at a time, and failed concurrent builds can leave an invalid index behind. (PostgreSQL</a>)</p>
That means this is invalid:</p>
BEGIN;

CREATE INDEX CONCURRENTLY idx_orders_customer_id
ON orders (customer_id);

COMMIT;
</code></pre>
You need to run it outside a normal transaction block:</p>
CREATE INDEX CONCURRENTLY idx_orders_customer_id
ON orders (customer_id);
</code></pre>
Many migration frameworks wrap migrations in transactions by default. That default is good for many schema changes, but it conflicts with CREATE INDEX CONCURRENTLY</code>.</p>
This is not just a syntax issue. It is a deployment-system issue.</p>
Your migration tooling must understand which changes need transactional execution and which changes need to run outside a transaction.</p>

Concurrent index creation can still hurt</h2>
CONCURRENTLY</code> reduces blocking. It does not make index creation free.</p>
A concurrent index build can still:</p>
scan a large table;
consume CPU;
consume disk IO;
generate WAL;
increase replication lag;
compete with autovacuum;
take a long time;
fail and leave an invalid index;
wait for old transactions;
interact badly with other maintenance.
</code></pre>
Monitor progress:</p>
SELECT
    p.pid,
    p.datname,
    p.relid::regclass AS table_name,
    p.index_relid::regclass AS index_name,
    p.phase,
    p.blocks_total,
    p.blocks_done,
    round(100.0 * p.blocks_done / nullif(p.blocks_total, 0), 2) AS blocks_pct,
    p.tuples_total,
    p.tuples_done,
    now() - a.query_start AS runtime
FROM pg_stat_progress_create_index p
JOIN pg_stat_activity a ON a.pid = p.pid
ORDER BY runtime DESC;
</code></pre>
Check for invalid indexes after failure:</p>
SELECT
    n.nspname AS schema_name,
    t.relname AS table_name,
    i.relname AS index_name,
    ix.indisvalid,
    ix.indisready
FROM pg_index ix
JOIN pg_class i ON i.oid = ix.indexrelid
JOIN pg_class t ON t.oid = ix.indrelid
JOIN pg_namespace n ON n.oid = t.relnamespace
WHERE ix.indisvalid = false
   OR ix.indisready = false
ORDER BY schema_name, table_name, index_name;
</code></pre>
An invalid index is easy to forget.</p>
It may not help queries, but it can still create maintenance and write overhead.</p>
That is exactly the kind of “cleanup later” detail that becomes reliability debt.</p>

Adding constraints safely</h2>
A constraint can be both logically correct and operationally expensive.</p>
For example:</p>
ALTER TABLE orders
ADD CONSTRAINT orders_amount_positive
CHECK (amount > 0);
</code></pre>
On a large table, Postgres may need to scan existing rows to verify that they satisfy the new constraint.</p>
A safer phased pattern:</p>
ALTER TABLE orders
ADD CONSTRAINT orders_amount_positive
CHECK (amount > 0) NOT VALID;
</code></pre>
Then later:</p>
ALTER TABLE orders
VALIDATE CONSTRAINT orders_amount_positive;
</code></pre>
Postgres documentation explains that NOT VALID</code> skips the potentially lengthy scan of existing rows when adding foreign-key, CHECK</code>, or not-null constraints, while still applying the constraint to subsequent inserts or updates; the constraint is not considered valid for all existing rows until VALIDATE CONSTRAINT</code> is run. (PostgreSQL</a>)</p>
The validation step scans the table later:</p>
ALTER TABLE orders
VALIDATE CONSTRAINT orders_amount_positive;
</code></pre>
This still does work. It is not free. But it separates two concerns:</p>
Start enforcing the rule for new data
        ↓
Validate old data later under controlled conditions
</code></pre>
That separation is often the difference between a safe rollout and a production incident.</p>

Foreign keys are operational changes too</h2>
Foreign keys are valuable. They protect data integrity.</p>
But adding one to a large, hot table is not just a metadata change.</p>
Example:</p>
ALTER TABLE orders
ADD CONSTRAINT orders_customer_id_fkey
FOREIGN KEY (customer_id)
REFERENCES customers(id);
</code></pre>
A phased version:</p>
ALTER TABLE orders
ADD CONSTRAINT orders_customer_id_fkey
FOREIGN KEY (customer_id)
REFERENCES customers(id)
NOT VALID;
</code></pre>
Then:</p>
ALTER TABLE orders
VALIDATE CONSTRAINT orders_customer_id_fkey;
</code></pre>
Postgres notes that adding a foreign key with NOT VALID</code> can reduce impact, and validation later does not need to lock out concurrent updates because new rows are already checked; validation uses a lighter lock on the altered table, and foreign-key validation also requires a lock on the referenced table. (PostgreSQL</a>)</p>
The important part is that a foreign key touches two tables operationally:</p>
the table that contains the foreign key;
the table being referenced.
</code></pre>
That matters during incidents.</p>
A migration on orders</code> can affect customers</code>.</p>
A team that only looks at one table may miss the real blocking chain.</p>

SET NOT NULL</code> can be a table scan</h2>
This looks simple:</p>
ALTER TABLE users
ALTER COLUMN email SET NOT NULL;
</code></pre>
But Postgres must know that no existing row violates the constraint.</p>
The documentation says SET NOT NULL</code> is ordinarily checked by scanning the whole table, unless a valid check constraint proves no nulls can exist or NOT VALID</code> is used in supported cases. (PostgreSQL</a>)</p>
A common safer pattern is:</p>
ALTER TABLE users
ADD CONSTRAINT users_email_not_null
CHECK (email IS NOT NULL) NOT VALID;
</code></pre>
Then validate:</p>
ALTER TABLE users
VALIDATE CONSTRAINT users_email_not_null;
</code></pre>
Then apply the not-null marker when appropriate:</p>
ALTER TABLE users
ALTER COLUMN email SET NOT NULL;
</code></pre>
The exact sequence depends on Postgres version, table structure, and whether you need a true column-level NOT NULL</code> constraint or a check constraint is enough for your use case.</p>
The reliability point is stable:</p>
Do not assume a one-line constraint change is operationally small.
</code></pre>

Adding a column is not always the risky part</h2>
Many teams focus on ADD COLUMN</code>.</p>
But the dangerous part is often what follows.</p>
Example:</p>
ALTER TABLE users
ADD COLUMN normalized_email text;
</code></pre>
Then:</p>
UPDATE users
SET normalized_email = lower(email);
</code></pre>
The first statement may be quick.</p>
The second statement may be a production event:</p>
large table scan;
many row updates;
large WAL generation;
replication lag;
autovacuum pressure;
index maintenance;
long transaction;
lock contention;
cache churn;
connection pool pressure.
</code></pre>
A safer backfill pattern uses batches:</p>
WITH batch AS (
    SELECT id
    FROM users
    WHERE normalized_email IS NULL
    ORDER BY id
    LIMIT 1000
)
UPDATE users u
SET normalized_email = lower(u.email)
FROM batch
WHERE u.id = batch.id;
</code></pre>
Repeat in a worker with:</p>
small batches;
short transactions;
sleep between batches;
progress tracking;
replication lag monitoring;
statement timeout;
ability to stop quickly.
</code></pre>
The batch size is not universal. It should be chosen based on production pressure.</p>
A backfill is not just SQL. It is a controlled workload.</p>

The expand-and-contract pattern</h2>
For application-visible schema changes, the safest migrations are often multi-step.</p>
Suppose you want to rename a column from name</code> to full_name</code>.</p>
A dangerous migration:</p>
ALTER TABLE users RENAME COLUMN name TO full_name;
</code></pre>
If old application code still expects name</code>, it breaks.</p>
A safer pattern:</p>
1. Expand schema
2. Deploy code that writes both old and new shapes
3. Backfill old data into new shape
4. Deploy code that reads new shape
5. Stop using old shape
6. Contract schema later
</code></pre>
Example:</p>
ALTER TABLE users
ADD COLUMN full_name text;
</code></pre>
Application writes both:</p>
name = input.name
full_name = input.name
</code></pre>
Backfill:</p>
WITH batch AS (
    SELECT id
    FROM users
    WHERE full_name IS NULL
    ORDER BY id
    LIMIT 1000
)
UPDATE users u
SET full_name = u.name
FROM batch
WHERE u.id = batch.id;
</code></pre>
Later, after all code reads full_name</code> and old versions are gone:</p>
ALTER TABLE users
DROP COLUMN name;
</code></pre>
This is slower than a one-step migration.</p>
It is also much safer.</p>
Reliability often means accepting more deployment steps to reduce coupling between code and schema.</p>

Backward compatibility matters during rolling deploys</h2>
Many production systems deploy gradually.</p>
For some period, old and new application versions run at the same time.</p>
That means schema migrations must be compatible with both versions.</p>
Risky sequence:</p>
Migration removes column
        ↓
Old app instance still reads column
        ↓
Requests fail
</code></pre>
Safer sequence:</p>
New app stops depending on column
        ↓
Rollout completes
        ↓
Old app versions are gone
        ↓
Column is removed later
</code></pre>
This is not a database-only issue.</p>
It is a deployment architecture issue.</p>
A schema migration must be designed for:</p>
rolling deploys;
failed deploys;
rollbacks;
background workers;
cron jobs;
admin scripts;
BI tools;
old application instances;
read replicas;
migration retries.
</code></pre>
A migration that is safe in a single-process mental model may be unsafe in a distributed production system.</p>

Beware of defaults and rewrites</h2>
This migration looks innocent:</p>
ALTER TABLE events
ADD COLUMN source text DEFAULT 'web';
</code></pre>
Depending on Postgres version and the exact default expression, adding a column with a default may be metadata-only or may require more work. Stable constant defaults have become much safer in modern Postgres than they were in older versions, but volatile defaults or other forms of schema change can still be expensive.</p>
A safer mental model is:</p>
Do not judge by syntax.
Check the operational behavior for your exact Postgres version and exact command.
</code></pre>
When in doubt, use a phased approach:</p>
ALTER TABLE events
ADD COLUMN source text;
</code></pre>
Deploy code to write source</code>.</p>
Backfill existing rows in batches.</p>
Then add a default for future rows:</p>
ALTER TABLE events
ALTER COLUMN source SET DEFAULT 'web';
</code></pre>
This is often more verbose, but it gives you control over when the large data change happens.</p>

Large deletes are migrations too</h2>
Retention changes often appear as simple cleanup:</p>
DELETE FROM events
WHERE created_at < now() - interval '180 days';
</code></pre>
On a large table, this can be a serious write workload.</p>
It can:</p>
generate huge WAL;
create many dead tuples;
increase autovacuum pressure;
block or slow other queries;
increase replica lag;
hold locks for too long;
fill disk temporarily;
cause checkpoint pressure.
</code></pre>
A batched delete:</p>
WITH batch AS (
    SELECT id
    FROM events
    WHERE created_at < now() - interval '180 days'
    ORDER BY id
    LIMIT 5000
)
DELETE FROM events e
USING batch
WHERE e.id = batch.id;
</code></pre>
For very large time-series data, partitioning may be a better retention mechanism:</p>
DROP TABLE events_2025_01;
</code></pre>
Dropping an old partition can be dramatically different from deleting millions of rows from a single table.</p>
But partitioning has its own complexity. It should match the data lifecycle, query patterns, and operational ownership.</p>

Migrations should have observability</h2>
A production migration should not be a black box.</p>
Before running a risky migration, decide how you will observe it.</p>
Useful checks include active migration sessions:</p>
SELECT
    pid,
    application_name,
    state,
    wait_event_type,
    wait_event,
    now() - query_start AS query_age,
    left(query, 200) AS query_preview
FROM pg_stat_activity
WHERE query ILIKE '%alter table%'
   OR query ILIKE '%create index%'
   OR query ILIKE '%validate constraint%'
ORDER BY query_start ASC;
</code></pre>
Lock waits:</p>
SELECT
    pid,
    application_name,
    wait_event_type,
    wait_event,
    now() - query_start AS waiting_for,
    left(query, 200) AS query_preview
FROM pg_stat_activity
WHERE wait_event_type = 'Lock'
ORDER BY query_start ASC;
</code></pre>
Index build progress:</p>
SELECT
    p.pid,
    p.relid::regclass AS table_name,
    p.index_relid::regclass AS index_name,
    p.phase,
    p.blocks_done,
    p.blocks_total,
    round(100.0 * p.blocks_done / nullif(p.blocks_total, 0), 2) AS pct_done,
    now() - a.query_start AS runtime
FROM pg_stat_progress_create_index p
JOIN pg_stat_activity a ON a.pid = p.pid;
</code></pre>
Replication lag:</p>
SELECT
    application_name,
    state,
    sync_state,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) AS replay_lag_bytes,
    write_lag,
    flush_lag,
    replay_lag
FROM pg_stat_replication;
</code></pre>
Dead tuple pressure after backfills or deletes:</p>
SELECT
    relname,
    n_live_tup,
    n_dead_tup,
    last_autovacuum,
    last_autoanalyze
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 30;
</code></pre>
The key question:</p>
How will we know the migration is becoming unsafe before users tell us?
</code></pre>

Migration tooling can create risk</h2>
Migration frameworks are useful. They provide ordering, history, repeatability, and deployment integration.</p>
But they can also create hazards.</p>
Common tooling problems:</p>
all migrations wrapped in one transaction;
no support for CREATE INDEX CONCURRENTLY;
no lock_timeout by default;
no statement_timeout policy;
no distinction between schema change and data backfill;
no pause/resume mechanism;
no progress visibility;
automatic retries of unsafe migrations;
running migrations during app startup;
running migrations from multiple app instances;
no clear owner during incidents.
</code></pre>
A particularly dangerous pattern:</p>
flowchart TD
    A[App instance starts] --> B[It runs migrations automatically]
    B --> C[Many instances start during deploy]
    C --> D[Multiple migration attempts compete]
    D --> E([Production traffic is already increasing])
</code></pre>
Migration execution should be controlled.</p>
For serious systems, migrations are not just part of application boot.</p>
They are operational tasks with ownership and observability.</p>

Rollback is not always the inverse migration</h2>
Application rollbacks are often easier than database rollbacks.</p>
If a deploy fails, you can roll back code.</p>
But after this runs:</p>
ALTER TABLE users
DROP COLUMN legacy_id;
</code></pre>
the old column is gone.</p>
After this runs:</p>
UPDATE accounts
SET status = 'inactive'
WHERE last_seen_at < now() - interval '2 years';
</code></pre>
the old values are not automatically recoverable unless you prepared for that.</p>
After this runs:</p>
ALTER TABLE orders
ALTER COLUMN total_cents TYPE numeric;
</code></pre>
returning to the old type may be lossy, slow, or impossible without careful planning.</p>
A good migration plan distinguishes:</p>
code rollback;
schema rollback;
data rollback;
roll-forward fix;
restore from backup;
point-in-time recovery;
manual correction.
</code></pre>
Many database changes are not safely reversible.</p>
For those, the safer strategy is often:</p>
make the change additive;
delay destructive steps;
keep old data until confidence is high;
roll forward instead of rolling back;
test recovery before production.
</code></pre>
A rollback plan that says “run the down migration” is not enough.</p>

A practical pre-flight checklist</h2>
Before running a migration on a large or hot table, ask:</p>
What lock level does this operation need?
Can it wait behind an old transaction?
Can it cause later application traffic to queue?
Will it scan the table?
Will it rewrite the table?
Will it generate large WAL?
Will it increase replication lag?
Will it create many dead tuples?
Will it affect autovacuum?
Can it run inside a transaction?
Does the migration framework support the required mode?
Can it fail quickly with lock_timeout?
Can it be paused or resumed?
Is the change backward-compatible with old code?
Is there a safe rollback or roll-forward plan?
Who is watching it?
What metric tells us to stop?
</code></pre>
This checklist does not replace practice.</p>
It helps expose which migrations deserve deeper planning.</p>

Common anti-patterns</h2>
Testing only on tiny staging data</h3>
A migration that takes 200 ms on staging can take hours or block production on a large table.</p>
Combining too much into one migration</h3>
Schema change, backfill, constraint validation, index creation, and cleanup are different operational phases. They should often be separated.</p>
Running destructive changes too early</h3>
Dropping columns, constraints, tables, or indexes before all code paths are ready creates rollback traps.</p>
No lock timeout</h3>
A migration that waits forever can silently create a production queue.</p>
Treating CONCURRENTLY</code> as harmless</h3>
It reduces blocking, but still consumes resources and has caveats.</p>
Ignoring old transactions</h3>
Long transactions can turn a safe migration into a lock incident.</p>
Backfilling in one huge transaction</h3>
This creates WAL, dead tuples, replication lag, and rollback risk.</p>
Forgetting replicas and downstream systems</h3>
A migration may succeed on the primary while breaking read replicas, CDC consumers, ETL jobs, or analytics systems.</p>

Why migration incidents are excellent simulation material</h2>
Schema migration incidents are some of the best reliability training scenarios because they involve both technical mechanics and human pressure.</p>
A realistic simulation can include:</p>
a migration waiting on a lock;
a long idle transaction;
application queries piling up behind DDL;
a connection pool filling;
a concurrent index build consuming IO;
a failed index leaving an invalid artifact;
a backfill increasing replication lag;
a rollback that is not actually safe;
a team debating whether to cancel, wait, kill a blocker, pause traffic, or roll forward.
</code></pre>
The hard part is not knowing that locks exist.</p>
The hard part is choosing the safest action while production is degrading.</p>
Should the team cancel the migration?
Terminate the blocker?
Pause workers?
Reduce traffic?
Disable retries?
Let the migration finish?
Roll forward?
Roll back application code?
Validate later?
Drop an invalid index?
Leave the system alone and collect more evidence?</p>
These are operational decisions, not syntax questions.</p>
Articles can teach the patterns.
Checklists can reduce obvious mistakes.
Simulations train the judgment needed when a migration interacts with real production load.</p>

Conclusion</h2>
A Postgres migration is not safe because the SQL is valid.</p>
It is safe only if its production behavior is understood.</p>
A one-line ALTER TABLE</code> can create a lock queue.
A normal CREATE INDEX</code> can block writes.
A concurrent index can still consume enough resources to hurt.
A constraint can require a large validation scan.
A backfill can generate WAL, dead tuples, and replica lag.
A rollback can be impossible after destructive data changes.</p>
The dangerous phrase is:</p>
It worked on staging.
</code></pre>
The better reliability question is:</p>
What will this migration do to locks, WAL, replicas, autovacuum, connection pools, old application versions, and rollback options in production?
</code></pre>
That question turns migrations from hidden deployment risk into a deliberate reliability practice.</p>


Autovacuum: the quiet Postgres process that becomes a loud reliability problem
2026-04-14T00:00:00+00:00
Autovacuum is easy to ignore when everything is healthy.</p>
It runs in the background.
It does not usually appear in product discussions.
It is rarely mentioned in feature planning.
It does not look like an application dependency.</p>
Then one day the database gets slower, storage grows unexpectedly, query plans become unstable, or Postgres starts warning about transaction ID wraparound.</p>
At that point, autovacuum is no longer background maintenance.</p>
It is part of the incident.</p>
Postgres autovacuum exists because MVCC creates old row versions that must eventually be cleaned up, and because the planner needs fresh table statistics to choose good query plans. The PostgreSQL documentation describes routine vacuuming as necessary to recover or reuse storage occupied by updated or deleted rows, update planner statistics, and protect against transaction ID wraparound. (PostgreSQL</a>)</p>
The reliability lesson is simple:</p>
Autovacuum is not an optional optimization.
It is part of Postgres survival.
</code></pre>

Why Postgres needs vacuum at all</h2>
Postgres uses MVCC: multi-version concurrency control.</p>
When a row is updated, Postgres does not simply overwrite the old row in place. It creates a new row version. When a row is deleted, the old version is not immediately removed from the table file.</p>
That design allows concurrent transactions to see a consistent view of data without blocking each other unnecessarily.</p>
But it creates a maintenance problem.</p>
Old row versions eventually become unnecessary. Once no active transaction can still see them, they can be cleaned up. That cleanup is one of the main jobs of VACUUM</code>.</p>
A simplified update chain:</p>
UPDATE users SET status = 'active' WHERE id = 42;

Old row version remains for older snapshots.
New row version becomes visible to newer transactions.
VACUUM can later remove the old version when safe.
</code></pre>
If cleanup does not keep up, dead tuples accumulate.</p>
The table may become physically larger.
Indexes may become less efficient.
Sequential scans may touch more pages.
Index scans may visit more dead entries.
Autovacuum may need to do more work later under worse conditions.</p>
This is how a quiet maintenance lag becomes user-visible latency.</p>

Autovacuum also runs ANALYZE</h2>
Autovacuum is not only about removing dead tuples.</p>
It also triggers ANALYZE</code>, which refreshes planner statistics. PostgreSQL documentation notes that the autovacuum daemon automatically issues ANALYZE</code> when table contents have changed sufficiently. (PostgreSQL</a>)</p>
That matters because the query planner depends on statistics.</p>
Consider this query:</p>
SELECT *
FROM invoices
WHERE account_id = $1
  AND status = 'open'
ORDER BY due_date
LIMIT 50;
</code></pre>
The planner needs to estimate:</p>
How many rows match this account_id?
How selective is status = 'open'?
Is an index scan cheaper than a sequential scan?
Will sorting be expensive?
</code></pre>
If statistics are stale, Postgres may choose a bad plan.</p>
A table with poor vacuum behavior often has poor analyze behavior too. The incident may appear as a slow query, but the underlying issue may be maintenance starvation.</p>
You can inspect recent vacuum and analyze activity:</p>
SELECT
    schemaname,
    relname,
    n_live_tup,
    n_dead_tup,
    last_vacuum,
    last_autovacuum,
    last_analyze,
    last_autoanalyze,
    vacuum_count,
    autovacuum_count,
    analyze_count,
    autoanalyze_count
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 30;
</code></pre>
This does not prove bloat by itself, but it shows where cleanup pressure and statistics freshness deserve attention.</p>

The most common autovacuum misconception</h2>
A dangerous sentence:</p>
Autovacuum is using IO, so let’s disable it.
</code></pre>
Autovacuum can absolutely create load. It reads pages, cleans dead tuples, updates visibility information, and may generate WAL.</p>
But disabling it usually converts visible maintenance cost into hidden future debt.</p>
That debt comes back as:</p>
larger tables;
larger indexes;
worse cache efficiency;
slower scans;
stale statistics;
unstable query plans;
wraparound risk;
emergency anti-wraparound vacuum;
operational panic.
</code></pre>
PostgreSQL’s autovacuum</code> setting controls whether the server runs the autovacuum launcher, and it is on by default; the docs also note that track_counts</code> must be enabled for autovacuum to work. (PostgreSQL</a>)</p>
You can check the basics:</p>
SHOW autovacuum;
SHOW track_counts;
</code></pre>
And inspect relevant settings:</p>
SELECT
    name,
    setting,
    unit,
    context,
    short_desc
FROM pg_settings
WHERE name LIKE 'autovacuum%'
   OR name IN (
       'track_counts',
       'vacuum_cost_delay',
       'vacuum_cost_limit',
       'maintenance_work_mem',
       'autovacuum_work_mem'
   )
ORDER BY name;
</code></pre>
The goal is not to turn autovacuum off.</p>
The goal is to make sure it can keep up with the workload.</p>

Dead tuples are a signal, not the whole diagnosis</h2>
A common starting point:</p>
SELECT
    relname,
    n_live_tup,
    n_dead_tup,
    round(
        100.0 * n_dead_tup / greatest(n_live_tup + n_dead_tup, 1),
        2
    ) AS dead_tuple_percent,
    last_autovacuum,
    last_autoanalyze
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 20;
</code></pre>
This is useful, but it has limits.</p>
n_dead_tup</code> is an estimate. It does not directly equal “bloat”. A table can have many dead tuples and still be manageable if autovacuum is keeping up. Another table can have fewer dead tuples but be operationally sensitive because it is large, hot, heavily indexed, or latency-critical.</p>
Better questions:</p>
Is the number of dead tuples growing over time?
Does autovacuum run but fail to catch up?
Is the table write-heavy?
Are long transactions preventing cleanup?
Are indexes growing faster than expected?
Did query latency change as dead tuples accumulated?
</code></pre>
For incident response, trend is often more important than a single snapshot.</p>

Long transactions can block cleanup</h2>
Autovacuum cannot remove row versions that might still be visible to an old transaction.</p>
That means one old transaction can keep dead tuples alive across the database.</p>
Find old transactions:</p>
SELECT
    pid,
    usename,
    application_name,
    client_addr,
    state,
    now() - xact_start AS transaction_age,
    wait_event_type,
    wait_event,
    left(query, 160) AS query_preview
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
ORDER BY xact_start ASC;
</code></pre>
Find sessions idle inside a transaction:</p>
SELECT
    pid,
    usename,
    application_name,
    client_addr,
    now() - xact_start AS transaction_age,
    left(query, 160) AS last_query
FROM pg_stat_activity
WHERE state = 'idle in transaction'
ORDER BY xact_start ASC;
</code></pre>
An idle in transaction</code> session may look harmless because it is not actively consuming CPU. But it can prevent cleanup, hold locks, and keep old snapshots alive.</p>
A classic reliability failure chain:</p>
flowchart TD
    A[Application opens transaction] --> B[Transaction becomes idle and remains open]
    B --> C[Updates and deletes continue elsewhere]
    C --> D[Dead tuples cannot be fully cleaned]
    D --> E[Tables and indexes grow]
    E --> F[Queries touch more pages]
    F --> G[Latency increases]
    G --> H([Connection pool saturates])
</code></pre>
The visible symptom may be slow queries.</p>
The mechanism may be vacuum being unable to clean because the application is holding old snapshots.</p>

The table that autovacuum cannot catch</h2>
Some tables are much harder for autovacuum than others.</p>
Examples:</p>
high-update tables;
queue-like tables;
session tables;
event status tables;
tables with frequent DELETE;
tables with many indexes;
tables with very large row counts;
tables with hot tenants or skewed access patterns.
</code></pre>
A queue table is a common example:</p>
CREATE TABLE jobs (
    id bigserial PRIMARY KEY,
    status text NOT NULL,
    payload jsonb NOT NULL,
    created_at timestamptz NOT NULL DEFAULT now(),
    updated_at timestamptz NOT NULL DEFAULT now()
);
</code></pre>
Workers constantly do:</p>
UPDATE jobs
SET status = 'running',
    updated_at = now()
WHERE id = $1;
</code></pre>
Then:</p>
UPDATE jobs
SET status = 'done',
    updated_at = now()
WHERE id = $1;
</code></pre>
Or:</p>
DELETE FROM jobs
WHERE status = 'done'
  AND updated_at < now() - interval '7 days';
</code></pre>
This table may generate dead tuples continuously.</p>
A default autovacuum configuration may be too conservative for it, especially if the table is large and the scale factor means vacuum starts only after a large number of changes.</p>
You can inspect per-table autovacuum options:</p>
SELECT
    n.nspname AS schema_name,
    c.relname AS table_name,
    c.reloptions
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE c.relkind = 'r'
  AND c.reloptions IS NOT NULL
ORDER BY n.nspname, c.relname;
</code></pre>
For a hot table, per-table tuning may be more appropriate than changing global settings:</p>
ALTER TABLE jobs SET (
    autovacuum_vacuum_scale_factor = 0.02,
    autovacuum_vacuum_threshold = 5000,
    autovacuum_analyze_scale_factor = 0.01,
    autovacuum_analyze_threshold = 5000
);
</code></pre>
This is only an example, not a universal recommendation. The right values depend on write rate, table size, IO capacity, latency goals, and how much maintenance work the system can absorb.</p>
PostgreSQL exposes autovacuum thresholds, scale factors, cost delay settings, and worker limits as configuration parameters; these settings control when and how autovacuum runs. (PostgreSQL</a>)</p>

Autovacuum workers are limited</h2>
Autovacuum is not an infinite background army.</p>
It has a launcher and a limited number of workers. If several large or busy tables need cleanup at the same time, some tables wait.</p>
Check running autovacuum activity:</p>
SELECT
    pid,
    datname,
    usename,
    application_name,
    state,
    wait_event_type,
    wait_event,
    now() - query_start AS runtime,
    left(query, 200) AS query_preview
FROM pg_stat_activity
WHERE query ILIKE 'autovacuum:%'
ORDER BY query_start ASC;
</code></pre>
You can also inspect active vacuum progress. PostgreSQL provides pg_stat_progress_vacuum</code>, with one row for each backend, including autovacuum workers, currently running VACUUM</code>. (PostgreSQL</a>)</p>
SELECT
    p.pid,
    a.datname,
    a.application_name,
    p.relid::regclass AS table_name,
    p.phase,
    p.heap_blks_total,
    p.heap_blks_scanned,
    p.heap_blks_vacuumed,
    p.index_vacuum_count,
    now() - a.query_start AS runtime
FROM pg_stat_progress_vacuum p
JOIN pg_stat_activity a ON a.pid = p.pid
ORDER BY runtime DESC;
</code></pre>
This helps answer:</p>
Is vacuum currently running?
Which table is it working on?
Is it scanning heap pages?
Is it vacuuming indexes?
Is it spending a long time on one relation?
</code></pre>
If autovacuum is always running but dead tuples continue rising, the system may be under-provisioned for its write workload, misconfigured for specific hot tables, blocked by old transactions, or overloaded by competing IO.</p>

Cost-based delay: autovacuum can be too polite</h2>
Autovacuum is designed not to overwhelm the system.</p>
That politeness can become a problem.</p>
Cost-based vacuum delay allows vacuum to pause during work so it does not consume too many resources at once. PostgreSQL exposes cost-based vacuum settings and progress/verbose reporting related to this behavior. (PostgreSQL</a>)</p>
In a write-heavy system, autovacuum can be so gentle that it never catches up.</p>
The symptom is not that autovacuum is absent.</p>
The symptom is that it is always behind.</p>
You may see:</p>
autovacuum runs frequently;
dead tuples remain high;
table size keeps growing;
indexes grow disproportionately;
query latency slowly worsens;
manual VACUUM helps temporarily;
the problem returns.
</code></pre>
This is a capacity mismatch.</p>
The database is generating cleanup work faster than the maintenance system is allowed to process it.</p>

Vacuum and indexes</h2>
Vacuum is not only about heap tuples.</p>
Indexes also matter.</p>
When tables are updated and deleted, indexes can accumulate dead entries. Vacuum has to deal with those too.</p>
A table with many indexes creates more maintenance work per row change.</p>
Example:</p>
CREATE INDEX idx_orders_customer_id ON orders(customer_id);
CREATE INDEX idx_orders_status ON orders(status);
CREATE INDEX idx_orders_created_at ON orders(created_at);
CREATE INDEX idx_orders_region ON orders(region);
CREATE INDEX idx_orders_status_created ON orders(status, created_at);
</code></pre>
Every update that changes indexed columns can increase write and maintenance cost.</p>
A useful index review query:</p>
SELECT
    schemaname,
    relname AS table_name,
    indexrelname AS index_name,
    idx_scan,
    idx_tup_read,
    idx_tup_fetch
FROM pg_stat_user_indexes
ORDER BY idx_scan ASC, idx_tup_read ASC
LIMIT 50;
</code></pre>
Low usage does not automatically mean an index is safe to drop. It may support rare but critical queries, constraints, or incident workflows.</p>
But unused indexes are not free.</p>
They increase write amplification and vacuum work. In reliability terms, unnecessary indexes are permanent background cost.</p>

Wraparound: the autovacuum incident you really do not want</h2>
Postgres transaction IDs are finite. To prevent transaction ID wraparound, tables must be vacuumed so old transaction IDs can be frozen. Routine vacuuming documentation explicitly includes protection against transaction ID wraparound as one of the reasons vacuuming is necessary. (PostgreSQL</a>)</p>
Inspect transaction ID age by table:</p>
SELECT
    n.nspname AS schema_name,
    c.relname AS table_name,
    age(c.relfrozenxid) AS xid_age,
    pg_size_pretty(pg_total_relation_size(c.oid)) AS total_size
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE c.relkind IN ('r', 'm')
ORDER BY age(c.relfrozenxid) DESC
LIMIT 30;
</code></pre>
Also inspect database age:</p>
SELECT
    datname,
    age(datfrozenxid) AS xid_age
FROM pg_database
ORDER BY age(datfrozenxid) DESC;
</code></pre>
When wraparound risk grows, Postgres becomes increasingly aggressive about vacuuming. Anti-wraparound vacuum is not a normal tuning issue. It is a reliability emergency.</p>
The worst version of this incident looks like:</p>
Autovacuum was disabled or starved.
Old transactions prevented cleanup.
Large tables were not frozen in time.
Wraparound warnings appeared.
Emergency vacuum consumed IO.
Critical workload slowed down.
Operators had limited safe options.
</code></pre>
The best time to care about transaction age is long before those warnings appear.</p>

Multixact age: the less famous cousin</h2>
Postgres also tracks multixact IDs, which are relevant for row locking scenarios such as foreign keys and shared row locks.</p>
You can inspect multixact age:</p>
SELECT
    n.nspname AS schema_name,
    c.relname AS table_name,
    mxid_age(c.relminmxid) AS mxid_age,
    pg_size_pretty(pg_total_relation_size(c.oid)) AS total_size
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE c.relkind IN ('r', 'm')
ORDER BY mxid_age(c.relminmxid) DESC
LIMIT 30;
</code></pre>
This is especially relevant in systems with heavy foreign key activity, concurrent locking, or queue-like patterns.</p>
Many teams monitor transaction ID age but forget multixact age.</p>
That blind spot can turn into a surprise maintenance emergency.</p>

Manual VACUUM is not a magic button</h2>
You can run:</p>
VACUUM VERBOSE orders;
</code></pre>
Or:</p>
VACUUM (VERBOSE, ANALYZE) orders;
</code></pre>
PostgreSQL’s VACUUM</code> command reports progress through pg_stat_progress_vacuum</code> for regular vacuum operations; VACUUM FULL</code> is different because it rewrites the table and reports through cluster progress views. (PostgreSQL</a>)</p>
The distinction is important.</p>
Regular VACUUM</code> cleans up dead tuples and makes space reusable inside the table. It does not usually shrink the table file on disk dramatically.</p>
VACUUM FULL</code> rewrites the table and can return disk space to the operating system, but it requires much stronger locking and is operationally disruptive.</p>
That means this command is not a casual production fix:</p>
VACUUM FULL orders;
</code></pre>
It may block access in ways your product cannot tolerate.</p>
A reliability-minded approach asks:</p>
Do we need to improve query performance?
Do we need to recover disk to the OS?
Can regular VACUUM catch up?
Is bloat severe enough to justify a rewrite?
Can we use online rebuild strategies instead?
What is the lock impact?
What is the rollback plan?
</code></pre>
Manual vacuuming can help, but it does not replace understanding why autovacuum fell behind.</p>

Logging autovacuum activity</h2>
Autovacuum can be made more observable through logging.</p>
PostgreSQL provides log_autovacuum_min_duration</code>, which logs autovacuum actions exceeding the configured duration; the documentation notes this can help track autovacuum activity. (PostgreSQL</a>)</p>
Example:</p>
ALTER SYSTEM SET log_autovacuum_min_duration = '5s';
SELECT pg_reload_conf();
</code></pre>
Or in postgresql.conf</code>:</p>
log_autovacuum_min_duration = '5s'
</code></pre>
In noisy systems, you may choose a higher value. In an investigation, lowering it temporarily can provide evidence.</p>
Autovacuum logs can reveal:</p>
which tables are vacuumed often;
which vacuums take a long time;
whether dead tuple cleanup is effective;
whether vacuum is skipped or delayed;
whether index cleanup dominates;
whether analyze is happening regularly.
</code></pre>
The goal is not to log everything forever.</p>
The goal is to make background maintenance visible enough to reason about it.</p>

Autovacuum and partitioning</h2>
Partitioning can make vacuum behavior more manageable, but it does not remove the need for vacuum.</p>
For event-like data, partitioning by time can help because old partitions become mostly static.</p>
Example:</p>
CREATE TABLE events (
    id bigint NOT NULL,
    tenant_id bigint NOT NULL,
    event_type text NOT NULL,
    created_at timestamptz NOT NULL,
    payload jsonb NOT NULL
) PARTITION BY RANGE (created_at);
</code></pre>
Monthly partitions:</p>
CREATE TABLE events_2026_06
PARTITION OF events
FOR VALUES FROM ('2026-06-01') TO ('2026-07-01');
</code></pre>
For append-mostly workloads, old partitions may need less frequent vacuuming after they stop changing.</p>
For hot current partitions, autovacuum still matters.</p>
Partitioning helps when it matches the data lifecycle. It hurts when it is used as a substitute for understanding write patterns.</p>
Bad partitioning can create:</p>
too many relations;
planning overhead;
operational complexity;
uneven hot partitions;
forgotten per-table settings;
maintenance surprises.
</code></pre>
The reliability question is not “Should we partition?”</p>
It is:</p>
Does partitioning align with how data is written, updated, queried, retained, and vacuumed?
</code></pre>

A practical autovacuum health snapshot</h2>
This is not a full runbook, but it gives a useful operational snapshot.</p>
Largest dead tuple estimates:</p>
SELECT
    schemaname,
    relname,
    n_live_tup,
    n_dead_tup,
    round(
        100.0 * n_dead_tup / greatest(n_live_tup + n_dead_tup, 1),
        2
    ) AS dead_tuple_percent,
    last_autovacuum,
    last_autoanalyze
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 30;
</code></pre>
Tables not recently vacuumed:</p>
SELECT
    schemaname,
    relname,
    n_live_tup,
    n_dead_tup,
    last_autovacuum,
    last_vacuum
FROM pg_stat_user_tables
WHERE n_dead_tup > 0
ORDER BY last_autovacuum NULLS FIRST, n_dead_tup DESC
LIMIT 30;
</code></pre>
Oldest transaction IDs:</p>
SELECT
    n.nspname AS schema_name,
    c.relname AS table_name,
    age(c.relfrozenxid) AS xid_age,
    pg_size_pretty(pg_total_relation_size(c.oid)) AS total_size
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE c.relkind IN ('r', 'm')
ORDER BY age(c.relfrozenxid) DESC
LIMIT 30;
</code></pre>
Current vacuum progress:</p>
SELECT
    p.pid,
    p.relid::regclass AS table_name,
    p.phase,
    p.heap_blks_total,
    p.heap_blks_scanned,
    p.heap_blks_vacuumed,
    p.index_vacuum_count,
    now() - a.query_start AS runtime
FROM pg_stat_progress_vacuum p
JOIN pg_stat_activity a ON a.pid = p.pid
ORDER BY runtime DESC;
</code></pre>
Old transactions that can hold cleanup back:</p>
SELECT
    pid,
    usename,
    application_name,
    state,
    now() - xact_start AS transaction_age,
    left(query, 160) AS query_preview
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
ORDER BY xact_start ASC
LIMIT 20;
</code></pre>
Per-table autovacuum overrides:</p>
SELECT
    n.nspname AS schema_name,
    c.relname AS table_name,
    c.reloptions
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE c.relkind = 'r'
  AND c.reloptions IS NOT NULL
ORDER BY n.nspname, c.relname;
</code></pre>
These queries help build a picture. They do not replace interpretation.</p>

What teams often get wrong</h2>
They only notice autovacuum when it hurts</h3>
If the first time you discuss autovacuum is during an incident, the system has already been running on assumptions.</p>
They use global settings for table-specific problems</h3>
A single hot table often needs specific tuning. Changing global autovacuum settings can help one table while causing unnecessary maintenance pressure elsewhere.</p>
They ignore application transaction behavior</h3>
No autovacuum configuration fully compensates for application code that holds transactions open for too long.</p>
They treat bloat as a one-time cleanup task</h3>
Bloat cleanup without workload change is temporary. If the write pattern remains the same, the problem returns.</p>
They forget that indexes multiply maintenance cost</h3>
Every unnecessary index makes writes and vacuum more expensive.</p>
They use VACUUM FULL</code> too casually</h3>
It can reclaim disk, but it rewrites the table and can create serious locking impact. It is a maintenance operation, not a harmless cleanup command.</p>

A better mental model</h2>
Autovacuum is a feedback system.</p>
flowchart TD
    A[Application writes] --> B[Updates and deletes create dead tuples]
    B --> C[Autovacuum cleans old versions]
    C --> D[ANALYZE refreshes statistics]
    D --> E[Planner makes better decisions]
    E --> F[Queries stay predictable]
    F --> G([Storage growth remains controlled])
</code></pre>
When that feedback loop breaks, the symptoms appear elsewhere:</p>
slow queries;
bad plans;
growing storage;
high IO;
replication pressure;
pool saturation;
wraparound warnings;
long incident calls.
</code></pre>
That is why autovacuum problems are often misdiagnosed.</p>
They do not always announce themselves as “autovacuum failed.”</p>
They appear as system degradation.</p>

Why autovacuum incidents are strong simulation material</h2>
Autovacuum incidents are excellent for training because they develop slowly and then become urgent.</p>
A realistic simulation might include:</p>
a high-update table;
a long idle transaction;
dead tuples accumulating;
query plans becoming unstable;
autovacuum workers running but not catching up;
storage growth;
an engineer proposing to disable autovacuum;
a manual VACUUM that helps only partially;
a wraparound risk warning later in the scenario.
</code></pre>
The hard part is not knowing that VACUUM</code> exists.</p>
The hard part is connecting weak signals before they become a major incident.</p>
Is autovacuum absent, blocked, too slow, or just overloaded?
Are stale statistics causing bad plans?
Is a long transaction preventing cleanup?
Is the table design generating too much churn?
Is the immediate risk latency, storage, or wraparound?
Should the response be tuning, traffic reduction, transaction cleanup, manual vacuum, index review, or application change?</p>
These decisions require operational judgment.</p>
An article can explain the mechanism.
A dashboard can show the counters.
A simulation forces the team to make decisions while the system is degrading.</p>

Conclusion</h2>
Autovacuum is not background noise.</p>
It is one of the core processes that keeps a Postgres system healthy over time.</p>
When it works well, nobody notices.
When it falls behind, the symptoms can appear as slow queries, unstable plans, growing tables, bloated indexes, IO pressure, storage incidents, or transaction ID wraparound risk.</p>
The right lesson is not “autovacuum is good” or “autovacuum is bad.”</p>
The right lesson is:</p>
Autovacuum is part of the workload.
It needs capacity, observability, and tuning.
</code></pre>
Reliable Postgres operations require knowing which tables generate cleanup pressure, which transactions prevent cleanup, which indexes amplify maintenance cost, and which alerts reveal trouble early enough to act safely.</p>
Autovacuum is quiet by design.</p>
Database reliability means hearing it before it has to become loud.</p>


Postgres monitoring: which metrics help, and which ones create noise
2026-04-09T00:00:00+00:00
Most Postgres monitoring starts with good intentions and slowly turns into noise.</p>
A team adds dashboards for CPU, memory, connections, replication lag, locks, slow queries, disk usage, cache hit ratio, autovacuum, checkpoints, WAL, dead tuples, table sizes, index usage, and dozens of other signals.</p>
Then an incident happens.</p>
The dashboard is full of red panels.
Everyone sees something different.
One engineer points at CPU.
Another points at connections.
Someone else sees slow queries.
A replica lag alert fires.
The application shows timeouts.
The team has many metrics, but no clear direction.</p>
That is the core monitoring problem:</p>
More metrics do not automatically create better reliability.
</code></pre>
Good monitoring helps a team form and test hypotheses. Bad monitoring creates panic, false confidence, and alert fatigue.</p>
Postgres reliability monitoring is not about collecting every possible number. It is about knowing which signal answers which operational question.</p>

Monitoring should start with user impact</h2>
A database can look unhealthy while the product is fine.</p>
A database can also look mostly healthy while users are already suffering.</p>
That is why the top layer of monitoring should not be Postgres internals. It should be user-visible behavior.</p>
Examples:</p>
API latency
API error rate
checkout failures
login failures
background job delay
queue age
request timeout rate
successful writes per second
customer-facing read latency
</code></pre>
These are not Postgres metrics, but they are the reason Postgres reliability matters.</p>
A useful monitoring hierarchy looks like this:</p>
flowchart TD
    A[User symptoms] --> B[Application behavior]
    B --> C[Connection pool pressure]
    C --> D[Postgres activity]
    D --> E[Storage / OS / infrastructure]
    E --> F[Replication / backup / recovery systems]
</code></pre>
If you start from the bottom, you may optimize the wrong thing.</p>
If you start from user impact, you can ask:</p>
Which database symptom explains the product symptom?
</code></pre>
That question is more useful than:</p>
Which graph is red?
</code></pre>

A metric is useful only when it supports a decision</h2>
A weak alert says:</p>
Database connections are high.
</code></pre>
A better alert says:</p>
User-facing requests are waiting for database connections,
and Postgres active sessions are also elevated.
</code></pre>
A weak dashboard says:</p>
CPU is 90%.
</code></pre>
A better dashboard helps answer:</p>
Is CPU high because useful work increased,
because a bad query plan appeared,
because concurrency exploded,
or because retries are multiplying traffic?
</code></pre>
The value of a metric is not the number itself. The value is the decision it helps with.</p>
For Postgres incidents, useful decisions include:</p>
Should we reduce traffic?
Should we pause background workers?
Should we cancel a query?
Should we cancel a migration?
Should we add capacity?
Should we fail over?
Should we stop retries?
Should we run ANALYZE?
Should we let recovery continue?
Should we avoid touching the database until we know more?
</code></pre>
Monitoring should make those decisions safer.</p>

Postgres has many statistics views, but they are not a diagnosis</h2>
PostgreSQL exposes a cumulative statistics system that reports server activity, including table and index access, row counts, and vacuum/analyze activity. The official monitoring documentation also reminds operators not to ignore OS-level tools such as ps</code>, top</code>, iostat</code>, and vmstat</code>, and to use EXPLAIN</code> for deeper query investigation after identifying a poorly performing query. (PostgreSQL</a>)</p>
That means Postgres gives you evidence.</p>
It does not give you the incident narrative automatically.</p>
For example, this query summarizes sessions:</p>
SELECT
    state,
    wait_event_type,
    wait_event,
    count(*) AS sessions
FROM pg_stat_activity
GROUP BY state, wait_event_type, wait_event
ORDER BY sessions DESC;
</code></pre>
This can tell you:</p>
many sessions are active;
many sessions are waiting on locks;
many sessions are idle;
many sessions are idle in transaction;
many sessions are waiting on IO.
</code></pre>
But the next step is human reasoning:</p>
Why are they active?
Why are they waiting?
What changed?
Which workload owns them?
Are they a cause or a symptom?
</code></pre>
A statistics view is not an incident response plan.</p>

Start with “what changed?”</h2>
Many Postgres incidents are triggered by change:</p>
new deploy;
new query shape;
schema migration;
new index;
data import;
traffic spike;
autoscaling event;
background job;
customer onboarding;
configuration change;
replica issue;
storage degradation.
</code></pre>
Monitoring should make change visible.</p>
A good incident dashboard should correlate database symptoms with:</p>
deploy markers;
migration start/finish events;
feature flag changes;
traffic volume;
worker concurrency;
autoscaling events;
database failover events;
backup windows;
maintenance jobs;
large imports or backfills.
</code></pre>
Without change context, metrics are easier to misread.</p>
Example:</p>
Connections increased at 12:05.
</code></pre>
That could mean:</p>
traffic increased;
queries became slower;
pool size changed;
a deploy doubled app instances;
a connection leak started;
retries increased;
a lock queue formed.
</code></pre>
If the dashboard also shows a deployment at 12:03, the investigation starts differently.</p>

Monitor the connection boundary</h2>
Connection pools are where application behavior becomes database pressure.</p>
Postgres-side connection snapshot:</p>
SELECT
    application_name,
    usename,
    client_addr,
    count(*) AS total,
    count(*) FILTER (WHERE state = 'active') AS active,
    count(*) FILTER (WHERE state = 'idle') AS idle,
    count(*) FILTER (WHERE state = 'idle in transaction') AS idle_in_transaction
FROM pg_stat_activity
GROUP BY application_name, usename, client_addr
ORDER BY total DESC;
</code></pre>
This answers:</p>
Which applications are connected?
Who owns the sessions?
How many are active?
How many are idle?
Are any idle inside transactions?
</code></pre>
But Postgres cannot fully explain pool behavior. The application must expose:</p>
pool size;
connections in use;
idle pool connections;
pending checkout count;
connection acquisition latency;
pool checkout timeout count;
transaction duration;
query duration;
request duration while holding a connection.
</code></pre>
A critical distinction:</p>
Waiting for a pool connection is not the same as executing a slow SQL query.
</code></pre>
If you do not separate those, every incident looks like “Postgres is slow.”</p>

Monitor active sessions, not just total connections</h2>
Total connections can be misleading.</p>
A database with 300 mostly idle sessions may be healthier than a database with 60 active sessions all fighting over locks or disk.</p>
Useful live activity query:</p>
SELECT
    pid,
    application_name,
    usename,
    state,
    wait_event_type,
    wait_event,
    now() - query_start AS query_age,
    now() - xact_start AS transaction_age,
    left(query, 200) AS query_preview
FROM pg_stat_activity
WHERE state <> 'idle'
ORDER BY query_start ASC;
</code></pre>
This helps distinguish:</p>
active CPU work;
lock waiting;
IO waiting;
long-running transactions;
stuck migrations;
slow queries;
idle transactions;
client-related waits.
</code></pre>
During incidents, the wait state is often more useful than the raw connection count.</p>
A good dashboard should not only ask:</p>
How many connections exist?
</code></pre>
It should ask:</p>
What are those connections doing?
</code></pre>

Long transactions deserve their own panel</h2>
Long transactions are behind many Postgres reliability problems:</p>
vacuum cannot clean old row versions;
schema migrations wait;
row locks remain held;
bloat grows;
replicas can be affected;
connection pools lose capacity;
query behavior becomes harder to explain.
</code></pre>
Monitor them directly:</p>
SELECT
    pid,
    application_name,
    usename,
    state,
    now() - xact_start AS transaction_age,
    wait_event_type,
    wait_event,
    left(query, 200) AS query_preview
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
ORDER BY xact_start ASC
LIMIT 30;
</code></pre>
And specifically:</p>
SELECT
    pid,
    application_name,
    usename,
    client_addr,
    now() - xact_start AS transaction_age,
    now() - state_change AS idle_age,
    left(query, 200) AS last_query
FROM pg_stat_activity
WHERE state = 'idle in transaction'
ORDER BY xact_start ASC;
</code></pre>
A mature alert is not simply:</p>
There is an idle transaction.
</code></pre>
It is more like:</p>
An app-owned transaction has been idle for longer than expected
on a database with high write activity or pending migrations.
</code></pre>
Context turns noise into signal.</p>

Query monitoring: total cost, latency, frequency, and variance</h2>
pg_stat_statements</code> is one of the most important extensions for Postgres workload visibility. The official documentation describes it as a module for tracking planning and execution statistics of SQL statements executed by a server. (PostgreSQL</a>)</p>
The mistake is using only one ranking.</p>
Highest total time:</p>
SELECT
    calls,
    total_exec_time,
    mean_exec_time,
    rows,
    left(query, 180) AS query_preview
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;
</code></pre>
This finds queries that consume the most total database time.</p>
Highest average latency:</p>
SELECT
    calls,
    mean_exec_time,
    max_exec_time,
    rows,
    left(query, 180) AS query_preview
FROM pg_stat_statements
WHERE calls > 100
ORDER BY mean_exec_time DESC
LIMIT 20;
</code></pre>
This finds consistently slow queries.</p>
Highest call count:</p>
SELECT
    calls,
    total_exec_time,
    mean_exec_time,
    left(query, 180) AS query_preview
FROM pg_stat_statements
ORDER BY calls DESC
LIMIT 20;
</code></pre>
This finds queries that may be cheap individually but expensive in aggregate.</p>
High variance:</p>
SELECT
    calls,
    mean_exec_time,
    max_exec_time,
    stddev_exec_time,
    rows,
    left(query, 180) AS query_preview
FROM pg_stat_statements
WHERE calls > 100
ORDER BY stddev_exec_time DESC
LIMIT 20;
</code></pre>
This finds unstable queries.</p>
Each view answers a different question:</p>
Total time: what consumes the database?
Mean time: what is consistently expensive?
Max time: what occasionally explodes?
Call count: what happens too often?
Variance: what depends heavily on parameters or data shape?
</code></pre>
If your dashboard only shows “top slow queries,” it may miss high-frequency queries that quietly dominate database load.</p>

Slow-query logs are useful, but can become noise</h2>
Postgres logging can capture slow statements through settings such as log_min_duration_statement</code>, and the logging system supports multiple destinations such as stderr, csvlog, jsonlog, syslog, and eventlog depending on platform and configuration. (PostgreSQL</a>)</p>
A common setting:</p>
SHOW log_min_duration_statement;
</code></pre>
Example:</p>
ALTER SYSTEM SET log_min_duration_statement = '500ms';
SELECT pg_reload_conf();
</code></pre>
This can help identify expensive statements.</p>
But slow-query logs have limitations:</p>
They show completed statements, not necessarily currently stuck ones.
They can become extremely noisy under incidents.
They may miss high-frequency fast queries that cause aggregate load.
They need application context to be useful.
They can increase log volume significantly.
</code></pre>
Slow-query logging is evidence, not a complete monitoring strategy.</p>
For production reliability, combine it with:</p>
pg_stat_statements;
application tracing;
pool metrics;
lock monitoring;
wait events;
deployment markers;
request-level latency.
</code></pre>
The goal is to connect a slow query to user impact and system pressure.</p>

Lock monitoring must show blockers and victims</h2>
A lock alert that says “lock wait exists” is often too vague.</p>
During production incidents, you need to know:</p>
Who is blocked?
Who is blocking?
How long has the blocker been running?
Is the blocker active or idle in transaction?
Which application owns it?
Is the blocked query user traffic, migration, worker, or admin?
</code></pre>
Useful query:</p>
SELECT
    blocked.pid AS blocked_pid,
    blocked.application_name AS blocked_app,
    blocked.usename AS blocked_user,
    now() - blocked.query_start AS blocked_duration,
    left(blocked.query, 160) AS blocked_query,
    blocking.pid AS blocking_pid,
    blocking.application_name AS blocking_app,
    blocking.usename AS blocking_user,
    blocking.state AS blocking_state,
    now() - blocking.query_start AS blocking_duration,
    left(blocking.query, 160) AS blocking_query
FROM pg_stat_activity blocked
JOIN LATERAL unnest(pg_blocking_pids(blocked.pid)) AS blocker_pid ON true
JOIN pg_stat_activity blocking ON blocking.pid = blocker_pid
ORDER BY blocked_duration DESC;
</code></pre>
Good lock monitoring separates:</p>
one harmless short lock wait;
a growing lock queue behind a migration;
a long idle transaction blocking DDL;
row-level contention in a hot workflow;
application workers fighting over the same rows.
</code></pre>
The metric is not “number of locks.”</p>
The signal is the shape of the blocking chain.</p>

Autovacuum monitoring should focus on whether cleanup keeps up</h2>
Autovacuum is noisy if monitored incorrectly.</p>
A graph showing “autovacuum is running” may look scary, but it can be completely normal.</p>
Better questions:</p>
Are dead tuples growing over time?
Are hot tables vacuumed often enough?
Are old transactions preventing cleanup?
Are analyze runs keeping statistics fresh?
Are tables approaching transaction ID age risk?
Is autovacuum always running but still not catching up?
</code></pre>
Useful table maintenance snapshot:</p>
SELECT
    schemaname,
    relname,
    n_live_tup,
    n_dead_tup,
    round(
        100.0 * n_dead_tup / greatest(n_live_tup + n_dead_tup, 1),
        2
    ) AS dead_tuple_percent,
    last_autovacuum,
    last_autoanalyze,
    vacuum_count,
    autovacuum_count,
    analyze_count,
    autoanalyze_count
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 30;
</code></pre>
Current vacuum progress:</p>
SELECT
    p.pid,
    p.relid::regclass AS table_name,
    p.phase,
    p.heap_blks_total,
    p.heap_blks_scanned,
    p.heap_blks_vacuumed,
    p.index_vacuum_count,
    now() - a.query_start AS runtime
FROM pg_stat_progress_vacuum p
JOIN pg_stat_activity a ON a.pid = p.pid
ORDER BY runtime DESC;
</code></pre>
Transaction ID age:</p>
SELECT
    n.nspname AS schema_name,
    c.relname AS table_name,
    age(c.relfrozenxid) AS xid_age,
    pg_size_pretty(pg_total_relation_size(c.oid)) AS total_size
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE c.relkind IN ('r', 'm')
ORDER BY age(c.relfrozenxid) DESC
LIMIT 30;
</code></pre>
Autovacuum monitoring should tell you whether maintenance is keeping up with write workload.</p>
If it only tells you that autovacuum exists, it is not enough.</p>

WAL and checkpoint monitoring should reveal pressure chains</h2>
WAL is involved in durability, crash recovery, replication, backups, archiving, and logical decoding.</p>
A WAL incident rarely stays isolated.</p>
Watch:</p>
WAL generation rate;
pg_wal directory growth;
archiver failures;
replication slot retention;
replica replay lag;
checkpoint frequency;
checkpoint write/sync time;
WAL-heavy statements;
large backfills or migrations.
</code></pre>
WAL generation snapshot:</p>
SELECT
    wal_records,
    wal_fpi,
    pg_size_pretty(wal_bytes) AS wal_bytes,
    wal_buffers_full,
    stats_reset
FROM pg_stat_wal;
</code></pre>
Replication slots:</p>
SELECT
    slot_name,
    slot_type,
    active,
    wal_status,
    pg_size_pretty(
        pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)
    ) AS retained_wal
FROM pg_replication_slots
ORDER BY pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) DESC NULLS LAST;
</code></pre>
Archiver status:</p>
SELECT
    archived_count,
    last_archived_wal,
    last_archived_time,
    failed_count,
    last_failed_wal,
    last_failed_time
FROM pg_stat_archiver;
</code></pre>
The alert should not be only:</p>
WAL directory is large.
</code></pre>
It should help identify the mechanism:</p>
WAL is growing because archiving is failing.
WAL is retained by an inactive replication slot.
WAL generation spiked after a backfill.
Replica replay lag is increasing because the primary is producing WAL too quickly.
</code></pre>
That is the difference between symptom monitoring and reliability monitoring.</p>

Replication monitoring: bytes, time, and product semantics</h2>
Replication lag is not one metric.</p>
Primary-side view:</p>
SELECT
    application_name,
    client_addr,
    state,
    sync_state,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn))   AS send_lag_bytes,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), write_lsn))  AS write_lag_bytes,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn))  AS flush_lag_bytes,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) AS replay_lag_bytes,
    write_lag,
    flush_lag,
    replay_lag
FROM pg_stat_replication
ORDER BY application_name;
</code></pre>
Standby-side view:</p>
SELECT
    pg_is_in_recovery() AS is_standby,
    pg_last_wal_receive_lsn() AS receive_lsn,
    pg_last_wal_replay_lsn() AS replay_lsn,
    now() - pg_last_xact_replay_timestamp() AS replay_delay;
</code></pre>
But the product question is:</p>
Can this read be stale?
</code></pre>
Replication monitoring should connect to read-routing behavior.</p>
A replica that is 5 seconds behind may be fine for analytics. It may be unacceptable for permissions, checkout, authentication, or user settings.</p>
Good monitoring distinguishes:</p>
replica connected;
replica receiving WAL;
replica replaying WAL;
replica serving stale reads;
replica safe for failover;
replica safe for reporting;
replica threatening primary disk through slot retention.
</code></pre>
One “replication lag” graph is rarely enough.</p>

Disk monitoring must distinguish capacity from performance</h2>
Disk incidents come in two forms:</p>
capacity problem: storage is filling;
performance problem: storage is too slow for current workload.
</code></pre>
Both hurt Postgres, but the response differs.</p>
Capacity signals:</p>
data directory size;
pg_wal size;
temporary file growth;
table/index growth;
backup/archive accumulation;
replication slot retention;
available filesystem space.
</code></pre>
Performance signals:</p>
read latency;
write latency;
fsync latency;
IOPS saturation;
queue depth;
checkpoint sync time;
query wait events;
temporary file spills;
backend writes.
</code></pre>
Postgres metrics alone are not enough here. The official monitoring chapter explicitly points operators toward OS-level tools in addition to PostgreSQL’s internal statistics. (PostgreSQL</a>)</p>
A database graph may show slow queries.</p>
The storage graph may reveal the real mechanism.</p>

Cache hit ratio is often overrated</h2>
Many dashboards show buffer cache hit ratio.</p>
It can be useful, but it is often overinterpreted.</p>
A high cache hit ratio does not prove the database is healthy.</p>
A low cache hit ratio does not automatically identify the cause of an incident.</p>
Problems:</p>
large sequential scans can distort the number;
some workloads naturally read cold data;
a high ratio can hide CPU or lock contention;
the metric says little about query shape;
it does not show whether users are impacted.
</code></pre>
A better approach is to pair cache-related metrics with:</p>
query plans;
buffer reads from EXPLAIN;
IO wait events;
storage latency;
table/index scan patterns;
top queries by shared blocks read;
application latency.
</code></pre>
Cache hit ratio is context, not a primary incident diagnosis.</p>

Alert on symptoms, investigate with causes</h2>
A common mistake is alerting on too many internal causes.</p>
Examples:</p>
CPU > 80%
connections > 300
dead tuples > threshold
replica lag > threshold
cache hit ratio < threshold
autovacuum running too long
</code></pre>
Some of these are useful. But if every internal metric pages someone, the team learns to ignore alerts.</p>
A healthier alerting model:</p>
Page on user impact and imminent risk.
Ticket on trends and maintenance debt.
Dashboard internal signals for investigation.
</code></pre>
Page-worthy examples:</p>
user-facing error rate high;
API latency SLO burn;
database unavailable;
disk close to full;
primary cannot write WAL;
replica lag violates product read semantics;
connection exhaustion blocking traffic;
transaction ID age approaching dangerous thresholds;
backup/archive pipeline broken beyond recovery objective.
</code></pre>
Ticket-worthy examples:</p>
dead tuples trending up on hot table;
index bloat suspected;
unused indexes accumulating;
autovacuum not keeping up on one table;
slow query variance increasing;
connection usage slowly approaching capacity;
replica lag occasionally above normal but not user-impacting.
</code></pre>
Not every red graph deserves a page.</p>

Use trend and rate, not only absolute values</h2>
A single value can be misleading.</p>
Examples:</p>
1000 dead tuples on a small table may matter.
10 million dead tuples on a huge table may be normal temporarily.

200 connections may be normal for one system.
50 active connections may overload another.

1 GB of WAL may be fine.
1 GB per minute may be alarming.

Replica lag of 2 seconds may be acceptable for reporting.
Replica lag of 2 seconds may be unacceptable for read-after-write flows.
</code></pre>
Prefer metrics that show:</p>
rate of change;
baseline deviation;
duration;
affected workload;
relation to user symptoms;
relation to known changes.
</code></pre>
A good alert is rarely “value > threshold.”</p>
It is more often:</p>
value is above threshold for long enough,
during user-impacting traffic,
and is moving in the wrong direction.
</code></pre>

Version-specific monitoring matters</h2>
Postgres monitoring changes across versions.</p>
Views, columns, and statistics capabilities evolve. For example, modern PostgreSQL versions expose more detailed IO and WAL-related statistics than older versions, and settings such as track_io_timing</code> and track_wal_io_timing</code> can provide timing information with potential overhead because they repeatedly query the operating system clock. (PostgreSQL</a>)</p>
This creates a practical rule:</p>
Do not blindly copy monitoring SQL from another Postgres version.
</code></pre>
For every dashboard query, know:</p>
which Postgres versions it supports;
whether required extensions are enabled;
whether timing settings add overhead;
whether statistics reset affects interpretation;
whether managed database providers restrict access;
whether replicas expose the same views.
</code></pre>
Monitoring should be treated like production code.</p>
It can break, lie, or become outdated.</p>

A minimal reliability-oriented Postgres dashboard</h2>
A useful dashboard does not need hundreds of panels.</p>
It should answer the main incident questions quickly.</p>
1. User impact</h3>
request latency;
error rate;
timeout rate;
business operation success rate;
queue age;
job delay.
</code></pre>
2. Application database boundary</h3>
pool usage;
pool wait time;
pool checkout timeouts;
query duration;
transaction duration;
retry rate;
database errors by type.
</code></pre>
3. Postgres live activity</h3>
active sessions;
sessions by wait_event_type;
long queries;
long transactions;
idle in transaction;
blocked sessions and blockers.
</code></pre>
4. Workload shape</h3>
top queries by total time;
top queries by calls;
top queries by mean time;
queries with high variance;
WAL-heavy statements.
</code></pre>
5. Maintenance health</h3>
dead tuples;
last autovacuum/analyze;
vacuum progress;
transaction ID age;
table and index growth.
</code></pre>
6. WAL, checkpoints, storage</h3>
WAL generation rate;
pg_wal size;
checkpoint frequency;
checkpoint write/sync time;
archiver failures;
disk capacity;
disk latency.
</code></pre>
7. Replication and recovery</h3>
replication lag by stage;
standby replay delay;
replication slot retained WAL;
backup/archive status;
failover readiness indicators.
</code></pre>
The dashboard should be organized around questions, not around PostgreSQL catalog names.</p>

Good monitoring supports hypothesis-driven debugging</h2>
During an incident, an engineer should be able to move through a chain like this:</p>
flowchart TD
    A[Users see timeouts] --> B[Application pool wait time is rising]
    B --> C[Postgres active sessions are elevated]
    C --> D[Most active sessions wait on Lock]
    D --> E[Blocking query is a migration]
    E --> F[Migration is waiting behind an idle transaction]
    F --> G[Retries are increasing request volume]
    G --> H([Stop retries, cancel migration, or terminate a known-safe blocker])
</code></pre>
Or:</p>
Writes are slow.
        ↓
CPU is normal.
        ↓
WAL generation spiked.
        ↓
Checkpoint warnings started.
        ↓
Replica lag is increasing.
        ↓
A backfill began five minutes earlier.
        ↓
Safest mitigation is to pause or throttle the backfill.
</code></pre>
This is what monitoring is for.</p>
Not to show everything.</p>
To help the team move from symptom to mechanism to decision.</p>

Common monitoring anti-patterns</h2>
Dashboard as decoration</h3>
A dashboard nobody uses during incidents is not observability. It is wallpaper.</p>
Too many panels, no hierarchy</h3>
If every graph has equal visual importance, the dashboard cannot guide attention.</p>
Alerts without ownership</h3>
Every alert should have an owner, expected action, and reason for existence.</p>
Internal metrics without user impact</h3>
A database can look noisy without affecting customers. A page should usually be tied to impact or imminent risk.</p>
User impact without database detail</h3>
Knowing users are affected is not enough. You need fast paths into database evidence.</p>
No deployment or migration markers</h3>
Without change context, incidents take longer to explain.</p>
Averaging away important behavior</h3>
Mean latency hides outliers. Total time hides variance. Aggregate database metrics hide one bad tenant or one hot table.</p>
Ignoring application metrics</h3>
Postgres cannot show pool checkout time, retry storms, request deadlines, or business operation failures by itself.</p>

Why monitoring incidents are good simulation material</h2>
Monitoring failures are often human failures.</p>
The metrics were there, but nobody knew which ones mattered.
The dashboard showed the answer, but it was buried under noise.
The alert fired, but it was not actionable.
The team watched CPU while the real problem was locks.
The team watched slow queries while the real problem was pool saturation.
The team watched the primary while the replica was serving stale reads.
The team watched database metrics while application retries amplified the incident.</p>
A realistic simulation can train:</p>
reading dashboards under pressure;
separating symptoms from causes;
forming hypotheses from weak signals;
rejecting misleading metrics;
connecting application and database behavior;
deciding when a metric is actionable;
communicating uncertainty clearly;
choosing mitigations based on evidence.
</code></pre>
This is the gap articles cannot fully close.</p>
A written guide can explain which metrics exist.
A dashboard can display the signals.
A simulation teaches the team how to reason when ten signals change at once.</p>

Conclusion</h2>
Postgres monitoring is not about collecting every metric.</p>
It is about building an evidence system for production decisions.</p>
Good monitoring starts with user impact, connects that impact to application behavior, then follows pressure into Postgres internals, storage, replication, and maintenance systems.</p>
Useful metrics answer operational questions:</p>
Are users affected?
Where is the queue?
What changed?
What is Postgres waiting on?
Which workload owns the pressure?
Is this a query, lock, IO, WAL, vacuum, replication, or pool problem?
Is the system getting worse?
Which mitigation reduces risk?
</code></pre>
The dangerous phrase is:</p>
We have dashboards, so we are covered.
</code></pre>
The better reliability question is:</p>
Can our monitoring help an engineer make the right decision during a confusing incident?
</code></pre>
That is the difference between metric collection and Postgres database reliability.</p>


A slow Postgres query is a symptom, not a diagnosis
2026-04-04T00:00:00+00:00
A slow query is one of the easiest Postgres problems to notice and one of the easiest to misunderstand.</p>
The application times out.
The endpoint gets slower.
The dashboard shows high database time.
Someone finds a query in logs and says:</p>

“This query is the problem.”</p>
</blockquote>
Maybe it is.</p>
But a slow query is rarely a complete diagnosis. It is a symptom produced by a specific mechanism: a bad plan, missing index, stale statistics, lock contention, IO saturation, parameter sensitivity, table bloat, too much concurrency, or a data distribution change that made yesterday’s assumptions false.</p>
The SQL text is only one part of the story.</p>
A query can become slow without changing at all.</p>

The same query can be fast yesterday and dangerous today</h2>
Consider a simple query:</p>
SELECT *
FROM invoices
WHERE account_id = $1
  AND status = 'open'
ORDER BY due_date ASC
LIMIT 50;
</code></pre>
This query may be perfectly fine when most accounts have a few hundred invoices.</p>
Then the product grows. One enterprise customer imports millions of invoices. Suddenly, the same query behaves differently for different accounts.</p>
For small accounts, it is still fast.</p>
For one large account, it becomes expensive.</p>
That is not a different query. It is a different data shape.</p>
A useful index might be:</p>
CREATE INDEX CONCURRENTLY idx_invoices_account_status_due_date
ON invoices (account_id, status, due_date);
</code></pre>
But the reliability lesson is not simply “add an index.”</p>
The deeper lesson is:</p>
Query performance depends on data distribution,
not just on SQL syntax.
</code></pre>
A query that was safe when the product was small may become a production risk as the data changes.</p>

“Slow query” hides multiple failure modes</h2>
From the outside, several very different problems can look identical.</p>
API latency increased
Database time increased
Requests started timing out
Connection pool is full
The same query appears in logs repeatedly
</code></pre>
But the underlying cause could be:</p>
Missing index
Wrong index order
Stale planner statistics
Bad row estimate
Lock contention
Disk IO saturation
Sort spilling to disk
Too much concurrency
Autovacuum falling behind
Table or index bloat
Parameter-sensitive query plan
Application retry storm
</code></pre>
The same symptom requires different mitigations depending on the mechanism.</p>
That is why “find the slow query” is not enough.</p>
You need to understand why it is slow now.</p>

Start with the query shape, not just the query text</h2>
A useful first step is to identify the shape of the query.</p>
For example:</p>
SELECT *
FROM events
WHERE tenant_id = $1
  AND event_type = $2
  AND created_at >= $3
ORDER BY created_at DESC
LIMIT 100;
</code></pre>
This query shape tells you several important things:</p>
It is tenant-scoped.
It filters by event type.
It uses a time range.
It needs rows in descending time order.
It has a LIMIT.
It may be called frequently.
</code></pre>
An index that supports this access pattern may look like:</p>
CREATE INDEX CONCURRENTLY idx_events_tenant_type_created_desc
ON events (tenant_id, event_type, created_at DESC);
</code></pre>
But index design depends on real workload. For example, if event_type</code> is not selective, or if most queries do not filter by it, a different index may be better:</p>
CREATE INDEX CONCURRENTLY idx_events_tenant_created_desc
ON events (tenant_id, created_at DESC);
</code></pre>
The key question is not:</p>
Does this query have an index?
</code></pre>
The better question is:</p>
Does the index match the actual access pattern?
</code></pre>

Use EXPLAIN</code>, but do not worship it</h2>
The most common tool for investigating a slow query is:</p>
EXPLAIN (ANALYZE, BUFFERS)
SELECT *
FROM invoices
WHERE account_id = 123
  AND status = 'open'
ORDER BY due_date ASC
LIMIT 50;
</code></pre>
This can show:</p>
Which plan Postgres chose
How many rows it expected
How many rows it actually processed
Whether it used an index
How many buffers were read or hit
Whether sorting happened
Whether the query touched much more data than expected
</code></pre>
For example, a suspicious plan may show:</p>
Rows expected: 50
Rows actual: 850000
</code></pre>
That is not just “slow.” That is a planner estimate problem.</p>
A query with BUFFERS</code> may show heavy reads:</p>
shared hit blocks: 1200
shared read blocks: 95000
</code></pre>
That suggests the query is reading a lot from disk or pulling a large amount of data through shared buffers.</p>
But EXPLAIN ANALYZE</code> has an important property:</p>
It actually runs the query.
</code></pre>
For SELECT</code>, that is usually acceptable in a safe environment, though it can still be expensive.</p>
For writes, be careful. This executes the write:</p>
EXPLAIN (ANALYZE, BUFFERS)
UPDATE orders
SET status = 'expired'
WHERE expires_at < now();
</code></pre>
A safer pattern for investigation is:</p>
BEGIN;

EXPLAIN (ANALYZE, BUFFERS)
UPDATE orders
SET status = 'expired'
WHERE expires_at < now();

ROLLBACK;
</code></pre>
Even then, the database still performs work and may take locks while the statement runs. Do not treat diagnostic queries as harmless in production.</p>

Slow because of a missing index</h2>
The simplest case is a query that has no useful index.</p>
Example:</p>
SELECT *
FROM users
WHERE lower(email) = lower($1);
</code></pre>
An ordinary index on email</code> may not help because the query applies a function:</p>
CREATE INDEX CONCURRENTLY idx_users_email
ON users (email);
</code></pre>
Postgres may need an expression index instead:</p>
CREATE INDEX CONCURRENTLY idx_users_lower_email
ON users (lower(email));
</code></pre>
Another example:</p>
SELECT *
FROM orders
WHERE customer_id = $1
ORDER BY created_at DESC
LIMIT 20;
</code></pre>
A partial index on only customer_id</code> may help filtering but not ordering:</p>
CREATE INDEX CONCURRENTLY idx_orders_customer_id
ON orders (customer_id);
</code></pre>
A better index for this query shape may be:</p>
CREATE INDEX CONCURRENTLY idx_orders_customer_created_desc
ON orders (customer_id, created_at DESC);
</code></pre>
But even here, the right fix depends on the workload.</p>
If the table is write-heavy, every new index has a cost. It slows down writes, consumes disk, increases vacuum work, and adds operational risk during creation.</p>
The index may fix one query and harm the system elsewhere.</p>

Slow because of stale statistics</h2>
Postgres uses statistics to choose query plans.</p>
If statistics are stale or too coarse, the planner may choose a bad plan.</p>
You can inspect table statistics freshness:</p>
SELECT
    relname,
    n_live_tup,
    n_dead_tup,
    last_analyze,
    last_autoanalyze,
    last_vacuum,
    last_autovacuum
FROM pg_stat_user_tables
WHERE relname = 'invoices';
</code></pre>
If a table changed significantly and has not been analyzed recently, Postgres may make poor estimates.</p>
You can manually refresh statistics:</p>
ANALYZE invoices;
</code></pre>
Sometimes a specific column needs better statistics because values are highly skewed:</p>
ALTER TABLE invoices
ALTER COLUMN account_id SET STATISTICS 1000;

ANALYZE invoices;
</code></pre>
This does not make the query faster directly. It gives the planner better information.</p>
The incident pattern often looks like this:</p>
Data distribution changes
        ↓
Planner estimates become inaccurate
        ↓
Postgres chooses a bad plan
        ↓
Query latency increases
        ↓
Application holds connections longer
        ↓
Pool saturates
</code></pre>
The SQL did not change. The planner’s model of the data became wrong.</p>

Slow because of parameter sensitivity</h2>
Some queries behave very differently depending on parameter values.</p>
Example:</p>
SELECT *
FROM messages
WHERE workspace_id = $1
  AND created_at >= now() - interval '7 days'
ORDER BY created_at DESC
LIMIT 100;
</code></pre>
For most workspaces, this returns a few rows.</p>
For one very large workspace, it may scan millions.</p>
This becomes especially tricky when prepared statements or generic plans are involved. The planner may choose a plan that is “reasonable on average” but bad for important parameter values.</p>
The query is not universally slow. It is selectively slow.</p>
That distinction matters.</p>
Averages hide this problem. You need to look for variance.</p>
With pg_stat_statements</code>, this kind of query may have a moderate mean but a terrible max:</p>
SELECT
    calls,
    mean_exec_time,
    max_exec_time,
    stddev_exec_time,
    rows,
    left(query, 160) AS query_preview
FROM pg_stat_statements
ORDER BY max_exec_time DESC
LIMIT 20;
</code></pre>
A query with high standard deviation may be more interesting than a query with the highest average time.</p>
A reliability-minded question is:</p>
Is this query always slow,
or only slow for certain tenants, users, statuses, or time ranges?
</code></pre>
That question often changes the fix.</p>

Slow because of locks</h2>
A query may appear slow even when its execution plan is fine.</p>
It may simply be waiting.</p>
For example:</p>
UPDATE accounts
SET status = 'disabled'
WHERE id = $1;
</code></pre>
This can be fast in normal conditions. But if another transaction holds a row lock on the same account, the update waits.</p>
You can inspect lock waits:</p>
SELECT
    pid,
    usename,
    application_name,
    state,
    wait_event_type,
    wait_event,
    now() - query_start AS waiting_for,
    left(query, 160) AS query_preview
FROM pg_stat_activity
WHERE wait_event_type = 'Lock'
ORDER BY query_start ASC;
</code></pre>
To find blockers:</p>
SELECT
    blocked.pid AS blocked_pid,
    blocked.application_name AS blocked_app,
    now() - blocked.query_start AS blocked_duration,
    left(blocked.query, 120) AS blocked_query,
    blocking.pid AS blocking_pid,
    blocking.application_name AS blocking_app,
    blocking.state AS blocking_state,
    now() - blocking.query_start AS blocking_duration,
    left(blocking.query, 120) AS blocking_query
FROM pg_stat_activity blocked
JOIN LATERAL unnest(pg_blocking_pids(blocked.pid)) AS blocker_pid ON true
JOIN pg_stat_activity blocking ON blocking.pid = blocker_pid
ORDER BY blocked_duration DESC;
</code></pre>
This is a very different failure mode from a missing index.</p>
Adding an index will not fix a lock wait.</p>
Running EXPLAIN ANALYZE</code> later may show a fast plan, because the lock contention is gone.</p>
That is why incident context matters. The query plan after the incident may not reproduce the incident.</p>

Slow because of IO saturation</h2>
A query can be slow because it is doing too much disk work.</p>
But it can also be slow because some other operation is saturating disk.</p>
For example:</p>
A concurrent index build
A large vacuum
A checkpoint spike
A reporting query
A backup process
A sequential scan on another table
</code></pre>
The query you see in logs may be a victim, not the cause.</p>
EXPLAIN (ANALYZE, BUFFERS)</code> can show whether the query reads many blocks:</p>
EXPLAIN (ANALYZE, BUFFERS)
SELECT *
FROM events
WHERE tenant_id = 42
ORDER BY created_at DESC
LIMIT 100;
</code></pre>
But to understand system-wide pressure, you also need to look at active queries:</p>
SELECT
    pid,
    application_name,
    wait_event_type,
    wait_event,
    now() - query_start AS query_age,
    left(query, 160) AS query_preview
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY query_start ASC;
</code></pre>
A query waiting on IO may show wait events related to data file reads or writes, depending on Postgres version and workload.</p>
The important operational distinction:</p>
Is this query slow because it performs too much IO,
or because the database storage is already saturated by something else?
</code></pre>
The mitigation is different.</p>

Slow because of sorting or memory pressure</h2>
A query may use an index for filtering but still sort a large result set.</p>
Example:</p>
SELECT *
FROM audit_log
WHERE organization_id = $1
ORDER BY created_at DESC
LIMIT 100;
</code></pre>
If the index does not support the order, Postgres may need to sort.</p>
A useful index:</p>
CREATE INDEX CONCURRENTLY idx_audit_log_org_created_desc
ON audit_log (organization_id, created_at DESC);
</code></pre>
In plans, watch for:</p>
Sort
Sort Method: external merge Disk
</code></pre>
That means the sort spilled to disk.</p>
A simplified example:</p>
EXPLAIN (ANALYZE, BUFFERS)
SELECT *
FROM audit_log
WHERE organization_id = 123
ORDER BY created_at DESC
LIMIT 100;
</code></pre>
If you see disk-based sorting, increasing work_mem</code> might help in some cases. But changing work_mem</code> globally can be dangerous because it applies per operation, not per database.</p>
A query with multiple sort/hash nodes across many concurrent sessions can multiply memory usage quickly.</p>
This is why “just increase memory” is often a risky incident response.</p>

Slow because of bloat</h2>
Postgres uses MVCC. Updates and deletes leave old row versions behind until vacuum can clean them up.</p>
If vacuum falls behind, tables and indexes can become bloated.</p>
A bloated table means Postgres may need to scan more pages to get the same useful data.</p>
You can inspect dead tuple pressure:</p>
SELECT
    relname,
    n_live_tup,
    n_dead_tup,
    round(
        100.0 * n_dead_tup / greatest(n_live_tup + n_dead_tup, 1),
        2
    ) AS dead_tuple_percent,
    last_autovacuum,
    last_autoanalyze
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 20;
</code></pre>
This is not a perfect bloat measurement, but it is a useful signal.</p>
A common incident chain:</p>
flowchart TD
    A[Long transaction remains open] --> B[Vacuum cannot clean old row versions]
    B --> C[Dead tuples accumulate]
    C --> D[Table and index scans become more expensive]
    D --> E[Query latency increases]
    E --> F[More connections remain busy]
    F --> G([System degrades])
</code></pre>
The slow query is only the visible symptom.</p>
The root issue may be a long transaction or vacuum starvation.</p>
You can inspect old transactions:</p>
SELECT
    pid,
    usename,
    application_name,
    state,
    now() - xact_start AS transaction_age,
    left(query, 160) AS query_preview
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
ORDER BY xact_start ASC;
</code></pre>
Again, a slow query may be downstream of a completely different operational failure.</p>

Slow because of too much concurrency</h2>
A query can be individually acceptable but collectively harmful.</p>
Example:</p>
SELECT *
FROM product_recommendations
WHERE user_id = $1
ORDER BY score DESC
LIMIT 20;
</code></pre>
One execution is fine.
Ten executions are fine.
Five thousand concurrent executions during a traffic spike are not fine.</p>
This is the difference between query latency and system throughput.</p>
A query does not have to be “bad” to cause an incident. It only has to be too frequent, too concurrent, or too poorly bounded.</p>
This often happens with retries.</p>
Database gets slower
        ↓
Application requests timeout
        ↓
Application retries
        ↓
Database receives more duplicate work
        ↓
Database gets even slower
</code></pre>
At that point, optimizing the query may help later, but the immediate mitigation might be reducing concurrency, disabling a worker, rate-limiting retries, or shedding non-critical load.</p>
A database incident is often a traffic-shaping problem, not just a SQL problem.</p>

Finding important queries with pg_stat_statements</code></h2>
pg_stat_statements</code> is one of the most useful Postgres extensions for understanding workload.</p>
A basic view of expensive queries:</p>
SELECT
    calls,
    total_exec_time,
    mean_exec_time,
    max_exec_time,
    rows,
    left(query, 160) AS query_preview
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;
</code></pre>
But different orderings answer different questions.</p>
Highest total time:</p>
SELECT
    calls,
    total_exec_time,
    mean_exec_time,
    left(query, 160) AS query_preview
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;
</code></pre>
This finds queries that consume the most database time overall.</p>
Highest mean time:</p>
SELECT
    calls,
    mean_exec_time,
    max_exec_time,
    left(query, 160) AS query_preview
FROM pg_stat_statements
WHERE calls > 100
ORDER BY mean_exec_time DESC
LIMIT 20;
</code></pre>
This finds consistently expensive queries.</p>
Highest call count:</p>
SELECT
    calls,
    mean_exec_time,
    total_exec_time,
    left(query, 160) AS query_preview
FROM pg_stat_statements
ORDER BY calls DESC
LIMIT 20;
</code></pre>
This finds queries that may be cheap individually but expensive in aggregate.</p>
High variance:</p>
SELECT
    calls,
    mean_exec_time,
    max_exec_time,
    stddev_exec_time,
    left(query, 160) AS query_preview
FROM pg_stat_statements
WHERE calls > 100
ORDER BY stddev_exec_time DESC
LIMIT 20;
</code></pre>
This finds queries that behave unpredictably.</p>
The important part is choosing the right question.</p>
Total time asks: what consumes the database?
Mean time asks: what is slow on average?
Max time asks: what occasionally explodes?
Calls asks: what is happening too often?
Variance asks: what behaves differently across inputs?
</code></pre>
A production investigation needs all of these perspectives.</p>

Why “add an index” is sometimes the wrong fix</h2>
Indexes are powerful. Many incidents are fixed by adding or changing an index.</p>
But indexes are not free.</p>
Every index has costs:</p>
More disk usage
Slower inserts
Slower updates
Slower deletes
More WAL generation
More vacuum work
More memory pressure
Longer backup/restore times
Operational risk during creation
</code></pre>
An index can also be technically used but not useful enough.</p>
For example:</p>
CREATE INDEX CONCURRENTLY idx_orders_status
ON orders (status);
</code></pre>
If status = 'active'</code> matches 80% of the table, this index may not be very selective. Postgres may correctly choose a sequential scan.</p>
A more useful partial index might be:</p>
CREATE INDEX CONCURRENTLY idx_orders_pending_created
ON orders (created_at)
WHERE status = 'pending';
</code></pre>
This can be valuable if pending</code> is rare and frequently queried.</p>
But partial indexes require discipline. The query must match the predicate well enough for the planner to use it.</p>
A slow query investigation should ask:</p>
What exact access pattern are we optimizing?
How often does it run?
How many rows does it usually return?
How selective are the filters?
Does the query need ordering?
Is the index worth the write cost?
Can the index be created safely under current load?
</code></pre>
Without those questions, indexing becomes guesswork.</p>

Query performance is part of application design</h2>
A database can only do so much if the application asks expensive questions.</p>
For example, this pattern is dangerous:</p>
SELECT *
FROM events
WHERE tenant_id = $1
ORDER BY created_at DESC;
</code></pre>
No limit. No time bound. Potentially huge result set.</p>
A safer shape:</p>
SELECT *
FROM events
WHERE tenant_id = $1
  AND created_at < $2
ORDER BY created_at DESC
LIMIT 100;
</code></pre>
This supports pagination and gives the database a bounded amount of work.</p>
Another dangerous pattern is N+1 queries:</p>
Load 100 orders
For each order, query customer
For each customer, query latest invoice
For each invoice, query payment status
</code></pre>
Individually, each query may be fast.</p>
Together, they create a database pressure pattern.</p>
A better approach may use joins, batching, caching, or precomputed views, depending on the system.</p>
The database is not just a storage layer. It is part of the application’s execution model.</p>

The post-incident question should be deeper than “which query was slow?”</h2>
After a query-related incident, a weak review says:</p>
A query was slow.
We added an index.
The incident is resolved.
</code></pre>
A stronger review asks:</p>
Why did this query become slow now?
Was the data distribution different from staging?
Did the query pattern change in a release?
Did we have statistics drift?
Was the index missing, wrong, or too expensive to maintain?
Did retries amplify the load?
Did the connection pool hide early symptoms?
Did dashboards show query variance or only averages?
Could we have detected this before users did?
</code></pre>
The goal is not to blame a query.</p>
The goal is to improve the system’s ability to survive workload changes.</p>

A useful mental model</h2>
When you see a slow query, do not stop at the SQL text.</p>
Walk through the layers:</p>
flowchart TD
    A[SQL shape] --> B[Planner estimates] --> C[Chosen plan] --> D[Index access]
    D --> E[Rows scanned vs rows returned] --> F[Buffers hit vs read] --> G[Sort / hash behavior]
    G --> H[Lock waits] --> I[Transaction age] --> J[Concurrency]
    J --> K[Connection pool behavior] --> L[Application retries] --> M([User-visible impact])
</code></pre>
This does not mean every incident requires checking everything manually.</p>
It means the query is part of a system.</p>
The diagnosis is the mechanism, not the symptom.</p>

Why slow-query incidents are good simulation material</h2>
Slow-query incidents are excellent for training because they are deceptively familiar.</p>
Most engineers know how to read a query.
Many know how to run EXPLAIN</code>.
Some know how to add an index.</p>
But production incidents are harder than that.</p>
A realistic simulation forces questions like:</p>
Is this query the cause or a victim?
Is the plan bad or is it waiting on a lock?
Is the index missing or are statistics wrong?
Is the database overloaded by this query or by retries?
Should we add an index now or reduce traffic first?
Is the safest action in SQL, application config, or deployment rollback?
</code></pre>
That is the skill gap.</p>
Articles can explain the mechanics.
Queries can reveal evidence.
But operational judgment comes from practicing the loop:</p>
flowchart LR
    S[Symptom] --> H[Hypothesis] --> I[Inspection] --> D[Decision] --> C[Consequence]
</code></pre>
In production, every decision has side effects.</p>
A simulation lets teams experience those side effects before they are dealing with real customers, real data, and real pressure.</p>

Conclusion</h2>
A slow Postgres query is not a diagnosis.</p>
It is a signal.</p>
Sometimes the fix is an index.
Sometimes it is ANALYZE</code>.
Sometimes it is rewriting the query.
Sometimes it is reducing concurrency.
Sometimes it is stopping retries.
Sometimes it is killing a blocking transaction.
Sometimes it is changing application behavior.
Sometimes it is doing nothing immediately and collecting better evidence first.</p>
The hard part is not finding a slow query.</p>
The hard part is understanding why it became slow, why it became slow now, and what action will reduce risk without making the system worse.</p>
That is the core of Postgres database reliability: not just knowing how queries work, but understanding how query behavior emerges from data, workload, concurrency, and operational decisions.</p>


Why Postgres reliability cannot be learned from documentation alone
2026-03-27T00:00:00+00:00
Postgres documentation is excellent.</p>
It explains MVCC, locks, WAL, indexes, replication, vacuum, isolation levels, planner behavior, configuration, backup, recovery, and hundreds of other details. If you operate Postgres seriously, you should read it.</p>
But documentation is not the same as operational readiness.</p>
Documentation teaches mechanisms.
Incidents test judgment.</p>
A production incident does not usually announce itself as:</p>
This is a lock queue caused by an ACCESS EXCLUSIVE lock waiting behind an idle transaction.
</code></pre>
It looks more like this:</p>
API latency is rising.
The pool is full.
Some queries are slow.
A migration started recently.
CPU is not that high.
Replica lag is increasing.
Users are reporting timeouts.
The team is not sure whether to cancel, wait, kill, rollback, or fail over.
</code></pre>
That is the gap.</p>
Postgres reliability is not only about knowing how Postgres works. It is about making safe decisions when Postgres, the application, infrastructure, traffic, and human pressure interact.</p>

Documentation explains components, but incidents combine them</h2>
Postgres documentation is organized by topics:</p>
Locks
Transactions
Indexes
VACUUM
WAL
Replication
Configuration
Monitoring
Backup and restore
Query planning
</code></pre>
That structure is necessary for learning.</p>
But real incidents rarely respect that structure.</p>
A single production problem can involve:</p>
flowchart TD
    A[A new release] --> B[A query plan regression]
    B --> C[Longer transaction time]
    C --> D[Connection pool saturation]
    D --> E[Aggressive retries]
    E --> F[Higher database concurrency]
    F --> G[Autovacuum falling behind]
    G --> H[Replica lag]
    H --> I([User-visible timeouts])
</code></pre>
Which chapter is that?</p>
It is not one chapter. It is the interaction of many systems.</p>
That is why reading about locks does not automatically prepare someone for a migration incident. Reading about VACUUM</code> does not automatically prepare someone for a bloat-driven latency degradation. Reading about replication does not automatically prepare someone to decide whether failover is safe.</p>
The hard part is synthesis.</p>

Knowing the command is not the same as knowing when to use it</h2>
Many Postgres incident actions are simple at the command level.</p>
Cancel a query:</p>
SELECT pg_cancel_backend(12345);
</code></pre>
Terminate a backend:</p>
SELECT pg_terminate_backend(12345);
</code></pre>
Analyze a table:</p>
ANALYZE invoices;
</code></pre>
Create an index concurrently:</p>
CREATE INDEX CONCURRENTLY idx_orders_customer_created
ON orders (customer_id, created_at DESC);
</code></pre>
Promote a standby:</p>
SELECT pg_promote();
</code></pre>
Drop a replication slot:</p>
SELECT pg_drop_replication_slot('old_slot');
</code></pre>
None of these commands are hard to type.</p>
The difficult questions are operational:</p>
Is this backend safe to terminate?
Will cancellation cause retries that make the incident worse?
Is ANALYZE enough, or is the query slow because of locks?
Can we afford the IO of a concurrent index right now?
Is the standby fresh enough to promote?
Could the old primary still accept writes?
Is this replication slot abandoned, or does a downstream system still need it?
</code></pre>
Documentation can tell you what a command does.</p>
It cannot decide whether using it right now reduces risk.</p>
That decision depends on context.</p>

Incidents are full of partial evidence</h2>
In a calm environment, you can investigate carefully.</p>
During an incident, the evidence is incomplete and changing.</p>
You may see:</p>
SELECT
    state,
    wait_event_type,
    wait_event,
    count(*) AS sessions
FROM pg_stat_activity
GROUP BY state, wait_event_type, wait_event
ORDER BY sessions DESC;
</code></pre>
And get something like:</p>
active | Lock | transactionid | 47
active | IO   | DataFileRead  | 12
idle   |      |               | 180
</code></pre>
This does not automatically tell you what to do.</p>
You need to ask:</p>
Are lock waits the cause or a consequence?
Who is blocking whom?
Did a migration start?
Are retries increasing concurrency?
Are the IO waits caused by the same workload?
Are idle connections normal or part of pool exhaustion?
</code></pre>
Then you inspect blockers:</p>
SELECT
    blocked.pid AS blocked_pid,
    blocked.application_name AS blocked_app,
    now() - blocked.query_start AS blocked_duration,
    left(blocked.query, 120) AS blocked_query,
    blocking.pid AS blocking_pid,
    blocking.application_name AS blocking_app,
    blocking.state AS blocking_state,
    now() - blocking.query_start AS blocking_duration,
    left(blocking.query, 120) AS blocking_query
FROM pg_stat_activity blocked
JOIN LATERAL unnest(pg_blocking_pids(blocked.pid)) AS blocker_pid ON true
JOIN pg_stat_activity blocking ON blocking.pid = blocker_pid
ORDER BY blocked_duration DESC;
</code></pre>
Now you find a blocker.</p>
But even then, you still need judgment.</p>
Killing the blocker may fix the incident.
It may also roll back important work, trigger retries, break a migration, or create more load.</p>
The database gives evidence. It does not give certainty.</p>

Runbooks help, but they are not enough</h2>
Runbooks are valuable.</p>
A good runbook can say:</p>
If lock waits are high:
1. Identify blocked sessions.
2. Identify blockers.
3. Check whether the blocker is a migration, application query, or idle transaction.
4. Check user impact.
5. Prefer cancellation before termination where possible.
6. Escalate before terminating unknown critical sessions.
</code></pre>
This is useful.</p>
But production incidents often violate the clean path.</p>
For example:</p>
The blocker is a migration.
The migration is important.
The migration has already partially completed.
Application retries are increasing pressure.
The team cannot immediately tell whether canceling is safe.
The service owner is offline.
A background job is also holding connections.
Replica lag is rising.
The incident commander wants a decision now.
</code></pre>
The runbook can guide thinking.</p>
It cannot replace thinking.</p>
A weak reliability culture treats runbooks as scripts.
A strong reliability culture treats runbooks as decision support.</p>

The dangerous middle: when several actions are plausible</h2>
Many Postgres incidents are hard because multiple actions seem reasonable.</p>
Imagine this situation:</p>
API latency is high.
Connection pool wait time is rising.
Postgres has many active sessions.
Top queries show one expensive query shape.
Replica lag is increasing.
A backfill started ten minutes ago.
</code></pre>
Possible actions:</p>
Pause the backfill.
Reduce application concurrency.
Cancel slow queries.
Increase pool size.
Add an index.
Move reads to a replica.
Disable retries.
Scale the database.
Wait and observe.
</code></pre>
Several of these may be valid in different scenarios.</p>
The wrong action can amplify the incident.</p>
Increasing the pool may push more work into Postgres.
Moving reads to a lagging replica may serve stale data.
Adding an index may create more IO and WAL during an already overloaded period.
Canceling queries may trigger retries.
Waiting may be correct if the system is recovering, or disastrous if pressure is still growing.</p>
The skill is not knowing a list of actions.</p>
The skill is understanding the likely consequence of each action under current conditions.</p>

Documentation teaches normal behavior; incidents expose edge behavior</h2>
Most engineers learn Postgres features in their normal form.</p>
A transaction groups work:</p>
BEGIN;

UPDATE accounts
SET balance = balance - 100
WHERE id = 1;

UPDATE accounts
SET balance = balance + 100
WHERE id = 2;

COMMIT;
</code></pre>
An index speeds up access:</p>
CREATE INDEX CONCURRENTLY idx_orders_customer_id
ON orders (customer_id);
</code></pre>
A replica provides another copy of data:</p>
primary → standby
</code></pre>
Autovacuum cleans old row versions.</p>
WAL protects durability.</p>
All of this is true.</p>
But incidents live in the edge behavior.</p>
A transaction becomes dangerous when it stays open for 45 minutes.</p>
An index build becomes dangerous when it competes with peak traffic.</p>
A replica becomes dangerous when the application assumes it is always fresh.</p>
Autovacuum becomes dangerous when it cannot keep up with write churn.</p>
WAL becomes dangerous when a backfill generates more than archiving and replication can consume.</p>
The feature is not the problem.
The production interaction is the problem.</p>

Example: documentation says CREATE INDEX CONCURRENTLY</code>, production asks “when?”</h2>
A team finds a slow query:</p>
SELECT *
FROM orders
WHERE customer_id = $1
ORDER BY created_at DESC
LIMIT 50;
</code></pre>
The likely index is obvious:</p>
CREATE INDEX CONCURRENTLY idx_orders_customer_created
ON orders (customer_id, created_at DESC);
</code></pre>
Documentation can explain why CONCURRENTLY</code> avoids blocking writes.</p>
But production readiness requires more questions:</p>
How large is the table?
How many indexes already exist?
How much WAL will this generate?
Will replica lag violate read expectations?
Is storage already under pressure?
Is autovacuum currently behind?
Are we in peak traffic?
Can the migration framework run this outside a transaction?
What happens if the index build fails?
Who will clean up an invalid index?
</code></pre>
The SQL is technically correct.</p>
That does not make the timing safe.</p>
Reliability depends on knowing when the right command is wrong for the current system state.</p>

Example: documentation says failover is possible, production asks “is it safe?”</h2>
Promotion can be simple:</p>
SELECT pg_promote();
</code></pre>
But the operational question is not whether you can promote.</p>
It is whether promotion improves the situation.</p>
Before failover, you need to know:</p>
Is the primary truly unavailable?
Can the old primary still accept writes?
How far behind is the standby?
What data loss is acceptable?
How will applications reconnect?
What happens to connection pools?
Will background workers follow the new primary?
What happens to read replicas?
What happens to logical replication slots?
Can the old primary be fenced?
</code></pre>
A standby that exists but is 20 minutes behind may not be a safe target.</p>
A standby that is current but cannot accept the full write workload may fail shortly after promotion.</p>
A failover that leaves the old primary alive can create data divergence.</p>
Documentation explains promotion.</p>
Practice teaches hesitation, verification, and controlled execution.</p>

Example: documentation says vacuum cleans dead tuples, production asks “why is it behind?”</h2>
A table shows many dead tuples:</p>
SELECT
    relname,
    n_live_tup,
    n_dead_tup,
    last_autovacuum,
    last_autoanalyze
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 20;
</code></pre>
A beginner may think:</p>
Run VACUUM.
</code></pre>
A more experienced operator asks:</p>
Why did dead tuples accumulate?
Is autovacuum blocked by a long transaction?
Is the table too hot for default thresholds?
Are there too many indexes?
Is a backfill creating churn?
Is the table design queue-like?
Are we seeing bloat or just temporary dead tuple pressure?
Will manual vacuum compete with user traffic?
</code></pre>
Then they check old transactions:</p>
SELECT
    pid,
    application_name,
    state,
    now() - xact_start AS transaction_age,
    left(query, 160) AS query_preview
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
ORDER BY xact_start ASC;
</code></pre>
The command VACUUM</code> is easy.</p>
Understanding why cleanup failed is the reliability work.</p>

Operational skill means recognizing patterns</h2>
In real incidents, the exact details vary.</p>
But patterns repeat.</p>
A lock incident has a recognizable shape:</p>
Latency rises.
Connections increase.
Many sessions wait on Lock.
A migration or transaction is in the blocking chain.
The pool fills behind blocked work.
</code></pre>
A connection storm has a recognizable shape:</p>
App instances increase.
Total connections rise sharply.
Many sessions are active or waiting.
Pool timeouts appear.
Retries multiply traffic.
Postgres slows under concurrency.
</code></pre>
A WAL pressure incident has a recognizable shape:</p>
A bulk operation starts.
WAL generation spikes.
Checkpoints become more frequent.
Replica lag grows.
Archiving may fall behind.
Write latency worsens.
</code></pre>
A vacuum starvation incident has a recognizable shape:</p>
Dead tuples trend upward.
Old transactions exist.
Autovacuum runs but does not catch up.
Table/index size grows.
Query performance degrades gradually.
</code></pre>
Pattern recognition does not come only from reading.</p>
It comes from seeing scenarios, making decisions, and observing consequences.</p>

The hardest part is prioritization</h2>
During a Postgres incident, there are usually too many possible investigations.</p>
You can inspect:</p>
pg_stat_activity
pg_locks
pg_stat_statements
pg_stat_replication
pg_replication_slots
pg_stat_user_tables
pg_stat_progress_vacuum
pg_stat_progress_create_index
pg_stat_wal
pg_stat_archiver
application pool metrics
request traces
deployment history
OS IO metrics
</code></pre>
All of them may be relevant.</p>
But you cannot investigate everything at once.</p>
Operational readiness means knowing what to check first based on symptoms.</p>
For example:</p>
If users wait for DB connections:
    start at application pool metrics and pg_stat_activity.

If sessions wait on Lock:
    identify blockers and recent migrations.

If writes are slow and replica lag grows:
    inspect WAL generation, checkpoints, storage, and recent bulk operations.

If queries became slow gradually:
    inspect query plans, table statistics, dead tuples, bloat signals, and data growth.

If reads from replicas are inconsistent:
    inspect replay delay and read-routing assumptions.
</code></pre>
The skill is navigation.</p>
Documentation gives the map.
Incidents require route selection.</p>

Good operators know what not to touch</h2>
A major difference between junior and senior incident response is restraint.</p>
During a database incident, doing something feels better than doing nothing.</p>
But some actions are dangerous without enough evidence:</p>
Increasing max_connections
Increasing pool size
Killing random backends
Dropping replication slots
Running VACUUM FULL
Creating emergency indexes
Failing over prematurely
Restarting Postgres
Disabling autovacuum
Changing durability settings
Deleting files from pg_wal
</code></pre>
Some of these actions can be correct in specific situations.</p>
The danger is using them as reflexes.</p>
Reliability is not only the ability to act.</p>
It is the ability to delay unsafe action long enough to understand the system, while still acting quickly enough to reduce impact.</p>
That balance cannot be learned from syntax alone.</p>

Documentation does not teach team coordination</h2>
Postgres incidents are rarely solved by one person silently running SQL.</p>
They involve communication:</p>
Who is incident commander?
Who owns the application?
Who owns the database?
Who can pause workers?
Who can rollback deploys?
Who can approve failover?
Who communicates customer impact?
Who records the timeline?
Who verifies recovery?
</code></pre>
Technical evidence must be translated into operational decisions.</p>
For example:</p>
“We have 60 sessions waiting on a migration lock.
The blocker is an app transaction idle for 18 minutes.
Canceling the migration will stop new queue growth.
Terminating the idle transaction appears safe, but it belongs to the billing service.
We need billing owner approval or incident commander decision.”
</code></pre>
That is not just database knowledge.</p>
That is incident communication.</p>
A technically correct action performed without coordination can still create organizational failure.</p>

Documentation does not create muscle memory</h2>
In a quiet learning environment, an engineer can search, read, think, and test.</p>
In an incident, the environment is different:</p>
Users are affected.
Dashboards are noisy.
Logs are incomplete.
People are asking for updates.
The system is changing while you investigate.
Some actions are irreversible.
Time pressure is real.
</code></pre>
Under pressure, people fall back to practiced behavior.</p>
If the only practiced behavior is reading documentation, the team may move too slowly or choose familiar but unsafe actions.</p>
Simulation creates muscle memory:</p>
Notice the symptom.
Form a hypothesis.
Choose the next inspection.
Interpret evidence.
Communicate uncertainty.
Take a bounded action.
Observe the result.
Revise the hypothesis.
</code></pre>
That loop is the core of operational reliability.</p>

A useful maturity model</h2>
Postgres reliability maturity can be described in four levels.</p>
Level 1: Vocabulary</h3>
The team knows terms:</p>
locks;
VACUUM;
WAL;
replica lag;
connection pool;
EXPLAIN;
checkpoint;
transaction;
index.
</code></pre>
This is necessary, but not enough.</p>
Level 2: Mechanism understanding</h3>
The team understands how things work:</p>
why MVCC creates dead tuples;
why locks protect consistency;
why WAL enables recovery;
why replicas can lag;
why indexes help some queries and hurt writes;
why pool size controls concurrency.
</code></pre>
This is where documentation is very strong.</p>
Level 3: Diagnostic reasoning</h3>
The team can connect symptoms to mechanisms:</p>
pool saturation may be caused by slow queries;
slow queries may be caused by locks;
locks may be caused by migrations;
replica lag may be caused by WAL spikes;
bad plans may be caused by stale statistics;
vacuum lag may be caused by old transactions.
</code></pre>
This requires experience and practice.</p>
Level 4: Operational judgment</h3>
The team can act safely under pressure:</p>
cancel the right thing;
pause the right workload;
avoid unsafe failover;
reduce concurrency;
communicate risk;
choose rollback vs roll-forward;
protect user traffic;
recover without creating a second incident.
</code></pre>
This is where simulation matters most.</p>

What articles can teach well</h2>
Articles are valuable.</p>
They can explain:</p>
mental models;
common failure modes;
diagnostic queries;
dangerous anti-patterns;
technical vocabulary;
incident patterns;
review questions;
safe design principles.
</code></pre>
An article can show why this query matters:</p>
SELECT
    pid,
    application_name,
    state,
    wait_event_type,
    wait_event,
    now() - query_start AS query_age,
    now() - xact_start AS transaction_age,
    left(query, 160) AS query_preview
FROM pg_stat_activity
WHERE state <> 'idle'
ORDER BY query_start ASC;
</code></pre>
It can explain that wait_event_type = 'Lock'</code> points toward contention.</p>
It can explain that idle in transaction</code> is dangerous.</p>
It can explain that high connection count is not the same as useful throughput.</p>
But an article cannot reproduce the stress of deciding whether to terminate a real backend while customer requests are failing.</p>
That is the boundary.</p>

What simulations teach better</h2>
Simulations are useful because they train behavior, not only knowledge.</p>
A good Postgres incident simulation can force the team to experience:</p>
unclear symptoms;
conflicting metrics;
misleading first hypotheses;
actions with side effects;
pressure from user impact;
coordination between roles;
the cost of waiting too long;
the cost of acting too early;
post-incident analysis.
</code></pre>
For example, a simulation can show what happens when someone increases pool size during database saturation.</p>
It can show how retries amplify load.</p>
It can show how canceling the wrong migration changes the incident.</p>
It can show how a replica exists but is not safe for failover.</p>
It can show how a long transaction prevents vacuum cleanup and creates delayed consequences.</p>
That feedback loop is difficult to get from documentation.</p>

The goal is not to replace documentation</h2>
This is not an argument against documentation.</p>
Strong Postgres reliability requires documentation, source knowledge, and practical experience.</p>
The right relationship is:</p>
Documentation explains mechanisms.
Runbooks organize known responses.
Monitoring provides evidence.
Simulations build judgment.
Production experience validates assumptions.
Post-incident reviews improve the system.
</code></pre>
Each layer has a role.</p>
The mistake is expecting one layer to do all the work.</p>
Documentation alone produces theoretical understanding.
Monitoring alone produces noise.
Runbooks alone produce mechanical responses.
Production alone is too expensive as a training environment.</p>
Simulation connects them.</p>

A practical way to use documentation better</h2>
Documentation becomes more valuable when read through incident questions.</p>
Instead of reading the lock chapter as theory, ask:</p>
Which lock modes can block normal traffic?
How would I recognize a lock queue?
Which DDL operations need strong locks?
What would make cancellation unsafe?
</code></pre>
Instead of reading about WAL as internals, ask:</p>
What happens if WAL generation spikes?
How does WAL affect replication lag?
How can archiving failure fill disk?
How would checkpoints appear in latency graphs?
</code></pre>
Instead of reading about autovacuum, ask:</p>
What prevents cleanup?
How do old transactions affect vacuum?
Which tables need different settings?
How would vacuum failure show up as query latency?
</code></pre>
Instead of reading about replication, ask:</p>
What does this replica protect us from?
How stale can reads be?
What happens during promotion?
How do we prevent split-brain?
What downstream systems depend on replication state?
</code></pre>
This turns documentation from reference material into operational training material.</p>

A good Postgres reliability training loop</h2>
A strong learning process looks like this:</p>
1. Study the mechanism.
2. Observe the metric in a healthy system.
3. Trigger or simulate a controlled failure.
4. Diagnose using real tools.
5. Choose a mitigation.
6. Observe side effects.
7. Review the decision.
8. Update dashboards, runbooks, code, or process.
</code></pre>
For example, with locks:</p>
Study lock modes.
Observe normal pg_stat_activity.
Simulate a migration waiting behind a transaction.
Identify blockers.
Try canceling migration vs terminating blocker.
Observe pool behavior.
Discuss which action was safest.
Update migration policy.
</code></pre>
With replication:</p>
Study WAL streaming.
Observe normal replay lag.
Simulate a WAL spike.
Watch replica delay.
Route stale-sensitive reads.
Discuss failover safety.
Update read-routing rules.
</code></pre>
This is how knowledge becomes readiness.</p>

The business reason this matters</h2>
Postgres reliability is not only a technical concern.</p>
Database incidents affect product behavior:</p>
users cannot log in;
payments fail;
orders timeout;
dashboards show stale data;
workers fall behind;
notifications duplicate;
customers lose trust;
engineers lose sleep;
teams become afraid of migrations.
</code></pre>
The cost is not just downtime.</p>
It is slower engineering velocity.</p>
When teams fear the database, they avoid necessary changes. They delay migrations, postpone cleanup, over-index defensively, under-invest in schema evolution, and treat every production change as risky.</p>
Reliability training reduces that fear.</p>
Not by pretending incidents will not happen, but by making the team more competent when they do.</p>

Why this matters specifically for Postgres</h2>
Postgres is powerful because it gives teams many capabilities:</p>
transactions;
rich indexing;
constraints;
JSON;
extensions;
replication;
partitioning;
concurrent index builds;
foreign keys;
materialized views;
stored procedures;
logical decoding;
advanced SQL.
</code></pre>
Those capabilities allow teams to build serious systems.</p>
They also create operational complexity.</p>
A database that supports strong correctness, flexible queries, and rich workloads requires disciplined operation.</p>
Postgres will often do exactly what you asked.</p>
The reliability question is whether you understood what you asked it to do under production conditions.</p>

Common anti-patterns in learning Postgres reliability</h2>
Learning only through local experiments</h3>
Local databases hide production realities: data volume, concurrency, locks, replicas, WAL volume, autovacuum pressure, and real traffic.</p>
Memorizing diagnostic queries without hypotheses</h3>
A query is useful only when you know what question it answers.</p>
Treating every incident as a missing index</h3>
Indexes matter, but not every latency problem is an indexing problem.</p>
Treating failover as a button</h3>
Promotion is easy. Safe recovery is not.</p>
Treating runbooks as scripts</h3>
Runbooks guide decisions. They do not remove context.</p>
Treating monitoring as truth</h3>
Metrics are evidence. They require interpretation.</p>
Waiting for production to teach the team</h3>
Production is the most expensive classroom.</p>

What a simulation-ready team looks like</h2>
A team ready for Postgres incidents can do more than quote documentation.</p>
It can say:</p>
We know what normal looks like.
We know which symptoms are user-impacting.
We know where database pressure appears first.
We know how to inspect active sessions.
We know how to identify blockers.
We know which workloads can be paused.
We know who owns migrations.
We know how replicas are used.
We know our acceptable data loss window.
We know which actions are dangerous.
We have practiced decisions before production forced them.
</code></pre>
That is operational maturity.</p>
It does not mean the team never has incidents.</p>
It means incidents are shorter, less chaotic, and less likely to produce secondary failures.</p>

Conclusion</h2>
Postgres documentation is necessary.</p>
But it is not sufficient.</p>
It teaches what locks are, how WAL works, why vacuum exists, how replication functions, what indexes do, and how configuration parameters behave.</p>
Production incidents test something different:</p>
Can the team recognize the pattern?
Can it connect database symptoms to application behavior?
Can it choose the safest next action?
Can it avoid making the incident worse?
Can it communicate uncertainty?
Can it recover the system without creating a second failure?
Can it learn afterward?
</code></pre>
That is database reliability.</p>
The dangerous phrase is:</p>
We read the docs, so we know Postgres.
</code></pre>
The better phrase is:</p>
We understand the mechanisms, and we have practiced applying them under incident conditions.
</code></pre>
Documentation builds knowledge.
Simulation builds judgment.
Reliable Postgres operations need both.</p>
Rillence

Postgres incidents rarely start with "Postgres broke"

The difference between trigger, mechanism, and amplifier</h2> A useful way to reason about Postgres incidents is to separate three things.</p>

Dangerous reactions during Postgres incidents</h2> Some actions feel helpful but can be dangerous when done without understanding the mechanism.</p>

Increasing the connection pool</h3> May help if the pool is too small and Postgres has spare capacity.</p> May hurt if Postgres is already saturated.</p>

Killing random queries</h3> May help if a clearly harmful query is blocking critical work.</p> May hurt if you kill the wrong backend, interrupt a migration, or cause application-level retries.</p>

Restarting the application</h3> May help if the app is stuck.</p> May hurt if every instance reconnects at once and creates a connection storm.</p>

Failing over to a replica</h3> May help if the primary is unhealthy.</p> May hurt if the issue is caused by application behavior, bad queries, or a migration that will continue after failover.</p>

Connection pools and Postgres: why more connections do not mean more performance

Common anti-patterns</h2>

Pool size configured per instance without considering total instances</h3> Autoscaling can silently multiply database pressure.</p>

One shared pool for critical and non-critical work</h3> A reporting job should not be able to starve checkout.</p>

Long external calls inside transactions</h3> This turns network latency into database connection pressure.</p>

No timeout hierarchy</h3> Without clear request, pool, statement, lock, and transaction timeouts, failures linger too long.</p>

Aggressive retries</h3> Retries without budgets and backoff can turn a small slowdown into a storm.</p>

Treating PgBouncer as a universal fix</h3> A pooler helps manage connections. It does not remove query cost, lock contention, IO saturation, or bad transaction design.</p>

Postgres replication: when a standby exists but does not save you

The false sense of safety</h2> Many teams say:</p>

Replication lag is not one number</h2> The first mistake is treating replication lag as a single metric.</p> In practice, there are several different “lags”:</p>

Checking the standby from the standby</h2> On the standby itself:</p>

Read replicas can serve stale data</h2> A common architecture sends writes to the primary and reads to replicas:</p>

WAL volume can break your assumptions</h2> Replication lag is not only about network speed.</p> A primary can suddenly generate more WAL than usual:</p>

Synchronous replication: stronger durability, different failure mode</h2> Asynchronous replication has a data loss window.</p>

Failover is a process, not a command</h2> Promotion is technically simple:</p>

Timeline changes matter</h2> After promotion, the new primary continues on a new timeline.</p> That matters for replicas, WAL archives, backup chains, and recovery procedures.</p>

Logical replication adds another layer</h2> Logical replication is often used for:</p>

Common anti-patterns</h2>

Ignoring slots until disk pressure</h3> Replication slots should be treated like production resources with owners, alerts, and lifecycle management.</p> An abandoned slot is not harmless metadata.</p>

Treating failover as infrastructure-only</h3> Failover affects database clients, application routing, workers, caches, queues, jobs, observability, and people.</p> A database promotion that the application does not understand is not recovery.</p>

Never testing promotion</h3> A failover process that has never been practiced is an assumption.</p> Assumptions do not become reliable because they are written in a document.</p>

WAL and checkpoints: the invisible machinery behind Postgres durability

The basic idea of WAL</h2> When a transaction changes data, Postgres does not rely only on immediately updating table and index files.</p> It first records the change in WAL.</p> A simplified write path looks like this:</p>

COMMIT does not mean “every table page is already on disk”</h2> A common misconception:</p>

Checkpoints are not free</h2> During a checkpoint, Postgres must write dirty buffers to disk.</p> If many pages are dirty, that can create significant IO pressure. If the storage system is already busy, checkpoint activity can appear as latency spikes.</p> Symptoms may include:</p>

Time-based and WAL-volume-based checkpoints</h2> Checkpoints happen for different reasons.</p> Two important controls are:</p>

Measuring WAL generation</h2> A basic WAL snapshot:</p>

Finding WAL-heavy queries</h2> In modern Postgres versions, pg_stat_statements</code> can expose WAL-related metrics for statements, depending on version and configuration.</p> A useful query shape:</p>

WAL retention and replication slots</h2> Replication slots can retain WAL required by a replica or logical consumer.</p> That is useful.</p> It is also dangerous.</p>

Common anti-patterns</h2>

Treating WAL as a storage nuisance</h3> pg_wal</code> is not garbage. It is required for crash recovery, replication, and backups.</p>

Ignoring archiver failures</h3> A database can keep serving traffic while silently losing point-in-time recovery capability.</p>

Letting replication slots have no owner</h3> An abandoned slot can retain WAL until the primary disk is in danger.</p>

Running large backfills without a WAL budget</h3> A backfill should be planned around WAL rate, replica lag, archive capacity, and checkpoint pressure.</p>

Using staging to estimate production WAL cost</h3> Small data, fewer indexes, and missing replicas make staging a poor predictor of WAL impact.</p>

Manually deleting WAL files</h3> This is not a safe incident response pattern. It can destroy recovery guarantees.</p>

Postgres locks: how one ALTER TABLE can stop your product

What teams often get wrong</h2>

They test migrations only on empty or tiny databases</h3> A migration that takes 100 ms on staging may behave very differently on a 500 GB production table.</p>

They ignore concurrent workload</h3> The table is not sitting idle in production. It is being read, written, vacuumed, indexed, and queried by multiple services.</p>

They forget old transactions</h3> One forgotten transaction can turn a safe migration into a production incident.</p>

They run DDL without timeouts</h3> A migration that waits forever can become a silent lock queue.</p>

They treat the database as isolated</h3> The real incident may involve the app pool, retries, background jobs, dashboards, and human decisions.</p>

Schema migrations in Postgres: why safe SQL can be dangerous in production

The core problem: DDL changes concurrency</h2> A normal application query changes data or reads data.</p> A schema migration changes the shape of the database itself.</p>

Lock compatibility is the hidden part of migration safety</h2> Postgres locks are not all equal.</p>

The migration that waits can be worse than the migration that runs</h2> A migration can damage traffic before it does any meaningful work.</p> Suppose this transaction is open:</p>

Concurrent index creation can still hurt</h2> CONCURRENTLY</code> reduces blocking. It does not make index creation free.</p> A concurrent index build can still:</p>

Adding constraints safely</h2> A constraint can be both logically correct and operationally expensive.</p> For example:</p>

Foreign keys are operational changes too</h2> Foreign keys are valuable. They protect data integrity.</p> But adding one to a large, hot table is not just a metadata change.</p> Example:</p>

The difference between trigger, mechanism, and amplifier</h2>
A useful way to reason about Postgres incidents is to separate three things.</p>

Dangerous reactions during Postgres incidents</h2>
Some actions feel helpful but can be dangerous when done without understanding the mechanism.</p>

Increasing the connection pool</h3>
May help if the pool is too small and Postgres has spare capacity.</p>
May hurt if Postgres is already saturated.</p>

Killing random queries</h3>
May help if a clearly harmful query is blocking critical work.</p>
May hurt if you kill the wrong backend, interrupt a migration, or cause application-level retries.</p>

Restarting the application</h3>
May help if the app is stuck.</p>
May hurt if every instance reconnects at once and creates a connection storm.</p>

Failing over to a replica</h3>
May help if the primary is unhealthy.</p>
May hurt if the issue is caused by application behavior, bad queries, or a migration that will continue after failover.</p>

Pool size configured per instance without considering total instances</h3>
Autoscaling can silently multiply database pressure.</p>

One shared pool for critical and non-critical work</h3>
A reporting job should not be able to starve checkout.</p>

Long external calls inside transactions</h3>
This turns network latency into database connection pressure.</p>

No timeout hierarchy</h3>
Without clear request, pool, statement, lock, and transaction timeouts, failures linger too long.</p>

Aggressive retries</h3>
Retries without budgets and backoff can turn a small slowdown into a storm.</p>

Treating PgBouncer as a universal fix</h3>
A pooler helps manage connections. It does not remove query cost, lock contention, IO saturation, or bad transaction design.</p>

Ignoring slots until disk pressure</h3>
Replication slots should be treated like production resources with owners, alerts, and lifecycle management.</p>
An abandoned slot is not harmless metadata.</p>

Treating failover as infrastructure-only</h3>
Failover affects database clients, application routing, workers, caches, queues, jobs, observability, and people.</p>
A database promotion that the application does not understand is not recovery.</p>

Never testing promotion</h3>
A failover process that has never been practiced is an assumption.</p>
Assumptions do not become reliable because they are written in a document.</p>

Treating WAL as a storage nuisance</h3>
`pg_wal</code> is not garbage. It is required for crash recovery, replication, and backups.</p>`

Ignoring archiver failures</h3>
A database can keep serving traffic while silently losing point-in-time recovery capability.</p>

Letting replication slots have no owner</h3>
An abandoned slot can retain WAL until the primary disk is in danger.</p>

Running large backfills without a WAL budget</h3>
A backfill should be planned around WAL rate, replica lag, archive capacity, and checkpoint pressure.</p>

Using staging to estimate production WAL cost</h3>
Small data, fewer indexes, and missing replicas make staging a poor predictor of WAL impact.</p>

Manually deleting WAL files</h3>
This is not a safe incident response pattern. It can destroy recovery guarantees.</p>

They test migrations only on empty or tiny databases</h3>
A migration that takes 100 ms on staging may behave very differently on a 500 GB production table.</p>

They ignore concurrent workload</h3>
The table is not sitting idle in production. It is being read, written, vacuumed, indexed, and queried by multiple services.</p>

They forget old transactions</h3>
One forgotten transaction can turn a safe migration into a production incident.</p>

They run DDL without timeouts</h3>
A migration that waits forever can become a silent lock queue.</p>

They treat the database as isolated</h3>
The real incident may involve the app pool, retries, background jobs, dashboards, and human decisions.</p>