<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
    <title>Rillence</title>
    <subtitle>Practice realistic PostgreSQL incidents in sandboxed environments with deterministic feedback and zero production risk.</subtitle>
    <link rel="self" type="application/atom+xml" href="https://rillence.com/atom.xml"/>
    <link rel="alternate" type="text/html" href="https://rillence.com"/>
    <generator uri="https://www.getzola.org/">Zola</generator>
    <updated>2026-05-29T00:00:00+00:00</updated>
    <id>https://rillence.com/atom.xml</id>
    <entry xml:lang="en">
        <title>Postgres incidents rarely start with &quot;Postgres broke&quot;</title>
        <published>2026-05-29T00:00:00+00:00</published>
        <updated>2026-05-29T00:00:00+00:00</updated>
        
        <author>
          <name>
            
              Unknown
            
          </name>
        </author>
        
        <link rel="alternate" type="text/html" href="https://rillence.com/notes/incidents-start-before-postgres/"/>
        <id>https://rillence.com/notes/incidents-start-before-postgres/</id>
        
        <content type="html" xml:base="https://rillence.com/notes/incidents-start-before-postgres/">&lt;p&gt;When a production system starts degrading, Postgres often becomes the first suspect.&lt;&#x2F;p&gt;
&lt;p&gt;The application is slow.
Requests are timing out.
Background jobs are piling up.
Dashboards are turning red.
The database CPU is higher than usual.
Someone opens the incident channel and says:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;“Looks like Postgres is having problems.”&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;Sometimes that is true. But very often, Postgres is not the original cause. It is the place where multiple system problems finally become visible.&lt;&#x2F;p&gt;
&lt;p&gt;A Postgres incident usually starts somewhere else: a release, a schema migration, a query pattern change, a sudden traffic spike, a connection pool misconfiguration, a long-running transaction, a reporting job, a replica falling behind, or an application retry storm.&lt;&#x2F;p&gt;
&lt;p&gt;The database becomes the pressure point.&lt;&#x2F;p&gt;
&lt;p&gt;That is why Postgres reliability is not only about knowing SQL or database internals. It is about understanding how Postgres behaves inside a living production system.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;the-misleading-phrase-the-database-is-slow&quot;&gt;The misleading phrase: “the database is slow”&lt;&#x2F;h2&gt;
&lt;p&gt;“The database is slow” sounds like a diagnosis, but it is usually only a symptom.&lt;&#x2F;p&gt;
&lt;p&gt;A slow query can be caused by many different mechanisms:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;a bad execution plan;&lt;&#x2F;li&gt;
&lt;li&gt;missing or ineffective indexes;&lt;&#x2F;li&gt;
&lt;li&gt;outdated table statistics;&lt;&#x2F;li&gt;
&lt;li&gt;table bloat;&lt;&#x2F;li&gt;
&lt;li&gt;lock contention;&lt;&#x2F;li&gt;
&lt;li&gt;disk saturation;&lt;&#x2F;li&gt;
&lt;li&gt;too many concurrent connections;&lt;&#x2F;li&gt;
&lt;li&gt;long-running transactions;&lt;&#x2F;li&gt;
&lt;li&gt;autovacuum falling behind;&lt;&#x2F;li&gt;
&lt;li&gt;replication lag;&lt;&#x2F;li&gt;
&lt;li&gt;application retry storms;&lt;&#x2F;li&gt;
&lt;li&gt;connection pool exhaustion;&lt;&#x2F;li&gt;
&lt;li&gt;an expensive migration running at the wrong time.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;The external symptom may look the same:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;HTTP latency increased
API requests timing out
Worker queue length growing
Database connections rising
Postgres CPU and IO elevated
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;But the correct response depends entirely on the mechanism.&lt;&#x2F;p&gt;
&lt;p&gt;This is where many teams get into trouble. They treat the symptom as the cause.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;a-typical-incident-chain&quot;&gt;A typical incident chain&lt;&#x2F;h2&gt;
&lt;p&gt;A Postgres incident often looks like this:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;mermaid&quot;&gt;flowchart TD
    A[Small application change] --&amp;gt; B[New or more frequent query pattern]
    B --&amp;gt; C[Higher database load]
    C --&amp;gt; D[Longer query execution time]
    D --&amp;gt; E[Connections held for longer]
    E --&amp;gt; F[Connection pool saturation]
    F --&amp;gt; G[Application timeouts]
    G --&amp;gt; H[Retries]
    H --&amp;gt; I[Even more database load]
    I --&amp;gt; J([Production incident])
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;From the outside, this may look like “Postgres became slow.”&lt;&#x2F;p&gt;
&lt;p&gt;But Postgres did not randomly become slow. The system changed around it.&lt;&#x2F;p&gt;
&lt;p&gt;That distinction matters because the wrong mitigation can make the incident worse.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;example-1-a-harmless-release-that-doubles-database-pressure&quot;&gt;Example 1: a harmless release that doubles database pressure&lt;&#x2F;h2&gt;
&lt;p&gt;Imagine a backend service has an endpoint like this:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT id, email, status
FROM users
WHERE id = $1;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;It is fast. It uses the primary key. No problem.&lt;&#x2F;p&gt;
&lt;p&gt;Then a release adds a feature flag check based on recent user activity:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT id
FROM user_events
WHERE user_id = $1
  AND event_type = &amp;#39;purchase&amp;#39;
ORDER BY created_at DESC
LIMIT 1;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;On staging, this query is fast. In production, &lt;code&gt;user_events&lt;&#x2F;code&gt; has hundreds of millions of rows.&lt;&#x2F;p&gt;
&lt;p&gt;If the index is not aligned with the query, Postgres may need to scan far more data than expected.&lt;&#x2F;p&gt;
&lt;p&gt;A better supporting index might look like:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;CREATE INDEX CONCURRENTLY idx_user_events_user_type_created
ON user_events (user_id, event_type, created_at DESC);
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;But the incident is not just “missing index.”&lt;&#x2F;p&gt;
&lt;p&gt;The real incident chain may be:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;New release adds one extra query per request
        ↓
Query is cheap for some users, expensive for others
        ↓
Average DB time per request increases
        ↓
Application holds connections longer
        ↓
Pool reaches max size
        ↓
Requests queue inside the app
        ↓
Timeouts trigger retries
        ↓
Postgres receives even more work
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;A useful first question is not:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;“Which query is slow?”&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;A better first question is:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;“What changed in the system right before the database started showing pressure?”&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;example-2-connection-pool-exhaustion-is-not-always-a-pool-problem&quot;&gt;Example 2: connection pool exhaustion is not always a pool problem&lt;&#x2F;h2&gt;
&lt;p&gt;When an application starts timing out while waiting for a database connection, the instinctive response is often:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;“Increase the pool size.”&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;That can help in some cases. But it can also make the incident worse.&lt;&#x2F;p&gt;
&lt;p&gt;A connection pool is not just a performance tool. It is a pressure regulator.&lt;&#x2F;p&gt;
&lt;p&gt;If Postgres is already overloaded, increasing the number of concurrent database sessions may increase CPU contention, memory pressure, lock contention, and IO saturation.&lt;&#x2F;p&gt;
&lt;p&gt;A useful mental model:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Small pool:
Application queues before Postgres

Huge pool:
Postgres receives too much concurrent work directly
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;You can inspect active database sessions with:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    state,
    wait_event_type,
    wait_event,
    count(*)
FROM pg_stat_activity
GROUP BY state, wait_event_type, wait_event
ORDER BY count(*) DESC;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This tells you whether sessions are actively running, waiting on locks, waiting on IO, idle in transaction, or simply connected.&lt;&#x2F;p&gt;
&lt;p&gt;But the query alone is not the solution. The important part is interpretation.&lt;&#x2F;p&gt;
&lt;p&gt;For example, many sessions in this state are a major warning sign:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    pid,
    usename,
    application_name,
    client_addr,
    now() - xact_start AS transaction_age,
    state,
    query
FROM pg_stat_activity
WHERE state = &amp;#39;idle in transaction&amp;#39;
ORDER BY transaction_age DESC;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;An &lt;code&gt;idle in transaction&lt;&#x2F;code&gt; session may keep old row versions alive, block vacuum progress, hold locks, or distort the behavior of other parts of the system.&lt;&#x2F;p&gt;
&lt;p&gt;In an incident, this may appear as “Postgres is slow,” while the actual trigger is an application code path that opened a transaction and failed to close it correctly.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;example-3-a-schema-migration-that-blocks-production-traffic&quot;&gt;Example 3: a schema migration that blocks production traffic&lt;&#x2F;h2&gt;
&lt;p&gt;Schema migrations are one of the most common sources of Postgres incidents.&lt;&#x2F;p&gt;
&lt;p&gt;A migration can be syntactically correct and still operationally dangerous.&lt;&#x2F;p&gt;
&lt;p&gt;For example:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;ALTER TABLE orders ADD COLUMN processed_at timestamptz;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This may be safe and fast in many modern Postgres versions. But not every &lt;code&gt;ALTER TABLE&lt;&#x2F;code&gt; is harmless, and even operations that are usually fast still need locks.&lt;&#x2F;p&gt;
&lt;p&gt;A more dangerous example:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;ALTER TABLE orders
ADD CONSTRAINT orders_customer_id_fkey
FOREIGN KEY (customer_id)
REFERENCES customers(id);
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Or:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;CREATE INDEX idx_orders_created_at
ON orders (created_at);
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Creating a normal index can block writes. In production, you usually want:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;CREATE INDEX CONCURRENTLY idx_orders_created_at
ON orders (created_at);
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;But even &lt;code&gt;CONCURRENTLY&lt;&#x2F;code&gt; is not magic. It takes longer, consumes resources, and can fail if there are conflicting operations.&lt;&#x2F;p&gt;
&lt;p&gt;During a suspected lock-related incident, this type of query can help identify blockers:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    blocked.pid AS blocked_pid,
    blocked.query AS blocked_query,
    blocking.pid AS blocking_pid,
    blocking.query AS blocking_query,
    now() - blocking.query_start AS blocking_duration
FROM pg_locks blocked_locks
JOIN pg_stat_activity blocked
    ON blocked.pid = blocked_locks.pid
JOIN pg_locks blocking_locks
    ON blocking_locks.locktype = blocked_locks.locktype
   AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
   AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
   AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
   AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
   AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
   AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
   AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
   AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
   AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
   AND blocking_locks.pid != blocked_locks.pid
JOIN pg_stat_activity blocking
    ON blocking.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted
  AND blocking_locks.granted;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This is useful, but it is still only one piece of the incident.&lt;&#x2F;p&gt;
&lt;p&gt;The deeper questions are:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;Why was this migration run during this traffic pattern?&lt;&#x2F;li&gt;
&lt;li&gt;Was there a rollback plan?&lt;&#x2F;li&gt;
&lt;li&gt;Were lock timeouts configured?&lt;&#x2F;li&gt;
&lt;li&gt;Were long transactions checked before the migration?&lt;&#x2F;li&gt;
&lt;li&gt;Did the application have retry behavior that amplified the issue?&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;A mature team does not only ask “which process blocked us?”
It asks “why was the system vulnerable to this class of failure?”&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;example-4-a-slow-query-is-not-always-a-query-problem&quot;&gt;Example 4: a slow query is not always a query problem&lt;&#x2F;h2&gt;
&lt;p&gt;A query can become slow without changing the SQL text.&lt;&#x2F;p&gt;
&lt;p&gt;For example:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT *
FROM invoices
WHERE account_id = $1
  AND status = &amp;#39;open&amp;#39;
ORDER BY due_date ASC
LIMIT 50;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This may work well when each account has a small number of invoices.&lt;&#x2F;p&gt;
&lt;p&gt;But as the product grows, one enterprise account may accumulate millions of rows. The query becomes highly sensitive to data distribution.&lt;&#x2F;p&gt;
&lt;p&gt;You can inspect the execution plan:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;EXPLAIN (ANALYZE, BUFFERS)
SELECT *
FROM invoices
WHERE account_id = 123
  AND status = &amp;#39;open&amp;#39;
ORDER BY due_date ASC
LIMIT 50;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The plan might reveal:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;sequential scans;&lt;&#x2F;li&gt;
&lt;li&gt;high buffer reads;&lt;&#x2F;li&gt;
&lt;li&gt;unexpected nested loops;&lt;&#x2F;li&gt;
&lt;li&gt;bad row estimates;&lt;&#x2F;li&gt;
&lt;li&gt;sort operations spilling to disk;&lt;&#x2F;li&gt;
&lt;li&gt;index scans that are technically used but still inefficient.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;A possible supporting index could be:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;CREATE INDEX CONCURRENTLY idx_invoices_account_status_due
ON invoices (account_id, status, due_date);
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;But again, the point is not “add this index.”&lt;&#x2F;p&gt;
&lt;p&gt;The real reliability lesson is that production data shape changes over time. A query that was safe six months ago can become dangerous after customer growth, product changes, or new usage patterns.&lt;&#x2F;p&gt;
&lt;p&gt;Reliability is not only about fixing bad queries. It is about detecting when previously good assumptions have expired.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;the-difference-between-trigger-mechanism-and-amplifier&quot;&gt;The difference between trigger, mechanism, and amplifier&lt;&#x2F;h2&gt;
&lt;p&gt;A useful way to reason about Postgres incidents is to separate three things.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;1-trigger&quot;&gt;1. Trigger&lt;&#x2F;h3&gt;
&lt;p&gt;The event that started the incident.&lt;&#x2F;p&gt;
&lt;p&gt;Examples:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;New release
Schema migration
Traffic spike
Batch job
Analytics query
Configuration change
Failover
New customer onboarded
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;h3 id=&quot;2-mechanism&quot;&gt;2. Mechanism&lt;&#x2F;h3&gt;
&lt;p&gt;The technical process through which the system degraded.&lt;&#x2F;p&gt;
&lt;p&gt;Examples:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Lock contention
Connection saturation
Query plan regression
Disk IO saturation
WAL pressure
Autovacuum lag
Replication lag
Memory pressure
Transaction buildup
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;h3 id=&quot;3-amplifier&quot;&gt;3. Amplifier&lt;&#x2F;h3&gt;
&lt;p&gt;The thing that made the incident worse.&lt;&#x2F;p&gt;
&lt;p&gt;Examples:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Aggressive retries
Oversized connection pools
No statement timeout
No lock timeout
Long-running transactions
Missing dashboards
No migration safety process
Manual panic actions
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;A poor incident review says:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;“The database was slow because of a bad query.”&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;A better incident review says:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;“The trigger was a release that introduced a new query pattern. The mechanism was inefficient index access under production data distribution. The amplifier was application retries combined with a pool size that allowed too much concurrent pressure on Postgres.”&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;That second version teaches the team something reusable.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;useful-diagnostic-queries-are-not-the-same-as-an-incident-response-skill&quot;&gt;Useful diagnostic queries are not the same as an incident response skill&lt;&#x2F;h2&gt;
&lt;p&gt;It is good to know queries like these.&lt;&#x2F;p&gt;
&lt;p&gt;Current activity:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    pid,
    usename,
    application_name,
    state,
    wait_event_type,
    wait_event,
    now() - query_start AS query_age,
    left(query, 120) AS query_preview
FROM pg_stat_activity
WHERE state != &amp;#39;idle&amp;#39;
ORDER BY query_age DESC;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Long transactions:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    pid,
    usename,
    application_name,
    state,
    now() - xact_start AS xact_age,
    left(query, 120) AS query_preview
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
ORDER BY xact_age DESC;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Top queries with &lt;code&gt;pg_stat_statements&lt;&#x2F;code&gt;:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    calls,
    total_exec_time,
    mean_exec_time,
    rows,
    left(query, 160) AS query_preview
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Replication lag:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    application_name,
    state,
    sync_state,
    write_lag,
    flush_lag,
    replay_lag
FROM pg_stat_replication;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Approximate table bloat and dead tuple pressure:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    relname,
    n_live_tup,
    n_dead_tup,
    last_vacuum,
    last_autovacuum,
    last_analyze,
    last_autoanalyze
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 20;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Index usage:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    relname AS table_name,
    indexrelname AS index_name,
    idx_scan,
    idx_tup_read,
    idx_tup_fetch
FROM pg_stat_user_indexes
ORDER BY idx_scan ASC
LIMIT 20;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;These queries are useful. But they are not enough.&lt;&#x2F;p&gt;
&lt;p&gt;During a real incident, the challenge is not just running SQL. The challenge is knowing which hypothesis you are testing.&lt;&#x2F;p&gt;
&lt;p&gt;For example:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Are we overloaded because queries are slower?
Are queries slower because of locks?
Are locks caused by a migration?
Is the pool full because Postgres is slow, or is Postgres slow because the pool allows too much concurrency?
Is replication lag a cause, a symptom, or a separate issue?
Are retries protecting the system or attacking it?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This is where operational skill matters.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;why-postgres-incidents-often-look-similar&quot;&gt;Why Postgres incidents often look similar&lt;&#x2F;h2&gt;
&lt;p&gt;Many different failure modes produce similar symptoms.&lt;&#x2F;p&gt;
&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Symptom&lt;&#x2F;th&gt;&lt;th&gt;Possible causes&lt;&#x2F;th&gt;&lt;&#x2F;tr&gt;&lt;&#x2F;thead&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td&gt;High latency&lt;&#x2F;td&gt;&lt;td&gt;slow queries, locks, IO saturation, pool wait, CPU pressure&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Many active connections&lt;&#x2F;td&gt;&lt;td&gt;slow DB, oversized pool, retry storm, long transactions&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;High CPU&lt;&#x2F;td&gt;&lt;td&gt;query plan regression, too much concurrency, missing index&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;High IO&lt;&#x2F;td&gt;&lt;td&gt;sequential scans, checkpoints, vacuum, index creation, bad plans&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Timeouts&lt;&#x2F;td&gt;&lt;td&gt;pool exhaustion, locks, network, overloaded DB, application retries&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;tr&gt;&lt;td&gt;Replica lag&lt;&#x2F;td&gt;&lt;td&gt;WAL volume, slow replica IO, long queries on standby, replication slot issues&lt;&#x2F;td&gt;&lt;&#x2F;tr&gt;
&lt;&#x2F;tbody&gt;&lt;&#x2F;table&gt;
&lt;p&gt;This is why “dashboard watching” is not enough.&lt;&#x2F;p&gt;
&lt;p&gt;Metrics do not tell you what to do by themselves. They only become useful when connected to a hypothesis.&lt;&#x2F;p&gt;
&lt;p&gt;A metric says:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Connections are high.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;An engineer has to ask:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Are connections high because requests increased?
Because queries are slower?
Because transactions are stuck?
Because the pool was reconfigured?
Because the app is retrying?
Because background jobs started?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The same metric can point to different actions depending on context.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;dangerous-reactions-during-postgres-incidents&quot;&gt;Dangerous reactions during Postgres incidents&lt;&#x2F;h2&gt;
&lt;p&gt;Some actions feel helpful but can be dangerous when done without understanding the mechanism.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;increasing-the-connection-pool&quot;&gt;Increasing the connection pool&lt;&#x2F;h3&gt;
&lt;p&gt;May help if the pool is too small and Postgres has spare capacity.&lt;&#x2F;p&gt;
&lt;p&gt;May hurt if Postgres is already saturated.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;killing-random-queries&quot;&gt;Killing random queries&lt;&#x2F;h3&gt;
&lt;p&gt;May help if a clearly harmful query is blocking critical work.&lt;&#x2F;p&gt;
&lt;p&gt;May hurt if you kill the wrong backend, interrupt a migration, or cause application-level retries.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;restarting-the-application&quot;&gt;Restarting the application&lt;&#x2F;h3&gt;
&lt;p&gt;May help if the app is stuck.&lt;&#x2F;p&gt;
&lt;p&gt;May hurt if every instance reconnects at once and creates a connection storm.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;failing-over-to-a-replica&quot;&gt;Failing over to a replica&lt;&#x2F;h3&gt;
&lt;p&gt;May help if the primary is unhealthy.&lt;&#x2F;p&gt;
&lt;p&gt;May hurt if the issue is caused by application behavior, bad queries, or a migration that will continue after failover.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;running-emergency-indexes&quot;&gt;Running emergency indexes&lt;&#x2F;h3&gt;
&lt;p&gt;May help if the cause is well understood.&lt;&#x2F;p&gt;
&lt;p&gt;May hurt if index creation adds IO pressure during an already overloaded period.&lt;&#x2F;p&gt;
&lt;p&gt;The operational question is not:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;“What can we do?”&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;It is:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;“Which action reduces pressure without increasing uncertainty?”&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;reliability-requires-practicing-the-messy-middle&quot;&gt;Reliability requires practicing the messy middle&lt;&#x2F;h2&gt;
&lt;p&gt;Most educational material explains clean concepts:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;how MVCC works;&lt;&#x2F;li&gt;
&lt;li&gt;how indexes work;&lt;&#x2F;li&gt;
&lt;li&gt;how locks work;&lt;&#x2F;li&gt;
&lt;li&gt;how autovacuum works;&lt;&#x2F;li&gt;
&lt;li&gt;how replication works;&lt;&#x2F;li&gt;
&lt;li&gt;how query planning works.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;That knowledge is necessary.&lt;&#x2F;p&gt;
&lt;p&gt;But incidents do not arrive as clean textbook chapters.&lt;&#x2F;p&gt;
&lt;p&gt;They arrive as noisy combinations:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;A migration is waiting on a lock.
A long transaction is preventing cleanup.
The application pool is saturated.
Retries are increasing traffic.
A reporting query is consuming IO.
Replication lag is rising.
The team is debating rollback.
Customers are already affected.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The difficult part is the messy middle: forming hypotheses, rejecting bad assumptions, choosing safe mitigations, and communicating clearly while the system is degraded.&lt;&#x2F;p&gt;
&lt;p&gt;This cannot be learned fully from documentation.&lt;&#x2F;p&gt;
&lt;p&gt;It has to be practiced.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;what-incident-simulations-teach-that-articles-cannot&quot;&gt;What incident simulations teach that articles cannot&lt;&#x2F;h2&gt;
&lt;p&gt;An article can explain the concepts.
A checklist can remind you what to inspect.
A dashboard can show symptoms.&lt;&#x2F;p&gt;
&lt;p&gt;But a simulation trains the actual operational behavior:&lt;&#x2F;p&gt;
&lt;ul&gt;
&lt;li&gt;noticing weak signals early;&lt;&#x2F;li&gt;
&lt;li&gt;distinguishing trigger from mechanism;&lt;&#x2F;li&gt;
&lt;li&gt;avoiding attractive but dangerous actions;&lt;&#x2F;li&gt;
&lt;li&gt;reading database symptoms in application context;&lt;&#x2F;li&gt;
&lt;li&gt;understanding how one mitigation changes system pressure;&lt;&#x2F;li&gt;
&lt;li&gt;coordinating investigation under time pressure;&lt;&#x2F;li&gt;
&lt;li&gt;learning from mistakes without damaging production.&lt;&#x2F;li&gt;
&lt;&#x2F;ul&gt;
&lt;p&gt;In a good Postgres incident simulation, the goal is not to memorize one magic query.&lt;&#x2F;p&gt;
&lt;p&gt;The goal is to experience the chain:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;mermaid&quot;&gt;flowchart LR
    S[Symptom] --&amp;gt; H[Hypothesis] --&amp;gt; I[Inspection] --&amp;gt; D[Decision] --&amp;gt; C[Consequence]
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;That loop is the core of database reliability work.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;&#x2F;h2&gt;
&lt;p&gt;Postgres incidents rarely begin with “Postgres broke.”&lt;&#x2F;p&gt;
&lt;p&gt;More often, they begin with a normal engineering action:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;a release
a migration
a new query
a batch job
a traffic spike
a retry policy
a pool configuration change
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Postgres becomes the place where the consequences accumulate.&lt;&#x2F;p&gt;
&lt;p&gt;That is why reliable Postgres operations require more than database knowledge. They require system thinking.&lt;&#x2F;p&gt;
&lt;p&gt;You need to understand queries, locks, transactions, WAL, vacuum, replication, and indexes. But you also need to understand application behavior, deployment practices, connection pools, retries, traffic patterns, and human decision-making during incidents.&lt;&#x2F;p&gt;
&lt;p&gt;Documentation teaches mechanisms.
Monitoring shows symptoms.
Simulations build operational judgment.&lt;&#x2F;p&gt;
&lt;p&gt;And in production, judgment is often the difference between a short degradation and a serious incident.&lt;&#x2F;p&gt;
</content>
        
    </entry>
    <entry xml:lang="en">
        <title>Connection pools and Postgres: why more connections do not mean more performance</title>
        <published>2026-05-20T00:00:00+00:00</published>
        <updated>2026-05-20T00:00:00+00:00</updated>
        
        <author>
          <name>
            
              Unknown
            
          </name>
        </author>
        
        <link rel="alternate" type="text/html" href="https://rillence.com/notes/connection-pools-more-is-not-more/"/>
        <id>https://rillence.com/notes/connection-pools-more-is-not-more/</id>
        
        <content type="html" xml:base="https://rillence.com/notes/connection-pools-more-is-not-more/">&lt;p&gt;When an application starts timing out while talking to Postgres, one of the most tempting reactions is:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Increase the connection pool size.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;It feels reasonable.&lt;&#x2F;p&gt;
&lt;p&gt;Requests are waiting for a database connection.
The pool is full.
The application needs more throughput.
So the team gives it more connections.&lt;&#x2F;p&gt;
&lt;p&gt;Sometimes that helps.&lt;&#x2F;p&gt;
&lt;p&gt;But in many Postgres incidents, increasing the pool size turns a controlled slowdown into a larger failure.&lt;&#x2F;p&gt;
&lt;p&gt;A connection pool is not just a performance optimization. It is a pressure valve between the application and the database.&lt;&#x2F;p&gt;
&lt;p&gt;When configured well, it protects Postgres from too much concurrent work.
When configured badly, it allows the application to overload the database faster.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;the-wrong-mental-model&quot;&gt;The wrong mental model&lt;&#x2F;h2&gt;
&lt;p&gt;A common mental model looks like this:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;More connections = more parallelism = more throughput
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;That is only true up to a point.&lt;&#x2F;p&gt;
&lt;p&gt;Postgres does not become infinitely faster just because more clients connect to it. Each active connection competes for shared resources:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;CPU
memory
locks
shared buffers
disk IO
WAL bandwidth
temporary file space
autovacuum capacity
checkpoint pressure
planner and executor overhead
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;If the database is already saturated, adding more active sessions usually increases contention.&lt;&#x2F;p&gt;
&lt;p&gt;A better mental model is:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Connections are not throughput.
Connections are concurrency.
Concurrency must be limited to what the database can actually serve.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The pool should protect the database from excessive concurrency, not blindly maximize it.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;the-hidden-multiplication-problem&quot;&gt;The hidden multiplication problem&lt;&#x2F;h2&gt;
&lt;p&gt;Connection incidents often start with innocent numbers.&lt;&#x2F;p&gt;
&lt;p&gt;One service has a pool size of 20.&lt;&#x2F;p&gt;
&lt;p&gt;That sounds small.&lt;&#x2F;p&gt;
&lt;p&gt;Then production reality looks like this:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;20 connections per application instance
× 30 application instances
= 600 possible database connections
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Now add:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;background workers;
admin jobs;
migration runners;
BI tools;
cron scripts;
read replicas;
multiple services;
autoscaling;
deployment overlap during rolling releases.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Suddenly, &lt;code&gt;max_connections = 500&lt;&#x2F;code&gt; no longer looks large.&lt;&#x2F;p&gt;
&lt;p&gt;The dangerous part is that each team may only see its own service:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Our pool is only 20.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;But Postgres sees the total:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Hundreds of clients competing for one database.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;A simple inventory query:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    application_name,
    usename,
    client_addr,
    state,
    count(*) AS connections
FROM pg_stat_activity
GROUP BY application_name, usename, client_addr, state
ORDER BY connections DESC;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This often reveals surprises:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;old app versions still connected;
workers using separate pools;
BI tools holding sessions;
idle clients consuming slots;
one service with far more connections than expected;
deployment overlap doubling connection count temporarily.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The incident is not always caused by one bad query. Sometimes the system simply permits too many concurrent conversations with the database.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;idle-connections-are-not-free&quot;&gt;Idle connections are not free&lt;&#x2F;h2&gt;
&lt;p&gt;An idle connection is less dangerous than an active query, but it is not free.&lt;&#x2F;p&gt;
&lt;p&gt;Each connection is represented by a backend process. It consumes memory and a connection slot. It also increases operational complexity during spikes, failovers, restarts, and deployments.&lt;&#x2F;p&gt;
&lt;p&gt;Inspect idle connections:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    application_name,
    usename,
    client_addr,
    count(*) AS idle_connections
FROM pg_stat_activity
WHERE state = &amp;#39;idle&amp;#39;
GROUP BY application_name, usename, client_addr
ORDER BY idle_connections DESC;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;A large number of idle sessions may indicate:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;oversized pools;
too many application instances;
poor pool lifecycle management;
clients that connect and do not reuse efficiently;
services holding capacity they do not need.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Idle connections may not be the immediate cause of latency, but they reduce headroom.&lt;&#x2F;p&gt;
&lt;p&gt;During an incident, headroom matters.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;active-connections-are-the-real-pressure&quot;&gt;Active connections are the real pressure&lt;&#x2F;h2&gt;
&lt;p&gt;The more important question is not just how many sessions exist.&lt;&#x2F;p&gt;
&lt;p&gt;It is how many sessions are actively doing work or waiting on something.&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    state,
    wait_event_type,
    wait_event,
    count(*) AS sessions
FROM pg_stat_activity
GROUP BY state, wait_event_type, wait_event
ORDER BY sessions DESC;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This gives a better view of database pressure:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;active sessions consuming CPU;
sessions waiting on locks;
sessions waiting on IO;
sessions idle in transaction;
sessions waiting on client reads or writes;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;A saturated pool with many active database sessions means one thing.&lt;&#x2F;p&gt;
&lt;p&gt;A saturated pool with many sessions waiting on locks means another.&lt;&#x2F;p&gt;
&lt;p&gt;A saturated pool with many idle-in-transaction sessions means something else entirely.&lt;&#x2F;p&gt;
&lt;p&gt;The number of connections is only the surface.&lt;&#x2F;p&gt;
&lt;p&gt;The wait state tells you what kind of pressure the database is experiencing.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;pool-saturation-can-be-a-symptom-not-the-root-cause&quot;&gt;Pool saturation can be a symptom, not the root cause&lt;&#x2F;h2&gt;
&lt;p&gt;When an application pool is full, it is easy to blame the pool.&lt;&#x2F;p&gt;
&lt;p&gt;But a pool usually fills because connections are being held longer than expected.&lt;&#x2F;p&gt;
&lt;p&gt;That can happen because:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;queries became slower;
transactions became longer;
locks caused sessions to wait;
the database started waiting on IO;
the application opened transactions too early;
external service calls happened inside transactions;
retries increased traffic;
a deployment created more concurrent workers;
background jobs started competing with user requests.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;A typical chain:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;mermaid&quot;&gt;flowchart TD
    A[Query latency increases] --&amp;gt; B[Application holds DB connections longer]
    B --&amp;gt; C[Pool reaches max size]
    C --&amp;gt; D[New requests wait for a connection]
    D --&amp;gt; E[HTTP latency increases]
    E --&amp;gt; F[Requests time out]
    F --&amp;gt; G[Application retries]
    G --&amp;gt; H[More work reaches Postgres]
    H --&amp;gt; I([The pool stays saturated])
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The pool is not the original failure. It is the place where the failure becomes visible.&lt;&#x2F;p&gt;
&lt;p&gt;Increasing the pool size may only move the queue from the application into Postgres.&lt;&#x2F;p&gt;
&lt;p&gt;That can make the database less stable.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;queuing-in-the-application-is-often-safer-than-queuing-in-postgres&quot;&gt;Queuing in the application is often safer than queuing in Postgres&lt;&#x2F;h2&gt;
&lt;p&gt;A small pool can be frustrating because requests wait before reaching the database.&lt;&#x2F;p&gt;
&lt;p&gt;But that waiting can be protective.&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Application-side queue:
limits database concurrency.

Database-side queue:
lets too much work enter Postgres.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;If too many requests enter Postgres, they can compete for locks, memory, CPU, and IO. Once the database is overloaded, every query can become slower, which makes connections stay busy even longer.&lt;&#x2F;p&gt;
&lt;p&gt;This feedback loop is dangerous:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;mermaid&quot;&gt;flowchart TD
    A[More concurrent queries] --&amp;gt; B[More contention]
    B --&amp;gt; C[Slower queries]
    C --&amp;gt; D[Connections held longer]
    D --&amp;gt; E[More pool pressure]
    E --&amp;gt; F[More retries]
    F --&amp;gt; A
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;A pool should create backpressure.&lt;&#x2F;p&gt;
&lt;p&gt;Backpressure is not failure. It is a controlled refusal to overload the most critical shared component.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;the-database-pool-is-part-of-your-traffic-control-system&quot;&gt;The database pool is part of your traffic control system&lt;&#x2F;h2&gt;
&lt;p&gt;A mature production system usually has multiple layers of traffic control:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;load balancer limits;
application worker limits;
request timeouts;
queue depth limits;
connection pool limits;
statement timeouts;
retry budgets;
rate limits;
circuit breakers;
background job concurrency limits.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The database pool is one of those layers.&lt;&#x2F;p&gt;
&lt;p&gt;If all other layers are loose, the database pool becomes the final gate before Postgres.&lt;&#x2F;p&gt;
&lt;p&gt;That is risky.&lt;&#x2F;p&gt;
&lt;p&gt;For example:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;API accepts too much traffic.
Workers retry aggressively.
Each worker can open many DB connections.
Background jobs are unconstrained.
Pool size is high.
Postgres receives the full blast.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This is how a traffic spike becomes a database incident.&lt;&#x2F;p&gt;
&lt;p&gt;The database did not “break.” It was used as the only effective limiter in the system.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;inspecting-connection-pressure-in-postgres&quot;&gt;Inspecting connection pressure in Postgres&lt;&#x2F;h2&gt;
&lt;p&gt;Start with total connection usage:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    count(*) AS current_connections,
    setting::int AS max_connections,
    round(100.0 * count(*) &#x2F; setting::int, 2) AS percent_used
FROM pg_stat_activity
CROSS JOIN pg_settings
WHERE name = &amp;#39;max_connections&amp;#39;
GROUP BY setting;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Break it down by application:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    application_name,
    count(*) AS total,
    count(*) FILTER (WHERE state = &amp;#39;active&amp;#39;) AS active,
    count(*) FILTER (WHERE state = &amp;#39;idle&amp;#39;) AS idle,
    count(*) FILTER (WHERE state = &amp;#39;idle in transaction&amp;#39;) AS idle_in_transaction
FROM pg_stat_activity
GROUP BY application_name
ORDER BY total DESC;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Look for old sessions:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    pid,
    application_name,
    usename,
    client_addr,
    state,
    now() - backend_start AS connection_age,
    now() - state_change AS state_age,
    left(query, 160) AS query_preview
FROM pg_stat_activity
ORDER BY backend_start ASC
LIMIT 30;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Look for long-running active queries:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    pid,
    application_name,
    usename,
    state,
    wait_event_type,
    wait_event,
    now() - query_start AS query_age,
    left(query, 200) AS query_preview
FROM pg_stat_activity
WHERE state = &amp;#39;active&amp;#39;
ORDER BY query_start ASC
LIMIT 30;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Look for idle transactions:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    pid,
    application_name,
    usename,
    client_addr,
    now() - xact_start AS transaction_age,
    now() - state_change AS idle_age,
    left(query, 200) AS last_query
FROM pg_stat_activity
WHERE state = &amp;#39;idle in transaction&amp;#39;
ORDER BY xact_start ASC;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;These queries help separate different problems:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;too many idle sessions;
too many active sessions;
long-running queries;
sessions waiting on locks;
idle transactions;
connection leaks;
unexpected clients;
deployment overlap.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The goal is not just to count connections.&lt;&#x2F;p&gt;
&lt;p&gt;The goal is to understand why they exist and what they are doing.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;idle-in-transaction-small-bug-large-blast-radius&quot;&gt;&lt;code&gt;idle in transaction&lt;&#x2F;code&gt;: small bug, large blast radius&lt;&#x2F;h2&gt;
&lt;p&gt;An application can open a transaction, run a query, and then wait.&lt;&#x2F;p&gt;
&lt;p&gt;Example:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;BEGIN;

SELECT *
FROM accounts
WHERE id = 42;

-- application waits on an external API before COMMIT
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;From Postgres, this may appear as:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;idle in transaction
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;That session is not actively running SQL, but the transaction is still open.&lt;&#x2F;p&gt;
&lt;p&gt;This can:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;hold locks;
prevent vacuum cleanup;
keep old row versions visible;
increase bloat;
block migrations;
hold a pool connection indefinitely;
create confusing incident symptoms.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;A useful protection:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SHOW idle_in_transaction_session_timeout;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;You can set it at the role or database level:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;ALTER ROLE app_user
SET idle_in_transaction_session_timeout = &amp;#39;60s&amp;#39;;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This is not a substitute for fixing application code, but it can reduce blast radius.&lt;&#x2F;p&gt;
&lt;p&gt;Application transactions should be short and explicit.&lt;&#x2F;p&gt;
&lt;p&gt;A dangerous pattern:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;BEGIN
  read from database
  call external service
  perform business logic
  write to database
COMMIT
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;A safer pattern is usually:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;call external services before opening the transaction;
open the transaction late;
perform only the required database work;
commit quickly;
avoid user or network waits inside the transaction.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Postgres can handle concurrency. It cannot make long application transactions short.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;pool-timeouts-and-statement-timeouts-are-different&quot;&gt;Pool timeouts and statement timeouts are different&lt;&#x2F;h2&gt;
&lt;p&gt;Application pool timeout:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;How long a request waits to get a database connection.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Postgres &lt;code&gt;statement_timeout&lt;&#x2F;code&gt;:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;How long a SQL statement may run before Postgres cancels it.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Postgres &lt;code&gt;lock_timeout&lt;&#x2F;code&gt;:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;How long a statement waits to acquire a lock.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Postgres &lt;code&gt;idle_in_transaction_session_timeout&lt;&#x2F;code&gt;:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;How long a session may remain idle while inside a transaction.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;These protect different parts of the system.&lt;&#x2F;p&gt;
&lt;p&gt;Inspect settings:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SHOW statement_timeout;
SHOW lock_timeout;
SHOW idle_in_transaction_session_timeout;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Example role-level guardrails:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;ALTER ROLE app_user SET statement_timeout = &amp;#39;30s&amp;#39;;
ALTER ROLE app_user SET lock_timeout = &amp;#39;2s&amp;#39;;
ALTER ROLE app_user SET idle_in_transaction_session_timeout = &amp;#39;60s&amp;#39;;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;These values are examples, not universal defaults.&lt;&#x2F;p&gt;
&lt;p&gt;Different workloads need different limits:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;OLTP API queries need strict latency control.
Background jobs may need longer statement timeouts.
Migrations need careful lock timeouts.
Analytics should often run on separate infrastructure.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Timeouts do not fix bad architecture, but they prevent some failures from growing without bounds.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;pgbouncer-is-useful-but-not-magic&quot;&gt;PgBouncer is useful, but not magic&lt;&#x2F;h2&gt;
&lt;p&gt;Many Postgres systems use PgBouncer or another external pooler.&lt;&#x2F;p&gt;
&lt;p&gt;PgBouncer can reduce the number of server connections and allow many client connections to share fewer Postgres backends.&lt;&#x2F;p&gt;
&lt;p&gt;But the pooling mode matters.&lt;&#x2F;p&gt;
&lt;p&gt;The common modes are:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;session pooling;
transaction pooling;
statement pooling.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;In session pooling, a client keeps the same server connection for the whole client session.&lt;&#x2F;p&gt;
&lt;p&gt;In transaction pooling, a client gets a server connection only for the duration of a transaction.&lt;&#x2F;p&gt;
&lt;p&gt;Transaction pooling can dramatically reduce pressure on Postgres, but it changes what application behavior is safe.&lt;&#x2F;p&gt;
&lt;p&gt;Features that depend on session state may become problematic:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;temporary tables;
session-level SET commands;
session-level advisory locks;
LISTEN &#x2F; NOTIFY patterns;
some prepared statement assumptions;
stateful connection behavior in application frameworks.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;For example, this is session state:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SET search_path = tenant_42, public;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;If an application assumes this setting remains attached to a session, transaction pooling can break that assumption.&lt;&#x2F;p&gt;
&lt;p&gt;A safer approach is to make state explicit:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SET LOCAL statement_timeout = &amp;#39;5s&amp;#39;;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;inside a transaction, or avoid relying on session state for request behavior.&lt;&#x2F;p&gt;
&lt;p&gt;The reliability lesson:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;A pooler changes the contract between the application and Postgres.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;It must be tested as part of the application architecture, not added only during an emergency.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;app-level-pools-and-external-poolers-can-fight-each-other&quot;&gt;App-level pools and external poolers can fight each other&lt;&#x2F;h2&gt;
&lt;p&gt;A common architecture has both:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Application connection pool
        ↓
PgBouncer
        ↓
Postgres
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;That can work well.&lt;&#x2F;p&gt;
&lt;p&gt;But it can also create confusion.&lt;&#x2F;p&gt;
&lt;p&gt;Example:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;50 application instances
× app pool size 20
= 1000 client connections to PgBouncer

PgBouncer pool size 100
= only 100 server connections to Postgres
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;That may be fine if PgBouncer queues safely.&lt;&#x2F;p&gt;
&lt;p&gt;But application metrics may say:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Database pool is healthy.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;while PgBouncer is saturated.&lt;&#x2F;p&gt;
&lt;p&gt;Or PgBouncer may be healthy while Postgres is overloaded by 100 expensive active queries.&lt;&#x2F;p&gt;
&lt;p&gt;The important operational question is:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Where is the queue?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Possible answers:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;inside the application pool;
inside PgBouncer;
inside Postgres lock waits;
inside disk IO;
inside the application request queue;
inside a background job system.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The location of the queue tells you where backpressure is happening.&lt;&#x2F;p&gt;
&lt;p&gt;During an incident, moving the queue from one layer to another may improve or worsen the system.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;pool-size-should-be-based-on-database-capacity-not-hope&quot;&gt;Pool size should be based on database capacity, not hope&lt;&#x2F;h2&gt;
&lt;p&gt;A poor pool-sizing strategy:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Set pool size high enough that application requests rarely wait.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;That optimizes for hiding pressure.&lt;&#x2F;p&gt;
&lt;p&gt;A better strategy:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Set pool size low enough that Postgres remains stable under expected and degraded conditions.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;A rough capacity-oriented approach:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;How many active queries can Postgres serve with acceptable latency?
How many services share this database?
How many app instances can exist during autoscaling or rolling deploys?
How many background jobs run concurrently?
What is reserved for migrations, admin access, replication, monitoring, and emergency operations?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The total possible connection count matters:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;total_possible_connections =
    service_count
  × instances_per_service
  × pool_size_per_instance
  + workers
  + admin clients
  + migrations
  + monitoring
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;That number should not accidentally exceed what Postgres can handle.&lt;&#x2F;p&gt;
&lt;p&gt;More importantly, the number of active queries should not exceed what the database can serve efficiently.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;max-connections-is-not-a-performance-target&quot;&gt;&lt;code&gt;max_connections&lt;&#x2F;code&gt; is not a performance target&lt;&#x2F;h2&gt;
&lt;p&gt;&lt;code&gt;max_connections&lt;&#x2F;code&gt; is a limit, not a goal.&lt;&#x2F;p&gt;
&lt;p&gt;If Postgres has &lt;code&gt;max_connections = 500&lt;&#x2F;code&gt;, that does not mean the system should normally run with 500 active sessions.&lt;&#x2F;p&gt;
&lt;p&gt;Check the setting:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SHOW max_connections;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;When connection count approaches the limit, new clients may fail to connect. That can block application traffic, migrations, admin access, and incident response.&lt;&#x2F;p&gt;
&lt;p&gt;You do not want to discover during an outage that there is no free connection left for an operator.&lt;&#x2F;p&gt;
&lt;p&gt;A useful connection headroom query:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    count(*) AS used_connections,
    setting::int AS max_connections,
    setting::int - count(*) AS remaining_connections
FROM pg_stat_activity
CROSS JOIN pg_settings
WHERE name = &amp;#39;max_connections&amp;#39;
GROUP BY setting;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Running near the maximum is usually a sign of poor control, not high efficiency.&lt;&#x2F;p&gt;
&lt;p&gt;A stable Postgres system should have connection headroom.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;retries-can-turn-pool-pressure-into-a-storm&quot;&gt;Retries can turn pool pressure into a storm&lt;&#x2F;h2&gt;
&lt;p&gt;Retries are meant to make systems more resilient.&lt;&#x2F;p&gt;
&lt;p&gt;Under database saturation, they can do the opposite.&lt;&#x2F;p&gt;
&lt;p&gt;A bad retry pattern:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Request times out waiting for DB
        ↓
Application retries immediately
        ↓
Retry also waits for DB
        ↓
More requests accumulate
        ↓
Pool remains saturated
        ↓
Database receives duplicate work
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;A better retry strategy includes:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;bounded attempts;
exponential backoff;
jitter;
request deadlines;
idempotency keys;
retry budgets;
different policies for reads and writes;
no retry for known non-transient errors;
load shedding when the database is saturated.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The pool and retry policy must be designed together.&lt;&#x2F;p&gt;
&lt;p&gt;A small pool with aggressive retries can still overload the system.
A large pool with aggressive retries can overload it faster.&lt;&#x2F;p&gt;
&lt;p&gt;Retries should not be allowed to attack a struggling database.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;background-workers-need-separate-limits&quot;&gt;Background workers need separate limits&lt;&#x2F;h2&gt;
&lt;p&gt;User-facing requests and background jobs should not always share the same database capacity.&lt;&#x2F;p&gt;
&lt;p&gt;A background worker can be useful during normal operation and harmful during an incident.&lt;&#x2F;p&gt;
&lt;p&gt;Examples:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;email jobs;
billing reconciliation;
search indexing;
analytics sync;
cleanup tasks;
data backfills;
report generation;
cache warming;
webhook reprocessing.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;If these workers use the same database pool limits as API traffic, they can starve critical paths.&lt;&#x2F;p&gt;
&lt;p&gt;A better architecture often separates:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;API pool;
worker pool;
migration&#x2F;admin access;
analytics&#x2F;reporting access;
maintenance jobs.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This allows operational decisions such as:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;pause non-critical workers;
reduce backfill concurrency;
reserve capacity for user traffic;
run reporting on a replica;
prevent cleanup jobs from overwhelming primary.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;In a database incident, not all work is equally important.&lt;&#x2F;p&gt;
&lt;p&gt;The pool configuration should reflect that.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;connection-leaks&quot;&gt;Connection leaks&lt;&#x2F;h2&gt;
&lt;p&gt;A connection leak happens when the application checks out a database connection and does not return it to the pool.&lt;&#x2F;p&gt;
&lt;p&gt;Symptoms:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;pool usage grows over time;
database queries are not necessarily slow;
application instances require restart to recover;
idle connections accumulate;
a specific code path correlates with pool exhaustion.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Database-side symptoms may not be obvious.&lt;&#x2F;p&gt;
&lt;p&gt;You can inspect session age and state:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    pid,
    application_name,
    client_addr,
    state,
    now() - backend_start AS backend_age,
    now() - state_change AS state_age,
    left(query, 160) AS query_preview
FROM pg_stat_activity
ORDER BY state_age DESC
LIMIT 50;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;But connection leaks are often easier to detect with application metrics:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;pool connections in use;
pool idle connections;
pool wait time;
pool checkout timeout count;
connection acquisition latency;
connections opened and closed per second.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Postgres can show you the sessions.&lt;&#x2F;p&gt;
&lt;p&gt;The application usually tells you whether the pool is leaking.&lt;&#x2F;p&gt;
&lt;p&gt;Both views are needed.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;long-queries-can-masquerade-as-pool-problems&quot;&gt;Long queries can masquerade as pool problems&lt;&#x2F;h2&gt;
&lt;p&gt;Suppose an endpoint usually takes 50 ms of database time.&lt;&#x2F;p&gt;
&lt;p&gt;Then one query starts taking 5 seconds.&lt;&#x2F;p&gt;
&lt;p&gt;Even without more traffic, pool usage rises because each request holds a connection longer.&lt;&#x2F;p&gt;
&lt;p&gt;A simple relationship:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;required concurrency ≈ request rate × connection hold time
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;If request rate is 100 requests per second and each request holds a DB connection for 50 ms:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;100 × 0.05 = 5 active connections
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;If the same path now holds a connection for 5 seconds:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;100 × 5 = 500 active connections
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The pool did not become too small.&lt;&#x2F;p&gt;
&lt;p&gt;The connection hold time exploded.&lt;&#x2F;p&gt;
&lt;p&gt;This is why pool metrics should be read together with query latency, transaction duration, and application request traces.&lt;&#x2F;p&gt;
&lt;p&gt;The pool is a mirror of database time.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;transactions-should-not-wrap-too-much-application-logic&quot;&gt;Transactions should not wrap too much application logic&lt;&#x2F;h2&gt;
&lt;p&gt;A transaction should protect a small unit of database consistency.&lt;&#x2F;p&gt;
&lt;p&gt;It should not wrap an entire business workflow unless absolutely necessary.&lt;&#x2F;p&gt;
&lt;p&gt;Risky pattern:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;BEGIN
  select user
  call payment provider
  update order
  send webhook
  insert audit log
COMMIT
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This holds a database connection while waiting for external systems.&lt;&#x2F;p&gt;
&lt;p&gt;Safer pattern:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;prepare required data;
call external systems outside transaction when possible;
open transaction;
perform minimal database changes;
commit;
emit async follow-up work.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;There are exceptions. Some workflows need careful transactional boundaries.&lt;&#x2F;p&gt;
&lt;p&gt;But as a reliability default:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Keep transactions short.
Keep connection hold time predictable.
Do not wait on the network while holding scarce database capacity.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This is one of the most important application-level rules for Postgres reliability.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;what-to-measure-in-the-application&quot;&gt;What to measure in the application&lt;&#x2F;h2&gt;
&lt;p&gt;Postgres views are necessary, but not sufficient.&lt;&#x2F;p&gt;
&lt;p&gt;The application should expose pool metrics:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;maximum pool size;
connections currently in use;
idle connections;
pending connection requests;
connection acquisition latency;
connection checkout timeout count;
query duration;
transaction duration;
request duration while holding DB connection;
retries by reason;
errors by SQLSTATE or exception type.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The most useful metric is often not just query time.&lt;&#x2F;p&gt;
&lt;p&gt;It is:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;time spent waiting for a connection
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;If this grows, the application is experiencing backpressure.&lt;&#x2F;p&gt;
&lt;p&gt;That may be healthy if Postgres is protected and the system degrades gracefully.&lt;&#x2F;p&gt;
&lt;p&gt;It may be dangerous if requests timeout and retry aggressively.&lt;&#x2F;p&gt;
&lt;p&gt;Metrics should distinguish:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;waiting for a pool connection;
executing SQL;
waiting on a database lock;
waiting on network;
waiting on an external service while holding a connection.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Without that separation, every database incident looks like “Postgres is slow.”&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;what-to-measure-in-postgres&quot;&gt;What to measure in Postgres&lt;&#x2F;h2&gt;
&lt;p&gt;Useful database-side signals:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;-- Connections by state and wait type
SELECT
    state,
    wait_event_type,
    wait_event,
    count(*) AS sessions
FROM pg_stat_activity
GROUP BY state, wait_event_type, wait_event
ORDER BY sessions DESC;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;-- Connections by application
SELECT
    application_name,
    count(*) AS total,
    count(*) FILTER (WHERE state = &amp;#39;active&amp;#39;) AS active,
    count(*) FILTER (WHERE wait_event_type = &amp;#39;Lock&amp;#39;) AS waiting_on_lock,
    count(*) FILTER (WHERE state = &amp;#39;idle in transaction&amp;#39;) AS idle_in_transaction
FROM pg_stat_activity
GROUP BY application_name
ORDER BY total DESC;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;-- Oldest transactions
SELECT
    pid,
    application_name,
    state,
    now() - xact_start AS transaction_age,
    left(query, 160) AS query_preview
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
ORDER BY xact_start ASC
LIMIT 20;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;-- Long active queries
SELECT
    pid,
    application_name,
    wait_event_type,
    wait_event,
    now() - query_start AS query_age,
    left(query, 200) AS query_preview
FROM pg_stat_activity
WHERE state = &amp;#39;active&amp;#39;
ORDER BY query_start ASC
LIMIT 20;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;-- Blocked sessions
SELECT
    blocked.pid AS blocked_pid,
    blocked.application_name AS blocked_app,
    now() - blocked.query_start AS blocked_duration,
    left(blocked.query, 120) AS blocked_query,
    blocking.pid AS blocking_pid,
    blocking.application_name AS blocking_app,
    blocking.state AS blocking_state,
    now() - blocking.query_start AS blocking_duration,
    left(blocking.query, 120) AS blocking_query
FROM pg_stat_activity blocked
JOIN LATERAL unnest(pg_blocking_pids(blocked.pid)) AS blocker_pid ON true
JOIN pg_stat_activity blocking ON blocking.pid = blocker_pid
ORDER BY blocked_duration DESC;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;These queries are not a runbook by themselves.&lt;&#x2F;p&gt;
&lt;p&gt;They help answer one central question:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Is Postgres doing too much work, waiting on something, or being held hostage by client behavior?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;why-connection-incidents-are-often-misdiagnosed&quot;&gt;Why connection incidents are often misdiagnosed&lt;&#x2F;h2&gt;
&lt;p&gt;Connection pool incidents are confusing because the first visible error is often outside the database.&lt;&#x2F;p&gt;
&lt;p&gt;The app logs may say:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;timeout acquiring connection from pool
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;or:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;remaining connection slots are reserved
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;or:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;too many clients already
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;or:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;context deadline exceeded
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Teams then debate:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Is this an app issue?
Is this a database issue?
Is the pool too small?
Is max_connections too low?
Is PgBouncer broken?
Is a query slow?
Is the network slow?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The answer may be “yes” to several of these.&lt;&#x2F;p&gt;
&lt;p&gt;The pool is the boundary between application behavior and database capacity. Boundary failures usually have causes on both sides.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;common-anti-patterns&quot;&gt;Common anti-patterns&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;pool-size-copied-from-another-system&quot;&gt;Pool size copied from another system&lt;&#x2F;h3&gt;
&lt;p&gt;A pool size that worked for one service may be wrong for another.&lt;&#x2F;p&gt;
&lt;p&gt;Workload shape matters:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;short OLTP queries;
long reporting queries;
bursty writes;
background jobs;
tenant skew;
transaction-heavy workflows;
read-after-write patterns.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;h3 id=&quot;pool-size-configured-per-instance-without-considering-total-instances&quot;&gt;Pool size configured per instance without considering total instances&lt;&#x2F;h3&gt;
&lt;p&gt;Autoscaling can silently multiply database pressure.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;one-shared-pool-for-critical-and-non-critical-work&quot;&gt;One shared pool for critical and non-critical work&lt;&#x2F;h3&gt;
&lt;p&gt;A reporting job should not be able to starve checkout.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;long-external-calls-inside-transactions&quot;&gt;Long external calls inside transactions&lt;&#x2F;h3&gt;
&lt;p&gt;This turns network latency into database connection pressure.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;no-timeout-hierarchy&quot;&gt;No timeout hierarchy&lt;&#x2F;h3&gt;
&lt;p&gt;Without clear request, pool, statement, lock, and transaction timeouts, failures linger too long.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;aggressive-retries&quot;&gt;Aggressive retries&lt;&#x2F;h3&gt;
&lt;p&gt;Retries without budgets and backoff can turn a small slowdown into a storm.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;treating-pgbouncer-as-a-universal-fix&quot;&gt;Treating PgBouncer as a universal fix&lt;&#x2F;h3&gt;
&lt;p&gt;A pooler helps manage connections. It does not remove query cost, lock contention, IO saturation, or bad transaction design.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;a-healthier-operating-model&quot;&gt;A healthier operating model&lt;&#x2F;h2&gt;
&lt;p&gt;A good connection strategy is explicit.&lt;&#x2F;p&gt;
&lt;p&gt;It defines:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;which services may connect to Postgres;
how many connections each service may use;
how many instances may exist during normal and deploy conditions;
which work is allowed on the primary;
which work should use replicas;
which jobs can be paused;
which timeouts protect the system;
which retries are allowed;
which metrics indicate backpressure;
which actions reduce pressure safely.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This is not only DBA work.&lt;&#x2F;p&gt;
&lt;p&gt;It requires cooperation between:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;backend engineers;
SREs;
DBAs;
platform engineers;
application owners;
incident responders.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Connection reliability lives at the boundary between application design and database operations.&lt;&#x2F;p&gt;
&lt;p&gt;That is why it often falls through organizational cracks.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;why-connection-pool-incidents-are-good-simulation-material&quot;&gt;Why connection pool incidents are good simulation material&lt;&#x2F;h2&gt;
&lt;p&gt;Connection pool incidents are excellent for practice because they create misleading symptoms.&lt;&#x2F;p&gt;
&lt;p&gt;The application says it cannot get a connection.
The database says it has too many clients.
The query dashboard shows slower SQL.
The lock dashboard may show waiting sessions.
The autoscaler adds more application instances.
Retries increase traffic.
Someone proposes increasing &lt;code&gt;max_connections&lt;&#x2F;code&gt;.
Someone else proposes restarting the app.&lt;&#x2F;p&gt;
&lt;p&gt;All of these may be plausible.&lt;&#x2F;p&gt;
&lt;p&gt;A realistic simulation can force the team to reason through:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;where the queue is forming;
whether the pool is protecting or harming Postgres;
whether increasing pool size would help or amplify the incident;
which workload should be shed first;
whether long transactions are holding connections;
whether retries are multiplying demand;
whether background workers should be paused;
whether the safest mitigation is in SQL, app config, infrastructure, or traffic control.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The goal is not to memorize a perfect pool size.&lt;&#x2F;p&gt;
&lt;p&gt;The goal is to build judgment around database pressure.&lt;&#x2F;p&gt;
&lt;p&gt;Articles can explain the mechanics.
Dashboards can show saturation.
Simulations teach what it feels like to choose under pressure.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;&#x2F;h2&gt;
&lt;p&gt;More Postgres connections do not automatically mean more performance.&lt;&#x2F;p&gt;
&lt;p&gt;They mean more concurrency.&lt;&#x2F;p&gt;
&lt;p&gt;Concurrency is useful only while the database has capacity to serve it. Past that point, additional connections create contention, longer waits, more timeouts, more retries, and a larger incident.&lt;&#x2F;p&gt;
&lt;p&gt;A connection pool should not be treated as a bucket that must be as large as possible.&lt;&#x2F;p&gt;
&lt;p&gt;It should be treated as a control surface.&lt;&#x2F;p&gt;
&lt;p&gt;Good pooling protects Postgres.
Bad pooling exposes Postgres to uncontrolled application demand.&lt;&#x2F;p&gt;
&lt;p&gt;Reliable Postgres systems need:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;bounded connection counts;
short transactions;
clear timeout policies;
safe retry behavior;
separate limits for critical and background work;
visibility into pool wait time;
visibility into database wait states;
enough headroom for operations and incidents.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The dangerous phrase is:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;The pool is full, so increase it.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The better question is:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Why are connections being held longer than expected, and where should backpressure happen?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;That question turns connection pooling from a configuration detail into a database reliability practice.&lt;&#x2F;p&gt;
</content>
        
    </entry>
    <entry xml:lang="en">
        <title>Postgres replication: when a standby exists but does not save you</title>
        <published>2026-05-12T00:00:00+00:00</published>
        <updated>2026-05-12T00:00:00+00:00</updated>
        
        <author>
          <name>
            
              Unknown
            
          </name>
        </author>
        
        <link rel="alternate" type="text/html" href="https://rillence.com/notes/standby-does-not-save-you/"/>
        <id>https://rillence.com/notes/standby-does-not-save-you/</id>
        
        <content type="html" xml:base="https://rillence.com/notes/standby-does-not-save-you/">&lt;p&gt;A standby database is comforting.&lt;&#x2F;p&gt;
&lt;p&gt;It appears in architecture diagrams as a safety net. The primary fails, the standby takes over, and the product survives. Read traffic can be moved away from the primary. Backups can be isolated. Disaster recovery looks solved.&lt;&#x2F;p&gt;
&lt;p&gt;But Postgres replication does not automatically mean high availability.&lt;&#x2F;p&gt;
&lt;p&gt;A standby can be too far behind.
A replica can faithfully reproduce bad writes.
A failover can create split-brain.
A replication slot can fill the primary disk with retained WAL.
Read queries on a standby can conflict with recovery.
A promoted replica can break downstream consumers.
An application can keep writing to the wrong node after failover.&lt;&#x2F;p&gt;
&lt;p&gt;Replication is not a guarantee. It is a mechanism.&lt;&#x2F;p&gt;
&lt;p&gt;And like every reliability mechanism, it creates new failure modes.&lt;&#x2F;p&gt;
&lt;p&gt;PostgreSQL streaming replication keeps a standby up to date by sending WAL records from the primary as they are generated; it is asynchronous by default, meaning there can be a delay between commit on the primary and visibility on the standby. (&lt;a rel=&quot;external&quot; title=&quot;PostgreSQL: Documentation: 18: 26.2. Log-Shipping Standby Servers&quot; href=&quot;https:&#x2F;&#x2F;www.postgresql.org&#x2F;docs&#x2F;current&#x2F;warm-standby.html&quot;&gt;PostgreSQL&lt;&#x2F;a&gt;)&lt;&#x2F;p&gt;
&lt;p&gt;That small sentence contains an entire class of incidents.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;the-false-sense-of-safety&quot;&gt;The false sense of safety&lt;&#x2F;h2&gt;
&lt;p&gt;Many teams say:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;We have a replica.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;But that statement is incomplete.&lt;&#x2F;p&gt;
&lt;p&gt;A more useful operational version is:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;We have a replica.
We know how far behind it is.
We know whether it can be promoted.
We know what data loss window is acceptable.
We know how applications reconnect.
We know how to prevent the old primary from coming back.
We know what happens to replication slots, read traffic, jobs, and logical consumers after failover.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;A replica is not a disaster recovery plan by itself.&lt;&#x2F;p&gt;
&lt;p&gt;It is a component inside a larger recovery process.&lt;&#x2F;p&gt;
&lt;p&gt;PostgreSQL’s own failover documentation is explicit about the need to prevent the old primary from continuing as primary after a standby is promoted, because two systems believing they are primary can lead to data loss; this is the classic split-brain problem. (&lt;a rel=&quot;external&quot; title=&quot;PostgreSQL: Documentation: 18: 26.3. Failover&quot; href=&quot;https:&#x2F;&#x2F;www.postgresql.org&#x2F;docs&#x2F;current&#x2F;warm-standby-failover.html&quot;&gt;PostgreSQL&lt;&#x2F;a&gt;)&lt;&#x2F;p&gt;
&lt;p&gt;That is why replication reliability is not just about lag.&lt;&#x2F;p&gt;
&lt;p&gt;It is about control.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;replication-lag-is-not-one-number&quot;&gt;Replication lag is not one number&lt;&#x2F;h2&gt;
&lt;p&gt;The first mistake is treating replication lag as a single metric.&lt;&#x2F;p&gt;
&lt;p&gt;In practice, there are several different “lags”:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;WAL generated on primary but not sent
WAL sent but not written by standby
WAL written but not flushed
WAL flushed but not replayed
Changes replayed but application still reading stale data
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;On the primary, &lt;code&gt;pg_stat_replication&lt;&#x2F;code&gt; is the main view for directly connected standbys. The PostgreSQL statistics documentation describes it as one row per WAL sender process, with information about replication to the connected standby. (&lt;a rel=&quot;external&quot; title=&quot;PostgreSQL: Documentation: 18: 27.2. The Cumulative Statistics System&quot; href=&quot;https:&#x2F;&#x2F;www.postgresql.org&#x2F;docs&#x2F;current&#x2F;monitoring-stats.html&quot;&gt;PostgreSQL&lt;&#x2F;a&gt;)&lt;&#x2F;p&gt;
&lt;p&gt;A useful primary-side query:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    application_name,
    client_addr,
    state,
    sync_state,
    sent_lsn,
    write_lsn,
    flush_lsn,
    replay_lsn,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn))   AS send_lag_bytes,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), write_lsn))  AS write_lag_bytes,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn))  AS flush_lag_bytes,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) AS replay_lag_bytes,
    write_lag,
    flush_lag,
    replay_lag
FROM pg_stat_replication
ORDER BY application_name;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This query separates the pipeline.&lt;&#x2F;p&gt;
&lt;p&gt;If &lt;code&gt;sent_lsn&lt;&#x2F;code&gt; is far behind, the primary is not sending fast enough or the connection is impaired.&lt;&#x2F;p&gt;
&lt;p&gt;If &lt;code&gt;write_lsn&lt;&#x2F;code&gt; lags behind &lt;code&gt;sent_lsn&lt;&#x2F;code&gt;, the standby is receiving but not writing fast enough.&lt;&#x2F;p&gt;
&lt;p&gt;If &lt;code&gt;flush_lsn&lt;&#x2F;code&gt; is behind, WAL is not durable on the standby yet.&lt;&#x2F;p&gt;
&lt;p&gt;If &lt;code&gt;replay_lsn&lt;&#x2F;code&gt; is behind, the standby has received WAL but has not applied it.&lt;&#x2F;p&gt;
&lt;p&gt;Those are not the same problem.&lt;&#x2F;p&gt;
&lt;p&gt;A standby can be connected and still not be useful for failover if it is too far behind the primary.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;checking-the-standby-from-the-standby&quot;&gt;Checking the standby from the standby&lt;&#x2F;h2&gt;
&lt;p&gt;On the standby itself:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT pg_is_in_recovery();
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;A standby returns &lt;code&gt;true&lt;&#x2F;code&gt;. After promotion, it returns &lt;code&gt;false&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;To inspect receive and replay positions:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    pg_last_wal_receive_lsn() AS receive_lsn,
    pg_last_wal_replay_lsn()  AS replay_lsn,
    pg_size_pretty(
        pg_wal_lsn_diff(pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn())
    ) AS receive_replay_gap,
    now() - pg_last_xact_replay_timestamp() AS replay_delay;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This helps answer a different question:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Is this standby receiving WAL?
Is it replaying WAL?
How stale is the data visible to queries?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The &lt;code&gt;replay_delay&lt;&#x2F;code&gt; value is especially important for read replicas. It tells you how far behind visible database state may be.&lt;&#x2F;p&gt;
&lt;p&gt;For example, if the application writes an order to the primary and immediately reads from a standby, it may not see its own write.&lt;&#x2F;p&gt;
&lt;p&gt;That is not a Postgres bug. It is a read-after-write consistency problem.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;read-replicas-can-serve-stale-data&quot;&gt;Read replicas can serve stale data&lt;&#x2F;h2&gt;
&lt;p&gt;A common architecture sends writes to the primary and reads to replicas:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;mermaid&quot;&gt;flowchart TD
    A[Application writes order to primary] --&amp;gt; B[Application reads order from standby]
    B --&amp;gt; C([Order is missing])
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The write committed successfully. The replica simply has not replayed the WAL yet.&lt;&#x2F;p&gt;
&lt;p&gt;This is one of the most common ways replication leaks into product behavior.&lt;&#x2F;p&gt;
&lt;p&gt;The user sees:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;I saved the setting, but the UI still shows the old value.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The backend sees:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;INSERT succeeded.
SELECT returned old state.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The database sees:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Primary is correct.
Standby is behind by 800 ms.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;That may be acceptable for dashboards, analytics, or eventually consistent feeds. It may be unacceptable for checkout, authentication, permissions, billing, or anything requiring read-your-writes behavior.&lt;&#x2F;p&gt;
&lt;p&gt;A basic mitigation pattern is application-level routing:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Fresh reads after writes → primary
Stale-tolerant reads → replica
Long analytics queries → dedicated reporting replica
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This decision belongs in system design, not in a panic during an incident.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;replication-protects-availability-not-correctness-of-bad-changes&quot;&gt;Replication protects availability, not correctness of bad changes&lt;&#x2F;h2&gt;
&lt;p&gt;Replication copies changes.&lt;&#x2F;p&gt;
&lt;p&gt;That includes bad changes.&lt;&#x2F;p&gt;
&lt;p&gt;If an application deploy runs:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;UPDATE users
SET plan = &amp;#39;free&amp;#39;;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;without a &lt;code&gt;WHERE&lt;&#x2F;code&gt; clause, the standby will not save you. It will replay the same change.&lt;&#x2F;p&gt;
&lt;p&gt;If a migration drops the wrong column, the standby will follow.&lt;&#x2F;p&gt;
&lt;p&gt;If an application bug deletes valid data, physical streaming replication reproduces the deletion.&lt;&#x2F;p&gt;
&lt;p&gt;This is why replication is not a replacement for backups, point-in-time recovery, access controls, safer migrations, or staged rollouts.&lt;&#x2F;p&gt;
&lt;p&gt;A standby helps when the primary node, disk, VM, container, or availability zone fails.&lt;&#x2F;p&gt;
&lt;p&gt;It does not magically distinguish good WAL from bad WAL.&lt;&#x2F;p&gt;
&lt;p&gt;A good reliability review asks:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Which failure mode are we defending against?
Primary host failure?
Storage failure?
Human error?
Bad deploy?
Region outage?
Silent corruption?
Accidental DELETE?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;A replica is useful for some of these. It is insufficient for others.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;replication-slots-safety-mechanism-with-sharp-edges&quot;&gt;Replication slots: safety mechanism with sharp edges&lt;&#x2F;h2&gt;
&lt;p&gt;Replication slots are designed to help prevent the primary from removing WAL that a replica or logical consumer still needs. PostgreSQL documents &lt;code&gt;pg_replication_slots&lt;&#x2F;code&gt; as the view listing replication slots and their current state. (&lt;a rel=&quot;external&quot; title=&quot;PostgreSQL: Documentation: 18: 53.20. pg_replication_slots&quot; href=&quot;https:&#x2F;&#x2F;www.postgresql.org&#x2F;docs&#x2F;current&#x2F;view-pg-replication-slots.html&quot;&gt;PostgreSQL&lt;&#x2F;a&gt;)&lt;&#x2F;p&gt;
&lt;p&gt;That is useful. It is also dangerous if nobody monitors it.&lt;&#x2F;p&gt;
&lt;p&gt;Inspect slots:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    slot_name,
    slot_type,
    active,
    restart_lsn,
    confirmed_flush_lsn,
    wal_status,
    safe_wal_size,
    pg_size_pretty(
        pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)
    ) AS retained_wal
FROM pg_replication_slots
ORDER BY pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) DESC NULLS LAST;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The risk is simple:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;mermaid&quot;&gt;flowchart TD
    A[A replica disconnects] --&amp;gt; B[Its replication slot remains]
    B --&amp;gt; C[The primary keeps WAL needed by that slot]
    C --&amp;gt; D[WAL accumulates]
    D --&amp;gt; E[Disk fills]
    E --&amp;gt; F([The primary goes down])
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The original problem may have been a failed standby.&lt;&#x2F;p&gt;
&lt;p&gt;The actual production outage may be the primary running out of disk because the slot kept retaining WAL.&lt;&#x2F;p&gt;
&lt;p&gt;Replication infrastructure can therefore take down the primary it was supposed to protect.&lt;&#x2F;p&gt;
&lt;p&gt;Operationally, slots need ownership:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Who owns this slot?
Which process consumes it?
Is it expected to be active?
How much WAL can it retain?
What alert fires before disk pressure becomes dangerous?
Can this slot be safely dropped?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Dropping a slot is not a casual action. If the consumer still needs that WAL, dropping the slot may force reinitialization or data loss for that consumer.&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT pg_drop_replication_slot(&amp;#39;slot_name&amp;#39;);
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;That command can be correct. It can also be destructive. The hard part is knowing which situation you are in.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;wal-volume-can-break-your-assumptions&quot;&gt;WAL volume can break your assumptions&lt;&#x2F;h2&gt;
&lt;p&gt;Replication lag is not only about network speed.&lt;&#x2F;p&gt;
&lt;p&gt;A primary can suddenly generate more WAL than usual:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Large UPDATE
Bulk import
Index creation
VACUUM FULL
High-write deploy
Backfill job
Large DELETE
Migration touching many rows
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;A replica that keeps up during normal traffic may fall behind during a backfill.&lt;&#x2F;p&gt;
&lt;p&gt;A simple way to inspect WAL generation rate is to sample LSN movement over time.&lt;&#x2F;p&gt;
&lt;p&gt;Manual example:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT pg_current_wal_lsn();
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Run it again later:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    pg_size_pretty(
        pg_wal_lsn_diff(&amp;#39;0&#x2F;50000000&amp;#39;::pg_lsn, &amp;#39;0&#x2F;40000000&amp;#39;::pg_lsn)
    ) AS wal_generated;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;In a monitoring system, this becomes a rate:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;WAL bytes generated per second
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;That metric matters because replication capacity is about throughput over time, not just whether the standby is connected.&lt;&#x2F;p&gt;
&lt;p&gt;The standby may be healthy and still unable to keep up with a temporary WAL storm.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;hot-standby-query-conflicts&quot;&gt;Hot standby query conflicts&lt;&#x2F;h2&gt;
&lt;p&gt;A hot standby can serve read-only queries while it replays WAL.&lt;&#x2F;p&gt;
&lt;p&gt;That sounds perfect until long read queries on the standby conflict with recovery.&lt;&#x2F;p&gt;
&lt;p&gt;A reporting query might hold a snapshot that conflicts with WAL replay. Postgres then has a choice: delay replay or cancel the query, depending on configuration and timing.&lt;&#x2F;p&gt;
&lt;p&gt;You can inspect standby conflicts with:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    datname,
    confl_tablespace,
    confl_lock,
    confl_snapshot,
    confl_bufferpin,
    confl_deadlock
FROM pg_stat_database_conflicts
ORDER BY datname;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The monitoring stats documentation includes &lt;code&gt;pg_stat_database_conflicts&lt;&#x2F;code&gt; for database-wide query cancels due to conflicts with recovery on standby servers. (&lt;a rel=&quot;external&quot; title=&quot;PostgreSQL: Documentation: 18: 27.2. The Cumulative Statistics System&quot; href=&quot;https:&#x2F;&#x2F;www.postgresql.org&#x2F;docs&#x2F;current&#x2F;monitoring-stats.html&quot;&gt;PostgreSQL&lt;&#x2F;a&gt;)&lt;&#x2F;p&gt;
&lt;p&gt;This matters because a replica often has two competing jobs:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Stay close to primary for failover
Serve long-running read queries
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Those goals can conflict.&lt;&#x2F;p&gt;
&lt;p&gt;If the standby prioritizes replay, analytical queries may be canceled.&lt;&#x2F;p&gt;
&lt;p&gt;If the standby delays replay to satisfy long queries, replication lag may grow.&lt;&#x2F;p&gt;
&lt;p&gt;You can reduce pain by separating roles:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;HA standby: optimized for promotion, minimal lag
Reporting replica: accepts staleness, runs heavy reads
Logical&#x2F;ETL replica: feeds downstream systems
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Using one standby for every purpose is cheap architecturally and expensive operationally.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;synchronous-replication-stronger-durability-different-failure-mode&quot;&gt;Synchronous replication: stronger durability, different failure mode&lt;&#x2F;h2&gt;
&lt;p&gt;Asynchronous replication has a data loss window.&lt;&#x2F;p&gt;
&lt;p&gt;Synchronous replication can reduce that window, but it changes the write path. The primary may wait for standby acknowledgement depending on &lt;code&gt;synchronous_commit&lt;&#x2F;code&gt; and synchronous replication configuration. The PostgreSQL replication settings documentation warns that with &lt;code&gt;synchronous_commit = remote_apply&lt;&#x2F;code&gt;, commits wait for the change to be applied on the standby. (&lt;a rel=&quot;external&quot; title=&quot;PostgreSQL: Documentation: 18: 19.6. Replication&quot; href=&quot;https:&#x2F;&#x2F;www.postgresql.org&#x2F;docs&#x2F;current&#x2F;runtime-config-replication.html&quot;&gt;PostgreSQL&lt;&#x2F;a&gt;)&lt;&#x2F;p&gt;
&lt;p&gt;That means synchronous replication can turn standby problems into primary write latency.&lt;&#x2F;p&gt;
&lt;p&gt;The trade-off is not “sync is better” or “async is better.”&lt;&#x2F;p&gt;
&lt;p&gt;The trade-off is:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Async replication:
lower write latency,
possible data loss during failover.

Sync replication:
stronger durability guarantees,
standby health can affect primary commits.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;A useful query:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    application_name,
    state,
    sync_state,
    write_lag,
    flush_lag,
    replay_lag
FROM pg_stat_replication
ORDER BY application_name;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Pay attention to &lt;code&gt;sync_state&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;Values such as &lt;code&gt;sync&lt;&#x2F;code&gt;, &lt;code&gt;potential&lt;&#x2F;code&gt;, or &lt;code&gt;async&lt;&#x2F;code&gt; tell you how the standby participates in synchronous replication behavior.&lt;&#x2F;p&gt;
&lt;p&gt;A synchronous standby is not just a backup target. It is part of the commit path.&lt;&#x2F;p&gt;
&lt;p&gt;If it becomes slow, user-facing writes may slow down too.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;failover-is-a-process-not-a-command&quot;&gt;Failover is a process, not a command&lt;&#x2F;h2&gt;
&lt;p&gt;Promotion is technically simple:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT pg_promote();
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;or from the server:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;bash&quot;&gt;pg_ctl promote
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;PostgreSQL documents these as ways to trigger failover for a log-shipping standby. (&lt;a rel=&quot;external&quot; title=&quot;PostgreSQL: Documentation: 18: 26.3. Failover&quot; href=&quot;https:&#x2F;&#x2F;www.postgresql.org&#x2F;docs&#x2F;current&#x2F;warm-standby-failover.html&quot;&gt;PostgreSQL&lt;&#x2F;a&gt;)&lt;&#x2F;p&gt;
&lt;p&gt;But promotion is only one step.&lt;&#x2F;p&gt;
&lt;p&gt;A real failover involves many decisions:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Is the primary truly dead?
Could it still accept writes?
Which standby is the best candidate?
How much WAL has it replayed?
What data loss is acceptable?
How will applications reconnect?
What happens to connection pools?
What happens to old primary fencing?
What happens to read replicas following the old primary?
What happens to logical replication slots?
What happens to scheduled jobs and workers?
Who declares the incident phase complete?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The dangerous failover is not the one that fails loudly.&lt;&#x2F;p&gt;
&lt;p&gt;The dangerous failover is the one that half-succeeds.&lt;&#x2F;p&gt;
&lt;p&gt;For example:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Standby promoted successfully.
Some app instances still write to old primary.
A background worker reconnects to the wrong host.
Read replicas still follow the old timeline.
Logical consumers lose their slots.
Monitoring shows green because one node is healthy.
Data diverges.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This is why failover must be rehearsed.&lt;&#x2F;p&gt;
&lt;p&gt;Not discussed.
Not documented once.
Rehearsed.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;timeline-changes-matter&quot;&gt;Timeline changes matter&lt;&#x2F;h2&gt;
&lt;p&gt;After promotion, the new primary continues on a new timeline.&lt;&#x2F;p&gt;
&lt;p&gt;That matters for replicas, WAL archives, backup chains, and recovery procedures.&lt;&#x2F;p&gt;
&lt;p&gt;PostgreSQL documentation notes that standbys used for high availability should follow timeline changes after failover, with &lt;code&gt;recovery_target_timeline&lt;&#x2F;code&gt; set to &lt;code&gt;latest&lt;&#x2F;code&gt;, which is the default. (&lt;a rel=&quot;external&quot; title=&quot;PostgreSQL: Documentation: 18: 26.2. Log-Shipping Standby Servers&quot; href=&quot;https:&#x2F;&#x2F;www.postgresql.org&#x2F;docs&#x2F;current&#x2F;warm-standby.html&quot;&gt;PostgreSQL&lt;&#x2F;a&gt;)&lt;&#x2F;p&gt;
&lt;p&gt;This detail sounds small until a replica fails to follow the new primary after failover.&lt;&#x2F;p&gt;
&lt;p&gt;The operational symptom may be confusing:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;New primary accepts writes.
Old standby does not catch up.
A recreated replica follows the wrong history.
Archive restore behaves unexpectedly.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;During calm periods, timeline mechanics feel like internal implementation detail.&lt;&#x2F;p&gt;
&lt;p&gt;During failover, they become part of the recovery path.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;logical-replication-adds-another-layer&quot;&gt;Logical replication adds another layer&lt;&#x2F;h2&gt;
&lt;p&gt;Logical replication is often used for:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;CDC pipelines
Search indexing
Data warehouses
Event streaming
Cross-version migrations
Selective table replication
Zero-downtime migration workflows
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Its failure modes are different from physical streaming replication.&lt;&#x2F;p&gt;
&lt;p&gt;A logical slot can fall behind and retain WAL.
A subscriber can stop applying changes.
Schema drift can break replication.
A failover can strand logical slots if they are not handled correctly.&lt;&#x2F;p&gt;
&lt;p&gt;Recent PostgreSQL versions include mechanisms for logical failover slot synchronization. The current documentation describes &lt;code&gt;sync_replication_slots&lt;&#x2F;code&gt; as enabling a physical standby to synchronize logical failover slots from the primary so logical subscribers can resume from the new primary after failover. (&lt;a rel=&quot;external&quot; title=&quot;PostgreSQL: Documentation: 18: 19.6. Replication&quot; href=&quot;https:&#x2F;&#x2F;www.postgresql.org&#x2F;docs&#x2F;current&#x2F;runtime-config-replication.html&quot;&gt;PostgreSQL&lt;&#x2F;a&gt;)&lt;&#x2F;p&gt;
&lt;p&gt;The practical lesson is simple:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;If downstream systems depend on logical replication,
failover planning must include those systems.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;It is not enough that the database comes back.&lt;&#x2F;p&gt;
&lt;p&gt;The data platform around it must continue correctly.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;a-practical-replication-health-snapshot&quot;&gt;A practical replication health snapshot&lt;&#x2F;h2&gt;
&lt;p&gt;This is not a full runbook, but these queries make a useful health snapshot.&lt;&#x2F;p&gt;
&lt;p&gt;Primary-side replication status:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    application_name,
    client_addr,
    state,
    sync_state,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) AS replay_lag_bytes,
    write_lag,
    flush_lag,
    replay_lag
FROM pg_stat_replication
ORDER BY application_name;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Replication slots:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    slot_name,
    slot_type,
    active,
    wal_status,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal
FROM pg_replication_slots
ORDER BY pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) DESC NULLS LAST;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Standby freshness:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    pg_is_in_recovery() AS is_standby,
    pg_last_wal_receive_lsn() AS receive_lsn,
    pg_last_wal_replay_lsn() AS replay_lsn,
    now() - pg_last_xact_replay_timestamp() AS replay_delay;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Standby conflicts:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    datname,
    confl_lock,
    confl_snapshot,
    confl_bufferpin,
    confl_deadlock
FROM pg_stat_database_conflicts
ORDER BY datname;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;WAL receiver on standby:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    status,
    receive_start_lsn,
    written_lsn,
    flushed_lsn,
    received_tli,
    last_msg_send_time,
    last_msg_receipt_time,
    latest_end_lsn,
    latest_end_time,
    conninfo
FROM pg_stat_wal_receiver;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;These queries do not tell you what to do automatically. They help you ask better questions:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Is the standby connected?
Is it catching up or falling behind?
Is lag measured in bytes, time, or user-visible staleness?
Is WAL retention becoming dangerous?
Are standby reads conflicting with recovery?
Is failover currently safe, risky, or impossible?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;common-anti-patterns&quot;&gt;Common anti-patterns&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;one-replica-for-every-purpose&quot;&gt;One replica for every purpose&lt;&#x2F;h3&gt;
&lt;p&gt;A standby used for HA, reporting, backups, ad hoc analytics, and read scaling will eventually disappoint one of those use cases.&lt;&#x2F;p&gt;
&lt;p&gt;HA wants low lag.
Analytics wants long queries.
Backups want predictable throughput.
Read scaling wants availability and acceptable staleness.&lt;&#x2F;p&gt;
&lt;p&gt;Those goals are not identical.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;no-explicit-read-consistency-model&quot;&gt;No explicit read consistency model&lt;&#x2F;h3&gt;
&lt;p&gt;If the application casually sends reads to replicas, product behavior may become inconsistent.&lt;&#x2F;p&gt;
&lt;p&gt;Use replicas deliberately:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Can this read be stale?
Does this user need to read their own write?
Can this endpoint tolerate lag?
Should this workflow force primary reads?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;h3 id=&quot;ignoring-slots-until-disk-pressure&quot;&gt;Ignoring slots until disk pressure&lt;&#x2F;h3&gt;
&lt;p&gt;Replication slots should be treated like production resources with owners, alerts, and lifecycle management.&lt;&#x2F;p&gt;
&lt;p&gt;An abandoned slot is not harmless metadata.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;treating-failover-as-infrastructure-only&quot;&gt;Treating failover as infrastructure-only&lt;&#x2F;h3&gt;
&lt;p&gt;Failover affects database clients, application routing, workers, caches, queues, jobs, observability, and people.&lt;&#x2F;p&gt;
&lt;p&gt;A database promotion that the application does not understand is not recovery.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;never-testing-promotion&quot;&gt;Never testing promotion&lt;&#x2F;h3&gt;
&lt;p&gt;A failover process that has never been practiced is an assumption.&lt;&#x2F;p&gt;
&lt;p&gt;Assumptions do not become reliable because they are written in a document.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;what-a-good-incident-review-should-ask&quot;&gt;What a good incident review should ask&lt;&#x2F;h2&gt;
&lt;p&gt;After a replication incident, avoid stopping at:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;The replica lagged.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;That is only the symptom.&lt;&#x2F;p&gt;
&lt;p&gt;Better questions:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;What created the WAL spike?
Was the standby under-provisioned or overloaded by read traffic?
Did a long query on the standby delay recovery?
Did a slot retain more WAL than expected?
Were alerts based on bytes, time, or disk risk?
Did application reads tolerate the actual staleness?
Was failover considered? If not, why?
Would promotion have caused data loss?
Could the old primary have reappeared?
Did downstream logical consumers survive the event?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The goal is to understand the system’s recovery posture, not just the replication metric that turned red.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;why-replication-incidents-are-excellent-simulation-material&quot;&gt;Why replication incidents are excellent simulation material&lt;&#x2F;h2&gt;
&lt;p&gt;Replication incidents are perfect for training because they combine database internals with distributed systems behavior.&lt;&#x2F;p&gt;
&lt;p&gt;A realistic scenario can involve:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;WAL generation spike from a migration
Replica lag crossing the read-staleness budget
Replication slot retaining dangerous WAL volume
Read queries conflicting with recovery
Application reads returning stale data
A failover decision under uncertainty
Old primary fencing
Connection string and DNS behavior
Downstream logical replication consumers
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The hard part is not running &lt;code&gt;pg_stat_replication&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;The hard part is deciding what the evidence means.&lt;&#x2F;p&gt;
&lt;p&gt;Is the replica unhealthy, or is the primary generating too much WAL?
Is lag acceptable for read traffic but unacceptable for failover?
Is the slot protecting data or threatening disk?
Would promotion reduce impact or create split-brain?
Should traffic be moved, throttled, failed over, or left alone while the standby catches up?&lt;&#x2F;p&gt;
&lt;p&gt;Those decisions require practice.&lt;&#x2F;p&gt;
&lt;p&gt;Articles can explain the mechanism.
Monitoring can expose the symptoms.
Simulation builds the judgment needed to act safely.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;&#x2F;h2&gt;
&lt;p&gt;A standby does not automatically save you.&lt;&#x2F;p&gt;
&lt;p&gt;Postgres replication is powerful, but it is not magic. It improves availability only when the surrounding operational system is mature enough to use it correctly.&lt;&#x2F;p&gt;
&lt;p&gt;You need to know:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;how far behind replicas are;
which reads can tolerate staleness;
how much WAL slots retain;
whether standby queries conflict with replay;
what data loss window is acceptable;
how failover is triggered;
how split-brain is prevented;
how applications reconnect;
how downstream consumers continue;
how the cluster returns to a healthy topology after promotion.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Replication is not just a database feature.&lt;&#x2F;p&gt;
&lt;p&gt;It is a reliability contract between Postgres, infrastructure, applications, operators, and product expectations.&lt;&#x2F;p&gt;
&lt;p&gt;The dangerous phrase is:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;“We have a replica, so we are safe.”
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The better phrase is:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;“We know exactly what our replica can and cannot save us from.”
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
</content>
        
    </entry>
    <entry xml:lang="en">
        <title>WAL and checkpoints: the invisible machinery behind Postgres durability</title>
        <published>2026-05-04T00:00:00+00:00</published>
        <updated>2026-05-04T00:00:00+00:00</updated>
        
        <author>
          <name>
            
              Unknown
            
          </name>
        </author>
        
        <link rel="alternate" type="text/html" href="https://rillence.com/notes/wal-and-checkpoints/"/>
        <id>https://rillence.com/notes/wal-and-checkpoints/</id>
        
        <content type="html" xml:base="https://rillence.com/notes/wal-and-checkpoints/">&lt;p&gt;Most teams notice WAL only when something goes wrong.&lt;&#x2F;p&gt;
&lt;p&gt;The disk fills with files in &lt;code&gt;pg_wal&lt;&#x2F;code&gt;.
A replica falls behind.
Backups stop completing.
Checkpoints create latency spikes.
A bulk update generates far more IO than expected.
A restart takes longer than the team is comfortable with.&lt;&#x2F;p&gt;
&lt;p&gt;Until then, WAL and checkpoints feel like internal Postgres details.&lt;&#x2F;p&gt;
&lt;p&gt;They are not.&lt;&#x2F;p&gt;
&lt;p&gt;WAL and checkpoints are part of the contract between Postgres, storage, replication, backups, recovery, and application latency. If you operate Postgres in production, you do not need to become a storage engine developer, but you do need a practical reliability model of how this machinery behaves under pressure.&lt;&#x2F;p&gt;
&lt;p&gt;PostgreSQL uses Write-Ahead Logging to preserve data integrity: changes to data files must be logged first, and WAL records are flushed to durable storage before the corresponding data-file changes are considered safe. (&lt;a rel=&quot;external&quot; title=&quot;Documentation: 18: 28.3. Write-Ahead Logging (WAL)&quot; href=&quot;https:&#x2F;&#x2F;www.postgresql.org&#x2F;docs&#x2F;current&#x2F;wal-intro.html&quot;&gt;PostgreSQL&lt;&#x2F;a&gt;)&lt;&#x2F;p&gt;
&lt;p&gt;That is the foundation. The incidents come from everything around it.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;the-basic-idea-of-wal&quot;&gt;The basic idea of WAL&lt;&#x2F;h2&gt;
&lt;p&gt;When a transaction changes data, Postgres does not rely only on immediately updating table and index files.&lt;&#x2F;p&gt;
&lt;p&gt;It first records the change in WAL.&lt;&#x2F;p&gt;
&lt;p&gt;A simplified write path looks like this:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;mermaid&quot;&gt;flowchart TD
    A[Client sends write] --&amp;gt; B[Postgres modifies pages in memory]
    B --&amp;gt; C[Postgres writes WAL records]
    C --&amp;gt; D[WAL is flushed according to durability settings]
    D --&amp;gt; E[COMMIT returns]
    E --&amp;gt; F[Data pages are written later]
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This separation is crucial.&lt;&#x2F;p&gt;
&lt;p&gt;The data page may not be written to the table file immediately. It can remain dirty in shared buffers. If the server crashes, Postgres can use WAL during recovery to bring data files back to a consistent state. PostgreSQL keeps WAL in the &lt;code&gt;pg_wal&#x2F;&lt;&#x2F;code&gt; directory, and the documentation describes WAL replay after the last checkpoint as the mechanism used to restore consistency after a crash. (&lt;a rel=&quot;external&quot; title=&quot;25.3. Continuous Archiving and Point-in-Time Recovery ...&quot; href=&quot;https:&#x2F;&#x2F;www.postgresql.org&#x2F;docs&#x2F;current&#x2F;continuous-archiving.html&quot;&gt;PostgreSQL&lt;&#x2F;a&gt;)&lt;&#x2F;p&gt;
&lt;p&gt;That is why WAL is not just logging.&lt;&#x2F;p&gt;
&lt;p&gt;It is recovery infrastructure.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;commit-does-not-mean-every-table-page-is-already-on-disk&quot;&gt;COMMIT does not mean “every table page is already on disk”&lt;&#x2F;h2&gt;
&lt;p&gt;A common misconception:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;COMMIT means all changed table and index pages were written to disk.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Not exactly.&lt;&#x2F;p&gt;
&lt;p&gt;A committed transaction means Postgres has made the transaction durable according to its WAL and commit settings. The actual table and index pages may be written later.&lt;&#x2F;p&gt;
&lt;p&gt;This is one reason Postgres can perform well. It does not need to synchronously rewrite every affected data page before returning every commit.&lt;&#x2F;p&gt;
&lt;p&gt;But it also means that the health of WAL IO is critical.&lt;&#x2F;p&gt;
&lt;p&gt;If WAL writes or WAL fsync become slow, commits can become slow.&lt;&#x2F;p&gt;
&lt;p&gt;A user-visible symptom may be:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;INSERT&#x2F;UPDATE latency increases
API writes slow down
background jobs fall behind
replication lag grows
WAL directory grows
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The application may report “database is slow,” but the specific mechanism may be commit-path pressure.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;synchronous-commit-changes-the-durability-latency-trade-off&quot;&gt;&lt;code&gt;synchronous_commit&lt;&#x2F;code&gt; changes the durability&#x2F;latency trade-off&lt;&#x2F;h2&gt;
&lt;p&gt;One setting that directly affects commit behavior is:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SHOW synchronous_commit;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The default is usually appropriate for many production systems, but the operational model matters.&lt;&#x2F;p&gt;
&lt;p&gt;With stronger commit guarantees, the client waits for more durability work before &lt;code&gt;COMMIT&lt;&#x2F;code&gt; returns. With weaker settings, commits can return earlier, but the system accepts a larger risk window in the event of a crash.&lt;&#x2F;p&gt;
&lt;p&gt;This is not a generic performance knob.&lt;&#x2F;p&gt;
&lt;p&gt;It is a business and reliability decision.&lt;&#x2F;p&gt;
&lt;p&gt;For example, it may be acceptable to relax durability for:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;ephemeral analytics events;
rebuildable caches;
non-critical metrics;
temporary ingestion buffers.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;It may be unacceptable for:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;payments;
orders;
ledger entries;
identity changes;
permissions;
security-sensitive writes.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;A dangerous incident response is changing durability settings during pressure without understanding what data can be lost and what the product guarantees.&lt;&#x2F;p&gt;
&lt;p&gt;The question is not:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Can this reduce latency?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The better question is:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;What durability contract are we changing, and who owns that risk?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;what-checkpoints-do&quot;&gt;What checkpoints do&lt;&#x2F;h2&gt;
&lt;p&gt;If WAL can recover data after a crash, why do checkpoints exist?&lt;&#x2F;p&gt;
&lt;p&gt;Because recovery cannot start from the beginning of time.&lt;&#x2F;p&gt;
&lt;p&gt;A checkpoint is a known safe point in the WAL sequence. At checkpoint time, dirty data pages are flushed to disk, and Postgres writes a checkpoint record to WAL. PostgreSQL documentation describes checkpoints as points where heap and index data files are guaranteed to have been updated with all information written before that checkpoint. (&lt;a rel=&quot;external&quot; title=&quot;Documentation: 18: 28.5. WAL Configuration&quot; href=&quot;https:&#x2F;&#x2F;www.postgresql.org&#x2F;docs&#x2F;current&#x2F;wal-configuration.html&quot;&gt;PostgreSQL&lt;&#x2F;a&gt;)&lt;&#x2F;p&gt;
&lt;p&gt;A simplified model:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;mermaid&quot;&gt;flowchart TD
    A[WAL records accumulate] --&amp;gt; B[Dirty pages accumulate in memory]
    B --&amp;gt; C[Checkpoint begins]
    C --&amp;gt; D[Dirty pages are written to disk]
    D --&amp;gt; E[Checkpoint record is written]
    E --&amp;gt; F([Crash recovery can start from a later point])
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Checkpoints reduce crash recovery work.&lt;&#x2F;p&gt;
&lt;p&gt;But they also create IO work.&lt;&#x2F;p&gt;
&lt;p&gt;That trade-off is central to Postgres reliability.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;checkpoints-are-not-free&quot;&gt;Checkpoints are not free&lt;&#x2F;h2&gt;
&lt;p&gt;During a checkpoint, Postgres must write dirty buffers to disk.&lt;&#x2F;p&gt;
&lt;p&gt;If many pages are dirty, that can create significant IO pressure. If the storage system is already busy, checkpoint activity can appear as latency spikes.&lt;&#x2F;p&gt;
&lt;p&gt;Symptoms may include:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;periodic write latency spikes;
higher commit latency;
slow queries during checkpoint periods;
replica lag increasing during write bursts;
backend processes writing buffers directly;
checkpoint warnings in logs;
storage saturation without one obvious query.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This is why checkpoint behavior should be understood as part of workload management, not only configuration.&lt;&#x2F;p&gt;
&lt;p&gt;A checkpoint problem is often a workload-shape problem:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;many writes in a short period;
bulk updates;
large deletes;
index builds;
backfills;
ETL jobs;
maintenance tasks;
write-heavy deploys;
checkpoints happening too frequently.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The database may be working correctly while still creating unacceptable latency for the product.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;time-based-and-wal-volume-based-checkpoints&quot;&gt;Time-based and WAL-volume-based checkpoints&lt;&#x2F;h2&gt;
&lt;p&gt;Checkpoints happen for different reasons.&lt;&#x2F;p&gt;
&lt;p&gt;Two important controls are:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SHOW checkpoint_timeout;
SHOW max_wal_size;
SHOW checkpoint_completion_target;
SHOW checkpoint_warning;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Postgres can checkpoint because enough time has passed, or because WAL volume has grown enough. The documentation describes &lt;code&gt;checkpoint_timeout&lt;&#x2F;code&gt;, &lt;code&gt;max_wal_size&lt;&#x2F;code&gt;, &lt;code&gt;checkpoint_completion_target&lt;&#x2F;code&gt;, and &lt;code&gt;checkpoint_warning&lt;&#x2F;code&gt; as key WAL&#x2F;checkpoint configuration parameters. (&lt;a rel=&quot;external&quot; title=&quot;PostgreSQL: Documentation: 18: 19.5. Write Ahead Log&quot; href=&quot;https:&#x2F;&#x2F;www.postgresql.org&#x2F;docs&#x2F;current&#x2F;runtime-config-wal.html&quot;&gt;PostgreSQL&lt;&#x2F;a&gt;)&lt;&#x2F;p&gt;
&lt;p&gt;A useful mental model:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;checkpoint_timeout:
how long Postgres may go between automatic checkpoints.

max_wal_size:
how much WAL growth can push Postgres toward a checkpoint.

checkpoint_completion_target:
how much of the checkpoint interval Postgres tries to use
to spread checkpoint writes.

checkpoint_warning:
log a warning if checkpoints happen too close together.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Frequent requested checkpoints are usually a warning sign.&lt;&#x2F;p&gt;
&lt;p&gt;They often mean WAL is being generated faster than the current checkpoint configuration expects.&lt;&#x2F;p&gt;
&lt;p&gt;That can happen during normal growth, but it can also reveal an unsafe backfill, bulk update, migration, or retry storm.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;the-classic-warning-checkpoints-are-happening-too-often&quot;&gt;The classic warning: checkpoints are happening too often&lt;&#x2F;h2&gt;
&lt;p&gt;Postgres can log warnings when checkpoints caused by WAL segment pressure happen too close together. The documentation notes that &lt;code&gt;checkpoint_warning&lt;&#x2F;code&gt; exists to log when checkpoints caused by WAL filling occur closer together than the configured threshold, suggesting &lt;code&gt;max_wal_size&lt;&#x2F;code&gt; may need to be increased. (&lt;a rel=&quot;external&quot; title=&quot;PostgreSQL: Documentation: 18: 19.5. Write Ahead Log&quot; href=&quot;https:&#x2F;&#x2F;www.postgresql.org&#x2F;docs&#x2F;current&#x2F;runtime-config-wal.html&quot;&gt;PostgreSQL&lt;&#x2F;a&gt;)&lt;&#x2F;p&gt;
&lt;p&gt;A log message like this should not be ignored:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;checkpoints are occurring too frequently
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;It does not automatically mean “increase &lt;code&gt;max_wal_size&lt;&#x2F;code&gt; and move on.”&lt;&#x2F;p&gt;
&lt;p&gt;It means:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;The workload is generating WAL fast enough
to force more checkpoint activity than expected.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The next question is workload-oriented:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;What changed?
A migration?
A bulk update?
A new write-heavy endpoint?
A data import?
A queue retry storm?
A new index?
A replica or archive issue?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Changing a setting may be appropriate. But if the WAL spike came from a bad release or uncontrolled job, the real fix may be outside &lt;code&gt;postgresql.conf&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;measuring-wal-generation&quot;&gt;Measuring WAL generation&lt;&#x2F;h2&gt;
&lt;p&gt;A basic WAL snapshot:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    wal_records,
    wal_fpi,
    pg_size_pretty(wal_bytes) AS wal_bytes,
    wal_buffers_full,
    stats_reset
FROM pg_stat_wal;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;PostgreSQL’s cumulative statistics system exposes server activity through statistics views, including WAL-related and replication-related views. (&lt;a rel=&quot;external&quot; title=&quot;Documentation: 18: 27.2. The Cumulative Statistics System&quot; href=&quot;https:&#x2F;&#x2F;www.postgresql.org&#x2F;docs&#x2F;current&#x2F;monitoring-stats.html&quot;&gt;PostgreSQL&lt;&#x2F;a&gt;)&lt;&#x2F;p&gt;
&lt;p&gt;The most useful number is not only total WAL generated.&lt;&#x2F;p&gt;
&lt;p&gt;It is the rate.&lt;&#x2F;p&gt;
&lt;p&gt;You can sample WAL position:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT now(), pg_current_wal_lsn();
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Then sample again later:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    pg_size_pretty(
        pg_wal_lsn_diff(
            &amp;#39;0&#x2F;70000000&amp;#39;::pg_lsn,
            &amp;#39;0&#x2F;60000000&amp;#39;::pg_lsn
        )
    ) AS wal_generated;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;In monitoring, this becomes:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;WAL bytes generated per second
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Why this matters:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;WAL must be written locally.
WAL may need to be archived.
WAL may need to be streamed to replicas.
WAL may be retained for replication slots.
WAL volume affects checkpoint pressure.
WAL volume affects recovery time.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;A system can have acceptable query latency and still be heading toward a WAL-related incident.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;finding-wal-heavy-queries&quot;&gt;Finding WAL-heavy queries&lt;&#x2F;h2&gt;
&lt;p&gt;In modern Postgres versions, &lt;code&gt;pg_stat_statements&lt;&#x2F;code&gt; can expose WAL-related metrics for statements, depending on version and configuration.&lt;&#x2F;p&gt;
&lt;p&gt;A useful query shape:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    calls,
    pg_size_pretty(wal_bytes) AS total_wal,
    pg_size_pretty((wal_bytes &#x2F; greatest(calls, 1))::numeric) AS wal_per_call,
    mean_exec_time,
    rows,
    left(query, 180) AS query_preview
FROM pg_stat_statements
WHERE wal_bytes &amp;gt; 0
ORDER BY wal_bytes DESC
LIMIT 20;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This helps identify statements that generate large amounts of WAL.&lt;&#x2F;p&gt;
&lt;p&gt;Typical WAL-heavy operations include:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;large UPDATEs;
large DELETEs;
bulk INSERTs;
index creation;
table rewrites;
VACUUM FULL;
CLUSTER;
backfills;
high-churn queue updates;
touching indexed columns repeatedly.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The important distinction:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;A query can be acceptable from a latency perspective
and still dangerous from a WAL perspective.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;For example, a backfill may run efficiently but generate enough WAL to delay replicas, overload archiving, and force frequent checkpoints.&lt;&#x2F;p&gt;
&lt;p&gt;That is a reliability problem, even if the SQL itself is “fast.”&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;full-page-images-and-wal-volume&quot;&gt;Full-page images and WAL volume&lt;&#x2F;h2&gt;
&lt;p&gt;After a checkpoint, the first modification to a data page may include a full-page image in WAL when &lt;code&gt;full_page_writes&lt;&#x2F;code&gt; is enabled.&lt;&#x2F;p&gt;
&lt;p&gt;Check:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SHOW full_page_writes;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;&lt;code&gt;full_page_writes&lt;&#x2F;code&gt; protects against torn pages after crashes. It can increase WAL volume, especially after checkpoints and during write-heavy workloads.&lt;&#x2F;p&gt;
&lt;p&gt;This creates an important interaction:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Frequent checkpoints
        ↓
More pages modified for the first time after each checkpoint
        ↓
More full-page images
        ↓
More WAL generated
        ↓
More pressure on WAL, archiving, replication, and checkpoints
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This is one reason overly frequent checkpoints can amplify IO pressure.&lt;&#x2F;p&gt;
&lt;p&gt;A dangerous conclusion would be:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Full-page writes generate WAL, so disable them.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;That is usually the wrong instinct. This setting exists for crash safety.&lt;&#x2F;p&gt;
&lt;p&gt;A better conclusion:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;If full-page images are high,
understand checkpoint frequency, write patterns, and storage behavior.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;wal-compression&quot;&gt;WAL compression&lt;&#x2F;h2&gt;
&lt;p&gt;Postgres supports WAL compression:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SHOW wal_compression;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Enabling WAL compression can reduce WAL volume for some workloads, especially where full-page images dominate. But it may increase CPU usage.&lt;&#x2F;p&gt;
&lt;p&gt;This is a trade-off:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Less WAL volume
More CPU work
Potentially lower replication&#x2F;archive pressure
Potentially higher CPU pressure
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;It is not universally good or bad.&lt;&#x2F;p&gt;
&lt;p&gt;It should be evaluated against the actual bottleneck:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Is the system WAL-volume bound?
Storage bound?
Network bound?
Archive bound?
Replica catch-up bound?
CPU bound?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;A reliability mistake is tuning WAL without knowing which resource is constrained.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;wal-and-replication-lag&quot;&gt;WAL and replication lag&lt;&#x2F;h2&gt;
&lt;p&gt;Replication depends on WAL movement.&lt;&#x2F;p&gt;
&lt;p&gt;A write-heavy event on the primary can become a replica incident:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Bulk update generates WAL
        ↓
Primary writes WAL locally
        ↓
WAL is streamed to standby
        ↓
Standby writes, flushes, replays WAL
        ↓
Replica falls behind
        ↓
Read traffic sees stale data
        ↓
Failover safety decreases
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Primary-side check:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    application_name,
    state,
    sync_state,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn))   AS send_lag,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), write_lsn))  AS write_lag,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn))  AS flush_lag,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) AS replay_lag,
    write_lag,
    flush_lag,
    replay_lag
FROM pg_stat_replication;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Standby-side check:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    pg_is_in_recovery() AS is_standby,
    pg_last_wal_receive_lsn() AS receive_lsn,
    pg_last_wal_replay_lsn() AS replay_lsn,
    now() - pg_last_xact_replay_timestamp() AS replay_delay;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The WAL question during an incident is not only:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;How much WAL did we generate?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;It is:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Can every downstream system consume it fast enough?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;That includes standbys, archives, logical replication consumers, backup systems, and change-data-capture pipelines.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;wal-archiving-and-backup-risk&quot;&gt;WAL archiving and backup risk&lt;&#x2F;h2&gt;
&lt;p&gt;WAL is also central to point-in-time recovery.&lt;&#x2F;p&gt;
&lt;p&gt;If WAL archiving fails, backups may no longer support the recovery objectives the team believes they have.&lt;&#x2F;p&gt;
&lt;p&gt;Postgres continuous archiving relies on saving WAL files so that the database can be restored by replaying WAL from a base backup to a desired point in time. (&lt;a rel=&quot;external&quot; title=&quot;25.3. Continuous Archiving and Point-in-Time Recovery ...&quot; href=&quot;https:&#x2F;&#x2F;www.postgresql.org&#x2F;docs&#x2F;current&#x2F;continuous-archiving.html&quot;&gt;PostgreSQL&lt;&#x2F;a&gt;)&lt;&#x2F;p&gt;
&lt;p&gt;A common failure chain:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;mermaid&quot;&gt;flowchart TD
    A[Archive command starts failing] --&amp;gt; B[WAL files accumulate]
    B --&amp;gt; C[pg_wal grows]
    C --&amp;gt; D[Disk fills]
    D --&amp;gt; E([Primary becomes unstable or stops])
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Check archiver status:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    archived_count,
    last_archived_wal,
    last_archived_time,
    failed_count,
    last_failed_wal,
    last_failed_time,
    stats_reset
FROM pg_stat_archiver;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This view should be part of production monitoring when archiving is enabled.&lt;&#x2F;p&gt;
&lt;p&gt;A healthy primary with broken archiving is not healthy from a disaster recovery perspective.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;wal-retention-and-replication-slots&quot;&gt;WAL retention and replication slots&lt;&#x2F;h2&gt;
&lt;p&gt;Replication slots can retain WAL required by a replica or logical consumer.&lt;&#x2F;p&gt;
&lt;p&gt;That is useful.&lt;&#x2F;p&gt;
&lt;p&gt;It is also dangerous.&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    slot_name,
    slot_type,
    active,
    restart_lsn,
    confirmed_flush_lsn,
    wal_status,
    safe_wal_size,
    pg_size_pretty(
        pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)
    ) AS retained_wal
FROM pg_replication_slots
ORDER BY pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) DESC NULLS LAST;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;A disconnected consumer with an active slot can force the primary to retain WAL.&lt;&#x2F;p&gt;
&lt;p&gt;The incident can look like:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Logical consumer stops
        ↓
Replication slot remains
        ↓
Primary keeps old WAL
        ↓
Disk usage grows
        ↓
Emergency cleanup decision required
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The dangerous command:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT pg_drop_replication_slot(&amp;#39;slot_name&amp;#39;);
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This can be correct if the slot is abandoned. It can also break a consumer that still needs the WAL.&lt;&#x2F;p&gt;
&lt;p&gt;WAL retention is not just a database metric.&lt;&#x2F;p&gt;
&lt;p&gt;It is ownership information:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Who owns this slot?
What system consumes it?
How far behind is it allowed to get?
What alert fires?
What is the reinitialization procedure?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;monitoring-checkpoint-behavior&quot;&gt;Monitoring checkpoint behavior&lt;&#x2F;h2&gt;
&lt;p&gt;On newer PostgreSQL versions, checkpoint-related statistics are exposed separately through &lt;code&gt;pg_stat_checkpointer&lt;&#x2F;code&gt;; on older versions, similar counters are found in &lt;code&gt;pg_stat_bgwriter&lt;&#x2F;code&gt;. The exact view and column names vary by version, so monitoring queries should match the Postgres version you operate. PostgreSQL’s monitoring documentation describes these cumulative statistics views as the place to inspect server activity. (&lt;a rel=&quot;external&quot; title=&quot;Documentation: 18: 27.2. The Cumulative Statistics System&quot; href=&quot;https:&#x2F;&#x2F;www.postgresql.org&#x2F;docs&#x2F;current&#x2F;monitoring-stats.html&quot;&gt;PostgreSQL&lt;&#x2F;a&gt;)&lt;&#x2F;p&gt;
&lt;p&gt;For newer versions:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    num_timed,
    num_requested,
    num_done,
    write_time,
    sync_time,
    buffers_written,
    stats_reset
FROM pg_stat_checkpointer;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;For older versions:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    checkpoints_timed,
    checkpoints_req,
    checkpoint_write_time,
    checkpoint_sync_time,
    buffers_checkpoint,
    buffers_backend,
    buffers_backend_fsync,
    stats_reset
FROM pg_stat_bgwriter;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The operational interpretation:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Many requested checkpoints:
WAL volume may be forcing checkpoints.

High checkpoint write&#x2F;sync time:
storage may be struggling with checkpoint work.

High backend buffer writes:
foreground sessions may be doing writes themselves,
which can increase user-visible latency.

Frequent checkpoint warnings:
checkpoint&#x2F;WAL sizing may not match workload.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The goal is not to obsess over one counter.&lt;&#x2F;p&gt;
&lt;p&gt;The goal is to detect whether checkpoint work is smooth and predictable or bursty and user-visible.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;logging-checkpoints&quot;&gt;Logging checkpoints&lt;&#x2F;h2&gt;
&lt;p&gt;You can enable checkpoint logging:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SHOW log_checkpoints;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;To enable:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;ALTER SYSTEM SET log_checkpoints = on;
SELECT pg_reload_conf();
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Checkpoint logs can show:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;when checkpoints start and finish;
how much was written;
how long writing took;
how long syncing took;
whether checkpoints are requested or timed;
whether the system is checkpointing too often.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This is useful during investigation because checkpoint problems are often temporal.&lt;&#x2F;p&gt;
&lt;p&gt;A graph may show latency spikes every few minutes.&lt;&#x2F;p&gt;
&lt;p&gt;Checkpoint logs can confirm whether those spikes correlate with checkpoint activity.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;wal-and-disk-full-incidents&quot;&gt;WAL and disk-full incidents&lt;&#x2F;h2&gt;
&lt;p&gt;&lt;code&gt;pg_wal&lt;&#x2F;code&gt; filling the disk is one of the most direct WAL-related outages.&lt;&#x2F;p&gt;
&lt;p&gt;Possible causes:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;archive failures;
replication slot retention;
replica disconnected;
logical replication consumer stopped;
long base backup;
too much WAL generated too quickly;
max_wal_size too small for workload;
storage capacity too low;
unexpected bulk operation.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;A useful filesystem-level check:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;bash&quot;&gt;du -sh &amp;quot;$PGDATA&#x2F;pg_wal&amp;quot;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;From SQL, you can inspect WAL directory files if permissions allow:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    count(*) AS wal_files,
    pg_size_pretty(sum(size)) AS total_size
FROM pg_ls_waldir();
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Disk-full incidents are dangerous because Postgres may be unable to continue writing WAL.&lt;&#x2F;p&gt;
&lt;p&gt;At that point, this is not a tuning issue. It is an availability incident.&lt;&#x2F;p&gt;
&lt;p&gt;The immediate question becomes:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Why is WAL being retained or generated faster than expected,
and what can be safely removed, advanced, paused, or fixed?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Deleting files manually from &lt;code&gt;pg_wal&lt;&#x2F;code&gt; is not a safe normal operation. It can corrupt recovery assumptions and break the cluster.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;wal-heavy-migrations&quot;&gt;WAL-heavy migrations&lt;&#x2F;h2&gt;
&lt;p&gt;Some migrations generate much more WAL than teams expect.&lt;&#x2F;p&gt;
&lt;p&gt;Examples:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;UPDATE users
SET normalized_email = lower(email);
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;DELETE FROM events
WHERE created_at &amp;lt; now() - interval &amp;#39;180 days&amp;#39;;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;CREATE INDEX CONCURRENTLY idx_events_tenant_created
ON events (tenant_id, created_at);
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;ALTER TABLE orders
ADD COLUMN total_cents bigint DEFAULT 0;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Depending on version, table structure, defaults, and operation type, schema changes may be metadata-only or may rewrite substantial data. Large updates and deletes can generate WAL, create dead tuples, pressure autovacuum, and increase replication lag.&lt;&#x2F;p&gt;
&lt;p&gt;A safer operational pattern for backfills:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;process in small batches;
sleep between batches;
measure WAL rate;
watch replication lag;
watch archive status;
watch checkpoint frequency;
keep transactions short;
make progress resumable;
stop quickly if pressure rises.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Example batch shape:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;WITH batch AS (
    SELECT id
    FROM users
    WHERE normalized_email IS NULL
    ORDER BY id
    LIMIT 1000
)
UPDATE users u
SET normalized_email = lower(u.email)
FROM batch
WHERE u.id = batch.id;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The exact batch size is workload-specific.&lt;&#x2F;p&gt;
&lt;p&gt;The reliability principle is stable:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;A migration should have a pressure budget,
not just a correctness test.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;why-fast-on-staging-is-not-enough&quot;&gt;Why “fast on staging” is not enough&lt;&#x2F;h2&gt;
&lt;p&gt;WAL behavior depends on production realities:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;table size;
index count;
row width;
update pattern;
checkpoint timing;
full-page writes;
storage latency;
replica speed;
archive bandwidth;
logical consumers;
autovacuum state;
concurrent workload.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;A staging database with small tables and no replicas cannot reveal the true WAL cost of a production backfill.&lt;&#x2F;p&gt;
&lt;p&gt;A migration may pass every functional test and still be operationally unsafe.&lt;&#x2F;p&gt;
&lt;p&gt;The better pre-flight question:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;How much WAL will this generate,
and what systems must absorb that WAL?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;That question changes how teams design migrations.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;crash-recovery-time-is-part-of-reliability&quot;&gt;Crash recovery time is part of reliability&lt;&#x2F;h2&gt;
&lt;p&gt;Checkpoints influence crash recovery.&lt;&#x2F;p&gt;
&lt;p&gt;If checkpoints are very far apart, there may be more WAL to replay after a crash. If checkpoints are too frequent, normal operation may suffer from excessive checkpoint IO.&lt;&#x2F;p&gt;
&lt;p&gt;This is a trade-off:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Less frequent checkpoints:
potentially smoother normal operation,
more WAL to replay after crash.

More frequent checkpoints:
less WAL to replay,
more frequent checkpoint IO,
potentially more full-page image WAL.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The right balance depends on recovery objectives, write workload, storage capacity, and latency requirements.&lt;&#x2F;p&gt;
&lt;p&gt;A database that is fast during normal operation but takes too long to recover may not satisfy the business reliability target.&lt;&#x2F;p&gt;
&lt;p&gt;A database that checkpoints too aggressively may create latency incidents during normal traffic.&lt;&#x2F;p&gt;
&lt;p&gt;Reliability is the balance, not one extreme.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;a-practical-wal-and-checkpoint-health-snapshot&quot;&gt;A practical WAL and checkpoint health snapshot&lt;&#x2F;h2&gt;
&lt;p&gt;This is not a complete runbook, but it is a useful investigation snapshot.&lt;&#x2F;p&gt;
&lt;p&gt;WAL settings:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    name,
    setting,
    unit,
    context
FROM pg_settings
WHERE name IN (
    &amp;#39;wal_level&amp;#39;,
    &amp;#39;synchronous_commit&amp;#39;,
    &amp;#39;full_page_writes&amp;#39;,
    &amp;#39;wal_compression&amp;#39;,
    &amp;#39;checkpoint_timeout&amp;#39;,
    &amp;#39;checkpoint_completion_target&amp;#39;,
    &amp;#39;checkpoint_warning&amp;#39;,
    &amp;#39;max_wal_size&amp;#39;,
    &amp;#39;min_wal_size&amp;#39;,
    &amp;#39;archive_mode&amp;#39;,
    &amp;#39;archive_command&amp;#39;,
    &amp;#39;max_slot_wal_keep_size&amp;#39;
)
ORDER BY name;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;WAL generation:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    wal_records,
    wal_fpi,
    pg_size_pretty(wal_bytes) AS wal_bytes,
    wal_buffers_full,
    stats_reset
FROM pg_stat_wal;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;WAL directory size:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    count(*) AS wal_files,
    pg_size_pretty(sum(size)) AS total_size
FROM pg_ls_waldir();
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Archiving:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    archived_count,
    last_archived_wal,
    last_archived_time,
    failed_count,
    last_failed_wal,
    last_failed_time
FROM pg_stat_archiver;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Replication lag:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    application_name,
    state,
    sync_state,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) AS replay_lag_bytes,
    write_lag,
    flush_lag,
    replay_lag
FROM pg_stat_replication;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Replication slots:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    slot_name,
    slot_type,
    active,
    wal_status,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal
FROM pg_replication_slots
ORDER BY pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) DESC NULLS LAST;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Checkpoint stats, newer versions:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    num_timed,
    num_requested,
    num_done,
    write_time,
    sync_time,
    buffers_written,
    stats_reset
FROM pg_stat_checkpointer;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Checkpoint stats, older versions:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    checkpoints_timed,
    checkpoints_req,
    checkpoint_write_time,
    checkpoint_sync_time,
    buffers_checkpoint,
    buffers_backend,
    buffers_backend_fsync,
    stats_reset
FROM pg_stat_bgwriter;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;WAL-heavy statements:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    calls,
    pg_size_pretty(wal_bytes) AS total_wal,
    pg_size_pretty((wal_bytes &#x2F; greatest(calls, 1))::numeric) AS wal_per_call,
    mean_exec_time,
    left(query, 180) AS query_preview
FROM pg_stat_statements
WHERE wal_bytes &amp;gt; 0
ORDER BY wal_bytes DESC
LIMIT 20;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The purpose of this snapshot is to connect symptoms:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;high WAL generation;
frequent checkpoints;
archive failures;
replica lag;
slot retention;
storage growth;
write latency;
migration activity.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;A WAL incident is rarely visible through one metric alone.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;common-anti-patterns&quot;&gt;Common anti-patterns&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;treating-wal-as-a-storage-nuisance&quot;&gt;Treating WAL as a storage nuisance&lt;&#x2F;h3&gt;
&lt;p&gt;&lt;code&gt;pg_wal&lt;&#x2F;code&gt; is not garbage. It is required for crash recovery, replication, and backups.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;increasing-max-wal-size-without-understanding-the-workload&quot;&gt;Increasing &lt;code&gt;max_wal_size&lt;&#x2F;code&gt; without understanding the workload&lt;&#x2F;h3&gt;
&lt;p&gt;This may reduce checkpoint frequency, but it does not explain why WAL generation changed.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;ignoring-archiver-failures&quot;&gt;Ignoring archiver failures&lt;&#x2F;h3&gt;
&lt;p&gt;A database can keep serving traffic while silently losing point-in-time recovery capability.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;letting-replication-slots-have-no-owner&quot;&gt;Letting replication slots have no owner&lt;&#x2F;h3&gt;
&lt;p&gt;An abandoned slot can retain WAL until the primary disk is in danger.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;running-large-backfills-without-a-wal-budget&quot;&gt;Running large backfills without a WAL budget&lt;&#x2F;h3&gt;
&lt;p&gt;A backfill should be planned around WAL rate, replica lag, archive capacity, and checkpoint pressure.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;using-staging-to-estimate-production-wal-cost&quot;&gt;Using staging to estimate production WAL cost&lt;&#x2F;h3&gt;
&lt;p&gt;Small data, fewer indexes, and missing replicas make staging a poor predictor of WAL impact.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;manually-deleting-wal-files&quot;&gt;Manually deleting WAL files&lt;&#x2F;h3&gt;
&lt;p&gt;This is not a safe incident response pattern. It can destroy recovery guarantees.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;why-wal-and-checkpoint-incidents-are-good-simulation-material&quot;&gt;Why WAL and checkpoint incidents are good simulation material&lt;&#x2F;h2&gt;
&lt;p&gt;WAL&#x2F;checkpoint incidents are excellent for training because the symptoms are distributed across the system.&lt;&#x2F;p&gt;
&lt;p&gt;The application may show write latency.
The database may show frequent checkpoints.
The replica may show lag.
The backup system may show archive failures.
The disk may show &lt;code&gt;pg_wal&lt;&#x2F;code&gt; growth.
The migration system may show a “successful” backfill.
The team may be tempted to change settings without understanding the pressure chain.&lt;&#x2F;p&gt;
&lt;p&gt;A realistic simulation can force decisions like:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Is the primary overloaded by WAL writes or normal query IO?
Is checkpoint activity causing latency spikes?
Is a bulk operation generating too much WAL?
Is the replica behind because it is slow or because the primary is producing too much WAL?
Is archiving broken or merely delayed?
Is a replication slot safe to drop?
Should the team pause a migration, throttle a job, increase WAL capacity, tune checkpoints, or protect user traffic first?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This is not about memorizing &lt;code&gt;pg_stat_wal&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;It is about understanding the system consequences of writes.&lt;&#x2F;p&gt;
&lt;p&gt;Articles can explain WAL mechanics.
Dashboards can expose WAL rates.
Simulations teach teams how WAL pressure changes operational decisions.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;&#x2F;h2&gt;
&lt;p&gt;WAL and checkpoints are invisible when healthy and unavoidable when they fail.&lt;&#x2F;p&gt;
&lt;p&gt;WAL protects durability and enables crash recovery, replication, archiving, and point-in-time recovery. Checkpoints bound recovery work and move dirty data pages to disk. Together, they form the storage reliability backbone of Postgres.&lt;&#x2F;p&gt;
&lt;p&gt;But that backbone has operational limits.&lt;&#x2F;p&gt;
&lt;p&gt;Write-heavy workloads generate WAL.
WAL must be written, archived, streamed, retained, and replayed.
Checkpoints must flush dirty data.
Storage must absorb bursts.
Replicas and backup systems must keep up.
Operators must understand when a “database slowdown” is really WAL pressure.&lt;&#x2F;p&gt;
&lt;p&gt;The dangerous phrase is:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;It is just WAL.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The better reliability question is:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;What is generating this WAL, what systems must consume it, and what happens if they cannot keep up?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;That question turns WAL and checkpoints from internal Postgres machinery into practical production reliability signals.&lt;&#x2F;p&gt;
</content>
        
    </entry>
    <entry xml:lang="en">
        <title>Postgres locks: how one ALTER TABLE can stop your product</title>
        <published>2026-04-25T00:00:00+00:00</published>
        <updated>2026-04-25T00:00:00+00:00</updated>
        
        <author>
          <name>
            
              Unknown
            
          </name>
        </author>
        
        <link rel="alternate" type="text/html" href="https://rillence.com/notes/postgres-locks-alter-table/"/>
        <id>https://rillence.com/notes/postgres-locks-alter-table/</id>
        
        <content type="html" xml:base="https://rillence.com/notes/postgres-locks-alter-table/">&lt;p&gt;Postgres locks are not a bug.&lt;&#x2F;p&gt;
&lt;p&gt;They are one of the reasons Postgres can safely protect your data while many users, services, jobs, migrations, and background processes are touching the same database at the same time.&lt;&#x2F;p&gt;
&lt;p&gt;The problem is that locks are often invisible until they are not.&lt;&#x2F;p&gt;
&lt;p&gt;A migration that looked harmless in staging can freeze production traffic.
A long-running transaction can block a schema change.
A background job can hold a lock longer than expected.
A single &lt;code&gt;ALTER TABLE&lt;&#x2F;code&gt; can create a queue of blocked queries behind it.&lt;&#x2F;p&gt;
&lt;p&gt;From the outside, this often looks like:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;The application is slow.
Requests are timing out.
Postgres has many active connections.
CPU is not necessarily high.
The database “looks stuck”.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;But Postgres may not be stuck at all. It may be doing exactly what it was designed to do: preserving consistency.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;the-dangerous-misunderstanding&quot;&gt;The dangerous misunderstanding&lt;&#x2F;h2&gt;
&lt;p&gt;Many teams think about locks only when they explicitly run something like:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;LOCK TABLE users;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;But most Postgres locks are not written manually. They are acquired automatically by normal SQL operations.&lt;&#x2F;p&gt;
&lt;p&gt;For example:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT * FROM orders WHERE id = 42;
UPDATE orders SET status = &amp;#39;paid&amp;#39; WHERE id = 42;
ALTER TABLE orders ADD COLUMN processed_at timestamptz;
CREATE INDEX orders_created_at_idx ON orders(created_at);
DELETE FROM sessions WHERE expires_at &amp;lt; now();
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;All of these can involve locks.&lt;&#x2F;p&gt;
&lt;p&gt;Usually, that is fine. Most locks are short-lived and harmless. The incident starts when a lock is held longer than expected, or when a lock request waits behind another transaction while new queries pile up behind it.&lt;&#x2F;p&gt;
&lt;p&gt;This is the part that surprises people: the most damaging session is not always the one using the most CPU. Sometimes it is just waiting.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;a-simple-lock-queue-scenario&quot;&gt;A simple lock queue scenario&lt;&#x2F;h2&gt;
&lt;p&gt;Imagine a busy table:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;CREATE TABLE accounts (
    id bigint PRIMARY KEY,
    email text NOT NULL,
    status text NOT NULL
);
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The application constantly runs queries like:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT *
FROM accounts
WHERE id = $1;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Now a migration starts:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;ALTER TABLE accounts
ADD COLUMN deleted_at timestamptz;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Depending on the operation and Postgres version, this may be fast. But it still needs a table lock. If another transaction is already touching the table in a way that conflicts, the &lt;code&gt;ALTER TABLE&lt;&#x2F;code&gt; waits.&lt;&#x2F;p&gt;
&lt;p&gt;That sounds safe: the migration is waiting, not blocking, right?&lt;&#x2F;p&gt;
&lt;p&gt;Not quite.&lt;&#x2F;p&gt;
&lt;p&gt;Once the &lt;code&gt;ALTER TABLE&lt;&#x2F;code&gt; is waiting for a strong lock, later application queries may queue behind it. The result can look like the whole table is frozen.&lt;&#x2F;p&gt;
&lt;p&gt;A simplified chain:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;mermaid&quot;&gt;flowchart TD
    A[Long transaction touches accounts] --&amp;gt; B[ALTER TABLE waits for lock]
    B --&amp;gt; C[New application queries arrive]
    C --&amp;gt; D[They queue behind the pending ALTER TABLE]
    D --&amp;gt; E[Connection pool fills]
    E --&amp;gt; F[Requests time out]
    F --&amp;gt; G([Incident])
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The migration may not be consuming CPU. It may not be doing heavy IO. It may simply be waiting.&lt;&#x2F;p&gt;
&lt;p&gt;But its position in the lock queue can still damage production traffic.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;locks-are-about-compatibility&quot;&gt;Locks are about compatibility&lt;&#x2F;h2&gt;
&lt;p&gt;Postgres has different lock modes. They are not all equal.&lt;&#x2F;p&gt;
&lt;p&gt;A normal &lt;code&gt;SELECT&lt;&#x2F;code&gt; does not block another normal &lt;code&gt;SELECT&lt;&#x2F;code&gt;. Many operations can safely happen together. The problem appears when two operations require incompatible locks.&lt;&#x2F;p&gt;
&lt;p&gt;You do not need to memorize the entire lock matrix to respond well to incidents, but you do need the mental model:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Weak locks allow many operations to continue.
Strong locks conflict with more operations.
Some schema changes require very strong locks.
A waiting strong lock can cause later queries to queue.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;For product engineers, the important lesson is this:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;“This query is fast locally” does not mean “this operation is operationally safe in production.”&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;Lock behavior depends on concurrency, transaction duration, table size, workload, and timing.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;the-classic-villain-idle-in-transaction&quot;&gt;The classic villain: &lt;code&gt;idle in transaction&lt;&#x2F;code&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;One of the most common lock-related problems is not a dramatic query. It is a transaction that started, did some work, and then remained open.&lt;&#x2F;p&gt;
&lt;p&gt;For example, application code does something like:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;BEGIN;

SELECT *
FROM accounts
WHERE id = 42;

-- application waits on network, external API, user input, or crashes before COMMIT
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;From the database side, the session may become:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;idle in transaction
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;That means it is not actively running a query, but the transaction is still open.&lt;&#x2F;p&gt;
&lt;p&gt;You can find old transactions with:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    pid,
    usename,
    application_name,
    client_addr,
    state,
    now() - xact_start AS transaction_age,
    wait_event_type,
    wait_event,
    left(query, 160) AS query_preview
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
ORDER BY xact_start ASC;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;And specifically:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    pid,
    usename,
    application_name,
    client_addr,
    now() - xact_start AS transaction_age,
    left(query, 160) AS last_query
FROM pg_stat_activity
WHERE state = &amp;#39;idle in transaction&amp;#39;
ORDER BY xact_start ASC;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;An &lt;code&gt;idle in transaction&lt;&#x2F;code&gt; session can be harmful because it may:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;hold locks;
prevent vacuum cleanup;
keep old row versions visible;
interfere with migrations;
increase table and index bloat over time;
confuse incident responders because it looks inactive.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The session is “idle”, but the transaction is not harmless.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;a-safer-way-to-inspect-blockers&quot;&gt;A safer way to inspect blockers&lt;&#x2F;h2&gt;
&lt;p&gt;Modern Postgres gives you a very useful function:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;pg_blocking_pids(pid)
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;You can use it to see which sessions are blocking others:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    blocked.pid AS blocked_pid,
    blocked.usename AS blocked_user,
    blocked.application_name AS blocked_app,
    blocked.state AS blocked_state,
    now() - blocked.query_start AS blocked_duration,
    left(blocked.query, 120) AS blocked_query,
    blocking.pid AS blocking_pid,
    blocking.usename AS blocking_user,
    blocking.application_name AS blocking_app,
    blocking.state AS blocking_state,
    now() - blocking.query_start AS blocking_duration,
    left(blocking.query, 120) AS blocking_query
FROM pg_stat_activity blocked
JOIN LATERAL unnest(pg_blocking_pids(blocked.pid)) AS blocker_pid ON true
JOIN pg_stat_activity blocking ON blocking.pid = blocker_pid
ORDER BY blocked_duration DESC;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This is often easier and safer than writing a large manual join over &lt;code&gt;pg_locks&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;The result can tell you:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Who is blocked?
Who is blocking them?
How long has each query been running?
Which application opened the session?
Is the blocker active or idle in transaction?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;But this query is not a complete incident response plan. It only answers one question: “Who is blocking whom?”&lt;&#x2F;p&gt;
&lt;p&gt;The harder question is: “What is the safest action now?”&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;why-killing-the-blocker-is-not-always-the-right-move&quot;&gt;Why killing the blocker is not always the right move&lt;&#x2F;h2&gt;
&lt;p&gt;When you find a blocking session, the tempting move is:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT pg_terminate_backend(&amp;lt;pid&amp;gt;);
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;That can be the correct action in some incidents. But it is dangerous as a reflex.&lt;&#x2F;p&gt;
&lt;p&gt;There are two related functions:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT pg_cancel_backend(&amp;lt;pid&amp;gt;);
SELECT pg_terminate_backend(&amp;lt;pid&amp;gt;);
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The difference matters.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;code&gt;pg_cancel_backend&lt;&#x2F;code&gt; asks Postgres to cancel the current query. The connection stays alive.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;code&gt;pg_terminate_backend&lt;&#x2F;code&gt; terminates the whole backend connection. If it is inside a transaction, the transaction is rolled back.&lt;&#x2F;p&gt;
&lt;p&gt;That rollback can itself be expensive. It can also trigger application retries, break a migration, or cause a thundering herd of reconnects.&lt;&#x2F;p&gt;
&lt;p&gt;A better incident question is:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Is this blocker safe to cancel?
Is it part of a migration?
Is it user traffic?
Is it a background job?
Is it already rolling back?
Will the application retry immediately?
Will killing it unblock the critical path or create more load?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The existence of a blocker tells you where pressure is accumulating. It does not automatically tell you what to kill.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;schema-migrations-and-lock-risk&quot;&gt;Schema migrations and lock risk&lt;&#x2F;h2&gt;
&lt;p&gt;Schema migrations deserve special respect in Postgres.&lt;&#x2F;p&gt;
&lt;p&gt;Consider:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;ALTER TABLE users
ADD COLUMN last_seen_at timestamptz;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This can be quick. But “quick” is not the same as “risk-free”.&lt;&#x2F;p&gt;
&lt;p&gt;Now consider:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;ALTER TABLE users
ALTER COLUMN email SET NOT NULL;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Or:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;ALTER TABLE orders
ADD CONSTRAINT orders_customer_id_fkey
FOREIGN KEY (customer_id)
REFERENCES customers(id);
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Or:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;CREATE INDEX orders_created_at_idx
ON orders(created_at);
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Some operations scan data. Some require stronger locks. Some block writes. Some interact badly with long transactions. Some are safe on small tables and dangerous on large ones.&lt;&#x2F;p&gt;
&lt;p&gt;For indexes, the production-safe form is often:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;CREATE INDEX CONCURRENTLY orders_created_at_idx
ON orders(created_at);
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;But &lt;code&gt;CONCURRENTLY&lt;&#x2F;code&gt; is not magic. It reduces blocking, but it can still:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;take a long time;
consume CPU and IO;
fail and leave an invalid index;
conflict with other schema changes;
increase load during an already sensitive period.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;You can check invalid indexes with:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    schemaname,
    tablename,
    indexname
FROM pg_indexes
WHERE indexname IN (
    SELECT relname
    FROM pg_class
    WHERE oid IN (
        SELECT indexrelid
        FROM pg_index
        WHERE NOT indisvalid
    )
);
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;A cleaner version using catalog tables:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    n.nspname AS schema_name,
    t.relname AS table_name,
    i.relname AS index_name,
    ix.indisvalid,
    ix.indisready
FROM pg_index ix
JOIN pg_class i ON i.oid = ix.indexrelid
JOIN pg_class t ON t.oid = ix.indrelid
JOIN pg_namespace n ON n.oid = t.relnamespace
WHERE ix.indisvalid = false
   OR ix.indisready = false;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This is useful after a failed concurrent index build.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;use-timeouts-as-guardrails&quot;&gt;Use timeouts as guardrails&lt;&#x2F;h2&gt;
&lt;p&gt;One of the simplest ways to reduce lock-related blast radius is to use timeouts during migrations.&lt;&#x2F;p&gt;
&lt;p&gt;For example:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SET lock_timeout = &amp;#39;2s&amp;#39;;
SET statement_timeout = &amp;#39;5min&amp;#39;;

ALTER TABLE accounts
ADD COLUMN deleted_at timestamptz;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;&lt;code&gt;lock_timeout&lt;&#x2F;code&gt; means: do not wait forever to acquire a lock.&lt;&#x2F;p&gt;
&lt;p&gt;This is valuable because the worst migration is often not the one that fails. It is the one that waits silently and causes application traffic to queue behind it.&lt;&#x2F;p&gt;
&lt;p&gt;A common migration pattern is:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;BEGIN;
SET LOCAL lock_timeout = &amp;#39;2s&amp;#39;;
SET LOCAL statement_timeout = &amp;#39;5min&amp;#39;;

ALTER TABLE accounts
ADD COLUMN deleted_at timestamptz;

COMMIT;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;However, be careful with commands like &lt;code&gt;CREATE INDEX CONCURRENTLY&lt;&#x2F;code&gt;: they cannot run inside a normal transaction block.&lt;&#x2F;p&gt;
&lt;p&gt;For example:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SET lock_timeout = &amp;#39;2s&amp;#39;;
SET statement_timeout = &amp;#39;30min&amp;#39;;

CREATE INDEX CONCURRENTLY idx_accounts_email
ON accounts(email);
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Timeouts do not make a migration safe by themselves. They are guardrails. They help a risky operation fail early instead of becoming an incident.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;detecting-lock-pressure-before-users-notice&quot;&gt;Detecting lock pressure before users notice&lt;&#x2F;h2&gt;
&lt;p&gt;During normal operation, you can inspect waiting sessions:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    pid,
    usename,
    application_name,
    state,
    wait_event_type,
    wait_event,
    now() - query_start AS waiting_for,
    left(query, 160) AS query_preview
FROM pg_stat_activity
WHERE wait_event_type = &amp;#39;Lock&amp;#39;
ORDER BY query_start ASC;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;You can also summarize wait events:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    wait_event_type,
    wait_event,
    count(*) AS sessions
FROM pg_stat_activity
WHERE wait_event_type IS NOT NULL
GROUP BY wait_event_type, wait_event
ORDER BY sessions DESC;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This helps separate lock waits from other kinds of waits.&lt;&#x2F;p&gt;
&lt;p&gt;For example:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Lock waits suggest contention.
IO waits suggest disk or storage pressure.
Client waits may indicate application behavior.
LWLock waits may indicate internal contention.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;But again, this is not enough by itself. You still need context:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Did a migration just start?
Did a deploy just happen?
Is a background job running?
Did traffic increase?
Are blocked sessions all from one service?
Are blockers idle in transaction?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Locks become understandable only when connected to system events.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;row-locks-can-also-cause-incidents&quot;&gt;Row locks can also cause incidents&lt;&#x2F;h2&gt;
&lt;p&gt;Not all dangerous locks are table-level migration locks.&lt;&#x2F;p&gt;
&lt;p&gt;Application-level transactions can block each other on rows.&lt;&#x2F;p&gt;
&lt;p&gt;For example:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;BEGIN;

UPDATE accounts
SET balance = balance - 100
WHERE id = 1;

-- transaction remains open
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Another transaction tries:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;UPDATE accounts
SET balance = balance + 100
WHERE id = 1;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The second transaction waits.&lt;&#x2F;p&gt;
&lt;p&gt;This is normal. But if the first transaction waits on an external API before committing, you have created database contention from application behavior.&lt;&#x2F;p&gt;
&lt;p&gt;A dangerous pattern:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;BEGIN
  update database row
  call external service
  wait for response
  update another row
COMMIT
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;A safer pattern is often:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;do external work before opening transaction;
keep the transaction small;
avoid user&#x2F;network waits inside transactions;
commit quickly;
make retry behavior explicit.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Postgres can handle concurrency well, but it cannot make long business transactions short.&lt;&#x2F;p&gt;
&lt;p&gt;That is an application architecture problem, not just a database problem.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;select-for-update-is-powerful-and-dangerous&quot;&gt;&lt;code&gt;SELECT ... FOR UPDATE&lt;&#x2F;code&gt; is powerful and dangerous&lt;&#x2F;h2&gt;
&lt;p&gt;Many systems use row-level locking intentionally:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT *
FROM jobs
WHERE status = &amp;#39;pending&amp;#39;
ORDER BY created_at
LIMIT 1
FOR UPDATE;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This can be correct, but under concurrency it can create contention.&lt;&#x2F;p&gt;
&lt;p&gt;For job queues, a better pattern is often:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT *
FROM jobs
WHERE status = &amp;#39;pending&amp;#39;
ORDER BY created_at
LIMIT 1
FOR UPDATE SKIP LOCKED;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;&lt;code&gt;SKIP LOCKED&lt;&#x2F;code&gt; allows workers to skip rows already locked by other workers.&lt;&#x2F;p&gt;
&lt;p&gt;But this changes semantics. It is useful for queues and work distribution, not for every business operation.&lt;&#x2F;p&gt;
&lt;p&gt;The reliability lesson is that lock behavior is part of application design. It is not just a database implementation detail.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;advisory-locks-useful-but-easy-to-forget&quot;&gt;Advisory locks: useful, but easy to forget&lt;&#x2F;h2&gt;
&lt;p&gt;Postgres also supports advisory locks:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT pg_advisory_lock(12345);
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;And:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT pg_advisory_unlock(12345);
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;These are application-defined locks. They are useful for leader election, scheduled jobs, migration coordination, or preventing duplicate work.&lt;&#x2F;p&gt;
&lt;p&gt;But they can also create mysterious incidents if not visible in normal application logs.&lt;&#x2F;p&gt;
&lt;p&gt;You can inspect advisory locks with:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    a.pid,
    a.usename,
    a.application_name,
    l.locktype,
    l.mode,
    l.granted,
    now() - a.query_start AS query_age,
    left(a.query, 160) AS query_preview
FROM pg_locks l
JOIN pg_stat_activity a ON a.pid = l.pid
WHERE l.locktype = &amp;#39;advisory&amp;#39;
ORDER BY query_age DESC;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Advisory locks are not bad. Hidden coordination is bad.&lt;&#x2F;p&gt;
&lt;p&gt;If your system uses advisory locks, they should be named, documented, observable, and included in incident thinking.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;a-practical-migration-safety-checklist&quot;&gt;A practical migration safety checklist&lt;&#x2F;h2&gt;
&lt;p&gt;This is not a full migration playbook, but these questions catch many common problems.&lt;&#x2F;p&gt;
&lt;p&gt;Before running a migration on a hot table, ask:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;How large is the table?
What lock level does this operation need?
Can it run concurrently?
Can it be split into smaller phases?
Does it scan or rewrite the table?
Can it fail quickly with lock_timeout?
Is there a rollback plan?
Are there long-running transactions right now?
Is traffic normal or elevated?
Will application retries amplify the problem?
Are dashboards ready for lock waits and pool saturation?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;For large tables, prefer phased changes.&lt;&#x2F;p&gt;
&lt;p&gt;For example, instead of immediately adding a strict constraint:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;ALTER TABLE orders
ADD CONSTRAINT orders_amount_positive
CHECK (amount &amp;gt; 0);
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;You may use:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;ALTER TABLE orders
ADD CONSTRAINT orders_amount_positive
CHECK (amount &amp;gt; 0) NOT VALID;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Then validate later:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;ALTER TABLE orders
VALIDATE CONSTRAINT orders_amount_positive;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This pattern can reduce operational risk because adding the constraint metadata and validating existing rows are separated.&lt;&#x2F;p&gt;
&lt;p&gt;Again, the point is not to memorize one trick. The point is to treat schema changes as production operations, not just code changes.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;what-teams-often-get-wrong&quot;&gt;What teams often get wrong&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;they-test-migrations-only-on-empty-or-tiny-databases&quot;&gt;They test migrations only on empty or tiny databases&lt;&#x2F;h3&gt;
&lt;p&gt;A migration that takes 100 ms on staging may behave very differently on a 500 GB production table.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;they-ignore-concurrent-workload&quot;&gt;They ignore concurrent workload&lt;&#x2F;h3&gt;
&lt;p&gt;The table is not sitting idle in production. It is being read, written, vacuumed, indexed, and queried by multiple services.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;they-forget-old-transactions&quot;&gt;They forget old transactions&lt;&#x2F;h3&gt;
&lt;p&gt;One forgotten transaction can turn a safe migration into a production incident.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;they-run-ddl-without-timeouts&quot;&gt;They run DDL without timeouts&lt;&#x2F;h3&gt;
&lt;p&gt;A migration that waits forever can become a silent lock queue.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;they-treat-the-database-as-isolated&quot;&gt;They treat the database as isolated&lt;&#x2F;h3&gt;
&lt;p&gt;The real incident may involve the app pool, retries, background jobs, dashboards, and human decisions.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;why-lock-incidents-are-good-simulation-material&quot;&gt;Why lock incidents are good simulation material&lt;&#x2F;h2&gt;
&lt;p&gt;Lock incidents are especially valuable to practice because they are deceptive.&lt;&#x2F;p&gt;
&lt;p&gt;They often do not look dramatic at first.&lt;&#x2F;p&gt;
&lt;p&gt;CPU may be fine.
Memory may be fine.
The migration may appear to be “just waiting.”
The blocker may be “idle.”
The application may report generic timeout errors.&lt;&#x2F;p&gt;
&lt;p&gt;A good simulation teaches the operational loop:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;mermaid&quot;&gt;flowchart TD
    A[Notice latency] --&amp;gt; B[Inspect active sessions]
    B --&amp;gt; C[Identify lock waits]
    C --&amp;gt; D[Find blockers]
    D --&amp;gt; E[Understand application context]
    E --&amp;gt; F[Choose safe mitigation]
    F --&amp;gt; G[Observe consequences]
    G --&amp;gt; H[Review why the system was vulnerable]
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The hard part is not running a query against &lt;code&gt;pg_stat_activity&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;The hard part is deciding what the result means under pressure.&lt;&#x2F;p&gt;
&lt;p&gt;Should you cancel the migration?
Terminate the blocker?
Reduce application concurrency?
Disable a worker?
Rollback a deploy?
Wait?
Communicate impact?
Prevent retries?&lt;&#x2F;p&gt;
&lt;p&gt;Those choices are where reliability skill is built.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;&#x2F;h2&gt;
&lt;p&gt;Postgres locks are not the enemy. They are part of how Postgres protects correctness.&lt;&#x2F;p&gt;
&lt;p&gt;The incident happens when lock behavior meets production reality:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;large tables;
long transactions;
busy applications;
schema migrations;
connection pools;
background jobs;
retry storms;
unclear ownership;
time pressure.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;A single &lt;code&gt;ALTER TABLE&lt;&#x2F;code&gt; can stop a product not because Postgres is fragile, but because production systems are concurrent.&lt;&#x2F;p&gt;
&lt;p&gt;The right lesson is not “avoid locks.”
The right lesson is “understand the lock behavior of your changes before production does.”&lt;&#x2F;p&gt;
&lt;p&gt;Articles and checklists can teach the concepts.
Queries can reveal symptoms.
But lock incidents require practiced judgment.&lt;&#x2F;p&gt;
&lt;p&gt;Because in the middle of an incident, the question is rarely:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Is there a lock?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The real question is:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Which action reduces risk without making the system worse?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
</content>
        
    </entry>
    <entry xml:lang="en">
        <title>Schema migrations in Postgres: why safe SQL can be dangerous in production</title>
        <published>2026-04-19T00:00:00+00:00</published>
        <updated>2026-04-19T00:00:00+00:00</updated>
        
        <author>
          <name>
            
              Unknown
            
          </name>
        </author>
        
        <link rel="alternate" type="text/html" href="https://rillence.com/notes/schema-migrations-in-production/"/>
        <id>https://rillence.com/notes/schema-migrations-in-production/</id>
        
        <content type="html" xml:base="https://rillence.com/notes/schema-migrations-in-production/">&lt;p&gt;Schema migrations are one of the most common ways teams accidentally create Postgres incidents.&lt;&#x2F;p&gt;
&lt;p&gt;The migration passes code review.
It works locally.
It runs instantly on staging.
The SQL is syntactically correct.
The change looks small.&lt;&#x2F;p&gt;
&lt;p&gt;Then production traffic slows down, the connection pool fills, requests time out, and the incident channel starts with a familiar sentence:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;The database is stuck.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Usually, Postgres is not stuck.&lt;&#x2F;p&gt;
&lt;p&gt;It is enforcing the rules that keep data consistent while many transactions touch the same tables concurrently.&lt;&#x2F;p&gt;
&lt;p&gt;The mistake is treating a schema migration as “just a code change.”&lt;&#x2F;p&gt;
&lt;p&gt;In production, a migration is an operational event.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;the-core-problem-ddl-changes-concurrency&quot;&gt;The core problem: DDL changes concurrency&lt;&#x2F;h2&gt;
&lt;p&gt;A normal application query changes data or reads data.&lt;&#x2F;p&gt;
&lt;p&gt;A schema migration changes the shape of the database itself.&lt;&#x2F;p&gt;
&lt;p&gt;That difference matters because Postgres must protect the table definition while other sessions are reading or writing rows. &lt;code&gt;ALTER TABLE&lt;&#x2F;code&gt; has many subforms, and the official documentation notes that lock levels differ by subform; unless explicitly noted, &lt;code&gt;ALTER TABLE&lt;&#x2F;code&gt; acquires an &lt;code&gt;ACCESS EXCLUSIVE&lt;&#x2F;code&gt; lock, and when several subcommands are combined, Postgres uses the strictest required lock. (&lt;a rel=&quot;external&quot; title=&quot;PostgreSQL: Documentation: 18: ALTER TABLE&quot; href=&quot;https:&#x2F;&#x2F;www.postgresql.org&#x2F;docs&#x2F;current&#x2F;sql-altertable.html&quot;&gt;PostgreSQL&lt;&#x2F;a&gt;)&lt;&#x2F;p&gt;
&lt;p&gt;That is the reliability risk.&lt;&#x2F;p&gt;
&lt;p&gt;A migration may not be CPU-heavy.
It may not read much data.
It may not write many rows.
It may simply need a lock that conflicts with normal traffic.&lt;&#x2F;p&gt;
&lt;p&gt;A migration incident often looks like this:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;mermaid&quot;&gt;flowchart TD
    A[Long-running transaction touches a hot table] --&amp;gt; B[Migration waits for a table lock]
    B --&amp;gt; C[New application queries arrive]
    C --&amp;gt; D[They queue behind the waiting migration]
    D --&amp;gt; E[Application pool fills]
    E --&amp;gt; F[Requests time out]
    F --&amp;gt; G[Retries increase pressure]
    G --&amp;gt; H([Production incident])
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The migration may be “waiting.”&lt;&#x2F;p&gt;
&lt;p&gt;But waiting in the wrong place can still stop the product.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;lock-compatibility-is-the-hidden-part-of-migration-safety&quot;&gt;Lock compatibility is the hidden part of migration safety&lt;&#x2F;h2&gt;
&lt;p&gt;Postgres locks are not all equal.&lt;&#x2F;p&gt;
&lt;p&gt;A regular &lt;code&gt;SELECT&lt;&#x2F;code&gt; acquires an &lt;code&gt;ACCESS SHARE&lt;&#x2F;code&gt; lock. &lt;code&gt;INSERT&lt;&#x2F;code&gt;, &lt;code&gt;UPDATE&lt;&#x2F;code&gt;, &lt;code&gt;DELETE&lt;&#x2F;code&gt;, and &lt;code&gt;MERGE&lt;&#x2F;code&gt; acquire &lt;code&gt;ROW EXCLUSIVE&lt;&#x2F;code&gt; locks on the target table. &lt;code&gt;CREATE INDEX&lt;&#x2F;code&gt; without &lt;code&gt;CONCURRENTLY&lt;&#x2F;code&gt; acquires a &lt;code&gt;SHARE&lt;&#x2F;code&gt; lock. &lt;code&gt;ACCESS EXCLUSIVE&lt;&#x2F;code&gt; conflicts with every table-level lock mode and is the only table-level lock that blocks a plain &lt;code&gt;SELECT&lt;&#x2F;code&gt;. (&lt;a rel=&quot;external&quot; title=&quot;PostgreSQL: Documentation: 18: 13.3. Explicit Locking&quot; href=&quot;https:&#x2F;&#x2F;www.postgresql.org&#x2F;docs&#x2F;current&#x2F;explicit-locking.html&quot;&gt;PostgreSQL&lt;&#x2F;a&gt;)&lt;&#x2F;p&gt;
&lt;p&gt;This is why a schema change can have a much larger blast radius than expected.&lt;&#x2F;p&gt;
&lt;p&gt;For example:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;ALTER TABLE accounts
ADD COLUMN deleted_at timestamptz;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This may be fast in many situations. But “fast” is not the same as “risk-free.”&lt;&#x2F;p&gt;
&lt;p&gt;Even a short lock can be dangerous when:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;the table is hot;
transactions are long;
traffic is high;
the migration waits behind another session;
application timeouts are short;
retries are aggressive;
the deploy starts many app instances at once.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The operational risk is not only how long the migration takes after it starts.&lt;&#x2F;p&gt;
&lt;p&gt;It is also how long it waits before it can safely start.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;the-migration-that-waits-can-be-worse-than-the-migration-that-runs&quot;&gt;The migration that waits can be worse than the migration that runs&lt;&#x2F;h2&gt;
&lt;p&gt;A migration can damage traffic before it does any meaningful work.&lt;&#x2F;p&gt;
&lt;p&gt;Suppose this transaction is open:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;BEGIN;

SELECT *
FROM accounts
WHERE id = 42;

-- application stays idle before COMMIT
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Now a migration runs:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;ALTER TABLE accounts
ADD COLUMN archived_at timestamptz;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;If the migration waits for a strong lock, later queries against &lt;code&gt;accounts&lt;&#x2F;code&gt; can queue behind it.&lt;&#x2F;p&gt;
&lt;p&gt;That queue can grow quickly:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Session A: old transaction is still open
Session B: ALTER TABLE waits for lock
Session C: SELECT from application waits
Session D: UPDATE from application waits
Session E: SELECT from application waits
...
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This is why “the migration is only waiting” is not comforting.&lt;&#x2F;p&gt;
&lt;p&gt;A waiting migration can become a traffic barrier.&lt;&#x2F;p&gt;
&lt;p&gt;During an incident, look for blocked and blocking sessions:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    blocked.pid AS blocked_pid,
    blocked.application_name AS blocked_app,
    now() - blocked.query_start AS blocked_duration,
    left(blocked.query, 160) AS blocked_query,
    blocking.pid AS blocking_pid,
    blocking.application_name AS blocking_app,
    blocking.state AS blocking_state,
    now() - blocking.query_start AS blocking_duration,
    left(blocking.query, 160) AS blocking_query
FROM pg_stat_activity blocked
JOIN LATERAL unnest(pg_blocking_pids(blocked.pid)) AS blocker_pid ON true
JOIN pg_stat_activity blocking ON blocking.pid = blocker_pid
ORDER BY blocked_duration DESC;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;And inspect sessions waiting on locks:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    pid,
    application_name,
    usename,
    state,
    wait_event_type,
    wait_event,
    now() - query_start AS waiting_for,
    left(query, 200) AS query_preview
FROM pg_stat_activity
WHERE wait_event_type = &amp;#39;Lock&amp;#39;
ORDER BY query_start ASC;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The goal is not only to identify the blocking PID.&lt;&#x2F;p&gt;
&lt;p&gt;The goal is to understand whether the migration has created a queue in front of production traffic.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;use-lock-timeouts-as-blast-radius-control&quot;&gt;Use lock timeouts as blast-radius control&lt;&#x2F;h2&gt;
&lt;p&gt;A migration should not wait forever for a lock on a hot table.&lt;&#x2F;p&gt;
&lt;p&gt;Use a lock timeout:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SET lock_timeout = &amp;#39;2s&amp;#39;;
SET statement_timeout = &amp;#39;5min&amp;#39;;

ALTER TABLE accounts
ADD COLUMN archived_at timestamptz;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Inside a transaction:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;BEGIN;

SET LOCAL lock_timeout = &amp;#39;2s&amp;#39;;
SET LOCAL statement_timeout = &amp;#39;5min&amp;#39;;

ALTER TABLE accounts
ADD COLUMN archived_at timestamptz;

COMMIT;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This does not make the migration safe.&lt;&#x2F;p&gt;
&lt;p&gt;It makes failure faster.&lt;&#x2F;p&gt;
&lt;p&gt;That is valuable.&lt;&#x2F;p&gt;
&lt;p&gt;A failed migration with a clear timeout is usually better than a migration that silently waits and causes traffic to queue behind it.&lt;&#x2F;p&gt;
&lt;p&gt;The operational principle:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Migrations should fail before they become incidents.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;But timeouts need to be chosen carefully. Too short, and safe migrations fail constantly. Too long, and the timeout no longer protects production traffic.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;create-index-is-not-always-online&quot;&gt;&lt;code&gt;CREATE INDEX&lt;&#x2F;code&gt; is not always online&lt;&#x2F;h2&gt;
&lt;p&gt;A classic migration:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;CREATE INDEX idx_orders_customer_id
ON orders (customer_id);
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This is a normal index build. It can block writes to the table.&lt;&#x2F;p&gt;
&lt;p&gt;For production systems, teams often use:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;CREATE INDEX CONCURRENTLY idx_orders_customer_id
ON orders (customer_id);
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Postgres documents &lt;code&gt;CREATE INDEX CONCURRENTLY&lt;&#x2F;code&gt; as a way to create an index without locking out writes to the table. The same documentation also notes important caveats: concurrent index builds cannot run inside a transaction block, only one concurrent index build can run on a table at a time, and failed concurrent builds can leave an invalid index behind. (&lt;a rel=&quot;external&quot; title=&quot;PostgreSQL: Documentation: 18: CREATE INDEX&quot; href=&quot;https:&#x2F;&#x2F;www.postgresql.org&#x2F;docs&#x2F;current&#x2F;sql-createindex.html&quot;&gt;PostgreSQL&lt;&#x2F;a&gt;)&lt;&#x2F;p&gt;
&lt;p&gt;That means this is invalid:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;BEGIN;

CREATE INDEX CONCURRENTLY idx_orders_customer_id
ON orders (customer_id);

COMMIT;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;You need to run it outside a normal transaction block:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;CREATE INDEX CONCURRENTLY idx_orders_customer_id
ON orders (customer_id);
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Many migration frameworks wrap migrations in transactions by default. That default is good for many schema changes, but it conflicts with &lt;code&gt;CREATE INDEX CONCURRENTLY&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;This is not just a syntax issue. It is a deployment-system issue.&lt;&#x2F;p&gt;
&lt;p&gt;Your migration tooling must understand which changes need transactional execution and which changes need to run outside a transaction.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;concurrent-index-creation-can-still-hurt&quot;&gt;Concurrent index creation can still hurt&lt;&#x2F;h2&gt;
&lt;p&gt;&lt;code&gt;CONCURRENTLY&lt;&#x2F;code&gt; reduces blocking. It does not make index creation free.&lt;&#x2F;p&gt;
&lt;p&gt;A concurrent index build can still:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;scan a large table;
consume CPU;
consume disk IO;
generate WAL;
increase replication lag;
compete with autovacuum;
take a long time;
fail and leave an invalid index;
wait for old transactions;
interact badly with other maintenance.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Monitor progress:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    p.pid,
    p.datname,
    p.relid::regclass AS table_name,
    p.index_relid::regclass AS index_name,
    p.phase,
    p.blocks_total,
    p.blocks_done,
    round(100.0 * p.blocks_done &#x2F; nullif(p.blocks_total, 0), 2) AS blocks_pct,
    p.tuples_total,
    p.tuples_done,
    now() - a.query_start AS runtime
FROM pg_stat_progress_create_index p
JOIN pg_stat_activity a ON a.pid = p.pid
ORDER BY runtime DESC;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Check for invalid indexes after failure:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    n.nspname AS schema_name,
    t.relname AS table_name,
    i.relname AS index_name,
    ix.indisvalid,
    ix.indisready
FROM pg_index ix
JOIN pg_class i ON i.oid = ix.indexrelid
JOIN pg_class t ON t.oid = ix.indrelid
JOIN pg_namespace n ON n.oid = t.relnamespace
WHERE ix.indisvalid = false
   OR ix.indisready = false
ORDER BY schema_name, table_name, index_name;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;An invalid index is easy to forget.&lt;&#x2F;p&gt;
&lt;p&gt;It may not help queries, but it can still create maintenance and write overhead.&lt;&#x2F;p&gt;
&lt;p&gt;That is exactly the kind of “cleanup later” detail that becomes reliability debt.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;adding-constraints-safely&quot;&gt;Adding constraints safely&lt;&#x2F;h2&gt;
&lt;p&gt;A constraint can be both logically correct and operationally expensive.&lt;&#x2F;p&gt;
&lt;p&gt;For example:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;ALTER TABLE orders
ADD CONSTRAINT orders_amount_positive
CHECK (amount &amp;gt; 0);
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;On a large table, Postgres may need to scan existing rows to verify that they satisfy the new constraint.&lt;&#x2F;p&gt;
&lt;p&gt;A safer phased pattern:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;ALTER TABLE orders
ADD CONSTRAINT orders_amount_positive
CHECK (amount &amp;gt; 0) NOT VALID;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Then later:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;ALTER TABLE orders
VALIDATE CONSTRAINT orders_amount_positive;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Postgres documentation explains that &lt;code&gt;NOT VALID&lt;&#x2F;code&gt; skips the potentially lengthy scan of existing rows when adding foreign-key, &lt;code&gt;CHECK&lt;&#x2F;code&gt;, or not-null constraints, while still applying the constraint to subsequent inserts or updates; the constraint is not considered valid for all existing rows until &lt;code&gt;VALIDATE CONSTRAINT&lt;&#x2F;code&gt; is run. (&lt;a rel=&quot;external&quot; title=&quot;PostgreSQL: Documentation: 18: ALTER TABLE&quot; href=&quot;https:&#x2F;&#x2F;www.postgresql.org&#x2F;docs&#x2F;current&#x2F;sql-altertable.html&quot;&gt;PostgreSQL&lt;&#x2F;a&gt;)&lt;&#x2F;p&gt;
&lt;p&gt;The validation step scans the table later:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;ALTER TABLE orders
VALIDATE CONSTRAINT orders_amount_positive;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This still does work. It is not free. But it separates two concerns:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Start enforcing the rule for new data
        ↓
Validate old data later under controlled conditions
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;That separation is often the difference between a safe rollout and a production incident.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;foreign-keys-are-operational-changes-too&quot;&gt;Foreign keys are operational changes too&lt;&#x2F;h2&gt;
&lt;p&gt;Foreign keys are valuable. They protect data integrity.&lt;&#x2F;p&gt;
&lt;p&gt;But adding one to a large, hot table is not just a metadata change.&lt;&#x2F;p&gt;
&lt;p&gt;Example:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;ALTER TABLE orders
ADD CONSTRAINT orders_customer_id_fkey
FOREIGN KEY (customer_id)
REFERENCES customers(id);
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;A phased version:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;ALTER TABLE orders
ADD CONSTRAINT orders_customer_id_fkey
FOREIGN KEY (customer_id)
REFERENCES customers(id)
NOT VALID;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Then:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;ALTER TABLE orders
VALIDATE CONSTRAINT orders_customer_id_fkey;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Postgres notes that adding a foreign key with &lt;code&gt;NOT VALID&lt;&#x2F;code&gt; can reduce impact, and validation later does not need to lock out concurrent updates because new rows are already checked; validation uses a lighter lock on the altered table, and foreign-key validation also requires a lock on the referenced table. (&lt;a rel=&quot;external&quot; title=&quot;PostgreSQL: Documentation: 18: ALTER TABLE&quot; href=&quot;https:&#x2F;&#x2F;www.postgresql.org&#x2F;docs&#x2F;current&#x2F;sql-altertable.html&quot;&gt;PostgreSQL&lt;&#x2F;a&gt;)&lt;&#x2F;p&gt;
&lt;p&gt;The important part is that a foreign key touches two tables operationally:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;the table that contains the foreign key;
the table being referenced.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;That matters during incidents.&lt;&#x2F;p&gt;
&lt;p&gt;A migration on &lt;code&gt;orders&lt;&#x2F;code&gt; can affect &lt;code&gt;customers&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;A team that only looks at one table may miss the real blocking chain.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;set-not-null-can-be-a-table-scan&quot;&gt;&lt;code&gt;SET NOT NULL&lt;&#x2F;code&gt; can be a table scan&lt;&#x2F;h2&gt;
&lt;p&gt;This looks simple:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;ALTER TABLE users
ALTER COLUMN email SET NOT NULL;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;But Postgres must know that no existing row violates the constraint.&lt;&#x2F;p&gt;
&lt;p&gt;The documentation says &lt;code&gt;SET NOT NULL&lt;&#x2F;code&gt; is ordinarily checked by scanning the whole table, unless a valid check constraint proves no nulls can exist or &lt;code&gt;NOT VALID&lt;&#x2F;code&gt; is used in supported cases. (&lt;a rel=&quot;external&quot; title=&quot;PostgreSQL: Documentation: 18: ALTER TABLE&quot; href=&quot;https:&#x2F;&#x2F;www.postgresql.org&#x2F;docs&#x2F;current&#x2F;sql-altertable.html&quot;&gt;PostgreSQL&lt;&#x2F;a&gt;)&lt;&#x2F;p&gt;
&lt;p&gt;A common safer pattern is:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;ALTER TABLE users
ADD CONSTRAINT users_email_not_null
CHECK (email IS NOT NULL) NOT VALID;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Then validate:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;ALTER TABLE users
VALIDATE CONSTRAINT users_email_not_null;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Then apply the not-null marker when appropriate:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;ALTER TABLE users
ALTER COLUMN email SET NOT NULL;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The exact sequence depends on Postgres version, table structure, and whether you need a true column-level &lt;code&gt;NOT NULL&lt;&#x2F;code&gt; constraint or a check constraint is enough for your use case.&lt;&#x2F;p&gt;
&lt;p&gt;The reliability point is stable:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Do not assume a one-line constraint change is operationally small.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;adding-a-column-is-not-always-the-risky-part&quot;&gt;Adding a column is not always the risky part&lt;&#x2F;h2&gt;
&lt;p&gt;Many teams focus on &lt;code&gt;ADD COLUMN&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;But the dangerous part is often what follows.&lt;&#x2F;p&gt;
&lt;p&gt;Example:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;ALTER TABLE users
ADD COLUMN normalized_email text;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Then:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;UPDATE users
SET normalized_email = lower(email);
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The first statement may be quick.&lt;&#x2F;p&gt;
&lt;p&gt;The second statement may be a production event:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;large table scan;
many row updates;
large WAL generation;
replication lag;
autovacuum pressure;
index maintenance;
long transaction;
lock contention;
cache churn;
connection pool pressure.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;A safer backfill pattern uses batches:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;WITH batch AS (
    SELECT id
    FROM users
    WHERE normalized_email IS NULL
    ORDER BY id
    LIMIT 1000
)
UPDATE users u
SET normalized_email = lower(u.email)
FROM batch
WHERE u.id = batch.id;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Repeat in a worker with:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;small batches;
short transactions;
sleep between batches;
progress tracking;
replication lag monitoring;
statement timeout;
ability to stop quickly.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The batch size is not universal. It should be chosen based on production pressure.&lt;&#x2F;p&gt;
&lt;p&gt;A backfill is not just SQL. It is a controlled workload.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;the-expand-and-contract-pattern&quot;&gt;The expand-and-contract pattern&lt;&#x2F;h2&gt;
&lt;p&gt;For application-visible schema changes, the safest migrations are often multi-step.&lt;&#x2F;p&gt;
&lt;p&gt;Suppose you want to rename a column from &lt;code&gt;name&lt;&#x2F;code&gt; to &lt;code&gt;full_name&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;A dangerous migration:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;ALTER TABLE users RENAME COLUMN name TO full_name;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;If old application code still expects &lt;code&gt;name&lt;&#x2F;code&gt;, it breaks.&lt;&#x2F;p&gt;
&lt;p&gt;A safer pattern:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;1. Expand schema
2. Deploy code that writes both old and new shapes
3. Backfill old data into new shape
4. Deploy code that reads new shape
5. Stop using old shape
6. Contract schema later
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Example:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;ALTER TABLE users
ADD COLUMN full_name text;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Application writes both:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;name = input.name
full_name = input.name
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Backfill:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;WITH batch AS (
    SELECT id
    FROM users
    WHERE full_name IS NULL
    ORDER BY id
    LIMIT 1000
)
UPDATE users u
SET full_name = u.name
FROM batch
WHERE u.id = batch.id;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Later, after all code reads &lt;code&gt;full_name&lt;&#x2F;code&gt; and old versions are gone:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;ALTER TABLE users
DROP COLUMN name;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This is slower than a one-step migration.&lt;&#x2F;p&gt;
&lt;p&gt;It is also much safer.&lt;&#x2F;p&gt;
&lt;p&gt;Reliability often means accepting more deployment steps to reduce coupling between code and schema.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;backward-compatibility-matters-during-rolling-deploys&quot;&gt;Backward compatibility matters during rolling deploys&lt;&#x2F;h2&gt;
&lt;p&gt;Many production systems deploy gradually.&lt;&#x2F;p&gt;
&lt;p&gt;For some period, old and new application versions run at the same time.&lt;&#x2F;p&gt;
&lt;p&gt;That means schema migrations must be compatible with both versions.&lt;&#x2F;p&gt;
&lt;p&gt;Risky sequence:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Migration removes column
        ↓
Old app instance still reads column
        ↓
Requests fail
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Safer sequence:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;New app stops depending on column
        ↓
Rollout completes
        ↓
Old app versions are gone
        ↓
Column is removed later
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This is not a database-only issue.&lt;&#x2F;p&gt;
&lt;p&gt;It is a deployment architecture issue.&lt;&#x2F;p&gt;
&lt;p&gt;A schema migration must be designed for:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;rolling deploys;
failed deploys;
rollbacks;
background workers;
cron jobs;
admin scripts;
BI tools;
old application instances;
read replicas;
migration retries.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;A migration that is safe in a single-process mental model may be unsafe in a distributed production system.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;beware-of-defaults-and-rewrites&quot;&gt;Beware of defaults and rewrites&lt;&#x2F;h2&gt;
&lt;p&gt;This migration looks innocent:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;ALTER TABLE events
ADD COLUMN source text DEFAULT &amp;#39;web&amp;#39;;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Depending on Postgres version and the exact default expression, adding a column with a default may be metadata-only or may require more work. Stable constant defaults have become much safer in modern Postgres than they were in older versions, but volatile defaults or other forms of schema change can still be expensive.&lt;&#x2F;p&gt;
&lt;p&gt;A safer mental model is:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Do not judge by syntax.
Check the operational behavior for your exact Postgres version and exact command.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;When in doubt, use a phased approach:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;ALTER TABLE events
ADD COLUMN source text;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Deploy code to write &lt;code&gt;source&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;Backfill existing rows in batches.&lt;&#x2F;p&gt;
&lt;p&gt;Then add a default for future rows:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;ALTER TABLE events
ALTER COLUMN source SET DEFAULT &amp;#39;web&amp;#39;;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This is often more verbose, but it gives you control over when the large data change happens.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;large-deletes-are-migrations-too&quot;&gt;Large deletes are migrations too&lt;&#x2F;h2&gt;
&lt;p&gt;Retention changes often appear as simple cleanup:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;DELETE FROM events
WHERE created_at &amp;lt; now() - interval &amp;#39;180 days&amp;#39;;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;On a large table, this can be a serious write workload.&lt;&#x2F;p&gt;
&lt;p&gt;It can:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;generate huge WAL;
create many dead tuples;
increase autovacuum pressure;
block or slow other queries;
increase replica lag;
hold locks for too long;
fill disk temporarily;
cause checkpoint pressure.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;A batched delete:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;WITH batch AS (
    SELECT id
    FROM events
    WHERE created_at &amp;lt; now() - interval &amp;#39;180 days&amp;#39;
    ORDER BY id
    LIMIT 5000
)
DELETE FROM events e
USING batch
WHERE e.id = batch.id;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;For very large time-series data, partitioning may be a better retention mechanism:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;DROP TABLE events_2025_01;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Dropping an old partition can be dramatically different from deleting millions of rows from a single table.&lt;&#x2F;p&gt;
&lt;p&gt;But partitioning has its own complexity. It should match the data lifecycle, query patterns, and operational ownership.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;migrations-should-have-observability&quot;&gt;Migrations should have observability&lt;&#x2F;h2&gt;
&lt;p&gt;A production migration should not be a black box.&lt;&#x2F;p&gt;
&lt;p&gt;Before running a risky migration, decide how you will observe it.&lt;&#x2F;p&gt;
&lt;p&gt;Useful checks include active migration sessions:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    pid,
    application_name,
    state,
    wait_event_type,
    wait_event,
    now() - query_start AS query_age,
    left(query, 200) AS query_preview
FROM pg_stat_activity
WHERE query ILIKE &amp;#39;%alter table%&amp;#39;
   OR query ILIKE &amp;#39;%create index%&amp;#39;
   OR query ILIKE &amp;#39;%validate constraint%&amp;#39;
ORDER BY query_start ASC;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Lock waits:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    pid,
    application_name,
    wait_event_type,
    wait_event,
    now() - query_start AS waiting_for,
    left(query, 200) AS query_preview
FROM pg_stat_activity
WHERE wait_event_type = &amp;#39;Lock&amp;#39;
ORDER BY query_start ASC;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Index build progress:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    p.pid,
    p.relid::regclass AS table_name,
    p.index_relid::regclass AS index_name,
    p.phase,
    p.blocks_done,
    p.blocks_total,
    round(100.0 * p.blocks_done &#x2F; nullif(p.blocks_total, 0), 2) AS pct_done,
    now() - a.query_start AS runtime
FROM pg_stat_progress_create_index p
JOIN pg_stat_activity a ON a.pid = p.pid;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Replication lag:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    application_name,
    state,
    sync_state,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) AS replay_lag_bytes,
    write_lag,
    flush_lag,
    replay_lag
FROM pg_stat_replication;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Dead tuple pressure after backfills or deletes:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    relname,
    n_live_tup,
    n_dead_tup,
    last_autovacuum,
    last_autoanalyze
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 30;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The key question:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;How will we know the migration is becoming unsafe before users tell us?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;migration-tooling-can-create-risk&quot;&gt;Migration tooling can create risk&lt;&#x2F;h2&gt;
&lt;p&gt;Migration frameworks are useful. They provide ordering, history, repeatability, and deployment integration.&lt;&#x2F;p&gt;
&lt;p&gt;But they can also create hazards.&lt;&#x2F;p&gt;
&lt;p&gt;Common tooling problems:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;all migrations wrapped in one transaction;
no support for CREATE INDEX CONCURRENTLY;
no lock_timeout by default;
no statement_timeout policy;
no distinction between schema change and data backfill;
no pause&#x2F;resume mechanism;
no progress visibility;
automatic retries of unsafe migrations;
running migrations during app startup;
running migrations from multiple app instances;
no clear owner during incidents.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;A particularly dangerous pattern:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;mermaid&quot;&gt;flowchart TD
    A[App instance starts] --&amp;gt; B[It runs migrations automatically]
    B --&amp;gt; C[Many instances start during deploy]
    C --&amp;gt; D[Multiple migration attempts compete]
    D --&amp;gt; E([Production traffic is already increasing])
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Migration execution should be controlled.&lt;&#x2F;p&gt;
&lt;p&gt;For serious systems, migrations are not just part of application boot.&lt;&#x2F;p&gt;
&lt;p&gt;They are operational tasks with ownership and observability.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;rollback-is-not-always-the-inverse-migration&quot;&gt;Rollback is not always the inverse migration&lt;&#x2F;h2&gt;
&lt;p&gt;Application rollbacks are often easier than database rollbacks.&lt;&#x2F;p&gt;
&lt;p&gt;If a deploy fails, you can roll back code.&lt;&#x2F;p&gt;
&lt;p&gt;But after this runs:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;ALTER TABLE users
DROP COLUMN legacy_id;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;the old column is gone.&lt;&#x2F;p&gt;
&lt;p&gt;After this runs:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;UPDATE accounts
SET status = &amp;#39;inactive&amp;#39;
WHERE last_seen_at &amp;lt; now() - interval &amp;#39;2 years&amp;#39;;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;the old values are not automatically recoverable unless you prepared for that.&lt;&#x2F;p&gt;
&lt;p&gt;After this runs:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;ALTER TABLE orders
ALTER COLUMN total_cents TYPE numeric;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;returning to the old type may be lossy, slow, or impossible without careful planning.&lt;&#x2F;p&gt;
&lt;p&gt;A good migration plan distinguishes:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;code rollback;
schema rollback;
data rollback;
roll-forward fix;
restore from backup;
point-in-time recovery;
manual correction.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Many database changes are not safely reversible.&lt;&#x2F;p&gt;
&lt;p&gt;For those, the safer strategy is often:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;make the change additive;
delay destructive steps;
keep old data until confidence is high;
roll forward instead of rolling back;
test recovery before production.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;A rollback plan that says “run the down migration” is not enough.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;a-practical-pre-flight-checklist&quot;&gt;A practical pre-flight checklist&lt;&#x2F;h2&gt;
&lt;p&gt;Before running a migration on a large or hot table, ask:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;What lock level does this operation need?
Can it wait behind an old transaction?
Can it cause later application traffic to queue?
Will it scan the table?
Will it rewrite the table?
Will it generate large WAL?
Will it increase replication lag?
Will it create many dead tuples?
Will it affect autovacuum?
Can it run inside a transaction?
Does the migration framework support the required mode?
Can it fail quickly with lock_timeout?
Can it be paused or resumed?
Is the change backward-compatible with old code?
Is there a safe rollback or roll-forward plan?
Who is watching it?
What metric tells us to stop?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This checklist does not replace practice.&lt;&#x2F;p&gt;
&lt;p&gt;It helps expose which migrations deserve deeper planning.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;common-anti-patterns&quot;&gt;Common anti-patterns&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;testing-only-on-tiny-staging-data&quot;&gt;Testing only on tiny staging data&lt;&#x2F;h3&gt;
&lt;p&gt;A migration that takes 200 ms on staging can take hours or block production on a large table.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;combining-too-much-into-one-migration&quot;&gt;Combining too much into one migration&lt;&#x2F;h3&gt;
&lt;p&gt;Schema change, backfill, constraint validation, index creation, and cleanup are different operational phases. They should often be separated.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;running-destructive-changes-too-early&quot;&gt;Running destructive changes too early&lt;&#x2F;h3&gt;
&lt;p&gt;Dropping columns, constraints, tables, or indexes before all code paths are ready creates rollback traps.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;no-lock-timeout&quot;&gt;No lock timeout&lt;&#x2F;h3&gt;
&lt;p&gt;A migration that waits forever can silently create a production queue.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;treating-concurrently-as-harmless&quot;&gt;Treating &lt;code&gt;CONCURRENTLY&lt;&#x2F;code&gt; as harmless&lt;&#x2F;h3&gt;
&lt;p&gt;It reduces blocking, but still consumes resources and has caveats.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;ignoring-old-transactions&quot;&gt;Ignoring old transactions&lt;&#x2F;h3&gt;
&lt;p&gt;Long transactions can turn a safe migration into a lock incident.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;backfilling-in-one-huge-transaction&quot;&gt;Backfilling in one huge transaction&lt;&#x2F;h3&gt;
&lt;p&gt;This creates WAL, dead tuples, replication lag, and rollback risk.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;forgetting-replicas-and-downstream-systems&quot;&gt;Forgetting replicas and downstream systems&lt;&#x2F;h3&gt;
&lt;p&gt;A migration may succeed on the primary while breaking read replicas, CDC consumers, ETL jobs, or analytics systems.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;why-migration-incidents-are-excellent-simulation-material&quot;&gt;Why migration incidents are excellent simulation material&lt;&#x2F;h2&gt;
&lt;p&gt;Schema migration incidents are some of the best reliability training scenarios because they involve both technical mechanics and human pressure.&lt;&#x2F;p&gt;
&lt;p&gt;A realistic simulation can include:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;a migration waiting on a lock;
a long idle transaction;
application queries piling up behind DDL;
a connection pool filling;
a concurrent index build consuming IO;
a failed index leaving an invalid artifact;
a backfill increasing replication lag;
a rollback that is not actually safe;
a team debating whether to cancel, wait, kill a blocker, pause traffic, or roll forward.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The hard part is not knowing that locks exist.&lt;&#x2F;p&gt;
&lt;p&gt;The hard part is choosing the safest action while production is degrading.&lt;&#x2F;p&gt;
&lt;p&gt;Should the team cancel the migration?
Terminate the blocker?
Pause workers?
Reduce traffic?
Disable retries?
Let the migration finish?
Roll forward?
Roll back application code?
Validate later?
Drop an invalid index?
Leave the system alone and collect more evidence?&lt;&#x2F;p&gt;
&lt;p&gt;These are operational decisions, not syntax questions.&lt;&#x2F;p&gt;
&lt;p&gt;Articles can teach the patterns.
Checklists can reduce obvious mistakes.
Simulations train the judgment needed when a migration interacts with real production load.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;&#x2F;h2&gt;
&lt;p&gt;A Postgres migration is not safe because the SQL is valid.&lt;&#x2F;p&gt;
&lt;p&gt;It is safe only if its production behavior is understood.&lt;&#x2F;p&gt;
&lt;p&gt;A one-line &lt;code&gt;ALTER TABLE&lt;&#x2F;code&gt; can create a lock queue.
A normal &lt;code&gt;CREATE INDEX&lt;&#x2F;code&gt; can block writes.
A concurrent index can still consume enough resources to hurt.
A constraint can require a large validation scan.
A backfill can generate WAL, dead tuples, and replica lag.
A rollback can be impossible after destructive data changes.&lt;&#x2F;p&gt;
&lt;p&gt;The dangerous phrase is:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;It worked on staging.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The better reliability question is:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;What will this migration do to locks, WAL, replicas, autovacuum, connection pools, old application versions, and rollback options in production?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;That question turns migrations from hidden deployment risk into a deliberate reliability practice.&lt;&#x2F;p&gt;
</content>
        
    </entry>
    <entry xml:lang="en">
        <title>Autovacuum: the quiet Postgres process that becomes a loud reliability problem</title>
        <published>2026-04-14T00:00:00+00:00</published>
        <updated>2026-04-14T00:00:00+00:00</updated>
        
        <author>
          <name>
            
              Unknown
            
          </name>
        </author>
        
        <link rel="alternate" type="text/html" href="https://rillence.com/notes/autovacuum-quiet-process/"/>
        <id>https://rillence.com/notes/autovacuum-quiet-process/</id>
        
        <content type="html" xml:base="https://rillence.com/notes/autovacuum-quiet-process/">&lt;p&gt;Autovacuum is easy to ignore when everything is healthy.&lt;&#x2F;p&gt;
&lt;p&gt;It runs in the background.
It does not usually appear in product discussions.
It is rarely mentioned in feature planning.
It does not look like an application dependency.&lt;&#x2F;p&gt;
&lt;p&gt;Then one day the database gets slower, storage grows unexpectedly, query plans become unstable, or Postgres starts warning about transaction ID wraparound.&lt;&#x2F;p&gt;
&lt;p&gt;At that point, autovacuum is no longer background maintenance.&lt;&#x2F;p&gt;
&lt;p&gt;It is part of the incident.&lt;&#x2F;p&gt;
&lt;p&gt;Postgres autovacuum exists because MVCC creates old row versions that must eventually be cleaned up, and because the planner needs fresh table statistics to choose good query plans. The PostgreSQL documentation describes routine vacuuming as necessary to recover or reuse storage occupied by updated or deleted rows, update planner statistics, and protect against transaction ID wraparound. (&lt;a rel=&quot;external&quot; title=&quot;Documentation: 18: 24.1. Routine Vacuuming&quot; href=&quot;https:&#x2F;&#x2F;www.postgresql.org&#x2F;docs&#x2F;current&#x2F;routine-vacuuming.html&quot;&gt;PostgreSQL&lt;&#x2F;a&gt;)&lt;&#x2F;p&gt;
&lt;p&gt;The reliability lesson is simple:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Autovacuum is not an optional optimization.
It is part of Postgres survival.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;why-postgres-needs-vacuum-at-all&quot;&gt;Why Postgres needs vacuum at all&lt;&#x2F;h2&gt;
&lt;p&gt;Postgres uses MVCC: multi-version concurrency control.&lt;&#x2F;p&gt;
&lt;p&gt;When a row is updated, Postgres does not simply overwrite the old row in place. It creates a new row version. When a row is deleted, the old version is not immediately removed from the table file.&lt;&#x2F;p&gt;
&lt;p&gt;That design allows concurrent transactions to see a consistent view of data without blocking each other unnecessarily.&lt;&#x2F;p&gt;
&lt;p&gt;But it creates a maintenance problem.&lt;&#x2F;p&gt;
&lt;p&gt;Old row versions eventually become unnecessary. Once no active transaction can still see them, they can be cleaned up. That cleanup is one of the main jobs of &lt;code&gt;VACUUM&lt;&#x2F;code&gt;.&lt;&#x2F;p&gt;
&lt;p&gt;A simplified update chain:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;UPDATE users SET status = &amp;#39;active&amp;#39; WHERE id = 42;

Old row version remains for older snapshots.
New row version becomes visible to newer transactions.
VACUUM can later remove the old version when safe.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;If cleanup does not keep up, dead tuples accumulate.&lt;&#x2F;p&gt;
&lt;p&gt;The table may become physically larger.
Indexes may become less efficient.
Sequential scans may touch more pages.
Index scans may visit more dead entries.
Autovacuum may need to do more work later under worse conditions.&lt;&#x2F;p&gt;
&lt;p&gt;This is how a quiet maintenance lag becomes user-visible latency.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;autovacuum-also-runs-analyze&quot;&gt;Autovacuum also runs ANALYZE&lt;&#x2F;h2&gt;
&lt;p&gt;Autovacuum is not only about removing dead tuples.&lt;&#x2F;p&gt;
&lt;p&gt;It also triggers &lt;code&gt;ANALYZE&lt;&#x2F;code&gt;, which refreshes planner statistics. PostgreSQL documentation notes that the autovacuum daemon automatically issues &lt;code&gt;ANALYZE&lt;&#x2F;code&gt; when table contents have changed sufficiently. (&lt;a rel=&quot;external&quot; title=&quot;Documentation: 18: 24.1. Routine Vacuuming&quot; href=&quot;https:&#x2F;&#x2F;www.postgresql.org&#x2F;docs&#x2F;current&#x2F;routine-vacuuming.html&quot;&gt;PostgreSQL&lt;&#x2F;a&gt;)&lt;&#x2F;p&gt;
&lt;p&gt;That matters because the query planner depends on statistics.&lt;&#x2F;p&gt;
&lt;p&gt;Consider this query:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT *
FROM invoices
WHERE account_id = $1
  AND status = &amp;#39;open&amp;#39;
ORDER BY due_date
LIMIT 50;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The planner needs to estimate:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;How many rows match this account_id?
How selective is status = &amp;#39;open&amp;#39;?
Is an index scan cheaper than a sequential scan?
Will sorting be expensive?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;If statistics are stale, Postgres may choose a bad plan.&lt;&#x2F;p&gt;
&lt;p&gt;A table with poor vacuum behavior often has poor analyze behavior too. The incident may appear as a slow query, but the underlying issue may be maintenance starvation.&lt;&#x2F;p&gt;
&lt;p&gt;You can inspect recent vacuum and analyze activity:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    schemaname,
    relname,
    n_live_tup,
    n_dead_tup,
    last_vacuum,
    last_autovacuum,
    last_analyze,
    last_autoanalyze,
    vacuum_count,
    autovacuum_count,
    analyze_count,
    autoanalyze_count
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 30;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This does not prove bloat by itself, but it shows where cleanup pressure and statistics freshness deserve attention.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;the-most-common-autovacuum-misconception&quot;&gt;The most common autovacuum misconception&lt;&#x2F;h2&gt;
&lt;p&gt;A dangerous sentence:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Autovacuum is using IO, so let’s disable it.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Autovacuum can absolutely create load. It reads pages, cleans dead tuples, updates visibility information, and may generate WAL.&lt;&#x2F;p&gt;
&lt;p&gt;But disabling it usually converts visible maintenance cost into hidden future debt.&lt;&#x2F;p&gt;
&lt;p&gt;That debt comes back as:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;larger tables;
larger indexes;
worse cache efficiency;
slower scans;
stale statistics;
unstable query plans;
wraparound risk;
emergency anti-wraparound vacuum;
operational panic.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;PostgreSQL’s &lt;code&gt;autovacuum&lt;&#x2F;code&gt; setting controls whether the server runs the autovacuum launcher, and it is on by default; the docs also note that &lt;code&gt;track_counts&lt;&#x2F;code&gt; must be enabled for autovacuum to work. (&lt;a rel=&quot;external&quot; title=&quot;PostgreSQL: Documentation: 18: 19.10. Vacuuming&quot; href=&quot;https:&#x2F;&#x2F;www.postgresql.org&#x2F;docs&#x2F;current&#x2F;runtime-config-vacuum.html&quot;&gt;PostgreSQL&lt;&#x2F;a&gt;)&lt;&#x2F;p&gt;
&lt;p&gt;You can check the basics:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SHOW autovacuum;
SHOW track_counts;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;And inspect relevant settings:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    name,
    setting,
    unit,
    context,
    short_desc
FROM pg_settings
WHERE name LIKE &amp;#39;autovacuum%&amp;#39;
   OR name IN (
       &amp;#39;track_counts&amp;#39;,
       &amp;#39;vacuum_cost_delay&amp;#39;,
       &amp;#39;vacuum_cost_limit&amp;#39;,
       &amp;#39;maintenance_work_mem&amp;#39;,
       &amp;#39;autovacuum_work_mem&amp;#39;
   )
ORDER BY name;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The goal is not to turn autovacuum off.&lt;&#x2F;p&gt;
&lt;p&gt;The goal is to make sure it can keep up with the workload.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;dead-tuples-are-a-signal-not-the-whole-diagnosis&quot;&gt;Dead tuples are a signal, not the whole diagnosis&lt;&#x2F;h2&gt;
&lt;p&gt;A common starting point:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    relname,
    n_live_tup,
    n_dead_tup,
    round(
        100.0 * n_dead_tup &#x2F; greatest(n_live_tup + n_dead_tup, 1),
        2
    ) AS dead_tuple_percent,
    last_autovacuum,
    last_autoanalyze
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 20;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This is useful, but it has limits.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;code&gt;n_dead_tup&lt;&#x2F;code&gt; is an estimate. It does not directly equal “bloat”. A table can have many dead tuples and still be manageable if autovacuum is keeping up. Another table can have fewer dead tuples but be operationally sensitive because it is large, hot, heavily indexed, or latency-critical.&lt;&#x2F;p&gt;
&lt;p&gt;Better questions:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Is the number of dead tuples growing over time?
Does autovacuum run but fail to catch up?
Is the table write-heavy?
Are long transactions preventing cleanup?
Are indexes growing faster than expected?
Did query latency change as dead tuples accumulated?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;For incident response, trend is often more important than a single snapshot.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;long-transactions-can-block-cleanup&quot;&gt;Long transactions can block cleanup&lt;&#x2F;h2&gt;
&lt;p&gt;Autovacuum cannot remove row versions that might still be visible to an old transaction.&lt;&#x2F;p&gt;
&lt;p&gt;That means one old transaction can keep dead tuples alive across the database.&lt;&#x2F;p&gt;
&lt;p&gt;Find old transactions:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    pid,
    usename,
    application_name,
    client_addr,
    state,
    now() - xact_start AS transaction_age,
    wait_event_type,
    wait_event,
    left(query, 160) AS query_preview
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
ORDER BY xact_start ASC;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Find sessions idle inside a transaction:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    pid,
    usename,
    application_name,
    client_addr,
    now() - xact_start AS transaction_age,
    left(query, 160) AS last_query
FROM pg_stat_activity
WHERE state = &amp;#39;idle in transaction&amp;#39;
ORDER BY xact_start ASC;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;An &lt;code&gt;idle in transaction&lt;&#x2F;code&gt; session may look harmless because it is not actively consuming CPU. But it can prevent cleanup, hold locks, and keep old snapshots alive.&lt;&#x2F;p&gt;
&lt;p&gt;A classic reliability failure chain:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;mermaid&quot;&gt;flowchart TD
    A[Application opens transaction] --&amp;gt; B[Transaction becomes idle and remains open]
    B --&amp;gt; C[Updates and deletes continue elsewhere]
    C --&amp;gt; D[Dead tuples cannot be fully cleaned]
    D --&amp;gt; E[Tables and indexes grow]
    E --&amp;gt; F[Queries touch more pages]
    F --&amp;gt; G[Latency increases]
    G --&amp;gt; H([Connection pool saturates])
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The visible symptom may be slow queries.&lt;&#x2F;p&gt;
&lt;p&gt;The mechanism may be vacuum being unable to clean because the application is holding old snapshots.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;the-table-that-autovacuum-cannot-catch&quot;&gt;The table that autovacuum cannot catch&lt;&#x2F;h2&gt;
&lt;p&gt;Some tables are much harder for autovacuum than others.&lt;&#x2F;p&gt;
&lt;p&gt;Examples:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;high-update tables;
queue-like tables;
session tables;
event status tables;
tables with frequent DELETE;
tables with many indexes;
tables with very large row counts;
tables with hot tenants or skewed access patterns.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;A queue table is a common example:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;CREATE TABLE jobs (
    id bigserial PRIMARY KEY,
    status text NOT NULL,
    payload jsonb NOT NULL,
    created_at timestamptz NOT NULL DEFAULT now(),
    updated_at timestamptz NOT NULL DEFAULT now()
);
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Workers constantly do:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;UPDATE jobs
SET status = &amp;#39;running&amp;#39;,
    updated_at = now()
WHERE id = $1;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Then:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;UPDATE jobs
SET status = &amp;#39;done&amp;#39;,
    updated_at = now()
WHERE id = $1;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Or:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;DELETE FROM jobs
WHERE status = &amp;#39;done&amp;#39;
  AND updated_at &amp;lt; now() - interval &amp;#39;7 days&amp;#39;;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This table may generate dead tuples continuously.&lt;&#x2F;p&gt;
&lt;p&gt;A default autovacuum configuration may be too conservative for it, especially if the table is large and the scale factor means vacuum starts only after a large number of changes.&lt;&#x2F;p&gt;
&lt;p&gt;You can inspect per-table autovacuum options:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    n.nspname AS schema_name,
    c.relname AS table_name,
    c.reloptions
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE c.relkind = &amp;#39;r&amp;#39;
  AND c.reloptions IS NOT NULL
ORDER BY n.nspname, c.relname;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;For a hot table, per-table tuning may be more appropriate than changing global settings:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;ALTER TABLE jobs SET (
    autovacuum_vacuum_scale_factor = 0.02,
    autovacuum_vacuum_threshold = 5000,
    autovacuum_analyze_scale_factor = 0.01,
    autovacuum_analyze_threshold = 5000
);
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This is only an example, not a universal recommendation. The right values depend on write rate, table size, IO capacity, latency goals, and how much maintenance work the system can absorb.&lt;&#x2F;p&gt;
&lt;p&gt;PostgreSQL exposes autovacuum thresholds, scale factors, cost delay settings, and worker limits as configuration parameters; these settings control when and how autovacuum runs. (&lt;a rel=&quot;external&quot; title=&quot;PostgreSQL: Documentation: 18: 19.10. Vacuuming&quot; href=&quot;https:&#x2F;&#x2F;www.postgresql.org&#x2F;docs&#x2F;current&#x2F;runtime-config-vacuum.html&quot;&gt;PostgreSQL&lt;&#x2F;a&gt;)&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;autovacuum-workers-are-limited&quot;&gt;Autovacuum workers are limited&lt;&#x2F;h2&gt;
&lt;p&gt;Autovacuum is not an infinite background army.&lt;&#x2F;p&gt;
&lt;p&gt;It has a launcher and a limited number of workers. If several large or busy tables need cleanup at the same time, some tables wait.&lt;&#x2F;p&gt;
&lt;p&gt;Check running autovacuum activity:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    pid,
    datname,
    usename,
    application_name,
    state,
    wait_event_type,
    wait_event,
    now() - query_start AS runtime,
    left(query, 200) AS query_preview
FROM pg_stat_activity
WHERE query ILIKE &amp;#39;autovacuum:%&amp;#39;
ORDER BY query_start ASC;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;You can also inspect active vacuum progress. PostgreSQL provides &lt;code&gt;pg_stat_progress_vacuum&lt;&#x2F;code&gt;, with one row for each backend, including autovacuum workers, currently running &lt;code&gt;VACUUM&lt;&#x2F;code&gt;. (&lt;a rel=&quot;external&quot; title=&quot;Documentation: 18: 27.4. Progress Reporting&quot; href=&quot;https:&#x2F;&#x2F;www.postgresql.org&#x2F;docs&#x2F;current&#x2F;progress-reporting.html&quot;&gt;PostgreSQL&lt;&#x2F;a&gt;)&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    p.pid,
    a.datname,
    a.application_name,
    p.relid::regclass AS table_name,
    p.phase,
    p.heap_blks_total,
    p.heap_blks_scanned,
    p.heap_blks_vacuumed,
    p.index_vacuum_count,
    now() - a.query_start AS runtime
FROM pg_stat_progress_vacuum p
JOIN pg_stat_activity a ON a.pid = p.pid
ORDER BY runtime DESC;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This helps answer:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Is vacuum currently running?
Which table is it working on?
Is it scanning heap pages?
Is it vacuuming indexes?
Is it spending a long time on one relation?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;If autovacuum is always running but dead tuples continue rising, the system may be under-provisioned for its write workload, misconfigured for specific hot tables, blocked by old transactions, or overloaded by competing IO.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;cost-based-delay-autovacuum-can-be-too-polite&quot;&gt;Cost-based delay: autovacuum can be too polite&lt;&#x2F;h2&gt;
&lt;p&gt;Autovacuum is designed not to overwhelm the system.&lt;&#x2F;p&gt;
&lt;p&gt;That politeness can become a problem.&lt;&#x2F;p&gt;
&lt;p&gt;Cost-based vacuum delay allows vacuum to pause during work so it does not consume too many resources at once. PostgreSQL exposes cost-based vacuum settings and progress&#x2F;verbose reporting related to this behavior. (&lt;a rel=&quot;external&quot; title=&quot;PostgreSQL: Documentation: 18: 19.9. Run-time Statistics&quot; href=&quot;https:&#x2F;&#x2F;www.postgresql.org&#x2F;docs&#x2F;current&#x2F;runtime-config-statistics.html&quot;&gt;PostgreSQL&lt;&#x2F;a&gt;)&lt;&#x2F;p&gt;
&lt;p&gt;In a write-heavy system, autovacuum can be so gentle that it never catches up.&lt;&#x2F;p&gt;
&lt;p&gt;The symptom is not that autovacuum is absent.&lt;&#x2F;p&gt;
&lt;p&gt;The symptom is that it is always behind.&lt;&#x2F;p&gt;
&lt;p&gt;You may see:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;autovacuum runs frequently;
dead tuples remain high;
table size keeps growing;
indexes grow disproportionately;
query latency slowly worsens;
manual VACUUM helps temporarily;
the problem returns.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This is a capacity mismatch.&lt;&#x2F;p&gt;
&lt;p&gt;The database is generating cleanup work faster than the maintenance system is allowed to process it.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;vacuum-and-indexes&quot;&gt;Vacuum and indexes&lt;&#x2F;h2&gt;
&lt;p&gt;Vacuum is not only about heap tuples.&lt;&#x2F;p&gt;
&lt;p&gt;Indexes also matter.&lt;&#x2F;p&gt;
&lt;p&gt;When tables are updated and deleted, indexes can accumulate dead entries. Vacuum has to deal with those too.&lt;&#x2F;p&gt;
&lt;p&gt;A table with many indexes creates more maintenance work per row change.&lt;&#x2F;p&gt;
&lt;p&gt;Example:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;CREATE INDEX idx_orders_customer_id ON orders(customer_id);
CREATE INDEX idx_orders_status ON orders(status);
CREATE INDEX idx_orders_created_at ON orders(created_at);
CREATE INDEX idx_orders_region ON orders(region);
CREATE INDEX idx_orders_status_created ON orders(status, created_at);
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Every update that changes indexed columns can increase write and maintenance cost.&lt;&#x2F;p&gt;
&lt;p&gt;A useful index review query:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    schemaname,
    relname AS table_name,
    indexrelname AS index_name,
    idx_scan,
    idx_tup_read,
    idx_tup_fetch
FROM pg_stat_user_indexes
ORDER BY idx_scan ASC, idx_tup_read ASC
LIMIT 50;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Low usage does not automatically mean an index is safe to drop. It may support rare but critical queries, constraints, or incident workflows.&lt;&#x2F;p&gt;
&lt;p&gt;But unused indexes are not free.&lt;&#x2F;p&gt;
&lt;p&gt;They increase write amplification and vacuum work. In reliability terms, unnecessary indexes are permanent background cost.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;wraparound-the-autovacuum-incident-you-really-do-not-want&quot;&gt;Wraparound: the autovacuum incident you really do not want&lt;&#x2F;h2&gt;
&lt;p&gt;Postgres transaction IDs are finite. To prevent transaction ID wraparound, tables must be vacuumed so old transaction IDs can be frozen. Routine vacuuming documentation explicitly includes protection against transaction ID wraparound as one of the reasons vacuuming is necessary. (&lt;a rel=&quot;external&quot; title=&quot;Documentation: 18: 24.1. Routine Vacuuming&quot; href=&quot;https:&#x2F;&#x2F;www.postgresql.org&#x2F;docs&#x2F;current&#x2F;routine-vacuuming.html&quot;&gt;PostgreSQL&lt;&#x2F;a&gt;)&lt;&#x2F;p&gt;
&lt;p&gt;Inspect transaction ID age by table:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    n.nspname AS schema_name,
    c.relname AS table_name,
    age(c.relfrozenxid) AS xid_age,
    pg_size_pretty(pg_total_relation_size(c.oid)) AS total_size
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE c.relkind IN (&amp;#39;r&amp;#39;, &amp;#39;m&amp;#39;)
ORDER BY age(c.relfrozenxid) DESC
LIMIT 30;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Also inspect database age:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    datname,
    age(datfrozenxid) AS xid_age
FROM pg_database
ORDER BY age(datfrozenxid) DESC;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;When wraparound risk grows, Postgres becomes increasingly aggressive about vacuuming. Anti-wraparound vacuum is not a normal tuning issue. It is a reliability emergency.&lt;&#x2F;p&gt;
&lt;p&gt;The worst version of this incident looks like:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Autovacuum was disabled or starved.
Old transactions prevented cleanup.
Large tables were not frozen in time.
Wraparound warnings appeared.
Emergency vacuum consumed IO.
Critical workload slowed down.
Operators had limited safe options.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The best time to care about transaction age is long before those warnings appear.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;multixact-age-the-less-famous-cousin&quot;&gt;Multixact age: the less famous cousin&lt;&#x2F;h2&gt;
&lt;p&gt;Postgres also tracks multixact IDs, which are relevant for row locking scenarios such as foreign keys and shared row locks.&lt;&#x2F;p&gt;
&lt;p&gt;You can inspect multixact age:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    n.nspname AS schema_name,
    c.relname AS table_name,
    mxid_age(c.relminmxid) AS mxid_age,
    pg_size_pretty(pg_total_relation_size(c.oid)) AS total_size
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE c.relkind IN (&amp;#39;r&amp;#39;, &amp;#39;m&amp;#39;)
ORDER BY mxid_age(c.relminmxid) DESC
LIMIT 30;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This is especially relevant in systems with heavy foreign key activity, concurrent locking, or queue-like patterns.&lt;&#x2F;p&gt;
&lt;p&gt;Many teams monitor transaction ID age but forget multixact age.&lt;&#x2F;p&gt;
&lt;p&gt;That blind spot can turn into a surprise maintenance emergency.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;manual-vacuum-is-not-a-magic-button&quot;&gt;Manual VACUUM is not a magic button&lt;&#x2F;h2&gt;
&lt;p&gt;You can run:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;VACUUM VERBOSE orders;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Or:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;VACUUM (VERBOSE, ANALYZE) orders;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;PostgreSQL’s &lt;code&gt;VACUUM&lt;&#x2F;code&gt; command reports progress through &lt;code&gt;pg_stat_progress_vacuum&lt;&#x2F;code&gt; for regular vacuum operations; &lt;code&gt;VACUUM FULL&lt;&#x2F;code&gt; is different because it rewrites the table and reports through cluster progress views. (&lt;a rel=&quot;external&quot; title=&quot;Documentation: 18: VACUUM&quot; href=&quot;https:&#x2F;&#x2F;www.postgresql.org&#x2F;docs&#x2F;current&#x2F;sql-vacuum.html&quot;&gt;PostgreSQL&lt;&#x2F;a&gt;)&lt;&#x2F;p&gt;
&lt;p&gt;The distinction is important.&lt;&#x2F;p&gt;
&lt;p&gt;Regular &lt;code&gt;VACUUM&lt;&#x2F;code&gt; cleans up dead tuples and makes space reusable inside the table. It does not usually shrink the table file on disk dramatically.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;code&gt;VACUUM FULL&lt;&#x2F;code&gt; rewrites the table and can return disk space to the operating system, but it requires much stronger locking and is operationally disruptive.&lt;&#x2F;p&gt;
&lt;p&gt;That means this command is not a casual production fix:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;VACUUM FULL orders;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;It may block access in ways your product cannot tolerate.&lt;&#x2F;p&gt;
&lt;p&gt;A reliability-minded approach asks:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Do we need to improve query performance?
Do we need to recover disk to the OS?
Can regular VACUUM catch up?
Is bloat severe enough to justify a rewrite?
Can we use online rebuild strategies instead?
What is the lock impact?
What is the rollback plan?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Manual vacuuming can help, but it does not replace understanding why autovacuum fell behind.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;logging-autovacuum-activity&quot;&gt;Logging autovacuum activity&lt;&#x2F;h2&gt;
&lt;p&gt;Autovacuum can be made more observable through logging.&lt;&#x2F;p&gt;
&lt;p&gt;PostgreSQL provides &lt;code&gt;log_autovacuum_min_duration&lt;&#x2F;code&gt;, which logs autovacuum actions exceeding the configured duration; the documentation notes this can help track autovacuum activity. (&lt;a rel=&quot;external&quot; title=&quot;Documentation: 18: 19.8. Error Reporting and Logging&quot; href=&quot;https:&#x2F;&#x2F;www.postgresql.org&#x2F;docs&#x2F;current&#x2F;runtime-config-logging.html&quot;&gt;PostgreSQL&lt;&#x2F;a&gt;)&lt;&#x2F;p&gt;
&lt;p&gt;Example:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;ALTER SYSTEM SET log_autovacuum_min_duration = &amp;#39;5s&amp;#39;;
SELECT pg_reload_conf();
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Or in &lt;code&gt;postgresql.conf&lt;&#x2F;code&gt;:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;conf&quot;&gt;log_autovacuum_min_duration = &amp;#39;5s&amp;#39;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;In noisy systems, you may choose a higher value. In an investigation, lowering it temporarily can provide evidence.&lt;&#x2F;p&gt;
&lt;p&gt;Autovacuum logs can reveal:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;which tables are vacuumed often;
which vacuums take a long time;
whether dead tuple cleanup is effective;
whether vacuum is skipped or delayed;
whether index cleanup dominates;
whether analyze is happening regularly.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The goal is not to log everything forever.&lt;&#x2F;p&gt;
&lt;p&gt;The goal is to make background maintenance visible enough to reason about it.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;autovacuum-and-partitioning&quot;&gt;Autovacuum and partitioning&lt;&#x2F;h2&gt;
&lt;p&gt;Partitioning can make vacuum behavior more manageable, but it does not remove the need for vacuum.&lt;&#x2F;p&gt;
&lt;p&gt;For event-like data, partitioning by time can help because old partitions become mostly static.&lt;&#x2F;p&gt;
&lt;p&gt;Example:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;CREATE TABLE events (
    id bigint NOT NULL,
    tenant_id bigint NOT NULL,
    event_type text NOT NULL,
    created_at timestamptz NOT NULL,
    payload jsonb NOT NULL
) PARTITION BY RANGE (created_at);
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Monthly partitions:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;CREATE TABLE events_2026_06
PARTITION OF events
FOR VALUES FROM (&amp;#39;2026-06-01&amp;#39;) TO (&amp;#39;2026-07-01&amp;#39;);
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;For append-mostly workloads, old partitions may need less frequent vacuuming after they stop changing.&lt;&#x2F;p&gt;
&lt;p&gt;For hot current partitions, autovacuum still matters.&lt;&#x2F;p&gt;
&lt;p&gt;Partitioning helps when it matches the data lifecycle. It hurts when it is used as a substitute for understanding write patterns.&lt;&#x2F;p&gt;
&lt;p&gt;Bad partitioning can create:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;too many relations;
planning overhead;
operational complexity;
uneven hot partitions;
forgotten per-table settings;
maintenance surprises.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The reliability question is not “Should we partition?”&lt;&#x2F;p&gt;
&lt;p&gt;It is:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Does partitioning align with how data is written, updated, queried, retained, and vacuumed?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;a-practical-autovacuum-health-snapshot&quot;&gt;A practical autovacuum health snapshot&lt;&#x2F;h2&gt;
&lt;p&gt;This is not a full runbook, but it gives a useful operational snapshot.&lt;&#x2F;p&gt;
&lt;p&gt;Largest dead tuple estimates:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    schemaname,
    relname,
    n_live_tup,
    n_dead_tup,
    round(
        100.0 * n_dead_tup &#x2F; greatest(n_live_tup + n_dead_tup, 1),
        2
    ) AS dead_tuple_percent,
    last_autovacuum,
    last_autoanalyze
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 30;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Tables not recently vacuumed:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    schemaname,
    relname,
    n_live_tup,
    n_dead_tup,
    last_autovacuum,
    last_vacuum
FROM pg_stat_user_tables
WHERE n_dead_tup &amp;gt; 0
ORDER BY last_autovacuum NULLS FIRST, n_dead_tup DESC
LIMIT 30;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Oldest transaction IDs:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    n.nspname AS schema_name,
    c.relname AS table_name,
    age(c.relfrozenxid) AS xid_age,
    pg_size_pretty(pg_total_relation_size(c.oid)) AS total_size
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE c.relkind IN (&amp;#39;r&amp;#39;, &amp;#39;m&amp;#39;)
ORDER BY age(c.relfrozenxid) DESC
LIMIT 30;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Current vacuum progress:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    p.pid,
    p.relid::regclass AS table_name,
    p.phase,
    p.heap_blks_total,
    p.heap_blks_scanned,
    p.heap_blks_vacuumed,
    p.index_vacuum_count,
    now() - a.query_start AS runtime
FROM pg_stat_progress_vacuum p
JOIN pg_stat_activity a ON a.pid = p.pid
ORDER BY runtime DESC;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Old transactions that can hold cleanup back:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    pid,
    usename,
    application_name,
    state,
    now() - xact_start AS transaction_age,
    left(query, 160) AS query_preview
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
ORDER BY xact_start ASC
LIMIT 20;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Per-table autovacuum overrides:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    n.nspname AS schema_name,
    c.relname AS table_name,
    c.reloptions
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE c.relkind = &amp;#39;r&amp;#39;
  AND c.reloptions IS NOT NULL
ORDER BY n.nspname, c.relname;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;These queries help build a picture. They do not replace interpretation.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;what-teams-often-get-wrong&quot;&gt;What teams often get wrong&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;they-only-notice-autovacuum-when-it-hurts&quot;&gt;They only notice autovacuum when it hurts&lt;&#x2F;h3&gt;
&lt;p&gt;If the first time you discuss autovacuum is during an incident, the system has already been running on assumptions.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;they-use-global-settings-for-table-specific-problems&quot;&gt;They use global settings for table-specific problems&lt;&#x2F;h3&gt;
&lt;p&gt;A single hot table often needs specific tuning. Changing global autovacuum settings can help one table while causing unnecessary maintenance pressure elsewhere.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;they-ignore-application-transaction-behavior&quot;&gt;They ignore application transaction behavior&lt;&#x2F;h3&gt;
&lt;p&gt;No autovacuum configuration fully compensates for application code that holds transactions open for too long.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;they-treat-bloat-as-a-one-time-cleanup-task&quot;&gt;They treat bloat as a one-time cleanup task&lt;&#x2F;h3&gt;
&lt;p&gt;Bloat cleanup without workload change is temporary. If the write pattern remains the same, the problem returns.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;they-forget-that-indexes-multiply-maintenance-cost&quot;&gt;They forget that indexes multiply maintenance cost&lt;&#x2F;h3&gt;
&lt;p&gt;Every unnecessary index makes writes and vacuum more expensive.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;they-use-vacuum-full-too-casually&quot;&gt;They use &lt;code&gt;VACUUM FULL&lt;&#x2F;code&gt; too casually&lt;&#x2F;h3&gt;
&lt;p&gt;It can reclaim disk, but it rewrites the table and can create serious locking impact. It is a maintenance operation, not a harmless cleanup command.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;a-better-mental-model&quot;&gt;A better mental model&lt;&#x2F;h2&gt;
&lt;p&gt;Autovacuum is a feedback system.&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;mermaid&quot;&gt;flowchart TD
    A[Application writes] --&amp;gt; B[Updates and deletes create dead tuples]
    B --&amp;gt; C[Autovacuum cleans old versions]
    C --&amp;gt; D[ANALYZE refreshes statistics]
    D --&amp;gt; E[Planner makes better decisions]
    E --&amp;gt; F[Queries stay predictable]
    F --&amp;gt; G([Storage growth remains controlled])
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;When that feedback loop breaks, the symptoms appear elsewhere:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;slow queries;
bad plans;
growing storage;
high IO;
replication pressure;
pool saturation;
wraparound warnings;
long incident calls.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;That is why autovacuum problems are often misdiagnosed.&lt;&#x2F;p&gt;
&lt;p&gt;They do not always announce themselves as “autovacuum failed.”&lt;&#x2F;p&gt;
&lt;p&gt;They appear as system degradation.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;why-autovacuum-incidents-are-strong-simulation-material&quot;&gt;Why autovacuum incidents are strong simulation material&lt;&#x2F;h2&gt;
&lt;p&gt;Autovacuum incidents are excellent for training because they develop slowly and then become urgent.&lt;&#x2F;p&gt;
&lt;p&gt;A realistic simulation might include:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;a high-update table;
a long idle transaction;
dead tuples accumulating;
query plans becoming unstable;
autovacuum workers running but not catching up;
storage growth;
an engineer proposing to disable autovacuum;
a manual VACUUM that helps only partially;
a wraparound risk warning later in the scenario.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The hard part is not knowing that &lt;code&gt;VACUUM&lt;&#x2F;code&gt; exists.&lt;&#x2F;p&gt;
&lt;p&gt;The hard part is connecting weak signals before they become a major incident.&lt;&#x2F;p&gt;
&lt;p&gt;Is autovacuum absent, blocked, too slow, or just overloaded?
Are stale statistics causing bad plans?
Is a long transaction preventing cleanup?
Is the table design generating too much churn?
Is the immediate risk latency, storage, or wraparound?
Should the response be tuning, traffic reduction, transaction cleanup, manual vacuum, index review, or application change?&lt;&#x2F;p&gt;
&lt;p&gt;These decisions require operational judgment.&lt;&#x2F;p&gt;
&lt;p&gt;An article can explain the mechanism.
A dashboard can show the counters.
A simulation forces the team to make decisions while the system is degrading.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;&#x2F;h2&gt;
&lt;p&gt;Autovacuum is not background noise.&lt;&#x2F;p&gt;
&lt;p&gt;It is one of the core processes that keeps a Postgres system healthy over time.&lt;&#x2F;p&gt;
&lt;p&gt;When it works well, nobody notices.
When it falls behind, the symptoms can appear as slow queries, unstable plans, growing tables, bloated indexes, IO pressure, storage incidents, or transaction ID wraparound risk.&lt;&#x2F;p&gt;
&lt;p&gt;The right lesson is not “autovacuum is good” or “autovacuum is bad.”&lt;&#x2F;p&gt;
&lt;p&gt;The right lesson is:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Autovacuum is part of the workload.
It needs capacity, observability, and tuning.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Reliable Postgres operations require knowing which tables generate cleanup pressure, which transactions prevent cleanup, which indexes amplify maintenance cost, and which alerts reveal trouble early enough to act safely.&lt;&#x2F;p&gt;
&lt;p&gt;Autovacuum is quiet by design.&lt;&#x2F;p&gt;
&lt;p&gt;Database reliability means hearing it before it has to become loud.&lt;&#x2F;p&gt;
</content>
        
    </entry>
    <entry xml:lang="en">
        <title>Postgres monitoring: which metrics help, and which ones create noise</title>
        <published>2026-04-09T00:00:00+00:00</published>
        <updated>2026-04-09T00:00:00+00:00</updated>
        
        <author>
          <name>
            
              Unknown
            
          </name>
        </author>
        
        <link rel="alternate" type="text/html" href="https://rillence.com/notes/postgres-monitoring-signal-vs-noise/"/>
        <id>https://rillence.com/notes/postgres-monitoring-signal-vs-noise/</id>
        
        <content type="html" xml:base="https://rillence.com/notes/postgres-monitoring-signal-vs-noise/">&lt;p&gt;Most Postgres monitoring starts with good intentions and slowly turns into noise.&lt;&#x2F;p&gt;
&lt;p&gt;A team adds dashboards for CPU, memory, connections, replication lag, locks, slow queries, disk usage, cache hit ratio, autovacuum, checkpoints, WAL, dead tuples, table sizes, index usage, and dozens of other signals.&lt;&#x2F;p&gt;
&lt;p&gt;Then an incident happens.&lt;&#x2F;p&gt;
&lt;p&gt;The dashboard is full of red panels.
Everyone sees something different.
One engineer points at CPU.
Another points at connections.
Someone else sees slow queries.
A replica lag alert fires.
The application shows timeouts.
The team has many metrics, but no clear direction.&lt;&#x2F;p&gt;
&lt;p&gt;That is the core monitoring problem:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;More metrics do not automatically create better reliability.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Good monitoring helps a team form and test hypotheses. Bad monitoring creates panic, false confidence, and alert fatigue.&lt;&#x2F;p&gt;
&lt;p&gt;Postgres reliability monitoring is not about collecting every possible number. It is about knowing which signal answers which operational question.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;monitoring-should-start-with-user-impact&quot;&gt;Monitoring should start with user impact&lt;&#x2F;h2&gt;
&lt;p&gt;A database can look unhealthy while the product is fine.&lt;&#x2F;p&gt;
&lt;p&gt;A database can also look mostly healthy while users are already suffering.&lt;&#x2F;p&gt;
&lt;p&gt;That is why the top layer of monitoring should not be Postgres internals. It should be user-visible behavior.&lt;&#x2F;p&gt;
&lt;p&gt;Examples:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;API latency
API error rate
checkout failures
login failures
background job delay
queue age
request timeout rate
successful writes per second
customer-facing read latency
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;These are not Postgres metrics, but they are the reason Postgres reliability matters.&lt;&#x2F;p&gt;
&lt;p&gt;A useful monitoring hierarchy looks like this:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;mermaid&quot;&gt;flowchart TD
    A[User symptoms] --&amp;gt; B[Application behavior]
    B --&amp;gt; C[Connection pool pressure]
    C --&amp;gt; D[Postgres activity]
    D --&amp;gt; E[Storage &#x2F; OS &#x2F; infrastructure]
    E --&amp;gt; F[Replication &#x2F; backup &#x2F; recovery systems]
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;If you start from the bottom, you may optimize the wrong thing.&lt;&#x2F;p&gt;
&lt;p&gt;If you start from user impact, you can ask:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Which database symptom explains the product symptom?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;That question is more useful than:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Which graph is red?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;a-metric-is-useful-only-when-it-supports-a-decision&quot;&gt;A metric is useful only when it supports a decision&lt;&#x2F;h2&gt;
&lt;p&gt;A weak alert says:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Database connections are high.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;A better alert says:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;User-facing requests are waiting for database connections,
and Postgres active sessions are also elevated.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;A weak dashboard says:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;CPU is 90%.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;A better dashboard helps answer:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Is CPU high because useful work increased,
because a bad query plan appeared,
because concurrency exploded,
or because retries are multiplying traffic?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The value of a metric is not the number itself. The value is the decision it helps with.&lt;&#x2F;p&gt;
&lt;p&gt;For Postgres incidents, useful decisions include:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Should we reduce traffic?
Should we pause background workers?
Should we cancel a query?
Should we cancel a migration?
Should we add capacity?
Should we fail over?
Should we stop retries?
Should we run ANALYZE?
Should we let recovery continue?
Should we avoid touching the database until we know more?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Monitoring should make those decisions safer.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;postgres-has-many-statistics-views-but-they-are-not-a-diagnosis&quot;&gt;Postgres has many statistics views, but they are not a diagnosis&lt;&#x2F;h2&gt;
&lt;p&gt;PostgreSQL exposes a cumulative statistics system that reports server activity, including table and index access, row counts, and vacuum&#x2F;analyze activity. The official monitoring documentation also reminds operators not to ignore OS-level tools such as &lt;code&gt;ps&lt;&#x2F;code&gt;, &lt;code&gt;top&lt;&#x2F;code&gt;, &lt;code&gt;iostat&lt;&#x2F;code&gt;, and &lt;code&gt;vmstat&lt;&#x2F;code&gt;, and to use &lt;code&gt;EXPLAIN&lt;&#x2F;code&gt; for deeper query investigation after identifying a poorly performing query. (&lt;a rel=&quot;external&quot; title=&quot;Documentation: 18: 27.2. The Cumulative Statistics System&quot; href=&quot;https:&#x2F;&#x2F;www.postgresql.org&#x2F;docs&#x2F;current&#x2F;monitoring-stats.html&quot;&gt;PostgreSQL&lt;&#x2F;a&gt;)&lt;&#x2F;p&gt;
&lt;p&gt;That means Postgres gives you evidence.&lt;&#x2F;p&gt;
&lt;p&gt;It does not give you the incident narrative automatically.&lt;&#x2F;p&gt;
&lt;p&gt;For example, this query summarizes sessions:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    state,
    wait_event_type,
    wait_event,
    count(*) AS sessions
FROM pg_stat_activity
GROUP BY state, wait_event_type, wait_event
ORDER BY sessions DESC;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This can tell you:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;many sessions are active;
many sessions are waiting on locks;
many sessions are idle;
many sessions are idle in transaction;
many sessions are waiting on IO.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;But the next step is human reasoning:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Why are they active?
Why are they waiting?
What changed?
Which workload owns them?
Are they a cause or a symptom?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;A statistics view is not an incident response plan.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;start-with-what-changed&quot;&gt;Start with “what changed?”&lt;&#x2F;h2&gt;
&lt;p&gt;Many Postgres incidents are triggered by change:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;new deploy;
new query shape;
schema migration;
new index;
data import;
traffic spike;
autoscaling event;
background job;
customer onboarding;
configuration change;
replica issue;
storage degradation.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Monitoring should make change visible.&lt;&#x2F;p&gt;
&lt;p&gt;A good incident dashboard should correlate database symptoms with:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;deploy markers;
migration start&#x2F;finish events;
feature flag changes;
traffic volume;
worker concurrency;
autoscaling events;
database failover events;
backup windows;
maintenance jobs;
large imports or backfills.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Without change context, metrics are easier to misread.&lt;&#x2F;p&gt;
&lt;p&gt;Example:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Connections increased at 12:05.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;That could mean:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;traffic increased;
queries became slower;
pool size changed;
a deploy doubled app instances;
a connection leak started;
retries increased;
a lock queue formed.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;If the dashboard also shows a deployment at 12:03, the investigation starts differently.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;monitor-the-connection-boundary&quot;&gt;Monitor the connection boundary&lt;&#x2F;h2&gt;
&lt;p&gt;Connection pools are where application behavior becomes database pressure.&lt;&#x2F;p&gt;
&lt;p&gt;Postgres-side connection snapshot:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    application_name,
    usename,
    client_addr,
    count(*) AS total,
    count(*) FILTER (WHERE state = &amp;#39;active&amp;#39;) AS active,
    count(*) FILTER (WHERE state = &amp;#39;idle&amp;#39;) AS idle,
    count(*) FILTER (WHERE state = &amp;#39;idle in transaction&amp;#39;) AS idle_in_transaction
FROM pg_stat_activity
GROUP BY application_name, usename, client_addr
ORDER BY total DESC;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This answers:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Which applications are connected?
Who owns the sessions?
How many are active?
How many are idle?
Are any idle inside transactions?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;But Postgres cannot fully explain pool behavior. The application must expose:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;pool size;
connections in use;
idle pool connections;
pending checkout count;
connection acquisition latency;
pool checkout timeout count;
transaction duration;
query duration;
request duration while holding a connection.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;A critical distinction:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Waiting for a pool connection is not the same as executing a slow SQL query.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;If you do not separate those, every incident looks like “Postgres is slow.”&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;monitor-active-sessions-not-just-total-connections&quot;&gt;Monitor active sessions, not just total connections&lt;&#x2F;h2&gt;
&lt;p&gt;Total connections can be misleading.&lt;&#x2F;p&gt;
&lt;p&gt;A database with 300 mostly idle sessions may be healthier than a database with 60 active sessions all fighting over locks or disk.&lt;&#x2F;p&gt;
&lt;p&gt;Useful live activity query:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    pid,
    application_name,
    usename,
    state,
    wait_event_type,
    wait_event,
    now() - query_start AS query_age,
    now() - xact_start AS transaction_age,
    left(query, 200) AS query_preview
FROM pg_stat_activity
WHERE state &amp;lt;&amp;gt; &amp;#39;idle&amp;#39;
ORDER BY query_start ASC;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This helps distinguish:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;active CPU work;
lock waiting;
IO waiting;
long-running transactions;
stuck migrations;
slow queries;
idle transactions;
client-related waits.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;During incidents, the wait state is often more useful than the raw connection count.&lt;&#x2F;p&gt;
&lt;p&gt;A good dashboard should not only ask:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;How many connections exist?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;It should ask:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;What are those connections doing?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;long-transactions-deserve-their-own-panel&quot;&gt;Long transactions deserve their own panel&lt;&#x2F;h2&gt;
&lt;p&gt;Long transactions are behind many Postgres reliability problems:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;vacuum cannot clean old row versions;
schema migrations wait;
row locks remain held;
bloat grows;
replicas can be affected;
connection pools lose capacity;
query behavior becomes harder to explain.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Monitor them directly:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    pid,
    application_name,
    usename,
    state,
    now() - xact_start AS transaction_age,
    wait_event_type,
    wait_event,
    left(query, 200) AS query_preview
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
ORDER BY xact_start ASC
LIMIT 30;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;And specifically:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    pid,
    application_name,
    usename,
    client_addr,
    now() - xact_start AS transaction_age,
    now() - state_change AS idle_age,
    left(query, 200) AS last_query
FROM pg_stat_activity
WHERE state = &amp;#39;idle in transaction&amp;#39;
ORDER BY xact_start ASC;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;A mature alert is not simply:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;There is an idle transaction.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;It is more like:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;An app-owned transaction has been idle for longer than expected
on a database with high write activity or pending migrations.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Context turns noise into signal.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;query-monitoring-total-cost-latency-frequency-and-variance&quot;&gt;Query monitoring: total cost, latency, frequency, and variance&lt;&#x2F;h2&gt;
&lt;p&gt;&lt;code&gt;pg_stat_statements&lt;&#x2F;code&gt; is one of the most important extensions for Postgres workload visibility. The official documentation describes it as a module for tracking planning and execution statistics of SQL statements executed by a server. (&lt;a rel=&quot;external&quot; title=&quot;F.32. pg_stat_statements — track statistics of SQL planning ...&quot; href=&quot;https:&#x2F;&#x2F;www.postgresql.org&#x2F;docs&#x2F;current&#x2F;pgstatstatements.html&quot;&gt;PostgreSQL&lt;&#x2F;a&gt;)&lt;&#x2F;p&gt;
&lt;p&gt;The mistake is using only one ranking.&lt;&#x2F;p&gt;
&lt;p&gt;Highest total time:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    calls,
    total_exec_time,
    mean_exec_time,
    rows,
    left(query, 180) AS query_preview
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This finds queries that consume the most total database time.&lt;&#x2F;p&gt;
&lt;p&gt;Highest average latency:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    calls,
    mean_exec_time,
    max_exec_time,
    rows,
    left(query, 180) AS query_preview
FROM pg_stat_statements
WHERE calls &amp;gt; 100
ORDER BY mean_exec_time DESC
LIMIT 20;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This finds consistently slow queries.&lt;&#x2F;p&gt;
&lt;p&gt;Highest call count:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    calls,
    total_exec_time,
    mean_exec_time,
    left(query, 180) AS query_preview
FROM pg_stat_statements
ORDER BY calls DESC
LIMIT 20;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This finds queries that may be cheap individually but expensive in aggregate.&lt;&#x2F;p&gt;
&lt;p&gt;High variance:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    calls,
    mean_exec_time,
    max_exec_time,
    stddev_exec_time,
    rows,
    left(query, 180) AS query_preview
FROM pg_stat_statements
WHERE calls &amp;gt; 100
ORDER BY stddev_exec_time DESC
LIMIT 20;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This finds unstable queries.&lt;&#x2F;p&gt;
&lt;p&gt;Each view answers a different question:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Total time: what consumes the database?
Mean time: what is consistently expensive?
Max time: what occasionally explodes?
Call count: what happens too often?
Variance: what depends heavily on parameters or data shape?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;If your dashboard only shows “top slow queries,” it may miss high-frequency queries that quietly dominate database load.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;slow-query-logs-are-useful-but-can-become-noise&quot;&gt;Slow-query logs are useful, but can become noise&lt;&#x2F;h2&gt;
&lt;p&gt;Postgres logging can capture slow statements through settings such as &lt;code&gt;log_min_duration_statement&lt;&#x2F;code&gt;, and the logging system supports multiple destinations such as stderr, csvlog, jsonlog, syslog, and eventlog depending on platform and configuration. (&lt;a rel=&quot;external&quot; title=&quot;Documentation: 18: 19.8. Error Reporting and Logging&quot; href=&quot;https:&#x2F;&#x2F;www.postgresql.org&#x2F;docs&#x2F;current&#x2F;runtime-config-logging.html&quot;&gt;PostgreSQL&lt;&#x2F;a&gt;)&lt;&#x2F;p&gt;
&lt;p&gt;A common setting:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SHOW log_min_duration_statement;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Example:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;ALTER SYSTEM SET log_min_duration_statement = &amp;#39;500ms&amp;#39;;
SELECT pg_reload_conf();
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This can help identify expensive statements.&lt;&#x2F;p&gt;
&lt;p&gt;But slow-query logs have limitations:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;They show completed statements, not necessarily currently stuck ones.
They can become extremely noisy under incidents.
They may miss high-frequency fast queries that cause aggregate load.
They need application context to be useful.
They can increase log volume significantly.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Slow-query logging is evidence, not a complete monitoring strategy.&lt;&#x2F;p&gt;
&lt;p&gt;For production reliability, combine it with:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;pg_stat_statements;
application tracing;
pool metrics;
lock monitoring;
wait events;
deployment markers;
request-level latency.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The goal is to connect a slow query to user impact and system pressure.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;lock-monitoring-must-show-blockers-and-victims&quot;&gt;Lock monitoring must show blockers and victims&lt;&#x2F;h2&gt;
&lt;p&gt;A lock alert that says “lock wait exists” is often too vague.&lt;&#x2F;p&gt;
&lt;p&gt;During production incidents, you need to know:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Who is blocked?
Who is blocking?
How long has the blocker been running?
Is the blocker active or idle in transaction?
Which application owns it?
Is the blocked query user traffic, migration, worker, or admin?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Useful query:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    blocked.pid AS blocked_pid,
    blocked.application_name AS blocked_app,
    blocked.usename AS blocked_user,
    now() - blocked.query_start AS blocked_duration,
    left(blocked.query, 160) AS blocked_query,
    blocking.pid AS blocking_pid,
    blocking.application_name AS blocking_app,
    blocking.usename AS blocking_user,
    blocking.state AS blocking_state,
    now() - blocking.query_start AS blocking_duration,
    left(blocking.query, 160) AS blocking_query
FROM pg_stat_activity blocked
JOIN LATERAL unnest(pg_blocking_pids(blocked.pid)) AS blocker_pid ON true
JOIN pg_stat_activity blocking ON blocking.pid = blocker_pid
ORDER BY blocked_duration DESC;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Good lock monitoring separates:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;one harmless short lock wait;
a growing lock queue behind a migration;
a long idle transaction blocking DDL;
row-level contention in a hot workflow;
application workers fighting over the same rows.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The metric is not “number of locks.”&lt;&#x2F;p&gt;
&lt;p&gt;The signal is the shape of the blocking chain.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;autovacuum-monitoring-should-focus-on-whether-cleanup-keeps-up&quot;&gt;Autovacuum monitoring should focus on whether cleanup keeps up&lt;&#x2F;h2&gt;
&lt;p&gt;Autovacuum is noisy if monitored incorrectly.&lt;&#x2F;p&gt;
&lt;p&gt;A graph showing “autovacuum is running” may look scary, but it can be completely normal.&lt;&#x2F;p&gt;
&lt;p&gt;Better questions:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Are dead tuples growing over time?
Are hot tables vacuumed often enough?
Are old transactions preventing cleanup?
Are analyze runs keeping statistics fresh?
Are tables approaching transaction ID age risk?
Is autovacuum always running but still not catching up?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Useful table maintenance snapshot:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    schemaname,
    relname,
    n_live_tup,
    n_dead_tup,
    round(
        100.0 * n_dead_tup &#x2F; greatest(n_live_tup + n_dead_tup, 1),
        2
    ) AS dead_tuple_percent,
    last_autovacuum,
    last_autoanalyze,
    vacuum_count,
    autovacuum_count,
    analyze_count,
    autoanalyze_count
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 30;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Current vacuum progress:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    p.pid,
    p.relid::regclass AS table_name,
    p.phase,
    p.heap_blks_total,
    p.heap_blks_scanned,
    p.heap_blks_vacuumed,
    p.index_vacuum_count,
    now() - a.query_start AS runtime
FROM pg_stat_progress_vacuum p
JOIN pg_stat_activity a ON a.pid = p.pid
ORDER BY runtime DESC;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Transaction ID age:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    n.nspname AS schema_name,
    c.relname AS table_name,
    age(c.relfrozenxid) AS xid_age,
    pg_size_pretty(pg_total_relation_size(c.oid)) AS total_size
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE c.relkind IN (&amp;#39;r&amp;#39;, &amp;#39;m&amp;#39;)
ORDER BY age(c.relfrozenxid) DESC
LIMIT 30;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Autovacuum monitoring should tell you whether maintenance is keeping up with write workload.&lt;&#x2F;p&gt;
&lt;p&gt;If it only tells you that autovacuum exists, it is not enough.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;wal-and-checkpoint-monitoring-should-reveal-pressure-chains&quot;&gt;WAL and checkpoint monitoring should reveal pressure chains&lt;&#x2F;h2&gt;
&lt;p&gt;WAL is involved in durability, crash recovery, replication, backups, archiving, and logical decoding.&lt;&#x2F;p&gt;
&lt;p&gt;A WAL incident rarely stays isolated.&lt;&#x2F;p&gt;
&lt;p&gt;Watch:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;WAL generation rate;
pg_wal directory growth;
archiver failures;
replication slot retention;
replica replay lag;
checkpoint frequency;
checkpoint write&#x2F;sync time;
WAL-heavy statements;
large backfills or migrations.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;WAL generation snapshot:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    wal_records,
    wal_fpi,
    pg_size_pretty(wal_bytes) AS wal_bytes,
    wal_buffers_full,
    stats_reset
FROM pg_stat_wal;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Replication slots:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    slot_name,
    slot_type,
    active,
    wal_status,
    pg_size_pretty(
        pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)
    ) AS retained_wal
FROM pg_replication_slots
ORDER BY pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) DESC NULLS LAST;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Archiver status:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    archived_count,
    last_archived_wal,
    last_archived_time,
    failed_count,
    last_failed_wal,
    last_failed_time
FROM pg_stat_archiver;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The alert should not be only:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;WAL directory is large.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;It should help identify the mechanism:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;WAL is growing because archiving is failing.
WAL is retained by an inactive replication slot.
WAL generation spiked after a backfill.
Replica replay lag is increasing because the primary is producing WAL too quickly.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;That is the difference between symptom monitoring and reliability monitoring.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;replication-monitoring-bytes-time-and-product-semantics&quot;&gt;Replication monitoring: bytes, time, and product semantics&lt;&#x2F;h2&gt;
&lt;p&gt;Replication lag is not one metric.&lt;&#x2F;p&gt;
&lt;p&gt;Primary-side view:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    application_name,
    client_addr,
    state,
    sync_state,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn))   AS send_lag_bytes,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), write_lsn))  AS write_lag_bytes,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn))  AS flush_lag_bytes,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) AS replay_lag_bytes,
    write_lag,
    flush_lag,
    replay_lag
FROM pg_stat_replication
ORDER BY application_name;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Standby-side view:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    pg_is_in_recovery() AS is_standby,
    pg_last_wal_receive_lsn() AS receive_lsn,
    pg_last_wal_replay_lsn() AS replay_lsn,
    now() - pg_last_xact_replay_timestamp() AS replay_delay;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;But the product question is:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Can this read be stale?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Replication monitoring should connect to read-routing behavior.&lt;&#x2F;p&gt;
&lt;p&gt;A replica that is 5 seconds behind may be fine for analytics. It may be unacceptable for permissions, checkout, authentication, or user settings.&lt;&#x2F;p&gt;
&lt;p&gt;Good monitoring distinguishes:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;replica connected;
replica receiving WAL;
replica replaying WAL;
replica serving stale reads;
replica safe for failover;
replica safe for reporting;
replica threatening primary disk through slot retention.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;One “replication lag” graph is rarely enough.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;disk-monitoring-must-distinguish-capacity-from-performance&quot;&gt;Disk monitoring must distinguish capacity from performance&lt;&#x2F;h2&gt;
&lt;p&gt;Disk incidents come in two forms:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;capacity problem: storage is filling;
performance problem: storage is too slow for current workload.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Both hurt Postgres, but the response differs.&lt;&#x2F;p&gt;
&lt;p&gt;Capacity signals:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;data directory size;
pg_wal size;
temporary file growth;
table&#x2F;index growth;
backup&#x2F;archive accumulation;
replication slot retention;
available filesystem space.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Performance signals:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;read latency;
write latency;
fsync latency;
IOPS saturation;
queue depth;
checkpoint sync time;
query wait events;
temporary file spills;
backend writes.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Postgres metrics alone are not enough here. The official monitoring chapter explicitly points operators toward OS-level tools in addition to PostgreSQL’s internal statistics. (&lt;a rel=&quot;external&quot; title=&quot;Documentation: 18: Chapter 27. Monitoring Database Activity&quot; href=&quot;https:&#x2F;&#x2F;www.postgresql.org&#x2F;docs&#x2F;current&#x2F;monitoring.html&quot;&gt;PostgreSQL&lt;&#x2F;a&gt;)&lt;&#x2F;p&gt;
&lt;p&gt;A database graph may show slow queries.&lt;&#x2F;p&gt;
&lt;p&gt;The storage graph may reveal the real mechanism.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;cache-hit-ratio-is-often-overrated&quot;&gt;Cache hit ratio is often overrated&lt;&#x2F;h2&gt;
&lt;p&gt;Many dashboards show buffer cache hit ratio.&lt;&#x2F;p&gt;
&lt;p&gt;It can be useful, but it is often overinterpreted.&lt;&#x2F;p&gt;
&lt;p&gt;A high cache hit ratio does not prove the database is healthy.&lt;&#x2F;p&gt;
&lt;p&gt;A low cache hit ratio does not automatically identify the cause of an incident.&lt;&#x2F;p&gt;
&lt;p&gt;Problems:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;large sequential scans can distort the number;
some workloads naturally read cold data;
a high ratio can hide CPU or lock contention;
the metric says little about query shape;
it does not show whether users are impacted.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;A better approach is to pair cache-related metrics with:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;query plans;
buffer reads from EXPLAIN;
IO wait events;
storage latency;
table&#x2F;index scan patterns;
top queries by shared blocks read;
application latency.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Cache hit ratio is context, not a primary incident diagnosis.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;alert-on-symptoms-investigate-with-causes&quot;&gt;Alert on symptoms, investigate with causes&lt;&#x2F;h2&gt;
&lt;p&gt;A common mistake is alerting on too many internal causes.&lt;&#x2F;p&gt;
&lt;p&gt;Examples:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;CPU &amp;gt; 80%
connections &amp;gt; 300
dead tuples &amp;gt; threshold
replica lag &amp;gt; threshold
cache hit ratio &amp;lt; threshold
autovacuum running too long
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Some of these are useful. But if every internal metric pages someone, the team learns to ignore alerts.&lt;&#x2F;p&gt;
&lt;p&gt;A healthier alerting model:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Page on user impact and imminent risk.
Ticket on trends and maintenance debt.
Dashboard internal signals for investigation.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Page-worthy examples:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;user-facing error rate high;
API latency SLO burn;
database unavailable;
disk close to full;
primary cannot write WAL;
replica lag violates product read semantics;
connection exhaustion blocking traffic;
transaction ID age approaching dangerous thresholds;
backup&#x2F;archive pipeline broken beyond recovery objective.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Ticket-worthy examples:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;dead tuples trending up on hot table;
index bloat suspected;
unused indexes accumulating;
autovacuum not keeping up on one table;
slow query variance increasing;
connection usage slowly approaching capacity;
replica lag occasionally above normal but not user-impacting.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Not every red graph deserves a page.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;use-trend-and-rate-not-only-absolute-values&quot;&gt;Use trend and rate, not only absolute values&lt;&#x2F;h2&gt;
&lt;p&gt;A single value can be misleading.&lt;&#x2F;p&gt;
&lt;p&gt;Examples:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;1000 dead tuples on a small table may matter.
10 million dead tuples on a huge table may be normal temporarily.

200 connections may be normal for one system.
50 active connections may overload another.

1 GB of WAL may be fine.
1 GB per minute may be alarming.

Replica lag of 2 seconds may be acceptable for reporting.
Replica lag of 2 seconds may be unacceptable for read-after-write flows.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Prefer metrics that show:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;rate of change;
baseline deviation;
duration;
affected workload;
relation to user symptoms;
relation to known changes.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;A good alert is rarely “value &amp;gt; threshold.”&lt;&#x2F;p&gt;
&lt;p&gt;It is more often:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;value is above threshold for long enough,
during user-impacting traffic,
and is moving in the wrong direction.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;version-specific-monitoring-matters&quot;&gt;Version-specific monitoring matters&lt;&#x2F;h2&gt;
&lt;p&gt;Postgres monitoring changes across versions.&lt;&#x2F;p&gt;
&lt;p&gt;Views, columns, and statistics capabilities evolve. For example, modern PostgreSQL versions expose more detailed IO and WAL-related statistics than older versions, and settings such as &lt;code&gt;track_io_timing&lt;&#x2F;code&gt; and &lt;code&gt;track_wal_io_timing&lt;&#x2F;code&gt; can provide timing information with potential overhead because they repeatedly query the operating system clock. (&lt;a rel=&quot;external&quot; title=&quot;PostgreSQL: Documentation: 18: 19.9. Run-time Statistics&quot; href=&quot;https:&#x2F;&#x2F;www.postgresql.org&#x2F;docs&#x2F;current&#x2F;runtime-config-statistics.html&quot;&gt;PostgreSQL&lt;&#x2F;a&gt;)&lt;&#x2F;p&gt;
&lt;p&gt;This creates a practical rule:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Do not blindly copy monitoring SQL from another Postgres version.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;For every dashboard query, know:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;which Postgres versions it supports;
whether required extensions are enabled;
whether timing settings add overhead;
whether statistics reset affects interpretation;
whether managed database providers restrict access;
whether replicas expose the same views.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Monitoring should be treated like production code.&lt;&#x2F;p&gt;
&lt;p&gt;It can break, lie, or become outdated.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;a-minimal-reliability-oriented-postgres-dashboard&quot;&gt;A minimal reliability-oriented Postgres dashboard&lt;&#x2F;h2&gt;
&lt;p&gt;A useful dashboard does not need hundreds of panels.&lt;&#x2F;p&gt;
&lt;p&gt;It should answer the main incident questions quickly.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;1-user-impact&quot;&gt;1. User impact&lt;&#x2F;h3&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;request latency;
error rate;
timeout rate;
business operation success rate;
queue age;
job delay.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;h3 id=&quot;2-application-database-boundary&quot;&gt;2. Application database boundary&lt;&#x2F;h3&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;pool usage;
pool wait time;
pool checkout timeouts;
query duration;
transaction duration;
retry rate;
database errors by type.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;h3 id=&quot;3-postgres-live-activity&quot;&gt;3. Postgres live activity&lt;&#x2F;h3&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;active sessions;
sessions by wait_event_type;
long queries;
long transactions;
idle in transaction;
blocked sessions and blockers.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;h3 id=&quot;4-workload-shape&quot;&gt;4. Workload shape&lt;&#x2F;h3&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;top queries by total time;
top queries by calls;
top queries by mean time;
queries with high variance;
WAL-heavy statements.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;h3 id=&quot;5-maintenance-health&quot;&gt;5. Maintenance health&lt;&#x2F;h3&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;dead tuples;
last autovacuum&#x2F;analyze;
vacuum progress;
transaction ID age;
table and index growth.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;h3 id=&quot;6-wal-checkpoints-storage&quot;&gt;6. WAL, checkpoints, storage&lt;&#x2F;h3&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;WAL generation rate;
pg_wal size;
checkpoint frequency;
checkpoint write&#x2F;sync time;
archiver failures;
disk capacity;
disk latency.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;h3 id=&quot;7-replication-and-recovery&quot;&gt;7. Replication and recovery&lt;&#x2F;h3&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;replication lag by stage;
standby replay delay;
replication slot retained WAL;
backup&#x2F;archive status;
failover readiness indicators.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The dashboard should be organized around questions, not around PostgreSQL catalog names.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;good-monitoring-supports-hypothesis-driven-debugging&quot;&gt;Good monitoring supports hypothesis-driven debugging&lt;&#x2F;h2&gt;
&lt;p&gt;During an incident, an engineer should be able to move through a chain like this:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;mermaid&quot;&gt;flowchart TD
    A[Users see timeouts] --&amp;gt; B[Application pool wait time is rising]
    B --&amp;gt; C[Postgres active sessions are elevated]
    C --&amp;gt; D[Most active sessions wait on Lock]
    D --&amp;gt; E[Blocking query is a migration]
    E --&amp;gt; F[Migration is waiting behind an idle transaction]
    F --&amp;gt; G[Retries are increasing request volume]
    G --&amp;gt; H([Stop retries, cancel migration, or terminate a known-safe blocker])
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Or:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Writes are slow.
        ↓
CPU is normal.
        ↓
WAL generation spiked.
        ↓
Checkpoint warnings started.
        ↓
Replica lag is increasing.
        ↓
A backfill began five minutes earlier.
        ↓
Safest mitigation is to pause or throttle the backfill.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This is what monitoring is for.&lt;&#x2F;p&gt;
&lt;p&gt;Not to show everything.&lt;&#x2F;p&gt;
&lt;p&gt;To help the team move from symptom to mechanism to decision.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;common-monitoring-anti-patterns&quot;&gt;Common monitoring anti-patterns&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;dashboard-as-decoration&quot;&gt;Dashboard as decoration&lt;&#x2F;h3&gt;
&lt;p&gt;A dashboard nobody uses during incidents is not observability. It is wallpaper.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;too-many-panels-no-hierarchy&quot;&gt;Too many panels, no hierarchy&lt;&#x2F;h3&gt;
&lt;p&gt;If every graph has equal visual importance, the dashboard cannot guide attention.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;alerts-without-ownership&quot;&gt;Alerts without ownership&lt;&#x2F;h3&gt;
&lt;p&gt;Every alert should have an owner, expected action, and reason for existence.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;internal-metrics-without-user-impact&quot;&gt;Internal metrics without user impact&lt;&#x2F;h3&gt;
&lt;p&gt;A database can look noisy without affecting customers. A page should usually be tied to impact or imminent risk.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;user-impact-without-database-detail&quot;&gt;User impact without database detail&lt;&#x2F;h3&gt;
&lt;p&gt;Knowing users are affected is not enough. You need fast paths into database evidence.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;no-deployment-or-migration-markers&quot;&gt;No deployment or migration markers&lt;&#x2F;h3&gt;
&lt;p&gt;Without change context, incidents take longer to explain.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;averaging-away-important-behavior&quot;&gt;Averaging away important behavior&lt;&#x2F;h3&gt;
&lt;p&gt;Mean latency hides outliers. Total time hides variance. Aggregate database metrics hide one bad tenant or one hot table.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;ignoring-application-metrics&quot;&gt;Ignoring application metrics&lt;&#x2F;h3&gt;
&lt;p&gt;Postgres cannot show pool checkout time, retry storms, request deadlines, or business operation failures by itself.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;why-monitoring-incidents-are-good-simulation-material&quot;&gt;Why monitoring incidents are good simulation material&lt;&#x2F;h2&gt;
&lt;p&gt;Monitoring failures are often human failures.&lt;&#x2F;p&gt;
&lt;p&gt;The metrics were there, but nobody knew which ones mattered.
The dashboard showed the answer, but it was buried under noise.
The alert fired, but it was not actionable.
The team watched CPU while the real problem was locks.
The team watched slow queries while the real problem was pool saturation.
The team watched the primary while the replica was serving stale reads.
The team watched database metrics while application retries amplified the incident.&lt;&#x2F;p&gt;
&lt;p&gt;A realistic simulation can train:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;reading dashboards under pressure;
separating symptoms from causes;
forming hypotheses from weak signals;
rejecting misleading metrics;
connecting application and database behavior;
deciding when a metric is actionable;
communicating uncertainty clearly;
choosing mitigations based on evidence.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This is the gap articles cannot fully close.&lt;&#x2F;p&gt;
&lt;p&gt;A written guide can explain which metrics exist.
A dashboard can display the signals.
A simulation teaches the team how to reason when ten signals change at once.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;&#x2F;h2&gt;
&lt;p&gt;Postgres monitoring is not about collecting every metric.&lt;&#x2F;p&gt;
&lt;p&gt;It is about building an evidence system for production decisions.&lt;&#x2F;p&gt;
&lt;p&gt;Good monitoring starts with user impact, connects that impact to application behavior, then follows pressure into Postgres internals, storage, replication, and maintenance systems.&lt;&#x2F;p&gt;
&lt;p&gt;Useful metrics answer operational questions:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Are users affected?
Where is the queue?
What changed?
What is Postgres waiting on?
Which workload owns the pressure?
Is this a query, lock, IO, WAL, vacuum, replication, or pool problem?
Is the system getting worse?
Which mitigation reduces risk?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The dangerous phrase is:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;We have dashboards, so we are covered.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The better reliability question is:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Can our monitoring help an engineer make the right decision during a confusing incident?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;That is the difference between metric collection and Postgres database reliability.&lt;&#x2F;p&gt;
</content>
        
    </entry>
    <entry xml:lang="en">
        <title>A slow Postgres query is a symptom, not a diagnosis</title>
        <published>2026-04-04T00:00:00+00:00</published>
        <updated>2026-04-04T00:00:00+00:00</updated>
        
        <author>
          <name>
            
              Unknown
            
          </name>
        </author>
        
        <link rel="alternate" type="text/html" href="https://rillence.com/notes/slow-query-is-a-symptom/"/>
        <id>https://rillence.com/notes/slow-query-is-a-symptom/</id>
        
        <content type="html" xml:base="https://rillence.com/notes/slow-query-is-a-symptom/">&lt;p&gt;A slow query is one of the easiest Postgres problems to notice and one of the easiest to misunderstand.&lt;&#x2F;p&gt;
&lt;p&gt;The application times out.
The endpoint gets slower.
The dashboard shows high database time.
Someone finds a query in logs and says:&lt;&#x2F;p&gt;
&lt;blockquote&gt;
&lt;p&gt;“This query is the problem.”&lt;&#x2F;p&gt;
&lt;&#x2F;blockquote&gt;
&lt;p&gt;Maybe it is.&lt;&#x2F;p&gt;
&lt;p&gt;But a slow query is rarely a complete diagnosis. It is a symptom produced by a specific mechanism: a bad plan, missing index, stale statistics, lock contention, IO saturation, parameter sensitivity, table bloat, too much concurrency, or a data distribution change that made yesterday’s assumptions false.&lt;&#x2F;p&gt;
&lt;p&gt;The SQL text is only one part of the story.&lt;&#x2F;p&gt;
&lt;p&gt;A query can become slow without changing at all.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;the-same-query-can-be-fast-yesterday-and-dangerous-today&quot;&gt;The same query can be fast yesterday and dangerous today&lt;&#x2F;h2&gt;
&lt;p&gt;Consider a simple query:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT *
FROM invoices
WHERE account_id = $1
  AND status = &amp;#39;open&amp;#39;
ORDER BY due_date ASC
LIMIT 50;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This query may be perfectly fine when most accounts have a few hundred invoices.&lt;&#x2F;p&gt;
&lt;p&gt;Then the product grows. One enterprise customer imports millions of invoices. Suddenly, the same query behaves differently for different accounts.&lt;&#x2F;p&gt;
&lt;p&gt;For small accounts, it is still fast.&lt;&#x2F;p&gt;
&lt;p&gt;For one large account, it becomes expensive.&lt;&#x2F;p&gt;
&lt;p&gt;That is not a different query. It is a different data shape.&lt;&#x2F;p&gt;
&lt;p&gt;A useful index might be:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;CREATE INDEX CONCURRENTLY idx_invoices_account_status_due_date
ON invoices (account_id, status, due_date);
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;But the reliability lesson is not simply “add an index.”&lt;&#x2F;p&gt;
&lt;p&gt;The deeper lesson is:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Query performance depends on data distribution,
not just on SQL syntax.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;A query that was safe when the product was small may become a production risk as the data changes.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;slow-query-hides-multiple-failure-modes&quot;&gt;“Slow query” hides multiple failure modes&lt;&#x2F;h2&gt;
&lt;p&gt;From the outside, several very different problems can look identical.&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;API latency increased
Database time increased
Requests started timing out
Connection pool is full
The same query appears in logs repeatedly
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;But the underlying cause could be:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Missing index
Wrong index order
Stale planner statistics
Bad row estimate
Lock contention
Disk IO saturation
Sort spilling to disk
Too much concurrency
Autovacuum falling behind
Table or index bloat
Parameter-sensitive query plan
Application retry storm
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The same symptom requires different mitigations depending on the mechanism.&lt;&#x2F;p&gt;
&lt;p&gt;That is why “find the slow query” is not enough.&lt;&#x2F;p&gt;
&lt;p&gt;You need to understand why it is slow now.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;start-with-the-query-shape-not-just-the-query-text&quot;&gt;Start with the query shape, not just the query text&lt;&#x2F;h2&gt;
&lt;p&gt;A useful first step is to identify the shape of the query.&lt;&#x2F;p&gt;
&lt;p&gt;For example:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT *
FROM events
WHERE tenant_id = $1
  AND event_type = $2
  AND created_at &amp;gt;= $3
ORDER BY created_at DESC
LIMIT 100;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This query shape tells you several important things:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;It is tenant-scoped.
It filters by event type.
It uses a time range.
It needs rows in descending time order.
It has a LIMIT.
It may be called frequently.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;An index that supports this access pattern may look like:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;CREATE INDEX CONCURRENTLY idx_events_tenant_type_created_desc
ON events (tenant_id, event_type, created_at DESC);
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;But index design depends on real workload. For example, if &lt;code&gt;event_type&lt;&#x2F;code&gt; is not selective, or if most queries do not filter by it, a different index may be better:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;CREATE INDEX CONCURRENTLY idx_events_tenant_created_desc
ON events (tenant_id, created_at DESC);
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The key question is not:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Does this query have an index?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The better question is:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Does the index match the actual access pattern?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;use-explain-but-do-not-worship-it&quot;&gt;Use &lt;code&gt;EXPLAIN&lt;&#x2F;code&gt;, but do not worship it&lt;&#x2F;h2&gt;
&lt;p&gt;The most common tool for investigating a slow query is:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;EXPLAIN (ANALYZE, BUFFERS)
SELECT *
FROM invoices
WHERE account_id = 123
  AND status = &amp;#39;open&amp;#39;
ORDER BY due_date ASC
LIMIT 50;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This can show:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Which plan Postgres chose
How many rows it expected
How many rows it actually processed
Whether it used an index
How many buffers were read or hit
Whether sorting happened
Whether the query touched much more data than expected
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;For example, a suspicious plan may show:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Rows expected: 50
Rows actual: 850000
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;That is not just “slow.” That is a planner estimate problem.&lt;&#x2F;p&gt;
&lt;p&gt;A query with &lt;code&gt;BUFFERS&lt;&#x2F;code&gt; may show heavy reads:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;shared hit blocks: 1200
shared read blocks: 95000
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;That suggests the query is reading a lot from disk or pulling a large amount of data through shared buffers.&lt;&#x2F;p&gt;
&lt;p&gt;But &lt;code&gt;EXPLAIN ANALYZE&lt;&#x2F;code&gt; has an important property:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;It actually runs the query.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;For &lt;code&gt;SELECT&lt;&#x2F;code&gt;, that is usually acceptable in a safe environment, though it can still be expensive.&lt;&#x2F;p&gt;
&lt;p&gt;For writes, be careful. This executes the write:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;EXPLAIN (ANALYZE, BUFFERS)
UPDATE orders
SET status = &amp;#39;expired&amp;#39;
WHERE expires_at &amp;lt; now();
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;A safer pattern for investigation is:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;BEGIN;

EXPLAIN (ANALYZE, BUFFERS)
UPDATE orders
SET status = &amp;#39;expired&amp;#39;
WHERE expires_at &amp;lt; now();

ROLLBACK;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Even then, the database still performs work and may take locks while the statement runs. Do not treat diagnostic queries as harmless in production.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;slow-because-of-a-missing-index&quot;&gt;Slow because of a missing index&lt;&#x2F;h2&gt;
&lt;p&gt;The simplest case is a query that has no useful index.&lt;&#x2F;p&gt;
&lt;p&gt;Example:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT *
FROM users
WHERE lower(email) = lower($1);
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;An ordinary index on &lt;code&gt;email&lt;&#x2F;code&gt; may not help because the query applies a function:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;CREATE INDEX CONCURRENTLY idx_users_email
ON users (email);
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Postgres may need an expression index instead:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;CREATE INDEX CONCURRENTLY idx_users_lower_email
ON users (lower(email));
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Another example:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT *
FROM orders
WHERE customer_id = $1
ORDER BY created_at DESC
LIMIT 20;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;A partial index on only &lt;code&gt;customer_id&lt;&#x2F;code&gt; may help filtering but not ordering:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;CREATE INDEX CONCURRENTLY idx_orders_customer_id
ON orders (customer_id);
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;A better index for this query shape may be:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;CREATE INDEX CONCURRENTLY idx_orders_customer_created_desc
ON orders (customer_id, created_at DESC);
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;But even here, the right fix depends on the workload.&lt;&#x2F;p&gt;
&lt;p&gt;If the table is write-heavy, every new index has a cost. It slows down writes, consumes disk, increases vacuum work, and adds operational risk during creation.&lt;&#x2F;p&gt;
&lt;p&gt;The index may fix one query and harm the system elsewhere.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;slow-because-of-stale-statistics&quot;&gt;Slow because of stale statistics&lt;&#x2F;h2&gt;
&lt;p&gt;Postgres uses statistics to choose query plans.&lt;&#x2F;p&gt;
&lt;p&gt;If statistics are stale or too coarse, the planner may choose a bad plan.&lt;&#x2F;p&gt;
&lt;p&gt;You can inspect table statistics freshness:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    relname,
    n_live_tup,
    n_dead_tup,
    last_analyze,
    last_autoanalyze,
    last_vacuum,
    last_autovacuum
FROM pg_stat_user_tables
WHERE relname = &amp;#39;invoices&amp;#39;;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;If a table changed significantly and has not been analyzed recently, Postgres may make poor estimates.&lt;&#x2F;p&gt;
&lt;p&gt;You can manually refresh statistics:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;ANALYZE invoices;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Sometimes a specific column needs better statistics because values are highly skewed:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;ALTER TABLE invoices
ALTER COLUMN account_id SET STATISTICS 1000;

ANALYZE invoices;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This does not make the query faster directly. It gives the planner better information.&lt;&#x2F;p&gt;
&lt;p&gt;The incident pattern often looks like this:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Data distribution changes
        ↓
Planner estimates become inaccurate
        ↓
Postgres chooses a bad plan
        ↓
Query latency increases
        ↓
Application holds connections longer
        ↓
Pool saturates
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The SQL did not change. The planner’s model of the data became wrong.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;slow-because-of-parameter-sensitivity&quot;&gt;Slow because of parameter sensitivity&lt;&#x2F;h2&gt;
&lt;p&gt;Some queries behave very differently depending on parameter values.&lt;&#x2F;p&gt;
&lt;p&gt;Example:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT *
FROM messages
WHERE workspace_id = $1
  AND created_at &amp;gt;= now() - interval &amp;#39;7 days&amp;#39;
ORDER BY created_at DESC
LIMIT 100;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;For most workspaces, this returns a few rows.&lt;&#x2F;p&gt;
&lt;p&gt;For one very large workspace, it may scan millions.&lt;&#x2F;p&gt;
&lt;p&gt;This becomes especially tricky when prepared statements or generic plans are involved. The planner may choose a plan that is “reasonable on average” but bad for important parameter values.&lt;&#x2F;p&gt;
&lt;p&gt;The query is not universally slow. It is selectively slow.&lt;&#x2F;p&gt;
&lt;p&gt;That distinction matters.&lt;&#x2F;p&gt;
&lt;p&gt;Averages hide this problem. You need to look for variance.&lt;&#x2F;p&gt;
&lt;p&gt;With &lt;code&gt;pg_stat_statements&lt;&#x2F;code&gt;, this kind of query may have a moderate mean but a terrible max:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    calls,
    mean_exec_time,
    max_exec_time,
    stddev_exec_time,
    rows,
    left(query, 160) AS query_preview
FROM pg_stat_statements
ORDER BY max_exec_time DESC
LIMIT 20;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;A query with high standard deviation may be more interesting than a query with the highest average time.&lt;&#x2F;p&gt;
&lt;p&gt;A reliability-minded question is:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Is this query always slow,
or only slow for certain tenants, users, statuses, or time ranges?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;That question often changes the fix.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;slow-because-of-locks&quot;&gt;Slow because of locks&lt;&#x2F;h2&gt;
&lt;p&gt;A query may appear slow even when its execution plan is fine.&lt;&#x2F;p&gt;
&lt;p&gt;It may simply be waiting.&lt;&#x2F;p&gt;
&lt;p&gt;For example:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;UPDATE accounts
SET status = &amp;#39;disabled&amp;#39;
WHERE id = $1;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This can be fast in normal conditions. But if another transaction holds a row lock on the same account, the update waits.&lt;&#x2F;p&gt;
&lt;p&gt;You can inspect lock waits:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    pid,
    usename,
    application_name,
    state,
    wait_event_type,
    wait_event,
    now() - query_start AS waiting_for,
    left(query, 160) AS query_preview
FROM pg_stat_activity
WHERE wait_event_type = &amp;#39;Lock&amp;#39;
ORDER BY query_start ASC;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;To find blockers:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    blocked.pid AS blocked_pid,
    blocked.application_name AS blocked_app,
    now() - blocked.query_start AS blocked_duration,
    left(blocked.query, 120) AS blocked_query,
    blocking.pid AS blocking_pid,
    blocking.application_name AS blocking_app,
    blocking.state AS blocking_state,
    now() - blocking.query_start AS blocking_duration,
    left(blocking.query, 120) AS blocking_query
FROM pg_stat_activity blocked
JOIN LATERAL unnest(pg_blocking_pids(blocked.pid)) AS blocker_pid ON true
JOIN pg_stat_activity blocking ON blocking.pid = blocker_pid
ORDER BY blocked_duration DESC;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This is a very different failure mode from a missing index.&lt;&#x2F;p&gt;
&lt;p&gt;Adding an index will not fix a lock wait.&lt;&#x2F;p&gt;
&lt;p&gt;Running &lt;code&gt;EXPLAIN ANALYZE&lt;&#x2F;code&gt; later may show a fast plan, because the lock contention is gone.&lt;&#x2F;p&gt;
&lt;p&gt;That is why incident context matters. The query plan after the incident may not reproduce the incident.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;slow-because-of-io-saturation&quot;&gt;Slow because of IO saturation&lt;&#x2F;h2&gt;
&lt;p&gt;A query can be slow because it is doing too much disk work.&lt;&#x2F;p&gt;
&lt;p&gt;But it can also be slow because some other operation is saturating disk.&lt;&#x2F;p&gt;
&lt;p&gt;For example:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;A concurrent index build
A large vacuum
A checkpoint spike
A reporting query
A backup process
A sequential scan on another table
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The query you see in logs may be a victim, not the cause.&lt;&#x2F;p&gt;
&lt;p&gt;&lt;code&gt;EXPLAIN (ANALYZE, BUFFERS)&lt;&#x2F;code&gt; can show whether the query reads many blocks:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;EXPLAIN (ANALYZE, BUFFERS)
SELECT *
FROM events
WHERE tenant_id = 42
ORDER BY created_at DESC
LIMIT 100;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;But to understand system-wide pressure, you also need to look at active queries:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    pid,
    application_name,
    wait_event_type,
    wait_event,
    now() - query_start AS query_age,
    left(query, 160) AS query_preview
FROM pg_stat_activity
WHERE state != &amp;#39;idle&amp;#39;
ORDER BY query_start ASC;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;A query waiting on IO may show wait events related to data file reads or writes, depending on Postgres version and workload.&lt;&#x2F;p&gt;
&lt;p&gt;The important operational distinction:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Is this query slow because it performs too much IO,
or because the database storage is already saturated by something else?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The mitigation is different.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;slow-because-of-sorting-or-memory-pressure&quot;&gt;Slow because of sorting or memory pressure&lt;&#x2F;h2&gt;
&lt;p&gt;A query may use an index for filtering but still sort a large result set.&lt;&#x2F;p&gt;
&lt;p&gt;Example:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT *
FROM audit_log
WHERE organization_id = $1
ORDER BY created_at DESC
LIMIT 100;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;If the index does not support the order, Postgres may need to sort.&lt;&#x2F;p&gt;
&lt;p&gt;A useful index:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;CREATE INDEX CONCURRENTLY idx_audit_log_org_created_desc
ON audit_log (organization_id, created_at DESC);
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;In plans, watch for:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Sort
Sort Method: external merge Disk
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;That means the sort spilled to disk.&lt;&#x2F;p&gt;
&lt;p&gt;A simplified example:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;EXPLAIN (ANALYZE, BUFFERS)
SELECT *
FROM audit_log
WHERE organization_id = 123
ORDER BY created_at DESC
LIMIT 100;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;If you see disk-based sorting, increasing &lt;code&gt;work_mem&lt;&#x2F;code&gt; might help in some cases. But changing &lt;code&gt;work_mem&lt;&#x2F;code&gt; globally can be dangerous because it applies per operation, not per database.&lt;&#x2F;p&gt;
&lt;p&gt;A query with multiple sort&#x2F;hash nodes across many concurrent sessions can multiply memory usage quickly.&lt;&#x2F;p&gt;
&lt;p&gt;This is why “just increase memory” is often a risky incident response.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;slow-because-of-bloat&quot;&gt;Slow because of bloat&lt;&#x2F;h2&gt;
&lt;p&gt;Postgres uses MVCC. Updates and deletes leave old row versions behind until vacuum can clean them up.&lt;&#x2F;p&gt;
&lt;p&gt;If vacuum falls behind, tables and indexes can become bloated.&lt;&#x2F;p&gt;
&lt;p&gt;A bloated table means Postgres may need to scan more pages to get the same useful data.&lt;&#x2F;p&gt;
&lt;p&gt;You can inspect dead tuple pressure:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    relname,
    n_live_tup,
    n_dead_tup,
    round(
        100.0 * n_dead_tup &#x2F; greatest(n_live_tup + n_dead_tup, 1),
        2
    ) AS dead_tuple_percent,
    last_autovacuum,
    last_autoanalyze
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 20;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This is not a perfect bloat measurement, but it is a useful signal.&lt;&#x2F;p&gt;
&lt;p&gt;A common incident chain:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;mermaid&quot;&gt;flowchart TD
    A[Long transaction remains open] --&amp;gt; B[Vacuum cannot clean old row versions]
    B --&amp;gt; C[Dead tuples accumulate]
    C --&amp;gt; D[Table and index scans become more expensive]
    D --&amp;gt; E[Query latency increases]
    E --&amp;gt; F[More connections remain busy]
    F --&amp;gt; G([System degrades])
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The slow query is only the visible symptom.&lt;&#x2F;p&gt;
&lt;p&gt;The root issue may be a long transaction or vacuum starvation.&lt;&#x2F;p&gt;
&lt;p&gt;You can inspect old transactions:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    pid,
    usename,
    application_name,
    state,
    now() - xact_start AS transaction_age,
    left(query, 160) AS query_preview
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
ORDER BY xact_start ASC;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Again, a slow query may be downstream of a completely different operational failure.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;slow-because-of-too-much-concurrency&quot;&gt;Slow because of too much concurrency&lt;&#x2F;h2&gt;
&lt;p&gt;A query can be individually acceptable but collectively harmful.&lt;&#x2F;p&gt;
&lt;p&gt;Example:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT *
FROM product_recommendations
WHERE user_id = $1
ORDER BY score DESC
LIMIT 20;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;One execution is fine.
Ten executions are fine.
Five thousand concurrent executions during a traffic spike are not fine.&lt;&#x2F;p&gt;
&lt;p&gt;This is the difference between query latency and system throughput.&lt;&#x2F;p&gt;
&lt;p&gt;A query does not have to be “bad” to cause an incident. It only has to be too frequent, too concurrent, or too poorly bounded.&lt;&#x2F;p&gt;
&lt;p&gt;This often happens with retries.&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Database gets slower
        ↓
Application requests timeout
        ↓
Application retries
        ↓
Database receives more duplicate work
        ↓
Database gets even slower
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;At that point, optimizing the query may help later, but the immediate mitigation might be reducing concurrency, disabling a worker, rate-limiting retries, or shedding non-critical load.&lt;&#x2F;p&gt;
&lt;p&gt;A database incident is often a traffic-shaping problem, not just a SQL problem.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;finding-important-queries-with-pg-stat-statements&quot;&gt;Finding important queries with &lt;code&gt;pg_stat_statements&lt;&#x2F;code&gt;&lt;&#x2F;h2&gt;
&lt;p&gt;&lt;code&gt;pg_stat_statements&lt;&#x2F;code&gt; is one of the most useful Postgres extensions for understanding workload.&lt;&#x2F;p&gt;
&lt;p&gt;A basic view of expensive queries:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    calls,
    total_exec_time,
    mean_exec_time,
    max_exec_time,
    rows,
    left(query, 160) AS query_preview
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;But different orderings answer different questions.&lt;&#x2F;p&gt;
&lt;p&gt;Highest total time:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    calls,
    total_exec_time,
    mean_exec_time,
    left(query, 160) AS query_preview
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This finds queries that consume the most database time overall.&lt;&#x2F;p&gt;
&lt;p&gt;Highest mean time:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    calls,
    mean_exec_time,
    max_exec_time,
    left(query, 160) AS query_preview
FROM pg_stat_statements
WHERE calls &amp;gt; 100
ORDER BY mean_exec_time DESC
LIMIT 20;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This finds consistently expensive queries.&lt;&#x2F;p&gt;
&lt;p&gt;Highest call count:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    calls,
    mean_exec_time,
    total_exec_time,
    left(query, 160) AS query_preview
FROM pg_stat_statements
ORDER BY calls DESC
LIMIT 20;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This finds queries that may be cheap individually but expensive in aggregate.&lt;&#x2F;p&gt;
&lt;p&gt;High variance:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    calls,
    mean_exec_time,
    max_exec_time,
    stddev_exec_time,
    left(query, 160) AS query_preview
FROM pg_stat_statements
WHERE calls &amp;gt; 100
ORDER BY stddev_exec_time DESC
LIMIT 20;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This finds queries that behave unpredictably.&lt;&#x2F;p&gt;
&lt;p&gt;The important part is choosing the right question.&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Total time asks: what consumes the database?
Mean time asks: what is slow on average?
Max time asks: what occasionally explodes?
Calls asks: what is happening too often?
Variance asks: what behaves differently across inputs?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;A production investigation needs all of these perspectives.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;why-add-an-index-is-sometimes-the-wrong-fix&quot;&gt;Why “add an index” is sometimes the wrong fix&lt;&#x2F;h2&gt;
&lt;p&gt;Indexes are powerful. Many incidents are fixed by adding or changing an index.&lt;&#x2F;p&gt;
&lt;p&gt;But indexes are not free.&lt;&#x2F;p&gt;
&lt;p&gt;Every index has costs:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;More disk usage
Slower inserts
Slower updates
Slower deletes
More WAL generation
More vacuum work
More memory pressure
Longer backup&#x2F;restore times
Operational risk during creation
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;An index can also be technically used but not useful enough.&lt;&#x2F;p&gt;
&lt;p&gt;For example:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;CREATE INDEX CONCURRENTLY idx_orders_status
ON orders (status);
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;If &lt;code&gt;status = &#x27;active&#x27;&lt;&#x2F;code&gt; matches 80% of the table, this index may not be very selective. Postgres may correctly choose a sequential scan.&lt;&#x2F;p&gt;
&lt;p&gt;A more useful partial index might be:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;CREATE INDEX CONCURRENTLY idx_orders_pending_created
ON orders (created_at)
WHERE status = &amp;#39;pending&amp;#39;;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This can be valuable if &lt;code&gt;pending&lt;&#x2F;code&gt; is rare and frequently queried.&lt;&#x2F;p&gt;
&lt;p&gt;But partial indexes require discipline. The query must match the predicate well enough for the planner to use it.&lt;&#x2F;p&gt;
&lt;p&gt;A slow query investigation should ask:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;What exact access pattern are we optimizing?
How often does it run?
How many rows does it usually return?
How selective are the filters?
Does the query need ordering?
Is the index worth the write cost?
Can the index be created safely under current load?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Without those questions, indexing becomes guesswork.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;query-performance-is-part-of-application-design&quot;&gt;Query performance is part of application design&lt;&#x2F;h2&gt;
&lt;p&gt;A database can only do so much if the application asks expensive questions.&lt;&#x2F;p&gt;
&lt;p&gt;For example, this pattern is dangerous:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT *
FROM events
WHERE tenant_id = $1
ORDER BY created_at DESC;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;No limit. No time bound. Potentially huge result set.&lt;&#x2F;p&gt;
&lt;p&gt;A safer shape:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT *
FROM events
WHERE tenant_id = $1
  AND created_at &amp;lt; $2
ORDER BY created_at DESC
LIMIT 100;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This supports pagination and gives the database a bounded amount of work.&lt;&#x2F;p&gt;
&lt;p&gt;Another dangerous pattern is N+1 queries:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Load 100 orders
For each order, query customer
For each customer, query latest invoice
For each invoice, query payment status
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Individually, each query may be fast.&lt;&#x2F;p&gt;
&lt;p&gt;Together, they create a database pressure pattern.&lt;&#x2F;p&gt;
&lt;p&gt;A better approach may use joins, batching, caching, or precomputed views, depending on the system.&lt;&#x2F;p&gt;
&lt;p&gt;The database is not just a storage layer. It is part of the application’s execution model.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;the-post-incident-question-should-be-deeper-than-which-query-was-slow&quot;&gt;The post-incident question should be deeper than “which query was slow?”&lt;&#x2F;h2&gt;
&lt;p&gt;After a query-related incident, a weak review says:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;A query was slow.
We added an index.
The incident is resolved.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;A stronger review asks:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Why did this query become slow now?
Was the data distribution different from staging?
Did the query pattern change in a release?
Did we have statistics drift?
Was the index missing, wrong, or too expensive to maintain?
Did retries amplify the load?
Did the connection pool hide early symptoms?
Did dashboards show query variance or only averages?
Could we have detected this before users did?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The goal is not to blame a query.&lt;&#x2F;p&gt;
&lt;p&gt;The goal is to improve the system’s ability to survive workload changes.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;a-useful-mental-model&quot;&gt;A useful mental model&lt;&#x2F;h2&gt;
&lt;p&gt;When you see a slow query, do not stop at the SQL text.&lt;&#x2F;p&gt;
&lt;p&gt;Walk through the layers:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;mermaid&quot;&gt;flowchart TD
    A[SQL shape] --&amp;gt; B[Planner estimates] --&amp;gt; C[Chosen plan] --&amp;gt; D[Index access]
    D --&amp;gt; E[Rows scanned vs rows returned] --&amp;gt; F[Buffers hit vs read] --&amp;gt; G[Sort &#x2F; hash behavior]
    G --&amp;gt; H[Lock waits] --&amp;gt; I[Transaction age] --&amp;gt; J[Concurrency]
    J --&amp;gt; K[Connection pool behavior] --&amp;gt; L[Application retries] --&amp;gt; M([User-visible impact])
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This does not mean every incident requires checking everything manually.&lt;&#x2F;p&gt;
&lt;p&gt;It means the query is part of a system.&lt;&#x2F;p&gt;
&lt;p&gt;The diagnosis is the mechanism, not the symptom.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;why-slow-query-incidents-are-good-simulation-material&quot;&gt;Why slow-query incidents are good simulation material&lt;&#x2F;h2&gt;
&lt;p&gt;Slow-query incidents are excellent for training because they are deceptively familiar.&lt;&#x2F;p&gt;
&lt;p&gt;Most engineers know how to read a query.
Many know how to run &lt;code&gt;EXPLAIN&lt;&#x2F;code&gt;.
Some know how to add an index.&lt;&#x2F;p&gt;
&lt;p&gt;But production incidents are harder than that.&lt;&#x2F;p&gt;
&lt;p&gt;A realistic simulation forces questions like:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Is this query the cause or a victim?
Is the plan bad or is it waiting on a lock?
Is the index missing or are statistics wrong?
Is the database overloaded by this query or by retries?
Should we add an index now or reduce traffic first?
Is the safest action in SQL, application config, or deployment rollback?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;That is the skill gap.&lt;&#x2F;p&gt;
&lt;p&gt;Articles can explain the mechanics.
Queries can reveal evidence.
But operational judgment comes from practicing the loop:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;mermaid&quot;&gt;flowchart LR
    S[Symptom] --&amp;gt; H[Hypothesis] --&amp;gt; I[Inspection] --&amp;gt; D[Decision] --&amp;gt; C[Consequence]
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;In production, every decision has side effects.&lt;&#x2F;p&gt;
&lt;p&gt;A simulation lets teams experience those side effects before they are dealing with real customers, real data, and real pressure.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;&#x2F;h2&gt;
&lt;p&gt;A slow Postgres query is not a diagnosis.&lt;&#x2F;p&gt;
&lt;p&gt;It is a signal.&lt;&#x2F;p&gt;
&lt;p&gt;Sometimes the fix is an index.
Sometimes it is &lt;code&gt;ANALYZE&lt;&#x2F;code&gt;.
Sometimes it is rewriting the query.
Sometimes it is reducing concurrency.
Sometimes it is stopping retries.
Sometimes it is killing a blocking transaction.
Sometimes it is changing application behavior.
Sometimes it is doing nothing immediately and collecting better evidence first.&lt;&#x2F;p&gt;
&lt;p&gt;The hard part is not finding a slow query.&lt;&#x2F;p&gt;
&lt;p&gt;The hard part is understanding why it became slow, why it became slow now, and what action will reduce risk without making the system worse.&lt;&#x2F;p&gt;
&lt;p&gt;That is the core of Postgres database reliability: not just knowing how queries work, but understanding how query behavior emerges from data, workload, concurrency, and operational decisions.&lt;&#x2F;p&gt;
</content>
        
    </entry>
    <entry xml:lang="en">
        <title>Why Postgres reliability cannot be learned from documentation alone</title>
        <published>2026-03-27T00:00:00+00:00</published>
        <updated>2026-03-27T00:00:00+00:00</updated>
        
        <author>
          <name>
            
              Unknown
            
          </name>
        </author>
        
        <link rel="alternate" type="text/html" href="https://rillence.com/notes/reliability-not-learned-from-docs/"/>
        <id>https://rillence.com/notes/reliability-not-learned-from-docs/</id>
        
        <content type="html" xml:base="https://rillence.com/notes/reliability-not-learned-from-docs/">&lt;p&gt;Postgres documentation is excellent.&lt;&#x2F;p&gt;
&lt;p&gt;It explains MVCC, locks, WAL, indexes, replication, vacuum, isolation levels, planner behavior, configuration, backup, recovery, and hundreds of other details. If you operate Postgres seriously, you should read it.&lt;&#x2F;p&gt;
&lt;p&gt;But documentation is not the same as operational readiness.&lt;&#x2F;p&gt;
&lt;p&gt;Documentation teaches mechanisms.
Incidents test judgment.&lt;&#x2F;p&gt;
&lt;p&gt;A production incident does not usually announce itself as:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;This is a lock queue caused by an ACCESS EXCLUSIVE lock waiting behind an idle transaction.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;It looks more like this:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;API latency is rising.
The pool is full.
Some queries are slow.
A migration started recently.
CPU is not that high.
Replica lag is increasing.
Users are reporting timeouts.
The team is not sure whether to cancel, wait, kill, rollback, or fail over.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;That is the gap.&lt;&#x2F;p&gt;
&lt;p&gt;Postgres reliability is not only about knowing how Postgres works. It is about making safe decisions when Postgres, the application, infrastructure, traffic, and human pressure interact.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;documentation-explains-components-but-incidents-combine-them&quot;&gt;Documentation explains components, but incidents combine them&lt;&#x2F;h2&gt;
&lt;p&gt;Postgres documentation is organized by topics:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Locks
Transactions
Indexes
VACUUM
WAL
Replication
Configuration
Monitoring
Backup and restore
Query planning
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;That structure is necessary for learning.&lt;&#x2F;p&gt;
&lt;p&gt;But real incidents rarely respect that structure.&lt;&#x2F;p&gt;
&lt;p&gt;A single production problem can involve:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;mermaid&quot;&gt;flowchart TD
    A[A new release] --&amp;gt; B[A query plan regression]
    B --&amp;gt; C[Longer transaction time]
    C --&amp;gt; D[Connection pool saturation]
    D --&amp;gt; E[Aggressive retries]
    E --&amp;gt; F[Higher database concurrency]
    F --&amp;gt; G[Autovacuum falling behind]
    G --&amp;gt; H[Replica lag]
    H --&amp;gt; I([User-visible timeouts])
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Which chapter is that?&lt;&#x2F;p&gt;
&lt;p&gt;It is not one chapter. It is the interaction of many systems.&lt;&#x2F;p&gt;
&lt;p&gt;That is why reading about locks does not automatically prepare someone for a migration incident. Reading about &lt;code&gt;VACUUM&lt;&#x2F;code&gt; does not automatically prepare someone for a bloat-driven latency degradation. Reading about replication does not automatically prepare someone to decide whether failover is safe.&lt;&#x2F;p&gt;
&lt;p&gt;The hard part is synthesis.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;knowing-the-command-is-not-the-same-as-knowing-when-to-use-it&quot;&gt;Knowing the command is not the same as knowing when to use it&lt;&#x2F;h2&gt;
&lt;p&gt;Many Postgres incident actions are simple at the command level.&lt;&#x2F;p&gt;
&lt;p&gt;Cancel a query:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT pg_cancel_backend(12345);
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Terminate a backend:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT pg_terminate_backend(12345);
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Analyze a table:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;ANALYZE invoices;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Create an index concurrently:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;CREATE INDEX CONCURRENTLY idx_orders_customer_created
ON orders (customer_id, created_at DESC);
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Promote a standby:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT pg_promote();
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Drop a replication slot:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT pg_drop_replication_slot(&amp;#39;old_slot&amp;#39;);
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;None of these commands are hard to type.&lt;&#x2F;p&gt;
&lt;p&gt;The difficult questions are operational:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Is this backend safe to terminate?
Will cancellation cause retries that make the incident worse?
Is ANALYZE enough, or is the query slow because of locks?
Can we afford the IO of a concurrent index right now?
Is the standby fresh enough to promote?
Could the old primary still accept writes?
Is this replication slot abandoned, or does a downstream system still need it?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Documentation can tell you what a command does.&lt;&#x2F;p&gt;
&lt;p&gt;It cannot decide whether using it right now reduces risk.&lt;&#x2F;p&gt;
&lt;p&gt;That decision depends on context.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;incidents-are-full-of-partial-evidence&quot;&gt;Incidents are full of partial evidence&lt;&#x2F;h2&gt;
&lt;p&gt;In a calm environment, you can investigate carefully.&lt;&#x2F;p&gt;
&lt;p&gt;During an incident, the evidence is incomplete and changing.&lt;&#x2F;p&gt;
&lt;p&gt;You may see:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    state,
    wait_event_type,
    wait_event,
    count(*) AS sessions
FROM pg_stat_activity
GROUP BY state, wait_event_type, wait_event
ORDER BY sessions DESC;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;And get something like:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;active | Lock | transactionid | 47
active | IO   | DataFileRead  | 12
idle   |      |               | 180
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This does not automatically tell you what to do.&lt;&#x2F;p&gt;
&lt;p&gt;You need to ask:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Are lock waits the cause or a consequence?
Who is blocking whom?
Did a migration start?
Are retries increasing concurrency?
Are the IO waits caused by the same workload?
Are idle connections normal or part of pool exhaustion?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Then you inspect blockers:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    blocked.pid AS blocked_pid,
    blocked.application_name AS blocked_app,
    now() - blocked.query_start AS blocked_duration,
    left(blocked.query, 120) AS blocked_query,
    blocking.pid AS blocking_pid,
    blocking.application_name AS blocking_app,
    blocking.state AS blocking_state,
    now() - blocking.query_start AS blocking_duration,
    left(blocking.query, 120) AS blocking_query
FROM pg_stat_activity blocked
JOIN LATERAL unnest(pg_blocking_pids(blocked.pid)) AS blocker_pid ON true
JOIN pg_stat_activity blocking ON blocking.pid = blocker_pid
ORDER BY blocked_duration DESC;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Now you find a blocker.&lt;&#x2F;p&gt;
&lt;p&gt;But even then, you still need judgment.&lt;&#x2F;p&gt;
&lt;p&gt;Killing the blocker may fix the incident.
It may also roll back important work, trigger retries, break a migration, or create more load.&lt;&#x2F;p&gt;
&lt;p&gt;The database gives evidence. It does not give certainty.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;runbooks-help-but-they-are-not-enough&quot;&gt;Runbooks help, but they are not enough&lt;&#x2F;h2&gt;
&lt;p&gt;Runbooks are valuable.&lt;&#x2F;p&gt;
&lt;p&gt;A good runbook can say:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;If lock waits are high:
1. Identify blocked sessions.
2. Identify blockers.
3. Check whether the blocker is a migration, application query, or idle transaction.
4. Check user impact.
5. Prefer cancellation before termination where possible.
6. Escalate before terminating unknown critical sessions.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This is useful.&lt;&#x2F;p&gt;
&lt;p&gt;But production incidents often violate the clean path.&lt;&#x2F;p&gt;
&lt;p&gt;For example:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;The blocker is a migration.
The migration is important.
The migration has already partially completed.
Application retries are increasing pressure.
The team cannot immediately tell whether canceling is safe.
The service owner is offline.
A background job is also holding connections.
Replica lag is rising.
The incident commander wants a decision now.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The runbook can guide thinking.&lt;&#x2F;p&gt;
&lt;p&gt;It cannot replace thinking.&lt;&#x2F;p&gt;
&lt;p&gt;A weak reliability culture treats runbooks as scripts.
A strong reliability culture treats runbooks as decision support.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;the-dangerous-middle-when-several-actions-are-plausible&quot;&gt;The dangerous middle: when several actions are plausible&lt;&#x2F;h2&gt;
&lt;p&gt;Many Postgres incidents are hard because multiple actions seem reasonable.&lt;&#x2F;p&gt;
&lt;p&gt;Imagine this situation:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;API latency is high.
Connection pool wait time is rising.
Postgres has many active sessions.
Top queries show one expensive query shape.
Replica lag is increasing.
A backfill started ten minutes ago.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Possible actions:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Pause the backfill.
Reduce application concurrency.
Cancel slow queries.
Increase pool size.
Add an index.
Move reads to a replica.
Disable retries.
Scale the database.
Wait and observe.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Several of these may be valid in different scenarios.&lt;&#x2F;p&gt;
&lt;p&gt;The wrong action can amplify the incident.&lt;&#x2F;p&gt;
&lt;p&gt;Increasing the pool may push more work into Postgres.
Moving reads to a lagging replica may serve stale data.
Adding an index may create more IO and WAL during an already overloaded period.
Canceling queries may trigger retries.
Waiting may be correct if the system is recovering, or disastrous if pressure is still growing.&lt;&#x2F;p&gt;
&lt;p&gt;The skill is not knowing a list of actions.&lt;&#x2F;p&gt;
&lt;p&gt;The skill is understanding the likely consequence of each action under current conditions.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;documentation-teaches-normal-behavior-incidents-expose-edge-behavior&quot;&gt;Documentation teaches normal behavior; incidents expose edge behavior&lt;&#x2F;h2&gt;
&lt;p&gt;Most engineers learn Postgres features in their normal form.&lt;&#x2F;p&gt;
&lt;p&gt;A transaction groups work:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;BEGIN;

UPDATE accounts
SET balance = balance - 100
WHERE id = 1;

UPDATE accounts
SET balance = balance + 100
WHERE id = 2;

COMMIT;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;An index speeds up access:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;CREATE INDEX CONCURRENTLY idx_orders_customer_id
ON orders (customer_id);
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;A replica provides another copy of data:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;primary → standby
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Autovacuum cleans old row versions.&lt;&#x2F;p&gt;
&lt;p&gt;WAL protects durability.&lt;&#x2F;p&gt;
&lt;p&gt;All of this is true.&lt;&#x2F;p&gt;
&lt;p&gt;But incidents live in the edge behavior.&lt;&#x2F;p&gt;
&lt;p&gt;A transaction becomes dangerous when it stays open for 45 minutes.&lt;&#x2F;p&gt;
&lt;p&gt;An index build becomes dangerous when it competes with peak traffic.&lt;&#x2F;p&gt;
&lt;p&gt;A replica becomes dangerous when the application assumes it is always fresh.&lt;&#x2F;p&gt;
&lt;p&gt;Autovacuum becomes dangerous when it cannot keep up with write churn.&lt;&#x2F;p&gt;
&lt;p&gt;WAL becomes dangerous when a backfill generates more than archiving and replication can consume.&lt;&#x2F;p&gt;
&lt;p&gt;The feature is not the problem.
The production interaction is the problem.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;example-documentation-says-create-index-concurrently-production-asks-when&quot;&gt;Example: documentation says &lt;code&gt;CREATE INDEX CONCURRENTLY&lt;&#x2F;code&gt;, production asks “when?”&lt;&#x2F;h2&gt;
&lt;p&gt;A team finds a slow query:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT *
FROM orders
WHERE customer_id = $1
ORDER BY created_at DESC
LIMIT 50;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The likely index is obvious:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;CREATE INDEX CONCURRENTLY idx_orders_customer_created
ON orders (customer_id, created_at DESC);
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Documentation can explain why &lt;code&gt;CONCURRENTLY&lt;&#x2F;code&gt; avoids blocking writes.&lt;&#x2F;p&gt;
&lt;p&gt;But production readiness requires more questions:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;How large is the table?
How many indexes already exist?
How much WAL will this generate?
Will replica lag violate read expectations?
Is storage already under pressure?
Is autovacuum currently behind?
Are we in peak traffic?
Can the migration framework run this outside a transaction?
What happens if the index build fails?
Who will clean up an invalid index?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The SQL is technically correct.&lt;&#x2F;p&gt;
&lt;p&gt;That does not make the timing safe.&lt;&#x2F;p&gt;
&lt;p&gt;Reliability depends on knowing when the right command is wrong for the current system state.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;example-documentation-says-failover-is-possible-production-asks-is-it-safe&quot;&gt;Example: documentation says failover is possible, production asks “is it safe?”&lt;&#x2F;h2&gt;
&lt;p&gt;Promotion can be simple:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT pg_promote();
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;But the operational question is not whether you can promote.&lt;&#x2F;p&gt;
&lt;p&gt;It is whether promotion improves the situation.&lt;&#x2F;p&gt;
&lt;p&gt;Before failover, you need to know:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Is the primary truly unavailable?
Can the old primary still accept writes?
How far behind is the standby?
What data loss is acceptable?
How will applications reconnect?
What happens to connection pools?
Will background workers follow the new primary?
What happens to read replicas?
What happens to logical replication slots?
Can the old primary be fenced?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;A standby that exists but is 20 minutes behind may not be a safe target.&lt;&#x2F;p&gt;
&lt;p&gt;A standby that is current but cannot accept the full write workload may fail shortly after promotion.&lt;&#x2F;p&gt;
&lt;p&gt;A failover that leaves the old primary alive can create data divergence.&lt;&#x2F;p&gt;
&lt;p&gt;Documentation explains promotion.&lt;&#x2F;p&gt;
&lt;p&gt;Practice teaches hesitation, verification, and controlled execution.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;example-documentation-says-vacuum-cleans-dead-tuples-production-asks-why-is-it-behind&quot;&gt;Example: documentation says vacuum cleans dead tuples, production asks “why is it behind?”&lt;&#x2F;h2&gt;
&lt;p&gt;A table shows many dead tuples:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    relname,
    n_live_tup,
    n_dead_tup,
    last_autovacuum,
    last_autoanalyze
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 20;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;A beginner may think:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Run VACUUM.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;A more experienced operator asks:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Why did dead tuples accumulate?
Is autovacuum blocked by a long transaction?
Is the table too hot for default thresholds?
Are there too many indexes?
Is a backfill creating churn?
Is the table design queue-like?
Are we seeing bloat or just temporary dead tuple pressure?
Will manual vacuum compete with user traffic?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Then they check old transactions:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    pid,
    application_name,
    state,
    now() - xact_start AS transaction_age,
    left(query, 160) AS query_preview
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
ORDER BY xact_start ASC;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The command &lt;code&gt;VACUUM&lt;&#x2F;code&gt; is easy.&lt;&#x2F;p&gt;
&lt;p&gt;Understanding why cleanup failed is the reliability work.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;operational-skill-means-recognizing-patterns&quot;&gt;Operational skill means recognizing patterns&lt;&#x2F;h2&gt;
&lt;p&gt;In real incidents, the exact details vary.&lt;&#x2F;p&gt;
&lt;p&gt;But patterns repeat.&lt;&#x2F;p&gt;
&lt;p&gt;A lock incident has a recognizable shape:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Latency rises.
Connections increase.
Many sessions wait on Lock.
A migration or transaction is in the blocking chain.
The pool fills behind blocked work.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;A connection storm has a recognizable shape:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;App instances increase.
Total connections rise sharply.
Many sessions are active or waiting.
Pool timeouts appear.
Retries multiply traffic.
Postgres slows under concurrency.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;A WAL pressure incident has a recognizable shape:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;A bulk operation starts.
WAL generation spikes.
Checkpoints become more frequent.
Replica lag grows.
Archiving may fall behind.
Write latency worsens.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;A vacuum starvation incident has a recognizable shape:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Dead tuples trend upward.
Old transactions exist.
Autovacuum runs but does not catch up.
Table&#x2F;index size grows.
Query performance degrades gradually.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Pattern recognition does not come only from reading.&lt;&#x2F;p&gt;
&lt;p&gt;It comes from seeing scenarios, making decisions, and observing consequences.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;the-hardest-part-is-prioritization&quot;&gt;The hardest part is prioritization&lt;&#x2F;h2&gt;
&lt;p&gt;During a Postgres incident, there are usually too many possible investigations.&lt;&#x2F;p&gt;
&lt;p&gt;You can inspect:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;pg_stat_activity
pg_locks
pg_stat_statements
pg_stat_replication
pg_replication_slots
pg_stat_user_tables
pg_stat_progress_vacuum
pg_stat_progress_create_index
pg_stat_wal
pg_stat_archiver
application pool metrics
request traces
deployment history
OS IO metrics
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;All of them may be relevant.&lt;&#x2F;p&gt;
&lt;p&gt;But you cannot investigate everything at once.&lt;&#x2F;p&gt;
&lt;p&gt;Operational readiness means knowing what to check first based on symptoms.&lt;&#x2F;p&gt;
&lt;p&gt;For example:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;If users wait for DB connections:
    start at application pool metrics and pg_stat_activity.

If sessions wait on Lock:
    identify blockers and recent migrations.

If writes are slow and replica lag grows:
    inspect WAL generation, checkpoints, storage, and recent bulk operations.

If queries became slow gradually:
    inspect query plans, table statistics, dead tuples, bloat signals, and data growth.

If reads from replicas are inconsistent:
    inspect replay delay and read-routing assumptions.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The skill is navigation.&lt;&#x2F;p&gt;
&lt;p&gt;Documentation gives the map.
Incidents require route selection.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;good-operators-know-what-not-to-touch&quot;&gt;Good operators know what not to touch&lt;&#x2F;h2&gt;
&lt;p&gt;A major difference between junior and senior incident response is restraint.&lt;&#x2F;p&gt;
&lt;p&gt;During a database incident, doing something feels better than doing nothing.&lt;&#x2F;p&gt;
&lt;p&gt;But some actions are dangerous without enough evidence:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Increasing max_connections
Increasing pool size
Killing random backends
Dropping replication slots
Running VACUUM FULL
Creating emergency indexes
Failing over prematurely
Restarting Postgres
Disabling autovacuum
Changing durability settings
Deleting files from pg_wal
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Some of these actions can be correct in specific situations.&lt;&#x2F;p&gt;
&lt;p&gt;The danger is using them as reflexes.&lt;&#x2F;p&gt;
&lt;p&gt;Reliability is not only the ability to act.&lt;&#x2F;p&gt;
&lt;p&gt;It is the ability to delay unsafe action long enough to understand the system, while still acting quickly enough to reduce impact.&lt;&#x2F;p&gt;
&lt;p&gt;That balance cannot be learned from syntax alone.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;documentation-does-not-teach-team-coordination&quot;&gt;Documentation does not teach team coordination&lt;&#x2F;h2&gt;
&lt;p&gt;Postgres incidents are rarely solved by one person silently running SQL.&lt;&#x2F;p&gt;
&lt;p&gt;They involve communication:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Who is incident commander?
Who owns the application?
Who owns the database?
Who can pause workers?
Who can rollback deploys?
Who can approve failover?
Who communicates customer impact?
Who records the timeline?
Who verifies recovery?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Technical evidence must be translated into operational decisions.&lt;&#x2F;p&gt;
&lt;p&gt;For example:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;“We have 60 sessions waiting on a migration lock.
The blocker is an app transaction idle for 18 minutes.
Canceling the migration will stop new queue growth.
Terminating the idle transaction appears safe, but it belongs to the billing service.
We need billing owner approval or incident commander decision.”
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;That is not just database knowledge.&lt;&#x2F;p&gt;
&lt;p&gt;That is incident communication.&lt;&#x2F;p&gt;
&lt;p&gt;A technically correct action performed without coordination can still create organizational failure.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;documentation-does-not-create-muscle-memory&quot;&gt;Documentation does not create muscle memory&lt;&#x2F;h2&gt;
&lt;p&gt;In a quiet learning environment, an engineer can search, read, think, and test.&lt;&#x2F;p&gt;
&lt;p&gt;In an incident, the environment is different:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Users are affected.
Dashboards are noisy.
Logs are incomplete.
People are asking for updates.
The system is changing while you investigate.
Some actions are irreversible.
Time pressure is real.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Under pressure, people fall back to practiced behavior.&lt;&#x2F;p&gt;
&lt;p&gt;If the only practiced behavior is reading documentation, the team may move too slowly or choose familiar but unsafe actions.&lt;&#x2F;p&gt;
&lt;p&gt;Simulation creates muscle memory:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Notice the symptom.
Form a hypothesis.
Choose the next inspection.
Interpret evidence.
Communicate uncertainty.
Take a bounded action.
Observe the result.
Revise the hypothesis.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;That loop is the core of operational reliability.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;a-useful-maturity-model&quot;&gt;A useful maturity model&lt;&#x2F;h2&gt;
&lt;p&gt;Postgres reliability maturity can be described in four levels.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;level-1-vocabulary&quot;&gt;Level 1: Vocabulary&lt;&#x2F;h3&gt;
&lt;p&gt;The team knows terms:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;locks;
VACUUM;
WAL;
replica lag;
connection pool;
EXPLAIN;
checkpoint;
transaction;
index.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This is necessary, but not enough.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;level-2-mechanism-understanding&quot;&gt;Level 2: Mechanism understanding&lt;&#x2F;h3&gt;
&lt;p&gt;The team understands how things work:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;why MVCC creates dead tuples;
why locks protect consistency;
why WAL enables recovery;
why replicas can lag;
why indexes help some queries and hurt writes;
why pool size controls concurrency.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This is where documentation is very strong.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;level-3-diagnostic-reasoning&quot;&gt;Level 3: Diagnostic reasoning&lt;&#x2F;h3&gt;
&lt;p&gt;The team can connect symptoms to mechanisms:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;pool saturation may be caused by slow queries;
slow queries may be caused by locks;
locks may be caused by migrations;
replica lag may be caused by WAL spikes;
bad plans may be caused by stale statistics;
vacuum lag may be caused by old transactions.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This requires experience and practice.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;level-4-operational-judgment&quot;&gt;Level 4: Operational judgment&lt;&#x2F;h3&gt;
&lt;p&gt;The team can act safely under pressure:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;cancel the right thing;
pause the right workload;
avoid unsafe failover;
reduce concurrency;
communicate risk;
choose rollback vs roll-forward;
protect user traffic;
recover without creating a second incident.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This is where simulation matters most.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;what-articles-can-teach-well&quot;&gt;What articles can teach well&lt;&#x2F;h2&gt;
&lt;p&gt;Articles are valuable.&lt;&#x2F;p&gt;
&lt;p&gt;They can explain:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;mental models;
common failure modes;
diagnostic queries;
dangerous anti-patterns;
technical vocabulary;
incident patterns;
review questions;
safe design principles.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;An article can show why this query matters:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;sql&quot;&gt;SELECT
    pid,
    application_name,
    state,
    wait_event_type,
    wait_event,
    now() - query_start AS query_age,
    now() - xact_start AS transaction_age,
    left(query, 160) AS query_preview
FROM pg_stat_activity
WHERE state &amp;lt;&amp;gt; &amp;#39;idle&amp;#39;
ORDER BY query_start ASC;
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;It can explain that &lt;code&gt;wait_event_type = &#x27;Lock&#x27;&lt;&#x2F;code&gt; points toward contention.&lt;&#x2F;p&gt;
&lt;p&gt;It can explain that &lt;code&gt;idle in transaction&lt;&#x2F;code&gt; is dangerous.&lt;&#x2F;p&gt;
&lt;p&gt;It can explain that high connection count is not the same as useful throughput.&lt;&#x2F;p&gt;
&lt;p&gt;But an article cannot reproduce the stress of deciding whether to terminate a real backend while customer requests are failing.&lt;&#x2F;p&gt;
&lt;p&gt;That is the boundary.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;what-simulations-teach-better&quot;&gt;What simulations teach better&lt;&#x2F;h2&gt;
&lt;p&gt;Simulations are useful because they train behavior, not only knowledge.&lt;&#x2F;p&gt;
&lt;p&gt;A good Postgres incident simulation can force the team to experience:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;unclear symptoms;
conflicting metrics;
misleading first hypotheses;
actions with side effects;
pressure from user impact;
coordination between roles;
the cost of waiting too long;
the cost of acting too early;
post-incident analysis.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;For example, a simulation can show what happens when someone increases pool size during database saturation.&lt;&#x2F;p&gt;
&lt;p&gt;It can show how retries amplify load.&lt;&#x2F;p&gt;
&lt;p&gt;It can show how canceling the wrong migration changes the incident.&lt;&#x2F;p&gt;
&lt;p&gt;It can show how a replica exists but is not safe for failover.&lt;&#x2F;p&gt;
&lt;p&gt;It can show how a long transaction prevents vacuum cleanup and creates delayed consequences.&lt;&#x2F;p&gt;
&lt;p&gt;That feedback loop is difficult to get from documentation.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;the-goal-is-not-to-replace-documentation&quot;&gt;The goal is not to replace documentation&lt;&#x2F;h2&gt;
&lt;p&gt;This is not an argument against documentation.&lt;&#x2F;p&gt;
&lt;p&gt;Strong Postgres reliability requires documentation, source knowledge, and practical experience.&lt;&#x2F;p&gt;
&lt;p&gt;The right relationship is:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Documentation explains mechanisms.
Runbooks organize known responses.
Monitoring provides evidence.
Simulations build judgment.
Production experience validates assumptions.
Post-incident reviews improve the system.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Each layer has a role.&lt;&#x2F;p&gt;
&lt;p&gt;The mistake is expecting one layer to do all the work.&lt;&#x2F;p&gt;
&lt;p&gt;Documentation alone produces theoretical understanding.
Monitoring alone produces noise.
Runbooks alone produce mechanical responses.
Production alone is too expensive as a training environment.&lt;&#x2F;p&gt;
&lt;p&gt;Simulation connects them.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;a-practical-way-to-use-documentation-better&quot;&gt;A practical way to use documentation better&lt;&#x2F;h2&gt;
&lt;p&gt;Documentation becomes more valuable when read through incident questions.&lt;&#x2F;p&gt;
&lt;p&gt;Instead of reading the lock chapter as theory, ask:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Which lock modes can block normal traffic?
How would I recognize a lock queue?
Which DDL operations need strong locks?
What would make cancellation unsafe?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Instead of reading about WAL as internals, ask:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;What happens if WAL generation spikes?
How does WAL affect replication lag?
How can archiving failure fill disk?
How would checkpoints appear in latency graphs?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Instead of reading about autovacuum, ask:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;What prevents cleanup?
How do old transactions affect vacuum?
Which tables need different settings?
How would vacuum failure show up as query latency?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Instead of reading about replication, ask:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;What does this replica protect us from?
How stale can reads be?
What happens during promotion?
How do we prevent split-brain?
What downstream systems depend on replication state?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This turns documentation from reference material into operational training material.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;a-good-postgres-reliability-training-loop&quot;&gt;A good Postgres reliability training loop&lt;&#x2F;h2&gt;
&lt;p&gt;A strong learning process looks like this:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;1. Study the mechanism.
2. Observe the metric in a healthy system.
3. Trigger or simulate a controlled failure.
4. Diagnose using real tools.
5. Choose a mitigation.
6. Observe side effects.
7. Review the decision.
8. Update dashboards, runbooks, code, or process.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;For example, with locks:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Study lock modes.
Observe normal pg_stat_activity.
Simulate a migration waiting behind a transaction.
Identify blockers.
Try canceling migration vs terminating blocker.
Observe pool behavior.
Discuss which action was safest.
Update migration policy.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;With replication:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Study WAL streaming.
Observe normal replay lag.
Simulate a WAL spike.
Watch replica delay.
Route stale-sensitive reads.
Discuss failover safety.
Update read-routing rules.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;This is how knowledge becomes readiness.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;the-business-reason-this-matters&quot;&gt;The business reason this matters&lt;&#x2F;h2&gt;
&lt;p&gt;Postgres reliability is not only a technical concern.&lt;&#x2F;p&gt;
&lt;p&gt;Database incidents affect product behavior:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;users cannot log in;
payments fail;
orders timeout;
dashboards show stale data;
workers fall behind;
notifications duplicate;
customers lose trust;
engineers lose sleep;
teams become afraid of migrations.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The cost is not just downtime.&lt;&#x2F;p&gt;
&lt;p&gt;It is slower engineering velocity.&lt;&#x2F;p&gt;
&lt;p&gt;When teams fear the database, they avoid necessary changes. They delay migrations, postpone cleanup, over-index defensively, under-invest in schema evolution, and treat every production change as risky.&lt;&#x2F;p&gt;
&lt;p&gt;Reliability training reduces that fear.&lt;&#x2F;p&gt;
&lt;p&gt;Not by pretending incidents will not happen, but by making the team more competent when they do.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;why-this-matters-specifically-for-postgres&quot;&gt;Why this matters specifically for Postgres&lt;&#x2F;h2&gt;
&lt;p&gt;Postgres is powerful because it gives teams many capabilities:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;transactions;
rich indexing;
constraints;
JSON;
extensions;
replication;
partitioning;
concurrent index builds;
foreign keys;
materialized views;
stored procedures;
logical decoding;
advanced SQL.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Those capabilities allow teams to build serious systems.&lt;&#x2F;p&gt;
&lt;p&gt;They also create operational complexity.&lt;&#x2F;p&gt;
&lt;p&gt;A database that supports strong correctness, flexible queries, and rich workloads requires disciplined operation.&lt;&#x2F;p&gt;
&lt;p&gt;Postgres will often do exactly what you asked.&lt;&#x2F;p&gt;
&lt;p&gt;The reliability question is whether you understood what you asked it to do under production conditions.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;common-anti-patterns-in-learning-postgres-reliability&quot;&gt;Common anti-patterns in learning Postgres reliability&lt;&#x2F;h2&gt;
&lt;h3 id=&quot;learning-only-through-local-experiments&quot;&gt;Learning only through local experiments&lt;&#x2F;h3&gt;
&lt;p&gt;Local databases hide production realities: data volume, concurrency, locks, replicas, WAL volume, autovacuum pressure, and real traffic.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;memorizing-diagnostic-queries-without-hypotheses&quot;&gt;Memorizing diagnostic queries without hypotheses&lt;&#x2F;h3&gt;
&lt;p&gt;A query is useful only when you know what question it answers.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;treating-every-incident-as-a-missing-index&quot;&gt;Treating every incident as a missing index&lt;&#x2F;h3&gt;
&lt;p&gt;Indexes matter, but not every latency problem is an indexing problem.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;treating-failover-as-a-button&quot;&gt;Treating failover as a button&lt;&#x2F;h3&gt;
&lt;p&gt;Promotion is easy. Safe recovery is not.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;treating-runbooks-as-scripts&quot;&gt;Treating runbooks as scripts&lt;&#x2F;h3&gt;
&lt;p&gt;Runbooks guide decisions. They do not remove context.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;treating-monitoring-as-truth&quot;&gt;Treating monitoring as truth&lt;&#x2F;h3&gt;
&lt;p&gt;Metrics are evidence. They require interpretation.&lt;&#x2F;p&gt;
&lt;h3 id=&quot;waiting-for-production-to-teach-the-team&quot;&gt;Waiting for production to teach the team&lt;&#x2F;h3&gt;
&lt;p&gt;Production is the most expensive classroom.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;what-a-simulation-ready-team-looks-like&quot;&gt;What a simulation-ready team looks like&lt;&#x2F;h2&gt;
&lt;p&gt;A team ready for Postgres incidents can do more than quote documentation.&lt;&#x2F;p&gt;
&lt;p&gt;It can say:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;We know what normal looks like.
We know which symptoms are user-impacting.
We know where database pressure appears first.
We know how to inspect active sessions.
We know how to identify blockers.
We know which workloads can be paused.
We know who owns migrations.
We know how replicas are used.
We know our acceptable data loss window.
We know which actions are dangerous.
We have practiced decisions before production forced them.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;That is operational maturity.&lt;&#x2F;p&gt;
&lt;p&gt;It does not mean the team never has incidents.&lt;&#x2F;p&gt;
&lt;p&gt;It means incidents are shorter, less chaotic, and less likely to produce secondary failures.&lt;&#x2F;p&gt;
&lt;hr &#x2F;&gt;
&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;&#x2F;h2&gt;
&lt;p&gt;Postgres documentation is necessary.&lt;&#x2F;p&gt;
&lt;p&gt;But it is not sufficient.&lt;&#x2F;p&gt;
&lt;p&gt;It teaches what locks are, how WAL works, why vacuum exists, how replication functions, what indexes do, and how configuration parameters behave.&lt;&#x2F;p&gt;
&lt;p&gt;Production incidents test something different:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;Can the team recognize the pattern?
Can it connect database symptoms to application behavior?
Can it choose the safest next action?
Can it avoid making the incident worse?
Can it communicate uncertainty?
Can it recover the system without creating a second failure?
Can it learn afterward?
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;That is database reliability.&lt;&#x2F;p&gt;
&lt;p&gt;The dangerous phrase is:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;We read the docs, so we know Postgres.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;The better phrase is:&lt;&#x2F;p&gt;
&lt;pre&gt;&lt;code data-lang=&quot;text&quot;&gt;We understand the mechanisms, and we have practiced applying them under incident conditions.
&lt;&#x2F;code&gt;&lt;&#x2F;pre&gt;
&lt;p&gt;Documentation builds knowledge.
Simulation builds judgment.
Reliable Postgres operations need both.&lt;&#x2F;p&gt;
</content>
        
    </entry>
</feed>
