← All posts

Incidents 11 min read

Postgres incidents rarely start with "Postgres broke"

When a production system starts degrading, Postgres often becomes the first suspect.

The application is slow. Requests are timing out. Background jobs are piling up. Dashboards are turning red. The database CPU is higher than usual. Someone opens the incident channel and says:

“Looks like Postgres is having problems.”

Sometimes that is true. But very often, Postgres is not the original cause. It is the place where multiple system problems finally become visible.

A Postgres incident usually starts somewhere else: a release, a schema migration, a query pattern change, a sudden traffic spike, a connection pool misconfiguration, a long-running transaction, a reporting job, a replica falling behind, or an application retry storm.

The database becomes the pressure point.

That is why Postgres reliability is not only about knowing SQL or database internals. It is about understanding how Postgres behaves inside a living production system.


The misleading phrase: “the database is slow”

“The database is slow” sounds like a diagnosis, but it is usually only a symptom.

A slow query can be caused by many different mechanisms:

  • a bad execution plan;
  • missing or ineffective indexes;
  • outdated table statistics;
  • table bloat;
  • lock contention;
  • disk saturation;
  • too many concurrent connections;
  • long-running transactions;
  • autovacuum falling behind;
  • replication lag;
  • application retry storms;
  • connection pool exhaustion;
  • an expensive migration running at the wrong time.

The external symptom may look the same:

HTTP latency increased
API requests timing out
Worker queue length growing
Database connections rising
Postgres CPU and IO elevated

But the correct response depends entirely on the mechanism.

This is where many teams get into trouble. They treat the symptom as the cause.


A typical incident chain

A Postgres incident often looks like this:

flowchart TD
    A[Small application change] --> B[New or more frequent query pattern]
    B --> C[Higher database load]
    C --> D[Longer query execution time]
    D --> E[Connections held for longer]
    E --> F[Connection pool saturation]
    F --> G[Application timeouts]
    G --> H[Retries]
    H --> I[Even more database load]
    I --> J([Production incident])

From the outside, this may look like “Postgres became slow.”

But Postgres did not randomly become slow. The system changed around it.

That distinction matters because the wrong mitigation can make the incident worse.


Example 1: a harmless release that doubles database pressure

Imagine a backend service has an endpoint like this:

SELECT id, email, status
FROM users
WHERE id = $1;

It is fast. It uses the primary key. No problem.

Then a release adds a feature flag check based on recent user activity:

SELECT id
FROM user_events
WHERE user_id = $1
  AND event_type = 'purchase'
ORDER BY created_at DESC
LIMIT 1;

On staging, this query is fast. In production, user_events has hundreds of millions of rows.

If the index is not aligned with the query, Postgres may need to scan far more data than expected.

A better supporting index might look like:

CREATE INDEX CONCURRENTLY idx_user_events_user_type_created
ON user_events (user_id, event_type, created_at DESC);

But the incident is not just “missing index.”

The real incident chain may be:

New release adds one extra query per request
        ↓
Query is cheap for some users, expensive for others
        ↓
Average DB time per request increases
        ↓
Application holds connections longer
        ↓
Pool reaches max size
        ↓
Requests queue inside the app
        ↓
Timeouts trigger retries
        ↓
Postgres receives even more work

A useful first question is not:

“Which query is slow?”

A better first question is:

“What changed in the system right before the database started showing pressure?”


Example 2: connection pool exhaustion is not always a pool problem

When an application starts timing out while waiting for a database connection, the instinctive response is often:

“Increase the pool size.”

That can help in some cases. But it can also make the incident worse.

A connection pool is not just a performance tool. It is a pressure regulator.

If Postgres is already overloaded, increasing the number of concurrent database sessions may increase CPU contention, memory pressure, lock contention, and IO saturation.

A useful mental model:

Small pool:
Application queues before Postgres

Huge pool:
Postgres receives too much concurrent work directly

You can inspect active database sessions with:

SELECT
    state,
    wait_event_type,
    wait_event,
    count(*)
FROM pg_stat_activity
GROUP BY state, wait_event_type, wait_event
ORDER BY count(*) DESC;

This tells you whether sessions are actively running, waiting on locks, waiting on IO, idle in transaction, or simply connected.

But the query alone is not the solution. The important part is interpretation.

For example, many sessions in this state are a major warning sign:

SELECT
    pid,
    usename,
    application_name,
    client_addr,
    now() - xact_start AS transaction_age,
    state,
    query
FROM pg_stat_activity
WHERE state = 'idle in transaction'
ORDER BY transaction_age DESC;

An idle in transaction session may keep old row versions alive, block vacuum progress, hold locks, or distort the behavior of other parts of the system.

In an incident, this may appear as “Postgres is slow,” while the actual trigger is an application code path that opened a transaction and failed to close it correctly.


Example 3: a schema migration that blocks production traffic

Schema migrations are one of the most common sources of Postgres incidents.

A migration can be syntactically correct and still operationally dangerous.

For example:

ALTER TABLE orders ADD COLUMN processed_at timestamptz;

This may be safe and fast in many modern Postgres versions. But not every ALTER TABLE is harmless, and even operations that are usually fast still need locks.

A more dangerous example:

ALTER TABLE orders
ADD CONSTRAINT orders_customer_id_fkey
FOREIGN KEY (customer_id)
REFERENCES customers(id);

Or:

CREATE INDEX idx_orders_created_at
ON orders (created_at);

Creating a normal index can block writes. In production, you usually want:

CREATE INDEX CONCURRENTLY idx_orders_created_at
ON orders (created_at);

But even CONCURRENTLY is not magic. It takes longer, consumes resources, and can fail if there are conflicting operations.

During a suspected lock-related incident, this type of query can help identify blockers:

SELECT
    blocked.pid AS blocked_pid,
    blocked.query AS blocked_query,
    blocking.pid AS blocking_pid,
    blocking.query AS blocking_query,
    now() - blocking.query_start AS blocking_duration
FROM pg_locks blocked_locks
JOIN pg_stat_activity blocked
    ON blocked.pid = blocked_locks.pid
JOIN pg_locks blocking_locks
    ON blocking_locks.locktype = blocked_locks.locktype
   AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
   AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
   AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
   AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
   AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
   AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
   AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
   AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
   AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
   AND blocking_locks.pid != blocked_locks.pid
JOIN pg_stat_activity blocking
    ON blocking.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted
  AND blocking_locks.granted;

This is useful, but it is still only one piece of the incident.

The deeper questions are:

  • Why was this migration run during this traffic pattern?
  • Was there a rollback plan?
  • Were lock timeouts configured?
  • Were long transactions checked before the migration?
  • Did the application have retry behavior that amplified the issue?

A mature team does not only ask “which process blocked us?” It asks “why was the system vulnerable to this class of failure?”


Example 4: a slow query is not always a query problem

A query can become slow without changing the SQL text.

For example:

SELECT *
FROM invoices
WHERE account_id = $1
  AND status = 'open'
ORDER BY due_date ASC
LIMIT 50;

This may work well when each account has a small number of invoices.

But as the product grows, one enterprise account may accumulate millions of rows. The query becomes highly sensitive to data distribution.

You can inspect the execution plan:

EXPLAIN (ANALYZE, BUFFERS)
SELECT *
FROM invoices
WHERE account_id = 123
  AND status = 'open'
ORDER BY due_date ASC
LIMIT 50;

The plan might reveal:

  • sequential scans;
  • high buffer reads;
  • unexpected nested loops;
  • bad row estimates;
  • sort operations spilling to disk;
  • index scans that are technically used but still inefficient.

A possible supporting index could be:

CREATE INDEX CONCURRENTLY idx_invoices_account_status_due
ON invoices (account_id, status, due_date);

But again, the point is not “add this index.”

The real reliability lesson is that production data shape changes over time. A query that was safe six months ago can become dangerous after customer growth, product changes, or new usage patterns.

Reliability is not only about fixing bad queries. It is about detecting when previously good assumptions have expired.


The difference between trigger, mechanism, and amplifier

A useful way to reason about Postgres incidents is to separate three things.

1. Trigger

The event that started the incident.

Examples:

New release
Schema migration
Traffic spike
Batch job
Analytics query
Configuration change
Failover
New customer onboarded

2. Mechanism

The technical process through which the system degraded.

Examples:

Lock contention
Connection saturation
Query plan regression
Disk IO saturation
WAL pressure
Autovacuum lag
Replication lag
Memory pressure
Transaction buildup

3. Amplifier

The thing that made the incident worse.

Examples:

Aggressive retries
Oversized connection pools
No statement timeout
No lock timeout
Long-running transactions
Missing dashboards
No migration safety process
Manual panic actions

A poor incident review says:

“The database was slow because of a bad query.”

A better incident review says:

“The trigger was a release that introduced a new query pattern. The mechanism was inefficient index access under production data distribution. The amplifier was application retries combined with a pool size that allowed too much concurrent pressure on Postgres.”

That second version teaches the team something reusable.


Useful diagnostic queries are not the same as an incident response skill

It is good to know queries like these.

Current activity:

SELECT
    pid,
    usename,
    application_name,
    state,
    wait_event_type,
    wait_event,
    now() - query_start AS query_age,
    left(query, 120) AS query_preview
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY query_age DESC;

Long transactions:

SELECT
    pid,
    usename,
    application_name,
    state,
    now() - xact_start AS xact_age,
    left(query, 120) AS query_preview
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
ORDER BY xact_age DESC;

Top queries with pg_stat_statements:

SELECT
    calls,
    total_exec_time,
    mean_exec_time,
    rows,
    left(query, 160) AS query_preview
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;

Replication lag:

SELECT
    application_name,
    state,
    sync_state,
    write_lag,
    flush_lag,
    replay_lag
FROM pg_stat_replication;

Approximate table bloat and dead tuple pressure:

SELECT
    relname,
    n_live_tup,
    n_dead_tup,
    last_vacuum,
    last_autovacuum,
    last_analyze,
    last_autoanalyze
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 20;

Index usage:

SELECT
    relname AS table_name,
    indexrelname AS index_name,
    idx_scan,
    idx_tup_read,
    idx_tup_fetch
FROM pg_stat_user_indexes
ORDER BY idx_scan ASC
LIMIT 20;

These queries are useful. But they are not enough.

During a real incident, the challenge is not just running SQL. The challenge is knowing which hypothesis you are testing.

For example:

Are we overloaded because queries are slower?
Are queries slower because of locks?
Are locks caused by a migration?
Is the pool full because Postgres is slow, or is Postgres slow because the pool allows too much concurrency?
Is replication lag a cause, a symptom, or a separate issue?
Are retries protecting the system or attacking it?

This is where operational skill matters.


Why Postgres incidents often look similar

Many different failure modes produce similar symptoms.

SymptomPossible causes
High latencyslow queries, locks, IO saturation, pool wait, CPU pressure
Many active connectionsslow DB, oversized pool, retry storm, long transactions
High CPUquery plan regression, too much concurrency, missing index
High IOsequential scans, checkpoints, vacuum, index creation, bad plans
Timeoutspool exhaustion, locks, network, overloaded DB, application retries
Replica lagWAL volume, slow replica IO, long queries on standby, replication slot issues

This is why “dashboard watching” is not enough.

Metrics do not tell you what to do by themselves. They only become useful when connected to a hypothesis.

A metric says:

Connections are high.

An engineer has to ask:

Are connections high because requests increased?
Because queries are slower?
Because transactions are stuck?
Because the pool was reconfigured?
Because the app is retrying?
Because background jobs started?

The same metric can point to different actions depending on context.


Dangerous reactions during Postgres incidents

Some actions feel helpful but can be dangerous when done without understanding the mechanism.

Increasing the connection pool

May help if the pool is too small and Postgres has spare capacity.

May hurt if Postgres is already saturated.

Killing random queries

May help if a clearly harmful query is blocking critical work.

May hurt if you kill the wrong backend, interrupt a migration, or cause application-level retries.

Restarting the application

May help if the app is stuck.

May hurt if every instance reconnects at once and creates a connection storm.

Failing over to a replica

May help if the primary is unhealthy.

May hurt if the issue is caused by application behavior, bad queries, or a migration that will continue after failover.

Running emergency indexes

May help if the cause is well understood.

May hurt if index creation adds IO pressure during an already overloaded period.

The operational question is not:

“What can we do?”

It is:

“Which action reduces pressure without increasing uncertainty?”


Reliability requires practicing the messy middle

Most educational material explains clean concepts:

  • how MVCC works;
  • how indexes work;
  • how locks work;
  • how autovacuum works;
  • how replication works;
  • how query planning works.

That knowledge is necessary.

But incidents do not arrive as clean textbook chapters.

They arrive as noisy combinations:

A migration is waiting on a lock.
A long transaction is preventing cleanup.
The application pool is saturated.
Retries are increasing traffic.
A reporting query is consuming IO.
Replication lag is rising.
The team is debating rollback.
Customers are already affected.

The difficult part is the messy middle: forming hypotheses, rejecting bad assumptions, choosing safe mitigations, and communicating clearly while the system is degraded.

This cannot be learned fully from documentation.

It has to be practiced.


What incident simulations teach that articles cannot

An article can explain the concepts. A checklist can remind you what to inspect. A dashboard can show symptoms.

But a simulation trains the actual operational behavior:

  • noticing weak signals early;
  • distinguishing trigger from mechanism;
  • avoiding attractive but dangerous actions;
  • reading database symptoms in application context;
  • understanding how one mitigation changes system pressure;
  • coordinating investigation under time pressure;
  • learning from mistakes without damaging production.

In a good Postgres incident simulation, the goal is not to memorize one magic query.

The goal is to experience the chain:

flowchart LR
    S[Symptom] --> H[Hypothesis] --> I[Inspection] --> D[Decision] --> C[Consequence]

That loop is the core of database reliability work.


Conclusion

Postgres incidents rarely begin with “Postgres broke.”

More often, they begin with a normal engineering action:

a release
a migration
a new query
a batch job
a traffic spike
a retry policy
a pool configuration change

Postgres becomes the place where the consequences accumulate.

That is why reliable Postgres operations require more than database knowledge. They require system thinking.

You need to understand queries, locks, transactions, WAL, vacuum, replication, and indexes. But you also need to understand application behavior, deployment practices, connection pools, retries, traffic patterns, and human decision-making during incidents.

Documentation teaches mechanisms. Monitoring shows symptoms. Simulations build operational judgment.

And in production, judgment is often the difference between a short degradation and a serious incident.

Newsletter

Stay in the loop

New incident tracks, psql+ features and hard-won PostgreSQL tips — delivered to your inbox now and then.

No spam. Unsubscribe anytime.