A slow query is one of the easiest Postgres problems to notice and one of the easiest to misunderstand.
The application times out. The endpoint gets slower. The dashboard shows high database time. Someone finds a query in logs and says:
“This query is the problem.”
Maybe it is.
But a slow query is rarely a complete diagnosis. It is a symptom produced by a specific mechanism: a bad plan, missing index, stale statistics, lock contention, IO saturation, parameter sensitivity, table bloat, too much concurrency, or a data distribution change that made yesterday’s assumptions false.
The SQL text is only one part of the story.
A query can become slow without changing at all.
The same query can be fast yesterday and dangerous today
Consider a simple query:
SELECT *
FROM invoices
WHERE account_id = $1
AND status = 'open'
ORDER BY due_date ASC
LIMIT 50;
This query may be perfectly fine when most accounts have a few hundred invoices.
Then the product grows. One enterprise customer imports millions of invoices. Suddenly, the same query behaves differently for different accounts.
For small accounts, it is still fast.
For one large account, it becomes expensive.
That is not a different query. It is a different data shape.
A useful index might be:
CREATE INDEX CONCURRENTLY idx_invoices_account_status_due_date
ON invoices (account_id, status, due_date);
But the reliability lesson is not simply “add an index.”
The deeper lesson is:
Query performance depends on data distribution,
not just on SQL syntax.
A query that was safe when the product was small may become a production risk as the data changes.
“Slow query” hides multiple failure modes
From the outside, several very different problems can look identical.
API latency increased
Database time increased
Requests started timing out
Connection pool is full
The same query appears in logs repeatedly
But the underlying cause could be:
Missing index
Wrong index order
Stale planner statistics
Bad row estimate
Lock contention
Disk IO saturation
Sort spilling to disk
Too much concurrency
Autovacuum falling behind
Table or index bloat
Parameter-sensitive query plan
Application retry storm
The same symptom requires different mitigations depending on the mechanism.
That is why “find the slow query” is not enough.
You need to understand why it is slow now.
Start with the query shape, not just the query text
A useful first step is to identify the shape of the query.
For example:
SELECT *
FROM events
WHERE tenant_id = $1
AND event_type = $2
AND created_at >= $3
ORDER BY created_at DESC
LIMIT 100;
This query shape tells you several important things:
It is tenant-scoped.
It filters by event type.
It uses a time range.
It needs rows in descending time order.
It has a LIMIT.
It may be called frequently.
An index that supports this access pattern may look like:
CREATE INDEX CONCURRENTLY idx_events_tenant_type_created_desc
ON events (tenant_id, event_type, created_at DESC);
But index design depends on real workload. For example, if event_type is not selective, or if most queries do not filter by it, a different index may be better:
CREATE INDEX CONCURRENTLY idx_events_tenant_created_desc
ON events (tenant_id, created_at DESC);
The key question is not:
Does this query have an index?
The better question is:
Does the index match the actual access pattern?
Use EXPLAIN, but do not worship it
The most common tool for investigating a slow query is:
EXPLAIN (ANALYZE, BUFFERS)
SELECT *
FROM invoices
WHERE account_id = 123
AND status = 'open'
ORDER BY due_date ASC
LIMIT 50;
This can show:
Which plan Postgres chose
How many rows it expected
How many rows it actually processed
Whether it used an index
How many buffers were read or hit
Whether sorting happened
Whether the query touched much more data than expected
For example, a suspicious plan may show:
Rows expected: 50
Rows actual: 850000
That is not just “slow.” That is a planner estimate problem.
A query with BUFFERS may show heavy reads:
shared hit blocks: 1200
shared read blocks: 95000
That suggests the query is reading a lot from disk or pulling a large amount of data through shared buffers.
But EXPLAIN ANALYZE has an important property:
It actually runs the query.
For SELECT, that is usually acceptable in a safe environment, though it can still be expensive.
For writes, be careful. This executes the write:
EXPLAIN (ANALYZE, BUFFERS)
UPDATE orders
SET status = 'expired'
WHERE expires_at < now();
A safer pattern for investigation is:
BEGIN;
EXPLAIN (ANALYZE, BUFFERS)
UPDATE orders
SET status = 'expired'
WHERE expires_at < now();
ROLLBACK;
Even then, the database still performs work and may take locks while the statement runs. Do not treat diagnostic queries as harmless in production.
Slow because of a missing index
The simplest case is a query that has no useful index.
Example:
SELECT *
FROM users
WHERE lower(email) = lower($1);
An ordinary index on email may not help because the query applies a function:
CREATE INDEX CONCURRENTLY idx_users_email
ON users (email);
Postgres may need an expression index instead:
CREATE INDEX CONCURRENTLY idx_users_lower_email
ON users (lower(email));
Another example:
SELECT *
FROM orders
WHERE customer_id = $1
ORDER BY created_at DESC
LIMIT 20;
A partial index on only customer_id may help filtering but not ordering:
CREATE INDEX CONCURRENTLY idx_orders_customer_id
ON orders (customer_id);
A better index for this query shape may be:
CREATE INDEX CONCURRENTLY idx_orders_customer_created_desc
ON orders (customer_id, created_at DESC);
But even here, the right fix depends on the workload.
If the table is write-heavy, every new index has a cost. It slows down writes, consumes disk, increases vacuum work, and adds operational risk during creation.
The index may fix one query and harm the system elsewhere.
Slow because of stale statistics
Postgres uses statistics to choose query plans.
If statistics are stale or too coarse, the planner may choose a bad plan.
You can inspect table statistics freshness:
SELECT
relname,
n_live_tup,
n_dead_tup,
last_analyze,
last_autoanalyze,
last_vacuum,
last_autovacuum
FROM pg_stat_user_tables
WHERE relname = 'invoices';
If a table changed significantly and has not been analyzed recently, Postgres may make poor estimates.
You can manually refresh statistics:
ANALYZE invoices;
Sometimes a specific column needs better statistics because values are highly skewed:
ALTER TABLE invoices
ALTER COLUMN account_id SET STATISTICS 1000;
ANALYZE invoices;
This does not make the query faster directly. It gives the planner better information.
The incident pattern often looks like this:
Data distribution changes
↓
Planner estimates become inaccurate
↓
Postgres chooses a bad plan
↓
Query latency increases
↓
Application holds connections longer
↓
Pool saturates
The SQL did not change. The planner’s model of the data became wrong.
Slow because of parameter sensitivity
Some queries behave very differently depending on parameter values.
Example:
SELECT *
FROM messages
WHERE workspace_id = $1
AND created_at >= now() - interval '7 days'
ORDER BY created_at DESC
LIMIT 100;
For most workspaces, this returns a few rows.
For one very large workspace, it may scan millions.
This becomes especially tricky when prepared statements or generic plans are involved. The planner may choose a plan that is “reasonable on average” but bad for important parameter values.
The query is not universally slow. It is selectively slow.
That distinction matters.
Averages hide this problem. You need to look for variance.
With pg_stat_statements, this kind of query may have a moderate mean but a terrible max:
SELECT
calls,
mean_exec_time,
max_exec_time,
stddev_exec_time,
rows,
left(query, 160) AS query_preview
FROM pg_stat_statements
ORDER BY max_exec_time DESC
LIMIT 20;
A query with high standard deviation may be more interesting than a query with the highest average time.
A reliability-minded question is:
Is this query always slow,
or only slow for certain tenants, users, statuses, or time ranges?
That question often changes the fix.
Slow because of locks
A query may appear slow even when its execution plan is fine.
It may simply be waiting.
For example:
UPDATE accounts
SET status = 'disabled'
WHERE id = $1;
This can be fast in normal conditions. But if another transaction holds a row lock on the same account, the update waits.
You can inspect lock waits:
SELECT
pid,
usename,
application_name,
state,
wait_event_type,
wait_event,
now() - query_start AS waiting_for,
left(query, 160) AS query_preview
FROM pg_stat_activity
WHERE wait_event_type = 'Lock'
ORDER BY query_start ASC;
To find blockers:
SELECT
blocked.pid AS blocked_pid,
blocked.application_name AS blocked_app,
now() - blocked.query_start AS blocked_duration,
left(blocked.query, 120) AS blocked_query,
blocking.pid AS blocking_pid,
blocking.application_name AS blocking_app,
blocking.state AS blocking_state,
now() - blocking.query_start AS blocking_duration,
left(blocking.query, 120) AS blocking_query
FROM pg_stat_activity blocked
JOIN LATERAL unnest(pg_blocking_pids(blocked.pid)) AS blocker_pid ON true
JOIN pg_stat_activity blocking ON blocking.pid = blocker_pid
ORDER BY blocked_duration DESC;
This is a very different failure mode from a missing index.
Adding an index will not fix a lock wait.
Running EXPLAIN ANALYZE later may show a fast plan, because the lock contention is gone.
That is why incident context matters. The query plan after the incident may not reproduce the incident.
Slow because of IO saturation
A query can be slow because it is doing too much disk work.
But it can also be slow because some other operation is saturating disk.
For example:
A concurrent index build
A large vacuum
A checkpoint spike
A reporting query
A backup process
A sequential scan on another table
The query you see in logs may be a victim, not the cause.
EXPLAIN (ANALYZE, BUFFERS) can show whether the query reads many blocks:
EXPLAIN (ANALYZE, BUFFERS)
SELECT *
FROM events
WHERE tenant_id = 42
ORDER BY created_at DESC
LIMIT 100;
But to understand system-wide pressure, you also need to look at active queries:
SELECT
pid,
application_name,
wait_event_type,
wait_event,
now() - query_start AS query_age,
left(query, 160) AS query_preview
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY query_start ASC;
A query waiting on IO may show wait events related to data file reads or writes, depending on Postgres version and workload.
The important operational distinction:
Is this query slow because it performs too much IO,
or because the database storage is already saturated by something else?
The mitigation is different.
Slow because of sorting or memory pressure
A query may use an index for filtering but still sort a large result set.
Example:
SELECT *
FROM audit_log
WHERE organization_id = $1
ORDER BY created_at DESC
LIMIT 100;
If the index does not support the order, Postgres may need to sort.
A useful index:
CREATE INDEX CONCURRENTLY idx_audit_log_org_created_desc
ON audit_log (organization_id, created_at DESC);
In plans, watch for:
Sort
Sort Method: external merge Disk
That means the sort spilled to disk.
A simplified example:
EXPLAIN (ANALYZE, BUFFERS)
SELECT *
FROM audit_log
WHERE organization_id = 123
ORDER BY created_at DESC
LIMIT 100;
If you see disk-based sorting, increasing work_mem might help in some cases. But changing work_mem globally can be dangerous because it applies per operation, not per database.
A query with multiple sort/hash nodes across many concurrent sessions can multiply memory usage quickly.
This is why “just increase memory” is often a risky incident response.
Slow because of bloat
Postgres uses MVCC. Updates and deletes leave old row versions behind until vacuum can clean them up.
If vacuum falls behind, tables and indexes can become bloated.
A bloated table means Postgres may need to scan more pages to get the same useful data.
You can inspect dead tuple pressure:
SELECT
relname,
n_live_tup,
n_dead_tup,
round(
100.0 * n_dead_tup / greatest(n_live_tup + n_dead_tup, 1),
2
) AS dead_tuple_percent,
last_autovacuum,
last_autoanalyze
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 20;
This is not a perfect bloat measurement, but it is a useful signal.
A common incident chain:
flowchart TD
A[Long transaction remains open] --> B[Vacuum cannot clean old row versions]
B --> C[Dead tuples accumulate]
C --> D[Table and index scans become more expensive]
D --> E[Query latency increases]
E --> F[More connections remain busy]
F --> G([System degrades])
The slow query is only the visible symptom.
The root issue may be a long transaction or vacuum starvation.
You can inspect old transactions:
SELECT
pid,
usename,
application_name,
state,
now() - xact_start AS transaction_age,
left(query, 160) AS query_preview
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
ORDER BY xact_start ASC;
Again, a slow query may be downstream of a completely different operational failure.
Slow because of too much concurrency
A query can be individually acceptable but collectively harmful.
Example:
SELECT *
FROM product_recommendations
WHERE user_id = $1
ORDER BY score DESC
LIMIT 20;
One execution is fine. Ten executions are fine. Five thousand concurrent executions during a traffic spike are not fine.
This is the difference between query latency and system throughput.
A query does not have to be “bad” to cause an incident. It only has to be too frequent, too concurrent, or too poorly bounded.
This often happens with retries.
Database gets slower
↓
Application requests timeout
↓
Application retries
↓
Database receives more duplicate work
↓
Database gets even slower
At that point, optimizing the query may help later, but the immediate mitigation might be reducing concurrency, disabling a worker, rate-limiting retries, or shedding non-critical load.
A database incident is often a traffic-shaping problem, not just a SQL problem.
Finding important queries with pg_stat_statements
pg_stat_statements is one of the most useful Postgres extensions for understanding workload.
A basic view of expensive queries:
SELECT
calls,
total_exec_time,
mean_exec_time,
max_exec_time,
rows,
left(query, 160) AS query_preview
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;
But different orderings answer different questions.
Highest total time:
SELECT
calls,
total_exec_time,
mean_exec_time,
left(query, 160) AS query_preview
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;
This finds queries that consume the most database time overall.
Highest mean time:
SELECT
calls,
mean_exec_time,
max_exec_time,
left(query, 160) AS query_preview
FROM pg_stat_statements
WHERE calls > 100
ORDER BY mean_exec_time DESC
LIMIT 20;
This finds consistently expensive queries.
Highest call count:
SELECT
calls,
mean_exec_time,
total_exec_time,
left(query, 160) AS query_preview
FROM pg_stat_statements
ORDER BY calls DESC
LIMIT 20;
This finds queries that may be cheap individually but expensive in aggregate.
High variance:
SELECT
calls,
mean_exec_time,
max_exec_time,
stddev_exec_time,
left(query, 160) AS query_preview
FROM pg_stat_statements
WHERE calls > 100
ORDER BY stddev_exec_time DESC
LIMIT 20;
This finds queries that behave unpredictably.
The important part is choosing the right question.
Total time asks: what consumes the database?
Mean time asks: what is slow on average?
Max time asks: what occasionally explodes?
Calls asks: what is happening too often?
Variance asks: what behaves differently across inputs?
A production investigation needs all of these perspectives.
Why “add an index” is sometimes the wrong fix
Indexes are powerful. Many incidents are fixed by adding or changing an index.
But indexes are not free.
Every index has costs:
More disk usage
Slower inserts
Slower updates
Slower deletes
More WAL generation
More vacuum work
More memory pressure
Longer backup/restore times
Operational risk during creation
An index can also be technically used but not useful enough.
For example:
CREATE INDEX CONCURRENTLY idx_orders_status
ON orders (status);
If status = 'active' matches 80% of the table, this index may not be very selective. Postgres may correctly choose a sequential scan.
A more useful partial index might be:
CREATE INDEX CONCURRENTLY idx_orders_pending_created
ON orders (created_at)
WHERE status = 'pending';
This can be valuable if pending is rare and frequently queried.
But partial indexes require discipline. The query must match the predicate well enough for the planner to use it.
A slow query investigation should ask:
What exact access pattern are we optimizing?
How often does it run?
How many rows does it usually return?
How selective are the filters?
Does the query need ordering?
Is the index worth the write cost?
Can the index be created safely under current load?
Without those questions, indexing becomes guesswork.
Query performance is part of application design
A database can only do so much if the application asks expensive questions.
For example, this pattern is dangerous:
SELECT *
FROM events
WHERE tenant_id = $1
ORDER BY created_at DESC;
No limit. No time bound. Potentially huge result set.
A safer shape:
SELECT *
FROM events
WHERE tenant_id = $1
AND created_at < $2
ORDER BY created_at DESC
LIMIT 100;
This supports pagination and gives the database a bounded amount of work.
Another dangerous pattern is N+1 queries:
Load 100 orders
For each order, query customer
For each customer, query latest invoice
For each invoice, query payment status
Individually, each query may be fast.
Together, they create a database pressure pattern.
A better approach may use joins, batching, caching, or precomputed views, depending on the system.
The database is not just a storage layer. It is part of the application’s execution model.
The post-incident question should be deeper than “which query was slow?”
After a query-related incident, a weak review says:
A query was slow.
We added an index.
The incident is resolved.
A stronger review asks:
Why did this query become slow now?
Was the data distribution different from staging?
Did the query pattern change in a release?
Did we have statistics drift?
Was the index missing, wrong, or too expensive to maintain?
Did retries amplify the load?
Did the connection pool hide early symptoms?
Did dashboards show query variance or only averages?
Could we have detected this before users did?
The goal is not to blame a query.
The goal is to improve the system’s ability to survive workload changes.
A useful mental model
When you see a slow query, do not stop at the SQL text.
Walk through the layers:
flowchart TD
A[SQL shape] --> B[Planner estimates] --> C[Chosen plan] --> D[Index access]
D --> E[Rows scanned vs rows returned] --> F[Buffers hit vs read] --> G[Sort / hash behavior]
G --> H[Lock waits] --> I[Transaction age] --> J[Concurrency]
J --> K[Connection pool behavior] --> L[Application retries] --> M([User-visible impact])
This does not mean every incident requires checking everything manually.
It means the query is part of a system.
The diagnosis is the mechanism, not the symptom.
Why slow-query incidents are good simulation material
Slow-query incidents are excellent for training because they are deceptively familiar.
Most engineers know how to read a query.
Many know how to run EXPLAIN.
Some know how to add an index.
But production incidents are harder than that.
A realistic simulation forces questions like:
Is this query the cause or a victim?
Is the plan bad or is it waiting on a lock?
Is the index missing or are statistics wrong?
Is the database overloaded by this query or by retries?
Should we add an index now or reduce traffic first?
Is the safest action in SQL, application config, or deployment rollback?
That is the skill gap.
Articles can explain the mechanics. Queries can reveal evidence. But operational judgment comes from practicing the loop:
flowchart LR
S[Symptom] --> H[Hypothesis] --> I[Inspection] --> D[Decision] --> C[Consequence]
In production, every decision has side effects.
A simulation lets teams experience those side effects before they are dealing with real customers, real data, and real pressure.
Conclusion
A slow Postgres query is not a diagnosis.
It is a signal.
Sometimes the fix is an index.
Sometimes it is ANALYZE.
Sometimes it is rewriting the query.
Sometimes it is reducing concurrency.
Sometimes it is stopping retries.
Sometimes it is killing a blocking transaction.
Sometimes it is changing application behavior.
Sometimes it is doing nothing immediately and collecting better evidence first.
The hard part is not finding a slow query.
The hard part is understanding why it became slow, why it became slow now, and what action will reduce risk without making the system worse.
That is the core of Postgres database reliability: not just knowing how queries work, but understanding how query behavior emerges from data, workload, concurrency, and operational decisions.