← All posts

Monitoring 16 min read

Postgres monitoring: which metrics help, and which ones create noise

Most Postgres monitoring starts with good intentions and slowly turns into noise.

A team adds dashboards for CPU, memory, connections, replication lag, locks, slow queries, disk usage, cache hit ratio, autovacuum, checkpoints, WAL, dead tuples, table sizes, index usage, and dozens of other signals.

Then an incident happens.

The dashboard is full of red panels. Everyone sees something different. One engineer points at CPU. Another points at connections. Someone else sees slow queries. A replica lag alert fires. The application shows timeouts. The team has many metrics, but no clear direction.

That is the core monitoring problem:

More metrics do not automatically create better reliability.

Good monitoring helps a team form and test hypotheses. Bad monitoring creates panic, false confidence, and alert fatigue.

Postgres reliability monitoring is not about collecting every possible number. It is about knowing which signal answers which operational question.


Monitoring should start with user impact

A database can look unhealthy while the product is fine.

A database can also look mostly healthy while users are already suffering.

That is why the top layer of monitoring should not be Postgres internals. It should be user-visible behavior.

Examples:

API latency
API error rate
checkout failures
login failures
background job delay
queue age
request timeout rate
successful writes per second
customer-facing read latency

These are not Postgres metrics, but they are the reason Postgres reliability matters.

A useful monitoring hierarchy looks like this:

flowchart TD
    A[User symptoms] --> B[Application behavior]
    B --> C[Connection pool pressure]
    C --> D[Postgres activity]
    D --> E[Storage / OS / infrastructure]
    E --> F[Replication / backup / recovery systems]

If you start from the bottom, you may optimize the wrong thing.

If you start from user impact, you can ask:

Which database symptom explains the product symptom?

That question is more useful than:

Which graph is red?

A metric is useful only when it supports a decision

A weak alert says:

Database connections are high.

A better alert says:

User-facing requests are waiting for database connections,
and Postgres active sessions are also elevated.

A weak dashboard says:

CPU is 90%.

A better dashboard helps answer:

Is CPU high because useful work increased,
because a bad query plan appeared,
because concurrency exploded,
or because retries are multiplying traffic?

The value of a metric is not the number itself. The value is the decision it helps with.

For Postgres incidents, useful decisions include:

Should we reduce traffic?
Should we pause background workers?
Should we cancel a query?
Should we cancel a migration?
Should we add capacity?
Should we fail over?
Should we stop retries?
Should we run ANALYZE?
Should we let recovery continue?
Should we avoid touching the database until we know more?

Monitoring should make those decisions safer.


Postgres has many statistics views, but they are not a diagnosis

PostgreSQL exposes a cumulative statistics system that reports server activity, including table and index access, row counts, and vacuum/analyze activity. The official monitoring documentation also reminds operators not to ignore OS-level tools such as ps, top, iostat, and vmstat, and to use EXPLAIN for deeper query investigation after identifying a poorly performing query. (PostgreSQL)

That means Postgres gives you evidence.

It does not give you the incident narrative automatically.

For example, this query summarizes sessions:

SELECT
    state,
    wait_event_type,
    wait_event,
    count(*) AS sessions
FROM pg_stat_activity
GROUP BY state, wait_event_type, wait_event
ORDER BY sessions DESC;

This can tell you:

many sessions are active;
many sessions are waiting on locks;
many sessions are idle;
many sessions are idle in transaction;
many sessions are waiting on IO.

But the next step is human reasoning:

Why are they active?
Why are they waiting?
What changed?
Which workload owns them?
Are they a cause or a symptom?

A statistics view is not an incident response plan.


Start with “what changed?”

Many Postgres incidents are triggered by change:

new deploy;
new query shape;
schema migration;
new index;
data import;
traffic spike;
autoscaling event;
background job;
customer onboarding;
configuration change;
replica issue;
storage degradation.

Monitoring should make change visible.

A good incident dashboard should correlate database symptoms with:

deploy markers;
migration start/finish events;
feature flag changes;
traffic volume;
worker concurrency;
autoscaling events;
database failover events;
backup windows;
maintenance jobs;
large imports or backfills.

Without change context, metrics are easier to misread.

Example:

Connections increased at 12:05.

That could mean:

traffic increased;
queries became slower;
pool size changed;
a deploy doubled app instances;
a connection leak started;
retries increased;
a lock queue formed.

If the dashboard also shows a deployment at 12:03, the investigation starts differently.


Monitor the connection boundary

Connection pools are where application behavior becomes database pressure.

Postgres-side connection snapshot:

SELECT
    application_name,
    usename,
    client_addr,
    count(*) AS total,
    count(*) FILTER (WHERE state = 'active') AS active,
    count(*) FILTER (WHERE state = 'idle') AS idle,
    count(*) FILTER (WHERE state = 'idle in transaction') AS idle_in_transaction
FROM pg_stat_activity
GROUP BY application_name, usename, client_addr
ORDER BY total DESC;

This answers:

Which applications are connected?
Who owns the sessions?
How many are active?
How many are idle?
Are any idle inside transactions?

But Postgres cannot fully explain pool behavior. The application must expose:

pool size;
connections in use;
idle pool connections;
pending checkout count;
connection acquisition latency;
pool checkout timeout count;
transaction duration;
query duration;
request duration while holding a connection.

A critical distinction:

Waiting for a pool connection is not the same as executing a slow SQL query.

If you do not separate those, every incident looks like “Postgres is slow.”


Monitor active sessions, not just total connections

Total connections can be misleading.

A database with 300 mostly idle sessions may be healthier than a database with 60 active sessions all fighting over locks or disk.

Useful live activity query:

SELECT
    pid,
    application_name,
    usename,
    state,
    wait_event_type,
    wait_event,
    now() - query_start AS query_age,
    now() - xact_start AS transaction_age,
    left(query, 200) AS query_preview
FROM pg_stat_activity
WHERE state <> 'idle'
ORDER BY query_start ASC;

This helps distinguish:

active CPU work;
lock waiting;
IO waiting;
long-running transactions;
stuck migrations;
slow queries;
idle transactions;
client-related waits.

During incidents, the wait state is often more useful than the raw connection count.

A good dashboard should not only ask:

How many connections exist?

It should ask:

What are those connections doing?

Long transactions deserve their own panel

Long transactions are behind many Postgres reliability problems:

vacuum cannot clean old row versions;
schema migrations wait;
row locks remain held;
bloat grows;
replicas can be affected;
connection pools lose capacity;
query behavior becomes harder to explain.

Monitor them directly:

SELECT
    pid,
    application_name,
    usename,
    state,
    now() - xact_start AS transaction_age,
    wait_event_type,
    wait_event,
    left(query, 200) AS query_preview
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
ORDER BY xact_start ASC
LIMIT 30;

And specifically:

SELECT
    pid,
    application_name,
    usename,
    client_addr,
    now() - xact_start AS transaction_age,
    now() - state_change AS idle_age,
    left(query, 200) AS last_query
FROM pg_stat_activity
WHERE state = 'idle in transaction'
ORDER BY xact_start ASC;

A mature alert is not simply:

There is an idle transaction.

It is more like:

An app-owned transaction has been idle for longer than expected
on a database with high write activity or pending migrations.

Context turns noise into signal.


Query monitoring: total cost, latency, frequency, and variance

pg_stat_statements is one of the most important extensions for Postgres workload visibility. The official documentation describes it as a module for tracking planning and execution statistics of SQL statements executed by a server. (PostgreSQL)

The mistake is using only one ranking.

Highest total time:

SELECT
    calls,
    total_exec_time,
    mean_exec_time,
    rows,
    left(query, 180) AS query_preview
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;

This finds queries that consume the most total database time.

Highest average latency:

SELECT
    calls,
    mean_exec_time,
    max_exec_time,
    rows,
    left(query, 180) AS query_preview
FROM pg_stat_statements
WHERE calls > 100
ORDER BY mean_exec_time DESC
LIMIT 20;

This finds consistently slow queries.

Highest call count:

SELECT
    calls,
    total_exec_time,
    mean_exec_time,
    left(query, 180) AS query_preview
FROM pg_stat_statements
ORDER BY calls DESC
LIMIT 20;

This finds queries that may be cheap individually but expensive in aggregate.

High variance:

SELECT
    calls,
    mean_exec_time,
    max_exec_time,
    stddev_exec_time,
    rows,
    left(query, 180) AS query_preview
FROM pg_stat_statements
WHERE calls > 100
ORDER BY stddev_exec_time DESC
LIMIT 20;

This finds unstable queries.

Each view answers a different question:

Total time: what consumes the database?
Mean time: what is consistently expensive?
Max time: what occasionally explodes?
Call count: what happens too often?
Variance: what depends heavily on parameters or data shape?

If your dashboard only shows “top slow queries,” it may miss high-frequency queries that quietly dominate database load.


Slow-query logs are useful, but can become noise

Postgres logging can capture slow statements through settings such as log_min_duration_statement, and the logging system supports multiple destinations such as stderr, csvlog, jsonlog, syslog, and eventlog depending on platform and configuration. (PostgreSQL)

A common setting:

SHOW log_min_duration_statement;

Example:

ALTER SYSTEM SET log_min_duration_statement = '500ms';
SELECT pg_reload_conf();

This can help identify expensive statements.

But slow-query logs have limitations:

They show completed statements, not necessarily currently stuck ones.
They can become extremely noisy under incidents.
They may miss high-frequency fast queries that cause aggregate load.
They need application context to be useful.
They can increase log volume significantly.

Slow-query logging is evidence, not a complete monitoring strategy.

For production reliability, combine it with:

pg_stat_statements;
application tracing;
pool metrics;
lock monitoring;
wait events;
deployment markers;
request-level latency.

The goal is to connect a slow query to user impact and system pressure.


Lock monitoring must show blockers and victims

A lock alert that says “lock wait exists” is often too vague.

During production incidents, you need to know:

Who is blocked?
Who is blocking?
How long has the blocker been running?
Is the blocker active or idle in transaction?
Which application owns it?
Is the blocked query user traffic, migration, worker, or admin?

Useful query:

SELECT
    blocked.pid AS blocked_pid,
    blocked.application_name AS blocked_app,
    blocked.usename AS blocked_user,
    now() - blocked.query_start AS blocked_duration,
    left(blocked.query, 160) AS blocked_query,
    blocking.pid AS blocking_pid,
    blocking.application_name AS blocking_app,
    blocking.usename AS blocking_user,
    blocking.state AS blocking_state,
    now() - blocking.query_start AS blocking_duration,
    left(blocking.query, 160) AS blocking_query
FROM pg_stat_activity blocked
JOIN LATERAL unnest(pg_blocking_pids(blocked.pid)) AS blocker_pid ON true
JOIN pg_stat_activity blocking ON blocking.pid = blocker_pid
ORDER BY blocked_duration DESC;

Good lock monitoring separates:

one harmless short lock wait;
a growing lock queue behind a migration;
a long idle transaction blocking DDL;
row-level contention in a hot workflow;
application workers fighting over the same rows.

The metric is not “number of locks.”

The signal is the shape of the blocking chain.


Autovacuum monitoring should focus on whether cleanup keeps up

Autovacuum is noisy if monitored incorrectly.

A graph showing “autovacuum is running” may look scary, but it can be completely normal.

Better questions:

Are dead tuples growing over time?
Are hot tables vacuumed often enough?
Are old transactions preventing cleanup?
Are analyze runs keeping statistics fresh?
Are tables approaching transaction ID age risk?
Is autovacuum always running but still not catching up?

Useful table maintenance snapshot:

SELECT
    schemaname,
    relname,
    n_live_tup,
    n_dead_tup,
    round(
        100.0 * n_dead_tup / greatest(n_live_tup + n_dead_tup, 1),
        2
    ) AS dead_tuple_percent,
    last_autovacuum,
    last_autoanalyze,
    vacuum_count,
    autovacuum_count,
    analyze_count,
    autoanalyze_count
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 30;

Current vacuum progress:

SELECT
    p.pid,
    p.relid::regclass AS table_name,
    p.phase,
    p.heap_blks_total,
    p.heap_blks_scanned,
    p.heap_blks_vacuumed,
    p.index_vacuum_count,
    now() - a.query_start AS runtime
FROM pg_stat_progress_vacuum p
JOIN pg_stat_activity a ON a.pid = p.pid
ORDER BY runtime DESC;

Transaction ID age:

SELECT
    n.nspname AS schema_name,
    c.relname AS table_name,
    age(c.relfrozenxid) AS xid_age,
    pg_size_pretty(pg_total_relation_size(c.oid)) AS total_size
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE c.relkind IN ('r', 'm')
ORDER BY age(c.relfrozenxid) DESC
LIMIT 30;

Autovacuum monitoring should tell you whether maintenance is keeping up with write workload.

If it only tells you that autovacuum exists, it is not enough.


WAL and checkpoint monitoring should reveal pressure chains

WAL is involved in durability, crash recovery, replication, backups, archiving, and logical decoding.

A WAL incident rarely stays isolated.

Watch:

WAL generation rate;
pg_wal directory growth;
archiver failures;
replication slot retention;
replica replay lag;
checkpoint frequency;
checkpoint write/sync time;
WAL-heavy statements;
large backfills or migrations.

WAL generation snapshot:

SELECT
    wal_records,
    wal_fpi,
    pg_size_pretty(wal_bytes) AS wal_bytes,
    wal_buffers_full,
    stats_reset
FROM pg_stat_wal;

Replication slots:

SELECT
    slot_name,
    slot_type,
    active,
    wal_status,
    pg_size_pretty(
        pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)
    ) AS retained_wal
FROM pg_replication_slots
ORDER BY pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) DESC NULLS LAST;

Archiver status:

SELECT
    archived_count,
    last_archived_wal,
    last_archived_time,
    failed_count,
    last_failed_wal,
    last_failed_time
FROM pg_stat_archiver;

The alert should not be only:

WAL directory is large.

It should help identify the mechanism:

WAL is growing because archiving is failing.
WAL is retained by an inactive replication slot.
WAL generation spiked after a backfill.
Replica replay lag is increasing because the primary is producing WAL too quickly.

That is the difference between symptom monitoring and reliability monitoring.


Replication monitoring: bytes, time, and product semantics

Replication lag is not one metric.

Primary-side view:

SELECT
    application_name,
    client_addr,
    state,
    sync_state,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn))   AS send_lag_bytes,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), write_lsn))  AS write_lag_bytes,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), flush_lsn))  AS flush_lag_bytes,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) AS replay_lag_bytes,
    write_lag,
    flush_lag,
    replay_lag
FROM pg_stat_replication
ORDER BY application_name;

Standby-side view:

SELECT
    pg_is_in_recovery() AS is_standby,
    pg_last_wal_receive_lsn() AS receive_lsn,
    pg_last_wal_replay_lsn() AS replay_lsn,
    now() - pg_last_xact_replay_timestamp() AS replay_delay;

But the product question is:

Can this read be stale?

Replication monitoring should connect to read-routing behavior.

A replica that is 5 seconds behind may be fine for analytics. It may be unacceptable for permissions, checkout, authentication, or user settings.

Good monitoring distinguishes:

replica connected;
replica receiving WAL;
replica replaying WAL;
replica serving stale reads;
replica safe for failover;
replica safe for reporting;
replica threatening primary disk through slot retention.

One “replication lag” graph is rarely enough.


Disk monitoring must distinguish capacity from performance

Disk incidents come in two forms:

capacity problem: storage is filling;
performance problem: storage is too slow for current workload.

Both hurt Postgres, but the response differs.

Capacity signals:

data directory size;
pg_wal size;
temporary file growth;
table/index growth;
backup/archive accumulation;
replication slot retention;
available filesystem space.

Performance signals:

read latency;
write latency;
fsync latency;
IOPS saturation;
queue depth;
checkpoint sync time;
query wait events;
temporary file spills;
backend writes.

Postgres metrics alone are not enough here. The official monitoring chapter explicitly points operators toward OS-level tools in addition to PostgreSQL’s internal statistics. (PostgreSQL)

A database graph may show slow queries.

The storage graph may reveal the real mechanism.


Cache hit ratio is often overrated

Many dashboards show buffer cache hit ratio.

It can be useful, but it is often overinterpreted.

A high cache hit ratio does not prove the database is healthy.

A low cache hit ratio does not automatically identify the cause of an incident.

Problems:

large sequential scans can distort the number;
some workloads naturally read cold data;
a high ratio can hide CPU or lock contention;
the metric says little about query shape;
it does not show whether users are impacted.

A better approach is to pair cache-related metrics with:

query plans;
buffer reads from EXPLAIN;
IO wait events;
storage latency;
table/index scan patterns;
top queries by shared blocks read;
application latency.

Cache hit ratio is context, not a primary incident diagnosis.


Alert on symptoms, investigate with causes

A common mistake is alerting on too many internal causes.

Examples:

CPU > 80%
connections > 300
dead tuples > threshold
replica lag > threshold
cache hit ratio < threshold
autovacuum running too long

Some of these are useful. But if every internal metric pages someone, the team learns to ignore alerts.

A healthier alerting model:

Page on user impact and imminent risk.
Ticket on trends and maintenance debt.
Dashboard internal signals for investigation.

Page-worthy examples:

user-facing error rate high;
API latency SLO burn;
database unavailable;
disk close to full;
primary cannot write WAL;
replica lag violates product read semantics;
connection exhaustion blocking traffic;
transaction ID age approaching dangerous thresholds;
backup/archive pipeline broken beyond recovery objective.

Ticket-worthy examples:

dead tuples trending up on hot table;
index bloat suspected;
unused indexes accumulating;
autovacuum not keeping up on one table;
slow query variance increasing;
connection usage slowly approaching capacity;
replica lag occasionally above normal but not user-impacting.

Not every red graph deserves a page.


Use trend and rate, not only absolute values

A single value can be misleading.

Examples:

1000 dead tuples on a small table may matter.
10 million dead tuples on a huge table may be normal temporarily.

200 connections may be normal for one system.
50 active connections may overload another.

1 GB of WAL may be fine.
1 GB per minute may be alarming.

Replica lag of 2 seconds may be acceptable for reporting.
Replica lag of 2 seconds may be unacceptable for read-after-write flows.

Prefer metrics that show:

rate of change;
baseline deviation;
duration;
affected workload;
relation to user symptoms;
relation to known changes.

A good alert is rarely “value > threshold.”

It is more often:

value is above threshold for long enough,
during user-impacting traffic,
and is moving in the wrong direction.

Version-specific monitoring matters

Postgres monitoring changes across versions.

Views, columns, and statistics capabilities evolve. For example, modern PostgreSQL versions expose more detailed IO and WAL-related statistics than older versions, and settings such as track_io_timing and track_wal_io_timing can provide timing information with potential overhead because they repeatedly query the operating system clock. (PostgreSQL)

This creates a practical rule:

Do not blindly copy monitoring SQL from another Postgres version.

For every dashboard query, know:

which Postgres versions it supports;
whether required extensions are enabled;
whether timing settings add overhead;
whether statistics reset affects interpretation;
whether managed database providers restrict access;
whether replicas expose the same views.

Monitoring should be treated like production code.

It can break, lie, or become outdated.


A minimal reliability-oriented Postgres dashboard

A useful dashboard does not need hundreds of panels.

It should answer the main incident questions quickly.

1. User impact

request latency;
error rate;
timeout rate;
business operation success rate;
queue age;
job delay.

2. Application database boundary

pool usage;
pool wait time;
pool checkout timeouts;
query duration;
transaction duration;
retry rate;
database errors by type.

3. Postgres live activity

active sessions;
sessions by wait_event_type;
long queries;
long transactions;
idle in transaction;
blocked sessions and blockers.

4. Workload shape

top queries by total time;
top queries by calls;
top queries by mean time;
queries with high variance;
WAL-heavy statements.

5. Maintenance health

dead tuples;
last autovacuum/analyze;
vacuum progress;
transaction ID age;
table and index growth.

6. WAL, checkpoints, storage

WAL generation rate;
pg_wal size;
checkpoint frequency;
checkpoint write/sync time;
archiver failures;
disk capacity;
disk latency.

7. Replication and recovery

replication lag by stage;
standby replay delay;
replication slot retained WAL;
backup/archive status;
failover readiness indicators.

The dashboard should be organized around questions, not around PostgreSQL catalog names.


Good monitoring supports hypothesis-driven debugging

During an incident, an engineer should be able to move through a chain like this:

flowchart TD
    A[Users see timeouts] --> B[Application pool wait time is rising]
    B --> C[Postgres active sessions are elevated]
    C --> D[Most active sessions wait on Lock]
    D --> E[Blocking query is a migration]
    E --> F[Migration is waiting behind an idle transaction]
    F --> G[Retries are increasing request volume]
    G --> H([Stop retries, cancel migration, or terminate a known-safe blocker])

Or:

Writes are slow.
        ↓
CPU is normal.
        ↓
WAL generation spiked.
        ↓
Checkpoint warnings started.
        ↓
Replica lag is increasing.
        ↓
A backfill began five minutes earlier.
        ↓
Safest mitigation is to pause or throttle the backfill.

This is what monitoring is for.

Not to show everything.

To help the team move from symptom to mechanism to decision.


Common monitoring anti-patterns

Dashboard as decoration

A dashboard nobody uses during incidents is not observability. It is wallpaper.

Too many panels, no hierarchy

If every graph has equal visual importance, the dashboard cannot guide attention.

Alerts without ownership

Every alert should have an owner, expected action, and reason for existence.

Internal metrics without user impact

A database can look noisy without affecting customers. A page should usually be tied to impact or imminent risk.

User impact without database detail

Knowing users are affected is not enough. You need fast paths into database evidence.

No deployment or migration markers

Without change context, incidents take longer to explain.

Averaging away important behavior

Mean latency hides outliers. Total time hides variance. Aggregate database metrics hide one bad tenant or one hot table.

Ignoring application metrics

Postgres cannot show pool checkout time, retry storms, request deadlines, or business operation failures by itself.


Why monitoring incidents are good simulation material

Monitoring failures are often human failures.

The metrics were there, but nobody knew which ones mattered. The dashboard showed the answer, but it was buried under noise. The alert fired, but it was not actionable. The team watched CPU while the real problem was locks. The team watched slow queries while the real problem was pool saturation. The team watched the primary while the replica was serving stale reads. The team watched database metrics while application retries amplified the incident.

A realistic simulation can train:

reading dashboards under pressure;
separating symptoms from causes;
forming hypotheses from weak signals;
rejecting misleading metrics;
connecting application and database behavior;
deciding when a metric is actionable;
communicating uncertainty clearly;
choosing mitigations based on evidence.

This is the gap articles cannot fully close.

A written guide can explain which metrics exist. A dashboard can display the signals. A simulation teaches the team how to reason when ten signals change at once.


Conclusion

Postgres monitoring is not about collecting every metric.

It is about building an evidence system for production decisions.

Good monitoring starts with user impact, connects that impact to application behavior, then follows pressure into Postgres internals, storage, replication, and maintenance systems.

Useful metrics answer operational questions:

Are users affected?
Where is the queue?
What changed?
What is Postgres waiting on?
Which workload owns the pressure?
Is this a query, lock, IO, WAL, vacuum, replication, or pool problem?
Is the system getting worse?
Which mitigation reduces risk?

The dangerous phrase is:

We have dashboards, so we are covered.

The better reliability question is:

Can our monitoring help an engineer make the right decision during a confusing incident?

That is the difference between metric collection and Postgres database reliability.

Newsletter

Stay in the loop

New incident tracks, psql+ features and hard-won PostgreSQL tips — delivered to your inbox now and then.

No spam. Unsubscribe anytime.