Vacuum Apr 14, 2026 15 min read

Autovacuum: the quiet Postgres process that becomes a loud reliability problem

Video walkthrough coming soon

Autovacuum is easy to ignore when everything is healthy.

It runs in the background. It does not usually appear in product discussions. It is rarely mentioned in feature planning. It does not look like an application dependency.

Then one day the database gets slower, storage grows unexpectedly, query plans become unstable, or Postgres starts warning about transaction ID wraparound.

At that point, autovacuum is no longer background maintenance.

It is part of the incident.

Postgres autovacuum exists because MVCC creates old row versions that must eventually be cleaned up, and because the planner needs fresh table statistics to choose good query plans. The PostgreSQL documentation describes routine vacuuming as necessary to recover or reuse storage occupied by updated or deleted rows, update planner statistics, and protect against transaction ID wraparound. (PostgreSQL)

The reliability lesson is simple:

Autovacuum is not an optional optimization.
It is part of Postgres survival.

Why Postgres needs vacuum at all

Postgres uses MVCC: multi-version concurrency control.

When a row is updated, Postgres does not simply overwrite the old row in place. It creates a new row version. When a row is deleted, the old version is not immediately removed from the table file.

That design allows concurrent transactions to see a consistent view of data without blocking each other unnecessarily.

But it creates a maintenance problem.

Old row versions eventually become unnecessary. Once no active transaction can still see them, they can be cleaned up. That cleanup is one of the main jobs of VACUUM.

A simplified update chain:

UPDATE users SET status = 'active' WHERE id = 42;

Old row version remains for older snapshots.
New row version becomes visible to newer transactions.
VACUUM can later remove the old version when safe.

If cleanup does not keep up, dead tuples accumulate.

The table may become physically larger. Indexes may become less efficient. Sequential scans may touch more pages. Index scans may visit more dead entries. Autovacuum may need to do more work later under worse conditions.

This is how a quiet maintenance lag becomes user-visible latency.

Autovacuum also runs ANALYZE

Autovacuum is not only about removing dead tuples.

It also triggers ANALYZE, which refreshes planner statistics. PostgreSQL documentation notes that the autovacuum daemon automatically issues ANALYZE when table contents have changed sufficiently. (PostgreSQL)

That matters because the query planner depends on statistics.

Consider this query:

SELECT *
FROM invoices
WHERE account_id = $1
  AND status = 'open'
ORDER BY due_date
LIMIT 50;

The planner needs to estimate:

How many rows match this account_id?
How selective is status = 'open'?
Is an index scan cheaper than a sequential scan?
Will sorting be expensive?

If statistics are stale, Postgres may choose a bad plan.

A table with poor vacuum behavior often has poor analyze behavior too. The incident may appear as a slow query, but the underlying issue may be maintenance starvation.

You can inspect recent vacuum and analyze activity:

SELECT
    schemaname,
    relname,
    n_live_tup,
    n_dead_tup,
    last_vacuum,
    last_autovacuum,
    last_analyze,
    last_autoanalyze,
    vacuum_count,
    autovacuum_count,
    analyze_count,
    autoanalyze_count
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 30;

This does not prove bloat by itself, but it shows where cleanup pressure and statistics freshness deserve attention.

The most common autovacuum misconception

A dangerous sentence:

Autovacuum is using IO, so let’s disable it.

Autovacuum can absolutely create load. It reads pages, cleans dead tuples, updates visibility information, and may generate WAL.

But disabling it usually converts visible maintenance cost into hidden future debt.

That debt comes back as:

larger tables;
larger indexes;
worse cache efficiency;
slower scans;
stale statistics;
unstable query plans;
wraparound risk;
emergency anti-wraparound vacuum;
operational panic.

PostgreSQL’s autovacuum setting controls whether the server runs the autovacuum launcher, and it is on by default; the docs also note that track_counts must be enabled for autovacuum to work. (PostgreSQL)

You can check the basics:

SHOW autovacuum;
SHOW track_counts;

And inspect relevant settings:

SELECT
    name,
    setting,
    unit,
    context,
    short_desc
FROM pg_settings
WHERE name LIKE 'autovacuum%'
   OR name IN (
       'track_counts',
       'vacuum_cost_delay',
       'vacuum_cost_limit',
       'maintenance_work_mem',
       'autovacuum_work_mem'
   )
ORDER BY name;

The goal is not to turn autovacuum off.

The goal is to make sure it can keep up with the workload.

Dead tuples are a signal, not the whole diagnosis

A common starting point:

SELECT
    relname,
    n_live_tup,
    n_dead_tup,
    round(
        100.0 * n_dead_tup / greatest(n_live_tup + n_dead_tup, 1),
        2
    ) AS dead_tuple_percent,
    last_autovacuum,
    last_autoanalyze
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 20;

This is useful, but it has limits.

n_dead_tup is an estimate. It does not directly equal “bloat”. A table can have many dead tuples and still be manageable if autovacuum is keeping up. Another table can have fewer dead tuples but be operationally sensitive because it is large, hot, heavily indexed, or latency-critical.

Better questions:

Is the number of dead tuples growing over time?
Does autovacuum run but fail to catch up?
Is the table write-heavy?
Are long transactions preventing cleanup?
Are indexes growing faster than expected?
Did query latency change as dead tuples accumulated?

For incident response, trend is often more important than a single snapshot.

Long transactions can block cleanup

Autovacuum cannot remove row versions that might still be visible to an old transaction.

That means one old transaction can keep dead tuples alive across the database.

Find old transactions:

SELECT
    pid,
    usename,
    application_name,
    client_addr,
    state,
    now() - xact_start AS transaction_age,
    wait_event_type,
    wait_event,
    left(query, 160) AS query_preview
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
ORDER BY xact_start ASC;

Find sessions idle inside a transaction:

SELECT
    pid,
    usename,
    application_name,
    client_addr,
    now() - xact_start AS transaction_age,
    left(query, 160) AS last_query
FROM pg_stat_activity
WHERE state = 'idle in transaction'
ORDER BY xact_start ASC;

An idle in transaction session may look harmless because it is not actively consuming CPU. But it can prevent cleanup, hold locks, and keep old snapshots alive.

A classic reliability failure chain:

flowchart TD
    A[Application opens transaction] --> B[Transaction becomes idle and remains open]
    B --> C[Updates and deletes continue elsewhere]
    C --> D[Dead tuples cannot be fully cleaned]
    D --> E[Tables and indexes grow]
    E --> F[Queries touch more pages]
    F --> G[Latency increases]
    G --> H([Connection pool saturates])

The visible symptom may be slow queries.

The mechanism may be vacuum being unable to clean because the application is holding old snapshots.

The table that autovacuum cannot catch

Some tables are much harder for autovacuum than others.

Examples:

high-update tables;
queue-like tables;
session tables;
event status tables;
tables with frequent DELETE;
tables with many indexes;
tables with very large row counts;
tables with hot tenants or skewed access patterns.

A queue table is a common example:

CREATE TABLE jobs (
    id bigserial PRIMARY KEY,
    status text NOT NULL,
    payload jsonb NOT NULL,
    created_at timestamptz NOT NULL DEFAULT now(),
    updated_at timestamptz NOT NULL DEFAULT now()
);

Workers constantly do:

UPDATE jobs
SET status = 'running',
    updated_at = now()
WHERE id = $1;

Then:

UPDATE jobs
SET status = 'done',
    updated_at = now()
WHERE id = $1;

Or:

DELETE FROM jobs
WHERE status = 'done'
  AND updated_at < now() - interval '7 days';

This table may generate dead tuples continuously.

A default autovacuum configuration may be too conservative for it, especially if the table is large and the scale factor means vacuum starts only after a large number of changes.

You can inspect per-table autovacuum options:

SELECT
    n.nspname AS schema_name,
    c.relname AS table_name,
    c.reloptions
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE c.relkind = 'r'
  AND c.reloptions IS NOT NULL
ORDER BY n.nspname, c.relname;

For a hot table, per-table tuning may be more appropriate than changing global settings:

ALTER TABLE jobs SET (
    autovacuum_vacuum_scale_factor = 0.02,
    autovacuum_vacuum_threshold = 5000,
    autovacuum_analyze_scale_factor = 0.01,
    autovacuum_analyze_threshold = 5000
);

This is only an example, not a universal recommendation. The right values depend on write rate, table size, IO capacity, latency goals, and how much maintenance work the system can absorb.

PostgreSQL exposes autovacuum thresholds, scale factors, cost delay settings, and worker limits as configuration parameters; these settings control when and how autovacuum runs. (PostgreSQL)

Autovacuum workers are limited

Autovacuum is not an infinite background army.

It has a launcher and a limited number of workers. If several large or busy tables need cleanup at the same time, some tables wait.

Check running autovacuum activity:

SELECT
    pid,
    datname,
    usename,
    application_name,
    state,
    wait_event_type,
    wait_event,
    now() - query_start AS runtime,
    left(query, 200) AS query_preview
FROM pg_stat_activity
WHERE query ILIKE 'autovacuum:%'
ORDER BY query_start ASC;

You can also inspect active vacuum progress. PostgreSQL provides pg_stat_progress_vacuum, with one row for each backend, including autovacuum workers, currently running VACUUM. (PostgreSQL)

SELECT
    p.pid,
    a.datname,
    a.application_name,
    p.relid::regclass AS table_name,
    p.phase,
    p.heap_blks_total,
    p.heap_blks_scanned,
    p.heap_blks_vacuumed,
    p.index_vacuum_count,
    now() - a.query_start AS runtime
FROM pg_stat_progress_vacuum p
JOIN pg_stat_activity a ON a.pid = p.pid
ORDER BY runtime DESC;

This helps answer:

Is vacuum currently running?
Which table is it working on?
Is it scanning heap pages?
Is it vacuuming indexes?
Is it spending a long time on one relation?

If autovacuum is always running but dead tuples continue rising, the system may be under-provisioned for its write workload, misconfigured for specific hot tables, blocked by old transactions, or overloaded by competing IO.

Cost-based delay: autovacuum can be too polite

Autovacuum is designed not to overwhelm the system.

That politeness can become a problem.

Cost-based vacuum delay allows vacuum to pause during work so it does not consume too many resources at once. PostgreSQL exposes cost-based vacuum settings and progress/verbose reporting related to this behavior. (PostgreSQL)

In a write-heavy system, autovacuum can be so gentle that it never catches up.

The symptom is not that autovacuum is absent.

The symptom is that it is always behind.

You may see:

autovacuum runs frequently;
dead tuples remain high;
table size keeps growing;
indexes grow disproportionately;
query latency slowly worsens;
manual VACUUM helps temporarily;
the problem returns.

This is a capacity mismatch.

The database is generating cleanup work faster than the maintenance system is allowed to process it.

Vacuum and indexes

Vacuum is not only about heap tuples.

Indexes also matter.

When tables are updated and deleted, indexes can accumulate dead entries. Vacuum has to deal with those too.

A table with many indexes creates more maintenance work per row change.

Example:

CREATE INDEX idx_orders_customer_id ON orders(customer_id);
CREATE INDEX idx_orders_status ON orders(status);
CREATE INDEX idx_orders_created_at ON orders(created_at);
CREATE INDEX idx_orders_region ON orders(region);
CREATE INDEX idx_orders_status_created ON orders(status, created_at);

Every update that changes indexed columns can increase write and maintenance cost.

A useful index review query:

SELECT
    schemaname,
    relname AS table_name,
    indexrelname AS index_name,
    idx_scan,
    idx_tup_read,
    idx_tup_fetch
FROM pg_stat_user_indexes
ORDER BY idx_scan ASC, idx_tup_read ASC
LIMIT 50;

Low usage does not automatically mean an index is safe to drop. It may support rare but critical queries, constraints, or incident workflows.

But unused indexes are not free.

They increase write amplification and vacuum work. In reliability terms, unnecessary indexes are permanent background cost.

Wraparound: the autovacuum incident you really do not want

Postgres transaction IDs are finite. To prevent transaction ID wraparound, tables must be vacuumed so old transaction IDs can be frozen. Routine vacuuming documentation explicitly includes protection against transaction ID wraparound as one of the reasons vacuuming is necessary. (PostgreSQL)

Inspect transaction ID age by table:

SELECT
    n.nspname AS schema_name,
    c.relname AS table_name,
    age(c.relfrozenxid) AS xid_age,
    pg_size_pretty(pg_total_relation_size(c.oid)) AS total_size
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE c.relkind IN ('r', 'm')
ORDER BY age(c.relfrozenxid) DESC
LIMIT 30;

Also inspect database age:

SELECT
    datname,
    age(datfrozenxid) AS xid_age
FROM pg_database
ORDER BY age(datfrozenxid) DESC;

When wraparound risk grows, Postgres becomes increasingly aggressive about vacuuming. Anti-wraparound vacuum is not a normal tuning issue. It is a reliability emergency.

The worst version of this incident looks like:

Autovacuum was disabled or starved.
Old transactions prevented cleanup.
Large tables were not frozen in time.
Wraparound warnings appeared.
Emergency vacuum consumed IO.
Critical workload slowed down.
Operators had limited safe options.

The best time to care about transaction age is long before those warnings appear.

Multixact age: the less famous cousin

Postgres also tracks multixact IDs, which are relevant for row locking scenarios such as foreign keys and shared row locks.

You can inspect multixact age:

SELECT
    n.nspname AS schema_name,
    c.relname AS table_name,
    mxid_age(c.relminmxid) AS mxid_age,
    pg_size_pretty(pg_total_relation_size(c.oid)) AS total_size
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE c.relkind IN ('r', 'm')
ORDER BY mxid_age(c.relminmxid) DESC
LIMIT 30;

This is especially relevant in systems with heavy foreign key activity, concurrent locking, or queue-like patterns.

Many teams monitor transaction ID age but forget multixact age.

That blind spot can turn into a surprise maintenance emergency.

Manual VACUUM is not a magic button

You can run:

VACUUM VERBOSE orders;

Or:

VACUUM (VERBOSE, ANALYZE) orders;

PostgreSQL’s VACUUM command reports progress through pg_stat_progress_vacuum for regular vacuum operations; VACUUM FULL is different because it rewrites the table and reports through cluster progress views. (PostgreSQL)

The distinction is important.

Regular VACUUM cleans up dead tuples and makes space reusable inside the table. It does not usually shrink the table file on disk dramatically.

VACUUM FULL rewrites the table and can return disk space to the operating system, but it requires much stronger locking and is operationally disruptive.

That means this command is not a casual production fix:

VACUUM FULL orders;

It may block access in ways your product cannot tolerate.

A reliability-minded approach asks:

Do we need to improve query performance?
Do we need to recover disk to the OS?
Can regular VACUUM catch up?
Is bloat severe enough to justify a rewrite?
Can we use online rebuild strategies instead?
What is the lock impact?
What is the rollback plan?

Manual vacuuming can help, but it does not replace understanding why autovacuum fell behind.

Logging autovacuum activity

Autovacuum can be made more observable through logging.

PostgreSQL provides log_autovacuum_min_duration, which logs autovacuum actions exceeding the configured duration; the documentation notes this can help track autovacuum activity. (PostgreSQL)

Example:

ALTER SYSTEM SET log_autovacuum_min_duration = '5s';
SELECT pg_reload_conf();

Or in postgresql.conf:

log_autovacuum_min_duration = '5s'

In noisy systems, you may choose a higher value. In an investigation, lowering it temporarily can provide evidence.

Autovacuum logs can reveal:

which tables are vacuumed often;
which vacuums take a long time;
whether dead tuple cleanup is effective;
whether vacuum is skipped or delayed;
whether index cleanup dominates;
whether analyze is happening regularly.

The goal is not to log everything forever.

The goal is to make background maintenance visible enough to reason about it.

Autovacuum and partitioning

Partitioning can make vacuum behavior more manageable, but it does not remove the need for vacuum.

For event-like data, partitioning by time can help because old partitions become mostly static.

Example:

CREATE TABLE events (
    id bigint NOT NULL,
    tenant_id bigint NOT NULL,
    event_type text NOT NULL,
    created_at timestamptz NOT NULL,
    payload jsonb NOT NULL
) PARTITION BY RANGE (created_at);

Monthly partitions:

CREATE TABLE events_2026_06
PARTITION OF events
FOR VALUES FROM ('2026-06-01') TO ('2026-07-01');

For append-mostly workloads, old partitions may need less frequent vacuuming after they stop changing.

For hot current partitions, autovacuum still matters.

Partitioning helps when it matches the data lifecycle. It hurts when it is used as a substitute for understanding write patterns.

Bad partitioning can create:

too many relations;
planning overhead;
operational complexity;
uneven hot partitions;
forgotten per-table settings;
maintenance surprises.

The reliability question is not “Should we partition?”

It is:

Does partitioning align with how data is written, updated, queried, retained, and vacuumed?

A practical autovacuum health snapshot

This is not a full runbook, but it gives a useful operational snapshot.

Largest dead tuple estimates:

SELECT
    schemaname,
    relname,
    n_live_tup,
    n_dead_tup,
    round(
        100.0 * n_dead_tup / greatest(n_live_tup + n_dead_tup, 1),
        2
    ) AS dead_tuple_percent,
    last_autovacuum,
    last_autoanalyze
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 30;

Tables not recently vacuumed:

SELECT
    schemaname,
    relname,
    n_live_tup,
    n_dead_tup,
    last_autovacuum,
    last_vacuum
FROM pg_stat_user_tables
WHERE n_dead_tup > 0
ORDER BY last_autovacuum NULLS FIRST, n_dead_tup DESC
LIMIT 30;

Oldest transaction IDs:

SELECT
    n.nspname AS schema_name,
    c.relname AS table_name,
    age(c.relfrozenxid) AS xid_age,
    pg_size_pretty(pg_total_relation_size(c.oid)) AS total_size
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE c.relkind IN ('r', 'm')
ORDER BY age(c.relfrozenxid) DESC
LIMIT 30;

Current vacuum progress:

SELECT
    p.pid,
    p.relid::regclass AS table_name,
    p.phase,
    p.heap_blks_total,
    p.heap_blks_scanned,
    p.heap_blks_vacuumed,
    p.index_vacuum_count,
    now() - a.query_start AS runtime
FROM pg_stat_progress_vacuum p
JOIN pg_stat_activity a ON a.pid = p.pid
ORDER BY runtime DESC;

Old transactions that can hold cleanup back:

SELECT
    pid,
    usename,
    application_name,
    state,
    now() - xact_start AS transaction_age,
    left(query, 160) AS query_preview
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
ORDER BY xact_start ASC
LIMIT 20;

Per-table autovacuum overrides:

SELECT
    n.nspname AS schema_name,
    c.relname AS table_name,
    c.reloptions
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE c.relkind = 'r'
  AND c.reloptions IS NOT NULL
ORDER BY n.nspname, c.relname;

These queries help build a picture. They do not replace interpretation.

What teams often get wrong

They only notice autovacuum when it hurts

If the first time you discuss autovacuum is during an incident, the system has already been running on assumptions.

They use global settings for table-specific problems

A single hot table often needs specific tuning. Changing global autovacuum settings can help one table while causing unnecessary maintenance pressure elsewhere.

They ignore application transaction behavior

No autovacuum configuration fully compensates for application code that holds transactions open for too long.

They treat bloat as a one-time cleanup task

Bloat cleanup without workload change is temporary. If the write pattern remains the same, the problem returns.

They forget that indexes multiply maintenance cost

Every unnecessary index makes writes and vacuum more expensive.

They use `VACUUM FULL` too casually

It can reclaim disk, but it rewrites the table and can create serious locking impact. It is a maintenance operation, not a harmless cleanup command.

A better mental model

Autovacuum is a feedback system.

flowchart TD
    A[Application writes] --> B[Updates and deletes create dead tuples]
    B --> C[Autovacuum cleans old versions]
    C --> D[ANALYZE refreshes statistics]
    D --> E[Planner makes better decisions]
    E --> F[Queries stay predictable]
    F --> G([Storage growth remains controlled])

When that feedback loop breaks, the symptoms appear elsewhere:

slow queries;
bad plans;
growing storage;
high IO;
replication pressure;
pool saturation;
wraparound warnings;
long incident calls.

That is why autovacuum problems are often misdiagnosed.

They do not always announce themselves as “autovacuum failed.”

They appear as system degradation.

Why autovacuum incidents are strong simulation material

Autovacuum incidents are excellent for training because they develop slowly and then become urgent.

A realistic simulation might include:

a high-update table;
a long idle transaction;
dead tuples accumulating;
query plans becoming unstable;
autovacuum workers running but not catching up;
storage growth;
an engineer proposing to disable autovacuum;
a manual VACUUM that helps only partially;
a wraparound risk warning later in the scenario.

The hard part is not knowing that VACUUM exists.

The hard part is connecting weak signals before they become a major incident.

Is autovacuum absent, blocked, too slow, or just overloaded? Are stale statistics causing bad plans? Is a long transaction preventing cleanup? Is the table design generating too much churn? Is the immediate risk latency, storage, or wraparound? Should the response be tuning, traffic reduction, transaction cleanup, manual vacuum, index review, or application change?

These decisions require operational judgment.

An article can explain the mechanism. A dashboard can show the counters. A simulation forces the team to make decisions while the system is degrading.

Conclusion

Autovacuum is not background noise.

It is one of the core processes that keeps a Postgres system healthy over time.

When it works well, nobody notices. When it falls behind, the symptoms can appear as slow queries, unstable plans, growing tables, bloated indexes, IO pressure, storage incidents, or transaction ID wraparound risk.

The right lesson is not “autovacuum is good” or “autovacuum is bad.”

The right lesson is:

Autovacuum is part of the workload.
It needs capacity, observability, and tuning.

Reliable Postgres operations require knowing which tables generate cleanup pressure, which transactions prevent cleanup, which indexes amplify maintenance cost, and which alerts reveal trouble early enough to act safely.

Autovacuum is quiet by design.

Database reliability means hearing it before it has to become loud.