← All posts

Locks 12 min read

Postgres locks: how one ALTER TABLE can stop your product

Postgres locks are not a bug.

They are one of the reasons Postgres can safely protect your data while many users, services, jobs, migrations, and background processes are touching the same database at the same time.

The problem is that locks are often invisible until they are not.

A migration that looked harmless in staging can freeze production traffic. A long-running transaction can block a schema change. A background job can hold a lock longer than expected. A single ALTER TABLE can create a queue of blocked queries behind it.

From the outside, this often looks like:

The application is slow.
Requests are timing out.
Postgres has many active connections.
CPU is not necessarily high.
The database “looks stuck”.

But Postgres may not be stuck at all. It may be doing exactly what it was designed to do: preserving consistency.


The dangerous misunderstanding

Many teams think about locks only when they explicitly run something like:

LOCK TABLE users;

But most Postgres locks are not written manually. They are acquired automatically by normal SQL operations.

For example:

SELECT * FROM orders WHERE id = 42;
UPDATE orders SET status = 'paid' WHERE id = 42;
ALTER TABLE orders ADD COLUMN processed_at timestamptz;
CREATE INDEX orders_created_at_idx ON orders(created_at);
DELETE FROM sessions WHERE expires_at < now();

All of these can involve locks.

Usually, that is fine. Most locks are short-lived and harmless. The incident starts when a lock is held longer than expected, or when a lock request waits behind another transaction while new queries pile up behind it.

This is the part that surprises people: the most damaging session is not always the one using the most CPU. Sometimes it is just waiting.


A simple lock queue scenario

Imagine a busy table:

CREATE TABLE accounts (
    id bigint PRIMARY KEY,
    email text NOT NULL,
    status text NOT NULL
);

The application constantly runs queries like:

SELECT *
FROM accounts
WHERE id = $1;

Now a migration starts:

ALTER TABLE accounts
ADD COLUMN deleted_at timestamptz;

Depending on the operation and Postgres version, this may be fast. But it still needs a table lock. If another transaction is already touching the table in a way that conflicts, the ALTER TABLE waits.

That sounds safe: the migration is waiting, not blocking, right?

Not quite.

Once the ALTER TABLE is waiting for a strong lock, later application queries may queue behind it. The result can look like the whole table is frozen.

A simplified chain:

flowchart TD
    A[Long transaction touches accounts] --> B[ALTER TABLE waits for lock]
    B --> C[New application queries arrive]
    C --> D[They queue behind the pending ALTER TABLE]
    D --> E[Connection pool fills]
    E --> F[Requests time out]
    F --> G([Incident])

The migration may not be consuming CPU. It may not be doing heavy IO. It may simply be waiting.

But its position in the lock queue can still damage production traffic.


Locks are about compatibility

Postgres has different lock modes. They are not all equal.

A normal SELECT does not block another normal SELECT. Many operations can safely happen together. The problem appears when two operations require incompatible locks.

You do not need to memorize the entire lock matrix to respond well to incidents, but you do need the mental model:

Weak locks allow many operations to continue.
Strong locks conflict with more operations.
Some schema changes require very strong locks.
A waiting strong lock can cause later queries to queue.

For product engineers, the important lesson is this:

“This query is fast locally” does not mean “this operation is operationally safe in production.”

Lock behavior depends on concurrency, transaction duration, table size, workload, and timing.


The classic villain: idle in transaction

One of the most common lock-related problems is not a dramatic query. It is a transaction that started, did some work, and then remained open.

For example, application code does something like:

BEGIN;

SELECT *
FROM accounts
WHERE id = 42;

-- application waits on network, external API, user input, or crashes before COMMIT

From the database side, the session may become:

idle in transaction

That means it is not actively running a query, but the transaction is still open.

You can find old transactions with:

SELECT
    pid,
    usename,
    application_name,
    client_addr,
    state,
    now() - xact_start AS transaction_age,
    wait_event_type,
    wait_event,
    left(query, 160) AS query_preview
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
ORDER BY xact_start ASC;

And specifically:

SELECT
    pid,
    usename,
    application_name,
    client_addr,
    now() - xact_start AS transaction_age,
    left(query, 160) AS last_query
FROM pg_stat_activity
WHERE state = 'idle in transaction'
ORDER BY xact_start ASC;

An idle in transaction session can be harmful because it may:

hold locks;
prevent vacuum cleanup;
keep old row versions visible;
interfere with migrations;
increase table and index bloat over time;
confuse incident responders because it looks inactive.

The session is “idle”, but the transaction is not harmless.


A safer way to inspect blockers

Modern Postgres gives you a very useful function:

pg_blocking_pids(pid)

You can use it to see which sessions are blocking others:

SELECT
    blocked.pid AS blocked_pid,
    blocked.usename AS blocked_user,
    blocked.application_name AS blocked_app,
    blocked.state AS blocked_state,
    now() - blocked.query_start AS blocked_duration,
    left(blocked.query, 120) AS blocked_query,
    blocking.pid AS blocking_pid,
    blocking.usename AS blocking_user,
    blocking.application_name AS blocking_app,
    blocking.state AS blocking_state,
    now() - blocking.query_start AS blocking_duration,
    left(blocking.query, 120) AS blocking_query
FROM pg_stat_activity blocked
JOIN LATERAL unnest(pg_blocking_pids(blocked.pid)) AS blocker_pid ON true
JOIN pg_stat_activity blocking ON blocking.pid = blocker_pid
ORDER BY blocked_duration DESC;

This is often easier and safer than writing a large manual join over pg_locks.

The result can tell you:

Who is blocked?
Who is blocking them?
How long has each query been running?
Which application opened the session?
Is the blocker active or idle in transaction?

But this query is not a complete incident response plan. It only answers one question: “Who is blocking whom?”

The harder question is: “What is the safest action now?”


Why killing the blocker is not always the right move

When you find a blocking session, the tempting move is:

SELECT pg_terminate_backend(<pid>);

That can be the correct action in some incidents. But it is dangerous as a reflex.

There are two related functions:

SELECT pg_cancel_backend(<pid>);
SELECT pg_terminate_backend(<pid>);

The difference matters.

pg_cancel_backend asks Postgres to cancel the current query. The connection stays alive.

pg_terminate_backend terminates the whole backend connection. If it is inside a transaction, the transaction is rolled back.

That rollback can itself be expensive. It can also trigger application retries, break a migration, or cause a thundering herd of reconnects.

A better incident question is:

Is this blocker safe to cancel?
Is it part of a migration?
Is it user traffic?
Is it a background job?
Is it already rolling back?
Will the application retry immediately?
Will killing it unblock the critical path or create more load?

The existence of a blocker tells you where pressure is accumulating. It does not automatically tell you what to kill.


Schema migrations and lock risk

Schema migrations deserve special respect in Postgres.

Consider:

ALTER TABLE users
ADD COLUMN last_seen_at timestamptz;

This can be quick. But “quick” is not the same as “risk-free”.

Now consider:

ALTER TABLE users
ALTER COLUMN email SET NOT NULL;

Or:

ALTER TABLE orders
ADD CONSTRAINT orders_customer_id_fkey
FOREIGN KEY (customer_id)
REFERENCES customers(id);

Or:

CREATE INDEX orders_created_at_idx
ON orders(created_at);

Some operations scan data. Some require stronger locks. Some block writes. Some interact badly with long transactions. Some are safe on small tables and dangerous on large ones.

For indexes, the production-safe form is often:

CREATE INDEX CONCURRENTLY orders_created_at_idx
ON orders(created_at);

But CONCURRENTLY is not magic. It reduces blocking, but it can still:

take a long time;
consume CPU and IO;
fail and leave an invalid index;
conflict with other schema changes;
increase load during an already sensitive period.

You can check invalid indexes with:

SELECT
    schemaname,
    tablename,
    indexname
FROM pg_indexes
WHERE indexname IN (
    SELECT relname
    FROM pg_class
    WHERE oid IN (
        SELECT indexrelid
        FROM pg_index
        WHERE NOT indisvalid
    )
);

A cleaner version using catalog tables:

SELECT
    n.nspname AS schema_name,
    t.relname AS table_name,
    i.relname AS index_name,
    ix.indisvalid,
    ix.indisready
FROM pg_index ix
JOIN pg_class i ON i.oid = ix.indexrelid
JOIN pg_class t ON t.oid = ix.indrelid
JOIN pg_namespace n ON n.oid = t.relnamespace
WHERE ix.indisvalid = false
   OR ix.indisready = false;

This is useful after a failed concurrent index build.


Use timeouts as guardrails

One of the simplest ways to reduce lock-related blast radius is to use timeouts during migrations.

For example:

SET lock_timeout = '2s';
SET statement_timeout = '5min';

ALTER TABLE accounts
ADD COLUMN deleted_at timestamptz;

lock_timeout means: do not wait forever to acquire a lock.

This is valuable because the worst migration is often not the one that fails. It is the one that waits silently and causes application traffic to queue behind it.

A common migration pattern is:

BEGIN;
SET LOCAL lock_timeout = '2s';
SET LOCAL statement_timeout = '5min';

ALTER TABLE accounts
ADD COLUMN deleted_at timestamptz;

COMMIT;

However, be careful with commands like CREATE INDEX CONCURRENTLY: they cannot run inside a normal transaction block.

For example:

SET lock_timeout = '2s';
SET statement_timeout = '30min';

CREATE INDEX CONCURRENTLY idx_accounts_email
ON accounts(email);

Timeouts do not make a migration safe by themselves. They are guardrails. They help a risky operation fail early instead of becoming an incident.


Detecting lock pressure before users notice

During normal operation, you can inspect waiting sessions:

SELECT
    pid,
    usename,
    application_name,
    state,
    wait_event_type,
    wait_event,
    now() - query_start AS waiting_for,
    left(query, 160) AS query_preview
FROM pg_stat_activity
WHERE wait_event_type = 'Lock'
ORDER BY query_start ASC;

You can also summarize wait events:

SELECT
    wait_event_type,
    wait_event,
    count(*) AS sessions
FROM pg_stat_activity
WHERE wait_event_type IS NOT NULL
GROUP BY wait_event_type, wait_event
ORDER BY sessions DESC;

This helps separate lock waits from other kinds of waits.

For example:

Lock waits suggest contention.
IO waits suggest disk or storage pressure.
Client waits may indicate application behavior.
LWLock waits may indicate internal contention.

But again, this is not enough by itself. You still need context:

Did a migration just start?
Did a deploy just happen?
Is a background job running?
Did traffic increase?
Are blocked sessions all from one service?
Are blockers idle in transaction?

Locks become understandable only when connected to system events.


Row locks can also cause incidents

Not all dangerous locks are table-level migration locks.

Application-level transactions can block each other on rows.

For example:

BEGIN;

UPDATE accounts
SET balance = balance - 100
WHERE id = 1;

-- transaction remains open

Another transaction tries:

UPDATE accounts
SET balance = balance + 100
WHERE id = 1;

The second transaction waits.

This is normal. But if the first transaction waits on an external API before committing, you have created database contention from application behavior.

A dangerous pattern:

BEGIN
  update database row
  call external service
  wait for response
  update another row
COMMIT

A safer pattern is often:

do external work before opening transaction;
keep the transaction small;
avoid user/network waits inside transactions;
commit quickly;
make retry behavior explicit.

Postgres can handle concurrency well, but it cannot make long business transactions short.

That is an application architecture problem, not just a database problem.


SELECT ... FOR UPDATE is powerful and dangerous

Many systems use row-level locking intentionally:

SELECT *
FROM jobs
WHERE status = 'pending'
ORDER BY created_at
LIMIT 1
FOR UPDATE;

This can be correct, but under concurrency it can create contention.

For job queues, a better pattern is often:

SELECT *
FROM jobs
WHERE status = 'pending'
ORDER BY created_at
LIMIT 1
FOR UPDATE SKIP LOCKED;

SKIP LOCKED allows workers to skip rows already locked by other workers.

But this changes semantics. It is useful for queues and work distribution, not for every business operation.

The reliability lesson is that lock behavior is part of application design. It is not just a database implementation detail.


Advisory locks: useful, but easy to forget

Postgres also supports advisory locks:

SELECT pg_advisory_lock(12345);

And:

SELECT pg_advisory_unlock(12345);

These are application-defined locks. They are useful for leader election, scheduled jobs, migration coordination, or preventing duplicate work.

But they can also create mysterious incidents if not visible in normal application logs.

You can inspect advisory locks with:

SELECT
    a.pid,
    a.usename,
    a.application_name,
    l.locktype,
    l.mode,
    l.granted,
    now() - a.query_start AS query_age,
    left(a.query, 160) AS query_preview
FROM pg_locks l
JOIN pg_stat_activity a ON a.pid = l.pid
WHERE l.locktype = 'advisory'
ORDER BY query_age DESC;

Advisory locks are not bad. Hidden coordination is bad.

If your system uses advisory locks, they should be named, documented, observable, and included in incident thinking.


A practical migration safety checklist

This is not a full migration playbook, but these questions catch many common problems.

Before running a migration on a hot table, ask:

How large is the table?
What lock level does this operation need?
Can it run concurrently?
Can it be split into smaller phases?
Does it scan or rewrite the table?
Can it fail quickly with lock_timeout?
Is there a rollback plan?
Are there long-running transactions right now?
Is traffic normal or elevated?
Will application retries amplify the problem?
Are dashboards ready for lock waits and pool saturation?

For large tables, prefer phased changes.

For example, instead of immediately adding a strict constraint:

ALTER TABLE orders
ADD CONSTRAINT orders_amount_positive
CHECK (amount > 0);

You may use:

ALTER TABLE orders
ADD CONSTRAINT orders_amount_positive
CHECK (amount > 0) NOT VALID;

Then validate later:

ALTER TABLE orders
VALIDATE CONSTRAINT orders_amount_positive;

This pattern can reduce operational risk because adding the constraint metadata and validating existing rows are separated.

Again, the point is not to memorize one trick. The point is to treat schema changes as production operations, not just code changes.


What teams often get wrong

They test migrations only on empty or tiny databases

A migration that takes 100 ms on staging may behave very differently on a 500 GB production table.

They ignore concurrent workload

The table is not sitting idle in production. It is being read, written, vacuumed, indexed, and queried by multiple services.

They forget old transactions

One forgotten transaction can turn a safe migration into a production incident.

They run DDL without timeouts

A migration that waits forever can become a silent lock queue.

They treat the database as isolated

The real incident may involve the app pool, retries, background jobs, dashboards, and human decisions.


Why lock incidents are good simulation material

Lock incidents are especially valuable to practice because they are deceptive.

They often do not look dramatic at first.

CPU may be fine. Memory may be fine. The migration may appear to be “just waiting.” The blocker may be “idle.” The application may report generic timeout errors.

A good simulation teaches the operational loop:

flowchart TD
    A[Notice latency] --> B[Inspect active sessions]
    B --> C[Identify lock waits]
    C --> D[Find blockers]
    D --> E[Understand application context]
    E --> F[Choose safe mitigation]
    F --> G[Observe consequences]
    G --> H[Review why the system was vulnerable]

The hard part is not running a query against pg_stat_activity.

The hard part is deciding what the result means under pressure.

Should you cancel the migration? Terminate the blocker? Reduce application concurrency? Disable a worker? Rollback a deploy? Wait? Communicate impact? Prevent retries?

Those choices are where reliability skill is built.


Conclusion

Postgres locks are not the enemy. They are part of how Postgres protects correctness.

The incident happens when lock behavior meets production reality:

large tables;
long transactions;
busy applications;
schema migrations;
connection pools;
background jobs;
retry storms;
unclear ownership;
time pressure.

A single ALTER TABLE can stop a product not because Postgres is fragile, but because production systems are concurrent.

The right lesson is not “avoid locks.” The right lesson is “understand the lock behavior of your changes before production does.”

Articles and checklists can teach the concepts. Queries can reveal symptoms. But lock incidents require practiced judgment.

Because in the middle of an incident, the question is rarely:

Is there a lock?

The real question is:

Which action reduces risk without making the system worse?
Newsletter

Stay in the loop

New incident tracks, psql+ features and hard-won PostgreSQL tips — delivered to your inbox now and then.

No spam. Unsubscribe anytime.