Migrations Apr 19, 2026 16 min read

Schema migrations in Postgres: why safe SQL can be dangerous in production

Video walkthrough coming soon

Schema migrations are one of the most common ways teams accidentally create Postgres incidents.

The migration passes code review. It works locally. It runs instantly on staging. The SQL is syntactically correct. The change looks small.

Then production traffic slows down, the connection pool fills, requests time out, and the incident channel starts with a familiar sentence:

The database is stuck.

Usually, Postgres is not stuck.

It is enforcing the rules that keep data consistent while many transactions touch the same tables concurrently.

The mistake is treating a schema migration as “just a code change.”

In production, a migration is an operational event.

The core problem: DDL changes concurrency

A normal application query changes data or reads data.

A schema migration changes the shape of the database itself.

That difference matters because Postgres must protect the table definition while other sessions are reading or writing rows. ALTER TABLE has many subforms, and the official documentation notes that lock levels differ by subform; unless explicitly noted, ALTER TABLE acquires an ACCESS EXCLUSIVE lock, and when several subcommands are combined, Postgres uses the strictest required lock. (PostgreSQL)

That is the reliability risk.

A migration may not be CPU-heavy. It may not read much data. It may not write many rows. It may simply need a lock that conflicts with normal traffic.

A migration incident often looks like this:

flowchart TD
    A[Long-running transaction touches a hot table] --> B[Migration waits for a table lock]
    B --> C[New application queries arrive]
    C --> D[They queue behind the waiting migration]
    D --> E[Application pool fills]
    E --> F[Requests time out]
    F --> G[Retries increase pressure]
    G --> H([Production incident])

The migration may be “waiting.”

But waiting in the wrong place can still stop the product.

Lock compatibility is the hidden part of migration safety

Postgres locks are not all equal.

A regular SELECT acquires an ACCESS SHARE lock. INSERT, UPDATE, DELETE, and MERGE acquire ROW EXCLUSIVE locks on the target table. CREATE INDEX without CONCURRENTLY acquires a SHARE lock. ACCESS EXCLUSIVE conflicts with every table-level lock mode and is the only table-level lock that blocks a plain SELECT. (PostgreSQL)

This is why a schema change can have a much larger blast radius than expected.

For example:

ALTER TABLE accounts
ADD COLUMN deleted_at timestamptz;

This may be fast in many situations. But “fast” is not the same as “risk-free.”

Even a short lock can be dangerous when:

the table is hot;
transactions are long;
traffic is high;
the migration waits behind another session;
application timeouts are short;
retries are aggressive;
the deploy starts many app instances at once.

The operational risk is not only how long the migration takes after it starts.

It is also how long it waits before it can safely start.

The migration that waits can be worse than the migration that runs

A migration can damage traffic before it does any meaningful work.

Suppose this transaction is open:

BEGIN;

SELECT *
FROM accounts
WHERE id = 42;

-- application stays idle before COMMIT

Now a migration runs:

ALTER TABLE accounts
ADD COLUMN archived_at timestamptz;

If the migration waits for a strong lock, later queries against accounts can queue behind it.

That queue can grow quickly:

Session A: old transaction is still open
Session B: ALTER TABLE waits for lock
Session C: SELECT from application waits
Session D: UPDATE from application waits
Session E: SELECT from application waits
...

This is why “the migration is only waiting” is not comforting.

A waiting migration can become a traffic barrier.

During an incident, look for blocked and blocking sessions:

SELECT
    blocked.pid AS blocked_pid,
    blocked.application_name AS blocked_app,
    now() - blocked.query_start AS blocked_duration,
    left(blocked.query, 160) AS blocked_query,
    blocking.pid AS blocking_pid,
    blocking.application_name AS blocking_app,
    blocking.state AS blocking_state,
    now() - blocking.query_start AS blocking_duration,
    left(blocking.query, 160) AS blocking_query
FROM pg_stat_activity blocked
JOIN LATERAL unnest(pg_blocking_pids(blocked.pid)) AS blocker_pid ON true
JOIN pg_stat_activity blocking ON blocking.pid = blocker_pid
ORDER BY blocked_duration DESC;

And inspect sessions waiting on locks:

SELECT
    pid,
    application_name,
    usename,
    state,
    wait_event_type,
    wait_event,
    now() - query_start AS waiting_for,
    left(query, 200) AS query_preview
FROM pg_stat_activity
WHERE wait_event_type = 'Lock'
ORDER BY query_start ASC;

The goal is not only to identify the blocking PID.

The goal is to understand whether the migration has created a queue in front of production traffic.

Use lock timeouts as blast-radius control

A migration should not wait forever for a lock on a hot table.

Use a lock timeout:

SET lock_timeout = '2s';
SET statement_timeout = '5min';

ALTER TABLE accounts
ADD COLUMN archived_at timestamptz;

Inside a transaction:

BEGIN;

SET LOCAL lock_timeout = '2s';
SET LOCAL statement_timeout = '5min';

ALTER TABLE accounts
ADD COLUMN archived_at timestamptz;

COMMIT;

This does not make the migration safe.

It makes failure faster.

That is valuable.

A failed migration with a clear timeout is usually better than a migration that silently waits and causes traffic to queue behind it.

The operational principle:

Migrations should fail before they become incidents.

But timeouts need to be chosen carefully. Too short, and safe migrations fail constantly. Too long, and the timeout no longer protects production traffic.

`CREATE INDEX` is not always online

A classic migration:

CREATE INDEX idx_orders_customer_id
ON orders (customer_id);

This is a normal index build. It can block writes to the table.

For production systems, teams often use:

CREATE INDEX CONCURRENTLY idx_orders_customer_id
ON orders (customer_id);

Postgres documents CREATE INDEX CONCURRENTLY as a way to create an index without locking out writes to the table. The same documentation also notes important caveats: concurrent index builds cannot run inside a transaction block, only one concurrent index build can run on a table at a time, and failed concurrent builds can leave an invalid index behind. (PostgreSQL)

That means this is invalid:

BEGIN;

CREATE INDEX CONCURRENTLY idx_orders_customer_id
ON orders (customer_id);

COMMIT;

You need to run it outside a normal transaction block:

CREATE INDEX CONCURRENTLY idx_orders_customer_id
ON orders (customer_id);

Many migration frameworks wrap migrations in transactions by default. That default is good for many schema changes, but it conflicts with CREATE INDEX CONCURRENTLY.

This is not just a syntax issue. It is a deployment-system issue.

Your migration tooling must understand which changes need transactional execution and which changes need to run outside a transaction.

Concurrent index creation can still hurt

CONCURRENTLY reduces blocking. It does not make index creation free.

A concurrent index build can still:

scan a large table;
consume CPU;
consume disk IO;
generate WAL;
increase replication lag;
compete with autovacuum;
take a long time;
fail and leave an invalid index;
wait for old transactions;
interact badly with other maintenance.

Monitor progress:

SELECT
    p.pid,
    p.datname,
    p.relid::regclass AS table_name,
    p.index_relid::regclass AS index_name,
    p.phase,
    p.blocks_total,
    p.blocks_done,
    round(100.0 * p.blocks_done / nullif(p.blocks_total, 0), 2) AS blocks_pct,
    p.tuples_total,
    p.tuples_done,
    now() - a.query_start AS runtime
FROM pg_stat_progress_create_index p
JOIN pg_stat_activity a ON a.pid = p.pid
ORDER BY runtime DESC;

Check for invalid indexes after failure:

SELECT
    n.nspname AS schema_name,
    t.relname AS table_name,
    i.relname AS index_name,
    ix.indisvalid,
    ix.indisready
FROM pg_index ix
JOIN pg_class i ON i.oid = ix.indexrelid
JOIN pg_class t ON t.oid = ix.indrelid
JOIN pg_namespace n ON n.oid = t.relnamespace
WHERE ix.indisvalid = false
   OR ix.indisready = false
ORDER BY schema_name, table_name, index_name;

An invalid index is easy to forget.

It may not help queries, but it can still create maintenance and write overhead.

That is exactly the kind of “cleanup later” detail that becomes reliability debt.

Adding constraints safely

A constraint can be both logically correct and operationally expensive.

For example:

ALTER TABLE orders
ADD CONSTRAINT orders_amount_positive
CHECK (amount > 0);

On a large table, Postgres may need to scan existing rows to verify that they satisfy the new constraint.

A safer phased pattern:

ALTER TABLE orders
ADD CONSTRAINT orders_amount_positive
CHECK (amount > 0) NOT VALID;

Then later:

ALTER TABLE orders
VALIDATE CONSTRAINT orders_amount_positive;

Postgres documentation explains that NOT VALID skips the potentially lengthy scan of existing rows when adding foreign-key, CHECK, or not-null constraints, while still applying the constraint to subsequent inserts or updates; the constraint is not considered valid for all existing rows until VALIDATE CONSTRAINT is run. (PostgreSQL)

The validation step scans the table later:

ALTER TABLE orders
VALIDATE CONSTRAINT orders_amount_positive;

This still does work. It is not free. But it separates two concerns:

Start enforcing the rule for new data
        ↓
Validate old data later under controlled conditions

That separation is often the difference between a safe rollout and a production incident.

Foreign keys are operational changes too

Foreign keys are valuable. They protect data integrity.

But adding one to a large, hot table is not just a metadata change.

Example:

ALTER TABLE orders
ADD CONSTRAINT orders_customer_id_fkey
FOREIGN KEY (customer_id)
REFERENCES customers(id);

A phased version:

ALTER TABLE orders
ADD CONSTRAINT orders_customer_id_fkey
FOREIGN KEY (customer_id)
REFERENCES customers(id)
NOT VALID;

Then:

ALTER TABLE orders
VALIDATE CONSTRAINT orders_customer_id_fkey;

Postgres notes that adding a foreign key with NOT VALID can reduce impact, and validation later does not need to lock out concurrent updates because new rows are already checked; validation uses a lighter lock on the altered table, and foreign-key validation also requires a lock on the referenced table. (PostgreSQL)

The important part is that a foreign key touches two tables operationally:

the table that contains the foreign key;
the table being referenced.

That matters during incidents.

A migration on orders can affect customers.

A team that only looks at one table may miss the real blocking chain.

`SET NOT NULL` can be a table scan

This looks simple:

ALTER TABLE users
ALTER COLUMN email SET NOT NULL;

But Postgres must know that no existing row violates the constraint.

The documentation says SET NOT NULL is ordinarily checked by scanning the whole table, unless a valid check constraint proves no nulls can exist or NOT VALID is used in supported cases. (PostgreSQL)

A common safer pattern is:

ALTER TABLE users
ADD CONSTRAINT users_email_not_null
CHECK (email IS NOT NULL) NOT VALID;

Then validate:

ALTER TABLE users
VALIDATE CONSTRAINT users_email_not_null;

Then apply the not-null marker when appropriate:

ALTER TABLE users
ALTER COLUMN email SET NOT NULL;

The exact sequence depends on Postgres version, table structure, and whether you need a true column-level NOT NULL constraint or a check constraint is enough for your use case.

The reliability point is stable:

Do not assume a one-line constraint change is operationally small.

Adding a column is not always the risky part

Many teams focus on ADD COLUMN.

But the dangerous part is often what follows.

Example:

ALTER TABLE users
ADD COLUMN normalized_email text;

Then:

UPDATE users
SET normalized_email = lower(email);

The first statement may be quick.

The second statement may be a production event:

large table scan;
many row updates;
large WAL generation;
replication lag;
autovacuum pressure;
index maintenance;
long transaction;
lock contention;
cache churn;
connection pool pressure.

A safer backfill pattern uses batches:

WITH batch AS (
    SELECT id
    FROM users
    WHERE normalized_email IS NULL
    ORDER BY id
    LIMIT 1000
)
UPDATE users u
SET normalized_email = lower(u.email)
FROM batch
WHERE u.id = batch.id;

Repeat in a worker with:

small batches;
short transactions;
sleep between batches;
progress tracking;
replication lag monitoring;
statement timeout;
ability to stop quickly.

The batch size is not universal. It should be chosen based on production pressure.

A backfill is not just SQL. It is a controlled workload.

The expand-and-contract pattern

For application-visible schema changes, the safest migrations are often multi-step.

Suppose you want to rename a column from name to full_name.

A dangerous migration:

ALTER TABLE users RENAME COLUMN name TO full_name;

If old application code still expects name, it breaks.

A safer pattern:

1. Expand schema
2. Deploy code that writes both old and new shapes
3. Backfill old data into new shape
4. Deploy code that reads new shape
5. Stop using old shape
6. Contract schema later

Example:

ALTER TABLE users
ADD COLUMN full_name text;

Application writes both:

name = input.name
full_name = input.name

Backfill:

WITH batch AS (
    SELECT id
    FROM users
    WHERE full_name IS NULL
    ORDER BY id
    LIMIT 1000
)
UPDATE users u
SET full_name = u.name
FROM batch
WHERE u.id = batch.id;

Later, after all code reads full_name and old versions are gone:

ALTER TABLE users
DROP COLUMN name;

This is slower than a one-step migration.

It is also much safer.

Reliability often means accepting more deployment steps to reduce coupling between code and schema.

Backward compatibility matters during rolling deploys

Many production systems deploy gradually.

For some period, old and new application versions run at the same time.

That means schema migrations must be compatible with both versions.

Risky sequence:

Migration removes column
        ↓
Old app instance still reads column
        ↓
Requests fail

Safer sequence:

New app stops depending on column
        ↓
Rollout completes
        ↓
Old app versions are gone
        ↓
Column is removed later

This is not a database-only issue.

It is a deployment architecture issue.

A schema migration must be designed for:

rolling deploys;
failed deploys;
rollbacks;
background workers;
cron jobs;
admin scripts;
BI tools;
old application instances;
read replicas;
migration retries.

A migration that is safe in a single-process mental model may be unsafe in a distributed production system.

Beware of defaults and rewrites

This migration looks innocent:

ALTER TABLE events
ADD COLUMN source text DEFAULT 'web';

Depending on Postgres version and the exact default expression, adding a column with a default may be metadata-only or may require more work. Stable constant defaults have become much safer in modern Postgres than they were in older versions, but volatile defaults or other forms of schema change can still be expensive.

A safer mental model is:

Do not judge by syntax.
Check the operational behavior for your exact Postgres version and exact command.

When in doubt, use a phased approach:

ALTER TABLE events
ADD COLUMN source text;

Deploy code to write source.

Backfill existing rows in batches.

Then add a default for future rows:

ALTER TABLE events
ALTER COLUMN source SET DEFAULT 'web';

This is often more verbose, but it gives you control over when the large data change happens.

Large deletes are migrations too

Retention changes often appear as simple cleanup:

DELETE FROM events
WHERE created_at < now() - interval '180 days';

On a large table, this can be a serious write workload.

It can:

generate huge WAL;
create many dead tuples;
increase autovacuum pressure;
block or slow other queries;
increase replica lag;
hold locks for too long;
fill disk temporarily;
cause checkpoint pressure.

A batched delete:

WITH batch AS (
    SELECT id
    FROM events
    WHERE created_at < now() - interval '180 days'
    ORDER BY id
    LIMIT 5000
)
DELETE FROM events e
USING batch
WHERE e.id = batch.id;

For very large time-series data, partitioning may be a better retention mechanism:

DROP TABLE events_2025_01;

Dropping an old partition can be dramatically different from deleting millions of rows from a single table.

But partitioning has its own complexity. It should match the data lifecycle, query patterns, and operational ownership.

Migrations should have observability

A production migration should not be a black box.

Before running a risky migration, decide how you will observe it.

Useful checks include active migration sessions:

SELECT
    pid,
    application_name,
    state,
    wait_event_type,
    wait_event,
    now() - query_start AS query_age,
    left(query, 200) AS query_preview
FROM pg_stat_activity
WHERE query ILIKE '%alter table%'
   OR query ILIKE '%create index%'
   OR query ILIKE '%validate constraint%'
ORDER BY query_start ASC;

Lock waits:

SELECT
    pid,
    application_name,
    wait_event_type,
    wait_event,
    now() - query_start AS waiting_for,
    left(query, 200) AS query_preview
FROM pg_stat_activity
WHERE wait_event_type = 'Lock'
ORDER BY query_start ASC;

Index build progress:

SELECT
    p.pid,
    p.relid::regclass AS table_name,
    p.index_relid::regclass AS index_name,
    p.phase,
    p.blocks_done,
    p.blocks_total,
    round(100.0 * p.blocks_done / nullif(p.blocks_total, 0), 2) AS pct_done,
    now() - a.query_start AS runtime
FROM pg_stat_progress_create_index p
JOIN pg_stat_activity a ON a.pid = p.pid;

Replication lag:

SELECT
    application_name,
    state,
    sync_state,
    pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) AS replay_lag_bytes,
    write_lag,
    flush_lag,
    replay_lag
FROM pg_stat_replication;

Dead tuple pressure after backfills or deletes:

SELECT
    relname,
    n_live_tup,
    n_dead_tup,
    last_autovacuum,
    last_autoanalyze
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 30;

The key question:

How will we know the migration is becoming unsafe before users tell us?

Migration tooling can create risk

Migration frameworks are useful. They provide ordering, history, repeatability, and deployment integration.

But they can also create hazards.

Common tooling problems:

all migrations wrapped in one transaction;
no support for CREATE INDEX CONCURRENTLY;
no lock_timeout by default;
no statement_timeout policy;
no distinction between schema change and data backfill;
no pause/resume mechanism;
no progress visibility;
automatic retries of unsafe migrations;
running migrations during app startup;
running migrations from multiple app instances;
no clear owner during incidents.

A particularly dangerous pattern:

flowchart TD
    A[App instance starts] --> B[It runs migrations automatically]
    B --> C[Many instances start during deploy]
    C --> D[Multiple migration attempts compete]
    D --> E([Production traffic is already increasing])

Migration execution should be controlled.

For serious systems, migrations are not just part of application boot.

They are operational tasks with ownership and observability.

Rollback is not always the inverse migration

Application rollbacks are often easier than database rollbacks.

If a deploy fails, you can roll back code.

But after this runs:

ALTER TABLE users
DROP COLUMN legacy_id;

the old column is gone.

After this runs:

UPDATE accounts
SET status = 'inactive'
WHERE last_seen_at < now() - interval '2 years';

the old values are not automatically recoverable unless you prepared for that.

After this runs:

ALTER TABLE orders
ALTER COLUMN total_cents TYPE numeric;

returning to the old type may be lossy, slow, or impossible without careful planning.

A good migration plan distinguishes:

code rollback;
schema rollback;
data rollback;
roll-forward fix;
restore from backup;
point-in-time recovery;
manual correction.

Many database changes are not safely reversible.

For those, the safer strategy is often:

make the change additive;
delay destructive steps;
keep old data until confidence is high;
roll forward instead of rolling back;
test recovery before production.

A rollback plan that says “run the down migration” is not enough.

A practical pre-flight checklist

Before running a migration on a large or hot table, ask:

What lock level does this operation need?
Can it wait behind an old transaction?
Can it cause later application traffic to queue?
Will it scan the table?
Will it rewrite the table?
Will it generate large WAL?
Will it increase replication lag?
Will it create many dead tuples?
Will it affect autovacuum?
Can it run inside a transaction?
Does the migration framework support the required mode?
Can it fail quickly with lock_timeout?
Can it be paused or resumed?
Is the change backward-compatible with old code?
Is there a safe rollback or roll-forward plan?
Who is watching it?
What metric tells us to stop?

This checklist does not replace practice.

It helps expose which migrations deserve deeper planning.

Common anti-patterns

Testing only on tiny staging data

A migration that takes 200 ms on staging can take hours or block production on a large table.

Combining too much into one migration

Schema change, backfill, constraint validation, index creation, and cleanup are different operational phases. They should often be separated.

Running destructive changes too early

Dropping columns, constraints, tables, or indexes before all code paths are ready creates rollback traps.

No lock timeout

A migration that waits forever can silently create a production queue.

Treating `CONCURRENTLY` as harmless

It reduces blocking, but still consumes resources and has caveats.

Ignoring old transactions

Long transactions can turn a safe migration into a lock incident.

Backfilling in one huge transaction

This creates WAL, dead tuples, replication lag, and rollback risk.

Forgetting replicas and downstream systems

A migration may succeed on the primary while breaking read replicas, CDC consumers, ETL jobs, or analytics systems.

Why migration incidents are excellent simulation material

Schema migration incidents are some of the best reliability training scenarios because they involve both technical mechanics and human pressure.

A realistic simulation can include:

a migration waiting on a lock;
a long idle transaction;
application queries piling up behind DDL;
a connection pool filling;
a concurrent index build consuming IO;
a failed index leaving an invalid artifact;
a backfill increasing replication lag;
a rollback that is not actually safe;
a team debating whether to cancel, wait, kill a blocker, pause traffic, or roll forward.

The hard part is not knowing that locks exist.

The hard part is choosing the safest action while production is degrading.

Should the team cancel the migration? Terminate the blocker? Pause workers? Reduce traffic? Disable retries? Let the migration finish? Roll forward? Roll back application code? Validate later? Drop an invalid index? Leave the system alone and collect more evidence?

These are operational decisions, not syntax questions.

Articles can teach the patterns. Checklists can reduce obvious mistakes. Simulations train the judgment needed when a migration interacts with real production load.

Conclusion

A Postgres migration is not safe because the SQL is valid.

It is safe only if its production behavior is understood.

A one-line ALTER TABLE can create a lock queue. A normal CREATE INDEX can block writes. A concurrent index can still consume enough resources to hurt. A constraint can require a large validation scan. A backfill can generate WAL, dead tuples, and replica lag. A rollback can be impossible after destructive data changes.

The dangerous phrase is:

It worked on staging.

The better reliability question is:

What will this migration do to locks, WAL, replicas, autovacuum, connection pools, old application versions, and rollback options in production?

That question turns migrations from hidden deployment risk into a deliberate reliability practice.

Incidents Postgres incidents rarely start with "Postgres broke" May 29, 2026 11 min read

Connections Connection pools and Postgres: why more connections do not mean more performance May 20, 2026 17 min read

Replication Postgres replication: when a standby exists but does not save you May 12, 2026 13 min read

WAL WAL and checkpoints: the invisible machinery behind Postgres durability May 4, 2026 16 min read

Locks Postgres locks: how one ALTER TABLE can stop your product Apr 25, 2026 12 min read

Vacuum Autovacuum: the quiet Postgres process that becomes a loud reliability problem Apr 14, 2026 15 min read

Monitoring Postgres monitoring: which metrics help, and which ones create noise Apr 9, 2026 16 min read

Performance A slow Postgres query is a symptom, not a diagnosis Apr 4, 2026 14 min read

Reliability Why Postgres reliability cannot be learned from documentation alone Mar 27, 2026 16 min read

The core problem: DDL changes concurrency

Lock compatibility is the hidden part of migration safety

The migration that waits can be worse than the migration that runs

Use lock timeouts as blast-radius control

CREATE INDEX is not always online

Concurrent index creation can still hurt

Adding constraints safely

Foreign keys are operational changes too

SET NOT NULL can be a table scan

Adding a column is not always the risky part

The expand-and-contract pattern

Backward compatibility matters during rolling deploys

Beware of defaults and rewrites

Large deletes are migrations too

Migrations should have observability

Migration tooling can create risk

Rollback is not always the inverse migration

A practical pre-flight checklist

Common anti-patterns

Testing only on tiny staging data

Combining too much into one migration

Running destructive changes too early

No lock timeout

Treating CONCURRENTLY as harmless

Ignoring old transactions

Backfilling in one huge transaction

Forgetting replicas and downstream systems

Why migration incidents are excellent simulation material

Conclusion

Stay in the loop

`CREATE INDEX` is not always online

`SET NOT NULL` can be a table scan

Treating `CONCURRENTLY` as harmless