Schema migrations are one of the most common ways teams accidentally create Postgres incidents.
The migration passes code review. It works locally. It runs instantly on staging. The SQL is syntactically correct. The change looks small.
Then production traffic slows down, the connection pool fills, requests time out, and the incident channel starts with a familiar sentence:
The database is stuck.
Usually, Postgres is not stuck.
It is enforcing the rules that keep data consistent while many transactions touch the same tables concurrently.
The mistake is treating a schema migration as “just a code change.”
In production, a migration is an operational event.
The core problem: DDL changes concurrency
A normal application query changes data or reads data.
A schema migration changes the shape of the database itself.
That difference matters because Postgres must protect the table definition while other sessions are reading or writing rows. ALTER TABLE has many subforms, and the official documentation notes that lock levels differ by subform; unless explicitly noted, ALTER TABLE acquires an ACCESS EXCLUSIVE lock, and when several subcommands are combined, Postgres uses the strictest required lock. (PostgreSQL)
That is the reliability risk.
A migration may not be CPU-heavy. It may not read much data. It may not write many rows. It may simply need a lock that conflicts with normal traffic.
A migration incident often looks like this:
flowchart TD
A[Long-running transaction touches a hot table] --> B[Migration waits for a table lock]
B --> C[New application queries arrive]
C --> D[They queue behind the waiting migration]
D --> E[Application pool fills]
E --> F[Requests time out]
F --> G[Retries increase pressure]
G --> H([Production incident])
The migration may be “waiting.”
But waiting in the wrong place can still stop the product.
Lock compatibility is the hidden part of migration safety
Postgres locks are not all equal.
A regular SELECT acquires an ACCESS SHARE lock. INSERT, UPDATE, DELETE, and MERGE acquire ROW EXCLUSIVE locks on the target table. CREATE INDEX without CONCURRENTLY acquires a SHARE lock. ACCESS EXCLUSIVE conflicts with every table-level lock mode and is the only table-level lock that blocks a plain SELECT. (PostgreSQL)
This is why a schema change can have a much larger blast radius than expected.
For example:
ALTER TABLE accounts
ADD COLUMN deleted_at timestamptz;
This may be fast in many situations. But “fast” is not the same as “risk-free.”
Even a short lock can be dangerous when:
the table is hot;
transactions are long;
traffic is high;
the migration waits behind another session;
application timeouts are short;
retries are aggressive;
the deploy starts many app instances at once.
The operational risk is not only how long the migration takes after it starts.
It is also how long it waits before it can safely start.
The migration that waits can be worse than the migration that runs
A migration can damage traffic before it does any meaningful work.
Suppose this transaction is open:
BEGIN;
SELECT *
FROM accounts
WHERE id = 42;
-- application stays idle before COMMIT
Now a migration runs:
ALTER TABLE accounts
ADD COLUMN archived_at timestamptz;
If the migration waits for a strong lock, later queries against accounts can queue behind it.
That queue can grow quickly:
Session A: old transaction is still open
Session B: ALTER TABLE waits for lock
Session C: SELECT from application waits
Session D: UPDATE from application waits
Session E: SELECT from application waits
...
This is why “the migration is only waiting” is not comforting.
A waiting migration can become a traffic barrier.
During an incident, look for blocked and blocking sessions:
SELECT
blocked.pid AS blocked_pid,
blocked.application_name AS blocked_app,
now() - blocked.query_start AS blocked_duration,
left(blocked.query, 160) AS blocked_query,
blocking.pid AS blocking_pid,
blocking.application_name AS blocking_app,
blocking.state AS blocking_state,
now() - blocking.query_start AS blocking_duration,
left(blocking.query, 160) AS blocking_query
FROM pg_stat_activity blocked
JOIN LATERAL unnest(pg_blocking_pids(blocked.pid)) AS blocker_pid ON true
JOIN pg_stat_activity blocking ON blocking.pid = blocker_pid
ORDER BY blocked_duration DESC;
And inspect sessions waiting on locks:
SELECT
pid,
application_name,
usename,
state,
wait_event_type,
wait_event,
now() - query_start AS waiting_for,
left(query, 200) AS query_preview
FROM pg_stat_activity
WHERE wait_event_type = 'Lock'
ORDER BY query_start ASC;
The goal is not only to identify the blocking PID.
The goal is to understand whether the migration has created a queue in front of production traffic.
Use lock timeouts as blast-radius control
A migration should not wait forever for a lock on a hot table.
Use a lock timeout:
SET lock_timeout = '2s';
SET statement_timeout = '5min';
ALTER TABLE accounts
ADD COLUMN archived_at timestamptz;
Inside a transaction:
BEGIN;
SET LOCAL lock_timeout = '2s';
SET LOCAL statement_timeout = '5min';
ALTER TABLE accounts
ADD COLUMN archived_at timestamptz;
COMMIT;
This does not make the migration safe.
It makes failure faster.
That is valuable.
A failed migration with a clear timeout is usually better than a migration that silently waits and causes traffic to queue behind it.
The operational principle:
Migrations should fail before they become incidents.
But timeouts need to be chosen carefully. Too short, and safe migrations fail constantly. Too long, and the timeout no longer protects production traffic.
CREATE INDEX is not always online
A classic migration:
CREATE INDEX idx_orders_customer_id
ON orders (customer_id);
This is a normal index build. It can block writes to the table.
For production systems, teams often use:
CREATE INDEX CONCURRENTLY idx_orders_customer_id
ON orders (customer_id);
Postgres documents CREATE INDEX CONCURRENTLY as a way to create an index without locking out writes to the table. The same documentation also notes important caveats: concurrent index builds cannot run inside a transaction block, only one concurrent index build can run on a table at a time, and failed concurrent builds can leave an invalid index behind. (PostgreSQL)
That means this is invalid:
BEGIN;
CREATE INDEX CONCURRENTLY idx_orders_customer_id
ON orders (customer_id);
COMMIT;
You need to run it outside a normal transaction block:
CREATE INDEX CONCURRENTLY idx_orders_customer_id
ON orders (customer_id);
Many migration frameworks wrap migrations in transactions by default. That default is good for many schema changes, but it conflicts with CREATE INDEX CONCURRENTLY.
This is not just a syntax issue. It is a deployment-system issue.
Your migration tooling must understand which changes need transactional execution and which changes need to run outside a transaction.
Concurrent index creation can still hurt
CONCURRENTLY reduces blocking. It does not make index creation free.
A concurrent index build can still:
scan a large table;
consume CPU;
consume disk IO;
generate WAL;
increase replication lag;
compete with autovacuum;
take a long time;
fail and leave an invalid index;
wait for old transactions;
interact badly with other maintenance.
Monitor progress:
SELECT
p.pid,
p.datname,
p.relid::regclass AS table_name,
p.index_relid::regclass AS index_name,
p.phase,
p.blocks_total,
p.blocks_done,
round(100.0 * p.blocks_done / nullif(p.blocks_total, 0), 2) AS blocks_pct,
p.tuples_total,
p.tuples_done,
now() - a.query_start AS runtime
FROM pg_stat_progress_create_index p
JOIN pg_stat_activity a ON a.pid = p.pid
ORDER BY runtime DESC;
Check for invalid indexes after failure:
SELECT
n.nspname AS schema_name,
t.relname AS table_name,
i.relname AS index_name,
ix.indisvalid,
ix.indisready
FROM pg_index ix
JOIN pg_class i ON i.oid = ix.indexrelid
JOIN pg_class t ON t.oid = ix.indrelid
JOIN pg_namespace n ON n.oid = t.relnamespace
WHERE ix.indisvalid = false
OR ix.indisready = false
ORDER BY schema_name, table_name, index_name;
An invalid index is easy to forget.
It may not help queries, but it can still create maintenance and write overhead.
That is exactly the kind of “cleanup later” detail that becomes reliability debt.
Adding constraints safely
A constraint can be both logically correct and operationally expensive.
For example:
ALTER TABLE orders
ADD CONSTRAINT orders_amount_positive
CHECK (amount > 0);
On a large table, Postgres may need to scan existing rows to verify that they satisfy the new constraint.
A safer phased pattern:
ALTER TABLE orders
ADD CONSTRAINT orders_amount_positive
CHECK (amount > 0) NOT VALID;
Then later:
ALTER TABLE orders
VALIDATE CONSTRAINT orders_amount_positive;
Postgres documentation explains that NOT VALID skips the potentially lengthy scan of existing rows when adding foreign-key, CHECK, or not-null constraints, while still applying the constraint to subsequent inserts or updates; the constraint is not considered valid for all existing rows until VALIDATE CONSTRAINT is run. (PostgreSQL)
The validation step scans the table later:
ALTER TABLE orders
VALIDATE CONSTRAINT orders_amount_positive;
This still does work. It is not free. But it separates two concerns:
Start enforcing the rule for new data
↓
Validate old data later under controlled conditions
That separation is often the difference between a safe rollout and a production incident.
Foreign keys are operational changes too
Foreign keys are valuable. They protect data integrity.
But adding one to a large, hot table is not just a metadata change.
Example:
ALTER TABLE orders
ADD CONSTRAINT orders_customer_id_fkey
FOREIGN KEY (customer_id)
REFERENCES customers(id);
A phased version:
ALTER TABLE orders
ADD CONSTRAINT orders_customer_id_fkey
FOREIGN KEY (customer_id)
REFERENCES customers(id)
NOT VALID;
Then:
ALTER TABLE orders
VALIDATE CONSTRAINT orders_customer_id_fkey;
Postgres notes that adding a foreign key with NOT VALID can reduce impact, and validation later does not need to lock out concurrent updates because new rows are already checked; validation uses a lighter lock on the altered table, and foreign-key validation also requires a lock on the referenced table. (PostgreSQL)
The important part is that a foreign key touches two tables operationally:
the table that contains the foreign key;
the table being referenced.
That matters during incidents.
A migration on orders can affect customers.
A team that only looks at one table may miss the real blocking chain.
SET NOT NULL can be a table scan
This looks simple:
ALTER TABLE users
ALTER COLUMN email SET NOT NULL;
But Postgres must know that no existing row violates the constraint.
The documentation says SET NOT NULL is ordinarily checked by scanning the whole table, unless a valid check constraint proves no nulls can exist or NOT VALID is used in supported cases. (PostgreSQL)
A common safer pattern is:
ALTER TABLE users
ADD CONSTRAINT users_email_not_null
CHECK (email IS NOT NULL) NOT VALID;
Then validate:
ALTER TABLE users
VALIDATE CONSTRAINT users_email_not_null;
Then apply the not-null marker when appropriate:
ALTER TABLE users
ALTER COLUMN email SET NOT NULL;
The exact sequence depends on Postgres version, table structure, and whether you need a true column-level NOT NULL constraint or a check constraint is enough for your use case.
The reliability point is stable:
Do not assume a one-line constraint change is operationally small.
Adding a column is not always the risky part
Many teams focus on ADD COLUMN.
But the dangerous part is often what follows.
Example:
ALTER TABLE users
ADD COLUMN normalized_email text;
Then:
UPDATE users
SET normalized_email = lower(email);
The first statement may be quick.
The second statement may be a production event:
large table scan;
many row updates;
large WAL generation;
replication lag;
autovacuum pressure;
index maintenance;
long transaction;
lock contention;
cache churn;
connection pool pressure.
A safer backfill pattern uses batches:
WITH batch AS (
SELECT id
FROM users
WHERE normalized_email IS NULL
ORDER BY id
LIMIT 1000
)
UPDATE users u
SET normalized_email = lower(u.email)
FROM batch
WHERE u.id = batch.id;
Repeat in a worker with:
small batches;
short transactions;
sleep between batches;
progress tracking;
replication lag monitoring;
statement timeout;
ability to stop quickly.
The batch size is not universal. It should be chosen based on production pressure.
A backfill is not just SQL. It is a controlled workload.
The expand-and-contract pattern
For application-visible schema changes, the safest migrations are often multi-step.
Suppose you want to rename a column from name to full_name.
A dangerous migration:
ALTER TABLE users RENAME COLUMN name TO full_name;
If old application code still expects name, it breaks.
A safer pattern:
1. Expand schema
2. Deploy code that writes both old and new shapes
3. Backfill old data into new shape
4. Deploy code that reads new shape
5. Stop using old shape
6. Contract schema later
Example:
ALTER TABLE users
ADD COLUMN full_name text;
Application writes both:
name = input.name
full_name = input.name
Backfill:
WITH batch AS (
SELECT id
FROM users
WHERE full_name IS NULL
ORDER BY id
LIMIT 1000
)
UPDATE users u
SET full_name = u.name
FROM batch
WHERE u.id = batch.id;
Later, after all code reads full_name and old versions are gone:
ALTER TABLE users
DROP COLUMN name;
This is slower than a one-step migration.
It is also much safer.
Reliability often means accepting more deployment steps to reduce coupling between code and schema.
Backward compatibility matters during rolling deploys
Many production systems deploy gradually.
For some period, old and new application versions run at the same time.
That means schema migrations must be compatible with both versions.
Risky sequence:
Migration removes column
↓
Old app instance still reads column
↓
Requests fail
Safer sequence:
New app stops depending on column
↓
Rollout completes
↓
Old app versions are gone
↓
Column is removed later
This is not a database-only issue.
It is a deployment architecture issue.
A schema migration must be designed for:
rolling deploys;
failed deploys;
rollbacks;
background workers;
cron jobs;
admin scripts;
BI tools;
old application instances;
read replicas;
migration retries.
A migration that is safe in a single-process mental model may be unsafe in a distributed production system.
Beware of defaults and rewrites
This migration looks innocent:
ALTER TABLE events
ADD COLUMN source text DEFAULT 'web';
Depending on Postgres version and the exact default expression, adding a column with a default may be metadata-only or may require more work. Stable constant defaults have become much safer in modern Postgres than they were in older versions, but volatile defaults or other forms of schema change can still be expensive.
A safer mental model is:
Do not judge by syntax.
Check the operational behavior for your exact Postgres version and exact command.
When in doubt, use a phased approach:
ALTER TABLE events
ADD COLUMN source text;
Deploy code to write source.
Backfill existing rows in batches.
Then add a default for future rows:
ALTER TABLE events
ALTER COLUMN source SET DEFAULT 'web';
This is often more verbose, but it gives you control over when the large data change happens.
Large deletes are migrations too
Retention changes often appear as simple cleanup:
DELETE FROM events
WHERE created_at < now() - interval '180 days';
On a large table, this can be a serious write workload.
It can:
generate huge WAL;
create many dead tuples;
increase autovacuum pressure;
block or slow other queries;
increase replica lag;
hold locks for too long;
fill disk temporarily;
cause checkpoint pressure.
A batched delete:
WITH batch AS (
SELECT id
FROM events
WHERE created_at < now() - interval '180 days'
ORDER BY id
LIMIT 5000
)
DELETE FROM events e
USING batch
WHERE e.id = batch.id;
For very large time-series data, partitioning may be a better retention mechanism:
DROP TABLE events_2025_01;
Dropping an old partition can be dramatically different from deleting millions of rows from a single table.
But partitioning has its own complexity. It should match the data lifecycle, query patterns, and operational ownership.
Migrations should have observability
A production migration should not be a black box.
Before running a risky migration, decide how you will observe it.
Useful checks include active migration sessions:
SELECT
pid,
application_name,
state,
wait_event_type,
wait_event,
now() - query_start AS query_age,
left(query, 200) AS query_preview
FROM pg_stat_activity
WHERE query ILIKE '%alter table%'
OR query ILIKE '%create index%'
OR query ILIKE '%validate constraint%'
ORDER BY query_start ASC;
Lock waits:
SELECT
pid,
application_name,
wait_event_type,
wait_event,
now() - query_start AS waiting_for,
left(query, 200) AS query_preview
FROM pg_stat_activity
WHERE wait_event_type = 'Lock'
ORDER BY query_start ASC;
Index build progress:
SELECT
p.pid,
p.relid::regclass AS table_name,
p.index_relid::regclass AS index_name,
p.phase,
p.blocks_done,
p.blocks_total,
round(100.0 * p.blocks_done / nullif(p.blocks_total, 0), 2) AS pct_done,
now() - a.query_start AS runtime
FROM pg_stat_progress_create_index p
JOIN pg_stat_activity a ON a.pid = p.pid;
Replication lag:
SELECT
application_name,
state,
sync_state,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) AS replay_lag_bytes,
write_lag,
flush_lag,
replay_lag
FROM pg_stat_replication;
Dead tuple pressure after backfills or deletes:
SELECT
relname,
n_live_tup,
n_dead_tup,
last_autovacuum,
last_autoanalyze
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 30;
The key question:
How will we know the migration is becoming unsafe before users tell us?
Migration tooling can create risk
Migration frameworks are useful. They provide ordering, history, repeatability, and deployment integration.
But they can also create hazards.
Common tooling problems:
all migrations wrapped in one transaction;
no support for CREATE INDEX CONCURRENTLY;
no lock_timeout by default;
no statement_timeout policy;
no distinction between schema change and data backfill;
no pause/resume mechanism;
no progress visibility;
automatic retries of unsafe migrations;
running migrations during app startup;
running migrations from multiple app instances;
no clear owner during incidents.
A particularly dangerous pattern:
flowchart TD
A[App instance starts] --> B[It runs migrations automatically]
B --> C[Many instances start during deploy]
C --> D[Multiple migration attempts compete]
D --> E([Production traffic is already increasing])
Migration execution should be controlled.
For serious systems, migrations are not just part of application boot.
They are operational tasks with ownership and observability.
Rollback is not always the inverse migration
Application rollbacks are often easier than database rollbacks.
If a deploy fails, you can roll back code.
But after this runs:
ALTER TABLE users
DROP COLUMN legacy_id;
the old column is gone.
After this runs:
UPDATE accounts
SET status = 'inactive'
WHERE last_seen_at < now() - interval '2 years';
the old values are not automatically recoverable unless you prepared for that.
After this runs:
ALTER TABLE orders
ALTER COLUMN total_cents TYPE numeric;
returning to the old type may be lossy, slow, or impossible without careful planning.
A good migration plan distinguishes:
code rollback;
schema rollback;
data rollback;
roll-forward fix;
restore from backup;
point-in-time recovery;
manual correction.
Many database changes are not safely reversible.
For those, the safer strategy is often:
make the change additive;
delay destructive steps;
keep old data until confidence is high;
roll forward instead of rolling back;
test recovery before production.
A rollback plan that says “run the down migration” is not enough.
A practical pre-flight checklist
Before running a migration on a large or hot table, ask:
What lock level does this operation need?
Can it wait behind an old transaction?
Can it cause later application traffic to queue?
Will it scan the table?
Will it rewrite the table?
Will it generate large WAL?
Will it increase replication lag?
Will it create many dead tuples?
Will it affect autovacuum?
Can it run inside a transaction?
Does the migration framework support the required mode?
Can it fail quickly with lock_timeout?
Can it be paused or resumed?
Is the change backward-compatible with old code?
Is there a safe rollback or roll-forward plan?
Who is watching it?
What metric tells us to stop?
This checklist does not replace practice.
It helps expose which migrations deserve deeper planning.
Common anti-patterns
Testing only on tiny staging data
A migration that takes 200 ms on staging can take hours or block production on a large table.
Combining too much into one migration
Schema change, backfill, constraint validation, index creation, and cleanup are different operational phases. They should often be separated.
Running destructive changes too early
Dropping columns, constraints, tables, or indexes before all code paths are ready creates rollback traps.
No lock timeout
A migration that waits forever can silently create a production queue.
Treating CONCURRENTLY as harmless
It reduces blocking, but still consumes resources and has caveats.
Ignoring old transactions
Long transactions can turn a safe migration into a lock incident.
Backfilling in one huge transaction
This creates WAL, dead tuples, replication lag, and rollback risk.
Forgetting replicas and downstream systems
A migration may succeed on the primary while breaking read replicas, CDC consumers, ETL jobs, or analytics systems.
Why migration incidents are excellent simulation material
Schema migration incidents are some of the best reliability training scenarios because they involve both technical mechanics and human pressure.
A realistic simulation can include:
a migration waiting on a lock;
a long idle transaction;
application queries piling up behind DDL;
a connection pool filling;
a concurrent index build consuming IO;
a failed index leaving an invalid artifact;
a backfill increasing replication lag;
a rollback that is not actually safe;
a team debating whether to cancel, wait, kill a blocker, pause traffic, or roll forward.
The hard part is not knowing that locks exist.
The hard part is choosing the safest action while production is degrading.
Should the team cancel the migration? Terminate the blocker? Pause workers? Reduce traffic? Disable retries? Let the migration finish? Roll forward? Roll back application code? Validate later? Drop an invalid index? Leave the system alone and collect more evidence?
These are operational decisions, not syntax questions.
Articles can teach the patterns. Checklists can reduce obvious mistakes. Simulations train the judgment needed when a migration interacts with real production load.
Conclusion
A Postgres migration is not safe because the SQL is valid.
It is safe only if its production behavior is understood.
A one-line ALTER TABLE can create a lock queue.
A normal CREATE INDEX can block writes.
A concurrent index can still consume enough resources to hurt.
A constraint can require a large validation scan.
A backfill can generate WAL, dead tuples, and replica lag.
A rollback can be impossible after destructive data changes.
The dangerous phrase is:
It worked on staging.
The better reliability question is:
What will this migration do to locks, WAL, replicas, autovacuum, connection pools, old application versions, and rollback options in production?
That question turns migrations from hidden deployment risk into a deliberate reliability practice.