When an application starts timing out while talking to Postgres, one of the most tempting reactions is:
Increase the connection pool size.
It feels reasonable.
Requests are waiting for a database connection. The pool is full. The application needs more throughput. So the team gives it more connections.
Sometimes that helps.
But in many Postgres incidents, increasing the pool size turns a controlled slowdown into a larger failure.
A connection pool is not just a performance optimization. It is a pressure valve between the application and the database.
When configured well, it protects Postgres from too much concurrent work. When configured badly, it allows the application to overload the database faster.
The wrong mental model
A common mental model looks like this:
More connections = more parallelism = more throughput
That is only true up to a point.
Postgres does not become infinitely faster just because more clients connect to it. Each active connection competes for shared resources:
CPU
memory
locks
shared buffers
disk IO
WAL bandwidth
temporary file space
autovacuum capacity
checkpoint pressure
planner and executor overhead
If the database is already saturated, adding more active sessions usually increases contention.
A better mental model is:
Connections are not throughput.
Connections are concurrency.
Concurrency must be limited to what the database can actually serve.
The pool should protect the database from excessive concurrency, not blindly maximize it.
The hidden multiplication problem
Connection incidents often start with innocent numbers.
One service has a pool size of 20.
That sounds small.
Then production reality looks like this:
20 connections per application instance
× 30 application instances
= 600 possible database connections
Now add:
background workers;
admin jobs;
migration runners;
BI tools;
cron scripts;
read replicas;
multiple services;
autoscaling;
deployment overlap during rolling releases.
Suddenly, max_connections = 500 no longer looks large.
The dangerous part is that each team may only see its own service:
Our pool is only 20.
But Postgres sees the total:
Hundreds of clients competing for one database.
A simple inventory query:
SELECT
application_name,
usename,
client_addr,
state,
count(*) AS connections
FROM pg_stat_activity
GROUP BY application_name, usename, client_addr, state
ORDER BY connections DESC;
This often reveals surprises:
old app versions still connected;
workers using separate pools;
BI tools holding sessions;
idle clients consuming slots;
one service with far more connections than expected;
deployment overlap doubling connection count temporarily.
The incident is not always caused by one bad query. Sometimes the system simply permits too many concurrent conversations with the database.
Idle connections are not free
An idle connection is less dangerous than an active query, but it is not free.
Each connection is represented by a backend process. It consumes memory and a connection slot. It also increases operational complexity during spikes, failovers, restarts, and deployments.
Inspect idle connections:
SELECT
application_name,
usename,
client_addr,
count(*) AS idle_connections
FROM pg_stat_activity
WHERE state = 'idle'
GROUP BY application_name, usename, client_addr
ORDER BY idle_connections DESC;
A large number of idle sessions may indicate:
oversized pools;
too many application instances;
poor pool lifecycle management;
clients that connect and do not reuse efficiently;
services holding capacity they do not need.
Idle connections may not be the immediate cause of latency, but they reduce headroom.
During an incident, headroom matters.
Active connections are the real pressure
The more important question is not just how many sessions exist.
It is how many sessions are actively doing work or waiting on something.
SELECT
state,
wait_event_type,
wait_event,
count(*) AS sessions
FROM pg_stat_activity
GROUP BY state, wait_event_type, wait_event
ORDER BY sessions DESC;
This gives a better view of database pressure:
active sessions consuming CPU;
sessions waiting on locks;
sessions waiting on IO;
sessions idle in transaction;
sessions waiting on client reads or writes;
A saturated pool with many active database sessions means one thing.
A saturated pool with many sessions waiting on locks means another.
A saturated pool with many idle-in-transaction sessions means something else entirely.
The number of connections is only the surface.
The wait state tells you what kind of pressure the database is experiencing.
Pool saturation can be a symptom, not the root cause
When an application pool is full, it is easy to blame the pool.
But a pool usually fills because connections are being held longer than expected.
That can happen because:
queries became slower;
transactions became longer;
locks caused sessions to wait;
the database started waiting on IO;
the application opened transactions too early;
external service calls happened inside transactions;
retries increased traffic;
a deployment created more concurrent workers;
background jobs started competing with user requests.
A typical chain:
flowchart TD
A[Query latency increases] --> B[Application holds DB connections longer]
B --> C[Pool reaches max size]
C --> D[New requests wait for a connection]
D --> E[HTTP latency increases]
E --> F[Requests time out]
F --> G[Application retries]
G --> H[More work reaches Postgres]
H --> I([The pool stays saturated])
The pool is not the original failure. It is the place where the failure becomes visible.
Increasing the pool size may only move the queue from the application into Postgres.
That can make the database less stable.
Queuing in the application is often safer than queuing in Postgres
A small pool can be frustrating because requests wait before reaching the database.
But that waiting can be protective.
Application-side queue:
limits database concurrency.
Database-side queue:
lets too much work enter Postgres.
If too many requests enter Postgres, they can compete for locks, memory, CPU, and IO. Once the database is overloaded, every query can become slower, which makes connections stay busy even longer.
This feedback loop is dangerous:
flowchart TD
A[More concurrent queries] --> B[More contention]
B --> C[Slower queries]
C --> D[Connections held longer]
D --> E[More pool pressure]
E --> F[More retries]
F --> A
A pool should create backpressure.
Backpressure is not failure. It is a controlled refusal to overload the most critical shared component.
The database pool is part of your traffic control system
A mature production system usually has multiple layers of traffic control:
load balancer limits;
application worker limits;
request timeouts;
queue depth limits;
connection pool limits;
statement timeouts;
retry budgets;
rate limits;
circuit breakers;
background job concurrency limits.
The database pool is one of those layers.
If all other layers are loose, the database pool becomes the final gate before Postgres.
That is risky.
For example:
API accepts too much traffic.
Workers retry aggressively.
Each worker can open many DB connections.
Background jobs are unconstrained.
Pool size is high.
Postgres receives the full blast.
This is how a traffic spike becomes a database incident.
The database did not “break.” It was used as the only effective limiter in the system.
Inspecting connection pressure in Postgres
Start with total connection usage:
SELECT
count(*) AS current_connections,
setting::int AS max_connections,
round(100.0 * count(*) / setting::int, 2) AS percent_used
FROM pg_stat_activity
CROSS JOIN pg_settings
WHERE name = 'max_connections'
GROUP BY setting;
Break it down by application:
SELECT
application_name,
count(*) AS total,
count(*) FILTER (WHERE state = 'active') AS active,
count(*) FILTER (WHERE state = 'idle') AS idle,
count(*) FILTER (WHERE state = 'idle in transaction') AS idle_in_transaction
FROM pg_stat_activity
GROUP BY application_name
ORDER BY total DESC;
Look for old sessions:
SELECT
pid,
application_name,
usename,
client_addr,
state,
now() - backend_start AS connection_age,
now() - state_change AS state_age,
left(query, 160) AS query_preview
FROM pg_stat_activity
ORDER BY backend_start ASC
LIMIT 30;
Look for long-running active queries:
SELECT
pid,
application_name,
usename,
state,
wait_event_type,
wait_event,
now() - query_start AS query_age,
left(query, 200) AS query_preview
FROM pg_stat_activity
WHERE state = 'active'
ORDER BY query_start ASC
LIMIT 30;
Look for idle transactions:
SELECT
pid,
application_name,
usename,
client_addr,
now() - xact_start AS transaction_age,
now() - state_change AS idle_age,
left(query, 200) AS last_query
FROM pg_stat_activity
WHERE state = 'idle in transaction'
ORDER BY xact_start ASC;
These queries help separate different problems:
too many idle sessions;
too many active sessions;
long-running queries;
sessions waiting on locks;
idle transactions;
connection leaks;
unexpected clients;
deployment overlap.
The goal is not just to count connections.
The goal is to understand why they exist and what they are doing.
idle in transaction: small bug, large blast radius
An application can open a transaction, run a query, and then wait.
Example:
BEGIN;
SELECT *
FROM accounts
WHERE id = 42;
-- application waits on an external API before COMMIT
From Postgres, this may appear as:
idle in transaction
That session is not actively running SQL, but the transaction is still open.
This can:
hold locks;
prevent vacuum cleanup;
keep old row versions visible;
increase bloat;
block migrations;
hold a pool connection indefinitely;
create confusing incident symptoms.
A useful protection:
SHOW idle_in_transaction_session_timeout;
You can set it at the role or database level:
ALTER ROLE app_user
SET idle_in_transaction_session_timeout = '60s';
This is not a substitute for fixing application code, but it can reduce blast radius.
Application transactions should be short and explicit.
A dangerous pattern:
BEGIN
read from database
call external service
perform business logic
write to database
COMMIT
A safer pattern is usually:
call external services before opening the transaction;
open the transaction late;
perform only the required database work;
commit quickly;
avoid user or network waits inside the transaction.
Postgres can handle concurrency. It cannot make long application transactions short.
Pool timeouts and statement timeouts are different
Application pool timeout:
How long a request waits to get a database connection.
Postgres statement_timeout:
How long a SQL statement may run before Postgres cancels it.
Postgres lock_timeout:
How long a statement waits to acquire a lock.
Postgres idle_in_transaction_session_timeout:
How long a session may remain idle while inside a transaction.
These protect different parts of the system.
Inspect settings:
SHOW statement_timeout;
SHOW lock_timeout;
SHOW idle_in_transaction_session_timeout;
Example role-level guardrails:
ALTER ROLE app_user SET statement_timeout = '30s';
ALTER ROLE app_user SET lock_timeout = '2s';
ALTER ROLE app_user SET idle_in_transaction_session_timeout = '60s';
These values are examples, not universal defaults.
Different workloads need different limits:
OLTP API queries need strict latency control.
Background jobs may need longer statement timeouts.
Migrations need careful lock timeouts.
Analytics should often run on separate infrastructure.
Timeouts do not fix bad architecture, but they prevent some failures from growing without bounds.
PgBouncer is useful, but not magic
Many Postgres systems use PgBouncer or another external pooler.
PgBouncer can reduce the number of server connections and allow many client connections to share fewer Postgres backends.
But the pooling mode matters.
The common modes are:
session pooling;
transaction pooling;
statement pooling.
In session pooling, a client keeps the same server connection for the whole client session.
In transaction pooling, a client gets a server connection only for the duration of a transaction.
Transaction pooling can dramatically reduce pressure on Postgres, but it changes what application behavior is safe.
Features that depend on session state may become problematic:
temporary tables;
session-level SET commands;
session-level advisory locks;
LISTEN / NOTIFY patterns;
some prepared statement assumptions;
stateful connection behavior in application frameworks.
For example, this is session state:
SET search_path = tenant_42, public;
If an application assumes this setting remains attached to a session, transaction pooling can break that assumption.
A safer approach is to make state explicit:
SET LOCAL statement_timeout = '5s';
inside a transaction, or avoid relying on session state for request behavior.
The reliability lesson:
A pooler changes the contract between the application and Postgres.
It must be tested as part of the application architecture, not added only during an emergency.
App-level pools and external poolers can fight each other
A common architecture has both:
Application connection pool
↓
PgBouncer
↓
Postgres
That can work well.
But it can also create confusion.
Example:
50 application instances
× app pool size 20
= 1000 client connections to PgBouncer
PgBouncer pool size 100
= only 100 server connections to Postgres
That may be fine if PgBouncer queues safely.
But application metrics may say:
Database pool is healthy.
while PgBouncer is saturated.
Or PgBouncer may be healthy while Postgres is overloaded by 100 expensive active queries.
The important operational question is:
Where is the queue?
Possible answers:
inside the application pool;
inside PgBouncer;
inside Postgres lock waits;
inside disk IO;
inside the application request queue;
inside a background job system.
The location of the queue tells you where backpressure is happening.
During an incident, moving the queue from one layer to another may improve or worsen the system.
Pool size should be based on database capacity, not hope
A poor pool-sizing strategy:
Set pool size high enough that application requests rarely wait.
That optimizes for hiding pressure.
A better strategy:
Set pool size low enough that Postgres remains stable under expected and degraded conditions.
A rough capacity-oriented approach:
How many active queries can Postgres serve with acceptable latency?
How many services share this database?
How many app instances can exist during autoscaling or rolling deploys?
How many background jobs run concurrently?
What is reserved for migrations, admin access, replication, monitoring, and emergency operations?
The total possible connection count matters:
total_possible_connections =
service_count
× instances_per_service
× pool_size_per_instance
+ workers
+ admin clients
+ migrations
+ monitoring
That number should not accidentally exceed what Postgres can handle.
More importantly, the number of active queries should not exceed what the database can serve efficiently.
max_connections is not a performance target
max_connections is a limit, not a goal.
If Postgres has max_connections = 500, that does not mean the system should normally run with 500 active sessions.
Check the setting:
SHOW max_connections;
When connection count approaches the limit, new clients may fail to connect. That can block application traffic, migrations, admin access, and incident response.
You do not want to discover during an outage that there is no free connection left for an operator.
A useful connection headroom query:
SELECT
count(*) AS used_connections,
setting::int AS max_connections,
setting::int - count(*) AS remaining_connections
FROM pg_stat_activity
CROSS JOIN pg_settings
WHERE name = 'max_connections'
GROUP BY setting;
Running near the maximum is usually a sign of poor control, not high efficiency.
A stable Postgres system should have connection headroom.
Retries can turn pool pressure into a storm
Retries are meant to make systems more resilient.
Under database saturation, they can do the opposite.
A bad retry pattern:
Request times out waiting for DB
↓
Application retries immediately
↓
Retry also waits for DB
↓
More requests accumulate
↓
Pool remains saturated
↓
Database receives duplicate work
A better retry strategy includes:
bounded attempts;
exponential backoff;
jitter;
request deadlines;
idempotency keys;
retry budgets;
different policies for reads and writes;
no retry for known non-transient errors;
load shedding when the database is saturated.
The pool and retry policy must be designed together.
A small pool with aggressive retries can still overload the system. A large pool with aggressive retries can overload it faster.
Retries should not be allowed to attack a struggling database.
Background workers need separate limits
User-facing requests and background jobs should not always share the same database capacity.
A background worker can be useful during normal operation and harmful during an incident.
Examples:
email jobs;
billing reconciliation;
search indexing;
analytics sync;
cleanup tasks;
data backfills;
report generation;
cache warming;
webhook reprocessing.
If these workers use the same database pool limits as API traffic, they can starve critical paths.
A better architecture often separates:
API pool;
worker pool;
migration/admin access;
analytics/reporting access;
maintenance jobs.
This allows operational decisions such as:
pause non-critical workers;
reduce backfill concurrency;
reserve capacity for user traffic;
run reporting on a replica;
prevent cleanup jobs from overwhelming primary.
In a database incident, not all work is equally important.
The pool configuration should reflect that.
Connection leaks
A connection leak happens when the application checks out a database connection and does not return it to the pool.
Symptoms:
pool usage grows over time;
database queries are not necessarily slow;
application instances require restart to recover;
idle connections accumulate;
a specific code path correlates with pool exhaustion.
Database-side symptoms may not be obvious.
You can inspect session age and state:
SELECT
pid,
application_name,
client_addr,
state,
now() - backend_start AS backend_age,
now() - state_change AS state_age,
left(query, 160) AS query_preview
FROM pg_stat_activity
ORDER BY state_age DESC
LIMIT 50;
But connection leaks are often easier to detect with application metrics:
pool connections in use;
pool idle connections;
pool wait time;
pool checkout timeout count;
connection acquisition latency;
connections opened and closed per second.
Postgres can show you the sessions.
The application usually tells you whether the pool is leaking.
Both views are needed.
Long queries can masquerade as pool problems
Suppose an endpoint usually takes 50 ms of database time.
Then one query starts taking 5 seconds.
Even without more traffic, pool usage rises because each request holds a connection longer.
A simple relationship:
required concurrency ≈ request rate × connection hold time
If request rate is 100 requests per second and each request holds a DB connection for 50 ms:
100 × 0.05 = 5 active connections
If the same path now holds a connection for 5 seconds:
100 × 5 = 500 active connections
The pool did not become too small.
The connection hold time exploded.
This is why pool metrics should be read together with query latency, transaction duration, and application request traces.
The pool is a mirror of database time.
Transactions should not wrap too much application logic
A transaction should protect a small unit of database consistency.
It should not wrap an entire business workflow unless absolutely necessary.
Risky pattern:
BEGIN
select user
call payment provider
update order
send webhook
insert audit log
COMMIT
This holds a database connection while waiting for external systems.
Safer pattern:
prepare required data;
call external systems outside transaction when possible;
open transaction;
perform minimal database changes;
commit;
emit async follow-up work.
There are exceptions. Some workflows need careful transactional boundaries.
But as a reliability default:
Keep transactions short.
Keep connection hold time predictable.
Do not wait on the network while holding scarce database capacity.
This is one of the most important application-level rules for Postgres reliability.
What to measure in the application
Postgres views are necessary, but not sufficient.
The application should expose pool metrics:
maximum pool size;
connections currently in use;
idle connections;
pending connection requests;
connection acquisition latency;
connection checkout timeout count;
query duration;
transaction duration;
request duration while holding DB connection;
retries by reason;
errors by SQLSTATE or exception type.
The most useful metric is often not just query time.
It is:
time spent waiting for a connection
If this grows, the application is experiencing backpressure.
That may be healthy if Postgres is protected and the system degrades gracefully.
It may be dangerous if requests timeout and retry aggressively.
Metrics should distinguish:
waiting for a pool connection;
executing SQL;
waiting on a database lock;
waiting on network;
waiting on an external service while holding a connection.
Without that separation, every database incident looks like “Postgres is slow.”
What to measure in Postgres
Useful database-side signals:
-- Connections by state and wait type
SELECT
state,
wait_event_type,
wait_event,
count(*) AS sessions
FROM pg_stat_activity
GROUP BY state, wait_event_type, wait_event
ORDER BY sessions DESC;
-- Connections by application
SELECT
application_name,
count(*) AS total,
count(*) FILTER (WHERE state = 'active') AS active,
count(*) FILTER (WHERE wait_event_type = 'Lock') AS waiting_on_lock,
count(*) FILTER (WHERE state = 'idle in transaction') AS idle_in_transaction
FROM pg_stat_activity
GROUP BY application_name
ORDER BY total DESC;
-- Oldest transactions
SELECT
pid,
application_name,
state,
now() - xact_start AS transaction_age,
left(query, 160) AS query_preview
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
ORDER BY xact_start ASC
LIMIT 20;
-- Long active queries
SELECT
pid,
application_name,
wait_event_type,
wait_event,
now() - query_start AS query_age,
left(query, 200) AS query_preview
FROM pg_stat_activity
WHERE state = 'active'
ORDER BY query_start ASC
LIMIT 20;
-- Blocked sessions
SELECT
blocked.pid AS blocked_pid,
blocked.application_name AS blocked_app,
now() - blocked.query_start AS blocked_duration,
left(blocked.query, 120) AS blocked_query,
blocking.pid AS blocking_pid,
blocking.application_name AS blocking_app,
blocking.state AS blocking_state,
now() - blocking.query_start AS blocking_duration,
left(blocking.query, 120) AS blocking_query
FROM pg_stat_activity blocked
JOIN LATERAL unnest(pg_blocking_pids(blocked.pid)) AS blocker_pid ON true
JOIN pg_stat_activity blocking ON blocking.pid = blocker_pid
ORDER BY blocked_duration DESC;
These queries are not a runbook by themselves.
They help answer one central question:
Is Postgres doing too much work, waiting on something, or being held hostage by client behavior?
Why connection incidents are often misdiagnosed
Connection pool incidents are confusing because the first visible error is often outside the database.
The app logs may say:
timeout acquiring connection from pool
or:
remaining connection slots are reserved
or:
too many clients already
or:
context deadline exceeded
Teams then debate:
Is this an app issue?
Is this a database issue?
Is the pool too small?
Is max_connections too low?
Is PgBouncer broken?
Is a query slow?
Is the network slow?
The answer may be “yes” to several of these.
The pool is the boundary between application behavior and database capacity. Boundary failures usually have causes on both sides.
Common anti-patterns
Pool size copied from another system
A pool size that worked for one service may be wrong for another.
Workload shape matters:
short OLTP queries;
long reporting queries;
bursty writes;
background jobs;
tenant skew;
transaction-heavy workflows;
read-after-write patterns.
Pool size configured per instance without considering total instances
Autoscaling can silently multiply database pressure.
One shared pool for critical and non-critical work
A reporting job should not be able to starve checkout.
Long external calls inside transactions
This turns network latency into database connection pressure.
No timeout hierarchy
Without clear request, pool, statement, lock, and transaction timeouts, failures linger too long.
Aggressive retries
Retries without budgets and backoff can turn a small slowdown into a storm.
Treating PgBouncer as a universal fix
A pooler helps manage connections. It does not remove query cost, lock contention, IO saturation, or bad transaction design.
A healthier operating model
A good connection strategy is explicit.
It defines:
which services may connect to Postgres;
how many connections each service may use;
how many instances may exist during normal and deploy conditions;
which work is allowed on the primary;
which work should use replicas;
which jobs can be paused;
which timeouts protect the system;
which retries are allowed;
which metrics indicate backpressure;
which actions reduce pressure safely.
This is not only DBA work.
It requires cooperation between:
backend engineers;
SREs;
DBAs;
platform engineers;
application owners;
incident responders.
Connection reliability lives at the boundary between application design and database operations.
That is why it often falls through organizational cracks.
Why connection pool incidents are good simulation material
Connection pool incidents are excellent for practice because they create misleading symptoms.
The application says it cannot get a connection.
The database says it has too many clients.
The query dashboard shows slower SQL.
The lock dashboard may show waiting sessions.
The autoscaler adds more application instances.
Retries increase traffic.
Someone proposes increasing max_connections.
Someone else proposes restarting the app.
All of these may be plausible.
A realistic simulation can force the team to reason through:
where the queue is forming;
whether the pool is protecting or harming Postgres;
whether increasing pool size would help or amplify the incident;
which workload should be shed first;
whether long transactions are holding connections;
whether retries are multiplying demand;
whether background workers should be paused;
whether the safest mitigation is in SQL, app config, infrastructure, or traffic control.
The goal is not to memorize a perfect pool size.
The goal is to build judgment around database pressure.
Articles can explain the mechanics. Dashboards can show saturation. Simulations teach what it feels like to choose under pressure.
Conclusion
More Postgres connections do not automatically mean more performance.
They mean more concurrency.
Concurrency is useful only while the database has capacity to serve it. Past that point, additional connections create contention, longer waits, more timeouts, more retries, and a larger incident.
A connection pool should not be treated as a bucket that must be as large as possible.
It should be treated as a control surface.
Good pooling protects Postgres. Bad pooling exposes Postgres to uncontrolled application demand.
Reliable Postgres systems need:
bounded connection counts;
short transactions;
clear timeout policies;
safe retry behavior;
separate limits for critical and background work;
visibility into pool wait time;
visibility into database wait states;
enough headroom for operations and incidents.
The dangerous phrase is:
The pool is full, so increase it.
The better question is:
Why are connections being held longer than expected, and where should backpressure happen?
That question turns connection pooling from a configuration detail into a database reliability practice.