Kubernetes Rollout Without DB Outage: How to Stop PostgreSQL Connection Storm
There’s a special kind of outage where everything is green and still on fire. We rolled out a routine release, watched all pods go Ready, and 30 seconds later Postgres yelled FATAL: too many connections for role "app".
This was the day we scaled from 5 replicas to 20. Nothing in the app changed; only the rollout dynamics did. That is a connection storm: healthy pods doing normal things, all at once.
Tested on: k3d 5.x, PostgreSQL 15-16, PgBouncer 1.21+, Node.js 20+ and Java 21 applications.
What is Connection Storm
During a Kubernetes rolling update deployment:
- Rolling update - gradually terminates old pods and starts new ones
- All new pods try to connect to database simultaneously
- Old pods still hold connections (graceful shutdown)
- Result: 2x more connections than normal
Normal state: 20 pods × 10 connections = 200 connections
During rollout: 20 old + 20 new = 400 connections
PostgreSQL limit: max_connections = 200
Why It Worked Before
With 5 replicas:
- Normal: 5 × 10 = 50 connections
- Rollout: 10 × 10 = 100 connections (under limit)
With 20 replicas:
- Normal: 20 × 10 = 200 connections (at limit)
- Rollout: 40 × 10 = 400 connections BOOM
Reproducible Lab
Setup with k3d
# Create cluster
k3d cluster create connection-storm --agents 3
# Install PostgreSQL
helm repo add bitnami https://charts.bitnami.com/bitnami
helm install postgres bitnami/postgresql \
--set auth.postgresPassword=secret \
--set primary.resources.limits.memory=512Mi \
--set primary.configuration="max_connections=50"
Test Application
# app-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: api
spec:
replicas: 20
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
template:
spec:
containers:
- name: api
image: your-app:v1
env:
- name: DATABASE_URL
value: "postgres://postgres:secret@postgres:5432/app"
- name: POOL_SIZE
value: "5" # 20 pods × 5 = 100 connections
Trigger Rollout
# Watch connections
watch -n1 "kubectl exec -it postgres-postgresql-0 -- \
psql -U postgres -c 'SELECT count(*) FROM pg_stat_activity'"
# Trigger rollout
kubectl set image deployment/api api=your-app:v2
Result: Connection count spikes to 150-200 and application starts failing.
Benchmark: Rollout Without PgBouncer
Test Scenario
20 replicas, pool_size=5 per pod
max_connections=100
Rolling update 25% maxSurge
Measurement
# Connection count during rollout (every second)
while true; do
kubectl exec -it postgres-postgresql-0 -- \
psql -U postgres -c "SELECT count(*) FROM pg_stat_activity WHERE state != 'idle'" \
2>/dev/null | grep -E "^\s*[0-9]+"
sleep 1
done
Results
| Phase | Active connections | Status |
|---|---|---|
| Before rollout | 48 | OK |
| Rollout start | 73 | OK |
| Peak (t+15s) | 127 | FAIL (limit 100) |
| Stabilization | 48 | OK |
Connection refused errors: 23 during entire rollout
Solution 1: PgBouncer
PgBouncer is a connection pooler that:
- Maintains persistent connections to PostgreSQL
- Multiplexes application connections
- Significantly reduces real DB connections
PgBouncer Deployment
# pgbouncer-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: pgbouncer-config
data:
pgbouncer.ini: |
[databases]
app = host=postgres-postgresql port=5432 dbname=app
[pgbouncer]
listen_addr = 0.0.0.0
listen_port = 5432
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 20
min_pool_size = 5
reserve_pool_size = 5
reserve_pool_timeout = 3
server_lifetime = 3600
server_idle_timeout = 600
log_connections = 1
log_disconnections = 1
log_pooler_errors = 1
userlist.txt: |
"postgres" "secret"
---
# pgbouncer-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: pgbouncer
spec:
replicas: 2
template:
spec:
containers:
- name: pgbouncer
image: edoburu/pgbouncer:1.21.0
ports:
- containerPort: 5432
volumeMounts:
- name: config
mountPath: /etc/pgbouncer
resources:
limits:
memory: 128Mi
cpu: 100m
livenessProbe:
tcpSocket:
port: 5432
initialDelaySeconds: 10
readinessProbe:
exec:
command:
- /bin/sh
- -c
- "psql -h localhost -U postgres -c 'SHOW STATS' pgbouncer"
initialDelaySeconds: 5
volumes:
- name: config
configMap:
name: pgbouncer-config
Benchmark with PgBouncer
| Phase | Client connections | Server connections | Status |
|---|---|---|---|
| Before rollout | 100 | 20 | OK |
| Rollout start | 150 | 25 | OK |
| Peak (t+15s) | 200 | 30 | OK |
| Stabilization | 100 | 20 | OK |
Connection refused errors: 0
Solution 2: PreStop Hook with Jitter
Even with PgBouncer, there can be issues if all pods start simultaneously. Solution: spread starts over time.
PreStop Hook
spec:
containers:
- name: api
lifecycle:
preStop:
exec:
command:
- /bin/sh
- -c
- "sleep 5" # Give time for drain
terminationGracePeriodSeconds: 30
Startup Jitter
// In application - random delay before first DB query
const jitter = Math.random() * 5000; // 0-5 seconds
await new Promise(resolve => setTimeout(resolve, jitter));
await db.connect();
Or in Kubernetes:
spec:
containers:
- name: api
command:
- /bin/sh
- -c
- |
# Random delay 0-10 seconds
sleep $((RANDOM % 10))
exec node server.js
Benchmark with Jitter
| Metric | Without jitter | With jitter (0-5s) |
|---|---|---|
| Peak connections | 200 | 140 |
| Connection errors | 3 | 0 |
| Rollout duration | 45s | 52s |
Solution 3: Connection Pool Sizing
Calculating Correct Pool Size
Rule of thumb:
pool_size = (2 × CPU cores) + disk_spindles
For typical cloud workload:
pool_size = 5-10 per pod
Dynamic Pool Sizing
// Connection count based on environment
const poolConfig = {
development: { min: 2, max: 5 },
staging: { min: 5, max: 10 },
production: { min: 10, max: 20 }
};
const pool = new Pool({
...poolConfig[process.env.NODE_ENV],
idleTimeoutMillis: 30000,
connectionTimeoutMillis: 5000,
});
Connection Limiting per Pod
# HPA with custom metrics
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api
minReplicas: 5
maxReplicas: 50
metrics:
- type: Pods
pods:
metric:
name: db_connections_active
target:
type: AverageValue
averageValue: 8 # Scale when avg > 8 connections per pod
Production Checklist
## DB-Safe Kubernetes Deployment Checklist
### Connection Management
- [ ] PgBouncer or similar pooler in front of database
- [ ] Pool mode: transaction (not session)
- [ ] max_client_conn > (max_pods × pool_size_per_pod × 2)
- [ ] Server-side pool limit: default_pool_size < max_connections / expected_apps
### Deployment Strategy
- [ ] maxSurge: 25% or less for DB-heavy apps
- [ ] maxUnavailable: 25% (not 0 - causes accumulation)
- [ ] terminationGracePeriodSeconds: min 30s
- [ ] preStop hook: sleep 5-10s
### Application
- [ ] Connection pool with min/max limits
- [ ] Connection timeout: max 5s
- [ ] Idle timeout: 30-60s
- [ ] Startup jitter: random 0-5s delay
### Monitoring
- [ ] Alert: pg_stat_activity count > 80% max_connections
- [ ] Alert: pgbouncer waiting clients > 0 sustained
- [ ] Dashboard: connections over time during rollouts
- [ ] Metric: connection_acquire_time_seconds
### Testing
- [ ] Load test with production replica count
- [ ] Chaos test: rollout during peak traffic
- [ ] Verify: zero connection errors during rollout
Monitoring Setup
PostgreSQL Metrics
-- Current connections per user/database
SELECT usename, datname, count(*)
FROM pg_stat_activity
GROUP BY usename, datname;
-- Waiting queries (connection starvation)
SELECT count(*) FROM pg_stat_activity
WHERE wait_event_type = 'Client';
PgBouncer Metrics
# Connect to PgBouncer admin
psql -p 6432 -U postgres pgbouncer
# Pool statistics
SHOW POOLS;
# Current clients
SHOW CLIENTS;
# Servers (real DB connections)
SHOW SERVERS;
Prometheus Queries
# Connection usage %
pg_stat_activity_count / pg_settings_max_connections * 100
# Spike detection
increase(pg_stat_activity_count[1m]) > 50
# PgBouncer waiting clients
pgbouncer_pools_cl_waiting
Alternatives to PgBouncer
Odyssey
Yandex fork, better for large scale:
# Pros
- Multi-threaded (PgBouncer is single-threaded)
- Better for 10k+ connections
- TLS passthrough
# Cons
- Less documentation
- Smaller community
pgcat
Cloudflare project:
# Pros
- Rust (memory safe)
- Built-in sharding support
- Native Prometheus metrics
# Cons
- Relatively new
- Breaking changes between versions
Conclusion
Connection storm is an insidious problem - everything works until you start scaling. Key measures:
- PgBouncer - never connect directly to PostgreSQL from application pods
- Transaction pooling - not session pooling
- Jitter - spread starts over time
- Monitoring - watch connections during every rollout
- Load testing - test with production replica count
Investment in proper connection management pays off with every deploy.
FAQ
Why can’t I just increase max_connections?
PostgreSQL isn’t designed for thousands of active connections. Each connection consumes memory (~10MB) and context switching degrades performance. PgBouncer is more efficient.
Can I use application-level connection pooler instead of PgBouncer?
Application pool doesn’t solve the problem - each pod has its own pool, which totals the same connection count. You need a centralized pooler.
How many PgBouncer replicas do I need?
For most workloads, 2 replicas (HA) are sufficient. PgBouncer is extremely efficient - one process can handle thousands of client connections.
What if I use managed DB (RDS, Cloud SQL)?
Managed databases have stricter limits. RDS default is 100-200 connections. Pooler is even more important.
Related Articles
- Zero-Downtime PostgreSQL Migrations - How to safely migrate during rolling deployments
- CI/CD for Monorepo - Testing database connection limits in pipeline
Related posts
Zero-Downtime PostgreSQL Migrations: Expand/Contract, Backfill and Rollback Strategies
A practical playbook for safe database migrations in production. From expand/contract pattern through online indexes to monitoring and rollback.
When Prepared Statements Make PostgreSQL 10× Slower: Generic Plan Trap
Same query, same params, but prod is slow and staging works fine. I'll show how to reproduce the generic plan problem with pgBouncer, Java/Go and how to fix it.
CI/CD for Monorepo: Speed, Caching, Selective Tests and Supply-Chain Security
A complete blueprint for efficient CI/CD pipelines in monorepo - from path filters through remote cache to SBOM and SLSA. Practical solutions, not theory.
Connection Pool Sizing with Little's Law: Mathematical Approach to HikariCP and PgBouncer
Pool size 50 because that's how it's always been? I'll show how to use Little's Law to calculate optimal pool size and prove it with load tests.
Cite this article
If you reference this post, please link to the original URL and credit the author.