Kubernetes Rollout Without DB Outage: How to Stop PostgreSQL Connection Storm

There’s a special kind of outage where everything is green and still on fire. We rolled out a routine release, watched all pods go Ready, and 30 seconds later Postgres yelled FATAL: too many connections for role "app".

This was the day we scaled from 5 replicas to 20. Nothing in the app changed; only the rollout dynamics did. That is a connection storm: healthy pods doing normal things, all at once.

Tested on: k3d 5.x, PostgreSQL 15-16, PgBouncer 1.21+, Node.js 20+ and Java 21 applications.

What is Connection Storm

During a Kubernetes rolling update deployment:

Rolling update - gradually terminates old pods and starts new ones
All new pods try to connect to database simultaneously
Old pods still hold connections (graceful shutdown)
Result: 2x more connections than normal

Normal state:      20 pods × 10 connections = 200 connections
During rollout:    20 old + 20 new = 400 connections
PostgreSQL limit:  max_connections = 200

Why It Worked Before

With 5 replicas:

Normal: 5 × 10 = 50 connections
Rollout: 10 × 10 = 100 connections (under limit)

With 20 replicas:

Normal: 20 × 10 = 200 connections (at limit)
Rollout: 40 × 10 = 400 connections BOOM

Reproducible Lab

Setup with k3d

# Create cluster
k3d cluster create connection-storm --agents 3

# Install PostgreSQL
helm repo add bitnami https://charts.bitnami.com/bitnami
helm install postgres bitnami/postgresql \
  --set auth.postgresPassword=secret \
  --set primary.resources.limits.memory=512Mi \
  --set primary.configuration="max_connections=50"

Test Application

# app-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
spec:
  replicas: 20
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
  template:
    spec:
      containers:
      - name: api
        image: your-app:v1
        env:
        - name: DATABASE_URL
          value: "postgres://postgres:secret@postgres:5432/app"
        - name: POOL_SIZE
          value: "5"  # 20 pods × 5 = 100 connections

Trigger Rollout

# Watch connections
watch -n1 "kubectl exec -it postgres-postgresql-0 -- \
  psql -U postgres -c 'SELECT count(*) FROM pg_stat_activity'"

# Trigger rollout
kubectl set image deployment/api api=your-app:v2

Result: Connection count spikes to 150-200 and application starts failing.

Benchmark: Rollout Without PgBouncer

Test Scenario

20 replicas, pool_size=5 per pod
max_connections=100
Rolling update 25% maxSurge

Measurement

# Connection count during rollout (every second)
while true; do
  kubectl exec -it postgres-postgresql-0 -- \
    psql -U postgres -c "SELECT count(*) FROM pg_stat_activity WHERE state != 'idle'" \
    2>/dev/null | grep -E "^\s*[0-9]+"
  sleep 1
done

Results

Phase	Active connections	Status
Before rollout	48	OK
Rollout start	73	OK
Peak (t+15s)	127	FAIL (limit 100)
Stabilization	48	OK

Connection refused errors: 23 during entire rollout

Solution 1: PgBouncer

PgBouncer is a connection pooler that:

Maintains persistent connections to PostgreSQL
Multiplexes application connections
Significantly reduces real DB connections

PgBouncer Deployment

# pgbouncer-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: pgbouncer-config
data:
  pgbouncer.ini: |
    [databases]
    app = host=postgres-postgresql port=5432 dbname=app

    [pgbouncer]
    listen_addr = 0.0.0.0
    listen_port = 5432
    auth_type = md5
    auth_file = /etc/pgbouncer/userlist.txt
    pool_mode = transaction
    max_client_conn = 1000
    default_pool_size = 20
    min_pool_size = 5
    reserve_pool_size = 5
    reserve_pool_timeout = 3
    server_lifetime = 3600
    server_idle_timeout = 600
    log_connections = 1
    log_disconnections = 1
    log_pooler_errors = 1

  userlist.txt: |
    "postgres" "secret"
---
# pgbouncer-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: pgbouncer
spec:
  replicas: 2
  template:
    spec:
      containers:
      - name: pgbouncer
        image: edoburu/pgbouncer:1.21.0
        ports:
        - containerPort: 5432
        volumeMounts:
        - name: config
          mountPath: /etc/pgbouncer
        resources:
          limits:
            memory: 128Mi
            cpu: 100m
        livenessProbe:
          tcpSocket:
            port: 5432
          initialDelaySeconds: 10
        readinessProbe:
          exec:
            command:
            - /bin/sh
            - -c
            - "psql -h localhost -U postgres -c 'SHOW STATS' pgbouncer"
          initialDelaySeconds: 5
      volumes:
      - name: config
        configMap:
          name: pgbouncer-config

Benchmark with PgBouncer

Phase	Client connections	Server connections	Status
Before rollout	100	20	OK
Rollout start	150	25	OK
Peak (t+15s)	200	30	OK
Stabilization	100	20	OK

Connection refused errors: 0

Solution 2: PreStop Hook with Jitter

Even with PgBouncer, there can be issues if all pods start simultaneously. Solution: spread starts over time.

PreStop Hook

spec:
  containers:
  - name: api
    lifecycle:
      preStop:
        exec:
          command:
          - /bin/sh
          - -c
          - "sleep 5"  # Give time for drain
  terminationGracePeriodSeconds: 30

Startup Jitter

// In application - random delay before first DB query
const jitter = Math.random() * 5000; // 0-5 seconds
await new Promise(resolve => setTimeout(resolve, jitter));
await db.connect();

Or in Kubernetes:

spec:
  containers:
  - name: api
    command:
    - /bin/sh
    - -c
    - |
      # Random delay 0-10 seconds
      sleep $((RANDOM % 10))
      exec node server.js

Benchmark with Jitter

Metric	Without jitter	With jitter (0-5s)
Peak connections	200	140
Connection errors	3	0
Rollout duration	45s	52s

Solution 3: Connection Pool Sizing

Calculating Correct Pool Size

Rule of thumb:
pool_size = (2 × CPU cores) + disk_spindles

For typical cloud workload:
pool_size = 5-10 per pod

Dynamic Pool Sizing

// Connection count based on environment
const poolConfig = {
  development: { min: 2, max: 5 },
  staging: { min: 5, max: 10 },
  production: { min: 10, max: 20 }
};

const pool = new Pool({
  ...poolConfig[process.env.NODE_ENV],
  idleTimeoutMillis: 30000,
  connectionTimeoutMillis: 5000,
});

Connection Limiting per Pod

# HPA with custom metrics
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 5
  maxReplicas: 50
  metrics:
  - type: Pods
    pods:
      metric:
        name: db_connections_active
      target:
        type: AverageValue
        averageValue: 8  # Scale when avg > 8 connections per pod

Production Checklist

## DB-Safe Kubernetes Deployment Checklist

### Connection Management
- [ ] PgBouncer or similar pooler in front of database
- [ ] Pool mode: transaction (not session)
- [ ] max_client_conn > (max_pods × pool_size_per_pod × 2)
- [ ] Server-side pool limit: default_pool_size < max_connections / expected_apps

### Deployment Strategy
- [ ] maxSurge: 25% or less for DB-heavy apps
- [ ] maxUnavailable: 25% (not 0 - causes accumulation)
- [ ] terminationGracePeriodSeconds: min 30s
- [ ] preStop hook: sleep 5-10s

### Application
- [ ] Connection pool with min/max limits
- [ ] Connection timeout: max 5s
- [ ] Idle timeout: 30-60s
- [ ] Startup jitter: random 0-5s delay

### Monitoring
- [ ] Alert: pg_stat_activity count > 80% max_connections
- [ ] Alert: pgbouncer waiting clients > 0 sustained
- [ ] Dashboard: connections over time during rollouts
- [ ] Metric: connection_acquire_time_seconds

### Testing
- [ ] Load test with production replica count
- [ ] Chaos test: rollout during peak traffic
- [ ] Verify: zero connection errors during rollout

Monitoring Setup

PostgreSQL Metrics

-- Current connections per user/database
SELECT usename, datname, count(*)
FROM pg_stat_activity
GROUP BY usename, datname;

-- Waiting queries (connection starvation)
SELECT count(*) FROM pg_stat_activity
WHERE wait_event_type = 'Client';

PgBouncer Metrics

# Connect to PgBouncer admin
psql -p 6432 -U postgres pgbouncer

# Pool statistics
SHOW POOLS;

# Current clients
SHOW CLIENTS;

# Servers (real DB connections)
SHOW SERVERS;

Prometheus Queries

# Connection usage %
pg_stat_activity_count / pg_settings_max_connections * 100

# Spike detection
increase(pg_stat_activity_count[1m]) > 50

# PgBouncer waiting clients
pgbouncer_pools_cl_waiting

Alternatives to PgBouncer

Odyssey

Yandex fork, better for large scale:

# Pros
- Multi-threaded (PgBouncer is single-threaded)
- Better for 10k+ connections
- TLS passthrough

# Cons
- Less documentation
- Smaller community

pgcat

Cloudflare project:

# Pros
- Rust (memory safe)
- Built-in sharding support
- Native Prometheus metrics

# Cons
- Relatively new
- Breaking changes between versions

Conclusion

Connection storm is an insidious problem - everything works until you start scaling. Key measures:

PgBouncer - never connect directly to PostgreSQL from application pods
Transaction pooling - not session pooling
Jitter - spread starts over time
Monitoring - watch connections during every rollout
Load testing - test with production replica count

Investment in proper connection management pays off with every deploy.

FAQ

Why can’t I just increase max_connections?

PostgreSQL isn’t designed for thousands of active connections. Each connection consumes memory (~10MB) and context switching degrades performance. PgBouncer is more efficient.

Can I use application-level connection pooler instead of PgBouncer?

Application pool doesn’t solve the problem - each pod has its own pool, which totals the same connection count. You need a centralized pooler.

How many PgBouncer replicas do I need?

For most workloads, 2 replicas (HA) are sufficient. PgBouncer is extremely efficient - one process can handle thousands of client connections.

What if I use managed DB (RDS, Cloud SQL)?

Managed databases have stricter limits. RDS default is 100-200 connections. Pooler is even more important.

Zero-Downtime PostgreSQL Migrations - How to safely migrate during rolling deployments
CI/CD for Monorepo - Testing database connection limits in pipeline

Kubernetes Rollout Without DB Outage: How to Stop PostgreSQL Connection Storm

What is Connection Storm

Why It Worked Before

Reproducible Lab

Setup with k3d

Test Application

Trigger Rollout

Benchmark: Rollout Without PgBouncer

Test Scenario

Measurement

Results

Solution 1: PgBouncer

PgBouncer Deployment

Benchmark with PgBouncer

Solution 2: PreStop Hook with Jitter

PreStop Hook

Startup Jitter

Benchmark with Jitter

Solution 3: Connection Pool Sizing

Calculating Correct Pool Size

Dynamic Pool Sizing

Connection Limiting per Pod

Production Checklist

Monitoring Setup

PostgreSQL Metrics

PgBouncer Metrics

Prometheus Queries

Alternatives to PgBouncer

Odyssey

pgcat

Conclusion

FAQ

Why can’t I just increase max_connections?

Can I use application-level connection pooler instead of PgBouncer?

How many PgBouncer replicas do I need?

What if I use managed DB (RDS, Cloud SQL)?

Related posts

Cite this article

What is Connection Storm

Why It Worked Before

Reproducible Lab

Setup with k3d

Test Application

Trigger Rollout

Benchmark: Rollout Without PgBouncer

Test Scenario

Measurement

Results

Solution 1: PgBouncer

PgBouncer Deployment

Benchmark with PgBouncer

Solution 2: PreStop Hook with Jitter

PreStop Hook

Startup Jitter

Benchmark with Jitter

Solution 3: Connection Pool Sizing

Calculating Correct Pool Size

Dynamic Pool Sizing

Connection Limiting per Pod

Production Checklist

Monitoring Setup

PostgreSQL Metrics

PgBouncer Metrics

Prometheus Queries

Alternatives to PgBouncer

Odyssey

pgcat

Conclusion

FAQ

Why can’t I just increase max_connections?

Can I use application-level connection pooler instead of PgBouncer?

How many PgBouncer replicas do I need?

What if I use managed DB (RDS, Cloud SQL)?

Related Articles

Related posts

Cite this article