Back to blog

Kubernetes Rollout Without DB Outage: How to Stop PostgreSQL Connection Storm

There’s a special kind of outage where everything is green and still on fire. We rolled out a routine release, watched all pods go Ready, and 30 seconds later Postgres yelled FATAL: too many connections for role "app".

This was the day we scaled from 5 replicas to 20. Nothing in the app changed; only the rollout dynamics did. That is a connection storm: healthy pods doing normal things, all at once.

Tested on: k3d 5.x, PostgreSQL 15-16, PgBouncer 1.21+, Node.js 20+ and Java 21 applications.

What is Connection Storm

During a Kubernetes rolling update deployment:

  1. Rolling update - gradually terminates old pods and starts new ones
  2. All new pods try to connect to database simultaneously
  3. Old pods still hold connections (graceful shutdown)
  4. Result: 2x more connections than normal
Normal state:      20 pods × 10 connections = 200 connections
During rollout:    20 old + 20 new = 400 connections
PostgreSQL limit:  max_connections = 200

Why It Worked Before

With 5 replicas:

  • Normal: 5 × 10 = 50 connections
  • Rollout: 10 × 10 = 100 connections (under limit)

With 20 replicas:

  • Normal: 20 × 10 = 200 connections (at limit)
  • Rollout: 40 × 10 = 400 connections BOOM

Reproducible Lab

Setup with k3d

# Create cluster
k3d cluster create connection-storm --agents 3

# Install PostgreSQL
helm repo add bitnami https://charts.bitnami.com/bitnami
helm install postgres bitnami/postgresql \
  --set auth.postgresPassword=secret \
  --set primary.resources.limits.memory=512Mi \
  --set primary.configuration="max_connections=50"

Test Application

# app-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
spec:
  replicas: 20
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
  template:
    spec:
      containers:
      - name: api
        image: your-app:v1
        env:
        - name: DATABASE_URL
          value: "postgres://postgres:secret@postgres:5432/app"
        - name: POOL_SIZE
          value: "5"  # 20 pods × 5 = 100 connections

Trigger Rollout

# Watch connections
watch -n1 "kubectl exec -it postgres-postgresql-0 -- \
  psql -U postgres -c 'SELECT count(*) FROM pg_stat_activity'"

# Trigger rollout
kubectl set image deployment/api api=your-app:v2

Result: Connection count spikes to 150-200 and application starts failing.

Benchmark: Rollout Without PgBouncer

Test Scenario

20 replicas, pool_size=5 per pod
max_connections=100
Rolling update 25% maxSurge

Measurement

# Connection count during rollout (every second)
while true; do
  kubectl exec -it postgres-postgresql-0 -- \
    psql -U postgres -c "SELECT count(*) FROM pg_stat_activity WHERE state != 'idle'" \
    2>/dev/null | grep -E "^\s*[0-9]+"
  sleep 1
done

Results

PhaseActive connectionsStatus
Before rollout48OK
Rollout start73OK
Peak (t+15s)127FAIL (limit 100)
Stabilization48OK

Connection refused errors: 23 during entire rollout

Solution 1: PgBouncer

PgBouncer is a connection pooler that:

  • Maintains persistent connections to PostgreSQL
  • Multiplexes application connections
  • Significantly reduces real DB connections

PgBouncer Deployment

# pgbouncer-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: pgbouncer-config
data:
  pgbouncer.ini: |
    [databases]
    app = host=postgres-postgresql port=5432 dbname=app

    [pgbouncer]
    listen_addr = 0.0.0.0
    listen_port = 5432
    auth_type = md5
    auth_file = /etc/pgbouncer/userlist.txt
    pool_mode = transaction
    max_client_conn = 1000
    default_pool_size = 20
    min_pool_size = 5
    reserve_pool_size = 5
    reserve_pool_timeout = 3
    server_lifetime = 3600
    server_idle_timeout = 600
    log_connections = 1
    log_disconnections = 1
    log_pooler_errors = 1

  userlist.txt: |
    "postgres" "secret"
---
# pgbouncer-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: pgbouncer
spec:
  replicas: 2
  template:
    spec:
      containers:
      - name: pgbouncer
        image: edoburu/pgbouncer:1.21.0
        ports:
        - containerPort: 5432
        volumeMounts:
        - name: config
          mountPath: /etc/pgbouncer
        resources:
          limits:
            memory: 128Mi
            cpu: 100m
        livenessProbe:
          tcpSocket:
            port: 5432
          initialDelaySeconds: 10
        readinessProbe:
          exec:
            command:
            - /bin/sh
            - -c
            - "psql -h localhost -U postgres -c 'SHOW STATS' pgbouncer"
          initialDelaySeconds: 5
      volumes:
      - name: config
        configMap:
          name: pgbouncer-config

Benchmark with PgBouncer

PhaseClient connectionsServer connectionsStatus
Before rollout10020OK
Rollout start15025OK
Peak (t+15s)20030OK
Stabilization10020OK

Connection refused errors: 0

Solution 2: PreStop Hook with Jitter

Even with PgBouncer, there can be issues if all pods start simultaneously. Solution: spread starts over time.

PreStop Hook

spec:
  containers:
  - name: api
    lifecycle:
      preStop:
        exec:
          command:
          - /bin/sh
          - -c
          - "sleep 5"  # Give time for drain
  terminationGracePeriodSeconds: 30

Startup Jitter

// In application - random delay before first DB query
const jitter = Math.random() * 5000; // 0-5 seconds
await new Promise(resolve => setTimeout(resolve, jitter));
await db.connect();

Or in Kubernetes:

spec:
  containers:
  - name: api
    command:
    - /bin/sh
    - -c
    - |
      # Random delay 0-10 seconds
      sleep $((RANDOM % 10))
      exec node server.js

Benchmark with Jitter

MetricWithout jitterWith jitter (0-5s)
Peak connections200140
Connection errors30
Rollout duration45s52s

Solution 3: Connection Pool Sizing

Calculating Correct Pool Size

Rule of thumb:
pool_size = (2 × CPU cores) + disk_spindles

For typical cloud workload:
pool_size = 5-10 per pod

Dynamic Pool Sizing

// Connection count based on environment
const poolConfig = {
  development: { min: 2, max: 5 },
  staging: { min: 5, max: 10 },
  production: { min: 10, max: 20 }
};

const pool = new Pool({
  ...poolConfig[process.env.NODE_ENV],
  idleTimeoutMillis: 30000,
  connectionTimeoutMillis: 5000,
});

Connection Limiting per Pod

# HPA with custom metrics
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 5
  maxReplicas: 50
  metrics:
  - type: Pods
    pods:
      metric:
        name: db_connections_active
      target:
        type: AverageValue
        averageValue: 8  # Scale when avg > 8 connections per pod

Production Checklist

## DB-Safe Kubernetes Deployment Checklist

### Connection Management
- [ ] PgBouncer or similar pooler in front of database
- [ ] Pool mode: transaction (not session)
- [ ] max_client_conn > (max_pods × pool_size_per_pod × 2)
- [ ] Server-side pool limit: default_pool_size < max_connections / expected_apps

### Deployment Strategy
- [ ] maxSurge: 25% or less for DB-heavy apps
- [ ] maxUnavailable: 25% (not 0 - causes accumulation)
- [ ] terminationGracePeriodSeconds: min 30s
- [ ] preStop hook: sleep 5-10s

### Application
- [ ] Connection pool with min/max limits
- [ ] Connection timeout: max 5s
- [ ] Idle timeout: 30-60s
- [ ] Startup jitter: random 0-5s delay

### Monitoring
- [ ] Alert: pg_stat_activity count > 80% max_connections
- [ ] Alert: pgbouncer waiting clients > 0 sustained
- [ ] Dashboard: connections over time during rollouts
- [ ] Metric: connection_acquire_time_seconds

### Testing
- [ ] Load test with production replica count
- [ ] Chaos test: rollout during peak traffic
- [ ] Verify: zero connection errors during rollout

Monitoring Setup

PostgreSQL Metrics

-- Current connections per user/database
SELECT usename, datname, count(*)
FROM pg_stat_activity
GROUP BY usename, datname;

-- Waiting queries (connection starvation)
SELECT count(*) FROM pg_stat_activity
WHERE wait_event_type = 'Client';

PgBouncer Metrics

# Connect to PgBouncer admin
psql -p 6432 -U postgres pgbouncer

# Pool statistics
SHOW POOLS;

# Current clients
SHOW CLIENTS;

# Servers (real DB connections)
SHOW SERVERS;

Prometheus Queries

# Connection usage %
pg_stat_activity_count / pg_settings_max_connections * 100

# Spike detection
increase(pg_stat_activity_count[1m]) > 50

# PgBouncer waiting clients
pgbouncer_pools_cl_waiting

Alternatives to PgBouncer

Odyssey

Yandex fork, better for large scale:

# Pros
- Multi-threaded (PgBouncer is single-threaded)
- Better for 10k+ connections
- TLS passthrough

# Cons
- Less documentation
- Smaller community

pgcat

Cloudflare project:

# Pros
- Rust (memory safe)
- Built-in sharding support
- Native Prometheus metrics

# Cons
- Relatively new
- Breaking changes between versions

Conclusion

Connection storm is an insidious problem - everything works until you start scaling. Key measures:

  1. PgBouncer - never connect directly to PostgreSQL from application pods
  2. Transaction pooling - not session pooling
  3. Jitter - spread starts over time
  4. Monitoring - watch connections during every rollout
  5. Load testing - test with production replica count

Investment in proper connection management pays off with every deploy.

FAQ

Why can’t I just increase max_connections?

PostgreSQL isn’t designed for thousands of active connections. Each connection consumes memory (~10MB) and context switching degrades performance. PgBouncer is more efficient.

Can I use application-level connection pooler instead of PgBouncer?

Application pool doesn’t solve the problem - each pod has its own pool, which totals the same connection count. You need a centralized pooler.

How many PgBouncer replicas do I need?

For most workloads, 2 replicas (HA) are sufficient. PgBouncer is extremely efficient - one process can handle thousands of client connections.

What if I use managed DB (RDS, Cloud SQL)?

Managed databases have stricter limits. RDS default is 100-200 connections. Pooler is even more important.


Related posts

Cite this article

If you reference this post, please link to the original URL and credit the author.

Michal Drozd. "Kubernetes Rollout Without DB Outage: How to Stop PostgreSQL Connection Storm". https://www.michal-drozd.com/en/blog/k8s-postgresql-connection-storm/ (Published April 1, 2025).