Kafka Consumer Rebalance Storms: Why Scaling Consumers Can Increase Lag
This is one of the most counterintuitive Kafka incidents:
- Consumer lag is rising.
- You scale consumers from 20 → 40.
- Lag rises faster.
- Throughput drops.
- Logs show constant “revoking partitions” / “rebalancing”.
I’ve caused this myself by “helpfully” scaling out during an incident. The system didn’t get more capacity. It got more rebalances, and rebalances are a stop-the-world event for consumption.
Tested on: Apache Kafka 3.6–3.8, Java consumers 3.6–3.8, high-throughput topics with DB-backed processing.
Incident narrative (anonymized)
We had a pipeline: Kafka → consumer → Postgres write. A production deploy accidentally increased per-message work (one extra DB call per record). Processing slowed. Lag rose.
I scaled consumers. That made it worse:
- DB connection pool started thrashing
- processing time per poll increased
- consumers missed heartbeats / exceeded
max.poll.interval.ms - the group started rebalancing constantly
- each rebalance revoked partitions, interrupted work, and created duplicates (because we weren’t fully idempotent yet)
Blast radius: delayed event processing + duplicate side effects + backpressure to downstream services.
Constraint: We needed a mitigation that stabilized the group quickly, without relying on “just add more nodes”.
Timeline
- T-0: lag alert fires; processing latency up.
- T+10m: scale-out to more consumers; lag rises faster.
- T+20m: consumer logs show frequent rebalances and partition revocations.
- T+30m:
kafka-consumer-groups.shshows group oscillating between STABLE and REBALANCING. - T+45m: mitigation: reduce per-poll workload, tune consumer configs, switch assignor, add static membership.
- T+90m: group stabilizes; lag starts decaying.
- T+1d: we add “rebalance budgets” and idempotency guardrails.
Mechanism: why rebalances collapse throughput
A rebalance pauses consumption
During a rebalance, partitions are revoked and assigned again. Depending on your client and assignor, this can be very disruptive:
- consumers stop fetching
- in-flight processing may be aborted or duplicated
- caches warm up again
- commits can fail
If you rebalance continuously, your effective consumption time approaches zero.
The two classic triggers
Trigger 1: max.poll.interval.ms exceeded
If your app doesn’t call poll() frequently enough (because processing is slow or blocked), Kafka considers the consumer “stuck” and kicks it out of the group → rebalance.
Trigger 2: heartbeat/session timeout issues
GC pauses, network hiccups, or overloaded consumers can miss heartbeats.
Scaling out can worsen both triggers because it:
- increases DB contention (longer per-message time)
- increases group churn during deploys
- increases the probability that some member is slow at any time
Runbook: confirm you’re in a rebalance storm
What to check first
- Consumer logs Look for:
- “Revoked partitions”
- “Rebalance in progress”
- “Max poll interval exceeded”
- “Commit failed: rebalance in progress”
- Group state
kafka-consumer-groups.sh --bootstrap-server <broker:9092> --describe --group <group>
kafka-consumer-groups.sh --bootstrap-server <broker:9092> --describe --group <group> --members
I’m looking for:
- group state not staying STABLE
- members flapping
- partitions moving constantly
- Lag shape If lag rises in a “sawtooth” with periodic resets, that’s often a rebalance loop.
How to confirm the hypothesis
A. Check whether processing blocks poll()
A classic smell is doing heavy work on the poll thread.
If you can’t inspect code immediately, infer it:
- large processing time per batch
- commits delayed
- max.poll exceeded logs
B. Verify DB/backends are the real bottleneck In my incident the DB pool was exhausted and consumers were mostly waiting.
If you have it:
- DB pool metrics
- DB latency
- CPU not high but throughput low
Safe mitigations (what I do in order)
-
Stop making the group bigger Don’t scale more consumers until the group is stable.
-
Reduce work per poll
- reduce
max.poll.records - move heavy processing to a worker pool so the poll thread keeps polling
- add backpressure: don’t poll if your work queue is full
-
Increase
max.poll.interval.ms(carefully) This buys time for slow processing without being kicked out of the group. -
Adopt cooperative rebalancing + static membership This reduces stop-the-world rebalances during deploys and flapping.
Risky mitigations
- Set timeouts extremely high → slow failover during real crashes
- Disable commits/ack discipline → duplicates and data loss
- “Fix lag” by enabling aggressive retries without idempotency
What we changed (concrete config diff)
We made two changes: stabilize polling and reduce rebalance disruption.
Before (too optimistic for our processing time):
enable.auto.commit=true
max.poll.records=500
max.poll.interval.ms=300000
session.timeout.ms=10000
heartbeat.interval.ms=3000
partition.assignment.strategy=org.apache.kafka.clients.consumer.RangeAssignor
After (more stable, deploy-friendly):
enable.auto.commit=false
max.poll.records=50
max.poll.interval.ms=1800000
session.timeout.ms=30000
heartbeat.interval.ms=10000
partition.assignment.strategy=org.apache.kafka.clients.consumer.CooperativeStickyAssignor
group.instance.id=${HOSTNAME}
In code, we also ensured:
- poll thread only polls and enqueues work
- worker pool processes records
- offsets are committed only after successful processing
- processing is idempotent (or has idempotency keys)
How to verify (measurable)
-
Group stays STABLE Run
--describe --membersmultiple times across 10–15 minutes. Membership should stop flapping. -
Rebalance frequency drops In logs, “revoked partitions” messages should become rare (deploy windows only).
-
Lag decays monotonically Once stable, lag should trend down. If it oscillates, you still have instability.
-
Downstream pressure improves DB pool stops thrashing, latency stabilizes.
Prevention / guardrails
- Rebalance budget
- alert if rebalances exceed N/hour
- Poll-time budget
- define max processing time per poll cycle
- Idempotency contract
- duplicates are inevitable under some failures; make them safe
- Backpressure contract
- consumer must stop polling if downstream is saturated
Related reading
- One Partition at 99% CPU: Stop Kafka Hotspots Before They Reach Production
- API Idempotency: Designing Endpoints Resistant to Retries
- Transactional Outbox: Solving the Dual Write Problem Without 2PC
- Protobuf Event Evolution: Why buf breaking Isn’t Enough
- Schema Evolution Contracts: Catch Schema Evolution Bugs Before Production
- Database Connection Pool Exhaustion: The Silent Outage Trigger
- Connection Pool Sizing with Little’s Law: Mathematical Approach to HikariCP and PgBouncer
Related posts
Redis AOF fsync Latency Spikes: When Durability Becomes Your p99
Redis AOF can turn durability into p99 spikes: fsync pressure and rewrite fork CoW. Runbook to confirm, mitigate safely, and add guardrails.
Pods Stuck in Terminating: A Production Decision Tree for Finalizers, Volumes, and Dead Nodes
A conservative runbook to unstick Pods safely: finalizers, CSI/volume cleanup stalls, dead nodes, and when (and how) to force-delete.
PostgreSQL XID Wraparound: Emergency Playbook for Vacuum Freeze Under Fire
PostgreSQL can go read-only near XID wraparound. Use this emergency playbook to find the oldest tables, unblock vacuum freeze, and prevent repeat incidents.
ingress-nginx Reload Storms: Why 502 Spikes Track Ingress Churn
NGINX Ingress reload storms can drop keep-alives and cause 502 spikes. Runbook to prove reload impact, reduce churn, and harden graceful reload.
Cite this article
If you reference this post, please link to the original URL and credit the author.