Kafka Consumer Rebalance Storms: Why Scaling Consumers Can Increase Lag

This is one of the most counterintuitive Kafka incidents:

Consumer lag is rising.
You scale consumers from 20 → 40.
Lag rises faster.
Throughput drops.
Logs show constant “revoking partitions” / “rebalancing”.

I’ve caused this myself by “helpfully” scaling out during an incident. The system didn’t get more capacity. It got more rebalances, and rebalances are a stop-the-world event for consumption.

Tested on: Apache Kafka 3.6–3.8, Java consumers 3.6–3.8, high-throughput topics with DB-backed processing.

Incident narrative (anonymized)

We had a pipeline: Kafka → consumer → Postgres write. A production deploy accidentally increased per-message work (one extra DB call per record). Processing slowed. Lag rose.

I scaled consumers. That made it worse:

DB connection pool started thrashing
processing time per poll increased
consumers missed heartbeats / exceeded max.poll.interval.ms
the group started rebalancing constantly
each rebalance revoked partitions, interrupted work, and created duplicates (because we weren’t fully idempotent yet)

Blast radius: delayed event processing + duplicate side effects + backpressure to downstream services.

Constraint: We needed a mitigation that stabilized the group quickly, without relying on “just add more nodes”.

Timeline

T-0: lag alert fires; processing latency up.
T+10m: scale-out to more consumers; lag rises faster.
T+20m: consumer logs show frequent rebalances and partition revocations.
T+30m: kafka-consumer-groups.sh shows group oscillating between STABLE and REBALANCING.
T+45m: mitigation: reduce per-poll workload, tune consumer configs, switch assignor, add static membership.
T+90m: group stabilizes; lag starts decaying.
T+1d: we add “rebalance budgets” and idempotency guardrails.

Mechanism: why rebalances collapse throughput

A rebalance pauses consumption

During a rebalance, partitions are revoked and assigned again. Depending on your client and assignor, this can be very disruptive:

consumers stop fetching
in-flight processing may be aborted or duplicated
caches warm up again
commits can fail

If you rebalance continuously, your effective consumption time approaches zero.

The two classic triggers

Trigger 1: max.poll.interval.ms exceeded
If your app doesn’t call poll() frequently enough (because processing is slow or blocked), Kafka considers the consumer “stuck” and kicks it out of the group → rebalance.

Trigger 2: heartbeat/session timeout issues
GC pauses, network hiccups, or overloaded consumers can miss heartbeats.

Scaling out can worsen both triggers because it:

increases DB contention (longer per-message time)
increases group churn during deploys
increases the probability that some member is slow at any time

Runbook: confirm you’re in a rebalance storm

What to check first

Consumer logs Look for:

“Revoked partitions”
“Rebalance in progress”
“Max poll interval exceeded”
“Commit failed: rebalance in progress”

Group state

kafka-consumer-groups.sh --bootstrap-server <broker:9092> --describe --group <group>
kafka-consumer-groups.sh --bootstrap-server <broker:9092> --describe --group <group> --members

I’m looking for:

group state not staying STABLE
members flapping
partitions moving constantly

Lag shape If lag rises in a “sawtooth” with periodic resets, that’s often a rebalance loop.

How to confirm the hypothesis

A. Check whether processing blocks poll() A classic smell is doing heavy work on the poll thread.

If you can’t inspect code immediately, infer it:

large processing time per batch
commits delayed
max.poll exceeded logs

B. Verify DB/backends are the real bottleneck In my incident the DB pool was exhausted and consumers were mostly waiting.

If you have it:

DB pool metrics
DB latency
CPU not high but throughput low

Safe mitigations (what I do in order)

Stop making the group bigger Don’t scale more consumers until the group is stable.
Reduce work per poll

reduce max.poll.records
move heavy processing to a worker pool so the poll thread keeps polling
add backpressure: don’t poll if your work queue is full

Increase max.poll.interval.ms (carefully) This buys time for slow processing without being kicked out of the group.
Adopt cooperative rebalancing + static membership This reduces stop-the-world rebalances during deploys and flapping.

Risky mitigations

Set timeouts extremely high → slow failover during real crashes
Disable commits/ack discipline → duplicates and data loss
“Fix lag” by enabling aggressive retries without idempotency

What we changed (concrete config diff)

We made two changes: stabilize polling and reduce rebalance disruption.

Before (too optimistic for our processing time):

enable.auto.commit=true
max.poll.records=500
max.poll.interval.ms=300000
session.timeout.ms=10000
heartbeat.interval.ms=3000
partition.assignment.strategy=org.apache.kafka.clients.consumer.RangeAssignor

After (more stable, deploy-friendly):

enable.auto.commit=false
max.poll.records=50
max.poll.interval.ms=1800000
session.timeout.ms=30000
heartbeat.interval.ms=10000
partition.assignment.strategy=org.apache.kafka.clients.consumer.CooperativeStickyAssignor
group.instance.id=${HOSTNAME}

In code, we also ensured:

poll thread only polls and enqueues work
worker pool processes records
offsets are committed only after successful processing
processing is idempotent (or has idempotency keys)

How to verify (measurable)

Group stays STABLE Run --describe --members multiple times across 10–15 minutes. Membership should stop flapping.
Rebalance frequency drops In logs, “revoked partitions” messages should become rare (deploy windows only).
Lag decays monotonically Once stable, lag should trend down. If it oscillates, you still have instability.
Downstream pressure improves DB pool stops thrashing, latency stabilizes.

Prevention / guardrails

Rebalance budget
- alert if rebalances exceed N/hour
Poll-time budget
- define max processing time per poll cycle
Idempotency contract
- duplicates are inevitable under some failures; make them safe
Backpressure contract
- consumer must stop polling if downstream is saturated

Kafka Consumer Rebalance Storms: Why Scaling Consumers Can Increase Lag

Incident narrative (anonymized)

Timeline

Mechanism: why rebalances collapse throughput

A rebalance pauses consumption

The two classic triggers

Runbook: confirm you’re in a rebalance storm

What to check first

How to confirm the hypothesis

Safe mitigations (what I do in order)

Risky mitigations

What we changed (concrete config diff)

How to verify (measurable)

Prevention / guardrails

Related posts

Cite this article

Incident narrative (anonymized)

Timeline

Mechanism: why rebalances collapse throughput

A rebalance pauses consumption

The two classic triggers

Runbook: confirm you’re in a rebalance storm

What to check first

How to confirm the hypothesis

Safe mitigations (what I do in order)

Risky mitigations

What we changed (concrete config diff)

How to verify (measurable)

Prevention / guardrails

Related reading

Related posts

Cite this article