Back to blog

Kafka Consumer Rebalance Storms: Why Scaling Consumers Can Increase Lag

This is one of the most counterintuitive Kafka incidents:

  • Consumer lag is rising.
  • You scale consumers from 20 → 40.
  • Lag rises faster.
  • Throughput drops.
  • Logs show constant “revoking partitions” / “rebalancing”.

I’ve caused this myself by “helpfully” scaling out during an incident. The system didn’t get more capacity. It got more rebalances, and rebalances are a stop-the-world event for consumption.

Tested on: Apache Kafka 3.6–3.8, Java consumers 3.6–3.8, high-throughput topics with DB-backed processing.

Incident narrative (anonymized)

We had a pipeline: Kafka → consumer → Postgres write. A production deploy accidentally increased per-message work (one extra DB call per record). Processing slowed. Lag rose.

I scaled consumers. That made it worse:

  • DB connection pool started thrashing
  • processing time per poll increased
  • consumers missed heartbeats / exceeded max.poll.interval.ms
  • the group started rebalancing constantly
  • each rebalance revoked partitions, interrupted work, and created duplicates (because we weren’t fully idempotent yet)

Blast radius: delayed event processing + duplicate side effects + backpressure to downstream services.

Constraint: We needed a mitigation that stabilized the group quickly, without relying on “just add more nodes”.

Timeline

  • T-0: lag alert fires; processing latency up.
  • T+10m: scale-out to more consumers; lag rises faster.
  • T+20m: consumer logs show frequent rebalances and partition revocations.
  • T+30m: kafka-consumer-groups.sh shows group oscillating between STABLE and REBALANCING.
  • T+45m: mitigation: reduce per-poll workload, tune consumer configs, switch assignor, add static membership.
  • T+90m: group stabilizes; lag starts decaying.
  • T+1d: we add “rebalance budgets” and idempotency guardrails.

Mechanism: why rebalances collapse throughput

A rebalance pauses consumption

During a rebalance, partitions are revoked and assigned again. Depending on your client and assignor, this can be very disruptive:

  • consumers stop fetching
  • in-flight processing may be aborted or duplicated
  • caches warm up again
  • commits can fail

If you rebalance continuously, your effective consumption time approaches zero.

The two classic triggers

Trigger 1: max.poll.interval.ms exceeded
If your app doesn’t call poll() frequently enough (because processing is slow or blocked), Kafka considers the consumer “stuck” and kicks it out of the group → rebalance.

Trigger 2: heartbeat/session timeout issues
GC pauses, network hiccups, or overloaded consumers can miss heartbeats.

Scaling out can worsen both triggers because it:

  • increases DB contention (longer per-message time)
  • increases group churn during deploys
  • increases the probability that some member is slow at any time

Runbook: confirm you’re in a rebalance storm

What to check first

  1. Consumer logs Look for:
  • “Revoked partitions”
  • “Rebalance in progress”
  • “Max poll interval exceeded”
  • “Commit failed: rebalance in progress”
  1. Group state
kafka-consumer-groups.sh --bootstrap-server <broker:9092> --describe --group <group>
kafka-consumer-groups.sh --bootstrap-server <broker:9092> --describe --group <group> --members

I’m looking for:

  • group state not staying STABLE
  • members flapping
  • partitions moving constantly
  1. Lag shape If lag rises in a “sawtooth” with periodic resets, that’s often a rebalance loop.

How to confirm the hypothesis

A. Check whether processing blocks poll() A classic smell is doing heavy work on the poll thread.

If you can’t inspect code immediately, infer it:

  • large processing time per batch
  • commits delayed
  • max.poll exceeded logs

B. Verify DB/backends are the real bottleneck In my incident the DB pool was exhausted and consumers were mostly waiting.

If you have it:

  • DB pool metrics
  • DB latency
  • CPU not high but throughput low

Safe mitigations (what I do in order)

  1. Stop making the group bigger Don’t scale more consumers until the group is stable.

  2. Reduce work per poll

  • reduce max.poll.records
  • move heavy processing to a worker pool so the poll thread keeps polling
  • add backpressure: don’t poll if your work queue is full
  1. Increase max.poll.interval.ms (carefully) This buys time for slow processing without being kicked out of the group.

  2. Adopt cooperative rebalancing + static membership This reduces stop-the-world rebalances during deploys and flapping.

Risky mitigations

  • Set timeouts extremely high → slow failover during real crashes
  • Disable commits/ack discipline → duplicates and data loss
  • “Fix lag” by enabling aggressive retries without idempotency

What we changed (concrete config diff)

We made two changes: stabilize polling and reduce rebalance disruption.

Before (too optimistic for our processing time):

enable.auto.commit=true
max.poll.records=500
max.poll.interval.ms=300000
session.timeout.ms=10000
heartbeat.interval.ms=3000
partition.assignment.strategy=org.apache.kafka.clients.consumer.RangeAssignor

After (more stable, deploy-friendly):

enable.auto.commit=false
max.poll.records=50
max.poll.interval.ms=1800000
session.timeout.ms=30000
heartbeat.interval.ms=10000
partition.assignment.strategy=org.apache.kafka.clients.consumer.CooperativeStickyAssignor
group.instance.id=${HOSTNAME}

In code, we also ensured:

  • poll thread only polls and enqueues work
  • worker pool processes records
  • offsets are committed only after successful processing
  • processing is idempotent (or has idempotency keys)

How to verify (measurable)

  1. Group stays STABLE Run --describe --members multiple times across 10–15 minutes. Membership should stop flapping.

  2. Rebalance frequency drops In logs, “revoked partitions” messages should become rare (deploy windows only).

  3. Lag decays monotonically Once stable, lag should trend down. If it oscillates, you still have instability.

  4. Downstream pressure improves DB pool stops thrashing, latency stabilizes.

Prevention / guardrails

  • Rebalance budget
    • alert if rebalances exceed N/hour
  • Poll-time budget
    • define max processing time per poll cycle
  • Idempotency contract
    • duplicates are inevitable under some failures; make them safe
  • Backpressure contract
    • consumer must stop polling if downstream is saturated

Related posts

Cite this article

If you reference this post, please link to the original URL and credit the author.

Michal Drozd. "Kafka Consumer Rebalance Storms: Why Scaling Consumers Can Increase Lag". https://www.michal-drozd.com/en/blog/kafka-consumer-rebalance-storm/ (Published November 10, 2025).