One Partition at 99% CPU: Stop Kafka Hotspots Before They Reach Production
Kafka partition skew taught me that “evenly distributed” is a lie until you prove it with data. “Why is consumer lag spiking? All our metrics look normal.” We were staring at Grafana dashboards showing consumer group lag creeping up, rebalances happening, and occasional timeout errors. CPU across Kafka brokers looked fine—except for one broker handling partition 23, which was at 99% CPU while the others were at 15%.
The incident unfolded over three days. First day: “Maybe it’s a weird traffic spike.” Second day: “Rebalancing should fix it.” Third day: “Oh no, it’s not load balancing at all—one partition is getting 60% of all events.” The root cause was embarrassingly simple: we changed the partition key from userId (millions of values) to countryCode (12 values). Nobody caught it in code review because the key generation logic was buried in a shared library.
What made this particularly frustrating was that our testing looked perfect. Unit tests passed. Integration tests passed. Load tests showed “evenly distributed” traffic. But the load tests used generated UUIDs, not real production data distribution. The moment actual user traffic hit with its real country distribution (80% from one country), partition 23 became the bottleneck and everything downstream started timing out.
This is the kind of bug that makes you question your entire testing strategy. The Kafka partition key is effectively an API contract—change it and you can silently break your entire streaming pipeline. But unlike REST API changes, there’s no schema validation, no compile-time check, and no standard way to catch cardinality problems before they reach production.
Environment: Kafka 3.6+, Java producers, 48 partitions, multi-region user base
Understanding Partition Skew
How Kafka Partitioning Works (and Fails)
Expected behavior (good key):
Topic: user-events (48 partitions)
Key: userId (millions of unique values)
partition_id = hash(key) % 48
Distribution:
Partition 0: ████░░░░░░ 2.1%
Partition 1: ███░░░░░░░ 2.0%
Partition 2: ████░░░░░░ 2.2%
...
Partition 47: ███░░░░░░░ 2.0%
✓ Even CPU load
✓ Even consumer lag
✓ Predictable latency
What actually happens (bad key):
Topic: user-events (48 partitions)
Key: countryCode (12 unique values... but not evenly distributed)
Real production distribution:
- US: 80% of traffic
- UK: 10%
- DE: 5%
- Others: 5%
partition_id = hash("US") % 48 = 23 ← 80% goes here!
partition_id = hash("UK") % 48 = 7 ← 10% goes here
partition_id = hash("DE") % 48 = 41 ← 5% goes here
Distribution:
Partition 23: ████████████████████████████████ 80% ← MELTDOWN
Partition 7: ████░░░░░░ 10%
Partition 41: ██░░░░░░░░ 5%
Others: ░░░░░░░░░░ 5%
❌ One consumer stuck
❌ Lag spikes
❌ Rebalance timeouts
❌ Downstream cascading failures
Why Standard Testing Misses This
// This test looks fine but proves nothing
@Test
void testPartitionDistribution() {
for (int i = 0; i < 100000; i++) {
String key = UUID.randomUUID().toString(); // ← FAKE DATA!
producer.send(new ProducerRecord<>("events", key, event));
}
// Distribution is perfect... because UUIDs are random
}
The test passes because random UUIDs have perfect cardinality. Production fails because real keys don’t.
Common Skew Scenarios (I’ve Seen All of These)
1. Low-Cardinality Key
// Before: good
record.setKey(userId); // millions of users
// After: disaster
record.setKey(tenantId); // 50 tenants, 1 has 80% of users
2. Normalization Kills Distribution
// Looks innocent
String normalizedEmail = email.trim().toLowerCase();
record.setKey(normalizedEmail);
// But if your emails are like:
// "[email protected] " (with trailing spaces)
// "[email protected]"
// Both hash to same partition after normalization
// Worse: null handling
String key = userId != null ? userId : ""; // All nulls → one partition!
3. Composite Key Prefix Dominance
// Intention: distribute by both tenant and user
String key = String.format("%s:%s", tenantId, userId);
// Reality: hash of "TENANT_A:user123" is dominated by prefix
// If TENANT_A has 80% of users → skew
4. The “Clever” Optimization That Backfires
// Someone "optimized" key size
String key = userId.substring(0, 8); // "Save bytes!"
// Now collisions: user "12345678999" and "12345678000" → same partition
The Skew Contract Approach
Instead of hoping your key is good, define an explicit contract:
# skew_contract.yml
version: 1
topics:
user-events:
partitions: 48
max_partition_share: 0.08 # No partition should get >8% of traffic
max_gini: 0.15 # Gini coefficient measures inequality
min_unique_keys: 20000 # Ensure high cardinality
sample_size: 100000 # Test with realistic sample size
This contract says:
- max_partition_share: In a perfect world, each of 48 partitions gets 2.08%. Allow up to 8% (some variance is OK).
- max_gini: Gini coefficient of 0 = perfect equality, 1 = complete inequality. 0.15 allows some natural variance.
- min_unique_keys: Your sample must have at least 20k unique keys (catches low-cardinality early).
Implementation: Testing Keys Without Kafka
The trick: you don’t need a Kafka cluster. You just need to test your key generation logic with realistic data.
Step 1: Key Sampler Test (Java)
// src/test/java/com/yourapp/kafka/KeySamplerTest.java
import org.junit.jupiter.api.Test;
import java.nio.file.*;
import java.util.*;
class KeySamplerTest {
@Test
void generateRepresentativeKeys() throws Exception {
// Load realistic test data (anonymized production sample, fixtures, etc.)
List<UserEvent> events = loadRealisticEvents(); // Your implementation
List<String> keys = new ArrayList<>(100_000);
for (UserEvent event : events) {
// THIS IS THE CRITICAL PART: use your REAL key logic
String key = KeyFactory.keyFor(event);
keys.add(key);
}
// Write to file for validation
Path output = Paths.get("build/kafka_keys_user_events.txt");
Files.createDirectories(output.getParent());
Files.write(output, keys, UTF_8);
System.out.printf("Generated %d keys with %d unique values\n",
keys.size(), new HashSet<>(keys).size());
}
private List<UserEvent> loadRealisticEvents() {
// IMPORTANT: This should reflect production distribution
// - 80% US users, 10% UK, etc. if that's your reality
// - Mix of tenant sizes if you're B2B
// - Include edge cases: null handling, special characters
// Example (replace with your real data):
return IntStream.range(0, 100_000)
.mapToObj(i -> {
String country = selectCountryRealistic(); // Weighted distribution
String userId = "user-" + (i % 50000); // Realistic cardinality
return new UserEvent(country, userId, "action");
})
.collect(Collectors.toList());
}
}
Step 2: Skew Contract Validator (Go CLI)
This tool uses the same murmur2 hash that Kafka Java clients use:
# cmd/skewcontract/main.go
go run ./cmd/skewcontract \
-topic user-events \
-keys build/kafka_keys_user_events.txt \
-partitions 48 \
-max-share 0.08 \
-max-gini 0.15 \
-min-unique 20000
The validator:
- Reads your generated keys
- Applies
murmur2(key) % partitions(Kafka’s default) - Calculates distribution metrics
- Fails if skew exceeds thresholds
Key insight: This is deterministic. Same keys → same partition assignment. No flakiness.
CI Integration (GitHub Actions)
name: kafka-skew-contract
on: [pull_request]
jobs:
skew:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-java@v4
with:
distribution: temurin
java-version: '21'
- uses: actions/setup-go@v5
with:
go-version: '1.22'
- name: Generate partition keys from realistic sample
run: ./gradlew test --tests KeySamplerTest
- name: Verify skew contract
run: |
go run ./cmd/skewcontract \
-topic user-events \
-keys build/kafka_keys_user_events.txt \
-partitions 48 \
-max-share 0.08 \
-max-gini 0.15 \
-min-unique 20000 \
-report skew_report.json
- name: Upload report (for debugging failures)
if: always()
uses: actions/upload-artifact@v4
with:
name: skew-report
path: skew_report.json
- name: Comment on PR with skew stats
if: failure()
uses: actions/github-script@v7
with:
script: |
const fs = require('fs');
const report = JSON.parse(fs.readFileSync('skew_report.json', 'utf8'));
const body = `## ⚠️ Partition Skew Detected
- **Max partition share**: ${(report.max_partition_share * 100).toFixed(1)}% (threshold: 8%)
- **Gini coefficient**: ${report.gini.toFixed(3)} (threshold: 0.15)
- **Unique keys**: ${report.unique_keys.toLocaleString()} (min: 20,000)
**Top hotspot partitions**:
${report.top_partitions.slice(0, 5).map(p =>
`- Partition ${p.partition}: ${(p.share * 100).toFixed(1)}%`
).join('\n')}
**Suggestions**:
${report.suggestions.join('\n')}
`;
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.name,
body
});
This gives you a fast, deterministic gate that runs in ~10 seconds and catches skew before merge.
The Go Validator (Simplified Core)
Here’s the essential logic (full implementation in appendix):
package main
import (
"bufio"
"fmt"
"os"
"sort"
)
func main() {
keys := readKeys("keys.txt")
partitions := 48
// Count keys per partition
counts := make([]int, partitions)
unique := make(map[string]bool)
for _, key := range keys {
unique[key] = true
partition := kafkaPartition(key, partitions)
counts[partition]++
}
// Calculate metrics
maxCount := max(counts)
totalKeys := len(keys)
maxShare := float64(maxCount) / float64(totalKeys)
gini := giniCoefficient(counts)
// Validate
if maxShare > 0.08 {
fmt.Printf("FAIL: max share %.2f%% exceeds 8%%\n", maxShare*100)
os.Exit(1)
}
if gini > 0.15 {
fmt.Printf("FAIL: gini %.3f exceeds 0.15\n", gini)
os.Exit(1)
}
fmt.Println("PASS: Partition distribution is acceptable")
}
// Kafka's DefaultPartitioner logic (murmur2)
func kafkaPartition(key string, partitions int) int {
h := murmur2([]byte(key))
return int((h & 0x7fffffff) % uint32(partitions))
}
func murmur2(data []byte) uint32 {
// Kafka-compatible murmur2 implementation
// (full implementation omitted for brevity - see appendix)
const seed = 0x9747b28c
// ... hash calculation ...
return hash
}
Runtime Monitoring (Because CI Isn’t Enough)
Even with CI gates, production can surprise you:
- New traffic patterns
- Data migration changes distribution
- A/B tests shift user segments
Monitor these metrics:
# Partition lag skew
max by (partition) (kafka_consumer_lag{topic="user-events"})
/ avg by (partition) (kafka_consumer_lag{topic="user-events"})
> 3 # One partition has 3x average lag
# Partition size skew
max by (partition) (kafka_log_size{topic="user-events"})
/ avg by (partition) (kafka_log_size{topic="user-events"})
> 2
# Consumer processing time skew
max by (partition) (kafka_consumer_processing_time_p99)
/ avg by (partition) (kafka_consumer_processing_time_p99)
> 2
Alert when these ratios spike—it means a hotspot is forming.
When Skew Happens in Production
If you detect skew after deployment:
Option 1: Fix the Key (Preferred)
// Bad: low cardinality
String key = countryCode;
// Good: add high-cardinality component
String key = countryCode + ":" + userId;
// or better: just use userId if it's globally unique
String key = userId;
Then re-publish events with new key (or accept temporary skew during transition).
Option 2: Increase Partitions (Risky)
kafka-topics --alter --topic user-events --partitions 96
Caution: This only helps if your key has enough cardinality. If you have 12 country codes and 96 partitions, you still only use 12 partitions. Also, increasing partitions is irreversible and affects consumer rebalancing.
Option 3: Split the Topic
If one dimension dominates (e.g., 80% US users), consider:
user-events-us (handles US traffic)
user-events-intl (handles rest)
Downside: more topics to manage, more consumer groups.
Checklist
## Partition Key E-E-A-T Checklist
### Before Changing Keys
- [ ] Run key sampler test with realistic production data distribution
- [ ] Verify cardinality: `SELECT COUNT(DISTINCT key_column) FROM events`
- [ ] Check for null handling: `WHERE key_column IS NULL`
- [ ] Review normalization logic (trim, lowercase, substring)
### CI Contract
- [ ] `max_partition_share` set based on `1/partitions * acceptable_variance`
- [ ] `min_unique_keys` ≥ `partitions * 100` (rule of thumb)
- [ ] Sample data reflects production distribution (not random UUIDs!)
### Production Monitoring
- [ ] Alert on partition lag skew ratio > 3x
- [ ] Alert on partition size skew ratio > 2x
- [ ] Dashboard showing per-partition metrics
### When Skew Detected
- [ ] Identify key cardinality: too low? normalization bug? null handling?
- [ ] Fix key generation logic (prefer high-cardinality stable IDs)
- [ ] If unfixable, consider topic split or accept variance
- [ ] Only increase partitions if key has sufficient cardinality
Conclusion
The Kafka partition key is an API contract that most teams don’t treat as one. Change it carelessly and you get silent hotspots, consumer lag spikes, and mysterious timeout errors—all while your dashboards show “everything is fine.”
Skew Contracts turn “pick a good key” from tribal knowledge into a testable, enforceable policy. The implementation is surprisingly simple: generate realistic keys in a test, hash them the same way Kafka does, measure distribution, fail if it’s skewed. No Kafka cluster needed, no flaky tests, just deterministic math.
The key insight: your partition key is not an implementation detail, it’s an API. Treat it like one.
Key principles:
- Test keys with realistic data—random UUIDs prove nothing
- Cardinality matters more than uniformity—a million slightly-skewed keys beats 12 perfectly-distributed ones
- Validate in CI before production—catching skew in code review is nearly impossible
- Monitor skew in production—traffic patterns change
- Fix the key, not the cluster—increasing partitions doesn’t help if your key has low cardinality
The next time someone suggests changing a Kafka partition key, ask: “Did we run the skew contract?”
Appendix: Full Go Validator
See github.com/yourorg/kafka-skew-contracts for complete implementation with:
- YAML config loading
- Multi-topic support
- SARIF report for GitHub Code Scanning
- Prometheus metrics export
Related Articles
- Kubernetes conntrack Exhaustion - Another “looks fine until production” failure mode
- gRPC Load Balancing in Kubernetes - Client-side load balancing challenges
Related posts
5000 Unacked Messages and Climbing: Stop RabbitMQ Consumer Meltdowns in CI
Queue looks healthy until deployment, then messages_unacknowledged explodes, memory spikes, and redelivery storms start. The culprit: your prefetch is too high and nobody tested actual ack behavior.
Fields Disappeared But Nothing Crashed: Catch Schema Evolution Bugs Before Production
Producer upgraded Protobuf, consumer still on old version. No errors, no warnings—just silent data loss in production. Your schema evolution broke backward compatibility and CI didn't notice.
Kubernetes APF Starvation: When One Controller Makes kubectl Hang
APF can starve your Kubernetes API: kubectl hangs, controllers timeout, and 429s spike. Runbook to isolate the noisy client, fix FlowSchemas, and prove it.
Kubernetes OOM Killer: Why Your Container Dies at 50% Memory
Container memory limit is 4GB but OOM kills at 2GB used. Kernel buffers, page cache, and cgroup accounting tricks cause early OOMKills. Here's the full picture.
Cite this article
If you reference this post, please link to the original URL and credit the author.