Back to blog

One Partition at 99% CPU: Stop Kafka Hotspots Before They Reach Production

|
| kafka, debugging, testing, ci, partition-key

Kafka partition skew taught me that “evenly distributed” is a lie until you prove it with data. “Why is consumer lag spiking? All our metrics look normal.” We were staring at Grafana dashboards showing consumer group lag creeping up, rebalances happening, and occasional timeout errors. CPU across Kafka brokers looked fine—except for one broker handling partition 23, which was at 99% CPU while the others were at 15%.

The incident unfolded over three days. First day: “Maybe it’s a weird traffic spike.” Second day: “Rebalancing should fix it.” Third day: “Oh no, it’s not load balancing at all—one partition is getting 60% of all events.” The root cause was embarrassingly simple: we changed the partition key from userId (millions of values) to countryCode (12 values). Nobody caught it in code review because the key generation logic was buried in a shared library.

What made this particularly frustrating was that our testing looked perfect. Unit tests passed. Integration tests passed. Load tests showed “evenly distributed” traffic. But the load tests used generated UUIDs, not real production data distribution. The moment actual user traffic hit with its real country distribution (80% from one country), partition 23 became the bottleneck and everything downstream started timing out.

This is the kind of bug that makes you question your entire testing strategy. The Kafka partition key is effectively an API contract—change it and you can silently break your entire streaming pipeline. But unlike REST API changes, there’s no schema validation, no compile-time check, and no standard way to catch cardinality problems before they reach production.

Environment: Kafka 3.6+, Java producers, 48 partitions, multi-region user base

Understanding Partition Skew

How Kafka Partitioning Works (and Fails)

Expected behavior (good key):
Topic: user-events (48 partitions)
Key: userId (millions of unique values)

partition_id = hash(key) % 48

Distribution:
Partition 0:  ████░░░░░░  2.1%
Partition 1:  ███░░░░░░░  2.0%
Partition 2:  ████░░░░░░  2.2%
...
Partition 47: ███░░░░░░░  2.0%

✓ Even CPU load
✓ Even consumer lag
✓ Predictable latency
What actually happens (bad key):
Topic: user-events (48 partitions)
Key: countryCode (12 unique values... but not evenly distributed)

Real production distribution:
- US: 80% of traffic
- UK: 10%
- DE: 5%
- Others: 5%

partition_id = hash("US") % 48 = 23  ← 80% goes here!
partition_id = hash("UK") % 48 = 7   ← 10% goes here
partition_id = hash("DE") % 48 = 41  ← 5% goes here

Distribution:
Partition 23: ████████████████████████████████  80% ← MELTDOWN
Partition 7:  ████░░░░░░  10%
Partition 41: ██░░░░░░░░   5%
Others:       ░░░░░░░░░░   5%

❌ One consumer stuck
❌ Lag spikes
❌ Rebalance timeouts
❌ Downstream cascading failures

Why Standard Testing Misses This

// This test looks fine but proves nothing
@Test
void testPartitionDistribution() {
    for (int i = 0; i < 100000; i++) {
        String key = UUID.randomUUID().toString();  // ← FAKE DATA!
        producer.send(new ProducerRecord<>("events", key, event));
    }
    // Distribution is perfect... because UUIDs are random
}

The test passes because random UUIDs have perfect cardinality. Production fails because real keys don’t.

Common Skew Scenarios (I’ve Seen All of These)

1. Low-Cardinality Key

// Before: good
record.setKey(userId);  // millions of users

// After: disaster
record.setKey(tenantId);  // 50 tenants, 1 has 80% of users

2. Normalization Kills Distribution

// Looks innocent
String normalizedEmail = email.trim().toLowerCase();
record.setKey(normalizedEmail);

// But if your emails are like:
// "[email protected]  " (with trailing spaces)
// "[email protected]"
// Both hash to same partition after normalization

// Worse: null handling
String key = userId != null ? userId : "";  // All nulls → one partition!

3. Composite Key Prefix Dominance

// Intention: distribute by both tenant and user
String key = String.format("%s:%s", tenantId, userId);

// Reality: hash of "TENANT_A:user123" is dominated by prefix
// If TENANT_A has 80% of users → skew

4. The “Clever” Optimization That Backfires

// Someone "optimized" key size
String key = userId.substring(0, 8);  // "Save bytes!"

// Now collisions: user "12345678999" and "12345678000" → same partition

The Skew Contract Approach

Instead of hoping your key is good, define an explicit contract:

# skew_contract.yml
version: 1

topics:
  user-events:
    partitions: 48
    max_partition_share: 0.08   # No partition should get >8% of traffic
    max_gini: 0.15              # Gini coefficient measures inequality
    min_unique_keys: 20000      # Ensure high cardinality
    sample_size: 100000         # Test with realistic sample size

This contract says:

  • max_partition_share: In a perfect world, each of 48 partitions gets 2.08%. Allow up to 8% (some variance is OK).
  • max_gini: Gini coefficient of 0 = perfect equality, 1 = complete inequality. 0.15 allows some natural variance.
  • min_unique_keys: Your sample must have at least 20k unique keys (catches low-cardinality early).

Implementation: Testing Keys Without Kafka

The trick: you don’t need a Kafka cluster. You just need to test your key generation logic with realistic data.

Step 1: Key Sampler Test (Java)

// src/test/java/com/yourapp/kafka/KeySamplerTest.java
import org.junit.jupiter.api.Test;
import java.nio.file.*;
import java.util.*;

class KeySamplerTest {

    @Test
    void generateRepresentativeKeys() throws Exception {
        // Load realistic test data (anonymized production sample, fixtures, etc.)
        List<UserEvent> events = loadRealisticEvents();  // Your implementation

        List<String> keys = new ArrayList<>(100_000);

        for (UserEvent event : events) {
            // THIS IS THE CRITICAL PART: use your REAL key logic
            String key = KeyFactory.keyFor(event);
            keys.add(key);
        }

        // Write to file for validation
        Path output = Paths.get("build/kafka_keys_user_events.txt");
        Files.createDirectories(output.getParent());
        Files.write(output, keys, UTF_8);

        System.out.printf("Generated %d keys with %d unique values\n",
            keys.size(), new HashSet<>(keys).size());
    }

    private List<UserEvent> loadRealisticEvents() {
        // IMPORTANT: This should reflect production distribution
        // - 80% US users, 10% UK, etc. if that's your reality
        // - Mix of tenant sizes if you're B2B
        // - Include edge cases: null handling, special characters

        // Example (replace with your real data):
        return IntStream.range(0, 100_000)
            .mapToObj(i -> {
                String country = selectCountryRealistic();  // Weighted distribution
                String userId = "user-" + (i % 50000);  // Realistic cardinality
                return new UserEvent(country, userId, "action");
            })
            .collect(Collectors.toList());
    }
}

Step 2: Skew Contract Validator (Go CLI)

This tool uses the same murmur2 hash that Kafka Java clients use:

# cmd/skewcontract/main.go
go run ./cmd/skewcontract \
  -topic user-events \
  -keys build/kafka_keys_user_events.txt \
  -partitions 48 \
  -max-share 0.08 \
  -max-gini 0.15 \
  -min-unique 20000

The validator:

  1. Reads your generated keys
  2. Applies murmur2(key) % partitions (Kafka’s default)
  3. Calculates distribution metrics
  4. Fails if skew exceeds thresholds

Key insight: This is deterministic. Same keys → same partition assignment. No flakiness.

CI Integration (GitHub Actions)

name: kafka-skew-contract

on: [pull_request]

jobs:
  skew:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-java@v4
        with:
          distribution: temurin
          java-version: '21'

      - uses: actions/setup-go@v5
        with:
          go-version: '1.22'

      - name: Generate partition keys from realistic sample
        run: ./gradlew test --tests KeySamplerTest

      - name: Verify skew contract
        run: |
          go run ./cmd/skewcontract \
            -topic user-events \
            -keys build/kafka_keys_user_events.txt \
            -partitions 48 \
            -max-share 0.08 \
            -max-gini 0.15 \
            -min-unique 20000 \
            -report skew_report.json

      - name: Upload report (for debugging failures)
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: skew-report
          path: skew_report.json

      - name: Comment on PR with skew stats
        if: failure()
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const report = JSON.parse(fs.readFileSync('skew_report.json', 'utf8'));

            const body = `## ⚠️ Partition Skew Detected

            - **Max partition share**: ${(report.max_partition_share * 100).toFixed(1)}% (threshold: 8%)
            - **Gini coefficient**: ${report.gini.toFixed(3)} (threshold: 0.15)
            - **Unique keys**: ${report.unique_keys.toLocaleString()} (min: 20,000)

            **Top hotspot partitions**:
            ${report.top_partitions.slice(0, 5).map(p =>
              `- Partition ${p.partition}: ${(p.share * 100).toFixed(1)}%`
            ).join('\n')}

            **Suggestions**:
            ${report.suggestions.join('\n')}
            `;

            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.name,
              body
            });

This gives you a fast, deterministic gate that runs in ~10 seconds and catches skew before merge.

The Go Validator (Simplified Core)

Here’s the essential logic (full implementation in appendix):

package main

import (
    "bufio"
    "fmt"
    "os"
    "sort"
)

func main() {
    keys := readKeys("keys.txt")
    partitions := 48

    // Count keys per partition
    counts := make([]int, partitions)
    unique := make(map[string]bool)

    for _, key := range keys {
        unique[key] = true
        partition := kafkaPartition(key, partitions)
        counts[partition]++
    }

    // Calculate metrics
    maxCount := max(counts)
    totalKeys := len(keys)
    maxShare := float64(maxCount) / float64(totalKeys)
    gini := giniCoefficient(counts)

    // Validate
    if maxShare > 0.08 {
        fmt.Printf("FAIL: max share %.2f%% exceeds 8%%\n", maxShare*100)
        os.Exit(1)
    }

    if gini > 0.15 {
        fmt.Printf("FAIL: gini %.3f exceeds 0.15\n", gini)
        os.Exit(1)
    }

    fmt.Println("PASS: Partition distribution is acceptable")
}

// Kafka's DefaultPartitioner logic (murmur2)
func kafkaPartition(key string, partitions int) int {
    h := murmur2([]byte(key))
    return int((h & 0x7fffffff) % uint32(partitions))
}

func murmur2(data []byte) uint32 {
    // Kafka-compatible murmur2 implementation
    // (full implementation omitted for brevity - see appendix)
    const seed = 0x9747b28c
    // ... hash calculation ...
    return hash
}

Runtime Monitoring (Because CI Isn’t Enough)

Even with CI gates, production can surprise you:

  • New traffic patterns
  • Data migration changes distribution
  • A/B tests shift user segments

Monitor these metrics:

# Partition lag skew
max by (partition) (kafka_consumer_lag{topic="user-events"})
/ avg by (partition) (kafka_consumer_lag{topic="user-events"})
> 3  # One partition has 3x average lag

# Partition size skew
max by (partition) (kafka_log_size{topic="user-events"})
/ avg by (partition) (kafka_log_size{topic="user-events"})
> 2

# Consumer processing time skew
max by (partition) (kafka_consumer_processing_time_p99)
/ avg by (partition) (kafka_consumer_processing_time_p99)
> 2

Alert when these ratios spike—it means a hotspot is forming.

When Skew Happens in Production

If you detect skew after deployment:

Option 1: Fix the Key (Preferred)

// Bad: low cardinality
String key = countryCode;

// Good: add high-cardinality component
String key = countryCode + ":" + userId;
// or better: just use userId if it's globally unique
String key = userId;

Then re-publish events with new key (or accept temporary skew during transition).

Option 2: Increase Partitions (Risky)

kafka-topics --alter --topic user-events --partitions 96

Caution: This only helps if your key has enough cardinality. If you have 12 country codes and 96 partitions, you still only use 12 partitions. Also, increasing partitions is irreversible and affects consumer rebalancing.

Option 3: Split the Topic

If one dimension dominates (e.g., 80% US users), consider:

user-events-us   (handles US traffic)
user-events-intl (handles rest)

Downside: more topics to manage, more consumer groups.

Checklist

## Partition Key E-E-A-T Checklist

### Before Changing Keys
- [ ] Run key sampler test with realistic production data distribution
- [ ] Verify cardinality: `SELECT COUNT(DISTINCT key_column) FROM events`
- [ ] Check for null handling: `WHERE key_column IS NULL`
- [ ] Review normalization logic (trim, lowercase, substring)

### CI Contract
- [ ] `max_partition_share` set based on `1/partitions * acceptable_variance`
- [ ] `min_unique_keys``partitions * 100` (rule of thumb)
- [ ] Sample data reflects production distribution (not random UUIDs!)

### Production Monitoring
- [ ] Alert on partition lag skew ratio > 3x
- [ ] Alert on partition size skew ratio > 2x
- [ ] Dashboard showing per-partition metrics

### When Skew Detected
- [ ] Identify key cardinality: too low? normalization bug? null handling?
- [ ] Fix key generation logic (prefer high-cardinality stable IDs)
- [ ] If unfixable, consider topic split or accept variance
- [ ] Only increase partitions if key has sufficient cardinality

Conclusion

The Kafka partition key is an API contract that most teams don’t treat as one. Change it carelessly and you get silent hotspots, consumer lag spikes, and mysterious timeout errors—all while your dashboards show “everything is fine.”

Skew Contracts turn “pick a good key” from tribal knowledge into a testable, enforceable policy. The implementation is surprisingly simple: generate realistic keys in a test, hash them the same way Kafka does, measure distribution, fail if it’s skewed. No Kafka cluster needed, no flaky tests, just deterministic math.

The key insight: your partition key is not an implementation detail, it’s an API. Treat it like one.

Key principles:

  1. Test keys with realistic data—random UUIDs prove nothing
  2. Cardinality matters more than uniformity—a million slightly-skewed keys beats 12 perfectly-distributed ones
  3. Validate in CI before production—catching skew in code review is nearly impossible
  4. Monitor skew in production—traffic patterns change
  5. Fix the key, not the cluster—increasing partitions doesn’t help if your key has low cardinality

The next time someone suggests changing a Kafka partition key, ask: “Did we run the skew contract?”


Appendix: Full Go Validator

See github.com/yourorg/kafka-skew-contracts for complete implementation with:

  • YAML config loading
  • Multi-topic support
  • SARIF report for GitHub Code Scanning
  • Prometheus metrics export

Related posts

Cite this article

If you reference this post, please link to the original URL and credit the author.

Michal Drozd. "One Partition at 99% CPU: Stop Kafka Hotspots Before They Reach Production". https://www.michal-drozd.com/en/blog/kafka-partition-skew-contracts/ (Published November 15, 2025).