Kubernetes CPU Throttling Autopsy: Why p99 Latency Explodes at 40% CPU Usage

CPU limits felt safe until throttling showed up in the latency charts. “CPU usage is 40%, but p99 latency jumped from 50ms to 800ms.” We stared at the dashboard for hours. The service was clearly struggling—tail latency was terrible, users were complaining—but every metric said we had plenty of headroom. CPU was at 40%. Memory was fine. There was no contention, no obvious bottleneck. Yet our p99 was 16 times worse than normal.

The answer turned out to be invisible in standard Kubernetes dashboards: CPU throttling. The “40% CPU” was an average over time, but our workload was bursty. When a request arrived, the handler would spike to 100% CPU for 80ms, then idle. CFS (the Linux Completely Fair Scheduler) saw that spike, hit the quota limit, and throttled us—forcing the next request to wait until the next scheduling period. The throttling added latency that didn’t appear in CPU metrics because the metric showed average utilization, not burst behavior.

This is one of the most counterintuitive aspects of Kubernetes resource management. CPU limits are enforced via a quota mechanism that operates on 100ms periods. If your workload uses its entire quota in a burst, it’s throttled for the rest of the period—even if the “average” CPU usage is low. Bursty workloads like web servers and APIs are especially vulnerable.

Tested on: Kubernetes 1.28, Java 21, Go 1.22, Prometheus + Grafana

How CFS Throttling Works

CPU Limits in Kubernetes

resources:
  requests:
    cpu: "500m"   # Guaranteed 0.5 CPU
  limits:
    cpu: "1000m"  # Maximum 1 CPU

CFS Quota Mechanism

CFS Period: 100ms (default)
CPU Limit 1000m = 100ms CPU time per 100ms period

Timeline:
|----100ms period----|----100ms period----|
|▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓|                    |
 ↑ 80ms CPU burst    ↑ 20ms waiting       ↑ Next request waits!

Problem: Bursts

Scenario: Web request handler
- Most time: idle (0% CPU)
- On request: 100% CPU for 80ms
- CFS quota: 100ms per 100ms period

Request 1: 80ms CPU → OK (20ms quota remaining)
Request 2: 80ms CPU → THROTTLED! (60ms quota missing)
         → Request 2 waits 60ms for next period

Result: Request 2 has +60ms latency (throttling)

Diagnostics: CFS Metrics

Prometheus Query

# Throttled seconds per minute
sum(rate(container_cpu_cfs_throttled_seconds_total{
    namespace="production",
    pod=~"api-.*"
}[5m])) by (pod)

Correlation with Latency

# Throttling vs P99 latency
# Panel 1: Throttled periods
sum(rate(container_cpu_cfs_throttled_periods_total[5m])) by (pod)
/
sum(rate(container_cpu_cfs_periods_total[5m])) by (pod)

# Panel 2: Request latency P99
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

Reproducible Lab

Java Service

// ThrottlingDemo.java
@RestController
public class ThrottlingDemo {

    @GetMapping("/cpu-burst")
    public String cpuBurst() {
        // Simulate CPU-intensive work
        long start = System.nanoTime();
        double result = 0;
        for (int i = 0; i < 10_000_000; i++) {
            result += Math.sin(i) * Math.cos(i);
        }
        long elapsed = (System.nanoTime() - start) / 1_000_000;
        return "Computed in " + elapsed + "ms, result: " + result;
    }
}

Kubernetes Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: throttling-demo
spec:
  replicas: 1
  template:
    spec:
      containers:
      - name: app
        image: throttling-demo:latest
        resources:
          requests:
            cpu: "500m"
            memory: "512Mi"
          limits:
            cpu: "500m"      # Tight limit!
            memory: "512Mi"
        ports:
        - containerPort: 8080

Load Test

# Send 10 concurrent requests
hey -n 1000 -c 10 http://throttling-demo:8080/cpu-burst

Results

# With CPU limit 500m:
Summary:
  Requests/sec: 12.3
  Latency distribution:
    50%: 120ms
    90%: 450ms
    99%: 890ms  ← Throttling!

# Without CPU limit (only request):
Summary:
  Requests/sec: 45.2
  Latency distribution:
    50%: 45ms
    90%: 52ms
    99%: 68ms   ← 13× better!

Solutions

1. Remove CPU Limits (Controversial!)

resources:
  requests:
    cpu: "500m"    # Keep request for scheduling
  # limits:        # NO CPU limit!
  #   cpu: "1000m"

Why it works:

No limit = no CFS quota
Bursts can use free CPU on node
Request guarantees minimum CPU

Risks:

“Noisy neighbor” - one pod can consume all CPU
Less predictable behavior
Need good monitoring

2. Set Higher Limit

resources:
  requests:
    cpu: "500m"
  limits:
    cpu: "2000m"  # 4× request = room for bursts

Rule of thumb: limit = 2-4 × request for bursty workloads.

3. Burstable vs Guaranteed QoS

# Guaranteed QoS (request == limit)
resources:
  requests:
    cpu: "1000m"
  limits:
    cpu: "1000m"  # Same = Guaranteed

# Burstable QoS (request < limit)
resources:
  requests:
    cpu: "500m"
  limits:
    cpu: "2000m"  # Higher = Burstable

4. Java: GOMAXPROCS Equivalent

// JVM automatically detects CPU limit in container
// But may set too few GC threads

// Dockerfile or deployment:
// -XX:ActiveProcessorCount=4
// Explicitly set CPU count for JVM

java -XX:ActiveProcessorCount=4 \
     -XX:ParallelGCThreads=4 \
     -jar app.jar

5. Go: GOMAXPROCS

// automaxprocs automatically sets based on CFS quota
import _ "go.uber.org/automaxprocs"

func main() {
    // GOMAXPROCS automatically set based on CPU limit
    // Not based on node CPU count
}

Monitoring Dashboard

Grafana Panel: Throttling Overview

# Throttled percentage
100 * (
  sum(rate(container_cpu_cfs_throttled_periods_total{namespace="$namespace"}[5m])) by (pod)
  /
  sum(rate(container_cpu_cfs_periods_total{namespace="$namespace"}[5m])) by (pod)
)

Alert Rule

# prometheus_rules.yml
groups:
- name: cpu_throttling
  rules:
  - alert: HighCPUThrottling
    expr: |
      sum(rate(container_cpu_cfs_throttled_periods_total[5m])) by (pod, namespace)
      /
      sum(rate(container_cpu_cfs_periods_total[5m])) by (pod, namespace)
      > 0.25
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "High CPU throttling on {{ $labels.pod }}"
      description: "Pod {{ $labels.pod }} is being throttled >25% of the time"

Benchmark: Limit vs No Limit

Configuration	p50	p90	p99	Throttled %
limit=500m	120ms	450ms	890ms	45%
limit=1000m	65ms	85ms	180ms	12%
limit=2000m	48ms	58ms	72ms	2%
no limit	45ms	52ms	68ms	0%

Gotchas

1. GC Spikes

Java GC (G1) needs CPU burst
- Minor GC: 10-50ms CPU spike
- With tight limit: GC takes longer due to throttling
- → Longer GC pauses → Worse latency

2. JIT Compilation

JVM JIT compiler runs in background
- Needs CPU for compilation
- With throttling: slower compilation
- → Longer warm-up → Worse performance at start

3. Sidecar Containers

# Istio sidecar also consumes CPU limit!
containers:
- name: app
  resources:
    limits:
      cpu: "1000m"
- name: istio-proxy  # Automatically added
  resources:
    limits:
      cpu: "100m"   # Default, might be too little!

Checklist

## CPU Throttling Diagnostics

### Identification
- [ ] Add container_cpu_cfs_throttled_seconds_total metric to dashboard
- [ ] Correlate throttling with p99 latency
- [ ] Check if GC spikes correlate with throttling

### Resolution
- [ ] Increase CPU limit to 2-4× request
- [ ] Or remove limit entirely (with monitoring)
- [ ] Set GOMAXPROCS / ActiveProcessorCount explicitly

### Monitoring
- [ ] Alert on throttled_periods > 25%
- [ ] Dashboard with throttling vs latency correlation
- [ ] Track trends after limit changes

Conclusion

CPU throttling exposes a fundamental tension in Kubernetes resource management. CPU limits exist for good reasons—isolation, fair sharing, cost control. But the mechanism used to enforce them (CFS quota) interacts badly with bursty workloads. The result is latency that doesn’t appear in your CPU metrics but dramatically affects user experience.

The controversial solution—removing CPU limits entirely—works because it trades isolation for performance. Without limits, your bursty workload can use whatever CPU is available on the node. The request guarantee ensures you get scheduled with sufficient resources. The risk is that a misbehaving pod can affect neighbors, which is why this approach requires good monitoring and trust in your workloads.

The key insight is that CPU utilization is the wrong metric for detecting throttling. You need to monitor cfs_throttled_seconds_total directly. Many teams add this metric to their dashboards after experiencing throttling issues—it should be there from the start.

Key takeaways:

40% average CPU doesn’t mean “OK” - bursts within a period are throttled
Removing CPU limits can paradoxically improve stability for bursty workloads
Monitor cfs_throttled_seconds_total - this is the actual metric that matters
Java/Go need explicit CPU count configuration to avoid scheduler issues
Set limits 2-4× requests if you must have limits on bursty workloads

For web servers, APIs, and any workload that idles between requests then spikes during processing, tight CPU limits are an anti-pattern. Consider this before defaulting to limit=request.

K8s PostgreSQL Connection Storm - Connection management
Go GOMAXPROCS in Containers - Go tuning

Kubernetes CPU Throttling Autopsy: Why p99 Latency Explodes at 40% CPU Usage

How CFS Throttling Works

CPU Limits in Kubernetes

CFS Quota Mechanism

Problem: Bursts

Diagnostics: CFS Metrics

Prometheus Query

Correlation with Latency

Reproducible Lab

Java Service

Kubernetes Deployment

Load Test

Results

Solutions

1. Remove CPU Limits (Controversial!)

2. Set Higher Limit

3. Burstable vs Guaranteed QoS

4. Java: GOMAXPROCS Equivalent

5. Go: GOMAXPROCS

Monitoring Dashboard

Grafana Panel: Throttling Overview

Alert Rule

Benchmark: Limit vs No Limit

Gotchas

1. GC Spikes

2. JIT Compilation

3. Sidecar Containers

Checklist

Conclusion

Related posts

Cite this article

How CFS Throttling Works

CPU Limits in Kubernetes

CFS Quota Mechanism

Problem: Bursts

Diagnostics: CFS Metrics

Prometheus Query

Correlation with Latency

Reproducible Lab

Java Service

Kubernetes Deployment

Load Test

Results

Solutions

1. Remove CPU Limits (Controversial!)

2. Set Higher Limit

3. Burstable vs Guaranteed QoS

4. Java: GOMAXPROCS Equivalent

5. Go: GOMAXPROCS

Monitoring Dashboard

Grafana Panel: Throttling Overview

Alert Rule

Benchmark: Limit vs No Limit

Gotchas

1. GC Spikes

2. JIT Compilation

3. Sidecar Containers

Checklist

Conclusion

Related Articles

Related posts

Cite this article