Back to blog

Kubernetes CPU Throttling Autopsy: Why p99 Latency Explodes at 40% CPU Usage

|

CPU limits felt safe until throttling showed up in the latency charts. “CPU usage is 40%, but p99 latency jumped from 50ms to 800ms.” We stared at the dashboard for hours. The service was clearly struggling—tail latency was terrible, users were complaining—but every metric said we had plenty of headroom. CPU was at 40%. Memory was fine. There was no contention, no obvious bottleneck. Yet our p99 was 16 times worse than normal.

The answer turned out to be invisible in standard Kubernetes dashboards: CPU throttling. The “40% CPU” was an average over time, but our workload was bursty. When a request arrived, the handler would spike to 100% CPU for 80ms, then idle. CFS (the Linux Completely Fair Scheduler) saw that spike, hit the quota limit, and throttled us—forcing the next request to wait until the next scheduling period. The throttling added latency that didn’t appear in CPU metrics because the metric showed average utilization, not burst behavior.

This is one of the most counterintuitive aspects of Kubernetes resource management. CPU limits are enforced via a quota mechanism that operates on 100ms periods. If your workload uses its entire quota in a burst, it’s throttled for the rest of the period—even if the “average” CPU usage is low. Bursty workloads like web servers and APIs are especially vulnerable.

Tested on: Kubernetes 1.28, Java 21, Go 1.22, Prometheus + Grafana

How CFS Throttling Works

CPU Limits in Kubernetes

resources:
  requests:
    cpu: "500m"   # Guaranteed 0.5 CPU
  limits:
    cpu: "1000m"  # Maximum 1 CPU

CFS Quota Mechanism

CFS Period: 100ms (default)
CPU Limit 1000m = 100ms CPU time per 100ms period

Timeline:
|----100ms period----|----100ms period----|
|▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓|                    |
 ↑ 80ms CPU burst    ↑ 20ms waiting       ↑ Next request waits!

Problem: Bursts

Scenario: Web request handler
- Most time: idle (0% CPU)
- On request: 100% CPU for 80ms
- CFS quota: 100ms per 100ms period

Request 1: 80ms CPU → OK (20ms quota remaining)
Request 2: 80ms CPU → THROTTLED! (60ms quota missing)
         → Request 2 waits 60ms for next period

Result: Request 2 has +60ms latency (throttling)

Diagnostics: CFS Metrics

Prometheus Query

# Throttled seconds per minute
sum(rate(container_cpu_cfs_throttled_seconds_total{
    namespace="production",
    pod=~"api-.*"
}[5m])) by (pod)

Correlation with Latency

# Throttling vs P99 latency
# Panel 1: Throttled periods
sum(rate(container_cpu_cfs_throttled_periods_total[5m])) by (pod)
/
sum(rate(container_cpu_cfs_periods_total[5m])) by (pod)

# Panel 2: Request latency P99
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

Reproducible Lab

Java Service

// ThrottlingDemo.java
@RestController
public class ThrottlingDemo {

    @GetMapping("/cpu-burst")
    public String cpuBurst() {
        // Simulate CPU-intensive work
        long start = System.nanoTime();
        double result = 0;
        for (int i = 0; i < 10_000_000; i++) {
            result += Math.sin(i) * Math.cos(i);
        }
        long elapsed = (System.nanoTime() - start) / 1_000_000;
        return "Computed in " + elapsed + "ms, result: " + result;
    }
}

Kubernetes Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: throttling-demo
spec:
  replicas: 1
  template:
    spec:
      containers:
      - name: app
        image: throttling-demo:latest
        resources:
          requests:
            cpu: "500m"
            memory: "512Mi"
          limits:
            cpu: "500m"      # Tight limit!
            memory: "512Mi"
        ports:
        - containerPort: 8080

Load Test

# Send 10 concurrent requests
hey -n 1000 -c 10 http://throttling-demo:8080/cpu-burst

Results

# With CPU limit 500m:
Summary:
  Requests/sec: 12.3
  Latency distribution:
    50%: 120ms
    90%: 450ms
    99%: 890ms  ← Throttling!

# Without CPU limit (only request):
Summary:
  Requests/sec: 45.2
  Latency distribution:
    50%: 45ms
    90%: 52ms
    99%: 68ms   ← 13× better!

Solutions

1. Remove CPU Limits (Controversial!)

resources:
  requests:
    cpu: "500m"    # Keep request for scheduling
  # limits:        # NO CPU limit!
  #   cpu: "1000m"

Why it works:

  • No limit = no CFS quota
  • Bursts can use free CPU on node
  • Request guarantees minimum CPU

Risks:

  • “Noisy neighbor” - one pod can consume all CPU
  • Less predictable behavior
  • Need good monitoring

2. Set Higher Limit

resources:
  requests:
    cpu: "500m"
  limits:
    cpu: "2000m"  # 4× request = room for bursts

Rule of thumb: limit = 2-4 × request for bursty workloads.

3. Burstable vs Guaranteed QoS

# Guaranteed QoS (request == limit)
resources:
  requests:
    cpu: "1000m"
  limits:
    cpu: "1000m"  # Same = Guaranteed

# Burstable QoS (request < limit)
resources:
  requests:
    cpu: "500m"
  limits:
    cpu: "2000m"  # Higher = Burstable

4. Java: GOMAXPROCS Equivalent

// JVM automatically detects CPU limit in container
// But may set too few GC threads

// Dockerfile or deployment:
// -XX:ActiveProcessorCount=4
// Explicitly set CPU count for JVM

java -XX:ActiveProcessorCount=4 \
     -XX:ParallelGCThreads=4 \
     -jar app.jar

5. Go: GOMAXPROCS

// automaxprocs automatically sets based on CFS quota
import _ "go.uber.org/automaxprocs"

func main() {
    // GOMAXPROCS automatically set based on CPU limit
    // Not based on node CPU count
}

Monitoring Dashboard

Grafana Panel: Throttling Overview

# Throttled percentage
100 * (
  sum(rate(container_cpu_cfs_throttled_periods_total{namespace="$namespace"}[5m])) by (pod)
  /
  sum(rate(container_cpu_cfs_periods_total{namespace="$namespace"}[5m])) by (pod)
)

Alert Rule

# prometheus_rules.yml
groups:
- name: cpu_throttling
  rules:
  - alert: HighCPUThrottling
    expr: |
      sum(rate(container_cpu_cfs_throttled_periods_total[5m])) by (pod, namespace)
      /
      sum(rate(container_cpu_cfs_periods_total[5m])) by (pod, namespace)
      > 0.25
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "High CPU throttling on {{ $labels.pod }}"
      description: "Pod {{ $labels.pod }} is being throttled >25% of the time"

Benchmark: Limit vs No Limit

Configurationp50p90p99Throttled %
limit=500m120ms450ms890ms45%
limit=1000m65ms85ms180ms12%
limit=2000m48ms58ms72ms2%
no limit45ms52ms68ms0%

Gotchas

1. GC Spikes

Java GC (G1) needs CPU burst
- Minor GC: 10-50ms CPU spike
- With tight limit: GC takes longer due to throttling
- → Longer GC pauses → Worse latency

2. JIT Compilation

JVM JIT compiler runs in background
- Needs CPU for compilation
- With throttling: slower compilation
- → Longer warm-up → Worse performance at start

3. Sidecar Containers

# Istio sidecar also consumes CPU limit!
containers:
- name: app
  resources:
    limits:
      cpu: "1000m"
- name: istio-proxy  # Automatically added
  resources:
    limits:
      cpu: "100m"   # Default, might be too little!

Checklist

## CPU Throttling Diagnostics

### Identification
- [ ] Add container_cpu_cfs_throttled_seconds_total metric to dashboard
- [ ] Correlate throttling with p99 latency
- [ ] Check if GC spikes correlate with throttling

### Resolution
- [ ] Increase CPU limit to 2-4× request
- [ ] Or remove limit entirely (with monitoring)
- [ ] Set GOMAXPROCS / ActiveProcessorCount explicitly

### Monitoring
- [ ] Alert on throttled_periods > 25%
- [ ] Dashboard with throttling vs latency correlation
- [ ] Track trends after limit changes

Conclusion

CPU throttling exposes a fundamental tension in Kubernetes resource management. CPU limits exist for good reasons—isolation, fair sharing, cost control. But the mechanism used to enforce them (CFS quota) interacts badly with bursty workloads. The result is latency that doesn’t appear in your CPU metrics but dramatically affects user experience.

The controversial solution—removing CPU limits entirely—works because it trades isolation for performance. Without limits, your bursty workload can use whatever CPU is available on the node. The request guarantee ensures you get scheduled with sufficient resources. The risk is that a misbehaving pod can affect neighbors, which is why this approach requires good monitoring and trust in your workloads.

The key insight is that CPU utilization is the wrong metric for detecting throttling. You need to monitor cfs_throttled_seconds_total directly. Many teams add this metric to their dashboards after experiencing throttling issues—it should be there from the start.

Key takeaways:

  1. 40% average CPU doesn’t mean “OK” - bursts within a period are throttled
  2. Removing CPU limits can paradoxically improve stability for bursty workloads
  3. Monitor cfs_throttled_seconds_total - this is the actual metric that matters
  4. Java/Go need explicit CPU count configuration to avoid scheduler issues
  5. Set limits 2-4× requests if you must have limits on bursty workloads

For web servers, APIs, and any workload that idles between requests then spikes during processing, tight CPU limits are an anti-pattern. Consider this before defaulting to limit=request.


Related posts

Cite this article

If you reference this post, please link to the original URL and credit the author.

Michal Drozd. "Kubernetes CPU Throttling Autopsy: Why p99 Latency Explodes at 40% CPU Usage". https://www.michal-drozd.com/en/blog/k8s-cpu-throttling/ (Published October 19, 2025).