Kubernetes CPU Throttling Autopsy: Why p99 Latency Explodes at 40% CPU Usage
CPU limits felt safe until throttling showed up in the latency charts. “CPU usage is 40%, but p99 latency jumped from 50ms to 800ms.” We stared at the dashboard for hours. The service was clearly struggling—tail latency was terrible, users were complaining—but every metric said we had plenty of headroom. CPU was at 40%. Memory was fine. There was no contention, no obvious bottleneck. Yet our p99 was 16 times worse than normal.
The answer turned out to be invisible in standard Kubernetes dashboards: CPU throttling. The “40% CPU” was an average over time, but our workload was bursty. When a request arrived, the handler would spike to 100% CPU for 80ms, then idle. CFS (the Linux Completely Fair Scheduler) saw that spike, hit the quota limit, and throttled us—forcing the next request to wait until the next scheduling period. The throttling added latency that didn’t appear in CPU metrics because the metric showed average utilization, not burst behavior.
This is one of the most counterintuitive aspects of Kubernetes resource management. CPU limits are enforced via a quota mechanism that operates on 100ms periods. If your workload uses its entire quota in a burst, it’s throttled for the rest of the period—even if the “average” CPU usage is low. Bursty workloads like web servers and APIs are especially vulnerable.
Tested on: Kubernetes 1.28, Java 21, Go 1.22, Prometheus + Grafana
How CFS Throttling Works
CPU Limits in Kubernetes
resources:
requests:
cpu: "500m" # Guaranteed 0.5 CPU
limits:
cpu: "1000m" # Maximum 1 CPU
CFS Quota Mechanism
CFS Period: 100ms (default)
CPU Limit 1000m = 100ms CPU time per 100ms period
Timeline:
|----100ms period----|----100ms period----|
|▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓| |
↑ 80ms CPU burst ↑ 20ms waiting ↑ Next request waits!
Problem: Bursts
Scenario: Web request handler
- Most time: idle (0% CPU)
- On request: 100% CPU for 80ms
- CFS quota: 100ms per 100ms period
Request 1: 80ms CPU → OK (20ms quota remaining)
Request 2: 80ms CPU → THROTTLED! (60ms quota missing)
→ Request 2 waits 60ms for next period
Result: Request 2 has +60ms latency (throttling)
Diagnostics: CFS Metrics
Prometheus Query
# Throttled seconds per minute
sum(rate(container_cpu_cfs_throttled_seconds_total{
namespace="production",
pod=~"api-.*"
}[5m])) by (pod)
Correlation with Latency
# Throttling vs P99 latency
# Panel 1: Throttled periods
sum(rate(container_cpu_cfs_throttled_periods_total[5m])) by (pod)
/
sum(rate(container_cpu_cfs_periods_total[5m])) by (pod)
# Panel 2: Request latency P99
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
Reproducible Lab
Java Service
// ThrottlingDemo.java
@RestController
public class ThrottlingDemo {
@GetMapping("/cpu-burst")
public String cpuBurst() {
// Simulate CPU-intensive work
long start = System.nanoTime();
double result = 0;
for (int i = 0; i < 10_000_000; i++) {
result += Math.sin(i) * Math.cos(i);
}
long elapsed = (System.nanoTime() - start) / 1_000_000;
return "Computed in " + elapsed + "ms, result: " + result;
}
}
Kubernetes Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: throttling-demo
spec:
replicas: 1
template:
spec:
containers:
- name: app
image: throttling-demo:latest
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "500m" # Tight limit!
memory: "512Mi"
ports:
- containerPort: 8080
Load Test
# Send 10 concurrent requests
hey -n 1000 -c 10 http://throttling-demo:8080/cpu-burst
Results
# With CPU limit 500m:
Summary:
Requests/sec: 12.3
Latency distribution:
50%: 120ms
90%: 450ms
99%: 890ms ← Throttling!
# Without CPU limit (only request):
Summary:
Requests/sec: 45.2
Latency distribution:
50%: 45ms
90%: 52ms
99%: 68ms ← 13× better!
Solutions
1. Remove CPU Limits (Controversial!)
resources:
requests:
cpu: "500m" # Keep request for scheduling
# limits: # NO CPU limit!
# cpu: "1000m"
Why it works:
- No limit = no CFS quota
- Bursts can use free CPU on node
- Request guarantees minimum CPU
Risks:
- “Noisy neighbor” - one pod can consume all CPU
- Less predictable behavior
- Need good monitoring
2. Set Higher Limit
resources:
requests:
cpu: "500m"
limits:
cpu: "2000m" # 4× request = room for bursts
Rule of thumb: limit = 2-4 × request for bursty workloads.
3. Burstable vs Guaranteed QoS
# Guaranteed QoS (request == limit)
resources:
requests:
cpu: "1000m"
limits:
cpu: "1000m" # Same = Guaranteed
# Burstable QoS (request < limit)
resources:
requests:
cpu: "500m"
limits:
cpu: "2000m" # Higher = Burstable
4. Java: GOMAXPROCS Equivalent
// JVM automatically detects CPU limit in container
// But may set too few GC threads
// Dockerfile or deployment:
// -XX:ActiveProcessorCount=4
// Explicitly set CPU count for JVM
java -XX:ActiveProcessorCount=4 \
-XX:ParallelGCThreads=4 \
-jar app.jar
5. Go: GOMAXPROCS
// automaxprocs automatically sets based on CFS quota
import _ "go.uber.org/automaxprocs"
func main() {
// GOMAXPROCS automatically set based on CPU limit
// Not based on node CPU count
}
Monitoring Dashboard
Grafana Panel: Throttling Overview
# Throttled percentage
100 * (
sum(rate(container_cpu_cfs_throttled_periods_total{namespace="$namespace"}[5m])) by (pod)
/
sum(rate(container_cpu_cfs_periods_total{namespace="$namespace"}[5m])) by (pod)
)
Alert Rule
# prometheus_rules.yml
groups:
- name: cpu_throttling
rules:
- alert: HighCPUThrottling
expr: |
sum(rate(container_cpu_cfs_throttled_periods_total[5m])) by (pod, namespace)
/
sum(rate(container_cpu_cfs_periods_total[5m])) by (pod, namespace)
> 0.25
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU throttling on {{ $labels.pod }}"
description: "Pod {{ $labels.pod }} is being throttled >25% of the time"
Benchmark: Limit vs No Limit
| Configuration | p50 | p90 | p99 | Throttled % |
|---|---|---|---|---|
| limit=500m | 120ms | 450ms | 890ms | 45% |
| limit=1000m | 65ms | 85ms | 180ms | 12% |
| limit=2000m | 48ms | 58ms | 72ms | 2% |
| no limit | 45ms | 52ms | 68ms | 0% |
Gotchas
1. GC Spikes
Java GC (G1) needs CPU burst
- Minor GC: 10-50ms CPU spike
- With tight limit: GC takes longer due to throttling
- → Longer GC pauses → Worse latency
2. JIT Compilation
JVM JIT compiler runs in background
- Needs CPU for compilation
- With throttling: slower compilation
- → Longer warm-up → Worse performance at start
3. Sidecar Containers
# Istio sidecar also consumes CPU limit!
containers:
- name: app
resources:
limits:
cpu: "1000m"
- name: istio-proxy # Automatically added
resources:
limits:
cpu: "100m" # Default, might be too little!
Checklist
## CPU Throttling Diagnostics
### Identification
- [ ] Add container_cpu_cfs_throttled_seconds_total metric to dashboard
- [ ] Correlate throttling with p99 latency
- [ ] Check if GC spikes correlate with throttling
### Resolution
- [ ] Increase CPU limit to 2-4× request
- [ ] Or remove limit entirely (with monitoring)
- [ ] Set GOMAXPROCS / ActiveProcessorCount explicitly
### Monitoring
- [ ] Alert on throttled_periods > 25%
- [ ] Dashboard with throttling vs latency correlation
- [ ] Track trends after limit changes
Conclusion
CPU throttling exposes a fundamental tension in Kubernetes resource management. CPU limits exist for good reasons—isolation, fair sharing, cost control. But the mechanism used to enforce them (CFS quota) interacts badly with bursty workloads. The result is latency that doesn’t appear in your CPU metrics but dramatically affects user experience.
The controversial solution—removing CPU limits entirely—works because it trades isolation for performance. Without limits, your bursty workload can use whatever CPU is available on the node. The request guarantee ensures you get scheduled with sufficient resources. The risk is that a misbehaving pod can affect neighbors, which is why this approach requires good monitoring and trust in your workloads.
The key insight is that CPU utilization is the wrong metric for detecting throttling. You need to monitor cfs_throttled_seconds_total directly. Many teams add this metric to their dashboards after experiencing throttling issues—it should be there from the start.
Key takeaways:
- 40% average CPU doesn’t mean “OK” - bursts within a period are throttled
- Removing CPU limits can paradoxically improve stability for bursty workloads
- Monitor
cfs_throttled_seconds_total- this is the actual metric that matters - Java/Go need explicit CPU count configuration to avoid scheduler issues
- Set limits 2-4× requests if you must have limits on bursty workloads
For web servers, APIs, and any workload that idles between requests then spikes during processing, tight CPU limits are an anti-pattern. Consider this before defaulting to limit=request.
Related Articles
- K8s PostgreSQL Connection Storm - Connection management
- Go GOMAXPROCS in Containers - Go tuning
Related posts
Go GOMAXPROCS in Containers: The CPU Detection Problem
Go sees 64 host CPUs but your container has 2 CPU limit. GOMAXPROCS=64 causes excessive context switching and throttling. Here's the fix.
Kubernetes DNS: The ndots:5 Latency Tax
Every DNS query in K8s makes 5 failed lookups before succeeding. ndots:5 default causes 100ms+ latency. Here's how to fix it properly.
When Prepared Statements Make PostgreSQL 10× Slower: Generic Plan Trap
Same query, same params, but prod is slow and staging works fine. I'll show how to reproduce the generic plan problem with pgBouncer, Java/Go and how to fix it.
JVM Native Memory in Kubernetes: Why Your Pod Gets OOMKilled with 50% Heap
Heap is 50% full but pod gets OOMKilled. I'll show how to track native memory (Metaspace, threads, NIO) and prevent container memory issues.
Cite this article
If you reference this post, please link to the original URL and credit the author.