eBPF Run-Queue Latency: Finding the Off-CPU Bottleneck
When flamegraphs said ‘CPU is fine’, eBPF told a very different story. “CPU is at 30% but p99 latency is terrible.” We had a service that should have been fast—simple HTTP requests, efficient code, plenty of CPU headroom according to monitoring. But p99 latency was 200ms when it should have been 20ms. Nothing in the flame graphs explained where the time was going.
The problem was that traditional CPU profilers only tell you what your code does while it’s running. They sample the call stack periodically and show you where CPU time is spent. But they’re blind to the time your code spends waiting to run—queued up in the Linux scheduler, waiting for a CPU to become available.
This is off-CPU time, and in containerized environments it’s often the majority of request latency. Your threads are ready to run, but they’re stuck in the run-queue behind other threads, or throttled by CFS bandwidth limiting, or bouncing between NUMA nodes. None of this shows up in a normal flame graph. It requires a different kind of analysis.
The tools for this are eBPF-based: runqlat, offcputime, runqslower. They hook into the Linux scheduler and measure the time between a thread becoming runnable and actually getting CPU time. When I ran runqlat on our service, the problem was immediately obvious: 45% of thread lifetime was spent waiting in the run-queue, not executing.
Environment: Linux with kernel 5.x+, high-contention multi-threaded applications, overcommitted containers
The Problem
The Invisible Wait Time
Traditional profiling shows:
┌─────────────────────────────────────────────────────────────┐
│ Function │ CPU % │ Samples │
├─────────────────────┼────────┼──────────────────────────────┤
│ processRequest() │ 25% │ 2500 │
│ parseJSON() │ 15% │ 1500 │
│ dbQuery() │ 10% │ 1000 │
│ Total │ 50% │ 5000 │
└─────────────────────┴────────┴──────────────────────────────┘
"CPU is only 50% utilized, plenty of headroom!"
Reality with off-CPU analysis:
┌─────────────────────────────────────────────────────────────┐
│ State │ Time % │ Where │
├─────────────────────┼─────────┼─────────────────────────────┤
│ ON-CPU (executing) │ 30% │ Your code running │
│ OFF-CPU (run-queue) │ 45% │ Waiting for CPU! │
│ OFF-CPU (I/O wait) │ 15% │ Disk/network │
│ OFF-CPU (locks) │ 10% │ Mutex contention │
└─────────────────────┴─────────┴─────────────────────────────┘
"45% of time spent just waiting for a CPU!"
When Run-Queue Latency Spikes
Scenarios that cause run-queue wait:
1. Container CPU limits (most common):
┌─────────────────────────────────────────────┐
│ Container: 2 CPU limit, 8 threads │
│ 6 threads always waiting in run-queue │
│ CFS throttling adds more queue time │
└─────────────────────────────────────────────┘
2. CPU overcommitment:
┌─────────────────────────────────────────────┐
│ 64 vCPUs on 8 physical cores │
│ Context switches between all VMs │
│ Each switch = run-queue time │
└─────────────────────────────────────────────┘
3. NUMA misplacement:
┌─────────────────────────────────────────────┐
│ Thread scheduled on remote NUMA node │
│ Migration bouncing between nodes │
│ Each migration = re-queue │
└─────────────────────────────────────────────┘
Root Cause
CFS Scheduler Behavior
// Linux CFS (Completely Fair Scheduler) maintains per-CPU run queues
// When a thread wakes up, it goes to a run queue
struct cfs_rq {
struct rb_root_cached tasks_timeline; // Red-black tree of runnable tasks
u64 min_vruntime; // Minimum virtual runtime
// ...
};
// Run-queue latency = time from becoming runnable to actually running
// This is INVISIBLE to perf/flamegraphs unless you specifically measure it
// Key insight: A thread with 1ms of actual CPU work
// might have 10ms of run-queue latency in overloaded systems
Container CFS Bandwidth Throttling
# Containers use CFS bandwidth control
# cpu.cfs_quota_us / cpu.cfs_period_us = CPU limit
# Example: 2 CPU limit
/sys/fs/cgroup/cpu/container/cpu.cfs_quota_us: 200000 # 200ms
/sys/fs/cgroup/cpu/container/cpu.cfs_period_us: 100000 # per 100ms
# If your 8 threads try to use 8 CPUs worth of time in 100ms:
# They get 200ms of CPU time, then THROTTLED for rest of period
# Throttled threads sit in run-queue, adding latency
# Check throttling:
cat /sys/fs/cgroup/cpu/container/cpu.stat
# nr_throttled: 15234 <- Times throttled
# throttled_time: 892347123 <- Nanoseconds spent throttled
Diagnosis
Step 1: Check Run-Queue Length
# Simple check: runnable processes per CPU
sar -q 1 5
# runq-sz = average run queue length
# If runq-sz > num_cpus consistently, you have queuing
# More detail with vmstat
vmstat 1
# r column = processes waiting for run time
# Should be <= number of CPUs
# Per-CPU view
mpstat -P ALL 1
# Check %idle - if low but you think you have headroom, it's queuing
Step 2: eBPF Run-Queue Latency Histogram
# Using bcc-tools: runqlat
sudo runqlat -m 10
# Shows histogram of run-queue latency in milliseconds
# Healthy: most samples < 1ms
# Problem: significant samples > 10ms
# Output example (problematic):
msecs : count distribution
0 -> 1 : 1523 |************** |
2 -> 3 : 892 |******** |
4 -> 7 : 2341 |********************** | <- Too many!
8 -> 15 : 1876 |****************** | <- Way too many!
16 -> 31 : 543 |***** |
32 -> 63 : 121 |* |
Step 3: Per-Process Run-Queue Time
# runqlat with process filter
sudo runqlat -p $(pgrep java) -m 10
# Or use runqslower for individual events
sudo runqslower 10000
# Shows every instance where run-queue wait > 10ms
# Output: TIME COMM PID LAT(us)
# 12:34:56 java 1234 15234
Step 4: Off-CPU Flame Graph
# Capture off-CPU stacks with bcc
sudo offcputime -df -p $(pgrep myapp) 30 > out.stacks
# Generate flame graph
./flamegraph.pl --color=io --title="Off-CPU Time" out.stacks > offcpu.svg
# Look for:
# - schedule() in kernel stacks = run-queue wait
# - futex_wait = lock contention
# - io_schedule = I/O wait
The Fix
Option 1: Right-Size Container CPU Limits
# BEFORE: Limit that causes constant throttling
resources:
limits:
cpu: "2" # 2 CPUs
requests:
cpu: "2"
# With 8 worker threads, 6 always waiting
# AFTER: Match limit to actual parallelism
resources:
limits:
cpu: "4" # Allow more parallel execution
requests:
cpu: "2" # Still request 2 for scheduling
# Or reduce parallelism to match limit:
# Set thread pool size = CPU limit
WORKER_THREADS=2
Option 2: Tune CFS Parameters
# Increase CFS period for less frequent throttling
# (trades latency variance for throughput)
# Per-container (Kubernetes doesn't expose this directly)
echo 1000000 > /sys/fs/cgroup/cpu/container/cpu.cfs_period_us
# System-wide tuning
sysctl kernel.sched_min_granularity_ns=3000000
sysctl kernel.sched_wakeup_granularity_ns=4000000
Option 3: CPU Pinning for Latency-Critical Threads
// Pin critical threads to specific CPUs
import "golang.org/x/sys/unix"
func pinToCPU(cpuID int) error {
var mask unix.CPUSet
mask.Set(cpuID)
return unix.SchedSetaffinity(0, &mask)
}
// Latency-critical handler pinned to dedicated CPU
go func() {
pinToCPU(0) // CPU 0 reserved for this
for req := range criticalRequests {
handleCritical(req)
}
}()
# Kubernetes: Use CPU manager for guaranteed pinning
kubelet config:
cpuManagerPolicy: static
# Pod spec for guaranteed QoS (enables pinning)
resources:
limits:
cpu: "2"
memory: "4Gi"
requests:
cpu: "2"
memory: "4Gi"
Option 4: NUMA-Aware Scheduling
# Check NUMA topology
numactl --hardware
# Pin process to single NUMA node
numactl --cpunodebind=0 --membind=0 ./myapp
# In Kubernetes, use topology manager
kubelet config:
topologyManagerPolicy: single-numa-node
Monitoring
groups:
- name: scheduler-latency
rules:
- alert: HighRunQueueLatency
expr: |
histogram_quantile(0.99,
rate(scheduler_runq_latency_bucket[5m])
) > 0.01
for: 5m
labels:
severity: warning
annotations:
summary: "p99 run-queue latency > 10ms"
- alert: CPSThrottlingHigh
expr: |
rate(container_cpu_cfs_throttled_seconds_total[5m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "Container being CPU throttled > 10% of time"
- alert: RunQueueOverloaded
expr: |
node_load1 / count(node_cpu_seconds_total{mode="idle"}) > 2
for: 5m
labels:
severity: critical
annotations:
summary: "Run queue length > 2x CPU count"
Checklist
## Run-Queue Latency Investigation
### Symptoms
- [ ] Low CPU utilization but high latency
- [ ] Latency spikes not correlated with traffic
- [ ] Traditional profilers show "nothing wrong"
- [ ] Performance degrades under moderate load
### Diagnosis
- [ ] Check run-queue length with sar/vmstat
- [ ] Use runqlat for latency histogram
- [ ] Generate off-CPU flame graph
- [ ] Check CFS throttling statistics
### Fixes
- [ ] Right-size container CPU limits
- [ ] Reduce thread count to match CPU limit
- [ ] Consider CPU pinning for critical paths
- [ ] Enable NUMA-aware scheduling
- [ ] Monitor CFS throttling metrics
Conclusion
The lesson: CPU profilers only show what happens while your code runs. Off-CPU analysis reveals where your code waits - and run-queue time is often the hidden majority of request latency in containerized environments.
Key tools:
runqlat- Histogram of run-queue wait timesoffcputime- Off-CPU stack tracesrunqslower- Individual long waits- CFS throttle metrics - Container-specific queueing
Related Articles
- Java Profiling in Hardened K8s - Profiling in containers
- Go Timer Heap Pressure - Another latency investigation
Related posts
eBPF Off-CPU Analysis: Finding Latency That Metrics Miss
CPU is at 20% but latency is 500ms. Standard profilers show nothing. The app is waiting, not computing. I show how to use eBPF to find what it's waiting for.
TCP TIME_WAIT Port Exhaustion: When Connection Pooling Isn't Enough
Service can't connect to database - 'cannot assign requested address'. The cause: ephemeral ports exhausted by thousands of sockets stuck in TIME_WAIT state.
tcpdump Sees SYNs, but the Service Times Out: The Listen Backlog Trap
Clients time out, tcpdump shows SYNs (sometimes even SYN-ACK), yet your app logs nothing. The culprit is often the Linux listen/accept queues overflowing under load or CPU starvation.
Linux ARP Cache Stale Entries: Failover Traffic Blackhole
Traffic goes to old server after failover. The cause: Linux ARP cache retains MAC address of failed node, sending packets to unreachable destination for minutes.
Cite this article
If you reference this post, please link to the original URL and credit the author.