eBPF Run-Queue Latency: Finding the Off-CPU Bottleneck

When flamegraphs said ‘CPU is fine’, eBPF told a very different story. “CPU is at 30% but p99 latency is terrible.” We had a service that should have been fast—simple HTTP requests, efficient code, plenty of CPU headroom according to monitoring. But p99 latency was 200ms when it should have been 20ms. Nothing in the flame graphs explained where the time was going.

The problem was that traditional CPU profilers only tell you what your code does while it’s running. They sample the call stack periodically and show you where CPU time is spent. But they’re blind to the time your code spends waiting to run—queued up in the Linux scheduler, waiting for a CPU to become available.

This is off-CPU time, and in containerized environments it’s often the majority of request latency. Your threads are ready to run, but they’re stuck in the run-queue behind other threads, or throttled by CFS bandwidth limiting, or bouncing between NUMA nodes. None of this shows up in a normal flame graph. It requires a different kind of analysis.

The tools for this are eBPF-based: runqlat, offcputime, runqslower. They hook into the Linux scheduler and measure the time between a thread becoming runnable and actually getting CPU time. When I ran runqlat on our service, the problem was immediately obvious: 45% of thread lifetime was spent waiting in the run-queue, not executing.

Environment: Linux with kernel 5.x+, high-contention multi-threaded applications, overcommitted containers

The Problem

The Invisible Wait Time

Traditional profiling shows:
┌─────────────────────────────────────────────────────────────┐
│ Function            │ CPU %  │ Samples                      │
├─────────────────────┼────────┼──────────────────────────────┤
│ processRequest()    │ 25%    │ 2500                         │
│ parseJSON()         │ 15%    │ 1500                         │
│ dbQuery()           │ 10%    │ 1000                         │
│ Total               │ 50%    │ 5000                         │
└─────────────────────┴────────┴──────────────────────────────┘
"CPU is only 50% utilized, plenty of headroom!"

Reality with off-CPU analysis:
┌─────────────────────────────────────────────────────────────┐
│ State               │ Time %  │ Where                       │
├─────────────────────┼─────────┼─────────────────────────────┤
│ ON-CPU (executing)  │ 30%     │ Your code running           │
│ OFF-CPU (run-queue) │ 45%     │ Waiting for CPU!            │
│ OFF-CPU (I/O wait)  │ 15%     │ Disk/network                │
│ OFF-CPU (locks)     │ 10%     │ Mutex contention            │
└─────────────────────┴─────────┴─────────────────────────────┘
"45% of time spent just waiting for a CPU!"

When Run-Queue Latency Spikes

Scenarios that cause run-queue wait:

1. Container CPU limits (most common):
   ┌─────────────────────────────────────────────┐
   │ Container: 2 CPU limit, 8 threads           │
   │ 6 threads always waiting in run-queue       │
   │ CFS throttling adds more queue time         │
   └─────────────────────────────────────────────┘

2. CPU overcommitment:
   ┌─────────────────────────────────────────────┐
   │ 64 vCPUs on 8 physical cores               │
   │ Context switches between all VMs           │
   │ Each switch = run-queue time               │
   └─────────────────────────────────────────────┘

3. NUMA misplacement:
   ┌─────────────────────────────────────────────┐
   │ Thread scheduled on remote NUMA node       │
   │ Migration bouncing between nodes           │
   │ Each migration = re-queue                  │
   └─────────────────────────────────────────────┘

Root Cause

CFS Scheduler Behavior

// Linux CFS (Completely Fair Scheduler) maintains per-CPU run queues
// When a thread wakes up, it goes to a run queue

struct cfs_rq {
    struct rb_root_cached tasks_timeline;  // Red-black tree of runnable tasks
    u64 min_vruntime;                      // Minimum virtual runtime
    // ...
};

// Run-queue latency = time from becoming runnable to actually running
// This is INVISIBLE to perf/flamegraphs unless you specifically measure it

// Key insight: A thread with 1ms of actual CPU work
// might have 10ms of run-queue latency in overloaded systems

Container CFS Bandwidth Throttling

# Containers use CFS bandwidth control
# cpu.cfs_quota_us / cpu.cfs_period_us = CPU limit

# Example: 2 CPU limit
/sys/fs/cgroup/cpu/container/cpu.cfs_quota_us: 200000  # 200ms
/sys/fs/cgroup/cpu/container/cpu.cfs_period_us: 100000 # per 100ms

# If your 8 threads try to use 8 CPUs worth of time in 100ms:
# They get 200ms of CPU time, then THROTTLED for rest of period
# Throttled threads sit in run-queue, adding latency

# Check throttling:
cat /sys/fs/cgroup/cpu/container/cpu.stat
# nr_throttled: 15234      <- Times throttled
# throttled_time: 892347123 <- Nanoseconds spent throttled

Diagnosis

Step 1: Check Run-Queue Length

# Simple check: runnable processes per CPU
sar -q 1 5
# runq-sz = average run queue length
# If runq-sz > num_cpus consistently, you have queuing

# More detail with vmstat
vmstat 1
# r column = processes waiting for run time
# Should be <= number of CPUs

# Per-CPU view
mpstat -P ALL 1
# Check %idle - if low but you think you have headroom, it's queuing

Step 2: eBPF Run-Queue Latency Histogram

# Using bcc-tools: runqlat
sudo runqlat -m 10
# Shows histogram of run-queue latency in milliseconds
# Healthy: most samples < 1ms
# Problem: significant samples > 10ms

# Output example (problematic):
     msecs           : count    distribution
       0 -> 1        : 1523    |**************          |
       2 -> 3        : 892     |********                |
       4 -> 7        : 2341    |**********************  |  <- Too many!
       8 -> 15       : 1876    |******************      |  <- Way too many!
      16 -> 31       : 543     |*****                   |
      32 -> 63       : 121     |*                       |

Step 3: Per-Process Run-Queue Time

# runqlat with process filter
sudo runqlat -p $(pgrep java) -m 10

# Or use runqslower for individual events
sudo runqslower 10000
# Shows every instance where run-queue wait > 10ms
# Output: TIME     COMM             PID    LAT(us)
#         12:34:56 java             1234   15234

Step 4: Off-CPU Flame Graph

# Capture off-CPU stacks with bcc
sudo offcputime -df -p $(pgrep myapp) 30 > out.stacks

# Generate flame graph
./flamegraph.pl --color=io --title="Off-CPU Time" out.stacks > offcpu.svg

# Look for:
# - schedule() in kernel stacks = run-queue wait
# - futex_wait = lock contention
# - io_schedule = I/O wait

The Fix

Option 1: Right-Size Container CPU Limits

# BEFORE: Limit that causes constant throttling
resources:
  limits:
    cpu: "2"      # 2 CPUs
  requests:
    cpu: "2"
# With 8 worker threads, 6 always waiting

# AFTER: Match limit to actual parallelism
resources:
  limits:
    cpu: "4"      # Allow more parallel execution
  requests:
    cpu: "2"      # Still request 2 for scheduling

# Or reduce parallelism to match limit:
# Set thread pool size = CPU limit
WORKER_THREADS=2

Option 2: Tune CFS Parameters

# Increase CFS period for less frequent throttling
# (trades latency variance for throughput)

# Per-container (Kubernetes doesn't expose this directly)
echo 1000000 > /sys/fs/cgroup/cpu/container/cpu.cfs_period_us

# System-wide tuning
sysctl kernel.sched_min_granularity_ns=3000000
sysctl kernel.sched_wakeup_granularity_ns=4000000

Option 3: CPU Pinning for Latency-Critical Threads

// Pin critical threads to specific CPUs
import "golang.org/x/sys/unix"

func pinToCPU(cpuID int) error {
    var mask unix.CPUSet
    mask.Set(cpuID)
    return unix.SchedSetaffinity(0, &mask)
}

// Latency-critical handler pinned to dedicated CPU
go func() {
    pinToCPU(0)  // CPU 0 reserved for this
    for req := range criticalRequests {
        handleCritical(req)
    }
}()

# Kubernetes: Use CPU manager for guaranteed pinning
kubelet config:
  cpuManagerPolicy: static

# Pod spec for guaranteed QoS (enables pinning)
resources:
  limits:
    cpu: "2"
    memory: "4Gi"
  requests:
    cpu: "2"
    memory: "4Gi"

Option 4: NUMA-Aware Scheduling

# Check NUMA topology
numactl --hardware

# Pin process to single NUMA node
numactl --cpunodebind=0 --membind=0 ./myapp

# In Kubernetes, use topology manager
kubelet config:
  topologyManagerPolicy: single-numa-node

Monitoring

groups:
  - name: scheduler-latency
    rules:
      - alert: HighRunQueueLatency
        expr: |
          histogram_quantile(0.99,
            rate(scheduler_runq_latency_bucket[5m])
          ) > 0.01
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "p99 run-queue latency > 10ms"

      - alert: CPSThrottlingHigh
        expr: |
          rate(container_cpu_cfs_throttled_seconds_total[5m]) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Container being CPU throttled > 10% of time"

      - alert: RunQueueOverloaded
        expr: |
          node_load1 / count(node_cpu_seconds_total{mode="idle"}) > 2
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Run queue length > 2x CPU count"

Checklist

## Run-Queue Latency Investigation

### Symptoms
- [ ] Low CPU utilization but high latency
- [ ] Latency spikes not correlated with traffic
- [ ] Traditional profilers show "nothing wrong"
- [ ] Performance degrades under moderate load

### Diagnosis
- [ ] Check run-queue length with sar/vmstat
- [ ] Use runqlat for latency histogram
- [ ] Generate off-CPU flame graph
- [ ] Check CFS throttling statistics

### Fixes
- [ ] Right-size container CPU limits
- [ ] Reduce thread count to match CPU limit
- [ ] Consider CPU pinning for critical paths
- [ ] Enable NUMA-aware scheduling
- [ ] Monitor CFS throttling metrics

Conclusion

The lesson: CPU profilers only show what happens while your code runs. Off-CPU analysis reveals where your code waits - and run-queue time is often the hidden majority of request latency in containerized environments.

Key tools:

runqlat - Histogram of run-queue wait times
offcputime - Off-CPU stack traces
runqslower - Individual long waits
CFS throttle metrics - Container-specific queueing

Java Profiling in Hardened K8s - Profiling in containers
Go Timer Heap Pressure - Another latency investigation

eBPF Run-Queue Latency: Finding the Off-CPU Bottleneck

The Problem

The Invisible Wait Time

When Run-Queue Latency Spikes

Root Cause

CFS Scheduler Behavior

Container CFS Bandwidth Throttling

Diagnosis

Step 1: Check Run-Queue Length

Step 2: eBPF Run-Queue Latency Histogram

Step 3: Per-Process Run-Queue Time

Step 4: Off-CPU Flame Graph

The Fix

Option 1: Right-Size Container CPU Limits

Option 2: Tune CFS Parameters

Option 3: CPU Pinning for Latency-Critical Threads

Option 4: NUMA-Aware Scheduling

Monitoring

Checklist

Conclusion

Related posts

Cite this article

The Problem

The Invisible Wait Time

When Run-Queue Latency Spikes

Root Cause

CFS Scheduler Behavior

Container CFS Bandwidth Throttling

Diagnosis

Step 1: Check Run-Queue Length

Step 2: eBPF Run-Queue Latency Histogram

Step 3: Per-Process Run-Queue Time

Step 4: Off-CPU Flame Graph

The Fix

Option 1: Right-Size Container CPU Limits

Option 2: Tune CFS Parameters

Option 3: CPU Pinning for Latency-Critical Threads

Option 4: NUMA-Aware Scheduling

Monitoring

Checklist

Conclusion

Related Articles

Related posts

Cite this article