eBPF Off-CPU Analysis: Finding Latency That Metrics Miss

The CPU was idle, the requests were slow, and eBPF finally explained the gap. “CPU usage is 20% but p99 latency is 500ms.” Standard profilers only show CPU time. Your application is waiting - for locks, I/O, or other processes. eBPF off-CPU analysis reveals where time actually goes.

Tested on: Linux 5.15, bcc-tools 0.28, Go 1.21 application with mutex contention

Understanding Off-CPU Time

CPU Time vs Wall Time

Request lifecycle:

Wall time: 500ms (what user experiences)
├── CPU time: 100ms (what standard profilers see)
│   ├── 50ms: JSON parsing
│   ├── 30ms: Business logic
│   └── 20ms: Response serialization
│
└── Off-CPU time: 400ms (INVISIBLE to CPU profilers!)
    ├── 200ms: Waiting for database
    ├── 150ms: Blocked on mutex
    └── 50ms: Network I/O wait

CPU profiler says: "JSON parsing is 50% of CPU time"
Reality: "Mutex contention is 30% of request time"

Why Standard Tools Miss This

What pprof shows (CPU profiler):
┌─────────────────────────────────────────────────────────────┐
│ Function              │ CPU Time │  % Total                 │
├─────────────────────────────────────────────────────────────┤
│ encoding/json.Unmarshal │  50ms  │   50%                    │
│ main.processOrder       │  30ms  │   30%                    │
│ net/http.writeResponse  │  20ms  │   20%                    │
└─────────────────────────────────────────────────────────────┘
Total: 100ms (but request took 500ms!)

What eBPF off-CPU shows:
┌─────────────────────────────────────────────────────────────┐
│ Wait Reason           │ Off-CPU  │  % Total                 │
├─────────────────────────────────────────────────────────────┤
│ futex (mutex)         │  150ms   │   30%                    │
│ epoll_wait (network)  │  200ms   │   40%                    │
│ read (disk)           │   50ms   │   10%                    │
│ On-CPU                │  100ms   │   20%                    │
└─────────────────────────────────────────────────────────────┘
Total: 500ms (the full picture!)

Using BCC Tools

Installation

# Ubuntu/Debian
apt-get install bpfcc-tools linux-headers-$(uname -r)

# RHEL/CentOS
yum install bcc-tools

# Tools are in /usr/share/bcc/tools/
# Or run with: python3 /usr/share/bcc/tools/offcputime

Basic Off-CPU Analysis

# Find PID of your application
pgrep -f "myapp"
# Example: 12345

# Capture off-CPU stacks for 30 seconds
/usr/share/bcc/tools/offcputime -df -p 12345 30 > offcpu.stacks

# Output format:
# stack_trace;stack_trace;... time_in_microseconds
# Example:
# main.handler;sync.Mutex.Lock;runtime.futex 150000
# main.handler;database/sql.Query;net.Read 200000

Visualize with Flame Graph

# Install flamegraph tools
git clone https://github.com/brendangregg/FlameGraph

# Generate flame graph
./FlameGraph/flamegraph.pl --color=io \
    --title="Off-CPU Time" \
    offcpu.stacks > offcpu.svg

# Open in browser - RED/ORANGE shows off-CPU time

Live Analysis

# Real-time off-CPU summary
/usr/share/bcc/tools/offcputime -p 12345 --state 2

# --state 2 = TASK_INTERRUPTIBLE (waiting for I/O or locks)

# Sample output:
# Tracing off-CPU time (us) of PID 12345... Ctrl-C to end.
#
# [30 second trace]
#
#     waker:  mysql-server
#     runtime.futexsleep
#     runtime.notesleep
#     runtime.mPark
#     ...
#     database/sql.(*DB).Query
#     main.handleRequest
#         - 5234521 us (5.2 seconds total waiting for MySQL)

Common Patterns

Mutex Contention

# Find futex waits (mutex/condition variable)
/usr/share/bcc/tools/offcputime -p 12345 -K 30 2>&1 | grep futex

# Example output:
# runtime.futexsleep
# sync.(*Mutex).Lock
# main.(*Cache).Get        <- Contention here!
# main.handleRequest
#     12345678 us

// Problematic code (found by eBPF):
type Cache struct {
    mu   sync.Mutex  // Single lock for entire cache
    data map[string][]byte
}

func (c *Cache) Get(key string) []byte {
    c.mu.Lock()         // All requests serialize here
    defer c.mu.Unlock()
    return c.data[key]
}

// Solution: Sharded locks
type ShardedCache struct {
    shards [256]struct {
        mu   sync.RWMutex
        data map[string][]byte
    }
}

func (c *ShardedCache) Get(key string) []byte {
    shard := &c.shards[hash(key)%256]
    shard.mu.RLock()  // Read lock, less contention
    defer shard.mu.RUnlock()
    return shard.data[key]
}

Database Connection Pool Exhaustion

# Find waits on network sockets
/usr/share/bcc/tools/offcputime -p 12345 30 | grep -E "epoll|read|write"

# If seeing long waits in database driver:
# net.(*netFD).Read
# database/sql.(*DB).Query
#     25000000 us  <- Waiting 25s for DB connection!

File I/O Blocking

# Trace file system waits
/usr/share/bcc/tools/filetop -p 12345 30

# Or with offcputime
/usr/share/bcc/tools/offcputime -p 12345 30 | grep -E "read|write|fsync"

# Common issue: synchronous file writes
# os.(*File).Write
# main.logToFile
#     5000000 us  <- 5s blocked on file I/O

Advanced: Profile Specific Functions

Using bpftrace

# Install bpftrace
apt-get install bpftrace

# Trace time spent in specific function
bpftrace -e '
uprobe:/path/to/myapp:main.slowFunction {
    @start[tid] = nsecs;
}
uretprobe:/path/to/myapp:main.slowFunction /@start[tid]/ {
    @latency = hist(nsecs - @start[tid]);
    delete(@start[tid]);
}
'

# Output: histogram of function execution time

Trace Lock Contention

# Trace futex contention
bpftrace -e '
tracepoint:syscalls:sys_enter_futex {
    @start[tid] = nsecs;
}
tracepoint:syscalls:sys_exit_futex /@start[tid]/ {
    $dur = nsecs - @start[tid];
    if ($dur > 1000000) {  // > 1ms
        @slow[ustack, $dur / 1000000] = count();
    }
    delete(@start[tid]);
}
'

Continuous Profiling Integration

Parca Integration

# Parca agent config for off-CPU profiling
apiVersion: v1
kind: ConfigMap
metadata:
  name: parca-agent-config
data:
  parca-agent.yaml: |
    profiling:
      cpu: true
      offcpu: true  # Enable off-CPU profiling
      offcpu_threshold_ms: 10  # Capture waits > 10ms

Pyroscope Integration

// Enable off-CPU profiling in Pyroscope
import "github.com/grafana/pyroscope-go"

pyroscope.Start(pyroscope.Config{
    ApplicationName: "myapp",
    ServerAddress:   "http://pyroscope:4040",
    ProfileTypes: []pyroscope.ProfileType{
        pyroscope.ProfileCPU,
        pyroscope.ProfileBlockProfile,    // Blocking/mutex
        pyroscope.ProfileGoroutineProfile,
    },
})

Monitoring and Alerting

Prometheus Metrics from eBPF

# Custom eBPF exporter for off-CPU metrics
from bcc import BPF
from prometheus_client import Histogram, start_http_server

offcpu_duration = Histogram(
    'offcpu_duration_seconds',
    'Time spent off-CPU',
    ['reason'],
    buckets=[0.001, 0.01, 0.1, 0.5, 1, 5, 10]
)

# BPF program to track off-CPU time
bpf_text = """
#include <uapi/linux/ptrace.h>

BPF_HASH(start, u32);
BPF_HISTOGRAM(dist);

int oncpu(struct pt_regs *ctx, struct task_struct *prev) {
    u32 pid = prev->pid;
    u64 ts = bpf_ktime_get_ns();
    start.update(&pid, &ts);
    return 0;
}
"""

Alert on High Off-CPU Time

groups:
- name: offcpu
  rules:
  - alert: HighOffCPURatio
    expr: |
      (rate(process_offcpu_seconds_total[5m]) /
       rate(process_cpu_seconds_total[5m])) > 4
    for: 10m
    annotations:
      summary: "App spending 4x more time waiting than computing"

  - alert: MutexContention
    expr: |
      rate(go_sync_mutex_wait_total_seconds[5m]) > 0.1
    for: 5m
    annotations:
      summary: "High mutex contention detected"

Checklist

## Off-CPU Analysis

### Investigation
- [ ] Check if CPU usage << latency suggests off-CPU time
- [ ] Run offcputime for 30-60 seconds
- [ ] Generate flame graph for visualization
- [ ] Identify top off-CPU stacks

### Common Causes
- [ ] Mutex contention (futex waits)
- [ ] Database connection pool exhaustion
- [ ] Synchronous file I/O
- [ ] Network I/O waits
- [ ] Channel blocking in Go

### Solutions
- [ ] Shard locks for reduced contention
- [ ] Increase connection pool size
- [ ] Use async I/O where possible
- [ ] Add buffering to reduce syscalls

### Continuous Monitoring
- [ ] Enable off-CPU profiling in Parca/Pyroscope
- [ ] Alert on high off-CPU ratio
- [ ] Dashboard showing CPU vs wall time

Conclusion

When CPU is low but latency is high, use eBPF:

Standard profilers miss waiting time - they only see CPU
offcputime reveals the full picture - locks, I/O, everything
Flame graphs make it visual - spot bottlenecks instantly
Continuous profiling catches regressions - before users notice

Your application spends more time waiting than working. Find out what it’s waiting for.

K8s CPU Throttling Autopsy - CPU performance
Prometheus Cardinality Explosion - Observability

eBPF Off-CPU Analysis: Finding Latency That Metrics Miss

Understanding Off-CPU Time

CPU Time vs Wall Time

Why Standard Tools Miss This

Using BCC Tools

Installation

Basic Off-CPU Analysis

Visualize with Flame Graph

Live Analysis

Common Patterns

Mutex Contention

Database Connection Pool Exhaustion

File I/O Blocking

Advanced: Profile Specific Functions

Using bpftrace

Trace Lock Contention

Continuous Profiling Integration

Parca Integration

Pyroscope Integration

Monitoring and Alerting

Prometheus Metrics from eBPF

Alert on High Off-CPU Time

Checklist

Conclusion

Related posts

Cite this article

Understanding Off-CPU Time

CPU Time vs Wall Time

Why Standard Tools Miss This

Using BCC Tools

Installation

Basic Off-CPU Analysis

Visualize with Flame Graph

Live Analysis

Common Patterns

Mutex Contention

Database Connection Pool Exhaustion

File I/O Blocking

Advanced: Profile Specific Functions

Using bpftrace

Trace Lock Contention

Continuous Profiling Integration

Parca Integration

Pyroscope Integration

Monitoring and Alerting

Prometheus Metrics from eBPF

Alert on High Off-CPU Time

Checklist

Conclusion

Related Articles

Related posts

Cite this article