eBPF Off-CPU Analysis: Finding Latency That Metrics Miss
The CPU was idle, the requests were slow, and eBPF finally explained the gap. “CPU usage is 20% but p99 latency is 500ms.” Standard profilers only show CPU time. Your application is waiting - for locks, I/O, or other processes. eBPF off-CPU analysis reveals where time actually goes.
Tested on: Linux 5.15, bcc-tools 0.28, Go 1.21 application with mutex contention
Understanding Off-CPU Time
CPU Time vs Wall Time
Request lifecycle:
Wall time: 500ms (what user experiences)
├── CPU time: 100ms (what standard profilers see)
│ ├── 50ms: JSON parsing
│ ├── 30ms: Business logic
│ └── 20ms: Response serialization
│
└── Off-CPU time: 400ms (INVISIBLE to CPU profilers!)
├── 200ms: Waiting for database
├── 150ms: Blocked on mutex
└── 50ms: Network I/O wait
CPU profiler says: "JSON parsing is 50% of CPU time"
Reality: "Mutex contention is 30% of request time"
Why Standard Tools Miss This
What pprof shows (CPU profiler):
┌─────────────────────────────────────────────────────────────┐
│ Function │ CPU Time │ % Total │
├─────────────────────────────────────────────────────────────┤
│ encoding/json.Unmarshal │ 50ms │ 50% │
│ main.processOrder │ 30ms │ 30% │
│ net/http.writeResponse │ 20ms │ 20% │
└─────────────────────────────────────────────────────────────┘
Total: 100ms (but request took 500ms!)
What eBPF off-CPU shows:
┌─────────────────────────────────────────────────────────────┐
│ Wait Reason │ Off-CPU │ % Total │
├─────────────────────────────────────────────────────────────┤
│ futex (mutex) │ 150ms │ 30% │
│ epoll_wait (network) │ 200ms │ 40% │
│ read (disk) │ 50ms │ 10% │
│ On-CPU │ 100ms │ 20% │
└─────────────────────────────────────────────────────────────┘
Total: 500ms (the full picture!)
Using BCC Tools
Installation
# Ubuntu/Debian
apt-get install bpfcc-tools linux-headers-$(uname -r)
# RHEL/CentOS
yum install bcc-tools
# Tools are in /usr/share/bcc/tools/
# Or run with: python3 /usr/share/bcc/tools/offcputime
Basic Off-CPU Analysis
# Find PID of your application
pgrep -f "myapp"
# Example: 12345
# Capture off-CPU stacks for 30 seconds
/usr/share/bcc/tools/offcputime -df -p 12345 30 > offcpu.stacks
# Output format:
# stack_trace;stack_trace;... time_in_microseconds
# Example:
# main.handler;sync.Mutex.Lock;runtime.futex 150000
# main.handler;database/sql.Query;net.Read 200000
Visualize with Flame Graph
# Install flamegraph tools
git clone https://github.com/brendangregg/FlameGraph
# Generate flame graph
./FlameGraph/flamegraph.pl --color=io \
--title="Off-CPU Time" \
offcpu.stacks > offcpu.svg
# Open in browser - RED/ORANGE shows off-CPU time
Live Analysis
# Real-time off-CPU summary
/usr/share/bcc/tools/offcputime -p 12345 --state 2
# --state 2 = TASK_INTERRUPTIBLE (waiting for I/O or locks)
# Sample output:
# Tracing off-CPU time (us) of PID 12345... Ctrl-C to end.
#
# [30 second trace]
#
# waker: mysql-server
# runtime.futexsleep
# runtime.notesleep
# runtime.mPark
# ...
# database/sql.(*DB).Query
# main.handleRequest
# - 5234521 us (5.2 seconds total waiting for MySQL)
Common Patterns
Mutex Contention
# Find futex waits (mutex/condition variable)
/usr/share/bcc/tools/offcputime -p 12345 -K 30 2>&1 | grep futex
# Example output:
# runtime.futexsleep
# sync.(*Mutex).Lock
# main.(*Cache).Get <- Contention here!
# main.handleRequest
# 12345678 us
// Problematic code (found by eBPF):
type Cache struct {
mu sync.Mutex // Single lock for entire cache
data map[string][]byte
}
func (c *Cache) Get(key string) []byte {
c.mu.Lock() // All requests serialize here
defer c.mu.Unlock()
return c.data[key]
}
// Solution: Sharded locks
type ShardedCache struct {
shards [256]struct {
mu sync.RWMutex
data map[string][]byte
}
}
func (c *ShardedCache) Get(key string) []byte {
shard := &c.shards[hash(key)%256]
shard.mu.RLock() // Read lock, less contention
defer shard.mu.RUnlock()
return shard.data[key]
}
Database Connection Pool Exhaustion
# Find waits on network sockets
/usr/share/bcc/tools/offcputime -p 12345 30 | grep -E "epoll|read|write"
# If seeing long waits in database driver:
# net.(*netFD).Read
# database/sql.(*DB).Query
# 25000000 us <- Waiting 25s for DB connection!
File I/O Blocking
# Trace file system waits
/usr/share/bcc/tools/filetop -p 12345 30
# Or with offcputime
/usr/share/bcc/tools/offcputime -p 12345 30 | grep -E "read|write|fsync"
# Common issue: synchronous file writes
# os.(*File).Write
# main.logToFile
# 5000000 us <- 5s blocked on file I/O
Advanced: Profile Specific Functions
Using bpftrace
# Install bpftrace
apt-get install bpftrace
# Trace time spent in specific function
bpftrace -e '
uprobe:/path/to/myapp:main.slowFunction {
@start[tid] = nsecs;
}
uretprobe:/path/to/myapp:main.slowFunction /@start[tid]/ {
@latency = hist(nsecs - @start[tid]);
delete(@start[tid]);
}
'
# Output: histogram of function execution time
Trace Lock Contention
# Trace futex contention
bpftrace -e '
tracepoint:syscalls:sys_enter_futex {
@start[tid] = nsecs;
}
tracepoint:syscalls:sys_exit_futex /@start[tid]/ {
$dur = nsecs - @start[tid];
if ($dur > 1000000) { // > 1ms
@slow[ustack, $dur / 1000000] = count();
}
delete(@start[tid]);
}
'
Continuous Profiling Integration
Parca Integration
# Parca agent config for off-CPU profiling
apiVersion: v1
kind: ConfigMap
metadata:
name: parca-agent-config
data:
parca-agent.yaml: |
profiling:
cpu: true
offcpu: true # Enable off-CPU profiling
offcpu_threshold_ms: 10 # Capture waits > 10ms
Pyroscope Integration
// Enable off-CPU profiling in Pyroscope
import "github.com/grafana/pyroscope-go"
pyroscope.Start(pyroscope.Config{
ApplicationName: "myapp",
ServerAddress: "http://pyroscope:4040",
ProfileTypes: []pyroscope.ProfileType{
pyroscope.ProfileCPU,
pyroscope.ProfileBlockProfile, // Blocking/mutex
pyroscope.ProfileGoroutineProfile,
},
})
Monitoring and Alerting
Prometheus Metrics from eBPF
# Custom eBPF exporter for off-CPU metrics
from bcc import BPF
from prometheus_client import Histogram, start_http_server
offcpu_duration = Histogram(
'offcpu_duration_seconds',
'Time spent off-CPU',
['reason'],
buckets=[0.001, 0.01, 0.1, 0.5, 1, 5, 10]
)
# BPF program to track off-CPU time
bpf_text = """
#include <uapi/linux/ptrace.h>
BPF_HASH(start, u32);
BPF_HISTOGRAM(dist);
int oncpu(struct pt_regs *ctx, struct task_struct *prev) {
u32 pid = prev->pid;
u64 ts = bpf_ktime_get_ns();
start.update(&pid, &ts);
return 0;
}
"""
Alert on High Off-CPU Time
groups:
- name: offcpu
rules:
- alert: HighOffCPURatio
expr: |
(rate(process_offcpu_seconds_total[5m]) /
rate(process_cpu_seconds_total[5m])) > 4
for: 10m
annotations:
summary: "App spending 4x more time waiting than computing"
- alert: MutexContention
expr: |
rate(go_sync_mutex_wait_total_seconds[5m]) > 0.1
for: 5m
annotations:
summary: "High mutex contention detected"
Checklist
## Off-CPU Analysis
### Investigation
- [ ] Check if CPU usage << latency suggests off-CPU time
- [ ] Run offcputime for 30-60 seconds
- [ ] Generate flame graph for visualization
- [ ] Identify top off-CPU stacks
### Common Causes
- [ ] Mutex contention (futex waits)
- [ ] Database connection pool exhaustion
- [ ] Synchronous file I/O
- [ ] Network I/O waits
- [ ] Channel blocking in Go
### Solutions
- [ ] Shard locks for reduced contention
- [ ] Increase connection pool size
- [ ] Use async I/O where possible
- [ ] Add buffering to reduce syscalls
### Continuous Monitoring
- [ ] Enable off-CPU profiling in Parca/Pyroscope
- [ ] Alert on high off-CPU ratio
- [ ] Dashboard showing CPU vs wall time
Conclusion
When CPU is low but latency is high, use eBPF:
- Standard profilers miss waiting time - they only see CPU
- offcputime reveals the full picture - locks, I/O, everything
- Flame graphs make it visual - spot bottlenecks instantly
- Continuous profiling catches regressions - before users notice
Your application spends more time waiting than working. Find out what it’s waiting for.
Related Articles
- K8s CPU Throttling Autopsy - CPU performance
- Prometheus Cardinality Explosion - Observability
Related posts
eBPF Run-Queue Latency: Finding the Off-CPU Bottleneck
CPU utilization is low but requests are slow. The hidden culprit: time spent waiting in the scheduler run-queue, invisible to traditional profilers but visible with eBPF off-CPU analysis.
tcpdump Sees SYNs, but the Service Times Out: The Listen Backlog Trap
Clients time out, tcpdump shows SYNs (sometimes even SYN-ACK), yet your app logs nothing. The culprit is often the Linux listen/accept queues overflowing under load or CPU starvation.
Prometheus WAL Replay Hell: Slow Restarts and Missing Alerts
When Prometheus takes minutes or hours to restart, WAL replay is often the culprit. Prove it from logs and disk, recover safely, and prevent it.
Go p99 Latency Cliffs: Nested context.WithTimeout Timer Storms
Periodic latency spikes that look like network jitter. The real cause: nested timeouts creating thousands of timers that pressure the Go runtime timer heap and trigger GC scanning.
Cite this article
If you reference this post, please link to the original URL and credit the author.