Back to blog

eBPF Off-CPU Analysis: Finding Latency That Metrics Miss

The CPU was idle, the requests were slow, and eBPF finally explained the gap. “CPU usage is 20% but p99 latency is 500ms.” Standard profilers only show CPU time. Your application is waiting - for locks, I/O, or other processes. eBPF off-CPU analysis reveals where time actually goes.

Tested on: Linux 5.15, bcc-tools 0.28, Go 1.21 application with mutex contention

Understanding Off-CPU Time

CPU Time vs Wall Time

Request lifecycle:

Wall time: 500ms (what user experiences)
├── CPU time: 100ms (what standard profilers see)
│   ├── 50ms: JSON parsing
│   ├── 30ms: Business logic
│   └── 20ms: Response serialization

└── Off-CPU time: 400ms (INVISIBLE to CPU profilers!)
    ├── 200ms: Waiting for database
    ├── 150ms: Blocked on mutex
    └── 50ms: Network I/O wait

CPU profiler says: "JSON parsing is 50% of CPU time"
Reality: "Mutex contention is 30% of request time"

Why Standard Tools Miss This

What pprof shows (CPU profiler):
┌─────────────────────────────────────────────────────────────┐
│ Function              │ CPU Time │  % Total                 │
├─────────────────────────────────────────────────────────────┤
│ encoding/json.Unmarshal │  50ms  │   50%                    │
│ main.processOrder       │  30ms  │   30%                    │
│ net/http.writeResponse  │  20ms  │   20%                    │
└─────────────────────────────────────────────────────────────┘
Total: 100ms (but request took 500ms!)

What eBPF off-CPU shows:
┌─────────────────────────────────────────────────────────────┐
│ Wait Reason           │ Off-CPU  │  % Total                 │
├─────────────────────────────────────────────────────────────┤
│ futex (mutex)         │  150ms   │   30%                    │
│ epoll_wait (network)  │  200ms   │   40%                    │
│ read (disk)           │   50ms   │   10%                    │
│ On-CPU                │  100ms   │   20%                    │
└─────────────────────────────────────────────────────────────┘
Total: 500ms (the full picture!)

Using BCC Tools

Installation

# Ubuntu/Debian
apt-get install bpfcc-tools linux-headers-$(uname -r)

# RHEL/CentOS
yum install bcc-tools

# Tools are in /usr/share/bcc/tools/
# Or run with: python3 /usr/share/bcc/tools/offcputime

Basic Off-CPU Analysis

# Find PID of your application
pgrep -f "myapp"
# Example: 12345

# Capture off-CPU stacks for 30 seconds
/usr/share/bcc/tools/offcputime -df -p 12345 30 > offcpu.stacks

# Output format:
# stack_trace;stack_trace;... time_in_microseconds
# Example:
# main.handler;sync.Mutex.Lock;runtime.futex 150000
# main.handler;database/sql.Query;net.Read 200000

Visualize with Flame Graph

# Install flamegraph tools
git clone https://github.com/brendangregg/FlameGraph

# Generate flame graph
./FlameGraph/flamegraph.pl --color=io \
    --title="Off-CPU Time" \
    offcpu.stacks > offcpu.svg

# Open in browser - RED/ORANGE shows off-CPU time

Live Analysis

# Real-time off-CPU summary
/usr/share/bcc/tools/offcputime -p 12345 --state 2

# --state 2 = TASK_INTERRUPTIBLE (waiting for I/O or locks)

# Sample output:
# Tracing off-CPU time (us) of PID 12345... Ctrl-C to end.
#
# [30 second trace]
#
#     waker:  mysql-server
#     runtime.futexsleep
#     runtime.notesleep
#     runtime.mPark
#     ...
#     database/sql.(*DB).Query
#     main.handleRequest
#         - 5234521 us (5.2 seconds total waiting for MySQL)

Common Patterns

Mutex Contention

# Find futex waits (mutex/condition variable)
/usr/share/bcc/tools/offcputime -p 12345 -K 30 2>&1 | grep futex

# Example output:
# runtime.futexsleep
# sync.(*Mutex).Lock
# main.(*Cache).Get        <- Contention here!
# main.handleRequest
#     12345678 us
// Problematic code (found by eBPF):
type Cache struct {
    mu   sync.Mutex  // Single lock for entire cache
    data map[string][]byte
}

func (c *Cache) Get(key string) []byte {
    c.mu.Lock()         // All requests serialize here
    defer c.mu.Unlock()
    return c.data[key]
}

// Solution: Sharded locks
type ShardedCache struct {
    shards [256]struct {
        mu   sync.RWMutex
        data map[string][]byte
    }
}

func (c *ShardedCache) Get(key string) []byte {
    shard := &c.shards[hash(key)%256]
    shard.mu.RLock()  // Read lock, less contention
    defer shard.mu.RUnlock()
    return shard.data[key]
}

Database Connection Pool Exhaustion

# Find waits on network sockets
/usr/share/bcc/tools/offcputime -p 12345 30 | grep -E "epoll|read|write"

# If seeing long waits in database driver:
# net.(*netFD).Read
# database/sql.(*DB).Query
#     25000000 us  <- Waiting 25s for DB connection!

File I/O Blocking

# Trace file system waits
/usr/share/bcc/tools/filetop -p 12345 30

# Or with offcputime
/usr/share/bcc/tools/offcputime -p 12345 30 | grep -E "read|write|fsync"

# Common issue: synchronous file writes
# os.(*File).Write
# main.logToFile
#     5000000 us  <- 5s blocked on file I/O

Advanced: Profile Specific Functions

Using bpftrace

# Install bpftrace
apt-get install bpftrace

# Trace time spent in specific function
bpftrace -e '
uprobe:/path/to/myapp:main.slowFunction {
    @start[tid] = nsecs;
}
uretprobe:/path/to/myapp:main.slowFunction /@start[tid]/ {
    @latency = hist(nsecs - @start[tid]);
    delete(@start[tid]);
}
'

# Output: histogram of function execution time

Trace Lock Contention

# Trace futex contention
bpftrace -e '
tracepoint:syscalls:sys_enter_futex {
    @start[tid] = nsecs;
}
tracepoint:syscalls:sys_exit_futex /@start[tid]/ {
    $dur = nsecs - @start[tid];
    if ($dur > 1000000) {  // > 1ms
        @slow[ustack, $dur / 1000000] = count();
    }
    delete(@start[tid]);
}
'

Continuous Profiling Integration

Parca Integration

# Parca agent config for off-CPU profiling
apiVersion: v1
kind: ConfigMap
metadata:
  name: parca-agent-config
data:
  parca-agent.yaml: |
    profiling:
      cpu: true
      offcpu: true  # Enable off-CPU profiling
      offcpu_threshold_ms: 10  # Capture waits > 10ms

Pyroscope Integration

// Enable off-CPU profiling in Pyroscope
import "github.com/grafana/pyroscope-go"

pyroscope.Start(pyroscope.Config{
    ApplicationName: "myapp",
    ServerAddress:   "http://pyroscope:4040",
    ProfileTypes: []pyroscope.ProfileType{
        pyroscope.ProfileCPU,
        pyroscope.ProfileBlockProfile,    // Blocking/mutex
        pyroscope.ProfileGoroutineProfile,
    },
})

Monitoring and Alerting

Prometheus Metrics from eBPF

# Custom eBPF exporter for off-CPU metrics
from bcc import BPF
from prometheus_client import Histogram, start_http_server

offcpu_duration = Histogram(
    'offcpu_duration_seconds',
    'Time spent off-CPU',
    ['reason'],
    buckets=[0.001, 0.01, 0.1, 0.5, 1, 5, 10]
)

# BPF program to track off-CPU time
bpf_text = """
#include <uapi/linux/ptrace.h>

BPF_HASH(start, u32);
BPF_HISTOGRAM(dist);

int oncpu(struct pt_regs *ctx, struct task_struct *prev) {
    u32 pid = prev->pid;
    u64 ts = bpf_ktime_get_ns();
    start.update(&pid, &ts);
    return 0;
}
"""

Alert on High Off-CPU Time

groups:
- name: offcpu
  rules:
  - alert: HighOffCPURatio
    expr: |
      (rate(process_offcpu_seconds_total[5m]) /
       rate(process_cpu_seconds_total[5m])) > 4
    for: 10m
    annotations:
      summary: "App spending 4x more time waiting than computing"

  - alert: MutexContention
    expr: |
      rate(go_sync_mutex_wait_total_seconds[5m]) > 0.1
    for: 5m
    annotations:
      summary: "High mutex contention detected"

Checklist

## Off-CPU Analysis

### Investigation
- [ ] Check if CPU usage << latency suggests off-CPU time
- [ ] Run offcputime for 30-60 seconds
- [ ] Generate flame graph for visualization
- [ ] Identify top off-CPU stacks

### Common Causes
- [ ] Mutex contention (futex waits)
- [ ] Database connection pool exhaustion
- [ ] Synchronous file I/O
- [ ] Network I/O waits
- [ ] Channel blocking in Go

### Solutions
- [ ] Shard locks for reduced contention
- [ ] Increase connection pool size
- [ ] Use async I/O where possible
- [ ] Add buffering to reduce syscalls

### Continuous Monitoring
- [ ] Enable off-CPU profiling in Parca/Pyroscope
- [ ] Alert on high off-CPU ratio
- [ ] Dashboard showing CPU vs wall time

Conclusion

When CPU is low but latency is high, use eBPF:

  1. Standard profilers miss waiting time - they only see CPU
  2. offcputime reveals the full picture - locks, I/O, everything
  3. Flame graphs make it visual - spot bottlenecks instantly
  4. Continuous profiling catches regressions - before users notice

Your application spends more time waiting than working. Find out what it’s waiting for.


Related posts

Cite this article

If you reference this post, please link to the original URL and credit the author.

Michal Drozd. "eBPF Off-CPU Analysis: Finding Latency That Metrics Miss". https://www.michal-drozd.com/en/blog/ebpf-off-cpu-debugging/ (Published September 7, 2025).