Back to blog

Go p99 Latency Cliffs: Nested context.WithTimeout Timer Storms

|
| golang, performance, debugging, context, gc, latency

Timers feel free until you have millions of them. Every 30 seconds our p99 jumped and, of course, the network got blamed. Spoiler: the network was innocent.

The real culprit was in our middleware stack. Each layer wrapped the request with context.WithTimeout “just in case”—HTTP server, auth, service client, database driver. Four layers deep means four timers per request. At 10k RPS that’s ~40,000 timers per second, pressure on Go’s runtime timer heap, and extra GC scanning.

In this post I’ll show how to recognize this pattern, why it happens, and how I now structure deadlines so timeouts protect me without creating a timer storm.

Environment: Go 1.21, high-throughput HTTP service, 10k RPS

The Problem

Symptoms

// Latency pattern:
// p50: 5ms (stable)
// p95: 15ms (stable)
// p99: 50ms normally, but spikes to 500ms every 30 seconds

// CPU and memory look normal
// Network latency is stable
// No obvious bottleneck

The Hidden Timer Explosion

// Common pattern - looks innocent:
func handleRequest(w http.ResponseWriter, r *http.Request) {
    // Layer 1: HTTP server timeout
    ctx, cancel := context.WithTimeout(r.Context(), 30*time.Second)
    defer cancel()

    // Layer 2: Service call timeout
    ctx2, cancel2 := context.WithTimeout(ctx, 10*time.Second)
    defer cancel2()

    result, err := callExternalService(ctx2)
    // ...
}

func callExternalService(ctx context.Context) (Result, error) {
    // Layer 3: gRPC client timeout
    ctx, cancel := context.WithTimeout(ctx, 5*time.Second)
    defer cancel()

    // Layer 4: Individual RPC timeout
    ctx2, cancel2 := context.WithTimeout(ctx, 2*time.Second)
    defer cancel2()

    return client.Call(ctx2)
}

// At 10k RPS, this creates:
// 10,000 × 4 timers = 40,000 timers PER SECOND

// Timers live for their full duration even if:
// - Request completes early
// - Context is cancelled
// Until Go 1.23, timer cleanup was delayed!

Root Cause

How context.WithTimeout Creates Timers

// Every WithTimeout creates a timer:
func WithTimeout(parent Context, timeout time.Duration) (Context, CancelFunc) {
    return WithDeadline(parent, time.Now().Add(timeout))
}

func WithDeadline(parent Context, d time.Time) (Context, CancelFunc) {
    // ...
    c := &timerCtx{...}
    // THIS CREATES A RUNTIME TIMER
    c.timer = time.AfterFunc(dur, func() {
        c.cancel(true, DeadlineExceeded, cause)
    })
    return c, func() { c.cancel(true, Canceled, nil) }
}

// The timer is added to the global timer heap
// Timer heap operations are O(log n)
// With 40,000 timers, each operation is slower

Timer Heap Pressure

Timer heap under pressure:

Normal (few timers):
┌────────────────────────────┐
│ Timer Heap (100 timers)    │
│ Insert: O(log 100) = fast  │
│ Pop: O(log 100) = fast     │
└────────────────────────────┘

Under load (many timers):
┌────────────────────────────┐
│ Timer Heap (40,000 timers) │
│ Insert: O(log 40000) = 15x │
│ Pop: O(log 40000) = 15x    │
│ GC scan: scanning 40k ptrs │
└────────────────────────────┘

Every timer is a goroutine waiting
GC must scan all of them
Timer operations hold locks
Result: periodic latency spikes

GC Interaction

// Timers contain pointers that GC must scan
type timer struct {
    when   int64         // when timer expires
    period int64         // period for repeating timer
    f      func(any, uintptr)  // function pointer - GC must scan
    arg    any           // argument - GC must scan
    seq    uintptr
}

// With 40,000 timers:
// - 40,000 function pointers to scan
// - 40,000 argument pointers to scan
// - All in a heap structure

// GC STW phases increase with timer count
// More timers = longer GC pauses

Diagnosis

Step 1: Check Timer Count

// Add to your metrics/debugging endpoint
import "runtime"

func timerMetrics() map[string]int {
    var m runtime.MemStats
    runtime.ReadMemStats(&m)

    return map[string]int{
        "num_goroutines": runtime.NumGoroutine(),
        "gc_pause_ns":    int(m.PauseNs[(m.NumGC+255)%256]),
        // Timer count not directly exposed, but goroutines correlate
    }
}

Step 2: Profile Timer Activity

# Capture execution trace
curl "http://localhost:6060/debug/pprof/trace?seconds=30" > trace.out
go tool trace trace.out

# Look for:
# - Timer creation/expiration patterns
# - GC pause durations
# - Scheduling latency

# CPU profile for timer operations
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30
# Look for time.* and runtime.* timer functions

Step 3: Count Context Layers

// Instrument context creation
var timerContextCount atomic.Int64

func WithTimeoutCounted(parent context.Context, timeout time.Duration) (context.Context, context.CancelFunc) {
    timerContextCount.Add(1)
    ctx, cancel := context.WithTimeout(parent, timeout)
    return ctx, func() {
        timerContextCount.Add(-1)
        cancel()
    }
}

The Fix

Option 1: Use Single Propagated Deadline

// BEFORE: Nested timeouts
func handleRequest(w http.ResponseWriter, r *http.Request) {
    ctx1, cancel1 := context.WithTimeout(r.Context(), 30*time.Second)
    defer cancel1()

    ctx2, cancel2 := context.WithTimeout(ctx1, 10*time.Second)
    defer cancel2()

    callService(ctx2) // creates more timeouts inside
}

// AFTER: Single deadline, propagated
func handleRequest(w http.ResponseWriter, r *http.Request) {
    // One timeout at the edge
    ctx, cancel := context.WithTimeout(r.Context(), 30*time.Second)
    defer cancel()

    // Pass through without new timers
    callService(ctx)
}

func callService(ctx context.Context) {
    // Don't add another timeout - parent already has deadline
    // Just check if deadline exists
    if _, ok := ctx.Deadline(); !ok {
        // Only add timeout if none exists
        var cancel context.CancelFunc
        ctx, cancel = context.WithTimeout(ctx, 10*time.Second)
        defer cancel()
    }

    // Use existing deadline
    client.Call(ctx)
}

Option 2: Reuse Timers with Pools

// For custom timeout handling, pool timers
var timerPool = sync.Pool{
    New: func() interface{} {
        return time.NewTimer(time.Hour) // placeholder duration
    },
}

func withPooledTimeout(ctx context.Context, timeout time.Duration) (context.Context, context.CancelFunc) {
    timer := timerPool.Get().(*time.Timer)
    timer.Reset(timeout)

    ctx, cancel := context.WithCancel(ctx)

    go func() {
        select {
        case <-timer.C:
            cancel()
        case <-ctx.Done():
        }
        timer.Stop()
        timerPool.Put(timer)
    }()

    return ctx, cancel
}

Option 3: Use Deadline Instead of Timeout

// BEFORE: Multiple WithTimeout calls
ctx1, _ := context.WithTimeout(ctx, 30*time.Second)
ctx2, _ := context.WithTimeout(ctx1, 10*time.Second)
ctx3, _ := context.WithTimeout(ctx2, 5*time.Second)
// Creates 3 timers, all tracking the same request

// AFTER: Calculate and use single deadline
deadline := time.Now().Add(5 * time.Second) // smallest timeout
ctx, cancel := context.WithDeadline(ctx, deadline)
defer cancel()
// Only 1 timer

Option 4: Reduce Retry Amplification

// BEFORE: Retries create more timeouts
func callWithRetry(ctx context.Context) error {
    for i := 0; i < 3; i++ {
        ctx, cancel := context.WithTimeout(ctx, 5*time.Second)
        err := call(ctx)
        cancel()
        if err == nil {
            return nil
        }
    }
    return err
}
// 3 timers per request

// AFTER: Single timeout across retries
func callWithRetry(ctx context.Context) error {
    ctx, cancel := context.WithTimeout(ctx, 15*time.Second) // total budget
    defer cancel()

    for i := 0; i < 3; i++ {
        err := call(ctx) // reuses same timer
        if err == nil {
            return nil
        }
        if ctx.Err() != nil {
            return ctx.Err()
        }
    }
    return err
}
// 1 timer per request

Monitoring

groups:
  - name: go-timers
    rules:
      - alert: HighGoroutineCount
        expr: go_goroutines > 10000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High goroutine count (possible timer leak)"

      - alert: GCPauseTooLong
        expr: go_gc_pause_seconds{quantile="1"} > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "GC pauses exceeding 50ms"

Checklist

## Go Timer Heap Pressure

### Symptoms
- [ ] Periodic p99 latency spikes
- [ ] Spikes correlate with GC pauses
- [ ] High goroutine count
- [ ] Network looks fine during spikes

### Diagnosis
- [ ] Count context.WithTimeout calls per request
- [ ] Check execution trace for timer patterns
- [ ] Profile for time.*/runtime.* timer functions
- [ ] Monitor GC pause durations

### Fixes
- [ ] Use single propagated deadline
- [ ] Reduce nested timeout layers
- [ ] Eliminate retry amplification
- [ ] Consider timer pooling for custom use cases

Conclusion

This failure mode reveals an uncomfortable truth about Go’s context timeout pattern: it’s easy to misuse in ways that create performance problems. The pattern looks like defense in depth—more timeouts means more protection against hanging requests. But each context.WithTimeout creates a timer that consumes runtime resources. At high throughput, those resources add up.

The counter-intuitive lesson is that fewer timeouts are often better than more. A single timeout at the edge of your system, propagated through the call chain via the context, creates one timer per request. Nested timeouts create N timers per request, where N is your middleware depth. The request still times out at the same point, but the runtime cost is multiplied.

Go 1.23 improved timer handling to clean up resources faster after cancellation, which helps somewhat. But the fundamental issue remains: if you’re creating tens of thousands of timers per second, you’re putting pressure on the runtime that will eventually manifest as latency spikes.

Key takeaways:

  1. Fewer timeouts is often better - one propagated deadline beats nested timeouts
  2. Timer count matters at scale - 40k timers creates measurable GC and heap pressure
  3. The 30-second pattern is a GC signature - look for it when debugging periodic latency
  4. Check deadline existence before adding another - use ctx.Deadline() to avoid redundant timers

Related posts

Cite this article

If you reference this post, please link to the original URL and credit the author.

Michal Drozd. "Go p99 Latency Cliffs: Nested context.WithTimeout Timer Storms". https://www.michal-drozd.com/en/blog/go-timer-heap-pressure/ (Published January 15, 2025).