Go p99 Latency Cliffs: Nested context.WithTimeout Timer Storms
Timers feel free until you have millions of them. Every 30 seconds our p99 jumped and, of course, the network got blamed. Spoiler: the network was innocent.
The real culprit was in our middleware stack. Each layer wrapped the request with context.WithTimeout “just in case”—HTTP server, auth, service client, database driver. Four layers deep means four timers per request. At 10k RPS that’s ~40,000 timers per second, pressure on Go’s runtime timer heap, and extra GC scanning.
In this post I’ll show how to recognize this pattern, why it happens, and how I now structure deadlines so timeouts protect me without creating a timer storm.
Environment: Go 1.21, high-throughput HTTP service, 10k RPS
The Problem
Symptoms
// Latency pattern:
// p50: 5ms (stable)
// p95: 15ms (stable)
// p99: 50ms normally, but spikes to 500ms every 30 seconds
// CPU and memory look normal
// Network latency is stable
// No obvious bottleneck
The Hidden Timer Explosion
// Common pattern - looks innocent:
func handleRequest(w http.ResponseWriter, r *http.Request) {
// Layer 1: HTTP server timeout
ctx, cancel := context.WithTimeout(r.Context(), 30*time.Second)
defer cancel()
// Layer 2: Service call timeout
ctx2, cancel2 := context.WithTimeout(ctx, 10*time.Second)
defer cancel2()
result, err := callExternalService(ctx2)
// ...
}
func callExternalService(ctx context.Context) (Result, error) {
// Layer 3: gRPC client timeout
ctx, cancel := context.WithTimeout(ctx, 5*time.Second)
defer cancel()
// Layer 4: Individual RPC timeout
ctx2, cancel2 := context.WithTimeout(ctx, 2*time.Second)
defer cancel2()
return client.Call(ctx2)
}
// At 10k RPS, this creates:
// 10,000 × 4 timers = 40,000 timers PER SECOND
// Timers live for their full duration even if:
// - Request completes early
// - Context is cancelled
// Until Go 1.23, timer cleanup was delayed!
Root Cause
How context.WithTimeout Creates Timers
// Every WithTimeout creates a timer:
func WithTimeout(parent Context, timeout time.Duration) (Context, CancelFunc) {
return WithDeadline(parent, time.Now().Add(timeout))
}
func WithDeadline(parent Context, d time.Time) (Context, CancelFunc) {
// ...
c := &timerCtx{...}
// THIS CREATES A RUNTIME TIMER
c.timer = time.AfterFunc(dur, func() {
c.cancel(true, DeadlineExceeded, cause)
})
return c, func() { c.cancel(true, Canceled, nil) }
}
// The timer is added to the global timer heap
// Timer heap operations are O(log n)
// With 40,000 timers, each operation is slower
Timer Heap Pressure
Timer heap under pressure:
Normal (few timers):
┌────────────────────────────┐
│ Timer Heap (100 timers) │
│ Insert: O(log 100) = fast │
│ Pop: O(log 100) = fast │
└────────────────────────────┘
Under load (many timers):
┌────────────────────────────┐
│ Timer Heap (40,000 timers) │
│ Insert: O(log 40000) = 15x │
│ Pop: O(log 40000) = 15x │
│ GC scan: scanning 40k ptrs │
└────────────────────────────┘
Every timer is a goroutine waiting
GC must scan all of them
Timer operations hold locks
Result: periodic latency spikes
GC Interaction
// Timers contain pointers that GC must scan
type timer struct {
when int64 // when timer expires
period int64 // period for repeating timer
f func(any, uintptr) // function pointer - GC must scan
arg any // argument - GC must scan
seq uintptr
}
// With 40,000 timers:
// - 40,000 function pointers to scan
// - 40,000 argument pointers to scan
// - All in a heap structure
// GC STW phases increase with timer count
// More timers = longer GC pauses
Diagnosis
Step 1: Check Timer Count
// Add to your metrics/debugging endpoint
import "runtime"
func timerMetrics() map[string]int {
var m runtime.MemStats
runtime.ReadMemStats(&m)
return map[string]int{
"num_goroutines": runtime.NumGoroutine(),
"gc_pause_ns": int(m.PauseNs[(m.NumGC+255)%256]),
// Timer count not directly exposed, but goroutines correlate
}
}
Step 2: Profile Timer Activity
# Capture execution trace
curl "http://localhost:6060/debug/pprof/trace?seconds=30" > trace.out
go tool trace trace.out
# Look for:
# - Timer creation/expiration patterns
# - GC pause durations
# - Scheduling latency
# CPU profile for timer operations
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30
# Look for time.* and runtime.* timer functions
Step 3: Count Context Layers
// Instrument context creation
var timerContextCount atomic.Int64
func WithTimeoutCounted(parent context.Context, timeout time.Duration) (context.Context, context.CancelFunc) {
timerContextCount.Add(1)
ctx, cancel := context.WithTimeout(parent, timeout)
return ctx, func() {
timerContextCount.Add(-1)
cancel()
}
}
The Fix
Option 1: Use Single Propagated Deadline
// BEFORE: Nested timeouts
func handleRequest(w http.ResponseWriter, r *http.Request) {
ctx1, cancel1 := context.WithTimeout(r.Context(), 30*time.Second)
defer cancel1()
ctx2, cancel2 := context.WithTimeout(ctx1, 10*time.Second)
defer cancel2()
callService(ctx2) // creates more timeouts inside
}
// AFTER: Single deadline, propagated
func handleRequest(w http.ResponseWriter, r *http.Request) {
// One timeout at the edge
ctx, cancel := context.WithTimeout(r.Context(), 30*time.Second)
defer cancel()
// Pass through without new timers
callService(ctx)
}
func callService(ctx context.Context) {
// Don't add another timeout - parent already has deadline
// Just check if deadline exists
if _, ok := ctx.Deadline(); !ok {
// Only add timeout if none exists
var cancel context.CancelFunc
ctx, cancel = context.WithTimeout(ctx, 10*time.Second)
defer cancel()
}
// Use existing deadline
client.Call(ctx)
}
Option 2: Reuse Timers with Pools
// For custom timeout handling, pool timers
var timerPool = sync.Pool{
New: func() interface{} {
return time.NewTimer(time.Hour) // placeholder duration
},
}
func withPooledTimeout(ctx context.Context, timeout time.Duration) (context.Context, context.CancelFunc) {
timer := timerPool.Get().(*time.Timer)
timer.Reset(timeout)
ctx, cancel := context.WithCancel(ctx)
go func() {
select {
case <-timer.C:
cancel()
case <-ctx.Done():
}
timer.Stop()
timerPool.Put(timer)
}()
return ctx, cancel
}
Option 3: Use Deadline Instead of Timeout
// BEFORE: Multiple WithTimeout calls
ctx1, _ := context.WithTimeout(ctx, 30*time.Second)
ctx2, _ := context.WithTimeout(ctx1, 10*time.Second)
ctx3, _ := context.WithTimeout(ctx2, 5*time.Second)
// Creates 3 timers, all tracking the same request
// AFTER: Calculate and use single deadline
deadline := time.Now().Add(5 * time.Second) // smallest timeout
ctx, cancel := context.WithDeadline(ctx, deadline)
defer cancel()
// Only 1 timer
Option 4: Reduce Retry Amplification
// BEFORE: Retries create more timeouts
func callWithRetry(ctx context.Context) error {
for i := 0; i < 3; i++ {
ctx, cancel := context.WithTimeout(ctx, 5*time.Second)
err := call(ctx)
cancel()
if err == nil {
return nil
}
}
return err
}
// 3 timers per request
// AFTER: Single timeout across retries
func callWithRetry(ctx context.Context) error {
ctx, cancel := context.WithTimeout(ctx, 15*time.Second) // total budget
defer cancel()
for i := 0; i < 3; i++ {
err := call(ctx) // reuses same timer
if err == nil {
return nil
}
if ctx.Err() != nil {
return ctx.Err()
}
}
return err
}
// 1 timer per request
Monitoring
groups:
- name: go-timers
rules:
- alert: HighGoroutineCount
expr: go_goroutines > 10000
for: 5m
labels:
severity: warning
annotations:
summary: "High goroutine count (possible timer leak)"
- alert: GCPauseTooLong
expr: go_gc_pause_seconds{quantile="1"} > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "GC pauses exceeding 50ms"
Checklist
## Go Timer Heap Pressure
### Symptoms
- [ ] Periodic p99 latency spikes
- [ ] Spikes correlate with GC pauses
- [ ] High goroutine count
- [ ] Network looks fine during spikes
### Diagnosis
- [ ] Count context.WithTimeout calls per request
- [ ] Check execution trace for timer patterns
- [ ] Profile for time.*/runtime.* timer functions
- [ ] Monitor GC pause durations
### Fixes
- [ ] Use single propagated deadline
- [ ] Reduce nested timeout layers
- [ ] Eliminate retry amplification
- [ ] Consider timer pooling for custom use cases
Conclusion
This failure mode reveals an uncomfortable truth about Go’s context timeout pattern: it’s easy to misuse in ways that create performance problems. The pattern looks like defense in depth—more timeouts means more protection against hanging requests. But each context.WithTimeout creates a timer that consumes runtime resources. At high throughput, those resources add up.
The counter-intuitive lesson is that fewer timeouts are often better than more. A single timeout at the edge of your system, propagated through the call chain via the context, creates one timer per request. Nested timeouts create N timers per request, where N is your middleware depth. The request still times out at the same point, but the runtime cost is multiplied.
Go 1.23 improved timer handling to clean up resources faster after cancellation, which helps somewhat. But the fundamental issue remains: if you’re creating tens of thousands of timers per second, you’re putting pressure on the runtime that will eventually manifest as latency spikes.
Key takeaways:
- Fewer timeouts is often better - one propagated deadline beats nested timeouts
- Timer count matters at scale - 40k timers creates measurable GC and heap pressure
- The 30-second pattern is a GC signature - look for it when debugging periodic latency
- Check deadline existence before adding another - use
ctx.Deadline()to avoid redundant timers
Related Articles
- Go Memory Profiling - Understanding Go memory
- Go cgo DNS Thread Explosion - Another Go runtime trap
Related posts
Go cgo DNS Resolution Thread Explosion: When net.LookupHost Spawns Thousands of Threads
Go application suddenly has 10,000 threads consuming all memory. The cause: cgo-based DNS resolution blocking in slow DNS environments, bypassing Go's goroutine scheduler.
eBPF Off-CPU Analysis: Finding Latency That Metrics Miss
CPU is at 20% but latency is 500ms. Standard profilers show nothing. The app is waiting, not computing. I show how to use eBPF to find what it's waiting for.
gRPC Keepalive Mismatch: Transport Closing After Idle
gRPC connections randomly close with 'transport is closing'. The cause: client and server keepalive settings don't match, causing the server to terminate idle connections.
PostgreSQL Read Replica Conflicts: Why Your Queries Get Canceled
Queries on read replicas fail with 'canceling statement due to conflict with recovery'. The fix depends on which of the 5 conflict types you have - here's how to diagnose and solve each one.
Cite this article
If you reference this post, please link to the original URL and credit the author.