Go cgo DNS Resolution Thread Explosion: When net.LookupHost Spawns Thousands of Threads
The first time I saw thousands of DNS threads, I thought Go was broken. “Our Go service suddenly has 10,000 OS threads and is OOMKilled.” The Kubernetes pod kept getting killed despite having plenty of heap memory available. Go’s memory profiler showed normal heap usage. But pprof showed thread count growing unbounded—5,000, 8,000, 10,000 OS threads until the container hit its memory limit and died.
The cause was one of Go’s hidden gotchas: when DNS lookups use the cgo resolver (via libc’s getaddrinfo()), each lookup spawns a real OS thread that blocks until DNS responds. With slow DNS (>100ms latency) and high concurrency (5,000 concurrent requests), you suddenly have 5,000 blocked OS threads, each consuming stack memory. The goroutine model’s efficiency disappears—you’re paying the full cost of OS threads.
What made this especially confusing was that Go is supposed to be good at concurrency. Goroutines are lightweight. You can run millions of them. But that’s only true when they stay on the Go scheduler. The moment a goroutine calls into cgo, it needs a real OS thread, and that thread is blocked until the cgo call returns. With slow DNS, the threads pile up faster than they drain.
The insidious part is that this only happens under specific conditions: cgo-based DNS resolver (not pure Go), slow DNS responses, and high concurrency. In development, with fast local DNS and lower concurrency, everything works fine. In production, with corporate DNS servers and thousands of concurrent requests, your service explodes.
Environment: Go 1.21, Kubernetes with CoreDNS, high-concurrency HTTP client, corporate network with slow DNS
The Problem
Thread Explosion Timeline
Normal operation:
- 50 goroutines doing HTTP requests
- 10 OS threads (GOMAXPROCS + runtime threads)
- Memory: 100MB
During DNS slowdown:
- 5000 goroutines doing HTTP requests
- 5000+ OS threads (!!)
- Memory: 2GB+ and climbing
- Eventually: OOMKilled
# Check thread count:
cat /proc/$(pgrep myapp)/status | grep Threads
# Threads: 5234 <- Way too many!
The Innocent-Looking Code
// Simple HTTP client - what could go wrong?
func fetchURL(url string) (*Response, error) {
// This does DNS resolution internally
resp, err := http.Get(url)
if err != nil {
return nil, err
}
defer resp.Body.Close()
// ...
}
// High concurrency
for i := 0; i < 5000; i++ {
go fetchURL("https://api.example.com/data")
}
// If DNS is slow, each of these can spawn an OS thread!
Root Cause
Go Has Two DNS Resolvers
// Go can resolve DNS two ways:
// 1. Pure Go resolver (default on most Linux)
// - Uses goroutines
// - Non-blocking
// - Respects GOMAXPROCS
// - Reads /etc/resolv.conf directly
// 2. cgo resolver (uses system libc)
// - Spawns real OS threads
// - BLOCKS the thread until DNS responds
// - Can spawn unlimited threads
// - Uses getaddrinfo()
// Check which resolver is being used:
// Set GODEBUG=netdns=2 to see logs
// "go package net: using cgo DNS resolver"
// "go package net: using Go DNS resolver"
When Does Go Use cgo Resolver?
// Go uses cgo resolver when ANY of these are true:
// 1. CGO_ENABLED=1 and:
// - /etc/nsswitch.conf has complex configuration
// - /etc/resolv.conf has unsupported options
// - System uses mDNS or custom NSS modules
// 2. Force cgo with:
// export GODEBUG=netdns=cgo
// Common triggers:
// /etc/nsswitch.conf:
hosts: files mdns4_minimal [NOTFOUND=return] dns // mdns triggers cgo!
// /etc/resolv.conf:
options rotate // Some options trigger cgo
options edns0 // This one doesn't, but check your config
The Thread Explosion Mechanism
Pure Go resolver (safe):
┌──────────────────────────────────────────────────┐
│ Goroutine 1 ──► DNS query ──► epoll wait │
│ Goroutine 2 ──► DNS query ──► epoll wait │
│ Goroutine 3 ──► DNS query ──► epoll wait │
│ │
│ All share same OS threads via netpoller │
│ 5000 goroutines = still ~10 OS threads │
└──────────────────────────────────────────────────┘
cgo resolver (dangerous):
┌──────────────────────────────────────────────────┐
│ Goroutine 1 ──► cgo ──► getaddrinfo() ──► BLOCK │
│ (needs own OS thread!) │
│ Goroutine 2 ──► cgo ──► getaddrinfo() ──► BLOCK │
│ (needs own OS thread!) │
│ ... │
│ Goroutine 5000 ──► cgo ──► getaddrinfo() ──► ... │
│ │
│ 5000 goroutines = 5000+ OS threads! │
│ Each thread = ~8KB stack minimum │
│ 5000 × 8KB = 40MB just for stacks │
│ Plus all the other per-thread overhead... │
└──────────────────────────────────────────────────┘
Diagnosis
Check Which Resolver Is Used
# Run with debug logging
GODEBUG=netdns=2 ./myapp 2>&1 | head -20
# Look for:
# "go package net: using Go's DNS resolver" <- Good
# "go package net: using cgo DNS resolver" <- Danger!
# Check why cgo was selected:
GODEBUG=netdns=2 ./myapp 2>&1 | grep -i "cgo\|dns"
Monitor Thread Count
# Watch thread count in real-time
watch -n 1 'cat /proc/$(pgrep myapp)/status | grep Threads'
# Or with pprof
curl http://localhost:6060/debug/pprof/threadcreate?debug=1
# Prometheus metric
process_threads # Should be stable, not climbing
Check DNS Resolution Time
# From inside container/pod
time nslookup api.example.com
# If DNS takes > 1 second, you'll see thread explosion
# under high concurrency with cgo resolver
# Check CoreDNS latency
kubectl logs -n kube-system -l k8s-app=kube-dns | grep -i slow
The Fix
Option 1: Force Pure Go Resolver
# Environment variable (most reliable)
export GODEBUG=netdns=go
# Or at build time (compile without cgo)
CGO_ENABLED=0 go build -o myapp .
# Kubernetes deployment
env:
- name: GODEBUG
value: "netdns=go"
Option 2: Simplify nsswitch.conf
# Check current config
cat /etc/nsswitch.conf | grep hosts
# Problematic (triggers cgo):
hosts: files mdns4_minimal [NOTFOUND=return] dns myhostname
# Simple (allows pure Go):
hosts: files dns
# In container, you control the base image
# Alpine uses musl libc - different behavior!
# Debian/Ubuntu with simplified nsswitch.conf works with pure Go
Option 3: DNS Caching and Connection Pooling
// Reduce DNS lookups with proper HTTP client configuration
import "net/http"
var client = &http.Client{
Transport: &http.Transport{
// Reuse connections to avoid repeated DNS lookups
MaxIdleConns: 100,
MaxIdleConnsPerHost: 100,
IdleConnTimeout: 90 * time.Second,
// Custom dialer with DNS caching
DialContext: (&net.Dialer{
Timeout: 30 * time.Second,
KeepAlive: 30 * time.Second,
Resolver: &net.Resolver{
PreferGo: true, // Force pure Go resolver
},
}).DialContext,
},
}
Option 4: Limit Concurrent DNS Lookups
// Use semaphore to limit concurrent DNS operations
var dnsSem = make(chan struct{}, 100) // Max 100 concurrent lookups
func lookupWithLimit(host string) ([]net.IP, error) {
dnsSem <- struct{}{} // Acquire
defer func() { <-dnsSem }() // Release
return net.LookupIP(host)
}
Option 5: Use External DNS Cache
# Deploy node-local DNS cache (Kubernetes)
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-local-dns
spec:
# ... NodeLocal DNSCache configuration
# Reduces DNS latency significantly
Monitoring
groups:
- name: go-dns-threads
rules:
- alert: GoThreadExplosion
expr: |
process_threads{job="myapp"} > 500
for: 2m
labels:
severity: critical
annotations:
summary: "Go app has {{ $value }} threads (expected < 100)"
- alert: HighDNSLatency
expr: |
histogram_quantile(0.99,
rate(dns_lookup_duration_seconds_bucket[5m])
) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "DNS lookup p99 > 1 second"
Checklist
## Go cgo DNS Thread Explosion
### Symptoms
- [ ] Thread count climbing unbounded
- [ ] Memory usage growing with thread count
- [ ] OOMKilled despite low goroutine heap usage
- [ ] Correlates with DNS latency or high concurrency
### Diagnosis
- [ ] Check resolver: GODEBUG=netdns=2
- [ ] Monitor thread count via /proc or pprof
- [ ] Measure DNS resolution latency
- [ ] Check /etc/nsswitch.conf for cgo triggers
### Fixes
- [ ] Set GODEBUG=netdns=go
- [ ] Or build with CGO_ENABLED=0
- [ ] Simplify nsswitch.conf if possible
- [ ] Implement DNS caching
- [ ] Limit concurrent DNS lookups
- [ ] Deploy node-local DNS cache
Conclusion
This failure mode is a perfect example of how abstractions can leak in unexpected ways. Go’s goroutine model is brilliantly efficient—until you hit a code path that escapes the scheduler. DNS resolution via cgo is one of those escape paths, and it’s surprisingly easy to trigger accidentally.
The fundamental issue is that Go’s standard library uses heuristics to choose between pure Go DNS and cgo DNS. If your nsswitch.conf has certain entries, or your resolv.conf has certain options, Go silently switches to cgo—and your lightweight goroutines become heavyweight OS threads.
The fix is straightforward once you know to look: force pure Go DNS with GODEBUG=netdns=go or build with CGO_ENABLED=0. But the real lesson is about monitoring. Thread count should be a core metric for any Go service. If it’s climbing unbounded, you have a cgo-related problem—either DNS or something else.
Key principles:
- Force pure Go resolver with
GODEBUG=netdns=goin production - Build without cgo when possible (
CGO_ENABLED=0) for maximum predictability - Monitor thread count - it should be stable around GOMAXPROCS, not climbing
- Cache DNS at application level (connection pooling) or infrastructure level (local DNS cache)
- Limit concurrent DNS lookups with semaphores if you can’t eliminate cgo
The Go runtime is designed for millions of goroutines. But the operating system is not designed for thousands of OS threads. Know where your code might escape the goroutine scheduler, and plan accordingly.
Related Articles
- Go Timer Heap Pressure - Another Go runtime trap
- Java Native Memory OOMKilled - Similar off-heap issues
Related posts
Java Profiling in Hardened Kubernetes: When Security Blocks Your Debugger
Can't attach profiler to production JVM. seccomp blocks perf_event_open, container drops CAP_SYS_PTRACE, and PodSecurityPolicy prevents privileged mode. Here's how to profile anyway.
Go p99 Latency Cliffs: Nested context.WithTimeout Timer Storms
Periodic latency spikes that look like network jitter. The real cause: nested timeouts creating thousands of timers that pressure the Go runtime timer heap and trigger GC scanning.
CoreDNS vs NodeLocal DNS Cache: Cutting Kubernetes DNS Latency by 10x
Your pods make 100 DNS queries per request. CoreDNS is a bottleneck. I benchmark NodeLocal DNS cache and show configuration for production.
etcd Watch Replay Storms: When Giant ConfigMaps Kill the Control Plane
The apiserver becomes 'randomly slow'. Root cause: large, frequently updated ConfigMaps trigger watch compaction, causing thousands of controllers to relist simultaneously.
Cite this article
If you reference this post, please link to the original URL and credit the author.