Go cgo DNS Resolution Thread Explosion: When net.LookupHost Spawns Thousands of Threads

The first time I saw thousands of DNS threads, I thought Go was broken. “Our Go service suddenly has 10,000 OS threads and is OOMKilled.” The Kubernetes pod kept getting killed despite having plenty of heap memory available. Go’s memory profiler showed normal heap usage. But pprof showed thread count growing unbounded—5,000, 8,000, 10,000 OS threads until the container hit its memory limit and died.

The cause was one of Go’s hidden gotchas: when DNS lookups use the cgo resolver (via libc’s getaddrinfo()), each lookup spawns a real OS thread that blocks until DNS responds. With slow DNS (>100ms latency) and high concurrency (5,000 concurrent requests), you suddenly have 5,000 blocked OS threads, each consuming stack memory. The goroutine model’s efficiency disappears—you’re paying the full cost of OS threads.

What made this especially confusing was that Go is supposed to be good at concurrency. Goroutines are lightweight. You can run millions of them. But that’s only true when they stay on the Go scheduler. The moment a goroutine calls into cgo, it needs a real OS thread, and that thread is blocked until the cgo call returns. With slow DNS, the threads pile up faster than they drain.

The insidious part is that this only happens under specific conditions: cgo-based DNS resolver (not pure Go), slow DNS responses, and high concurrency. In development, with fast local DNS and lower concurrency, everything works fine. In production, with corporate DNS servers and thousands of concurrent requests, your service explodes.

Environment: Go 1.21, Kubernetes with CoreDNS, high-concurrency HTTP client, corporate network with slow DNS

The Problem

Thread Explosion Timeline

Normal operation:
- 50 goroutines doing HTTP requests
- 10 OS threads (GOMAXPROCS + runtime threads)
- Memory: 100MB

During DNS slowdown:
- 5000 goroutines doing HTTP requests
- 5000+ OS threads (!!)
- Memory: 2GB+ and climbing
- Eventually: OOMKilled

# Check thread count:
cat /proc/$(pgrep myapp)/status | grep Threads
# Threads: 5234  <- Way too many!

The Innocent-Looking Code

// Simple HTTP client - what could go wrong?
func fetchURL(url string) (*Response, error) {
    // This does DNS resolution internally
    resp, err := http.Get(url)
    if err != nil {
        return nil, err
    }
    defer resp.Body.Close()
    // ...
}

// High concurrency
for i := 0; i < 5000; i++ {
    go fetchURL("https://api.example.com/data")
}

// If DNS is slow, each of these can spawn an OS thread!

Root Cause

Go Has Two DNS Resolvers

// Go can resolve DNS two ways:

// 1. Pure Go resolver (default on most Linux)
//    - Uses goroutines
//    - Non-blocking
//    - Respects GOMAXPROCS
//    - Reads /etc/resolv.conf directly

// 2. cgo resolver (uses system libc)
//    - Spawns real OS threads
//    - BLOCKS the thread until DNS responds
//    - Can spawn unlimited threads
//    - Uses getaddrinfo()

// Check which resolver is being used:
// Set GODEBUG=netdns=2 to see logs
// "go package net: using cgo DNS resolver"
// "go package net: using Go DNS resolver"

When Does Go Use cgo Resolver?

// Go uses cgo resolver when ANY of these are true:

// 1. CGO_ENABLED=1 and:
//    - /etc/nsswitch.conf has complex configuration
//    - /etc/resolv.conf has unsupported options
//    - System uses mDNS or custom NSS modules

// 2. Force cgo with:
//    export GODEBUG=netdns=cgo

// Common triggers:
// /etc/nsswitch.conf:
hosts: files mdns4_minimal [NOTFOUND=return] dns  // mdns triggers cgo!

// /etc/resolv.conf:
options rotate  // Some options trigger cgo
options edns0   // This one doesn't, but check your config

The Thread Explosion Mechanism

Pure Go resolver (safe):
┌──────────────────────────────────────────────────┐
│ Goroutine 1 ──► DNS query ──► epoll wait         │
│ Goroutine 2 ──► DNS query ──► epoll wait         │
│ Goroutine 3 ──► DNS query ──► epoll wait         │
│                                                  │
│ All share same OS threads via netpoller          │
│ 5000 goroutines = still ~10 OS threads           │
└──────────────────────────────────────────────────┘

cgo resolver (dangerous):
┌──────────────────────────────────────────────────┐
│ Goroutine 1 ──► cgo ──► getaddrinfo() ──► BLOCK  │
│                         (needs own OS thread!)   │
│ Goroutine 2 ──► cgo ──► getaddrinfo() ──► BLOCK  │
│                         (needs own OS thread!)   │
│ ...                                              │
│ Goroutine 5000 ──► cgo ──► getaddrinfo() ──► ... │
│                                                  │
│ 5000 goroutines = 5000+ OS threads!              │
│ Each thread = ~8KB stack minimum                 │
│ 5000 × 8KB = 40MB just for stacks                │
│ Plus all the other per-thread overhead...        │
└──────────────────────────────────────────────────┘

Diagnosis

Check Which Resolver Is Used

# Run with debug logging
GODEBUG=netdns=2 ./myapp 2>&1 | head -20

# Look for:
# "go package net: using Go's DNS resolver"  <- Good
# "go package net: using cgo DNS resolver"   <- Danger!

# Check why cgo was selected:
GODEBUG=netdns=2 ./myapp 2>&1 | grep -i "cgo\|dns"

Monitor Thread Count

# Watch thread count in real-time
watch -n 1 'cat /proc/$(pgrep myapp)/status | grep Threads'

# Or with pprof
curl http://localhost:6060/debug/pprof/threadcreate?debug=1

# Prometheus metric
process_threads  # Should be stable, not climbing

Check DNS Resolution Time

# From inside container/pod
time nslookup api.example.com

# If DNS takes > 1 second, you'll see thread explosion
# under high concurrency with cgo resolver

# Check CoreDNS latency
kubectl logs -n kube-system -l k8s-app=kube-dns | grep -i slow

The Fix

Option 1: Force Pure Go Resolver

# Environment variable (most reliable)
export GODEBUG=netdns=go

# Or at build time (compile without cgo)
CGO_ENABLED=0 go build -o myapp .

# Kubernetes deployment
env:
  - name: GODEBUG
    value: "netdns=go"

Option 2: Simplify nsswitch.conf

# Check current config
cat /etc/nsswitch.conf | grep hosts

# Problematic (triggers cgo):
hosts: files mdns4_minimal [NOTFOUND=return] dns myhostname

# Simple (allows pure Go):
hosts: files dns

# In container, you control the base image
# Alpine uses musl libc - different behavior!
# Debian/Ubuntu with simplified nsswitch.conf works with pure Go

Option 3: DNS Caching and Connection Pooling

// Reduce DNS lookups with proper HTTP client configuration
import "net/http"

var client = &http.Client{
    Transport: &http.Transport{
        // Reuse connections to avoid repeated DNS lookups
        MaxIdleConns:        100,
        MaxIdleConnsPerHost: 100,
        IdleConnTimeout:     90 * time.Second,

        // Custom dialer with DNS caching
        DialContext: (&net.Dialer{
            Timeout:   30 * time.Second,
            KeepAlive: 30 * time.Second,
            Resolver: &net.Resolver{
                PreferGo: true,  // Force pure Go resolver
            },
        }).DialContext,
    },
}

Option 4: Limit Concurrent DNS Lookups

// Use semaphore to limit concurrent DNS operations
var dnsSem = make(chan struct{}, 100)  // Max 100 concurrent lookups

func lookupWithLimit(host string) ([]net.IP, error) {
    dnsSem <- struct{}{}        // Acquire
    defer func() { <-dnsSem }() // Release

    return net.LookupIP(host)
}

Option 5: Use External DNS Cache

# Deploy node-local DNS cache (Kubernetes)
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-local-dns
spec:
  # ... NodeLocal DNSCache configuration
  # Reduces DNS latency significantly

Monitoring

groups:
  - name: go-dns-threads
    rules:
      - alert: GoThreadExplosion
        expr: |
          process_threads{job="myapp"} > 500
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Go app has {{ $value }} threads (expected < 100)"

      - alert: HighDNSLatency
        expr: |
          histogram_quantile(0.99,
            rate(dns_lookup_duration_seconds_bucket[5m])
          ) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "DNS lookup p99 > 1 second"

Checklist

## Go cgo DNS Thread Explosion

### Symptoms
- [ ] Thread count climbing unbounded
- [ ] Memory usage growing with thread count
- [ ] OOMKilled despite low goroutine heap usage
- [ ] Correlates with DNS latency or high concurrency

### Diagnosis
- [ ] Check resolver: GODEBUG=netdns=2
- [ ] Monitor thread count via /proc or pprof
- [ ] Measure DNS resolution latency
- [ ] Check /etc/nsswitch.conf for cgo triggers

### Fixes
- [ ] Set GODEBUG=netdns=go
- [ ] Or build with CGO_ENABLED=0
- [ ] Simplify nsswitch.conf if possible
- [ ] Implement DNS caching
- [ ] Limit concurrent DNS lookups
- [ ] Deploy node-local DNS cache

Conclusion

This failure mode is a perfect example of how abstractions can leak in unexpected ways. Go’s goroutine model is brilliantly efficient—until you hit a code path that escapes the scheduler. DNS resolution via cgo is one of those escape paths, and it’s surprisingly easy to trigger accidentally.

The fundamental issue is that Go’s standard library uses heuristics to choose between pure Go DNS and cgo DNS. If your nsswitch.conf has certain entries, or your resolv.conf has certain options, Go silently switches to cgo—and your lightweight goroutines become heavyweight OS threads.

The fix is straightforward once you know to look: force pure Go DNS with GODEBUG=netdns=go or build with CGO_ENABLED=0. But the real lesson is about monitoring. Thread count should be a core metric for any Go service. If it’s climbing unbounded, you have a cgo-related problem—either DNS or something else.

Key principles:

Force pure Go resolver with GODEBUG=netdns=go in production
Build without cgo when possible (CGO_ENABLED=0) for maximum predictability
Monitor thread count - it should be stable around GOMAXPROCS, not climbing
Cache DNS at application level (connection pooling) or infrastructure level (local DNS cache)
Limit concurrent DNS lookups with semaphores if you can’t eliminate cgo

The Go runtime is designed for millions of goroutines. But the operating system is not designed for thousands of OS threads. Know where your code might escape the goroutine scheduler, and plan accordingly.

Go Timer Heap Pressure - Another Go runtime trap
Java Native Memory OOMKilled - Similar off-heap issues

Go cgo DNS Resolution Thread Explosion: When net.LookupHost Spawns Thousands of Threads

The Problem

Thread Explosion Timeline

The Innocent-Looking Code

Root Cause

Go Has Two DNS Resolvers

When Does Go Use cgo Resolver?

The Thread Explosion Mechanism

Diagnosis

Check Which Resolver Is Used

Monitor Thread Count

Check DNS Resolution Time

The Fix

Option 1: Force Pure Go Resolver

Option 2: Simplify nsswitch.conf

Option 3: DNS Caching and Connection Pooling

Option 4: Limit Concurrent DNS Lookups

Option 5: Use External DNS Cache

Monitoring

Checklist

Conclusion

Related posts

Cite this article

The Problem

Thread Explosion Timeline

The Innocent-Looking Code

Root Cause

Go Has Two DNS Resolvers

When Does Go Use cgo Resolver?

The Thread Explosion Mechanism

Diagnosis

Check Which Resolver Is Used

Monitor Thread Count

Check DNS Resolution Time

The Fix

Option 1: Force Pure Go Resolver

Option 2: Simplify nsswitch.conf

Option 3: DNS Caching and Connection Pooling

Option 4: Limit Concurrent DNS Lookups

Option 5: Use External DNS Cache

Monitoring

Checklist

Conclusion

Related Articles

Related posts

Cite this article