CoreDNS vs NodeLocal DNS Cache: Cutting Kubernetes DNS Latency by 10x

We benchmarked NodeLocal DNSCache after a DNS incident we don’t want to repeat. “Why does every HTTP call add 5ms latency?” Every service call requires DNS lookup. Your pods talk to CoreDNS over the network. With NodeLocal DNS Cache, that drops to 0.2ms.

Tested on: Kubernetes 1.28, CoreDNS 1.11, NodeLocal DNSCache 1.22, 50-node cluster

The DNS Bottleneck

How Kubernetes DNS Works

Without NodeLocal DNS Cache:

Pod → kube-dns Service (ClusterIP) → CoreDNS Pod
     └─ Network hop (5-20ms)        └─ Possibly on different node

DNS path:
1. Pod makes DNS query (UDP)
2. Query goes to kube-dns ClusterIP (10.96.0.10)
3. kube-proxy/iptables routes to CoreDNS pod
4. CoreDNS resolves (cache hit or upstream query)
5. Response returns through same path

The Problem

Typical web request DNS lookups:
1. Service discovery: api.default.svc.cluster.local
2. Database: postgres.db.svc.cluster.local
3. Cache: redis.cache.svc.cluster.local
4. External API: api.stripe.com

4 DNS lookups × 5ms = 20ms added latency per request!

At 1000 RPS:
- 4000 DNS queries/sec to CoreDNS
- CoreDNS becomes bottleneck
- Tail latency increases

NodeLocal DNS Cache

How It Works

With NodeLocal DNS Cache:

Pod → NodeLocal DaemonSet → CoreDNS (only on cache miss)
     └─ Local (0.2ms)      └─ Network (5ms, rare)

NodeLocal runs as DaemonSet:
- One pod per node
- Listens on link-local IP (169.254.20.10)
- Caches responses locally
- Falls back to CoreDNS on miss

Installation

# Download NodeLocal DNS manifest
kubectl apply -f https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/dns/nodelocaldns/nodelocaldns.yaml

# Or with Helm
helm install nodelocaldns stable/nodelocaldns \
  --set config.localDNS=169.254.20.10 \
  --set config.clusterDNS=10.96.0.10

Configuration

# nodelocaldns-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: node-local-dns
  namespace: kube-system
data:
  Corefile: |
    cluster.local:53 {
        errors
        cache {
            success 9984 30  # Cache 30 seconds
            denial 9984 5    # Cache NXDOMAIN 5 seconds
        }
        reload
        loop
        bind 169.254.20.10
        forward . __PILLAR__CLUSTER__DNS__ {
            force_tcp
        }
        prometheus :9253
        health 169.254.20.10:8080
    }
    in-addr.arpa:53 {
        errors
        cache 30
        reload
        loop
        bind 169.254.20.10
        forward . __PILLAR__CLUSTER__DNS__ {
            force_tcp
        }
        prometheus :9253
    }
    .:53 {
        errors
        cache 30
        reload
        loop
        bind 169.254.20.10
        forward . __PILLAR__UPSTREAM__SERVERS__
        prometheus :9253
    }

Pod Configuration

# Option 1: Modify kubelet to use NodeLocal
# /var/lib/kubelet/config.yaml
clusterDNS:
  - 169.254.20.10  # NodeLocal first
  - 10.96.0.10     # Fallback to CoreDNS

# Option 2: Per-pod dnsConfig
apiVersion: v1
kind: Pod
spec:
  dnsPolicy: None
  dnsConfig:
    nameservers:
      - 169.254.20.10
    searches:
      - default.svc.cluster.local
      - svc.cluster.local
      - cluster.local
    options:
      - name: ndots
        value: "5"

Benchmark Results

Test Setup

// dns_benchmark.go
package main

import (
    "net"
    "testing"
    "time"
)

func BenchmarkDNSLookup(b *testing.B) {
    hosts := []string{
        "kubernetes.default.svc.cluster.local",
        "kube-dns.kube-system.svc.cluster.local",
    }

    for i := 0; i < b.N; i++ {
        for _, host := range hosts {
            _, err := net.LookupHost(host)
            if err != nil {
                b.Fatal(err)
            }
        }
    }
}

Results

CoreDNS Only (network path):
  Latency p50:    5.2ms
  Latency p99:    28.4ms
  Latency p999:   89.2ms
  Queries/sec:    8,500

NodeLocal DNS Cache (local path):
  Latency p50:    0.18ms  (29x faster)
  Latency p99:    0.45ms  (63x faster)
  Latency p999:   1.2ms   (74x faster)
  Queries/sec:    45,000  (5x higher)

Cache hit rate: 92% (typical production)

Load Test

# Using dnsperf
dnsperf -s 169.254.20.10 -d queries.txt -l 60 -c 100

# Results with NodeLocal:
# Queries sent:       2,812,456
# Queries completed:  2,812,456
# Queries lost:       0 (0.00%)
# Response codes:     NOERROR 2,812,456 (100.00%)
# Average latency:    0.21ms
# Maximum latency:    2.34ms

Monitoring

Prometheus Metrics

# Cache hit rate
sum(rate(coredns_cache_hits_total{server="dns://:53"}[5m]))
/
sum(rate(coredns_dns_requests_total{server="dns://:53"}[5m]))

# DNS latency (NodeLocal)
histogram_quantile(0.99,
  sum(rate(coredns_dns_request_duration_seconds_bucket[5m])) by (le)
)

# Upstream forward latency (CoreDNS)
histogram_quantile(0.99,
  sum(rate(coredns_forward_request_duration_seconds_bucket[5m])) by (le)
)

Alert Rules

groups:
- name: dns
  rules:
  - alert: DNSLatencyHigh
    expr: |
      histogram_quantile(0.99, sum(rate(coredns_dns_request_duration_seconds_bucket[5m])) by (le))
      > 0.01
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "DNS p99 latency > 10ms"

  - alert: NodeLocalDNSDown
    expr: |
      up{job="nodelocaldns"} == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "NodeLocal DNS not running on {{ $labels.node }}"

Troubleshooting

DNS Not Using NodeLocal

# Check resolv.conf in pod
kubectl exec -it mypod -- cat /etc/resolv.conf

# Should show:
# nameserver 169.254.20.10

# If shows 10.96.0.10, check kubelet config

NodeLocal Pod Crashing

# Check logs
kubectl logs -n kube-system -l k8s-app=node-local-dns

# Common issues:
# - Port conflict (another process on 53)
# - Link-local IP already in use
# - Insufficient permissions (needs NET_ADMIN)

Cache Not Working

# Check cache stats
kubectl exec -n kube-system node-local-dns-xxxxx -- \
  wget -qO- http://localhost:9253/metrics | grep cache

# Look for:
# coredns_cache_hits_total
# coredns_cache_misses_total

Production Configuration

Optimized Settings

# nodelocaldns-configmap.yaml
data:
  Corefile: |
    cluster.local:53 {
        errors
        cache {
            success 9984 60    # Cache success 60 seconds
            denial 9984 10     # Cache NXDOMAIN 10 seconds
            prefetch 10 1m 10% # Prefetch popular entries
        }
        reload
        loop
        bind 169.254.20.10
        forward . __PILLAR__CLUSTER__DNS__ {
            force_tcp
            max_concurrent 1000  # Higher concurrency
        }
        prometheus :9253
        health 169.254.20.10:8080
        ready 169.254.20.10:8181
    }

Resource Limits

# DaemonSet resource limits
resources:
  requests:
    cpu: 25m
    memory: 32Mi
  limits:
    cpu: 100m
    memory: 128Mi

ndots Optimization

The Problem

# Default ndots=5 in Kubernetes
# Query: api.stripe.com

# DNS search order:
1. api.stripe.com.default.svc.cluster.local (NXDOMAIN)
2. api.stripe.com.svc.cluster.local (NXDOMAIN)
3. api.stripe.com.cluster.local (NXDOMAIN)
4. api.stripe.com. (SUCCESS)

# 4 DNS queries for one external name!

Solution

# Pod spec with reduced ndots
spec:
  dnsConfig:
    options:
      - name: ndots
        value: "2"  # Reduced from 5

# Or add trailing dot for external names
# api.stripe.com. (absolute name, no search)

Checklist

## NodeLocal DNS Cache Setup

### Installation
- [ ] Deploy NodeLocal DaemonSet
- [ ] Configure kubelet clusterDNS
- [ ] Verify pods use 169.254.20.10

### Configuration
- [ ] Set appropriate cache TTLs
- [ ] Enable prefetch for popular entries
- [ ] Configure resource limits

### Monitoring
- [ ] Dashboard with cache hit rate
- [ ] Alert on DNS latency > 10ms
- [ ] Alert on NodeLocal pod failures

### Optimization
- [ ] Consider reducing ndots
- [ ] Use absolute DNS names for external services
- [ ] Monitor cache hit rates

Conclusion

DNS is a hidden Kubernetes bottleneck:

Every service call needs DNS lookup
CoreDNS over network adds 5-20ms per query
NodeLocal cache reduces to 0.2ms (29x faster)
92% cache hit rate in production

Install NodeLocal DNS Cache and cut your tail latency.

K8s CPU Throttling Autopsy - Performance tuning
HTTP Keep-Alive Connection Reset - Network optimization

CoreDNS vs NodeLocal DNS Cache: Cutting Kubernetes DNS Latency by 10x

The DNS Bottleneck

How Kubernetes DNS Works

The Problem

NodeLocal DNS Cache

How It Works

Installation

Configuration

Pod Configuration

Benchmark Results

Test Setup

Results

Load Test

Monitoring

Prometheus Metrics

Alert Rules

Troubleshooting

DNS Not Using NodeLocal

NodeLocal Pod Crashing

Cache Not Working

Production Configuration

Optimized Settings

Resource Limits

ndots Optimization

The Problem

Solution

Checklist

Conclusion

Related posts

Cite this article

The DNS Bottleneck

How Kubernetes DNS Works

The Problem

NodeLocal DNS Cache

How It Works

Installation

Configuration

Pod Configuration

Benchmark Results

Test Setup

Results

Load Test

Monitoring

Prometheus Metrics

Alert Rules

Troubleshooting

DNS Not Using NodeLocal

NodeLocal Pod Crashing

Cache Not Working

Production Configuration

Optimized Settings

Resource Limits

ndots Optimization

The Problem

Solution

Checklist

Conclusion

Related Articles

Related posts

Cite this article