Back to blog

Ephemeral Port Exhaustion: The Node That 'Goes Bad'

The first time I saw ‘cannot assign requested address’ in Kubernetes, I thought it was DNS. “One node randomly goes bad—connections to external APIs fail, but only from that node.” The pattern was maddening. Three identical nodes in an autoscaling group. Same pods, same configuration, same traffic distribution. But periodically, one node would “go bad”—all its pods would fail to connect to external services while the other nodes worked fine.

Restarting pods on the bad node would temporarily fix the problem. Moving pods to other nodes helped. But within 30 minutes, another node would start failing the same way. We chased application bugs, DNS issues, and network policies before discovering the actual cause: ephemeral port exhaustion.

Every outbound TCP connection needs a source port. The kernel allocates these from a pool of about 28,000 ports (32768-60999 by default). When traffic goes through SNAT—as it does for pods connecting to external services—all pods on a node share the same pool of source ports. If you’re creating connections faster than they close, you eventually run out.

What made this particularly confusing was that pod metrics looked completely normal. CPU, memory, network I/O—all fine. The resource that was exhausted (ephemeral ports) isn’t tracked by standard container metrics. And because it’s a node-level resource, it affects all pods on the node equally, making it look like a “node going bad” rather than a resource exhaustion problem.

The service mesh made things worse by multiplying connections. Each request through the sidecar proxy creates two connections (app→sidecar→external) instead of one. Add retry logic and you can have 6 connections per application request. At high traffic volumes, that’s thousands of ports consumed per second.

Environment: Kubernetes 1.27, Istio service mesh, high-traffic API gateway pods

The Problem

Symptoms

What we observed:

Node A (healthy):
  Pod 1 → External API → 200 OK
  Pod 2 → External API → 200 OK

Node B (bad):
  Pod 3 → External API → connection refused / timeout
  Pod 4 → External API → connection refused / timeout

But:
- All pods showed Ready
- No OOM, no CPU throttling
- Same image, same config
- Restarting pods temporarily helped
- Problem returned after ~30 minutes

Why Standard Debugging Misses This

# Pod metrics look fine
kubectl top pod -n api-gateway
# All pods: CPU 20%, Memory 40%

# Logs show connection failures but no root cause
kubectl logs api-gateway-xyz
# ERROR: connection refused to external-api.com:443

# Even node metrics look okay
kubectl top node problem-node
# CPU: 45%, Memory: 60%

# The problem is invisible until you check ephemeral ports

Root Cause

The Ephemeral Port Problem

How TCP connections work:

┌─────────────┐                    ┌─────────────┐
│   Client    │                    │   Server    │
│             │   Source Port      │             │
│             │   (ephemeral)      │             │
│   :34567 ───┼──────────────────▶ ├──── :443    │
│   :34568 ───┼──────────────────▶ ├──── :443    │
│   :34569 ───┼──────────────────▶ ├──── :443    │
└─────────────┘                    └─────────────┘

Default ephemeral port range: 32768-60999 = ~28,000 ports

When SNAT is involved (NodePort, external traffic):
All pods on a node share the node's ephemeral ports!

┌──────────────────────────────────────────────────┐
│                    Node                          │
│  ┌─────┐ ┌─────┐ ┌─────┐                        │
│  │Pod 1│ │Pod 2│ │Pod 3│  All share             │
│  └──┬──┘ └──┬──┘ └──┬──┘  same ports            │
│     │       │       │     for SNAT              │
│     └───────┴───────┴────▶ iptables MASQUERADE  │
│                           32768-60999           │
└──────────────────────────────────────────────────┘

Service Mesh Amplification

Without mesh:
  Pod → External API
  = 1 connection per request

With sidecar proxy:
  Pod → Envoy sidecar → External API
  = 2 connections per request (internal + external)

With aggressive retries:
  Pod → Envoy (retry 3x) → External API
  = Up to 6 connections per request

With connection pooling disabled/broken:
  Each request = new TCP connection
  High RPS = port exhaustion in minutes

TIME_WAIT Makes It Worse

# Check TIME_WAIT connections on the node
ss -tan state time-wait | wc -l
# 25000  <-- Almost at the limit!

# TIME_WAIT lasts 60 seconds by default
# If you create connections faster than they expire:

Rate: 500 new connections/second
TIME_WAIT duration: 60 seconds
Steady state: 500 * 60 = 30,000 ports in TIME_WAIT

Available ports: ~28,000
Result: Port exhaustion!

Diagnosis

Step 1: Check Port Usage

# On the affected node
# Count connections by state
ss -tan | awk '{print $1}' | sort | uniq -c | sort -rn

# Sample output from exhausted node:
# 25432 TIME-WAIT
#   523 ESTABLISHED
#   156 FIN-WAIT-2
#    42 SYN-SENT

# Check connections to specific destination
ss -tan state time-wait dst :443 | wc -l

Step 2: Identify the Culprit Pod

# Check which process owns most connections
# Using conntrack to trace NAT'd connections

conntrack -L -d <external-api-ip> 2>/dev/null | \
  awk '{print $5}' | sort | uniq -c | sort -rn | head -10

# Or check by source port range usage
ss -tan | awk -F: '/TIME-WAIT.*:443/ {print $2}' | \
  cut -d' ' -f1 | sort -n | uniq -c | sort -rn

Step 3: Verify Port Range

# Check current ephemeral port range
cat /proc/sys/net/ipv4/ip_local_port_range
# 32768	60999  (default, about 28,000 ports)

# Check net.ipv4.tcp_tw_reuse setting
cat /proc/sys/net/ipv4/tcp_tw_reuse
# 0 (disabled by default)

The Fix

Option 1: Expand Port Range

# On all nodes (via DaemonSet or node config)
sysctl -w net.ipv4.ip_local_port_range="1024 65535"

# Persistent via /etc/sysctl.d/
echo "net.ipv4.ip_local_port_range = 1024 65535" > /etc/sysctl.d/99-ephemeral-ports.conf
# DaemonSet to apply sysctl
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: sysctl-tuning
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app: sysctl-tuning
  template:
    metadata:
      labels:
        app: sysctl-tuning
    spec:
      hostNetwork: true
      hostPID: true
      initContainers:
        - name: sysctl
          image: busybox
          securityContext:
            privileged: true
          command:
            - /bin/sh
            - -c
            - |
              sysctl -w net.ipv4.ip_local_port_range="1024 65535"
              sysctl -w net.ipv4.tcp_tw_reuse=1
              sysctl -w net.ipv4.tcp_fin_timeout=30
      containers:
        - name: pause
          image: gcr.io/google_containers/pause:3.2

Option 2: Enable TCP Connection Reuse

# Allow reusing TIME_WAIT connections for new outbound
sysctl -w net.ipv4.tcp_tw_reuse=1

# Reduce TIME_WAIT timeout (careful - can break some scenarios)
sysctl -w net.ipv4.tcp_fin_timeout=30

Option 3: Fix the Application

# Istio: Enable connection pooling
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: external-api
spec:
  host: external-api.com
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100
        connectTimeout: 10s
      http:
        h2UpgradePolicy: UPGRADE  # Use HTTP/2 multiplexing
        maxRequestsPerConnection: 1000
// Go: Reuse HTTP client with connection pooling
var httpClient = &http.Client{
    Transport: &http.Transport{
        MaxIdleConns:        100,
        MaxIdleConnsPerHost: 100,
        IdleConnTimeout:     90 * time.Second,
        // Critical: keep connections alive
        DisableKeepAlives:   false,
    },
    Timeout: 30 * time.Second,
}

// DON'T do this:
// resp, err := http.Get(url)  // Creates new client each time!

Option 4: Use HTTP/2 or gRPC

HTTP/1.1: 1 request per connection
HTTP/2:   Multiple requests per connection (multiplexing)

With HTTP/2:
  1000 RPS = could use just 10-50 connections
  vs HTTP/1.1 = 1000 connections

Switch to HTTP/2 for external APIs where possible

Monitoring

Prometheus Metrics

groups:
  - name: ephemeral-ports
    rules:
      - alert: EphemeralPortExhaustion
        expr: |
          node_sockstat_TCP_tw > 20000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High TIME_WAIT connections on {{ $labels.instance }}"
          description: "{{ $value }} connections in TIME_WAIT, risk of port exhaustion"

      - alert: HighConnectionRate
        expr: |
          rate(node_netstat_Tcp_PassiveOpens[5m]) +
          rate(node_netstat_Tcp_ActiveOpens[5m]) > 1000
        for: 5m
        labels:
          severity: info
        annotations:
          summary: "High TCP connection rate on {{ $labels.instance }}"

Quick Health Check Script

#!/bin/bash
# port-exhaustion-check.sh

echo "=== Ephemeral Port Status ==="

# Current port range
echo "Port range: $(cat /proc/sys/net/ipv4/ip_local_port_range)"
RANGE=$(cat /proc/sys/net/ipv4/ip_local_port_range)
LOW=$(echo $RANGE | awk '{print $1}')
HIGH=$(echo $RANGE | awk '{print $2}')
TOTAL=$((HIGH - LOW))
echo "Total available: $TOTAL"

# Ports in use
TIME_WAIT=$(ss -tan state time-wait | wc -l)
ESTABLISHED=$(ss -tan state established | wc -l)
echo "TIME_WAIT: $TIME_WAIT"
echo "ESTABLISHED: $ESTABLISHED"

# Utilization
USED=$((TIME_WAIT + ESTABLISHED))
PERCENT=$((USED * 100 / TOTAL))
echo "Utilization: $PERCENT%"

if [ $PERCENT -gt 80 ]; then
  echo "WARNING: Port exhaustion risk!"
fi

# Top destinations in TIME_WAIT
echo -e "\n=== Top TIME_WAIT destinations ==="
ss -tan state time-wait | awk '{print $4}' | sort | uniq -c | sort -rn | head -5

Checklist

## Ephemeral Port Exhaustion

### Symptoms
- [ ] One node fails while others work
- [ ] Connection refused to external services
- [ ] Pods look healthy (CPU/memory OK)
- [ ] Problem returns after pod restarts

### Diagnosis
- [ ] Check TIME_WAIT count: ss -tan state time-wait | wc -l
- [ ] Check port range: cat /proc/sys/net/ipv4/ip_local_port_range
- [ ] Identify top connection destinations
- [ ] Look for missing connection pooling

### Fixes
- [ ] Expand port range (1024-65535)
- [ ] Enable tcp_tw_reuse
- [ ] Enable connection pooling in app/mesh
- [ ] Switch to HTTP/2 where possible
- [ ] Reduce connection-per-request patterns

Conclusion

This failure mode is a perfect example of how container abstractions can hide system-level problems. Kubernetes does an excellent job of isolating pods from each other, but ephemeral ports are fundamentally a node-level resource that gets shared via SNAT. No amount of pod isolation can change that.

The pattern of “one node goes bad” is the key diagnostic clue. If the problem were application-level, you’d expect it to affect specific pods regardless of which node they’re on. If it were network-level, you’d expect it to affect traffic patterns consistently. But when an entire node fails to make outbound connections while other nodes are fine, you’re looking at a node-level resource exhaustion—and ephemeral ports are the most common culprit.

The service mesh amplification is particularly insidious. Without a mesh, you might have comfortable headroom on ephemeral ports. Add a sidecar proxy, and suddenly every connection is doubled. Add retries and circuit breaker probes, and you might be consuming 5-10x as many ports per logical request. The system that was fine in pre-mesh testing breaks under the same traffic with mesh enabled.

Key takeaways:

  1. Pod metrics look fine - ephemeral ports are a node-level resource invisible to container metrics
  2. Only affects SNAT traffic - internal pod-to-pod traffic uses different port allocation
  3. Temporary fixes (restart) work - which hides the pattern and delays root cause analysis
  4. Service mesh amplifies - each proxy hop doubles the connection count

The fix is usually a combination of:

  • System tuning (expand port range to 1024-65535, enable tcp_tw_reuse)
  • Application fixes (connection pooling, HTTP keep-alive, HTTP/2 multiplexing)
  • Architecture changes (reduce short-lived connections, batch operations)

When deploying a service mesh, proactively monitor TIME_WAIT connection counts. It’s much easier to tune the system before you hit exhaustion than to debug it during an incident.


Related posts

Cite this article

If you reference this post, please link to the original URL and credit the author.

Michal Drozd. "Ephemeral Port Exhaustion: The Node That 'Goes Bad'". https://www.michal-drozd.com/en/blog/ephemeral-port-exhaustion-kubernetes/ (Published November 11, 2024).