Ephemeral Port Exhaustion: The Node That 'Goes Bad'
The first time I saw ‘cannot assign requested address’ in Kubernetes, I thought it was DNS. “One node randomly goes bad—connections to external APIs fail, but only from that node.” The pattern was maddening. Three identical nodes in an autoscaling group. Same pods, same configuration, same traffic distribution. But periodically, one node would “go bad”—all its pods would fail to connect to external services while the other nodes worked fine.
Restarting pods on the bad node would temporarily fix the problem. Moving pods to other nodes helped. But within 30 minutes, another node would start failing the same way. We chased application bugs, DNS issues, and network policies before discovering the actual cause: ephemeral port exhaustion.
Every outbound TCP connection needs a source port. The kernel allocates these from a pool of about 28,000 ports (32768-60999 by default). When traffic goes through SNAT—as it does for pods connecting to external services—all pods on a node share the same pool of source ports. If you’re creating connections faster than they close, you eventually run out.
What made this particularly confusing was that pod metrics looked completely normal. CPU, memory, network I/O—all fine. The resource that was exhausted (ephemeral ports) isn’t tracked by standard container metrics. And because it’s a node-level resource, it affects all pods on the node equally, making it look like a “node going bad” rather than a resource exhaustion problem.
The service mesh made things worse by multiplying connections. Each request through the sidecar proxy creates two connections (app→sidecar→external) instead of one. Add retry logic and you can have 6 connections per application request. At high traffic volumes, that’s thousands of ports consumed per second.
Environment: Kubernetes 1.27, Istio service mesh, high-traffic API gateway pods
The Problem
Symptoms
What we observed:
Node A (healthy):
Pod 1 → External API → 200 OK
Pod 2 → External API → 200 OK
Node B (bad):
Pod 3 → External API → connection refused / timeout
Pod 4 → External API → connection refused / timeout
But:
- All pods showed Ready
- No OOM, no CPU throttling
- Same image, same config
- Restarting pods temporarily helped
- Problem returned after ~30 minutes
Why Standard Debugging Misses This
# Pod metrics look fine
kubectl top pod -n api-gateway
# All pods: CPU 20%, Memory 40%
# Logs show connection failures but no root cause
kubectl logs api-gateway-xyz
# ERROR: connection refused to external-api.com:443
# Even node metrics look okay
kubectl top node problem-node
# CPU: 45%, Memory: 60%
# The problem is invisible until you check ephemeral ports
Root Cause
The Ephemeral Port Problem
How TCP connections work:
┌─────────────┐ ┌─────────────┐
│ Client │ │ Server │
│ │ Source Port │ │
│ │ (ephemeral) │ │
│ :34567 ───┼──────────────────▶ ├──── :443 │
│ :34568 ───┼──────────────────▶ ├──── :443 │
│ :34569 ───┼──────────────────▶ ├──── :443 │
└─────────────┘ └─────────────┘
Default ephemeral port range: 32768-60999 = ~28,000 ports
When SNAT is involved (NodePort, external traffic):
All pods on a node share the node's ephemeral ports!
┌──────────────────────────────────────────────────┐
│ Node │
│ ┌─────┐ ┌─────┐ ┌─────┐ │
│ │Pod 1│ │Pod 2│ │Pod 3│ All share │
│ └──┬──┘ └──┬──┘ └──┬──┘ same ports │
│ │ │ │ for SNAT │
│ └───────┴───────┴────▶ iptables MASQUERADE │
│ 32768-60999 │
└──────────────────────────────────────────────────┘
Service Mesh Amplification
Without mesh:
Pod → External API
= 1 connection per request
With sidecar proxy:
Pod → Envoy sidecar → External API
= 2 connections per request (internal + external)
With aggressive retries:
Pod → Envoy (retry 3x) → External API
= Up to 6 connections per request
With connection pooling disabled/broken:
Each request = new TCP connection
High RPS = port exhaustion in minutes
TIME_WAIT Makes It Worse
# Check TIME_WAIT connections on the node
ss -tan state time-wait | wc -l
# 25000 <-- Almost at the limit!
# TIME_WAIT lasts 60 seconds by default
# If you create connections faster than they expire:
Rate: 500 new connections/second
TIME_WAIT duration: 60 seconds
Steady state: 500 * 60 = 30,000 ports in TIME_WAIT
Available ports: ~28,000
Result: Port exhaustion!
Diagnosis
Step 1: Check Port Usage
# On the affected node
# Count connections by state
ss -tan | awk '{print $1}' | sort | uniq -c | sort -rn
# Sample output from exhausted node:
# 25432 TIME-WAIT
# 523 ESTABLISHED
# 156 FIN-WAIT-2
# 42 SYN-SENT
# Check connections to specific destination
ss -tan state time-wait dst :443 | wc -l
Step 2: Identify the Culprit Pod
# Check which process owns most connections
# Using conntrack to trace NAT'd connections
conntrack -L -d <external-api-ip> 2>/dev/null | \
awk '{print $5}' | sort | uniq -c | sort -rn | head -10
# Or check by source port range usage
ss -tan | awk -F: '/TIME-WAIT.*:443/ {print $2}' | \
cut -d' ' -f1 | sort -n | uniq -c | sort -rn
Step 3: Verify Port Range
# Check current ephemeral port range
cat /proc/sys/net/ipv4/ip_local_port_range
# 32768 60999 (default, about 28,000 ports)
# Check net.ipv4.tcp_tw_reuse setting
cat /proc/sys/net/ipv4/tcp_tw_reuse
# 0 (disabled by default)
The Fix
Option 1: Expand Port Range
# On all nodes (via DaemonSet or node config)
sysctl -w net.ipv4.ip_local_port_range="1024 65535"
# Persistent via /etc/sysctl.d/
echo "net.ipv4.ip_local_port_range = 1024 65535" > /etc/sysctl.d/99-ephemeral-ports.conf
# DaemonSet to apply sysctl
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: sysctl-tuning
namespace: kube-system
spec:
selector:
matchLabels:
app: sysctl-tuning
template:
metadata:
labels:
app: sysctl-tuning
spec:
hostNetwork: true
hostPID: true
initContainers:
- name: sysctl
image: busybox
securityContext:
privileged: true
command:
- /bin/sh
- -c
- |
sysctl -w net.ipv4.ip_local_port_range="1024 65535"
sysctl -w net.ipv4.tcp_tw_reuse=1
sysctl -w net.ipv4.tcp_fin_timeout=30
containers:
- name: pause
image: gcr.io/google_containers/pause:3.2
Option 2: Enable TCP Connection Reuse
# Allow reusing TIME_WAIT connections for new outbound
sysctl -w net.ipv4.tcp_tw_reuse=1
# Reduce TIME_WAIT timeout (careful - can break some scenarios)
sysctl -w net.ipv4.tcp_fin_timeout=30
Option 3: Fix the Application
# Istio: Enable connection pooling
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: external-api
spec:
host: external-api.com
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
connectTimeout: 10s
http:
h2UpgradePolicy: UPGRADE # Use HTTP/2 multiplexing
maxRequestsPerConnection: 1000
// Go: Reuse HTTP client with connection pooling
var httpClient = &http.Client{
Transport: &http.Transport{
MaxIdleConns: 100,
MaxIdleConnsPerHost: 100,
IdleConnTimeout: 90 * time.Second,
// Critical: keep connections alive
DisableKeepAlives: false,
},
Timeout: 30 * time.Second,
}
// DON'T do this:
// resp, err := http.Get(url) // Creates new client each time!
Option 4: Use HTTP/2 or gRPC
HTTP/1.1: 1 request per connection
HTTP/2: Multiple requests per connection (multiplexing)
With HTTP/2:
1000 RPS = could use just 10-50 connections
vs HTTP/1.1 = 1000 connections
Switch to HTTP/2 for external APIs where possible
Monitoring
Prometheus Metrics
groups:
- name: ephemeral-ports
rules:
- alert: EphemeralPortExhaustion
expr: |
node_sockstat_TCP_tw > 20000
for: 5m
labels:
severity: warning
annotations:
summary: "High TIME_WAIT connections on {{ $labels.instance }}"
description: "{{ $value }} connections in TIME_WAIT, risk of port exhaustion"
- alert: HighConnectionRate
expr: |
rate(node_netstat_Tcp_PassiveOpens[5m]) +
rate(node_netstat_Tcp_ActiveOpens[5m]) > 1000
for: 5m
labels:
severity: info
annotations:
summary: "High TCP connection rate on {{ $labels.instance }}"
Quick Health Check Script
#!/bin/bash
# port-exhaustion-check.sh
echo "=== Ephemeral Port Status ==="
# Current port range
echo "Port range: $(cat /proc/sys/net/ipv4/ip_local_port_range)"
RANGE=$(cat /proc/sys/net/ipv4/ip_local_port_range)
LOW=$(echo $RANGE | awk '{print $1}')
HIGH=$(echo $RANGE | awk '{print $2}')
TOTAL=$((HIGH - LOW))
echo "Total available: $TOTAL"
# Ports in use
TIME_WAIT=$(ss -tan state time-wait | wc -l)
ESTABLISHED=$(ss -tan state established | wc -l)
echo "TIME_WAIT: $TIME_WAIT"
echo "ESTABLISHED: $ESTABLISHED"
# Utilization
USED=$((TIME_WAIT + ESTABLISHED))
PERCENT=$((USED * 100 / TOTAL))
echo "Utilization: $PERCENT%"
if [ $PERCENT -gt 80 ]; then
echo "WARNING: Port exhaustion risk!"
fi
# Top destinations in TIME_WAIT
echo -e "\n=== Top TIME_WAIT destinations ==="
ss -tan state time-wait | awk '{print $4}' | sort | uniq -c | sort -rn | head -5
Checklist
## Ephemeral Port Exhaustion
### Symptoms
- [ ] One node fails while others work
- [ ] Connection refused to external services
- [ ] Pods look healthy (CPU/memory OK)
- [ ] Problem returns after pod restarts
### Diagnosis
- [ ] Check TIME_WAIT count: ss -tan state time-wait | wc -l
- [ ] Check port range: cat /proc/sys/net/ipv4/ip_local_port_range
- [ ] Identify top connection destinations
- [ ] Look for missing connection pooling
### Fixes
- [ ] Expand port range (1024-65535)
- [ ] Enable tcp_tw_reuse
- [ ] Enable connection pooling in app/mesh
- [ ] Switch to HTTP/2 where possible
- [ ] Reduce connection-per-request patterns
Conclusion
This failure mode is a perfect example of how container abstractions can hide system-level problems. Kubernetes does an excellent job of isolating pods from each other, but ephemeral ports are fundamentally a node-level resource that gets shared via SNAT. No amount of pod isolation can change that.
The pattern of “one node goes bad” is the key diagnostic clue. If the problem were application-level, you’d expect it to affect specific pods regardless of which node they’re on. If it were network-level, you’d expect it to affect traffic patterns consistently. But when an entire node fails to make outbound connections while other nodes are fine, you’re looking at a node-level resource exhaustion—and ephemeral ports are the most common culprit.
The service mesh amplification is particularly insidious. Without a mesh, you might have comfortable headroom on ephemeral ports. Add a sidecar proxy, and suddenly every connection is doubled. Add retries and circuit breaker probes, and you might be consuming 5-10x as many ports per logical request. The system that was fine in pre-mesh testing breaks under the same traffic with mesh enabled.
Key takeaways:
- Pod metrics look fine - ephemeral ports are a node-level resource invisible to container metrics
- Only affects SNAT traffic - internal pod-to-pod traffic uses different port allocation
- Temporary fixes (restart) work - which hides the pattern and delays root cause analysis
- Service mesh amplifies - each proxy hop doubles the connection count
The fix is usually a combination of:
- System tuning (expand port range to 1024-65535, enable tcp_tw_reuse)
- Application fixes (connection pooling, HTTP keep-alive, HTTP/2 multiplexing)
- Architecture changes (reduce short-lived connections, batch operations)
When deploying a service mesh, proactively monitor TIME_WAIT connection counts. It’s much easier to tune the system before you hit exhaustion than to debug it during an incident.
Related Articles
- Kubernetes Conntrack Exhaustion - Another node-level resource limit
- gRPC Load Balancing in Kubernetes - Connection management patterns
Related posts
Traffic Hitting Dead Pods: Conntrack's Stale NAT Mapping
Deploy causes 503s for exactly 2 minutes. The issue: conntrack keeps NAT mappings to old pod IPs even after Kubernetes removes endpoints, sending traffic to dead pods.
Kubernetes Ghost Connections: Stale Conntrack DNAT Entries
Service returns wrong pod IPs after scaling. The cause: Linux conntrack keeps DNAT entries alive longer than pods exist, routing traffic to deleted endpoints.
Kubernetes conntrack Table Exhaustion: The Silent Packet Killer
Random DNS timeouts, dropped connections, services timing out. Your nf_conntrack table is full. I show how to diagnose, monitor, and fix this Kubernetes networking issue.
Packets Arrive but the App Times Out: The rp_filter Trap in Kubernetes
tcpdump shows packets arriving, but your application sees nothing. The culprit: Linux reverse path filtering silently dropping packets before they reach iptables, triggered by asymmetric routing in multi-homed setups.
Cite this article
If you reference this post, please link to the original URL and credit the author.