gRPC in Kubernetes: Why Service Round-Robin Lies
gRPC plus Kubernetes looked simple until load balancing started lying. “Why do you have 10 replicas and only 1 has 90% of traffic?” This was the question from our on-call engineer, staring at a Grafana dashboard that made no sense. We’d deployed a new gRPC service with five replicas, expecting even load distribution. Instead, one pod was drowning in requests while the others sat nearly idle. The overloaded pod was hitting CPU limits and increasing latency for everyone.
The Kubernetes Service was configured correctly. The pods were healthy. The deployment had round-robin load balancing. But none of that mattered, because gRPC uses HTTP/2—and HTTP/2 maintains long-lived connections that multiplex all requests over a single connection. Kubernetes Service load balancing works at the connection level, not the request level. When our client connected once and held that connection, all subsequent requests went to the same pod forever.
This is one of those “works on my laptop” problems that bites teams hard in production. HTTP/1.1 clients open a new connection for each request (or a small pool that cycles), so Kubernetes Service load balancing distributes requests naturally. HTTP/2 clients open one connection and reuse it for thousands of requests, concentrating all traffic on whichever pod happened to receive that initial connection.
The really insidious part is that this gets worse over time. As pods restart and connections reconnect, they tend to cluster—new connections after a rolling deployment all go to the fresh pods. Traffic distribution becomes increasingly skewed. Without proper client-side load balancing, gRPC in Kubernetes is fundamentally broken for any serious workload.
Tested on: Kubernetes 1.28+, gRPC-Go 1.60+, Istio 1.20+. Reproduced on GKE, EKS and bare metal.
Why Service Round-Robin Doesn’t Work
HTTP/1.1 (works)
Client → K8s Service → Pod A (request 1)
Client → K8s Service → Pod B (request 2)
Client → K8s Service → Pod C (request 3)
Each request = new connection = new pod.
gRPC/HTTP/2 (doesn’t work)
Client → K8s Service → Pod A (connection established)
Pod A (request 1, 2, 3, 4, 5...)
Pod A (all requests)
One connection = multiplexed requests = one pod.
Reproducible Lab
Server
// server/main.go
package main
import (
"context"
"log"
"net"
"os"
pb "example/grpc/proto"
"google.golang.org/grpc"
)
type server struct {
pb.UnimplementedGreeterServer
podName string
}
func (s *server) SayHello(ctx context.Context, in *pb.HelloRequest) (*pb.HelloReply, error) {
log.Printf("Received request on pod: %s", s.podName)
return &pb.HelloReply{Message: "Hello from " + s.podName}, nil
}
func main() {
podName := os.Getenv("POD_NAME")
lis, _ := net.Listen("tcp", ":50051")
s := grpc.NewServer()
pb.RegisterGreeterServer(s, &server{podName: podName})
log.Printf("Server started on pod: %s", podName)
s.Serve(lis)
}
Kubernetes Manifests
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: grpc-server
spec:
replicas: 5
selector:
matchLabels:
app: grpc-server
template:
metadata:
labels:
app: grpc-server
spec:
containers:
- name: server
image: grpc-server:latest
ports:
- containerPort: 50051
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
---
# service.yaml
apiVersion: v1
kind: Service
metadata:
name: grpc-server
spec:
selector:
app: grpc-server
ports:
- port: 50051
targetPort: 50051
Load Test
# ghz - gRPC benchmarking tool
ghz --insecure \
--call helloworld.Greeter/SayHello \
--total 10000 \
--concurrency 50 \
--data '{"name":"test"}' \
grpc-server:50051
# Result: 90%+ requests on one pod
Solution 1: Headless Service + Client-Side LB
The simplest and most reliable solution is to make the client responsible for load balancing. This requires two changes: a headless Service that exposes individual pod IPs via DNS, and a client configuration that uses those IPs with round-robin.
A headless Service (clusterIP: None) doesn’t create a virtual IP. Instead, DNS queries for the Service name return A records for all healthy pod IPs. The gRPC client can then maintain connections to multiple pods and distribute requests across them.
Headless Service
apiVersion: v1
kind: Service
metadata:
name: grpc-server-headless
spec:
clusterIP: None # Headless!
selector:
app: grpc-server
ports:
- port: 50051
Client with DNS Resolver
// client/main.go
import (
"google.golang.org/grpc"
"google.golang.org/grpc/resolver"
_ "google.golang.org/grpc/balancer/roundrobin"
)
func main() {
// DNS resolver + round robin balancer
conn, err := grpc.Dial(
"dns:///grpc-server-headless:50051",
grpc.WithDefaultServiceConfig(`{"loadBalancingPolicy":"round_robin"}`),
grpc.WithInsecure(),
)
if err != nil {
log.Fatalf("Failed to dial: %v", err)
}
defer conn.Close()
client := pb.NewGreeterClient(conn)
// Now requests go to different pods
}
Results
| Metric | ClusterIP Service | Headless + Client LB |
|---|---|---|
| Pod distribution | 90/5/5/0/0 | 20/20/20/20/20 |
| Latency P99 | 45ms | 12ms |
| Throughput | 5k RPS | 25k RPS |
Solution 2: Service Mesh (Istio)
If you can’t modify your clients—perhaps they’re third-party or deployed by other teams—a service mesh can intercept traffic and apply proper load balancing. The mesh sidecar proxy understands HTTP/2 and balances at the request level, not the connection level.
This is the most transparent solution: no code changes required. But it comes with operational complexity. You’re deploying a sidecar to every pod, adding latency, and taking on a significant new piece of infrastructure.
Istio DestinationRule
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: grpc-server
spec:
host: grpc-server
trafficPolicy:
loadBalancer:
simple: ROUND_ROBIN
connectionPool:
http:
h2UpgradePolicy: UPGRADE
Istio Benefits
- No code changes
- mTLS automatic
- Observability (tracing, metrics)
- Traffic management (canary, circuit breaker)
Drawbacks
- Overhead (sidecar)
- Operational complexity
- Latency (+1-3ms)
Solution 3: Linkerd
# Linkerd annotations
apiVersion: apps/v1
kind: Deployment
metadata:
name: grpc-server
annotations:
linkerd.io/inject: enabled
spec:
# ...
Linkerd automatically detects gRPC and applies per-request load balancing.
Solution 4: Envoy as Sidecar
# envoy-sidecar.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: envoy-config
data:
envoy.yaml: |
static_resources:
listeners:
- address:
socket_address:
address: 0.0.0.0
port_value: 8080
filter_chains:
- filters:
- name: envoy.filters.network.http_connection_manager
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
codec_type: AUTO
stat_prefix: ingress_http
route_config:
virtual_hosts:
- name: backend
domains: ["*"]
routes:
- match: { prefix: "/" }
route:
cluster: grpc_backend
http_filters:
- name: envoy.filters.http.router
clusters:
- name: grpc_backend
type: STRICT_DNS
lb_policy: ROUND_ROBIN
http2_protocol_options: {}
load_assignment:
cluster_name: grpc_backend
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: grpc-server-headless
port_value: 50051
Monitoring gRPC Distribution
Prometheus Metrics
# Requests per pod
sum(rate(grpc_server_handled_total[5m])) by (pod)
# Distribution %
sum(rate(grpc_server_handled_total[5m])) by (pod)
/ ignoring(pod)
sum(rate(grpc_server_handled_total[5m]))
Expected vs Actual
5 pods, even load:
- Expected: 20% / 20% / 20% / 20% / 20%
- Without client LB: 85% / 5% / 5% / 3% / 2%
- With client LB: 19% / 21% / 20% / 20% / 20%
Production Checklist
## gRPC Load Balancing Checklist
### Basics
- [ ] Headless Service for gRPC
- [ ] Client-side load balancing or mesh
- [ ] Connection pooling with max lifetime
- [ ] Keepalive settings
### Client Config
- [ ] `loadBalancingPolicy: round_robin`
- [ ] DNS resolver (`dns:///`)
- [ ] Keepalive: 30s interval, 10s timeout
- [ ] Max connection age: 5m
### Server Config
- [ ] MaxConnectionAge: 5m
- [ ] MaxConnectionAgeGrace: 10s
- [ ] Keepalive enforcement
### Monitoring
- [ ] Per-pod request distribution
- [ ] Connection count per pod
- [ ] Latency per pod
- [ ] Alert: uneven distribution
Conclusion
This is one of the most common gotchas when moving from HTTP/1.1 REST APIs to gRPC. The assumption that “Kubernetes handles load balancing” is true for HTTP/1.1 but completely false for gRPC/HTTP/2. If you’re not explicitly addressing this, you’re running an accidentally broken system.
The solution you choose depends on your constraints. Client-side load balancing with headless Services is the most efficient—no sidecars, no extra latency, just smarter clients. But it requires code changes and every client must be updated. Service mesh solutions are more transparent but add operational complexity and latency.
Either way, monitoring is non-negotiable. You need visibility into per-pod request distribution to catch this problem quickly. An even 20/20/20/20/20 split is healthy; a 90/5/3/1/1 split is a problem waiting to escalate.
Key principles:
- Kubernetes Service doesn’t work for gRPC round-robin—it balances connections, not requests
- Headless Service + client LB is the simplest and most efficient solution
- Service mesh (Istio/Linkerd) when you can’t modify clients
- Monitor distribution continuously—it’s the only way to catch the problem
- Set connection max age so clients periodically reconnect and rebalance
FAQ
What if I can’t change the client?
Use service mesh (Istio/Linkerd) or Envoy as sidecar proxy.
Is client-side LB safe?
Yes, but you need to regularly refresh DNS (max connection age).
How many connections per pod?
Typically 1-5 for modern gRPC clients. More = overhead without benefit.
Related Articles
- K8s Connection Storm - Connection management during rollouts
- CI/CD for Monorepo - Testing gRPC services
Related posts
gRPC Deadline Propagation: Preventing Cascading Failures
Frontend gives up after 5s but backend keeps working for 30s. Without deadline propagation, you waste resources on doomed requests. I show how to implement it in Go.
Linux Page Cache Thrashing in Containers: When Free Memory Isn't Free
Your container has 2GB free but runs slow. Page cache counts against memory limit. File I/O forces code pages out. I explain with benchmarks and solutions.
JVM Native Memory in Kubernetes: Why Your Pod Gets OOMKilled with 50% Heap
Heap is 50% full but pod gets OOMKilled. I'll show how to track native memory (Metaspace, threads, NIO) and prevent container memory issues.
Tail-Based Sampling in OpenTelemetry: Sizing, Memory Crashes and Cost Model
Practical sizing guide for tail sampling in OpenTelemetry Collector. From decision_wait through memory limits to cost-benefit analysis.
Cite this article
If you reference this post, please link to the original URL and credit the author.