Back to blog

gRPC in Kubernetes: Why Service Round-Robin Lies

|
| grpc, kubernetes, load-balancing, performance, microservices

gRPC plus Kubernetes looked simple until load balancing started lying. “Why do you have 10 replicas and only 1 has 90% of traffic?” This was the question from our on-call engineer, staring at a Grafana dashboard that made no sense. We’d deployed a new gRPC service with five replicas, expecting even load distribution. Instead, one pod was drowning in requests while the others sat nearly idle. The overloaded pod was hitting CPU limits and increasing latency for everyone.

The Kubernetes Service was configured correctly. The pods were healthy. The deployment had round-robin load balancing. But none of that mattered, because gRPC uses HTTP/2—and HTTP/2 maintains long-lived connections that multiplex all requests over a single connection. Kubernetes Service load balancing works at the connection level, not the request level. When our client connected once and held that connection, all subsequent requests went to the same pod forever.

This is one of those “works on my laptop” problems that bites teams hard in production. HTTP/1.1 clients open a new connection for each request (or a small pool that cycles), so Kubernetes Service load balancing distributes requests naturally. HTTP/2 clients open one connection and reuse it for thousands of requests, concentrating all traffic on whichever pod happened to receive that initial connection.

The really insidious part is that this gets worse over time. As pods restart and connections reconnect, they tend to cluster—new connections after a rolling deployment all go to the fresh pods. Traffic distribution becomes increasingly skewed. Without proper client-side load balancing, gRPC in Kubernetes is fundamentally broken for any serious workload.

Tested on: Kubernetes 1.28+, gRPC-Go 1.60+, Istio 1.20+. Reproduced on GKE, EKS and bare metal.

Why Service Round-Robin Doesn’t Work

HTTP/1.1 (works)

Client → K8s Service → Pod A (request 1)
Client → K8s Service → Pod B (request 2)
Client → K8s Service → Pod C (request 3)

Each request = new connection = new pod.

gRPC/HTTP/2 (doesn’t work)

Client → K8s Service → Pod A (connection established)
                       Pod A (request 1, 2, 3, 4, 5...)
                       Pod A (all requests)

One connection = multiplexed requests = one pod.

Reproducible Lab

Server

// server/main.go
package main

import (
    "context"
    "log"
    "net"
    "os"

    pb "example/grpc/proto"
    "google.golang.org/grpc"
)

type server struct {
    pb.UnimplementedGreeterServer
    podName string
}

func (s *server) SayHello(ctx context.Context, in *pb.HelloRequest) (*pb.HelloReply, error) {
    log.Printf("Received request on pod: %s", s.podName)
    return &pb.HelloReply{Message: "Hello from " + s.podName}, nil
}

func main() {
    podName := os.Getenv("POD_NAME")
    lis, _ := net.Listen("tcp", ":50051")
    s := grpc.NewServer()
    pb.RegisterGreeterServer(s, &server{podName: podName})
    log.Printf("Server started on pod: %s", podName)
    s.Serve(lis)
}

Kubernetes Manifests

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: grpc-server
spec:
  replicas: 5
  selector:
    matchLabels:
      app: grpc-server
  template:
    metadata:
      labels:
        app: grpc-server
    spec:
      containers:
      - name: server
        image: grpc-server:latest
        ports:
        - containerPort: 50051
        env:
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
---
# service.yaml
apiVersion: v1
kind: Service
metadata:
  name: grpc-server
spec:
  selector:
    app: grpc-server
  ports:
  - port: 50051
    targetPort: 50051

Load Test

# ghz - gRPC benchmarking tool
ghz --insecure \
    --call helloworld.Greeter/SayHello \
    --total 10000 \
    --concurrency 50 \
    --data '{"name":"test"}' \
    grpc-server:50051

# Result: 90%+ requests on one pod

Solution 1: Headless Service + Client-Side LB

The simplest and most reliable solution is to make the client responsible for load balancing. This requires two changes: a headless Service that exposes individual pod IPs via DNS, and a client configuration that uses those IPs with round-robin.

A headless Service (clusterIP: None) doesn’t create a virtual IP. Instead, DNS queries for the Service name return A records for all healthy pod IPs. The gRPC client can then maintain connections to multiple pods and distribute requests across them.

Headless Service

apiVersion: v1
kind: Service
metadata:
  name: grpc-server-headless
spec:
  clusterIP: None  # Headless!
  selector:
    app: grpc-server
  ports:
  - port: 50051

Client with DNS Resolver

// client/main.go
import (
    "google.golang.org/grpc"
    "google.golang.org/grpc/resolver"
    _ "google.golang.org/grpc/balancer/roundrobin"
)

func main() {
    // DNS resolver + round robin balancer
    conn, err := grpc.Dial(
        "dns:///grpc-server-headless:50051",
        grpc.WithDefaultServiceConfig(`{"loadBalancingPolicy":"round_robin"}`),
        grpc.WithInsecure(),
    )
    if err != nil {
        log.Fatalf("Failed to dial: %v", err)
    }
    defer conn.Close()

    client := pb.NewGreeterClient(conn)
    // Now requests go to different pods
}

Results

MetricClusterIP ServiceHeadless + Client LB
Pod distribution90/5/5/0/020/20/20/20/20
Latency P9945ms12ms
Throughput5k RPS25k RPS

Solution 2: Service Mesh (Istio)

If you can’t modify your clients—perhaps they’re third-party or deployed by other teams—a service mesh can intercept traffic and apply proper load balancing. The mesh sidecar proxy understands HTTP/2 and balances at the request level, not the connection level.

This is the most transparent solution: no code changes required. But it comes with operational complexity. You’re deploying a sidecar to every pod, adding latency, and taking on a significant new piece of infrastructure.

Istio DestinationRule

apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: grpc-server
spec:
  host: grpc-server
  trafficPolicy:
    loadBalancer:
      simple: ROUND_ROBIN
    connectionPool:
      http:
        h2UpgradePolicy: UPGRADE

Istio Benefits

  • No code changes
  • mTLS automatic
  • Observability (tracing, metrics)
  • Traffic management (canary, circuit breaker)

Drawbacks

  • Overhead (sidecar)
  • Operational complexity
  • Latency (+1-3ms)

Solution 3: Linkerd

# Linkerd annotations
apiVersion: apps/v1
kind: Deployment
metadata:
  name: grpc-server
  annotations:
    linkerd.io/inject: enabled
spec:
  # ...

Linkerd automatically detects gRPC and applies per-request load balancing.

Solution 4: Envoy as Sidecar

# envoy-sidecar.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: envoy-config
data:
  envoy.yaml: |
    static_resources:
      listeners:
      - address:
          socket_address:
            address: 0.0.0.0
            port_value: 8080
        filter_chains:
        - filters:
          - name: envoy.filters.network.http_connection_manager
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
              codec_type: AUTO
              stat_prefix: ingress_http
              route_config:
                virtual_hosts:
                - name: backend
                  domains: ["*"]
                  routes:
                  - match: { prefix: "/" }
                    route:
                      cluster: grpc_backend
              http_filters:
              - name: envoy.filters.http.router
      clusters:
      - name: grpc_backend
        type: STRICT_DNS
        lb_policy: ROUND_ROBIN
        http2_protocol_options: {}
        load_assignment:
          cluster_name: grpc_backend
          endpoints:
          - lb_endpoints:
            - endpoint:
                address:
                  socket_address:
                    address: grpc-server-headless
                    port_value: 50051

Monitoring gRPC Distribution

Prometheus Metrics

# Requests per pod
sum(rate(grpc_server_handled_total[5m])) by (pod)

# Distribution %
sum(rate(grpc_server_handled_total[5m])) by (pod)
/ ignoring(pod)
sum(rate(grpc_server_handled_total[5m]))

Expected vs Actual

5 pods, even load:
- Expected: 20% / 20% / 20% / 20% / 20%
- Without client LB: 85% / 5% / 5% / 3% / 2%
- With client LB: 19% / 21% / 20% / 20% / 20%

Production Checklist

## gRPC Load Balancing Checklist

### Basics
- [ ] Headless Service for gRPC
- [ ] Client-side load balancing or mesh
- [ ] Connection pooling with max lifetime
- [ ] Keepalive settings

### Client Config
- [ ] `loadBalancingPolicy: round_robin`
- [ ] DNS resolver (`dns:///`)
- [ ] Keepalive: 30s interval, 10s timeout
- [ ] Max connection age: 5m

### Server Config
- [ ] MaxConnectionAge: 5m
- [ ] MaxConnectionAgeGrace: 10s
- [ ] Keepalive enforcement

### Monitoring
- [ ] Per-pod request distribution
- [ ] Connection count per pod
- [ ] Latency per pod
- [ ] Alert: uneven distribution

Conclusion

This is one of the most common gotchas when moving from HTTP/1.1 REST APIs to gRPC. The assumption that “Kubernetes handles load balancing” is true for HTTP/1.1 but completely false for gRPC/HTTP/2. If you’re not explicitly addressing this, you’re running an accidentally broken system.

The solution you choose depends on your constraints. Client-side load balancing with headless Services is the most efficient—no sidecars, no extra latency, just smarter clients. But it requires code changes and every client must be updated. Service mesh solutions are more transparent but add operational complexity and latency.

Either way, monitoring is non-negotiable. You need visibility into per-pod request distribution to catch this problem quickly. An even 20/20/20/20/20 split is healthy; a 90/5/3/1/1 split is a problem waiting to escalate.

Key principles:

  1. Kubernetes Service doesn’t work for gRPC round-robin—it balances connections, not requests
  2. Headless Service + client LB is the simplest and most efficient solution
  3. Service mesh (Istio/Linkerd) when you can’t modify clients
  4. Monitor distribution continuously—it’s the only way to catch the problem
  5. Set connection max age so clients periodically reconnect and rebalance

FAQ

What if I can’t change the client?

Use service mesh (Istio/Linkerd) or Envoy as sidecar proxy.

Is client-side LB safe?

Yes, but you need to regularly refresh DNS (max connection age).

How many connections per pod?

Typically 1-5 for modern gRPC clients. More = overhead without benefit.


Related posts

Cite this article

If you reference this post, please link to the original URL and credit the author.

Michal Drozd. "gRPC in Kubernetes: Why Service Round-Robin Lies". https://www.michal-drozd.com/en/blog/grpc-load-balancing-k8s/ (Published August 11, 2025).