gRPC Keepalive Mismatch: Transport Closing After Idle
gRPC keepalive bit us right after we scaled connections. “Random ‘transport is closing’ errors after periods of low traffic.” The pattern was unmistakable: Monday mornings, right after lunch, early evenings—anytime traffic picked up after a quiet period, we’d see a burst of gRPC errors. The errors would clear after a few retries, but the initial failures caused user-visible issues.
The debugging was frustrating because everything looked correct in isolation. The server was healthy. The client was healthy. Network connectivity was fine. But the gRPC connections were dying. The “transport is closing” error message gave no indication of why.
The root cause turned out to be a timing mismatch. Our server had MaxConnectionIdle set to 5 minutes—if a connection had no traffic for 5 minutes, close it. Our client had keepalive pings set to 10 minutes—ping the server every 10 minutes to keep the connection alive. The math doesn’t work: the server closes at 5 minutes, but the client doesn’t ping until 10 minutes. During idle periods, the server terminates connections that the client expects to be healthy.
What made this worse was the multi-layer nature of the problem. We had the application’s gRPC settings, an Envoy sidecar with its own timeouts, and an AWS ALB with yet another idle timeout (60 seconds by default). The client’s keepalive had to be shorter than the minimum of all these timeouts. We’d tuned one layer but missed the others.
Environment: gRPC 1.40+, Go/Java/Python clients, long-lived connections, bursty traffic patterns
The Problem
The Intermittent Connection Deaths
Traffic pattern and failures:
00:00 - 00:15 High traffic, many requests, no errors
00:15 - 00:45 Low traffic, few requests
00:46 Burst of requests → "transport is closing" errors
00:47 New connections established, requests succeed
Errors appear:
- After idle periods (lunch, nights, weekends)
- During traffic bursts following quiet periods
- Only on long-lived connections
- Not on fresh connections
The Error Messages
// Client side errors
rpc error: code = Unavailable desc = transport is closing
// Server side logs (if verbose)
grpc: Server.Serve failed to complete security handshake
connection closed before server preface received
// Or just silent connection termination
// No error on server, client gets RST
Root Cause
Keepalive Timing Mismatch
Server configuration:
┌─────────────────────────────────────────────────────────────┐
│ MaxConnectionIdle: 5 minutes │
│ "Close connections that have no activity for 5 minutes" │
│ │
│ MaxConnectionAge: 30 minutes │
│ "Close connections older than 30 minutes regardless" │
│ │
│ MaxConnectionAgeGrace: 10 seconds │
│ "Give 10s for in-flight RPCs before force close" │
└─────────────────────────────────────────────────────────────┘
Client configuration:
┌─────────────────────────────────────────────────────────────┐
│ KeepAliveTime: 10 minutes │
│ "Send ping every 10 minutes if no activity" │
│ │
│ KeepAliveTimeout: 20 seconds │
│ "Wait 20s for ping response before marking dead" │
└─────────────────────────────────────────────────────────────┘
Timeline:
T+0:00 Last RPC completes
T+5:00 Server: "Connection idle for 5 min, closing" → RST
T+5:01 Client tries RPC → "transport is closing"
T+10:00 Client would have sent keepalive ping (too late!)
Why This Happens
// Common mistake: Relying on client keepalives alone
conn, err := grpc.Dial(target,
grpc.WithKeepaliveParams(keepalive.ClientParameters{
Time: 10 * time.Minute, // Too long!
Timeout: 20 * time.Second,
PermitWithoutStream: true,
}),
)
// Server has stricter settings (often defaults or load balancer)
server := grpc.NewServer(
grpc.KeepaliveParams(keepalive.ServerParameters{
MaxConnectionIdle: 5 * time.Minute, // Server kills first!
}),
)
// Result: Server closes before client pings
Diagnosis
Check Server Keepalive Settings
// Go server - check what's configured
func printServerKeepalive(s *grpc.Server) {
// Unfortunately no direct way to read settings
// Check your server initialization code
// Common defaults:
// - MaxConnectionIdle: infinity
// - MaxConnectionAge: infinity
// - But load balancers often have their own!
}
# Check if load balancer is terminating connections
# Look at connection ages when errors occur
# On client, track connection lifetimes
# Add logging for connection state changes
Monitor Connection State
// Go client - monitor connection state changes
import "google.golang.org/grpc/connectivity"
func monitorConnection(conn *grpc.ClientConn) {
state := conn.GetState()
for {
changed := conn.WaitForStateChange(context.Background(), state)
if !changed {
return
}
newState := conn.GetState()
log.Printf("gRPC connection state: %s → %s", state, newState)
if newState == connectivity.TransientFailure {
log.Printf("Connection entered TransientFailure - will reconnect")
}
state = newState
}
}
Capture Connection Metrics
// Enable gRPC channelz for debugging
import _ "google.golang.org/grpc/channelz/service"
// Start channelz service
grpc.EnableTracing = true
// Then query via grpc_cli or channelz web UI
// Shows: connection ages, states, last activity times
The Fix
Option 1: Align Keepalive Times
// Client keepalive MUST be shorter than server MaxConnectionIdle
// Server configuration
server := grpc.NewServer(
grpc.KeepaliveParams(keepalive.ServerParameters{
MaxConnectionIdle: 15 * time.Minute,
MaxConnectionAge: 30 * time.Minute,
MaxConnectionAgeGrace: 5 * time.Second,
Time: 5 * time.Minute, // Server pings
Timeout: 1 * time.Second,
}),
grpc.KeepaliveEnforcementPolicy(keepalive.EnforcementPolicy{
MinTime: 1 * time.Minute, // Allow client pings
PermitWithoutStream: true,
}),
)
// Client configuration - ping before server closes
conn, err := grpc.Dial(target,
grpc.WithKeepaliveParams(keepalive.ClientParameters{
Time: 5 * time.Minute, // < MaxConnectionIdle
Timeout: 10 * time.Second,
PermitWithoutStream: true, // Important!
}),
)
Option 2: Handle Reconnection Gracefully
// Use retry with backoff for transient failures
import "google.golang.org/grpc/codes"
import "google.golang.org/grpc/status"
func callWithRetry(ctx context.Context, client pb.ServiceClient) error {
var lastErr error
for attempt := 0; attempt < 3; attempt++ {
resp, err := client.SomeMethod(ctx, &pb.Request{})
if err == nil {
return nil
}
lastErr = err
st, ok := status.FromError(err)
if !ok {
return err // Not a gRPC error
}
switch st.Code() {
case codes.Unavailable:
// Transport closing - retry immediately, connection will reconnect
log.Printf("Connection unavailable, retrying (attempt %d)", attempt+1)
time.Sleep(100 * time.Millisecond)
continue
case codes.DeadlineExceeded, codes.ResourceExhausted:
// Backoff for these
time.Sleep(time.Duration(attempt+1) * time.Second)
continue
default:
return err
}
}
return lastErr
}
Option 3: Configure Service Mesh Properly
# Istio DestinationRule - control connection pool
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: my-service
spec:
host: my-service.default.svc.cluster.local
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
connectTimeout: 5s
tcpKeepalive:
time: 300s # 5 minutes
interval: 75s
http:
h2UpgradePolicy: UPGRADE
idleTimeout: 900s # 15 minutes - longer than client keepalive
Option 4: Load Balancer Configuration
# AWS ALB - increase idle timeout
# Default is 60 seconds - often too short for gRPC!
# Terraform
resource "aws_lb_target_group" "grpc" {
protocol = "HTTP"
protocol_version = "GRPC"
health_check {
protocol = "HTTP"
path = "/grpc.health.v1.Health/Check"
}
}
resource "aws_lb_listener" "grpc" {
load_balancer_arn = aws_lb.main.arn
default_action {
type = "forward"
target_group_arn = aws_lb_target_group.grpc.arn
}
}
# Set idle timeout on ALB
resource "aws_lb" "main" {
idle_timeout = 900 # 15 minutes for gRPC
}
Monitoring
groups:
- name: grpc-connections
rules:
- alert: GRPCTransportClosing
expr: |
rate(grpc_client_handled_total{grpc_code="Unavailable"}[5m]) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High rate of gRPC Unavailable errors"
- alert: GRPCConnectionChurn
expr: |
rate(grpc_client_connections_total[5m]) > 10
for: 10m
labels:
severity: warning
annotations:
summary: "High gRPC connection churn - check keepalive settings"
Checklist
## gRPC Keepalive Mismatch
### Symptoms
- [ ] "transport is closing" errors after idle periods
- [ ] Errors correlate with low traffic periods
- [ ] Fresh connections work fine
- [ ] Problem worse on weekends/nights
### Diagnosis
- [ ] Check server MaxConnectionIdle setting
- [ ] Check client KeepAliveTime setting
- [ ] Verify load balancer idle timeout
- [ ] Check service mesh connection pool settings
- [ ] Monitor connection state transitions
### Fixes
- [ ] Client keepalive < Server MaxConnectionIdle
- [ ] Enable PermitWithoutStream on client
- [ ] Set server EnforcementPolicy to allow pings
- [ ] Configure load balancer idle timeout > keepalive
- [ ] Add retry logic for Unavailable errors
Conclusion
This problem is a perfect example of how distributed systems create emergent complexity. Each component—client, server, sidecar, load balancer—has reasonable default settings. But when you combine them, the interactions create failure modes that none of the individual components would exhibit alone.
The gRPC keepalive dance requires coordination across all layers. Your client’s keepalive time must be shorter than the server’s MaxConnectionIdle, shorter than the load balancer’s idle timeout, shorter than the service mesh’s connection pool timeout. Miss any one of these, and you get “transport is closing” errors during idle periods.
The frustrating part is that these settings are often invisible. The AWS ALB’s default idle timeout of 60 seconds isn’t prominently documented in gRPC contexts. Envoy’s default connection idle timeout isn’t obvious unless you look at its configuration. You have to trace through every component in your request path and verify their timeout settings against each other.
The fix is conceptually simple—make the client ping faster than anything can timeout—but requires surveying your entire infrastructure. And you should add retry logic for Unavailable errors regardless, because connections will occasionally fail even with perfect keepalive settings.
Key principles:
- Client keepalive time < server MaxConnectionIdle - the fundamental rule
- Load balancers have their own timeouts - AWS ALB defaults to 60s, often the shortest
- Service meshes add another layer - Envoy, Istio, Linkerd all have connection pool settings
- PermitWithoutStream = true for idle keepalives - without this, pings only happen during active RPCs
- Retry Unavailable errors - connection will auto-reconnect, just retry the failed RPC
Related Articles
- Go cgo DNS Thread Explosion - Go networking issues
- Kubernetes Headless Service DNS - Service discovery problems
Related posts
Go p99 Latency Cliffs: Nested context.WithTimeout Timer Storms
Periodic latency spikes that look like network jitter. The real cause: nested timeouts creating thousands of timers that pressure the Go runtime timer heap and trigger GC scanning.
The Ghost Pod: Why Your Service Still Sends Traffic to Dead Endpoints
Random ECONNRESET on some nodes but not others. Endpoints look fine. The culprit: conntrack NAT entries keeping long-lived connections pinned to pods that no longer exist.
Kubernetes Ghost Connections: Stale Conntrack DNAT Entries
Service returns wrong pod IPs after scaling. The cause: Linux conntrack keeps DNAT entries alive longer than pods exist, routing traffic to deleted endpoints.
Gossip Protocol Ghost Nodes: IP Reuse Haunting Your Cluster
New node joins cluster but gets shunned. Old node's IP is still in gossip protocol's failure detection blacklist. The zombie membership record lives on.
Cite this article
If you reference this post, please link to the original URL and credit the author.