gRPC Keepalive Mismatch: Transport Closing After Idle

gRPC keepalive bit us right after we scaled connections. “Random ‘transport is closing’ errors after periods of low traffic.” The pattern was unmistakable: Monday mornings, right after lunch, early evenings—anytime traffic picked up after a quiet period, we’d see a burst of gRPC errors. The errors would clear after a few retries, but the initial failures caused user-visible issues.

The debugging was frustrating because everything looked correct in isolation. The server was healthy. The client was healthy. Network connectivity was fine. But the gRPC connections were dying. The “transport is closing” error message gave no indication of why.

The root cause turned out to be a timing mismatch. Our server had MaxConnectionIdle set to 5 minutes—if a connection had no traffic for 5 minutes, close it. Our client had keepalive pings set to 10 minutes—ping the server every 10 minutes to keep the connection alive. The math doesn’t work: the server closes at 5 minutes, but the client doesn’t ping until 10 minutes. During idle periods, the server terminates connections that the client expects to be healthy.

What made this worse was the multi-layer nature of the problem. We had the application’s gRPC settings, an Envoy sidecar with its own timeouts, and an AWS ALB with yet another idle timeout (60 seconds by default). The client’s keepalive had to be shorter than the minimum of all these timeouts. We’d tuned one layer but missed the others.

Environment: gRPC 1.40+, Go/Java/Python clients, long-lived connections, bursty traffic patterns

The Problem

The Intermittent Connection Deaths

Traffic pattern and failures:

00:00 - 00:15  High traffic, many requests, no errors
00:15 - 00:45  Low traffic, few requests
00:46         Burst of requests → "transport is closing" errors
00:47         New connections established, requests succeed

Errors appear:
- After idle periods (lunch, nights, weekends)
- During traffic bursts following quiet periods
- Only on long-lived connections
- Not on fresh connections

The Error Messages

// Client side errors
rpc error: code = Unavailable desc = transport is closing

// Server side logs (if verbose)
grpc: Server.Serve failed to complete security handshake
connection closed before server preface received

// Or just silent connection termination
// No error on server, client gets RST

Root Cause

Keepalive Timing Mismatch

Server configuration:
┌─────────────────────────────────────────────────────────────┐
│ MaxConnectionIdle: 5 minutes                                │
│ "Close connections that have no activity for 5 minutes"    │
│                                                             │
│ MaxConnectionAge: 30 minutes                                │
│ "Close connections older than 30 minutes regardless"        │
│                                                             │
│ MaxConnectionAgeGrace: 10 seconds                           │
│ "Give 10s for in-flight RPCs before force close"           │
└─────────────────────────────────────────────────────────────┘

Client configuration:
┌─────────────────────────────────────────────────────────────┐
│ KeepAliveTime: 10 minutes                                   │
│ "Send ping every 10 minutes if no activity"                │
│                                                             │
│ KeepAliveTimeout: 20 seconds                                │
│ "Wait 20s for ping response before marking dead"           │
└─────────────────────────────────────────────────────────────┘

Timeline:
T+0:00    Last RPC completes
T+5:00    Server: "Connection idle for 5 min, closing" → RST
T+5:01    Client tries RPC → "transport is closing"
T+10:00   Client would have sent keepalive ping (too late!)

Why This Happens

// Common mistake: Relying on client keepalives alone
conn, err := grpc.Dial(target,
    grpc.WithKeepaliveParams(keepalive.ClientParameters{
        Time:                10 * time.Minute,  // Too long!
        Timeout:             20 * time.Second,
        PermitWithoutStream: true,
    }),
)

// Server has stricter settings (often defaults or load balancer)
server := grpc.NewServer(
    grpc.KeepaliveParams(keepalive.ServerParameters{
        MaxConnectionIdle: 5 * time.Minute,  // Server kills first!
    }),
)

// Result: Server closes before client pings

Diagnosis

Check Server Keepalive Settings

// Go server - check what's configured
func printServerKeepalive(s *grpc.Server) {
    // Unfortunately no direct way to read settings
    // Check your server initialization code
    // Common defaults:
    // - MaxConnectionIdle: infinity
    // - MaxConnectionAge: infinity
    // - But load balancers often have their own!
}

# Check if load balancer is terminating connections
# Look at connection ages when errors occur

# On client, track connection lifetimes
# Add logging for connection state changes

Monitor Connection State

// Go client - monitor connection state changes
import "google.golang.org/grpc/connectivity"

func monitorConnection(conn *grpc.ClientConn) {
    state := conn.GetState()
    for {
        changed := conn.WaitForStateChange(context.Background(), state)
        if !changed {
            return
        }
        newState := conn.GetState()
        log.Printf("gRPC connection state: %s → %s", state, newState)

        if newState == connectivity.TransientFailure {
            log.Printf("Connection entered TransientFailure - will reconnect")
        }
        state = newState
    }
}

Capture Connection Metrics

// Enable gRPC channelz for debugging
import _ "google.golang.org/grpc/channelz/service"

// Start channelz service
grpc.EnableTracing = true

// Then query via grpc_cli or channelz web UI
// Shows: connection ages, states, last activity times

The Fix

Option 1: Align Keepalive Times

// Client keepalive MUST be shorter than server MaxConnectionIdle

// Server configuration
server := grpc.NewServer(
    grpc.KeepaliveParams(keepalive.ServerParameters{
        MaxConnectionIdle:     15 * time.Minute,
        MaxConnectionAge:      30 * time.Minute,
        MaxConnectionAgeGrace: 5 * time.Second,
        Time:                  5 * time.Minute,   // Server pings
        Timeout:               1 * time.Second,
    }),
    grpc.KeepaliveEnforcementPolicy(keepalive.EnforcementPolicy{
        MinTime:             1 * time.Minute,  // Allow client pings
        PermitWithoutStream: true,
    }),
)

// Client configuration - ping before server closes
conn, err := grpc.Dial(target,
    grpc.WithKeepaliveParams(keepalive.ClientParameters{
        Time:                5 * time.Minute,   // < MaxConnectionIdle
        Timeout:             10 * time.Second,
        PermitWithoutStream: true,              // Important!
    }),
)

Option 2: Handle Reconnection Gracefully

// Use retry with backoff for transient failures
import "google.golang.org/grpc/codes"
import "google.golang.org/grpc/status"

func callWithRetry(ctx context.Context, client pb.ServiceClient) error {
    var lastErr error
    for attempt := 0; attempt < 3; attempt++ {
        resp, err := client.SomeMethod(ctx, &pb.Request{})
        if err == nil {
            return nil
        }

        lastErr = err
        st, ok := status.FromError(err)
        if !ok {
            return err // Not a gRPC error
        }

        switch st.Code() {
        case codes.Unavailable:
            // Transport closing - retry immediately, connection will reconnect
            log.Printf("Connection unavailable, retrying (attempt %d)", attempt+1)
            time.Sleep(100 * time.Millisecond)
            continue
        case codes.DeadlineExceeded, codes.ResourceExhausted:
            // Backoff for these
            time.Sleep(time.Duration(attempt+1) * time.Second)
            continue
        default:
            return err
        }
    }
    return lastErr
}

Option 3: Configure Service Mesh Properly

# Istio DestinationRule - control connection pool
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: my-service
spec:
  host: my-service.default.svc.cluster.local
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100
        connectTimeout: 5s
        tcpKeepalive:
          time: 300s      # 5 minutes
          interval: 75s
      http:
        h2UpgradePolicy: UPGRADE
        idleTimeout: 900s  # 15 minutes - longer than client keepalive

Option 4: Load Balancer Configuration

# AWS ALB - increase idle timeout
# Default is 60 seconds - often too short for gRPC!

# Terraform
resource "aws_lb_target_group" "grpc" {
  protocol         = "HTTP"
  protocol_version = "GRPC"

  health_check {
    protocol = "HTTP"
    path     = "/grpc.health.v1.Health/Check"
  }
}

resource "aws_lb_listener" "grpc" {
  load_balancer_arn = aws_lb.main.arn

  default_action {
    type             = "forward"
    target_group_arn = aws_lb_target_group.grpc.arn
  }
}

# Set idle timeout on ALB
resource "aws_lb" "main" {
  idle_timeout = 900  # 15 minutes for gRPC
}

Monitoring

groups:
  - name: grpc-connections
    rules:
      - alert: GRPCTransportClosing
        expr: |
          rate(grpc_client_handled_total{grpc_code="Unavailable"}[5m]) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High rate of gRPC Unavailable errors"

      - alert: GRPCConnectionChurn
        expr: |
          rate(grpc_client_connections_total[5m]) > 10
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High gRPC connection churn - check keepalive settings"

Checklist

## gRPC Keepalive Mismatch

### Symptoms
- [ ] "transport is closing" errors after idle periods
- [ ] Errors correlate with low traffic periods
- [ ] Fresh connections work fine
- [ ] Problem worse on weekends/nights

### Diagnosis
- [ ] Check server MaxConnectionIdle setting
- [ ] Check client KeepAliveTime setting
- [ ] Verify load balancer idle timeout
- [ ] Check service mesh connection pool settings
- [ ] Monitor connection state transitions

### Fixes
- [ ] Client keepalive < Server MaxConnectionIdle
- [ ] Enable PermitWithoutStream on client
- [ ] Set server EnforcementPolicy to allow pings
- [ ] Configure load balancer idle timeout > keepalive
- [ ] Add retry logic for Unavailable errors

Conclusion

This problem is a perfect example of how distributed systems create emergent complexity. Each component—client, server, sidecar, load balancer—has reasonable default settings. But when you combine them, the interactions create failure modes that none of the individual components would exhibit alone.

The gRPC keepalive dance requires coordination across all layers. Your client’s keepalive time must be shorter than the server’s MaxConnectionIdle, shorter than the load balancer’s idle timeout, shorter than the service mesh’s connection pool timeout. Miss any one of these, and you get “transport is closing” errors during idle periods.

The frustrating part is that these settings are often invisible. The AWS ALB’s default idle timeout of 60 seconds isn’t prominently documented in gRPC contexts. Envoy’s default connection idle timeout isn’t obvious unless you look at its configuration. You have to trace through every component in your request path and verify their timeout settings against each other.

The fix is conceptually simple—make the client ping faster than anything can timeout—but requires surveying your entire infrastructure. And you should add retry logic for Unavailable errors regardless, because connections will occasionally fail even with perfect keepalive settings.

Key principles:

Client keepalive time < server MaxConnectionIdle - the fundamental rule
Load balancers have their own timeouts - AWS ALB defaults to 60s, often the shortest
Service meshes add another layer - Envoy, Istio, Linkerd all have connection pool settings
PermitWithoutStream = true for idle keepalives - without this, pings only happen during active RPCs
Retry Unavailable errors - connection will auto-reconnect, just retry the failed RPC

Go cgo DNS Thread Explosion - Go networking issues
Kubernetes Headless Service DNS - Service discovery problems

gRPC Keepalive Mismatch: Transport Closing After Idle

The Problem

The Intermittent Connection Deaths

The Error Messages

Root Cause

Keepalive Timing Mismatch

Why This Happens

Diagnosis

Check Server Keepalive Settings

Monitor Connection State

Capture Connection Metrics

The Fix

Option 1: Align Keepalive Times

Option 2: Handle Reconnection Gracefully

Option 3: Configure Service Mesh Properly

Option 4: Load Balancer Configuration

Monitoring

Checklist

Conclusion

Related posts

Cite this article

The Problem

The Intermittent Connection Deaths

The Error Messages

Root Cause

Keepalive Timing Mismatch

Why This Happens

Diagnosis

Check Server Keepalive Settings

Monitor Connection State

Capture Connection Metrics

The Fix

Option 1: Align Keepalive Times

Option 2: Handle Reconnection Gracefully

Option 3: Configure Service Mesh Properly

Option 4: Load Balancer Configuration

Monitoring

Checklist

Conclusion

Related Articles

Related posts

Cite this article