HTTP Keep-Alive Connection Reset: Why Your Requests Fail with 'Connection Reset by Peer'
I keep a sticky note from that night: “why do only 0.1% fail?” Some bugs are loud. This one is a whisper: 0.1% of requests fail, only in prod, and never in local tests. I first saw it as a trickle of connection reset by peer.
This is almost always a keep-alive timeout mismatch. The server closes the connection just as the client sends a new request.
Tested on: nginx 1.24, Go 1.22, Java 21, Kubernetes 1.28
The Race Condition
How Keep-Alive Works
Without Keep-Alive:
Client Server
|--- TCP SYN ------------->|
|<-- TCP SYN-ACK ----------|
|--- TCP ACK ------------->|
|--- HTTP Request -------->|
|<-- HTTP Response --------|
|--- TCP FIN ------------->| ← Connection closed
|<-- TCP FIN-ACK ----------|
With Keep-Alive:
Client Server
|--- TCP Connect ---------->|
|--- HTTP Request 1 ------->|
|<-- HTTP Response 1 -------|
|--- HTTP Request 2 ------->| ← Same connection!
|<-- HTTP Response 2 -------|
...
(connection stays open)
The Race Condition
Server timeout: 60 seconds
Client timeout: 90 seconds
Timeline:
T+0s: Request 1 completes
T+59s: Client prepares new request
T+60s: Server closes connection (timeout!)
T+60s: Client sends request on "open" connection
→ "Connection reset by peer"
Packet Level
T+60.000s: Server sends FIN
T+60.001s: Client sends HTTP request (doesn't know about FIN yet)
T+60.002s: Server receives request on closed socket
T+60.002s: Server sends RST (reset)
T+60.003s: Client receives RST → Error!
Diagnosing the Problem
Symptoms
Error patterns:
- "connection reset by peer"
- "ECONNRESET"
- "broken pipe"
- java.net.SocketException: Connection reset
- net/http: request canceled (Client.Timeout exceeded)
Characteristics:
- Sporadic (0.01% - 1% of requests)
- More common under low load (connections idle longer)
- Cannot reproduce locally
- Happens after period of inactivity
Finding the Timeouts
# nginx
grep keepalive_timeout /etc/nginx/nginx.conf
# keepalive_timeout 65;
# AWS ALB
# Default: 60 seconds (not configurable directly)
# Kubernetes Ingress (nginx)
kubectl get configmap ingress-nginx-controller -o yaml | grep keep-alive
# Go server
# http.Server.IdleTimeout (default: no timeout!)
# Java / Tomcat
# server.tomcat.connection-timeout (default: 20s for idle)
Solutions
Rule: Client Timeout < Server Timeout
Server keeps connection open: 60s
Client closes connection after: 55s ← 5s safety margin
Client always closes first → No race condition
Go Client
// http_client.go
import (
"net/http"
"time"
)
func newHTTPClient() *http.Client {
transport := &http.Transport{
// Max idle connections
MaxIdleConns: 100,
MaxIdleConnsPerHost: 10,
// CRITICAL: Close connection before server does
IdleConnTimeout: 55 * time.Second, // Server: 60s
// Connection timeout
DialContext: (&net.Dialer{
Timeout: 30 * time.Second,
KeepAlive: 30 * time.Second,
}).DialContext,
}
return &http.Client{
Transport: transport,
Timeout: 30 * time.Second, // Total request timeout
}
}
Go Server
// server.go
server := &http.Server{
Addr: ":8080",
Handler: handler,
// Keep-alive timeout
IdleTimeout: 60 * time.Second,
// Read/Write timeouts
ReadTimeout: 30 * time.Second,
WriteTimeout: 30 * time.Second,
}
Java Client (Apache HttpClient 5)
// HttpClientConfig.java
import org.apache.hc.client5.http.impl.classic.CloseableHttpClient;
import org.apache.hc.client5.http.impl.classic.HttpClients;
import org.apache.hc.client5.http.impl.io.PoolingHttpClientConnectionManager;
import org.apache.hc.core5.util.TimeValue;
public CloseableHttpClient createHttpClient() {
PoolingHttpClientConnectionManager connectionManager =
new PoolingHttpClientConnectionManager();
connectionManager.setMaxTotal(100);
connectionManager.setDefaultMaxPerRoute(10);
// Close idle connections before server timeout
// Server: 60s, Client: 55s
connectionManager.closeIdle(TimeValue.ofSeconds(55));
return HttpClients.custom()
.setConnectionManager(connectionManager)
.evictExpiredConnections()
.evictIdleConnections(TimeValue.ofSeconds(55))
.build();
}
Java Server (Spring Boot)
# application.yml
server:
tomcat:
# Keep-alive timeout
keep-alive-timeout: 60s
# Max keep-alive requests per connection
max-keep-alive-requests: 100
# Connection timeout
connection-timeout: 30s
nginx Configuration
http {
# Keep-alive to clients
keepalive_timeout 65s;
# Keep-alive to upstream (backends)
upstream backend {
server app:8080;
# Reuse connections to backend
keepalive 100;
# CRITICAL: Close before backend timeout
keepalive_timeout 55s; # Backend: 60s
}
server {
location / {
proxy_pass http://backend;
# Enable keep-alive to upstream
proxy_http_version 1.1;
proxy_set_header Connection "";
}
}
}
Kubernetes Ingress (nginx)
# ingress-nginx ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
name: ingress-nginx-controller
data:
# Client keep-alive
keep-alive: "65"
# Upstream keep-alive
upstream-keepalive-connections: "100"
upstream-keepalive-timeout: "55" # Backend: 60s
# Requests per connection
upstream-keepalive-requests: "1000"
AWS ALB + Target Groups
ALB idle timeout: 60s (default)
Target group: 65s ← Target HIGHER than ALB
# Terraform
resource "aws_lb_target_group" "app" {
# ...
deregistration_delay = 30
stickiness {
enabled = true
type = "lb_cookie"
}
}
resource "aws_lb" "app" {
# ...
idle_timeout = 60 # ALB timeout
}
# Application must have timeout > 60s
Retry Strategy
Idempotent Request Retry
// retry.go
func doWithRetry(client *http.Client, req *http.Request) (*http.Response, error) {
var lastErr error
for attempt := 0; attempt < 3; attempt++ {
resp, err := client.Do(req)
if err == nil {
return resp, nil
}
// Only retry on connection errors
if isConnectionError(err) && isIdempotent(req.Method) {
lastErr = err
time.Sleep(time.Duration(attempt*100) * time.Millisecond)
continue
}
return nil, err
}
return nil, fmt.Errorf("after 3 retries: %w", lastErr)
}
func isConnectionError(err error) bool {
if err == nil {
return false
}
errStr := err.Error()
return strings.Contains(errStr, "connection reset") ||
strings.Contains(errStr, "broken pipe") ||
strings.Contains(errStr, "EOF")
}
func isIdempotent(method string) bool {
return method == "GET" || method == "HEAD" ||
method == "OPTIONS" || method == "PUT" ||
method == "DELETE"
}
Spring Retry
@Configuration
@EnableRetry
public class RetryConfig {
@Bean
public RestTemplate restTemplate() {
return new RestTemplate();
}
}
@Service
public class ApiClient {
@Retryable(
value = {ResourceAccessException.class},
maxAttempts = 3,
backoff = @Backoff(delay = 100)
)
public String callApi() {
return restTemplate.getForObject("/api/data", String.class);
}
@Recover
public String fallback(ResourceAccessException e) {
log.error("All retries failed", e);
throw new ServiceUnavailableException("API unavailable");
}
}
Monitoring
Prometheus Metrics
# Connection resets (nginx)
sum(rate(nginx_http_requests_total{status="499"}[5m]))
# Connection errors (client side)
sum(rate(http_client_errors_total{error="connection_reset"}[5m]))
# Keep-alive connections
sum(nginx_http_connections{state="kept"})
Alerts
groups:
- name: connection_errors
rules:
- alert: HighConnectionResetRate
expr: |
sum(rate(http_client_errors_total{error=~"connection_reset|broken_pipe"}[5m]))
/
sum(rate(http_client_requests_total[5m]))
> 0.001
for: 10m
annotations:
summary: "Connection reset rate >0.1%"
description: "Check keep-alive timeout alignment"
Checklist
## Keep-Alive Configuration
### Timeout Alignment
- [ ] Document all keep-alive timeouts in chain
- [ ] Ensure: Client < Proxy < Server
- [ ] Add 5-10 second safety margin
### Configuration
Client:
- [ ] Go: IdleConnTimeout < server timeout
- [ ] Java: evictIdleConnections() < server timeout
Proxy (nginx):
- [ ] keepalive_timeout (client side)
- [ ] upstream keepalive_timeout (backend side)
Server:
- [ ] IdleTimeout configured
- [ ] Documented for clients
### Testing
- [ ] Load test with idle periods
- [ ] Monitor connection reset rate
- [ ] Test under various load patterns
### Monitoring
- [ ] Alert on connection reset rate > 0.1%
- [ ] Track keep-alive connection count
- [ ] Log connection errors with context
Conclusion
Keep-alive connection resets are a configuration problem:
- Client timeout must be < server timeout - Always close first
- Add 5-10s safety margin - Account for clock skew
- Configure all layers - Client, proxy, server
- Retry idempotent requests - Handle unavoidable resets
Chain: Client (55s) → nginx (60s) → App (65s)
Related Articles
- Connection Pool Sizing with Little’s Law - Connection management
- Circuit Breaker vs Rate Limiter vs Bulkhead - Resilience patterns
Related posts
Kubernetes CPU Throttling Autopsy: Why p99 Latency Explodes at 40% CPU Usage
CPU looks OK but tail latency is catastrophic. I'll show how to correlate CFS throttling with latency spikes and why removing CPU limits can paradoxically help.
Cilium BPF Conntrack Map Exhaustion: Random Resets While conntrack Looks Fine
Random resets with Cilium? Learn how eBPF conntrack (CT) maps fill up, why netfilter conntrack looks fine, and how to size + verify fixes in Kubernetes.
Redlock vs PostgreSQL Advisory Locks: When You Don't Need Redis for Distributed Locking
Adding Redis just for distributed locks? PostgreSQL advisory locks might be enough. I compare both with failure scenarios and performance benchmarks.
JVM Native Memory in Kubernetes: Why Your Pod Gets OOMKilled with 50% Heap
Heap is 50% full but pod gets OOMKilled. I'll show how to track native memory (Metaspace, threads, NIO) and prevent container memory issues.
Cite this article
If you reference this post, please link to the original URL and credit the author.