Back to blog

HTTP Keep-Alive Connection Reset: Why Your Requests Fail with 'Connection Reset by Peer'

|
| http, keep-alive, kubernetes, networking, troubleshooting, nginx, go, java

I keep a sticky note from that night: “why do only 0.1% fail?” Some bugs are loud. This one is a whisper: 0.1% of requests fail, only in prod, and never in local tests. I first saw it as a trickle of connection reset by peer.

This is almost always a keep-alive timeout mismatch. The server closes the connection just as the client sends a new request.

Tested on: nginx 1.24, Go 1.22, Java 21, Kubernetes 1.28

The Race Condition

How Keep-Alive Works

Without Keep-Alive:
Client                    Server
  |--- TCP SYN ------------->|
  |<-- TCP SYN-ACK ----------|
  |--- TCP ACK ------------->|
  |--- HTTP Request -------->|
  |<-- HTTP Response --------|
  |--- TCP FIN ------------->|  ← Connection closed
  |<-- TCP FIN-ACK ----------|

With Keep-Alive:
Client                    Server
  |--- TCP Connect ---------->|
  |--- HTTP Request 1 ------->|
  |<-- HTTP Response 1 -------|
  |--- HTTP Request 2 ------->|  ← Same connection!
  |<-- HTTP Response 2 -------|
  ...
  (connection stays open)

The Race Condition

Server timeout: 60 seconds
Client timeout: 90 seconds

Timeline:
T+0s:   Request 1 completes
T+59s:  Client prepares new request
T+60s:  Server closes connection (timeout!)
T+60s:  Client sends request on "open" connection
        → "Connection reset by peer"

Packet Level

T+60.000s: Server sends FIN
T+60.001s: Client sends HTTP request (doesn't know about FIN yet)
T+60.002s: Server receives request on closed socket
T+60.002s: Server sends RST (reset)
T+60.003s: Client receives RST → Error!

Diagnosing the Problem

Symptoms

Error patterns:
- "connection reset by peer"
- "ECONNRESET"
- "broken pipe"
- java.net.SocketException: Connection reset
- net/http: request canceled (Client.Timeout exceeded)

Characteristics:
- Sporadic (0.01% - 1% of requests)
- More common under low load (connections idle longer)
- Cannot reproduce locally
- Happens after period of inactivity

Finding the Timeouts

# nginx
grep keepalive_timeout /etc/nginx/nginx.conf
# keepalive_timeout 65;

# AWS ALB
# Default: 60 seconds (not configurable directly)

# Kubernetes Ingress (nginx)
kubectl get configmap ingress-nginx-controller -o yaml | grep keep-alive

# Go server
# http.Server.IdleTimeout (default: no timeout!)

# Java / Tomcat
# server.tomcat.connection-timeout (default: 20s for idle)

Solutions

Rule: Client Timeout < Server Timeout

Server keeps connection open: 60s
Client closes connection after: 55s  ← 5s safety margin

Client always closes first → No race condition

Go Client

// http_client.go
import (
    "net/http"
    "time"
)

func newHTTPClient() *http.Client {
    transport := &http.Transport{
        // Max idle connections
        MaxIdleConns:        100,
        MaxIdleConnsPerHost: 10,

        // CRITICAL: Close connection before server does
        IdleConnTimeout:     55 * time.Second,  // Server: 60s

        // Connection timeout
        DialContext: (&net.Dialer{
            Timeout:   30 * time.Second,
            KeepAlive: 30 * time.Second,
        }).DialContext,
    }

    return &http.Client{
        Transport: transport,
        Timeout:   30 * time.Second,  // Total request timeout
    }
}

Go Server

// server.go
server := &http.Server{
    Addr:         ":8080",
    Handler:      handler,

    // Keep-alive timeout
    IdleTimeout:  60 * time.Second,

    // Read/Write timeouts
    ReadTimeout:  30 * time.Second,
    WriteTimeout: 30 * time.Second,
}

Java Client (Apache HttpClient 5)

// HttpClientConfig.java
import org.apache.hc.client5.http.impl.classic.CloseableHttpClient;
import org.apache.hc.client5.http.impl.classic.HttpClients;
import org.apache.hc.client5.http.impl.io.PoolingHttpClientConnectionManager;
import org.apache.hc.core5.util.TimeValue;

public CloseableHttpClient createHttpClient() {
    PoolingHttpClientConnectionManager connectionManager =
        new PoolingHttpClientConnectionManager();

    connectionManager.setMaxTotal(100);
    connectionManager.setDefaultMaxPerRoute(10);

    // Close idle connections before server timeout
    // Server: 60s, Client: 55s
    connectionManager.closeIdle(TimeValue.ofSeconds(55));

    return HttpClients.custom()
        .setConnectionManager(connectionManager)
        .evictExpiredConnections()
        .evictIdleConnections(TimeValue.ofSeconds(55))
        .build();
}

Java Server (Spring Boot)

# application.yml
server:
  tomcat:
    # Keep-alive timeout
    keep-alive-timeout: 60s

    # Max keep-alive requests per connection
    max-keep-alive-requests: 100

    # Connection timeout
    connection-timeout: 30s

nginx Configuration

http {
    # Keep-alive to clients
    keepalive_timeout 65s;

    # Keep-alive to upstream (backends)
    upstream backend {
        server app:8080;

        # Reuse connections to backend
        keepalive 100;

        # CRITICAL: Close before backend timeout
        keepalive_timeout 55s;  # Backend: 60s
    }

    server {
        location / {
            proxy_pass http://backend;

            # Enable keep-alive to upstream
            proxy_http_version 1.1;
            proxy_set_header Connection "";
        }
    }
}

Kubernetes Ingress (nginx)

# ingress-nginx ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: ingress-nginx-controller
data:
  # Client keep-alive
  keep-alive: "65"

  # Upstream keep-alive
  upstream-keepalive-connections: "100"
  upstream-keepalive-timeout: "55"  # Backend: 60s

  # Requests per connection
  upstream-keepalive-requests: "1000"

AWS ALB + Target Groups

ALB idle timeout: 60s (default)
Target group: 65s  ← Target HIGHER than ALB

# Terraform
resource "aws_lb_target_group" "app" {
  # ...
  deregistration_delay = 30

  stickiness {
    enabled = true
    type    = "lb_cookie"
  }
}

resource "aws_lb" "app" {
  # ...
  idle_timeout = 60  # ALB timeout
}

# Application must have timeout > 60s

Retry Strategy

Idempotent Request Retry

// retry.go
func doWithRetry(client *http.Client, req *http.Request) (*http.Response, error) {
    var lastErr error

    for attempt := 0; attempt < 3; attempt++ {
        resp, err := client.Do(req)
        if err == nil {
            return resp, nil
        }

        // Only retry on connection errors
        if isConnectionError(err) && isIdempotent(req.Method) {
            lastErr = err
            time.Sleep(time.Duration(attempt*100) * time.Millisecond)
            continue
        }

        return nil, err
    }

    return nil, fmt.Errorf("after 3 retries: %w", lastErr)
}

func isConnectionError(err error) bool {
    if err == nil {
        return false
    }
    errStr := err.Error()
    return strings.Contains(errStr, "connection reset") ||
           strings.Contains(errStr, "broken pipe") ||
           strings.Contains(errStr, "EOF")
}

func isIdempotent(method string) bool {
    return method == "GET" || method == "HEAD" ||
           method == "OPTIONS" || method == "PUT" ||
           method == "DELETE"
}

Spring Retry

@Configuration
@EnableRetry
public class RetryConfig {

    @Bean
    public RestTemplate restTemplate() {
        return new RestTemplate();
    }
}

@Service
public class ApiClient {

    @Retryable(
        value = {ResourceAccessException.class},
        maxAttempts = 3,
        backoff = @Backoff(delay = 100)
    )
    public String callApi() {
        return restTemplate.getForObject("/api/data", String.class);
    }

    @Recover
    public String fallback(ResourceAccessException e) {
        log.error("All retries failed", e);
        throw new ServiceUnavailableException("API unavailable");
    }
}

Monitoring

Prometheus Metrics

# Connection resets (nginx)
sum(rate(nginx_http_requests_total{status="499"}[5m]))

# Connection errors (client side)
sum(rate(http_client_errors_total{error="connection_reset"}[5m]))

# Keep-alive connections
sum(nginx_http_connections{state="kept"})

Alerts

groups:
- name: connection_errors
  rules:
  - alert: HighConnectionResetRate
    expr: |
      sum(rate(http_client_errors_total{error=~"connection_reset|broken_pipe"}[5m]))
      /
      sum(rate(http_client_requests_total[5m]))
      > 0.001
    for: 10m
    annotations:
      summary: "Connection reset rate >0.1%"
      description: "Check keep-alive timeout alignment"

Checklist

## Keep-Alive Configuration

### Timeout Alignment
- [ ] Document all keep-alive timeouts in chain
- [ ] Ensure: Client < Proxy < Server
- [ ] Add 5-10 second safety margin

### Configuration
Client:
- [ ] Go: IdleConnTimeout < server timeout
- [ ] Java: evictIdleConnections() < server timeout

Proxy (nginx):
- [ ] keepalive_timeout (client side)
- [ ] upstream keepalive_timeout (backend side)

Server:
- [ ] IdleTimeout configured
- [ ] Documented for clients

### Testing
- [ ] Load test with idle periods
- [ ] Monitor connection reset rate
- [ ] Test under various load patterns

### Monitoring
- [ ] Alert on connection reset rate > 0.1%
- [ ] Track keep-alive connection count
- [ ] Log connection errors with context

Conclusion

Keep-alive connection resets are a configuration problem:

  1. Client timeout must be < server timeout - Always close first
  2. Add 5-10s safety margin - Account for clock skew
  3. Configure all layers - Client, proxy, server
  4. Retry idempotent requests - Handle unavoidable resets

Chain: Client (55s) → nginx (60s) → App (65s)


Related posts

Cite this article

If you reference this post, please link to the original URL and credit the author.

Michal Drozd. "HTTP Keep-Alive Connection Reset: Why Your Requests Fail with 'Connection Reset by Peer'". https://www.michal-drozd.com/en/blog/http-keepalive-connection-reset/ (Published July 16, 2025).