HTTP Keep-Alive Connection Reset: Prečo Vaše Requesty Zlyhávajú s 'Connection Reset by Peer'
Mam poznamku z tej noci: “preco pada len 0.1%?” Niektore bugy kricia. Tento len sepkal: 0.1% requestov padlo, iba v produkcii, a nikdy lokalne. Videl som len pramienok connection reset by peer.
Toto je takmer vzdy nesulad keep-alive timeoutov. Server zatvori spojenie prave ked klient posiela novy request.
Testované na: nginx 1.24, Go 1.22, Java 21, Kubernetes 1.28
Race Condition
Ako Keep-Alive Funguje
Bez Keep-Alive:
Client Server
|--- TCP SYN ------------->|
|<-- TCP SYN-ACK ----------|
|--- TCP ACK ------------->|
|--- HTTP Request -------->|
|<-- HTTP Response --------|
|--- TCP FIN ------------->| ← Spojenie zatvorené
|<-- TCP FIN-ACK ----------|
S Keep-Alive:
Client Server
|--- TCP Connect ---------->|
|--- HTTP Request 1 ------->|
|<-- HTTP Response 1 -------|
|--- HTTP Request 2 ------->| ← Rovnaké spojenie!
|<-- HTTP Response 2 -------|
...
(spojenie zostáva otvorené)
Race Condition
Server timeout: 60 sekúnd
Client timeout: 90 sekúnd
Timeline:
T+0s: Request 1 dokončený
T+59s: Client pripravuje nový request
T+60s: Server zatvorí spojenie (timeout!)
T+60s: Client pošle request na "otvorené" spojenie
→ "Connection reset by peer"
Packet Level
T+60.000s: Server pošle FIN
T+60.001s: Client pošle HTTP request (nevie o FIN)
T+60.002s: Server prijme request na zatvorenom sockete
T+60.002s: Server pošle RST (reset)
T+60.003s: Client prijme RST → Chyba!
Diagnostika Problému
Symptómy
Vzory chýb:
- "connection reset by peer"
- "ECONNRESET"
- "broken pipe"
- java.net.SocketException: Connection reset
- net/http: request canceled (Client.Timeout exceeded)
Charakteristiky:
- Sporadické (0.01% - 1% requestov)
- Častejšie pri nízkej záťaži (spojenia nečinné dlhšie)
- Nemožno reprodukovať lokálne
- Stáva sa po období nečinnosti
Nájdenie Timeoutov
# nginx
grep keepalive_timeout /etc/nginx/nginx.conf
# keepalive_timeout 65;
# AWS ALB
# Default: 60 sekúnd (nie je priamo konfigurovateľné)
# Kubernetes Ingress (nginx)
kubectl get configmap ingress-nginx-controller -o yaml | grep keep-alive
# Go server
# http.Server.IdleTimeout (default: žiadny timeout!)
# Java / Tomcat
# server.tomcat.connection-timeout (default: 20s pre idle)
Riešenia
Pravidlo: Client Timeout < Server Timeout
Server drží spojenie otvorené: 60s
Client zatvorí spojenie po: 55s ← 5s safety margin
Client vždy zatvorí prvý → Žiadna race condition
Go Client
// http_client.go
import (
"net/http"
"time"
)
func newHTTPClient() *http.Client {
transport := &http.Transport{
// Max idle connections
MaxIdleConns: 100,
MaxIdleConnsPerHost: 10,
// KRITICKÉ: Zatvor spojenie pred serverom
IdleConnTimeout: 55 * time.Second, // Server: 60s
// Connection timeout
DialContext: (&net.Dialer{
Timeout: 30 * time.Second,
KeepAlive: 30 * time.Second,
}).DialContext,
}
return &http.Client{
Transport: transport,
Timeout: 30 * time.Second, // Celkový request timeout
}
}
Go Server
// server.go
server := &http.Server{
Addr: ":8080",
Handler: handler,
// Keep-alive timeout
IdleTimeout: 60 * time.Second,
// Read/Write timeouts
ReadTimeout: 30 * time.Second,
WriteTimeout: 30 * time.Second,
}
Java Client (Apache HttpClient 5)
// HttpClientConfig.java
import org.apache.hc.client5.http.impl.classic.CloseableHttpClient;
import org.apache.hc.client5.http.impl.classic.HttpClients;
import org.apache.hc.client5.http.impl.io.PoolingHttpClientConnectionManager;
import org.apache.hc.core5.util.TimeValue;
public CloseableHttpClient createHttpClient() {
PoolingHttpClientConnectionManager connectionManager =
new PoolingHttpClientConnectionManager();
connectionManager.setMaxTotal(100);
connectionManager.setDefaultMaxPerRoute(10);
// Zatvor idle spojenia pred server timeoutom
// Server: 60s, Client: 55s
connectionManager.closeIdle(TimeValue.ofSeconds(55));
return HttpClients.custom()
.setConnectionManager(connectionManager)
.evictExpiredConnections()
.evictIdleConnections(TimeValue.ofSeconds(55))
.build();
}
Java Server (Spring Boot)
# application.yml
server:
tomcat:
# Keep-alive timeout
keep-alive-timeout: 60s
# Max keep-alive requestov per spojenie
max-keep-alive-requests: 100
# Connection timeout
connection-timeout: 30s
nginx Konfigurácia
http {
# Keep-alive ku klientom
keepalive_timeout 65s;
# Keep-alive k upstream (backendom)
upstream backend {
server app:8080;
# Znovupoužitie spojení k backendu
keepalive 100;
# KRITICKÉ: Zatvor pred backend timeoutom
keepalive_timeout 55s; # Backend: 60s
}
server {
location / {
proxy_pass http://backend;
# Povoľ keep-alive k upstream
proxy_http_version 1.1;
proxy_set_header Connection "";
}
}
}
Kubernetes Ingress (nginx)
# ingress-nginx ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
name: ingress-nginx-controller
data:
# Client keep-alive
keep-alive: "65"
# Upstream keep-alive
upstream-keepalive-connections: "100"
upstream-keepalive-timeout: "55" # Backend: 60s
# Requestov per spojenie
upstream-keepalive-requests: "1000"
AWS ALB + Target Groups
ALB idle timeout: 60s (default)
Target group: 65s ← Target VYŠŠÍ ako ALB
# Terraform
resource "aws_lb_target_group" "app" {
# ...
deregistration_delay = 30
stickiness {
enabled = true
type = "lb_cookie"
}
}
resource "aws_lb" "app" {
# ...
idle_timeout = 60 # ALB timeout
}
# Aplikácia musí mať timeout > 60s
Retry Stratégia
Idempotent Request Retry
// retry.go
func doWithRetry(client *http.Client, req *http.Request) (*http.Response, error) {
var lastErr error
for attempt := 0; attempt < 3; attempt++ {
resp, err := client.Do(req)
if err == nil {
return resp, nil
}
// Retry len pri connection erroroch
if isConnectionError(err) && isIdempotent(req.Method) {
lastErr = err
time.Sleep(time.Duration(attempt*100) * time.Millisecond)
continue
}
return nil, err
}
return nil, fmt.Errorf("after 3 retries: %w", lastErr)
}
func isConnectionError(err error) bool {
if err == nil {
return false
}
errStr := err.Error()
return strings.Contains(errStr, "connection reset") ||
strings.Contains(errStr, "broken pipe") ||
strings.Contains(errStr, "EOF")
}
func isIdempotent(method string) bool {
return method == "GET" || method == "HEAD" ||
method == "OPTIONS" || method == "PUT" ||
method == "DELETE"
}
Spring Retry
@Configuration
@EnableRetry
public class RetryConfig {
@Bean
public RestTemplate restTemplate() {
return new RestTemplate();
}
}
@Service
public class ApiClient {
@Retryable(
value = {ResourceAccessException.class},
maxAttempts = 3,
backoff = @Backoff(delay = 100)
)
public String callApi() {
return restTemplate.getForObject("/api/data", String.class);
}
@Recover
public String fallback(ResourceAccessException e) {
log.error("All retries failed", e);
throw new ServiceUnavailableException("API unavailable");
}
}
Monitoring
Prometheus Metriky
# Connection resety (nginx)
sum(rate(nginx_http_requests_total{status="499"}[5m]))
# Connection errors (client side)
sum(rate(http_client_errors_total{error="connection_reset"}[5m]))
# Keep-alive spojenia
sum(nginx_http_connections{state="kept"})
Alerty
groups:
- name: connection_errors
rules:
- alert: HighConnectionResetRate
expr: |
sum(rate(http_client_errors_total{error=~"connection_reset|broken_pipe"}[5m]))
/
sum(rate(http_client_requests_total[5m]))
> 0.001
for: 10m
annotations:
summary: "Connection reset rate >0.1%"
description: "Skontroluj keep-alive timeout alignment"
Checklist
## Keep-Alive Konfigurácia
### Timeout Alignment
- [ ] Zdokumentuj všetky keep-alive timeouty v reťazci
- [ ] Zabezpeč: Client < Proxy < Server
- [ ] Pridaj 5-10 sekúnd safety margin
### Konfigurácia
Client:
- [ ] Go: IdleConnTimeout < server timeout
- [ ] Java: evictIdleConnections() < server timeout
Proxy (nginx):
- [ ] keepalive_timeout (client strana)
- [ ] upstream keepalive_timeout (backend strana)
Server:
- [ ] IdleTimeout nakonfigurovaný
- [ ] Zdokumentovaný pre klientov
### Testovanie
- [ ] Load test s idle periódami
- [ ] Monitoruj connection reset rate
- [ ] Testuj pri rôznych záťažových vzoroch
### Monitoring
- [ ] Alert na connection reset rate > 0.1%
- [ ] Sleduj počet keep-alive spojení
- [ ] Loguj connection errors s kontextom
Záver
Keep-alive connection resety sú konfiguračný problém:
- Client timeout musí byť < server timeout - Vždy zatvor prvý
- Pridaj 5-10s safety margin - Účtuj clock skew
- Konfiguruj všetky vrstvy - Client, proxy, server
- Retry idempotent requesty - Zvládni nevyhnutné resety
Reťazec: Client (55s) → nginx (60s) → App (65s)
Súvisiace články
- Connection Pool Sizing s Little’s Law - Connection management
- Circuit Breaker vs Rate Limiter vs Bulkhead - Resilience patterns
Súvisiace články
Kubernetes CPU Throttling Pitva: Prečo p99 Latencia Exploduje pri 40% CPU Usage
CPU vyzerá OK, ale tail latencia je katastrofálna. Ukážem ako korelovať CFS throttling s latency spikes a prečo odstránenie CPU limitov môže paradoxne pomôcť.
Cilium BPF conntrack map full: náhodné resetovania aj keď conntrack vyzerá OK
Náhodné resetovania s Cilium? Ako sa zaplnia eBPF conntrack (CT) mapy, prečo netfilter conntrack vyzerá OK, a runbook na sizing a verifikáciu v Kubernetes.
Redlock vs PostgreSQL Advisory Locks: Kedy Nepotrebujete Redis na Distributed Locking
Pridávate Redis len pre distributed locks? PostgreSQL advisory locks môžu stačiť. Porovnávam oba s failure scenármi a performance benchmarkami.
JVM Native Memory v Kubernetes: Prečo Pod Dostane OOMKilled s 50% Heap
Heap je 50% plný ale pod dostane OOMKilled. Ukážem ako sledovať native memory (Metaspace, threads, NIO) a zabrániť container memory problémom.
Citujte tento článok
Ak na článok odkazujete, pridajte pôvodnú URL a uveďte autora.