ingress-nginx Reload Storms: Why 502 Spikes Track Ingress Churn
The symptom pattern is annoyingly periodic:
- 502/504 errors spike for a few seconds… every few minutes.
- gRPC and keep-alive heavy clients reconnect and “heal on retry”.
- Backends are healthy (readiness is green).
- ingress-nginx pods are not crashing.
If you see spiky 502s with healthy backends, ask one question first:
Are we reloading NGINX too often?
ingress-nginx generates NGINX config and reloads when Kubernetes objects change. Under churn (Ingress updates, cert renewals, ExternalDNS, canary flips), reloads stop being “configuration” and become “micro-outage generator”.
Tested on: Kubernetes 1.29–1.31, ingress-nginx 1.10–1.11, NGINX 1.25+, HTTP/1.1 + HTTP/2, cloud and on‑prem LBs in front.
Incident narrative (anonymized)
We had 502 spikes that always went away on retry. The backends were stable and p99 on the application side didn’t explain it.
The clue was synchronization: multiple unrelated services had 502 spikes at the same timestamps. That points to a shared edge component, not an individual backend.
Root cause:
- GitOps was continuously reconciling Ingress objects.
- The reconciler produced “no-op” updates (annotation ordering, rewritten defaults).
- ingress-nginx treated them as config changes and reloaded each time.
- Reload time grew as the generated config grew.
Constraint: we could not stop deployments. We needed an operational budget: reduce reload frequency first, then make reload less disruptive.
Timeline
- T-0: 502 spikes observed, mostly on keep-alive connections and gRPC.
- T+10m: ingress-nginx logs show reload events aligned with spikes.
- T+20m: reload frequency measured at multiple reloads per minute during deploy windows.
- T+30m: mitigation: pause non-essential churn and increase graceful shutdown budgets.
- T+60m: 502 spikes drop significantly.
- T+1d: fixes shipped: stop no-op updates, consolidate Ingress resources, alert on reload rate.
Mechanism: why reload can cause 502 with healthy backends
NGINX reload is not a no-op:
- the master validates the new config
- new workers start with the new config
- old workers drain and eventually exit
Even with a graceful reload, some connections will be closed:
- idle keep-alives can be dropped
- long-lived HTTP/2 streams can be reset depending on timing
- clients retry and hide it until p99 explodes
This gets worse when:
- reloads happen again before old workers finish draining
- the config is large (parsing/validation time)
- the controller is CPU-throttled, stretching reload time windows
Runbook: prove reload is the cause
1) Correlate spikes with controller reload logs
Start with logs. Different versions log different phrases, but the intent is always visible.
kubectl -n ingress-nginx logs deploy/ingress-nginx-controller --since=60m | \
rg -n "Reloading|reloaded|Configuration changes detected|backend reload" | tail -n 50
If timestamps line up with 502 spikes, you have a strong signal.
2) Measure reload rate (don’t guess)
If you scrape ingress-nginx metrics, search its /metrics output for reload counters rather than assuming metric names:
kubectl -n ingress-nginx port-forward deploy/ingress-nginx-controller 10254:10254
curl -s http://127.0.0.1:10254/metrics | rg -n "reload|nginx.*reload" | head -n 50
Your goal is a budget: reloads per minute under steady state should be near zero.
3) Find the churn source
Common churn generators:
- GitOps tools doing frequent patch loops
- cert-manager updates (new TLS secrets, renewed certs)
- ExternalDNS updates (annotations/records)
- canary tooling toggling annotations frequently
During the incident window, look for “who touched Ingress”:
kubectl get events -A --sort-by=.lastTimestamp | tail -n 50
kubectl get ingress -A -o custom-columns=NS:.metadata.namespace,NAME:.metadata.name,GEN:.metadata.generation --sort-by=.metadata.generation | tail -n 20
If you see generation bumps without real route changes, that’s often a no-op churn bug.
4) Confirm backends are healthy (avoid false attribution)
kubectl -n <app-ns> get pods -o wide
kubectl -n <app-ns> get endpoints <svc> -o wide
If backend pods are stable and only the edge spikes, you’re debugging ingress behavior, not the app.
Safe mitigations (during incident)
1) Stop touching the config
Pause what creates churn:
- pause GitOps sync for Ingress resources
- avoid canary flips
- delay DNS/cert changes during peak traffic
2) Give reloads more drain budget
ingress-nginx exposes NGINX knobs via ConfigMap (validate exact keys for your version in your deployment docs). The intent:
- allow old workers to drain longer
- avoid dropping keep-alives during tight reload windows
Representative example:
apiVersion: v1
kind: ConfigMap
metadata:
name: ingress-nginx-controller
namespace: ingress-nginx
data:
worker-shutdown-timeout: "30s"
keep-alive: "75"
keep-alive-requests: "10000"
And ensure the controller pod has a termination grace period that can cover drain:
spec:
template:
spec:
terminationGracePeriodSeconds: 60
3) Remove CPU throttling from the controller
If the controller is CPU-throttled, reloads take longer and disruption windows widen. Give it CPU headroom and consider horizontal scaling.
What we changed (concrete)
1) Stop no-op updates at the source
Our GitOps pipeline was rewriting annotations even when the rendered config didn’t change. Fixing that reduced reloads by an order of magnitude.
2) Consolidate and de-churn Ingress resources
Hundreds of tiny Ingress objects create a huge generated config and lots of update events. Consolidating routes reduced:
- config size
- reload time
- change frequency
3) Budget reload and make it observable
We added:
- dashboard panel: reloads per minute
- alert: reload rate above budget
- alert: reload “start without success” for more than N seconds (log-based if needed)
How to verify
- 502 spikes no longer align with reload log lines.
- Reload rate stays low under steady state.
- Long-lived clients (gRPC streams, keep-alive heavy clients) stop reconnecting periodically.
Prevention / guardrails
- Treat Ingress updates as a production change; no-op updates are outages waiting to happen.
- Put a budget on reload rate (steady state should be rare and explainable).
- Keep ingress controller out of CPU throttling territory.
- Coordinate cert/DNS churn to avoid peak-traffic reload storms.
Related reading
- HTTP Keep-Alive Connection Reset: Why Your Requests Fail with ‘Connection Reset by Peer’
- Kubernetes Graceful Shutdown as a Contract: Zero 502s During Rollouts (HTTP + gRPC)
- tcpdump Sees SYNs, but the Service Times Out: The Listen Backlog Trap
- PMTU Blackholes: When Only Large Responses Hang
- kube-proxy Micro-Outages: The xtables Lock Contention Problem
- Structured Logging Performance: When Your Logger Becomes the Bottleneck
- Ephemeral Port Exhaustion: The Node That ‘Goes Bad’
Related posts
tcpdump Sees SYNs, but the Service Times Out: The Listen Backlog Trap
Clients time out, tcpdump shows SYNs (sometimes even SYN-ACK), yet your app logs nothing. The culprit is often the Linux listen/accept queues overflowing under load or CPU starvation.
Redis AOF fsync Latency Spikes: When Durability Becomes Your p99
Redis AOF can turn durability into p99 spikes: fsync pressure and rewrite fork CoW. Runbook to confirm, mitigate safely, and add guardrails.
Kubernetes APF Starvation: When One Controller Makes kubectl Hang
APF can starve your Kubernetes API: kubectl hangs, controllers timeout, and 429s spike. Runbook to isolate the noisy client, fix FlowSchemas, and prove it.
Java Profiling in Hardened Kubernetes: When Security Blocks Your Debugger
Can't attach profiler to production JVM. seccomp blocks perf_event_open, container drops CAP_SYS_PTRACE, and PodSecurityPolicy prevents privileged mode. Here's how to profile anyway.
Cite this article
If you reference this post, please link to the original URL and credit the author.