ingress-nginx Reload Storms: Why 502 Spikes Track Ingress Churn

The symptom pattern is annoyingly periodic:

502/504 errors spike for a few seconds… every few minutes.
gRPC and keep-alive heavy clients reconnect and “heal on retry”.
Backends are healthy (readiness is green).
ingress-nginx pods are not crashing.

If you see spiky 502s with healthy backends, ask one question first:

Are we reloading NGINX too often?

ingress-nginx generates NGINX config and reloads when Kubernetes objects change. Under churn (Ingress updates, cert renewals, ExternalDNS, canary flips), reloads stop being “configuration” and become “micro-outage generator”.

Tested on: Kubernetes 1.29–1.31, ingress-nginx 1.10–1.11, NGINX 1.25+, HTTP/1.1 + HTTP/2, cloud and on‑prem LBs in front.

Incident narrative (anonymized)

We had 502 spikes that always went away on retry. The backends were stable and p99 on the application side didn’t explain it.

The clue was synchronization: multiple unrelated services had 502 spikes at the same timestamps. That points to a shared edge component, not an individual backend.

Root cause:

GitOps was continuously reconciling Ingress objects.
The reconciler produced “no-op” updates (annotation ordering, rewritten defaults).
ingress-nginx treated them as config changes and reloaded each time.
Reload time grew as the generated config grew.

Constraint: we could not stop deployments. We needed an operational budget: reduce reload frequency first, then make reload less disruptive.

Timeline

T-0: 502 spikes observed, mostly on keep-alive connections and gRPC.
T+10m: ingress-nginx logs show reload events aligned with spikes.
T+20m: reload frequency measured at multiple reloads per minute during deploy windows.
T+30m: mitigation: pause non-essential churn and increase graceful shutdown budgets.
T+60m: 502 spikes drop significantly.
T+1d: fixes shipped: stop no-op updates, consolidate Ingress resources, alert on reload rate.

Mechanism: why reload can cause 502 with healthy backends

NGINX reload is not a no-op:

the master validates the new config
new workers start with the new config
old workers drain and eventually exit

Even with a graceful reload, some connections will be closed:

idle keep-alives can be dropped
long-lived HTTP/2 streams can be reset depending on timing
clients retry and hide it until p99 explodes

This gets worse when:

reloads happen again before old workers finish draining
the config is large (parsing/validation time)
the controller is CPU-throttled, stretching reload time windows

Runbook: prove reload is the cause

1) Correlate spikes with controller reload logs

Start with logs. Different versions log different phrases, but the intent is always visible.

kubectl -n ingress-nginx logs deploy/ingress-nginx-controller --since=60m | \
  rg -n "Reloading|reloaded|Configuration changes detected|backend reload" | tail -n 50

If timestamps line up with 502 spikes, you have a strong signal.

2) Measure reload rate (don’t guess)

If you scrape ingress-nginx metrics, search its /metrics output for reload counters rather than assuming metric names:

kubectl -n ingress-nginx port-forward deploy/ingress-nginx-controller 10254:10254
curl -s http://127.0.0.1:10254/metrics | rg -n "reload|nginx.*reload" | head -n 50

Your goal is a budget: reloads per minute under steady state should be near zero.

3) Find the churn source

Common churn generators:

GitOps tools doing frequent patch loops
cert-manager updates (new TLS secrets, renewed certs)
ExternalDNS updates (annotations/records)
canary tooling toggling annotations frequently

During the incident window, look for “who touched Ingress”:

kubectl get events -A --sort-by=.lastTimestamp | tail -n 50
kubectl get ingress -A -o custom-columns=NS:.metadata.namespace,NAME:.metadata.name,GEN:.metadata.generation --sort-by=.metadata.generation | tail -n 20

If you see generation bumps without real route changes, that’s often a no-op churn bug.

4) Confirm backends are healthy (avoid false attribution)

kubectl -n <app-ns> get pods -o wide
kubectl -n <app-ns> get endpoints <svc> -o wide

If backend pods are stable and only the edge spikes, you’re debugging ingress behavior, not the app.

Safe mitigations (during incident)

1) Stop touching the config

Pause what creates churn:

pause GitOps sync for Ingress resources
avoid canary flips
delay DNS/cert changes during peak traffic

2) Give reloads more drain budget

ingress-nginx exposes NGINX knobs via ConfigMap (validate exact keys for your version in your deployment docs). The intent:

allow old workers to drain longer
avoid dropping keep-alives during tight reload windows

Representative example:

apiVersion: v1
kind: ConfigMap
metadata:
  name: ingress-nginx-controller
  namespace: ingress-nginx
data:
  worker-shutdown-timeout: "30s"
  keep-alive: "75"
  keep-alive-requests: "10000"

And ensure the controller pod has a termination grace period that can cover drain:

spec:
  template:
    spec:
      terminationGracePeriodSeconds: 60

3) Remove CPU throttling from the controller

If the controller is CPU-throttled, reloads take longer and disruption windows widen. Give it CPU headroom and consider horizontal scaling.

What we changed (concrete)

1) Stop no-op updates at the source

Our GitOps pipeline was rewriting annotations even when the rendered config didn’t change. Fixing that reduced reloads by an order of magnitude.

2) Consolidate and de-churn Ingress resources

Hundreds of tiny Ingress objects create a huge generated config and lots of update events. Consolidating routes reduced:

config size
reload time
change frequency

3) Budget reload and make it observable

We added:

dashboard panel: reloads per minute
alert: reload rate above budget
alert: reload “start without success” for more than N seconds (log-based if needed)

How to verify

502 spikes no longer align with reload log lines.
Reload rate stays low under steady state.
Long-lived clients (gRPC streams, keep-alive heavy clients) stop reconnecting periodically.

Prevention / guardrails

Treat Ingress updates as a production change; no-op updates are outages waiting to happen.
Put a budget on reload rate (steady state should be rare and explainable).
Keep ingress controller out of CPU throttling territory.
Coordinate cert/DNS churn to avoid peak-traffic reload storms.

ingress-nginx Reload Storms: Why 502 Spikes Track Ingress Churn

Incident narrative (anonymized)

Timeline

Mechanism: why reload can cause 502 with healthy backends

Runbook: prove reload is the cause

1) Correlate spikes with controller reload logs

2) Measure reload rate (don’t guess)

3) Find the churn source

4) Confirm backends are healthy (avoid false attribution)

Safe mitigations (during incident)

1) Stop touching the config

2) Give reloads more drain budget

3) Remove CPU throttling from the controller

What we changed (concrete)

1) Stop no-op updates at the source

2) Consolidate and de-churn Ingress resources

3) Budget reload and make it observable

How to verify

Prevention / guardrails

Related posts

Cite this article

Incident narrative (anonymized)

Timeline

Mechanism: why reload can cause 502 with healthy backends

Runbook: prove reload is the cause

1) Correlate spikes with controller reload logs

2) Measure reload rate (don’t guess)

3) Find the churn source

4) Confirm backends are healthy (avoid false attribution)

Safe mitigations (during incident)

1) Stop touching the config

2) Give reloads more drain budget

3) Remove CPU throttling from the controller

What we changed (concrete)

1) Stop no-op updates at the source

2) Consolidate and de-churn Ingress resources

3) Budget reload and make it observable

How to verify

Prevention / guardrails

Related reading

Related posts

Cite this article