Kubernetes Graceful Shutdown as a Contract: Zero 502s During Rollouts (HTTP + gRPC)

If you’ve ever rolled out a Deployment and watched:

a burst of 502/504 from an ingress,
ECONNRESET / “connection reset by peer” in clients,
gRPC UNAVAILABLE spikes,
and then everything “stabilizes”…

…you already know the uncomfortable truth: “graceful shutdown” is not a boolean feature. It’s a contract between:

the client (keepalive, retries, connection reuse),
the LB/ingress/sidecar (draining behavior),
Kubernetes endpoint propagation (EndpointSlice → kube-proxy),
and your application (SIGTERM handling, refusing new work, finishing in-flight work).

This post is a production-minded, reproducible approach to make rollouts boring by design.

Tested on: Kubernetes 1.27–1.30, NGINX Ingress and Envoy-based proxies, Go HTTP servers and gRPC services.

What “graceful shutdown” must guarantee

Define a Drain Contract with explicit invariants:

Stop new traffic first (from Kubernetes routing)
Stop accepting new work (inside the process)
Finish or cancel in-flight work within a bounded time
Only then exit, before Kubernetes sends SIGKILL

If any of these are missing, you get rollout errors even if you handle SIGTERM.

How Kubernetes termination actually plays out

When a Pod is terminated, Kubernetes (simplified) does:

Marks the Pod with a deletion timestamp.
Runs each container’s preStop hook (if configured).
Sends SIGTERM to containers.
Waits up to terminationGracePeriodSeconds.
Sends SIGKILL if still running.

Separately (and importantly), traffic stop-routing depends on:

readiness state and controllers updating EndpointSlices,
and kube-proxy / dataplane propagation delays,
and client connection reuse (keepalive pools).

This means: your process might still be receiving requests after termination started, unless you intentionally drain.

The core idea: readiness-driven draining

The simplest reliable pattern is:

Your app exposes a readiness endpoint that returns not ready once draining starts.
On termination, you flip the app into draining mode before stopping the server.

You can trigger draining via:

preStop hook calling a local endpoint (recommended for consistency),
or handling SIGTERM and toggling a drain flag immediately (also fine).

Drain budget math (don’t guess)

You need a grace period large enough for:

grace >= endpoint_propagation + drain_delay + worst_case_request_time + safety_margin

Where:

endpoint_propagation: time for EndpointSlice update + dataplane to stop routing
drain_delay: a small wait after becoming NotReady (to let routing converge)
worst_case_request_time: your real upper bound (or enforced deadline)
safety_margin: buffer for jitter

You don’t need perfect numbers. You need measured numbers.

Reference implementation: Kubernetes YAML

Below is a minimal but production-grade Pod contract for HTTP or gRPC services.

1) Readiness probe (must reflect draining)

readinessProbe:
  httpGet:
    path: /health/ready
    port: 8080
  periodSeconds: 2
  timeoutSeconds: 1
  failureThreshold: 1

Key points:

Keep the probe quick.
failureThreshold: 1 makes readiness react fast (don’t do this for liveness unless you want restart storms).

2) preStop hook: trigger draining

lifecycle:
  preStop:
    httpGet:
      path: /admin/drain
      port: 8080

And then set terminationGracePeriodSeconds big enough:

terminationGracePeriodSeconds: 60

Where does the “wait” happen? Prefer the app to own the drain timing:

/admin/drain flips draining mode and starts a timer.
The process shuts down only after the drain delay and after attempting graceful stop.

Why not sleep in preStop?

Sleeping outside the app doesn’t stop the app from accepting new requests.
You lose observability and control.

Reference implementation: app-level draining (Go examples)

You can implement this in any language/runtime. Here’s the logic.

HTTP server: stop accepting new connections, finish in-flight requests

// Illustrative Go-like pseudocode.
// Readiness depends on a drain flag, and termination triggers shutdown with a hard deadline.

var draining atomic.Bool

func readyHandler(w http.ResponseWriter, r *http.Request) {
  if draining.Load() {
    w.WriteHeader(http.StatusServiceUnavailable)
    return
  }
  w.WriteHeader(http.StatusOK)
}

func drainHandler(w http.ResponseWriter, r *http.Request) {
  draining.Store(true)
  w.WriteHeader(http.StatusOK)
}

func main() {
  srv := &http.Server{Addr: ":8080", Handler: mux()}

  go func() {
    <-sigterm
    draining.Store(true)

    // Allow time for routing to converge before closing.
    time.Sleep(5 * time.Second)

    ctx, cancel := context.WithTimeout(context.Background(), 45*time.Second)
    defer cancel()
    _ = srv.Shutdown(ctx)
  }()

  _ = srv.ListenAndServe()
}

Important details:

Readiness flips first.
You wait a short “routing convergence” period.
You shutdown with a bounded timeout.

gRPC server: `GracefulStop` with a hard cap

gRPC servers can block indefinitely on long streams. You want:

attempt graceful stop,
but enforce a max drain time (then force close).

go func() {
  <-sigterm
  draining.Store(true)
  time.Sleep(5 * time.Second)

  done := make(chan struct{})
  go func() {
    grpcServer.GracefulStop()
    close(done)
  }()

  select {
  case <-done:
    // Graceful stop finished
  case <-time.After(45 * time.Second):
    grpcServer.Stop() // hard stop to avoid SIGKILL
  }
}()

Repro lab: prove it with numbers (before/after)

Don’t ship this as theory. Prove it.

Step 1: Generate load

hey -z 2m -c 50 http://my-service.default.svc.cluster.local/

For gRPC, a common tool is ghz (run it wherever you normally run load).

Step 2: Roll out repeatedly

kubectl rollout restart deploy/my-service
kubectl rollout status deploy/my-service

Step 3: Watch endpoints propagate

kubectl get endpointslices -l kubernetes.io/service-name=my-service -w

What you’re looking for:

do endpoints drop quickly after draining starts?
does your load generator still see errors during the window?

Step 4: Compare error rates

Track:

HTTP 5xx / resets
gRPC UNAVAILABLE / CANCELLED
p95/p99 latency during rollout window

If you can’t graph it, at least log it in the load generator output and in app metrics.

Common failure modes (and how to recognize them)

“We have preStop sleep” but we still drop requests

Symptom:

errors persist
readiness stays OK during sleep

Cause:

app continues accepting traffic while the hook sleeps

Fix:

use readiness-driven drain, not external sleep

Grace period too short → SIGKILL → partial work

Symptom:

pods exit with SIGKILL
in-flight requests fail near the end of grace period

Fix:

compute your drain budget and increase terminationGracePeriodSeconds
enforce request deadlines so you have a real upper bound

Long-lived streams (gRPC streaming, WebSockets)

Symptom:

graceful stop takes forever
pods hit SIGKILL during rollouts

Fix:

define a max drain window
enforce server-side stream limits / keepalive policies
version your clients so they reconnect cleanly

Retries amplify rollouts

Symptom:

upstream load spikes during rollout
error rate triggers a retry storm

Fix:

align retries with deadlines (retry budget)
bounded retries + backoff

What I’d do in prod

If I had to set a default “production termination contract” today:

Make readiness reflect draining (never stay Ready while shutting down)
preStop triggers draining (HTTP call or exec), not sleeping
Wait a short, measured routing convergence delay
Shut down servers with bounded timeouts
Set terminationGracePeriodSeconds using the drain budget formula
Add a rollout SLO: “error rate during deploy window” must be near zero
Rehearse it: run the rollout lab on at least one critical service

This turns graceful shutdown from a belief into an enforced contract.

FAQ

Why do I still see errors even though my app handles SIGTERM?

Because traffic stop-routing is not instantaneous: EndpointSlice/controller and kube-proxy propagation plus connection reuse can keep sending requests briefly.

Should I fail readiness immediately on SIGTERM?

Usually yes—if your readiness check truly represents “safe to receive new requests.” It should become false during draining.

Is `preStop: sleep 10` ever acceptable?

Only as a last resort and only if your app also refuses new work immediately. Otherwise it’s just “wait while still accepting traffic.”

What about sidecars/ingress that keep connections open?

Then you must measure where draining happens (ingress/sidecar vs app). The contract stays the same; the draining hop changes.

How do I pick the drain delay (routing convergence time)?

Measure it: watch EndpointSlice changes and observe when new requests stop hitting the terminating pod. Use that as your baseline.

/en/blog/conntrack-stale-nat-mapping/ (deploy 503s that aren’t graceful shutdown)
/en/blog/kubernetes-ghost-pod-conntrack/ (why traffic can still hit dead Pods)
/en/blog/k8s-postgresql-connection-storm/ (rollouts as a system-wide event)

Kubernetes Graceful Shutdown as a Contract: Zero 502s During Rollouts (HTTP + gRPC)

What “graceful shutdown” must guarantee

How Kubernetes termination actually plays out

The core idea: readiness-driven draining

Drain budget math (don’t guess)

Reference implementation: Kubernetes YAML

1) Readiness probe (must reflect draining)

2) preStop hook: trigger draining

Reference implementation: app-level draining (Go examples)

HTTP server: stop accepting new connections, finish in-flight requests

gRPC server: `GracefulStop` with a hard cap

Repro lab: prove it with numbers (before/after)

Step 1: Generate load

Step 2: Roll out repeatedly

Step 3: Watch endpoints propagate

Step 4: Compare error rates

Common failure modes (and how to recognize them)

“We have preStop sleep” but we still drop requests

Grace period too short → SIGKILL → partial work

Long-lived streams (gRPC streaming, WebSockets)

Retries amplify rollouts

What I’d do in prod

FAQ

Why do I still see errors even though my app handles SIGTERM?

Should I fail readiness immediately on SIGTERM?

Is `preStop: sleep 10` ever acceptable?

What about sidecars/ingress that keep connections open?

How do I pick the drain delay (routing convergence time)?

Further reading

Related posts

Cite this article

What “graceful shutdown” must guarantee

How Kubernetes termination actually plays out

The core idea: readiness-driven draining

Drain budget math (don’t guess)

Reference implementation: Kubernetes YAML

1) Readiness probe (must reflect draining)

2) preStop hook: trigger draining

Reference implementation: app-level draining (Go examples)

HTTP server: stop accepting new connections, finish in-flight requests

gRPC server: GracefulStop with a hard cap

Repro lab: prove it with numbers (before/after)

Step 1: Generate load

Step 2: Roll out repeatedly

Step 3: Watch endpoints propagate

Step 4: Compare error rates

Common failure modes (and how to recognize them)

“We have preStop sleep” but we still drop requests

Grace period too short → SIGKILL → partial work

Long-lived streams (gRPC streaming, WebSockets)

Retries amplify rollouts

What I’d do in prod

FAQ

Why do I still see errors even though my app handles SIGTERM?

Should I fail readiness immediately on SIGTERM?

Is preStop: sleep 10 ever acceptable?

What about sidecars/ingress that keep connections open?

How do I pick the drain delay (routing convergence time)?

Related reading

Further reading

Related posts

Cite this article

gRPC server: `GracefulStop` with a hard cap

Is `preStop: sleep 10` ever acceptable?