Kubernetes APF Starvation: When One Controller Makes kubectl Hang
The symptom pattern is weird the first time you see it:
kubectl get podshangs or times out.- Deployments stop progressing (“waiting for rollout” forever).
- Controllers start logging errors like “Too Many Requests (429)” and retrying.
- Nodes are fine, etcd is fine, and CPU on the API server isn’t pegged.
I’ve hit this in real clusters more than once. The root cause wasn’t etcd, and it wasn’t “the API server is slow”. It was API Priority and Fairness (APF) doing exactly what it was designed to do — except our configuration made critical control-plane traffic compete with a noisy controller.
Tested on: Kubernetes 1.29–1.31, managed control planes + self-hosted kube-apiserver, Prometheus scraping apiserver metrics.
Why this matters in 2026
APF is no longer an exotic feature. Multi-tenant clusters, GitOps, operators, and “everything is a controller” means API load is constantly under pressure. When APF isn’t isolating traffic correctly, you get a failure mode that looks like “random Kubernetes flakiness” — until you graph APF metrics and it becomes obvious.
Incident narrative (anonymized)
I rolled out a new internal controller that watched a CRD and reconciled related resources. A small bug made it do this:
- list pods cluster-wide
- list secrets cluster-wide
- every reconciliation loop
- with aggressive retries
At the same time, we had previously introduced a custom FlowSchema for “platform automation” that matched most authenticated service accounts and shoved them into a single priority level with low concurrency shares.
Blast radius:
- GitOps drifted
- HPA stopped reacting
kubectlbecame unreliable for everyone (including on-call)- a few workloads hit pod restarts because their controllers couldn’t update objects
Constraint:
- I couldn’t “just scale the API server” (managed control plane).
- I needed an in-cluster mitigation: isolate the noisy controller and protect system traffic.
Timeline (what I actually did)
- T-0: On-call report: “kubectl is hanging”, rollouts stuck.
- T+5m: I see a wave of 429s in controller logs and GitOps retries.
- T+10m: I graph APF rejections and see a clear step change after my controller rollout.
- T+20m: I find a FlowSchema that acts like a catch-all for most service accounts.
- T+30m: I scale down the noisy controller (immediate relief).
- T+45m: I add a FlowSchema + PriorityLevelConfiguration to isolate that controller permanently.
- T+60m: APF rejections drop to near-zero, control plane stabilizes, kubectl works again.
Mechanism: how APF starvation happens
APF is a traffic router + concurrency limiter
APF matches each request to a FlowSchema, which maps it to a PriorityLevelConfiguration.
That priority level controls:
- how many “seats” (concurrency) this flow can get (
assuredConcurrencyShares) - whether requests are queued or rejected when overloaded (queuing vs reject)
Starvation happens when “everything matches the same flow”
The common misconfig is not “APF is broken”.
It’s:
- a FlowSchema matches more traffic than you intended (often because it targets
system:authenticatedbroadly) - that FlowSchema has higher precedence than you think
- it maps critical and bulk traffic into the same priority bucket
Once the bucket saturates, clients see:
- queued latency (kubectl hangs)
- 429 rejections (controllers retry → even more load)
- unstable watch/list behavior (more retries, more LISTs, more API churn)
Why retries make APF incidents spiral
429s are “polite”. Many controllers treat them as transient and retry aggressively. That creates a feedback loop:
- APF rejects (429)
- controller retries → increases request rate
- APF rejects more
- the control plane becomes unusable for everything sharing that priority level
Runbook: diagnosing APF starvation fast
What to check first
- Confirm 429s and timeouts are real and clustered in time
- Controller logs (GitOps, operators, custom controllers)
- API client errors in apps that talk to the API (rare but happens)
- Check APF rejection metrics If you scrape kube-apiserver metrics, these are the fastest signals.
PromQL examples (paste as-is; metric labels may vary slightly by distro):
# Overall APF rejections
sum(rate(apiserver_flowcontrol_rejected_requests_total[5m]))
# Who is getting rejected (by priority level)
sum by (priority_level) (rate(apiserver_flowcontrol_rejected_requests_total[5m]))
# Which FlowSchema is rejecting
topk(10, sum by (flow_schema) (rate(apiserver_flowcontrol_rejected_requests_total[5m])))
If you don’t have these metrics, your “first fix” is to enable scraping — but during an incident, you can still proceed with object inspection + log correlation.
- Inspect FlowSchemas and precedence I start here because I’ve been burned by “catch-all” FlowSchemas more than once.
kubectl get flowschemas.flowcontrol.apiserver.k8s.io
kubectl get prioritylevelconfigurations.flowcontrol.apiserver.k8s.io
Then:
kubectl get flowschemas.flowcontrol.apiserver.k8s.io -o yaml | grep -n "matchingPrecedence" -n
I look for:
- unusually low
matchingPrecedence(higher priority) - rules matching broad groups like
system:authenticated - rules matching “all namespaces, all resources” unintentionally
How to confirm the hypothesis
A. Identify the FlowSchema / priority level under pressure
From metrics, pick the top flow_schema or priority_level involved.
Then open it:
kubectl get flowschema <name> -o yaml
kubectl get prioritylevelconfiguration <name> -o yaml
You want to answer:
- Who matches this FlowSchema? (service accounts, groups, namespaces)
- Does it include your controller’s service account?
- Is queuing enabled, and what are the queue lengths?
- How many concurrency shares does it get?
B. Correlate with a known noisy client In my case it was the controller I had just deployed. If it’s not obvious, the fastest way is:
- temporarily scale down suspected controllers one by one (GitOps, custom operators)
- watch APF rejection rate drop
- stop once you found the offender
This is crude, but it works when you don’t have audit logs or per-client metrics handy.
Safe mitigations (in order)
- Scale down / pause the noisy controller This is the quickest way to break the retry loop.
kubectl -n <ns> scale deploy/<controller> --replicas=0
-
Lower its client-side QPS/burst If you own the controller code, this is usually a one-line client-go change. If you don’t, see if it has env/config for QPS.
-
Isolate it in APF Create a FlowSchema matching only that controller’s service account, and map it to a low-share priority level.
-
Protect critical system traffic If your APF config accidentally down-prioritized system flows, fix precedence and restore sensible shares for system priority levels.
Risky mitigations (avoid unless you have no choice)
- Disabling APF (often not possible on managed control planes, and it can turn “fair throttling” into “total meltdown”).
- Blindly raising concurrency shares everywhere
- You can remove fairness and spike etcd load hard.
- Restarting controllers to “fix it”
- Restarts often increase list/watches and make the storm worse.
What we changed (concrete)
1) We isolated the controller with a dedicated PriorityLevelConfiguration
Before, our controller matched a broad FlowSchema that shared capacity with other automation.
After, we created a priority level with explicit low shares:
apiVersion: flowcontrol.apiserver.k8s.io/v1beta3
kind: PriorityLevelConfiguration
metadata:
name: plc-noisy-controller
spec:
type: Limited
limited:
assuredConcurrencyShares: 5
limitResponse:
type: Queuing
queuing:
queues: 16
handSize: 4
queueLengthLimit: 50
2) We created a FlowSchema that matches only that service account
apiVersion: flowcontrol.apiserver.k8s.io/v1beta3
kind: FlowSchema
metadata:
name: fs-noisy-controller
spec:
matchingPrecedence: 2000
priorityLevelConfiguration:
name: plc-noisy-controller
rules:
- subjects:
- kind: ServiceAccount
serviceAccount:
name: noisy-controller
namespace: platform
resourceRules:
- apiGroups: ["*"]
resources: ["*"]
verbs: ["*"]
namespaces: ["*"]
clusterScope: true
3) We fixed the controller bug and added a client-side rate limit
The “real fix” was to stop doing cluster-wide LISTs in a tight loop. But the APF isolation meant one bug could no longer starve the whole cluster.
How to verify (what I look at after the fix)
- APF rejections drop
sum(rate(apiserver_flowcontrol_rejected_requests_total[5m]))
Expected: returns near baseline.
- kubectl responsiveness returns I literally measure it:
time kubectl get ns >/dev/null
time kubectl get pods -A --request-timeout=10s >/dev/null
- Controllers stop retry-storming GitOps and kube-controller-manager logs should stop spamming 429s.
Prevention / guardrails
- API QPS budgets per controller
- treat “API requests/sec” as an SLO just like DB connections
- APF guardrails
- no FlowSchema allowed to match
system:authenticatedwithout explicit review - precedence changes require a diff-based review
- no FlowSchema allowed to match
- Alerts
- APF rejections > 0 for more than N minutes
- queue depth non-zero for critical priority levels
- Game day
- intentionally scale a noisy controller in staging and prove the cluster still works
Related reading
- etcd Watch Replay Storms: When Giant ConfigMaps Kill the Control Plane
- etcd Quota Alarm: When Your Kubernetes Cluster Goes Read-Only
- Pods Stuck in Terminating: A Production Decision Tree for Finalizers, Volumes, and Dead Nodes
- Kubernetes TLS Certificate Rotation: The 3AM Outage
- Dash Contracts in Go: CI Compiler for Grafana Dashboards and Prometheus Alerts
- Prometheus Cardinality Explosion: Detection, Prevention, and Recovery
- Architectural Linting: Automated Protection Against Spaghetti Code
Related posts
ingress-nginx Reload Storms: Why 502 Spikes Track Ingress Churn
NGINX Ingress reload storms can drop keep-alives and cause 502 spikes. Runbook to prove reload impact, reduce churn, and harden graceful reload.
tcpdump Sees SYNs, but the Service Times Out: The Listen Backlog Trap
Clients time out, tcpdump shows SYNs (sometimes even SYN-ACK), yet your app logs nothing. The culprit is often the Linux listen/accept queues overflowing under load or CPU starvation.
etcd Watch Replay Storms: When Giant ConfigMaps Kill the Control Plane
The apiserver becomes 'randomly slow'. Root cause: large, frequently updated ConfigMaps trigger watch compaction, causing thousands of controllers to relist simultaneously.
Kubernetes OOM Killer: Why Your Container Dies at 50% Memory
Container memory limit is 4GB but OOM kills at 2GB used. Kernel buffers, page cache, and cgroup accounting tricks cause early OOMKills. Here's the full picture.
Cite this article
If you reference this post, please link to the original URL and credit the author.