Back to blog

CSI VolumeAttachment Stuck: Pods in ContainerCreating and Drains That Never Finish

This is the Kubernetes storage incident that eats hours because the symptoms look like “Kubernetes is slow”:

  • Pods sit in ContainerCreating forever.
  • kubectl drain hangs on a node because volumes won’t detach.
  • Events show AttachVolume.Attach failed or Multi-Attach error.
  • The application isn’t even starting, so app-level debugging is useless.

When CSI is in the picture, the real unit of progress is often the VolumeAttachment object and its finalizers — not the Pod.

Tested on: Kubernetes 1.29–1.31, CSI drivers for cloud block storage and on-prem, StatefulSets, managed and self-managed clusters.

Incident narrative (anonymized)

We lost a node (hard failure). A StatefulSet pod rescheduled to a new node, but it never came up.

What I saw:

  • Pod stayed in ContainerCreating.
  • Events alternated between “waiting for volume attachment” and “multi-attach”.
  • VolumeAttachment objects piled up with old node references.

The actual root cause was not “the disk is broken”. The CSI controller path was degraded: the external-attacher was running with a single replica and got evicted during the chaos. That left finalizers stuck, so attachments/detachments didn’t converge.

Constraint: this was a stateful workload. A wrong “force” action can cause data corruption. I needed a runbook that makes “safe vs risky” explicit.

Timeline

  • T-0: Node fails; StatefulSet pod reschedules.
  • T+10m: Pod stuck in ContainerCreating; I check Pod events.
  • T+20m: I identify the PV/PVC and the related VolumeAttachment objects.
  • T+30m: VolumeAttachment shows old node attachment state; finalizer not progressing.
  • T+45m: Mitigation: restore CSI controller health (external-attacher back up) and wait for clean detach/attach.
  • T+90m: Pod starts; volume shows attached to the new node.
  • T+1d: Fix: make CSI controllers HA + add alerts on stuck VolumeAttachments.

Mechanism: why VolumeAttachment is the “truth” during CSI incidents

Pods don’t attach volumes; controllers do

For CSI, the attach/detach flow is coordinated by controllers and tracked as objects:

  • PVC/PV describe what you want.
  • VolumeAttachment represents the attach intent and state for a specific node.
  • CSI side components (external-attacher, external-provisioner) and the kube-controller-manager drive the state machine.

Finalizers exist so Kubernetes doesn’t “forget” about an attachment before the driver confirms detach. That’s good — but when the controller path is unhealthy, finalizers become a wedge.

Common failure modes

  1. Multi-Attach
  • Many block volumes support a single attach.
  • If a node dies and the volume is still marked attached, the new node can’t attach.
  1. CSI controller path degraded
  • external-attacher not running / stuck / no leader
  • RBAC or cloud API errors
  • control-plane congestion
  1. Node is NotReady but not really gone
  • Kubernetes still thinks the node exists; detach can take a long time.
  • Force detach too early risks two nodes writing the same volume.

Runbook: from Pod symptom to safe recovery

What to check first

1) Pod events (they usually tell you which volume)

kubectl -n <ns> describe pod <pod> | sed -n '/Events:/,$p'

Look for lines like:

  • AttachVolume.Attach failed
  • Multi-Attach error
  • timed out waiting for the condition

2) Identify PVC and PV

kubectl -n <ns> get pod <pod> -o jsonpath='{.spec.volumes[*].persistentVolumeClaim.claimName}{"\n"}'
kubectl -n <ns> get pvc <pvc> -o wide
kubectl get pv <pv> -o wide

3) Inspect VolumeAttachment objects (cluster-scoped)

kubectl get volumeattachment
kubectl describe volumeattachment <name>

If you have many, filter by PV name in the spec (or grep output):

kubectl get volumeattachment -o yaml | grep -n "<pv>" -n | head

What I look for in describe:

  • target node name
  • attached: true/false
  • errors from the CSI driver
  • finalizers that aren’t being removed

4) Check the CSI controller components

The exact names vary by driver, but you’re looking for:

  • external-attacher
  • external-provisioner
  • driver controller pods
kubectl -n kube-system get pods | grep -E 'csi|attacher|provisioner' | head -n 50

How to confirm the hypothesis

You have a “VolumeAttachment stuck” incident if:

  • Pod is blocked on volume attach
  • VolumeAttachment references an old node or sits in error
  • CSI controller components are unhealthy or the underlying volume is still attached elsewhere

A strong confirmation is when restoring the controller health (or fixing cloud API errors) causes VolumeAttachments to progress without force.

Safe mitigations

1) Make the CSI controller path healthy again

If external-attacher is down or wedged, fix that first:

  • restore replicas
  • fix RBAC
  • fix cloud API rate limits
  • restart only the controller components (not the whole cluster)

This is usually safe because it allows the designed state machine to complete.

2) Ensure the old node is truly dead before any “force”

If you suspect multi-attach:

  • confirm the old node is NotReady and not coming back
  • ensure the filesystem isn’t mounted anywhere else
  • only then consider provider-side detach actions

3) Drain in the correct order

If you’re draining a node:

  • cordon first
  • delete pods that hold the volume if needed
  • wait for detach to complete before proceeding

Risky mitigations (high data-loss potential)

  • Deleting VolumeAttachment objects or stripping finalizers by hand
    • you can trick Kubernetes into thinking a volume is safe to reattach while it’s still mounted
  • Force detach in the storage provider without ensuring the node is dead
    • you can create split-brain at the filesystem level
  • Restarting everything
    • increases chaos; may hide the root cause

What we changed (concrete)

1) We made CSI controllers highly available

Before: one replica, no PDB, vulnerable to eviction during cluster stress.

After: 2 replicas + PDB + priority class (representative example):

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: csi-controller
  namespace: kube-system
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: csi-controller

And a deployment tweak (sketch):

spec:
  replicas: 2
  template:
    spec:
      priorityClassName: system-cluster-critical

2) We added an alert on stuck VolumeAttachments

Using kube-state-metrics (metric names can vary), the intent is:

  • alert if a VolumeAttachment exists for more than 15 minutes and is not progressing
  • alert if a volume is attached to a NotReady node for too long

Example query shape:

# VolumeAttachments older than 15 minutes
time() - kube_volumeattachment_created > 900

3) We documented “safe detach checklist”

We turned “tribal knowledge” into a checklist:

  • confirm node state
  • confirm mount state
  • confirm VolumeAttachment target
  • only then consider force detach

How to verify (measurable)

1) VolumeAttachment converges

kubectl get volumeattachment
kubectl describe volumeattachment <name>

Expected:

  • attachment errors stop
  • attached matches reality
  • finalizers get removed when appropriate

2) Pod transitions to Running

kubectl -n <ns> get pod <pod> -w

Expected: ContainerCreating → Running without repeated attach events.

3) Stateful workload passes a basic integrity check

For databases, I always run a lightweight integrity check or a read-only query that touches the data directory. The goal is to ensure we didn’t “recover” by corrupting.

Prevention / guardrails

  • Treat CSI controllers like control-plane
    • HA, PDB, priority class, observability
  • Time budgets
    • “volume detach must complete within N minutes” as an SLO
  • Alerting
    • stuck VolumeAttachment, repeated attach errors, multi-attach
  • Runbooks
    • explicit “safe vs risky” actions for storage incidents

Related posts

Cite this article

If you reference this post, please link to the original URL and credit the author.

Michal Drozd. "CSI VolumeAttachment Stuck: Pods in ContainerCreating and Drains That Never Finish". https://www.michal-drozd.com/en/blog/kubernetes-volumeattachment-stuck-csi/ (Published November 30, 2025).