
K8s Pod Evictions: My Debugging Nightmare & Solution
Ah, the sweet symphony of kubectl logs and kubectl describe pod – the lullaby of every Kubernetes developer, right? Well, usually. But then there are those moments, usually at 3 AM, when your carefull...
r5yn1r4143
2h ago
Ah, the sweet symphony of kubectl logs and kubectl describe pod – the lullaby of every Kubernetes developer, right? Well, usually. But then there are those moments, usually at 3 AM, when your carefully crafted deployments suddenly decide to take an unscheduled nap, leaving you staring at a pod status of Evicted. Cue the existential dread. Just last week, I was wrestling with a microservice that kept disappearing from the cluster, and after a few rounds of frantic debugging, I realized I was staring down the barrel of pod eviction. It's like your cluster just decided, "Nah, you're not welcome here anymore, buddy."
TL;DR: Pod Eviction Woes
So, your pods are getting kicked out of the Kubernetes party? Most likely, it's one of these culprits:
Resource Starvation: Your pod is hungry and the node can't feed it. Think OOMKilled or nodes running out of disk space.
Node Pressure: The node itself is having a bad day (disk, memory, or PIDs).
Taints and Tolerations: You've accidentally alienated a node with a nasty taint.
Eviction Thresholds: The cluster's safety nets are doing their job, maybe a little too well.
The Usual Suspects: Resource Limits & Node Pressure
The most common reason I’ve seen for pod eviction is, surprisingly, the pod itself being a resource hog or the node being a bit… fragile. When a Kubernetes node starts running low on critical resources like memory, disk space, or even processes (PIDs), the kubelet on that node has to make some tough decisions. It needs to free up resources to keep the node stable, and it does this by evicting pods.
Error Message Alert! You'll often see something like this in your kubectl describe pod <pod-name> output, or even in the node's events:
Reason: Evicted
Message: The node was low on resource: memory.
Or, if it's disk pressure:
Reason: Evicted
Message: The node was low on resource: ephemeral-storage.
What to do:
Here’s a typical pod spec with requests and limits:
apiVersion: v1
kind: Pod
metadata:
name: my-app-pod
spec:
containers:
- name: app-container
image: my-awesome-app:latest
resources:
requests:
memory: "64Mi"
cpu: "100m" # 0.1 CPU core
limits:
memory: "128Mi"
cpu: "200m" # 0.2 CPU core
Pro-tip: Start with reasonable requests. Monitor your application’s actual resource usage. Tools like Prometheus and Grafana are your best friends here. If your pod is constantly hitting its memory limit, you might get OOMKilled (Out Of Memory killed), which also leads to eviction. If the node is low on memory, it might evict other pods to survive.
kubectl describe node <node-name> to see the Conditions section. Look for MemoryPressure, DiskPressure, or PIDPressure set to True.
kubectl describe node worker-node-1 | grep Pressure
If you see any of these, it’s a strong indicator the node is struggling. This might mean you need to: Add more resources to the node: If it's a VM, scale it up. Add more nodes to your cluster: Distribute the load. Identify the noisy neighbor: Is there another pod on that node hogging resources?
The Uninvited Guest: Taints and Tolerations
Another sneaky culprit is taints. Nodes can be "tainted" to repel pods that don't explicitly "tolerate" them. This is often used for dedicated nodes (e.g., GPU nodes) or for nodes that are undergoing maintenance. If your pod lands on a tainted node without the right toleration, it’s basically told to leave.
Error Message Example: You might not see an explicit "Evicted" message immediately, but the pod will likely be stuck in Pending or Evicted with a message pointing to the taint.
kubectl describe pod <pod-name>
...
Events:
Type Reason Age Message
---- ------ ---- -------
Warning FailedScheduling 1m (x5 over 5m) 0/3 nodes are available: 3 node(s) had taints that the pod didn't tolerate.
What to do:
kubectl describe node <node-name> | grep Taints
You might see something like:
Taints: node-role.kubernetes.io/master:NoSchedule
Taints: gpu=true:NoSchedule
tolerations section to your pod definition.
apiVersion: v1
kind: Pod
metadata:
name: my-gpu-app-pod
spec:
containers:
- name: gpu-container
image: my-gpu-app:latest
resources:
limits:
nvidia.com/gpu: 1 # Requesting a GPU
tolerations:
- key: "gpu"
operator: "Equal"
value: "true"
effect: "NoSchedule"
Important: Understand the effect (NoSchedule, PreferNoSchedule, NoExecute). NoSchedule means pods without the toleration won't be scheduled. NoExecute means pods already running on the node without* the toleration will be evicted. This is a common cause of unexpected evictions if a taint is added to a
Comments
Sign in to join the discussion.