K8s Pod Evictions: My Debugging Nightmare & Solution

Ah, the sweet symphony of kubectl logs and kubectl describe pod – the lullaby of every Kubernetes developer, right? Well, usually. But then there are those moments, usually at 3 AM, when your carefull...

r5yn1r4143

Apr 8

60 views0 likes0 comments

#troubleshooting#kubernetes#pod#eviction#errors

Ah, the sweet symphony of kubectl logs and kubectl describe pod – the lullaby of every Kubernetes developer, right? Well, usually. But then there are those moments, usually at 3 AM, when your carefully crafted deployments suddenly decide to take an unscheduled nap, leaving you staring at a pod status of Evicted. Cue the existential dread. Just last week, I was wrestling with a microservice that kept disappearing from the cluster, and after a few rounds of frantic debugging, I realized I was staring down the barrel of pod eviction. It's like your cluster just decided, "Nah, you're not welcome here anymore, buddy."

TL;DR: Pod Eviction Woes

So, your pods are getting kicked out of the Kubernetes party? Most likely, it's one of these culprits:

Resource Starvation: Your pod is hungry and the node can't feed it. Think OOMKilled or nodes running out of disk space. Node Pressure: The node itself is having a bad day (disk, memory, or PIDs). Taints and Tolerations: You've accidentally alienated a node with a nasty taint. Eviction Thresholds: The cluster's safety nets are doing their job, maybe a little too well.

The Usual Suspects: Resource Limits & Node Pressure

The most common reason I’ve seen for pod eviction is, surprisingly, the pod itself being a resource hog or the node being a bit… fragile. When a Kubernetes node starts running low on critical resources like memory, disk space, or even processes (PIDs), the kubelet on that node has to make some tough decisions. It needs to free up resources to keep the node stable, and it does this by evicting pods.

Error Message Alert! You'll often see something like this in your kubectl describe pod <pod-name> output, or even in the node's events:

Reason: Evicted
Message: The node was low on resource: memory.

Or, if it's disk pressure:

Reason: Evicted
Message: The node was low on resource: ephemeral-storage.

What to do:

Check Your Resource Requests and Limits: This is crucial. If you haven't set them, Kubernetes is guessing, and its guesses can be wildly inaccurate. If you have set them, are they realistic?

Request: How much resource your pod needs to run. Limit: The maximum resource your pod can consume. If it exceeds this, it gets a stern talking-to (or killed, in the case of memory).

Here’s a typical pod spec with requests and limits:

    apiVersion: v1
    kind: Pod
    metadata:
      name: my-app-pod
    spec:
      containers:
      - name: app-container
        image: my-awesome-app:latest
        resources:
          requests:
            memory: "64Mi"
            cpu: "100m" # 0.1 CPU core
          limits:
            memory: "128Mi"
            cpu: "200m" # 0.2 CPU core

Pro-tip: Start with reasonable requests. Monitor your application’s actual resource usage. Tools like Prometheus and Grafana are your best friends here. If your pod is constantly hitting its memory limit, you might get OOMKilled (Out Of Memory killed), which also leads to eviction. If the node is low on memory, it might evict other pods to survive.

Inspect Node Conditions: Use kubectl describe node <node-name> to see the Conditions section. Look for MemoryPressure, DiskPressure, or PIDPressure set to True.

    kubectl describe node worker-node-1 | grep Pressure

If you see any of these, it’s a strong indicator the node is struggling. This might mean you need to: Add more resources to the node: If it's a VM, scale it up. Add more nodes to your cluster: Distribute the load. Identify the noisy neighbor: Is there another pod on that node hogging resources?

The Uninvited Guest: Taints and Tolerations

Another sneaky culprit is taints. Nodes can be "tainted" to repel pods that don't explicitly "tolerate" them. This is often used for dedicated nodes (e.g., GPU nodes) or for nodes that are undergoing maintenance. If your pod lands on a tainted node without the right toleration, it’s basically told to leave.

Error Message Example: You might not see an explicit "Evicted" message immediately, but the pod will likely be stuck in Pending or Evicted with a message pointing to the taint.

kubectl describe pod <pod-name>
...
Events:
  Type     Reason            Age        Message
  ----     ------            ----       -------
  Warning  FailedScheduling  1m (x5 over 5m)  0/3 nodes are available: 3 node(s) had taints that the pod didn't tolerate.

What to do:

Check Node Taints: See what taints are applied to your nodes.

    kubectl describe node <node-name> | grep Taints

You might see something like: Taints: node-role.kubernetes.io/master:NoSchedule Taints: gpu=true:NoSchedule

Add Tolerations to Your Pod Spec: If you need your pod to run on a tainted node, you need to add a tolerations section to your pod definition.

    apiVersion: v1
    kind: Pod
    metadata:
      name: my-gpu-app-pod
    spec:
      containers:
      - name: gpu-container
        image: my-gpu-app:latest
        resources:
          limits:
            nvidia.com/gpu: 1 # Requesting a GPU
      tolerations:
      - key: "gpu"
        operator: "Equal"
        value: "true"
        effect: "NoSchedule"

Important: Understand the effect (NoSchedule, PreferNoSchedule, NoExecute). NoSchedule means pods without the toleration won't be scheduled. NoExecute means pods already running on the node without* the toleration will be evicted. This is a common cause of unexpected evictions if a taint is added to a

TL;DR: Pod Eviction Woes

The Usual Suspects: Resource Limits & Node Pressure

The Uninvited Guest: Taints and Tolerations

Comments