Kubernetes - Debugging Pods and Nodes

Quiz

Kubernetes is a robust platform for deploying and managing applications, but like any complex system, things can go wrong. Pods may crash unexpectedly, nodes can become unresponsive, and networking issues might disrupt communication between services. Instead of guessing what went wrong, we need a structured approach to troubleshooting.

In this chapter, weâll walk through practical debugging techniques to identify and resolve issues with Kubernetes Pods and Nodes, helping us maintain a stable and reliable cluster.

Understanding Kubernetes Debugging

Before jumping into specific debugging techniques, let's define what debugging in Kubernetes means. Debugging involves identifying and resolving issues in our cluster, such as:

Application crashes due to misconfigurations or resource constraints.
Networking problems that prevent communication between services.
Node failures causing Pods to be unschedulable.
PersistentVolume issues where data is inaccessible.
Misconfigured deployments leading to unexpected behavior.

By understanding common failure points, we can systematically approach debugging in a structured way.

Debugging Kubernetes Pods

Pods are the smallest deployable units in Kubernetes, and most issues start at this level. Let's look at various ways to diagnose and fix Pod-related problems.

Checking Pod Status

The first step in debugging a Pod is to check its status. We use:

$ kubectl get pods

Output

NAME            READY   STATUS             RESTARTS      AGE
crashloop-pod   0/1     CrashLoopBackOff   3 (37s ago)   88s

The STATUS column tells us if the Pod is running, pending, or in an error state. If a Pod is in CrashLoopBackOff, it means the application is repeatedly crashing.

Check the Logs of the Pod

Since the container is failing, logs provide valuable insights:

$ kubectl logs crashloop-pod

If there are multiple containers in the Pod, specify the container name:

$ kubectl logs crashloop-pod -c faulty-container

Describe the Pod for More Details

This command will show events and reasons for failure:

$ kubectl describe pod crashloop-pod

Output

Events:
Type     Reason     Age                 From               Message
----     ------     ----                ----               -------
Normal   Scheduled  31m                 default-scheduler  Successfully assigned default/crashloop-pod to node01
Normal   Pulled     31m                 kubelet            Successfully pulled image "busybox" in 2.602s (2.602s including waiting). Image size: 2156519 bytes.
Normal   Pulled     31m                 kubelet            Successfully pulled image "busybox" in 823ms (823ms including waiting). Image size: 2156519 bytes.
Normal   Pulled     30m                 kubelet            
Normal   Created    30m (x4 over 31m)   kubelet            Created container: faulty-container
Normal   Started    30m (x4 over 31m)   kubelet            Started container faulty-container
Warning  BackOff    21m (x47 over 31m)  kubelet            Back-off restarting failed container faulty-container...

Look for messages under the Events sectionâthis often gives hints about why the Pod is crashing, such as an image pull error, insufficient CPU/memory, or a missing secret.

Possible Fixes

If it's an image issue, ensure the correct image name is used:

$ kubectl get pod crashloop-pod -o yaml | grep image

Output

{"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{},"name":"crashloop-pod","namespace":"default"},
"spec":{"containers":[{"command":["sh","-c","exit 1"],"image":"busybox","name":"faulty-container"}]}}
image: busybox
imagePullPolicy: Always
- image: busybox
imagePullPolicy: Always
image: busybox
imagePullPolicy: Always
image: docker.io/library/busybox:latest
imageID:

From the above output, it looks like crashloop-pod is using the faulty command (exit 1), causing it to keep crashing. To fix this, update the pod's YAML definition and restart it.

Fix the CrashLoopBackOff Issue

Run the following command to edit the pod's YAML:

$ kubectl edit pod crashloop-pod

Then, modify the command section under spec.containers to:

containers:
- name: faulty-container
image: busybox
command: ["sh", "-c", "sleep infinity"]

Save and exit the editor.

Delete and Recreate the Pod

Since editing a running pod doesnât update its spec, delete and recreate it:

$ kubectl delete pod crashloop-pod
$ kubectl apply -f crashloop-pod.yaml

Now, verify the status:

$ kubectl get pods
NAME        READY   STATUS      RESTARTS   AGE
crashloop-pod    0/1     Running   0          25s

Debugging Kubernetes Nodes

If a Pod issue isnât resolved, the problem might be at the Node level. Kubernetes nodes run workloads, and if they fail, Pods may become unavailable.

Checking Node Status

List the nodes and their conditions:

$ kubectl get nodes

Output

NAME           STATUS   ROLES           AGE   VERSION
controlplane   Ready    control-plane   94m   v1.31.6
node01         Ready    <none>         94m   v1.31.6

A NotReady node indicates an issue. If a node remains in the NotReady state, it may be due to network issues. In such cases, installing a network plugin like Flannel can help restore connectivity:

$ kubectl apply -f https://raw.githubusercontent.com/flannel-io/flannel/master/Documentation/kube-flannel.yml

Investigating Node Issues

Checking Kubelet Logs

Kubelet manages Pods on a node. If a node is failing, check its logs:

$ journalctl -u kubelet -f

Look for errors like failed to start container or out of memory issues.

Checking Disk Space

Nodes may fail if they run out of disk space. Verify with:

$ df -h

If disk usage is high, clean up logs and unused containers:

$ sudo docker system prune -a

Restarting the Cluster

If debugging doesnât solve the issue, restarting the cluster may help:

$ sudo systemctl restart kubelet containerd

Checking Network Connectivity

If a Pod cannot communicate with another service, test networking using:

$ kubectl exec -it $(kubectl get pod -l app=web-app-xyz -o 
jsonpath="{.items[0].metadata.name}") -- curl http://my-service:8080

Output

metadata.name}") -- curl http://my-service:8080

The kubectl exec command successfully accessed the web-app-xyz Pod and made a request to my-service:8080. The response confirms that Nginx is running correctly inside the Pod and serving the default welcome page.

If network policies are blocking communication, review them with:

$ kubectl get networkpolicy -A

Debugging Kubernetes Cluster Issues

Checking Kubernetes API Server

If kubectl commands are slow or failing, the API server may be down. Verify with:

$ kubectl cluster-info

Output

Kubernetes control plane is running at https://172.16.32.5:6443
CoreDNS is running at https://172.16.32.5:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

To further debug and diagnose cluster problems, use:

'kubectl cluster-info dump'

If the API server is unreachable, check logs on the master node:

$ journalctl -u kube-apiserver -f

Checking Control Plane Components

Run:

$ kubectl get pods -n kube-system

Output

NAME                                    READY   STATUS    RESTARTS   AGE
coredns-7c65d6cfc9-k4f4d                1/1     Running   0          48m
coredns-7c65d6cfc9-pp729                1/1     Running   0          48m
etcd-controlplane                       1/1     Running   0          48m...

If key services like kube-controller-manager or etcd are failing, restart them:

$ sudo systemctl restart kubelet

Checking DNS Issues

If services fail due to DNS problems, test resolution inside a Pod:

$ kubectl run -it --rm --image=busybox dns-test -- nslookup my-service

Output

If you don't see a command prompt, try pressing enter.
Server:    10.96.0.10
Address:   10.96.0.10:53

Name:      my-service.default.svc.cluster.local
Address:   10.103.45.32

If it fails, restart CoreDNS:

$ kubectl rollout restart deployment coredns -n kube-system

Conclusion

Debugging Kubernetes requires a systematic approach. By analyzing Pod logs, inspecting events, checking node status, and verifying networking, we can efficiently diagnose and resolve issues within the cluster setup.

Print Page