Kubernetes Debugging: From Pod Failures to Cluster Issues

Kubernetes abstracts away infrastructure until something breaks. Then you need to peel back the layers. These debugging patterns will help you find problems fast.

First Steps: Get the Lay of the Land

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# Cluster health
kubectl cluster-info
kubectl get nodes
kubectl top nodes

# Namespace overview
kubectl get all -n myapp

# Events (recent issues surface here)
kubectl get events -n myapp --sort-by='.lastTimestamp'

Pod Debugging

Check Pod Status

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# List pods with status
kubectl get pods -n myapp

# Detailed pod info
kubectl describe pod mypod -n myapp

# Common status meanings:
# Pending      - Waiting for scheduling or image pull
# Running      - At least one container running
# Succeeded    - All containers completed successfully
# Failed       - All containers terminated, at least one failed
# CrashLoopBackOff - Container crashing repeatedly
# ImagePullBackOff - Can't pull container image

View Logs

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
# Current logs
kubectl logs mypod -n myapp

# Previous container (after crash)
kubectl logs mypod -n myapp --previous

# Follow logs
kubectl logs -f mypod -n myapp

# Specific container (multi-container pod)
kubectl logs mypod -n myapp -c mycontainer

# Last N lines
kubectl logs mypod -n myapp --tail=100

# Since timestamp
kubectl logs mypod -n myapp --since=1h

Execute Commands in Pod

1
2
3
4
5
6
7
8
# Shell into running container
kubectl exec -it mypod -n myapp -- /bin/sh

# Run specific command
kubectl exec mypod -n myapp -- cat /etc/config/app.yaml

# Specific container
kubectl exec -it mypod -n myapp -c mycontainer -- /bin/sh

Debug Crashed Containers

1
2
3
4
5
6
7
8
# Check why it crashed
kubectl describe pod mypod -n myapp | grep -A 10 "Last State"

# View previous logs
kubectl logs mypod -n myapp --previous

# Run debug container (K8s 1.25+)
kubectl debug mypod -n myapp -it --image=busybox --target=mycontainer

Common Pod Issues

ImagePullBackOff

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# Check events for details
kubectl describe pod mypod -n myapp | grep -A 5 Events

# Common causes:
# - Wrong image name/tag
# - Private registry without imagePullSecrets
# - Registry rate limiting (Docker Hub)

# Verify image exists
docker pull myimage:tag

# Check imagePullSecrets
kubectl get pod mypod -n myapp -o jsonpath='{.spec.imagePullSecrets}'

CrashLoopBackOff

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# Get exit code
kubectl describe pod mypod -n myapp | grep "Exit Code"

# Exit codes:
# 0   - Success (shouldn't be crashing)
# 1   - Application error
# 137 - OOMKilled (out of memory)
# 139 - Segmentation fault
# 143 - SIGTERM (graceful shutdown)

# Check resource limits
kubectl describe pod mypod -n myapp | grep -A 5 Limits

Pending Pods

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Check why not scheduled
kubectl describe pod mypod -n myapp | grep -A 10 Events

# Common causes:
# - Insufficient resources
# - Node selector/affinity not matched
# - Taints without tolerations
# - PVC not bound

# Check node resources
kubectl describe nodes | grep -A 5 "Allocated resources"

Resource Issues

Memory Problems

1
2
3
4
5
6
7
8
# Check pod resource usage
kubectl top pod mypod -n myapp

# Check for OOMKilled
kubectl describe pod mypod -n myapp | grep OOMKilled

# View memory limits
kubectl get pod mypod -n myapp -o jsonpath='{.spec.containers[*].resources}'

CPU Throttling

1
2
3
4
5
# Check CPU usage vs limits
kubectl top pod mypod -n myapp

# In container, check throttling
kubectl exec mypod -n myapp -- cat /sys/fs/cgroup/cpu/cpu.stat

Networking Debugging

Service Connectivity

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# Check service exists
kubectl get svc -n myapp

# Check endpoints (are pods backing the service?)
kubectl get endpoints myservice -n myapp

# Test from within cluster
kubectl run debug --rm -it --image=busybox -- /bin/sh
# Then: wget -qO- http://myservice.myapp.svc.cluster.local

# DNS resolution
kubectl run debug --rm -it --image=busybox -- nslookup myservice.myapp.svc.cluster.local

Pod-to-Pod Communication

1
2
3
4
5
6
7
8
# Get pod IPs
kubectl get pods -n myapp -o wide

# Test connectivity from one pod to another
kubectl exec mypod1 -n myapp -- wget -qO- http://10.0.0.5:8080

# Check network policies
kubectl get networkpolicies -n myapp

Ingress Issues

1
2
3
4
5
6
7
8
# Check ingress configuration
kubectl describe ingress myingress -n myapp

# Check ingress controller logs
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx

# Verify backend service
kubectl get svc myservice -n myapp

ConfigMaps and Secrets

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# Verify ConfigMap exists and has expected data
kubectl get configmap myconfig -n myapp -o yaml

# Check if mounted correctly
kubectl exec mypod -n myapp -- ls -la /etc/config/
kubectl exec mypod -n myapp -- cat /etc/config/app.yaml

# Verify Secret
kubectl get secret mysecret -n myapp -o jsonpath='{.data.password}' | base64 -d

# Check environment variables
kubectl exec mypod -n myapp -- env | grep MY_VAR

Persistent Volumes

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# Check PVC status
kubectl get pvc -n myapp

# Describe for binding issues
kubectl describe pvc mypvc -n myapp

# Check PV
kubectl get pv

# Verify mount in pod
kubectl exec mypod -n myapp -- df -h
kubectl exec mypod -n myapp -- ls -la /data

Node Issues

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# Node status
kubectl get nodes
kubectl describe node mynode

# Check conditions
kubectl get nodes -o custom-columns=NAME:.metadata.name,CONDITIONS:.status.conditions[*].type

# Node resource pressure
kubectl describe node mynode | grep -A 5 Conditions

# Pods on specific node
kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=mynode

# Drain node for maintenance
kubectl drain mynode --ignore-daemonsets --delete-emptydir-data

Control Plane Debugging

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# API server health
kubectl get --raw='/healthz'

# Component status (deprecated but useful)
kubectl get componentstatuses

# Check system pods
kubectl get pods -n kube-system

# API server logs (if accessible)
kubectl logs -n kube-system kube-apiserver-master

# etcd health
kubectl exec -n kube-system etcd-master -- etcdctl endpoint health

Useful Debug Containers

1
2
3
4
5
6
7
8
# Network debugging
kubectl run netdebug --rm -it --image=nicolaka/netshoot -- /bin/bash

# DNS debugging
kubectl run dnsdebug --rm -it --image=tutum/dnsutils -- /bin/bash

# General debugging
kubectl run debug --rm -it --image=busybox -- /bin/sh

Systematic Debugging Checklist

Events first: kubectl get events --sort-by='.lastTimestamp'
Describe the resource: kubectl describe <resource> <name>
Check logs: kubectl logs <pod> (and --previous)
Verify dependencies: ConfigMaps, Secrets, Services, PVCs
Check resources: CPU, memory limits and usage
Test connectivity: DNS, service endpoints, network policies
Compare with working: Diff against known good configuration

Quick Reference

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
# Pod not starting
kubectl describe pod <name>
kubectl get events

# Pod crashing
kubectl logs <pod> --previous
kubectl describe pod <name> | grep "Exit Code"

# Can't connect to service
kubectl get endpoints <service>
kubectl run debug --rm -it --image=busybox -- wget -qO- http://<service>

# Resource issues
kubectl top pods
kubectl describe node | grep -A 5 "Allocated"

# Config issues
kubectl exec <pod> -- env
kubectl exec <pod> -- cat /path/to/config

Kubernetes debugging is methodical. Start with events, drill into describe output, check logs, and verify each dependency. Most issues are configuration mismatches—wrong image tags, missing secrets, insufficient resources.

When stuck, compare against something that works. The diff usually reveals the problem.

First Steps: Get the Lay of the Land#

Pod Debugging#

Check Pod Status#

View Logs#

Execute Commands in Pod#

Debug Crashed Containers#

Common Pod Issues#

ImagePullBackOff#

CrashLoopBackOff#

Pending Pods#

Resource Issues#

Memory Problems#

CPU Throttling#

Networking Debugging#

Service Connectivity#

Pod-to-Pod Communication#

Ingress Issues#

ConfigMaps and Secrets#

Persistent Volumes#

Node Issues#

Control Plane Debugging#

Useful Debug Containers#

Systematic Debugging Checklist#

Quick Reference#

📬 Get the Newsletter