Kubernetes Troubleshooting: A Practical Field Guide
When your pods won't start or your services won't connect, here's how to systematically diagnose what's wrong.
March 5, 2026 · 6 min · 1234 words · Rob Washington
Table of Contents
Kubernetes failures are rarely mysterious once you know where to look. The problem is knowing where to look. This guide covers the systematic approach to diagnosing common Kubernetes issues.
# Check events for whykubectl describe pod <pod> | grep -A 20 Events
# Common causes:# - Insufficient CPU/memory (node capacity)# - Node selector/affinity can't be satisfied# - PersistentVolumeClaim pending
ImagePullBackOff — Can’t pull container image
1
2
3
4
5
6
7
8
9
# Check image name for typoskubectl describe pod <pod> | grep Image
# Check pull secretskubectl get secrets -n <namespace>
kubectl describe pod <pod> | grep -A 5"Image Pull Secrets"# Test image pull manuallydocker pull <image>
CrashLoopBackOff — Container starts, then crashes
1
2
3
4
5
6
# Check container logskubectl logs <pod> -n <namespace>
kubectl logs <pod> -n <namespace> --previous # logs from last crash# Check exit codekubectl describe pod <pod> | grep -A 10"Last State"
CreateContainerConfigError — ConfigMap or Secret missing
1
2
3
4
5
6
# List what the pod expectskubectl describe pod <pod> | grep -A 20"Volumes"# Verify they existkubectl get configmaps -n <namespace>
kubectl get secrets -n <namespace>
# Check node resourceskubectl describe nodes | grep -A 10"Allocated resources"# Check if pod requests exceed availablekubectl describe pod <pod> | grep -A 5"Requests"kubectl top nodes
If empty, the selector doesn’t match any ready pods:
1
2
3
4
5
# Check pod labelskubectl get pods -n <namespace> --show-labels
# Compare to service selectorkubectl get svc <service> -n <namespace> -o jsonpath='{.spec.selector}'
# From inside the clusterkubectl run debug --rm -it --image=busybox -- sh
wget -qO- http://<service>.<namespace>.svc.cluster.local:<port>
# Check DNS resolutionnslookup <service>.<namespace>.svc.cluster.local
# List policies affecting the namespacekubectl get networkpolicies -n <namespace>
# Describe to see ingress/egress ruleskubectl describe networkpolicy <policy> -n <namespace>
Network policies are deny-by-default when present. If a policy exists, you need explicit allow rules.
# Check memory limits vs actual usagekubectl describe pod <pod> | grep -A 3"Limits"kubectl top pod <pod>
# Check node memory pressurekubectl describe node <node> | grep -A 5"Conditions"
Fix: Increase memory limits or fix the memory leak.
# Get all resources in a namespacekubectl get all -n <namespace>
# Watch resources changekubectl get pods -n <namespace> -w
# Get events sorted by timekubectl get events -n <namespace> --sort-by='.lastTimestamp'# Check resource usagekubectl top pods -n <namespace>
kubectl top nodes
# Debug networkingkubectl run debug --rm -it --image=nicolaka/netshoot -- bash
# Copy files to/from podkubectl cp <pod>:/path/to/file ./local-file
kubectl cp ./local-file <pod>:/path/to/file
# Port forward for local testingkubectl port-forward svc/<service> 8080:80 -n <namespace>
App runs → Configs mounted → Secrets available → Connections work
When something breaks, identify which stage failed and inspect that component. The error messages are usually accurate—the skill is knowing where to find them.
📬 Get the Newsletter
Weekly insights on DevOps, automation, and CLI mastery. No spam, unsubscribe anytime.