Kubernetes hides complexity until something breaks. Then you need to know where to look. Here’s a systematic approach to debugging production issues.
The Debugging Hierarchy#
Start broad, narrow down:
- Cluster level: Nodes healthy? Resources available?
- Namespace level: Deployments running? Services configured?
- Pod level: Containers starting? Logs clean?
- Container level: Process running? Resources sufficient?
Quick Health Check#
1
2
3
4
5
6
7
8
9
10
11
| # Node status
kubectl get nodes -o wide
# All pods across namespaces
kubectl get pods -A
# Pods not running
kubectl get pods -A | grep -v Running | grep -v Completed
# Events (recent issues)
kubectl get events -A --sort-by='.lastTimestamp' | tail -20
|
Pod Troubleshooting#
Pod States#
| State | Meaning | Check |
|---|
| Pending | Can’t be scheduled | Resources, node selectors, taints |
| ContainerCreating | Image pulling or volume mounting | Image name, pull secrets, PVCs |
| CrashLoopBackOff | Container crashing repeatedly | Logs, resource limits, probes |
| ImagePullBackOff | Can’t pull image | Image name, registry auth |
| Error | Container exited with error | Logs |
Pending Pods#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| # Why is it pending?
kubectl describe pod my-pod
# Look for:
# - Insufficient cpu/memory
# - No nodes match nodeSelector
# - Taints not tolerated
# - PVC not bound
# Check node resources
kubectl describe nodes | grep -A5 "Allocated resources"
# Check PVC status
kubectl get pvc
|
CrashLoopBackOff#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| # Get logs from current container
kubectl logs my-pod
# Get logs from previous (crashed) container
kubectl logs my-pod --previous
# Get logs from specific container
kubectl logs my-pod -c my-container
# Follow logs
kubectl logs -f my-pod
# Last N lines
kubectl logs --tail=100 my-pod
|
Common causes:
- Application error (check logs)
- OOMKilled (increase memory limit)
- Liveness probe failing
- Missing config/secrets
Exec Into Pod#
1
2
3
4
5
6
7
8
| # Interactive shell
kubectl exec -it my-pod -- /bin/sh
# Run command
kubectl exec my-pod -- cat /app/config.yaml
# Specific container
kubectl exec -it my-pod -c my-container -- /bin/sh
|
Resource Issues#
OOMKilled#
1
2
3
4
5
6
7
8
| # Check if OOMKilled
kubectl describe pod my-pod | grep -i oom
# Check resource usage
kubectl top pod my-pod
# Check limits vs requests
kubectl get pod my-pod -o yaml | grep -A10 resources
|
Fix: Increase memory limits or optimize application.
1
2
3
4
5
| resources:
requests:
memory: "256Mi"
limits:
memory: "512Mi" # Increase this
|
CPU Throttling#
1
2
3
4
| # Check CPU usage
kubectl top pod my-pod
# High CPU but pod is slow = throttling
|
Fix: Increase CPU limits or use no limit (burstable).
Networking Issues#
Service Not Reachable#
1
2
3
4
5
6
7
8
9
10
| # Check service endpoints
kubectl get endpoints my-service
# No endpoints = selector doesn't match pods
kubectl get svc my-service -o yaml | grep -A5 selector
kubectl get pods --show-labels
# Test from within cluster
kubectl run debug --rm -it --image=busybox -- /bin/sh
# Inside: wget -qO- my-service:8080/health
|
DNS Issues#
1
2
3
4
5
6
| # Test DNS resolution
kubectl run debug --rm -it --image=busybox -- nslookup my-service
# Check CoreDNS
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns
|
Network Policies#
1
2
3
4
5
6
7
8
| # Check if network policies exist
kubectl get networkpolicies
# Describe policy
kubectl describe networkpolicy my-policy
# Test connectivity
kubectl exec my-pod -- wget -qO- --timeout=2 other-service:8080
|
Storage Issues#
PVC Pending#
1
2
3
4
5
6
7
| # Check PVC status
kubectl describe pvc my-pvc
# Look for:
# - No matching StorageClass
# - Insufficient storage
# - Wrong access mode
|
Volume Mount Failures#
1
2
3
4
5
6
7
| # Check pod events
kubectl describe pod my-pod | grep -A10 Events
# Common issues:
# - PVC not bound
# - Wrong mount path
# - Permission denied
|
Deployment Issues#
Rollout Stuck#
1
2
3
4
5
6
7
8
9
10
11
12
| # Check rollout status
kubectl rollout status deployment/my-app
# Check deployment
kubectl describe deployment my-app
# Check replica sets
kubectl get rs
# Check why new pods aren't starting
kubectl get pods -l app=my-app
kubectl describe pod <new-pod>
|
Rollback#
1
2
3
4
5
6
| # Rollback to previous version
kubectl rollout undo deployment/my-app
# Rollback to specific revision
kubectl rollout history deployment/my-app
kubectl rollout undo deployment/my-app --to-revision=2
|
Node Issues#
Node NotReady#
1
2
3
4
5
6
7
8
9
10
| # Check node status
kubectl describe node problematic-node
# Look for:
# - Conditions (MemoryPressure, DiskPressure)
# - Taints
# - Kubelet logs
# SSH to node and check
journalctl -u kubelet -f
|
Drain Node#
1
2
3
4
5
| # Safely remove pods before maintenance
kubectl drain node-name --ignore-daemonsets --delete-emptydir-data
# Bring back
kubectl uncordon node-name
|
Ephemeral Debug Container#
1
2
3
4
5
| # Attach debug container to running pod (K8s 1.23+)
kubectl debug -it my-pod --image=busybox --target=my-container
# Debug node
kubectl debug node/my-node -it --image=ubuntu
|
Port Forward#
1
2
3
4
5
6
7
8
| # Access pod directly
kubectl port-forward pod/my-pod 8080:8080
# Access service
kubectl port-forward svc/my-service 8080:80
# Access in background
kubectl port-forward pod/my-pod 8080:8080 &
|
Copy Files#
1
2
3
4
5
| # Copy from pod
kubectl cp my-pod:/app/logs/app.log ./app.log
# Copy to pod
kubectl cp ./config.yaml my-pod:/app/config.yaml
|
Useful One-Liners#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
| # Pods sorted by restart count
kubectl get pods --sort-by='.status.containerStatuses[0].restartCount'
# Pods by memory usage
kubectl top pods --sort-by=memory
# Events in last hour
kubectl get events --field-selector type=Warning -A
# All images in cluster
kubectl get pods -A -o jsonpath='{.items[*].spec.containers[*].image}' | tr ' ' '\n' | sort -u
# Pods on specific node
kubectl get pods -A --field-selector spec.nodeName=node-1
# Delete all evicted pods
kubectl get pods -A | grep Evicted | awk '{print $2 " -n " $1}' | xargs -L1 kubectl delete pod
# Force delete stuck pod
kubectl delete pod my-pod --grace-period=0 --force
|
Common Patterns#
Health Check Debug#
1
2
3
4
5
| # Test liveness endpoint manually
kubectl exec my-pod -- wget -qO- localhost:8080/health
# Check probe config
kubectl get pod my-pod -o yaml | grep -A10 livenessProbe
|
Secret/ConfigMap Issues#
1
2
3
4
5
6
| # Verify secret exists
kubectl get secret my-secret -o yaml
# Check if mounted correctly
kubectl exec my-pod -- ls -la /etc/secrets/
kubectl exec my-pod -- cat /etc/secrets/password
|
Init Container Failures#
1
2
3
4
5
| # Check init container status
kubectl describe pod my-pod | grep -A20 "Init Containers"
# Get init container logs
kubectl logs my-pod -c init-container-name
|
The Checklist#
When something’s broken:
kubectl get pods - What state?kubectl describe pod - Events sectionkubectl logs - Application errorskubectl logs --previous - If crashingkubectl get events - Cluster-wide issueskubectl top - Resource problemskubectl exec - Debug from inside
Kubernetes troubleshooting is pattern recognition. Learn the common failure modes, and you’ll fix most issues in minutes.
📬 Get the Newsletter
Weekly insights on DevOps, automation, and CLI mastery. No spam, unsubscribe anytime.