When a Kubernetes deployment goes sideways at 3am, you need a systematic approach. Here’s the troubleshooting playbook I’ve developed from watching countless production incidents.
The First Three Commands#
Before diving deep, these three commands tell you 80% of what you need:
1
2
3
4
5
6
7
8
| # What's not running?
kubectl get pods -A | grep -v Running | grep -v Completed
# What happened recently?
kubectl get events -A --sort-by='.lastTimestamp' | tail -20
# Resource pressure?
kubectl top nodes
|
Run these first. Always.
Pod Troubleshooting Ladder#
Level 1: Pod Won’t Start#
1
2
3
4
5
6
7
8
| # Check pod status and events
kubectl describe pod $POD -n $NS
# Look for:
# - ImagePullBackOff → registry auth or image doesn't exist
# - CrashLoopBackOff → app crashing immediately
# - Pending → scheduler can't place it
# - Init:Error → init container failing
|
For ImagePullBackOff:
1
2
3
4
5
6
| # Test registry access
kubectl run test --image=$IMAGE --restart=Never -it --rm -- sh
# Check secrets
kubectl get secrets -n $NS | grep -i pull
kubectl describe secret $PULL_SECRET -n $NS
|
For Pending pods:
1
2
3
4
5
6
7
8
| # Why can't it schedule?
kubectl get events -n $NS --field-selector involvedObject.name=$POD
# Common causes:
# - Insufficient CPU/memory
# - Node selector/affinity doesn't match
# - PVC can't bind
# - Taints with no tolerations
|
Level 2: Container Crashing#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| # Get the exit code
kubectl get pod $POD -n $NS -o jsonpath='{.status.containerStatuses[0].lastState.terminated}'
# Common exit codes:
# 0 = normal exit (but shouldn't restart)
# 1 = application error
# 137 = OOMKilled (exit code = 128 + signal 9)
# 143 = SIGTERM (graceful shutdown)
# Check previous logs
kubectl logs $POD -n $NS --previous
# Memory investigation
kubectl describe pod $POD -n $NS | grep -A5 "Last State"
|
For OOMKilled:
1
2
3
4
5
6
7
8
| # Option 1: Increase limits
resources:
limits:
memory: "512Mi" # Was probably too low
requests:
memory: "256Mi"
# Option 2: Fix the memory leak in your app
|
Level 3: Running But Unhealthy#
1
2
3
4
5
| # Check readiness/liveness probes
kubectl describe pod $POD -n $NS | grep -A10 "Liveness\|Readiness"
# Watch probe failures in real-time
kubectl get events -n $NS -w | grep $POD
|
Probe debugging:
1
2
3
4
5
6
7
8
| # Exec into pod and test the probe manually
kubectl exec -it $POD -n $NS -- curl -v localhost:8080/health
# Common issues:
# - Wrong port
# - Path returns 500
# - Probe timeout too short for slow startup
# - TCP probe on HTTPS port
|
Service Troubleshooting#
1
2
3
4
5
6
| # Does the service have endpoints?
kubectl get endpoints $SVC -n $NS
# Empty endpoints = selectors don't match any pods
kubectl get svc $SVC -n $NS -o yaml | grep -A5 selector
kubectl get pods -n $NS --show-labels | grep -E "$(kubectl get svc $SVC -n $NS -o jsonpath='{.spec.selector}' | tr -d '{}' | tr ':' '=')"
|
Testing service connectivity:
1
2
3
4
5
6
7
| # From another pod in the cluster
kubectl run debug --image=nicolaka/netshoot --restart=Never -it --rm -- \
curl -v $SVC.$NS.svc.cluster.local:$PORT
# DNS resolution
kubectl run debug --image=nicolaka/netshoot --restart=Never -it --rm -- \
nslookup $SVC.$NS.svc.cluster.local
|
Ingress Debugging#
1
2
3
4
5
6
7
8
| # Check ingress status
kubectl describe ingress $ING -n $NS
# Verify backend service
kubectl get ingress $ING -n $NS -o jsonpath='{.spec.rules[*].http.paths[*].backend}'
# Check ingress controller logs
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx --tail=100 | grep $HOST
|
Node Problems#
1
2
3
4
5
6
7
8
9
10
11
| # Node conditions
kubectl describe node $NODE | grep -A10 Conditions
# Taints blocking scheduling
kubectl describe node $NODE | grep Taints
# Resource allocation
kubectl describe node $NODE | grep -A5 "Allocated resources"
# System pods on problematic node
kubectl get pods -A --field-selector spec.nodeName=$NODE | grep -v Running
|
Network Debugging#
The swiss army knife for network issues:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| # Deploy netshoot for debugging
kubectl run netdebug --image=nicolaka/netshoot -it --rm -- bash
# Inside the debug pod:
# Test DNS
nslookup kubernetes.default.svc.cluster.local
# Test service connectivity
curl -v $SERVICE:$PORT
# Check network policy effects
iptables -L -n | head -50
# Trace route to external
traceroute 8.8.8.8
|
Storage Issues#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| # PVC stuck in Pending
kubectl describe pvc $PVC -n $NS
# Check storage class
kubectl get storageclass
# PV status
kubectl get pv | grep $PVC
# Common issues:
# - Storage class doesn't exist
# - Storage class has no provisioner
# - PV/PVC access mode mismatch
# - Capacity mismatch
|
I keep this in every cluster:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
| apiVersion: v1
kind: ConfigMap
metadata:
name: debug-scripts
namespace: default
data:
cluster-health.sh: |
#!/bin/bash
echo "=== Unhealthy Pods ==="
kubectl get pods -A | grep -v Running | grep -v Completed
echo ""
echo "=== Recent Events ==="
kubectl get events -A --sort-by='.lastTimestamp' | tail -10
echo ""
echo "=== Node Resources ==="
kubectl top nodes
echo ""
echo "=== PVC Issues ==="
kubectl get pvc -A | grep -v Bound
|
Golden Rules#
- Check events first — they tell you what Kubernetes thinks went wrong
- Previous logs exist — use
--previous for crashed containers - Describe is your friend — more info than
get - Test from inside —
kubectl exec or debug pods eliminate external factors - Labels matter — service selectors and pod labels must match exactly
The difference between a 30-minute outage and a 3-hour one is usually having a systematic approach. Start broad, narrow down, verify each step.
📬 Get the Newsletter
Weekly insights on DevOps, automation, and CLI mastery. No spam, unsubscribe anytime.