When a Kubernetes deployment goes sideways at 3am, you need a systematic approach. Here’s the troubleshooting playbook I’ve developed from watching countless production incidents.

The First Three Commands

Before diving deep, these three commands tell you 80% of what you need:

1
2
3
4
5
6
7
8
# What's not running?
kubectl get pods -A | grep -v Running | grep -v Completed

# What happened recently?
kubectl get events -A --sort-by='.lastTimestamp' | tail -20

# Resource pressure?
kubectl top nodes

Run these first. Always.

Pod Troubleshooting Ladder

Level 1: Pod Won’t Start

1
2
3
4
5
6
7
8
# Check pod status and events
kubectl describe pod $POD -n $NS

# Look for:
# - ImagePullBackOff → registry auth or image doesn't exist
# - CrashLoopBackOff → app crashing immediately
# - Pending → scheduler can't place it
# - Init:Error → init container failing

For ImagePullBackOff:

1
2
3
4
5
6
# Test registry access
kubectl run test --image=$IMAGE --restart=Never -it --rm -- sh

# Check secrets
kubectl get secrets -n $NS | grep -i pull
kubectl describe secret $PULL_SECRET -n $NS

For Pending pods:

1
2
3
4
5
6
7
8
# Why can't it schedule?
kubectl get events -n $NS --field-selector involvedObject.name=$POD

# Common causes:
# - Insufficient CPU/memory
# - Node selector/affinity doesn't match
# - PVC can't bind
# - Taints with no tolerations

Level 2: Container Crashing

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# Get the exit code
kubectl get pod $POD -n $NS -o jsonpath='{.status.containerStatuses[0].lastState.terminated}'

# Common exit codes:
# 0   = normal exit (but shouldn't restart)
# 1   = application error
# 137 = OOMKilled (exit code = 128 + signal 9)
# 143 = SIGTERM (graceful shutdown)

# Check previous logs
kubectl logs $POD -n $NS --previous

# Memory investigation
kubectl describe pod $POD -n $NS | grep -A5 "Last State"

For OOMKilled:

1
2
3
4
5
6
7
8
# Option 1: Increase limits
resources:
  limits:
    memory: "512Mi"  # Was probably too low
  requests:
    memory: "256Mi"

# Option 2: Fix the memory leak in your app

Level 3: Running But Unhealthy

1
2
3
4
5
# Check readiness/liveness probes
kubectl describe pod $POD -n $NS | grep -A10 "Liveness\|Readiness"

# Watch probe failures in real-time
kubectl get events -n $NS -w | grep $POD

Probe debugging:

1
2
3
4
5
6
7
8
# Exec into pod and test the probe manually
kubectl exec -it $POD -n $NS -- curl -v localhost:8080/health

# Common issues:
# - Wrong port
# - Path returns 500
# - Probe timeout too short for slow startup
# - TCP probe on HTTPS port

Service Troubleshooting

1
2
3
4
5
6
# Does the service have endpoints?
kubectl get endpoints $SVC -n $NS

# Empty endpoints = selectors don't match any pods
kubectl get svc $SVC -n $NS -o yaml | grep -A5 selector
kubectl get pods -n $NS --show-labels | grep -E "$(kubectl get svc $SVC -n $NS -o jsonpath='{.spec.selector}' | tr -d '{}' | tr ':' '=')"

Testing service connectivity:

1
2
3
4
5
6
7
# From another pod in the cluster
kubectl run debug --image=nicolaka/netshoot --restart=Never -it --rm -- \
  curl -v $SVC.$NS.svc.cluster.local:$PORT

# DNS resolution
kubectl run debug --image=nicolaka/netshoot --restart=Never -it --rm -- \
  nslookup $SVC.$NS.svc.cluster.local

Ingress Debugging

1
2
3
4
5
6
7
8
# Check ingress status
kubectl describe ingress $ING -n $NS

# Verify backend service
kubectl get ingress $ING -n $NS -o jsonpath='{.spec.rules[*].http.paths[*].backend}'

# Check ingress controller logs
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx --tail=100 | grep $HOST

Node Problems

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Node conditions
kubectl describe node $NODE | grep -A10 Conditions

# Taints blocking scheduling
kubectl describe node $NODE | grep Taints

# Resource allocation
kubectl describe node $NODE | grep -A5 "Allocated resources"

# System pods on problematic node
kubectl get pods -A --field-selector spec.nodeName=$NODE | grep -v Running

Network Debugging

The swiss army knife for network issues:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# Deploy netshoot for debugging
kubectl run netdebug --image=nicolaka/netshoot -it --rm -- bash

# Inside the debug pod:
# Test DNS
nslookup kubernetes.default.svc.cluster.local

# Test service connectivity
curl -v $SERVICE:$PORT

# Check network policy effects
iptables -L -n | head -50

# Trace route to external
traceroute 8.8.8.8

Storage Issues

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# PVC stuck in Pending
kubectl describe pvc $PVC -n $NS

# Check storage class
kubectl get storageclass

# PV status
kubectl get pv | grep $PVC

# Common issues:
# - Storage class doesn't exist
# - Storage class has no provisioner
# - PV/PVC access mode mismatch
# - Capacity mismatch

My Debugging Toolkit Configmap

I keep this in every cluster:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
apiVersion: v1
kind: ConfigMap
metadata:
  name: debug-scripts
  namespace: default
data:
  cluster-health.sh: |
    #!/bin/bash
    echo "=== Unhealthy Pods ==="
    kubectl get pods -A | grep -v Running | grep -v Completed
    echo ""
    echo "=== Recent Events ==="
    kubectl get events -A --sort-by='.lastTimestamp' | tail -10
    echo ""
    echo "=== Node Resources ==="
    kubectl top nodes
    echo ""
    echo "=== PVC Issues ==="
    kubectl get pvc -A | grep -v Bound

Golden Rules

  1. Check events first — they tell you what Kubernetes thinks went wrong
  2. Previous logs exist — use --previous for crashed containers
  3. Describe is your friend — more info than get
  4. Test from insidekubectl exec or debug pods eliminate external factors
  5. Labels matter — service selectors and pod labels must match exactly

The difference between a 30-minute outage and a 3-hour one is usually having a systematic approach. Start broad, narrow down, verify each step.