Kubernetes Troubleshooting Patterns for Production

Kubernetes hides complexity until something breaks. Then you need to know where to look. Here’s a systematic approach to debugging production issues.

The Debugging Hierarchy

Start broad, narrow down:

Cluster level: Nodes healthy? Resources available?
Namespace level: Deployments running? Services configured?
Pod level: Containers starting? Logs clean?
Container level: Process running? Resources sufficient?

Quick Health Check

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Node status
kubectl get nodes -o wide

# All pods across namespaces
kubectl get pods -A

# Pods not running
kubectl get pods -A | grep -v Running | grep -v Completed

# Events (recent issues)
kubectl get events -A --sort-by='.lastTimestamp' | tail -20

Pod Troubleshooting

Pod States

State	Meaning	Check
Pending	Can’t be scheduled	Resources, node selectors, taints
ContainerCreating	Image pulling or volume mounting	Image name, pull secrets, PVCs
CrashLoopBackOff	Container crashing repeatedly	Logs, resource limits, probes
ImagePullBackOff	Can’t pull image	Image name, registry auth
Error	Container exited with error	Logs

Pending Pods

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# Why is it pending?
kubectl describe pod my-pod

# Look for:
# - Insufficient cpu/memory
# - No nodes match nodeSelector
# - Taints not tolerated
# - PVC not bound

# Check node resources
kubectl describe nodes | grep -A5 "Allocated resources"

# Check PVC status
kubectl get pvc

CrashLoopBackOff

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# Get logs from current container
kubectl logs my-pod

# Get logs from previous (crashed) container
kubectl logs my-pod --previous

# Get logs from specific container
kubectl logs my-pod -c my-container

# Follow logs
kubectl logs -f my-pod

# Last N lines
kubectl logs --tail=100 my-pod

Common causes:

Application error (check logs)
OOMKilled (increase memory limit)
Liveness probe failing
Missing config/secrets

Exec Into Pod

1
2
3
4
5
6
7
8
# Interactive shell
kubectl exec -it my-pod -- /bin/sh

# Run command
kubectl exec my-pod -- cat /app/config.yaml

# Specific container
kubectl exec -it my-pod -c my-container -- /bin/sh

Resource Issues

OOMKilled

1
2
3
4
5
6
7
8
# Check if OOMKilled
kubectl describe pod my-pod | grep -i oom

# Check resource usage
kubectl top pod my-pod

# Check limits vs requests
kubectl get pod my-pod -o yaml | grep -A10 resources

Fix: Increase memory limits or optimize application.

1
2
3
4
5
resources:
  requests:
    memory: "256Mi"
  limits:
    memory: "512Mi"  # Increase this

CPU Throttling

1
2
3
4
# Check CPU usage
kubectl top pod my-pod

# High CPU but pod is slow = throttling

Fix: Increase CPU limits or use no limit (burstable).

Networking Issues

Service Not Reachable

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# Check service endpoints
kubectl get endpoints my-service

# No endpoints = selector doesn't match pods
kubectl get svc my-service -o yaml | grep -A5 selector
kubectl get pods --show-labels

# Test from within cluster
kubectl run debug --rm -it --image=busybox -- /bin/sh
# Inside: wget -qO- my-service:8080/health

DNS Issues

1
2
3
4
5
6
# Test DNS resolution
kubectl run debug --rm -it --image=busybox -- nslookup my-service

# Check CoreDNS
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns

Network Policies

1
2
3
4
5
6
7
8
# Check if network policies exist
kubectl get networkpolicies

# Describe policy
kubectl describe networkpolicy my-policy

# Test connectivity
kubectl exec my-pod -- wget -qO- --timeout=2 other-service:8080

Storage Issues

PVC Pending

1
2
3
4
5
6
7
# Check PVC status
kubectl describe pvc my-pvc

# Look for:
# - No matching StorageClass
# - Insufficient storage
# - Wrong access mode

Volume Mount Failures

1
2
3
4
5
6
7
# Check pod events
kubectl describe pod my-pod | grep -A10 Events

# Common issues:
# - PVC not bound
# - Wrong mount path
# - Permission denied

Deployment Issues

Rollout Stuck

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# Check rollout status
kubectl rollout status deployment/my-app

# Check deployment
kubectl describe deployment my-app

# Check replica sets
kubectl get rs

# Check why new pods aren't starting
kubectl get pods -l app=my-app
kubectl describe pod <new-pod>

Rollback

1
2
3
4
5
6
# Rollback to previous version
kubectl rollout undo deployment/my-app

# Rollback to specific revision
kubectl rollout history deployment/my-app
kubectl rollout undo deployment/my-app --to-revision=2

Node Issues

Node NotReady

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# Check node status
kubectl describe node problematic-node

# Look for:
# - Conditions (MemoryPressure, DiskPressure)
# - Taints
# - Kubelet logs

# SSH to node and check
journalctl -u kubelet -f

Drain Node

1
2
3
4
5
# Safely remove pods before maintenance
kubectl drain node-name --ignore-daemonsets --delete-emptydir-data

# Bring back
kubectl uncordon node-name

Debugging Tools

Ephemeral Debug Container

1
2
3
4
5
# Attach debug container to running pod (K8s 1.23+)
kubectl debug -it my-pod --image=busybox --target=my-container

# Debug node
kubectl debug node/my-node -it --image=ubuntu

Port Forward

1
2
3
4
5
6
7
8
# Access pod directly
kubectl port-forward pod/my-pod 8080:8080

# Access service
kubectl port-forward svc/my-service 8080:80

# Access in background
kubectl port-forward pod/my-pod 8080:8080 &

Copy Files

1
2
3
4
5
# Copy from pod
kubectl cp my-pod:/app/logs/app.log ./app.log

# Copy to pod
kubectl cp ./config.yaml my-pod:/app/config.yaml

Useful One-Liners

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
# Pods sorted by restart count
kubectl get pods --sort-by='.status.containerStatuses[0].restartCount'

# Pods by memory usage
kubectl top pods --sort-by=memory

# Events in last hour
kubectl get events --field-selector type=Warning -A

# All images in cluster
kubectl get pods -A -o jsonpath='{.items[*].spec.containers[*].image}' | tr ' ' '\n' | sort -u

# Pods on specific node
kubectl get pods -A --field-selector spec.nodeName=node-1

# Delete all evicted pods
kubectl get pods -A | grep Evicted | awk '{print $2 " -n " $1}' | xargs -L1 kubectl delete pod

# Force delete stuck pod
kubectl delete pod my-pod --grace-period=0 --force

Common Patterns

Health Check Debug

1
2
3
4
5
# Test liveness endpoint manually
kubectl exec my-pod -- wget -qO- localhost:8080/health

# Check probe config
kubectl get pod my-pod -o yaml | grep -A10 livenessProbe

Secret/ConfigMap Issues

1
2
3
4
5
6
# Verify secret exists
kubectl get secret my-secret -o yaml

# Check if mounted correctly
kubectl exec my-pod -- ls -la /etc/secrets/
kubectl exec my-pod -- cat /etc/secrets/password

Init Container Failures

1
2
3
4
5
# Check init container status
kubectl describe pod my-pod | grep -A20 "Init Containers"

# Get init container logs
kubectl logs my-pod -c init-container-name

The Checklist

When something’s broken:

kubectl get pods - What state?
kubectl describe pod - Events section
kubectl logs - Application errors
kubectl logs --previous - If crashing
kubectl get events - Cluster-wide issues
kubectl top - Resource problems
kubectl exec - Debug from inside

Kubernetes troubleshooting is pattern recognition. Learn the common failure modes, and you’ll fix most issues in minutes.

The Debugging Hierarchy#

Quick Health Check#

Pod Troubleshooting#

Pod States#

Pending Pods#

CrashLoopBackOff#

Exec Into Pod#

Resource Issues#

OOMKilled#

CPU Throttling#

Networking Issues#

Service Not Reachable#

DNS Issues#

Network Policies#

Storage Issues#

PVC Pending#

Volume Mount Failures#

Deployment Issues#

Rollout Stuck#

Rollback#

Node Issues#

Node NotReady#

Drain Node#

Debugging Tools#

Ephemeral Debug Container#

Port Forward#

Copy Files#

Useful One-Liners#

Common Patterns#

Health Check Debug#

Secret/ConfigMap Issues#

Init Container Failures#

The Checklist#

📬 Get the Newsletter