Kubernetes Troubleshooting: A Practical Field Guide

Kubernetes failures are rarely mysterious once you know where to look. The problem is knowing where to look. This guide covers the systematic approach to diagnosing common Kubernetes issues.

The Diagnostic Hierarchy

Start broad, drill down:

At each level, the same questions apply:

What’s the current state?
What’s the desired state?
What changed recently?

Pod Won’t Start

The most common issue. Work through this checklist:

Step 1: Get Pod Status

1
2
kubectl get pods -n <namespace>
kubectl describe pod <pod-name> -n <namespace>

The describe output tells you almost everything. Look for:

Events at the bottom (most recent issues)
Conditions (Ready, Initialized, ContainersReady)
Container State (Waiting, Running, Terminated)

Step 2: Decode the Status

Pending — Scheduler can’t place the pod

1
2
3
4
5
6
7
# Check events for why
kubectl describe pod <pod> | grep -A 20 Events

# Common causes:
# - Insufficient CPU/memory (node capacity)
# - Node selector/affinity can't be satisfied
# - PersistentVolumeClaim pending

ImagePullBackOff — Can’t pull container image

1
2
3
4
5
6
7
8
9
# Check image name for typos
kubectl describe pod <pod> | grep Image

# Check pull secrets
kubectl get secrets -n <namespace>
kubectl describe pod <pod> | grep -A 5 "Image Pull Secrets"

# Test image pull manually
docker pull <image>

CrashLoopBackOff — Container starts, then crashes

1
2
3
4
5
6
# Check container logs
kubectl logs <pod> -n <namespace>
kubectl logs <pod> -n <namespace> --previous  # logs from last crash

# Check exit code
kubectl describe pod <pod> | grep -A 10 "Last State"

CreateContainerConfigError — ConfigMap or Secret missing

1
2
3
4
5
6
# List what the pod expects
kubectl describe pod <pod> | grep -A 20 "Volumes"

# Verify they exist
kubectl get configmaps -n <namespace>
kubectl get secrets -n <namespace>

Step 3: Resource Issues

1
2
3
4
5
6
# Check node resources
kubectl describe nodes | grep -A 10 "Allocated resources"

# Check if pod requests exceed available
kubectl describe pod <pod> | grep -A 5 "Requests"
kubectl top nodes

Service Won’t Connect

Traffic not reaching your pods:

Step 1: Verify the Chain

Each link can break independently.

Step 2: Check Service

1
2
kubectl get svc <service> -n <namespace>
kubectl describe svc <service> -n <namespace>

Look for:

Selector — Does it match your pods?
Endpoints — Are pods listed?
Port/TargetPort — Correct mapping?

Step 3: Check Endpoints

1
kubectl get endpoints <service> -n <namespace>

If empty, the selector doesn’t match any ready pods:

1
2
3
4
5
# Check pod labels
kubectl get pods -n <namespace> --show-labels

# Compare to service selector
kubectl get svc <service> -n <namespace> -o jsonpath='{.spec.selector}'

Step 4: Test Connectivity

1
2
3
4
5
6
# From inside the cluster
kubectl run debug --rm -it --image=busybox -- sh
wget -qO- http://<service>.<namespace>.svc.cluster.local:<port>

# Check DNS resolution
nslookup <service>.<namespace>.svc.cluster.local

Step 5: Network Policies

1
2
3
4
5
# List policies affecting the namespace
kubectl get networkpolicies -n <namespace>

# Describe to see ingress/egress rules
kubectl describe networkpolicy <policy> -n <namespace>

Network policies are deny-by-default when present. If a policy exists, you need explicit allow rules.

Container Keeps Crashing

The app starts but dies:

Step 1: Check Logs

1
2
3
4
5
6
7
8
# Current logs
kubectl logs <pod> -c <container>

# Previous crash logs
kubectl logs <pod> -c <container> --previous

# Follow logs in real-time
kubectl logs <pod> -c <container> -f

Step 2: Check Exit Codes

1
kubectl describe pod <pod> | grep -A 5 "Last State"

Common exit codes:

0: Clean exit (but why did it exit?)
1: Application error
137: OOMKilled (out of memory)
143: SIGTERM (graceful shutdown)

Step 3: OOMKilled Investigation

1
2
3
4
5
6
# Check memory limits vs actual usage
kubectl describe pod <pod> | grep -A 3 "Limits"
kubectl top pod <pod>

# Check node memory pressure
kubectl describe node <node> | grep -A 5 "Conditions"

Fix: Increase memory limits or fix the memory leak.

Step 4: Exec Into Container

1
2
3
4
5
6
7
# If container is running (even briefly)
kubectl exec -it <pod> -c <container> -- sh

# Check process, ports, filesystem
ps aux
netstat -tlnp
df -h

Deployment Won’t Roll Out

Step 1: Check Rollout Status

1
2
kubectl rollout status deployment/<name> -n <namespace>
kubectl rollout history deployment/<name> -n <namespace>

Step 2: Check ReplicaSets

1
2
kubectl get rs -n <namespace>
kubectl describe rs <replicaset> -n <namespace>

Stuck rollout usually means new pods can’t become ready.

Step 3: Common Causes

Readiness probe failing:

1
2
kubectl describe pod <pod> | grep -A 10 "Readiness"
kubectl logs <pod>  # Check if app is actually ready

Resource quota exceeded:

1
kubectl describe resourcequota -n <namespace>

PodDisruptionBudget blocking:

1
2
kubectl get pdb -n <namespace>
kubectl describe pdb <pdb> -n <namespace>

Step 4: Rollback

1
2
3
4
5
# Rollback to previous version
kubectl rollout undo deployment/<name> -n <namespace>

# Rollback to specific revision
kubectl rollout undo deployment/<name> --to-revision=2 -n <namespace>

DNS Issues

Step 1: Test Resolution

1
2
3
kubectl run debug --rm -it --image=busybox -- sh
nslookup kubernetes.default
nslookup <service>.<namespace>.svc.cluster.local

Step 2: Check CoreDNS

1
2
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns

Step 3: Verify DNS Config

1
kubectl exec <pod> -- cat /etc/resolv.conf

Should show cluster DNS service IP.

Persistent Volume Issues

Step 1: Check PVC Status

1
2
kubectl get pvc -n <namespace>
kubectl describe pvc <pvc> -n <namespace>

Pending means no PV is available that matches.

Step 2: Check PV Availability

1
2
kubectl get pv
kubectl describe pv <pv>

Look at:

Status: Available, Bound, Released
StorageClass: Must match PVC
Capacity: Must meet PVC request
AccessModes: Must include what PVC requests

Step 3: StorageClass Issues

1
2
kubectl get storageclass
kubectl describe storageclass <class>

If using dynamic provisioning, check the provisioner is working.

Quick Reference Commands

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Get all resources in a namespace
kubectl get all -n <namespace>

# Watch resources change
kubectl get pods -n <namespace> -w

# Get events sorted by time
kubectl get events -n <namespace> --sort-by='.lastTimestamp'

# Check resource usage
kubectl top pods -n <namespace>
kubectl top nodes

# Debug networking
kubectl run debug --rm -it --image=nicolaka/netshoot -- bash

# Copy files to/from pod
kubectl cp <pod>:/path/to/file ./local-file
kubectl cp ./local-file <pod>:/path/to/file

# Port forward for local testing
kubectl port-forward svc/<service> 8080:80 -n <namespace>

The Mindset

Kubernetes troubleshooting is about following the data flow:

Request comes in → Ingress → Service → Endpoints → Pod
Pod starts → Image pull → Container create → Probes pass → Ready
App runs → Configs mounted → Secrets available → Connections work

When something breaks, identify which stage failed and inspect that component. The error messages are usually accurate—the skill is knowing where to find them.

The Diagnostic Hierarchy#

Pod Won’t Start#

Step 1: Get Pod Status#

Step 2: Decode the Status#

Step 3: Resource Issues#

Service Won’t Connect#

Step 1: Verify the Chain#

Step 2: Check Service#

Step 3: Check Endpoints#

Step 4: Test Connectivity#

Step 5: Network Policies#

Container Keeps Crashing#

Step 1: Check Logs#

Step 2: Check Exit Codes#

Step 3: OOMKilled Investigation#

Step 4: Exec Into Container#

Deployment Won’t Roll Out#

Step 1: Check Rollout Status#

Step 2: Check ReplicaSets#

Step 3: Common Causes#

Step 4: Rollback#

DNS Issues#

Step 1: Test Resolution#

Step 2: Check CoreDNS#

Step 3: Verify DNS Config#

Persistent Volume Issues#

Step 1: Check PVC Status#

Step 2: Check PV Availability#

Step 3: StorageClass Issues#

Quick Reference Commands#

The Mindset#

📬 Get the Newsletter