Troubleshooting

DNS Troubleshooting: When dig Is Your Best Friend

DNS issues are deceptively simple. “It’s always DNS” is a meme because it’s true. Here’s how to actually debug it. The Essential Commands dig: Your Primary Tool 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 # Basic lookup dig example.com # Short answer only dig +short example.com # Specific record type dig example.com MX dig example.com TXT dig example.com CNAME # Query specific nameserver dig @8.8.8.8 example.com # Trace the full resolution path dig +trace example.com Understanding dig Output ; ; ; ; e ; ; ; ; ; e ; x ; ; ; ; x a Q a A m Q S W M U m N p u E H S E p S l e R E G S l W e r V N D T e E . y E : S i I . R c R I G O c o t : S Z N o S m i a E 9 m E . m 1 t . S . C e 9 1 E T : 2 F r 8 C I . e c . T O 2 1 b v 1 I N 3 6 d O : 8 2 : N m . 8 : s 1 5 e . 2 6 c 1 0 # : e 5 3 x 3 3 0 a 6 ( : m 0 1 0 p 0 9 0 l 2 e . E . 1 S c 6 T o I I 8 m N N . 2 1 0 . 2 1 6 ) A A 9 3 . 1 8 4 . 2 1 6 . 3 4 Key fields: ...

Linux Performance Troubleshooting: The First Five Minutes

When a server is slow and people are yelling, you need a systematic approach. Here’s what to run in the first five minutes. The Checklist 1 2 3 4 5 6 7 8 uptime dmesg | tail vmstat 1 5 mpstat -P ALL 1 3 pidstat 1 3 iostat -xz 1 3 free -h sar -n DEV 1 3 Let’s break down what each tells you. 1. uptime 1 2 $ uptime 16:30:01 up 45 days, 3:22, 2 users, load average: 8.42, 6.31, 5.12 Load averages: 1-minute, 5-minute, 15-minute. ...

Kubernetes Troubleshooting Patterns for Production

Kubernetes hides complexity until something breaks. Then you need to know where to look. Here’s a systematic approach to debugging production issues. The Debugging Hierarchy Start broad, narrow down: Cluster level: Nodes healthy? Resources available? Namespace level: Deployments running? Services configured? Pod level: Containers starting? Logs clean? Container level: Process running? Resources sufficient? Quick Health Check 1 2 3 4 5 6 7 8 9 10 11 # Node status kubectl get nodes -o wide # All pods across namespaces kubectl get pods -A # Pods not running kubectl get pods -A | grep -v Running | grep -v Completed # Events (recent issues) kubectl get events -A --sort-by='.lastTimestamp' | tail -20 Pod Troubleshooting Pod States State Meaning Check Pending Can’t be scheduled Resources, node selectors, taints ContainerCreating Image pulling or volume mounting Image name, pull secrets, PVCs CrashLoopBackOff Container crashing repeatedly Logs, resource limits, probes ImagePullBackOff Can’t pull image Image name, registry auth Error Container exited with error Logs Pending Pods 1 2 3 4 5 6 7 8 9 10 11 12 13 14 # Why is it pending? kubectl describe pod my-pod # Look for: # - Insufficient cpu/memory # - No nodes match nodeSelector # - Taints not tolerated # - PVC not bound # Check node resources kubectl describe nodes | grep -A5 "Allocated resources" # Check PVC status kubectl get pvc CrashLoopBackOff 1 2 3 4 5 6 7 8 9 10 11 12 13 14 # Get logs from current container kubectl logs my-pod # Get logs from previous (crashed) container kubectl logs my-pod --previous # Get logs from specific container kubectl logs my-pod -c my-container # Follow logs kubectl logs -f my-pod # Last N lines kubectl logs --tail=100 my-pod Common causes: ...

Kubernetes Troubleshooting: A Practical Field Guide

When a Kubernetes deployment goes sideways at 3am, you need a systematic approach. Here’s the troubleshooting playbook I’ve developed from watching countless production incidents. The First Three Commands Before diving deep, these three commands tell you 80% of what you need: 1 2 3 4 5 6 7 8 # What's not running? kubectl get pods -A | grep -v Running | grep -v Completed # What happened recently? kubectl get events -A --sort-by='.lastTimestamp' | tail -20 # Resource pressure? kubectl top nodes Run these first. Always. ...

Kubernetes Debugging: From Pod Failures to Cluster Issues

Kubernetes abstracts away infrastructure until something breaks. Then you need to peel back the layers. These debugging patterns will help you find problems fast. First Steps: Get the Lay of the Land 1 2 3 4 5 6 7 8 9 10 # Cluster health kubectl cluster-info kubectl get nodes kubectl top nodes # Namespace overview kubectl get all -n myapp # Events (recent issues surface here) kubectl get events -n myapp --sort-by='.lastTimestamp' Pod Debugging Check Pod Status 1 2 3 4 5 6 7 8 9 10 11 12 13 # List pods with status kubectl get pods -n myapp # Detailed pod info kubectl describe pod mypod -n myapp # Common status meanings: # Pending - Waiting for scheduling or image pull # Running - At least one container running # Succeeded - All containers completed successfully # Failed - All containers terminated, at least one failed # CrashLoopBackOff - Container crashing repeatedly # ImagePullBackOff - Can't pull container image View Logs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 # Current logs kubectl logs mypod -n myapp # Previous container (after crash) kubectl logs mypod -n myapp --previous # Follow logs kubectl logs -f mypod -n myapp # Specific container (multi-container pod) kubectl logs mypod -n myapp -c mycontainer # Last N lines kubectl logs mypod -n myapp --tail=100 # Since timestamp kubectl logs mypod -n myapp --since=1h Execute Commands in Pod 1 2 3 4 5 6 7 8 # Shell into running container kubectl exec -it mypod -n myapp -- /bin/sh # Run specific command kubectl exec mypod -n myapp -- cat /etc/config/app.yaml # Specific container kubectl exec -it mypod -n myapp -c mycontainer -- /bin/sh Debug Crashed Containers 1 2 3 4 5 6 7 8 # Check why it crashed kubectl describe pod mypod -n myapp | grep -A 10 "Last State" # View previous logs kubectl logs mypod -n myapp --previous # Run debug container (K8s 1.25+) kubectl debug mypod -n myapp -it --image=busybox --target=mycontainer Common Pod Issues ImagePullBackOff 1 2 3 4 5 6 7 8 9 10 11 12 13 # Check events for details kubectl describe pod mypod -n myapp | grep -A 5 Events # Common causes: # - Wrong image name/tag # - Private registry without imagePullSecrets # - Registry rate limiting (Docker Hub) # Verify image exists docker pull myimage:tag # Check imagePullSecrets kubectl get pod mypod -n myapp -o jsonpath='{.spec.imagePullSecrets}' CrashLoopBackOff 1 2 3 4 5 6 7 8 9 10 11 12 # Get exit code kubectl describe pod mypod -n myapp | grep "Exit Code" # Exit codes: # 0 - Success (shouldn't be crashing) # 1 - Application error # 137 - OOMKilled (out of memory) # 139 - Segmentation fault # 143 - SIGTERM (graceful shutdown) # Check resource limits kubectl describe pod mypod -n myapp | grep -A 5 Limits Pending Pods 1 2 3 4 5 6 7 8 9 10 11 # Check why not scheduled kubectl describe pod mypod -n myapp | grep -A 10 Events # Common causes: # - Insufficient resources # - Node selector/affinity not matched # - Taints without tolerations # - PVC not bound # Check node resources kubectl describe nodes | grep -A 5 "Allocated resources" Resource Issues Memory Problems 1 2 3 4 5 6 7 8 # Check pod resource usage kubectl top pod mypod -n myapp # Check for OOMKilled kubectl describe pod mypod -n myapp | grep OOMKilled # View memory limits kubectl get pod mypod -n myapp -o jsonpath='{.spec.containers[*].resources}' CPU Throttling 1 2 3 4 5 # Check CPU usage vs limits kubectl top pod mypod -n myapp # In container, check throttling kubectl exec mypod -n myapp -- cat /sys/fs/cgroup/cpu/cpu.stat Networking Debugging Service Connectivity 1 2 3 4 5 6 7 8 9 10 11 12 # Check service exists kubectl get svc -n myapp # Check endpoints (are pods backing the service?) kubectl get endpoints myservice -n myapp # Test from within cluster kubectl run debug --rm -it --image=busybox -- /bin/sh # Then: wget -qO- http://myservice.myapp.svc.cluster.local # DNS resolution kubectl run debug --rm -it --image=busybox -- nslookup myservice.myapp.svc.cluster.local Pod-to-Pod Communication 1 2 3 4 5 6 7 8 # Get pod IPs kubectl get pods -n myapp -o wide # Test connectivity from one pod to another kubectl exec mypod1 -n myapp -- wget -qO- http://10.0.0.5:8080 # Check network policies kubectl get networkpolicies -n myapp Ingress Issues 1 2 3 4 5 6 7 8 # Check ingress configuration kubectl describe ingress myingress -n myapp # Check ingress controller logs kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx # Verify backend service kubectl get svc myservice -n myapp ConfigMaps and Secrets 1 2 3 4 5 6 7 8 9 10 11 12 # Verify ConfigMap exists and has expected data kubectl get configmap myconfig -n myapp -o yaml # Check if mounted correctly kubectl exec mypod -n myapp -- ls -la /etc/config/ kubectl exec mypod -n myapp -- cat /etc/config/app.yaml # Verify Secret kubectl get secret mysecret -n myapp -o jsonpath='{.data.password}' | base64 -d # Check environment variables kubectl exec mypod -n myapp -- env | grep MY_VAR Persistent Volumes 1 2 3 4 5 6 7 8 9 10 11 12 # Check PVC status kubectl get pvc -n myapp # Describe for binding issues kubectl describe pvc mypvc -n myapp # Check PV kubectl get pv # Verify mount in pod kubectl exec mypod -n myapp -- df -h kubectl exec mypod -n myapp -- ls -la /data Node Issues 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 # Node status kubectl get nodes kubectl describe node mynode # Check conditions kubectl get nodes -o custom-columns=NAME:.metadata.name,CONDITIONS:.status.conditions[*].type # Node resource pressure kubectl describe node mynode | grep -A 5 Conditions # Pods on specific node kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=mynode # Drain node for maintenance kubectl drain mynode --ignore-daemonsets --delete-emptydir-data Control Plane Debugging 1 2 3 4 5 6 7 8 9 10 11 12 13 14 # API server health kubectl get --raw='/healthz' # Component status (deprecated but useful) kubectl get componentstatuses # Check system pods kubectl get pods -n kube-system # API server logs (if accessible) kubectl logs -n kube-system kube-apiserver-master # etcd health kubectl exec -n kube-system etcd-master -- etcdctl endpoint health Useful Debug Containers 1 2 3 4 5 6 7 8 # Network debugging kubectl run netdebug --rm -it --image=nicolaka/netshoot -- /bin/bash # DNS debugging kubectl run dnsdebug --rm -it --image=tutum/dnsutils -- /bin/bash # General debugging kubectl run debug --rm -it --image=busybox -- /bin/sh Systematic Debugging Checklist Events first: kubectl get events --sort-by='.lastTimestamp' Describe the resource: kubectl describe <resource> <name> Check logs: kubectl logs <pod> (and --previous) Verify dependencies: ConfigMaps, Secrets, Services, PVCs Check resources: CPU, memory limits and usage Test connectivity: DNS, service endpoints, network policies Compare with working: Diff against known good configuration Quick Reference 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 # Pod not starting kubectl describe pod <name> kubectl get events # Pod crashing kubectl logs <pod> --previous kubectl describe pod <name> | grep "Exit Code" # Can't connect to service kubectl get endpoints <service> kubectl run debug --rm -it --image=busybox -- wget -qO- http://<service> # Resource issues kubectl top pods kubectl describe node | grep -A 5 "Allocated" # Config issues kubectl exec <pod> -- env kubectl exec <pod> -- cat /path/to/config Kubernetes debugging is methodical. Start with events, drill into describe output, check logs, and verify each dependency. Most issues are configuration mismatches—wrong image tags, missing secrets, insufficient resources. ...

Linux Process Management: From ps to Process Trees

Understanding processes is fundamental to Linux troubleshooting. These tools and techniques will help you find what’s running, what’s stuck, and what needs to die. Viewing Processes ps - Process Snapshot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 # All processes (BSD style) ps aux # All processes (Unix style) ps -ef # Process tree ps auxf # Specific columns ps -eo pid,ppid,user,%cpu,%mem,stat,cmd # Find specific process ps aux | grep nginx # By exact name (no grep needed) ps -C nginx # By user ps -u www-data Understanding ps Output U r w S o w E o w R t - d a t a 1 P 2 I 3 D 1 4 % C 0 2 P . . U 0 5 % M 0 1 E . . M 1 2 1 4 6 5 9 6 V 9 7 S 3 8 Z 6 9 1 9 3 8 R 2 7 S 5 6 S 6 5 T ? ? T Y S S S T s l A T S F 1 T e 0 A b : R 2 0 T 4 0 T 0 5 I : : M 0 2 E 3 3 C / n O s g M b i M i n A n x N / : D i n w i o t r k e r PID: Process ID %CPU: CPU usage %MEM: Memory usage VSZ: Virtual memory size RSS: Resident set size (actual RAM) STAT: Process state TIME: CPU time consumed Process States (STAT) R S D Z T N s l R S S Z S H L S M F u l l o t i o e u o n e e m o g w s l r n e e b p h s t e i p p i p p i i g n i i e e p r o - r g n n d r i n t o g g i o h u o r l r n ( ( r i e e d i u i t a a n n t y d d p t i y e e r e n r d o r t c r e e u r s p r s t u i p b t l i e b ) l e , u s u a l l y I / O ) top - Real-time View 1 2 3 4 5 6 7 8 9 10 11 # Basic top # Sort by memory top -o %MEM # Specific user top -u www-data # Batch mode (for scripts) top -b -n 1 Inside top: ...