Debugging

Logging Levels: A Practical Guide to What Goes Where

Logging seems simple until you’re debugging production at 2 AM, scrolling through millions of lines trying to find the one that matters. Good logging practices make that experience less painful. Here’s how to think about log levels. The Levels Most logging frameworks use these standard levels: D E B U G < I N F O < W A R N < E R R O R < F A T A L In production, you typically run at INFO or WARN. Lower levels include all higher levels (INFO includes WARN, ERROR, and FATAL). ...

API Error Handling That Helps Instead of Frustrates

Bad error handling wastes everyone’s time. A cryptic “Error 500” sends developers on a debugging odyssey. A well-designed error response tells them exactly what went wrong and how to fix it. Here’s how to build the latter. The Anatomy of a Good Error Every error response should answer three questions: What happened? (error code/type) Why? (human-readable message) How do I fix it? (actionable guidance) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 { "error": { "code": "VALIDATION_ERROR", "message": "Request validation failed", "details": [ { "field": "email", "message": "Invalid email format", "received": "not-an-email" }, { "field": "age", "message": "Must be a positive integer", "received": "-5" } ], "documentation_url": "https://api.example.com/docs/errors#VALIDATION_ERROR" }, "request_id": "req_abc123" } Always include: ...

Kubernetes Troubleshooting: A Practical Field Guide

When a Kubernetes deployment goes sideways at 3am, you need a systematic approach. Here’s the troubleshooting playbook I’ve developed from watching countless production incidents. The First Three Commands Before diving deep, these three commands tell you 80% of what you need: 1 2 3 4 5 6 7 8 # What's not running? kubectl get pods -A | grep -v Running | grep -v Completed # What happened recently? kubectl get events -A --sort-by='.lastTimestamp' | tail -20 # Resource pressure? kubectl top nodes Run these first. Always. ...

Debugging Production Issues Without Breaking Things

Production is sacred. When something breaks, you need to investigate without making it worse. Here’s how. Rule Zero: Don’t Make It Worse Before touching anything: Don’t restart services until you understand the problem Don’t deploy fixes without knowing the root cause Don’t clear logs you might need for investigation Don’t scale down what might be handling load Stabilize first, investigate second, fix third. Start With Observability Check Dashboards Before SSH-ing anywhere: ...

YAML Gotchas: The Traps That Bite Every Developer

YAML looks simple until it isn’t. These gotchas have broken production configs and wasted countless debugging hours. Learn them once, avoid them forever. The Norway Problem 1 2 3 4 5 # These are ALL booleans in YAML 1.1 country: NO # false answer: yes # true enabled: on # true disabled: off # false Fix: Always quote strings that could be interpreted as booleans. 1 2 country: "NO" answer: "yes" YAML 1.2 fixed this, but many parsers (including PyYAML by default) still use 1.1 rules. ...

Kubernetes Debugging: From Pod Failures to Cluster Issues

Kubernetes abstracts away infrastructure until something breaks. Then you need to peel back the layers. These debugging patterns will help you find problems fast. First Steps: Get the Lay of the Land 1 2 3 4 5 6 7 8 9 10 # Cluster health kubectl cluster-info kubectl get nodes kubectl top nodes # Namespace overview kubectl get all -n myapp # Events (recent issues surface here) kubectl get events -n myapp --sort-by='.lastTimestamp' Pod Debugging Check Pod Status 1 2 3 4 5 6 7 8 9 10 11 12 13 # List pods with status kubectl get pods -n myapp # Detailed pod info kubectl describe pod mypod -n myapp # Common status meanings: # Pending - Waiting for scheduling or image pull # Running - At least one container running # Succeeded - All containers completed successfully # Failed - All containers terminated, at least one failed # CrashLoopBackOff - Container crashing repeatedly # ImagePullBackOff - Can't pull container image View Logs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 # Current logs kubectl logs mypod -n myapp # Previous container (after crash) kubectl logs mypod -n myapp --previous # Follow logs kubectl logs -f mypod -n myapp # Specific container (multi-container pod) kubectl logs mypod -n myapp -c mycontainer # Last N lines kubectl logs mypod -n myapp --tail=100 # Since timestamp kubectl logs mypod -n myapp --since=1h Execute Commands in Pod 1 2 3 4 5 6 7 8 # Shell into running container kubectl exec -it mypod -n myapp -- /bin/sh # Run specific command kubectl exec mypod -n myapp -- cat /etc/config/app.yaml # Specific container kubectl exec -it mypod -n myapp -c mycontainer -- /bin/sh Debug Crashed Containers 1 2 3 4 5 6 7 8 # Check why it crashed kubectl describe pod mypod -n myapp | grep -A 10 "Last State" # View previous logs kubectl logs mypod -n myapp --previous # Run debug container (K8s 1.25+) kubectl debug mypod -n myapp -it --image=busybox --target=mycontainer Common Pod Issues ImagePullBackOff 1 2 3 4 5 6 7 8 9 10 11 12 13 # Check events for details kubectl describe pod mypod -n myapp | grep -A 5 Events # Common causes: # - Wrong image name/tag # - Private registry without imagePullSecrets # - Registry rate limiting (Docker Hub) # Verify image exists docker pull myimage:tag # Check imagePullSecrets kubectl get pod mypod -n myapp -o jsonpath='{.spec.imagePullSecrets}' CrashLoopBackOff 1 2 3 4 5 6 7 8 9 10 11 12 # Get exit code kubectl describe pod mypod -n myapp | grep "Exit Code" # Exit codes: # 0 - Success (shouldn't be crashing) # 1 - Application error # 137 - OOMKilled (out of memory) # 139 - Segmentation fault # 143 - SIGTERM (graceful shutdown) # Check resource limits kubectl describe pod mypod -n myapp | grep -A 5 Limits Pending Pods 1 2 3 4 5 6 7 8 9 10 11 # Check why not scheduled kubectl describe pod mypod -n myapp | grep -A 10 Events # Common causes: # - Insufficient resources # - Node selector/affinity not matched # - Taints without tolerations # - PVC not bound # Check node resources kubectl describe nodes | grep -A 5 "Allocated resources" Resource Issues Memory Problems 1 2 3 4 5 6 7 8 # Check pod resource usage kubectl top pod mypod -n myapp # Check for OOMKilled kubectl describe pod mypod -n myapp | grep OOMKilled # View memory limits kubectl get pod mypod -n myapp -o jsonpath='{.spec.containers[*].resources}' CPU Throttling 1 2 3 4 5 # Check CPU usage vs limits kubectl top pod mypod -n myapp # In container, check throttling kubectl exec mypod -n myapp -- cat /sys/fs/cgroup/cpu/cpu.stat Networking Debugging Service Connectivity 1 2 3 4 5 6 7 8 9 10 11 12 # Check service exists kubectl get svc -n myapp # Check endpoints (are pods backing the service?) kubectl get endpoints myservice -n myapp # Test from within cluster kubectl run debug --rm -it --image=busybox -- /bin/sh # Then: wget -qO- http://myservice.myapp.svc.cluster.local # DNS resolution kubectl run debug --rm -it --image=busybox -- nslookup myservice.myapp.svc.cluster.local Pod-to-Pod Communication 1 2 3 4 5 6 7 8 # Get pod IPs kubectl get pods -n myapp -o wide # Test connectivity from one pod to another kubectl exec mypod1 -n myapp -- wget -qO- http://10.0.0.5:8080 # Check network policies kubectl get networkpolicies -n myapp Ingress Issues 1 2 3 4 5 6 7 8 # Check ingress configuration kubectl describe ingress myingress -n myapp # Check ingress controller logs kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx # Verify backend service kubectl get svc myservice -n myapp ConfigMaps and Secrets 1 2 3 4 5 6 7 8 9 10 11 12 # Verify ConfigMap exists and has expected data kubectl get configmap myconfig -n myapp -o yaml # Check if mounted correctly kubectl exec mypod -n myapp -- ls -la /etc/config/ kubectl exec mypod -n myapp -- cat /etc/config/app.yaml # Verify Secret kubectl get secret mysecret -n myapp -o jsonpath='{.data.password}' | base64 -d # Check environment variables kubectl exec mypod -n myapp -- env | grep MY_VAR Persistent Volumes 1 2 3 4 5 6 7 8 9 10 11 12 # Check PVC status kubectl get pvc -n myapp # Describe for binding issues kubectl describe pvc mypvc -n myapp # Check PV kubectl get pv # Verify mount in pod kubectl exec mypod -n myapp -- df -h kubectl exec mypod -n myapp -- ls -la /data Node Issues 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 # Node status kubectl get nodes kubectl describe node mynode # Check conditions kubectl get nodes -o custom-columns=NAME:.metadata.name,CONDITIONS:.status.conditions[*].type # Node resource pressure kubectl describe node mynode | grep -A 5 Conditions # Pods on specific node kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=mynode # Drain node for maintenance kubectl drain mynode --ignore-daemonsets --delete-emptydir-data Control Plane Debugging 1 2 3 4 5 6 7 8 9 10 11 12 13 14 # API server health kubectl get --raw='/healthz' # Component status (deprecated but useful) kubectl get componentstatuses # Check system pods kubectl get pods -n kube-system # API server logs (if accessible) kubectl logs -n kube-system kube-apiserver-master # etcd health kubectl exec -n kube-system etcd-master -- etcdctl endpoint health Useful Debug Containers 1 2 3 4 5 6 7 8 # Network debugging kubectl run netdebug --rm -it --image=nicolaka/netshoot -- /bin/bash # DNS debugging kubectl run dnsdebug --rm -it --image=tutum/dnsutils -- /bin/bash # General debugging kubectl run debug --rm -it --image=busybox -- /bin/sh Systematic Debugging Checklist Events first: kubectl get events --sort-by='.lastTimestamp' Describe the resource: kubectl describe <resource> <name> Check logs: kubectl logs <pod> (and --previous) Verify dependencies: ConfigMaps, Secrets, Services, PVCs Check resources: CPU, memory limits and usage Test connectivity: DNS, service endpoints, network policies Compare with working: Diff against known good configuration Quick Reference 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 # Pod not starting kubectl describe pod <name> kubectl get events # Pod crashing kubectl logs <pod> --previous kubectl describe pod <name> | grep "Exit Code" # Can't connect to service kubectl get endpoints <service> kubectl run debug --rm -it --image=busybox -- wget -qO- http://<service> # Resource issues kubectl top pods kubectl describe node | grep -A 5 "Allocated" # Config issues kubectl exec <pod> -- env kubectl exec <pod> -- cat /path/to/config Kubernetes debugging is methodical. Start with events, drill into describe output, check logs, and verify each dependency. Most issues are configuration mismatches—wrong image tags, missing secrets, insufficient resources. ...

strace: Tracing System Calls for Debugging

strace intercepts and records system calls made by a process. When a program hangs, crashes, or behaves mysteriously, strace reveals what it’s actually doing at the kernel level. Basic Usage 1 2 3 4 5 6 7 8 9 10 11 # Trace a command strace ls # Trace running process strace -p 1234 # Trace with timestamps strace -t ls # Trace with relative timestamps strace -r ls Filtering System Calls 1 2 3 4 5 6 7 8 9 10 11 12 13 14 # Only file operations strace -e trace=file ls # Only network operations strace -e trace=network curl example.com # Only process operations strace -e trace=process bash -c 'sleep 1' # Specific syscalls strace -e open,read,write cat file.txt # Exclude syscalls strace -e trace=!mmap ls Trace Categories 1 2 3 4 5 6 7 file # open, stat, chmod, etc. process # fork, exec, exit, etc. network # socket, connect, send, etc. signal # signal, kill, etc. ipc # shmget, semop, etc. desc # read, write, close, etc. memory # mmap, brk, etc. Output Options 1 2 3 4 5 6 7 8 9 10 11 12 13 14 # Write to file strace -o output.txt ls # Append to file strace -o output.txt -A ls # With timestamps (wall clock) strace -t ls # With microseconds strace -tt ls # Relative timestamps strace -r ls Following Children 1 2 3 4 5 6 # Follow forked processes strace -f bash -c 'ls; echo done' # Follow forks and write separate files strace -ff -o trace ls # Creates trace.1234, trace.1235, etc. String Output 1 2 3 4 5 # Show full strings (default truncates at 32 chars) strace -s 1000 cat file.txt # Show full strings for specific calls strace -e read -s 10000 cat file.txt Statistics 1 2 3 4 5 6 7 8 9 # Summary of syscalls strace -c ls # Sample output: # % time seconds usecs/call calls errors syscall # ------ ----------- ----------- --------- --------- ---------------- # 45.00 0.000045 45 1 execve # 25.00 0.000025 3 8 mmap # 15.00 0.000015 2 6 openat 1 2 3 # Summary with detailed trace strace -c -S time ls # Sort by time strace -c -S calls ls # Sort by call count Practical Examples Debug “File Not Found” 1 2 # See what files the program is trying to open strace -e openat ./myprogram 2>&1 | grep -i "no such file" Find Config File Locations 1 2 # See all files a program tries to read strace -e openat nginx -t 2>&1 | grep -E "openat.*O_RDONLY" Debug Connection Issues 1 2 3 4 5 # Watch network connections strace -e connect curl https://example.com # See DNS lookups strace -e socket,connect,sendto,recvfrom dig example.com Debug Hanging Process 1 2 3 4 5 6 7 # Attach to hung process strace -p $(pgrep hung-process) # Common findings: # - Waiting on read() = blocked on input # - Waiting on futex() = waiting for lock # - Waiting on poll/select = waiting for I/O Find Why Program is Slow 1 2 3 4 5 6 7 8 # Time each syscall strace -T ls # Shows time spent in each call: # openat(AT_FDCWD, ".", ...) = 3 <0.000015> # Summary to find slow operations strace -c -S time slow-program Debug Permission Issues 1 2 3 4 5 # See access denials strace -e openat,access ./program 2>&1 | grep -i denied # Sample output: # openat(AT_FDCWD, "/etc/secret", O_RDONLY) = -1 EACCES (Permission denied) Watch File I/O 1 2 3 4 5 # See all reads and writes strace -e read,write -s 100 cat file.txt # Count I/O operations strace -c -e read,write dd if=/dev/zero of=/dev/null bs=1M count=100 Debug Signal Handling 1 2 3 4 5 # Trace signals strace -e signal,kill ./program # See what signal killed a process strace -e trace=signal -p 1234 Find Library Loading Issues 1 2 3 4 5 6 # See shared library loading strace -e openat ./program 2>&1 | grep "\.so" # Common issues: # - Library not found # - Wrong library version loaded Advanced Usage Inject Faults 1 2 3 4 5 # Make open() fail with ENOENT strace -e fault=openat:error=ENOENT ls # Fail every 3rd call strace -e fault=read:error=EIO:when=3 cat file.txt Decode Arguments 1 2 3 4 5 # Decode socket addresses strace -yy curl example.com # Decode file descriptors strace -y cat file.txt Trace Specific Syscall Return 1 2 3 4 5 # Only show failed syscalls strace -Z ls /nonexistent # Show syscalls that return specific value strace -e status=failed ls /nonexistent Reading strace Output o │ │ │ │ │ └ p ─ e ─ n a S t y ( s A t T │ │ │ │ └ e _ ─ m F ─ D c C D a W i l D r l , e c n " t a / o m e r e t y c │ │ │ └ / ─ f p ─ d a s P ( s a A w t T d h _ " n F , a D m C O e W _ D R │ │ └ a D ─ r = O ─ g N u c L F m u Y l e r ) a n r g t e = s n t 3 │ └ d ─ i ─ r ) R e t u r n v a l u e ( f d 3 ) Common Return Values = = = = = = = 0 N - - - - - 1 1 1 1 1 E E E E E N A A E I O C G X N E C A I T N E I S R T S N T S F N P R F I u i o e e i n c l r s l t c e s m o e e e u i u r s d c s r e r s e h s c x u s i e i p ( c f o s t f r i n t t e o i l e s d r p e d m t e p s m o o n o y a r r i r s n e a t y o d d r e r i i m c r l a c e y c l o c a l u t u l s n o n l ) t r a y v a i l a b l e Alternatives 1 2 3 4 5 6 7 8 # ltrace - trace library calls (not syscalls) ltrace ./program # perf trace - lower overhead perf trace ls # bpftrace - more powerful, requires setup bpftrace -e 'tracepoint:syscalls:sys_enter_openat { printf("%s\n", str(args->filename)); }' Performance Note strace adds significant overhead — programs run much slower when traced. For production debugging: ...

lsof: Finding What's Using Files and Ports

lsof (list open files) shows which processes have which files open. Since Unix treats everything as a file — including network connections, devices, and pipes — lsof is incredibly powerful for system debugging. Basic Usage 1 2 3 4 5 6 7 8 9 10 11 # List all open files (overwhelming) lsof # List open files for specific file lsof /var/log/syslog # List open files for user lsof -u username # List open files for process lsof -p 1234 Find What’s Using a File 1 2 3 4 5 6 7 8 # Who has this file open? lsof /path/to/file # Who has files open in this directory? lsof +D /var/log # Recursive directory search lsof +D /var/log/ “Device or resource busy” Debugging 1 2 3 4 5 # Can't unmount? Find what's using it lsof +D /mnt/usb # Find open files on filesystem lsof /dev/sda1 Find What’s Using a Port 1 2 3 4 5 6 7 8 9 10 11 12 13 14 # What's on port 8080? lsof -i :8080 # What's on port 80 or 443? lsof -i :80 -i :443 # All network connections lsof -i # TCP only lsof -i TCP # UDP only lsof -i UDP Specific Protocol and Port 1 2 3 4 5 6 7 8 # TCP port 22 lsof -i TCP:22 # UDP port 53 lsof -i UDP:53 # Port range lsof -i :1-1024 Filter by Process 1 2 3 4 5 6 7 8 9 10 11 # By PID lsof -p 1234 # By process name lsof -c nginx # By multiple process names lsof -c nginx -c apache # Exclude process lsof -c ^nginx Filter by User 1 2 3 4 5 6 7 8 9 10 11 # By username lsof -u www-data # By UID lsof -u 1000 # Multiple users lsof -u user1,user2 # Exclude user lsof -u ^root Network Connections List All Connections 1 2 3 4 5 6 7 8 # All internet connections lsof -i # Established connections only lsof -i | grep ESTABLISHED # Listening sockets lsof -i | grep LISTEN Connections to Specific Host 1 2 3 4 5 6 7 8 # Connections to specific IP lsof -i @192.168.1.100 # Connections to hostname lsof -i @example.com # Connections to specific host and port lsof -i @192.168.1.100:22 IPv4 vs IPv6 1 2 3 4 5 # IPv4 only lsof -i 4 # IPv6 only lsof -i 6 Output Formatting 1 2 3 4 5 6 7 8 9 10 # Terse output (parseable) lsof -t /path/to/file # Returns just PIDs # No header lsof +D /var/log | tail -n +2 # Specific fields lsof -F p /path/to/file # PID only lsof -F c /path/to/file # Command only Combining Filters By default, filters are OR’d. Use -a to AND them: ...

netcat (nc): The Swiss Army Knife of Networking

netcat (nc) does one thing: move bytes over a network. That simplicity makes it incredibly versatile — port scanning, file transfers, chat servers, proxies, and network debugging all become one-liners. Basic Usage 1 2 3 4 5 6 7 8 # Connect to host:port nc hostname 80 # Listen on port nc -l 8080 # Listen (keep listening after disconnect) nc -lk 8080 Test if Port is Open 1 2 3 4 5 6 7 8 9 # Quick port check nc -zv hostname 22 # Connection to hostname 22 port [tcp/ssh] succeeded! # Scan port range nc -zv hostname 20-25 # With timeout nc -zv -w 3 hostname 443 Simple Client-Server Communication Server (listener) 1 nc -l 9000 Client 1 nc localhost 9000 Now type in either terminal — text flows both ways. Ctrl+C to exit. ...

YAML Gotchas: The Traps Everyone Falls Into

YAML is everywhere — Kubernetes, Docker Compose, Ansible, GitHub Actions, CI/CD pipelines. It looks friendly until you spend an hour debugging why on became true or your port number turned into octal. Here are the traps and how to avoid them. The Norway Problem 1 2 3 4 5 # What you wrote country: NO # What YAML parsed country: false YAML interprets NO, no, No, OFF, off, Off as boolean false. Same with YES, yes, Yes, ON, on, On as true. ...