Debugging

Kubernetes Debugging: From Pod Failures to Cluster Issues

Kubernetes abstracts away infrastructure until something breaks. Then you need to peel back the layers. These debugging patterns will help you find problems fast. First Steps: Get the Lay of the Land 1 2 3 4 5 6 7 8 9 10 # Cluster health kubectl cluster-info kubectl get nodes kubectl top nodes # Namespace overview kubectl get all -n myapp # Events (recent issues surface here) kubectl get events -n myapp --sort-by='.lastTimestamp' Pod Debugging Check Pod Status 1 2 3 4 5 6 7 8 9 10 11 12 13 # List pods with status kubectl get pods -n myapp # Detailed pod info kubectl describe pod mypod -n myapp # Common status meanings: # Pending - Waiting for scheduling or image pull # Running - At least one container running # Succeeded - All containers completed successfully # Failed - All containers terminated, at least one failed # CrashLoopBackOff - Container crashing repeatedly # ImagePullBackOff - Can't pull container image View Logs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 # Current logs kubectl logs mypod -n myapp # Previous container (after crash) kubectl logs mypod -n myapp --previous # Follow logs kubectl logs -f mypod -n myapp # Specific container (multi-container pod) kubectl logs mypod -n myapp -c mycontainer # Last N lines kubectl logs mypod -n myapp --tail=100 # Since timestamp kubectl logs mypod -n myapp --since=1h Execute Commands in Pod 1 2 3 4 5 6 7 8 # Shell into running container kubectl exec -it mypod -n myapp -- /bin/sh # Run specific command kubectl exec mypod -n myapp -- cat /etc/config/app.yaml # Specific container kubectl exec -it mypod -n myapp -c mycontainer -- /bin/sh Debug Crashed Containers 1 2 3 4 5 6 7 8 # Check why it crashed kubectl describe pod mypod -n myapp | grep -A 10 "Last State" # View previous logs kubectl logs mypod -n myapp --previous # Run debug container (K8s 1.25+) kubectl debug mypod -n myapp -it --image=busybox --target=mycontainer Common Pod Issues ImagePullBackOff 1 2 3 4 5 6 7 8 9 10 11 12 13 # Check events for details kubectl describe pod mypod -n myapp | grep -A 5 Events # Common causes: # - Wrong image name/tag # - Private registry without imagePullSecrets # - Registry rate limiting (Docker Hub) # Verify image exists docker pull myimage:tag # Check imagePullSecrets kubectl get pod mypod -n myapp -o jsonpath='{.spec.imagePullSecrets}' CrashLoopBackOff 1 2 3 4 5 6 7 8 9 10 11 12 # Get exit code kubectl describe pod mypod -n myapp | grep "Exit Code" # Exit codes: # 0 - Success (shouldn't be crashing) # 1 - Application error # 137 - OOMKilled (out of memory) # 139 - Segmentation fault # 143 - SIGTERM (graceful shutdown) # Check resource limits kubectl describe pod mypod -n myapp | grep -A 5 Limits Pending Pods 1 2 3 4 5 6 7 8 9 10 11 # Check why not scheduled kubectl describe pod mypod -n myapp | grep -A 10 Events # Common causes: # - Insufficient resources # - Node selector/affinity not matched # - Taints without tolerations # - PVC not bound # Check node resources kubectl describe nodes | grep -A 5 "Allocated resources" Resource Issues Memory Problems 1 2 3 4 5 6 7 8 # Check pod resource usage kubectl top pod mypod -n myapp # Check for OOMKilled kubectl describe pod mypod -n myapp | grep OOMKilled # View memory limits kubectl get pod mypod -n myapp -o jsonpath='{.spec.containers[*].resources}' CPU Throttling 1 2 3 4 5 # Check CPU usage vs limits kubectl top pod mypod -n myapp # In container, check throttling kubectl exec mypod -n myapp -- cat /sys/fs/cgroup/cpu/cpu.stat Networking Debugging Service Connectivity 1 2 3 4 5 6 7 8 9 10 11 12 # Check service exists kubectl get svc -n myapp # Check endpoints (are pods backing the service?) kubectl get endpoints myservice -n myapp # Test from within cluster kubectl run debug --rm -it --image=busybox -- /bin/sh # Then: wget -qO- http://myservice.myapp.svc.cluster.local # DNS resolution kubectl run debug --rm -it --image=busybox -- nslookup myservice.myapp.svc.cluster.local Pod-to-Pod Communication 1 2 3 4 5 6 7 8 # Get pod IPs kubectl get pods -n myapp -o wide # Test connectivity from one pod to another kubectl exec mypod1 -n myapp -- wget -qO- http://10.0.0.5:8080 # Check network policies kubectl get networkpolicies -n myapp Ingress Issues 1 2 3 4 5 6 7 8 # Check ingress configuration kubectl describe ingress myingress -n myapp # Check ingress controller logs kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx # Verify backend service kubectl get svc myservice -n myapp ConfigMaps and Secrets 1 2 3 4 5 6 7 8 9 10 11 12 # Verify ConfigMap exists and has expected data kubectl get configmap myconfig -n myapp -o yaml # Check if mounted correctly kubectl exec mypod -n myapp -- ls -la /etc/config/ kubectl exec mypod -n myapp -- cat /etc/config/app.yaml # Verify Secret kubectl get secret mysecret -n myapp -o jsonpath='{.data.password}' | base64 -d # Check environment variables kubectl exec mypod -n myapp -- env | grep MY_VAR Persistent Volumes 1 2 3 4 5 6 7 8 9 10 11 12 # Check PVC status kubectl get pvc -n myapp # Describe for binding issues kubectl describe pvc mypvc -n myapp # Check PV kubectl get pv # Verify mount in pod kubectl exec mypod -n myapp -- df -h kubectl exec mypod -n myapp -- ls -la /data Node Issues 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 # Node status kubectl get nodes kubectl describe node mynode # Check conditions kubectl get nodes -o custom-columns=NAME:.metadata.name,CONDITIONS:.status.conditions[*].type # Node resource pressure kubectl describe node mynode | grep -A 5 Conditions # Pods on specific node kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=mynode # Drain node for maintenance kubectl drain mynode --ignore-daemonsets --delete-emptydir-data Control Plane Debugging 1 2 3 4 5 6 7 8 9 10 11 12 13 14 # API server health kubectl get --raw='/healthz' # Component status (deprecated but useful) kubectl get componentstatuses # Check system pods kubectl get pods -n kube-system # API server logs (if accessible) kubectl logs -n kube-system kube-apiserver-master # etcd health kubectl exec -n kube-system etcd-master -- etcdctl endpoint health Useful Debug Containers 1 2 3 4 5 6 7 8 # Network debugging kubectl run netdebug --rm -it --image=nicolaka/netshoot -- /bin/bash # DNS debugging kubectl run dnsdebug --rm -it --image=tutum/dnsutils -- /bin/bash # General debugging kubectl run debug --rm -it --image=busybox -- /bin/sh Systematic Debugging Checklist Events first: kubectl get events --sort-by='.lastTimestamp' Describe the resource: kubectl describe <resource> <name> Check logs: kubectl logs <pod> (and --previous) Verify dependencies: ConfigMaps, Secrets, Services, PVCs Check resources: CPU, memory limits and usage Test connectivity: DNS, service endpoints, network policies Compare with working: Diff against known good configuration Quick Reference 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 # Pod not starting kubectl describe pod <name> kubectl get events # Pod crashing kubectl logs <pod> --previous kubectl describe pod <name> | grep "Exit Code" # Can't connect to service kubectl get endpoints <service> kubectl run debug --rm -it --image=busybox -- wget -qO- http://<service> # Resource issues kubectl top pods kubectl describe node | grep -A 5 "Allocated" # Config issues kubectl exec <pod> -- env kubectl exec <pod> -- cat /path/to/config Kubernetes debugging is methodical. Start with events, drill into describe output, check logs, and verify each dependency. Most issues are configuration mismatches—wrong image tags, missing secrets, insufficient resources. ...

strace: Tracing System Calls for Debugging

strace intercepts and records system calls made by a process. When a program hangs, crashes, or behaves mysteriously, strace reveals what it’s actually doing at the kernel level. Basic Usage 1 2 3 4 5 6 7 8 9 10 11 # Trace a command strace ls # Trace running process strace -p 1234 # Trace with timestamps strace -t ls # Trace with relative timestamps strace -r ls Filtering System Calls 1 2 3 4 5 6 7 8 9 10 11 12 13 14 # Only file operations strace -e trace=file ls # Only network operations strace -e trace=network curl example.com # Only process operations strace -e trace=process bash -c 'sleep 1' # Specific syscalls strace -e open,read,write cat file.txt # Exclude syscalls strace -e trace=!mmap ls Trace Categories 1 2 3 4 5 6 7 file # open, stat, chmod, etc. process # fork, exec, exit, etc. network # socket, connect, send, etc. signal # signal, kill, etc. ipc # shmget, semop, etc. desc # read, write, close, etc. memory # mmap, brk, etc. Output Options 1 2 3 4 5 6 7 8 9 10 11 12 13 14 # Write to file strace -o output.txt ls # Append to file strace -o output.txt -A ls # With timestamps (wall clock) strace -t ls # With microseconds strace -tt ls # Relative timestamps strace -r ls Following Children 1 2 3 4 5 6 # Follow forked processes strace -f bash -c 'ls; echo done' # Follow forks and write separate files strace -ff -o trace ls # Creates trace.1234, trace.1235, etc. String Output 1 2 3 4 5 # Show full strings (default truncates at 32 chars) strace -s 1000 cat file.txt # Show full strings for specific calls strace -e read -s 10000 cat file.txt Statistics 1 2 3 4 5 6 7 8 9 # Summary of syscalls strace -c ls # Sample output: # % time seconds usecs/call calls errors syscall # ------ ----------- ----------- --------- --------- ---------------- # 45.00 0.000045 45 1 execve # 25.00 0.000025 3 8 mmap # 15.00 0.000015 2 6 openat 1 2 3 # Summary with detailed trace strace -c -S time ls # Sort by time strace -c -S calls ls # Sort by call count Practical Examples Debug “File Not Found” 1 2 # See what files the program is trying to open strace -e openat ./myprogram 2>&1 | grep -i "no such file" Find Config File Locations 1 2 # See all files a program tries to read strace -e openat nginx -t 2>&1 | grep -E "openat.*O_RDONLY" Debug Connection Issues 1 2 3 4 5 # Watch network connections strace -e connect curl https://example.com # See DNS lookups strace -e socket,connect,sendto,recvfrom dig example.com Debug Hanging Process 1 2 3 4 5 6 7 # Attach to hung process strace -p $(pgrep hung-process) # Common findings: # - Waiting on read() = blocked on input # - Waiting on futex() = waiting for lock # - Waiting on poll/select = waiting for I/O Find Why Program is Slow 1 2 3 4 5 6 7 8 # Time each syscall strace -T ls # Shows time spent in each call: # openat(AT_FDCWD, ".", ...) = 3 <0.000015> # Summary to find slow operations strace -c -S time slow-program Debug Permission Issues 1 2 3 4 5 # See access denials strace -e openat,access ./program 2>&1 | grep -i denied # Sample output: # openat(AT_FDCWD, "/etc/secret", O_RDONLY) = -1 EACCES (Permission denied) Watch File I/O 1 2 3 4 5 # See all reads and writes strace -e read,write -s 100 cat file.txt # Count I/O operations strace -c -e read,write dd if=/dev/zero of=/dev/null bs=1M count=100 Debug Signal Handling 1 2 3 4 5 # Trace signals strace -e signal,kill ./program # See what signal killed a process strace -e trace=signal -p 1234 Find Library Loading Issues 1 2 3 4 5 6 # See shared library loading strace -e openat ./program 2>&1 | grep "\.so" # Common issues: # - Library not found # - Wrong library version loaded Advanced Usage Inject Faults 1 2 3 4 5 # Make open() fail with ENOENT strace -e fault=openat:error=ENOENT ls # Fail every 3rd call strace -e fault=read:error=EIO:when=3 cat file.txt Decode Arguments 1 2 3 4 5 # Decode socket addresses strace -yy curl example.com # Decode file descriptors strace -y cat file.txt Trace Specific Syscall Return 1 2 3 4 5 # Only show failed syscalls strace -Z ls /nonexistent # Show syscalls that return specific value strace -e status=failed ls /nonexistent Reading strace Output o │ │ │ │ │ └ p ─ e ─ n a S t y ( s A t T │ │ │ │ └ e _ ─ m F ─ D c C D a W i l D r l , e c n " t a / o m e r e t y c │ │ │ └ / ─ f p ─ d a s P ( s a A w t T d h _ " n F , a D m C O e W _ D R │ │ └ a D ─ r = O ─ g N u c L F m u Y l e r ) a n r g t e = s n t 3 │ └ d ─ i ─ r ) R e t u r n v a l u e ( f d 3 ) Common Return Values = = = = = = = 0 N - - - - - 1 1 1 1 1 E E E E E N A A E I O C G X N E C A I T N E I S R T S N T S F N P R F I u i o e e i n c l r s l t c e s m o e e e u i u r s d c s r e r s e h s c x u s i e i p ( c f o s t f r i n t t e o i l e s d r p e d m t e p s m o o n o y a r r i r s n e a t y o d d r e r i i m c r l a c e y c l o c a l u t u l s n o n l ) t r a y v a i l a b l e Alternatives 1 2 3 4 5 6 7 8 # ltrace - trace library calls (not syscalls) ltrace ./program # perf trace - lower overhead perf trace ls # bpftrace - more powerful, requires setup bpftrace -e 'tracepoint:syscalls:sys_enter_openat { printf("%s\n", str(args->filename)); }' Performance Note strace adds significant overhead — programs run much slower when traced. For production debugging: ...

lsof: Finding What's Using Files and Ports

lsof (list open files) shows which processes have which files open. Since Unix treats everything as a file — including network connections, devices, and pipes — lsof is incredibly powerful for system debugging. Basic Usage 1 2 3 4 5 6 7 8 9 10 11 # List all open files (overwhelming) lsof # List open files for specific file lsof /var/log/syslog # List open files for user lsof -u username # List open files for process lsof -p 1234 Find What’s Using a File 1 2 3 4 5 6 7 8 # Who has this file open? lsof /path/to/file # Who has files open in this directory? lsof +D /var/log # Recursive directory search lsof +D /var/log/ “Device or resource busy” Debugging 1 2 3 4 5 # Can't unmount? Find what's using it lsof +D /mnt/usb # Find open files on filesystem lsof /dev/sda1 Find What’s Using a Port 1 2 3 4 5 6 7 8 9 10 11 12 13 14 # What's on port 8080? lsof -i :8080 # What's on port 80 or 443? lsof -i :80 -i :443 # All network connections lsof -i # TCP only lsof -i TCP # UDP only lsof -i UDP Specific Protocol and Port 1 2 3 4 5 6 7 8 # TCP port 22 lsof -i TCP:22 # UDP port 53 lsof -i UDP:53 # Port range lsof -i :1-1024 Filter by Process 1 2 3 4 5 6 7 8 9 10 11 # By PID lsof -p 1234 # By process name lsof -c nginx # By multiple process names lsof -c nginx -c apache # Exclude process lsof -c ^nginx Filter by User 1 2 3 4 5 6 7 8 9 10 11 # By username lsof -u www-data # By UID lsof -u 1000 # Multiple users lsof -u user1,user2 # Exclude user lsof -u ^root Network Connections List All Connections 1 2 3 4 5 6 7 8 # All internet connections lsof -i # Established connections only lsof -i | grep ESTABLISHED # Listening sockets lsof -i | grep LISTEN Connections to Specific Host 1 2 3 4 5 6 7 8 # Connections to specific IP lsof -i @192.168.1.100 # Connections to hostname lsof -i @example.com # Connections to specific host and port lsof -i @192.168.1.100:22 IPv4 vs IPv6 1 2 3 4 5 # IPv4 only lsof -i 4 # IPv6 only lsof -i 6 Output Formatting 1 2 3 4 5 6 7 8 9 10 # Terse output (parseable) lsof -t /path/to/file # Returns just PIDs # No header lsof +D /var/log | tail -n +2 # Specific fields lsof -F p /path/to/file # PID only lsof -F c /path/to/file # Command only Combining Filters By default, filters are OR’d. Use -a to AND them: ...

netcat (nc): The Swiss Army Knife of Networking

netcat (nc) does one thing: move bytes over a network. That simplicity makes it incredibly versatile — port scanning, file transfers, chat servers, proxies, and network debugging all become one-liners. Basic Usage 1 2 3 4 5 6 7 8 # Connect to host:port nc hostname 80 # Listen on port nc -l 8080 # Listen (keep listening after disconnect) nc -lk 8080 Test if Port is Open 1 2 3 4 5 6 7 8 9 # Quick port check nc -zv hostname 22 # Connection to hostname 22 port [tcp/ssh] succeeded! # Scan port range nc -zv hostname 20-25 # With timeout nc -zv -w 3 hostname 443 Simple Client-Server Communication Server (listener) 1 nc -l 9000 Client 1 nc localhost 9000 Now type in either terminal — text flows both ways. Ctrl+C to exit. ...

YAML Gotchas: The Traps Everyone Falls Into

YAML is everywhere — Kubernetes, Docker Compose, Ansible, GitHub Actions, CI/CD pipelines. It looks friendly until you spend an hour debugging why on became true or your port number turned into octal. Here are the traps and how to avoid them. The Norway Problem 1 2 3 4 5 # What you wrote country: NO # What YAML parsed country: false YAML interprets NO, no, No, OFF, off, Off as boolean false. Same with YES, yes, Yes, ON, on, On as true. ...

Git Bisect: Finding the Commit That Broke Everything

Something’s broken. It worked last week. Somewhere in the 47 commits since then, someone introduced a bug. You could check each commit manually, or you could let git do the work. git bisect performs a binary search through your commit history to find the exact commit that introduced a problem. Instead of checking 47 commits, you check about 6. The Basic Workflow 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 # Start bisecting git bisect start # Mark current commit as bad (has the bug) git bisect bad # Mark a known good commit (before the bug existed) git bisect good v1.2.0 # or git bisect good abc123 # Git checks out a commit halfway between good and bad # Test it, then tell git the result: git bisect good # This commit doesn't have the bug # or git bisect bad # This commit has the bug # Git narrows down and checks out another commit # Repeat until git finds the first bad commit # When done, return to original state git bisect reset Example Session 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 $ git bisect start $ git bisect bad # HEAD is broken $ git bisect good v2.0.0 # v2.0.0 worked fine Bisecting: 23 revisions left to test after this (roughly 5 steps) [abc123...] Add caching layer $ ./run-tests.sh Tests pass! $ git bisect good Bisecting: 11 revisions left to test after this (roughly 4 steps) [def456...] Refactor auth module $ ./run-tests.sh FAIL! $ git bisect bad Bisecting: 5 revisions left to test after this (roughly 3 steps) [ghi789...] Update dependencies # ... continue until: abc123def456789 is the first bad commit commit abc123def456789 Author: Someone <someone@example.com> Date: Mon Feb 20 14:30:00 2024 Fix edge case in login flow This commit introduced the bug! $ git bisect reset Automated Bisecting If you have a script that can test for the bug: ...

DNS Debugging: Finding Why Your Domain Isn't Working

“It’s always DNS” is a meme because it’s usually true. When something network-related breaks, DNS is the first suspect. Here’s how to investigate. The Essential Tools dig - The DNS Swiss Army Knife 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 # Basic lookup dig example.com # Specific record type dig example.com MX dig example.com TXT dig example.com CNAME # Short output (just the answer) dig +short example.com # Query a specific nameserver dig @8.8.8.8 example.com # Trace the full resolution path dig +trace example.com nslookup - Quick and Simple 1 2 3 4 5 6 7 8 9 10 # Basic lookup nslookup example.com # Query specific server nslookup example.com 8.8.8.8 # Interactive mode nslookup > set type=MX > example.com host - Even Simpler 1 2 3 4 5 # Basic lookup host example.com # Specific record type host -t MX example.com Common DNS Problems Problem: Domain Not Resolving Symptoms: NXDOMAIN or SERVFAIL errors ...

Linux Performance Debugging: Finding What's Slow and Why

Something’s slow. Users are complaining. Your monitoring shows high latency but not why. You SSH into the server and need to figure out what’s wrong — fast. This is a systematic approach to Linux performance debugging. Start with the Big Picture Before diving deep, get an overview: 1 2 3 4 5 6 # System load and uptime uptime # 17:00:00 up 45 days, load average: 8.52, 4.23, 2.15 # Load average: 1/5/15 minute averages # Compare to CPU count: load 8 on 4 CPUs = overloaded 1 2 # Quick health check top -bn1 | head -20 CPU: Who’s Using It? 1 2 3 4 5 6 # Real-time CPU usage top # Sort by CPU: press 'P' # Sort by memory: press 'M' # Show threads: press 'H' 1 2 3 4 5 # CPU usage by process (snapshot) ps aux --sort=-%cpu | head -10 # CPU time accumulated ps aux --sort=-time | head -10 1 2 3 4 # Per-CPU breakdown mpstat -P ALL 1 5 # Shows each CPU core's utilization # Look for: one core at 100% (single-threaded bottleneck) 1 2 3 4 5 6 7 8 9 10 11 # What's the CPU actually doing? vmstat 1 5 # r b swpd free buff cache si so bi bo in cs us sy id wa st # 3 0 0 245612 128940 2985432 0 0 8 24 312 892 45 12 40 3 0 # r: processes waiting for CPU (high = CPU-bound) # b: processes blocked on I/O (high = I/O-bound) # us: user CPU time # sy: system CPU time # wa: waiting for I/O # id: idle Memory: Running Out? 1 2 3 4 5 6 7 # Memory overview free -h # total used free shared buff/cache available # Mem: 31Gi 12Gi 2.1Gi 1.2Gi 17Gi 17Gi # "available" is what matters, not "free" # Linux uses free memory for cache — that's good 1 2 3 4 5 6 7 8 # Top memory consumers ps aux --sort=-%mem | head -10 # Memory details per process pmap -x <pid> # Or from /proc cat /proc/<pid>/status | grep -i mem 1 2 3 # Check for OOM killer activity dmesg | grep -i "out of memory" journalctl -k | grep -i oom 1 2 3 # Swap usage (if swapping, you're in trouble) swapon -s vmstat 1 5 # si/so columns show swap in/out Disk I/O: The Hidden Bottleneck 1 2 3 4 5 6 7 8 9 10 11 # I/O wait in top top # Look at %wa in the CPU line # Detailed I/O stats iostat -xz 1 5 # Device r/s w/s rkB/s wkB/s await %util # sda 150.00 200.00 4800 8000 12.5 85.0 # await: average I/O wait time (ms) - high = slow disk # %util: disk utilization - 100% = saturated 1 2 3 4 5 6 # Which processes are doing I/O? iotop -o # Shows only processes actively doing I/O # Or without iotop pidstat -d 1 5 1 2 3 # Check for disk errors dmesg | grep -i error smartctl -a /dev/sda # SMART data for disk health Network: Connections and Throughput 1 2 3 4 5 6 7 8 # Network interface stats ip -s link show # Watch for errors, drops, overruns # Real-time bandwidth iftop # Or nload 1 2 3 4 5 6 7 # Connection states ss -s # Total: 1523 (kernel 1847) # TCP: 892 (estab 654, closed 89, orphaned 12, timewait 127) # Many TIME_WAIT: connection churn # Many ESTABLISHED: legitimate load or connection leak 1 2 3 4 5 # Connections per state ss -tan | awk '{print $1}' | sort | uniq -c | sort -rn # Connections by remote IP ss -tn | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -rn | head 1 2 3 4 5 6 # Check for packet loss ping -c 100 <gateway> # Any loss = network problem # TCP retransmits netstat -s | grep -i retrans Process-Level Debugging 1 2 3 4 5 6 # What's a specific process doing? strace -p <pid> -c # Summary of system calls strace -p <pid> -f -e trace=network # Network-related calls only 1 2 3 4 5 6 7 8 # Open files and connections lsof -p <pid> # Just network connections lsof -i -p <pid> # Files in a directory lsof +D /var/log 1 2 3 4 5 # Process threads ps -T -p <pid> # Thread CPU usage top -H -p <pid> System-Wide Tracing 1 2 3 4 5 6 7 # What's happening system-wide? perf top # Real-time view of where CPU cycles go # Record and analyze perf record -g -a sleep 10 perf report 1 2 3 4 5 6 7 8 9 # BPF-based tools (if available) # CPU usage by function profile-bpfcc 10 # Disk latency histogram biolatency-bpfcc # TCP connection latency tcpconnlat-bpfcc The Checklist Approach When debugging, go through systematically: ...

Distributed Tracing Essentials: Following Requests Across Services

A request hits your API gateway, bounces through five microservices, touches two databases, and returns an error. Logs say “something failed.” Which service? Which call? What was the payload? Distributed tracing answers these questions by connecting the dots across service boundaries. The Core Concepts Traces and Spans A trace represents a complete request journey. A span represents a single operation within that journey. T ├ │ │ │ │ │ │ r ─ a ─ c e S ├ ├ │ └ : p ─ ─ ─ a ─ ─ ─ a n b : S S └ S ├ └ c p p ─ p ─ ─ 1 A a a ─ a ─ ─ 2 P n n n 3 I : : S : S S p p p G A U a O a a a u s n r n n t t e : d : : e h r e w D r I P a S S a n a y e e t S v y r r a e e m ( v v b r n e p i i a v t n a c c s i o t r e e e c r e e y P n Q r t u C o ) e h c r e e y c s k s i n g Each span has: ...

Structured Logging: Stop Grepping, Start Querying

Unstructured logs are a liability. When your application writes User 12345 logged in from 192.168.1.1, you’re creating text that’s easy to read but impossible to query at scale. Structured logging changes the game: logs become data you can filter, aggregate, and analyze. The Problem with Text Logs 1 2 3 4 # Traditional logging import logging logging.info(f"User {user_id} logged in from {ip_address}") # Output: INFO:root:User 12345 logged in from 192.168.1.1 Want to find all logins from a specific IP? You need regex. Want to count logins per user? Good luck. Want to correlate with other events? Hope your timestamp parsing is solid. ...