Debugging

Git Bisect: Finding the Commit That Broke Everything

Something’s broken. It worked last week. Somewhere in the 47 commits since then, someone introduced a bug. You could check each commit manually, or you could let git do the work. git bisect performs a binary search through your commit history to find the exact commit that introduced a problem. Instead of checking 47 commits, you check about 6. The Basic Workflow 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 # Start bisecting git bisect start # Mark current commit as bad (has the bug) git bisect bad # Mark a known good commit (before the bug existed) git bisect good v1.2.0 # or git bisect good abc123 # Git checks out a commit halfway between good and bad # Test it, then tell git the result: git bisect good # This commit doesn't have the bug # or git bisect bad # This commit has the bug # Git narrows down and checks out another commit # Repeat until git finds the first bad commit # When done, return to original state git bisect reset Example Session 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 $ git bisect start $ git bisect bad # HEAD is broken $ git bisect good v2.0.0 # v2.0.0 worked fine Bisecting: 23 revisions left to test after this (roughly 5 steps) [abc123...] Add caching layer $ ./run-tests.sh Tests pass! $ git bisect good Bisecting: 11 revisions left to test after this (roughly 4 steps) [def456...] Refactor auth module $ ./run-tests.sh FAIL! $ git bisect bad Bisecting: 5 revisions left to test after this (roughly 3 steps) [ghi789...] Update dependencies # ... continue until: abc123def456789 is the first bad commit commit abc123def456789 Author: Someone <someone@example.com> Date: Mon Feb 20 14:30:00 2024 Fix edge case in login flow This commit introduced the bug! $ git bisect reset Automated Bisecting If you have a script that can test for the bug: ...

DNS Debugging: Finding Why Your Domain Isn't Working

“It’s always DNS” is a meme because it’s usually true. When something network-related breaks, DNS is the first suspect. Here’s how to investigate. The Essential Tools dig - The DNS Swiss Army Knife 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 # Basic lookup dig example.com # Specific record type dig example.com MX dig example.com TXT dig example.com CNAME # Short output (just the answer) dig +short example.com # Query a specific nameserver dig @8.8.8.8 example.com # Trace the full resolution path dig +trace example.com nslookup - Quick and Simple 1 2 3 4 5 6 7 8 9 10 # Basic lookup nslookup example.com # Query specific server nslookup example.com 8.8.8.8 # Interactive mode nslookup > set type=MX > example.com host - Even Simpler 1 2 3 4 5 # Basic lookup host example.com # Specific record type host -t MX example.com Common DNS Problems Problem: Domain Not Resolving Symptoms: NXDOMAIN or SERVFAIL errors ...

Linux Performance Debugging: Finding What's Slow and Why

Something’s slow. Users are complaining. Your monitoring shows high latency but not why. You SSH into the server and need to figure out what’s wrong — fast. This is a systematic approach to Linux performance debugging. Start with the Big Picture Before diving deep, get an overview: 1 2 3 4 5 6 # System load and uptime uptime # 17:00:00 up 45 days, load average: 8.52, 4.23, 2.15 # Load average: 1/5/15 minute averages # Compare to CPU count: load 8 on 4 CPUs = overloaded 1 2 # Quick health check top -bn1 | head -20 CPU: Who’s Using It? 1 2 3 4 5 6 # Real-time CPU usage top # Sort by CPU: press 'P' # Sort by memory: press 'M' # Show threads: press 'H' 1 2 3 4 5 # CPU usage by process (snapshot) ps aux --sort=-%cpu | head -10 # CPU time accumulated ps aux --sort=-time | head -10 1 2 3 4 # Per-CPU breakdown mpstat -P ALL 1 5 # Shows each CPU core's utilization # Look for: one core at 100% (single-threaded bottleneck) 1 2 3 4 5 6 7 8 9 10 11 # What's the CPU actually doing? vmstat 1 5 # r b swpd free buff cache si so bi bo in cs us sy id wa st # 3 0 0 245612 128940 2985432 0 0 8 24 312 892 45 12 40 3 0 # r: processes waiting for CPU (high = CPU-bound) # b: processes blocked on I/O (high = I/O-bound) # us: user CPU time # sy: system CPU time # wa: waiting for I/O # id: idle Memory: Running Out? 1 2 3 4 5 6 7 # Memory overview free -h # total used free shared buff/cache available # Mem: 31Gi 12Gi 2.1Gi 1.2Gi 17Gi 17Gi # "available" is what matters, not "free" # Linux uses free memory for cache — that's good 1 2 3 4 5 6 7 8 # Top memory consumers ps aux --sort=-%mem | head -10 # Memory details per process pmap -x <pid> # Or from /proc cat /proc/<pid>/status | grep -i mem 1 2 3 # Check for OOM killer activity dmesg | grep -i "out of memory" journalctl -k | grep -i oom 1 2 3 # Swap usage (if swapping, you're in trouble) swapon -s vmstat 1 5 # si/so columns show swap in/out Disk I/O: The Hidden Bottleneck 1 2 3 4 5 6 7 8 9 10 11 # I/O wait in top top # Look at %wa in the CPU line # Detailed I/O stats iostat -xz 1 5 # Device r/s w/s rkB/s wkB/s await %util # sda 150.00 200.00 4800 8000 12.5 85.0 # await: average I/O wait time (ms) - high = slow disk # %util: disk utilization - 100% = saturated 1 2 3 4 5 6 # Which processes are doing I/O? iotop -o # Shows only processes actively doing I/O # Or without iotop pidstat -d 1 5 1 2 3 # Check for disk errors dmesg | grep -i error smartctl -a /dev/sda # SMART data for disk health Network: Connections and Throughput 1 2 3 4 5 6 7 8 # Network interface stats ip -s link show # Watch for errors, drops, overruns # Real-time bandwidth iftop # Or nload 1 2 3 4 5 6 7 # Connection states ss -s # Total: 1523 (kernel 1847) # TCP: 892 (estab 654, closed 89, orphaned 12, timewait 127) # Many TIME_WAIT: connection churn # Many ESTABLISHED: legitimate load or connection leak 1 2 3 4 5 # Connections per state ss -tan | awk '{print $1}' | sort | uniq -c | sort -rn # Connections by remote IP ss -tn | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -rn | head 1 2 3 4 5 6 # Check for packet loss ping -c 100 <gateway> # Any loss = network problem # TCP retransmits netstat -s | grep -i retrans Process-Level Debugging 1 2 3 4 5 6 # What's a specific process doing? strace -p <pid> -c # Summary of system calls strace -p <pid> -f -e trace=network # Network-related calls only 1 2 3 4 5 6 7 8 # Open files and connections lsof -p <pid> # Just network connections lsof -i -p <pid> # Files in a directory lsof +D /var/log 1 2 3 4 5 # Process threads ps -T -p <pid> # Thread CPU usage top -H -p <pid> System-Wide Tracing 1 2 3 4 5 6 7 # What's happening system-wide? perf top # Real-time view of where CPU cycles go # Record and analyze perf record -g -a sleep 10 perf report 1 2 3 4 5 6 7 8 9 # BPF-based tools (if available) # CPU usage by function profile-bpfcc 10 # Disk latency histogram biolatency-bpfcc # TCP connection latency tcpconnlat-bpfcc The Checklist Approach When debugging, go through systematically: ...

Distributed Tracing Essentials: Following Requests Across Services

A request hits your API gateway, bounces through five microservices, touches two databases, and returns an error. Logs say “something failed.” Which service? Which call? What was the payload? Distributed tracing answers these questions by connecting the dots across service boundaries. The Core Concepts Traces and Spans A trace represents a complete request journey. A span represents a single operation within that journey. T ├ │ │ │ │ │ │ r ─ a ─ c e S ├ ├ │ └ : p ─ ─ ─ a ─ ─ ─ a n b : S S └ S ├ └ c p p ─ p ─ ─ 1 A a a ─ a ─ ─ 2 P n n n 3 I : : S : S S p p p G A U a O a a a u s n r n n t t e : d : : e h r e w D r I P a S S a n a y e e t S v y r r a e e m ( v v b r n e p i i a v t n a c c s i o t r e e e c r e e y P n Q r t u C o ) e h c r e e y c s k s i n g Each span has: ...

Structured Logging: Stop Grepping, Start Querying

Unstructured logs are a liability. When your application writes User 12345 logged in from 192.168.1.1, you’re creating text that’s easy to read but impossible to query at scale. Structured logging changes the game: logs become data you can filter, aggregate, and analyze. The Problem with Text Logs 1 2 3 4 # Traditional logging import logging logging.info(f"User {user_id} logged in from {ip_address}") # Output: INFO:root:User 12345 logged in from 192.168.1.1 Want to find all logins from a specific IP? You need regex. Want to count logins per user? Good luck. Want to correlate with other events? Hope your timestamp parsing is solid. ...

Logging That Actually Helps: From Printf to Production Debugging

A practical guide to logging — structured formats, log levels, correlation IDs, and patterns that make debugging production issues bearable.