Production is sacred. When something breaks, you need to investigate without making it worse. Here’s how.
Rule Zero: Don’t Make It Worse#
Before touching anything:
- Don’t restart services until you understand the problem
- Don’t deploy fixes without knowing the root cause
- Don’t clear logs you might need for investigation
- Don’t scale down what might be handling load
Stabilize first, investigate second, fix third.
Start With Observability#
Check Dashboards#
Before SSH-ing anywhere:
Log Aggregation#
1
2
3
4
5
6
7
8
| # Recent errors
grep -i error /var/log/app/app.log | tail -100
# Error frequency
grep -i error /var/log/app/app.log | cut -d' ' -f1-2 | uniq -c | tail -20
# With structured logs (JSON)
cat app.log | jq 'select(.level == "error")' | tail -20
|
Metrics First#
If you have Prometheus/Grafana:
- Check if this is new or recurring
- Look for correlation with deployments
- Compare to same time yesterday/last week
Safe Investigation Commands#
System Overview#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| # Load and uptime
uptime
# Memory
free -h
# Disk
df -h
# Top processes
top -bn1 | head -20
# IO wait
iostat -x 1 3
|
Process Investigation#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| # Find your process
pgrep -a myapp
# Process details
ps aux | grep myapp
# Open files
lsof -p $(pgrep myapp) | head -50
# File descriptors count
ls /proc/$(pgrep myapp)/fd | wc -l
# Network connections
ss -tunapl | grep myapp
|
Network Investigation#
1
2
3
4
5
6
7
8
9
10
11
| # Connection states
ss -s
# Connections to specific port
ss -tan | grep :8080 | awk '{print $4}' | sort | uniq -c | sort -rn
# DNS resolution
dig api.example.com
# Connectivity test
curl -w "\ntime_total: %{time_total}s\n" -o /dev/null -s https://api.example.com/health
|
Application Logs#
1
2
3
4
5
6
7
8
9
10
11
| # Follow logs
tail -f /var/log/app/app.log
# Recent errors with context
grep -B5 -A5 "ERROR" /var/log/app/app.log | tail -100
# Time-bounded search
awk '$0 >= "2024-01-15 10:00" && $0 <= "2024-01-15 11:00"' app.log
# Count by error type
grep ERROR app.log | sed 's/.*ERROR: //' | cut -d: -f1 | sort | uniq -c | sort -rn
|
Strace for System Calls#
When you need to see what a process is actually doing:
1
2
3
4
5
6
7
8
9
10
11
| # Attach to running process
strace -p $(pgrep myapp) -f -e trace=network
# Just file operations
strace -p $(pgrep myapp) -e trace=open,read,write
# With timing
strace -p $(pgrep myapp) -T -e trace=all
# Output to file
strace -p $(pgrep myapp) -o /tmp/strace.log
|
Warning: strace adds overhead. Use briefly on production.
Database Investigation#
PostgreSQL#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
| -- Active queries
SELECT pid, now() - pg_stat_activity.query_start AS duration, query, state
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '5 seconds';
-- Lock waits
SELECT blocked_locks.pid AS blocked_pid,
blocking_locks.pid AS blocking_pid,
blocked_activity.query AS blocked_query
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype
JOIN pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
WHERE NOT blocked_locks.granted;
-- Table stats
SELECT relname, seq_scan, idx_scan, n_live_tup
FROM pg_stat_user_tables
ORDER BY seq_scan DESC LIMIT 10;
|
MySQL#
1
2
3
4
5
6
7
8
| -- Running queries
SHOW PROCESSLIST;
-- InnoDB status
SHOW ENGINE INNODB STATUS\G
-- Table lock waits
SHOW OPEN TABLES WHERE In_use > 0;
|
Redis#
1
2
3
4
5
6
7
8
9
10
11
| # Memory usage
redis-cli INFO memory
# Slow log
redis-cli SLOWLOG GET 10
# Connected clients
redis-cli CLIENT LIST
# Big keys
redis-cli --bigkeys
|
Container Debugging#
Docker#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| # Container stats
docker stats --no-stream
# Container logs
docker logs --tail 100 -f container_name
# Exec into container
docker exec -it container_name /bin/sh
# Container processes
docker top container_name
# Inspect container
docker inspect container_name | jq '.[0].State'
|
Kubernetes#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
| # Pod status
kubectl get pods -o wide
# Pod events
kubectl describe pod pod_name
# Pod logs
kubectl logs pod_name --tail=100 -f
kubectl logs pod_name --previous # Crashed container
# Exec into pod
kubectl exec -it pod_name -- /bin/sh
# Resource usage
kubectl top pods
# Events
kubectl get events --sort-by='.lastTimestamp' | tail -20
|
Memory Issues#
1
2
3
4
5
6
7
8
9
10
11
12
| # Memory by process
ps aux --sort=-%mem | head -10
# Heap dump (Java)
jmap -dump:live,format=b,file=/tmp/heap.hprof $(pgrep java)
# Memory map
pmap -x $(pgrep myapp)
# OOM killer logs
dmesg | grep -i "killed process"
journalctl -k | grep -i "out of memory"
|
Thread/Goroutine Issues#
1
2
3
4
5
6
7
8
| # Thread count
ps -eLf | grep myapp | wc -l
# Thread states (Java)
jstack $(pgrep java) > /tmp/threads.txt
# Go pprof (if enabled)
curl http://localhost:6060/debug/pprof/goroutine?debug=2
|
Network Debugging#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
| # Packet capture
tcpdump -i any port 8080 -w /tmp/capture.pcap -c 1000
# HTTP traffic (plain)
tcpdump -i any -A port 80 | grep -E 'GET|POST|HTTP'
# Connection timing
curl -w "@curl-format.txt" -o /dev/null -s https://example.com
# curl-format.txt:
# time_namelookup: %{time_namelookup}s\n
# time_connect: %{time_connect}s\n
# time_appconnect: %{time_appconnect}s\n
# time_redirect: %{time_redirect}s\n
# time_pretransfer: %{time_pretransfer}s\n
# time_starttransfer: %{time_starttransfer}s\n
# ----------\n
# time_total: %{time_total}s\n
|
What NOT To Do#
Don’t#
- Run
rm, kill -9, or restarts without understanding why - Deploy a “fix” under pressure without code review
- Clear logs before saving them
- Make multiple changes at once
- Debug in production with IDE debuggers attached
- Add print statements to production code
- Ignore the metrics and rely on intuition
- Screenshot/save the current state
- Communicate with your team
- Check if this happened before
- Make one change at a time
- Document what you find
- Create a timeline of events
Post-Incident#
After stabilizing:
- Save evidence: logs, metrics, screenshots
- Write timeline: what happened when
- Root cause analysis: why did this happen
- Prevention: how do we stop it recurring
- Detection: how do we catch it faster next time
The best debugging skill is knowing what questions to ask before touching anything.