Production is sacred. When something breaks, you need to investigate without making it worse. Here’s how.

Rule Zero: Don’t Make It Worse

Before touching anything:

  • Don’t restart services until you understand the problem
  • Don’t deploy fixes without knowing the root cause
  • Don’t clear logs you might need for investigation
  • Don’t scale down what might be handling load

Stabilize first, investigate second, fix third.

Start With Observability

Check Dashboards

Before SSH-ing anywhere:

-----ELRRDraeeertscpoeoeernunncrtdryceaedntpeteeupsrtlscioetelyrrnimvetzeiniancdltteieisnso?hgne?(?apl5t0h,?p95,p99)?

Log Aggregation

1
2
3
4
5
6
7
8
# Recent errors
grep -i error /var/log/app/app.log | tail -100

# Error frequency
grep -i error /var/log/app/app.log | cut -d' ' -f1-2 | uniq -c | tail -20

# With structured logs (JSON)
cat app.log | jq 'select(.level == "error")' | tail -20

Metrics First

If you have Prometheus/Grafana:

  • Check if this is new or recurring
  • Look for correlation with deployments
  • Compare to same time yesterday/last week

Safe Investigation Commands

System Overview

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# Load and uptime
uptime

# Memory
free -h

# Disk
df -h

# Top processes
top -bn1 | head -20

# IO wait
iostat -x 1 3

Process Investigation

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# Find your process
pgrep -a myapp

# Process details
ps aux | grep myapp

# Open files
lsof -p $(pgrep myapp) | head -50

# File descriptors count
ls /proc/$(pgrep myapp)/fd | wc -l

# Network connections
ss -tunapl | grep myapp

Network Investigation

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Connection states
ss -s

# Connections to specific port
ss -tan | grep :8080 | awk '{print $4}' | sort | uniq -c | sort -rn

# DNS resolution
dig api.example.com

# Connectivity test
curl -w "\ntime_total: %{time_total}s\n" -o /dev/null -s https://api.example.com/health

Application Logs

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Follow logs
tail -f /var/log/app/app.log

# Recent errors with context
grep -B5 -A5 "ERROR" /var/log/app/app.log | tail -100

# Time-bounded search
awk '$0 >= "2024-01-15 10:00" && $0 <= "2024-01-15 11:00"' app.log

# Count by error type
grep ERROR app.log | sed 's/.*ERROR: //' | cut -d: -f1 | sort | uniq -c | sort -rn

Strace for System Calls

When you need to see what a process is actually doing:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Attach to running process
strace -p $(pgrep myapp) -f -e trace=network

# Just file operations
strace -p $(pgrep myapp) -e trace=open,read,write

# With timing
strace -p $(pgrep myapp) -T -e trace=all

# Output to file
strace -p $(pgrep myapp) -o /tmp/strace.log

Warning: strace adds overhead. Use briefly on production.

Database Investigation

PostgreSQL

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
-- Active queries
SELECT pid, now() - pg_stat_activity.query_start AS duration, query, state
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '5 seconds';

-- Lock waits
SELECT blocked_locks.pid AS blocked_pid,
       blocking_locks.pid AS blocking_pid,
       blocked_activity.query AS blocked_query
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype
JOIN pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
WHERE NOT blocked_locks.granted;

-- Table stats
SELECT relname, seq_scan, idx_scan, n_live_tup 
FROM pg_stat_user_tables 
ORDER BY seq_scan DESC LIMIT 10;

MySQL

1
2
3
4
5
6
7
8
-- Running queries
SHOW PROCESSLIST;

-- InnoDB status
SHOW ENGINE INNODB STATUS\G

-- Table lock waits
SHOW OPEN TABLES WHERE In_use > 0;

Redis

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Memory usage
redis-cli INFO memory

# Slow log
redis-cli SLOWLOG GET 10

# Connected clients
redis-cli CLIENT LIST

# Big keys
redis-cli --bigkeys

Container Debugging

Docker

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# Container stats
docker stats --no-stream

# Container logs
docker logs --tail 100 -f container_name

# Exec into container
docker exec -it container_name /bin/sh

# Container processes
docker top container_name

# Inspect container
docker inspect container_name | jq '.[0].State'

Kubernetes

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
# Pod status
kubectl get pods -o wide

# Pod events
kubectl describe pod pod_name

# Pod logs
kubectl logs pod_name --tail=100 -f
kubectl logs pod_name --previous  # Crashed container

# Exec into pod
kubectl exec -it pod_name -- /bin/sh

# Resource usage
kubectl top pods

# Events
kubectl get events --sort-by='.lastTimestamp' | tail -20

Memory Issues

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# Memory by process
ps aux --sort=-%mem | head -10

# Heap dump (Java)
jmap -dump:live,format=b,file=/tmp/heap.hprof $(pgrep java)

# Memory map
pmap -x $(pgrep myapp)

# OOM killer logs
dmesg | grep -i "killed process"
journalctl -k | grep -i "out of memory"

Thread/Goroutine Issues

1
2
3
4
5
6
7
8
# Thread count
ps -eLf | grep myapp | wc -l

# Thread states (Java)
jstack $(pgrep java) > /tmp/threads.txt

# Go pprof (if enabled)
curl http://localhost:6060/debug/pprof/goroutine?debug=2

Network Debugging

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
# Packet capture
tcpdump -i any port 8080 -w /tmp/capture.pcap -c 1000

# HTTP traffic (plain)
tcpdump -i any -A port 80 | grep -E 'GET|POST|HTTP'

# Connection timing
curl -w "@curl-format.txt" -o /dev/null -s https://example.com

# curl-format.txt:
#     time_namelookup:  %{time_namelookup}s\n
#        time_connect:  %{time_connect}s\n
#     time_appconnect:  %{time_appconnect}s\n
#        time_redirect:  %{time_redirect}s\n
#    time_pretransfer:  %{time_pretransfer}s\n
#       time_starttransfer:  %{time_starttransfer}s\n
#                     ----------\n
#             time_total:  %{time_total}s\n

What NOT To Do

Don’t

  • Run rm, kill -9, or restarts without understanding why
  • Deploy a “fix” under pressure without code review
  • Clear logs before saving them
  • Make multiple changes at once
  • Debug in production with IDE debuggers attached
  • Add print statements to production code
  • Ignore the metrics and rely on intuition

Do

  • Screenshot/save the current state
  • Communicate with your team
  • Check if this happened before
  • Make one change at a time
  • Document what you find
  • Create a timeline of events

Post-Incident

After stabilizing:

  1. Save evidence: logs, metrics, screenshots
  2. Write timeline: what happened when
  3. Root cause analysis: why did this happen
  4. Prevention: how do we stop it recurring
  5. Detection: how do we catch it faster next time

The best debugging skill is knowing what questions to ask before touching anything.