Debugging Production Issues Without Breaking Things

Production is sacred. When something breaks, you need to investigate without making it worse. Here’s how.

Rule Zero: Don’t Make It Worse

Before touching anything:

Don’t restart services until you understand the problem
Don’t deploy fixes without knowing the root cause
Don’t clear logs you might need for investigation
Don’t scale down what might be handling load

Stabilize first, investigate second, fix third.

Start With Observability

Check Dashboards

Before SSH-ing anywhere:

Log Aggregation

1
2
3
4
5
6
7
8
# Recent errors
grep -i error /var/log/app/app.log | tail -100

# Error frequency
grep -i error /var/log/app/app.log | cut -d' ' -f1-2 | uniq -c | tail -20

# With structured logs (JSON)
cat app.log | jq 'select(.level == "error")' | tail -20

Metrics First

If you have Prometheus/Grafana:

Check if this is new or recurring
Look for correlation with deployments
Compare to same time yesterday/last week

Safe Investigation Commands

System Overview

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# Load and uptime
uptime

# Memory
free -h

# Disk
df -h

# Top processes
top -bn1 | head -20

# IO wait
iostat -x 1 3

Process Investigation

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# Find your process
pgrep -a myapp

# Process details
ps aux | grep myapp

# Open files
lsof -p $(pgrep myapp) | head -50

# File descriptors count
ls /proc/$(pgrep myapp)/fd | wc -l

# Network connections
ss -tunapl | grep myapp

Network Investigation

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Connection states
ss -s

# Connections to specific port
ss -tan | grep :8080 | awk '{print $4}' | sort | uniq -c | sort -rn

# DNS resolution
dig api.example.com

# Connectivity test
curl -w "\ntime_total: %{time_total}s\n" -o /dev/null -s https://api.example.com/health

Application Logs

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Follow logs
tail -f /var/log/app/app.log

# Recent errors with context
grep -B5 -A5 "ERROR" /var/log/app/app.log | tail -100

# Time-bounded search
awk '$0 >= "2024-01-15 10:00" && $0 <= "2024-01-15 11:00"' app.log

# Count by error type
grep ERROR app.log | sed 's/.*ERROR: //' | cut -d: -f1 | sort | uniq -c | sort -rn

Strace for System Calls

When you need to see what a process is actually doing:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Attach to running process
strace -p $(pgrep myapp) -f -e trace=network

# Just file operations
strace -p $(pgrep myapp) -e trace=open,read,write

# With timing
strace -p $(pgrep myapp) -T -e trace=all

# Output to file
strace -p $(pgrep myapp) -o /tmp/strace.log

Warning: strace adds overhead. Use briefly on production.

Database Investigation

PostgreSQL

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
-- Active queries
SELECT pid, now() - pg_stat_activity.query_start AS duration, query, state
FROM pg_stat_activity
WHERE (now() - pg_stat_activity.query_start) > interval '5 seconds';

-- Lock waits
SELECT blocked_locks.pid AS blocked_pid,
       blocking_locks.pid AS blocking_pid,
       blocked_activity.query AS blocked_query
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype
JOIN pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
WHERE NOT blocked_locks.granted;

-- Table stats
SELECT relname, seq_scan, idx_scan, n_live_tup 
FROM pg_stat_user_tables 
ORDER BY seq_scan DESC LIMIT 10;

MySQL

1
2
3
4
5
6
7
8
-- Running queries
SHOW PROCESSLIST;

-- InnoDB status
SHOW ENGINE INNODB STATUS\G

-- Table lock waits
SHOW OPEN TABLES WHERE In_use > 0;

Redis

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Memory usage
redis-cli INFO memory

# Slow log
redis-cli SLOWLOG GET 10

# Connected clients
redis-cli CLIENT LIST

# Big keys
redis-cli --bigkeys

Container Debugging

Docker

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# Container stats
docker stats --no-stream

# Container logs
docker logs --tail 100 -f container_name

# Exec into container
docker exec -it container_name /bin/sh

# Container processes
docker top container_name

# Inspect container
docker inspect container_name | jq '.[0].State'

Kubernetes

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
# Pod status
kubectl get pods -o wide

# Pod events
kubectl describe pod pod_name

# Pod logs
kubectl logs pod_name --tail=100 -f
kubectl logs pod_name --previous  # Crashed container

# Exec into pod
kubectl exec -it pod_name -- /bin/sh

# Resource usage
kubectl top pods

# Events
kubectl get events --sort-by='.lastTimestamp' | tail -20

Memory Issues

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# Memory by process
ps aux --sort=-%mem | head -10

# Heap dump (Java)
jmap -dump:live,format=b,file=/tmp/heap.hprof $(pgrep java)

# Memory map
pmap -x $(pgrep myapp)

# OOM killer logs
dmesg | grep -i "killed process"
journalctl -k | grep -i "out of memory"

Thread/Goroutine Issues

1
2
3
4
5
6
7
8
# Thread count
ps -eLf | grep myapp | wc -l

# Thread states (Java)
jstack $(pgrep java) > /tmp/threads.txt

# Go pprof (if enabled)
curl http://localhost:6060/debug/pprof/goroutine?debug=2

Network Debugging

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
# Packet capture
tcpdump -i any port 8080 -w /tmp/capture.pcap -c 1000

# HTTP traffic (plain)
tcpdump -i any -A port 80 | grep -E 'GET|POST|HTTP'

# Connection timing
curl -w "@curl-format.txt" -o /dev/null -s https://example.com

# curl-format.txt:
#     time_namelookup:  %{time_namelookup}s\n
#        time_connect:  %{time_connect}s\n
#     time_appconnect:  %{time_appconnect}s\n
#        time_redirect:  %{time_redirect}s\n
#    time_pretransfer:  %{time_pretransfer}s\n
#       time_starttransfer:  %{time_starttransfer}s\n
#                     ----------\n
#             time_total:  %{time_total}s\n

What NOT To Do

Don’t

Run rm, kill -9, or restarts without understanding why
Deploy a “fix” under pressure without code review
Clear logs before saving them
Make multiple changes at once
Debug in production with IDE debuggers attached
Add print statements to production code
Ignore the metrics and rely on intuition

Do

Screenshot/save the current state
Communicate with your team
Check if this happened before
Make one change at a time
Document what you find
Create a timeline of events

Post-Incident

After stabilizing:

Save evidence: logs, metrics, screenshots
Write timeline: what happened when
Root cause analysis: why did this happen
Prevention: how do we stop it recurring
Detection: how do we catch it faster next time

The best debugging skill is knowing what questions to ask before touching anything.

Rule Zero: Don’t Make It Worse#

Start With Observability#

Check Dashboards#

Log Aggregation#

Metrics First#

Safe Investigation Commands#

System Overview#

Process Investigation#

Network Investigation#

Application Logs#

Strace for System Calls#

Database Investigation#

PostgreSQL#

MySQL#

Redis#

Container Debugging#

Docker#

Kubernetes#

Memory Issues#

Thread/Goroutine Issues#

Network Debugging#

What NOT To Do#

Don’t#

Do#

Post-Incident#

📬 Get the Newsletter