Something’s slow. Users are complaining. Your monitoring shows high latency but not why. You SSH into the server and need to figure out what’s wrong — fast.
This is a systematic approach to Linux performance debugging.
Start with the Big Picture Before diving deep, get an overview:
1 2 3 4 5 6 # System load and uptime uptime # 17:00:00 up 45 days, load average: 8.52, 4.23, 2.15 # Load average: 1/5/15 minute averages # Compare to CPU count: load 8 on 4 CPUs = overloaded 1 2 # Quick health check top -bn1 | head -20 CPU: Who’s Using It? 1 2 3 4 5 6 # Real-time CPU usage top # Sort by CPU: press 'P' # Sort by memory: press 'M' # Show threads: press 'H' 1 2 3 4 5 # CPU usage by process (snapshot) ps aux --sort=-%cpu | head -10 # CPU time accumulated ps aux --sort=-time | head -10 1 2 3 4 # Per-CPU breakdown mpstat -P ALL 1 5 # Shows each CPU core's utilization # Look for: one core at 100% (single-threaded bottleneck) 1 2 3 4 5 6 7 8 9 10 11 # What's the CPU actually doing? vmstat 1 5 # r b swpd free buff cache si so bi bo in cs us sy id wa st # 3 0 0 245612 128940 2985432 0 0 8 24 312 892 45 12 40 3 0 # r: processes waiting for CPU (high = CPU-bound) # b: processes blocked on I/O (high = I/O-bound) # us: user CPU time # sy: system CPU time # wa: waiting for I/O # id: idle Memory: Running Out? 1 2 3 4 5 6 7 # Memory overview free -h # total used free shared buff/cache available # Mem: 31Gi 12Gi 2.1Gi 1.2Gi 17Gi 17Gi # "available" is what matters, not "free" # Linux uses free memory for cache — that's good 1 2 3 4 5 6 7 8 # Top memory consumers ps aux --sort=-%mem | head -10 # Memory details per process pmap -x <pid> # Or from /proc cat /proc/<pid>/status | grep -i mem 1 2 3 # Check for OOM killer activity dmesg | grep -i "out of memory" journalctl -k | grep -i oom 1 2 3 # Swap usage (if swapping, you're in trouble) swapon -s vmstat 1 5 # si/so columns show swap in/out Disk I/O: The Hidden Bottleneck 1 2 3 4 5 6 7 8 9 10 11 # I/O wait in top top # Look at %wa in the CPU line # Detailed I/O stats iostat -xz 1 5 # Device r/s w/s rkB/s wkB/s await %util # sda 150.00 200.00 4800 8000 12.5 85.0 # await: average I/O wait time (ms) - high = slow disk # %util: disk utilization - 100% = saturated 1 2 3 4 5 6 # Which processes are doing I/O? iotop -o # Shows only processes actively doing I/O # Or without iotop pidstat -d 1 5 1 2 3 # Check for disk errors dmesg | grep -i error smartctl -a /dev/sda # SMART data for disk health Network: Connections and Throughput 1 2 3 4 5 6 7 8 # Network interface stats ip -s link show # Watch for errors, drops, overruns # Real-time bandwidth iftop # Or nload 1 2 3 4 5 6 7 # Connection states ss -s # Total: 1523 (kernel 1847) # TCP: 892 (estab 654, closed 89, orphaned 12, timewait 127) # Many TIME_WAIT: connection churn # Many ESTABLISHED: legitimate load or connection leak 1 2 3 4 5 # Connections per state ss -tan | awk '{print $1}' | sort | uniq -c | sort -rn # Connections by remote IP ss -tn | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -rn | head 1 2 3 4 5 6 # Check for packet loss ping -c 100 <gateway> # Any loss = network problem # TCP retransmits netstat -s | grep -i retrans Process-Level Debugging 1 2 3 4 5 6 # What's a specific process doing? strace -p <pid> -c # Summary of system calls strace -p <pid> -f -e trace=network # Network-related calls only 1 2 3 4 5 6 7 8 # Open files and connections lsof -p <pid> # Just network connections lsof -i -p <pid> # Files in a directory lsof +D /var/log 1 2 3 4 5 # Process threads ps -T -p <pid> # Thread CPU usage top -H -p <pid> System-Wide Tracing 1 2 3 4 5 6 7 # What's happening system-wide? perf top # Real-time view of where CPU cycles go # Record and analyze perf record -g -a sleep 10 perf report 1 2 3 4 5 6 7 8 9 # BPF-based tools (if available) # CPU usage by function profile-bpfcc 10 # Disk latency histogram biolatency-bpfcc # TCP connection latency tcpconnlat-bpfcc The Checklist Approach When debugging, go through systematically:
...