Something’s slow. Users are complaining. Your monitoring shows high latency but not why. You SSH into the server and need to figure out what’s wrong — fast.

This is a systematic approach to Linux performance debugging.

Start with the Big Picture

Before diving deep, get an overview:

1
2
3
4
5
6
# System load and uptime
uptime
# 17:00:00 up 45 days, load average: 8.52, 4.23, 2.15

# Load average: 1/5/15 minute averages
# Compare to CPU count: load 8 on 4 CPUs = overloaded
1
2
# Quick health check
top -bn1 | head -20

CPU: Who’s Using It?

1
2
3
4
5
6
# Real-time CPU usage
top

# Sort by CPU: press 'P'
# Sort by memory: press 'M'
# Show threads: press 'H'
1
2
3
4
5
# CPU usage by process (snapshot)
ps aux --sort=-%cpu | head -10

# CPU time accumulated
ps aux --sort=-time | head -10
1
2
3
4
# Per-CPU breakdown
mpstat -P ALL 1 5
# Shows each CPU core's utilization
# Look for: one core at 100% (single-threaded bottleneck)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# What's the CPU actually doing?
vmstat 1 5
#  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
#  3  0      0 245612 128940 2985432    0    0     8    24  312  892 45 12 40  3  0

# r: processes waiting for CPU (high = CPU-bound)
# b: processes blocked on I/O (high = I/O-bound)
# us: user CPU time
# sy: system CPU time
# wa: waiting for I/O
# id: idle

Memory: Running Out?

1
2
3
4
5
6
7
# Memory overview
free -h
#               total        used        free      shared  buff/cache   available
# Mem:           31Gi        12Gi       2.1Gi       1.2Gi        17Gi        17Gi

# "available" is what matters, not "free"
# Linux uses free memory for cache — that's good
1
2
3
4
5
6
7
8
# Top memory consumers
ps aux --sort=-%mem | head -10

# Memory details per process
pmap -x <pid>

# Or from /proc
cat /proc/<pid>/status | grep -i mem
1
2
3
# Check for OOM killer activity
dmesg | grep -i "out of memory"
journalctl -k | grep -i oom
1
2
3
# Swap usage (if swapping, you're in trouble)
swapon -s
vmstat 1 5  # si/so columns show swap in/out

Disk I/O: The Hidden Bottleneck

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# I/O wait in top
top
# Look at %wa in the CPU line

# Detailed I/O stats
iostat -xz 1 5
# Device   r/s     w/s    rkB/s   wkB/s  await  %util
# sda     150.00  200.00  4800    8000   12.5   85.0

# await: average I/O wait time (ms) - high = slow disk
# %util: disk utilization - 100% = saturated
1
2
3
4
5
6
# Which processes are doing I/O?
iotop -o
# Shows only processes actively doing I/O

# Or without iotop
pidstat -d 1 5
1
2
3
# Check for disk errors
dmesg | grep -i error
smartctl -a /dev/sda  # SMART data for disk health

Network: Connections and Throughput

1
2
3
4
5
6
7
8
# Network interface stats
ip -s link show
# Watch for errors, drops, overruns

# Real-time bandwidth
iftop
# Or
nload
1
2
3
4
5
6
7
# Connection states
ss -s
# Total: 1523 (kernel 1847)
# TCP:   892 (estab 654, closed 89, orphaned 12, timewait 127)

# Many TIME_WAIT: connection churn
# Many ESTABLISHED: legitimate load or connection leak
1
2
3
4
5
# Connections per state
ss -tan | awk '{print $1}' | sort | uniq -c | sort -rn

# Connections by remote IP
ss -tn | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -rn | head
1
2
3
4
5
6
# Check for packet loss
ping -c 100 <gateway>
# Any loss = network problem

# TCP retransmits
netstat -s | grep -i retrans

Process-Level Debugging

1
2
3
4
5
6
# What's a specific process doing?
strace -p <pid> -c
# Summary of system calls

strace -p <pid> -f -e trace=network
# Network-related calls only
1
2
3
4
5
6
7
8
# Open files and connections
lsof -p <pid>

# Just network connections
lsof -i -p <pid>

# Files in a directory
lsof +D /var/log
1
2
3
4
5
# Process threads
ps -T -p <pid>

# Thread CPU usage
top -H -p <pid>

System-Wide Tracing

1
2
3
4
5
6
7
# What's happening system-wide?
perf top
# Real-time view of where CPU cycles go

# Record and analyze
perf record -g -a sleep 10
perf report
1
2
3
4
5
6
7
8
9
# BPF-based tools (if available)
# CPU usage by function
profile-bpfcc 10

# Disk latency histogram
biolatency-bpfcc

# TCP connection latency
tcpconnlat-bpfcc

The Checklist Approach

When debugging, go through systematically:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# 1. Load and saturation
uptime
vmstat 1 5

# 2. CPU
mpstat -P ALL 1 5
pidstat 1 5

# 3. Memory
free -h
vmstat 1 5  # si/so columns
ps aux --sort=-%mem | head

# 4. Disk I/O
iostat -xz 1 5
iotop -o

# 5. Network
ss -s
iftop

# 6. Errors
dmesg | tail -50
journalctl -p err -n 50

Common Patterns

High load, low CPU utilization: → I/O bound. Check iostat, iotop.

One CPU at 100%, others idle: → Single-threaded bottleneck. Find the process, consider parallelization.

High memory, no swap, still slow: → Might be disk cache pressure. Check I/O.

Network timeouts, no packet loss: → Application-level issue. Check connection pools, file descriptors.

Gradual degradation over time: → Memory leak, connection leak, or file descriptor leak. Compare metrics to last week.

Quick Reference

SymptomFirst CheckDeep Dive
High loaduptime, vmstatmpstat, iostat
Slow apptop, psstrace, perf
Out of memoryfree, dmesgpmap, /proc/<pid>/status
Disk slowiostatiotop, blktrace
Network issuesss -s, ip -s linktcpdump, iftop

Performance debugging is detective work. Start broad (uptime, vmstat, top), identify the resource under pressure (CPU, memory, disk, network), then drill down to the specific process or operation.

The tools are simple. The skill is knowing which to reach for and how to interpret the output. Practice on healthy systems so you’re ready when things break.