Linux Performance Troubleshooting: The First Five Minutes

When a server is slow and people are yelling, you need a systematic approach. Here’s what to run in the first five minutes.

The Checklist

1
2
3
4
5
6
7
8
uptime
dmesg | tail
vmstat 1 5
mpstat -P ALL 1 3
pidstat 1 3
iostat -xz 1 3
free -h
sar -n DEV 1 3

Let’s break down what each tells you.

1. uptime

1
2
$ uptime
 16:30:01 up 45 days,  3:22,  2 users,  load average: 8.42, 6.31, 5.12

Load averages: 1-minute, 5-minute, 15-minute.

Load increasing (8 > 6 > 5): problem is recent
Load decreasing: problem may be resolving
Load equals CPU count: fully utilized
Load » CPU count: something’s waiting

Quick rule: if 1-min load > (2 × CPU cores), investigate immediately.

2. dmesg | tail

1
2
3
$ dmesg | tail
[3214.523] Out of memory: Kill process 12345 (java) score 892
[3214.601] oom-killer: Killed process 12345

Kernel messages reveal:

OOM kills
Hardware errors
Network issues
Disk problems

If you see OOM kills, you found your problem.

3. vmstat 1 5

1
2
3
4
$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 4  2  10240  12340  45678 234567    0    5   102  2048 1523 2341 45 12 38  5  0

Key columns:

r: processes waiting for CPU (>CPU count = saturated)
b: processes blocked on I/O
si/so: swap in/out (should be 0, any value = memory pressure)
us: user CPU %
sy: system CPU %
wa: I/O wait % (high = disk bottleneck)
id: idle %

If wa is high, it’s disk. If us+sy is high, it’s CPU.

4. mpstat -P ALL 1 3

1
2
3
4
5
6
7
$ mpstat -P ALL 1 3
CPU    %usr   %sys   %iowait   %idle
all    45.2   12.1      5.3     37.4
  0    98.1    1.2      0.0      0.7
  1     2.3    0.8      0.0     96.9
  2     2.1    0.5      0.0     97.4
  3    78.2   45.6      0.0      0.0

Per-CPU breakdown reveals:

Single-threaded bottleneck (one core maxed)
Kernel/interrupt storms (high %sys on one core)
Even distribution (good parallelization)

5. pidstat 1 3

1
2
3
4
$ pidstat 1 3
PID    %usr  %system  %CPU   Command
12345  89.2     8.1   97.3   java
12346  12.3     2.1   14.4   nginx

Which process is eating resources? Now you have a target.

Add -d for disk I/O per process:

1
2
3
$ pidstat -d 1 3
PID   kB_rd/s   kB_wr/s   Command
12345  102400    51200    postgres

6. iostat -xz 1 3

1
2
3
$ iostat -xz 1 3
Device  r/s    w/s    rkB/s   wkB/s  await  %util
sda    156.2  234.1   8234   12456   12.3   89.2

Key columns:

await: average I/O wait time in ms (>10ms = slow)
%util: device utilization (>80% = saturated)
r/s, w/s: read/write operations per second

If %util is high and await is high, disk is the bottleneck.

7. free -h

1
2
3
4
$ free -h
              total        used        free      shared  buff/cache   available
Mem:           16Gi        12Gi       512Mi       256Mi        3.2Gi        3.0Gi
Swap:          4.0Gi       2.1Gi       1.9Gi

Focus on available, not free. Linux uses free memory for cache.

If available is low AND swap is being used, you need more RAM.

8. sar -n DEV 1 3

1
2
3
$ sar -n DEV 1 3
IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s
eth0    12345.2    8901.3   102400.1   51200.5

Network saturation check:

Compare to link speed (1Gbps ≈ 125,000 kB/s)
Look for packet drops: sar -n EDEV 1 3

Quick Diagnosis Tree

Going Deeper

Once you’ve identified the bottleneck:

CPU bound:

1
2
3
perf top                    # Live CPU profiling
perf record -g -p $PID      # Record profile
perf report                 # Analyze

Memory issues:

1
2
ps aux --sort=-%mem | head  # Top memory users
pmap -x $PID                # Process memory map

Disk issues:

1
2
iotop -aoP                  # Live I/O by process
lsof +D /path               # What's using this path?

Network issues:

1
2
ss -tunapl                  # Active connections
tcpdump -i eth0 port 80     # Packet capture

The One-Liner

If you only have 30 seconds:

1
uptime; vmstat 1 3; iostat -xz 1 1; free -h

Load, CPU, memory, disk, swap. Covers 90% of cases.

The goal isn’t to memorize everything—it’s to have a systematic approach. Start broad, identify the resource under pressure, then dig in.

The Checklist#

1. uptime#

2. dmesg | tail#

3. vmstat 1 5#

4. mpstat -P ALL 1 3#

5. pidstat 1 3#

6. iostat -xz 1 3#

7. free -h#

8. sar -n DEV 1 3#

Quick Diagnosis Tree#

Going Deeper#

The One-Liner#

📬 Get the Newsletter