awk: When grep and cut Aren't Enough

You can grep for lines and cut for columns. But what about “show me the third column of lines containing ERROR, but only if the second column is greater than 100”?

That’s awk territory.

The Basics

awk processes text line by line, splitting each into fields:

1
2
3
4
5
6
7
8
9
# Print second column (space-delimited by default)
echo "hello world" | awk '{print $2}'
# world

# Print first and third columns
cat data.txt | awk '{print $1, $3}'

# Print entire line
awk '{print $0}' file.txt

$1, $2, etc. are fields. $0 is the whole line. NF is the number of fields. NR is the line number.

Custom Delimiters

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Colon-separated (like /etc/passwd)
awk -F: '{print $1, $3}' /etc/passwd

# CSV (careful with quoted fields)
awk -F, '{print $2}' data.csv

# Multiple delimiters
awk -F'[,;:]' '{print $1}' mixed.txt

# Tab-delimited
awk -F'\t' '{print $1}' data.tsv

Conditions

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# Lines where column 3 > 100
awk '$3 > 100' data.txt

# Lines containing "ERROR"
awk '/ERROR/' logfile.txt

# Combine conditions
awk '$3 > 100 && /ERROR/' data.txt

# Column equals specific value
awk '$2 == "active"' status.txt

# Regex match on column
awk '$1 ~ /^server/' config.txt

BEGIN and END Blocks

1
2
3
4
5
6
7
8
# Header and footer
awk 'BEGIN {print "Name\tScore"} {print $1, $2} END {print "---done---"}' data.txt

# Initialize variables
awk 'BEGIN {count=0} /ERROR/ {count++} END {print count " errors"}' log.txt

# Set delimiter in BEGIN
awk 'BEGIN {FS=":"} {print $1}' /etc/passwd

Built-in Variables

Variable	Meaning
`$0`	Entire line
`$1, $2...`	Fields
`NF`	Number of fields
`NR`	Line number (all files)
`FNR`	Line number (current file)
`FS`	Field separator
`OFS`	Output field separator
`RS`	Record separator

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Print line numbers
awk '{print NR, $0}' file.txt

# Print last column
awk '{print $NF}' file.txt

# Print second-to-last column
awk '{print $(NF-1)}' file.txt

# Skip header line
awk 'NR > 1 {print $1}' data.csv

Calculations

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Sum a column
awk '{sum += $3} END {print sum}' sales.txt

# Average
awk '{sum += $3; count++} END {print sum/count}' data.txt

# Min/Max
awk 'NR==1 {max=$3} $3>max {max=$3} END {print max}' data.txt

# Running total
awk '{sum += $2; print $1, sum}' transactions.txt

String Functions

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# Length of field
awk '{print $1, length($1)}' names.txt

# Substring
awk '{print substr($1, 1, 3)}' data.txt  # First 3 chars

# Find position
awk '{print index($0, "error")}' log.txt

# Split into array
awk '{split($0, parts, ":"); print parts[1]}' data.txt

# Uppercase/lowercase (GNU awk)
awk '{print toupper($1)}' names.txt

Printf for Formatting

1
2
3
4
5
# Aligned columns
awk '{printf "%-20s %10.2f\n", $1, $2}' data.txt

# Fixed-width output
awk '{printf "%05d %s\n", NR, $0}' file.txt

Format specifiers: %s (string), %d (integer), %f (float), %- (left-align).

Real-World Patterns

Log Analysis

1
2
3
4
5
6
7
8
# Count HTTP status codes
awk '{print $9}' access.log | sort | uniq -c | sort -rn

# Better: do it all in awk
awk '{codes[$9]++} END {for (c in codes) print codes[c], c}' access.log | sort -rn

# Average response time by endpoint
awk '{times[$7] += $10; counts[$7]++} END {for (e in times) print e, times[e]/counts[e]}' access.log

Process Monitoring

1
2
3
4
5
# Memory usage by process name
ps aux | awk '{mem[$11] += $6} END {for (p in mem) print mem[p], p}' | sort -rn | head

# CPU > 50%
ps aux | awk '$3 > 50 {print $11, $3"%"}'

CSV Processing

1
2
3
4
5
# Sum column 3 where column 1 is "sales"
awk -F, '$1 == "sales" {sum += $3} END {print sum}' data.csv

# Filter and reformat
awk -F, 'NR > 1 && $4 > 1000 {print $1 ": $" $4}' transactions.csv

Disk Usage

1
2
3
4
5
# Total size of files by extension
find . -type f -name "*.*" | awk -F. '{ext[$NF]++} END {for (e in ext) print ext[e], e}' | sort -rn

# Or with sizes
ls -l *.log | awk '{sum += $5} END {print sum/1024/1024 " MB"}'

Configuration Parsing

1
2
3
4
5
# Extract values from key=value files
awk -F= '/^database_host/ {print $2}' config.ini

# Ignore comments and blank lines
awk '!/^#/ && !/^$/ {print}' config.txt

Multi-file Processing

1
2
3
4
5
6
7
8
# Print filename with each line
awk '{print FILENAME, $0}' *.log

# Reset line count per file
awk 'FNR == 1 {print "--- " FILENAME " ---"} {print}' file1.txt file2.txt

# Compare files
awk 'NR==FNR {a[$1]; next} $1 in a' file1.txt file2.txt

Arrays

1
2
3
4
5
6
7
8
# Count occurrences
awk '{counts[$1]++} END {for (k in counts) print k, counts[k]}' data.txt

# Deduplicate
awk '!seen[$0]++' file.txt

# Join lines by key
awk '{data[$1] = data[$1] " " $2} END {for (k in data) print k, data[k]}' pairs.txt

One-Liners Worth Memorizing

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
# Print unique lines (preserving order)
awk '!seen[$0]++' file.txt

# Print lines between patterns
awk '/START/,/END/' file.txt

# Remove blank lines
awk 'NF' file.txt

# Print every nth line
awk 'NR % 5 == 0' file.txt

# Reverse columns
awk '{for (i=NF; i>0; i--) printf "%s ", $i; print ""}' file.txt

# Sum column and print with total
awk '{sum += $2; print} END {print "Total:", sum}' data.txt

When to Use awk vs. Alternatives

Task	Best Tool
Simple pattern search	`grep`
Extract single column	`cut`
Column operations with conditions	`awk`
Complex transformations	`awk` or Python
JSON processing	`jq`
CSV with proper parsing	`csvkit` or Python

awk hits the sweet spot: more powerful than grep/cut, simpler than a full script.

Quick Reference

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# Structure
awk 'pattern {action}' file

# Common patterns
/regex/           # Lines matching regex
$1 == "value"     # Field equals value
$2 > 100          # Numeric comparison
NR > 1            # Skip first line
NR == 1, NR == 10 # Lines 1-10

# Common actions
{print $1, $2}    # Print fields
{sum += $1}       # Accumulate
{count++}         # Count
{arr[$1]++}       # Count by key

awk is a complete programming language disguised as a CLI tool. You don’t need to master all of it — just enough to solve the problem grep can’t.

Computing Arts is CLI fluency for practitioners. More at computingarts.com.

The Basics#

Custom Delimiters#

Conditions#

BEGIN and END Blocks#

Built-in Variables#

Calculations#

String Functions#

Printf for Formatting#

Real-World Patterns#

Log Analysis#

Process Monitoring#

CSV Processing#

Disk Usage#

Configuration Parsing#

Multi-file Processing#

Arrays#

One-Liners Worth Memorizing#

When to Use awk vs. Alternatives#

Quick Reference#

📬 Get the Newsletter