Monitoring

Monitoring and Alerting: Best Practices That Won't Burn You Out

Bad monitoring means missing real problems. Bad alerting means 3 AM pages for things that don’t matter. Let’s do both right. What to Monitor The Four Golden Signals From Google’s SRE book — if you monitor nothing else, monitor these: 1. Latency: How long requests take 1 2 3 4 # p95 latency over 5 minutes histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le) ) 2. Traffic: Request volume 1 2 # Requests per second sum(rate(http_requests_total[5m])) 3. Errors: Failure rate ...

The Three Pillars of Observability: Logs, Metrics, and Traces

When your service goes down at 3 AM, you need answers fast. Observability—the ability to understand what’s happening inside your systems from their external outputs—is what separates a 5-minute fix from a 3-hour nightmare. The three pillars of observability are logs, metrics, and traces. Each tells a different part of the story. Logs: The Narrative Logs are discrete events. They tell you what happened in human-readable terms. 1 2 3 4 5 6 7 8 9 { "timestamp": "2026-03-03T12:34:56Z", "level": "error", "service": "payment-api", "message": "Payment processing failed", "user_id": "12345", "error_code": "CARD_DECLINED", "request_id": "abc-123" } Best Practices for Logging Structure your logs. JSON is your friend. Unstructured logs like Payment failed for user 12345 are hard to search and aggregate. ...

Structured Logging: Stop Parsing Log Lines

Unstructured logs are technical debt. Structured logs are queryable, parseable, and actually useful when things break. The Problem # 2 2 2 0 0 0 U 2 2 2 n 6 6 6 s - - - t 0 0 0 r 2 2 2 u - - - c 2 2 2 t 8 8 8 u r 1 1 1 e 0 0 0 d : : : : 1 1 1 5 5 5 g : : : o 2 2 2 o 3 4 5 d I E I l N R N u F R F c O O O k R U R p s F e a e a q r r i u s l e i a e s n l d t g i c t c t e o o h m i l p p s o r l g o e g c t e e e d s d s i i n o n r f d 2 r e 3 o r 4 m m 1 s 1 2 9 3 2 4 . 5 1 : 6 8 c . o 1 n . n 1 e c t i o n t i m e o u t Regex hell when you need to extract user, IP, order ID, or duration. ...

Prometheus Alerting Rules That Won't Wake You Up at 3am

The difference between good alerting and bad alerting is whether you still trust your pager after six months. Here’s how to build alerts that matter. The Golden Rule: Alert on Symptoms, Not Causes 1 2 3 4 5 6 7 8 9 10 11 12 13 # Bad: alerts on a cause - alert: HighCPU expr: node_cpu_seconds_total > 80 for: 5m # Good: alerts on user-facing symptom - alert: HighLatency expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.5 for: 5m labels: severity: warning annotations: summary: "95th percentile latency above 500ms" Users don’t care if CPU is high. They care if the site is slow. ...

Debugging Production Issues Without Breaking Things

Production is sacred. When something breaks, you need to investigate without making it worse. Here’s how. Rule Zero: Don’t Make It Worse Before touching anything: Don’t restart services until you understand the problem Don’t deploy fixes without knowing the root cause Don’t clear logs you might need for investigation Don’t scale down what might be handling load Stabilize first, investigate second, fix third. Start With Observability Check Dashboards Before SSH-ing anywhere: ...

htop: Process Monitoring for Humans

top works. htop works better. It’s colorful, interactive, and actually pleasant to use. Here’s how to get the most from it. Installation 1 2 3 4 5 6 7 8 # Debian/Ubuntu sudo apt install htop # RHEL/CentOS/Fedora sudo dnf install htop # macOS brew install htop The Interface Launch with htop. You’ll see: Top section: CPU bars (one per core) Memory and swap usage Tasks, load average, uptime Process list: ...

watch: Repeat Commands and See Changes

watch runs a command every N seconds and displays the output. It’s the simplest form of real-time monitoring — no setup, no configuration, just instant feedback loops. Basic Usage 1 2 3 4 5 6 7 8 # Run command every 2 seconds (default) watch date # Run every 5 seconds watch -n 5 date # Run every 0.5 seconds watch -n 0.5 date Highlight Changes 1 2 3 4 5 # Highlight differences between updates watch -d df -h # Highlight changes permanently (cumulative) watch -d=cumulative df -h Common Options 1 2 3 4 5 6 7 -n, --interval Seconds between updates (default: 2) -d, --differences Highlight changes -t, --no-title Hide the header -b, --beep Beep on command error -e, --errexit Exit on command error -c, --color Interpret ANSI color sequences -x, --exec Pass command to exec instead of sh -c Practical Examples Disk Space 1 2 3 4 5 6 7 8 # Watch disk usage watch df -h # Watch specific mount watch 'df -h | grep /dev/sda1' # Watch directory size watch 'du -sh /var/log' Memory 1 2 3 4 5 # Memory stats watch free -h # Memory with buffers/cache detail watch 'free -h && echo && cat /proc/meminfo | head -10' Processes 1 2 3 4 5 6 7 8 # Process count watch 'ps aux | wc -l' # Specific process watch 'ps aux | grep nginx' # Process memory watch 'ps aux --sort=-%mem | head -10' Network 1 2 3 4 5 6 7 8 9 10 11 # Network connections watch 'netstat -tuln' # Connection count watch 'netstat -an | wc -l' # Active connections watch 'ss -s' # Interface stats watch 'cat /proc/net/dev' Files and Directories 1 2 3 4 5 6 7 8 9 10 11 # File list watch ls -la # Directory size changes watch 'du -sh *' # File count watch 'find . -type f | wc -l' # Recent files watch 'ls -lt | head -10' Docker 1 2 3 4 5 6 7 8 # Container status watch docker ps # Container stats watch 'docker stats --no-stream' # Image list watch docker images Kubernetes 1 2 3 4 5 6 7 8 # Pod status watch kubectl get pods # All resources watch 'kubectl get pods,svc,deploy' # Pod logs (last 5 lines) watch 'kubectl logs -l app=myapp --tail=5' Git 1 2 3 4 5 6 7 8 # Branch status watch git status # Log (one line per commit) watch 'git log --oneline -10' # Diff stats watch git diff --stat Logs 1 2 3 4 5 6 7 8 # Last log lines watch 'tail -5 /var/log/syslog' # Error count watch 'grep -c ERROR /var/log/app.log' # Recent errors watch 'grep ERROR /var/log/app.log | tail -5' APIs and Services 1 2 3 4 5 6 7 8 # HTTP health check watch 'curl -s localhost:8080/health' # API response time watch 'curl -s -w "%{time_total}\n" -o /dev/null http://localhost/api' # Service status watch systemctl status nginx Database 1 2 3 4 5 6 7 8 # PostgreSQL connections watch 'psql -c "SELECT count(*) FROM pg_stat_activity"' # MySQL process list watch 'mysql -e "SHOW PROCESSLIST"' # Table row count watch 'psql -c "SELECT count(*) FROM users"' Quoting and Complex Commands For commands with pipes or special characters, quote the entire command: ...

htop: Interactive Process Monitoring

htop is an interactive process viewer — a better top. It shows CPU, memory, running processes, and lets you kill or renice processes without typing PIDs. Installation 1 2 3 4 5 6 7 8 # Debian/Ubuntu sudo apt install htop # RHEL/CentOS sudo yum install htop # macOS brew install htop The Interface M S e w 1 5 1 2 3 4 m p P 2 6 [ [ I 3 7 | D 4 8 [ [ [ [ | U w p S w o E w s R - t d g a r t e a s P R 2 2 I 0 0 N I 0 0 1 V 0 5 I 2 1 R 4 2 4 T M M . 2 1 2 1 G 0 R 5 2 / K E 6 8 2 3 1 2 1 / S M M 5 3 8 8 5 2 . . . . . . 3 7 2 9 6 0 1 % % % % G 0 S 2 8 ] ] ] ] ] G H M M ] R S S S T L U 4 1 a o p C 5 2 s a t P . . k d i U 2 3 s m % : a e v : M 1 0 1 e E . . 4 r 4 M 6 8 2 a % , g d e a 1 3 : y 2 5 1 s T : : 2 1 , I 3 4 . M 4 3 t 2 0 E . . h 4 3 + 5 2 r : 6 1 ; 0 2 . 2 C n p 2 9 : o g o 8 1 m i s r 5 m n t u 0 a x g n . n : r n 8 d e i 7 w s n o : g r k w e r r i t e r Top Section CPU bars: Usage per core (user, system, nice, IRQ, etc.) Memory bar: Used/total RAM Swap bar: Used/total swap Tasks: Process count, threads, running processes Load average: 1, 5, 15 minute averages Uptime: System uptime Process Columns PID: Process ID USER: Owner PRI: Priority NI: Nice value (-20 to 19, lower = higher priority) VIRT: Virtual memory RES: Resident (physical) memory SHR: Shared memory S: State (R=running, S=sleeping, Z=zombie, D=disk wait) CPU%: CPU usage MEM%: Memory usage TIME+: Total CPU time Command: Process command Navigation ↑ P H S u / g o p ↓ U m a p e c / / e P E g n D d n M P F T F S F o a i a i e i v g r g l a l e e s t r t t p e c e u u / r r h r p p l o / / a c b d d s e y o o t s w w s u n n p s r e o r c e s s Function Keys F F F F F F F F F F 1 2 3 4 5 6 7 8 9 1 0 H S S F T S N N K Q e e e i r o i i i u l t a l e r c c l i p u r t e t e e l t p c e h r v b - + p ( i y r c e ( ( o o w c h l c n o i o e f l g w s i u h e s g m e r u n r r p a p r t r i i i o o o r n r i ) i t t y y ) ) Sorting F P M T I 6 O S S S S S I p o o o o o n e r r r r r v n t t t t t e r s b b b b b t o y y y y y r s t C M T n p o P E I e r r m U M M x e t e % % E t v n i o u c o r o u d l s e u r m c n o l u m n Process Actions Kill a Process 1 2 3 4 . . . . N P S - - - P a r e r v e l 1 9 2 e i s e 5 s g s c S S s a t S I I t F I G G E e 9 s G K I n i T I N t t g E L T e o o n R L r r a M ( p l ( i r k : ( f n o ) g o t c r r e e a c r s c e r s e ) u f p u t l ) ) Kill Multiple Processes 1 2 3 . . . S F A p 9 l a l c t e o t a t k g o i g l e t l d a g p r p o r c o e c s e s s e s s e s r e c e i v e s i g n a l Change Priority (Nice) F F 7 8 D I e n c c r r e e a a s s e e n n i i c c e e ( ( h l i o g w h e e r r p p r r i i o o r r i i t t y y ) , n e e d s r o o t ) Tree View Press F5 to toggle tree view — shows parent/child relationships: ...

Alerting That Doesn't Suck: From Noise to Signal

The worst oncall shift I ever had wasn’t the one with the outage. It was the one with 47 alerts, none of which mattered, followed by one that did — which I almost missed because I’d stopped paying attention. Alert fatigue is real, and it’s a systems problem, not a discipline problem. If your alerts are noisy, the fix isn’t “try harder to pay attention.” The fix is better alerts. ...

Observability: Logs, Metrics, and Traces Working Together

Monitoring answers “is it working?” Observability answers “why isn’t it working?” The difference matters when you’re debugging a production incident at 3am. The three pillars of observability — logs, metrics, and traces — each provide different perspectives. Together, they create a complete picture of system behavior. Logs: The Narrative Logs tell you what happened, in order: 1 2 3 {"timestamp": "2026-02-23T13:00:01Z", "level": "info", "event": "request_started", "request_id": "abc123", "path": "/api/users"} {"timestamp": "2026-02-23T13:00:01Z", "level": "info", "event": "db_query", "request_id": "abc123", "duration_ms": 45} {"timestamp": "2026-02-23T13:00:02Z", "level": "error", "event": "request_failed", "request_id": "abc123", "error": "connection timeout"} Good for: Debugging specific requests Understanding error context Audit trails Ad-hoc investigation Challenges: ...