Observability

Monitoring and Alerting: Best Practices That Won't Burn You Out

Bad monitoring means missing real problems. Bad alerting means 3 AM pages for things that don’t matter. Let’s do both right. What to Monitor The Four Golden Signals From Google’s SRE book — if you monitor nothing else, monitor these: 1. Latency: How long requests take 1 2 3 4 # p95 latency over 5 minutes histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le) ) 2. Traffic: Request volume 1 2 # Requests per second sum(rate(http_requests_total[5m])) 3. Errors: Failure rate ...

The Three Pillars of Observability: Logs, Metrics, and Traces

When your service goes down at 3 AM, you need answers fast. Observability—the ability to understand what’s happening inside your systems from their external outputs—is what separates a 5-minute fix from a 3-hour nightmare. The three pillars of observability are logs, metrics, and traces. Each tells a different part of the story. Logs: The Narrative Logs are discrete events. They tell you what happened in human-readable terms. 1 2 3 4 5 6 7 8 9 { "timestamp": "2026-03-03T12:34:56Z", "level": "error", "service": "payment-api", "message": "Payment processing failed", "user_id": "12345", "error_code": "CARD_DECLINED", "request_id": "abc-123" } Best Practices for Logging Structure your logs. JSON is your friend. Unstructured logs like Payment failed for user 12345 are hard to search and aggregate. ...

Testing in Production: Because Staging Never Tells the Whole Story

“We don’t test in production” sounds responsible until you realize: production is the only environment that’s actually production. Staging lies to you. Here’s how to test in production safely. Why Staging Fails Staging environments differ from production in ways that matter: Data: Sanitized, outdated, or synthetic Scale: 1% of production traffic Integrations: Sandbox APIs with different behavior Users: Developers clicking around, not real usage patterns Infrastructure: Smaller instances, shared resources That bug that only appears under real load with real data? Staging won’t catch it. ...

Health Check Endpoints: More Than Just 200 OK

Every modern service needs health check endpoints. Load balancers probe them. Kubernetes uses them. Monitoring systems scrape them. But a naive implementation—returning 200 OK if the process is running—tells you almost nothing useful. Here’s how to build health checks that actually help. Two Types of Health Liveness: Is the process alive and not deadlocked? Readiness: Can this instance handle requests right now? These are different questions with different answers: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 # Liveness: Am I alive? @app.get("/health/live") def liveness(): # If this returns, the process is alive return {"status": "alive"} # Readiness: Can I serve traffic? @app.get("/health/ready") def readiness(): checks = { "database": check_database(), "cache": check_cache(), "disk_space": check_disk_space(), } all_healthy = all(c["healthy"] for c in checks.values()) return JSONResponse( status_code=200 if all_healthy else 503, content={"status": "ready" if all_healthy else "not_ready", "checks": checks} ) Why separate them? ...

Logging Levels: A Practical Guide to What Goes Where

Logging seems simple until you’re debugging production at 2 AM, scrolling through millions of lines trying to find the one that matters. Good logging practices make that experience less painful. Here’s how to think about log levels. The Levels Most logging frameworks use these standard levels: D E B U G < I N F O < W A R N < E R R O R < F A T A L In production, you typically run at INFO or WARN. Lower levels include all higher levels (INFO includes WARN, ERROR, and FATAL). ...

Structured Logging: Stop Parsing Log Lines

Unstructured logs are technical debt. Structured logs are queryable, parseable, and actually useful when things break. The Problem # 2 2 2 0 0 0 U 2 2 2 n 6 6 6 s - - - t 0 0 0 r 2 2 2 u - - - c 2 2 2 t 8 8 8 u r 1 1 1 e 0 0 0 d : : : : 1 1 1 5 5 5 g : : : o 2 2 2 o 3 4 5 d I E I l N R N u F R F c O O O k R U R p s F e a e a q r r i u s l e i a e s n l d t g i c t c t e o o h m i l p p s o r l g o e g c t e e e d s d s i i n o n r f d 2 r e 3 o r 4 m m 1 s 1 2 9 3 2 4 . 5 1 : 6 8 c . o 1 n . n 1 e c t i o n t i m e o u t Regex hell when you need to extract user, IP, order ID, or duration. ...

Prometheus Alerting Rules That Won't Wake You Up at 3am

The difference between good alerting and bad alerting is whether you still trust your pager after six months. Here’s how to build alerts that matter. The Golden Rule: Alert on Symptoms, Not Causes 1 2 3 4 5 6 7 8 9 10 11 12 13 # Bad: alerts on a cause - alert: HighCPU expr: node_cpu_seconds_total > 80 for: 5m # Good: alerts on user-facing symptom - alert: HighLatency expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.5 for: 5m labels: severity: warning annotations: summary: "95th percentile latency above 500ms" Users don’t care if CPU is high. They care if the site is slow. ...

Debugging Production Issues Without Breaking Things

Production is sacred. When something breaks, you need to investigate without making it worse. Here’s how. Rule Zero: Don’t Make It Worse Before touching anything: Don’t restart services until you understand the problem Don’t deploy fixes without knowing the root cause Don’t clear logs you might need for investigation Don’t scale down what might be handling load Stabilize first, investigate second, fix third. Start With Observability Check Dashboards Before SSH-ing anywhere: ...

Structured Logging: Stop Grepping, Start Querying

Unstructured logs are a trap. They look simple until you need to find something. [ [ [ 2 2 2 0 0 0 2 2 2 6 6 6 - - - 0 0 0 2 2 2 - - - 2 2 2 7 7 7 0 0 0 5 5 5 : : : 3 3 3 0 0 0 : : : 1 1 1 5 6 7 ] ] ] I E W N R A F R R O O N R U H s F i e a g r i h l j e m o d e h m n t o @ o r e y x p a r u m o s p c a l e g e s e . s c d o o e m r t d e l e c o r t g e g 1 d e 2 : d 3 4 8 i 5 7 n : % f c r o o n m n e 1 c 9 t 2 i . o 1 n 6 8 t . i 1 m . e 5 o 0 u t Quick: find all login failures from a specific IP range in the last hour. Now try parsing the order ID from error messages. Hope you enjoy regex. ...

Log Aggregation Pipelines: From Scattered Files to Searchable Insights

When you have one server, you SSH in and grep the logs. When you have fifty servers, that stops working. Log aggregation is how you make “what happened?” answerable at scale. The Pipeline Architecture Every log aggregation system follows the same basic pattern: ┌ │ └ ─ ─ ─ S ─ ─ o ─ ─ u ─ ─ r ─ │ │ └ ─ c ─ ─ ─ e ─ ─ ─ s ─ ─ ─ ─ ─ ┐ │ ┘ ─ ─ ─ ─ ─ ─ ─ ▶ ▶ ┌ │ └ ┌ │ └ ─ ─ ─ ─ ─ C ─ ─ ─ ─ o ─ ─ Q ─ ─ l ─ ─ u ─ ─ l ─ ─ e ─ ─ e ─ ─ r ─ ─ c ─ ─ y ─ ─ t ─ ─ ─ ─ ─ ─ ─ ┐ │ ┘ ┐ │ ┘ ─ ◀ ─ ─ ─ ─ ▶ ─ ┌ │ └ ─ ─ ─ ─ ─ P ─ ─ ─ r ─ ─ ─ o ─ ─ ─ c ─ ─ ─ e ─ ─ ─ s ─ ─ ─ s ─ ─ ─ ─ ─ ┐ │ ┘ ─ ─ ─ ─ ─ ─ ─ ▶ ─ ┌ │ └ ─ ─ ─ ─ ─ ─ ─ ─ S ─ ─ ─ t ─ ─ ─ o ─ │ ┘ ─ r ─ │ ─ e ─ ─ ─ ─ ─ ┐ │ ┘ Each stage has choices. Let’s walk through them. ...