Observability

Testing in Production: Because Staging Never Tells the Whole Story

“We don’t test in production” sounds responsible until you realize: production is the only environment that’s actually production. Staging lies to you. Here’s how to test in production safely. Why Staging Fails Staging environments differ from production in ways that matter: Data: Sanitized, outdated, or synthetic Scale: 1% of production traffic Integrations: Sandbox APIs with different behavior Users: Developers clicking around, not real usage patterns Infrastructure: Smaller instances, shared resources That bug that only appears under real load with real data? Staging won’t catch it. ...

Health Check Endpoints: More Than Just 200 OK

Every modern service needs health check endpoints. Load balancers probe them. Kubernetes uses them. Monitoring systems scrape them. But a naive implementation—returning 200 OK if the process is running—tells you almost nothing useful. Here’s how to build health checks that actually help. Two Types of Health Liveness: Is the process alive and not deadlocked? Readiness: Can this instance handle requests right now? These are different questions with different answers: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 # Liveness: Am I alive? @app.get("/health/live") def liveness(): # If this returns, the process is alive return {"status": "alive"} # Readiness: Can I serve traffic? @app.get("/health/ready") def readiness(): checks = { "database": check_database(), "cache": check_cache(), "disk_space": check_disk_space(), } all_healthy = all(c["healthy"] for c in checks.values()) return JSONResponse( status_code=200 if all_healthy else 503, content={"status": "ready" if all_healthy else "not_ready", "checks": checks} ) Why separate them? ...

Logging Levels: A Practical Guide to What Goes Where

Logging seems simple until you’re debugging production at 2 AM, scrolling through millions of lines trying to find the one that matters. Good logging practices make that experience less painful. Here’s how to think about log levels. The Levels Most logging frameworks use these standard levels: D E B U G < I N F O < W A R N < E R R O R < F A T A L In production, you typically run at INFO or WARN. Lower levels include all higher levels (INFO includes WARN, ERROR, and FATAL). ...

Structured Logging: Stop Parsing Log Lines

Unstructured logs are technical debt. Structured logs are queryable, parseable, and actually useful when things break. The Problem # 2 2 2 0 0 0 U 2 2 2 n 6 6 6 s - - - t 0 0 0 r 2 2 2 u - - - c 2 2 2 t 8 8 8 u r 1 1 1 e 0 0 0 d : : : : 1 1 1 5 5 5 g : : : o 2 2 2 o 3 4 5 d I E I l N R N u F R F c O O O k R U R p s F e a e a q r r i u s l e i a e s n l d t g i c t c t e o o h m i l p p s o r l g o e g c t e e e d s d s i i n o n r f d 2 r e 3 o r 4 m m 1 s 1 2 9 3 2 4 . 5 1 : 6 8 c . o 1 n . n 1 e c t i o n t i m e o u t Regex hell when you need to extract user, IP, order ID, or duration. ...

Prometheus Alerting Rules That Won't Wake You Up at 3am

The difference between good alerting and bad alerting is whether you still trust your pager after six months. Here’s how to build alerts that matter. The Golden Rule: Alert on Symptoms, Not Causes 1 2 3 4 5 6 7 8 9 10 11 12 13 # Bad: alerts on a cause - alert: HighCPU expr: node_cpu_seconds_total > 80 for: 5m # Good: alerts on user-facing symptom - alert: HighLatency expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.5 for: 5m labels: severity: warning annotations: summary: "95th percentile latency above 500ms" Users don’t care if CPU is high. They care if the site is slow. ...

Debugging Production Issues Without Breaking Things

Production is sacred. When something breaks, you need to investigate without making it worse. Here’s how. Rule Zero: Don’t Make It Worse Before touching anything: Don’t restart services until you understand the problem Don’t deploy fixes without knowing the root cause Don’t clear logs you might need for investigation Don’t scale down what might be handling load Stabilize first, investigate second, fix third. Start With Observability Check Dashboards Before SSH-ing anywhere: ...

Structured Logging: Stop Grepping, Start Querying

Unstructured logs are a trap. They look simple until you need to find something. [ [ [ 2 2 2 0 0 0 2 2 2 6 6 6 - - - 0 0 0 2 2 2 - - - 2 2 2 7 7 7 0 0 0 5 5 5 : : : 3 3 3 0 0 0 : : : 1 1 1 5 6 7 ] ] ] I E W N R A F R R O O N R U H s F i e a g r i h l j e m o d e h m n t o @ o r e y x p a r u m o s p c a l e g e s e . s c d o o e m r t d e l e c o r t g e g 1 d e 2 : d 3 4 8 i 5 7 n : % f c r o o n m n e 1 c 9 t 2 i . o 1 n 6 8 t . i 1 m . e 5 o 0 u t Quick: find all login failures from a specific IP range in the last hour. Now try parsing the order ID from error messages. Hope you enjoy regex. ...

Log Aggregation Pipelines: From Scattered Files to Searchable Insights

When you have one server, you SSH in and grep the logs. When you have fifty servers, that stops working. Log aggregation is how you make “what happened?” answerable at scale. The Pipeline Architecture Every log aggregation system follows the same basic pattern: ┌ │ └ ─ ─ ─ S ─ ─ o ─ ─ u ─ ─ r ─ │ │ └ ─ c ─ ─ ─ e ─ ─ ─ s ─ ─ ─ ─ ─ ┐ │ ┘ ─ ─ ─ ─ ─ ─ ─ ▶ ▶ ┌ │ └ ┌ │ └ ─ ─ ─ ─ ─ C ─ ─ ─ ─ o ─ ─ Q ─ ─ l ─ ─ u ─ ─ l ─ ─ e ─ ─ e ─ ─ r ─ ─ c ─ ─ y ─ ─ t ─ ─ ─ ─ ─ ─ ─ ┐ │ ┘ ┐ │ ┘ ─ ◀ ─ ─ ─ ─ ▶ ─ ┌ │ └ ─ ─ ─ ─ ─ P ─ ─ ─ r ─ ─ ─ o ─ ─ ─ c ─ ─ ─ e ─ ─ ─ s ─ ─ ─ s ─ ─ ─ ─ ─ ┐ │ ┘ ─ ─ ─ ─ ─ ─ ─ ▶ ─ ┌ │ └ ─ ─ ─ ─ ─ ─ ─ ─ S ─ ─ ─ t ─ ─ ─ o ─ │ ┘ ─ r ─ │ ─ e ─ ─ ─ ─ ─ ┐ │ ┘ Each stage has choices. Let’s walk through them. ...

Alerting That Doesn't Suck: From Noise to Signal

The worst oncall shift I ever had wasn’t the one with the outage. It was the one with 47 alerts, none of which mattered, followed by one that did — which I almost missed because I’d stopped paying attention. Alert fatigue is real, and it’s a systems problem, not a discipline problem. If your alerts are noisy, the fix isn’t “try harder to pay attention.” The fix is better alerts. ...

Observability: Logs, Metrics, and Traces Working Together

Monitoring answers “is it working?” Observability answers “why isn’t it working?” The difference matters when you’re debugging a production incident at 3am. The three pillars of observability — logs, metrics, and traces — each provide different perspectives. Together, they create a complete picture of system behavior. Logs: The Narrative Logs tell you what happened, in order: 1 2 3 {"timestamp": "2026-02-23T13:00:01Z", "level": "info", "event": "request_started", "request_id": "abc123", "path": "/api/users"} {"timestamp": "2026-02-23T13:00:01Z", "level": "info", "event": "db_query", "request_id": "abc123", "duration_ms": 45} {"timestamp": "2026-02-23T13:00:02Z", "level": "error", "event": "request_failed", "request_id": "abc123", "error": "connection timeout"} Good for: Debugging specific requests Understanding error context Audit trails Ad-hoc investigation Challenges: ...