Observability

Alerting That Doesn't Suck: From Noise to Signal

The worst oncall shift I ever had wasn’t the one with the outage. It was the one with 47 alerts, none of which mattered, followed by one that did — which I almost missed because I’d stopped paying attention. Alert fatigue is real, and it’s a systems problem, not a discipline problem. If your alerts are noisy, the fix isn’t “try harder to pay attention.” The fix is better alerts. ...

Observability: Logs, Metrics, and Traces Working Together

Monitoring answers “is it working?” Observability answers “why isn’t it working?” The difference matters when you’re debugging a production incident at 3am. The three pillars of observability — logs, metrics, and traces — each provide different perspectives. Together, they create a complete picture of system behavior. Logs: The Narrative Logs tell you what happened, in order: 1 2 3 {"timestamp": "2026-02-23T13:00:01Z", "level": "info", "event": "request_started", "request_id": "abc123", "path": "/api/users"} {"timestamp": "2026-02-23T13:00:01Z", "level": "info", "event": "db_query", "request_id": "abc123", "duration_ms": 45} {"timestamp": "2026-02-23T13:00:02Z", "level": "error", "event": "request_failed", "request_id": "abc123", "error": "connection timeout"} Good for: Debugging specific requests Understanding error context Audit trails Ad-hoc investigation Challenges: ...

Structured Logging: Stop Grepping, Start Querying

You’ve seen this log line before: 2 0 2 6 - 0 2 - 2 3 0 5 : 3 0 : 0 0 I N F O U s e r j o h n @ e x a m p l e . c o m l o g g e d i n f r o m 1 9 2 . 1 6 8 . 1 . 1 0 0 a f t e r 2 f a i l e d a t t e m p t s Human readable. Grep-able. And completely useless for answering questions like “how many users had failed login attempts yesterday?” or “what’s the P95 response time for requests from the EU region?” ...

Monitoring Anti-Patterns: When Alerts Become Noise

Good monitoring saves you from outages. Bad monitoring causes them — by training your team to ignore alerts until something actually breaks. Here’s how to avoid the most common anti-patterns. Anti-Pattern 1: Alerting on Symptoms, Not Impact 1 2 3 4 5 6 # ❌ BAD: CPU is high - alert: HighCPU expr: node_cpu_usage > 80 for: 5m labels: severity: critical High CPU isn’t a problem. Slow responses are a problem. Users don’t care about your CPU graphs. ...

Infrastructure Observability for LLM Agents

When you deploy an LLM-powered agent in production, traditional APM dashboards only tell half the story. You can track latency, error rates, and throughput — but what about what the agent actually did? Did it hallucinate? Did it spiral into an infinite retry loop? Did it spend $47 on tokens chasing a dead end? Here’s how to build observability for autonomous agents that actually helps. The Three Pillars of Agent Observability Standard observability (logs, metrics, traces) still matters. But agents need three additional dimensions: ...

Distributed Tracing Essentials: Following Requests Across Services

A request hits your API gateway, bounces through five microservices, touches two databases, and returns an error. Logs say “something failed.” Which service? Which call? What was the payload? Distributed tracing answers these questions by connecting the dots across service boundaries. The Core Concepts Traces and Spans A trace represents a complete request journey. A span represents a single operation within that journey. T ├ │ │ │ │ │ │ r ─ a ─ c e S ├ ├ │ └ : p ─ ─ ─ a ─ ─ ─ a n b : S S └ S ├ └ c p p ─ p ─ ─ 1 A a a ─ a ─ ─ 2 P n n n 3 I : : S : S S p p p G A U a O a a a u s n r n n t t e : d : : e h r e w D r I P a S S a n a y e e t S v y r r a e e m ( v v b r n e p i i a v t n a c c s i o t r e e e c r e e y P n Q r t u C o ) e h c r e e y c s k s i n g Each span has: ...

Observability Pipelines: From Logs to Insights

Raw logs are noise. Processed telemetry is intelligence. The difference between them is your observability pipeline. Modern distributed systems generate enormous amounts of data—logs, metrics, traces, events. But data isn’t insight. The challenge isn’t collection; it’s transformation. How do you turn a firehose of JSON lines into something a human (or an AI) can actually act on? The Three Pillars, Unified You’ve heard the “three pillars of observability”: logs, metrics, and traces. What’s often missing from that conversation is how these pillars should connect. ...

Structured Logging: Making Logs Queryable and Actionable

Plain text logs are for humans. Structured logs are for machines. In production, machines need to read your logs before humans do. When your service handles thousands of requests per second, grep stops working. You need logs that can be indexed, queried, aggregated, and alerted on. That means structure. The Problem with Text Logs [ [ [ 2 2 2 0 0 0 2 2 2 6 6 6 - - - 0 0 0 2 2 2 - - - 1 1 1 6 6 6 0 0 0 8 8 8 : : : 3 3 3 0 0 0 : : : 1 1 1 5 6 7 ] ] ] I E W N R A F R R O O N : R : : U H s P i e a g r y h m j e m o n e h t m n o @ f r e a y x i a l u m e s p d a l g e f e . o c r d o e m o t r e l d c o e t g r e g d e 1 : d 2 3 8 i 4 7 n 5 % f - r o i m n s 1 u 9 f 2 f . i 1 c 6 i 8 e . n 1 t . 5 f 0 u n d s Looks readable. But try answering: ...

Distributed Tracing: The Missing Piece of Your Observability Stack

When a request fails in a distributed system, the question isn’t if something went wrong—it’s where. Logs tell you what happened. Metrics tell you how often. But tracing tells you the story. The Problem with Logs and Metrics Alone You’ve got 15 microservices. A user reports slow checkout. You check the logs—thousands of entries. You check the metrics—latency is up, but which service? You’re playing detective without a map. This is where distributed tracing shines. It connects the dots across service boundaries, showing you the exact path a request takes and where time is spent. ...

Structured Logging for Distributed Systems

When your application spans multiple services, containers, and regions, print("something went wrong") doesn’t cut it anymore. Structured logging transforms your logs from walls of text into queryable data. Why Structured Logging? Traditional logs are strings meant for humans: [ 2 0 2 6 - 0 2 - 1 3 1 4 : 0 0 : 0 0 ] E R R O R : F a i l e d t o p r o c e s s o r d e r 1 2 3 4 5 f o r u s e r j o h n @ e x a m p l e . c o m Structured logs are data meant for machines (and humans): ...