Monitoring and observability get used interchangeably. They shouldn’t. The distinction isn’t pedantic—it determines whether you can debug problems you’ve never seen before.

Monitoring answers: “Is the thing I expected to break, broken?”

Observability answers: “What is happening, even if I didn’t anticipate it?”

One is verification. The other is exploration.

The Dashboard Trap

Most teams start with dashboards. CPU usage, memory, request latency, error rates. Green means good, red means bad.

This works until it doesn’t.

DUYasoseuhr:bso:"aBr"udTt:h.e.A.lalptphgeriesdeanbsrhobkoeanr"d..."

The dashboard monitors what you thought would break. But production breaks in ways you didn’t imagine. A green dashboard with broken users means your mental model of failure was incomplete.

Monitoring is necessary. But monitoring alone leaves you blind to unknown failure modes.

What Observability Actually Means

A system is observable if you can understand its internal state from its external outputs. The key word is understand—not just detect.

Observable systems emit rich telemetry:

  • Logs: What happened, in sequence
  • Metrics: How much, how often, how long
  • Traces: The path a request took through services

But the telemetry alone isn’t observability. Observability is the ability to ask arbitrary questions of your system and get answers.

MOobnsietrovraibnigliqtuyesqtuieosnt:io"nI:s"lWawahchtyceeenniscsstyihlneaagibtroetvnhceceayr5/t0hc0ihhmgeashcs?k"fomouortreuesntedhrpasoniin1nt0Gietremmasn?y"

The first is a threshold check. The second is an investigation. You couldn’t have written an alert for the second—you didn’t know to look for it until the incident started.

The Three Pillars (And Why They’re Not Enough)

Everyone talks about logs, metrics, and traces. But having all three doesn’t make you observable.

Logs Without Context Are Noise

[[[EEERRRRRROOORRR]]]FDFaaaitilalebedadstetootpiprmroeococeuestsssrreeqquueesstt

Okay, but which request? Which user? What was the request trying to do? What else was happening at the same time?

Observable logs:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
{
  "level": "error",
  "message": "Database timeout",
  "trace_id": "abc123",
  "user_id": "user_456",
  "endpoint": "/api/orders",
  "db_query": "SELECT * FROM orders WHERE user_id = ?",
  "query_duration_ms": 30000,
  "db_pool_active": 50,
  "db_pool_max": 50
}

Now you can ask: “Show me all errors for user_456” or “What was the database pool state during timeouts?”

Metrics Without Cardinality Are Averages

http_request_duration_seconds{quantile="0.99"}0.5

Great, p99 is 500ms. But is that all endpoints? All users? All regions?

Observable metrics have dimensions:

h}ttemsrc2pneteu._dtags5rphtiteoouooqidsnmun===eet"""rs=P5e_t"O0ut_/S0-idaT"weup",erri,s=a/t"to-eir1nod"tne,e_rrsspe"rc,iosned"s{

High cardinality is expensive. But without it, you’re averaging away the signal.

Traces Without Spans Are Just IDs

A trace ID that appears in logs is better than nothing. But a trace that shows the actual flow—which services, which operations, how long each took—is what lets you find the slow span in a 20-service chain.

Tracearpeaisb-pcgaoo1aurn2ttds3ehee:w-ripdas-nra(yesvit2reecam(vrnibs1ivtna)2ciogsmecr-eseys-)(-ew5(crrm3hvis2eit)0cce0kem(s((3)580005mm0ssmH))sE)REACTUALLYHERE

Building for Observability

1. Structured Logging Everywhere

No more print(f"Error: {e}"). Every log should be structured JSON with:

  • Timestamp
  • Level
  • Message
  • Trace/correlation ID
  • Relevant context (user, endpoint, operation)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
logger.error(
    "Payment failed",
    extra={
        "trace_id": trace_id,
        "user_id": user_id,
        "payment_method": method,
        "amount_cents": amount,
        "error_code": e.code,
        "error_message": str(e)
    }
)

2. Propagate Context

Every request gets a trace ID at the edge. That ID propagates through every service, every queue, every async job.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
# At the edge
trace_id = request.headers.get("X-Trace-ID") or generate_trace_id()

# In every outgoing request
response = httpx.post(
    url,
    headers={"X-Trace-ID": trace_id}
)

# In every log
logger = logger.bind(trace_id=trace_id)

# In every metric
metrics.histogram(
    "request_duration",
    duration,
    tags={"trace_id": trace_id}
)

3. High-Cardinality When It Matters

You can’t afford high cardinality everywhere. Choose where it matters:

  • Customer/tenant ID (for debugging customer issues)
  • Endpoint/operation (for finding slow paths)
  • Error codes (for categorizing failures)

Sample aggressively for volume, but keep 100% of errors and slow requests.

A spike in error rate is a metric. But which requests errored?

Exemplars attach trace IDs to metric data points:

#htPtrpo_mreetqhueeusst_edxuermaptliaorn_fsoercmoantds_bucket{le="0.5"}100#{trace_id="abc123"}

Now you can click from “error spike” to “actual traces that errored.”

The Observability Workflow

When something breaks:

  1. Detect (monitoring): Alert fires, or users report issues
  2. Scope (metrics): Which service? Which endpoints? Which users?
  3. Hypothesize (exploration): What could cause this pattern?
  4. Investigate (traces + logs): Find example requests, trace their path
  5. Confirm (back to metrics): Does the fix change the pattern?

Monitoring gets you to step 1. Observability gets you through steps 2-5.

Practical Starting Points

If you have nothing:

  1. Add structured logging with trace IDs
  2. Emit request duration and error count metrics by endpoint
  3. Use a log aggregator that supports querying (Loki, CloudWatch Logs Insights, etc.)

If you have basic monitoring:

  1. Add distributed tracing (Jaeger, Zipkin, or cloud-native options)
  2. Increase metric cardinality for problem areas
  3. Link traces to logs via trace ID

If you have all three pillars:

  1. Add exemplars to connect metrics → traces
  2. Build runbooks that use all three together
  3. Practice “observability-driven debugging” in post-mortems

The Mindset Shift

Monitoring is defensive: “Alert me when known things break.”

Observability is exploratory: “Let me ask questions I haven’t thought of yet.”

Both are necessary. But if your production debugging workflow is “look at dashboard, see green, shrug at users,” you have monitoring without observability.

Build systems that answer questions you haven’t asked yet. That’s what observability actually means.