Observability: Logs, Metrics, and Traces Working Together

Monitoring answers “is it working?” Observability answers “why isn’t it working?” The difference matters when you’re debugging a production incident at 3am.

The three pillars of observability — logs, metrics, and traces — each provide different perspectives. Together, they create a complete picture of system behavior.

Logs: The Narrative

Logs tell you what happened, in order:

1
2
3
{"timestamp": "2026-02-23T13:00:01Z", "level": "info", "event": "request_started", "request_id": "abc123", "path": "/api/users"}
{"timestamp": "2026-02-23T13:00:01Z", "level": "info", "event": "db_query", "request_id": "abc123", "duration_ms": 45}
{"timestamp": "2026-02-23T13:00:02Z", "level": "error", "event": "request_failed", "request_id": "abc123", "error": "connection timeout"}

Good for:

Debugging specific requests
Understanding error context
Audit trails
Ad-hoc investigation

Challenges:

High volume, high cost
Searching requires indexing
Context scattered across services

Structured Logging

Always log structured data:

1
2
3
4
5
6
7
8
9
# Bad - unstructured
logger.info(f"User {user_id} purchased {product} for ${price}")

# Good - structured
logger.info("purchase_completed", 
    user_id=user_id, 
    product_id=product, 
    price=price,
    currency="USD")

Structured logs are queryable: “show all purchases over $100 in the last hour.”

Metrics: The Numbers

Metrics are numerical measurements over time:

Good for:

Dashboards and visualization
Alerting on thresholds
Capacity planning
Trend analysis

Challenges:

High cardinality kills performance
Aggregation loses detail
Choosing what to measure

The RED Method (Request-focused)

For services:

Rate: Requests per second
Errors: Failed requests per second
Duration: Request latency distribution

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# Prometheus metrics
from prometheus_client import Counter, Histogram

requests = Counter('http_requests_total', 'Total requests', ['method', 'status'])
latency = Histogram('http_request_duration_seconds', 'Request latency')

@app.middleware
async def metrics_middleware(request, call_next):
    start = time.time()
    response = await call_next(request)
    
    requests.labels(method=request.method, status=response.status_code).inc()
    latency.observe(time.time() - start)
    
    return response

The USE Method (Resource-focused)

For infrastructure:

Utilization: Percentage of resource busy
Saturation: Queue length, backlog
Errors: Error count

1
2
3
4
5
6
7
# Example alerts
- alert: HighCPUUtilization
  expr: cpu_usage_percent > 80
  for: 5m

- alert: DiskNearlyFull
  expr: disk_usage_percent > 90

Traces: The Journey

Traces follow a request across service boundaries:

Good for:

Understanding distributed latency
Finding bottlenecks
Debugging cross-service issues
Dependency mapping

Challenges:

Sampling required at scale
Instrumentation overhead
Context propagation across boundaries

Implementing Tracing

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
from opentelemetry import trace
from opentelemetry.trace.propagation.tracecontext import TraceContextTextMapPropagator

tracer = trace.get_tracer(__name__)

@app.get("/api/users/{user_id}")
async def get_user(user_id: int):
    with tracer.start_as_current_span("get_user") as span:
        span.set_attribute("user_id", user_id)
        
        with tracer.start_as_current_span("db_query"):
            user = await db.get_user(user_id)
        
        with tracer.start_as_current_span("enrich_user"):
            preferences = await preferences_service.get(user_id)
        
        return {**user, **preferences}

Propagate context to downstream services:

1
2
3
4
# Inject trace context into outgoing requests
headers = {}
TraceContextTextMapPropagator().inject(headers)
response = requests.get(url, headers=headers)

Connecting the Pillars

The magic happens when logs, metrics, and traces connect:

1
2
3
4
5
6
7
8
{
  "timestamp": "2026-02-23T13:00:01Z",
  "level": "error",
  "message": "Payment failed",
  "trace_id": "xyz789",
  "span_id": "abc123",
  "user_id": "user_456"
}

From an alert (metric), jump to the trace. From the trace, find the relevant logs. Each provides context the others lack.

Correlation IDs

Every request gets a unique ID that flows everywhere:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
@app.middleware
async def add_request_id(request, call_next):
    request_id = request.headers.get('X-Request-ID', str(uuid4()))
    
    # Add to logging context
    structlog.contextvars.bind_contextvars(request_id=request_id)
    
    response = await call_next(request)
    response.headers['X-Request-ID'] = request_id
    
    return response

Now every log, metric label, and trace span can include request_id.

Sampling Strategies

At scale, you can’t keep everything:

Head-based sampling: Decide at the start whether to trace this request.

1
2
3
# Sample 10% of requests
if random.random() < 0.1:
    start_trace()

Tail-based sampling: Decide after the request completes.

1
2
3
# Keep all errors and slow requests
if response.status >= 500 or duration > 1.0:
    keep_trace()

Adaptive sampling: Adjust rate based on traffic.

1
2
# More sampling during low traffic
sample_rate = 0.1 if requests_per_second > 1000 else 0.5

Alerting Philosophy

Alert on symptoms, not causes:

1
2
3
4
5
6
7
8
9
# Bad - alerts on cause
- alert: DatabaseConnectionPoolExhausted
  expr: db_connections_available == 0

# Good - alerts on symptom
- alert: HighErrorRate
  expr: rate(http_requests_total{status="500"}[5m]) > 0.01

# Then investigate: why are requests failing? (maybe connection pool)

Symptoms tell you users are affected. Causes require investigation.

Alert Fatigue

Every alert should be:

Actionable: Someone can do something about it
Urgent: It needs attention now (or use a ticket, not an alert)
Real: Low false positive rate

If you’re ignoring alerts, you have too many or they’re misconfigured.

Tool Choices

Metrics: Prometheus, InfluxDB, Datadog, CloudWatch Logs: Elasticsearch, Loki, Splunk, CloudWatch Logs Traces: Jaeger, Zipkin, Datadog APM, AWS X-Ray

Unified platforms: Datadog, Grafana Cloud, New Relic, Honeycomb

Starting out? Prometheus + Loki + Tempo with Grafana provides all three pillars, open source, with excellent integration.

The Observability Mindset

Observability isn’t a product you buy. It’s a property of your system:

Instrument everything: If it’s not measured, it doesn’t exist
Correlate across signals: Connect logs, metrics, traces
Ask arbitrary questions: Can you debug issues you’ve never seen before?
Reduce MTTR: The goal is faster incident resolution

The system that can explain its own behavior is the system you can actually operate.

Logs tell the story. Metrics show the trends. Traces reveal the journey. Separately, each is useful. Together, they’re essential.

Build observability in from the start. Instrument before you need to debug. Connect your signals with correlation IDs. The incident you resolve in minutes instead of hours will justify every line of instrumentation code.

Logs: The Narrative#

Structured Logging#

Metrics: The Numbers#

The RED Method (Request-focused)#

The USE Method (Resource-focused)#

Traces: The Journey#

Implementing Tracing#

Connecting the Pillars#

Correlation IDs#

Sampling Strategies#

Alerting Philosophy#

Alert Fatigue#

Tool Choices#

The Observability Mindset#

📬 Get the Newsletter

Logs: The Narrative

Structured Logging

Metrics: The Numbers

The RED Method (Request-focused)

The USE Method (Resource-focused)

Traces: The Journey

Implementing Tracing

Connecting the Pillars

Correlation IDs

Sampling Strategies

Alerting Philosophy

Alert Fatigue

Tool Choices

The Observability Mindset