The Three Pillars of Observability: Logs, Metrics, and Traces

When your service goes down at 3 AM, you need answers fast. Observability—the ability to understand what’s happening inside your systems from their external outputs—is what separates a 5-minute fix from a 3-hour nightmare.

The three pillars of observability are logs, metrics, and traces. Each tells a different part of the story.

Logs: The Narrative

Logs are discrete events. They tell you what happened in human-readable terms.

1
2
3
4
5
6
7
8
9
{
  "timestamp": "2026-03-03T12:34:56Z",
  "level": "error",
  "service": "payment-api",
  "message": "Payment processing failed",
  "user_id": "12345",
  "error_code": "CARD_DECLINED",
  "request_id": "abc-123"
}

Best Practices for Logging

Structure your logs. JSON is your friend. Unstructured logs like Payment failed for user 12345 are hard to search and aggregate.

Include context. Every log should answer: who, what, when, and ideally why. Always include request IDs for correlation.

Use appropriate levels:

DEBUG: Development details
INFO: Normal operations worth noting
WARN: Something unexpected, but handled
ERROR: Something failed
FATAL: Service can’t continue

Sample high-volume logs. If you’re logging every request, you’ll drown in data. Sample at ingestion time, but keep 100% of errors.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
import logging
import random

def should_log_request():
    # Log 10% of successful requests, 100% of errors
    return random.random() < 0.1

@app.route('/api/data')
def get_data():
    result = process_request()
    if result.error or should_log_request():
        logger.info("Request processed", extra={
            "status": result.status,
            "duration_ms": result.duration
        })
    return result

Metrics: The Numbers

Metrics are numeric measurements over time. They tell you how your system is performing.

The Four Golden Signals

Google’s SRE book identified four critical metrics:

Latency: How long requests take
Traffic: How many requests you’re handling
Errors: How many requests fail
Saturation: How “full” your system is

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
from prometheus_client import Counter, Histogram, Gauge

# Traffic
request_count = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

# Latency
request_duration = Histogram(
    'http_request_duration_seconds',
    'HTTP request latency',
    ['method', 'endpoint'],
    buckets=[.01, .05, .1, .25, .5, 1, 2.5, 5, 10]
)

# Saturation
connection_pool_usage = Gauge(
    'db_connection_pool_usage_ratio',
    'Database connection pool utilization'
)

Metrics vs Logs

A common question: if I have logs, why do I need metrics?

Metrics are aggregatable. “What’s the 99th percentile latency over the last hour?” is easy with metrics, painful with logs.

Metrics are cheap. Storing a counter increment costs almost nothing. Storing a full log line costs orders of magnitude more.

Metrics enable alerting. “Alert if error rate exceeds 1%” requires metrics. You can’t efficiently scan logs in real-time for this.

Use logs for debugging individual requests. Use metrics for understanding system behavior.

Traces: The Journey

Traces follow a request as it travels through your distributed system. They tell you where time is spent.

Implementing Tracing

Modern tracing follows the OpenTelemetry standard:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

# Initialize tracer
provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter()))
trace.set_tracer_provider(provider)

tracer = trace.get_tracer(__name__)

async def process_order(order_id: str):
    with tracer.start_as_current_span("process_order") as span:
        span.set_attribute("order.id", order_id)
        
        # Each child operation creates a child span
        with tracer.start_as_current_span("validate_inventory"):
            await check_inventory(order_id)
        
        with tracer.start_as_current_span("charge_payment"):
            await process_payment(order_id)

Context Propagation

The magic of distributed tracing is context propagation. When Service A calls Service B, it passes trace context in headers:

This links all spans from all services into a single trace.

Connecting the Pillars

The real power comes from connecting all three:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import logging
from opentelemetry import trace

logger = logging.getLogger(__name__)

def process_request(request):
    span = trace.get_current_span()
    trace_id = span.get_span_context().trace_id
    
    # Include trace_id in logs for correlation
    logger.info(
        "Processing request",
        extra={
            "trace_id": format(trace_id, '032x'),
            "user_id": request.user_id,
            "endpoint": request.path
        }
    )
    
    # Increment metrics with same labels
    request_count.labels(
        method=request.method,
        endpoint=request.path
    ).inc()

Now when you see an error spike in your metrics dashboard, you can:

Click through to see related logs
Find the trace_id in those logs
Open the trace to see exactly where things went wrong

Tool Recommendations

Logs:

Self-hosted: Loki + Grafana
Managed: Datadog, Splunk, AWS CloudWatch

Metrics:

Self-hosted: Prometheus + Grafana
Managed: Datadog, New Relic, AWS CloudWatch

Traces:

Self-hosted: Jaeger, Tempo
Managed: Datadog APM, Honeycomb, AWS X-Ray

All-in-one: Grafana Cloud, Datadog, and Elastic all offer unified observability platforms.

Start Small, Grow Incrementally

You don’t need all three pillars on day one:

Week 1: Add structured logging with request IDs
Week 2: Add the four golden signals as metrics
Week 3: Add tracing to your most critical path
Week 4: Connect them with shared identifiers

The goal isn’t perfect observability—it’s enough observability to debug problems quickly. Start with what hurts most, and expand from there.

Good observability is an investment that pays dividends at 3 AM. Your on-call self will thank your past self.

Logs: The Narrative#

Best Practices for Logging#

Metrics: The Numbers#

The Four Golden Signals#

Metrics vs Logs#

Traces: The Journey#

Implementing Tracing#

Context Propagation#

Connecting the Pillars#

Tool Recommendations#

Start Small, Grow Incrementally#

📬 Get the Newsletter