When your service goes down at 3 AM, you need answers fast. Observability—the ability to understand what’s happening inside your systems from their external outputs—is what separates a 5-minute fix from a 3-hour nightmare.

The three pillars of observability are logs, metrics, and traces. Each tells a different part of the story.

Logs: The Narrative

Logs are discrete events. They tell you what happened in human-readable terms.

1
2
3
4
5
6
7
8
9
{
  "timestamp": "2026-03-03T12:34:56Z",
  "level": "error",
  "service": "payment-api",
  "message": "Payment processing failed",
  "user_id": "12345",
  "error_code": "CARD_DECLINED",
  "request_id": "abc-123"
}

Best Practices for Logging

Structure your logs. JSON is your friend. Unstructured logs like Payment failed for user 12345 are hard to search and aggregate.

Include context. Every log should answer: who, what, when, and ideally why. Always include request IDs for correlation.

Use appropriate levels:

  • DEBUG: Development details
  • INFO: Normal operations worth noting
  • WARN: Something unexpected, but handled
  • ERROR: Something failed
  • FATAL: Service can’t continue

Sample high-volume logs. If you’re logging every request, you’ll drown in data. Sample at ingestion time, but keep 100% of errors.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
import logging
import random

def should_log_request():
    # Log 10% of successful requests, 100% of errors
    return random.random() < 0.1

@app.route('/api/data')
def get_data():
    result = process_request()
    if result.error or should_log_request():
        logger.info("Request processed", extra={
            "status": result.status,
            "duration_ms": result.duration
        })
    return result

Metrics: The Numbers

Metrics are numeric measurements over time. They tell you how your system is performing.

The Four Golden Signals

Google’s SRE book identified four critical metrics:

  1. Latency: How long requests take
  2. Traffic: How many requests you’re handling
  3. Errors: How many requests fail
  4. Saturation: How “full” your system is
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
from prometheus_client import Counter, Histogram, Gauge

# Traffic
request_count = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

# Latency
request_duration = Histogram(
    'http_request_duration_seconds',
    'HTTP request latency',
    ['method', 'endpoint'],
    buckets=[.01, .05, .1, .25, .5, 1, 2.5, 5, 10]
)

# Saturation
connection_pool_usage = Gauge(
    'db_connection_pool_usage_ratio',
    'Database connection pool utilization'
)

Metrics vs Logs

A common question: if I have logs, why do I need metrics?

Metrics are aggregatable. “What’s the 99th percentile latency over the last hour?” is easy with metrics, painful with logs.

Metrics are cheap. Storing a counter increment costs almost nothing. Storing a full log line costs orders of magnitude more.

Metrics enable alerting. “Alert if error rate exceeds 1%” requires metrics. You can’t efficiently scan logs in real-time for this.

Use logs for debugging individual requests. Use metrics for understanding system behavior.

Traces: The Journey

Traces follow a request as it travels through your distributed system. They tell you where time is spent.

[Tracaorepre:ids-epagaripdobau-nrancttsvits-eheecae1w-rnib-2asvtnaf3yeiogso]rcr-er(veys-mT1i-eqao2c(cruttme4hvetas5eiril)(0ccyn:8mkegms(5s)((2(2)41833520mmm0msssms))s))🔥Slow!

Implementing Tracing

Modern tracing follows the OpenTelemetry standard:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

# Initialize tracer
provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter()))
trace.set_tracer_provider(provider)

tracer = trace.get_tracer(__name__)

async def process_order(order_id: str):
    with tracer.start_as_current_span("process_order") as span:
        span.set_attribute("order.id", order_id)
        
        # Each child operation creates a child span
        with tracer.start_as_current_span("validate_inventory"):
            await check_inventory(order_id)
        
        with tracer.start_as_current_span("charge_payment"):
            await process_payment(order_id)

Context Propagation

The magic of distributed tracing is context propagation. When Service A calls Service B, it passes trace context in headers:

traceparent:00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01

This links all spans from all services into a single trace.

Connecting the Pillars

The real power comes from connecting all three:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import logging
from opentelemetry import trace

logger = logging.getLogger(__name__)

def process_request(request):
    span = trace.get_current_span()
    trace_id = span.get_span_context().trace_id
    
    # Include trace_id in logs for correlation
    logger.info(
        "Processing request",
        extra={
            "trace_id": format(trace_id, '032x'),
            "user_id": request.user_id,
            "endpoint": request.path
        }
    )
    
    # Increment metrics with same labels
    request_count.labels(
        method=request.method,
        endpoint=request.path
    ).inc()

Now when you see an error spike in your metrics dashboard, you can:

  1. Click through to see related logs
  2. Find the trace_id in those logs
  3. Open the trace to see exactly where things went wrong

Tool Recommendations

Logs:

  • Self-hosted: Loki + Grafana
  • Managed: Datadog, Splunk, AWS CloudWatch

Metrics:

  • Self-hosted: Prometheus + Grafana
  • Managed: Datadog, New Relic, AWS CloudWatch

Traces:

  • Self-hosted: Jaeger, Tempo
  • Managed: Datadog APM, Honeycomb, AWS X-Ray

All-in-one: Grafana Cloud, Datadog, and Elastic all offer unified observability platforms.

Start Small, Grow Incrementally

You don’t need all three pillars on day one:

  1. Week 1: Add structured logging with request IDs
  2. Week 2: Add the four golden signals as metrics
  3. Week 3: Add tracing to your most critical path
  4. Week 4: Connect them with shared identifiers

The goal isn’t perfect observability—it’s enough observability to debug problems quickly. Start with what hurts most, and expand from there.


Good observability is an investment that pays dividends at 3 AM. Your on-call self will thank your past self.