Structured Logging for Distributed Systems

When your application spans multiple services, containers, and regions, print("something went wrong") doesn’t cut it anymore. Structured logging transforms your logs from walls of text into queryable data.

Why Structured Logging?

Traditional logs are strings meant for humans:

Structured logs are data meant for machines (and humans):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
{
  "timestamp": "2026-02-13T14:00:00Z",
  "level": "error",
  "message": "Failed to process order",
  "order_id": "12345",
  "user_email": "john@example.com",
  "service": "order-processor",
  "trace_id": "abc123",
  "duration_ms": 2340
}

The second version lets you query: “Show me all errors for order 12345 across all services” or “What’s the p99 duration for failed orders?”

Python Implementation

Using structlog for clean, structured output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import structlog
import uuid

# Configure structlog
structlog.configure(
    processors=[
        structlog.stdlib.filter_by_level,
        structlog.stdlib.add_logger_name,
        structlog.stdlib.add_log_level,
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.StackInfoRenderer(),
        structlog.processors.JSONRenderer()
    ],
    wrapper_class=structlog.stdlib.BoundLogger,
    context_class=dict,
    logger_factory=structlog.stdlib.LoggerFactory(),
)

log = structlog.get_logger()

# Bind context that persists across log calls
def process_request(request):
    # Create a trace ID for this request
    trace_id = str(uuid.uuid4())[:8]
    
    # Bind context - all subsequent logs include these fields
    logger = log.bind(
        trace_id=trace_id,
        user_id=request.user_id,
        endpoint=request.path
    )
    
    logger.info("request_started")
    
    try:
        result = do_work(request, logger)
        logger.info("request_completed", status="success")
        return result
    except Exception as e:
        logger.error("request_failed", 
                    error=str(e), 
                    error_type=type(e).__name__)
        raise

The Trace ID Pattern

In distributed systems, a single user action might touch 10 services. The trace ID ties them all together:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# Service A: API Gateway
@app.middleware("http")
async def add_trace_id(request, call_next):
    trace_id = request.headers.get("X-Trace-ID") or str(uuid.uuid4())
    
    # Store in context for this request
    structlog.contextvars.bind_contextvars(trace_id=trace_id)
    
    response = await call_next(request)
    response.headers["X-Trace-ID"] = trace_id
    return response

# Service B: Order Processor
async def process_order(order_data: dict, trace_id: str):
    logger = log.bind(trace_id=trace_id, order_id=order_data["id"])
    
    logger.info("order_processing_started")
    
    # Call downstream service, passing trace ID
    async with httpx.AsyncClient() as client:
        response = await client.post(
            "http://inventory-service/reserve",
            json=order_data,
            headers={"X-Trace-ID": trace_id}
        )
    
    logger.info("inventory_reserved", 
                response_status=response.status_code)

Now when something fails, you search for the trace ID and see the complete journey across all services.

Log Levels That Mean Something

Define clear semantics for each level:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
# DEBUG: Detailed diagnostic info (off in production)
logger.debug("cache_lookup", key=cache_key, hit=True)

# INFO: Normal operations worth recording
logger.info("order_created", order_id=order_id, total=total)

# WARNING: Unexpected but handled situations
logger.warning("rate_limit_approaching", 
               current=950, limit=1000, user_id=user_id)

# ERROR: Failures that need attention
logger.error("payment_failed", 
             order_id=order_id, 
             error_code=error_code,
             retry_count=3)

# CRITICAL: System-level failures
logger.critical("database_connection_lost", 
                host=db_host, 
                reconnect_attempts=5)

Shipping to a Log Aggregator

Structured logs shine when aggregated. Here’s a Docker setup shipping to Elasticsearch:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# docker-compose.yml
services:
  app:
    image: myapp
    logging:
      driver: "fluentd"
      options:
        fluentd-address: "localhost:24224"
        tag: "app.{{.Name}}"

  fluentd:
    image: fluent/fluentd-kubernetes-daemonset
    volumes:
      - ./fluent.conf:/fluentd/etc/fluent.conf
    ports:
      - "24224:24224"

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# fluent.conf
<source>
  @type forward
  port 24224
</source>

<match app.**>
  @type elasticsearch
  host elasticsearch
  port 9200
  index_name logs
  type_name _doc
</match>

Querying in Practice

With structured logs in Elasticsearch or similar:

Key Principles

Always include a trace/correlation ID — the single most valuable field
Use consistent field names — user_id everywhere, not userId sometimes
Log events, not sentences — "order_created" not "An order was created"
Include relevant context — IDs, durations, counts, states
Keep sensitive data out — no passwords, tokens, or PII in logs

The Payoff

Structured logging takes more thought upfront, but when it’s 3 AM and something’s broken, being able to query “show me the exact sequence of events for this failed request across all services” is worth every minute invested.

Your future on-call self will thank you.

Why Structured Logging?#

Python Implementation#

The Trace ID Pattern#

Log Levels That Mean Something#

Shipping to a Log Aggregator#

Querying in Practice#

Key Principles#

The Payoff#

📬 Get the Newsletter