Structured Logging That Actually Helps You Debug

Your logs are lying to you. Not because they’re wrong, but because they’re formatted for humans who will never read them.

That stack trace you carefully formatted? It’ll be searched by a machine. Those helpful debug messages? They’ll be filtered by a regex that breaks on the first edge case. The log line that would have saved you three hours of debugging? Buried in 10GB of unstructured text.

Structured logging fixes this. Here’s how to do it without making your codebase worse.

What Structured Logging Actually Means

Structured logging means emitting logs as data, not strings. Instead of:

You emit:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
{
  "timestamp": "2026-03-08T06:30:00Z",
  "level": "error",
  "message": "Failed to process order",
  "order_id": "12345",
  "user_email": "john@example.com",
  "error_type": "connection_timeout",
  "service": "order-processor",
  "trace_id": "abc123"
}

The difference isn’t cosmetic. The first format requires regex parsing that breaks when message formats change. The second format lets you query: “show me all connection timeouts for orders over $100 in the last hour.”

The Minimum Viable Log Entry

Every log entry should have:

1
2
3
4
5
6
7
{
  "timestamp": "ISO 8601 format",
  "level": "debug|info|warn|error",
  "message": "human-readable description",
  "service": "which service emitted this",
  "trace_id": "correlation id across services"
}

That’s the baseline. Everything else is context you add based on what you’re logging.

Timestamp: Always UTC, always ISO 8601. Your log aggregator will thank you.

Level: Be honest. INFO means “normal operation.” WARN means “something’s off but we’re handling it.” ERROR means “we failed to do what we were asked.”

Message: Short, static string. Not a template with variables. The variables go in separate fields.

Service: Which component. Essential when you have more than one thing running.

Trace ID: The key to distributed debugging. Pass this through every service in a request chain.

Context Is Everything

The log message tells you what happened. Context tells you why it matters.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# Bad: all context in the message
logger.error(f"Payment failed for user {user_id} order {order_id} amount ${amount}")

# Good: message + structured context
logger.error(
    "Payment failed",
    extra={
        "user_id": user_id,
        "order_id": order_id,
        "amount_cents": amount,
        "payment_provider": "stripe",
        "error_code": error.code,
        "retry_count": retry_count
    }
)

The second version lets you:

Query all failures for a specific user
Calculate failure rates by payment provider
Find orders that failed after N retries
Correlate with amount thresholds

The first version lets you grep and hope.

What to Log at Each Level

DEBUG: Internal state useful for development. Request/response bodies (sanitized), cache hits/misses, branch decisions. Never in production unless you’re actively debugging.

INFO: Normal operations worth recording. Request received, job started, job completed. The audit trail of your system doing its job.

WARN: Something unexpected that we handled. Rate limit approaching, deprecated endpoint called, fallback activated. Things that deserve attention but didn’t break anything.

ERROR: We failed to complete the requested operation. User impact occurred or was narrowly avoided. These should trigger alerts.

The Performance Question

“But structured logging is slower than string logging!”

Measure it. In most applications, the difference is microseconds. If you’re logging so much that JSON serialization is your bottleneck, you’re logging too much.

That said, some practical guidelines:

Don’t log in hot loops. Aggregate, then log once.
Lazy evaluation for expensive context. Don’t compute debug info if debug logging is disabled.
Sample high-volume events. Log 1% of health checks, 100% of errors.

1
2
3
4
5
6
# Bad: always computes expensive_debug_info
logger.debug(f"State: {compute_expensive_debug_info()}")

# Good: only computes if debug is enabled
if logger.isEnabledFor(logging.DEBUG):
    logger.debug("State", extra={"debug_info": compute_expensive_debug_info()})

Sensitive Data

Structured logging makes it easy to accidentally log secrets. Some rules:

Never log passwords, tokens, or API keys. Not even partially.
Hash or truncate identifiers. Log user_id: abc...xyz not full emails.
Sanitize request bodies. Have an allowlist of fields that are safe to log.
Review your log output. Grep your logs for patterns that look like secrets.

1
2
3
4
SAFE_FIELDS = {"order_id", "status", "timestamp", "item_count"}

def sanitize_for_logging(data: dict) -> dict:
    return {k: v for k, v in data.items() if k in SAFE_FIELDS}

Correlation Across Services

In a distributed system, a single user request touches multiple services. Without correlation, debugging is archaeology.

The pattern:

Generate a trace ID at the edge (API gateway, load balancer)
Pass it in headers through every internal call
Include it in every log entry
Include it in error responses to users

1
2
3
4
5
6
7
8
9
# Middleware that extracts or generates trace ID
def trace_middleware(request):
    trace_id = request.headers.get("X-Trace-ID") or generate_trace_id()
    # Store in context for this request
    set_trace_id(trace_id)
    # Add to all outgoing requests
    response = handle_request(request)
    response.headers["X-Trace-ID"] = trace_id
    return response

Now when a user reports “something went wrong,” they can give you the trace ID from the error page, and you can find every log entry from every service that touched their request.

Log Aggregation

Structured logs are only useful if you can query them. The stack that works:

Emit JSON to stdout. Let the infrastructure handle collection.
Ship to a log aggregator. Loki, Elasticsearch, CloudWatch, Datadog — pick one.
Build dashboards for common queries. Error rates by service, latency percentiles, top error messages.
Set up alerts on patterns. Error rate spikes, new error types, specific high-severity events.

The key insight: your logs are a time-series database of events. Treat them that way.

Practical Migration

If you’re stuck with printf-style logging, migrate incrementally:

Add a JSON handler alongside your existing one. Emit both formats temporarily.
Start with error logs. These are most valuable and lowest volume.
Add trace IDs to new code. Retrofit critical paths.
Build one dashboard. Prove value before migrating everything.
Kill the text logs. Once you’re querying JSON, you won’t go back.

The Debugging Payoff

Here’s what good structured logging enables:

Compare that to grepping through files, guessing at timestamps, and manually correlating between services.

The investment in structured logging pays off the first time you debug a production incident in minutes instead of hours. Your logs stop being write-only archives and become queryable operational data.

That’s the goal: logs as a feature, not a formality.

What Structured Logging Actually Means#

The Minimum Viable Log Entry#

Context Is Everything#

What to Log at Each Level#

The Performance Question#

Sensitive Data#

Correlation Across Services#

Log Aggregation#

Practical Migration#

The Debugging Payoff#

📬 Get the Newsletter