Your logs are probably useless. Not because you’re not logging, but because you’re logging wrong. Here’s how to make logs that actually help when things break.

The Problem with Unstructured Logs

[[[222000222444---000333---111222111000:::222333:::444567]]]IEPNRrFRoOOc:Re:sUssSieonrmgeltoohrgidgneegrdwfieonnrtuwsreorng12345

Try answering these questions:

  • Which user logged in?
  • What went wrong? Where?
  • How do I find all errors for user 12345?

You can’t. At least not without regex gymnastics and hope.

Structured Logging

Same events, structured:

1
2
3
{"timestamp": "2024-03-12T10:23:45Z", "level": "info", "event": "user_login", "user_id": "u_789", "ip": "192.168.1.100"}
{"timestamp": "2024-03-12T10:23:46Z", "level": "error", "event": "payment_failed", "user_id": "u_789", "error": "card_declined", "amount": 99.99}
{"timestamp": "2024-03-12T10:23:47Z", "level": "info", "event": "order_processing", "user_id": "u_12345", "order_id": "ord_456"}

Now you can:

  • Query: user_id = "u_789" AND level = "error"
  • Aggregate: Count errors by event type
  • Alert: When payment_failed events spike
  • Correlate: Find all events for a user/request

Implementation

Python (structlog)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import structlog

# Configure once at startup
structlog.configure(
    processors=[
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.add_log_level,
        structlog.processors.JSONRenderer()
    ],
    wrapper_class=structlog.BoundLogger,
    context_class=dict,
    logger_factory=structlog.PrintLoggerFactory(),
)

log = structlog.get_logger()

# Usage
log.info("user_login", user_id="u_789", ip="192.168.1.100")
log.error("payment_failed", user_id="u_789", error="card_declined", amount=99.99)

# Bind context for the request
log = log.bind(request_id="req_123", user_id="u_789")
log.info("order_created", order_id="ord_456")  # Includes request_id and user_id

Node.js (pino)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
const pino = require('pino');

const log = pino({
  level: 'info',
  formatters: {
    level: (label) => ({ level: label }),
  },
  timestamp: pino.stdTimeFunctions.isoTime,
});

// Usage
log.info({ user_id: 'u_789', ip: '192.168.1.100' }, 'user_login');
log.error({ user_id: 'u_789', error: 'card_declined', amount: 99.99 }, 'payment_failed');

// Child logger with bound context
const reqLog = log.child({ request_id: 'req_123', user_id: 'u_789' });
reqLog.info({ order_id: 'ord_456' }, 'order_created');

Go (zerolog)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
package main

import (
    "os"
    "github.com/rs/zerolog"
    "github.com/rs/zerolog/log"
)

func main() {
    zerolog.TimeFieldFormat = zerolog.TimeFormatUnix
    log.Logger = zerolog.New(os.Stdout).With().Timestamp().Logger()

    // Usage
    log.Info().
        Str("user_id", "u_789").
        Str("ip", "192.168.1.100").
        Msg("user_login")

    log.Error().
        Str("user_id", "u_789").
        Str("error", "card_declined").
        Float64("amount", 99.99).
        Msg("payment_failed")

    // With context
    reqLog := log.With().
        Str("request_id", "req_123").
        Str("user_id", "u_789").
        Logger()
    
    reqLog.Info().Str("order_id", "ord_456").Msg("order_created")
}

Essential Fields

Every log entry should have:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
{
  "timestamp": "2024-03-12T10:23:45.123Z",
  "level": "info",
  "service": "payment-api",
  "version": "1.2.3",
  "environment": "production",
  "trace_id": "abc123",
  "span_id": "def456",
  "message": "payment_processed"
}
  • timestamp: ISO 8601, UTC
  • level: debug, info, warn, error, fatal
  • service: Which service generated this
  • version: Deployed version (helps with “did the deploy break it?”)
  • environment: prod, staging, dev
  • trace_id/span_id: For distributed tracing correlation

Request Context

Inject request context into every log within a request:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# Python middleware example
import uuid
from contextvars import ContextVar

request_context: ContextVar[dict] = ContextVar('request_context', default={})

def logging_middleware(request, call_next):
    ctx = {
        'request_id': str(uuid.uuid4()),
        'user_id': request.user.id if request.user else None,
        'path': request.path,
        'method': request.method,
    }
    request_context.set(ctx)
    
    response = call_next(request)
    return response

# In your logger config, include context
def add_request_context(logger, method_name, event_dict):
    ctx = request_context.get()
    event_dict.update(ctx)
    return event_dict

Now every log in that request automatically includes request_id, user_id, etc.

Log Levels

Use them consistently:

LevelWhen to Use
debugDetailed info for developers. Off in production.
infoNormal operations: requests, jobs, state changes
warnSomething unexpected but handled: retries, fallbacks
errorSomething failed and needs attention
fatalApplication cannot continue
1
2
3
4
5
6
# Good level choices
log.debug("cache_lookup", key="user:123", hit=True)
log.info("order_created", order_id="ord_456", total=99.99)
log.warn("rate_limit_approaching", current=95, limit=100)
log.error("payment_failed", error="timeout", retry_count=3)
log.fatal("database_connection_failed", host="db.example.com")

What NOT to Log

Sensitive Data

1
2
3
4
5
# NEVER
log.info("user_login", password=user.password, ssn=user.ssn)

# DO
log.info("user_login", user_id=user.id)

Scrub: passwords, tokens, credit cards, SSNs, API keys, PII.

High-Volume Noise

1
2
3
# Don't log every cache hit in production
if settings.DEBUG:
    log.debug("cache_hit", key=key)

Use debug level for high-frequency events.

ELK Stack (Elasticsearch, Logstash, Kibana)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Filebeat config
filebeat.inputs:
  - type: container
    paths:
      - '/var/lib/docker/containers/*/*.log'
    json.keys_under_root: true
    json.add_error_key: true

output.elasticsearch:
  hosts: ["elasticsearch:9200"]
  index: "logs-%{+yyyy.MM.dd}"

Loki (Grafana)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
# Promtail config
scrape_configs:
  - job_name: containers
    static_configs:
      - targets:
          - localhost
        labels:
          job: docker
          __path__: /var/lib/docker/containers/*/*log
    pipeline_stages:
      - json:
          expressions:
            level: level
            service: service
      - labels:
          level:
          service:

CloudWatch Logs Insights

1
2
3
4
5
6
fields @timestamp, @message
| filter level = "error"
| filter service = "payment-api"
| stats count() by error
| sort count desc
| limit 10

Alerting on Logs

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# Prometheus AlertManager (via Loki)
groups:
  - name: logs
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate({job="app"} |= "error" [5m])) by (service)
          / sum(rate({job="app"} [5m])) by (service)
          > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High error rate in {{ $labels.service }}"

The Debugging Workflow

When something breaks:

  1. Find the error: level = "error" AND timestamp > "10 minutes ago"
  2. Get context: Find the request_id or trace_id
  3. See the journey: request_id = "req_123" shows all logs for that request
  4. Find patterns: error = "timeout" | stats count() by service
  5. Correlate: Link to traces and metrics with the same IDs

Quick Wins

  1. Today: Switch one service to JSON logging
  2. This week: Add request_id to all logs
  3. This month: Set up centralized log aggregation
  4. This quarter: Build dashboards and alerts on log patterns

The goal: when something breaks at 3am, you can find out what happened in minutes, not hours.


Good logs tell a story. Great logs let you search that story.