Everyone logs. Few log well. The difference between “we have logs” and “we can debug with logs” comes down to discipline in what you capture, how you structure it, and where you send it.

The Logging Hierarchy

Not all log levels are created equal. Use them intentionally:

FEWIDTARANERTRRFBAAONOUCLRGETSUNDEhonoexemerttexmaratpaiephellmpiceelntodligepycdedafrivtabaaeiiutgroltinbneooodhnssc,ateanmi.nbdicnullNoteeietdsnvt.tfechooroeMn.nieitagsEniph.xnptppuTerekbhno.eeesdeciuWpohvcasmeetkea,ierrouatunsnbs.onpeumiraaenotlogb.ln.lyeeNmoue.fpef.dsinatptreondt.ion.

The key insight: INFO should tell a story. If you read only INFO logs, you should understand what the application did.

Structured Logging: The Non-Negotiable

Plain text logs are for humans reading terminals. Structured logs are for machines processing millions of events:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Bad: grep-hostile, unparseable
logger.info(f"User {user_id} purchased {item_id} for ${price}")

# Good: structured, queryable
logger.info("purchase_completed", extra={
    "user_id": user_id,
    "item_id": item_id,
    "price_cents": price_cents,
    "currency": "USD",
    "payment_method": payment_method
})

Output as JSON:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
{
  "timestamp": "2026-03-05T10:15:32.123Z",
  "level": "INFO",
  "message": "purchase_completed",
  "user_id": "u_abc123",
  "item_id": "item_xyz",
  "price_cents": 4999,
  "currency": "USD",
  "payment_method": "card"
}

Now you can query: “Show me all purchases over $100 using PayPal in the last hour.”

What to Log

Always Log

Request boundaries

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
@app.middleware
async def log_requests(request, call_next):
    request_id = str(uuid4())
    logger.info("request_started", extra={
        "request_id": request_id,
        "method": request.method,
        "path": request.url.path,
        "client_ip": request.client.host
    })
    
    response = await call_next(request)
    
    logger.info("request_completed", extra={
        "request_id": request_id,
        "status_code": response.status_code,
        "duration_ms": elapsed_ms
    })
    return response

State transitions

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
def update_order_status(order_id: str, new_status: str):
    old_status = get_order_status(order_id)
    
    logger.info("order_status_changed", extra={
        "order_id": order_id,
        "old_status": old_status,
        "new_status": new_status,
        "changed_by": current_user()
    })
    
    # Actually update...

External system calls

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
def call_payment_api(payload):
    logger.info("payment_api_request", extra={
        "provider": "stripe",
        "amount_cents": payload["amount"],
        "idempotency_key": payload["idempotency_key"]
    })
    
    try:
        response = stripe.charges.create(**payload)
        logger.info("payment_api_success", extra={
            "charge_id": response.id,
            "duration_ms": elapsed
        })
    except StripeError as e:
        logger.error("payment_api_failed", extra={
            "error_type": type(e).__name__,
            "error_message": str(e),
            "duration_ms": elapsed
        })
        raise

Authentication events

1
2
3
logger.info("login_attempt", extra={"username": username, "success": False, "reason": "invalid_password"})
logger.info("login_success", extra={"user_id": user.id, "mfa_used": True})
logger.warn("suspicious_login", extra={"user_id": user.id, "new_ip": ip, "country": geo.country})

Never Log

Secrets and credentials

1
2
3
4
5
6
# NEVER
logger.debug(f"Connecting with password: {password}")
logger.info(f"API key: {api_key}")

# Instead, log that you're using credentials, not what they are
logger.info("database_connecting", extra={"host": host, "user": user})

Full PII without redaction

1
2
3
4
5
6
7
8
# Bad
logger.info(f"Processing card {card_number}")

# Better
logger.info("processing_payment", extra={
    "card_last_four": card_number[-4:],
    "card_brand": detect_brand(card_number)
})

Huge payloads

1
2
3
4
5
6
7
8
# Bad: logs 10MB of data
logger.debug(f"Response body: {response.text}")

# Better: log metadata, store payload elsewhere if needed
logger.debug("response_received", extra={
    "content_length": len(response.text),
    "content_type": response.headers.get("content-type")
})

Correlation: Tying It All Together

A single user action can touch dozens of services. Without correlation, you’re searching haystacks:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# Generate at the edge, propagate everywhere
REQUEST_ID_HEADER = "X-Request-ID"

def get_correlation_id():
    return context.get("request_id") or str(uuid4())

# Every log includes it
class CorrelatedLogger:
    def info(self, msg, **kwargs):
        kwargs["request_id"] = get_correlation_id()
        kwargs["service"] = SERVICE_NAME
        underlying_logger.info(msg, extra=kwargs)

Now tracing a request across services is a single query:

1
2
3
SELECT * FROM logs 
WHERE request_id = 'abc-123' 
ORDER BY timestamp;

Error Logging Done Right

Errors need context. The stack trace alone tells you where, not why:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
try:
    result = process_order(order)
except ValidationError as e:
    logger.error("order_validation_failed", extra={
        "order_id": order.id,
        "user_id": order.user_id,
        "validation_errors": e.errors,
        "order_total": order.total,
        "item_count": len(order.items)
    })
    raise

except ExternalServiceError as e:
    logger.error("order_processing_failed", extra={
        "order_id": order.id,
        "service": e.service_name,
        "error_code": e.code,
        "retry_count": attempt_number,
        "will_retry": attempt_number < MAX_RETRIES
    }, exc_info=True)  # Include stack trace
    raise

The goal: someone reading this log at 3am should understand what happened without reading code.

Log Sampling for High-Volume Events

Some events happen millions of times. Log strategically:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
import random

def should_sample(event_type: str, rate: float = 0.01) -> bool:
    """Sample 1% of events by default."""
    return random.random() < rate

# Always log errors
if response.status >= 500:
    logger.error("request_failed", extra=context)

# Sample successful requests
elif should_sample("request_success"):
    logger.info("request_completed", extra={**context, "sampled": True})

Or sample intelligently:

1
2
3
4
5
6
7
8
9
def smart_sample(context: dict) -> bool:
    # Always log slow requests
    if context["duration_ms"] > 1000:
        return True
    # Always log new users' first requests
    if context.get("is_first_request"):
        return True
    # Sample the rest
    return random.random() < 0.01

Log Aggregation Patterns

Local logs are useless at scale. Ship them somewhere queryable:

Simple: stdout + container orchestrator

1
2
3
4
5
6
# Log to stdout
logging.basicConfig(
    format='%(message)s',
    stream=sys.stdout,
    level=logging.INFO
)

Let Kubernetes/Docker capture and forward.

Medium: Fluent Bit sidecar

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# fluent-bit.conf
[INPUT]
    Name tail
    Path /var/log/app/*.log
    Parser json

[OUTPUT]
    Name  es
    Match *
    Host  elasticsearch
    Index app-logs

Full: Dedicated logging infrastructure

  • Loki + Grafana (lightweight, label-based)
  • Elasticsearch + Kibana (powerful, resource-hungry)
  • Datadog/Splunk (managed, expensive)

Pick based on query patterns and budget.

Retention and Costs

Logs grow fast. Plan for it:

1
2
3
4
5
# Example retention policy
hot_storage:    7 days   # Fast queries, expensive
warm_storage:   30 days  # Slower queries, cheaper
cold_storage:   90 days  # Compliance/audit only
deleted:        90+ days # Gone

Cost optimization:

  • Sample high-volume, low-value events
  • Compress before shipping
  • Drop DEBUG/TRACE in production
  • Index only fields you query

The Debugging Workflow

Good logs enable this flow:

  1. Alert fires: “Error rate > 5% on checkout service”
  2. Quick filter: service=checkout level=ERROR last 15m
  3. Find pattern: “All errors have payment_provider=stripe
  4. Correlate: Pick one request_id, trace across services
  5. Root cause: Stripe API returning 503, retry logic not working
  6. Fix and verify: Deploy fix, watch error rate drop in real-time

If your logs don’t support this workflow, they’re not earning their storage costs.

Quick Wins

If you do nothing else:

  1. Use structured logging — JSON or key-value pairs
  2. Add request IDs — Correlate across services
  3. Log state transitions — Before and after
  4. Include context in errors — What was the app trying to do?
  5. Never log secrets — Seems obvious, still happens

Logs are the most reliable witness to what your code actually did. Treat them accordingly.