Log Aggregation: Centralizing Logs for Faster Debugging

When your application runs on 50 containers across 10 servers, SSH’ing into each one to grep logs doesn’t scale. Centralized logging gives you one place to search everything.

The Log Aggregation Pipeline

Stack Options

ELK (Elasticsearch, Logstash, Kibana)

The classic choice. Powerful but resource-hungry.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# docker-compose.yml
services:
  elasticsearch:
    image: elasticsearch:8.12.0
    environment:
      - discovery.type=single-node
      - xpack.security.enabled=false
    volumes:
      - esdata:/usr/share/elasticsearch/data
    
  logstash:
    image: logstash:8.12.0
    volumes:
      - ./logstash.conf:/usr/share/logstash/pipeline/logstash.conf
    depends_on:
      - elasticsearch
      
  kibana:
    image: kibana:8.12.0
    ports:
      - "5601:5601"
    depends_on:
      - elasticsearch

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# logstash.conf
input {
  beats { port => 5044 }
  tcp { port => 5000 codec => json }
}

filter {
  if [message] =~ /^\{/ {
    json { source => "message" }
  }
  
  date {
    match => ["timestamp", "ISO8601"]
    target => "@timestamp"
  }
}

output {
  elasticsearch {
    hosts => ["elasticsearch:9200"]
    index => "logs-%{+YYYY.MM.dd}"
  }
}

Loki + Grafana

Lightweight alternative. Doesn’t index log content—just labels.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
# docker-compose.yml
services:
  loki:
    image: grafana/loki:2.9.0
    ports:
      - "3100:3100"
    volumes:
      - ./loki-config.yml:/etc/loki/local-config.yaml
      
  promtail:
    image: grafana/promtail:2.9.0
    volumes:
      - /var/log:/var/log:ro
      - ./promtail-config.yml:/etc/promtail/config.yml
      
  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
# promtail-config.yml
server:
  http_listen_port: 9080

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: containers
    docker_sd_configs:
      - host: unix:///var/run/docker.sock
    relabel_configs:
      - source_labels: ['__meta_docker_container_name']
        target_label: 'container'

Vector

Modern, fast collector written in Rust:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# vector.toml
[sources.docker]
type = "docker_logs"

[transforms.parse_json]
type = "remap"
inputs = ["docker"]
source = '''
. = parse_json!(.message)
'''

[sinks.loki]
type = "loki"
inputs = ["parse_json"]
endpoint = "http://loki:3100"
labels = { app = "{{ container_name }}" }

Structured Logging

Make logs machine-parseable:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# Bad: unstructured
logger.info(f"User {user_id} purchased {item_id} for ${price}")

# Good: structured JSON
logger.info("purchase_completed", extra={
    "user_id": user_id,
    "item_id": item_id,
    "price": price,
    "currency": "USD"
})

Output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
{
  "timestamp": "2026-03-04T16:30:00Z",
  "level": "info",
  "message": "purchase_completed",
  "user_id": "u-12345",
  "item_id": "item-789",
  "price": 29.99,
  "currency": "USD",
  "service": "checkout",
  "trace_id": "abc123"
}

Python Structured Logging

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
import structlog

structlog.configure(
    processors=[
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.JSONRenderer()
    ]
)

logger = structlog.get_logger()
logger.info("user_login", user_id="12345", method="oauth")

Node.js with Pino

1
2
3
4
const pino = require('pino');
const logger = pino({ level: 'info' });

logger.info({ userId: '12345', action: 'purchase' }, 'Order completed');

Log Levels

Use consistently across services:

Level	When to Use
DEBUG	Detailed debugging info (disabled in prod)
INFO	Normal operations worth recording
WARN	Unexpected but handled situations
ERROR	Failures that need attention
FATAL	Service cannot continue

1
2
3
4
5
logger.debug("Cache lookup", key=key)
logger.info("Request processed", duration_ms=150)
logger.warning("Retry attempt", attempt=3, max_attempts=5)
logger.error("Payment failed", error=str(e), user_id=user_id)
logger.critical("Database connection lost", host=db_host)

Essential Fields

Always include:

1
2
3
4
5
6
7
8
9
{
  "timestamp": "ISO8601 format",
  "level": "info|warn|error",
  "service": "service-name",
  "environment": "prod|staging",
  "trace_id": "for distributed tracing",
  "request_id": "unique per request",
  "message": "human readable description"
}

Query Patterns

Elasticsearch/Kibana (KQL)

Loki (LogQL)

Retention and Storage

Logs grow fast. Plan for it:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# Elasticsearch ILM policy
PUT _ilm/policy/logs-policy
{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": {
            "max_size": "50gb",
            "max_age": "1d"
          }
        }
      },
      "warm": {
        "min_age": "7d",
        "actions": {
          "shrink": { "number_of_shards": 1 },
          "forcemerge": { "max_num_segments": 1 }
        }
      },
      "delete": {
        "min_age": "30d",
        "actions": { "delete": {} }
      }
    }
  }
}

Cost Optimization

Hot storage: Recent logs (7 days), fast SSDs
Warm storage: Older logs (30 days), cheaper disks
Cold/Archive: Compliance (S3 Glacier), rarely accessed
Delete: After retention period

1
2
3
4
5
6
# Loki retention
limits_config:
  retention_period: 720h  # 30 days
  
compactor:
  retention_enabled: true

Alerting on Logs

Elasticsearch Watcher

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
PUT _watcher/watch/error_spike
{
  "trigger": { "schedule": { "interval": "5m" }},
  "input": {
    "search": {
      "request": {
        "indices": ["logs-*"],
        "body": {
          "query": {
            "bool": {
              "must": [
                { "match": { "level": "error" }},
                { "range": { "@timestamp": { "gte": "now-5m" }}}
              ]
            }
          }
        }
      }
    }
  },
  "condition": {
    "compare": { "ctx.payload.hits.total": { "gt": 100 }}
  },
  "actions": {
    "slack": {
      "webhook": {
        "url": "https://hooks.slack.com/..."
      }
    }
  }
}

Loki + Alertmanager

1
2
3
4
5
6
7
8
9
# Grafana alert rule
- alert: HighErrorRate
  expr: |
    sum(rate({level="error"}[5m])) by (service) > 10
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "High error rate in {{ $labels.service }}"

Security Considerations

Sensitive Data

Never log:

Passwords or tokens
Credit card numbers
Personal health information
Full social security numbers

1
2
3
4
5
6
# Redact sensitive fields
def sanitize_log(data):
    sensitive = ['password', 'token', 'ssn', 'credit_card']
    return {k: '***' if k in sensitive else v for k, v in data.items()}

logger.info("user_data", **sanitize_log(user_dict))

Access Control

Restrict who can search logs
Audit log access
Encrypt in transit and at rest

Quick Start Checklist

All services output structured JSON logs
Consistent timestamp format (ISO8601)
Request/trace IDs for correlation
Collector deployed (Promtail/Filebeat/Vector)
Retention policy configured
Basic alerting on error spikes
Sensitive data redacted

Tool Comparison

	ELK	Loki	CloudWatch
Full-text search	✅ Fast	⚠️ Slow	⚠️ Slow
Cost at scale	💰💰💰	💰	💰💰
Setup complexity	High	Low	None
Query language	KQL	LogQL	Insights
Best for	Large teams	K8s/Grafana users	AWS native

Centralized logging turns “where did that error happen?” from a 30-minute hunt into a 30-second query. Set it up before you need it.

The Log Aggregation Pipeline#

Stack Options#

ELK (Elasticsearch, Logstash, Kibana)#

Loki + Grafana#

Vector#

Structured Logging#

Python Structured Logging#

Node.js with Pino#

Log Levels#

Essential Fields#

Query Patterns#

Elasticsearch/Kibana (KQL)#

Loki (LogQL)#

Retention and Storage#

Cost Optimization#

Alerting on Logs#

Elasticsearch Watcher#

Loki + Alertmanager#

Security Considerations#

Sensitive Data#

Access Control#

Quick Start Checklist#

Tool Comparison#

📬 Get the Newsletter