When your application runs on 50 containers across 10 servers, SSH’ing into each one to grep logs doesn’t scale. Centralized logging gives you one place to search everything.

The Log Aggregation Pipeline

AppsfsltiyidlscoelausottgionsFLVCloeougclestlntoetarcdsthorsELSSlo3takosirtaigcesearSceharcKGChirL/baIVafinasanuaalization

Stack Options

ELK (Elasticsearch, Logstash, Kibana)

The classic choice. Powerful but resource-hungry.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# docker-compose.yml
services:
  elasticsearch:
    image: elasticsearch:8.12.0
    environment:
      - discovery.type=single-node
      - xpack.security.enabled=false
    volumes:
      - esdata:/usr/share/elasticsearch/data
    
  logstash:
    image: logstash:8.12.0
    volumes:
      - ./logstash.conf:/usr/share/logstash/pipeline/logstash.conf
    depends_on:
      - elasticsearch
      
  kibana:
    image: kibana:8.12.0
    ports:
      - "5601:5601"
    depends_on:
      - elasticsearch
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# logstash.conf
input {
  beats { port => 5044 }
  tcp { port => 5000 codec => json }
}

filter {
  if [message] =~ /^\{/ {
    json { source => "message" }
  }
  
  date {
    match => ["timestamp", "ISO8601"]
    target => "@timestamp"
  }
}

output {
  elasticsearch {
    hosts => ["elasticsearch:9200"]
    index => "logs-%{+YYYY.MM.dd}"
  }
}

Loki + Grafana

Lightweight alternative. Doesn’t index log content—just labels.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
# docker-compose.yml
services:
  loki:
    image: grafana/loki:2.9.0
    ports:
      - "3100:3100"
    volumes:
      - ./loki-config.yml:/etc/loki/local-config.yaml
      
  promtail:
    image: grafana/promtail:2.9.0
    volumes:
      - /var/log:/var/log:ro
      - ./promtail-config.yml:/etc/promtail/config.yml
      
  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
# promtail-config.yml
server:
  http_listen_port: 9080

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: containers
    docker_sd_configs:
      - host: unix:///var/run/docker.sock
    relabel_configs:
      - source_labels: ['__meta_docker_container_name']
        target_label: 'container'

Vector

Modern, fast collector written in Rust:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# vector.toml
[sources.docker]
type = "docker_logs"

[transforms.parse_json]
type = "remap"
inputs = ["docker"]
source = '''
. = parse_json!(.message)
'''

[sinks.loki]
type = "loki"
inputs = ["parse_json"]
endpoint = "http://loki:3100"
labels = { app = "{{ container_name }}" }

Structured Logging

Make logs machine-parseable:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# Bad: unstructured
logger.info(f"User {user_id} purchased {item_id} for ${price}")

# Good: structured JSON
logger.info("purchase_completed", extra={
    "user_id": user_id,
    "item_id": item_id,
    "price": price,
    "currency": "USD"
})

Output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
{
  "timestamp": "2026-03-04T16:30:00Z",
  "level": "info",
  "message": "purchase_completed",
  "user_id": "u-12345",
  "item_id": "item-789",
  "price": 29.99,
  "currency": "USD",
  "service": "checkout",
  "trace_id": "abc123"
}

Python Structured Logging

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
import structlog

structlog.configure(
    processors=[
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.JSONRenderer()
    ]
)

logger = structlog.get_logger()
logger.info("user_login", user_id="12345", method="oauth")

Node.js with Pino

1
2
3
4
const pino = require('pino');
const logger = pino({ level: 'info' });

logger.info({ userId: '12345', action: 'purchase' }, 'Order completed');

Log Levels

Use consistently across services:

LevelWhen to Use
DEBUGDetailed debugging info (disabled in prod)
INFONormal operations worth recording
WARNUnexpected but handled situations
ERRORFailures that need attention
FATALService cannot continue
1
2
3
4
5
logger.debug("Cache lookup", key=key)
logger.info("Request processed", duration_ms=150)
logger.warning("Retry attempt", attempt=3, max_attempts=5)
logger.error("Payment failed", error=str(e), user_id=user_id)
logger.critical("Database connection lost", host=db_host)

Essential Fields

Always include:

1
2
3
4
5
6
7
8
9
{
  "timestamp": "ISO8601 format",
  "level": "info|warn|error",
  "service": "service-name",
  "environment": "prod|staging",
  "trace_id": "for distributed tracing",
  "request_id": "unique per request",
  "message": "human readable description"
}

Query Patterns

Elasticsearch/Kibana (KQL)

#s#u#d#NesuOErSeSrETrvprlaxrie_otceocciwilnreidouds:f:rndpie_eoi"c"qmincuushnhu-eetces1s>a:hce2tlekr3s1t"co'40h/kus50hot"0ceu"ahatcAAelAtNNctsNiDDkheDvs"ri@evlttnAieyidNcvmpDeeeolsil:tneatv"m:eeplr":r>/o=a"rpe""ir2/r0co2hr6e"-c0k3o-u0t4""

Loki (LogQL)

#{#{#r#{ssasEeJeRtPerrSraearrvOvt(tvoiNie{tirccsecsepeoere=a=frn=i"r"v"ncsaeimahiprcapceniretihcg"o=c"ek}r"h}co+sciku|hn|otfeg~u"ijct}lsk"toou|enus=rte|"r",_ediruldrre=oavurte-"il\o=\n"de+>r"r1o0r0"0}[5m])

Retention and Storage

Logs grow fast. Plan for it:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# Elasticsearch ILM policy
PUT _ilm/policy/logs-policy
{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": {
            "max_size": "50gb",
            "max_age": "1d"
          }
        }
      },
      "warm": {
        "min_age": "7d",
        "actions": {
          "shrink": { "number_of_shards": 1 },
          "forcemerge": { "max_num_segments": 1 }
        }
      },
      "delete": {
        "min_age": "30d",
        "actions": { "delete": {} }
      }
    }
  }
}

Cost Optimization

  • Hot storage: Recent logs (7 days), fast SSDs
  • Warm storage: Older logs (30 days), cheaper disks
  • Cold/Archive: Compliance (S3 Glacier), rarely accessed
  • Delete: After retention period
1
2
3
4
5
6
# Loki retention
limits_config:
  retention_period: 720h  # 30 days
  
compactor:
  retention_enabled: true

Alerting on Logs

Elasticsearch Watcher

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
PUT _watcher/watch/error_spike
{
  "trigger": { "schedule": { "interval": "5m" }},
  "input": {
    "search": {
      "request": {
        "indices": ["logs-*"],
        "body": {
          "query": {
            "bool": {
              "must": [
                { "match": { "level": "error" }},
                { "range": { "@timestamp": { "gte": "now-5m" }}}
              ]
            }
          }
        }
      }
    }
  },
  "condition": {
    "compare": { "ctx.payload.hits.total": { "gt": 100 }}
  },
  "actions": {
    "slack": {
      "webhook": {
        "url": "https://hooks.slack.com/..."
      }
    }
  }
}

Loki + Alertmanager

1
2
3
4
5
6
7
8
9
# Grafana alert rule
- alert: HighErrorRate
  expr: |
    sum(rate({level="error"}[5m])) by (service) > 10
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "High error rate in {{ $labels.service }}"

Security Considerations

Sensitive Data

Never log:

  • Passwords or tokens
  • Credit card numbers
  • Personal health information
  • Full social security numbers
1
2
3
4
5
6
# Redact sensitive fields
def sanitize_log(data):
    sensitive = ['password', 'token', 'ssn', 'credit_card']
    return {k: '***' if k in sensitive else v for k, v in data.items()}

logger.info("user_data", **sanitize_log(user_dict))

Access Control

  • Restrict who can search logs
  • Audit log access
  • Encrypt in transit and at rest

Quick Start Checklist

  • All services output structured JSON logs
  • Consistent timestamp format (ISO8601)
  • Request/trace IDs for correlation
  • Collector deployed (Promtail/Filebeat/Vector)
  • Retention policy configured
  • Basic alerting on error spikes
  • Sensitive data redacted

Tool Comparison

ELKLokiCloudWatch
Full-text search✅ Fast⚠️ Slow⚠️ Slow
Cost at scale💰💰💰💰💰💰
Setup complexityHighLowNone
Query languageKQLLogQLInsights
Best forLarge teamsK8s/Grafana usersAWS native

Centralized logging turns “where did that error happen?” from a 30-minute hunt into a 30-second query. Set it up before you need it.