When you have one server, you SSH in and grep the logs. When you have fifty servers, that stops working. Log aggregation is how you make “what happened?” answerable at scale.

The Pipeline Architecture

Every log aggregation system follows the same basic pattern:

SourcesCoQluleercytProcessStore

Each stage has choices. Let’s walk through them.

Collection: Getting Logs Out

File-Based Collection

The classic: applications write to files, agents ship them.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# Filebeat configuration
filebeat.inputs:
  - type: log
    paths:
      - /var/log/myapp/*.log
    json.keys_under_root: true
    json.add_error_key: true
    
  - type: container
    paths:
      - /var/lib/docker/containers/*/*.log

output.elasticsearch:
  hosts: ["elasticsearch:9200"]

Pros: Works with any application, no code changes Cons: Disk I/O, log rotation complexity, potential data loss

Direct Shipping

Applications send logs directly to the aggregation system.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
import logging
from pythonjsonlogger import jsonlogger
from fluent import handler

# Send to Fluentd
fluent_handler = handler.FluentHandler('myapp', host='fluentd', port=24224)
fluent_handler.setFormatter(jsonlogger.JsonFormatter())

logger = logging.getLogger()
logger.addHandler(fluent_handler)

logger.info("User logged in", extra={"user_id": 123, "ip": "1.2.3.4"})

Pros: No file I/O, immediate delivery, structured from source Cons: Requires code changes, coupling to log infrastructure

Sidecar Pattern (Kubernetes)

A logging sidecar reads from stdout/stderr or shared volumes.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
apiVersion: v1
kind: Pod
spec:
  containers:
    - name: app
      image: myapp:1.0
      # Logs to stdout
      
    - name: fluentd-sidecar
      image: fluent/fluentd:latest
      volumeMounts:
        - name: varlog
          mountPath: /var/log

In Kubernetes, the standard is logging to stdout and letting the node-level agent (Fluentd DaemonSet, Promtail, etc.) collect from the container runtime.

Processing: Making Logs Useful

Raw logs are rarely ready for storage. Processing adds structure, enrichment, and filtering.

Parsing

Turn unstructured text into structured data:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# Logstash grok pattern for nginx
filter {
  grok {
    match => { 
      "message" => '%{IPORHOST:client_ip} - %{USER:ident} \[%{HTTPDATE:timestamp}\] "%{WORD:method} %{URIPATHPARAM:request} HTTP/%{NUMBER:http_version}" %{NUMBER:status} %{NUMBER:bytes}'
    }
  }
  
  date {
    match => ["timestamp", "dd/MMM/yyyy:HH:mm:ss Z"]
    target => "@timestamp"
  }
}

Better: log in JSON from the start.

1
{"timestamp":"2024-02-03T10:30:00Z","level":"info","message":"Request completed","method":"GET","path":"/api/users","status":200,"duration_ms":45,"request_id":"abc123"}

Enrichment

Add context that wasn’t in the original log:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
filter {
  # Add geographic info from IP
  geoip {
    source => "client_ip"
    target => "geo"
  }
  
  # Add Kubernetes metadata
  kubernetes {
    source => "kubernetes"
  }
  
  # Add deployment info
  mutate {
    add_field => {
      "environment" => "${ENVIRONMENT}"
      "version" => "${APP_VERSION}"
    }
  }
}

Filtering and Sampling

Not all logs are worth keeping:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
filter {
  # Drop health check noise
  if [path] == "/health" and [status] == 200 {
    drop { }
  }
  
  # Sample debug logs (keep 10%)
  if [level] == "debug" {
    ruby {
      code => "event.cancel if rand > 0.1"
    }
  }
  
  # Redact sensitive data
  mutate {
    gsub => [
      "message", "\b\d{16}\b", "[REDACTED_CARD]",
      "message", "password=\S+", "password=[REDACTED]"
    ]
  }
}

Storage: Where Logs Live

Elasticsearch

The standard for full-text search over logs.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
// Index template for logs
{
  "index_patterns": ["logs-*"],
  "template": {
    "settings": {
      "number_of_shards": 3,
      "number_of_replicas": 1,
      "index.lifecycle.name": "logs-policy"
    },
    "mappings": {
      "properties": {
        "@timestamp": { "type": "date" },
        "level": { "type": "keyword" },
        "service": { "type": "keyword" },
        "message": { "type": "text" },
        "trace_id": { "type": "keyword" },
        "duration_ms": { "type": "integer" }
      }
    }
  }
}

Index lifecycle management:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": {
            "max_size": "50GB",
            "max_age": "1d"
          }
        }
      },
      "warm": {
        "min_age": "7d",
        "actions": {
          "shrink": { "number_of_shards": 1 },
          "forcemerge": { "max_num_segments": 1 }
        }
      },
      "delete": {
        "min_age": "30d",
        "actions": { "delete": {} }
      }
    }
  }
}

Loki (Prometheus-style)

Loki indexes labels, not log content. Cheaper, but different tradeoffs.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# Promtail config
scrape_configs:
  - job_name: kubernetes
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app]
        target_label: app
      - source_labels: [__meta_kubernetes_namespace]
        target_label: namespace

Query with LogQL:

{app="myservice",namespace="production"}|="error"|json|duration>1s

Pros: 10x cheaper storage, simple operations Cons: No full-text indexing, grep-style queries

ClickHouse

Column-oriented database, excellent for log analytics.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
CREATE TABLE logs (
    timestamp DateTime,
    level LowCardinality(String),
    service LowCardinality(String),
    message String,
    trace_id String,
    duration_ms UInt32
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (service, timestamp);

-- Fast aggregations
SELECT 
    service,
    count() as errors,
    avg(duration_ms) as avg_duration
FROM logs
WHERE level = 'error' AND timestamp > now() - INTERVAL 1 HOUR
GROUP BY service;

Pros: Blazing fast analytics, compression Cons: Not designed for full-text search

Querying: Finding What Matters

Structured Queries

With proper fields, queries are precise:

{}"}Eqlu"}aebsro"]"}tyom,fi"lu{{{i"}c:"slrs:t"""ta"e{"ttren@a{:eeargtrrrn"eic[mmg:"mh""e:e::"{s:{t{{a{m""psl""eed:rvuver{ilac"t"e:ig"ot:"nee_""rm:crsho""er:nc"ok{wo}-u"1t}gh",t"e}"}:},1000}}}

Correlation

The killer feature: following a request across services.

t1111r0000a::::FcS3333iee0000n_e::::di0000dt0000a:h....le0002l"0183af1594lbuocl[[[[g1laaucs2push3jiteef"o-hrcoug--krrassonteeuaeerrtywvv-saiispyccee]eerc]]viRifeTFcicoeecekt]iectvnhEreeRadvdRcaOerluReis:qdeuarPetasepytdrmoettfnrritaalccefeea__tiiirlddae==cdaaebb_tccir11da22=c33aeb_ci1d2=3abc123

This requires consistent trace ID propagation across all services.

Dashboards and Alerts

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Grafana alert rule
- alert: HighErrorRate
  expr: |
    sum(rate({app="checkout"} |= "error" [5m])) 
    / sum(rate({app="checkout"} [5m])) 
    > 0.05
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Checkout error rate above 5%"

Scaling Considerations

Volume Management

Logs grow fast. Plan for it:

1××003300rrdeeapqylusiecsratesst/e=snet7ci.o8×nT1=BKB2/.l6ogTB×86400sec/day=86GB/day

Strategies:

  • Sample verbose logs (debug, access logs)
  • Shorter retention for high-volume, low-value logs
  • Tiered storage (hot/warm/cold)
  • Aggressive compression

Backpressure

When the pipeline can’t keep up:

1
2
3
4
5
6
7
# Filebeat backpressure
queue.mem:
  events: 4096
  flush.min_events: 2048
  flush.timeout: 1s

# If queue fills, slow down reading

Better to drop logs than crash the application. But know when it’s happening:

1
2
3
# Alert on queue depth
- alert: LogBackpressure
  expr: filebeat_libbeat_output_events_dropped_total > 0

Multi-Tenancy

In shared clusters, isolate tenants:

1
2
3
4
5
6
7
8
# Elasticsearch: separate indices per tenant
output.elasticsearch:
  index: "logs-%{[tenant]}-%{+yyyy.MM.dd}"

# Loki: tenant header
clients:
  - url: http://loki:3100/loki/api/v1/push
    tenant_id: "${TENANT_ID}"

Common Pitfalls

  1. Logging too much: Debug logs in production, logging inside tight loops
  2. Logging too little: Missing context that would have explained the incident
  3. Unstructured logs: logger.info(f"User {user} did {action}") — impossible to query
  4. No retention policy: Storage fills up, cluster dies
  5. Ignoring cardinality: High-cardinality labels (user IDs, request IDs as labels) kill Loki/Prometheus

The Practical Stack

For most teams starting out:

  1. Collection: Promtail or Fluent Bit (lightweight)
  2. Processing: Minimal — structure logs at source
  3. Storage: Loki (simple) or Elasticsearch (powerful)
  4. Query: Grafana

Add complexity (Kafka buffer, Logstash processing) only when you need it.


Log aggregation isn’t glamorous, but it’s what lets you answer “what happened?” when things go wrong. Invest in structured logging from day one, pick a stack that matches your scale, and remember: the best log pipeline is the one you can actually operate.