When your application runs on 50 containers across 10 servers, SSH’ing into each one to grep logs doesn’t scale. Centralized logging gives you one place to search everything.
The Log Aggregation Pipeline# A p p s f s l t i y i d l s c ↓ o e l a u s o t t g i o n s → F L V C l o e o u g c l e s t l n t o e ↓ t a r c d s t h o r s → E L S S l o 3 t a k o s i r ↓ t a i g c e s e → a r S c e h a r c K G C h i r L / ↓ b a I V a f i n a s a n u a a l i z a t i o n
Stack Options# ELK (Elasticsearch, Logstash, Kibana)# The classic choice. Powerful but resource-hungry.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# docker-compose.yml
services :
elasticsearch :
image : elasticsearch:8.12.0
environment :
- discovery.type=single-node
- xpack.security.enabled=false
volumes :
- esdata:/usr/share/elasticsearch/data
logstash :
image : logstash:8.12.0
volumes :
- ./logstash.conf:/usr/share/logstash/pipeline/logstash.conf
depends_on :
- elasticsearch
kibana :
image : kibana:8.12.0
ports :
- "5601:5601"
depends_on :
- elasticsearch
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# logstash.conf
input {
beats { port => 5044 }
tcp { port => 5000 codec => json }
}
filter {
if [ message ] =~ /^\{/ {
json { source => "message" }
}
date {
match => [ "timestamp" , "ISO8601" ]
target => "@timestamp"
}
}
output {
elasticsearch {
hosts => [ "elasticsearch:9200" ]
index => "logs-%{+YYYY.MM.dd}"
}
}
Loki + Grafana# Lightweight alternative. Doesn’t index log content—just labels.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# docker-compose.yml
services :
loki :
image : grafana/loki:2.9.0
ports :
- "3100:3100"
volumes :
- ./loki-config.yml:/etc/loki/local-config.yaml
promtail :
image : grafana/promtail:2.9.0
volumes :
- /var/log:/var/log:ro
- ./promtail-config.yml:/etc/promtail/config.yml
grafana :
image : grafana/grafana:latest
ports :
- "3000:3000"
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# promtail-config.yml
server :
http_listen_port : 9080
positions :
filename : /tmp/positions.yaml
clients :
- url : http://loki:3100/loki/api/v1/push
scrape_configs :
- job_name : containers
docker_sd_configs :
- host : unix:///var/run/docker.sock
relabel_configs :
- source_labels : [ '__meta_docker_container_name' ]
target_label : 'container'
Vector# Modern, fast collector written in Rust:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# vector.toml
[ sources . docker ]
type = "docker_logs"
[ transforms . parse_json ]
type = "remap"
inputs = [ "docker" ]
source = '''
. = parse_json!(.message)
'''
[ sinks . loki ]
type = "loki"
inputs = [ "parse_json" ]
endpoint = "http://loki:3100"
labels = { app = "{{ container_name }}" }
Structured Logging# Make logs machine-parseable:
1
2
3
4
5
6
7
8
9
10
# Bad: unstructured
logger . info ( f "User { user_id } purchased { item_id } for $ { price } " )
# Good: structured JSON
logger . info ( "purchase_completed" , extra = {
"user_id" : user_id ,
"item_id" : item_id ,
"price" : price ,
"currency" : "USD"
})
Output:
1
2
3
4
5
6
7
8
9
10
11
{
"timestamp" : "2026-03-04T16:30:00Z" ,
"level" : "info" ,
"message" : "purchase_completed" ,
"user_id" : "u-12345" ,
"item_id" : "item-789" ,
"price" : 29.99 ,
"currency" : "USD" ,
"service" : "checkout" ,
"trace_id" : "abc123"
}
Python Structured Logging# 1
2
3
4
5
6
7
8
9
10
11
import structlog
structlog . configure (
processors = [
structlog . processors . TimeStamper ( fmt = "iso" ),
structlog . processors . JSONRenderer ()
]
)
logger = structlog . get_logger ()
logger . info ( "user_login" , user_id = "12345" , method = "oauth" )
Node.js with Pino# 1
2
3
4
const pino = require ( 'pino' );
const logger = pino ({ level : 'info' });
logger . info ({ userId : '12345' , action : 'purchase' }, 'Order completed' );
Log Levels# Use consistently across services:
Level When to Use DEBUG Detailed debugging info (disabled in prod) INFO Normal operations worth recording WARN Unexpected but handled situations ERROR Failures that need attention FATAL Service cannot continue
1
2
3
4
5
logger . debug ( "Cache lookup" , key = key )
logger . info ( "Request processed" , duration_ms = 150 )
logger . warning ( "Retry attempt" , attempt = 3 , max_attempts = 5 )
logger . error ( "Payment failed" , error = str ( e ), user_id = user_id )
logger . critical ( "Database connection lost" , host = db_host )
Essential Fields# Always include:
1
2
3
4
5
6
7
8
9
{
"timestamp" : "ISO8601 format" ,
"level" : "info|warn|error" ,
"service" : "service-name" ,
"environment" : "prod|staging" ,
"trace_id" : "for distributed tracing" ,
"request_id" : "unique per request" ,
"message" : "human readable description"
}
Query Patterns# Elasticsearch/Kibana (KQL)# # s # u # d # N e s u O E r S e S r E T r v p r l a x r i e _ o t c e o c c i w i l n r e i d o u d s : f : r n d p i e _ e o i " c " q m i n c u u s h n h u - e e t c e s 1 s > a : h c e 2 t l e k r 3 s 1 t " c o ' 4 0 h / k u s 5 0 h o t " 0 c e u " a h a t c A A e l A t N N c t s N i D D k h e D v s " r i @ e v l t t n A i e y i d N c v m p D e e e o l s i l : t n e a t v " m : e e p l r " : r > / o = a " r p e " " i r 2 / r 0 c o 2 h r 6 e " - c 0 k 3 o - u 0 t 4 " "
Loki (LogQL)# # { # { # r # { s s a s E e J e R t P e r r S r a e a r r v O v t ( t v o i N i e { t i r c c s e c s e p e o e r e = a = f r n = i " r " v " n c s a e i m a h i p r c a p c e n i r e t i h c g " o = c " e k } r " h } c o + s c i k u | h n | o t f e g ~ u " i j c t } l s k " t o o u | e n u s = r t e | " r " , _ e d i r u l d r r e = o a v u r t e - " i l \ o = \ n " d e + > r " r 1 o 0 r 0 " 0 } [ 5 m ] )
Retention and Storage# Logs grow fast. Plan for it:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# Elasticsearch ILM policy
PUT _ilm/policy/logs-policy
{
"policy": {
"phases": {
"hot": {
"actions": {
"rollover": {
"max_size": "50gb" ,
"max_age": "1d"
}
}
},
"warm": {
"min_age": "7d" ,
"actions": {
"shrink": { "number_of_shards": 1 },
"forcemerge": { "max_num_segments": 1 }
}
},
"delete": {
"min_age": "30d" ,
"actions": { "delete": {} }
}
}
}
}
Cost Optimization# Hot storage: Recent logs (7 days), fast SSDs Warm storage: Older logs (30 days), cheaper disks Cold/Archive: Compliance (S3 Glacier), rarely accessed Delete: After retention period 1
2
3
4
5
6
# Loki retention
limits_config :
retention_period : 720h # 30 days
compactor :
retention_enabled : true
Alerting on Logs# Elasticsearch Watcher# 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
PUT _watcher/watch/error_spike
{
"trigger" : { "schedule" : { "interval" : "5m" }},
"input" : {
"search" : {
"request" : {
"indices" : [ "logs-*" ],
"body" : {
"query" : {
"bool" : {
"must" : [
{ "match" : { "level" : "error" }},
{ "range" : { "@timestamp" : { "gte" : "now-5m" }}}
]
}
}
}
}
}
},
"condition" : {
"compare" : { "ctx.payload.hits.total" : { "gt" : 100 }}
},
"actions" : {
"slack" : {
"webhook" : {
"url" : "https://hooks.slack.com/..."
}
}
}
}
Loki + Alertmanager# 1
2
3
4
5
6
7
8
9
# Grafana alert rule
- alert : HighErrorRate
expr : |
sum(rate({level="error"}[5m])) by (service) > 10
for : 5m
labels :
severity : warning
annotations :
summary : "High error rate in {{ $labels.service }}"
Security Considerations# Sensitive Data# Never log:
Passwords or tokens Credit card numbers Personal health information Full social security numbers 1
2
3
4
5
6
# Redact sensitive fields
def sanitize_log ( data ):
sensitive = [ 'password' , 'token' , 'ssn' , 'credit_card' ]
return { k : '***' if k in sensitive else v for k , v in data . items ()}
logger . info ( "user_data" , ** sanitize_log ( user_dict ))
Access Control# Restrict who can search logs Audit log access Encrypt in transit and at rest Quick Start Checklist# ELK Loki CloudWatch Full-text search ✅ Fast ⚠️ Slow ⚠️ Slow Cost at scale 💰💰💰 💰 💰💰 Setup complexity High Low None Query language KQL LogQL Insights Best for Large teams K8s/Grafana users AWS native
Centralized logging turns “where did that error happen?” from a 30-minute hunt into a 30-second query. Set it up before you need it.