Raw logs are noise. Processed telemetry is intelligence. The difference between them is your observability pipeline.

Modern distributed systems generate enormous amounts of data—logs, metrics, traces, events. But data isn’t insight. The challenge isn’t collection; it’s transformation. How do you turn a firehose of JSON lines into something a human (or an AI) can actually act on?

The Three Pillars, Unified

You’ve heard the “three pillars of observability”: logs, metrics, and traces. What’s often missing from that conversation is how these pillars should connect.

LogsCoIMrEnernstegirligianhctetsisonTraces

A user reports slowness at 14:32. Your trace shows a 3-second database call. Your metrics confirm CPU spike on db-primary-01. Your logs reveal a missing index causing a full table scan. Each pillar alone tells part of the story. Correlated, they tell the whole thing.

Pipeline Architecture

An effective observability pipeline has distinct stages:

Stage 1: Collection

Use agents that speak OpenTelemetry. The days of proprietary collectors are over—OTel has won. Whether you’re using the OpenTelemetry Collector, Fluent Bit, or Vector, standardize on OTLP as your transport protocol.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 10s
    send_batch_size: 10000
  
  attributes:
    actions:
      - key: environment
        value: production
        action: upsert

exporters:
  otlphttp:
    endpoint: https://telemetry.internal/v1/traces
  
  prometheusremotewrite:
    endpoint: https://metrics.internal/api/v1/write

Stage 2: Enrichment

Raw telemetry lacks context. Enrichment adds meaning:

  • Service metadata: What team owns this? What’s the on-call rotation?
  • Business context: Is this a premium customer? What feature flag is enabled?
  • Infrastructure context: What availability zone? What Kubernetes node?
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
// Vector enrichment transform
[transforms.enrich_logs]
type = "remap"
inputs = ["raw_logs"]
source = '''
  .team = get_env_var!("TEAM_NAME")
  .oncall = get_service_metadata(.service, "oncall_email")
  
  if .customer_id != null {
    customer = get_customer_tier(.customer_id)
    .customer_tier = customer.tier
    .is_premium = customer.tier == "enterprise"
  }
'''

Stage 3: Aggregation and Sampling

You can’t store everything. You shouldn’t try. Smart sampling preserves signal while reducing cost:

  • Head-based sampling: Decide at trace start (fast but less intelligent)
  • Tail-based sampling: Decide after trace completes (captures errors and outliers)
  • Adaptive sampling: Adjust rates based on traffic patterns
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# Tail-based sampling configuration
processors:
  tail_sampling:
    decision_wait: 30s
    policies:
      - name: errors-always
        type: status_code
        status_code: {status_codes: [ERROR]}
      
      - name: slow-traces
        type: latency
        latency: {threshold_ms: 1000}
      
      - name: probabilistic-sample
        type: probabilistic
        probabilistic: {sampling_percentage: 10}

The magic is in policy ordering. Errors always get captured. Slow requests always get captured. Everything else gets sampled at 10%.

Stage 4: Storage

Different telemetry types need different storage engines:

TypeStorageRetentionQuery Pattern
MetricsTime-series DB (Prometheus, Mimir)13 monthsAggregation over time
LogsLog aggregation (Loki, Elasticsearch)30-90 daysFull-text search
TracesTrace backend (Tempo, Jaeger)7-14 daysID-based lookup

Don’t fight the storage engine. Prometheus is not a log store. Elasticsearch is not a metrics database. Use the right tool.

Stage 5: Analysis and Alerting

This is where value emerges. Good alerting is:

  1. Symptom-based: Alert on user-facing impact, not internal metrics
  2. SLO-driven: “Error budget at 50%” beats “5xx rate > 1%”
  3. Actionable: Every alert should have a runbook link
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# Symptom-based alerting
groups:
  - name: slo_alerts
    rules:
      - alert: ErrorBudgetBurnRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[1h]))
          /
          sum(rate(http_requests_total[1h]))
          > 14.4 * (1 - 0.999)  # 1000x burn rate
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Error budget burning fast"
          runbook: "https://wiki.internal/runbooks/error-budget"

Anti-Patterns to Avoid

Log everything, analyze nothing: Storage is cheap. Attention is not. If nobody reads it, stop collecting it.

Alert on causes, not symptoms: “Disk at 80%” is a cause. “User uploads failing” is a symptom. Alert on symptoms, investigate causes.

Silo your pillars: Separate tools for logs, metrics, and traces that can’t correlate to each other. Modern platforms (Grafana stack, Datadog, Honeycomb) solve this.

Ignore cardinality: That user_id label on your metric? If you have a million users, you now have a million time series. Cardinality explosions kill metrics systems.

The AI Angle

Here’s where it gets interesting. AI systems have unique observability needs:

  • Prompt/response logging: What did the model receive? What did it return?
  • Token usage tracking: Cost attribution per feature, per customer
  • Latency by model: Different models have different performance profiles
  • Semantic analysis: Did the response actually answer the question?
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# AI-specific telemetry
@observe(name="llm_call")
def call_model(prompt: str, model: str) -> str:
    with tracer.start_as_current_span("llm_inference") as span:
        span.set_attribute("model", model)
        span.set_attribute("prompt_tokens", count_tokens(prompt))
        
        response = client.complete(prompt, model=model)
        
        span.set_attribute("completion_tokens", count_tokens(response))
        span.set_attribute("total_cost", calculate_cost(model, prompt, response))
        
        return response

The observability pipeline for AI isn’t fundamentally different—it’s the same collect → enrich → aggregate → store → analyze flow. But the attributes you care about are different, and the alerting thresholds need tuning for stochastic systems.

Practical Starting Point

If you’re building from scratch:

  1. Week 1: Deploy OpenTelemetry Collector, export to Grafana Cloud (or self-hosted Grafana stack)
  2. Week 2: Add structured logging with trace context propagation
  3. Week 3: Define your SLOs. Three to five, max. Keep them simple.
  4. Week 4: Build dashboards that answer “is the system healthy?” in one glance
  5. Week 5: Create your first symptom-based alerts with runbooks

Observability isn’t a destination—it’s a practice. Start simple, iterate based on what you actually need to know.


The goal isn’t to collect data. It’s to reduce time-to-understanding when something goes wrong. Every pipeline decision should serve that goal. If it doesn’t, cut it.