Monitoring

Health Check Patterns: Liveness, Readiness, and Startup Probes

Your load balancer routes traffic to a pod that’s crashed. Or kills a pod that’s just slow. Or restarts a pod that’s still initializing. Health checks prevent these failures — when configured correctly. Most teams get them wrong. Here’s how to get them right. The Three Probe Types Kubernetes offers three distinct probes, each with a different purpose: Probe Question Failure Action Liveness Is the process alive? Restart container Readiness Can it handle traffic? Remove from Service Startup Has it finished starting? Delay other probes Liveness: “Should I restart this?” Detects when your app is stuck — deadlocked, infinite loop, unrecoverable state. ...

Observability Pipelines: From Logs to Insights

Raw logs are noise. Processed telemetry is intelligence. The difference between them is your observability pipeline. Modern distributed systems generate enormous amounts of data—logs, metrics, traces, events. But data isn’t insight. The challenge isn’t collection; it’s transformation. How do you turn a firehose of JSON lines into something a human (or an AI) can actually act on? The Three Pillars, Unified You’ve heard the “three pillars of observability”: logs, metrics, and traces. What’s often missing from that conversation is how these pillars should connect. ...

Health Checks: Readiness, Liveness, and Startup Probes Explained

Your application says it’s running. But is it actually working? Health checks answer that question. They’re the difference between “process exists” and “service is functional.” Get them wrong, and your orchestrator will either route traffic to broken instances or restart healthy ones. Three Types of Probes Liveness: “Is this process stuck?” Liveness probes detect deadlocks, infinite loops, and zombie processes. If liveness fails, the container gets killed and restarted. 1 2 3 4 5 6 7 livenessProbe: httpGet: path: /healthz port: 8080 initialDelaySeconds: 30 periodSeconds: 10 failureThreshold: 3 What to check: ...

Observability: Beyond Monitoring with Metrics, Logs, and Traces

Monitoring tells you when something is wrong. Observability helps you understand why. In distributed systems, you can’t predict every failure mode—you need systems that let you ask arbitrary questions about their behavior. The Three Pillars Metrics: What’s Happening Now Numeric time-series data. Fast to query, cheap to store. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 from prometheus_client import Counter, Histogram, Gauge, start_http_server # Counter - only goes up requests_total = Counter( 'http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status'] ) # Histogram - distribution of values request_duration = Histogram( 'http_request_duration_seconds', 'Request duration in seconds', ['method', 'endpoint'], buckets=[.01, .05, .1, .25, .5, 1, 2.5, 5, 10] ) # Gauge - can go up or down active_connections = Gauge( 'active_connections', 'Number of active connections' ) # Usage @app.route("/api/<endpoint>") def handle_request(endpoint): active_connections.inc() with request_duration.labels( method=request.method, endpoint=endpoint ).time(): result = process_request() requests_total.labels( method=request.method, endpoint=endpoint, status=200 ).inc() active_connections.dec() return result # Expose metrics endpoint start_http_server(9090) Logs: What Happened Discrete events with context. Rich detail, expensive at scale. ...

Structured Logging: Stop Grepping, Start Querying

Unstructured logs are a liability. When your application writes User 12345 logged in from 192.168.1.1, you’re creating text that’s easy to read but impossible to query at scale. Structured logging changes the game: logs become data you can filter, aggregate, and analyze. The Problem with Text Logs 1 2 3 4 # Traditional logging import logging logging.info(f"User {user_id} logged in from {ip_address}") # Output: INFO:root:User 12345 logged in from 192.168.1.1 Want to find all logins from a specific IP? You need regex. Want to count logins per user? Good luck. Want to correlate with other events? Hope your timestamp parsing is solid. ...

Health Checks Done Right: Liveness, Readiness, and Startup Probes

A health check that always returns 200 OK is worse than no health check at all. It gives you false confidence while your application silently fails. Let’s build health checks that actually tell you when something’s wrong. The Three Types of Probes Kubernetes defines three probe types, each serving a distinct purpose: Liveness Probe: “Is this process stuck?” If it fails, Kubernetes kills and restarts the container. Readiness Probe: “Can this instance handle traffic?” If it fails, the instance is removed from load balancing but keeps running. ...

Monitoring vs Observability: What's the Difference and Why It Matters

Monitoring tells you when something is wrong. Observability helps you understand why. Here’s how to build systems you can actually debug.