Reliability

Retry Patterns: Exponential Backoff and Beyond

Networks fail. Services go down. Databases get overwhelmed. The question isn’t whether your requests will fail—it’s how gracefully you handle it when they do. Naive retry logic can turn a minor hiccup into a catastrophic cascade. Smart retry logic can make your system resilient to transient failures. The difference is in the details. The Naive Approach (Don’t Do This) 1 2 3 4 5 6 7 8 9 # Bad: Immediate retry loop def fetch_data(url): for attempt in range(5): try: response = requests.get(url, timeout=5) return response.json() except requests.RequestException: continue raise Exception("Failed after 5 attempts") This code has several problems: ...

Graceful Shutdown: The Art of Dying Well in Production

Your container is about to die. It has 30 seconds to live. What happens next determines whether your users see a clean transition or a wall of 502 errors. Graceful shutdown is one of those things that seems obvious until you realize most applications do it wrong. The Problem When Kubernetes (or Docker, or systemd) decides to stop your application, it sends a SIGTERM signal. Your application has a grace period—usually 30 seconds—to finish what it’s doing and exit cleanly. After that, it gets SIGKILL. No negotiation. ...

Chaos Engineering: Breaking Your Systems to Make Them Stronger

You don’t know if your system handles failures gracefully until failures happen. Chaos engineering lets you find out on your terms—in controlled conditions, during business hours, with engineers ready to respond. The Principles Define steady state — What does “healthy” look like? Hypothesize — “The system will continue serving traffic if one pod dies” Inject failure — Kill the pod Observe — Did steady state hold? Learn — Fix what broke, update runbooks Start Simple: Kill Things Random Pod Termination 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 # chaos_pod_killer.py import random from kubernetes import client, config from datetime import datetime import logging logging.basicConfig(level=logging.INFO) logger = logging.getLogger(__name__) class ChaosPodKiller: def __init__(self, namespace: str, label_selector: str): config.load_incluster_config() # or load_kube_config() for local self.v1 = client.CoreV1Api() self.namespace = namespace self.label_selector = label_selector def get_targets(self) -> list: """Get pods matching selector.""" pods = self.v1.list_namespaced_pod( namespace=self.namespace, label_selector=self.label_selector ) return [pod.metadata.name for pod in pods.items if pod.status.phase == "Running"] def kill_random_pod(self, dry_run: bool = True) -> str: """Kill a random pod from targets.""" targets = self.get_targets() if len(targets) <= 1: logger.warning("Only one pod running, skipping kill") return None victim = random.choice(targets) if dry_run: logger.info(f"DRY RUN: Would delete pod {victim}") else: logger.info(f"Deleting pod {victim}") self.v1.delete_namespaced_pod( name=victim, namespace=self.namespace, grace_period_seconds=0 ) return victim def run_experiment(self, duration_minutes: int = 30, interval_seconds: int = 300): """Run chaos experiment for duration.""" import time end_time = datetime.now().timestamp() + (duration_minutes * 60) logger.info(f"Starting chaos experiment for {duration_minutes} minutes") while datetime.now().timestamp() < end_time: victim = self.kill_random_pod(dry_run=False) if victim: logger.info(f"Killed {victim}, waiting {interval_seconds}s") time.sleep(interval_seconds) logger.info("Chaos experiment completed") # Usage if __name__ == "__main__": killer = ChaosPodKiller( namespace="production", label_selector="app=api-service" ) killer.run_experiment(duration_minutes=30, interval_seconds=300) Kubernetes CronJob for Continuous Chaos 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 apiVersion: batch/v1 kind: CronJob metadata: name: chaos-pod-killer namespace: chaos-system spec: schedule: "*/10 9-17 * * 1-5" # Every 10 min, 9-5, weekdays only jobTemplate: spec: template: spec: serviceAccountName: chaos-runner containers: - name: chaos image: chaos-toolkit:latest command: - python - /scripts/chaos_pod_killer.py env: - name: TARGET_NAMESPACE value: "production" - name: LABEL_SELECTOR value: "chaos-enabled=true" restartPolicy: Never Network Chaos Introduce Latency 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 # Using Chaos Mesh apiVersion: chaos-mesh.org/v1alpha1 kind: NetworkChaos metadata: name: network-delay namespace: chaos-testing spec: action: delay mode: all selector: namespaces: - production labelSelectors: app: api-service delay: latency: "100ms" jitter: "20ms" correlation: "50" duration: "5m" scheduler: cron: "@every 1h" Packet Loss 1 2 3 4 5 6 7 8 9 10 11 12 13 14 apiVersion: chaos-mesh.org/v1alpha1 kind: NetworkChaos metadata: name: network-loss spec: action: loss mode: one selector: labelSelectors: app: payment-service loss: loss: "10" # 10% packet loss correlation: "50" duration: "2m" Network Partition 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 apiVersion: chaos-mesh.org/v1alpha1 kind: NetworkChaos metadata: name: network-partition spec: action: partition mode: all selector: namespaces: - production labelSelectors: app: api-service direction: both target: selector: namespaces: - production labelSelectors: app: database duration: "30s" Resource Stress CPU Stress 1 2 3 4 5 6 7 8 9 10 11 12 13 14 apiVersion: chaos-mesh.org/v1alpha1 kind: StressChaos metadata: name: cpu-stress spec: mode: one selector: labelSelectors: app: api-service stressors: cpu: workers: 2 load: 80 # 80% CPU usage duration: "5m" Memory Stress 1 2 3 4 5 6 7 8 9 10 11 12 13 14 apiVersion: chaos-mesh.org/v1alpha1 kind: StressChaos metadata: name: memory-stress spec: mode: one selector: labelSelectors: app: api-service stressors: memory: workers: 1 size: "512MB" # Allocate 512MB duration: "3m" Application-Level Chaos HTTP Fault Injection with Istio 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 apiVersion: networking.istio.io/v1beta1 kind: VirtualService metadata: name: api-chaos spec: hosts: - api-service http: - fault: abort: percentage: value: 5 httpStatus: 503 delay: percentage: value: 10 fixedDelay: 2s route: - destination: host: api-service Custom Failure Injection 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 # chaos_middleware.py import random import time import os from functools import wraps CHAOS_ENABLED = os.getenv("CHAOS_ENABLED", "false").lower() == "true" CHAOS_FAILURE_RATE = float(os.getenv("CHAOS_FAILURE_RATE", "0.0")) CHAOS_LATENCY_MS = int(os.getenv("CHAOS_LATENCY_MS", "0")) def chaos_middleware(f): """Inject chaos into function calls.""" @wraps(f) def wrapper(*args, **kwargs): if not CHAOS_ENABLED: return f(*args, **kwargs) # Random failure if random.random() < CHAOS_FAILURE_RATE: raise Exception("Chaos injection: random failure") # Random latency if CHAOS_LATENCY_MS > 0: delay = random.randint(0, CHAOS_LATENCY_MS) / 1000 time.sleep(delay) return f(*args, **kwargs) return wrapper # Usage @chaos_middleware def call_payment_service(order): return requests.post(PAYMENT_URL, json=order.to_dict()) Experiment Framework 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 # chaos_experiment.py from dataclasses import dataclass from typing import Callable, List import time @dataclass class SteadyStateCheck: name: str check: Callable[[], bool] @dataclass class ChaosExperiment: name: str hypothesis: str steady_state_checks: List[SteadyStateCheck] inject_failure: Callable[[], None] rollback: Callable[[], None] duration_seconds: int = 60 def run_experiment(experiment: ChaosExperiment) -> dict: """Run a chaos experiment with proper controls.""" result = { "name": experiment.name, "hypothesis": experiment.hypothesis, "success": False, "steady_state_before": {}, "steady_state_after": {}, "errors": [] } # 1. Verify steady state before print(f"Checking steady state before experiment...") for check in experiment.steady_state_checks: try: passed = check.check() result["steady_state_before"][check.name] = passed if not passed: result["errors"].append(f"Pre-check failed: {check.name}") return result except Exception as e: result["errors"].append(f"Pre-check error: {check.name}: {e}") return result # 2. Inject failure print(f"Injecting failure: {experiment.name}") try: experiment.inject_failure() except Exception as e: result["errors"].append(f"Injection failed: {e}") experiment.rollback() return result # 3. Wait for duration print(f"Waiting {experiment.duration_seconds}s...") time.sleep(experiment.duration_seconds) # 4. Check steady state during/after print(f"Checking steady state after experiment...") for check in experiment.steady_state_checks: try: passed = check.check() result["steady_state_after"][check.name] = passed except Exception as e: result["steady_state_after"][check.name] = False result["errors"].append(f"Post-check error: {check.name}: {e}") # 5. Rollback print(f"Rolling back...") experiment.rollback() # 6. Evaluate hypothesis result["success"] = all(result["steady_state_after"].values()) return result # Example experiment def check_api_responding(): response = requests.get("http://api-service/health", timeout=5) return response.status_code == 200 def check_error_rate_low(): # Query Prometheus result = prometheus.query('sum(rate(http_requests_total{status=~"5.."}[1m])) / sum(rate(http_requests_total[1m]))') return float(result) < 0.01 def kill_one_api_pod(): killer = ChaosPodKiller("production", "app=api-service") killer.kill_random_pod(dry_run=False) def noop_rollback(): pass # Kubernetes will restart the pod experiment = ChaosExperiment( name="api-pod-failure", hypothesis="API continues serving traffic when one pod dies", steady_state_checks=[ SteadyStateCheck("api_responding", check_api_responding), SteadyStateCheck("error_rate_low", check_error_rate_low), ], inject_failure=kill_one_api_pod, rollback=noop_rollback, duration_seconds=60 ) result = run_experiment(experiment) print(f"Experiment {'PASSED' if result['success'] else 'FAILED'}") Game Days Schedule regular chaos exercises: ...

Incident Response: On-Call Practices That Don't Burn Out Your Team

On-call doesn’t have to mean sleepless nights and burnout. With the right practices, you can maintain reliable systems while keeping your team healthy. Here’s how. Alert Design Only Alert on Actionable Items 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 # Bad: Noisy, leads to alert fatigue - alert: HighCPU expr: cpu_usage > 80 for: 1m # Good: Actionable, symptom-based - alert: HighErrorRate expr: | sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.01 for: 5m labels: severity: warning runbook: "https://runbooks.example.com/high-error-rate" annotations: summary: "Error rate {{ $value | humanizePercentage }} exceeds 1%" impact: "Users experiencing failures" action: "Check application logs and recent deployments" Severity Levels 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 # severity-definitions.yaml severities: critical: description: "Service completely down, immediate action required" response_time: "5 minutes" notification: "page + phone call" examples: - "Total service outage" - "Data loss occurring" - "Security breach detected" warning: description: "Degraded service, action needed soon" response_time: "30 minutes" notification: "page" examples: - "Error rate elevated but service functional" - "Approaching resource limits" - "Single replica down" info: description: "Notable event, no immediate action" response_time: "Next business day" notification: "Slack" examples: - "Deployment completed" - "Scheduled maintenance starting" Runbooks Every alert needs a runbook: ...

Webhook Patterns: Building Reliable Event-Driven Integrations

Webhooks seem simple: receive an HTTP POST, do something. In practice, they’re a minefield of security issues, reliability problems, and edge cases. Let’s build webhooks that actually work in production. Receiving Webhooks Basic Receiver 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 from fastapi import FastAPI, Request, HTTPException, Header from typing import Optional import hmac import hashlib app = FastAPI() WEBHOOK_SECRET = "your-webhook-secret" @app.post("/webhooks/stripe") async def stripe_webhook( request: Request, stripe_signature: Optional[str] = Header(None, alias="Stripe-Signature") ): payload = await request.body() # Verify signature if not verify_stripe_signature(payload, stripe_signature, WEBHOOK_SECRET): raise HTTPException(status_code=401, detail="Invalid signature") event = await request.json() # Process event event_type = event.get("type") if event_type == "payment_intent.succeeded": await handle_payment_success(event["data"]["object"]) elif event_type == "customer.subscription.deleted": await handle_subscription_cancelled(event["data"]["object"]) return {"received": True} def verify_stripe_signature(payload: bytes, signature: str, secret: str) -> bool: """Verify Stripe webhook signature.""" if not signature: return False # Parse signature header elements = dict(item.split("=") for item in signature.split(",")) timestamp = elements.get("t") expected_sig = elements.get("v1") # Compute expected signature signed_payload = f"{timestamp}.{payload.decode()}" computed_sig = hmac.new( secret.encode(), signed_payload.encode(), hashlib.sha256 ).hexdigest() return hmac.compare_digest(computed_sig, expected_sig) Idempotent Processing Webhooks can be delivered multiple times. Make your handlers idempotent: ...

Zero-Downtime Deployments: Blue-Green and Canary Strategies

Deploying new code shouldn’t mean crossing your fingers. Blue-green and canary deployments let you release changes with confidence, validate them with real traffic, and roll back in seconds if something goes wrong. Blue-Green Deployments Blue-green maintains two identical production environments. One serves traffic (blue), while the other stands ready (green). To deploy, you push to green, test it, then switch traffic over. B ┌ │ │ │ └ ┌ │ │ │ └ A ┌ │ │ │ └ ┌ │ │ │ └ e ─ ─ ─ ─ f ─ ─ ─ ─ f ─ ─ ─ ─ t ─ ─ ─ ─ o ─ ─ ─ S ─ e ─ S ─ ─ ─ r ─ A ─ ─ ( T ─ r ─ T ─ ─ A ─ e ─ B v C ─ ─ G i A ─ ─ B v A ─ ─ G v C ─ ─ l 1 T ─ ─ r d N ─ d ─ l 1 N ─ ─ r 1 T ─ d ─ u . I ─ ─ e l D ─ e ─ u . D ─ ─ e . I ─ e ─ e 0 V ─ ─ e e B ─ p ─ e 0 B ─ ─ e 1 V ─ p ─ ) E ─ ─ n ) Y ─ l ─ ) Y ─ ─ n ) E ─ l ─ ─ ─ ─ o ─ ─ ─ ─ o ─ ─ ─ ─ y ─ ─ ─ ─ y ─ ─ ─ ─ m ─ ─ ─ ─ m ─ ─ ─ ─ e ─ ─ ─ ─ e ┐ │ │ │ ┘ ┐ │ │ │ ┘ n ┐ │ │ │ ┘ ┐ │ │ │ ┘ n t t ◄ : ◄ : ─ ─ ─ ─ ┌ │ └ ┌ │ └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ R ─ ─ R ─ ─ o ─ ─ o ─ ─ u ─ ─ u ─ ─ t ─ ─ t ─ ─ e ─ ─ e ─ ─ r ─ ─ r ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ │ ┘ ┐ │ ┘ ◄ ◄ ─ ─ ─ ─ U U s s e e r r s s Kubernetes Implementation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 # blue-deployment.yaml apiVersion: apps/v1 kind: Deployment metadata: name: app-blue labels: app: myapp version: blue spec: replicas: 3 selector: matchLabels: app: myapp version: blue template: metadata: labels: app: myapp version: blue spec: containers: - name: app image: myapp:1.0.0 ports: - containerPort: 8080 --- # green-deployment.yaml apiVersion: apps/v1 kind: Deployment metadata: name: app-green labels: app: myapp version: green spec: replicas: 3 selector: matchLabels: app: myapp version: green template: metadata: labels: app: myapp version: green spec: containers: - name: app image: myapp:1.1.0 ports: - containerPort: 8080 --- # service.yaml - switch by changing selector apiVersion: v1 kind: Service metadata: name: myapp spec: selector: app: myapp version: blue # Change to 'green' to switch ports: - port: 80 targetPort: 8080 Deployment Script 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 #!/bin/bash set -e NEW_VERSION=$1 CURRENT=$(kubectl get svc myapp -o jsonpath='{.spec.selector.version}') if [ "$CURRENT" == "blue" ]; then TARGET="green" else TARGET="blue" fi echo "Current: $CURRENT, Deploying to: $TARGET" # Update the standby deployment kubectl set image deployment/app-$TARGET app=myapp:$NEW_VERSION # Wait for rollout kubectl rollout status deployment/app-$TARGET --timeout=300s # Run smoke tests against standby kubectl run smoke-test --rm -it --image=curlimages/curl \ --restart=Never -- curl -f http://app-$TARGET:8080/health # Switch traffic kubectl patch svc myapp -p "{\"spec\":{\"selector\":{\"version\":\"$TARGET\"}}}" echo "Switched to $TARGET (v$NEW_VERSION)" Instant Rollback 1 2 3 4 5 6 7 8 9 10 11 12 13 #!/bin/bash # rollback.sh - switch back to previous version CURRENT=$(kubectl get svc myapp -o jsonpath='{.spec.selector.version}') if [ "$CURRENT" == "blue" ]; then PREVIOUS="green" else PREVIOUS="blue" fi kubectl patch svc myapp -p "{\"spec\":{\"selector\":{\"version\":\"$PREVIOUS\"}}}" echo "Rolled back to $PREVIOUS" Canary Deployments Canary deployments route a small percentage of traffic to the new version, gradually increasing if metrics look good. ...

Health Checks Done Right: Liveness, Readiness, and Startup Probes

A health check that always returns 200 OK is worse than no health check at all. It gives you false confidence while your application silently fails. Let’s build health checks that actually tell you when something’s wrong. The Three Types of Probes Kubernetes defines three probe types, each serving a distinct purpose: Liveness Probe: “Is this process stuck?” If it fails, Kubernetes kills and restarts the container. Readiness Probe: “Can this instance handle traffic?” If it fails, the instance is removed from load balancing but keeps running. ...

Circuit Breakers: Building Systems That Fail Gracefully

In distributed systems, failures are inevitable. A single slow or failing service can cascade through your entire architecture, turning a minor issue into a major outage. Circuit breakers prevent this by detecting failures and stopping the cascade before it spreads. The Problem: Cascading Failures Imagine Service A calls Service B, which calls Service C. If Service C becomes slow: Requests to C start timing out Service B’s thread pool fills up waiting for C Service B becomes slow Service A’s threads fill up waiting for B Your entire system grinds to a halt One slow service just took down everything. ...