Devops

Incident Response: On-Call Practices That Don't Burn Out Your Team

On-call doesn’t have to mean sleepless nights and burnout. With the right practices, you can maintain reliable systems while keeping your team healthy. Here’s how. Alert Design Only Alert on Actionable Items 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 # Bad: Noisy, leads to alert fatigue - alert: HighCPU expr: cpu_usage > 80 for: 1m # Good: Actionable, symptom-based - alert: HighErrorRate expr: | sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.01 for: 5m labels: severity: warning runbook: "https://runbooks.example.com/high-error-rate" annotations: summary: "Error rate {{ $value | humanizePercentage }} exceeds 1%" impact: "Users experiencing failures" action: "Check application logs and recent deployments" Severity Levels 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 # severity-definitions.yaml severities: critical: description: "Service completely down, immediate action required" response_time: "5 minutes" notification: "page + phone call" examples: - "Total service outage" - "Data loss occurring" - "Security breach detected" warning: description: "Degraded service, action needed soon" response_time: "30 minutes" notification: "page" examples: - "Error rate elevated but service functional" - "Approaching resource limits" - "Single replica down" info: description: "Notable event, no immediate action" response_time: "Next business day" notification: "Slack" examples: - "Deployment completed" - "Scheduled maintenance starting" Runbooks Every alert needs a runbook: ...

Observability: Beyond Monitoring with Metrics, Logs, and Traces

Monitoring tells you when something is wrong. Observability helps you understand why. In distributed systems, you can’t predict every failure mode—you need systems that let you ask arbitrary questions about their behavior. The Three Pillars Metrics: What’s Happening Now Numeric time-series data. Fast to query, cheap to store. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 from prometheus_client import Counter, Histogram, Gauge, start_http_server # Counter - only goes up requests_total = Counter( 'http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status'] ) # Histogram - distribution of values request_duration = Histogram( 'http_request_duration_seconds', 'Request duration in seconds', ['method', 'endpoint'], buckets=[.01, .05, .1, .25, .5, 1, 2.5, 5, 10] ) # Gauge - can go up or down active_connections = Gauge( 'active_connections', 'Number of active connections' ) # Usage @app.route("/api/<endpoint>") def handle_request(endpoint): active_connections.inc() with request_duration.labels( method=request.method, endpoint=endpoint ).time(): result = process_request() requests_total.labels( method=request.method, endpoint=endpoint, status=200 ).inc() active_connections.dec() return result # Expose metrics endpoint start_http_server(9090) Logs: What Happened Discrete events with context. Rich detail, expensive at scale. ...

The Twelve-Factor App: Building Cloud-Native Applications That Scale

The twelve-factor methodology emerged from Heroku’s experience running millions of apps. These principles create applications that deploy cleanly, scale effortlessly, and minimize divergence between development and production. Let’s walk through each factor with practical examples. 1. Codebase: One Repo, Many Deploys One codebase tracked in version control, many deploys (dev, staging, prod). 1 2 3 4 5 6 7 8 9 # Good: Single repo, branch-based environments main → production staging → staging feature/* → development # Bad: Separate repos for each environment myapp-dev/ myapp-staging/ myapp-prod/ 1 2 3 4 5 # config.py - Same code, different configs import os ENVIRONMENT = os.getenv("ENVIRONMENT", "development") DATABASE_URL = os.getenv("DATABASE_URL") 2. Dependencies: Explicitly Declare and Isolate Never rely on system-wide packages. Declare everything. ...

Testing Infrastructure: From Terraform Plans to Production Validation

Infrastructure code deserves the same testing rigor as application code. A typo in Terraform can delete a database. An untested Ansible role can break production. Let’s build confidence with proper testing. The Testing Pyramid for Infrastructure E ( I ( 2 F n R E u t e U l e a n ( T l g l i S e r t t s s a c a t t t l T t s a i o e i c o u s c k n d t s a d T r n e e e a p s s l l t o y o s u s y r i m c s e e , n s t ) p ) l a n v a l i d a t i o n ) Unit Testing: Static Analysis Terraform Validation 1 2 3 4 5 6 7 8 # Built-in validation terraform init terraform validate terraform fmt -check # Custom validation rules terraform plan -out=tfplan terraform show -json tfplan > plan.json 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 # tests/test_terraform_plan.py import json import pytest @pytest.fixture def plan(): with open('plan.json') as f: return json.load(f) def test_no_resources_destroyed(plan): """Ensure no resources are being destroyed.""" changes = plan.get('resource_changes', []) destroyed = [c for c in changes if 'delete' in c.get('change', {}).get('actions', [])] assert len(destroyed) == 0, f"Resources being destroyed: {[d['address'] for d in destroyed]}" def test_no_public_s3_buckets(plan): """Ensure S3 buckets aren't public.""" changes = plan.get('resource_changes', []) for change in changes: if change['type'] == 'aws_s3_bucket': after = change.get('change', {}).get('after', {}) acl = after.get('acl', 'private') assert acl == 'private', f"Bucket {change['address']} has public ACL: {acl}" def test_instances_have_tags(plan): """Ensure EC2 instances have required tags.""" required_tags = {'Environment', 'Owner', 'Project'} changes = plan.get('resource_changes', []) for change in changes: if change['type'] == 'aws_instance': after = change.get('change', {}).get('after', {}) tags = set(after.get('tags', {}).keys()) missing = required_tags - tags assert not missing, f"Instance {change['address']} missing tags: {missing}" Policy as Code with OPA 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 # policy/terraform.rego package terraform deny[msg] { resource := input.resource_changes[_] resource.type == "aws_security_group_rule" resource.change.after.cidr_blocks[_] == "0.0.0.0/0" resource.change.after.from_port == 22 msg := sprintf("SSH open to world in %v", [resource.address]) } deny[msg] { resource := input.resource_changes[_] resource.type == "aws_db_instance" resource.change.after.publicly_accessible == true msg := sprintf("RDS instance %v is publicly accessible", [resource.address]) } 1 2 # Run OPA checks terraform show -json tfplan | opa eval -i - -d policy/ "data.terraform.deny" Integration Testing with Terratest Terratest deploys real infrastructure, validates it, then tears it down: ...

Container Security: Hardening Your Docker Images and Kubernetes Deployments

Containers aren’t inherently secure. A default Docker image runs as root, includes unnecessary packages, and exposes more attack surface than needed. Let’s fix that. Secure Dockerfile Practices Use Minimal Base Images 1 2 3 4 5 6 7 8 # BAD: Full OS with unnecessary packages FROM ubuntu:latest # BETTER: Slim variant FROM python:3.11-slim # BEST: Distroless (no shell, no package manager) FROM gcr.io/distroless/python3-debian11 Size comparison: ubuntu:latest: ~77MB python:3.11-slim: ~45MB distroless/python3: ~16MB Smaller image = smaller attack surface. ...

Structured Logging: Stop Grepping, Start Querying

Unstructured logs are a liability. When your application writes User 12345 logged in from 192.168.1.1, you’re creating text that’s easy to read but impossible to query at scale. Structured logging changes the game: logs become data you can filter, aggregate, and analyze. The Problem with Text Logs 1 2 3 4 # Traditional logging import logging logging.info(f"User {user_id} logged in from {ip_address}") # Output: INFO:root:User 12345 logged in from 192.168.1.1 Want to find all logins from a specific IP? You need regex. Want to count logins per user? Good luck. Want to correlate with other events? Hope your timestamp parsing is solid. ...

Zero-Downtime Deployments: Blue-Green and Canary Strategies

Deploying new code shouldn’t mean crossing your fingers. Blue-green and canary deployments let you release changes with confidence, validate them with real traffic, and roll back in seconds if something goes wrong. Blue-Green Deployments Blue-green maintains two identical production environments. One serves traffic (blue), while the other stands ready (green). To deploy, you push to green, test it, then switch traffic over. B ┌ │ │ │ └ ┌ │ │ │ └ A ┌ │ │ │ └ ┌ │ │ │ └ e ─ ─ ─ ─ f ─ ─ ─ ─ f ─ ─ ─ ─ t ─ ─ ─ ─ o ─ ─ ─ S ─ e ─ S ─ ─ ─ r ─ A ─ ─ ( T ─ r ─ T ─ ─ A ─ e ─ B v C ─ ─ G i A ─ ─ B v A ─ ─ G v C ─ ─ l 1 T ─ ─ r d N ─ d ─ l 1 N ─ ─ r 1 T ─ d ─ u . I ─ ─ e l D ─ e ─ u . D ─ ─ e . I ─ e ─ e 0 V ─ ─ e e B ─ p ─ e 0 B ─ ─ e 1 V ─ p ─ ) E ─ ─ n ) Y ─ l ─ ) Y ─ ─ n ) E ─ l ─ ─ ─ ─ o ─ ─ ─ ─ o ─ ─ ─ ─ y ─ ─ ─ ─ y ─ ─ ─ ─ m ─ ─ ─ ─ m ─ ─ ─ ─ e ─ ─ ─ ─ e ┐ │ │ │ ┘ ┐ │ │ │ ┘ n ┐ │ │ │ ┘ ┐ │ │ │ ┘ n t t ◄ : ◄ : ─ ─ ─ ─ ┌ │ └ ┌ │ └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ R ─ ─ R ─ ─ o ─ ─ o ─ ─ u ─ ─ u ─ ─ t ─ ─ t ─ ─ e ─ ─ e ─ ─ r ─ ─ r ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ │ ┘ ┐ │ ┘ ◄ ◄ ─ ─ ─ ─ U U s s e e r r s s Kubernetes Implementation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 # blue-deployment.yaml apiVersion: apps/v1 kind: Deployment metadata: name: app-blue labels: app: myapp version: blue spec: replicas: 3 selector: matchLabels: app: myapp version: blue template: metadata: labels: app: myapp version: blue spec: containers: - name: app image: myapp:1.0.0 ports: - containerPort: 8080 --- # green-deployment.yaml apiVersion: apps/v1 kind: Deployment metadata: name: app-green labels: app: myapp version: green spec: replicas: 3 selector: matchLabels: app: myapp version: green template: metadata: labels: app: myapp version: green spec: containers: - name: app image: myapp:1.1.0 ports: - containerPort: 8080 --- # service.yaml - switch by changing selector apiVersion: v1 kind: Service metadata: name: myapp spec: selector: app: myapp version: blue # Change to 'green' to switch ports: - port: 80 targetPort: 8080 Deployment Script 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 #!/bin/bash set -e NEW_VERSION=$1 CURRENT=$(kubectl get svc myapp -o jsonpath='{.spec.selector.version}') if [ "$CURRENT" == "blue" ]; then TARGET="green" else TARGET="blue" fi echo "Current: $CURRENT, Deploying to: $TARGET" # Update the standby deployment kubectl set image deployment/app-$TARGET app=myapp:$NEW_VERSION # Wait for rollout kubectl rollout status deployment/app-$TARGET --timeout=300s # Run smoke tests against standby kubectl run smoke-test --rm -it --image=curlimages/curl \ --restart=Never -- curl -f http://app-$TARGET:8080/health # Switch traffic kubectl patch svc myapp -p "{\"spec\":{\"selector\":{\"version\":\"$TARGET\"}}}" echo "Switched to $TARGET (v$NEW_VERSION)" Instant Rollback 1 2 3 4 5 6 7 8 9 10 11 12 13 #!/bin/bash # rollback.sh - switch back to previous version CURRENT=$(kubectl get svc myapp -o jsonpath='{.spec.selector.version}') if [ "$CURRENT" == "blue" ]; then PREVIOUS="green" else PREVIOUS="blue" fi kubectl patch svc myapp -p "{\"spec\":{\"selector\":{\"version\":\"$PREVIOUS\"}}}" echo "Rolled back to $PREVIOUS" Canary Deployments Canary deployments route a small percentage of traffic to the new version, gradually increasing if metrics look good. ...

Health Checks Done Right: Liveness, Readiness, and Startup Probes

A health check that always returns 200 OK is worse than no health check at all. It gives you false confidence while your application silently fails. Let’s build health checks that actually tell you when something’s wrong. The Three Types of Probes Kubernetes defines three probe types, each serving a distinct purpose: Liveness Probe: “Is this process stuck?” If it fails, Kubernetes kills and restarts the container. Readiness Probe: “Can this instance handle traffic?” If it fails, the instance is removed from load balancing but keeps running. ...

Feature Flags: Ship Fast, Control Risk, and Test in Production

Deploying code and releasing features don’t have to be the same thing. Feature flags let you ship code to production while controlling who sees it, when they see it, and how quickly you roll it out. This separation is transformative for both velocity and safety. Why Feature Flags? Traditional deployment: code goes live, everyone gets it immediately, and if something breaks, you redeploy. With feature flags: Deploy anytime — code exists in production but isn’t active Release gradually — 1% of users, then 10%, then 50%, then everyone Instant rollback — flip a switch, no deployment needed Test in production — real traffic, real conditions, controlled exposure A/B testing — compare variants with actual user behavior Basic Implementation Start simple. A feature flag is just a conditional: ...

Backup and Disaster Recovery: Because Hope Is Not a Strategy

A practical guide to backups and disaster recovery — automated backup strategies, testing your restores, and building systems that survive the worst.