Posts

Service Mesh: Traffic Management, Security, and Observability with Istio

When you have dozens of microservices talking to each other, managing traffic, security, and observability becomes complex. A service mesh handles this at the infrastructure layer, so your applications don’t have to. What Problems Does a Service Mesh Solve? Without a mesh, every service needs to implement: Retries and timeouts Circuit breakers Load balancing TLS certificates Metrics and tracing Access control With a mesh, the sidecar proxy handles all of this: ...

Testing Infrastructure: From Terraform Plans to Production Validation

Infrastructure code deserves the same testing rigor as application code. A typo in Terraform can delete a database. An untested Ansible role can break production. Let’s build confidence with proper testing. The Testing Pyramid for Infrastructure E ( I ( 2 F n R E u t e U l e a n ( T l g l i S e r t t s s a c a t t t l T t s a i o e i c o u s c k n d t s a d T r n e e e a p s s l l t o y o s u s y r i m c s e e , n s t ) p ) l a n v a l i d a t i o n ) Unit Testing: Static Analysis Terraform Validation 1 2 3 4 5 6 7 8 # Built-in validation terraform init terraform validate terraform fmt -check # Custom validation rules terraform plan -out=tfplan terraform show -json tfplan > plan.json 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 # tests/test_terraform_plan.py import json import pytest @pytest.fixture def plan(): with open('plan.json') as f: return json.load(f) def test_no_resources_destroyed(plan): """Ensure no resources are being destroyed.""" changes = plan.get('resource_changes', []) destroyed = [c for c in changes if 'delete' in c.get('change', {}).get('actions', [])] assert len(destroyed) == 0, f"Resources being destroyed: {[d['address'] for d in destroyed]}" def test_no_public_s3_buckets(plan): """Ensure S3 buckets aren't public.""" changes = plan.get('resource_changes', []) for change in changes: if change['type'] == 'aws_s3_bucket': after = change.get('change', {}).get('after', {}) acl = after.get('acl', 'private') assert acl == 'private', f"Bucket {change['address']} has public ACL: {acl}" def test_instances_have_tags(plan): """Ensure EC2 instances have required tags.""" required_tags = {'Environment', 'Owner', 'Project'} changes = plan.get('resource_changes', []) for change in changes: if change['type'] == 'aws_instance': after = change.get('change', {}).get('after', {}) tags = set(after.get('tags', {}).keys()) missing = required_tags - tags assert not missing, f"Instance {change['address']} missing tags: {missing}" Policy as Code with OPA 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 # policy/terraform.rego package terraform deny[msg] { resource := input.resource_changes[_] resource.type == "aws_security_group_rule" resource.change.after.cidr_blocks[_] == "0.0.0.0/0" resource.change.after.from_port == 22 msg := sprintf("SSH open to world in %v", [resource.address]) } deny[msg] { resource := input.resource_changes[_] resource.type == "aws_db_instance" resource.change.after.publicly_accessible == true msg := sprintf("RDS instance %v is publicly accessible", [resource.address]) } 1 2 # Run OPA checks terraform show -json tfplan | opa eval -i - -d policy/ "data.terraform.deny" Integration Testing with Terratest Terratest deploys real infrastructure, validates it, then tears it down: ...

Container Security: Hardening Your Docker Images and Kubernetes Deployments

Containers aren’t inherently secure. A default Docker image runs as root, includes unnecessary packages, and exposes more attack surface than needed. Let’s fix that. Secure Dockerfile Practices Use Minimal Base Images 1 2 3 4 5 6 7 8 # BAD: Full OS with unnecessary packages FROM ubuntu:latest # BETTER: Slim variant FROM python:3.11-slim # BEST: Distroless (no shell, no package manager) FROM gcr.io/distroless/python3-debian11 Size comparison: ubuntu:latest: ~77MB python:3.11-slim: ~45MB distroless/python3: ~16MB Smaller image = smaller attack surface. ...

API Versioning: Strategies That Won't Break Your Clients

APIs evolve. Fields get renamed, endpoints change, entire resources get restructured. The question isn’t whether to version your API—it’s how to do it without breaking every client that depends on you. Versioning Strategies 1. URL Path Versioning The most visible approach—version is part of the URL: G G E E T T / / a a p p i i / / v 1 2 / / u u s s e e r r s s / / 1 1 2 2 3 3 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 from fastapi import FastAPI, APIRouter app = FastAPI() # Version 1 v1_router = APIRouter(prefix="/api/v1") @v1_router.get("/users/{user_id}") async def get_user_v1(user_id: int): return { "id": user_id, "name": "John Doe", "email": "john@example.com" } # Version 2 - restructured response v2_router = APIRouter(prefix="/api/v2") @v2_router.get("/users/{user_id}") async def get_user_v2(user_id: int): return { "id": user_id, "profile": { "name": "John Doe", "email": "john@example.com" }, "metadata": { "created_at": "2026-01-01T00:00:00Z", "updated_at": "2026-02-11T00:00:00Z" } } app.include_router(v1_router) app.include_router(v2_router) Pros: Explicit, easy to understand, easy to route Cons: URLs change between versions, harder to maintain multiple versions ...

Message Queues: Decoupling Services for Scale and Reliability

When Service A needs to tell Service B something happened, the simplest approach is a direct HTTP call. But what happens when Service B is slow? Or down? Or overwhelmed? Message queues decouple your services, letting them communicate reliably even when things go wrong. Why Queues? Without a queue: U s e r R e q u e s t → A P I → P ( ( a i i y f f m e s d n l o t o w w n S , , e r u r v s e i e q c r u e e w s → a t i E t f m s a a ) i i l l s ) S e r v i c e → R e s p o n s e With a queue: ...

Webhook Patterns: Building Reliable Event-Driven Integrations

Webhooks seem simple: receive an HTTP POST, do something. In practice, they’re a minefield of security issues, reliability problems, and edge cases. Let’s build webhooks that actually work in production. Receiving Webhooks Basic Receiver 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 from fastapi import FastAPI, Request, HTTPException, Header from typing import Optional import hmac import hashlib app = FastAPI() WEBHOOK_SECRET = "your-webhook-secret" @app.post("/webhooks/stripe") async def stripe_webhook( request: Request, stripe_signature: Optional[str] = Header(None, alias="Stripe-Signature") ): payload = await request.body() # Verify signature if not verify_stripe_signature(payload, stripe_signature, WEBHOOK_SECRET): raise HTTPException(status_code=401, detail="Invalid signature") event = await request.json() # Process event event_type = event.get("type") if event_type == "payment_intent.succeeded": await handle_payment_success(event["data"]["object"]) elif event_type == "customer.subscription.deleted": await handle_subscription_cancelled(event["data"]["object"]) return {"received": True} def verify_stripe_signature(payload: bytes, signature: str, secret: str) -> bool: """Verify Stripe webhook signature.""" if not signature: return False # Parse signature header elements = dict(item.split("=") for item in signature.split(",")) timestamp = elements.get("t") expected_sig = elements.get("v1") # Compute expected signature signed_payload = f"{timestamp}.{payload.decode()}" computed_sig = hmac.new( secret.encode(), signed_payload.encode(), hashlib.sha256 ).hexdigest() return hmac.compare_digest(computed_sig, expected_sig) Idempotent Processing Webhooks can be delivered multiple times. Make your handlers idempotent: ...

Structured Logging: Stop Grepping, Start Querying

Unstructured logs are a liability. When your application writes User 12345 logged in from 192.168.1.1, you’re creating text that’s easy to read but impossible to query at scale. Structured logging changes the game: logs become data you can filter, aggregate, and analyze. The Problem with Text Logs 1 2 3 4 # Traditional logging import logging logging.info(f"User {user_id} logged in from {ip_address}") # Output: INFO:root:User 12345 logged in from 192.168.1.1 Want to find all logins from a specific IP? You need regex. Want to count logins per user? Good luck. Want to correlate with other events? Hope your timestamp parsing is solid. ...

GitOps with ArgoCD: Your Infrastructure as Code, Actually

GitOps takes “Infrastructure as Code” literally: your Git repository becomes the single source of truth for what should be running. ArgoCD watches your repo and automatically synchronizes your cluster to match. No more kubectl apply from laptops, no more “what’s actually deployed?” mysteries. GitOps Principles Declarative: Describe the desired state, not the steps to get there Versioned: All changes go through Git (audit trail, rollback) Automated: Changes are applied automatically when Git changes Self-healing: Drift from desired state is automatically corrected Installing ArgoCD 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 # Create namespace kubectl create namespace argocd # Install ArgoCD kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml # Wait for pods kubectl wait --for=condition=Ready pods --all -n argocd --timeout=300s # Get initial admin password kubectl -n argocd get secret argocd-initial-admin-secret \ -o jsonpath="{.data.password}" | base64 -d # Port forward to access UI kubectl port-forward svc/argocd-server -n argocd 8080:443 Access the UI at https://localhost:8080 with username admin. ...

Zero-Downtime Deployments: Blue-Green and Canary Strategies

Deploying new code shouldn’t mean crossing your fingers. Blue-green and canary deployments let you release changes with confidence, validate them with real traffic, and roll back in seconds if something goes wrong. Blue-Green Deployments Blue-green maintains two identical production environments. One serves traffic (blue), while the other stands ready (green). To deploy, you push to green, test it, then switch traffic over. B ┌ │ │ │ └ ┌ │ │ │ └ A ┌ │ │ │ └ ┌ │ │ │ └ e ─ ─ ─ ─ f ─ ─ ─ ─ f ─ ─ ─ ─ t ─ ─ ─ ─ o ─ ─ ─ S ─ e ─ S ─ ─ ─ r ─ A ─ ─ ( T ─ r ─ T ─ ─ A ─ e ─ B v C ─ ─ G i A ─ ─ B v A ─ ─ G v C ─ ─ l 1 T ─ ─ r d N ─ d ─ l 1 N ─ ─ r 1 T ─ d ─ u . I ─ ─ e l D ─ e ─ u . D ─ ─ e . I ─ e ─ e 0 V ─ ─ e e B ─ p ─ e 0 B ─ ─ e 1 V ─ p ─ ) E ─ ─ n ) Y ─ l ─ ) Y ─ ─ n ) E ─ l ─ ─ ─ ─ o ─ ─ ─ ─ o ─ ─ ─ ─ y ─ ─ ─ ─ y ─ ─ ─ ─ m ─ ─ ─ ─ m ─ ─ ─ ─ e ─ ─ ─ ─ e ┐ │ │ │ ┘ ┐ │ │ │ ┘ n ┐ │ │ │ ┘ ┐ │ │ │ ┘ n t t ◄ : ◄ : ─ ─ ─ ─ ┌ │ └ ┌ │ └ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ R ─ ─ R ─ ─ o ─ ─ o ─ ─ u ─ ─ u ─ ─ t ─ ─ t ─ ─ e ─ ─ e ─ ─ r ─ ─ r ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ │ ┘ ┐ │ ┘ ◄ ◄ ─ ─ ─ ─ U U s s e e r r s s Kubernetes Implementation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 # blue-deployment.yaml apiVersion: apps/v1 kind: Deployment metadata: name: app-blue labels: app: myapp version: blue spec: replicas: 3 selector: matchLabels: app: myapp version: blue template: metadata: labels: app: myapp version: blue spec: containers: - name: app image: myapp:1.0.0 ports: - containerPort: 8080 --- # green-deployment.yaml apiVersion: apps/v1 kind: Deployment metadata: name: app-green labels: app: myapp version: green spec: replicas: 3 selector: matchLabels: app: myapp version: green template: metadata: labels: app: myapp version: green spec: containers: - name: app image: myapp:1.1.0 ports: - containerPort: 8080 --- # service.yaml - switch by changing selector apiVersion: v1 kind: Service metadata: name: myapp spec: selector: app: myapp version: blue # Change to 'green' to switch ports: - port: 80 targetPort: 8080 Deployment Script 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 #!/bin/bash set -e NEW_VERSION=$1 CURRENT=$(kubectl get svc myapp -o jsonpath='{.spec.selector.version}') if [ "$CURRENT" == "blue" ]; then TARGET="green" else TARGET="blue" fi echo "Current: $CURRENT, Deploying to: $TARGET" # Update the standby deployment kubectl set image deployment/app-$TARGET app=myapp:$NEW_VERSION # Wait for rollout kubectl rollout status deployment/app-$TARGET --timeout=300s # Run smoke tests against standby kubectl run smoke-test --rm -it --image=curlimages/curl \ --restart=Never -- curl -f http://app-$TARGET:8080/health # Switch traffic kubectl patch svc myapp -p "{\"spec\":{\"selector\":{\"version\":\"$TARGET\"}}}" echo "Switched to $TARGET (v$NEW_VERSION)" Instant Rollback 1 2 3 4 5 6 7 8 9 10 11 12 13 #!/bin/bash # rollback.sh - switch back to previous version CURRENT=$(kubectl get svc myapp -o jsonpath='{.spec.selector.version}') if [ "$CURRENT" == "blue" ]; then PREVIOUS="green" else PREVIOUS="blue" fi kubectl patch svc myapp -p "{\"spec\":{\"selector\":{\"version\":\"$PREVIOUS\"}}}" echo "Rolled back to $PREVIOUS" Canary Deployments Canary deployments route a small percentage of traffic to the new version, gradually increasing if metrics look good. ...

Health Checks Done Right: Liveness, Readiness, and Startup Probes

A health check that always returns 200 OK is worse than no health check at all. It gives you false confidence while your application silently fails. Let’s build health checks that actually tell you when something’s wrong. The Three Types of Probes Kubernetes defines three probe types, each serving a distinct purpose: Liveness Probe: “Is this process stuck?” If it fails, Kubernetes kills and restarts the container. Readiness Probe: “Can this instance handle traffic?” If it fails, the instance is removed from load balancing but keeps running. ...