Infrastructure

Semantic Caching for LLM Applications

Every LLM API call costs money and takes time. When users ask variations of the same question, you’re paying for the same computation repeatedly. Semantic caching solves this by recognizing that “What’s the weather in NYC?” and “How’s the weather in New York City?” are functionally identical. The Problem with Traditional Caching Standard key-value caching uses exact string matching: 1 2 3 cache_key = hash(prompt) if cache_key in cache: return cache[cache_key] This fails for LLM applications because: ...

DNS Management: Infrastructure as Code for Your Domain Records

DNS is the foundation everything else depends on. A misconfigured record can take down your entire infrastructure. Yet DNS is often managed through web consoles with no version control, no review process, and no automation. Let’s fix that. Terraform for DNS Route53 Basics 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 # dns.tf resource "aws_route53_zone" "main" { name = "example.com" tags = { Environment = "production" } } # A record resource "aws_route53_record" "www" { zone_id = aws_route53_zone.main.zone_id name = "www.example.com" type = "A" ttl = 300 records = ["203.0.113.10"] } # CNAME record resource "aws_route53_record" "app" { zone_id = aws_route53_zone.main.zone_id name = "app.example.com" type = "CNAME" ttl = 300 records = ["app-lb-123456.us-east-1.elb.amazonaws.com"] } # Alias to ALB (no TTL, resolved at edge) resource "aws_route53_record" "api" { zone_id = aws_route53_zone.main.zone_id name = "api.example.com" type = "A" alias { name = aws_lb.api.dns_name zone_id = aws_lb.api.zone_id evaluate_target_health = true } } # MX records resource "aws_route53_record" "mx" { zone_id = aws_route53_zone.main.zone_id name = "example.com" type = "MX" ttl = 3600 records = [ "10 mail1.example.com", "20 mail2.example.com" ] } # TXT for SPF resource "aws_route53_record" "spf" { zone_id = aws_route53_zone.main.zone_id name = "example.com" type = "TXT" ttl = 3600 records = ["v=spf1 include:_spf.google.com ~all"] } Dynamic Records from Infrastructure 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 # Generate records from other resources locals { services = { "api" = aws_lb.api.dns_name "admin" = aws_lb.admin.dns_name "docs" = aws_cloudfront_distribution.docs.domain_name } } resource "aws_route53_record" "services" { for_each = local.services zone_id = aws_route53_zone.main.zone_id name = "${each.key}.example.com" type = "CNAME" ttl = 300 records = [each.value] } # From Kubernetes ingresses data "kubernetes_ingress_v1" "all" { for_each = toset(["api", "web", "admin"]) metadata { name = each.key namespace = "production" } } resource "aws_route53_record" "k8s_services" { for_each = data.kubernetes_ingress_v1.all zone_id = aws_route53_zone.main.zone_id name = "${each.key}.example.com" type = "CNAME" ttl = 300 records = [each.value.status[0].load_balancer[0].ingress[0].hostname] } DNS Failover Health Check Based Routing 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 # Health check for primary resource "aws_route53_health_check" "primary" { fqdn = "primary-api.example.com" port = 443 type = "HTTPS" resource_path = "/health" failure_threshold = 3 request_interval = 30 tags = { Name = "primary-api-health" } } # Health check for secondary resource "aws_route53_health_check" "secondary" { fqdn = "secondary-api.example.com" port = 443 type = "HTTPS" resource_path = "/health" failure_threshold = 3 request_interval = 30 tags = { Name = "secondary-api-health" } } # Primary record with failover resource "aws_route53_record" "api_primary" { zone_id = aws_route53_zone.main.zone_id name = "api.example.com" type = "A" ttl = 60 records = ["203.0.113.10"] set_identifier = "primary" health_check_id = aws_route53_health_check.primary.id failover_routing_policy { type = "PRIMARY" } } # Secondary record (used when primary fails) resource "aws_route53_record" "api_secondary" { zone_id = aws_route53_zone.main.zone_id name = "api.example.com" type = "A" ttl = 60 records = ["203.0.113.20"] set_identifier = "secondary" health_check_id = aws_route53_health_check.secondary.id failover_routing_policy { type = "SECONDARY" } } Weighted Routing for Gradual Migration 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 # 90% to current, 10% to new resource "aws_route53_record" "api_current" { zone_id = aws_route53_zone.main.zone_id name = "api.example.com" type = "A" ttl = 60 records = ["203.0.113.10"] set_identifier = "current" weighted_routing_policy { weight = 90 } } resource "aws_route53_record" "api_new" { zone_id = aws_route53_zone.main.zone_id name = "api.example.com" type = "A" ttl = 60 records = ["203.0.113.20"] set_identifier = "new" weighted_routing_policy { weight = 10 } } External DNS for Kubernetes Automatically create DNS records from Kubernetes resources. ...

Infrastructure Drift Detection: Keeping Your Terraform in Sync with Reality

You define infrastructure in Terraform. Someone clicks around in the AWS console. Now your code says one thing and reality says another. This is drift, and it will bite you when you least expect it. What Causes Drift? Manual changes — “Just this once” console modifications Emergency fixes — Hotfixes applied directly to production Other tools — Scripts, CLI commands, other IaC tools AWS itself — Auto-scaling, managed service updates Incomplete imports — Resources created outside Terraform Detecting Drift Basic: Terraform Plan 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 #!/bin/bash # drift-check.sh set -e # Initialize without backend changes terraform init -input=false # Generate plan terraform plan -detailed-exitcode -out=tfplan 2>&1 | tee plan.txt EXIT_CODE=$? if [ $EXIT_CODE -eq 0 ]; then echo "✅ No drift detected" elif [ $EXIT_CODE -eq 2 ]; then echo "⚠️ Drift detected! Changes needed:" terraform show tfplan exit 1 else echo "❌ Error running plan" exit 1 fi Automated Drift Detection 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 # drift_detector.py import subprocess import json import os from datetime import datetime from typing import List, Dict class DriftDetector: def __init__(self, workspace_dir: str): self.workspace_dir = workspace_dir def check_drift(self) -> Dict: """Run terraform plan and parse results.""" os.chdir(self.workspace_dir) # Initialize subprocess.run( ["terraform", "init", "-input=false"], capture_output=True, check=True ) # Plan with JSON output result = subprocess.run( ["terraform", "plan", "-json", "-out=tfplan"], capture_output=True, text=True ) changes = self._parse_plan_output(result.stdout) return { "workspace": self.workspace_dir, "timestamp": datetime.utcnow().isoformat(), "has_drift": len(changes) > 0, "changes": changes } def _parse_plan_output(self, output: str) -> List[Dict]: """Parse terraform plan JSON output.""" changes = [] for line in output.strip().split('\n'): try: entry = json.loads(line) if entry.get('@level') == 'info': msg = entry.get('@message', '') # Look for resource changes if 'will be' in msg: changes.append({ 'message': msg, 'type': entry.get('type', 'unknown'), 'change': entry.get('change', {}) }) except json.JSONDecodeError: continue return changes def check_all_workspaces(workspaces: List[str]) -> Dict: """Check drift across multiple Terraform workspaces.""" results = [] for workspace in workspaces: detector = DriftDetector(workspace) try: result = detector.check_drift() results.append(result) except Exception as e: results.append({ "workspace": workspace, "error": str(e), "has_drift": None }) drifted = [r for r in results if r.get('has_drift')] return { "total_workspaces": len(workspaces), "drifted_count": len(drifted), "results": results } # Usage workspaces = [ "/infrastructure/terraform/networking", "/infrastructure/terraform/compute", "/infrastructure/terraform/databases", ] report = check_all_workspaces(workspaces) print(json.dumps(report, indent=2)) CI Pipeline for Drift Detection 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 # .github/workflows/drift-detection.yml name: Drift Detection on: schedule: - cron: '0 */6 * * *' # Every 6 hours workflow_dispatch: jobs: detect-drift: runs-on: ubuntu-latest strategy: matrix: workspace: - networking - compute - databases - monitoring steps: - uses: actions/checkout@v4 - uses: hashicorp/setup-terraform@v3 with: terraform_version: 1.7.0 - name: Configure AWS uses: aws-actions/configure-aws-credentials@v4 with: aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }} aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }} aws-region: us-east-1 - name: Check Drift id: drift working-directory: terraform/${{ matrix.workspace }} run: | terraform init -input=false set +e terraform plan -detailed-exitcode -out=tfplan 2>&1 | tee plan.txt EXIT_CODE=$? set -e if [ $EXIT_CODE -eq 2 ]; then echo "drift_detected=true" >> $GITHUB_OUTPUT echo "### ⚠️ Drift Detected in ${{ matrix.workspace }}" >> $GITHUB_STEP_SUMMARY echo '```' >> $GITHUB_STEP_SUMMARY terraform show -no-color tfplan >> $GITHUB_STEP_SUMMARY echo '```' >> $GITHUB_STEP_SUMMARY else echo "drift_detected=false" >> $GITHUB_OUTPUT echo "### ✅ No drift in ${{ matrix.workspace }}" >> $GITHUB_STEP_SUMMARY fi - name: Alert on Drift if: steps.drift.outputs.drift_detected == 'true' uses: slackapi/slack-github-action@v1 with: payload: | { "text": "⚠️ Infrastructure drift detected in ${{ matrix.workspace }}", "blocks": [ { "type": "section", "text": { "type": "mrkdwn", "text": "*Infrastructure Drift Detected*\n\nWorkspace: `${{ matrix.workspace }}`\nRun: ${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }}" } } ] } env: SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK }} Continuous Reconciliation Atlantis for GitOps 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 # atlantis.yaml version: 3 projects: - name: networking dir: terraform/networking autoplan: when_modified: - "*.tf" - "*.tfvars" enabled: true apply_requirements: - approved - mergeable - name: compute dir: terraform/compute autoplan: enabled: true apply_requirements: - approved Drift Auto-Remediation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 # drift_remediation.py import subprocess import json from enum import Enum from typing import Optional class RemediationStrategy(Enum): ALERT_ONLY = "alert" # Just notify AUTO_PLAN = "plan" # Create PR with plan AUTO_APPLY = "apply" # Automatically apply (dangerous!) REFRESH_ONLY = "refresh" # Update state to match reality class DriftRemediator: def __init__(self, workspace: str, strategy: RemediationStrategy): self.workspace = workspace self.strategy = strategy def remediate(self, drift_report: dict) -> dict: """Remediate detected drift based on strategy.""" if not drift_report.get('has_drift'): return {"action": "none", "reason": "no drift"} if self.strategy == RemediationStrategy.ALERT_ONLY: return self._alert(drift_report) elif self.strategy == RemediationStrategy.REFRESH_ONLY: return self._refresh_state() elif self.strategy == RemediationStrategy.AUTO_PLAN: return self._create_remediation_pr(drift_report) elif self.strategy == RemediationStrategy.AUTO_APPLY: return self._auto_apply() return {"action": "unknown", "error": "invalid strategy"} def _alert(self, drift_report: dict) -> dict: """Send alert without taking action.""" # Send to Slack, PagerDuty, etc. return { "action": "alert", "changes": len(drift_report['changes']) } def _refresh_state(self) -> dict: """Refresh state to match reality (accept drift).""" result = subprocess.run( ["terraform", "apply", "-refresh-only", "-auto-approve"], capture_output=True, text=True, cwd=self.workspace ) return { "action": "refresh", "success": result.returncode == 0, "output": result.stdout } def _create_remediation_pr(self, drift_report: dict) -> dict: """Create a PR to remediate drift.""" # Generate plan subprocess.run( ["terraform", "plan", "-out=remediation.tfplan"], cwd=self.workspace ) # Create branch and PR (pseudo-code) branch = f"fix/drift-{self.workspace}-{datetime.now().strftime('%Y%m%d')}" # In practice, you'd use GitHub API or similar return { "action": "pr_created", "branch": branch, "changes": len(drift_report['changes']) } def _auto_apply(self) -> dict: """Automatically apply to fix drift. USE WITH CAUTION.""" result = subprocess.run( ["terraform", "apply", "-auto-approve"], capture_output=True, text=True, cwd=self.workspace ) return { "action": "auto_apply", "success": result.returncode == 0, "output": result.stdout if result.returncode == 0 else result.stderr } # Configuration per workspace WORKSPACE_STRATEGIES = { "networking": RemediationStrategy.ALERT_ONLY, # Critical, manual only "compute": RemediationStrategy.AUTO_PLAN, # Create PRs "monitoring": RemediationStrategy.REFRESH_ONLY, # Accept drift "dev-sandbox": RemediationStrategy.AUTO_APPLY, # Auto-fix okay } State Locking and Protection Prevent Manual Changes 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 # terraform.tf terraform { backend "s3" { bucket = "terraform-state-prod" key = "networking/terraform.tfstate" region = "us-east-1" dynamodb_table = "terraform-locks" encrypt = true } } # Prevent destroy of critical resources resource "aws_db_instance" "main" { # ... lifecycle { prevent_destroy = true } } AWS Service Control Policies 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 { "Version": "2012-10-17", "Statement": [ { "Sid": "RequireTerraformTag", "Effect": "Deny", "Action": [ "ec2:RunInstances", "ec2:CreateVpc", "rds:CreateDBInstance" ], "Resource": "*", "Condition": { "Null": { "aws:RequestTag/ManagedBy": "true" } } }, { "Sid": "PreventManualModification", "Effect": "Deny", "Action": [ "ec2:ModifyInstanceAttribute", "ec2:ModifyVpcAttribute" ], "Resource": "*", "Condition": { "StringEquals": { "aws:ResourceTag/ManagedBy": "terraform" }, "StringNotEquals": { "aws:PrincipalTag/Role": "terraform-automation" } } } ] } Drift Metrics 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 # drift_metrics.py from prometheus_client import Counter, Gauge, Histogram # Metrics drift_detected = Counter( 'terraform_drift_detected_total', 'Number of drift detection events', ['workspace', 'severity'] ) drift_resources = Gauge( 'terraform_drift_resources', 'Number of resources with drift', ['workspace'] ) drift_check_duration = Histogram( 'terraform_drift_check_seconds', 'Time to check for drift', ['workspace'] ) def record_drift_check(workspace: str, result: dict, duration: float): """Record drift check metrics.""" drift_check_duration.labels(workspace=workspace).observe(duration) if result['has_drift']: drift_detected.labels( workspace=workspace, severity='warning' if len(result['changes']) < 5 else 'critical' ).inc() drift_resources.labels(workspace=workspace).set(len(result['changes'])) else: drift_resources.labels(workspace=workspace).set(0) Best Practices Check drift frequently — At least daily, ideally every few hours Alert immediately — Drift compounds; catch it early Lock down production — Use SCPs to prevent manual changes Tag everything — ManagedBy: terraform enables enforcement Use workspaces — Isolate environments, check each independently Automate remediation — PRs for drift fixes, not manual applies Track metrics — Know your drift frequency and sources Quick Start Checklist Set up scheduled drift detection (GitHub Actions, Jenkins, etc.) Configure alerts for detected drift Add ManagedBy tags to all Terraform resources Implement SCPs to prevent manual changes in production Create runbook for drift remediation Track drift metrics over time Drift isn’t just annoying—it’s a reliability risk. Your next terraform apply might have unintended consequences because the state doesn’t match reality. Detect it early, fix it fast, and prevent it where possible. ...

Testing Infrastructure: From Terraform Plans to Production Validation

Infrastructure code deserves the same testing rigor as application code. A typo in Terraform can delete a database. An untested Ansible role can break production. Let’s build confidence with proper testing. The Testing Pyramid for Infrastructure E ( I ( 2 F n R E u t e U l e a n ( T l g l i S e r t t s s a c a t t t l T t s a i o e i c o u s c k n d t s a d T r n e e e a p s s l l t o y o s u s y r i m c s e e , n s t ) p ) l a n v a l i d a t i o n ) Unit Testing: Static Analysis Terraform Validation 1 2 3 4 5 6 7 8 # Built-in validation terraform init terraform validate terraform fmt -check # Custom validation rules terraform plan -out=tfplan terraform show -json tfplan > plan.json 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 # tests/test_terraform_plan.py import json import pytest @pytest.fixture def plan(): with open('plan.json') as f: return json.load(f) def test_no_resources_destroyed(plan): """Ensure no resources are being destroyed.""" changes = plan.get('resource_changes', []) destroyed = [c for c in changes if 'delete' in c.get('change', {}).get('actions', [])] assert len(destroyed) == 0, f"Resources being destroyed: {[d['address'] for d in destroyed]}" def test_no_public_s3_buckets(plan): """Ensure S3 buckets aren't public.""" changes = plan.get('resource_changes', []) for change in changes: if change['type'] == 'aws_s3_bucket': after = change.get('change', {}).get('after', {}) acl = after.get('acl', 'private') assert acl == 'private', f"Bucket {change['address']} has public ACL: {acl}" def test_instances_have_tags(plan): """Ensure EC2 instances have required tags.""" required_tags = {'Environment', 'Owner', 'Project'} changes = plan.get('resource_changes', []) for change in changes: if change['type'] == 'aws_instance': after = change.get('change', {}).get('after', {}) tags = set(after.get('tags', {}).keys()) missing = required_tags - tags assert not missing, f"Instance {change['address']} missing tags: {missing}" Policy as Code with OPA 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 # policy/terraform.rego package terraform deny[msg] { resource := input.resource_changes[_] resource.type == "aws_security_group_rule" resource.change.after.cidr_blocks[_] == "0.0.0.0/0" resource.change.after.from_port == 22 msg := sprintf("SSH open to world in %v", [resource.address]) } deny[msg] { resource := input.resource_changes[_] resource.type == "aws_db_instance" resource.change.after.publicly_accessible == true msg := sprintf("RDS instance %v is publicly accessible", [resource.address]) } 1 2 # Run OPA checks terraform show -json tfplan | opa eval -i - -d policy/ "data.terraform.deny" Integration Testing with Terratest Terratest deploys real infrastructure, validates it, then tears it down: ...

GitOps with ArgoCD: Your Infrastructure as Code, Actually

GitOps takes “Infrastructure as Code” literally: your Git repository becomes the single source of truth for what should be running. ArgoCD watches your repo and automatically synchronizes your cluster to match. No more kubectl apply from laptops, no more “what’s actually deployed?” mysteries. GitOps Principles Declarative: Describe the desired state, not the steps to get there Versioned: All changes go through Git (audit trail, rollback) Automated: Changes are applied automatically when Git changes Self-healing: Drift from desired state is automatically corrected Installing ArgoCD 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 # Create namespace kubectl create namespace argocd # Install ArgoCD kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml # Wait for pods kubectl wait --for=condition=Ready pods --all -n argocd --timeout=300s # Get initial admin password kubectl -n argocd get secret argocd-initial-admin-secret \ -o jsonpath="{.data.password}" | base64 -d # Port forward to access UI kubectl port-forward svc/argocd-server -n argocd 8080:443 Access the UI at https://localhost:8080 with username admin. ...

Home Automation for Developers: Beyond Smart Plugs

A developer’s guide to home automation — from simple scripts to full infrastructure, with patterns that actually work.

Secrets Management: Stop Hardcoding Your API Keys

A practical guide to managing secrets in modern applications — from environment variables to HashiCorp Vault, with patterns that actually work.

Monitoring vs Observability: What's the Difference and Why It Matters

Monitoring tells you when something is wrong. Observability helps you understand why. Here’s how to build systems you can actually debug.

Ansible: Configuration Management Without the Headache

Why Ansible became my go-to tool for managing servers, and how you can start using it today.

Infrastructure as Code: Why I Stopped Clicking Around in AWS Console

How Infrastructure as Code transformed my deployments from anxiety-inducing clicking sessions to repeatable, version-controlled confidence.