Terraform State Management: Keep Your Infrastructure Sane

Terraform state is the source of truth for your infrastructure. Mess it up and you’ll be manually reconciling resources at 2 AM. Here’s how to manage state properly from day one. What Is State? Terraform state maps your configuration to real resources: 1 2 3 4 5 # main.tf resource "aws_instance" "web" { ami = "ami-12345" instance_type = "t3.micro" } 1 2 3 4 5 6 7 8 9 10 11 12 13 14 // terraform.tfstate (simplified) { "resources": [{ "type": "aws_instance", "name": "web", "instances": [{ "attributes": { "id": "i-0abc123def456", "ami": "ami-12345", "instance_type": "t3.micro" } }] }] } Without state, Terraform doesn’t know aws_instance.web corresponds to i-0abc123def456. It would try to create a new instance every time. ...

March 1, 2026 Â· 7 min Â· 1336 words Â· Rob Washington

Health Check Endpoints: More Than Just 200 OK

Every modern service needs health check endpoints. Load balancers probe them. Kubernetes uses them. Monitoring systems scrape them. But a naive implementation—returning 200 OK if the process is running—tells you almost nothing useful. Here’s how to build health checks that actually help. Two Types of Health Liveness: Is the process alive and not deadlocked? Readiness: Can this instance handle requests right now? These are different questions with different answers: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 # Liveness: Am I alive? @app.get("/health/live") def liveness(): # If this returns, the process is alive return {"status": "alive"} # Readiness: Can I serve traffic? @app.get("/health/ready") def readiness(): checks = { "database": check_database(), "cache": check_cache(), "disk_space": check_disk_space(), } all_healthy = all(c["healthy"] for c in checks.values()) return JSONResponse( status_code=200 if all_healthy else 503, content={"status": "ready" if all_healthy else "not_ready", "checks": checks} ) Why separate them? ...

March 1, 2026 Â· 5 min Â· 920 words Â· Rob Washington

Database Connection Pooling: Stop Opening Connections for Every Query

Opening a database connection is expensive. TCP handshake, SSL negotiation, authentication, session setup—it all adds up. Do that for every query and your application crawls. Connection pooling fixes this by reusing connections. Here’s how to do it right. The Problem Without pooling, every request opens a new connection: 1 2 3 4 5 6 7 8 # BAD: New connection per request def get_user(user_id): conn = psycopg2.connect(DATABASE_URL) # ~50-100ms cursor = conn.cursor() cursor.execute("SELECT * FROM users WHERE id = %s", (user_id,)) user = cursor.fetchone() conn.close() return user At 100 requests per second, that’s 100 connections opening and closing per second. Your database server has a connection limit (typically 100-500). You’ll exhaust it fast. ...

March 1, 2026 Â· 5 min Â· 1021 words Â· Rob Washington

Configuration Management Patterns for Reliable Deployments

Configuration is where deployments go to die. A typo in an environment variable, a missing secret, a config file that works in staging but breaks in production. Here’s how to make configuration boring and reliable. The Hierarchy of Configuration Not all config is created equal. Layer it: 1 2 3 4 5 . . . . . D C E C F e o n o e f n v m a a f i m t u i r a u l g o n r t n d e s f m - i e l f ( l n i l i e t n a n s e g v s c ( a f o p r l ( d e i a r e r a g u ) b s n e l t n e i v s m i e r ) o n m e n t ) S O D a R n y f E u e n e n n - a s v t o m t i i f i r m f c f o e a n c l m v h l e v e a b n e r n a t r r g c - r i e k s i d s s p d e e e s c s i f i c Later layers override earlier ones. This lets you: ...

March 1, 2026 Â· 7 min Â· 1386 words Â· Rob Washington

Rate Limiting Strategies That Protect Without Frustrating

Rate limiting is the bouncer at your API’s door. Too strict, and legitimate users get frustrated. Too loose, and one bad actor can take down your service. Here’s how to find the balance. Why Rate Limit? Without limits, a single client can: Exhaust your database connections Burn through your third-party API quotas Inflate your cloud bill Deny service to everyone else Rate limiting isn’t about being restrictive—it’s about being fair. ...

March 1, 2026 Â· 5 min Â· 1047 words Â· Rob Washington

Background Job Patterns That Actually Scale

Every production system eventually needs background jobs. Email notifications, report generation, data syncing, webhook processing—the work that can’t (or shouldn’t) happen during a user request. Here’s what I’ve learned about making them reliable. The Naive Approach (And Why It Breaks) Most developers start with something like this: 1 2 3 4 5 @app.route('/signup') def signup(): user = create_user(request.form) send_welcome_email(user) # Blocks the response return redirect('/dashboard') This works until it doesn’t. The email service has a 5-second timeout. Now your signup page feels broken. Or the email service is down, and signups fail entirely. ...

March 1, 2026 Â· 4 min Â· 831 words Â· Rob Washington

DNS Troubleshooting: When dig Is Your Best Friend

DNS issues are deceptively simple. “It’s always DNS” is a meme because it’s true. Here’s how to actually debug it. The Essential Commands dig: Your Primary Tool 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 # Basic lookup dig example.com # Short answer only dig +short example.com # Specific record type dig example.com MX dig example.com TXT dig example.com CNAME # Query specific nameserver dig @8.8.8.8 example.com # Trace the full resolution path dig +trace example.com Understanding dig Output ; ; ; ; e ; ; ; ; ; e ; x ; ; ; ; x a Q a A m Q S W M U m N p u E H S E p S l e R E G S l W e r V N D T e E . y E : S i I . R c R I G O c o t : S Z N o S m i a E 9 m E . m 1 t . S . C e 9 1 E T : 2 F r 8 C I . e c . T O 2 1 b v 1 I N 3 6 d O : 8 2 : N m . 8 : s 1 5 e . 2 6 c 1 0 # : e 5 3 x 3 3 0 a 6 ( : m 0 1 0 p 0 9 0 l 2 e . E . 1 S c 6 T o I I 8 m N N . 2 1 0 . 2 1 6 ) A A 9 3 . 1 8 4 . 2 1 6 . 3 4 Key fields: ...

February 28, 2026 Â· 5 min Â· 1057 words Â· Rob Washington

Load Balancing: Beyond Round Robin

Round robin is the default, but it’s rarely the best choice. Here’s when to use each algorithm and why. The Algorithms Round Robin 1 2 3 4 5 upstream backend { server 192.168.1.1:8080; server 192.168.1.2:8080; server 192.168.1.3:8080; } Requests go 1→2→3→1→2→3. Simple, fair, ignores server load. Use when: All servers are identical and requests are uniform. Problem: A slow server gets the same traffic as a fast one. Weighted Round Robin 1 2 3 4 5 upstream backend { server 192.168.1.1:8080 weight=5; server 192.168.1.2:8080 weight=3; server 192.168.1.3:8080 weight=2; } Server 1 gets 50%, server 2 gets 30%, server 3 gets 20%. ...

February 28, 2026 Â· 4 min Â· 811 words Â· Rob Washington

Terraform Module Patterns for Reusable Infrastructure

Terraform modules turn infrastructure code from scripts into libraries. Here’s how to design them well. Module Structure m └ o ─ d ─ u l v ├ ├ ├ ├ └ e p ─ ─ ─ ─ ─ s c ─ ─ ─ ─ ─ / / m v o v R a a u e E i r t r A n i p s D . a u i M t b t o E f l s n . e . s m s t . d . f t t f f # # # # # R I O P D e n u r o s p t o c o u p v u u t u i m r t d e c v e n e a v r t s r a a i l r t a u e i b e q o l s u n e i s r e m e n t s Basic Module variables.tf 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 variable "name" { description = "Name prefix for resources" type = string } variable "cidr_block" { description = "CIDR block for VPC" type = string default = "10.0.0.0/16" } variable "azs" { description = "Availability zones" type = list(string) } variable "private_subnets" { description = "Private subnet CIDR blocks" type = list(string) default = [] } variable "public_subnets" { description = "Public subnet CIDR blocks" type = list(string) default = [] } variable "tags" { description = "Tags to apply to resources" type = map(string) default = {} } main.tf 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 resource "aws_vpc" "this" { cidr_block = var.cidr_block enable_dns_hostnames = true enable_dns_support = true tags = merge(var.tags, { Name = var.name }) } resource "aws_subnet" "private" { count = length(var.private_subnets) vpc_id = aws_vpc.this.id cidr_block = var.private_subnets[count.index] availability_zone = var.azs[count.index % length(var.azs)] tags = merge(var.tags, { Name = "${var.name}-private-${count.index + 1}" Type = "private" }) } resource "aws_subnet" "public" { count = length(var.public_subnets) vpc_id = aws_vpc.this.id cidr_block = var.public_subnets[count.index] availability_zone = var.azs[count.index % length(var.azs)] map_public_ip_on_launch = true tags = merge(var.tags, { Name = "${var.name}-public-${count.index + 1}" Type = "public" }) } outputs.tf 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 output "vpc_id" { description = "VPC ID" value = aws_vpc.this.id } output "private_subnet_ids" { description = "Private subnet IDs" value = aws_subnet.private[*].id } output "public_subnet_ids" { description = "Public subnet IDs" value = aws_subnet.public[*].id } output "vpc_cidr_block" { description = "VPC CIDR block" value = aws_vpc.this.cidr_block } versions.tf 1 2 3 4 5 6 7 8 9 10 terraform { required_version = ">= 1.0" required_providers { aws = { source = "hashicorp/aws" version = ">= 4.0" } } } Using Modules Local Module 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 module "vpc" { source = "./modules/vpc" name = "production" cidr_block = "10.0.0.0/16" azs = ["us-east-1a", "us-east-1b", "us-east-1c"] private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"] public_subnets = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"] tags = { Environment = "production" Project = "myapp" } } # Use outputs resource "aws_instance" "app" { subnet_id = module.vpc.private_subnet_ids[0] # ... } Registry Module 1 2 3 4 5 6 7 8 9 10 11 12 13 14 module "vpc" { source = "terraform-aws-modules/vpc/aws" version = "5.0.0" name = "my-vpc" cidr = "10.0.0.0/16" azs = ["us-east-1a", "us-east-1b"] private_subnets = ["10.0.1.0/24", "10.0.2.0/24"] public_subnets = ["10.0.101.0/24", "10.0.102.0/24"] enable_nat_gateway = true single_nat_gateway = true } Git Module 1 2 3 4 module "vpc" { source = "git::https://github.com/org/terraform-modules.git//vpc?ref=v1.2.0" # ... } Variable Validation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 variable "environment" { description = "Environment name" type = string validation { condition = contains(["dev", "staging", "production"], var.environment) error_message = "Environment must be dev, staging, or production." } } variable "instance_type" { description = "EC2 instance type" type = string default = "t3.micro" validation { condition = can(regex("^t3\\.", var.instance_type)) error_message = "Instance type must be t3 family." } } variable "cidr_block" { description = "CIDR block" type = string validation { condition = can(cidrhost(var.cidr_block, 0)) error_message = "Must be a valid CIDR block." } } Complex Types 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 variable "services" { description = "Service configurations" type = list(object({ name = string port = number health_path = optional(string, "/health") replicas = optional(number, 1) environment = optional(map(string), {}) })) } # Usage services = [ { name = "api" port = 8080 replicas = 3 environment = { LOG_LEVEL = "info" } }, { name = "worker" port = 9090 } ] Conditional Resources 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 variable "create_nat_gateway" { description = "Create NAT gateway" type = bool default = true } resource "aws_nat_gateway" "this" { count = var.create_nat_gateway ? 1 : 0 allocation_id = aws_eip.nat[0].id subnet_id = aws_subnet.public[0].id } resource "aws_eip" "nat" { count = var.create_nat_gateway ? 1 : 0 domain = "vpc" } Dynamic Blocks 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 variable "ingress_rules" { description = "Ingress rules" type = list(object({ port = number protocol = string cidr_blocks = list(string) })) default = [] } resource "aws_security_group" "this" { name = var.name description = var.description vpc_id = var.vpc_id dynamic "ingress" { for_each = var.ingress_rules content { from_port = ingress.value.port to_port = ingress.value.port protocol = ingress.value.protocol cidr_blocks = ingress.value.cidr_blocks } } egress { from_port = 0 to_port = 0 protocol = "-1" cidr_blocks = ["0.0.0.0/0"] } } Module Composition Root Module 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 # main.tf module "vpc" { source = "./modules/vpc" # ... } module "security_groups" { source = "./modules/security-groups" vpc_id = module.vpc.vpc_id # ... } module "ecs_cluster" { source = "./modules/ecs-cluster" vpc_id = module.vpc.vpc_id private_subnet_ids = module.vpc.private_subnet_ids security_group_ids = [module.security_groups.app_sg_id] # ... } Shared Data 1 2 3 4 5 6 7 8 9 10 11 12 13 14 # data.tf - Shared data sources data "aws_caller_identity" "current" {} data "aws_region" "current" {} locals { account_id = data.aws_caller_identity.current.account_id region = data.aws_region.current.name common_tags = { Environment = var.environment Project = var.project ManagedBy = "terraform" } } For_each vs Count Count (Index-based) 1 2 3 4 5 # Problematic: Adding/removing items shifts indices resource "aws_subnet" "private" { count = length(var.private_subnets) cidr_block = var.private_subnets[count.index] } For_each (Key-based) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 # Better: Keys are stable variable "subnets" { type = map(object({ cidr_block = string az = string })) } resource "aws_subnet" "private" { for_each = var.subnets cidr_block = each.value.cidr_block availability_zone = each.value.az tags = { Name = each.key } } # Usage subnets = { "private-1a" = { cidr_block = "10.0.1.0/24", az = "us-east-1a" } "private-1b" = { cidr_block = "10.0.2.0/24", az = "us-east-1b" } } Sensitive Values 1 2 3 4 5 6 7 8 9 10 11 variable "database_password" { description = "Database password" type = string sensitive = true } output "connection_string" { description = "Database connection string" value = "postgres://user:${var.database_password}@${aws_db_instance.this.endpoint}/db" sensitive = true } Module Testing Terraform Test (1.6+) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 # tests/vpc_test.tftest.hcl run "vpc_creates_successfully" { command = plan variables { name = "test-vpc" cidr_block = "10.0.0.0/16" azs = ["us-east-1a"] } assert { condition = aws_vpc.this.cidr_block == "10.0.0.0/16" error_message = "VPC CIDR block incorrect" } } Run tests: ...

February 28, 2026 Â· 8 min Â· 1602 words Â· Rob Washington

Ansible Patterns for Maintainable Infrastructure

Ansible is simple until it isn’t. Here’s how to structure projects that remain maintainable as they grow. Project Structure a ├ ├ │ │ │ │ │ │ │ │ ├ │ │ │ ├ │ │ │ └ n ─ ─ ─ ─ ─ s ─ ─ ─ ─ ─ i b a i ├ │ │ │ │ └ p ├ ├ └ r ├ ├ └ c └ l n n ─ ─ l ─ ─ ─ o ─ ─ ─ o ─ e s v ─ ─ a ─ ─ ─ l ─ ─ ─ l ─ / i e y e l b n p ├ └ s ├ └ b s w d s c n p e r l t r ─ ─ t ─ ─ o i e a / o g o c e e o o ─ ─ a ─ ─ o t b t m i s t q . r d g k e s a m n t i u c y u h g ├ └ i h g s . e b o x g o i f / c o r ─ ─ n o r y r a n r n r g t s o ─ ─ g s m v s e s e i t u / t u l e e s / m o s p a w s p r s q e n . _ l e . _ s . l n y v l b y v . y / t m a . s m a y m s l r y e l r m l . s m r s l y / l v / m e l r s . y m l Inventory Best Practices YAML Inventory 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 # inventory/production/hosts.yml all: children: webservers: hosts: web1.example.com: web2.example.com: vars: http_port: 80 databases: hosts: db1.example.com: postgresql_version: 15 db2.example.com: postgresql_version: 14 loadbalancers: hosts: lb1.example.com: Group Variables 1 2 3 4 5 6 7 8 9 10 11 12 13 # inventory/production/group_vars/all.yml --- ansible_user: deploy ansible_python_interpreter: /usr/bin/python3 ntp_servers: - 0.pool.ntp.org - 1.pool.ntp.org common_packages: - vim - htop - curl 1 2 3 4 5 # inventory/production/group_vars/webservers.yml --- nginx_worker_processes: auto nginx_worker_connections: 1024 app_port: 3000 Role Structure r ├ │ ├ │ ├ │ ├ │ ├ │ ├ │ ├ │ └ o ─ ─ ─ ─ ─ ─ ─ ─ l ─ ─ ─ ─ ─ ─ ─ ─ e s d └ v └ t └ h └ t └ f └ m └ R / e ─ a ─ a ─ a ─ e ─ i ─ e ─ E n f ─ r ─ s ─ n ─ m ─ l ─ t ─ A g a s k d p e a D i u m / m s m l m l n s s / m M n l a a / a e a a g / s a E x t i i i r i t i l i . / s n n n s n e n - n m / . . . / . s x p . d y y y y / . a y m m m m c r m l l l l o a l n m f s . . j c 2 o n # # # # # f # D R T H J # R e o a a i o f l s n n S l a e k d j t e u s l a a l v e 2 t m t a t r i e r o s t c t v i e a a a e ( m f d r b x r p i a i l e e l l t a e c s a e a b s u t t s l t a e a e ( e r s n s h t d i ( g s d l h e e o e r p w r v e e i n s p c d t r e e i s n p o , c r r i i i e e o t t s r y c i ) . t ) y ) Role Defaults 1 2 3 4 5 6 7 # roles/nginx/defaults/main.yml --- nginx_user: www-data nginx_worker_processes: auto nginx_worker_connections: 1024 nginx_keepalive_timeout: 65 nginx_sites: [] Role Tasks 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 # roles/nginx/tasks/main.yml --- - name: Install nginx ansible.builtin.apt: name: nginx state: present update_cache: true become: true - name: Configure nginx ansible.builtin.template: src: nginx.conf.j2 dest: /etc/nginx/nginx.conf owner: root group: root mode: '0644' become: true notify: Reload nginx - name: Configure sites ansible.builtin.template: src: site.conf.j2 dest: "/etc/nginx/sites-available/{{ item.name }}" owner: root group: root mode: '0644' loop: "{{ nginx_sites }}" become: true notify: Reload nginx - name: Enable sites ansible.builtin.file: src: "/etc/nginx/sites-available/{{ item.name }}" dest: "/etc/nginx/sites-enabled/{{ item.name }}" state: link loop: "{{ nginx_sites }}" become: true notify: Reload nginx - name: Start nginx ansible.builtin.service: name: nginx state: started enabled: true become: true Handlers 1 2 3 4 5 6 7 8 9 10 11 12 13 # roles/nginx/handlers/main.yml --- - name: Reload nginx ansible.builtin.service: name: nginx state: reloaded become: true - name: Restart nginx ansible.builtin.service: name: nginx state: restarted become: true Playbook Patterns Main Playbook 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 # playbooks/site.yml --- - name: Apply common configuration hosts: all roles: - common - name: Configure web servers hosts: webservers roles: - nginx - app - name: Configure databases hosts: databases roles: - postgresql Tagged Playbook 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 # playbooks/webservers.yml --- - name: Configure web servers hosts: webservers become: true tasks: - name: Install packages ansible.builtin.apt: name: "{{ item }}" state: present loop: - nginx - python3 tags: - packages - name: Deploy application ansible.builtin.git: repo: "{{ app_repo }}" dest: /var/www/app version: "{{ app_version }}" tags: - deploy - name: Configure nginx ansible.builtin.template: src: nginx.conf.j2 dest: /etc/nginx/nginx.conf notify: Reload nginx tags: - config Run specific tags: ...

February 28, 2026 Â· 9 min Â· 1911 words Â· Rob Washington