You test your application code. Why not your infrastructure? Here’s how to build confidence that your Terraform, Ansible, and Kubernetes configs actually work.
Why Test Infrastructure?# Infrastructure code has the same problems as application code:
Typos break things Logic errors cause outages Refactoring introduces regressions “It works on my machine” applies to terraform too The difference: infrastructure mistakes often cost more. A bad deployment can take down production, corrupt data, or rack up cloud bills.
The Testing Pyramid for Infrastructure# P I r n C o t o d e n u g t S c r r t t a a a i t c t o i t i n o c n T S e A m T s n o e t a k s s l e t y s s T i e s s t / s U n i t T e s t s Start at the bottom. Move up as you gain confidence.
Level 1: Static Analysis# Catch errors without running anything.
1
2
terraform init
terraform validate
Catches syntax errors and type mismatches. Free and fast.
TFLint# 1
2
3
4
5
6
# Install
brew install tflint
# Run
tflint --init
tflint
Catches:
Deprecated syntax Invalid resource types Best practice violations 1
2
3
4
# TFLint catches this
resource "aws_instance" "bad" {
instance_type = "t2.micro123" # Invalid instance type!
}
Checkov / tfsec# Security-focused static analysis:
1
2
3
4
5
6
7
# Checkov
pip install checkov
checkov -d .
# tfsec
brew install tfsec
tfsec .
Catches:
Unencrypted storage Open security groups Missing logging Public S3 buckets 1
2
3
4
5
6
7
# tfsec flags this
resource "aws_s3_bucket" "bad" {
bucket = "my-bucket"
# Missing encryption!
# Missing versioning!
# Missing logging!
}
Ansible Lint# 1
2
pip install ansible-lint
ansible-lint playbook.yml
Catches deprecated modules, bad practices, and style issues.
Level 2: Unit Tests# Test individual modules in isolation.
Terraform 1.6+ has native testing:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# tests/vpc.tftest.hcl
run "vpc_creates_subnets" {
command = plan
assert {
condition = length(aws_subnet.private) == 3
error_message = "Expected 3 private subnets"
}
}
run "cidr_blocks_are_valid" {
command = plan
assert {
condition = can ( cidrhost ( var . vpc_cidr , 0 ))
error_message = "VPC CIDR must be valid"
}
}
Run with:
Terratest (Go)# More powerful testing with real infrastructure:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
package test
import (
"testing"
"github.com/gruntwork-io/terratest/modules/terraform"
"github.com/stretchr/testify/assert"
)
func TestVpcModule ( t * testing . T ) {
terraformOptions := & terraform . Options {
TerraformDir : "../modules/vpc" ,
Vars : map [ string ] interface {}{
"vpc_cidr" : "10.0.0.0/16" ,
"azs" : [] string { "us-east-1a" , "us-east-1b" },
},
}
defer terraform . Destroy ( t , terraformOptions )
terraform . InitAndApply ( t , terraformOptions )
vpcId := terraform . Output ( t , terraformOptions , "vpc_id" )
assert . NotEmpty ( t , vpcId )
subnetIds := terraform . OutputList ( t , terraformOptions , "private_subnet_ids" )
assert . Equal ( t , 2 , len ( subnetIds ))
}
Terratest actually creates resources, so use a dedicated test account.
Molecule (Ansible)# Test Ansible roles in containers:
1
2
3
4
5
6
7
8
9
10
11
12
13
# molecule/default/molecule.yml
dependency :
name : galaxy
driver :
name : docker
platforms :
- name : instance
image : ubuntu:22.04
pre_build_image : true
provisioner :
name : ansible
verifier :
name : ansible
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# molecule/default/verify.yml
- name : Verify
hosts : all
tasks :
- name : Check nginx is installed
ansible.builtin.package :
name : nginx
state : present
check_mode : true
register : nginx_check
- name : Assert nginx installed
ansible.builtin.assert :
that : not nginx_check.changed
Run with:
Level 3: Contract Tests# Verify that modules work together correctly.
Pact for Infrastructure# Define expected outputs and verify consumers get what they expect:
1
2
3
4
5
6
7
8
# Test that the VPC module outputs what the EKS module needs
def test_vpc_contract ():
vpc_outputs = terraform_output ( "modules/vpc" )
# EKS module requires these
assert "vpc_id" in vpc_outputs
assert "private_subnet_ids" in vpc_outputs
assert len ( vpc_outputs [ "private_subnet_ids" ]) >= 2
Schema Validation# 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
from jsonschema import validate
VPC_OUTPUT_SCHEMA = {
"type" : "object" ,
"required" : [ "vpc_id" , "private_subnet_ids" , "public_subnet_ids" ],
"properties" : {
"vpc_id" : { "type" : "string" , "pattern" : "^vpc-" },
"private_subnet_ids" : {
"type" : "array" ,
"minItems" : 2 ,
"items" : { "type" : "string" , "pattern" : "^subnet-" }
}
}
}
def test_vpc_outputs_match_schema ():
outputs = terraform_output ( "modules/vpc" )
validate ( instance = outputs , schema = VPC_OUTPUT_SCHEMA )
Level 4: Integration Tests# Test the full stack in a real environment.
Ephemeral Environments# Spin up complete environments for testing:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# .github/workflows/integration.yml
jobs :
integration :
runs-on : ubuntu-latest
steps :
- uses : actions/checkout@v4
- name : Setup Terraform
uses : hashicorp/setup-terraform@v3
- name : Create test environment
run : |
cd environments/test
terraform init
terraform apply -auto-approve
- name : Run integration tests
run : |
export API_URL=$(terraform output -raw api_url)
pytest tests/integration/
- name : Destroy test environment
if : always()
run : |
cd environments/test
terraform destroy -auto-approve
Localstack for AWS# Test AWS infrastructure locally:
1
2
3
4
5
6
7
8
# docker-compose.yml
services :
localstack :
image : localstack/localstack
ports :
- "4566:4566"
environment :
- SERVICES=s3,dynamodb,sqs,lambda
1
2
3
4
5
6
7
8
9
10
11
12
13
# Override provider for testing
provider "aws" {
access_key = "test"
secret_key = "test"
region = "us-east-1"
skip_credentials_validation = true
skip_requesting_account_id = true
endpoints {
s3 = "http://localhost:4566"
dynamodb = "http://localhost:4566"
}
}
Level 5: Production Smoke Tests# Verify production is working after deploy.
Health Checks# 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
#!/bin/bash
# post-deploy-smoke.sh
ENDPOINTS =(
"https://api.example.com/health"
"https://app.example.com"
"https://admin.example.com/health"
)
for endpoint in " ${ ENDPOINTS [@] } " ; do
status = $( curl -s -o /dev/null -w "%{http_code}" " $endpoint " )
if [ " $status " != "200" ] ; then
echo "FAIL: $endpoint returned $status "
exit 1
fi
echo "OK: $endpoint "
done
Synthetic Monitoring# 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Run continuously after deploy
def test_critical_user_journey ():
# Login
response = requests . post ( f " { API_URL } /auth/login" , json = {
"email" : "smoke-test@example.com" ,
"password" : os . environ [ "SMOKE_TEST_PASSWORD" ]
})
assert response . status_code == 200
token = response . json ()[ "token" ]
# Create order
response = requests . post ( f " { API_URL } /orders" ,
headers = { "Authorization" : f "Bearer { token } " },
json = { "product_id" : "test-product" , "quantity" : 1 }
)
assert response . status_code == 201
# Verify order
order_id = response . json ()[ "id" ]
response = requests . get ( f " { API_URL } /orders/ { order_id } " ,
headers = { "Authorization" : f "Bearer { token } " })
assert response . status_code == 200
CI/CD Integration# GitHub Actions Example# 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
name : Infrastructure Tests
on :
pull_request :
paths :
- 'terraform/**'
- 'ansible/**'
jobs :
lint :
runs-on : ubuntu-latest
steps :
- uses : actions/checkout@v4
- name : TFLint
uses : terraform-linters/setup-tflint@v4
- run : |
tflint --init
tflint terraform/
security :
runs-on : ubuntu-latest
steps :
- uses : actions/checkout@v4
- name : Checkov
uses : bridgecrewio/checkov-action@v12
with :
directory : terraform/
unit :
runs-on : ubuntu-latest
steps :
- uses : actions/checkout@v4
- uses : hashicorp/setup-terraform@v3
- run : |
cd terraform/modules/vpc
terraform init
terraform test
integration :
needs : [ lint, security, unit]
runs-on : ubuntu-latest
if : github.event_name == 'pull_request'
steps :
- uses : actions/checkout@v4
- name : Run Terratest
run : |
cd test
go test -v -timeout 30m
The Confidence Equation# Each testing level adds confidence:
Level Catches Speed Cost Static Analysis 60% of issues Seconds Free Unit Tests 25% more Minutes Low Contract Tests 10% more Minutes Low Integration 4% more 10-30 min Medium Smoke Tests Last 1% Minutes Low
Don’t skip the cheap, fast tests. They catch most issues.
Start Here# Today: Add terraform validate to CIThis week: Add tfsec or CheckovThis month: Write unit tests for critical modulesThis quarter: Build ephemeral test environmentsThe goal: catch infrastructure bugs before they reach production. Every test you add is an outage you prevent.
The best infrastructure change is one you deployed with confidence because the tests said it would work.