You test your application code. Why not your infrastructure? Here’s how to build confidence that your Terraform, Ansible, and Kubernetes configs actually work.

Why Test Infrastructure?

Infrastructure code has the same problems as application code:

  • Typos break things
  • Logic errors cause outages
  • Refactoring introduces regressions
  • “It works on my machine” applies to terraform too

The difference: infrastructure mistakes often cost more. A bad deployment can take down production, corrupt data, or rack up cloud bills.

The Testing Pyramid for Infrastructure

PIrnCotodenugtScrrttaaaitctoitinocnTSeAmTsnoetakssletyssTiesst/sUnitTests

Start at the bottom. Move up as you gain confidence.

Level 1: Static Analysis

Catch errors without running anything.

Terraform Validate

1
2
terraform init
terraform validate

Catches syntax errors and type mismatches. Free and fast.

TFLint

1
2
3
4
5
6
# Install
brew install tflint

# Run
tflint --init
tflint

Catches:

  • Deprecated syntax
  • Invalid resource types
  • Best practice violations
1
2
3
4
# TFLint catches this
resource "aws_instance" "bad" {
  instance_type = "t2.micro123"  # Invalid instance type!
}

Checkov / tfsec

Security-focused static analysis:

1
2
3
4
5
6
7
# Checkov
pip install checkov
checkov -d .

# tfsec
brew install tfsec
tfsec .

Catches:

  • Unencrypted storage
  • Open security groups
  • Missing logging
  • Public S3 buckets
1
2
3
4
5
6
7
# tfsec flags this
resource "aws_s3_bucket" "bad" {
  bucket = "my-bucket"
  # Missing encryption!
  # Missing versioning!
  # Missing logging!
}

Ansible Lint

1
2
pip install ansible-lint
ansible-lint playbook.yml

Catches deprecated modules, bad practices, and style issues.

Level 2: Unit Tests

Test individual modules in isolation.

Terraform Test (Built-in)

Terraform 1.6+ has native testing:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
# tests/vpc.tftest.hcl
run "vpc_creates_subnets" {
  command = plan

  assert {
    condition     = length(aws_subnet.private) == 3
    error_message = "Expected 3 private subnets"
  }
}

run "cidr_blocks_are_valid" {
  command = plan

  assert {
    condition     = can(cidrhost(var.vpc_cidr, 0))
    error_message = "VPC CIDR must be valid"
  }
}

Run with:

1
terraform test

Terratest (Go)

More powerful testing with real infrastructure:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
package test

import (
    "testing"
    "github.com/gruntwork-io/terratest/modules/terraform"
    "github.com/stretchr/testify/assert"
)

func TestVpcModule(t *testing.T) {
    terraformOptions := &terraform.Options{
        TerraformDir: "../modules/vpc",
        Vars: map[string]interface{}{
            "vpc_cidr": "10.0.0.0/16",
            "azs":      []string{"us-east-1a", "us-east-1b"},
        },
    }

    defer terraform.Destroy(t, terraformOptions)
    terraform.InitAndApply(t, terraformOptions)

    vpcId := terraform.Output(t, terraformOptions, "vpc_id")
    assert.NotEmpty(t, vpcId)

    subnetIds := terraform.OutputList(t, terraformOptions, "private_subnet_ids")
    assert.Equal(t, 2, len(subnetIds))
}

Terratest actually creates resources, so use a dedicated test account.

Molecule (Ansible)

Test Ansible roles in containers:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# molecule/default/molecule.yml
dependency:
  name: galaxy
driver:
  name: docker
platforms:
  - name: instance
    image: ubuntu:22.04
    pre_build_image: true
provisioner:
  name: ansible
verifier:
  name: ansible
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# molecule/default/verify.yml
- name: Verify
  hosts: all
  tasks:
    - name: Check nginx is installed
      ansible.builtin.package:
        name: nginx
        state: present
      check_mode: true
      register: nginx_check
      
    - name: Assert nginx installed
      ansible.builtin.assert:
        that: not nginx_check.changed

Run with:

1
molecule test

Level 3: Contract Tests

Verify that modules work together correctly.

Pact for Infrastructure

Define expected outputs and verify consumers get what they expect:

1
2
3
4
5
6
7
8
# Test that the VPC module outputs what the EKS module needs
def test_vpc_contract():
    vpc_outputs = terraform_output("modules/vpc")
    
    # EKS module requires these
    assert "vpc_id" in vpc_outputs
    assert "private_subnet_ids" in vpc_outputs
    assert len(vpc_outputs["private_subnet_ids"]) >= 2

Schema Validation

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
from jsonschema import validate

VPC_OUTPUT_SCHEMA = {
    "type": "object",
    "required": ["vpc_id", "private_subnet_ids", "public_subnet_ids"],
    "properties": {
        "vpc_id": {"type": "string", "pattern": "^vpc-"},
        "private_subnet_ids": {
            "type": "array",
            "minItems": 2,
            "items": {"type": "string", "pattern": "^subnet-"}
        }
    }
}

def test_vpc_outputs_match_schema():
    outputs = terraform_output("modules/vpc")
    validate(instance=outputs, schema=VPC_OUTPUT_SCHEMA)

Level 4: Integration Tests

Test the full stack in a real environment.

Ephemeral Environments

Spin up complete environments for testing:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# .github/workflows/integration.yml
jobs:
  integration:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3
        
      - name: Create test environment
        run: |
          cd environments/test
          terraform init
          terraform apply -auto-approve
          
      - name: Run integration tests
        run: |
          export API_URL=$(terraform output -raw api_url)
          pytest tests/integration/
          
      - name: Destroy test environment
        if: always()
        run: |
          cd environments/test
          terraform destroy -auto-approve

Localstack for AWS

Test AWS infrastructure locally:

1
2
3
4
5
6
7
8
# docker-compose.yml
services:
  localstack:
    image: localstack/localstack
    ports:
      - "4566:4566"
    environment:
      - SERVICES=s3,dynamodb,sqs,lambda
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# Override provider for testing
provider "aws" {
  access_key                  = "test"
  secret_key                  = "test"
  region                      = "us-east-1"
  skip_credentials_validation = true
  skip_requesting_account_id  = true

  endpoints {
    s3       = "http://localhost:4566"
    dynamodb = "http://localhost:4566"
  }
}

Level 5: Production Smoke Tests

Verify production is working after deploy.

Health Checks

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
#!/bin/bash
# post-deploy-smoke.sh

ENDPOINTS=(
  "https://api.example.com/health"
  "https://app.example.com"
  "https://admin.example.com/health"
)

for endpoint in "${ENDPOINTS[@]}"; do
  status=$(curl -s -o /dev/null -w "%{http_code}" "$endpoint")
  if [ "$status" != "200" ]; then
    echo "FAIL: $endpoint returned $status"
    exit 1
  fi
  echo "OK: $endpoint"
done

Synthetic Monitoring

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Run continuously after deploy
def test_critical_user_journey():
    # Login
    response = requests.post(f"{API_URL}/auth/login", json={
        "email": "smoke-test@example.com",
        "password": os.environ["SMOKE_TEST_PASSWORD"]
    })
    assert response.status_code == 200
    token = response.json()["token"]
    
    # Create order
    response = requests.post(f"{API_URL}/orders", 
        headers={"Authorization": f"Bearer {token}"},
        json={"product_id": "test-product", "quantity": 1}
    )
    assert response.status_code == 201
    
    # Verify order
    order_id = response.json()["id"]
    response = requests.get(f"{API_URL}/orders/{order_id}",
        headers={"Authorization": f"Bearer {token}"})
    assert response.status_code == 200

CI/CD Integration

GitHub Actions Example

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
name: Infrastructure Tests

on:
  pull_request:
    paths:
      - 'terraform/**'
      - 'ansible/**'

jobs:
  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: TFLint
        uses: terraform-linters/setup-tflint@v4
      - run: |
          tflint --init
          tflint terraform/

  security:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Checkov
        uses: bridgecrewio/checkov-action@v12
        with:
          directory: terraform/

  unit:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
      - run: |
          cd terraform/modules/vpc
          terraform init
          terraform test

  integration:
    needs: [lint, security, unit]
    runs-on: ubuntu-latest
    if: github.event_name == 'pull_request'
    steps:
      - uses: actions/checkout@v4
      - name: Run Terratest
        run: |
          cd test
          go test -v -timeout 30m

The Confidence Equation

Each testing level adds confidence:

LevelCatchesSpeedCost
Static Analysis60% of issuesSecondsFree
Unit Tests25% moreMinutesLow
Contract Tests10% moreMinutesLow
Integration4% more10-30 minMedium
Smoke TestsLast 1%MinutesLow

Don’t skip the cheap, fast tests. They catch most issues.

Start Here

  1. Today: Add terraform validate to CI
  2. This week: Add tfsec or Checkov
  3. This month: Write unit tests for critical modules
  4. This quarter: Build ephemeral test environments

The goal: catch infrastructure bugs before they reach production. Every test you add is an outage you prevent.


The best infrastructure change is one you deployed with confidence because the tests said it would work.