You test your application code. Why not your infrastructure code?

Infrastructure as Code (IaC) has the same failure modes as any software: bugs, regressions, unintended side effects. Yet most teams treat Terraform and Ansible like configuration files rather than code that deserves tests.

Why Infrastructure Testing Matters

A Terraform plan looks correct until it:

  • Creates a security group that’s too permissive
  • Deploys to the wrong availability zone
  • Sets instance types that exceed your budget
  • Breaks networking in ways that only manifest at runtime

Manual review catches some issues. Automated testing catches more.

Testing Levels

Static Analysis

Catch problems before terraform plan:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Terraform validate - syntax and internal consistency
terraform validate

# tflint - best practices and cloud-specific rules
tflint --recursive

# checkov - security and compliance scanning
checkov -d .

# tfsec - security-focused static analysis
tfsec .

Example tflint rule:

1
2
3
4
5
6
7
8
9
# .tflint.hcl
rule "aws_instance_invalid_type" {
  enabled = true
}

rule "terraform_naming_convention" {
  enabled = true
  format  = "snake_case"
}

Plan Testing

Validate the plan output programmatically:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
# conftest policy (Open Policy Agent)
# policy/terraform.rego

package terraform

deny[msg] {
  resource := input.resource_changes[_]
  resource.type == "aws_security_group_rule"
  resource.change.after.cidr_blocks[_] == "0.0.0.0/0"
  resource.change.after.from_port == 22
  msg := "SSH open to the world is not allowed"
}

deny[msg] {
  resource := input.resource_changes[_]
  resource.type == "aws_instance"
  not startswith(resource.change.after.instance_type, "t3.")
  msg := sprintf("Instance type %s not allowed, use t3 family", [resource.change.after.instance_type])
}
1
2
3
4
# Run policy against plan
terraform plan -out=tfplan
terraform show -json tfplan > tfplan.json
conftest test tfplan.json

Integration Testing with Terratest

Terratest deploys real infrastructure, validates it, then destroys it:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
package test

import (
    "testing"
    "github.com/gruntwork-io/terratest/modules/terraform"
    "github.com/gruntwork-io/terratest/modules/aws"
    "github.com/stretchr/testify/assert"
)

func TestVpcModule(t *testing.T) {
    t.Parallel()

    terraformOptions := &terraform.Options{
        TerraformDir: "../modules/vpc",
        Vars: map[string]interface{}{
            "vpc_cidr":     "10.0.0.0/16",
            "environment":  "test",
            "az_count":     2,
        },
    }

    // Clean up after test
    defer terraform.Destroy(t, terraformOptions)

    // Deploy infrastructure
    terraform.InitAndApply(t, terraformOptions)

    // Get outputs
    vpcId := terraform.Output(t, terraformOptions, "vpc_id")
    publicSubnets := terraform.OutputList(t, terraformOptions, "public_subnet_ids")
    privateSubnets := terraform.OutputList(t, terraformOptions, "private_subnet_ids")

    // Validate VPC exists
    vpc := aws.GetVpcById(t, vpcId, "us-east-1")
    assert.Equal(t, "10.0.0.0/16", vpc.CidrBlock)

    // Validate subnet count
    assert.Equal(t, 2, len(publicSubnets))
    assert.Equal(t, 2, len(privateSubnets))

    // Validate subnets are in different AZs
    for i, subnetId := range publicSubnets {
        subnet := aws.GetSubnetById(t, subnetId, "us-east-1")
        assert.Contains(t, subnet.AvailabilityZone, fmt.Sprintf("us-east-1%c", 'a'+i))
    }
}

HTTP Endpoint Testing

Validate deployed services actually work:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
func TestWebServer(t *testing.T) {
    t.Parallel()

    terraformOptions := &terraform.Options{
        TerraformDir: "../modules/web-server",
    }

    defer terraform.Destroy(t, terraformOptions)
    terraform.InitAndApply(t, terraformOptions)

    // Get the public URL
    url := terraform.Output(t, terraformOptions, "public_url")

    // Retry with backoff - infrastructure needs time to stabilize
    maxRetries := 30
    timeBetweenRetries := 10 * time.Second

    http_helper.HttpGetWithRetry(
        t,
        url,
        nil,
        200,
        "Welcome",
        maxRetries,
        timeBetweenRetries,
    )
}

Testing Patterns

Test Fixtures

Create minimal, isolated test environments:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# test/fixtures/minimal-vpc/main.tf
module "vpc" {
  source = "../../../modules/vpc"
  
  vpc_cidr    = "10.99.0.0/16"  # Non-production CIDR
  environment = "test-${random_id.suffix.hex}"
  az_count    = 1  # Minimize cost
}

resource "random_id" "suffix" {
  byte_length = 4
}

Parallel Testing

Run tests in parallel with isolated state:

1
2
3
4
5
6
7
8
9
func TestModule1(t *testing.T) {
    t.Parallel()  // This test runs concurrently
    // ...
}

func TestModule2(t *testing.T) {
    t.Parallel()  // This test runs concurrently
    // ...
}

Cost Control

Infrastructure tests cost money. Minimize it:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
// Use smallest instance types
Vars: map[string]interface{}{
    "instance_type": "t3.micro",
    "min_size":      1,
    "max_size":      1,
}

// Always destroy, even on failure
defer terraform.Destroy(t, terraformOptions)

// Set timeouts to avoid runaway costs
terraformOptions.MaxRetries = 3
terraformOptions.TimeBetweenRetries = 5 * time.Second

CI/CD Integration

Run infrastructure tests in your pipeline:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# .github/workflows/infrastructure-test.yml
name: Infrastructure Tests

on:
  pull_request:
    paths:
      - 'terraform/**'

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3
        
      - name: Setup Go
        uses: actions/setup-go@v5
        with:
          go-version: '1.21'
          
      - name: Static Analysis
        run: |
          terraform init
          terraform validate
          tflint --recursive
          tfsec .
          
      - name: Plan Tests
        run: |
          terraform plan -out=tfplan
          terraform show -json tfplan > tfplan.json
          conftest test tfplan.json
          
      - name: Integration Tests
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        run: |
          cd test
          go test -v -timeout 30m ./...

What to Test

Always test:

  • Security group rules (no accidental 0.0.0.0/0)
  • IAM policies (principle of least privilege)
  • Encryption settings (at rest and in transit)
  • Network connectivity (can services reach each other?)
  • Outputs match expected values

Consider testing:

  • Scaling behavior (does autoscaling actually work?)
  • Failover (does multi-AZ actually failover?)
  • Performance (response times under load)

Skip testing:

  • Exact resource counts (too brittle)
  • Provider implementation details
  • Things that change frequently

The Testing Pyramid for Infrastructure

E(2FI(EunMltoSTledt(eguaVssrltattaeilsatciciddkoeAanpntdlaeeTol,peyylsmsloteiiysnsnmtte,n+ts)veacluirdiattyiosnc)an)

More static analysis (fast, cheap), fewer E2E tests (slow, expensive).

The Mental Model

Infrastructure tests answer the question: “If I apply this change, will production still work?”

Static analysis catches typos and policy violations. Plan testing catches logical errors. Integration testing catches runtime issues.

You wouldn’t ship application code without tests. Your infrastructure deserves the same discipline.

The cost of testing is hours and cloud spend. The cost of not testing is a 3 AM incident when that security group rule you didn’t review takes down production.

Test your infrastructure. Sleep better.