Terraform state is where infrastructure-as-code meets reality. It’s also where most Terraform disasters originate. Here’s how to manage state without losing sleep.

The Problem

Terraform tracks what it’s created in a state file. This file maps your HCL resources to real infrastructure. Without it, Terraform can’t update or destroy anything — it doesn’t know what exists.

The default is a local file called terraform.tfstate. This works fine until:

  • Someone else needs to run Terraform
  • Your laptop dies
  • Two people run apply simultaneously
  • You accidentally commit secrets to Git

Rule 1: Remote State from Day One

Never use local state for anything beyond experiments:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# backend.tf
terraform {
  backend "s3" {
    bucket         = "mycompany-terraform-state"
    key            = "prod/network/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-locks"
  }
}

Why S3 + DynamoDB?

  • S3 stores the state (versioned, encrypted)
  • DynamoDB provides locking (prevents concurrent applies)
  • Both are cheap and managed

Create the backend resources first (yes, this is a chicken-and-egg problem):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# Bootstrap script - run once manually
aws s3api create-bucket \
  --bucket mycompany-terraform-state \
  --region us-east-1

aws s3api put-bucket-versioning \
  --bucket mycompany-terraform-state \
  --versioning-configuration Status=Enabled

aws dynamodb create-table \
  --table-name terraform-locks \
  --attribute-definitions AttributeName=LockID,AttributeType=S \
  --key-schema AttributeName=LockID,KeyType=HASH \
  --billing-mode PAY_PER_REQUEST

Rule 2: One State Per Environment

Don’t share state between environments. A bad apply in dev shouldn’t affect prod.

terramefonodvruimlvrdsp/epoetrscnvao//m/gdembimb/mbnaanaaaaticgicicsnk/nknk/.e.e.etntntnfdfdfd...tttfff###kkkeeeyyy==="""dspetrvao/gdti/entrger/ratrfeaorfrromar.fmto.frtsmft.satttfaest"tea"te"

Each environment has its own state file, its own lock, its own blast radius.

Rule 3: Use Workspaces Sparingly

Terraform workspaces let you reuse configuration with different state:

1
2
3
terraform workspace new staging
terraform workspace select staging
terraform apply

The problem: Workspaces share everything except state. You can’t have different backend configs, different providers, or different module versions per workspace.

Workspaces work for simple multi-tenant scenarios. For environments with different requirements, use separate directories.

Rule 4: Never Edit State Manually

The state file is JSON. You can read it. You shouldn’t edit it.

When you need to modify state, use the CLI:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# Remove a resource from state (without destroying it)
terraform state rm aws_instance.legacy_server

# Move a resource (after refactoring)
terraform state mv aws_instance.web aws_instance.web_server

# Import existing infrastructure
terraform import aws_instance.imported i-1234567890abcdef0

# Pull state for inspection
terraform state pull > state.json

# Push state (dangerous, use rarely)
terraform state push state.json

Rule 5: Handle State Drift

Infrastructure changes outside Terraform. Someone clicks in the console, an auto-scaling event occurs, a security patch updates something. Now state doesn’t match reality.

Detect drift regularly:

1
2
3
# See what changed
terraform plan -detailed-exitcode
# Exit code 2 = changes detected

In CI/CD:

1
2
3
4
5
6
7
- name: Check for drift
  run: |
    terraform plan -detailed-exitcode -out=plan.tfplan
    if [ $? -eq 2 ]; then
      echo "::warning::Infrastructure drift detected"
      # Optionally: post to Slack, create ticket
    fi

When drift happens:

  1. If Terraform should own it: terraform apply to fix
  2. If it was intentional: terraform refresh to update state
  3. If it’s out of scope: consider lifecycle { ignore_changes = [...] }

Rule 6: Protect Sensitive Data

State contains secrets. Database passwords, API keys, anything you pass to a resource ends up in state.

Mitigations:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
# 1. Encrypt state at rest (S3)
terraform {
  backend "s3" {
    encrypt = true
    kms_key_id = "arn:aws:kms:us-east-1:123456789:key/xxx"
  }
}

# 2. Use environment variables for secrets
variable "db_password" {
  sensitive = true  # Hides in logs, but still in state
}

# 3. Better: Reference secrets from a vault
data "aws_secretsmanager_secret_version" "db_password" {
  secret_id = "prod/db/password"
}

resource "aws_db_instance" "main" {
  password = data.aws_secretsmanager_secret_version.db_password.secret_string
}

The secret is still in state, but now it’s a reference you can rotate without changing Terraform.

Rule 7: Lock State in CI/CD

Concurrent Terraform runs corrupt state. Always lock:

1
2
3
4
5
6
7
8
9
# GitHub Actions example
jobs:
  terraform:
    concurrency: 
      group: terraform-${{ github.ref }}
      cancel-in-progress: false  # Don't cancel running applies
    steps:
      - name: Terraform Apply
        run: terraform apply -auto-approve

The DynamoDB lock handles Terraform-level concurrency. The CI concurrency group handles pipeline-level concurrency.

Rule 8: Plan for Recovery

State files get corrupted. Backends have outages. Plan for it:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Enable versioning on S3 (you did this, right?)
aws s3api list-object-versions \
  --bucket mycompany-terraform-state \
  --prefix prod/terraform.tfstate

# Recover a previous version
aws s3api get-object \
  --bucket mycompany-terraform-state \
  --key prod/terraform.tfstate \
  --version-id "abc123" \
  recovered-state.json

Backup state before risky operations:

1
2
terraform state pull > backup-$(date +%Y%m%d-%H%M%S).json
terraform apply  # the risky operation

Rule 9: Split Large State Files

Monolithic state files are slow and risky. A single bad apply affects everything.

Split by:

  • Lifecycle: Networking rarely changes, app infra changes often
  • Team ownership: Platform team owns VPC, app teams own their services
  • Blast radius: Prod database separate from prod compute

Reference across state files with terraform_remote_state:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# In app/main.tf
data "terraform_remote_state" "network" {
  backend = "s3"
  config = {
    bucket = "mycompany-terraform-state"
    key    = "prod/network/terraform.tfstate"
    region = "us-east-1"
  }
}

resource "aws_instance" "app" {
  subnet_id = data.terraform_remote_state.network.outputs.private_subnet_ids[0]
}

The Minimum Checklist

Starting a new Terraform project? Do these first:

  1. ✅ Create S3 bucket with versioning
  2. ✅ Create DynamoDB lock table
  3. ✅ Configure remote backend
  4. ✅ Set up CI/CD with concurrency controls
  5. ✅ Add drift detection to scheduled pipeline
  6. ✅ Document recovery procedures

State management isn’t exciting, but it’s the foundation. Get it right early, and you’ll avoid the 3 AM “who deleted production” incidents.