Terraform State Management: Don't Learn This the Hard Way

Terraform state is both the source of its power and the cause of most Terraform disasters. Get it wrong and you’re recreating production resources at 2 AM. Get it right and infrastructure changes become boring (the good kind).

What State Actually Is

Terraform state is a JSON file that maps your configuration to real resources. When you write aws_instance.web, Terraform needs to know which actual EC2 instance that refers to. State is that mapping.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
{
  "resources": [{
    "type": "aws_instance",
    "name": "web",
    "instances": [{
      "attributes": {
        "id": "i-0abc123def456789",
        "ami": "ami-0123456789abcdef0"
      }
    }]
  }]
}

Without state, Terraform would create a new instance every apply. With state, it knows to update or leave alone the existing one.

Remote State: Not Optional

Local state files work for learning. For anything else, use remote state.

1
2
3
4
5
6
7
8
9
terraform {
  backend "s3" {
    bucket         = "my-terraform-state"
    key            = "prod/infrastructure.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-locks"
  }
}

The DynamoDB table provides locking — critical when multiple people run Terraform. Without it, two simultaneous applies corrupt your state.

State Isolation Patterns

Per-Environment

Separate state per environment means a bad terraform destroy in dev can’t touch prod.

Per-Component

Blast radius reduction. Database changes don’t risk compute resources.

Workspaces: Use With Caution

1
2
terraform workspace new staging
terraform workspace select prod

Workspaces share code but separate state. Sounds elegant, but:

All environments use identical code (no per-env tweaks)
Easy to forget which workspace you’re in
terraform destroy in wrong workspace is catastrophic

I’ve seen teams succeed with workspaces. I’ve seen more teams abandon them after incidents.

State Operations You’ll Need

Import Existing Resources

1
terraform import aws_instance.web i-0abc123def456789

Then write the matching configuration. Use terraform plan to iterate until no changes detected.

Move Resources Between Modules

1
terraform state mv 'aws_instance.web' 'module.compute.aws_instance.web'

Essential during refactoring. Without this, Terraform destroys and recreates.

Remove Without Destroying

1
terraform state rm aws_instance.web

Removes from state but leaves the actual resource. Useful when handing off resources to another Terraform config or manual management.

Replace Tainted Resources

1
2
3
terraform taint aws_instance.web
# or in newer versions:
terraform apply -replace="aws_instance.web"

Forces recreation on next apply.

Recovering from State Disasters

State Got Corrupted

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# Pull last good state from S3 versioning
aws s3api list-object-versions \
  --bucket my-terraform-state \
  --prefix prod/infrastructure.tfstate

aws s3api get-object \
  --bucket my-terraform-state \
  --key prod/infrastructure.tfstate \
  --version-id "ABC123" \
  restored.tfstate

terraform state push restored.tfstate

Enable S3 versioning on your state bucket. It’s saved me twice.

Lost State Entirely

Import everything. It’s painful but possible:

1
2
3
4
5
6
# List what exists
aws ec2 describe-instances --query 'Reservations[].Instances[].InstanceId'

# Import each one
terraform import aws_instance.web i-0abc123
terraform import aws_instance.api i-0abc456

For complex infrastructure, consider terraformer to generate config from existing resources.

Sensitive Data in State

State contains everything — including secrets:

1
2
3
4
5
{
  "attributes": {
    "password": "hunter2"
  }
}

Protect it:

S3 encryption: encrypt = true in backend config
Bucket policies: Restrict who can read
No git: Never commit state files (add *.tfstate to .gitignore)
Audit logs: Enable S3 access logging

For databases, use random_password and store in Secrets Manager instead of state:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
resource "random_password" "db" {
  length  = 32
  special = true
}

resource "aws_secretsmanager_secret_version" "db" {
  secret_id     = aws_secretsmanager_secret.db.id
  secret_string = random_password.db.result
}

resource "aws_db_instance" "main" {
  password = random_password.db.result
}

Password still in state, but retrieval goes through Secrets Manager.

State Locking Deep Dive

DynamoDB locking for S3 backend:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
resource "aws_dynamodb_table" "terraform_locks" {
  name         = "terraform-locks"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "LockID"

  attribute {
    name = "LockID"
    type = "S"
  }
}

If a lock gets stuck (Terraform crashed mid-apply):

1
terraform force-unlock LOCK_ID

Use sparingly. Only when you’re certain no other process is running.

Best Practices Summary

Remote state from day one — local state doesn’t scale
Enable versioning — your safety net for corruption
Isolate state — per environment at minimum
Lock state — DynamoDB for S3, built-in for Terraform Cloud
Encrypt state — it contains secrets
Don’t edit state manually — use terraform state commands
Backup before risky operations — terraform state pull > backup.tfstate

State management isn’t exciting, but getting it right means infrastructure changes stay boring. And in operations, boring is beautiful.

What State Actually Is#

Remote State: Not Optional#

State Isolation Patterns#

Per-Environment#

Per-Component#

Workspaces: Use With Caution#

State Operations You’ll Need#

Import Existing Resources#

Move Resources Between Modules#

Remove Without Destroying#

Replace Tainted Resources#

Recovering from State Disasters#

State Got Corrupted#

Lost State Entirely#

Sensitive Data in State#

State Locking Deep Dive#

Best Practices Summary#

📬 Get the Newsletter