Infrastructure as Code for AI Workloads: Scaling Smart

As AI workloads become central to business operations, managing the infrastructure that powers them requires the same rigor we apply to traditional applications. Infrastructure as Code (IaC) isn’t just nice-to-have for AI—it’s essential for cost control, reproducibility, and scaling.

The AI Infrastructure Challenge

AI workloads have unique requirements that traditional IaC patterns don’t always address:

GPU instances that cost $3-10/hour and need careful lifecycle management
Model artifacts that can be gigabytes in size and need versioning
Auto-scaling that must consider both compute load and model warming time
Spot instance strategies to reduce costs by 60-90%

Let’s build a Terraform + Ansible solution that handles these challenges.

Terraform: The Foundation

Start with a GPU-enabled instance that can scale based on queue depth:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
# main.tf
resource "aws_launch_template" "ai_worker" {
  name_prefix   = "ai-worker-"
  image_id      = "ami-0c02fb55956c7d316" # Deep Learning AMI
  instance_type = "g4dn.xlarge" # NVIDIA T4 GPU
  
  vpc_security_group_ids = [aws_security_group.ai_worker.id]
  
  user_data = base64encode(templatefile("${path.module}/user_data.sh", {
    s3_bucket = aws_s3_bucket.models.bucket
  }))
  
  tag_specifications {
    resource_type = "instance"
    tags = {
      Name = "ai-worker"
      Role = "inference"
    }
  }
}

resource "aws_autoscaling_group" "ai_workers" {
  name                = "ai-workers"
  vpc_zone_identifier = [aws_subnet.private.id]
  target_group_arns   = [aws_lb_target_group.ai.arn]
  
  min_size         = 0
  max_size         = 10
  desired_capacity = 1
  
  launch_template {
    id      = aws_launch_template.ai_worker.id
    version = "$Latest"
  }
  
  # Scale based on SQS queue depth
  tag {
    key                 = "AmazonECSManaged"
    value               = true
    propagate_at_launch = false
  }
}

# Custom metric for queue-based scaling
resource "aws_autoscaling_policy" "scale_up" {
  name                   = "ai-scale-up"
  scaling_adjustment     = 2
  adjustment_type        = "ChangeInCapacity"
  cooldown               = 300
  autoscaling_group_name = aws_autoscaling_group.ai_workers.name
}

resource "aws_cloudwatch_metric_alarm" "queue_depth_high" {
  alarm_name          = "ai-queue-depth-high"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = "2"
  metric_name         = "ApproximateNumberOfMessages"
  namespace           = "AWS/SQS"
  period              = "120"
  statistic           = "Average"
  threshold           = "10"
  alarm_description   = "This metric monitors SQS queue depth"
  alarm_actions       = [aws_autoscaling_policy.scale_up.arn]

  dimensions = {
    QueueName = aws_sqs_queue.ai_jobs.name
  }
}

Spot Instance Strategy

AI training and batch inference are perfect for spot instances. Add this to your launch template:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
resource "aws_launch_template" "ai_spot_worker" {
  name_prefix   = "ai-spot-worker-"
  image_id      = "ami-0c02fb55956c7d316"
  instance_type = "g4dn.xlarge"
  
  instance_market_options {
    market_type = "spot"
    spot_options {
      max_price = "0.50" # 50% of on-demand price
    }
  }
  
  # Handle spot interruptions gracefully
  user_data = base64encode(templatefile("${path.module}/spot_handler.sh", {
    s3_bucket = aws_s3_bucket.models.bucket
  }))
}

Ansible: Configuration and Deployment

Use Ansible to handle the complex setup that Terraform can’t:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
# playbooks/ai-setup.yml
---
- hosts: ai_workers
  become: yes
  vars:
    model_version: "{{ lookup('env', 'MODEL_VERSION') | default('latest') }}"
    
  tasks:
    - name: Install NVIDIA drivers and CUDA
      ansible.builtin.package:
        name: nvidia-docker2
        state: present
        
    - name: Download model from S3
      aws_s3:
        bucket: "{{ s3_bucket }}"
        object: "models/{{ model_name }}/{{ model_version }}/model.tar.gz"
        dest: "/opt/models/{{ model_name }}.tar.gz"
        mode: get
      register: model_download
      
    - name: Extract model
      ansible.builtin.unarchive:
        src: "/opt/models/{{ model_name }}.tar.gz"
        dest: "/opt/models/"
        remote_src: yes
      when: model_download.changed
      
    - name: Start inference service
      docker_container:
        name: "{{ model_name }}-inference"
        image: "pytorch/pytorch:latest"
        volumes:
          - "/opt/models:/models"
        ports:
          - "8080:8080"
        environment:
          MODEL_PATH: "/models/{{ model_name }}"
          GPU_MEMORY_FRACTION: "0.8"
        restart_policy: always

Cost Optimization Patterns

1. Scheduled Shutdown

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
resource "aws_autoscaling_schedule" "scale_down_evening" {
  scheduled_action_name  = "scale-down-evening"
  min_size               = 0
  max_size               = 2
  desired_capacity       = 0
  recurrence             = "0 22 * * *" # 10 PM daily
  autoscaling_group_name = aws_autoscaling_group.ai_workers.name
}

resource "aws_autoscaling_schedule" "scale_up_morning" {
  scheduled_action_name  = "scale-up-morning"
  min_size               = 1
  max_size               = 10
  desired_capacity       = 2
  recurrence             = "0 8 * * 1-5" # 8 AM weekdays
  autoscaling_group_name = aws_autoscaling_group.ai_workers.name
}

2. Model Caching Strategy

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
- name: Cache models in EFS for faster startup
  mount:
    path: /opt/model-cache
    src: "{{ efs_dns_name }}:/"
    fstype: efs
    opts: tls
    state: mounted
    
- name: Warm model cache
  shell: |
    if [ ! -f /opt/model-cache/{{ model_name }}/ready ]; then
      cp -r /opt/models/{{ model_name }}/* /opt/model-cache/{{ model_name }}/
      touch /opt/model-cache/{{ model_name }}/ready
    fi

Monitoring and Observability

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
resource "aws_cloudwatch_log_group" "ai_inference" {
  name              = "/ai/inference"
  retention_in_days = 7
}

resource "aws_cloudwatch_metric_alarm" "gpu_utilization_low" {
  alarm_name          = "ai-gpu-utilization-low"
  comparison_operator = "LessThanThreshold"
  evaluation_periods  = "3"
  metric_name         = "GPUUtilization"
  namespace           = "AI/Custom"
  period              = "300"
  statistic           = "Average"
  threshold           = "20"
  alarm_description   = "GPU utilization is consistently low"
  
  # Trigger scale-down when GPUs are underutilized
  alarm_actions = [aws_autoscaling_policy.scale_down.arn]
}

The Deployment Pipeline

Tie it all together with a CI/CD pipeline that handles model updates:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
#!/bin/bash
# deploy-ai-infrastructure.sh

# Deploy infrastructure
terraform init && terraform plan && terraform apply -auto-approve

# Wait for instances to be ready
sleep 300

# Deploy configuration
ansible-playbook -i aws_ec2.yml playbooks/ai-setup.yml \
  -e model_version=${MODEL_VERSION} \
  -e s3_bucket=${S3_BUCKET}

# Health check
curl -f http://$(terraform output load_balancer_dns)/health || exit 1

echo "AI infrastructure deployed successfully"

Key Takeaways

Treat AI infrastructure like any other workload - use IaC principles
Leverage spot instances aggressively - 60-90% cost savings for batch workloads
Monitor GPU utilization closely - idle GPUs are expensive mistakes
Cache models intelligently - startup time matters for auto-scaling
Plan for spot interruptions - graceful degradation is essential

Infrastructure as Code isn’t just about reproducibility—for AI workloads, it’s about survival. GPU costs can spiral out of control without proper automation. Start with these patterns and adapt them to your specific models and workload patterns.

The future of AI is automated infrastructure. Build it right from the start.

The AI Infrastructure Challenge#

Terraform: The Foundation#

Spot Instance Strategy#

Ansible: Configuration and Deployment#

Cost Optimization Patterns#

1. Scheduled Shutdown#

2. Model Caching Strategy#

Monitoring and Observability#

The Deployment Pipeline#

Key Takeaways#

📬 Get the Newsletter