Running AI workloads in containers presents unique challenges that traditional web application patterns don’t address. GPU scheduling, model caching, and bursty inference traffic all require thoughtful architecture. Here’s what actually works in production.

The GPU Scheduling Problem

Standard Kubernetes scheduling assumes CPU and memory are your primary constraints. When you add GPUs to the mix, everything changes.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
apiVersion: v1
kind: Pod
metadata:
  name: inference-server
spec:
  containers:
  - name: model
    image: my-registry/llm-server:v1.2
    resources:
      limits:
        nvidia.com/gpu: 1
        memory: "32Gi"
      requests:
        nvidia.com/gpu: 1
        memory: "24Gi"

The naive approach—one GPU per pod—works until you realize GPUs cost $2-4/hour and sit idle between requests. MIG (Multi-Instance GPU) and time-slicing help, but they introduce complexity:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# Time-slicing config for NVIDIA device plugin
apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
data:
  any: |-
    version: v1
    sharing:
      timeSlicing:
        resources:
        - name: nvidia.com/gpu
          replicas: 4

Time-slicing divides GPU memory but not compute—all tenants share the same CUDA cores. For inference workloads with variable load, this often makes sense. For training? Almost never.

Model Caching: The Hidden Bottleneck

Loading a 7B parameter model takes 10-30 seconds. Loading a 70B model? Minutes. If your pods restart frequently or scale often, you’re spending more time loading than inferring.

Pattern 1: Init Container Pre-warming

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
initContainers:
- name: model-loader
  image: my-registry/model-loader:v1
  command: ["python", "download_model.py", "--model", "$(MODEL_ID)"]
  volumeMounts:
  - name: model-cache
    mountPath: /models
  env:
  - name: MODEL_ID
    value: "mistral-7b-instruct"

Pattern 2: Shared PVC with ReadWriteMany

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
volumes:
- name: model-cache
  persistentVolumeClaim:
    claimName: shared-models
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: shared-models
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: efs-sc
  resources:
    requests:
      storage: 500Gi

EFS or equivalent networked storage lets all pods share downloaded models. The trade-off: network latency on model load versus storage costs of duplicating models per-node.

Pattern 3: Node-Local DaemonSet Cache

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: model-cache-warmer
spec:
  selector:
    matchLabels:
      app: model-cache
  template:
    spec:
      nodeSelector:
        gpu: "true"
      containers:
      - name: warmer
        image: my-registry/cache-warmer:v1
        volumeMounts:
        - name: local-cache
          mountPath: /models
      volumes:
      - name: local-cache
        hostPath:
          path: /mnt/nvme/models

This keeps models warm on NVMe local storage. Faster than network, but you’re managing state on “cattle” nodes.

Handling Bursty Traffic

AI inference traffic is rarely smooth. A viral tweet mentioning your product can 10x traffic in minutes. Traditional HPA based on CPU won’t react fast enough.

KEDA with Custom Metrics:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: inference-scaler
spec:
  scaleTargetRef:
    name: inference-deployment
  minReplicaCount: 2
  maxReplicaCount: 20
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus:9090
      metricName: inference_queue_depth
      threshold: '10'
      query: sum(inference_requests_pending)

Scale on queue depth, not CPU. When pending requests exceed threshold, spin up more pods before latency degrades.

The Cold Start Problem:

Even with KEDA, new pods take time to become ready. For true responsiveness, maintain a “warm pool”:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: inference-hpa
spec:
  minReplicas: 3  # Always keep 3 warm
  maxReplicas: 50
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5min before scaling down
      policies:
      - type: Pods
        value: 1
        periodSeconds: 60  # Remove 1 pod per minute max

Scale up aggressively, scale down slowly. Those extra 2-3 pods during quiet periods cost less than cold-start latency during traffic spikes.

Request Batching at the Ingress

Individual inference requests are inefficient. Batching improves GPU utilization dramatically:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# Simplified batching logic
class BatchingServer:
    def __init__(self, max_batch=32, max_wait_ms=50):
        self.queue = asyncio.Queue()
        self.max_batch = max_batch
        self.max_wait = max_wait_ms / 1000
    
    async def collect_batch(self):
        batch = []
        deadline = time.time() + self.max_wait
        
        while len(batch) < self.max_batch:
            timeout = max(0, deadline - time.time())
            try:
                item = await asyncio.wait_for(
                    self.queue.get(), 
                    timeout=timeout
                )
                batch.append(item)
            except asyncio.TimeoutError:
                break
        
        return batch

The trade-off: batching adds latency (up to max_wait_ms) but improves throughput. For chatbots where users expect instant responses, keep max_wait low. For batch processing, crank it up.

Practical Takeaways

  1. Don’t over-isolate GPUs — time-slicing works well for inference; save dedicated GPUs for training
  2. Cache models aggressively — loading is often slower than inference itself
  3. Scale on queue depth, not CPU — GPU workloads don’t correlate with CPU usage
  4. Batch requests when possible — even 10ms of batching dramatically improves throughput
  5. Scale down slowly — cold starts hurt more than idle capacity costs

The infrastructure patterns that work for traditional web apps often fail for AI. GPU constraints, model loading times, and bursty traffic all require different thinking. Start simple, measure everything, and optimize the actual bottlenecks you observe.


What patterns have you found useful for AI workloads? The field is evolving fast—today’s best practices might be tomorrow’s anti-patterns.