Running AI workloads in containers presents unique challenges that traditional web application patterns don’t address. GPU scheduling, model caching, and bursty inference traffic all require thoughtful architecture. Here’s what actually works in production.
The GPU Scheduling Problem
Standard Kubernetes scheduling assumes CPU and memory are your primary constraints. When you add GPUs to the mix, everything changes.
| |
The naive approach—one GPU per pod—works until you realize GPUs cost $2-4/hour and sit idle between requests. MIG (Multi-Instance GPU) and time-slicing help, but they introduce complexity:
| |
Time-slicing divides GPU memory but not compute—all tenants share the same CUDA cores. For inference workloads with variable load, this often makes sense. For training? Almost never.
Model Caching: The Hidden Bottleneck
Loading a 7B parameter model takes 10-30 seconds. Loading a 70B model? Minutes. If your pods restart frequently or scale often, you’re spending more time loading than inferring.
Pattern 1: Init Container Pre-warming
| |
Pattern 2: Shared PVC with ReadWriteMany
| |
EFS or equivalent networked storage lets all pods share downloaded models. The trade-off: network latency on model load versus storage costs of duplicating models per-node.
Pattern 3: Node-Local DaemonSet Cache
| |
This keeps models warm on NVMe local storage. Faster than network, but you’re managing state on “cattle” nodes.
Handling Bursty Traffic
AI inference traffic is rarely smooth. A viral tweet mentioning your product can 10x traffic in minutes. Traditional HPA based on CPU won’t react fast enough.
KEDA with Custom Metrics:
| |
Scale on queue depth, not CPU. When pending requests exceed threshold, spin up more pods before latency degrades.
The Cold Start Problem:
Even with KEDA, new pods take time to become ready. For true responsiveness, maintain a “warm pool”:
| |
Scale up aggressively, scale down slowly. Those extra 2-3 pods during quiet periods cost less than cold-start latency during traffic spikes.
Request Batching at the Ingress
Individual inference requests are inefficient. Batching improves GPU utilization dramatically:
| |
The trade-off: batching adds latency (up to max_wait_ms) but improves throughput. For chatbots where users expect instant responses, keep max_wait low. For batch processing, crank it up.
Practical Takeaways
- Don’t over-isolate GPUs — time-slicing works well for inference; save dedicated GPUs for training
- Cache models aggressively — loading is often slower than inference itself
- Scale on queue depth, not CPU — GPU workloads don’t correlate with CPU usage
- Batch requests when possible — even 10ms of batching dramatically improves throughput
- Scale down slowly — cold starts hurt more than idle capacity costs
The infrastructure patterns that work for traditional web apps often fail for AI. GPU constraints, model loading times, and bursty traffic all require different thinking. Start simple, measure everything, and optimize the actual bottlenecks you observe.
What patterns have you found useful for AI workloads? The field is evolving fast—today’s best practices might be tomorrow’s anti-patterns.