Building toy demos with LLM APIs is easy. Building production systems that handle real traffic, fail gracefully, and don’t bankrupt you? That’s where it gets interesting.
The Reality of Production LLM Integration#
Most tutorials show you curl to an API and celebrate. Real systems need to handle:
- API rate limits and throttling
- Transient failures and retries
- Cost explosion from runaway loops
- Latency variance (100ms to 30s responses)
- Model version changes breaking prompts
- Token limits exceeding input size
Let’s look at patterns that actually work.
Pattern 1: Circuit Breakers for LLM Calls#
LLM APIs have bad days. When they do, you don’t want every request in your system waiting 60 seconds to timeout.
1
2
3
4
5
6
7
8
9
10
11
12
| from circuitbreaker import circuit
import openai
@circuit(failure_threshold=5, recovery_timeout=30)
def call_llm(prompt: str, model: str = "gpt-4") -> str:
"""LLM call with circuit breaker protection."""
response = openai.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
timeout=30
)
return response.choices[0].message.content
|
After 5 failures, the circuit opens. Calls fail fast for 30 seconds, giving the API time to recover. Your users get quick errors instead of hanging requests.
Pattern 2: Tiered Model Fallback#
Not every request needs GPT-4. Build a cascade:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
| MODELS = [
{"name": "gpt-4", "cost": 0.03, "quality": "high"},
{"name": "gpt-3.5-turbo", "cost": 0.002, "quality": "medium"},
{"name": "local-llama", "cost": 0.0, "quality": "basic"},
]
async def smart_completion(prompt: str, min_quality: str = "basic") -> str:
"""Try models in order, falling back on failure or rate limits."""
for model in MODELS:
if quality_rank(model["quality"]) < quality_rank(min_quality):
continue
try:
return await call_model(model["name"], prompt)
except RateLimitError:
continue
except APIError as e:
log.warning(f"{model['name']} failed: {e}")
continue
raise AllModelsExhausted("No available models could handle request")
|
This gives you graceful degradation. If your primary model is rate-limited, fall back to cheaper options rather than failing completely.
Pattern 3: Token Budget Management#
The fastest way to burn money is an infinite loop calling GPT-4. Protect yourself:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
| from dataclasses import dataclass
from datetime import datetime, timedelta
@dataclass
class TokenBudget:
daily_limit: int = 1_000_000
hourly_limit: int = 100_000
per_request_limit: int = 4000
_daily_used: int = 0
_hourly_used: int = 0
_hour_start: datetime = None
def can_spend(self, tokens: int) -> bool:
self._maybe_reset_hour()
return (
tokens <= self.per_request_limit and
self._daily_used + tokens <= self.daily_limit and
self._hourly_used + tokens <= self.hourly_limit
)
def spend(self, tokens: int):
if not self.can_spend(tokens):
raise BudgetExceeded(f"Would exceed token budget")
self._daily_used += tokens
self._hourly_used += tokens
|
Track budgets at the application level. Don’t rely on API-side limits alone—by the time you hit those, you’ve already spent the money.
Pattern 4: Prompt Versioning#
Your prompts are code. Version them:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
| # prompts/summarize/v2.yaml
version: 2
model: gpt-4
temperature: 0.3
system: |
You are a precise summarizer. Extract key points only.
Format: bullet points, max 5 items.
Never add information not in the source.
user_template: |
Summarize this document:
{document}
# prompts/summarize/v1.yaml (deprecated)
version: 1
model: gpt-3.5-turbo
# ... old prompt
|
Load prompts dynamically. A/B test versions. Roll back when something breaks. Your prompts evolve faster than your code—treat them accordingly.
Pattern 5: Structured Output Validation#
LLMs lie about their output format. Validate aggressively:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
| from pydantic import BaseModel, ValidationError
import json
class ExtractedData(BaseModel):
name: str
confidence: float
categories: list[str]
def extract_with_validation(text: str, retries: int = 3) -> ExtractedData:
"""Extract structured data with retry on parse failure."""
for attempt in range(retries):
response = call_llm(f"Extract data as JSON: {text}")
try:
# Strip markdown code blocks if present
clean = response.strip().removeprefix("```json").removesuffix("```")
data = json.loads(clean)
return ExtractedData(**data)
except (json.JSONDecodeError, ValidationError) as e:
if attempt == retries - 1:
raise ParseError(f"Failed to parse after {retries} attempts")
# Retry with explicit format reminder
continue
|
Assume every response might be malformed. Parse, validate, retry. Log failures for prompt improvement.
Pattern 6: Request Coalescing#
If multiple users request the same thing simultaneously, don’t pay for it twice:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
| import asyncio
from functools import lru_cache
_inflight: dict[str, asyncio.Future] = {}
async def coalesced_completion(prompt_hash: str, prompt: str) -> str:
"""Coalesce identical concurrent requests."""
if prompt_hash in _inflight:
return await _inflight[prompt_hash]
future = asyncio.Future()
_inflight[prompt_hash] = future
try:
result = await call_llm(prompt)
future.set_result(result)
return result
except Exception as e:
future.set_exception(e)
raise
finally:
del _inflight[prompt_hash]
|
For non-personalized requests (summaries, translations of the same text), this can cut costs significantly.
Pattern 7: Async Streaming for UX#
Users hate waiting. Stream responses:
1
2
3
4
5
6
7
8
9
10
11
| async def stream_response(prompt: str):
"""Stream LLM response for better perceived performance."""
stream = await openai.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
stream=True
)
async for chunk in stream:
if chunk.choices[0].delta.content:
yield chunk.choices[0].delta.content
|
First token in 200ms feels faster than complete response in 3 seconds, even if total time is the same.
Monitoring That Matters#
Track these metrics:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
| # Per-model metrics
llm_request_duration_seconds = Histogram(
"llm_request_duration_seconds",
"Time spent waiting for LLM response",
["model", "endpoint"]
)
llm_tokens_used = Counter(
"llm_tokens_used_total",
"Total tokens consumed",
["model", "type"] # type: prompt|completion
)
llm_request_cost_dollars = Counter(
"llm_request_cost_dollars",
"Estimated cost in USD",
["model"]
)
|
Alert on cost anomalies, not just errors. A bug that works perfectly but calls GPT-4 in a loop will kill your budget before it kills your uptime.
The Bottom Line#
Production LLM integration is infrastructure work, not AI magic. The same patterns that make any external API reliable—circuit breakers, retries, fallbacks, budgets—apply here with extra emphasis on cost control.
Build boring, defensive systems. Let the AI be the exciting part.