Building toy demos with LLM APIs is easy. Building production systems that handle real traffic, fail gracefully, and don’t bankrupt you? That’s where it gets interesting.

The Reality of Production LLM Integration

Most tutorials show you curl to an API and celebrate. Real systems need to handle:

  • API rate limits and throttling
  • Transient failures and retries
  • Cost explosion from runaway loops
  • Latency variance (100ms to 30s responses)
  • Model version changes breaking prompts
  • Token limits exceeding input size

Let’s look at patterns that actually work.

Pattern 1: Circuit Breakers for LLM Calls

LLM APIs have bad days. When they do, you don’t want every request in your system waiting 60 seconds to timeout.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
from circuitbreaker import circuit
import openai

@circuit(failure_threshold=5, recovery_timeout=30)
def call_llm(prompt: str, model: str = "gpt-4") -> str:
    """LLM call with circuit breaker protection."""
    response = openai.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        timeout=30
    )
    return response.choices[0].message.content

After 5 failures, the circuit opens. Calls fail fast for 30 seconds, giving the API time to recover. Your users get quick errors instead of hanging requests.

Pattern 2: Tiered Model Fallback

Not every request needs GPT-4. Build a cascade:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
MODELS = [
    {"name": "gpt-4", "cost": 0.03, "quality": "high"},
    {"name": "gpt-3.5-turbo", "cost": 0.002, "quality": "medium"},
    {"name": "local-llama", "cost": 0.0, "quality": "basic"},
]

async def smart_completion(prompt: str, min_quality: str = "basic") -> str:
    """Try models in order, falling back on failure or rate limits."""
    for model in MODELS:
        if quality_rank(model["quality"]) < quality_rank(min_quality):
            continue
        try:
            return await call_model(model["name"], prompt)
        except RateLimitError:
            continue
        except APIError as e:
            log.warning(f"{model['name']} failed: {e}")
            continue
    
    raise AllModelsExhausted("No available models could handle request")

This gives you graceful degradation. If your primary model is rate-limited, fall back to cheaper options rather than failing completely.

Pattern 3: Token Budget Management

The fastest way to burn money is an infinite loop calling GPT-4. Protect yourself:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from dataclasses import dataclass
from datetime import datetime, timedelta

@dataclass
class TokenBudget:
    daily_limit: int = 1_000_000
    hourly_limit: int = 100_000
    per_request_limit: int = 4000
    
    _daily_used: int = 0
    _hourly_used: int = 0
    _hour_start: datetime = None
    
    def can_spend(self, tokens: int) -> bool:
        self._maybe_reset_hour()
        return (
            tokens <= self.per_request_limit and
            self._daily_used + tokens <= self.daily_limit and
            self._hourly_used + tokens <= self.hourly_limit
        )
    
    def spend(self, tokens: int):
        if not self.can_spend(tokens):
            raise BudgetExceeded(f"Would exceed token budget")
        self._daily_used += tokens
        self._hourly_used += tokens

Track budgets at the application level. Don’t rely on API-side limits alone—by the time you hit those, you’ve already spent the money.

Pattern 4: Prompt Versioning

Your prompts are code. Version them:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
# prompts/summarize/v2.yaml
version: 2
model: gpt-4
temperature: 0.3
system: |
  You are a precise summarizer. Extract key points only.
  Format: bullet points, max 5 items.
  Never add information not in the source.
user_template: |
  Summarize this document:
  
  {document}

# prompts/summarize/v1.yaml (deprecated)
version: 1
model: gpt-3.5-turbo
# ... old prompt

Load prompts dynamically. A/B test versions. Roll back when something breaks. Your prompts evolve faster than your code—treat them accordingly.

Pattern 5: Structured Output Validation

LLMs lie about their output format. Validate aggressively:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
from pydantic import BaseModel, ValidationError
import json

class ExtractedData(BaseModel):
    name: str
    confidence: float
    categories: list[str]

def extract_with_validation(text: str, retries: int = 3) -> ExtractedData:
    """Extract structured data with retry on parse failure."""
    for attempt in range(retries):
        response = call_llm(f"Extract data as JSON: {text}")
        try:
            # Strip markdown code blocks if present
            clean = response.strip().removeprefix("```json").removesuffix("```")
            data = json.loads(clean)
            return ExtractedData(**data)
        except (json.JSONDecodeError, ValidationError) as e:
            if attempt == retries - 1:
                raise ParseError(f"Failed to parse after {retries} attempts")
            # Retry with explicit format reminder
            continue

Assume every response might be malformed. Parse, validate, retry. Log failures for prompt improvement.

Pattern 6: Request Coalescing

If multiple users request the same thing simultaneously, don’t pay for it twice:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
import asyncio
from functools import lru_cache

_inflight: dict[str, asyncio.Future] = {}

async def coalesced_completion(prompt_hash: str, prompt: str) -> str:
    """Coalesce identical concurrent requests."""
    if prompt_hash in _inflight:
        return await _inflight[prompt_hash]
    
    future = asyncio.Future()
    _inflight[prompt_hash] = future
    
    try:
        result = await call_llm(prompt)
        future.set_result(result)
        return result
    except Exception as e:
        future.set_exception(e)
        raise
    finally:
        del _inflight[prompt_hash]

For non-personalized requests (summaries, translations of the same text), this can cut costs significantly.

Pattern 7: Async Streaming for UX

Users hate waiting. Stream responses:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
async def stream_response(prompt: str):
    """Stream LLM response for better perceived performance."""
    stream = await openai.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        stream=True
    )
    
    async for chunk in stream:
        if chunk.choices[0].delta.content:
            yield chunk.choices[0].delta.content

First token in 200ms feels faster than complete response in 3 seconds, even if total time is the same.

Monitoring That Matters

Track these metrics:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
# Per-model metrics
llm_request_duration_seconds = Histogram(
    "llm_request_duration_seconds",
    "Time spent waiting for LLM response",
    ["model", "endpoint"]
)

llm_tokens_used = Counter(
    "llm_tokens_used_total",
    "Total tokens consumed",
    ["model", "type"]  # type: prompt|completion
)

llm_request_cost_dollars = Counter(
    "llm_request_cost_dollars",
    "Estimated cost in USD",
    ["model"]
)

Alert on cost anomalies, not just errors. A bug that works perfectly but calls GPT-4 in a loop will kill your budget before it kills your uptime.

The Bottom Line

Production LLM integration is infrastructure work, not AI magic. The same patterns that make any external API reliable—circuit breakers, retries, fallbacks, budgets—apply here with extra emphasis on cost control.

Build boring, defensive systems. Let the AI be the exciting part.