You’ve got your OpenAI or Anthropic API key. The hello-world example works. Now you need to put it in production and suddenly everything is different.

LLM APIs have characteristics that break standard integration patterns: high latency, unpredictable response times, token-based billing, and outputs that can vary wildly for the same input. Here’s what actually works.

The Unique Challenges

Traditional API calls return in milliseconds. LLM calls can take 5-30 seconds. Traditional APIs have predictable costs per call. LLM costs depend on input and output length — and you don’t control the output.

This changes everything about how you build.

Pattern 1: Streaming by Default

Waiting 20 seconds for a response feels broken. Streaming the same response token-by-token feels responsive.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
import anthropic

client = anthropic.Anthropic()

def stream_response(prompt: str):
    with client.messages.stream(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}]
    ) as stream:
        for text in stream.text_stream:
            yield text

# Usage: show tokens as they arrive
for chunk in stream_response("Explain quantum computing"):
    print(chunk, end="", flush=True)

For web apps, pipe this directly to Server-Sent Events or WebSockets. Users see progress immediately instead of staring at a spinner.

Pattern 2: Timeout + Retry with Exponential Backoff

LLM providers have rate limits and occasional outages. Build for it.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import time
import random
from functools import wraps

def retry_with_backoff(max_retries=3, base_delay=1):
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            last_exception = None
            for attempt in range(max_retries):
                try:
                    return func(*args, **kwargs)
                except (RateLimitError, APIConnectionError) as e:
                    last_exception = e
                    if attempt < max_retries - 1:
                        delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
                        time.sleep(delay)
            raise last_exception
        return wrapper
    return decorator

@retry_with_backoff(max_retries=3)
def call_llm(prompt: str) -> str:
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text

Key insight: only retry on transient errors (rate limits, connection issues). Don’t retry on validation errors or content policy violations.

Pattern 3: Budget Guards

LLM costs can spike unexpectedly. A runaway loop or a prompt injection that generates maximum tokens can get expensive fast.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
class BudgetGuard:
    def __init__(self, max_tokens_per_hour: int = 100000):
        self.max_tokens = max_tokens_per_hour
        self.tokens_used = 0
        self.window_start = time.time()
    
    def check_and_record(self, input_tokens: int, output_tokens: int):
        # Reset window if needed
        if time.time() - self.window_start > 3600:
            self.tokens_used = 0
            self.window_start = time.time()
        
        total = input_tokens + output_tokens
        if self.tokens_used + total > self.max_tokens:
            raise BudgetExceededError(
                f"Would exceed hourly budget: {self.tokens_used + total} > {self.max_tokens}"
            )
        self.tokens_used += total

budget = BudgetGuard(max_tokens_per_hour=50000)

def guarded_call(prompt: str) -> str:
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}]
    )
    budget.check_and_record(
        response.usage.input_tokens,
        response.usage.output_tokens
    )
    return response.content[0].text

Pattern 4: Structured Output Validation

LLMs can return anything. If you need JSON, validate it.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import json
from pydantic import BaseModel, ValidationError

class ExtractedData(BaseModel):
    name: str
    email: str
    sentiment: str

def extract_with_validation(text: str, max_attempts: int = 3) -> ExtractedData:
    prompt = f"""Extract the following from this text and return valid JSON:
- name: string
- email: string  
- sentiment: positive, negative, or neutral

Text: {text}

Return ONLY valid JSON, no other text."""

    for attempt in range(max_attempts):
        response = call_llm(prompt)
        try:
            # Try to parse JSON from response
            data = json.loads(response.strip())
            return ExtractedData(**data)
        except (json.JSONDecodeError, ValidationError) as e:
            if attempt == max_attempts - 1:
                raise ValueError(f"Failed to get valid output after {max_attempts} attempts")
            # Retry with more explicit instruction
            prompt += "\n\nPrevious response was invalid. Return ONLY the JSON object."
    
    raise ValueError("Extraction failed")

Pattern 5: Fallback Chains

Don’t depend on a single provider. Build fallback chains.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
class LLMChain:
    def __init__(self):
        self.providers = [
            ("anthropic", self._call_anthropic),
            ("openai", self._call_openai),
        ]
    
    def _call_anthropic(self, prompt: str) -> str:
        # Primary provider
        ...
    
    def _call_openai(self, prompt: str) -> str:
        # Fallback provider
        ...
    
    def call(self, prompt: str) -> str:
        last_error = None
        for name, provider_fn in self.providers:
            try:
                return provider_fn(prompt)
            except Exception as e:
                last_error = e
                logger.warning(f"{name} failed: {e}, trying next")
                continue
        raise last_error

chain = LLMChain()
response = chain.call("Summarize this document...")

Pattern 6: Async for Parallelism

When you need multiple LLM calls, run them in parallel.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
import asyncio
import anthropic

async_client = anthropic.AsyncAnthropic()

async def analyze_batch(texts: list[str]) -> list[str]:
    async def analyze_one(text: str) -> str:
        response = await async_client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=256,
            messages=[{"role": "user", "content": f"Summarize: {text}"}]
        )
        return response.content[0].text
    
    return await asyncio.gather(*[analyze_one(t) for t in texts])

# 10 calls in parallel instead of sequential
results = asyncio.run(analyze_batch(documents))

The Meta-Pattern: Observability

Log everything. Token counts, latencies, retry counts, costs. You can’t optimize what you can’t measure.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import logging
import time

logger = logging.getLogger(__name__)

def observed_call(prompt: str) -> str:
    start = time.time()
    try:
        response = client.messages.create(...)
        latency = time.time() - start
        logger.info(
            "llm_call",
            extra={
                "latency_ms": latency * 1000,
                "input_tokens": response.usage.input_tokens,
                "output_tokens": response.usage.output_tokens,
                "model": "claude-sonnet-4-20250514",
                "cost_usd": calculate_cost(response.usage),
            }
        )
        return response.content[0].text
    except Exception as e:
        logger.error("llm_call_failed", extra={"error": str(e)})
        raise

Reality Check

LLM APIs are powerful but different. Treat them like you’d treat any unreliable external dependency: with timeouts, retries, fallbacks, and budgets. The patterns above aren’t optional hardening — they’re table stakes for production use.

Start with streaming and budget guards. Add the rest as you scale.


Ship it, measure it, iterate. Your integration will thank you.