You’ve got your OpenAI or Anthropic API key. The hello-world example works. Now you need to put it in production and suddenly everything is different.
LLM APIs have characteristics that break standard integration patterns: high latency, unpredictable response times, token-based billing, and outputs that can vary wildly for the same input. Here’s what actually works.
The Unique Challenges#
Traditional API calls return in milliseconds. LLM calls can take 5-30 seconds. Traditional APIs have predictable costs per call. LLM costs depend on input and output length — and you don’t control the output.
This changes everything about how you build.
Pattern 1: Streaming by Default#
Waiting 20 seconds for a response feels broken. Streaming the same response token-by-token feels responsive.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
| import anthropic
client = anthropic.Anthropic()
def stream_response(prompt: str):
with client.messages.stream(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[{"role": "user", "content": prompt}]
) as stream:
for text in stream.text_stream:
yield text
# Usage: show tokens as they arrive
for chunk in stream_response("Explain quantum computing"):
print(chunk, end="", flush=True)
|
For web apps, pipe this directly to Server-Sent Events or WebSockets. Users see progress immediately instead of staring at a spinner.
Pattern 2: Timeout + Retry with Exponential Backoff#
LLM providers have rate limits and occasional outages. Build for it.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
| import time
import random
from functools import wraps
def retry_with_backoff(max_retries=3, base_delay=1):
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
last_exception = None
for attempt in range(max_retries):
try:
return func(*args, **kwargs)
except (RateLimitError, APIConnectionError) as e:
last_exception = e
if attempt < max_retries - 1:
delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
time.sleep(delay)
raise last_exception
return wrapper
return decorator
@retry_with_backoff(max_retries=3)
def call_llm(prompt: str) -> str:
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[{"role": "user", "content": prompt}]
)
return response.content[0].text
|
Key insight: only retry on transient errors (rate limits, connection issues). Don’t retry on validation errors or content policy violations.
Pattern 3: Budget Guards#
LLM costs can spike unexpectedly. A runaway loop or a prompt injection that generates maximum tokens can get expensive fast.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
| class BudgetGuard:
def __init__(self, max_tokens_per_hour: int = 100000):
self.max_tokens = max_tokens_per_hour
self.tokens_used = 0
self.window_start = time.time()
def check_and_record(self, input_tokens: int, output_tokens: int):
# Reset window if needed
if time.time() - self.window_start > 3600:
self.tokens_used = 0
self.window_start = time.time()
total = input_tokens + output_tokens
if self.tokens_used + total > self.max_tokens:
raise BudgetExceededError(
f"Would exceed hourly budget: {self.tokens_used + total} > {self.max_tokens}"
)
self.tokens_used += total
budget = BudgetGuard(max_tokens_per_hour=50000)
def guarded_call(prompt: str) -> str:
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[{"role": "user", "content": prompt}]
)
budget.check_and_record(
response.usage.input_tokens,
response.usage.output_tokens
)
return response.content[0].text
|
Pattern 4: Structured Output Validation#
LLMs can return anything. If you need JSON, validate it.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
| import json
from pydantic import BaseModel, ValidationError
class ExtractedData(BaseModel):
name: str
email: str
sentiment: str
def extract_with_validation(text: str, max_attempts: int = 3) -> ExtractedData:
prompt = f"""Extract the following from this text and return valid JSON:
- name: string
- email: string
- sentiment: positive, negative, or neutral
Text: {text}
Return ONLY valid JSON, no other text."""
for attempt in range(max_attempts):
response = call_llm(prompt)
try:
# Try to parse JSON from response
data = json.loads(response.strip())
return ExtractedData(**data)
except (json.JSONDecodeError, ValidationError) as e:
if attempt == max_attempts - 1:
raise ValueError(f"Failed to get valid output after {max_attempts} attempts")
# Retry with more explicit instruction
prompt += "\n\nPrevious response was invalid. Return ONLY the JSON object."
raise ValueError("Extraction failed")
|
Pattern 5: Fallback Chains#
Don’t depend on a single provider. Build fallback chains.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
| class LLMChain:
def __init__(self):
self.providers = [
("anthropic", self._call_anthropic),
("openai", self._call_openai),
]
def _call_anthropic(self, prompt: str) -> str:
# Primary provider
...
def _call_openai(self, prompt: str) -> str:
# Fallback provider
...
def call(self, prompt: str) -> str:
last_error = None
for name, provider_fn in self.providers:
try:
return provider_fn(prompt)
except Exception as e:
last_error = e
logger.warning(f"{name} failed: {e}, trying next")
continue
raise last_error
chain = LLMChain()
response = chain.call("Summarize this document...")
|
Pattern 6: Async for Parallelism#
When you need multiple LLM calls, run them in parallel.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
| import asyncio
import anthropic
async_client = anthropic.AsyncAnthropic()
async def analyze_batch(texts: list[str]) -> list[str]:
async def analyze_one(text: str) -> str:
response = await async_client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=256,
messages=[{"role": "user", "content": f"Summarize: {text}"}]
)
return response.content[0].text
return await asyncio.gather(*[analyze_one(t) for t in texts])
# 10 calls in parallel instead of sequential
results = asyncio.run(analyze_batch(documents))
|
Log everything. Token counts, latencies, retry counts, costs. You can’t optimize what you can’t measure.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
| import logging
import time
logger = logging.getLogger(__name__)
def observed_call(prompt: str) -> str:
start = time.time()
try:
response = client.messages.create(...)
latency = time.time() - start
logger.info(
"llm_call",
extra={
"latency_ms": latency * 1000,
"input_tokens": response.usage.input_tokens,
"output_tokens": response.usage.output_tokens,
"model": "claude-sonnet-4-20250514",
"cost_usd": calculate_cost(response.usage),
}
)
return response.content[0].text
except Exception as e:
logger.error("llm_call_failed", extra={"error": str(e)})
raise
|
Reality Check#
LLM APIs are powerful but different. Treat them like you’d treat any unreliable external dependency: with timeouts, retries, fallbacks, and budgets. The patterns above aren’t optional hardening — they’re table stakes for production use.
Start with streaming and budget guards. Add the rest as you scale.
Ship it, measure it, iterate. Your integration will thank you.