LLM API Integration Patterns: Building Reliable AI-Powered Features

Adding an LLM to your application sounds simple: call the API, get a response, display it. In practice, you’re dealing with rate limits, token costs, latency spikes, and outputs that occasionally make no sense.

These patterns help build LLM features that are reliable, cost-effective, and actually useful.

The Basic Call

Every LLM integration starts here:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
from openai import OpenAI

client = OpenAI()

def ask_llm(prompt: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.7
    )
    return response.choices[0].message.content

This works for demos. Production needs more.

Structured Output

LLMs return strings. You need data:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
from pydantic import BaseModel
import json

class ProductAnalysis(BaseModel):
    sentiment: str  # positive, negative, neutral
    key_points: list[str]
    confidence: float

def analyze_review(review: str) -> ProductAnalysis:
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{
            "role": "user",
            "content": f"""Analyze this product review and return JSON:
            {{
                "sentiment": "positive|negative|neutral",
                "key_points": ["point1", "point2"],
                "confidence": 0.0-1.0
            }}
            
            Review: {review}"""
        }],
        response_format={"type": "json_object"}
    )
    
    data = json.loads(response.choices[0].message.content)
    return ProductAnalysis(**data)

Use response_format for guaranteed JSON. Validate with Pydantic to catch malformed responses.

Retry with Exponential Backoff

APIs fail. Rate limits hit. Networks hiccup:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
import time
from tenacity import retry, stop_after_attempt, wait_exponential

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=1, max=60)
)
def call_llm_with_retry(messages: list) -> str:
    try:
        response = client.chat.completions.create(
            model="gpt-4",
            messages=messages
        )
        return response.choices[0].message.content
    except openai.RateLimitError:
        # Let tenacity handle the retry
        raise
    except openai.APIError as e:
        if e.status_code >= 500:
            raise  # Retry server errors
        raise  # Don't retry client errors

Respect Retry-After headers when present.

Caching Responses

Identical prompts don’t need fresh API calls:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import hashlib
import redis

cache = redis.Redis()
CACHE_TTL = 3600  # 1 hour

def cached_llm_call(prompt: str, **kwargs) -> str:
    # Create cache key from prompt and parameters
    cache_key = hashlib.sha256(
        f"{prompt}:{json.dumps(kwargs, sort_keys=True)}".encode()
    ).hexdigest()
    
    # Check cache
    cached = cache.get(f"llm:{cache_key}")
    if cached:
        return cached.decode()
    
    # Make API call
    response = call_llm_with_retry([{"role": "user", "content": prompt}])
    
    # Cache result
    cache.setex(f"llm:{cache_key}", CACHE_TTL, response)
    
    return response

Cache hit rates of 30-50% are common for similar queries.

Token Management

Tokens cost money. Track and limit them:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import tiktoken

def count_tokens(text: str, model: str = "gpt-4") -> int:
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))

def truncate_to_tokens(text: str, max_tokens: int, model: str = "gpt-4") -> str:
    encoding = tiktoken.encoding_for_model(model)
    tokens = encoding.encode(text)
    if len(tokens) <= max_tokens:
        return text
    return encoding.decode(tokens[:max_tokens])

def call_with_budget(prompt: str, max_input_tokens: int = 2000) -> str:
    truncated = truncate_to_tokens(prompt, max_input_tokens)
    
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": truncated}],
        max_tokens=500  # Limit output too
    )
    
    # Log usage for monitoring
    usage = response.usage
    logger.info(f"Tokens used: {usage.prompt_tokens} in, {usage.completion_tokens} out")
    
    return response.choices[0].message.content

Streaming for Better UX

Users hate waiting. Stream responses:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
def stream_response(prompt: str):
    stream = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        stream=True
    )
    
    for chunk in stream:
        if chunk.choices[0].delta.content:
            yield chunk.choices[0].delta.content

# In a web framework
@app.get("/chat")
async def chat(prompt: str):
    async def generate():
        for token in stream_response(prompt):
            yield f"data: {json.dumps({'token': token})}\n\n"
    
    return StreamingResponse(generate(), media_type="text/event-stream")

First token appears in ~200ms instead of waiting 2-5 seconds for the full response.

Fallback Models

Don’t let one provider’s outage break your feature:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
MODELS = [
    ("openai", "gpt-4"),
    ("anthropic", "claude-3-sonnet"),
    ("openai", "gpt-3.5-turbo"),  # Cheaper fallback
]

def call_with_fallback(prompt: str) -> str:
    for provider, model in MODELS:
        try:
            if provider == "openai":
                return call_openai(prompt, model)
            elif provider == "anthropic":
                return call_anthropic(prompt, model)
        except Exception as e:
            logger.warning(f"{provider}/{model} failed: {e}")
            continue
    
    raise Exception("All LLM providers failed")

Prompt Templates

Don’t embed prompts in code. Manage them separately:

1
2
3
4
5
6
7
8
9
# prompts/analyze_code.txt
You are a code reviewer. Analyze the following code for:
1. Security vulnerabilities
2. Performance issues
3. Code style problems

Code:
```{language}
{code}

Respond in JSON format: { “security”: […], “performance”: […], “style”: […] }

Version control your prompts. A/B test variations.

Guardrails

LLMs hallucinate. Validate outputs:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
def extract_urls(text: str) -> list[str]:
    prompt = f"Extract all URLs from this text. Return JSON array: {text}"
    response = call_llm_with_retry([{"role": "user", "content": prompt}])
    
    urls = json.loads(response)
    
    # Validate: are these actually URLs?
    import re
    url_pattern = re.compile(r'https?://\S+')
    validated = [u for u in urls if url_pattern.match(u)]
    
    if len(validated) != len(urls):
        logger.warning(f"LLM returned {len(urls) - len(validated)} invalid URLs")
    
    return validated

Never trust LLM output for:

SQL queries (injection risk)
File paths (traversal risk)
Code execution (obvious risk)
Anything security-sensitive

Cost Monitoring

Track spend before it surprises you:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# Approximate costs per 1K tokens (check current pricing)
COSTS = {
    "gpt-4": {"input": 0.03, "output": 0.06},
    "gpt-3.5-turbo": {"input": 0.0005, "output": 0.0015},
    "claude-3-sonnet": {"input": 0.003, "output": 0.015},
}

def estimate_cost(model: str, input_tokens: int, output_tokens: int) -> float:
    prices = COSTS.get(model, COSTS["gpt-3.5-turbo"])
    return (input_tokens / 1000 * prices["input"] + 
            output_tokens / 1000 * prices["output"])

# After each call
cost = estimate_cost(model, usage.prompt_tokens, usage.completion_tokens)
metrics.increment("llm_cost_cents", int(cost * 100), tags=[f"model:{model}"])

Set up alerts for daily spend thresholds.

LLM APIs are powerful but unpredictable. Wrap them in retries, caches, and validation. Monitor costs obsessively. Have fallbacks ready.

The goal isn’t to call the API — it’s to build a reliable feature that happens to use an LLM. Every pattern here exists because something went wrong in production. Learn from others’ incidents.

The Basic Call#

Structured Output#

Retry with Exponential Backoff#

Caching Responses#

Token Management#

Streaming for Better UX#

Fallback Models#

Prompt Templates#

Guardrails#

Cost Monitoring#

📬 Get the Newsletter