LLM API Integration Patterns for Production Applications

Integrating LLMs into production applications is deceptively simple. Call an API, get text back. But building reliable, cost-effective systems requires more thought. Here are patterns that work at scale. The Basic Call Every LLM integration starts here: 1 2 3 4 5 6 7 8 import openai def complete(prompt: str) -> str: response = openai.chat.completions.create( model="gpt-4", messages=[{"role": "user", "content": prompt}] ) return response.choices[0].message.content This works for prototypes. Production needs more. Retry with Exponential Backoff LLM APIs have rate limits and occasional failures: ...

March 1, 2026 Â· 5 min Â· 1002 words Â· Rob Washington

Practical Patterns for Building Autonomous AI Agents

The gap between “AI demo” and “AI that runs reliably” is enormous. Here are patterns that emerge when you actually deploy autonomous agents. The Heartbeat Pattern Agents need periodic check-ins, not just reactive responses. A heartbeat system provides: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 @dataclass class HeartbeatState: last_email_check: datetime last_calendar_check: datetime last_service_health: datetime async def heartbeat(state: HeartbeatState): now = datetime.now() if (now - state.last_service_health).hours >= 2: await check_services() state.last_service_health = now if (now - state.last_email_check).hours >= 4: await check_inbox() state.last_email_check = now The key insight: batch periodic tasks into a single heartbeat rather than creating dozens of scheduled jobs. This reduces API calls and keeps context coherent. ...

February 28, 2026 Â· 4 min Â· 642 words Â· Rob Washington

Getting Structured Data from LLMs: JSON Mode and Beyond

The biggest challenge with LLMs in production isn’t getting good responses—it’s getting parseable responses. When you need JSON for your pipeline, “Here’s the data you requested:” followed by markdown-wrapped output breaks everything. Here’s how to reliably extract structured data. The Problem 1 2 3 4 5 6 7 8 response = client.chat.completions.create( model="gpt-4", messages=[{"role": "user", "content": "Extract the person's name and age from: 'John Smith is 34 years old'"}] ) print(response.choices[0].message.content) # "The person's name is John Smith and their age is 34." # ... not what we needed You wanted {"name": "John Smith", "age": 34}. You got prose. ...

February 26, 2026 Â· 6 min Â· 1074 words Â· Rob Washington

Container Orchestration Patterns for AI Workloads

Running AI workloads in containers presents unique challenges that traditional web application patterns don’t address. GPU scheduling, model caching, and bursty inference traffic all require thoughtful architecture. Here’s what actually works in production. The GPU Scheduling Problem Standard Kubernetes scheduling assumes CPU and memory are your primary constraints. When you add GPUs to the mix, everything changes. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 apiVersion: v1 kind: Pod metadata: name: inference-server spec: containers: - name: model image: my-registry/llm-server:v1.2 resources: limits: nvidia.com/gpu: 1 memory: "32Gi" requests: nvidia.com/gpu: 1 memory: "24Gi" The naive approach—one GPU per pod—works until you realize GPUs cost $2-4/hour and sit idle between requests. MIG (Multi-Instance GPU) and time-slicing help, but they introduce complexity: ...

February 26, 2026 Â· 4 min Â· 850 words Â· Rob Washington

LLM API Integration Patterns: Building Reliable AI-Powered Applications

Integrating LLM APIs into production applications requires more than just making API calls. These patterns address the real challenges: rate limits, token costs, latency, and reliability. Basic Client Setup 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 import os from anthropic import Anthropic client = Anthropic( api_key=os.environ.get("ANTHROPIC_API_KEY"), timeout=60.0, max_retries=3, ) def chat(message: str, system: str = None) -> str: """Simple completion with sensible defaults.""" messages = [{"role": "user", "content": message}] response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=1024, system=system or "You are a helpful assistant.", messages=messages, ) return response.content[0].text Retry with Exponential Backoff Built-in retries help, but custom logic handles edge cases: ...

February 25, 2026 Â· 7 min Â· 1291 words Â· Rob Washington

LLM API Integration Patterns: Building Reliable AI Features

LLM APIs are deceptively simple: send a prompt, get text back. But building reliable AI features requires handling rate limits, managing costs, structuring outputs, and gracefully degrading when things go wrong. Here are the patterns that work in production. The Basic Client Start with a wrapper that handles common concerns: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 import os import time from typing import Optional import anthropic from tenacity import retry, stop_after_attempt, wait_exponential class LLMClient: def __init__(self): self.client = anthropic.Anthropic() self.default_model = "claude-sonnet-4-20250514" self.max_tokens = 4096 @retry( stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=4, max=60) ) def complete( self, prompt: str, system: Optional[str] = None, model: Optional[str] = None, max_tokens: Optional[int] = None ) -> str: messages = [{"role": "user", "content": prompt}] response = self.client.messages.create( model=model or self.default_model, max_tokens=max_tokens or self.max_tokens, system=system or "", messages=messages ) return response.content[0].text The tenacity library handles retries with exponential backoff — essential for rate limits and transient errors. ...

February 24, 2026 Â· 6 min Â· 1104 words Â· Rob Washington

LLM API Integration Patterns: Building Reliable AI-Powered Features

Adding an LLM to your application sounds simple: call the API, get a response, display it. In practice, you’re dealing with rate limits, token costs, latency spikes, and outputs that occasionally make no sense. These patterns help build LLM features that are reliable, cost-effective, and actually useful. The Basic Call Every LLM integration starts here: 1 2 3 4 5 6 7 8 9 10 11 from openai import OpenAI client = OpenAI() def ask_llm(prompt: str) -> str: response = client.chat.completions.create( model="gpt-4", messages=[{"role": "user", "content": prompt}], temperature=0.7 ) return response.choices[0].message.content This works for demos. Production needs more. ...

February 23, 2026 Â· 7 min Â· 1302 words Â· Rob Washington

Infrastructure Observability for LLM Agents

When you deploy an LLM-powered agent in production, traditional APM dashboards only tell half the story. You can track latency, error rates, and throughput — but what about what the agent actually did? Did it hallucinate? Did it spiral into an infinite retry loop? Did it spend $47 on tokens chasing a dead end? Here’s how to build observability for autonomous agents that actually helps. The Three Pillars of Agent Observability Standard observability (logs, metrics, traces) still matters. But agents need three additional dimensions: ...

February 21, 2026 Â· 4 min Â· 778 words Â· Rob Washington

Building Resilient LLM API Integrations

When you’re building production systems that rely on LLM APIs, you quickly learn that “it works in development” doesn’t mean much. Rate limits hit at the worst times, APIs go down, and costs can spiral if you’re not careful. Here’s how to build integrations that actually survive the real world. The Problem with Naive Integrations Most tutorials show you something like this: 1 2 3 4 5 6 7 8 import anthropic client = anthropic.Anthropic() response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=1024, messages=[{"role": "user", "content": prompt}] ) This works great until: ...

February 20, 2026 Â· 6 min Â· 1181 words Â· Rob Washington

AI Coding Assistants: Beyond Autocomplete

The landscape of AI coding assistants has shifted dramatically. What started as glorified autocomplete has matured into something far more interesting: collaborative coding partners that can reason, refactor, and even architect. The Evolution Early tools like GitHub Copilot impressed by completing your current line. Useful, but limited. Today’s assistants—Claude Code, Cursor, Codex CLI—operate at a different level: Multi-file awareness: They understand project context, not just the current buffer Reasoning: They can explain why code should change, not just what to change Tool use: They run tests, check linting, execute commands Iteration: They refine solutions based on feedback Patterns That Work After months of heavy use, here’s what actually moves the needle: ...

February 20, 2026 Â· 3 min Â· 532 words Â· Rob Washington