LLM API Integration Patterns: Building Reliable AI-Powered Features
Integrating LLMs into production systems requires more than API calls. Here are patterns that actually work.
February 23, 2026 · 7 min · 1302 words · Rob Washington
Table of Contents
Adding an LLM to your application sounds simple: call the API, get a response, display it. In practice, you’re dealing with rate limits, token costs, latency spikes, and outputs that occasionally make no sense.
These patterns help build LLM features that are reliable, cost-effective, and actually useful.
importtimefromtenacityimportretry,stop_after_attempt,wait_exponential@retry(stop=stop_after_attempt(3),wait=wait_exponential(multiplier=1,min=1,max=60))defcall_llm_with_retry(messages:list)->str:try:response=client.chat.completions.create(model="gpt-4",messages=messages)returnresponse.choices[0].message.contentexceptopenai.RateLimitError:# Let tenacity handle the retryraiseexceptopenai.APIErrorase:ife.status_code>=500:raise# Retry server errorsraise# Don't retry client errors
importhashlibimportrediscache=redis.Redis()CACHE_TTL=3600# 1 hourdefcached_llm_call(prompt:str,**kwargs)->str:# Create cache key from prompt and parameterscache_key=hashlib.sha256(f"{prompt}:{json.dumps(kwargs,sort_keys=True)}".encode()).hexdigest()# Check cachecached=cache.get(f"llm:{cache_key}")ifcached:returncached.decode()# Make API callresponse=call_llm_with_retry([{"role":"user","content":prompt}])# Cache resultcache.setex(f"llm:{cache_key}",CACHE_TTL,response)returnresponse
Cache hit rates of 30-50% are common for similar queries.
defstream_response(prompt:str):stream=client.chat.completions.create(model="gpt-4",messages=[{"role":"user","content":prompt}],stream=True)forchunkinstream:ifchunk.choices[0].delta.content:yieldchunk.choices[0].delta.content# In a web framework@app.get("/chat")asyncdefchat(prompt:str):asyncdefgenerate():fortokeninstream_response(prompt):yieldf"data: {json.dumps({'token':token})}\n\n"returnStreamingResponse(generate(),media_type="text/event-stream")
First token appears in ~200ms instead of waiting 2-5 seconds for the full response.
defextract_urls(text:str)->list[str]:prompt=f"Extract all URLs from this text. Return JSON array: {text}"response=call_llm_with_retry([{"role":"user","content":prompt}])urls=json.loads(response)# Validate: are these actually URLs?importreurl_pattern=re.compile(r'https?://\S+')validated=[uforuinurlsifurl_pattern.match(u)]iflen(validated)!=len(urls):logger.warning(f"LLM returned {len(urls)-len(validated)} invalid URLs")returnvalidated
# Approximate costs per 1K tokens (check current pricing)COSTS={"gpt-4":{"input":0.03,"output":0.06},"gpt-3.5-turbo":{"input":0.0005,"output":0.0015},"claude-3-sonnet":{"input":0.003,"output":0.015},}defestimate_cost(model:str,input_tokens:int,output_tokens:int)->float:prices=COSTS.get(model,COSTS["gpt-3.5-turbo"])return(input_tokens/1000*prices["input"]+output_tokens/1000*prices["output"])# After each callcost=estimate_cost(model,usage.prompt_tokens,usage.completion_tokens)metrics.increment("llm_cost_cents",int(cost*100),tags=[f"model:{model}"])
Set up alerts for daily spend thresholds.
LLM APIs are powerful but unpredictable. Wrap them in retries, caches, and validation. Monitor costs obsessively. Have fallbacks ready.
The goal isn’t to call the API — it’s to build a reliable feature that happens to use an LLM. Every pattern here exists because something went wrong in production. Learn from others’ incidents.
📬 Get the Newsletter
Weekly insights on DevOps, automation, and CLI mastery. No spam, unsubscribe anytime.