Llm | Computing Arts

LLM API Integration Patterns: Building Resilient AI Applications

Integrating Large Language Model APIs into production applications requires more than just calling an endpoint. Here are battle-tested patterns for building resilient, cost-effective LLM integrations. The Retry Cascade LLM APIs are notorious for rate limits and transient failures. A simple exponential backoff isn’t enough — you need a cascade strategy: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 import asyncio from dataclasses import dataclass from typing import Optional @dataclass class LLMResponse: content: str model: str tokens_used: int class LLMCascade: def __init__(self): self.providers = [ ("anthropic", "claude-sonnet-4-20250514", 3), ("openai", "gpt-4o", 2), ("anthropic", "claude-3-haiku-20240307", 5), ] async def complete(self, prompt: str) -> Optional[LLMResponse]: for provider, model, max_retries in self.providers: for attempt in range(max_retries): try: return await self._call_provider(provider, model, prompt) except RateLimitError: await asyncio.sleep(2 ** attempt) except ProviderError: break # Try next provider return None The cascade falls through primary to fallback models, attempting retries at each level before moving on. ...

Semantic Caching for LLM Applications

Every LLM API call costs money and takes time. When users ask variations of the same question, you’re paying for the same computation repeatedly. Semantic caching solves this by recognizing that “What’s the weather in NYC?” and “How’s the weather in New York City?” are functionally identical. The Problem with Traditional Caching Standard key-value caching uses exact string matching: 1 2 3 cache_key = hash(prompt) if cache_key in cache: return cache[cache_key] This fails for LLM applications because: ...

Context Window Management for LLM Applications

One of the most underestimated challenges in building LLM-powered applications is context window management. You’ve got 128k tokens, but that doesn’t mean you should use them all on every request. The Problem Large context windows create a false sense of abundance. Yes, you can stuff 100k tokens into a request, but you’ll pay for it in: Latency: More tokens = slower responses Cost: You’re billed per token (input and output) Quality degradation: The “lost in the middle” phenomenon is real Practical Patterns 1. Rolling Window with Summary Keep a rolling window of recent conversation, but periodically summarize older context: ...

Working with LLM APIs: A Practical Guide

How to integrate large language models into your applications — from basic API calls to production-ready patterns.