Integrating LLM APIs into production systems is harder than the tutorials suggest. The API call works in development. Then you hit rate limits, latency spikes, context limits, and costs that scale faster than your revenue.
Here’s how to build LLM integrations that actually work.
The Basics Nobody Mentions Always Stream Non-streaming API calls block until complete. For a 500-token response, that’s 5-15 seconds of your user staring at nothing.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 # Bad: User waits forever response = client.chat.completions.create( model="gpt-4", messages=[{"role": "user", "content": prompt}] ) print(response.choices[0].message.content) # Good: Tokens appear as generated stream = client.chat.completions.create( model="gpt-4", messages=[{"role": "user", "content": prompt}], stream=True ) for chunk in stream: if chunk.choices[0].delta.content: print(chunk.choices[0].delta.content, end="", flush=True) Streaming also lets you abort early if the response is going off-rails.
...