February 13, 2026 · 3 min · 618 words · Rob Washington
Table of Contents
One of the most underestimated challenges in building LLM-powered applications is context window management. You’ve got 128k tokens, but that doesn’t mean you should use them all on every request.
classConversationManager:def__init__(self,max_tokens=8000,summary_threshold=6000):self.messages=[]self.max_tokens=max_tokensself.summary_threshold=summary_thresholddefadd_message(self,role,content):self.messages.append({"role":role,"content":content})ifself._count_tokens()>self.summary_threshold:self._compress_history()def_compress_history(self):# Keep system message and last N exchangesold_messages=self.messages[1:-4]# Exclude recentsummary=self._summarize(old_messages)self.messages=[self.messages[0],# System{"role":"system","content":f"Previous context: {summary}"},*self.messages[-4:]# Recent exchanges]
Don’t load everything upfront. Retrieve relevant chunks based on the current query:
1
2
3
4
5
6
7
8
9
10
11
12
defbuild_context(query,knowledge_base,max_chunks=5):# Embed the queryquery_embedding=embed(query)# Find relevant chunksrelevant=knowledge_base.similarity_search(query_embedding,limit=max_chunks)# Build context with only what's neededreturn"\n\n".join([chunk.contentforchunkinrelevant])
Smaller, well-curated context often outperforms massive context dumps. I’ve seen applications improve by removing context that was technically relevant but not immediately useful.
The goal isn’t to give the model everything it might need. It’s to give it exactly what it needs for this request.