One of the most underestimated challenges in building LLM-powered applications is context window management. You’ve got 128k tokens, but that doesn’t mean you should use them all on every request.

The Problem

Large context windows create a false sense of abundance. Yes, you can stuff 100k tokens into a request, but you’ll pay for it in:

  • Latency: More tokens = slower responses
  • Cost: You’re billed per token (input and output)
  • Quality degradation: The “lost in the middle” phenomenon is real

Practical Patterns

1. Rolling Window with Summary

Keep a rolling window of recent conversation, but periodically summarize older context:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
class ConversationManager:
    def __init__(self, max_tokens=8000, summary_threshold=6000):
        self.messages = []
        self.max_tokens = max_tokens
        self.summary_threshold = summary_threshold
    
    def add_message(self, role, content):
        self.messages.append({"role": role, "content": content})
        
        if self._count_tokens() > self.summary_threshold:
            self._compress_history()
    
    def _compress_history(self):
        # Keep system message and last N exchanges
        old_messages = self.messages[1:-4]  # Exclude recent
        summary = self._summarize(old_messages)
        
        self.messages = [
            self.messages[0],  # System
            {"role": "system", "content": f"Previous context: {summary}"},
            *self.messages[-4:]  # Recent exchanges
        ]

2. Semantic Retrieval (RAG) for Deep Context

Don’t load everything upfront. Retrieve relevant chunks based on the current query:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
def build_context(query, knowledge_base, max_chunks=5):
    # Embed the query
    query_embedding = embed(query)
    
    # Find relevant chunks
    relevant = knowledge_base.similarity_search(
        query_embedding,
        limit=max_chunks
    )
    
    # Build context with only what's needed
    return "\n\n".join([chunk.content for chunk in relevant])

3. Tiered Memory Architecture

Implement different storage tiers like an operating system’s memory hierarchy:

HWCoaotrlmd(i((nsr-uecmtomrnaitreeyvx)at:l))::CCFuourmlrplernectsosncevodenrvrseearctseianottniohlnio,sgtsao,crtydi,ovcekuemtyeansdtkesc,diesktinaooinwlslsedgebase

4. Context Budgeting

Allocate your token budget intentionally:

1
2
3
4
5
6
7
8
9
CONTEXT_BUDGET = {
    "system_prompt": 500,
    "retrieved_context": 2000,
    "conversation_history": 3000,
    "user_message": 1000,
    "buffer_for_output": 1500
}

# Total: 8000 tokens reserved, leaving room for response

The Counterintuitive Truth

Smaller, well-curated context often outperforms massive context dumps. I’ve seen applications improve by removing context that was technically relevant but not immediately useful.

The goal isn’t to give the model everything it might need. It’s to give it exactly what it needs for this request.

Monitoring Context Usage

Track your actual usage patterns:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
def log_context_metrics(request, response):
    metrics = {
        "input_tokens": count_tokens(request),
        "output_tokens": count_tokens(response),
        "context_utilization": input_tokens / MAX_CONTEXT,
        "cost": calculate_cost(input_tokens, output_tokens)
    }
    
    # Alert if context utilization > 80%
    if metrics["context_utilization"] > 0.8:
        logger.warning("High context utilization", extra=metrics)

Takeaways

  1. Budget your context like you budget compute
  2. Summarize aggressively — the model doesn’t need verbatim history
  3. Retrieve dynamically — RAG beats static loading
  4. Monitor and iterate — your ideal context size will vary by use case

The 128k context window is a safety net, not a target. Use it wisely.