Every LLM API call costs money and takes time. When users ask variations of the same question, you’re paying for the same computation repeatedly. Semantic caching solves this by recognizing that “What’s the weather in NYC?” and “How’s the weather in New York City?” are functionally identical.
The Problem with Traditional Caching#
Standard key-value caching uses exact string matching:
1
2
3
| cache_key = hash(prompt)
if cache_key in cache:
return cache[cache_key]
|
This fails for LLM applications because:
- “Explain recursion” ≠ “What is recursion?”
- Minor typos invalidate the cache
- Rephrased questions never hit
You need a cache that understands meaning, not just strings.
How Semantic Caching Works#
The core idea: convert queries to embeddings, then find the nearest cached neighbor.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
| import openai
import numpy as np
from redis import Redis
redis = Redis(host='localhost', port=6379)
def get_embedding(text: str) -> list[float]:
response = openai.embeddings.create(
model="text-embedding-3-small",
input=text
)
return response.data[0].embedding
def semantic_cache_lookup(query: str, threshold: float = 0.92):
query_embedding = get_embedding(query)
# Search for similar cached queries
# Using Redis with RediSearch vector similarity
results = redis.ft("llm_cache").search(
Query(f"*=>[KNN 1 @embedding $vec AS score]")
.return_fields("response", "score")
.dialect(2),
query_params={"vec": np.array(query_embedding).tobytes()}
)
if results.docs and float(results.docs[0].score) >= threshold:
return results.docs[0].response
return None # Cache miss
|
Choosing the Right Threshold#
The similarity threshold controls the tradeoff:
| Threshold | Behavior |
|---|
| 0.98+ | Nearly exact matches only |
| 0.92-0.97 | Similar questions, safe for most use cases |
| 0.85-0.91 | Broader matching, review carefully |
| < 0.85 | Dangerous, likely false positives |
Start conservative (0.95+) and adjust based on your data.
Architecture Patterns#
Pattern 1: Check-then-compute#
1
2
3
4
5
6
7
8
9
10
11
12
| def ask_llm(prompt: str) -> str:
# Try cache first
cached = semantic_cache_lookup(prompt)
if cached:
return cached
# Cache miss — call LLM
response = call_llm(prompt)
# Store for future hits
store_in_cache(prompt, response)
return response
|
Pattern 2: Async warm-up#
Pre-populate the cache with common query variations:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| base_queries = [
"How do I reset my password?",
"What are your business hours?",
"How do I contact support?"
]
for query in base_queries:
# Generate variations
variations = generate_paraphrases(query)
response = call_llm(query)
# Cache the response for all variations
for var in [query] + variations:
store_in_cache(var, response)
|
Pattern 3: Tiered caching#
Use multiple cache layers with different TTLs:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| def tiered_lookup(query: str):
# L1: Exact match (fast, short TTL)
exact = exact_cache.get(hash(query))
if exact:
return exact
# L2: Semantic match (slower, longer TTL)
semantic = semantic_cache_lookup(query, threshold=0.93)
if semantic:
# Promote to L1 for future exact hits
exact_cache.set(hash(query), semantic, ttl=3600)
return semantic
return None
|
Vector Database Options#
Several databases support efficient vector similarity search:
Redis with RediSearch: Great if you’re already using Redis. Supports HNSW indexes.
1
2
3
4
5
6
7
8
9
| # Create index
redis.ft("llm_cache").create_index([
VectorField("embedding", "HNSW", {
"TYPE": "FLOAT32",
"DIM": 1536,
"DISTANCE_METRIC": "COSINE"
}),
TextField("response")
])
|
PostgreSQL with pgvector: Keep everything in Postgres. Excellent for smaller datasets.
1
2
3
4
5
6
7
8
9
10
11
12
13
| CREATE EXTENSION vector;
CREATE TABLE llm_cache (
id SERIAL PRIMARY KEY,
query TEXT,
embedding vector(1536),
response TEXT,
created_at TIMESTAMP DEFAULT NOW()
);
CREATE INDEX ON llm_cache
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);
|
Pinecone/Weaviate/Qdrant: Purpose-built vector databases. Best for large scale.
Handling Cache Invalidation#
Semantic caches need thoughtful invalidation:
1
2
3
4
5
6
7
8
9
10
| def invalidate_related(query: str, radius: float = 0.90):
"""Remove all cached items semantically similar to query"""
query_embedding = get_embedding(query)
# Find all similar entries
similar = vector_search(query_embedding, threshold=radius)
# Delete them
for entry in similar:
cache.delete(entry.id)
|
Time-based TTLs also help:
1
2
3
4
5
6
7
8
| def store_in_cache(query: str, response: str, ttl: int = 86400):
embedding = get_embedding(query)
redis.hset(f"cache:{uuid4()}", mapping={
"query": query,
"response": response,
"embedding": embedding,
"expires": time.time() + ttl
})
|
Real-World Savings#
Numbers from production systems:
- 40-60% cache hit rate typical for customer support bots
- 70-80% cost reduction when combined with response caching
- 50-200ms faster response times on cache hits
The embedding call adds ~50ms overhead, but that’s negligible compared to LLM inference.
When NOT to Use Semantic Caching#
- Personalized responses: If the answer depends on user context
- Real-time data: Weather, stock prices, current events
- Creative generation: Where variation is the point
- Low-volume applications: Setup cost exceeds savings
Quick Implementation Checklist#
- ✅ Choose a vector store (Redis, Postgres, or dedicated)
- ✅ Pick an embedding model (text-embedding-3-small is cost-effective)
- ✅ Start with threshold 0.94-0.96
- ✅ Implement TTL-based expiration
- ✅ Monitor hit rate and adjust threshold
- ✅ Log cache misses to identify patterns
Semantic caching is one of the highest-ROI optimizations for LLM applications. A few hours of implementation can cut your API costs in half and make your app feel faster.
Start small, measure everything, and adjust the threshold until it feels right.