Every LLM API call costs money and takes time. When users ask variations of the same question, you’re paying for the same computation repeatedly. Semantic caching solves this by recognizing that “What’s the weather in NYC?” and “How’s the weather in New York City?” are functionally identical.

The Problem with Traditional Caching

Standard key-value caching uses exact string matching:

1
2
3
cache_key = hash(prompt)
if cache_key in cache:
    return cache[cache_key]

This fails for LLM applications because:

  • “Explain recursion” ≠ “What is recursion?”
  • Minor typos invalidate the cache
  • Rephrased questions never hit

You need a cache that understands meaning, not just strings.

How Semantic Caching Works

The core idea: convert queries to embeddings, then find the nearest cached neighbor.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import openai
import numpy as np
from redis import Redis

redis = Redis(host='localhost', port=6379)

def get_embedding(text: str) -> list[float]:
    response = openai.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return response.data[0].embedding

def semantic_cache_lookup(query: str, threshold: float = 0.92):
    query_embedding = get_embedding(query)
    
    # Search for similar cached queries
    # Using Redis with RediSearch vector similarity
    results = redis.ft("llm_cache").search(
        Query(f"*=>[KNN 1 @embedding $vec AS score]")
        .return_fields("response", "score")
        .dialect(2),
        query_params={"vec": np.array(query_embedding).tobytes()}
    )
    
    if results.docs and float(results.docs[0].score) >= threshold:
        return results.docs[0].response
    
    return None  # Cache miss

Choosing the Right Threshold

The similarity threshold controls the tradeoff:

ThresholdBehavior
0.98+Nearly exact matches only
0.92-0.97Similar questions, safe for most use cases
0.85-0.91Broader matching, review carefully
< 0.85Dangerous, likely false positives

Start conservative (0.95+) and adjust based on your data.

Architecture Patterns

Pattern 1: Check-then-compute

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
def ask_llm(prompt: str) -> str:
    # Try cache first
    cached = semantic_cache_lookup(prompt)
    if cached:
        return cached
    
    # Cache miss — call LLM
    response = call_llm(prompt)
    
    # Store for future hits
    store_in_cache(prompt, response)
    return response

Pattern 2: Async warm-up

Pre-populate the cache with common query variations:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
base_queries = [
    "How do I reset my password?",
    "What are your business hours?",
    "How do I contact support?"
]

for query in base_queries:
    # Generate variations
    variations = generate_paraphrases(query)
    response = call_llm(query)
    
    # Cache the response for all variations
    for var in [query] + variations:
        store_in_cache(var, response)

Pattern 3: Tiered caching

Use multiple cache layers with different TTLs:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
def tiered_lookup(query: str):
    # L1: Exact match (fast, short TTL)
    exact = exact_cache.get(hash(query))
    if exact:
        return exact
    
    # L2: Semantic match (slower, longer TTL)
    semantic = semantic_cache_lookup(query, threshold=0.93)
    if semantic:
        # Promote to L1 for future exact hits
        exact_cache.set(hash(query), semantic, ttl=3600)
        return semantic
    
    return None

Vector Database Options

Several databases support efficient vector similarity search:

Redis with RediSearch: Great if you’re already using Redis. Supports HNSW indexes.

1
2
3
4
5
6
7
8
9
# Create index
redis.ft("llm_cache").create_index([
    VectorField("embedding", "HNSW", {
        "TYPE": "FLOAT32",
        "DIM": 1536,
        "DISTANCE_METRIC": "COSINE"
    }),
    TextField("response")
])

PostgreSQL with pgvector: Keep everything in Postgres. Excellent for smaller datasets.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
CREATE EXTENSION vector;

CREATE TABLE llm_cache (
    id SERIAL PRIMARY KEY,
    query TEXT,
    embedding vector(1536),
    response TEXT,
    created_at TIMESTAMP DEFAULT NOW()
);

CREATE INDEX ON llm_cache 
    USING ivfflat (embedding vector_cosine_ops)
    WITH (lists = 100);

Pinecone/Weaviate/Qdrant: Purpose-built vector databases. Best for large scale.

Handling Cache Invalidation

Semantic caches need thoughtful invalidation:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
def invalidate_related(query: str, radius: float = 0.90):
    """Remove all cached items semantically similar to query"""
    query_embedding = get_embedding(query)
    
    # Find all similar entries
    similar = vector_search(query_embedding, threshold=radius)
    
    # Delete them
    for entry in similar:
        cache.delete(entry.id)

Time-based TTLs also help:

1
2
3
4
5
6
7
8
def store_in_cache(query: str, response: str, ttl: int = 86400):
    embedding = get_embedding(query)
    redis.hset(f"cache:{uuid4()}", mapping={
        "query": query,
        "response": response,
        "embedding": embedding,
        "expires": time.time() + ttl
    })

Real-World Savings

Numbers from production systems:

  • 40-60% cache hit rate typical for customer support bots
  • 70-80% cost reduction when combined with response caching
  • 50-200ms faster response times on cache hits

The embedding call adds ~50ms overhead, but that’s negligible compared to LLM inference.

When NOT to Use Semantic Caching

  • Personalized responses: If the answer depends on user context
  • Real-time data: Weather, stock prices, current events
  • Creative generation: Where variation is the point
  • Low-volume applications: Setup cost exceeds savings

Quick Implementation Checklist

  1. ✅ Choose a vector store (Redis, Postgres, or dedicated)
  2. ✅ Pick an embedding model (text-embedding-3-small is cost-effective)
  3. ✅ Start with threshold 0.94-0.96
  4. ✅ Implement TTL-based expiration
  5. ✅ Monitor hit rate and adjust threshold
  6. ✅ Log cache misses to identify patterns

Semantic caching is one of the highest-ROI optimizations for LLM applications. A few hours of implementation can cut your API costs in half and make your app feel faster.

Start small, measure everything, and adjust the threshold until it feels right.