Caching for LLM Systems Is Harder Than It Looks
The economics of LLM applications are brutal. A single GPT-5/or Claude call can cost $0.03–$0.12 depending on context length. Multiply that by millions of users, and suddenly your AI product has a cost structure that would make a SaaS CFO weep.
The obvious answer? Caching.
But here's the problem: caching LLM systems isn't like caching web requests. Traditional cache strategies—built for deterministic systems with predictable keys—break down when dealing with semantic similarity, multi-step reasoning, dynamic context, and non-deterministic outputs.
Most teams discover this the hard way. They bolt on Redis, implement a simple key-value store, and watch their cache hit rates hover around 12%. Then they realize: LLM caching is a fundamentally different problem.
Why Traditional Caching Fails for LLMs
Traditional HTTP caching works because requests are exact and repeatable. If user A requests /api/products/123, and user B requests the same endpoint, the cache serves identical data. The key is deterministic. The output is stable.
LLMs destroy these assumptions:
1. Semantic equivalence ≠ string equivalence
"What's the weather in San Francisco?" and "Tell me the SF weather" are semantically identical but string-different. A naive cache misses this entirely.
2. Context windows are enormous and variable
A prompt might include 50KB of retrieved documents. Two prompts could be 98% identical but differ in one paragraph—should they share a cache entry?
3. Outputs are non-deterministic
Even with temperature=0, LLM outputs can vary slightly. Caching must handle "close enough" responses, not just perfect matches.
4. Multi-step workflows complicate invalidation
RAG pipelines, tool-calling agents, and chain-of-thought systems involve multiple cached layers. Invalidating one layer can cascade unpredictably.
As Chip Huyen noted in her analysis of LLM system design, "The hardest part of building LLM applications isn't the model—it's the infrastructure around it."
The Five Caching Layers Every LLM System Needs
Sophisticated LLM products don't use one cache—they use five, each solving a different problem.
1. Semantic Caching: Beyond String Matching
Semantic caching answers the question: "Have we seen a prompt that means the same thing?"
Instead of exact string matches, semantic caches use embedding-based similarity search. The flow:
- Embed the incoming prompt using a lightweight model (e.g.,
text-embedding-3-small) - Query a vector database (Pinecone, Weaviate, Qdrant) for similar embeddings
- If similarity > threshold (e.g., 0.95 cosine similarity), return the cached response
The trick: Choosing the right similarity threshold. Too high (0.98+), and cache hits plummet. Too low (0.85), and you serve irrelevant responses.
Anthropic's prompt engineering guide suggests starting at 0.92–0.95 and adjusting based on domain specificity.
Real-world example:
A legal research tool might cache "What are the penalties for breach of contract in California?" and serve it for "CA breach of contract penalties?" Even though the strings differ, the semantic intent is identical.
Implementation pattern:
def semantic_cache_lookup(prompt: str, threshold: float = 0.93):
embedding = embed_prompt(prompt)
results = vector_db.query(embedding, top_k=1)
if results[0].score >= threshold:
return results[0].cached_response
# Cache miss - call LLM
response = llm.generate(prompt)
vector_db.insert(embedding, response)
return response
Cost impact: Semantic caching can reduce LLM calls by 40–60% in high-traffic scenarios with repetitive user intent.
2. Prompt-Template Caching: Amortizing Context Costs
OpenAI and Anthropic now support prompt caching at the API level—but most teams aren't using it correctly.
The concept: if your prompt has a large, static prefix (system instructions, few-shot examples, retrieved documents), the LLM provider caches the processed representation of that prefix. Subsequent requests reuse it, slashing costs and latency.
Anthropic's prompt caching (launched mid-2024) reduces costs by 90% for cached prefixes and cuts latency by ~85%. But there's a catch: the cached portion must be identical across requests.
Design pattern:
Structure prompts so static content comes first:
[SYSTEM INSTRUCTIONS - 2000 tokens] ← cached
[RETRIEVED CONTEXT - 5000 tokens] ← cached
[USER QUERY - 50 tokens] ← not cached
If the retrieved context changes per request, caching fails. The solution? Batch similar queries or pre-compute common retrieval sets.
Anti-pattern:
Inserting user-specific data (user ID, timestamps) early in the prompt breaks caching. Move dynamic content to the end.
As Simon Willison explored in his caching experiments, "The structure of your prompt is now a performance optimization problem, not just a quality problem."
3. Retrieval-Result Caching: The RAG Bottleneck
RAG (Retrieval-Augmented Generation) systems fetch relevant documents before generating responses. But retrieval is expensive—embedding queries, searching vector databases, and re-ranking results can add 200–500ms per request.
The insight: Many user queries retrieve the same documents.
"What's our refund policy?" and "How do I get a refund?" likely retrieve identical knowledge base articles. Caching retrieval results avoids redundant vector searches.
Implementation:
def cached_retrieval(query: str):
cache_key = hash(normalize(query))
if cache_key in retrieval_cache:
return retrieval_cache[cache_key]
# Perform retrieval
docs = vector_db.search(embed(query), top_k=5)
retrieval_cache[cache_key] = docs
return docs
Gotcha: Retrieval caches must be content-aware. If the underlying knowledge base updates (new docs, edited articles), the cache must invalidate. Use event-driven invalidation:
@event_handler("knowledge_base.updated")
def invalidate_retrieval_cache(doc_id: str):
# Invalidate all cache entries that included this doc
retrieval_cache.invalidate_by_doc(doc_id)
Latency impact: Retrieval caching can cut RAG response time by 30–40%.
4. Tool-Result Caching: For Agentic Systems
LLM agents call external tools—APIs, databases, calculators. If an agent calls get_weather("San Francisco") 1000 times in an hour, why hit the API 1000 times?
Tool-result caching is conceptually simple but requires TTL (time-to-live) tuning:
- Weather data: 30–60 minutes
- Stock prices: 1–5 minutes
- User profile data: 5–15 minutes
- Historical data: hours or days
The challenge: Different tools have different freshness requirements. A one-size-fits-all TTL doesn't work.
Pattern:
@cache(ttl=3600) # 1 hour
def get_weather(location: str):
return weather_api.fetch(location)
@cache(ttl=300) # 5 minutes
def get_stock_price(ticker: str):
return market_api.fetch(ticker)
Advanced: Implement conditional caching based on tool response metadata. If an API returns Cache-Control: max-age=600, respect it.
LangChain's tool caching documentation provides useful patterns, though most teams need custom logic for production use.
5. Tenant-Aware Invalidation: The Multi-Tenancy Trap
SaaS LLM products serve multiple customers (tenants). Each tenant has isolated data, but they share infrastructure.
The problem: Cache poisoning and stale data across tenants.
If Tenant A updates their knowledge base, Tenant B's cache shouldn't serve Tenant A's old data. Yet naive caching strategies leak data across tenants or fail to invalidate properly.
Solution: Namespace caches by tenant:
cache_key = f"tenant:{tenant_id}:query:{hash(query)}"
But that's not enough. When Tenant A updates data, you must:
- Invalidate all cache entries that depend on that data
- Avoid invalidating Tenant B's unrelated caches
- Handle cascading invalidation (retrieval → generation → tool results)
Real-world pattern:
class TenantAwareCache:
def invalidate_tenant_data(self, tenant_id: str, data_type: str):
# Invalidate semantic cache entries
self.semantic_cache.delete_by_tenant(tenant_id)
# Invalidate retrieval results
if data_type == "knowledge_base":
self.retrieval_cache.delete_by_tenant(tenant_id)
# Invalidate prompt template caches
self.prompt_cache.delete_by_tenant(tenant_id)
Observability is critical. Track cache hit rates per tenant. A sudden drop signals either a data update or a cache poisoning bug.
As Shreya Shankar discusses in her work on LLM observability, "Multi-tenant LLM systems need tenant-level metrics, not just system-level metrics."
The Invalidation Problem: When to Flush What
Phil Karlton famously said, "There are only two hard things in Computer Science: cache invalidation and naming things."
LLM systems make invalidation exponentially harder because caches are interdependent:
- A knowledge base update invalidates retrieval caches
- Which invalidates semantic caches (since they reference old retrievals)
- Which might invalidate tool-result caches (if tools use retrieved data)
Naive strategy: Invalidate everything. Result: Cache hit rates drop to near-zero.
Better strategy: Build a cache dependency graph:
Knowledge Base Update
↓
Retrieval Cache (affected docs)
↓
Semantic Cache (queries using those docs)
↓
Tool Results (if tools depend on those docs)
Implementation:
class CacheInvalidationGraph:
def on_knowledge_update(self, doc_ids: List[str]):
# Find retrieval cache entries using these docs
affected_retrievals = self.retrieval_cache.find_by_docs(doc_ids)
# Invalidate semantic caches that used those retrievals
for retrieval in affected_retrievals:
self.semantic_cache.invalidate_by_retrieval(retrieval.id)
# Invalidate tool results if they reference these docs
self.tool_cache.invalidate_by_docs(doc_ids)
Monitoring: Track invalidation cascades. If a single doc update invalidates 10,000 cache entries, something's wrong.
Cost-Latency-Accuracy Tradeoffs
Every caching decision is a three-way tradeoff:
| Strategy | Cost Savings | Latency Reduction | Accuracy Risk |
|---|---|---|---|
| Semantic caching (high threshold) | 40–60% | 60–80% | Low |
| Semantic caching (low threshold) | 60–80% | 70–90% | Medium-High |
| Prompt-template caching | 70–90% | 80–90% | None |
| Retrieval caching | 20–40% | 30–50% | Medium (staleness) |
| Tool-result caching | 30–70% | 50–80% | High (staleness) |
The insight: Different parts of your product tolerate different accuracy risks.
- Customer support chatbot: High tolerance for semantic caching (users ask similar questions)
- Financial analysis tool: Low tolerance (precision matters, data changes frequently)
- Code generation assistant: Medium tolerance (caching common patterns works, but edge cases need fresh responses)
Design principle: Tune caching aggressiveness per feature, not per system.
Actionable Takeaways for CTOs and CPOs
1. Start with prompt-template caching—it's the highest ROI
If you're using Anthropic or OpenAI, enable prompt caching immediately. Restructure prompts to front-load static content. This alone can cut costs 50–70% with zero accuracy loss.
2. Build semantic caching for high-traffic, repetitive use cases
Don't try to cache everything semantically. Focus on:
- FAQ-style queries
- Common user intents
- Repetitive workflows
3. Implement cache observability from day one
Track:
- Cache hit rate (overall and per layer)
- Cache hit rate by tenant
- Invalidation frequency and cascade size
- Cost savings vs. LLM baseline
Use tools like Helicone, LangSmith, or build custom dashboards.
4. Design for graceful degradation
Caches will fail. LLMs should still work. Never block a user request because a cache is down.
5. Treat caching as a product problem, not just an infra problem
Cache strategy affects user experience. CPOs should be involved in decisions like:
- How stale can responses be?
- Which features tolerate lower accuracy?
- What's the cost ceiling per user?
6. Invest in cache invalidation infrastructure early
The teams that scale LLM products successfully build sophisticated invalidation logic before they need it. Retrofitting invalidation into a live system is brutal.
The Bottom Line
LLM caching isn't a nice-to-have—it's a requirement for economic viability. But it's also not a solved problem. The tools are immature. The patterns are still emerging. The tradeoffs are nuanced.
The teams winning in production LLM systems are treating caching as a first-class architectural concern, not an afterthought. They're building multi-layer strategies, investing in observability, and tuning aggressively based on real usage patterns.
As Andrej Karpathy noted on the evolution of LLM systems, "The next wave of LLM innovation won't be in model quality—it'll be in the systems around the models."
Caching is one of those systems. And it's harder than it looks.