Caching for LLM Systems Is Harder Than It Looks

LLMscachingsystem-architecturecost-optimizationRAGAI-infrastructure
Caching is hard for LLM Applications, strategies.

The economics of LLM applications are brutal. A single GPT-5/or Claude call can cost $0.03–$0.12 depending on context length. Multiply that by millions of users, and suddenly your AI product has a cost structure that would make a SaaS CFO weep.

The obvious answer? Caching.

But here's the problem: caching LLM systems isn't like caching web requests. Traditional cache strategies—built for deterministic systems with predictable keys—break down when dealing with semantic similarity, multi-step reasoning, dynamic context, and non-deterministic outputs.

Most teams discover this the hard way. They bolt on Redis, implement a simple key-value store, and watch their cache hit rates hover around 12%. Then they realize: LLM caching is a fundamentally different problem.

Why Traditional Caching Fails for LLMs

Traditional HTTP caching works because requests are exact and repeatable. If user A requests /api/products/123, and user B requests the same endpoint, the cache serves identical data. The key is deterministic. The output is stable.

LLMs destroy these assumptions:

1. Semantic equivalence ≠ string equivalence

"What's the weather in San Francisco?" and "Tell me the SF weather" are semantically identical but string-different. A naive cache misses this entirely.

2. Context windows are enormous and variable

A prompt might include 50KB of retrieved documents. Two prompts could be 98% identical but differ in one paragraph—should they share a cache entry?

3. Outputs are non-deterministic

Even with temperature=0, LLM outputs can vary slightly. Caching must handle "close enough" responses, not just perfect matches.

4. Multi-step workflows complicate invalidation

RAG pipelines, tool-calling agents, and chain-of-thought systems involve multiple cached layers. Invalidating one layer can cascade unpredictably.

As Chip Huyen noted in her analysis of LLM system design, "The hardest part of building LLM applications isn't the model—it's the infrastructure around it."

The Five Caching Layers Every LLM System Needs

Sophisticated LLM products don't use one cache—they use five, each solving a different problem.

1. Semantic Caching: Beyond String Matching

Semantic caching answers the question: "Have we seen a prompt that means the same thing?"

Instead of exact string matches, semantic caches use embedding-based similarity search. The flow:

  1. Embed the incoming prompt using a lightweight model (e.g., text-embedding-3-small)
  2. Query a vector database (Pinecone, Weaviate, Qdrant) for similar embeddings
  3. If similarity > threshold (e.g., 0.95 cosine similarity), return the cached response

The trick: Choosing the right similarity threshold. Too high (0.98+), and cache hits plummet. Too low (0.85), and you serve irrelevant responses.

Anthropic's prompt engineering guide suggests starting at 0.92–0.95 and adjusting based on domain specificity.

Real-world example:
A legal research tool might cache "What are the penalties for breach of contract in California?" and serve it for "CA breach of contract penalties?" Even though the strings differ, the semantic intent is identical.

Implementation pattern:

def semantic_cache_lookup(prompt: str, threshold: float = 0.93):
    embedding = embed_prompt(prompt)
    results = vector_db.query(embedding, top_k=1)
    
    if results[0].score >= threshold:
        return results[0].cached_response
    
    # Cache miss - call LLM
    response = llm.generate(prompt)
    vector_db.insert(embedding, response)
    return response

Cost impact: Semantic caching can reduce LLM calls by 40–60% in high-traffic scenarios with repetitive user intent.

2. Prompt-Template Caching: Amortizing Context Costs

OpenAI and Anthropic now support prompt caching at the API level—but most teams aren't using it correctly.

The concept: if your prompt has a large, static prefix (system instructions, few-shot examples, retrieved documents), the LLM provider caches the processed representation of that prefix. Subsequent requests reuse it, slashing costs and latency.

Anthropic's prompt caching (launched mid-2024) reduces costs by 90% for cached prefixes and cuts latency by ~85%. But there's a catch: the cached portion must be identical across requests.

Design pattern:

Structure prompts so static content comes first:

[SYSTEM INSTRUCTIONS - 2000 tokens] ← cached
[RETRIEVED CONTEXT - 5000 tokens]   ← cached
[USER QUERY - 50 tokens]            ← not cached

If the retrieved context changes per request, caching fails. The solution? Batch similar queries or pre-compute common retrieval sets.

Anti-pattern:
Inserting user-specific data (user ID, timestamps) early in the prompt breaks caching. Move dynamic content to the end.

As Simon Willison explored in his caching experiments, "The structure of your prompt is now a performance optimization problem, not just a quality problem."

3. Retrieval-Result Caching: The RAG Bottleneck

RAG (Retrieval-Augmented Generation) systems fetch relevant documents before generating responses. But retrieval is expensive—embedding queries, searching vector databases, and re-ranking results can add 200–500ms per request.

The insight: Many user queries retrieve the same documents.

"What's our refund policy?" and "How do I get a refund?" likely retrieve identical knowledge base articles. Caching retrieval results avoids redundant vector searches.

Implementation:

def cached_retrieval(query: str):
    cache_key = hash(normalize(query))
    
    if cache_key in retrieval_cache:
        return retrieval_cache[cache_key]
    
    # Perform retrieval
    docs = vector_db.search(embed(query), top_k=5)
    retrieval_cache[cache_key] = docs
    return docs

Gotcha: Retrieval caches must be content-aware. If the underlying knowledge base updates (new docs, edited articles), the cache must invalidate. Use event-driven invalidation:

@event_handler("knowledge_base.updated")
def invalidate_retrieval_cache(doc_id: str):
    # Invalidate all cache entries that included this doc
    retrieval_cache.invalidate_by_doc(doc_id)

Latency impact: Retrieval caching can cut RAG response time by 30–40%.

4. Tool-Result Caching: For Agentic Systems

LLM agents call external tools—APIs, databases, calculators. If an agent calls get_weather("San Francisco") 1000 times in an hour, why hit the API 1000 times?

Tool-result caching is conceptually simple but requires TTL (time-to-live) tuning:

  • Weather data: 30–60 minutes
  • Stock prices: 1–5 minutes
  • User profile data: 5–15 minutes
  • Historical data: hours or days

The challenge: Different tools have different freshness requirements. A one-size-fits-all TTL doesn't work.

Pattern:

@cache(ttl=3600)  # 1 hour
def get_weather(location: str):
    return weather_api.fetch(location)

@cache(ttl=300)  # 5 minutes
def get_stock_price(ticker: str):
    return market_api.fetch(ticker)

Advanced: Implement conditional caching based on tool response metadata. If an API returns Cache-Control: max-age=600, respect it.

LangChain's tool caching documentation provides useful patterns, though most teams need custom logic for production use.

5. Tenant-Aware Invalidation: The Multi-Tenancy Trap

SaaS LLM products serve multiple customers (tenants). Each tenant has isolated data, but they share infrastructure.

The problem: Cache poisoning and stale data across tenants.

If Tenant A updates their knowledge base, Tenant B's cache shouldn't serve Tenant A's old data. Yet naive caching strategies leak data across tenants or fail to invalidate properly.

Solution: Namespace caches by tenant:

cache_key = f"tenant:{tenant_id}:query:{hash(query)}"

But that's not enough. When Tenant A updates data, you must:

  1. Invalidate all cache entries that depend on that data
  2. Avoid invalidating Tenant B's unrelated caches
  3. Handle cascading invalidation (retrieval → generation → tool results)

Real-world pattern:

class TenantAwareCache:
    def invalidate_tenant_data(self, tenant_id: str, data_type: str):
        # Invalidate semantic cache entries
        self.semantic_cache.delete_by_tenant(tenant_id)
        
        # Invalidate retrieval results
        if data_type == "knowledge_base":
            self.retrieval_cache.delete_by_tenant(tenant_id)
        
        # Invalidate prompt template caches
        self.prompt_cache.delete_by_tenant(tenant_id)

Observability is critical. Track cache hit rates per tenant. A sudden drop signals either a data update or a cache poisoning bug.

As Shreya Shankar discusses in her work on LLM observability, "Multi-tenant LLM systems need tenant-level metrics, not just system-level metrics."

The Invalidation Problem: When to Flush What

Phil Karlton famously said, "There are only two hard things in Computer Science: cache invalidation and naming things."

LLM systems make invalidation exponentially harder because caches are interdependent:

  • A knowledge base update invalidates retrieval caches
  • Which invalidates semantic caches (since they reference old retrievals)
  • Which might invalidate tool-result caches (if tools use retrieved data)

Naive strategy: Invalidate everything. Result: Cache hit rates drop to near-zero.

Better strategy: Build a cache dependency graph:

Knowledge Base Update
  ↓
Retrieval Cache (affected docs)
  ↓
Semantic Cache (queries using those docs)
  ↓
Tool Results (if tools depend on those docs)

Implementation:

class CacheInvalidationGraph:
    def on_knowledge_update(self, doc_ids: List[str]):
        # Find retrieval cache entries using these docs
        affected_retrievals = self.retrieval_cache.find_by_docs(doc_ids)
        
        # Invalidate semantic caches that used those retrievals
        for retrieval in affected_retrievals:
            self.semantic_cache.invalidate_by_retrieval(retrieval.id)
        
        # Invalidate tool results if they reference these docs
        self.tool_cache.invalidate_by_docs(doc_ids)

Monitoring: Track invalidation cascades. If a single doc update invalidates 10,000 cache entries, something's wrong.

Cost-Latency-Accuracy Tradeoffs

Every caching decision is a three-way tradeoff:

StrategyCost SavingsLatency ReductionAccuracy Risk
Semantic caching (high threshold)40–60%60–80%Low
Semantic caching (low threshold)60–80%70–90%Medium-High
Prompt-template caching70–90%80–90%None
Retrieval caching20–40%30–50%Medium (staleness)
Tool-result caching30–70%50–80%High (staleness)

The insight: Different parts of your product tolerate different accuracy risks.

  • Customer support chatbot: High tolerance for semantic caching (users ask similar questions)
  • Financial analysis tool: Low tolerance (precision matters, data changes frequently)
  • Code generation assistant: Medium tolerance (caching common patterns works, but edge cases need fresh responses)

Design principle: Tune caching aggressiveness per feature, not per system.

Actionable Takeaways for CTOs and CPOs

1. Start with prompt-template caching—it's the highest ROI

If you're using Anthropic or OpenAI, enable prompt caching immediately. Restructure prompts to front-load static content. This alone can cut costs 50–70% with zero accuracy loss.

2. Build semantic caching for high-traffic, repetitive use cases

Don't try to cache everything semantically. Focus on:

  • FAQ-style queries
  • Common user intents
  • Repetitive workflows

3. Implement cache observability from day one

Track:

  • Cache hit rate (overall and per layer)
  • Cache hit rate by tenant
  • Invalidation frequency and cascade size
  • Cost savings vs. LLM baseline

Use tools like Helicone, LangSmith, or build custom dashboards.

4. Design for graceful degradation

Caches will fail. LLMs should still work. Never block a user request because a cache is down.

5. Treat caching as a product problem, not just an infra problem

Cache strategy affects user experience. CPOs should be involved in decisions like:

  • How stale can responses be?
  • Which features tolerate lower accuracy?
  • What's the cost ceiling per user?

6. Invest in cache invalidation infrastructure early

The teams that scale LLM products successfully build sophisticated invalidation logic before they need it. Retrofitting invalidation into a live system is brutal.


The Bottom Line

LLM caching isn't a nice-to-have—it's a requirement for economic viability. But it's also not a solved problem. The tools are immature. The patterns are still emerging. The tradeoffs are nuanced.

The teams winning in production LLM systems are treating caching as a first-class architectural concern, not an afterthought. They're building multi-layer strategies, investing in observability, and tuning aggressively based on real usage patterns.

As Andrej Karpathy noted on the evolution of LLM systems, "The next wave of LLM innovation won't be in model quality—it'll be in the systems around the models."

Caching is one of those systems. And it's harder than it looks.


If any of this resonates, you should subscribe.

No spam. No fluff. Just honest reflections on building products, leading teams, and staying curious.