Module 3: Caching — The Art of Not Repeating Work
🚀 Problem Statement
A typical RAG query to a GenAI system costs roughly $0.03 (embedding + LLM inference). If 40% of queries are near-identical variations ("What is the system protocol?" vs "Tell me about system protocols"), significant monthly costs can be incurred due to redundant LLM calls.
🧠 The Engineering Story
The Villain: "The Stateless Amnesiac." Every request is treated as brand new. The system re-embeds the same query, re-retrieves the same documents, and re-generates the same answer — 100 times per day.
The Hero: "The Semantic Memory Layer." A multi-tier cache that remembers at every level: exact query matches, semantic near-matches, retrieved context, and generated responses.
The Plot:
- L1 — Browser/Client Cache: Cache static assets, previously rendered answers
- L2 — CDN/Edge: Cache common API responses geographically
- L3 — Application Cache (Redis): Exact query → response mapping
- L4 — Semantic Cache: Embedding similarity for near-duplicate queries
- L5 — Database Query Cache: Materialized views for common aggregations
The Twist (Failure): Stale Semantic Cache. An answer to "What PPE is required for Task 7?" might be cached, but then the enterprise document is revised. The cache could serve outdated information. Cache invalidation for semantic similarity remains a significant challenge.
Interview Signal: Can articulate cache invalidation strategies beyond simple TTL.
🧠 Key Concepts to Master
| Pattern | Description | GenAI Application |
|---|---|---|
| Cache-Aside | App checks cache first, loads from DB on miss | Standard pattern for metadata |
| Write-Through | Write to cache + DB simultaneously | Session state consistency |
| Write-Behind | Write to cache, async flush to DB | High-throughput logging |
| Semantic Caching | Match queries by embedding similarity (cosine > 0.95) | Deduplicate similar LLM queries |
| Cache Stampede Prevention | Lock/lease on cache misses | Prevent 1000 concurrent LLM calls for same query |
| Versioned Cache Keys | Include document version in cache key | Automatic invalidation on document revisions |
📝 Design Exercise
Design a caching strategy for a RAG pipeline that handles: exact query dedup, semantic near-match dedup, document revision invalidation, and cost tracking of cache hit vs miss savings.