Module 13: Search & Retrieval Systems (Hybrid Search)

🚀 Problem Statement

A RAG system that uses pure vector search might struggle with specific queries. For instance, when a user asks "What is the RCD for Task 3 in DOC-2024-0087?", the system might retrieve chunks about general "task management" — semantically similar but factually incorrect. In such cases, exact keyword matching for document reference numbers is essential.

🧠 The Engineering Story

The Villain: "The Semantic Zealot." Vector search finds "meaning-similar" content but often fails on exact identifiers, code references, acronyms, and proper nouns. A reference like "DOC-2024-0087" may have little semantic meaning to an embedding model.

The Hero: "The Hybrid Retriever." Combines BM25 keyword search (exact matches) with vector similarity (semantic understanding), using Reciprocal Rank Fusion to merge results.

The Plot:

Understand inverted indexes (BM25/TF-IDF) vs vector indexes (HNSW/IVF)
Implement Hybrid Search: keyword + semantic with RRF or weighted fusion
Add metadata filtering: filter by document type, date range, project
Implement re-ranking with a cross-encoder for final result quality

The Twist (Failure): The Re-ranking Bottleneck. A cross-encoder re-ranker (e.g., a 400M parameter model) might take 500ms to score 100 candidates. At high query volumes (e.g., 50 QPS), this stage can become a significant bottleneck, potentially requiring more GPU resources than the actual LLM generation.

Interview Signal: Can explain the retrieval pipeline stages (retrieve → filter → rerank) and justify choices at each stage.

🧠 Retrieval Pipeline

Stage	Method	Candidates	Latency
Stage 1: Sparse Retrieval	BM25 on inverted index	10,000 → 1,000	~5ms
Stage 2: Dense Retrieval	ANN on HNSW index	10M → 1,000	~10ms
Stage 3: Fusion	RRF or weighted merge	2,000 → 200	~1ms
Stage 4: Metadata Filter	SQL/filter predicates	200 → 50	~2ms
Stage 5: Re-rank	Cross-encoder model	50 → 10	~100ms
Stage 6: Generation	LLM with top-k context	10 → 1 response	~2s

🔗 Case Study References

S3 Lite Architecture — For distributed storage and retrieval patterns.