Module 6: Consistent Hashing & Data Partitioning

🚀 Problem Statement

A vector database holds 200M embeddings across 20 nodes. Adding a 21st node might cause 95% of the data to reshuffle if hash(key) % N is used. During reshuffling, the RAG system might return incomplete results for several hours.

🧠 The Engineering Story

The Villain: "The Modulo Fallacy." Simple hash-mod partitioning means adding or removing a single node remaps almost every key.

The Hero: "The Hash Ring." Consistent hashing maps both keys and nodes onto a ring, so adding a node only remaps K/N keys (where K = total keys, N = total nodes).

The Plot:

Understand the hash ring with virtual nodes for uniform distribution
Implement range-based vs hash-based partitioning trade-offs
Design a rebalancing strategy that maintains availability during scaling
Handle hot partitions (popular documents have 100x more chunks)

The Twist (Failure): The Hot Partition. A viral compliance document generates 50% of all queries. All its chunks live on 2 nodes. Consistent hashing guarantees even data distribution but not necessarily even load distribution.

Interview Signal: Can explain consistent hashing and knows when range partitioning is better (e.g., for time-series data or range scans).

📝 Design Exercise

Design a partitioning strategy for a vector store where:

embeddings must be co-located by document for batch retrieval

the system scales from 5 to 50 nodes gracefully

popular documents don't create hot spots.