Deep Dive: NoSQL Database Internals
This document compares the internal read and write paths of three foundational NoSQL architectures: Dynamo, Cassandra, and BigTable.
1. Amazon Dynamo (Key–Value)
Dynamo is a highly available, decentralized datastore that prioritizes "always writeable" mechanics.
Write Path
- The coordinator node generates a new data version and a Vector Clock
[Node, Counter]to track causality. - The request is sent to N−1 healthy nodes, aiming for a Sloppy Quorum.
Hinted Handoff
- If a target node is down, a healthy node temporarily buffers the write locally.
- Once the failed node recovers, the buffered write is delivered.
Read Path
- The coordinator requests data from N−1 nodes.
- It waits for R−1 replies.
Conflict Handling
- If Vector Clocks indicate concurrent updates, Dynamo returns all versions to the client.
- The client performs semantic reconciliation.
Anti-Entropy
Background synchronization occurs using Merkle Trees.
- A Merkle Tree is a binary tree of hashes.
- It allows replicas to quickly detect divergence.
- Only the inconsistent data ranges are synchronized, minimizing network transfer.
2. Apache Cassandra (Wide-Column)
Cassandra combines:
- Dynamo’s decentralized cluster architecture
- LSM-Tree storage engine
Architecture Overview
graph TD
subgraph Cassandra_Node
W[Write Request] --> WAL[Commit Log on Disk]
W --> MT[In-Memory MemTable]
MT -->|Flushes when full| SST[Immutable SSTable on Disk]
R[Read Request] --> RC[Row Cache]
R --> BF[Bloom Filter]
BF -->|Positive| PI[Partition Index]
PI --> SST_R[SSTable Read]
R --> MT_R[MemTable Read]
SST_R --> Merge[Merge Data & Return Latest]
MT_R --> Merge
end
Write Path
- The coordinator forwards the write to replicas based on the configured consistency level.
- If a replica node is unavailable, Hinted Handoff may temporarily store the write on another node.
- The write is immediately appended to the Commit Log (disk-based) to guarantee crash recovery durability.
- The data is written to an in-memory MemTable.
- When the MemTable reaches a threshold, it is flushed to disk as an immutable SSTable.
Read Path
- The system first checks the Row Cache (in-memory).
- If the data is not found, Bloom Filters are used to probabilistically determine which SSTables might contain the data.
- This avoids unnecessary disk seeks.
- The system reads data from:
- MemTable
- Relevant SSTables
- The results are merged to return the most recent value.
Read Repair
- The coordinator hashes responses from multiple replicas.
- If mismatches are detected, the coordinator updates the out-of-sync replicas.
- This ensures eventual consistency across nodes.
3. Practical Implementation
Explore the low-level implementations of LSM-Trees and distributed storage internals: