Deep Dive: NoSQL Database Internals

This document compares the internal read and write paths of three foundational NoSQL architectures: Dynamo, Cassandra, and BigTable.

1. Amazon Dynamo (Key–Value)

Dynamo is a highly available, decentralized datastore that prioritizes "always writeable" mechanics.

Write Path

The coordinator node generates a new data version and a Vector Clock [Node, Counter] to track causality.
The request is sent to N−1 healthy nodes, aiming for a Sloppy Quorum.

Hinted Handoff

If a target node is down, a healthy node temporarily buffers the write locally.
Once the failed node recovers, the buffered write is delivered.

Read Path

The coordinator requests data from N−1 nodes.
It waits for R−1 replies.

Conflict Handling

If Vector Clocks indicate concurrent updates, Dynamo returns all versions to the client.
The client performs semantic reconciliation.

Anti-Entropy

Background synchronization occurs using Merkle Trees.

A Merkle Tree is a binary tree of hashes.
It allows replicas to quickly detect divergence.
Only the inconsistent data ranges are synchronized, minimizing network transfer.

2. Apache Cassandra (Wide-Column)

Cassandra combines:

Dynamo’s decentralized cluster architecture
LSM-Tree storage engine

Architecture Overview

graph TD
    subgraph Cassandra_Node
        W[Write Request] --> WAL[Commit Log on Disk]
        W --> MT[In-Memory MemTable]
        MT -->|Flushes when full| SST[Immutable SSTable on Disk]

        R[Read Request] --> RC[Row Cache]
        R --> BF[Bloom Filter]
        BF -->|Positive| PI[Partition Index]
        PI --> SST_R[SSTable Read]
        R --> MT_R[MemTable Read]

        SST_R --> Merge[Merge Data & Return Latest]
        MT_R --> Merge
    end

Write Path

The coordinator forwards the write to replicas based on the configured consistency level.
If a replica node is unavailable, Hinted Handoff may temporarily store the write on another node.
The write is immediately appended to the Commit Log (disk-based) to guarantee crash recovery durability.
The data is written to an in-memory MemTable.
When the MemTable reaches a threshold, it is flushed to disk as an immutable SSTable.

Read Path

The system first checks the Row Cache (in-memory).
If the data is not found, Bloom Filters are used to probabilistically determine which SSTables might contain the data.
This avoids unnecessary disk seeks.
The system reads data from:
MemTable
Relevant SSTables
The results are merged to return the most recent value.

Read Repair

The coordinator hashes responses from multiple replicas.
If mismatches are detected, the coordinator updates the out-of-sync replicas.
This ensures eventual consistency across nodes.

3. Practical Implementation

Explore the low-level implementations of LSM-Trees and distributed storage internals: