Module 12: Data Pipelines & Stream Processing

🚀 Problem Statement

An organization might upload 500 compliance documents per day. Each must be: parsed (PDF/DOCX), chunked (semantic splitting), embedded, indexed in a vector DB, and made searchable — ideally within 5 minutes of upload. If a batch job runs only every 6 hours, significant delays occur.

🧠 The Engineering Story

The Villain: "The Nightly Batch." Documents uploaded at 9 AM might not be searchable until 3 PM. An end user cannot find updated data records because the pipeline hasn't run yet.

The Hero: "The Streaming Ingestion." Each document upload triggers a real-time pipeline: parse → chunk → embed → index. Documents are searchable within minutes, not hours.

The Plot:

Design a streaming pipeline: source (upload event) → transform (chunk + embed) → sink (vector DB + search index)
Handle backpressure: what happens when GPU embedding is slower than document uploads
Implement exactly-once processing with checkpointing
Design a dead-letter queue for documents that fail processing

The Twist (Failure): The Ordering Problem. A user uploads v1.0, then immediately uploads v1.1. If the streaming pipeline processes v1.1 first (e.g., because it is a smaller file), v1.0 might subsequently overwrite it, causing the latest version to be lost.

Interview Signal: Can design a pipeline that handles ordering, backpressure, and failure recovery.

🧠 Pipeline Architecture

graph LR
    A["Document Upload"] --> B["Event Bus (Kafka)"]
    B --> C["Parser Worker"]
    C --> D["Chunker"]
    D --> E["Embedding Worker (GPU)"]
    E --> F["Vector DB"]
    E --> G["Search Index"]
    B --> H["Metadata DB"]

    C -- "Parse Failure" --> I["Dead Letter Queue"]
    E -- "GPU OOM" --> J["Retry Queue"]

🔗 Case Study References

Dockerized Job Scheduler — Real-world implementation of async background processing.