Module 4: Load Balancing & Reverse Proxies

🚀 Problem Statement

A GenAI application has 10 inference worker pods. Short queries (simple lookups) complete in 200ms but may get stuck behind long-running RAG queries (15s). P99 latency can spike to 30s even though average CPU utilization is only 40%.

🧠 The Engineering Story

The Villain: "The Round-Robin Roulette." A load balancer sends requests in strict rotation. If Worker 3 receives 5 heavy RAG queries in a row, they are queued while workers 7-10 sit idle.

The Hero: "The Aware Dispatcher." Uses health-aware, latency-weighted routing with separate queues for fast vs slow operations.

The Plot:

Understand L4 (TCP) vs L7 (HTTP) load balancing trade-offs
Implement health checks that go beyond "is the port open" — check GPU memory, queue depth
Separate traffic lanes: lightweight API calls vs heavy inference requests
Use consistent hashing for stateful workloads (user sessions, model shards)

The Twist (Failure): The Thundering Herd. All 10 workers health-check as "healthy" simultaneously. A traffic spike hits, all workers get overwhelmed, all fail health checks, and the load balancer has zero healthy backends.

Interview Signal: Can design a load balancing strategy that accounts for heterogeneous request costs.

🧠 Key Concepts

Strategy	Best For	GenAI Scenario
Round Robin	Homogeneous, fast requests	Static API serving
Least Connections	Variable processing times	Mixed query complexity
Weighted	Heterogeneous hardware	GPU vs CPU workers
Consistent Hashing	Stateful routing	Model shard affinity, KV-cache locality
Queue-based	Expensive async work	Batch LLM inference jobs