Module 16: LLM Orchestration Platforms
🚀 Problem Statement
A GenAI workflow may involve multiple LLM calls per user request: query classification, query rewriting, parallel sub-question answers, synthesis, and quality checks. Managing prompts, model routing, fallbacks, token budgets, and cost tracking across these calls using raw HTTP clients can become a maintenance challenge.
🧠The Engineering Story
The Villain: "The Prompt Spaghetti." Multiple LLM calls scattered across services, each with hardcoded prompts, lacking retry logic, token tracking, or fallback mechanisms when primary models are rate-limited.
The Hero: "The Orchestration Layer." A centralized LLM gateway that handles prompt management (versioned templates), model routing (matching complexity to model capability), fallback chains, token budgets, and structured output parsing.
The Plot:
- Design an LLM Gateway: single interface for all LLM calls with unified logging
- Implement prompt versioning and A/B testing
- Build a model router: classify query complexity → route to appropriate model
- Add structured output parsing with retry on format errors
- Implement cost controls: per-user token budgets with graceful degradation
The Twist (Failure): The Cascade Failure. If a primary model hits rate limits, a fallback to a less capable model might occur. If the quality-check mechanism also falls back to a less capable "judge" model, it might approve lower-quality output. Users might then see incorrect information that is nonetheless marked with a "verified" badge.
Interview Signal: Can design an LLM orchestration system with fallback chains, quality gates, and cost controls.
🧠Orchestration Architecture
| Component | Responsibility | Key Feature |
|---|---|---|
| Prompt Registry | Version-controlled prompt templates | A/B test prompt variations |
| Model Router | Classify complexity → select model | GPT-4 for analysis, GPT-3.5 for formatting |
| Token Budget Manager | Track usage per user/org/project | Hard limits + graceful degradation |
| Output Parser | Enforce structured JSON output | Retry with reformatted prompt on failure |
| Guardrail Engine | Check factuality, toxicity, PII leakage | Block or flag before delivery |
| Fallback Chain | Primary → Secondary → Cached → Error | GPT-4 → GPT-3.5 → cached response → "unavailable" |