Module 11: Observability, Reliability & SRE
🚀 Problem Statement
A GenAI platform experienced a 2-hour outage where users reported significant slowness, even though the dashboard showed 99.5% uptime. The root cause was a slow memory leak in the embedding service that gradually degraded response times from 200ms to 12s over three days. No alerts were triggered because the service never crashed; it simply slowed down.
🧠The Engineering Story
The Villain: "The Dashboard Liar." Uptime monitoring may report 99.9% because a service returns HTTP 200. However, if those responses take 12 seconds or contain incorrect embeddings because the model ran out of GPU memory, the system is effectively failing.
The Hero: "The Observability Triad." Metrics (what), Logs (why), Traces (where) — combined with SLIs/SLOs that measure what users actually experience.
The Plot:
- Define SLIs: latency P50/P95/P99, error rate, throughput, embedding quality score
- Set SLOs: "99.5% of RAG queries complete in < 3s with relevance > 0.7"
- Implement distributed tracing across the full RAG pipeline
- Build alerts on error budgets, not on individual metrics
The Twist (Failure): Alert Fatigue. Setting too many alerts (e.g., 200) can lead to critical warnings being ignored. A vital alert regarding embedding quality degradation might be buried among dozens of minor warnings like "disk usage > 70%."
Interview Signal: Can define meaningful SLIs/SLOs for an ML system and explain error budget policies.
🧠Observability for GenAI Systems
| Signal | What to Measure | Tool |
|---|---|---|
| Latency | P50/P99 per pipeline stage (embed, retrieve, generate) | Prometheus + Grafana |
| Quality | Relevance score (cosine similarity), faithfulness | Custom metrics + LLM-as-judge |
| Cost | Token usage per query, cache hit ratio | Custom counters |
| Errors | Rate limit hits, model timeouts, hallucination detection | Structured logging |
| Traces | Full request path: API → embed → retrieve → rerank → generate | Jaeger / OpenTelemetry |