🚦 System Design: API Rate Limiter
📝 Overview
A Rate Limiter is a defensive middleware component that restricts the number of requests an entity (User, IP, or Service) can send to an API within a specified time window. It acts as a shield, preventing DoS attacks, capping expensive infrastructure costs, and ensuring fair resource allocation across distributed systems.
Core Concepts
- Token Bucket / Leaky Bucket: Mathematical algorithms used to throttle traffic efficiently while allowing for controlled bursts.
- Sliding Window: Time-based tracking algorithms designed to prevent request spikes at the exact edges of fixed time windows.
- Fail-Open vs. Fail-Closed: The architectural decision of whether to allow or block all traffic if the rate-limiting infrastructure itself crashes.
🏭 The Scenario & Requirements
😡 The Problem (The Villain)
A malicious script, a brute-force bot, or even a runaway internal microservice sends 100,000 requests per second to your login API. Without limits, the database connection pool is instantly exhausted, CPU spikes to 100%, and the entire platform crashes for all legitimate users.
🦸 The Solution (The Hero)
A distributed middleware layer sitting at the API Gateway that counts incoming requests against predefined tier quotas in a blazing-fast, centralized in-memory datastore (Redis). It instantly intercepts and drops excess traffic (returning an HTTP 429 status) before the load ever touches your delicate backend services.
📜 Requirements
- Functional Requirements:
- The system must Allow or Deny requests based on identifiers (UserID, IP address, or API Key).
- It must return clear
HTTP 429 Too Many Requestsresponses, ideally with aRetry-Afterheader. - It must support different limiting rules for different tiers (e.g., Free users = 100 req/min, Pro users = 10,000 req/min).
- Non-Functional Requirements:
- Ultra-Low Latency: The rate limiter is in the critical path of every request. It must add < 5ms to the response time.
- High Availability: If the rate limiter fails, it should default to "Fail-Open" (allow requests through) rather than taking down the whole API.
- Accuracy & Consistency: Must prevent race conditions across concurrent, distributed API gateways.
Capacity Estimation (Back-of-the-envelope)
- Traffic: 10 Million Active Users generating 50,000 requests per second (QPS) globally.
- Latency Budget: Network hop to Redis + Redis read/write must execute in under 2-3 milliseconds.
- Memory footprint: - Using a Sliding Window Log for 1 million active users with a limit of 500 req/hour requires ~12GB of Redis RAM (storing every timestamp).
- Using a Token Bucket or Fixed Window reduces this to just ~50MB, as it only stores a few integers per user.
📊 API Design & Data Model
GET /api/v1/resource(When limit is exceeded)- Status:
429 Too Many Requests - Headers:
X-Ratelimit-Limit: 100X-Ratelimit-Remaining: 0Retry-After: 36(Seconds until the client can retry)
- Response:
{ "error": "Rate limit exceeded. Please try again later." }
- Status:
- Token Bucket (Hash):
Key:rate_limit:{user_id}(e.g.,rate_limit:u123)Value:{ "tokens": 45, "last_refill_timestamp": 1690000000 }
- Sliding Window Log (Sorted Set):
Key:rate_log:{user_id}Value:SortedSet<UnixTimestamps>(Score and Value are both the exact millisecond timestamp of the request).
- Fixed Window Counter (String/Int):
Key:rate_count:{user_id}:{minute_timestamp}(e.g.,rate_count:u123:1690000100)Value:42(Atomic integer)
🏗️ High-Level Architecture
Architecture Diagram
graph TD
Client --> WAF[Web Application Firewall]
WAF --> LB[Load Balancer]
LB --> API1[API Gateway 1]
LB --> API2[API Gateway 2]
API1 --> RL1[Rate Limiter Middleware]
API2 --> RL2[Rate Limiter Middleware]
RL1 <--> Redis[(Central Redis Cluster)]
RL2 <--> Redis
RL1 -->|Allow| Svc[Backend Microservices]
RL1 -->|Deny| 429[HTTP 429 Too Many Requests]
RL2 -->|Allow| Svc
Component Walkthrough
- API Gateway / Middleware: Intercepts every incoming request. It extracts the unique identifier (IP or API Token) and queries the central rate-limiting datastore.
- Central Redis Cluster: A highly available, low-latency, in-memory datastore. It holds the counters and timestamps for every active user. Centralization is mandatory to prevent a user from bypassing limits by hitting different stateless API gateways round-robin.
- Backend Microservices: Completely shielded from the rate-limiting logic. If a request reaches them, it is guaranteed to be within the allowed quota.
🔬 Deep Dive & Scalability
Handling Bottlenecks
1. Algorithm Selection There are several mathematical approaches to throttling traffic, each with distinct memory and accuracy trade-offs:
-
Token Bucket: A bucket is assigned to a user and holds a maximum of tokens. Tokens are refilled at a constant rate. Highly memory efficient and allows for short bursts of traffic.
-
Fixed Window Counter: Counts requests in fixed time blocks (e.g., 10:00 to 10:01). Flaw: A user could send 100 requests at 10:00:59, and another 100 at 10:01:01, processing 200 requests in 2 seconds and bypassing the load limit entirely.
-
Sliding Window Log: Stores the exact Unix timestamp of every request in a Redis Sorted Set, dropping old timestamps. 100% accurate, but has an extremely high memory footprint.
-
Sliding Window Counter (Hybrid): Combines Fixed Window and the Log. Calculates a weighted average of the previous window's count based on the overlap percentage of the current window. Smooths out edge spikes while using 86% less memory than the pure Log method.
2. Atomicity (Race Conditions) When reading and updating counters in Redis concurrently across dozens of API gateways, race conditions will occur.
- Problem: Two servers read
counter = 10simultaneously and both increment it to11. A request count is lost, allowing the user to bypass the limit. - Solution: Never do a Read-Modify-Write in application code. Use Redis INCR atomic operations, or bundle complex logic (like Token Bucket math) into atomic Lua Scripts executed directly on the Redis server.
3. Latency & Caching Making a synchronous network call to Redis for every single API request adds latency and creates a massive single point of failure.
- Solution: API gateways often cache recent rate-limit states locally in memory for a short duration. They use an asynchronous "Write-Back" strategy to sync the batched counts to the central Redis cluster every few seconds. This sacrifices absolute accuracy for a massive boost in performance.
⚖️ Trade-offs
| Strategy | Pros | Cons / Limitations |
|---|---|---|
| Hard Throttling | Perfectly protects backend infrastructure. No exceptions made. | Harms user experience. A legitimate user performing a valid bulk operation is blocked with a 429. |
| Soft Throttling | Allows a small percentage of requests to exceed the limit temporarily, improving UX. | Requires more complex logic and monitoring to ensure backend queues aren't overwhelmed by the allowed burst. |
| Local Gateway Throttling (No Redis) | Zero network latency. Infinitely scalable. | Inaccurate. If a user is routed to Gateway A, then Gateway B, they effectively double their allowed limit. |
🎤 Interview Toolkit
- Scale Question: "Your Redis cluster is CPU-bound from executing millions of Lua scripts per second. How do you scale this?" -> Implement a two-tier architecture. The API Gateway maintains a fast, local in-memory counter (e.g., Guava Cache). It only syncs with the central Redis cluster asynchronously in batches (e.g., syncing every 100 requests or every 2 seconds). This drastically reduces Redis load at the cost of slight limit imprecision.
- Failure Probe: "The Redis cluster goes down completely. What happens to incoming API traffic?" -> The system must "Fail-Open." The API Gateway's middleware should catch the Redis timeout exception and automatically route the traffic to the backend services anyway. It's better to risk heavy load than to cause a 100% global outage for all users because the defensive shield broke.
- Edge Case: "In a Sliding Window Log, what happens if the clocks on your different API Gateway servers drift by a few seconds?" -> Clock drift will cause false positives and false negatives, as timestamps are evaluated inaccurately. To solve this, the Lua script should rely exclusively on the centralized Redis server's internal clock (
TIMEcommand) rather than the localized gateway clocks.
🔗 Related Architectures
- Infrastructure Challenge: Redis Rate Limiter
- Machine Coding: Distributed Rate Limiter