API Rate Limiting with Intelligent Caching

December 21, 2025 • 7 min read • API Architecture

Rate limiting protects your APIs from abuse, manages costs for third-party services, and ensures fair resource allocation. But traditional rate limiting implementations are either too simple (inaccurate), too slow (database lookups per request), or too expensive (dedicated infrastructure). Intelligent caching solves all three problems.

Why Rate Limiting Needs Caching

Consider a high-traffic API handling 10,000 requests/second. Every request needs rate limit verification:

Database-based: 10K queries/sec = database overload
In-memory: Works per-server, but fails in distributed systems
Cached counters: Fast, accurate, distributed-ready

Caching rate limit state enables sub-millisecond checks while maintaining accuracy across distributed systems.

Rate Limiting Algorithms

1. Token Bucket (Most Common)

Each user has a bucket that fills with tokens at a fixed rate. Requests consume tokens. When the bucket is empty, requests are rejected.

# Redis-backed token bucket
async function checkRateLimit(userId, limit, refillRate) {
  const key = `ratelimit:${userId}`;
  const now = Date.now();

  // Get current state
  const data = await cache.get(key);
  let tokens = limit;
  let lastRefill = now;

  if (data) {
    ({ tokens, lastRefill } = JSON.parse(data));

    // Refill tokens based on time elapsed
    const elapsed = (now - lastRefill) / 1000;
    tokens = Math.min(limit, tokens + elapsed * refillRate);
  }

  // Try to consume a token
  if (tokens >= 1) {
    tokens -= 1;
    await cache.set(key, JSON.stringify({
      tokens,
      lastRefill: now
    }), { ttl: 3600 });
    return { allowed: true, remaining: Math.floor(tokens) };
  }

  return { allowed: false, remaining: 0 };
}

2. Sliding Window Counter

More accurate than fixed windows, tracks requests in a rolling time period:

# Redis Lua script for atomic sliding window
local key = KEYS[1]
local window = tonumber(ARGV[1])  -- seconds
local limit = tonumber(ARGV[2])
local now = tonumber(ARGV[3])

-- Remove old entries
redis.call('ZREMRANGEBYSCORE', key, 0, now - window)

-- Count current requests in window
local current = redis.call('ZCARD', key)

if current < limit then
  -- Add this request
  redis.call('ZADD', key, now, now)
  redis.call('EXPIRE', key, window)
  return { 1, limit - current - 1 }
else
  return { 0, 0 }
end

3. Fixed Window Counter

Simplest implementation, but allows bursts at window boundaries:

async function fixedWindowLimit(userId, limit) {
  const key = `ratelimit:${userId}`;
  const window = Math.floor(Date.now() / 60000); // 1-minute windows
  const windowKey = `${key}:${window}`;

  const count = await cache.incr(windowKey);

  if (count === 1) {
    // First request in this window, set expiry
    await cache.expire(windowKey, 60);
  }

  return {
    allowed: count <= limit,
    remaining: Math.max(0, limit - count)
  };
}

Distributed Rate Limiting Challenges

Race Conditions

Multiple servers checking limits simultaneously can exceed quotas:

# Problem: Two servers check simultaneously
Server A: reads count=99, checks (99 < 100), allows request
Server B: reads count=99, checks (99 < 100), allows request
Result: 101 requests allowed (quota exceeded!)

# Solution: Atomic increment with Lua scripts
local current = redis.call('INCR', key)
if current > limit then
  redis.call('DECR', key)
  return 0
else
  return 1
end

Cache Consistency

Distributed caches need eventual consistency guarantees:

Use single Redis instance/cluster per region
Avoid local caching of rate limit state
Implement read-through pattern for accuracy

Intelligent Rate Limiting Strategies

1. Tiered Rate Limits

Different limits for different user tiers:

const RATE_LIMITS = {
  free: { requests: 100, window: 3600 },
  pro: { requests: 1000, window: 3600 },
  enterprise: { requests: 10000, window: 3600 }
};

async function getRateLimit(userId) {
  const tier = await cache.get(`user:${userId}:tier`);
  return RATE_LIMITS[tier || 'free'];
}

2. Endpoint-Specific Limits

Expensive endpoints get stricter limits:

const ENDPOINT_LIMITS = {
  '/api/search': 10,      // Expensive full-text search
  '/api/export': 5,       // Resource-intensive export
  '/api/users': 100,      // Cheap user lookup
};

async function checkEndpointLimit(userId, endpoint) {
  const limit = ENDPOINT_LIMITS[endpoint] || 50;
  const key = `ratelimit:${userId}:${endpoint}`;
  return checkRateLimit(key, limit, 60);
}

3. Adaptive Rate Limiting

Automatically adjust limits based on system load:

async function getAdaptiveLimit(userId, baseLimit) {
  const systemLoad = await cache.get('metrics:cpu_usage');

  if (systemLoad > 80) {
    // System under stress, reduce limits
    return baseLimit * 0.5;
  } else if (systemLoad < 30) {
    // System idle, allow more requests
    return baseLimit * 1.5;
  }

  return baseLimit;
}

4. Burst Allowance

Allow short bursts while maintaining average rate:

async function checkWithBurst(userId, sustained, burst) {
  // Sustained rate: 100 req/hour
  // Burst: 20 req/minute

  const sustainedOk = await checkRateLimit(
    `${userId}:hour`, sustained, 3600
  );

  const burstOk = await checkRateLimit(
    `${userId}:minute`, burst, 60
  );

  return sustainedOk.allowed && burstOk.allowed;
}

Performance Optimization

Batch Rate Limit Checks

For internal API calls, batch multiple checks:

async function batchCheckLimits(userIds) {
  const pipeline = cache.pipeline();

  userIds.forEach(id => {
    pipeline.get(`ratelimit:${id}`);
  });

  const results = await pipeline.exec();
  return results.map(([err, data], i) => ({
    userId: userIds[i],
    allowed: data ? parseInt(data) < LIMIT : true
  }));
}

Local Caching for Quota Information

Cache user tier information locally (not counters):

const tierCache = new Map();

async function getUserTier(userId) {
  if (tierCache.has(userId)) {
    return tierCache.get(userId);
  }

  const tier = await database.query(
    'SELECT tier FROM users WHERE id = ?', [userId]
  );

  // Cache tier for 5 minutes
  tierCache.set(userId, tier);
  setTimeout(() => tierCache.delete(userId), 300000);

  return tier;
}

Rate Limit Response Headers

Inform clients about their rate limit status:

app.use(async (req, res, next) => {
  const limit = await checkRateLimit(req.userId);

  res.setHeader('X-RateLimit-Limit', limit.total);
  res.setHeader('X-RateLimit-Remaining', limit.remaining);
  res.setHeader('X-RateLimit-Reset', limit.resetAt);

  if (!limit.allowed) {
    res.setHeader('Retry-After', limit.retryAfter);
    return res.status(429).json({
      error: 'Rate limit exceeded',
      retryAfter: limit.retryAfter
    });
  }

  next();
});

Advanced: ML-Powered Rate Limiting

Detect abuse patterns using ML instead of simple thresholds:

async function detectAnomalousUsage(userId) {
  const pattern = await getAccessPattern(userId);

  const features = {
    requests_per_hour: pattern.requestRate,
    unique_endpoints: pattern.endpointDiversity,
    error_rate: pattern.errorRate,
    time_distribution: pattern.timeVariance,
    geographic_diversity: pattern.ipDiversity
  };

  const abuseScore = await mlModel.predict(features);

  if (abuseScore > 0.8) {
    // Likely abuse, apply stricter limits
    return reduceRateLimit(userId, 0.1);
  }
}

Monitoring Rate Limits

Track these metrics to optimize rate limiting:

Rejection rate: What % of requests hit limits?
User distribution: Are limits too strict/loose?
Burst patterns: Do users hit burst limits often?
Cache hit rate: Is rate limit cache effective?

Conclusion

Effective API rate limiting requires fast, accurate, distributed-ready implementation. Intelligent caching provides the foundation: sub-millisecond checks, atomic operations, and scalable infrastructure. By combining token buckets, sliding windows, and ML-powered anomaly detection, you can build rate limiters that protect your APIs while maintaining excellent user experience.

Start with simple token buckets cached in Redis, then add sophistication as your API scales: tiered limits, endpoint-specific quotas, adaptive throttling, and intelligent abuse detection.

Built-In Intelligent Rate Limiting

Cachee.ai includes distributed rate limiting with automatic tier management and abuse detection.

Start Free Trial

The Numbers That Matter

Cache performance discussions get philosophical fast. Here are the actual measured numbers from production deployments running on documented hardware, so you can compare against your own infrastructure instead of trusting marketing copy.

L0 hot path GET: 28.9 nanoseconds on Apple M4 Max, single-threaded against pre-warmed in-memory cache. This is the floor — there's no faster way to read a key.
L1 CacheeLFU GET: ~89 nanoseconds on AWS Graviton4 (c8g.metal-48xl). Sharded DashMap with admission filtering.
Sustained throughput: 32 million ops/sec single-threaded on M4 Max, 7.41 million ops/sec at 16 workers on Graviton4 c8g.16xlarge.
L2 fallback: Sub-millisecond hits against ElastiCache Redis 7.4 over same-AZ network when L1 misses cascade through.

The compounding effect matters more than any single number. A 28-nanosecond L0 hit means your application spends almost zero time on cache lookups in the hot path, leaving the CPU free for the actual business logic that generates revenue.

Average Latency Hides The Real Story

Average latency is the most misleading number in cache benchmarking. The percentile distribution is what actually breaks production systems. Tail latency — the slowest 0.1% of requests — is where users notice the lag and where SLAs get violated.

Percentile	Network Redis (same-AZ)	In-process L0
p50	~85 microseconds	28.9 nanoseconds
p95	~140 microseconds	~45 nanoseconds
p99	~280 microseconds	~80 nanoseconds
p99.9	~1.2 milliseconds	~150 nanoseconds

The p99.9 spike on networked Redis isn't a bug — it's the cost of running a single-threaded event loop that occasionally blocks on background tasks like RDB snapshots, AOF rewrites, and expired-key sweeps. Cachee's L0 stays inside a few hundred nanoseconds because the hot-path read is a lock-free shard lookup with no background work scheduled on the same thread.

If your application is sensitive to tail latency — payments, real-time bidding, fraud detection, trading — the p99.9 number is the one to optimize against. Average latency improvements that don't move the tail are vanity metrics.

When Caching Actually Helps

Caching isn't free. It introduces a consistency problem you didn't have before. Before adding any cache layer, the question to answer is whether your workload actually benefits from caching at all.

Caching helps when three conditions hold simultaneously. First, your reads dramatically outnumber your writes — typically a 10:1 ratio or higher. Second, the same keys get read repeatedly within a window where a cached value remains valid. Third, the cost of computing or fetching the underlying value is meaningfully higher than the cost of a cache lookup. Database queries that hit secondary indexes, RPC calls to slow upstream services, expensive computed aggregations, and rendered template fragments all qualify.

Caching hurts when those conditions don't hold. Write-heavy workloads suffer because every write invalidates a cache entry, multiplying your work. Workloads with poor key locality suffer because the cache wastes memory storing entries that never get reused. Workloads where the underlying fetch is already fast — well-indexed primary key lookups against a properly tuned database, for example — gain almost nothing from caching and inherit the consistency complexity for no reason.

The honest first step before any cache deployment is measuring your actual read/write ratio, key access distribution, and underlying fetch latency. If your read/write ratio is below 5:1 or your underlying database is already returning results in single-digit milliseconds, the engineering time is better spent elsewhere.

Memory Efficiency Is The Hidden Cost Lever

Throughput numbers get the headlines but memory efficiency determines your monthly bill. A cache that stores the same hot data in less RAM lets you run a smaller instance class — and on AWS that's the difference between profitable and breakeven for a lot of services.

Redis stores each key as a Simple Dynamic String with 16 bytes of header overhead, plus dictEntry pointers in the main hashtable, plus embedded TTL metadata. For 1KB values, per-entry overhead lands around 1100-1200 bytes once you account for hashtable load factor and slab fragmentation. At a million keys, that's roughly 1.2 GB of resident memory just for the data.

Cachee's L1 layer uses sharded DashMap entries with compact packing — a 64-bit key hash, value bytes, an 8-byte expiry timestamp, and a small frequency counter for the CacheeLFU admission filter. Per-entry overhead lands at roughly 40 bytes of structural data on top of the value itself. For the same million-key workload, that's about 13% smaller resident memory. On AWS ElastiCache pricing, that gap is the difference between needing a cache.r7g.large versus a cache.r7g.xlarge for borderline workloads.