← Back to Blog

API Rate Limiting with Intelligent Caching

December 21, 2025 • 7 min read • API Architecture

Rate limiting protects your APIs from abuse, manages costs for third-party services, and ensures fair resource allocation. But traditional rate limiting implementations are either too simple (inaccurate), too slow (database lookups per request), or too expensive (dedicated infrastructure). Intelligent caching solves all three problems.

Why Rate Limiting Needs Caching

Consider a high-traffic API handling 10,000 requests/second. Every request needs rate limit verification:

Caching rate limit state enables sub-millisecond checks while maintaining accuracy across distributed systems.

Rate Limiting Algorithms

1. Token Bucket (Most Common)

Each user has a bucket that fills with tokens at a fixed rate. Requests consume tokens. When the bucket is empty, requests are rejected.

# Redis-backed token bucket
async function checkRateLimit(userId, limit, refillRate) {
  const key = `ratelimit:${userId}`;
  const now = Date.now();

  // Get current state
  const data = await cache.get(key);
  let tokens = limit;
  let lastRefill = now;

  if (data) {
    ({ tokens, lastRefill } = JSON.parse(data));

    // Refill tokens based on time elapsed
    const elapsed = (now - lastRefill) / 1000;
    tokens = Math.min(limit, tokens + elapsed * refillRate);
  }

  // Try to consume a token
  if (tokens >= 1) {
    tokens -= 1;
    await cache.set(key, JSON.stringify({
      tokens,
      lastRefill: now
    }), { ttl: 3600 });
    return { allowed: true, remaining: Math.floor(tokens) };
  }

  return { allowed: false, remaining: 0 };
}

2. Sliding Window Counter

More accurate than fixed windows, tracks requests in a rolling time period:

# Redis Lua script for atomic sliding window
local key = KEYS[1]
local window = tonumber(ARGV[1])  -- seconds
local limit = tonumber(ARGV[2])
local now = tonumber(ARGV[3])

-- Remove old entries
redis.call('ZREMRANGEBYSCORE', key, 0, now - window)

-- Count current requests in window
local current = redis.call('ZCARD', key)

if current < limit then
  -- Add this request
  redis.call('ZADD', key, now, now)
  redis.call('EXPIRE', key, window)
  return { 1, limit - current - 1 }
else
  return { 0, 0 }
end

3. Fixed Window Counter

Simplest implementation, but allows bursts at window boundaries:

async function fixedWindowLimit(userId, limit) {
  const key = `ratelimit:${userId}`;
  const window = Math.floor(Date.now() / 60000); // 1-minute windows
  const windowKey = `${key}:${window}`;

  const count = await cache.incr(windowKey);

  if (count === 1) {
    // First request in this window, set expiry
    await cache.expire(windowKey, 60);
  }

  return {
    allowed: count <= limit,
    remaining: Math.max(0, limit - count)
  };
}

Distributed Rate Limiting Challenges

Race Conditions

Multiple servers checking limits simultaneously can exceed quotas:

# Problem: Two servers check simultaneously
Server A: reads count=99, checks (99 < 100), allows request
Server B: reads count=99, checks (99 < 100), allows request
Result: 101 requests allowed (quota exceeded!)

# Solution: Atomic increment with Lua scripts
local current = redis.call('INCR', key)
if current > limit then
  redis.call('DECR', key)
  return 0
else
  return 1
end

Cache Consistency

Distributed caches need eventual consistency guarantees:

Intelligent Rate Limiting Strategies

1. Tiered Rate Limits

Different limits for different user tiers:

const RATE_LIMITS = {
  free: { requests: 100, window: 3600 },
  pro: { requests: 1000, window: 3600 },
  enterprise: { requests: 10000, window: 3600 }
};

async function getRateLimit(userId) {
  const tier = await cache.get(`user:${userId}:tier`);
  return RATE_LIMITS[tier || 'free'];
}

2. Endpoint-Specific Limits

Expensive endpoints get stricter limits:

const ENDPOINT_LIMITS = {
  '/api/search': 10,      // Expensive full-text search
  '/api/export': 5,       // Resource-intensive export
  '/api/users': 100,      // Cheap user lookup
};

async function checkEndpointLimit(userId, endpoint) {
  const limit = ENDPOINT_LIMITS[endpoint] || 50;
  const key = `ratelimit:${userId}:${endpoint}`;
  return checkRateLimit(key, limit, 60);
}

3. Adaptive Rate Limiting

Automatically adjust limits based on system load:

async function getAdaptiveLimit(userId, baseLimit) {
  const systemLoad = await cache.get('metrics:cpu_usage');

  if (systemLoad > 80) {
    // System under stress, reduce limits
    return baseLimit * 0.5;
  } else if (systemLoad < 30) {
    // System idle, allow more requests
    return baseLimit * 1.5;
  }

  return baseLimit;
}

4. Burst Allowance

Allow short bursts while maintaining average rate:

async function checkWithBurst(userId, sustained, burst) {
  // Sustained rate: 100 req/hour
  // Burst: 20 req/minute

  const sustainedOk = await checkRateLimit(
    `${userId}:hour`, sustained, 3600
  );

  const burstOk = await checkRateLimit(
    `${userId}:minute`, burst, 60
  );

  return sustainedOk.allowed && burstOk.allowed;
}

Performance Optimization

Batch Rate Limit Checks

For internal API calls, batch multiple checks:

async function batchCheckLimits(userIds) {
  const pipeline = cache.pipeline();

  userIds.forEach(id => {
    pipeline.get(`ratelimit:${id}`);
  });

  const results = await pipeline.exec();
  return results.map(([err, data], i) => ({
    userId: userIds[i],
    allowed: data ? parseInt(data) < LIMIT : true
  }));
}

Local Caching for Quota Information

Cache user tier information locally (not counters):

const tierCache = new Map();

async function getUserTier(userId) {
  if (tierCache.has(userId)) {
    return tierCache.get(userId);
  }

  const tier = await database.query(
    'SELECT tier FROM users WHERE id = ?', [userId]
  );

  // Cache tier for 5 minutes
  tierCache.set(userId, tier);
  setTimeout(() => tierCache.delete(userId), 300000);

  return tier;
}

Rate Limit Response Headers

Inform clients about their rate limit status:

app.use(async (req, res, next) => {
  const limit = await checkRateLimit(req.userId);

  res.setHeader('X-RateLimit-Limit', limit.total);
  res.setHeader('X-RateLimit-Remaining', limit.remaining);
  res.setHeader('X-RateLimit-Reset', limit.resetAt);

  if (!limit.allowed) {
    res.setHeader('Retry-After', limit.retryAfter);
    return res.status(429).json({
      error: 'Rate limit exceeded',
      retryAfter: limit.retryAfter
    });
  }

  next();
});

Advanced: ML-Powered Rate Limiting

Detect abuse patterns using ML instead of simple thresholds:

async function detectAnomalousUsage(userId) {
  const pattern = await getAccessPattern(userId);

  const features = {
    requests_per_hour: pattern.requestRate,
    unique_endpoints: pattern.endpointDiversity,
    error_rate: pattern.errorRate,
    time_distribution: pattern.timeVariance,
    geographic_diversity: pattern.ipDiversity
  };

  const abuseScore = await mlModel.predict(features);

  if (abuseScore > 0.8) {
    // Likely abuse, apply stricter limits
    return reduceRateLimit(userId, 0.1);
  }
}

Monitoring Rate Limits

Track these metrics to optimize rate limiting:

Conclusion

Effective API rate limiting requires fast, accurate, distributed-ready implementation. Intelligent caching provides the foundation: sub-millisecond checks, atomic operations, and scalable infrastructure. By combining token buckets, sliding windows, and ML-powered anomaly detection, you can build rate limiters that protect your APIs while maintaining excellent user experience.

Start with simple token buckets cached in Redis, then add sophistication as your API scales: tiered limits, endpoint-specific quotas, adaptive throttling, and intelligent abuse detection.

Built-In Intelligent Rate Limiting

Cachee.ai includes distributed rate limiting with automatic tier management and abuse detection.

Start Free Trial