Skip to content
OPTIMIZATION

Claude API Rate Limits in 2026: What to Do When You Hit the Wall

A practical guide to Claude API rate limits — understand tiers, handle 429 errors, implement retry logic, route models intelligently, and use gateways to avoid throttling.

You have been building your product for weeks. The demo is tomorrow. You push to staging, trigger a load test, and within thirty seconds your logs are flooded with HTTP 429 responses. The Claude API has decided you are done for the minute. This post exists so that moment never catches you off guard again. We will walk through exactly how Anthropic's rate limits work, what levers you have, and how to architect a system that degrades gracefully instead of falling over.

How Claude rate limits work

Anthropic enforces rate limits at three levels, and you need to think about all three simultaneously:

  1. Requests per minute (RPM) — the raw number of API calls you can make in a 60-second sliding window. This caps concurrency regardless of how small your prompts are.
  2. Tokens per minute (TPM) — the total input + output tokens the API will process in a 60-second window. A single massive prompt can eat your entire TPM budget in one call.
  3. Tokens per day (TPD) — a daily ceiling on total token throughput. Even if you stay under RPM and TPM, you can exhaust this over a long enough sustained burst.

Each model has its own limits. Haiku is the most generous because it is the cheapest to serve. Opus is the most restricted because each request consumes significant compute. Sonnet sits in between. This is not a bug — it is Anthropic signalling which model you should use for high-volume workloads.

Rate limits are applied per API key and per model. If you have one key hitting Sonnet at 4,000 RPM and another key hitting Haiku at 4,000 RPM, those are independent buckets. But two services sharing the same key and the same model will compete for the same bucket.

The headers that matter

Every response from the Claude API includes rate limit headers:

anthropic-ratelimit-requests-limit: 4000
anthropic-ratelimit-requests-remaining: 3,842
anthropic-ratelimit-requests-reset: 2026-04-29T12:01:00Z
anthropic-ratelimit-tokens-limit: 400000
anthropic-ratelimit-tokens-remaining: 312,000
anthropic-ratelimit-tokens-reset: 2026-04-29T12:01:00Z

When you exceed a limit, you get an HTTP 429 Too Many Requests response with a retry-after header telling you how many seconds to wait. The response body will also include an error object with type: "rate_limit_error".

Understanding HTTP 429 errors

A 429 is not a failure — it is a signal. The API is telling you "I heard you, but you need to slow down." The worst thing you can do is immediately retry at full speed, because that will keep you pinned at the limit. The best thing you can do is respect the retry-after header and use exponential backoff.

Here is what a 429 response looks like:

{
  "type": "error",
  "error": {
    "type": "rate_limit_error",
    "message": "Number of request tokens has exceeded your per-model limit..."
  }
}

Common causes of 429 errors in production:

  • Burst traffic — a queue drains all at once after a deploy
  • Large prompts — a single 100K-token context window eats your TPM
  • Missing backoff — retry loops without delays create a feedback spiral
  • Shared keys — multiple services on one key without coordination

Anthropic's tier system

Anthropic segments API access into tiers based on your usage history and spend. Each tier unlocks higher rate limits. As of early 2026, the tiers look roughly like this:

TierMonthly spendSonnet RPMSonnet TPMOpus RPMOpus TPM
Free$05040,000510,000
Build (T1)$5+1,00080,00010040,000
Build (T2)$40+2,000160,00020080,000
Scale (T3)$200+4,000400,000400200,000
Scale (T4)$1,000+8,000800,0001,000400,000
EnterpriseCustomCustomCustomCustomCustom

A few things to note:

  • Moving from Free to Build is a massive jump — 20× on Sonnet RPM.
  • The gap between T3 and T4 requires serious spend but doubles everything.
  • Enterprise gets custom limits negotiated with Anthropic's sales team, often >80,000 RPM on Sonnet.
  • Haiku limits are typically 2–4× the Sonnet limits at every tier.

The cold start problem

New API keys start at the bottom. If you are building a product that needs T3 limits on day one, you have two options: negotiate with Anthropic (slow, requires a relationship) or use a gateway like Claudexia that already has provisioned capacity across multiple high-tier keys.

Strategies to handle rate limits

1. Exponential backoff with jitter

The gold standard for retry logic. On each 429, wait an exponentially increasing amount of time plus a random jitter to avoid thundering herd.

Python implementation:

import anthropic
import time
import random

def call_with_backoff(
    client: anthropic.Anthropic,
    max_retries: int = 5,
    base_delay: float = 1.0,
    **kwargs,
) -> anthropic.types.Message:
    for attempt in range(max_retries):
        try:
            return client.messages.create(**kwargs)
        except anthropic.RateLimitError as e:
            if attempt == max_retries - 1:
                raise
            # Respect retry-after header if present
            retry_after = getattr(e, "response", None)
            if retry_after and retry_after.headers.get("retry-after"):
                delay = float(retry_after.headers["retry-after"])
            else:
                delay = base_delay * (2 ** attempt)
            # Add jitter: 0–50% of delay
            jitter = delay * random.uniform(0, 0.5)
            time.sleep(delay + jitter)
    raise RuntimeError("Unreachable")

TypeScript implementation:

import Anthropic from "@anthropic-ai/sdk";

async function callWithBackoff(
  client: Anthropic,
  params: Anthropic.MessageCreateParams,
  maxRetries = 5,
  baseDelay = 1000,
): Promise<Anthropic.Message> {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      return await client.messages.create(params);
    } catch (err) {
      if (
        err instanceof Anthropic.RateLimitError &&
        attempt < maxRetries - 1
      ) {
        const retryAfter = err.headers?.["retry-after"];
        const delay = retryAfter
          ? parseFloat(retryAfter) * 1000
          : baseDelay * 2 ** attempt;
        const jitter = delay * Math.random() * 0.5;
        await new Promise((r) => setTimeout(r, delay + jitter));
        continue;
      }
      throw err;
    }
  }
  throw new Error("Unreachable");
}

2. Request queuing

Instead of firing requests as fast as they arrive, put them in a queue and drain at a controlled rate. This is especially important when you have bursty traffic — a user uploads 50 documents and you need to process each one.

import asyncio
from collections import deque

class RateLimitedQueue:
    def __init__(self, rpm: int = 1000):
        self.interval = 60.0 / rpm
        self.queue: deque = deque()
        self.semaphore = asyncio.Semaphore(rpm // 10)  # concurrency cap

    async def submit(self, coro):
        await self.semaphore.acquire()
        try:
            result = await coro
            await asyncio.sleep(self.interval)
            return result
        finally:
            self.semaphore.release()

3. Batching with the Batch API

If your workload can tolerate latency of up to 24 hours, the Anthropic Batch API processes requests at half the cost and with separate rate limits. This is ideal for classification jobs, data extraction, bulk summarisation, and nightly pipelines.

import anthropic

client = anthropic.Anthropic()

batch = client.messages.batches.create(
    requests=[
        {
            "custom_id": f"doc-{i}",
            "params": {
                "model": "claude-sonnet-4.5",
                "max_tokens": 1024,
                "messages": [
                    {"role": "user", "content": f"Summarise: {doc}"}
                ],
            },
        }
        for i, doc in enumerate(documents)
    ]
)
# Poll batch.id for results — up to 24h

Batching is one of the most underused features of the Claude API. If even 30% of your traffic is not latency-sensitive, moving it to batches frees up your real-time RPM and TPM for the requests that matter.

Model routing: the single biggest lever

Not every request needs Opus. In fact, most requests do not even need Sonnet. A well-designed model router can cut your costs by 60–80% and dramatically reduce rate limit pressure by spreading load across models with different limit pools.

The principle is simple:

  • Haiku — classification, extraction, simple Q&A, validation, routing decisions. Fast, cheap, generous limits.
  • Sonnet — summarisation, code generation, multi-step reasoning, content creation. The workhorse.
  • Opus — complex analysis, long-document reasoning, agentic loops, tasks where accuracy is critical and cost is secondary.

Here is a production-grade router:

import anthropic
from enum import Enum

class Complexity(Enum):
    LOW = "low"
    MEDIUM = "medium"
    HIGH = "high"

def estimate_complexity(prompt: str, token_count: int) -> Complexity:
    """Route based on heuristics. Replace with a classifier for precision."""
    # Short prompts with clear instructions → Haiku
    if token_count < 2000 and not any(
        kw in prompt.lower()
        for kw in ["analyze", "compare", "reason step by step", "architect"]
    ):
        return Complexity.LOW
    # Very long context or explicit reasoning demands → Opus
    if token_count > 50_000 or "think carefully" in prompt.lower():
        return Complexity.HIGH
    return Complexity.MEDIUM

MODEL_MAP = {
    Complexity.LOW: "claude-haiku-4-20250414",
    Complexity.MEDIUM: "claude-sonnet-4.5",
    Complexity.HIGH: "claude-opus-4.5",
}

def routed_call(client: anthropic.Anthropic, prompt: str, **kwargs):
    token_est = len(prompt) // 4  # rough estimate
    complexity = estimate_complexity(prompt, token_est)
    model = MODEL_MAP[complexity]
    return client.messages.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        **kwargs,
    )

TypeScript version:

import Anthropic from "@anthropic-ai/sdk";

type Complexity = "low" | "medium" | "high";

function estimateComplexity(prompt: string): Complexity {
  const len = prompt.length / 4;
  const lowerPrompt = prompt.toLowerCase();
  if (
    len < 2000 &&
    !["analyze", "compare", "reason step by step"].some((kw) =>
      lowerPrompt.includes(kw),
    )
  ) {
    return "low";
  }
  if (len > 50_000 || lowerPrompt.includes("think carefully")) {
    return "high";
  }
  return "medium";
}

const MODEL_MAP: Record<Complexity, string> = {
  low: "claude-haiku-4-20250414",
  medium: "claude-sonnet-4.5",
  high: "claude-opus-4.5",
};

async function routedCall(
  client: Anthropic,
  prompt: string,
  maxTokens = 4096,
): Promise<Anthropic.Message> {
  const complexity = estimateComplexity(prompt);
  return client.messages.create({
    model: MODEL_MAP[complexity],
    max_tokens: maxTokens,
    messages: [{ role: "user", content: prompt }],
  });
}

The key insight is that model routing does not just save money — it redistributes rate limit pressure. Haiku has 2–4× the RPM of Sonnet. By routing 60% of your traffic to Haiku, you free up Sonnet capacity for the requests that actually need it.

Prompt caching to reduce token consumption

Prompt caching is Anthropic's mechanism for avoiding re-processing the same input tokens across requests. If your system prompt, tool definitions, and few-shot examples are identical across calls, you can mark them as cacheable and pay 10× less for those tokens on subsequent requests.

The impact on rate limits is indirect but significant: cached tokens count toward your TPM at a reduced weight, meaning you can fit more requests into the same window.

import anthropic

client = anthropic.Anthropic()

# The system prompt and tools are cached after the first call
response = client.messages.create(
    model="claude-sonnet-4.5",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a customer support agent for Acme Corp...",
            "cache_control": {"type": "ephemeral"},
        }
    ],
    messages=[{"role": "user", "content": user_query}],
)
# Check cache performance
print(f"Cache read: {response.usage.cache_read_input_tokens}")
print(f"Cache creation: {response.usage.cache_creation_input_tokens}")

For a system prompt of 6,000 tokens called 10,000 times per month, caching reduces your effective input token spend from 60M tokens at $0.33/M ($19.80) to 60M tokens at $0.05/M ($3). That is a 90% reduction on the cached portion.

Using a gateway for separate rate limit pools

This is the strategy most teams discover too late. Anthropic rate limits are per API key. If you use the Anthropic API directly, you have one key and one set of limits. A gateway like Claudexia maintains multiple high-tier API keys and distributes your requests across them, effectively giving you a separate rate limit pool that is larger than what any single key provides.

Here is what changes in your code:

import anthropic

# Before: direct to Anthropic
client = anthropic.Anthropic(api_key="sk-ant-...")

# After: through Claudexia — same SDK, different base URL
client = anthropic.Anthropic(
    api_key="your-claudexia-key",
    base_url="https://api.claudexia.tech/v1",
)

# Everything else stays identical
response = client.messages.create(
    model="claude-sonnet-4.5",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Hello"}],
)
import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic({
  apiKey: "your-claudexia-key",
  baseURL: "https://api.claudexia.tech/v1",
});

// Same SDK, same types, higher limits
const response = await client.messages.create({
  model: "claude-sonnet-4.5",
  max_tokens: 1024,
  messages: [{ role: "user", content: "Hello" }],
});

The key advantage is that you get higher effective limits without negotiating an enterprise contract with Anthropic, and you can start immediately. Claudexia passes through Anthropic's pricing 1:1 with no markup on token costs.

Comparison table: rate limits across providers

FactorAnthropic T1Anthropic T3Anthropic EnterpriseClaudexiaOpenAI (GPT-4o)Together.ai
Sonnet/equiv. RPM1,0004,000CustomPooled (higher)5,000 (GPT-4o)600
Sonnet/equiv. TPM80,000400,000CustomPooled (higher)800,000100,000
Opus/equiv. RPM100400CustomPooled (higher)500 (o1)N/A
Batch APIYes (50% off)Yes (50% off)Yes (50% off)Yes (passthrough)Yes (50% off)No
Prompt cachingYes (10× saving)Yes (10× saving)Yes (10× saving)Yes (passthrough)Yes (partial)No
Multi-key poolingManualManualManagedAutomaticManualManual
Time to T3+ limitsWeeksWeeksNegotiationImmediateWeeksN/A
SDK compatibilityNativeNativeNativeDrop-inSeparate SDKOpenAI-compatible

Monitoring and alerting

You should not wait for 429s to learn you are approaching limits. Build monitoring around the rate limit headers:

import anthropic
import logging

logger = logging.getLogger("rate_limits")

def monitored_call(client: anthropic.Anthropic, **kwargs):
    response = client.messages.create(**kwargs)

    # Extract rate limit headers from the raw response
    remaining_requests = response._response.headers.get(
        "anthropic-ratelimit-requests-remaining"
    )
    remaining_tokens = response._response.headers.get(
        "anthropic-ratelimit-tokens-remaining"
    )

    if remaining_requests and int(remaining_requests) < 100:
        logger.warning(f"RPM running low: {remaining_requests} remaining")
    if remaining_tokens and int(remaining_tokens) < 50_000:
        logger.warning(f"TPM running low: {remaining_tokens} remaining")

    return response

Set alerts when remaining capacity drops below 20% of your limit. This gives you time to throttle, re-route, or scale before you start getting 429s.

Putting it all together: a production architecture

A robust production system combines all of the above:

  1. Model router classifies incoming requests and picks Haiku, Sonnet, or Opus.
  2. Request queue enforces a maximum drain rate per model.
  3. Prompt caching reduces token consumption for repeated system prompts.
  4. Retry with backoff handles transient 429s gracefully.
  5. Gateway (Claudexia) provides pooled limits and removes the cold start problem.
  6. Monitoring alerts before you hit the wall instead of after.
  7. Batch API offloads non-urgent work to a separate limit pool with a 50% cost discount.

This is not over-engineering. Every production Claude deployment of meaningful scale hits rate limits eventually. The question is whether your system handles it transparently or wakes someone up at 3 AM.

FAQ

What exactly is a rate limit on the Claude API?

A rate limit is a cap on how many requests (RPM), tokens (TPM), or daily tokens (TPD) your API key can consume in a given window. When you exceed any of these, the API returns HTTP 429 until the window resets.

How do I check my current rate limit tier?

Log into the Anthropic Console and navigate to your API key settings. Your tier is displayed alongside your usage. You can also inspect the anthropic-ratelimit-* headers in any API response to see your current limits and remaining capacity.

Can I request a rate limit increase from Anthropic?

Yes. For Build and Scale tiers, limits increase automatically as your monthly spend grows. For custom limits beyond T4, you need to contact Anthropic's sales team and negotiate an Enterprise agreement. This typically requires a minimum annual commitment.

What is the difference between RPM and TPM limits?

RPM (requests per minute) limits how many API calls you can make regardless of size. TPM (tokens per minute) limits total token throughput. You can hit TPM with a single massive prompt even if your RPM is fine. Both must stay under their ceilings.

Does the Batch API have separate rate limits?

Yes. Batch requests run asynchronously with their own limit pool and cost 50% less. They are ideal for any workload that can tolerate up to 24 hours of latency — classification pipelines, nightly report generation, bulk data extraction.

How does prompt caching help with rate limits?

Cached tokens are processed faster and count at reduced weight against your TPM. A 6,000-token system prompt that is cached means the API only processes the dynamic portion of each request, effectively increasing your throughput within the same TPM window.

Will using a gateway like Claudexia violate Anthropic's terms?

No. Claudexia routes requests to the official Anthropic API. It functions as a proxy that provides key management, pooled limits, and monitoring. Your requests still reach the same Claude models through Anthropic's infrastructure.

What is the fastest way to stop getting 429 errors right now?

Three immediate actions: (1) add exponential backoff with jitter to your retry logic, (2) route simple tasks to Haiku to free up Sonnet capacity, and (3) switch to a gateway like Claudexia for higher pooled limits. You can implement all three in under an hour.


Rate limits are a fact of life with any managed API, and Claude is no exception. But they are a solvable problem. The teams that handle them well are the ones that plan for them before they ship, not after the first outage. If you want higher limits without the wait, Claudexia gives you pooled capacity, drop-in SDK compatibility, and the same Anthropic pricing — so you can focus on building instead of watching rate limit dashboards.