It was working yesterday, now everything is broken. What the hell?

Welcome to the August 2025 weekly limit shitshow. Anthropic decided to add weekly caps on top of the existing per-minute limits. You burned through your weekly quota on Monday? Too bad, no more API calls until next Monday. This affects everyone: - **Claude Pro ($20/month):** Maybe 40-80 hours per week if you're lucky - **Claude Max ($200/month):** 140-480 hours, but who's counting? **Solution:** Track your weekly usage like your job depends on it (because it does). Set up alerts before you hit the weekly wall.

Which limit am I hitting? They all look the same!

The error messages are barely different, which is helpful (/s): - **"Number of requests has exceeded..."** → You're making requests too fast (50/minute max) - **"Number of request tokens has exceeded..."** → Your prompts are too big/frequent - **"Output token limit exceeded"** → Claude's responses are too long **Actually useful headers:** ``` anthropic-ratelimit-requests-remaining: 23 anthropic-ratelimit-input-tokens-remaining: 15000 anthropic-ratelimit-output-tokens-remaining: 4500 ``` Parse these after every request. When the numbers get low, slow the fuck down.

I waited 60 seconds like it said, why am I still getting rate limited?

Because the `retry-after` header is a dirty liar. It tells you the minimum wait, not when you'll actually have enough capacity back. Claude uses a token bucket that refills gradually. If you had 30k tokens, used them all, then waited 60 seconds, you might only have 5k tokens back. Send another big request? Rate limited again. **What actually works:** Wait longer than it tells you, or send smaller requests after the wait period. I usually add 30-60 extra seconds because getting burned twice in the same evening while trying to fix a customer issue is just insulting.

Can I increase my rate limits without upgrading tiers?

**For standard limits:** No. Rate limits are tied directly to usage tiers, which advance automatically based on spending and deposits. **Enterprise options:** - **Priority Tier:** Enhanced service levels for committed spend - contact Anthropic sales - **Custom limits:** Available for enterprise customers - requires sales discussion - **AWS Bedrock/Google Vertex:** Different rate limiting systems with potentially higher limits **Current tier structure (September 2025):** Anthropic has simplified their API access structure to three main service tiers: - **Standard Tier:** Default access for most users - adequate for piloting and everyday use - **Priority Tier:** Enhanced service levels for time-critical applications with predictable pricing - **Batch Tier:** 50% savings for asynchronous workloads that can wait for processing Contact Anthropic sales for Priority Tier access or custom enterprise rate limits. The old numbered tier system (1-4) has been replaced with this usage-based approach.

My database is half-fucked because Claude cut out mid-transaction. Help?

Yeah, this one hurts. I've been there. Claude stops responding halfway through a migration and now your schema is in an undefined state. **Prevention (learn from my pain):** ```python async def dont_get_fucked_by_claude(): # Check you have enough quota BEFORE starting estimated_tokens = calculate_total_tokens_needed() if not await rate_limiter.can_handle(estimated_tokens): raise Exception("Not enough quota, try later") # Reserve the capacity async with rate_limiter.reserve(estimated_tokens): async with database.transaction(): result = await claude_api.do_the_thing() await database.commit_if_complete(result) ``` **If it already happened:** You need to manually fix the database. Claude won't help you clean up its own mess. This is why we backup before schema changes.

Why do my rate limits seem lower during peak hours (9 AM - 5 PM PT)?

Anthropic's infrastructure experiences higher load during business hours. While official rate limits don't change, **practical throughput may be lower due to:** - Infrastructure queuing during peak demand - 529 "overloaded" errors (different from rate limits) - Longer response times affecting your application's request rate **Peak hour strategies:** - Shift non-critical processing to off-peak hours (nights/weekends) - Implement more aggressive caching during 9 AM - 5 PM PT - Use circuit breakers to detect infrastructure overload vs. rate limits

How do I handle rate limits when using Claude through AWS Bedrock or Google Vertex AI?

**AWS Bedrock** has different error handling (and breaks differently on boto3 1.28.x vs 1.29.x): ```python try: response = bedrock_client.invoke_model(...) except ClientError as e: if e.response['Error']['Code'] == 'ThrottlingException': # AWS recommends 60+ second waits for throttling await asyncio.sleep(60) elif e.response['Error']['Code'] == 'ModelNotReadyException': # Model cold start - shorter wait (broke my weekend deploy) await asyncio.sleep(30) ``` **Google Vertex AI** enforces concurrent request limits (this bit me when using Node.js 18.17.0): - Error: "quota exceeded for aiplatform.googleapis.com/online_prediction_concurrent_requests_per_base_model" - **Solution:** Limit concurrent requests using semaphores (typically 5-10 max) - but Node's async handling can still fuck this up - Request quota increases through Google Cloud Console (expect 2-3 business days) **Key differences:** - Different error codes and retry strategies - Platform-specific rate limit tiers - Separate billing and quota systems

My application needs real-time responses, but rate limits cause 60-second delays. What alternatives do I have?

Real-time applications struggle most with rate limiting. **Solutions:** **1. Multi-provider setup:** ```python async def get_real_time_response(prompt): providers = ['claude', 'gpt-4', 'gemini'] for provider in providers: try: return await call_provider(provider, prompt) except RateLimitError: continue return "Service temporarily unavailable" ``` **2. Aggressive caching with similarity matching:** - Cache responses for similar queries - Use semantic similarity to match user queries to cached responses - Update cache during low-usage periods **3. Self-hosted alternatives:** - Deploy open-source models (Llama, Mistral) with no rate limits - Higher initial cost but predictable scaling - Full control over response times

What's the most cost-effective way to handle high-volume Claude API usage?

**Tier optimization strategy:** 1. **Stay in lower tiers as long as possible** - rate limits per dollar are often better in lower tiers 2. **Batch requests efficiently** - combine multiple operations into single API calls when possible 3. **Implement aggressive caching** - 80%+ cache hit rates dramatically reduce API costs 4. **Use cheaper models when sufficient** - Haiku for simple tasks, Sonnet for complex analysis **Cost monitoring automation:** ```python # Alert before advancing to expensive tiers if monthly_spend > tier_threshold * 0.8: alert("Approaching tier advancement - optimize usage now") ``` **Alternative architectures:** - **Hybrid:** Use Claude for complex reasoning, local models for routine tasks - **Progressive enhancement:** Start with fast/cheap models, escalate to Claude only when needed - **Request consolidation:** Process multiple user requests in batches during off-peak hours

Currently viewing the AI version

Switch to human version

Claude API Rate Limit Production Guide

Critical Rate Limit Structure

Three Overlapping Limit Types (All Active Simultaneously)

Request Limits: 50/minute (enforced as ~1/second, not 50 in first second)
Token Limits: 30,000 input tokens/minute (Tier 1)
Weekly Limits: Added August 2025 - can exhaust entire week's quota by Tuesday

Rate Limiting Algorithm: Token Bucket (Not Fixed Windows)

Refills gradually, not at fixed intervals
retry-after header shows minimum wait, not actual capacity restoration
Practical wait time: header value + 30-60 seconds

Critical Failure Modes

Production-Breaking Scenarios

Database Schema Corruption: Rate limit during migration leaves half-updated schema with broken foreign keys
Multi-tenant Cascade: One customer's heavy usage locks out all other customers
Background Job Quota Burn: Scheduled jobs consume daily limits before business hours
Mid-transaction Cutoffs: Claude stops mid-response during critical operations

August 2025 Breaking Changes

Weekly limits added on top of per-minute limits
Can hit weekly quota Tuesday, locked out until Monday regardless of other limits
Affected all paid tiers including Claude Max ($200/month)

Error Classification

Rate Limit Errors (429)

"Number of request tokens has exceeded your per-minute rate limit (Tier: 1)" - Token volume
"Number of requests has exceeded your per-minute rate limit" - Request frequency  
"Output token limit exceeded" - Response too long

Infrastructure Errors (529) - Not Rate Limits

"529 - overloaded_error: Anthropic's API is temporarily overloaded" - Escalate, don't retry

Platform-Specific Errors

AWS Bedrock: ThrottlingException (60+ second waits required)
Google Vertex: quota exceeded for concurrent_requests_per_base_model (5-10 max concurrent)

Production-Tested Solutions

1. Pre-emptive Rate Limiting

class ClaudeRateLimit:
    def __init__(self):
        self.requests_per_minute = 48  # Buffer below 50 limit
        self.tokens_per_minute = 28000  # Buffer below 30k limit
        
    async def acquire_request_permit(self, estimated_tokens):
        # Implementation handles gradual token bucket refill
        # Prevents hitting API limits before they occur

Critical Implementation Notes:

Use 48 requests/minute, not 50 (accounts for timing variations)
Stay under 28k tokens/minute for safety buffer
Track request history with timestamps for accurate calculation

2. Priority Request Queue

CRITICAL = 1  # Customer-facing requests
NORMAL = 2    # Internal tools  
BACKGROUND = 3 # Analytics/batch jobs

Operational Intelligence:

Customer requests always processed before internal tools
Prevents background jobs from blocking user-facing features
Essential for SaaS applications with multiple customers

3. Circuit Breaker Pattern

class CircuitBreaker:
    # Opens after 5 consecutive failures
    # Stays open for 5 minutes before retry
    # Prevents retry storms during API outages

Failure Prevention:

Stops attempting requests when API is consistently failing
Prevents cascading failures across dependent systems
Automatically recovers when service returns

Token Optimization Strategies

Context Reduction Techniques

Conversation History: Keep only last 5 messages, summarize older ones
Prompt Optimization: Remove politeness terms ("please", "thank you")
Example Limiting: Maximum 1-2 examples per prompt
Efficiency Threshold: Response length / tokens used should exceed 0.3

Request Batching

class RequestBatcher:
    # Combines up to 5 requests into single API call
    # Max wait time: 2 seconds before processing batch
    # Reduces API calls by 80% in high-volume scenarios

Cost and Tier Management

Tier Advancement Thresholds (September 2025)

Tier 1: $5 deposit, $100/month limit
Tier 2: $40 deposit, $500/month limit
Tier 3: $200 deposit, $1000/month limit
Tier 4: $400 deposit, $5000/month limit

Model Costs (Per 1k Tokens)

Claude-3.5-Sonnet: Input $0.003, Output $0.015
Claude-3.5-Haiku: Input $0.00025, Output $0.00125
Claude-3-Opus: Input $0.015, Output $0.075

Cost Optimization:

Use Haiku for simple tasks (50% cheaper than Sonnet)
Monitor daily spend to prevent unexpected tier advancement
Alert at 80% of tier limits to optimize usage before advancement

Emergency Response Procedures

Crisis Management Checklist

Stop non-critical API calls (analytics, batch jobs)
Activate aggressive caching (extend TTL from 1 hour to 24 hours)
Enable fallback responses for critical user functions
Alert stakeholders before customers notice service degradation

Multi-Provider Fallback

async def get_ai_response(prompt):
    providers = ['claude', 'openai', 'gemini']  
    # Try each provider in sequence
    # Returns generic message if all providers fail

Monitoring and Alerting

Critical Metrics to Track

Token burn rate: Tokens used per minute over time
Request frequency: Requests per minute with burst detection
Weekly quota usage: Daily tracking to prevent mid-week lockouts
Cost per request: Monitor for efficiency degradation

Alert Thresholds

80% token usage: Warning alert
90% token usage: Critical alert
$50 daily spend: Cost warning
95% weekly quota: Emergency alert

Response Header Monitoring

anthropic-ratelimit-requests-remaining: 23
anthropic-ratelimit-input-tokens-remaining: 15000
anthropic-ratelimit-output-tokens-remaining: 4500

Parse after every request - critical for capacity planning

Platform-Specific Implementations

AWS Bedrock Differences

Uses ThrottlingException instead of 429 errors
Requires 60+ second waits (longer than direct API)
ModelNotReadyException for cold starts (30 second wait)
Different error handling patterns required

Google Vertex AI Limitations

Concurrent request limits (5-10 max recommended)
Separate quota system from direct Anthropic API
Request quota increases take 2-3 business days
Platform-specific error codes

Production Reliability Patterns

Caching Strategy

class ClaudeResponseCache:
    # MD5 hash of request parameters as cache key
    # 1 hour TTL for normal operations  
    # 24 hour TTL during rate limit emergencies
    # Reduces API calls by 60-80% in typical applications

Request Consolidation

Batch Size: 5 requests per batch call
Max Wait: 2 seconds before processing incomplete batch
Token Savings: 40-60% reduction in token usage
Complexity Cost: Requires response parsing logic

Architecture Decision Framework

When to Implement Each Strategy

Strategy	Implementation Difficulty	Reliability Impact	Use Case
Basic Rate Limiting	2-3 days	High	All production applications
Priority Queues	1 week	Medium	Multi-tenant SaaS
Circuit Breakers	1-2 days	High	Customer-facing features
Multi-provider Fallback	1 week	Medium	Critical business functions
Aggressive Caching	2-3 days	Medium	High-volume applications

Resource Requirements

Development Time: 1-2 weeks for complete rate limit handling
Infrastructure: Redis for caching, monitoring tools for alerts
Maintenance Overhead: Weekly quota monitoring, monthly cost optimization

Critical Implementation Warnings

What Will Break Your Production

Trusting retry-after headers: Always add 30-60 seconds buffer
Fixed 60-second waits: Token bucket refills gradually, not instantly
Ignoring weekly limits: Can lockout for entire week regardless of daily capacity
Mixing 429/529 errors: Different root causes require different responses
No pre-emptive limiting: Reactive error handling causes user-visible failures

Success Criteria

Zero user-visible rate limit errors during normal operations
Graceful degradation during API outages (fallback responses)
Cost predictability with spending alerts before tier advancement
Recovery time under 5 minutes when rate limits are hit

Technical Dependencies

Token counting library: tiktoken (Python) or transformers.js (JavaScript)
Async HTTP client: aiohttp (Python) or axios (Node.js) with retry capabilities
Caching layer: Redis or in-memory cache with TTL support
Monitoring stack: Grafana/Prometheus or DataDog for metrics tracking

Emergency Contacts and Escalation

Anthropic Status: https://status.anthropic.com/
Enterprise Support: Available for Priority Tier customers
Community Support: GitHub issues for Claude Code, Stack Overflow for general API questions
Platform Support: AWS/Google enterprise support for Bedrock/Vertex AI implementations

Useful Links for Further Investigation

Links That Actually Help (No Bullshit)

Link	Description
Claude API Rate Limits	The official docs that explain the token bucket algorithm. Actually useful once you get past the marketing speak
Claude API Errors	Error codes that you'll memorize by heart after debugging rate limits for a week
Anthropic Console - Limits Page	Where you go to see how fucked you are right now
Anthropic Console - Usage Dashboard	Official support guide for monitoring usage and increasing limits
Anthropic Status Page	Says "all systems operational" even when everything is broken
Python: tenacity	Actually works for retry logic. Used this in production for 6 months with no issues
Node.js: axios-retry	Does what it says on the tin. Saved me from writing custom retry logic
.NET: Polly	If you're stuck in Microsoft hell, this is your best option
Python: aiohttp	Good for async Python apps. Better than requests for high-concurrency
tiktoken Python Library	More accurate token counting than rough character estimates, works reasonably well for Claude API
transformers.js Token Counter	Client-side token counting for web applications to prevent over-limit requests
Anthropic API Console	Official Anthropic workspace for monitoring token usage and costs
OpenAI Token Calculator	Cross-reference tool for token estimation (works as approximation for Claude)
DataDog API Monitoring	Enterprise monitoring that tracks Claude API response times, error rates, and rate limit metrics
New Relic Application Performance Monitoring	Comprehensive API monitoring with custom dashboards for Claude usage patterns
Grafana + Prometheus	Open source monitoring stack for building custom Claude API health dashboards
StatusCake API Monitoring	Affordable external monitoring for Claude API health checks and availability
PagerDuty Incident Response	Escalation system for rate limit alerts and production API issues
curl Command Line Tool	Essential for debugging Claude API connection issues with verbose output (-v flag)
HTTPie	Modern curl alternative with cleaner output for testing Claude API requests and responses
Postman Claude API Collection	GUI testing environment for debugging rate limit behaviors across different models
Wireshark Network Analyzer	Deep packet inspection for diagnosing network-level issues with Claude API connections
mtr Network Route Tracing	Better than traceroute for finding network issues between your servers and Anthropic
OpenAI API	Your most likely fallback. Different rate limits, similar quality
Google Gemini API	Free tier is actually decent. Quality varies but it's something
AWS Bedrock	Same Claude models, different rate limiting. Worth trying if you're already on AWS
Azure OpenAI Service	If you're deep in Microsoft ecosystem
vLLM	High-performance open source LLM serving with no rate limits for self-hosted deployments
Ollama	Easy local LLM deployment tool for running open source models without API restrictions
Text Generation WebUI	Web interface for local LLM hosting with API endpoints
LlamaIndex	Framework for building applications with multiple LLM providers and fallback strategies
LangChain	Development framework with built-in rate limiting and multi-provider support
Anthropic Console Billing	Official usage tracking with detailed token and cost breakdowns
AWS Cost Explorer	Cost tracking for Claude API usage through AWS Bedrock
Google Cloud Billing	Cost monitoring for Claude usage through Google Vertex AI
CloudZero Cost Intelligence	Third-party cost monitoring that tracks Claude API spending across multiple accounts
Claude Code GitHub Issues	Other developers sharing the same pain. Good for finding workarounds
Anthropic Console Support	Submit support tickets and track usage directly through the console
Stack Overflow Claude Tag	Hit or miss, but sometimes has the exact solution you need
Anthropic Help Center	Official support that may or may not help you in a reasonable timeframe
Docker Health Checks	Container health monitoring for services calling Claude API
Kubernetes Probes	Pod health monitoring and automatic restart for Claude API service failures
systemd Service Monitoring	Linux service monitoring and automatic restart for Claude API workers
Redis	High-performance caching for Claude API responses to reduce request volume
PostgreSQL	Database for logging detailed error patterns and usage analytics
Anthropic Research Papers	Technical research on AI safety and scaling that informs rate limiting decisions
Token Bucket Algorithm	Wikipedia reference for understanding Claude's rate limiting algorithm
2025 Claude Code Rate Limit Analysis	Detailed analysis of August 2025 rate limit changes
Portkey: Everything We Know About Claude Code Limits	Comprehensive analysis of Claude Code rate limiting changes
Slack API Notifications	Send rate limit alerts and debugging info to team channels
Discord Webhooks	Free alternative to Slack for small teams handling API incidents
OpsGenie	Incident response platform with escalation for critical rate limit failures
Incident.io	Modern incident management for API outages and rate limiting issues
AWS Bedrock Documentation	Complete integration guide for Claude through AWS with different rate limiting
AWS CloudWatch Metrics	Monitor Bedrock Claude usage and throttling events
Vertex AI Documentation	Google Cloud Claude integration with platform-specific quota management
Google Cloud Quotas	Request quota increases for Claude on Vertex AI
Anthropic Enterprise Sales	Contact for custom rate limits and Priority Tier access
AWS Enterprise Support	Enterprise-level support for Bedrock Claude deployments
Google Cloud Customer Care	Enterprise support for Vertex AI Claude implementations

Claude API Rate Limit Production Guide

Critical Rate Limit Structure

Three Overlapping Limit Types (All Active Simultaneously)

Rate Limiting Algorithm: Token Bucket (Not Fixed Windows)

Critical Failure Modes

Production-Breaking Scenarios

August 2025 Breaking Changes

Error Classification

Rate Limit Errors (429)

Infrastructure Errors (529) - Not Rate Limits

Platform-Specific Errors

Production-Tested Solutions

1. Pre-emptive Rate Limiting

2. Priority Request Queue

3. Circuit Breaker Pattern

Token Optimization Strategies

Context Reduction Techniques

Request Batching

Cost and Tier Management

Tier Advancement Thresholds (September 2025)

Model Costs (Per 1k Tokens)

Emergency Response Procedures

Crisis Management Checklist

Multi-Provider Fallback

Monitoring and Alerting

Critical Metrics to Track

Alert Thresholds

Response Header Monitoring

Platform-Specific Implementations

AWS Bedrock Differences

Google Vertex AI Limitations

Production Reliability Patterns

Caching Strategy

Request Consolidation

Architecture Decision Framework

When to Implement Each Strategy

Resource Requirements

Critical Implementation Warnings

What Will Break Your Production

Success Criteria

Technical Dependencies

Emergency Contacts and Escalation

Useful Links for Further Investigation

Links That Actually Help (No Bullshit)

Related Tools & Recommendations

Meta Just Dropped $10 Billion on Google Cloud Because Their Servers Are on Fire

Meta Signs $10+ Billion Cloud Deal with Google: AI Infrastructure Alliance

Claude API Setup That Actually Works (With the Gotchas They Don't Mention)

Making LangChain, LlamaIndex, and CrewAI Work Together Without Losing Your Mind

OpenAI API Alternatives That Don't Suck at Your Actual Job

Google Gemini API: What breaks and how to fix it

AI Coding Assistants 2025 Pricing Breakdown - What You'll Actually Pay

Claude + LangChain + FastAPI: The Only Stack That Doesn't Suck

Google Cloud Platform - After 3 Years, I Still Don't Hate It

MCP Production Troubleshooting Guide - Fix the Shit That Breaks

OpenAI Alternatives That Actually Save Money (And Don't Suck)

OpenAI API Integration with Microsoft Teams and Slack

Google Vertex AI - Google's Answer to AWS SageMaker

Amazon Q Business vs Q Developer: AWS's Confusing Q Twins

Amazon Nova Models - AWS Finally Builds Their Own AI

Google Hit $3 Trillion and Yes, That's Absolutely Insane

Zapier Enterprise Review - Is It Worth the Insane Cost?

Zapier - Connect Your Apps Without Coding (Usually)

Claude Can Finally Do Shit Besides Talk

Claude API Production Debugging - When Everything Breaks at 3AM