Getting HTTP 429 "Usage limit exceeded" but I barely used anything

Rate limits are fucked. You hit 50 requests per minute on Tier 1, but that's measured as 1 request per second, not 50 in the first second then wait 59 seconds. **Immediate fix:** Add exponential backoff retry logic. Here's what actually works: ```python import time import random def retry_with_backoff(api_call, max_retries=5): for attempt in range(max_retries): try: return api_call() except Exception as e: if "429" in str(e) and attempt < max_retries - 1: wait_time = (2 ** attempt) + random.uniform(0, 1) time.sleep(wait_time) else: raise e ```

API suddenly returns "Provider returned error" through OpenRouter

This one's a nightmare. OpenRouter works fine with other models, but Claude 3.5 Haiku randomly starts failing with zero useful error message. **Nuclear option that works:** Switch to direct Anthropic API temporarily. OpenRouter has some weird interaction with Haiku specifically that breaks intermittently. **Debugging steps:** 1. Test the exact same prompt with direct Anthropic API 2. If it works there, it's OpenRouter's fault 3. Try different model on OpenRouter (like claude-3-5-sonnet) to confirm

Anthropic's API is completely down (like September 10th)

When everything breaks at once, you need backup plans ready. **Immediate workaround:** Switch to GPT-4o Mini as fallback. Yeah, it's dumber, but it's better than your app being completely broken. **What to check first:** 1. [https://status.anthropic.com/](https://status.anthropic.com/) - bookmark this 2. Your error logs for specific 5xx codes 3. If it's just rate limits vs actual outage

Response quality suddenly turned to garbage

Happened August 26 - September 5, 2024. Multiple bugs in production affecting Haiku quality without warning. **This is real:** Anthropic confirmed two separate bugs that degraded Claude Haiku 3.5 responses. It wasn't your imagination. **What to do:** - Check the status page for quality incidents - Switch models temporarily if you notice degradation - Keep historical examples to compare against

Why does Haiku randomly fail tool use but work fine for text?

Tool use in Haiku is brittle as hell. It works most of the time but fails way more often than Sonnet. The failures are usually: **Malformed JSON in function calls:** Haiku occasionally adds trailing commas or extra brackets **Wrong parameter types:** Sends string "123" instead of integer 123 **Hallucinated function names:** Calls `deleteAllUsers()` when you defined `deleteUser()` **Bulletproof fix:** ```python import json from jsonschema import validate def validate_tool_response(response, expected_schema): try: parsed = json.loads(response) validate(instance=parsed, schema=expected_schema) return parsed except (json.JSONDecodeError, ValidationError) as e: # Log the failure and retry with more explicit instructions return None ```

Prompt caching saves 90% but I'm not seeing those savings

The [90% savings claim](https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching) requires identical system prompts across requests. If you're adding timestamps, request IDs, or dynamic content to your system prompt, caching is useless. **What breaks caching:** - Dynamic timestamps: "Current time: 2024-09-16 15:30:21" - Request IDs: "Request ID: abc123" - User-specific info in system prompts **What enables caching:** - Static system prompts with role instructions - Fixed examples or templates - Consistent code context **Real savings we see:** Around 30-60% for most applications, closer to 90% only for very repetitive tasks.

Getting different responses for identical prompts

Haiku has a temperature of 0.1 by default, not 0. For deterministic responses, you need: ```python response = client.messages.create( model="claude-3-5-haiku-20241022", messages=[{"role": "user", "content": prompt}], temperature=0, # Actually deterministic max_tokens=1000 ) ``` Even at temperature=0, you might see slight variations in: - Code formatting (spaces vs tabs) - Word choice in explanations - Order of listed items

Haiku costs 5x more than GPT-4o Mini but response quality is inconsistent

Yeah, it's expensive. But the math works if: 1. **Developer time costs more than $200/hour** - then the better suggestions save money 2. **Response speed matters** - Haiku is consistently faster 3. **Tool use reliability matters** - Haiku breaks less than GPT-4o Mini **Cost optimization that works:** - Use Haiku for user-facing features where speed matters - Use GPT-4o Mini for batch processing and internal tools - Cache aggressively with identical system prompts - Set hard token limits in code, not just billing alerts

AWS Bedrock vs direct API - what breaks differently?

**Bedrock issues:** - Higher latency (add ~200ms) - Different error format (AWS errors wrapped around Anthropic errors) - VPC configuration can cause mysterious timeouts - Billing is delayed and confusing **Direct API issues:** - More frequent outages during model deployments - Rate limit headers are more accurate - Better error messages for debugging **Our production setup:** Direct API for user-facing features, Bedrock for batch jobs that need VPC.

Why does Claude work fine locally but fail in production?

**Network differences:** - Production often has stricter egress rules - Load balancers can timeout before Claude responds - TLS certificate validation might fail in containers **Resource differences:** - Memory limits can kill processes during large responses - CPU throttling affects JSON parsing of responses - Disk space issues if you're logging full responses **Environment differences:** - Different API keys (sandbox vs production) - Different rate limits based on billing tier - Different model versions if you don't pin exact model ID **Debug with this health check:** ```python def claude_health_check(): try: response = client.messages.create( model="claude-3-5-haiku-20241022", messages=[{"role": "user", "content": "respond with 'ok'"}], max_tokens=10, timeout=5.0 ) return response.content[0].text == "ok" except Exception as e: log.error(f"Claude health check failed: {e}") return False ```

Error: "acceleration limits" - what the hell does that mean?

This happens when your usage spikes suddenly. Anthropic has undocumented acceleration limits that kick in if you go from 10 requests/hour to 100 requests/hour too quickly. **Typical causes:** - Deploying new features that increase API usage - Traffic spikes during marketing campaigns - Batch jobs running during peak hours **How to avoid:** - Ramp up usage gradually over days, not minutes - Spread batch processing across multiple hours - Monitor your request rate and smooth out spikes **When it hits:** Back off for 15-30 minutes, then resume at lower rate.

Currently viewing the AI version

Switch to human version

Claude 3.5 Haiku Production Troubleshooting - AI-Optimized Reference

Critical Emergency Fixes

Model Routing Bug - Immediate Impact

Issue: Claude Code automatically routes to Sonnet when prompt contains "think", "thinking", "thoughts"
Error: 400 "claude-3-5-haiku-20241022 does not support thinking"
Severity: Breaks user-facing features instantly
Fix Time: 30 seconds
Solution: Replace trigger words with "consider", "believe", "assume", "figure"

TRIGGER_REPLACEMENTS = {
    "think": "consider",
    "thinking": "considering",
    "thoughts": "ideas",
    "I think": "I believe"
}

Rate Limiting - Token Bucket Algorithm

Limit: 50 requests/minute (Tier 1)
Reality: Token bucket, not simple counter - burst requests exhaust capacity
Critical: 50 requests in first 10 seconds = rate limited for remaining 50 seconds
Production Fix: Request pacing with 45 RPM headroom

class RequestPacer:
    def __init__(self, requests_per_minute=45):  # Leave headroom
        self.rpm = requests_per_minute
        self.last_requests = []

    async def wait_if_needed(self):
        # Remove requests older than 1 minute
        self.last_requests = [r for r in self.last_requests
                            if now - r < timedelta(minutes=1)]
        if len(self.last_requests) >= self.rpm:
            wait_time = 60 - (now - self.last_requests[0]).total_seconds()
            await asyncio.sleep(max(0, wait_time))

Outage Patterns

Frequency: 2-3 outages per month
Duration: 15 minutes to 4 hours
Detection: DownDetector often faster than official status page
Fallback: GPT-4o Mini (1/6th cost, lower quality)

Production Performance Reality

Latency Benchmarks vs Reality

Metric	Benchmark	Production (AWS us-east-1)
Best case	0.52s	0.7-0.9s
Typical	N/A	1-2s
P99	N/A	3+ seconds

UX Impact: Budget 2 seconds for user-facing features, show loading states immediately

Cost Analysis

Price: $4 per million output tokens (5x GPT-4o Mini)
Break-even: When developer time costs >$200/hour
Reality: 30-60% savings from caching (not 90% claimed)
Budget killer: $100 monthly limit reached in first week without controls

Quality Degradation Detection

Historical Incidents

August 26 - September 5, 2024: Two confirmed quality bugs

Code suggestions significantly worse
Tool use failure rate increased
Responses became generic

Detection Strategy:

def log_response_quality(prompt, response):
    metrics = {
        "response_length": len(response),
        "has_code_block": "```" in response,
        "has_specific_examples": count_specific_examples(response),
        "follows_instructions": rate_instruction_following(prompt, response)
    }
    log_metrics("claude_quality", metrics)

Tool Use Failures - High Frequency Issue

Common Failures

Malformed JSON: Trailing commas, extra brackets
Type errors: String "123" instead of integer 123
Hallucinated functions: Calls non-existent function names

Failure Rate: Significantly higher than Sonnet
Mitigation:

def validate_tool_response(response, expected_schema):
    try:
        parsed = json.loads(response)
        validate(instance=parsed, schema=expected_schema)
        return parsed
    except (json.JSONDecodeError, ValidationError):
        return None  # Retry with explicit instructions

Network and Infrastructure Issues

Undocumented Error Types

Error	Real Meaning	Fix
Empty 200 response	Partial outage	Check response length
Malformed JSON errors	Infrastructure issue	Wrap JSON parsing
"Acceleration limits"	Usage spike detection	Back off 15-30 minutes
SSL intermittent	~1/10,000 requests	Retry with fresh connection

OpenRouter vs Direct API

OpenRouter Issues:

Random "Provider returned error" with no details
Haiku-specific compatibility problems
Use direct Anthropic API for debugging

Bedrock Issues:

+200ms latency overhead
AWS error wrapping around Anthropic errors
VPC configuration timeouts

Prompt Caching - Cost Optimization

Requirements for 90% Savings

Identical system prompts across requests
No dynamic content in system prompts (timestamps, request IDs)
Static role instructions and examples

Cache Breakers

Dynamic timestamps: "Current time: 2024-09-16 15:30:21"
Request IDs in system prompts
User-specific information in system context

Realistic Savings

Most applications: 30-60%
Repetitive tasks: Closer to 90%
Mixed usage: Varies significantly

Error Handling - Production Patterns

Retry Logic with Exponential Backoff

def retry_with_backoff(api_call, max_retries=5):
    for attempt in range(max_retries):
        try:
            return api_call()
        except Exception as e:
            if "429" in str(e) and attempt < max_retries - 1:
                wait_time = (2 ** attempt) + random.uniform(0, 1)
                time.sleep(wait_time)
            else:
                raise e

Health Check Implementation

def claude_health_check():
    try:
        response = client.messages.create(
            model="claude-3-5-haiku-20241022",
            messages=[{"role": "user", "content": "respond with 'ok'"}],
            max_tokens=10,
            timeout=5.0
        )
        return response.content[0].text == "ok"
    except Exception as e:
        return False

Monitoring Metrics - Production Essential

Critical Metrics

Token usage per hour - Prevent billing surprises
Response latency P99 - Catch performance degradation
Error rate by type - Distinguish code vs API issues
Quality scores - Detect model degradation
Fallback activation rate - Monitor backup usage

Cost Control Implementation

MAX_TOKENS_PER_REQUEST = 1000
DAILY_TOKEN_BUDGET = 50000

def check_budget_before_api_call():
    today_usage = get_daily_token_count()
    if today_usage > DAILY_TOKEN_BUDGET:
        raise Exception("Daily budget exceeded")

Local vs Production Differences

Common Failure Points

Network: Stricter egress rules, load balancer timeouts
Resources: Memory limits during large responses, CPU throttling
Environment: Different API keys, rate limits, model versions

Environment-Specific Issues

Development: Works with sandbox keys
Production: Different billing tier = different rate limits
Containers: TLS certificate validation failures

Alternative Models - Fallback Strategy

Cost-Performance Comparison

Model	Cost Ratio	Use Case	Reliability
Claude 3.5 Haiku	5x	User-facing, speed critical	Moderate
GPT-4o Mini	1x	Batch processing, cost-sensitive	High
Gemini	2x	Backup option	Low
Cohere	1.5x	Stable alternative	Moderate

Implementation Strategy

User-facing: Haiku for speed
Batch jobs: GPT-4o Mini for cost
Emergency fallback: Pre-configured alternatives

Resource Requirements

Time Investment

Initial setup: 2-4 hours for proper error handling
Rate limit implementation: 1-2 hours
Monitoring setup: 4-8 hours
Quality tracking: 2-3 hours

Expertise Requirements

JSON schema validation for tool use
Exponential backoff implementation
Token bucket algorithm understanding
Multi-model fallback architecture

Infrastructure Costs

API costs: $4/1M output tokens
Monitoring: DataDog/equivalent for metrics
Backup models: Additional API accounts
Development time: Quality tracking implementation

Breaking Points and Failure Modes

Hard Limits

Rate limits: 50 RPM (Tier 1), token bucket refill
Response size: Large responses can cause memory issues
Acceleration limits: Sudden usage spikes trigger undocumented limits
Quality degradation: No advance warning, requires monitoring

Deterministic Behavior

Temperature: Default 0.1, not 0 - set explicitly for deterministic responses
Slight variations: Even at temperature=0, formatting and word choice vary
Model versioning: Pin exact model ID to avoid surprise updates

Decision Criteria

When Claude 3.5 Haiku Makes Sense

Developer time >$200/hour
Sub-2-second response time required
Tool use reliability critical
Quality worth 5x cost premium

When to Use Alternatives

Batch processing (use GPT-4o Mini)
Cost-sensitive applications
High-volume, low-complexity tasks
During Anthropic outages

Migration Considerations

Prompt compatibility: Most prompts work across models
Tool use schemas: May need adjustment for different models
Error handling: Different error formats require separate handling
Performance characteristics: Latency and quality trade-offs

Useful Links for Further Investigation

Resources That Actually Help When Things Break

Link	Description
Anthropic Status Page	Check this first. Don't waste time debugging if it's their fault. Historical incident data shows outages happen 2-3 times per month for 15 minutes to 4 hours.
Claude Console Dashboard	See your current rate limits and spending. Check the Dashboard and Settings sections for usage analytics and rate limit monitoring.
Rate Limits Documentation	Essential for understanding why you're getting 429s. The token bucket algorithm explanation is buried but crucial.
Claude API Error Codes	Official error reference. Bookmark this because error messages are often cryptic.
Postman Collection for Claude API	Test requests outside your code to isolate whether it's your implementation or the API.
GitHub: Claude Code Issues	Search for the thinking bug and other CLI-specific issues. Community often finds workarounds before official fixes.
Anthropic SDKs with Retry Logic	Use the official SDKs - they handle retries better than custom implementations. The Python SDK's retry logic is solid.
OpenAI SDK Compatibility	Drop-in replacement if you need to quickly switch between providers during outages.
AWS Bedrock Claude Documentation	If you're using Bedrock, the error formats are different. This helps translate AWS errors to Anthropic errors.
Anthropic Help Center	Official support documentation with troubleshooting guides. More reliable than community forums for emergency issues.
GitHub: Cline Issues	Popular VSCode extension that hits similar API issues. Good source of real-world problems and solutions.
Anthropic Discord	Fastest community response for urgent issues. Staff sometimes respond here faster than support tickets.
Token Cost Calculator	Calculate actual costs before committing budget. Output tokens at $4/1M add up faster than you think.
Prompt Caching Guide	The only reliable way to reduce costs significantly. Must have identical system prompts to work.
Message Batches API	50% cost reduction for non-urgent requests. 24-hour delay but worth it for batch processing.
OpenAI API Documentation	Keep GPT-4o Mini as fallback. Different API format but similar capabilities at 1/6th the cost.
Google Gemini API	Alternative fallback, though their documentation is painful. Cheaper but less reliable.
Cohere API	Another backup option. Less capable but stable and well-documented.
Anthropic Cookbook - Tool Use Examples	Practical examples including error handling and API best practices.
Anthropic Python SDK - Error Handling	Official SDK documentation covering all error types and retry patterns.
Anthropic API Fundamentals Course	Comprehensive course covering rate limits, error handling, and production patterns.
Anthropic Support	Official support for when community resources aren't enough. Response time varies from hours to days.
Anthropic Status Page Incidents	Historical incident data to see patterns and typical resolution times.
Downdetector for Claude	Community-reported outages. Sometimes faster than official status page for detecting issues.

Claude 3.5 Haiku Production Troubleshooting - AI-Optimized Reference

Critical Emergency Fixes

Model Routing Bug - Immediate Impact

Rate Limiting - Token Bucket Algorithm

Outage Patterns

Production Performance Reality

Latency Benchmarks vs Reality

Cost Analysis

Quality Degradation Detection

Historical Incidents

Tool Use Failures - High Frequency Issue

Common Failures

Network and Infrastructure Issues

Undocumented Error Types

OpenRouter vs Direct API

Prompt Caching - Cost Optimization

Requirements for 90% Savings

Cache Breakers

Realistic Savings

Error Handling - Production Patterns

Retry Logic with Exponential Backoff

Health Check Implementation

Monitoring Metrics - Production Essential

Critical Metrics

Cost Control Implementation

Local vs Production Differences

Common Failure Points

Environment-Specific Issues

Alternative Models - Fallback Strategy

Cost-Performance Comparison

Implementation Strategy

Resource Requirements

Time Investment

Expertise Requirements

Infrastructure Costs

Breaking Points and Failure Modes

Hard Limits

Deterministic Behavior

Decision Criteria

When Claude 3.5 Haiku Makes Sense

When to Use Alternatives

Migration Considerations

Useful Links for Further Investigation

Resources That Actually Help When Things Break

Related Tools & Recommendations

jQuery - The Library That Won't Die

Claude 3.5 Sonnet Migration Guide

Claude 3.5 Sonnet - The Model Everyone Actually Used

Stop Stripe from Destroying Your Serverless Performance

Drizzle ORM - The TypeScript ORM That Doesn't Suck

Fix TaxAct When It Breaks at the Worst Possible Time

Slither - Catches the Bugs That Drain Protocols

OP Stack Deployment Guide - So You Want to Run a Rollup

Firebase Started Eating Our Money, So We Switched to Supabase

Twistlock - Container Security That Actually Works (Most of the Time)

CDC Implementation Without The Bullshit

React Error Boundaries Are Lying to You in Production

Git Disaster Recovery - When Everything Goes Wrong

Swift Assist - The AI Tool Apple Promised But Never Delivered

Fix MySQL Error 1045 Access Denied - Real Solutions That Actually Work

How to Stop Your API from Getting Absolutely Destroyed by Script Kiddies

Change Data Capture - Stream Database Changes So Your Data Isn't 6 Hours Behind

OpenAI Browser Developer Integration Guide

Binance API Production Security Hardening - Don't Get Rekt

Stripe Terminal iOS Integration: The Only Way That Actually Works