Claude 3.5 Haiku Production Troubleshooting

Currently viewing the human version

Emergency Room - Fix It Now Questions

Claude returns "'claude-3-5-haiku-20241022' does not support thinking" for normal requests

Hit this bug just last week. If your prompt contains the word "think" anywhere, Claude Code automatically tries to route it to Sonnet instead of Haiku, then fails. Even shit like "I think this would be cool" triggers it.

Quick fix: Replace "think" with "consider", "believe", "assume" - anything else. Or switch to direct API calls instead of Claude Code if you need that specific wording.

Real example that breaks:

"I think it would be cool to tell me a joke"
→ Error: 400 "claude-3-5-haiku-20241022 does not support thinking"

Works instead:

"I believe it would be cool to tell me a joke"
→ Normal response

Getting HTTP 429 "Usage limit exceeded" but I barely used anything

Rate limits are fucked. You hit 50 requests per minute on Tier 1, but that's measured as 1 request per second, not 50 in the first second then wait 59 seconds.

Immediate fix: Add exponential backoff retry logic. Here's what actually works:

import time
import random

def retry_with_backoff(api_call, max_retries=5):
    for attempt in range(max_retries):
        try:
            return api_call()
        except Exception as e:
            if "429" in str(e) and attempt < max_retries - 1:
                wait_time = (2 ** attempt) + random.uniform(0, 1)
                time.sleep(wait_time)
            else:
                raise e

API suddenly returns "Provider returned error" through OpenRouter

This one's a nightmare. OpenRouter works fine with other models, but Claude 3.5 Haiku randomly starts failing with zero useful error message.

Nuclear option that works: Switch to direct Anthropic API temporarily. OpenRouter has some weird interaction with Haiku specifically that breaks intermittently.

Debugging steps:

Test the exact same prompt with direct Anthropic API
If it works there, it's OpenRouter's fault
Try different model on OpenRouter (like claude-3-5-sonnet) to confirm

Anthropic's API is completely down (like September 10th)

When everything breaks at once, you need backup plans ready.

Immediate workaround: Switch to GPT-4o Mini as fallback. Yeah, it's dumber, but it's better than your app being completely broken.

What to check first:

https://status.anthropic.com/ - bookmark this
Your error logs for specific 5xx codes
If it's just rate limits vs actual outage

Response quality suddenly turned to garbage

Happened August 26 - September 5, 2024. Multiple bugs in production affecting Haiku quality without warning.

This is real: Anthropic confirmed two separate bugs that degraded Claude Haiku 3.5 responses. It wasn't your imagination.

What to do:

Check the status page for quality incidents
Switch models temporarily if you notice degradation
Keep historical examples to compare against

Hit my $100 monthly limit in the first week

At $4 per million output tokens, this happens faster than you think. Especially if you're not caching repetitive prompts.

Cost control that works:

## Set hard limits in your code
MAX_TOKENS_PER_REQUEST = 1000
DAILY_TOKEN_BUDGET = 50000

## Track usage in Redis or whatever
def check_budget_before_api_call():
    today_usage = get_daily_token_count()
    if today_usage > DAILY_TOKEN_BUDGET:
        raise Exception("Daily budget exceeded")

Production Reality Check - What Actually Breaks

Been running Claude 3.5 Haiku in production since early 2024. Here's what breaks, when it breaks, and what we learned the hard way.

Rate Limits: The Silent Killer

How Token Bucket Works: Think of it like a bucket with holes. Tokens (requests) fill the bucket continuously. When you make requests, tokens drain out. If the bucket empties (burst requests), you wait until more tokens accumulate.

The official docs say you get 50 requests per minute on Tier 1. What they don't tell you is how this actually works. Hit 10 requests in the first 10 seconds? You're probably fine. Hit 50 requests in the first 10 seconds? You're rate limited for the next 50 seconds.

The reality: It's a token bucket algorithm, not a simple counter. Your capacity refills continuously, but burst requests will exhaust it fast. This comprehensive guide covers the technical details better than Anthropic's docs.

Production lesson: We implemented a request queue that spreads requests evenly across each minute. Went from constant 429s to zero rate limit issues:

import asyncio
from datetime import datetime, timedelta

class RequestPacer:
    def __init__(self, requests_per_minute=45):  # Leave some headroom - learned this the hard way
        self.rpm = requests_per_minute
        self.last_requests = []

    async def wait_if_needed(self):
        now = datetime.now()
        # Remove requests older than 1 minute
        # TODO: this is probably inefficient but it works
        self.last_requests = [r for r in self.last_requests
                            if now - r < timedelta(minutes=1)]

        if len(self.last_requests) >= self.rpm:
            wait_time = 60 - (now - self.last_requests[0]).total_seconds()
            await asyncio.sleep(max(0, wait_time))

        self.last_requests.append(now)

The Thinking Bug: Most Annoying Edge Case

Found this the hard way. If you use Claude Code (the CLI tool), certain words trigger automatic routing to Sonnet, which then fails because you configured Haiku.

Words that break everything: "think", "thinking", "thoughts"
Words that work fine: "consider", "believe", "assume", "figure"

This is Claude Code specific, not the API. But if you're building dev tools that integrate with Claude Code, you'll hit this.

Our workaround: Text preprocessing that replaces trigger words before sending to Claude Code:

TRIGGER_REPLACEMENTS = {
    "think": "consider",
    "thinking": "considering",
    "thoughts": "ideas",
    "I think": "I believe"
}

def sanitize_for_claude_code(text):
    for trigger, replacement in TRIGGER_REPLACEMENTS.items():
        text = text.replace(trigger, replacement)
    return text

Quality Degradation: The Silent Production Killer

Between August 26 - September 5, 2024, Anthropic confirmed two separate bugs that degraded Haiku output quality. Users reported it for weeks before official acknowledgment. Check the historical incident data to see that outages happen 2-3 times per month averaging 15 minutes to 4 hours each.

What degraded:

Code suggestions became significantly worse
Tool use started failing more often
Responses became more generic and less helpful

How we caught it: Started logging response quality metrics in August:

def log_response_quality(prompt, response):
    # This is ugly but it caught the August quality issue
    metrics = {
        "response_length": len(response),
        "has_code_block": "```" in response,
        "has_specific_examples": count_specific_examples(response),  # TODO: make this smarter
        "follows_instructions": rate_instruction_following(prompt, response)
    }
    # Log to your monitoring system (we use DataDog)
    log_metrics("claude_quality", metrics)

When we graphed this data, we saw a clear drop starting around late August. Having historical data made the difference between "feels wrong" and "definitely broken".

Network Issues: The 1-Second Lie

The benchmarks show 0.52 seconds average response time. In production from AWS us-east-1, we see significantly different results. Production monitoring best practices recommend tracking P99 latency instead of averages:

Best case: around 700-900ms (including TLS handshake)
Typical: somewhere between 1-2 seconds
Bad days: 3+ seconds before we give up

Critical for UX: Don't promise sub-second responses to users. Budget 2 seconds for user-facing features and show loading states immediately.

Our monitoring shows latency spikes correlate with:

Anthropic deploying new models (usually 15-30 minute windows)
High usage periods (US business hours)
General AWS networking issues in us-east-1

Error Handling: Beyond the Obvious

API Error Handling Flow

Most guides tell you to catch 429s and 500s. Comprehensive error handling guides cover the basics, but we've seen these production edge cases that aren't documented anywhere:

Empty responses with 200 status: Happens during Anthropic outages. Response body is literally "" but HTTP status is 200. Always check response length.

Malformed JSON in error responses: During some outages, error responses aren't valid JSON. Wrap your JSON parsing:

try:
    error_data = json.loads(response.text)
except json.JSONDecodeError:
    # Anthropic is having a bad day
    error_data = {"error": {"message": response.text}}

Intermittent SSL errors: Happens maybe once per 10,000 requests. Retry once with a fresh connection usually fixes it.

Monitoring That Actually Helps

API Monitoring Dashboard

We track these metrics in production:

Token usage per hour - catch runaway usage before billing surprises
Response latency P99 - catch degraded performance early
Error rate by type - distinguish between our bugs and Anthropic's
Quality scores - catch degradation like the August incident
Fallback activation rate - how often we switch to backup models

The metric that saved us: Response similarity to expected outputs. When this drops significantly, something's wrong with the model, not our code. DownDetector is often faster than the official status page for detecting issues. For detailed monitoring strategies, check out API monitoring best practices and production deployment guides.

Advanced Issues - The Shit That Keeps You Up at Night

Why does Haiku randomly fail tool use but work fine for text?

Tool use in Haiku is brittle as hell. It works most of the time but fails way more often than Sonnet. The failures are usually:

Malformed JSON in function calls: Haiku occasionally adds trailing commas or extra brackets
Wrong parameter types: Sends string "123" instead of integer 123
Hallucinated function names: Calls deleteAllUsers() when you defined deleteUser()

Bulletproof fix:

import json
from jsonschema import validate

def validate_tool_response(response, expected_schema):
    try:
        parsed = json.loads(response)
        validate(instance=parsed, schema=expected_schema)
        return parsed
    except (json.JSONDecodeError, ValidationError) as e:
        # Log the failure and retry with more explicit instructions
        return None

Prompt caching saves 90% but I'm not seeing those savings

The 90% savings claim requires identical system prompts across requests. If you're adding timestamps, request IDs, or dynamic content to your system prompt, caching is useless.

What breaks caching:

Dynamic timestamps: "Current time: 2024-09-16 15:30:21"
Request IDs: "Request ID: abc123"
User-specific info in system prompts

What enables caching:

Static system prompts with role instructions
Fixed examples or templates
Consistent code context

Real savings we see: Around 30-60% for most applications, closer to 90% only for very repetitive tasks.

Getting different responses for identical prompts

Haiku has a temperature of 0.1 by default, not 0. For deterministic responses, you need:

response = client.messages.create(
    model="claude-3-5-haiku-20241022",
    messages=[{"role": "user", "content": prompt}],
    temperature=0,  # Actually deterministic
    max_tokens=1000
)

Even at temperature=0, you might see slight variations in:

Code formatting (spaces vs tabs)
Word choice in explanations
Order of listed items

Haiku costs 5x more than GPT-4o Mini but response quality is inconsistent

Yeah, it's expensive.

But the math works if:

Developer time costs more than $200/hour
- then the better suggestions save money
Response speed matters
- Haiku is consistently faster
Tool use reliability matters
- Haiku breaks less than GPT-4o Mini

Cost optimization that works:

Use Haiku for user-facing features where speed matters
Use GPT-4o Mini for batch processing and internal tools
Cache aggressively with identical system prompts
Set hard token limits in code, not just billing alerts

AWS Bedrock vs direct API - what breaks differently?

Bedrock issues:

Higher latency (add ~200ms)
Different error format (AWS errors wrapped around Anthropic errors)
VPC configuration can cause mysterious timeouts
Billing is delayed and confusing

Direct API issues:

More frequent outages during model deployments
Rate limit headers are more accurate
Better error messages for debugging

Our production setup: Direct API for user-facing features, Bedrock for batch jobs that need VPC.

Why does Claude work fine locally but fail in production?

Network differences:

Production often has stricter egress rules
Load balancers can timeout before Claude responds
TLS certificate validation might fail in containers

Resource differences:

Memory limits can kill processes during large responses
CPU throttling affects JSON parsing of responses
Disk space issues if you're logging full responses

Environment differences:

Different API keys (sandbox vs production)
Different rate limits based on billing tier
Different model versions if you don't pin exact model ID

Debug with this health check:

def claude_health_check():
    try:
        response = client.messages.create(
            model="claude-3-5-haiku-20241022",
            messages=[{"role": "user", "content": "respond with 'ok'"}],
            max_tokens=10,
            timeout=5.0
        )
        return response.content[0].text == "ok"
    except Exception as e:
        log.error(f"Claude health check failed: {e}")
        return False

Error: "acceleration limits" - what the hell does that mean?

This happens when your usage spikes suddenly. Anthropic has undocumented acceleration limits that kick in if you go from 10 requests/hour to 100 requests/hour too quickly.

Typical causes:

Deploying new features that increase API usage
Traffic spikes during marketing campaigns
Batch jobs running during peak hours

How to avoid:

Ramp up usage gradually over days, not minutes
Spread batch processing across multiple hours
Monitor your request rate and smooth out spikes

When it hits: Back off for 15-30 minutes, then resume at lower rate.

Error Types and What Actually Works

Error Type	What You See	What It Actually Means	Fix That Works	Time to Fix
HTTP 429	"Usage limit exceeded"	Rate limited requests per minute	Add exponential backoff retry	5 minutes
HTTP 429	"Quota exceeded"	Monthly spending limit hit	Add more credits or wait for next month	Immediate/$$
HTTP 400	"does not support thinking"	Claude Code routing bug	Replace "think" with "consider"	30 seconds
HTTP 500	"Internal server error"	Anthropic having a bad day	Check status page, use retry logic	15 mins 2 hours
HTTP 502/503	"Bad gateway" / "Service unavailable"	Anthropic outage	Switch to fallback model	30 mins 4 hours
HTTP 401	"Authentication failed"	Wrong API key or expired	Check console for valid key	2 minutes
Empty Response	200 status but `""` body	Anthropic partial outage	Check response length before parsing	Ongoing
Malformed JSON	Parse error on response	Anthropic infrastructure issue	Wrap JSON parsing in try/catch	Ongoing
SSL Error	"Certificate verify failed"	Network/TLS issue	Retry once with fresh connection	1 retry
Timeout	No response after N seconds	Network or Claude overloaded	Increase timeout, add retry	Immediate
Tool Use JSON	Malformed function calls	Haiku being Haiku	Validate JSON schema before using	Per request
Quality Drop	Responses are generic	Model degradation bug	Switch models or wait for fix	Days to weeks

Quick Navigation

Claude returns "'claude-3-5-haiku-20241022' does not support thinking" for normal requests

Getting HTTP 429 "Usage limit exceeded" but I barely used anything

API suddenly returns "Provider returned error" through OpenRouter

Anthropic's API is completely down (like September 10th)

Response quality suddenly turned to garbage

Hit my $100 monthly limit in the first week

Rate Limits: The Silent Killer

The Thinking Bug: Most Annoying Edge Case

Quality Degradation: The Silent Production Killer

Network Issues: The 1-Second Lie

Error Handling: Beyond the Obvious

Monitoring That Actually Helps

Why does Haiku randomly fail tool use but work fine for text?

Prompt caching saves 90% but I'm not seeing those savings

Getting different responses for identical prompts

Haiku costs 5x more than GPT-4o Mini but response quality is inconsistent

AWS Bedrock vs direct API - what breaks differently?

Why does Claude work fine locally but fail in production?

Error: "acceleration limits" - what the hell does that mean?

Related Tools & Recommendations

jQuery - The Library That Won't Die

Claude 3.5 Sonnet Migration Guide

Claude 3.5 Sonnet - The Model Everyone Actually Used

Stop Stripe from Destroying Your Serverless Performance

Drizzle ORM - The TypeScript ORM That Doesn't Suck

Fix TaxAct When It Breaks at the Worst Possible Time

Slither - Catches the Bugs That Drain Protocols

OP Stack Deployment Guide - So You Want to Run a Rollup

Firebase Started Eating Our Money, So We Switched to Supabase

Twistlock - Container Security That Actually Works (Most of the Time)

CDC Implementation Without The Bullshit

React Error Boundaries Are Lying to You in Production

Git Disaster Recovery - When Everything Goes Wrong

Swift Assist - The AI Tool Apple Promised But Never Delivered

Fix MySQL Error 1045 Access Denied - Real Solutions That Actually Work

How to Stop Your API from Getting Absolutely Destroyed by Script Kiddies

Change Data Capture - Stream Database Changes So Your Data Isn't 6 Hours Behind

OpenAI Browser Developer Integration Guide

Binance API Production Security Hardening - Don't Get Rekt

Stripe Terminal iOS Integration: The Only Way That Actually Works