Currently viewing the human version
Switch to AI version

Emergency Room - Fix It Now Questions

Q

Claude returns "'claude-3-5-haiku-20241022' does not support thinking" for normal requests

A

Hit this bug just last week. If your prompt contains the word "think" anywhere, Claude Code automatically tries to route it to Sonnet instead of Haiku, then fails. Even shit like "I think this would be cool" triggers it.

Quick fix: Replace "think" with "consider", "believe", "assume" - anything else. Or switch to direct API calls instead of Claude Code if you need that specific wording.

Real example that breaks:

"I think it would be cool to tell me a joke"
→ Error: 400 "claude-3-5-haiku-20241022 does not support thinking"

Works instead:

"I believe it would be cool to tell me a joke"
→ Normal response
Q

Getting HTTP 429 "Usage limit exceeded" but I barely used anything

A

Rate limits are fucked. You hit 50 requests per minute on Tier 1, but that's measured as 1 request per second, not 50 in the first second then wait 59 seconds.

Immediate fix: Add exponential backoff retry logic. Here's what actually works:

import time
import random

def retry_with_backoff(api_call, max_retries=5):
    for attempt in range(max_retries):
        try:
            return api_call()
        except Exception as e:
            if "429" in str(e) and attempt < max_retries - 1:
                wait_time = (2 ** attempt) + random.uniform(0, 1)
                time.sleep(wait_time)
            else:
                raise e
Q

API suddenly returns "Provider returned error" through OpenRouter

A

This one's a nightmare. OpenRouter works fine with other models, but Claude 3.5 Haiku randomly starts failing with zero useful error message.

Nuclear option that works: Switch to direct Anthropic API temporarily. OpenRouter has some weird interaction with Haiku specifically that breaks intermittently.

Debugging steps:

  1. Test the exact same prompt with direct Anthropic API
  2. If it works there, it's OpenRouter's fault
  3. Try different model on OpenRouter (like claude-3-5-sonnet) to confirm
Q

Anthropic's API is completely down (like September 10th)

A

When everything breaks at once, you need backup plans ready.

Immediate workaround: Switch to GPT-4o Mini as fallback. Yeah, it's dumber, but it's better than your app being completely broken.

What to check first:

  1. https://status.anthropic.com/ - bookmark this
  2. Your error logs for specific 5xx codes
  3. If it's just rate limits vs actual outage
Q

Response quality suddenly turned to garbage

A

Happened August 26 - September 5, 2024. Multiple bugs in production affecting Haiku quality without warning.

This is real: Anthropic confirmed two separate bugs that degraded Claude Haiku 3.5 responses. It wasn't your imagination.

What to do:

  • Check the status page for quality incidents
  • Switch models temporarily if you notice degradation
  • Keep historical examples to compare against
Q

Hit my $100 monthly limit in the first week

A

At $4 per million output tokens, this happens faster than you think. Especially if you're not caching repetitive prompts.

Cost control that works:

## Set hard limits in your code
MAX_TOKENS_PER_REQUEST = 1000
DAILY_TOKEN_BUDGET = 50000

## Track usage in Redis or whatever
def check_budget_before_api_call():
    today_usage = get_daily_token_count()
    if today_usage > DAILY_TOKEN_BUDGET:
        raise Exception("Daily budget exceeded")

Production Reality Check - What Actually Breaks

Been running Claude 3.5 Haiku in production since early 2024. Here's what breaks, when it breaks, and what we learned the hard way.

Rate Limits: The Silent Killer

How Token Bucket Works: Think of it like a bucket with holes. Tokens (requests) fill the bucket continuously. When you make requests, tokens drain out. If the bucket empties (burst requests), you wait until more tokens accumulate.

The official docs say you get 50 requests per minute on Tier 1. What they don't tell you is how this actually works. Hit 10 requests in the first 10 seconds? You're probably fine. Hit 50 requests in the first 10 seconds? You're rate limited for the next 50 seconds.

The reality: It's a token bucket algorithm, not a simple counter. Your capacity refills continuously, but burst requests will exhaust it fast. This comprehensive guide covers the technical details better than Anthropic's docs.

Production lesson: We implemented a request queue that spreads requests evenly across each minute. Went from constant 429s to zero rate limit issues:

import asyncio
from datetime import datetime, timedelta

class RequestPacer:
    def __init__(self, requests_per_minute=45):  # Leave some headroom - learned this the hard way
        self.rpm = requests_per_minute
        self.last_requests = []

    async def wait_if_needed(self):
        now = datetime.now()
        # Remove requests older than 1 minute
        # TODO: this is probably inefficient but it works
        self.last_requests = [r for r in self.last_requests
                            if now - r < timedelta(minutes=1)]

        if len(self.last_requests) >= self.rpm:
            wait_time = 60 - (now - self.last_requests[0]).total_seconds()
            await asyncio.sleep(max(0, wait_time))

        self.last_requests.append(now)

The Thinking Bug: Most Annoying Edge Case

Found this the hard way. If you use Claude Code (the CLI tool), certain words trigger automatic routing to Sonnet, which then fails because you configured Haiku.

Words that break everything: "think", "thinking", "thoughts"
Words that work fine: "consider", "believe", "assume", "figure"

This is Claude Code specific, not the API. But if you're building dev tools that integrate with Claude Code, you'll hit this.

Our workaround: Text preprocessing that replaces trigger words before sending to Claude Code:

TRIGGER_REPLACEMENTS = {
    "think": "consider",
    "thinking": "considering",
    "thoughts": "ideas",
    "I think": "I believe"
}

def sanitize_for_claude_code(text):
    for trigger, replacement in TRIGGER_REPLACEMENTS.items():
        text = text.replace(trigger, replacement)
    return text

Quality Degradation: The Silent Production Killer

Between August 26 - September 5, 2024, Anthropic confirmed two separate bugs that degraded Haiku output quality. Users reported it for weeks before official acknowledgment. Check the historical incident data to see that outages happen 2-3 times per month averaging 15 minutes to 4 hours each.

What degraded:

  • Code suggestions became significantly worse
  • Tool use started failing more often
  • Responses became more generic and less helpful

How we caught it: Started logging response quality metrics in August:

def log_response_quality(prompt, response):
    # This is ugly but it caught the August quality issue
    metrics = {
        "response_length": len(response),
        "has_code_block": "```" in response,
        "has_specific_examples": count_specific_examples(response),  # TODO: make this smarter
        "follows_instructions": rate_instruction_following(prompt, response)
    }
    # Log to your monitoring system (we use DataDog)
    log_metrics("claude_quality", metrics)

When we graphed this data, we saw a clear drop starting around late August. Having historical data made the difference between "feels wrong" and "definitely broken".

Network Issues: The 1-Second Lie

The benchmarks show 0.52 seconds average response time. In production from AWS us-east-1, we see significantly different results. Production monitoring best practices recommend tracking P99 latency instead of averages:

  • Best case: around 700-900ms (including TLS handshake)
  • Typical: somewhere between 1-2 seconds
  • Bad days: 3+ seconds before we give up

Critical for UX: Don't promise sub-second responses to users. Budget 2 seconds for user-facing features and show loading states immediately.

Our monitoring shows latency spikes correlate with:

  • Anthropic deploying new models (usually 15-30 minute windows)
  • High usage periods (US business hours)
  • General AWS networking issues in us-east-1

Error Handling: Beyond the Obvious

API Error Handling Flow

Most guides tell you to catch 429s and 500s. Comprehensive error handling guides cover the basics, but we've seen these production edge cases that aren't documented anywhere:

Empty responses with 200 status: Happens during Anthropic outages. Response body is literally "" but HTTP status is 200. Always check response length.

Malformed JSON in error responses: During some outages, error responses aren't valid JSON. Wrap your JSON parsing:

try:
    error_data = json.loads(response.text)
except json.JSONDecodeError:
    # Anthropic is having a bad day
    error_data = {"error": {"message": response.text}}

Intermittent SSL errors: Happens maybe once per 10,000 requests. Retry once with a fresh connection usually fixes it.

Monitoring That Actually Helps

API Monitoring Dashboard

We track these metrics in production:

  1. Token usage per hour - catch runaway usage before billing surprises
  2. Response latency P99 - catch degraded performance early
  3. Error rate by type - distinguish between our bugs and Anthropic's
  4. Quality scores - catch degradation like the August incident
  5. Fallback activation rate - how often we switch to backup models

The metric that saved us: Response similarity to expected outputs. When this drops significantly, something's wrong with the model, not our code. DownDetector is often faster than the official status page for detecting issues. For detailed monitoring strategies, check out API monitoring best practices and production deployment guides.

Advanced Issues - The Shit That Keeps You Up at Night

Q

Why does Haiku randomly fail tool use but work fine for text?

A

Tool use in Haiku is brittle as hell. It works most of the time but fails way more often than Sonnet. The failures are usually:

Malformed JSON in function calls: Haiku occasionally adds trailing commas or extra brackets
Wrong parameter types: Sends string "123" instead of integer 123
Hallucinated function names: Calls deleteAllUsers() when you defined deleteUser()

Bulletproof fix:

import json
from jsonschema import validate

def validate_tool_response(response, expected_schema):
    try:
        parsed = json.loads(response)
        validate(instance=parsed, schema=expected_schema)
        return parsed
    except (json.JSONDecodeError, ValidationError) as e:
        # Log the failure and retry with more explicit instructions
        return None
Q

Prompt caching saves 90% but I'm not seeing those savings

A

The 90% savings claim requires identical system prompts across requests. If you're adding timestamps, request IDs, or dynamic content to your system prompt, caching is useless.

What breaks caching:

  • Dynamic timestamps: "Current time: 2024-09-16 15:30:21"
  • Request IDs: "Request ID: abc123"
  • User-specific info in system prompts

What enables caching:

  • Static system prompts with role instructions
  • Fixed examples or templates
  • Consistent code context

Real savings we see: Around 30-60% for most applications, closer to 90% only for very repetitive tasks.

Q

Getting different responses for identical prompts

A

Haiku has a temperature of 0.1 by default, not 0. For deterministic responses, you need:

response = client.messages.create(
    model="claude-3-5-haiku-20241022",
    messages=[{"role": "user", "content": prompt}],
    temperature=0,  # Actually deterministic
    max_tokens=1000
)

Even at temperature=0, you might see slight variations in:

  • Code formatting (spaces vs tabs)
  • Word choice in explanations
  • Order of listed items
Q

Haiku costs 5x more than GPT-4o Mini but response quality is inconsistent

A

Yeah, it's expensive.

But the math works if:

  1. Developer time costs more than $200/hour
    • then the better suggestions save money
  2. Response speed matters
    • Haiku is consistently faster
  3. Tool use reliability matters
    • Haiku breaks less than GPT-4o Mini

Cost optimization that works:

  • Use Haiku for user-facing features where speed matters
  • Use GPT-4o Mini for batch processing and internal tools
  • Cache aggressively with identical system prompts
  • Set hard token limits in code, not just billing alerts
Q

AWS Bedrock vs direct API - what breaks differently?

A

Bedrock issues:

  • Higher latency (add ~200ms)
  • Different error format (AWS errors wrapped around Anthropic errors)
  • VPC configuration can cause mysterious timeouts
  • Billing is delayed and confusing

Direct API issues:

  • More frequent outages during model deployments
  • Rate limit headers are more accurate
  • Better error messages for debugging

Our production setup: Direct API for user-facing features, Bedrock for batch jobs that need VPC.

Q

Why does Claude work fine locally but fail in production?

A

Network differences:

  • Production often has stricter egress rules
  • Load balancers can timeout before Claude responds
  • TLS certificate validation might fail in containers

Resource differences:

  • Memory limits can kill processes during large responses
  • CPU throttling affects JSON parsing of responses
  • Disk space issues if you're logging full responses

Environment differences:

  • Different API keys (sandbox vs production)
  • Different rate limits based on billing tier
  • Different model versions if you don't pin exact model ID

Debug with this health check:

def claude_health_check():
    try:
        response = client.messages.create(
            model="claude-3-5-haiku-20241022",
            messages=[{"role": "user", "content": "respond with 'ok'"}],
            max_tokens=10,
            timeout=5.0
        )
        return response.content[0].text == "ok"
    except Exception as e:
        log.error(f"Claude health check failed: {e}")
        return False
Q

Error: "acceleration limits" - what the hell does that mean?

A

This happens when your usage spikes suddenly. Anthropic has undocumented acceleration limits that kick in if you go from 10 requests/hour to 100 requests/hour too quickly.

Typical causes:

  • Deploying new features that increase API usage
  • Traffic spikes during marketing campaigns
  • Batch jobs running during peak hours

How to avoid:

  • Ramp up usage gradually over days, not minutes
  • Spread batch processing across multiple hours
  • Monitor your request rate and smooth out spikes

When it hits: Back off for 15-30 minutes, then resume at lower rate.

Error Types and What Actually Works

Error Type

What You See

What It Actually Means

Fix That Works

Time to Fix

HTTP 429

"Usage limit exceeded"

Rate limited

  • requests per minute

Add exponential backoff retry

5 minutes

HTTP 429

"Quota exceeded"

Monthly spending limit hit

Add more credits or wait for next month

Immediate/$$

HTTP 400

"does not support thinking"

Claude Code routing bug

Replace "think" with "consider"

30 seconds

HTTP 500

"Internal server error"

Anthropic having a bad day

Check status page, use retry logic

15 mins

  • 2 hours

HTTP 502/503

"Bad gateway" / "Service unavailable"

Anthropic outage

Switch to fallback model

30 mins

  • 4 hours

HTTP 401

"Authentication failed"

Wrong API key or expired

Check console for valid key

2 minutes

Empty Response

200 status but "" body

Anthropic partial outage

Check response length before parsing

Ongoing

Malformed JSON

Parse error on response

Anthropic infrastructure issue

Wrap JSON parsing in try/catch

Ongoing

SSL Error

"Certificate verify failed"

Network/TLS issue

Retry once with fresh connection

1 retry

Timeout

No response after N seconds

Network or Claude overloaded

Increase timeout, add retry

Immediate

Tool Use JSON

Malformed function calls

Haiku being Haiku

Validate JSON schema before using

Per request

Quality Drop

Responses are generic

Model degradation bug

Switch models or wait for fix

Days to weeks

Resources That Actually Help When Things Break

Related Tools & Recommendations

tool
Popular choice

jQuery - The Library That Won't Die

Explore jQuery's enduring legacy, its impact on web development, and the key changes in jQuery 4.0. Understand its relevance for new projects in 2025.

jQuery
/tool/jquery/overview
60%
tool
Recommended

Claude 3.5 Sonnet Migration Guide

The Model Everyone Actually Used - Migration or Your Shit Breaks

Claude 3.5 Sonnet
/tool/claude-3-5-sonnet/migration-crisis
59%
tool
Recommended

Claude 3.5 Sonnet - The Model Everyone Actually Used

similar to Claude 3.5 Sonnet

Claude 3.5 Sonnet
/tool/claude-3-5-sonnet/overview
59%
integration
Popular choice

Stop Stripe from Destroying Your Serverless Performance

Cold starts are killing your payments, webhooks are timing out randomly, and your users think your checkout is broken. Here's how to fix the mess.

Stripe
/integration/stripe-nextjs-app-router/serverless-performance-optimization
57%
tool
Popular choice

Drizzle ORM - The TypeScript ORM That Doesn't Suck

Discover Drizzle ORM, the TypeScript ORM that developers love for its performance and intuitive design. Learn why it's a powerful alternative to traditional ORM

Drizzle ORM
/tool/drizzle-orm/overview
55%
tool
Popular choice

Fix TaxAct When It Breaks at the Worst Possible Time

The 3am tax deadline debugging guide for login crashes, WebView2 errors, and all the shit that goes wrong when you need it to work

TaxAct
/tool/taxact/troubleshooting-guide
52%
tool
Popular choice

Slither - Catches the Bugs That Drain Protocols

Built by Trail of Bits, the team that's seen every possible way contracts can get rekt

Slither
/tool/slither/overview
50%
tool
Popular choice

OP Stack Deployment Guide - So You Want to Run a Rollup

What you actually need to know to deploy OP Stack without fucking it up

OP Stack
/tool/op-stack/deployment-guide
47%
review
Popular choice

Firebase Started Eating Our Money, So We Switched to Supabase

Facing insane Firebase costs, we detail our challenging but worthwhile migration to Supabase. Learn about the financial triggers, the migration process, and if

Supabase
/review/supabase-vs-firebase-migration/migration-experience
45%
tool
Popular choice

Twistlock - Container Security That Actually Works (Most of the Time)

The container security tool everyone used before Palo Alto bought them and made everything cost enterprise prices

Twistlock
/tool/twistlock/overview
42%
tool
Popular choice

CDC Implementation Without The Bullshit

I've implemented CDC at 3 companies. Here's what actually works vs what the vendors promise.

Change Data Capture (CDC)
/tool/change-data-capture/enterprise-implementation-guide
40%
tool
Popular choice

React Error Boundaries Are Lying to You in Production

Learn why React Error Boundaries often fail silently in production builds and discover effective strategies to debug and fix them, preventing white screens for

React Error Boundary
/tool/react-error-boundary/error-handling-patterns
40%
tool
Popular choice

Git Disaster Recovery - When Everything Goes Wrong

Learn Git disaster recovery strategies and get immediate action steps for the critical CVE-2025-48384 security alert affecting Linux and macOS users.

Git
/tool/git/disaster-recovery-troubleshooting
40%
tool
Popular choice

Swift Assist - The AI Tool Apple Promised But Never Delivered

Explore Swift Assist, Apple's unreleased AI coding tool. Understand its features, why it was announced at WWDC 2024 but never shipped, and its impact on develop

Swift Assist
/tool/swift-assist/overview
40%
troubleshoot
Popular choice

Fix MySQL Error 1045 Access Denied - Real Solutions That Actually Work

Stop fucking around with generic fixes - these authentication solutions are tested on thousands of production systems

MySQL
/troubleshoot/mysql-error-1045-access-denied/authentication-error-solutions
40%
howto
Popular choice

How to Stop Your API from Getting Absolutely Destroyed by Script Kiddies

Because your servers have better things to do than serve malicious bots all day

Redis
/howto/implement-api-rate-limiting/complete-setup-guide
40%
tool
Popular choice

Change Data Capture - Stream Database Changes So Your Data Isn't 6 Hours Behind

Discover Change Data Capture (CDC): why it's essential, real-world production insights, performance considerations, and debugging tips for tools like Debezium.

Change Data Capture (CDC)
/tool/change-data-capture/overview
40%
tool
Popular choice

OpenAI Browser Developer Integration Guide

Building on the AI-Powered Web Browser Platform

OpenAI Browser
/tool/openai-browser/developer-integration-guide
40%
tool
Popular choice

Binance API Production Security Hardening - Don't Get Rekt

The complete security checklist for running Binance trading bots in production without losing your shirt

Binance API
/tool/binance-api/production-security-hardening
40%
integration
Popular choice

Stripe Terminal iOS Integration: The Only Way That Actually Works

Skip the Cross-Platform Nightmare - Go Native

Stripe Terminal
/integration/stripe-terminal-pos/ios-native-integration
40%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization