Why doesn't my API call work?

First things you probably screwed up: - Missing `x-api-key` header (you'll get a 401) - Wrong model name - they change regularly, check the [docs](https://docs.anthropic.com/en/api/overview) - Forgot `anthropic-version` header (API complains about this) - Set `max_tokens` too low and responses get cut off - API key doesn't have permissions (if you're using organization-level keys) **Quick test:** ```bash # Set ANTHROPIC_API_BASE=https://api.anthropic.com - docs at https://docs.anthropic.com/en/api/messages curl -X POST "$ANTHROPIC_API_BASE/v1/messages" \ --header "x-api-key: $ANTHROPIC_API_KEY" \ --header "anthropic-version: 2023-06-01" \ --header "content-type: application/json" \ --data '{"model": "claude-sonnet-4-20250514", "max_tokens": 100, "messages": [{"role": "user", "content": "test"}]}' ```

Why is my bill higher than expected?

Your costs probably spiraled because: - **Context bloat** - Users paste entire documents and you don't truncate - **Using Opus for everything** - The output costs will murder your budget fast - **No prompt caching** - You're re-sending the same context repeatedly - **Inefficient prompts** - Long prompts = higher input costs **Check [current pricing](https://docs.anthropic.com/en/docs/about-claude/pricing) - rates change:** - **Haiku**: Check docs for input/output rates per million tokens - **Sonnet**: Check docs for input/output rates per million tokens - **Opus**: Check docs for input/output rates per million tokens **Cost control:** - Use [prompt caching](https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching) for repeated context (saves money) - [Batch processing](https://docs.anthropic.com/en/docs/build-with-claude/batch-processing) for discounts on non-urgent stuff - Route simple tasks to Haiku, complex stuff to Sonnet, only use Opus when desperate **Reality check:** A typical chat message costs like $0.02-0.05 with Sonnet. If you're paying way more, something's fucked.

Which model should I actually use?

![AI Model Selection](https://img.icons8.com/color/48/000000/artificial-intelligence.png) **Start with Sonnet 4 for everything.** Seriously. Don't overthink it. Once you're running in production: - **Haiku 3.5** for brain-dead simple stuff ("reset my password", basic data extraction) - **Sonnet 4** for everything else (seriously, like 90% of your use cases) - **Opus 4.1** when Sonnet shits the bed and you're desperate (complex analysis, architectural decisions) **Real decision tree:** 1. Does it need to be fast and cheap? → Haiku (but test it thoroughly) 2. Is it complex reasoning that Sonnet struggles with? → Opus (check your budget first) 3. Everything else → Sonnet **Don't fall into the trap** of optimizing model selection before you have real usage data. Get Sonnet 4 working first, then optimize costs later when you actually understand your traffic patterns.

Why do I keep hitting rate limits?

Rate limits are real and you'll hit them: - **Standard tier**: 1K requests/minute - **Higher tiers**: Up to 4K requests/minute (if you spend enough) - **Enterprise**: Whatever you negotiate **You're probably hitting limits because:** - Peak usage spikes (lunch time, Monday mornings) - No request queuing/backoff logic - Multiple instances hitting the same rate limit - You're using the API for batch processing (use the batch API instead, idiot) **How to handle it:** ```python import time import random def exponential_backoff(attempt): wait_time = (2 ** attempt) + random.uniform(0, 1) time.sleep(wait_time) ``` **Pro tip:** Most apps never hit 1K RPM in normal usage. If you're hitting limits regularly, you're probably doing something inefficient.

How do I not leak my API key?

![Security Key](https://img.icons8.com/color/48/000000/key-security.png) **Basic security (don't be an idiot):** - Never hardcode API keys in your code - Use environment variables: `os.getenv("ANTHROPIC_API_KEY")` - Add `.env` to your `.gitignore` - Rotate keys regularly (monthly is good) **For production:** - Use your cloud provider's secret management (AWS Secrets Manager, etc.) - Set up API key rotation - Monitor usage for weird spikes (could indicate compromise) **Enterprise stuff your security team wants:** - SOC 2, HIPAA, GDPR compliance (Anthropic has it) - SSO integration (SAML/OAuth) - Audit logs of all API calls - Role-based access controls ```python # Don't do this client = Anthropic(api_key="sk-ant-api03-...") # Do this client = Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY")) ```

How do I debug tool calling when it breaks?

Tool calling error messages are terrible. Here's what usually goes wrong: **Common issues:** - JSON schema doesn't match what Claude sends - Function description is unclear - Missing required parameters - Wrong parameter types **Debugging tips:** ```python # Add verbose logging to see what Claude is actually sending import json import logging logging.basicConfig(level=logging.DEBUG) def my_function(user_id: str): logging.debug(f"Function called with: {user_id}") return f"User data for {user_id}" # Check the tool call details if response.content[0].type == "tool_use": tool_call = response.content[0] logging.debug(f"Tool: {tool_call.name}, Input: {tool_call.input}") ``` **Pro tip:** Start with simple functions and gradually add complexity. Claude is picky about schema definitions.

What languages/frameworks work with Claude?

**Use the official SDKs if possible:** - **Python**: [anthropic-sdk-python](https://github.com/anthropics/anthropic-sdk-python) - solid, well-maintained - **TypeScript/JavaScript**: [anthropic-sdk-typescript](https://github.com/anthropics/anthropic-sdk-typescript) - has some async quirks but works **Everything else:** - It's just HTTP REST calls, any language with an HTTP client works - Community SDKs exist for Go, Rust, Ruby, PHP - Quality varies, check GitHub activity before using **Frameworks that integrate well:** - **LangChain** - native support, lots of examples - **Streamlit** - easy UI prototypes - **FastAPI** - production APIs **Skip the frameworks initially** - get the raw API working first, then add complexity.

How do I monitor costs before my boss freaks out?

**Set up monitoring immediately:** - [Console dashboard](https://console.anthropic.com/) shows real-time usage - Set up spending alerts (do this first) - Check costs daily until you understand usage patterns **Simple cost tracking:** ```python class CostMonitor: def __init__(self): self.daily_budget = 100 # Set your actual limit self.current_spend = 0 def track_call(self, model, input_tokens, output_tokens): # WARNING: Update these rates from the pricing docs! rates = { "claude-sonnet-4-20250514": {"input": 3.00, "output": 15.00}, "claude-3-5-haiku-20241022": {"input": 0.80, "output": 4.00}, "claude-opus-4-1-20250805": {"input": 15.00, "output": 75.00} } cost = (input_tokens * rates[model]["input"] + output_tokens * rates[model]["output"]) / 1_000_000 self.current_spend += cost if self.current_spend > self.daily_budget: raise Exception(f"Daily budget exceeded: ${self.current_spend:.2f}") ``` **Pro tip:** Set conservative daily limits initially. You can always increase them, but overages are hard to explain.

Can I use this commercially without getting sued?

**Yes, it's designed for commercial use:** - No weird licensing restrictions - You own the output Claude generates - Standard terms of service (read them, obviously) **Compliance stuff that matters:** - SOC 2, HIPAA, GDPR certified (your legal team will ask) - Data processing agreements available - EU data residency if you need it **Enterprise features if you pay enough:** - 99.9% uptime SLA (they actually hit this) - Dedicated support (real humans, not chatbots) - Custom contracts for big deployments

What breaks in production?

**Shit that'll wake you up during on-call:** - **Rate limits during peak hours** - hit 4K RPM limit during Black Friday rush, API returned `HTTP 429` with `Retry-After: 60`, spent my entire weekend implementing exponential backoff with jitter - **Context windows filling up** - some genius user pasted their entire 100MB log file (I think it was like 600GB? Maybe more?), maxed out the 200K token limit, crashed our chat service with `context_length_exceeded` errors until I added aggressive truncation at 150K tokens - **Streaming responses dying mid-sentence** - nginx `proxy_read_timeout` was set to 30s, Claude's streaming died halfway through responses, users thought the AI had a stroke, took me 3 tries to find the right nginx config - **Tool calling schema errors** - spent 6 hours debugging `{"type": "invalid_request_error", "error": {"type": "invalid_request_error", "message": "function: null"}}` that turned out to be a missing fucking comma in my JSON schema - **File uploads timing out** - 500MB PDF uploads kept failing with `request_timeout` after exactly 30 seconds, had to implement chunking and multipart uploads which was a complete nightmare - **Memory leaks in long conversations** - conversation history grew to 2GB+ RAM per user session, took down our server at 2am on Sunday with OOMKilled errors, now I aggressively prune context after 50 messages **Realistic limitations:** - Context window: 200K tokens (real, but slower with big contexts) - No memory between conversations (obvious but people forget) - Rate limits based on your tier (1K-4K RPM max) - JSON responses only (no streaming binary data) **Infrastructure you'll actually need:** - Redis for caching and session management - Queue system for rate limit handling - Monitoring for cost and error tracking - Load balancer if you hit scale **Languages:** English works best, other languages are decent but not perfect. Programming languages work great across the board.

Currently viewing the AI version

Switch to human version

Claude API Technical Reference - AI Optimized

Model Selection and Economics

Claude Model Performance Matrix

Model	Intelligence Level	Input Cost/1M tokens	Output Cost/1M tokens	Context Window	Max Output	Production Use Case
Opus 4.1	Maximum reasoning	$15.00	$75.00	200K	32K	Complex architecture, critical debugging only
Sonnet 4	Production-ready	$3.00	$15.00	200K	8K	90% of use cases, daily driver
Haiku 3.5	Fast & minimal	$0.80	$4.00	200K	4K	Simple responses, high-volume processing

Cost Reality Check

Typical chat message cost: $0.02-0.05 with Sonnet 4
Complex debugging session: $50+ with Opus 4.1
Budget killer: Using Opus for routine tasks
Cost multiplier: Large context windows slow responses but don't reduce quality

Critical Production Failures

Rate Limiting Breakdown Points

Standard tier: 1K requests/minute limit
Enterprise tier: Up to 4K requests/minute
Peak hour failures: Lunch time, Monday mornings hit limits hardest
Failure mode: HTTP 429 with Retry-After: 60 header

Context Window Gotchas

200K token limit is real but responses slow with large contexts
User behavior risk: Users paste entire documents (100MB+ logs)
Failure point: context_length_exceeded errors crash services
Mitigation: Aggressive truncation at 150K tokens required

Streaming Response Failures

Infrastructure dependency: nginx proxy_read_timeout must exceed 30s
Failure mode: Responses die mid-sentence with timeout errors
User impact: AI appears to "have a stroke"
Solution: Configure proxy timeouts to 120s+

Implementation Failure Modes

Authentication Failures

# Common setup errors that waste hours:
# Missing x-api-key header → 401 with no details
# Wrong model name → Check docs for current names
# Missing anthropic-version header → Cryptic error response
# max_tokens too low → Response cuts off mid-sentence

Tool Calling Schema Hell

Error message quality: Cryptic JSON validation failures
Common failure: Missing comma in JSON schema → 6 hours debugging
Error format: {"type": "invalid_request_error", "error": {"message": "function: null"}}
Success pattern: Start simple, gradually add complexity

File Processing Limitations

Upload timeout: 500MB files fail after exactly 30 seconds
Workaround required: Implement chunking and multipart uploads
Format preference: PDFs > Word documents for accuracy

Memory and Resource Management

Context Bloat Prevention

def prevent_bankruptcy(conversation, max_tokens=100000):
    # Critical: Monitor context growth
    # Failure: 2GB+ RAM per user session → OOMKilled at 2am
    # Solution: Prune after 50 messages aggressively

    system_msg = conversation[0] if conversation[0]["role"] == "system" else None
    recent_messages = conversation[-8:]  # Last 8 exchanges sufficient

    total_chars = sum(len(msg["content"]) for msg in recent_messages)
    estimated_tokens = total_chars // 4  # ~4 chars per token

    if estimated_tokens > max_tokens:
        return ([system_msg] if system_msg else []) + conversation[-2:]

    return ([system_msg] if system_msg else []) + recent_messages

Cost Control Mechanisms

Prompt caching: Up to 90% savings for repeated context
Batch processing: 50% discount for non-urgent requests
Intelligent routing: Haiku for simple → Sonnet for standard → Opus for desperate

Security and Compliance Requirements

Enterprise Compliance Checklist

✅ SOC 2 Type II certified
✅ HIPAA compliance available
✅ GDPR compliance with EU data centers
✅ Zero data retention policy
✅ SSO integration (SAML/OAuth)
✅ Audit trails for all API calls

API Key Security Best Practices

# NEVER hardcode keys
client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))

# Rotate keys monthly
# Use cloud secret management (AWS Secrets Manager)
# Monitor for usage spikes (compromise indicator)

Error Handling Patterns

Production-Ready Retry Logic

import time
import random
from anthropic import RateLimitError, APIError

def exponential_backoff_with_jitter(attempt):
    wait_time = (2 ** attempt) + random.uniform(0, 1)
    time.sleep(wait_time)

def production_claude_call(prompt, max_retries=3):
    for attempt in range(max_retries):
        try:
            response = client.messages.create(
                model="claude-sonnet-4-20250514",
                max_tokens=1000,
                messages=[{"role": "user", "content": prompt}]
            )
            return response.content[0].text

        except RateLimitError:
            if attempt == max_retries - 1:
                raise Exception("Still rate limited after retries")
            exponential_backoff_with_jitter(attempt)

        except APIError as e:
            # Usually client error - don't retry
            print(f"API error: {e}")
            raise

Performance Characteristics

Response Time Expectations

Haiku 3.5: Sub-second for simple requests
Sonnet 4: 1-3 seconds for standard requests
Opus 4.1: 3-12 seconds for complex reasoning
Large context penalty: +2-5 seconds with 150K+ token contexts

Reliability Metrics

Uptime SLA: 99.9% (actually achieved in production)
Rate limit recovery: Automatic with proper backoff
Error rate: <0.1% for properly formatted requests

Infrastructure Requirements

Minimum Production Stack

Redis: Session management and caching
Queue system: Rate limit handling (RabbitMQ/SQS)
Monitoring: Cost and error tracking (DataDog/New Relic)
Load balancer: If exceeding 1K RPM

Monitoring Essentials

class ProductionMonitoring:
    def __init__(self, daily_budget=500):
        self.daily_budget = daily_budget
        self.current_spend = 0

    def track_request(self, model, input_tokens, output_tokens):
        # Update rates from current pricing docs
        cost = self.calculate_cost(model, input_tokens, output_tokens)
        self.current_spend += cost

        if self.current_spend > self.daily_budget:
            # Alert before bankruptcy
            raise BudgetExceededException(f"Daily limit: ${self.current_spend:.2f}")

Integration Patterns

Streaming Implementation

# Use for user-facing applications
with client.messages.stream(
    model="claude-sonnet-4-20250514",
    max_tokens=1000,
    messages=[{"role": "user", "content": prompt}]
) as stream:
    for text in stream.text_stream:
        # Send to websocket/SSE
        yield text

Tool Calling Success Pattern

# Minimal working example
tools = [{
    "name": "get_user_data",
    "description": "Get user information",  # Keep descriptions clear
    "input_schema": {
        "type": "object",
        "properties": {"user_id": {"type": "string"}},
        "required": ["user_id"]
    }
}]

# Handle the conversation loop manually
response = client.messages.create(model="claude-sonnet-4-20250514", tools=tools, ...)
if response.content[0].type == "tool_use":
    # Execute function and continue conversation
    result = execute_tool(response.content[0])

Decision Criteria

When to Use Each Model

Can it be answered with basic logic? → Haiku 3.5
Requires reasoning but not architecture-level thinking? → Sonnet 4
Complex system design or desperate debugging? → Opus 4.1

Framework Selection

Start with raw API: Understand behavior first
Add LangChain: Only after API patterns are solid
Use official SDKs: Python/TypeScript are bulletproof

Scaling Thresholds

<1K RPM: Standard tier sufficient
1K-4K RPM: Need higher tier and load balancing
>4K RPM: Enterprise tier with custom limits

Critical Warnings

Budget Killers

Context bloat: Users paste entire documents
Wrong model routing: Using Opus for simple tasks
No caching: Re-sending identical contexts
Batch processing: Using real-time API for bulk operations

Breaking Points

1K+ spans in distributed tracing: UI becomes unusable for debugging
2GB+ conversation history: Server OOMKilled during peak usage
500MB+ file uploads: Timeout failures without chunking
Peak hour traffic: Rate limits hit harder during business hours

Hidden Costs

Human debugging time: Tool calling schema errors take hours
Infrastructure complexity: Streaming requires proper proxy configuration
Migration pain: Model names change, requiring code updates
Expertise requirement: Production deployment needs AI engineering knowledge

Success Metrics

Production Readiness Indicators

Rate limit handling with exponential backoff implemented
Cost monitoring with daily budget alerts configured
Context management preventing memory bloat
Error handling covering all API failure modes
Monitoring dashboard showing real-time usage and costs

Performance Benchmarks

90% of requests complete within model-specific time limits
<0.1% error rate excluding user input validation failures
Context window utilization <75% to maintain response speed
Daily API costs predictable within 20% variance

Useful Links for Further Investigation

Claude API Resources That Don't Suck

Link	Description
Console	Test prompts here first, manage API keys, and watch costs effectively.

Claude API Technical Reference - AI Optimized

Model Selection and Economics

Claude Model Performance Matrix

Cost Reality Check

Critical Production Failures

Rate Limiting Breakdown Points

Context Window Gotchas

Streaming Response Failures

Implementation Failure Modes

Authentication Failures

Tool Calling Schema Hell

File Processing Limitations

Memory and Resource Management

Context Bloat Prevention

Cost Control Mechanisms

Security and Compliance Requirements

Enterprise Compliance Checklist

API Key Security Best Practices

Error Handling Patterns

Production-Ready Retry Logic

Performance Characteristics

Response Time Expectations

Reliability Metrics

Infrastructure Requirements

Minimum Production Stack

Monitoring Essentials

Integration Patterns

Streaming Implementation

Tool Calling Success Pattern

Decision Criteria

When to Use Each Model

Framework Selection

Scaling Thresholds

Critical Warnings

Budget Killers

Breaking Points

Hidden Costs

Success Metrics

Production Readiness Indicators

Performance Benchmarks

Useful Links for Further Investigation

Claude API Resources That Don't Suck

Related Tools & Recommendations

Multi-Framework AI Agent Integration - What Actually Works in Production

LangChain vs LlamaIndex vs Haystack vs AutoGen - Which One Won't Ruin Your Weekend

Claude Pricing Got You Down? Here Are the Alternatives That Won't Bankrupt Your Startup

Python vs JavaScript vs Go vs Rust - Production Reality Check

OpenAI Alternatives That Actually Save Money (And Don't Suck)

OpenAI Alternatives That Won't Bankrupt You

I've Been Testing Enterprise AI Platforms in Production - Here's What Actually Works

Google Gemini API: What breaks and how to fix it

Google Vertex AI - Google's Answer to AWS SageMaker

Cursor vs GitHub Copilot vs Codeium vs Tabnine vs Amazon Q - Which One Won't Screw You Over

Amazon ECR - Because Managing Your Own Registry Sucks

I've Been Testing Amazon Q Developer for 3 Months - Here's What Actually Works and What's Marketing Bullshit

Google Pixel 10 Pro Launch: Tensor G5 and Gemini AI Integration

Google Gets Slapped With $425M for Lying About Privacy (Shocking, I Know)

GKE Security That Actually Stops Attacks

Azure OpenAI Service - Production Troubleshooting Guide

Azure OpenAI Enterprise Deployment - Don't Let Security Theater Kill Your Project

How to Actually Use Azure OpenAI APIs Without Losing Your Mind

Stop Fighting with Vector Databases - Here's How to Make Weaviate, LangChain, and Next.js Actually Work Together

LlamaIndex - Document Q&A That Doesn't Suck