Why does my coordinator keep losing track of agents?

Agents disappear when their containers restart or crash. I spent two weeks debugging this. The fix: implement health checks that ping agents every 30 seconds and remove dead ones from your registry. Also, set `restart: unless-stopped` in your docker-compose.yml or they'll stay dead after crashes.

Why do I keep getting error code -32601?

Method not found. You called a tool that doesn't exist or isn't registered yet. Check: 1. the agent is actually running 2. the tool name matches exactly (case sensitive) 3. the agent has finished loading all its tools. Add a `/health` endpoint to check agent status.

My research agent gets rate-limited constantly. How do I fix this?

Google's free tier allows ~100 searches/day. That's nothing. Either pay for an API key or implement aggressive caching. I cache search results for 24 hours and use multiple search APIs (Bing, DuckDuckGo) with fallback logic. Also, add delays between requests - 1 second minimum.

Docker containers keep running out of memory. What's the issue?

The analysis agent loads entire datasets into pandas DataFrames. This crashes on anything > 50MB. Solutions: 1. Set memory limits in docker-compose.yml 2. Process data in chunks 3. Use a real database instead of keeping everything in memory. I learned this when my 16GB machine couldn't analyze a 200MB CSV.

How do I debug when everything times out?

Set short timeouts initially (5-10 seconds) and gradually increase them. Most issues are network-related. Check: 1. agents can reach each other (`docker exec -it container ping other_container`) 2. firewalls aren't blocking ports 3. agents aren't stuck in infinite loops. Use `docker logs container_name` to see what's actually happening.

Can I run this without Docker?

Yes, but you'll hate yourself. Run each agent in a separate terminal with different ports: `python coordinator.py --port 8000`, `python researcher.py --port 8001`, etc. Docker handles networking, restarts, and cleanup. Without it, you're manually managing 4+ Python processes.

The coordinator assigns tasks to dead agents. How do I fix this?

Add agent health tracking. When an agent fails a request, mark it as unhealthy and stop assigning tasks to it. I do a simple ping every 60 seconds and remove agents that don't respond. Here's the check: `GET /health` should return `{"status": "healthy"}`.

My analysis agent crashes on real data. Why?

Pandas is not designed for production workloads. It loads everything into memory and crashes on large datasets. For anything serious: use PostgreSQL with chunked processing, not pandas. Or add data size limits: reject requests > 10,000 rows and tell users to use a real database.

Why does the reporter agent generate garbage reports?

Because the data it receives is garbage. Add data validation before report generation. Check that dictionaries have expected keys, lists aren't empty, and strings aren't just error messages. I validate every data source and skip broken ones rather than crashing the whole report.

What's the point of this vs just using API calls?

MCP provides standardized tool discovery and error handling. Without it, you're writing custom API clients for every agent interaction. MCP handles the JSON-RPC messaging, capability discovery, and error codes. It's less code and more reliable than rolling your own protocol.

How do I scale this beyond 4 agents?

Don't. This architecture doesn't scale. For > 10 agents, use a proper message queue (Redis, RabbitMQ) instead of direct HTTP calls. Or switch to a microservices framework designed for scale. MCP is great for small systems, terrible for large distributed architectures.

Everything works locally but fails in production. Why?

Production has real network latency, slower disks, less memory, and actual users. Increase all timeouts by 5x, reduce batch sizes by 10x, and add retries everywhere. Local development is a lie. Test with realistic delays: `await asyncio.sleep(random.uniform(0.1, 2.0))` in your local setup.

Currently viewing the AI version

Switch to human version

Multi-Agent System with MCP Architecture - AI-Optimized Technical Reference

Critical System Overview

Technology Stack: MCP (Model Context Protocol) with JSON-RPC 2.0 communication
Architecture Pattern: Coordinator-worker pattern (single point of failure)
Language: Python 3.8+ (3.9 recommended due to asyncio issues in 3.8)
Communication Protocol: JSON-RPC 2.0 over HTTP

Resource Requirements

Hardware Specifications

Minimum RAM: 16GB (documentation claims 8GB but fails with 4 agents)
Network: Fast internet required (500MB+ dependency downloads)
Storage: Monitor Docker disk usage (containers consume significant space)

Time Investment

Setup Time: 4-6 hours minimum (tutorials underestimate by 50-75%)
Debugging Allocation: 70% of development time spent on connection issues
Learning Curve: Expect weeks of debugging distributed systems failures

Core Dependencies and Installation Issues

Required Packages

pip install fastmcp httpx asyncio aiofiles
pip install openai anthropic pydantic jsonschema

Common Installation Failures

Issue	Solution	Frequency
`fastmcp` fails on Windows	Use WSL2 or abandon Windows	High
`httpx` timeout errors	Network issues, retry installation	Medium
Import errors with `asyncio`	Upgrade from Python 3.7	Low

Architecture Components and Failure Modes

MCP Component Types (Source of Confusion)

MCP Hosts - Where LLM runs (Claude Desktop, custom app)
MCP Clients - Translation layer between host and servers
MCP Servers - Actual agents performing work

Critical Note: Each agent is both client AND server (dual role causes debugging complexity)

Coordinator Agent - Single Point of Failure

Function: Task decomposition, agent assignment, result aggregation
Memory Leaks: Python asyncio loops leak memory in long-running services
Timeout Handling: Default 60 seconds (increase for production)

Common Coordinator Failures

Agents disappear when containers restart
Task decomposition fails during LLM bad performance days
Network timeouts kill entire workflow
Result aggregation crashes on malformed JSON

Configuration That Works

# Realistic timeout settings
timeout_seconds: int = 60
health_check_interval: int = 60  # seconds
max_concurrent_tasks: int = 10   # start low

Worker Agent Specifications

Research Agent - Web Scraper

Capabilities: Web search, content extraction
Rate Limits: Google free tier = 100 searches/day
Memory Pattern: Minimal memory usage
Failure Threshold: Breaks after ~20 requests without rate limiting

Critical Settings:

Minimum 1 second between requests
Cache results for 24 hours
Use multiple search APIs with fallback
Implement exponential backoff

Analysis Agent - Data Processor

Memory Limit: Crashes on datasets > 100MB
Safe Row Limit: 10,000 rows maximum
Technology Constraint: Pandas not suitable for production workloads
Memory Growth: Exponential with data size

Breaking Points:

50MB+ datasets cause OOM crashes
Complex nested data structures break DataFrame conversion
Correlation analysis fails randomly on large datasets

Reporter Agent - Markdown Generator

Function: Data to markdown conversion
Reliability: Highest (simplest component)
Memory Usage: Minimal
Failure Mode: Crashes on complex nested data structures

Error Codes and Debugging

JSON-RPC Error Codes (Memorize These)

Code	Meaning	Common Cause
-32601	Method not found	Tool not registered or agent not loaded
-32602	Invalid parameters	Schema validation failure
-32603	Internal error	Agent crashed or network issue

Network Debugging Commands

# Check container connectivity
docker exec -it container ping other_container

# Monitor container resource usage  
docker stats

# View container logs
docker logs container_name --follow --tail 100

# Check health endpoints
curl -f http://localhost:8000/health

Production Configuration

Docker Resource Limits

services:
  analyzer:
    mem_limit: 2g  # Prevent memory bombs
    restart: unless-stopped
    healthcheck:
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 30s

Environment Variables

LOG_LEVEL=INFO
MAX_CONCURRENT_TASKS=10  # Start low, increase carefully
REQUEST_TIMEOUT=30       # 30 seconds sufficient
HEALTH_CHECK_INTERVAL=60 # Check agents every minute

Performance Characteristics

Load Testing Results

Breaking Point: System fails at 20+ concurrent requests
Success Rate: <80% at failure threshold
Response Time: Increases linearly with concurrent load
Memory Usage: Grows with agent count and data size

Scaling Limitations

Agent Limit: Don't exceed 10 agents (architecture doesn't scale)
Data Processing: 10K rows maximum per analysis
Concurrent Tasks: 10 maximum for stability

Critical Warnings and Hidden Costs

What Official Documentation Doesn't Tell You

Default settings fail in production
Memory leaks are inevitable with long-running Python services
Network issues cause cascade failures
LLM decomposition fails randomly
Docker containers develop networking problems over time

Operational Costs

Human Time: 70% debugging, 30% feature development
Infrastructure: 16GB RAM minimum, fast network required
API Costs: Rate limiting forces paid API subscriptions
Monitoring: Essential for production (Prometheus + Grafana recommended)

Breaking Points and Failure Modes

Container Restarts: Agents disappear from coordinator registry
Memory Exhaustion: Analysis agent OOMs on realistic datasets
Rate Limiting: Research agent blocked after minimal usage
Network Partitions: Coordinator loses track of healthy agents
Cascade Failures: One agent failure can kill entire workflow

Migration and Scaling Considerations

When to Abandon This Architecture

More than 10 agents required
Processing datasets > 100MB
High availability requirements
Consistent low-latency needs

Alternative Technologies

Message Queues: Redis, RabbitMQ for >10 agents
Microservices: Proper frameworks for scale
Databases: PostgreSQL instead of pandas for data processing
Languages: Go for better concurrent performance

Monitoring and Alerting Requirements

Essential Metrics

Agent health status (binary: healthy/unhealthy)
Request success rate (target: >95%)
Average response time (target: <30 seconds)
Memory usage per container (alert at 80%)
Error rate by agent type

Alert Conditions

Any agent unhealthy > 2 minutes
Success rate < 80% over 5 minutes
Memory usage > 80% of container limit
Response time > 60 seconds
Error rate > 10% over 10 minutes

Testing Strategy

Local Testing Limitations

Perfect local performance creates false confidence
Network latency absent in local environment
Resource constraints not realistic
Real API failures not simulated

Integration Testing Requirements

Mock all external APIs (prevent rate limiting)
Test with realistic network delays
Simulate container restarts
Test concurrent load scenarios
Validate error handling paths

Implementation Decision Tree

Choose MCP When:

Building demo or prototype system
Need tool sharing between agents
Have <10 agents total
Processing small datasets (<10K rows)
Can tolerate single point of failure

Avoid MCP When:

Require high availability
Need to scale beyond 10 agents
Processing large datasets (>100MB)
Cannot tolerate coordinator failures
Need consistent low latency

Troubleshooting Quick Reference

Agent Registration Issues

Check agent is running: curl http://agent:port/health
Verify coordinator can reach agent network
Check Docker networking configuration
Validate agent tool registration completion

Task Execution Failures

Check coordinator logs for assignment errors
Verify agent health status in registry
Test individual agent endpoints directly
Check for memory or timeout issues

Performance Degradation

Monitor memory usage across all containers
Check for network connectivity issues
Verify no agents stuck in infinite loops
Review concurrent task limits

This architecture works for demonstrations and small-scale systems but requires significant operational overhead and has fundamental scaling limitations. Plan migration strategy before reaching operational limits.

Useful Links for Further Investigation

Resources That Actually Matter

Link	Description
Model Context Protocol Specification	The actual spec. Read this when you get cryptic JSON-RPC errors and need to understand what's supposed to happen vs what's actually happening.
Anthropic's MCP Documentation	Better than most docs, includes real examples that sometimes work. Start here for Claude integration.
MCP GitHub Repository	Source code and examples. More useful than the docs when you need to see how things actually work.
FastMCP Python Framework	What I used in this tutorial. Lightweight and works for demos. Not sure about production scale.
MCP TypeScript SDK	Official JavaScript implementation. More mature than the Python stuff if you're building serious applications.
Go MCP Implementation	For when Python's async performance isn't cutting it and you need something that actually scales.
Jaeger Tracing	Essential for debugging multi-agent systems. When agents stop talking to each other, this helps figure out where requests are dying.
Prometheus + Grafana	Standard monitoring stack. Set up dashboards for agent health, request rates, and response times before things break in production.
Docker Logs Analysis	`docker logs` is your friend. Learn the flags: `--follow`, `--tail`, `--since`. You'll use them constantly.
MCP Discussions	Official community forum. People post real problems and solutions here.
MCP Community Discussions	Community discussions about MCP implementations, including war stories and performance tips.
Stack Overflow - MCP Tag	When you get stuck with specific technical issues, check here for solutions.
Multi-Agent Systems: A Modern Approach	Academic textbook. Good for understanding the theory behind why multi-agent systems are hard.
Distributed Systems Course - MIT	Free course materials. Essential for understanding why distributed systems break and how to make them more reliable.
Building Microservices - Sam Newman	Practical guide to distributed architectures. Many lessons apply to multi-agent systems.

Multi-Agent System with MCP Architecture - AI-Optimized Technical Reference

Critical System Overview

Resource Requirements

Hardware Specifications

Time Investment

Core Dependencies and Installation Issues

Required Packages

Common Installation Failures

Architecture Components and Failure Modes

MCP Component Types (Source of Confusion)

Coordinator Agent - Single Point of Failure

Common Coordinator Failures

Configuration That Works

Worker Agent Specifications

Research Agent - Web Scraper

Analysis Agent - Data Processor

Reporter Agent - Markdown Generator

Error Codes and Debugging

JSON-RPC Error Codes (Memorize These)

Network Debugging Commands

Production Configuration

Docker Resource Limits

Environment Variables

Performance Characteristics

Load Testing Results

Scaling Limitations

Critical Warnings and Hidden Costs

What Official Documentation Doesn't Tell You

Operational Costs

Breaking Points and Failure Modes

Migration and Scaling Considerations

When to Abandon This Architecture

Alternative Technologies

Monitoring and Alerting Requirements

Essential Metrics

Alert Conditions

Testing Strategy

Local Testing Limitations

Integration Testing Requirements

Implementation Decision Tree

Choose MCP When:

Avoid MCP When:

Troubleshooting Quick Reference

Agent Registration Issues

Task Execution Failures

Performance Degradation

Useful Links for Further Investigation

Resources That Actually Matter

Related Tools & Recommendations

PostgreSQL Alternatives: Escape Your Production Nightmare

AWS RDS Blue/Green Deployments - Zero-Downtime Database Updates

Three Stories That Pissed Me Off Today

Aider - Terminal AI That Actually Works

jQuery - The Library That Won't Die

vtenext CRM Allows Unauthenticated Remote Code Execution

Django Production Deployment - Enterprise-Ready Guide for 2025

HeidiSQL - Database Tool That Actually Works

Fix Redis "ERR max number of clients reached" - Solutions That Actually Work

QuickNode - Blockchain Nodes So You Don't Have To

Get Alpaca Market Data Without the Connection Constantly Dying on You

OpenAI Alternatives That Won't Bankrupt You

Migrate JavaScript to TypeScript Without Losing Your Mind

Docker Compose 2.39.2 and Buildx 0.27.0 Released with Major Updates

Google Vertex AI - Google's Answer to AWS SageMaker

Google NotebookLM Goes Global: Video Overviews in 80+ Languages

Figma Gets Lukewarm Wall Street Reception Despite AI Potential - August 25, 2025

MongoDB - Document Database That Actually Works

How to Actually Configure Cursor AI Custom Prompts Without Losing Your Mind

Cloudflare AI Week 2025 - New Tools to Stop Employees from Leaking Data to ChatGPT