Why does my Celery worker just die randomly?

Memory leaks in your tasks. Check your task code for objects that don't get garbage collected - PIL images are notorious for this. The `worker_max_tasks_per_child=1000` setting helps by restarting workers periodically. Also check if you're running out of system memory - the Linux OOM killer loves murdering Celery workers at the worst possible moment.

How do I know if tasks are actually running?

Install Flower (`pip install flower`) and run `celery -A tasks flower`. Visit localhost:5555 to see what's happening. If tasks aren't showing up, the worker probably died or can't find your task functions.

Should I use Redis or RabbitMQ?

Redis for development and simple deployments. RabbitMQ for production where you can't afford to lose jobs. Redis will lose queued tasks if it restarts, RabbitMQ persists them to disk.

My tasks are running multiple times, what the hell?

You probably have multiple workers reading the same queue, or your task is failing and retrying. Check worker logs and make sure tasks are idempotent (safe to run multiple times).

Tasks never finish, they just hang forever

Set time limits: ```python app.conf.update( task_soft_time_limit=600, # 10 minutes warning task_time_limit=700, # 11 minutes hard kill ) ```

How many workers should I run?

Start with your CPU core count. For I/O-bound tasks (API calls, file uploads), run 2-4x your core count. For CPU-bound tasks, stick with core count. Monitor queue lengths and adjust.

Can I run Celery tasks synchronously for testing?

```python # In your test settings app.conf.task_always_eager = True ``` Tasks will run immediately instead of being queued. Don't use this in production.

Celery says it can't find my tasks

Import paths are wrong. Your tasks need to be importable when the worker starts. If tasks are in `myapp/tasks.py`, run: ```bash celery -A myapp.tasks worker ```

Redis connection keeps failing in Docker

Docker networking is finicky as hell. Use explicit networks instead of `--link` (which Docker deprecated anyway). Make sure Redis is actually running and accessible from your worker container. Check with `docker exec worker ping redis` - saved me hours of debugging container DNS bullshit.

Does Celery work with Django/Flask/FastAPI?

Yes. Django integration is built-in. Flask and FastAPI work fine too. No special configuration needed beyond normal Celery setup.

How do I handle secrets in tasks?

Don't pass secrets as task arguments - they get logged and stored in the broker. Use environment variables or a secrets manager. Pass references instead: ```python @app.task def process_user_data(user_id): api_key = os.getenv('SECRET_API_KEY') # Use api_key here ```

My tasks are too slow, how do I make them faster?

1. Profile your task code first 2. Use appropriate serialization (JSON is usually fine) 3. Don't return large objects from tasks 4. Consider breaking big tasks into smaller chunks 5. Use proper broker settings for your workload

Can I cancel running tasks?

You can revoke tasks with `app.control.revoke(task_id, terminate=True)` but it's not reliable. Better to design tasks to check for cancellation flags periodically.

Flower dashboard crashes more than my workers do

Flower is useful for development but flaky in production. For production monitoring, use Prometheus/Grafana or whatever monitoring stack you already have.

Currently viewing the AI version

Switch to human version

Celery: AI-Optimized Technical Reference

Critical Context

Problem Severity: Without background tasks, web apps freeze for 30 seconds to 5 minutes during heavy operations (email sending, file processing), causing users to assume the application is broken and generate support tickets.

Breaking Point: Applications become unusable when synchronous operations exceed 3-5 seconds of user wait time.

Production Reality: Celery v5.5.3+ is stable; v4.x had random crashes; v5.4 broke with Python 3.13 async/await changes.

Technical Specifications

Performance Thresholds

Theoretical Maximum: Millions of tasks per minute
Practical Performance: Thousands of tasks per second
Worker Memory Usage: 50-100MB per worker baseline
Memory Growth: Workers consume 100MB → 8GB over 3 days without worker_max_tasks_per_child

Critical Failure Modes

Silent Worker Death: Workers killed by SIGKILL due to memory leaks, appears as "Worker exited prematurely: signal 9"
Task Hanging: Tasks never complete without time limits configured
Multiple Execution: Tasks run multiple times due to connection losses or multiple workers on same queue
Redis Connection Loss: "ConnectionError: Error 111" in Docker networking scenarios
Task Discovery Failure: Import path issues prevent workers from finding task functions

Configuration That Works in Production

Essential Settings

app.conf.update(
    task_serializer='json',           # NEVER use pickle in production
    accept_content=['json'],
    result_serializer='json',
    timezone='UTC',
    enable_utc=True,
    
    # Critical for stability
    worker_max_tasks_per_child=1000,  # Prevents memory leaks
    task_soft_time_limit=600,         # 10 minute warning
    task_time_limit=700,              # 11 minute hard kill
    
    # Performance optimization
    task_routes={
        'tasks.send_email': {'queue': 'fast'},
        'tasks.process_video': {'queue': 'slow'},
    },
)

Broker Selection Decision Matrix

Scenario	Recommendation	Consequence of Wrong Choice
Development	Redis	None - easy setup
Production (job loss acceptable)	Redis	Faster setup, simpler ops
Production (zero job loss)	RabbitMQ	Jobs lost on Redis restart

Resource Requirements

Time Investment

Basic Setup: 1-2 hours (with Redis)
Production Configuration: 4-8 hours (including monitoring)
Advanced Workflows (Canvas): 1-2 days learning curve
Debugging Production Issues: 2-4 hours per incident

Expertise Requirements

Minimum: Basic Python, understanding of async concepts
Production: Docker networking, message queue concepts, monitoring setup
Advanced: RabbitMQ administration, Kubernetes scaling, performance tuning

Infrastructure Costs

Development: Redis container (~10MB RAM)
Production: Redis/RabbitMQ + worker instances (50-100MB per worker)
Monitoring: Flower (development) or Prometheus/Grafana (production)

Critical Warnings

What Documentation Doesn't Tell You

Default Configuration Fails: Out-of-box settings cause memory leaks and worker death
Docker Networking Hell: --link is deprecated; explicit networks required
Flower Unreliability: Monitoring dashboard crashes during outages when needed most
Canvas Complexity: Chord debugging at 3am is nightmare; simple loops often better
Python Version Compatibility: v5.4 breaks with Python 3.13; stick to tested combinations

Production Gotchas

Workers die silently without proper logging configuration
Redis in Docker requires careful network setup or connection errors
Multiple workers + same queue = duplicate task execution
Memory leaks accumulate without worker_max_tasks_per_child
Task arguments get logged and stored in broker (security risk for secrets)

Decision Support Information

When Celery Makes Sense

USE: Complex workflows, Django integration, need for Canvas features, enterprise monitoring requirements
DON'T USE: Simple "send email later" (RQ sufficient), resource-constrained environments, teams without ops experience

Alternatives Comparison

Tool	Setup Difficulty	Stability	Memory Usage	When to Choose
Celery	Hard	Good (when configured)	50-100MB/worker	Complex workflows, Django apps
RQ	Easy	Very Good	20-50MB/worker	Simple jobs, small teams
Dramatiq	Medium	Excellent	30-70MB/worker	Clean architecture priority
Huey	Trivial	Good	10-30MB/worker	Minimal resource usage

Migration Considerations

From synchronous: 1-2 day implementation, significant architecture changes
Between task queues: 2-5 days depending on Canvas usage
Breaking changes: v4→v5 requires code updates; monitor for deprecation warnings

Implementation Patterns

Retry Configuration

@app.task(bind=True, autoretry_for=(ConnectionError, TimeoutError), 
          retry_kwargs={'max_retries': 5, 'countdown': 60})
def api_call_task(self, url):
    # Exponential backoff with jitter for API calls
    pass

Worker Scaling Guidelines

I/O-bound tasks: 2-4x CPU cores
CPU-bound tasks: 1x CPU cores
Mixed workload: Start with CPU count, monitor queue lengths
Auto-scaling: Possible with Kubernetes but requires 4-8 hours setup time

Security Requirements

Never pass secrets in task arguments (they get logged)
Use JSON serialization only (pickle allows code execution)
Enable SSL/TLS for broker connections in production
Message signing required for untrusted task sources

Monitoring and Debugging

Essential Metrics

Queue length per queue type
Worker memory usage over time
Task failure rates and retry counts
Task execution duration by type
Worker availability and restart frequency

Debugging Workflow

Check worker logs for SIGKILL (memory issues)
Verify broker connectivity (Redis/RabbitMQ)
Confirm task discovery (import paths)
Monitor queue lengths (worker capacity)
Check time limits (hanging tasks)

Production Monitoring Stack

Development: Flower web interface (unreliable but functional)
Production: Prometheus + Grafana (stable, integrates with existing monitoring)
Alerting: Monitor worker deaths, queue buildup, task failure rates

Cost-Benefit Analysis

Hidden Costs

Learning Curve: 2-4 weeks for team proficiency
Operational Overhead: Monitoring setup, worker management, broker maintenance
Debugging Time: Complex failures can take hours to diagnose
Infrastructure: Additional Redis/RabbitMQ instances, worker compute resources

Break-Even Point

Celery investment pays off when:

Background tasks take >5 seconds
Task volume >100 per hour
Need for complex workflows exists
Team has ops capacity for maintenance

ROI Indicators

Positive: Reduced user complaints about app freezing, ability to handle larger workloads, better user experience
Negative: Increased operational complexity, additional failure modes, team learning investment

Celery: AI-Optimized Technical Reference

Critical Context

Technical Specifications

Performance Thresholds

Critical Failure Modes

Configuration That Works in Production

Essential Settings

Broker Selection Decision Matrix

Resource Requirements

Time Investment

Expertise Requirements

Infrastructure Costs

Critical Warnings

What Documentation Doesn't Tell You

Production Gotchas

Decision Support Information

When Celery Makes Sense

Alternatives Comparison

Migration Considerations

Implementation Patterns

Retry Configuration

Worker Scaling Guidelines

Security Requirements

Monitoring and Debugging

Essential Metrics

Debugging Workflow

Production Monitoring Stack

Cost-Benefit Analysis

Hidden Costs

Break-Even Point

ROI Indicators

Related Tools & Recommendations

SonarQube Review - Comprehensive Analysis & Real-World Assessment

SonarQube - Find Bugs Before They Bite You

GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus

Redis vs Memcached vs Hazelcast: Production Caching Decision Guide

Redis Alternatives for High-Performance Applications

Redis - In-Memory Data Platform for Real-Time Applications

RabbitMQ - Message Broker That Actually Works

RabbitMQ Production Review - Real-World Performance Analysis

Stop Fighting Your Messaging Architecture - Use All Three

Apache Airflow: Two Years of Production Hell

Apache Airflow - Python Workflow Orchestrator That Doesn't Completely Suck

Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break

Deploy Django with Docker Compose - Complete Production Guide

Stop Waiting 3 Seconds for Your Django Pages to Load

Django - The Web Framework for Perfectionists with Deadlines

Braintree - PayPal's Payment Processing That Doesn't Suck

Claude + LangChain + FastAPI: The Only Stack That Doesn't Suck

FastAPI Production Deployment - What Actually Works

FastAPI Production Deployment Errors - The Debugging Hell Guide

Docker Alternatives That Won't Break Your Budget