Docker networking fuckery - workers can't find Redis

What happens: Workers start up, immediately crash with "Connection refused" or "Name resolution failed". Everything works fine on your laptop.Why it's broken: You used `localhost:6379` in your Django settings because that's what every tutorial shows. Docker containers can't talk to `localhost` - they need service names.Fix: ```python # This breaks in Docker (but works locally) CELERY_BROKER_URL = 'redis://localhost:6379/1' # This actually works CELERY_BROKER_URL = 'redis://redis:6379/1' # 'redis' = service name in docker-compose ``` Time wasted: 4 hours the first time, 30 minutes every time after when I forgot

Tasks run synchronously and defeat the whole fucking point

What happens: You queue a task, it executes immediately in the web process instead of background worker. Defeats the entire purpose of using Celery.Why it's broken: Someone set `CELERY_TASK_ALWAYS_EAGER = True` in Django settings. This makes tasks execute synchronously for "easier debugging" but everyone forgets to turn it off.Fix: ```python # In settings.py - make sure this is False (or remove it entirely) CELERY_TASK_ALWAYS_EAGER = False ``` How to check: Queue a slow task (like `time.sleep(10)`). If your web request blocks for 10 seconds, eager mode is on.Time wasted: 2 hours wondering why performance didn't improve

Workers grow like cancer and get OOM killed

What happens: Workers start at 100MB, grow to 800MB over a few days, then Docker kills them with "Memory cgroup out of memory". Tasks start failing randomly.Why it's broken: Python garbage collection isn't perfect. Workers accumulate memory leaks, especially with image processing, file handling, or database connections that don't close properly.Fixes that work: ```python # Force worker recycling - nuclear option but it works CELERY_WORKER_MAX_TASKS_PER_CHILD = 500 # Kill worker after 500 tasks # Don't hoard tasks in memory CELERY_WORKER_PREFETCH_MULTIPLIER = 1 # Only grab one task at a time # Close database connections properly CELERY_TASK_ACKS_LATE = True # Acknowledge only after completion ``` Docker memory limits (be generous or workers die randomly): ```yaml worker: deploy: resources: limits: memory: 1.5G # Generous limit reservations: memory: 512M ``` Debug memory usage: `docker stats` to watch memory grow, `docker exec worker ps aux` to see individual processes

ImportError hell - tasks can't find your Django models

What happens: Workers start fine, but tasks fail with `ImportError: No module named 'myapp.models'` or `django.core.exceptions.ImproperlyConfigured`.Why it's broken: Worker containers have different `PYTHONPATH` or `DJANGO_SETTINGS_MODULE` than web containers. Python can't find your Django app code.Fix: Make sure both web and worker containers have identical environment: ```python # celery.py - this file needs to be identical in both containers import os from celery import Celery # Same settings module as web container os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'myproject.settings') app = Celery('myproject') app.config_from_object('django.conf:settings', namespace='CELERY') app.autodiscover_tasks() ``` Dockerfile (must be identical for web and worker): ```dockerfile WORKDIR /app COPY . /app/ ENV PYTHONPATH=/app ENV DJANGO_SETTINGS_MODULE=myproject.settings ``` Debug this: `docker exec worker python -c "import myapp.models; print('works')"` - should not fail Nuclear option: If imports still break, add this to your task files: ```python import sys sys.path.append('/app') ```

PostgreSQL "too many connections" nightmare

What happens: App works fine, you add 3 more Celery workers, suddenly PostgreSQL starts rejecting connections with "FATAL: too many connections for role".Why it's broken: PostgreSQL defaults to 100 connections total. Your web app uses 8 workers × 5 connections = 40. Add 6 Celery workers × 5 connections = 30 more. Plus monitoring, migrations, admin users. Hits limit fast.Fixes: 1. **Increase PostgreSQL connection limit** (if you control the DB): ```sql ALTER SYSTEM SET max_connections = 200; SELECT pg_reload_conf(); ``` 2. **Limit connections per Django process**: ```python DATABASES = { 'default': { 'ENGINE': 'django.db.backends.postgresql', 'CONN_MAX_AGE': 300, # Reuse connections for 5 minutes 'OPTIONS': { 'MAX_CONNS': 4, # Max 4 connections per worker } } } ``` 3. **Close database connections in tasks** (nuclear option): ```python from django.db import connections @shared_task def some_task(): # Do work result = process_data() # Force close all connections connections.close_all() return result ```

Memory death spiral - Redis grows until server dies

What happens: Redis memory usage grows from 100MB to 2GB+ over weeks, eventually triggers OOM killer, everything stops working.Why it's broken: Celery result storage never expires, even with TTL settings. Results accumulate forever.Fixes that work: ```python # Don't store results if you don't need them CELERY_TASK_IGNORE_RESULT = True # Or expire results aggressively CELERY_RESULT_EXPIRES = 300 # 5 minutes CELERY_TASK_RESULT_EXPIRES = 300 # Use separate Redis DB for results (can flush without losing queues) CELERY_BROKER_URL = 'redis://redis:6379/0' # Queue storage CELERY_RESULT_BACKEND = 'redis://redis:6379/1' # Result storage ``` Monitor Redis memory: `redis-cli info memory` shows used memory, keys count

Currently viewing the AI version

Switch to human version

Django Celery Redis Docker: Production-Ready Background Task Processing

Critical Architecture Overview

Stack Components: Django web servers + Celery workers + Redis message broker + PostgreSQL database + Docker containers

Performance Impact: Response time improvement from 850ms to 45ms for report generation, peak CPU usage reduced from 85% to 35%, eliminated user timeout errors (from ~50/day to 0)

Scaling Capacity: Handles 20k daily active users, supports concurrent user capacity roughly doubled through background task processing

Configuration Requirements

Redis Production Settings

redis:
  command: redis-server --appendonly yes --save 60 1000 --maxmemory 512mb --maxmemory-policy allkeys-lru
  volumes:
    - redis_data:/data
  restart: unless-stopped

Critical Parameters:

--appendonly yes: Prevents message loss during crashes (learned after losing 2000 queued tasks)
--maxmemory 512mb: Prevents Redis consuming all server memory (crashed production twice)
--maxmemory-policy allkeys-lru: Evicts oldest data when memory full

Celery Worker Configuration

# Production settings that prevent memory leaks
CELERY_WORKER_MAX_TASKS_PER_CHILD = 1000  # Workers grow to 800MB+ without this
CELERY_WORKER_PREFETCH_MULTIPLIER = 1     # Prevents worker task hoarding
CELERY_TASK_ACKS_LATE = True              # Acknowledge after completion
CELERY_RESULT_EXPIRES = 3600              # Results expire after 1 hour
CELERY_TASK_IGNORE_RESULT = True          # Don't store results if not needed

Resource Limits (prevents OOM kills):

worker:
  deploy:
    resources:
      limits:
        memory: 1.5G  # Generous limit prevents random deaths
      reservations:
        memory: 512M

Database Connection Management

DATABASES = {
    'default': {
        'CONN_MAX_AGE': 300,  # Connection pooling
        'OPTIONS': {
            'MAX_CONNS': 4,   # Max 4 connections per worker
        }
    }
}

Connection Limits: PostgreSQL defaults to 100 connections. Formula: 8 web workers × 5 connections + 6 Celery workers × 5 connections + monitoring = limit exceeded. Must increase to 200+ connections.

Critical Failure Modes and Solutions

Docker Networking Failures

Problem: Workers crash with "Connection refused" - everything works locally but fails in containers
Root Cause: Using localhost:6379 instead of Docker service names
Fix: Use service names: redis://redis:6379/1 not redis://localhost:6379/1
Time Cost: 4 hours first occurrence, 30 minutes every subsequent time

Synchronous Task Execution (Silent Failure)

Problem: Tasks execute in web process instead of background workers
Root Cause: CELERY_TASK_ALWAYS_EAGER = True in settings
Detection: Queue slow task (time.sleep(10)), if web request blocks for 10 seconds, eager mode is active
Fix: Set CELERY_TASK_ALWAYS_EAGER = False or remove setting entirely

Redis Message Loss

Problem: Redis crashes eat all queued tasks, workers lose connection
Root Cause: Default Redis config doesn't persist to disk
Consequence: All background work disappears during restarts/crashes
Solution: Enable AOF persistence and volume mounting as shown in configuration

Worker Memory Growth (OOM Kills)

Problem: Workers start at 100MB, grow to 800MB+ over days, get killed by Docker
Root Cause: Python garbage collection imperfect, memory leaks accumulate
Detection: docker stats to monitor growth, docker exec worker ps aux for processes
Solutions:

Force worker recycling: CELERY_WORKER_MAX_TASKS_PER_CHILD = 500
Prevent task hoarding: CELERY_WORKER_PREFETCH_MULTIPLIER = 1
Set generous Docker memory limits: 1.5G minimum

PostgreSQL Connection Exhaustion

Problem: "FATAL: too many connections for role" errors
Calculation: 8 web workers × 5 connections + 6 workers × 5 connections + monitoring = 70+ connections (PostgreSQL default: 100)
Solutions:

Increase PostgreSQL limit: ALTER SYSTEM SET max_connections = 200;
Limit per-process connections: MAX_CONNS: 4 in Django settings
Nuclear option: connections.close_all() in tasks

Task Import Failures

Problem: Tasks fail with ImportError: No module named 'myapp.models'
Root Cause: Different PYTHONPATH or DJANGO_SETTINGS_MODULE between containers
Solution: Ensure identical environment variables in web and worker containers
Debug: docker exec worker python -c "import myapp.models; print('works')"

Infinite Task Pending State

Problem: Tasks show as queued but never execute, no error messages
Cause: Task routing doesn't match worker queue configuration
Debug Commands:

docker exec worker celery -A myproject inspect ping
docker exec worker celery -A myproject inspect active_queues
docker exec worker celery -A myproject inspect active

Production Scaling Patterns

Queue-Based Auto Scaling

Trigger Points:

Queue length > 500: Scale to 10 workers (heavy load)
Queue length > 100: Scale to 5 workers (medium load)
Queue length < 10: Scale to 2 workers (light load)

Specialized Worker Pools

# Fast tasks (email, notifications) - high concurrency
worker-fast:
  command: celery -A core worker -Q fast --concurrency=8

# CPU intensive (image processing) - low concurrency, high CPU
worker-cpu:
  command: celery -A core worker -Q cpu_intensive --concurrency=2
  resources:
    limits:
      cpus: '4.0'

# IO intensive (downloads, API calls) - very high concurrency
worker-io:
  command: celery -A core worker -Q io_intensive --concurrency=20

Monitoring Requirements

Critical Metrics:

Queue depth: Scale workers when > 100 tasks for > 5 minutes
Worker memory usage: Alert when > 1GB per worker
Task failure rate: Alert when > 5% failure rate
Redis memory usage: Alert when > 80% of limit
Database connections: Alert when > 80% of max_connections

Health Check Commands:

# Worker health
celery -A core inspect ping

# Active tasks
celery -A core inspect active

# Redis memory
redis-cli info memory

# Database connections
SELECT count(*) FROM pg_stat_activity;

Resource Investment Reality

Time Costs:

Initial setup: 3 months to get production-stable configuration
Docker networking issues: 4 hours first time, 30 minutes recurring
Memory leak debugging: 2-3 hours per incident
Import/path issues: 1-2 hours typical resolution

Infrastructure Costs (monthly):

Redis instance: ~$15 (t3.small)
Additional monitoring: ~$10-25
Increased database capacity: ~$12+ (connection limits)

Expertise Requirements:

Docker networking knowledge (critical)
Redis persistence and memory management
PostgreSQL connection pooling
Python memory profiling for leak detection

When NOT to Use This Stack

Skip for:

Basic CRUD applications with < 1000 daily users
Operations completing in < 1 second
Prototypes and MVPs (add complexity later)
Simple blogs, portfolios, basic CMSes

Complexity Threshold: Only worth it when hitting actual performance walls, not theoretical scaling concerns.

Docker Compose Production Template

version: '3.8'
services:
  db:
    image: postgres:16-alpine
    environment:
      POSTGRES_DB: ${DB_NAME:-djangodb}
      POSTGRES_USER: ${DB_USER:-postgres}
      POSTGRES_PASSWORD: ${DB_PASSWORD:-postgres}
    volumes:
      - postgres_data:/var/lib/postgresql/data/
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U ${DB_USER:-postgres}"]
      interval: 10s
      timeout: 5s
      retries: 5
    restart: unless-stopped

  redis:
    image: redis:7.4-alpine
    command: redis-server --appendonly yes --maxmemory 512mb --maxmemory-policy allkeys-lru
    volumes:
      - redis_data:/data
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 3s
      retries: 5
    restart: unless-stopped

  web:
    build: .
    command: gunicorn core.wsgi:application --bind 0.0.0.0:8000 --workers 3
    environment:
      - DATABASE_URL=postgres://${DB_USER:-postgres}:${DB_PASSWORD:-postgres}@db:5432/${DB_NAME:-djangodb}
      - CELERY_BROKER_URL=redis://redis:6379/1
      - DEBUG=False
    depends_on:
      db:
        condition: service_healthy
      redis:
        condition: service_healthy
    restart: unless-stopped

  worker:
    build: .
    command: celery -A core worker --loglevel=info --concurrency=4 --max-tasks-per-child=1000
    environment:
      - DATABASE_URL=postgres://${DB_USER:-postgres}:${DB_PASSWORD:-postgres}@db:5432/${DB_NAME:-djangodb}
      - CELERY_BROKER_URL=redis://redis:6379/1
    depends_on:
      - db
      - redis
    restart: unless-stopped
    deploy:
      replicas: 2

volumes:
  postgres_data:
  redis_data:

Troubleshooting Decision Tree

Workers won't start: Check Docker service names in CELERY_BROKER_URL
Tasks execute synchronously: Verify CELERY_TASK_ALWAYS_EAGER = False
Tasks disappear: Enable Redis persistence (--appendonly yes)
Workers crash randomly: Set memory limits and max-tasks-per-child
Database connection errors: Increase max_connections, add connection pooling
Import errors: Ensure identical PYTHONPATH in all containers
Tasks stuck pending: Check queue routing vs worker queue configuration

Operational Intelligence Summary

This stack requires significant operational overhead but provides substantial performance benefits for applications processing heavy background work. The configuration complexity is front-loaded - once properly configured, it scales reliably. Most failures stem from Docker networking, memory management, and database connection limits rather than the core technologies. Investment worthwhile for applications with genuine background processing needs exceeding simple web request patterns.

Useful Links for Further Investigation

Essential Resources for Redis + Django + Celery + Docker Integration

Link	Description
Celery Documentation	The authoritative guide for Celery configuration, task patterns, and best practices. Includes comprehensive examples for Django integration and production deployment.
Redis Documentation	Complete Redis reference including persistence, clustering, and performance tuning. Essential for understanding Redis as both cache and message broker.
Django Cache Framework	Official Django caching documentation with Redis integration examples and configuration patterns.
Docker Compose Documentation	Official Docker Compose reference for multi-container application orchestration and service dependencies.
Django Cookiecutter Template	Battle-tested Django project template with Celery, Redis, and Docker Compose configurations for production deployment.
Real Python: Asynchronous Tasks with Django and Celery	Comprehensive tutorial covering Django-Celery integration with Redis broker, including error handling and monitoring.
TestDriven.io: Django + Celery + Redis + Docker	Production-focused guide covering periodic tasks, monitoring, and scaling patterns with working code examples.
Django Deployment Best Practices	Enterprise patterns for Django-Celery deployment including security, monitoring, and maintenance strategies.
Awesome Docker Compose	Collection of Docker Compose examples for Django applications with various service combinations and deployment patterns.
Django Docker Best Practices	Comprehensive guide for Docker optimization, security hardening, and production deployment of Django applications.
Kubernetes Django Deployment	Official Kubernetes documentation for deploying stateful Django applications with persistent volumes and service discovery.
Celery Flower	Web-based monitoring tool for Celery clusters providing real-time worker metrics, task statistics, and cluster management.
Redis Insight	Free Redis desktop GUI for development and production monitoring, including memory analysis and performance profiling.
Prometheus Celery Exporter	Prometheus metrics exporter for Celery providing detailed worker and task monitoring for production environments.
Grafana Redis Dashboard	Pre-built Grafana dashboard for Redis monitoring with alerts and performance visualizations.
Sentry Django Integration	Error tracking and performance monitoring specifically configured for Django-Celery applications with distributed tracing.
Redis Performance Tuning Guide	Official Redis optimization guide covering memory management, persistence configuration, and scaling strategies.
Celery Performance Best Practices	Comprehensive performance optimization guide including worker tuning, serialization choices, and resource management.
Django Database Optimization	Official Django guide for database performance including connection pooling and query optimization for background tasks.
AWS ElastiCache for Redis	Managed Redis service documentation including Multi-AZ deployment, backup strategies, and integration patterns.
django-extensions	Django utility extensions including management commands, debugging tools, and development server enhancements for Celery development.
django-debug-toolbar	Development toolbar with Celery panel for monitoring task execution, cache hits, and performance profiling.
redis-cli Advanced Usage	Complete redis-cli reference for debugging production issues, monitoring commands, and performance analysis.
Docker Development Workflow	Best practices for Docker-based development including volume mounting, environment management, and debugging techniques.
Redis Security Checklist	Comprehensive security hardening guide for Redis in production including authentication, encryption, and network security.
Django Security Checklist	Official Django security guide covering cache security, session management, and production deployment hardening.
Docker Security Best Practices	Container security guidelines including image scanning, runtime security, and production deployment patterns.
OWASP Top 10 for Django	Security vulnerability reference with Django-specific mitigation strategies and best practices.
Django Forum - Background Tasks	Official Django community forum with active discussions about Celery integration, troubleshooting, and best practices.
Celery Users Google Group	Active community for Celery-specific questions, deployment issues, and feature discussions.
Django Discord Community	Active Reddit community with regular discussions about Django-Celery patterns, production experiences, and troubleshooting.
Stack Overflow: Django + Celery	Comprehensive Q&A archive for Django-Celery integration issues with working solutions and expert answers.
Instagram Engineering: Django at Scale	Real-world case study of Django and Celery deployment patterns at massive scale with lessons learned.
Mozilla Developer Network	Web development best practices and performance optimization techniques applicable to Django applications.
Disqus: Scaling Django with Celery	Technical deep-dive into scaling Django and Celery for high-traffic applications with performance metrics.
Django-Q2	Django-native task queue alternative to Celery with simpler configuration and built-in monitoring dashboard.
Huey Task Queue	Lightweight Python task queue with Redis backend, simpler than Celery but with fewer features.
Django-RQ	Django integration for RQ (Redis Queue) offering simpler configuration than Celery for basic use cases.
Dramatiq	Fast and reliable Python task processing library with Redis backend and excellent error handling.
Railway Django Deployment	One-click Django deployment with Redis and PostgreSQL, including environment management and scaling options.
DigitalOcean Django Guide	Complete production deployment guide for Django applications with PostgreSQL, Nginx, and Gunicorn.
Heroku Django Deployment	Platform-as-a-service deployment with Redis add-ons, though more expensive than container-based alternatives.
AWS Elastic Beanstalk Django	Managed AWS deployment option with ElastiCache Redis integration and auto-scaling capabilities.

Related Tools & Recommendations

integration

Similar content

Stop Waiting 3 Seconds for Your Django Pages to Load

Learn how to integrate Redis caching with Django to drastically improve app performance. This guide covers installation, common pitfalls, and troubleshooting me

Django Celery Redis Docker: Production-Ready Background Task Processing

Critical Architecture Overview

Configuration Requirements

Redis Production Settings

Celery Worker Configuration

Database Connection Management

Critical Failure Modes and Solutions

Docker Networking Failures

Synchronous Task Execution (Silent Failure)

Redis Message Loss

Worker Memory Growth (OOM Kills)

PostgreSQL Connection Exhaustion

Task Import Failures

Infinite Task Pending State

Production Scaling Patterns

Queue-Based Auto Scaling

Specialized Worker Pools

Monitoring Requirements

Resource Investment Reality

When NOT to Use This Stack

Docker Compose Production Template

Troubleshooting Decision Tree

Operational Intelligence Summary

Useful Links for Further Investigation

Essential Resources for Redis + Django + Celery + Docker Integration

Related Tools & Recommendations

Stop Waiting 3 Seconds for Your Django Pages to Load

GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus

Redis vs Memcached vs Hazelcast: Production Caching Decision Guide

Django Troubleshooting Guide - Fixing Production Disasters at 3 AM

Fix Kubernetes ImagePullBackOff Error - The Complete Battle-Tested Guide

Fix Kubernetes OOMKilled Pods - Production Memory Crisis Management

GitHub Actions + Jenkins Security Integration

Celery - Python Task Queue That Actually Works

PostgreSQL vs MySQL vs MongoDB vs Cassandra vs DynamoDB - Database Reality Check

Redis Acquires Decodable to Power AI Agent Memory and Real-Time Data Processing

Deploy Django with Docker Compose - Complete Production Guide

Podman Desktop - Free Docker Desktop Alternative

Podman Desktop Alternatives That Don't Suck

SonarQube Review - Comprehensive Analysis & Real-World Assessment

Stop Deploying Vulnerable Code - GitHub Actions, SonarQube, and Snyk Integration

SonarQube - Find Bugs Before They Bite You

Docker Desktop vs Podman Desktop vs Rancher Desktop vs OrbStack: What Actually Happens

FastAPI Production Deployment Errors - The Debugging Hell Guide

FastAPI Production Deployment - What Actually Works

Claude + LangChain + FastAPI: The Only Stack That Doesn't Suck