Why did my DAG randomly stop running?

Usually the scheduler crashed silently. Check `docker logs airflow-scheduler` or the scheduler logs. Common causes: ran out of memory (scheduler OOMs around 300+ DAGs), database connection died, or someone deployed broken DAG code that crashed the parser. Nuclear option: restart everything.

Why is Airflow so fucking slow?

Because you have 500 DAGs and one scheduler trying to parse them all every 30 seconds. Increase `dag_dir_list_interval` to 300 seconds, set `dag_file_processor_timeout` higher, or add more memory. Or just accept that Airflow isn't fast and plan accordingly.

My tasks are stuck in "queued" state forever

Your executor is dead. If using CeleryExecutor, Redis probably crashed. If using KubernetesExecutor, check your cluster has resources. For LocalExecutor, the worker processes died. Solution: `docker system prune -a && docker-compose up`

What executor should I actually use?

LocalExecutor if you're on one machine and it's not going to prod. CeleryExecutor if you want distributed workers and enjoy debugging Redis connection issues at 2am. KubernetesExecutor if you hate your infrastructure team and want them to hate you back.

How do I handle secrets without hardcoding passwords in my DAGs?

Use [Airflow Connections](https://airflow.apache.org/docs/apache-airflow/stable/howto/connection.html) through the UI (Admin > Connections). For real production setup, integrate with [AWS Secrets Manager](https://airflow.apache.org/docs/apache-airflow-providers-amazon/stable/secrets-backends/aws-secrets-manager.html) or [HashiCorp Vault](https://airflow.apache.org/docs/apache-airflow-providers-hashicorp/stable/secrets-backends/hashiCorp-vault.html). Don't put passwords in DAG files - they're visible to everyone.

Why aren't my DAGs showing up in the web UI?

Check the scheduler logs first. Common causes: syntax error in your Python code, file isn't in the DAGs folder (`~/airflow/dags` by default), or import error from missing dependencies. Error message: `AIRFLOW__CORE__DAGS_FOLDER not accessible to DagFileProcessor`. Fix permissions or move the file.

How do I test DAGs without breaking production?

Use [pytest](https://airflow.apache.org/docs/apache-airflow/stable/best-practices.html#unit-tests) for unit tests. Test structure with `dag.test()`. For integration tests, spin up a separate Airflow instance with test data. Don't test in production - that's how you take down data pipelines at 3am.

Should I use dynamic DAGs?

Only if you enjoy pain. Dynamic DAGs work but make debugging harder and slow down the scheduler. If you have 50+ similar DAGs, consider using [DAG factories](https://airflow.apache.org/docs/apache-airflow/stable/howto/dynamic-dag-generation.html) but be careful not to recreate DAGs on every scheduler parse.

What are the biggest Airflow anti-patterns?

- Using XComs for large data (limit: 48KB in SQLite, 1MB in PostgreSQL) - SubDAGs (they're [deprecated](https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/subdags.html), use TaskGroups) - Heavy computation in DAG file top-level code (slows scheduler parsing) - Running data processing in Airflow instead of external systems - Not using `catchup=False` (unless you want to backfill everything)

Can I use Airflow for real-time streaming?

No. Airflow is for batch workflows, not streaming. Minimum scheduling interval is 1 minute, but realistically you're looking at 5+ minute intervals in production. Use [Kafka](https://kafka.apache.org/), [Pulsar](https://pulsar.apache.org/), or [Storm](https://storm.apache.org/) for streaming, then trigger Airflow DAGs when batches are ready.

How do I monitor this thing in production?

- [StatsD metrics](https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/logging-monitoring/metrics.html) to Datadog/New Relic - [Prometheus exporter](https://github.com/epoch8/airflow-exporter) for custom dashboards - Email alerts on DAG failures (set up in DAG `default_args`) - External health checks on the web UI endpoint - Monitor scheduler lag - if it's over 30 seconds, you have problems

What hardware do I actually need?

Start with: 2 CPU cores, 8GB RAM for the scheduler. Add workers as needed (2 CPU, 4GB each). Database needs fast storage (SSD) and decent IOPS. Scale up when the scheduler starts falling behind - you'll know because DAGs will be late.

Airflow vs other tools - what should I choose?

Use Airflow if you: need complex dependencies, want Python-based workflows, need detailed monitoring, or are already in the Python ecosystem. Don't use it if you: need sub-minute scheduling, want simple cron replacement, or are primarily Java-based ([use Luigi](https://github.com/spotify/luigi) or stick with Jenkins).

Currently viewing the AI version

Switch to human version

Apache Airflow: AI-Optimized Technical Reference

Core Technology Overview

Apache Airflow is a Python-based workflow orchestrator for defining data pipelines as code. Originally built by Airbnb in 2015 to replace hundreds of brittle cron jobs that failed silently.

Production Scale Reality

Netflix: 100k+ workflows daily (requires army of engineers)
Adobe: Thousands of pipelines in production
Industry standard despite operational complexity

Architecture Components

Scheduler

Function: Core component that executes workflows
Critical Breaking Point: Fails silently around 300 DAGs without proper tuning
Memory Requirements: 2-4GB minimum
Failure Scenario: Random stops during high load, requires manual restart
Operational Impact: Complete pipeline failure with no alerts

Web Server

Function: UI for monitoring and management
Resource Usage: Standard web server requirements
Common Failure: 500 errors when database connection breaks
Debugging Value: Essential for task log access and manual interventions

Executor Types

Executor	Use Case	Breaking Points	Operational Overhead
LocalExecutor	Single machine	Memory limits, no distribution	Minimal
CeleryExecutor	Distributed workers	Redis crashes at 2am	High (Redis dependency)
KubernetesExecutor	Cloud-native	Cluster resource limits	Very High (K8s complexity)

Database Requirements

Production Standard: PostgreSQL 12+
Critical Warning: MySQL has encoding issues with task metadata
Failure Mode: SQLite fails with multiple users
MariaDB: Not supported - will break unpredictably

Version Management

Current Version (September 2025)

Stable: Apache Airflow 3.0.6
Minimum Supported: 2.7+ (earlier versions have security issues)
Breaking Changes: Major changes in 3.0 release

Version Migration Impact

Airflow 3.0: Breaking changes require migration effort
Python Support: 3.9-3.13 (3.8 dropped in 2.7.0)
Upgrade Pain: Major version upgrades require significant testing

System Requirements

Minimum Production Requirements

Memory: 8GB minimum (scheduler: 2-4GB, workers: 1-2GB each)
CPU: 2 cores minimum for scheduler
Storage: SSD required for database, fast IOPS critical
OS: POSIX systems only (Windows via WSL2 with networking issues)

Scaling Thresholds

300 DAGs: Scheduler performance degrades
1000+ DAGs: Scheduler becomes bottleneck, requires tuning or splitting
Memory OOM: Occurs on 4GB instances with 200+ DAG files

Installation Methods

Difficulty Assessment

Method	Setup Time	Failure Probability	Production Readiness
pip	30 minutes	High (dependency hell)	Requires expertise
Docker	15 minutes-2 hours	Medium	Good for local dev
Kubernetes	3+ days	Very High	Enterprise ready

Docker Installation

Command: curl -LfO 'https://airflow.apache.org/docs/apache-airflow/stable/docker-compose.yaml' && docker-compose up
Size: 2GB download, 7 containers
Common Failure: Port 8080 conflicts (always check first)

Configuration Gotchas

Critical Settings That Will Break

DAGs Folder: Must be ~/airflow/dags or explicitly configured
Scheduler Permissions: Silent failure without read permissions
Log Storage: Defaults to ~/airflow/logs, requires disk space monitoring
Catchup Setting: catchup=False prevents backfill hell

Default Settings That Fail in Production

DAG Parsing Interval: 30 seconds (too frequent for large deployments)
Task Timeout: No default (tasks can hang forever)
Retry Logic: Manual configuration required
Connection Management: Secrets in UI visible to all users

Common Production Failures

Scheduler Issues

Symptom: DAGs stop running randomly
Root Cause: Memory exhaustion, database disconnection, or parsing errors
Recovery Time: 15 minutes to 2 hours depending on diagnosis
Prevention: External monitoring required (built-in alerts insufficient)

Task Queue Problems

Symptom: Tasks stuck in "queued" state forever
Root Cause: Executor death (Redis crash for Celery, K8s resource limits)
Recovery: Full system restart usually required
Impact: Complete pipeline stoppage

Database Connection Failures

Symptom: Web UI 500 errors, scheduler stops
Root Cause: Database overload, connection limits, network issues
Recovery Time: Minutes to hours depending on root cause
Critical: Single point of failure for entire system

Performance Characteristics

Scheduling Limitations

Minimum Interval: 1 minute theoretical, 5+ minutes practical
Not Suitable For: Real-time streaming, sub-minute workflows
Batch Processing: Designed for ETL, not streaming
Latency: High compared to purpose-built schedulers

Resource Consumption Scaling

Linear Growth: Memory usage scales with DAG count
Parse Overhead: Every DAG file parsed every 30 seconds by default
Database Load: Increases exponentially with task volume
Network I/O: High with distributed executors

Technology Comparison Matrix

Tool	Setup Difficulty	Failure Complexity	Learning Curve	Production Overhead
Airflow	Very High	High (silent failures)	2-3 weeks	High (DevOps team required)
Prefect	Low	Low (fails fast)	3-5 days	Low (managed options)
Dagster	Medium	Medium	1-2 weeks	Medium
Luigi	Very Low	Low (obvious breaks)	2 days	Minimal

Security Implementation

Enterprise Features

RBAC: Available but requires configuration
OAuth Integration: Supported
SSL/Encryption: Manual setup required
Audit Logging: Built-in but needs external storage

Security Gotchas

Secret Storage: DAG files visible to all users
Connection Strings: Stored in database, visible in UI
Production Integration: Requires external secret management (AWS Secrets Manager, Vault)

Monitoring Requirements

Critical Metrics to Track

Scheduler Lag: >30 seconds indicates problems
DAG Parse Time: Increases with DAG count
Task Duration: Detect hanging tasks
Database Connection Pool: Monitor for exhaustion
Memory Usage: Track for OOM prevention

External Monitoring Necessity

Built-in Alerts: Insufficient for production
Required Tools: Prometheus, Datadog, or similar
Health Check: External monitoring of web UI endpoint
Database Monitoring: Essential for preventing failures

Anti-Patterns and Common Mistakes

Data Processing Anti-Patterns

XCom Size Limits: 48KB SQLite, 1MB PostgreSQL
Heavy Computation: Should be in external systems, not Airflow
Large Data Transfer: Use external storage, not task communication

Architecture Anti-Patterns

SubDAGs: Deprecated, use TaskGroups
Dynamic DAGs: Slows scheduler, complicates debugging
Top-level Code: Heavy computation in DAG files kills performance

Operational Anti-Patterns

Production Testing: Never test in production environment
Manual Secret Management: Use proper secret backends
Single Instance: No high availability without multiple schedulers

Decision Criteria

Use Airflow When

Complex workflow dependencies required
Python ecosystem integration needed
Detailed monitoring and logging required
Team has DevOps expertise available
Scale justifies operational overhead

Avoid Airflow When

Simple cron job replacement needed
Sub-minute scheduling required
Team lacks Python/DevOps expertise
Operational simplicity is priority
Real-time processing required

Resource Requirements for Success

Team Expertise Required

Python Development: Intermediate to advanced
DevOps/Infrastructure: Required for production
Database Administration: PostgreSQL expertise needed
Monitoring/Observability: Essential for operations

Time Investment

Initial Setup: 1-2 weeks for basic installation
Production Readiness: 1-2 months including monitoring
Team Training: 2-3 weeks per developer
Ongoing Maintenance: 20-30% of one engineer's time

Infrastructure Costs

Compute: Higher than simple schedulers due to architecture
Storage: Database and log storage requirements
Monitoring: Additional tooling required
Managed Services: AWS MWAA, Google Composer, Astronomer (expensive but functional)

Troubleshooting Decision Tree

DAGs Not Appearing

Check scheduler logs for syntax/import errors
Verify file location in DAGs folder
Check file permissions
Validate Python syntax

Tasks Stuck in Queue

Verify executor process running
Check Redis/database connectivity
Validate worker resource availability
Restart executor as last resort

Scheduler Performance Issues

Monitor DAG count (300+ threshold)
Increase parsing intervals
Add memory to scheduler
Consider DAG splitting

Database Connection Failures

Check connection limits
Monitor database resource usage
Validate network connectivity
Review connection pool settings

Production Deployment Patterns

High Availability Setup

Multiple Schedulers: Supported but first does most work
Database: Still single point of failure
Load Balancing: Web server only
Backup Strategy: Database backups critical

Scaling Strategies

Vertical: Add memory/CPU to scheduler
Horizontal: Multiple Airflow instances for large deployments
Resource Isolation: Separate environments for different teams
Performance Tuning: Scheduler configuration optimization

This reference provides the technical foundation and operational reality needed to make informed decisions about Apache Airflow implementation and management in production environments.

Useful Links for Further Investigation

Resources That Don't Suck

Link	Description
Apache Airflow Official Docs	Better than most open source docs, actually. Start with the [Quick Start](https://airflow.apache.org/docs/apache-airflow/stable/start.html) if you want to get up and running without reading a novel.
Apache Airflow GitHub	Where to file bug reports when things inevitably break. Also good for seeing what's coming in future releases.
Provider Packages	2000+ operators for every system you can think of. If there's no provider for your system, you'll have to write custom operators (good luck).
Airflow YouTube Channel	Monthly town halls where maintainers explain why your feature request won't be implemented for another 2 years.
Airflow Slack	15,000+ people who've been through the same pain. Post your error and someone will tell you to check the logs (they're right).
Astronomer Academy	Free courses that are actually decent. Better than reading the docs for the 20th time.
Stack Overflow #airflow	Where you'll find the answer to your exact problem, posted 3 years ago by someone who never came back to mark it solved.
AWS MWAA	Amazon's managed service. Expensive but works. Good if you're already deep in AWS and have money to burn.
Google Cloud Composer	Google's take on managed Airflow. Solid integration with GCP data services, but can be slow to upgrade to new Airflow versions.
Astronomer	The gold standard for managed Airflow. Actually know what they're doing since they employ several core maintainers. Worth the money if you're serious.
Official Docker Images	The easy way to run Airflow locally. Download 2GB of images and pray your ports aren't already taken.
Helm Chart	Official Kubernetes deployment. Features 47 configuration options and will definitely break on your first try.
Airflow Tutorials	Start here if you're new. The examples actually work (unlike most tutorials on Medium).
Built-in Prometheus Metrics	Actually useful metrics. Monitor scheduler lag - if it's over 30 seconds, you have problems.
Datadog Integration	For when you want pretty dashboards showing exactly how broken your pipelines are.
Great Expectations	Data quality checks that will tell you your data is garbage (which you probably already knew).

Related Tools & Recommendations

integration

Similar content

dbt + Snowflake + Apache Airflow: Production Orchestration That Actually Works

How to stop burning money on failed pipelines and actually get your data stack working together

dbt (Data Build Tool)

/integration/dbt-snowflake-airflow/production-orchestration

100%

integration

Recommended

GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus

How to Wire Together the Modern DevOps Stack Without Losing Your Sanity

kubernetes

/integration/docker-kubernetes-argocd-prometheus/gitops-workflow-integration

77%

howto

Recommended

MySQL to PostgreSQL Production Migration: Complete Step-by-Step Guide

Migrate MySQL to PostgreSQL without destroying your career (probably)

MySQL

/howto/migrate-mysql-to-postgresql-production/mysql-to-postgresql-production-migration

71%

compare

Recommended

PostgreSQL vs MySQL vs MongoDB vs Cassandra vs DynamoDB - Database Reality Check

Most database comparisons are written by people who've never deployed shit in production at 3am

PostgreSQL

/compare/postgresql/mysql/mongodb/cassandra/dynamodb/serverless-cloud-native-comparison

71%

tool

Similar content

dbt - Actually Decent SQL Pipeline Tool

dbt compiles your SQL into maintainable data pipelines. Works great for SQL transformations, nightmare fuel when dependencies break.

Apache Spark Troubleshooting - Debug Production Failures Fast

When your Spark job dies at 3 AM and you need answers, not philosophy

Apache Spark

/tool/apache-spark/troubleshooting-guide

62%

review

Similar content

Apache Airflow: Two Years of Production Hell

I've Been Fighting This Thing Since 2023 - Here's What Actually Happens

Apache Airflow

/review/apache-airflow/production-operations-review

62%

tool

Similar content

Apache NiFi: Drag-and-drop data plumbing that actually works (most of the time)

Visual data flow tool that lets you move data between systems without writing code. Great for ETL work, API integrations, and those "just move this data from A

Apache NiFi

/tool/apache-nifi/overview

59%

review

Similar content