What Airflow Actually Is (And Why Your Data Team Needs It)

Airflow is a Python-based workflow orchestrator that lets you define data pipelines as code. It was born out of necessity at Airbnb in 2015 when their data team got tired of managing hundreds of brittle cron jobs that would fail silently and leave everyone wondering why reports were broken.

Airflow Web Interface

Airflow Web Interface: The UI shows your DAGs as interactive graphs where you can see which tasks are running, failed, or completed. It's actually pretty useful once you figure out where everything is.

How This Thing Actually Works

Airflow Architecture Diagram

Here's what you're dealing with when you deploy Airflow:

Scheduler: The thing that actually runs your workflows. It's supposed to be reliable, but will randomly stop working when you hit around 300 DAGs unless you tune it properly. We learned this during a Black Friday deployment. Scale early.

Web Server: The UI where you'll spend most of your time trying to figure out why tasks failed. It's actually pretty decent - you can see logs, trigger reruns, and pretend you understand why the scheduler decided to skip your task.

Executor: Defines where tasks run. LocalExecutor works for single machines, CeleryExecutor works great until Redis crashes at 2am, and KubernetesExecutor is what you use when you want to make your infrastructure team hate you.

Metadata Database: PostgreSQL is your best bet. MySQL works but has weird edge cases. SQLite is fine for testing until you have more than one user.

Workers: The things that actually run your code. They'll work fine until you run out of memory, then they'll die silently and your tasks will hang forever.

DAGs (Directed Acyclic Graphs)

Workflow Structure: DAGs represent your data pipeline as a graph of connected tasks, each with dependencies and retry logic.

This is Airflow's fancy name for "workflow." You write Python code that defines:

  • What tasks to run (PythonOperator, BashOperator, etc.)
  • When to run them (scheduling and dependencies)
  • What to do when they fail (retry 3 times then wake up the on-call engineer)
  • How long to wait before giving up (task timeouts)

Dependencies look like this: task_a >> task_b >> task_c. Simple enough that even your product manager can understand it.

DAG Workflow Example

DAG Visualization: Tasks show up as colored boxes (green = success, red = failed, yellow = running). You can click on any task to see logs, retry failed tasks, or mark them as successful if you're feeling dangerous.

Production Reality Check

High Availability: You can run multiple schedulers, but the first one will do most of the work anyway. The metadata database is still a single point of failure.

Security: Has RBAC, OAuth, SSL, and encrypted connections. Basically enterprise-ready if you spend the time configuring it properly.

Monitoring: Built-in Prometheus metrics and email alerts. You'll still need external monitoring because the scheduler can fail without any alerts.

Scaling: Works until it doesn't. The scheduler becomes a bottleneck around 1000+ DAGs. You'll need to tune scheduler performance or split into multiple Airflow instances.

Version Reality (September 2025)

Apache Airflow 3.0.6 is the current version as of September 2025. Airflow 3.0 was a major release in early 2025 with breaking changes. Don't use anything before 2.7 - too many security issues.

Major 3.0 changes include:

Netflix runs 100k+ workflows daily on Airflow, but they also have an army of engineers keeping it running. Adobe manages thousands of pipelines, and both companies contribute heavily to making it less terrible for the rest of us.

Now that you know what you're getting into, let's see how Airflow stacks up against the competition - because there are definitely easier alternatives if your use case doesn't demand Airflow's complexity.

Airflow vs The Competition (What Actually Matters)

Feature

Apache Airflow

Prefect

Dagster

Luigi

Setup Difficulty

Pain in the ass (database, scheduler, workers)

Actually easy (cloud or local)

Moderate (few services)

Dead simple (single process)

When It Breaks

Scheduler dies silently, debug for hours

Fails fast with clear errors

Good error messages

Breaks obviously

Learning Time

2-3 weeks to not hate it

3-5 days to be productive

1-2 weeks (if you know dbt)

2 days max

Community Help

Stack Overflow has answers

Good docs, smaller community

Modern docs, growing

Good luck

Enterprise Ready

Yes (RBAC, encryption, audit)

Getting there

Yes (data lineage is nice)

Lol no

Operational Overhead

High (need DevOps team)

Low (managed cloud options)

Medium

None

Real Use Case

100k+ DAGs at Netflix

ML pipelines that change a lot

dbt + data assets

Simple ETL that just works

Getting Airflow Running (Without Losing Your Weekend)

Airflow installation will fight you. Here's how to win without questioning your career choices.

System Requirements (The Stuff That Will Actually Break)

Airflow runs on POSIX systems (Linux, macOS). Windows works via WSL2, but you'll spend more time debugging networking issues than actually building pipelines.

Python: Supports 3.9, 3.10, 3.11, 3.12, and 3.13 (added in July 2025). Python 3.8 was dropped in Airflow 2.7.0 because it hit end-of-life. Python 3.13 is fully supported as of Airflow 3.0, but check your provider packages for compatibility if you use many third-party operators.

Memory: Start with 8GB minimum. The scheduler will eat 2-4GB by itself, and each worker needs 1-2GB depending on your DAGs. We've seen the scheduler OOM on 4GB instances when parsing 200+ DAG files.

Database: Use PostgreSQL 12+ in production. Period. MySQL 8.0 technically works but has encoding issues with certain task metadata. SQLite is fine for testing until you try to run more than one component.

MariaDB is not supported and will break in weird ways. Just don't.

Installation Methods (Pick Your Poison)

pip Installation (Works until it doesn't):

## This will conflict with your existing packages
pip install apache-airflow[postgres,celery,redis]

Pro tip: Use a virtual environment or pipx to avoid dependency hell. The [postgres,celery,redis] extras are what you actually need - don't install the base package alone.

Docker Setup: The easiest way to get Airflow running locally without dealing with dependency hell.

Airflow Docker Compose Setup

Docker Deployment (15 minutes if lucky, 2 hours if not):

curl -LfO 'https://airflow.apache.org/docs/apache-airflow/stable/docker-compose.yaml'
docker-compose up

This downloads a 2GB compose file that starts 7 containers. Port 8080 will be taken by something else on your machine guaranteed. Check the Docker setup guide when it breaks.

Kubernetes Deployment (For when you hate your weekend):
The official Helm chart is production-ready if you enjoy spending 3 days configuring YAML files. Your infra team will thank you (after they stop cursing your name).

The Setup Ritual (What Actually Breaks)

Copy these commands and pray:

## Initialize database (will fail if DB isn't running)
airflow db init

## Create admin user (will prompt for password you'll forget)
airflow users create --username admin --firstname Admin --lastname User --role Admin --email admin@example.com

## Start components
airflow webserver -p 8080 &
airflow scheduler &

Pro tip: The DAGs folder defaults to ~/airflow/dags. If you put files anywhere else, they won't show up and you'll waste an hour wondering why.

Common gotchas:

  • The scheduler needs read permissions on the DAGs folder and will fail silently if it doesn't have them
  • Task logs go to ~/airflow/logs by default - make sure there's disk space
  • Connections for databases/APIs go through the web UI under Admin > Connections

Your First DAG (That Actually Works)

Here's a DAG that won't immediately break:

from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.bash import BashOperator

## This prevents backfill hell
default_args = {
    'owner': 'your_name_here',
    'depends_on_past': False,
    'retries': 2,
    'retry_delay': timedelta(minutes=5)
}

## catchup=False prevents Airflow from running every day since start_date
with DAG(
    'hello_world_that_works',
    default_args=default_args,
    description='A simple tutorial DAG',
    schedule='@daily',
    start_date=datetime(2025, 9, 1),
    catchup=False,
    tags=['example']
) as dag:
    
    hello_task = BashOperator(
        task_id='say_hello',
        bash_command='echo "Hello World from task $(date)"',
    )

Save this as hello_world.py in your DAGs folder. It'll show up in the UI after 30-60 seconds (the scheduler parses DAG files every 30 seconds by default).

DAGs View: Your DAGs appear in the web UI with status indicators. Green = running fine, red = something's broken and you need to fix it before someone notices.

Things That Will Definitely Go Wrong

"My DAG isn't showing up": Check the scheduler logs. It's probably a Python syntax error, import error, or the file isn't in the DAGs folder.

"Tasks are stuck in queued state": The executor died. Restart everything with docker system prune -a && docker-compose up if using Docker.

"Scheduler randomly stops working": This happens around 300+ DAGs. You'll need to tune the scheduler settings or add more memory.

"Web UI shows 500 errors": Database connection is probably broken. Check your database is running and connection string is correct.

Nuclear option: Delete everything and start over: rm -rf ~/airflow && docker system prune -a

Start with the official tutorial once you get the basic setup working. It covers TaskFlow API, dependencies, and the stuff you'll actually use in production.

Even with this guide, you're going to run into weird issues. Below are the questions every Airflow developer asks - usually at 3am when something's broken and deadline pressure is mounting.

Questions People Actually Ask (At 3AM)

Q

Why did my DAG randomly stop running?

A

Usually the scheduler crashed silently. Check docker logs airflow-scheduler or the scheduler logs. Common causes: ran out of memory (scheduler OOMs around 300+ DAGs), database connection died, or someone deployed broken DAG code that crashed the parser. Nuclear option: restart everything.

Q

Why is Airflow so fucking slow?

A

Because you have 500 DAGs and one scheduler trying to parse them all every 30 seconds. Increase dag_dir_list_interval to 300 seconds, set dag_file_processor_timeout higher, or add more memory. Or just accept that Airflow isn't fast and plan accordingly.

Q

My tasks are stuck in "queued" state forever

A

Your executor is dead. If using CeleryExecutor, Redis probably crashed. If using KubernetesExecutor, check your cluster has resources. For LocalExecutor, the worker processes died. Solution: docker system prune -a && docker-compose up

Q

What executor should I actually use?

A

LocalExecutor if you're on one machine and it's not going to prod. CeleryExecutor if you want distributed workers and enjoy debugging Redis connection issues at 2am. KubernetesExecutor if you hate your infrastructure team and want them to hate you back.

Q

How do I handle secrets without hardcoding passwords in my DAGs?

A

Use Airflow Connections through the UI (Admin > Connections).

For real production setup, integrate with AWS Secrets Manager or HashiCorp Vault. Don't put passwords in DAG files

  • they're visible to everyone.
Q

Why aren't my DAGs showing up in the web UI?

A

Check the scheduler logs first. Common causes: syntax error in your Python code, file isn't in the DAGs folder (~/airflow/dags by default), or import error from missing dependencies. Error message: AIRFLOW__CORE__DAGS_FOLDER not accessible to DagFileProcessor. Fix permissions or move the file.

Q

How do I test DAGs without breaking production?

A

Use pytest for unit tests. Test structure with dag.test(). For integration tests, spin up a separate Airflow instance with test data. Don't test in production

  • that's how you take down data pipelines at 3am.
Q

Should I use dynamic DAGs?

A

Only if you enjoy pain. Dynamic DAGs work but make debugging harder and slow down the scheduler. If you have 50+ similar DAGs, consider using DAG factories but be careful not to recreate DAGs on every scheduler parse.

Q

What are the biggest Airflow anti-patterns?

A
  • Using XComs for large data (limit: 48KB in SQLite, 1MB in PostgreSQL)
  • SubDAGs (they're deprecated, use TaskGroups)
  • Heavy computation in DAG file top-level code (slows scheduler parsing)
  • Running data processing in Airflow instead of external systems
  • Not using catchup=False (unless you want to backfill everything)
Q

Can I use Airflow for real-time streaming?

A

No. Airflow is for batch workflows, not streaming. Minimum scheduling interval is 1 minute, but realistically you're looking at 5+ minute intervals in production. Use Kafka, Pulsar, or Storm for streaming, then trigger Airflow DAGs when batches are ready.

Q

How do I monitor this thing in production?

A
  • StatsD metrics to Datadog/New Relic
  • Prometheus exporter for custom dashboards
  • Email alerts on DAG failures (set up in DAG default_args)
  • External health checks on the web UI endpoint
  • Monitor scheduler lag - if it's over 30 seconds, you have problems
Q

What hardware do I actually need?

A

Start with: 2 CPU cores, 8GB RAM for the scheduler. Add workers as needed (2 CPU, 4GB each). Database needs fast storage (SSD) and decent IOPS. Scale up when the scheduler starts falling behind

  • you'll know because DAGs will be late.
Q

Airflow vs other tools - what should I choose?

A

Use Airflow if you: need complex dependencies, want Python-based workflows, need detailed monitoring, or are already in the Python ecosystem. Don't use it if you: need sub-minute scheduling, want simple cron replacement, or are primarily Java-based (use Luigi or stick with Jenkins).

Resources That Don't Suck

Related Tools & Recommendations

integration
Similar content

dbt, Snowflake, Airflow: Reliable Production Data Orchestration

How to stop burning money on failed pipelines and actually get your data stack working together

dbt (Data Build Tool)
/integration/dbt-snowflake-airflow/production-orchestration
100%
compare
Recommended

PostgreSQL vs MySQL vs MongoDB vs Cassandra - Which Database Will Ruin Your Weekend Less?

Skip the bullshit. Here's what breaks in production.

PostgreSQL
/compare/postgresql/mysql/mongodb/cassandra/comprehensive-database-comparison
43%
tool
Similar content

Pyenv Overview: Master Python Version Management & Installation

Switch between Python versions without your system exploding

Pyenv
/tool/pyenv/overview
28%
tool
Similar content

Python 3.13 Team Migration Guide: Avoid SSL Hell & CI/CD Breaks

For teams who don't want to debug SSL hell at 3am

Python 3.13
/tool/python-3.13/team-migration-strategy
27%
tool
Recommended

dbt - Actually Decent SQL Pipeline Tool

dbt compiles your SQL into maintainable data pipelines. Works great for SQL transformations, nightmare fuel when dependencies break.

dbt
/tool/dbt/overview
27%
pricing
Recommended

Databricks vs Snowflake vs BigQuery Pricing: Which Platform Will Bankrupt You Slowest

We burned through about $47k in cloud bills figuring this out so you don't have to

Databricks
/pricing/databricks-snowflake-bigquery-comparison/comprehensive-pricing-breakdown
27%
tool
Recommended

Snowflake - Cloud Data Warehouse That Doesn't Suck

Finally, a database that scales without the usual database admin bullshit

Snowflake
/tool/snowflake/overview
27%
tool
Recommended

Google Kubernetes Engine (GKE) - Google's Managed Kubernetes (That Actually Works Most of the Time)

Google runs your Kubernetes clusters so you don't wake up to etcd corruption at 3am. Costs way more than DIY but beats losing your weekend to cluster disasters.

Google Kubernetes Engine (GKE)
/tool/google-kubernetes-engine/overview
27%
review
Recommended

Kubernetes Enterprise Review - Is It Worth The Investment in 2025?

integrates with Kubernetes

Kubernetes
/review/kubernetes/enterprise-value-assessment
27%
troubleshoot
Recommended

Fix Kubernetes Pod CrashLoopBackOff - Complete Troubleshooting Guide

integrates with Kubernetes

Kubernetes
/troubleshoot/kubernetes-pod-crashloopbackoff/crashloop-diagnosis-solutions
27%
troubleshoot
Recommended

Fix Docker Daemon Connection Failures

When Docker decides to fuck you over at 2 AM

Docker Engine
/troubleshoot/docker-error-during-connect-daemon-not-running/daemon-connection-failures
27%
troubleshoot
Recommended

Docker Container Won't Start? Here's How to Actually Fix It

Real solutions for when Docker decides to ruin your day (again)

Docker
/troubleshoot/docker-container-wont-start-error/container-startup-failures
27%
troubleshoot
Recommended

Docker Permission Denied on Windows? Here's How to Fix It

Docker on Windows breaks at 3am. Every damn time.

Docker Desktop
/troubleshoot/docker-permission-denied-windows/permission-denied-fixes
27%
tool
Recommended

Google Cloud Platform - After 3 Years, I Still Don't Hate It

I've been running production workloads on GCP since 2022. Here's why I'm still here.

Google Cloud Platform
/tool/google-cloud-platform/overview
27%
tool
Similar content

pandas Performance Troubleshooting: Fix Production Issues

When your pandas code crashes production at 3AM and you need solutions that actually work

pandas
/tool/pandas/performance-troubleshooting
26%
tool
Similar content

pyenv-virtualenv: Stop Python Environment Hell - Overview & Guide

Discover pyenv-virtualenv to manage Python environments effortlessly. Prevent project breaks, solve local vs. production issues, and streamline your Python deve

pyenv-virtualenv
/tool/pyenv-virtualenv/overview
25%
pricing
Recommended

Database Hosting Costs: PostgreSQL vs MySQL vs MongoDB

integrates with PostgreSQL

PostgreSQL
/pricing/postgresql-mysql-mongodb-database-hosting-costs/hosting-cost-breakdown
24%
compare
Recommended

PostgreSQL vs MySQL vs MariaDB vs SQLite vs CockroachDB - Pick the Database That Won't Ruin Your Life

integrates with postgresql

postgresql
/compare/postgresql-mysql-mariadb-sqlite-cockroachdb/database-decision-guide
24%
integration
Recommended

Fix Your Slow-Ass Laravel + MySQL Setup

Stop letting database performance kill your Laravel app - here's how to actually fix it

MySQL
/integration/mysql-laravel/overview
24%
alternatives
Recommended

MySQL Alternatives That Don't Suck - A Migration Reality Check

Oracle's 2025 Licensing Squeeze and MySQL's Scaling Walls Are Forcing Your Hand

MySQL
/alternatives/mysql/migration-focused-alternatives
24%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization