Apache Airflow: Two Years of Production Hell

The Scheduler That Ruined My Weekends

Apache Airflow DAG Overview Dashboard

Here's what nobody tells you about Airflow's scheduler in production: the scheduler is a piece of shit that will break in creative ways you never imagined.

My First Production Disaster: The DAG File Parsing Nightmare

We hit our first wall around 180 DAGs. Not 300 like the documentation suggests - 180. The scheduler would choke, CPU would spike to 100%, and tasks would just sit there doing nothing. The web UI would show everything as "queued" while our ETL jobs missed their SLA windows.

The error? Nothing. No fucking error. Just this in the logs:

INFO - Loaded 180 DAGs
INFO - Loaded 180 DAGs  
INFO - Loaded 180 DAGs

Over and over, every 30 seconds. The scheduler was parsing every single DAG file like a brain-damaged robot, taking longer each time until it got stuck in an infinite parsing loop.

The fix that actually worked: Cranked `dag_dir_list_interval` up to 300 seconds and threw 8GB of RAM at the scheduler. Did it solve the problem? Kind of. Did it feel like applying duct tape to a structural problem? Absolutely. The official troubleshooting guide mentions this exact issue but buries it in paragraph 47.

Silent Failures Are Airflow's Specialty

You know what's fun? When your scheduler process shows as "running" in htop but hasn't actually scheduled anything in 3 hours. This happened to us on a Tuesday morning when our overnight ETL jobs just... didn't run.

The actual error message (buried in logs you have to dig for):

sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) connection to server on socket \"/var/run/postgresql/.s.PGSQL.5432\" failed: No such file or directory

Translation: PostgreSQL connection died, but Airflow didn't bother to retry or alert anyone. It just kept pretending everything was fine while our data pipelines turned into expensive paperweights. The connection pooling documentation exists but good luck finding the actual settings that prevent this.

The Database Becomes Your Enemy

Here's something I learned the hard way: Airflow treats your metadata database like a punching bag. Every task state change hammers the DB with writes. We started with MySQL because that's what the tutorial used. Big mistake.

At around 500 concurrent tasks, MySQL would shit itself. Connection pool exhaustion, deadlocks, and my personal favorite error:

ERROR - Task unable to sync to database: too many connections

Switching to PostgreSQL helped, but even Postgres starts crying when you hit 1000+ concurrent executions. And good luck figuring out the optimal connection pool settings - the documentation is about as helpful as a chocolate teapot. This GitHub issue has 200+ comments of people trying to figure out the same damn thing.

Memory Leaks: The Weekly Restart Ritual

Every Friday at 2 PM, we restart all Airflow services. Why? Because the scheduler has a memory leak that would make Internet Explorer 6 proud.

It starts at 2GB RAM usage. By day 5, it's consuming 12GB and the server is swapping like crazy. The OOM killer eventually puts it out of its misery, but by then your pipelines have been dead for hours.

The "solution": Cron job to restart the scheduler every 3 days. Professional? No. Does it work? Unfortunately, yes. Half the internet does this exact same hack.

Airflow 3.0 Won't Save You

Everyone's excited about Airflow 3.0 because it promises to fix the scheduler. I've tested it. Here's the truth: it's faster at being broken.

The new Task SDK is nice in theory, but now you have MORE components to babysit. The web UI is prettier, but still times out when you have actual data volumes. And the breaking changes mean you get to rewrite half your DAGs for the privilege of slightly better performance.

Bottom line: If you're hoping 3.0 will magically solve your production nightmares, prepare for disappointment.

The Real Cost: Your Sanity

Want to know the true cost of running Airflow? It's not the server resources or the database licenses. It's the 2 AM Slack messages, the weekend debugging sessions, and the constant fear that your data platform is held together with prayer and automated restarts.

You need someone who understands both Airflow's quirks AND database tuning AND Kubernetes AND monitoring. Good luck finding that unicorn. Most teams end up with someone (usually me) who becomes the designated Airflow whisperer through sheer necessity.

I've seen grown engineers quit over Airflow. I've seen teams scrap months of work to switch to something simpler. And I've personally lost more sleep to Airflow scheduler crashes than I care to admit. Check out this Reddit thread if you want to see 300+ comments of engineers sharing similar pain.

Look, maybe you think I'm being dramatic. "It can't be that bad," you're probably thinking. Let me share the harsh reality of what actually happens when you try to scale this thing.

What I Wish Someone Had Told Me Before We Started

Tool	Description
Prefect	Like Airflow but designed by people who actually run data pipelines. The cloud version just works. The open source version doesn't make you want to set yourself on fire.
AWS Step Functions	Limited but bulletproof. If you're already in AWS and your workflows aren't too complex, this will save your weekends.
dbt + cron	For 80% of data teams, this is enough. Seriously. Before you dive into Airflow's complexity circus, ask yourself if you really need orchestration or just data transformations on a schedule.
Dagster	The new kid on the block. Better development experience than Airflow, but still young enough to have sharp edges.

Three Simple Questions to Save Your Career

Airflow DAGs Overview

Question 1: Do You Have a Full-Time DevOps Person?

If the answer is no, stop reading. Don't use Airflow. I'm serious.

I've watched three different startups try to run Airflow with "the data engineer who knows Docker." It always ends the same way: that person becomes the designated Airflow firefighter, working weekends, getting burned out, and eventually quitting.

Real talk: Airflow isn't something you set up and forget. It's a needy, high-maintenance system that requires constant attention. You need someone who understands Kubernetes, database performance tuning, monitoring, and can debug Python memory leaks at 3 AM.

If you don't have that person, use Prefect Cloud or just stick with dbt + GitHub Actions.

Question 2: Do You Actually Need Complex Orchestration?

Most teams think they need Airflow when they actually just need scheduled data transformations. Before you dive into orchestration hell, ask yourself:

Are your workflows just "run SQL transformations in order"? → Use dbt
Do you need simple retries and notifications? → GitHub Actions or Prefect
Are you processing one dataset and outputting another? → You don't need orchestration

The complexity trap: I've seen teams spend 6 months setting up Airflow to run what was essentially a cron job with better error handling. That's like buying a semi-truck to deliver pizza.

The only time you actually need Airflow is when you have genuinely complex workflows: conditional branching, dynamic task generation, complex retry logic, or interdependent pipelines where failure in one affects dozens of others. Read this Uber engineering post if you want to see what "actually needing Airflow" looks like.

Question 3: Can You Afford to Be Down for a Day?

Here's what nobody tells you: Airflow will go down. Not "might go down" - will go down. The scheduler will crash, the database will get corrupted, or some obscure Python package will break everything.

When that happens, can your business survive 8-24 hours of downtime while you figure out what the fuck went wrong?

If the answer is no, you need either:

A managed service (Astronomer, Google Composer) with proper SLAs
A simpler alternative that doesn't randomly explode
A backup plan that doesn't involve Airflow

The Airflow 3.0 Trap

Everyone's excited about Airflow 3.0 because the marketing promises to fix everything. I've tested it. Here's the truth: it fixes some things and breaks others.

Don't migrate to 3.0 if:

Your current setup is working (barely counts, but if it's stable, don't touch it)
You don't have 2-3 months for migration hell
Your team is already at capacity with other projects

Do migrate if:

You're already drowning in scheduler performance issues
You have the bandwidth for a major migration project
You're starting fresh anyway

Migration isn't just upgrading - you'll rewrite DAGs, update dependencies, retrain your team, and probably discover new and exciting ways for things to break.

What I'd Use Instead

After two years of Airflow pain, here's what I actually recommend:

For most teams: Prefect. The cloud version just works, the open-source version is Airflow without the operational nightmares. If you must self-host something, this is your best bet.

For simple workflows: dbt + GitHub Actions. Seriously. Transform your data, run tests, deploy to production. Most "orchestration" problems are actually just transformation scheduling problems.

For cloud-native teams: AWS Step Functions or Azure Logic Apps. Limited but bulletproof. If your workflows fit their constraints, you'll never get woken up at 2 AM.

For enterprise budgets: Google Cloud Composer or Astronomer managed. Let someone else deal with the operational complexity. It's expensive but cheaper than hiring a full-time Airflow babysitter.

The Brutal Truth

If you have to ask whether you need Airflow, you probably don't.

Airflow is like Kubernetes - it's an incredibly powerful tool that solves complex problems by creating different complex problems. Uber and large tech companies use it because they have teams of specialists to manage the complexity. Your 5-person startup probably doesn't.

Save your weekends. Use something simpler until complexity forces your hand. And when that day comes, hire someone who's already been through Airflow hell so you don't have to learn the hard way.

And once you've ignored all this advice and deployed Airflow anyway, you'll start asking the same questions everyone else asks when reality hits.

Questions I Get Asked (Usually at 3 AM via Slack)

This piece of shit worked fine on my laptop, why is it broken in prod?

Because your laptop has 5 DAGs and production has 150. Also, you're not hammering the web UI with 20 people checking "why isn't my job running?" every 5 minutes.

Here's what actually breaks:

The scheduler chokes parsing DAG files when you have real volume
Database connections get exhausted (your laptop uses SQLite, prod uses Postgres with connection limits)
Memory leaks that don't matter for 30-minute dev sessions become critical after 3 days uptime

Quick fix: Double your memory allocation and restart the scheduler. Long-term fix: Accept that dev and prod are completely different beasts.

The scheduler shows as "running" but nothing's happening. WTF?

Oh, this old chestnut. The scheduler is probably stuck in a parsing loop or has lost database connectivity but hasn't bothered to crash properly.

Check the logs for this delightful error:

INFO - Loaded 180 DAGs
INFO - Loaded 180 DAGs (repeating forever)

Or my personal favorite:

sqlalchemy.exc.DisconnectionError: Connection invalidated by a database disconnect, but it's still "running"

Nuclear option: systemctl restart airflow-scheduler and pray. Works 90% of the time.

Can't we just pay someone else to deal with this shit? (AWS MWAA, etc.)

Sure, if you have $5k/month burning a hole in your pocket. AWS MWAA will take your money and give you Airflow that's 6 months behind the latest version.

Google Composer is faster to upgrade but costs even more. Astronomer is probably your best bet if you have enterprise budget - at least they understand Airflow's quirks.

Reality check: Managed services solve infrastructure headaches but cost 3-5x more than self-hosting. Do the math based on your team's time vs. money situation.

How many DAGs can this thing handle before it dies?

Depends on how much pain you can tolerate. We hit problems at 180 DAGs. Some teams push 500+ with enough hardware and tuning.

Rule of thumb: When the scheduler starts consuming more RAM than your database, you're in trouble. When DAG parsing takes longer than your shortest task interval, you're fucked.

Netflix runs thousands of DAGs, but they also have a team of 20+ engineers whose job is keeping Airflow alive. You probably don't.

Should I upgrade to Airflow 3.0?

Only if you enjoy pain and have 2+ months to burn on migration hell.

3.0 fixes some performance issues but breaks a bunch of other shit. The new CLI is nicer, but you'll need to retrain your team and rewrite half your DAGs.

If your current setup is working, leave it alone. "If it ain't broke, don't fix it" applies double to Airflow.

How small a team can run this thing?

Minimum viable team: 3 engineers with one person who doesn't mind getting woken up at 2 AM when the scheduler crashes.

Solo dev warning: Don't try this alone unless you enjoy being the single point of failure for your entire data platform. What happens when you're on vacation and Airflow decides to shit itself?

What's this actually going to cost us?

More than you think. Budget for:

Infrastructure: $3-8k/month (servers, database, monitoring)
Your sanity: $150k/year for someone to babysit it full-time
Hidden costs: Consultant fees when you realize you're in over your head ($10k)

Total reality: $200k+ annually once you factor in the human cost.

Can we migrate from [other tool] to Airflow?

From cron? Easy. From Jenkins? Doable. From Prefect or another actual orchestration tool? Prepare for 3-6 months of migration hell.

Pro tip: If your current tool is working, seriously consider why you want to migrate. "Everyone else uses Airflow" is not a good reason to blow up your working data platform.

What happens when Airflow is down for a day?

Your data pipelines stop. All of them. Hope you have manual backup procedures documented somewhere.

High availability requires database replication, multiple schedulers, shared storage, and a bunch of other complexity that defeats the purpose of "simple orchestration."

Most teams just accept that Airflow outages = data platform outages and have runbooks for manual recovery.

How do we keep secrets from showing up in logs?

Don't put secrets in DAG files, you absolute muppet. Use Airflow Connections or integrate with AWS Secrets Manager.

But seriously, if you hardcode database passwords in Python files, you deserve what happens to you.

Is Airflow overkill for what we're doing?

If you have to ask, yes. Airflow is for complex workflows with interdependencies, conditional logic, and retry requirements.

If you just need to run SQL transformations on a schedule, use dbt. If you need basic orchestration, try Prefect. Save yourself the operational nightmare until you absolutely need Airflow's complexity.

How do we monitor this clusterfuck?

Monitor everything because Airflow will find new and creative ways to break:

Scheduler heartbeat (when it dies silently)
Database connection pool usage (when it gets exhausted)
Memory consumption (scheduler leaks like a sieve)
Disk space (logs will fill your drives)

Set up external health checks because Airflow's internal monitoring lies. When the scheduler is dead, the web UI will still show green.

Essential alert: Scheduler heartbeat > 60 seconds = wake someone up.

What I'd Actually Use Today (September 2025)

My Final Advice After Two Years of Hell

Airflow Graph View

Don't Use Airflow Unless You Absolutely Have To

There, I said it. After debugging Airflow disasters at 3 AM more times than I can count, spending two years fighting with schedulers and memory leaks, and watching good engineers burn out over this system, my recommendation is simple: avoid Airflow unless you have no other choice.

When You Actually Need Airflow

The only times I'd recommend Airflow:

You're Uber or a similar massive tech company - You have 50+ engineers dedicated to data platform work, unlimited budget, and complex workflows that genuinely require Airflow's capabilities.

You have genuinely complex orchestration needs - Not "I need to run SQL on a schedule" complex, but "I need conditional branching based on external APIs, dynamic task generation, and complex retry logic across 500+ interdependent workflows" complex.

You already have it and it's working - If your current Airflow setup is stable and your team knows how to manage it, don't fix what isn't broken. Migration pain isn't worth it.

You have enterprise budget for managed services - If you can afford $100k+/year for Astronomer or Cloud Composer, and you actually need Airflow's complexity, then fine. Let someone else deal with the operational nightmare.

What Actually Happened to Our Team

We ditched Airflow in March 2025 and moved to Prefect Cloud. Here's what changed:

Before (Airflow):

2-3 weekend debugging sessions per month
One person (me) became the designated firefighter
Scheduler restarts every 3 days via cron job
Team afraid to add new workflows because "what if it breaks the scheduler?"
8GB+ RAM just for the scheduler process

After (Prefect Cloud):

Zero weekend incidents so far
Everyone on the team can add workflows without fear
No scheduler to babysit
$50/month vs. $5k/month infrastructure costs
Actually get useful error messages when things fail

The Truth About Airflow 3.0

Everyone's excited about 3.0, but I've tested it extensively. It's better, but it's still Airflow. The scheduler is more efficient, but it's still a complex distributed system that will find new ways to break. The new Task SDK is nice, but now you have more components to manage.

If you're on Airflow 2.x and it's working: Don't migrate unless you're already in pain. The improvements aren't worth 3 months of migration hell.

If you're starting fresh: Don't start with Airflow 3.0. Use something simpler and migrate to Airflow later if you actually need its complexity.

What I'd Do Instead

For 80% of teams: Start with dbt + GitHub Actions. Seriously. Most "orchestration" problems are actually transformation scheduling problems.

For Python teams: Prefect. The cloud version costs $50/month and just works. The open-source version gives you most of Airflow's power without the operational complexity.

For AWS shops: Step Functions. Limited but bulletproof. If your workflows fit the constraints, you'll never get woken up at 3 AM.

For enterprise teams with budget: Pay for managed Airflow (Astronomer, Cloud Composer) and let experts deal with the complexity. It's expensive but cheaper than hiring a dedicated Airflow babysitter.

The Skills You'll Need

If you ignore my advice and use Airflow anyway, make sure someone on your team has:

Deep Python knowledge (not just "can write scripts")
Database performance tuning (PostgreSQL optimization, connection pooling)
Kubernetes expertise (if you're running on k8s)
Monitoring and alerting (setting up external health checks)
The patience of a saint and acceptance that weekends will be interrupted

Most importantly, that person needs to be able to debug complex distributed systems failures at 2 AM while slightly drunk on the company Christmas party. Ask me how I know.

My Real Recommendation

Save yourself the pain. Use Prefect, dbt, or even just cron jobs until complexity forces your hand. When that day comes - and it might not - hire someone who's already been through Airflow hell rather than learning the hard way.

When the "simple" installer turns your weekend into a debugging nightmare

Docker Desktop

/troubleshoot/docker-cve-2025-9074/installation-startup-failures

23%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization

Quick Navigation

My First Production Disaster: The DAG File Parsing Nightmare

Silent Failures Are Airflow's Specialty

The Database Becomes Your Enemy

Memory Leaks: The Weekly Restart Ritual

Airflow 3.0 Won't Save You

The Real Cost: Your Sanity

Question 1: Do You Have a Full-Time DevOps Person?

Question 2: Do You Actually Need Complex Orchestration?

Question 3: Can You Afford to Be Down for a Day?

The Airflow 3.0 Trap

What I'd Use Instead

The Brutal Truth

This piece of shit worked fine on my laptop, why is it broken in prod?

The scheduler shows as "running" but nothing's happening. WTF?

Can't we just pay someone else to deal with this shit? (AWS MWAA, etc.)

How many DAGs can this thing handle before it dies?

Should I upgrade to Airflow 3.0?

How small a team can run this thing?

What's this actually going to cost us?

Can we migrate from [other tool] to Airflow?

What happens when Airflow is down for a day?

How do we keep secrets from showing up in logs?

Is Airflow overkill for what we're doing?

How do we monitor this clusterfuck?

Don't Use Airflow Unless You Absolutely Have To

When You Actually Need Airflow

What Actually Happened to Our Team

The Truth About Airflow 3.0

What I'd Do Instead

The Skills You'll Need

My Real Recommendation

Related Tools & Recommendations

dbt, Snowflake, Airflow: Reliable Production Data Orchestration

Apache Airflow: Python Workflow Orchestrator & Data Pipelines

Apache Spark Troubleshooting - Debug Production Failures Fast

Change Data Capture (CDC) Troubleshooting Guide: Fix Common Issues

AWS CDK Production Horror Stories: CloudFormation Deployment Nightmares

FastAPI + SQLAlchemy + Alembic + PostgreSQL: The Real Integration Guide

Debug Kubernetes Issues: The 3AM Production Survival Guide

Vector Databases 2025: The Reality Check You Need

Database Migration Tool Pricing: Real Costs & TCO Analysis

CDC Enterprise Implementation Guide: Real-World Challenges & Solutions

Kafka Spark Elasticsearch: Build & Optimize Real-time Pipelines

Databricks Overview: Multi-Cloud Analytics, Setup & Cost Reality

CDC Tool Selection Guide: Pick the Right Change Data Capture

dbt - Actually Decent SQL Pipeline Tool

Your Snowflake Bill is Out of Control - Here's Why

Snowflake - Cloud Data Warehouse That Doesn't Suck

OpenTelemetry + Jaeger + Grafana on Kubernetes - The Stack That Actually Works

Fix Kubernetes ImagePullBackOff Error - The Complete Battle-Tested Guide

Lock Down Your K8s Cluster Before It Costs You $50k

Docker Desktop Won't Install? Welcome to Hell