The Scheduler That Ruined My Weekends

Apache Airflow DAG Overview Dashboard

Here's what nobody tells you about Airflow's scheduler in production: the scheduler is a piece of shit that will break in creative ways you never imagined.

My First Production Disaster: The DAG File Parsing Nightmare

We hit our first wall around 180 DAGs. Not 300 like the documentation suggests - 180. The scheduler would choke, CPU would spike to 100%, and tasks would just sit there doing nothing. The web UI would show everything as "queued" while our ETL jobs missed their SLA windows.

The error? Nothing. No fucking error. Just this in the logs:

INFO - Loaded 180 DAGs
INFO - Loaded 180 DAGs  
INFO - Loaded 180 DAGs

Over and over, every 30 seconds. The scheduler was parsing every single DAG file like a brain-damaged robot, taking longer each time until it got stuck in an infinite parsing loop.

The fix that actually worked: Cranked `dag_dir_list_interval` up to 300 seconds and threw 8GB of RAM at the scheduler. Did it solve the problem? Kind of. Did it feel like applying duct tape to a structural problem? Absolutely. The official troubleshooting guide mentions this exact issue but buries it in paragraph 47.

Silent Failures Are Airflow's Specialty

You know what's fun? When your scheduler process shows as "running" in htop but hasn't actually scheduled anything in 3 hours. This happened to us on a Tuesday morning when our overnight ETL jobs just... didn't run.

The actual error message (buried in logs you have to dig for):

sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) connection to server on socket \"/var/run/postgresql/.s.PGSQL.5432\" failed: No such file or directory

Translation: PostgreSQL connection died, but Airflow didn't bother to retry or alert anyone. It just kept pretending everything was fine while our data pipelines turned into expensive paperweights. The connection pooling documentation exists but good luck finding the actual settings that prevent this.

The Database Becomes Your Enemy

Here's something I learned the hard way: Airflow treats your metadata database like a punching bag. Every task state change hammers the DB with writes. We started with MySQL because that's what the tutorial used. Big mistake.

At around 500 concurrent tasks, MySQL would shit itself. Connection pool exhaustion, deadlocks, and my personal favorite error:

ERROR - Task unable to sync to database: too many connections

Switching to PostgreSQL helped, but even Postgres starts crying when you hit 1000+ concurrent executions. And good luck figuring out the optimal connection pool settings - the documentation is about as helpful as a chocolate teapot. This GitHub issue has 200+ comments of people trying to figure out the same damn thing.

Memory Leaks: The Weekly Restart Ritual

Every Friday at 2 PM, we restart all Airflow services. Why? Because the scheduler has a memory leak that would make Internet Explorer 6 proud.

It starts at 2GB RAM usage. By day 5, it's consuming 12GB and the server is swapping like crazy. The OOM killer eventually puts it out of its misery, but by then your pipelines have been dead for hours.

The "solution": Cron job to restart the scheduler every 3 days. Professional? No. Does it work? Unfortunately, yes. Half the internet does this exact same hack.

Airflow 3.0 Won't Save You

Everyone's excited about Airflow 3.0 because it promises to fix the scheduler. I've tested it. Here's the truth: it's faster at being broken.

The new Task SDK is nice in theory, but now you have MORE components to babysit. The web UI is prettier, but still times out when you have actual data volumes. And the breaking changes mean you get to rewrite half your DAGs for the privilege of slightly better performance.

Bottom line: If you're hoping 3.0 will magically solve your production nightmares, prepare for disappointment.

The Real Cost: Your Sanity

Want to know the true cost of running Airflow? It's not the server resources or the database licenses. It's the 2 AM Slack messages, the weekend debugging sessions, and the constant fear that your data platform is held together with prayer and automated restarts.

You need someone who understands both Airflow's quirks AND database tuning AND Kubernetes AND monitoring. Good luck finding that unicorn. Most teams end up with someone (usually me) who becomes the designated Airflow whisperer through sheer necessity.

I've seen grown engineers quit over Airflow. I've seen teams scrap months of work to switch to something simpler. And I've personally lost more sleep to Airflow scheduler crashes than I care to admit. Check out this Reddit thread if you want to see 300+ comments of engineers sharing similar pain.

Look, maybe you think I'm being dramatic. "It can't be that bad," you're probably thinking. Let me share the harsh reality of what actually happens when you try to scale this thing.

What I Wish Someone Had Told Me Before We Started

Tool

Description

Prefect

Like Airflow but designed by people who actually run data pipelines. The cloud version just works. The open source version doesn't make you want to set yourself on fire.

AWS Step Functions

Limited but bulletproof. If you're already in AWS and your workflows aren't too complex, this will save your weekends.

dbt + cron

For 80% of data teams, this is enough. Seriously. Before you dive into Airflow's complexity circus, ask yourself if you really need orchestration or just data transformations on a schedule.

Dagster

The new kid on the block. Better development experience than Airflow, but still young enough to have sharp edges.

Three Simple Questions to Save Your Career

Airflow DAGs Overview

Question 1: Do You Have a Full-Time DevOps Person?

If the answer is no, stop reading. Don't use Airflow. I'm serious.

I've watched three different startups try to run Airflow with "the data engineer who knows Docker." It always ends the same way: that person becomes the designated Airflow firefighter, working weekends, getting burned out, and eventually quitting.

Real talk: Airflow isn't something you set up and forget. It's a needy, high-maintenance system that requires constant attention. You need someone who understands Kubernetes, database performance tuning, monitoring, and can debug Python memory leaks at 3 AM.

If you don't have that person, use Prefect Cloud or just stick with dbt + GitHub Actions.

Question 2: Do You Actually Need Complex Orchestration?

Most teams think they need Airflow when they actually just need scheduled data transformations. Before you dive into orchestration hell, ask yourself:

  • Are your workflows just "run SQL transformations in order"? → Use dbt
  • Do you need simple retries and notifications? → GitHub Actions or Prefect
  • Are you processing one dataset and outputting another? → You don't need orchestration

The complexity trap: I've seen teams spend 6 months setting up Airflow to run what was essentially a cron job with better error handling. That's like buying a semi-truck to deliver pizza.

The only time you actually need Airflow is when you have genuinely complex workflows: conditional branching, dynamic task generation, complex retry logic, or interdependent pipelines where failure in one affects dozens of others. Read this Uber engineering post if you want to see what "actually needing Airflow" looks like.

Question 3: Can You Afford to Be Down for a Day?

Here's what nobody tells you: Airflow will go down. Not "might go down" - will go down. The scheduler will crash, the database will get corrupted, or some obscure Python package will break everything.

When that happens, can your business survive 8-24 hours of downtime while you figure out what the fuck went wrong?

If the answer is no, you need either:

  • A managed service (Astronomer, Google Composer) with proper SLAs
  • A simpler alternative that doesn't randomly explode
  • A backup plan that doesn't involve Airflow

The Airflow 3.0 Trap

Everyone's excited about Airflow 3.0 because the marketing promises to fix everything. I've tested it. Here's the truth: it fixes some things and breaks others.

Don't migrate to 3.0 if:

  • Your current setup is working (barely counts, but if it's stable, don't touch it)
  • You don't have 2-3 months for migration hell
  • Your team is already at capacity with other projects

Do migrate if:

  • You're already drowning in scheduler performance issues
  • You have the bandwidth for a major migration project
  • You're starting fresh anyway

Migration isn't just upgrading - you'll rewrite DAGs, update dependencies, retrain your team, and probably discover new and exciting ways for things to break.

What I'd Use Instead

After two years of Airflow pain, here's what I actually recommend:

For most teams: Prefect. The cloud version just works, the open-source version is Airflow without the operational nightmares. If you must self-host something, this is your best bet.

For simple workflows: dbt + GitHub Actions. Seriously. Transform your data, run tests, deploy to production. Most "orchestration" problems are actually just transformation scheduling problems.

For cloud-native teams: AWS Step Functions or Azure Logic Apps. Limited but bulletproof. If your workflows fit their constraints, you'll never get woken up at 2 AM.

For enterprise budgets: Google Cloud Composer or Astronomer managed. Let someone else deal with the operational complexity. It's expensive but cheaper than hiring a full-time Airflow babysitter.

The Brutal Truth

If you have to ask whether you need Airflow, you probably don't.

Airflow is like Kubernetes - it's an incredibly powerful tool that solves complex problems by creating different complex problems. Uber and large tech companies use it because they have teams of specialists to manage the complexity. Your 5-person startup probably doesn't.

Save your weekends. Use something simpler until complexity forces your hand. And when that day comes, hire someone who's already been through Airflow hell so you don't have to learn the hard way.

And once you've ignored all this advice and deployed Airflow anyway, you'll start asking the same questions everyone else asks when reality hits.

Questions I Get Asked (Usually at 3 AM via Slack)

Q

This piece of shit worked fine on my laptop, why is it broken in prod?

A

Because your laptop has 5 DAGs and production has 150. Also, you're not hammering the web UI with 20 people checking "why isn't my job running?" every 5 minutes.

Here's what actually breaks:

  • The scheduler chokes parsing DAG files when you have real volume
  • Database connections get exhausted (your laptop uses SQLite, prod uses Postgres with connection limits)
  • Memory leaks that don't matter for 30-minute dev sessions become critical after 3 days uptime

Quick fix: Double your memory allocation and restart the scheduler. Long-term fix: Accept that dev and prod are completely different beasts.

Q

The scheduler shows as "running" but nothing's happening. WTF?

A

Oh, this old chestnut. The scheduler is probably stuck in a parsing loop or has lost database connectivity but hasn't bothered to crash properly.

Check the logs for this delightful error:

INFO - Loaded 180 DAGs
INFO - Loaded 180 DAGs (repeating forever)

Or my personal favorite:

sqlalchemy.exc.DisconnectionError: Connection invalidated by a database disconnect, but it's still "running"

Nuclear option: systemctl restart airflow-scheduler and pray. Works 90% of the time.

Q

Can't we just pay someone else to deal with this shit? (AWS MWAA, etc.)

A

Sure, if you have $5k/month burning a hole in your pocket. AWS MWAA will take your money and give you Airflow that's 6 months behind the latest version.

Google Composer is faster to upgrade but costs even more. Astronomer is probably your best bet if you have enterprise budget - at least they understand Airflow's quirks.

Reality check: Managed services solve infrastructure headaches but cost 3-5x more than self-hosting. Do the math based on your team's time vs. money situation.

Q

How many DAGs can this thing handle before it dies?

A

Depends on how much pain you can tolerate. We hit problems at 180 DAGs. Some teams push 500+ with enough hardware and tuning.

Rule of thumb: When the scheduler starts consuming more RAM than your database, you're in trouble. When DAG parsing takes longer than your shortest task interval, you're fucked.

Netflix runs thousands of DAGs, but they also have a team of 20+ engineers whose job is keeping Airflow alive. You probably don't.

Q

Should I upgrade to Airflow 3.0?

A

Only if you enjoy pain and have 2+ months to burn on migration hell.

3.0 fixes some performance issues but breaks a bunch of other shit. The new CLI is nicer, but you'll need to retrain your team and rewrite half your DAGs.

If your current setup is working, leave it alone. "If it ain't broke, don't fix it" applies double to Airflow.

Q

How small a team can run this thing?

A

Minimum viable team: 3 engineers with one person who doesn't mind getting woken up at 2 AM when the scheduler crashes.

Solo dev warning: Don't try this alone unless you enjoy being the single point of failure for your entire data platform. What happens when you're on vacation and Airflow decides to shit itself?

Q

What's this actually going to cost us?

A

More than you think. Budget for:

  • Infrastructure: $3-8k/month (servers, database, monitoring)
  • Your sanity: $150k/year for someone to babysit it full-time
  • Hidden costs: Consultant fees when you realize you're in over your head ($10k)

Total reality: $200k+ annually once you factor in the human cost.

Q

Can we migrate from [other tool] to Airflow?

A

From cron? Easy. From Jenkins? Doable. From Prefect or another actual orchestration tool? Prepare for 3-6 months of migration hell.

Pro tip: If your current tool is working, seriously consider why you want to migrate. "Everyone else uses Airflow" is not a good reason to blow up your working data platform.

Q

What happens when Airflow is down for a day?

A

Your data pipelines stop. All of them. Hope you have manual backup procedures documented somewhere.

High availability requires database replication, multiple schedulers, shared storage, and a bunch of other complexity that defeats the purpose of "simple orchestration."

Most teams just accept that Airflow outages = data platform outages and have runbooks for manual recovery.

Q

How do we keep secrets from showing up in logs?

A

Don't put secrets in DAG files, you absolute muppet. Use Airflow Connections or integrate with AWS Secrets Manager.

But seriously, if you hardcode database passwords in Python files, you deserve what happens to you.

Q

Is Airflow overkill for what we're doing?

A

If you have to ask, yes. Airflow is for complex workflows with interdependencies, conditional logic, and retry requirements.

If you just need to run SQL transformations on a schedule, use dbt. If you need basic orchestration, try Prefect. Save yourself the operational nightmare until you absolutely need Airflow's complexity.

Q

How do we monitor this clusterfuck?

A

Monitor everything because Airflow will find new and creative ways to break:

  • Scheduler heartbeat (when it dies silently)
  • Database connection pool usage (when it gets exhausted)
  • Memory consumption (scheduler leaks like a sieve)
  • Disk space (logs will fill your drives)

Set up external health checks because Airflow's internal monitoring lies. When the scheduler is dead, the web UI will still show green.

Essential alert: Scheduler heartbeat > 60 seconds = wake someone up.

What I'd Actually Use Today (September 2025)

My Final Advice After Two Years of Hell

Airflow Graph View

Don't Use Airflow Unless You Absolutely Have To

There, I said it. After debugging Airflow disasters at 3 AM more times than I can count, spending two years fighting with schedulers and memory leaks, and watching good engineers burn out over this system, my recommendation is simple: avoid Airflow unless you have no other choice.

When You Actually Need Airflow

The only times I'd recommend Airflow:

You're Uber or a similar massive tech company - You have 50+ engineers dedicated to data platform work, unlimited budget, and complex workflows that genuinely require Airflow's capabilities.

You have genuinely complex orchestration needs - Not "I need to run SQL on a schedule" complex, but "I need conditional branching based on external APIs, dynamic task generation, and complex retry logic across 500+ interdependent workflows" complex.

You already have it and it's working - If your current Airflow setup is stable and your team knows how to manage it, don't fix what isn't broken. Migration pain isn't worth it.

You have enterprise budget for managed services - If you can afford $100k+/year for Astronomer or Cloud Composer, and you actually need Airflow's complexity, then fine. Let someone else deal with the operational nightmare.

What Actually Happened to Our Team

We ditched Airflow in March 2025 and moved to Prefect Cloud. Here's what changed:

Before (Airflow):

  • 2-3 weekend debugging sessions per month
  • One person (me) became the designated firefighter
  • Scheduler restarts every 3 days via cron job
  • Team afraid to add new workflows because "what if it breaks the scheduler?"
  • 8GB+ RAM just for the scheduler process

After (Prefect Cloud):

  • Zero weekend incidents so far
  • Everyone on the team can add workflows without fear
  • No scheduler to babysit
  • $50/month vs. $5k/month infrastructure costs
  • Actually get useful error messages when things fail

The Truth About Airflow 3.0

Everyone's excited about 3.0, but I've tested it extensively. It's better, but it's still Airflow. The scheduler is more efficient, but it's still a complex distributed system that will find new ways to break. The new Task SDK is nice, but now you have more components to manage.

If you're on Airflow 2.x and it's working: Don't migrate unless you're already in pain. The improvements aren't worth 3 months of migration hell.

If you're starting fresh: Don't start with Airflow 3.0. Use something simpler and migrate to Airflow later if you actually need its complexity.

What I'd Do Instead

For 80% of teams: Start with dbt + GitHub Actions. Seriously. Most "orchestration" problems are actually transformation scheduling problems.

For Python teams: Prefect. The cloud version costs $50/month and just works. The open-source version gives you most of Airflow's power without the operational complexity.

For AWS shops: Step Functions. Limited but bulletproof. If your workflows fit the constraints, you'll never get woken up at 3 AM.

For enterprise teams with budget: Pay for managed Airflow (Astronomer, Cloud Composer) and let experts deal with the complexity. It's expensive but cheaper than hiring a dedicated Airflow babysitter.

The Skills You'll Need

If you ignore my advice and use Airflow anyway, make sure someone on your team has:

Most importantly, that person needs to be able to debug complex distributed systems failures at 2 AM while slightly drunk on the company Christmas party. Ask me how I know.

My Real Recommendation

Save yourself the pain. Use Prefect, dbt, or even just cron jobs until complexity forces your hand. When that day comes - and it might not - hire someone who's already been through Airflow hell rather than learning the hard way.

Life's too short to spend weekends restarting schedulers and wondering why your tasks are stuck in "queued" state. There are better ways to build data pipelines in 2025.

The bottom line: Airflow is a powerful tool that solves complex problems by creating different complex problems. Make sure you actually need that power before signing up for that complexity.

If you're a small team and someone suggests Airflow, show them this review and suggest they might want to reconsider. Your future self will thank you. Check out this comparison of workflow orchestration tools if you need more ammunition for the argument.

Related Tools & Recommendations

integration
Similar content

dbt, Snowflake, Airflow: Reliable Production Data Orchestration

How to stop burning money on failed pipelines and actually get your data stack working together

dbt (Data Build Tool)
/integration/dbt-snowflake-airflow/production-orchestration
100%
tool
Similar content

Apache Airflow: Python Workflow Orchestrator & Data Pipelines

Python-based workflow orchestrator for when cron jobs aren't cutting it and you need something that won't randomly break at 3am

Apache Airflow
/tool/apache-airflow/overview
50%
tool
Similar content

Apache Spark Troubleshooting - Debug Production Failures Fast

When your Spark job dies at 3 AM and you need answers, not philosophy

Apache Spark
/tool/apache-spark/troubleshooting-guide
46%
tool
Similar content

Change Data Capture (CDC) Troubleshooting Guide: Fix Common Issues

I've debugged CDC disasters at three different companies. Here's what actually breaks and how to fix it.

Change Data Capture (CDC)
/tool/change-data-capture/troubleshooting-guide
37%
tool
Similar content

AWS CDK Production Horror Stories: CloudFormation Deployment Nightmares

Real War Stories from Engineers Who've Been There

AWS Cloud Development Kit
/tool/aws-cdk/production-horror-stories
34%
integration
Recommended

FastAPI + SQLAlchemy + Alembic + PostgreSQL: The Real Integration Guide

integrates with FastAPI

FastAPI
/integration/fastapi-sqlalchemy-alembic-postgresql/complete-integration-stack
34%
tool
Similar content

Debug Kubernetes Issues: The 3AM Production Survival Guide

When your pods are crashing, services aren't accessible, and your pager won't stop buzzing - here's how to actually fix it

Kubernetes
/tool/kubernetes/debugging-kubernetes-issues
28%
review
Similar content

Vector Databases 2025: The Reality Check You Need

I've been running vector databases in production for two years. Here's what actually works.

/review/vector-databases-2025/vector-database-market-review
27%
pricing
Similar content

Database Migration Tool Pricing: Real Costs & TCO Analysis

The vendors lie about costs, documentation is garbage, and every migration takes 3x longer than promised. Here's what actually happens when you try to move ente

Liquibase Pro
/pricing/database-migration-tools-enterprise-cost-analysis/total-cost-ownership-analysis
27%
tool
Similar content

CDC Enterprise Implementation Guide: Real-World Challenges & Solutions

I've implemented CDC at 3 companies. Here's what actually works vs what the vendors promise.

Change Data Capture (CDC)
/tool/change-data-capture/enterprise-implementation-guide
27%
integration
Similar content

Kafka Spark Elasticsearch: Build & Optimize Real-time Pipelines

The Data Pipeline That'll Consume Your Soul (But Actually Works)

Apache Kafka
/integration/kafka-spark-elasticsearch/real-time-data-pipeline
25%
tool
Similar content

Databricks Overview: Multi-Cloud Analytics, Setup & Cost Reality

Managed Spark with notebooks that actually work

Databricks
/tool/databricks/overview
25%
tool
Similar content

CDC Tool Selection Guide: Pick the Right Change Data Capture

I've debugged enough CDC disasters to know what actually matters. Here's what works and what doesn't.

Change Data Capture (CDC)
/tool/change-data-capture/tool-selection-guide
24%
tool
Recommended

dbt - Actually Decent SQL Pipeline Tool

dbt compiles your SQL into maintainable data pipelines. Works great for SQL transformations, nightmare fuel when dependencies break.

dbt
/tool/dbt/overview
23%
pricing
Recommended

Your Snowflake Bill is Out of Control - Here's Why

What you'll actually pay (hint: way more than they tell you)

Snowflake
/pricing/snowflake/cost-optimization-guide
23%
tool
Recommended

Snowflake - Cloud Data Warehouse That Doesn't Suck

Finally, a database that scales without the usual database admin bullshit

Snowflake
/tool/snowflake/overview
23%
integration
Recommended

OpenTelemetry + Jaeger + Grafana on Kubernetes - The Stack That Actually Works

Stop flying blind in production microservices

OpenTelemetry
/integration/opentelemetry-jaeger-grafana-kubernetes/complete-observability-stack
23%
troubleshoot
Recommended

Fix Kubernetes ImagePullBackOff Error - The Complete Battle-Tested Guide

From "Pod stuck in ImagePullBackOff" to "Problem solved in 90 seconds"

Kubernetes
/troubleshoot/kubernetes-imagepullbackoff/comprehensive-troubleshooting-guide
23%
howto
Recommended

Lock Down Your K8s Cluster Before It Costs You $50k

Stop getting paged at 3am because someone turned your cluster into a bitcoin miner

Kubernetes
/howto/setup-kubernetes-production-security/hardening-production-clusters
23%
troubleshoot
Recommended

Docker Desktop Won't Install? Welcome to Hell

When the "simple" installer turns your weekend into a debugging nightmare

Docker Desktop
/troubleshoot/docker-cve-2025-9074/installation-startup-failures
23%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization