Zuul - The CI System That Actually Tests Changes Together

Currently viewing the human version

Why Traditional CI is Broken and How Zuul Actually Fixes It

Picture this: You push a change, tests pass, you merge. Five minutes later, main is broken. Sound familiar? That's because traditional CI tests your change in isolation, pretending the other 47 changes that merged while you were coding don't exist.

The Problem Every Multi-Repo Team Faces

Here's what actually happens in traditional CI:

Developer A pushes a change that breaks when combined with Developer B's pending change
Both changes pass CI individually because they're tested against old main
Both merge in quick succession
Main branch is now fucked
Everyone spends the next 2 hours figuring out whose fault it is

Project gating fixes this by testing what your change looks like AFTER all the pending changes merge. It's like testing the actual future state instead of some fantasy version where your change exists in isolation.

OpenStack learned this the hard way managing 300+ interconnected repositories. Their solution was Zuul, because when you have that many moving pieces, traditional CI becomes a daily exercise in frustration. During the recent Epoxy release cycle, Zuul ran over 1.1 million jobs - that's the scale where this complexity becomes justified.

How Zuul Works

What Makes Zuul Different (And Why Setup Sucks)

Cross-Project Testing: Unlike Jenkins or GitHub Actions, Zuul can test changes across multiple repositories simultaneously. When your library change affects 12 downstream projects, Zuul tests all of them together. Try doing that with traditional CI - you'll end up with a mess of triggers and dependencies.

Ansible Everything: Every job is an Ansible playbook. This means the same code that tests your application can deploy it. Sounds great until you realize you now need to become an Ansible expert whether you wanted to or not.

Dynamic Infrastructure: Nodepool spins up fresh VMs for every job. No more "works on my machine" because every test runs in a clean environment. Also no more permanent build agents eating resources 24/7. The downside? You now have to manage a cloud infrastructure orchestration layer.

Microservices Hell: Zuul consists of separate services for scheduling (zuul-scheduler), execution (zuul-executor), merging (zuul-merger), and web UI (zuul-web). Plus ZooKeeper for coordination and Nodepool for infrastructure. That's a lot of moving parts that can break at 3 AM.

Zuul and Nodepool Architecture

Zuul 13.0.0 supports Ansible 11 and includes performance improvements, but don't expect the setup complexity to magically disappear. The latest release focused on stability fixes and better error handling, which you'll need when things break at 3 AM.

The Real Cost of Traditional CI Failure

The numbers are brutal. OpenStack's research shows that traditional CI systems create a cascade of failures that can cost teams days of productivity. When a broken change merges, it blocks everyone else's work until someone figures out what broke and rolls it back.

Jenkins comparison studies demonstrate why project gating beats traditional CI for multi-repository projects. Jenkins might work fine for single repos, but try coordinating changes across dozens of interdependent projects and you'll quickly understand why OpenStack moved away from Jenkins to Zuul.

Companies like LeBonCoin use Zuul for testing at scale precisely because traditional CI tools fail when you need to coordinate changes across multiple teams and repositories. The Zuul community FAQ specifically addresses why generic automation tools like Jenkins can't handle the complexity of proper project gating.

Traditional CI vs Project Gating

Zuul vs Every Other CI Tool (Spoiler: You Probably Don't Need Zuul)

Feature	Zuul	Jenkins	GitLab CI	GitHub Actions	CircleCI
Project Gating	✅ Actually works	❌ Plugin nightmare	❌ Merge queues (sort of)	❌ Branch protection theater	❌ Doesn't exist
Multi-Repo Testing	✅ Built for this	🤮 Pipeline from hell	⚠️ If you enjoy pain	⚠️ Dispatch events mess	⚠️ Workflow spaghetti
Setup Time	2-3 weeks minimum	Few hours	30 minutes	15 minutes	10 minutes
Active Jobs Scale	1.1M jobs (OpenStack)	Hundreds of thousands	Millions (hosted)	Millions (hosted)	Thousands
Configuration	YAML + Ansible hell	Groovy nightmares	Clean YAML	Clean YAML	Clean YAML
When It Breaks	Debug 5 microservices	Check plugins	Read logs	Usually works	Usually works
Resource Usage	Heavy (ZooKeeper + Nodepool)	Heavy (permanent agents)	Light (shared runners)	Light (hosted)	Light (cloud)
Learning Curve	Mount Everest	Steep hill	Gentle slope	Gentle slope	Gentle slope
Plugin Ecosystem	What plugins?	Plugin chaos	Built-in features	Marketplace	Extensions
Vendor Lock-in	None (good luck leaving)	None	GitLab-centric	GitHub-centric	CircleCI-centric

Setting Up Zuul: A Journey Through Infrastructure Hell

Setting up Zuul is not a weekend project. Plan for weeks, not hours. If you're expecting a "quick start," prepare for disappointment. Here's what actually happens when you try to implement this thing.

The Architecture That Will Consume Your Life

Zuul Testing in Parallel

Zuul consists of several microservices that all need to work together. When they don't (and they won't), debugging becomes a full-time job:

zuul-scheduler: The brain that decides what gets tested when. When this breaks, nothing works. It talks to ZooKeeper constantly and will fail in mysterious ways if connectivity hiccups.

zuul-executor: Runs your Ansible playbooks. Expects a perfect Ansible environment and will throw cryptic errors if anything is slightly wrong. Scales horizontally, which sounds great until you're debugging why executor-03 behaves differently than executor-01.

zuul-merger: Creates the "future state" by merging all pending changes. Works beautifully until it encounters a merge conflict, then everything stops and you get to figure out why.

zuul-web: The React dashboard that shows you what's happening. Usually the only component that actually works reliably.

ZooKeeper: Coordinates everything. When ZooKeeper hiccups (and it will), your entire CI system becomes useless. Hope you enjoy debugging distributed consensus algorithms. The latest ZooKeeper 3.9 is more stable, but split-brain scenarios during network partitions will still ruin your day.

Nodepool: Manages your cloud resources. Will happily consume your entire cloud budget if misconfigured. OpenStack users love this because they can provision unlimited VMs. AWS users discover that unlimited VMs cost unlimited money.

Real Organizations That Actually Use This

Notice a pattern? These are organizations with dedicated DevOps teams and serious engineering budgets.

The Setup Reality

Zuul Job Execution Workflow

Time Investment: Expect 2-3 weeks for basic setup, 2-3 months for production-ready deployment. The OpenMetal production guide shows what real deployment looks like.

Infrastructure: You'll need ZooKeeper (good luck), Nodepool (cloud orchestration nightmare), and enough compute resources to satisfy Zuul's appetite for fresh VMs.

Expertise Required: Ansible mastery is mandatory. YAML debugging skills are essential. Distributed systems knowledge helps when everything breaks at 3 AM.

Migration Strategy: Start small or regret it. Don't try to migrate everything at once unless you enjoy pain. The OpenDev containerized setup is actually helpful for learning.

If you don't have dedicated infrastructure engineers, consider managed Zuul services instead of torturing yourself with self-hosting.

OpenStack Infrastructure

Learning Resources That Don't Lie

The Software Factory project documentation provides realistic deployment guides without the marketing fluff. They've dealt with the pain so you don't have to discover every gotcha yourself.

Red Hat's OpenStack CI documentation shows how they actually use Zuul in production. This isn't theoretical - it's battle-tested configuration that handles thousands of daily commits.

For the masochists who want to understand every component, the academic analysis of release synchronization in OpenStack shows why project gating became necessary at scale.

Questions Engineers Actually Ask About Zuul

Can small teams use Zuul or is it complete overkill?

Fuck no, unless you enjoy suffering. Zuul is for teams that have so many repositories they can't keep track. If you have fewer than 50 repos that depend on each other, use GitHub Actions and save yourself the headache.

How long does it actually take to set up Zuul?

2-3 weeks minimum if you know what you're doing. 2-3 months for production-ready.

The "quick start" guide is lies. Budget for Zoo

Keeper debugging, Ansible hell, and cloud resource management nightmares. Pro tip: Start with the containerized tutorial

at least Docker containers fail faster than VMs.

What's the current version and should I wait for the next release?

Zuul 13.0.0 includes Ansible 11 support and stability improvements. Don't wait for the next version

the complexity doesn't get better, just different. If you need this level of project gating, the current version works fine.

What happens when ZooKeeper breaks at 3 AM?

Your entire CI system becomes useless. ZooKeeper is a single point of failure that will fail in mysterious ways. Learn to love distributed consensus debugging or pay someone else to deal with it.

Does the GitHub integration actually work properly?

It works, but GitHub's webhook delays can screw with the gating logic. The GitHub driver exists but Gerrit integration is more mature. Don't expect GitHub pull requests to behave exactly like Gerrit changes.

How much will Nodepool cost me on AWS?

However much you have. Nodepool will happily provision unlimited VMs if misconfigured. Set strict limits or watch your cloud bill explode. OpenStack users love this because unlimited VMs cost them nothing.

Can I migrate from Jenkins without rewriting everything?

No. Jenkins jobs are shell scripts or Groovy. Zuul jobs are Ansible playbooks. You'll rewrite everything. This is actually good long-term but painful short-term.

What's the difference between this Zuul and Netflix Zuul?

Completely different projects. Netflix Zuul is an API gateway. This Zuul is a CI system. The naming collision is unfortunate and confusing.

Why does job configuration break when I change seemingly unrelated things?

Because YAML is hell and Ansible is worse. Zuul's job inheritance is powerful but complex. Change one parent job and watch 50 child jobs break in unexpected ways.

Do I need to become an Ansible expert?

Yes. Everything is Ansible. You'll debug playbooks, understand inventory, and curse YAML syntax errors. The pre-built jobs help but you'll still need Ansible skills.

What breaks most often in production?

Zoo

Keeper connectivity issues, Nodepool resource exhaustion, and executor nodes getting stuck. The logs are spread across multiple services. Have fun debugging. Also, watch out for the scheduler's memory usage

it'll slowly leak memory until restart is required, usually during your biggest deployment.

Can I run this on Kubernetes instead of VMs?

Yes, but you're trading VM orchestration complexity for Kubernetes complexity. The zuul-operator exists but good luck debugging when pods start crashing. Most production deployments still use dedicated VMs because they're easier to troubleshoot when shit hits the fan.

Is there actually good commercial support?

VEXXHOST offers managed services if you want the benefits without the pain. Red Hat supports it through Software Factory. Consider this unless you have dedicated infrastructure engineers.

Essential Zuul Resources

40%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization

Quick Navigation

The Problem Every Multi-Repo Team Faces

What Makes Zuul Different (And Why Setup Sucks)

The Real Cost of Traditional CI Failure

The Architecture That Will Consume Your Life

Real Organizations That Actually Use This

The Setup Reality

Learning Resources That Don't Lie

Can small teams use Zuul or is it complete overkill?

How long does it actually take to set up Zuul?

What's the current version and should I wait for the next release?

What happens when ZooKeeper breaks at 3 AM?

Does the GitHub integration actually work properly?

How much will Nodepool cost me on AWS?

Can I migrate from Jenkins without rewriting everything?

What's the difference between this Zuul and Netflix Zuul?

Why does job configuration break when I change seemingly unrelated things?

Do I need to become an Ansible expert?

What breaks most often in production?

Can I run this on Kubernetes instead of VMs?

Is there actually good commercial support?

Related Tools & Recommendations

API Gateway Pricing: AWS Will Destroy Your Budget, Kong Hides Their Prices, and Zuul Is Free But Costs Everything

AWS API Gateway - Production Security Hardening

AWS API Gateway - The API Service That Actually Works

Spring Boot - Finally, Java That Doesn't Suck

Stop Debugging Microservices Networking at 3AM

Istio - Service Mesh That'll Make You Question Your Life Choices

How to Deploy Istio Without Destroying Your Production Environment

MariaDB - What MySQL Should Have Been

Docker Desktop Got Expensive - Here's What Actually Works

Protocol Buffers - Google's Binary Format That Actually Works

Tesla FSD Still Can't Handle Edge Cases (Like Train Crossings)

Envoy Proxy - The Network Proxy That Actually Works

Datadog - Expensive Monitoring That Actually Works

Should You Use TypeScript? Here's What It Actually Costs

Python vs JavaScript vs Go vs Rust - Production Reality Check

JavaScript Gets Built-In Iterator Operators in ECMAScript 2025

Stop Writing Selenium Scripts That Break Every Week - Claude Can Click Stuff for You

Hugging Face Transformers - The ML Library That Actually Works

Base - The Layer 2 That Actually Works

Confluence Enterprise Automation - Stop Doing The Same Shit Manually