Nix Production Deployment - Beyond the Dev Environment

The Three Ways to Deploy Nix in Production

I've deployed Nix to production in three different environments over the past 4 years. Each approach has its place, but they're not interchangeable.

The Simple Way: Direct nixos-rebuild

This is how you start. SSH into your server and run nixos-rebuild switch. Your configuration lives in /etc/nixos/configuration.nix and you edit it directly on the server.

I used this for my first production NixOS server in 2021. It worked fine for a single-server Rails app with low traffic.

When this works:

Single server or very few servers
You don't mind SSHing into production to deploy
Configuration changes are infrequent
Team is small (1-2 people max)

When this breaks:

Multiple servers need identical configs
You want deployment history and rollbacks
Team growth means multiple people touching production
Compliance requires audit trails of who changed what

The moment you have two servers, direct editing becomes a nightmare. Trust me, I've been there. You make a change on server A, forget to apply it to server B, and spend 2 hours debugging why they behave differently.

The Remote Way: nixos-rebuild with --build-host

This is the middle ground. Your configuration lives in version control, and you build remotely but deploy from your local machine:

nixos-rebuild switch \
  --build-host build-server.example.com \
  --target-host prod-server.example.com \
  --use-remote-sudo

The --build-host flag is crucial for production. Building Firefox from source on a 1-CPU production server will kill your site for 3 hours. Build on a separate machine with more cores and push the result.

When this works:

2-10 servers that need coordinated updates
You have a beefy build server
Manual deployment process is acceptable
Want version control for configurations

When this starts sucking:

Deploys take forever (hitting servers one by one like it's 2005)
More than one person trying to deploy causes chaos
Rolling back means manually SSHing into each server
Binary cache misconfiguration means you're building Firefox from source during peak traffic

I used this approach for a client with 8 NixOS servers. Deployments took 15 minutes because I had to hit each server sequentially. The binary cache saved us from recompiling, but the serial deployment was painful.

The Production Way: Deploy-rs and Flakes

This is how you do it when you're serious. Deploy-rs treats deployment as a first-class problem with proper tooling.

Your flake.nix defines everything:

{
  deploy.nodes.web-server = {
    hostname = \"web01.prod.example.com\";
    profiles.system = {
      user = \"root\";
      path = deploy-rs.lib.x86_64-linux.activate.nixos 
        self.nixosConfigurations.web-server;
    };
  };
  
  deploy.nodes.api-server = {
    hostname = \"api01.prod.example.com\";
    profiles.system = {
      user = \"root\";
      path = deploy-rs.lib.x86_64-linux.activate.nixos 
        self.nixosConfigurations.api-server;
    };
  };
}

Deploy everything with deploy . and it runs in parallel. Magic rollback means if you break SSH access, the server reverts automatically after 30 seconds.

Why this is better:

Parallel deployments: 20 servers finish as fast as 1 server
Atomic rollbacks: If any server fails, everything rolls back
Interactive mode: Preview changes before deployment
Multi-profile support: Deploy apps without root access
Proper error handling: Clear failures, not silent corruption

I've used this for clients with 50+ servers. A full deployment finishes in under 5 minutes, including application updates and OS configuration changes.

Binary Caches: Don't Build in Production

Here's the thing nobody tells you: binary caches are not optional for production. They're mandatory.

Without a cache, every deployment compiles everything from source. I've seen production deployments take 4 hours because someone modified a low-level dependency.

Your options:

cache.nixos.org: Free, covers 95% of nixpkgs
Cachix: $45/month, handles custom packages
FlakeHub Cache: Enterprise solution with private flakes
Self-hosted: Attic or Nix-serve

For production, I recommend Cachix for the convenience, or self-hosted Attic if you need full control. FlightAware uses self-hosted caches because they need guaranteed availability.

The cache hit rate for standard nixpkgs is usually 90%+. For custom applications, you'll build once and cache forever. This turns 2-hour deployments into 2-minute deployments.

CI/CD Integration That Actually Works

Don't try to adapt Docker-based CI/CD to Nix. Build a Nix-native pipeline instead.

Our GitHub Actions workflow looks like this:

- uses: DeterminateSystems/nix-installer-action@v4
- uses: DeterminateSystems/magic-nix-cache-action@v2
- name: Build system configurations  
  run: nix build '.#nixosConfigurations.web-server.config.system.build.toplevel'
- name: Deploy to production
  run: deploy . --skip-checks
  env:
    SSH_PRIVATE_KEY: ${{ secrets.DEPLOY_SSH_KEY }}

The Magic Nix Cache speeds up CI builds dramatically. Combined with deploy-rs, you get proper deployment automation.

Key insights from production use:

Build everything in CI, never on production servers
Use --skip-checks in automated deployments (checks already ran in CI)
Set up proper SSH key management for deploy access
Monitor deployment times - anything over 10 minutes needs investigation

The whole process from git push to production deployment takes 5-8 minutes for our largest clients. Compare that to Docker-based pipelines that take 20-30 minutes for similar complexity.

Companies like Shopify and Tweag use variations of this approach for hundreds of servers.

Comparison Table

Deployment Approach	Direct nixos-rebuild	Remote nixos-rebuild	Deploy-rs + Flakes
Learning Curve	5 minutes to break everything	Weekend to realize you're doing it wrong	2-3 days to actually understand it
Server Limit	1-2 servers before you lose your mind	2-10 servers before you want to quit	100+ servers and you're still sane
Deployment Time	30 seconds to fuck up one server	5-15 minutes of watching servers fail one by one	2-5 minutes to deploy everything correctly
Rollback Speed	Manual panic, 2-5 minutes of terror	Manual panic, 5-10 minutes of more terror	Automatic magic, 30 seconds of relief
CI/CD Integration	Don't even think about it	Bash scripts held together with hope	Actually designed for this
Team Scalability	One person who hates their life	2-3 people stepping on each other	Unlimited people who sleep at night
Production Readiness	Dev environment only	Toy production at best	Real production for grown-ups

Production Gotchas That Will Ruin Your Weekend

I've debugged Nix production issues at 3am more times than I want to remember. Here's every mistake I've made (and seen others make) so you don't have to.

The `/nix/store` Disk Space Disaster

The Problem: /nix/store fills your root filesystem. Server stops accepting connections. Site goes down.

How It Happens: You deploy a few times, each creating a new system generation. Old generations aren't automatically cleaned up. One day you hit 100% disk usage and everything breaks.

The Fix: Run garbage collection regularly:

## Emergency cleanup (keeps last 3 generations)
nix-collect-garbage --delete-older-than 3d

## Scheduled cleanup (run this weekly)
nix-collect-garbage --delete-older-than 30d

Prevention: Add this to your NixOS configuration:

nix.gc = {
  automatic = true;
  dates = "weekly";
  options = "--delete-older-than 30d";
};

I learned this the hard way when a client's API servers ran out of disk space during Black Friday. 45GB of old system generations took down their entire checkout process. The exact error was No space left on device but it took me 20 minutes to figure out it was the /nix/store eating all the space. CEO was pissed. I was more pissed at myself for not setting up automatic garbage collection.

Binary Cache Authentication Hell

The Problem: Private binary cache stops working. Builds fall back to source. Deployments take 3 hours instead of 3 minutes.

How It Happens: Cache authentication tokens expire, network issues, or misconfigured SSH keys. Nix silently falls back to building everything from source.

Debugging: Check if cache is actually being used:

## See what's being fetched vs built
nix build --print-build-logs --verbose .#nixosConfigurations.server

## Test cache access directly  
nix store ping --store https://cache.nixos.org
nix store ping --store https://your-cache.cachix.org

The Fix: Verify your cache configuration:

nix.settings = {
  substituters = [
    "https://cache.nixos.org"
    "https://your-cache.cachix.org"
  ];
  trusted-public-keys = [
    "cache.nixos.org-1:6NCHdD59X431o0gWypbMrAURkbJ16ZPMQFGspcDShjY="
    "your-cache.cachix.org-1:YOUR_PUBLIC_KEY_HERE"
  ];
};

Pro tip: Set up monitoring for cache hit rates. If your hit rate drops below 80%, something's wrong. See Nix cache debugging guide for more troubleshooting steps.

The "Permission Denied" SSH Deployment Trap

The Problem: deploy-rs fails with Permission denied (publickey) but SSH works fine manually.

How It Happens: Different SSH configurations between your shell and the deploy tool, or missing SSH agent forwarding. I spent 2 hours on this exact issue before realizing the problem.

Debugging: Test the exact SSH command deploy-rs uses:

ssh -o BatchMode=yes root@your-server.com whoami
## This will fail with: Permission denied (publickey).
## But this works fine:
ssh root@your-server.com whoami
## Welcome to production-web-01

If you see this pattern, your SSH agent isn't available to deploy-rs. The BatchMode=yes flag is the giveaway.

The Fix: Ensure SSH agent is running and keys are loaded:

eval "$(ssh-agent -s)"
ssh-add ~/.ssh/id_rsa
export SSH_AUTH_SOCK

Or configure SSH properly in ~/.ssh/config:

Host production-server
    HostName prod.example.com
    User root
    IdentityFile ~/.ssh/production_key
    IdentitiesOnly yes

Flake Input Pinning Disasters

The Problem: Flake inputs update automatically, breaking production builds. What worked yesterday fails today.

How It Happens: You don't pin your inputs properly. nixpkgs.url = "github:NixOS/nixpkgs/nixos-unstable" means "latest commit", not "stable version".

The Fix: Pin everything in flake.lock:

## Pin current working versions
nix flake lock

## Update only when you want to
nix flake update

## Update specific input
nix flake update nixpkgs

Best Practice: Never use unpinned inputs in production. Your flake.nix should specify exact commits:

{
  inputs.nixpkgs.url = "github:NixOS/nixpkgs/a1b2c3d4...";  # Exact commit
  inputs.deploy-rs.url = "github:serokell/deploy-rs/v1.6.1";  # Tagged release
}

I've seen production deployments fail because someone's flake inputs updated overnight and pulled in a broken version of systemd. The error was cryptic as hell: Failed to start systemd-networkd.service: Unit systemd-networkd.service has a bad unit file setting - took me 4 hours to trace it back to an unpinned nixpkgs input that updated to a broken commit.

The Activation Script Infinite Loop

The Problem: System activation fails, but the rollback also fails. Server becomes unbootable.

How It Happens: Your activation script has logic that fails both going forward and rolling back. Usually involves external dependencies like network services or databases.

Debugging: Check the activation logs:

journalctl -u nixos-activation.service
systemctl status nixos-activation

The Fix: Keep activation scripts simple and idempotent:

## BAD: Depends on network
system.activationScripts.setup = ''
  curl https://httpbin.org/get || exit 1
'';

## GOOD: Local operations only
system.activationScripts.setup = ''
  mkdir -p /var/lib/myapp
  chown myapp:myapp /var/lib/myapp || true
'';

Prevention: Test activation scripts in development. Use deploy-rs magic rollback to automatically revert failed activations.

Build User Exhaustion Under Load

The Problem: Deployments fail with "waiting for build users" during high activity periods.

How It Happens: Nix has a limited number of build users (default: 32). Under heavy load, they're all busy and new builds queue indefinitely.

The Fix: Increase the build user count:

nix.settings.max-jobs = 8;  # Parallel builds
users.users = lib.mkMerge (lib.genList (i: {
  "nixbld${toString (i+33)}" = {
    isSystemUser = true;
    group = "nixbld";
    uid = 30033 + i;
  };
}) 32);  # Add 32 more build users

Monitoring: Check build user usage:

ps aux | grep nix-daemon | wc -l
who | grep nixbld | wc -l

Network Partitions During Deployment

The Problem: Network connection drops during deployment. Server is left in an inconsistent state.

How It Happens: Your deploy tool doesn't handle network failures gracefully. Partially applied configuration breaks the system.

The Fix: Use atomic deployments with proper rollback:

## deploy-rs handles this automatically
deploy . --magic-rollback true --auto-rollback-timeout 60

## Manual approach with nixos-rebuild
nixos-rebuild switch --rollback-on-failure

Best Practice: Always test deployments on staging infrastructure that matches production networking conditions. Consider using NixOS containers for development environments that closely match production.

Out of Memory During Large Builds

The Problem: Build processes consume all available RAM. System becomes unresponsive.

How It Happens: Building large packages (like browsers or language compilers) on memory-constrained servers.

The Fix: Use a separate build server:

nixos-rebuild switch \
  --build-host powerful-builder.internal \
  --target-host production-server.internal

Or configure build resource limits:

nix.settings = {
  max-jobs = 2;  # Limit parallel builds
  cores = 4;     # Cores per build job
};

## Add swap if needed
swapDevices = [{
  device = "/var/lib/swapfile";
  size = 8192;  # 8GB swap
}];

Monitoring: Watch memory usage during deployments:

watch -n 1 'free -h && ps aux --sort=-%mem | head'

I once had a client whose 2GB server tried to build Firefox from source. The server completely locked up - no SSH, no HTTP responses, nothing. Killed their entire application stack for 4 hours while I frantically tried to figure out why everything was dead. The OOM killer logs showed Out of memory: Kill process 12847 (cc1plus) score 856 - Firefox compilation was trying to use 8GB+ of RAM on a 2GB server. The kernel just gave up and started killing everything.

Monitoring and Alerting for Nix Deployments

Set up proper monitoring so you catch issues before they become disasters:

services.prometheus.exporters.node.enable = true;

## Monitor disk usage in /nix/store
services.prometheus.rules = [{
  groups = [{
    name = "nix";
    rules = [{
      alert = "NixStoreDiskFull";
      expr = "disk_free_bytes{mountpoint=\

Questions I Get Asked About Production Nix

Can I deploy Nix to existing Ubuntu/CentOS servers?

Yes, but you're missing the point.

Nix as a package manager works on any Linux. But the real value comes from NixOS

the whole system managed declaratively.

If you can't switch to Nix

OS, at least use Nix for application environments.

Deploy your apps with nix-env or home-manager, but know you're only getting 30% of the benefit.

How do I handle secrets and environment variables?

Don't put secrets in Nix store - they're world-readable. Use proper secret management:

For development: sops-nix encrypts secrets in your repository.

For production: External secret stores like HashiCorp Vault, AWS Secrets Manager, or Kubernetes secrets.

Simple approach: systemd environment files that aren't managed by Nix:

systemd.services.myapp = {
  environment = { LOG_LEVEL = "info"; };
  serviceConfig.EnvironmentFile = "/etc/secrets/myapp.env";
};

Never commit secrets to your Nix configuration. I've seen this mistake literally destroy startups. One git push and their API keys are on GitHub forever.

What about Docker? Can I use both?

You can, but it's usually redundant. Nix has better isolation and reproducibility than Docker.

If you must use Docker, dockerTools.buildImage creates minimal containers:

dockerImage = pkgs.dockerTools.buildImage {
  name = "my-app";
  config.Cmd = [ "${my-app}/bin/my-app" ];
};

These images contain only your app and its dependencies - no Ubuntu base image bloat.

But honestly? If you're using Nix properly, Docker becomes unnecessary complexity.

How fast are Nix deployments compared to Kubernetes?

Much faster. Here's real data from a client with 20 services:

Kubernetes: 15-25 minutes (image builds, registry pushes, rollouts)
Nix + deploy-rs: 3-5 minutes (parallel deployment, binary cache hits)

Nix deployments are atomic. Either everything deploys successfully or nothing changes. Kubernetes has partial failure modes where some pods update but others don't.

Should I use channels or flakes for production?

Flakes. Period.

Channels are a legacy system with unpredictable behavior. Flakes give you:

Reproducible inputs with flake.lock
Explicit dependency management
Better composition across repositories
Native support in modern deployment tools

Yes, flakes are "experimental" but everyone uses them for serious work. The community moved faster than the official docs.

What happens if cache.nixos.org goes down?

Your deployments slow down but don't break. Nix automatically falls back to building from source.

For production, don't rely on external caches:

Use Cachix for mission-critical deployments
Set up Attic for full control
Mirror critical packages to your own cache

FlightAware runs their own caches because they can't afford external dependencies for flight tracking systems.

How do I update production systems safely?

Pin everything, test thoroughly, deploy gradually:

Pin inputs: Never use unpinned nixos-unstable
Test in staging: Identical configuration to production
Gradual rollout: Deploy to one server first
Monitor: Watch metrics for 15-20 minutes
Proceed: Deploy to remaining servers if all good

## Good deployment flow
nix flake update nixpkgs  # Explicit updates only
deploy .#staging          # Test first
deploy .#production-web1  # One server
## Wait and monitor...
deploy .#production       # Full rollout

Can I use Terraform with Nix?

Yes, it's actually a great combination. Terraform provisions infrastructure, Nix manages the software on it.

resource "aws_instance" "web" {
  ami           = "ami-12345"  # NixOS AMI
  instance_type = "t3.medium"
  
  user_data = <<-EOF
    #cloud-config
    write_files:
      - path: /etc/nixos/configuration.nix
        content: |
          { imports = [ ./hardware-configuration.nix ]; 
            networking.hostName = "web-${count.index}"; }
  EOF
}

Then use deploy-rs to manage the system configuration. Best of both worlds.

How do I handle database migrations?

Nix doesn't run database migrations automatically - that would be crazy. Handle them separately:

systemd.services.myapp-migrate = {
  description = "Run database migrations";
  after = [ "postgresql.service" ];
  before = [ "myapp.service" ];
  serviceConfig = {
    Type = "oneshot";
    User = "myapp";
    ExecStart = "${my-app}/bin/migrate";
  };
  wantedBy = [ "myapp.service" ];
};

Better approach: Use Flyway or migrate as separate deployment steps outside of Nix.

What about compliance and security audits?

NixOS actually helps with compliance because everything is declarative and auditable:

Configuration drift: Impossible - systems match their configuration exactly
Patch management: Track exactly what's installed and when
Reproducible audits: Auditors can build identical systems to test
Change tracking: All changes go through version control

For security:

CVE tracking: Vulnix scans for known vulnerabilities
Minimal systems: No unnecessary packages installed
Atomic updates: Security patches apply atomically, no partial states

Large enterprises like IOHK use NixOS for cryptocurrency infrastructure specifically because of these security properties.

Can new team members learn Nix quickly enough?

The learning curve is real but manageable:

Week 1: Frustrated and confused
Week 2: Starting to understand the concepts
Week 3-4: Productive with guidance
Month 2: Writing their own configurations

Provide good examples and mentorship. Don't throw people into deep Nix without support.

The payoff is huge - developers who learn Nix never want to go back to traditional package managers.

How do I get started without disrupting production?

Start with development environments, not production:

Use nix-shell for project dependencies
Add flake.nix to one repository
Deploy to staging with deploy-rs
Get comfortable with rollbacks and debugging
Migrate production one service at a time

Don't try to convert everything at once. I've watched teams crash and burn trying to migrate 50 services to NixOS in one weekend. It never works. Start small.

Is Nix ready for enterprise production?

Companies successfully using Nix in production:

FlightAware: Flight tracking infrastructure
Shopify: Developer environments and tooling
IOHK: Cardano blockchain infrastructure
Tweag: Client consulting infrastructure

The technology is solid. The ecosystem has enterprise users. The tooling keeps improving.

The question isn't whether Nix is ready for enterprise - it's whether your team is ready for Nix.

Production-Grade Nix Resources

Related Tools & Recommendations

tool

Popular choice

jQuery - The Library That Won't Die

Explore jQuery's enduring legacy, its impact on web development, and the key changes in jQuery 4.0. Understand its relevance for new projects in 2025.

jQuery

/tool/jquery/overview

50%

tool

Popular choice

Which AI Actually Helps You Code (And Which Ones Waste Your Time)

Compare Claude, ChatGPT (GPT-4 Turbo), and Gemini's coding capabilities. Discover which AI is best for debugging complex issues, rapid prototyping, and daily de

Claude

/compare/chatgpt/claude/gemini/coding-capabilities-comparison

50%

tool

Popular choice