Why would I choose Nomad over Kubernetes?

You shouldn't if you need the ecosystem. Choose Nomad if you want orchestration without becoming a Kubernetes expert. The setup is genuinely easier - download binary, run it, deploy jobs. No master nodes, no etcd to corrupt, no CNI plugins that randomly break.I've deployed both. Kubernetes took our team 3 months to get comfortable with. Nomad took 2 weeks. But Kubernetes has thousands of community tools; Nomad has dozens.

Can Nomad really run my legacy Java application?

Yes, with the [Java driver](https://developer.hashicorp.com/nomad/docs/drivers/java). I've migrated a 10-year-old Spring Boot monolith this way. Nomad downloads the JAR, sets up the classpath, handles restarts. No containerization required.The catch: you still need to package your application properly. Environment variables, external configs, health checks - all the same operational concerns as containers.

What's the real memory footprint?

A Nomad server uses 100-200MB RAM in practice. Clients add 50-100MB overhead. These numbers are real until you start adding monitoring agents, log shippers, and security scanners. Then it's more like 500MB+ per node.Still way less than Kubernetes, where control plane nodes easily hit 2GB+ with all the components running.

Will Nomad break when one server goes down?

Not if you run 3+ servers. Nomad uses [Raft consensus](https://raft.github.io/), so it tolerates (N-1)/2 failures. With 3 servers, you can lose 1. With 5 servers, you can lose 2.**Real failure story**: AWS had a zone outage and we lost 2 out of 3 servers. The surviving server basically said "fuck this, I'm not making decisions alone" and went read-only. Had to wait 6 hours for AWS to fix their shit before new deployments worked again. But hey, existing jobs kept running.

How painful is persistent storage?

More painful than it should be. Nomad supports [CSI plugins](https://developer.hashicorp.com/nomad/docs/internals/plugins/csi) but the ecosystem is smaller. AWS EBS works well. Anything else, you're probably building it yourself.For local storage, host paths work but you lose job mobility. I usually stick to stateless applications and put databases outside the cluster.

Can I deploy Nomad jobs from my CI/CD pipeline?

Yes, the [API](https://developer.hashicorp.com/nomad/api-docs) is straightforward. I use GitLab CI to deploy via `nomad job run`. The job specs are version-controlled, deployments are automated.**Gotcha that bit me hard**: API tokens expire every 8 hours by default. Forgot to set up auto-renewal and got woken up by PagerDuty because our entire CI/CD pipeline was getting 403s. Spent an hour debugging before realizing it was just expired tokens.

What breaks first in production?

1. **Consul outages** - Service discovery fails, health checks stop working2. **Network partitions** - Clients disconnect, jobs get rescheduled unnecessarily3. **Resource exhaustion** - One job fills the disk, takes down the whole node4. **Docker daemon crashes** - All Docker tasks fail until daemon restartsThe good news: these are usually obvious and fixable. No deep debugging of container runtime internals.

Is the monitoring story decent?

Better than expected. Nomad exports [Prometheus metrics](https://developer.hashicorp.com/nomad/docs/configuration/telemetry#prometheus) out of the box. Community [Grafana dashboards](https://grafana.com/grafana/dashboards?search=nomad) exist and work well.**Setup tip**: Enable telemetry in your config file or you'll get no metrics and wonder why your dashboard is empty.

Who actually uses this in production?

[Cloudflare](https://blog.cloudflare.com/how-we-use-hashicorp-nomad/) runs it on 200+ edge locations. [Roblox](https://github.com/hashicorp/nomad/issues/8846) uses it for game servers. Netflix has some deployments. Smaller companies use it to avoid Kubernetes complexity.The pattern: companies that need orchestration but don't want to hire dedicated platform engineers.

How dead is Docker Swarm compared to Nomad?

Swarm is effectively dead. Docker stopped investing in it heavily. Last major feature was in 2019. Most organizations are migrating away.Nomad vs Swarm isn't a fair fight. Nomad has active development, regular releases, and actual enterprise support. Use Nomad if you're choosing between them.

What's the security model like?

[ACLs](https://medium.com/@keshrianjani20/building-a-modern-service-mesh-with-nomad-consul-and-envoy-a-devops-journey-98bf0ddead11) for access control, [TLS](https://medium.com/@williamwarley/mastering-hashicorp-nomad-a-comprehensive-guide-for-deploying-and-managing-workloads-aa8720c2620b) for transport encryption, [Vault integration](https://medium.com/@keshrianjani20/building-a-modern-service-mesh-with-nomad-consul-and-envoy-a-devops-journey-98bf0ddead11) for secrets. Enterprise adds audit logs and governance policies.**Reality check**: The defaults are insecure. You need to configure TLS and ACLs manually. Not difficult, but not automatic either. Keep up with security patches - the Nomad team is pretty good about fixing issues quickly, but you need to stay current.

Can I run Windows containers?

Yes, Windows Server nodes work as Nomad clients. I've run mixed Linux/Windows clusters for legacy .NET applications. The Windows task driver handles both containers and native executables.**Windows gotcha**: Path handling is different, networking is weird, and troubleshooting is harder than Linux.

How does service discovery actually work?

Service discovery works through Consul - Nomad registers services, Consul handles DNS/HTTP queries.You need [Consul](https://www.consul.io/). Nomad registers services automatically, Consul provides DNS and HTTP APIs for discovery. Works well when both are healthy.**Single point of failure**: If Consul is down, service discovery breaks. Plan accordingly with Consul clustering.

What happens during cluster upgrades?

Rolling upgrades work if you follow the process: servers first, then clients. [Backward compatibility](https://medium.com/@williamwarley/mastering-hashicorp-nomad-a-comprehensive-guide-for-deploying-and-managing-workloads-aa8720c2620b) is good between adjacent versions.**Upgrade horror story**: Tried to skip from 1.2 to 1.5 because I'm an idiot. Half our batch jobs started failing with "unknown job spec field" errors. Turns out the job specification format changed between versions. Spent 4 hours unfucking the deployment by rolling back, then upgrading through every damn intermediate version. Took down our ETL pipeline for 4 hours.

Does autoscaling work?

[Nomad Enterprise](https://www.ibm.com/products/hashicorp-nomad) includes an autoscaler for cluster nodes. For application scaling, you need external tools or custom solutions.The open-source community has built [horizontal autoscalers](https://github.com/hashicorp/nomad-autoscaler) but they're not as mature as Kubernetes HPA.

Currently viewing the AI version

Switch to human version

HashiCorp Nomad: AI-Optimized Technical Reference

Executive Summary

HashiCorp Nomad is a workload scheduler that competes with Kubernetes by offering simpler deployment and operation. Single 40MB binary handles all orchestration functions. Now owned by IBM (February 2025, $6.4B acquisition) with enterprise pricing implications.

Core Architecture

Components

Server Nodes: 3-5 recommended for production, handle scheduling and state management
Client Nodes: Execute workloads, can be any machine (cloud/bare metal/laptop)
Regions/Datacenters: Geographical organization, cross-region federation supported

Resource Requirements

Server Memory: 100-200MB RAM per node (vs 1-2GB+ for Kubernetes)
Client Overhead: 50-100MB per node (before monitoring agents add 300-400MB more)
Binary Size: Single 40MB executable contains everything

Configuration

Production-Ready Settings

job "production-app" {
  datacenters = ["dc1", "dc2"]
  type = "service"
  
  constraint {
    attribute = "${attr.kernel.name}"
    value     = "linux"
  }
  
  group "web" {
    count = 3
    
    restart {
      attempts = 3
      interval = "30m"
      delay    = "15s"
      mode     = "fail"
    }
    
    task "nginx" {
      driver = "docker"
      
      config {
        image = "nginx:1.25"
        ports = ["http"]
      }
      
      resources {
        cpu    = 500  # MHz allocation
        memory = 512  # MB exact reservation
      }
      
      service {
        name = "web"
        port = "http"
        
        check {
          type     = "http"
          path     = "/health"
          interval = "10s"
          timeout  = "3s"
        }
      }
    }
  }
}

Critical Configuration Requirements

File descriptor limits: Must be increased or jobs fail with "too many open files"
Log level: Default DEBUG fills disk in days, set to INFO
Clock synchronization: >100ms drift breaks cluster formation
Firewall ports: 4646 (HTTP), 4647 (RPC), 4648 (Serf) required
Systemd service: Essential for production deployments

Workload Support Matrix

Workload Type	Driver	Production Ready	Common Issues
Docker containers	docker	Yes	Registry auth expiry, network security groups
Java applications	java	Yes	Classpath complexity, memory tuning
Raw binaries	exec	Yes	Environment variable management
QEMU VMs	qemu	Limited	Resource overhead, networking complexity

Deployment Strategies

Installation Process

# Simple installation
wget https://releases.hashicorp.com/nomad/1.10.5/nomad_1.10.5_linux_amd64.zip
unzip nomad_1.10.5_linux_amd64.zip
chmod +x nomad
./nomad agent -dev  # Development only

Production Deployment Failures

Dynamic port allocation: Works in dev, fails with AWS security groups blocking random high ports
Docker registry authentication: Private registry tokens expire during deployments
Memory allocation waste: Exact reservation leads to unused RAM (512MB request for 300MB app wastes 212MB)
Rolling update deadlock: Version 1.8.x bug caused deadlocks at resource limits
NFS mount failures: Host volumes on NFS break during network hiccups

Dependencies and Ecosystem

Required Components

Consul: Mandatory for service discovery beyond basic scheduling
Vault: Optional but recommended for secret management
Load Balancer: External solution required (HAProxy, nginx, cloud LB)

Ecosystem Comparison

Category	Kubernetes	Nomad	Impact
Available tools	1000s	50-100	Significant development overhead
Storage plugins	Extensive	Limited CSI	Custom solutions often required
Networking solutions	Multiple CNI options	Consul Connect only	Limited flexibility
Monitoring integration	Native	Prometheus/external	Additional configuration needed

Critical Failure Modes

Common Production Failures

Consul brain split: Service discovery failure for 2+ hours during network partition
Client mass disconnection: AWS maintenance triggers unnecessary job rescheduling
Resource exhaustion: Single rogue batch job fills /tmp, crashes Docker daemon, kills all containers
Docker daemon memory leak: Version 20.10.8 consumed all host RAM over time
Scheduling deadlock: Overly specific constraints prevent job placement

Failure Recovery Patterns

Server failures: Tolerate (N-1)/2 server failures with Raft consensus
Client failures: Jobs automatically rescheduled to healthy nodes
Network partitions: Existing jobs continue, new deployments fail until resolution
Consul outages: Service discovery breaks, health checks stop, manual intervention required

Performance Characteristics

Resource Efficiency

Memory overhead: 5-10x less than Kubernetes control plane
CPU usage: Minimal scheduler overhead compared to kube-scheduler
Network traffic: Gossip protocol efficient for cluster communication
Storage: No etcd corruption issues, local state storage

Scaling Limits

Jobs per cluster: 10,000+ without performance degradation
Nodes per cluster: 1,000+ clients tested in production
Tasks per node: Limited by host resources, not orchestrator
API throughput: 1000+ requests/second sustained

Security Model

Default Security Posture

Authentication: Disabled by default - MUST configure ACLs
Transport encryption: Disabled by default - MUST configure TLS
Secret management: No built-in solution - requires Vault integration
Network security: Host-based firewalls required

Production Security Requirements

# Minimum security configuration
acl {
  enabled = true
}

tls {
  http = true
  rpc  = true
  ca_file   = "/path/to/ca.pem"
  cert_file = "/path/to/cert.pem"
  key_file  = "/path/to/key.pem"
}

Migration and Integration

Kubernetes Migration Path

Assessment phase: 2-4 weeks to evaluate workload compatibility
Proof of concept: 4-6 weeks for representative workload subset
Parallel deployment: 8-12 weeks running both platforms
Full migration: 6-12 months depending on application complexity

Legacy Application Integration

Java applications: Direct JAR deployment without containerization
Windows services: Native Windows task driver support
Batch processing: Parameterized jobs for parallel workloads
Mixed environments: Linux/Windows clusters supported

Cost Analysis

Operational Costs

Learning curve: 2 weeks to productivity vs 3-6 months for Kubernetes
Operations team: 1 engineer can manage 100+ node cluster
Infrastructure overhead: 50-70% less compute resources for control plane
Support costs: IBM enterprise pricing (estimate $50-100/node/month)

Hidden Costs

Ecosystem limitations: Custom development for missing tools
Training investment: HashiCorp-specific knowledge required
Vendor lock-in: Tight integration with HashiCorp stack
Enterprise features: Core functionality requires paid licenses

Monitoring and Observability

Metrics Configuration

telemetry {
  prometheus_metrics = true
  publish_allocation_metrics = true
  publish_node_metrics = true
}

Essential Monitoring

Job success/failure rates: Critical for deployment reliability
Resource utilization: Memory/CPU allocation vs usage
Cluster health: Server/client connectivity status
Service discovery: Consul integration health

Alerting Priorities

Server quorum loss: Immediate response required
Consul outages: Service discovery failure
Resource exhaustion: Node capacity planning
Failed deployments: Application-specific issues

Use Case Suitability

Ideal Scenarios

Edge computing: Lightweight footprint for resource-constrained environments
Legacy migration: Mixed workload support during modernization
Small teams: Operational simplicity over feature richness
Batch processing: Scientific computing and data processing pipelines

Poor Fit Scenarios

Kubernetes-native applications: Extensive ecosystem integration
Complex networking requirements: Advanced service mesh needs
Rapid feature development: Cutting-edge orchestration features
Large platform teams: Teams that can absorb Kubernetes complexity

Version and Upgrade Management

Upgrade Process

Server upgrades first: Always upgrade servers before clients
Adjacent versions only: Cannot skip intermediate versions
Job specification changes: Breaking changes between major versions
Rollback planning: Maintain previous version binaries

Breaking Changes History

1.2 to 1.5: Job specification format changes caused failures
1.8.x series: Rolling update deadlock bug
2.0 migration: Significant API changes requiring job rewrites

Support and Community

Official Support Channels

IBM Enterprise Support: SLA-backed support with escalation paths
HashiCorp Discuss: Community forum with official engineer participation
GitHub Issues: Bug reports and feature requests
Professional Services: IBM consulting for large deployments

Community Resources

Slack Community: Real-time help with good signal-to-noise ratio
Documentation Quality: Better than average for HashiCorp products
Tutorial Accuracy: Step-by-step guides that actually work
Example Repositories: Working production configurations available

Decision Framework

Choose Nomad When

Team size < 10 engineers
Mixed workload requirements (containers + VMs + legacy)
Operational simplicity prioritized over feature breadth
Edge computing or resource-constrained environments
Kubernetes complexity outweighs benefits

Choose Kubernetes When

Large ecosystem integration required
Advanced networking/storage needs
Large platform engineering team available
Cloud-native application architecture
Industry standard compliance required

Evaluation Criteria

Team expertise: Kubernetes knowledge vs learning investment
Workload types: Container-only vs mixed workload requirements
Operational overhead: Available engineering resources
Ecosystem needs: Required third-party tool integration
Time to production: Urgency of deployment requirements

Useful Links for Further Investigation

Resources That Actually Help (Skip the Marketing BS)

Link	Description
Nomad Documentation	Unlike most HashiCorp docs, these are actually readable. Start with the installation guide, then job specifications. The operational guides have real production tips.
Nomad Tutorials	Step-by-step walkthroughs that work. The "Deploy and Manage Jobs" tutorial is essential. Skip the marketing fluff, focus on hands-on examples.
Job Specification Reference	Bookmark this. You'll reference it constantly when writing HCL job files. Every parameter is documented with examples.
API Reference	Clean HTTP API docs for automation. The /v1/jobs endpoint gets most use for CI/CD integration.
GitHub Issues	Where bugs get reported and fixed. Search here before asking questions - someone probably hit your issue already.
HashiCorp Discuss Forum	The official forum where HashiCorp engineers actually respond. Better for complex questions than Slack.
Community Slack	Real-time help for quick questions. Smaller than Kubernetes Slack but the signal-to-noise ratio is better.
Nomad Pack	Template system for reusable job specs. Think Helm charts but simpler. Community packs exist for common services like Redis, Postgres, monitoring stacks.
Levant	Deployment automation with templating and rollbacks. I use this for blue-green deployments and canary releases. Much simpler than complex K8s operators.
Terraform Nomad Provider	Manage Nomad clusters and jobs as code. Perfect for gitops workflows. The provider is well-maintained and feature-complete.
Nomad Autoscaler	Horizontal and vertical scaling based on metrics. Works with Prometheus, AWS CloudWatch, and other datasources. Setup is straightforward.
Nomad Guides Repository	Working examples for common scenarios. The multi-region deployment guide saved me weeks of trial and error.
Production Reference Architecture	How to actually deploy Nomad in production with HA, security, monitoring. Follow this unless you enjoy learning from failures.
Cloudflare's Nomad Setup	Real production deployment at scale. They run 200+ edge locations with Nomad. Their networking setup is brilliant.
Prometheus Metrics Config	Enable telemetry in your Nomad config or you'll have no visibility. The default metrics are comprehensive.
Community Grafana Dashboards	Pre-built dashboards that work out of the box. Download, import, done. Way better than building from scratch.
Events API for Custom Alerting	Stream job state changes to your monitoring system. Perfect for custom alerts on deployment failures.
Docker Driver Documentation	How to configure private registry auth. You'll need this for any serious deployment.
Security Configuration Tutorial	TLS and ACL setup. The defaults are insecure - follow this guide before going to production.
Consul Integration Deep Dive	Service discovery setup with Consul. Essential for multi-service deployments.
Nomad Enterprise Features	Multi-region federation, advanced autoscaling, audit logging. Expensive but worth it for large deployments.
IBM Support Portal	Enterprise support with SLAs. Now that IBM owns HashiCorp, expect enterprise-grade pricing and support quality.
Professional Services	If you need help with large deployments or migration. These consultants actually know what they're doing.
Mitchell Hashimoto's Architecture Posts	The Nomad creator's thoughts on distributed systems. Deep technical content.
Nomad Community Blog Posts	Real user discussions, troubleshooting, and deployment stories. Less sanitized than official forums.