Is this actually production-ready or just marketing bullshit?

[It's production-ready](https://medium.com/@mughal.asim/longhorn-for-kubernetes-cloud-native-block-storage-4259cb47ef31) for most workloads. We use it for GitLab, Jenkins, and other stateful apps that can handle brief IO pauses during replica rebuilds. I wouldn't put my main database on it yet, but for everything else it's solid. Just test your disaster recovery before you need it.

What do I actually need to run this?

3 Kubernetes nodes minimum, k8s v1.21+, and **open-iscsi installed and running on every node**. That last part will bite you in the ass if you forget it - Ubuntu doesn't install it by default. Check the [requirements](https://longhorn.io/docs/latest/deploy/install/) but seriously, just run `sudo apt install open-iscsi && systemctl enable iscsid` on Ubuntu before you start.

What happens when nodes die?

It detects the failure pretty quickly and starts rebuilding replicas on other nodes. Your volumes become read-only until enough replicas are healthy again. The rebuild process can take hours for large volumes and your IO performance goes to shit during that time. Plan your maintenance windows accordingly.

Can I run this on 2 nodes?

Nope, you need 3 minimum or it won't work. Trust me, I tried this on my homelab cluster and spent 2 hours debugging before reading the docs properly. Two nodes can't do proper quorum for distributed consensus.

Backups go to [S3 or NFS](https://longhorn.io/docs/latest/snapshots-and-backups/backup-and-restore/set-backup-target/) and they're incremental, so you're not uploading everything every time. Can restore to different clusters, which saved our asses during a DC migration. Set up recurring backups or you'll forget to do them manually. ![Longhorn Snapshot Management](https://longhorn.io/img/diagrams/concepts/longhorn-backup-creation.png)

How's the performance compared to Ceph or other options?

[Better than Ceph](https://kubedo.com/kubernetes-storage-comparison/) for smaller deployments because it writes locally first, then replicates. Not as fast as dedicated storage arrays, but way easier to manage. Use SSDs or your databases will be painfully slow. Write performance drops about 70% during replica rebuilds.

Can I pay someone to fix it when it breaks?

Yeah, [SUSE has commercial support](https://www.suse.com/support/kb/doc/?id=000020454) if you're using their Rancher stack. Community support is pretty good though - the Slack channel is active and people actually help instead of telling you to RTFM.

Can I make volumes bigger after creating them?

Yes, expanding volumes works fine through kubectl or the UI while apps are running. Shrinking doesn't work because it's block storage and would probably destroy your data anyway.

How painful are upgrades?

[Rolling upgrades](https://longhorn.io/docs/latest/deploy/upgrade/) usually work fine with minimal downtime. Each volume controller upgrades independently so other volumes keep working. Just don't skip versions - you have to go through each minor version. v1.8 to v1.9 took us 6 hours because we didn't read the release notes about the migration process. ![Longhorn Node Management](https://longhorn.io/img/diagrams/architecture/read-index.png)

How much RAM does this thing eat?

[About 256MB per TB](https://longhorn.io/docs/latest/concepts/) per replica for metadata. Sounds reasonable until you realize that's per replica, not per volume. With default 3-replica setup, a 1TB volume uses 768MB just for indexes. We hit 2GB memory usage with 4TB allocated across multiple volumes. The base install is only 300MB though.

My pods are stuck in ContainerCreating, what's wrong?

Nine times out of ten it's open-iscsi not running: `sudo systemctl status iscsid`. If that's dead, start it: `sudo systemctl start iscsid`. The error message "MountVolume.WaitForAttach failed for volume" is usually this. Ubuntu 20.04 doesn't start iscsid by default - bit me on three different clusters.

Volume stuck in "Unknown" state, now what?

Usually means a node died or networking is fucked. Check `kubectl get nodes` first. If all nodes are healthy, restart the Longhorn manager pod on the affected node: `kubectl delete pod -n longhorn-system longhorn-manager-xxxxx`. Takes 30 seconds to come back up.

Replica rebuild taking forever, is this normal?

Define forever. 1GB takes about 5 minutes on decent hardware with good networking. 100GB can take 2 hours. If it's been stuck for 6+ hours, something's wrong. Check `dmesg | grep iscsi` for network timeouts. We had a switch with bad ports that made rebuilds hang.

Currently viewing the AI version

Switch to human version

Longhorn Distributed Block Storage for Kubernetes: AI-Optimized Technical Reference

Overview and Positioning

Technology: Longhorn - Distributed block storage for Kubernetes clusters
Maintainer: SUSE (formerly Rancher Labs)
Status: CNCF Incubating project, production-ready as of v1.9.1 (July 2025)
Architecture: Microservices-based with dedicated storage engine per volume

Critical Success Factors

What Actually Works

Isolated failure domains: Each volume runs dedicated storage engine, preventing cascade failures
Incremental snapshots: Point-in-time recovery without excessive disk consumption
Multi-destination backups: S3/NFS integration for cross-cluster restoration
Thin provisioning: Dynamic disk allocation based on actual usage
Usable management UI: Functional dashboard for volume monitoring and operations

Performance Specifications

Production IOPS: 4,000-6,000 random 4K reads with SSDs, 60% for writes
Latency: <10ms for most operations with SSD storage
HDD penalty: 50% performance reduction compared to SSD
Rebuild impact: 70% write performance degradation during replica reconstruction
Memory overhead: 256MB per TB per replica (768MB for default 3-replica 1TB volume)

Installation Requirements and Failure Points

Hard Requirements

Minimum cluster size: 3 nodes (2-node clusters fail quorum requirements)
Kubernetes version: v1.25+ minimum
Critical dependency: open-iscsi package installed and running on ALL nodes
Resource minimums: 4GB RAM, 2 CPU cores per node
Network requirement: Low-latency, reliable connectivity between nodes

Installation-Breaking Issues

Issue	Symptom	Solution	Time Cost
Missing open-iscsi	Pods stuck in ContainerCreating	`apt install open-iscsi && systemctl enable iscsid`	2+ hours debugging
Ubuntu 20.04 default	iscsid disabled by default	Manual service enablement required	1 hour
RKE2 kubelet path	Volume mount failures	`--set defaultSettings.kubeletRootDir=/var/lib/kubelet`	30 minutes
Loop device exhaustion	Silent attach failures	Monitor with `losetup -l`	1+ hours
Network packet loss	Hanging replica rebuilds	Switch/port diagnosis required	4+ hours

Version-Specific Gotchas

v1.8.2: Replica rebuild hangs with mixed disk types (SSD+HDD)
v1.9.0 RC1: UI breaks with 50+ volumes
Upgrade requirement: Sequential minor version upgrades only (no skipping)

Operational Intelligence

Scale Limitations

Official limit: 500 volumes per cluster
UI degradation: Starts at 100 volumes, unusable at 200+
API performance: Remains functional beyond UI limits
Memory scaling: 2GB consumed with 4TB allocated across multiple volumes

Critical Failure Scenarios

Scenario	Impact	Recovery Time	Mitigation
Single node failure	Read-only until rebuild	30s detection + rebuild time	Monitor replica health
Replica rebuild on large volumes	Severe performance degradation	2+ hours for 100GB	Schedule maintenance windows
Network partition	Volume "Unknown" state	30s manager pod restart	Redundant network paths
All replicas lost	Complete data loss	6+ hours from backup	Never delete all replicas

Backup and Recovery Reality

Backup speed: Limited by S3 egress bandwidth
Restoration time: 4 hours for 200GB from S3
Cross-cluster recovery: Tested and functional during DC migration
Backup configuration: Requires manual S3/NFS setup, not automatic

Decision Criteria and Trade-offs

When Longhorn is Worth It

Scenario: Need "good enough" storage without storage team expertise
Benefit: Operational simplicity over performance optimization
Cost: 70% write performance hit during rebuilds
Alternative avoided: Ceph operational complexity and failure cascades

When to Choose Alternatives

Use Case	Better Option	Reason
High-performance databases	Dedicated storage arrays	Consistent low latency required
Large-scale deployments	Rook-Ceph	Better scaling beyond 500 volumes
Single-node testing	OpenEBS	Supports single-node clusters
Enterprise features	StorageOS	Advanced enterprise backup/monitoring

Resource Investment Requirements

Initial setup: 5 minutes to 4 hours (depends on Linux storage issues)
Operational overhead: Minimal once stable (quarterly upgrades)
Expertise needed: Basic Kubernetes knowledge, Linux storage fundamentals
Support options: SUSE commercial support available, active community Slack

Production Warnings and Tribal Knowledge

Undocumented Behaviors

Volume attach debugging: Check kubectl get volumeattachments for stuck states
Network diagnosis: Packet loss causes hanging rebuilds (check switch ports)
Unknown state recovery: Usually networking - restart manager pod first
Backup timeout: Increase timeout settings for slow S3 connections
UI performance: Becomes unusable >100 volumes but API remains functional

Monitoring and Alerting Requirements

Critical metrics: Replica rebuild status, volume "Unknown" state detection
Prometheus integration: Available and functional
Alert thresholds: Memory usage scaling with volume count
Network monitoring: Essential for rebuild performance diagnosis

Maintenance Patterns

Upgrade frequency: Every 4 months (stable release cycle)
Testing requirement: Always test in staging first
Maintenance windows: Required for large volume rebuilds
Backup verification: Test restoration before needed (not optional)

Comparison with Alternatives

Aspect	Longhorn	Rook-Ceph	OpenEBS	StorageOS
Operational complexity	Low	High	Medium	Medium
Minimum cluster size	3 nodes	5+ nodes	1 node	3 nodes
Installation time	5 min - 4 hours	Days	Hours	Hours
Memory overhead	256MB/TB/replica	2GB+ per node	Variable	Medium
Performance during failures	Degraded	Complex failure modes	Engine-dependent	Fast recovery
Learning curve	Minimal	Steep	Moderate	Moderate
Enterprise support	SUSE	Red Hat/IBM	MayaData	StorageOS

Bottom Line Assessment

Operational reality: Longhorn delivers "boring infrastructure that just works" - the infrastructure sweet spot where you can focus on applications instead of storage debugging.

Best fit: Organizations needing reliable persistent storage without dedicated storage teams or complex performance requirements.

Risk profile: Low operational risk once running, moderate setup risk due to Linux storage dependencies.

Cost-benefit: Trades peak performance for operational simplicity - worthwhile for most Kubernetes workloads except high-performance databases.

Useful Links for Further Investigation

Resources That Actually Help

Link	Description
Longhorn GitHub Issues	Where you'll end up when things break. Search closed issues first - someone has hit your exact problem before. The maintainers actually respond, which is refreshing.
Troubleshooting Docs	The troubleshooting section has saved my ass multiple times. Start here when volumes get stuck in "Unknown" state or when replica rebuilds fail silently.
Community Slack	#longhorn channel is active and people actually help instead of telling you to RTFM. Way better than Stack Overflow for Longhorn-specific questions.
Longhorn Docs	Actually readable documentation, unlike most Kubernetes project docs. The backup/restore section is solid, installation guide is accurate.
Release Notes	Read these before upgrading. They actually document breaking changes and migration steps. v1.9.0 release notes saved me from a config migration headache.
Rancher Longhorn Guide	If you're using Rancher, this one-click install actually works. Better than manually applying YAML.
SUSE Support	Commercial support if you need someone to call at 3am when production is down. Worth the cost for critical workloads.
CNCF Project Info	Incubating project status means it's stable but not finished. Good enough for production, just don't expect it to solve world hunger.
Official Helm Charts	Use these instead of kubectl apply. You get actual configuration options and upgrades that don't break everything.
Architecture Overview	Read this if you want to understand why replica rebuilds take forever. Helpful for troubleshooting weird performance issues.

Longhorn Distributed Block Storage for Kubernetes: AI-Optimized Technical Reference

Overview and Positioning

Critical Success Factors

What Actually Works

Performance Specifications

Installation Requirements and Failure Points

Hard Requirements

Installation-Breaking Issues

Version-Specific Gotchas

Operational Intelligence

Scale Limitations

Critical Failure Scenarios

Backup and Recovery Reality

Decision Criteria and Trade-offs

When Longhorn is Worth It

When to Choose Alternatives

Resource Investment Requirements

Production Warnings and Tribal Knowledge

Undocumented Behaviors

Monitoring and Alerting Requirements

Maintenance Patterns

Comparison with Alternatives

Bottom Line Assessment

Useful Links for Further Investigation

Resources That Actually Help

Related Tools & Recommendations

GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus

Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break

Prometheus + Grafana + Jaeger: Stop Debugging Microservices Like It's 2015

Rancher Desktop - Docker Desktop's Free Replacement That Actually Works

I Ditched Docker Desktop for Rancher Desktop - Here's What Actually Happened

Rancher - Manage Multiple Kubernetes Clusters Without Losing Your Sanity

Fix Helm When It Inevitably Breaks - Debug Guide

Helm - Because Managing 47 YAML Files Will Drive You Insane

Making Pulumi, Kubernetes, Helm, and GitOps Actually Work Together

Google Pixel 10 Phones Launch with Triple Cameras and Tensor G5

Dutch Axelera AI Seeks €150M+ as Europe Bets on Chip Sovereignty

ArgoCD - GitOps for Kubernetes That Actually Works

ArgoCD Production Troubleshooting - Fix the Shit That Breaks at 3AM

FLUX.1 - Finally, an AI That Listens to Prompts

Flux Performance Troubleshooting - When GitOps Goes Wrong

Flux - Stop Giving Your CI System Cluster Admin

Grafana - The Monitoring Dashboard That Doesn't Suck

Set Up Microservices Monitoring That Actually Works

Samsung Wins 'Oscars of Innovation' for Revolutionary Cooling Tech

Nvidia's $45B Earnings Test: Beat Impossible Expectations or Watch Tech Crash