Velero: Kubernetes Backup & Disaster Recovery - AI Knowledge Base
Executive Summary
Velero is a CNCF-graduated Kubernetes backup tool maintained by VMware Tanzu. Production-proven by Netflix, MongoDB, and Reddit. Current version v1.17.0 (2025) fixes memory leaks but introduces new failure modes. Setup requires 2-3 days due to IAM permission complexity, ongoing maintenance burden moderate.
Core Architecture & Failure Points
Three Critical Components
Velero Server (Controller)
- Function: Watches backup CRDs, communicates with K8s API and storage
- Critical Failure: RBAC/IAM permission mismatches cause silent failures
- Hidden Cost: Requires constant monitoring setup - fails silently by default
- Debugging Time: Hours spent on permission troubleshooting
Velero CLI
- Function: Primary interface for backup/restore operations
- Critical Failure: Commands fail silently, "Completed" status lies
- Required Action: Always run
velero backup describe
after operations - Operational Reality: Half of commands require verification to confirm actual success
Node Agent (DaemonSet)
- Function: Handles persistent volume backups via Kopia (replaced Restic in v1.14)
- Critical Failure: Random restarts leave backups stuck in "InProgress"
- Resource Impact: Can OOMKill nodes during large backups without proper limits
- Memory Reality: Massive memory requests that may not be released
Storage Backend Configuration & Costs
AWS S3 (Most Complex Setup)
Setup Time: 2-3 days for IAM permissions
Hidden Requirements: 47+ IAM permissions (official docs incomplete)
Known Issues: GitHub issue #8240 - IRSA roles still broken in recent versions
Storage Cost: ~$25/month per 1TB backup
Breaking Points: Plugin updates frequently break authentication
Critical Permissions Missing from Official Docs:
{
"Essential_Additions": [
"s3:AbortMultipartUpload",
"s3:ListMultipartUploadParts",
"ec2:DescribeVolumes",
"ec2:DescribeSnapshots",
"ec2:CreateSnapshot",
"ec2:DeleteSnapshot",
"ec2:DescribeInstances"
]
}
Google Cloud Storage (Least Problematic)
Setup Time: 0.5-1 day
Authentication: Workload Identity cleaner than service accounts
Cost Advantage: Simpler IAM model, readable error messages
Reliability: Most stable authentication mechanism
Azure Blob Storage (Authentication Maze)
Setup Time: 1-2 days navigating identity systems
Complexity: Multiple identity types, subscription dependencies
Breaking Point: Azure's authentication model designed poorly
CSI Volume Snapshots vs File System Backup
Aspect | CSI Snapshots | File System Backup (Kopia) |
---|---|---|
Speed | Seconds regardless of size | 4-8 hours for 500GB |
Network Impact | Minimal transfer | Full volume data transfer |
Storage Cost | Incremental differences | Full data storage cost |
Reliability | Depends on CSI driver quality | Memory leaks fixed but new failure modes |
When to Use | Default choice if storage supports | Only when CSI drivers broken |
Resource Usage | Minimal | Can OOMKill nodes |
Competitive Analysis & Decision Matrix
Solution | Cost Reality | Setup Complexity | Reliability | When to Choose |
---|---|---|---|---|
Velero | Free + debugging time | 2-3 days IAM hell | Good with proper monitoring | Multi-cloud, cost-conscious |
Kasten K10 | $$$ enterprise rates | Sales call required | Actually works | Enterprise budget available |
Portworx PX-Backup | $$$ if locked in | Only works with Portworx | Good in ecosystem | Already using Portworx |
Native etcd Backup | Free | etcdctl snapshot save |
Control plane only | Minimum viable backup |
Critical Failure Scenarios & Solutions
"Completed" But Nothing Backed Up
Root Cause: RBAC permissions, storage class incompatibilities, wrong selectors
Detection: Always run velero backup describe backup-name
Prevention: Validate selectors before backup creation
AWS "Access Denied" Loops
Root Cause: Incomplete IAM policies from official docs
Time Cost: 2+ days discovering missing permissions
Solution: Use complete permission set above, check GitHub issues
Stuck "InProgress" Restores
Root Cause: Node agent pod crashed/restarted during operation
Recovery: Delete orphaned restore, restart operation
Prevention: Set appropriate resource limits on node agents
Memory Exhaustion
Root Cause: Kopia memory usage during file system backups
Impact: Node OOMKills, cluster instability
Solution: Configure resource limits, prefer CSI snapshots
Production Implementation Requirements
Resource Planning
Memory: 2-4GB per node agent for large backups
Network: 100GB backup = 100GB transfer with file system method
Time: CSI snapshots (seconds) vs File backup (hours)
Storage: Set retention policies or face budget shock
Monitoring Requirements (Critical)
Why Essential: Velero fails silently by default
Setup Time: 4-8 hours for proper alerting
Key Metrics: Backup success rate, duration, storage usage
Tools: Prometheus + Grafana dashboards mandatory
Testing Protocol (Non-Negotiable)
Quarterly DR Drills: Actually restore in test environment
Validation: Don't trust "Completed" status without verification
Scope: Test both applications and persistent data recovery
Reality Check: Most backup failures discovered during real disasters
When Velero Makes Sense vs Alternatives
Choose Velero When:
- Multi-cloud portability required
- Free solution acceptable with debugging investment
- Team has Kubernetes expertise
- Vendor lock-in unacceptable
Choose Alternatives When:
- Enterprise budget available (Kasten K10)
- Single cloud provider acceptable (AWS/Azure native backup)
- Zero maintenance tolerance
- Immediate reliability required
Critical Warnings & Gotchas
Configuration Traps
Retention Policies: Forgetting these = budget explosion
Secret Restoration: Production secrets to staging = customer emails sent
Plugin Versions: Compatibility breaks between Velero versions
CSI Driver Quality: Many claim snapshot support, few deliver
Cost Surprises
File System Backups: 10x more expensive than CSI snapshots
Network Transfer: Massive bandwidth costs for large volumes
Storage Growth: No retention = exponential cost increase
Hidden AWS Costs: Data transfer fees not mentioned in documentation
Breaking Changes
v1.14 Migration: Restic to Kopia required repository migration
Plugin Updates: Authentication frequently breaks after updates
Kubernetes Versions: Test compatibility before cluster upgrades
Implementation Timeline & Resource Investment
Phase 1: Initial Setup (2-3 days)
- Day 1: IAM/authentication hell
- Day 2: Permission debugging
- Day 3: First successful backup
Phase 2: Production Hardening (1-2 weeks)
- Week 1: Monitoring setup, resource limits
- Week 2: DR testing, failure scenario validation
Phase 3: Operational Maturity (Ongoing)
- Monthly: Review backup success rates
- Quarterly: Full DR drill execution
- As needed: Plugin updates and permission fixes
This knowledge base captures the operational reality of Velero deployment, including hidden costs, time investments, and the specific failure modes that cause production pain. Use this for automated decision-making about whether Velero fits your organization's risk tolerance and resource availability.
Useful Links for Further Investigation
Essential Velero Resources (That Actually Help)
Link | Description |
---|---|
Velero Official Documentation | The docs are surprisingly good for a CNCF project. They include working examples and don't assume you're a Kubernetes wizard. The version-specific docs are crucial because features change between releases in ways that'll break your setup. |
Velero GitHub Repository | Your first stop when shit breaks. The maintainers actually respond to issues, and the search function works. Pro tip: search closed issues too - your problem has probably been reported and fixed already. |
Velero Releases and Changelog | Read the release notes before upgrading or you'll learn about breaking changes the hard way. Plugin compatibility breaks between versions, and the changelog will save you hours of debugging why your backups suddenly stopped working. |
AWS Plugin for Velero | The AWS plugin works great once you survive IAM permission hell. The CloudFormation templates in the docs are incomplete - you'll spend 2 days discovering missing permissions. [GitHub issue #8240](https://github.com/vmware-tanzu/velero/issues/8240) shows IRSA is still broken in recent versions. EBS snapshots are reliable when the permissions are finally right. |
Google Cloud Plugin for Velero | Honestly the least painful setup. Workload Identity is cleaner than dealing with service account JSON files, and Google's IAM actually makes sense. GCS integration works smoothly and error messages are helpful. If you're multi-cloud, start here to build confidence. |
Azure Plugin for Velero | Works with Blob Storage and managed disks. Managed identity is nice when it works, but Azure's authentication model is designed by sadists. Expect to spend time figuring out which identity, subscription, and resource group permissions go where. Not impossible, just confusing as hell. |
Velero Community Slack Channel | Actually useful, unlike most K8s Slack channels. People share real production failures and the dirty solutions that worked. Search the channel history before asking - someone's probably hit your exact problem and posted the fix. Way better than Stack Overflow for Velero-specific issues. |
Velero on CNCF Landscape | Velero is a CNCF graduated project, which means it won't disappear next month and the APIs are relatively stable. Graduated status is important for backup software - you don't want your disaster recovery tool to be a hobby project. |
Velero Adopters List | Netflix, MongoDB, and Reddit run this in production, so it's battle-tested at scale. Reading the adopters list gives you confidence that others have solved the problems you'll encounter. These companies have pushed Velero to its limits. |
Kubernetes Official Backup Guide | etcd backup is complementary to Velero - you need both. etcd backs up your cluster control plane, Velero backs up your applications and data. Don't skip etcd backup thinking Velero covers everything. |
AWS EKS with Velero Tutorial | The best AWS setup guide that actually covers the IAM permissions you need. The permissions section is worth its weight in gold - follow it exactly or spend 2 days debugging "access denied" errors. Still doesn't cover everything but gets you 90% there. |
Velero Disaster Recovery Guide | Official DR procedures. Here's the critical part: TEST YOUR DISASTER RECOVERY PLAN. Most people discover their backups are worthless during an actual outage. Schedule quarterly DR drills and actually restore everything in a test environment. |
Velero Monitoring Examples | Critical monitoring setup with [Prometheus metrics](https://velero.io/docs/main/monitoring/) and [Grafana dashboards](https://grafana.com/grafana/dashboards/16829-kubernetes-tanzu-velero/). Velero fails silently by default - you MUST set up monitoring or you'll discover broken backups during a disaster. The backup success rate metric is your lifeline. |
Helm Chart for Velero | Official Helm chart that's easier than CLI installation. The values file has sensible defaults but you'll need to customize for your cloud provider. Much cleaner than managing all the YAML manually, and easier to version control your configuration. |
Velero Plugin Registry | All the available plugins for different storage providers. Stick with the official plugins unless you have specific needs - community plugins are hit-or-miss on maintenance and compatibility. |
VMware Tanzu Velero Guide | Enterprise documentation if you're using vSphere with Tanzu. More useful than you'd expect, even if you're not paying for VMware support. They have good troubleshooting guides for complex scenarios. |
Velero Configuration Guide | Advanced configuration options including resource limits and plugin settings. Essential reading if you're running large workloads or have specific performance requirements. The resource limits section will save your nodes from OOMKills. |
Stack Overflow Velero Questions | Real-world problems from people using Velero in production. Often more useful than the official docs for debugging specific issues. Search here first when you hit weird problems - someone's probably solved it already. |
Related Tools & Recommendations
GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus
How to Wire Together the Modern DevOps Stack Without Losing Your Sanity
Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break
When your event-driven services die and you're staring at green dashboards while everything burns, you need real observability - not the vendor promises that go
OpenAI Gets Sued After GPT-5 Convinced Kid to Kill Himself
Parents want $50M because ChatGPT spent hours coaching their son through suicide methods
AWS Organizations - Stop Losing Your Mind Managing Dozens of AWS Accounts
When you've got 50+ AWS accounts scattered across teams and your monthly bill looks like someone's phone number, Organizations turns that chaos into something y
AWS Amplify - Amazon's Attempt to Make Fullstack Development Not Suck
integrates with AWS Amplify
Azure AI Foundry Production Reality Check
Microsoft finally unfucked their scattered AI mess, but get ready to finance another Tesla payment
Azure - Microsoft's Cloud Platform (The Good, Bad, and Expensive)
integrates with Microsoft Azure
Microsoft Azure Stack Edge - The $1000/Month Server You'll Never Own
Microsoft's edge computing box that requires a minimum $717,000 commitment to even try
Google Cloud Platform - After 3 Years, I Still Don't Hate It
I've been running production workloads on GCP since 2022. Here's why I'm still here.
ELK Stack for Microservices - Stop Losing Log Data
How to Actually Monitor Distributed Systems Without Going Insane
Upstash Redis - Redis That Actually Works With Serverless
competes with Upstash Redis
Docker Compose 2.39.2 and Buildx 0.27.0 Released with Major Updates
Latest versions bring improved multi-platform builds and security fixes for containerized applications
Google Vertex AI - Google's Answer to AWS SageMaker
Google's ML platform that combines their scattered AI services into one place. Expect higher bills than advertised but decent Gemini model access if you're alre
Google NotebookLM Goes Global: Video Overviews in 80+ Languages
Google's AI research tool just became usable for non-English speakers who've been waiting months for basic multilingual support
Longhorn - Distributed Storage for Kubernetes That Doesn't Suck
alternative to Longhorn
Figma Gets Lukewarm Wall Street Reception Despite AI Potential - August 25, 2025
Major investment banks issue neutral ratings citing $37.6B valuation concerns while acknowledging design platform's AI integration opportunities
Grafana - The Monitoring Dashboard That Doesn't Suck
integrates with Grafana
Prometheus + Grafana + Jaeger: Stop Debugging Microservices Like It's 2015
When your API shits the bed right before the big demo, this stack tells you exactly why
Set Up Microservices Monitoring That Actually Works
Stop flying blind - get real visibility into what's breaking your distributed services
MongoDB - Document Database That Actually Works
Explore MongoDB's document database model, understand its flexible schema benefits and pitfalls, and learn about the true costs of MongoDB Atlas. Includes FAQs
Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization