Currently viewing the AI version
Switch to human version

Velero: Kubernetes Backup & Disaster Recovery - AI Knowledge Base

Executive Summary

Velero is a CNCF-graduated Kubernetes backup tool maintained by VMware Tanzu. Production-proven by Netflix, MongoDB, and Reddit. Current version v1.17.0 (2025) fixes memory leaks but introduces new failure modes. Setup requires 2-3 days due to IAM permission complexity, ongoing maintenance burden moderate.

Core Architecture & Failure Points

Three Critical Components

Velero Server (Controller)

  • Function: Watches backup CRDs, communicates with K8s API and storage
  • Critical Failure: RBAC/IAM permission mismatches cause silent failures
  • Hidden Cost: Requires constant monitoring setup - fails silently by default
  • Debugging Time: Hours spent on permission troubleshooting

Velero CLI

  • Function: Primary interface for backup/restore operations
  • Critical Failure: Commands fail silently, "Completed" status lies
  • Required Action: Always run velero backup describe after operations
  • Operational Reality: Half of commands require verification to confirm actual success

Node Agent (DaemonSet)

  • Function: Handles persistent volume backups via Kopia (replaced Restic in v1.14)
  • Critical Failure: Random restarts leave backups stuck in "InProgress"
  • Resource Impact: Can OOMKill nodes during large backups without proper limits
  • Memory Reality: Massive memory requests that may not be released

Storage Backend Configuration & Costs

AWS S3 (Most Complex Setup)

Setup Time: 2-3 days for IAM permissions
Hidden Requirements: 47+ IAM permissions (official docs incomplete)
Known Issues: GitHub issue #8240 - IRSA roles still broken in recent versions
Storage Cost: ~$25/month per 1TB backup
Breaking Points: Plugin updates frequently break authentication

Critical Permissions Missing from Official Docs:

{
  "Essential_Additions": [
    "s3:AbortMultipartUpload",
    "s3:ListMultipartUploadParts", 
    "ec2:DescribeVolumes",
    "ec2:DescribeSnapshots",
    "ec2:CreateSnapshot",
    "ec2:DeleteSnapshot",
    "ec2:DescribeInstances"
  ]
}

Google Cloud Storage (Least Problematic)

Setup Time: 0.5-1 day
Authentication: Workload Identity cleaner than service accounts
Cost Advantage: Simpler IAM model, readable error messages
Reliability: Most stable authentication mechanism

Azure Blob Storage (Authentication Maze)

Setup Time: 1-2 days navigating identity systems
Complexity: Multiple identity types, subscription dependencies
Breaking Point: Azure's authentication model designed poorly

CSI Volume Snapshots vs File System Backup

Aspect CSI Snapshots File System Backup (Kopia)
Speed Seconds regardless of size 4-8 hours for 500GB
Network Impact Minimal transfer Full volume data transfer
Storage Cost Incremental differences Full data storage cost
Reliability Depends on CSI driver quality Memory leaks fixed but new failure modes
When to Use Default choice if storage supports Only when CSI drivers broken
Resource Usage Minimal Can OOMKill nodes

Competitive Analysis & Decision Matrix

Solution Cost Reality Setup Complexity Reliability When to Choose
Velero Free + debugging time 2-3 days IAM hell Good with proper monitoring Multi-cloud, cost-conscious
Kasten K10 $$$ enterprise rates Sales call required Actually works Enterprise budget available
Portworx PX-Backup $$$ if locked in Only works with Portworx Good in ecosystem Already using Portworx
Native etcd Backup Free etcdctl snapshot save Control plane only Minimum viable backup

Critical Failure Scenarios & Solutions

"Completed" But Nothing Backed Up

Root Cause: RBAC permissions, storage class incompatibilities, wrong selectors
Detection: Always run velero backup describe backup-name
Prevention: Validate selectors before backup creation

AWS "Access Denied" Loops

Root Cause: Incomplete IAM policies from official docs
Time Cost: 2+ days discovering missing permissions
Solution: Use complete permission set above, check GitHub issues

Stuck "InProgress" Restores

Root Cause: Node agent pod crashed/restarted during operation
Recovery: Delete orphaned restore, restart operation
Prevention: Set appropriate resource limits on node agents

Memory Exhaustion

Root Cause: Kopia memory usage during file system backups
Impact: Node OOMKills, cluster instability
Solution: Configure resource limits, prefer CSI snapshots

Production Implementation Requirements

Resource Planning

Memory: 2-4GB per node agent for large backups
Network: 100GB backup = 100GB transfer with file system method
Time: CSI snapshots (seconds) vs File backup (hours)
Storage: Set retention policies or face budget shock

Monitoring Requirements (Critical)

Why Essential: Velero fails silently by default
Setup Time: 4-8 hours for proper alerting
Key Metrics: Backup success rate, duration, storage usage
Tools: Prometheus + Grafana dashboards mandatory

Testing Protocol (Non-Negotiable)

Quarterly DR Drills: Actually restore in test environment
Validation: Don't trust "Completed" status without verification
Scope: Test both applications and persistent data recovery
Reality Check: Most backup failures discovered during real disasters

When Velero Makes Sense vs Alternatives

Choose Velero When:

  • Multi-cloud portability required
  • Free solution acceptable with debugging investment
  • Team has Kubernetes expertise
  • Vendor lock-in unacceptable

Choose Alternatives When:

  • Enterprise budget available (Kasten K10)
  • Single cloud provider acceptable (AWS/Azure native backup)
  • Zero maintenance tolerance
  • Immediate reliability required

Critical Warnings & Gotchas

Configuration Traps

Retention Policies: Forgetting these = budget explosion
Secret Restoration: Production secrets to staging = customer emails sent
Plugin Versions: Compatibility breaks between Velero versions
CSI Driver Quality: Many claim snapshot support, few deliver

Cost Surprises

File System Backups: 10x more expensive than CSI snapshots
Network Transfer: Massive bandwidth costs for large volumes
Storage Growth: No retention = exponential cost increase
Hidden AWS Costs: Data transfer fees not mentioned in documentation

Breaking Changes

v1.14 Migration: Restic to Kopia required repository migration
Plugin Updates: Authentication frequently breaks after updates
Kubernetes Versions: Test compatibility before cluster upgrades

Implementation Timeline & Resource Investment

Phase 1: Initial Setup (2-3 days)

  • Day 1: IAM/authentication hell
  • Day 2: Permission debugging
  • Day 3: First successful backup

Phase 2: Production Hardening (1-2 weeks)

  • Week 1: Monitoring setup, resource limits
  • Week 2: DR testing, failure scenario validation

Phase 3: Operational Maturity (Ongoing)

  • Monthly: Review backup success rates
  • Quarterly: Full DR drill execution
  • As needed: Plugin updates and permission fixes

This knowledge base captures the operational reality of Velero deployment, including hidden costs, time investments, and the specific failure modes that cause production pain. Use this for automated decision-making about whether Velero fits your organization's risk tolerance and resource availability.

Useful Links for Further Investigation

Essential Velero Resources (That Actually Help)

LinkDescription
Velero Official DocumentationThe docs are surprisingly good for a CNCF project. They include working examples and don't assume you're a Kubernetes wizard. The version-specific docs are crucial because features change between releases in ways that'll break your setup.
Velero GitHub RepositoryYour first stop when shit breaks. The maintainers actually respond to issues, and the search function works. Pro tip: search closed issues too - your problem has probably been reported and fixed already.
Velero Releases and ChangelogRead the release notes before upgrading or you'll learn about breaking changes the hard way. Plugin compatibility breaks between versions, and the changelog will save you hours of debugging why your backups suddenly stopped working.
AWS Plugin for VeleroThe AWS plugin works great once you survive IAM permission hell. The CloudFormation templates in the docs are incomplete - you'll spend 2 days discovering missing permissions. [GitHub issue #8240](https://github.com/vmware-tanzu/velero/issues/8240) shows IRSA is still broken in recent versions. EBS snapshots are reliable when the permissions are finally right.
Google Cloud Plugin for VeleroHonestly the least painful setup. Workload Identity is cleaner than dealing with service account JSON files, and Google's IAM actually makes sense. GCS integration works smoothly and error messages are helpful. If you're multi-cloud, start here to build confidence.
Azure Plugin for VeleroWorks with Blob Storage and managed disks. Managed identity is nice when it works, but Azure's authentication model is designed by sadists. Expect to spend time figuring out which identity, subscription, and resource group permissions go where. Not impossible, just confusing as hell.
Velero Community Slack ChannelActually useful, unlike most K8s Slack channels. People share real production failures and the dirty solutions that worked. Search the channel history before asking - someone's probably hit your exact problem and posted the fix. Way better than Stack Overflow for Velero-specific issues.
Velero on CNCF LandscapeVelero is a CNCF graduated project, which means it won't disappear next month and the APIs are relatively stable. Graduated status is important for backup software - you don't want your disaster recovery tool to be a hobby project.
Velero Adopters ListNetflix, MongoDB, and Reddit run this in production, so it's battle-tested at scale. Reading the adopters list gives you confidence that others have solved the problems you'll encounter. These companies have pushed Velero to its limits.
Kubernetes Official Backup Guideetcd backup is complementary to Velero - you need both. etcd backs up your cluster control plane, Velero backs up your applications and data. Don't skip etcd backup thinking Velero covers everything.
AWS EKS with Velero TutorialThe best AWS setup guide that actually covers the IAM permissions you need. The permissions section is worth its weight in gold - follow it exactly or spend 2 days debugging "access denied" errors. Still doesn't cover everything but gets you 90% there.
Velero Disaster Recovery GuideOfficial DR procedures. Here's the critical part: TEST YOUR DISASTER RECOVERY PLAN. Most people discover their backups are worthless during an actual outage. Schedule quarterly DR drills and actually restore everything in a test environment.
Velero Monitoring ExamplesCritical monitoring setup with [Prometheus metrics](https://velero.io/docs/main/monitoring/) and [Grafana dashboards](https://grafana.com/grafana/dashboards/16829-kubernetes-tanzu-velero/). Velero fails silently by default - you MUST set up monitoring or you'll discover broken backups during a disaster. The backup success rate metric is your lifeline.
Helm Chart for VeleroOfficial Helm chart that's easier than CLI installation. The values file has sensible defaults but you'll need to customize for your cloud provider. Much cleaner than managing all the YAML manually, and easier to version control your configuration.
Velero Plugin RegistryAll the available plugins for different storage providers. Stick with the official plugins unless you have specific needs - community plugins are hit-or-miss on maintenance and compatibility.
VMware Tanzu Velero GuideEnterprise documentation if you're using vSphere with Tanzu. More useful than you'd expect, even if you're not paying for VMware support. They have good troubleshooting guides for complex scenarios.
Velero Configuration GuideAdvanced configuration options including resource limits and plugin settings. Essential reading if you're running large workloads or have specific performance requirements. The resource limits section will save your nodes from OOMKills.
Stack Overflow Velero QuestionsReal-world problems from people using Velero in production. Often more useful than the official docs for debugging specific issues. Search here first when you hit weird problems - someone's probably solved it already.

Related Tools & Recommendations

integration
Recommended

GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus

How to Wire Together the Modern DevOps Stack Without Losing Your Sanity

kubernetes
/integration/docker-kubernetes-argocd-prometheus/gitops-workflow-integration
80%
integration
Recommended

Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break

When your event-driven services die and you're staring at green dashboards while everything burns, you need real observability - not the vendor promises that go

Apache Kafka
/integration/kafka-mongodb-kubernetes-prometheus-event-driven/complete-observability-architecture
80%
news
Recommended

OpenAI Gets Sued After GPT-5 Convinced Kid to Kill Himself

Parents want $50M because ChatGPT spent hours coaching their son through suicide methods

Technology News Aggregation
/news/2025-08-26/openai-gpt5-safety-lawsuit
66%
tool
Recommended

AWS Organizations - Stop Losing Your Mind Managing Dozens of AWS Accounts

When you've got 50+ AWS accounts scattered across teams and your monthly bill looks like someone's phone number, Organizations turns that chaos into something y

AWS Organizations
/tool/aws-organizations/overview
66%
tool
Recommended

AWS Amplify - Amazon's Attempt to Make Fullstack Development Not Suck

integrates with AWS Amplify

AWS Amplify
/tool/aws-amplify/overview
66%
tool
Recommended

Azure AI Foundry Production Reality Check

Microsoft finally unfucked their scattered AI mess, but get ready to finance another Tesla payment

Microsoft Azure AI
/tool/microsoft-azure-ai/production-deployment
66%
tool
Recommended

Azure - Microsoft's Cloud Platform (The Good, Bad, and Expensive)

integrates with Microsoft Azure

Microsoft Azure
/tool/microsoft-azure/overview
66%
tool
Recommended

Microsoft Azure Stack Edge - The $1000/Month Server You'll Never Own

Microsoft's edge computing box that requires a minimum $717,000 commitment to even try

Microsoft Azure Stack Edge
/tool/microsoft-azure-stack-edge/overview
66%
tool
Recommended

Google Cloud Platform - After 3 Years, I Still Don't Hate It

I've been running production workloads on GCP since 2022. Here's why I'm still here.

Google Cloud Platform
/tool/google-cloud-platform/overview
66%
integration
Recommended

ELK Stack for Microservices - Stop Losing Log Data

How to Actually Monitor Distributed Systems Without Going Insane

Elasticsearch
/integration/elasticsearch-logstash-kibana/microservices-logging-architecture
60%
tool
Recommended

Upstash Redis - Redis That Actually Works With Serverless

competes with Upstash Redis

Upstash Redis
/tool/upstash-redis/overview
60%
news
Popular choice

Docker Compose 2.39.2 and Buildx 0.27.0 Released with Major Updates

Latest versions bring improved multi-platform builds and security fixes for containerized applications

Docker
/news/2025-09-05/docker-compose-buildx-updates
60%
tool
Popular choice

Google Vertex AI - Google's Answer to AWS SageMaker

Google's ML platform that combines their scattered AI services into one place. Expect higher bills than advertised but decent Gemini model access if you're alre

Google Vertex AI
/tool/google-vertex-ai/overview
57%
news
Popular choice

Google NotebookLM Goes Global: Video Overviews in 80+ Languages

Google's AI research tool just became usable for non-English speakers who've been waiting months for basic multilingual support

Technology News Aggregation
/news/2025-08-26/google-notebooklm-video-overview-expansion
55%
tool
Recommended

Longhorn - Distributed Storage for Kubernetes That Doesn't Suck

alternative to Longhorn

Longhorn
/tool/longhorn/overview
54%
news
Popular choice

Figma Gets Lukewarm Wall Street Reception Despite AI Potential - August 25, 2025

Major investment banks issue neutral ratings citing $37.6B valuation concerns while acknowledging design platform's AI integration opportunities

Technology News Aggregation
/news/2025-08-25/figma-neutral-wall-street
50%
tool
Recommended

Grafana - The Monitoring Dashboard That Doesn't Suck

integrates with Grafana

Grafana
/tool/grafana/overview
49%
integration
Recommended

Prometheus + Grafana + Jaeger: Stop Debugging Microservices Like It's 2015

When your API shits the bed right before the big demo, this stack tells you exactly why

Prometheus
/integration/prometheus-grafana-jaeger/microservices-observability-integration
49%
howto
Recommended

Set Up Microservices Monitoring That Actually Works

Stop flying blind - get real visibility into what's breaking your distributed services

Prometheus
/howto/setup-microservices-observability-prometheus-jaeger-grafana/complete-observability-setup
49%
tool
Popular choice

MongoDB - Document Database That Actually Works

Explore MongoDB's document database model, understand its flexible schema benefits and pitfalls, and learn about the true costs of MongoDB Atlas. Includes FAQs

MongoDB
/tool/mongodb/overview
47%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization