Why does my backup say "Completed" but nothing was actually backed up?

Velero lies. "Completed" doesn't mean successful, it means the process finished. Check `velero backup describe your-backup-name` and look for warnings. Common culprits: RBAC permissions missing, storage class incompatibilities, or the backup included zero resources because your selectors were wrong. Always check what was actually included in the backup.

Why is my restore stuck in "InProgress" forever?

The node agent pod probably crashed or restarted during the operation. Check `kubectl get pods -n velero` and `kubectl logs -n velero` on the node agent. If it restarted, the restore is orphaned and you'll need to delete it and try again. This happens a lot with large volume restores.

My backup is taking 6 hours - is this normal?

If you're using file system backup (Kopia), yes, it's slow as hell. A 500GB volume can take 4-8 hours depending on your network and how many small files you have. CSI snapshots take seconds. Switch to CSI snapshots if your storage supports them, otherwise suffer through the file backup pain.

Why does Velero keep running out of memory?

File system backups using [Kopia](https://kopia.io/) can eat massive amounts of memory, especially with lots of small files. The node agent pods request memory dynamically and can OOMKill your nodes. Set resource limits in the Velero deployment or your cluster will suffer. v1.17 fixed some of this but it's still a memory hog.

How much is this going to cost me on AWS?

More than you think. S3 storage costs add up, especially if you forget retention policies. Budget around $0.023/GB/month for S3 Standard, plus snapshot costs for EBS volumes. A 1TB backup costs about $25/month in S3. Set retention policies or your first bill will be a shock.

How do I know if my backups are actually working?

Test them. Seriously. Create a test backup, delete something important in a staging environment, then restore it. Most people discover their backups are broken during a real disaster. Use `velero backup describe` to check for warnings, and actually try restoring to make sure it works. Backup monitoring should alert when backups fail, not just when they complete.

Why are my scheduled backups failing silently?

Velero schedules use Kubernetes CronJobs, which fail silently by default. Check `kubectl get cronjobs -n velero` and `velero schedule get`. Failed schedules often happen because of resource limits, storage authentication issues, or quota problems. Set up [Prometheus monitoring](https://velero.io/docs/main/monitoring/) to alert when backups fail.

Can I use Velero without cloud storage?

Yes, with [MinIO](https://min.io/) or other S3-compatible storage. MinIO is solid for on-premises setups, while Ceph Object Gateway will make you question your life choices. You can run MinIO in a container or as a standalone service. Just make sure your storage is actually reliable - your backups are only as good as the storage they're on.

What's the difference between Restic and Kopia?

[Kopia replaced Restic in v1.14](https://velero.io/docs/main/file-system-backup/) because Restic had memory leaks that would OOMKill nodes during large backups. Kopia is more memory-efficient and faster, but introduced new failure modes. Existing Restic backups are still restorable, but new file system backups use Kopia. The migration was generally worth it.

Why won't my CSI snapshots work?

Your CSI driver probably doesn't support snapshots properly, or it's buggy. Many storage providers claim CSI snapshot support but it's broken or incomplete. Check if your storage class supports snapshots with `kubectl get volumesnapshotclass`. If snapshots fail, fall back to file system backup with Kopia.

Will upgrading Velero break my existing backups?

Usually no, but check the [compatibility matrix](https://velero.io/docs/main/supported-providers/) for plugin versions. The v1.14 upgrade from Restic to Kopia was the big breaking change - file system backups needed repository migration. Test upgrades in staging first because plugin incompatibilities can leave you unable to restore backups.

What happens to secrets when I restore to a different cluster?

Secrets get restored but cloud-specific authentication (IAM roles, service accounts) won't work in the new environment. You'll need to manually reconfigure authentication for things like database connections, external APIs, and cloud services. Don't restore production secrets to staging environments - you'll accidentally send emails to real customers.

Does Velero backup my operators and CRDs?

Yes, [CRDs and custom resources](https://velero.io/docs/main/resource-filtering/) get backed up. Most operators resume working after restore, but operators that manage external services (databases, cloud resources) may need manual intervention to reconnect to their managed services. Test operator restore procedures because they often have weird edge cases.

How do I set up monitoring so I know when backups fail?

Set up [Prometheus monitoring](https://velero.io/docs/main/monitoring/) and alerts because Velero doesn't alert you by default. Use the [Grafana dashboards](https://velero.io/docs/main/examples/) for monitoring views. The key metrics are backup success/failure rates and backup duration. Silent failures are common, so monitoring is essential.

Should I use Velero or just rely on cloud provider backups?

Depends on vendor lock-in tolerance. [AWS Backup](https://aws.amazon.com/backup/) and [Azure Backup](https://azure.microsoft.com/services/backup/) are easier to set up and work better with native cloud services, but lock you into that provider. Velero gives you portability and consistency across clouds, but requires more setup and debugging. Choose cloud backups for simplicity, Velero for multi-cloud freedom.

Currently viewing the AI version

Switch to human version

Velero: Kubernetes Backup & Disaster Recovery - AI Knowledge Base

Executive Summary

Velero is a CNCF-graduated Kubernetes backup tool maintained by VMware Tanzu. Production-proven by Netflix, MongoDB, and Reddit. Current version v1.17.0 (2025) fixes memory leaks but introduces new failure modes. Setup requires 2-3 days due to IAM permission complexity, ongoing maintenance burden moderate.

Core Architecture & Failure Points

Three Critical Components

Velero Server (Controller)

Function: Watches backup CRDs, communicates with K8s API and storage
Critical Failure: RBAC/IAM permission mismatches cause silent failures
Hidden Cost: Requires constant monitoring setup - fails silently by default
Debugging Time: Hours spent on permission troubleshooting

Velero CLI

Function: Primary interface for backup/restore operations
Critical Failure: Commands fail silently, "Completed" status lies
Required Action: Always run velero backup describe after operations
Operational Reality: Half of commands require verification to confirm actual success

Node Agent (DaemonSet)

Function: Handles persistent volume backups via Kopia (replaced Restic in v1.14)
Critical Failure: Random restarts leave backups stuck in "InProgress"
Resource Impact: Can OOMKill nodes during large backups without proper limits
Memory Reality: Massive memory requests that may not be released

Storage Backend Configuration & Costs

AWS S3 (Most Complex Setup)

Setup Time: 2-3 days for IAM permissions
Hidden Requirements: 47+ IAM permissions (official docs incomplete)
Known Issues: GitHub issue #8240 - IRSA roles still broken in recent versions
Storage Cost: ~$25/month per 1TB backup
Breaking Points: Plugin updates frequently break authentication

Critical Permissions Missing from Official Docs:

{
  "Essential_Additions": [
    "s3:AbortMultipartUpload",
    "s3:ListMultipartUploadParts", 
    "ec2:DescribeVolumes",
    "ec2:DescribeSnapshots",
    "ec2:CreateSnapshot",
    "ec2:DeleteSnapshot",
    "ec2:DescribeInstances"
  ]
}

Google Cloud Storage (Least Problematic)

Setup Time: 0.5-1 day
Authentication: Workload Identity cleaner than service accounts
Cost Advantage: Simpler IAM model, readable error messages
Reliability: Most stable authentication mechanism

Azure Blob Storage (Authentication Maze)

Setup Time: 1-2 days navigating identity systems
Complexity: Multiple identity types, subscription dependencies
Breaking Point: Azure's authentication model designed poorly

CSI Volume Snapshots vs File System Backup

Aspect	CSI Snapshots	File System Backup (Kopia)
Speed	Seconds regardless of size	4-8 hours for 500GB
Network Impact	Minimal transfer	Full volume data transfer
Storage Cost	Incremental differences	Full data storage cost
Reliability	Depends on CSI driver quality	Memory leaks fixed but new failure modes
When to Use	Default choice if storage supports	Only when CSI drivers broken
Resource Usage	Minimal	Can OOMKill nodes

Competitive Analysis & Decision Matrix

Solution	Cost Reality	Setup Complexity	Reliability	When to Choose
Velero	Free + debugging time	2-3 days IAM hell	Good with proper monitoring	Multi-cloud, cost-conscious
Kasten K10	$$$ enterprise rates	Sales call required	Actually works	Enterprise budget available
Portworx PX-Backup	$$$ if locked in	Only works with Portworx	Good in ecosystem	Already using Portworx
Native etcd Backup	Free	`etcdctl snapshot save`	Control plane only	Minimum viable backup

Critical Failure Scenarios & Solutions

"Completed" But Nothing Backed Up

Root Cause: RBAC permissions, storage class incompatibilities, wrong selectors
Detection: Always run velero backup describe backup-name
Prevention: Validate selectors before backup creation

AWS "Access Denied" Loops

Root Cause: Incomplete IAM policies from official docs
Time Cost: 2+ days discovering missing permissions
Solution: Use complete permission set above, check GitHub issues

Stuck "InProgress" Restores

Root Cause: Node agent pod crashed/restarted during operation
Recovery: Delete orphaned restore, restart operation
Prevention: Set appropriate resource limits on node agents

Memory Exhaustion

Root Cause: Kopia memory usage during file system backups
Impact: Node OOMKills, cluster instability
Solution: Configure resource limits, prefer CSI snapshots

Production Implementation Requirements

Resource Planning

Memory: 2-4GB per node agent for large backups
Network: 100GB backup = 100GB transfer with file system method
Time: CSI snapshots (seconds) vs File backup (hours)
Storage: Set retention policies or face budget shock

Monitoring Requirements (Critical)

Why Essential: Velero fails silently by default
Setup Time: 4-8 hours for proper alerting
Key Metrics: Backup success rate, duration, storage usage
Tools: Prometheus + Grafana dashboards mandatory

Testing Protocol (Non-Negotiable)

Quarterly DR Drills: Actually restore in test environment
Validation: Don't trust "Completed" status without verification
Scope: Test both applications and persistent data recovery
Reality Check: Most backup failures discovered during real disasters

When Velero Makes Sense vs Alternatives

Choose Velero When:

Multi-cloud portability required
Free solution acceptable with debugging investment
Team has Kubernetes expertise
Vendor lock-in unacceptable

Choose Alternatives When:

Enterprise budget available (Kasten K10)
Single cloud provider acceptable (AWS/Azure native backup)
Zero maintenance tolerance
Immediate reliability required

Critical Warnings & Gotchas

Configuration Traps

Retention Policies: Forgetting these = budget explosion
Secret Restoration: Production secrets to staging = customer emails sent
Plugin Versions: Compatibility breaks between Velero versions
CSI Driver Quality: Many claim snapshot support, few deliver

Cost Surprises

File System Backups: 10x more expensive than CSI snapshots
Network Transfer: Massive bandwidth costs for large volumes
Storage Growth: No retention = exponential cost increase
Hidden AWS Costs: Data transfer fees not mentioned in documentation

Breaking Changes

v1.14 Migration: Restic to Kopia required repository migration
Plugin Updates: Authentication frequently breaks after updates
Kubernetes Versions: Test compatibility before cluster upgrades

Implementation Timeline & Resource Investment

Phase 1: Initial Setup (2-3 days)

Day 1: IAM/authentication hell
Day 2: Permission debugging
Day 3: First successful backup

Phase 2: Production Hardening (1-2 weeks)

Week 1: Monitoring setup, resource limits
Week 2: DR testing, failure scenario validation

Phase 3: Operational Maturity (Ongoing)

Monthly: Review backup success rates
Quarterly: Full DR drill execution
As needed: Plugin updates and permission fixes

This knowledge base captures the operational reality of Velero deployment, including hidden costs, time investments, and the specific failure modes that cause production pain. Use this for automated decision-making about whether Velero fits your organization's risk tolerance and resource availability.

Useful Links for Further Investigation

Essential Velero Resources (That Actually Help)

Link	Description
Velero Official Documentation	The docs are surprisingly good for a CNCF project. They include working examples and don't assume you're a Kubernetes wizard. The version-specific docs are crucial because features change between releases in ways that'll break your setup.
Velero GitHub Repository	Your first stop when shit breaks. The maintainers actually respond to issues, and the search function works. Pro tip: search closed issues too - your problem has probably been reported and fixed already.
Velero Releases and Changelog	Read the release notes before upgrading or you'll learn about breaking changes the hard way. Plugin compatibility breaks between versions, and the changelog will save you hours of debugging why your backups suddenly stopped working.
AWS Plugin for Velero	The AWS plugin works great once you survive IAM permission hell. The CloudFormation templates in the docs are incomplete - you'll spend 2 days discovering missing permissions. [GitHub issue #8240](https://github.com/vmware-tanzu/velero/issues/8240) shows IRSA is still broken in recent versions. EBS snapshots are reliable when the permissions are finally right.
Google Cloud Plugin for Velero	Honestly the least painful setup. Workload Identity is cleaner than dealing with service account JSON files, and Google's IAM actually makes sense. GCS integration works smoothly and error messages are helpful. If you're multi-cloud, start here to build confidence.
Azure Plugin for Velero	Works with Blob Storage and managed disks. Managed identity is nice when it works, but Azure's authentication model is designed by sadists. Expect to spend time figuring out which identity, subscription, and resource group permissions go where. Not impossible, just confusing as hell.
Velero Community Slack Channel	Actually useful, unlike most K8s Slack channels. People share real production failures and the dirty solutions that worked. Search the channel history before asking - someone's probably hit your exact problem and posted the fix. Way better than Stack Overflow for Velero-specific issues.
Velero on CNCF Landscape	Velero is a CNCF graduated project, which means it won't disappear next month and the APIs are relatively stable. Graduated status is important for backup software - you don't want your disaster recovery tool to be a hobby project.
Velero Adopters List	Netflix, MongoDB, and Reddit run this in production, so it's battle-tested at scale. Reading the adopters list gives you confidence that others have solved the problems you'll encounter. These companies have pushed Velero to its limits.
Kubernetes Official Backup Guide	etcd backup is complementary to Velero - you need both. etcd backs up your cluster control plane, Velero backs up your applications and data. Don't skip etcd backup thinking Velero covers everything.
AWS EKS with Velero Tutorial	The best AWS setup guide that actually covers the IAM permissions you need. The permissions section is worth its weight in gold - follow it exactly or spend 2 days debugging "access denied" errors. Still doesn't cover everything but gets you 90% there.
Velero Disaster Recovery Guide	Official DR procedures. Here's the critical part: TEST YOUR DISASTER RECOVERY PLAN. Most people discover their backups are worthless during an actual outage. Schedule quarterly DR drills and actually restore everything in a test environment.
Velero Monitoring Examples	Critical monitoring setup with [Prometheus metrics](https://velero.io/docs/main/monitoring/) and [Grafana dashboards](https://grafana.com/grafana/dashboards/16829-kubernetes-tanzu-velero/). Velero fails silently by default - you MUST set up monitoring or you'll discover broken backups during a disaster. The backup success rate metric is your lifeline.
Helm Chart for Velero	Official Helm chart that's easier than CLI installation. The values file has sensible defaults but you'll need to customize for your cloud provider. Much cleaner than managing all the YAML manually, and easier to version control your configuration.
Velero Plugin Registry	All the available plugins for different storage providers. Stick with the official plugins unless you have specific needs - community plugins are hit-or-miss on maintenance and compatibility.
VMware Tanzu Velero Guide	Enterprise documentation if you're using vSphere with Tanzu. More useful than you'd expect, even if you're not paying for VMware support. They have good troubleshooting guides for complex scenarios.
Velero Configuration Guide	Advanced configuration options including resource limits and plugin settings. Essential reading if you're running large workloads or have specific performance requirements. The resource limits section will save your nodes from OOMKills.
Stack Overflow Velero Questions	Real-world problems from people using Velero in production. Often more useful than the official docs for debugging specific issues. Search here first when you hit weird problems - someone's probably solved it already.

Velero: Kubernetes Backup & Disaster Recovery - AI Knowledge Base

Executive Summary

Core Architecture & Failure Points

Three Critical Components

Storage Backend Configuration & Costs

AWS S3 (Most Complex Setup)

Google Cloud Storage (Least Problematic)

Azure Blob Storage (Authentication Maze)

CSI Volume Snapshots vs File System Backup

Competitive Analysis & Decision Matrix

Critical Failure Scenarios & Solutions

"Completed" But Nothing Backed Up

AWS "Access Denied" Loops

Stuck "InProgress" Restores

Memory Exhaustion

Production Implementation Requirements

Resource Planning

Monitoring Requirements (Critical)

Testing Protocol (Non-Negotiable)

When Velero Makes Sense vs Alternatives

Choose Velero When:

Choose Alternatives When:

Critical Warnings & Gotchas

Configuration Traps

Cost Surprises

Breaking Changes

Implementation Timeline & Resource Investment

Phase 1: Initial Setup (2-3 days)

Phase 2: Production Hardening (1-2 weeks)

Phase 3: Operational Maturity (Ongoing)

Useful Links for Further Investigation

Essential Velero Resources (That Actually Help)

Related Tools & Recommendations

GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus

Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break

OpenAI Gets Sued After GPT-5 Convinced Kid to Kill Himself

AWS Organizations - Stop Losing Your Mind Managing Dozens of AWS Accounts

AWS Amplify - Amazon's Attempt to Make Fullstack Development Not Suck

Azure AI Foundry Production Reality Check

Azure - Microsoft's Cloud Platform (The Good, Bad, and Expensive)

Microsoft Azure Stack Edge - The $1000/Month Server You'll Never Own

Google Cloud Platform - After 3 Years, I Still Don't Hate It

ELK Stack for Microservices - Stop Losing Log Data

Upstash Redis - Redis That Actually Works With Serverless

Docker Compose 2.39.2 and Buildx 0.27.0 Released with Major Updates

Google Vertex AI - Google's Answer to AWS SageMaker

Google NotebookLM Goes Global: Video Overviews in 80+ Languages

Longhorn - Distributed Storage for Kubernetes That Doesn't Suck

Figma Gets Lukewarm Wall Street Reception Despite AI Potential - August 25, 2025

Grafana - The Monitoring Dashboard That Doesn't Suck

Prometheus + Grafana + Jaeger: Stop Debugging Microservices Like It's 2015

Set Up Microservices Monitoring That Actually Works

MongoDB - Document Database That Actually Works