Velero - Save Your Ass When Kubernetes Implodes

Currently viewing the human version

What is Velero

Velero is the backup tool you install after experiencing your first major Kubernetes disaster. Originally called Heptio Ark, it's now maintained by VMware Tanzu and has graduated from the CNCF, which means it's stable enough that you won't get fired for using it.

v1.17.0 dropped in 2025 with Windows support and micro-service architecture for fs-backup. The big deal is they fixed the memory hogging issues - node-agent pods used to request massive memory and hold it forever. Now they actually release it when done backing up your stuff.

Companies like Netflix, MongoDB, and Reddit run this in production, which tells you it's battle-tested enough for your workloads.

The Three Parts That Actually Matter

Velero Backup Process

Velero has three components that you'll spend way too much time debugging:

Velero Server: The controller that lives in your cluster and watches for backup CRDs. When it works, it talks to the K8s API and your storage backend. When it doesn't work, you'll spend hours checking RBAC permissions and IAM policies. The server handles backup schedules but won't tell you when they fail unless you set up monitoring.

Velero CLI: Your interface for everything. velero backup create, velero schedule create, velero restore create - sounds simple until you realize half the commands fail silently. The CLI has validation but it's optimistic. Always check velero backup describe after creating anything because "Completed" doesn't mean what you think it means.

Node Agent: DaemonSet that does the heavy lifting for persistent volume backups when CSI snapshots aren't available. Since v1.14 it uses Kopia instead of Restic, which fixed the memory leaks but introduced new failure modes. Pro tip: these pods will randomly restart and leave your backups stuck in "InProgress" status.

When You Actually Need This Thing

Velero solves three problems that will ruin your day:

Disaster Recovery: When someone accidentally deletes prod or your cloud provider has an outage. Netflix uses this because they learned the hard way that shit happens. Recovery time is anywhere from 10 minutes if you're lucky to 6 hours if your persistent volumes are massive. Don't ask me how I know.

Cluster Migration: Moving workloads between clusters without wanting to rebuild everything from scratch. The official migration docs make it sound easy, but you'll spend days fixing storage classes, ingress annotations, and whatever custom resource definitions broke between environments. Test this multiple times before doing it in prod.

Environment Replication: Copying prod to staging so developers can debug with real data instead of made-up test fixtures. Works great until you restore production secrets to staging and suddenly your test environment is sending real emails to customers. Always sanitize before restoring.

Storage Options (And Where They'll Bite You)

Velero supports multiple storage backends, each with its own special way of frustrating you:

AWS S3: The AWS plugin works great once you survive the IAM permission hell. You'll need about 47 different permissions, and the documentation lies about half of them. GitHub issue #8240 from September 2024 shows the plugin still fucks up IRSA roles. Budget 2 days for getting the permissions right, then another day when it breaks after an AWS update.

Google Cloud Storage: The GCP plugin is honestly the least painful to set up. Workload Identity is cleaner than dealing with service account keys, and Google's IAM model actually makes sense. Still fails sometimes but at least the error messages are readable.

Azure Blob Storage: The Azure plugin works with managed disks and blob storage. Managed identity is nice when it works, but Azure's authentication model is a maze. Expect to spend time figuring out which identity goes where.

CSI Volume Snapshots: Uses your CSI driver to take volume snapshots instead of copying files. Fast and efficient when it works, but many CSI drivers have bugs. AWS EBS snapshots are reliable, others are hit-or-miss. Always test restoring from snapshots - taking them is the easy part.

S3-Compatible Storage: MinIO and Ceph work for on-premises setups. MinIO is solid, Ceph will make you question your life choices. Good for air-gapped environments where cloud storage isn't an option.

Velero vs The Competition (Honest Assessment)

Feature	Velero	Kasten K10	Portworx PX-Backup	Longhorn	Native etcd Backup
Cost	Free but you pay in debugging time	$$$ per node, costs more than your rent	$$$ Enterprise tax	Free	Free
Setup Hell	AWS IAM will break your soul	Requires enterprise sales call	Only works if you already bought Portworx	Actually easy for once	`etcdctl snapshot save`
Cloud Support	Works everywhere after 2 days of config	Works great if you pay enterprise rates	AWS/Azure if you use Portworx	Basic cloud support	Backup works anywhere
Application Backups	Decent with hooks you'll forget to test	Actually works but costs a fortune	Good if you're locked into Portworx	Block-level only, no app consistency	Control plane only, your apps are fucked
Migration	Cross-cloud works with manual fixes	Enterprise tools that actually work	Portworx-to-Portworx only	Not happening	You rebuild everything
Volume Snapshots	CSI works when drivers aren't buggy	Proprietary magic that actually works	Fast with Portworx, useless otherwise	Built-in snapshots work okay	No volumes backed up
File Backup	Kopia fixed the memory leaks	Commercial engine, actually optimized	Limited unless you use Portworx storage	Block replication, not file backup	None
Scheduling	Cron jobs that fail silently	Enterprise policies with actual alerting	Works if you're in their ecosystem	Basic cron, at least it's simple	Cron + scripts you'll write
Performance	Depends on your storage backend	Fast because you pay for optimization	Fast with Portworx, slow everywhere else	Slow as hell for large datasets	Fastest possible
When It Breaks	Community help on Slack	You call support and they actually answer	Portworx support or you're SOL	GitHub issues and prayer	Standard Unix debugging
Real Usage	Netflix, Reddit use it in anger	Fortune 500s with backup budgets	Companies locked into Portworx	Rancher users and edge deployments	Everyone backs up etcd
Verdict	Free but you'll earn every backup	Works great, costs a fortune	Only if you're already Portworx	Simple setups only	Bare minimum coverage

Getting Velero Running (The Painful Truth)

What You Actually Need

Velero needs Kubernetes v1.20+ - basically any modern cluster. v1.17.0 is current and works with K8s up to 1.31. The hard part isn't the version compatibility, it's the fucking IAM permissions that'll make you question your career choices.

Step-by-Step Installation Hell

Step 1: Get the CLI (This Part Actually Works)

## Download v1.17.0 - check GitHub for latest
wget https://github.com/vmware-tanzu/velero/releases/download/v1.17.0/velero-v1.17.0-linux-amd64.tar.gz
tar -xvf velero-v1.17.0-linux-amd64.tar.gz
sudo mv velero-v1.17.0-linux-amd64/velero /usr/local/bin/

Step 2: AWS Setup (Where Dreams Go to Die)

Create your S3 bucket first - this part is easy:

aws s3 mb s3://your-velero-bucket-name-here

Now comes the IAM clusterfuck. The official AWS docs give you a policy that's missing half the permissions you actually need. You'll discover the missing ones when backups fail silently. Save yourself hours and use this:

{
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:DeleteObject", 
                "s3:PutObject",
                "s3:AbortMultipartUpload",
                "s3:ListMultipartUploadParts"
            ],
            "Resource": "arn:aws:s3:::your-bucket/*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket"
            ],
            "Resource": "arn:aws:s3:::your-bucket"
        }
    ]
}

Add EBS snapshot permissions if you want volume snapshots (and you do):

{
    "Effect": "Allow",
    "Action": [
        "ec2:DescribeVolumes",
        "ec2:DescribeSnapshots",
        "ec2:CreateSnapshot",
        "ec2:DeleteSnapshot",
        "ec2:DescribeInstances"
    ],
    "Resource": "*"
}

Step 3: Install and Pray

velero install \
    --provider aws \
    --plugins velero/velero-plugin-for-aws:v1.11.0 \
    --bucket your-velero-bucket-name-here \
    --backup-location-config region=us-east-1 \
    --snapshot-location-config region=us-east-1 \
    --secret-file ./credentials-velero

This will probably fail the first time with some cryptic error about credentials. Check kubectl logs -n velero deployment/velero to see what's actually broken.

The Parts That Break Most Often

Backup Storage Locations (BSL): This is where your backups actually go - an S3 bucket or compatible object storage. When backups fail, it's usually because the BSL can't authenticate. Check kubectl get backupstoragelocation -n velero and look for "Available: false". Multiple backup storage locations are possible but add complexity you probably don't need.

Volume Snapshot Locations (VSL): Where volume snapshots live, usually in the same region as your cluster to avoid data transfer costs. When snapshots fail, check if your storage provider supports CSI snapshots properly. Many don't, despite claiming they do.

Custom Resource Definitions: Velero adds these CRDs that you'll debug constantly:

`Backup`: Your backup job. Check .status.phase - "Completed" with warnings usually means it didn't back up what you think it did
`Restore`: Restore job. "InProgress" stuck forever means check the logs with velero restore logs
`Schedule`: Cron-based backup scheduling. Silent failures are common - set up monitoring
`BackupStorageLocation` and `VolumeSnapshotLocation`: Configuration CRDs that break when IAM policies change

Two Ways to Back Up Volumes (One Works, One Doesn't)

Back up Clusters

CSI Volume Snapshots: Uses your cloud provider's native snapshot API. Works with AWS EBS, GCP Persistent Disks, and Azure Disks. Takes seconds regardless of volume size and costs way less than file backup. This is what you want to use unless your storage provider's CSI driver is broken (many are).

File System Backup: Uses Kopia to copy actual files from your volumes. Replaced Restic in v1.14 which fixed the memory leaks but introduced new ways for backups to fail. Takes forever, uses tons of network bandwidth, and costs more in storage. Use this only when CSI snapshots don't work with your storage setup.

Production Reality Check

Resource Usage: File system backups will eat your cluster resources alive. The node agent pods can request massive amounts of memory during Kopia operations. Set appropriate resource limits or watch your nodes get OOMKilled during backup operations.

Network Bandwidth: File system backups upload your entire volume data to object storage. A 100GB database backup will transfer 100GB over the network and take 2 hours on a decent connection. CSI snapshots transfer almost nothing. Plan accordingly.

Storage Costs: Forgot to set retention policies? Your first AWS bill will be a surprise. File system backups cost more because they store actual data. CSI snapshots use incremental differences, so they're cheaper. Set retention or go broke.

Questions People Actually Ask (When Things Break)

Why does my backup say "Completed" but nothing was actually backed up?

Velero lies. "Completed" doesn't mean successful, it means the process finished. Check velero backup describe your-backup-name and look for warnings. Common culprits: RBAC permissions missing, storage class incompatibilities, or the backup included zero resources because your selectors were wrong. Always check what was actually included in the backup.

How do I fix "access denied" errors on AWS?

Your IAM policy is wrong.

The official docs are incomplete.

You need S3 permissions (obviously) but also `ec2:Describe

Volumes, ec2:

Describe

Snapshots, ec2:Create

Snapshot, ec2:

Delete

Snapshot, and ec2:Describe

Instances` for EBS snapshots.

If you're using IRSA, check GitHub issue #8240

the plugin still fucks up role assumptions.

Why is my restore stuck in "InProgress" forever?

The node agent pod probably crashed or restarted during the operation. Check kubectl get pods -n velero and kubectl logs -n velero on the node agent. If it restarted, the restore is orphaned and you'll need to delete it and try again. This happens a lot with large volume restores.

My backup is taking 6 hours - is this normal?

If you're using file system backup (Kopia), yes, it's slow as hell. A 500GB volume can take 4-8 hours depending on your network and how many small files you have. CSI snapshots take seconds. Switch to CSI snapshots if your storage supports them, otherwise suffer through the file backup pain.

Why does Velero keep running out of memory?

File system backups using Kopia can eat massive amounts of memory, especially with lots of small files. The node agent pods request memory dynamically and can OOMKill your nodes. Set resource limits in the Velero deployment or your cluster will suffer. v1.17 fixed some of this but it's still a memory hog.

How much is this going to cost me on AWS?

More than you think. S3 storage costs add up, especially if you forget retention policies. Budget around $0.023/GB/month for S3 Standard, plus snapshot costs for EBS volumes. A 1TB backup costs about $25/month in S 3. Set retention policies or your first bill will be a shock.

How do I know if my backups are actually working?

Test them. Seriously. Create a test backup, delete something important in a staging environment, then restore it. Most people discover their backups are broken during a real disaster. Use velero backup describe to check for warnings, and actually try restoring to make sure it works. Backup monitoring should alert when backups fail, not just when they complete.

Why are my scheduled backups failing silently?

Velero schedules use Kubernetes CronJobs, which fail silently by default. Check kubectl get cronjobs -n velero and velero schedule get. Failed schedules often happen because of resource limits, storage authentication issues, or quota problems. Set up Prometheus monitoring to alert when backups fail.

Can I use Velero without cloud storage?

Yes, with MinIO or other S3-compatible storage. Min

IO is solid for on-premises setups, while Ceph Object Gateway will make you question your life choices. You can run MinIO in a container or as a standalone service. Just make sure your storage is actually reliable

your backups are only as good as the storage they're on.

What's the difference between Restic and Kopia?

Kopia replaced Restic in v1.14 because Restic had memory leaks that would OOMKill nodes during large backups. Kopia is more memory-efficient and faster, but introduced new failure modes. Existing Restic backups are still restorable, but new file system backups use Kopia. The migration was generally worth it.

Why won't my CSI snapshots work?

Your CSI driver probably doesn't support snapshots properly, or it's buggy. Many storage providers claim CSI snapshot support but it's broken or incomplete. Check if your storage class supports snapshots with kubectl get volumesnapshotclass. If snapshots fail, fall back to file system backup with Kopia.

Will upgrading Velero break my existing backups?

Usually no, but check the compatibility matrix for plugin versions. The v1.14 upgrade from Restic to Kopia was the big breaking change

file system backups needed repository migration. Test upgrades in staging first because plugin incompatibilities can leave you unable to restore backups.

What happens to secrets when I restore to a different cluster?

Secrets get restored but cloud-specific authentication (IAM roles, service accounts) won't work in the new environment. You'll need to manually reconfigure authentication for things like database connections, external APIs, and cloud services. Don't restore production secrets to staging environments

you'll accidentally send emails to real customers.

Does Velero backup my operators and CRDs?

Yes, CRDs and custom resources get backed up. Most operators resume working after restore, but operators that manage external services (databases, cloud resources) may need manual intervention to reconnect to their managed services. Test operator restore procedures because they often have weird edge cases.

How do I set up monitoring so I know when backups fail?

Set up Prometheus monitoring and alerts because Velero doesn't alert you by default. Use the Grafana dashboards for monitoring views. The key metrics are backup success/failure rates and backup duration. Silent failures are common, so monitoring is essential.

Should I use Velero or just rely on cloud provider backups?

Depends on vendor lock-in tolerance. AWS Backup and Azure Backup are easier to set up and work better with native cloud services, but lock you into that provider. Velero gives you portability and consistency across clouds, but requires more setup and debugging. Choose cloud backups for simplicity, Velero for multi-cloud freedom.

Quick Navigation

The Three Parts That Actually Matter

When You Actually Need This Thing

Storage Options (And Where They'll Bite You)

What You Actually Need

Step-by-Step Installation Hell

The Parts That Break Most Often

Two Ways to Back Up Volumes (One Works, One Doesn't)

Production Reality Check

Why does my backup say "Completed" but nothing was actually backed up?

How do I fix "access denied" errors on AWS?

Why is my restore stuck in "InProgress" forever?

My backup is taking 6 hours - is this normal?

Why does Velero keep running out of memory?

How much is this going to cost me on AWS?

How do I know if my backups are actually working?

Why are my scheduled backups failing silently?

Can I use Velero without cloud storage?

What's the difference between Restic and Kopia?

Why won't my CSI snapshots work?

Will upgrading Velero break my existing backups?

What happens to secrets when I restore to a different cluster?

Does Velero backup my operators and CRDs?

How do I set up monitoring so I know when backups fail?

Should I use Velero or just rely on cloud provider backups?

Related Tools & Recommendations

GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus

Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break

OpenAI Gets Sued After GPT-5 Convinced Kid to Kill Himself

AWS Organizations - Stop Losing Your Mind Managing Dozens of AWS Accounts

AWS Amplify - Amazon's Attempt to Make Fullstack Development Not Suck

Azure AI Foundry Production Reality Check

Azure - Microsoft's Cloud Platform (The Good, Bad, and Expensive)

Microsoft Azure Stack Edge - The $1000/Month Server You'll Never Own

Google Cloud Platform - After 3 Years, I Still Don't Hate It

ELK Stack for Microservices - Stop Losing Log Data

Upstash Redis - Redis That Actually Works With Serverless

Docker Compose 2.39.2 and Buildx 0.27.0 Released with Major Updates

Google Vertex AI - Google's Answer to AWS SageMaker

Google NotebookLM Goes Global: Video Overviews in 80+ Languages

Longhorn - Distributed Storage for Kubernetes That Doesn't Suck

Figma Gets Lukewarm Wall Street Reception Despite AI Potential - August 25, 2025

Grafana - The Monitoring Dashboard That Doesn't Suck

Prometheus + Grafana + Jaeger: Stop Debugging Microservices Like It's 2015

Set Up Microservices Monitoring That Actually Works

MongoDB - Document Database That Actually Works