Currently viewing the human version
Switch to AI version

What is Velero

Velero is the backup tool you install after experiencing your first major Kubernetes disaster. Originally called Heptio Ark, it's now maintained by VMware Tanzu and has graduated from the CNCF, which means it's stable enough that you won't get fired for using it.

v1.17.0 dropped in 2025 with Windows support and micro-service architecture for fs-backup. The big deal is they fixed the memory hogging issues - node-agent pods used to request massive memory and hold it forever. Now they actually release it when done backing up your stuff.

Companies like Netflix, MongoDB, and Reddit run this in production, which tells you it's battle-tested enough for your workloads.

The Three Parts That Actually Matter

Velero Backup Process

Velero has three components that you'll spend way too much time debugging:

Velero Server: The controller that lives in your cluster and watches for backup CRDs. When it works, it talks to the K8s API and your storage backend. When it doesn't work, you'll spend hours checking RBAC permissions and IAM policies. The server handles backup schedules but won't tell you when they fail unless you set up monitoring.

Velero CLI: Your interface for everything. velero backup create, velero schedule create, velero restore create - sounds simple until you realize half the commands fail silently. The CLI has validation but it's optimistic. Always check velero backup describe after creating anything because "Completed" doesn't mean what you think it means.

Node Agent: DaemonSet that does the heavy lifting for persistent volume backups when CSI snapshots aren't available. Since v1.14 it uses Kopia instead of Restic, which fixed the memory leaks but introduced new failure modes. Pro tip: these pods will randomly restart and leave your backups stuck in "InProgress" status.

When You Actually Need This Thing

Velero solves three problems that will ruin your day:

Disaster Recovery: When someone accidentally deletes prod or your cloud provider has an outage. Netflix uses this because they learned the hard way that shit happens. Recovery time is anywhere from 10 minutes if you're lucky to 6 hours if your persistent volumes are massive. Don't ask me how I know.

Cluster Migration: Moving workloads between clusters without wanting to rebuild everything from scratch. The official migration docs make it sound easy, but you'll spend days fixing storage classes, ingress annotations, and whatever custom resource definitions broke between environments. Test this multiple times before doing it in prod.

Environment Replication: Copying prod to staging so developers can debug with real data instead of made-up test fixtures. Works great until you restore production secrets to staging and suddenly your test environment is sending real emails to customers. Always sanitize before restoring.

Storage Options (And Where They'll Bite You)

Velero supports multiple storage backends, each with its own special way of frustrating you:

AWS S3: The AWS plugin works great once you survive the IAM permission hell. You'll need about 47 different permissions, and the documentation lies about half of them. GitHub issue #8240 from September 2024 shows the plugin still fucks up IRSA roles. Budget 2 days for getting the permissions right, then another day when it breaks after an AWS update.

Google Cloud Storage: The GCP plugin is honestly the least painful to set up. Workload Identity is cleaner than dealing with service account keys, and Google's IAM model actually makes sense. Still fails sometimes but at least the error messages are readable.

Azure Blob Storage: The Azure plugin works with managed disks and blob storage. Managed identity is nice when it works, but Azure's authentication model is a maze. Expect to spend time figuring out which identity goes where.

CSI Volume Snapshots: Uses your CSI driver to take volume snapshots instead of copying files. Fast and efficient when it works, but many CSI drivers have bugs. AWS EBS snapshots are reliable, others are hit-or-miss. Always test restoring from snapshots - taking them is the easy part.

Disaster Recovery

S3-Compatible Storage: MinIO and Ceph work for on-premises setups. MinIO is solid, Ceph will make you question your life choices. Good for air-gapped environments where cloud storage isn't an option.

Velero vs The Competition (Honest Assessment)

Feature

Velero

Kasten K10

Portworx PX-Backup

Longhorn

Native etcd Backup

Cost

Free but you pay in debugging time

$$$ per node, costs more than your rent

$$$ Enterprise tax

Free

Free

Setup Hell

AWS IAM will break your soul

Requires enterprise sales call

Only works if you already bought Portworx

Actually easy for once

etcdctl snapshot save

Cloud Support

Works everywhere after 2 days of config

Works great if you pay enterprise rates

AWS/Azure if you use Portworx

Basic cloud support

Backup works anywhere

Application Backups

Decent with hooks you'll forget to test

Actually works but costs a fortune

Good if you're locked into Portworx

Block-level only, no app consistency

Control plane only, your apps are fucked

Migration

Cross-cloud works with manual fixes

Enterprise tools that actually work

Portworx-to-Portworx only

Not happening

You rebuild everything

Volume Snapshots

CSI works when drivers aren't buggy

Proprietary magic that actually works

Fast with Portworx, useless otherwise

Built-in snapshots work okay

No volumes backed up

File Backup

Kopia fixed the memory leaks

Commercial engine, actually optimized

Limited unless you use Portworx storage

Block replication, not file backup

None

Scheduling

Cron jobs that fail silently

Enterprise policies with actual alerting

Works if you're in their ecosystem

Basic cron, at least it's simple

Cron + scripts you'll write

Performance

Depends on your storage backend

Fast because you pay for optimization

Fast with Portworx, slow everywhere else

Slow as hell for large datasets

Fastest possible

When It Breaks

Community help on Slack

You call support and they actually answer

Portworx support or you're SOL

GitHub issues and prayer

Standard Unix debugging

Real Usage

Netflix, Reddit use it in anger

Fortune 500s with backup budgets

Companies locked into Portworx

Rancher users and edge deployments

Everyone backs up etcd

Verdict

Free but you'll earn every backup

Works great, costs a fortune

Only if you're already Portworx

Simple setups only

Bare minimum coverage

Getting Velero Running (The Painful Truth)

What You Actually Need

Velero needs Kubernetes v1.20+ - basically any modern cluster. v1.17.0 is current and works with K8s up to 1.31. The hard part isn't the version compatibility, it's the fucking IAM permissions that'll make you question your career choices.

Step-by-Step Installation Hell

Step 1: Get the CLI (This Part Actually Works)

## Download v1.17.0 - check GitHub for latest
wget https://github.com/vmware-tanzu/velero/releases/download/v1.17.0/velero-v1.17.0-linux-amd64.tar.gz
tar -xvf velero-v1.17.0-linux-amd64.tar.gz
sudo mv velero-v1.17.0-linux-amd64/velero /usr/local/bin/

Step 2: AWS Setup (Where Dreams Go to Die)

Create your S3 bucket first - this part is easy:

aws s3 mb s3://your-velero-bucket-name-here

Now comes the IAM clusterfuck. The official AWS docs give you a policy that's missing half the permissions you actually need. You'll discover the missing ones when backups fail silently. Save yourself hours and use this:

{
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:DeleteObject", 
                "s3:PutObject",
                "s3:AbortMultipartUpload",
                "s3:ListMultipartUploadParts"
            ],
            "Resource": "arn:aws:s3:::your-bucket/*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket"
            ],
            "Resource": "arn:aws:s3:::your-bucket"
        }
    ]
}

Add EBS snapshot permissions if you want volume snapshots (and you do):

{
    "Effect": "Allow",
    "Action": [
        "ec2:DescribeVolumes",
        "ec2:DescribeSnapshots",
        "ec2:CreateSnapshot",
        "ec2:DeleteSnapshot",
        "ec2:DescribeInstances"
    ],
    "Resource": "*"
}

Step 3: Install and Pray

velero install \
    --provider aws \
    --plugins velero/velero-plugin-for-aws:v1.11.0 \
    --bucket your-velero-bucket-name-here \
    --backup-location-config region=us-east-1 \
    --snapshot-location-config region=us-east-1 \
    --secret-file ./credentials-velero

This will probably fail the first time with some cryptic error about credentials. Check kubectl logs -n velero deployment/velero to see what's actually broken.

Data Protection

The Parts That Break Most Often

Backup Storage Locations (BSL): This is where your backups actually go - an S3 bucket or compatible object storage. When backups fail, it's usually because the BSL can't authenticate. Check kubectl get backupstoragelocation -n velero and look for "Available: false". Multiple backup storage locations are possible but add complexity you probably don't need.

Volume Snapshot Locations (VSL): Where volume snapshots live, usually in the same region as your cluster to avoid data transfer costs. When snapshots fail, check if your storage provider supports CSI snapshots properly. Many don't, despite claiming they do.

Custom Resource Definitions: Velero adds these CRDs that you'll debug constantly:

  • `Backup`: Your backup job. Check .status.phase - "Completed" with warnings usually means it didn't back up what you think it did
  • `Restore`: Restore job. "InProgress" stuck forever means check the logs with velero restore logs
  • `Schedule`: Cron-based backup scheduling. Silent failures are common - set up monitoring
  • `BackupStorageLocation` and `VolumeSnapshotLocation`: Configuration CRDs that break when IAM policies change

Two Ways to Back Up Volumes (One Works, One Doesn't)

Back up Clusters

CSI Volume Snapshots: Uses your cloud provider's native snapshot API. Works with AWS EBS, GCP Persistent Disks, and Azure Disks. Takes seconds regardless of volume size and costs way less than file backup. This is what you want to use unless your storage provider's CSI driver is broken (many are).

File System Backup: Uses Kopia to copy actual files from your volumes. Replaced Restic in v1.14 which fixed the memory leaks but introduced new ways for backups to fail. Takes forever, uses tons of network bandwidth, and costs more in storage. Use this only when CSI snapshots don't work with your storage setup.

Production Reality Check

Resource Usage: File system backups will eat your cluster resources alive. The node agent pods can request massive amounts of memory during Kopia operations. Set appropriate resource limits or watch your nodes get OOMKilled during backup operations.

Network Bandwidth: File system backups upload your entire volume data to object storage. A 100GB database backup will transfer 100GB over the network and take 2 hours on a decent connection. CSI snapshots transfer almost nothing. Plan accordingly.

Storage Costs: Forgot to set retention policies? Your first AWS bill will be a surprise. File system backups cost more because they store actual data. CSI snapshots use incremental differences, so they're cheaper. Set retention or go broke.

Questions People Actually Ask (When Things Break)

Q

Why does my backup say "Completed" but nothing was actually backed up?

A

Velero lies. "Completed" doesn't mean successful, it means the process finished. Check velero backup describe your-backup-name and look for warnings. Common culprits: RBAC permissions missing, storage class incompatibilities, or the backup included zero resources because your selectors were wrong. Always check what was actually included in the backup.

Q

How do I fix "access denied" errors on AWS?

A

Your IAM policy is wrong.

The official docs are incomplete.

You need S3 permissions (obviously) but also `ec2:Describe

Volumes, ec2:

Describe

Snapshots, ec2:Create

Snapshot, ec2:

Delete

Snapshot, and ec2:Describe

Instances` for EBS snapshots.

If you're using IRSA, check GitHub issue #8240

  • the plugin still fucks up role assumptions.
Q

Why is my restore stuck in "InProgress" forever?

A

The node agent pod probably crashed or restarted during the operation. Check kubectl get pods -n velero and kubectl logs -n velero on the node agent. If it restarted, the restore is orphaned and you'll need to delete it and try again. This happens a lot with large volume restores.

Q

My backup is taking 6 hours - is this normal?

A

If you're using file system backup (Kopia), yes, it's slow as hell. A 500GB volume can take 4-8 hours depending on your network and how many small files you have. CSI snapshots take seconds. Switch to CSI snapshots if your storage supports them, otherwise suffer through the file backup pain.

Q

Why does Velero keep running out of memory?

A

File system backups using Kopia can eat massive amounts of memory, especially with lots of small files. The node agent pods request memory dynamically and can OOMKill your nodes. Set resource limits in the Velero deployment or your cluster will suffer. v1.17 fixed some of this but it's still a memory hog.

Q

How much is this going to cost me on AWS?

A

More than you think. S3 storage costs add up, especially if you forget retention policies. Budget around $0.023/GB/month for S3 Standard, plus snapshot costs for EBS volumes. A 1TB backup costs about $25/month in S 3. Set retention policies or your first bill will be a shock.

Q

How do I know if my backups are actually working?

A

Test them. Seriously. Create a test backup, delete something important in a staging environment, then restore it. Most people discover their backups are broken during a real disaster. Use velero backup describe to check for warnings, and actually try restoring to make sure it works. Backup monitoring should alert when backups fail, not just when they complete.

Q

Why are my scheduled backups failing silently?

A

Velero schedules use Kubernetes CronJobs, which fail silently by default. Check kubectl get cronjobs -n velero and velero schedule get. Failed schedules often happen because of resource limits, storage authentication issues, or quota problems. Set up Prometheus monitoring to alert when backups fail.

Q

Can I use Velero without cloud storage?

A

Yes, with MinIO or other S3-compatible storage. Min

IO is solid for on-premises setups, while Ceph Object Gateway will make you question your life choices. You can run MinIO in a container or as a standalone service. Just make sure your storage is actually reliable

  • your backups are only as good as the storage they're on.
Q

What's the difference between Restic and Kopia?

A

Kopia replaced Restic in v1.14 because Restic had memory leaks that would OOMKill nodes during large backups. Kopia is more memory-efficient and faster, but introduced new failure modes. Existing Restic backups are still restorable, but new file system backups use Kopia. The migration was generally worth it.

Q

Why won't my CSI snapshots work?

A

Your CSI driver probably doesn't support snapshots properly, or it's buggy. Many storage providers claim CSI snapshot support but it's broken or incomplete. Check if your storage class supports snapshots with kubectl get volumesnapshotclass. If snapshots fail, fall back to file system backup with Kopia.

Q

Will upgrading Velero break my existing backups?

A

Usually no, but check the compatibility matrix for plugin versions. The v1.14 upgrade from Restic to Kopia was the big breaking change

  • file system backups needed repository migration. Test upgrades in staging first because plugin incompatibilities can leave you unable to restore backups.
Q

What happens to secrets when I restore to a different cluster?

A

Secrets get restored but cloud-specific authentication (IAM roles, service accounts) won't work in the new environment. You'll need to manually reconfigure authentication for things like database connections, external APIs, and cloud services. Don't restore production secrets to staging environments

  • you'll accidentally send emails to real customers.
Q

Does Velero backup my operators and CRDs?

A

Yes, CRDs and custom resources get backed up. Most operators resume working after restore, but operators that manage external services (databases, cloud resources) may need manual intervention to reconnect to their managed services. Test operator restore procedures because they often have weird edge cases.

Q

How do I set up monitoring so I know when backups fail?

A

Set up Prometheus monitoring and alerts because Velero doesn't alert you by default. Use the Grafana dashboards for monitoring views. The key metrics are backup success/failure rates and backup duration. Silent failures are common, so monitoring is essential.

Q

Should I use Velero or just rely on cloud provider backups?

A

Depends on vendor lock-in tolerance. AWS Backup and Azure Backup are easier to set up and work better with native cloud services, but lock you into that provider. Velero gives you portability and consistency across clouds, but requires more setup and debugging. Choose cloud backups for simplicity, Velero for multi-cloud freedom.

Essential Velero Resources (That Actually Help)

Related Tools & Recommendations

integration
Recommended

GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus

How to Wire Together the Modern DevOps Stack Without Losing Your Sanity

kubernetes
/integration/docker-kubernetes-argocd-prometheus/gitops-workflow-integration
80%
integration
Recommended

Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break

When your event-driven services die and you're staring at green dashboards while everything burns, you need real observability - not the vendor promises that go

Apache Kafka
/integration/kafka-mongodb-kubernetes-prometheus-event-driven/complete-observability-architecture
80%
news
Recommended

OpenAI Gets Sued After GPT-5 Convinced Kid to Kill Himself

Parents want $50M because ChatGPT spent hours coaching their son through suicide methods

Technology News Aggregation
/news/2025-08-26/openai-gpt5-safety-lawsuit
66%
tool
Recommended

AWS Organizations - Stop Losing Your Mind Managing Dozens of AWS Accounts

When you've got 50+ AWS accounts scattered across teams and your monthly bill looks like someone's phone number, Organizations turns that chaos into something y

AWS Organizations
/tool/aws-organizations/overview
66%
tool
Recommended

AWS Amplify - Amazon's Attempt to Make Fullstack Development Not Suck

integrates with AWS Amplify

AWS Amplify
/tool/aws-amplify/overview
66%
tool
Recommended

Azure AI Foundry Production Reality Check

Microsoft finally unfucked their scattered AI mess, but get ready to finance another Tesla payment

Microsoft Azure AI
/tool/microsoft-azure-ai/production-deployment
66%
tool
Recommended

Azure - Microsoft's Cloud Platform (The Good, Bad, and Expensive)

integrates with Microsoft Azure

Microsoft Azure
/tool/microsoft-azure/overview
66%
tool
Recommended

Microsoft Azure Stack Edge - The $1000/Month Server You'll Never Own

Microsoft's edge computing box that requires a minimum $717,000 commitment to even try

Microsoft Azure Stack Edge
/tool/microsoft-azure-stack-edge/overview
66%
tool
Recommended

Google Cloud Platform - After 3 Years, I Still Don't Hate It

I've been running production workloads on GCP since 2022. Here's why I'm still here.

Google Cloud Platform
/tool/google-cloud-platform/overview
66%
integration
Recommended

ELK Stack for Microservices - Stop Losing Log Data

How to Actually Monitor Distributed Systems Without Going Insane

Elasticsearch
/integration/elasticsearch-logstash-kibana/microservices-logging-architecture
60%
tool
Recommended

Upstash Redis - Redis That Actually Works With Serverless

competes with Upstash Redis

Upstash Redis
/tool/upstash-redis/overview
60%
news
Popular choice

Docker Compose 2.39.2 and Buildx 0.27.0 Released with Major Updates

Latest versions bring improved multi-platform builds and security fixes for containerized applications

Docker
/news/2025-09-05/docker-compose-buildx-updates
60%
tool
Popular choice

Google Vertex AI - Google's Answer to AWS SageMaker

Google's ML platform that combines their scattered AI services into one place. Expect higher bills than advertised but decent Gemini model access if you're alre

Google Vertex AI
/tool/google-vertex-ai/overview
57%
news
Popular choice

Google NotebookLM Goes Global: Video Overviews in 80+ Languages

Google's AI research tool just became usable for non-English speakers who've been waiting months for basic multilingual support

Technology News Aggregation
/news/2025-08-26/google-notebooklm-video-overview-expansion
55%
tool
Recommended

Longhorn - Distributed Storage for Kubernetes That Doesn't Suck

alternative to Longhorn

Longhorn
/tool/longhorn/overview
54%
news
Popular choice

Figma Gets Lukewarm Wall Street Reception Despite AI Potential - August 25, 2025

Major investment banks issue neutral ratings citing $37.6B valuation concerns while acknowledging design platform's AI integration opportunities

Technology News Aggregation
/news/2025-08-25/figma-neutral-wall-street
50%
tool
Recommended

Grafana - The Monitoring Dashboard That Doesn't Suck

integrates with Grafana

Grafana
/tool/grafana/overview
49%
integration
Recommended

Prometheus + Grafana + Jaeger: Stop Debugging Microservices Like It's 2015

When your API shits the bed right before the big demo, this stack tells you exactly why

Prometheus
/integration/prometheus-grafana-jaeger/microservices-observability-integration
49%
howto
Recommended

Set Up Microservices Monitoring That Actually Works

Stop flying blind - get real visibility into what's breaking your distributed services

Prometheus
/howto/setup-microservices-observability-prometheus-jaeger-grafana/complete-observability-setup
49%
tool
Popular choice

MongoDB - Document Database That Actually Works

Explore MongoDB's document database model, understand its flexible schema benefits and pitfalls, and learn about the true costs of MongoDB Atlas. Includes FAQs

MongoDB
/tool/mongodb/overview
47%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization