Docker + Kubernetes + ArgoCD + Prometheus GitOps Stack

Q: Why does Prometheus keep running out of disk space?

Because nobody reads the [retention documentation](https://prometheus.io/docs/prometheus/latest/storage/#operational-aspects) and Prometheus defaults to keeping everything forever. Add `--storage.tsdb.retention.time=30d` or watch your AWS bill explode. I learned this paying $500 for a month of useless metrics.

Q: How do I actually deploy secrets without committing them to Git?

[Sealed Secrets](https://sealed-secrets.netlify.app/) or [External Secrets Operator](https://external-secrets.io/). Both are painful to set up because of the chicken-and-egg problem: you need secrets to deploy the secret manager. Start with sealed secrets - less moving parts, more predictable failure modes.

Q: Why does Docker keep saying "COPY failed: no source files"?

Check your `.dockerignore` file for trailing spaces or weird characters. Docker's context handling is garbage and the error messages are useless. Or your build context is wrong. Or there's a symlink somewhere. Docker build failures are 50% configuration and 50% dark magic.

Q: How do I debug networking in Kubernetes?

You don't. You cry, then deploy [netshoot](https://github.com/nicolaka/netshoot) and run `nslookup` from inside the cluster until something works. K8s networking is designed by people who hate happiness.

Q: How do I make ArgoCD sync faster than every 3 minutes?

Set `--app-resync` to something shorter, but now you're polling Git constantly and ArgoCD might OOM under load. Or set up webhooks, which work great until they don't and you're back to polling anyway.

What You're Actually Getting Into

This isn't just four tools working together - it's four different ways for things to break simultaneously. But here's the thing: after you survive the initial pain, you get automated deployments that actually don't suck. The stack becomes worth it when you're deploying 50+ times a day without breaking into a cold sweat.

What Each Tool Actually Does

Docker Architecture

Docker packages your app into containers. Works great until you hit the layer cache bullshit or Docker Desktop randomly stops working and you spend an hour restarting everything. The build optimization docs help, but multi-stage builds are still your best bet for cache performance.

Kubernetes Architecture

Kubernetes orchestrates those containers. It's powerful as hell, but the learning curve is steeper than tax law. Don't use K8s 1.28.3 - there's a networking bug that'll ruin your week. Stick with 1.29.x.

ArgoCD watches your Git repos and deploys changes automatically. The UI is pretty but times out whenever you're trying to debug a failed deployment. ArgoCD 2.11.x randomly stops syncing - upgrade to 2.12.x or deal with mysterious failures.

Prometheus monitors everything and sends you alerts at 3am. It works great until you see your storage bill. Seriously, configure retention policies or you'll be paying for metrics from 2019.

How It's Supposed to Work

You commit code → Docker builds it → ArgoCD sees the change → Kubernetes deploys it → Prometheus monitors it → You sleep peacefully at night (haha, right).

In reality: You commit code → Docker build fails because of some random layer issue → You fix that → ArgoCD doesn't sync because of a webhook timeout → You manually refresh → Kubernetes pod crashes with CrashLoopBackOff → You debug for an hour → Prometheus fills up your disk with metrics → You fix the retention policy → It finally works → You get an alert at 2am anyway.

But here's the thing: after 6 months of fighting this stack, deployments go from taking 2 hours to 5 minutes. And when something breaks, you actually know why. That's the real value - not the automation itself, but the observability and consistency that comes with it.

Now let's talk about what actually happens when you try to implement this beautiful theory in the real world.

The Painful Reality of Actually Implementing This Shit

Kubernetes Error Messages

What The Tutorials Don't Tell You

Setting up this stack takes 3 days minimum, not the 3 hours every tutorial claims. I've done this 5 times now and it never gets easier, just different ways to break.

The Docker Build Hell: Your builds will randomly fail with COPY failed: no source files were specified. This happens because Docker's context handling is garbage. You'll spend 2 hours debugging only to find out you had a trailing space in your `.dockerignore` file, or the build context is wrong, or you're running from the wrong directory.

Kubernetes YAML Nightmare: You'll write 47 YAML files for a simple app. One indentation error and nothing works. The error messages are about as helpful as a chocolate teapot: `error validating data: ValidationError(Deployment.spec)`. YAML validation tools exist, but kubectl version mismatch will still fuck you over. Online validators help, but data validation is still your problem.

ArgoCD Sync Problems: ArgoCD will randomly decide your app is "OutOfSync" even when nothing changed. The sync hooks documentation is wrong half the time. You'll click "Sync" 17 times before giving up and doing kubectl apply manually.

When ArgoCD Shits the Bed at 3AM:
I've been there - production deploy fails, ArgoCD shows "Sync Failed", and the error is some cryptic YAML validation bullshit. Here's what actually works:

kubectl get events --sort-by=.metadata.creationTimestamp - see what K8s is actually complaining about
Check ArgoCD pod logs: kubectl logs -n argocd deployment/argocd-application-controller
Force refresh in UI, then sync with "prune" and "force" checked
If still broken, kubectl apply -f manifest.yaml to get specific error messages
Fix the actual issue, commit, pray ArgoCD picks it up
Nuclear option: kubectl delete app -n argocd your-app-name && argocd app create ...

The error "failed to sync: rpc error: code = Unknown desc = error validating data" means absolutely fucking nothing. Check the actual Kubernetes events - that's where the real error lives.

Prometheus Storage Apocalypse: Prometheus will happily eat 50GB of disk space in a week if you don't configure retention properly. I learned this when AWS charged me $200 for EBS storage in one month.

Multi-Environment Madness

Everyone says "just use different Git branches for different environments." That works great until you need a hotfix in production but not in staging. Then you're cherry-picking commits at 3am while everything's on fire.

ApplicationSets are supposed to solve this, but the templating syntax makes Helm look simple. The mixed Helm/Kustomize support is confusing as hell, and ApplicationSet templates conflict with Helm notation. Good luck debugging when it generates the wrong namespace.

Security: It's Complicated

Git-based deployments are secure in theory. In practice, someone always commits a secret to the repo, and then you're rotating API keys while ArgoCD keeps trying to deploy the old ones.

Secret Management Hell:
Everyone says "don't commit secrets" but nobody explains how to actually deploy them. Here's what I learned after getting burned:

Sealed Secrets: Works but bootstrap is chicken-and-egg hell. You need the controller to encrypt secrets, but you need encrypted secrets to deploy the controller. Solution: manually apply the controller first with kubectl apply, then never lose the master key or you're fucked.
External Secrets Operator: Better but adds complexity and another point of failure. When AWS IAM is broken, your app can't start because it can't fetch secrets.
Cloud provider secret managers: Expensive but actually work in production. AWS Secrets Manager costs $0.40/secret/month which adds up fast.
Reality: You'll probably commit a secret at least once, so rotate everything regularly and use tools like gitleaks to catch it before push.

The Sealed Secrets bootstrap problem is real: how do you deploy the sealed-secrets controller when you need sealed secrets to deploy it? External Secrets Operator is better but performance sucks at scale. ArgoCD's secret management guide doesn't solve the chicken-and-egg problem.

The honest truth: this stack works, but budget 2 weeks for the initial setup and plan to spend 4 hours every month fixing random sync issues. Once it's stable though, deployments are fucking magical.

So you've survived the implementation nightmare - now let's talk about what this actually costs you in both money and sanity.

Honest Assessment of Each Tool

Component	What It Actually Does	The Good Shit	The Bad Shit	Should You Use It?
Docker	Packages your app in a box	Works everywhere, decent caching	Desktop randomly dies, layer cache fuckery	Yes, no choice really
Kubernetes	Orchestrates containers with 47 YAML files	Scales well, self-healing	Learning curve from hell, networking nightmares	If you hate yourself
ArgoCD	Syncs Git to K8s (when it feels like it)	Nice UI, GitOps workflow	Randomly stops syncing, UI timeouts	Better than manual deploys
Prometheus	Collects metrics and bankrupts you	Powerful queries, great alerting	Storage costs, memory hungry	Just configure retention FFS

What This Actually Costs You (Money and Sanity)

Real Resource Requirements

Prometheus Architecture

ArgoCD needs way more RAM than they tell you - I'd say 4GB minimum or it crashes during big deployments. Don't believe the official docs saying 2GB is enough. I learned this when ArgoCD died during a production deployment.

Prometheus? That thing eats memory. Start with 8GB, but you'll probably need 16GB+ if you're monitoring anything real. Their "1GB per million samples" calculation is bullshit - budget 2-3x that.

A basic 3-node K8s cluster on AWS costs $300-400/month just for the nodes. Add ArgoCD, Prometheus, and storage, and you're looking at $500-700/month before you deploy a single application.

Security: The Pain Points Nobody Mentions

RBAC in Kubernetes is a fucking nightmare. You'll spend a week figuring out why your service account can't create a pod in one namespace but works fine in another. The official RBAC docs are garbage - just use rbac.dev to generate working YAML instead.

ArgoCD SSO setup looks simple in the docs but took me 6 hours to get working with Azure AD. The callback URLs are finicky as hell and the error messages tell you nothing.

Container scanning with Docker Scout sounds great until it flags every base image as vulnerable. You'll waste time fixing CVEs that don't actually matter for your app.

What Actually Works in Production

Environment parity is impossible. Production always has that one special config that breaks everything when you try to replicate it in staging. Just accept it and move on.

Progressive rollouts with ArgoCD sync waves work great when they work. When they don't, you'll be debugging YAML ordering at 2am wondering why your database migration ran after your app deployed.

Disaster recovery is a joke until you actually need it. Your Git repos are safe, but good luck restoring that Prometheus data when it corrupts itself after a node crash.

Cost Reality Check

Monthly AWS Reality Check:

3x t3.large nodes (24/7): $220/month
EBS gp3 storage (500GB total): $40/month
Application Load Balancer: $16/month
Prometheus storage (growing daily): $50-120/month
Data transfer out: $20-60/month
NAT Gateway (2 AZs): $64/month
Total: $410-520/month before your first fucking application

Add development clusters that nobody remembers to shut off: +$180/month. Add monitoring for 5+ services: +$80/month storage. Add backup EBS snapshots: +$30/month. Suddenly you're at $700/month wondering why your "simple" GitOps setup costs more than your old monolith on a single t2.large.

Autoscaling sounds amazing but works terribly in practice. Cluster autoscaler takes 5-10 minutes to spin up nodes, so your users get timeout errors while it's thinking. Over-provisioning is the only real solution.

My AWS bill went from $800/month to $400/month, not because of magical GitOps savings, but because I finally learned to shut off the fucking development clusters at night. That $400 savings? It was entirely from automation that turns off non-production shit when humans go home.

Alright, so you're still with me after hearing about the costs and complexity. You want to know if this is actually better than your current setup. Let's talk about how these tools stack up against each other and the alternatives.

The Questions You'll Actually Ask (Usually at 3am)

Why does ArgoCD randomly stop syncing when I didn't change anything?

Because ArgoCD is fucking moody. It'll claim your app is "OutOfSync" even when Git and the cluster match perfectly. Usually it's webhook timeouts, network hiccups, or ArgoCD just being dramatic. Solution: Click "Refresh" 3 times, sacrifice a goat, then click "Sync" with "Force" checked.

How much will this cost me on AWS?

More than you budgeted. Plan for $500-800/month minimum for a basic production setup. That's 3 t3.large nodes, EBS storage, load balancers, and Prometheus eating your disk space. Want HA? Double it. Forgot to turn off dev clusters over the weekend? Add another $200.

Why does Prometheus keep running out of disk space?

Because nobody reads the retention documentation and Prometheus defaults to keeping everything forever. Add --storage.tsdb.retention.time=30d or watch your AWS bill explode. I learned this paying $500 for a month of useless metrics.

How do I actually deploy secrets without committing them to Git?

Sealed Secrets or External Secrets Operator.

Both are painful to set up because of the chicken-and-egg problem: you need secrets to deploy the secret manager. Start with sealed secrets

less moving parts, more predictable failure modes.

Why did my deployment work in staging but fail in production?

Because production always has that one special snowflake configuration that staging doesn't. Or you're hitting resource limits. Or there's a network policy blocking traffic. Or the moon is in the wrong phase. Use kubectl describe pod and pray the error message makes sense.

Can I just use Docker Swarm instead of Kubernetes?

Sure, if you want your career to die with Docker Swarm. It's simpler, easier, and actually works, but nobody's hiring Docker Swarm engineers anymore. Kubernetes won

deal with it.

Why does Docker keep saying "COPY failed: no source files"?

Check your .dockerignore file for trailing spaces or weird characters. Docker's context handling is garbage and the error messages are useless. Or your build context is wrong. Or there's a symlink somewhere. Docker build failures are 50% configuration and 50% dark magic.

How do I debug networking in Kubernetes?

You don't. You cry, then deploy netshoot and run nslookup from inside the cluster until something works. K8s networking is designed by people who hate happiness.

Should I use Helm or Kustomize?

Both suck in different ways. Helm has templates that make YAML more complex. Kustomize has patching that makes YAML more complex. Pick your poison. I use Kustomize because at least the complexity is predictable.

How do I make ArgoCD sync faster than every 3 minutes?

Set --app-resync to something shorter, but now you're polling Git constantly and ArgoCD might OOM under load. Or set up webhooks, which work great until they don't and you're back to polling anyway.

Quick Navigation

What Each Tool Actually Does

How It's Supposed to Work

What The Tutorials Don't Tell You

Multi-Environment Madness

Security: It's Complicated

Real Resource Requirements

Security: The Pain Points Nobody Mentions

What Actually Works in Production

Cost Reality Check

Why does ArgoCD randomly stop syncing when I didn't change anything?

How much will this cost me on AWS?

Why does Prometheus keep running out of disk space?

How do I actually deploy secrets without committing them to Git?

Why did my deployment work in staging but fail in production?

Can I just use Docker Swarm instead of Kubernetes?

Why does Docker keep saying "COPY failed: no source files"?

How do I debug networking in Kubernetes?

Should I use Helm or Kustomize?

How do I make ArgoCD sync faster than every 3 minutes?

Related Tools & Recommendations

GitOps Integration: Docker, Kubernetes, Argo CD, Prometheus Setup

ArgoCD - GitOps for Kubernetes That Actually Works

Pulumi Kubernetes Helm GitOps Workflow: Production Integration Guide

Istio Service Mesh: Real-World Complexity, Benefits & Deployment

Debug Kubernetes Issues: The 3AM Production Survival Guide

Linkerd Overview: The Lightweight Kubernetes Service Mesh

Container Runtime Security: Prevent Escapes with Falco

Master Microservices Setup: Docker & Kubernetes Guide 2025

Debugging Istio Production Issues: The 3AM Survival Guide

Deploy Kubernetes in Production: A Complete Step-by-Step Guide

Jsonnet Overview: Stop Copy-Pasting YAML Like an Animal

containerd - The Container Runtime That Actually Just Works

Kubernetes Cluster Autoscaler: Automatic Node Scaling Guide

Enterprise Kubernetes Platform Pricing: Red Hat, VMware, SUSE Costs

Helm: Simplify Kubernetes Deployments & Avoid YAML Chaos

KEDA - Kubernetes Event-driven Autoscaling: Overview & Deployment Guide

Change Data Capture (CDC) Integration Patterns for Production

Kubernetes CrashLoopBackOff: Debug & Fix Pod Restart Issues

TypeScript Compiler Performance: Fix Slow Builds & Optimize Speed

Escape Kubernetes Complexity: Simpler Container Orchestration