GitOps Stack That Actually Works (Docker + K8s + ArgoCD + Monitoring)

Q: How do I handle secrets without putting them in Git?

Don't be an idiot and put secrets in Git. Use [External Secrets Operator](https://external-secrets.io/) for Vault/AWS/Azure integration, or [Sealed Secrets](https://sealed-secrets.netlify.app/) if you're lazy.Both work until your secret provider is down and nothing can start. Always fun at 3am.

Q: How much resources does this monitoring nightmare actually need?

More than you think:- **ArgoCD**: 2-4 cores, 4-8GB RAM (more with lots of apps)- **Prometheus**: 4-8 cores, 8-16GB RAM (scales with metric cardinality)- **Grafana**: 1-2 cores, 2-4GB RAM- **Everything else**: 2-4 cores, 4-8GB RAMExpect 15+ cores and 30+ GB RAM just for monitoring on a production cluster with 50+ services and 30-day metric retention. High-cardinality metrics will double this.

Q: ArgoCD is stuck syncing forever. What now?

Usual suspects:1. **Competing operators** fighting over resources2. **Admission webhooks timing out** (looking at you, OPA)3. **RBAC problems** - service account can't do shit4. **Jobs stuck in Running** - delete them manuallyTry `argocd app sync --force` but figure out why it happened or it'll repeat.

Q: Helm or raw YAML manifests?

**Helm** for standard stuff like kube-prometheus-stack. ArgoCD's [Helm support](https://argo-cd.readthedocs.io/en/stable/user-guide/helm/) works fine.**Raw YAML** when you need complete control or Helm charts are broken (which happens).**Reality**: Mix of Helm for common components, raw YAML for custom shit, [Kustomize](https://kustomize.io/) for environment differences.

Q: How do I backup this clusterfuck?

Your disaster recovery plan better be solid:1. **Git repos**: Multiple remotes, mirror everything2. **ArgoCD config**: Backup the namespace, CRDs, secrets3. **etcd**: Automated backups of cluster state4. **Prometheus data**: Remote write to external storageTest your recovery procedures. The outage is not the time to learn they don't work.

Q: What's the difference between push-based and pull-based GitOps?

**Pull-based (ArgoCD)**: Agents in clusters pull changes from Git repositories. More secure as no external access to clusters required, but requires agents in each cluster.**Push-based (Traditional CI/CD)**: External systems push changes to clusters. Simpler for single clusters but requires secure access to production environments and doesn't provide drift detection.GitOps traditionally refers to pull-based approaches, offering better security posture and drift detection capabilities.

GitOps: Because Manual Deployments Are for Masochists

GitOps means Git controls your deployments - no more logging into servers to run random kubectl commands at 2am when shit breaks. The core stack is Docker + Kubernetes + ArgoCD + Prometheus. When it works, it's actually pretty sweet. When it doesn't, you'll burn 6 hours debugging why ArgoCD is stuck syncing.

The Stack That'll Make You Question Your Life Choices

Look, here's the deal with GitOps: Git is your source of truth, which sounds great until ArgoCD decides it wants to take a coffee break and stops syncing for no fucking reason. I've spent more 3am nights debugging "why isn't this deploying" than I care to admit.

Docker: Containers are supposed to solve "works on my machine" but they just move the problem to "works in my container but not in prod." You'll spend hours debugging why your Alpine Linux container breaks when you need glibc libraries, or why your multi-stage builds work fine locally but fail in CI/CD pipelines.

Kubernetes Logo

Kubernetes: K8s is like that friend who's really smart but explains things in the most complicated way possible. Sure, it orchestrates everything beautifully, but try debugging why your pods are stuck in Pending state at 2am. The official troubleshooting guide won't help when you're dealing with resource quotas that somebody forgot to configure properly.

ArgoCD: The GitOps controller that's supposed to watch your Git repos and deploy changes automatically. Works great until it doesn't sync, shows "OutOfSync" for no reason, or gets stuck on that one namespace deletion that's been running for 3 hours. The ArgoCD troubleshooting docs are helpful until you hit edge cases that require diving into application resource management or understanding sync phases.

Prometheus: The monitoring stack that'll consume more RAM than your actual applications. Great for metrics until you realize you're storing high-cardinality data and your storage costs just doubled.

When It Actually Works (Sometimes)

GitOps automates deployments so you don't have to SSH into production servers and manually run kubectl commands like some kind of caveman. Until ArgoCD breaks, then you're back to manual debugging anyway.

Drift Detection: ArgoCD is supposed to keep your cluster in sync with Git. In theory, this prevents the clusterfuck of "who changed what in production." In practice, ArgoCD sometimes thinks your ServiceMonitor is out of sync even when it's not. Understanding drift detection mechanisms and sync policies becomes essential when dealing with server-side apply conflicts.

Monitoring Integration: Prometheus scrapes metrics from everything, including ArgoCD itself. Cool until you realize your monitoring stack is using more resources than the apps you're monitoring.

Multi-Cluster Pain: Sure, you can manage multiple clusters with one ArgoCD instance. Just be prepared for network timeouts, authentication issues, and that one cluster that randomly loses connection during your demo. The cluster management docs won't prepare you for debugging RBAC permissions across environments.

Real-World Implementation (AKA Where Dreams Die)

Most teams start with the app-of-apps pattern because it looks clean in diagrams. Then you realize managing 50+ applications through a single ArgoCD UI is like trying to herd cats through molasses.

Secret Management: Never put secrets in Git. Use External Secrets Operator to pull from Vault or AWS Secrets Manager. This works great until your vault is down and nothing can start. Pro tip: your monitoring won't help because the monitoring needs secrets too. Learn about Kubernetes secrets and secret management best practices before you fuck up production.

Repository Structure: Separate your app configs from ArgoCD configs. Sounds obvious until you're 6 months in and your monorepo has become an unmaintainable mess of YAML files that nobody wants to touch.

Advanced Deployments: Argo Rollouts gives you canary deployments and blue-green releases. It's actually pretty sweet when it works. Just don't expect the rollback to work perfectly when your canary deployment takes down production.

GitOps Workflow

Bottom line: GitOps is better than manually deploying shit, but it's not magic. You'll still spend weekends debugging why your app won't start, except now you get to debug Kubernetes, ArgoCD, AND your application.

The real question isn't whether you should adopt GitOps - it's how to implement it without losing your sanity. Before you jump in, you need to understand the different approaches and what actually works in production environments where uptime matters and stakeholders are watching.

GitOps Stack Implementation Approaches

Implementation Approach	GitOps Playground	Helm-Based Setup	Custom Manifests	Enterprise Platform
Setup Complexity	Single command deployment	Moderate Helm chart management	High manual YAML creation	Low managed service
Initial Setup Time	15-30 minutes	2-4 hours	8-16 hours	1-2 hours configuration
Production Readiness	Development/learning focused	Production-ready with customization	Fully customizable for production	Enterprise-grade out of box
Customization Level	Limited to provided options	High via Helm values	Complete control	Platform-specific options
Multi-Cluster Support	Single cluster focus	Manual multi-cluster setup	Custom multi-cluster implementation	Built-in multi-cluster
Component Versions	Pre-selected stable versions	Latest stable versions	Any version you choose	Vendor-managed versions
Repository Structure	Predefined GitOps layout	Flexible Helm structure	Completely custom	Platform conventions
Secret Management	Basic External Secrets	External Secrets Operator integration	Custom secret solutions	Enterprise secret management
Monitoring Stack	kube-prometheus-stack included	kube-prometheus-stack v77.5.0	Custom Prometheus setup	Vendor monitoring integration
ArgoCD Configuration	Basic ArgoCD setup	ArgoCD v3.1.4 with custom config	Fully customized ArgoCD	Managed ArgoCD service
Learning Curve	Beginner-friendly	Intermediate Kubernetes knowledge	Advanced Kubernetes expertise	Platform-specific training
Operational Overhead	Minimal automated setup	Moderate Helm maintenance	High manual maintenance	Low vendor managed
Update Management	Playground script updates	Helm chart version management	Manual component updates	Automated vendor updates
Cost Structure	Free (infrastructure only)	Free tools + infrastructure	Free tools + infrastructure	Enterprise licensing + infrastructure
Support Options	Community documentation	Community + vendor docs	Community support only	Enterprise support included
Best For	Learning and prototyping	Small to medium production	Large enterprise with specific needs	Enterprise with budget

Production Reality: Where Tutorials Go to Die

Every GitOps tutorial makes this shit look easy. "Just deploy kube-prometheus-stack and you're done!" Sure. Here's what actually happens when you try to run this in production.

The Shit That Actually Breaks

ArgoCD CRD Error

The "Too Long" Annotation Error: kube-prometheus-stack will fail with a "metadata too long" error that tells you absolutely nothing useful. Took me 4 hours to figure out it was the CRD size limit. ArgoCD stores the entire manifest in annotations, and Prometheus CRDs are fucking huge.

Fix: Deploy CRDs separately with Replace=true, then use skipCrds: true for the main chart. This is not documented anywhere obvious.

Dependency Hell: ArgoCD doesn't care about deployment order by default. Your app will try to start before its ConfigMap exists, then crash in a loop while you wonder what's wrong.

Use sync waves: infrastructure gets -1, core services get 0, apps get 1+. Obvious in hindsight, not so much when you're debugging at midnight.

Secrets Are Still A Pain: Never put secrets in Git. Use External Secrets Operator to pull from Vault. This works great until your vault is unreachable and nothing can start because everything needs a secret to initialize.

External Secrets Operator Architecture

Scale Problems You Didn't Expect

Single ArgoCD Gets Slow As Hell: One ArgoCD works fine until you hit 50+ apps, then the UI becomes painfully slow and sync operations start timing out. You'll need to shard or deploy separate ArgoCD instances per environment.

ApplicationSets help template apps across clusters, but good luck debugging when one of your 20 templated applications is broken.

Prometheus Resource Usage

Prometheus Will Eat All Your RAM: The monitoring stack uses more resources than the actual apps you're monitoring. Prometheus memory usage scales with cardinality, so avoid labels like user_id or request_id unless you want your monitoring to OOM.

I've seen Prometheus consume 16GB of RAM just to monitor a cluster with 10 applications running with default scraping intervals and 30-day retention. Set retention policies and be ruthless about what metrics you actually need.

Repository Structure Hell: Start with separate repos per environment or you'll hate life. Monorepos become unmaintainable messes of YAML that nobody wants to touch. Use Kustomize for environment-specific configs, Helm for templates.

The Production Pain Points

Meta-Monitoring Problems: You need to monitor your monitoring, but what monitors the thing monitoring your monitoring? I've seen entire teams spend a day debugging why alerts weren't firing, only to discover Prometheus was down and Alertmanager couldn't reach anything.

Run separate monitoring for your GitOps infrastructure. Use external services for critical "is my cluster dead" alerts.

ArgoCD Architecture

Disaster Recovery Is An Afterthought: Your Git repos are backed up, right? What about ArgoCD's configuration? Or the etcd cluster state?

Document your recovery procedures and test them. The 3am outage is not the time to learn that your backups don't actually work.

Security Theatre vs Reality: Default ArgoCD runs with cluster-admin privileges. Cool. Implement RBAC, use OPA for policies, enable audit logging. Your security team will thank you.

The dirty secret: most "production-ready" GitOps setups are held together with duct tape and prayers. Plan for failure, because it's not if, it's when.

Speaking of failure - here are the questions you'll be frantically Googling at 3am when everything's broken, your pager is going off, and you need answers that actually work instead of another "have you tried turning it off and on again" response.

FAQ: The Shit Nobody Tells You About GitOps

Why does my kube-prometheus-stack keep failing with some cryptic "too long" error?

Because ArgoCD stores your entire manifest in annotations and Prometheus CRDs are massive. Kubernetes has a 262KB limit on annotations. You'll get this exact useless error: metadata.annotations: Too long: must have at most 262144 bytes and waste hours of your time figuring out what the fuck that means.Fix: Split CRD deployment from the main chart. Deploy CRDs with Replace=true, then deploy the main chart with skipCrds: true. This should be the default but isn't.

Why does my app keep crashing with "ConfigMap not found" even though I deployed it?

ArgoCD deploys things in random order by default. Your app starts before its ConfigMap exists.Use sync waves: argocd.argoproj.io/sync-wave: "-1" for infrastructure, "0" for services, "1" for apps. Should be obvious but apparently isn't.

How do I handle secrets without putting them in Git?

Don't be an idiot and put secrets in Git. Use External Secrets Operator for Vault/AWS/Azure integration, or Sealed Secrets if you're lazy.Both work until your secret provider is down and nothing can start. Always fun at 3am.

Why does ArgoCD think my perfectly fine deployment is "OutOfSync"?

ArgoCD gets confused by status fields that controllers add after deployment. It's especially bad with ServiceMonitors and CRDs.Enable Server-Side Apply with ServerSideApply=true. Should fix most false positives.

ArgoCD is slow as shit with lots of apps. How do I fix it?

Single ArgoCD instances choke around 50+ applications. UI becomes unusable, sync operations timeout.Shard ArgoCD with multiple replicas or deploy separate instances per environment. ApplicationSets help template across clusters.

How much resources does this monitoring nightmare actually need?

More than you think:

ArgoCD: 2-4 cores, 4-8GB RAM (more with lots of apps)
Prometheus: 4-8 cores, 8-16GB RAM (scales with metric cardinality)
Grafana: 1-2 cores, 2-4GB RAM
Everything else: 2-4 cores, 4-8GB RAMExpect 15+ cores and 30+ GB RAM just for monitoring on a production cluster with 50+ services and 30-day metric retention. High-cardinality metrics will double this.

ArgoCD is stuck syncing forever. What now?

Usual suspects:

Competing operators fighting over resources
Admission webhooks timing out (looking at you, OPA)3. RBAC problems
- service account can't do shit
Jobs stuck in Running
- delete them manuallyTry argocd app sync --force but figure out why it happened or it'll repeat.

Helm or raw YAML manifests?

Helm for standard stuff like kube-prometheus-stack. ArgoCD's Helm support works fine.Raw YAML when you need complete control or Helm charts are broken (which happens).Reality: Mix of Helm for common components, raw YAML for custom shit, Kustomize for environment differences.

How do I backup this clusterfuck?

Your disaster recovery plan better be solid:

Git repos:

Multiple remotes, mirror everything 2. ArgoCD config: Backup the namespace, CRDs, secrets 3. etcd:

Automated backups of cluster state 4. Prometheus data: Remote write to external storageTest your recovery procedures. The outage is not the time to learn they don't work.

What's the difference between push-based and pull-based GitOps?

Pull-based (ArgoCD): Agents in clusters pull changes from Git repositories. More secure as no external access to clusters required, but requires agents in each cluster.Push-based (Traditional CI/CD): External systems push changes to clusters. Simpler for single clusters but requires secure access to production environments and doesn't provide drift detection.GitOps traditionally refers to pull-based approaches, offering better security posture and drift detection capabilities.

How do I handle GitOps with multiple environments and promotion workflows?

Implement environment progression through:

Branch-based:

Separate branches per environment with promotion PRs 2. Repository-based: Separate repos per environment with automated promotion 3. Overlay-based: Kustomize overlays with shared base configurationsEach approach has trade-offs. Most organizations start with branch-based and migrate to repository-based as complexity increases.

Why is Prometheus eating all my RAM?

Cardinality is a bitch. Every unique label combination = more memory.Avoid labels like user_id, request_id, session_id. Set retention policies, reduce scrape intervals, use recording rules.Or just throw more RAM at it like everyone else.

How do I monitor my monitoring?

Meta-monitoring is required but painful:

Expose ArgoCD metrics via ServiceMonitor
Run separate monitoring for GitOps health
Define SLIs/SLOs for sync success rates
External alerting for "is my cluster dead" scenariosBecause nothing's worse than discovering your monitoring was down during an outage.

Quick Navigation

The Stack That'll Make You Question Your Life Choices

When It Actually Works (Sometimes)

Real-World Implementation (AKA Where Dreams Die)

The Shit That Actually Breaks

Scale Problems You Didn't Expect

The Production Pain Points

Why does my kube-prometheus-stack keep failing with some cryptic "too long" error?

Why does my app keep crashing with "ConfigMap not found" even though I deployed it?

How do I handle secrets without putting them in Git?

Why does ArgoCD think my perfectly fine deployment is "OutOfSync"?

ArgoCD is slow as shit with lots of apps. How do I fix it?

How much resources does this monitoring nightmare actually need?

ArgoCD is stuck syncing forever. What now?

Helm or raw YAML manifests?

How do I backup this clusterfuck?

What's the difference between push-based and pull-based GitOps?

How do I handle GitOps with multiple environments and promotion workflows?

Why is Prometheus eating all my RAM?

How do I monitor my monitoring?

Related Tools & Recommendations

GitOps Integration: Docker, Kubernetes, Argo CD, Prometheus Setup

GitOps Overview: Principles, Benefits & Implementation Guide

Pulumi Kubernetes Helm GitOps Workflow: Production Integration Guide

ArgoCD Production Troubleshooting: Debugging & Fixing Deployments

ArgoCD - GitOps for Kubernetes That Actually Works

Flux GitOps: Secure Kubernetes Deployments with CI/CD

Development Containers - Production Deployment Guide

kubectl: Kubernetes CLI - Overview, Usage & Extensibility

containerd - The Container Runtime That Actually Just Works

Flux Performance Troubleshooting - When GitOps Goes Wrong

Debugging Istio Production Issues: The 3AM Survival Guide

KEDA - Kubernetes Event-driven Autoscaling: Overview & Deployment Guide

kubeadm - The Official Way to Bootstrap Kubernetes Clusters

LangChain Production Deployment Guide: What Actually Breaks

Rancher Desktop: The Free Docker Desktop Alternative That Works

Fix Kubernetes Pod CrashLoopBackOff - Complete Troubleshooting Guide

Rancher - Manage Multiple Kubernetes Clusters Without Losing Your Sanity

Fix Docker Won't Start on Windows 11: Daemon Startup Issues

gRPC Service Mesh Integration: Solve Load Balancing & Production Issues

Azure Container Instances (ACI): Run Containers Without Kubernetes