I've been running ArgoCD in production for over 3 years across 50+ clusters. Here's the shit that actually goes wrong when you scale beyond the demo.
The Memory Leak That Cost Us a Weekend
Our ArgoCD repo server started consuming 12GB of memory during peak deployment hours. The Kubernetes cluster kept killing it, causing sync failures across all applications. The root cause? We had a monorepo with 300+ microservices, each with complex Helm charts that included 50+ templates.
The repo server was caching rendered manifests for every application, every environment, every branch we'd ever deployed. That's thousands of YAML files being held in memory simultaneously. When the cache hit 12GB, the kernel OOMKilled the pod.
The fix that actually worked:
apiVersion: v1
kind: ConfigMap
metadata:
name: argocd-cmd-params-cm
namespace: argocd
data:
reposerver.max.combined.directory.manifests.size: \"200Mi\"
repository.cache.expiration.duration: \"5m\"
repo.server.parallelism.limit: \"2\" # Don't render everything at once
We also split the monorepo into smaller repositories organized by team ownership. Instead of one repo with 300 services, we have 15 repos with ~20 services each. ArgoCD performance improved dramatically.
The Great Prune Disaster of 2024
Someone enabled auto-sync with prune on our production ArgoCD applications without understanding what prune does. ArgoCD deleted every Kubernetes resource that wasn't explicitly defined in our Git repositories. This included:
- External secrets created by the External Secrets Operator
- Service mesh sidecars injected by Istio
- Monitoring agents deployed by Datadog
- Certificate resources managed by cert-manager
- Persistent volumes that contained customer data
The incident lasted 4 hours and required manual restoration from backups. We lost customer data that couldn't be recovered from Git.
The prevention strategy we implemented:
## Default sync policy for all apps
spec:
syncPolicy:
automated:
prune: false # NEVER enable this globally
selfHeal: true # Auto-fix drift, but don't delete
syncOptions:
- CreateNamespace=true
- PrunePropagationPolicy=foreground
- PruneLast=true # If prune is enabled, do it last
We also added this to our ArgoCD RBAC policy to prevent accidental prune enabling:
p, ops-team, applications, *, default/*, allow
p, ops-team, applications, sync, default/*, deny
p, admin-only, applications, *, default/*, allow
The GitHub API Rate Limit Hell
With 200+ applications polling GitHub every 3 minutes, we hit GitHub's API rate limits constantly. ArgoCD would stop syncing randomly, deployments would fail, and the logs were full of "API rate limit exceeded" errors.
The math is brutal: 200 apps × 20 polls/hour × 24 hours = 96,000 API requests per day. GitHub's rate limit is 60 requests/minute for basic authentication.
Our multi-pronged solution:
- Webhook-based sync for critical applications:
apiVersion: v1
kind: Service
metadata:
name: argocd-webhook
spec:
ports:
- port: 80
targetPort: 8080
selector:
app.kubernetes.io/name: argocd-server
- Increased polling intervals for non-critical apps:
spec:
source:
repoURL: https://github.com/company/configs
path: staging/
targetRevision: HEAD
syncPolicy:
syncOptions:
- Replace=true
# Reduce polling frequency to 15 minutes
argocd.argoproj.io/refresh: 15m
- Multiple GitHub tokens with load balancing:
apiVersion: v1
kind: Secret
metadata:
name: github-repo-1
data:
password: <token-1-base64>
---
apiVersion: v1
kind: Secret
metadata:
name: github-repo-2
data:
password: <token-2-base64>
We went from 400+ rate limit errors per day to zero.
The Sync Wave Nightmare
We had a complex application with database migrations that needed to run before the API deployment. Simple enough - use sync waves to ensure the migration Job runs first:
## Migration job - wave 0 (runs first)
apiVersion: batch/v1
kind: Job
metadata:
annotations:
argocd.argoproj.io/sync-wave: \"0\"
argocd.argoproj.io/hook: PreSync
---
## API deployment - wave 1 (runs after)
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
argocd.argoproj.io/sync-wave: \"1\"
This worked in testing but failed in production because:
- The migration job took 15 minutes to complete
- ArgoCD's default sync timeout is 5 minutes
- The job appeared to hang, so ArgoCD marked the sync as failed
- But the migration was still running in the background
- The next sync attempt created a second migration job
- Now we have two migrations running simultaneously against the same database
The fix required multiple changes:
## Increase sync timeout globally
apiVersion: v1
kind: ConfigMap
metadata:
name: argocd-cmd-params-cm
data:
timeout.hard.reconciliation: \"20m\" # Up from 5m default
---
## Job with proper cleanup and timeout
apiVersion: batch/v1
kind: Job
metadata:
annotations:
argocd.argoproj.io/sync-wave: \"0\"
argocd.argoproj.io/hook: PreSync
argocd.argoproj.io/hook-delete-policy: BeforeHookCreation
spec:
activeDeadlineSeconds: 1200 # 20 minutes max
backoffLimit: 0 # Don't retry failed migrations
The Multi-Cluster Authentication Nightmare
We had ArgoCD managing 15 different Kubernetes clusters across AWS, GCP, and on-premises. Authentication worked fine initially, but after 6 months, random clusters would become "Connection Refused" and stop syncing.
The problem: EKS and GKE clusters update their API endpoints periodically. The cluster certificates ArgoCD cached became invalid. Service account tokens expired. Network policies changed.
Our detection and remediation strategy:
- Health monitoring for all registered clusters:
#!/bin/bash
## Check cluster connectivity every 5 minutes
for cluster in $(argocd cluster list -o name); do
if ! argocd cluster get \"$cluster\" &>/dev/null; then
echo \"ALERT: Cluster $cluster is unreachable\"
# Auto-remediation: re-register the cluster
argocd cluster add \"$cluster\" --kubeconfig /etc/kubeconfig/\"$cluster\"
fi
done
- Automated token refresh for service accounts:
## CronJob to refresh cluster service account tokens
apiVersion: batch/v1
kind: CronJob
metadata:
name: cluster-token-refresh
spec:
schedule: \"0 2 * * 0\" # Weekly at 2 AM
jobTemplate:
spec:
template:
spec:
serviceAccountName: cluster-admin
containers:
- name: token-refresh
image: argoproj/argocd:v2.8.4
command: [\"/usr/local/bin/refresh-cluster-tokens.sh\"]
- Circuit breaker pattern to isolate failing clusters:
## Separate ArgoCD instance for critical production clusters
apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
name: production-only
spec:
destinations:
- namespace: '*'
server: https://prod-cluster-1.example.com
- namespace: '*'
server: https://prod-cluster-2.example.com
# Isolate from dev/staging cluster failures
The Resource Hooks That Never Finished
We used resource hooks for database schema migrations and cache warming. The hooks would start successfully but never complete, leaving applications stuck in "Progressing" status forever.
The issue: failed hooks don't fail loudly. They just sit there consuming resources while ArgoCD waits indefinitely. After investigating dozens of stuck deployments, we found:
- Database migration scripts that hung on locks
- Network timeouts that weren't handled
- Resource quotas preventing Job pod creation
- RBAC issues preventing hook execution
- Insufficient memory limits causing OOMKilled pods
Our hook reliability improvements:
apiVersion: batch/v1
kind: Job
metadata:
annotations:
argocd.argoproj.io/hook: PreSync
argocd.argoproj.io/hook-delete-policy: HookSucceeded,BeforeHookCreation
argocd.argoproj.io/sync-wave: \"-1\"
spec:
activeDeadlineSeconds: 300 # 5 minute timeout
backoffLimit: 2 # Retry twice max
completions: 1 # Single run
parallelism: 1 # No concurrent execution
template:
metadata:
annotations:
# Prevent Istio sidecar injection that can cause hooks to hang
sidecar.istio.io/inject: \"false\"
spec:
restartPolicy: Never
containers:
- name: migration
image: migrate/migrate:4
resources:
limits:
memory: 512Mi # Prevent OOMKill
cpu: 500m
requests:
memory: 256Mi
cpu: 100m
# Always include health checks
livenessProbe:
exec:
command: [\"/bin/sh\", \"-c\", \"ps aux | grep migrate\"]
initialDelaySeconds: 30
periodSeconds: 60
We also added comprehensive logging to track hook execution:
## Monitor hook status across all applications
kubectl get jobs -A -l app.kubernetes.io/managed-by=argocd --watch
## Debug specific hook failures
kubectl logs job/pre-sync-migration-123 -n production --previous
The key insight: hooks need the same reliability patterns as regular applications - timeouts, resource limits, health checks, and proper error handling.
Performance Tuning That Actually Matters
After scaling to 500+ applications across multiple clusters, ArgoCD's default settings become a bottleneck. Here's what we tuned and why:
Repository Server Scaling:
## More repo server replicas for parallel Git operations
apiVersion: apps/v1
kind: Deployment
metadata:
name: argocd-repo-server
spec:
replicas: 5 # Up from default 1
template:
spec:
containers:
- name: repo-server
resources:
limits:
memory: 2Gi # Up from 256Mi default
cpu: 1000m # Up from 250m default
requests:
memory: 1Gi
cpu: 500m
env:
# Increase Git operation timeout
- name: ARGOCD_EXEC_TIMEOUT
value: \"300\" # 5 minutes for large repos
Application Controller Tuning:
apiVersion: v1
kind: ConfigMap
metadata:
name: argocd-cmd-params-cm
data:
# Increase application processing parallelism
application.controller.parallelism.limit: \"50\" # Up from 10
# Reduce sync frequency for non-critical apps
timeout.reconciliation: \"300s\" # 5 minutes
# Enable gzip compression for large manifests
reposerver.parallelism.limit: \"4\"
Redis Configuration for Scale:
## Redis with persistence and memory optimization
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: argocd-redis
spec:
template:
spec:
containers:
- name: redis
image: redis:7-alpine
resources:
limits:
memory: 4Gi # Large cache for 500+ apps
requests:
memory: 2Gi
# Redis config optimized for ArgoCD workload
args:
- redis-server
- --maxmemory
- 3gb
- --maxmemory-policy
- allkeys-lru # Evict least recently used keys
- --save
- 900 1 # Persist to disk every 15 minutes
These changes reduced average sync times from 45 seconds to 12 seconds across our application portfolio.
The Lessons From Production
After years of ArgoCD incidents, here's what actually prevents problems:
- Start with conservative settings - Auto-sync disabled, prune disabled, manual approval required
- Monitor resource consumption - Memory, CPU, and API rate limits are the first things to break
- Test everything in staging first - But assume production will break differently anyway
- Have rollback plans - Git reverts are your friend, but practice them before you need them
- Don't trust health checks - ArgoCD "Healthy" doesn't mean your app works
- Separate critical from non-critical apps - Use different ArgoCD instances or strict RBAC
- Automate common fixes - Script the solutions to problems you see repeatedly
The reality: ArgoCD is incredibly powerful when it works, but debugging it at scale requires deep Kubernetes knowledge and patience. Plan for things to break in ways you didn't expect.