The GitOps Setup That Finally Stopped Destroying Our Weekends

GitOps CI/CD Architecture

After Jenkins pipelines died randomly and GitLab CI murdered our AWS bill, we finally got a GitOps setup that works. Took six months, three postmortems, and one very awkward all-hands about why checkout was fucked for two hours on Black Friday.

GitOps Finally Solved Our "Who the Hell Deployed This?" Problem

Here's the thing about traditional push-based CI/CD: when shit hits the fan at 3 AM, you're frantically digging through Jenkins logs trying to figure out what got deployed when. With GitOps, every deployment is a Git commit. Need to rollback? git revert. Need to audit who deployed what? git log. When ArgoCD pulls from Git instead of Jenkins pushing random builds, you actually know what's running in production.

The first time our on-call rotation went from weekly pages to maybe one alert in a month, we knew we were onto something. Turns out most outages happen because someone deployed something and forgot to tell anyone.

The Two-Part System That Stopped Our Weekend Debugging Sessions

ArgoCD Architecture

Here's how it works when it's not breaking:

GitHub Actions handles the build stuff (this part usually works):

  • Developer pushes code, Actions runs tests and security scans
  • If tests pass, builds Docker image and shoves it into the registry
  • Updates the deployment manifest with the new image SHA (not tag - learned this the hard way)

ArgoCD handles the deployment nightmare (this is where things get interesting):

  • ArgoCD obsessively watches your Git repo for changes every 3 minutes
  • Spots the new manifest, syncs it to your cluster
  • Rolling deployment happens with health checks (when they're configured right)
  • If your app dies during deploy, ArgoCD automatically rolls back

I learned about the separation after our monolithic Jenkins setup ate itself during a deploy and took both CI and CD down. Two systems means when one shits the bed, the other keeps working. Google's SRE book calls this "blast radius reduction" - we called it "thank god we can still deploy hotfixes."

What You Need Before Starting (Don't Skip This Shit)

Seriously, don't start unless you have:

  • Kubernetes cluster (k3d works for learning, but you'll need EKS/GKE for anything real)
  • GitHub repo with admin rights (you'll need to add secrets and webhooks)
  • Container registry (Docker Hub rate limits will bite you, use ECR/GCR)
  • Domain + SSL cert (Let's Encrypt is fine, just don't use self-signed certs)
  • You've used kubectl before and know pods from deployments

This took me an entire weekend, and I thought I knew what I was doing. If you're new to this shit, clear your schedule for the next month. Those "5 minute setup" guides are fucking lies written by people who've never seen production.

The Modern Stack We're Using

Modern CI/CD Stack

GitHub Actions handles CI because:

  • Native GitHub integration with zero configuration
  • Powerful workflow syntax that scales from simple to complex
  • Massive ecosystem of pre-built actions
  • Built-in secret management and OIDC authentication
  • Matrix builds for testing across multiple environments
  • Artifact storage and caching built-in

ArgoCD handles CD because:

Helm manages Kubernetes manifests because:

Real-World Production Considerations

Production Pipeline Flow

Here's the shit nobody tells you about production:

Security: Every component needs proper authentication. GitHub Actions uses OIDC to connect to cloud providers without storing long-lived credentials. ArgoCD uses Kubernetes RBAC to limit what it can deploy. Implement network policies, pod security standards, and image scanning.

Monitoring: You need observability into every step. GitHub Actions provides build metrics, ArgoCD shows deployment status, Kubernetes gives runtime metrics. Set up AlertManager for failed deployments, not just successful ones.

Compliance: Git history becomes your audit trail. Every production change must go through this pipeline - no manual kubectl commands. GitOps provides complete change traceability which SOC2 and ISO27001 auditors require.

Disaster Recovery: Your entire infrastructure is in Git. If your cluster dies, you can recreate it exactly by applying your Git repository. This is infrastructure as code taken to its logical conclusion.

What Success Looks Like

When properly implemented, this pipeline gives you:

  • Sub-10-minute deployments from commit to production
  • Automatic rollback on deployment failure
  • Complete deployment history in Git
  • Visual deployment tracking in ArgoCD UI
  • Zero-downtime deployments with proper health checks
  • Multi-environment promotion from dev to staging to production

Management loves metrics, so here's what we saw after switching (your mileage will definitely vary):

  • Deployment frequency: Multiple times per day
  • Lead time: Less than 1 hour from commit to production
  • Change failure rate: Less than 5%
  • Recovery time: Less than 1 hour

Common Anti-Patterns to Avoid

GitOps Anti-Patterns

Don't put application code and deployment manifests in the same repository. Separate repositories prevent deployment changes from triggering unnecessary application builds.

Don't use ArgoCD for CI tasks like building images or running tests. ArgoCD is for deployment only. Separation of concerns applies to CI/CD tooling too.

Don't manually edit cluster resources after deployment. Everything must go through Git. Manual changes create configuration drift that breaks future deployments.

Don't store secrets in Git repositories, even if they're encrypted. Use Kubernetes secrets managed by external secret operators like External Secrets Operator or Sealed Secrets.

Alright, enough theory. Time to build this thing. Next up: GitHub Actions that won't randomly die when you need them most.

Step 1: Build the GitHub Actions CI Pipeline

GitHub Actions Workflow

Let's build a CI pipeline that actually works and doesn't randomly fail when you need it most.

This is the foundation

  • if CI is broken, everything else is fucked.

Repository Structure That Won't Break

Create this exact directory structure in your main application repository:

my-app/
├── .github/
│   └── workflows/
│       └── ci.yml
├── src/
│   └── (your application code)
├── tests/
│   └── (your test files)
├── Dockerfile
├── package.json  (or equivalent)
└── README.md

The GitHub Actions Workflow That Actually Works

Create .github/workflows/ci.yml with this configuration.

I've run this pattern in production for 2+ years:

name:

 CI That Won't Die on You
## This one actually works

on:
  push:
    branches: [ main, develop ]
  pull_request:
    branches: [ main ]

env:

  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}

jobs:
  test:
    runs-on: ubuntu-latest
    steps:

- uses: actions/checkout@v4
    
    
- name:

 Setup Node.js
      uses: actions/setup-node@v4
      with:
        node-version: '20'
        cache: 'npm'
    
    
- name:

 Install dependencies
      run: npm ci
    
    
- name:

 Run tests
      run: npm run test:coverage
      
    
- name:

 Run security scan
      run: npm audit --audit-level high
      # Dies on serious vulnerabilities, which is what we want
      
    
- name:

 Upload coverage to Codecov
      uses: codecov/codecov-action@v3
      # Management loves their numbers

  build:
    needs: test
    runs-on: ubuntu-latest
    if: github.event_name == 'push'  # Only build on actual pushes, not PRs
    outputs:
      image: ${{ steps.image.outputs.image }}
      digest: ${{ steps.build.outputs.digest }}  # Use digests, not tags 
- learned this the hard way
    
    steps:

- uses: actions/checkout@v4
    
    
- name:

 Log in to Container Registry
      uses: docker/login-action@v3
      with:
        registry: ${{ env.

REGISTRY }}
        username: ${{ github.actor }}
        password: ${{ secrets.

GITHUB_TOKEN }}
    
    
- name: Extract metadata
      id: meta
      uses: docker/metadata-action@v5
      with:
        images: ${{ env.

REGISTRY }}/${{ env.IMAGE_NAME }}
        tags: |
          type=ref,event=branch
          type=ref,event=pr
          type=sha,prefix={{branch}}-
          type=raw,value=latest,enable={{is_default_branch}}
    
    
- name:

 Build and push Docker image
      id: build
      uses: docker/build-push-action@v5
      with:
        context: .
        push: true
        tags: ${{ steps.meta.outputs.tags }}
        labels: ${{ steps.meta.outputs.labels }}
        cache-from: type=gha  # Saves your sanity on rebuilds
        cache-to: type=gha,mode=max

  security:
    needs: build
    runs-on: ubuntu-latest
    if: github.event_name == 'push'  # Skip security scans on PRs to save time
    
    steps:

- name:

 Run Trivy vulnerability scanner
      uses: aquasecurity/trivy-action@master
      with:
        image-ref: ${{ needs.build.outputs.image }}
        format: 'sarif'
        output: 'trivy-results.sarif'
        severity: 'CRITICAL,HIGH'  # Will fail the build on serious vulnerabilities
        
    
- name:

 Upload Trivy scan results
      uses: github/codeql-action/upload-sarif@v3
      if: always()
      with:
        sarif_file: 'trivy-results.sarif'

Critical Configuration Details

Caching Strategy:

The cache: 'npm' and Docker cache-from/cache-to lines cut build times from 8 minutes to 2 minutes in my experience. GitHub's cache documentation explains the mechanics.

Use actions/cache for node_modules caching and Docker layer caching with BuildKit.

Image Tagging:

We create multiple tags

  • branch name, SHA, and latest. This gives you deployment flexibility. The SHA tag is crucial for GitOps
  • it provides immutable image references.

Use docker/metadata-action for consistent tagging strategies.

Security Integration: Trivy scanning catches vulnerabilities before they reach production.

Results upload to GitHub Security tab for tracking. SARIF format enables native GitHub integration.

Also consider Snyk and GitHub Dependabot.

OIDC Authentication Setup (Critical for Production)

Never store cloud credentials in GitHub Secrets.

Use OIDC authentication instead:

- name:

 Configure AWS credentials
  uses: aws-actions/configure-aws-credentials@v4
  with:
    role-to-assume: arn:aws:iam::123456789012:role/GitHubAction-AssumeRoleWithAction
    aws-region: us-east-1
    
- name:

 Login to Amazon ECR
  uses: aws-actions/amazon-ecr-login@v2

AWS trust policy setup is required.

AWS trust policy setup is a pain in the ass. Took me forever to get the JSON right.

Step 2: Setup ArgoCD for GitOps Deployment

ArgoCD Dashboard

ArgoCD is your deployment controller.

It watches Git repositories and syncs changes to Kubernetes clusters automatically.

Install ArgoCD in Your Cluster

## Create dedicated namespace
kubectl create namespace argocd

## Install ArgoCD
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml

## Wait for deployment
kubectl wait --for=condition=available --timeout=300s deployment/argocd-server -n argocd

## Get initial admin password
kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d

Production Note:

Use Helm charts for production installations.

The raw manifests work for learning, but you need customization for real environments.

Configure ArgoCD for GitOps

Create a separate repository for your deployment manifests. This is critical

Repository structure for my-app-config:

my-app-config/
├── base/
│   ├── deployment.yaml
│   ├── service.yaml
│   └── kustomization.yaml
├── environments/
│   ├── staging/
│   │   ├── kustomization.yaml
│   │   └── patches/
│   └── production/
│       ├── kustomization.yaml
│       └── patches/
└── apps/
    ├── staging-app.yaml
    └── production-app.yaml

Base Deployment Configuration

Create base/deployment.yaml:

apiVersion: apps/v1
kind:

 Deployment
metadata:
  name: my-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:

- name: my-app
        image: ghcr.io/myorg/my-app:main-abc123
        ports:

- containerPort: 8080
        env:

- name:

 NODE_ENV
          value: "production"
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "500m"
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5

Resource Limits:

Always set resource requests and limits. Kubernetes scheduling depends on requests, and limits prevent resource starvation.

Health Checks: Liveness and readiness probes are mandatory for production.

Without them, Kubernetes can't tell if your app is actually healthy.

ArgoCD Application Configuration

Create apps/production-app.yaml:

apiVersion: argoproj.io/v1alpha1
kind:

 Application
metadata:
  name: my-app-production
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/myorg/my-app-config.git
    targetRevision: main
    path: environments/production
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:

- CreateNamespace=true
    retry:
      limit: 5
      backoff:
        duration: 5s
        maxDuration: 3m0s
        factor: 2

Automated Sync: prune: true removes resources deleted from Git. selfHeal: true reverts manual cluster changes back to Git state.

This maintains GitOps discipline.

Connect CI to CD with Image Updates

The critical piece: when CI builds a new image, it must update the deployment configuration.

Add this job to your GitHub Actions:

  update-deployment:
    needs: [build, security]
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main'
    
    steps:

- name:

 Checkout config repo
      uses: actions/checkout@v4
      with:
        repository: myorg/my-app-config
        token: ${{ secrets.

CONFIG_REPO_TOKEN }}
        
    
- name: Update image tag
      run: |
        cd environments/production
        kustomize edit set image ghcr.io/myorg/my-app:${{ needs.build.outputs.digest }}
        
    
- name:

 Commit changes
      run: |
        git config --local user.email "action@github.com"
        git config --local user.name "GitHub Action"
        git add .
        git commit -m "Update image to ${{ needs.build.outputs.digest }}"
        git push

Important:

Use image digests, not tags. Tags are mutable; digests are immutable. OCI image specification guarantees digest uniqueness.

Step 3:

Implement Production Monitoring and Rollback

ArgoCD Rollback Interface

Production deployments need monitoring and automated rollback capabilities.

Here's how to implement both.

Deployment Health Monitoring

Argo

CD monitors application health automatically, but you need application-level health checks:

## Add to your deployment
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
          failureThreshold: 3
          timeoutSeconds: 5
        readinessProbe:
          httpGet:
            path: /ready  
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5
          failureThreshold: 2
          successThreshold: 1

Your application must implement these endpoints:

  • /health
  • Returns 200 if app is alive (for liveness probe)
  • /ready
  • Returns 200 if app can serve traffic (for readiness probe)

Automatic Rollback on Health Check Failure

Configure ArgoCD sync windows and health checks:

  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    retry:
      limit: 5
      backoff:
        duration: 5s
        maxDuration: 3m0s
        factor: 2
    syncOptions:

- CreateNamespace=true
    
- RespectIgnoreDifferences=true

Progressive delivery patterns like canary deployments provide additional safety.

Argo Rollouts extends ArgoCD with advanced deployment strategies.

Manual Rollback Process

When you need to rollback manually:

## Via ArgoCD CLI
argocd app rollback my-app-production

## Via Git (preferred GitOps method)
cd my-app-config
git log --oneline  # Find previous good commit
git revert abc123  # Revert to previous version
git push           # ArgoCD syncs automatically

Git-based rollback is preferred because it maintains the GitOps principle

  • all changes flow through Git.

That's the pipeline built. But production never goes according to plan

  • next up is the real shit that breaks and how to fix it when you're getting paged at 3 AM.

My GitHub Actions workflow isn't triggering and I'm losing my mind

Q

My GitHub Actions workflow isn't triggering and I'm losing my mind

A

Yeah, this shit happens all the time. Here's what's fucked, in order of obviousness:

  1. You forgot to commit the workflow file to main (90% of the time it's this dumb shit)
  2. YAML syntax is fucked - Run yamllint .github/workflows/ci.yml and fix your indentation. Two spaces, not tabs
  3. Actions disabled - Check Settings → Actions → General, some security-paranoid admin probably disabled them
  4. Branch protection blocking you - Protected branches can block Actions from triggering
  5. Wrong directory name - Must be .github/workflows/ not .github/workflow/. That missing 's' will ruin your day

Spent 2 hours debugging once because I put the workflow in .github/workflow instead of .github/workflows. GitHub doesn't give you an error, just silently ignores it. Also, Node 18.17.0 has a known issue with npm ci - use 18.16.1 or 20.x if you hit ERESOLVE errors.

Q

How do I handle secrets in a GitOps pipeline?

A

Never put secrets in Git, even encrypted ones. Use these patterns:

  • External Secrets Operator - Syncs secrets from AWS Secrets Manager/Azure Key Vault
  • Sealed Secrets - Encrypts secrets that only the cluster can decrypt
  • ArgoCD Vault Plugin - Integrates with HashiCorp Vault

Example with External Secrets Operator:

apiVersion: external-secrets.io/v1beta1
kind: SecretStore
metadata:
  name: aws-secrets
spec:
  provider:
    aws:
      service: SecretsManager
      region: us-east-1
Q

ArgoCD shows "Unknown" health status - what's wrong?

A

ArgoCD can't determine if your application is healthy. Add proper health checks:

## In your Deployment
livenessProbe:
  httpGet:
    path: /health
    port: 8080
readinessProbe:
  httpGet:
    path: /ready
    port: 8080

If you don't have health endpoints, use this minimal check:

livenessProbe:
  tcpSocket:
    port: 8080
Q

My deployment succeeded but nothing changed?

A

Classic GitOps confusion. Check:

  1. Image tag didn't change - ArgoCD won't sync if manifests are identical
  2. Sync policy - Manual sync required if automated isn't configured
  3. Repository URL - ArgoCD watching wrong repo or branch
  4. Path configuration - ArgoCD looking at wrong directory

Run argocd app get myapp to see current sync status.

Q

How do I rollback to a previous version?

A

Git-based rollback (recommended):

cd my-app-config
git log --oneline
git revert <commit-hash>
git push  # ArgoCD syncs automatically

ArgoCD rollback:

argocd app rollback myapp
## Or via UI: click app → History → click previous version → Rollback

Git method is better because it maintains GitOps principles.

Q

GitHub Actions failing with "docker: permission denied"?

A

Docker daemon permission issue. Add this to your workflow:

- name: Setup Docker Buildx
  uses: docker/setup-buildx-action@v3

- name: Login to Container Registry
  uses: docker/login-action@v3
  with:
    registry: ghcr.io
    username: ${{ github.actor }}
    password: ${{ secrets.GITHUB_TOKEN }}

For self-hosted runners, add the runner user to docker group:

sudo usermod -aG docker $USER
Q

ArgoCD says "ComparisonError" - how do I fix this?

A

ArgoCD can't compare desired state with actual state. Common causes:

  1. Malformed YAML - Validate with kubectl apply --dry-run
  2. Missing CRDs - Install required Custom Resource Definitions first
  3. Resource name conflicts - Two resources trying to use same name
  4. Namespace issues - Resource references non-existent namespace

Check ArgoCD logs: kubectl logs -n argocd deployment/argocd-application-controller

Q

How do I deploy to multiple environments (dev/staging/prod)?

A

Use Kustomize overlays or separate ArgoCD applications:

Kustomize approach:

environments/
├── base/
│   └── deployment.yaml
├── staging/
│   ├── kustomization.yaml
│   └── replica-patch.yaml
└── production/
    ├── kustomization.yaml
    └── replica-patch.yaml

Separate Applications approach:

## staging-app.yaml
spec:
  source:
    path: environments/staging
## production-app.yaml  
spec:
  source:
    path: environments/production

Both work. Kustomize is simpler for small differences, separate apps better for major environment variations.

Q

Why are my deployments so slow?

A

Common bottlenecks:

  1. Image pull time - Use multi-stage builds and layer caching
  2. Health check delays - Reduce initialDelaySeconds if app starts quickly
  3. Resource limits - Pod can't start due to insufficient cluster resources
  4. Registry throttling - Docker Hub rate limits public image pulls

Add timing to troubleshoot:

readinessProbe:
  initialDelaySeconds: 5  # Start checking after 5s
  periodSeconds: 2        # Check every 2s instead of default 10s
Q

How do I handle database migrations in GitOps?

A

Pre-deployment hooks with ArgoCD:

apiVersion: batch/v1
kind: Job
metadata:
  name: migrate-db
  annotations:
    argocd.argoproj.io/hook: PreSync
    argocd.argoproj.io/hook-delete-policy: BeforeHookCreation
spec:
  template:
    spec:
      containers:
      - name: migrate
        image: myapp:latest
        command: ["npm", "run", "migrate"]

External migration management:

  • Use Flyway/Liquibase operators
  • Run migrations in separate CI/CD pipeline
  • Handle migrations at application startup (careful with rollbacks)
Q

ArgoCD out of sync but resources look identical?

A

Kubernetes adds default values and metadata that ArgoCD sees as differences. Fix with:

## In your ArgoCD Application
spec:
  ignoreDifferences:
  - group: apps
    kind: Deployment
    jsonPointers:
    - /spec/replicas  # Ignore if HPA manages replicas
  - group: ""
    kind: Service
    jsonPointers:
    - /spec/clusterIP  # Kubernetes assigns this automatically
Q

My tests pass locally but fail in GitHub Actions?

A

Environment differences. Common issues:

  1. Node.js version - Pin versions: node-version: '20.x'
  2. Dependencies - Use npm ci instead of npm install
  3. Timezone - Tests failing due to date/time assumptions
  4. File system permissions - Linux vs macOS/Windows differences
  5. Environment variables - Missing secrets or config

Add debugging:

- name: Debug environment
  run: |
    node --version
    npm --version
    pwd
    ls -la
    env | sort
Q

How do I secure my GitOps pipeline?

A

Security checklist:

  1. Branch protection - Require reviews for config repo changes
  2. RBAC - Limit ArgoCD permissions with Kubernetes roles
  3. Network policies - Restrict pod-to-pod communication
  4. Image scanning - Scan images for vulnerabilities before deployment
  5. Secrets management - Use external secret operators
  6. Supply chain - Pin action versions to SHAs, not tags
## Pin to specific commit
- uses: actions/checkout@8ade135a41bc03ea155e62e844d188df1ea18608  # v4.1.0
Q

OIDC authentication failing with AWS?

A

Trust policy syntax is picky. Use this exact format:

{
  "StringEquals": {
    "token.actions.githubusercontent.com:aud": "sts.amazonaws.com",
    "token.actions.githubusercontent.com:sub": "repo:myorg/myrepo:ref:refs/heads/main"
  }
}

Common mistakes:

  • Using sts.aws.com instead of sts.amazonaws.com
  • Wrong subject format - must be repo:org/repo:ref:refs/heads/branch
  • Typo in repository name (case sensitive)
Q

How much will this setup actually cost?

A

GitHub Actions: $0.008/minute for private repos. Public repos are free.

Container Registry:

  • GitHub Container Registry: Free for public, $0.50/GB/month private
  • AWS ECR: $0.10/GB/month
  • Docker Hub: Free tier, then $5/month

Kubernetes Cluster:

  • Local (k3d/kind): Free but not production-ready
  • EKS/GKE/AKS: ~$72/month for control plane + node costs
  • DigitalOcean Kubernetes: $12/month minimum

ArgoCD: Free and open source. Only costs cluster resources.

This stuff ain't free. Expect around $200/month, maybe more if AWS decides to fuck you with data transfer charges.

Q

Should I use this for a team of 2 developers?

A

Probably not worth the pain. GitOps makes sense when you have:

  • Multiple environments (dev/staging/prod)
  • Multiple team members making changes
  • Compliance requirements for change tracking
  • Complex multi-service applications

For 2-person teams, consider:

  • GitHub Actions + simple deployment (kubectl apply)
  • Heroku/Railway/Render for simpler hosting
  • Docker Compose on a VPS

Don't torture yourself with this unless you're already stuck with Kubernetes.

Deployment Approach Comparison

Approach

Setup Complexity

Maintenance Effort

Rollback Speed

Audit Trail

Best For

GitOps + ArgoCD

High (4-8 hours)

Low once configured

Seconds (Git revert)

Complete Git history

Teams with Kubernetes, compliance needs

GitHub Actions Direct Deploy

Low (1-2 hours)

Medium ongoing

Minutes (manual)

GitHub Actions logs

Small teams, simple applications

Jenkins + Scripts

Very High (days)

High (plugin maintenance)

Variable

Build logs only

Legacy environments, complex workflows

GitLab CI/CD

Medium (2-4 hours)

Low-Medium

Minutes

GitLab pipeline history

Teams already on GitLab

Heroku/Railway Deploy

Very Low (minutes)

Very Low

Minutes

Platform logs

Prototypes, simple web apps

Traditional FTP/SSH

Low

High (manual process)

Hours (manual)

None

Legacy systems (not recommended)

🚀 GitOps Case Study: Terraform AKS & EC2 | DevSecOps CI/CD Pipeline with GitHub Actions & ArgoCD by Raghu The Security Expert

## GitOps DevSecOps CI/CD Pipeline - Complete Implementation

This 45-minute video actually shows you how to build this stuff instead of just talking about it. Covers Terraform, ArgoCD, and GitHub Actions - basically what I just walked you through but with more clicking around.

Key timestamps:
- 0:00 - GitOps architecture overview and benefits
- 8:30 - Setting up infrastructure with Terraform
- 15:45 - Configuring GitHub Actions for CI
- 25:20 - ArgoCD installation and configuration
- 35:10 - End-to-end deployment demonstration
- 40:15 - Security scanning and compliance considerations

Why this video doesn't suck:
Shows the real implementation details that docs skip over, like dealing with secrets and fixing the shit that breaks during setup.

Watch: GitOps Case Study: Terraform AKS & EC2 | DevSecOps CI/CD Pipeline

📺 YouTube

The Real Shit That Breaks and How to Fix It Fast

ArgoCD Troubleshooting Dashboard

Look, GitOps isn't magic. Things break. Usually at the worst possible time. Here are the five clusterfucks I've debugged multiple times, along with the nuclear option fixes that actually work when you're getting paged at 3 AM and your CEO is asking why the site is down.

Five Ways GitOps Will Ruin Your Weekend (And How to Fix Them)

1. Registry Secrets Die and Take Everything With Them

Saturday morning, 7 AM. Coffee ready, weekend planned. Then Slack goes fucking ballistic: every pod is ImagePullBackOff. Your GitHub token expired overnight because GitHub's "never expires" tokens actually expire after a year. Liars.

3 AM Emergency Fix:

## First, see which pods are fucked
kubectl get pods -A | grep ImagePullBackOff

## Check what the actual error is (usually "unauthorized" or "forbidden")  
kubectl describe pod <some-broken-pod> -n production

## Generate new registry secret (copy-paste ready)
kubectl create secret docker-registry ghcr-secret \
  --docker-server=ghcr.io \
  --docker-username=$GITHUB_USERNAME \
  --docker-password=$GITHUB_TOKEN_THAT_ACTUALLY_WORKS \
  --docker-email=whatever@company.com \
  -n production --dry-run=client -o yaml | kubectl apply -f -

Permanent Solution:
Use External Secrets Operator to automatically refresh credentials from AWS Secrets Manager, HashiCorp Vault, or Azure Key Vault:

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: github-registry-secret
spec:
  refreshInterval: 24h
  secretStoreRef:
    name: github-secret-store
    kind: SecretStore
  target:
    name: ghcr-secret
    type: kubernetes.io/dockerconfigjson

2. ArgoCD Decides It's Too Tired to Deploy

ArgoCD UI shows "Operation is taking too long" and just... gives up. This happens when your cluster is slower than ArgoCD's patience (default 3 minutes), or when you're trying to deploy a massive app with 47 microservices because your architect read about Netflix once.

Get ArgoCD to Actually Try:

## Make ArgoCD less impatient (increase timeout to 10 minutes)
kubectl patch configmap argocd-cm -n argocd --type merge \
  -p='{"data":{"timeout.reconciliation":"600s","timeout.hard.reconciliation":"0"}}'

## Kick ArgoCD in the ass (restart it)
kubectl rollout restart deployment/argocd-application-controller -n argocd
kubectl rollout restart deployment/argocd-server -n argocd

## Check it actually restarted and isn't stuck
kubectl get pods -n argocd | grep argocd-application-controller

Production Configuration:

apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-cm
  namespace: argocd
data:
  timeout.reconciliation: "300s"  # 5 minutes instead of default 180s
  timeout.hard.reconciliation: "0"  # Disable hard timeout
  application.operation.timeout: "600s"  # 10 minutes for operations

3. Resource Quota Exceeded

Pods stuck in Pending status because the namespace hit resource limits. This kills deployments silently.

Diagnosis:

## Check resource usage
kubectl describe quota -n <namespace>
kubectl top pods -n <namespace> --sort-by=cpu
kubectl top pods -n <namespace> --sort-by=memory

## Check what's requesting resources
kubectl describe limitrange -n <namespace>

Emergency Fix:

## Temporarily increase quota
kubectl patch resourcequota compute-quota -n <namespace> --type merge -p='{"spec":{"hard":{"requests.cpu":"4","requests.memory":"8Gi","limits.cpu":"8","limits.memory":"16Gi"}}}'

4. GitHub Actions Rate Limiting

Builds fail with 403 Forbidden errors when making API calls to GitHub or pulling from registries.

Check Rate Limits:

## Check current rate limit status in GitHub Settings
## GitHub Settings > Developer settings > Personal access tokens

Solutions:

5. Kubernetes Node Resource Exhaustion

The classic "everything was working, now nothing starts" scenario. Nodes ran out of CPU/memory.

Emergency Response:

## Check node status
kubectl top nodes
kubectl describe node <node-name>

## Find resource hogs
kubectl top pods --all-namespaces --sort-by=cpu
kubectl top pods --all-namespaces --sort-by=memory

## Quick cleanup
kubectl delete pod <resource-heavy-pod> -n <namespace>

Advanced Production Configurations

Multi-Cluster ArgoCD Setup

For managing multiple Kubernetes clusters (dev/staging/prod) from single ArgoCD instance using ArgoCD cluster management. This approach scales to hundreds of clusters with proper resource tuning:

apiVersion: v1
kind: Secret
metadata:
  name: staging-cluster
  namespace: argocd
  labels:
    argocd.argoproj.io/secret-type: cluster
type: Opaque
stringData:
  name: staging
  server: YOUR_CLUSTER_API_ENDPOINT  # Replace with your cluster API server URL and port 6443
  config: |
    {
      "bearerToken": "...",
      "tlsClientConfig": {
        "insecure": false,
        "caData": "..."
      }
    }

Progressive Delivery with Argo Rollouts

Implement canary deployments for risk reduction using blue-green or progressive delivery patterns. Supports analysis runs with Prometheus, DataDog, or New Relic:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: my-app
spec:
  replicas: 10
  strategy:
    canary:
      steps:
      - setWeight: 20      # 20% traffic to new version
      - pause: {duration: 60s}
      - setWeight: 50      # 50% traffic
      - pause: {duration: 60s}
      - setWeight: 100     # Full rollout
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
      - name: my-app
        image: myapp:latest

Automated Certificate Management

Use cert-manager for TLS certificate automation with Let's Encrypt, HashiCorp Vault, or AWS Certificate Manager. Integrates with ingress controllers for automatic certificate provisioning:

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: admin@example.com
    privateKeySecretRef:
      name: letsencrypt-prod
    solvers:
    - http01:
        ingress:
          class: nginx

Monitoring and Alerting That Actually Helps

Essential Metrics to Track:

AlertManager Rules:

groups:
- name: gitops-alerts
  rules:
  - alert: ArgoCD-App-Degraded
    expr: argocd_app_health_status{health_status!="Healthy"} == 1
    for: 5m
    annotations:
      summary: "ArgoCD application {{ $labels.name }} is degraded"
      
  - alert: GitHub-Actions-Failing
    expr: increase(github_actions_workflow_run_failures_total[15m]) > 3
    annotations:
      summary: "Multiple GitHub Actions workflow failures detected"

Security Hardening for Production

Network Policies for ArgoCD:

Implement Kubernetes Network Policies to restrict traffic flow. Use Calico or Cilium for advanced policy enforcement with egress controls:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: argocd-network-policy
  namespace: argocd
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: argocd
    ports:
    - protocol: TCP
      port: 8080

RBAC for ArgoCD Projects:

Configure Role-Based Access Control with AppProjects for multi-tenancy. Integrates with OIDC providers like Active Directory, Google OAuth, or GitHub Teams:

apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
  name: production
  namespace: argocd
spec:
  description: Production applications
  sourceRepos:
  - 'https://github.com/myorg/my-app-config'
  destinations:
  - namespace: production
    server: "REPLACE_WITH_YOUR_CLUSTER_API_ENDPOINT"  # Replace with your cluster API server URL
  clusterResourceWhitelist:
  - group: ''
    kind: Namespace
  - group: rbac.authorization.k8s.io
    kind: ClusterRole
  namespaceResourceWhitelist:
  - group: apps
    kind: Deployment
  - group: ''
    kind: Service

Performance Optimization

ArgoCD Performance Tuning:

Optimize ArgoCD for large-scale deployments with horizontal scaling and resource tuning. Monitor with Prometheus metrics and Grafana dashboards:

apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-cm
  namespace: argocd
data:
  # Increase repo server replicas for faster Git operations
  reposerver.parallelism.limit: "20"
  
  # Enable Git LFS support
  reposerver.enable.git.lfs.support: "true"
  
  # Optimize resource tracking
  application.resourceTrackingMethod: "annotation"

GitHub Actions Performance:

Optimize CI performance with dependency caching, Docker layer caching, and matrix builds. Consider self-hosted runners for consistent performance:

- uses: actions/setup-node@v4
  with:
    node-version: '20'
    cache: 'npm'         # Critical for speed
    
- name: Cache Docker layers
  uses: actions/cache@v3
  with:
    path: /tmp/.buildx-cache
    key: ${{ runner.os }}-buildx-${{ github.sha }}
    restore-keys: |
      ${{ runner.os }}-buildx-

The key to production success is proactive monitoring and having runbooks for common issues. Don't wait until 3am to figure out how to rollback a broken deployment - practice these procedures during normal business hours.

GitOps isn't about perfection. It's about having a system that doesn't completely fuck you when things break. Learn these patterns and you might actually sleep through the night occasionally.

Stuff That Doesn't Suck (Mostly)