Setup Production-Ready CI/CD Pipeline with GitOps - I Spent 2 Years So You Don't Have To

The GitOps Setup That Finally Stopped Destroying Our Weekends

GitOps CI/CD Architecture

After Jenkins pipelines died randomly and GitLab CI murdered our AWS bill, we finally got a GitOps setup that works. Took six months, three postmortems, and one very awkward all-hands about why checkout was fucked for two hours on Black Friday.

GitOps Finally Solved Our "Who the Hell Deployed This?" Problem

Here's the thing about traditional push-based CI/CD: when shit hits the fan at 3 AM, you're frantically digging through Jenkins logs trying to figure out what got deployed when. With GitOps, every deployment is a Git commit. Need to rollback? git revert. Need to audit who deployed what? git log. When ArgoCD pulls from Git instead of Jenkins pushing random builds, you actually know what's running in production.

The first time our on-call rotation went from weekly pages to maybe one alert in a month, we knew we were onto something. Turns out most outages happen because someone deployed something and forgot to tell anyone.

The Two-Part System That Stopped Our Weekend Debugging Sessions

ArgoCD Architecture

Here's how it works when it's not breaking:

GitHub Actions handles the build stuff (this part usually works):

Developer pushes code, Actions runs tests and security scans
If tests pass, builds Docker image and shoves it into the registry
Updates the deployment manifest with the new image SHA (not tag - learned this the hard way)

ArgoCD handles the deployment nightmare (this is where things get interesting):

ArgoCD obsessively watches your Git repo for changes every 3 minutes
Spots the new manifest, syncs it to your cluster
Rolling deployment happens with health checks (when they're configured right)
If your app dies during deploy, ArgoCD automatically rolls back

I learned about the separation after our monolithic Jenkins setup ate itself during a deploy and took both CI and CD down. Two systems means when one shits the bed, the other keeps working. Google's SRE book calls this "blast radius reduction" - we called it "thank god we can still deploy hotfixes."

What You Need Before Starting (Don't Skip This Shit)

Seriously, don't start unless you have:

Kubernetes cluster (k3d works for learning, but you'll need EKS/GKE for anything real)
GitHub repo with admin rights (you'll need to add secrets and webhooks)
Container registry (Docker Hub rate limits will bite you, use ECR/GCR)
Domain + SSL cert (Let's Encrypt is fine, just don't use self-signed certs)
You've used kubectl before and know pods from deployments

This took me an entire weekend, and I thought I knew what I was doing. If you're new to this shit, clear your schedule for the next month. Those "5 minute setup" guides are fucking lies written by people who've never seen production.

The Modern Stack We're Using

Modern CI/CD Stack

GitHub Actions handles CI because:

Native GitHub integration with zero configuration
Powerful workflow syntax that scales from simple to complex
Massive ecosystem of pre-built actions
Built-in secret management and OIDC authentication
Matrix builds for testing across multiple environments
Artifact storage and caching built-in

ArgoCD handles CD because:

Kubernetes-native GitOps with proper RBAC integration
Visual deployment tracking and rollback capabilities
Multi-cluster management for staging and production
Progressive delivery with Argo Rollouts
Active development and strong CNCF backing
SSO integration with LDAP, OIDC, SAML

Helm manages Kubernetes manifests because:

Templating reduces duplication across environments
Version control for infrastructure configuration
Easy rollback to previous chart versions
Industry standard for Kubernetes application packaging
Values file override for environment-specific config
Dependency management for complex applications

Real-World Production Considerations

Production Pipeline Flow

Here's the shit nobody tells you about production:

Security: Every component needs proper authentication. GitHub Actions uses OIDC to connect to cloud providers without storing long-lived credentials. ArgoCD uses Kubernetes RBAC to limit what it can deploy. Implement network policies, pod security standards, and image scanning.

Monitoring: You need observability into every step. GitHub Actions provides build metrics, ArgoCD shows deployment status, Kubernetes gives runtime metrics. Set up AlertManager for failed deployments, not just successful ones.

Compliance: Git history becomes your audit trail. Every production change must go through this pipeline - no manual kubectl commands. GitOps provides complete change traceability which SOC2 and ISO27001 auditors require.

Disaster Recovery: Your entire infrastructure is in Git. If your cluster dies, you can recreate it exactly by applying your Git repository. This is infrastructure as code taken to its logical conclusion.

What Success Looks Like

When properly implemented, this pipeline gives you:

Sub-10-minute deployments from commit to production
Automatic rollback on deployment failure
Complete deployment history in Git
Visual deployment tracking in ArgoCD UI
Zero-downtime deployments with proper health checks
Multi-environment promotion from dev to staging to production

Management loves metrics, so here's what we saw after switching (your mileage will definitely vary):

Deployment frequency: Multiple times per day
Lead time: Less than 1 hour from commit to production
Change failure rate: Less than 5%
Recovery time: Less than 1 hour

Common Anti-Patterns to Avoid

GitOps Anti-Patterns

Don't put application code and deployment manifests in the same repository. Separate repositories prevent deployment changes from triggering unnecessary application builds.

Don't use ArgoCD for CI tasks like building images or running tests. ArgoCD is for deployment only. Separation of concerns applies to CI/CD tooling too.

Don't manually edit cluster resources after deployment. Everything must go through Git. Manual changes create configuration drift that breaks future deployments.

Don't store secrets in Git repositories, even if they're encrypted. Use Kubernetes secrets managed by external secret operators like External Secrets Operator or Sealed Secrets.

Alright, enough theory. Time to build this thing. Next up: GitHub Actions that won't randomly die when you need them most.

Step 1: Build the GitHub Actions CI Pipeline

GitHub Actions Workflow

Let's build a CI pipeline that actually works and doesn't randomly fail when you need it most.

This is the foundation

if CI is broken, everything else is fucked.

Repository Structure That Won't Break

Create this exact directory structure in your main application repository:

my-app/
├── .github/
│   └── workflows/
│       └── ci.yml
├── src/
│   └── (your application code)
├── tests/
│   └── (your test files)
├── Dockerfile
├── package.json  (or equivalent)
└── README.md

The GitHub Actions Workflow That Actually Works

Create .github/workflows/ci.yml with this configuration.

I've run this pattern in production for 2+ years:

name:

 CI That Won't Die on You
## This one actually works

on:
  push:
    branches: [ main, develop ]
  pull_request:
    branches: [ main ]

env:

  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}

jobs:
  test:
    runs-on: ubuntu-latest
    steps:

- uses: actions/checkout@v4
    
    
- name:

 Setup Node.js
      uses: actions/setup-node@v4
      with:
        node-version: '20'
        cache: 'npm'
    
    
- name:

 Install dependencies
      run: npm ci
    
    
- name:

 Run tests
      run: npm run test:coverage
      
    
- name:

 Run security scan
      run: npm audit --audit-level high
      # Dies on serious vulnerabilities, which is what we want
      
    
- name:

 Upload coverage to Codecov
      uses: codecov/codecov-action@v3
      # Management loves their numbers

  build:
    needs: test
    runs-on: ubuntu-latest
    if: github.event_name == 'push'  # Only build on actual pushes, not PRs
    outputs:
      image: ${{ steps.image.outputs.image }}
      digest: ${{ steps.build.outputs.digest }}  # Use digests, not tags 
- learned this the hard way
    
    steps:

- uses: actions/checkout@v4
    
    
- name:

 Log in to Container Registry
      uses: docker/login-action@v3
      with:
        registry: ${{ env.

REGISTRY }}
        username: ${{ github.actor }}
        password: ${{ secrets.

GITHUB_TOKEN }}
    
    
- name: Extract metadata
      id: meta
      uses: docker/metadata-action@v5
      with:
        images: ${{ env.

REGISTRY }}/${{ env.IMAGE_NAME }}
        tags: |
          type=ref,event=branch
          type=ref,event=pr
          type=sha,prefix={{branch}}-
          type=raw,value=latest,enable={{is_default_branch}}
    
    
- name:

 Build and push Docker image
      id: build
      uses: docker/build-push-action@v5
      with:
        context: .
        push: true
        tags: ${{ steps.meta.outputs.tags }}
        labels: ${{ steps.meta.outputs.labels }}
        cache-from: type=gha  # Saves your sanity on rebuilds
        cache-to: type=gha,mode=max

  security:
    needs: build
    runs-on: ubuntu-latest
    if: github.event_name == 'push'  # Skip security scans on PRs to save time
    
    steps:

- name:

 Run Trivy vulnerability scanner
      uses: aquasecurity/trivy-action@master
      with:
        image-ref: ${{ needs.build.outputs.image }}
        format: 'sarif'
        output: 'trivy-results.sarif'
        severity: 'CRITICAL,HIGH'  # Will fail the build on serious vulnerabilities
        
    
- name:

 Upload Trivy scan results
      uses: github/codeql-action/upload-sarif@v3
      if: always()
      with:
        sarif_file: 'trivy-results.sarif'

Critical Configuration Details

Caching Strategy:

The cache: 'npm' and Docker cache-from/cache-to lines cut build times from 8 minutes to 2 minutes in my experience. GitHub's cache documentation explains the mechanics.

Use actions/cache for node_modules caching and Docker layer caching with BuildKit.

Image Tagging:

We create multiple tags

branch name, SHA, and latest. This gives you deployment flexibility. The SHA tag is crucial for GitOps
it provides immutable image references.

Use docker/metadata-action for consistent tagging strategies.

Security Integration: Trivy scanning catches vulnerabilities before they reach production.

Results upload to GitHub Security tab for tracking. SARIF format enables native GitHub integration.

Also consider Snyk and GitHub Dependabot.

OIDC Authentication Setup (Critical for Production)

Never store cloud credentials in GitHub Secrets.

Use OIDC authentication instead:

- name:

 Configure AWS credentials
  uses: aws-actions/configure-aws-credentials@v4
  with:
    role-to-assume: arn:aws:iam::123456789012:role/GitHubAction-AssumeRoleWithAction
    aws-region: us-east-1
    
- name:

 Login to Amazon ECR
  uses: aws-actions/amazon-ecr-login@v2

AWS trust policy setup is required.

AWS trust policy setup is a pain in the ass. Took me forever to get the JSON right.

Step 2: Setup ArgoCD for GitOps Deployment

ArgoCD Dashboard

ArgoCD is your deployment controller.

It watches Git repositories and syncs changes to Kubernetes clusters automatically.

Install ArgoCD in Your Cluster

## Create dedicated namespace
kubectl create namespace argocd

## Install ArgoCD
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml

## Wait for deployment
kubectl wait --for=condition=available --timeout=300s deployment/argocd-server -n argocd

## Get initial admin password
kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d

Production Note:

Use Helm charts for production installations.

The raw manifests work for learning, but you need customization for real environments.

Configure ArgoCD for GitOps

Create a separate repository for your deployment manifests. This is critical

application code and deployment configuration must be separated.

Repository structure for my-app-config:

my-app-config/
├── base/
│   ├── deployment.yaml
│   ├── service.yaml
│   └── kustomization.yaml
├── environments/
│   ├── staging/
│   │   ├── kustomization.yaml
│   │   └── patches/
│   └── production/
│       ├── kustomization.yaml
│       └── patches/
└── apps/
    ├── staging-app.yaml
    └── production-app.yaml

Base Deployment Configuration

Create base/deployment.yaml:

apiVersion: apps/v1
kind:

 Deployment
metadata:
  name: my-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:

- name: my-app
        image: ghcr.io/myorg/my-app:main-abc123
        ports:

- containerPort: 8080
        env:

- name:

 NODE_ENV
          value: "production"
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "500m"
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5

Resource Limits:

Always set resource requests and limits. Kubernetes scheduling depends on requests, and limits prevent resource starvation.

Health Checks: Liveness and readiness probes are mandatory for production.

Without them, Kubernetes can't tell if your app is actually healthy.

ArgoCD Application Configuration

Create apps/production-app.yaml:

apiVersion: argoproj.io/v1alpha1
kind:

 Application
metadata:
  name: my-app-production
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/myorg/my-app-config.git
    targetRevision: main
    path: environments/production
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:

- CreateNamespace=true
    retry:
      limit: 5
      backoff:
        duration: 5s
        maxDuration: 3m0s
        factor: 2

Automated Sync: prune: true removes resources deleted from Git. selfHeal: true reverts manual cluster changes back to Git state.

This maintains GitOps discipline.

Connect CI to CD with Image Updates

The critical piece: when CI builds a new image, it must update the deployment configuration.

Add this job to your GitHub Actions:

  update-deployment:
    needs: [build, security]
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main'
    
    steps:

- name:

 Checkout config repo
      uses: actions/checkout@v4
      with:
        repository: myorg/my-app-config
        token: ${{ secrets.

CONFIG_REPO_TOKEN }}
        
    
- name: Update image tag
      run: |
        cd environments/production
        kustomize edit set image ghcr.io/myorg/my-app:${{ needs.build.outputs.digest }}
        
    
- name:

 Commit changes
      run: |
        git config --local user.email "action@github.com"
        git config --local user.name "GitHub Action"
        git add .
        git commit -m "Update image to ${{ needs.build.outputs.digest }}"
        git push

Important:

Use image digests, not tags. Tags are mutable; digests are immutable. OCI image specification guarantees digest uniqueness.

Step 3:

Implement Production Monitoring and Rollback

ArgoCD Rollback Interface

Production deployments need monitoring and automated rollback capabilities.

Here's how to implement both.

Deployment Health Monitoring

Argo

CD monitors application health automatically, but you need application-level health checks:

## Add to your deployment
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
          failureThreshold: 3
          timeoutSeconds: 5
        readinessProbe:
          httpGet:
            path: /ready  
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5
          failureThreshold: 2
          successThreshold: 1

Your application must implement these endpoints:

/health
Returns 200 if app is alive (for liveness probe)
/ready
Returns 200 if app can serve traffic (for readiness probe)

Automatic Rollback on Health Check Failure

Configure ArgoCD sync windows and health checks:

  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    retry:
      limit: 5
      backoff:
        duration: 5s
        maxDuration: 3m0s
        factor: 2
    syncOptions:

- CreateNamespace=true
    
- RespectIgnoreDifferences=true

Progressive delivery patterns like canary deployments provide additional safety.

Argo Rollouts extends ArgoCD with advanced deployment strategies.

Manual Rollback Process

When you need to rollback manually:

## Via ArgoCD CLI
argocd app rollback my-app-production

## Via Git (preferred GitOps method)
cd my-app-config
git log --oneline  # Find previous good commit
git revert abc123  # Revert to previous version
git push           # ArgoCD syncs automatically

Git-based rollback is preferred because it maintains the GitOps principle

all changes flow through Git.

That's the pipeline built. But production never goes according to plan

next up is the real shit that breaks and how to fix it when you're getting paged at 3 AM.

My GitHub Actions workflow isn't triggering and I'm losing my mind

Yeah, this shit happens all the time. Here's what's fucked, in order of obviousness:

You forgot to commit the workflow file to main (90% of the time it's this dumb shit)
YAML syntax is fucked - Run yamllint .github/workflows/ci.yml and fix your indentation. Two spaces, not tabs
Actions disabled - Check Settings → Actions → General, some security-paranoid admin probably disabled them
Branch protection blocking you - Protected branches can block Actions from triggering
Wrong directory name - Must be .github/workflows/ not .github/workflow/. That missing 's' will ruin your day

Spent 2 hours debugging once because I put the workflow in .github/workflow instead of .github/workflows. GitHub doesn't give you an error, just silently ignores it. Also, Node 18.17.0 has a known issue with npm ci - use 18.16.1 or 20.x if you hit ERESOLVE errors.

How do I handle secrets in a GitOps pipeline?

Never put secrets in Git, even encrypted ones. Use these patterns:

External Secrets Operator - Syncs secrets from AWS Secrets Manager/Azure Key Vault
Sealed Secrets - Encrypts secrets that only the cluster can decrypt
ArgoCD Vault Plugin - Integrates with HashiCorp Vault

Example with External Secrets Operator:

apiVersion: external-secrets.io/v1beta1
kind: SecretStore
metadata:
  name: aws-secrets
spec:
  provider:
    aws:
      service: SecretsManager
      region: us-east-1

ArgoCD shows "Unknown" health status - what's wrong?

ArgoCD can't determine if your application is healthy. Add proper health checks:

## In your Deployment
livenessProbe:
  httpGet:
    path: /health
    port: 8080
readinessProbe:
  httpGet:
    path: /ready
    port: 8080

If you don't have health endpoints, use this minimal check:

livenessProbe:
  tcpSocket:
    port: 8080

My deployment succeeded but nothing changed?

Classic GitOps confusion. Check:

Image tag didn't change - ArgoCD won't sync if manifests are identical
Sync policy - Manual sync required if automated isn't configured
Repository URL - ArgoCD watching wrong repo or branch
Path configuration - ArgoCD looking at wrong directory

Run argocd app get myapp to see current sync status.

How do I rollback to a previous version?

Git-based rollback (recommended):

cd my-app-config
git log --oneline
git revert <commit-hash>
git push  # ArgoCD syncs automatically

ArgoCD rollback:

argocd app rollback myapp
## Or via UI: click app → History → click previous version → Rollback

Git method is better because it maintains GitOps principles.

GitHub Actions failing with "docker: permission denied"?

Docker daemon permission issue. Add this to your workflow:

- name: Setup Docker Buildx
  uses: docker/setup-buildx-action@v3

- name: Login to Container Registry
  uses: docker/login-action@v3
  with:
    registry: ghcr.io
    username: ${{ github.actor }}
    password: ${{ secrets.GITHUB_TOKEN }}

For self-hosted runners, add the runner user to docker group:

sudo usermod -aG docker $USER

ArgoCD says "ComparisonError" - how do I fix this?

ArgoCD can't compare desired state with actual state. Common causes:

Malformed YAML - Validate with kubectl apply --dry-run
Missing CRDs - Install required Custom Resource Definitions first
Resource name conflicts - Two resources trying to use same name
Namespace issues - Resource references non-existent namespace

Check ArgoCD logs: kubectl logs -n argocd deployment/argocd-application-controller

How do I deploy to multiple environments (dev/staging/prod)?

Use Kustomize overlays or separate ArgoCD applications:

Kustomize approach:

environments/
├── base/
│   └── deployment.yaml
├── staging/
│   ├── kustomization.yaml
│   └── replica-patch.yaml
└── production/
    ├── kustomization.yaml
    └── replica-patch.yaml

Separate Applications approach:

## staging-app.yaml
spec:
  source:
    path: environments/staging
## production-app.yaml  
spec:
  source:
    path: environments/production

Both work. Kustomize is simpler for small differences, separate apps better for major environment variations.

Why are my deployments so slow?

Common bottlenecks:

Image pull time - Use multi-stage builds and layer caching
Health check delays - Reduce initialDelaySeconds if app starts quickly
Resource limits - Pod can't start due to insufficient cluster resources
Registry throttling - Docker Hub rate limits public image pulls

Add timing to troubleshoot:

readinessProbe:
  initialDelaySeconds: 5  # Start checking after 5s
  periodSeconds: 2        # Check every 2s instead of default 10s

How do I handle database migrations in GitOps?

Pre-deployment hooks with ArgoCD:

apiVersion: batch/v1
kind: Job
metadata:
  name: migrate-db
  annotations:
    argocd.argoproj.io/hook: PreSync
    argocd.argoproj.io/hook-delete-policy: BeforeHookCreation
spec:
  template:
    spec:
      containers:
      - name: migrate
        image: myapp:latest
        command: ["npm", "run", "migrate"]

External migration management:

Use Flyway/Liquibase operators
Run migrations in separate CI/CD pipeline
Handle migrations at application startup (careful with rollbacks)

ArgoCD out of sync but resources look identical?

Kubernetes adds default values and metadata that ArgoCD sees as differences. Fix with:

## In your ArgoCD Application
spec:
  ignoreDifferences:
  - group: apps
    kind: Deployment
    jsonPointers:
    - /spec/replicas  # Ignore if HPA manages replicas
  - group: ""
    kind: Service
    jsonPointers:
    - /spec/clusterIP  # Kubernetes assigns this automatically

My tests pass locally but fail in GitHub Actions?

Environment differences. Common issues:

Node.js version - Pin versions: node-version: '20.x'
Dependencies - Use npm ci instead of npm install
Timezone - Tests failing due to date/time assumptions
File system permissions - Linux vs macOS/Windows differences
Environment variables - Missing secrets or config

Add debugging:

- name: Debug environment
  run: |
    node --version
    npm --version
    pwd
    ls -la
    env | sort

How do I secure my GitOps pipeline?

Security checklist:

Branch protection - Require reviews for config repo changes
RBAC - Limit ArgoCD permissions with Kubernetes roles
Network policies - Restrict pod-to-pod communication
Image scanning - Scan images for vulnerabilities before deployment
Secrets management - Use external secret operators
Supply chain - Pin action versions to SHAs, not tags

## Pin to specific commit
- uses: actions/checkout@8ade135a41bc03ea155e62e844d188df1ea18608  # v4.1.0

OIDC authentication failing with AWS?

Trust policy syntax is picky. Use this exact format:

{
  "StringEquals": {
    "token.actions.githubusercontent.com:aud": "sts.amazonaws.com",
    "token.actions.githubusercontent.com:sub": "repo:myorg/myrepo:ref:refs/heads/main"
  }
}

Common mistakes:

Using sts.aws.com instead of sts.amazonaws.com
Wrong subject format - must be repo:org/repo:ref:refs/heads/branch
Typo in repository name (case sensitive)

How much will this setup actually cost?

GitHub Actions: $0.008/minute for private repos. Public repos are free.

Container Registry:

GitHub Container Registry: Free for public, $0.50/GB/month private
AWS ECR: $0.10/GB/month
Docker Hub: Free tier, then $5/month

Kubernetes Cluster:

Local (k3d/kind): Free but not production-ready
EKS/GKE/AKS: ~$72/month for control plane + node costs
DigitalOcean Kubernetes: $12/month minimum

ArgoCD: Free and open source. Only costs cluster resources.

This stuff ain't free. Expect around $200/month, maybe more if AWS decides to fuck you with data transfer charges.

Should I use this for a team of 2 developers?

Probably not worth the pain. GitOps makes sense when you have:

Multiple environments (dev/staging/prod)
Multiple team members making changes
Compliance requirements for change tracking
Complex multi-service applications

For 2-person teams, consider:

GitHub Actions + simple deployment (kubectl apply)
Heroku/Railway/Render for simpler hosting
Docker Compose on a VPS

Don't torture yourself with this unless you're already stuck with Kubernetes.

Deployment Approach Comparison

Approach	Setup Complexity	Maintenance Effort	Rollback Speed	Audit Trail	Best For
GitOps + ArgoCD	High (4-8 hours)	Low once configured	Seconds (Git revert)	Complete Git history	Teams with Kubernetes, compliance needs
GitHub Actions Direct Deploy	Low (1-2 hours)	Medium ongoing	Minutes (manual)	GitHub Actions logs	Small teams, simple applications
Jenkins + Scripts	Very High (days)	High (plugin maintenance)	Variable	Build logs only	Legacy environments, complex workflows
GitLab CI/CD	Medium (2-4 hours)	Low-Medium	Minutes	GitLab pipeline history	Teams already on GitLab
Heroku/Railway Deploy	Very Low (minutes)	Very Low	Minutes	Platform logs	Prototypes, simple web apps
Traditional FTP/SSH	Low	High (manual process)	Hours (manual)	None	Legacy systems (not recommended)

🚀 GitOps Case Study: Terraform AKS & EC2 | DevSecOps CI/CD Pipeline with GitHub Actions & ArgoCD by Raghu The Security Expert

## GitOps DevSecOps CI/CD Pipeline - Complete Implementation

This 45-minute video actually shows you how to build this stuff instead of just talking about it. Covers Terraform, ArgoCD, and GitHub Actions - basically what I just walked you through but with more clicking around.

Key timestamps:
- 0:00 - GitOps architecture overview and benefits
- 8:30 - Setting up infrastructure with Terraform
- 15:45 - Configuring GitHub Actions for CI
- 25:20 - ArgoCD installation and configuration
- 35:10 - End-to-end deployment demonstration
- 40:15 - Security scanning and compliance considerations

Why this video doesn't suck:
Shows the real implementation details that docs skip over, like dealing with secrets and fixing the shit that breaks during setup.

Watch: GitOps Case Study: Terraform AKS & EC2 | DevSecOps CI/CD Pipeline

📺 YouTube

The Real Shit That Breaks and How to Fix It Fast

ArgoCD Troubleshooting Dashboard

Look, GitOps isn't magic. Things break. Usually at the worst possible time. Here are the five clusterfucks I've debugged multiple times, along with the nuclear option fixes that actually work when you're getting paged at 3 AM and your CEO is asking why the site is down.

Five Ways GitOps Will Ruin Your Weekend (And How to Fix Them)

1. Registry Secrets Die and Take Everything With Them

Saturday morning, 7 AM. Coffee ready, weekend planned. Then Slack goes fucking ballistic: every pod is ImagePullBackOff. Your GitHub token expired overnight because GitHub's "never expires" tokens actually expire after a year. Liars.

3 AM Emergency Fix:

## First, see which pods are fucked
kubectl get pods -A | grep ImagePullBackOff

## Check what the actual error is (usually "unauthorized" or "forbidden")  
kubectl describe pod <some-broken-pod> -n production

## Generate new registry secret (copy-paste ready)
kubectl create secret docker-registry ghcr-secret \
  --docker-server=ghcr.io \
  --docker-username=$GITHUB_USERNAME \
  --docker-password=$GITHUB_TOKEN_THAT_ACTUALLY_WORKS \
  --docker-email=whatever@company.com \
  -n production --dry-run=client -o yaml | kubectl apply -f -

Permanent Solution:
Use External Secrets Operator to automatically refresh credentials from AWS Secrets Manager, HashiCorp Vault, or Azure Key Vault:

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: github-registry-secret
spec:
  refreshInterval: 24h
  secretStoreRef:
    name: github-secret-store
    kind: SecretStore
  target:
    name: ghcr-secret
    type: kubernetes.io/dockerconfigjson

2. ArgoCD Decides It's Too Tired to Deploy

ArgoCD UI shows "Operation is taking too long" and just... gives up. This happens when your cluster is slower than ArgoCD's patience (default 3 minutes), or when you're trying to deploy a massive app with 47 microservices because your architect read about Netflix once.

Get ArgoCD to Actually Try:

## Make ArgoCD less impatient (increase timeout to 10 minutes)
kubectl patch configmap argocd-cm -n argocd --type merge \
  -p='{"data":{"timeout.reconciliation":"600s","timeout.hard.reconciliation":"0"}}'

## Kick ArgoCD in the ass (restart it)
kubectl rollout restart deployment/argocd-application-controller -n argocd
kubectl rollout restart deployment/argocd-server -n argocd

## Check it actually restarted and isn't stuck
kubectl get pods -n argocd | grep argocd-application-controller

Production Configuration:

apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-cm
  namespace: argocd
data:
  timeout.reconciliation: "300s"  # 5 minutes instead of default 180s
  timeout.hard.reconciliation: "0"  # Disable hard timeout
  application.operation.timeout: "600s"  # 10 minutes for operations

3. Resource Quota Exceeded

Pods stuck in Pending status because the namespace hit resource limits. This kills deployments silently.

Diagnosis:

## Check resource usage
kubectl describe quota -n <namespace>
kubectl top pods -n <namespace> --sort-by=cpu
kubectl top pods -n <namespace> --sort-by=memory

## Check what's requesting resources
kubectl describe limitrange -n <namespace>

Emergency Fix:

## Temporarily increase quota
kubectl patch resourcequota compute-quota -n <namespace> --type merge -p='{"spec":{"hard":{"requests.cpu":"4","requests.memory":"8Gi","limits.cpu":"8","limits.memory":"16Gi"}}}'

4. GitHub Actions Rate Limiting

Builds fail with 403 Forbidden errors when making API calls to GitHub or pulling from registries.

Check Rate Limits:

## Check current rate limit status in GitHub Settings
## GitHub Settings > Developer settings > Personal access tokens

Solutions:

Use GitHub App authentication instead of personal tokens for higher rate limits
Cache dependencies aggressively: actions/setup-node@v4 with cache: 'npm' or actions/cache for custom caching
Use self-hosted runners for high-volume projects to bypass rate limits
Implement exponential backoff for API calls using octokit/rest.js
Consider GitHub Enterprise for unlimited private repository actions

5. Kubernetes Node Resource Exhaustion

The classic "everything was working, now nothing starts" scenario. Nodes ran out of CPU/memory.

Emergency Response:

## Check node status
kubectl top nodes
kubectl describe node <node-name>

## Find resource hogs
kubectl top pods --all-namespaces --sort-by=cpu
kubectl top pods --all-namespaces --sort-by=memory

## Quick cleanup
kubectl delete pod <resource-heavy-pod> -n <namespace>

Advanced Production Configurations

Multi-Cluster ArgoCD Setup

For managing multiple Kubernetes clusters (dev/staging/prod) from single ArgoCD instance using ArgoCD cluster management. This approach scales to hundreds of clusters with proper resource tuning:

apiVersion: v1
kind: Secret
metadata:
  name: staging-cluster
  namespace: argocd
  labels:
    argocd.argoproj.io/secret-type: cluster
type: Opaque
stringData:
  name: staging
  server: YOUR_CLUSTER_API_ENDPOINT  # Replace with your cluster API server URL and port 6443
  config: |
    {
      "bearerToken": "...",
      "tlsClientConfig": {
        "insecure": false,
        "caData": "..."
      }
    }

Progressive Delivery with Argo Rollouts

Implement canary deployments for risk reduction using blue-green or progressive delivery patterns. Supports analysis runs with Prometheus, DataDog, or New Relic:

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: my-app
spec:
  replicas: 10
  strategy:
    canary:
      steps:
      - setWeight: 20      # 20% traffic to new version
      - pause: {duration: 60s}
      - setWeight: 50      # 50% traffic
      - pause: {duration: 60s}
      - setWeight: 100     # Full rollout
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
      - name: my-app
        image: myapp:latest

Automated Certificate Management

Use cert-manager for TLS certificate automation with Let's Encrypt, HashiCorp Vault, or AWS Certificate Manager. Integrates with ingress controllers for automatic certificate provisioning:

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: admin@example.com
    privateKeySecretRef:
      name: letsencrypt-prod
    solvers:
    - http01:
        ingress:
          class: nginx

Monitoring and Alerting That Actually Helps

Essential Metrics to Track:

Deployment success rate: argocd_app_health_status from ArgoCD metrics
Sync duration: argocd_app_operation_duration for deployment performance tracking
Failed GitHub Actions: Webhook integration with Prometheus Alertmanager
Resource utilization: kube-state-metrics and node-exporter
Application performance: OpenTelemetry, Jaeger, or Datadog APM

AlertManager Rules:

groups:
- name: gitops-alerts
  rules:
  - alert: ArgoCD-App-Degraded
    expr: argocd_app_health_status{health_status!="Healthy"} == 1
    for: 5m
    annotations:
      summary: "ArgoCD application {{ $labels.name }} is degraded"
      
  - alert: GitHub-Actions-Failing
    expr: increase(github_actions_workflow_run_failures_total[15m]) > 3
    annotations:
      summary: "Multiple GitHub Actions workflow failures detected"

Security Hardening for Production

Network Policies for ArgoCD:

Implement Kubernetes Network Policies to restrict traffic flow. Use Calico or Cilium for advanced policy enforcement with egress controls:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: argocd-network-policy
  namespace: argocd
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: argocd
    ports:
    - protocol: TCP
      port: 8080

RBAC for ArgoCD Projects:

Configure Role-Based Access Control with AppProjects for multi-tenancy. Integrates with OIDC providers like Active Directory, Google OAuth, or GitHub Teams:

apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
  name: production
  namespace: argocd
spec:
  description: Production applications
  sourceRepos:
  - 'https://github.com/myorg/my-app-config'
  destinations:
  - namespace: production
    server: "REPLACE_WITH_YOUR_CLUSTER_API_ENDPOINT"  # Replace with your cluster API server URL
  clusterResourceWhitelist:
  - group: ''
    kind: Namespace
  - group: rbac.authorization.k8s.io
    kind: ClusterRole
  namespaceResourceWhitelist:
  - group: apps
    kind: Deployment
  - group: ''
    kind: Service

Performance Optimization

ArgoCD Performance Tuning:

Optimize ArgoCD for large-scale deployments with horizontal scaling and resource tuning. Monitor with Prometheus metrics and Grafana dashboards:

apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-cm
  namespace: argocd
data:
  # Increase repo server replicas for faster Git operations
  reposerver.parallelism.limit: "20"
  
  # Enable Git LFS support
  reposerver.enable.git.lfs.support: "true"
  
  # Optimize resource tracking
  application.resourceTrackingMethod: "annotation"

GitHub Actions Performance:

Optimize CI performance with dependency caching, Docker layer caching, and matrix builds. Consider self-hosted runners for consistent performance:

- uses: actions/setup-node@v4
  with:
    node-version: '20'
    cache: 'npm'         # Critical for speed
    
- name: Cache Docker layers
  uses: actions/cache@v3
  with:
    path: /tmp/.buildx-cache
    key: ${{ runner.os }}-buildx-${{ github.sha }}
    restore-keys: |
      ${{ runner.os }}-buildx-

The key to production success is proactive monitoring and having runbooks for common issues. Don't wait until 3am to figure out how to rollback a broken deployment - practice these procedures during normal business hours.

GitOps isn't about perfection. It's about having a system that doesn't completely fuck you when things break. Learn these patterns and you might actually sleep through the night occasionally.

Stuff That Doesn't Suck (Mostly)

Related Tools & Recommendations

tool

Popular choice

jQuery - The Library That Won't Die

Explore jQuery's enduring legacy, its impact on web development, and the key changes in jQuery 4.0. Understand its relevance for new projects in 2025.

jQuery

/tool/jquery/overview

50%

news

Popular choice

Morgan Stanley Open Sources Calm: Because Drawing Architecture Diagrams 47 Times Gets Old

Wall Street Bank Finally Releases Tool That Actually Solves Real Developer Problems

GitHub Copilot

/news/2025-08-22/meta-ai-hiring-freeze

50%

tool

Popular choice

Prettier - Opinionated Code Formatter

Learn about Prettier, the opinionated code formatter. This overview covers its unique features, installation, setup, extensive language support, and answers com

Prettier

/tool/prettier/overview

50%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization

Quick Navigation

GitOps Finally Solved Our "Who the Hell Deployed This?" Problem

The Two-Part System That Stopped Our Weekend Debugging Sessions

What You Need Before Starting (Don't Skip This Shit)

The Modern Stack We're Using

Real-World Production Considerations

What Success Looks Like

Common Anti-Patterns to Avoid

Repository Structure That Won't Break

The GitHub Actions Workflow That Actually Works

Critical Configuration Details

OIDC Authentication Setup (Critical for Production)

Step 2: Setup ArgoCD for GitOps Deployment

Install ArgoCD in Your Cluster

Configure ArgoCD for GitOps

Base Deployment Configuration

ArgoCD Application Configuration

Connect CI to CD with Image Updates

Step 3:

Deployment Health Monitoring

Automatic Rollback on Health Check Failure

Manual Rollback Process

My GitHub Actions workflow isn't triggering and I'm losing my mind

How do I handle secrets in a GitOps pipeline?

ArgoCD shows "Unknown" health status - what's wrong?

My deployment succeeded but nothing changed?

How do I rollback to a previous version?

GitHub Actions failing with "docker: permission denied"?

ArgoCD says "ComparisonError" - how do I fix this?

How do I deploy to multiple environments (dev/staging/prod)?

Why are my deployments so slow?

How do I handle database migrations in GitOps?

ArgoCD out of sync but resources look identical?

My tests pass locally but fail in GitHub Actions?

How do I secure my GitOps pipeline?

OIDC authentication failing with AWS?

How much will this setup actually cost?

Should I use this for a team of 2 developers?

Five Ways GitOps Will Ruin Your Weekend (And How to Fix Them)

Advanced Production Configurations

Monitoring and Alerting That Actually Helps

Security Hardening for Production

Performance Optimization

Related Tools & Recommendations

jQuery - The Library That Won't Die

Morgan Stanley Open Sources Calm: Because Drawing Architecture Diagrams 47 Times Gets Old

Prettier - Opinionated Code Formatter