The Shit That Actually Breaks

Q

Why is the enforcer agent eating all my CPU?

A

Agent goes rogue consuming 80%+ CPU.

Happens most often after Kubernetes node restarts or network hiccups

  • usually around 3AM because that's when everything breaks.Quick fix: Restart the DaemonSet pod on the affected node:kubectl delete pod -n aqua <enforcer-pod-name>Nuclear option:

If it's screwed across multiple nodes:kubectl rollout restart daemonset/aqua-agent -n aquaRoot cause: Memory leak in the network monitoring thread. Aqua fixed this in version 6.2.1 but if you're running older versions (and who isn't because upgrading means another weekend), you'll still hit it. I've been burned by this three times.

Q

Enforcer agent won't start - "failed to create runtime monitor"

A

Classic ARM64 node issue. The agent tries to load x86 kernel modules on ARM processors and completely craps out.Immediate workaround:yamlnodeSelector: kubernetes.io/arch: amd64Better fix: Upgrade to Aqua 6.5+ which has proper ARM64 support, or exclude ARM nodes from the DaemonSet entirely if you don't need them monitored. Note: Version 6.2.0 has a memory leak, 6.2.1 fixes it, 6.3.0 breaks ARM64 support again.

Q

"PANIC: runtime error: invalid memory address" in agent logs

A

Memory corruption, usually from resource limits being too low.Fix now: Bump the memory limits in your DaemonSet:yamlresources: limits: memory: "4Gi" # Instead of their suggested 2Gi requests: memory: "2Gi" # Instead of 1GiTime to fix: 2 minutes. Time you spent debugging it while standing in your kitchen at 3AM googling error messages: 3 hours. Lost a weekend to this bug because their documentation doesn't mention memory requirements for shit.

Q

Images failing scan with "connection timed out" errors

A

Registry connectivity from the scanner pods.

Happens a lot with private registries behind VPN or with poorly configured network policies.Debug first:kubectl exec -n aqua <scanner-pod> -- nslookup registry.example.comkubectl exec -n aqua <scanner-pod> -- curl -I https://docker.ioCommon fixes:

  • Add registry URLs to network policy egress rules
  • Increase scanner timeout: AQUA_SCANNER_TIMEOUT=600
  • For AWS ECR:

Make sure the IAM role has ecr:GetAuthorizationToken

Q

PostgreSQL connection errors killing scans

A

"dial tcp: connect: connection refused" means the database is either down or rejecting connections.Emergency triage:sqlSELECT count(*) FROM pg_stat_activity WHERE state = 'active';If you see 100+ active connections, PostgreSQL is choking. Default max_connections is 100, which is laughable for production.Fix in postgresql.conf:max_connections = 500shared_buffers = 4GBwork_mem = 64MBRestart PostgreSQL (yes, downtime, deal with it).

Q

Admission controller webhook timing out deployments

A

Your deployments hang in "Pending" state because the webhook takes forever to respond.

Quick bypass (if you need to deploy NOW):kubectl label namespace <your-namespace> aqua-security=disabledProper fix:

Tune webhook timeout and failure policy:```yamladmissionReviewVersions:

  • v1failurePolicy:

Ignore # Instead of FailtimeoutSeconds: 30 # Instead of default 10```Yeah, failurePolicy: Ignore defeats the entire fucking point of having security policies, but it beats explaining to your manager why the production deployment failed because the security scanner took a coffee break.

Q

"Error: failed to start profiler" spam in logs

A

Profiler can't bind to the metrics port, usually because of port conflicts or security policies.Shut it up: Disable profiling if you don't need it:AQUA_ENABLE_PROFILING=falseOr fix the port conflict: Check what's using port 6060 and kill it:netstat -tulpn | grep 6060

Q

Memory usage keeps climbing until nodes die

A

Classic memory leak.

Happens with older enforcer versions when scanning lots of large images.Immediate relief: Set memory limits and restart policy:```yamlresources: limits: memory: "8Gi"containers:

  • name: enforcer restartPolicy:

AlwaysMonitor it:kubectl top pod -n aqua --sort-by memory```If a pod is using >4GB consistently, it's leaking. Restart it.

Q

Scans work in dev but fail in prod

A

Usually network policies, security contexts, or resource constraints that don't exist in dev.

Check security context first:kubectl describe pod <scanner-pod> | grep -i securityCommon prod differences:

  • Pod Security Standards blocking privileged containers
  • Network policies blocking registry access
  • Resource quotas limiting scanner pods
  • Different service account permissionsCopy the working dev config and adapt it
  • don't try to debug what's different. Don't use Kubernetes 1.24 with Aqua 6.1
  • the admission controller crashes.

The Real Performance Problems (And How to Actually Fix Them)

Monitoring Performance Dashboard

Performance issues with Aqua Security aren't like normal app performance problems. When the security layer starts choking, everything downstream gets screwed. Here's what actually breaks and how to fix it.

Resource Starvation: The Silent Killer

Aqua's documentation claims 2GB RAM and 1 CPU core per node. That's complete bullshit - obviously written by someone who's never deployed this thing beyond a toy cluster. I've seen enforcer agents consume 8GB+ when scanning large container images or monitoring high-throughput applications.

Real resource requirements based on our deployments across 3 different environments:

  • Small clusters (10-20 nodes): 4GB RAM, 2 CPU cores per enforcer - learned this after our first deployment OOMKilled every 6 hours. We hit this bug during Black Friday weekend - worst possible timing
  • Medium clusters (50-100 nodes): 6GB RAM, 3 CPU cores - found this out when scan bursts started choking our worker nodes
  • Large clusters (200+ nodes): 8GB RAM, 4 CPU cores - discovered during a particularly brutal Monday morning deployment rush

And that's just the enforcer. The scanner pods need their own resources:

  • Scanner pods: 8GB RAM minimum, 16GB if you're scanning images >2GB (like every Java app ever built)
  • PostgreSQL: Start with 32GB RAM and 8 CPU cores. Yeah, seriously. I learned this the hard way when the "quick" security update took down prod for 6 hours.

Monitor this stuff continuously using kubectl top and resource monitoring:

## Watch for memory pressure
kubectl top nodes --sort-by memory
kubectl top pods -n aqua --sort-by memory

## Check for CPU throttling
kubectl describe pod -n aqua | grep -A 5 -B 5 throttl

The Database Bottleneck Nobody Talks About

PostgreSQL Performance Tuning

PostgreSQL becomes the chokepoint way before you hit container limits. Aqua stores scan results, policy configurations, and runtime events in Postgres. When your database starts thrashing, everything grinds to a halt.

Signs your database is dying:

  • Scans taking >10 minutes for normal-sized images
  • UI becoming unresponsive during peak scan hours
  • pg_stat_activity showing queries stuck in "waiting" state

PostgreSQL tuning that actually matters:

-- Increase connection pooling
max_connections = 500
max_prepared_transactions = 500

-- Memory settings for scan workloads
shared_buffers = 8GB
effective_cache_size = 24GB
work_mem = 256MB

-- Reduce WAL overhead
wal_buffers = 16MB
checkpoint_segments = 64
checkpoint_completion_target = 0.9

Connection pooling isn't optional - it's the difference between a working system and explaining to your CTO why security scans brought down the entire platform. Use PgBouncer or prepare for a very uncomfortable conversation:

## PgBouncer config for Aqua workloads
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 100

Network Latency: The 3AM Wake-up Call

Network issues manifest as timeouts, failed scans, and webhook failures. The enforcer agent needs constant connectivity to the management server, and any hiccup causes failures.

Common network problems:

  • Cross-AZ latency: Enforcer in us-east-1a talking to DB in us-east-1c adds 2-5ms per request
  • Egress filtering: Corporate firewalls blocking registry access during image pulls
  • CNI plugin conflicts: Calico + Aqua sometimes clash on network policies

Debug network issues systematically:

## Test basic connectivity
kubectl exec -n aqua <enforcer-pod> -- ping <db-host>
kubectl exec -n aqua <enforcer-pod> -- telnet <registry> 443

## Check DNS resolution time
kubectl exec -n aqua <enforcer-pod> -- nslookup <registry>

## Test registry auth
kubectl exec -n aqua <enforcer-pod> -- docker login <registry>

Scanner Pod Placement Anti-Patterns

Default Kubernetes scheduling spreads scanner pods randomly across nodes. This creates resource contention and inconsistent performance.

Bad: Letting scanner pods land on the same nodes as your application workloads
Good: Dedicated scanner nodes with appropriate resource allocation

Node affinity for scanner pods:

affinity:
  nodeAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
    - weight: 100
      preference:
        matchExpressions:
        - key: workload-type
          operator: In
          values: ["security-scanning"]

Or use node selectors if you have dedicated scanner nodes:

nodeSelector:
  scanner-node: "true"

Image Scanning Performance Reality Check

Large images (>5GB) will destroy your scanning performance. Java applications with massive dependency trees, ML model containers, and base images with every tool installed cause scanner pods to consume absurd resources and time.

Performance breakdown by image size (from our production data):

  • <500MB: 30-60 seconds scan time - most base images fall here
  • 500MB-2GB: 2-5 minutes scan time - typical Node.js and Python apps
  • 2GB-5GB: 10-15 minutes scan time - Java apps with every dependency known to mankind
  • >5GB: 20-45 minutes, often times out - ML models and legacy apps that include the entire internet. This config worked fine until we hit 1000 containers, then everything exploded

Optimization strategies:

  1. Parallel scanning: Increase scanner replica count for large image volumes
  2. Registry caching: Use a registry cache/proxy to reduce download times
  3. Image layering: Scan base images separately, use layer caching
  4. Selective scanning: Skip scanning for known-good base images

Scanner tuning for large images:

env:
- name: AQUA_SCANNER_TIMEOUT
  value: "1800"  # 30 minutes instead of default 10
- name: AQUA_MAX_CONCURRENT_SCANS  
  value: "3"     # Reduce concurrent scans for large images

The Webhook Performance Trap

Admission controller webhooks add latency to every pod deployment. In busy clusters, this becomes a significant bottleneck.

Webhook performance monitoring:

## Check webhook response times
kubectl get validatingadmissionconfigurations aqua-admission-controller -o yaml | grep timeout

## Monitor admission latency
kubectl logs -n aqua <admission-controller-pod> | grep "admission review took"

Performance optimizations:

  • Set aggressive timeouts: timeoutSeconds: 30
  • Use failure policy Ignore for non-critical environments
  • Exclude system namespaces: kube-system, aqua, monitoring namespaces
  • Cache policy decisions when possible

Resource Quotas: The Hidden Performance Killer

Kubernetes resource quotas can throttle Aqua components without obvious errors. Scanner pods might be stuck in "Pending" state while you're debugging network issues.

Check for quota limitations:

kubectl describe resourcequota -n aqua
kubectl get events -n aqua --sort-by='.lastTimestamp' | grep -i quota

Set realistic quotas for security scanning workloads:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: aqua-security-quota
spec:
  hard:
    requests.cpu: "50"      # 50 CPU cores for scanning workloads
    requests.memory: "200Gi" # 200GB RAM for large image scanning
    persistentvolumeclaims: "10"

Don't just throw resources at the problem - monitor and tune based on actual usage patterns. But also don't be stingy with resource limits when security is involved. A failed security scan is worse than spending an extra $100/month on compute.

Error Codes and What They Actually Mean

Error Message

Translation

Quick Fix

Time to Fix

failed to create runtime monitor

Kernel module conflicts, usually ARM64

Add nodeSelector for amd64 nodes

5 minutes

connection timeout scanning image

Registry unreachable or auth failed

Check network policies, verify credentials

15 minutes

PANIC: runtime error: invalid memory

Agent memory limits too low

Increase memory limits to 4Gi+

2 minutes

admission webhook timed out

Webhook overloaded or network slow

Increase timeout, check webhook logs

10 minutes

dial tcp: connect: connection refused

PostgreSQL rejecting connections

Check max_connections, restart PG

30 minutes

enforcer agent consuming high CPU

Memory leak in network monitoring

Restart DaemonSet, upgrade to 6.2.1+

5 minutes

scanner pod stuck in Pending

Resource limits or node affinity issues

Check resource quotas, node selectors

20 minutes

failed to pull image for scanning

Registry auth or network policy blocking

Fix image pull secrets, egress rules

25 minutes

Monitoring and Alerting: Don't Get Blindsided

Container Security Monitoring Dashboard

Setting up proper monitoring for Aqua Security isn't optional - it's the difference between catching issues before they become disasters and getting woken up at 3AM because nothing's been scanning for 6 hours. When it breaks, you want to know immediately, not when your deployment pipeline starts shitting itself or your security team notices containers aren't getting scanned.

Critical Metrics That Actually Matter

Forget the vendor-suggested metrics. These are the ones that'll save your ass, based on Prometheus best practices and Kubernetes monitoring:

Agent Health Metrics (monitor every 60 seconds) using kubectl top:

## Agent pod status across all nodes
kubectl get pods -n aqua -l app=aqua-agent -o wide | grep -v Running

## Memory usage trending upward (memory leak indicator)
kubectl top pods -n aqua -l app=aqua-agent --sort-by memory

## CPU usage spiking above 80% (performance degradation)
kubectl top pods -n aqua -l app=aqua-agent --sort-by cpu

Database Performance (critical for scan throughput) using PostgreSQL monitoring:

-- Active connection count (alert if >80% of max_connections)
SELECT count(*) FROM pg_stat_activity WHERE state = 'active';

-- Long-running queries (alert if queries >5 minutes)
SELECT pid, query, state, query_start 
FROM pg_stat_activity 
WHERE query_start < NOW() - INTERVAL '5 minutes' 
AND state = 'active';

-- Database size growth (scan data accumulation)
SELECT pg_size_pretty(pg_database_size('aqua'));

Scan Pipeline Health:

## Scanner pods failing or stuck
kubectl get pods -n aqua -l app=scanner | grep -E '(Error|Pending|CrashLoop)'

## Registry connectivity from scanner pods
kubectl exec -n aqua <scanner-pod> -- curl -I --max-time 10 <registry-url>

## Admission controller response times
kubectl logs -n aqua <admission-controller> | grep "admission review took" | tail -20

Alerting Rules That Won't Wake You Up for Bullshit

Using Prometheus alerting and AlertManager:

High Priority Alerts (PagerDuty/oncall):

  • Agent pods down on >20% of nodes for >5 minutes
  • Admission controller webhook failing >50% of requests for >2 minutes
  • PostgreSQL connection count >90% for >3 minutes
  • Any pod OOMKilled in aqua namespace

Medium Priority Alerts (Slack/email):

  • Scanner pod stuck in pending for >10 minutes
  • Registry connectivity failures >25% for >5 minutes
  • Database query time >30 seconds average
  • Agent memory usage >6GB for >30 minutes

Low Priority Alerts (daily summary):

  • Scan queue depth trending upward
  • Storage usage growth rate for PostgreSQL
  • Network policy violations in monitored namespaces

Emergency Response Runbook

When you get the 3AM page and your deployment pipeline is broken, follow this sequence. Don't think, just execute:

Step 1: Triage (2 minutes max)

## Overall cluster health
kubectl get nodes
kubectl get pods -n aqua

## Quick resource check
kubectl top nodes --sort-by memory
kubectl top pods -n aqua --sort-by memory

Step 2: Identify the blast radius

  • Single agent pod failing? → Node-specific issue
  • Multiple scanner pods failing? → Registry/network issue
  • Admission controller down? → Deployments blocked cluster-wide
  • Database connectivity issues? → Total service degradation

Step 3: Emergency mitigation (choose one)

For agent pod failures:

## Nuclear option: restart all agents
kubectl rollout restart daemonset/aqua-agent -n aqua

## Surgical option: restart specific failing pods
kubectl delete pod -n aqua <failing-pod-name>

For admission controller failures:

## Bypass security (temporary)
kubectl patch validatingadmissionconfiguration aqua-admission-controller \
  --type='merge' -p='{"webhooks":[{"name":"aqua-admission-controller","failurePolicy":"Ignore"}]}'

For database connectivity using PostgreSQL administration:

## Kill long-running queries
sudo -u postgres psql -d aqua -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE query_start < NOW() - INTERVAL '5 minutes';"

## Emergency connection cleanup
sudo -u postgres psql -d aqua -c "SELECT pg_reload_conf();"

Capacity Planning: Avoiding the Next Disaster

Resource planning for security scanning is different from normal application workloads. Scan workloads are bursty, triggered by deployments and CI/CD pipelines.

Daily scan volume estimation:

  • Average image size in your registry × images scanned per day
  • Peak scan concurrency during deployment windows
  • Database storage growth rate (scan results retention)

Resource scaling triggers:

  • Scale scanner pods when scan queue depth >10 for >5 minutes
  • Scale database when connection utilization >70% consistently
  • Add agent resources when memory usage trending >80% for >1 hour

Quarterly capacity review:

## Historical scan volume
kubectl logs -n aqua <scanner-pod> --since=720h | grep "scan completed" | wc -l

## Database growth rate  
SELECT schemaname,tablename,pg_size_pretty(size) FROM 
(SELECT schemaname,tablename,pg_relation_size(schemaname||'.'||tablename) as size 
FROM pg_tables WHERE schemaname='public') AS sizes ORDER BY size DESC;

## Network bandwidth utilization during scan windows
kubectl top nodes --sort-by memory

Log Aggregation: Making Debugging Suck Less

Using Kubernetes logging and centralized logging:

Ship all Aqua logs to your central logging system. When troubleshooting multi-component failures, you need correlated logs across agents, scanners, and database.

Essential log sources:

Log parsing patterns for common issues:

## Memory leak detection
grep -E "(killed|OOMKilled|memory)" /var/log/pods/aqua-*

## Network timeout patterns  
grep -E "(timeout|connection refused|dial tcp)" /var/log/pods/aqua-*

## Database connection issues
grep -E "(connection|postgres|database)" /var/log/pods/aqua-*

The goal is to identify patterns before they become incidents. Memory usage trending upward, increasing scan times, or rising error rates are all early indicators of impending failures.

Don't wait for things to break completely. Set up monitoring, document your runbooks, and test your emergency procedures before you need them at 3AM while standing in your kitchen in your underwear trying to debug why nothing's deploying.

Emergency References (Bookmark These)

Related Tools & Recommendations

compare
Similar content

Twistlock vs Aqua vs Snyk: Container Security Comparison

We tested all three platforms in production so you don't have to suffer through the sales demos

Twistlock
/compare/twistlock/aqua-security/snyk-container/comprehensive-comparison
100%
integration
Recommended

Jenkins + Docker + Kubernetes: How to Deploy Without Breaking Production (Usually)

The Real Guide to CI/CD That Actually Works

Jenkins
/integration/jenkins-docker-kubernetes/enterprise-ci-cd-pipeline
86%
tool
Similar content

Snyk Container: Comprehensive Docker Image Security & CVE Scanning

Container security that doesn't make you want to quit your job. Scans your Docker images for the million ways they can get you pwned.

Snyk Container
/tool/snyk-container/overview
77%
tool
Similar content

Aqua Security - Container Security That Actually Works

Been scanning containers since Docker was scary, now covers all your cloud stuff without breaking CI/CD

Aqua Security Platform
/tool/aqua-security/overview
70%
tool
Recommended

Google Kubernetes Engine (GKE) - Google's Managed Kubernetes (That Actually Works Most of the Time)

Google runs your Kubernetes clusters so you don't wake up to etcd corruption at 3am. Costs way more than DIY but beats losing your weekend to cluster disasters.

Google Kubernetes Engine (GKE)
/tool/google-kubernetes-engine/overview
67%
troubleshoot
Similar content

Trivy Scanning Failures - Common Problems and Solutions

Fix timeout errors, memory crashes, and database download failures that break your security scans

Trivy
/troubleshoot/trivy-scanning-failures-fix/common-scanning-failures
55%
troubleshoot
Recommended

Fix Kubernetes Service Not Accessible - Stop the 503 Hell

Your pods show "Running" but users get connection refused? Welcome to Kubernetes networking hell.

Kubernetes
/troubleshoot/kubernetes-service-not-accessible/service-connectivity-troubleshooting
54%
troubleshoot
Recommended

Docker Won't Start on Windows 11? Here's How to Fix That Garbage

Stop the whale logo from spinning forever and actually get Docker working

Docker Desktop
/troubleshoot/docker-daemon-not-running-windows-11/daemon-startup-issues
42%
howto
Recommended

Stop Docker from Killing Your Containers at Random (Exit Code 137 Is Not Your Friend)

Three weeks into a project and Docker Desktop suddenly decides your container needs 16GB of RAM to run a basic Node.js app

Docker Desktop
/howto/setup-docker-development-environment/complete-development-setup
42%
news
Recommended

Docker Desktop's Stupidly Simple Container Escape Just Owned Everyone

integrates with Technology News Aggregation

Technology News Aggregation
/news/2025-08-26/docker-cve-security
42%
troubleshoot
Recommended

Fix Snyk Authentication Nightmares That Kill Your Deployments

When Snyk can't connect to your registry and everything goes to hell

Snyk
/troubleshoot/snyk-container-scan-errors/authentication-registry-errors
41%
tool
Recommended

GitHub Actions Security Hardening - Prevent Supply Chain Attacks

integrates with GitHub Actions

GitHub Actions
/tool/github-actions/security-hardening
38%
alternatives
Recommended

Tired of GitHub Actions Eating Your Budget? Here's Where Teams Are Actually Going

integrates with GitHub Actions

GitHub Actions
/alternatives/github-actions/migration-ready-alternatives
38%
tool
Recommended

GitHub Actions - CI/CD That Actually Lives Inside GitHub

integrates with GitHub Actions

GitHub Actions
/tool/github-actions/overview
38%
tool
Recommended

Jenkins - The CI/CD Server That Won't Die

integrates with Jenkins

Jenkins
/tool/jenkins/overview
37%
tool
Recommended

Jenkins Production Deployment - From Dev to Bulletproof

integrates with Jenkins

Jenkins
/tool/jenkins/production-deployment
37%
tool
Similar content

npm Enterprise Troubleshooting: Fix Corporate IT & Dev Problems

Production failures, proxy hell, and the CI/CD problems that actually cost money

npm
/tool/npm/enterprise-troubleshooting
36%
troubleshoot
Similar content

Docker Container Breakout Prevention: Emergency Response Guide

Learn practical strategies for Docker container breakout prevention, emergency response, forensic analysis, and recovery. Get actionable steps for securing your

Docker Engine
/troubleshoot/docker-container-breakout-prevention/incident-response-forensics
35%
tool
Similar content

Twistlock: Container Security Overview & Palo Alto Acquisition Impact

The container security tool everyone used before Palo Alto bought them and made everything cost enterprise prices

Twistlock
/tool/twistlock/overview
35%
tool
Similar content

Bolt.new Production Deployment Troubleshooting Guide

Beyond the demo: Real deployment issues, broken builds, and the fixes that actually work

Bolt.new
/tool/bolt-new/production-deployment-troubleshooting
35%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization