Aqua Security Production Troubleshooting - When Things Break at 3AM

The Shit That Actually Breaks

Why is the enforcer agent eating all my CPU?

Agent goes rogue consuming 80%+ CPU.

Happens most often after Kubernetes node restarts or network hiccups

usually around 3AM because that's when everything breaks.Quick fix: Restart the DaemonSet pod on the affected node:kubectl delete pod -n aqua <enforcer-pod-name>Nuclear option:

If it's screwed across multiple nodes:kubectl rollout restart daemonset/aqua-agent -n aquaRoot cause: Memory leak in the network monitoring thread. Aqua fixed this in version 6.2.1 but if you're running older versions (and who isn't because upgrading means another weekend), you'll still hit it. I've been burned by this three times.

Enforcer agent won't start - "failed to create runtime monitor"

Classic ARM64 node issue. The agent tries to load x86 kernel modules on ARM processors and completely craps out.Immediate workaround:yamlnodeSelector: kubernetes.io/arch: amd64Better fix: Upgrade to Aqua 6.5+ which has proper ARM64 support, or exclude ARM nodes from the DaemonSet entirely if you don't need them monitored. Note: Version 6.2.0 has a memory leak, 6.2.1 fixes it, 6.3.0 breaks ARM64 support again.

"PANIC: runtime error: invalid memory address" in agent logs

Memory corruption, usually from resource limits being too low.Fix now: Bump the memory limits in your DaemonSet:yamlresources: limits: memory: "4Gi" # Instead of their suggested 2Gi requests: memory: "2Gi" # Instead of 1GiTime to fix: 2 minutes. Time you spent debugging it while standing in your kitchen at 3AM googling error messages: 3 hours. Lost a weekend to this bug because their documentation doesn't mention memory requirements for shit.

Images failing scan with "connection timed out" errors

Registry connectivity from the scanner pods.

Happens a lot with private registries behind VPN or with poorly configured network policies.Debug first:kubectl exec -n aqua <scanner-pod> -- nslookup registry.example.comkubectl exec -n aqua <scanner-pod> -- curl -I https://docker.ioCommon fixes:

Add registry URLs to network policy egress rules
Increase scanner timeout: AQUA_SCANNER_TIMEOUT=600
For AWS ECR:

Make sure the IAM role has ecr:GetAuthorizationToken

PostgreSQL connection errors killing scans

"dial tcp: connect: connection refused" means the database is either down or rejecting connections.Emergency triage:sqlSELECT count(*) FROM pg_stat_activity WHERE state = 'active';If you see 100+ active connections, PostgreSQL is choking. Default max_connections is 100, which is laughable for production.Fix in postgresql.conf:max_connections = 500shared_buffers = 4GBwork_mem = 64MBRestart PostgreSQL (yes, downtime, deal with it).

Admission controller webhook timing out deployments

Your deployments hang in "Pending" state because the webhook takes forever to respond.

Quick bypass (if you need to deploy NOW):kubectl label namespace <your-namespace> aqua-security=disabledProper fix:

Tune webhook timeout and failure policy:```yamladmissionReviewVersions:

v1failurePolicy:

Ignore # Instead of FailtimeoutSeconds: 30 # Instead of default 10```Yeah, failurePolicy: Ignore defeats the entire fucking point of having security policies, but it beats explaining to your manager why the production deployment failed because the security scanner took a coffee break.

"Error: failed to start profiler" spam in logs

Profiler can't bind to the metrics port, usually because of port conflicts or security policies.Shut it up: Disable profiling if you don't need it:AQUA_ENABLE_PROFILING=falseOr fix the port conflict: Check what's using port 6060 and kill it:netstat -tulpn | grep 6060

Memory usage keeps climbing until nodes die

Classic memory leak.

Happens with older enforcer versions when scanning lots of large images.Immediate relief: Set memory limits and restart policy:```yamlresources: limits: memory: "8Gi"containers:

name: enforcer restartPolicy:

AlwaysMonitor it:kubectl top pod -n aqua --sort-by memory```If a pod is using >4GB consistently, it's leaking. Restart it.

Scans work in dev but fail in prod

Usually network policies, security contexts, or resource constraints that don't exist in dev.

Check security context first:kubectl describe pod <scanner-pod> | grep -i securityCommon prod differences:

Pod Security Standards blocking privileged containers
Network policies blocking registry access
Resource quotas limiting scanner pods
Different service account permissionsCopy the working dev config and adapt it
don't try to debug what's different. Don't use Kubernetes 1.24 with Aqua 6.1
the admission controller crashes.

The Real Performance Problems (And How to Actually Fix Them)

Monitoring Performance Dashboard

Performance issues with Aqua Security aren't like normal app performance problems. When the security layer starts choking, everything downstream gets screwed. Here's what actually breaks and how to fix it.

Resource Starvation: The Silent Killer

Aqua's documentation claims 2GB RAM and 1 CPU core per node. That's complete bullshit - obviously written by someone who's never deployed this thing beyond a toy cluster. I've seen enforcer agents consume 8GB+ when scanning large container images or monitoring high-throughput applications.

Real resource requirements based on our deployments across 3 different environments:

Small clusters (10-20 nodes): 4GB RAM, 2 CPU cores per enforcer - learned this after our first deployment OOMKilled every 6 hours. We hit this bug during Black Friday weekend - worst possible timing
Medium clusters (50-100 nodes): 6GB RAM, 3 CPU cores - found this out when scan bursts started choking our worker nodes
Large clusters (200+ nodes): 8GB RAM, 4 CPU cores - discovered during a particularly brutal Monday morning deployment rush

And that's just the enforcer. The scanner pods need their own resources:

Scanner pods: 8GB RAM minimum, 16GB if you're scanning images >2GB (like every Java app ever built)
PostgreSQL: Start with 32GB RAM and 8 CPU cores. Yeah, seriously. I learned this the hard way when the "quick" security update took down prod for 6 hours.

Monitor this stuff continuously using kubectl top and resource monitoring:

## Watch for memory pressure
kubectl top nodes --sort-by memory
kubectl top pods -n aqua --sort-by memory

## Check for CPU throttling
kubectl describe pod -n aqua | grep -A 5 -B 5 throttl

The Database Bottleneck Nobody Talks About

PostgreSQL Performance Tuning

PostgreSQL becomes the chokepoint way before you hit container limits. Aqua stores scan results, policy configurations, and runtime events in Postgres. When your database starts thrashing, everything grinds to a halt.

Signs your database is dying:

Scans taking >10 minutes for normal-sized images
UI becoming unresponsive during peak scan hours
pg_stat_activity showing queries stuck in "waiting" state

PostgreSQL tuning that actually matters:

-- Increase connection pooling
max_connections = 500
max_prepared_transactions = 500

-- Memory settings for scan workloads
shared_buffers = 8GB
effective_cache_size = 24GB
work_mem = 256MB

-- Reduce WAL overhead
wal_buffers = 16MB
checkpoint_segments = 64
checkpoint_completion_target = 0.9

Connection pooling isn't optional - it's the difference between a working system and explaining to your CTO why security scans brought down the entire platform. Use PgBouncer or prepare for a very uncomfortable conversation:

## PgBouncer config for Aqua workloads
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 100

Network Latency: The 3AM Wake-up Call

Network issues manifest as timeouts, failed scans, and webhook failures. The enforcer agent needs constant connectivity to the management server, and any hiccup causes failures.

Common network problems:

Cross-AZ latency: Enforcer in us-east-1a talking to DB in us-east-1c adds 2-5ms per request
Egress filtering: Corporate firewalls blocking registry access during image pulls
CNI plugin conflicts: Calico + Aqua sometimes clash on network policies

Debug network issues systematically:

## Test basic connectivity
kubectl exec -n aqua <enforcer-pod> -- ping <db-host>
kubectl exec -n aqua <enforcer-pod> -- telnet <registry> 443

## Check DNS resolution time
kubectl exec -n aqua <enforcer-pod> -- nslookup <registry>

## Test registry auth
kubectl exec -n aqua <enforcer-pod> -- docker login <registry>

Scanner Pod Placement Anti-Patterns

Default Kubernetes scheduling spreads scanner pods randomly across nodes. This creates resource contention and inconsistent performance.

Bad: Letting scanner pods land on the same nodes as your application workloads
Good: Dedicated scanner nodes with appropriate resource allocation

Node affinity for scanner pods:

affinity:
  nodeAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
    - weight: 100
      preference:
        matchExpressions:
        - key: workload-type
          operator: In
          values: ["security-scanning"]

Or use node selectors if you have dedicated scanner nodes:

nodeSelector:
  scanner-node: "true"

Image Scanning Performance Reality Check

Large images (>5GB) will destroy your scanning performance. Java applications with massive dependency trees, ML model containers, and base images with every tool installed cause scanner pods to consume absurd resources and time.

Performance breakdown by image size (from our production data):

<500MB: 30-60 seconds scan time - most base images fall here
500MB-2GB: 2-5 minutes scan time - typical Node.js and Python apps
2GB-5GB: 10-15 minutes scan time - Java apps with every dependency known to mankind
>5GB: 20-45 minutes, often times out - ML models and legacy apps that include the entire internet. This config worked fine until we hit 1000 containers, then everything exploded

Optimization strategies:

Parallel scanning: Increase scanner replica count for large image volumes
Registry caching: Use a registry cache/proxy to reduce download times
Image layering: Scan base images separately, use layer caching
Selective scanning: Skip scanning for known-good base images

Scanner tuning for large images:

env:
- name: AQUA_SCANNER_TIMEOUT
  value: "1800"  # 30 minutes instead of default 10
- name: AQUA_MAX_CONCURRENT_SCANS  
  value: "3"     # Reduce concurrent scans for large images

The Webhook Performance Trap

Admission controller webhooks add latency to every pod deployment. In busy clusters, this becomes a significant bottleneck.

Webhook performance monitoring:

## Check webhook response times
kubectl get validatingadmissionconfigurations aqua-admission-controller -o yaml | grep timeout

## Monitor admission latency
kubectl logs -n aqua <admission-controller-pod> | grep "admission review took"

Performance optimizations:

Set aggressive timeouts: timeoutSeconds: 30
Use failure policy Ignore for non-critical environments
Exclude system namespaces: kube-system, aqua, monitoring namespaces
Cache policy decisions when possible

Resource Quotas: The Hidden Performance Killer

Kubernetes resource quotas can throttle Aqua components without obvious errors. Scanner pods might be stuck in "Pending" state while you're debugging network issues.

Check for quota limitations:

kubectl describe resourcequota -n aqua
kubectl get events -n aqua --sort-by='.lastTimestamp' | grep -i quota

Set realistic quotas for security scanning workloads:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: aqua-security-quota
spec:
  hard:
    requests.cpu: "50"      # 50 CPU cores for scanning workloads
    requests.memory: "200Gi" # 200GB RAM for large image scanning
    persistentvolumeclaims: "10"

Don't just throw resources at the problem - monitor and tune based on actual usage patterns. But also don't be stingy with resource limits when security is involved. A failed security scan is worse than spending an extra $100/month on compute.

Error Codes and What They Actually Mean

Error Message	Translation	Quick Fix	Time to Fix
`failed to create runtime monitor`	Kernel module conflicts, usually ARM64	Add nodeSelector for amd64 nodes	5 minutes
`connection timeout scanning image`	Registry unreachable or auth failed	Check network policies, verify credentials	15 minutes
`PANIC: runtime error: invalid memory`	Agent memory limits too low	Increase memory limits to 4Gi+	2 minutes
`admission webhook timed out`	Webhook overloaded or network slow	Increase timeout, check webhook logs	10 minutes
`dial tcp: connect: connection refused`	PostgreSQL rejecting connections	Check max_connections, restart PG	30 minutes
`enforcer agent consuming high CPU`	Memory leak in network monitoring	Restart DaemonSet, upgrade to 6.2.1+	5 minutes
`scanner pod stuck in Pending`	Resource limits or node affinity issues	Check resource quotas, node selectors	20 minutes
`failed to pull image for scanning`	Registry auth or network policy blocking	Fix image pull secrets, egress rules	25 minutes

Monitoring and Alerting: Don't Get Blindsided

Container Security Monitoring Dashboard

Setting up proper monitoring for Aqua Security isn't optional - it's the difference between catching issues before they become disasters and getting woken up at 3AM because nothing's been scanning for 6 hours. When it breaks, you want to know immediately, not when your deployment pipeline starts shitting itself or your security team notices containers aren't getting scanned.

Critical Metrics That Actually Matter

Forget the vendor-suggested metrics. These are the ones that'll save your ass, based on Prometheus best practices and Kubernetes monitoring:

Agent Health Metrics (monitor every 60 seconds) using kubectl top:

## Agent pod status across all nodes
kubectl get pods -n aqua -l app=aqua-agent -o wide | grep -v Running

## Memory usage trending upward (memory leak indicator)
kubectl top pods -n aqua -l app=aqua-agent --sort-by memory

## CPU usage spiking above 80% (performance degradation)
kubectl top pods -n aqua -l app=aqua-agent --sort-by cpu

Database Performance (critical for scan throughput) using PostgreSQL monitoring:

-- Active connection count (alert if >80% of max_connections)
SELECT count(*) FROM pg_stat_activity WHERE state = 'active';

-- Long-running queries (alert if queries >5 minutes)
SELECT pid, query, state, query_start 
FROM pg_stat_activity 
WHERE query_start < NOW() - INTERVAL '5 minutes' 
AND state = 'active';

-- Database size growth (scan data accumulation)
SELECT pg_size_pretty(pg_database_size('aqua'));

Scan Pipeline Health:

## Scanner pods failing or stuck
kubectl get pods -n aqua -l app=scanner | grep -E '(Error|Pending|CrashLoop)'

## Registry connectivity from scanner pods
kubectl exec -n aqua <scanner-pod> -- curl -I --max-time 10 <registry-url>

## Admission controller response times
kubectl logs -n aqua <admission-controller> | grep "admission review took" | tail -20

Alerting Rules That Won't Wake You Up for Bullshit

Using Prometheus alerting and AlertManager:

High Priority Alerts (PagerDuty/oncall):

Agent pods down on >20% of nodes for >5 minutes
Admission controller webhook failing >50% of requests for >2 minutes
PostgreSQL connection count >90% for >3 minutes
Any pod OOMKilled in aqua namespace

Medium Priority Alerts (Slack/email):

Scanner pod stuck in pending for >10 minutes
Registry connectivity failures >25% for >5 minutes
Database query time >30 seconds average
Agent memory usage >6GB for >30 minutes

Low Priority Alerts (daily summary):

Scan queue depth trending upward
Storage usage growth rate for PostgreSQL
Network policy violations in monitored namespaces

Emergency Response Runbook

When you get the 3AM page and your deployment pipeline is broken, follow this sequence. Don't think, just execute:

Step 1: Triage (2 minutes max)

## Overall cluster health
kubectl get nodes
kubectl get pods -n aqua

## Quick resource check
kubectl top nodes --sort-by memory
kubectl top pods -n aqua --sort-by memory

Step 2: Identify the blast radius

Single agent pod failing? → Node-specific issue
Multiple scanner pods failing? → Registry/network issue
Admission controller down? → Deployments blocked cluster-wide
Database connectivity issues? → Total service degradation

Step 3: Emergency mitigation (choose one)

For agent pod failures:

## Nuclear option: restart all agents
kubectl rollout restart daemonset/aqua-agent -n aqua

## Surgical option: restart specific failing pods
kubectl delete pod -n aqua <failing-pod-name>

For admission controller failures:

## Bypass security (temporary)
kubectl patch validatingadmissionconfiguration aqua-admission-controller \
  --type='merge' -p='{"webhooks":[{"name":"aqua-admission-controller","failurePolicy":"Ignore"}]}'

For database connectivity using PostgreSQL administration:

## Kill long-running queries
sudo -u postgres psql -d aqua -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE query_start < NOW() - INTERVAL '5 minutes';"

## Emergency connection cleanup
sudo -u postgres psql -d aqua -c "SELECT pg_reload_conf();"

Capacity Planning: Avoiding the Next Disaster

Resource planning for security scanning is different from normal application workloads. Scan workloads are bursty, triggered by deployments and CI/CD pipelines.

Daily scan volume estimation:

Average image size in your registry × images scanned per day
Peak scan concurrency during deployment windows
Database storage growth rate (scan results retention)

Resource scaling triggers:

Scale scanner pods when scan queue depth >10 for >5 minutes
Scale database when connection utilization >70% consistently
Add agent resources when memory usage trending >80% for >1 hour

Quarterly capacity review:

## Historical scan volume
kubectl logs -n aqua <scanner-pod> --since=720h | grep "scan completed" | wc -l

## Database growth rate  
SELECT schemaname,tablename,pg_size_pretty(size) FROM 
(SELECT schemaname,tablename,pg_relation_size(schemaname||'.'||tablename) as size 
FROM pg_tables WHERE schemaname='public') AS sizes ORDER BY size DESC;

## Network bandwidth utilization during scan windows
kubectl top nodes --sort-by memory

Log Aggregation: Making Debugging Suck Less

Using Kubernetes logging and centralized logging:

Ship all Aqua logs to your central logging system. When troubleshooting multi-component failures, you need correlated logs across agents, scanners, and database.

Essential log sources:

All pods in aqua namespace
PostgreSQL query logs (enable log_statement = 'all' temporarily)
Kubernetes events in namespaces with admission control
Registry access logs (if available)

Log parsing patterns for common issues:

## Memory leak detection
grep -E "(killed|OOMKilled|memory)" /var/log/pods/aqua-*

## Network timeout patterns  
grep -E "(timeout|connection refused|dial tcp)" /var/log/pods/aqua-*

## Database connection issues
grep -E "(connection|postgres|database)" /var/log/pods/aqua-*

The goal is to identify patterns before they become incidents. Memory usage trending upward, increasing scan times, or rising error rates are all early indicators of impending failures.

Don't wait for things to break completely. Set up monitoring, document your runbooks, and test your emergency procedures before you need them at 3AM while standing in your kitchen in your underwear trying to debug why nothing's deploying.

Emergency References (Bookmark These)

35%

Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization

Quick Navigation

Why is the enforcer agent eating all my CPU?

Enforcer agent won't start - "failed to create runtime monitor"

"PANIC: runtime error: invalid memory address" in agent logs

Images failing scan with "connection timed out" errors

PostgreSQL connection errors killing scans

Admission controller webhook timing out deployments

"Error: failed to start profiler" spam in logs

Memory usage keeps climbing until nodes die

Scans work in dev but fail in prod

Resource Starvation: The Silent Killer

The Database Bottleneck Nobody Talks About

Network Latency: The 3AM Wake-up Call

Scanner Pod Placement Anti-Patterns

Image Scanning Performance Reality Check

The Webhook Performance Trap

Resource Quotas: The Hidden Performance Killer

Critical Metrics That Actually Matter

Alerting Rules That Won't Wake You Up for Bullshit

Emergency Response Runbook

Capacity Planning: Avoiding the Next Disaster

Log Aggregation: Making Debugging Suck Less

Related Tools & Recommendations

Twistlock vs Aqua vs Snyk: Container Security Comparison

Jenkins + Docker + Kubernetes: How to Deploy Without Breaking Production (Usually)

Snyk Container: Comprehensive Docker Image Security & CVE Scanning

Aqua Security - Container Security That Actually Works

Google Kubernetes Engine (GKE) - Google's Managed Kubernetes (That Actually Works Most of the Time)

Trivy Scanning Failures - Common Problems and Solutions

Fix Kubernetes Service Not Accessible - Stop the 503 Hell

Docker Won't Start on Windows 11? Here's How to Fix That Garbage

Stop Docker from Killing Your Containers at Random (Exit Code 137 Is Not Your Friend)

Docker Desktop's Stupidly Simple Container Escape Just Owned Everyone

Fix Snyk Authentication Nightmares That Kill Your Deployments

GitHub Actions Security Hardening - Prevent Supply Chain Attacks

Tired of GitHub Actions Eating Your Budget? Here's Where Teams Are Actually Going

GitHub Actions - CI/CD That Actually Lives Inside GitHub

Jenkins - The CI/CD Server That Won't Die

Jenkins Production Deployment - From Dev to Bulletproof

npm Enterprise Troubleshooting: Fix Corporate IT & Dev Problems

Docker Container Breakout Prevention: Emergency Response Guide

Twistlock: Container Security Overview & Palo Alto Acquisition Impact

Bolt.new Production Deployment Troubleshooting Guide