Performance issues with Aqua Security aren't like normal app performance problems. When the security layer starts choking, everything downstream gets screwed. Here's what actually breaks and how to fix it.
Resource Starvation: The Silent Killer
Aqua's documentation claims 2GB RAM and 1 CPU core per node. That's complete bullshit - obviously written by someone who's never deployed this thing beyond a toy cluster. I've seen enforcer agents consume 8GB+ when scanning large container images or monitoring high-throughput applications.
Real resource requirements based on our deployments across 3 different environments:
- Small clusters (10-20 nodes): 4GB RAM, 2 CPU cores per enforcer - learned this after our first deployment OOMKilled every 6 hours. We hit this bug during Black Friday weekend - worst possible timing
- Medium clusters (50-100 nodes): 6GB RAM, 3 CPU cores - found this out when scan bursts started choking our worker nodes
- Large clusters (200+ nodes): 8GB RAM, 4 CPU cores - discovered during a particularly brutal Monday morning deployment rush
And that's just the enforcer. The scanner pods need their own resources:
- Scanner pods: 8GB RAM minimum, 16GB if you're scanning images >2GB (like every Java app ever built)
- PostgreSQL: Start with 32GB RAM and 8 CPU cores. Yeah, seriously. I learned this the hard way when the "quick" security update took down prod for 6 hours.
Monitor this stuff continuously using kubectl top and resource monitoring:
## Watch for memory pressure
kubectl top nodes --sort-by memory
kubectl top pods -n aqua --sort-by memory
## Check for CPU throttling
kubectl describe pod -n aqua | grep -A 5 -B 5 throttl
The Database Bottleneck Nobody Talks About
PostgreSQL becomes the chokepoint way before you hit container limits. Aqua stores scan results, policy configurations, and runtime events in Postgres. When your database starts thrashing, everything grinds to a halt.
Signs your database is dying:
- Scans taking >10 minutes for normal-sized images
- UI becoming unresponsive during peak scan hours
pg_stat_activity
showing queries stuck in "waiting" state
PostgreSQL tuning that actually matters:
-- Increase connection pooling
max_connections = 500
max_prepared_transactions = 500
-- Memory settings for scan workloads
shared_buffers = 8GB
effective_cache_size = 24GB
work_mem = 256MB
-- Reduce WAL overhead
wal_buffers = 16MB
checkpoint_segments = 64
checkpoint_completion_target = 0.9
Connection pooling isn't optional - it's the difference between a working system and explaining to your CTO why security scans brought down the entire platform. Use PgBouncer or prepare for a very uncomfortable conversation:
## PgBouncer config for Aqua workloads
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 100
Network Latency: The 3AM Wake-up Call
Network issues manifest as timeouts, failed scans, and webhook failures. The enforcer agent needs constant connectivity to the management server, and any hiccup causes failures.
Common network problems:
- Cross-AZ latency: Enforcer in us-east-1a talking to DB in us-east-1c adds 2-5ms per request
- Egress filtering: Corporate firewalls blocking registry access during image pulls
- CNI plugin conflicts: Calico + Aqua sometimes clash on network policies
Debug network issues systematically:
## Test basic connectivity
kubectl exec -n aqua <enforcer-pod> -- ping <db-host>
kubectl exec -n aqua <enforcer-pod> -- telnet <registry> 443
## Check DNS resolution time
kubectl exec -n aqua <enforcer-pod> -- nslookup <registry>
## Test registry auth
kubectl exec -n aqua <enforcer-pod> -- docker login <registry>
Scanner Pod Placement Anti-Patterns
Default Kubernetes scheduling spreads scanner pods randomly across nodes. This creates resource contention and inconsistent performance.
Bad: Letting scanner pods land on the same nodes as your application workloads
Good: Dedicated scanner nodes with appropriate resource allocation
Node affinity for scanner pods:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: workload-type
operator: In
values: ["security-scanning"]
Or use node selectors if you have dedicated scanner nodes:
nodeSelector:
scanner-node: "true"
Image Scanning Performance Reality Check
Large images (>5GB) will destroy your scanning performance. Java applications with massive dependency trees, ML model containers, and base images with every tool installed cause scanner pods to consume absurd resources and time.
Performance breakdown by image size (from our production data):
- <500MB: 30-60 seconds scan time - most base images fall here
- 500MB-2GB: 2-5 minutes scan time - typical Node.js and Python apps
- 2GB-5GB: 10-15 minutes scan time - Java apps with every dependency known to mankind
- >5GB: 20-45 minutes, often times out - ML models and legacy apps that include the entire internet. This config worked fine until we hit 1000 containers, then everything exploded
Optimization strategies:
- Parallel scanning: Increase scanner replica count for large image volumes
- Registry caching: Use a registry cache/proxy to reduce download times
- Image layering: Scan base images separately, use layer caching
- Selective scanning: Skip scanning for known-good base images
Scanner tuning for large images:
env:
- name: AQUA_SCANNER_TIMEOUT
value: "1800" # 30 minutes instead of default 10
- name: AQUA_MAX_CONCURRENT_SCANS
value: "3" # Reduce concurrent scans for large images
The Webhook Performance Trap
Admission controller webhooks add latency to every pod deployment. In busy clusters, this becomes a significant bottleneck.
Webhook performance monitoring:
## Check webhook response times
kubectl get validatingadmissionconfigurations aqua-admission-controller -o yaml | grep timeout
## Monitor admission latency
kubectl logs -n aqua <admission-controller-pod> | grep "admission review took"
Performance optimizations:
- Set aggressive timeouts:
timeoutSeconds: 30
- Use failure policy
Ignore
for non-critical environments - Exclude system namespaces:
kube-system
,aqua
, monitoring namespaces - Cache policy decisions when possible
Resource Quotas: The Hidden Performance Killer
Kubernetes resource quotas can throttle Aqua components without obvious errors. Scanner pods might be stuck in "Pending" state while you're debugging network issues.
Check for quota limitations:
kubectl describe resourcequota -n aqua
kubectl get events -n aqua --sort-by='.lastTimestamp' | grep -i quota
Set realistic quotas for security scanning workloads:
apiVersion: v1
kind: ResourceQuota
metadata:
name: aqua-security-quota
spec:
hard:
requests.cpu: "50" # 50 CPU cores for scanning workloads
requests.memory: "200Gi" # 200GB RAM for large image scanning
persistentvolumeclaims: "10"
Don't just throw resources at the problem - monitor and tune based on actual usage patterns. But also don't be stingy with resource limits when security is involved. A failed security scan is worse than spending an extra $100/month on compute.