I've been debugging RHACS performance disasters since version 3.68, through 4.8's Scanner V4 improvements that still crash on fat images, and now into RHACS 4.9 which is slightly less broken. Red Hat's sizing guidelines are complete fiction. They assume you're running Hello World containers and scanning them during your lunch break. Check out real deployment horror stories and you'll see why half the GitHub issues are about performance.
What Actually Kills RHACS Performance (From Someone Who's Fixed It 50 Times)
RHACS 4.8's Scanner V4 architecture is better than the legacy scanner, but it still chokes on real production workloads. Check the Red Hat community forums for war stories. Here's what breaks first (in order of pain):
What Breaks RHACS in Production (From Someone Who Fixed It 50 Times):
- PostgreSQL Death Spiral: Your Central database will explode from 50GB to 500GB in 6 months because nobody told you about retention policies. Query performance goes to shit after 100GB without proper indexes, and Red Hat's defaults are garbage.
- Scanner V4 OOMKilled: A single 4GB Docker image with 47 layers will spike Scanner memory from 2GB to 14GB instantly, then crash. ClairCore scanning patterns are predictably shitty. I've watched Scanner V4 OOMKill itself trying to scan bloated Node.js images with 200MB of node_modules that nobody cleaned up.
- Network Saturation: Each Sensor phones home every 30 seconds like a needy teenager. With 100 clusters, that's 300K connections per hour even when absolutely nothing is happening. Watch this with Prometheus or your network will hate you.
- Storage IOPS Starvation: Scanner V4 hammers PostgreSQL for vulnerability data. Put this shit on gp2 storage and watch scan times crawl from 30 seconds to 10 fucking minutes per image.
The Only RHACS Metrics That Actually Matter (From Someone Who's Been Paged Too Many Times)
Metrics That Will Save Your Ass (Actually Tested Under Fire):
RHACS Prometheus integration pukes out 50+ metrics via ServiceMonitors, but only 8 actually predict when shit breaks. The OpenShift monitoring stack works with Grafana dashboards, but the defaults are useless. Set up AlertManager rules for these or enjoy getting paged:
## Metrics that actually predict outages:
stackrox_central_db_connections_active # When this hits 80, you're fucked
stackrox_scanner_queue_length # Queue >50 means Scanner is drowning
stackrox_scanner_image_scan_duration_seconds # >300s per image = storage problem
stackrox_sensor_last_contact_time # Sensor disconnects predict Central death
stackrox_admission_controller_request_duration_seconds # >1s breaks CI/CD
PostgreSQL Reality Check:
RHACS 4.8 uses PostgreSQL 15, but Red Hat's tuning is optimized for fucking demos, not real workloads that actually matter. Run pgbench and check pg_stat_statements or you're flying blind. Here's what you monitor before the database kills your entire CI/CD pipeline:
-- Find which tables are eating your disk space
SELECT
tablename,
pg_size_pretty(pg_total_relation_size('public.'||tablename)) as size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size('public.'||tablename) DESC
LIMIT 5;
-- alerts table will be 90% of your database
-- Find slow queries before they kill everything
SELECT query, mean_exec_time, calls
FROM pg_stat_statements
WHERE mean_exec_time > 1000 -- Anything over 1s is trouble
ORDER BY mean_exec_time DESC;
Scanner V4: Reality vs Red Hat's Marketing Bullshit
I've run dive on thousands of production images to figure out exactly what kills Scanner V4. Docker Hub stats show average image sizes tripled since 2020, but Red Hat still demos with tiny Alpine images. Meanwhile your ML team pushes 8GB Python monsters with 500 layers from conda bullshit:
- RHEL UBI Images: 30s scan time, 2GB RAM ✅ (Red Hat's docs are actually right for once)
- Node.js with node_modules: 300s scan time, 8GB RAM ❌ (Red Hat claims 2-5 minutes, but npm audit reveals why that's horseshit)
- ML/AI Images: 900s+ scan time, 16GB+ RAM ❌ (OOM kills Scanner every fucking time, see TensorFlow base images)
- Multi-arch Images: Scanner V4 tries to scan both architectures at once like an idiot, doubles memory usage per manifest list
Stop Using Red Hat's Capacity Calculator (It's Completely Wrong)
Step 1: Measure Your Actual Workload (Not Red Hat's Examples)
Red Hat's capacity calculator assumes you're scanning Hello World containers during a fucking demo. Your Jenkins builds fat 400MB Spring Boot apps with 80 layers of Maven dependency hell. Slight difference. Here's how to actually baseline without the marketing bullshit, using real Kubernetes monitoring and metrics-server:
## Check if Central is about to die
kubectl top pods -n stackrox | grep central
## Memory >80% = upgrade time
## Count restart loops (Scanner crashes a lot)
kubectl get pods -n stackrox -o wide | grep -E "(Restart|Error|OOMKilled)"
## Database size reality check
kubectl exec -n stackrox central-db-0 -- psql -U postgres -d central -c "
SELECT
pg_size_pretty(pg_database_size('central')) as db_size,
(SELECT count(*) FROM alerts) as alerts_total;"
## If alerts > 100K, start deleting old data
Step 2: Break RHACS Before Production Does
Load testing RHACS is like torture testing - you want it to fail in staging, not during your Tuesday morning deployment. K6 on Kubernetes works well for this, especially with the k6 operator. Also try Locust for Python-based testing or Artillery for Node.js workloads:
- Scanner Overload Test: Push 50 parallel image scans and watch Scanner V4 OOM kill itself
- Admission Controller Stress: Deploy 100 pods/minute and measure when latency goes from 50ms to 2s
- Database Death Test: Run compliance scans while Central is processing 1000 policy violations
- Network Saturation: Restart all Sensors simultaneously, measure Central recovery time
Step 3: Real Resource Requirements (Not Red Hat Fantasy Numbers)
After breaking RHACS in 12 different ways (from GKE to EKS to bare metal kubeadm clusters), here's what you actually need to not get fired:
- Central CPU: 2 vCPU per 10 active clusters (Red Hat says 50 clusters per CPU - complete clown shit 🤡)
- Central Memory: 4GB baseline + 1GB per 10 clusters + 2GB per 50K policy violations
- Scanner Memory: 8GB baseline + 4GB per concurrent scan (large TensorFlow images need 16GB each)
- Database Storage: 100GB baseline + 5GB per cluster per month (alerts table grows fast)
- Network: 100Kbps per cluster baseline (policy updates spike to 2Mbps per cluster)
RHACS Performance Tuning That Actually Fucking Works
Central Optimization (Because Red Hat's Defaults Are Garbage):
Red Hat's tuning guide is barely adequate, but here's the shit they don't mention. Check PostgreSQL performance docs and Kubernetes resource limits for the real story:
## Central config that won't crash in production
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: central
namespace: stackrox
spec:
template:
spec:
containers:
- name: central
resources:
limits:
memory: "64Gi" # 32GB dies under load
cpu: "32000m" # CPU is cheap, downtime is expensive
requests:
memory: "32Gi" # Start big, you'll need it
cpu: "16000m"
env:
- name: ROX_POSTGRES_MAX_OPEN_CONNS
value: "200" # Default 20 causes [connection exhaustion](https://www.postgresql.org/docs/15/runtime-config-connection.html)
- name: ROX_POSTGRES_MAX_IDLE_CONNS
value: "50" # More [idle connections](https://golang.org/pkg/database/sql/#DB.SetMaxIdleConns) = better performance
- name: ROX_POSTGRES_CONN_MAX_LIFETIME
value: "900s" # Shorter lifetime prevents [connection leaks](https://github.com/lib/pq/issues/766)
Scanner V4: Scale Wide, Not Tall (Or It Dies)
Horizontal scaling saves your ass when someone pushes a 6GB PyTorch container image. Use Kubernetes HPA with custom metrics:
## Scanner V4 that won't OOM every Tuesday
apiVersion: apps/v1
kind: Deployment
metadata:
name: scanner-v4
namespace: stackrox
spec:
replicas: 8 # More replicas = faster recovery from crashes
template:
spec:
containers:
- name: scanner-v4
resources:
limits:
memory: "24Gi" # Large images need 16GB+ per scan
cpu: "12000m" # CPU helps with layer decompression
requests:
memory: "12Gi"
cpu: "6000m"
env:
- name: ROX_SCANNER_V4_INDEXER_DATABASE_POOL_SIZE
value: "50" # Default 10 creates bottlenecks
- name: ROX_SCANNER_V4_MATCHER_DATABASE_POOL_SIZE
value: "30"
That's the baseline config that won't embarrass you. Scale up when your CI/CD pipeline starts choking because Scanner can't handle your team's 40 deployments per day and developers start complaining in Slack.