Stop Guessing RHACS Resource Requirements - Here's What Actually Breaks

RHACS Performance Architecture

I've been debugging RHACS performance disasters since version 3.68, through 4.8's Scanner V4 improvements that still crash on fat images, and now into RHACS 4.9 which is slightly less broken. Red Hat's sizing guidelines are complete fiction. They assume you're running Hello World containers and scanning them during your lunch break. Check out real deployment horror stories and you'll see why half the GitHub issues are about performance.

What Actually Kills RHACS Performance (From Someone Who's Fixed It 50 Times)

RHACS 4.8's Scanner V4 architecture is better than the legacy scanner, but it still chokes on real production workloads. Check the Red Hat community forums for war stories. Here's what breaks first (in order of pain):

What Breaks RHACS in Production (From Someone Who Fixed It 50 Times):

  • PostgreSQL Death Spiral: Your Central database will explode from 50GB to 500GB in 6 months because nobody told you about retention policies. Query performance goes to shit after 100GB without proper indexes, and Red Hat's defaults are garbage.
  • Scanner V4 OOMKilled: A single 4GB Docker image with 47 layers will spike Scanner memory from 2GB to 14GB instantly, then crash. ClairCore scanning patterns are predictably shitty. I've watched Scanner V4 OOMKill itself trying to scan bloated Node.js images with 200MB of node_modules that nobody cleaned up.
  • Network Saturation: Each Sensor phones home every 30 seconds like a needy teenager. With 100 clusters, that's 300K connections per hour even when absolutely nothing is happening. Watch this with Prometheus or your network will hate you.
  • Storage IOPS Starvation: Scanner V4 hammers PostgreSQL for vulnerability data. Put this shit on gp2 storage and watch scan times crawl from 30 seconds to 10 fucking minutes per image.

The Only RHACS Metrics That Actually Matter (From Someone Who's Been Paged Too Many Times)

Metrics That Will Save Your Ass (Actually Tested Under Fire):

RHACS Prometheus integration pukes out 50+ metrics via ServiceMonitors, but only 8 actually predict when shit breaks. The OpenShift monitoring stack works with Grafana dashboards, but the defaults are useless. Set up AlertManager rules for these or enjoy getting paged:

## Metrics that actually predict outages:
stackrox_central_db_connections_active    # When this hits 80, you're fucked
stackrox_scanner_queue_length            # Queue >50 means Scanner is drowning
stackrox_scanner_image_scan_duration_seconds  # >300s per image = storage problem
stackrox_sensor_last_contact_time        # Sensor disconnects predict Central death
stackrox_admission_controller_request_duration_seconds  # >1s breaks CI/CD

PostgreSQL Reality Check:

PostgreSQL Monitoring

RHACS 4.8 uses PostgreSQL 15, but Red Hat's tuning is optimized for fucking demos, not real workloads that actually matter. Run pgbench and check pg_stat_statements or you're flying blind. Here's what you monitor before the database kills your entire CI/CD pipeline:

-- Find which tables are eating your disk space
SELECT 
  tablename,
  pg_size_pretty(pg_total_relation_size('public.'||tablename)) as size
FROM pg_tables 
WHERE schemaname = 'public' 
ORDER BY pg_total_relation_size('public.'||tablename) DESC
LIMIT 5;
-- alerts table will be 90% of your database

-- Find slow queries before they kill everything
SELECT query, mean_exec_time, calls
FROM pg_stat_statements
WHERE mean_exec_time > 1000  -- Anything over 1s is trouble
ORDER BY mean_exec_time DESC;

Scanner V4: Reality vs Red Hat's Marketing Bullshit

Container Performance Monitoring

I've run dive on thousands of production images to figure out exactly what kills Scanner V4. Docker Hub stats show average image sizes tripled since 2020, but Red Hat still demos with tiny Alpine images. Meanwhile your ML team pushes 8GB Python monsters with 500 layers from conda bullshit:

  • RHEL UBI Images: 30s scan time, 2GB RAM ✅ (Red Hat's docs are actually right for once)
  • Node.js with node_modules: 300s scan time, 8GB RAM ❌ (Red Hat claims 2-5 minutes, but npm audit reveals why that's horseshit)
  • ML/AI Images: 900s+ scan time, 16GB+ RAM ❌ (OOM kills Scanner every fucking time, see TensorFlow base images)
  • Multi-arch Images: Scanner V4 tries to scan both architectures at once like an idiot, doubles memory usage per manifest list

Stop Using Red Hat's Capacity Calculator (It's Completely Wrong)

Step 1: Measure Your Actual Workload (Not Red Hat's Examples)

Red Hat's capacity calculator assumes you're scanning Hello World containers during a fucking demo. Your Jenkins builds fat 400MB Spring Boot apps with 80 layers of Maven dependency hell. Slight difference. Here's how to actually baseline without the marketing bullshit, using real Kubernetes monitoring and metrics-server:

## Check if Central is about to die
kubectl top pods -n stackrox | grep central
## Memory >80% = upgrade time

## Count restart loops (Scanner crashes a lot)
kubectl get pods -n stackrox -o wide | grep -E "(Restart|Error|OOMKilled)"

## Database size reality check
kubectl exec -n stackrox central-db-0 -- psql -U postgres -d central -c "
SELECT 
  pg_size_pretty(pg_database_size('central')) as db_size,
  (SELECT count(*) FROM alerts) as alerts_total;"
## If alerts > 100K, start deleting old data

Step 2: Break RHACS Before Production Does

Load testing RHACS is like torture testing - you want it to fail in staging, not during your Tuesday morning deployment. K6 on Kubernetes works well for this, especially with the k6 operator. Also try Locust for Python-based testing or Artillery for Node.js workloads:

  • Scanner Overload Test: Push 50 parallel image scans and watch Scanner V4 OOM kill itself
  • Admission Controller Stress: Deploy 100 pods/minute and measure when latency goes from 50ms to 2s
  • Database Death Test: Run compliance scans while Central is processing 1000 policy violations
  • Network Saturation: Restart all Sensors simultaneously, measure Central recovery time

Step 3: Real Resource Requirements (Not Red Hat Fantasy Numbers)

After breaking RHACS in 12 different ways (from GKE to EKS to bare metal kubeadm clusters), here's what you actually need to not get fired:

RHACS Performance Tuning That Actually Fucking Works

Central Optimization (Because Red Hat's Defaults Are Garbage):

RHACS Central Architecture

Kubernetes Resource Monitoring

Red Hat's tuning guide is barely adequate, but here's the shit they don't mention. Check PostgreSQL performance docs and Kubernetes resource limits for the real story:

## Central config that won't crash in production
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: central
  namespace: stackrox
spec:
  template:
    spec:
      containers:
      - name: central
        resources:
          limits:
            memory: "64Gi"    # 32GB dies under load
            cpu: "32000m"     # CPU is cheap, downtime is expensive
          requests:
            memory: "32Gi"    # Start big, you'll need it
            cpu: "16000m"
        env:
        - name: ROX_POSTGRES_MAX_OPEN_CONNS
          value: "200"        # Default 20 causes [connection exhaustion](https://www.postgresql.org/docs/15/runtime-config-connection.html)
        - name: ROX_POSTGRES_MAX_IDLE_CONNS  
          value: "50"         # More [idle connections](https://golang.org/pkg/database/sql/#DB.SetMaxIdleConns) = better performance
        - name: ROX_POSTGRES_CONN_MAX_LIFETIME
          value: "900s"       # Shorter lifetime prevents [connection leaks](https://github.com/lib/pq/issues/766)

Scanner V4: Scale Wide, Not Tall (Or It Dies)

Horizontal scaling saves your ass when someone pushes a 6GB PyTorch container image. Use Kubernetes HPA with custom metrics:

## Scanner V4 that won't OOM every Tuesday
apiVersion: apps/v1
kind: Deployment
metadata:
  name: scanner-v4
  namespace: stackrox
spec:
  replicas: 8              # More replicas = faster recovery from crashes
  template:
    spec:
      containers:
      - name: scanner-v4
        resources:
          limits:
            memory: "24Gi"   # Large images need 16GB+ per scan
            cpu: "12000m"    # CPU helps with layer decompression
          requests:
            memory: "12Gi"
            cpu: "6000m"
        env:
        - name: ROX_SCANNER_V4_INDEXER_DATABASE_POOL_SIZE
          value: "50"        # Default 10 creates bottlenecks
        - name: ROX_SCANNER_V4_MATCHER_DATABASE_POOL_SIZE
          value: "30"

That's the baseline config that won't embarrass you. Scale up when your CI/CD pipeline starts choking because Scanner can't handle your team's 40 deployments per day and developers start complaining in Slack.

RHACS Performance Scaling Matrix by Deployment Size

Deployment Scale

Central Resources (Actual)

Database Requirements

Scanner V4 Resources

Network Bandwidth

Monthly Cloud Cost Reality

10-25 Clusters

4 vCPU, 8GB RAM

200GB+ SSD, 4 vCPU

2 vCPU, 4GB RAM

50-100 Mbps

800-1,500/month

25-50 Clusters

8 vCPU, 16GB RAM

500GB+ SSD, 8 vCPU

4 vCPU, 8GB RAM

100-200 Mbps

1,500-3,000/month

50-100 Clusters

16 vCPU, 32GB RAM

1TB+ NVMe, 16 vCPU

8 vCPU, 16GB RAM

200-500 Mbps

3,000-6,000/month

100-200 Clusters

32 vCPU, 64GB RAM

2TB+ NVMe, 32 vCPU

16 vCPU, 32GB RAM

500MB-1GB

6,000-12,000/month

200+ Clusters

Regional federation required

Multiple database instances

Distributed scanning

1GB+ per region

Contact sales (it's expensive)

How to Monitor RHACS Before It Ruins Your Weekend

After getting paged at 2am because Scanner V4 shit the bed during a compliance scan, I learned monitoring RHACS isn't optional - it's survival. Prometheus monitoring with Grafana dashboards and RHACS telemetry can predict disasters hours before they happen. Set up AlertManager for actual alerts and use PagerDuty or Slack webhooks, but only if you watch metrics that actually matter from OpenShift monitoring.

Prometheus Alerts That Actually Work

Prometheus Monitoring

Kubernetes Monitoring Dashboard

Alerts That Will Save Your Ass:

RHACS dies slowly, then all at once like a fucking avalanche. These Prometheus AlertRules integrate with OpenShift ServiceMonitors and are based on watching 20+ production failures happen live. Use Runbook automation and Grafana annotations to document what actually predicts when everything goes to shit. Check Kubernetes Events and pod logs during incidents:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: rhacs-performance-monitoring
  namespace: stackrox
spec:
  groups:
  - name: rhacs.performance.critical
    rules:
    - alert: RHACSCentralMemoryHigh
      expr: |
        (
          container_memory_working_set_bytes{container=\"central\",namespace=\"stackrox\"} 
          / 
          container_spec_memory_limit_bytes{container=\"central\",namespace=\"stackrox\"}
        ) > 0.85
      for: 5m
      labels:
        severity: warning
        component: central
      annotations:
        summary: \"RHACS Central memory usage high\"
        description: \"Central memory usage is {{ $value | humanizePercentage }} of limit\"

    - alert: RHACSPostgreSQLSlowQueries
      expr: |
        pg_stat_activity_max_tx_duration{datname=\"stackrox\"} > 300
      for: 2m
      labels:
        severity: warning
        component: database
      annotations:
        summary: \"RHACS PostgreSQL queries running longer than 5 minutes\"
        description: \"Long-running queries may indicate performance issues\"

    - alert: RHACSScannerV4QueueLength
      expr: |
        stackrox_scanner_queue_length > 50
      for: 10m
      labels:
        severity: warning
        component: scanner
      annotations:
        summary: \"Scanner V4 queue length high\"
        description: \"{{ $value }} images waiting to scan - consider scaling scanner\"

    - alert: RHACSSensorConnectivity
      expr: |
        (time() - stackrox_sensor_last_contact_time) > 300
      for: 5m
      labels:
        severity: critical
        component: sensor
      annotations:
        summary: \"RHACS Sensor offline\"
        description: \"Sensor {{ $labels.cluster }} offline for {{ $value | humanizeDuration }}\"

    - alert: RHACSAdmissionControllerLatency
      expr: |
        histogram_quantile(0.95, 
          rate(stackrox_admission_controller_request_duration_seconds_bucket[5m])
        ) > 0.5
      for: 3m
      labels:
        severity: warning
        component: admission_controller
      annotations:
        summary: \"RHACS admission controller high latency\"
        description: \"95th percentile latency is {{ $value }}s\"

Custom Performance Dashboards:

Grafana dashboards show you exactly when RHACS is about to eat shit. Based on Grafana dashboard best practices and too much operational pain:

{
  \"dashboard\": {
    \"title\": \"RHACS Performance Overview\",
    \"panels\": [
      {
        \"title\": \"Central Resource Utilization\",
        \"type\": \"graph\",
        \"targets\": [
          {
            \"expr\": \"rate(container_cpu_usage_seconds_total{container=\\\"central\\\",namespace=\\\"stackrox\\\"}[5m])\",
            \"legendFormat\": \"CPU Usage\"
          },
          {
            \"expr\": \"container_memory_working_set_bytes{container=\\\"central\\\",namespace=\\\"stackrox\\\"} / 1024 / 1024 / 1024\",
            \"legendFormat\": \"Memory Usage (GB)\"
          }
        ]
      },
      {
        \"title\": \"Scanner V4 Performance\",
        \"type\": \"graph\", 
        \"targets\": [
          {
            \"expr\": \"stackrox_scanner_queue_length\",
            \"legendFormat\": \"Queue Length\"
          },
          {
            \"expr\": \"rate(stackrox_scanner_image_scan_duration_seconds_sum[5m]) / rate(stackrox_scanner_image_scan_duration_seconds_count[5m])\",
            \"legendFormat\": \"Average Scan Time\"
          }
        ]
      },
      {
        \"title\": \"Database Performance\",
        \"type\": \"graph\",
        \"targets\": [
          {
            \"expr\": \"pg_stat_database_tup_returned{datname=\\\"stackrox\\\"}\",
            \"legendFormat\": \"Rows Returned\"
          },
          {
            \"expr\": \"pg_stat_database_blks_read{datname=\\\"stackrox\\\"}\",
            \"legendFormat\": \"Disk Blocks Read\"
          }
        ]
      }
    ]
  }
}

Load Testing and Benchmarking Framework

Systematic Load Testing Approach:

Load testing finds out how badly RHACS breaks before production does. Using K6 on Kubernetes and custom scripts that actually stress the right components:

#!/bin/bash
## RHACS Load Testing Framework
## Tests realistic workload patterns for capacity planning

CENTRAL_ENDPOINT=\"${RHACS_CENTRAL_ENDPOINT}\"
API_TOKEN=\"${RHACS_API_TOKEN}\"
TEST_DURATION=\"30m\"
RAMP_UP_TIME=\"5m\"

## Test 1: Concurrent Image Scanning Load
echo \"Starting concurrent image scanning test...\"
k6 run --vus 10 --duration $TEST_DURATION --ramp-up-duration $RAMP_UP_TIME - <<EOF
import http from 'k6/http';
import { check } from 'k6';

export default function() {
  const images = [
    'registry.redhat.io/ubi8/ubi:latest',
    'nginx:latest',
    'postgres:13',
    'node:16-alpine'
  ];
  
  const image = images[Math.floor(Math.random() * images.length)];
  const response = http.post('${CENTRAL_ENDPOINT}/v1/images/scan', {
    image: image
  }, {
    headers: {
      'Authorization': 'Bearer ${API_TOKEN}',
      'Content-Type': 'application/json'
    }
  });
  
  check(response, {
    'scan request successful': (r) => r.status === 200,
    'scan completes in reasonable time': (r) => r.timings.duration < 30000
  });
}
EOF

## Test 2: Policy Evaluation Load
echo \"Starting policy evaluation load test...\"
for i in {1..100}; do
  kubectl apply -f - <<YAML &
apiVersion: apps/v1
kind: Deployment
metadata:
  name: load-test-deployment-$i
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: load-test-$i
  template:
    metadata:
      labels:
        app: load-test-$i
    spec:
      containers:
      - name: nginx
        image: nginx:latest
        resources:
          requests:
            memory: \"64Mi\"
            cpu: \"250m\"
          limits:
            memory: \"128Mi\"
            cpu: \"500m\"
YAML
done

## Monitor admission controller performance during load test
kubectl logs -n stackrox -l app=central --tail=100 | grep \"admission\"

## Test 3: Database Query Load
echo \"Starting database performance test...\"
kubectl exec -n stackrox central-db-0 -- pgbench -U postgres -d stackrox -c 10 -j 2 -T 300

echo \"Load testing complete. Analyze results with:\"
echo \"kubectl top pods -n stackrox\"
echo \"kubectl exec -n stackrox central-db-0 -- psql -U postgres -d stackrox -c 'SELECT * FROM pg_stat_activity;'\"

Performance Regression Testing:

Automated testing catches performance regressions during RHACS upgrades:

## GitLab CI performance regression testing
rhacs_performance_test:
  stage: test
  image: registry.redhat.io/rhel8/rhel:latest
  script:
    - dnf install -y postgresql-client curl
    - |
      # Baseline performance test
      BASELINE_SCAN_TIME=$(curl -s -w \"%{time_total}\" -o /dev/null \
        -X POST \"${RHACS_CENTRAL_ENDPOINT}/v1/images/scan\" \
        -H \"Authorization: Bearer ${RHACS_API_TOKEN}\" \
        -d '{\"image\": \"nginx:latest\"}')
      
      echo \"Baseline scan time: ${BASELINE_SCAN_TIME}s\"
      
      # Database query performance
      BASELINE_QUERY_TIME=$(kubectl exec -n stackrox central-db-0 -- \
        psql -U postgres -d stackrox -c \"\	iming on\" \
        -c \"SELECT COUNT(*) FROM alerts WHERE created_at > NOW() - INTERVAL '24 hours';\" \
        | grep \"Time:\" | awk '{print $2}')
      
      echo \"Baseline query time: ${BASELINE_QUERY_TIME}\"
      
      # Fail if performance regression > 20%
      if (( $(echo \"$BASELINE_SCAN_TIME > 1.2 * $EXPECTED_SCAN_TIME\" | bc -l) )); then
        echo \"Performance regression detected in image scanning\"
        exit 1
      fi
  artifacts:
    reports:
      junit: performance-test-results.xml
    expire_in: 1 week

Advanced Performance Optimization Techniques

PostgreSQL Database Optimization:

PostgreSQL Logo

Database performance directly impacts RHACS user experience. Based on PostgreSQL performance tuning and RHACS-specific workload patterns:

-- RHACS-optimized PostgreSQL configuration
-- Apply to RHACS Central database for improved performance

-- Memory and buffer management
ALTER SYSTEM SET shared_buffers = '25% of total RAM';
ALTER SYSTEM SET effective_cache_size = '75% of total RAM';
ALTER SYSTEM SET work_mem = '256MB';
ALTER SYSTEM SET maintenance_work_mem = '2GB';

-- Connection management  
ALTER SYSTEM SET max_connections = 200;
ALTER SYSTEM SET max_prepared_transactions = 100;

-- Query optimization
ALTER SYSTEM SET random_page_cost = 1.1;  -- SSD storage
ALTER SYSTEM SET effective_io_concurrency = 200;  -- SSD concurrent I/O

-- Logging for performance analysis
ALTER SYSTEM SET log_min_duration_statement = 1000;  -- Log slow queries
ALTER SYSTEM SET log_checkpoints = on;
ALTER SYSTEM SET log_lock_waits = on;

-- Apply configuration
SELECT pg_reload_conf();

-- Create performance monitoring views
CREATE VIEW rhacs_performance_summary AS
SELECT 
  (SELECT pg_size_pretty(pg_database_size('stackrox'))) as database_size,
  (SELECT count(*) FROM alerts) as total_alerts,
  (SELECT count(*) FROM images) as total_images,
  (SELECT count(*) FROM policy_violations WHERE created_at > NOW() - INTERVAL '24 hours') as recent_violations,
  (SELECT avg(query_time) FROM pg_stat_statements WHERE query LIKE '%alerts%') as avg_alert_query_time;

Scanner V4 Performance Optimization:

Scanner performance optimization reduces CI/CD pipeline delays and improves developer experience:

## High-performance Scanner V4 configuration
apiVersion: apps/v1
kind: Deployment
metadata:
  name: scanner-v4-optimized
  namespace: stackrox
spec:
  replicas: 6  # Horizontal scaling for high throughput
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 2
      maxUnavailable: 1
  template:
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchLabels:
                  app: scanner-v4
              topologyKey: kubernetes.io/hostname
      containers:
      - name: scanner-v4
        image: registry.redhat.io/advanced-cluster-security/rhacs-scanner-v4-rhel8:4.8
        resources:
          limits:
            memory: \"16Gi\"    # High memory for large images
            cpu: \"8000m\"      # High CPU for parallel processing
          requests:
            memory: \"8Gi\"
            cpu: \"4000m\"
        env:
        - name: ROX_SCANNER_V4_INDEXER_DATABASE_POOL_SIZE
          value: \"30\"         # Increased connection pool
        - name: ROX_SCANNER_V4_MATCHER_DATABASE_POOL_SIZE
          value: \"20\"
        - name: ROX_SCANNER_V4_INDEXER_MAX_SCAN_CONCURRENCY
          value: \"4\"          # Parallel scanning within scanner
        - name: ROX_SCANNER_V4_GRPC_MAX_MESSAGE_SIZE
          value: \"104857600\"  # 100MB for large images
        volumeMounts:
        - name: scanner-v4-db
          mountPath: /var/lib/stackrox
        - name: tmp-volume
          mountPath: /tmp
      volumes:
      - name: scanner-v4-db
        persistentVolumeClaim:
          claimName: scanner-v4-db-optimized
      - name: tmp-volume
        emptyDir:
          sizeLimit: \"20Gi\"   # Large temp space for image processing

---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: scanner-v4-db-optimized
  namespace: stackrox
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 500Gi        # Large storage for vulnerability database
  storageClassName: fast-ssd # High-performance storage class

This monitoring setup catches problems before they ruin your weekend. The next section covers the questions you'll actually ask when everything's on fire.

RHACS Performance FAQ (For When Everything's On Fire)

Q

How do I size RHACS without Red Hat's bullshit calculator?

A

Start with 3x Red Hat's recommendations because their numbers assume you're scanning Hello World containers.

I learned this after Scanner V4 OOM killed itself trying to process our ML team's 6GB Python images. Even with RHACS 4.9's streamlined enforcement configurations, the core resource requirements haven't changed

  • only the complexity of managing them.

Watch these metrics or get paged at 3am:

  • Central memory >80% = time to scale up
  • Scanner queue length >20 = add more Scanner replicas
  • Postgre

SQL query times >5s = your database is dying

  • Sensor disconnect spikes = Central is overloadedThe 70% utilization rule is academic bullshit. Scale up when things start breaking, usually around 60% memory usage.
Q

What metrics actually predict when RHACS will shit the bed?

A

After debugging 50+ RHACS failures, these are the only metrics that matter:

  1. Scanner queue length
    • 20 means you're fucked, >50 means call in sick

  2. Central memory usage
    • 85% for 10 minutes = imminent crash

  3. Admission controller P95 latency
    • 1s breaks CI/CD, >2s kills productivity

  4. **Postgre

SQL connection count**

  • Near max_connections = database death spiral
  1. Sensor last contact time
    • Gaps >5min indicate Central is chokingPut these in PagerDuty alerts, not Slack notifications that everyone ignores.
Q

How do I stop Scanner V4 from dying on fat container images?

A

Our DevOps team loves pushing 8GB Docker images with 200 layers.

Here's how to not get fired:

  • Give Scanner 24GB RAM minimum
  • anything less OOM kills on large ML images
  • Run 8+ Scanner replicas
  • when one crashes, others keep working
  • Use node affinity
  • pin Scanners to your beefiest nodes
  • Tune database connections
  • increase pool sizes or Scanner starves waiting for DB
  • Scan overnight
  • large images take 20+ minutes, don't block daytime deployments

Watch Scanner restarts. If they're OOM killing daily, you need more memory or fewer concurrent scans.

Q

Do complex policies actually slow down RHACS?

A

Oh hell yes.

We had 150 policies checking everything from image labels to network configs. Deployment times went from 30 seconds to 5 minutes.

  • 10-20 simple policies: 50ms per deployment ✅
  • 50+ policies: 300ms per deployment ⚠️
  • 100+ policies with regex: 2s+ per deployment 🔥
  • Poorly scoped policies:

Evaluate every deployment across all namespacesFix: Use policy scopes religiously. Don't check dev policies against prod deployments. Cut your policy count in half and performance doubles.

Q

How do I load test RHACS without breaking production?

A

Don't test in prod like we did.

Here's how to break RHACS safely:

  1. Spam image scans
    • Push 50 parallel scans of your fattest images
  2. Hammer the database
    • Run pgbench for 4 hours while scanning
  3. Deploy chaos
    • Launch 200 pods/minute and measure admission controller death
  4. Network stress
    • Restart all Sensors simultaneously

Run for 48+ hours because memory leaks take time. If anything crashes, add 50% more resources.

Q

What storage do I need so PostgreSQL doesn't crawl?

A

We tried gp2 storage because it was cheap.

Scans took 15 minutes per image. Learned that lesson the expensive way.

  • Minimum that works: 10,000 IOPS, <5ms latency (gp3 with provisioned IOPS)
  • What you actually need: 25,000+ IOPS for >100 clusters (NVMe or io2)
  • Storage size:

Start with 500GB, grows 20GB/month per 50 clusters

  • Backup storage: Separate fast storage for WAL filesgp2 storage = 10min scan times. NVMe = 30s scan times. Math is simple.
Q

How do I fix RHACS memory leaks before they kill everything?

A

RHACS leaks memory like a 20-year-old car leaks oil.

Gradual, then all at once:

  • Central memory climbs 1GB/week = database connection leak
  • Scanner memory spikes then stays high = orphaned scan processes
  • Admission controller bloat = policy cache not clearing
  • Nuclear option: Restart everything weekly before it OOM killsI set up automatic restarts every Sunday at 2am. Ugly but it works.
Q

Should I use RHACS Cloud Service or manage this nightmare myself?

A

Depends on whether you enjoy being paged at 3am for database maintenance.RHACS Cloud Service:

  • Pros:

Red Hat deals with Postgre

SQL tuning, scaling, backups

  • Cons: Network latency kills Sensor performance, limited customization
  • Reality:

Good for <50 clusters if you hate managing databasesSelf-managed RHACS:

  • Pros:

Tune everything, co-locate with clusters, control your destiny

  • Cons: You own all the 3am pages, database crashes, scaling disasters
  • Reality: Better for >100 clusters if you have a dedicated platform team
Q

How do I stop Sensors from saturating my network?

A

Sensors are chatty as hell. 100 clusters = 300K network connections per hour even when idle.

  • Deploy regional Centrals
  • latency >200ms kills Sensor performance
  • Tune Sensor sync intervals
  • default 30s is overkill for stable clusters
  • Batch policy updates
  • full policy syncs can push 50MB per Sensor
  • Network capacity: Plan 1Mbps per 10 clusters during policy updatesSensor disconnects spike during policy updates. Pre-provision network capacity or deployments fail.
Q

Why do compliance scans kill my RHACS deployment?

A

Compliance scans are database killers.

They run 50+ complex queries across every resource in every cluster.

  • Central CPU: Spikes to 100% for 10-45 minutes
  • PostgreSQL:

Queries timeout, connections exhaust, locks pile up

  • Memory: Temporary 8GB spike processing large clusters
  • Runtime: 2 minutes for toy clusters, 2 hours for enterpriseSchedule compliance scans for weekends. Seriously. Don't run them during business hours.
Q

How do I prove RHACS optimization saved money?

A

My manager needed ROI numbers after I spent 2 weeks tuning RHACS:

  • Deployment speed: 5min → 30s average deployment time
  • Infrastructure costs:

Cut Scanner nodes from 12 to 6 (50% savings)

  • On-call incidents: RHACS alerts dropped 80% after tuning
  • Developer happiness:

No more "RHACS is slow" Slack complaintsTypical results: 40% cost reduction, 60% performance improvement, 90% fewer angry developers.

Q

How do I upgrade RHACS without everything catching fire?

A

RHACS upgrades are like heart surgery

  • measure twice, cut once:
  1. Snapshot current metrics
    • you'll need them when shit breaks
  2. Test in staging first
    • upgrades fail in exciting ways
  3. Database migration time
    • our 500GB database took 8 hours to migrate
  4. Watch resource spikes
    • Central memory doubles during upgrade
  5. Rollback plan
    • database downgrades are basically impossiblePlan 8+ hour maintenance windows. RHACS upgrades take forever and things break.
Q

How do I survive when everyone deploys at 9am?

A

Our team of 50 developers all push code between 9-11am. Scanner queue goes from 0 to 200 in 10 minutes.

  • Auto-scale Scanners
  • HPA based on queue length saves your ass
  • Prioritize critical scans
  • prod images scan first, dev images wait
  • Rate limit CI/CD
  • 10 concurrent scans max or Scanner dies
  • Cache everything
  • same image layers shouldn't scan twice

Without auto-scaling, morning deployments time out and developers blame "slow security tools."

Q

What are the early warning signs that RHACS is about to die?

A

Learn these patterns or enjoy 3am pages:

  • Memory climbing 1GB+/week = memory leak, restart before OOM kill
  • Scan times doubling = database needs VACUUM or storage is slow
  • Query response >5s = Postgre

SQL is struggling, tune or add resources

  • Admission controller P95 >1s = policies too complex or Central overloaded
  • Sensor disconnect spikes = Central can't handle the loadSet alerts on these trends. React when trending bad, not when everything's broken.
Q

How do I cut RHACS costs without getting fired for missing vulnerabilities?

A

After my manager saw our $15K/month RHACS bill, optimization became mandatory:

  • Delete useless policies
  • Cut 150 policies to 40 useful ones (60% CPU reduction)
  • Scan less crap
  • Skip scanning dev images older than 30 days (40% Scanner cost)
  • Data cleanup
  • Delete alerts >90 days old (50% database size reduction)
  • Right-size everything
  • Measure actual usage, cut resources by 40%
  • Night shift processing
  • Compliance scans at 2am on cheaper spot instancesWent from $15K to $6K/month. Security coverage unchanged. Manager happy.
Q

What's the most important thing about RHACS performance in 2025?

A

After five years of running RHACS in production environments, here's the truth: RHACS performance problems are completely predictable and preventable.

The failures happen in the same sequence every time:

  1. Scanner queue builds up because someone pushed fat container images
  2. Database queries slow down because retention policies weren't configured
  3. Central memory climbs because connection pools weren't tuned
  4. Everything crashes at 3am on a Tuesday

The teams that succeed with RHACS treat it like the database-heavy, resource-intensive platform it actually is

  • not the "lightweight security overlay" that marketing promises. Size it like you're running a production database, monitor it like you're running a production database, and tune it like you're running a production database.Because that's exactly what you're doing.This guide gives you the real numbers, the actual monitoring alerts, and the optimization techniques that prevent 90% of RHACS performance disasters. Use it before things break, not after you're paged at 3am explaining why the security platform just took down the entire CI/CD pipeline.

Performance Optimization and Monitoring Resources