Clair Production Monitoring - Keep Your Scanner Running (Or Watch Everything Break)

What Breaks in Production (And Why You'll Hate It)

PostgreSQL: Where Performance Goes to Die

Your PostgreSQL database is the heart of Clair, and it's probably your biggest operational headache. When teams first deploy Clair, they slap it on a basic RDS instance and assume it'll scale. Wrong.

Connection pool exhaustion hits first. The default config allows 100 connections, but three indexer instances can easily saturate that during peak scanning. You'll see connection pool exhausted errors in your logs, and new scans just hang. PostgreSQL connection pooling becomes critical - bump max_connections to 200+ and configure PgBouncer if you're serious about scale.

Database bloat kills query performance once you hit 500,000+ indexed images. The vulnerability correlation queries that worked fine with 10,000 images take 30+ seconds with real production data. VACUUM operations need to run regularly, especially after vulnerability database updates that touch millions of rows.

Memory consumption spirals during updater runs in ways you can't predict. When Ubuntu releases daily security updates, the RHEL VEX updater (new in v4.8.0) rebuilds correlation data that might use 8GB of RAM or might use 16GB - honestly, I think it depends on the phase of the moon. Your database server needs at least 16GB to handle these spikes, but I've personally watched it blow past that and there's no good monitoring for this shit.

Memory Usage: The Silent Container Killer

Clair's memory consumption is unpredictable and absolutely brutal. A basic Ubuntu container might index with 200MB RAM usage, but that TensorFlow container with 73 layers and custom Python packages? I've seen it spike anywhere from 4-6GB to "oh fuck it's eaten our entire node's memory" levels during indexing. There's literally no way to predict this shit ahead of time.

Kubernetes resource limits become a nightmare to tune. Set them too low and your indexer pods get OOMKilled mid-scan. Set them too high and you're wasting cluster resources. I've seen production deployments where 80% of memory allocation goes unused most of the time, but you need those spikes handled.

The worst part is memory leaks during malformed container analysis. When Clair encounters a corrupted layer or some weird-ass package metadata that shouldn't exist, memory usage climbs indefinitely until the process crashes. I've personally debugged situations where you just see the pod restart at 3am, lose 45 minutes of scan progress, and have no fucking idea why because there's no decent monitoring for this.

Prometheus Monitoring System

Webhook Delivery: The Async Disaster

Webhook notifications fail silently and spectacularly. Your vulnerability notification system depends on these webhooks, but when they break, you might not notice for weeks.

Timeout configurations cause most webhook failures. The default 30-second timeout seems generous until you realize webhook processing might include Slack notifications, database updates, and policy evaluations. Your receiving endpoint times out, Clair marks it as failed, and you stop getting vulnerability alerts.

Webhook retry logic is primitive. Failed deliveries get retried with exponential backoff, but there's no dead letter queue or manual retry mechanism. If your webhook endpoint is down for an hour, you lose all notifications from that window.

Authentication failures happen constantly with token rotation. Your webhook endpoint expects a valid JWT, but when certificates rotate or tokens expire, webhook delivery just stops. Clair logs show authentication failed but doesn't distinguish between temporary and permanent auth failures.

Vulnerability Database Updates: The Scanning Killer

Database updates lock scanning operations and create unpredictable downtime. When RHEL VEX data updates (the new hotness in v4.8.0), the matchers rebuild correlation indexes that block all vulnerability queries.

The migration from RHEL OVAL to VEX updater in Clair v4.8.0 creates a specific operational nightmare. During the upgrade, there's a period where no Red Hat vulnerabilities exist in the database until the VEX updater completes its first run. Production deployments can go hours without RHEL vulnerability detection.

Network dependency failures cascade across the entire system. Clair needs to fetch updates from NVD, Ubuntu USN, Debian DSA, and now Red Hat VEX endpoints. When any of these services are slow or unavailable, updater runs hang and block the entire scanning pipeline.

Rate limiting from upstream sources breaks update schedules. The NVD API started enforcing rate limits that can delay vulnerability database synchronization by hours. Your scanning pipeline looks healthy, but you're working with day-old vulnerability data.

Production Troubleshooting: When Everything Goes Wrong

Why are my scans stuck in "indexing" forever?

Check Postgre

SQL connections first

this causes 70% of stuck scans. Run SELECT count(*) FROM pg_stat_activity WHERE state = 'active'; on your database. If you're hitting connection limits, indexer requests just hang.Network timeouts during layer downloads are the second most common cause. Check your logs for context deadline exceeded errors. Large container images (2GB+) can timeout on slow networks. Increase the timeout value in your indexer config or improve network bandwidth to your registry.Memory limits kill indexing silently. Check kubectl describe pod for OOMKilled events. If your indexer pods are getting killed mid-scan, increase memory limits to at least 4GB for production workloads
learned this one after debugging why scans kept mysteriously failing halfway through.

My vulnerability database updates are failing constantly

Red Hat VEX updater failures dominate post-v4.8.0 issues. The OVAL-to-VEX migration introduced new API endpoints that have different rate limits and authentication requirements. Check your logs for VEX update failed messages and verify your Red Hat API access.DNS resolution problems hit air-gapped environments hard. If you're seeing no such host errors for external vulnerability feeds, you need to configure vulnerability database mirroring or provide proper DNS resolution for external endpoints.API rate limiting from NVD causes delayed updates. Request an API key to get higher rate limits, or expect 6+ hour delays for complete vulnerability database synchronization.

Why is PostgreSQL eating all my server resources?

Vulnerability correlation queries get expensive with scale.

The vuln_affected table grows exponentially

with 100,000+ images, correlation queries can take minutes. Add database indexes manually or your scan reports will timeout.VACUUM operations aren't running automatically. After major updater runs, PostgreSQL needs to reclaim space from deleted vulnerability records. Schedule regular VACUUM ANALYZE operations or your database will bloat indefinitely.The autovacuum settings are wrong for Clair's workload. Increase autovacuum_max_workers to 6+ and lower autovacuum_vacuum_scale_factor to 0.1 for better maintenance.

Memory usage spikes are killing my containers

Tensor

Flow and ML containers trigger massive memory usage during package analysis. A single 6GB container with thousands of Python packages can consume 8GB+ RAM during indexing. You can't predict this

just allocate more memory.Memory leaks during malformed package parsing aren't officially acknowledged but happen regularly. When indexer processes hit 10GB+ usage on simple containers, restart the indexer pod and file a bug report with the problematic container manifest.

Webhooks stopped working and nobody noticed

Token expiration kills webhook authentication silently.

Check your webhook endpoint logs for 401/403 errors. Clair doesn't differentiate between temporary and permanent auth failures

they all look the same in the logs.Network policy changes block webhook delivery. If your security team modified firewall rules or Kubernetes network policies, webhook traffic might be getting dropped. Test webhook connectivity manually: curl -X POST your-webhook-endpoint.

How do I monitor this mess properly?

Prometheus metrics exist but require careful interpretation. Monitor clair_indexer_queue_size to detect scan backlogs and clair_updater_last_success to catch broken vulnerability updates.Database metrics matter more than application metrics. Monitor PostgreSQL connection usage, query performance, and disk space consumption. Your application might look healthy while the database dies.Custom alerting on scan completion times catches operational issues early. If container scans that normally take 30 seconds start taking 5+ minutes, your infrastructure is degraded even if nothing is technically "down".

Air-gapped deployments are a special hell

Vulnerability database synchronization requires manual intervention. You need to mirror NVD, Ubuntu USN, Debian DSA, and Red Hat VEX data locally. Each source has different update frequencies and formats.Certificate chain verification fails in isolation. Clair validates SSL certificates for external sources, but in air-gapped environments, you might need to configure custom CA bundles or disable certificate verification (not recommended).The pre-migration command for v4.8.0 helps with OVAL-to-VEX transition: clairctl -D admin pre v4.8.0 removes deprecated vulnerabilities before the migration runs.

Monitoring Strategies That Actually Work in Production

Grafana Logo

Metrics That Matter (And Ones That Don't)

Most teams monitor the wrong shit. CPU and memory utilization tell you nothing useful about Clair's operational health. I've watched a Clair instance show 20% CPU usage while the scan queue backs up for hours because PostgreSQL connection pools are fucked.

Clair's Prometheus metrics include clair_indexer_queue_size, but this only shows queued scan requests, not the time those requests have been waiting. You need custom alerting that tracks scan completion times: if a basic Ubuntu container takes more than 2 minutes to index, something's wrong.

Database metrics reveal more operational problems than application metrics. Monitor PostgreSQL `pg_stat_activity` to track connection usage and query duration. When connection counts hit 80% of your limit, you're about to hit the operational cliff where new scans hang indefinitely.

Vulnerability database update success is binary but critical. `clair_updater_last_success` timestamps show when each updater completed successfully. If your RHEL VEX updater (new in v4.8.0) hasn't succeeded in 24+ hours, you're missing critical security data.

PostgreSQL Monitoring Dashboard

Log Analysis for Operational Intelligence

Clair's logs actually tell you what's broken if you know how to read them. Connection pool exhaustion shows up as acquiring connection: timeout before scan failures become visible to users - I learned to alert on this pattern after spending a weekend debugging stuck scans.

Memory allocation failures manifest as runtime: out of memory in indexer logs, but by then it's too late. Monitor for increasing memory usage patterns during specific container types - ML containers with 50+ layers almost always trigger memory spikes.

Webhook delivery failures appear as notification delivery failed with HTTP status codes. Status 5xx errors indicate temporary problems worth retrying; 4xx errors usually mean authentication or configuration problems that need manual intervention.

Database query performance degradation shows up as slow query warnings, but Clair's default log level misses these. Enable query logging in PostgreSQL with log_min_duration_statement = 1000 to catch queries taking longer than 1 second.

Alerting Rules That Don't Create Noise

Critical alerts (page someone immediately):

PostgreSQL connection pool above 90% utilization for 2+ minutes
Any indexer pod OOMKilled in the last 5 minutes
Vulnerability database updaters failing for 6+ hours
Scan queue size above 100 requests for 10+ minutes

Warning alerts (Slack notification):

Individual container scans taking 5+ minutes consistently
Memory usage above 3GB for basic container indexing
Webhook delivery failure rate above 10% over 1 hour
Database query times above 5 seconds average

Informational tracking (metrics only):

Daily scan volume and completion times
Vulnerability database update frequency and duration
Resource usage patterns by container type and size

What to Expect When Everything Goes Wrong

Look, you need to know what normal performance looks like before you can tell when shit's broken. A standard Ubuntu base image should index in under 30 seconds. Multi-stage Docker builds with 20+ layers typically take 1-2 minutes. ML containers with custom compiled packages can legitimately take 5+ minutes - and that's not necessarily your problem.

Database growth follows predictable patterns until it doesn't. Every 1,000 indexed containers generates roughly 500MB of PostgreSQL data. Plan for 50GB+ database storage if you're scanning 100,000+ images regularly - the vulnerability and vuln_affected tables are where all your disk space disappears.

Memory usage makes no fucking sense. A 4GB container with simple package metadata might use 500MB during indexing, while a 500MB container with complex Python environments can spike to 6GB. Size alone tells you nothing about what resources you'll need.

Network bandwidth needs scale with container layers and registry distance. Local registries enable 100+ Mbps sustained transfer during indexing. External registries (Docker Hub, ECR across regions) might limit you to 10-20 Mbps, significantly impacting scan throughput.

Health Check Strategies Beyond HTTP 200

HTTP health checks miss most operational problems. Clair's /healthz endpoint returns 200 even when the scan queue is backed up for hours or webhook delivery is failing.

Database connectivity health checks should verify both connection availability and query performance. A simple SELECT 1 proves connectivity but misses performance degradation. Use SELECT COUNT(*) FROM vulnerability LIMIT 1 to test query responsiveness.

Functional health checks should attempt actual scanning operations. Submit a small, known container for indexing and verify completion within expected time bounds. This catches integration problems that infrastructure health checks miss.

End-to-end monitoring should track the complete vulnerability reporting pipeline. Submit a container with known vulnerabilities, verify scan completion, and confirm webhook delivery to your notification system. This proves the entire system works, not just individual components.

Operational Deep Dive: Advanced Production Issues

How do I handle the v4.8.0 OVAL-to-VEX migration in production?

Run the pre-migration command during maintenance windows: clairctl -D admin pre v4.8.0 removes deprecated OVAL vulnerabilities before the upgrade. This prevents the operational gap where no Red Hat vulnerabilities exist during migration.Expect 2-6 hours of degraded Red Hat vulnerability detection during the upgrade. The VEX updater needs to complete its first full run before RHEL/CentOS vulnerabilities appear in reports. Schedule upgrades accordingly.Monitor VEX updater logs for authentication failures with Red Hat's new endpoints. The VEX API uses different authentication than OVAL feeds

verify your Red Hat API access before upgrading.

My PostgreSQL performance is acceptable until it isn't

Query performance cliffs happen around 100,000 indexed images.

The vuln_affected table correlation queries that worked fine at 10,000 images take 30+ seconds at scale. This isn't gradual degradation

it's a cliff.Create custom indexes to handle Clair's query patterns: CREATE INDEX CONCURRENTLY idx_vuln_affected_package ON vuln_affected (package_id, vulnerability_id); helps correlation queries but adds overhead to updates.

Connection pool sizing becomes critical at scale. The default 100 connections work for development but production needs 200+ connections with PgBouncer for connection pooling and load balancing.

What's the real impact of webhook delivery failures?

Webhook failures create security blind spots. If your vulnerability notifications stop working, you won't know about new CVEs affecting your production containers until someone manually checks scan reports.Failed webhook deliveries aren't retried intelligently. Clair uses exponential backoff but doesn't distinguish between temporary network issues and permanent endpoint failures. A webhook endpoint down for 30 minutes can lose hours of notifications.Authentication token rotation breaks webhook delivery silently. When your JWT tokens or certificates expire, webhook delivery fails with generic authentication errors. There's no built-in alerting for this failure mode.

How do I size resources for unpredictable workloads?

Memory requirements depend on container complexity, not size. A 500MB Python container with complex dependency trees can use 6GB RAM during indexing. A 4GB binary-only container might use 200MB. You can't predict this from container metadata.CPU usage spikes during vulnerability correlation, not package analysis. The indexing phase is I/O bound, but matching vulnerabilities to packages can saturate CPU cores. Plan for CPU spikes during vulnerability database updates.Network bandwidth becomes the bottleneck for registry-distant deployments. Local Harbor registries enable fast layer downloads, but scanning containers from Docker Hub across continents can take 10x longer due to network latency and throughput limits.

Air-gapped environments require special operational procedures

Vulnerability database synchronization must be scripted and monitored. You need to fetch updates from NVD, Ubuntu USN, Debian DSA, and Red Hat VEX endpoints, then transfer them to your air-gapped environment.Certificate validation failures are common with custom CA hierarchies. Clair validates SSL certificates for external endpoints, but air-gapped environments often use internal CAs. Configure custom CA bundles or accept the security risk of disabling certificate verification.Database migration procedures need testing in isolation. Upgrading Clair versions in air-gapped environments can't rely on external connectivity for database schema updates. Test migration procedures thoroughly in staging environments.

Container registries impact operational behavior

Harbor registry integration provides the best operational experience. Built-in Clair support, webhook management, and scan result storage eliminate many integration headaches.AWS ECR requires careful authentication configuration. IAM roles and cross-account access can create intermittent authentication failures that appear as random scan failures in logs.Docker Hub rate limiting affects vulnerability scanning. Anonymous access has strict rate limits that can delay or block container layer downloads during indexing operations.

Database maintenance procedures for production

Regular VACUUM operations are essential for performance. After vulnerability database updates, Postgre

SQL needs to reclaim space from deleted records. Schedule weekly VACUUM ANALYZE operations during low-usage periods.Index rebuilding helps with query performance degradation over time. The vuln_affected table grows large and fragmented

monthly REINDEX operations on critical indexes maintain query performance.Backup procedures must account for database size growth. Vulnerability databases can reach 50GB+ in production environments. Plan backup windows and storage accordingly, especially if your backup strategy involves downtime.

When Clair Breaks in Production (And How to Fix It)

Related Tools & Recommendations

tool

Similar content

Clair - Container Vulnerability Scanner That Actually Works

Scan your Docker images for known CVEs before they bite you in production. Built by CoreOS engineers who got tired of security teams breathing down their necks.