PostgreSQL: Where Performance Goes to Die
Your PostgreSQL database is the heart of Clair, and it's probably your biggest operational headache. When teams first deploy Clair, they slap it on a basic RDS instance and assume it'll scale. Wrong.
Connection pool exhaustion hits first. The default config allows 100 connections, but three indexer instances can easily saturate that during peak scanning. You'll see connection pool exhausted
errors in your logs, and new scans just hang. PostgreSQL connection pooling becomes critical - bump max_connections
to 200+ and configure PgBouncer if you're serious about scale.
Database bloat kills query performance once you hit 500,000+ indexed images. The vulnerability correlation queries that worked fine with 10,000 images take 30+ seconds with real production data. VACUUM operations need to run regularly, especially after vulnerability database updates that touch millions of rows.
Memory consumption spirals during updater runs in ways you can't predict. When Ubuntu releases daily security updates, the RHEL VEX updater (new in v4.8.0) rebuilds correlation data that might use 8GB of RAM or might use 16GB - honestly, I think it depends on the phase of the moon. Your database server needs at least 16GB to handle these spikes, but I've personally watched it blow past that and there's no good monitoring for this shit.
Memory Usage: The Silent Container Killer
Clair's memory consumption is unpredictable and absolutely brutal. A basic Ubuntu container might index with 200MB RAM usage, but that TensorFlow container with 73 layers and custom Python packages? I've seen it spike anywhere from 4-6GB to "oh fuck it's eaten our entire node's memory" levels during indexing. There's literally no way to predict this shit ahead of time.
Kubernetes resource limits become a nightmare to tune. Set them too low and your indexer pods get OOMKilled mid-scan. Set them too high and you're wasting cluster resources. I've seen production deployments where 80% of memory allocation goes unused most of the time, but you need those spikes handled.
The worst part is memory leaks during malformed container analysis. When Clair encounters a corrupted layer or some weird-ass package metadata that shouldn't exist, memory usage climbs indefinitely until the process crashes. I've personally debugged situations where you just see the pod restart at 3am, lose 45 minutes of scan progress, and have no fucking idea why because there's no decent monitoring for this.
Webhook Delivery: The Async Disaster
Webhook notifications fail silently and spectacularly. Your vulnerability notification system depends on these webhooks, but when they break, you might not notice for weeks.
Timeout configurations cause most webhook failures. The default 30-second timeout seems generous until you realize webhook processing might include Slack notifications, database updates, and policy evaluations. Your receiving endpoint times out, Clair marks it as failed, and you stop getting vulnerability alerts.
Webhook retry logic is primitive. Failed deliveries get retried with exponential backoff, but there's no dead letter queue or manual retry mechanism. If your webhook endpoint is down for an hour, you lose all notifications from that window.
Authentication failures happen constantly with token rotation. Your webhook endpoint expects a valid JWT, but when certificates rotate or tokens expire, webhook delivery just stops. Clair logs show authentication failed
but doesn't distinguish between temporary and permanent auth failures.
Vulnerability Database Updates: The Scanning Killer
Database updates lock scanning operations and create unpredictable downtime. When RHEL VEX data updates (the new hotness in v4.8.0), the matchers rebuild correlation indexes that block all vulnerability queries.
The migration from RHEL OVAL to VEX updater in Clair v4.8.0 creates a specific operational nightmare. During the upgrade, there's a period where no Red Hat vulnerabilities exist in the database until the VEX updater completes its first run. Production deployments can go hours without RHEL vulnerability detection.
Network dependency failures cascade across the entire system. Clair needs to fetch updates from NVD, Ubuntu USN, Debian DSA, and now Red Hat VEX endpoints. When any of these services are slow or unavailable, updater runs hang and block the entire scanning pipeline.
Rate limiting from upstream sources breaks update schedules. The NVD API started enforcing rate limits that can delay vulnerability database synchronization by hours. Your scanning pipeline looks healthy, but you're working with day-old vulnerability data.