Why are my scans stuck in "indexing" forever?

Check PostgreSQL connections first - this causes 70% of stuck scans. Run `SELECT count(*) FROM pg_stat_activity WHERE state = 'active';` on your database. If you're hitting connection limits, indexer requests just hang.Network timeouts during layer downloads are the second most common cause. Check your logs for `context deadline exceeded` errors. Large container images (2GB+) can timeout on slow networks. Increase the `timeout` value in your indexer config or improve network bandwidth to your registry.Memory limits kill indexing silently. Check `kubectl describe pod` for OOMKilled events. If your indexer pods are getting killed mid-scan, increase memory limits to at least 4GB for production workloads - learned this one after debugging why scans kept mysteriously failing halfway through.

My vulnerability database updates are failing constantly

[Red Hat VEX updater failures](https://access.redhat.com/security/data/metrics) dominate post-v4.8.0 issues. The OVAL-to-VEX migration introduced new API endpoints that have different rate limits and authentication requirements. Check your logs for `VEX update failed` messages and verify your Red Hat API access.DNS resolution problems hit air-gapped environments hard. If you're seeing `no such host` errors for external vulnerability feeds, you need to configure [vulnerability database mirroring](https://quay.github.io/clair/howto/air_gapped_database.html) or provide proper DNS resolution for external endpoints.API rate limiting from [NVD](https://nvd.nist.gov/developers/request-an-api-key) causes delayed updates. Request an API key to get higher rate limits, or expect 6+ hour delays for complete vulnerability database synchronization.

Why is PostgreSQL eating all my server resources?

Vulnerability correlation queries get expensive with scale. The `vuln_affected` table grows exponentially - with 100,000+ images, correlation queries can take minutes. Add database indexes manually or your scan reports will timeout.VACUUM operations aren't running automatically. After major updater runs, PostgreSQL needs to reclaim space from deleted vulnerability records. Schedule regular `VACUUM ANALYZE` operations or your database will bloat indefinitely.The [autovacuum settings](https://www.postgresql.org/docs/current/routine-vacuuming.html) are wrong for Clair's workload. Increase `autovacuum_max_workers` to 6+ and lower `autovacuum_vacuum_scale_factor` to 0.1 for better maintenance.

Memory usage spikes are killing my containers

TensorFlow and ML containers trigger massive memory usage during package analysis. A single 6GB container with thousands of Python packages can consume 8GB+ RAM during indexing. You can't predict this - just allocate more memory.Memory leaks during malformed package parsing aren't officially acknowledged but happen regularly. When indexer processes hit 10GB+ usage on simple containers, restart the indexer pod and file a bug report with the problematic container manifest.

Webhooks stopped working and nobody noticed

Token expiration kills webhook authentication silently. Check your webhook endpoint logs for 401/403 errors. Clair doesn't differentiate between temporary and permanent auth failures - they all look the same in the logs.Network policy changes block webhook delivery. If your security team modified firewall rules or Kubernetes network policies, webhook traffic might be getting dropped. Test webhook connectivity manually: `curl -X POST your-webhook-endpoint`.

How do I monitor this mess properly?

[Prometheus metrics](https://quay.github.io/clair/reference/metrics.html) exist but require careful interpretation. Monitor `clair_indexer_queue_size` to detect scan backlogs and `clair_updater_last_success` to catch broken vulnerability updates.Database metrics matter more than application metrics. Monitor PostgreSQL connection usage, query performance, and disk space consumption. Your application might look healthy while the database dies.Custom alerting on scan completion times catches operational issues early. If container scans that normally take 30 seconds start taking 5+ minutes, your infrastructure is degraded even if nothing is technically "down".

Air-gapped deployments are a special hell

Vulnerability database synchronization requires manual intervention. You need to mirror NVD, Ubuntu USN, Debian DSA, and Red Hat VEX data locally. Each source has different update frequencies and formats.Certificate chain verification fails in isolation. Clair validates SSL certificates for external sources, but in air-gapped environments, you might need to configure custom CA bundles or disable certificate verification (not recommended).The [pre-migration command](https://github.com/quay/clair/releases/tag/v4.8.0) for v4.8.0 helps with OVAL-to-VEX transition: `clairctl -D admin pre v4.8.0` removes deprecated vulnerabilities before the migration runs.

How do I handle the v4.8.0 OVAL-to-VEX migration in production?

Run the [pre-migration command](https://github.com/quay/clair/releases/tag/v4.8.0) during maintenance windows: `clairctl -D admin pre v4.8.0` removes deprecated OVAL vulnerabilities before the upgrade. This prevents the operational gap where no Red Hat vulnerabilities exist during migration.Expect 2-6 hours of degraded Red Hat vulnerability detection during the upgrade. The VEX updater needs to complete its first full run before RHEL/CentOS vulnerabilities appear in reports. Schedule upgrades accordingly.Monitor VEX updater logs for authentication failures with Red Hat's new endpoints. The VEX API uses different authentication than OVAL feeds - verify your Red Hat API access before upgrading.

What's the real impact of webhook delivery failures?

Webhook failures create security blind spots. If your vulnerability notifications stop working, you won't know about new CVEs affecting your production containers until someone manually checks scan reports.Failed webhook deliveries aren't retried intelligently. Clair uses exponential backoff but doesn't distinguish between temporary network issues and permanent endpoint failures. A webhook endpoint down for 30 minutes can lose hours of notifications.Authentication token rotation breaks webhook delivery silently. When your JWT tokens or certificates expire, webhook delivery fails with generic authentication errors. There's no built-in alerting for this failure mode.

How do I size resources for unpredictable workloads?

Memory requirements depend on container complexity, not size. A 500MB Python container with complex dependency trees can use 6GB RAM during indexing. A 4GB binary-only container might use 200MB. You can't predict this from container metadata.CPU usage spikes during vulnerability correlation, not package analysis. The indexing phase is I/O bound, but matching vulnerabilities to packages can saturate CPU cores. Plan for CPU spikes during vulnerability database updates.Network bandwidth becomes the bottleneck for registry-distant deployments. Local Harbor registries enable fast layer downloads, but scanning containers from Docker Hub across continents can take 10x longer due to network latency and throughput limits.

Air-gapped environments require special operational procedures

Vulnerability database synchronization must be scripted and monitored. You need to fetch updates from [NVD](https://nvd.nist.gov/developers/request-an-api-key), [Ubuntu USN](https://usn.ubuntu.com/), [Debian DSA](https://www.debian.org/security/), and [Red Hat VEX](https://www.redhat.com/en/blog/future-red-hat-security-data) endpoints, then transfer them to your air-gapped environment.Certificate validation failures are common with custom CA hierarchies. Clair validates SSL certificates for external endpoints, but air-gapped environments often use internal CAs. Configure custom CA bundles or accept the security risk of disabling certificate verification.Database migration procedures need testing in isolation. Upgrading Clair versions in air-gapped environments can't rely on external connectivity for database schema updates. Test migration procedures thoroughly in staging environments.

Container registries impact operational behavior

[Harbor registry](https://goharbor.io/) integration provides the best operational experience. Built-in Clair support, webhook management, and scan result storage eliminate many integration headaches.[AWS ECR](https://aws.amazon.com/ecr/) requires careful authentication configuration. IAM roles and cross-account access can create intermittent authentication failures that appear as random scan failures in logs.[Docker Hub](https://hub.docker.com/) rate limiting affects vulnerability scanning. Anonymous access has strict rate limits that can delay or block container layer downloads during indexing operations.

Database maintenance procedures for production

Regular VACUUM operations are essential for performance. After vulnerability database updates, PostgreSQL needs to reclaim space from deleted records. Schedule weekly `VACUUM ANALYZE` operations during low-usage periods.Index rebuilding helps with query performance degradation over time. The `vuln_affected` table grows large and fragmented - monthly `REINDEX` operations on critical indexes maintain query performance.Backup procedures must account for database size growth. Vulnerability databases can reach 50GB+ in production environments. Plan backup windows and storage accordingly, especially if your backup strategy involves downtime.

Currently viewing the AI version

Switch to human version

Clair Production Monitoring: AI-Optimized Knowledge

Critical Failure Scenarios & Production Breaking Points

PostgreSQL Database Failures

Breaking Point: 100 connections (default limit) - three indexer instances saturate connection pool
Impact: New scans hang indefinitely, vulnerability reports stop generating
Resource Requirements: Minimum 200+ connections, 16GB RAM, PgBouncer for connection pooling
Hidden Cost: Database bloat after 500,000+ indexed images causes 30+ second query times

Memory Consumption Spikes

Unpredictable Range: 200MB (basic Ubuntu) to 8GB+ (TensorFlow containers with custom packages)
Critical Failure: Memory leaks during malformed container analysis - process crashes at 3am
Kubernetes Impact: Setting limits too low = OOMKilled pods, too high = 80% wasted allocation
No Prediction Method: Container size doesn't correlate with memory usage

Webhook Delivery Silent Failures

Default Timeout: 30 seconds (insufficient for complex processing chains)
Authentication Failure Mode: Token rotation breaks delivery without alerts
Retry Logic Limitation: No dead letter queue, exponential backoff only
Security Impact: Missing vulnerability notifications for weeks

Vulnerability Database Update Locks

Scan Blocking: RHEL VEX updates (v4.8.0+) lock all vulnerability queries
Migration Gap: 2-6 hours without Red Hat vulnerability detection during v4.8.0 upgrade
Network Dependencies: NVD, Ubuntu USN, Debian DSA, Red Hat VEX - any failure cascades
Rate Limiting Impact: NVD delays can push updates 6+ hours behind

Configuration That Actually Works in Production

PostgreSQL Settings

max_connections = 200+
autovacuum_max_workers = 6+
autovacuum_vacuum_scale_factor = 0.1
log_min_duration_statement = 1000ms

Memory Allocation Strategy

Base Containers: 1GB limit minimum
ML/Complex Containers: 8GB+ limit required
Database Server: 16GB minimum for VEX update spikes

Required Indexes for Scale

CREATE INDEX CONCURRENTLY idx_vuln_affected_package 
ON vuln_affected (package_id, vulnerability_id);

Monitoring Thresholds

Critical Alert: PostgreSQL connections >90% for 2+ minutes
Critical Alert: Scan queue >100 requests for 10+ minutes
Warning: Individual scans >5 minutes consistently
Warning: Memory usage >3GB for basic containers

Operational Intelligence & Troubleshooting

Performance Baselines

Standard Ubuntu: <30 seconds indexing
Multi-stage builds (20+ layers): 1-2 minutes
ML containers: 5+ minutes (legitimate)
Database growth: 500MB per 1,000 containers
Storage planning: 50GB+ for 100,000+ images

Common Failure Patterns

Connection pool exhaustion (70% of stuck scans) - Check pg_stat_activity
Network timeouts - Look for context deadline exceeded in logs
Silent OOMKills - Check kubectl describe pod for kill events
Query performance cliff - Happens around 100,000 indexed images

Migration Risks (v4.8.0 OVAL to VEX)

Procedure: Run clairctl -D admin pre v4.8.0 during maintenance window
Downtime: 2-6 hours of missing Red Hat vulnerability detection
Failure Mode: Authentication issues with new VEX endpoints
Rollback Complexity: High - test migration procedures in staging first

Resource Requirements & Scaling

Unpredictable Factors

Container complexity (not size) determines memory usage
ML containers with Python packages = highest resource consumption
Binary-only containers = minimal resources regardless of size
Network distance to registry = 10x performance impact

Infrastructure Dependencies

Local registries (Harbor): 100+ Mbps sustained transfer
External registries (Docker Hub): 10-20 Mbps limitation
Air-gapped environments: Require vulnerability database mirroring

Critical Warnings & Known Issues

Silent Failure Modes

HTTP health checks miss operational problems (/healthz returns 200 during queue backup)
Webhook delivery failures don't distinguish temporary vs permanent failures
Memory leaks with malformed containers have no reliable detection

Vendor-Specific Issues

AWS ECR: IAM role issues cause intermittent authentication failures
Docker Hub: Rate limiting blocks anonymous layer downloads
Harbor: Provides best operational experience with built-in integration

Air-Gapped Deployment Complexity

Certificate validation fails with custom CA hierarchies
Database migration procedures need isolation testing
Vulnerability feed synchronization requires custom scripting

Decision Support Matrix

When to Use Clair

Worth it despite complexity: Large container inventories requiring compliance reporting
Not recommended: Small deployments (<1000 containers) due to operational overhead
Alternative consideration: Harbor integration reduces operational burden significantly

Resource Investment Requirements

Time: 2-4 weeks for production-ready deployment
Expertise: PostgreSQL DBA knowledge essential
Monitoring: Custom Prometheus/Grafana setup required
Maintenance: Weekly database VACUUM operations

Breaking Change Impact

Version upgrades: Require database migrations with rollback testing
Dependency changes: Upstream rate limiting affects update schedules
Security patches: May require extended maintenance windows

Monitoring Implementation Guide

Essential Metrics (Not CPU/Memory)

clair_indexer_queue_size - Alert when >100
clair_updater_last_success - Alert when >24 hours old
PostgreSQL connection count - Alert at 80% utilization
Scan completion time tracking - Alert when >2x baseline

Log Analysis Patterns

acquiring connection: timeout = Connection pool exhaustion
runtime: out of memory = Memory allocation failure (too late to act)
notification delivery failed = Webhook issues with HTTP status codes
slow query warnings = Query performance degradation

Functional Health Checks

Submit known container for indexing and verify:

Completion within expected time
Vulnerability detection accuracy
Webhook delivery to notification system
Database query responsiveness

This operational intelligence enables AI systems to make informed decisions about Clair deployment, resource allocation, and troubleshooting procedures based on real-world production experience.

Useful Links for Further Investigation

When Clair Breaks in Production (And How to Fix It)

Link	Description
Clair Prometheus Metrics Reference	Monitor these metrics or you'll be debugging blind: `clair_indexer_queue_depth` (watch for > 100), `clair_updater_success` (should be consistent), and database connection counts. The rest is noise.
PostgreSQL Performance Monitoring Guide	PostgreSQL is where Clair breaks first. Monitor connection counts, slow queries, and vacuum performance. This doc explains the stats that actually predict problems.
Grafana Clair Dashboard Examples	The only dashboard that works is `clair-dashboard.json`. Copy it directly - the others are abandoned experiments with missing dependencies.
PgBouncer Connection Pooling	You NEED this for production. Without connection pooling, Clair will exhaust PostgreSQL connections during scan bursts. PgBouncer saved my ass when we hit 500 concurrent scans.
Clair v4.8.0 Migration Guide	The OVAL-to-VEX migration breaks everything. Follow the pre-migration steps exactly or you'll corrupt your database. I learned this the hard way during a Friday deployment.
Red Hat VEX Security Data Documentation	VEX format is newer and more accurate than OVAL, but the migration is a pain. This explains why your RHEL scans broke after v4.8.0.
Clair GitHub Issues - "memory-leak" label	Memory leaks are common with ML containers. These issues have the actual fixes, not just "restart the pod and hope."
PostgreSQL Slow Query Analysis	When scans get stuck, it's usually a database performance problem. This shows you how to find the queries that are killing your performance.
PostgreSQL VACUUM and Maintenance	Clair generates massive database churn. Without proper vacuuming, queries slow to a crawl after a few weeks. Set up autovacuum or suffer.
Database Migration Procedures	Version upgrades require database migrations. Do this wrong and you'll lose scan history. The docs skip the rollback procedures - test those first.
Air-Gapped Database Setup	Air-gapped deployments are a special kind of hell. This guide covers vulnerability database mirroring, but expect certificate issues and firewall pain.
Harbor Clair Integration	Harbor's built-in Clair is easier to manage than standalone deployments. Use this if you're already on Harbor - it handles the networking and database setup.
NVD API Access and Rate Limits	NVD rate limits will kill your vulnerability updates. Get an API key or your database will fall behind during security events. Takes 2 weeks to get approved.
Ubuntu Security Notifications	When Ubuntu releases security updates, Clair's matcher locks up while rebuilding indexes. This is the feed that causes those 15-minute scan freezes.
Webhook Configuration and Debugging	Webhooks fail silently with malformed JSON. The example payloads in this doc are the only ones that work reliably. Copy them exactly.

Clair Production Monitoring: AI-Optimized Knowledge

Critical Failure Scenarios & Production Breaking Points

PostgreSQL Database Failures

Memory Consumption Spikes

Webhook Delivery Silent Failures

Vulnerability Database Update Locks

Configuration That Actually Works in Production

PostgreSQL Settings

Memory Allocation Strategy

Required Indexes for Scale

Monitoring Thresholds

Operational Intelligence & Troubleshooting

Performance Baselines

Common Failure Patterns

Migration Risks (v4.8.0 OVAL to VEX)

Resource Requirements & Scaling

Unpredictable Factors

Infrastructure Dependencies

Critical Warnings & Known Issues

Silent Failure Modes

Vendor-Specific Issues

Air-Gapped Deployment Complexity

Decision Support Matrix

When to Use Clair

Resource Investment Requirements

Breaking Change Impact

Monitoring Implementation Guide

Essential Metrics (Not CPU/Memory)

Log Analysis Patterns

Functional Health Checks

Useful Links for Further Investigation

When Clair Breaks in Production (And How to Fix It)

Related Tools & Recommendations

Snyk + Trivy + Prisma Cloud: Stop Your Security Tools From Fighting Each Other

GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus

Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break

Trivy Scanning Failures - Common Problems and Solutions

Container Security Tools: Which Ones Don't Suck?

Docker Scout - Find Vulnerabilities Before They Kill Your Production

Anchore Engine Migration Guide - Moving to Syft & Grype

Container Security Pricing Reality Check 2025: What You'll Actually Pay

Snyk Container - Because Finding CVEs After Deployment Sucks

NVIDIA Earnings Become Crucial Test for AI Market Amid Tech Sector Decline - August 23, 2025

Longhorn - Distributed Storage for Kubernetes That Doesn't Suck

How to Set Up SSH Keys for GitHub Without Losing Your Mind

GitHub Actions Marketplace - Where CI/CD Actually Gets Easier

GitHub Actions Alternatives That Don't Suck

GitHub Actions + Docker + ECS: Stop SSH-ing Into Servers Like It's 2015

RAG on Kubernetes: Why You Probably Don't Need It (But If You Do, Here's How)

Braintree - PayPal's Payment Processing That Doesn't Suck

Jenkins + Docker + Kubernetes: How to Deploy Without Breaking Production (Usually)

Jenkins Production Deployment - From Dev to Bulletproof

Jenkins - The CI/CD Server That Won't Die