Why do my pods get stuck in Pending state and how do I fix it?

This is probably the first thing that'll go wrong during your deployment. Your pods just sit there mocking you with their `Pending` status for 5+ minutes. **What's actually happening:** Your cluster is fucked - either no resources or storage config is broken. Here's how to unfuck it: ```bash kubectl describe pod kubectl get nodes -o wide kubectl get storageclass ``` Usually happens because your cluster nodes are smaller than what Weaviate actually needs, or AWS/GCP decided your storage class doesn't exist today. You'll see helpful errors like "0/3 nodes are available: 3 Insufficient memory" that tell you exactly nothing useful. Check that your nodes aren't t2.micro instances trying to run enterprise software, and verify your storage classes exist before Kubernetes starts lying to you. Study [minimum resource requirements](https://docs.weaviate.io/weaviate/concepts/resources) and pray your cloud provider's dynamic provisioning works.

Why are my queries taking forever in production when they were fast in dev?

Ah yes, the classic "works on my machine" but production is slow as molasses. If your queries are consistently taking over 100ms, here's what's probably wrong: **Solutions:** - **Memory pressure:** Increase memory allocation if working set doesn't fit in RAM - **CPU throttling:** Check if CPU limits are being hit during peak load - **Network latency:** Verify ingress configuration and load balancer settings - **Index optimization:** Ensure proper HNSW parameters for your data distribution Monitor with: `kubectl top pods` and check Prometheus metrics for `weaviate_query_duration_seconds`.

How do I safely update Weaviate version in production?

**Process for zero-downtime upgrades:** 1. Enable replication with factor ≥ 2 2. Update one replica at a time using rolling updates 3. Validate each node before proceeding to the next ```bash helm upgrade weaviate-prod weaviate/weaviate \ --set image.tag="1.30.1" \ --set updateStrategy.type=RollingUpdate ``` [Zero-downtime upgrade guide](https://weaviate.io/blog/zero-downtime-upgrades) provides detailed procedures for production environments.

What causes "connection refused" errors from applications?

**Common causes:** - Service discovery issues in Kubernetes - Network policies blocking traffic - Authentication configuration problems - Load balancer health check failures **Debugging steps:** ```bash kubectl get svc kubectl get endpoints weaviate-prod kubectl logs -l app=weaviate --tail=50 ``` Verify service exposure and DNS resolution from client pods.

How do I handle storage expansion when running out of disk space?

**When you're about to run out of disk space (because you always underestimate storage growth):** 1. Check current usage: `kubectl exec -- df -h` 2. Expand PVC if storage class supports it: ```bash kubectl patch pvc weaviate-data-weaviate-prod-0 -p '{\"spec\":{\"resources\":{\"requests\":{\"storage\":\"1Ti\"}}}}' ``` 3. Monitor expansion progress: `kubectl get pvc -w` **Prevention:** Set up alerts for 80% disk utilization and implement automated cleanup of old backup files.

Why do I get different search results when I run the same query twice?

Welcome to the joys of eventual consistency. This one will drive you absolutely crazy because the same query returns different results seemingly at random. **Solutions:** - Enable [strong consistency](https://docs.weaviate.io/weaviate/concepts/replication-architecture/consistency#tunable-write-consistency) for critical operations - Adjust consistency level: `ONE`, `QUORUM`, or `ALL` - Monitor replication lag through Prometheus metrics

How do I configure monitoring for production alerting?

**Essential metrics to monitor:** - `weaviate_query_duration_seconds` - Query latency - `weaviate_objects_total` - Object count growth - `weaviate_vector_index_operations_total` - Index operation rate - `weaviate_lsm_bloom_filters_duration_seconds` - Storage performance **Grafana dashboard setup:** ```yaml # Add to your monitoring stack apiVersion: v1 kind: ConfigMap metadata: name: weaviate-dashboard data: dashboard.json: | { \"dashboard\": { \"title\": \"Weaviate Production Metrics\" } } ``` Reference the [monitoring guide](https://weaviate.io/blog/monitoring-weaviate-in-production) for complete setup instructions.

What backup strategy should I implement for production data?

**Recommended approach:** 1. Daily full backups to object storage (S3/GCS/Azure Blob) 2. Point-in-time recovery capability 3. Cross-region backup replication for disaster recovery ```bash # Configure automated backups kubectl apply -f - <<EOF apiVersion: batch/v1 kind: CronJob metadata: name: weaviate-backup spec: schedule: \"0 2 * * *\" jobTemplate: spec: template: spec: containers: - name: backup image: weaviate/backup-tool:latest command: [\"backup\", \"--full\"] EOF ``` Test restore procedures monthly to ensure backup integrity and recovery time objectives.

How do I scale my cluster when approaching capacity limits?

**Horizontal scaling process:** 1. Add new Kubernetes nodes to the cluster 2. Update Helm configuration to increase replica count 3. Configure new collections with appropriate shard count 4. [Migrate existing data](https://docs.weaviate.io/weaviate/more-resources/migration) if needed **Vertical scaling considerations:** - Memory can be increased by updating resource limits - CPU scaling requires careful testing under production load - Storage expansion depends on your storage class capabilities Monitor capacity with: `kubectl top nodes` and set alerts at 70% resource utilization for proactive scaling.

Currently viewing the AI version

Switch to human version

Weaviate Production Deployment: AI-Optimized Technical Reference

Critical Failure Scenarios & Consequences

Memory Planning Failures

Official Formula Limitation: (objects × dimensions × 4 bytes) + overhead assumes single-tenant, write-once workloads
Real-World Multipliers:
- Multi-tenancy: 2x memory requirement
- Frequent updates: 3x memory requirement
- Production traffic: 6GB+ RAM for theoretical 3GB workload
Failure Impact: Complete cluster failure when staging cluster (1M vectors, 3GB) cannot handle production traffic
Cost Impact: Teams blow entire AWS budget due to inaccurate memory planning

HNSW Index Memory Consumption

Index Rebuilding: Temporarily doubles memory usage during operations
Garbage Collection: Causes query timeouts in production
Memory Fragmentation: Prevents utilization of all allocated RAM
Multi-tenancy Overhead: Adds 50-100% memory usage per tenant

Configuration Requirements

Production-Ready Resource Allocation

replicas: 5  # Minimum for true high availability
resources:
  requests:
    cpu: "2000m"     # Prevents throttling hell
    memory: "8Gi"    # Doubled from theoretical calculation
  limits:
    cpu: "4000m"     # Headroom for index rebuilds
    memory: "16Gi"   # Prevents OOMKilled errors

Storage Configuration That Prevents Bankruptcy

persistence:
  storageClass: "gp3"  # Cost-effective, avoid io2 unless bottlenecked
  size: "1000Gi"       # Plan for growth, resizing is operationally painful

Storage Cost Reality:

Provisioned IOPS charges can reach $4,800/month with poor write patterns
EBS gp3 sufficient until actual bottleneck identification
Burst credits exhaust faster than deployment patience

Kubernetes High Availability Reality

3-node clusters: Single point of failure when one node dies during memory spike
Minimum requirement: 5+ nodes with proper pod anti-affinity
Failure mode: "Highly available" cluster becomes single overloaded node

Security Implementation Challenges

Authentication Operational Issues

API Key Problems:
- Security teams discover hardcoded keys in Git history
- Triggers "urgent security reviews"
OIDC Integration:
- Adds 500ms latency to every request
- Fails during Azure AD outages (especially during product demos)
- Breaks mysteriously when identity provider has issues

Network Policy Disasters

Implementation Reality: Block legitimate traffic in ways requiring hours to debug
Recommended Approach: Start without policies, add incrementally after basic functionality works
Debugging Time: First week spent troubleshooting "connection refused" errors

TLS Certificate Nightmares

cert-manager Reliability: Works perfectly in staging, stops renewing in production
Rate Limit Failures: Let's Encrypt limits cause cert-manager to abandon renewal attempts
Failure Timing: Certificates expire during holidays/weekends (Christmas Eve documented case)
Prevention: Manual cert rotation scripts tested monthly

Deployment Process Reality

Helm Deployment Expectations vs Reality

# Deployment time expectations:
# Documentation: 5-10 minutes
# Reality: 30 minutes (lucky), 2 hours (networking issues), full day (EKS bugs)

Common Deployment Failures

Pending Pods: Insufficient cluster resources or broken storage class configuration
Storage Issues: AWS/GCP storage classes don't exist as expected
Memory Constraints: Nodes smaller than Weaviate requirements (t2.micro attempting enterprise software)
EKS 1.28.2 Bug: Ingress controller causes pods to disappear completely

Scaling Operational Intelligence

Sharding Configuration Reality

# Recommended for production growth:
Configure.sharding(
    virtual_per_physical=512,  # Over-provision from day 1
    desired_count=10,          # Plan for growth, not current size
)

# Avoid this pattern:
Configure.sharding(
    virtual_per_physical=64,   # Creates resharding nightmare later
    desired_count=3,           # Single point of failure
)

Resharding Consequences:

Requires complete downtime (6 hours documented case)
Process fails at high completion percentages (87% failure documented)
Memory exhaustion during resharding process

Async Replication Trade-offs

Performance Gain: 300-500% write performance improvement
Consistency Cost: Eventual consistency introduces stale read bugs
Application Impact: Must handle seconds/minutes of stale data
Monitoring Requirement: Replication lag monitoring essential

Performance Expectations vs Reality

Query Latency Reality Check

Marketing Claims: Sub-millisecond latency
Production Reality: 10-50ms with network overhead, authentication, real query patterns
Benchmark Limitations: Perfect conditions don't exist in production
Planning Target: 50-200ms latency for real-world scenarios

Load Testing That Breaks Systems

# Realistic load test parameters:
concurrent_workers = 50      # Real production load
query_count = 200           # Sufficient to expose bottlenecks
result_limit = 1000         # Realistic result set size
timeout = 30                # Realistic timeout expectations

Failure Indicators:

"connection reset by peer" when cluster can't handle load
All queries failing indicates cluster failure
P95 latency > 100ms indicates capacity issues

Monitoring Critical Metrics

Essential Production Metrics

Query Latency: P50, P95, P99 percentiles (averages lie)
Memory Utilization: Trend monitoring for capacity planning
Index Operation Rates: Background maintenance impact
Replication Lag: Consistency impact measurement

Alerting Thresholds

# Proven alerting rules:
- alert: WeaviateHighQueryLatency
  expr: weaviate_query_duration_seconds{quantile="0.95"} > 0.1
  for: 5m

Backup and Disaster Recovery

Backup Reality

Testing Frequency: Monthly restore testing required
Cross-region Complexity: Split-brain scenarios and data lag issues
Restore Time: Test actual recovery time, not just backup creation
Failure Discovery: Usually discovered when restoration is actually needed

Cost Management Intelligence

AWS Cost Optimization

IOPS Optimization: Check write patterns before upgrading to io2
Storage Class Selection: Start with gp3, upgrade only when bottlenecked
Resource Right-sizing: Monitor actual usage vs allocated resources

Memory Cost Management

Over-allocation Risk: Wasting money on unused RAM
Under-allocation Risk: OOMKilled errors in production
Monitoring Approach: Use kubectl top pods for actual usage tracking

Migration and Upgrade Risks

Version Upgrade Process

Zero-downtime Requirements: Replication factor ≥ 2 mandatory
Rolling Update Strategy: Update one replica at a time
Validation Steps: Verify each node before proceeding
Rollback Planning: Prepare rollback procedures before upgrade

Data Migration Challenges

Downtime Requirements: Plan for extended maintenance windows
Data Integrity: Verify migration completeness before cutover
Performance Impact: Expect degraded performance during migration

Troubleshooting Decision Tree

Pod Pending Issues

Check node resources: kubectl get nodes -o wide
Verify storage class: kubectl get storageclass
Review events: kubectl get events --sort-by='.lastTimestamp'

Query Performance Issues

Memory pressure: Check if working set fits in RAM
CPU throttling: Monitor CPU limit hits during peak load
Network latency: Verify ingress and load balancer configuration
Index optimization: Validate HNSW parameters for data distribution

Connection Refused Errors

Service discovery: kubectl get svc and kubectl get endpoints
Network policies: Check for traffic blocking rules
Authentication: Verify API key or OIDC configuration
Load balancer: Health check failure investigation

Resource Requirements by Scale

Vector Count	Memory Requirement	CPU Requirement	Storage Requirement
1M vectors	6GB+ RAM	2+ cores	100GB+ SSD
10M vectors	60GB+ RAM	4+ cores	1TB+ SSD
100M vectors	600GB+ RAM	8+ cores	10TB+ SSD

Scaling Multipliers:

Multi-tenancy: 2x memory
Frequent updates: 3x memory
Index rebuilds: Temporary 2x memory spike

Production Success Metrics

Realistic Success Indicators

Users stop complaining in Slack
Queries don't timeout during CEO demos
No 3am PagerDuty alerts
Sub-200ms query latency consistently

Unrealistic Expectations

Sub-100ms latency (marketing bullshit)
Perfect uptime without dedicated SREs
Zero operational overhead
Unlimited AWS credits like case study examples

Implementation Priority Order

Phase 1: Basic cluster deployment with proper resource allocation
Phase 2: Monitoring and alerting setup before production traffic
Phase 3: Security hardening (authentication, TLS, network policies)
Phase 4: Scaling configuration (sharding, replication)
Phase 5: Backup and disaster recovery procedures
Phase 6: Performance optimization and advanced scaling

Key Documentation References

Useful Links for Further Investigation

Essential Resources for Production Weaviate Deployment

Link	Description
Weaviate Production Environment Guide	This guide provides comprehensive requirements and best practices for deploying Weaviate in a production environment, ensuring stability and performance.
Kubernetes Deployment Documentation	Access the official documentation for deploying Weaviate on Kubernetes, including detailed guides and step-by-step tutorials for various setups.
Horizontal Scaling Configuration	Explore detailed sharding and replication strategies to configure Weaviate for horizontal scaling, optimizing performance and data distribution across your cluster.
Production Readiness Assessment	Utilize this self-assessment checklist to evaluate your Weaviate deployment's readiness for production, covering critical aspects of stability and reliability.
Deploy Weaviate on Google GKE	Follow this step-by-step tutorial provided by Google Cloud to successfully deploy your Weaviate instance on Google Kubernetes Engine (GKE).
AWS EKS with Weaviate	Learn how to deploy Weaviate on Amazon Elastic Kubernetes Service (EKS) using Kubernetes, ensuring a robust and scalable cloud infrastructure.
Multi-cloud Vector Database Deployments	Discover enterprise security patterns and best practices for multi-account deployment of open-source vector databases like Weaviate on AWS.
Monitoring Weaviate in Production	Set up a complete monitoring solution for Weaviate in production environments using popular tools like Prometheus and Grafana for observability.
Weaviate Resource Requirements	Understand the memory, CPU, and storage planning guidelines essential for effectively sizing and provisioning your Weaviate cluster resources.
Cluster Architecture Overview	Take a deep dive into Weaviate's distributed architecture, understanding how replication and sharding contribute to its scalability and resilience.
The Art of Scaling a Vector Database	Learn advanced scaling techniques and performance optimization strategies specifically tailored for vector databases like Weaviate to handle high loads.
Zero-Downtime Upgrades Guide	Implement production upgrade strategies for Weaviate that ensure zero service interruption, maintaining continuous availability during critical updates.
Async Replication Configuration	Configure high-throughput asynchronous replication settings, a feature introduced in Weaviate v1.29, to enhance data consistency and performance.
Weaviate Community Forum	Engage with the active Weaviate community forum to find support, participate in discussions, and share knowledge with other users and developers.
Production Environment Support Category	Find specific help and solutions for challenges related to production Weaviate deployments within this dedicated support category on the community forum.
Kubernetes Deployment Discussions	Join community discussions focused on multi-node Weaviate setups and Kubernetes deployments, sharing insights and troubleshooting tips with peers.
Loti AI Production Case Study	Read this real-world case study of Loti AI's production deployment, successfully handling an impressive 9 billion vectors with Weaviate.
Enterprise AI at Scale Podcast	Gain valuable insights from Box's large-scale Weaviate deployment in this podcast, discussing enterprise AI at scale with industry experts.
Official Weaviate Helm Chart	Access the official Weaviate Helm chart repository, providing a production-ready solution for deploying and managing Weaviate on Kubernetes.
Weaviate Docker Images	Find the official Weaviate container images on Docker Hub, optimized and ready for deployment in production environments.
Configuration Examples	Review sample configurations for various Weaviate deployment scenarios, offering practical examples to guide your setup and customization.
Python Client v4 Documentation	Explore the documentation for the production-ready Weaviate Python client v4, featuring efficient connection pooling for robust applications.
JavaScript/TypeScript Client	Integrate Weaviate into your Node.js applications using the JavaScript/TypeScript client, designed for production-grade performance and reliability.
GraphQL and REST API Reference	Access the complete API documentation for Weaviate, covering both GraphQL and REST interfaces, essential for custom integrations and development.
Weaviate 1.30 Migration Guide	Follow the migration procedures for the BlockMax WAND algorithm, crucial for upgrading your Weaviate instance to version 1.30.
Database Migration Between Clusters	Find community guidance and best practices for migrating your Weaviate database from one cluster to another, ensuring data integrity.
Automated Backup Solutions	Implement automated backup solutions for Weaviate to ensure robust data protection and comprehensive disaster recovery planning for your deployments.
Weaviate Release History	Review the complete changelog and detailed upgrade notes for all Weaviate releases, available directly on the official GitHub repository.
Weaviate Development Blog	Stay informed with the latest updates, feature announcements, and technical insights from the official Weaviate development blog.
Running Vector DBs on Kubernetes - Production Tips	Read this independent guide offering production tips for running vector databases like Qdrant or Weaviate effectively on Kubernetes.
Installing Weaviate on Kubernetes: In-Depth Guide	Follow this comprehensive, in-depth installation walkthrough for deploying Weaviate on Kubernetes, covering all necessary steps and configurations.
Scalable Vector Search Architecture	Discover production architecture patterns and effective scaling strategies for building a highly scalable vector search system with Weaviate.
Vector Database Comparison 2025	Review a detailed analysis comparing Weaviate against its competitors like Pinecone, Qdrant, Milvus, and Chroma for RAG systems in 2025.
Production RAG Systems Guide	Learn best practices and discover the latest tools for building robust, production-ready RAG (Retrieval Augmented Generation) systems using Weaviate.

Related Tools & Recommendations

compare

Similar content

Milvus vs Weaviate vs Pinecone vs Qdrant vs Chroma: What Actually Works in Production

I've deployed all five. Here's what breaks at 2AM.

Milvus

/compare/milvus/weaviate/pinecone/qdrant/chroma/production-performance-reality

100%

pricing

Similar content

Why Vector DB Migrations Usually Fail and Cost a Fortune

Pinecone's $50/month minimum has everyone thinking they can migrate to Qdrant in a weekend. Spoiler: you can't.

Qdrant

/pricing/qdrant-weaviate-chroma-pinecone/migration-cost-analysis

82%

integration

Recommended

GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus

How to Wire Together the Modern DevOps Stack Without Losing Your Sanity

kubernetes

/integration/docker-kubernetes-argocd-prometheus/gitops-workflow-integration

57%

compare

Recommended

LangChain vs LlamaIndex vs Haystack vs AutoGen - Which One Won't Ruin Your Weekend

By someone who's actually debugged these frameworks at 3am

LangChain

/compare/langchain/llamaindex/haystack/autogen/ai-agent-framework-comparison

51%

integration

Similar content

Stop Fighting with Vector Databases - Here's How to Make Weaviate, LangChain, and Next.js Actually Work Together

Weaviate + LangChain + Next.js = Vector Search That Actually Works

Weaviate

/integration/weaviate-langchain-nextjs/complete-integration-guide

44%

tool

Similar content

Milvus - Vector Database That Actually Works

For when FAISS crashes and PostgreSQL pgvector isn't fast enough

Milvus

/tool/milvus/overview

40%

integration

Recommended

Multi-Framework AI Agent Integration - What Actually Works in Production

Getting LlamaIndex, LangChain, CrewAI, and AutoGen to play nice together (spoiler: it's fucking complicated)

LlamaIndex

/integration/llamaindex-langchain-crewai-autogen/multi-framework-orchestration

36%

tool

Similar content

Qdrant - Vector Database That Doesn't Suck

Explore Qdrant, the vector database that doesn't suck. Understand what Qdrant is, its core features, and practical use cases. Learn why it's a powerful choice f

Qdrant

/tool/qdrant/overview

32%

howto

Recommended

Stop Docker from Killing Your Containers at Random (Exit Code 137 Is Not Your Friend)

Three weeks into a project and Docker Desktop suddenly decides your container needs 16GB of RAM to run a basic Node.js app

Docker Desktop

/howto/setup-docker-development-environment/complete-development-setup

31%

troubleshoot

Recommended

CVE-2025-9074 Docker Desktop Emergency Patch - Critical Container Escape Fixed

Critical vulnerability allowing container breakouts patched in Docker Desktop 4.44.3

Docker Desktop

/troubleshoot/docker-cve-2025-9074/emergency-response-patching

31%

tool

Similar content

Weaviate - The Vector Database That Doesn't Suck

Explore Weaviate, the open-source vector database for embeddings. Learn about its features, deployment options, and how it differs from traditional databases. G

Weaviate

/tool/weaviate/overview

28%

alternatives

Recommended