Why does Tabby keep running out of memory and crashing?

Memory leaks from indexing jobs that don't clean up properly. After a week of normal operation, containers balloon from 8GB to 28GB+ usage. Set up rolling restarts every 24 hours during off-hours - ugly solution but it works. Also increase your resource limits to 24GB, not the 4GB default.

The VS Code extension shows "Not connected" randomly. What's wrong?

Tabby server restarts (maintenance, crashes, whatever) but the extension doesn't retry connections properly. Enable auto-reconnect in VS Code extension settings. If it still happens frequently, your server is probably crashing due to memory issues or CUDA errors.

How much hardware do I actually need for 50 developers?

For 7B models: RTX 4070 Ti minimum (16GB VRAM), 32GB system RAM, 8 CPU cores. The docs say 8GB VRAM works - that's a lie. For 1B models: RTX 3060 (12GB VRAM), 16GB RAM works fine but suggestions are mediocre. Budget for multiple GPU nodes if you have heavy usage.

Can I run this completely offline/air-gapped?

Yes, but it's painful. Pre-download models on an internet-connected machine, transfer them manually. You'll miss model updates and need to handle dependency management yourself. Use `tabby download --model ModelName` to grab models, then transfer the entire cache directory to your offline environment.

Why does model loading take forever or get stuck?

Corporate proxies mess with HuggingFace downloads, or your internet is shit. Models are 4-14GB downloads that can take hours on slow connections. Pre-download models during off-hours, or cache them on a local registry. If it's really stuck, check proxy settings and restart the container.

How do I handle LDAP authentication when it just says "authentication failed"?

Enable debug logging (`RUST_LOG=debug`) to see actual LDAP errors. Common issues: wrong bind DN format, certificate validation failures with internal CAs, nested group membership not working. Mount your CA bundle to `/etc/ssl/certs/` if using internal certificates.

The container starts but completions don't work. What's broken?

Probably CUDA version mismatches between your container and host GPU drivers. Check `nvidia-smi` output and ensure your Kubernetes nodes have consistent NVIDIA driver versions. Pin deployments to specific node pools with compatible drivers using NodeSelectors.

Git repository indexing is stuck on our huge monorepo. How do I fix it?

Increase indexing timeout to 4+ hours for repos with 500k+ lines. Split massive repos into multiple indexes if possible. Monitor `tabby_git_sync_last_success_timestamp` to catch stuck indexing jobs. Initial indexing takes forever; incremental updates are much faster.

Why are my developers complaining about slow or bad completions?

You're probably running 1B models (fast but dumb) or your GPU is underpowered. 7B models give much better suggestions but need serious hardware. Also check if indexing is complete - without code context, all suggestions are generic garbage.

How do I monitor this thing properly in production?

Built-in `/health` endpoint is useless - returns 200 OK even when everything is broken. Monitor memory usage over time (catch leaks), model loading duration (catch hangs), and git sync timestamps (catch indexing failures). Set up Prometheus metrics for these.

What breaks during upgrades?

Config format changes between versions without migration scripts. Backup your config directory before upgrading. Container crashes during model loading if you have insufficient resources. Rolling updates work fine if you have multiple replicas, but single-node deployments will have downtime.

Can I run multiple models or switch between them?

Tabby v0.20+ supports switching chat models, but completion models are still loaded at startup. Each model eats 4-14GB VRAM, so you need multiple GPUs or accept longer startup times when switching. Most enterprise deployments stick with one model to avoid complexity.

How do I handle compliance and audit requirements?

Tabby logs all requests but doesn't track which specific developer made which query by default. Enable detailed logging if you need audit trails. All code stays local (that's the point), but your security team will still ask for SOC2 reports that don't exist for open source projects.

What's the actual uptime like in production?

95-98% if you have proper resource limits, monitoring, and rolling restarts configured. Main failure modes: memory leaks causing crashes, model loading timeouts, and git indexing getting stuck. With proper monitoring and automation, you can catch these before users notice.

Should I run this on-premises or in the cloud?

Cloud gives you better GPU options (A100s, H100s) and managed Kubernetes, but defeats the privacy purpose. On-premises means dealing with your own hardware procurement and maintenance. Most enterprises I've worked with go cloud but use private subnets and VPCs to control data flow.

Currently viewing the AI version

Switch to human version

Tabby Enterprise Deployment: Production Technical Reference

Hardware Requirements - Production Reality

7B Models (CodeLlama-7B, DeepSeek-Coder-7B)

Minimum Production Configuration:

GPU: RTX 4070 Ti or better (16GB VRAM minimum)
RAM: 32GB system RAM
CPU: 8 cores minimum
Storage: NVMe SSD with 500GB+ free space

Critical Warning: Official documentation claims 8GB VRAM works - this causes CUDA out of memory crashes every 2 hours in production.

1B Models (StarCoder-1B)

Functional Configuration:

GPU: RTX 3060 with 12GB VRAM
RAM: 16GB system RAM sufficient
Performance trade-off: Fast but mediocre suggestion quality

Failure Scenarios

RTX 3070 (8GB VRAM) with 7B models = guaranteed crashes every few hours
Default 4GB RAM allocation = OOMKilled during peak hours
Model weights + OS overhead consume significantly more memory than documented

Kubernetes Production Configuration

Resource Requirements That Work

resources:
  requests:
    memory: "16Gi"  # Not 4Gi default
    cpu: "8"        # Not 2 default
    nvidia.com/gpu: 1
  limits:
    memory: "24Gi"  # Leave headroom for memory leaks
    cpu: "12" 
    nvidia.com/gpu: 1

Storage Requirements

Default 20GB persistent volume = insufficient
Production requirement: 100GB minimum
Breakdown: 14GB compressed model + multiple models + git caches + indexed code + logs

Critical Deployment Issues

Pod Anti-Affinity Required:

podAntiAffinity:
  requiredDuringSchedulingIgnoredDuringExecution:
  - labelSelector:
      matchLabels:
        app: tabby
    topologyKey: kubernetes.io/hostname

Ingress Configuration:

Standard nginx configurations fail with Istio/Traefik
WebSocket support required for chat interface
Extended timeouts needed for model loading

Memory Leak Management

Root Cause

Indexing jobs don't clean up properly after completion
Memory usage increases from 8GB baseline to 28GB+ over one week
Container stays "healthy" while consuming excessive memory

Production Solution

Automated Rolling Restarts:

# Cron job for proactive memory management
kubectl get pods -n tabby --no-headers | while read pod _; do
  memory=$(kubectl top pod $pod -n tabby --no-headers | awk '{print $3}' | sed 's/Mi//')
  if [ "$memory" -gt 20000 ]; then  # 20GB threshold
    kubectl delete pod $pod -n tabby
  fi
done

Schedule: Every 24 hours during off-hours
Impact: 30-second interruption prevents production crashes

Authentication Configuration

LDAP Integration Issues

Common Failures:

Cryptic "authentication failed" errors without root cause indication
Nested group membership not detected (most enterprises use nested groups)
Certificate validation failures with internal CAs

Required Configuration:

# Enable debug logging for troubleshooting
RUST_LOG=debug

# Certificate handling for internal CAs
# Mount CA bundle to /etc/ssl/certs/ in container

Group Search Requirements:

Manual search filter configuration required for Active Directory vs OpenLDAP
Nested groups require custom filter configuration

Network Configuration

Air-Gapped Deployment

Model Pre-loading Process:

# Internet-connected machine
docker run -v $(pwd)/models:/data tabbyml/tabby download --model CodeLlama-7B-Instruct

# Transfer models directory to production
# Run with --model /data/CodeLlama-7B-Instruct

Proxy Configuration

Required Environment Variables:

HTTP_PROXY=http://proxy.company.com:8080
HTTPS_PROXY=http://proxy.company.com:8080
NO_PROXY=localhost,127.0.0.1,.company.com

Known Issue: HuggingFace model downloads hang despite correct proxy settings
Detection: Custom monitoring required - downloads taking >10 minutes indicate failure

Production Failure Modes

IDE Extension Connectivity

Failure Pattern: "Not connected" status in VS Code extension
Root Cause: Extension doesn't retry connections after server restarts
Solution: Enable auto-reconnect in extension settings (not foolproof)

Repository Indexing Failures

Failure Threshold: 500k+ lines of code
Symptoms: Initial indexing times out, completions become generic
Solutions:

Split large repos into multiple smaller indexes
Increase indexing timeout to 4+ hours for initial run
Incremental updates much faster after initial completion

CUDA Version Mismatches

Failure Pattern: Container starts, loads model, crashes on first completion request
Root Cause: Different NVIDIA driver versions across Kubernetes nodes
Solution:

nodeSelector:
  nvidia.com/gpu.driver-version: "535.86.10"

Monitoring Configuration

Critical Metrics

Essential Prometheus Metrics:

tabby_memory_usage_bytes - Track memory leak progression
tabby_model_load_duration_seconds - Detect stuck downloads
tabby_completion_requests_total - Usage patterns
tabby_git_sync_last_success_timestamp - Indexing health

Health Check Requirements

Built-in /health endpoint limitations: Returns 200 OK during failures
Custom health checks must verify:

Model loaded and responding to test prompts
Git repositories indexed within 4 hours
Memory usage under operational thresholds
GPU temperature within safe ranges

Scaling Patterns

Multi-Node Architecture

Pattern: Active-Passive with GPU Affinity

Two instances on separate GPU nodes
One active, one warm standby
Failover time: <30 seconds vs 5-minute cold start

Load Balancing Configuration

For 20+ developers:

nginx.ingress.kubernetes.io/affinity: "cookie"
nginx.ingress.kubernetes.io/session-cookie-name: "tabby-server"
nginx.ingress.kubernetes.io/session-cookie-expires: "7200"

Repository Sharding Strategy

Large monorepo handling:

Split by service boundaries rather than single massive repo
3-5 related repositories per Tabby instance
Prevents indexing timeouts and improves suggestion relevance

Model Selection Matrix

Model	VRAM Required	RAM Required	Use Case	Quality
CodeLlama-7B-Instruct	16GB	32GB	General enterprise	Best balance
DeepSeek-Coder-7B	16GB	32GB	Math/ML heavy	Good for algorithms
StarCoder-1B	12GB	16GB	CI/CD, staging	Fast but mediocre

Enterprise Standardization: Choose one model per organization to avoid operational complexity

Cost Optimization

Instance Selection Strategy

A100: Expensive but handles larger models and more concurrent users
RTX 4090: Cheaper per VRAM GB but limited cloud availability
H100: Overkill unless running 13B+ models

Development Environment Cost Reduction

Spot instances: 60-70% cost savings for non-production
Time-based scaling: Scale down overnight, up before work hours
Model caching: Shared persistent volumes prevent repeated downloads

Security Implementation

Network Segmentation

- to:
  - namespaceSelector:
      matchLabels:
        name: monitoring
  - namespaceSelector:
      matchLabels:
        name: developer-tools

Audit Requirements

Log user IDs, request types, response times, error rates
Do NOT log code snippets (defeats privacy purpose)
Store audit logs separate from application logs
Enable detailed request logging for compliance

Disaster Recovery

Recovery Testing Requirements

Quarterly testing protocol:

Intentionally kill entire Tabby deployment during low usage
Verify failover works within SLA
Confirm IDE auto-reconnection
Validate model loading completes within timeframes
Test persistent volume and config restoration

Recovery Time Objectives

Failover: <30 seconds with warm standby
Cold start: 5-10 minutes depending on model size
Full disaster recovery: <1 hour with proper automation

Critical Warning Indicators

Immediate Action Required

Memory usage >20GB per container
Model loading duration >10 minutes
Git sync timestamp >4 hours old
GPU temperature >80°C
CUDA out of memory errors in logs

Upgrade Risk Factors

Config format changes between versions without migration scripts
Container crashes during model loading with insufficient resources
Single-node deployments have guaranteed downtime during updates

Alternative Comparison Matrix

Solution	Setup Time	Maintenance	Data Privacy	Cost (50 devs/month)
Tabby Self-Hosted	1-2 weeks	High	Complete	$5-15K
GitHub Copilot Enterprise	30 minutes	Zero	Microsoft servers	$1,950
Sourcegraph Cody	3-5 days	Medium	Sourcegraph servers	$10-20K
Amazon CodeWhisperer	1-2 hours	Low	AWS processing	$950

Production Readiness Checklist

Pre-Deployment Requirements

GPU nodes with consistent driver versions
100GB+ persistent storage per instance
Network policies configured
LDAP/SSO integration tested
Custom health checks implemented
Monitoring and alerting configured

Post-Deployment Validation

Memory leak monitoring active
Rolling restart automation configured
IDE extension connectivity verified across team
Repository indexing completed successfully
Failover procedures tested
Audit logging compliance verified

Ongoing Maintenance Tasks

Weekly memory usage review
Monthly model update evaluation
Quarterly disaster recovery testing
Semi-annual security review
Annual hardware capacity planning

Useful Links for Further Investigation

Essential Resources for Enterprise Tabby Deployment

Link	Description
Tabby Installation Guide	Start here for basic deployment options. The Docker section is solid, but the Kubernetes examples need significant modifications for production use.
Models Registry	Complete list of supported models with actual hardware requirements. Ignore the "minimum" specs - use the "recommended" ones for production.
Configuration Reference	YAML config documentation. Essential for LDAP, authentication, and custom model paths.
Air-Gapped Deployment with Docker	Step-by-step guide for completely offline deployments. Actually works, unlike most vendor tutorials.
SkyPilot Deployment Guide	Cloud deployment automation using SkyPilot. Good for multi-cloud setups and spot instance management.
Kubernetes Manifests	Official K8s YAML files. Use these as a starting point but expect to modify resource limits, storage, and networking configs.
NVIDIA Container Toolkit Setup	Required for GPU access in containers. The setup is finicky - follow the exact steps or waste hours debugging CUDA errors.
Prometheus GPU Metrics	Essential for monitoring GPU memory usage and temperature in production deployments.
Kubernetes GPU Operator	Automates GPU driver management in K8s clusters. Worth the complexity if you're running multiple GPU workloads.
Tabby GitHub Issues	Search here first when things break. Sort by "most commented" to find common production issues.
CUDA Troubleshooting Guide	When GPU errors happen (they will), this is your debugging bible.
Docker GPU Access Problems	Common Docker + NVIDIA runtime issues and solutions. Essential for containerized deployments.
LDAP Configuration Examples	Real-world LDAP integration examples for Active Directory and OpenLDAP.
Reverse Proxy Setup	nginx, Traefik, and Istio configuration for enterprise networking environments.
SSL Certificate Management	Setting up HTTPS with internal CAs and certificate rotation.
HuggingFace Model Hub	Browse available models compatible with Tabby. Filter by text-generation and downloads for proven options.
Model Performance Benchmarks	Community performance comparisons across different hardware configurations.
Custom Model Integration	How to use your own fine-tuned models with Tabby. Useful for domain-specific codebases.
GitHub Copilot Enterprise	Official enterprise alternative. Much easier deployment, higher ongoing costs.
Continue.dev	Open-source alternative with better model flexibility but requires more configuration.
Sourcegraph Cody Enterprise	Best codebase context understanding but expensive and complex to deploy.
Tabby Community Discussions	Active community discussing deployment patterns, performance optimization, and troubleshooting.
Tabby Discord Server	Real-time chat with other Tabby users and maintainers. Good for urgent production issues.
Self-Hosted Community	Community-maintained list of self-hosted alternatives to cloud services, including AI coding assistants and deployment guides.
GPU Cloud Providers Comparison	Cost and performance comparison of GPU cloud instances for AI workloads.
NVIDIA RTX Server Hardware	Enterprise GPU server options if you're building on-premises infrastructure.
Kubernetes GPU Scheduling	Official K8s documentation for GPU resource management and node affinity.
Container Security Best Practices	K8s security hardening for enterprise deployments.
NVIDIA Security Advisory	GPU driver security updates and vulnerability management.
Open Source License Compliance	Apache 2.0 license terms and enterprise legal considerations.

Tabby Enterprise Deployment: Production Technical Reference

Hardware Requirements - Production Reality

7B Models (CodeLlama-7B, DeepSeek-Coder-7B)

1B Models (StarCoder-1B)

Failure Scenarios

Kubernetes Production Configuration

Resource Requirements That Work

Storage Requirements

Critical Deployment Issues

Memory Leak Management

Root Cause

Production Solution

Authentication Configuration

LDAP Integration Issues

Network Configuration

Air-Gapped Deployment

Proxy Configuration

Production Failure Modes

IDE Extension Connectivity

Repository Indexing Failures

CUDA Version Mismatches

Monitoring Configuration

Critical Metrics

Health Check Requirements

Scaling Patterns

Multi-Node Architecture

Load Balancing Configuration

Repository Sharding Strategy

Model Selection Matrix

Cost Optimization

Instance Selection Strategy

Development Environment Cost Reduction

Security Implementation

Network Segmentation

Audit Requirements

Disaster Recovery

Recovery Testing Requirements

Recovery Time Objectives

Critical Warning Indicators

Immediate Action Required

Upgrade Risk Factors

Alternative Comparison Matrix

Production Readiness Checklist

Pre-Deployment Requirements

Post-Deployment Validation

Ongoing Maintenance Tasks

Useful Links for Further Investigation

Essential Resources for Enterprise Tabby Deployment

Related Tools & Recommendations

AI Coding Assistants 2025 Pricing Breakdown - What You'll Actually Pay

I've Been Juggling Copilot, Cursor, and Windsurf for 8 Months

Copilot's JetBrains Plugin Is Garbage - Here's What Actually Works

GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus

Cursor vs Copilot vs Codeium vs Windsurf vs Amazon Q vs Claude Code: Enterprise Reality Check

I Tested 4 AI Coding Tools So You Don't Have To

Stop Burning Money on AI Coding Tools That Don't Work

VS Code 1.103 Finally Fixes the MCP Server Restart Hell

GitHub Copilot + VS Code Integration - What Actually Works

Cursor AI Review: Your First AI Coding Tool? Start Here

JetBrains AI Credits: From Unlimited to Pay-Per-Thought Bullshit

JetBrains AI Assistant Alternatives That Won't Bankrupt You

JetBrains AI Assistant - The Only AI That Gets My Weird Codebase

Docker Alternatives That Won't Break Your Budget

I Tested 5 Container Security Scanners in CI/CD - Here's What Actually Works

Continue - The AI Coding Tool That Actually Lets You Choose Your Model

I Used Tabnine for 6 Months - Here's What Nobody Tells You

Tabnine Enterprise Review: After GitHub Copilot Leaked Our Code

GitLab CI/CD - The Platform That Does Everything (Usually)

GitLab Container Registry