Why does my Docker container keep restarting with exit code 137?

Because you OOM-killed it. The docs lie about memory requirements. That 7B model they claim needs "8GB" actually needs 16GB minimum because of PyTorch overhead and CUDA contexts.Fix: `docker run -m 20g` or in Kubernetes:```yamlresources: limits: memory: "20Gi" requests: memory: "16Gi"```We learned this after our production container restarted 47 times in one weekend. Spent 8 hours debugging before realizing it was just memory limits.

Model loading takes 8 minutes and users think the app is broken - help?

Yeah, downloading 8GB models over the internet is slow. Who would have thought?Solutions that actually work:- Pre-download during Docker build (increases image size to 12GB)- Use [model caching](https://huggingface.co/docs/huggingface_hub/guides/download#caching) with persistent volumes- Switch to smaller quantized models via [GPTQ](https://huggingface.co/docs/transformers/quantization)- Accept that cold starts suck and over-provision warm instances

Kubernetes GPU scheduling randomly stopped working - WTF?

The [NVIDIA device plugin](https://github.com/NVIDIA/k8s-device-plugin) probably crashed again. This happens every 2-3 weeks because of memory leaks.Quick fix:```bashkubectl delete pod -n kube-system -l name=nvidia-device-plugin-ds```Permanent fix: Restart the daemonset weekly via cronjob. Yes, it's hacky. No, there's no better solution.

Auto-scaling takes forever and requests queue up during traffic spikes

HPA takes 5-10 minutes to scale up because [Kubernetes metrics](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale-walkthrough/) collection is slow and conservative. During Black Friday traffic, this means your app dies.Real solutions:- Pre-scale before traffic events- Use [vertical pod autoscaling](https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler) for faster response- Over-provision 20-30% and accept the cost

My inference API randomly returns 500 errors with no useful logs

Welcome to the wonderful world of [TGI error handling](https://github.com/huggingface/text-generation-inference/issues). It fails silently on:- CUDA OOM (just `RuntimeError: CUDA out of memory`)- Invalid tokens (returns generic `HTTP 500 Internal Server Error`)- Model tensor corruption (process exits with code 139 - no logs)Enable debug logging:```bashRUST_LOG=debug ./text-generation-launcher --model-id microsoft/DialoGPT-medium```

LangSmith monitoring shows everything is green but users can't get responses

LangSmith measures requests started, not completed. It's useless for production monitoring.Use [Prometheus](https://prometheus.io/docs/introduction/overview/) with custom metrics:```pythonfrom prometheus_client import Counter, HistogramREQUEST_COUNT = Counter('inference_requests_total', 'Total inference requests')REQUEST_DURATION = Histogram('inference_duration_seconds', 'Request duration')```

Why is my AWS bill so expensive for one month of "testing"?

Because you probably left GPU instances running 24/7 without auto-scaling. `g5.2xlarge` instances cost around $2.50/hour. That's like $1,800/month per instance, which adds up fast.[AWS Cost Explorer](https://aws.amazon.com/aws-cost-management/aws-cost-explorer/) is your friend. Set up billing alerts or you'll learn the hard way like we did.

Memory usage keeps growing and containers eventually crash

Memory leaks in PyTorch are real, especially with large models and long-running processes. [HuggingFace Transformers](https://github.com/huggingface/transformers/issues/19686) has known issues with GPU memory cleanup.Our solution: Restart containers every 6 hours via cronjob. Not elegant, but it works:```yamlapiVersion: batch/v1kind: CronJob metadata: name: restart-inference-podsspec: schedule: "0 */6 * * *" jobTemplate: spec: template: spec: containers: - name: restart image: bitnami/kubectl command: ["/bin/sh", "-c", "kubectl rollout restart deployment/inference"]```

The model works locally but gives garbage outputs in production

Version mismatches between your local Python environment and production containers. [PyTorch 2.1.0 vs 2.0.0](https://pytorch.org/blog/pytorch-2.0-release/) can produce completely different outputs for the same model - learned this when our chatbot started giving nonsense responses after a Docker base image update.Pin everything:```dockerfileRUN pip install torch==2.0.0 transformers==4.21.0 langchain==0.1.17```

How do I debug why inference is slow without any error messages?

GPU utilization at 100% but low throughput usually means:- Memory bandwidth bottleneck (use [nvidia-smi](https://developer.nvidia.com/nvidia-system-management-interface))- Inefficient batching (check request patterns)- CUDA kernel inefficiency (profile with [nsight](https://developer.nvidia.com/nsight-systems))Real debugging tool: `nvidia-smi dmon -i 0` shows memory bandwidth utilization. If it's maxed out, you need bigger GPUs or smaller models.

Currently viewing the AI version

Switch to human version

LangChain & Hugging Face Production Deployment Guide

Executive Summary

Three viable deployment patterns exist for LangChain + Hugging Face LLMs in production. All other approaches fail under real traffic conditions. Budget $2,400/month minimum for moderate traffic, 3-6 months for self-hosted setup, and expect weekend alerts regardless of approach.

Deployment Architecture Patterns

1. Hugging Face Endpoints (Reliable but Expensive)

Performance Profile:

Setup time: 5 minutes to working API
Cost: $2,400/month for moderate traffic
Failure mode: Rate limits at 1,000 requests/hour

Implementation Requirements:

Pin dependencies: langchain==0.2.x and huggingface-hub==0.24.x
Version conflicts occur regularly between releases
No weekend alerts about infrastructure failures

Decision Criteria:

Use for: Demos, MVPs, teams without platform expertise
Avoid when: Cost-sensitive or high-volume applications

2. Self-Hosted Kubernetes (Maximum Control)

Resource Investment:

Setup time: 3-6 months for production-ready deployment
Team requirement: Dedicated platform engineering team
Cost: $1,200/month plus engineer time
Failure frequency: Expect 2-3 AM alerts weekly

Critical Breaking Points:

GPU scheduling fails randomly with NVIDIA device plugin memory leaks
Solution: Restart daemonset weekly via cronjob
EKS 1.24 broke GPU allocation entirely - required plugin version pinning

Memory Reality:

Documentation claims: 8GB for 7B models
Production requirement: 16GB minimum for PyTorch overhead + CUDA contexts
OOM kills cause silent failures with exit code 137

3. Serverless GPU (Variable Workloads Only)

Performance Characteristics:

Cold start time: 2-3 minutes for large models
AWS Lambda timeout: 15 minutes (insufficient for model loading)
Google Cloud Run: 3-minute cold starts destroy user experience

Use Cases:

Batch processing workloads
Variable traffic with acceptable latency
Cost range: $400-1,200/month

Model Serving Infrastructure

Text Generation Inference (TGI) - Only Viable Option

Performance Features:

Dynamic batching: 10x throughput improvement (10 → 100 requests/minute)
GPTQ quantization: 50% memory reduction without quality loss
Tensor parallelism: Multi-GPU support with network latency constraints

Operational Requirements:

Memory leaks require container restarts every 6 hours
Debug logging essential: RUST_LOG=debug for troubleshooting
Silent failure modes: CUDA OOM, invalid tokens, tensor corruption

Docker Implementation Reality

Critical Configuration Issues

Base Image Problems:

NVIDIA base images: 8GB+ size due to unnecessary CUDA libraries
Model downloads fail randomly - implement retry logic
CI cache bloat: 50GB over 2 weeks without cleanup

Working Dockerfile Pattern:

# Download models separately - only run once
FROM python:3.11-slim AS model-downloader
RUN pip install huggingface-hub
RUN huggingface-cli download microsoft/DialoGPT-medium --resume-download

# Runtime stage
FROM nvidia/cuda:12.1-runtime-ubuntu20.04
COPY --from=model-downloader /root/.cache/huggingface /app/models

Health Check Reality:

Use: curl localhost:8080/health
Avoid: Generic endpoints that always return 200 OK

Kubernetes Production Configuration

GPU Resource Management

Working Configuration:

resources:
  limits:
    nvidia.com/gpu: 1
    memory: "20Gi"
  requests:
    nvidia.com/gpu: 1
    memory: "16Gi"

StatefulSet Alternative:

Use regular Deployments with emptyDir volumes
Persistent volume claims never bind reliably
Accept occasional pod restarts

Auto-scaling Failure Modes

Horizontal Pod Autoscaler Issues:

Response time: 8 minutes to spin up instances
Traffic spike result: 504 Gateway Timeout errors
Solution: Over-provision 20-30% for predictable performance

Scaling Reality:

Vertical scaling works better for predictable loads
Pre-scaling required before traffic events
HPA metrics collection is slow and conservative

Security Implementation

Container Security Standards

Pod Security Configuration:

apiVersion: v1
kind: Namespace
metadata:
  name: ai-models
  labels:
    pod-security.kubernetes.io/enforce: baseline

Key Management:

Never use environment variables for API keys
Use Kubernetes secrets or AWS Secrets Manager
Restricted profile breaks AI workloads (write access to /tmp required)

Monitoring and Observability

Critical Metrics

Operational Metrics:

inference_requests_total - completed, not started requests
model_memory_usage_bytes - prevents silent OOM kills
request_duration_p99 - averages mislead
gpu_utilization - ignore if response times are acceptable

LangSmith Limitations:

Measures requests started, not completed
Crashes on high throughput
Use for debugging individual requests only
Production monitoring requires Prometheus

Effective Monitoring Stack

Components:

Node Exporter for system metrics
NVIDIA GPU Prometheus Exporter for GPU statistics
Custom application metrics for business logic
AlertManager configured for actionable alerts only

Cost Analysis and Platform Comparison

Platform	Setup Time	Monthly Cost	Failure Mode	Use Case
HF Endpoints	5 minutes	$2,400+	Rate limits	Demo/MVP only
AWS SageMaker	2 weeks	$800-3,000	Instance limits, IAM complexity	Enterprise with budget
Self-hosted K8s	3-6 months	$1,200+ engineer time	Everything breaks	Full control required
GCP Cloud Run	30 minutes	$400-1,200	3-minute cold starts	Batch workloads
Azure ACI	1 day	$600-1,800	Limited GPU options	Microsoft ecosystems

Common Failure Scenarios and Solutions

Container Restart Issues (Exit Code 137)

Cause: OOM kill due to insufficient memory allocation
Solution: Allocate 20GB for containers, 16GB requests minimum
Prevention: Monitor memory usage patterns during load testing

GPU Scheduling Failures

Cause: NVIDIA device plugin memory leaks
Quick Fix: kubectl delete pod -n kube-system -l name=nvidia-device-plugin-ds
Permanent Fix: Weekly daemonset restart via cronjob

Model Loading Performance

Problem: 8-minute loading times
Solutions:

Pre-download during Docker build (12GB image size)
Model caching with persistent volumes
Switch to quantized models (GPTQ)
Over-provision warm instances

Silent API Failures

Symptoms: HTTP 500 with no logs
Causes: CUDA OOM, invalid tokens, tensor corruption
Debugging: Enable RUST_LOG=debug for TGI
Monitoring: Track request completion, not initiation

Memory Leak Management

Issue: Growing memory usage in long-running processes
Root Cause: PyTorch GPU memory cleanup issues
Solution: Restart containers every 6 hours via cronjob
Configuration:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: restart-inference-pods
spec:
  schedule: "0 */6 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: restart
            image: bitnami/kubectl
            command: ["/bin/sh", "-c", "kubectl rollout restart deployment/inference"]

Version Compatibility Critical Points

Dependency Pinning Requirements:

PyTorch version changes produce different model outputs
Pin exact versions: torch==2.0.0 transformers==4.21.0 langchain==0.1.17
Test version compatibility in staging before production updates

Breaking Change Patterns:

LangChain releases break integrations regularly
HuggingFace model format changes require re-downloads
NVIDIA driver updates break CUDA contexts

Resource Requirements Reality Check

Actual vs Documented Memory:

Documentation: 8GB for 7B models
Production reality: 16GB minimum
Safe allocation: 20GB limits for stability

GPU Performance Debugging:

Use nvidia-smi dmon -i 0 for real-time bandwidth monitoring
100% GPU utilization with low throughput = memory bandwidth bottleneck
Solution: Larger GPUs or smaller models

Storage Requirements:

Model cache: 8-15GB per model
Container images: 12GB with pre-downloaded models
CI cache management: Prune weekly to avoid storage costs

Service Mesh and Network Considerations

Istio Performance Impact:

Adds 100-200ms latency per request
Acceptable for 2-5 second LLM inference
Skip for real-time applications

Network Policy Implementation:

Cilium policies work reliably
Built-in Kubernetes policies depend on CNI plugin quality
Most CNI plugins implement policies poorly

Compliance and Security Framework

Required Certifications for Enterprise:

SOC 2 compliance for audit requirements
ISO 27001 for security frameworks
EU AI Act compliance for European deployments

Container Security:

Follow Docker CIS Benchmark guidelines
Implement Falco runtime security (expect false positives)
Use baseline pod security standards (restricted breaks AI workloads)

Alternative Technologies and Trade-offs

Model Serving Options:

TGI: Most stable, moderate performance
vLLM: Higher throughput, complex setup
TensorRT-LLM: 3x performance improvement, fragile configuration

Optimization Libraries:

ONNX Runtime: Microsoft's optimization, works when it doesn't crash
GPTQ: 50% memory reduction, maintains quality
Quantization: Essential for cost control

Training and Certification Value

Worthwhile Investments:

NVIDIA Deep Learning Institute: Actual GPU optimization knowledge
CKAD: Proves Kubernetes competency
Cloud provider AI certifications: Required for enterprise sales

Community Resources:

Hugging Face forums: Responsive community and team
LangChain Discord: Mixed quality, occasional expertise
CNCF AI/ML SIG: Architecture-focused discussions

This guide represents $8,000+ in production lessons learned and provides actionable intelligence for avoiding common deployment failures in LangChain + Hugging Face implementations.

Useful Links for Further Investigation

Resources That Don't Suck (Mostly)

Link	Description
LangChain Hugging Face Provider Documentation	Actually decent integration guide, which is shocking for LangChain docs. Has working code examples that aren't completely broken.
LangChain Cloud Deployment Guide	Skip this garbage, it's mostly marketing fluff. The community guides will save your ass instead.
LangSmith Observability Platform	Looks pretty but crashes harder than Internet Explorer under load. Use for debugging only, not monitoring.
Hugging Face Inference Endpoints	Works fine but will bankrupt you. Good for demos, terrible for real traffic.
Text Generation Inference (TGI)	The only model server that doesn't make me want to quit engineering. Use this or suffer.
Hugging Face Transformers Performance Guide	Actually useful GPU optimization tips. Helped us cut memory usage in half.
Kubernetes GPU Scheduling Guide	Official docs are missing all the real gotchas that will ruin your weekend. The GitHub issues have the actual solutions.
NVIDIA GPU Operator	Breaks every other Tuesday for no reason. Pin your version and sacrifice a goat.
Docker Multi-stage Build Best Practices	Decent advice, but they "forgot" to mention 8GB model downloads will murder your CI cache.
AWS SageMaker LLM Deployment	Amazon's way to fuck your budget sideways. Works but costs 3x what doing it yourself costs.
Azure Container Instances GPU Support	Microsoft's GPU offerings. Slower than AWS, more expensive than GCP.
Google Cloud Run GPU Support	Google's serverless GPU thing. Good luck debugging when it breaks.
ONNX Runtime Transformers Optimization	Microsoft's optimization toolkit. Works surprisingly well when it doesn't crash.
TensorRT-LLM GitHub Repository	NVIDIA's black magic optimization. 3x faster but breaks if you look at it wrong.
vLLM Documentation	Actually good high-throughput serving. Use this instead of TGI if you can handle the setup complexity.
Prometheus GPU Monitoring	Someone's side project that's better than NVIDIA's official monitoring. Don't ask why.
Grafana AI Dashboards	Pre-built dashboards that mostly work. Expect to rewrite half of them.
Jaeger Distributed Tracing	For when you need to figure out which service is making your latency suck.
Docker CIS Benchmark Guide	Docker's official security recommendations. Most of it is checkbox compliance nonsense, but follow it or get fired.
Kubernetes Security Best Practices	Pages and pages of security configs that break everything. Start simple, add complexity slowly.
Falco Runtime Security	Noisy alerting system that will spam your Slack with false positives. Useful once properly tuned.
EU AI Act Compliance Guide	European regulations that make deployment 10x more complex. Good luck.
SOC 2 Compliance Framework	Checkbox exercises that auditors love and engineers hate.
ISO 27001 AI Security Guidelines	More compliance paperwork. Required for enterprise sales.
LangChain Community Discord	Hit or miss. Lots of "try restarting" advice, but occasionally someone knows what they're talking about.
Hugging Face Community Forums	Surprisingly helpful community. The HF team actually responds to questions.
CNCF AI/ML SIG	Kubernetes people who know AI exists. Good for architecture discussions.
Kubernetes Application Developer (CKAD)	Useful certification that proves you can actually use Kubernetes.
NVIDIA Deep Learning Institute	GPU optimization training that's worth the money. Rare for vendor training.
Cloud Provider AI Certifications	Expensive way to prove you know how to click buttons in AWS console.

LangChain & Hugging Face Production Deployment Guide

Executive Summary

Deployment Architecture Patterns

1. Hugging Face Endpoints (Reliable but Expensive)

2. Self-Hosted Kubernetes (Maximum Control)

3. Serverless GPU (Variable Workloads Only)

Model Serving Infrastructure

Text Generation Inference (TGI) - Only Viable Option

Docker Implementation Reality

Critical Configuration Issues

Kubernetes Production Configuration

GPU Resource Management

Auto-scaling Failure Modes

Security Implementation

Container Security Standards

Monitoring and Observability

Critical Metrics

Effective Monitoring Stack

Cost Analysis and Platform Comparison

Common Failure Scenarios and Solutions

Container Restart Issues (Exit Code 137)

GPU Scheduling Failures

Model Loading Performance

Silent API Failures

Memory Leak Management

Version Compatibility Critical Points

Resource Requirements Reality Check

Service Mesh and Network Considerations

Compliance and Security Framework

Alternative Technologies and Trade-offs

Training and Certification Value

Useful Links for Further Investigation

Resources That Don't Suck (Mostly)

Related Tools & Recommendations

Milvus vs Weaviate vs Pinecone vs Qdrant vs Chroma: What Actually Works in Production

Making LangChain, LlamaIndex, and CrewAI Work Together Without Losing Your Mind

LangChain vs LlamaIndex vs Haystack vs AutoGen - Which One Won't Ruin Your Weekend

OpenAI Gets Sued After GPT-5 Convinced Kid to Kill Himself

LlamaIndex - Document Q&A That Doesn't Suck

I Migrated Our RAG System from LangChain to LlamaIndex

Haystack - RAG Framework That Doesn't Explode

Haystack Editor - Code Editor on a Big Whiteboard

OpenAI Launches Developer Mode with Custom Connectors - September 10, 2025

OpenAI Finally Admits Their Product Development is Amateur Hour

Anthropic Raises $13B at $183B Valuation: AI Bubble Peak or Actual Revenue?

Don't Get Screwed Buying AI APIs: OpenAI vs Claude vs Gemini

Anthropic Just Paid $1.5 Billion to Authors for Stealing Their Books to Train Claude

Hugging Face Inference Endpoints Security & Production Guide

Hugging Face Inference Endpoints Cost Optimization Guide

Hugging Face Inference Endpoints - Skip the DevOps Hell

Microsoft AutoGen - Multi-Agent Framework (That Won't Crash Your Production Like v0.2 Did)

CrewAI - Python Multi-Agent Framework

Pinecone Production Reality: What I Learned After $3200 in Surprise Bills

Claude + LangChain + Pinecone RAG: What Actually Works in Production