Why does my Kubeflow installation fail with "ImagePullBackOff" errors?

I've debugged this exact issue probably 6 times. It's usually network fuckery or your cluster not having enough juice. Here's how to figure out what's actually wrong: ```bash # Check if nodes can pull images kubectl describe pod -n kubeflow # Verify internet connectivity from nodes kubectl run test-connectivity --image=curlimages/curl --rm -it --restart=Never -- curl -I https://www.google.com # Check resource constraints kubectl top nodes kubectl describe node ``` Nine times out of ten, it's either your nodes are too small or some network policy is blocking image downloads. Fix those and you're golden.

How much storage do I actually need for a production Kubeflow setup?

Way more than you think. Storage grows fast when you're training models and keeping artifacts around: - **Just getting started:** 500GB might last you a few weeks - **Small team:** Plan for a few TB, maybe 2-5TB - **Bigger team:** 10-20TB and growing - **Enterprise scale:** 50TB+ and you better have a cleanup strategy Most of it goes to model artifacts, training data, and all the intermediate crap pipelines generate. Set up automated cleanup or you'll run out of space at the worst possible moment.

Can I run Kubeflow on a single-node cluster for testing?

Yes, but you'll need significant resources on that node: ```bash # Minimum specs for single-node testing # 16 CPUs, 32GB RAM, 200GB storage # Use kind with resource limits kind create cluster --config - <<EOF kind: Cluster apiVersion: kind.x-k8s.io/v1alpha4 nodes: - role: control-plane extraMounts: - hostPath: /tmp/kubeflow-storage containerPath: /var/lib/rancher/k3s/storage EOF ``` Single-node works for development but never for production due to lack of high availability.

What happens when I upgrade Kubeflow versions?

Kubeflow upgrades can be complex. Plan for potential breaking changes: 1. **Backup everything** first (pipelines, models, configurations) 2. **Test in staging** with identical data and workloads 3. **Read release notes** carefully for breaking changes 4. **Plan rollback strategy** before starting From our experience, major version upgrades (1.8 → 1.9 → 1.10) typically require 4-8 hours of downtime and may require pipeline modifications.

Why are my Kubeflow Pipelines running so slowly?

Pipeline performance issues usually stem from: **Resource constraints:** ```bash # Check if pods are resource-starved kubectl top pods -n kubeflow --sort-by=cpu kubectl describe pod -n kubeflow ``` **I/O bottlenecks:** - Slow storage for large datasets - Network bandwidth limitations between nodes - Inefficient data loading patterns in your code **Scheduling overhead:** - Too many small pipeline steps (combine related operations) - Inefficient component resource requests The solution often involves profiling your pipeline code and rightsizing resource requests.

How many concurrent pipelines can Kubeflow handle?

Depends on how much hardware you have and how hungry your pipelines are: - **Small cluster:** Maybe 5-10 simple pipelines running at once - **Decent cluster:** Could handle 20-50 concurrent pipelines if they're not too crazy - **Big cluster:** 100+ if you have the resources and patience Really depends on what your pipelines actually do. Monitor your cluster and see where the bottlenecks hit.

Why do my training jobs keep getting OOMKilled?

Out of Memory kills are common with ML workloads. Debugging steps: ```bash # Check memory usage patterns kubectl logs -n kubeflow --previous # Look for memory-intensive operations kubectl top pod -n kubeflow --containers # Adjust resource limits apiVersion: v1 kind: Pod spec: containers: - name: training-container resources: requests: memory: "8Gi" # Start here limits: memory: "16Gi" # Allow burst usage ``` Rule of thumb: set memory requests to 75% of what you think you need, limits to 150% of requests.

My feature values are inconsistent between training and serving. How do I debug this?

Feature inconsistency is a critical production issue. Here's how to debug: ```python # Compare features between online and offline stores from feast import FeatureStore from datetime import datetime, timedelta fs = FeatureStore(repo_path=".") # Get the same features from both stores entity_rows = [{"user_id": "test_user_123"}] features = ["user_stats:avg_transaction_amount"] # Online features (for serving) online_features = fs.get_online_features( entity_rows=entity_rows, features=features ).to_dict() # Historical features (for training) entity_df = pd.DataFrame({ "user_id": ["test_user_123"], "event_timestamp": [datetime.now() - timedelta(minutes=5)] }) historical_features = fs.get_historical_features( entity_df=entity_df, features=features ).to_df() print("Online:", online_features) print("Historical:", historical_features) ``` Common causes: clock skew between systems, race conditions during feature materialization, or different feature computation logic.

How do I monitor Feast feature freshness?

Feature staleness can break model predictions. Set up monitoring: ```python # Custom metrics for feature freshness from prometheus_client import Gauge import pandas as pd feature_freshness_gauge = Gauge('feast_feature_freshness_seconds', 'Feature freshness in seconds', ['feature_view']) def monitor_feature_freshness(): fs = FeatureStore() for fv in fs.list_feature_views(): # Get latest feature timestamp latest_feature = fs.get_online_features( entity_rows=[{"user_id": "monitoring_check"}], features=[f"{fv.name}:timestamp"] ).to_dict() if latest_feature and 'timestamp' in latest_feature: staleness = (datetime.now() - latest_feature['timestamp'][0]).total_seconds() feature_freshness_gauge.labels(feature_view=fv.name).set(staleness) ``` Set alerts when features are more than 1 hour stale for real-time use cases.

Can I use Feast without Kubernetes?

Yes, Feast can run standalone, but you lose integration benefits: ```bash # Standalone Feast server pip install feast[redis,aws] # Start local server feast serve --host 0.0.0.0 --port 6566 ``` However, the real value comes from tight integration with Kubeflow Pipelines for automated feature engineering and serving.

What should I do when the entire Kubeflow system is unresponsive?

Follow this emergency checklist: 1. **Check cluster health:** ```bash kubectl get nodes kubectl get pods -n kubeflow | grep -v Running kubectl top nodes ``` 2. **Check critical components:** ```bash kubectl logs -n kubeflow deployment/ml-pipeline-api-server kubectl logs -n istio-system deployment/istiod ``` 3. **Look for resource exhaustion:** ```bash kubectl describe nodes | grep -A 5 "Allocated resources" kubectl get events -n kubeflow --sort-by=.lastTimestamp ``` 4. **Restart services in dependency order:** ```bash kubectl rollout restart deployment/ml-pipeline-api-server -n kubeflow kubectl rollout restart deployment/ml-pipeline-ui -n kubeflow ```

How do I recover from a corrupted pipeline database?

Database corruption can happen during ungraceful shutdowns. Recovery steps: 1. **Stop all pipeline services** ```bash kubectl scale deployment ml-pipeline-api-server --replicas=0 -n kubeflow ``` 2. **Access the database pod** ```bash kubectl exec -it mysql-pod-name -n kubeflow -- mysql -u root -p ``` 3. **Check database integrity** ```bash CHECK TABLE pipeline_runs; CHECK TABLE pipeline_jobs; ``` 4. **Restore from backup if corruption found** ```bash mysql -u root -p < /backups/kubeflow-db-backup.sql ``` 5. **Restart services** ```bash kubectl scale deployment ml-pipeline-api-server --replicas=1 -n kubeflow ``` Always maintain automated daily database backups to minimize data loss.

Why can't my models access the feature store during inference?

Service-to-service communication issues are common in Kubernetes. Check: ```bash # Test network connectivity kubectl exec -it model-pod -- nslookup feast-server.feast-system.svc.cluster.local # Check service endpoints kubectl get endpoints feast-server -n feast-system # Verify network policies allow traffic kubectl get networkpolicies -n feast-system kubectl get networkpolicies -n kubeflow ``` Most issues are DNS resolution problems or overly restrictive network policies blocking cross-namespace communication.

Currently viewing the AI version

Switch to human version

Kubeflow & Feast Production MLOps Setup: AI-Optimized Technical Reference

Critical Configuration Intelligence

Resource Requirements (Real-World Minimums)

Minimum viable cluster: 3 nodes, 16 cores/64GB RAM per node, 500GB+ NVMe storage
Reality check: Cluster uses significant resources at idle; ML training jobs are memory-intensive and will OOM kill other processes
GPU considerations: Set 4-hour max time limits; use node taints for GPU-only workloads; monitor utilization (cryptocurrency mining risk)

Time Investment Reality

Initial setup: Full weekend minimum (documentation claims 30 minutes are false)
Actually working system: Additional week for undocumented edge cases
Production ready: Minimum 1 month before trusting with business data
Team onboarding: 2 weeks minimum hand-holding required

Storage Configuration That Prevents Data Loss

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: kubeflow-fast
provisioner: kubernetes.io/aws-ebs
parameters:
  type: gp3
  iops: "3000"
reclaimPolicy: Retain  # CRITICAL: DO NOT use Delete
allowVolumeExpansion: true

Critical Warning: Using Delete reclaim policy will cause data loss. Author lost 3 days of training data from this mistake.

Installation Failure Modes & Solutions

Common Kubeflow Installation Failures

CRD compatibility issues: Kubernetes 1.31 deprecated APIs that worked in 1.30
Istio OOM kills: Default 512Mi memory limit insufficient, needs 2GB minimum
MySQL storage failures: Error 1 (HY000) indicates missing persistent storage mount
Pending pod status: Usually missing storage class application

Working Installation Pattern:

# Install components sequentially for debugging
kubectl apply -k ./manifests/common/
kubectl wait --for=condition=Ready pods --all -n kubeflow --timeout=600s

# Delete and retry stuck pods
kubectl delete pod <stuck-pod> -n kubeflow

Feast Setup Reality

Documentation assumes expertise in all storage systems
Redis setup requires managed service or significant storage allocation
Default memory limits (1-2Gi) insufficient for production workloads

Production Performance Characteristics

Feature Serving Latency Expectations

Good day: 150ms typical response time
Bad day: 300-400ms when AWS routes through suboptimal paths
Disaster: 2+ seconds during Redis background saves
Solution: Invest in NVMe storage or accept degraded 50ms SLA performance

Memory Consumption Patterns

Demo tutorials: Use tiny datasets, unrealistic for production
Reality: Recommendation models consume 40GB+ RAM, crash co-located services
Resource allocation strategy: Memory requests at 75% of estimated need, limits at 150% of requests

Cost Analysis & Alternatives Comparison

Platform	Setup Difficulty	Monthly Cost	Reliability	Kubernetes Integration
Kubeflow + Feast	Excruciating (weeks)	$8K-25K	Eventually works after fixes	Native
AWS SageMaker	Annoying (days)	$20K-80K+ (surprise bills)	Usually works on happy path	Limited
Google Vertex AI	Reasonable (hours)	$15K-50K (TPU costs)	Actually tested by Google	Partial
Azure ML	Tolerable (days)	$12K-40K (hidden costs)	Sometimes works	Half-implemented
MLflow + DIY	Months of pain	$3K + sanity cost	Define "work"	You implement it

Critical Monitoring Requirements

Metrics That Prevent 3AM Failures

Pipeline failure count (last hour) - not success rates
Feature serving latency at 99th percentile - averages mislead
Feature store data freshness verification
Memory usage across all nodes - someone always hits limits
Disk space monitoring - failures occur at 3AM Sunday

Storage Growth Patterns

Getting started: 500GB lasts few weeks
Small team: 2-5TB realistic need
Bigger team: 10-20TB with growth management
Enterprise: 50TB+ requires cleanup automation

Critical: Storage grows rapidly from model artifacts, training data, and pipeline intermediates. Implement automated cleanup or face sudden capacity failures.

Security Implementation (Non-Optional)

Network policies for pod-to-pod communication restriction
Kubernetes secrets for credential storage (not environment variables)
RBAC enablement without universal cluster-admin access
Regular key rotation implementation

Backup Strategy (Data Loss Prevention)

# Simple backup that works
kubectl get all -n kubeflow -o yaml > kubeflow-backup-$(date +%Y%m%d).yaml
kubectl get all -n feast-system -o yaml > feast-backup-$(date +%Y%m%d).yaml

Critical: Store backups off-cluster, preferably different cloud provider.

Common Production Failure Scenarios

Feature Inconsistency Between Training/Serving

Root causes: Clock drift between systems, race conditions during materialization, Python version differences in computation
Detection: Compare online vs offline store values for same entities
Resolution: Synchronize system clocks, implement atomic feature updates, standardize computation environments

Pipeline Performance Degradation

Symptoms: Slow execution, resource starvation
Common causes: I/O bottlenecks from slow storage, network bandwidth limits, inefficient data loading, excessive small pipeline steps
Solutions: Profile pipeline code, rightsize resource requests, combine related operations

System Unresponsiveness Recovery

Check cluster health: kubectl get nodes
Identify failing components: kubectl get pods -n kubeflow | grep -v Running
Check resource exhaustion: kubectl describe nodes | grep -A 5 "Allocated resources"
Restart services in dependency order: API server first, then UI

Resource Scaling Guidelines

Concurrent Pipeline Capacity

Small cluster: 5-10 simple pipelines maximum
Decent cluster: 20-50 concurrent pipelines (resource dependent)
Large cluster: 100+ possible with adequate hardware

Limiting factors: Resource availability and pipeline complexity, not just count.

Feature Store Production Considerations

Redis Configuration for High Availability

Use clustering for availability
Implement connection pooling
Monitor cache hit rates
Set appropriate TTLs (stale data worse than no data)

Feature Freshness Monitoring

# Alert when features > 1 hour stale
feature_freshness_gauge = Gauge('feast_feature_freshness_seconds',
                               'Feature freshness in seconds', ['feature_view'])

Database Recovery Procedures

Corruption Recovery Steps

Stop pipeline services: kubectl scale deployment ml-pipeline-api-server --replicas=0
Check integrity: CHECK TABLE pipeline_runs;
Restore from backup if corrupted
Restart services after verification

Critical: Maintain automated daily database backups to minimize data loss.

Cost Control Strategies

Auto-scaling with scale-to-zero at night
Automated deletion of old pipeline runs
Spot instances for training jobs
Daily bill monitoring (costs compound rapidly)

Performance Optimization Patterns

Parallel execution of independent pipeline steps
Expensive feature computation caching
Smaller Docker images (saves 2-3 minutes per job)
Pre-pulling images on nodes

Troubleshooting Decision Tree

ImagePullBackOff Errors

Check node internet connectivity
Verify resource constraints
Examine network policies for image registry access
Common cause: Nodes too small or network policy blocking downloads

OOMKilled Containers

Check memory usage patterns in logs
Monitor container resource consumption
Adjust resource limits (start with 2x current allocation)
Pattern: Set memory requests to 75% estimated need, limits to 150% of requests

Service Communication Failures

Test DNS resolution between services
Check service endpoints configuration
Verify network policies allow cross-namespace traffic
Common issue: DNS problems or overly restrictive network policies

This technical reference extracts all actionable intelligence while preserving the operational context that prevents common implementation failures. The original author's hard-won experience provides critical guidance for avoiding expensive mistakes and lengthy debugging cycles.

Useful Links for Further Investigation

Resources That Actually Helped (And Some That Didn't)

Link	Description
Kubeflow manifests repo	This is the actual source code for installing Kubeflow. The docs are pretty but this repo is where the rubber meets the road. Use the v1.10-branch unless you enjoy living dangerously.
Stack Overflow Kubeflow tag	Real problems, real solutions. Half the issues I hit were answered here by people who'd already suffered through them. Sort by votes, not recency.
This Feast tutorial that actually works	Most Feast tutorials skip the boring parts like "how to actually configure your database." This one doesn't.
Kubeflow Slack	The #kubeflow-platform channel has saved my ass more times than I can count. People actually respond here, unlike GitHub issues.
Feast Slack	Less traffic than Kubeflow Slack but the answers are better. The maintainers actually hang out here.
GitHub issues for when Slack fails you	Last resort. File a bug report and wait 3-6 months for someone to tell you it's "working as designed."
Prometheus Operator	Install this first, debug everything else later. The pre-built dashboards will show you exactly where your cluster is dying.
This Grafana dashboard for Kubeflow	Forget the fancy ones with 47 panels. This one shows CPU, memory, and "is it actually working" - everything you need at 3 AM.
Official Kubeflow docs	Great for understanding concepts, terrible for troubleshooting real problems. The examples work perfectly in their isolated test environments.
DataCamp's Kubeflow tutorial	Skip the official training courses. This one walks through practical examples that don't assume you're a Kubernetes expert.

Related Tools & Recommendations

integration

Similar content

MLOps Production Pipeline: Kubeflow + MLflow + Feast Integration

How to Connect These Three Tools Without Losing Your Sanity

Kubeflow

/integration/kubeflow-mlflow-feast/complete-mlops-pipeline

100%

integration

Recommended

GitOps Integration Hell: Docker + Kubernetes + ArgoCD + Prometheus

How to Wire Together the Modern DevOps Stack Without Losing Your Sanity

kubernetes

/integration/docker-kubernetes-argocd-prometheus/gitops-workflow-integration

74%

tool

Similar content

MLflow - Stop Losing Your Goddamn Model Configurations

Experiment tracking for people who've tried everything else and given up.

MLflow

/tool/mlflow/overview

64%

howto

Recommended

Stop MLflow from Murdering Your Database Every Time Someone Logs an Experiment

Deploy MLflow tracking that survives more than one data scientist

MLflow

/howto/setup-mlops-pipeline-mlflow-kubernetes/complete-setup-guide

49%

pricing

Recommended

Databricks vs Snowflake vs BigQuery Pricing: Which Platform Will Bankrupt You Slowest

We burned through about $47k in cloud bills figuring this out so you don't have to

Databricks

/pricing/databricks-snowflake-bigquery-comparison/comprehensive-pricing-breakdown

42%

tool

Similar content

TensorFlow Serving Production Deployment - The Shit Nobody Tells You About

Until everything's on fire during your anniversary dinner and you're debugging memory leaks at 11 PM

TensorFlow Serving

/tool/tensorflow-serving/production-deployment-guide

37%

news

Recommended

Databricks Acquires Tecton in $900M+ AI Agent Push - August 23, 2025

Databricks - Unified Analytics Platform

GitHub Copilot

/news/2025-08-23/databricks-tecton-acquisition

32%

tool

Similar content

Feast - Prevents Your ML Models From Breaking When You Deploy Them

Explore Feast, the open-source feature store, to understand why ML models fail in production and how to ensure reliability. Learn setup best practices and commo