Weights & Biases - Because Spreadsheet Tracking Died in 2019

What is Weights & Biases?

Weights & Biases Logo

W&B Dashboard Interface

Sick of tracking experiments with spreadsheets? Weights & Biases (W&B) is an ML experiment tracking platform that tracks your shit without falling over. If you've ever lost track of which hyperparameters produced your best model, or spent hours trying to reproduce results from three weeks ago, W&B solves that problem.

Started by engineers who got tired of managing experiments with spreadsheets and folder names like "model_final_v2_ACTUALLY_FINAL", W&B tracks your training runs automatically. The web UI doesn't look like it was built in 2005, which is already better than most ML tools. Big companies use it because it scales from solo PhD projects to enterprise teams without completely falling apart.

What W&B Actually Does

Experiment Tracking: Logs metrics, hyperparameters, and system info as your models train. Finally you can compare hundreds of runs side-by-side instead of squinting at terminal output or trying to remember what you changed between runs.

Model Registry: Stores your trained models so you can actually find the model you deployed to production. No more "model_best.pkl" vs "model_best_final.pkl" debates.

Dataset Versioning: Tracks dataset changes. Useful when your data team updates the training set and suddenly your model performance tanks.

Hyperparameter Sweeps: Automatically tries different parameter combinations. Beats manually editing config files and forgetting what you tested.

Integration Reality Check

W&B works with the usual suspects: PyTorch, TensorFlow, Hugging Face, scikit-learn. Adding wandb.init() to your training script is easy. Getting it to work when PyTorch 2.1.0 breaks their auto-logging, your custom data loader throws NoneType errors, and your Docker container can't reach their servers because of networking bullshit? That's where you'll spend your weekend and most of Monday morning.

The integrations mostly work as advertised, until you need distributed training with a custom dataset and suddenly you're debugging TypeError: 'NoneType' object is not subscriptable at 2AM. Their docs cover the happy path well enough, but anything remotely complex means digging through 500 GitHub issues to find the one comment that actually fixes your problem.

Two Main Products

W&B Platform Architecture

W&B Models: The original experiment tracking everyone uses. Solid for traditional ML workflows.

W&B Weave: Their newer LLM and AI application platform. Still beta but doesn't completely suck for prompt engineering and LLM evaluation work.

W&B vs MLOps Alternatives

Feature	Weights & Biases	MLflow	Neptune	TensorBoard	DVC
Experiment Tracking	✅ Real-time UI	✅ UI looks like 2018	✅ Actually good UI	✅ Local only	❌ No tracking
Model Registry	✅ Full lifecycle	✅ Basic registry	✅ Advanced registry	❌ No registry	✅ Git-based versioning
Hyperparameter Optimization	✅ Bayesian sweeps	❌ Roll your own	✅ Built-in optimization	❌ No optimization	❌ No optimization
Dataset Versioning	✅ Artifacts system	❌ Barely works	✅ Dataset management	❌ No versioning	✅ Actually good at this
Collaboration	✅ Teams & sharing	⚠️ Share screenshots	✅ Advanced sharing	❌ Screenshot hell begins	⚠️ Git merge conflicts
Cloud Hosting	✅ SaaS + On-prem	⚠️ DIY infrastructure	✅ SaaS + On-prem	❌ Local only	❌ No hosting
Enterprise Features	✅ SSO, RBAC, Audit	⚠️ If you build it	✅ Enterprise ready	❌ No enterprise	❌ No enterprise
AI/LLM Tools	✅ Weave platform	❌ No LLM tools	⚠️ Basic LLM support	❌ No LLM tools	❌ No LLM tools
Pricing	Gets expensive fast	Open source	Costs a fortune	Free (Google)	Open source
Setup Pain	🟢 One line of code*	🟡 Docker hell	🟡 Moderate setup	🟢 Just works	🔴 YAML nightmare
What Breaks	Uploads fail silently	Everything	UI timeouts	Nothing (too simple)	Git history
Support Quality	Discord + paid tiers	Stack Overflow	Good if you pay	None	GitHub issues

W&B: What Works, What Breaks, What'll Cost You

W&B Hyperparameter Sweep Visualization

W&B Models: Here's Where It Breaks

Experiment Tracking Reality: W&B tracking logs your metrics, hyperparameters, and system info automatically. Works great until your training script crashes and you lose 6 hours of metrics because you forgot to call wandb.finish(). The UI is actually decent for comparing runs, but expect it to timeout on large experiments.

Either wrap your training loop with wandb.finish() or use atexit.register(wandb.finish) or you'll lose data when shit inevitably breaks. I learned this the hard way after losing 6 hours of training metrics when my server died. Don't log metrics faster than once per second or you'll hit undocumented rate limits that just return "Request failed" with no other context.

Version Gotchas: PyTorch 2.1.0 breaks their auto-logging completely, 2.0.1 works fine, 2.1.1 fixes the W&B issue but breaks something else entirely. They hook into private APIs that PyTorch changes without warning. Always check their GitHub issues before upgrading - someone else definitely hit the same problem 3 weeks ago and posted a workaround buried in comment #47.

Hyperparameter Sweeps: W&B Sweeps automatically tries different parameter combinations. Beats manually editing YAML files, but can spawn way too many processes if you're not careful with the agent configuration. Misconfigured sweeps can spawn infinite instances. Set proper resource limits or prepare for surprise AWS bills.

The early stopping works but is aggressive - you might lose promising runs that start slow. Monitor your sweeps because they occasionally get stuck and stop launching new runs without telling you. Use proper sweep configuration and parallel execution strategies for better results.

W&B Model Registry Workflow

Model Registry Pain Points: The model registry tracks your models through deployment stages. Useful for avoiding "model_final_v2_ACTUALLY_FINAL.pkl" situations, but the API for programmatic model promotion is clunky. Expect to write wrapper scripts.

Artifact uploads fail silently more often than they should. Lost a 2GB model file because the upload "succeeded" but the file wasn't actually there. Could have been network issues, could have been their chunked upload system choking, could have been my shitty exception handling - the error just said "Artifact upload failed" with zero useful context. Always verify with artifact.verify() and check the web UI manually before deleting anything local.

Artifacts System Gotchas: W&B Artifacts handles dataset versioning. Works well for small datasets, but large file uploads are unreliable. Their chunked upload system helps with big files but doesn't retry failed chunks automatically.

The deduplication is clever but confusing as hell when artifacts share data - you'll spend hours figuring out why your 10GB model shows up as 200MB.

W&B Weave: Why This Sucks

Weave LLM Tracing Interface

LLM Tracing: Weave traces LLM calls and agent workflows. Still beta-ish but shows promise for debugging prompt engineering. The automatic tracing catches most calls but manual instrumentation is needed for custom setups.

Token counting is approximate - don't rely on it for billing accuracy. The cost tracking helps but verify against your actual OpenAI/Anthropic bills.

Evaluation Framework: Built-in evals for LLM applications work for standard cases. Custom evaluations require more setup than their marketing suggests. The semantic similarity metrics are decent but not magic - garbage prompts still produce garbage results.

Prompt Management: Version control for prompts with A/B testing. Useful for tracking what prompt changes actually improve performance vs. just feeling better. The deployment automation is basic - you'll probably want to build your own promotion pipeline.

Enterprise: Prepare Your Wallet

W&B has the usual enterprise checkboxes: SOC 2, GDPR, SSO. Their security documentation is thorough. On-premises deployment is available but requires weeks of setup - plan for weeks, not days.

Deployment options include SaaS (easiest), dedicated cloud (middle ground), and on-prem (your headache). Enterprise support is actually responsive, unlike some vendors. Just prepare for "contact us" pricing that scales with your pain.

Here's What'll Bite You

Storage costs add up fast if you log everything
The free tier limit hits sooner than you think - plan for scaling costs
Distributed training setup requires reading GitHub issues, not just docs
Team permissions are more complex than they need to be - use service accounts wisely
Export functionality exists but isn't obvious - vendor lock-in concerns are valid

Questions Engineers Actually Ask

Why does my W&B experiment randomly lose data?

This usually happens when your training script crashes without calling wandb.finish().

W&B doesn't save metrics until the run completes properly. Always wrap your training loop or use try/except blocks. Also check if you're hitting rate limits by logging too frequently

they throttle at around 1 request per second but the error message just says "Request failed" with no other details.Production Horror Story: Server went down during a 12-hour training run, lost everything because wandb didn't finish properly. Could have been the OOM killer, could have been AWS being AWS, could have been wandb's upload process choking. Honestly could have been anything. Now I always use atexit.register(wandb.finish) and signal handlers because I refuse to debug this shit again at 3AM.

How much does W&B actually cost in practice?

The free tier gives you 100GB storage and unlimited personal projects. That sounds like a lot until you start logging model checkpoints and large datasets

burns through it in about a week of real training. Team plans start at $50/user/month, but enterprise pricing is "call us" which means expensive. Budget somewhere around $200-500/user/month for serious usage, maybe more if you're heavy on storage.

Why does W&B break when I update PyTorch/TensorFlow?

W&B's auto-logging hooks into framework internals that change between versions. They're usually 1-2 releases behind the latest frameworks. Pin your wandb version or expect to spend time debugging integration issues. Check their GitHub issues before updating anything.

How do I get my data out if I want to leave W&B?

Getting your data out requires digging through their API docs. You can download artifacts and export run data via their API, but it's not one-click. Plan ahead if vendor lock-in is a concern. Their Python API lets you script exports, but you'll need to write the extraction code yourself.

Why does W&B say my upload succeeded but the file isn't there?

Artifact uploads fail silently, especially for large files. Their chunked upload system doesn't always retry failed chunks. Always verify uploads completed before deleting local files. Use artifact.verify() or check the web UI manually.

Can I use W&B without sending data to their servers?

Yes, but it's painful. On-premises deployment exists for enterprise customers but requires weeks of Docker hell. The "server" deployment is basically running their entire stack yourself. Expect weeks of setup time and ongoing maintenance headaches.

Why is my W&B storage bill so high?

Every model checkpoint, dataset version, and artifact gets stored with versioning. That "quick experiment" where you logged checkpoints every epoch just ate 50GB of storage. The storage pricing isn't clearly displayed upfront

found out the hard way when my bill jumped from $200 to $800 in one month. Clean up old experiments regularly or enable automatic cleanup policies.

Does W&B work with distributed training?

Mostly.

The distributed training setup works but requires manual configuration that isn't well documented. Spent 2 days debugging `TypeError: 'None

Type' object is not subscriptable`

classic W&B distributed training error that happens when their process coordination gets confused with custom datasets. Finally found the fix in GitHub issue #3847, comment #12. You'll need to handle process coordination yourself. Good luck finding examples that actually work with your exact setup.

What happens when W&B's servers are down?

Your training continues but metrics aren't logged. They have decent uptime but outages happen. There's an offline mode that caches data locally, but you need to enable it before problems occur. Check their status page when things seem broken.

Why does the W&B UI timeout on large experiments?

The web interface struggles with experiments containing thousands of runs or large amounts of logged data. No official limits are documented, but expect timeouts above ~1000 runs. Use their API for programmatic analysis of large datasets instead of relying on the web UI.

Is W&B Weave ready for production LLM apps?

It's getting there but still feels beta. Good for experiment tracking and debugging prompt engineering, but I wouldn't bet production workflows on it yet. The tracing is useful but expect to supplement with other monitoring tools. Token counting is approximate

verify costs independently.

Quick Navigation

What is Weights & Biases?

What W&B Actually Does

Integration Reality Check

Two Main Products

W&B Models: Here's Where It Breaks

W&B Weave: Why This Sucks

Enterprise: Prepare Your Wallet

Here's What'll Bite You

Why does my W&B experiment randomly lose data?

How much does W&B actually cost in practice?

Why does W&B break when I update PyTorch/TensorFlow?

How do I get my data out if I want to leave W&B?

Why does W&B say my upload succeeded but the file isn't there?

Can I use W&B without sending data to their servers?

Why is my W&B storage bill so high?

Does W&B work with distributed training?

What happens when W&B's servers are down?

Why does the W&B UI timeout on large experiments?

Is W&B Weave ready for production LLM apps?

Related Tools & Recommendations

Databricks MLflow Overview: What It Does, Works, & Breaks

MLflow: Experiment Tracking, Why It Exists & Setup Guide

PyTorch Production Deployment: Scale, Optimize & Prevent Crashes

Hugging Face Inference Endpoints: Deploy AI Models Easily

Hugging Face Inference Endpoints Cost Optimization Guide

PyTorch ↔ TensorFlow Model Conversion: The Real Story

BentoML Production Deployment: Secure & Reliable ML Model Serving

Modal: Deploy ML Models Without Docker/Kubernetes Nightmare

KServe - Deploy ML Models on Kubernetes Without Losing Your Mind

Google Cloud Vertex AI Production Deployment Troubleshooting Guide

AWS AI/ML Infrastructure: Build Cost-Effective, Robust ML Systems

Google Cloud Vertex AI: Overview, Costs, & Production Challenges

MLflow Production Troubleshooting: Fix Common Issues & Scale

Google Vertex AI: Overview, Costs, & Production Reality

BentoML: Deploy ML Models, Simplify MLOps & Model Serving

Replicate: Simplify AI Model Deployment, Skip Docker & CUDA Pain

AWS AI/ML Cost Optimization: Cut Bills 60-90% | Expert Guide

Mastering ML Model Deployment: From Jupyter to Production

JupyterLab Team Collaboration: Fix Broken Data Science Workflows

NVIDIA Triton Inference Server: High-Performance AI Serving