How much RAM does Phoenix actually need in production?

Start with 8GB if you want to get it running, but plan for 16GB+ in production. Memory usage grows with active traces and evaluations. We've seen instances hit 12GB+ with moderate trace volumes (20K traces/hour). If you're running evaluations regularly, add another 8GB buffer.

Phoenix keeps crashing with "database connection" errors. What's wrong?

Phoenix's connection pooling is shit. Default PostgreSQL allows 100 connections and Phoenix will exhaust them faster than you can say "production outage." I learned this at 3am when everything stopped working and I got `FATAL: sorry, too many clients already` spam in the logs. Bump `max_connections` to 200+ in PostgreSQL or use pgbouncer. Also check if your database is actually reachable - I've wasted hours debugging "connection refused" errors that were just firewall rules. Pro tip: `telnet your-db-host 5432` first before going down the rabbit hole.

How do I stop my S3 storage costs from exploding?

Set retention policies immediately. Without them, trace data accumulates forever. A single large trace can be several MB. Plan for 50-200MB per 1K traces depending on payload complexity. Configure [data retention](https://arize.com/docs/phoenix/settings/data-retention) to purge traces after 30-90 days. Use S3 lifecycle policies to move older data to cheaper storage tiers.

Can I run Phoenix on Kubernetes? The docs are unclear.

Yeah, but it's a pain in the ass. Recent versions have Helm charts but the configuration is still wonky. You'll spend hours figuring out persistent volumes and ingress configs. I gave up and just ran it on regular VMs with Docker. Way less complexity.

How do I actually configure authentication? The OAuth2 setup is confusing.

Don't. Seriously. I wasted two days trying to get OAuth2 working with our Azure AD and the docs are useless. Half the environment variables aren't documented and the error messages are garbage. Just use API keys if you can get away with it.

What's the performance impact of Phoenix instrumentation on my LLM app?

Depends on your framework and trace complexity. Minimal overhead for simple OpenAI calls (maybe 10-20ms). LangChain instrumentation can add more overhead, especially with complex chains. Test in staging first. You can disable instrumentation with `OTEL_SDK_DISABLED=true` if things break.

Phoenix UI is slow with large datasets. How do I fix it?

The UI is trash with more than 10k traces visible. It'll lock up your browser trying to render massive trace lists. Always use date filters and limit results to under 5k traces. For bulk operations, use the API - the web interface will timeout on anything substantial.

How do I migrate from one Phoenix instance to another?

Use the REST API to export/import data. There's no built-in migration tool. Export traces, projects, and datasets separately. Database migrations between PostgreSQL instances work, but test thoroughly. Expect downtime during migration.

Why does Phoenix show "OTEL connection refused" errors?

Check your OpenTelemetry endpoint configuration. Default is `http://localhost:6006/v1/traces`. If Phoenix is running in containers or different hosts, adjust the endpoint. Network policies, firewalls, and service discovery issues are common causes. Verify with `curl http://phoenix-host:6006/v1/traces` or `telnet phoenix-host 6006` first. I spent way too long debugging this when Phoenix was just running on a different port because I changed the config and forgot about it.

Phoenix says it's "processing" traces but nothing shows up in the UI. What's broken?

Phoenix logs are about as useful as a chocolate teapot. It'll say "processing traces" while silently dropping everything because your timestamp format is wrong. Took me 4 hours to find that buried in debug logs. Check for `TRACE_DROP` messages and pray the error actually tells you something useful.

How do I backup Phoenix data?

**Database:** Use PostgreSQL's `pg_dump` or automated RDS snapshots **Trace storage:** S3 versioning and cross-region replication **Configuration:** Export through the API or version control your infrastructure code Test your backup restoration process regularly. We've seen corrupted backups that weren't discovered until needed.

What happens when Phoenix runs out of disk space?

Phoenix becomes unresponsive and stops ingesting traces. Database writes fail, UI queries timeout. You'll get cryptic errors like `ERROR: could not extend file` from PostgreSQL. Monitor disk usage on both the application and database servers. Set up alerts at 80% usage. Emergency fix: delete old traces or expand storage. I've been here - it sucks, and Phoenix gives you no warning before everything dies.

Can I run multiple Phoenix instances for high availability?

Yes, but it requires shared storage (database and S3) and load balancing. Session affinity isn't required. Make sure database can handle concurrent connections from multiple instances. Test failover scenarios - Phoenix doesn't handle partial failures gracefully.

How do I troubleshoot trace ingestion failures?

1. Check Phoenix application logs for errors 2. Verify OpenTelemetry instrumentation is sending data (`wireshark` or `tcpdump`) 3. Test trace ingestion with curl to the OTEL endpoint 4. Check trace format - malformed traces get dropped 5. Monitor database for connection or write errors

Is Phoenix suitable for high-throughput production systems?

Depends on your definition of "high-throughput." Works fine for most LLM applications (hundreds of requests/minute). Struggles with thousands of requests/minute without careful tuning. Database becomes the bottleneck. Consider trace sampling for high-volume systems.

How do I get help when Phoenix breaks in production?

**Open source Phoenix:** [GitHub issues](https://github.com/Arize-ai/phoenix), [Slack community](https://arize-ai.slack.com/join/shared_invite/zt-11t1vbu4x-xkBIHmOREQnYnYDH1GDfCg) (#phoenix-support) **Phoenix Cloud:** Built-in support and team collaboration features **Arize AX Enterprise:** Dedicated support channels and professional services For urgent issues, try the Slack community first. Phoenix is popular in the community so it's pretty active. Keep diagnostic information handy (logs, configuration, error messages). You can also check their comprehensive [release notes](https://arize.com/docs/phoenix/release-notes) for known issues and recent fixes.

Currently viewing the AI version

Switch to human version

Phoenix Production Deployment Guide - AI-Optimized Summary

Overview

Phoenix is an AI observability platform built around OpenTelemetry trace ingestion. Despite marketing claims of 5-minute setup, production deployment requires weekend-level effort. Recent versions are more stable than earlier releases which crashed frequently.

Deployment Options

Phoenix Self-Hosted

What it is: Complete control deployment on your infrastructure
Requirements:

Docker/Kubernetes deployment capability
PostgreSQL 12+ for metadata persistence
S3-compatible storage for trace data
Load balancer for multiple instances

Trade-offs:

Full control vs. complete operational responsibility
Lower ongoing costs vs. high expertise requirements
Custom security implementation vs. built-in compliance

Phoenix Cloud

What it is: Arize-hosted solution at app.phoenix.arize.com
Benefits:

Quick deployment (actual 5 minutes)
Built-in team collaboration features
Automatic updates and maintenance
Limitations:
Data sent to third-party cloud
Pricing not publicly disclosed
Unknown scaling limits

Arize AX Platform

What it is: Full enterprise platform including Phoenix
Cost: $50k-$200k+ annually
Includes:

Advanced analytics and compliance reporting
Dedicated enterprise support
Professional services for implementation

Critical Production Requirements

Minimum Viable Specifications

Memory: 8GB minimum, 16GB recommended for stability
Database: PostgreSQL 12+ (SQLite unsuitable for production)
Storage: S3-compatible with retention policies configured
Network: Reverse proxy with TLS termination required

Performance Thresholds and Failure Points

Trace Volume Limits: Phoenix degrades at ~5k traces/hour on 16GB systems
Database Bottleneck: PostgreSQL becomes limiting factor before application
UI Breaking Point: Browser becomes unresponsive above 10k traces in view
Memory Spikes: Can jump from 4GB to 18GB during evaluation runs
Storage Growth: Plan for 50-200MB per 1K traces

Critical Warnings and Failure Scenarios

Authentication Nightmare

OAuth2 integration poorly documented with frequent lockouts
API key permissions model counterintuitive
Environment variables not properly documented
Workaround: Use API keys instead of OAuth2 for initial deployment

Scaling Failure Points

Database Connection Exhaustion: Default PostgreSQL 100 connections insufficient
OOM Kills: Process dies with exit code 137 under memory pressure
Trace Ingestion Breaks: Silent failures with useless error messages
UI Performance Degradation: Complete browser lockup with large datasets

Storage Cost Explosion

Without retention policies: Unlimited trace accumulation
Real costs observed: 300GB in 2 months = $500/month S3 costs
Critical action: Configure data retention from day one

Production Architecture

Network Topology

Internet -> Load Balancer -> nginx -> Phoenix instances
                                   -> PostgreSQL cluster  
                                   -> S3/MinIO storage

Security Considerations

Phoenix defaults to HTTP (TLS available but not default)
No built-in rate limiting (requires nginx/reverse proxy)
No DDoS protection (external solution required)
Authentication tokens don't expire by default
Trace data may contain sensitive information

Resource Planning and Costs

Infrastructure Costs

Database scaling: RDS scaling expensive when hitting 50k traces/hour
Storage: Aggressive retention policies essential (30-90 days maximum)
Network: AWS data transfer costs significant with heavy trace loads
Memory: Plan for 20GB spikes during evaluation runs

Time Investment

Initial deployment: 4-8 hours for experienced engineers
Production hardening: Full weekend minimum
Troubleshooting: Expect significant debugging time due to poor error messages

High Availability and Monitoring

Single Points of Failure

PostgreSQL database (requires replication)
Phoenix application instances (requires load balancing)
S3 storage (requires backup strategy)
Network connectivity (requires monitoring)

Essential Monitoring Metrics

Trace ingestion rate and queue depth
Database connection pool utilization
Memory usage per Phoenix instance
Response time for UI queries
Storage growth rate
Error rates in trace processing

Backup Requirements

Database: PostgreSQL pg_dump or RDS automated snapshots
Trace storage: S3 versioning and cross-region replication
Configuration: Infrastructure as Code version control

Integration Reality

OpenTelemetry Instrumentation

Performance overhead: 10-20ms for simple calls, higher for complex chains
Version compatibility: Frequent issues between instrumentation and Phoenix
Custom attributes: May not render correctly in UI
Sampling complexity: Essential for cost management at scale

Framework-Specific Issues

OpenAI integration: Generally reliable
LangChain: Higher instrumentation overhead
Custom integrations: Require significant additional work
Distributed systems: Complex OTEL Collector configuration needed

Troubleshooting Common Issues

Database Connection Failures

Symptoms: "FATAL: sorry, too many clients already"
Solution: Increase PostgreSQL max_connections to 200+ or implement pgbouncer
Prevention: Monitor connection pool utilization

Memory Issues

Symptoms: Process killed with exit code 137
Solution: Increase instance memory or implement memory monitoring
Prevention: Set up alerts at 80% memory utilization

Trace Ingestion Failures

Symptoms: "OTEL connection refused" or silent trace dropping
Solutions:

Verify endpoint configuration (default: http://localhost:6006/v1/traces)
Check network connectivity with telnet
Validate trace format and timestamps
Monitor for TRACE_DROP messages in debug logs

UI Performance Issues

Symptoms: Browser lockup or extreme slowness
Solutions:

Always use date filters
Limit results to under 5k traces
Use API for bulk operations
Avoid large trace list rendering

Cost Optimization Strategies

Immediate Actions

Configure data retention policies (30-90 days)
Implement trace sampling at instrumentation level
Set up S3 lifecycle policies for cheaper storage tiers
Monitor evaluation runs resource consumption

Long-term Optimization

Use cheaper storage classes for archived traces
Implement read replicas for query workloads
Optimize database indexes and queries
Consider trace payload size optimization

Enterprise Considerations

Commercial Support Options

Open source: GitHub issues and Slack community (#phoenix-support)
Phoenix Cloud: Built-in support with team features
Arize AX Enterprise: Dedicated support and professional services

Compliance and Security

Data location: Configurable for self-hosted, US-based for cloud
Access control: RBAC available but complex to configure
Audit logging: Available in enterprise versions
Multi-tenancy: Enterprise feature only

Decision Framework

Choose Self-Hosted If:

You have Docker/Kubernetes expertise
Data sovereignty requirements exist
Cost optimization important long-term
Custom integrations required

Choose Phoenix Cloud If:

Quick deployment needed
Team collaboration essential
Infrastructure management not desired
Acceptable to send data to third-party

Choose Arize AX If:

Enterprise compliance required
Dedicated support needed
Budget allows $50k+ annually
Professional services desired

Migration and Disaster Recovery

Data Export/Import

Use REST API for trace export/import
No built-in migration tools available
Database migrations possible but require testing
Expect downtime during migration

Backup Testing

Test restoration process quarterly
Verify backup integrity regularly
Document recovery procedures
Train team on emergency procedures

Performance Optimization

Database Tuning

Increase max_connections from default 100
Optimize indexes for trace queries
Consider connection pooling (pgbouncer)
Monitor slow query logs

Application Scaling

Implement horizontal scaling with shared storage
Configure load balancing with health checks
Use session affinity if required
Monitor instance resource utilization

Storage Optimization

Implement tiered storage strategy
Configure automated cleanup processes
Monitor storage growth trends
Use compression for archived data

Useful Links for Further Investigation

Phoenix Production Resources - What Actually Helps

Link	Description
Phoenix Official Documentation	The official docs. Half the examples don't work and the self-hosting section was clearly written by someone who's never actually deployed this thing, but it's what we've got. Start here, lower your expectations.
Phoenix Docker Hub Repository	Official container images with tags and deployment instructions. Use tagged versions for production deployments, not `latest`. Essential for containerized deployments.
OpenInference Instrumentation	Instrumentation libraries for different frameworks. Essential if you're integrating Phoenix with existing applications. Python and JavaScript SDKs are most mature.
Phoenix Tracing Overview	Comprehensive guide to Phoenix tracing capabilities. Essential for understanding how to instrument applications and collect observability data.
Phoenix Slack Community	Most active support channel. #phoenix-support channel has engineers and community members who actually use Phoenix in production. Response times vary but usually helpful.
Phoenix Release Notes	Read these before upgrading or you'll break something. Phoenix moves fast and each version changes things. I learned this the hard way when an upgrade broke our trace ingestion.
Phoenix Self-Hosting Guide	Official deployment guide that skips all the hard parts. Good for getting started but you'll need to figure out production stuff yourself. The PostgreSQL section is particularly useless.
Railway Phoenix Deploy	One-click deployment for testing. Not suitable for production but useful for evaluation.
Phoenix on GCP with Terraform Blog	Community-contributed guide for GCP deployment using Terraform. More realistic than official docs.
Phoenix RBAC and Authentication	Authentication setup guide. OAuth2 configuration is finicky - check GitHub issues for specific provider examples.
Data Retention Configuration	Essential for cost management. Configure this early or watch storage costs explode.
Cost Tracking Documentation	LLM cost monitoring features. Useful for visibility but doesn't prevent runaway costs.
Phoenix Cloud	Hosted Phoenix with team features. Good for teams who don't want to manage infrastructure. Pricing not public - contact sales.
Arize AX Platform	Full enterprise platform. Expensive but includes support, compliance features, and advanced analytics.
Azure Native Integration	Microsoft partnership for Azure deployments. Relevant if you're standardized on Azure services.
Phoenix REST API Reference	API documentation for automation and integration. The trace export/import APIs are useful for migrations.
Phoenix Production Guide	Production considerations and best practices. Covers security, scaling, and operational concerns.
LangSmith vs Phoenix Comparison	Feature comparison with LangSmith. Helps understand Phoenix's positioning and capabilities.
Langfuse vs Phoenix Comparison	Another competitive analysis. Useful for understanding trade-offs between open source alternatives.
Arize Company Information	They're well-funded so not going anywhere soon. Good for vendor risk assessment - nobody wants their observability platform to disappear.
Arize Customer Stories	Case studies from production deployments. Useful for understanding real-world usage patterns and ROI.
AI Agent Evaluation Course	DeepLearning.AI course on agent evaluation. Covers Phoenix usage for agent observability, includes practical exercises.
Phoenix MCP Integration	Model Context Protocol support for tracing client-server applications. Available in Phoenix 8.26+ with OpenInference instrumentation.
Phoenix TypeScript Client	Native TypeScript support for Phoenix with OpenAI, Anthropic, and Vercel AI SDK integration. Essential for JavaScript/Node.js applications.
Phoenix Evals Hub	Comprehensive guide to LLM evaluation techniques and best practices. Includes Phoenix-specific evaluation patterns and examples.
Phoenix Open Source Repository	Main Phoenix GitHub repository with source code, issues, examples, and community contributions. Essential for troubleshooting and understanding Phoenix internals.
Arize AI Learning Hub	Educational content on AI agents and evaluation. More marketing than technical but has some useful concepts.