Why is my Cosmos DB bill $8,000 this month when I expected $500?

Welcome to the fucking club. Here's probably what happened: 1. **You enabled multi-region writes** without reading the fine print (+200% cost) 2. **Your queries scan entire collections** instead of using partition keys 3. **You're indexing binary data** (images, PDFs) that you never search 4. **Autoscale is stuck at maximum** because of shitty partition key design 5. **Cross-partition queries** during traffic spikes **Fix it now:** - Check Azure Cost Management to see what's eating your budget - Look at RU consumption metrics - normalized RU > 80% = problem - Turn off multi-region writes unless you actually need them - Review indexing policy and exclude large fields **Real production costs:** - **Small app** (10K users): $300-1,200/month (not the $200-800 Microsoft claims) - **Medium app** (100K users): $1,500-5,000/month - **Large app** (1M+ users): $5,000-25,000/month

Can I use multiple APIs on the same data? I heard it's possible.

**No. Don't even think about it.** Technically possible doesn't mean good idea. Each API expects different data shapes and has different performance. You'll get: - Data corruption from schema mismatches - Performance issues from API translation overhead - Debugging nightmares when something breaks - Angry team members who have to maintain your mess **One API per container. Period.**

My queries are taking 30+ seconds. What the hell is wrong?

**90% of the time it's one of these:** 1. **Terrible partition key design** - you're scanning every partition for every query 2. **Missing indexes** - you excluded too much from indexing policy 3. **Cross-partition queries** - WHERE clause doesn't include partition key 4. **Hot partitions** - all data in one partition getting throttled **Debug steps that work:** ```sql -- Check partition distribution (NoSQL API) SELECT c.partitionKey, COUNT(1) as count FROM c GROUP BY c.partitionKey ORDER BY count DESC ``` If one partition has like 10x more docs than others, you're pretty much screwed. Time to rebuild with a better partition key or find a new job. **Quick fixes:** - Add partition key to WHERE clause - Check if you're doing `SELECT *` and returning huge documents - Look at query metrics - RU consumption > 100 for simple queries = problem - Increase RUs temporarily to see if it's just throttling

I'm getting 429 errors constantly. How do I fix this?

**429 = "Too Many Requests" = you're being throttled** **Immediate fixes:** 1. **Increase provisioned RUs** (costs money but stops the bleeding) 2. **Enable autoscale** if you haven't 3. **Implement retry logic** in your app (SDKs do this automatically) 4. **Check for hot partitions** in Azure Monitor **Long-term fixes:** - Redesign partition key if one partition is getting hammered - Optimize queries to consume fewer RUs - Spread traffic across multiple partition keys - Use bulk operations for multiple document operations **Reality check**: If you're getting 429s during normal operation, your partition key probably sucks ass.

Should I use Provisioned or Serverless? I'm confused as hell.

**Provisioned Throughput:** - Pay for RUs whether you use them or not - Cheaper if you have consistent traffic - Required for multi-region deployments - Can handle traffic spikes if you provision enough **Serverless:** - Pay only for RUs you consume - 2x more expensive per RU than Provisioned - Single region only - Gets expensive fast under sustained load Use Serverless for dev/test environments. Use Provisioned for production unless your app gets less than 1000 requests/day.

How much do operations actually cost in RUs?

| What You're Doing | Document Size | RU Cost | Reality Check | |------------------|---------------|---------|---------------| | Read by ID | 1KB | 1 RU | Only thing that's consistent | | Write new doc | 1KB | 5-6 RUs | Higher with lots of indexes | | Update existing | 1KB | 6-8 RUs | Depends on what changed | | Simple query | 10 results | 5-15 RUs | Add partition key or pay more | | Cross-partition scan | 100 results | 100-500 RUs | Expensive as hell | | Complex aggregation | 1000 docs | 200-1000 RUs | Can bankrupt you during spikes | **How to not waste RUs:** - Always include partition key in WHERE clauses - Use bulk operations for multiple writes - Don't index fields you never query - Use point reads (by ID) whenever possible

My app randomly throws errors. What's happening?

**Common Cosmos DB errors:** **429 - Too Many Requests**: You're being throttled - **Fix**: Increase RUs or fix partition key design **404 - Not Found**: Document or container doesn't exist - **Fix**: Check database/container names and document IDs **400 - Bad Request**: Malformed query or document - **Fix**: Check JSON structure and query syntax **503 - Service Unavailable**: Cosmos DB is having issues - **Fix**: Implement retry logic and wait it out **RequestRateTooLarge**: Same as 429, different name - **Fix**: Same as 429 - more RUs or better partition keys

I need to migrate from MongoDB/SQL. How screwed am I?

**From MongoDB**: Not terrible - Use Azure Database Migration Service for the data - Most MongoDB code works with minimal changes - **Gotcha**: Some MongoDB features aren't 100% compatible **From SQL Server**: Pretty painful - You'll need to denormalize your relational data - Export to JSON and import, or use Azure Data Factory - **Reality check**: Plan for weeks of refactoring, not days **Migration checklist:** - [ ] Test new partition key with realistic data volume - [ ] Measure RU consumption with actual query patterns - [ ] Plan for downtime during cutover (migration tools lie about zero downtime) - [ ] Have rollback plans ready

What consistency level should I actually use?

**Session Consistency**: Use this for 95% of applications - Users see their own writes immediately - Don't see other users' writes immediately (usually fine) - Best performance/consistency balance **Strong Consistency**: Only for financial transactions - Everyone sees the same data always - Costs 2x RUs and limits you to single-region writes - Required for payments, banking, inventory **Everything else**: You probably don't need them - **Eventual**: For analytics and logging where "close enough" works - **Bounded Staleness**: Rarely needed in practice - **Consistent Prefix**: Even more rarely needed

Can I integrate with other Azure services?

**Yes, and it's actually pretty good:** **Azure Functions**: Change feed triggers work great for real-time processing **Azure Search**: Cosmos DB indexer gives you full-text search **Power BI**: Direct Query works but can be slow with large datasets **Synapse Analytics**: Synapse Link for analytics without killing production performance **Stream Analytics**: Direct output to Cosmos DB for real-time data ingestion **Example that works:** ```csharp [CosmosDBTrigger( databaseName: "MyDB", collectionName: "Users", ConnectionStringSetting = "CosmosDBConnection")] public static void ProcessUserChanges(IReadOnlyList docs) { // Runs whenever documents change // Great for cache invalidation, notifications, etc. } ```

Currently viewing the AI version

Switch to human version

Azure Cosmos DB: AI-Optimized Implementation Guide

Configuration

API Selection Decision Matrix

Primary Recommendation: NoSQL API (Core SQL)

Performance: Best RU efficiency, gets features first
Feature Support: Stored procedures, triggers, patch operations
Update Priority: Patches released 2+ days faster than other APIs
RU Consumption: 20-30% more efficient than MongoDB API

Alternative APIs - Use Cases Only

API	Use When	Avoid When	RU Cost Multiplier
MongoDB	Migrating existing MongoDB code	Starting fresh	1.2-1.3x
Table	Simple key-value operations only	Complex queries needed	1.0x
Cassandra	Time-series/IoT at massive scale	Need secondary indexes	1.1x
Gremlin	Graph traversals required	Performance matters	2-10x

Production-Ready Configuration Settings

Account Setup - Critical Decisions

{
  "capacityMode": "Provisioned", // Serverless costs 2x per operation
  "consistencyLevel": "Session", // 95% of applications
  "backupPolicy": "Continuous", // Saves jobs during disasters
  "multiRegionWrites": false // +200% cost, rarely needed
}

Partition Key Design Rules

Minimum unique values: 1000+ (not 5-10)
Avoid: timestamps, status fields, device types
Prefer: userIds, customerIds, deviceIds
Cannot change: After container creation - rebuild required

Proven Partition Key Patterns

// E-commerce: Customer isolation
"partitionKey": "/customerId"

// IoT: Device distribution
"partitionKey": "/deviceId"

// Multi-tenant: Tenant isolation
"partitionKey": "/tenantId"

// Content: User-based access
"partitionKey": "/userId"

Indexing Policy - Production Optimized

{
  "indexingMode": "consistent",
  "automatic": true,
  "includedPaths": [
    {"path": "/userId/?"},
    {"path": "/createdDate/?"}
  ],
  "excludedPaths": [
    {"path": "/largeDescription/*"},
    {"path": "/binaryData/*"},
    {"path": "/*"}
  ]
}

Resource Requirements

Cost Structure Reality

Monthly Costs by Scale

Small app (10K users): $300-1,200/month
Medium app (100K users): $1,500-5,000/month
Large app (1M+ users): $5,000-25,000/month

RU Capacity Estimation

Microsoft calculator accuracy: Multiply by 2x for realistic estimates
Starting point: 25% of estimated need, monitor for 1 week
Autoscale costs: 1.5x minimum rate, scales 10%-100% of max

Operation Costs in Request Units

Operation	Document Size	RU Cost	Performance Impact
Point read by ID	1KB	1 RU	Only consistent cost
Write new document	1KB	5-6 RUs	Higher with many indexes
Update existing	1KB	6-8 RUs	Depends on changed fields
Simple query	10 results	5-15 RUs	Add partition key or pay more
Cross-partition scan	100 results	100-500 RUs	Destroys budget during spikes
Complex aggregation	1000 docs	200-1000 RUs	Can bankrupt applications

Time Investment Requirements

Setup and Implementation

Basic setup: 1-2 days with proper guidance
Partition key mistakes: 6+ weeks rebuild time
Learning curve by API: NoSQL (2-3 weeks), MongoDB (easy if known), Cassandra (1-2 months), Gremlin (3-6 months)

Migration Timelines

From MongoDB: Weeks with Database Migration Service
From SQL Server: Months (requires denormalization)
Downtime planning: Migration tools overstate "zero downtime" capabilities

Critical Warnings

Breaking Points and Failure Modes

Partition Key Disasters

Hot partition threshold: >80% normalized RU utilization
Common failures:
- E-commerce using /orderStatus (95% orders are "pending")
- IoT using /timestamp (all current data in one partition)
- Multi-tenant using /tenantPlan (enterprise customers overwhelm partition)

Query Performance Killers

Cross-partition queries: Missing partition key in WHERE clause
Index explosion: Indexing 5MB+ binary fields consumes 100+ RUs per write
Aggregation pipeline inefficiency: MongoDB operations can cost 800 RUs vs 20 RUs in NoSQL

Cost Explosion Triggers

Multi-region writes enabled: +200% cost increase
Default indexing on large fields: Binary data indexing
Autoscale stuck at maximum: Poor partition key distribution
Serverless in production: 2x cost per operation under load

Consistency Level Gotchas

Session Consistency (Recommended)

Multi-device users: See inconsistencies across devices
Cost: 1x RUs (baseline)
Appropriate for: 95% of applications

Strong Consistency Limitations

Regional restriction: Single write region only
Cost: 2x RUs
Required for: Financial transactions, payments, inventory

Performance Troubleshooting

429 Throttling Errors

Root cause: Hot partitions or insufficient RUs
Immediate fix: Increase provisioned capacity
Long-term fix: Redesign partition key
Code requirement: Implement retry logic with exponential backoff

Query Performance Issues

30+ second queries: Usually partition key problems
Debug method: Check partition distribution with COUNT GROUP BY
Quick fixes: Add partition key to WHERE, avoid SELECT *, increase RUs temporarily

Integration Limitations

MongoDB API Compatibility

Missing features: GridFS, some aggregation pipeline operations
Behavioral differences: Compound indexes not used properly
Performance gap: 20-30% higher RU consumption than NoSQL API

Multi-API Usage

Technical possibility: Same data accessible via different APIs
Reality: Data corruption, performance issues, debugging nightmares
Best practice: One API per container, never mix

Backup and Recovery

Backup Policy Critical Settings

Continuous backup: Required for point-in-time recovery
Cost impact: Additional charges but saves jobs during disasters
Recovery reality: Point-in-time restore can take hours

Monitoring Requirements

Essential Alerts

RU consumption >80%: Immediate capacity increase needed
429 error rate >1%: User experience degradation
P99 latency >100ms: Performance investigation required
Monthly cost variance >20%: Configuration review needed

Metric Interpretation

Normalized RU utilization: Per-partition health indicator
Hot partition detection: One partition consistently >80% utilization
Cross-partition query identification: High RU consumption without partition key

Operational Realities

Development vs Production

Emulator limitations: Performance doesn't match production, SSL certificate issues
Development costs: $200+/month without emulator usage
Testing requirements: Load testing with realistic data volumes mandatory

Team Knowledge Requirements

Partition key design: Cannot be learned from documentation alone
RU optimization: Requires understanding of indexing and query patterns
Troubleshooting skills: 3 AM debugging scenarios require deep Cosmos DB knowledge

Hidden Dependencies

SDK connection management: Tricky with .NET SDK v3
Change feed processing: Requires understanding of continuation tokens
Bulk operations: Essential for high-throughput scenarios

This guide prioritizes operational intelligence over theoretical knowledge, focusing on real-world implementation challenges and cost optimization strategies based on production experience.

Useful Links for Further Investigation

Resources: The Good, The Bad, and The Useless

Link	Description
Azure Cosmos DB Documentation Hub	Microsoft's docs are surprisingly not terrible, unlike most other Azure services. Start here and bookmark it.
Request Units Explained	Critical reading. RUs are how you get charged, so understand this or go broke.
Partitioning Guide	Most important thing to get right. Screw up partition keys and rebuild everything.
Cosmos DB Emulator	Essential for development. Saves you hundreds in dev costs per month.
Azure Cosmos DB Explorer	Web UI that's actually usable for quick data queries and exploration.
.NET SDK v3	Solid SDK with good bulk operation support. Connection management can be tricky.
Data Migration Tool	Works for small datasets. Don't trust it for production migrations without extensive testing.
Capacity Planner	Microsoft's RU calculator. Multiply results by 2x for realistic estimates.
Azure Monitor for Cosmos DB	Set up alerts for RU consumption > 80% or get surprised by throttling.
Performance Guide	Contains useful patterns, not just marketing fluff.
Azure Functions Bindings	Change feed triggers work well for real-time processing. Use for cache invalidation, notifications.
Synapse Link	Analytics without killing production performance. Useful if you need real-time reporting.
Azure Search Integration	Gives you full-text search. Indexer can be slow with large datasets.
Power BI Direct Query	Works but performance is unpredictable. Cache your aggregations.
Database Migration Service	Works for MongoDB migrations but test thoroughly first. "Online" doesn't mean zero downtime.
Data Factory	Good for ETL pipelines. Complex transformations get expensive in RUs.
Data Modeling Guide	Helps with denormalization concepts. Real-world data modeling is messier than examples suggest.
Official Pricing	Starting point. Remember multi-region writes are much more expensive.
Cost Optimization	Some useful tips buried in marketing speak. Focus on indexing and query optimization.
Reserved Capacity	1-3 year commitments for cost savings. Only if you're sure about usage patterns.
Stack Overflow	Real developers with real problems. Better than official forums for practical solutions.
Microsoft Q&A	Official support team sometimes responds. Hit or miss quality.
Cosmos DB Blog	New features and announcements. Occasionally has useful performance tips.
Azure Updates	Track breaking changes and new features that might affect your bill.
Change Feed Patterns	Event sourcing and real-time processing patterns. Useful for microservices.
Multi-tenancy	Tenant isolation strategies. Critical for SaaS applications.
Time Series Patterns	IoT and metrics data modeling. Partition key design is crucial here.
Microsoft Learn Path	Free hands-on labs. Actually pretty good for beginners.
Official Workshops	Practical exercises. More useful than typical Microsoft training.
DP-420 Certification	If your company pays for certs. Real-world experience matters more.

Related Tools & Recommendations

tool

Amazon DocumentDB - MongoDB's Evil Twin

Looks like MongoDB, smells like MongoDB, definitely not fucking MongoDB

Azure Cosmos DB: AI-Optimized Implementation Guide

Configuration

API Selection Decision Matrix

Production-Ready Configuration Settings

Resource Requirements

Cost Structure Reality

Time Investment Requirements

Critical Warnings

Breaking Points and Failure Modes

Consistency Level Gotchas

Performance Troubleshooting

Integration Limitations

Backup and Recovery

Monitoring Requirements

Operational Realities

Useful Links for Further Investigation

Resources: The Good, The Bad, and The Useless

Related Tools & Recommendations

Amazon DocumentDB - MongoDB's Evil Twin

PostgreSQL + Redis: Arquitectura de Caché de Producción que Funciona

Azure AI Foundry Production Reality Check

Amazon DynamoDB - AWS NoSQL Database That Actually Scales

Why I Finally Dumped Cassandra After 5 Years of 3AM Hell

MongoDB Atlas vs PlanetScale 料金比較 - どっちが安いか、どっちがクソなのか

Kafka + MongoDB + Kubernetes + Prometheus Integration - When Event Streams Break

Google Cloud Firestore - NoSQL That Won't Ruin Your Weekend

Apache Kafka - The Distributed Log That LinkedIn Built (And You Probably Don't Need)

Kafka Will Fuck Your Budget - Here's the Real Cost

Apache Kafka 프로덕션 배포 가이드 - 한국 개발팀을 위한 현실적인 운영 전략

MongoDB vs DynamoDB vs Cosmos DB - The Database Choice That'll Make or Break Your Project

Apache Cassandra - The Database That Scales Forever (and Breaks Spectacularly)

Cassandra Vector Search - Build RAG Apps Without the Vector Database Bullshit

Hardening Cassandra Security - Because Default Configs Get You Fired

Spring Boot Redis Session Management Integration - 분산 세션 관리 제대로 써보기

Redis故障排查血泪手册 - 当你想砸键盘的时候看这里

Elasticsearch - Search Engine That Actually Works (When You Configure It Right)

Kafka-Elasticsearch 삽질 끝에 얻은 프로덕션 노하우

Kafka + Spark + Elasticsearch: Don't Let This Pipeline Ruin Your Life