
Alright, you've fixed the obvious stuff. Now comes the hard part - storage and queries. This is where you either become a hero or accidentally nuke performance for everyone. I've done both, and I can tell you which one gets you called into more meetings.
EBS storage types matter more than AWS lets on. I've seen databases crawl because someone picked gp2 for a high-IOPS workload, then spent weeks debugging "performance issues" that were actually storage bottlenecks.
gp3 is Usually the Right Answer: General Purpose SSD (gp3) gives you 3,000 IOPS baseline regardless of volume size. gp2 gives you 3 IOPS per GB, which means a 100GB volume gets a pathetic 300 IOPS. That's why your small test databases run like garbage.
Storage Performance Reality Check:
- gp3: 3,000 baseline IOPS, scales to 16,000. Costs 20% less than gp2. There's literally no reason to use gp2 anymore.
- gp2: 3 IOPS per GB with burst credits that run out. When credits are gone, performance becomes unpredictable. I've seen production databases randomly slow down because someone ran a backup that ate all the burst credits.
- io2: Up to 64,000 IOPS with predictable latency. Expensive as hell but worth it if you actually need consistent sub-millisecond response times.
The gp2 Burst Credit Trap That Will Ruin Your Day
If You're Still on gp2, You're Playing Russian Roulette: Monitor BurstBalance
in CloudWatch like your job depends on it, because it does. When burst credits hit zero, your database performance makes waiting for Windows updates seem fast:
- Queries that fly during quiet hours suddenly take 10x longer during busy periods
- Nightly backups eat your burst credits and make the next day's performance garbage
- Response times become completely random with no correlation to actual query complexity
I once spent three days debugging "intermittent performance issues" only to discover our gp2 volume was running out of burst credits every afternoon when the batch jobs ran. The CloudWatch graphs showed perfect CPU and memory usage, but queries would randomly time out with ERROR: canceling statement due to user request
because everything was waiting for disk I/O.
IOPS vs Throughput: Know the Difference: Most people obsess over IOPS but actually hit throughput limits first. gp3 lets you tune both independently, so you can optimize for your actual workload:
- OLTP workloads: You need IOPS for all those small, frequent transactions
- Analytics workloads: You need throughput for those huge table scans
- Mixed workloads: Use Performance Insights to figure out which one is actually killing you
Storage Auto-Scaling Will Save Your Ass: RDS Storage Auto Scaling prevents the 3am call that your database ran out of space. But configure it wrong and you'll get a surprise AWS bill:
-- Storage auto-scaling configuration that won't bankrupt you
-- Free storage threshold: 20% (gives you time to react, prevents frequent scaling)
-- Maximum storage limit: Set 2-3x current usage, not unlimited
-- Auto-scaling cooldown: 300 seconds minimum to prevent thrashing
-- Example for 500GB production database:
-- Current: 500GB
-- Threshold: 20% (scales when <100GB free)
-- Maximum: 1500GB (room to grow without runaway costs)
-- Cooldown: 300 seconds (prevents rapid successive scaling events)
-- Monitor: "FreeStorageSpace" CloudWatch metric
-- Alert: When free space drops below 25% manually
Aurora Storage: Pay-Per-IO vs I/O-Optimized (Choose Wisely)
Aurora Billing is More Complex Than Quantum Mechanics: Aurora I/O-Optimized changes how you pay for storage I/O. Standard Aurora charges per I/O operation, I/O-Optimized gives you predictable hourly pricing.
The Math That Matters:
- Standard Aurora: $0.20 per million I/O operations. Cheap until it's not.
- I/O-Optimized: 35% higher instance cost, but no I/O charges. Predictable billing.
- Break-even point: Around 4 million I/Os per hour. Do the math for your workload or guess wrong and pay extra.
Workload Evaluation for I/O Optimization:
-- Analyze I/O patterns using Performance Insights
-- High I/O workloads: Analytics, real-time processing, frequent backups
-- Low I/O workloads: Simple CRUD operations, read-heavy applications
-- Variable I/O: Consider cost predictability vs. optimization


Query Optimization: Fix Your Worst Queries First
RDS Optimized Reads for Complex Queries: RDS Optimized Reads helps when your queries create massive temporary objects. If you've ever run a complex JOIN that brings your database to its knees, this might save you.
When Queries Create Huge Temp Objects:
- Complex JOINs: When your query joins 5 tables with no indexes and creates giant intermediate results
- Sorting without indexes: ORDER BY on millions of rows without a proper index
- High-cardinality GROUP BY: Grouping by columns with millions of unique values
- Window functions: Analytics queries that partition and rank huge datasets
How Optimized Reads Works: Uses local NVMe storage for temp objects instead of your main EBS storage:
- 5x faster temp object processing - but only for specific workload patterns
- Less I/O contention on your main storage while queries run
- Better concurrency when multiple expensive queries run simultaneously
The catch? It only works on specific instance types and costs 10-15% more. Worth it if you actually have these problems.
Instance Type Considerations: Optimized Reads is available on specific instance families:
- r6id, r6idn: Memory-optimized instances with local NVMe storage
- m6id, m6idn: General-purpose instances with local NVMe storage
- Cost premium: 10-15% higher than standard instances
Parameter Groups: Stop Using Defaults and Start Using Your Brain
Default Parameters Are Garbage for Production: AWS RDS parameter recommendations exist for a reason, but AWS ships with settings optimized for 2003 hardware. I've seen too many slow databases that could be fixed by spending 10 minutes configuring a custom parameter group instead of just accepting whatever Amazon gives you.
PostgreSQL Parameters That Actually Matter: Use the PostgreSQL Performance Tuning Guide and AWS PostgreSQL Best Practices, but here's what I actually change in production:
-- Memory settings that won't kill your instance (for r6i.xlarge: 32GB RAM)
shared_buffers = '8GB' -- 25% of instance memory for buffer cache
work_mem = '32MB' -- Higher for complex queries, monitor total usage
maintenance_work_mem = '2GB' -- For VACUUM, CREATE INDEX operations
effective_cache_size = '24GB' -- 75% of instance memory, tells optimizer about cache
random_page_cost = 1.1 -- SSD storage is fast, tell Postgres to use indexes
seq_page_cost = 1.0 -- Sequential scan cost on SSD
-- Connection management that works
max_connections = 200 -- Don't go crazy, each connection uses ~2.5MB base memory
shared_preload_libraries = 'pg_stat_statements' -- Essential for finding slow queries
-- Autovacuum tuning (critical for performance)
autovacuum = on -- Turn this off and your database will slowly die
autovacuum_max_workers = 6 -- More workers for busy systems
autovacuum_work_mem = '256MB' -- Larger work memory for vacuum operations
track_counts = on -- Required for autovacuum to know what to clean
-- Write-ahead log optimization
wal_buffers = '64MB' -- Larger WAL buffers for write-heavy workloads
checkpoint_completion_target = 0.7 -- Spread checkpoint I/O over longer period
MySQL Performance Parameters: Follow the MySQL Performance Schema Guidelines and AWS MySQL Best Practices:
-- InnoDB memory optimization (for r6i.xlarge: 32GB RAM)
innodb_buffer_pool_size = '24GB' -- 75% of available memory for cache
innodb_buffer_pool_instances = 8 -- Multiple instances for concurrency
innodb_log_buffer_size = '64MB' -- Larger log buffer for write-heavy loads
innodb_sort_buffer_size = '2MB' -- Memory for index creation/rebuilding
-- Transaction and durability settings
innodb_flush_log_at_trx_commit = 1 -- Full durability (AWS recommendation)
innodb_flush_method = O_DIRECT -- Avoid double buffering on Linux
innodb_file_per_table = ON -- Individual tablespace files
innodb_stats_persistent = ON -- Consistent query plans across restarts
-- I/O and concurrency optimization
innodb_io_capacity = 2000 -- Match your storage IOPS capability
innodb_io_capacity_max = 4000 -- Maximum I/O for background operations
innodb_read_io_threads = 8 -- Increase for read-heavy workloads
innodb_write_io_threads = 8 -- Increase for write-heavy workloads
innodb_purge_threads = 4 -- Background cleanup threads
-- Connection and query optimization
max_connections = 500 -- Higher than PostgreSQL (lighter per-connection)
max_user_connections = 450 -- Leave headroom for admin connections
interactive_timeout = 300 -- Kill idle connections after 5 minutes
wait_timeout = 300 -- Same timeout for non-interactive connections
table_open_cache = 4000 -- Cache for frequently accessed tables
-- Disable features that hurt SSD performance
innodb_change_buffering = NONE -- Optimal for modern SSD storage
query_cache_type = 0 -- Disable (removed in MySQL 8.0+, causes contention)
Index Strategy and Maintenance
Index Optimization Framework: Effective indexing requires understanding query patterns, not just creating indexes for every WHERE clause:
Covering Index Strategy: Design indexes that include all columns needed for query execution, eliminating table lookups:
-- Instead of separate indexes on (user_id) and (created_date)
-- Create covering index for common query patterns
CREATE INDEX idx_user_activities_covering
ON user_activities (user_id, created_date)
INCLUDE (activity_type, metadata);
Partial Index Optimization: For large tables with query patterns that filter on specific values:
-- Index only active records for frequently filtered queries
CREATE INDEX idx_active_users
ON users (email)
WHERE status = 'active';
Index Maintenance Automation: Configure automatic statistics updates to prevent query plan degradation:
- PostgreSQL: Ensure autovacuum is properly configured and running
- MySQL: Enable
innodb_stats_auto_recalc
for automatic statistics updates
- Monitor: Use Performance Insights to identify queries with changing execution plans
Advanced Monitoring and Alerting
Custom CloudWatch Metrics: Create composite metrics that combine multiple performance indicators:
-- Storage performance composite metric
-- Combine IOPS utilization + queue depth + latency
-- Alert when multiple indicators show stress simultaneously
-- Connection efficiency metric
-- DatabaseConnections / max_connections ratio
-- Connection creation rate vs. steady state
-- Connection pool efficiency calculations
Performance Insights Query Analysis: Use Performance Insights and Enhanced Monitoring to identify optimization opportunities:
- Top SQL by execution time: Focus optimization efforts on highest-impact queries
- Wait event analysis: Understand what database operations are blocking performance
- Dimension analysis: Correlate performance issues with specific users, hosts, or databases
Proactive Performance Monitoring: Configure alerts for leading indicators, not just problems:
- Storage utilization trends: Alert before reaching capacity limits
- Query performance degradation: Track execution time increases for critical queries
- Connection pattern changes: Detect application behavior changes that impact database
Systematic Optimization Approach:
- Baseline establishment: Use Compute Optimizer and Performance Insights to understand current performance characteristics
- Storage optimization: Migrate to appropriate storage types based on workload analysis
- Query optimization: Address highest-impact performance bottlenecks identified through monitoring
- Connection optimization: Implement appropriate pooling strategies for workload patterns
- Ongoing monitoring: Establish automated monitoring and alerting for performance regression detection
ROI Tracking: Measure optimization impact through:
- Query execution time improvements: Track specific query performance gains
- Resource utilization efficiency: Monitor CPU, memory, and storage utilization improvements
- Cost reduction: Calculate infrastructure cost savings from rightsizing and efficiency gains
- Application performance: Measure end-user experience improvements through APM tools
Storage and query optimization isn't about implementing every fancy feature AWS offers. It's about fixing the actual bottlenecks that are killing your performance. Start with storage (it's usually the easiest win), then tackle the slow queries that Performance Insights is screaming about. Don't optimize for the sake of optimization - fix what's actually broken.
Here's your battle plan, in order of impact:
Week 1: The Foundation (Low effort, high impact)
- Enable Compute Optimizer and wait 2 weeks for recommendations
- Migrate gp2 → gp3 storage (immediate 20% cost reduction + better performance)
- Create custom parameter groups with production-appropriate settings
Week 3-4: Connection Management (Medium effort, eliminates outages)
- Implement RDS Proxy for Lambda/microservice workloads
- Configure application-level connection pooling with proper sizing
- Set up connection monitoring alerts before you hit limits
Month 2: Advanced Optimization (High effort, substantial gains)
- Apply Compute Optimizer recommendations systematically
- Implement query optimization for Performance Insights top offenders
- Consider Graviton migration for cost-sensitive workloads
Ongoing: Monitoring and Maintenance (Prevents regression)
- Configure proactive alerting for leading indicators
- Regular query performance reviews using Performance Insights
- Quarterly cost optimization reviews using Cost Explorer
The difference between databases that struggle and databases that scale isn't rocket science—it's methodically addressing each performance layer without getting distracted by shiny features you don't need. Start with the fundamentals, measure the impact, then move to the next optimization. Your future self (and your AWS bill) will thank you.