pt-online-schema-change: AI-Optimized Technical Reference
Tool Overview
Function: MySQL schema changes without table locks
Source: Percona Toolkit
Current Version: 3.7.0 (December 2024)
MySQL Support: 5.7-8.4, MariaDB 10.x
Critical Implementation Requirements
Prerequisites (FAILURE CONDITIONS)
- Primary Key Required: Tool fails without primary key - DELETE trigger dependency
- Disk Space: Requires 3x table size free space (official docs claim 2x - insufficient)
- MySQL Version: Minimum 5.0.2+, full MySQL 8.4 support requires Toolkit 3.7.0+
Resource Requirements by Table Size
Table Size | Time Estimate | Real-World Range |
---|---|---|
1GB | 30 minutes | 30 minutes - 2 hours |
10GB | 2 hours | 2-8 hours |
100GB | 8 hours | 8-24 hours |
500GB+ | Days | Multiple days |
Critical Warning: Last 10% often takes as long as first 90% due to lock contention
Operational Process
How It Works
- Shadow Table Creation: Empty copy with schema changes applied
- Chunked Data Copy: 1,000 rows per batch (configurable), pauses between batches
- Trigger Synchronization: 3 triggers capture writes during copy process
- Atomic Swap: RENAME TABLE operation (milliseconds, not hours)
vs Standard ALTER TABLE
- Standard ALTER: Locks entire table for duration (hours for large tables)
- pt-osc: No table locks, controlled resource usage, production-safe
Critical Failure Scenarios
Disk Space Exhaustion
- Symptom: Tool crashes at ~95% completion when logs fill up
- Prevention: Monitor with
df -h
, ensure 3x table size free - Impact: Production downtime, emergency disk space procurement
Lock Wait Timeouts
- Symptom:
DBD::mysql::db do failed: Lock wait timeout exceeded
- Cause: Long-running queries blocking tool
- Solution: Kill blocking queries via
SHOW PROCESSLIST
- Prevention: Run during low-traffic periods (2-6 AM)
Progress Reporting Deception
- Reality: Progress bar lies consistently
- 99% Status: Can indicate 2-6 hours remaining
- Cause: Lock contention in final chunks
- Impact: Inaccurate completion estimates
Configuration That Works in Production
Basic Command Structure
pt-online-schema-change \
--alter "ADD COLUMN email VARCHAR(255)" \
--execute \
--max-lag=5s \
--max-load=Threads_running=50 \
--critical-load=Threads_running=100 \
--chunk-size=1000 \
--sleep=0.1 \
D=yourdb,t=yourtable
Critical Parameters
- --max-lag=5s: Prevents replica lag disasters
- --max-load: Prevents CPU overload (Threads_running=50 recommended)
- --critical-load: Emergency brake (Threads_running=100)
- --chunk-size=1000: Balance between speed and resource usage
- --sleep=0.1: Pause between chunks (essential for production)
AWS RDS Specific
- Required:
--recursion-method=none
- Reason: RDS networking breaks replica discovery
- Impact: Tool hangs indefinitely without this flag
Foreign Key Handling (Major Pain Point)
Available Methods (All Problematic)
Method | Behavior | Risk Level | Use Case |
---|---|---|---|
auto |
Tool decides | High | Never use |
drop_swap |
Disables FK checks | Medium | Most reliable |
rebuild_constraints |
Rebuilds FKs | High | Often fails |
Recommendation: Use drop_swap
for production, test thoroughly
Comparison with Alternatives
pt-osc vs gh-ost vs Standard ALTER
Aspect | pt-osc | gh-ost | Standard ALTER |
---|---|---|---|
Table Locking | None | None | Complete lock |
Trigger Dependency | Yes (3 triggers) | No (binlog) | No |
Foreign Key Support | Problematic | None | Full |
Production Impact | Medium | Low | High |
Failure Recovery | Partial | Good | None |
Learning Curve | Medium | Steep | None |
Decision Criteria:
- High write traffic: gh-ost preferred
- Foreign keys required: pt-osc only option
- Simple schemas: pt-osc sufficient
- Development/testing: Standard ALTER acceptable
Production Deployment Strategy
Pre-Execution Checklist
- Verify primary key exists:
SHOW INDEX FROM table WHERE Key_name = 'PRIMARY'
- Check disk space:
df -h
(need 3x table size) - Test with
--dry-run
(limited reliability) - Schedule during low-traffic window
- Alert team about potential replication lag
Monitoring During Execution
- Process List:
SHOW PROCESSLIST
for blocking queries - Replication Lag: Monitor all replicas
- Disk Usage:
iostat
ordf
monitoring - CPU Load: Watch
Threads_running
metric
Rollback Procedure
-- Emergency rollback (requires downtime)
RENAME TABLE users TO users_broken, users_old TO users;
Warning: No clean rollback method exists
Common Production Failures
Replica Discovery Hangs
- Environment: AWS RDS, complex networking
- Solution:
--recursion-method=none
- Impact: Tool hangs indefinitely
Trigger Performance Degradation
- Cause: Heavy write traffic during operation
- Symptom: 30+ second replication lag spikes
- Mitigation: Run during minimal write periods
Out of Space at 95%
- Cause: Insufficient disk monitoring
- Impact: Emergency weekend troubleshooting
- Prevention: 3x space requirement, continuous monitoring
Testing and Validation
Post-Completion Verification
-- Row count validation
SELECT COUNT(*) FROM old_table_name;
SELECT COUNT(*) FROM new_table_name;
-- Data integrity check
pt-table-checksum --databases=yourdb --tables=yourtable
Dry-Run Limitations
Dry-run does NOT test:
- Actual disk space requirements
- Production lock contention
- Foreign key constraint issues
- Real-world performance impact
Resource and Time Investment
Human Resources Required
- Database Administrator: Schema change planning and execution
- Application Team: Coordination for low-traffic windows
- Monitoring Team: Extended oversight during operation
- On-call Support: 24-hour availability for failures
Infrastructure Costs
- Disk Space: 3x table size temporarily
- CPU Overhead: Trigger execution during high-write periods
- Network: Increased replica traffic
- Time: 2-3x initial estimates for large tables
Critical Warnings
What Official Documentation Omits
- Disk space requirements underestimated by 50%
- Progress reporting completely unreliable
- Foreign key handling fundamentally broken
- Dry-run testing insufficient for production validation
Breaking Points
- Table Size: 500GB+ requires multi-day operations
- Write Traffic: Heavy writes cause exponential slowdown
- Foreign Keys: Complex FK relationships often fail
- Disk Space: 95% utilization triggers failures
Hidden Costs
- Extended Monitoring: Manual oversight for hours/days
- Failed Attempts: Debugging time for failed operations
- Emergency Response: Weekend/night troubleshooting
- Risk Management: Backup restoration procedures
Decision Framework
Use pt-online-schema-change When:
- Table size > 1GB
- Production uptime requirements > 99%
- Foreign keys present (no alternatives)
- Standard ALTER would cause unacceptable downtime
Consider Alternatives When:
- Tables < 1GB (standard ALTER acceptable)
- No foreign keys (gh-ost viable)
- Vitess already deployed
- Complex FK relationships (manual process may be safer)
Avoid When:
- Insufficient disk space (< 3x table size)
- No primary key (add PK first)
- Critical production window (insufficient testing)
- Multiple concurrent schema changes needed
Useful Links for Further Investigation
Actually Useful Resources (Not Marketing Crap)
Link | Description |
---|---|
Official pt-online-schema-change Docs | The manual. Dry as fuck but has all the options. You'll live in this when shit breaks. |
Percona Forums | Where you go when pt-osc breaks weird and Google fails you. Search first - someone else probably suffered through your exact disaster. |
Percona Bug Tracker | Check here if you hit a bug. Percona is actually pretty good about fixing critical issues, unlike some vendors. Search for PT- tickets for Percona Toolkit issues. |
pt-osc Stuck/Hanging Issues Thread | Thread about pt-osc hanging at 99%. Read the solutions - they'll save your ass. |
Lock Timeout Debugging Guide | Specific thread about lock wait timeouts with actual solutions that work. |
AWS RDS pt-osc Guide | August 2025 walkthrough of using pt-osc on RDS. Covers the networking gotchas and --recursion-method=none requirements. |
gh-ost vs pt-osc Reality Check | Honest comparison that doesn't bullshit you. Actually useful for picking between them. |
Percona's PT-OSC Best Practices | May 2024 guide covering production deployment strategies and optimization tips. |
MySQL 8.4 Compatibility Guide | January 2025 post explaining Percona Toolkit 3.7.0 support for MySQL 8.4 features and changes. |
GitHub gh-ost | GitHub's alternative to pt-osc. Better for high-traffic environments, worse for foreign keys. |
Vitess Online DDL | If you're already using Vitess. Don't use Vitess just for schema changes. |
Percona Toolkit Downloads | Download the package for your OS. Docker images are overkill unless you're already in container hell. |
Related Tools & Recommendations
gh-ost - GitHub's MySQL Migration Tool That Doesn't Use Triggers
Migration tool that doesn't break everything when pt-osc shits the bed
Percona Toolkit - Professional Command-Line Database Administration Tools
Percona Toolkit: Essential command-line tools for MySQL database administration. Discover how DBAs and developers use these powerful utilities to troubleshoot,
MySQL HeatWave - Oracle's Answer to the ETL Problem
Combines OLTP and OLAP in one MySQL database. No more data pipeline hell.
MySQL 프로덕션 최적화 가이드
실전 MySQL 성능 최적화 방법
MySQL Alternatives - Time to Jump Ship?
MySQL silently corrupted our production data for the third time this year. That's when I started seriously looking at alternatives.
MariaDB Performance Optimization - Making It Not Suck
Learn to optimize MariaDB performance. Fix slow queries, tune configurations, and monitor your server to prevent issues and boost database speed effectively.
Liquibase Pro - Database Migrations That Don't Break Production
Policy checks that actually catch the stupid stuff before you drop the wrong table in production, rollbacks that work more than 60% of the time, and features th
jQuery - The Library That Won't Die
Explore jQuery's enduring legacy, its impact on web development, and the key changes in jQuery 4.0. Understand its relevance for new projects in 2025.
Hoppscotch - Open Source API Development Ecosystem
Fast API testing that won't crash every 20 minutes or eat half your RAM sending a GET request.
Stop Jira from Sucking: Performance Troubleshooting That Works
Frustrated with slow Jira Software? Learn step-by-step performance troubleshooting techniques to identify and fix common issues, optimize your instance, and boo
Flyway Enterprise - Stop Writing Database Migrations by Hand
Automatic script generation for teams tired of manual ALTER statements
Flyway - Just Run SQL Scripts In Order
Database migrations without the XML bullshit or vendor lock-in
Northflank - Deploy Stuff Without Kubernetes Nightmares
Discover Northflank, the deployment platform designed to simplify app hosting and development. Learn how it streamlines deployments, avoids Kubernetes complexit
LM Studio MCP Integration - Connect Your Local AI to Real Tools
Turn your offline model into an actual assistant that can do shit
CUDA Development Toolkit 13.0 - Still Breaking Builds Since 2007
NVIDIA's parallel programming platform that makes GPU computing possible but not painless
MySQL Replication - How to Keep Your Database Alive When Shit Goes Wrong
Explore MySQL Replication: understand its architecture, learn setup steps, monitor production environments, and compare traditional vs. Group Replication and GT
PostgreSQL vs MySQL vs MariaDB - Developer Ecosystem Analysis 2025
PostgreSQL, MySQL, or MariaDB: Choose Your Database Nightmare Wisely
PostgreSQL vs MySQL vs MariaDB vs SQLite vs CockroachDB - Pick the Database That Won't Ruin Your Life
depends on mariadb
Percona Toolkit - 새벽 3시 MySQL 장애 때 진짜 쓸 수 있는 도구들
built on Percona Toolkit
Taco Bell's AI Drive-Through Crashes on Day One
CTO: "AI Cannot Work Everywhere" (No Shit, Sherlock)
Recommendations combine user behavior, content similarity, research intelligence, and SEO optimization