Andrei Mackenzie

Andrei Mackenzie

Engineering Manager at Xandr

13 Dec 2012

Altering Large InnoDB Tables

Large InnoDB tables in MySQL present a common challenge: How are columns best added or removed? For this article, “Large” means millions, possibly several hundred million rows. Table alterations are best done without sacrificing availability or currentness. The latter is especially vulnerable when working with a master/slave replicated cluster. The naive approach runs a simple ALTER statement, but produces some of the worst results due to resulting problems with row locking and replication.

The Naive Method

A small table can typically be altered with a single statement.

-- Optionally, initialize the column to some value
UPDATE Foo SET baz = ... ;

The MySQL documentation describes how MySQL processes the ALTER statement. A temporary copy of the table is created, then the alteration is made on the copy. Upon completion, MySQL swaps the two tables and drops the original. The table remains readable, but other operations (INSERT or UPDATE, for example) issued by applications wait for the process to complete. The ALTER is typically very fast for a small table, a duration of a few seconds will probably have a negligible effect on applications waiting to perform writes. The duration scales up with row count, however. Performing this sort of ALTER on a large table can easily take hours, even on modern hardware. Applications that rely on synchronous writes to the database will start to time out and become unusable.

A more subtle problem with this approach is how statement-level replication handles the ALTER. Suppose that no write operations to the altered table are issued for the duration of the ALTER, but the table belongs to a database with several other tables which are actively being changed. A common pitfall is to assume that the naive way suffices for this scenario, because no applications are waiting to write to the altered table on the master. This line of thought overlooks the impact on replication. Once the statement succeeds on the master, it will replicate to the slaves and start executing there. MySQL processes the binary log queue on each slave by successfully completing each operation before moving on to the next item in the queue. The long-running ALTER is likely to create a log jam of subsequent queries, which accumulate for as long as the ALTER takes. Any applications that rely on the slaves will be pulling stale data until the accumulated binary log is fully processed.

The Copy & Swap Method

A better approach for some large tables involves unravelling some of the work MySQL does as part of the ALTER and performing it without an ALTER. Specifically, a temporary copy of the altered table is created and populated, but writes are still allowed to some rows. Typically this approach works for tables with a temporal dimension. For example, a large table spanning several years of data might be used to drive some report. Data is written to the table every day, but never for previous days (or some other reasonable threshold). If it can be guaranteed that no change will be made to a vast majority of a large table’s rows during the alteration process, the majority of rows can be migrated to the new schema while writes to the remaining rows are allowed. A much faster running query can then be used to migrate the remaining rows and finally to swap in the new table via a RENAME.

-- At this point Foo_new is empty

Now, populate the rows that are guaranteed not to change via some sort of script. It’s important to break up the SQL statements into chunks that execute relatively quickly. This allows replication to stay more or less up to date as the statements replicate.

INSERT INTO Foo_new (fooDate, foo, bar, baz)
-- Use whatever logic is needed to populate 'baz', or insert NULL. In this example, bar + 1
SELECT fooDate, foo, bar, bar + 1
FROM Foo WHERE Foo.fooDate >= '$startDate' AND Foo.fooDate < '$endDate';

Control the interval by selecting new $startDate and $endDate variables for each execution. The entire process can be scripted, as long as enough time is allowed for replication to catch up between each execution. Keep in mind that this process is only suitable for migrating the ‘majority’ rows that will not change. The remainder that is at risk of being updated needs to be handled separately as part of the last step.

To complete the migration, a short downtime should be planned to execute the final iteration of the script above. This ensures applications are not modifying the remaining ‘minority’ rows. Once this is completed, the rows can be migrated to the new schema. If planned well, this statement should run quickly. Then, it’s just a matter of renaming the tables - another quick operation.

RENAME TABLE Foo TO Foo_old, Foo_new TO Foo;

At this point writes can be re-enabled. Foo_old can be dropped whenever the new table is deemed to be working well. If anything goes wrong, rollback is as simple as reversing the RENAME.

The benefit of this method is that the altered table remains usable (writable) for the majority of the alteration. If the INSERT INTO/SELECT FROM data chunks are small enough, replication delays will not become a major problem.

If it is too arduous to create small enough chunks, it is possible to run the migration queries on the master and all slaves concurrently by sidestepping replication. Adding SET sql_log_bin=0; as the first statement in each SQL session will prevent subsequent statements from being written to the bin log, and therefore prevent then from being replicated. The migration queries are instead run on both master and slaves. Care needs to be taken to ensure the data is consistent once the process is completed.