MySQL Replication Issues

Posted September 14 by Dan Cryer

In our ongoing, joyous adventures with scaling MySQL, we hit another stumbling block this morning. We arrived to find out that one of our two MySQL slaves had stopped replicating at 6AM on Saturday morning. This meant that our reporting systems were all running on two day old data, even though we’d been diligently collecting and storing data all weekend.
Our setup is reasonably standard:
  • One master – All writes go to this server.
  • Two slaves:
    • Slave One runs in our office, is a somewhat less powerful machine than the master, but usually keeps up. This slave’s job is to do a daily backup, as such, it can be up to six hours behind, as the backup takes that long.
    • Slave Two runs in the same rack as the master at our data centre, this box is of equal spec to the master and is usually 0-2 seconds behind. This slave’s job is to serve our reporting systems.
  • Row-based replication.

When we arrived, we were hit with Slave Two throwing errors such as: “Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event’s master log”. Slave One was, and has remained, fine. Remo, Ben, and I spent quite some time Googling for a solution, but that turned up absolutely nothing. There were a few suggestions, such as setting “slave_exec_mode” to “IDEMPOTENT”. This is merely silent error suppression for such errors, and seems to be suggested for environments using master to master replication and/or cluster-based replication.  We were very uncomfortable with using this, as it meant that our slaves could slowly slip completely out of sync on all tables, and we’d never know about it.

Luckily for us, the affected table is one that we never need to read from on the slaves. It’s a queue table, which we update on the master, and then need to immediately read those rows from, so we do that on the master too. Our solution became clear later in the afternoon. As MySQL allows you to use different storage engines on your slaves to the master, we went the dirty route. The queue tables on the slaves are now powered by the BlackHole storage engine.

Getting the slave back up and running was a whole task in itself, however, as thousands upon thousands of queries had failed in the time this problem had been occuring. The commonly accepted solution is to do the following:

STOP SLAVE;

SET GLOBAL SQL_SLAVE_SKIP_COUNTER=1;

START SLAVE;

SHOW SLAVE STATUS\G;

However, manually running this thousands of times was going to become tedius, so one of the team wrote a script to do it for us. It’ll automatically check that it’s only bypassing errors on a specific table (look out for “database.table” in the script), and bypass them all. It’s not perfect, but it seems to work pretty well. Here’s the code: http://pastie.org/private/mcxnw1oienrlfswciix4w

Leave a Reply