When scaling for speed slows you down…

Posted September 4 by Dan Cryer

At work, the past few days, I’ve been working on the scalability of one of our systems and, hopefully, we’re almost closing on the finishing line. For some context, the system I’m talking about is effectively a web crawler. It (all too slowly) works through a queue of URLs, downloading the pages and parsing them for various information, including backlinks. As it inserts data into a MySQL database, primarily made up of innodb tables.

This all started on Wednesday, we’d been looking at the queue and decided that it was far too large. We’d hit 12.5M pages waiting to be checked, and were processing about 200,000 a day, tops. The initial reaction, as this is a distributed system, was to scale it horizontally. We rented six more VPSs (to add to the existing 5) and planned to run them at capacity until the queue was empty. After setting all the boxes up, we left them over night, in hope of completing 750,000 or more pages. Boy, were we wrong. On Thursday morning, we came in to discover it was running at a peak of 5,000 pages an hour (144,000 a day). We’d slowed it down, and not only that, we’d slowed everything else down too.

This is when the optimisation work started. We looked at the queries we were running, and noticed that an obvious bottleneck appeared to be the queue itself. Each VPS ran 15 instances of our application at any given time, and each one did a query for 250 pages to check when it was starting up, updated them as in progress, and deleting each from the queue on completion. The first change we made was to change the indexing on the table, which merely served to slow us down further. The second, more interesting change, was to change the entire table from innodb to the memory/heap table engine. This meant changing both max_heap_table_size and tmp_table_size to over 2.5GB to accomodate the 1GB table at the same time as allowing temporary tables to be created. Getting this change replicated onto our two slaves was a delight.

The memory table “solution” worked well after turning all of our VPSs back on, for about 10 minutes. At which point, they were up to capacity and the queue table started locking, a lot. As did all of the other tables, inexplicably. In addition to inducing panic, this highlighted to us that another table was starting to cause problems, our anchor text table. This is responsible for storing only the words that are used to link from one site to another. This table was seeing anything up to 2 million inserts an hour, and the same number of selects. After several failed attempts to remedy this, we finally settled on selecting all of the anchor texts we’d just collected, any that came back from the database were then dropped from our insert, thus only inserting those that were not present. Those inserts were done with batches of 10 per query. This cut the number of queries down significantly. Whilst this has improved the locking situation, it has, unfortunately, led to hanging selects. These selects are using potentially gigantic ‘IN(one, two, three)’ conditions, and are taking anything from 0.01 to 120 seconds to complete. Completely unpredictably.

Next priority was fixing the table locking on our queue. We’ve done this by marking all deletes from the queue as LOW_PRIORITY. In doing that, we’re allowing MySQL to queue those deletes for when no other clients are accessing the table, preventing unnecessary locking.

At the time of this writing, 113 instances of our application are running without significant conflicts on either of the tables we’ve optimised. However, we’re still processing just 5,000 pages an hour. I’m sure I’ll be working on this problem again on Monday, and will update here with anything we do discover.

So to summarise this somewhat rambley and unhelpful post a little, don’t assume that because you’ve built your system to run across multiple servers, that it can infact run across multiple servers. Ours scaled up to 4, and then the scaling started working against us. If you found it interesting, have any MySQL questions, want to offer advice, or are doing similar work yourself… drop me a comment below. I’d love to know that people are reading this!

5 Responses to “When scaling for speed slows you down…”

  1. So to summarise: you broke it Dan.

  2. Caius says:

    Mysql is a pain to scale, especially when you’re using it as a queueing system. Its sometimes worth looking at other queueing systems like RabbitMQ to store the jobs, and your workers just pick the next job off the queue, and store the result in the db.

  3. Dan Cryer says:

    Funnily enough, that’s exactly what we’ve been looking at moving towards, though we’d been investigating beanstalkd (http://xph.us/software/beanstalkd/) instead of RabbitMQ.

    Will certainly check it out though, thanks!

  4. [...] posted last week about our work on scaling our crawler application and it’s corresponding MySQL database, but left it with very little conclusion, apart from [...]

  5. [...] you read my previous post about our scaling issues, you’ll know that I was planning on trying out a number of other ideas to see if we could [...]

Leave a Reply