Looking forward: Crawler version 2.0 and 3.0

Posted September 17 by Dan Cryer

Alongside my work on our current crawler, Wade has been rewriting the system to make better use of what we’ve learned so far, and make a number of changes:

  • The current codebase is a mess, many cron jobs running terribly written classes, across two separate crawler systems. The new system is neatly organised with models and controllers, and all run through one central job manager.
  • Our MySQL-based job queueing system is being replaced with beanstalkd.
  • We’re dropping all auto_increment primary keys in the database, replacing them with pre-calculated binary hashes, a BINARY(16) column with the UNHEX() value of an MD5.
  • Doing the above allows us to drop most of the SELECT queries we do, as we don’t need to know if we’ve already got a reference to that page/url/backlink, we can INSERT IGNORE new ones, and it’ll simply skip over it on those we already have. All references to that data will also use the binary hash.
  • Our crawler will make all requests under the user agent MarketDefenderBot/2.0 (some of our requests currently use MarketDefenderBot/1.0, but not all), and will respect robots.txt properly.
  • We’re putting indexes on what we need and use, and dropping ones we don’t… Crazy, I know.

However, due to the rate at which we’re growing, I’m already beginning to investigate where we can go next. We’re beginning to push some of MySQL’s limits, and we don’t want to simply migrate wholesale to another RDBMS. We’ve got a few options, obviously, including moving specific problems out of MySQL and into other solutions, as we’ve done with beanstalkd for queueing… but my preference is headed rapidly towards migrating to a Hadoop cluster, along with HDFS, MapReduce and HBase.

I know it may seem like simply jumping on the NoSQL bandwagon, but actually, I think that for something of the type and scale we’re working towards, it’s the only option. I realise that there are organisations with much larger datasets running on relational databases, but a consideration a lot of people miss when they make these comparisons, is the use cases.  We’re constantly querying across our entire dataset in a variety of different ways, many of which can’t be optimised with simple indexes.

I’ve also just discovered that Hadoop’s MapReduce can run any application for both the Map and Reduce tasks, meaning we can continue to use PHP until we absolutely have to switch to another language for performance, at which point, we can do it on a task by task basis.

One Response to “Looking forward: Crawler version 2.0 and 3.0”

  1. Ben Davies says:

    It’s good were finally getting into a place to make these structural changes to the DB, as it should speed things up loads, and make querying the dataset a lot easier.

    I’m unsure about hadoop though: I know we tried to argue the case the other week, but our failure to convince the others leaves me thinking that we may be hasty on this. Part of me thinks that it might be an automatic rejection on the basis of the ‘newness’ of the idea, along with the obvious language changes that are required to implement this properly.

    That and the argument put forward by the guys that there are still places we can go, such as MySQL cluster, I think that we should at least consider that our failure to argue the case means that the benefits that seemed so obvious, clearly arn’t.

    Dunno. I think you’re right that this is the inevitable future, I’m just unsure if it’s right now.

Leave a Reply