Hi! I'm Dan Cryer, a freelance web developer living in Bollington, Cheshire. I am a full stack developer with a focus on PHP, MySQL and related technologies. My primary interest is in back-end development, server management, APIs and PHP optimisation.

I am available on a freelance or contract basis as a development manager / consultant, senior developer or trainer. Please get in touch if you'd like to work with me.

Dan Cryer

How can I help you?

Consultancy

Not every business that needs web-based systems has the skills in house to design and plan such a system, or to find the right people to build it for them. I can help to fill that gap.

More Information »

Development

If you just need some extra hands, don't have some specific skills in-house, or are just simply too busy, I am available for short-medium term projects as an experienced senior developer.

More Information »

Server Management

In addition to PHP development, I've spent a lot of time in my career managing Linux servers. I am available for hire for one-off server set-up, or on a retainer for ongoing server management.

More Information »

Latest blog posts

Dynamic DNS with CloudFlare

I have a Raspberry Pi at home that I use as a simple web server for a few bits and pieces, and I wanted to make it web accessible. I couldn't rely on my ISP to provide a static IP address (it is, at best, usually static.) So, I decided to put together a really simple PHP script to update CloudFlare every X minutes with my current IP. The script is as follows:
<?php

$request = array();
$request['a'] = 'rec_edit';
$request['tkn'] = 'YOUR CLOUDFLARE API TOKEN';
$request['z'] = 'example.com';
$request['email'] = 'YOUR CLOUDFLARE EMAIL';
$request['id'] = 'DNS RECORD ID FOR home.example.com';
$request['type'] = 'A';
$request['name'] = 'home'; // Change this if you want the record to be something else. 
$request['content'] = file_get_contents('http://www.dancryer.com/ip.php?t='.time());
$request['service_mode'] = '0';
$request['ttl'] = '120';

$response = @json_decode(file_get_contents('https://www.cloudflare.com/api_json.html?' . http_build_query($request)), true);

if(!$response || $response['result'] != 'success')
{
print 'Failed to update DNS. :-(';
var_dump($response);
}
I then run that script via Cron and viola, home-brew dynamic DNS.

Accounting for freelancers and contractors

When I first got started as a contractor, I wanted to make sure that I was set up in the best possible way to make sure my taxes were covered, I was legally protected and I was getting the most out of the money I was earning. This can be confusing, time-consuming and it is very difficult to know whose advice to trust. Here's a few things I've learned along the way that I hope will be of help to others. This advice is obviously very UK-specific, so if you're elsewhere, this post may not be all that useful to you.

#1 - Get an accountant

This is the best piece of advice I could give anyone starting out on their own. A good accountant will help to save you significantly more than they will cost you each year, and more importantly they can save you from getting yourself into hot water. Whoever you speak to might also contradict some of the other things I've learned below, but that is okay, as everyone and every company is different - what suits me may be wrong for you. Lucky for me, my girlfriend's father is a partner in a local accountancy firm, Heywood Shepherd Chartered Accountants. This made them an obvious choice for me, but if you live in Cheshire, they are worth talking to. Speak to an accountant before you make any decisions about how to structure your new venture, including whether or not to become a limited company, whether to register for VAT, and so on. #2 - Limited companies are very tax efficient A lot of freelancers and contractors operate as self employed, this is because it is simpler and cheaper to operate. You don't have the overhead of running a limited company, or the cost that comes with accounting for one. However, it can be very worthwhile to take that overhead on. Becoming a limited company can offer you some big tax efficiencies, including:
  • Directors Dividends - As a director in your limited company, you can pay yourself up to ~£37,500 from your profits (after Corporation Tax) without paying any personal tax on it.
  • Personal Allowance - If you set up a PAYE (Pay As You Earn) scheme in your new company, you can pay yourself a small salary to take advantage of your personal tax allowance. This money is considered an expense to your company, so you will not pay Corporation Tax on it.
  • Business Expenses - A lot of expenses you will encounter as a contractor, such as equipment (laptops, etc.) or travel, can be put through the business as company expenses. Meaning that again, they're taken out before you pay Corporation Tax, reducing your overall tax burden.
  • VAT Flat-Rate Scheme - If you register for VAT, but don't think you'll be claiming a lot of VAT back (as most web developers don't,) you can set up under the "flat rate" scheme instead, which allows you to charge your clients the standard 20% VAT, but HMRC will only take some of that from you - 14% in the case of web developers - meaning you get an extra 6% revenue from every invoice.
#3 - Use a decent accounting package It can be tempting to keep costs as low as possible when you're starting out, by using spreadsheets and a word template for invoicing. This is fine, but you can end up wasting a lot of time keeping track of invoices, expenses and so on. There are several really good online accounting services, such as Freshbooks and FreeAgent. These services handle invoicing for you, reminders for clients, financial reporting, and will even integrate directly with your bank to keep track of your expenses. Using a package like this will keep you better informed about what's going on with your business and will save you time and money preparing your accounts at the end of each year. Freshbooks is even free if you only have up to three clients active at any one time. #4 - Banking When you set up a limited company, you have to get the company its own bank account. This is because the company is its own legal entity, and thus the money it makes is not yours. I took it a step further than this, and set myself up as follows:
  • Current Account - This is where money comes into, where I pay myself from and the account I use to pay the company credit card.
  • Savings Account - Each month, I work out how much I can pay myself and how much I need to put aside for tax. That tax money goes into the savings account. I also put aside money for accountancy and other longer term costs.
  • Credit Card - I don't have an awful lot of company expenses, but those that I do have go on a credit card that gets paid off in full each month. This includes bills from Amazon Web Services, travel costs, and so on.
I can't advise which bank is best for all of this, but all of my accounts are with Barclays (and by extension, Barclaycard.)   For those of you thinking about going it alone, or just starting out, I hope that this is of some use to you. Feel free to ask if you have any questions, though I can't guarantee I will know the answer!

Back on my own server

I've been running dancryer.com on a shared hosting account and block8.co.uk (+ client sites) on AppFog for a while now. As of today, it is finally all back running on a server completely under my control! For the geeky amongst you, the server is Ubuntu, running Nginx and PHP-FPM (with FastCGI caching and APC.) It may just be me, but it feels an awful lot snappier so far.

Handling robots.txt files in PHP based bots

Modern web applications are rarely the stand-alone silos they once were. Web applications and sites alike pull data from countless other web sites as part of their normal operation -- This can be as a direct result of a user action (i.e. posting a link to a piece of content,) or a web crawling process that runs in the background. Unfortunately, too many of these bots / spiders / crawlers fail to respect the basic protocols that robots are supposed to follow. Popular sites get hammered with hundreds of requests a second from the the same services and the vast majority of bots completely ignore robots.txt rules. I've recently been working on a system that needed to honour these rules, so I thought I'd put together a quick guide.

Rule #1 - Give your bot a name

This might sound silly, but you need to give your bot a name in order to give it a user agent string of its own. So let's say it is called DanBot, you'd want to put together a user agent string like the following:
Mozilla/5.0 (compatible; DanBot/1.0; +http://www.dancryer.com/about-danbot)
This breaks down as Mozilla/5.0 compatible, which is a bit of a hang-back to an earlier time, but something that most browsers, crawlers, etc. still include. Then, DanBot/1.0, your bot's name and version number - allowing site admins to know who is visiting them and for you to know which version of your bot it was, should someone complain. Finally, you have a URL - This should link to a page that explains what your bot is, what it does, why it is crawling someone's site, and how to stop it.

Rule #2 - Rate limit your requests

The worst thing a bot can do is fail to rate limit itself. You need to ensure that your bot only sends a reasonable number of requests to a given site in a given period. Now, how you determine what is reasonable is up to you, but some options include:
  • Simply limiting yourself to X requests per second. Depending on your needs, that may even be less than one request per second. 
  • Calculate a request per second limit based on the response time of the site, size of response, etc. This is a little more complicated, but well worth doing if you're doing a significant amount of web crawling.
  • Respect the crawl-delay parameter in the site's robots.txt file - This is covered in Rule #3.
Most importantly for this section, just remember that rate limiting should be done at a domain level. If you implement per-URL rate limiting on a site with a million pages, that's still up to a million requests per second for the site!

Rule #3 - Respect robots.txt

Respecting robots.txt is vital for any well-meaning, well-behaved, web crawler. Robots.txt allows a site owner to define what they do not want any given bot to see, and how often they'd like a given bot to access their site. Some example rules for DanBot might be as follows:
User-Agent: DanBot
Crawl-Delay: 5
Disallow: *.zip
Disallow: /admin
What that would tell DanBot is that the site only wants us to visit once every five seconds, it must not crawl any URL that ends in .zip, and that it must not crawl any URL that starts with /admin.  These robots.txt files are not too complicated to handle. Here's a sample class to help you get started:
class Robots
{
    const ROBOTS_TXT_LIFETIME = "-1 Day";
    protected $site;
    protected $robotsTxt;
    protected $robotsTxtLastUpdated;
    protected $lastCrawled;

    public function __construct($site)
    {
        $this->site = $site;
    }

    /**
     * Checks whether it is OK to crawl a given URI at this time.
     * @param   $uri                  string  URL to check, e.g. /my/page
     * @param   $ignoreCrawlDelays    bool    Ignore crawl delays - Used to check whether or not a page can be crawled at any time.
     * @return  bool
     */
    public function isOkToCrawl($uri, $ignoreCrawlDelays = false)
    {
        // Check that our robots.txt is sufficiently up to date:
        $botUserAgent               = $this->_getBotUserAgent();
        $lastRobotsUpdateThreshold  = new \DateTime(self::ROBOTS_TXT_LIFETIME);

        if(empty($this->robotsTxtLastUpdated) || $this->robotsTxtLastUpdated < $lastRobotsUpdateThreshold) {             $this->robotsTxt            = $this->_parseRobotsTxt($this->_fetchRobotsTxt());
            $this->robotsTxtLastUpdated = new \DateTime();
        }

        // No rules? Free for all:
        if(!count($this->robotsTxt)) {
            return true;
        }

        foreach($this->robotsTxt as $userAgent => $botRules)
        {
            // No disallows and no crawl delay? Ignore this ruleset.
            if((empty($botRules['disallow']) || !count($botRules['disallow'])) && empty($botRules['crawl-delay'])) {
                continue;
            }

            // If the user agent matches ours, or is a catch-all, then process the rules:
            if($userAgent == '*' || preg_match('/' . $userAgent . '/i', $botUserAgent)) {
                foreach($botRules['disallow'] as $rule) {
                    $disallow = $rule;
                    $disallow = preg_quote($disallow, '/');
                    $disallow = (substr($disallow, -1) != '*' && substr($disallow, -1) != '$') ? $disallow . '*' : $disallow;
                    $disallow = str_replace(array('\*', '\$'), array('*', '$'), $disallow);
                    $disallow = str_replace('*', '(.*)?', $disallow);

                    if(preg_match('/^' . $disallow . '/i', $uri)) {
                        return false;
                    }
                }

                // Process crawl delay rules (unless we are ignoring them):
                if(!$ignoreCrawlDelays && !empty($botRules['crawl-delay'])) {
                    $lastCrawlThreshold = new \DateTime('-' . $botRules['crawl-delay'] . ' SECOND');

                    if(!empty($this->lastCrawled) && $this->lastCrawled > $lastCrawlThreshold) {
                        return false;
                    }

                }
            }
        }

        return true;
    }

    /**
     * Gets our crawler user agent.
     * @return string
     */
    protected function _getBotUserAgent()
    {
        return 'Mozilla/5.0 (compatible; YourBotUserAgent/1.0; +http://www.yoursite.com/about-our-crawler)';
    }

    /**
     * Fetches the contents of a the site's robots.txt:
     * @return string
     * @throws \RuntimeException
     */
    protected function _fetchRobotsTxt()
    {
        return file_get_contents($this->site . '/robots.txt');
    }

    /**
     * Parses the robots.txt file content into a rules array.
     * @param $rules string
     * @return array
     */
    protected function _parseRobotsTxt($rules)
    {
        $rules      = explode("\n", str_replace("\r", "", $rules));
        $outRules   = array();

        $lastUserAgent = '*';
        foreach($rules as $rule)
        {
            if(trim($rule) == '') {
                continue;
            }

            if(strpos($rule, ':') === false) {
                continue;
            }

            $key = strtolower(trim(substr($rule, 0, strpos($rule, ':'))));
            $val = trim(substr($rule, strpos($rule, ':') + 1));

            if($key == 'user-agent') {
                $lastUserAgent = $val;
            }

            if(!isset($outRules[$lastUserAgent])) {
                $outRules[$lastUserAgent] = array();
            }

            if($key == 'disallow') {
                $outRules[$lastUserAgent]['disallow'][] = $val;
            }

            if($key == 'crawl-delay') {
                $outRules[$lastUserAgent]['crawl-delay'] = (float)$val;
            }
        }

        // Empty 'Disallow' means, effectively, allow all - so clear other rules.
        foreach($outRules as $ua => &$userAgent)
        {
            if(!isset($userAgent['disallow'])){
                continue;
            }

            foreach($userAgent['disallow'] as $rule) {
                if($rule == '') {
                    $userAgent['disallow'] = array();
                    break;
                }
            }
        }

        return $outRules;
    }
}
Note: The above code snippet is for example purposes and should be used as a guide only. To break the above code down:
  • isOkToCrawl() is the public interface to this class, we first check when we last accessed robots.txt and update it if necessary, then we use the rules to check whether the given URL is disallowed, and finally we check whether we're allowed to crawl yet, based on the crawl-delay setting.
  • _getBotUserAgent() would provide the code to get your bot user agent from a config file or similar.
  • _fetchRobotsTxt() would be your application-specific code to fetch the contents of a URL, in this case we're just using a really simple file_get_contents() request.
  • _parseRobotsTxt() takes the contents of a robots.txt file and breaks it down into a set of rules for a set of user agents.
  I hope this article helps provide a foundation for implementing a respectful and well behaved web crawler in your PHP applications. If you have any questions, please feel free to ask!

Chill: My little open source PHP CouchDb Library

Two years ago now, I created a CouchDb Client library for PHP called Chill and published it on Github. I used it in a couple of projects, and at some point upgraded it to operate as a Composer package, but otherwise largely just forgot about it and left it on Github. As it turns out, a number of people have started using it in real world projects, and a couple of people have even made forks and submitted pull requests to fix bug reports. All this despite the fact that I'd neglected to ever tag a release, so the Composer package could only be used in "unstable" form. As a result, today I've accepted the pending pull requests, properly licenced the library under the BSD 2 Clause licence and "officially" tagged version 1.0.0. If you want to use it, you can either get it from Github or require 'chill/chill' as a dependency in Composer. If you do use it, I'd love to hear about what you're using it for!