1. Announcing our latest site: Quality Watch

    It’s been a while since we posted on the team tumblr, and what better way to break the ice than with an announcement of a new client site launch?

    The latest addition to our portfolio is Quality Watch: a website to support a research programme run by The Nuffield Trust and The Health Foundation to examine the effects of economic austerity in the NHS.

    It’s built on Drupal 7, is mobile first, is functional right down to IE7 and leveraged our whitelabel features to kick start the build process. There was a requirement to provide an interface between the website and the research team (whose tools of the trade are predominantly spreadsheets), so we suggested github to store data about their charts and maps. We wrote custom code to pull changes from that repo and turn them into Drupal nodes and pretty rendered charts/maps using highcharts.js/polymaps.js respectively.

    This new site represents a fantastic result for Team Drupal owing to the tight delivery timescale and some challenging requirements. We think the end result speaks for itself and if you haven’t already, we’d encourage you to go have a look at it right now… preferably at least twice to see how it behaves between media queries!

    http://www.qualitywatch.org.uk

  2. Implementing most popular content listings in Drupal powered by Google Analytics data

    We recently implemented a feature to display ‘most popular’ content on the WRAP and Zero Waste Scotland websites in order to improve the visibility of popular content on their existing taxonomy pages and complement the existing ‘Latest content’ listings.

    A high level overview of the feature can be found on my recent Torchbox.com blog post. This post is the technical compliment to it where I can elaborate on some of the decisions and implementations of what went into the feature.

    Choosing a data source:

    Drupal comes with its own internal statistics module which can be used to track how many users have viewed a given node. It’s not particularly sophisticated and relies on a session of some variety to be able to associate a view event to a node. With something like Pressflow (or Drupal 7) anonymous users no longer have an anonymous session cookie set to ensure compatibility with HTTP caches such as Varnish.

    In the same manner, contrib modules such as radioactivity also rely on an active user session (anonymous or otherwise) and can also add a significant level of database load (update queries on every node view event) on high traffic sites. Even by queueing some of these updates in-memory through something like memcached and flushing to the database at less frequent intervals there is still a significant performance consideration on top of incompatibility with Varnish.

    WRAP and ZWS had well established Google Analytics profiles and Analytics has a great API. The downsides are that Analytics doesn’t understand which paths are nodes, which are views based and what taxonomy term a certain path relates to. On balance, this is perfectly acceptable, and this is covered a little later in the post.

    Given the richness of the data stored by Analytics, the ease of access and compatibility with Varnish (Analytics events/calls are client side), and variety of filters and metrics, it was became fairly clear that using Analytics to provide the data for scoring our content according to unique page views would be a good fit.

    Introducing logical separations of concern in the whole process

    Nobody likes a ‘superman’ function or class that tries to do everything. We’ve probably all seen them at one stage or another…

    function mymodule_update_node_score($node, $score) claims to do one discrete task, but when you look inside we might find it updates one thing, clears the cache, maybe even loads another, related node and updates something in that too.

    This kind of approach might well result in functional code, but in the long term it makes the end to end process brittle (how do errors get handled? does the return type (if any) change depending on the context of the current task?), harder to diagnose/follow and therefore more expensive to maintain.

    So, we split the problem into a few logical tasks:

    Structuring the Analytics query:

    We used the Google Analytics Query Explorer tool provided by Google to help test the various criteria we needed to include against the existing data. In this case, we were asking something along the lines of:

    "Please can I have the domain, page path and unique page views for this Analytics profile from the last three months… but don’t count anything that hasn’t got a unique page view in that period of time and ignore anything with these legacy URL paths."

    Talking to Google Analytics:

    The google_analytics_reports module was originally written to grab data from Analytics and display them in Drupal. We’re only interested in the included google_analytics_api module as it does the heavy lifting of communicating with the Analytics API based on the query format we fine tuned in the analytics query explorer.

    Storing the result set from Analytics:

    For better or worse the google_analytics_api module gave us a pretty large result object (>1MB of text when serialized in PHP). So much so that it caused a failure when trying to store it in memcached which has a maximum size per object of 1MB. So the memcached daemon complained and the Drupal memcache module quietly failed.

    To get around this we decided to configure the Drupal memcache module to keep the results from our query in the database by submitting a patch for the google_analytics_reports module. The use of a drush make file for this project allows modules, themes, libraries and patches submitted to, or hosted on drupal.org very easy to incorporate, manage and version control.

    Processing the result set

    Providing we had cached data to work with we would go through each result, check the path and see if it was a node and if so take its corresponding unique page views value and add it to our list of things to update.

    Finally, once we were done with the result set we could quickly iterate over our list of things to update and turn these into SQL commands to update our table. A few thousand inserts/updates on a simple, indexed table structure doesn’t take long at all.

    Scoring nodes?

    At first we got caught up in the procedural aspect of interrogating Analytics about paths from the site. For example, a mindset of: “For each taxonomy term, work out what nodes are tagged, get their paths, match these against Analytics and then update their scores”.

    It didn’t take long to see the flaw in that process. Not only was it complex in the detail, it would be inefficient and likely teeter on the edge of the daily Analytics API quotas; either by querying per-node (yuck) or getting far too much (>10k rows) data per request.

    However, by reducing the query size to a sensible date range, carefully filtering out clear non-node/404/obsolete URL paths we were left with a decent query that provided unique page views over the last 3 months for all WRAP/ZWS domains. By using a 3 month date range we ensured both that the overall result set wasn’t insanely large and also has the benefit of providing a natural expiry point for results. For example, a press release may generate a spike in page views over a one week period, but in the event that page views tailed off it would eventually slip from the rankings. In contrast, consistently popular content would retain a high ranking and would reflect the interest in it.

    Abstracting this further and removing the Drupal ‘internals’ from the problem we realised if we kept a score (ie: unique page views) against each node we could let Drupal concern itself with selecting content by domain/language/taxonomy term/publish state. As all the sites were were concerned with were hosted on the same database and managed with the domain access module, this was generally something we could leave to Drupal views.

    Storing the node scores:

    One option was to add a new integer CCK field to each content type on the site. This would immediately integrate it with views (to power the listings) and would also benefit from field permissions.

    We opted to keep scores in a database table in the interests of performance. The daily cron task to update node scores from the result set from Google Analytics would involve several thousand sequential node_load() and node_save() operations and this would not help the site’s performance. A wider concern was that node_save() invokes the cache_clear_all() function before it returns. Several thousand sequential full cache clears on a large site? No thanks!

    Accepting that we introduced a custom schema, we needed to integrate it with the bits of Drupal we were hoping to use. A few minutes extra work implementing hook_views_data() exposed our node_score table to views and allowed us to use that data to sort our nodes in our listings. It also meant that updating this table with several thousand results takes 2-3 seconds and doesn’t have a discernable impact in site performance at all.

    Frond end tweaks

    By this stage we had our data from Analytics stored in and available to Drupal. We made a new view block (based on the latest content listing) and changed the criteria required for selecting and sorting nodes. For the most part, all the existing CSS styles were re-used and existing JavaScript to control the listing filters remained valid.

    The final custom work required was some additional Javascript to provide a toggle between the two displays and manipulate the DOM slightly to allow us to fit the original design and pretend we didn’t have two completely independent views blocks. This Javascript was very much a progressive enhancement rather than graceful degradation. Without Javascript the pages and listings still function as they need to but appear one above the other.

    All custom code (server side and client side) was managed within a single custom Drupal module, meaning that in the event the module was switched off the collection pages would revert back to their previous state.

    We also made use of the block caches to further lower the resource footprint of this functionality. The content is only refreshed nightly, so we didn’t see the need to regenerate the markup over and over each time a logged in user views the page or the Varnish cache expires. Every little helps in that respect.

  3. New machine, new hostname to choose…

    After months of minor annoyances with Linux Mint I recently converted back to a MBP/OS X for day to day work.

    Most machines at Torchbox have some kind of clever hostname… for Macs you’ll find machines such as aroni, kerel etc… my old Dell was monte.

    All the good names had been taken for prefixes of ‘Mac’… so I opted for something grey, round(ish) and powerful… and an amusing childhood story sprang to mind: The Glerp.

    I am now johan@glerp

  4. Varnish and Drupal 7.20/21 DoS vulnerability handling

    The recent security releases for Drupal 7.20 and 7.21 fix a security vulnerability in the form of a possible denial of service attack where large numbers of requests to generate on-demand image styles could result in very high CPU loads.

    The fix introduces a token that is appended to the path for an image derivative, for example, previous URLs like

    /sites/default/files/styles/thumbnail/public/field/image/example.png
    now have a token such as
    /sites/default/files/styles/thumbnail/public/field/image/example.png?itok=zD_VaCaD.

    The upshot of this being each request can be validated server side and invalid/DoS requests can be safely discarded, reducing the potential load on a server.

    The downside is that any kind of HTTP accellerator or cache that relies on URL based identification is more or less invalidated as each request to the same image path will have a different, unique token.

    The 7.21 release makes some effort to limit the impact by allowing site administrators to set the

    $conf['image_allow_insecure_derivatives'] = TRUE;

    Drupal variable. This provides partial protection by safeguarding against the most serious form of the vulnerability, but permits requests to standard image derivatives without the token validation.

    Changing the Varnish hashing algorithm

    We employ the use of Varnish as an HTTP accellerator for most of our clients’ sites. We also wanted to ensure that anonymous requests to /sites/default/files/styles/thumbnail/public/field/image/example.png and /sites/default/files/styles/thumbnail/public/field/image/example.png?itok=zD_VaCaD are treated as the same cached resource instead of having two separate cache items. You can do this with the following adjustment to your VCL file:

    sub vcl_hash {
      set req.hash += req.http.host;
      set req.hash += regsub(req.url, "\?(.*)itok=.+[^&]", "\1");
      return (hash);
    }

    In plain terms, this configuration uses a regular expression to effectively ignore the itok token when generating a hash value for the item when it’s cached.

    Debugging your regex

    There isn’t really a console for debugging your Varnish regsub regex. One option is to add a custom HTTP header in vcl.fetch which lets you examine what your regex is doing. For instance:

    sub vcl_fetch {
      set beresp.http.X-Regex = regsub("Your URL to check", "Your regex", "Any substitutions");
    } 

    It’s then easy enough to check the response headers with any decent HTTP debugger or curl and examine the value of that header.

  5. Varnish: safely checking/using your VCL file

    If you’re running Varnish 2.1 or earlier, be very careful when looking to test your VCL file configuration before reloading/restarting the service. Most documentation you’ll find through search results will advise the following:

    $ sudo varnishd -C -f /path/to/mysetup.vcl # Don't do this!

    This runs the Varnish daemon in compile mode, the idea being if your VCL file is invalid compilation will fail and the daemon exits. When successful, you get the output of your VCL file. Apparently, the project maintainers for Varnish feel this is an adequate method of checking syntax. I don’t, but it doesn’t appear to be changing anytime soon.

    Varnish 2.1 doesn’t have the -C option, but it will output the VCL file anyway (looking like a success) and more importantly (and much less obvious!) - you’ll end up spawning a new Varnish daemon. Depending on your default settings it will end up allocating a fixed chunk of memory for it’s use.

    Let’s assume that default is 1GB of memory and you are busy tweaking your VCL file. You run varnishd in what you think is compile (and exit) mode, but what’s really happening is you’re inadvertently spawning new Varnish daemons. Eventually you run out of memory and the server dies.

    Oh my!

    Safely testing/reloading the VCL file

    The safest way to test a VCL file is to use varnishadm:

    $ sudo varnishadm -T 127.0.0.1:6082 -S /etc/varnish/secret vcl.load /
    <some-identifier-for-the-vlc-file>

    this loads (and parses) the VCL file and adds it to a buffer (identified with the identifier you give it).

    $ sudo varnishadm -T 127.0.0.1:6082 -S /etc/varnish/secret vcl.use /
    <your-unique-identifier>

    This command tells Varnish to use the VCL file from the buffer of your choice. The great thing here is it will hot-swap in the new VCL file without the need to restart the service, which will clear your existing cache.

    On Debian machines you can sometimes find this script which will check your VCL file:

    $ sudo /usr/share/varnish/reload-vcl

    This runs both the aforementioned vcl.load and vcl.use. If you only want to parse/check the syntax of your VCL file, add the -c parameter to only run vcl.load.

    Even if you’re running a recent version of Varnish you’re likely to be better off with this approach than simply firing up the daemon or restarting the service.

  6. Be aware when caching large data objects in Drupal

    TL;DR: Memcache fails silently when trying to store data greater than its defined slab size. You can also specify what cache bins are used and which are ignored and pushed through to the database.

    Most of the websites we put into production make use of the memcache module to take some of the strain off the database server. By its nature Drupal is very database centric and although it offers the ability to cache a variety of things there is always the underlying overhead of disk I/O or network latency when communicating with a database. Memcache allows Drupal to store its cached data in memory which is much faster.

    A recent project involved retrieving a relatively large amount of data from Google Analytics API using the google_analytics_reports module, caching it and using it to update content rankings. While the query itself succeeded and we received its payload, the data was mysteriously unavailable when trying to retrieve it from the cache. Cue a hunt to track it down!

    Look in your database tables, duh?

    Hmm… no data there. Not surprising really, because we were using memcache :)

    Telnet then?

    Yes, you could do this to peek at what memcache has (nice overview of how). What’s probably easier is to use drush eval to natively ask Drupal to run cache_get(‘some-identifier’) which means you don’t need to faff with dealing with whatever memcache/database/cardboard box instance you happen to be using to store things in.

    Still no data!?

    cache_get() did not show any results either and that was indeed odd. A quick test of another query with a single result DID show up. This started to ring a few bells…

    The size of the original query payload returned was big enough to exceed the 1MB slab size memcache uses. Evidently, it was failing to insert a cache item but that failure was not communicated or was not visible to the person/process invoking cache_set().

    Next steps:

    While it would be preferable to reduce the query payload size to be something more reasonable this would involve picking apart the code in the google_analytics_reports contrib module… and we didn’t want to suggest removing the large ‘data’ component of it’s response object if someone else had a genuine use for it. We could also consider narrowing the query to return less data… or maybe retrieve it in batches. 

    We could also consider increasing the memcache slab size to cover the size of our data, but this didn’t feel right given the single, niche requirement and the real possibility of getting the memcache memory handling settings wrong and running out of memory further down the line.

    In the end…

    • I submitted a patch to the google_analytics_reports module which allows the cache options parameter to specify which cache bin to use.
    • The memcache settings in settings.php were changed to ensure that things using the cache_analytics bin were passed back to the database, which can store a larger volume of data per cache item.

    We’re also considering reporting the silently failing cache_insert to the memcache maintainers.

  7. Optimising queries produced by views

    One of our clients has a dynamic user portal page that is responsible for pulling in content based on a number of criteria, including:

    • What organic group a user belongs to
    • Whether content is tagged with a given taxonomy_access term
    • How popular content is at a given time
    • Drupal’s core node access rules.

    We used the devel module with query output to see whether the issue related to database query times or something further up the stack. Profiling the page showed approximately 300 queries being run per page load, the majority of which are generated by views and/or the module_grants module as a means of regulating a number of modules that determine whether content is accessible or not.

    The queries that are produced do not always play to the strengths of the underlying database structure/indexes and it’s sometimes the case that MySQL’s query optimiser doesn’t pick the quickest solution.

    Based on the query logger output we took the worst offending queries (about 4 or 5 of them), broke them down, and with a considerable team effort optimised the queries to give around a 60% improvement on previous query times.

    How did we do this?

    • We used the STRAIGHT_JOIN MySQL operator to tell the query optimiser we want it to respect the order of tables we define.
    • Replaced a number of expensive sub-queries with better performing table joins, making best use of existing table indexes and keys.
    • Wrote a patch to for the notifications module to add an index to one of its tables to reduce some of its queries from 300ms to 0.2ms.
    • Re-ordered the order of tables in the query to match the order of the fields as they were being used.
    • Used EXPLAIN/DESCRIBE to allow MySQL to outline it’s query execution plan.

    module_grants produces a large number of sub-queries that, although unpleasant to look at, aren’t collectively the most damaging portion of the original query produced. The way module_grants appends these to a database query right before the query is executed also makes it impossible to restructure without patching the module itself, so we had to cut out losses there in this instance.

    Finally, we then implemented hook_views_pre_execute() in a custom module to rewrite the SQL produced by views, trading configuration for performance. A normal use case of these hooks would be to add dynamic parameters to an existing view or to slightly adjust the query. In this very specific case we effectively remove all control from the admin interface to be able to improve performance.

    The result? Page load times of around 12-14 seconds dropped to anywhere between 3 and 6 seconds. Whilst not ground breaking or lightning fast in their own right we’re pretty pleased we could get this level of improvement from query optimisation alone.

  8. Migrating data into Drupal with the migrate module

    Last week I gave a presentation on data migration with the migrate module at Oxford DrupalCamp. It’s a broad and complex topic so I’m glad it was well received - I had to field plenty of questions throughout the rest of the day.

    The slides from my session are now available on slideshare.

  9. You&#8217;re only as young as you feel your t-shirt!

    You’re only as young as you feel your t-shirt!

  10. A peacock from the Cornbury park estate came to visit yesterday

    A peacock from the Cornbury park estate came to visit yesterday