Drupal at Torchbox

What is Team Drupal at Torchbox up to? News about ongoing projects, d.o contributions, site launches and what we do when we're not coding.

17 posts tagged author:johan

A standard approach to improving the performance of Drupal sites is to increase the time pages are cached for. This could be either through the database cache you get out of the box, but when you employ an external caching mechanism like Varnish you have an extra component that needs to be kept in line with your wider cache expiration requirements.

Our Chatham House project saw us raise the TTL of cached pages to 1 hour to avoid having to regenerate pages. However, content editors were finding it frustrating that they couldn’t publish breaking content in a timely manner without affecting the wider performance of the website.

We suggested leaving the standard TTLs but implementing the expire and purge modules to provide a standardised way for key changes to tell Varnish when to drop a cached item. The rules integration of expire means the website can react to a key page (eg: the homepage node or major section index pages) events such as the change in moderation state to ‘Published’ and then issue a PURGE request to Varnish for the appropriate URL.

So far so good, although when changes that are displayed on the homepage that aren’t controlled from the homepage node (eg: blocks) are made we found the easiest solution was to give administrators a very simple admin form to issue PURGE requests to Varnish for specific URLs. This means that if the existing rule sets don’t quite cover an edge case nobody has thought of, then the administrators at Chatham House can still keep their content moving with minimal effort.

I released a sandbox module called purge_expirepath that provides this. I would hope to see this incorporated into either the expire or purge modules at some stage in the future, but for now it’s available for anyone to use.

It’s been a while since we posted on the team tumblr, and what better way to break the ice than with an announcement of a new client site launch?

The latest addition to our portfolio is Quality Watch: a website to support a research programme run by The Nuffield Trust and The Health Foundation to examine the effects of economic austerity in the NHS.

It’s built on Drupal 7, is mobile first, is functional right down to IE7 and leveraged our whitelabel features to kick start the build process. There was a requirement to provide an interface between the website and the research team (whose tools of the trade are predominantly spreadsheets), so we suggested github to store data about their charts and maps. We wrote custom code to pull changes from that repo and turn them into Drupal nodes and pretty rendered charts/maps using highcharts.js/polymaps.js respectively.

This new site represents a fantastic result for Team Drupal owing to the tight delivery timescale and some challenging requirements. We think the end result speaks for itself and if you haven’t already, we’d encourage you to go have a look at it right now… preferably at least twice to see how it behaves between media queries!

http://www.qualitywatch.org.uk

We recently implemented a feature to display ‘most popular’ content on the WRAP and Zero Waste Scotland websites in order to improve the visibility of popular content on their existing taxonomy pages and complement the existing ‘Latest content’ listings.

A high level overview of the feature can be found on my recent Torchbox.com blog post. This post is the technical compliment to it where I can elaborate on some of the decisions and implementations of what went into the feature.

Choosing a data source:

Drupal comes with its own internal statistics module which can be used to track how many users have viewed a given node. It’s not particularly sophisticated and relies on a session of some variety to be able to associate a view event to a node. With something like Pressflow (or Drupal 7) anonymous users no longer have an anonymous session cookie set to ensure compatibility with HTTP caches such as Varnish.

In the same manner, contrib modules such as radioactivity also rely on an active user session (anonymous or otherwise) and can also add a significant level of database load (update queries on every node view event) on high traffic sites. Even by queueing some of these updates in-memory through something like memcached and flushing to the database at less frequent intervals there is still a significant performance consideration on top of incompatibility with Varnish.

WRAP and ZWS had well established Google Analytics profiles and Analytics has a great API. The downsides are that Analytics doesn’t understand which paths are nodes, which are views based and what taxonomy term a certain path relates to. On balance, this is perfectly acceptable, and this is covered a little later in the post.

Given the richness of the data stored by Analytics, the ease of access and compatibility with Varnish (Analytics events/calls are client side), and variety of filters and metrics, it was became fairly clear that using Analytics to provide the data for scoring our content according to unique page views would be a good fit.

Introducing logical separations of concern in the whole process

Nobody likes a ‘superman’ function or class that tries to do everything. We’ve probably all seen them at one stage or another…

function mymodule_update_node_score($node, $score) claims to do one discrete task, but when you look inside we might find it updates one thing, clears the cache, maybe even loads another, related node and updates something in that too.

This kind of approach might well result in functional code, but in the long term it makes the end to end process brittle (how do errors get handled? does the return type (if any) change depending on the context of the current task?), harder to diagnose/follow and therefore more expensive to maintain.

So, we split the problem into a few logical tasks:

Structuring the Analytics query:

We used the Google Analytics Query Explorer tool provided by Google to help test the various criteria we needed to include against the existing data. In this case, we were asking something along the lines of:

"Please can I have the domain, page path and unique page views for this Analytics profile from the last three months… but don’t count anything that hasn’t got a unique page view in that period of time and ignore anything with these legacy URL paths."

Talking to Google Analytics:

The google_analytics_reports module was originally written to grab data from Analytics and display them in Drupal. We’re only interested in the included google_analytics_api module as it does the heavy lifting of communicating with the Analytics API based on the query format we fine tuned in the analytics query explorer.

Storing the result set from Analytics:

For better or worse the google_analytics_api module gave us a pretty large result object (>1MB of text when serialized in PHP). So much so that it caused a failure when trying to store it in memcached which has a maximum size per object of 1MB. So the memcached daemon complained and the Drupal memcache module quietly failed.

To get around this we decided to configure the Drupal memcache module to keep the results from our query in the database by submitting a patch for the google_analytics_reports module. The use of a drush make file for this project allows modules, themes, libraries and patches submitted to, or hosted on drupal.org very easy to incorporate, manage and version control.

Processing the result set

Providing we had cached data to work with we would go through each result, check the path and see if it was a node and if so take its corresponding unique page views value and add it to our list of things to update.

Finally, once we were done with the result set we could quickly iterate over our list of things to update and turn these into SQL commands to update our table. A few thousand inserts/updates on a simple, indexed table structure doesn’t take long at all.

Scoring nodes?

At first we got caught up in the procedural aspect of interrogating Analytics about paths from the site. For example, a mindset of: “For each taxonomy term, work out what nodes are tagged, get their paths, match these against Analytics and then update their scores”.

It didn’t take long to see the flaw in that process. Not only was it complex in the detail, it would be inefficient and likely teeter on the edge of the daily Analytics API quotas; either by querying per-node (yuck) or getting far too much (>10k rows) data per request.

However, by reducing the query size to a sensible date range, carefully filtering out clear non-node/404/obsolete URL paths we were left with a decent query that provided unique page views over the last 3 months for all WRAP/ZWS domains. By using a 3 month date range we ensured both that the overall result set wasn’t insanely large and also has the benefit of providing a natural expiry point for results. For example, a press release may generate a spike in page views over a one week period, but in the event that page views tailed off it would eventually slip from the rankings. In contrast, consistently popular content would retain a high ranking and would reflect the interest in it.

Abstracting this further and removing the Drupal ‘internals’ from the problem we realised if we kept a score (ie: unique page views) against each node we could let Drupal concern itself with selecting content by domain/language/taxonomy term/publish state. As all the sites were were concerned with were hosted on the same database and managed with the domain access module, this was generally something we could leave to Drupal views.

Storing the node scores:

One option was to add a new integer CCK field to each content type on the site. This would immediately integrate it with views (to power the listings) and would also benefit from field permissions.

We opted to keep scores in a database table in the interests of performance. The daily cron task to update node scores from the result set from Google Analytics would involve several thousand sequential node_load() and node_save() operations and this would not help the site’s performance. A wider concern was that node_save() invokes the cache_clear_all() function before it returns. Several thousand sequential full cache clears on a large site? No thanks!

Accepting that we introduced a custom schema, we needed to integrate it with the bits of Drupal we were hoping to use. A few minutes extra work implementing hook_views_data() exposed our node_score table to views and allowed us to use that data to sort our nodes in our listings. It also meant that updating this table with several thousand results takes 2-3 seconds and doesn’t have a discernable impact in site performance at all.

Frond end tweaks

By this stage we had our data from Analytics stored in and available to Drupal. We made a new view block (based on the latest content listing) and changed the criteria required for selecting and sorting nodes. For the most part, all the existing CSS styles were re-used and existing JavaScript to control the listing filters remained valid.

The final custom work required was some additional Javascript to provide a toggle between the two displays and manipulate the DOM slightly to allow us to fit the original design and pretend we didn’t have two completely independent views blocks. This Javascript was very much a progressive enhancement rather than graceful degradation. Without Javascript the pages and listings still function as they need to but appear one above the other.

All custom code (server side and client side) was managed within a single custom Drupal module, meaning that in the event the module was switched off the collection pages would revert back to their previous state.

We also made use of the block caches to further lower the resource footprint of this functionality. The content is only refreshed nightly, so we didn’t see the need to regenerate the markup over and over each time a logged in user views the page or the Varnish cache expires. Every little helps in that respect.

After months of minor annoyances with Linux Mint I recently converted back to a MBP/OS X for day to day work.

Most machines at Torchbox have some kind of clever hostname… for Macs you’ll find machines such as aroni, kerel etc… my old Dell was monte.

All the good names had been taken for prefixes of ‘Mac’… so I opted for something grey, round(ish) and powerful… and an amusing childhood story sprang to mind: The Glerp.

I am now johan@glerp

The recent security releases for Drupal 7.20 and 7.21 fix a security vulnerability in the form of a possible denial of service attack where large numbers of requests to generate on-demand image styles could result in very high CPU loads.

The fix introduces a token that is appended to the path for an image derivative, for example, previous URLs like

/sites/default/files/styles/thumbnail/public/field/image/example.png
now have a token such as
/sites/default/files/styles/thumbnail/public/field/image/example.png?itok=zD_VaCaD.

The upshot of this being each request can be validated server side and invalid/DoS requests can be safely discarded, reducing the potential load on a server.

The downside is that any kind of HTTP accellerator or cache that relies on URL based identification is more or less invalidated as each request to the same image path will have a different, unique token.

The 7.21 release makes some effort to limit the impact by allowing site administrators to set the

$conf['image_allow_insecure_derivatives'] = TRUE;

Drupal variable. This provides partial protection by safeguarding against the most serious form of the vulnerability, but permits requests to standard image derivatives without the token validation.

Changing the Varnish hashing algorithm

We employ the use of Varnish as an HTTP accellerator for most of our clients’ sites. We also wanted to ensure that anonymous requests to /sites/default/files/styles/thumbnail/public/field/image/example.png and /sites/default/files/styles/thumbnail/public/field/image/example.png?itok=zD_VaCaD are treated as the same cached resource instead of having two separate cache items. You can do this with the following adjustment to your VCL file:

sub vcl_hash {
  set req.hash += req.http.host;
  set req.hash += regsub(req.url, "\?(.*)itok=.+[^&]", "\1");
  return (hash);
}

In plain terms, this configuration uses a regular expression to effectively ignore the itok token when generating a hash value for the item when it’s cached.

Debugging your regex

There isn’t really a console for debugging your Varnish regsub regex. One option is to add a custom HTTP header in vcl.fetch which lets you examine what your regex is doing. For instance:

sub vcl_fetch {
  set beresp.http.X-Regex = regsub("Your URL to check", "Your regex", "Any substitutions");
} 

It’s then easy enough to check the response headers with any decent HTTP debugger or curl and examine the value of that header.