We recently implemented a feature to display ‘most popular’ content on the WRAP and Zero Waste Scotland websites in order to improve the visibility of popular content on their existing taxonomy pages and complement the existing ‘Latest content’ listings.
A high level overview of the feature can be found on my recent Torchbox.com blog post. This post is the technical compliment to it where I can elaborate on some of the decisions and implementations of what went into the feature.
Choosing a data source:
Drupal comes with its own internal statistics module which can be used to track how many users have viewed a given node. It’s not particularly sophisticated and relies on a session of some variety to be able to associate a view event to a node. With something like Pressflow (or Drupal 7) anonymous users no longer have an anonymous session cookie set to ensure compatibility with HTTP caches such as Varnish.
In the same manner, contrib modules such as radioactivity also rely on an active user session (anonymous or otherwise) and can also add a significant level of database load (update queries on every node view event) on high traffic sites. Even by queueing some of these updates in-memory through something like memcached and flushing to the database at less frequent intervals there is still a significant performance consideration on top of incompatibility with Varnish.
WRAP and ZWS had well established Google Analytics profiles and Analytics has a great API. The downsides are that Analytics doesn’t understand which paths are nodes, which are views based and what taxonomy term a certain path relates to. On balance, this is perfectly acceptable, and this is covered a little later in the post.
Given the richness of the data stored by Analytics, the ease of access and compatibility with Varnish (Analytics events/calls are client side), and variety of filters and metrics, it was became fairly clear that using Analytics to provide the data for scoring our content according to unique page views would be a good fit.
Introducing logical separations of concern in the whole process
Nobody likes a ‘superman’ function or class that tries to do everything. We’ve probably all seen them at one stage or another…
function mymodule_update_node_score($node, $score) claims to do one discrete task, but when you look inside we might find it updates one thing, clears the cache, maybe even loads another, related node and updates something in that too.
This kind of approach might well result in functional code, but in the long term it makes the end to end process brittle (how do errors get handled? does the return type (if any) change depending on the context of the current task?), harder to diagnose/follow and therefore more expensive to maintain.
So, we split the problem into a few logical tasks:
Structuring the Analytics query:
We used the Google Analytics Query Explorer tool provided by Google to help test the various criteria we needed to include against the existing data. In this case, we were asking something along the lines of:
"Please can I have the domain, page path and unique page views for this Analytics profile from the last three months… but don’t count anything that hasn’t got a unique page view in that period of time and ignore anything with these legacy URL paths."
Talking to Google Analytics:
The google_analytics_reports module was originally written to grab data from Analytics and display them in Drupal. We’re only interested in the included google_analytics_api module as it does the heavy lifting of communicating with the Analytics API based on the query format we fine tuned in the analytics query explorer.
Storing the result set from Analytics:
For better or worse the google_analytics_api module gave us a pretty large result object (>1MB of text when serialized in PHP). So much so that it caused a failure when trying to store it in memcached which has a maximum size per object of 1MB. So the memcached daemon complained and the Drupal memcache module quietly failed.
To get around this we decided to configure the Drupal memcache module to keep the results from our query in the database by submitting a patch for the google_analytics_reports module. The use of a drush make file for this project allows modules, themes, libraries and patches submitted to, or hosted on drupal.org very easy to incorporate, manage and version control.
Processing the result set
Providing we had cached data to work with we would go through each result, check the path and see if it was a node and if so take its corresponding unique page views value and add it to our list of things to update.
Finally, once we were done with the result set we could quickly iterate over our list of things to update and turn these into SQL commands to update our table. A few thousand inserts/updates on a simple, indexed table structure doesn’t take long at all.
At first we got caught up in the procedural aspect of interrogating Analytics about paths from the site. For example, a mindset of: “For each taxonomy term, work out what nodes are tagged, get their paths, match these against Analytics and then update their scores”.
It didn’t take long to see the flaw in that process. Not only was it complex in the detail, it would be inefficient and likely teeter on the edge of the daily Analytics API quotas; either by querying per-node (yuck) or getting far too much (>10k rows) data per request.
However, by reducing the query size to a sensible date range, carefully filtering out clear non-node/404/obsolete URL paths we were left with a decent query that provided unique page views over the last 3 months for all WRAP/ZWS domains. By using a 3 month date range we ensured both that the overall result set wasn’t insanely large and also has the benefit of providing a natural expiry point for results. For example, a press release may generate a spike in page views over a one week period, but in the event that page views tailed off it would eventually slip from the rankings. In contrast, consistently popular content would retain a high ranking and would reflect the interest in it.
Abstracting this further and removing the Drupal ‘internals’ from the problem we realised if we kept a score (ie: unique page views) against each node we could let Drupal concern itself with selecting content by domain/language/taxonomy term/publish state. As all the sites were were concerned with were hosted on the same database and managed with the domain access module, this was generally something we could leave to Drupal views.
Storing the node scores:
One option was to add a new integer CCK field to each content type on the site. This would immediately integrate it with views (to power the listings) and would also benefit from field permissions.
We opted to keep scores in a database table in the interests of performance. The daily cron task to update node scores from the result set from Google Analytics would involve several thousand sequential node_load() and node_save() operations and this would not help the site’s performance. A wider concern was that node_save() invokes the cache_clear_all() function before it returns. Several thousand sequential full cache clears on a large site? No thanks!
Accepting that we introduced a custom schema, we needed to integrate it with the bits of Drupal we were hoping to use. A few minutes extra work implementing hook_views_data() exposed our node_score table to views and allowed us to use that data to sort our nodes in our listings. It also meant that updating this table with several thousand results takes 2-3 seconds and doesn’t have a discernable impact in site performance at all.
Frond end tweaks
All custom code (server side and client side) was managed within a single custom Drupal module, meaning that in the event the module was switched off the collection pages would revert back to their previous state.
We also made use of the block caches to further lower the resource footprint of this functionality. The content is only refreshed nightly, so we didn’t see the need to regenerate the markup over and over each time a logged in user views the page or the Varnish cache expires. Every little helps in that respect.