Solr Rocks!

After struggling with large FULLTEXT indexes in MySQL, Solr comes to the rescue, 16 million records ingested in 20 minutes – wow!

One small Gotcha was the security classes, which have obviously moved since the documentation was written (see fix at end of the post).

For web apps I live off MySQL, albeit now-a-days often wrapped with my own NoSQLite libraries to do Mongo-style databases over the LAMP stack. I’d also recently had a successful experience using MySQL FULLTEXT indices with a smaller database (10s of thousands of records) for the HCI Book search.  So when I wanted to index 16 million the book titles with their author names from OpenLibrary I thought I might as well have a go.

For some MySQL table types, the normal recommendation used to be to insert records without an index and add the index later.  However, in the past I have had a very bad experience with this approach as there doesn’t appear to be a way to tell MySQL to go easy with this process – I recall the disk being absolutely thrashed and Fiona having to restart the web server 🙁

Happily, Ernie Souhrada  reports that for MyISAM tables incremental inserts with an index are no worse than bulk insert followed by adding the index.  So I went ahead and set off a script adding batches of a 10,000 records at a time, with small gaps ‘just in case’.  The just in case was definitely the case and 16 hours later I’d barely managed a million records and MySQL was getting slower and slower.

I cut my losses, tried an upload without the FULLTEXT index and 20 minutes later, that was fine … but no way could I dare doing that ‘CREATE FULLTEXT’!

In my heart I knew that lucene/Solr was the right way to go.  These are designed for search engine performance, but I dreaded the pain of trying to install and come up to speed with yet a different system that might not end up any better in the end.

However, I bit the bullet, and my dread was utterly unfounded.  Fiona got the right version of Java running and then within half an hour of downloading Solr I had it up and running with one of the examples.  I then tried experimental ingests with small chunks of the data: 1000 records, 10,000 records, 100,000 records, a million records … Solr lapped it up, utterly painless.  The only fix I needed was because my tab-separated records had quote characters that needed mangling.

So,  a quick split into million record chunks (I couldn’t bring myself to do a single multi-gigabyte POST …but maybe that would have been OK!), set the ingest going and 20 minutes later – hey presto 16 million full text indexed records 🙂  I then realised I’d forgotten to give fieldnames, so the ingest had taken the first record values as a header line.  No problems, just clear the database and re-ingest … at 20 minutes for the whole thing, who cares!

As noted there was one slight gotcha.  In the Securing Solr section of the Solr Reference guide, it explains how to set up the security.json file.  This kept failing until I realised it was failing to find the classes solr.BasicAuthPlugin and solr.RuleBasedAuthorizationPlugin (solr.log is your friend!).  After a bit of listing of contents of jars, I found tat these are now in  I also found that the JSON parser struggled a little with indents … I think maybe tab characters, but after explicitly selecting and then re-typing spaces yay! – I have a fully secured Solr instance with 16 million book titles – wow 🙂

This is my final security.json file (actual credentials obscured of course!

    "blockUnknown": true,
      "tom":"blabbityblabbityblabbityblabbityblabbityblo= blabbityblabbityblabbityblabbityblabbityblo=",
      "dick":"blabbityblabbityblabbityblabbityblabbityblo= blabbityblabbityblabbityblabbityblabbityblo=",
      "harry":"blabbityblabbityblabbityblabbityblabbityblo= blabbityblabbityblabbityblabbityblabbityblo="},


big brother Is watching … but doing it so, so badly

I followed a link to an article on Forbes’ web site1.  After a few moments the computer fan started to spin like a merry-go-round and the page, and the browser in general became virtually unresponsive.

I copied the url, closed the browser tab (Firefox) and pasted the link into Chrome, as Chrome is often billed for its stability and resilience to badly behaving web pages.  After a  few moments the same thing happened, roaring fan, and, when I peeked at the Activity Monitor, Chrome was eating more than a core worth of the machine’s CPU.

I dug a little deeper and peeked at the web inspector.  Network activity was haywire hundreds and hundreds of downloads, most were small, some just a  few hundred bytes, others a few Kb, but loads of them.  I watched mesmerised.  Eventually it began to level off after about 10 minutes when the total number of downloads was nearing 1700 and 8Mb total download.


It is clear that the majority of these are ‘beacons’, ‘web bugs’, ‘trackers’, tiny single pixel images used by various advertising, trend analysis and web analytics companies.  The early beacons were simple gifs, so would download once and simply tell the company what page you were on, and hence using this to tune future advertising, etc.

However, rather than simply images that download once, clearly many of the current beacons are small scripts that then go on to download larger scripts.  The scripts they download then periodically poll back to the server.  Not only can they tell their originating server that you visited the page, but also how long you stayed there.  The last url on the screenshot above is one of these report backs rather than the initial download; notice it telling the server what the url of the current page is.

Some years ago I recall seeing a graphic showing how many of these beacons common ‘quality’ sites contained – note this is Forbes.  I recall several had between one and two hundred on a single page.  I’m not sure the actual count here as each beacon seems to create very many hits, but certainly enough to create 1700 downloads in 10 minutes.  The chief culprits, in terms of volume, seemed to be two companies I’d not heard of before SimpleReach2 and Realtime3, but I also saw Google, Doubleclick and others.

While I was not surprised that these existed, the sheer volume of activity did shock me, consuming more bandwidth than the original web page – no wonder your data allowance disappears so fast on a mobile!

In addition the size of the JavaScript downloads suggests that there are doing more than merely report “page active”, I’m guessing tracking scroll location, mouse movement, hover time … enough to eat a whole core of CPU.

I left the browser window and when I returned, around an hour later, the activity had slowed down, and only a couple of the sites were still actively polling.  The total bandwidth had climbed another 700Kb, so around 10Kb/minute – again think about mobile data allowance, this is a web page that is just sitting there.

When I peeked at the activity monitor Chrome had three highly active processes, between them consuming 2 cores worth of CPU!  Again all on a web page that is just sitting there.  Not only are these web beacons spying on your every move, but they are badly written to boot, costuming vast amounts of CPU when there is nothing happening.

I tried to scroll the page and then, surprise, surprise:

So, I will avoid links to Forbes in future, not because I respect my privacy; I already know I am tracked and tracked; who needed Snowdon to tell you that?  I won’t go because the beacons make the site unusable.

I’m guessing this is partly because the network here on Tiree is slow.  It does not take 10 minutes to download 8Mb, but the vast numbers of small requests interact badly with the network characteristics.  However, this is merely exposing what would otherwise be hidden: the vast ratio between useful web page and tracking software, and just how badly written the latter is.

Come on Forbes, if you are going to allow spies to pay to use your web site, at least ask them to employ some competent coders.

  1. The page I was after was this one, but I’d guess any news page would be the same.[back]

JavaScript gotcha: var scope

I have been using JavaScript for more than 15 years with some projects running to several thousand lines.  But just discovered that for all these years I have misunderstood the scope rules for variables.  I had assumed they were block scoped, but in fact every variable is effectively declared at the beginning of the function.

So if you write:

function f() {
    for( var i=0; i<10; i++ ){
        var i_squared = i * i;
        // more stuff ...

This is treated as if you had written:

function f() {
    var i, i_squared
    for( i=0; i<10; i++ ){
         i_squared = i * i;
         // more stuff ...

The Mozilla Developer Network describes the basic principle in detail, however, does not include any examples with inner blocks like this.

So, there is effectively a single variable that gets reused every time round the loop.  Given you do the iterations one after another this is perfectly fine … until you need a closure.

I had a simple for loop:

function f(items)
    for( var ix in items ){
        var item = items[ix];
        var value = get_value(item)

This all worked well until I needed to get the value asynchronously (AJAX call) and so turned get_value into an asynchronous function:


which fetches the value and then calls callback(value) when it is ready.

The loop was then changed to

function f(items)
    for( var ix in items ){
        var item = items[ix];
        get_value_async( item, function(value) {
                          }; );

I had assumed that ‘item’ in each callback closure would be bound to the value for the particular iteration of the loop, but in fact the effective code is:

function f(items)
    var ix, item;
    for( ix in items ){
        item = items[ix];
        get_value_async( item, function(value) {
                          }; );

So all the callbacks point to the same ‘item’, which ends up as the one from the last iteration.  In this case the code is updating an onscreen menu, so only the last item got updated!

JavaScript 1.7 and ECMAScript 6 have a new ‘let’ keyword, which has precisely the semantics that I have always thought ‘var’ had, but does not seem to widely available yet in browsers.

As a workaround I have used the slightly hacky looking:

function f(items)
    for( var ix in items ){
        (function() {
            var item = items[ix];
            get_value_async( item, function(value) {
                              }; );

The anonymous function immediately inside the for loop is simply there to create scope for the item variable, and effectively means there is a fresh variable to be bound to the innermost function.

It works, but you do need to be confident with anonymous functions!

Offline HTML5, Chrome, and infinite regress

I am using HTML5’s offline mode as part of the Tiree Mobile Archive project.

This is, in principle, a lovely way of creating web sites that behave pretty much like native apps on mobile devices.  However, things, as you can guess, do not always go as smoothly as the press releases and blogs suggest!

PhotobucketSome time I must write at length on various useful lessons, but, for now, just one – the potential for an endless cycle of caches, rather like Jörmungandr, the Norse world serpent, that wraps around the world swallowing its own tail.

My problem started when I had a file (which I will call ‘shared.prob’ below, but was actually ‘place_data.js’), which I had updated on the web server, but kept showing an old version on Chrome no matter how many times I hit refresh and even after I went to the history settings and asked chrome to empty its cache.

I eventually got to the bottom of this and it turned out to be this Jörmungandr, cache-eats-cache, problem (browser bug!), but I should start at the beginning …

To make a web site work off-line in HTML5 you simply include a link to an application cache manifest file in the main file’s <html> tag.  The browser then pre-loads all of the files mentioned in the manifest to create the application cache (appCache for short). The site is then viewable off-line.  If this is combined with off-line storage using the built-in SQLite database, you can have highly functional applications, which can sync to central services using AJAX when connected.

Of course sometimes you have updated files in the site and you would like browsers to pick up the new version.  To do this you simply update the files, but then also update the manifest file in some way (often updating a version number or date in a comment).  The browser periodically checks the manifest file when it is next connected (or at least some browsers check themselves, for some you need to add Javascript code to do it), and then when it notices the manifest has changed it invalidates the appCache and rechecks all the files mentioned in the manifest, downloading the new versions.

Great, your web site becomes an off-line app and gets automatically updated 🙂

Of course as you work on your site you are likely to end up with different versions of it.  Each version has its own main html file and manifest giving a different appCache for each.  This is fine, you can update the versions separately, and then invalidate just the one you updated – particularly useful if you want a frozen release version and a development version.

Of course there may be some files, for example icons and images, that are relatively static between versions, so you end up having both manifest files mentioning the same file.  This is fine so long as the file never changes, but, if you ever do update that shared file, things get very odd indeed!

I will describe Chrome’s behaviour as it seems particularly ‘aggressive’ at caching, maybe because Google are trying to make their own web apps more efficient.

First you update the shared file (let’s call it shared.prob), then invalidate the two manifest files by updating them.

Next time you visit the site for appCache_1 Chrome notices that manifest_1 has been invalidated, so decides to check whether the files in the manifest need updating. When it gets to shared.prob it is about to go to the web to check it, then notices it is in appCache_2 – so uses that (old version).

Now it has the old version in appCache_1, but thinks it is up-to-date.

Next you visit the site associated with appCache_2, it notices manifest_2 is invalidated, checks files … and, you guessed it, when it gets to shared.prob, it takes the same old version from appCacche_1 🙁 🙁

They seem to keep playing catch like that for ever!

The only way out is to navigate to the pseudo-url ‘chrome://appcache-internals/’, which lets you remove caches entirely … wonderful.

But don’t know if there is an equivalent to this on Android browser as it certainly seems to have odd caching behaviour, but does seem to ‘sort itself out’ after a time!  Other browsers seem to temporarily have problems like this, but a few forced refreshes seems to work!

For future versions I plan to use some Apache ‘Rewrite’ rules to make it look to the browser that the shared files are in fact to completely different files:

RewriteRule  ^version_3/shared/(.*)$   /shared_place/$1 [L]

To be fair the cache cycle more of a problem during development rather than deployment, but still … so confusing.

Useful sites:

These are some sites I found useful for the application cache, but none sorted everything … and none mentioned Chrome’s infinite cache cycle!

    The W3C specification – of course this tell you how appCache is supposed to work, not necessarily what it does on actual browsers!
    It is called “A Beginner’s Guide to using the Application Cache”, but is actually pretty complete.
    Really useful quick reference, but:  “FACT: Any changes made to the manifest file will cause the browser to update the application cache.” – don’t you believe it!  For some browsers (Chrome, Android) you have to add your own checks in the code (See “Updating the cache” section in “A Beginner’s Guide …”).).
    Wonderful on-line manifest file validator checks both syntax and also whether all the referenced files download OK.  Of course it cannot tell whether you have included all the files you need to.

Alt-HCI open reviews – please join in

Papers are online for the Alt-HCI trcak of British HCI conference in September.

These are papers that are trying in various ways to push the limits of HCI, and we would like as many people as possible to join in discussion around them … and this discussion will be part of process for deciding which papers are presented at the conference, and possibly how long we give them!

Here are the papers  — please visit the site, comment, discuss, Tweet/Facebook about them.

paper #154 — How good is this conference? Evaluating conference reviewing and selectivity
        do conference reviews get it right? is it possible to measure this?

paper #165 — Hackinars: tinkering with academic practice
        doing vs talking – would you swop seminars for hack days?

paper #170 — Deriving Global Navigation from Taxonomic Lexical Relations
        website design – can you find perfect words and structure for everyone?

paper #181 — User Experience Study of Multiple Photo Streams Visualization
        lots of photos, devices, people – how to see them all?

paper #186 — You Only Live Twice or The Years We Wasted Caring about Shoulder-Surfing
        are people peeking at your passwords? what’s the real security problem?

paper #191 — Constructing the Cool Wall: A Tool to Explore Teen Meanings of Cool
        do you want to make thing teens think cool?  find out how!

paper #201 — A computer for the mature: what might it look like, and can we get there from here?
        over 50s have 80% of wealth, do you design well for them?

paper #222 — Remediation of the wearable space at the intersection of wearable technologies and interactive architecture
        wearable technology meets interactive architecture

paper #223 — Designing Blended Spaces
        where real and digital worlds collide

open data: for all or the few?

On Twitter Jeni Tennison asked:

Question: aside from personally identifiable data, is there any data that *should not* be open?  @JenT 11:19 AM – 14 Jul 12

This sparked a Twitter discussion about limits to openness: exposure of undercover agents, information about critical services that could be exploited by terrorists, etc.   My own answer was:

maybe all data should be open when all have equal ability to use it & those who can (e.g. Google) make *all* processed data open too   @alanjohndix 11:34 AM – 14 Jul 12

That is, it is not clear that just because data is open to all, it can be used equally by everyone.  In particular it will tend to be the powerful (governments and global companies) who have the computational facilities and expertise to exploit openly available data.

In India statistics about the use of their own open government data1 showed that the majority of access to the data was by well-off males over the age of 50 (oops that may include me!) – hardly a cross section of society.  At  a global scale Google makes extensive use of open data (and in some cases such as orphaned works or screen-scraped sites seeks to make non-open works open), but, quite understandably for a profit-making company, Google regards the amalgamated resources as commercially sensitive, definitely not open.

Open data has great potential to empower communities and individuals and serve to strengthen democracy2.  However, we need to ensure that this potential is realised, to develop the tools and education that truly make this resource available to all3.  If not then open data, like unregulated open markets, will simply serve to strengthen the powerful and dis-empower the weak.

  1. I had a reference to this at one point, but can’t locate it, does anyone else have the source for this.[back]
  2. For example, see my post last year “Private schools and open data” about the way Rob Cowen @bobbiecowman used UK government data to refute the government’s own education claims.[back]
  3. In fact there are a variety of projects and activities that work in this area: hackathons, data analysis and visualisation websites such as IBM Many Eyes, data journalism such as Guardian Datablog and some government and international agencies go beyond simply publishing data and offer tools to help users interpret it (I recall Enrico Bertini, worked on this with one of the UN bodies some years go). Indeed there will be some interesting data for mashing at the next Tiree Tech Wave in the autumn.[back]

not forgotten! 1997 scrollbars paper – best tech writing of the week at The Verge

Thanks to Marcin Wichary for letting me know that my 1997/1998 Interfaces article “Hands across the Screen” was just named in “Best Tech Writing of the Week” at The Verge.  Some years ago Marcin reprinted the article in his GUIdebook: Graphical User Interface gallery, and The Verge picked it up from there.

Hands across the screen is about why we have scroll bars on the right-hand side, even though it makes more sense to have them on the left, close to our visual attention for text.  The answer, I suggested, was that we mentally ‘imagine’ our hand crossing the screen, so a left-hand scroll-bar seems ‘wrong’, even though it is better (more on this later).

Any appreciation is obviously gratifying, but this is particularly so because it is a 15 year old article being picked up as ‘breaking’ technology news.

Interestingly, but perhaps not inconsequentially, the article was itself both addressing an issue current in 1997 and also looking back more than 15 years to the design of the Xerox Star and other early Xerox GUI in the late 1970s early 1980s as well as work at York in the mid 1980s.

Of course this should always be the case in academic writing: if the horizon is (only) 3-5 years leave it to industry.   Academic research certainly can be relevant today (and the article in question was in 1997), but if it does not have the likelihood of being useful in 10–20 years, then it is not research.

At the turn of the Millenium I wrote in my regular HCI Education column for SIGCHI Bulletin:

Pick up a recent CHI conference proceedings and turn to a paper at random. Now look at its bibliography – how many references are there to papers or books written before 1990 (or even before 1995)? Where there are older references, look where they come from — you’ll probably find most are in other disciplines: experimental psychology, physiology, education. If our research papers find no value in the HCI literature more than 5 years ago, then what value has today’s HCI got in 5 years time? Without enduring principles we ought to be teaching vocational training courses not academic college degrees.
(“the past, the future, and the wisdom of fools“, SIGCHI Bulletin, April 2000)

At the time about 90% of CHI citations were either to work in the last 5 years, or to the authors’ own work, to me that indicated a discipline in trouble — I wonder if it is any better today?

When revising the HCI textbook I am always pleased at the things that do not need revising — indeed some parts have hardly needed revising since the first edition in 1992.  These parts seem particularly important in education – if something has remained valuable for 10, 15, 20 years, then it is likely to still be valuable to your students in a further 10, 15, 20 years.  Likewise the things that are out of date after 5 years, even when revised, are also likely to be useless to your students even before they have graduated.

In fact, I have always been pleased with Hands across the Screen, even though it was short, and not published in a major conference or journal.  It had its roots in an experiment in my first every academic job at York in the mid-1980s, when we struggled to understand why the ‘obvious’ position for scroll arrows (bottom right) turned out to not work well.  After a detailed analysis, we worked out that in fact the top-left was the best place (with some other manipulations), and this analysis was verified in use.

As an important meta-lesson what looked right turned out not to be right.  User studies showed that it was wrong, but not how to put it right, and it was detailed analysis that filled the vital design gap.  However, even when we knew what was right it still looked wrong.  It was only years later (in 1997) that I realised that the discrepancy was because one mentally imagined a hand reaching across the screen, even though really one was using a mouse on the desk surface.

Visual (and other) impressions of designers and users can be wrong; as in any mature field, quite formal, detailed analysis is necessary to compliment even the most experienced designer’s intuitions.

The original interfaces article was followed by an even shorter subsidiary article “Sinister Scrollbar in the Xerox Star Xplained“, that delved into the history of the direction of scroll arrows on a scrollbar, and how they arose partly from a mistake when Apple took over the Star designs!  This is particularly interesting today given Apple’s perverse decision to remove scroll arrows completely — scrolling now feels like a Monti Carlo exercise, hoping you end up in the right place!

However, while it is important to find underlying principles, theories and explanations that stand the test of time, the application of these will certainly change.  Whilst, for an old mouse + screen PC,  the visual ‘hands across the screen’ impression was ‘wrong’ in terms of real use experience, now touch devices such as the iPad have changed this.  It really is a good idea to have the scrollbar on the left right so that you don’t cover up the screen as you scroll.  Or to be precise it is good if you are right handed.  But look hard, there are never options to change this for left-handed users — is this not a discrimination issue?  To be fair tabs and menu items are normally found at the top of the screen equally bad for all.  As with the scroll arrows, it seems that Apple long ago gave up any pretense of caring for basic usability of ergonomics (one day those class actions will come from a crippled generation!) — if  people buy because of visual and tactile design, why bother?  And where Apple lead the rest of the market follows 🙁

Actually it is not as easy as simply moving buttons around the screen; we have expectations from large screen GUI interfaces that we bring to the small screen, so any non-standard positioning needs to be particularly clear graphically.  However, the diverse location of items on web pages and often bespoke design of mobile apps, whilst bringing their own problems of inconsistency, do give a little more flexibility.

So today, as you design, do think “hands”, and left hands as well as right hands!

And in 15 years time, who knows what we’ll have in our hands, but let’s see if the same deep principles still remain.

spice up boring lists of web links – add favicons using jQuery

Earlier today I was laying out lists of links to web resources, initially as simple links:

However, this looked a little boring and so thought it would be good to add each site’s favicon (the little icon it shows to the left on a web browser), and have a list like this:

  jQuery home page

  Wikipedia page on favicons

  my academic home page

The pages with the lists were being generated, and the icons could have been inserted using a server-side script, but to simplify the server-side code (for speed and maintainability) I put the fetching of favicons into a small JavaScript function using jQuery.  The page is initially written (or generated) with default images, and the script simply fills in the favicons when the page is loaded.

The list above is made by hand, but look at this example page to see the script in action.

You can use this in your own web pages and applications by simply including a few JavaScript files and adding classes to certain HTML elements.

See the favicon code page for a more detailed explanation of how it works and how to use it in your own pages.

If Kodak had been more like Apple

Finally Kodak has crumbled; technology and the market changed, but Kodak could not keep up. Lots of memories of those bright yellow and black film spools, and memories in photographs piled in boxes beneath the bed.

But just imagine if Kodak had been more like Apple.

I’m wondering about the fallout from the Kodak collapse. I’m not an investor, nor an employee, or even a supplier, but I have used Kodak products since childhood and I do have 40 years of memories in Kodak’s digital photo cloud. There are talks of Fuji buying up the remains of the photo cloud service, so it maybe that they will re-emerge, but for the time being I can no longer stream my photos to friend’s kTV enabled TV sets when I visit, nor view them online.

Happily, my Kodak kReader has a cache of most of my photos. But, how many I’m not sure, when did I last look at the photos of those childhood holidays or my wedding, will they be in my reader, I’ll check my kPhone as well. I’d hate to think I’d lost the snaps of the seaside holiday when my hat blew into the water; I only half remember it, but every time I look at it I remember being told and re-told the story by my dad.

The kReader is only a few months old. I usually try to put off getting a new one as they are so expensive, but even after a couple of years the software updates put a strain on the old machines.  I had to give up when my three year old model seemed to take about a minute to show each photo. It was annoying as this wasn’t just the new photos, but ones I recall viewing instantly on my first photo-reader more than 30 years ago (I can still remember the excitement as I unwrapped it one Christmas, I was 14 at the time, but now children seem to get their first readers when they are 4). The last straw was when the software updates would no longer work on the old processor and all my newer photos were appearing in strange colours.

Some years ago, I’d tried using a Fuji-viewer, which was much cheaper than the Kodak one. In principle you could download your photo cloud collection in an industry standard format and then import them into the Fuji cloud. However, this lost all the notes and dates on the photos and kept timing out unless I downloaded them in small batches, then I lost track of where I was. Even my brother-in-law, who is usually good at this sort of thing, couldn’t help.

But now I’m glad I’ve got the newest model of kReader as it had 8 times the memory of the old one, so hopefully all of my old photos in its cache. But oh no, just thought, has it only cached the things I’ve looked at since I’ve got it?  If so I’ll have hardly anything. Please, please let the kReader have downloaded all it could.

Suddenly, I remember the days when I laughed a little when my mum was still using her reels of old Apple film and the glossy prints that would need scanning to share on the net (not that she did use the net, she’d pop them in the post!). “I know it is the future”, she used to say, “but I never really trust things I can’t hold”. Now I just wish I’d listened to her.

changing rules of copyright on the web – the NLA case

I’ve been wondering about the broader copyright implications of a case that went through the England and Wales Court of Appeal earlier this year.  The case was brought by  the NLA (Newspaper Licensing Agency) against Meltwater, who run commercial media-alert services; for example telling  you or your company when and where you have been mentioned in the press.

While the case is specifically about a news service, it appears to have  broader implications for the web, not least because it makes new judgements on:

  • the use of titles/headlines — they are copyright in their own right
  • the use of short snippets (in this case no more than 256 characters) — they too potentially infringe copyright
  • whether a URL link is sufficient acknowledgement of copyright material for fair use – it isn’t!

These, particularly the last, seems to have implications for any form of publicly available lists, bookmarks, summaries, or even search results on the web.  While NLA specifically allow free services such as Google News and Google Alerts, it appears that this is ‘grace and favour’, not use by right.   I am reminded of the Shetland case1, which led to many organisations having paranoid policies regarding external linking (e.g. seeking explicit permission for every link!).

So, in the UK at least, web law copyright law changed significantly through precedent, and I didn’t even notice at the time!

In fact, the original case was heard more than a year ago November 2010 (full judgement) and then the appeal in July 2011 (full judgement), but is sufficiently important that the NLA are still headlining it on their home page (see below, and also their press releases (PDF) about the original judgement and appeal).  So effectively things changed at least at that point, although as this is a judgement about law, not new legislation, it presumably also acts retrospectively.  However, I only recently became aware of it after seeing a notice in The Times last week – I guess because it is time for annual licences to be renewed.

Newspaper Licensing Agency (home page) on 26th Dec 2011

The actual case was, in summary, as follows. Meltwater News produce commercial media monitoring services, that include the title, first few words, and a short snippet of  news items that satisfy some criteria, for example mentioning a company name or product.  NLA have a license agreement for such companies and for those using such services, but Meltwater claimed it did not need such a license and, even if it did, its clients certainly did not require any licence.  However, the original judgement and the appeal found pretty overwhelmingly in favour of NLA.

In fact, my gut feeling in this case was with the NLA.  Meltwater were making substantial money from a service that (a) depends on the presence of news services and (b) would, for equivalent print services, require some form of licence fee to be paid.  So while I actually feel the judgement is fair in the particular case, it makes decisions that seem worrying when looked at in terms of the web in general.

Summary of the judgement

The appeal supported the original judgement so summarising the main points from the latter (indented text quoting from the text of the judgement).


The status of headlines (and I guess by extension book titles, etc.) in UK law are certainly materially changed by this ruling (para 70/71), from previous case law (Fairfax, Para. 62).

Para. 70. The evidence in the present case (incidentally much fuller than that before Bennett J in Fairfax -see her observations at [28]) is that headlines involve considerable skill in devising and they are specifically designed to entice by informing the reader of the content of the article in an entertaining manner.

Para. 71. In my opinion headlines are capable of being literary works, whether independently or as part of the articles to which they relate. Some of the headlines in the Daily Mail with which I have been provided are certainly independent literary works within the Infopaq test. However, I am unable to rule in the abstract, particularly as I do not know the precise process that went into creating any of them. I accept Mr Howe’s submission that it is not the completed work as published but the process of creation and the identification of the skill and labour that has gone into it which falls to be assessed.

Links and fair use

The ruling explicitly says that a link is not sufficient acknowledgement in terms of fair use:

Para. 146. I do not accept that argument either. The Link directs the End User to the original article. It is no better an acknowledgment than a citation of the title of a book coupled with an indication of where the book may be found, because unless the End User decides to go to the book, he will not be able to identify the author. This interpretation of identification of the author for the purposes of the definition of “sufficient acknowledgment” renders the requirement to identify the author virtually otiose.

Links as copies

Para 45 (not part of the judgement, but part of NLA’s case) says:

Para. 45. … By clicking on a Link to an article, the End User will make a copy of the article within the meaning of s. 17 and will be in possession of an infringing copy in the course of business within the meaning of s. 23.

The argument here is that the site has some terms and conditions that say it is not for ‘commercial user’.

As far as I can see the judge equivocates on this issue, but happily does not seem convinced:

Para 100. I was taken to no authority as to the effect of incorporation of terms and conditions through small type, as to implied licences, as to what is commercial user for the purposes of the terms and conditions or as to how such factors impact on whether direct access to the Publishers’ websites creates infringing copies. As I understand it, I am being asked to take a broad brush approach to the deployment of the websites by the Publishers and the use by End Users. There is undoubtedly however a tension between (i) complaining that Meltwater’s services result in a small click-through rate (ii) complaining that a direct click to the article skips the home page which contains the link to the terms and conditions and (iii) asserting that the End Users are commercial users who are not permitted to use the websites anyway.

Free use

Finally, the following extract suggests that NLA would not be seeking to enforce the full licence on certain free services:

Para. 20. The Publishers have arrangements or understandings with certain free media monitoring services such as Google News and Google Alerts whereby those services are currently licensed or otherwise permitted. It would apparently be open to the End Users to use such free services, or indeed a general search engine, instead of a paid media monitoring service without (currently at any rate) encountering opposition from the Publishers. That is so even though the End Users may be using such services for their own commercial purposes. The WEUL only applies to customers of a commercial media monitoring service.

Of course, the fact that they allow it without licence, suggests they feel the same copyright rules do apply, that is the search collation services are subject to copyright.  The judge does not make a big point of this piece of evidence in any way, which would suggest that these free services do not have a right to abstract and link.  However, the fact that Meltwater (the agency NA is acting against) is making substantial money was clearly noted by the judge, as was the fact that users could choose to use alternative services free.

Thinking about it

As noted my gut feeling is that fairness goes to the newspapers involved; news gathering and reportingis costly, and openly accessible online newspapers are of benefit to us all; so, if news providers are unable to make money, we all lose.

Indeed, years ago in days, at aQtive we were very careful that onCue, our intelligent internet sidebar, did not break the business models of the services we pointed to. While we effectively pre-filled forms and submitted them silently, we did not scrape results and present these directly, but instead sent the user to the web page that provided the information.  This was partly out a feeling that this was the right and fair thing to do, partly because if we treated others fairly they would be happy for us to provide this value-added service on top of what they provided, and partly because we relied on these third-party services for our business, so our commercial success relied on theirs.

This would all apply equally to the NLA v. Meltwater case.

However, like the Shetland case all those years ago, it is not the particular of the case that seems significant, but the wide ranging implications.  I, like so many others, frequently cite web materials in blog posts, web pages and resource lists by title alone with the words live and pointing to the source site.  According to this judgement the title is copyright, and even if my use of it is “fair use” (as it normally would be), the use of the live link is NOT sufficient acknowledgement.

Maybe, things are not quite so bad as they seem. In the NLA vs. Meltwater case, the NLA had a specific licence model and agreement.  The NLA were not seeking retrospective damages for copyright infringement before this was in place, merely requiring that Meltwater subscribe fully to the licence.  The issue was not that just that copyright had been infringed, but that it had been when there was a specific commercial option in place.  In UK copyright law, I believe, it is not sufficient to say copyright has been infringed, but also to show that the copyright owner has been materially disadvantaged by the infringement; so, the existence of the licence option was probably critical to the specific judgement.   However the general principles probably apply to any case where the owner could claim damage … and maybe claim so merely in order to seek an out-of-court settlement.

This case was resolved five months ago, and I’ve not heard of any rush of law firms creating vexatious copyright claims.  So maybe there will not be any long-lasting major repercussions from the case … or maybe the storm is still to come.

Certainly, the courts have become far more internet savvy since the 1990s, but judges can only deal with the laws they are give, and it is not at all clear that law-makers really understand the implications of their legislation on the smooth running of the web.

  1. This was the case in the late 1990s where the Shetland Times sued the Shetland News for including links to its articles.  Although the particular case involved material that appeared to be re-badged, the legal issues endangered the very act of linking at all. See NUJ Freelance “NUJ still supports Shetland News in internet case“, BBC “Shetland Internet squabble settled out of court“, The Lawyer “Shetland Internet copyright case is settled out of court“[back]