level of detail – scale matters

Posted on September 8, 2015 by alan

We get used to being able to zoom into every document picture and map, but part of the cartographer’s skill is putting the right information at the right level of detail. If you took area maps and then scaled them down, they would not make a good road atlas, the main motorways would hardly be visible, and the rest would look like a spider had walked all over it. Similarly if you zoom into a road atlas you would discover the narrow blue line of each motorway is in fact half a mile wide on the ground.

Nowadays we all use online maps that try to do this automatically. Sometimes this works … and sometimes it doesn’t.

Here are three successive views of Google maps focused on Bournemouth on the south coast of England.

On the first view we see Bournemouth clearly marked, and on the next, zooming in a little Poole, Christchurch and some smaller places also appear. So far, so good, as we zoom in more local names are shown as well as the larger place.

However, zoom in one more level and something weird happens, Bournemouth disappears. Poole and Christchurch are there, but no Bournemouth.

However, looking at the same level scale on another browser, Bournemouth is there still:

The difference between the two is the Hotel Miramar. On the first browser I am logged into Google mail, and so Google ‘knows’ I am booked to stay in the Hotel Miramar (presumably by scanning my email), and decides to display this also. The labels for Bournemouth and the hotel label overlap, so Google simply omitted the Bournemouth one as less important than the hotel I am due to stay in.

A human map maker would undoubtedly have simply shifted the name ‘Bournemouth’ up a bit, knowing that it refers to the whole town. In principle, Google maps could do the same, but typically geocoding (e.g. Geonames) simply gives a point for each location rather than an area, so it is not easy for the software to make adjustments … except Google clearly knows it is ‘big’ as it is displayed on the first, zoomed out, view; so maybe it could have done better.

This problem of overlapping legends will be familiar to anyone involved in visualisation whether map based or more abstract.

cone-trees

The image above is the original Cone Tree hierarchy browser developed by Xerox PARC in the early 1990s¹. This was the early days of interactive 3D visualisation, and the Cone Tree exploited many of the advantages such as a larger effective ‘space’ to place objects, and shadows giving both depth perception, but also a level of overview. However, there was no room for text labels without them all running over each other.

Enter the Cam Tree:

cam-tree

The Cam Tree is identical to the cone tree, except because it is on its side it is easier to place labels without them overlapping 🙂

Of course, with the Cam Tree the regularity of the layout makes it easy to have a single solution. The problem with maps is that labels can appear anywhere.

This is an image of a particularly cluttered part of the Frasan mobile heritage app developed for the An Iodhlann archive on Tiree. Multiple labels overlap making them unreadable. I should note that the large number of names only appear when the map is zoomed in, but when they do appear, there are clearly too many.

frasan-overlap

It is far from clear how to deal with this best. The Google solution was simply to not show some things, but as we’ve seen that can be confusing.

Another option would be to make the level of detail that appears depend not just on the zoom, but also the local density. In the Frasan map the locations of artefacts are not shown when zoomed out and only appear when zoomed in; it would be possible for them to appear, at first, only in the less cluttered areas, and appear in more busy areas only when the map is zoomed in sufficiently for them to space out. This would trade clutter for inconsistency, but might be worthwhile. The bigger problem would be knowing whether there were more things to see.

Another solution is to group things in busy areas. The two maps below are from house listing sites. The first is Rightmove which uses a Google map in its map view. Note how the house icons all overlap one another. Of course, the nature of houses means that if you zoom in sufficiently they start to separate, but the initial view is very cluttered. The second is daft.ie; note how some houses are shown individually, but when they get too close they are grouped together and just the number of houses in the group shown.

A few years ago, Geoff Ellis and I reviewed a number of clutter reduction techniques², each with advantages and disadvantages, there is no single ‘best’ answer. The daft.ie grouping solution is for icons, which are fixed size and small, the text label layout problem is far harder!

Maybe someday these automatic tools will be able to cope with the full variety of layout problems that arise, but for the time being this is one area where human cartographers still know best.

Robertson, G. G. ; Mackinlay, J. D. ; Card, S. K. Cone Trees: animated 3D visualizations of hierarchical information. Proceedings of the ACM Conference on Human Factors in Computing Systems (CHI ’91); 1991 April 27 – May 2; New Orleans; LA. NY: ACM; 1991; 189-194.[back]
Geoffrey Ellis and Alan Dix. 2007. A Taxonomy of Clutter Reduction for Information Visualisation. IEEE Transactions on Visualization and Computer Graphics 13, 6 (November 2007), 1216-1223. DOI=10.1109/TVCG.2007.70535[back]

REF Redux 5 – growing the gender gap

Posted on September 4, 2015 by alan

This fifth post in the REF Redux series looks at gender issue, in particular the likelihood that the apparent bias in computing REF results will disproportionately affect women in computing. While it is harder to find full data for this, a HEFCE post-REF report has already done a lot of the work.

Spoiler: REF results are exacerbating implicit gender bias in computing

A few weeks ago a female computing academic shared how she had been rejected for a job; in informal feedback she heard that her research area was ‘shrinking’. This seemed likely to be due to the REF sub-area profiles described in the first post of this series.

While this is a single example, I am aware that recruitment and investment decisions are already adjusting widely due to the REF results, so that any bias or unfairness in the results will have an impact ‘on the ground’.

Google image search for "computing professor"

Google image search “computing professor”

In fact gender and other equality issues were explicitly addressed in the REF process, with submissions explicitly asked what equality processes, such as Athena Swan, they had in place.

This is set in the context of a large gender gap in computing. Despite there being more women undergraduate entrants than men overall, only 17.4% of computing first degree graduates are female and this has declined since 2005 (Guardian datablog based on HESA data). Similarly only about 20% of computing academics are female (“Equality in higher education: statistical report 2014“), and again this appears to be declining:

from “Equality in higher education: statistical report 2014”, table 1.6 “SET academic staff by subject area and age group”

The misbalance in terms of application rates for research funding has also been issue that the European Commission has investigated in “The gender challenge in research funding: Assessing the European national scenes“.

HEFCE commissioned a post-REF report “The Metric Tide: Report of the Independent Review of the Role of Metrics in Research Assessment and Management“, which includes substantial statistics concerning the REF results and models of fit to various metrics (not just citations). Helpfully, Fran Amery, Stephen Bates and Steve McKay used these to create a summary of “Gender & Early Career Researcher REF Gaps” in different academic areas. While far from the largest, Computer Science and Informatics is in joint third place in terms of the gender gap as measured by the 4* outputs.

Their data comes from the HEFCE report’s supplement on “Correlation analysis of REF2014 scores and metrics“, and in particular table B4 (page 75):

Extract of “Table B4 Summary of submitting authors by UOA and additional characteristics” from “The Metric Tide : Correlation analysis of REF2014 scores and metrics”

This shows that while 24% of outputs submitted by males were ranked 4*, only 18% of those submitted by females received a 4*. That is a male member of staff in computing is 33% more likely to get a 4* than a female.

Now this could be due to many factors, not least the relative dearth of female senior academics reported by HESA.(“Age and gender statistics for HE staff“).

HESA academic staff gender balance: profs vs senior vs other academic

extract of HESA graphic “Staff at UK HE providers by occupation, age and sex 2013/14” from “Age and gender statistics for HE staff”

However, the HEFCE report goes on to compare this result with metrics, in a similar way to my own analysis of subareas and institutional effects. The report states (my emphasis) that:

Female authors in main panel B were significantly less likely to achieve a 4* output than male authors with the same metrics ratings. When considered in the UOA models, women were significantly less likely to have 4* outputs than men whilst controlling for metric scores in the following UOAs: Psychology, Psychiatry and Neuroscience; Computer Science and Informatics; Architecture, Built Environment and Planning; Economics and Econometrics.

That is, for outputs that look equally good from metrics, those submitted by men are more likely to obtain a 4* than the by women.

Having been on the computing panel, I never encountered any incidents that would suggest any explicit gender bias. Personally speaking, although outputs were not anonymous, the only time I was aware of the gender of authors was when I already knew them professionally.

My belief is that these differences are more likely to have arisen from implicit bias, in terms of what is valued. The The Royal Society of Edinburgh report “Tapping our Talents” warns of the danger that “concepts of what constitutes ‘merit’ are socially constructed” and the EU report “Structural change in research institutions” talks of “Unconscious bias in assessing excellence“. In both cases the context is recruitment and promotion procedures, but the same may well be true of the way we asses the results of research.,

In previous posts I have outlined the way that the REF output ratings appear to selectively benefit theoretical areas compared with more applied and human-oriented ones, and old universities compared with new universities.

While I’ve not yet been able obtain numbers to estimate the effects, in my experience the areas disadvantaged by REF are precisely those which have a larger number of women. Also, again based on personal experience, I believe there are more women in new university computing departments than old university departments.

It is possible that these factors alone may account for the male–female differences, although this does not preclude an additional gender bias.

Furthermore, if, as seems the be the case, the REF sub-area profiles are being used to skew recruiting and investment decisions, then this means that women will be selectively disadvantaged in future, exacerbating the existing gender divide.

Note that this is not suggesting that recruitment decisions will be explicitly biased against women, but by unfairly favouring traditionally more male-dominated sub-areas of computing this will create or exacerbate an implicit gender bias.

Making the most of stakeholder interviews

Posted on August 29, 2015 by alan

Recently, I was asked for any tips or suggestions for stakeholder interviews. I realised it was going to be more than would fit in the response to an IM message!

I’ll assume that this is purely for requirements gathering. For participatory or co-design, many of the same things hold, but there would be additional activities.

Kinds of knowing

First remember:

what they know – Whether the cleaner of a public lavatory or the CEO of a multi-national, they have rich experience in their area. Respect even the most apparently trivial comments.
what they don’t know they know – Much of our knowledge is tacit, things they know in the sense that they apply in their day to day activities, but are not explicitly aware of knowing. Part of your job as interviewer is to bring this latent knowledge to the surface.
what they don’t know – You are there because you bring expertise and knowledge, most critically in what is possible; it is often hard for someone who has spent years in a job to see that it could be different.

People also find it easier to articulate ‘what’ compared with ‘why’ knowledge:

what – objects, things, and people involved in their job, also the actions they perform, but even the latter can be difficult if they are too familiar
why – the underlying criteria, motivations and values that underpin their everyday activities

Be concrete

Most of us think best when we have concrete examples or situations to draw on, even if we are using these to describe more abstract concepts.

in their natural situation – People often find it easier to remember things if they are in the place and amongst the tools where they normally do them.
show you what they do – Being in their workplace also makes it easy for them to show you what they do – see “case study: Pensions printout“, for an example of this, the pensions manager was only able to articulate how a computer listing was used when he could demonstrate using the card files in his office. Note this applies to physical things, and also digital ones (e.g. talking through files on computer desktop)
watch what they do – If circumstances allow directly observe – often people omit the most obvious things, either because they assume it is known, or because it is too familiar and hence tacit. In “Early lessons – It’s not all about technology“, the (1960s!) system analyst realised that it was the operators’ fear of getting their clothes dirty that was slowing down the printing machine; this was not because of anything any of the operators said, but what they were observed doing.
seek stories of past incidents – Humans are born story tellers (listen to a toddler). If asked to give abstract instructions or information we often struggle.
normal and exceptional stories – both are important. Often if asked about a process or procedure the interviewee will give the normative or official version of what they do. This may be because they don’t want to admit to unofficial methods, or maybe that they think of the task in normative terms even though they actually never do it that way. Ask for ‘war stories’ of unusual, exceptional or problematic situations.
technology probes or envisioned scenarios – Although it may be hard to envisage new situations, if potential futures are presented in an engaging and concrete manner, then we are much more able to see ourselves in them, maybe using a new system, and say “but no that wouldn’t work.” (see more at hcibook online! “technology probes“)

Estrangement

As noted the stakeholder’s tacit knowledge may be the most important. By seeking out or deliberately creating odd or unusual situations, we may be able to break out of this blindness to the normal.

ask about other people’s jobs – As well as asking a stakeholder about what they do, ask them about other people; they may notice things about others better then the other person does themselves.
strangers / new folk / outsiders – Seek out the new person, the temporary visitor from another site, or even the cleaner; all see the situation with fresh eyes.
technology probes or envisioned scenarios (again!) – As well as being able to say “but no that wouldn’t work”, we can sometimes say “but no that wouldn’t work, because …”
fantasy – When the aim is to establish requirements and gain understanding, there is no reason why an envisaged scenario need be realistic or even possible. Think SciFi and magic 🙂 For an extended example of this look at ‘Making Tea‘, which asked chemists to make tea as if it were a laboratory procedure!

Of course some of these, notably fantasy scenarios, may work better in some organisations than others!

Analyse

You need to make sense of all that interview data!

the big picture – Much of what you learn will be about what happens to individuals. You need to see how this all fits together (e.g. Checkland/ Soft System Methodology ‘Rich Picture’, or process diagrams). Dig beyond the surface to make sense of the underlying organisational goals … and how they may conflict with those of individuals or other organisations.
the details – Look for inconsistencies, gaps, etc. both within an individual’s own accounts and between different people’s viewpoints. This may highlight the differences between what people believe happens and what actually happens, or part of that uncovering the tacit
the deep values – As noted it is often hard for people to articulate the criteria and motivations that determine their actions. You could look for ‘why’ vocabulary in what they say or written documentation, or attempt to ‘reverse engineer’ process to find purposes. Unearthing values helps to uncover potential conflicts (above), but also is important when considering radical changes. New processes, methods or systems might completely change existing practices, but should still be consonant with the underlying drivers for those original practices. See work on transforming musicological archival practice in the InConcert project for an example.

If possible you may wish to present these back to those involved, even if people are unaware of certain things they do or think, once presented to them, the flood gates open! If your stakeholders are hard to interview, maybe because they are senior, or far away, or because you only have limited access, then if possible do some level of analysis mid-way so that you can adjust future interviews based on past ones.

Prioritise

Neither you nor your interviewees have unlimited time; you need to have a clear idea of the most important things to learn – whilst of course keeping an open ear for things that are unexpected!

If possible plan time for a second round of some or all the interviewees after you have had a chance to analyse the first round. This is especially important as you may not know what is important until this stage!

Privacy, respect and honesty

You may not have total freedom in who you see, what you ask or how it is reported, but in so far as is possible (and maybe refuse unless it is) respect the privacy and personhood of those with whom you interact.

This is partly about good professional practice, but also efficacy – if interviewees know that what they say will only be reported anonymously they are more likely to tell you about the unofficial as well as the official practices! If you need to argue for good practice, the latter argument may hold more sway than the former!

In your reporting, do try to make sure that any accounts you give of individuals are ones they would be happy to hear. There may be humorous or strange stories, but make sure you laugh with not at your subjects. Even if no one else recognises them, they may well recognise themselves.

Of course do ensure that you are totally honest before you start in explaining what will and will not be related to management, colleagues, external publication, etc. Depending on the circumstances, you may allow interviewees to redact parts of an interview transcript, and/or to review and approve parts of a report pertaining to them.

REF Redux 4 – institutional effects

Posted on August 26, 2015 by alan

This fourth post in my REF analysis series compares computing sub-panel results across different types of institution.

Spoiler: new universities appear to have been disadvantaged in funding by at least 50%

When I first started analysing the REF results I expected a level of bias between areas; it is a human process and we all bring our own expectations, and ways of doing things. It was not the presence of bias that was shocking, but the size of the effect.

I had assumed that any bias between areas would have largely ‘averaged out’ at the level of Units of Assessment (or UoA, REF-speak typically corresponding to a department), as these would typically include a mix of areas. However, this had been assuming maybe a 10-20% difference between areas; once it became clear this was a huge 5-10 fold difference, the ‘averaging out’ argument was less certain.

The inter-area differences are crucially important, as emphasised in previous posts, for the careers of those in the disadvantaged areas, and for the health of computing research in the UK. However, so long as the effects averaged out, they would not affect the funding coming to institutions when algorithmic formulae are applied (including all English universities, where HEFCE allocate so called ‘QR’ funding based on weighted REF scores).

Realising how controversial this could be, I avoided looking at institutions for a long time, but it eventually became clear that it could not be ignored. In particular, as post-1992 universities (or ‘new universities’) often focus on more applied areas, I feared that they might have been consequentially affected by the sub-area bias.

It turns out that while this was right to an extent, in fact the picture is worse than I expected.

As each output is assigned to an institution it is possible to work out profiles for each institution based on the same measures as sub-areas (as described in the second and third posts in this series): using various types of Scopos and Google scholar raw citations and the ‘world rankings’ adjustments using the REF contextual data tables. Just as with the sub-areas, the different kinds of metrics all yield roughly similar results.

The main difference when looking at institutions compared to the sub-areas is that, of the 30 or so sub-areas, many are large enough (many hundreds of outputs) to examine individually with confidence that the numbers are statistically robust. In contrast, there were around 90 institutions with UoA submissions in computing, many with less than 50 outputs assessed (10-15 people), so getting towards the point were one would expect that citation measures to be imprecise for each one alone.

However, while, with a few exceptions such as UCL, Edinburgh and Imperial, the numbers for a single institution make it hard to say anything definitive, we can reliably look for overall trends.

One of the simplest single measures is the GPA for each institution (weighed sum with 4 for a 4*, 3 for a 3*, etc.) as this is a measure used in many league tables. The REF GPA can be compared to the predicted GPA based on citations.

While there is some scatter, which is to be expected given the size of each institution, there is also a clear tendency towards the diagonal.

Another measure frequently used is the ‘research power’, the GPA multiplied by the number of people in the submission.

This ‘stretches’ out the institutions and in particular makes the larger submissions (where the metrics are more reliable) stand out more. It is not surprising that this is more linear as, the points are equally scaled by size irrespective of the metric. However, the fact that it clusters quite closely to the diagonal at first seems to suggest that, at the level of institutions, the computing REF results are robust.

However, while GPA is used in many league tables, funding is not based on GPA. Where funding is formulaic (as it is with HEFCE for English universities), the combined measure is very heavily weighted towards 4*, with no money at all being allocated to 2* and 1*.

For RAE2008, the HEFCE weighting was approximately 3:1 between 4* and 3*, for REF2014 funding is weighted even more highly towards 4* at 4:1.

The next figure shows the equivalent of ‘power’ using a 4:1 ratio – roughly proportional to the amount of money under the HEFCE formula (although some of the institutions are not English, so will have different formula applied). Like the previous graphs this plots the actual REF money-related power compared the one predicted by citations.

Again the data is very spread out with three very large institutions (UCL, Edinburgh and Imperial) on the upper right and the rest in more of a pack in the lower left. UCL is dead on line, but the next two institutions look like outliers, doing substantially better under REF than citations would predict, and then further down there is more of a spread, with some below, some above the line.

This massed group is hard to see clearly because of the stretching, so the following graph shows the non-volume weighted results, that is simple 4:1 ratio (I have dubbed GPA #). This is roughly proportional to money per member of staff, and again citation-based prediction along the lower axis, actual REF values vertical axis.

The red line shows the prediction line. There is a rough correlation, but also a lot of spread. Given remarks earlier about the sizes of individual institutions this is to be expected. The crucial issue is whether there are any systematic effects, or whether this is purely random spread.

The two green lines show those UoAs with REF money-related scores 25% or more than expected, the ‘winners’ (above top left) and those with REF score 25% or more below prediction, the ‘losers’ (lower right).

Of 17 winners 16 are pre-1992 (‘old’) universities with just one post-1992 (‘new’) university. Furthermore of the 16 old university winners, 10 of these come from the 24 Russell Group universities.

Of the 35 losers, 25 are post-1992 (‘new’) universities and of the 10 ‘old’ university losers, there is just 1 Russell Group institution.

The exact numbers change depending on which precise metric one uses and whether one uses a 4:1, or 3:1 ratio, but the general pattern is the same.

Note this is not to do with who gets more or less money in total, whatever metric one uses, on average, the new universities tend to be lower, the old ones (on average) higher and Russell Group (on average) higher still. The issue here is about an additional benefit of reputation over and above this raw quality effect. For works that by external measures are of equal value, there appears to be at least 50-100% added benefit if they are submitted from a more ‘august’ institution.

To get a feel for this, let’s look at a specific example: one of the big ‘winners’, YYYYYYYY, a Russell Group university, compared with one of the ‘losers’, XXXXXXXX, a new university.

As noted one has to look at individual institutions with some caution as the numbers involved can be small, but XXXXXXXX is one of the larger (in terms of submission FTE) institutions in the ‘loser’ category; with 24.7 FTE and nearly 100 outputs. It also happened (by chance) to sit only one row above YYYYYYYY on the spreadsheet, so easy to compare. YYYYYYYY is even larger, nearly 50 FTE, 200 outputs.

At 100 and 200 outputs, these are still, in size, towards the smaller end of the sub-area groups we were looking at in the previous two posts, so this should be taken as more illustrative of the overall trend, not a specific comment on these institutional submissions.

This time we’ll first look at the citation profiles for the two.

The spreadsheet fragment below shows the profiles using raw Scopos citation measures. Note in this table, the right hand column, the upper quartile is the ‘best’ column.

The two institutions look comparable, XXXXXXXX is slightly higher in the very highest cited papers, but effectively differences within the noise.

Similarly, we can look at the ‘world ranks’ as used in the second post. Here the left hand side is ‘best, corresponding to the percentage of outputs that are within the best 1% of their area worldwide.

X-vs-Y-world-ranks

Again XXXXXXXX is slightly above YYYYYYYY, but basically within noise.

If you look at other measures: citations for ‘relable years’ (2011 and older, where there has been more time to gather cites), XXXXXXXX looks a bit stronger, for Google-based citations YYYYYYYY looks a bit stronger.

So, except for small variations, these two institutions, one a new university, one a Russell Group one, look comparable in terms external measures.

However, the REF scores paint a vastly different picture. The respective profiles are below:

Critically, the Russell Group YYYYYYYY has more than three times as many 4* outputs as the new university XXXXXXXX, despite being comparable in terms of external metrics. As the 4* are heavily weighted the effect is that the GPA # measure (roughly money per member of staff) is more than twice as large.

Comparing using the world rankings table: for the new university XXXXXXXX only just over half of their outputs in the top 1% worldwide are likely to be ranked a 4*, whereas for YYYYYYYY nearly all outputs in the top 5% are likely to rank 4*.

As noted it is not generally reliable to do point comparisons on institutions as the output counts are low, and also XXXXXXXX and YYYYYYYY are amongst the more extreme winners and losers (although not the most extreme!). However, they highlight the overall pattern.

At first I thought this institutional difference was due to the sub-area bias, but even when this was taken into account large institutional effects remained; there does appear to be an additional institutional bias.

The sub-area discrepancies will be partly due to experts from one area not understanding the methodologies and quality criteria of other areas. However, the institutional discrepancy is most likely simply a halo effect.

As emphasised in previous posts the computing sub-panels and indeed everyone involved with the REF process worked as hard as possible to ensure that the process was as fair, and, insofar as it was compatible with privacy, as transparent as possible. However, we are human and it is inevitable that to some extent when we see a paper from a ‘good’ institution we are expecting it to be good and visa versa.

These effects may actually be relatively small individually, but the heavy weighting of 4* is likely to exacerbate even small bias. In most statistical distributions, relatively small shifts of the mean can make large changes at the extremity.

By focusing on 4*, in order to be more ‘selective’ in funding, it is likely that the eventual funding metric is more noisy and more susceptible to emergent bias. Note how the GPA measure seemed far more robust, with REF results close to the citation predictions.

While HEFCE has shifted REF2014 funding more heavily towards 4*, the Scottish Funding Council has shifted slightly the other way from 3.11:1 for RAE2008, to 3:1 for REF2014 (see THES: Edinburgh and other research-intensives lose out in funding reshuffle). This has led to complaints that it is ‘defunding research excellence‘. To be honest, this shift will only marginally reduce institutional bias, but at least appears more reliable than the English formula.

Finally, it should be noted that while there appear to be overall trends favouring Russell Group and old universities compared with post-1992 (new) universities; this is not uniform. For example, UCL, with the largest ‘power’ rating and large enough that it is sensible to look at individually, is dead on the overall prediction line.

REF Redux 3 – plain citations

Posted on August 20, 2015 by alan

This third post in my series on the results of REF 2014, the UK periodic research assessment exercise, is still looking at subarea differences. Following posts will look at institutional (new vs old universities) and gender issues.

The last post looked at world rankings based on citations normalised using the REF ‘contextual data’. In this post we’ll look at plain unnormalised data. To some extent this should be ‘unfair’ to more applied areas as citation counts tend to be lower, as one mechanical engineer put it, “applied work doesn’t gather citations, it builds bridges”. However, it is a very direct measure.

The shocking things is that while raw citation measures are likely to bias against applied work, the REF results turn out to be worse.

There were a number of factors that pushed me towards analysing REF results using bibliometrics. One was the fact that HEFCE were using this for comparison between sub-panels, another was that Morris Sloman’s analysis of the computing sub-panel results, used Scopos and Google Scholar citations.

We’ll first look at the two relevant tables in Morris’ slides, one based on Scopos citations:

and one based on Google Scholar citations:

Both tables rank all outputs based on citations, divide these into quartiles, and then look at the percentage of 1*/2*/3*/4* outputs in each quartile. For example, looking at the Scopos table, 53.3% of 4* outputs have citation counts in the top (4th) quartile.

Both tables are roughy clustered towards the diagonal; that is there is an overall correlation between citation and REF score, apparently validating the REF process.

There are, however, also off-diagonal counts. At the top left are outputs that score well in REF, but have low citations. This is be expected; non-article outputs such as books, software, patents may be important but typically attract fewer citations, also good papers may have been published in a poor choice of venue leading to low citations.

More problematic is the lower right, outputs that have high citations, but low REF score. There are occasional reasons why this might be the case, for example, papers that are widely cited for being wrong, however, these cases are rare (I do not recall any in those I assessed). In general this areas represent outputs that the respective communities have judged strong, but the REF panel regard as weak. The numbers need care in interpreting as only there are only around 30% of outputs were scored 1* and 2* combined; however, it still means that around 10% of outputs in the top quartile were scored in the lower two categories and thus would not attract funding.

We cannot produce a table like the above for each sub-area as the individual scores for each output are not available in the public domain, and have been destroyed by HEFCE (for privacy reasons).

However, we can create quartile profiles for each area based on citations, which can then be compared with the REF 1*/2*/3*/4* profiles. These can be found on the results page of my REF analysis micro-site. Like the world rank lists in the previous post, there is a marked difference between the citation quartile profiles for each area and the REF star profiles.

One way to get a handle on the scale of the differences, is to divide the proportion of REF 4* by the proportion of top quartile outputs for each area. Given the proportion of 4* outputs is just over 22% overall, the top quartile results in an area should be a good predictor of the proportion of 4* results in that area.

The following shows an extract of the full results spreadsheet:

The left hand column shows the percentage of outputs in the top quartile of citations; the column to the right of the area title is the proportion of REF 4*; and the right hand column is the ratio. The green entries are those where the REF 4* results exceed those you would expect based on citations; the red those that get less REF 4* than would be expected.

While there’re some areas (AI, Vision) for which the citations are an almost perfect predictor, there are others which obtain two to three times more 4*s under REF than one would expect based on their citation scores, ‘the winners’, and some where REF gives two to three times fewer 4*s that would be expected, ‘the losers’. As is evident, the winners are the more formal areas, the losers the more applied and human centric areas. Remember again that if anything one would expect the citation measures to favour more theoretical areas, which makes this difference more shocking.

Andrew Howes replicated the citation analysis independently using R and produced the following graphic, which makes the differences very clear.

The vertical axis has areas ranked by proportion of REF 4*, higher up means more highly rated by REF. the horizontal axis shows areas ranked by proportion of citations in top quartile. If REF scores were roughly in line with citation measures, one would expect the points to lie close to the line of equal ranks; instead the areas are scattered widely.

That is, there seems little if any relation between quality as measured externally by citations and the quality measures of REF.

The contrast with the tables at the top of this post is dramatic. If you look at outputs as a whole, there is a reasonable correspondence, outputs that rank higher in terms officiations, rank higher in REF star score, apparently validating the REF results. However, when we compare areas, this correspondence disappears. This apparent contradiction is probably due to the correlation being very strong within area, just that the areas themselves are scattered.

Looking at Andrew’s graph, it is clear that it is not a random scatter, but systematic; the winners are precisely the theoretical areas, and the losers the applied and human centred areas.

Not only is the bias against applied areas critical for the individuals and research groups affected, but it has the potential to skew the future of UK computing. Institutions with more applied work will be disadvantaged, and based on the REF results it is clear that institutions are already skewing their recruitment policies to match the areas which are likely to give them better scores in the next exercise.

The economic future of the country is likely to become increasingly interwoven with digital developments and related creative industries and computing research is funded more generously than areas such as mathematics, precisely because it is expected to contribute to this development — or as a buzzword ‘impact’. However, the funding under REF within computing is precisely weighted against the very areas that are likely to contribute to digital and creative industries.

Unless there is rapid action the impact of REF2014 may well be to destroy the UK’s research base in the areas essential for its digital future, and ultimately weaken the economic life of the country as a whole.

If you do accessibility, please do it properly

Posted on August 19, 2015 by alan

I was looking at Coke Cola’s Rugby World Cup site¹,

On the all-red web page the tooltip stood out, with the uninformative text, “headimg”.

Peeking in the HTML, this is in both the title and alt attributes of the image.

<img title="headimg" alt="headimg" class="cq-dd-image" 
     src="/content/promotions/nwen/....png">

I am guessing that the web designer was aware of the need for an alt tag for accessibility, and may even have had been prompted to fill in the alt tag by the design software (Dreamweaver does this). However, perhaps they just couldn’t think of an alternative text and so put anything in (although as the image consists of text, this does betray a certain lack of imagination!); they probably planned to come back later to do it properly.

As the micro-site is predominantly targeted at the UK, Coke Cola are legally bound to make it accessible and so may well have run it through WCAG accessibility checking software. As the alt tag was present it will have passed W3C validation, even though the text is meaningless. Indeed the web designer might have added the unhelpful text just to get the page to validate.

The eventual page is worse than useless, a blank alt tag would have meant it was just skipped, and at least the text “header image” would have been read as words, whereas “headimg” will be spelt out letter by letter.

Perhaps I am being unfair, I’m sure many of my own pages are worse than this … but then again I don’t have the budget of Coke Cola!

More seriously there are important lessons for process. In particular it is very likely that at the point the designer uploads an image they are prompted for the alt tag — this certainly happens with Dreamweaver. However, at this point your focus is in getting the page looking right as the client looking at the initial designs is unlikely to be using a screen reader.

Good design software should not just prompt for the right information, but at the right time. It would be far better to make it easy to say “ask me later” and build up a to do list, rather than demand the information when the system wants it, and risk the user entering anything to ‘keep the system quiet’.

I call this the Micawber principle² and it is a good general principle for any notifications requiring user action. Always allow the user to put things off, but also have the application keep track of pending work, and then make it easy for the user see what needs to be done at a more suitable time.

Largely because I was fascinated by the semantically questionable statement “Win one of up to 1 million exclusive Gilbert rugby balls.” (my emphasis).[back]
From Dicken’s Mr Micawber, who was an arch procrastinator. See Learning Analytics for the Academic:
An Action Perspective where I discuss this principle in the context of academic use of learning analytics.[back]

REF Redux 2 – world ranking of UK computing

Posted on August 11, 2015 by alan

This is the second of my posts on the citation-based analysis of REF, the UK research assessment process in computer science. The first post set the scene and explained why citations are a valid means for validating (as opposed generating) research assessment scores.

Spoiler: for outputs of similar international standing it is ten times harder to get 4* in applied areas than more theoretical areas

As explained in the previous post amongst the public domain data available is the complete list of all outputs (except a very small number of confidential reports), this does NOT include the actual REF 4*/3*/2*/1* score, but does include Scopus citation data from late 2013 and Google scholar citation data from late 2014.

From this seven variations of citation metrics were used in my comparative analysis, but essentially all give the same results.

For this post I will focus on one of them, which is perhaps the clearest, effectively turning citation data into world ranking data.

As part of the pre-submission materials, the REF team distributed a spreadsheet, prepared by Scopus, which lists for different subject areas the number of citations for the best 1%, 5%, 10% and 25% of papers in each area. These vary between areas, in particular more theoretical areas tend to have more Scopus counted citations than more applied areas. The spreadsheet allows one to normalise the citation data and for each output see whether it is in the top 1%, 5%, 10% or 25% of papers within its own area.

The overall figure across REF outputs in computing is as follows:

Top 1%      16.9%
Top 1-5%:   27.9%
Top 6-10%:  18.0%
Top 11-25%: 23.8%
Lower 75%:  13.4%

The first thing to note is that about 1 in 6 of the submitted outputs are in the top 1% worldwide and not far short of a half (45%) in the top 5%. Of course this is the top publications, so one would expect the REF submissions to score well, but still this feels like a strong indication of the quality of UK research in computer science and informatics.

According to the REF2014 Assessment criteria and level definitions, the definition of 4* is “quality that is world-leading in terms of originality, significance and rigour“, and so these world citation rankings correspond very closely to “world leading”. In computing we allocated 22% of papers as 4*, that is, roughly, if a paper is in the top 1.5% of papers world wide in its area it is ‘world leading’, which sounds reasonable.

The next level 3* “internationally excellent” covers a further 47% of outputs, so approximately top 11% of papers world wide, which again sounds a reasonable definition of “internationally excellent”. Validating the overall quality criteria of the panel.

As the outputs include a sub-area tag, we can create similar world ‘league tables’ for each sub-area of computing, that is ranking the REF submitted outputs in each area amongst their own area worldwide:

As is evident there is a lot of variation, with some top areas (applications in life sciences and computer vision) with nearly a third of outputs in the top 1% worldwide, whilst other areas trail (mathematics of computing and logic), with only around 1 in 20 papers in top 1%.

Human computer interaction (my area) is split between two headings “human-centered computing” and “collaborative and social computing” between them just above mid point; AI also in the middle and Web in top half of the table.

Just as with the REF profile data, this table should be read with circumspection – it is about the health of the sub-area overall in the UK, not about a particular individual or group which may be at the stronger or weaker end.

The long-tail argument (that weaker researchers and those in less research intensive institutions are more likely to choose applied and human-centric areas) of course does not apply to logic, mathematics and formal methods at the bottom of the table. However, these areas may be affected by a dilution effect as more discursive areas are perhaps less likely to be adopted by non-first-language English academics.

This said, the definition of 4* is “Quality that is world-leading in terms of originality, significance and rigour“, and so these world rankings seem as close as possible to an objective assessment of this.

It would therefore be reasonable to assume that this table would correlate closely to the actual REF outputs, but in fact this is far from the case.

Compare this to the REF sub-area profiles in the previous post:

Some areas lie at similar points in both tables; for example, computer vision is near the top of both tables (ranks 2 and 4) and AI a bit above the middle in both (ranks 13 and 11). However, some areas that are near the middle in terms of world rankings (e.g. human-centred computing (rank 14) and even some near the top (e.g. network protocols at rank 3) come out very poorly in REF (ranks 26 and 24 respectively). On the other hand, some areas that rank very low in the world league table come very high in REF (e.g. logic rank 28 in ‘league table’ compared to rank 3 in REF).

On the whole, areas that are more applied or human focused tend to do a lot worse under REF than they appear to be when looked in terms of their world rankings, whereas more theoretical areas seem to have inflated REF rankings. Those that are traditional algorithmic computer science’ (e.g. vision, AI) are ranked similarly in REF and in the world rankings.

We will see other ways of looking at these differences in the next post, but one way to get a measure of the apparent bias is by looking at how high an output needs to be in world rankings to get a 4* depending on what area you are in.

We saw that on average, over all of computing, outputs that rank in the top 1.5% world-wide were getting 4* (world leading quality).

For some areas, for example, AI, this is precisely what we see, but for others the picture is very different.

In applied areas (e.g. web, HCI), an output needs to be in approximately the top 0.5% of papers worldwide to get a 4*, whereas in more theoretical areas (e.g. logic, formal, mathematics), a paper needs to only be in the top 5%.

That is looking at outputs equivalent in ‘world leading’-ness (which REF is trying to measure), it is 10 times easier to get a 4* in theoretical areas than applied ones.

REF Redux 1 – UK research assessment for computing; what it means and is it right?

Posted on August 3, 2015 by alan

REF is the 5 yearly exercise to assess the quality of UK university research, the results of which are crucial for both funding and prestige. In 2014, I served on the sub-panel that assessed computing submissions. Since, the publication of the results I have been using public domain data from the REF process in order to validate the results using citation data.

The results have been alarming suggesting that, despite the panel’s best efforts to be fair, in fact there was significant bias both in terms of areas of computer science and types of universities. Furthermore the first of these is also likely to have led to unintentional emergent gender bias.

I’ve presented results of this at a bibliometrics workshop at WebSci 2015 and at a panel at the British HCI conference a couple of weeks ago. However, I am aware that the full data and spreadsheets can be hard to read, so in a couple of posts I’ll try to bring out the main issues. A report and mini-site describes the methods used in detail, so in these posts I will concentrate on the results, and implications, starting in this post by setting the scene seeing how REF ranked sub-areas of computing and the use of citations for validation of the process. The next post will look at how UK computing sits amongst world research, and whether this agrees with the REF assessment.

Few in UK computing departments will have not seen the ranking list produced as part of the final report of the computing REF panel.

Here topic areas are ranked by the percentage of 4* outputs (the highest rank). Top of the list is Cryptography, with over 45% of outputs ranked 4*. The top of the list is dominated by theoretical computing areas, with 30-40% 4*, whilst the more applied and human areas are at the lower end with less than 20% 4*. Human-centred computing and collaborative computing, the areas where most HCI papers would be placed, are pretty much at the bottom of the list, with 10% and 8.8% of 4* papers respectively.

Even before this list was formally published I had a phone call from someone in an institution where the knowledge of it had obviously leaked. Their department was interviewing for a lectureship and the question being asked was whether they should be recruiting candidates from HCI as this will clearly not be good looking towards REF 2020.

Since then I have heard of numerous institutions who are questioning the value of supporting these more applied areas, due to their apparent poor showing under REF.

In fact, even taken at face value, the data says nothing at all about the value in particular departments., and the sub-panel report includes the warning “These data should be treated with circumspection“.

There are three possible reasons any, or all of which would give rise to the data:

the best applied work is weak — including HCI :-/
long tail — weak researchers choose applied areas
latent bias — despite panel’s efforts to be fair

I realised that citation data could help disentangle these.

There has been understandable resistance against using metrics as part of research assessment. However, that is about their use to assess individuals or small groups. There is general agreement that citation-based metrics are a good measure of research quality en masse; indeed I believe HEFCE are using citations to verify between-panel differences in 4* allocations, and in Morris Sloman’s post REF analysis slides (where the table above first appeared), he also uses the overall correlation between citations and REF scores as a positive validation of the process.

The public domain REF data does not include the actual scores given to each output, but does include citations data provided by Scopus in 2013. In addition, for Morris’ analysis in late 2014, Richard Mortier (then at Nottingham, now at Cambridge) collected Google Scholar citations for all REF outputs.

Together, these allow detailed citation-based analysis to verify (or otherwise) the validity of the REF outputs for computer science.

I’ll go into details in following posts, but suffice to say the results were alarming and show that, whatever other effects may have played a part, and despite the very best efforts of all involved, very large latent bias clearly emerged during the progress.

WebSci 2015 – WebSci and IoT panel

Posted on June 29, 2015 by alan

Sunshine on Keble quad, brings back memories of undergraduate days at Trinity, looking out toward the Wren Library.

Yesterday was first day of WebSci 2015. I’m here largely as I’m giving my work on comparing REF outcomes with citation measures, “Citations and Sub-Area Bias in the UK Research Assessment Process”, at the workshop on “Quantifying and Analysing Scholarly Communication on the Web” on Tuesday.

However, yesterday I was also on a panel on “Web Science & the Internet of Things”.

These are some of the points I made in my initial positioning remarks. I talked partly about a few things sorting round the edge of Internet of Things (IoT) and then some concerts examples of IoT related rings I;ve been involved with personally and use these to mention few themes that emerge.

Not quite IoT

Talis

Many at WebSci will remember Talis from its SemWeb work. The SemWeb side of the business has now closed, but the education side, particularly Reading List software with relationships between who read what and how they are related definitely still clear WebSci. However, the URIs (still RDF) of reading items are often books, items in libraries each marked with bar codes.

Years ago I wrote about barcodes as one of the earliest and most pervasive CSCW technologies (“CSCW — a framework“), the same could be said for IoT. It is interesting to look at the continuities and discontinuities between current IoT and these older computer-connected things.

The Walk

In 2013 I walked all around Wales, over 1000 miles. I would *love* to talk about the IoT aspects of this, especially as I was wired up with biosensors the whole way. I would love to do this, but can’t , because the idea of the Internet in West Wales and many rural areas is a bad joke. I could not even Tweet. When we talk about the IoT currently, and indeed anything with ‘Web’ or ‘Internet’ in its name, we have just excluded a substantial part of the UK population, let alone the world.

REF

Last year I was on the UK REF Computer Science and Informatics Sub-Panel. This is part of the UK process for assessing university research. According to the results it appears that web research in the UK is pretty poor. In the case of the computing sub-panel, the final result was the outcome of a mixed human and automated process, certainly interesting HCI case study of socio-technical systems and not far from WeSci concerns.

This has very real effects on departmental funding and on hiring and investment decisions within universities. From the first printed cheque, computer systems have affected the real world, while there are differences in granularity and scale, some aspects of IoT are not new.

Later in the conference I will talk about citation-based analysis of the results, so you can see if web science really is weak science 😉

Clearly IoT

Three concrete IoT things I’ve been involved with:

Firefly

While at Lancaster Jo Finney and I developed tiny intelligent lights. After more than ten years these are coming into commercial production.

Imagine a Christmas tree, and put a computer behind each and every light – that is Firefly. Each light becomes a single-pixel network computer, which seems like technological overkill, but because the digital technology is commoditised, suddenly the physical structures of wires and switches is simplified – saving money and time and allowing flexible and integrated lighting.

Even early prototypes had thousands of computers in a few square metres. Crucially too the higher level networking is all IP. This is solid IoT territory. However, like a lot of smart-dust, and sensing technology based around homogeneous devices and still, despite computational autonomy, largely centrally controlled.

While it may be another 10 years before it makes the transition from large-scale display lighting to domestic scale; we always imagined domestic scenarios. Picture the road, each house with a Christmas tree in its window, all Firefly and all connected to the internet, light patterns more form house to hose in waves, coordinate twinkling from window to window glistening in the snow. Even in tis technology issues of social interaction and trust begin to emerge.

FitBit

My wife has a FitBit. Clearly both and IoT technology and WebSci phenomena with millions of people connecting their devices into FitBit’s data sharing and social connection platform.

The week before WebSci we were on holiday, and we were struggling to get her iPad’s mobile data working. The Vodafone website is designed around phones, and still (how many iPads!) misses crucial information essential for data-only devices.

The FitBit’s alarm had been set for an early hour to wake us ready to catch the ferry. However, while the FitBit app on the iPad and the FitBit talk to one another via Bluetooth, the app will not control the alarm unless it is Internet connected. For the first few mornings of our holiday at 6am each morning …

Like my experience on the Wales walk the software assumes constant access to the web and fails when this is not present.

Tiree Tech Wave

I run a twice a year making, talking and thinking event, Tiree Tech Wave, on the Isle of Tiree. A wide range of things happen, but some are connected with the island itself and a number of island/rural based projects have emerged.

One of these projects, OnSupply looked at awareness of renewable power as the island has a community wind turbine, Tilly, and the emergence of SmartGrid technology. A large proportion of the houses on the island are not on modern SmartGrid technology, but do have storage heating controlled remotely, for power demand balancing. However, this is controlled using radio signals, and switched as large areas. So at 4am each morning all the storage heating goes on and there is a peak. When, as happens occasionally, there are problems with the cable between the island and the mainland, the Island’s backup generator has to deal with this surge, it cannot be controlled locally. Again issuss of connectivity deeply embedded in the system design.

We also have a small but growing infrastructure of displays and sensing.

We have, I believe, the worlds first internet-enabled shop open sign. When the café is open, the sign is on, this is broadcast to a web service, which can then be displayed in various ways. It is very important in a rural area to know what is open, as you might have to drive many miles to get to a café or shop.

We also use various data feeds from the ferry company, weather station, etc., to feed into public and web displays (e.g. TireeDashboard). That is we have heterogeneous networks of devices and displays communicating through web apis and services – good Iot and WebSCi!

This is part of a broader vision of Open Data Islands and Communities, exploring how open data can be of value to small communities. On their own open environments tend to be most easily used by the knowledgeable, wealthy and powerful, reinforcing rather than challenging existing power structures. We have to work explicitly to create structures and methods that make both IoT and the potential of the web truly of benefit to all.

Into the heart of darkness

Posted on April 2, 2015 by alan

Life is not all joy and fun, but often dark, depressing and painful.

Easter and Christmas are part of popular culture: Easter bunnies, Easter eggs, Christmas presents and Santa Claus. However, except for the odd Hot Cross Bun, Good Friday slips under the radar. The birth of a child and the glory of renewed life are images that are obvious causes for celebration, but a tale of abandonment followed by painful and bloody execution maybe has Gothic overtones, but is hardly party-worthy.

And yet, for those of different and no faith as well as for Christians, Good Friday touches issues of mythic as well as deeply personal significance.

Sometimes we simply need someone to share the darkness with us.

In days past the season of Lent with its fasting and sobriety helped build a sombre tension. Reading the Gospel accounts of Easter week is like one of those disaster films, where life appears to go on as normal, but with small and growing signs of the catastrophe to come. While we also know of the Easter story that follows, this does not shield us from the deep pit of despair that precedes it, “My God, My God, why have you forsaken me”¹.

I love Nina Bawden‘s books for children, and often in them are very real and flawed characters who, while young, can sometimes cause real pain and harm; they dig at one’s own buried memories and knowledge of our own flaws.

The Easter story is full of such characters, Peter falling asleep as Jesus prayed in Gethsemane, just at the point he was needed most as a friend; and later, after Jesus was arrested, in fear for his own life, denying that he ever knew Jesus. Each time I read it part of me wants to shout at him, warn him, encourage him, knowing that in his position I would do the same.

And Judas, the friend turned betrayer. Just like the actions of Lubitiz, who crashed the plane in the alps, many have speculated on the reasons in Judas’ heart: disillusionment that Jesus was not going to oust the Romans, greed for the bribe of silver, self-destruction, or maybe simply that bitter rancour in the presence of someone better than ourselves.

The 1960s protest song “There but for fortune” talks of the prisoner, the hobo, the drunk and the war-torn, but now I often think of the Auschwitz guard, the Rwandan militia, the ISIS terrorist — what are the life chances and life choices that brought me to where I am compared to those that took them?

Reading of Judas his betrayal, his remorse, throwing the tainted money back at the Priests’ feet, and taking his own life — there but for fortune.

And it is no accident that the blood money, the price of a life, the price of Jesus’ life, was used to buy a burial ground for strangers and foreigners². The death of one who sought out the marginal, the poor, the disabled, and the ‘immoral’ buys a resting place for the same.

Jesus death on the cross is, of course, at the heart of Christian theology, Paul once wrote to early converts on Corinth, “I resolved to know nothing while I was with you except Jesus Christ and him crucified.”³. The main focus is often on sacrifice, both personal, “greater love has no man than he lay down his life for his friends”⁴, and also theological, cosmic atonement for sins.

However, as well as this message that Jesus died for us, there is also a parallel message, that Jesus died with us, alongside us in the darkest hour. This is the end point of the Christmas story, one who was “like us in every respect”⁵, entering the world, a tiny head crowned in the blood of childbirth, and leaving crowned by bloody thorns.

While the early Church was never at doubt as to the resurrection of Jesus, the completeness of this moment, Jesus dying, flanked by criminals and a weeping prostitute at his feet, is so intense that the earliest versions of Mark’s gospel rush through the Easter morning itself in a mere 8 verses, and end with the empty tomb, the astonished disciples and the words, “for they were afraid”⁶.

The Apostles’ Creed repeated in various forms across all Western churches says Jesus “was crucified, died and was buried; he descended into hell“. Hell is not an easy concept for the modern mind, filled with images of half-comic horned demons. But despite its B-movie connotations, and irrespective of whether you read it literally, figuratively or mythically, Hades, Gehenna, the pit are powerful images.

Hell of 1st century Palestine is not just for the lifeless shades of Greek Hades, but more like Tartarus, the place of damnation, the abode of the sinner. Peter says that Jesus “preached to the dead”⁷, and other authors simply that death could not ultimately hold him⁸, but all agree that for three days that was where Jesus was, not simply dying for and with us, but entering the very place of the Auschwitz guard, of Judas, of the ISIS killer, of our own deepest darkness, and sharing it.

The one of whom they said, “Here is a glutton and a drunkard, a friend of tax collectors and sinners”⁹, the one who spent his life with outcasts and prostitutes, would he be anywhere else?