principles vs guidelines

I was recently asked to clarify the difference between usability principles and guidelines.  Having written a page-full of answer, I thought it was worth popping on the blog.

As with many things the boundary between the two is not absolute … and also the term ‘guidelines’ tends to get used differently at different times!

However, as a general rule of thumb:

  • Principles tend to be very general and would apply pretty much across different technologies and systems.
  • Guidelines tend to be more specific to a device or system.

As an example of the latter, look at the iOS Human Interface Guidelines on “Adaptivity and Layout”   It starts with a general principle:

“People generally want to use their favorite apps on all their devices and in multiple contexts”,

but then rapidly turns that into more mobile specific, and then iOS specific guidelines, talking first about different screen orientations, and then about specific iOS screen size classes.

I note that the definition on page 259 of Chapter 7 of the HCI textbook is slightly ambiguous.  When it says that guidelines are less authoritative and more general in application, it means in comparison to standards … although I’d now add a few caveats for the latter too!

Basically in terms of ‘authority’, from low to high:

lowest principles agreed by community, but not mandated
guidelines proposed by manufacture, but rarely enforced
highest standards mandated by standards authority

In terms of general applicability, high to low:

highest principles very broad e.g. ‘observability’
guidelines more specific, but still allowing interpretation
lowest standards very tight

This ‘generality of application’ dimension is a little more complex as guidelines are often manufacturer specific so arguably less ‘generally applicable’ than standards, but the range of situations that standard apply to is usually much tighter.

On the whole the more specific the rules, the easier they are to apply.  For example, the general principle of observability requires that the designer think about how it applies in each new application and situation. In contrast, a more specific rule that says, “always show the current editing state in the top right of the screen” is easy to apply, but tells you nothing about other aspects of system state.

Human-Like Computing

Last week I attended an EPSRC workshop on “Human-Like Computing“.

The delegate pack offered a tentative definition:

“offering the prospect of computation which is akin to that of humans, where learning and making sense of information about the world around us can match our human performance.” [E16]

However, the purpose of this workshop was to clarify, and expand on this, exploring what it might mean for computers to become more like humans.

It was an interdisciplinary meeting with some participants coming from more technical disciplines such as cognitive science, artificial intelligence, machine learning and Robotics; others from psychology or studying human and animal behaviour; and some, like myself, from HCI or human factors, bridging the two.

Why?

Perhaps the first question is why one might even want more human-like computing.

There are two obvious reasons:

(i) Because it is a good model to emulate — Humans are able to solve some problems, such as visual pattern finding, which computers find hard. If we can understand human perception and cognition, then we may be able to design more effective algorithms. For example, in my own work colleagues and I have used models based on spreading activation and layers of human memory when addressing ‘web scale reasoning’ [K10,D10].

robot-3-clip-sml(ii) For interacting with people — There is considerable work in HCI in making computers easier to use, but there are limitations. Often we are happy for computers to be simply ‘tools’, but at other times, such as when your computer notifies you of an update in the middle of a talk, you wish it had a little more human understanding. One example of this is recent work at Georgia Tech teaching human values to artificial agents by reading them stories! [F16]

To some extent (i) is simply the long-standing area of nature-inspired or biologically-inspired computing. However, the combination of computational power and psychological understanding mean that perhaps we are the point where new strides can be made. Certainly, the success of ‘deep learning’ and the recent computer mastery of Go suggest this. In addition, by my own calculations, for several years the internet as a whole has had more computational power than a single human brain, and we are very near the point when we could simulate a human brain in real time [D05b].

Both goals, but particularly (ii), suggest a further goal:

(iii) new interaction paradigms — We will need to develop new ways to design for interacting with human-like agents and robots, not least how to avoid the ‘uncanny valley’ and how to avoid the appearance of over-competence that has bedevilled much work in this broad area. (see more later)

Both goals also offer the potential for a fourth secondary goal:

(iv) learning about human cognition — In creating practical computational algorithms based in human qualities, we may come to better understand human behaviour, psychology and maybe even society. For example, in my own work on modelling regret (see later), it was aspects of the computational model that highlighted the important role of ‘positive regret’ (“the grass is greener on the other side”) to hep us avoid ‘local minima’, where we stick to the things we know and do not explore new options.

Human or superhuman?

Of course humans are not perfect, do we want to emulate limitations and failings?

For understanding humans (iv), the answer is probably “yes”, and maybe by understanding human fallibility we may be in a better position to predict and prevent failures.

Similarly, for interacting with people (ii), the agents should show at least some level of human limitations (even if ‘put on’); for example, a chess program that always wins would not be much fun!

However, for simply improving algorithms, goal (i), we may want to get the ‘best bits’, from human cognition and merge with the best aspects of artificial computation. Of course it maybe that the frailties are also the strengths, for example, the need to come to decisions and act in relatively short timescales (in terms of brain ‘ticks’) may be one way in which we avoid ‘over learning’, a common problem in machine learning.

In addition, the human mind has developed to work with the nature of neural material as a substrate, and the physical world, both of which have shaped the nature of human cognition.

Very simple animals learn purely by Skinner-like response training, effectively what AI would term sub-symbolic. However, this level of learning require many exposures to similar stimuli. For more rare occurrences, which do not occur frequently within a lifetime, learning must be at the, very slow pace of genetic development of instincts. In contrast, conscious reasoning (symbolic processing) allows us to learn through a single or very small number of exposures; ideal for infrequent events or novel environments.

Big Data means that computers effectively have access to vast amounts of ‘experience’, and researchers at Google have remarked on the ‘Unreasonable Effectiveness of Data’ [H09] that allows problems, such as translation, to be tackled in a statistical or sub-symbolic way which previously would have been regarded as essentially symbolic.

Google are now starting to recombine statistical techniques with more knowledge-rich techniques in order to achieve better results again. As humans we continually employ both types of thinking, so there are clear human-like lessons to be learnt, but the eventual system will not have the same ‘balance’ as a human.

If humans had developed with access to vast amounts of data and maybe other people’s experience directly (rather than through culture, books, etc.), would we have developed differently? Maybe we would do more things unconsciously that we do consciously. Maybe with enough experience we would never need to be conscious at all!

More practically, we need to decide how to make use of this additional data. For example, learning analytics is becoming an important part of educational practice. If we have an automated tutor working with a child, how should we make use of the vast body of data about other tutors interactions with other children?   Should we have a very human-like tutor that effectively ‘reads’ learning analytics just as a human tutor would look at a learning ‘dashboard’? Alternatively, we might have a more loosely human-inspired ‘hive-mind’ tutor that ‘instinctively’ makes pedagogic choices based on the overall experience of all tutors, but maybe in an unexplainable way?

What could go wrong …

There have been a number of high-profile statements in the last year about the potential coming ‘singularity’ (when computers are clever enough to design new computers leading to exponential development), and warnings that computers could become sentient, Terminator-style, and take over.

There was general agreement at the workshop this kind of risk was overblown and that despite breakthroughs, such as the mastery of Go, these are still very domain limited. It is many years before we have to worry about even general intelligence in robots, let alone sentience.

A far more pressing problem is that of incapable computers, which make silly mistakes, and the way in which people, maybe because of the media attention to the success stories, assume that computers are more capable than they are!

Indeed, over confidence in algorithms is not just a problem for the general public, but also among computing academics, as I found in my personal experience on the REF panel.

There are of course many ethical and legal issues raised as we design computer systems that are more autonomous. This is already being played out with driverless cars, with issues of insurance and liability. Some legislators are suggesting allowing driverless cars, but only if there is a drive there to take control … but if the car relinquishes control, how do you safely manage the abrupt change?

Furthermore, while the vision of autonomous robots taking over the world is still far fetched; more surreptitious control is already with us. Whether it is Uber cabs called by algorithm, or simply Google’s ranking of search results prompting particular holiday choices, we all to varying extents doing “what the computer tells us”. I recall in the Dalek Invasion of Earth, the very un-human-like Daleks could not move easily amongst the rubble of war-torn London. Instead they used ‘hypnotised men’ controlled by some form of neural headset. If the Daleks had landed today and simply taken over or digitally infected a few cloud computing services would we know?

Legibility

Sometimes it is sufficient to have a ‘black box’ that makes decisions and acts. So long as it works we are happy. However, a key issue for many ethical and legal issues, but also for practical interaction, is the ability to be able to interrogate a system, so seek explanations of why a decision has been made.

Back in 1992 I wrote about these issues [D92], in the early days when neural networks and other forms of machine learning were being proposed for a variety of tasks form controlling nuclear fusion reactions to credit scoring. One particular scenario, was if an algorithm were used to pre-sort large numbers of job applications. How could you know whether the algorithms were being discriminatory? How could a company using such algorithms defend themselves if such an accusation were brought?

One partial solution then, as now, was to accept underlying learning mechanisms may involve emergent behaviour form statistical, neural network or other forms of opaque reasoning. However, this opaque initial learning process should give rise to an intelligible representation. This is rather akin to a judge who might have a gut feeling that a defendant is guilty or innocent, but needs to explicate that in a reasoned legal judgement.

This approach was exemplified by Query-by-Browsing, a system that creates queries from examples (using a variant of ID3), but then converts this in SQL queries. This was subsequently implemented [D94] , and is still running as a web demonstration.

For many years I have argued that it is likely that our ‘logical’ reasoning arises precisely form this need to explain our own tacit judgement to others. While we simply act individually, or by observing the actions of others, this can be largely tacit, but as soon as we want others to act in planned collaborate ways, for example to kill a large animal, we need to convince them. Once we have the mental mechanisms to create these explanations, these become internalised so that we end up with internal means to question our own thoughts and judgement, and even use them constructively to tackle problems more abstract and complex than found in nature. That is dialogue leads to logic!

Scenarios

We split into groups and discussed scenarios as a means to understand the potential challenges for human-like computing. Over multiple session the group I was in discussed one man scenario and then a variant.

Paramedic for remote medicine

The main scenario consisted of a patient far form a central medical centre, with an intelligent local agent communicating intermittently and remotely with a human doctor. Surprisingly the remote aspect of the scenario was not initially proposed by me thinking of Tiree, but by another member of the group thinking abut some of the remote parts of the Scottish mainland.

The local agent would need to be able communicate with the patient, be able to express a level of empathy, be able to physically examine (needing touch sensing, vision), and discuss symptoms. On some occasions, like a triage nurse, the agent might be sufficiently certain to be able to make a diagnosis and recommend treatment. However, at other times it may need to pass on to the remote doctor, being able to describe what had been done in terms of examination, symptoms observed, information gathered from the patient, in the same way that a paramedic does when handing over a patient to the hospital. However, even after the handover of responsibility, the local agent may still form part of the remote diagnosis, and maybe able to take over again once the doctor has determined an overall course of action.

The scenario embodied many aspects of human-like computing:

  • The agent would require a level of emotional understanding to interact with the patient
  • It would require fine and situation contingent robotic features to allow physical examination
  • Diagnosis and decisions would need to be guided by rich human-inspired algorithms based on large corpora of medical data, case histories and knowledge of the particular patient.
  • The agent would need to be able to explain its actions both to the patient and to the doctor. That is it would not only need to transform its own internal representations into forms intelligible to a human, but do so in multiple ways depending on the inferred knowledge and nature of the person.
  • Ethical and legal responsibility are key issues in medical practice
  • The agent would need to be able manage handovers of control.
  • The agent would need to understand its own competencies in order to know when to call in the remote doctor.

The scenario could be in physical or mental health. The latter is particularly important given recent statistics, which suggested only 10% of people in the UK suffering mental health problems receive suitable help.

Physiotherapist

As a more specific scenario still, one fog the group related how he had been to an experienced physiotherapist after a failed diagnosis by a previous physician. Rather than jumping straight into a physical examination, or even apparently watching the patient’s movement, the physiotherapist proceeded to chat for 15 minutes about aspects of the patient’s life, work and exercise. At the end of this process, the physiotherapist said, “I think I know the problem”, and proceeded to administer a directed test, which correctly diagnosed the problem and led to successful treatment.

Clearly the conversation had given the physiotherapist a lot of information about potential causes of injury, aided by many years observing similar cases.

To do this using an artificial agent would suggest some level of:

  • theory/model of day-to-day life

Thinking about the more conversational aspects of this I was reminded of the PhD work of Ramanee Peiris [P97]. This concerned consultations on sensitive subjects such as sexual health. It was known that when people filled in (initially paper) forms prior to a consultation, they were more forthcoming and truthful than if they had to provide the information face-to-face. This was even if the patient knew that the person they were about to see would read the forms prior to the consultation.

Ramanee’s work extended this first to electronic forms and then to chat-bot style discussions which were semi-scripted, but used simple textual matching to determine which topics had been covered, including those spontaneously introduced by the patient. Interestingly, the more human like the system became the more truthful and forthcoming the patients were, even though they were less so wit a real human.

As well as revealing lessons for human interactions with human-like computers, this also showed that human-like computing may be possible with quite crude technologies. Indeed, even Eliza was treated (to Weizenbaum’s alarm) as if it really were a counsellor, even though people knew it was ‘just a computer’ [W66].

Cognition or Embodiment?

I think it fair to say that the overall balance, certainly in the group I was in, was towards the cognitivist: that is more Cartesian approach starting with understanding and models of internal cognition, and then seeing how these play out with external action. Indeed, the term ‘representation’ used repeatedly as an assumed central aspect of any human-like computing, and there was even talk of resurrecting Newells’s project for a ‘unified theory of cognition’ [N90]

There did not appear to be any hard-core embodiment theorist at the workshops, although several people who had sympathies. This was perhaps as well as we could easily have degenerated into well rehearsed arguments for an against embodiment/cognition centred explanations … not least about the critical word ‘representation’.

However, I did wonder whether a path that deliberately took embodiment centrally would be valuable. How many human-like behaviours could be modelled in this way, taking external perception-action as central and only taking on internal representations when they were absolutely necessary (Alan Clark’s 007 principle) [C98].

Such an approach would meet limits, not least the physiotherapist’s 25 minute chat, but I would guess would be more successful over a wider range of behaviours and scenarios then we would at first think.

Human–Computer Interaction and Human-Like Computing

Both Russell and myself were partly there representing our own research interest, but also more generally as part of the HCI community looking at the way human-like computing would intersect exiting HCI agendas, or maybe create new challenges and opportunities. (see poster) It was certainly clear during the workshop that there is a substantial role for human factors from fine motor interactions, to conversational interfaces and socio-technical systems design.

Russell and I presented a poster, which largely focused on these interactions.

HCI-HLC-poster

There are two sides to this:

  • understanding and modelling for human-like computing — HCI studies and models complex, real world, human activities and situations. Psychological experiments and models tend to be very deep and detailed, but narrowly focused and using controlled, artificial tasks. In contrast HCI’s broader, albeit more shallow, approach and focus on realistic or even ‘in the wild’ tasks and situations may mean that we are in an ideal position to inform human-like computing.

human interfaces for human-like computing — As noted in goal (iii) we will need paradigms for humans to interact with human-like computers.

As an illustration of the first of these, the poster used my work on making sense of the apparently ‘bad’ emotion of regret [D05] .

An initial cognitive model of regret was formulated involving a rich mix of imagination (in order to pull past events and action to mind), counter-factual modal reasoning (in order to work out what would have happened), emption (which is modified to feel better or worse depending on the possible alternative outcomes), and Skinner-like low-level behavioural learning (the eventual purpose of regret).

cog-model

This initial descriptive and qualitative cognitive model was then realised in a simplified computational model, which had a separate ‘regret’ module which could be plugged into a basic behavioural learning system.   Both the basic system and the system with regret learnt, but the addition of regret did so with between 5 and 10 times fewer exposures.   That is, the regret made a major improvement to the machine learning.

architecture

Turning to the second. Direct manipulation has been at the heart of interaction design since the PC revolution in the 1980s. Prior to that command line interfaces (or worse job control interfaces), suggested a mediated paradigm, where operators ‘asked’ the computer to do things for them. Direct manipulation changed that turning the computer into a passive virtual world of computational objects on which you operated with the aid of tools.

To some extent we need to shift back to the 1970s mediated paradigm, but renewed, where the computer is no longer like an severe bureaucrat demanding the precise grammatical and procedural request; but instead a helpful and understanding aide. For this we can draw upon existing areas of HCI such as human-human communications, intelligent user interfaces, conversational agents and human–robot interaction.

References

[C98] Clark, A. 1998. Being There: Putting Brain, Body and the World Together Again. MIT Press. https://mitpress.mit.edu/books/being-there

[D92] A. Dix (1992). Human issues in the use of pattern recognition techniques. In Neural Networks and Pattern Recognition in Human Computer Interaction Eds. R. Beale and J. Finlay. Ellis Horwood. 429-451. http://www.hcibook.com/alan/papers/neuro92/

[D94] A. Dix and A. Patrick (1994). Query By Browsing. Proceedings of IDS’94: The 2nd International Workshop on User Interfaces to Databases, Ed. P. Sawyer. Lancaster, UK, Springer Verlag. 236-248.

[D05] Dix, A..(2005).  The adaptive significance of regret. (unpublished essay, 2005) https://alandix.com/academic/essays/regret.pdf

[D05b] A. Dix (2005). the brain and the web – a quick backup in case of accidents. Interfaces, 65, pp. 6-7. Winter 2005. https://alandix.com/academic/papers/brain-and-web-2005/

[D10] A. Dix, A. Katifori, G. Lepouras, C. Vassilakis and N. Shabir (2010). Spreading Activation Over Ontology-Based Resources: From Personal Context To Web Scale Reasoning. Internatonal Journal of Semantic Computing, Special Issue on Web Scale Reasoning: scalable, tolerant and dynamic. 4(1) pp.59-102. http://www.hcibook.com/alan/papers/web-scale-reasoning-2010/

[E16] EPSRC (2016). Human Like Computing Hand book. Engineering and Physical Sciences Research Council. 17 – 18 February 2016

[F16] Alison Flood (2016). Robots could learn human values by reading stories, research suggests. The Guardian, Thursday 18 February 2016 http://www.theguardian.com/books/2016/feb/18/robots-could-learn-human-values-by-reading-stories-research-suggests

[H09] Alon Halevy, Peter Norvig, and Fernando Pereira. 2009. The Unreasonable Effectiveness of Data. IEEE Intelligent Systems 24, 2 (March 2009), 8-12. DOI=10.1109/MIS.2009.36

[K10] A. Katifori, C. Vassilakis and A. Dix (2010). Ontologies and the Brain: Using Spreading Activation through Ontologies to Support Personal Interaction. Cognitive Systems Research, 11 (2010) 25–41. https://alandix.com/academic/papers/Ontologies-and-the-Brain-2010/

[N90] Allen Newell. 1990. Unified Theories of Cognition. Harvard University Press, Cambridge, MA, USA. http://www.hup.harvard.edu/catalog.php?isbn=9780674921016

[P97] DR Peiris (1997). Computer interviews: enhancing their effectiveness by simulating interpersonal techniques. PhD Thesis, University of Dundee. http://virtual.inesc.pt/rct/show.php?id=56

[W66] Joseph Weizenbaum. 1966. ELIZA—a computer program for the study of natural language communication between man and machine. Commun. ACM 9, 1 (January 1966), 36-45. DOI=http://dx.doi.org/10.1145/365153.365168

REF Redux 6 — Reasons and Remedies

This, the last of my series of posts on post-REF analysis, asks what went wrong and what could be done to improve things in future.

Spoiler: a classic socio-technical failure story: compromising the quality of human processes in order to feed an algorithm

As I’ve noted multiple times, the whole REF process and every panel member was focused around fairness and transparency, and yet still the evidence is that quite massive bias emerged. This is evident in my own analysis of sub-area and institutional differences, and also in HEFCE’s own report, which highlighted gender differences.

Summarising some of the effects we have seen in previous posts:

  1. sub-areas: When you rank outputs within their own areas worldwide: theoretical papers ranked in the top 5% (top 1 in 20) worldwide get a 4* of whereas those in more applied human/centric papers need to be in the top 0.5% (top 1 in 200) – a ten-fold difference (REF Redux 2)
  2. institutions: Outputs that appear equivalent in terms of citation are ranked more highly in Russell Group universities compared with other old (pre-1992) universities, and both higher than new (post-1992) universities.  If two institutions have similar citation profiles, the Russell Group one, on average, would receive 2-3 times more money per member of staff than the equivalent new university (REF Redux 4)
  3. gender: A male academic in computing is 33% more likely to get a 4* then a female academic, and this effect persists even when other factors considered (HEFCE report “The Metric Tide”). Rather than explicit bias, I believe this is likely to be an implicit bias due to the higher proportions of women in sub-areas disadvantaged by REF (REF Redux 5)

These are all quite shocking results, not so much that the differences are there, but because of the size.

Before being a computer scientist I was trained as a statistician.  In all my years both as a professional statistician, and subsequently as a HCI academic engaged in or reviewing empirical work, I have never seen effect sizes this vast.

What went wrong?

Note that this analysis is all for sub-panel 11 Computer Science and Informatics. Some of the effects (in particular institutional bias) are probably not confined to this panel; however, there are special factors in the processes we used in computing which are likely to have exacerbated latent bias in general and sub-area bias in particular.

As a computing panel, we of course used algorithms!

The original reason for asking submissions to include an ACM sub-area code was to automate reviewer allocation. This meant that while other panel chairs were still starting their allocation process, SP11 members already had their full allocations of a thousand or so outputs a piece. Something like 21,000 output allocations at the press of a button. Understandably this was the envy of other panels!

We also used algorithms for normalisation of panel members’ scores. Some people score high, some score low, some bunch towards the middle with few high and few low scores, and some score too much to the extremes.

This is also the envy of many other panel members. While we did discuss scores on outputs where we varied substantially, we did not spend the many hours debating whether a particular paper was 3* or 4*, or trying to calibrate ourselves precisely — the algorithm does the work. Furthermore the process is transparent (we could even open source the code) and defensible — it is all in the algorithm, no potentially partisan decisions.

Of course such an algorithm cannot simply compare each panel member with the average as some panel members might have happened to have better or worse set of outputs to review than others. In order to work there has to be sufficient overlap between panel members’ assessments so that they can be robustly compared. In order to achieve this overlap we needed to ‘spread our expertise’ for the assignment process, so that we reviewed more papers slightly further from our core area of competence.

Panels varies substantially in the way they allocated outputs to reviewers. In STEM areas the typical output was an article of, say, 8–10 pages; whereas in the humanities often books or portfolios; in performing arts there might even be a recording of a performance taking hours. Clearly the style of reviewing varied. However most panels tried to assign two expert panelists to each output. In computing we had three assessors per output, compared to two in many areas (and in one sub-panel a single assessor per output). However, because of the expertise spreading this meant typically one expert and two more broad assessors per output.

For example, my own areas of core competence (Human-centered computing / Visualization and Collaborative and social computing) had between them 700 outputs, and were two others assessors with strong knowledge in these areas. However, of over 1000 outputs I assessed, barely one in six (170) were in these areas, that is only 2/3 more than if the allocation had been entirely random.

Assessing a broad range of computer science was certainly interesting, and I feel I came away with an understanding of the current state of UK computing that I certainly did not have before. Also having a perspective from outside a core area is very valuable especially in assessing the significance of work more broadly within the discipline.

This said the downside is that the vast majority of assessments were outside our core areas, and it is thus not so surprising that default assessments (aka bias) become a larger aspect of the assessment. This is particularly problematic when there are differences in methodology; whereas it is easy to look at a paper with mathematical proofs in it and think “that looks rigorous”, it is hard for someone not used to interpretative methodologies to assess, for example, ethnography.

If the effects were not so important, it is amusing to imagine the mathematics panel with statisticians, applied and pure mathematicians assessing each others work, or indeed, if formal computer science were assessed by a pure mathematicians.

Note that the intentions were for the best trying to make the algorithm work as well as possible; but the side effect was to reduce the quality of the human process that fed the algorithm. I recall the first thing I ever learnt in computing was the mantra, “garbage in — garbage out”.

Furthermore, the assumption underlying the algorithm was that while assessors differed in their severity/generosity of marking and their ‘accuracy’ of marking, they were all equally good at all assessments. While this might be reasonable if we all were mainly marking within our own competence zone, this is clearly not valid given the breadth of assessment.  That is the fundamental assumptions of the algorithm were broken.

This is a classic socio-technical failure story: in an effort to ‘optimise’ the computational part of the system, the overall human–computer system was compromised. It is reasonable for those working in more purely computational areas to have missed this; however, in retrospect, those of us with a background in this sort of issue should have foreseen problems (John 9:41), mea culpa.  Indeed, I recall that I did have reservations, but had hoped that any bad effects would average out given so many points of assessment.  It was only seeing first Morris Sloman’s analysis and then the results of my own that I realised quite how bad the distortions had been.

I guess we fell prey to another classic systems failure: not trialling, testing or prototyping a critical system before using it live.

What could be done better?

Few academics are in favour of metrics-only systems for research assessment, and, rather like democracy, it may be that the human-focused processes of REF are the worst possible solution apart from all the alternatives.

I would certainly have been of that view until seeing in detail the results outlined in this series. However, knowing what I do now, if there were a simple choice for the next REF of what we did and a purely metrics-based approach, I would vote for the latter. In every way that a pure metrics based approach would be bad for the discipline, our actual process was worse.

However, the choice is not simply metrics vs human assessment.

In computing we used a particular combination of algorithm and human processes that amplified rather than diminished the effects of latent bias. This will have been particularly bad for sub-areas where differences in methodology lead to asymmetric biases. However, it is also likely to have amplified institution bias effects as when assessing areas far from one’s own expertise it is more likely that default cues, such as the ‘known’ quality of the institution, will weigh strongly.

Clearly we need to do this differently next time, and other panels definitely ought not to borrow SP11’s algorithms without substantial modification.

Maybe it is possible to use metrics-based approaches to feed into a human process in a way that is complimentary. A few ideas could be:

  1. metrics for some outputs — for example we could assess older journal and conference outputs using metrics, combined with human assessment for newer or non-standard outputs
  2. metrics as under-girding – we could give outputs an initial grade based on metrics, which is then altered after reading, but where there is a differential burden of proof — easy to raise a grade (e.g. because of badly chosen venue for strong paper), but hard to bring it down (more exceptional reasons such as citations saying “this paper is wrong”)
  3. metrics for in-process feedback — a purely human process as we had, but part way through calculate the kinds of profiles for sub-areas and institutions that I calculated in REF Redux 2, 3 and 4. At this point the panel would be able to decide what to do about anomalous trends, for example, individually examine examples of outputs.

There are almost certainly other approaches, the critical thing is that we must do better than last time.

level of detail – scale matters

We get used to being able to zoom into every document picture and map, but part of the cartographer’s skill is putting the right information at the right level of detail.  If you took area maps and then scaled them down, they would not make a good road atlas, the main motorways would hardly be visible, and the rest would look like a spider had walked all over it.  Similarly if you zoom into a road atlas you would discover the narrow blue line of each motorway is in fact half a mile wide on the ground.

Nowadays we all use online maps that try to do this automatically.  Sometimes this works … and sometimes it doesn’t.

Here are three successive views of Google maps focused on Bournemouth on the south coast of England.

On the first view we see Bournemouth clearly marked, and on the next, zooming in a little Poole, Christchurch and some smaller places also appear.  So far, so good, as we zoom in more local names are shown as well as the larger place.

bournemouth-1  bournemouth-2

However, zoom in one more level and something weird happens, Bournemouth disappears.  Poole and Christchurch are there, but no  Bournemouth.

bournemouth-3

However, looking at the same level scale on another browser, Bournemouth is there still:

bournemouth-4

The difference between the two is the Hotel Miramar.  On the first browser I am logged into Google mail, and so Google ‘knows’ I am booked to stay in the Hotel Miramar (presumably by scanning my email), and decides to display this also.   The labels for Bournemouth and the hotel label overlap, so Google simply omitted the Bournemouth one as less important than the hotel I am due to stay in.

A human map maker would undoubtedly have simply shifted the name ‘Bournemouth’ up a bit, knowing that it refers to the whole town.  In principle, Google maps could do the same, but typically geocoding (e.g. Geonames) simply gives a point for each location rather than an area, so it is not easy for the software to make adjustments … except Google clearly knows it is ‘big’ as it is displayed on the first, zoomed out, view; so maybe it could have done better.

This problem of overlapping legends will be familiar to anyone involved in visualisation whether map based or more abstract.

cone-trees

The image above is the original Cone Tree hierarchy browser developed by Xerox PARC in the early 1990s1.  This was the early days of interactive 3D visualisation, and the Cone Tree exploited many of the advantages such as a larger effective ‘space’ to place objects, and shadows giving both depth perception, but also a level of overview.  However, there was no room for text labels without them all running over each other.

Enter the Cam Tree:

cam-tree

The Cam Tree is identical to the cone tree, except because it is on its side it is easier to place labels without them overlapping 🙂

Of course, with the Cam Tree the regularity of the layout makes it easy to have a single solution.  The problem with maps is that labels can appear anywhere.

This is an image of a particularly cluttered part of the Frasan mobile heritage app developed for the An Iodhlann archive on Tiree.  Multiple labels overlap making them unreadable.  I should note that the large number of names only appear when the map is zoomed in, but when they do appear, there are clearly too many.

frasan-overlap

It is far from clear how to deal with this best.  The Google solution was simply to not show some things, but as we’ve seen that can be confusing.

Another option would be to make the level of detail that appears depend not just on the zoom, but also the local density.  In the Frasan map the locations of artefacts are not shown when zoomed out and only appear when zoomed in; it would be possible for them to appear, at first, only in the less cluttered areas, and appear in more busy areas only when the map is zoomed in sufficiently for them to space out.   This would trade clutter for inconsistency, but might be worthwhile.  The bigger problem would be knowing whether there were more things to see.

Another solution is to group things in busy areas.  The two maps below are from house listing sites.  The first is Rightmove which uses a Google map in its map view.  Note how the house icons all overlap one another.  Of course, the nature of houses means that if you zoom in sufficiently they start to separate, but the initial view is very cluttered.  The second is daft.ie; note how some houses are shown individually, but when they get too close they are grouped together and just the number of houses in the group shown.

rightmove-houses  daft-ie-house-site

A few years ago, Geoff Ellis and I reviewed a number of clutter reduction techniques2, each with advantages and disadvantages, there is no single ‘best’ answer. The daft.ie grouping solution is for icons, which are fixed size and small, the text label layout problem is far harder!

Maybe someday these automatic tools will be able to cope with the full variety of layout problems that arise, but for the time being this is one area where human cartographers still know best.

  1. Robertson, G. G. ; Mackinlay, J. D. ; Card, S. K. Cone Trees: animated 3D visualizations of hierarchical informationProceedings of the ACM Conference on Human Factors in Computing Systems (CHI ’91); 1991 April 27 – May 2; New Orleans; LA. NY: ACM; 1991; 189-194.[back]
  2. Geoffrey Ellis and Alan Dix. 2007. A Taxonomy of Clutter Reduction for Information VisualisationIEEE Transactions on Visualization and Computer Graphics 13, 6 (November 2007), 1216-1223. DOI=10.1109/TVCG.2007.70535[back]

REF Redux 5 – growing the gender gap

This fifth post in the REF Redux series looks at gender issue, in particular the likelihood that the apparent bias in computing REF results will disproportionately affect women in computing. While it is harder to find full data for this, a HEFCE post-REF report has already done a lot of the work.

Spoiler:   REF results are exacerbating implicit gender bias in computing

A few weeks ago a female computing academic shared how she had been rejected for a job; in informal feedback she heard that her research area was ‘shrinking’.  This seemed likely to be due to the REF sub-area profiles described in the first post of this series.

While this is a single example, I am aware that recruitment and investment decisions are already adjusting widely due to the REF results, so that any bias or unfairness in the results will have an impact ‘on the ground’.

Google image search for "computing professor"

Google image search “computing professor”

In fact gender and other equality issues were explicitly addressed in the REF process, with submissions explicitly asked what equality processes, such as Athena Swan, they had in place.

This is set in the context of a large gender gap in computing. Despite there being more women undergraduate entrants than men overall, only 17.4% of computing first degree graduates are female and this has declined since 2005 (Guardian datablog based on HESA data).  Similarly only about 20% of computing academics are female (“Equality in higher education: statistical report 2014“), and again this appears to be declining:

academic-CS-staff-female

from “Equality in higher education: statistical report 2014”, table 1.6 “SET academic staff by subject area and age group”

The misbalance in terms of application rates for research funding has also been issue that the European Commission has investigated in “The gender challenge in research funding: Assessing the European national scenes“.

HEFCE commissioned a post-REF report “The Metric Tide: Report of the Independent Review of the Role of Metrics in Research Assessment and Management“, which includes substantial statistics concerning the REF results and models of fit to various metrics (not just citations). Helpfully, Fran AmeryStephen Bates and Steve McKay used these to create a summary of “Gender & Early Career Researcher REF Gaps” in different academic areas.  While far from the largest, Computer Science and Informatics is in joint third place in terms of the gender gap as measured by the 4* outputs.

Their data comes from the HEFCE report’s supplement on “Correlation analysis of REF2014 scores and metrics“, and in particular table B4 (page 75):

table-b4-no-legend

Extract of “Table B4 Summary of submitting authors by UOA and additional characteristics” from “The Metric Tide : Correlation analysis of REF2014 scores and metrics”

This shows that while 24% of outputs submitted by males were ranked 4*, only 18% of those submitted by females received a 4*.  That is a male member of staff in computing is 33% more likely to get a 4* than a female.

Now this could be due to many factors, not least the relative dearth of female senior academics reported by HESA.(“Age and gender statistics for HE staff“).

HESA academic staff gender balance: profs vs senior vs other academic

extract of HESA graphic “Staff at UK HE providers by occupation, age and sex 2013/14” from “Age and gender statistics for HE staff”

However, the HEFCE report goes on to compare this result with metrics, in a similar way to my own analysis of subareas and institutional effects.  The report states (my emphasis) that:

Female authors in main panel B were significantly less likely to achieve a 4* output than male authors with the same metrics ratings. When considered in the UOA models, women were significantly less likely to have 4* outputs than men whilst controlling for metric scores in the following UOAs: Psychology, Psychiatry and Neuroscience; Computer Science and Informatics; Architecture, Built Environment and Planning; Economics and Econometrics.

That is, for outputs that look equally good from metrics, those submitted by men are more likely to obtain a 4* than the by women.

Having been on the computing panel, I never encountered any incidents that would suggest any explicit gender bias.  Personally speaking, although outputs were not anonymous, the only time I was aware of the gender of authors was when I already knew them professionally.

My belief is that these differences are more likely to have arisen from implicit bias, in terms of what is valued.  The The Royal Society of Edinburgh report “Tapping our Talents” warns of the danger that “concepts of what constitutes ‘merit’ are socially constructed” and the EU report “Structural change in research institutions” talks of “Unconscious bias in assessing excellence“.  In both cases the context is recruitment and promotion procedures, but the same may well be true of the way we asses the results of research.,

In previous posts I have outlined the way that the REF output ratings appear to selectively benefit theoretical areas compared with more applied and human-oriented ones, and old universities compared with new universities.

While I’ve not yet been able obtain numbers to estimate the effects, in my experience the areas disadvantaged by REF are precisely those which have a larger number of women.  Also, again based on personal experience, I believe there are more women in new university computing departments than old university departments.

It is possible that these factors alone may account for the male–female differences, although this does not preclude an additional gender bias.

Furthermore, if, as seems the be the case, the REF sub-area profiles are being used to skew recruiting and investment decisions, then this means that women will be selectively disadvantaged in future, exacerbating the existing gender divide.

Note that this is not suggesting that recruitment decisions will be explicitly biased against women, but by unfairly favouring traditionally more male-dominated sub-areas of computing this will create or exacerbate an implicit gender bias.

Making the most of stakeholder interviews

Recently, I was asked for any tips or suggestions for stakeholder interviews.   I realised it was going to be more than would fit in the response to an IM message!

I’ll assume that this is purely for requirements gathering. For participatory or co-design, many of the same things hold, but there would be additional activities.

See also HCI book chapter 5: interaction design basics and chapter 13: socio-organizational issues and stakeholder requirements.

Kinds of knowing

First remember:

  • what they know – Whether the cleaner of a public lavatory or the CEO of a multi-national, they have rich experience in their area. Respect even the most apparently trivial comments.
  • what they don’t know they know – Much of our knowledge is tacit, things they know in the sense that they apply in their day to day activities, but are not explicitly aware of knowing. Part of your job as interviewer is to bring this latent knowledge to the surface.
  • what they don’t know – You are there because you bring expertise and knowledge, most critically in what is possible; it is often hard for someone who has spent years in a job to see that it could be different.

People also find it easier to articulate ‘what’ compared with ‘why’ knowledge:

  • whatobjects, things, and people involved in their job, also the actions they perform, but even the latter can be difficult if they are too familiar
  • why – the underlying criteria, motivations and values that underpin their everyday activities

Be concrete

Most of us think best when we have concrete examples or situations to draw on, even if we are using these to describe more abstract concepts.

  • in their natural situation – People often find it easier to remember things if they are in the place and amongst the tools where they normally do them.
  • printer-detailshow you what they do – Being in their workplace also makes it easy for them to show you what they do – see “case study: Pensions printout“, for an example of this, the pensions manager was only able to articulate how a computer listing was used when he could demonstrate using the card files in his office. Note this applies to physical things, and also digital ones (e.g. talking through files on computer desktop)
  • watch what they do – If circumstances allow directly observe – often people omit the most obvious things, either because they assume it is known, or because it is too familiar and hence tacit. In “Early lessons – It’s not all about technology“, the (1960s!) system analyst realised that it was the operators’ fear of getting their clothes dirty that was slowing down the printing machine; this was not because of anything any of the operators said, but what they were observed doing.
  • seek stories of past incidents – Humans are born story tellers (listen to a toddler). If asked to give abstract instructions or information we often struggle.
  • normal and exceptional storiesboth are important. Often if asked about a process or procedure the interviewee will give the normative or official version of what they do. This may be because they don’t want to admit to unofficial methods, or maybe that they think of the task in normative terms even though they actually never do it that way. Ask for ‘war stories’ of unusual, exceptional or problematic situations.
  • technology probes or envisioned scenarios – Although it may be hard to envisage new situations, if potential futures are presented in an engaging and concrete manner, then we are much more able to see ourselves in them, maybe using a new system, and say “but no that wouldn’t work.”  (see more at hcibook online! “technology probes“)

Estrangement

As noted the stakeholder’s tacit knowledge may be the most important. By seeking out or deliberately creating odd or unusual situations, we may be able to break out of this blindness to the normal.

  • ask about other people’s jobs – As well as asking a stakeholder about what they do, ask them about other people; they may notice things about others better then the other person does themselves.
  • strangers / new folk / outsiders – Seek out the new person, the temporary visitor from another site, or even the cleaner; all see the situation with fresh eyes.
  • technology probes or envisioned scenarios (again!) – As well as being able to say “but no that wouldn’t work”, we can sometimes say “but no that wouldn’t work, because …”
  • making-teafantasy – When the aim is to establish requirements and gain understanding, there is no reason why an envisaged scenario need be realistic or even possible. Think SciFi and magic 🙂 For an extended example of this look at ‘Making Tea‘, which asked chemists to make tea as if it were a laboratory procedure!

Of course some of these, notably fantasy scenarios, may work better in some organisations than others!

Analyse

You need to make sense of all that interview data!

  • the big picture – Much of what you learn will be about what happens to individuals. You need to see how this all fits together (e.g. Checkland/ Soft System Methodology ‘Rich Picture’, or process diagrams). Dig beyond the surface to make sense of the underlying organisational goals … and how they may conflict with those of individuals or other organisations.
  • the details – Look for inconsistencies, gaps, etc. both within an individual’s own accounts and between different people’s viewpoints. This may highlight the differences between what people believe happens and what actually happens, or part of that uncovering the tacit
  • the deep values – As noted it is often hard for people to articulate the criteria and motivations that determine their actions. You could look for ‘why’ vocabulary in what they say or written documentation, or attempt to ‘reverse engineer’ process to find purposes. Unearthing values helps to uncover potential conflicts (above), but also is important when considering radical changes. New processes, methods or systems might completely change existing practices, but should still be consonant with the underlying drivers for those original practices. See work on transforming musicological archival practice in the InConcert project for an example.

If possible you may wish to present these back to those involved, even if people are unaware of certain things they do or think, once presented to them, the flood gates open!   If your stakeholders are hard to interview, maybe because they are senior, or far away, or because you only have limited access, then if possible do some level of analysis mid-way so that you can adjust future interviews based on past ones.

Prioritise

Neither you nor your interviewees have unlimited time; you need to have a clear idea of the most important things to learn – whilst of course keeping an open ear for things that are unexpected!

If possible plan time for a second round of some or all the interviewees after you have had a chance to analyse the first round. This is especially important as you may not know what is important until this stage!

Privacy, respect and honesty

You may not have total freedom in who you see, what you ask or how it is reported, but in so far as is possible (and maybe refuse unless it is) respect the privacy and personhood of those with whom you interact.

This is partly about good professional practice, but also efficacy – if interviewees know that what they say will only be reported anonymously they are more likely to tell you about the unofficial as well as the official practices! If you need to argue for good practice, the latter argument may hold more sway than the former!

In your reporting, do try to make sure that any accounts you give of individuals are ones they would be happy to hear. There may be humorous or strange stories, but make sure you laugh with not at your subjects. Even if no one else recognises them, they may well recognise themselves.

Of course do ensure that you are totally honest before you start in explaining what will and will not be related to management, colleagues, external publication, etc. Depending on the circumstances, you may allow interviewees to redact parts of an interview transcript, and/or to review and approve parts of a report pertaining to them.

REF Redux 4 – institutional effects

This fourth post in my REF analysis series compares computing sub-panel results across different types of institution.

Spoiler: new universities appear to have been disadvantaged in funding by at least 50%

When I first started analysing the REF results I expected a level of bias between areas; it is a human process and we all bring our own expectations, and ways of doing things. It was not the presence of bias that was shocking, but the size of the effect.

I had assumed that any bias between areas would have largely ‘averaged out’ at the level of Units of Assessment (or UoA, REF-speak typically corresponding to a department), as these would typically include a mix of areas. However, this had been assuming maybe a 10-20% difference between areas; once it became clear this was a huge 5-10 fold difference, the ‘averaging out’ argument was less certain.

The inter-area differences are crucially important, as emphasised in previous posts, for the careers of those in the disadvantaged areas, and for the health of computing research in the UK. However, so long as the effects averaged out, they would not affect the funding coming to institutions when algorithmic formulae are applied (including all English universities, where HEFCE allocate so called ‘QR’ funding based on weighted REF scores).

Realising how controversial this could be, I avoided looking at institutions for a long time, but it eventually became clear that it could not be ignored. In particular, as post-1992 universities (or ‘new universities’) often focus on more applied areas, I feared that they might have been consequentially affected by the sub-area bias.

It turns out that while this was right to an extent, in fact the picture is worse than I expected.

As each output is assigned to an institution it is possible to work out profiles for each institution based on the same measures as sub-areas (as described in the second and third posts in this series): using various types of Scopos and Google scholar raw citations and the ‘world rankings’ adjustments using the REF contextual data tables.  Just as with the sub-areas, the different kinds of metrics all yield roughly similar results.

The main difference when looking at institutions compared to the sub-areas is that, of the 30 or so sub-areas, many are large enough (many hundreds of outputs) to examine individually with confidence that the numbers are statistically robust.  In contrast, there were around 90 institutions with UoA submissions in computing, many with less than 50 outputs assessed (10-15 people), so getting towards the point were one would expect that citation measures to be imprecise for each one alone.

However, while, with a few exceptions such as UCL, Edinburgh and Imperial, the numbers for a single institution make it hard to say anything definitive, we can reliably look for overall trends.

One of the simplest single measures is the GPA for each institution (weighed sum with 4 for a 4*, 3 for a 3*, etc.) as this is a measure used in many league tables.  The REF GPA can be compared to the predicted GPA based on citations.

GPA-cites-vs-REF

While there is some scatter, which is to be expected given the size of each institution, there is also a clear tendency towards the diagonal.

Another measure frequently used is the ‘research power’, the GPA multiplied by the number of people in the submission.

power-cites-vs-REF

This ‘stretches’ out the institutions and in particular makes the larger submissions (where the metrics are more reliable) stand out more.  It is not surprising that this is more linear as, the points are equally scaled by size irrespective of the metric.  However, the fact that it clusters quite closely to the diagonal at first seems to suggest that, at the level of institutions, the computing REF results are robust.

However, while GPA is used in many league tables, funding is not based on GPA.  Where funding is formulaic (as it is with HEFCE for English universities), the combined measure is very heavily weighted towards 4*, with no money at all being allocated to 2* and 1*.

For RAE2008, the HEFCE weighting was approximately 3:1 between 4* and 3*, for REF2014 funding is weighted even more highly towards 4* at 4:1.

The next figure shows the equivalent of ‘power’ using a 4:1 ratio – roughly proportional to the amount of money under the HEFCE formula (although some of the institutions are not English, so will have different formula applied).  Like the previous graphs this plots the actual REF money-related power compared the one predicted by citations.

weighted-power-cites-vs-REF-with-line

Again the data is very spread out with three very large institutions (UCL, Edinburgh and Imperial) on the upper right and the rest in more of a pack in the lower left.  UCL is dead on line, but the next two institutions look like outliers, doing substantially better under REF than citations would predict, and then further down there is more of a spread, with some below, some above the line.

This massed group is hard to see clearly because of the stretching, so the following graph shows the non-volume weighted results, that is simple 4:1 ratio (I have dubbed GPA #).  This is roughly proportional to money per member of staff, and again citation-based prediction along the lower axis, actual REF values vertical axis.

weighted-GPA-cites-vs-REF-with-line

The red line shows the prediction line.  There is a rough correlation, but also a lot of spread.  Given remarks earlier about the sizes of individual institutions this is to be expected.  The crucial issue is whether there are any systematic effects, or whether this is purely random spread.

The two green lines show those UoAs with REF money-related scores 25% or more than expected, the ‘winners’ (above top left) and those with REF score 25% or more below prediction, the ‘losers’ (lower right).

Of 17 winners 16 are pre-1992 (‘old’) universities with just one post-1992 (‘new’) university.  Furthermore of the 16 old university winners, 10 of these come from the 24 Russell Group universities.

Of the 35 losers, 25 are post-1992 (‘new’) universities and of the 10 ‘old’ university losers, there is just 1 Russell Group institution.

contingency-table

The exact numbers change depending on which precise metric one uses and whether one uses a 4:1, or 3:1 ratio, but the general pattern is the same.

Note this is not to do with who gets more or less money in total, whatever metric one uses, on average, the new universities tend to be lower, the old ones (on average) higher and Russell Group (on average) higher still.  The issue here is about an additional benefit of reputation over and above this raw quality effect. For works that by external measures are of equal value, there appears to be at least 50-100% added benefit if they are submitted from a more ‘august’ institution.

To get a feel for this, let’s look at a specific example: one of the big ‘winners’, YYYYYYYY, a Russell Group university, compared with one of the ‘losers’, XXXXXXXX, a new university.

As noted one has to look at individual institutions with some caution as the numbers involved can be small, but XXXXXXXX is one of the larger (in terms of submission FTE) institutions in the ‘loser’ category; with 24.7 FTE and nearly 100 outputs.  It also happened (by chance) to sit only one row above YYYYYYYY on the spreadsheet, so easy to compare.  YYYYYYYY is even larger, nearly 50 FTE, 200 outputs.

At 100 and 200 outputs, these are still, in size, towards the smaller end of the sub-area groups we were looking at in the previous two posts, so this should be taken as more illustrative of the overall trend, not a specific comment on these institutional submissions.

This time we’ll first look at the citation profiles for the two.

The spreadsheet fragment below shows the profiles using raw Scopos citation measures.  Note in this table, the right hand column, the upper quartile is the ‘best’ column.

X-vs-Y-raw-cite-quartiles

The two institutions look comparable, XXXXXXXX is slightly higher in the very highest cited papers, but effectively differences within the noise.

Similarly, we can look at the ‘world ranks’ as used in the second post.  Here the left hand side is ‘best, corresponding to the percentage of outputs that are within the best 1% of their area worldwide.

X-vs-Y-world-ranks

Again XXXXXXXX is slightly above YYYYYYYY, but basically within noise.

If you look at other measures: citations for ‘relable years’ (2011 and older, where there has been more time to gather cites), XXXXXXXX looks a bit stronger, for Google-based citations YYYYYYYY looks a bit stronger.

So, except for small variations, these two institutions, one a new university, one a Russell Group one, look comparable in terms external measures.

However, the REF scores paint a vastly different picture.  The respective profiles are below:

X-vs-Y-REF

Critically, the Russell Group YYYYYYYY has more than three times as many 4* outputs as the new university XXXXXXXX, despite being comparable in terms of external metrics.  As the 4* are heavily weighted the effect is that the GPA # measure (roughly money per member of staff) is more than twice as large.

Comparing using the world rankings table: for the new university XXXXXXXX only just over half of their outputs in the top 1% worldwide are likely to be ranked a 4*, whereas for YYYYYYYY nearly all outputs in the top 5% are likely to rank 4*.

As noted it is not generally reliable to do point comparisons on institutions as the output counts are low, and also XXXXXXXX and YYYYYYYY are amongst the more extreme winners and losers (although not the most extreme!).  However, they highlight the overall pattern.

At first I thought this institutional difference was due to the sub-area bias, but even when this was taken into account large institutional effects remained; there does appear to be an additional institutional bias.

The sub-area discrepancies will be partly due to experts from one area not understanding the methodologies and quality criteria of other areas. However, the institutional discrepancy is most likely simply a halo effect.

As emphasised in previous posts the computing sub-panels and indeed everyone involved with the REF process worked as hard as possible to ensure that the process was as fair, and, insofar as it was compatible with privacy, as transparent as possible.  However, we are human and it is inevitable that to some extent when we see a paper from a ‘good’ institution we are expecting it to be good and visa versa.

These effects may actually be relatively small individually, but the heavy weighting of 4* is likely to exacerbate even small bias.  In most statistical distributions, relatively small shifts of the mean can make large changes at the extremity.

By focusing on 4*, in order to be more ‘selective’ in funding, it is likely that the eventual funding metric is more noisy and more susceptible to emergent bias.  Note how the GPA measure seemed far more robust, with REF results close to the citation predictions.

While HEFCE has shifted REF2014 funding more heavily towards 4*, the Scottish Funding Council has shifted slightly the other way from 3.11:1 for RAE2008, to 3:1 for REF2014 (see THES: Edinburgh and other research-intensives lose out in funding reshuffle).  This has led to complaints that it is ‘defunding research excellence‘.  To be honest, this shift will only marginally reduce institutional bias, but at least appears more reliable than the English formula.

Finally, it should be noted that while there appear to be overall trends favouring Russell Group and old universities compared with post-1992 (new) universities; this is not uniform.  For example, UCL, with the largest ‘power’ rating and large enough that it is sensible to look at individually, is dead on the overall prediction line.

REF Redux 3 – plain citations

This third post in my series on the results of REF 2014, the UK periodic research assessment exercise, is still looking at subarea differences.  Following posts will look at institutional (new vs old universities) and gender issues.

The last post looked at world rankings based on citations normalised using the REF ‘contextual data’.  In this post we’ll look at plain unnormalised data.  To some extent this should be ‘unfair’ to more applied areas as citation counts tend to be lower, as one mechanical engineer put it, “applied work doesn’t gather citations, it builds bridges”.  However, it is a very direct measure.

The shocking things is that while raw citation measures are likely to bias against applied work, the REF results turn out to be worse.

There were a number of factors that pushed me towards analysing REF results using bibliometrics.  One was the fact that HEFCE were using this for comparison between sub-panels, another was that Morris Sloman’s analysis of the computing sub-panel results, used Scopos and Google Scholar citations.

We’ll first look at the two relevant tables in Morris’ slides, one based on Scopos citations:

Sloman-scopos-citation-table

and one based on Google Scholar citations:

Sloman-google-scholar-citation-table

Both tables rank all outputs based on citations, divide these into quartiles, and then look at the percentage of 1*/2*/3*/4* outputs in each quartile.  For example, looking at the Scopos table, 53.3% of 4* outputs have citation counts in the top (4th) quartile.

Both tables are roughy clustered towards the diagonal; that is there is an overall correlation between citation and REF score, apparently validating the REF process.

There are, however, also off-diagonal counts.  At the top left are outputs that score well in REF, but have low citations.  This is be expected; non-article outputs such as books, software, patents may be important but typically attract fewer citations, also good papers may have been published in a poor choice of venue leading to low citations.

More problematic is the lower right, outputs that have high citations, but low REF score.  There are occasional reasons why this might be the case, for example, papers that are widely cited for being wrong, however, these cases are rare (I do not recall any in those I assessed).  In general this areas represent outputs that the respective communities have judged strong, but the REF panel regard as weak.  The numbers need care in interpreting as only there are only around 30% of outputs were scored 1* and 2* combined; however, it still means that around 10% of outputs in the top quartile were scored in the lower two categories and thus would not attract funding.

We cannot produce a table like the above for each sub-area as the individual scores for each output are not available in the public domain, and have been destroyed by HEFCE (for privacy reasons).

However, we can create quartile profiles for each area based on citations, which can then be compared with the REF 1*/2*/3*/4* profiles.  These can be found on the results page of my REF analysis micro-site.  Like the world rank lists in the previous post, there is a marked difference between the citation quartile profiles for each area and the REF star profiles.

One way to get a handle on the scale of the differences, is to divide the proportion of REF 4* by the proportion of top quartile outputs for each area.  Given the proportion of 4* outputs is just over 22% overall, the top quartile results in an area should be a good predictor of the proportion of 4* results in that area.

The following shows an extract of the full results spreadsheet:

quartile-vs-REF

The left hand column shows the percentage of outputs in the top quartile of citations; the column to the right of the area title is the proportion of REF 4*; and the right hand column is the ratio.  The green entries are those where the REF 4* results exceed those you would expect based on citations; the red those that get less REF 4* than would be expected.

While there’re some areas (AI, Vision) for which the citations are an almost perfect predictor, there are others which obtain two to three times more 4*s under REF than one would expect based on their citation scores, ‘the winners’, and some where REF gives two to three times fewer 4*s that would be expected, ‘the losers’.  As is evident, the winners are the more formal areas, the losers the more applied and human centric areas.  Remember again that if anything one would expect the citation measures to favour more theoretical areas, which makes this difference more shocking.

Andrew Howes replicated the citation analysis independently using R and produced the following graphic, which makes the differences very clear.

scatter-citation-vs-REF-rank

The vertical axis has areas ranked by proportion of REF 4*, higher up means more highly rated by REF.  the horizontal axis shows areas ranked by proportion of citations in top quartile.  If REF scores were roughly in line with citation measures, one would expect the points to lie close to the line of equal ranks; instead the areas are scattered widely.

That is, there seems little if any relation between quality as measured externally by citations and the quality measures of REF.

The contrast with the tables at the top of this post is dramatic.  If you look at outputs as a whole, there is a reasonable correspondence, outputs that rank higher in terms officiations, rank higher in REF star score, apparently validating the REF results.  However, when we compare areas, this correspondence disappears.  This apparent contradiction is probably due to the correlation being very strong within area, just that the areas themselves are scattered.

Looking at Andrew’s graph, it is clear that it is not a random scatter, but systematic; the winners are precisely the theoretical areas, and the losers the applied and human centred areas.

Not only is the bias against applied areas critical for the individuals and research groups affected, but it has the potential to skew the future of UK computing. Institutions with more applied work will be disadvantaged, and based on the REF results it is clear that institutions are already skewing their recruitment policies to match the areas which are likely to give them better scores in the next exercise.

The economic future of the country is likely to become increasingly interwoven with digital developments and related creative industries and computing research is funded more generously than areas such as mathematics, precisely because it is expected to contribute to this development — or as a buzzword ‘impact’.  However, the funding under REF within computing is precisely weighted against the very areas that are likely to contribute to digital and creative industries.

Unless there is rapid action the impact of REF2014 may well be to destroy the UK’s research base in the areas essential for its digital future, and ultimately weaken the economic life of the country as a whole.

If you do accessibility, please do it properly

I was looking at Coke Cola’s Rugby World Cup site1,

On the all-red web page the tooltip stood out, with the uninformative text, “headimg”.

coke-rugby-web-site-zoom

Peeking in the HTML, this is in both the title and alt attributes of the image.

<img title="headimg" alt="headimg" class="cq-dd-image" 
     src="/content/promotions/nwen/....png">

I am guessing that the web designer was aware of the need for an alt tag for accessibility, and may even have had been prompted to fill in the alt tag by the design software (Dreamweaver does this).  However, perhaps they just couldn’t think of an alternative text and so put anything in (although as the image consists of text, this does betray a certain lack of imagination!); they probably planned to come back later to do it properly.

As the micro-site is predominantly targeted at the UK, Coke Cola are legally bound to make it accessible and so may well have run it through WCAG accessibility checking software.  As the alt tag was present it will have passed W3C validation, even though the text is meaningless.  Indeed the web designer might have added the unhelpful text just to get the page to validate.

The eventual page is worse than useless, a blank alt tag would have meant it was just skipped, and at least the text “header image” would have been read as words, whereas “headimg” will be spelt out letter by letter.

Perhaps I am being unfair,  I’m sure many of my own pages are worse than this … but then again I don’t have the budget of Coke Cola!

More seriously there are important lessons for process.  In particular it is very likely that at the point the designer uploads an image they are prompted for the alt tag — this certainly happens with Dreamweaver.  However, at this point your focus is in getting the page looking right as the client looking at the initial designs is unlikely to be using a screen reader.

Good design software should not just prompt for the right information, but at the right time.  It would be far better to make it easy to say “ask me later” and build up a to do list, rather than demand the information when the system wants it, and risk the user entering anything to ‘keep the system quiet’.

I call this the Micawber principle2 and it is a good general principle for any notifications requiring user action.  Always allow the user to put things off, but also have the application keep track of pending work, and then make it easy for the user see what needs to be done at a more suitable time.

  1. Largely because I was fascinated by the semantically questionable statement “Win one of up to 1 million exclusive Gilbert rugby balls.” (my emphasis).[back]
  2. From Dicken’s Mr Micawber, who was an arch procrastinator.  See Learning Analytics for the Academic:
    An Action Perspective where I discuss this principle in the context of academic use of learning analytics.[back]

REF Redux 2 – world ranking of UK computing

This is the second of my posts on the citation-based analysis of REF, the UK research assessment process in computer science. The first post set the scene and explained why citations are a valid means for validating (as opposed generating) research assessment scores.

Spoiler:  for outputs of similar international standing it is ten times harder to get 4* in applied areas than more theoretical areas

As explained in the previous post amongst the public domain data available is the complete list of all outputs (except a very small number of confidential reports), this does NOT include the actual REF 4*/3*/2*/1* score, but does include Scopus citation data from late 2013 and Google scholar citation data from late 2014.

From this seven variations of citation metrics were used in my comparative analysis, but essentially all give the same results.

For this post I will focus on one of them, which is perhaps the clearest, effectively turning citation data into world ranking data.

As part of the pre-submission materials, the REF team distributed a spreadsheet, prepared by Scopus, which lists for different subject areas the number of citations for the best 1%, 5%, 10% and 25% of papers in each area. These vary between areas, in particular more theoretical areas tend to have more Scopus counted citations than more applied areas. The spreadsheet allows one to normalise the citation data and for each output see whether it is in the top 1%, 5%, 10% or 25% of papers within its own area.

The overall figure across REF outputs in computing is as follows:

Top 1%      16.9%
Top 1-5%:   27.9%
Top 6-10%:  18.0%
Top 11-25%: 23.8%
Lower 75%:  13.4%

The first thing to note is that about 1 in 6 of the submitted outputs are in the top 1% worldwide and not far short of a half (45%) in the top 5%.   Of course this is the top publications, so one would expect the REF submissions to score well, but still this feels like a strong indication of the quality of UK research in computer science and informatics.

According to the REF2014 Assessment criteria and level definitions, the definition of 4* is “quality that is world-leading in terms of originality, significance and rigour“, and so these world citation rankings correspond very closely to “world leading”. In computing we allocated 22% of papers as 4*, that is, roughly, if a paper is in the top 1.5% of papers world wide in its area it is ‘world leading’, which sounds reasonable.

The next level 3* “internationally excellent” covers a further 47% of outputs, so approximately top 11% of papers world wide, which again sounds a reasonable definition of “internationally excellent”. Validating the overall quality criteria of the panel.

As the outputs include a sub-area tag, we can create similar world ‘league tables’ for each sub-area of computing, that is ranking the REF submitted outputs in each area amongst their own area worldwide:

Cite-ranks

As is evident there is a lot of variation, with some top areas (applications in life sciences and computer vision) with nearly a third of outputs in the top 1% worldwide, whilst other areas trail (mathematics of computing and logic), with only around 1 in 20 papers in top 1%.

Human computer interaction (my area) is split between two headings “human-centered computing” and “collaborative and social computing” between them just above mid point; AI also in the middle and Web in top half of the table.

Just as with the REF profile data, this table should be read with circumspection – it is about the health of the sub-area overall in the UK, not about a particular individual or group which may be at the stronger or weaker end.

The long-tail argument (that weaker researchers and those in less research intensive institutions are more likely to choose applied and human-centric areas) of course does not apply to logic, mathematics and formal methods at the bottom of the table. However, these areas may be affected by a dilution effect as more discursive areas are perhaps less likely to be adopted by non-first-language English academics.

This said, the definition of 4* is “Quality that is world-leading in terms of originality, significance and rigour“, and so these world rankings seem as close as possible to an objective assessment of this.

It would therefore be reasonable to assume that this table would correlate closely to the actual REF outputs, but in fact this is far from the case.

Compare this to the REF sub-area profiles in the previous post:

REF-ranks

Some areas lie at similar points in both tables; for example, computer vision is near the top of both tables (ranks 2 and 4) and AI a bit above the middle in both (ranks 13 and 11). However, some areas that are near the middle in terms of world rankings (e.g. human-centred computing (rank 14) and even some near the top (e.g. network protocols at rank 3) come out very poorly in REF (ranks 26 and 24 respectively). On the other hand, some areas that rank very low in the world league table come very high in REF (e.g. logic rank 28 in ‘league table’ compared to rank 3 in REF).

On the whole, areas that are more applied or human focused tend to do a lot worse under REF than they appear to be when looked in terms of their world rankings, whereas more theoretical areas seem to have inflated REF rankings. Those that are traditional algorithmic computer science’ (e.g. vision, AI) are ranked similarly in REF and in the world rankings.

We will see other ways of looking at these differences in the next post, but one way to get a measure of the apparent bias is by looking at how high an output needs to be in world rankings to get a 4* depending on what area you are in.

We saw that on average, over all of computing, outputs that rank in the top 1.5% world-wide were getting 4* (world leading quality).

For some areas, for example, AI, this is precisely what we see, but for others the picture is very different.

In applied areas (e.g. web, HCI), an output needs to be in approximately the top 0.5% of papers worldwide to get a 4*, whereas in more theoretical areas (e.g. logic, formal, mathematics), a paper needs to only be in the top 5%.

That is looking at outputs equivalent in ‘world leading’-ness (which REF is trying to measure), it is 10 times easier to get a 4* in theoretical areas than applied ones.