the job of statistics

from the real world to measurement and back again

If you want to use statistics you need to learn how to do statistics, in the sense of working out what tests to use, maybe a stats package such as SPSS or R.

But why do this at all? What does statistics actually do?

Fundamentally statistics is about trying to learn dependable things about the real world based on measurements of it.

However, what we mean by ‘real’ is itself a little complicated, from the actual users you have tested to the hypothetical idea of a ‘typical user’ of your system.

We’ll start with the real world, but what is it?

the sample – First of all there is the actual data you have: results from an experiment, responses from a survey, log data form a deployed application. This is the real world. The user you tested at 3pm on a rainy day in March, after a slightly overfilling lunch, did make precisely three errors and finished the task in 17 mins and 23 seconds. However, while this measured data is real, it is typically not what you meant to know. Would the same user on a different day, under different conditions have made the same errors? What about other users?

the population – Another idea of real, and one that may be what you really want to know, is when there is a larger group of people you want to now about, say all the people in your company, or all users of product A. What would be the average (and variation in) error arte if all of them sat down and used the software you are testing. Or as a more concrete kind of measurement, what is their average height?

The sample that you actually look at measure the heights of is real data, but yu are using to find out about the population as a whole.

the ideal – However, while this idea of the actual population is very concrete, often the real word you actually are interested in is slightly more nebulous. Even for the current uses of product A, you are not interested in the error rate if they tried your new software today, but if they did multiple times (maybe with the occasional memory wiping pill administered occasionally) over a period – that is a sort of ‘typical’ error rate when each uses the software.

Furthermore, it is not so much the actual set of users (not that you don’t care about them), but perhaps the typical user, especially for a new piece of software where you have no ‘real’ users yet.

Similarly, when you toss a coin you have an idea of the behaviour fair coin, that s not simply the complete collection of every coin in circulation. Even when you have tossed the coin, you can still think about the different ways it could have fallen, somehow reasoning about all possible past and present for an unrepeatable event.

Finally, this hypothetical ‘real’ event may be represented in a mathematically in a theoretical distribution such as the Normal distribution (for heights) or Binomial distribution (for coin tosses).

In practice you rarely need to voice these things explicitly, but occasionally you do need to think carefully about it. If you have done a series of consistent blood tests you may know something very important about a particular individual, but not patients in general. If you are analysing big data you may know something very precise about your current users, and how they behave given a particular social context, and particular algorithms in your system, but not necessarily about potential users and how they may behave if your algorithms and environment change.

Once you know what the ‘real’ world you want to know about is, the job of statistics becomes clear.

You have taken some measurement, often of some sample of people and situations, and you want to use the measurements to understand the real world.

Given a sample of 20 heights of random people from your organisation, what can you infer about the heights of everyone? Given the error rates of 20 people on an artificial task in a lab, what can you tell about the behaviour of a typical user in their everyday situation?

As is evident, answering these questions requires a combination of probability and common sense – and this is the job of statistics.

why are probability and statistics so hard?

Do you find probability and statistics hard? If so, don’t worry, it’s not just you; it’s basic human psychology.

We have two systems of thought (i) subconscious reactions that are based on semi-probabilistic associations, and (ii) conscious thinking that likes to have one model of the world and is really bad at probability. This is why we need to use mathematics and other explicit techniques to help us deal with probabilities.

Statistics both needs this mathematics of probability and an understanding of what it means in the real world. Understanding this means you don’t have to feel bad about finding stats hard (!), but also helps to find ways to make it easier.

Skinner’s famous experiments with pigeons showed how certain kinds of learning could be studied in terms of associations between stimuli and rewards. If you present a reward enough times with the behaviour you want, the pigeon will learn to do it even when the original reward no longer happens.

This low-level learning is semi-probabilistic in the sense that if rewards are more common the learning is faster or if rewards and penalties both happen at different frequencies, then you get a level of trade-off in the learning. At a cognitive level one can think of strengths of association being built up with rewards strengthening them and penalties inhibiting them.

This kind of learning is not quite a weighted sum of past experience: for example negative experiences typically count more than positive ones, and once a pattern is established it takes a lot to shift it. However, it is not so far from a probability estimate.

We humans share this subconscious learning processes with other animals, and at some periods this has been used explicitly in education. It is powerful and leads to very rapid reactions, but needs very large numbers of exposures to similar situations to establish memories.

Of course we are not just our subconscious! In addition we have conscious thinking and reasoning. This is powerful in that, amongst other things, we are able to learn from a single experience. Retrospectively we are able to retrieve even a single relevant past experience, compare it to what we are encountering now, and work what to do based on it.

This is very powerful, but unlike our more unconscious sea of overlapping memories and associations, our conscious mind is linear and is normally locked into a single model of the world1.

Because of this single model of the world this form of thinking is not so good at intuitively grasping probabilities, as is repeatedly evidenced by gambling behaviour and more broadly our assessment of risk.

One experiment uses cards with different coloured backs, red and blue2.  In the experiment the cards are initially dealt to the subject face down and then turned over. Some cards have a rewards, “you have won £5”, other a penalty, “sorry you’ve lost half your winnings”. The cards differ in that one colour, let’s say blue, has more penalties, and the other a better balance of rewards.

After playing a while the subjects realise that the packs are different and can tell you which s better.

However, the subjects are also wired up to a skin conductivity sensor as used in a lie detector. Well before they are able to say that one of the card colours is worse then the other they show a response on the sensor – that is subconsciously they know it is a bad card.

Although our conscious mind is not naturally good at dealing with probabilities , we are able to reason explicitly about them using mathematics. For example, if the subjects in the experiment had kept a tally of good and bad cards, they would have seen, in the numbers, that the red cards were better.

Some years ago, when I first was doing some statistical teaching I remember learning that statistics teaching was known to be particularly difficult. This is in part because it requires a combination of maths and real world thinking.

In statistics we use the explicit tallying of data and mathematical reasoning about probabilities to let is do quite complex reasoning from effects (measurements) back to causes (the real word phenomena that are being measured).

So you do need to feel reasonably comfortable with this mathematics.

However, even if you are a whizz at maths, if you can’t relate this back to understanding about the real world, you are also stuck. It is a bit like the applied maths problems where people get so lost in the maths that they forget the units: “the answer is 42” – but 42 what? 42 degrees centigrade, 42 metres, or 42 bananas?

On the whole those good at mathematics are not always good at relating their thinking back to the real world, and those of a more practical disposition not always best at maths – no wonder statistics is hard!

However, knowing this we can try to make things better.

In the “making sense of statistics” I am focusing more on those who have a reasonable sense of the practical issues and will try to explain some of the concepts that necessary without getting deep into the mathematics of how they are calculated … leave that to the computer!

 

  1. There are exceptions to the conscious mind’s single model of the world including when we tell each other stories or write; see my essay “writing as third order experience“. []
  2. Malcolm Gladwell describes this experiment in ‘The Second Mind‘, an online excerpt of his book Blink []

So What (making sense of statistics) – 7 – Building for the future

adding to the discipline

  • repeatability/replication – comparisons more robust than measures
  • meta analysis – reporting the details
  • data publishing – enabling open science

The touchstone of valuable research is the extent to which it builds the discipline, so that sum of knowledge after you have done your work is greater than it was before. How can you ensure this happens?

One part of this is repeatability, ensuring that you or others could replicate your study or experiment and get the same, or at least similar, results. At the CHI conference the RepliCHI initiative had a series of workshops and led to the addition of a “Validation and refutation” category in some subsequent conferences.

However, true replication is hard in HCI as it is difficult to precisely replicate the full context of even the most controlled experimental studies. The pool of subjects will differ even if university students, but form different institutions. The experimenter usually reads some sort of protocol, or greets subjects, so slight differences in the behavior could alter the mood of subjects. Similarly, the decoration of the room or lack of it , light levels, etc. all may alter behaviour.

Replication can even difficult even in apparently ‘physical’ situations. Many years ago I worked on agricultural sprayer research. We often used a apparatus that used a laser beam to measure the sizes of spray droplets. In one two-day series of experiments we carefully varied water temperature, quantity of surfactant, and variety of other factors, largely to see how carefully these needed to be controlled in other kinds of experiments. When we analysed the results there were some small but statistically significant effects that were surprising. After some time one of my colleagues suggested that as we had run the experiments over two days we should try adding a ‘dy effect’. We did this and sure enough all the anomalous effects disappeared, it just seemed that the runs one day were in some way different form the other, despite all out efforts to control the situation. Maybe this was some atmospheric effect, or slight difference in the equipment, we never knew.

Replication is, of course, even harder in more ecologically valid or in-the-wild studies.

This does not mean one should not try to replicate, just one should have an awareness of the difficulties.   There are things that can improve this situation.

  • First is to ensure that you are careful to fully describe your methods including, for example, any instructions given to participants, or data used in trials, as well as the tests used, numbers of participants etc.
  • The second is to focus on differences or comparisons more than absolute values. The fact that one condition is 10% faster than another in one experiment is more likely to be replicable than the exact speed of the base condition.

Understanding mechanism will help with both of these.

Meta-analysis is about using multiple studies by different groups in order to cross-validate and find emerging patterns. Like replication, ensuring your work is amenable to meta-analysis requires you to be careful to report method and results clearly and completely.

One way to achieve this is to simply put everything into the public domain: making all materials you used, instructions, software (if applicable), and of course the raw study data whether survey reports, video, or keystroke logging as well as derived data all the way to the data that lies behind the graphs in your published papers.

Having this data available means that those seeking to replicate can compare different points in their process, and those seeking to do meta-analysis can calculate common statistics across different data sets, or combine the datasets as a whole. However, making your data open also means that other people can analysis it in totally unexpected ways, testing alternative models or theories, or mining it for emergent patterns.

There are ethical problems in HCI at very least you will usually need to anonymise the data. However, crucially you need to ensure that participants are fully aware that data may be sued for purposes other than your own experiments. Often, by the time researchers come to consider publishing their data it is too late to obtain these permissions, so openness needs to be a consideration form the very start of your research design processes.

There are also practical problems to document your data well enough. During my 1000 mile walk around Wales in 2013 I collected copious data from bio-data to blogs. When I had finished the walk and wanted make this data available as a public open science resource I found I had to learn a whole new skill of documenting the data: ensuring that those using it could do so without necessarily consulting me. Part of this is technical documentation: each field had to be described carefully, and part is about making sure that the user of the data knows exactly how it was collected. Happily, the care has paid off and I often get to ear of people using the data who have never been in touch with me to ask questions abut it.

There are also broader cultural issues. The UK has a periodic research assessment exercise, which graded every subject and institution’s research. During the last such exercise, REF2014, the humanities panel included curated datasets as one of their categories of research output, but the science and engineering panel did not. It is not that STEM researchers do not think that data is valuable, but it is not valued, in the sense that careers, promotion andf esteem are attached to the analysis and implications of data, not the meticulous work of data collection itself.

Happily this is changing, many journals now mandate that data be provided for any publication and many universities are establishing data repositories alongside those for publications themselves.

However, despite these barriers, making your data available to the world is of immense value. You have often expended great effort in gathering it, it is surely worthwhile to see it reused by others … and, of course, by doing so you are doing your small part in building a stronger, greater and more robust discipline.

So What (making sense of statistics) – 6 – Mechanism

from what happens to how and why – when quantitative and qualitative meet

It is important not just to know that something occurs, but how and why.   Mechanism is about understanding the steps and processes, which buttons were pressed, what screens viewed, what information was looked at and how this all comes together to create a larger phenomenon.

Crucially understanding mechanism makes it possible to draw lessons and make predictions beyond the data available and the particular situations you have studied.

Typically quantitative data and statistical analysis helps you understand what happens as an end-to-end phenomenon and what is true of it as a whole. However, it often reveals little of the processes and mechanisms by which it occurs: what, but not how and why.

In contrast, qualitative methods such as rich observations, ethnography or post-experiment interviews are better suited to exploratory research (see “Why are you doing it?”) and answering these how and why questions. For example, one may determine the most common ways to achieve a task by content analysis of videos or key-stroke trace data.

Theoretical understanding may help here. This may include cognitive and psychological understanding, for example, if a user is selecting a small target with an on-screen pointer, then they have to be looking at it as human peripheral vision is not accurate enough for fine positioning tasks. Alternatively it may be related to unpacking device or application interaction characteristics, for example, if someone is choosing an item from a long menu, they need to decide if the item is in the visible portion, and if not scroll the menu, etc.

Once we have a model of how the user is behaving, we may be able to use that directly or we may use it to plan more in-depth analyses or investigations into each phase of activity.

When you have numerical empirical data one often attempts to interpolate between measured values. For example, if one found that reading speed was 10% faster with 12-point font than with 10-point font, then there is a good chance that 11-point font will sit in between, maybe of the order of 4%, to 6% faster. Even this may be problematic, for example, it just may be that 11-point font pixelates badly on the particular screen resolution of the devices you are experimenting with. However, it is a reasonable heuristic.

However, extrapolation is usually far harder: what about reading 8-point font or 32 point font or 3-point font?

However, if you understand the mechanism you can deconstruct the overall behaviour into arts that may be simple enough for you to be able to work out whether extrapolation is possible, or which can be put together in different ways to predict performance or behaviour in other contexts.

As an example, we will consider an early paper on font sizes on mobile devices, which included what appeared to have been a well conducted experiment, with statistically significant results, which concluded that a particular font size, let’s say 12 point, was best.

This sounds like a very useful piece fo design advice except for two things.

First, the result was almost certainly related to detailed device characteristics such as screen resolution: was this a 12-point font that was best, or a 12-pixel one, or simply one that did not render badly on the particular screen?

Second, the result will have been influenced by the particle task used. This involved finding items in a menu that could be paged (hence the earlier example). Would the result hold or other tasks?

In this case it was relatively easily to work out the mechanism, the detailed steps the user would need to perform in order to complete the menu selection task.

  1. visual search of the screen to see if the target item appears
  2. if not move to next screen and try step 1 again
  3. when it is found select the target item

Looking through these it seem very likely that step 1 will be easier with larger fonts until the point at which item names get too long to fit on the screen. Step 2 however is likely to occur more frequently with larger font sizes, as there will be fewer lines and hence fewer items per screen-full, so for this step smaller fonts are bond to reduce the number of cycles. Finally, step 3 is again likely to have been easier and faster with larger font sizes, whether on a touch device (larger target) or cursor key-based one (less items to move cursor through)

In summary:

  • Step 1 – speed of visual search – large font better
  • Step 2 – number of pages to scroll through – small font better
  • Step 3 – speed of item selection – large font better

The optimal font size will have been a trade-off between these factors, and changes in the tasks would almost certainly have changed this figure. For example, if the search were within a very large menu, then it is likely that scrolling through pages of menu items would dominate and hence the optimal choice would be the smallest readable font. In contrast if the number of items was always small I larger be better to have larger items so longa s they all fitted within the first screen.

As well as being able to make predictions before experimentation starts, unpacking the mechanism in this way would have allowed the experimenters to produce better analyses. Indeed, they had used some form of low-level logs to produce their end-to-end times and break these down into empirical timings for steps 1 and 3. For step 3, the number of pages that needed to be scrolled through to find the target item can be calculated precisely with empirical data being used to determine the time taken to press the page down key.

Wit these more detailed timings, the authors could have replaced their misleading single ‘optimal’ figure and replace this with a formula, that given an average menu length told you the best font size.

Furthermore, other kinds of mobile task would involve steps that resemble those for the menu selection task, enabling predictions to me made in entirely new contexts.

 

So What (making sense of statistics) – 5 – Diversity: individual and task

good for not just good

It is easy, especially when promoting one’s own idea, to want to show that it is better than everyone else’s!

However, users and tasks differ from one another. Typically a system or design property may be useful for a particular purpose or group of users, but not for others. If you understand this, you are in a better position to improve your research or market your system.

In general, it is more important to know who or what something is good for.

Imagine you have run a head to head comparison between two potential systems designs A and B, with 40 users. The user error rates are:

system A   5.2%

system B   6.2%

In fact they are not that different, System A is marginally better as people have slightly less errors, but is that 1% difference going to change the world. Anyway, it is a difference, so you go ahead and deploy system A.

However, it just so happens that of the 40 users 10 are novices and 10 experts. Sure enough the novices have a lower error rate with system A, and indeed by a wide margin (half the error rate), but look at the expert error rates:

expert – system A   9.6%

expert – system B   2.7%   !!!!

In fact, system A is considerably worse than system B for the experts.

If this were a research setting, then just looking at the averages means you have a fairly marginal result to report – yep, you might have a good p-value, but an effect size that will leave your readers yawning in their seats.

However, if you look at the way this differentially affects the different groups (a) you have larger effects to report; which are also (b) far more interesting.   Why do you get the different behaviour for novices and experts? What further research does this prompt?

The issue is perhaps even more critical for the usability professional.

It is often easier to user test novices when dealing with systems for rather than professionals; for example, you might test a financial planning application with economics students, or a diagnostic system with medical students. Novices are easier to access, and their time less costly.

However, it is likely that when you deploy the larger user group are expert.

You deployed the wrong system … and it is worse by a large margin!

If instead of simply asking, “is my system better?”, you ask, “who is my system better for?”, then you are able to ensue that you deliver the right solution to the right people.

This is also true for tasks. Typically a system or interaction method is good for some purposes, but less good for others.

The slide shows some stills of the PieTree visualisation [OD06] . Like a TreeMap, the PieTeee is a constant area visualisation for hierarchical data, in that the area of each part reflects the number or size of the items it represents. A PiTree starts as a normal pie chart of the top level categories, but you can explode any segment showing the next level in each as smaller and smaller segments. At the top right is a fully expanded PieTree, whereas the image n the centre is unexpanded. In real use only some segments may be expanded at any particular time depending on where the user has drilled down. The screen shot in the middle has the PieTree on the left and classic file tree-style visualisation on the left.

In evaluating this several tasks were used. The tasks included extreme ones following the advice on careful choice of tasks from “Gaining Power – tasks“. One was focused on finding the largest items, and was deliberately designed to highlight the advantages of the PieTree between the file-tree style visualisation; there was an obvious strategy for the former starting by drilling down into the biggest segment. However there was also a task to find the smallest, where there was no obvious search heuristic and everything had to be opened. When it was it is actually easier to san the text version of the smallest number than it is trying to work out which of the slightly different shaped small elements was actually smallest.

The results were exactly as we expected, that is the PieTree visualisation was good for some kinds of tasks and the file-tree style for others. Having both available, as in the image in the centre, was never best for any task, but was always a good second best no matter which of the visualisations ‘won’.

In general, it is usually far more important to know who or what something is good for, than some overall averaged measure. For researchers knowing this is far more informative allowing you to start to ask further questions about why certain features or properties are better. For practitioners, this is crucial for targeting solutions at the right people and the right problems.

Reference

[OD06] R. O’Donnell, A. Dix and L. Ball (2006). Exploring the PieTree for Representing Numerical Hierarchical Data. Proceedings of Proceedings of HCI2006, People and Conputers XX – Engage. Springer. pp. 239-254. http://www.alandix.com/academic/papers/HCI2006-PieTree/

So What (making sense of statistics) – 4 – What have you really shown?

Statistics is largely about assessing and validating measured values, but what do they actually measure?

Thinking about the conditions – what have you really shown – some general result or simply that one system or group of users is better than another?

In an example we will look at how a paper published at a major ACM conference appeared to be demonstrating the value of a particular kind of interaction style for a particular problem, but may simply be that they chose a particularly bad system as one of their experimental conditions.

Imagine you have got good data and a gold standard p-value. You are about to rite in your conclusions that using reverse alphabetic menus leads to faster access times than other layouts. However, before you commit, ask yourself “what else might have cased this result”. Maybe the tasks you used tended to include a lot of items starting with x, y and z?

If you find alternative explanations you might be able to look at your data in a different way to tease out the difference between your original hypothesis and the alternatives.   Can’t this would be an opportunity to plan a new experiment that exposes the difference.

It is easy to get confused between things that are true about your subjects and things that are true generally. Imagine you have a mobile phone app for amusement parks that offers games for families to play together while they wait in the queue for a ride.   You give the app to four families who have new app and also have a small clicker device where they are asked at intervals whether or not they are happy. The families visit many rides during the day and you analyse the data to see whether they are more happy while waiting in queues that have a game compared with those that don’t. Again you get a gold standard p-value and feel you are ready to publish.

However, if you had a small number of families, and a lot of data per family, what your statistics have probably told you is that you can accurately say for those four families, that they are, on average, happier when they play the app games. However, this is a reliable result about a few families, not a general result about all families; for that you would need far more families and different statistical analysis.

Perhaps even harder to spot because it is so common is to confuse results about specific systems with results about the properties they embody.

To illustrate this we’ll look at a little story from a few years ago.

It was a major ACM conference and the presentation of, what appeared to be, a good empirical paper. The topic was tools to support a collaborative task which we’ll call ‘X’.

The researchers were interested in two main factors:

  • domain specific for task X vs more generic software
  • synchronous vs asynchronous collaboration

They found three pieces of existibg siftware that covered three of the four slots’ in the design space:

  • A – domain specific software, synchronous
  • B – generic software, synchronous
  • C – generic software, asynchronous

The experiment used sensible measures of quality for the task and had a reasonable number of subjects in each condition. Overall it seemed to be well conducted and, it had statistically significant results.

The results showed that:

  • domain specific was better than generic
  • asynchronous was better than synchronous

The authors concluded that what was really needed was the missing gap in the deisgn space, asynchronous domain specific software for X. One assume that in the next year’s conference they may have a paper on just such a piece of software,.

There are some problems with this due to interaction effects, there may be some aspect to the task that means that while domain specific synchronous software is was better than generic software and also asynchronous generic software was better for task X than than synchronous generic software, still it could be that asynchronous domain specific software is worse. However, this is still a good place to look.

Much more important is that if you blinked at the wrong moment in the presentation, you could easily miss that the whole research results are potentially completely wrong.

Although the presentation discussed the experiment mostly in terms of the properties, and certainly the paper’s conclusions did this. In fact these were not independently varied. Instead three systems were used that happened to embody the relevant properties,

Say system B just happened to be a badly designed piece of software, nothing to do with the articular properties. In comparisons System B was would be worse than system A, which would be interpreted as domain specific is better than generic. Similarly system B would be worse than system C, being interpreted as asynchronous is better than synchronous … bit really system B just happens to be bad!

Weirdly most experimenters would realise that this was an issue if there were only three users, but having a small number of pieces of software often goes unnoticed.

So, what went wrong?

The experiment as run with borrowed methods from psychology, where the controlled experiments typically have a single cause and are in highly controlled environments, so that only the particular aspect being studied is varied between trials. The task X experiment appears in the guise of just such a controlled experiment, varying a single quality: bespoke vs. generic, synchronous vs. asynchronous.

However, interaction, even in lab settings, needs some level of ecological validity and indeed the systems used in the experiment were real software, with all their complexities.   However, the nature of such ecologically valid experiments is that there are always multiple causes and open situations. Indeed, Carroll and Rosson’s claims analysis [CR92] embraces the alterative and possibly multiple causes of the success (or failure!) of systems.

The obvious way to address this would be to have lots and lots of systems embodying each property, just as you have lots and lots of users. However, this is typically impractical, so that I have previous declared that:

the evaluation of generative artefacts is methodologically unsound [Dx08]

However, this does not mean that it is not possible to validate principles.

You can use rich data, for example, collecting logs or video, using think aloud protocols, or post-task interviews. This could be analysed looking for incidents that make it clear whether the poor performance of system B is due to the properties being studied or other factors (such as general poor design).

In general when you use any form of research methodology borrowed from another area, make sure you understand the assumptions behind it and modify it appropriately when you use it for yourself.

 

References

[CR92] John M. Carroll and Mary Beth Rosson. 1992. Getting around the task-artifact cycle: how to make claims and design by scenario. ACM Trans. Inf. Syst. 10, 2 (April 1992), 181-212. DOI=http://dx.doi.org/10.1145/146802.146834

[Dx08] A. Dix (2008). Theoretical analysis and theory creation, Chapter 9 in Research Methods for Human-Computer Interaction, P. Cairns and A. Cox (eds). Cambridge University Press, pp.175–195. ISBN-13: 9780521690317 http://www.alandix.com/academic/papers/theory-chapter-2008/

So What (making sense of statistics) – 3 – Visualise carefully

Visualisation is a powerful tool that can help you highlight the important features in your data, but is also dangerous and can be misleading.

Visualisation is a huge topic in its own right, but for initial eyeballing of raw data one is most often using quite simple scatter plots, line graphs or histograms, so here we will deal with two choices you make about these: the baseline and the basepoint.

The first, the baseline, is about where you start place the bottom of your graph at zero or some other value, a ‘false’ baselines. The second, the basepoint, is about the left-to-right start.

Mathematically speaking, the x and y axes are no different, you can graph data either way, but conventionally they are used differently. Typically the x (horizontal) axis shows the independent variable, the thing that you choose to vary experimentally (e.g. distance to target), or given by the world (e.g. date), the vertical, y, axis is usually the dependent variable, what you measure, for example response time or error rate.

 

As noted, the baseline is about where you start, whether you place the bottom of your graph at zero or some other value: the former is arguably more ‘truthful’, but the latter can help reveal differences that might get lost of the base effect is already large – think of climbing ‘small’ peaks near the top of Everest.

In the graph on the top right there is a clear change of slope. However, look more carefully at the vertical scale (you may need to zoom in!). The scale starts at 57.92 and the total range of the values plotted is just 0.02. This is a false baseline, instead of starting the scale at zero, it has started at anther value (in this case 57.92).

The utility of this is clear. If the data had been plotted on a full scale of, say, 0-60, then even the slope would be hard to see, let alone the change in slope. Whether these small changes are important depends on the application.

Scientists use a Kelvin scale for temperature, starting at absolute zero (-273 C), but if you used this as a full scale for day-to-day measurements, even the difference between a hot summer’s day and midwinter would only be about 10%, the ‘false’ baseline of the centigrade and Fahrenheit scales are far more useful.

This is even more important in a hospital: the difference between normal temperature and high fever, would be imperceptible (less than 1%) on a Kelvin scale, and medical thermometers do not even show a full centigrade range, but instead range from mid 30s to low 40s.

Of course, a false baseline can also be misleading if the reader is not aware of it, making insignificant differences appear large. This may happen by accident, or may be deliberate!

Many years ago there used to be a TV advert for a brand of painkiller, let’s call it Aspradine. The TV advert showed a laboratory with impressive scientific figures in white lab coats. On the laboratory bench was a rack of four test-tubes, each part filled with white powder all at the same height.   The camera zoomed into a view of the top portion of the test-tubes, and to the words “Aspradine has 25% more active ingredient than other brands”, additional powder was poured into one, which rose impressively.

Of course the words were perfectly accurate, and I’m sure they were careful to actually only add a quarter extra to the tube, but the impression given was of a much larger difference.

The photographs of President Trump’s inauguration are a high profile (and highly controversial!) example of this effect. Looking at photos from the front of the crowd, it is very hard to tell the difference between different inaugurations – all look full at the front, just like if the advert had just sown the bottom half of the test-tubes. However, the image from the back clearly shows the quite substantial, and not unexpected, differences between different inaugurations. The downside to this is that, just like the Aspradine advert’s image of the top of the test-tubes or the slope in the graph, it gave the impression that the 2017 crowd was in fact very small … and reported by at least one news outlet at only a quarter of a million, which then Trump heard, responded to in his CIA speech … and, as they say, the rest is history.

Hopefully your research will not be as controversial, but beware, whether or not this sort of rhetoric is acceptable in the marketing or political arena, be very careful in your academic publications!

The graph at the top of this slide shows UK public sector borrowing over a 20-year period. Imagine you want to quote a 10-year change figure. One choice might be to look at the lowest point in 2007 and compare to the highest point in 2017 (the green line). Alternatively you might choose the highest point in 2007 and compare with the lowest in 2017. The first would suggest that there had been a massive increase in public sector borrowing; the latter would suggest a massive decrease. Both would be misleading!

In this case the data is clearly seasonal, related, one assumes, to varying tax revenues through the year, and perhaps differing costs. Often such data is compared at like-time’s each year (say Jan-Jan), which would give a fairer comparison.

If the data simply varies a lot then some form of average is often better. The lower graph shows precisely the same UK public borrowing data, but averaged over 12 month periods.   Now the long-term trends are far more clear, not least the huge hike at the start of the global recession when there were large-scale bank bailouts followed by a crash in tax revenues.

For a real example of this see my blog “the educational divide – do numbers matter?“.

the educational divide – do numbers matter?

Finally, you may think that unless one were deliberately intending to deceive, no-one could make the mistake of using either of the two initial lines as both are so clearly misleading. However, imagine you had never plotted the data and instead it was simply a large spreadsheet full of numbers. It would be easy to pick and arbitrary start and end dates not realising the choice was so critical.

So another reminder – look at the data!

So What (making sense of statistics) – 2 – Look at the data

Look at the data, don’t just add up the numbers.

It seems an obvious message, but so easy to forget when you have that huge spreadsheet and just want to throw it into SPSS or R and see whether all your hard work was worthwhile.

But before you jump to work out your T-test, regression analysis or ANOVA, just stop and look.

Eyeball the raw data, as numbers, but probably in a simple graph- but don’t just plot averages, initially do scatter plots of all the data points, so you can get a feel for the way it spreads. If the data is in several clumps what do they mean?

Are there anomalies, or extreme values?

If so these may be a sign of a fault in the experiment, maybe a sensor went wrong; or it might be something more interesting, a new or unusual phenomenon you haven’t thought about.

Does it match your model. If you are expecting linear data does it vaguely look like that? If you are expecting the variability to stay similar (an assumption of many tests, including regression and ANOVA).

The graph above is based on one I once saw in a paper (recreated here), where the authors had fitted a regression line.

However, look at the data – it is not data scattered along a line, but rather data scattered below a line. The fitted line is below the max line, but the data clearly does not fit a standard model of linear fit data.

A particular example in the HCI literature where researchers often forget to eyeball the data is in Fitts’ Law experiments. Recall that in Fitts’ original work [Fi54] he found that the time taken to complete tasks was proportional to the Index of Difficulty (IoD), which is the logarithm of the distance to target divided by the target size (with various minor tweeks!):

IoD = log2 ( distance to target / target size )

Fitts’ law has been found to be true for many different kinds of pointing tasks, with a wide variety of devices, and even over multiple orders of magnitude. Given this, many performing Fitts’ Law related work do not bother to separately report distance and target size effects, but instead instantly jump to calculating the IoD assuming that Fitts’ Law holds. Often the assumption proves correct …but not always.

The graph above is based on a Fitts’ Law related paper I once read.

The paper was about the effects of adding noise to the pointer, as if you had a slightly dodgy mouse. Crucially the noise was of a fixed size (in pixels) not related to the speed or distance of mouse movement.

The dots on the graph show the averages of multiple trials on the same parameters: size, distance and the magnitude of the noise were varied. However, size and distance are not separately plotted, just the time to target against IoD.

If you understand the mechanism (that magic word again) of Fitts’ Law [Dx03,BB06], then you would expect anomalies’ to occur with fixed magnitude noise. In particular if the noise is bigger than the target size you would expect to have an initial Fitts Movement to the general vicinity of the target, but then a Monte Carlo (utterly random) period where the noise dominates and its pure chance when you manage to click the target.

Sure enough if you look at the graph you see small triads of points in roughly straight lines, but then the cluster of points following a slight curve. The regression line is drawn, but this is clearly not simply data scattered around the line.

In fact, given the understanding of mechanism, this is not surprising, but even without that knowledge the graph clearly shows something is wrong – and yet the authors never mentioned the problem.

One reason for this is probably because they had performed a regression analysis and it had come out statistically significant. That is they had jumped straight for the numbers (IoD + regression), and not properly looked at the data!

They will have reasoned that if the regression is significant and correlation coefficient strong, then the data is linear. In fact this is NOT what regression says.

To see why, look at the data above. This is not random simply an x squared curve. The straight line is a fitted regression line, which turns out to have a correlation coefficient of 0.96, which sounds near perfect.

There is a trend there, and the line does do a pretty good job of describing some of the change – indeed many algorithms depend on approximating curves with straight lines for precisely this reason. However, the underlying data is clearly not linear.

So next time you read about a correlation, or do one yourself, or indeed any other sort of statistical or algorithmic analysis, please, Please, remember to look at the data.

References

[BB06] Beamish, D., Bhatti, S. A., MacKenzie, I. S., & Wu, J. (2006). Fifty years later: a neurodynamic explanation of Fitts’ law. Journal of the Royal Society Interface, 3(10), 649–654. http://doi.org/10.1098/rsif.2006.0123

[Dx03] Dix, A. (2003/2005) A Cybernetic Understanding of Fitts’ Law. HCI book online! http://www.hcibook.com/e3/online/fitts-cybernetic/

[Fi54] Fitts, Paul M. (1954) The information capacity of the human motor system in controlling the amplitude of movement. Journal of Experimental Psychology, 47(6): 381-391, Jun 1954,. http://dx.doi.org/10.1037/h0055392

 

So What (making sense of statistics) – 1 – why are you doing it?

You have done your experiment or study and have your data, maybe you have even done some preliminary statistics – what next, how do you make sense of the results?

This part of will looks at a number of issues and questions:

  • Why are you doing it the work in the first place? Is it research or development, exploratory work, or summative evaluation?
  • Eyeballing and visualising your data – finding odd cases, checking you model makes sense, and avoiding misleading diagrams.
  • Understanding what you have really found – is it a deep result, or merely an artefact of an experimental choice?
  • Accepting the diversity of people and purposes – trying to understand not whether your system or idea is good, but who or what it is good for.
  • Building for the future – ensuring your work builds the discipline, sharing data, allowing replication or meta-analysis.

Although these are questions you can ask when you are about to start data analysis, they are also ones you should consider far earlier. One of the best ways to design a study is to imagine this situation before you start!.

When you think you are ready to start recruiting participants, ask yourself, “if I have finished my study, and the results are as good as I can imagine, so what? What do I know?” – it is amazing how often this leads to a complete rewriting of a survey or experimental redesign.

Are you doing empirical work because you are an academic addressing a research question, or a practitioner trying to design a better system? Is your work intended to test an existing hypothesis (validation) or to find out what you should be looking for (exploration)? Is it a one-off study, or part of a process (e.g. ‘5 users’ for iterative development)?

These seem like obvious questions, but, in the midst of performing and analysing your study, it is surprisingly easy it is to lose track of your initial reasons for doing it. Indeed, it is common to read a research paper where the authors have performed evaluations that are more appropriate for user interface development, reporting issues such as wording on menus rather than addressing the principles that prompted their study.

This is partly because there are similarities, in the empirical methods used, and also parallels between stages of each. Furthermore, your goals may shift – you might be in the midst of work to verify a prior research hypothesis, and then notice and anomaly in the data, which suggests a new phenomenon to study or a potential idea for a product.

We’ll start out by looking at the research and software-development processes separately, and then explore the parallels.

There are three main uses of empirical work during research, which often relate to the stages of a research project:

exploration – This is principally about identifying the questions you want to ask. Techniques for exploration are often open-ended. They may be qualitative: ethnography, in-depth interviews, or detailed observation of behaviour whether in the lab or on the wild.   However, this is also a stage that might involve (relatively) big data, for example, if you have deployed software with logging, or have conducted a large scale, but open ended, survey. Data analysis may then be used to uncover patterns, which may suggest research questions. Note, you may not need this as a stage of research if you come with an existing hypothesis, perhaps from previous phases of your own research, questions arising form other published work, or based on your own experiences.

validation – This is predominantly about answering questions or verifying hypotheses. This is often the stage that involves most quantitative work: including experiments or large-scale surveys. This is the stage that one most often publishes, especially in terms of statistical results, but that does not mean it is the most important. In order to validate, you must establish what you want to study (explorative) and what it means (explanation).

explanation – While the validation phase confirms that an observation is true, or a behaviour is prevalent, this stage is about working out why it is true, and how it happens in detail. Work at this stage often returns to more qualitative or observational methods, but with a tighter focus. However, it may also me more theory based, using existing models, or developing new ones in order to explain a phenomenon. Crucially it is about establishing mechanism, uncovering detailed step-by-step behaviours … a topic we shall return to later.

Of course these stages may often overlap and data gathered for one purpose may turn out to be useful for another. For example, work intended for validation or explanation may reveal anomalous behaviours that lead to fresh questions and new hypothesis. However, it is important to know which you were intending to do, and if you change when and why you are looking at the data differently … and if so whether this matters.

During iterative software development and user experience design, we are used to two different kinds of evaluation:

formative evaluation – This is about making the system better. This is performed on prototypes or experimental systems during the cycles of design–build–test. The primary purpose of formative evaluation is to uncover usability or experience problems for the next cycle.

summative evaluation – This is about checking that the systems works and is good enough. It is performed at the end of the software development process on a pre-release product. IT may be related to contractual obligations: “95% of users will be able to use the product for purpose X after 20 minutes training”; or may be comparative: “the new software outperforms competitor Y on both performance and user satisfaction”.

In web applications, the boundaries can become a little less clear as changes and testing may happen on the live system as part of perpetual-beta releases or A–B testing.

Although research and software development have different overall goals, we can see some obvious parallels between the two. There are clear links between explorative research and formative evaluations, and between validation and summative evaluations. Although, it is perhaps less clear immediately how explanatory research connects with development.

We will look at each in turn.

During the exploration stage of research or during formative evaluation of a product, you are interested in finding any interesting issue. For research this is about something that you may then go on to study in depth and, hopefully, publish papers about. In software development tis is about finding usability problems to fix or identifying opportunities for improvements or enhancements.

It does not matter whether you have fond the most important issue, or the most debilitating bug, so long as you have found sufficient for the next cycle of development.

Statistics are less important at this stage, but may help you establish priorities. If costs or time is short, you may need to decide out of the issues you have uncovered, which is most interesting to study further, or fix first.

One of the most well known (albeit misunderstood ) myths of interaction design is the idea that five users are enough.

The source of this was Nielsen and Landaur’s original paper [NL93], nearly twenty-five years ago. However, this was crucially about formative evaluation during iterative evaluation.

I emphasise it was NOT about either summative evaluation, not about sufficient numbers for statistics!

Nielsen and Landaur combined a simple theoretical model based on software bug detection with empirical data from a small number of substantial software projects to establish the optimum number of users to test per iteration.

Their notion of ‘optimum’ was based on cost-benefit analysis: each cycle of development cost a certain amount, each user test cost a certain amount. If you uncover too few user problems in each cycle you end up with lots of development cycles, which is expensive in terms of developer time. However, if you perform too many user tests you end up finding the same problems, thus wasting user-testing effort.

The optimum value depended on the size and complexity of the project, with the number higher for more complex projects, where redevelopment cycles were more costly, and the figure of five was a rough average.

Now-a-days, with better tool support, redevelopment cycles are far less expensive than any of the projects in the original study, and there are arguments that the optimal value now may even be just testing one user [MT05]   – especially if it is obvious that the issues uncovered are ones that appear likely to be common.

However, whether one, five or twenty user, there will be more users on the next iteration – this is not about the total number of users tested during development. In particular, at later stages of development, when the most glaring problems have been sorted, it will become more important to ensure you have covered a sufficient range of the

For more on this see Jakob Neilsen’s more recent and nuanced advice [Ni12] and my own analyses of “Are five users enough?” [Dx11].

In both validation in research and summative evaluation during development, the focus is much more exhaustive: you want to find all problems, or issues (hopefully not many left during summative evaluation!).

The answers you need are definitive, you are not so much interested in new directions (although this may be an accidental outcome), but instead verifying that your precise hypothesis is true, or that the system works as intended. For this you may need statistical test, whether traditional (p value) or Baysian (odds ratio).

You may also be after numbers: how good is it (e.g., “nine out of ten owners say their cats prefer …”), how prevalent is an issue (e.g., “95% of users successfully use the auto-grow feature”). For this the size of effects are important, so you may me more interested in confidence intervals, or pretty graphs with error bars on them.

While validation establishes that a phenomenon occurs, what is true, explanation tries to work out why it happens and how it works – deep understanding.

As noted this will often involve more qualitative work on small samples of people, but often connecting with quantitative studies of large samples.

For example, you might have a small number of rich in-depth interviews, but match the participants against the demographics of large-scale surveys. Say that a particular pattern of response is evident in the large study. If your in-depth interviewee has a similar response, then it is often a reasonable assumption their their reasons will be similar to the large sample. Of course they could be just saying the same thing, but for completely different reasons, but often commons sense, or prior knowledge means that the reliability is evident. Of course, if you are uncertain f the reliability of the explanation, this could always drive targeted questions in a further round of large-scale survey.

Similarly, if you have noticed a particular behaviour in logging data from a deployed experimental application, and a user has the same behaviour during a think aloud session or eye-tracking session, then it is reasonable to assume that the deliberations, cognitive or perceptual behaviours may well be the same as in the users of the deployed application.

We noted that the parallel with software development was less obvious, however the last example, starts to point towards this.

During the development process, often user testing reveals many minor problems. It iterates towards a good-enough solution … but rarely makes large scale changes. Furthermore, at worst, the changes you perform at each cycle may create new problems. This is a common problem with software bugs, code becomes fragile, but also with user interfaces, where each change in the interface creates further confusion, and may not even solve the problem that gave rise to it. After a while you may lose track of why each feature is there at all.

Rich understanding the underlying human processes: perceptual, cognitive, social, can both make sure that ‘bug fixes’ actually solve the problem, Furthermore, it allows more radical, but informed redesign that may make whole rafts of problems simply disappear.

References

[Dx11] Dix, A. (2011) Are five users enough? HCIbook online! http://www.hcibook.com/e3/online/are-five-users-enough/

[MT05] Marty, P.F. & Twidale, M.B. (2005). Extreme Discount Usability Engineering. Technical Report ISRN UIUCLIS–2005/1+CSCW. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.74.3702

[NL93] Nielsen, J. and Landauer, T. (1993). A mathematical model of the finding of usability problems. INTERACT/CHI ’93. ACM, 206–213.

[Ni12] Nielsen, J (2012). How Many Test Users in a Usability Study? NN/g Norman–Nielsen Group, June 4, 2012. https://www.nngroup.com/articles/how-many-test-users/

Probability can be hard – from goats to DNA

Simple techniques can help, but even mathematicians can get it wrong.

It would be nice if there was a magic bullet and all of probability and statistics would get easy. Hopefully some of the videos I am producing will make statistics make more sense, but there will always be difficult cases – our brains are just not built for complex probabilities.

However, it may help to know that even experts can get it wrong!

In this video we’ll look at two complex issues in probability that even mathematicians sometimes find hard: the Monty Hall problem and DNA evidence. We’ll also see how a simple technique can help you tune your common sense for this kind of problem. This is NOT the magic bullet, but sometimes may help.

There was a quiz show in the 1950s where the star prize car. After battling their way through previous rounds the winning contestant had one final challenge. There were three doors, one of which was the prize car, but behind each of the other two was a goat.

The contestant chose a door, but to increase the drama of the moment, the quizmaster did not immediately open the chosen door, but instead opened one of the others. The quizmaster, who knew which was the winning door, would always open a door with a goat behind. The contestant was then given the chance to change their mind.

  • Should they stick with the original choice?
  • Should they swop to the remaining unopened door?
  • Or doesn’t it make any difference?

What do you think?

One argument is that the chance of the original door being a car was one in three, whether the chance of it being in the other door was 1 in 2, so you should change. However the astute may have noticed that this is a slightly flawed probabilistic argument as the probabilities don’t add up to one.

A counter argument is that at the end there are two doors, so the chances are even which ahs the car, so there is no advantage to swopping.

An information theoretic argument is similar – the doors equally hide the car before and after the other door has been opened, you have no more knowledge, so why change your mind.

Even mathematicians and statisticians can argue about this, and when they work it out by enumerating the cases, do not always believe the answer. It is one of those case where one’s common sense simple does not help … even for a mathematician!

Before revealing the correct answer, let’s have a thought experiment.

Imagine if instead of three doors there were a million doors. Behind 999,999 doors are goats, but behind the one lucky door there is a car.

  • I am the quizmaster and ask you to choose a door. Let’s say you choose door number 42.
  • Now I now open 999,998 of the remaining doors, being careful to only open doors hiding goats.
  • You are left with two doors, your original choice and the one door I have not opened.

Do you want to change your mind?

This time it is pretty obvious that you should. There was virtually no chance of you having chosen the right door to start with, so it was almost certainly (999,999 out of a million) one of the others – I have helpfully discarded all the rest so the remaining door I didn’t open is almost certainly it.

It is a bit like if before I opened the 999,998 goats doors I’d asked you, “do you want to change you mind, do you think the car is precisely behind door 42, or any of the others.”

In fact exactly the same reasoning holds for three doors. In this case there was a 2/3 chance that it was in one of the doors you did not choose, and as the quizmaster I discarded one that had a goat from these, it is twice as likely that it is behind the door did not open as your original choice. Regarding the information theoretic argument: the q opening the goat doors does add information, originally there were possible goat doors, after you know they are not.

Mind you it still feels a bit like smoke and mirrors with three doors, even though, the million are obvious.

Using the extreme case helps tune your common sense, often allowing you to see flaws in mistaken arguments, or work out the true explanation.

It is not and infallible heuristic, sometimes arguments do change with scale, but often helpful.

The Monty Hall problem has always been a bit of fun, albeit disappointing if you were the contestant who got it wrong. However, there are similar kinds of problem where the outcomes are deadly serious.

DNA evidence is just such an example.

Let’s imagine that DNA matching has an accuracy of on in 100,000.

There has been a murder, and remains fo DNA have been found on the scene.

Imagine now two scenarios:

Case 1: Shortly prior to the body being found, the victim had been known to have having a violent argument with a friend. The police match the DNA of the frend with that found at the murder scene. The friend is arrested and taken to court.

Case 2: The police look up the DNA in the national DNA database and find a positive match. The matched person is arrested and taken to court.

Similar cases have occurred and led to convictions based heavily on the DNA evidence. However, while in case 1 the DMA is strong corroborating evidence, in case 2 it is not. Yet courts, guided by ‘expert’ witnesses, have not understood the distinction and convicted people in situations like case 2. Belatedly the problem has been recognised and in the UK there have been a number of appeals where longstanding cases have been overturned, sadly not before people have spent considerable periods behind bars for crimes they did not commit. One can only hope that similar evidence has not been crucial in countries with a death penalty.

If you were the judge or jury in such a case would the difference be obvious to you?

If not we van use a similar trick to the Monty Hall problem, instead here lets make the numbers less extreme. Instead of a 1 in 100,000 chance of a false DNA match, let’s make it 1 in 100. While this is still useful, albeit not overwhelming, corroborative evidence in case 1, it is pretty obvious that of there are more than a few hundred people in the police database, the you are bound to find a match.

It is a bit like if a red Fiat Uno had been spotted outside the victim’s house. If the arguing friend’s car had been a red Fiat Uno it would have been good additional circumstantial evidence, but simply arresting any red Fiat Uno owner would clearly be silly.

If we return to the original 1 in 100,000 figure for a DNA match, it is the same. IF there are more than a few hundred thousand people in the database then you are almost bound to find a match. This might be a way to find people you might investigate looking for other evidence, indeed the way several cold cases have been solved over recent years, but the DNA evidence would not in itself be strong.

In summary, some diverting puzzles and also some very serious problems involving probability can be very hard to understand. Our common sense is not well tuned to probability. Even trained mathematicians can get it wrong, which is one of the reasons we turn to formulae and calculations. Changing the scale of numbers in a problem can sometimes help your common sense understand them.

In the end, if it is confusing, it is not your fault: probability is hard for the human mind; so. if in doubt, seek professional help.