# Wild and wide (making sense of statistics) – 2 – quick (and dirty!) tip

We often deal with survey or count data. This might come in public forms such as opinion poll data preceding an election, or from your own data when you email out a survey, or count kinds of errors in a user study.

So when you find that 27% of the users in your study had a problem, how confident do you feel in using this to estimate the level of prevalence amongst users in general? If you did a bigger study wit more users would you be surprised if the figure you got was actually 17%, 37% or 77%?

You can work out precise numbers of this, but in this video I’ll give a simple rule of thumb method for doing a quick estimate.

We’re going to deal with this by looking at three separate cases.

First when the number you are dealing with is a comparatively small proportion of the overall sample.

For example, assume you want to know about people’s favourite colours. You do a survey of 1000 people at 10% say their favourite colour is blue. The question is how reliable this figure is. If you had done a larger survey, would the answer be close to 10% or could it be very different?

The simple rule is that the variation is 2 x the square root number of people who chose blue.

To work this out first calculate how many people the 10% represents. Given the sample was 1000, this is 100 people. The square root of 100 is 10, so 2 times this is 20 people. You can be reasonably confident (about 95%) that the number of people choosing blue in your sample is within +/- 20 of the proportion you’d expect from the population as a whole. Dividing that +/-20 people by the 1000 sample, the % of people for whom blue is their favourite colour is likely to be within +/- 2% of the measured 10%.

The second case is when you have a large majority who have selected a particular option.

For example, let’s say in another survey, this time of 200 people, 85% said green was their favourite colour.

This time you still apply the “2 x square root” rule, but instead focus on the smaller number, those who didn’t choose green. The 15% who didn’t choose green is 15% of 200 that is 30 people. The square root of 30 is about 5.5, so the expected variability is about +-11, or in percentage terms about +/- 5.5%.

That is the real proportion over the population as a whole could be anywhere between 80% and 90%.

Notice how the variability of the proportion estimate from the sample increases as the sample size gets smaller.

Finally if the numbers are near the middle, just take the square root, but this time multiply by 1.5.

For example, if you took a survey of 2000 people and 50% answered yes to a question, this represents 1000 people. The square toot of 1000 is a bit over 30, so 1.5 times this is around 50 people, so you expect a variation of about +/- 50 people, or about +/- 2.5%.

Opinion polls for elections often have samples of around 2000, so if the parties are within a few points of each other you really have no idea who will win.

For those who’d like to understand the detailed stats for this (skip if you don’t!) …

These three cases are simplified forms of the precise mathematical formula for the variance of a Binomial distribution np(1-p), where n is the number in the sample and p the true population proportion for the thing you are measuring. When you are dealing with fairly small proportions the 1-p term is close to 1, so the whole variance is close to np, that is the number with the given value. You then take the square root to give the standard deviation. The factor of 2 is because about 95% of measurements fall within 2 standard deviations. The reason this becomes 1.5 in the middle is that you can no longer treat (1-p) as nearly 1, and for p =0.5, this makes things smaller by square root of 0,5, which is about 0.7. Two times 0,7 is (about) one and half (I did say quick and dirty!).

However, for survey data, or indeed any kind of data, these calculations of variability are in the end far less critical than ensuring that the sample really does adequately measure the thing you are after.

Is it fair? – Has the way you have selected people made one outcome more likely. For example, if you do an election opinion poll of your Facebook friends, this may not be indicative of the country at large!

For surveys, has there been self-selection? – Maybe you asked a representative sample, but who actually answered. Often you get more responses from those who have strong feelings about the issue. For usability of software, this probably means those who have had a problem with it!

Have you phrased the question fairly? – For example, people are far more likely to answer “Yes” to a question, so if you ask “do you want to leave?” you might get 60% saying “yes” and 40% saying no, but if you asked the question in the opposite way “do you want to stay?”, you might still get 60% saying “yes”.

# Wild and wide (making sense of statistics) – 1 – unexpected wildness of random

This part will begin with some exercises and demonstrations of the unexpected wildness of random phenomena including the effects of bias and non-independence (when one result effects others).

We will discuss different kinds of distribution and the reasons why the normal distribution (classic hat shape), on which so many statistical tests are based, is so common. In particular we will look at some of the ways in which the effects we see in HCI may not satisfy the assumptions behind the normal distribution.

Most will be aware of the use of non-parametric statistics for discrete data such as Likert scales, but there are other ways in which non-normal distributions arise. Positive feedback effects, which give rise to the beauty of a snowflake, also create effects such as the bi-modal distribution of student marks in certain kinds of university courses (don’t believe those who say marks should be normally distributed!). This can become more complex if feedback processes include some form of threshold or other non-linear effect (e.g. when the rate of a task just gets too much for a user).
All of these effects are found in the processes that give rise to social networks both online and offline and other forms of network phenomena, which are often far better described by a long-tailed ‘power law’.

Just how random is the world?

We often underestimate just how wild random phenomena are – we expect to see patterns and reasons for what is sometimes entirely arbitrary.

Through a story and some exercises, I hope that you will get a better feel for how wild randomness is. We sometimes expect random things to end up close to their average behaviour, but we’ll see that variability is often large.

When you have real data you have a combination of some real effect and random ‘noise’. However, by doing some coin tossing experiments you know that the coins you are dealing with are near enough fair – everything you see will be sheer randomness.

In the far off land of Gheisra there lies the plain of Nali. For one hundred miles in each direction it spreads, featureless and flat, no vegetation, no habitation; except, at its very centre, a pavement of 25 tiles of stone, each perfectly level with the others and with the surrounding land.

The origins of this pavement are unknown – whether it was set there by some ancient race for its own purposes, or whether it was there from the beginning of the world.

Rain falls but rarely on that barren plain, but when clouds are seen gathering over the plain of Nali, the monks of Gheisra journey on pilgrimage to this shrine of the ancients, to watch for the patterns of the raindrops on the tiles. Oftentimes the rain falls by chance, but sometimes the raindrops form patterns, giving omens of events afar off.

Some of the patterns recorded by the monks are shown on the following slides. Which are mere chance and which foretell great omens?

Just a reminder choose first and then read one 😉

Before, revealing the true omens, you might like to know how you fare alongside three and seven year olds.

When very young children are presented with this choice they give very mixed answers, but have a small tendency to think that distributions like day 1 are real rainfall, whereas those like day 3 are an omen.

In contrast, once children are older, seven or so, they are more consistent and tended to plump for day 3 as the random rainfall.

Were you more like the three year old and thought day 1 was random rainfall, or more like the seven year old and thought day 1 was an omen and day 3 random. Or perhaps you were like neither of them and thought day 2 was true random rainfall.

Let’s see who is right.

When you looked at day 1 you might have seen a slight diagonal tendency with the lower right corner less dense then the upper left. Or you may have noted the suspiciously collinear three dots in the second tile on the top row.

However, this pattern, the preferred choice f the three year od, is in fact the random rainfall – or at least as random as a computer random number generator can manage!

In true random phenomena you often do get gaps, dense spots or apparent patterns, but this is just pure chance.

In day 2 you might have thought it looked a little clumped towards the middle.

In fact this is perfectly right, it is exactly the same tiles as in day 1, but re-ordered so that the fuller tiles are towards the centre, and the part-empty ones to the edges.

This is an omen!

Finally, day 3 is also an omen.

This is the preferred choice of seven year olds as being random rainfalls and also, I have found, the preferred choice of 27, 37 and 47 year olds.

However it is too uniform. The drops on each tile are distributed randomly, but there are precisely five drops on each tile.

At some point during our early education we ‘learn’ (wrongly!) that random phenomena are uniform. Although this is nearly true when there are very large numbers involved (maybe 12,500 drops rather than 125), wth smaller numbers the effects are far more wacky than one might imagine.

Now for a different exercise, and this time you don’t just have to choose, you have to do something.

Find a coin, or even better if you have 20, get them.

Toss the coins one by one and put the heads into one row and the tails into another.

Keep on tossing until one line of coins has ten coins in it … you could even mark a finish line 10 coins away from the start.

If you only have one coin you’ll have to just toss it and keep tally!

If you are on your own repeat this several times.

However, before you start think about what you expect to see.

So what happened?

Did you get a clear winner, or were they neck and neck?

And what did you expect to happen?

I had a go and did five races. In one case they were nearly neck-and-neck at 9 heads to 10 tails, but the other four races were won by heads with some quite large margins: 10 to 7, 10 to 6, 10 to 5 and 10 to 4.

Often people are surprised because they are expecting a near neck and neck race every time. As the coins are all fair, they expect approximately equal numbers of heads and tails. However, just like the rainfall in Gheisra, it is very common to have one quite far ahead of the other.

In your head you might think that because the probability of it being a head is a half, the number of heads will be near enough half. Indeed, this is the case if you average over lots and lots of tosses. However, with just 20 coins in a race, the variability is large.

The probability of getting an outright winner all heads or all tails is low, only about one in five hundred. However, the probability of getting a near wipe-out with one head and ten tails or vice versa is around one in fifty – in a large class one person is likely to have this.

I hope these two activities begin to give some idea of the wild nature of random phenomena. We can see a few general lessons.

First apparent patterns or differences may just be pure chance. For example, if you had found heads winning by 10 to 2, you might have thought this meant that your coin was in some way biased t heads, or you might have thought that the nearly straight line of three drops on day 1 had to mean something. But random things are so wild that sometimes apparently systematic effects happen by chance.

It is also remembering that this wildness may lead to what appear to be ‘bad values’. If you had got 10 tails and just 1 head, you might have thought “but coins are fair, so I must have done something wrong”.

Famous scientists have fallen for this fallacy!

Mendel’s experiment on inheritance of sweet pea characteristics laid the foundations for modern genetics. However, his results are a little too good. If you cross-pollinate two plants one pure bred to have a recessive characteristic and one purely dominant, you expect to see the second generation to have observable characteristics that are dominant and recessive in the ration 3:1. In Mendel’s data the ratios are just a little too close to this figure. It seems likely that he rejected ‘bad values’ assuming he had done something wrong when in fact they were just the results of chance.

The same thing can happen in Physics. In 1909 Millikan and Harvey Fletcher did an experiment that showed that each electron has an identical charge and measure it (also known as the ‘Millikan can experiment’). To do this they created charged oil drops and suspended them using their electrical charge. The relationship between the field needed and size (and hence weight) of the drop enabled them to calculate the charge on each oil drop. These always came in multiples of a single value – the electron charge. Only there are sources of error in all the measurements and yet the reported charges are a little too close to multiples of the same number. Again it looks like ‘bad’ results were ignored as some form of mistake during the setup of the experiment.

# the job of statistics

from the real world to measurement and back again

If you want to use statistics you need to learn how to do statistics, in the sense of working out what tests to use, maybe a stats package such as SPSS or R.

But why do this at all? What does statistics actually do?

Fundamentally statistics is about trying to learn dependable things about the real world based on measurements of it.

However, what we mean by ‘real’ is itself a little complicated, from the actual users you have tested to the hypothetical idea of a ‘typical user’ of your system.

the sample – First of all there is the actual data you have: results from an experiment, responses from a survey, log data form a deployed application. This is the real world. The user you tested at 3pm on a rainy day in March, after a slightly overfilling lunch, did make precisely three errors and finished the task in 17 mins and 23 seconds. However, while this measured data is real, it is typically not what you meant to know. Would the same user on a different day, under different conditions have made the same errors? What about other users?

the population – Another idea of real, and one that may be what you really want to know, is when there is a larger group of people you want to now about, say all the people in your company, or all users of product A. What would be the average (and variation in) error arte if all of them sat down and used the software you are testing. Or as a more concrete kind of measurement, what is their average height?

The sample that you actually look at measure the heights of is real data, but yu are using to find out about the population as a whole.

the ideal – However, while this idea of the actual population is very concrete, often the real word you actually are interested in is slightly more nebulous. Even for the current uses of product A, you are not interested in the error rate if they tried your new software today, but if they did multiple times (maybe with the occasional memory wiping pill administered occasionally) over a period – that is a sort of ‘typical’ error rate when each uses the software.

Furthermore, it is not so much the actual set of users (not that you don’t care about them), but perhaps the typical user, especially for a new piece of software where you have no ‘real’ users yet.

Similarly, when you toss a coin you have an idea of the behaviour fair coin, that s not simply the complete collection of every coin in circulation. Even when you have tossed the coin, you can still think about the different ways it could have fallen, somehow reasoning about all possible past and present for an unrepeatable event.

Finally, this hypothetical ‘real’ event may be represented in a mathematically in a theoretical distribution such as the Normal distribution (for heights) or Binomial distribution (for coin tosses).

In practice you rarely need to voice these things explicitly, but occasionally you do need to think carefully about it. If you have done a series of consistent blood tests you may know something very important about a particular individual, but not patients in general. If you are analysing big data you may know something very precise about your current users, and how they behave given a particular social context, and particular algorithms in your system, but not necessarily about potential users and how they may behave if your algorithms and environment change.

Once you know what the ‘real’ world you want to know about is, the job of statistics becomes clear.

You have taken some measurement, often of some sample of people and situations, and you want to use the measurements to understand the real world.

Given a sample of 20 heights of random people from your organisation, what can you infer about the heights of everyone? Given the error rates of 20 people on an artificial task in a lab, what can you tell about the behaviour of a typical user in their everyday situation?

As is evident, answering these questions requires a combination of probability and common sense – and this is the job of statistics.

# why are probability and statistics so hard?

Do you find probability and statistics hard? If so, don’t worry, it’s not just you; it’s basic human psychology.

We have two systems of thought (i) subconscious reactions that are based on semi-probabilistic associations, and (ii) conscious thinking that likes to have one model of the world and is really bad at probability. This is why we need to use mathematics and other explicit techniques to help us deal with probabilities.

Statistics both needs this mathematics of probability and an understanding of what it means in the real world. Understanding this means you don’t have to feel bad about finding stats hard (!), but also helps to find ways to make it easier.

Skinner’s famous experiments with pigeons showed how certain kinds of learning could be studied in terms of associations between stimuli and rewards. If you present a reward enough times with the behaviour you want, the pigeon will learn to do it even when the original reward no longer happens.

This low-level learning is semi-probabilistic in the sense that if rewards are more common the learning is faster or if rewards and penalties both happen at different frequencies, then you get a level of trade-off in the learning. At a cognitive level one can think of strengths of association being built up with rewards strengthening them and penalties inhibiting them.

This kind of learning is not quite a weighted sum of past experience: for example negative experiences typically count more than positive ones, and once a pattern is established it takes a lot to shift it. However, it is not so far from a probability estimate.

We humans share this subconscious learning processes with other animals, and at some periods this has been used explicitly in education. It is powerful and leads to very rapid reactions, but needs very large numbers of exposures to similar situations to establish memories.

Of course we are not just our subconscious! In addition we have conscious thinking and reasoning. This is powerful in that, amongst other things, we are able to learn from a single experience. Retrospectively we are able to retrieve even a single relevant past experience, compare it to what we are encountering now, and work what to do based on it.

This is very powerful, but unlike our more unconscious sea of overlapping memories and associations, our conscious mind is linear and is normally locked into a single model of the world1.

Because of this single model of the world this form of thinking is not so good at intuitively grasping probabilities, as is repeatedly evidenced by gambling behaviour and more broadly our assessment of risk.

One experiment uses cards with different coloured backs, red and blue2.  In the experiment the cards are initially dealt to the subject face down and then turned over. Some cards have a rewards, “you have won £5”, other a penalty, “sorry you’ve lost half your winnings”. The cards differ in that one colour, let’s say blue, has more penalties, and the other a better balance of rewards.

After playing a while the subjects realise that the packs are different and can tell you which s better.

However, the subjects are also wired up to a skin conductivity sensor as used in a lie detector. Well before they are able to say that one of the card colours is worse then the other they show a response on the sensor – that is subconsciously they know it is a bad card.

Although our conscious mind is not naturally good at dealing with probabilities , we are able to reason explicitly about them using mathematics. For example, if the subjects in the experiment had kept a tally of good and bad cards, they would have seen, in the numbers, that the red cards were better.

Some years ago, when I first was doing some statistical teaching I remember learning that statistics teaching was known to be particularly difficult. This is in part because it requires a combination of maths and real world thinking.

In statistics we use the explicit tallying of data and mathematical reasoning about probabilities to let is do quite complex reasoning from effects (measurements) back to causes (the real word phenomena that are being measured).

So you do need to feel reasonably comfortable with this mathematics.

However, even if you are a whizz at maths, if you can’t relate this back to understanding about the real world, you are also stuck. It is a bit like the applied maths problems where people get so lost in the maths that they forget the units: “the answer is 42” – but 42 what? 42 degrees centigrade, 42 metres, or 42 bananas?

On the whole those good at mathematics are not always good at relating their thinking back to the real world, and those of a more practical disposition not always best at maths – no wonder statistics is hard!

However, knowing this we can try to make things better.

In the “making sense of statistics” I am focusing more on those who have a reasonable sense of the practical issues and will try to explain some of the concepts that necessary without getting deep into the mathematics of how they are calculated … leave that to the computer!

1. There are exceptions to the conscious mind’s single model of the world including when we tell each other stories or write; see my essay “writing as third order experience“. []
2. Malcolm Gladwell describes this experiment in ‘The Second Mind‘, an online excerpt of his book Blink []

# So What (making sense of statistics) – 7 – Building for the future

• repeatability/replication – comparisons more robust than measures
• meta analysis – reporting the details
• data publishing – enabling open science

The touchstone of valuable research is the extent to which it builds the discipline, so that sum of knowledge after you have done your work is greater than it was before. How can you ensure this happens?

One part of this is repeatability, ensuring that you or others could replicate your study or experiment and get the same, or at least similar, results. At the CHI conference the RepliCHI initiative had a series of workshops and led to the addition of a “Validation and refutation” category in some subsequent conferences.

However, true replication is hard in HCI as it is difficult to precisely replicate the full context of even the most controlled experimental studies. The pool of subjects will differ even if university students, but form different institutions. The experimenter usually reads some sort of protocol, or greets subjects, so slight differences in the behavior could alter the mood of subjects. Similarly, the decoration of the room or lack of it , light levels, etc. all may alter behaviour.

Replication can even difficult even in apparently ‘physical’ situations. Many years ago I worked on agricultural sprayer research. We often used a apparatus that used a laser beam to measure the sizes of spray droplets. In one two-day series of experiments we carefully varied water temperature, quantity of surfactant, and variety of other factors, largely to see how carefully these needed to be controlled in other kinds of experiments. When we analysed the results there were some small but statistically significant effects that were surprising. After some time one of my colleagues suggested that as we had run the experiments over two days we should try adding a ‘dy effect’. We did this and sure enough all the anomalous effects disappeared, it just seemed that the runs one day were in some way different form the other, despite all out efforts to control the situation. Maybe this was some atmospheric effect, or slight difference in the equipment, we never knew.

Replication is, of course, even harder in more ecologically valid or in-the-wild studies.

This does not mean one should not try to replicate, just one should have an awareness of the difficulties.   There are things that can improve this situation.

• First is to ensure that you are careful to fully describe your methods including, for example, any instructions given to participants, or data used in trials, as well as the tests used, numbers of participants etc.
• The second is to focus on differences or comparisons more than absolute values. The fact that one condition is 10% faster than another in one experiment is more likely to be replicable than the exact speed of the base condition.

Understanding mechanism will help with both of these.

Meta-analysis is about using multiple studies by different groups in order to cross-validate and find emerging patterns. Like replication, ensuring your work is amenable to meta-analysis requires you to be careful to report method and results clearly and completely.

One way to achieve this is to simply put everything into the public domain: making all materials you used, instructions, software (if applicable), and of course the raw study data whether survey reports, video, or keystroke logging as well as derived data all the way to the data that lies behind the graphs in your published papers.

Having this data available means that those seeking to replicate can compare different points in their process, and those seeking to do meta-analysis can calculate common statistics across different data sets, or combine the datasets as a whole. However, making your data open also means that other people can analysis it in totally unexpected ways, testing alternative models or theories, or mining it for emergent patterns.

There are ethical problems in HCI at very least you will usually need to anonymise the data. However, crucially you need to ensure that participants are fully aware that data may be sued for purposes other than your own experiments. Often, by the time researchers come to consider publishing their data it is too late to obtain these permissions, so openness needs to be a consideration form the very start of your research design processes.

There are also practical problems to document your data well enough. During my 1000 mile walk around Wales in 2013 I collected copious data from bio-data to blogs. When I had finished the walk and wanted make this data available as a public open science resource I found I had to learn a whole new skill of documenting the data: ensuring that those using it could do so without necessarily consulting me. Part of this is technical documentation: each field had to be described carefully, and part is about making sure that the user of the data knows exactly how it was collected. Happily, the care has paid off and I often get to ear of people using the data who have never been in touch with me to ask questions abut it.

There are also broader cultural issues. The UK has a periodic research assessment exercise, which graded every subject and institution’s research. During the last such exercise, REF2014, the humanities panel included curated datasets as one of their categories of research output, but the science and engineering panel did not. It is not that STEM researchers do not think that data is valuable, but it is not valued, in the sense that careers, promotion andf esteem are attached to the analysis and implications of data, not the meticulous work of data collection itself.

Happily this is changing, many journals now mandate that data be provided for any publication and many universities are establishing data repositories alongside those for publications themselves.

However, despite these barriers, making your data available to the world is of immense value. You have often expended great effort in gathering it, it is surely worthwhile to see it reused by others … and, of course, by doing so you are doing your small part in building a stronger, greater and more robust discipline.

# So What (making sense of statistics) – 6 – Mechanism

from what happens to how and why – when quantitative and qualitative meet

It is important not just to know that something occurs, but how and why.   Mechanism is about understanding the steps and processes, which buttons were pressed, what screens viewed, what information was looked at and how this all comes together to create a larger phenomenon.

Crucially understanding mechanism makes it possible to draw lessons and make predictions beyond the data available and the particular situations you have studied.

Typically quantitative data and statistical analysis helps you understand what happens as an end-to-end phenomenon and what is true of it as a whole. However, it often reveals little of the processes and mechanisms by which it occurs: what, but not how and why.

In contrast, qualitative methods such as rich observations, ethnography or post-experiment interviews are better suited to exploratory research (see “Why are you doing it?”) and answering these how and why questions. For example, one may determine the most common ways to achieve a task by content analysis of videos or key-stroke trace data.

Theoretical understanding may help here. This may include cognitive and psychological understanding, for example, if a user is selecting a small target with an on-screen pointer, then they have to be looking at it as human peripheral vision is not accurate enough for fine positioning tasks. Alternatively it may be related to unpacking device or application interaction characteristics, for example, if someone is choosing an item from a long menu, they need to decide if the item is in the visible portion, and if not scroll the menu, etc.

Once we have a model of how the user is behaving, we may be able to use that directly or we may use it to plan more in-depth analyses or investigations into each phase of activity.

When you have numerical empirical data one often attempts to interpolate between measured values. For example, if one found that reading speed was 10% faster with 12-point font than with 10-point font, then there is a good chance that 11-point font will sit in between, maybe of the order of 4%, to 6% faster. Even this may be problematic, for example, it just may be that 11-point font pixelates badly on the particular screen resolution of the devices you are experimenting with. However, it is a reasonable heuristic.

However, extrapolation is usually far harder: what about reading 8-point font or 32 point font or 3-point font?

However, if you understand the mechanism you can deconstruct the overall behaviour into arts that may be simple enough for you to be able to work out whether extrapolation is possible, or which can be put together in different ways to predict performance or behaviour in other contexts.

As an example, we will consider an early paper on font sizes on mobile devices, which included what appeared to have been a well conducted experiment, with statistically significant results, which concluded that a particular font size, let’s say 12 point, was best.

This sounds like a very useful piece fo design advice except for two things.

First, the result was almost certainly related to detailed device characteristics such as screen resolution: was this a 12-point font that was best, or a 12-pixel one, or simply one that did not render badly on the particular screen?

Second, the result will have been influenced by the particle task used. This involved finding items in a menu that could be paged (hence the earlier example). Would the result hold or other tasks?

In this case it was relatively easily to work out the mechanism, the detailed steps the user would need to perform in order to complete the menu selection task.

1. visual search of the screen to see if the target item appears
2. if not move to next screen and try step 1 again
3. when it is found select the target item

Looking through these it seem very likely that step 1 will be easier with larger fonts until the point at which item names get too long to fit on the screen. Step 2 however is likely to occur more frequently with larger font sizes, as there will be fewer lines and hence fewer items per screen-full, so for this step smaller fonts are bond to reduce the number of cycles. Finally, step 3 is again likely to have been easier and faster with larger font sizes, whether on a touch device (larger target) or cursor key-based one (less items to move cursor through)

In summary:

• Step 1 – speed of visual search – large font better
• Step 2 – number of pages to scroll through – small font better
• Step 3 – speed of item selection – large font better

The optimal font size will have been a trade-off between these factors, and changes in the tasks would almost certainly have changed this figure. For example, if the search were within a very large menu, then it is likely that scrolling through pages of menu items would dominate and hence the optimal choice would be the smallest readable font. In contrast if the number of items was always small I larger be better to have larger items so longa s they all fitted within the first screen.

As well as being able to make predictions before experimentation starts, unpacking the mechanism in this way would have allowed the experimenters to produce better analyses. Indeed, they had used some form of low-level logs to produce their end-to-end times and break these down into empirical timings for steps 1 and 3. For step 3, the number of pages that needed to be scrolled through to find the target item can be calculated precisely with empirical data being used to determine the time taken to press the page down key.

Wit these more detailed timings, the authors could have replaced their misleading single ‘optimal’ figure and replace this with a formula, that given an average menu length told you the best font size.

Furthermore, other kinds of mobile task would involve steps that resemble those for the menu selection task, enabling predictions to me made in entirely new contexts.

# So What (making sense of statistics) – 5 – Diversity: individual and task

good for not just good

It is easy, especially when promoting one’s own idea, to want to show that it is better than everyone else’s!

However, users and tasks differ from one another. Typically a system or design property may be useful for a particular purpose or group of users, but not for others. If you understand this, you are in a better position to improve your research or market your system.

In general, it is more important to know who or what something is good for.

Imagine you have run a head to head comparison between two potential systems designs A and B, with 40 users. The user error rates are:

system A   5.2%

system B   6.2%

In fact they are not that different, System A is marginally better as people have slightly less errors, but is that 1% difference going to change the world. Anyway, it is a difference, so you go ahead and deploy system A.

However, it just so happens that of the 40 users 10 are novices and 10 experts. Sure enough the novices have a lower error rate with system A, and indeed by a wide margin (half the error rate), but look at the expert error rates:

expert – system A   9.6%

expert – system B   2.7%   !!!!

In fact, system A is considerably worse than system B for the experts.

If this were a research setting, then just looking at the averages means you have a fairly marginal result to report – yep, you might have a good p-value, but an effect size that will leave your readers yawning in their seats.

However, if you look at the way this differentially affects the different groups (a) you have larger effects to report; which are also (b) far more interesting.   Why do you get the different behaviour for novices and experts? What further research does this prompt?

The issue is perhaps even more critical for the usability professional.

It is often easier to user test novices when dealing with systems for rather than professionals; for example, you might test a financial planning application with economics students, or a diagnostic system with medical students. Novices are easier to access, and their time less costly.

However, it is likely that when you deploy the larger user group are expert.

You deployed the wrong system … and it is worse by a large margin!

If instead of simply asking, “is my system better?”, you ask, “who is my system better for?”, then you are able to ensue that you deliver the right solution to the right people.

This is also true for tasks. Typically a system or interaction method is good for some purposes, but less good for others.

The slide shows some stills of the PieTree visualisation [OD06] . Like a TreeMap, the PieTeee is a constant area visualisation for hierarchical data, in that the area of each part reflects the number or size of the items it represents. A PiTree starts as a normal pie chart of the top level categories, but you can explode any segment showing the next level in each as smaller and smaller segments. At the top right is a fully expanded PieTree, whereas the image n the centre is unexpanded. In real use only some segments may be expanded at any particular time depending on where the user has drilled down. The screen shot in the middle has the PieTree on the left and classic file tree-style visualisation on the left.

In evaluating this several tasks were used. The tasks included extreme ones following the advice on careful choice of tasks from “Gaining Power – tasks“. One was focused on finding the largest items, and was deliberately designed to highlight the advantages of the PieTree between the file-tree style visualisation; there was an obvious strategy for the former starting by drilling down into the biggest segment. However there was also a task to find the smallest, where there was no obvious search heuristic and everything had to be opened. When it was it is actually easier to san the text version of the smallest number than it is trying to work out which of the slightly different shaped small elements was actually smallest.

The results were exactly as we expected, that is the PieTree visualisation was good for some kinds of tasks and the file-tree style for others. Having both available, as in the image in the centre, was never best for any task, but was always a good second best no matter which of the visualisations ‘won’.

In general, it is usually far more important to know who or what something is good for, than some overall averaged measure. For researchers knowing this is far more informative allowing you to start to ask further questions about why certain features or properties are better. For practitioners, this is crucial for targeting solutions at the right people and the right problems.

# Reference

[OD06] R. O’Donnell, A. Dix and L. Ball (2006). Exploring the PieTree for Representing Numerical Hierarchical Data. Proceedings of Proceedings of HCI2006, People and Conputers XX – Engage. Springer. pp. 239-254. http://www.alandix.com/academic/papers/HCI2006-PieTree/

# So What (making sense of statistics) – 4 – What have you really shown?

Statistics is largely about assessing and validating measured values, but what do they actually measure?

Thinking about the conditions – what have you really shown – some general result or simply that one system or group of users is better than another?

In an example we will look at how a paper published at a major ACM conference appeared to be demonstrating the value of a particular kind of interaction style for a particular problem, but may simply be that they chose a particularly bad system as one of their experimental conditions.

Imagine you have got good data and a gold standard p-value. You are about to rite in your conclusions that using reverse alphabetic menus leads to faster access times than other layouts. However, before you commit, ask yourself “what else might have cased this result”. Maybe the tasks you used tended to include a lot of items starting with x, y and z?

If you find alternative explanations you might be able to look at your data in a different way to tease out the difference between your original hypothesis and the alternatives.   Can’t this would be an opportunity to plan a new experiment that exposes the difference.

It is easy to get confused between things that are true about your subjects and things that are true generally. Imagine you have a mobile phone app for amusement parks that offers games for families to play together while they wait in the queue for a ride.   You give the app to four families who have new app and also have a small clicker device where they are asked at intervals whether or not they are happy. The families visit many rides during the day and you analyse the data to see whether they are more happy while waiting in queues that have a game compared with those that don’t. Again you get a gold standard p-value and feel you are ready to publish.

However, if you had a small number of families, and a lot of data per family, what your statistics have probably told you is that you can accurately say for those four families, that they are, on average, happier when they play the app games. However, this is a reliable result about a few families, not a general result about all families; for that you would need far more families and different statistical analysis.

Perhaps even harder to spot because it is so common is to confuse results about specific systems with results about the properties they embody.

To illustrate this we’ll look at a little story from a few years ago.

It was a major ACM conference and the presentation of, what appeared to be, a good empirical paper. The topic was tools to support a collaborative task which we’ll call ‘X’.

The researchers were interested in two main factors:

• domain specific for task X vs more generic software
• synchronous vs asynchronous collaboration

They found three pieces of existibg siftware that covered three of the four slots’ in the design space:

• A – domain specific software, synchronous
• B – generic software, synchronous
• C – generic software, asynchronous

The experiment used sensible measures of quality for the task and had a reasonable number of subjects in each condition. Overall it seemed to be well conducted and, it had statistically significant results.

The results showed that:

• domain specific was better than generic
• asynchronous was better than synchronous

The authors concluded that what was really needed was the missing gap in the deisgn space, asynchronous domain specific software for X. One assume that in the next year’s conference they may have a paper on just such a piece of software,.

There are some problems with this due to interaction effects, there may be some aspect to the task that means that while domain specific synchronous software is was better than generic software and also asynchronous generic software was better for task X than than synchronous generic software, still it could be that asynchronous domain specific software is worse. However, this is still a good place to look.

Much more important is that if you blinked at the wrong moment in the presentation, you could easily miss that the whole research results are potentially completely wrong.

Although the presentation discussed the experiment mostly in terms of the properties, and certainly the paper’s conclusions did this. In fact these were not independently varied. Instead three systems were used that happened to embody the relevant properties,

Say system B just happened to be a badly designed piece of software, nothing to do with the articular properties. In comparisons System B was would be worse than system A, which would be interpreted as domain specific is better than generic. Similarly system B would be worse than system C, being interpreted as asynchronous is better than synchronous … bit really system B just happens to be bad!

Weirdly most experimenters would realise that this was an issue if there were only three users, but having a small number of pieces of software often goes unnoticed.

So, what went wrong?

The experiment as run with borrowed methods from psychology, where the controlled experiments typically have a single cause and are in highly controlled environments, so that only the particular aspect being studied is varied between trials. The task X experiment appears in the guise of just such a controlled experiment, varying a single quality: bespoke vs. generic, synchronous vs. asynchronous.

However, interaction, even in lab settings, needs some level of ecological validity and indeed the systems used in the experiment were real software, with all their complexities.   However, the nature of such ecologically valid experiments is that there are always multiple causes and open situations. Indeed, Carroll and Rosson’s claims analysis [CR92] embraces the alterative and possibly multiple causes of the success (or failure!) of systems.

The obvious way to address this would be to have lots and lots of systems embodying each property, just as you have lots and lots of users. However, this is typically impractical, so that I have previous declared that:

the evaluation of generative artefacts is methodologically unsound [Dx08]

However, this does not mean that it is not possible to validate principles.

You can use rich data, for example, collecting logs or video, using think aloud protocols, or post-task interviews. This could be analysed looking for incidents that make it clear whether the poor performance of system B is due to the properties being studied or other factors (such as general poor design).

In general when you use any form of research methodology borrowed from another area, make sure you understand the assumptions behind it and modify it appropriately when you use it for yourself.

## References

[CR92] John M. Carroll and Mary Beth Rosson. 1992. Getting around the task-artifact cycle: how to make claims and design by scenario. ACM Trans. Inf. Syst. 10, 2 (April 1992), 181-212. DOI=http://dx.doi.org/10.1145/146802.146834

[Dx08] A. Dix (2008). Theoretical analysis and theory creation, Chapter 9 in Research Methods for Human-Computer Interaction, P. Cairns and A. Cox (eds). Cambridge University Press, pp.175–195. ISBN-13: 9780521690317 http://www.alandix.com/academic/papers/theory-chapter-2008/

# So What (making sense of statistics) – 3 – Visualise carefully

Visualisation is a powerful tool that can help you highlight the important features in your data, but is also dangerous and can be misleading.

Visualisation is a huge topic in its own right, but for initial eyeballing of raw data one is most often using quite simple scatter plots, line graphs or histograms, so here we will deal with two choices you make about these: the baseline and the basepoint.

The first, the baseline, is about where you start place the bottom of your graph at zero or some other value, a ‘false’ baselines. The second, the basepoint, is about the left-to-right start.

Mathematically speaking, the x and y axes are no different, you can graph data either way, but conventionally they are used differently. Typically the x (horizontal) axis shows the independent variable, the thing that you choose to vary experimentally (e.g. distance to target), or given by the world (e.g. date), the vertical, y, axis is usually the dependent variable, what you measure, for example response time or error rate.

As noted, the baseline is about where you start, whether you place the bottom of your graph at zero or some other value: the former is arguably more ‘truthful’, but the latter can help reveal differences that might get lost of the base effect is already large – think of climbing ‘small’ peaks near the top of Everest.

In the graph on the top right there is a clear change of slope. However, look more carefully at the vertical scale (you may need to zoom in!). The scale starts at 57.92 and the total range of the values plotted is just 0.02. This is a false baseline, instead of starting the scale at zero, it has started at anther value (in this case 57.92).

The utility of this is clear. If the data had been plotted on a full scale of, say, 0-60, then even the slope would be hard to see, let alone the change in slope. Whether these small changes are important depends on the application.

Scientists use a Kelvin scale for temperature, starting at absolute zero (-273 C), but if you used this as a full scale for day-to-day measurements, even the difference between a hot summer’s day and midwinter would only be about 10%, the ‘false’ baseline of the centigrade and Fahrenheit scales are far more useful.

This is even more important in a hospital: the difference between normal temperature and high fever, would be imperceptible (less than 1%) on a Kelvin scale, and medical thermometers do not even show a full centigrade range, but instead range from mid 30s to low 40s.

Of course, a false baseline can also be misleading if the reader is not aware of it, making insignificant differences appear large. This may happen by accident, or may be deliberate!

Many years ago there used to be a TV advert for a brand of painkiller, let’s call it Aspradine. The TV advert showed a laboratory with impressive scientific figures in white lab coats. On the laboratory bench was a rack of four test-tubes, each part filled with white powder all at the same height.   The camera zoomed into a view of the top portion of the test-tubes, and to the words “Aspradine has 25% more active ingredient than other brands”, additional powder was poured into one, which rose impressively.

Of course the words were perfectly accurate, and I’m sure they were careful to actually only add a quarter extra to the tube, but the impression given was of a much larger difference.

The photographs of President Trump’s inauguration are a high profile (and highly controversial!) example of this effect. Looking at photos from the front of the crowd, it is very hard to tell the difference between different inaugurations – all look full at the front, just like if the advert had just sown the bottom half of the test-tubes. However, the image from the back clearly shows the quite substantial, and not unexpected, differences between different inaugurations. The downside to this is that, just like the Aspradine advert’s image of the top of the test-tubes or the slope in the graph, it gave the impression that the 2017 crowd was in fact very small … and reported by at least one news outlet at only a quarter of a million, which then Trump heard, responded to in his CIA speech … and, as they say, the rest is history.

Hopefully your research will not be as controversial, but beware, whether or not this sort of rhetoric is acceptable in the marketing or political arena, be very careful in your academic publications!

The graph at the top of this slide shows UK public sector borrowing over a 20-year period. Imagine you want to quote a 10-year change figure. One choice might be to look at the lowest point in 2007 and compare to the highest point in 2017 (the green line). Alternatively you might choose the highest point in 2007 and compare with the lowest in 2017. The first would suggest that there had been a massive increase in public sector borrowing; the latter would suggest a massive decrease. Both would be misleading!

In this case the data is clearly seasonal, related, one assumes, to varying tax revenues through the year, and perhaps differing costs. Often such data is compared at like-time’s each year (say Jan-Jan), which would give a fairer comparison.

If the data simply varies a lot then some form of average is often better. The lower graph shows precisely the same UK public borrowing data, but averaged over 12 month periods.   Now the long-term trends are far more clear, not least the huge hike at the start of the global recession when there were large-scale bank bailouts followed by a crash in tax revenues.

For a real example of this see my blog “the educational divide – do numbers matter?“.

the educational divide – do numbers matter?

Finally, you may think that unless one were deliberately intending to deceive, no-one could make the mistake of using either of the two initial lines as both are so clearly misleading. However, imagine you had never plotted the data and instead it was simply a large spreadsheet full of numbers. It would be easy to pick and arbitrary start and end dates not realising the choice was so critical.

So another reminder – look at the data!

# So What (making sense of statistics) – 2 – Look at the data

Look at the data, don’t just add up the numbers.

It seems an obvious message, but so easy to forget when you have that huge spreadsheet and just want to throw it into SPSS or R and see whether all your hard work was worthwhile.

But before you jump to work out your T-test, regression analysis or ANOVA, just stop and look.

Eyeball the raw data, as numbers, but probably in a simple graph- but don’t just plot averages, initially do scatter plots of all the data points, so you can get a feel for the way it spreads. If the data is in several clumps what do they mean?

Are there anomalies, or extreme values?

If so these may be a sign of a fault in the experiment, maybe a sensor went wrong; or it might be something more interesting, a new or unusual phenomenon you haven’t thought about.

Does it match your model. If you are expecting linear data does it vaguely look like that? If you are expecting the variability to stay similar (an assumption of many tests, including regression and ANOVA).

The graph above is based on one I once saw in a paper (recreated here), where the authors had fitted a regression line.

However, look at the data – it is not data scattered along a line, but rather data scattered below a line. The fitted line is below the max line, but the data clearly does not fit a standard model of linear fit data.

A particular example in the HCI literature where researchers often forget to eyeball the data is in Fitts’ Law experiments. Recall that in Fitts’ original work [Fi54] he found that the time taken to complete tasks was proportional to the Index of Difficulty (IoD), which is the logarithm of the distance to target divided by the target size (with various minor tweeks!):

IoD = log2 ( distance to target / target size )

Fitts’ law has been found to be true for many different kinds of pointing tasks, with a wide variety of devices, and even over multiple orders of magnitude. Given this, many performing Fitts’ Law related work do not bother to separately report distance and target size effects, but instead instantly jump to calculating the IoD assuming that Fitts’ Law holds. Often the assumption proves correct …but not always.

The graph above is based on a Fitts’ Law related paper I once read.

The paper was about the effects of adding noise to the pointer, as if you had a slightly dodgy mouse. Crucially the noise was of a fixed size (in pixels) not related to the speed or distance of mouse movement.

The dots on the graph show the averages of multiple trials on the same parameters: size, distance and the magnitude of the noise were varied. However, size and distance are not separately plotted, just the time to target against IoD.

If you understand the mechanism (that magic word again) of Fitts’ Law [Dx03,BB06], then you would expect anomalies’ to occur with fixed magnitude noise. In particular if the noise is bigger than the target size you would expect to have an initial Fitts Movement to the general vicinity of the target, but then a Monte Carlo (utterly random) period where the noise dominates and its pure chance when you manage to click the target.

Sure enough if you look at the graph you see small triads of points in roughly straight lines, but then the cluster of points following a slight curve. The regression line is drawn, but this is clearly not simply data scattered around the line.

In fact, given the understanding of mechanism, this is not surprising, but even without that knowledge the graph clearly shows something is wrong – and yet the authors never mentioned the problem.

One reason for this is probably because they had performed a regression analysis and it had come out statistically significant. That is they had jumped straight for the numbers (IoD + regression), and not properly looked at the data!

They will have reasoned that if the regression is significant and correlation coefficient strong, then the data is linear. In fact this is NOT what regression says.

To see why, look at the data above. This is not random simply an x squared curve. The straight line is a fitted regression line, which turns out to have a correlation coefficient of 0.96, which sounds near perfect.

There is a trend there, and the line does do a pretty good job of describing some of the change – indeed many algorithms depend on approximating curves with straight lines for precisely this reason. However, the underlying data is clearly not linear.

So next time you read about a correlation, or do one yourself, or indeed any other sort of statistical or algorithmic analysis, please, Please, remember to look at the data.

## References

[BB06] Beamish, D., Bhatti, S. A., MacKenzie, I. S., & Wu, J. (2006). Fifty years later: a neurodynamic explanation of Fitts’ law. Journal of the Royal Society Interface, 3(10), 649–654. http://doi.org/10.1098/rsif.2006.0123

[Dx03] Dix, A. (2003/2005) A Cybernetic Understanding of Fitts’ Law. HCI book online! http://www.hcibook.com/e3/online/fitts-cybernetic/

[Fi54] Fitts, Paul M. (1954) The information capacity of the human motor system in controlling the amplitude of movement. Journal of Experimental Psychology, 47(6): 381-391, Jun 1954,. http://dx.doi.org/10.1037/h0055392