Now here is a small exercise to get a feeling for that p < 5% figure … are your coins really fair?

Choose a coin, any coin.

If a coin is fair then the probability of six heads is 1/64 as is the probability of six tails, and the probability of six of either is 2/64, approximately 3%.

So we can do an experiment.

H_{0} the null hypothesis is that the coin is fair.

H_{1} the alternative hypothesis is that the coin is not fair.

The likelihood of HHHHHH or TTTTTT given H0 is less than 5%, so if you get six heads or six tails, you can reject the null hypothesis and conclude that the coin is not fair.

Try it, and if it doesn’t work try again. How long before you end up with a ‘statistically significant’ test?

Think back to the discussion of cherry picking and multiple tests …

This might seem a little artificial, but imagine rather than coin tossing it is six users preferences for software A or B.

Having done this, how do you feel about whether 5% is a suitable level to regard as evidence?

This is the point where I nail my colours to the mast – should you use traditional statistics or Bayesian methods? With all the controversy in the media about the ‘statistical crisis’ should one opt for alt-stats or stay with conservative ones? Of course the answer will partly be ‘it depends’, but for most purposes I think there is a best answer …

You may have seen stories about the ‘statistical crisis’ [Ba16]. A variety of papers and articles in the technical and sometimes even popular press highlighting general poor statistical practice. This has touched many disciplines, including HCI [Ca07, KN16].

Some have focused on the ‘replication crisis’, the fact that many attempts to repeat scientific studies have failed to reproduce the original (often statistically significant) outcomes. Other have focused on the statistics itself, especially p-hacking, where wittingly or unwittingly scientists use various means to ensure they get the necessary p<5% to enable them to publish their results.

Some of these problems are intrinsic to the scientific publishing process.

One is the tendency for journals to only accept positive results, so that non-significant results do not get published. This sounds reasonable until you remember that the p<5% means that on average one time in twenty you reject the null hypothesis (appear to have a positive finding) by shear chance. So if 100 scientists do experiments where there is no real effect, typically this will lead to 5 apparently ‘publishable’ effects.

Another problem is that the ‘publish or perish’ culture of academia means that researchers may ‘bend’ the facts slightly to get publishable results. To be fair this may be because they are convinced for other reasons that something is true, so they ‘gild the lily’ a little, ignoring negative indications and emphasising positive ones. As we saw previously, famous scientists have done this in the past, and because what they did happened to be true history has overlooked the poor stats (or looked the other way).

Most of the publicity on this has focused on traditional hypothesis testing. However, the potential problems in traditional statistics have been well known for at least 40 years and are to do with poor use or poor interpretation, not intrinsic weaknesses in the statistical techniques themselves.

There have been some changes, which may have led to the current level of publicity. One is the increasing publication pressures mentioned above. Another is that in years gone by when scientists needed to do statistics they would typically ask a statistician for advice, especially if it was at all unusual or unlike previous studies. Indeed, my own first job was at an agricultural engineering research institute, where, in addition to my main role doing mathematical and computational modelling, I was part of the institute’s statistics advisory team. Now time and budgetary constraints mean that research institutions are less likely to offer easy access to statistical advice and instead researchers reach for easy to use, but potentially easy to misuse, statistical packages.

However, there is also not a little hype amongst the genuine concern, including some fairly shaky statistical methods in some of the papers criticising statistical methods!

Most of the ‘bad press’ has focused on traditional statistics and in particular hypothesis testing – the dreaded p! As noted, this is largely due to well-understood issues when they are used inexpertly or badly. When used properly, traditional statistics (both hypothesis testing and confidence intervals) tend to be relatively conservative.

One reaction to this has been for some to abandon statistics entirely; famously (or maybe infamously) the journal Basic and Applied Social Psychology has banned all hypothesis testing [Wo15]. However, this is a bit like getting worried about the safety of a cruise ship sinking and so jumping into the water to avoid drowning. The answer to poor statistics is better statistics not no statistics!

The other reaction has been to loom to alternative statistics or ‘new statistics’; this has included traditional confidence intervals and also Bayesian statistics. Some of this is quite valid; the good use of statistics includes using the correct type of analysis for the kinds of data and information you have available. However, the advocacy of these alternatives can sometimes include an element of snake oil (paper titles such as “Using Bayes to get the most out of non-significant results” probably don’t help [Di14]).

Crucially, most of the problems that have been identified in the ‘statistical crisis’ also apply to alternative methods: selective publication, p-hacking (or various other forms of cherry picking), post-hoc hypotheses. In addition, reduced familiarity can lead to poorer statistical execution and reporting. Bayesian statistics in particular currently requires considerable expertise to be used correctly. Indeed, at the time of writing the Wikipedia page for Bayes Factor (Bayesian alternative to hypothesis testing) includes, as its central example, precisely this kind of inexpert use of the methods [W17].

There are good reasons why, even after 40 plus years of debate, most professional statisticians still use traditional methods!

Based on all this my personal advice is that for most things stick with traditional statistics, but where possible always quote confidence intervals alongside any form of p-value (APA also recommend this [AP10]).

This is partly because, despite the potential misuse, there is still better general understanding of these methods and their pitfalls. You are more likely to do them right and your readers are more likely to have an idea of what they mean.

This said there are a number of circumstances when Bayesian statistics are not only a good idea, but the only sensible thing to do. These are usually circumstances where you know the prior and are involved in some sort of decision-making. For example, if a patient is in hospital you know the underlying prevalence of various diseases and so should use this as part of diagnostic reasoning. This also applies in the algorithmic use of Bayesian methods in intelligent or adaptive interfaces.

If you do choose to use Bayesian statistics, do ensure you consult an expert, especially if you are dealing with continuous values (such as completion times), as the theory around these is particularly complex (is is evident on the Wikipedia page!). Do be careful to that your prior is not simply meaning you are confirming your own bias. Also do be aware that the odds ratios that are taken as acceptable evidence seem (to a traditional statistician!) to be somewhat lax (and 5% sig. is already quite lax!), so I would advise using one of the more strict levels.

Whether you use traditional stats, p-values, confidence intervals, Bayesian statistics, or tea-leaf reading – make sure you use the statistics properly. Understand what you are doing and what the results you are presenting mean.

… and I hope this course helps!

Endnote

For a balanced view of Bayesian methods see the interview with Peter Diggle, President of the Royal Statistical Society [RS15]. However, it is perhaps telling that the Royal Statistical Society’s own mini-guide for non-statisticians, also called ‘Making Sense of Statistics”, avoids mentioning Bayesian methods entirely [SS10].

References

[AP10] APA (2010). Publication Manual of the American Psychological Association, Sixth Edition. http://www.apastyle.org/manual/

[Ba16] Baker, M. (2016). Statisticians issue warning over misuse of P values. Nature, 531, 151 (10 March 2016) doi:10.1038/nature.2016.19503

[Ca07] Paul Cairns. 2007. HCI… not as it should be: inferential statistics in HCI research. In Proceedings of the 21st British HCI Group Annual Conference on People and Computers: HCI…but not as we know it – Volume 1 (BCS-HCI ’07), Vol. 1. British Computer Society, Swinton, UK, UK, 195-201.

[Di14] Dienes, Z. (2014). Using Bayes to get the most out of non-significant results. Frontiers in Psychology, 5, 781. http://doi.org/10.3389/fpsyg.2014.00781

[KN16] Kay, M., Nelson, G., and Hekler, E. 2016. Researcher-Centered Design of Statistics: Why Bayesian Statistics Better Fit the Culture and Incentives of HCI. CHI 2016, ACM, pp. 4521-4532.

[RS15] RSS (2015). Statistician or statistical scientist? an interview with RSS president Peter Diggle. StatsLife, Royal Statistical Society. 08 January 2015. https://www.statslife.org.uk/features/2822-statistician-or-statistical-scientist-an-interview-with-rss-president-peter-diggle

[SS10] Sense about Science (2010). Making Sense of Statistics. Sense about Science. in collaboration with the Royal Statistical Society. 29 April 2010. http://senseaboutscience.org/activities/making-sense-of-statistics/

[W17] Wikipedia (2017). Bayes factor. Wikipedia. Internet Archive at 26^{th} July 2017. https://web.archive.org/web/20170722072451/https://en.wikipedia.org/wiki/Bayes_factor

[Wo15] Chris Woolston (2015). Psychology journal bans P values. Nature News, Research Highlights: Social Selection. 26 February 2015. http://www.nature.com/news/psychology-journal-bans-p-values-1.17001

Many statistical tests depend on the idea of outcomes that are the same or worse than the one you have observed. For example, if you have the difference in response times between two systems is 5.7 seconds, then 5.8, 6, 23 seconds are all ‘the same or worse’, equally or more surprising. For numeric values this is fairly straightforward, but can be more complex when looking at different sorts of patterns in data. This is critical when you are making ‘post-hoc hypotheses’, noticing some pattern in the data and then trying to verify whether it is a real effect or simple chance.

You have got hold of the stallholder’s coin and wondering if it is fair or maybe weighted in some way.

Imagine you toss it 10 times and get the sequence: THTHHHTTHH

Does that seem reasonable?

What about all heads? HHHHHHHHHH

In fact, if the coin is fair the probability of any sequence of heads and tails is the same: 1 in 1024

Prob( THTHHHTTHH ) = 1/ 2^{10} ~ 0.001

Prob( HHHHHHHHHH ) = 1/ 2^{10} ~ 0.001

The same would be true of a pack of cards. There are 52! Different ways a pack of cards can come out (Note. 52! Is 52 factorial, the product of all the numbers up to 52 = 52x51x50 … x3x2x1), approximately the number of atoms in our galaxy. Each order is equally likely on a wells shuffled pack, so any pack you pick up is incredibly unlikely.

However, this goes against our intuition that some orders of cards, some patterns of coin tosses are more special than others.

This is where we need to have an idea of things that are similar or equally surprising to each other.

For the line of 10 heads, the only thing equally surprising would be a line of 10 tails. However for the pattern THTHHHTTHH, with 6 heads and 4 tails in a pretty unsurprising order (not even all the heads together), pretty much any other order s equally or more surprising, indeed if you are thinking about a fair coin, arguably the only thing less surprising is exactly five of each.

This idea of same or worse is relatively straightforward for numeric data such as completion time in an experiment, or number of heads in coin tosses.

We got 6 heads out of 10 tosses, so that 6, 7, 8, 9, 10 heads would be equally or more surprising, as would 6,7,8,9,10 tails.

So yes, 9 heads is starting to look more surprising, but is it enough to call the stallholder out for cheating?

As a final example, imagine 90 heads out of 100 tosses – the same proportion, but more tosses, therefore you expect things to ‘average out’ more.

Here the things that are equally or more surprising are 90 or more heads or 90 or more tails.

prob ( ‘the same or worse’ ) assuming fair
= prob ( nos heads >= 90 ) + prob ( nos tails >= 90 )
less than 1 in a million

The maths for this gets a little more complicated, but turns out to be less than one in a million. If this were a Wild West film this is the point the table would get flung over and the shooting start!

For continuous distributions, such as task completion times, the principle is the same. The maths to work out the probabilities gets harder still, but here you just look up the numbers in a statistical table, or rely on R or SPSS to wok it out for you.

For example, you measure the average task completion time for ten users of system A as 117.3 seconds and for ten users of system B it is 98.1 seconds. On average for the participants in your user test system B was 18.2 seconds faster. Can you conclude that your newly designed system B is indeed faster to use?

Just as with the coin tosses, the probability of a precisely 18.2 second difference is vanishingly small, it could be 18.3 or 18.1, or so many possible values. Indeed, even if the systems were identical, the probability of the difference being precisely zero, or 0.1 seconds, or 0.2 seconds are all still pretty tiny. Instead you look at the probability (given the systems are the same) that the difference is 18.2 or greater.

For numeric values, the only real complication is whether you want a one or two tailed test. In the case of checking whether the coin is far, you would be equally upset if it had been weighted in some way in either direction, hence you look at both sides equally and work out (for the 90 heads out of 100): prob (nos heads >=90) + prob( nos tails >= 90).

However, for the response time, you probably only care about your new system being faster than the old one. So in this case you would only look at the probability of the time difference being >=18.2, and not bother about it being larger in the opposite direction.

Things get more complex in various forms of combinatorial data, for example, friendship circles in network data. Here what it means to be the ‘same or worse’ can be far more difficult.

As an example of this kind of complexity, we’ll return to playing cards. Recall as many ways to shuffle the pack as atoms in the galaxy.

I sometimes play a variety of ‘patience’ (a solitaire card game) which involves laying out a 7×7 grid with the lower left triangle exposed and the upper right face down.

One day I was dealing these out and noticed that three consecutive cards had been jack of clubs, 10 of clubs, 9 of clubs.

My first thought was that I had not shuffled the cards well and this pattern was due to some systematic effects from a previous card game. However, I then realised that the previous game would not lead to sequences of cards in suit.

So what was going on? Remembering the rain drops in the Plains of Nali, is this chance or an omen?

Let’s look a little more closely. If you deal three cards in a row, what is the probability it will be a decreasing sequence?

Well, the first card can be anything except an ace or a 2, that is 44 out of 52 cards, or 11/13.

After this the second and third cards are precisely determined by the first card, so have probability of 1/51 and 1/50 respectively.

Overall the probability f three cards being in a downward run in suit is 11/33350, or about 1 in 3000 … pretty unlikely and if it were doing statistics after a usability experiment you would get pretty excited with p <0.0001

However, that is not the end of the story. I would have been equally surprised if it had been an ascending, or if the run had been anywhere amongst the visible cards where there are 15 possible start positions for a run of three. So, our probability of it being a run up or down in any of these start positions is now 30 x 11 / 33350 or about 1 in a 100.

But then there are also vertical runs up and down the cards, or perhaps runs of the same number from different suits that would also be equally surprising.

Finally how many times do I play the game? I probably only notice when there is something interesting (cherry picking), suddenly my run of three does not looks so surprising.

In some ways noticing the jack, ten, nine run was a bit like a post-hoc hypothesis.

You have run your experiment, looked at the data and the notice that it looks as though expert response time is slower than novice response time. This looks interesting, so you follow this up with a formal statistical test. You divide the data into novice and expert calculate averages, do the sums and and it comes out significant at 1%.

Yay, you have found out something … or have you?

What else would have been equally interesting that you might have noticed, age effects, experience, exposure to other kinds of software, those wearing glasses vs. those with perfect visions?

Remember your Bonferroni correction, if there are 10 things that would have been equally surprising, your 1% statistical significance is really only equivalent to 10%, which may be just plain chance.

Think back to the raindrops patterns on the Plain of Nali. One day’s rainfall included three drops is an almost straight line, but turned out to be an entirely random fall pattern. You notice the three drops in a line, but how many lots of three are there, how close to a line before you call it straight? Would small triangles, or squares have been equally surprising.

In an ideal (scientific) world, you would use patterns you notice in data as hypotheses for a further round of experiments … but often that is not possible and you need to be able to both spot patterns and make tests on their reliability on the same data.

One way to ease this is to visualise a portion of the data and then check it on the rest – this is similar to techniques used in machine earning. However, again that may not always be possible.

So, if you do need to make post-hoc hypotheses, try to make an assessment of what other things would ‘similar or more surprising’ and use this to help decide whether you are just seeing random patterns in the wild-slung stars.

Traditional statistics and Bayesian methods have their own specific pitfalls to avoid: for example interpretation of non-significant as no effect in traditional stats and confirmation bias for Bayesian stats.

They also have some common potential pitfalls. Perhaps the worst is cherry picking – doing analysis using different tests, and statistics, and methods until you find one that ‘works’! You also have to be careful of inter-related factors such as the age and experience of users. By being aware of these dangers one can hopefully avoid them!

One of the most common problems in statistics are forms of ‘cherry picking’, this is when you ignore results that for some reason are not to your liking and instead just report those that are advantageous. This may be a deliberate attempt to deceive, but more commonly is simply a combination of ignorance and bias. In hypothesis testing people talk about ‘p=hacking’, but this can equally be a problem for Bayesian statistics or confidence intervals.

multiple tests

The most obvious form of cherry picking is when you test loads and loads of things and then pick out the few that come out showing some effect (p value or odds ratio) and ignoring the rest, or even worse the few that come out showing the effect you want and ignoring the ones that point the opposite way!

A classic example of this is when you have a questionnaire administered after a user test, or remotely. You have 40 questions comparing two versions of a system (A and B) in terms of satisfaction, and the questions cover different aspects of the system and different forms of emotional response. Most of the questions come out mixed between the two systems, but three questions seem to show a marked preference for the new system. You then test these using hypothesis testing and find that all three are statistically significant at 5% level. You report these and feel you have good evidence that system B is better.

But hang on, remember the meaning of 5% significance is that there is a 1 in 20 chance of seeing the effect by sheer chance. So, if you have 40 questions and there is no real difference, then would you might expect to see, on average, 2 hits at this 1 in 20 level, sometimes just 1, sometimes 3 or more. In fact there is an approximately one in three chance that you will have 3 or more apparently ‘5% significant’ results with 40 questions.

The answer to this is that if you would have been satisfied with a 5% significance level for a single test and have 10 tests, then any single one needs to be at the 0.5% significance level (5% / 10) in order to correct for the multiple tests. If you have 40 questions, this means we should look for 0.125% or p<0.00125.

This dividing the target p level by the number of tests is called the Bonferroni correction. It is very slightly conservative and there are slightly more exact versions, but for most purposes this is sufficiently accurate..

multiple stats

A slightly less obvious form of cherry picking is when you try different kinds of statistical technique. First you try a non-parametric test, then a t-test, etc., until something comes out right.

I have seen one paper where all the statistics were using traditional hypothesis testing, and then in the middle there was one test that used Bayesian statistics. There was no explanation and my best bet was they the hypothesis testing had come out negative so they had a go with Bayesian and it ‘worked’.

This use of multiple kinds of statistics is not usually quite as bad as testing lots of different things as it is the same data so the test are not independent, but if you decide to swop the statistics you are using mid-analysis, you need to be very clear why your are doing it.

It may be that you have realised that you were initially using the wrong test, for example, you might have initially used a test, such as Student’s t, that assumes normally distributed data, but only after starting the analysis realise this is not true of the data. However, simply swopping statistics part way through in the hope tat ‘something will come out’ is just a form of fishing expedition!

For Bayesian stats the choice of prior can also be a form of cherry picking if you try one and then another, until you get the result you want.

outliers

A few outliers, that is extreme values, can have a disproportionate effect on some statistics, notably arithmetic mean and variance. They may be due to a fault in equipment, or some other irrelevant effect, or may simply occur by chance.

If they do appear to be valid data points that just happen to be extreme, there is an argument for just letting them be as they are part of the random nature of the phenomenon you are studying. However, for some purposes, one gets better results by removing the most extreme outliers.

However, this can add a cherry picking potential. Ne of the largest effects of removing outliers is to reduce the variance of the sample, and a large sample variance reduces the likelihhod of getting a statistically significant effect, so there is a temptation to strip out outliers until the stats come out right.

Ideally you should choose a strategy for dealing with outliers before you do your analysis. For example, some analysis choose to remove all data that lies more than 2 or 3 standard deviations from the mean. However, there are times when you don’t realise outliers are likely to be a problem until they occur. When this happens you should attempt to be as blind to the stats as possible as you choose which outliers to remove, do avoid removing a few re-testing, removing a few more then re-testing again!

post-hoc hypothesis

The final kind of cherry picking to beware of is post-hoc hypothesis testing.

You gather your data, visualise it (good practice), and notice an interesting pattern, perhaps a correlation between variables and then test for it.

This is a bit like doing multiple test, but with an unspecified number of alterative tests. For example, if you have 40 questions, then there are 780 different possible correlations, so if you happen to notice one and then test for it, this is a bit like doing 780 tests!

Another potential danger is where the factors you are trying to control for or measure are in some way inter-related making it hard to interpret results, especially potential causes for observed effects.

non-independently controllable factors

Sometimes you cannot change one parameter without changing others as well.

For example, if you are studying diet and try to reduce sugar intake, then it is likely that either fat intake will go up to compensate or overall calorie intake will fall. You can’t reduce sugar without something else changing.

This often happens with user interface properties or features.

For example imagine you find people are getting confused by the underline option on a menu, so you change it so the menu item says ‘underline’ when the text is not underlined, and ‘remove underline’ when it is already underlined. This may improve the underline feature, but then maybe users are confused because it still says ‘italic’ when the selected text is already italicised.

Similarly, imagine trying to take a system and make a version that is ‘not consistent’ but otherwise identical. In practice once you change one things, you need to change many others to make a coherent design.

The effect fo this is that you cannot simply say, in the diet example “reducing sugar has this effect”, but instead it is more likely to be “reducing sugar whilst keeping the rest of the diet fixed 9and hence reducing calories …” or “reducing sugar whilst keeping calorie intake constant (and hence probably increasing fact) …”.

In the menu example, you probably can’t just study the effects of the underline / remove underline menu options without changing all menu items, and hence will e studying constant name vs. state-based action naming, or something like that.

correlated features

A similar problem can occur with features of you users which you cannot directly control at all.

Let’s start again with a dietary example. Imagine you have clinical measures of health, perhaps cardiovascular tests results, and want to work out what factors in day-to-day life contribute to health, so you administer a life-style questionnaire. One question is about the amount of exercise they take and you find this correlates positively with cardio-vascular health, that s good. However, it maybe that someone who is a little overweight is less likely to take exercise, or vice versa. The different lifestyle traits: healthy diet, weight, exercise are likely to be correlated and thus it can be difficult to disentangle which are the casual factors for measured effects.

In a user interface setting we might have found that more senior managers work best with slightly larger fonts than their juniors. Maybe you surmise that this might be something to do with the high level of multi-tasking and the need for ‘at a glance’ information. However, on the whole those in more senior positions tend to be older than those in more junior positions, so that the preference is more to do with age-related eyesight problems.

Although they are both founded in probability theory, traditional statistics and Bayesian statistics have fundamental philosophical differences in the way they treat uncertainty. Bayesian methods demand the uncertainty is quantified, whereas traditional methods accept this uncertainty and reason form that. However, in practice our knowledge is somewhere between complete ignorance and precise probability, and both methods have ways of dealing with this in-between knowledge.

We have seen that both traditional statistics and Bayesian statistics effectively start with the same underlying data, and in many circumstances yield effectively equivalent results. However, they adopt fundamentally different philosophical stance in the way that they sue that data to answer questions about the world. These philosophical differences are critical in interpreting their results.

Traditional statistics effectively assumes nothing about the world: are there Martians or not, is your new design better than the old one or not, it is not so much neutral as takes no sides at all. It then seeks to reason from that state of unknowledge.

Bayesian statistics instead asks you to quantify that unknowledge into prior probabilities, and then reasons in an apparently mathematically clean way, but based on those guestimates.

In some ways traditional statistics is post-modern accepting uncertainty and leaving it even in the eventual interpretation of the results, whereas Bayesian statistics suggest a more closed world. However, with Bayesian stats the uncertainty is still there, just encapsulated in the guestimate of the prior.

On the surface they have radically different assumtoiosn about the unknown features of the real world. Traditional statistics assumes no knowledge of the real value, whereas Bayesian statistics assumes a precise porbailty dustriution.

However, neither the world, not the statistics we use to make sense of it, are as clear-cut.

Typically we have some knowledge about the likelihood (in the day to day sense) of things: you are pretty unlikely to encounter Martians; the coin you’ve pulled from your pocket is likely to be fair; that new design for the software, you’ve put a lot of effort into creating should be better than the old system. However, typically we do not have a precise measure of that knowledge.

In their purest form traditional statistics entirely ignores that knowledge and Bayesian statistics asks you to make it precise in a way that goes beyond you actual knowledge, turning uncertainty into precise probability. The former ignores information, the latter forces you to invent it!

In practice, both techniques are a little more nuanced.

In traditional statistics the significance level you are willing to accept as good evidence (p<5%, p<1%) often reflects your prior beliefs: you will probably need a very high level before you really call the Men in Black, or even accept that the coin may be loaded. Effectively there is a level of Bayesian reasoning applied during interpretation.

Similarly, while Bayesian statistics demands a precise prior probability distribution, in practice often uniform or other forms of very ‘spread’ priors are used, reflecting the high degree of uncertainty. Ideally it would be good to try a number of priors to obtain a level sensitivity analysis, rather as we did in the example, but I have not seen this done in practice., possibly as it would add another level of interpretation to explain!