# Doing it (making sense of statistics) – 5 – Bayesian statistics

Bayesian reasoning allows one to make strong statements about the probability of things based on evidence. This can be used for internal algorithms, for example, to make adaptive or intelligent interfaces, and also for a form of statistical reasoning that can be used as an alternative to traditional hypothesis testing.

However, to do this you need to be able to quantify in a robust and defensible manner what are the expected prior probabilities of different hypothesis before an experiment. This has potential dangers of confirmation bias, simply finding the results you thought of before you start, but when there are solid grounds for those estimates it is precise and powerful.

Crucially it is important to remember that Bayesian statistics is ultimately about quantified belief, not probability.

It is common knowledge that all Martians have antennae (just watch a sci-fi B-movie). However, humans rarely do.  Maybe there is some rare genetic condition or occasional fancy dress, so let’s say the probability that a human has antennae is no more than 1 in a 1000.

You decide to conduct a small experiment. There are two hypotheses:

H0 – there are no Martians in the High Street

H1 – the Martians have landed

You go out into the High Street and the first person you meet has antennae. The probability of this occurring given the null hypothesis that there are no Martians in the High Street is 1 in 1000, so we can reject the null hypothesis at p<=0.1% … which is a far stronger result than you likely to see in most usability experiments.

Should you call the Men in Black?

For a more familiar example, let’s go back to the coin tossing. You pull a coin out of your pocket, toss it 10 times and it is a head every time. The chances of this given it is a fair coin is just 1 in 1000; do you assume that the coin is fixed or that it is just a fluke?

Instead imagine it is not a coin from your pocket, but a coin from a stall-holder at a street market doing a gambling game – the coin lands heads 10 times in a row, do you trust it?

Now imagine it is a usability test and ten users were asked to compare your new system that you have spent many weeks perfecting with the previous system. All ten users said they prefer the new system … what do you think about that?

Clearly in day-to-day reasoning we take into account our prior beliefs and use that alongside the evidence from the observations we have made.

Bayesian reasoning tries to quantify this. You turn that vague feeling that it is unlikely you will meet a Martian, or unlikely the coin is biased, into solid numbers – a probability.

Let’s go back to the Martian example. We know we are unlikely to meet a Martiona, but how unlikely. We need to make an estimate of this prior probability, let’s say it is a million to one.

prior probability of meeting a Martian = 0.000001

prior probability of meeting a human = 0.999999

Remember that the all Martians have antennae so the probability that someone we meet has antennae given they are Martian is 1, and we said the probability of antennae given they are human was 0.001 (allowing for dressing up).

Now, just as in the previous scenario, you go out into the High Street and the first person you meet has antennae. You combine this information, including the conditional probabilities given the person is Martian or human, to end up with a revised posterior probability of each:

posterior probability of meeting a Martian ~ 0.001

posterior probability of meeting a human ~ 0.999

We’ll see and example with the exact maths for this later, but it makes sense that if it were a million times more likely to meet a human than a Martian, but a thousand times less likely to find a human with antennae, then about having a final result of about a thousand to one sounds right.

The answer you get does depend on the prior. You might have started out with even less belief in the possibility of Martians landing, perhaps 1 in a billion, in which case even after seeing the antennae, you would still think it a million times more likely the person is human, but that is different from the initial thousand to one posterior. We’ll see further examples of this later.

So, returning once again to the job of statistics diagram, Bayesian inference is doing the same thing as other forms of statistics, taking the sample, or measurement of the real world, which includes many random effects, and then turning this back to learn things about the real world.

The difference in Bayesian inference is that it also asks for a precise prior probability, what you would have thought was likely to be true of the real world before you saw the evidence of the sample measurements. This prior is combined with the same conditional probability (likelihood) used in traditional statistics, but because of the extra ‘information’ the result is a precise posterior distribution that says precisely how likely are different values or parameters of the real world.

The process is very mathematically sound and gives a far more precise answer than traditional statistics, but does depend on you being able to provide that initial precise prior.

This is very important, if you do not have strong grounds for the prior, as is often the case in Bayesian statistics, you are dealing with quantified belief set in the language of probability, not probabilities themselves..

We are particularly interested in the use of Bayesian methods as an alternative way to do statistics for experiments, surveys and studies. However, Bayesian inference can also be very successfully used within an application to make adaptive or intelligent user interfaces.

We’ll look at an example of how this can be used to create an adaptive website. This is partly because in this example there is a clear prior probability distribution and the meaning of the posterior is also clear. This will hopefully solidify the concepts of Bayesian techniques before looking at the slightly more complex case of Bayesian statistical inference.

This is the front page of the Isle of Tiree website. There is a menu along the left-hand side; it starts with ‘home’, ‘about Tiree’, ‘accommodation’, and the 12th item is ‘sport & leisure’.

Imagine we have gathered extensive data on use by different groups, perhaps by an experiment or perhaps based on real usage data. We find that for most users the likelihood of clicking ‘sport & leisure’ as the first selection on the site is 12.5%, but for surfers this figure is 75%. Cleary different users access the site in different ways, so perhaps we would like to customise the site in some way for different types of users.

Let’s imagine we also have figures for the overall proportion of visitors to the site who are surfers or non-surfers, let’s say that the figures are 20% surfers, 80% non-surfers.   Clearly, as only one in five visitors is a surfer we do not want to make the site too ‘surf-centric’.

However, let’s look at what we know after the user’s first selection.

Consider 100 visitors to the site. On average 20 of these will be surfers and 80 non-surfers. Of the 20 surfers 75%, that is 15 visitors, are likely to click ‘sorts & leisure’ first. Of the 80 non-surfers, 12.5%, that is 10 visitors are likely to click ‘sports & leisure’ first.

So in total of the 100 visitors 25 will click ‘sports & leisure’ first, Of these 15 are surfers and 10 non surfers, that is if the visitor has clicked ‘sports & leisure’ first there is a 60% chance the visitor is a surfer, so it becomes more sensible to adapt the site in various ways for these visitors. For visitors who made different first choices (and hence lower chance of being a surfer), we might present the site differently.

This is precisely the kind of reasoning that is often used by advertisers to target marketing and by shopping sites to help make suggestions.

Note here that the prior distribution is given by solid data as is the likelihood: the premises of Bayesian inference are fully met and thus the results of applying it are mathematically and practically sound.

If you’d like to see how the above reasoning is written mathematically, it goes as follows – using the notation P(A|B) as the conditional probability that A is true given B is true.

likelihood:

P( ‘sports & leisure’ first click | surfer ) = 0.75

P( ‘sports & leisure’ first click | non-surfer ) = 0.125

prior:

P( surfer ) = 0.2
P( non-surfer ) = 0.8

posterior (writing ‘S&L’ for “‘sports & leisure’ first click “) :

P( surfer | ‘S&L’ )   =   P( surfer and ‘S&L’ ) / P(‘S&L’ )
where:
P(‘S&L’ ) = P( surfer and ‘S&L’ ) + P( non-surfer and ‘S&L’ )
P( surfer and ‘S&L’ ) = P(‘S&L’ | surfer ) * P( surfer )
= 0.75 * 0.2 = 0.15
P( non-surfer and ‘S&L’ ) = P(‘S&L’ | non-surfer ) * P( non-surfer )
= 0.125 * 0.8 = 0.1
so
P( surfer | ‘S&L’ )   =   0.15 / ( 0.15 + 0.1 ) = 0.6

Let’s see how the same principle is applied to statistical inference for a user study result.

Let’s assume you are comparing and old system A with your new design system B. You are clearly hoping that your newly designed system is better!

Bayesian inference demands that you make precise your prior belief about the probability of the to outcomes. Let say that you have been quite conservative and decided that:

prior probability A & B are the same: 80%

prior probability B is better: 20%

You now do a small study with four users, all of whom say they prefer system B. Assuming the users are representative and independent, then this is just like tossing coin. For the case where A and B are equally preferred, you’d expect an average 50:50 split in preferences, so the chances of seeing all users prefer B is 1 in 16.

The alternative, B better, is a little more complex as there are usually many ways that something can be more or less better. Bayesian statistics has ways of dealing with this, but for now I’ll just assume we have done this and worked out that the probability of getting all four users to say they prefer B is ¾.

We can now work out a posterior probability based on the same reasoning as we used for the adaptive web site. The result of doing this yields the following posterior:

posterior probability A & B are the same: 25%

posterior probability B is better: 75%

It is three times more likely that your new design actually is better 🙂

This ratio, 3:1 is called the odds ratio and there are rules of thumb for determining whether this is deemed good evidence (rather like the 5% or 1% significance levels in standard hypothesis testing). While a 3:1 odds ratio is in the right direction, it would normally be regarded as a inconclusive, you would not feel able to draw strong recommendations form this data alone.

Now let’s imagine a slightly different prior where you are a little more confident in your new design. You think it four times more likely that you have managed to produce a better design than that you have made no difference (you are still modest enough to admit you may have done it badly!). Codified as a prior probability this gives us:

prior probability A & B are the same: 20%

prior probability B is better: 80%

The experimental results is exactly the same, but because the prior beliefs are different the posterior probability distribution is also different:

posterior probability A & B are the same: ~2%

posterior probability B is better: ~98%

The odds ratio is 48:1, which would be considered an overwhelmingly positive result; you would definitely conclude that system B is better.

Here the same study data leads to very different conclusions depending on the prior probability distribution. in other words your prior belief.

On the one hand this is a good thing, it precisely captures the difference between the situations where you toss the coin out of your pocket compared to the showman at the street market.

On the other hand, this also shows how sensitive the conclusions of Bayesian analysis are to your prior expectations. It is very easy to fall prey to confirmation bias, where the results of the analysis merely rubber stamp your initial impressions.

As is evident Bayesian inference can be really powerful in a variety of settings.   As a statistical tool however, it is evident that the choice of prior is the most critical issue.

## how do you get the prior?

Sometimes you have strong knowledge of the prior probability, perhaps based on previous similar experiments. However, this is more commonly the case for its use in internal algorithms, it is less clear in more typical usability settings such as the comparison between two systems. In these cases you are usually attempting to quantify your expert judgement.

Sometimes the evidence from the experiment or study is s overwhelming that it doesn’t make much difference what prior you choose … but in such cases hypothesis testing would give very high significance levels (low p values!), and confidence intervals very narrow ranges. It is nice when this happens, but if this were always the case we would not need the statistics!

Another option is to be conservative in your prior. The first example we gave was very conservative, giving the new system a low probability of success. More commonly a uniform prior is used, giving everything the same prior probability. This is easy when there a small number of distinct possibilities, you just make them equal, but a little more complex for unbounded value ranges, where often a Cauchy distribution is used … this is bell shaped a bit like the Normal distribution but has fatter edges, like an egg with more white.

In fact, if you use a uniform prior than the results of Bayesian statistics are pretty much identical to traditional statistics, the posterior is effectively the likelihood function, and the odds ratio is closely related to the significance level.

As we saw, if you do not use a uniform prior, or a prior based on well-founded previous research, you have to be very careful to avoid confirmation bias.

## handling multiple evidence

Bayesian methods are particularly good at dealing with multiple independent sources of evidence; you simply apply the technique iteratively with the posterior of one study forming the prior to the next. However, you do need to be very careful that the evidence is really independent evidence, or apply corrections if it is not.

Imagine you have applied Bayesian statistics using the task completion times of an experiment to provide evidence that system B is better than system A. You then take the posterior from this study and use it as the prior applying Bayesian statistics to evidence from an error rate study. If these are really two independent studies this is fine, but of this is the task completion times and error rates from the same study then it is likely that if a participant found the task hard on one system they will have both slow times and more errors and vice versa – the evidence is not independent and your final posterior has effectively used some of the same evidence twice!

## internecine warfare

Do be aware that there has been an on-going, normally good-natured, debate between statisticians on the relative merits of traditional and Bayesian statistics for at least 40 years. While Bayes Rule, the mathematics that underlies Bayesian methods, is applied across all branches of probability and statistics, Bayesian Statistics, the particular use for statistical inference, has always been less well accepted, the Cinderella of statistics.

However, algorithmic uses of Bayesian methods in machine learning and AI have blossomed over recent years, and are widely accepted and regarded across all communities.

# Doing it (making sense of statistics) – 4 – confidence intervals

Significance testing helps us to tell the difference between a real effect and random-chance patterns, but it is less helpful in giving us a clear idea of the potential size of an effect, and most importantly putting bounds on how similar things are. Confidence intervals help with both of these, giving some idea of were real values or real differences lie.

So you ran your experiment, you compared user response times to a suite of standard tasks, worked out the statistics and it came out not significant – unproven.

As we’ve seen this does not allow us to conclude there is no difference, it just may be that the difference was too small to see given the level of experimental error. Of course this error may be large, for example if we have few participants and there is a lot of individual difference; so even a large difference may be missed.

How can we tell the difference between not proven and no difference?

In fact it is usually impossible to say definitively ‘no difference’ as it there may always be vanishingly small differences that we cannot detect. However, we can put bounds on inequality.

A confidence interval does precisely this. It uses the same information and mathematics as is used to generate the p values in a significance test, but then uses this to create a lower and upper bound on the true value.

For example, we may have measured the response times in the old and new system, found an average difference of 0.3 seconds, but this did not turn out to be a statistically significant difference.

On its own this simply puts us in the ‘not proven’ territory, simply unknown.

However we can also ask our statistics application to calculate a 95% confidence interval, let’s say this turns out to be [-0.7,1.3] (often, but not always, these are symmetric around the average value).

Informally this gives an idea of the level of uncertainty about the average. Note this suggests it may be as low as -0.7, that is our new system maybe up to 0.7 second slower that the old system, but also may be up to 1.3 seconds faster.

However, like everything in statistics, this is uncertain knowledge.

What the 95% confidence interval actually says that is the true value were outside the range, then the probability of seeing the observed outcome is less than 5%. In other words if our null hypothesis had been “the difference is 2 seconds” or “the difference is 1.4 seconds”, or “the difference is 0.8 seconds the other way”, all of these cases the probability of the outcome would be less than 5%.

By a similar reasoning to the significance testing, this is then taken as evidence that the true value really is in the range.

Of course, 5% is a low degree of evidence, maybe you would prefer a 99% confidence interval, this then means that of the true value were outside the interval, the probability of seeing the observed outcome is less than 1 in 100. This 99% confidence interval will be wider than the 95% one, perhaps [-1,1.6], if you want to be more certain that the value is in a range, the range becomes wider.

Just like with significance testing, the 95% confidence interval of [-0.7,1.3] does not say that there is a 95% probability that the real value is in the range, it either is or it is not.

All it says is that if the real value were to lie outside the range, then the probability of the outcome is less than 5% (or 1% for 99% confidence interval).

Let’s say we have run our experiment as described and it had a mean difference in response time of 0.3 seconds, which was not significant, even at 5%. At this point, we still had no idea of whether this meant no (important) difference or simply a poor experiment. Things are inconclusive.

However, we then worked out the 95% confidence interval to be [-0.7,1.3]. Now we can start to make some stronger statements.

The upper limit of the confidence interval is 1.3 seconds; that is we have a reasonable level of confidence that the real difference is no bigger than this – does it matter, is this an important difference. Imagine this is a 1.3 second difference on a 2 hour task, and that deploying the new system would cost millions, it probably would not be worth it.

Equally, if there were other reasons we want to deploy the system would it matter if it were 0.7 seconds slower?

We had precisely this question with a novel soft keyboard for mobile phones some years ago [HD04]. The keyboard could be overlaid on top of content, but leaving the content visible, so had clear advantages in that respect over a standard soft keyboard that takes up the lower part of the screen.   My colleague ran an experiment and found that the new keyboard was slower (by around 10s in a 100s task), and that this difference was statistically significant.

If we had been trying to improve the speed of entry this would have been a real problem for the design, but we had in fact expected it to be a little slower, partly because it was novel and so unfamiliar, and partly because there were other advantages. It was important that the novel keyboard was not massively slower, but a small loss of speed was acceptable.

We calculated the 95% confidence interval for the slowdown at [2,18]. That is we could be fairly confident it was at least 2 seconds slower, but also confident that it was no more than 18 seconds slower.

Note this is different from the previous examples, here we have a significant difference, but using the confidence interval to give us an idea of how big that difference is. In this case, we have good evidence that the slow down was no more than about 20%, which was acceptable.

Researchers are often more familiar with significance testing and know that the need to quote the number of participants, the test used, etc.; you can see this in every other report you gave read that uses statistics.

When you quote a confidence level the same applies. If the data is two-outcome true/false data (like the coin toss), then the confidence interval may have been calculated using the Binomial distribution, if it is people’s heights it might use the Normal or Students-t distribution – this needs to be reported so that others can verify the calculations, or maybe reproduce your results.

Finally do remember that, as with all statistics, the confidence interval is still uncertain. It offers good evidence that the real value is within the interval, but it could still be outside.

## What you can say

In the video and slides I spend so much time warning you what is not true, I forgot to mention one of the things that you can say with certainty from a confidence interval.

If you run experiments or studies, calculate the 95% confidence interval for each and then work on the assumption that the real value lies in the range, then at least 95% of the time you will be right.

Similarly if you calculate 99% confidence intervals (usually much wider) and work on the assumption that the real value lies in the rage, then at least 99% of the time you will be right.

This is not to say that for any given experiment the probability of the real value lies in the range, it either does or doesn’t. just puts a limit on the probability you are wrong if you make that assumption.  These sound almost the same, but the former is about the real value of something that may have no probability associated with it; it is just unknown; the latter is about the fact that you do lots of experiments, each effectively each like the toss of a coin.

So if you assume something is in the 95% confident range, you really can be 95% confident that you are right.

Of course, this is about ALL of the experiments that you or others do .  However, often only positive results are published; so it is NOT necessarily true of the whole published literature.

# Doing it (making sense of statistics) – 3 – hypothesis testing

Hypothesis testing is still the most common use of statistics – using various measures to come up with a p < 5% result.

In this video we’ll look at what this 5% (or 1%) actual means, and as important what it does not mean. Perhaps even more critically, is understanding what you can conclude from a non-significant result and in particular remembering that it means ‘not proven’ NOT ‘no effect’!

The language of hypothesis testing can be a little opaque!

A core term, which you have probably seen, is the idea of the null hypothesis, also written H0, which is usually what you want to disprove. For example, it may be that your new design has made no difference to error rates.

The alternative hypothesis, written H1, is what you typically would like to be true.

The argument form is similar to proof by contradiction in logic or mathematics. In this you assert what you believe to be false as if it were true, reason from that to something that is clearly contradictory, and then use that to say that what you first assumed must be false.

In statistical reasoning of course you don’t know something is false, just that it is unlikely.

The hypothesis testing reasoning goes like this:

if the null hypothesis H0 were true
then the observations (measurements) are very unlikely
therefore the null hypothesis H0 is (likely to be) false
and hence the alternative H1 is (likely to be) true

For example, imagine our null hypothesis is that a coin is fair. You toss it 100 times and it ends up a head every time. The probability of this given a fair coin (likelihood) is 1/2100 that is around 1 in a nonillion (1 with 30 zeros!). This seems so unlikely, you begin to wonder about that coin!

Of course, most experiments are not so clear cut as that.

You may have heard of the term significance level, this is the threshold at which you decide to reject the null hypothesis. In the example above, this was 1 in a nonillion, but that is quite extreme.

The smallest level that is normally regarded as reasonable evidence is 5%. This means that if the likelihood of the null hypothesis (probability of the observation given H0) is less than 5% (1 in 20), you reject it and assume the alternative must be true.

Let’s say that our null hypothesis was that your new design is no different to the old one. When you tested with six users all the users preferred the new design.

The reasoning using a 5% significance level as a threshold would go:

if   the null hypothesis H0 is true (no difference)

then   the probability of the observed effect (6 preference)
happening by chance
is 1/64 which is less than 5%

therefore   reject H0 as unlikely to be true
and conclude that the alternative H1 is likely to be true

Yay! your new design is better 🙂

Note that this figure of 5% is fairly arbitrary. What figure is acceptable depends on a lot of factors. In usability, we will usually be using the results of the experiment alongside other evidence, often knowing that we need to make adaptations to generalise beyond the conditions. In physics, if they conclude something is true, it is taken to be incontrovertibly true, so they look for a figure more like 1 in 100,000 or millions.

Note too that if you take 5% as your acceptable significance level, then if your system were no better there would still be a 1 in 20 chance you would conclude it was better – this is called a type I error (more stats-speak!), or (more comprehensibly) a false positive result.

Note also that a 5% significance does NOT say that the probability of the null hypothesis is less than 1 in 20.   Think about the coin you are tossing, it is either fair or not fair, or think of the experiment comparing your new design with the old one: your design either is or is not better in terms of error rate for typical users.

Similarly it does not say the probability of the alterative hypothesis (your system is better) is > 0.95. Again it either is or is not.

Nor does it say that the difference is important. Imagine you have lots and lots of participants in an experiment, so many that you are able to distinguish quite small differences. The experiment showed, with a high degree of statistical significance (maybe 0.1%), that users perform faster with your new system than the old one. The difference turns out to be 0.03 seconds over an average time of 73 seconds. The difference is real and reliable, but do you care?

All that a 5% significance level says is that is the null hypothesis H0 were true, then the probability of seeing the observed outcome by chance is 1 in 20.

Similarly for a 1% levels the probability of seeing the observed outcome by chance is 1 in 100, etc.

Perhaps the easiest mistake to make with hypothesis testing is not when the result is significant, but when it isn’t.

Say you have run your experiment comparing the old system with the new design and there is no statistically significant difference between the two. Does that mean there is not difference?

This is a possible explanation, but it also may simply mean your experiment was not good enough to detect the difference.

Although you do reason that significant result means the H0 is false and H1 (the alternative) is likely to be true, you cannot do the opposite.

You can NEVER (simply) reason: non-significant result means H0 is true / H1 is false.

For example, imagine we have tossed 4 coins and all came up heads. If the coin is fair the probability of this happening is 1 in 16, this is not <5%, so even with the least strict level, this is not statistically significant.   However, this was the most extreme result that was possible given the experiment, tossing 4 coins could never give you enough information to reject the null hypothesis of a fair coin!

Scottish law courts can return three verdicts: guilty, not guilty or not proven. Guilty means the judge or jury feels there is enough evidence to conclude reasonably that the accused did the crime (but of course still could be wrong) and not guilty means they are reasonably certain the accused did not commit the crime. The ‘not proven’ verdict means that the judge or jury simply does not feel they have sufficient evidence to say one way or the other. This is often the verdict when it is a matter of the victim’s word versus that of the accused, as is frequently happens in rape cases.

Scotland is unusual in having the three classes of verdict and there is some debate as to whether to remove the ‘not proven’ verdict as in practice both ‘not proven’ and ‘not guilty’ means the accused is acquitted. However, it highlights that in other jurisdictions ‘not guilty’ includes both: it does not mean the court is necessarily convinced that the accused is innocent, merely that the prosecution has not provided sufficient evidence to prove they are guilty. In general a court would prefer the guilty walk free than the innocent are incarcerated, so the barrier to declaring ‘guilty’ is high (‘beyond all reasonable doubt’ … not p<5%!), so amongst the ‘not guilty’ will be many who committed a crime as well as many who did not.

In statistics ‘not significant’ is just the same – not proven.

In summary, all a test of statistical significance means is that if the null hypothesis (often no difference) is true, then the probability of seeing the measured results is low (e.g. <5%, or <1%). This is then used as evidence against the null hypothesis. It is good to return to this occasionally, but for most purposes an informal understanding is that statistical significance is evidence for the alterative hypothesis (often what you are trying to show), but maybe wrong – and the smaller the % or probability the more reliable the result. However, all that non-significance means is not proven.

# Doing it (making sense of statistics) – 2 – probing the unknown

You use statistics when there is something in the world you don’t’ know, and want to get a level of quantified understanding of it based on some form of the measurement or sample.

One key mathematical element of this shared by all techniques is the idea of conditional probability and likelihood; that is the probability of a specific measurement occurring assuming you know everything pertinent about the real world. Of course the whole point is that you don’t know what is true of the real world, but do know about the measurement, so you need to do back-to-front counterfactual reasoning, to go back from measurement to the world!

Future videos will discuss three major kinds of statistical analysis methods:

• Hypothesis testing (the dreaded p!) – robust but confusing
• Confidence intervals – powerful but underused
• Bayesian stats – mathematically clean but fragile

The first two use essentially the same theoretical approach, and the difference is more about the way you present results. Bayesian statistics takes a fundamentally different approach, with its own strengths and weaknesses.

First of all let’s recall the ‘job of statistics‘, which is an attempt to understand the fundamental properties of the real world based on measurements and samples. For example, you may have taken a dozen people (the sample), asked them to perform a task on a piece of software and a new version of the software. You have measured response times, satisfaction, error rate, etc., (the measurement) and want to know whether your new software will out perform the original software for the whole user group (the real world).

We are dealing with data with a lot of randomness and so need to deal with probabilities, but in particular what is known as conditional probability.

Imagine the main street of a local city. What is the probability that it is busy?

Now imagine that you are standing in the same street but it is 4am on a Sunday morning: what is the probability it is busy given this?

Although the overall probability of it being busy (at a random time of day) is high, the probability that it is busy given it is 4am on a Sunday is lower.

Similarly think of a throwing single die.   What is the probability it is a six?   1 in 6.

However, if I peek and tell you it is at least 4.   What now is the probability it is a six? The probability it is a six given it is four or greater is 1 in 3.

When we have more information, then we change our assessment of the probability of events accordingly.  This calculation of probability given some information is what mathematicians call conditional probability.

Returning to the job of statistics, we are interested in the relationship between measurements of the real world and what is true of the real world. Although we may not know what is true of the world (what is the actual error rate of our new software going to be), we can often work out the probability of measurements given (the unknown) state of the world.

For example, if the probability of a user making a particular error is 1 in 10, then the probability that exactly 3 make the error out of a sample of 5 is 7.29% (calculated from the Binomial distribution).

This conditional probability of a measurement given the state of the world (or typically some specific parameters of the world) is what statisticians call likelihood.

As another example the probability that six tosses of a coin will come out heads given the coin is fair is 1/64, or in other words the likelihood that it is fair is 1/64. If instead the coin were biased 2/3 heads 1/3 tails, the probability of 6 heads given this, likelihood fo the coin having this bias, is 64/729 ~ 1/ 11.

Note this likelihood is NOT the probability that the coin is fair or biased, we may have good reason to believe that most coins are fair. However, it does constitute evidence. The core difference between different kinds of statistics is the way this evidence is used.

Effectively statistics tries to turn this round, to take the likelihood, the probability of the measurement given the unknown state of the world, and reverse this, use the fact that the measurement has occurred to tell us something about the world.

Going back again to the job of statistics, the measurements we have of the world are prone to all sorts of random effects. The likelihood models the impact of these random effects as probabilities.   The different types of statistics then use this to produce conclusions about the real world.

However, crucially these are always uncertain conclusions. Although we can improve our ability to see through the fog of randomness, there is always the possibility that by shear chance things appear to suggest one conclusion even though it is not true.

We will look at three types of statistics.

Hypothesis testing is what you are most likely to have seen – the dreaded p! It was originally introduced as a form of ‘quick hack’, but has come to be the most widely used tool. Although it can be misused, deliberately or accidentally, in many ways, it is time-tested, robust and quite conservative. The downside is that understanding what it really says (not p<5% means true!) can be slightly complex.

Confidence intervals use the same underlying mathematical methods as hypothesis testing, but instead of taking about whether there is evidence for or against a single value, or proposition, confidence intervals give a range of values. This is really powerful in giving a sense of the level of uncertainty around an estimate or prediction, but are woefully underused.

Bayesian statistics use the same underlying likelihood (although not called that!) but combine this with numerical estimates of the probability of the world. It is mathematically very clean, but can be fragile. One needs to be particularly careful to avoid conformation bias and when dealing with multiple sources of non-independent evidence.  In addition, because the results are expressed as probabilities, this may give an impression of objectivity, but in most cases it is really about modifying one’s assessment of belief.

We will look at each of these in more detail in coming videos.

To some extent these techniques have been pretty much the same for the past 50 years, however computation has gradually made differences. Crucially, early statistics needed to relatively easy to compute by hand, whereas computer-based statistical analyses can use more complex methods. This has allowed more complex models based on theoretical distributions, and also simulation methods that use models where there is no ‘nice’ mathematical solution.

# Doing it (making sense of statistics) – 1 – introduction

In this part we will look at the major kinds of statistical analysis methods:

• Hypothesis testing (the dreaded p!) – robust but confusing
• Confidence intervals – powerful but underused
• Bayesian stats – mathematically clean but fragile

None of these is a magic bullet; all need care and a level of statistical understanding to apply.

We will discuss how these are related including the relationship between ‘likelihood’ in hypothesis testing and conditional probability as used in Bayesian analysis. There are common issues including the need to clearly report numbers and tests/distributions used. avoiding cherry picking, dealing with outliers, non-independent effects and correlated features. However, there are also specific issues for each method.

Classic statistical methods used in hypothesis testing and confidence intervals depend on ideas of ‘worse’ for measures, which are sometimes obvious, sometimes need thought (one vs. two tailed test), and sometimes outright confusing. In addition, care is needed in hypothesis testing to avoid classic fails such as treating non-significant as no-effect and inflated effect sizes.

In Bayesian statistics different problems arise including the need to be able to decide in a robust and defensible manner what are the expected prior probabilities of different hypothesis before an experiment; the closeness of replication; and the danger of common causes leading to inflated probability estimates due to a single initial fluke event or optimistic prior.

Crucially, while all methods have problems that need to be avoided, not using statistics at all can be far worse.

# Thing to come …

probing the unknown

• conditional probability and likelihood
• statistics as counter-factual reasoning

types of statistics

• hypothesis testing (the dreaded p!) – robust but confusing
• confidence intervals – powerful but underused
• Bayesian stats – mathematically clean .. but fragile – issues of independence

issues

• idea of ‘worse’ for measures
• what to do with ‘non-sig’
• priors for experiments?  Finding or verifying
• significance vs effect size vs power

dangers

• avoiding cherry picking – multiple tests, multiple stats, outliers, post-hoc hypotheses
• non-independent effects (e.g. fat and sugar)
• correlated features (e.g. weight, diet and exercise)

# Wild and wide (making sense of statistics) – 7 – Normal or not

approximations, central limit theorem and power laws

Some phenomena, such as tossing coins, have well understood distributions – in the case of coin tossing the Binomial distribution. This means one can work out precisely how likely every outcome is. At other times we need to use an approximate distribution that is close enough.

A special case is the Normal Distribution, the well-known bell-shaped curve. Some phenomena such as heights seem to naturally follow this distribution, whist others, such as coin-toss scores, end up looking approximately Normal for sufficiently many tosses.

However, there are special reasons why this works and some phenomena, in particular income distributions and social network data, are definitely NOT Normal!

Often in statistics, and also in engineering and forms of applied mathematics, one approximates one thing with another. For example, when working out the deflection of a beam under small loads, it is common to assume it takes on a circular arc shape, when in fact the actual shape is a somewhat more complex.

The slide shows a histogram. It is a theoretical distribution, the Binomial Distribution for n=6. This is what you’d expect for the number of heads if you tossed a fair coin six times. The histogram has been worked out from the maths: for n=1 (one toss) it would have been 50:50 for 0 and 1. If it was n=2 (two tosses), the histogram would have been ¼, ½, ¼ for 0,1,2,. For n=6, the probabilities are 1/64, 6/64, 15/64, 20/64, 15/64, 6/64, 1/64 for 0–6. However, if you tossed six coins, then did it again and again and again, thousands of times and kept tally of the number you got, you would end up getting closer and closer to this theoretical distribution.

It is discrete and bounded, but overlaid on it is a line representing the Normal distribution with mean and standard deviation chosen to match the binomial. The Normal distribution is a continuous distribution and unbounded (the tails go on for ever), but actually is not a bad fit, and for some purposes may be good enough.

In fact, if you looked at the same curves against a binomial for 20, 50 or 100 tosses, the match gets better and better.   As many statistical tests (such as Students t-test, linear regression, and ANOVA) are based around the Normal, this is good as it means that you can often use these even with something such as coin tossing data!

Although doing statistics on coin tosses is not that helpful except in exercises, it is not the only thing that comes out approximately normal, so many things from people’s heights to exam results seem to follow the Normal curve.

Why is that?

A mathematical result, the central limit theorem explains why this is the case. This says that if you take lots of things and:

1. average them or add them up (or do something close to linear)
2. they are around the same size (so no single value dominates)
3. they are nearly independent
4. they have finite variance (we’ll come back to this!)

Then the average (or sum of lots of very small things) behaves more and more like a Normal distribution as the number of small items gets larger.

Your height is based on many genes and many environmental factors. Occasionally, for some individuals, there may be single rare gene condition, traumatic event, or illness that stunts growth or cause gigantism. However, for the majority of the population it is the cumulative effect (near sum) of all those small effects that lead to our ultimate height, and hence why height is Normally distributed.

Indeed this pattern of large numbers of small things is so common that we find Normal distributions everywhere.

So if this set of conditions is so common, is everything Normal so long as you average enough of them?

Well if you look through the conditions various things can go wrong.

Condition (2), not having a single overwhelming value, is fairly obvious, and you can see when it is likely to fail.

The independence condition (3) is actually not as demanding as first appears. In the virtual coin demonstrator, setting a high correlation between coin tosses meant you got long runs of heads or tails, but eventually you get a swop and then a run of the other side. Although ‘enough of them’ ends up being even more, you still get Normal distributions. Here the non-independence is local and fades; all that matters is that there are not long distance effects so that one value does not affect so many of the others to dominate.

The linearity condition (1) is more interesting. There are various things that can cause non-linearity. One is some sort of threshold effect, for example, if plants are growing near an electric fence, those that reach certain height may simply be electrocuted leading to a chopped off Normal distribution!

Feedback effects also cause non-linearity as some initial change can make huge differences, the well-known butterfly effect. Snowflake growth is a good example of positive feedback, ice forms more easily on sharp tips, so any small growth grows further and ends up being a long spike. In contrast kids-picture-book bumpy clouds suggest a negative feedback process that smoothens out any protuberance.

In education in some subjects, particularly mathematical or more theoretical computer science, the results end up not as a Normal-like bell curve, but bi-modal as if there were two groups of students. There are various reasons for this, not least the tendency for many subjects like this to have questions with an ‘easy bit’ and then a ‘sting in the tail’! However, this also may well represent a true split in the student population. In the humanities if you have trouble with one week’s lectures, you can still understand the next week. With mathematical topics, often once you lose track everything else seems incomprehensible – that is a positive feedback process, one small failure leads to a bigger failure, and so on.

However, condition (4), the finite variance is oddest. Variance is a measure of the spread of a distribution. You find the variance by first finding the arithmetic mean of your data, then working out the differences from this (called the residuals), square those differences and then find the average of those squares.

For any real set of data this is bound to be a finite number, so what does it mean to not have a finite variance?

Normally, if you take larger and larger samples, this variance figure settles down and gets closer and closer to a single value. Try this with the virtual coin tossing application, increase the number of rows and watch the figure for the standard deviation (square root of variance) gradually settle down to a stable (and finite) value.

However, there are some phenomena where if you did this, took larger and larger samples, the standard deviation wouldn’t settle down to a final value, but in fact typically get larger and larger as the sample size grows (although as it is random sometimes larger samples might have small spread). Although the variance and standard deviation are finite for any single finite sample, they are unboundedly large as samples get larger.

Whether this happens or not is all about the tail. It is only possible at all with an unbounded tail, where there are occasional very large values, but on its own this is not sufficient.

Take the example of the number of heads before you get a tail (called a Negative Binomial). This is unbounded, you could get any number of heads, but the likelihood of getting lots of heads falls off very rapidly (one in a million for 20 heads), which leads to a finite mean of 1 and a finite variance of exactly 2.

Similarly the Normal distribution itself has potentially unbounded values, arbitrarily large in positive and or negative directions, but they are very unlikely resulting in a finite variance.

Indeed, in my early career, distributions without finite variance seemed relatively rare, a curiosity. One of the few common examples were income distributions in economics. For income he few super rich are often not enough to skew the arithmetic average income; Wayne Rooney’s wages averaged over the entire UK working population is less than a penny each. However, these few very well paid are enough to affect variance. Of course, for any set of people at any time, the variance is still finite, but the shape of it means that in practice if you take, say, lots of samples of first 100, then 1000, and so on larger and larger, the variance would keep on going up.   For wealth (rather than income) this is also true for the average!

I said ‘in my early career’, as this was before the realisation of how common power law distributions were in human and natural phenomena.

You may have heard of the phrase ‘power law’ or maybe ‘scale free’ distributions, particularly related to web phenomena. Do note that the use of the term ‘power’ here is different from statistical ‘power’, and for that matter the Power Rangers.

Some years ago it became clear that number of physical phenomena, for example earthquakes, have behaviours that are very different from the normal distribution or other distributions that were commonly dealt with in statistics at that time.

An example that you can observe for yourself is in a simple egg timer. Watch the sand grains as they dribble through the hole and land in a pile below. The pile initially grows, the little pile gets steeper and steeper, and then there is a little landslide and the top of the pile levels a little, then grows again, another little landslide. If you kept track of the landslides, some would be small, just levelling off the very tip of the pile, some larger, and sometimes the whole side of the pile cleaves away.

There is a largest size for the biggest landslide due to small size of the egg timer, but if you imagine the same slow stream of sand falling onto a larger pile, you can imagine even larger landslides.

If you keep track of the size of landslides, there are fewer large landslides than smaller ones, but the fall off is not nearly as dramatic as, say, the likelihood of getting lots and lots of heads in a row when tossing a coin. Like the income distribution, the distribution is ‘tail heavy’; there are enough low frequency but very high value events to effect the variance.

For sand piles and for earthquakes, this is due to positive feedback effects. Think of dropping a single grain of sand on the pile. Sometimes it just lands and stays. The sand pile is quite shallow this happens most of the time and the pile gets higher … and steeper. However, when the pile is of a size and steepens, when the slope is only just stable, sometimes the single grain knocks another out of place, these may both roll a little and then stop, but they may knock another, the little landslide has enough speed to create a larger one.

Earthquakes are just the same. Tension builds up in the rocks, and at some stage a part of the fault that is either a little loose, or under slightly more tension gives – a minor quake. However, sometimes that small amount of release means the bit of fault next to it is also pushed over its limit and also gives, the quake gets bigger, and so on.

Well, user interface and user experience testing doesn’t often involve earthquakes, nor for that matter sand piles. However, network phenomena such as web page links, paper citations and social media connections follow a very similar pattern. The reason for this is again positive feedback effects; if a web page has lots of links to it or a paper is heavily cited, others are more likely to find it and so link to it or cite it further. Small differences in the engagement or quality of the source, or simply some random initial spark of interest, leads to large differences in the final link/cite count. Similarly if you have lots of Twitter followers, more people see your tweets and choose to follow you.

Crucially this means that of you take the number of cites of a paper, the number of social media connections of a person, these do NOT behave like a Normal distribution even when averaged.

If you use statistical test and tools, such as t-tests, ANOVA or linear regression that assume Normal data, your analysis will be utterly meaningless and quite likely misleading.

As this sort of data becomes more common this of growing importance to know.

This does not mean you cannot use this sort of data, you can use special tests for it, or process the data to make it amenable to standard tests.

For example, you can take citation or social network connection data and code each as ‘low’ (bottom quartile of cites/connections) medium (middle 50%) or high (top quartile of cites/connections). If you turn these into a 0, 1, 2 scale, these have relative probability 0.25, 0.5, 0.25 – just like the number of heads when tossing two coins. This transformation of the data means that it is now suitable for use with standard tests so long as you have sufficient measurements – which is usually not a problem with this kind of data!

# Wild and wide (making sense of statistics) – 6 – distributions

discrete or continuous, bounds and tails

We now look at some of the different properties of data and ‘distributions’, another key word of statistics.

We discuss the different kinds of continuous and discrete data and the way they may be bounded within some finite range, or be unbounded.

In particular we’ll see how some kinds of data, such as income distributions, may have a very long tail, small number of very large values (yep that 1%!).

Some of this is a brief reprise if you have previously done some sort of statistics course.

One of the first things to consider, before you start any statistical analysis, is the kind of data you are dealing with.

A key difference is between continuous and discrete data. Think about an experiment where you have measured time to compete a particular sub- task and also the number of errors during the session. The first of these, completion time, is continuous, it might be 12 seconds or 13 seconds, but could also be 12.73 seconds or anything in between. However, while a single user might have 12 or13 errors, they cannot have 12.5 errors.

Discrete values also come in a number of flavours.

The number of errors is arithmetic. Although a single user cannot get 12.5 errors, it makes sense to average them, so you could find that the average error rate is 12.73 errors per user. Although one often jokes about the mythical family with 2.2 children, it is meaningful, if you have 100 families you expect, on average 220 children.

In contrast nominal or categorical data has discrete values that cannot easily be compared, added or averaged. For example, if when presented with a menu half your users choose ‘File’ and half choose ‘Font’, it does not make sense to say that they have on average selected ‘Flml’!

In between are ordinal values, such as the degrees of agreement or satisfaction in a Likert scale. To add a little confusion these are often coded as numbers, so that 1 might be the left-most point and 5 the right-most point in a simple five point Likert scale. While 5 may be represent better than 4 and 3 better than 1, it is not necessarily the case that 4 is twice as good as 2. The points are ordered, but do not represent any precise values. Strictly you cannot simply add up and average ordinal values … however, in practice and if you have enough data, you can sometimes ‘get away’ with it, especially if you just want an indicative idea or quick overview of data … but don’t tell any purists I said so 😉

A special case of discrete data is binary data such as yes/no answers, or present/not present. Indeed one way of dealing with ordinal data, that avoids frowned upon averaging, is to choose some critical value and turn the values into simple big/small. For example, you might say that 4 and 5 are generally ‘satisfied’, so you convert 4 and 5 into ‘Yes’ and 1, 2 and 3 into ‘No’. The downside of this is that it loses information, but can be an easy way to present data.

Another question about data is whether the values are finite or potentially unbounded.

Let’s look at some examples.

number of heads in 6 tosses – This is a discrete value and bounded. The valid responses can only be 0, 1, 2, 3, 4, 5 or 6. You cannot have 3.5 heads, nor can you have -3 heads, nor 7 heads.

number of heads until first tail – still discrete, and still bounded below, but this time unbounded above. Although unlikely you could have to wait for a hundred or a thousand heads before you get a tail, there is no absolute maximum. Although, once I got to 20 heads I might start to believe I’m in the Matrix.

wait before next bus – This is now a continuous value, it could be 1 minute, 20 minutes, or 12 minutes 17.36 seconds. It is bounded below (no negative wait times), but not above, you could wait an hour, or even forever if they have cancelled the service.

difference between heights – say if you had two buildings the Abbot Tower (lets say height A) and Burton Heights (height B), you could subtract A-B. If Abbot Tower is taller, then this would be positive, if Burton Height is taller, the difference would be negative. There are probably some physical limits on building height (if it were too tall the top part might be essentially in orbit and float away!). However, for most purposes the difference is effectively unbounded, either building could be arbitrarily bigger than the other.

Now for our first distribution.

The histogram is taken from UK Office for National Statistics data on weekly income during 2012. This is continuous data, but to plot it the ONS has put people into £10 ‘bins’: 0-£9.99 in the first bin, £10–£29.9 in the next bin, etc., and then the histogram height is the number of people who earn in that range.

Note that this is an empirical distribution, it is the actual number of people in each category, rather than a theoretical distribution based on mathematical calculations of probabilities.

You can easily see that mid-range weekly wages are around £300–£400, but with a lot of spread. Each bar in this mid-range represents a million people or so. Remembering my quick and dirty rule for count data, the variability of each column is probably only +/-2000, that is 0.2% of the column height. The little sticky-out columns are probably a real effect, not just random noise (yes really an omen, at this scale, things should be more uniform and smooth). I don’t know the explanation, but I wonder if it is a small tendency for jobs to have weekly, monthly or annual salaries that are round numbers.

You can also see that tis is not a symmetric distribution, it rises quite rapidly from zero, but then tails off a lot more slowly.

In fact, the rate of tailing off is so slow that then ONS have decided to cut it off at £1000 per week, even though it says that 4.1 million people earn more than this. In plotting the data they have chosen a cut off that avoids making the lower part getting too squashed.

But how far out does the ‘tail’ go?

I’ve not got the full data, but assume the tail, the long spread of low frequency values, decays reasonably smoothly at first.

However, I have found a few examples to populate a zoomed out view.

At the top is the range expanded by 3 to go up to £3000 a week. On this I’ve put the average UK company director’s salary of £2000 a week (£100K per annum) and the Prime Minister at about £3000 a week (£150K pa). UK professorial salaries fall roughly in the middle of this range.

Now lets zoom out by a factor of ten. The range is now up to £30,000 a week. About 1/3 of the way along is the vice-chancellor of the University of Bath (effectively CEO), who is currently the highest paid university vice-chancellor in the UK at £450K pa, around three times that of the Prime Minister.

However, we can’t stop here, we haven’t got everyone yet, let’s zoom out by another factor of ten and now we can see Wayne Rooney, who is one of the highest paid footballers in the UK at £260,000 per week. Of course this is before we even get to the tech and property company super-rich who can earn (or at least amass) millions per week.

At this scale, now look at the far left, can you just see a very thin spike of the mass of ordinary wage earners? This is why the ONS did not draw their histogram at this scale. This is a long-tail distribution, one where there are very high values but with very low frequency.

N.B. I got this slide wrong in the video because I lost a whole ‘order of magnitude’ between the vice-chancellor of Bath and Wayne Rooney!

There is another use of the term ‘tail’ in statistics – you may have seen mention of one- or two-tailed tests. This is referring to the same thing, the ‘tail’ of values at either end of a distribution, but in a slightly different way.

For a moment forget the distribution of the values, and think about the question you want to ask. Crucially do you care about direction?

Imagine you are about to introduce a new system that has some additional functionality. However, you are worried that its additional complexity will make it more error prone. Before you deploy it you want to check this.

Here your question is, “is the error rate higher?”.

If it is actually lower that is nice, but you don’t really care so long as it hasn’t made things worse.

This is a one-tailed test; you only care about one direction.

In contrast, imagine you are trying to decode between two systems, A and B. Before you make your decision you want to know whether the choice will effect performance. So you do a user test and measure completion times.

This time your question is, “are the completion times different?”.

This is a two-tailed test; you care in both directions, if A is better than B or if B is better than A.

# Wild and wide (making sense of statistics) – 5 – play

experiment with bias and independence

This section will talk you through two small web demonstrators that allow you to experiment with virtual coins.

Because the coins are digital you can alter their properties, make them not be 50:50 or make successive coin tosses not be independent of one another. Add positive correlation and see the long lines of heads or tails, or put in negative correlation and see the heads and tails swop nearly every toss.

Links to the demonstrators can be found at statistics for HCI resources.

Incidentally, the demos were originally created in 1998; my very first interactive web pages. It is amazing how much you could do even with 1998 web technology!

The first application automates the two-horse races that you have done by hand with real coins in the “unexpected wildness of random” exercises. This happens in the ‘virtual race area’ marked in the slide.

So far this doesn’t give you any advantage over physical coin tossing unless you find tossing coins very hard. However, because the coins are virtual, the application is able to automatically keep a tally of all the coin tosses (in the “row counts” area), and then gather statistics about them (in the “summary statistics area”).

Perhaps most interesting is the area marked “experiment with biased or non-independent coins” as this allows you to create virtual coins that would be very hard to replicate physically.

It should initially be empty. Press the marked coin to toss a coin. It will spin and when it stops the coin will be added to the top row for heads or the lower row for tails. Press the coin again for the next toss until one side or other wins.

If you would like another go, just press the ‘new race’ button.

As you do each race you will see the coins also appear as ‘H’ and ‘T’ in the text area on the left below the coin toss button and the counts for each completed race appear in the text box to the right of that.

The area above that box on the right keeps a tally of the total number of heads and tails, the mean (arithmetic average) if heads and tails per race, and the standard of each. We’ll look at these in more detail in ‘more coin tossing’ below.

Finally the area at the bottom left allows you to create unreal coins!

Adjusting the “prob head” allows you to create biased coins (valid values between 0 and 1). For example, setting ‘prob head” to 0.2 makes a coin that will fall heads 20% of the time and tails 80% of the time (both on average of course!).

Adjusting the ‘correlation’ figure allows you to create coins that depend on the previous coin toss (valid values -1 to +1). A positive figure means that each coin is more likely to be the same as the previous one – that is if the previous coin was a head toy are more likely to also get a head on the next toss. This is a bit like a learning effect in . Putting a negative value does the opposite, if the previous toss was a head the next one is more likely to be tail.

Play with these values to get a feel for how it affects the coin tossing. However, do remember to do plenty of coin tosses for each setting otherwise all you will see is the randomness!   Play first with quite extreme values, as this will be more evident.

The second application is very similar, except no race track!

This does not do a two-horse race, but instead will toss a given number of coins, and then repeat this. You don’t have to press the toss button for each cpin toss, just once and it does as many as you ask.

Instead of the virtual race area, there is just an area to put the number or coins you want it to toss, and then the number of rows of coins you want it to produce.

Let’s look in more detail.

At the top right are text areas to enter the number of coins to toss (here 10 coins at a time) and the number of rows of coins (here ste to 50 times). You press the coin to start, just as in the two-horse race, except now it will toss the coin 10 times and the Hs and Ts will appear in the tally box below. Once it has tossed the first 10 it will toss 10 more, and then 10 more until it has 50 rows of coins – 500 in total … all just for one press.

The area for setting bias and correlation is the same as in the two-horse race, as is the statistics area.

Here is a set of tosses where the coin was set to be fair (prob head=0.5) with completely independent tosses (correlation=0) – that is just like a real coin (only faster).

You can see the first 9 rows and first 9 row counts in the left and right text areas. Note how the individual rows vary quite a lot, as we have seen in the physical coin tossing experiments. However, the average (over 50 sets of 10 coin tosses) is quite close to 50:50. Note also the standard deviation

The standard deviation is 1.8, but note this is the standard deviation of the sample. Because this is a completely random coin toss, with well understood probabilistic properties, it is possible to calculate the ‘real’ standard deviation – this is the value you would expect to see if you repeated this thousands and thousands of times. This value is the square root of 2.5, which is just under 1.6. This measured standard deviation is an estimate of the ‘real’ value, and hence not the ‘real’ value, just like the measured proportion of heads has tuned out at 0.49, not exactly a half. This estimate of the standard deviation itself varies a lot … indeed estimates of variation of often very variable themselves!

Here’s another set of tosses, this time with the probability of a head set to 0,25. That is a (virtual!) coin that falls heads about a quarter of the time. So a bit like having a four sided spinner with heads on one side and tails on the other three.

The correlation has been set to zero still so that the tosses are independent.

You can see how the proportion of heads is now smaller, on average 2.3 heads to 7.7 tails in each run of 10 coins. This is not exactly 2.5, but if you repeated this sometimes it would be less, sometimes more. On average, over 1000s of tosses it would end up close to 2.5.

Now we’re going to play with very unreal non-independent coins.

The probability of being a head is set to 0.5 so it is a fair coin, but the correlation is positive (0.7) meaning heads are more likely to follow heads and vice versa.

If you look at the left hand text area you can see the long runs of heads and tails. Sometimes they do alternate, but then sty the same for long periods.

Looking up to the summary statistics area the average numbers of heads and tails is near 50:50 – the coin was fair, but the standard deviation is a lot higher than in the independent case. This is very evident if you look at the right-hand text area with the totals as they swing between extreme values much more than the independent coin did (even more that its quite wild randomness!).

If we put a negative value for the correlation we see the opposite effect.

Now the rows of Hs and Ts alternate a lot of time, far more than a normal coin.

The average is still close tot 50:50, but this time the variation is lower and you can see this in the row totals, which are typically much closer to five heads and five tails than the ordinary coin.

Recall the gambler’s fallacy that if a coin has fallen heads lots of times it is more likely to be a tail next. In some way this coin is a bit like that, effectively evening out the odds, hence the lower variation.

# Wild and wide (making sense of statistics) – 4 – independence and non-independence

‘Independence’ is another key term in statistics. We will see several different kinds of independence, but in general it is about whether one measurement, or factor gives information about another.

Non-independence may increase variability, lead to misattribution of effects or even suggest completely the wrong effect.

Simpson’s paradox is an example of the latter where, for example, you might see on year on year improvement in the performance of each kind of student you teach and yet the university tells you that you are doing worse!

Imagine you have tossed a coin ten times and it has come up heads each time. You know it is a fair coin, not a trick one. What is the probability it will be a tail next?

Of course, the answer is 50:50, but we often have a gut feeling that it should be more likely to be a tail to even things out.   This is the uniformity fallacy that leads people to choose the pattern with uniformly dispersed drops in the Gheisra story. It is exactly the same feeling that a gambler has putting in a big bet after a losing streak, “surely my luck must change”.

In fact with the coin tosses, each is independent: there is no relationship between one coin toss and the next. However, there can be circumstances (for example looking out of the window to see it is raining), where successive measurements are not independent.

This is the first of three kinds of independence we will look at :

• measurements
• factor effects
• sample prevalence

These each have slightly different causes and effects. In general the main effect of non-independence is to increase variability, however sometimes it can also induce bias. Critically, if one is unaware of the issues it is easy to make false inferences: I have looked out of the window 100 times and it was raining each time, should I therefore conclude that it is always raining?

We have already seen an example of where successive measurements are independent (coin tosses) and when they are not (looking out at the weather). In the latter case, if it is raining now it is likely to be if I look again in two minutes, the second observation adds little information.

Many statistical tests assume that measurements are independent and need some sort of correction to be applied or care in interpreting results when this is not the case. However there are a number of ways in which measurements may be related to one another:

order effects – This is one of the most common in experiments with users. A ‘measurement’ in user testing involves the user doing something, perhaps using a particular interface for 10 minutes. You then ask the same user to try a different interface and compare the two. There are advantages of having the same user perform on different systems (reduces the effect of individual differences); however, there are also potential problems.

You may get positive learning effects – the user is better at the second interface because they have already got used to the general ideas of the application in the first. Alternatively there may be interference effects, the user does less well in the second interface because they have got used to the detailed way things were done in the first.

One way this can be partially ameliorated is to alternate the orders, for half the users they see system A first followed by system B, the other half sees them the other way round.   You may even do lots of swops in the hope that the later ones have less order effects: ABABABABAB for some users and BABABABABA for others.

These techniques work best if any order effects are symmetric, if, for example, there is a positive learning effects between A and B, but a negative interference effect between B and A, alternating the order does not help! Typically you cannot tell this from the raw data, although comments made during talk-aloud or post study interviews can help. In the end you often have to make a professional judgment based on experience as to whether you believe this kind of asymmetry is likely, or indeed if order effects happen at all

context or ‘day’ effects – Successively looking out of the window does not give a good estimate of the overall weather in the area because they are effectively about the particular weather today. I fact the weather is not immaterial to user testing, especially user experience evaluation, as bad weather often effects people’s moods, and if people are less happy walking in to your study they are likely to perform less well and record lower satisfaction!

If you are performing a controlled experiment, you might try to do things strictly to protocol, but there may be slight differences in the way you do things that push the experiment in one direction or another.

Some years ago I was working on hydraulic sprays, as used in farming. We had a laser-based drop sizing machine and I ran a series of experiments varying things such as water temperature and surfactants added to the spray fluid, in order to ascertain whether these had any effect on the size of drops produced. The experiments were conducted in a sealed laboratory and were carefully controlled. When we analysed the results there were some odd effects that did not seem to make sense. After puzzling over this for some while one of my colleagues remembered that the experiments had occurred over two days and suggested we add a ‘day effect’ to the analysis. Sure enough this came out as a major factor and once it was included all the odd effects disappeared.

Now this was a physical system and I had tried to control the situation as well as possible, and yet still there was something, we never worked out what, that was different between the two days. Now think about a user test! You cannot predict every odd effect, but as far as you can make sure you mix your conditions as much as possible so that they are ‘balanced’ with respect to other factors can help – for example if you are doing two sessions of experimentation try to have a mix of two systems you are comparing in each session (although I know not always possible).

experimenter effects – A particular example of a contextual factor that may affect users performance and attitude is you! You may have prepared a script so that you greet each user the same and present the tasks they have to do in the same way, but if you have had a bad day your mood may well come through.

Using pre-recorded or textual instructions can help, but it would be rude to not at least say “hello” when they come in, and often you want to set users at ease so that more personal contact is needed.   As with other kinds of context effect, anything that can help balance out these effects is helpful. It may take a lot of effort to set up different testing systems, so you may have to have a long run of one system testing and then a long run of another, but if this is the case you might consider one day testing system A in the morning and system B in the afternoon and then another day doing the opposite.  If you do this, then, even if you have an off day, you affect both systems fairly.  Similarly if you are a morning person, or get tired in the afternoons, this will again affect both fairly. You can never remove these effects, but you can be aware of them.

The second kind of independence is between the various causal factors that you are measuring things about. For example, if you sampled LinkedIn and WhatsApp users and found that 5% of LinkedIn users were Justin Beiber fans compared with 50% of WhatsApp users, you might believe that there was something about LinkedIn that put people off Justin Beiber. However, of course, age will be a strong predictor of Justin Beiber fandom and is also related to the choice of social network platform. In this case social media use and age are called confounding variables.

As you can see it is easy for these effects to confuse causality.

A similar, and real, example of this is that when you measure the death rate amongst patients in specialist hospitals it is often higher than in general hospitals. At first sight this makes it seem that patients do not get as good care in specialist hospitals leading to lower safety, but in fact this is due to the fact that patients admitted to specialist hospitals are usually more ill to start with.

This kind of effect can sometimes entirely reverse effects leading to Simpson’s Paradox.

Imagine you are teaching a course on UX design. You teach a mix of full-time and part-time students and you have noticed that the performance of both groups has been improving year on year. You pat yourself on the back, happy that you are clearly finding better ways to teach as you grow more experienced.

However, one day you get an email from the university teaching committee noting that your performance seems to be diminishing. According to the university your grades are dropping.

Who is right?

In fact you may both be.

In your figures you have the average full-time student marks in 2015 and 2016 as 75% and 80%, an increase of 5%. In the same two years the average part-time student mark increased from 55% to 60%.

Yes both full-time and part-time students have improved their marks.

The university figures however show an average overall mark of 70% in 2015 dropping to 65% in 2016 – they are right too!

Looking more closely whilst there were 30 full-time students in both years the number of part-time students had increased form 10 in 2015 to 90 in 2016, maybe due to a university marketing drive or change in government funding patterns. Looking at the figures, the part-time students score substantially lower than the full-time students, not uncommon as part-time students are often juggling study with a job and may have been out of education for some years. The lower overall average the university reports entirely due to there being more low-scoring part-time students.

Although this seems like a contrived example see [BH75] for a real example of Simpson’s Paradox. Berkeley appeared to have gender bias in admission because (at the time, 1973) women had only 35% acceptance rate compared with 44% for men. However, deeper analysis found that in individual departments the bias was, if anything, slightly towards female candidates, it was just that females tended to apply for more competitive courses with lower admission rates (possibly revealing discrimination earlier in the education process).

Finally the way you obtain your sample may create lack of independence between your subjects.

This itself happens in two ways:

internal non-independence – This is when subjects are likely to be similar to one another, but in no particular direction with regard to your question. A simple example of this would be if you did a survey of people waiting in the queue to enter a football match. The fact that they are next to each other in the queue might mean they all came off the same coach and so more likely to supporting the same team.

Snowball samples are common in some areas. This is when you have an initial set of contacts, often friends or colleagues, use them as your first set of subjects and then ask them to suggest any of their own contacts who might take part in your survey.

Imagine you do this to get political opinions in the US and choose your first person to contact randomly from the electoral register. Let’s say the first person is a Democrat. That person’s friends are likely to likely to share political beliefs and also be Democrat, and then their contacts also. Your Snowball sample is likely to give you the impression that nearly everyone is a Democrat!

Typically this form of internal non-independence increases variability, but does not create bias.

Imagine continuing to survey people in the football queue, eventually you will get to a group of people from a different coach. Eventually after interviewing 500 people you might have thought you had pretty reliable statistics, but in fact that corresponds to about 10 coaches, so will have variability closer to a sample size of ten. Alternatively if you sample 20, and colleagues also do samples of 20 each, some of you will think nearly everyone are supporters of one team, some will get data that suggest the same is true for the other team, but if you average your results you will get something that is unbiased.

A similar thing happens with the snowball sample, if you had instead started with a Republican you would likely to have had a large sample almost all of whom would have been Republican. If you repeat the process each sample may be overwhelmingly one party or the other, but the long term-average of doing lots of Snowball samples would be correct. In fact, just like doing a bigger sample on the football queue, if you keep on the Snowball process on the sample starting with the Democrat, you are likely to eventually find someone who is friends with a Republican and then hit a big pocket of Republicans. However, again just like the football queue, while you might have surveyed hundreds of people, you may have only sampled a handful of pockets, the lack of internal independence means the effective sample size is a lot smaller than you think.

external non-independence – This is when the choice of subjects is actually connected with the topic being studied. For example, asking visiting an Apple Store and doing a survey about preferences between MacOs and Windows, or iPhone and Android. However, the effect may not be so immediately obvious, for example, using a mobile app-based survey on a topic which is likely to be age related.

The problem with this kind of non-independence is that it may lead to unintentional bias in your results. Unlike the football or snowball sample examples, doing another 20 users in the Apple Store, and then another 20 and then another 20 is not going to average out the fact that it is an Apple Store.

The crucial question to ask yourself is whether the way you have organised your sample likely to be independent of the thing you want to measure.

In the snowball sample example, it is clearly problematic for sampling political opinions, but may me acceptable for favourite colour or shoe size. The argument for this may be based on previous data, on pilot experiments, or on professional knowledge or common sense reasoning. While there may be some cliques, such as members of a basketball team, with similar shoe size, I am making a judgement based on my own life experience that common shoe size is not closely related to friendship whereas shared political belief is.

The decision may not be so obvious, for example, if you run a Fitts’ Law experiment and all the distant targets are coloured red and the close ones blue.  Maybe this doesn’t matter, or maybe there are odd peripheral vision reasons why it might skew the results. In this case, and assuming the colours are important, my first choice would be to include all conditions (including red close and blue distant targets) as well as the ones I’m interested in, or if not run and alternative experiment or spend a lot of time checking out the vision literature.

Perhaps the most significant potential biasing effect is that we will almost always get subjects from the same society as ourselves. In particular, for university research this tends to mean undergraduate students. However, even the most basic cognitive traits are not necessarily representative of the world at large [bibref name=”HH10″ /], let along more obviously culturally related attitudes.

## References

[BH75] Bickel, P., Hammel, E., & O’Connell, J. (1975). Sex Bias in Graduate Admissions: Data from Berkeley. Science, 187(4175), 398-404. Retrieved from http://www.jstor.org/stable/1739581

[HH10] Henrich J, Heine S, Norenzayan A. (2010). The weirdest people in the world? Behav Brain Sci. 2010 Jun;33(2-3):61-83; discussion 83-135. doi: 10.1017/S0140525X0999152X. Epub 2010 Jun 15.

# Wild and wide (making sense of statistics) – 3 – bias and variability

When you take a measurement, whether it is the time for someone to complete a task using some software, or a preferred way of doing something, you are using that measurement to find out something about the ‘real’ world – the average time for completion, or the overall level of preference amongst you users.

Two of the core things you need to know is about bias (is it a fair estimate of the real value) and variability (how likely is it to be close to the real value).

The word ‘bias’ in statistics has a precise meaning, but it is very close to its day-to-day meaning.

Bias is about systematic effects that skew your results in one way or another. In particular, if you use your measurements to predict some real world effect, is that effect likely to over or under estimate the true value; in other words, is it a fair estimate.

Say you take 20 users, and measure their average time to complete some task. You then use that as an estimate of the ‘true’ value, the average time to completion of all your users. Your particular estimate may be low or high (as we saw with the coin tossing experiments). However, if you repeated that experiment very many times would the average of your estimates end up being the true average?

If the complete user base were employees of a large company, and the company forced them to work with you, you could randomly select your 20 users, and in that case, yes, the estimate based on the users would be unbiased1.

However, imagine you are interested in popularity of Justin Bieber and issued a survey on a social network as a way to determine this. The effects would be very different if you chose to use LinkedIn or WhatsApp. No matter how randomly you selected users from LinkedIn, they are probably not representative of the population as a whole and so you would end up with a biased estimate of his popularity.

Crucially, the typical way to improve an estimate in statistics is to take a bigger sample: more users, more tasks, more tests on each user. Typically, bias persists no matter the sample size2.

However, the good news is that sometimes it is possible to model bias and correct for it. For example, you might ask questions about age or other demographics’ and then use known population demographics to add weight to groups under-represented in your sample … although I doubt this would work for the Justin Beiber example: if there are 15 year-old members of linked in, they are unlikely to be typical 15-year olds!

If you have done an introductory statistics course you might have wondered about the ‘n-1’ that occurs in calculations of standard deviation or variance. In fact this is precisely a correction of bias, the raw standard deviation of a sample slightly underestimates the real standard deviation of the overall population. This is pretty obvious in the case n=1 – imagine grabbing someone from the street and measuring their height. Using that height as an average height for everyone, would be pretty unreliable, but it is unbiased. However, the standard deviation of that sample of 1 is zero, it is one number, there is no spread. This underestimation is less clear for 2 or more, but in larger samples it persists. Happily, in this case you can precisely model the underestimation and the use of n-1 rather than n in the formulae for estimated standard deviation and variance precisely corrects for the underestimation.

If you toss 10 coins, there is only a one in five hundred chance of getting either all heads or all tails, about a one in fifty chance of getting only one head or one tails, the really extreme values are relatively unlikely. However, there about a one in ten chance of getting either just two heads or two tails. However, if you kept tossing the coins again and again, the times you got 2 heads and 8 tails would approximately balance the opposite and overall you would find that the average proportion of heads and tails would come out 50:50.

That is the proportion you estimate by tossing just 10 coins has a high variability, but is unbiased. It is a poor estimate of the right thing.

Often the answer is to just take a larger sample – toss 100 coins or 1000 coins, not just 10. Indeed when looking for infrequent events, physicists may leave equipment running for months on end taking thousands of samples per second.

You can sample yourself out of high variability!

Think now about studies with real users – if tossing ten coins can lead to such high variability; what about those measurements on ten users?

In fact for there may be time, cost and practicality limits on how many users you can involve, so there are times when you can’t just have more users. My ‘gaining power’ series of videos includes strategies including reducing variability for being able to obtain more traction form the users and time you have available.

In contrast, let’s imagine you have performed a random survey of 10,000 LinkedIn users and obtained data on their attitudes to Justin Beiber. Let’s say you found 5% liked Justin Beiber’s music. Remembering the quick and dirty rule3, the variability on this figure is about +/- 0.5%. If you repeated the survey, you would be likely to get a similar answer.

That’s is you have a very reliable estimate of his popularity amongst all LinkedIn users, but if you are interested in overall popularity, then is this any use?

You have a good estimate of the wrong thing.

As we’ve discussed you cannot simply sample your way out of this situation, if your process is biased it is likely to stay so. In this case you have two main options. You may try to eliminate the bias – maybe sample over a wide range of social network that between them offer a more representative view of society as whole.   Alternatively, you might try to model the bias, and correct for it.

On the whole high variability is a problem, but has relatively straightforward strategies for dealing with. Bias is your real enemy!

1. Assuming they behaved as normal in the test and weren’t annoyed at being told to be ‘volunteers’. []
2. Actually there are some forms of bias that do go away with large samples, called asymptotically unbiased estimators, but this does not apply in the cases where the way you choose your sample has created an unrepresentative sample, or the way you have set up your study favours one outcome. []
3. 5% of 10,000 represents 500 users.  The square root of 500 is around 22, twice that a bit under 50, so our estimate of variability is 500+/–50, or, as a percentage of users,  5% +/– 0.5% []