Wild and wide (making sense of statistics) – 3 – bias and variability

When you take a measurement, whether it is the time for someone to complete a task using some software, or a preferred way of doing something, you are using that measurement to find out something about the ‘real’ world – the average time for completion, or the overall level of preference amongst you users.

Two of the core things you need to know is about bias (is it a fair estimate of the real value) and variability (how likely is it to be close to the real value).

The word ‘bias’ in statistics has a precise meaning, but it is very close to its day-to-day meaning.

Bias is about systematic effects that skew your results in one way or another. In particular, if you use your measurements to predict some real world effect, is that effect likely to over or under estimate the true value; in other words, is it a fair estimate.

Say you take 20 users, and measure their average time to complete some task. You then use that as an estimate of the ‘true’ value, the average time to completion of all your users. Your particular estimate may be low or high (as we saw with the coin tossing experiments). However, if you repeated that experiment very many times would the average of your estimates end up being the true average?

If the complete user base were employees of a large company, and the company forced them to work with you, you could randomly select your 20 users, and in that case, yes, the estimate based on the users would be unbiased1.

However, imagine you are interested in popularity of Justin Bieber and issued a survey on a social network as a way to determine this. The effects would be very different if you chose to use LinkedIn or WhatsApp. No matter how randomly you selected users from LinkedIn, they are probably not representative of the population as a whole and so you would end up with a biased estimate of his popularity.

Crucially, the typical way to improve an estimate in statistics is to take a bigger sample: more users, more tasks, more tests on each user. Typically, bias persists no matter the sample size2.

However, the good news is that sometimes it is possible to model bias and correct for it. For example, you might ask questions about age or other demographics’ and then use known population demographics to add weight to groups under-represented in your sample … although I doubt this would work for the Justin Beiber example: if there are 15 year-old members of linked in, they are unlikely to be typical 15-year olds!

If you have done an introductory statistics course you might have wondered about the ‘n-1’ that occurs in calculations of standard deviation or variance. In fact this is precisely a correction of bias, the raw standard deviation of a sample slightly underestimates the real standard deviation of the overall population. This is pretty obvious in the case n=1 – imagine grabbing someone from the street and measuring their height. Using that height as an average height for everyone, would be pretty unreliable, but it is unbiased. However, the standard deviation of that sample of 1 is zero, it is one number, there is no spread. This underestimation is less clear for 2 or more, but in larger samples it persists. Happily, in this case you can precisely model the underestimation and the use of n-1 rather than n in the formulae for estimated standard deviation and variance precisely corrects for the underestimation.

If you toss 10 coins, there is only a one in five hundred chance of getting either all heads or all tails, about a one in fifty chance of getting only one head or one tails, the really extreme values are relatively unlikely. However, there about a one in ten chance of getting either just two heads or two tails. However, if you kept tossing the coins again and again, the times you got 2 heads and 8 tails would approximately balance the opposite and overall you would find that the average proportion of heads and tails would come out 50:50.

That is the proportion you estimate by tossing just 10 coins has a high variability, but is unbiased. It is a poor estimate of the right thing.

Often the answer is to just take a larger sample – toss 100 coins or 1000 coins, not just 10. Indeed when looking for infrequent events, physicists may leave equipment running for months on end taking thousands of samples per second.

You can sample yourself out of high variability!

Think now about studies with real users – if tossing ten coins can lead to such high variability; what about those measurements on ten users?

In fact for there may be time, cost and practicality limits on how many users you can involve, so there are times when you can’t just have more users. My ‘gaining power’ series of videos includes strategies including reducing variability for being able to obtain more traction form the users and time you have available.

In contrast, let’s imagine you have performed a random survey of 10,000 LinkedIn users and obtained data on their attitudes to Justin Beiber. Let’s say you found 5% liked Justin Beiber’s music. Remembering the quick and dirty rule3, the variability on this figure is about +/- 0.5%. If you repeated the survey, you would be likely to get a similar answer.

That’s is you have a very reliable estimate of his popularity amongst all LinkedIn users, but if you are interested in overall popularity, then is this any use?

You have a good estimate of the wrong thing.

As we’ve discussed you cannot simply sample your way out of this situation, if your process is biased it is likely to stay so. In this case you have two main options. You may try to eliminate the bias – maybe sample over a wide range of social network that between them offer a more representative view of society as whole.   Alternatively, you might try to model the bias, and correct for it.

On the whole high variability is a problem, but has relatively straightforward strategies for dealing with. Bias is your real enemy!

  1. Assuming they behaved as normal in the test and weren’t annoyed at being told to be ‘volunteers’. []
  2. Actually there are some forms of bias that do go away with large samples, called asymptotically unbiased estimators, but this does not apply in the cases where the way you choose your sample has created an unrepresentative sample, or the way you have set up your study favours one outcome. []
  3. 5% of 10,000 represents 500 users.  The square root of 500 is around 22, twice that a bit under 50, so our estimate of variability is 500+/–50, or, as a percentage of users,  5% +/– 0.5% []

Wild and wide (making sense of statistics) – 2 – quick (and dirty!) tip

We often deal with survey or count data. This might come in public forms such as opinion poll data preceding an election, or from your own data when you email out a survey, or count kinds of errors in a user study.

So when you find that 27% of the users in your study had a problem, how confident do you feel in using this to estimate the level of prevalence amongst users in general? If you did a bigger study wit more users would you be surprised if the figure you got was actually 17%, 37% or 77%?

You can work out precise numbers of this, but in this video I’ll give a simple rule of thumb method for doing a quick estimate.

We’re going to deal with this by looking at three separate cases.

First when the number you are dealing with is a comparatively small proportion of the overall sample.

For example, assume you want to know about people’s favourite colours. You do a survey of 1000 people at 10% say their favourite colour is blue. The question is how reliable this figure is. If you had done a larger survey, would the answer be close to 10% or could it be very different?

The simple rule is that the variation is 2 x the square root number of people who chose blue.

To work this out first calculate how many people the 10% represents. Given the sample was 1000, this is 100 people. The square root of 100 is 10, so 2 times this is 20 people. You can be reasonably confident (about 95%) that the number of people choosing blue in your sample is within +/- 20 of the proportion you’d expect from the population as a whole. Dividing that +/-20 people by the 1000 sample, the % of people for whom blue is their favourite colour is likely to be within +/- 2% of the measured 10%.

The second case is when you have a large majority who have selected a particular option.

For example, let’s say in another survey, this time of 200 people, 85% said green was their favourite colour.

This time you still apply the “2 x square root” rule, but instead focus on the smaller number, those who didn’t choose green. The 15% who didn’t choose green is 15% of 200 that is 30 people. The square root of 30 is about 5.5, so the expected variability is about +-11, or in percentage terms about +/- 5.5%.

That is the real proportion over the population as a whole could be anywhere between 80% and 90%.

Notice how the variability of the proportion estimate from the sample increases as the sample size gets smaller.

Finally if the numbers are near the middle, just take the square root, but this time multiply by 1.5.

For example, if you took a survey of 2000 people and 50% answered yes to a question, this represents 1000 people. The square toot of 1000 is a bit over 30, so 1.5 times this is around 50 people, so you expect a variation of about +/- 50 people, or about +/- 2.5%.

Opinion polls for elections often have samples of around 2000, so if the parties are within a few points of each other you really have no idea who will win.

For those who’d like to understand the detailed stats for this (skip if you don’t!) …

These three cases are simplified forms of the precise mathematical formula for the variance of a Binomial distribution np(1-p), where n is the number in the sample and p the true population proportion for the thing you are measuring. When you are dealing with fairly small proportions the 1-p term is close to 1, so the whole variance is close to np, that is the number with the given value. You then take the square root to give the standard deviation. The factor of 2 is because about 95% of measurements fall within 2 standard deviations. The reason this becomes 1.5 in the middle is that you can no longer treat (1-p) as nearly 1, and for p =0.5, this makes things smaller by square root of 0,5, which is about 0.7. Two times 0,7 is (about) one and half (I did say quick and dirty!).

However, for survey data, or indeed any kind of data, these calculations of variability are in the end far less critical than ensuring that the sample really does adequately measure the thing you are after.

Is it fair? – Has the way you have selected people made one outcome more likely. For example, if you do an election opinion poll of your Facebook friends, this may not be indicative of the country at large!

For surveys, has there been self-selection? – Maybe you asked a representative sample, but who actually answered. Often you get more responses from those who have strong feelings about the issue. For usability of software, this probably means those who have had a problem with it!

Have you phrased the question fairly? – For example, people are far more likely to answer “Yes” to a question, so if you ask “do you want to leave?” you might get 60% saying “yes” and 40% saying no, but if you asked the question in the opposite way “do you want to stay?”, you might still get 60% saying “yes”.

Wild and wide (making sense of statistics) – 1 – unexpected wildness of random

This part will begin with some exercises and demonstrations of the unexpected wildness of random phenomena including the effects of bias and non-independence (when one result effects others).

We will discuss different kinds of distribution and the reasons why the normal distribution (classic hat shape), on which so many statistical tests are based, is so common. In particular we will look at some of the ways in which the effects we see in HCI may not satisfy the assumptions behind the normal distribution.

Most will be aware of the use of non-parametric statistics for discrete data such as Likert scales, but there are other ways in which non-normal distributions arise. Positive feedback effects, which give rise to the beauty of a snowflake, also create effects such as the bi-modal distribution of student marks in certain kinds of university courses (don’t believe those who say marks should be normally distributed!). This can become more complex if feedback processes include some form of threshold or other non-linear effect (e.g. when the rate of a task just gets too much for a user).
All of these effects are found in the processes that give rise to social networks both online and offline and other forms of network phenomena, which are often far better described by a long-tailed ‘power law’.

Just how random is the world?

We often underestimate just how wild random phenomena are – we expect to see patterns and reasons for what is sometimes entirely arbitrary.

Through a story and some exercises, I hope that you will get a better feel for how wild randomness is. We sometimes expect random things to end up close to their average behaviour, but we’ll see that variability is often large.

When you have real data you have a combination of some real effect and random ‘noise’. However, by doing some coin tossing experiments you know that the coins you are dealing with are near enough fair – everything you see will be sheer randomness.

We’ll start with a story:

In the far off land of Gheisra there lies the plain of Nali. For one hundred miles in each direction it spreads, featureless and flat, no vegetation, no habitation; except, at its very centre, a pavement of 25 tiles of stone, each perfectly level with the others and with the surrounding land.

The origins of this pavement are unknown – whether it was set there by some ancient race for its own purposes, or whether it was there from the beginning of the world.

Rain falls but rarely on that barren plain, but when clouds are seen gathering over the plain of Nali, the monks of Gheisra journey on pilgrimage to this shrine of the ancients, to watch for the patterns of the raindrops on the tiles. Oftentimes the rain falls by chance, but sometimes the raindrops form patterns, giving omens of events afar off.

Some of the patterns recorded by the monks are shown on the following slides. Which are mere chance and which foretell great omens?

Before reading on make your choices and record why you made your decision.

Just a reminder choose first and then read one 😉

Before, revealing the true omens, you might like to know how you fare alongside three and seven year olds.

When very young children are presented with this choice they give very mixed answers, but have a small tendency to think that distributions like day 1 are real rainfall, whereas those like day 3 are an omen.

In contrast, once children are older, seven or so, they are more consistent and tended to plump for day 3 as the random rainfall.

Were you more like the three year old and thought day 1 was random rainfall, or more like the seven year old and thought day 1 was an omen and day 3 random. Or perhaps you were like neither of them and thought day 2 was true random rainfall.

Let’s see who is right.

When you looked at day 1 you might have seen a slight diagonal tendency with the lower right corner less dense then the upper left. Or you may have noted the suspiciously collinear three dots in the second tile on the top row.

However, this pattern, the preferred choice f the three year od, is in fact the random rainfall – or at least as random as a computer random number generator can manage!

In true random phenomena you often do get gaps, dense spots or apparent patterns, but this is just pure chance.

In day 2 you might have thought it looked a little clumped towards the middle.

In fact this is perfectly right, it is exactly the same tiles as in day 1, but re-ordered so that the fuller tiles are towards the centre, and the part-empty ones to the edges.

This is an omen!

Finally, day 3 is also an omen.

This is the preferred choice of seven year olds as being random rainfalls and also, I have found, the preferred choice of 27, 37 and 47 year olds.

However it is too uniform. The drops on each tile are distributed randomly, but there are precisely five drops on each tile.

At some point during our early education we ‘learn’ (wrongly!) that random phenomena are uniform. Although this is nearly true when there are very large numbers involved (maybe 12,500 drops rather than 125), wth smaller numbers the effects are far more wacky than one might imagine.

Now for a different exercise, and this time you don’t just have to choose, you have to do something.

Find a coin, or even better if you have 20, get them.

Toss the coins one by one and put the heads into one row and the tails into another.

Keep on tossing until one line of coins has ten coins in it … you could even mark a finish line 10 coins away from the start.

If you only have one coin you’ll have to just toss it and keep tally!

If you are on your own repeat this several times.

However, before you start think about what you expect to see.

So what happened?

Did you get a clear winner, or were they neck and neck?

And what did you expect to happen?

I had a go and did five races. In one case they were nearly neck-and-neck at 9 heads to 10 tails, but the other four races were won by heads with some quite large margins: 10 to 7, 10 to 6, 10 to 5 and 10 to 4.

Often people are surprised because they are expecting a near neck and neck race every time. As the coins are all fair, they expect approximately equal numbers of heads and tails. However, just like the rainfall in Gheisra, it is very common to have one quite far ahead of the other.

In your head you might think that because the probability of it being a head is a half, the number of heads will be near enough half. Indeed, this is the case if you average over lots and lots of tosses. However, with just 20 coins in a race, the variability is large.

The probability of getting an outright winner all heads or all tails is low, only about one in five hundred. However, the probability of getting a near wipe-out with one head and ten tails or vice versa is around one in fifty – in a large class one person is likely to have this.

I hope these two activities begin to give some idea of the wild nature of random phenomena. We can see a few general lessons.

First apparent patterns or differences may just be pure chance. For example, if you had found heads winning by 10 to 2, you might have thought this meant that your coin was in some way biased t heads, or you might have thought that the nearly straight line of three drops on day 1 had to mean something. But random things are so wild that sometimes apparently systematic effects happen by chance.

It is also remembering that this wildness may lead to what appear to be ‘bad values’. If you had got 10 tails and just 1 head, you might have thought “but coins are fair, so I must have done something wrong”.

Famous scientists have fallen for this fallacy!

Mendel’s experiment on inheritance of sweet pea characteristics laid the foundations for modern genetics. However, his results are a little too good. If you cross-pollinate two plants one pure bred to have a recessive characteristic and one purely dominant, you expect to see the second generation to have observable characteristics that are dominant and recessive in the ration 3:1. In Mendel’s data the ratios are just a little too close to this figure. It seems likely that he rejected ‘bad values’ assuming he had done something wrong when in fact they were just the results of chance.

The same thing can happen in Physics. In 1909 Millikan and Harvey Fletcher did an experiment that showed that each electron has an identical charge and measure it (also known as the ‘Millikan can experiment’). To do this they created charged oil drops and suspended them using their electrical charge. The relationship between the field needed and size (and hence weight) of the drop enabled them to calculate the charge on each oil drop. These always came in multiples of a single value – the electron charge. Only there are sources of error in all the measurements and yet the reported charges are a little too close to multiples of the same number. Again it looks like ‘bad’ results were ignored as some form of mistake during the setup of the experiment.