gaining power (2) – the noise-effect-number triangle

The heart of gaining power in your studies is understanding the noise–effect–number triangle.  Power arises from a combination of the size of the effect you are trying to detect, the size of the study (number of trails/participants) and the size of the ‘noise’ (the random or uncontrolled factors). We can increase power by addressing any one of these.

Cast your mind back to your first statistics course, or when you first opened a book on statistics.

The standard deviation (sd) is one of the most common ways to measure of the variability of a data point. This is often due to ‘noise’, or the things you can’t control or measure.

For example, the average adult male height in the UK is about 5 foot 9 inches ( with a standard deviation of about 3 inches (7.5cm), most British men are between 5′ 6″ (165cm) and 6′ (180cm) tall.

However, if you take a random sample and look at the average (arithmetic mean), this varies less as typically your sample has some people higher than average, and some people shorter than average, and they tend to cancel out. The variability of this average is called the standard error of the mean (or just s.e.), and is often drawn as little ‘error bars’ on graphs or histograms, to give you some idea of the accuracy of the average measure.

You might also remember that, for many kinds of data the standard error of the mean is given by:

s.e. = σ / √n                   (or if σ is an estimate √n-1 )

For example, of you have one hundred people, the variability of the average height is one tenth the variability of a single person.

The question you then have to ask yourself is how big an effect do you want to detect? Imagine I am about to visit Denmark. I have pretty good idea that Danish men are taller than British men and would like to check this.   If the average were a foot (30cm) I definitely want to know as I’ll end up with a sore neck looking up all the time, but if it is just half an inch (1.25cm) I probably don’t care.

Let’s call this least difference that I care about δ (Greek letters, it’s a mathematician thing!), so in the example δ = 0.5 inch.

If I took a sample of 100 British men and 100 Danes, the standard error of the mean would be about 0.3 inch (~1cm) for each, so it would be touch and go if I’d be able to detect the difference. However, if I took a sample of 900 of each, then the s.e. of each average would be about 0.1 inch, so I’d probably be easily able to detect differences of 0.5 inch.

In general, we’d like the minimum difference we want to detect to be substantially bigger than the standard error of the mean in order to be able to detect the difference. That is:

δ   >> σ / √n

Note the three elements here:

  • the effect size
  • the amount of noise or uncontrolled variation
  • the number of participants, groups or trials

Although the meanings of these vary between different kinds of data and different statistical methods, the basic triad is similar. This is even in data, such as network power-law, where the standard deviation is not well defined and other measures of spread or variation apply (Remember that this is a different use of the term ‘power’). In such data it is not the square root of participants that is the key factor, but still the general rule that you need a lot more participants to get greater accuracy in measures … only for power law data the ‘more’ is even greater than squaring!

Once we understand that statistical power is about the relationship between these three factors, it becomes obvious that while increasing the number of subjects is one way to address power, it is not the only way. We can attempt to effect any one of the three, or indeed several while designing our user studies or experiments.

Thinking of this we have three general strategies:

  • increase number – As mentioned several times, this is the standard approach, and the only one that many people think about. However, as we have seen, the square root means that we often need very lareg increase in the number of subjects or trials in order to reduce the variability of our results to acceptable level. Even when you have addressed other parts of the noise–effect–number triangle, you still have to ensure you have sufficient subjects, although hopefully less than you would need by a more naïve approach.
  • reduce noise – Noise is about variation due to actors that you do not control or know about; so, we can attempt to attack either of these. First we can control conditions reducing the variability in our study; this is the approach usually take in physics and other sciences, using very pure substances, with very precise instruments in controlled environments. Alternatively, we can measure other factors and fit or model the effect of these, for example, we might ask the participants’ age, prior experience, or other things we think may affect the results of our study.
  • increase effect size – Finally, we can attempt to manipulate the sensitivity of our study. A notable example of this is the photo from the back of the crowd at President Trump’s inauguration. It was very hard to assess differences in crowd size at different events from the photos taken from the front of the crowd, but photos at the back are a far more sensitive. Your studies will probably be less controversial, but you can use the same technique. Of course, there is a corresponding danger of false baselines, in that we may end up with a misleading idea of the size of effects — as noted previously with power comes the responsibility to report fairly and accurately.

In the following two posts, we will consider strategies that address the factors of the noise–effect–number triangle in different ways. We will concentrate first on the subjects, the users or participants in our studies, and then on the tasks we give them to perform.

 

gaining power (1) – if there is something there, make sure you find it

 

Statistical power is about whether an experiment or study is likely to reveal an effect if it is present. Without a sufficiently ‘powerful’ study, you risk being in the middle ground of ‘not proven’, not being able to make a strong statement either for or against whatever effect, system, or theory you are testing.


You’ve recruited your participants and run your experiment or posted an online survey and gathered your responses; you put the data into SPSS and … not significant.   Six months work wasted and your plans for your funded project or PhD shot to ruins.

How do you avoid the dread “n.s.”?

Part of the job of statistics is to make sure you don’t say anything wrong, to ensure that when you say something is true, there is good evidence that it really is.

This is the why in traditional hypothesis testing statistics, you have such a high bar to reject the null hypothesis. Typically the alternative hypothesis is the thing you are really hoping will be true, but you only declare it likely to be true if you are convinced that the null hypothesis is very unlikely.

Bayesian statistics has slightly different kinds of criteria, but is in the end doing the same things, ensuring you down have false positives.

However, you can have the opposite problem, a false negative — there may be a real effect there, but your experiment or study was simply not sensitive enough to detect it.

Statistical power is all about avoiding these false negatives. There are precise measures of this you can calculate, but in broad terms, it is about whether an experiment or study is likely to reveal an effect if it is present. Without a sufficiently ‘powerful’ study, you risk being in the middle ground of ‘not proven’, not being able to make a strong statement either for or against whatever effect, system, or theory you are testing.

(Note the use of the term ‘power’ here is not the same as when we talk about power-law distributions for network data).

The standard way to increase statistical power is simply to recruit more participants. No matter how small the effect, if you have a sufficiently large sample, you are likely to detect it … but ‘sufficiently large’ may be many, many people.

In HCI studies the greatest problem is often finding sufficient participants to do meaningful statistics. For professional practice we hear that ‘five users are enough‘, but less often that this figure was based on particular historical contingencies and in the context of single formative iterations, not summative evaluations, which still need the equivalent of ‘power’ to be reliable.

Happily, increasing the number of participants is not the only way to increase power.

In blogs over the next week or two, we will see that power arises from a combination of:

  • the size of the effect you are trying to detect
  • the size of the study (number of trails/participants) and
  • the size of the ‘noise’ (the random or uncontrolled factors).

We will discuss various ways in which careful design, selection of subjects and tasks can increase the power of your study albeit sometimes requiring care in interpreting results. For example, using a very narrow user group can reduce individual differences in knowledge and skill (reduce noise) and make it easier to see the effect of a novel interaction technique, but also reduces generalisation beyond that group. In another example, we will also see how careful choice of a task can even be used to deal with infrequent expert slips.

Often these techniques sacrifice some generality, so you need to understand how your choices have affected your results and be prepared to explain this in your reporting: with great (statistical) power comes great responsibility!

However, if a restricted experiment or study has shown some effect, at least you have results to report, and then, if the results are sufficiently promising, you can go on to do further targeted experiments or larger scale studies knowing that you are not on a wild goose chase.

Slides for CHI2017 statistics course available at SlideShare

All the slides for the CHI 2017 course have been uploaded to SlideShare. They can be found in the relevant section of the course website, and are collected below.

There will be videos for all this material following over the summer. If there are particular topics of interest, let me know and I’ll use this to help prioritise which parts to video first.