The course is divided into four main parts:
Wild and wide – concerning randomness and distributions
The course will begin with some exercises and demonstrations of the unexpected wildness of random phenomena including the effects of bias and non-independence (when one result effects others).
We will discuss different kinds of distribution and the reasons why the normal distribution (classic hat shape), on which so many statistical tests are based, is so common. In particular we will look at some of the ways in which the effects we see in HCI may not satisfy the assumptions behind the normal distribution.
Most will be aware of the use of non-parametric statistics for discrete data such as Likert scales, but there are other ways in which non-normal distributions arise. Positive feedback effects, which give rise to the beauty of a snowflake, also create effects such as the bi-modal distribution of student marks in certain kinds of university courses (don’t believe those who say marks should be normally distributed!). This can become more complex if feedback processes include some form of threshold or other non-linear effect (e.g. when the rate of a task just gets too much for a user).
All of these effects are found in the processes that give rise to social networks both online and offline and other forms of network phenomena, which are often far better described by a long-tailed ‘power law’.
Doing it – if not p then what
In this part we will look at the major kinds of statistical analysis methods:
- Hypothesis testing (the dreaded p!) – robust but confusing
- Confidence intervals – powerful but underused
- Bayesian stats – mathematically clean but fragile
- Simulation based – rare but useful
None of these is a magic bullet; all need care and a level of statistical understanding to apply.
We will discuss how these are related including the relationship between ‘likelihood’ in hypothesis testing and conditional probability as used in Bayesian analysis. There are common issues including the need to clearly report numbers and tests/distributions used. avoiding cherry picking, dealing with outliers, non-independent effects and correlated features. However, there are also specific issues for each method.
Classic statistical methods used in hypothesis testing and confidence intervals depend on ideas of ‘worse’ for measures, which are sometimes obvious, sometimes need thought (one vs. two tailed test), and sometimes outright confusing. In addition, care is needed in hypothesis testing to avoid classic fails such as treating non-significant as no-effect and inflated effect sizes.
In Bayesian statistics different problems arise: the need to be able to decide in a robust and defensible manner what are the expected likelihood of different hypothesis before an experiment; and the dangers of common causes leading to inflated probability estimates due to a single initial fluke event or optimistic prior.
Crucially, while all methods have problems that need to be avoided, we will see how not using statistics at all can be far worse.
Gaining power – the dreaded ‘too few participants’
Statistical power is about whether an experiment or study is likely to reveal an effect if it is present. Without a sufficiently ‘powerful’ study, you risk being in the middle ground of ‘not proven’, not being able to make a strong statement either for or against whatever effect, system, or theory you are testing.
In HCI studies the greatest problem is often finding sufficient participants to do meaningful statistics. For professional practice we hear that ‘five users are enough’, but less often that this figure was based on particular historical contingencies and in the context of single iterations, not summative evaluations, which still need the equivalent of ‘power’ to be reliable.
However, power arises from a combination of the size of the effect you are trying to detect, the size of the study (number of trails/participants) and the size of the ‘noise’ (the random or uncontrolled factors).
Increasing number of participants is not the only way to increase power and we will discuss various ways in which careful design, selection of subjects and tasks can increase the power of your study albeit sometimes requiring care in interpreting results. For example, using a very narrow user group can reduce individual differences in knowledge and skill (reduce noise) and make it easier to see the effect of a novel interaction technique, but also reduces generalisation beyond that group. In another example, we will also see how careful choice of a task can even be used to deal with infrequent expert slips.
So what? – making sense of results
You have done your experiment or study and have your data – what next, how do you make sense of the results? In fact one of the best ways to design a study is to imagine this situation before you start!
This part will address a number of questions to think about during analysis (or design) including: Whether your work is to test an existing hypothesis (validation) or to find out what you should be looking for (exploration)? Whether it is a one-off study, or part of a process (e.g. ‘5 users’ for iterative development)? How to make sure your results and data can be used by others (e.g. repeatability, meta analysis)? Looking at the data, and asking if it makes sense given your assumptions (e.g. Fitts’ Law experiments that assume index of difficulty is all that matters). Thinking about the conditions – what have you really shown – some general result or simply that one system or group of users is better than another?