Choose a coin, any coin.
If a coin is fair then the probability of six heads is 1/64 as is the probability of six tails, and the probability of six of either is 2/64, approximately 3%.
So we can do an experiment.
H0 the null hypothesis is that the coin is fair.
H1 the alternative hypothesis is that the coin is not fair.
The likelihood of HHHHHH or TTTTTT given H0 is less than 5%, so if you get six heads or six tails, you can reject the null hypothesis and conclude that the coin is not fair.
Try it, and if it doesn’t work try again. How long before you end up with a ‘statistically significant’ test?
Think back to the discussion of cherry picking and multiple tests …
This might seem a little artificial, but imagine rather than coin tossing it is six users preferences for software A or B.
Having done this, how do you feel about whether 5% is a suitable level to regard as evidence?