outliers

The glossary is being gradually proof checked, but currently has many typos and misspellings.

Outliers are data items that are very unusual, for example having particularly large or small values of a numerical feature.. Outliers can be problematic, especially for statistical methods that use the arithmetic mean, variance or similar measures, all of which are quite sensitive to outliers. This can also be a problem in pre-processing for AI and ML algorithms if, for example, we use the mid-point of the full range of a feature as a way of reducing it to a categorical large–small derived feature.

One way to avoid these problems is to use metrics, such as the median or interquartile range, or to pre-whiten data such as scaling it to percentiles. Another approach is to remove outliers from the data, for example, a common rule is to remove items where data values are more than twice the standard deviation. It is important to have a fixed, pre-determined rule for this if possible to avoid cherry picking the removal of outliers that simply give results you don't like, and hence adding your own bias to the data.

Used in Chap. 7: pages 88, 89, 96, 100; Chap. 10: pages 134, 138, 139; Chap. 14: page 215