outliers

Terms from Artificial Intelligence: humans at the heart of algorithms

Outliers are data items that are very unusual, for example having particularly large or small values of a numerical feature.. Outliers can be problematic, especially for statistical methods that use the arithmetic mean, variance or similar measures, all of which are quite sensitive to outliers. This can also be a problem in pre-provessing for AI and ML algorithms if, for example, we use the mid-point of the full range of a feature as a way of reducing it a categorical large--small derived feature. One way to avoid these problems is to use metrics, such as the median or interquartile range, or to pre-whiten data such as scaling it to percentiles. Another approach is to remove outliers form the data, for example, a common rule is to remove items where data values are more than twice the standard devoation. It is important to have a fixed, pre-determined rule for this if possible to avoid cherry picking the removal of outliers that simpyl give results you don't like, and hence adding your own bias to the data.

Used on pages 129, 141, 329