Overfitting is when a machine learning or a statistical model starts to model very particular features of the training set rather than the more general features of the underlying populaton or phenomena the training set is intended to be representative of. In the extreme the learning process precsiely mateches the training exmaples, but nothing else. This can be intentional, as in the case of memoisation, but usually an accdient due to fitting too many model parameters. In general there should be far fewer parameters (e.g. weights in a neural network) than features in the trainng data set. For exmaple, if you have a thousand training items with ten features per item, then you should have far less than ten thousand (1000 items x 10 features per item) parameters. By these metrics, some large language models appear to have too many parameters, and indeed do sometimes show features of overfitting such as directly reproducing passages of training text. However, there are arguments that this is not so critical as early layers of {[deep neural networks}} are producing large amounts of non-linear diversity, which is then effectvely internally selected and pruned by later layers, especially at pinch points.
Used on pages 98, 156, 183, 186, 188, 189, 386, 500, 523
Links:
- IBM: What is overfitting?