Model Validation and Overfitting

The model validation procedure describes the method of checking the performance of a statistical or data-analytical model.

A common method for validating neural networks is k-fold cross-validation.  In doing so one divides training data set into k subsets. One of the subsets represents the test set. The remaining subsets then serve as the training set. The training set is used to teach the model.  By the ratio of the correct results on the test set, it is possible to determine the degree of generalization of the model. The test set is then swapped with a training set and the performance is determined again until each set has finally functioned as a test set.  At the end of the process, the average degree of generalization is calculated to estimate the performance of the model. The advantage of this method is that you get a relatively variant-free performance estimate. The reason for this is to prevent important structures in the training data from being excluded. This procedure is basically an extension of the holdout method. However, the holdout method simply splits the record into a training and a test set. The danger with this method is in contrast to the k-fold Cross Validation that important data could not be available for training. This can lead to the model not being able to generalize sufficiently.

 

Beispielhafte Darstellung k-fold Cross Validation
Exemplary representation k-fold Cross Validation

 

Overfitting

Overfitting is the term used when a model is too specifically adapted to a training set. With neural networks, for example, this would mean that a network is very accurate for inputs from the training data set, but not for a test set. This means that the model can map the trained data very accurately, but it is unable to produce generalized results. 

Typically, overfitting occurs when the training record is relatively small and the model is relatively complex. Because a complex model can reproduce the training data more accurately, conversely, a simple model does not accurately depict the training data (underfitting).  So it is generally useful to keep the model as simple as possible and at the same time not too simple, depending on the existing data set. A perfect model, that is, a model that does not come to over or underfill is almost impossible to create.

In order to reduce the problem of overfitting and at the same time to keep the underfitting low, several methods have been introduced. Among other things, the patented by Google dropout.  It only deactivates a certain number, usually between 20% and 50%, depending on the fixed factor, of the neurons randomly. This method, despite its simplicity, achieves a significant reduction in overfitting.

 

Beispielhafte Darstellung für Overfitting
Exemplary representation for overfitting