Underfitting and overfitting
The purpose of a machine learning model is to approximate an unknown function that associates input elements to output ones (for a classifier, we call them classes). However, a training set is normally a representation of a global distribution, but it cannot contain all possible elements, otherwise, the problem could be solved with a one-to-one association. In the same way, we don't know the analytic expression of a possible underlying function, therefore, when training, it's necessary to think about fitting the model but keeping it free to generalize when an unknown input is presented. In this regard, it's useful to introduce the concept of the representational capacity of a model, as the ability to learn a small/large number of possible distributions over the dataset. Clearly, a low capacity is normally associated with simpler models that, for example, cannot solve non-linear problems, while a high capacity, that is both a function of the underlying model and of the number of parameters, leads to more complex separation hyperplanes. Considering the last example in the previous section, it's easy to understand that the linear classifier is equivalent to the equation of a straight line:
In this case, there are two parameters, m and q, and the curve can never change its slope (which is defined by m). Conversely, the second classifier could be imagined as a cubic equation:
Now, we have four parameters and two powers of the input value. These conditions allow modeling a function that can change its slope twice and can be adapted to more complex scenarios. Obviously, we could continue this analysis by considering a generic polynomial function:
The complexity (and, hence, the capacity) is proportional to the degree p. Joining polynomials and non-linear functions, we can obtain extremely complex representations (such as the ones achieved using neural networks) that can be flexible enough to capture the details of non-trivial datasets. However, it's important to remember that increasing the capacity is normally an irreversible operation. In other words, a more complex model will always be more complex, even when a simpler one would be preferable. The learning process can stretch or bend the curve, but it will never be able to remove the slope changes (for a more formal explanation, please check Mastering Machine Learning Algorithms, Bonaccorso G., Packt Publishing, 2018). This condition leads to two different potentials dangers:
- Underfitting: It means that the model isn't able to capture the dynamics shown by the same training set (probably because its capacity is too limited).
- Overfitting: The model has an excess capacity and it's not longer able to generalize effectively, considering the original dynamics provided by the training set. It can associate almost perfectly all the known samples to the corresponding output values, but when an unknown input is presented, the corresponding prediction error can be very high.
In the following graph, there are examples of interpolation with low capacity (underfitting), normal capacity (normal fitting), and excessive capacity (overfitting):
An underfitted model usually has a high bias, which is defined as the difference between the expected value of the estimation of the parameters θ and the true ones:
When the bias is null, the model is defined as unbiased. On the other hand, the presence of a bias means that the algorithm is not able to learn an acceptable representation of θ. In the first example of the previous graph, the straight line only has a negligible error in the neighborhoods of two points (about 0.3 and 0.8) and, as it's not able to change the slope, the bias will force the error to increase in all the other regions. Conversely, overfitting is often associated with a high variance, which is defined as follows:
A high variance is often the result of a high capacity. The model now has the ability to oscillate changing its slope many times, but it can't behave as a simpler one anymore. The right-hand example in the previous graph shows an extremely complex curve that will probably fail to classify the majority of never-seen samples. Underfitting is easier to detect considering the prediction error, while overfitting may prove to be more difficult to discover as it could be initially considered the result of a perfect fitting. In fact, in a classification task, a high-variance model can easily learn the structure of the dataset employed in the training phase, but, due to the excess complexity, it can become frozen and hyperspecialized. It often means that it will manage never-seen samples with less accuracy, as their features cannot be recognized as variants of the samples belonging to a class. Every small modification is captured by the model, which can now adapt its separation surface with more freedom, so the similarities (which are fundamental for the generalization ability) are more difficult to detect. Cross-validation and other techniques that we're going to discuss in the following chapters can easily show how our model works with test samples that are never seen during the training phase. That way, it would be possible to assess the generalization ability in a broader context (remember that we're not working with all possible values, but always with a subset that should reflect the original distribution) and make the most reasonable decisions. The reader must remember that the real goal of a machine learning model is not to overfit the training set (we'll discuss this in the next chapter, Chapter 3, Feature Selection and Feature Engineering), but to work with never-seen samples, hence it's necessary to pay attention to the performance metrics before moving to a production stage.