Learnability
A parametric model can be split into two parts: a static structure and a dynamic set of parameters. The former is determined by the choice of a specific algorithm and is normally immutable (except in the cases when the model provides some remodeling functionalities), while the latter is the objective of our optimization. Considering n unbounded parameters, they generate an n-dimensional space (imposing bounds results in a subspace without relevant changes in our discussion) where each point, together with the immutable part of the estimator function, represents a learning hypothesis H (associated with a specific set of parameters):
The goal of a parametric learning process is to find the best hypothesis whose corresponding prediction error is at minimum and the residual generalization ability is enough to avoid overfitting. In the following diagram, we can see an example of a dataset whose points must be classified as red (Class A) or blue (Class B). Three hypotheses are shown: the first one (the middle line starting from the left) misclassifies 3 samples, while the lower and upper ones misclassify 14 and 24 samples, respectively:
Of course, the first hypothesis is almost optimal and should be selected; however, it's important to understand an essential concept that can determine a potential overfitting. Think about an n-dimensional binary classification problem. We say that dataset X is linearly separable (without transformations) if a hyperplane exists that divides the space into two subspaces containing only elements belonging to the same class. Removing the constraint of linearity, we have infinite alternatives by using generic hypersurfaces. However, a parametric model only adopts a family of non-periodic and approximate functions whose ability to oscillate and fit the dataset is determined (sometimes in a very complex way) by the number of parameters.
Consider the example shown in the following diagram:
The blue classifier is linear while the red one is cubic. At a glance, the non-linear strategy seems to perform better, because it can capture more expressivity, thanks to its concavities. However, if new samples are added following the trend defined by the last four ones (from the right), they'll be completely misclassified. In fact, while a linear function is globally better but cannot capture the initial oscillation between 0 and 4, a cubic approach can fit this data almost perfectly but, at the same time, loses its ability to keep a global linear trend. Therefore, there are two possibilities:
- If we expect future data to be exactly distributed as training samples, a more complex model can be a good choice in order to capture small variations that a lower-level one will discard. In this case, a linear (or lower-level) model will drive to underfitting, because it won't be able to capture an appropriate level of expressivity.
- If we think that future data can be locally distributed differently but keeps a global trend, it's preferable to have a higher residual misclassification error as well as a more precise generalization ability. Using a bigger model focusing only on training data can drive to overfitting.