Supervised learning algorithms
In a supervised scenario, the task of the model is to find the correct label of a sample, assuming that the presence of a training set is correctly labeled, along with the possibility of comparing the estimated value with the correct one. The term supervised is derived from the idea of an external teaching agent that provides precise and immediate feedback after each prediction. The model can use such feedback as a measure of the error and, consequently perform the corrections needed to reduce it.
More formally, if we assume a data generating process, the dataset is obtained as:
As discussed in the previous section, all samples must be independent and identically distributed (IID) values uniformly sampled from the data generating process. In particular, all classes must represent the actual distribution (for example, if p(y=0) = 0.4 and p(y=1) = 0.6, the proportion should 40% or 60%). However, in order to avoid biases, when the discrepancies between classes are not very large, a reasonable choice is a perfectly uniform sampling and has the same number of representatives for y=1, 2, ..., M.
A generic classifier can be modeled in two ways:
- A parametrized function that outputs the predicted class
- A parametrized probability distribution that outputs the class probability for every input sample
In the first case, we have:
Considering the whole dataset X, it's possible to compute a global cost function L:
As L depends only on the parameter vector (xi and yi are constants), a generic algorithm must find the optimal parameter vector that minimizes the cost function. For example, in a regression problem (where the labels are continuous), the error measure can be the squared error between the actual value and prediction:
Such a cost function can be optimized in different ways (peculiar to specific algorithms), but a very common strategy (above all in deep learning) is to employ the Stochastic Gradient Descent (SGD) algorithm. It consists of an iteration of the following two steps:
- Computing the gradient ∇L (with respect to the parameter vector) with a small batch of samples xi ∈ X
- Updating the weights and moving the parameters in the opposite direction of the gradient -∇L (remember that the gradient is always directed towards a maximum)
Instead, when the classifier is probabilistic, it should be represented as a parametrized conditional probability distribution:
In other words, the classifier will now output the probability of a label y given an input vector. The goal is now to find the optimal parameter set, which will obtain:
In the preceding formula, we have expressed pdata as a conditional distribution. The optimization can be obtained using a probabilistic distance metric, such as the Kullback-Leibler divergence DKL (which is always non-negative DKL ≥ 0 and DKL = 0 only when the two distributions are identical):
With a few simple manipulations, we obtain:
Therefore, the resulting cost function corresponds to the difference between the cross-entropy between p and pdata up to a constant value (the entropy of the data generating process). Therefore, the training strategy is now based on representing labels using one-hot encoding (for example, if there are two labels 0 → (0, 1) and 1 → (1, 0), so that the sum of all elements must be always equal to 1) and using an intrinsic probabilistic output (such as in a logistic regression) or a softmax filter, which transforms M values into a probability distribution.
In both cases, it's clear the presence of a hidden teacher provides a consistent measure of the error, allowing the model to correct the parameters accordingly. In particular, the second approach is extremely helpful for our purposes, therefore I recommend studying it further if it's not known well (all the main definitions can also be found in Bonaccorso G., Machine Learning Algorithms, Second Edition, Packt, 2018).
We can now discuss a very basic example of supervised learning, a linear regression model that can be employed to predict the evolution of a simple time series.