Maximum-likelihood learning
We have defined likelihood as a filtering term in the Bayes formula. In general, it has the form of:
Here the first term expresses the actual likelihood of a hypothesis, given a dataset X. As you can imagine, in this formula there are no more Apriori probabilities, so, maximizing it doesn't imply accepting a theoretical preferential hypothesis, nor considering unlikely ones. A very common approach, known as expectation-maximization and used in many algorithms (we're going to see an example in logistic regression), is split into two main parts:
Determining a log-likelihood expression based on model parameters (they will be optimized accordingly)
Maximizing it until residual error is small enough
A log-likelihood (normally called L) is a useful trick that can simplify gradient calculations. A generic likelihood expression is:
As all parameters are inside hi, the gradient is a complex expression which isn't very manageable. However our goal is maximizing the likelihood, but it's easier minimizing its reciprocal:
This can be turned into a very simple expression by applying natural logarithm (which is monotonic):
The last term is a summation which can be easily derived and used in most of the optimization algorithms. At the end of this process, we can find a set of parameters which provides the maximum likelihood without any strong statement about prior distributions. This approach can seem very technical, but its logic is really simple and intuitive. To understand how it works, I propose a simple exercise, which is part of Gaussian mixture technique discussed also in Russel S., Norvig P., Artificial Intelligence: A Modern Approach, Pearson.
Let's consider 100 points drawn from a Gaussian distribution with zero mean and a standard deviation equal to 2.0 (quasi-white noise made of independent samples):
import numpy as np
nb_samples = 100
X_data = np.random.normal(loc=0.0, scale=np.sqrt(2.0), size=nb_samples)
The plot is shown next:
In this case, there's no need for a deep exploration (we know how they are generated), however, after restricting the hypothesis space to the Gaussian family (the most suitable considering only the graph), we'd like to find the best value for mean and variance. First of all, we need to compute the log-likelihood (which is rather simple thanks to the exponential function):
A simple Python implementation is provided next (for ease of use, there's only a single array which contains both mean (0) and variance (1)):
def negative_log_likelihood(v):
l = 0.0
f1 = 1.0 / np.sqrt(2.0 * np.pi * v[1])
f2 = 2.0 * v[1]
for x in X_data:
l += np.log(f1 * np.exp(-np.square(x - v[0]) / f2))
return -l
Then we need to find its minimum (in terms of mean and variance) with any of the available methods (gradient descent or another numerical optimization algorithm). For example, using the scipy minimization function, we can easily get:
from scipy.optimize import minimize
>>> minimize(fun=negative_log_likelihood, x0=[0.0, 1.0])
fun: 172.33380423827057
hess_inv: array([[ 0.01571807, 0.02658017],
[ 0.02658017, 0.14686427]])
jac: array([ 0.00000000e+00, -1.90734863e-06])
message: 'Optimization terminated successfully.'
nfev: 52
nit: 9
njev: 13
status: 0
success: True
x: array([ 0.04088792, 1.83822255])
A graph of the negative log-likelihood function is plotted next. The global minimum of this function corresponds to an optimal likelihood given a certain distribution. It doesn't mean that the problem has been completely solved, because the first step of this algorithm is determining an expectation, which must be always realistic. The likelihood function, however, is quite sensitive to wrong distributions because it can easily get close to zero when the probabilities are low. For this reason, maximum-likelihood (ML) learning is often preferable to MAP learning, which needs Apriori distributions and can fail when they are not selected in the most appropriate way:
This approach has been applied to a specific distribution family (which is indeed very easy to manage), but it also works perfectly when the model is more complex. Of course, it's always necessary to have an initial awareness about how the likelihood should be determined because more than one feasible family can generate the same dataset. In all these cases, Occam's razor is the best way to proceed: the simplest hypothesis should be considered first. If it doesn't fit, an extra level of complexity can be added to our model. As we'll see, in many situations, the easiest solution is the winning one, and increasing the number of parameters or using a more detailed model can only add noise and a higher possibility of overfitting.