Elements of information theory
A machine learning problem can also be analyzed in terms of information transfer or exchange. Our dataset is composed of n features, which are considered independent (for simplicity, even if it's often a realistic assumption) drawn from n different statistical distributions. Therefore, there are n probability density functions pi(x) which must be approximated through other n qi(x) functions. In any machine learning task, it's very important to understand how two corresponding distributions diverge and what is the amount of information we lose when approximating the original dataset.
The most useful measure is called entropy:
This value is proportional to the uncertainty of X and it's measured in bits (if the logarithm has another base, this unit can change too). For many purposes, a high entropy is preferable, because it means that a certain feature contains more information. For example, in tossing a coin (two possible outcomes), H(X) = 1 bit, but if the number of outcomes grows, even with the same probability, H(X) also does because of a higher number of different values and therefore increased variability. It's possible to prove that for a Gaussian distribution (using natural logarithm):
So, the entropy is proportional to the variance, which is a measure of the amount of information carried by a single feature. In the next chapter, we're going to discuss a method for feature selection based on variance threshold. Gaussian distributions are very common, so this example can be considered just as a general approach to feature filtering: low variance implies low information level and a model could often discard all those features.
In the following figure, there's a plot of H(X) for a Gaussian distribution expressed in nats (which is the corresponding unit measure when using natural logarithms):
For example, if a dataset is made up of some features whose variance (here it's more convenient talking about standard deviation) is bounded between 8 and 10 and a few with STD < 1.5, the latter could be discarded with a limited loss in terms of information. These concepts are very important in real-life problems when large datasets must be cleaned and processed in an efficient way.
If we have a target probability distribution p(x), which is approximated by another distribution q(x), a useful measure is cross-entropy between p and q (we are using the discrete definition as our problems must be solved using numerical computations):
If the logarithm base is 2, it measures the number of bits requested to decode an event drawn from P when using a code optimized for Q. In many machine learning problems, we have a source distribution and we need to train an estimator to be able to identify correctly the class of a sample. If the error is null, P = Q and the cross-entropy is minimum (corresponding to the entropy H(P)). However, as a null error is almost impossible when working with Q, we need to pay a price of H(P, Q) bits, to determine the right class starting from a prediction. Our goal is often to minimize it, so to reduce this price under a threshold that cannot alter the predicted output if not paid. In other words, think about a binary output and a sigmoid function: we have a threshold of 0.5 (this is the maximum price we can pay) to identify the correct class using a step function (0.6 -> 1, 0.1 -> 0, 0.4999 -> 0, and so on). As we're not able to pay this price, since our classifier doesn't know the original distribution, it's necessary to reduce the cross-entropy under a tolerable noise-robustness threshold (which is always the smallest achievable one).
In order to understand how a machine learning approach is performing, it's also useful to introduce a conditional entropy or the uncertainty of X given the knowledge of Y:
Through this concept, it's possible to introduce the idea of mutual information, which is the amount of information shared by both variables and therefore, the reduction of uncertainty about X provided by the knowledge of Y:
Intuitively, when X and Y are independent, they don't share any information. However, in machine learning tasks, there's a very tight dependence between an original feature and its prediction, so we want to maximize the information shared by both distributions. If the conditional entropy is small enough (so Y is able to describe X quite well), the mutual information gets close to the marginal entropy H(X), which measures the amount of information we want to learn.
An interesting learning approach based on the information theory, called Minimum Description Length (MDL), is discussed in Russel S., Norvig P., Artificial Intelligence: A Modern Approach, Pearson, where I suggest you look for any further information about these topics.