Maximum Likelihood Estimation, Cross Entropy and Deep Learning Network

Reading Ian Goodfellow’s Deep Learning Book recently, the 5th chapter (Machine Learning Basics) is really great. Comparing to Bishop’s Pattern Recognition and Machine Learning, it includes less mathematics and formulas which is good for a casual read. Today I want to share the topic of maximum likelihood estimation (MLE) which might not be straightforward to be understood.


From the principle, MLE is not a hard stuff: it’s just a way to measure how good or bad a model is. It is among many other different model estimation approach. I plan to write another blog about MLE’s friends, such as Bayesian estimation / Maximum Posterior Estimation, etc.

Maximum Likelihood Estimation

It’s formula is:

\theta_\text{ML}={arg\,max}_\theta P(Y|X;\theta)

Assume we have an image classification task, which is to recognize an input 256 \times 256 picture is a cat, a dog or anything else. So input is a 256 \times 256 matrix (picture) output is a 3d vector. For example, \{0.1, 0,2,0.7\} represents probabilities of input picture to 3 categories (cat/dog/other).

For this task, what the model needs to learn is a function which has parameters $\theta$, the function could be in any form, which can output probabilities to 3 categories. Our goal is, for any given input picture, output value should be as close as ideal. This is the so-called maximum likelihood.

From model training perspective, we can write formula in following form:

{arg\,max}_\theta \sum_{i=1}^{n} log P(y^\text{(i)}|x^\text{(i)};\theta)

Here,x^\text{(i)} and y^\text{(i)} represents picture and its labeled category. Since we need to consider all training samples, so we added all results together. Since training pictures are pre-labeled to single category, so training output probability vectors are one hot encoding probability vector. Such as \{0, 1,0\}, \{0,0,1\}, etc.

So here comes the problem, how we can measure difference of output probably to real probability? A simple way is to use Euclidean distance between two vectors. However Euclidean doesn’t understand probability, here’s an existing tool: KL (Kullback-Leibler) divergence. (Assume we want to understand difference from probability distribution P to Q.)

D_\text{KL}(P||Q)=-\sum_{i} P(i) log \frac{Q(i)}{P(i)}

In our model training case, it becomes:
D_\text{KL}(Y||\hat{Y})=-\sum_{i} y^\text{(i)} log \frac {\hat{y}^\text{(i)}}{y^\text{(i)}}

Here y^\text{(i)} means pre-labeled output,and \hat{y}^\text{(i)} means output from model.

So optimization goal is to minimize D_\text{KL}. Many optimization approaches can be used, such as SGD (Stochastic Gradient Descent).

Cross Entropy

I don’t want to talk too much about entropy and cross entropy here. In short, cross entropy is a way to calculate distances between two functions or probability distributions. Similarly, it uses KL divergences. In machine learning world, cross entropy and maximum likelihood estimation are synonymic to each other.

For details, you can find following articles to read:

1) Tutorial of information gain by Andrew Moore
2) A Friendly Introduction to Cross-Entropy Loss

Relations to deep learning

So what’s the relationship between MLE and cross entropy? If you have used Tensorflow or similar frameworks before. You can find at the end of the network construction, a Softmax layer will be added.

The most important of Softmax function is: it can normalize whatever outputs to probabilities vector. Once probability vector output by model, we can use MLE/cross-entropy to optimize parameters.

In TF, there’re several related methods:

They have different usage scenarios and merits, I suggest to take a look at the documentation in order to use them correctly.

Other references