Maximum Likelihood Estimation, Cross Entropy and Deep Learning Network

Reading Ian Goodfellow’s Deep Learning Book recently, the 5th chapter (Machine Learning Basics) is really great. Comparing to Bishop’s Pattern Recognition and Machine Learning, it includes less mathematics and formulas which is good for a casual read. Today I want to share the topic of maximum likelihood estimation (MLE) which might not be straightforward to be understood.


From the principle, MLE is not a hard stuff: it’s just a way to measure how good or bad a model is. It is among many other different model estimation approach. I plan to write another blog about MLE’s friends, such as Bayesian estimation / Maximum Posterior Estimation, etc.

Maximum Likelihood Estimation

It’s formula is:

\theta_\text{ML}={arg\,max}_\theta P(Y|X;\theta)

Assume we have an image classification task, which is to recognize an input 256 \times 256 picture is a cat, a dog or anything else. So input is a 256 \times 256 matrix (picture) output is a 3d vector. For example, \{0.1, 0,2,0.7\} represents probabilities of input picture to 3 categories (cat/dog/other).

For this task, what the model needs to learn is a function which has parameters $\theta$, the function could be in any form, which can output probabilities to 3 categories. Our goal is, for any given input picture, output value should be as close as ideal. This is the so-called maximum likelihood.

From model training perspective, we can write formula in following form:

{arg\,max}_\theta \sum_{i=1}^{n} log P(y^\text{(i)}|x^\text{(i)};\theta)

Here,x^\text{(i)} and y^\text{(i)} represents picture and its labeled category. Since we need to consider all training samples, so we added all results together. Since training pictures are pre-labeled to single category, so training output probability vectors are one hot encoding probability vector. Such as \{0, 1,0\}, \{0,0,1\}, etc.

So here comes the problem, how we can measure difference of output probably to real probability? A simple way is to use Euclidean distance between two vectors. However Euclidean doesn’t understand probability, here’s an existing tool: KL (Kullback-Leibler) divergence. (Assume we want to understand difference from probability distribution P to Q.)

D_\text{KL}(P||Q)=-\sum_{i} P(i) log \frac{Q(i)}{P(i)}

In our model training case, it becomes:
D_\text{KL}(Y||\hat{Y})=-\sum_{i} y^\text{(i)} log \frac {\hat{y}^\text{(i)}}{y^\text{(i)}}

Here y^\text{(i)} means pre-labeled output,and \hat{y}^\text{(i)} means output from model.

So optimization goal is to minimize D_\text{KL}. Many optimization approaches can be used, such as SGD (Stochastic Gradient Descent).

Cross Entropy

I don’t want to talk too much about entropy and cross entropy here. In short, cross entropy is a way to calculate distances between two functions or probability distributions. Similarly, it uses KL divergences. In machine learning world, cross entropy and maximum likelihood estimation are synonymic to each other.

For details, you can find following articles to read:

1) Tutorial of information gain by Andrew Moore
2) A Friendly Introduction to Cross-Entropy Loss

Relations to deep learning

So what’s the relationship between MLE and cross entropy? If you have used Tensorflow or similar frameworks before. You can find at the end of the network construction, a Softmax layer will be added.

The most important of Softmax function is: it can normalize whatever outputs to probabilities vector. Once probability vector output by model, we can use MLE/cross-entropy to optimize parameters.

In TF, there’re several related methods:

They have different usage scenarios and merits, I suggest to take a look at the documentation in order to use them correctly.

Other references


Deep understand locality in CapacityScheduler and how to control it.

Locality settings in CapacityScheduler looks straightforward but it is not so simple in reality.

We have two level of locality delays: node->rack (delay1), rack->off-switch (delay2).  As of now, delay in CapacityScheduler is based on missed-opportunity (abbr. MO, how many skipped node allocation) instead of wall clock time.

Let’s start with an example. We have a cluster with 2 racks, rack1 has node1-node20, rack2 has node21-40. And we have an app_1 requests 3  * rack1. Setting of delay1 (node-locality-delay) is 5, and this is a global setting which applies to all queues/apps. There’s no setting for delay2, YARN computes delay2 automatically (This is trickier, will explain later).

Once app start running, YARN will look at nodes one-by-one to allocate resources for the app.

Let’s mimic what happened. In the beginning, app_1’s missed-opportunity is 0, and assume YARN start with node1.

Node1: (rack-local to app_1’s request), skip, and increase missed_opportunity to 1.
Node2: (rack-local, skip, MO=2)
Node3: (rack-local, skip, MO=3)
Node4: (rack-local, skip, MO=4)
Node5: (rack-local, skip, MO=5)
Node6: (rack-local, since MO=5>delay1, allocate 1st container, and reset MO to 0 [1])
Node7: (rack-local, skip, MO=1)
Node8: (rack-local, skip, MO=2)
Node9: (rack-local, skip, MO=3)
Node10: (rack-local, skip, MO=4)
Node11: (rack-local, skip, MO=5)
Node12: (rack-local, since MO=5>delay1, allocate 2nd container, and reset MO to 0 [1])

Once rack-locality-full-reset set to false, allocation becomes:

Node1: (rack-local to app_1’s request), skip, and increase missed_opportunity to 1.
Node2: (rack-local, skip, MO=2)
Node3: (rack-local, skip, MO=3)
Node4: (rack-local, skip, MO=4)
Node5: (rack-local, skip, MO=5)
Node6: (rack-local, since MO=5>delay1, allocate 1st container, and reset MO to delay1)
             …. could allocate many containers depends on node’s availability.
Node7: (rack-local, allocate for app_1 since MO still >= delay1)
Node8: …

When delay2 comes to the picture, it becomes tricker:

delay2 = min(#cluster-node, (#requested-unique-hosts / #cluster-node) * #requested-containers)

So what the hell this means? First of all, delay2 won’t be exceed total nodes in the cluster you have. And the more unique host you requested, and the more #container you requested, you will expect to see longer delays. We can have [2] to control this behavior.

In addition to above two configs, the maximum number of containers can be allocated in each node allocation can be specified. See [3]

And there’s an open design to introduce better locality scheduling to YARN, see Capacity Scheduler : Improve location preference waiting mechanism


This resetting of MO after rack-local allocation can be controlled by: yarn.scheduler.capacity.rack-locality-full-reset, by default it is true (reset to 0). Once the rack-locality-full-reset set to false, YARN can continue allocate rack-local containers without waiting MO come back to delay1.

(Available since Apache Hadoop 2.8.0)


By setting yarn.scheduler.capacity.rack-locality-additional-delay > 0, we can control delay2 in the cluster level. Here’s description of the parameter:

      Number of additional missed scheduling opportunities over the node-locality-delay
      ones, after which the CapacityScheduler attempts to schedule off-switch containers,
      instead of rack-local ones.
      Example: with node-locality-delay=40 and rack-locality-delay=20, the scheduler will
      attempt rack-local assignments after 40 missed opportunities, and off-switch assignments
      after 40+20=60 missed opportunities.
      When setting this parameter, the size of the cluster should be taken into account.
      We use -1 as the default value, which disables this feature. In this case, the number
      of missed opportunities for assigning off-switch containers is calculated based on
      the number of containers and unique locations specified in the resource request,
      as well as the size of the cluster.

(Available since Apache Hadoop 2.9.0/3.0.0)

– Whether to allow multiple container assignments in one NodeManager heartbeat. Defaults to true.


– If `multiple-assignments-enabled` is true, the maximum amount of containers that can be assigned in one NodeManager heartbeat. Defaults to -1, which sets no limit.


– If `multiple-assignments-enabled` is true, the maximum amount of off-switch containers that can be assigned in one NodeManager heartbeat. Defaults to 1, which represents only one off-switch allocation allowed in one heartbeat.