Highlights of Tensorflow Dev Summit 2018 From Infra Engineer’s perspective

Quickly watched talks related to infra and here are highlights from my personal interests.

1) Tensorflow Hub.

Link: https://www.tensorflow.org/hub/

This is definitely a must-to-have piece in Tensorflow. Existing TF training is based on copy-paste model. Assume you’re a machine learning engineer and wants to use a pre-trained inception V3 network. You have to write lots of glue code to download module, make your network uses the output of pre-trained models, etc.

As transfer learning becomes more and more popular. Deep neural networks can be reused as the input to other networks. The core part of Tensorflow Hub is Module. A module is essentially a group of functions (called signature). And downstream modules / tensorflow programs can decide which module to use.

In addition to using pre-trained modules as a blackbox (like using DLL or jars). Engineers can do fine-tuning of modules by feeding own data, which can save tens of thousands GPU hours comparing to train all parameters from scratch. (all $$$).

2) Distributed learning

https://www.youtube.com/watch?v=-h0cWBiQ8s8&list=PLQY2H8rRoyvxjVx3zfw4vA4cvlKogyLNN&index=7

This year, TF team announced the new DistributionStrategy API. Per my understanding, It is introduced a way to better control the execution of distribution tasks and allows distributed program doesn’t need to aware underlying distributed communication implementation.

if you knew MPI (message passing interface), you can better understand what it is trying to achieve. It looks like MPI for Tensorflow. Existing DistributionStrategy is still in early phase, as of now, it only supports single node multiple GPU card. Not sure how popular it could be since it is a lower level APIs. Many TF users are still using estimator’s train_and_evaluate since it is more straightforward and can support multi-node training.

3) Preferred High-level APIs:

TF recommend using following high-level APIs: (Just a copy of screenshots so you can get an overlook and dig into details if you have more interests).

(Captured from Keynote 9min-12min).

12B2BEA1-9032-46E5-A948-FDF4DA2B4DBA

C7EFF78E-043C-4F20-89B2-1E5E868A62ED

CCD5C1B9-9C08-47C9-AE86-C7730F283208

E9AB6903-81D2-4CB1-AB29-715979EE859A

4) Eager execution.

Video: https://www.youtube.com/watch?v=T8AW0fKP0Hs&list=PLQY2H8rRoyvxjVx3zfw4vA4cvlKogyLNN&index=3

This is really an important piece to convert users from communities like Pytorch turns to TF. Prior to eager execution, Tensorflow can only support static graph, which is harder for debugging and to construct the static graph. You have to write a lot of boilerplate code.

If you have tried to debug TF program before, you will know how unfriendly the static graph for debugging: All variables are evaluated from training execution. And training execution is basically done by Tensorflow black boxes: it’s not so simple to add breakpoints using your Python IDE.

With the brand new eager execution, you can write program like:

tf.executing_eagerly()        # => True
x = [[2.]]
m = tf.matmul(x, x)
print("hello, {}".format(m))  # => "hello, [[4.]]"

And use your favorite IDE to debug your program.

5) Tensorflow.js.

I think this is the biggest announcement of Tensorflow dev summit this year.

It looks useful for TF community, but not quite useful for our use case (From enterprise customer’s POV): it trains model use the browser resource and serve model use browser resource as well. which means it cannot support larger models/data.

Enterprise’s use case uses a cluster to handle model which hard to be handled by a less powerful machine.

Maximum Likelihood Estimation, Cross Entropy and Deep Learning Network

Reading Ian Goodfellow’s Deep Learning Book recently, the 5th chapter (Machine Learning Basics) is really great. Comparing to Bishop’s Pattern Recognition and Machine Learning, it includes less mathematics and formulas which is good for a casual read. Today I want to share the topic of maximum likelihood estimation (MLE) which might not be straightforward to be understood.

61fim5QqaqL._SX373_BO1,204,203,200_

From the principle, MLE is not a hard stuff: it’s just a way to measure how good or bad a model is. It is among many other different model estimation approach. I plan to write another blog about MLE’s friends, such as Bayesian estimation / Maximum Posterior Estimation, etc.

Maximum Likelihood Estimation

It’s formula is:

\theta_\text{ML}={arg\,max}_\theta P(Y|X;\theta)

Assume we have an image classification task, which is to recognize an input 256 \times 256 picture is a cat, a dog or anything else. So input is a 256 \times 256 matrix (picture) output is a 3d vector. For example, \{0.1, 0,2,0.7\} represents probabilities of input picture to 3 categories (cat/dog/other).

For this task, what the model needs to learn is a function which has parameters $\theta$, the function could be in any form, which can output probabilities to 3 categories. Our goal is, for any given input picture, output value should be as close as ideal. This is the so-called maximum likelihood.

From model training perspective, we can write formula in following form:

{arg\,max}_\theta \sum_{i=1}^{n} log P(y^\text{(i)}|x^\text{(i)};\theta)

Here,x^\text{(i)} and y^\text{(i)} represents picture and its labeled category. Since we need to consider all training samples, so we added all results together. Since training pictures are pre-labeled to single category, so training output probability vectors are one hot encoding probability vector. Such as \{0, 1,0\}, \{0,0,1\}, etc.

So here comes the problem, how we can measure difference of output probably to real probability? A simple way is to use Euclidean distance between two vectors. However Euclidean doesn’t understand probability, here’s an existing tool: KL (Kullback-Leibler) divergence. (Assume we want to understand difference from probability distribution P to Q.)

D_\text{KL}(P||Q)=-\sum_{i} P(i) log \frac{Q(i)}{P(i)}

In our model training case, it becomes:
D_\text{KL}(Y||\hat{Y})=-\sum_{i} y^\text{(i)} log \frac {\hat{y}^\text{(i)}}{y^\text{(i)}}

Here y^\text{(i)} means pre-labeled output,and \hat{y}^\text{(i)} means output from model.

So optimization goal is to minimize D_\text{KL}. Many optimization approaches can be used, such as SGD (Stochastic Gradient Descent).

Cross Entropy

I don’t want to talk too much about entropy and cross entropy here. In short, cross entropy is a way to calculate distances between two functions or probability distributions. Similarly, it uses KL divergences. In machine learning world, cross entropy and maximum likelihood estimation are synonymic to each other.

For details, you can find following articles to read:

1) Tutorial of information gain by Andrew Moore
2) A Friendly Introduction to Cross-Entropy Loss

Relations to deep learning

So what’s the relationship between MLE and cross entropy? If you have used Tensorflow or similar frameworks before. You can find at the end of the network construction, a Softmax layer will be added.

The most important of Softmax function is: it can normalize whatever outputs to probabilities vector. Once probability vector output by model, we can use MLE/cross-entropy to optimize parameters.

In TF, there’re several related methods:
softmax
log_softmax
sigmoid_cross_entropy_with_logits
sparse_softmax_cross_entropy_with_logits

They have different usage scenarios and merits, I suggest to take a look at the documentation in order to use them correctly.

Other references

1) https://www.autonlab.org/_media/tutorials/mle13.pdf
2) http://www.mi.fu-berlin.de/wiki/pub/ABI/Genomics12/MLvsMAP.pdf