Highlights of Tensorflow Dev Summit 2018 From Infra Engineer’s perspective

Quickly watched talks related to infra and here are highlights from my personal interests.

1) Tensorflow Hub.

Link: https://www.tensorflow.org/hub/

This is definitely a must-to-have piece in Tensorflow. Existing TF training is based on copy-paste model. Assume you’re a machine learning engineer and wants to use a pre-trained inception V3 network. You have to write lots of glue code to download module, make your network uses the output of pre-trained models, etc.

As transfer learning becomes more and more popular. Deep neural networks can be reused as the input to other networks. The core part of Tensorflow Hub is Module. A module is essentially a group of functions (called signature). And downstream modules / tensorflow programs can decide which module to use.

In addition to using pre-trained modules as a blackbox (like using DLL or jars). Engineers can do fine-tuning of modules by feeding own data, which can save tens of thousands GPU hours comparing to train all parameters from scratch. (all $$$).

2) Distributed learning

https://www.youtube.com/watch?v=-h0cWBiQ8s8&list=PLQY2H8rRoyvxjVx3zfw4vA4cvlKogyLNN&index=7

This year, TF team announced the new DistributionStrategy API. Per my understanding, It is introduced a way to better control the execution of distribution tasks and allows distributed program doesn’t need to aware underlying distributed communication implementation.

if you knew MPI (message passing interface), you can better understand what it is trying to achieve. It looks like MPI for Tensorflow. Existing DistributionStrategy is still in early phase, as of now, it only supports single node multiple GPU card. Not sure how popular it could be since it is a lower level APIs. Many TF users are still using estimator’s train_and_evaluate since it is more straightforward and can support multi-node training.

3) Preferred High-level APIs:

TF recommend using following high-level APIs: (Just a copy of screenshots so you can get an overlook and dig into details if you have more interests).

(Captured from Keynote 9min-12min).

12B2BEA1-9032-46E5-A948-FDF4DA2B4DBA

C7EFF78E-043C-4F20-89B2-1E5E868A62ED

CCD5C1B9-9C08-47C9-AE86-C7730F283208

E9AB6903-81D2-4CB1-AB29-715979EE859A

4) Eager execution.

Video: https://www.youtube.com/watch?v=T8AW0fKP0Hs&list=PLQY2H8rRoyvxjVx3zfw4vA4cvlKogyLNN&index=3

This is really an important piece to convert users from communities like Pytorch turns to TF. Prior to eager execution, Tensorflow can only support static graph, which is harder for debugging and to construct the static graph. You have to write a lot of boilerplate code.

If you have tried to debug TF program before, you will know how unfriendly the static graph for debugging: All variables are evaluated from training execution. And training execution is basically done by Tensorflow black boxes: it’s not so simple to add breakpoints using your Python IDE.

With the brand new eager execution, you can write program like:

tf.executing_eagerly()        # => True
x = [[2.]]
m = tf.matmul(x, x)
print("hello, {}".format(m))  # => "hello, [[4.]]"

And use your favorite IDE to debug your program.

5) Tensorflow.js.

I think this is the biggest announcement of Tensorflow dev summit this year.

It looks useful for TF community, but not quite useful for our use case (From enterprise customer’s POV): it trains model use the browser resource and serve model use browser resource as well. which means it cannot support larger models/data.

Enterprise’s use case uses a cluster to handle model which hard to be handled by a less powerful machine.

Maximum Likelihood Estimation, Cross Entropy and Deep Learning Network

Reading Ian Goodfellow’s Deep Learning Book recently, the 5th chapter (Machine Learning Basics) is really great. Comparing to Bishop’s Pattern Recognition and Machine Learning, it includes less mathematics and formulas which is good for a casual read. Today I want to share the topic of maximum likelihood estimation (MLE) which might not be straightforward to be understood.

61fim5QqaqL._SX373_BO1,204,203,200_

From the principle, MLE is not a hard stuff: it’s just a way to measure how good or bad a model is. It is among many other different model estimation approach. I plan to write another blog about MLE’s friends, such as Bayesian estimation / Maximum Posterior Estimation, etc.

Maximum Likelihood Estimation

It’s formula is:

\theta_\text{ML}={arg\,max}_\theta P(Y|X;\theta)

Assume we have an image classification task, which is to recognize an input 256 \times 256 picture is a cat, a dog or anything else. So input is a 256 \times 256 matrix (picture) output is a 3d vector. For example, \{0.1, 0,2,0.7\} represents probabilities of input picture to 3 categories (cat/dog/other).

For this task, what the model needs to learn is a function which has parameters $\theta$, the function could be in any form, which can output probabilities to 3 categories. Our goal is, for any given input picture, output value should be as close as ideal. This is the so-called maximum likelihood.

From model training perspective, we can write formula in following form:

{arg\,max}_\theta \sum_{i=1}^{n} log P(y^\text{(i)}|x^\text{(i)};\theta)

Here,x^\text{(i)} and y^\text{(i)} represents picture and its labeled category. Since we need to consider all training samples, so we added all results together. Since training pictures are pre-labeled to single category, so training output probability vectors are one hot encoding probability vector. Such as \{0, 1,0\}, \{0,0,1\}, etc.

So here comes the problem, how we can measure difference of output probably to real probability? A simple way is to use Euclidean distance between two vectors. However Euclidean doesn’t understand probability, here’s an existing tool: KL (Kullback-Leibler) divergence. (Assume we want to understand difference from probability distribution P to Q.)

D_\text{KL}(P||Q)=-\sum_{i} P(i) log \frac{Q(i)}{P(i)}

In our model training case, it becomes:
D_\text{KL}(Y||\hat{Y})=-\sum_{i} y^\text{(i)} log \frac {\hat{y}^\text{(i)}}{y^\text{(i)}}

Here y^\text{(i)} means pre-labeled output,and \hat{y}^\text{(i)} means output from model.

So optimization goal is to minimize D_\text{KL}. Many optimization approaches can be used, such as SGD (Stochastic Gradient Descent).

Cross Entropy

I don’t want to talk too much about entropy and cross entropy here. In short, cross entropy is a way to calculate distances between two functions or probability distributions. Similarly, it uses KL divergences. In machine learning world, cross entropy and maximum likelihood estimation are synonymic to each other.

For details, you can find following articles to read:

1) Tutorial of information gain by Andrew Moore
2) A Friendly Introduction to Cross-Entropy Loss

Relations to deep learning

So what’s the relationship between MLE and cross entropy? If you have used Tensorflow or similar frameworks before. You can find at the end of the network construction, a Softmax layer will be added.

The most important of Softmax function is: it can normalize whatever outputs to probabilities vector. Once probability vector output by model, we can use MLE/cross-entropy to optimize parameters.

In TF, there’re several related methods:
softmax
log_softmax
sigmoid_cross_entropy_with_logits
sparse_softmax_cross_entropy_with_logits

They have different usage scenarios and merits, I suggest to take a look at the documentation in order to use them correctly.

Other references

1) https://www.autonlab.org/_media/tutorials/mle13.pdf
2) http://www.mi.fu-berlin.de/wiki/pub/ABI/Genomics12/MLvsMAP.pdf

Deep understand locality in CapacityScheduler and how to control it.

Locality settings in CapacityScheduler looks straightforward but it is not so simple in reality.

We have two level of locality delays: node->rack (delay1), rack->off-switch (delay2).  As of now, delay in CapacityScheduler is based on missed-opportunity (abbr. MO, how many skipped node allocation) instead of wall clock time.

Let’s start with an example. We have a cluster with 2 racks, rack1 has node1-node20, rack2 has node21-40. And we have an app_1 requests 3  * rack1. Setting of delay1 (node-locality-delay) is 5, and this is a global setting which applies to all queues/apps. There’s no setting for delay2, YARN computes delay2 automatically (This is trickier, will explain later).

Once app start running, YARN will look at nodes one-by-one to allocate resources for the app.

Let’s mimic what happened. In the beginning, app_1’s missed-opportunity is 0, and assume YARN start with node1.

Node1: (rack-local to app_1’s request), skip, and increase missed_opportunity to 1.
Node2: (rack-local, skip, MO=2)
Node3: (rack-local, skip, MO=3)
Node4: (rack-local, skip, MO=4)
Node5: (rack-local, skip, MO=5)
Node6: (rack-local, since MO=5>delay1, allocate 1st container, and reset MO to 0 [1])
Node7: (rack-local, skip, MO=1)
Node8: (rack-local, skip, MO=2)
Node9: (rack-local, skip, MO=3)
Node10: (rack-local, skip, MO=4)
Node11: (rack-local, skip, MO=5)
Node12: (rack-local, since MO=5>delay1, allocate 2nd container, and reset MO to 0 [1])
….

Once rack-locality-full-reset set to false, allocation becomes:

Node1: (rack-local to app_1’s request), skip, and increase missed_opportunity to 1.
Node2: (rack-local, skip, MO=2)
Node3: (rack-local, skip, MO=3)
Node4: (rack-local, skip, MO=4)
Node5: (rack-local, skip, MO=5)
Node6: (rack-local, since MO=5>delay1, allocate 1st container, and reset MO to delay1)
             …. could allocate many containers depends on node’s availability.
Node7: (rack-local, allocate for app_1 since MO still >= delay1)
Node8: …

When delay2 comes to the picture, it becomes tricker:

delay2 = min(#cluster-node, (#requested-unique-hosts / #cluster-node) * #requested-containers)

So what the hell this means? First of all, delay2 won’t be exceed total nodes in the cluster you have. And the more unique host you requested, and the more #container you requested, you will expect to see longer delays. We can have [2] to control this behavior.

In addition to above two configs, the maximum number of containers can be allocated in each node allocation can be specified. See [3]

And there’s an open design to introduce better locality scheduling to YARN, see https://issues.apache.org/jira/browse/YARN-4189: Capacity Scheduler : Improve location preference waiting mechanism

[1]

This resetting of MO after rack-local allocation can be controlled by: yarn.scheduler.capacity.rack-locality-full-reset, by default it is true (reset to 0). Once the rack-locality-full-reset set to false, YARN can continue allocate rack-local containers without waiting MO come back to delay1.

(Available since Apache Hadoop 2.8.0)

[2]

By setting yarn.scheduler.capacity.rack-locality-additional-delay > 0, we can control delay2 in the cluster level. Here’s description of the parameter:

      Number of additional missed scheduling opportunities over the node-locality-delay
      ones, after which the CapacityScheduler attempts to schedule off-switch containers,
      instead of rack-local ones.
      Example: with node-locality-delay=40 and rack-locality-delay=20, the scheduler will
      attempt rack-local assignments after 40 missed opportunities, and off-switch assignments
      after 40+20=60 missed opportunities.
      When setting this parameter, the size of the cluster should be taken into account.
      We use -1 as the default value, which disables this feature. In this case, the number
      of missed opportunities for assigning off-switch containers is calculated based on
      the number of containers and unique locations specified in the resource request,
      as well as the size of the cluster.

(Available since Apache Hadoop 2.9.0/3.0.0)

[3]
yarn.scheduler.capacity.per-node-heartbeat.multiple-assignments-enabled:
– Whether to allow multiple container assignments in one NodeManager heartbeat. Defaults to true.

 

yarn.scheduler.capacity.per-node-heartbeat.maximum-container-assignments
– If `multiple-assignments-enabled` is true, the maximum amount of containers that can be assigned in one NodeManager heartbeat. Defaults to -1, which sets no limit.

 

yarn.scheduler.capacity.per-node-heartbeat.maximum-offswitch-assignments
– If `multiple-assignments-enabled` is true, the maximum amount of off-switch containers that can be assigned in one NodeManager heartbeat. Defaults to 1, which represents only one off-switch allocation allowed in one heartbeat.

An updated feature comparison between Capacity Scheduler and Fair Scheduler

security-decision-making-938x535

INTRODUCTION

As every Hadoop YARN user knows, YARN has two schedulers: fair scheduler and capacity scheduler (Actually there’s a 3rd scheduler, called Fifo scheduler, but not that widely adopted). “Which scheduler to use?” is one of the most common questions asked by YARN user.

I want to write this blogpost to help you understand the latest feature-wise comparision between two schedulers. Hope this could make you become less doubt about making choices between the two schedulers.

Continue reading

Suggestions About How To Better Use YARN Node Label

custom

Introduction

Node label is an attractive feature of YARN, which is available since Apache Hadoop 2.6. It can solve problems in different scenarios. However, from Hadoop JIRA and mail lists, many users encounter issues to setup and use node label.

As major designer and maintainer of this feature, I would highly recommend you to read this blog post if you’re using or plan to use this feature. It could save you hours of time used to troubleshoot issues. Instead, you do more meaningful stuff, like getting drunk. 🙂
Continue reading