Node label is an attractive feature of YARN, which is available since Apache Hadoop 2.6. It can solve problems in different scenarios. However, from Hadoop JIRA and mail lists, many users encounter issues to setup and use node label.
As major designer and maintainer of this feature, I would highly recommend you to read this blog post if you’re using or plan to use this feature. It could save you hours of time used to troubleshoot issues. Instead, you do more meaningful stuff, like getting drunk. 🙂
Before reading this blog post, It’s better to read/go through following manuals. This blog post is not intended to be replacement of existing manuals/docs:
- Apache Hadoop Node Label Doc
- Tutorial of Using Node Labels
- Slides of Node Label in Hadoop Summit 15′
Here’re a couple of explanations/suggestions to let you better use this feature.
Explanations and Suggestions
#1: Basic philosophy: Node partition or node constraint
As of this writing, node label feature in Apache Hadoop YARN is node partition.
Literally, node partition is a way to “partition” a cluster, which essentially cut a big cluster into several disjoint sub-clusters. Sub-cluster can overwrite some global settings, majorly includes:
- Access-control-list: user can access one partition but not permitted to access another.
- Queue capacity: admin can assign more resource of a specific partition to specific queues. For example, “data scientist” queue can have more shares of “GPU” partition since it may need to run more computation-intensive jobs.
Since each node partition is a sub-cluster, resource scheduler also enforces/accounts many properties like:
- User limit of leaf queues under each partition
- Maximum number of resources that can be used to launch application master
- Per-partition resource usage of queues, apps and users.
- Preemption can also happen to enforce per-partition limits.
In contrast to node partition, node constraints (YARN-3409) is a simple way to tag nodes. They’re used to describe attributes of a node, such as type of OS/kernel version/CPU architecture/JDK version. We don’t need to do any capacity enforcement/access control for node constraints: one sample is, it makes no sense to prohibit one user to use a node with JDK9 installed.
Even though DockerContainerExecutor feature can cut down lots of requirements for node constraints – Docker image contains all dependencies/configurations that an app wants. The feature is still highly requested by the community. We consider to implement it together with the new resource request API definition (YARN-4902) if possible.
#2: Default node partition
Default node partition is set of nodes which aren’t assigned node label, please keep in mind that every node belongs to one partition. When node label is disabled, all nodes in the cluster belong to default node partition.
To get backward compatibility, when configuring queue.capacity / queue.maximum-capacity, it actually configures queue’s capacity of default node partition.
#3: Choose best Hadoop release to use node label feature.
As mentioned above, Apache Hadoop releases after 2.6 support node label. But I don’t suggest to use node label feature in 2.6 since it is too unstable.
Node label in Apache Hadoop 2.7 is much more stable than 2.6. Yahoo! is using node label feature from 2.7 in production to enable use cases like distributed deep learning application (See Large Scale Distributed Deep Learning on Hadoop Clusters). The next 2.7 Hadoop release will be 2.7.3.
For Apache Hadoop 2.8+, we improved node partition feature a lot! For example, better tracking resource usages by node partitions, supports get by-node-partition information from REST API. Improved scheduler UI to show per-partition information (YARN-3362. And also, fixed a couple of annoying bugs such as:
- YARN-3215. Respect labels in CapacityScheduler when computing headroom.
- YARN-3216. Max-AM-Resource-Percentage should respect node labels
Apache Hadoop 2.8 hasn’t been released yet for now, but I expect it should be released by mid of 2016. In general, I will highly recommend upgrading to Apache Hadoop 2.8 if you want above fixes/improvements for node label.
If you’re using Hortonworks Data Platform (HDP). Node label feature is very stable since HDP-2.3.
And please note that only Capacity Scheduler supports node label, node label for Fair Scheduler is still under planning/development. (YARN-2497).
#4: Requesting node labels from your application
Application has different ways to specify node labels.
If you’re writing your native YARN application:
ApplicationSubmissionContext#setNodeLabelExpression, specify requested node label for all the containers within the app.
ResourceRequest#setNodeLabelExpression, specify requested node label for the resource request.
- Similarly to above, from
AMRMClient#addContainerRequest, specify node-label-expression for
Currently many other Hadoop ecosystem projects support node label too, for example:
And finally, queue has a configuration called
default-node-label-expression, if application / ResourceRequest don’t specify node-label-expression, scheduler will pick
default-node-label-expression configured in queue.
#5: Troubleshooting issues of using node label.
5.1 Cannot add/remove node-to-label mappings
Generally, you can use
yarn rmadmin -replaceLabelsOnNode “node1=label1 node2=label2” to assign labels to nodes. If you hit some exceptions, please check following:
- The “=” syntax is supported only after Apache Hadoop 2.7. Documentation for node label of Hadoop 2.6 is in progress, which is tracked by YARN-4867
- Remove label is essentially assign an empty label to the node (move the node to default node partition). For example:
yarn rmadmin -replaceLabelsOnNode "host1= host2=", which move host1/host2 to default node partition.
5.2 Illegal capacity for children of queue=xxx error
If you see the error like:
Illegal capacity of for children of queue for label=
You should have wrong configuration for queue’s capacities: please remember that for each node partition, sum of children’s capacities for every parent queue MUST equal to 100.
For example, let’s say queue=a is a parent queue, and queue=a1/a2/a3 are a’s children.
For default node partition, you need to make sure that
a1.capacity + a2.capacity + a3.capacity = 100.
For non-default node partition, like node label = x, you should also make sure that
a1.accessible-node-labels.x.capacity + a2.accessible-node-labels.x.capacity + a3.accessible-node-labels.x.capacity = 100.
And please note that, by default, root.accessible-node-labels..capacity = 0. For a given label, if you don’t set the label’s capacity in root queue to be 100, you should avoid set the label’s capacity in any of other queues.
5.3 Non-default node partition can only launch one or few applications
Ideally, for every node partition, total resources that can be consumed by AM is,
total-resource-of-node-partition * queue_label.capacity * maximum-am-resource-percent. By default, maximum-am-resource-percent is 0.1, and for each queue, it can launch at least one AM.
If you find you should be able to launch more AM, but it doesn’t, you probably hit YARN-3216. Hadoop 2.8 or later includes this fix.
There’s a workaround: workaround of YARN-3216 if YARN-3216 is not included in your release.
5.4 Cannot get resource allocated from non-default node partition
There’re too many possible reasons cause the problem, for example, hit user-limit, hit queue’s guaranteed resource, container reserved but cannot be allocated, etc.
I plan to write a separate blog about how to troubleshoot such issues.
5.5 Container allocation delayed a lot when under non-default node label
If you find allocation of containers is very slow on non-default node partition (plenty resources available, but scheduler waits a long time after allocation of each container). Probably you hit YARN-4140. Please check the JIRA for more details.
Hope you will enjoy this blog! For this feature, credit goes to Apache Hadoop community, and feel free to send mail to Hadoop user/dev mail list if you have further questions.