Deep understand locality in CapacityScheduler and how to control it.

Locality settings in CapacityScheduler looks straightforward but it is not so simple in reality.

We have two level of locality delays: node->rack (delay1), rack->off-switch (delay2).  As of now, delay in CapacityScheduler is based on missed-opportunity (abbr. MO, how many skipped node allocation) instead of wall clock time.

Let’s start with an example. We have a cluster with 2 racks, rack1 has node1-node20, rack2 has node21-40. And we have an app_1 requests 3  * rack1. Setting of delay1 (node-locality-delay) is 5, and this is a global setting which applies to all queues/apps. There’s no setting for delay2, YARN computes delay2 automatically (This is trickier, will explain later).

Once app start running, YARN will look at nodes one-by-one to allocate resources for the app.

Let’s mimic what happened. In the beginning, app_1’s missed-opportunity is 0, and assume YARN start with node1.

Node1: (rack-local to app_1’s request), skip, and increase missed_opportunity to 1.
Node2: (rack-local, skip, MO=2)
Node3: (rack-local, skip, MO=3)
Node4: (rack-local, skip, MO=4)
Node5: (rack-local, skip, MO=5)
Node6: (rack-local, since MO=5>delay1, allocate 1st container, and reset MO to 0 [1])
Node7: (rack-local, skip, MO=1)
Node8: (rack-local, skip, MO=2)
Node9: (rack-local, skip, MO=3)
Node10: (rack-local, skip, MO=4)
Node11: (rack-local, skip, MO=5)
Node12: (rack-local, since MO=5>delay1, allocate 2nd container, and reset MO to 0 [1])
….

Once rack-locality-full-reset set to false, allocation becomes:

Node1: (rack-local to app_1’s request), skip, and increase missed_opportunity to 1.
Node2: (rack-local, skip, MO=2)
Node3: (rack-local, skip, MO=3)
Node4: (rack-local, skip, MO=4)
Node5: (rack-local, skip, MO=5)
Node6: (rack-local, since MO=5>delay1, allocate 1st container, and reset MO to delay1)
             …. could allocate many containers depends on node’s availability.
Node7: (rack-local, allocate for app_1 since MO still >= delay1)
Node8: …

When delay2 comes to the picture, it becomes tricker:

delay2 = min(#cluster-node, (#requested-unique-hosts / #cluster-node) * #requested-containers)

So what the hell this means? First of all, delay2 won’t be exceed total nodes in the cluster you have. And the more unique host you requested, and the more #container you requested, you will expect to see longer delays. We can have [2] to control this behavior.

In addition to above two configs, the maximum number of containers can be allocated in each node allocation can be specified. See [3]

And there’s an open design to introduce better locality scheduling to YARN, see https://issues.apache.org/jira/browse/YARN-4189: Capacity Scheduler : Improve location preference waiting mechanism

[1]

This resetting of MO after rack-local allocation can be controlled by: yarn.scheduler.capacity.rack-locality-full-reset, by default it is true (reset to 0). Once the rack-locality-full-reset set to false, YARN can continue allocate rack-local containers without waiting MO come back to delay1.

(Available since Apache Hadoop 2.8.0)

[2]

By setting yarn.scheduler.capacity.rack-locality-additional-delay > 0, we can control delay2 in the cluster level. Here’s description of the parameter:

      Number of additional missed scheduling opportunities over the node-locality-delay
      ones, after which the CapacityScheduler attempts to schedule off-switch containers,
      instead of rack-local ones.
      Example: with node-locality-delay=40 and rack-locality-delay=20, the scheduler will
      attempt rack-local assignments after 40 missed opportunities, and off-switch assignments
      after 40+20=60 missed opportunities.
      When setting this parameter, the size of the cluster should be taken into account.
      We use -1 as the default value, which disables this feature. In this case, the number
      of missed opportunities for assigning off-switch containers is calculated based on
      the number of containers and unique locations specified in the resource request,
      as well as the size of the cluster.

(Available since Apache Hadoop 2.9.0/3.0.0)

[3]
yarn.scheduler.capacity.per-node-heartbeat.multiple-assignments-enabled:
– Whether to allow multiple container assignments in one NodeManager heartbeat. Defaults to true.

 

yarn.scheduler.capacity.per-node-heartbeat.maximum-container-assignments
– If `multiple-assignments-enabled` is true, the maximum amount of containers that can be assigned in one NodeManager heartbeat. Defaults to -1, which sets no limit.

 

yarn.scheduler.capacity.per-node-heartbeat.maximum-offswitch-assignments
– If `multiple-assignments-enabled` is true, the maximum amount of off-switch containers that can be assigned in one NodeManager heartbeat. Defaults to 1, which represents only one off-switch allocation allowed in one heartbeat.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s