Hadoop hangs because it cannot reach other instances by hostname #765

mateusz-blaszkowski · 2015-12-21T15:09:47Z

The symptom is very similar to the ones described in #142 and #744. I have set the terasort_num_rows to the small number (like 1000) so that I can exclude the problem with long-lasting generate/sort process. The memory threshold is also high enough (like 32 or 64GB). The benchmark hangs on the teragen phase (/tmp/pkb/hadoop/bin/yarn jar /tmp/pkb/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.2.jar teragen 1000 /teragen' 1> /tmp/pkb/cmd6e619289-6a25-48dc-af46-ff5e86e90808.log). After a deep dive it turned out that Hadoop cluster encountered issues with reaching other instances by hostname:

root@pkb-e358f15f-2:/tmp/pkb/hadoop/logs# tailf yarn-root-resourcemanager-pkb-e358f15f-2.log
2015-12-21 14:11:01,070 ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt: Error trying to assign container token and NM token to an allocated container container_1450706300877_0001_01_000001
java.lang.IllegalArgumentException: java.net.UnknownHostException: pkb-e358f15f-1
        at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:373)
        at org.apache.hadoop.yarn.server.utils.BuilderUtils.newContainerToken(BuilderUtils.java:247)
        at org.apache.hadoop.yarn.server.resourcemanager.security.RMContainerTokenSecretManager.createContainerToken(RMContainerTokenSecretManager.java:199)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.pullNewlyAllocatedContainersAndNMTokens(SchedulerApplicationAttempt.java:425)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.allocate(FifoScheduler.java:345)
        at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMContainerAllocatedTransition.transition(RMAppAttemptImpl.java:816)
        at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMContainerAllocatedTransition.transition(RMAppAttemptImpl.java:809)
        at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
        at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
        at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
        at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
        at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:649)
        at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:104)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:761)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:742)
        at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
        at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.UnknownHostException: pkb-e358f15f-1
        ... 18 more

This is the ResourceManager log from the master instance. As you can see it tried to communicate with pkb-e358f15f-1 instance but it couldn't resolve the hostname. I did a workaround for this in hadoop_terasort_benchmark.py by simply generating new entries in /etc/hosts file for each instance. But I have concerns if it's the right solution because it may be Mesos (or Kubernetes) specific. How is this resolved in GCE? Is every instance reachable by every other instances in the same network using the hostname of the instances? Is anyone aware if this can be fixed in Hadoop configuration itself (by for example forcing ResourceManager to use IP addresses instead of hostnames)?

The text was updated successfully, but these errors were encountered:

cmccoy · 2015-12-21T15:30:21Z

On GCE, every instance in a network is reachable by hostname. Same on AWS, if the VPC is configured with DNS hostnames.

This seems to be a general Hadoop requirement (see the Cloudera docs). We might be able to set dfs.namenode.datanode.registration.ip-hostname-check to false, but I can't find an equivalent for YARN.

cmccoy · 2015-12-28T15:42:12Z

@mateusz-blaszkowski - do you want to include your /etc/hosts patch in PerfKitBenchmarker? The addition could be dependent on nodes not being reachable by name.

voellm · 2016-01-06T21:17:19Z

Any thoughts on this?

kivio · 2016-01-13T13:18:48Z

I have this same problem with Cassandra YCSB on OpenStack.
Maybe we can add hostname to /etc/hosts in creating VM post steps for sure.

voellm added bug P1 labels Dec 27, 2015

voellm assigned dberlin and cmccoy and unassigned dberlin Dec 27, 2015

cmccoy removed their assignment Dec 28, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hadoop hangs because it cannot reach other instances by hostname #765

Hadoop hangs because it cannot reach other instances by hostname #765

mateusz-blaszkowski commented Dec 21, 2015

cmccoy commented Dec 21, 2015

cmccoy commented Dec 28, 2015

voellm commented Jan 6, 2016 via email

kivio commented Jan 13, 2016

Hadoop hangs because it cannot reach other instances by hostname #765

Hadoop hangs because it cannot reach other instances by hostname #765

Comments

mateusz-blaszkowski commented Dec 21, 2015

cmccoy commented Dec 21, 2015

cmccoy commented Dec 28, 2015

voellm commented Jan 6, 2016 via email

kivio commented Jan 13, 2016