Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hadoop hangs because it cannot reach other instances by hostname #765

Open
mateusz-blaszkowski opened this issue Dec 21, 2015 · 4 comments
Labels

Comments

@mateusz-blaszkowski
Copy link
Contributor

The symptom is very similar to the ones described in #142 and #744. I have set the terasort_num_rows to the small number (like 1000) so that I can exclude the problem with long-lasting generate/sort process. The memory threshold is also high enough (like 32 or 64GB). The benchmark hangs on the teragen phase (/tmp/pkb/hadoop/bin/yarn jar /tmp/pkb/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.2.jar teragen 1000 /teragen' 1> /tmp/pkb/cmd6e619289-6a25-48dc-af46-ff5e86e90808.log). After a deep dive it turned out that Hadoop cluster encountered issues with reaching other instances by hostname:

root@pkb-e358f15f-2:/tmp/pkb/hadoop/logs# tailf yarn-root-resourcemanager-pkb-e358f15f-2.log
2015-12-21 14:11:01,070 ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt: Error trying to assign container token and NM token to an allocated container container_1450706300877_0001_01_000001
java.lang.IllegalArgumentException: java.net.UnknownHostException: pkb-e358f15f-1
        at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:373)
        at org.apache.hadoop.yarn.server.utils.BuilderUtils.newContainerToken(BuilderUtils.java:247)
        at org.apache.hadoop.yarn.server.resourcemanager.security.RMContainerTokenSecretManager.createContainerToken(RMContainerTokenSecretManager.java:199)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.pullNewlyAllocatedContainersAndNMTokens(SchedulerApplicationAttempt.java:425)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.allocate(FifoScheduler.java:345)
        at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMContainerAllocatedTransition.transition(RMAppAttemptImpl.java:816)
        at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMContainerAllocatedTransition.transition(RMAppAttemptImpl.java:809)
        at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
        at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
        at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
        at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
        at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:649)
        at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:104)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:761)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:742)
        at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
        at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.UnknownHostException: pkb-e358f15f-1
        ... 18 more

This is the ResourceManager log from the master instance. As you can see it tried to communicate with pkb-e358f15f-1 instance but it couldn't resolve the hostname. I did a workaround for this in hadoop_terasort_benchmark.py by simply generating new entries in /etc/hosts file for each instance. But I have concerns if it's the right solution because it may be Mesos (or Kubernetes) specific. How is this resolved in GCE? Is every instance reachable by every other instances in the same network using the hostname of the instances? Is anyone aware if this can be fixed in Hadoop configuration itself (by for example forcing ResourceManager to use IP addresses instead of hostnames)?

@cmccoy
Copy link
Contributor

cmccoy commented Dec 21, 2015

On GCE, every instance in a network is reachable by hostname. Same on AWS, if the VPC is configured with DNS hostnames.

This seems to be a general Hadoop requirement (see the Cloudera docs). We might be able to set dfs.namenode.datanode.registration.ip-hostname-check to false, but I can't find an equivalent for YARN.

@voellm voellm assigned dberlin and cmccoy and unassigned dberlin Dec 27, 2015
@cmccoy
Copy link
Contributor

cmccoy commented Dec 28, 2015

@mateusz-blaszkowski - do you want to include your /etc/hosts patch in PerfKitBenchmarker? The addition could be dependent on nodes not being reachable by name.

@cmccoy cmccoy removed their assignment Dec 28, 2015
@voellm
Copy link
Contributor

voellm commented Jan 6, 2016 via email

@kivio
Copy link
Contributor

kivio commented Jan 13, 2016

I have this same problem with Cassandra YCSB on OpenStack.
Maybe we can add hostname to /etc/hosts in creating VM post steps for sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants