You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The symptom is very similar to the ones described in #142 and #744. I have set the terasort_num_rows to the small number (like 1000) so that I can exclude the problem with long-lasting generate/sort process. The memory threshold is also high enough (like 32 or 64GB). The benchmark hangs on the teragen phase (/tmp/pkb/hadoop/bin/yarn jar /tmp/pkb/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.2.jar teragen 1000 /teragen' 1> /tmp/pkb/cmd6e619289-6a25-48dc-af46-ff5e86e90808.log). After a deep dive it turned out that Hadoop cluster encountered issues with reaching other instances by hostname:
root@pkb-e358f15f-2:/tmp/pkb/hadoop/logs# tailf yarn-root-resourcemanager-pkb-e358f15f-2.log
2015-12-21 14:11:01,070 ERROR org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt: Error trying to assign container token and NM token to an allocated container container_1450706300877_0001_01_000001
java.lang.IllegalArgumentException: java.net.UnknownHostException: pkb-e358f15f-1
at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:373)
at org.apache.hadoop.yarn.server.utils.BuilderUtils.newContainerToken(BuilderUtils.java:247)
at org.apache.hadoop.yarn.server.resourcemanager.security.RMContainerTokenSecretManager.createContainerToken(RMContainerTokenSecretManager.java:199)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.pullNewlyAllocatedContainersAndNMTokens(SchedulerApplicationAttempt.java:425)
at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fifo.FifoScheduler.allocate(FifoScheduler.java:345)
at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMContainerAllocatedTransition.transition(RMAppAttemptImpl.java:816)
at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMContainerAllocatedTransition.transition(RMAppAttemptImpl.java:809)
at org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:649)
at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:104)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:761)
at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:742)
at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.UnknownHostException: pkb-e358f15f-1
... 18 more
This is the ResourceManager log from the master instance. As you can see it tried to communicate with pkb-e358f15f-1 instance but it couldn't resolve the hostname. I did a workaround for this in hadoop_terasort_benchmark.py by simply generating new entries in /etc/hosts file for each instance. But I have concerns if it's the right solution because it may be Mesos (or Kubernetes) specific. How is this resolved in GCE? Is every instance reachable by every other instances in the same network using the hostname of the instances? Is anyone aware if this can be fixed in Hadoop configuration itself (by for example forcing ResourceManager to use IP addresses instead of hostnames)?
The text was updated successfully, but these errors were encountered:
This seems to be a general Hadoop requirement (see the Cloudera docs). We might be able to set dfs.namenode.datanode.registration.ip-hostname-check to false, but I can't find an equivalent for YARN.
@mateusz-blaszkowski - do you want to include your /etc/hosts patch in PerfKitBenchmarker? The addition could be dependent on nodes not being reachable by name.
The symptom is very similar to the ones described in #142 and #744. I have set the
terasort_num_rows
to the small number (like 1000) so that I can exclude the problem with long-lasting generate/sort process. The memory threshold is also high enough (like 32 or 64GB). The benchmark hangs on the teragen phase (/tmp/pkb/hadoop/bin/yarn jar /tmp/pkb/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.2.jar teragen 1000 /teragen' 1> /tmp/pkb/cmd6e619289-6a25-48dc-af46-ff5e86e90808.log
). After a deep dive it turned out that Hadoop cluster encountered issues with reaching other instances by hostname:This is the ResourceManager log from the master instance. As you can see it tried to communicate with
pkb-e358f15f-1
instance but it couldn't resolve the hostname. I did a workaround for this in hadoop_terasort_benchmark.py by simply generating new entries in /etc/hosts file for each instance. But I have concerns if it's the right solution because it may be Mesos (or Kubernetes) specific. How is this resolved in GCE? Is every instance reachable by every other instances in the same network using the hostname of the instances? Is anyone aware if this can be fixed in Hadoop configuration itself (by for example forcing ResourceManager to use IP addresses instead of hostnames)?The text was updated successfully, but these errors were encountered: