This tutorial will get you familiar with installing and configuring Hadoop (multi-node intallation) with basic feature on 3 nodes.
Make sure you have installed 3 linux images on three different machines and they are up and running.
Say the first machine "master" is the master, while The second "slave1" and the third "slave2", which are the slaves
For simplicity and the sake of this tutorial will assume these IP adresses and their corresponding names for our three nodes:
master:
IP : 192.168.1.103
HDFS : Namenode
slave1:
IP : 192.168.1.104
HDFS : Datanode
slave2:
IP : 192.168.1.105
HDFS : Datanode
Installing JAVA
Java is the main prerequisite for Hadoop. First of all, you should verify the existence of java in your system using:
java -version
If everything works fine it will give you the following output:
java version "1.7.0_71"
Java(TM) SE Runtime Environment (build 1.7.0_71-b13)
Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode)
If java is not installed then install java:
sudo apt-get install default-jre
sudo apt-get install default-jdk
Set up PATH and JAVA_HOME variables, by add the following commands to ~/.bashrc file:
export JAVA_HOME=/usr/local/jdk1.7.0_71
export PATH=$PATH:$JAVA_HOME/bin
MAPPING THE NODES
with each other,You have to edit hosts file in /etc/hosts on all nodes, specify the IP address of each system followed by their host names, add these lines in /etc/hosts:
192.168.1.103 master
192.168.1.104 slave1
192.168.1.105 slave2
CONFIGURING THE KEYS
generate the public/private keys in the master node and put the public key of the master in each of the slaves authorizedkey file:
ssh-keygen -t rsa
This will generate a public and private key on the machine. Make sure to copy the public key located in:
~/.ssh/id_rsa.pub
And add that key to the authorized keys on datanodes machines, the authorized keys file will be located in:
~/.ssh/authorized_keys
INSTALLING HADOOP ON MASTER
On the master node download your required version of hadoop from one of the apache mirrros.
Here we will install Hadoop in /opt directory:
cd /opt/
wget http://www-eu.apache.org/dist/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz
tar -xzf hadoop-2.7.3/hadoop-2.7.3.tar.gz
mv hadoop-2.7.3 hadoop
Next we will set $HADOOP_HOME variable by adding the following 2 lines to ~/.bashrc file:
export HADOOP_HOME=/opt/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
Load ./bashrc
source ./bashrc
Verify your installation
hadoop version
The output should be something like this:
Hadoop 2.7.3
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r baa91f7c6bc9cb92be5982de4719c1c8af91ccff
Compiled by root on 2016-08-18T01:41Z
Compiled with protoc 2.5.0
From source with checksum 2e4ce5f957ea4db193bce3734ff29ff4
CONFIGURING HADOOP
You have to make the some changes to the followng files in $HADOOP/etc/hadoop
<property>
<name>fs.defaultFS</name>
<value>hdfs://192.168.1.103:8020</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/opt/hadoop/hadoop_store/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/opt/hadoop/hadoop_store/hdfs/datanode</value>
</property>
Make sure the directories that will store the metadata are created
cd $HADOOP
mkdir -p /hadoop_store/hdfs/namenode
chmod 755 /hadoop_store/hdfs/namenode
mkdir -p /hadoop_store/hdfs/datanode
chmod 755 /hadoop_store/hdfs/datanode
First make the file by copying mapred-site.xml.template to mapred-site.xml
cd $HADOOP
cp /etc/hadoop/mapred-site.xml.template /etc/hadoop/mapred-site.xml
Then add the following lines in mapred-site.xml
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name> yarn.resourcemanager.resource-tracker.address</name>
<value>192.168.1.103:8025</value>
</property>
<property>
<name> yarn.resourcemanager.scheduler.address</name>
<value>192.168.1.103:8030</value>
</property>
<property>
<name> yarn.resourcemanager.address</name>
<value>192.168.1.103:8050</value>
</property>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
<name>yarn.nodemanager.disk-health-checker.min-healthy-disks</name>
<value>0</value>
</property>
Add the ip adress of the slaves node in /etc/hadoop/slaves
192.168.1.104
192.168.1.105
Set the JAVA_HOME variable path in the file.
export JAVA_HOME=/opt/jdk1.7.0_17
INSTALLING HADOOP ON SLAVES First you have to make sure SSH server is installed on all the machines
sudo apt-get install openssh-server
sudo systemctl restart ssh
Copy the hadoop file from master node to slaves, in the master node execute the following commands:
cd $HADOOP
scp -r hadoop slave1:/opt/hadoop
scp -r hadoop slave2:/opt/hadoop
Make sure that each data node has its metadata directory:
mkdir -p /opt/hadoop/hadoop_store/hdfs/datanode
chown 755 /opt/hadoop/hadoop_store/hdfs/datanode
FORMAT HADOOP FILE SYSTEM IN MASTER NODE
hdfs namenode -format
start-dfs.sh
start-yarn.sh
OR
start-all.dfs
namenode port: 50070
resourcemanager: 8088
sudo ufw disable