Installing Hadoop on Ubuntu Linux (Single Node Cluster)

Now a days, Bigdata is a buzzword, The prominent technology behind this jargon is Hadoop.
It is a good to have skill in developer’s resume. In order to learn Hadoop, it is mandatory to have a single node Hadoop cluster ready to play with Hadoop.
So, In this article, We are going to explain you a way of Installing Hadoop on Ubuntu Linux.

Installation Environment

I have tested the following tutorial in below environment.
OS : Ubuntu Linux(14.04 LTS) – 64bit
Hadoop : Hadoop-2.2.0
Prerequisites : Oracle Java 8.

Step 1: Java-JDK Installation:

The primary requirement of Hadoop installation is Java.
We can use Oracle JDK, OpenJDK or IBM JDK as per our requirements.
Compatibility of Hadoop with each JDK flavour can be found at Hadoop Wiki.
In one of our previous article, I have provided step by step guide to install Java in Ubuntu Linux, so we can skip this Java installation step here.

Step 2: Install SSH

It is required to install SSH so that Hadoop namenode(s) and datanodes can communicate with each other using SSH.
So if it is not already installed, we can install SSH by using following commands:

$ sudo apt-get install ssh

$ sudo apt-get install rsync

Step 3: Configuring SSH

After installing ssh on localhost, We need to use following commands to generate SSH key for current user with empty password.
Hadoop uses SSH to access and manage its nodes running Hadoop and we are keeping passwords as an empty string because otherwise user has to enter his/her password each time when Hadoop interacts with other node.

$ ssh-keygen -t rsa -P ""

Once SSH key is generated, we need to execute following command:

$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

Step 4: Testing SSH configuration

Our SSH configuration for localhost is now complete, so we can test SSH installation using the command:

$ ssh localhost

Optional Step : Debugging SSH configuration

Up to this step, if all steps are completed successfuly, then SSH is configured and running on localhost.
If SSH connection fails, enable SSH debugging using ssh -vvv localhostcommand and get an error in detail for troubleshooting.

Step 5: Download Hadoop

Download Apache hadoop from Apache Download Mirrors,
I have used link to download Hadoop 2.2.0 from Apache mirrors.
Once the file is downloaded, we have to select appropriate location in the system to save the Hadoop.
I have selected the path like, /usr/local/hadoop/<Different_Hadoop_Versions>.

Step 6: Create folder using mkdir command for saving hadoop installation files

So, in order to save Hadoop files, we have to create a folder under /usr/local,
I have created a folder using command:

$ sudo mkdir /usr/local/hadoop

Step 7: Extract tar file content using tar command

Once the folder is created, we have to extract the file that we have downloaded from Apache Hadoop’s website.
I have downloaded and placed that file to /home/javadeveloperzone/Desktop,
So I have to execute command for extracting the file:

$ tar -xvf hadoop-2.2.0.tar.gz

Step 8: Move folder content using mv command

Now we have to move the extracted folder to the location at which we want to save (in our case /usr/local/hadoop/).
I have used below command to move the extracted files to /usr/local/hadoop/.

$ sudo mv hadoop-2.2.0 /usr/local/hadoop

Step 9: Set permission using chmod command on Hadoop folder

Once the files are moved, it is time to get appropriate permissions on the folder,
I have used following command to acquire appropriate permission.

$ sudo chmod -R 777 /usr/local/hadoop/hadoop-2.2.0/

Now the setup part is almost done, we will start the configuration part,

Step 10: Configuring Hadoop related environment variables

In the configuration step, we have to set multiple HADOOP variables, In order to set the variables, we have to add them in $HOME/.bashrcfile.
I have used gedit $HOME/.bashrccommand to open the file, you can use any of your favorite editor to open the file and append the below lines to that file.

# Set Hadoop-related environment variables
export HADOOP_PREFIX=/usr/local/hadoop/hadoop-2.2.0
export HADOOP_HOME=/usr/local/hadoop/hadoop-2.2.0
export HADOOP_MAPRED_HOME=${HADOOP_HOME}
export HADOOP_COMMON_HOME=${HADOOP_HOME}
export HADOOP_HDFS_HOME=${HADOOP_HOME}
export YARN_HOME=${HADOOP_HOME}
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
# Native Path
export HADOOP_COMMON_LIB_NATIVE_DIR=${HADOOP_PREFIX}/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_PREFIX/lib"
# Add Hadoop bin/ directory to PATH
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

Once the modifications are completed, save that file.
After modifications, my .bashrc file looks like,

Step 11: Run/Apply bashrc file changes

Now we have to execute the below command so that the Hadoop variables will take effect,

$ source $HOME/.bashrc

Step 12: Create directories to save Hadoop data

Now in next step, we have to create 3 directories where Hadoop will save its namenode and datanode data and temp data.
Create following directories under /home folder.
name, data and temp directories.
As we are using single node installation, it won’t be a permission problem if the same user creates directory under /home/{USER} folder and provide them as namenode, datanode and temp data directory in Hadoop configurations.

Step 13: Configure Hadoop core-site.xml file

Now we have to specify those directories in Hadoop configuration files,
Open the core-site.xml file available under /usr/local/hadoop/hadoop-2.2.0/etc/hadoop location.
Place the below content in between <configuration></configuration> tag.

<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/javadeveloperzone/hadoop/hadoop-2.2.0/temp</value>
</property>

After modifications, my core-site.xml file looks like,

 

Step 14: Configure Hadoop hdfs-site.xml file

Open the hdfs-site.xml file available under /usr/local/hadoop/hadoop-2.2.0/etc/hadoop location.

Place the below content in between <configuration></configuration> tag.

	<property>
		<name>dfs.replication</name>
		<value>1</value>
	</property>
	<property>
		<name>dfs.namenode.name.dir</name>
		<value>file:/home/javadeveloperzone/hadoop/hadoop-2.2.0/name</value>
	</property>
	<property>
		<name>dfs.datanode.data.dir</name>
		<value>file:/home/javadeveloperzone/hadoop/hadoop-2.2.0/data</value>
	</property>

Step 15: Configure Hadoop mapred-site.xml file

By default mapred-site.xml file will be present as, mapred-site.xml.template .
Rename the file mapred-site.xml.template to mapred-site.xml .

Open the mapred-site.xml file and place the below content in between <configuration></configuration>tag.

<property>
	<name>mapreduce.framework.name</name>
	<value>yarn</value>
</property>

Step 16: Configure hadoop yarn-site.xml file

Open the yarn-site.xml file available under /usr/local/hadoop/hadoop-2.2.0/etc/hadoop location.

Place the below content in between <configuration></configuration> tag.

<property>
	<name>yarn.nodemanager.aux-services</name>
	<value>mapreduce_shuffle</value>
</property>
<property>
	<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
	<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>

Step 17: Configure JAVA_HOME in hadoop-env.sh file

Open the hadoop-env.sh file available under /usr/local/hadoop/hadoop-2.2.0/etc/hadoop location and update JAVA_HOME variable pointing to your java installation location.

I have updated like,

export JAVA_HOME=/usr/lib/jvm/jdk1.8.0_45

Step 18: Format Hadoop namenode

Now all the configurations are completed, we have to format namenode using below command.

$ hadoop namenode -format

Step 19: Start Hadoop components using commands

Now if our namenode is formatted without any errors, we can start hadoop using below commands,

$ start-dfs.sh
$ start-yarn.sh

Step 20: Check whether Hadoop components are running or not using jps command

If the above scripts runs properly, it will start different hadoop processes listed below,
you can check them using jps command.

$ jps

It will list below processes,

9008 ResourceManager
8674 DataNode
8530 NameNode
8854 SecondaryNameNode
9308 NodeManager
10206 Jps

The process ids will be different in every case.

Step 21: Verify Hadoop installation using pi example

Now if all of the above processes are started, then our Hadoop installation is proper, we can try Hadoop installation by using example code which is shipped with Hadoop.

Use the following command to run a pi example,

$ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar pi 4 1000

You can use below urls for HDFS and hadoop cluster.

http://localhost:50070/
http://localhost:8088

Now, your Hadoop single node cluster is running, you can run Hadoop map reduce programs. In this way, we can install Hadoop on ubuntu Linux.
Happy Hadooping…

Related Links:

https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html

Was this post helpful?

Tags:

Leave a Reply

Your email address will not be published. Required fields are marked *