Massoud Mazar

Sharing The Knowledge


setting up hadoop/hive cluster on Centos 5

If you are reading this post chances are you are trying to setup a hadoop/hive cluster. I hope these instructions save you some time with your installation.

Install CentOS:

I assume you are starting from scratch and want to setup a small cluster with minimum OS components (as far as I could minimize it). To start, download the OS setup media, and boot each machine with it. Go through the screens of the GUI installer, keeping in mind to do the following:

- Disable IP V 6 during Network setup
- De-select all components, including all GUIs and even  "Server"
- On the component selection page, click "Customize now" radio button and click next
- In "Base System" group, De-select "Dialup Networking Support"
It is important (at least in this tutorial) to define the host names in your DNS server (or in the host file, if no DNS server is used). Be careful with the hosts file, one problem was caused by the loop back address ( being assigned to the name of your host (in this case, centos1). Apparently Java listeners use the hostname to lookup the IP address to listen to, and they picked up the loop back address from hosts file, instead of the network address.
When the OS installation is complete, make sure to turn off and disable firewalls:
service iptables save
service iptables stop
chkconfig iptables off
It may be a good idea to update your OS before installing hadoop:

yum update

Install java JDK:

You may not like the way I installed Java JDK, so feel free to do it your way. Assuming you are using an SSH client like putty which does a good job with clipboard:
  • Using a browser (on the machine you run putty on) visit:
  • Select Linux as platform and click continue
  • On the popup click "skip this Step"
  • Copy the hyperlink address for "jdk-6u18-linux-i586-rpm.bin" to clipboard
  • do the following in putty session:
cd /tmp
wget -O jdk-6u18-linux-i586-rpm.bin <the URL you copied from java download site>
chmod a+x jdk-6u18-linux-i586-rpm.bin
To make sure java was installed correctly:

java -version
(If you have followed the instructions, java is installed in /usr/java/jdk1.6.0_18)

Install hadoop:

- setup hadoop user:

useradd hadoop
passwd hadoop

- create a folder for all hadoop operations:

mkdir /hadoop
chown -R hadoop /hadoop

- switch to user "hadoop":

su - hadoop

- set up passphrase-less ssh (only on master, while still logged in as hadoop):
ssh-keygen -t dsa
(leave passphrase empty)
- copy the identity to each machine (Only on master node, after creating hadoop user on each machine):
ssh-copy-id -i /home/hadoop/.ssh/id_dsa hadoop@centos1
ssh-copy-id -i /home/hadoop/.ssh/id_dsa hadoop@centos2
- to test passwordless ssh:
ssh centos1
ssh centos2
- download and install hadoop:
cd /hadoop
tar -xvzf hadoop-0.20.2.tar.gz
mv hadoop-0.20.2 hadoop

Configure hadoop:

cd /hadoop/hadoop

Modify the core-site.xml:

nano conf/core-site.xml
Add the following inside the <configuration> tags:

Modify the hdfs-site.xml:

nano conf/hdfs-site.xml
Add the following inside the <configuration> tags:

Modify the mapred-site.xml:

nano conf/mapred-site.xml

Add the following inside the <configuration> tags:


Modify the

nano conf/

Find the following exports and modify them accordingly:

export JAVA_HOME=/usr/java/jdk1.6.0_18

define the masters and slaves (Only on master node):
nano conf/masters

in this case hostname of master is centos1:


now modify the slaves file:

nano conf/slaves

Slaves in our example are centos1 and centos2:


Starting hadoop

Before using the hadoop HDFS, format the namenode:

su - hadoop
cd /hadoop/hadoop
bin/hadoop namenode -format

If there was no error, you can start all components, on all nodes:

Testing hadoop

For this test, we download some text files, copy them to HDFS and use a sample MapReduce job to count words in those files:

cd /hadoop
mkdir gutenberg
cd gutenberg
cd /hadoop/hadoop
bin/hadoop dfs -ls
bin/hadoop dfs -copyFromLocal /hadoop/gutenberg gutenberg
bin/hadoop dfs -ls gutenberg
bin/hadoop dfs -rmr gutenberg-output
bin/hadoop jar hadoop-0.20.2-examples.jar wordcount gutenberg gutenberg-output

Above commands will start a Map/Reduce job, and will show some progress percentage.

Adding new slave nodes to cluster:

In case you want to add new slave nodes to the cluster, simply add the hostname of new slaves to conf/slaves file on master node. Make sure follow the password-less SSH instructions above to allow communication from master to slaves. Log in as hadoop user to each slave and start the datanode and tasktracker:

bin/ start datanode
bin/ start tasktracker

You can also restart the whole cluster from master (if you want to!):

Install hive:

hive is a java code only needed on nodes which are used to submit hive queries. In our example, we only install it on the master node. Since hive is not in a production ready state yet, we need to download the source code and build it on our machine. If svn and ant are not installed, install them first:

yum install subversion
yum install ant

To install hive log in as root, then:

cd /tmp
svn co hive
cd hive
ant -Dhadoop.version="0.20.0" package
cd build/dist
mkdir /hadoop/hive
cp -R * /hadoop/hive/.
chown -R hadoop /hadoop

Configure hive:

Setup some paths in the login script:
nano /etc/profile
add the following to the end of the profile script:

export JAVA_HOME=/usr/java/jdk1.6.0_18
export HADOOP_HOME=/hadoop/hadoop
export HIVE_HOME=/hadoop/hive

hive will store it's data in specific directories in HDFS. This needs to be done once:

su - hadoop
cd /hadoop/hadoop
bin/hadoop fs -mkdir /tmp
bin/hadoop fs -mkdir /user/hive/warehouse
bin/hadoop fs -chmod g+w /tmp
bin/hadoop fs -chmod g+w /user/hive/warehouse
bin/hadoop fs -ls /user

Testing hive with embedded derby server:

To make sure hive was installed correctly:

cd /hadoop/hive
hive> show tables;

Install derby server:

hive by default uses an embedded derby database. In real-world scenarios which multiple hive queries are executed using multiple session, a database server like MySQL or derby server is required. For our example, we will use derby server. Before doing that, make sure cluster is down:

su - hadoop
cd /hadoop/hadoop

Then download and install the same version of derby that comes with hive:

cd /hadoop
tar -xzf db-derby-
mv db-derby- derby
mkdir derby/data

Login as root, then do
nano /etc/profile.d/
enter the following into the file:


Also modify
nano /etc/profile.d/
add the following into the file:

export HADOOP

the rest of modifications can be done when logged in as "hadoop":
su - hadoop
we can use to also start derby server:
nano /hadoop/hadoop/bin/
Add the following to the end of the script:

cd /hadoop/derby/data
nohup /hadoop/derby/bin/startNetworkServer -h &

If you have followed this tutorial, you do not have hive-site.xml. Either create one or edit the existing one:

nano /hadoop/hive/conf/hive-site.xml

Content will look like:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
  <description>controls whether to connect to remove metastore server or open a new metastore server in Hive Client JVM</description>
  <description>JDBC connect string for a JDBC metastore</description>
  <description>Driver class name for a JDBC metastore</description>

There is another file to modify (or create, if you do not have it):
nano /hadoop/hive/conf/
Note that you have to put the hostname of the server running derby instead of "centos1":


Now, copy some derby related jar files from new installation of derby to hive folder:

cp /hadoop/derby/lib/derbyclient.jar /hadoop/hive/lib
cp /hadoop/derby/lib/derbytools.jar /hadoop/hive/lib

Now you can start the cluster:

cd /hadoop/hadoop

To make sure new derby server is configured correctly:

cd /hadoop/hive
hive> show tables;

Starting the hive web interface:

It is much easier to run hive queries using the hive web interface. If you have followed this tutorial and installed hive release-0.4.1-rc2, then you have to fix a config property manually. (This issue has been fixed in trunk, as I was told.) Edit hive-default.xml:
nano /hadoop/hive/conf/hive-default.xml

And modify value of hive.hwi.war.file to the following:


Now you can start the web server listener:

export ANT_LIB=/usr/share/ant/lib
bin/hive --service hwi 

To access the web interface, point your browser to (replace centos1 with your hostname):

Add comment