Massoud Mazar

Sharing The Knowledge

setting up hadoop/hive cluster on Centos 5

If you are reading this post chances are you are trying to setup a hadoop/hive cluster. I hope these instructions save you some time with your installation.

Install CentOS:

I assume you are starting from scratch and want to setup a small cluster with minimum OS components (as far as I could minimize it). To start, download the OS setup media, and boot each machine with it. Go through the screens of the GUI installer, keeping in mind to do the following:

- Disable IP V 6 during Network setup
- De-select all components, including all GUIs and even  "Server"
- On the component selection page, click "Customize now" radio button and click next
- In "Base System" group, De-select "Dialup Networking Support"
 
It is important (at least in this tutorial) to define the host names in your DNS server (or in the host file, if no DNS server is used). Be careful with the hosts file, one problem was caused by the loop back address (127.0.0.1) being assigned to the name of your host (in this case, centos1). Apparently Java listeners use the hostname to lookup the IP address to listen to, and they picked up the loop back address from hosts file, instead of the network address.
 
When the OS installation is complete, make sure to turn off and disable firewalls:
 
service iptables save
service iptables stop
chkconfig iptables off
It may be a good idea to update your OS before installing hadoop:

yum update

Install java JDK:

You may not like the way I installed Java JDK, so feel free to do it your way. Assuming you are using an SSH client like putty which does a good job with clipboard:
  • Using a browser (on the machine you run putty on) visit: http://java.sun.com/javase/downloads/widget/jdk6.jsp
  • Select Linux as platform and click continue
  • On the popup click "skip this Step"
  • Copy the hyperlink address for "jdk-6u18-linux-i586-rpm.bin" to clipboard
  • do the following in putty session:
cd /tmp
wget -O jdk-6u18-linux-i586-rpm.bin <the URL you copied from java download site>
chmod a+x jdk-6u18-linux-i586-rpm.bin
./jdk-6u18-linux-i586-rpm.bin
To make sure java was installed correctly:

java -version
(If you have followed the instructions, java is installed in /usr/java/jdk1.6.0_18)
 

Install hadoop:

- setup hadoop user:

useradd hadoop
passwd hadoop

- create a folder for all hadoop operations:

mkdir /hadoop
chown -R hadoop /hadoop

- switch to user "hadoop":

su - hadoop

- set up passphrase-less ssh (only on master, while still logged in as hadoop):
ssh-keygen -t dsa
(leave passphrase empty)
- copy the identity to each machine (Only on master node, after creating hadoop user on each machine):
ssh-copy-id -i /home/hadoop/.ssh/id_dsa hadoop@centos1
ssh-copy-id -i /home/hadoop/.ssh/id_dsa hadoop@centos2
...
- to test passwordless ssh:
ssh centos1
ssh centos2
...
- download and install hadoop:
cd /hadoop
wget http://mirror.cloudera.com/apache/hadoop/core/hadoop-0.20.2/hadoop-0.20.2.tar.gz
tar -xvzf hadoop-0.20.2.tar.gz
mv hadoop-0.20.2 hadoop

Configure hadoop:

cd /hadoop/hadoop

Modify the core-site.xml:

nano conf/core-site.xml
Add the following inside the <configuration> tags:
<property>
  <name>fs.default.name</name>
  <value>hdfs://centos1:9000/</value>
</property>

Modify the hdfs-site.xml:

nano conf/hdfs-site.xml
Add the following inside the <configuration> tags:
<property>
  <name>dfs.name.dir</name>
  <value>/hadoop/hdfs/name</value>
</property>
<property>
  <name>dfs.data.dir</name>
  <value>/hadoop/hdfs/data</value>
</property>
<property>
  <name>dfs.replication</name>
  <value>2</value>
</property>

Modify the mapred-site.xml:

nano conf/mapred-site.xml

Add the following inside the <configuration> tags:

<property>
  <name>mapred.job.tracker</name>
  <value>centos1:9001</value>
</property>

Modify the hadoop-env.sh:

nano conf/hadoop-env.sh

Find the following exports and modify them accordingly:

export JAVA_HOME=/usr/java/jdk1.6.0_18
export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true

define the masters and slaves (Only on master node):
nano conf/masters

in this case hostname of master is centos1:

centos1

now modify the slaves file:

nano conf/slaves

Slaves in our example are centos1 and centos2:

centos1
centos2

Starting hadoop

Before using the hadoop HDFS, format the namenode:

su - hadoop
cd /hadoop/hadoop
bin/hadoop namenode -format

If there was no error, you can start all components, on all nodes:
bin/start-all.sh

Testing hadoop

For this test, we download some text files, copy them to HDFS and use a sample MapReduce job to count words in those files:

cd /hadoop
mkdir gutenberg
cd gutenberg
wget http://www.gutenberg.org/files/20417/20417.txt
wget http://www.gutenberg.org/dirs/etext04/7ldvc10.txt
wget http://www.gutenberg.org/files/4300/4300.txt
cd /hadoop/hadoop
bin/hadoop dfs -ls
bin/hadoop dfs -copyFromLocal /hadoop/gutenberg gutenberg
bin/hadoop dfs -ls gutenberg
bin/hadoop dfs -rmr gutenberg-output
bin/hadoop jar hadoop-0.20.2-examples.jar wordcount gutenberg gutenberg-output

Above commands will start a Map/Reduce job, and will show some progress percentage.

Adding new slave nodes to cluster:

In case you want to add new slave nodes to the cluster, simply add the hostname of new slaves to conf/slaves file on master node. Make sure follow the password-less SSH instructions above to allow communication from master to slaves. Log in as hadoop user to each slave and start the datanode and tasktracker:

bin/hadoop-daemon.sh start datanode
bin/hadoop-daemon.sh start tasktracker

You can also restart the whole cluster from master (if you want to!):
bin/stop-all.sh
bin/start-all.sh

Install hive:

hive is a java code only needed on nodes which are used to submit hive queries. In our example, we only install it on the master node. Since hive is not in a production ready state yet, we need to download the source code and build it on our machine. If svn and ant are not installed, install them first:

yum install subversion
yum install ant

To install hive log in as root, then:

cd /tmp
svn co http://svn.apache.org/repos/asf/hadoop/hive/tags/release-0.4.1-rc2/ hive
cd hive
ant -Dhadoop.version="0.20.0" package
cd build/dist
mkdir /hadoop/hive
cp -R * /hadoop/hive/.
chown -R hadoop /hadoop

Configure hive:

Setup some paths in the login script:
nano /etc/profile
add the following to the end of the profile script:

export JAVA_HOME=/usr/java/jdk1.6.0_18
export HADOOP_HOME=/hadoop/hadoop
export HIVE_HOME=/hadoop/hive

hive will store it's data in specific directories in HDFS. This needs to be done once:

su - hadoop
cd /hadoop/hadoop
bin/hadoop fs -mkdir /tmp
bin/hadoop fs -mkdir /user/hive/warehouse
bin/hadoop fs -chmod g+w /tmp
bin/hadoop fs -chmod g+w /user/hive/warehouse
bin/hadoop fs -ls /user

Testing hive with embedded derby server:

To make sure hive was installed correctly:

cd /hadoop/hive
bin/hive
hive> show tables;

Install derby server:

hive by default uses an embedded derby database. In real-world scenarios which multiple hive queries are executed using multiple session, a database server like MySQL or derby server is required. For our example, we will use derby server. Before doing that, make sure cluster is down:

su - hadoop
cd /hadoop/hadoop
bin/stop-all.sh

Then download and install the same version of derby that comes with hive:

cd /hadoop
wget http://archive.apache.org/dist/db/derby/db-derby-10.4.2.0/db-derby-10.4.2.0-bin.tar.gz
tar -xzf db-derby-10.4.2.0-bin.tar.gz
mv db-derby-10.4.2.0-bin derby
mkdir derby/data

Login as root, then do
nano /etc/profile.d/derby.sh
enter the following into the file:

DERBY_INSTALL=/hadoop/derby
DERBY_HOME=/hadoop/derby
export DERBY_INSTALL
export DERBY_HOME

Also modify hive.sh:
nano /etc/profile.d/hive.sh
add the following into the file:

HADOOP=/hadoop/hadoop/bin/hadoop
export HADOOP

the rest of modifications can be done when logged in as "hadoop":
su - hadoop
we can use start-dfs.sh to also start derby server:
nano /hadoop/hadoop/bin/start-dfs.sh
Add the following to the end of the script:

cd /hadoop/derby/data
nohup /hadoop/derby/bin/startNetworkServer -h 0.0.0.0 &

If you have followed this tutorial, you do not have hive-site.xml. Either create one or edit the existing one:

nano /hadoop/hive/conf/hive-site.xml

Content will look like:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
  <name>hive.metastore.local</name>
  <value>true</value>
  <description>controls whether to connect to remove metastore server or open a new metastore server in Hive Client JVM</description>
</property>
<property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:derby://centos1:1527/metastore_db;create=true</value>
  <description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>org.apache.derby.jdbc.ClientDriver</value>
  <description>Driver class name for a JDBC metastore</description>
</property>
</configuration>

There is another file to modify (or create, if you do not have it):
nano /hadoop/hive/conf/jpox.properties
Note that you have to put the hostname of the server running derby instead of "centos1":

javax.jdo.PersistenceManagerFactoryClass=org.jpox.PersistenceManagerFactoryImpl
org.jpox.autoCreateSchema=false
org.jpox.validateTables=false
org.jpox.validateColumns=false
org.jpox.validateConstraints=false
org.jpox.storeManagerType=rdbms
org.jpox.autoCreateSchema=true
org.jpox.autoStartMechanismMode=checked
org.jpox.transactionIsolation=read_committed
javax.jdo.option.DetachAllOnCommit=true
javax.jdo.option.NontransactionalRead=true
javax.jdo.option.ConnectionDriverName=org.apache.derby.jdbc.ClientDriver
javax.jdo.option.ConnectionURL=jdbc:derby://centos1:1527/metastore_db;create=true
javax.jdo.option.ConnectionUserName=APP
javax.jdo.option.ConnectionPassword=mine

Now, copy some derby related jar files from new installation of derby to hive folder:

cp /hadoop/derby/lib/derbyclient.jar /hadoop/hive/lib
cp /hadoop/derby/lib/derbytools.jar /hadoop/hive/lib

Now you can start the cluster:

cd /hadoop/hadoop
bin/start-all.sh

To make sure new derby server is configured correctly:

cd /hadoop/hive
bin/hive
hive> show tables;

Starting the hive web interface:

It is much easier to run hive queries using the hive web interface. If you have followed this tutorial and installed hive release-0.4.1-rc2, then you have to fix a config property manually. (This issue has been fixed in trunk, as I was told.) Edit hive-default.xml:
nano /hadoop/hive/conf/hive-default.xml

And modify value of hive.hwi.war.file to the following:

/hadoop/hive/lib/hive_hwi.war

Now you can start the web server listener:

export ANT_LIB=/usr/share/ant/lib
bin/hive --service hwi 

To access the web interface, point your browser to (replace centos1 with your hostname):
http://centos1:9999/hwi
Loading