Last updated on May 27, 2012 by Dan Nanni
Cloudera's Distribution (CDH) provides streamlined installation of Apache Hadoop via Cloudera Manager. Besides Apache Hadoop, CDH also allows installation of other components such as Hive, Pig, HBase, ZooKeeper, etc. in a modular fashion. The free edition of CDH Manager allows you to build and monitor a Hadoop cluster consisting of up to 50 nodes.
If you would like to install and configure HDFS/Hadoop on a small scale, I strongly recommend CDH for you.
You can install Cloudera Manager on Redhat-compatible systems as well as Ubuntu/Debian systems. However, Cloudera Manager can only support cluster nodes which are based on CentOS/RHEL. Therefore you need to install CentOS or RHEL on every cluster node, to be able to have them managed by Cloudera Manager.
In this example, I will show you how to install and configure HDFS and Hadoop using CDH3 (CDH version 3). I assume that there are one Cloudera Manager node, five cluster nodes, and (optionally) one client node (which will access Hadoop cluster).
First, disable SELinux on all cluster nodes, and reboot them:
$ sudo vi /etc/sysconfig/selinux
$ sudo chkconfig iptables off
Make sure that every cluster node as well as Cloudera Manager node has a fully qualified domain name (FQDN) in
/etc/hosts. I recommend that
/etc/hosts file of every cluster node as well as Cloudera Manager node include FQDNs of all nodes as follows. Otherwise, you may not be able to add cluster nodes to Cloudera Manager.
$ sudo vi /etc/hosts
192.168.212.10 manager.mydomain.com 192.168.212.11 node0.mydomain.com 192.168.212.12 node1.mydomain.com 192.168.212.13 node2.mydomain.com 192.168.212.14 node3.mydomain.com 192.168.212.15 node4.mydomain.com
Make sure to mount the partition used for data storage in each cluster node with "noatime" option. With noatime, read access to a file will no longer result in an update to the atime information associated with the file. For example, /etc/fstab in each cluster node can have:
/dev/sdb1 ext4 noatime 1 1
Make sure to have each and every cluster node accessible via ssh with the identical root password.
Next, install CDH3 on Cloudera Manager node:
$ wget http://archive.cloudera.com/cloudera-manager/installer/latest/cloudera-manager-installer.bin $ ./cloudera-manager-installer.bin
Now, go to
http://manager.myhost.com:7180/ in your browser to access Cloudera Manager interface. The default login/password for CDH3 is
Add all cluster nodes, and then install/start HDFS/Hadoop on all existing cluster nodes through Cloudera Manager interface. Once HDFS/Hadoop get started by Cloudera Manager, the HDFS storage cluster will have
/tmp folder created by default.
Generate client configurations through Cloudera Manager interface, and download the generated
On client node (which will read/write files hosted in HDFS, and initiate Hadoop jobs), do the following.
Put the FQDNs of all cluster nodes in
global-clientconfig.zip to the client node, and unzip it. It will create
hadoop-conf directory, and put HDFS/Hadoop configuration files inside.
Set up environment variable for Hadoop configuration directory.
$ export HADOOP_CONF_DIR=[location of hadoop-conf directory]
Install Hadoop on the client node.
Finally, test if you can access HDFS from the client node as follows.
$ hadoop dfs -ls /tmp
If the above command shows the content of the local
/tmp directory of the client node, instead of
/tmp directory created inside the storage cluster, something must be wrong. Double check if
HADOOP_CONF_DIR is set up correctly, and configuration files are sane. If the command successfully shows
/tmp directory created inside the storage cluster, you are ready to start a Hadoop job from the client node.
Please note that this article is published by Xmodulo.com under a Creative Commons Attribution-ShareAlike 3.0 Unported License. If you would like to use the whole or any part of this article, you need to cite this web page at Xmodulo.com as the original source.