Going further
Install and configure a multinode cluster
M. Tim Jones
Independent author
03 June 2010
The first article in this series showed how to use Hadoop in a single-node cluster. This article
continues with a more advanced setup that uses multiple nodes for parallel processing. It
demonstrates the various node types required for multinode clusters and explores MapReduce
functionality in a parallel environment. This article also digs into the management aspects of
Hadoopboth command line and Web based.
View more content in this series
The true power of the Hadoop distributed computing architecture lies in its distribution. In other
words, the ability to distribute work to many nodes in parallel permits Hadoop to scale to large
infrastructures and, similarly, the processing of large amounts of data. This article starts with a
decomposition of a distributed Hadoop architecture, and then explores distributed configuration
and use.
Trademarks
Page 1 of 15
developerWorks
ibm.com/developerWorks/
As shown in Figure 1, the master node consists of the namenode, secondary namenode, and
jobtracker daemons (the so-called master daemons). In addition, this is the node from which
you manage the cluster for the purposes of this demomonstration (using the Hadoop utility and
browser). The slave nodes consist of the tasktracker and the datanode (the slave daemons). The
distinction of this setup is that the master node contains those daemons that provide management
and coordination of the Hadoop cluster, where the slave node contains the daemons that
implement the storage functions for the Hadoop file system (HDFS) and MapReduce functionality
(the data processing function).
For this demonstration, you create a master node and two slave nodes sitting on a single LAN.
This setup is shown in Figure 2. Now, let's explore the installation of Hadoop for multinode
distribution and its configuration.
To simplify the deployment, you employ virtualization, which provides a few advantages. Although
performance may not be advantageous in this setting, using virtualization, it's possible to create
a Hadoop installation, and then clone it for the other nodes. For this reason, your Hadoop cluster
should appear as follows, running the master and slave nodes as virtual machines (VMs) in the
context of a hypervisor on a single host (see Figure 3).
Page 2 of 15
ibm.com/developerWorks/
developerWorks
Upgrading Hadoop
In Part 1, you installed a special distribution for Hadoop that ran on a single node (pseudoconfiguration). In this article, you update for a distributed configuration. If you've begun this article
series here, read through Part 1 to install the Hadoop pseudo-configuration first.
In the pseudo-configuration, you performed no configuration, as everything was preconfigured for
a single node. Now, you need to update the configuration. First, check the current configuration
using the update-alternatives command as shown in Listing 1. This command tells you that the
configuration is using conf.pseudo (the highest priority).
Next, create a new configuration by copying an existing one (in this case, conf.empty, as shown in
Listing 1):
$ sudo cp -r /etc/hadoop-0.20/conf.empty /etc/hadoop-0.20/conf.dist
$
Page 3 of 15
developerWorks
ibm.com/developerWorks/
Now, you have a new configuration called conf.dist that you'll use for your new distributed
configuration. At this stage, running in a virtualized environment, you clone this node into two
additional nodes that will serve as the data nodes.
So, on the master node, you update /etc/hadoop-0.20/conf.dist/masters to identify the master
node, which appears as:
master
and then identify the slave nodes in /etc/hadoop-0.20/conf.dist/slaves, which contains the following
two lines:
slave1
slave2
Next, from each node, connect through Secure Shell (ssh) to each of the other nodes to ensure
that pass-phraseless ssh is working. Each of these files (masters, slaves) is used by the Hadoop
start and stop utilities that you used in Part 1 of this series.
Next, continue with Hadoop-specific configuration in the /etc/hadoop-0.20/conf.dist subdirectory.
The following changes are required on all nodes (master and both slaves), as defined by the
Hadoop documentation. First, identify the HDFS master in the file core-site.xml (Listing 4), which
defines the host and port of the namenode (note the use of the master node's IP address). The file
core-site.xml defines the core properties of Hadoop.
Next, identify the MapReduce jobtracker. This jobtracker could exist on its own node, but for this
configuration, place it on the master node as shown in Listing 5. The file mapred-site.xml contains
the MapReduce properties.
Distributed data processing with Hadoop, Part 2: Going further
Page 4 of 15
ibm.com/developerWorks/
developerWorks
Finally, define the default replication factor (Listing 6). This value defines the number of replicas
that will be created and is commonly no larger than three. In this case, you define it as 2 (the
number of your datanodes). This value is defined in hdfs-site.xml, which contains the HDFS
properties.
The configuration items shown in Listing 4, Listing 5, and Listing 6) are the required elements
for your distributed setup. Hadoop provides a large number of configuration options here, which
allows you to tailor the entire environment. The Resources section provides more information on
what's available.
With your configuration complete, the next step is to format your namenode (the HDFS master
node). For this operation, use the hadoop-0.20 utility, specifying the namenode and operation (format):
Page 5 of 15
developerWorks
ibm.com/developerWorks/
************************************************************/
root@master:~#
With your namenode formatted, it's time to start the Hadoop daemons. You do this identically to
your previous pseudo-distributed configuration in Part 1, but the process accomplishes the same
thing for a distributed configuration. Note here that this code starts the namenode and secondary
namenode (as indicated by the jps command):
If you now inspect one of the slave nodes (data nodes) using jps, you'll see that a datanode
daemon now exists on each node:
The next step is to start the MapReduce daemons (jobtracker and tasktracker). You do this as
shown in Listing 10. Note that the script starts the jobtracker on the master node (as defined by
your configuration; see Listing 5) and the tasktrackers on each slave node. A jps command on the
master node shows that the jobtracker is now running.
Finally, check a slave node with jps. Here, you see that a tasktracker daemon has joined the
datanode daemon to each slave data node:
Distributed data processing with Hadoop, Part 2: Going further
Page 6 of 15
ibm.com/developerWorks/
developerWorks
The relationships between the start scripts, the nodes, and the daemons that are started are
shown in Figure 4. As you can see, the start-dfs script starts the namenodes and datanodes,
where the start-mapred script starts the jobtracker and tasktrackers.
Figure 4. Relationship of the start scripts and daemons for each node
Testing HDFS
Now that Hadoop is up and running across your cluster, you can run a couple of tests to
ensure that it's operational (see Listing 12). First, issue a file system command (fs) through the
hadoop-0.20 utility and request a df (disk free) operation. As with Linux, this command simply
identifies the space consumed and available for the particular device. So, with a newly formatted
file system, you've used no space. Next, perform an ls operation on the root of HDFS, create a
subdirectory, list its contents, and remove it. Finally, you can perform an fsck (file system check)
on HDFS using the fsck command within the hadoop-0.20 utility. All this tells youalong with a
variety of other information (such as 2 datanodes were detected)that the file system is healthy.
Page 7 of 15
developerWorks
ibm.com/developerWorks/
Next, kick off the wordcount MapReduce job. As in the pseudo-distributed model, you specify your
input subdirectory (which contains the input files) and the output directory (which doesn't exist but
will be created by the namenode and populated with the result data):
Page 8 of 15
ibm.com/developerWorks/
developerWorks
root@master:~#
The final step is to explore the output data. Because you ran the wordcount MapReduce job,
the result is a single file (reduced from the processed map files). This file contains a list of tuples
representing the words found in the input files and the number of times they appeared in all input
files:
Page 9 of 15
developerWorks
ibm.com/developerWorks/
Through the jobtracker, you can inspect running or completed jobs. In Figure 6, you can see an
inspection of your last job (from Listing 14). This figure shows the various data emitted as output
to the Java archive (JAR) request but also the status and number of tasks. Note here that two map
tasks were performed (one for each input file) and one reduce task (to reduce the two map inputs).
Page 10 of 15
ibm.com/developerWorks/
developerWorks
Finally, you can check on the status of your datanodes through the namenode. The namenode
main page identifies the number of live and dead nodes (as links) and allows you to inspect them
further. The page shown in Figure 7 shows your live datanodes in addition to statistics for each.
Page 11 of 15
developerWorks
ibm.com/developerWorks/
Many other views are possible through the namenode and jobtracker Web interfaces, but for
brevity, this sample set is shown. Within the namenode and jobtracker Web pages, you'll find
a number of links that will take you to additional information about Hadoop configuration and
operation (including run time logs).
Going further
With this installment, you've seen how a pseudo-distributed configuration from Cloudera can be
transformed into a fully distributed configuration. Surprisingly few steps along with an identical
interface for MapReduce applications makes Hadoop a uniquely useful tool for distributed
processing. Also interesting is exploring the scalability of Hadoop. By adding new datanodes
(along with updating their XML files and slave files in the master), you can easily scale Hadoop
for even higher levels of parallel processing. Part 3, the final installment in this Hadoop series, will
explore how to develop a MapReduce application for Hadoop.
Page 12 of 15
ibm.com/developerWorks/
developerWorks
Resources
Learn
Part 1 of this series, Distributed data processing with Hadoop, Part 1: Getting started
(developerWorks, May 2010), showed how to install Hadoop for a pseudo-distributed
configuration (in other words, running all daemons on a single node).
The Cloudera distribution, used by this series of articles, comes in a number of form factors,
from an installable package to source or a VM. You can learn more about Cloudera at its
main site. Cloudera also maintains a nice set of documentation for installing and using
Hadoop (in addition to Pig and Hive, Hadoop's large data set manipulation language and data
warehouse infrastructure built on Hadoop).
IBM InfoSphere BigInsights Basic Edition IBM's Hadoop distribution -- is an integrated, tested
and pre-configured, no-charge download for anyone who wants to experiment with and learn
about Hadoop.
Find free courses on Hadoop fundamentals, stream computing, text analytics, and more at
Big Data University.
Check out the cluster setup at Apache.org for a full list of the properties for core-site.xml,
mapred-site.xml, and hdfs-site.xml.
See Michael Noll's useful resources for using Hadoop in addition to other interesting topics.
Yahoo! provides a great set of resources for Hadoop at the developer network. In particular is
the Yahoo! Hadoop Tutorial, which introduces Hadoop and provides a detailed discussion of
its use and configuration.
In "Distributed computing with Linux and Hadoop" (developerWorks, December 2008)
and the more recent "Cloud computing with Linux and Apache Hadoop" (developerWorks,
October 2009), learn more about Hadoop and its architecture.
In the developerWorks Linux zone, find hundreds of how-to articles and tutorials, as well
as downloads, discussion forums, and a wealth other resources for Linux developers and
administrators.
Stay current with developerWorks technical events and webcasts focused on a variety of IBM
products and IT industry topics.
Attend a free developerWorks Live! briefing to get up-to-speed quickly on IBM products and
tools as well as IT industry trends.
Watch developerWorks on-demand demos ranging from product installation and setup demos
for beginners, to advanced functionality for experienced developers.
Follow developerWorks on Twitter, or subscribe to a feed of Linux tweets on developerWorks.
Get products and technologies
Hadoop is developed through the Apache Software Foundation.
Download IBM InfoSphere BigInsights Basic Edition at no charge and build a solution that
turns large, complex volumes of data into insight by combining Apache Hadoop with unique
technologies and capabilities from IBM.
Evaluate IBM products in the way that suits you best: Download a product trial, try a product
online, use a product in a cloud environment, or spend a few hours in the SOA Sandbox
learning how to implement Service Oriented Architecture efficiently.
Distributed data processing with Hadoop, Part 2: Going further
Page 13 of 15
developerWorks
ibm.com/developerWorks/
Discuss
Get involved in the My developerWorks community. Connect with other developerWorks
users while exploring the developer-driven blogs, forums, groups, and wikis.
Page 14 of 15
ibm.com/developerWorks/
developerWorks
Page 15 of 15