What is Hadoop?
Hadoop is a an open source Apache project written in Java and designed to provide users with two things: a distributed file system (HDFS) and a method for distributed computation. Its based on Googles published Google File System and MapReduce concept which discuss how to build a framework capable of executing intensive computations across tons of computers. Something that might, you know, be helpful in building a giant search index. Read the Hadoop project description and wiki for more information and background on Hadoop.
Caveat Emptor
Im one of the few that has invested the time to setup an actual distributed Hadoop installation on Windows. Ive used it for some successful development tests. I have not used this in production. Also, although I can get around in a Linux/Unix environment, Im no expert so some of the advice below may not be the correct way to configure things. Im also no security expert. If any of you out there have corrections or advice for me, please let me know in a comment and Ill get it fixed. This guide uses Hadoop v0.17 and assumes that you dont have any previous Hadoop installation. Ive also done my primary work with Hadoop on Windows XP. Where Im aware of differences between XP and Vista, Ive tried to note them. Please comment if something Ive written is not appropriate for Vista. Bottom line: your mileage may vary, but this guide should get you started running Hadoop on Windows.
Standalone: All Hadoop functionality runs in one Java process. This works out of the box and is trivial to use on any platform, Windows included. Pseudo-Distributed: Hadoop functionality all runs on the local machine but the various components will run as separate processes. This is much more like real Hadoop and does require some configuration as well as SSH. It does not, however, permit distributed storage or processing across multiple machines. Fully Distributed: Hadoop functionality is distributed across a cluster of machines. Each machine participates in somewhat different (and occasionally overlapping) roles. This allows multiple machines to contribute processing power and storage to the cluster.
The Hadoop Quickstart can get you started on Standalone mode and Psuedo-Distributed (to some degree). Take a look at that if youre not ready for Fully Distributed. This guide focuses on the Fully Distributed mode of Hadoop. After all, its the most interesting where youre actually doing real distributed computing.
Pre-Requisites
Java Im assuming if youre interested in running Hadoop that youre familiar with Java programming and have Java installed on all the machines on which you want to run Hadoop. The Hadoop docs recommend Java 6 and require at least Java 5. Whichever you choose, you need to make sure that you have the same major Java version (5 or 6) installed on each machine. Also, any code you write for running using Hadoops MapReduce must be compiled with the version you choose. If you dont have Java installed, go get it from Sun and install it. I will assume youre using Java 6 in the rest of this guide. Cygwin As I said in the introduction, Hadoop assumes Linux (or a Unix flavor OS) is being used to run Hadoop. This assumption is buried pretty deeply. Various parts of Hadoop are executed using shell scripts that will only work on a Linux shell. It also uses passwordless secure shell (SSH) to communicate between computers in the Hadoop cluster. The best way to do these things on Windows is to make Windows act more like Linux. You can do this using Cygwin, which provides a Linux-like environment for Windows that allows you to use Linux-style command line utilities as well as run really useful Linux-centric software like OpenSSH. Go download the latest version of Cygwin. Dont install it yet. Ill describe how you need to install it below. Hadoop Go download Hadoop core. Im writing this guide for version 0.17 and I will assume thats what youre using. More than one Windows PC on a LAN
It should probably go without saying that to follow this guide, youll need to have more than one PC. Im going to assume you have two computers and that theyre both on your LAN. Go ahead and designate one to be the Master and one to be the Slave. These machines together will be your cluster. The Master will be responsible for ensuring the Slaves have work to do (such as storing data or running MapReduce jobs). The Master can also do its share of this work as well. If you have more than two PCs, you can always setup Slave2, Slave3 and so on. Some of the steps below will need to be performed on all your cluster machines, some on just Master or Slaves. Ill note which apply for each step.
that you setup SSH to do the latter. Im not going to go into great detail on how this all works, but suffice it to say that youre going to do the following: 1. Generate a public-private key pair for your user on each cluster machine. 2. Exchange each machine users public key with each other machine user in the cluster. Generate public/private key pairs To generate a key pair, open Cygwin and issue the following commands ($> is the command prompt):
$> ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa $> cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
Now, you should be able to SSH into your local machine using the following command:
$> ssh localhost
When prompted for your password, enter it. Youll see something like the following in your Cygwin terminal.
hayes@localhost's password: Last login: Sun Jun 8 19:47:14 2008 from localhost
hayes@calculon ~ $> To quit the SSH session and go back to your regular terminal, use:
$> exit
Make sure to do this on all computers in your cluster. Exchange public keys Now that you have public and private key pairs on each machine in your cluster, you need to share your public keys around to permit passwordless login from one machine to the other. Once a machine has a public key, it can safely authenticate a request from a remote machine that is encrypted using the private key that matches that public key. On the master issue the following command in cygwin (where <slaveusername> is the username you use to login to Windows on the slave computer):
$> scp ~/.ssh/id_dsa.pub <slaveusername>@slave:~/.ssh/master-key.pub
Enter your password when prompted. This will copy your public key file in use on the master to the slave. On the slave, issue the following command in cygwin:
$> cat ~/.ssh/master-key.pub >> ~/.ssh/authorized_keys
This will append your public key to the set of authorized keys the slave accepts for authentication purposes. Back on the master, test this out by issuing the following command in cygwin:
$> ssh <slaveusername>@slave
If all is well, you should be logged into the slave computer with no password required. Repeat this process in reverse, copying the slaves public key to the master. Also, make sure to exchange public keys between the master and any other slaves that may be in your cluster. Configure SSH to use default usernames (optional) If all of your cluster machines are using the same username, you can safely skip this step. If not, read on. Most Hadoop tutorials suggest that you setup a user specific to Hadoop. If you want to do that, you certainly can. Why setup a specific user for Hadoop? Well, in addition to being more secure from a file permissions and security perspective, when Hadoop uses SSH to issue commands from one machine to another it will automatically try to login to the remote machine using the same user as the current machine. If you have different users on different machines, the SSH login performed by Hadoop will fail. However, most of us on Windows typically use our machines with a single user and would probably prefer not to have to setup a new user on each machine just for Hadoop. The way to allow Hadoop to work with multiple users is by configuring SSH to automatically select the appropriate user when Hadoop issues its SSH command. (Youll also need to edit the hadoop-env.sh config file, but that comes later in this guide.) You can do this by editing the file named config (no extension) located in the same .ssh directory where you stored your public and private keys for authentication. Cygwin stores this directory under c:\cygwin\home\<windowsusername>\.ssh. On the master, create a file called config and add the following lines (replacing <slaveusername> with the username youre using on the Slave machine:
Host slave User <slaveusername>
If you have more slaves in your cluster, add Host and User lines for those as well. On each slave, create a file called config and add the following lines (replacing <masterusername> with the username youre using on the Master machine:
Host master User <masterusername>
Now test this out. On the master, go to cygwin and issue the following command:
$> ssh slave
You should be automatically logged into the slave machine with no username and no password required. Make sure to exit out of your ssh session. For more information on this configuration files format and what it does, go here or run man ssh_config in cygwin.
This means that your Java home directory is wrong. Go back and make sure you specified the correct directory and used the appropriate escaping.
master slave
You should see output somewhat like the following (note that I have 2 slaves in my cluster which has a cluster ID of Appozite, your mileage will vary somewhat):
starting namenode, logging to /usr/local/hadoop/bin/../logs/hadoopAppozite-namenode-calculon.out master: starting datanode, logging to /usr/local/hadoop/bin/../logs/hadoop-Appozite-datanode-calculon.out slave: starting datanode, logging to /usr/local/hadoop/bin/../logs/hadoop-Appozite-datanode-hayes-davissmacbo ok-pro.local.out slave2: starting datanode, logging to /usr/local/hadoop/bin/../logs/hadoop-Appozite-datanodeXTRAPUFFYJR.out master: starting secondarynamenode, logging to /usr/local/hadoop/bin/../logs/hadoop-Appozite-secondarynamenode -calculon.out
To see if your distributed file system is actually running across multiple machines, you can open the Hadoop DFS web interface which will be running on your master on port 50070. You can probably open it by clicking this link: http://localhost:50070. Below is a screenshot of my cluster. As you can see, there are 3 nodes with a total of 712.27 GB of space. (Click the image to see the larger version.)
Starting MapReduce To start the MapReduce part of Hadoop, issue the following command:
$> bin/start-mapred.sh
You should see output similar to the following (again noting that Ive got 3 nodes in my cluster):
starting jobtracker, logging to /usr/local/hadoop/bin/../logs/hadoopAppozite-jobtracker-calculon.out master: starting tasktracker, logging to /usr/local/hadoop/bin/../logs/hadoop-Appozite-tasktracker-calculon.ou t slave: starting tasktracker, logging to /usr/local/hadoop/bin/../logs/hadoop-Appozite-tasktracker-hayesdaviss -macbook-pro.local.out slave2: starting tasktracker, logging to /usr/local/hadoop/bin/../logs/hadoop-Appozite-tasktracker-XTRAPUFFYJR .out
You can view your MapReduce setup using the MapReduce monitoring web app that comes with Hadoop which runs on port 50030 of your master node. You can probably open it by clicking this like: http://localhost:50030. Below is a screenshot from my browser. Theres not much exciting to see here until you have an actual MapReduce job running.
Testing it out
Now that youve got your Hadoop cluster up and running, executing MapReduce jobs or writing to and reading from DFS are no different on Windows than any other platform so long as you use cygwin to execute commands. At this point, Ill refer you to Michael Nolls Hadoop on Ubuntu Linux tutorial for an explanation on how to run a large enough MapReduce job to take advantage of your cluster. (Note that hes using Hadoop 0.16.0 instead of 0.17.0, so youll replace 0.16.0 with 0.17.0 where applicable.) Follow his instructions and you should be good to go. The Hadoop site also offers a MapReduce tutorial to you can get started writing your own jobs in Java. If youre interested in writing MapReduce jobs in other languages that take advantage of Hadoop, check out the Hadoop Streaming documentation.
stopping namenode master: stopping datanode slave: stopping datanode slave2: stopping datanode master: stopping secondarynamenode
And thats it
I hope this helps anyone out there trying to run Hadoop on Windows. If any of you have corrections, questions or suggestions please comment and let me know. Happy Hadooping!
Above, we discussed the ability of MapReduce to distribute computation over multiple servers. For that computation to take place, each server must have access to the data. This is the role of HDFS, the Hadoop Distributed File System. HDFS and MapReduce are robust. Servers in a Hadoop cluster can fail and not abort the computation process. HDFS ensures data is replicated with redundancy across the cluster. On completion of a calculation, a node will write its results back into HDFS. There are no restrictions on the data that HDFS stores. Data may be unstructured and schemaless. By contrast, relational databases require that data be structured and schemas be defined before storing the data. With HDFS, making sense of the data is the responsibility of the developer's code. Programming Hadoop at the MapReduce level is a case of working with the Java APIs, and manually loading data files into HDFS.
Pig is a programming language that simplifies the common tasks of working with
Hadoop: loading data, expressing transformations on the data, and storing the final results. Pig's built-in operations can make sense of semi-structured data, such as log files, and the language is extensible using Java to add support for custom data types and transformations.
data in HDFS and then permits queries over the data using a familiar SQL-like syntax. As with Pig, Hive's core capabilities are extensible. Choosing between Hive and Pig can be confusing. Hive is more suitable for data warehousing tasks, with predominantly static structure and the need for frequent analysis. Hive's closeness to SQL makes it an ideal point of integration between Hadoop and other business intelligence tools. Pig gives the developer more agility for the exploration of large datasets, allowing the development of succinct scripts for transforming data flows for incorporation into larger applications. Pig is a thinner layer over Hadoop than Hive, and its main advantage is to drastically cut the amount of code needed compared to direct use of Hadoop's Java APIs. As such, Pig's intended audience remains primarily the software developer.
In order to grant random access to the data, HBase does impose a few restrictions: Hive performance with HBase is 4-5 times slower than with plain HDFS, and the maximum amount of data you can store in HBase is approximately a petabyte, versus HDFS' limit of over 30PB. HBase is ill-suited to ad-hoc analytics and more appropriate for integrating big data as part of a larger application. Use cases include logging, counting and storing time-series data.
Though not strictly part of Hadoop, Whirr is a highly complementary component. It offers a way of running services, including Hadoop, on cloud platforms. Whirr is cloud neutral and currently supports the Amazon EC2 and Rackspace services.
Using Hadoop
Normally, you will use Hadoop in the form of a distribution. Much as with Linux before it, vendors integrate and test the components of the Apache Hadoop ecosystem and add in tools and administrative features of their own. Though not per se a distribution, a managed cloud installation of Hadoop's MapReduce is also available through Amazon's Elastic MapReduce service.