Anda di halaman 1dari 63

Hadoop Shell Commands

Posted on November 26, 2013 by aravindu012


1 Comment Leave a comment

This post will help users to learn important and useful Hadoop shell commands which are very
close to Unix shell commands., using these commands user can perform different operations on
hdfs. People who are familiar with Unix shell commands can easily get hold of it, people who
are new to Unix or hadoop need not to worry, just follow this article to learn all the commands
used in day to day basis and practice the same.

FS Shell
The FileSystem (FS) shell is invoked by bin/hadoop fs <args>. All the FS shell commands take
path URIs as arguments. The URI format is scheme://autority/path. For HDFS the scheme
is hdfs, and for the local filesystem the scheme is file. The scheme and authority are optional. If
not specified, the default scheme specified in the configuration is used. An HDFS file or
directory such as /parent/child can be specified as hdfs://namenodehost/parent/child or simply
as /parent/child.

Administrator Commands:

fsck /
Run a HDFS filesystem checking Utility

Example: hadoop fsck - /


balancer
Runs a cluster balancing utility. An administrator can simply press Ctrl-C to stop the re-
balancing process.

Example: hadoop balancer

version
Prints the hadoop version configured on the machine

Example: hadoop version


Hadoop fs commands:

ls
For a file returns stat on the file with the following format:
filename <number of replicas> filesize modification_date modification_time permissions userid
groupid For a directory it returns list of its direct children as in Unix. A directory is listed
as: dirname <dir> modification_time modification_time permissions userid groupid

Usage: hadoop fs -ls <args>

Example: hadoop fs -ls

lsr
Recursive version of ls. Similar to Unix ls -R.

Usage: hadoop fs -lsr <args>

Example hadoop fs -lsr


mkdir
Takes path uris as an argument and creates directories. The behavior is much like unix mkdir -p
creating parent directories along the path.

Usage: hadoop fs -mkdir <paths>

Example: hadoop fs -mkdir /user/hadoop/Aravindu

mv
Moves files from source to destination. This command allows multiple sources as well in which
case the destination needs to be a directory. Moving files across file systems is not permitted.

Usage: hadoop fs -mv URI [URI ] <dest>

Example: hadoop fs -mv /user/hduser/Aravindu/Consolidated_Sheet.csv


/user/hduser/sandela/

put
Copy single src, or multiple srcs from local file system to the destination filesystem. Also reads
input from stdin and writes to destination filesystem.

Usage: hadoop fs -put <localsrc> <dst>


Example: hadoop fs -put localfile /user/Hadoop/hadoopfile
hadoop fs -put localfile1 localfile2 /user/hadoop/hadoopdir

rm
Delete files specified as args. Only deletes non empty directory and files. Refer to rmr for
recursive deletes.

Usage: hadoop fs -rm URI [URI ]

Example: hadoop fs -rm /user/hduser/sandela/Consolidated_Sheet.csv

rmr
Recursive version of deleting.

Usage: hadoop fs -rmr URI [URI ]

Example: hadoop fs -rmr /user/hduser/sandela

cat
The cat command concatenates and display files, it works similar to Unix cat command:

Usage: hadoop fs -cat URI [URI ]

Example: hadoop fs -cat /user/hduser/Aravindu/Consolidated_sheet.csv


chgrp
Usage: hadoop fs -chgrp [-R] GROUP URI [URI ]

chmod
Change the permissions of files. With -R, make the change recursively through the directory
structure. The user must be the owner of the file, or else a super-user.

Usage: hadoop fs -chmod [-R] <MODE[,MODE] | OCTALMODE> URI [URI ]

chown
Change the owner of files. With -R, make the change recursively through the directory structure.
The user must be a super-user.

Usage: hadoop fs -chown [-R] [OWNER][:[GROUP]] URI [URI ]

copyFromLocal
Copies file form local machine and paste in given hadoop directory

Usage: hadoop fs -copyFromLocal <localsrc> URI


Example: hadoop fs copyFromLocal
/home/hduser/contact_details/output/job_titles/11-26-2013part-r-00000
/user/hduser/finaloutput_112613/

Similar to put command.

copyToLocal
Coppies file from hadoop directory and paste the file in local direcotry

Usage: hadoop fs -copyToLocal [-ignorecrc] [-crc] URI <localdst>

Example: hadoop fs copyToLocal


/user/hduser/output/Expecting_result_set_112613/part-r-00000
/home/hduser/contact_wordcount/11-26-2013

Similar to get command.

cp
Copy files from source to destination. This command allows multiple sources as well in which
case the destination must be a directory.

Usage: hadoop fs -cp URI [URI ] <dest>

Example:hadoop fs -cp /user/hduser/input/Consolodated_Sheet.csv


/user/hduser/Aravindu
hadoop fs -cp /user/hadoop/file1 /user/hadoop/file2 /user/hadoop/dir
du
Displays aggregate length of files contained in the directory or the length of a file in case its just
a file.

Usage: hadoop fs -du URI [URI ]

Example: hadoop fs -du /user/hduser/Aravindu/Consolidated_Sheet.csv

dus
Displays a summary of file lengths.

Usage: hadoop fs -dus <args>

Example: hadoop fs dus /user/hduser/Aravindu/Consolidated_Sheet.csv

count:
Count the number of directories, files and bytes under the paths that match the specified file
pattern

Example: hadoop fs count hdfs:/


expunge
Empty the Trash.

Usage: hadoop fs -expunge

Example: hadoop fs expunge

setrep
Changes the replication factor of a file. -R option is for recursively increasing the replication
factor of files within a directory.

Usage: hadoop fs -setrep [-R] <path>

Example: hadoop fs -setrep -w 3 -R /user/hadoop/Aravindu

stat
Returns the stat information on the path.

Usage: hadoop fs -stat URI [URI ]

Example: hadoop fs -stat /user/hduser/Aravindu

tail
Displays last kilobyte of the file to stdout. -f option can be used as in Unix.

Usage: hadoop fs -tail [-f] URI


Example: hadoop fs -tail /user/hduser/Aravindu/Info

text
Takes a source file and outputs the file in text format. The allowed formats are zip and
TextRecordInputStream.

Usage: hadoop fs -text <src>

Example: hadoop fs text /user/hduser/Aravindu/Info

touchz
Create a file of zero length.

Usage: hadoop fs -touchz URI [URI ]

Example: hadoop -touchz /user/hduser/Aravind/emptyfile


Installing single node Hadoop 2.2.0 on Ubuntu
Posted on November 2, 2013 by aravindu012
10 Comments Leave a comment

Please find the complete step by step process for installing Hadoop 2.2.0 stable version on
Ubuntu as requested by many of this blog visitors, friends and subscribers.

Apache Hadoop 2.2.0 release has significant changes compared to its previous stable release,
which is Apache Hadoop 1.2.1(Setting up Hadoop 1.2.1 can be found here).

In short , this release has a number of changes compared to its earlier version 1.2.1:

YARN A general purpose resource management system for Hadoop to allow


MapReduce and other data processing frameworks like Hive, Pig and Services

High Availability for HDFS

HDFS Federation, Snapshots

NFSv3 access to data in HDFS

Introduced Application Manager to manage the application life cycle

Support for running Hadoop on Microsoft Windows


HDFS Symlinks feature is disabled & will be taken out in future versions

Jobtracker has been replaced with Resource Manager and Node Manager

Before starting into setting up Apache Hadoop 2.2.0, please understand the concepts of Big Data
and Hadoop from my previous blog posts:

Big Data Characteristics, Problems and Solution.

What is Apache Hadoop?.

Setting up Single node Hadoop Cluster.

Setting up Multi node Hadoop Cluster.

Understanding HDFS architecture (in comic format).

Setting up the environment:

In this tutorial you will know step by step process for setting up a Hadoop Single Node cluster,
so that you can play around with the framework and learn more about it.

In This tutorial we are using following Software versions, you can download same by clicking
the hyperlinks:

Ubuntu Linux 12.04.3 LTS

Hadoop 2.2.0, released in October, 2013

If you are using putty to access your Linux box remotely, please install openssh by running this
command, this also helps in configuring SSH access easily in the later part of the installation:

sudo apt-get install openssh-server


Prerequisites:

1. Installing Java v1.7

2. Adding dedicated Hadoop system user.

3. Configuring SSH access.

4. Disabling IPv6.

Before starting of installing any applications or softwares, please makes sure your list of
packages from all repositories and PPAs is up to date or if not update them by using this
command:
sudo apt-get update

1. Installing Java v1.7:

For running Hadoop it requires Java v1. 7+

a. Download Latest oracle Java Linux version of the oracle website by using this command

wget https://edelivery.oracle.com/otn-pub/java/jdk/7u45-b18/jdk-7u45-linux-
x64.tar.gz
If it fails to download, please check with this given command which helps to avoid passing
username and password.

wget --no-cookies --no-check-certificate --header "Cookie: gpw_e24=http%3A%2F


%2Fwww.oracle.com" "https://edelivery.oracle.com/otn-pub/java/jdk/7u45-
b18/jdk-7u45-linux-x64.tar.gz"

b. Unpack the compressed Java binaries, in the directory:

sudo tar xvzf jdk-7u25-linux-x64.tar.gz

c. Create a Java directory using mkdir under /user/local/ and change the directory to
/usr/local/Java by using this command

mkdir -R /usr/local/Java
cd /usr/local/Java
d. Copy the Oracle Java binaries into the /usr/local/Java directory.

sudo cp -r jdk-1.7.0_45 /usr/local/java

e. Edit the system PATH file /etc/profile and add the following system variables to your system
path

sudo nano /etc/profile or sudo gedit /etc/profile

f. Scroll down to the end of the file using your arrow keys and add the following lines below to
the end of your /etc/profile file:

JAVA_HOME=/usr/local/Java/jdk1.7.0_45
PATH=$PATH:$HOME/bin:$JAVA_HOME/bin
export JAVA_HOME
export PATH

g. Inform your Ubuntu Linux system where your Oracle Java JDK/JRE is located. This will tell
the system that the new Oracle Java version is available for use.

sudo update-alternatives --install "/usr/bin/javac" "javac"


"/usr/local/java/jdk1.7.0_45/bin/javac" 1
sudo update-alternatvie --set javac /usr/local/Java/jdk1.7.0_45/bin/javac

This command notifies the system that Oracle Java JDK is available for use

h. Reload your system wide PATH /etc/profile by typing the following command:
. /etc/profile
Test to see if Oracle Java was installed correctly on your system.

Java -version

2. Adding dedicated Hadoop system user.


We will use a dedicated Hadoop user account for running Hadoop. While thats not required but
it is recommended, because it helps to separate the Hadoop installation from other software
applications and user accounts running on the same machine.

a. Adding group:

sudo addgroup Hadoop


b. Creating a user and adding the user to a group:

sudo adduser ingroup Hadoop hduser


It will ask to provide the new UNIX password and Information as shown in below image.

3. Configuring SSH access:

The need for SSH Key based authentication is required so that the master node can then login to
slave nodes (and the secondary node) to start/stop them and also local machine if you want to use
Hadoop with it. For our single-node setup of Hadoop, we therefore need to configure SSH access
to localhost for the hduser user we created in the previous section.
Before this step you have to make sure that SSH is up and running on your machine and
configured it to allow SSH public key authentication.

Generating an SSH key for the hduser user.


a. Login as hduser with sudo
b. Run this Key generation command:

ssh-keyegen -t rsa -P ""

c. It will ask to provide the file name in which to save the key, just press has entered so that it
will generate the key at /home/hduser/ .ssh

d. Enable SSH access to your local machine with this newly created key.

cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys


e. The final step is to test the SSH setup by connecting to your local machine with the hduser
user.

ssh hduser@localhost
This will add localhost permanently to the list of known hosts
4. Disabling IPv6.

We need to disable IPv6 because Ubuntu is using 0.0.0.0 IP for different Hadoop configurations.
You will need to run the following commands using a root account:

sudo gedit /etc/sysctl.conf


Add the following lines to the end of the file and reboot the machine, to update the
configurations correctly.

#disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
Hadoop Installation:

Go to Apache Downloadsand download Hadoop version 2.2.0 (prefer to download any stable
versions)

i. Run this following command to download Hadoop version 2.2.0

wget http://apache.mirrors.pair.com/hadoop/common/stable2/hadoop-2.2..tar.gz

ii. Unpack the compressed hadoop file by using this command:

tar xvzf hadoop-2.2.0.tar.gz

iii. move hadoop-2.2.0 to hadoop directory by using give command

mv hadoop-2.2.0 hadoop

iv. Move hadoop package of your choice, I picked /usr/local for my convenience

sudo mv hadoop /usr/local/

v. Make sure to change the owner of all the files to the hduser user and hadoop group by using
this command:

sudo chown -R hduser:hadoop Hadoop


Configuring Hadoop:

The following are the required files we will use for the perfect configuration of the single node
Hadoop cluster.

a. yarn-site.xml:
b. core-site.xml
c. mapred-site.xml
d. hdfs-site.xml
e. Update $HOME/.bashrc

We can find the list of files in Hadoop directory which is located in

cd /usr/local/hadoop/etc/hadoop

a.yarn-site.xml:

<configuration>
<!-- Site specific YARN configuration properties -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>

b. core-site.xml:

i. Change the user to hduser. Change the directory to /usr/local/hadoop/conf and edit the core-
site.xml file.

vi core-site.xml
ii. Add the following entry to the file and save and quit the file:

<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>

c. mapred-site.xml:

If this file does not exist, copy mapred-site.xml.template as mapred-site.xml

i. Edit the mapred-site.xml file

vi mapred-site.xml
ii. Add the following entry to the file and save and quit the file.

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

d. hdfs-site.xml:

i. Edit the hdfs-site.xml file

vi hdfs-site.xml
ii. Create two directories to be used by namenode and datanode.

mkdir -p $HADOOP_HOME/yarn_data/hdfs/namenode
sudo mkdir -p $HADOOP_HOME/yarn_data/hdfs/namenode
mkdir -p $HADOOP_HOME/yarn_data/hdfs/datanode
iii. Add the following entry to the file and save and quit the file:

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop/yarn_data/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/hadoop/yarn_data/hdfs/datanode</value>
</property>
</configuration>

e. Update $HOME/.bashrc
i. Go back to the root and edit the .bashrc file.

vi .bashrc

ii. Add the following lines to the end of the file.

Add below configurations:

# Set Hadoop-related environment variables


export HADOOP_PREFIX=/usr/local/hadoop
export HADOOP_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=${HADOOP_HOME}
export HADOOP_COMMON_HOME=${HADOOP_HOME}
export HADOOP_HDFS_HOME=${HADOOP_HOME}
export YARN_HOME=${HADOOP_HOME}
export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop
# Native Path
export HADOOP_COMMON_LIB_NATIVE_DIR=${HADOOP_PREFIX}/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_PREFIX/lib"
#Java path
export JAVA_HOME='/usr/locla/Java/jdk1.7.0_45'
# Add Hadoop bin/ directory to PATH
export PATH=$PATH:$HADOOP_HOME/bin:$JAVA_PATH/bin:$HADOOP_HOME/sbin

Formatting and Starting/Stopping the HDFS filesystem via the NameNode:

i. The first step to starting up your Hadoop installation is formatting the Hadoop filesystem
which is implemented on top of the local filesystem of your cluster. You need to do this the first
time you set up a Hadoop cluster. Do not format a running Hadoop filesystem as you will lose
all the data currently in the cluster (in HDFS). To format the filesystem (which simply
initializes the directory specified by the dfs.name.dir variable), run the

hadoop namenode -format


ii. Start Hadoop Daemons by running the following commands:

Name node:

$ hadoop-daemon.sh start namenode


Data node:

$ hadoop-daemon.sh start datanode


Resource Manager:

$ yarn-daemon.sh start resourcemanager


Node Manager:

$ yarn-daemon.sh start nodemanager


Job History Server:

$ mr-jobhistory-daemon.sh start historyserver

v. Stop Hadoop by running the following command

stop-dfs.sh
stop-yarn.sh
Hadoop Web Interfaces:
Hadoop comes with several web interfaces which are by default available at these locations:

HDFS Namenode and check health using http://localhost:50070

HDFS Secondary Namenode status using http://localhost:50090

By this we are done in setting up a single node hadoop cluster v2.2.0, hope this step by step
guide helps you to setup same environment at your end.

Please leave a comment/suggestion in the comment section,will try to answer asap and dont
forget to subscribe for the newsletter and a facebook like

Setting up Hive
Posted on October 28, 2013 by aravindu012
No Comments Leave a comment

As I said earlier, Apache Hive is an open-source data warehouse infrastructure built on top of
Hadoop for providing data summary, query, and analyzing large datasets stored in Hadoop files,
it is developed by Facebook and it provides

Tools to enable easy data extract/transform/load (ETL)

A mechanism to impose structure on a variety of data formats

Access to files stored either directly in Apache HDFSTM or in other data storage systems
such as Apache HBase
Query execution via MapReduce

In this post we will get to know about, how to setup Hive on top of Hadoop cluster

Objective

The objective of this tutorial is for setting up Hive and running HiveQL scripts.

Prerequisites

The following are the prerequisites for setting up Hive.

You should have the latest stable build of Hadoop up and running, to install hadoop, please check
my previous blog article on Hadoop Setup.

Setting up Hive:

Procedure

1. Download a stable version of the hive file from apache download mirrors, For this tutorial we
are using Hive-0.12.0,this release works with Hadoop 0.20.X, 1.X, 0.23.X and 2.X

wget http://apache.osuosl.org/hive/hive-0.12.0/hive-0.12.0.tar.gz
2. Unpack the compressed hive in home directory:

tar xvzf hive-0.12.0.tar.gz

3. Create a hive directory under usr/local directory as root user and change the ownership to
hduser as shown, this is for our convenience to differentiate each framework,software and
application with different users.

cd /usr/local
mkdir hive
sudo chown -R hduser:hadoop /usr/local/hive
4. Login as hduser and move the uncompressed hive-0.12.0 to /usr/local/hive folder

mv hive-0.12.0/ /usr/local/hive

5. set HIVE_HOME in $HOME/.bashrc so it will be set every time you login.

$ .bashrc

Add the following entries to the .bashrc file.

export HIVE_HOME='/usr/local/hive/hive-0.12.0'
export PATH=$HADOOP_HOME/bin:$HIVE_HOME/bin:PATH

7. compile .bashrc file using this command:

. .bashrc

Setting up hive on top of hadoop has takencare, lets test it:

8. Start hive by executing the following command.

hive

9. table in hive by the following command. Also after creating check if the table exists.

create table test (field1 string, field2 string);


show tables;

10. Show extended details on the table

Describe extended test;

By this output we know that hive was setup correctly on top of Hadoop cluster, its time to learn
the HiveQL.

What is Apache Hive?


Posted on October 28, 2013 by aravindu012
No Comments Leave a comment
Apache Hive is an open-source data warehouse infrastructure built on top of Hadoop for
providing data summary, query, and analyzing large datasets stored in Hadoop files, it is
developed by Facebook and it provides

Tools to enable easy data extract/transform/load (ETL)

A mechanism to impose structure on a variety of data formats

Access to files stored either directly in Apache HDFSTM or in other data storage systems
such as Apache HBase

Query execution via MapReduce

It supports queries expressed in a language called HiveQL, which automatically translates SQL-
like queries into MapReduce jobs executed on Hadoop. In addition, HiveQL supports custom
MapReduce scripts to be plugged into queries. Hive also enables data
serialization/deserialization and increases flexibility in schema design by including a system
catalog called Hive-Metastore.

According to the Apache Hive wiki, Hive is not designed for OLTP workloads and does not
offer real-time queries or row-level updates. It is best used for batch jobs over large sets of
append-only data (like web logs).

Hive supports text files (also called flat files), SequenceFiles (flat files consisting of binary
key/value pairs) and RCFiles (Record Columnar Files which store columns of a table in a
columnar database way.)
There is a SqlDevloper/Toad kind of tool named HiveDeveloper, developed by Stratapps Inc
which gives users the power to visualize their data stored in Hadoop as table views and do many
more operations.

In my next blog I will be explaining about how to setup Hive on top of Hadoop cluster, before
that please check how to setup Hadoop in my previous blog post so that you will be ready to
configure Hive on top of it.

Setting up Pig
Posted on October 28, 2013 by aravindu012
No Comments Leave a comment

Apache Pig is a high-level procedural language platform developed to simplify querying large
data sets in Apache Hadoop and MapReduce., Pig is popular for performing query operations in
hadoop using Pig Latin language, this layer that enables SQL-like queries to be performed on
distributed datasets within Hadoop applications, due to its simple interface, support for doing
complex operations such as joins and filters, which has the following key properties:

Ease of programming. Pig programs are easy to write and which accomplish huge tasks
as its done with other Map-Reducing programs.

Optimization: System optimize pig jobs execution automatically, allowing the user to
focus on semantics rather than efficiency.

Extensibility: Pig Users can write their own user defined functions (UDF) to do special-
purpose processing as per the requirement using Java/Phyton and JavaScript.

Objective

The objective of this tutorial is for setting up Pig and running Pig scripts.
Prerequisites

The following are the prerequisites for setting up Pig and running Pig scripts.

You should have the latest stable build of Hadoop up and running, to install hadoop,
please check my previous blog article on Hadoop Setup.

Setting up Pig

Procedure

1. Download a stable version of Pig file from apache download mirrors, For this tutorial we
are using pig-0.11.1,this release works with Hadoop 0.20.X, 1.X, 0.23.X and 2.X

wget http://apache.mirrors.hoobly.com/pig/pig-0.11.1/pig-0.11.1.tar.gz

2. Copy the pig binaries into the /usr/local/pig directory.

cp -r pig-0.11.1.tar.gz /usr/local/pig

3. Change the directory to /usr/local/pig by using this command

cd /usr/local/pig

4. Unpack the compressed pig, in the directory /usr/local/pig

sudo tar xvzf pig-0.11.1.tar.gz


5. set PIG_HOME in $HOME/.bashrc so it will be set every time you login. Add the following
line to it.

export PIG_HOME=<path_to_pig_home_directory>

e.g.
export PIG_HOME='/usr/local/pig/pig-0.11.1'
export PATH=$HADOOP_HOME/bin:$PIG_HOME/bin:$JAVA_HOME/bin:$PATH

6. Set the environment variable JAVA_HOME to point to the Java installation directory, which
Pig uses internally.

export JAVA_HOME=<<Java_installation_directory>>

Execution Modes

Pig has two modes of execution local mode and MapReduce mode.

Local Mode

Local mode is usually used to verify and debug Pig queries and/or scripts on smaller datasets
which a single machine could handle. It runs on a single JVM and access the local filesystem.

To run in local mode, please pass the following command:

$ pig -x local
grunt>
MapReduce Mode

This is the default mode Pig translates the queries into MapReduce jobs, which requires access to
a Hadoop cluster.

$ pig

2013-10-28 11:39:44,767 [main] INFO org.apache.pig.Main Apache Pig version 0.11.1


(r1459641) compiled Mar 22 2013, 02:13:53

2013-10-28 11:39:44,767 [main] INFO org.apache.pig.Main Logging error messages to:


/home/hduser/pig_1382985584762.log

2013-10-28 11:39:44,797 [main] INFO org.apache.pig.impl.util.Utils Default bootup file


/home/hduser/.pigbootup not found

2013-10-28 11:39:45,094 [main] INFO


org.apache.pig.backend.hadoop.executionengine.HExecutionEngine Connecting to hadoop file
system at: hdfs://Hadoopmaster:54310

2013-10-28 11:39:45,592 [main] INFO


org.apache.pig.backend.hadoop.executionengine.HExecutionEngine Connecting to map-reduce
job tracker at: Hadoopmaster:54311

grunt>
You can see the log reports from Pig stating the filesystem and jobtracker it connected to. Grunt
is an interactive shell for your Pig queries. You can run Pig programs in three ways via Script,
Grunt, or embedding the script into Java code. Running in Interactive shell is shown in the
Problem section. To run a batch of pig scripts, it is recommended to place them in a single file
with .pig extension and execute them in batch mode, will explain them in depth in coming posts.

What is Apache Pig?


Posted on October 28, 2013 by aravindu012
No Comments Leave a comment

Apache Pig is a high-level procedural language platform developed to simplify querying large
data sets in Apache Hadoop and MapReduce., Pig is popular for performing query operations in
hadoop using Pig Latin language, this layer that enables SQL-like queries to be performed on
distributed datasets within Hadoop applications, due to its simple interface, support for doing
complex operations such as joins and filters, which has the following key properties:

Ease of programming. Pig programs are easy to write and which accomplish huge tasks
as its done with other Map-Reducing programs.

Optimization: System optimize pig jobs execution automatically, allowing the user to
focus on semantics rather than efficiency.

Extensibility: Pig Users can write their own user defined functions (UDF) to do special-
purpose processing as per the requirement using Java/Phyton and JavaScript.

How it works:

Pig runs on Hadoop and makes use of MapReduce and the Hadoop Distributed File System
(HDFS). The language for the platform is called Pig Latin, which abstracts from the Java
MapReduce idiom into a form similar to SQL. Pig Latin is a flow language which allows you to
write a data flow that describes how your data will be transformed. Since Pig Latin scripts can
be graphs it is possible to build complex data flows involving multiple inputs, transforms, and
outputs. Users can extend Pig Latin by writing their own User Defined functions, using Java,
Python, Ruby, or other scripting languages.

We can run Pig in two modes:

Local Mode

Local mode is usually used to verify and debug Pig queries and/or scripts on smaller datasets
which a single machine could handle. It runs on a single JVM and access the local filesystem.

MapReduce Mode

This is the default mode Pig translates the queries into MapReduce jobs, which requires access to
a Hadoop cluster.

we will discuss more about pig, setting up pig with hadoop, running PigLatin scripts in Local and
MapReduce Mode in my next posts.

Hadoop Installation using Cloudera-CDH4 virtual machine


Posted on October 27, 2013 by aravindu012
No Comments Leave a comment

This document helps to configure Hadoop cluster with help of Cloudera vm in pseudo mode,
using Vmware player on a user machine for there practice.
Step 1: Download the VMware player from the link shown and install it as shown in the
images.

Url to download VMware Player(Non-commercial use):

https://my.vmware.com/web/vmware/free#desktop_end_user_computing/vmware_player/6_0|
PLAYER-600-A|product_downloads
Step 2: Download the Cloudera Setup File from the given url and extract that zipped file onto
your hard drive.
Url to download Cloudera VM :

http://www.cloudera.com/content/support/en/downloads/download-components/download-
products.html?productID=F6mO278Rvo&version=1

Step 3: Start VMPlayer and click Open a Virtual Machine.


Browse the extracted folder and select .vmx file as shown.
By clicking on Play virtual machine button, It will start the VM in a couple of minutes, but
sometimes you may get this issue This host does not support Intet VT-x

There are two reasons why you are getting this error:
1. Your CPU doesnt support it.
2. Your CPU does support it, but you have it disabled in the BIOS

Enabling Virtualization extensions in the BIOS

1. Reboot the computer and open the systems BIOS menu. This can usually be done by pressing
the delete key, the F1 key or Alt and F4 keys depending on the system.
2. Enabling the Virtualization extensions in the BIOS
Many of the steps below may vary depending on your motherboard, processor type, chipset and
OEM. Refer to your systems accompanying documentation for the correct information on
configuring your system.

a. Open the Processor submenu The processor settings menu may be hidden in the Chipset,
Advanced CPU Configuration or Northbridge.

b. Enable Intel Virtualization Technology (also known as Intel VT-x). AMD-V extensions cannot
be disabled in the BIOS and should already be enabled. The Virtualization extensions may be
labeled Virtualization Extensions, Vanderpool or various other names depending on the OEM
and system BIOS.

c. Enable Intel VT-d or AMD IOMMU, if the options are available. Intel VT-d and AMD
IOMMU are used for PCI device assignment.

d. Select Save & Exit.

3. Reboot the machine.

By following this steps this problem should be solved.

Login credentials:

Machine Login credentials are:


a. Username cloudera
b. Password cloudera

Cloudera Manager Credentials are:

a. Username admin
b. Password admin

Click on the black box shown below in the image to start a terminal.
Step 4: Checking your Hadoop Cluster
Type: sudo jps to see if all nodes are running (if you see an error, wait for some time and then
try again, your threads are not started yet)
Type: sudo su hdfs
Execute your command ie hadoop fs ls /

Step 5: Download the list of Hadoop commands for reference from the given URL.

Url do download list of commands:


http://hadoop.apache.org/docs/r1.0.4/commands_manual.pdf

Hadoop multi-node cluster setup


Posted on October 27, 2013 by aravindu012
No Comments Leave a comment

The following document describes the required steps for setting up a distributed multi-node
Apache Hadoop cluster on two Ubuntu machines, the best way to install and setup a multi node
cluster is to start installing two individual single node Hadoop clusters by following my previous
tutorial setting up hadoop single node cluster on Ubuntu and merge them together with
minimal configuration changes in which one Ubuntu box will become the designated master and
the other boxs will become a slave, we can add n number of slaves as per our future request.

Please follow my previous blog post for setting up hadoop single node cluster on Ubuntu

1. Prerequisites
i.Networking

Networking plays an important role here, before merging both single node servers into a multi
node cluster we need to make sure that both the node pings each other( they need to be connected
on the same network / hub or both the machines can speak to each other). Once we are done with
this process, we will be moving to the next step in selecting the master node and slave node, here
we are selecting 172.16.17.68 as the master machine(Hadoopmaster) and 172.16.17.61 as a slave
(hadoopnode) . Then we need to add them in /etc/hosts file on each machine as follows.

sudo vi /etc/hosts
172.16.17.68 Haadoopmaster
172.16.17.61 hadoopnode

Note: The addition of more slaves should be updated here in each machine using unique names
for slaves (e.g.: 172.16.17.xx hadoonode01, 172.16.17.xy slave02 so on..).

ii. Enabling SSH:

hduser on master(Hadoopmaster) machine need to able to connect to its own master


(Hadoopmaster) account user and also need to connect hduser to the slave (hadoopnode)
machine via password-less SSH login.

hduser@Hadoopmaster:~$ ssh-copy-id -i ~/.ssh/id_rsa.pub hduser@hadoonode

If you can see the below output when you run the given command on both master and slave, then
we configured it correctly.

ssh Hadoopmaster
ssh hadoopnode
2. Configurations:
The following are the required files we will use for the perfect configuration of the multi node
Hadoop cluster.

a. masters

b. slaves

c. core-site.xml
d. mapred-site.xml

e. hdfs-site.xml

Lets configure each and every config file accordingly:

a. masters:

In master (Hadoopmaster) machine we need to configure masters file accordingly as shown in


the image and add the master (Hadoopmaster) node name.

vi masters
Hadoopmaster

b. slaves:

Lists the hosts, one per line, where the Hadoop slave daemons (DataNodes and TaskTrackers)
will be running as shown:

Hadoomaster
hadoopnode
If you have additional slave nodes, just add them to the conf/slaves file, one hostname per line.

Configuring all *-site.xml files:

We need to use the same configurations on all the nodes of hadoop cluster, i.e. we need to edit all
*-site.xml files on each and every server accordingly.

c. core-site.xml:

We are changing the host name from localhost to Hadoopmaster, which specifies the
NameNode (the HDFS master) host and port.

vi core-site.xml

d. hdfs-site.xml:

We are changing the replication factor to 2, The default value of dfs.replication is 3. However,
we have only two nodes available, so we set dfs.replication to 2.

vi hdfs-site.xml
e. mapred-site.xml:

We are changing the host name from localhost to Hadoopmaster, which specifies the
JobTracker (MapReduce master) host and port

vi mapred-site.xml

3. Formatting and Starting/Stopping the HDFS filesystem via the NameNode:

The first step to starting up your multinode Hadoop cluster is formatting the Hadoop filesystem
which is implemented on top of the local filesystem of your cluster. To format the filesystem
(which simply initializes the directory specified by the dfs.name.dir variable), run the given
command.

hadoop namenode format


4. Starting the multi-node cluster:

Starting the cluster is performed in two steps.

We begin by starting the HDFS daemons first, the NameNode daemon is started on
Hadoopmaster and DataNode daemons are started on all nodes(slaves).

Then we will start the MapReduce daemons, the JobTracker is started on Hadoomaster and
TaskTracker daemons are started on all nodes (slaves).

a. To start HDFS daemons:

start-dfs.sh

This will get NameNode up and DataNodes up listed in conf/slaves.


By running jps command, we will see list of java processes running on master and slaves:

b. To start Map Red daemons:

start-mapred.sh

This will bring up the MapReduce cluster with the JobTracker running on the machine you ran
the previous command on, and TaskTrackers on the machines listed in the conf/slaves file.

By running jps command, we will see list of java processes including JobTracker and
TaskTracker running on master and slaves:.

c. To stop Map Red daemons:

stop-mapred.sh
d. To stop HDFS daemons:

stop-dfs.sh

5. Running a Map-reduce Job:

Use a much larger volume of data as inputs as we are running in a cluster.

hadoop jar hadoop *examples*. jar wordcount /user/hduser/demo


/user/hduser/demo-output

we can observe namenode,mapreduce,tasktracker process on the webinterface by following


given urls

http://Hadoopmaster:50070/ web UI of the NameNode daemon

http://Hadoopmaster:50030/ web UI of the JobTracker daemon

http://Hadoopmaster:50060/ web UI of the TaskTracker daemon

*Hadoopmaster can be replaced with the machine ip

By this we are done in setting up a multi-node hadoop cluster, hope this step by step guide helps
you to setup same environment at your place.

Please leave a comment in the comment section with your doubts, questions and suggestions,
will try to answer asap

What Is Apache Hadoop?


Posted on October 24, 2013 by aravindu012
No Comments Leave a comment

What Is Apache Hadoop?


The Apache Hadoop project develops open-source software for reliable, scalable, distributed
computing.
The Apache Hadoop is a framework that allows for the storing large data sets which are
distributed across clusters of computers using simple programming models and written in Java to
run on a single computer to large clusters of commodity hardware computers and it derived from
papers published by Google and incorporated the features of the Google File System (GFS) and
MapReduce paradigm and named it as Hadoop Distributed File System and Hadoop
MapReduce.

How does it fit in?


Hadoop Key Characteristics:
Scalable New nodes can be added as needed, and added without needing to change data
formats, how data is loaded, how jobs are written, or the applications on top.

Economical Hadoop brings massively parallel computing to commodity servers. The


result is a sizeable decrease in the cost per terabyte of storage, which in turn makes it
affordable to model all your data.

Flexible Hadoop is schema-less, and can absorb any type of data, structured or not,
from any number of sources. Data from multiple sources can be joined and aggregated in
arbitrary ways enabling deeper analyses than any one system can provide.

Reliable When you lose a node, the system redirects work to another location of the
data and continues processing without missing a beat.(Credit:Cloudera Blog)

Hadoop Eco System:


Hadoop Distributed File System (HDFS):
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on
commodity hardware, it is highly fault-tolerant and is designed to be deployed on low-cost
hardware, it is designed to store very large amounts of data (terabytes or even petabytes), and
provide high-throughput access to this information. Files are stored in blocks across multiple
machines to ensure their availability on the failure of the machine and high availability to
parallel processing, It has many similarities with existing distributed file systems. However, the
differences from other distributed file systems are significant.
HDFS is designed to store a very large amount of information (terabytes or petabytes). This
requires spreading the data across a large number of machines.
HDFS stores data reliably. If individual machines in the cluster fail, data is still being available
with data redundancy.
HDFS provides fast, scalable access to the information loaded on the clusters. It is possible to
serve a larger number of clients by simply adding more machines to the cluster.
HDFS integrate well with Hadoop MapReduce, allowing data to be read and computed upon
locally whenever needed.
HDFS was originally built as infrastructure for the Apache Nutch web search engine project
Assumptions and Goals:
HDFS is a filesystem designed for storing very large files with streaming data access patterns,
running on clusters on commodity hardware.

Commodity Hardware Failure:


Hadoop doesnt require expensive, highly reliable hardware. Its designed to run on clusters of
commodity hardware, an HDFS instance may consist of hundreds or thousands of server
machines, each storing part of the file systems data. The fact that there are a huge number of
components and that each component has a non-trivial probability of failure means that some
component of HDFS is always non-functional. Therefore, detection of faults and quick,
automatic recovery from them is a core architectural goal of HDFS.

Continuous Data Access:


Applications that run on HDFS need continuous access to their data sets. HDFS is designed more
for batch processing rather than interactive use by users. The emphasis is on high throughput of
data access rather than low latency of data access.

Very Large Data Files:


Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to
terabytes in size. So, HDFS is tuned to support large files.

It is also worth examining the applications for which using HDFS does not work so well. While
this may change in the future, these are areas where HDFS is not a good fit today:

Low-latency data access:


Applications that require low-latency access to data, in the tens of milliseconds range, will not
work well with HDFS. Remember HDFS is optimized for delivering a high throughput of data,
and this may be at the expense of latency.

Lots of small files:


Since the namenode holds filesystem metadata in memory, the limit to the number of files in a
filesystem is governed by the amount of memory on the namenode. As a rule of thumb, each file,
directory, and block take about 150 bytes. So, for example, if you had one million files, each
taking one block, you would need at least 300 MB of memory. While storing millions of files are
feasible, billions are beyond the capability of current hardware.

Multiple writers, arbitrary file modifications:


Files in HDFS may be written to by a single writer. Writes are always made at the end of the file.
There is no support for multiple writers, or for modifications at arbitrary offsets in the file.

References:

http://blog.cloudera.com/wp-content/uploads/2010/03/HDFS_Reliability.pdf

http://en.wikipedia.org/wiki/Apache_Hadoop

http://hadoop.apache.org/

Hadoop The Definitive guide

Big Data Characteristics, Problems and Solution


Posted on October 23, 2013 by aravindu012
No Comments Leave a comment
What is Big Data?
Big Data has become a Buzzword these days, data in every possible form, whether
through social media, structured, unstructured, text, images, audio, video, log files,
emails, simulations, 3D models, military surveillance, e-commerce and so on, it amounts
to around some zettabytes of data! This huge data is what we call as BIG DATA!
Big data is nothing but a synonym of a huge and complex data that it becomes very
tiresome or slow to capture, store, process, retrieve and analyze it with the help of any
relational database management tools or traditional data processing techniques.

Lets take a moment to look into the above picture this should explain us what happens
in every 60 seconds on the internet. By this we can understand how much data being
generated in a second, a minute, a day or a year and how exponentially its generating, as
per the analysis by TechNewsDaily we might generate more than 8 Zettabytes of data by
2015.
Over the next 10 years: The number of servers worldwide will grow by 10x, Amount of
information managed by enterprise data centers will grow by 50x, Number of files
enterprise data center handles will grow by 75x (Systems / Enterprises generate huge
amount of data from Terabytes to and even Petabytes of information.

Big Data characteristics:


Volume: BIG DATA depends upon how large it is. It could amount to hundreds of
terabytes or even petabytes of information.
Velocity: The increasing rate at which data flows into an organization
Variety: A common theme in big data systems is that the source data is diverse and
doesnt fall into neat relational structures

As we are speaking about data size, the above image will help us to understand or
correlate about what we are speaking.
Big Data problems?
As per our earlier discussions we might now understand what is Big Data, now lets
discuss what are all the problems we might face with Big data.
Traditional systems build within the company for handling the relational databases may
not be able to support/scale as data generating with high volume, velocity and variety
data.
1. Volume: For instance, Terabytes of Facebook posts or 400 billion annual twitter
tweets could mean Big Data! This data need to be stored to analyze and come up with
data science reports for different solutions and problem solving approaches.
2. Velocity: Big data requires fast processing. Time factor plays a very crucial role in
several organizations. For instance, processing million records on the share market need
to able to write this data with the same speed its coming back into the system.
3. Variety: Big Data may not belong to a specific format. It could be in any form such as
structured, unstructured, text, images, audio, video, log files, emails, simulations, 3D
models, etc. Till date we have been working with structured data, it might be difficult to
handle unstructured or semi structured data with quality and quantity we are generating
on a daily basis.
How Big Data can handle this situation:
Distributed File System (DFS):
With the help of distributed file system(DFS) we can divide a large set of data files into
smaller blocks and load them into a multiple number of machines which will be ready for
parallel processing.


Parallel Processing:
Data is residing on N number of servers and holds the power of N servers and can be
processed parallel for analysis, which helps user to reduce the wait time to generate the
final report or analyzed data.

Fault Tolerance:
One of the primary reasons to use some of the BigData frameworks(ex: Hadoop) to run
your jobs is due to its high degree of fault tolerance. Even when running jobs on a large
cluster where individual nodes or network components may experience high rates of
failure,BigData frameworks can guide jobs toward a successful completion as the data is
replicated into multiple nodes/slaves.

Use of Commodity hardware:


Most of the BigData tools and frameworks need commodity hardware for its working
which reduces the cost of the total infrastructure and very easy to add more clusters as
data size increase.