Anda di halaman 1dari 4

Abstract of Hadoop

Hadoop is the popular open source

implementation of Map Rcduce, a powerful
tool designed for deep analysis and
transformation of very large data sets.
Hadoop enables you to explore complex
data, using custom analyses tailored to your
information and questions. Hadoop is the
system that allows unstructured data to be
distributed across hundreds or thousands of
machines forming shared nothing clusters,
and the execution of Map/Reduce routines to
run on the data in that cluster. Hadoop has
its own file system which replicates data to
multiple nodes to ensure if one node holding
data goes down, there are at least 2 other
nodes from which to retrieve that piece of
information. This protects the data
availability from node failure, something
which is critical when there are many iftdes
in a cluster (aka RAID at a server level).

data being unavailable. An active

monitoring system then re-replicates the
data in response to system failures which
ctm result in partial storage. Hvcn though
the file chunks are replicated and distributed
across several machines, they form a single
namespace, so their contents are universally

Hadoop has its origins in Apache Nutch, an

open source web searchengine, itself a part
of the Lucene project. Building a web search
engine from scratch was an ambitious goal,
for not only is the software required to crawl
and index websites complex to write, but it
is also a challenge to run without a dedicated
operations team, since there are so many
moving parts. It's expensive look Mike
Cafarella and Doug Cutting estimated a
system supporting a 1-billkm-puge index
would cost around half a million dollars in
hardware, with a monthly running cost of

Fig.(l) Hadoop Prog. Framework

Introduction of Hadoop:
In a Iladoop cluster, data is distributed to all
the nodes of the cluster as it is being loaded
in. The Hadoop Distributed File System
(HDFS) will split laige data tiles into chunks
which are managed by different nodes in the
cluster. In addition to this each chunk is
replicated across several machines, so that a
single machine failure does not result in any

Data is conceptually record oriented in the

Hadoop programming

framework. Individual input files are broken

into lines or into other formats specific to
the application logic. Hach process running
on a node in the cluster then processes a
subset of these records. The Hadoop
framework then schedules these processes in
proximity to the location of data/records
using knowledge from the distributed file
system. Since files are. spread across the
distributed file system as chunks, each
compute process running on a node operates
on a subset of the data. Which data operated
on by a node is chosen based on its locality
to the node: most data is read from the local
disk straight into the CPU, alleviating strain
on network bandwidth and preventing
unnecessary network transfers. This strategy
of moving computation to the data, instead
of moving the data to The computation
allows Hadoop to achieve high data locality
which in turn results in bigii performance.

Hadoop limits the amount of communication
which can be performed by the processes, as
each individual record is processed by a task
in isolation from one another. While rhis
sounds like a major limitation at first, it
makes the whole framework much more
reliable. Hadoop will not run just any
program and distribute it across a cluster.
Programs must be written to conform to a
particular programming model, named
InMapRednce, records are processed in
isolation by tasks called Mappers. The
output from the Mappers is then brought
together into a second set of tasks called
Reducers, where results from different
mappers can be merged together.
Separate nodes in a Hadoop cluster still
communicate with one another. However, in
contrast to more conventional distributed
systems where application developers
explicitly marshal byte streams from node to
node over sockets or through MPI buffers,
communication in Hadoop is performed
implicitly. Pieces of data can be tagged with
key names which inform Hadoop how to
send related bits of information to a
internally manages all of the data transfer
and cluster topology issues.
By restricting the communication between
nodes, Hadoop makes the distributed system
much more reliable. Individual node failures
can be worked around by restarting tasks on
other machines. Since user-level tasks do
not communicate explicitly with one
another, no messages need to be exchanged
by user programs, nor do nodes need to roll
back to pre-arranged checkpoints to partially
restart the computation. The other workers
continue to operate as though nothing went
wrong, leaving the challenging aspects of

partially restarting the program to the

underlying Hadoop layer
The core Hadoop has two main systems: Haduup Distributed File System
(HDFS):- A distributed file system
that provides high throughput access
to application data.
Map Reduee:-A software framework
ibi distributed processing of large
data sets Oil compute clusters.
IIDFS has a master / slave architecture. An
HDFS cluster consists of a single Name
Node, a umsfer server that manages (he file
system namespace and regulates access to
files by clients. In addition, there lire a
number of Data Nodes, usually one per node
in the cluster, which manage storage
attached to the nodes that they run on.
HDFS exposes a file system name space and
allows user data to be stored in files.
Internally, a file is split into one or more
blocks and these blocks are stored in a set of
Data Nodes. The Name-Node executes file
system namespace operations like opening,
closing, and renaming files and directories.
It also determines the mapping of
blocks to DataNodes.
The DataNodes are responsible for serving
read and write requests from the file
system's clients. The DataNodes also
perform block creation, deletion, and
replication upon instruction from the Name

together into a second set of tasks called


Fig (2):- HDFS Architecture

The NameNode and DataNode are pieces of
software designed to run on commodity
machines. These machines typically run a
GNU/Linux operating system (OS). HDFS
is built using the Java language; any
machine that supports Java can run the
Name Node or the Data Node software.
Usage of the highly portable Java language
means that HDFS can be deployed on a wide
range of machines. A typical deployment
has a dedicated machine that runs only the
Name Node software, Each of the other
machines in the cluster runs one instance of
the Data Node software. The architecture
does not preclude running multiple
Data Nodes on the same machine hut in a
real deployment that is--rarely the case. The
existence of a single Name Node in a cluster
greatly simplifies the architecture of the
system. The Name Node is the sibiLrafor
and repositoiy for ail HDFS metadata. The
system is designed in such a way that user
data never flows through the Name Node.
Map Reduce
In Map Reduce, records are processed in
isolation by tasks called Mappers. The
output from the Mappers is then brought


Input files split (M splits)

Assign Master & Workers
Map tasks
Writing intermediate daLa Lo
disk (R regions)
Intermediate data read & sort
Reduce tasks

Capable of processing vast amounts
of data Scales linearly.
Same data problem will process 10x
faster on lOx larger cluster.
Individual failures have minimal
Failures during a job cause only a
small portion of the job toreexecuted.
Job setup takes time (e.g., several

Map Reduce is not for real-time

Requires deep understanding of the
MapReduce paradigm. Not all
problems arc easily expressed in
Map Reduce .


Acknowledge - to build the

recommender system for behavioral
targeting, plus other dickstrearn
analytics; clusters vary frum 50 to
200 nodes, mostly on F.C2. Context
web - to store ad serving log and use
it as a source for Ad optimizations/
learning, 23 machine cluster with
184 cores and about 35TR raw
storage. Fach (commodity) node has
8 cores, 8GB RAM and 1/7 TB of
Cornell University Web Lab:
Generating web graphs on 100 nodes
(dual 2.4GII7. Xeon Processor, 2 Gil
RAM, 72GB Hard Drive)
NetSeer - Up to 1000 instances on
Amazon. EC2 ; Data storage in
Amazon S3; Used for crawling,
processing, serving and log analysis
The New York Times ; Lurxe seule_
miaae conversions ; EC2 to run
hadoop on a large virtual cluster
Powerset / Microsoft - Natural
Language Search; up to 400
instances on Amazon LC2 ; data
storage in Amazon S3

ADVANTAGES: Iladoop is designed to run on

cheap commodity hardware .
It automatically handles data
replication and node failure.
It does the hard work - you can focus
on processing data.
Cost Saving and efficient and
reliable data processing.
analysis platform for large volumes
of data.
Hadoop will sit along side, not
replace your existing RDBMS.
Hadoop has many tools to ease
data analysis.
REFERENCES: Apachc. Hadoop

( )
on Wikipedia


Free Search hy Doug

Computing atYahoo!
( )
Cloudera Apachy Hudoup for