Hadoop Technologies

A new way to store and analyze data
Presented By :: M .Syamprasad
MTECH-CSE Student
12P66D5816
Topics Covered
What is Hadoop?
Why, Where, When?
Benefits of Hadoop
How Hadoop Works?
Hadoop Architecture
Hadoop Common
HDFS
Hadoop MapReduce
Installation &
Execution
Demo of installation
Hadoop Community
What is Hadoop?
Hadoop was created by Douglas Reed Cutting who named
,
hadoop after his childs stuffed elephant to support Lucene and

Nutch search engine projects.
Open-source project administered by Apache Software Foundation.

Hadoop consists of two key services:
a. Reliable data storage using the Hadoop Distributed File System (HDFS).
b. High-performance parallel data processing using a technique called
MapReduce.
Hadoop is large-scale, high-performance processing jobs in spite

of system changes or failures.
Hadoop, Why?
Need to process 100TB datasets
On 1 node:
scanning @ 50MB/s = 23 days
On 1000 node cluster:

scanning @ 50MB/s = 33 min
Need Efficient, Reliable and Usable framework
Benefits of Hadoop
Hadoop is designed to run on cheap commodity
hardware
It automatically handles data replication and node

failure
It does the hard work you can focus on processing

data
Cost Saving and efficient and reliable data

processing
How Hadoop Works

Hadoop implements a computational paradigm named
Map/Reduce, where the application is divided into many small
fragments of work, each of which may be executed or re-executed
on any node in the cluster.
In addition, it provides a distributed file system (HDFS) that

stores data on the compute nodes, providing very high aggregate
bandwidth across the cluster.
Both Map/Reduce and the distributed file system are designed so
that node failures are automatically handled by the framework.
Hadoop Architecture
The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing
Hadoop Consists::
Hadoop Common*: The common utilities that support the other

Hadoop subprojects.
HDFS*: A distributed file system that provides high throughput

access to application data.
MapReduce*: A software framework for distributed processing of

large data sets on compute clusters.
Hadoop is made up of a number of elements. Hadoop consists of the Hadoop Common,

At the bottom is the Hadoop Distributed File System (HDFS), which stores files across
storage nodes in a Hadoop cluster. Above the HDFS is the MapReduce engine, which
consists of JobTrackers and TaskTrackers.
Data Flow
Web
Servers
Scribe
Servers
Network
Storage
Oracle
RAC
Hadoop Cluster
MySQ
L
Hadoop Common
Hadoop Common is a set of utilities that

support the other Hadoop subprojects.
Hadoop Common includes FileSystem,
RPC, and serialization libraries.
HDFS
Hadoop Distributed File System (HDFS) is

the primary storage system used by
Hadoop applications.
HDFS creates multiple replicas of data

blocks and distributes them on compute
nodes throughout a cluster to enable
reliable, extremely rapid computations.
Replication and locality
HDFS Architecture
Hadoop MapReduce
The Map-Reduce programming model
Framework for distributed processing of large data sets
Pluggable user code runs in generic framework
Common design pattern in data processing
input | map | shuffle | reduce | output
Natural for:
Log processing
Web search indexing
Ad-hoc queries
MapReduce Implementation
1.Input files split (M splits)
2.Assign Master & Workers
3.Map tasks
4.Writing intermediate data to
disk (R regions)
5.Intermediate data read &
sort
6.Reduce tasks
7.Return
MapReduce Cluster
Implementation
Input files
M map
tasks
Intermediate
files
R reduce
tasks
split 0
split 1
split 2
split 3
split 4
Several map or
reduce tasks can
run on a single
computer
Output
files
Output 0
Output 1
Each intermediate file

is divided into R
partitions, by
partitioning function
Each reduce task

corresponds to one
partition
Examples of MapReduceWord
Count
Read text files and count how often words

occur.
o The input is text files
o The output is a text file
each line: word, tab, count
Map: Produce pairs of (word, count)

Reduce: For each word, sum up the
counts.
Hadoop Community
Hadoop Users
Adobe
Alibaba
Amazon
AOL
Facebook
Google
IBM
Major Contributor
Apache
Cloudera
Yahoo
References
Apache Hadoop! (http://hadoop.apache.org )

Hadoop on Wikipedia (
http://en.wikipedia.org/wiki/Hadoop)
Free Search by Doug Cutting (

http://cutting.wordpress.com )
Hadoop and Distributed Computing at Yahoo! (

http://developer.yahoo.com/hadoop )
Cloudera - Apache Hadoop for the Enterprise (

http://www.cloudera.com )

Hadoop Technologies

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Hadoop Technologies

Diunggah oleh

Hak Cipta:

Format Tersedia

A new way to store and analyze data

hadoop after his childs stuffed elephant to support Lucene and

Open-source project administered by Apache Software Foundation.

Hadoop is large-scale, high-performance processing jobs in spite

On 1000 node cluster:

Need Efficient, Reliable and Usable framework

It automatically handles data replication and node

It does the hard work you can focus on processing

Cost Saving and efficient and reliable data

How Hadoop Works

In addition, it provides a distributed file system (HDFS) that

Both Map/Reduce and the distributed file system are designed so

that node failures are automatically handled by the framework.

Hadoop Common*: The common utilities that support the other

HDFS*: A distributed file system that provides high throughput

MapReduce*: A software framework for distributed processing of

Hadoop is made up of a number of elements. Hadoop consists of the Hadoop Common,

Hadoop Common is a set of utilities that

Hadoop Distributed File System (HDFS) is

HDFS creates multiple replicas of data

Replication and locality

Each intermediate file

Each reduce task

Read text files and count how often words

Map: Produce pairs of (word, count)

Apache Hadoop! (http://hadoop.apache.org )

Free Search by Doug Cutting (

Hadoop and Distributed Computing at Yahoo! (

Cloudera - Apache Hadoop for the Enterprise (

Anda mungkin juga menyukai