Anda di halaman 1dari 28

InspireEducateTransform.

The best place for students to learn Applied Engineering http://www.insofe.edu.in


Dr. Sreerama KV Murthy
September 25, 2013
Engineering Big Data:
Online Batch
Session 9: Map Reduce 2
CEO, Teqnium Consultancy Services
The best place for students to learn Applied Engineering 2 http://www.insofe.edu.in
Refresher: What is MapReduce?
MapReduce is a programming model Google has used
successfully is processing its big-data sets (~ 20000 peta
bytes per day)
Users specify the computation in terms of a map and a reduce
function
Underlying runtime system automatically parallelizes the computation
across large-scale clusters of machines, and
Underlying system also handles machine failures, efficient
communications, and performance issues.
Reference: Dean, J. and Ghemawat, S. 2008. MapReduce: simplified
data processing on large clusters. Communication of ACM 51, 1 (Jan.
2008), 107-113.
CCSCNE 2009 Palttsburg, April 24 2009. B.Ramamurthy & K.Madurai
The best place for students to learn Applied Engineering 3 http://www.insofe.edu.in
The Five MapReduce Daemons
1. NameNode
Holds the metadata for HDFS
2. Secondary NameNode
Performs housekeeping functions for the NameNode. It is not a backup or
hot standby for the NameNode.
3. DataNode
Stores actual HDFS data blocks
4. JobTracker
Manages MapReduce jobs, distributes individual tasks to machines, etc
5. TaskTracker
Instantiates and monitors individual Map and Reduce tasks
Master Nodes in the cluster run one of the blue daemons above.
Slave Nodes run both of the non-blue daemons.
Each daemon runs in its own Java virtual machine.
The best place for students to learn Applied Engineering 4 http://www.insofe.edu.in
Five Daemons of MapReduce.. contd.
The best place for students to learn Applied Engineering 5 http://www.insofe.edu.in
YARN (MR2)
The best place for students to learn Applied Engineering 6 http://www.insofe.edu.in
CDH-3 Map Reduce Daemons
The best place for students to learn Applied Engineering 7 http://www.insofe.edu.in
MapReduce NextGen aka YARN aka MRv2
Divides the two major functions of the JobTracker - resource
management and job life-cycle management - into separate
components
Released in Hadoop-0.23
The new ResourceManager manages the global assignment of
compute resources to applications.
The ResourceManager has two main components: Scheduler and
ApplicationsManager.
The Scheduler is responsible for allocating resources to various
running applications subject to constraints of capacities, queues etc.
The best place for students to learn Applied Engineering 8 http://www.insofe.edu.in
MRv2 aka YARN: JobTracker Redefined
The best place for students to learn Applied Engineering 9 http://www.insofe.edu.in
Scheduler performs no tracking of status for the application, and
offers no guarantees about restarting failed tasks.
The per-application ApplicationMaster manages the applications
scheduling and coordination
The per-machine NodeManager daemon manages the user
processes on that machine.
An application is either a single MR job or a DAG of such jobs.
The ApplicationMaster negotiates resources from the
ResourceManager and works with the NodeManager(s) to execute
and monitor tasks.
CDH4 continues to support the original MapReduce framework (i.e.
the JobTracker and TaskTrackers). The old framework is referred to
as MRv1.
YARN aka MRv2 (contd.)
The best place for students to learn Applied Engineering 10 http://www.insofe.edu.in
The best place for students to learn Applied Engineering 11 http://www.insofe.edu.in
A COUPLE OF USE CASES
Map-Reduce
The best place for students to learn Applied Engineering 12 http://www.insofe.edu.in
Yahoo: Running Production WebMap
Search needs a graph of the known web
Invert edges, compute link text, whole graph heuristics
Periodic batch job using Map/Reduce
Uses a chain of ~100 map/reduce jobs
Scale
1 trillion edges in graph
Largest shuffle is 450 TB
Final output is 300 TB compressed
Runs on 10,000 cores
Raw disk used 5 PB
Written mostly using Hadoops C++ interface
The best place for students to learn Applied Engineering 13 http://www.insofe.edu.in
Yahoo Research Clusters
Mostly data mining/machine learning jobs
Most research jobs are not Java:
42% Streaming
Uses Unix text processing to define map and
reduce
28% Pig
Higher level dataflow scripting language
28% Java
2% C++
The best place for students to learn Applied Engineering 14 http://www.insofe.edu.in
NY Times
Needed offline conversion of public domain articles from
1851-1922.
Used Hadoop to convert scanned images to PDF
Ran 100 Amazon EC2 instances for around 24 hours
4 TB of input
1.5 TB of output
Published 1892, copyright New York Times
The best place for students to learn Applied Engineering 15 http://www.insofe.edu.in
Terabyte Sort Benchmark
Started by Jim Gray at Microsoft in 1998
Sorting 10 billion 100 byte records
Hadoop won the general category in 209 seconds
910 nodes
2 quad-core Xeons @ 2.0Ghz / node
4 SATA disks / node
8 GB ram / node
1 gb ethernet / node
40 nodes / rack
8 gb ethernet uplink / rack
Previous records was 297 seconds
Only hard parts were:
Getting a total order
Converting the data generator to map/reduce
The best place for students to learn Applied Engineering 16 http://www.insofe.edu.in
NOW.THE
SHAKE-UP QUIZ !!
The best place for students to learn Applied Engineering 17 http://www.insofe.edu.in
KEY-VALUE PAIRS
The best place for students to learn Applied Engineering 18 http://www.insofe.edu.in
MapReduce
Programmers specify two functions:
map (k, v) <k, v>*
reduce (k, v) <k, v>*
All values with the same key are reduced
together
The best place for students to learn Applied Engineering 19 http://www.insofe.edu.in
Keys and Values
The best place for students to learn Applied Engineering 20 http://www.insofe.edu.in
MapReduce In more detail
The best place for students to learn Applied Engineering 21 http://www.insofe.edu.in
MR: Logical Execution
The best place for students to learn Applied Engineering 22 http://www.insofe.edu.in
map map map map
Shuffle and Sort: aggregate values by keys
reduce reduce reduce
k
1
k
2
k
3
k
4
k
5
k
6
v
1
v
2
v
3 v
4
v
5
v
6
b a
1 2
c c
3 6
a c
5 2
b c
7 8
a
1 5
b
2 7
c
2 3 6 8
r
1
s
1
r
2
s
2 r
3
s
3
The best place for students to learn Applied Engineering 23 http://www.insofe.edu.in
SIMPLE MAPPERS & REDUCERS
The best place for students to learn Applied Engineering 24 http://www.insofe.edu.in
Mappers
Mappers run on nodes which hold their portion of the data locally,
to avoid network traffic
Multiple Mappers run in parallel, each processing a portion of the
input data
Mapper reads data in the form of key/value pairs
Mapper may use, or completely ignore, the input key.
E.g., a standard pattern is to read a line of a file at a time. Key then is
the byte offset into the file at which the line starts. Value is the
contents of the line itself. Typically the key is considered irrelevant .
It outputs zero or more key/value pairs
let map(k, v) = emit(k.toUpper(), v.toUpper())
('foo', 'bar') -> ('FOO', 'BAR')
The best place for students to learn Applied Engineering 25 http://www.insofe.edu.in
Others Examples: Explode mapper
The best place for students to learn Applied Engineering 26 http://www.insofe.edu.in
Example: Filter mapper
The best place for students to learn Applied Engineering 27 http://www.insofe.edu.in
Example: Changing Keyspaces
The best place for students to learn Applied Engineering 28 http://www.insofe.edu.in
International School of Engineering
Plot No 63/A, 1st Floor, Road No 13, Film Nagar, Jubilee Hills, Hyderabad - 500033
For Individuals: +91-9502334561/63
For Corporates: +91-9618483483
Web: http://www.insofe.edu.in
Facebook: https://www.facebook.com/Insofe
Twitter: https://twitter.com/INSOFEedu
YouTube: http://www.youtube.com/InsofeVideos
SlideShare: http://www.slideshare.net/INSOFE
This presentation may contain references to findings of various reports available in the public domain. INSOFE makes no representation as to their accuracy or that the organization
subscribes to those findings.

Anda mungkin juga menyukai