Outline
MapReduce overview
Note: These notes are based on notes
provided by Google
What is a Cloud?
Cloud = Lots of storage + compute cycles
nearby
Data-Intensive Computing
Data-Intensive
Typically store data at datacenters
Use compute nodes nearby
Compute nodes run computation services
In data-intensive computing, the focus is
on the data: problem areas include
Storage
Communication bottleneck
Moving tasks to data (rather than vice-versa)
Security
Availability of Data
Scalability
Computation Services
Google → MapReduce, Sawzall
Yahoo → Hadoop, Pig Latin
Microsoft → Dryad, DryadLINQ
Motivation: Large Scale Data
Processing
Want to process lots of data ( > 1 TB)
Want to parallelize across
hundreds/thousands of CPUs
How to parallelize
How to distribute
How to handle failures
Workers assigned map tasks read the input, parse it and invoke
the user’s Map() method.
Execution
• Reduce workers use remote procedure calls to read the data from
local disks of map works
• Sorts all intermediate data by intermediate key
Execution
• Reduce worker iterates over the sorted intermediate data and for
each key encountered it passes the key and the corresponding set
of intermediate values to the Reduce function
Execution
Master failure
MapReduce task is aborted and client is
notified
Locality
MapReduce master takes the location
information of input files into account and
attempts to schedule a map task on a
machine that contains a replica of the
corresponding input data
Schedule a map task near a replica of that
task’s input data
The goal is to read most input data locally
and thus reduce the consumption of
network bandwidth
Task Granularity
M and R should be much larger than the
number of available machines.
Dynamic load balancing.
Speeds up recovery in case of failures.