Anda di halaman 1dari 64

MapReduce Intro

The MapReduce Programming Model


Introduction and Examples
Dr. Jose Mara Alvarez-Rodrguez
Quality Management in Service-based Systems and Cloud
Applications
FP7 RELATE-ITN
South East European Research Center
Thessaloniki, 10th of April, 2013
1 / 61
MapReduce Intro
1
MapReduce in a nutshell
2
Thinking in MapReduce
3
Applying MapReduce
4
Success Stories with MapReduce
5
Summary and Conclusions
2 / 61
MapReduce Intro
MapReduce in a nutshell
Features
A programming model...
1
Large-scale distributed data processing
2
Simple but restricted
3
Paralell programming
4
Extensible
3 / 61
MapReduce Intro
MapReduce in a nutshell
Antecedents
Functional programming
1
Inspired
2
...but not equivalent
Example in Python
Given a list of numbers between 1 and 50 print only even
numbers

print filter(lambda x: x % 2 == 0, range(1, 50))


A list of numbers (data)
A condition (even numbers)
A function lter that is applied to the list (map)
4 / 61
MapReduce Intro
MapReduce in a nutshell
Antecedents
Functional programming
1
Inspired
2
...but not equivalent
Example in Python
Given a list of numbers between 1 and 50 print only even
numbers

print filter(lambda x: x % 2 == 0, range(1, 50))


A list of numbers (data)
A condition (even numbers)
A function lter that is applied to the list (map)
5 / 61
MapReduce Intro
MapReduce in a nutshell
...Other examples...
Example in Python
Return the sum of the squares of a list of numbers between 1 and
50

import operator
reduce(operator.add , map(( lambda x: x **2), range (1 ,50)) , 0)


reduce is equivalent to foldl in other func. languages as
Haskell
other math considerations should be taken into account (kind
of operator)...
6 / 61
MapReduce Intro
MapReduce in a nutshell
Some interesting points...
The Map Reduce framework...
1
Inspired in functional programming concepts (but not
equivalent)
2
Problems that can be paralellized
3
Sometimes recursive solutions
4
...
7 / 61
MapReduce Intro
MapReduce in a nutshell
Basic Model
MapReduce: The Programming Model and Practice, SIGMETRICS, Turorials 2009, Google.
8 / 61
MapReduce Intro
MapReduce in a nutshell
Map Function
Figure: Mapping creates a new output list by applying a function to
individual elements of an input list.
Module 4: MapReduce, Hadoop Tutorial, Yahoo!.
9 / 61
MapReduce Intro
MapReduce in a nutshell
Reduce Function
Figure: Reducing a list iterates over the input values to produce an
aggregate value as output.
Module 4: MapReduce, Hadoop Tutorial, Yahoo!.
10 / 61
MapReduce Intro
MapReduce in a nutshell
MapReduce Flow
Figure: High-level MapReduce pipeline.
Module 4: MapReduce, Hadoop Tutorial, Yahoo!.
11 / 61
MapReduce Intro
MapReduce in a nutshell
MapReduce Flow
Figure: Detailed Hadoop MapReduce data ow.
Module 4: MapReduce, Hadoop Tutorial, Yahoo!.
12 / 61
MapReduce Intro
MapReduce in a nutshell
Tip
What is MapReduce?
It is a framework inspired in functional programming to tackle
problems in which steps can be paralellized applying a divide and
conquer approach.
13 / 61
MapReduce Intro
Thinking in MapReduce
When should I use MapReduce?
Query
Index and Search: inverted index
Filtering
Classication
Recommendations: clustering or collaborative ltering
Analytics
Summarization and statistics
Sorting and merging
Frequency distribution
SQL-based queries: group-by, having, etc.
Generation of graphics: histograms, scatter plots.
Others
Message passing such as Breadth First-Search or PageRank algorithms.
14 / 61
MapReduce Intro
Thinking in MapReduce
When should I use MapReduce?
Query
Index and Search: inverted index
Filtering
Classication
Recommendations: clustering or collaborative ltering
Analytics
Summarization and statistics
Sorting and merging
Frequency distribution
SQL-based queries: group-by, having, etc.
Generation of graphics: histograms, scatter plots.
Others
Message passing such as Breadth First-Search or PageRank algorithms.
15 / 61
MapReduce Intro
Thinking in MapReduce
When should I use MapReduce?
Query
Index and Search: inverted index
Filtering
Classication
Recommendations: clustering or collaborative ltering
Analytics
Summarization and statistics
Sorting and merging
Frequency distribution
SQL-based queries: group-by, having, etc.
Generation of graphics: histograms, scatter plots.
Others
Message passing such as Breadth First-Search or PageRank algorithms.
16 / 61
MapReduce Intro
Thinking in MapReduce
How Google uses MapReduce (80% of data processing)
Large-scale web search indexing
Clustering problems for Google News
Produce reports for popular queries, e.g. Google Trend
Processing of satellite imagery data
Language model processing for statistical machine translation
Large-scale machine learning problems
. . .
17 / 61
MapReduce Intro
Thinking in MapReduce
Comparison of MapReduce and other approaches
MapReduce: The Programming Model and Practice, SIGMETRICS, Turorials 2009, Google.
18 / 61
MapReduce Intro
Thinking in MapReduce
Evaluation of MapReduce and other approaches
MapReduce: The Programming Model and Practice, SIGMETRICS, Turorials 2009, Google.
19 / 61
MapReduce Intro
Thinking in MapReduce
Apache Hadoop
MapReduce denition
The Apache Hadoop software
library is a framework that
allows for the distributed
processing of large data sets
across clusters of computers
using simple programming
models.
Figure: Apache Hadoop Logo.
20 / 61
MapReduce Intro
Thinking in MapReduce
Tip
What can I do in MapReduce?
Three main functions:
1
Querying
2
Summarizing
3
Analyzing
. . . large datasets in o-line mode for boosting other on-line
processes.
21 / 61
MapReduce Intro
Applying MapReduce
MapReduce in Action
MapReduce Patterns
1
Summarization
2
Filtering
3
Data Organization (sort, merging, etc.)
4
Relational-based (join, selection, projection, etc.)
5
Iterative Message Passing (graph processing)
6
Others (depending on the implementation):
Simulation of distributed systems
Cross-correlation
Metapatterns
Input-output
. . .
22 / 61
MapReduce Intro
Applying MapReduce
Overview (stages)-Counting Letters
23 / 61
MapReduce Intro
Applying MapReduce
Summarization
Types
1
Numerical summarizations
2
Inverted index
3
Counting and counters
24 / 61
MapReduce Intro
Applying MapReduce
Numerical Summarization-I
Description
A general pattern for calculating aggregate statistical values over
your data.
Intent
Group records together by a key eld and calculate a numerical
aggregate per group to get a top-level view of the larger data set.
25 / 61
MapReduce Intro
Applying MapReduce
Numerical Summarization-II
Applicability
To deal with numerical data or counting.
To group data by specic elds
Examples
1 Word count
2 Record count
3 Min/Max/Count
4 Average/Median/Standard deviation
5 . . .
26 / 61
MapReduce Intro
Applying MapReduce
Numerical Summarization-Pseudocode
class Mapper
method Map(recordid id, record r)
for all term t in record r do
Emit(term t, count 1)
class Reducer
method Reduce(term t, counts [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)
27 / 61
MapReduce Intro
Applying MapReduce
Overview-Word Counter
28 / 61
MapReduce Intro
Applying MapReduce
Numerical Summarization-Word Counter

public void map(LongWritable key , Text value , Context context)
throws Exception {
String line = value.toString ();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens ()) {
word.set(tokenizer.nextToken ());
context.write(word , one);
}
}
public void reduce(Text key , Iterable <IntWritable > values ,
Context context)
throws IOException , InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key , new IntWritable(sum));
}


29 / 61
MapReduce Intro
Applying MapReduce
Example-II
Min/Max
Given a list of tweets (username, date, text) determine rst and
last time an user commented and the number of times.
Implementation
See https://github.com/chemaar/seqos/tree/master/prototypes/mapreduce-intro
30 / 61
MapReduce Intro
Applying MapReduce
Overview - Min/Max
Min and max creation date are the same in the map phase.
31 / 61
MapReduce Intro
Applying MapReduce
Example II-Min/Max, function Map

public void map(Object key , Text value , Context context)
throws IOException , InterruptedException , ParseException {
Map <String , String > parsed = MRDPUtils.parse(value.
toString ());
String strDate = parsed.get(MRDPUtils.CREATION_DATE);
String userId = parsed.get(MRDPUtils.USER_ID);
if (strDate == null || userId == null) {
return;
}
Date creationDate = MRDPUtils.frmt.parse(strDate);
outTuple.setMin(creationDate);
outTuple.setMax(creationDate);
outTuple.setCount (1);
outUserId.set(userId);
context.write(outUserId , outTuple);
}


32 / 61
MapReduce Intro
Applying MapReduce
Example II-Min/Max, function Reduce

public void reduce(Text key , Iterable <MinMaxCountTuple > values ,
Context context) throws IOException , InterruptedException {
result.setMin(null);
result.setMax(null);
int sum = 0;
for (MinMaxCountTuple val : values) {
if (result.getMin () == null
|| val.getMin ().compareTo(result.getMin ()) < 0)
{
result.setMin(val.getMin ());
}
if (result.getMax () == null
|| val.getMax ().compareTo(result.getMax ()) > 0)
{
result.setMax(val.getMax ());
}
sum += val.getCount ();}
result.setCount(sum);
context.write(key , result);
}


33 / 61
MapReduce Intro
Applying MapReduce
Example-III
Average
Given a list of tweets (username, date, text) determine the average
comment length per hour of day.
Implementation
See https://github.com/chemaar/seqos/tree/master/prototypes/mapreduce-intro
34 / 61
MapReduce Intro
Applying MapReduce
Overview - Average
35 / 61
MapReduce Intro
Applying MapReduce
Example III-Average, function Map

public void map(Object key , Text value , Context context)
throws IOException , InterruptedException ,ParseException {
Map <String , String > parsed =
MRDPUtils.parse(value.toString ());
String strDate = parsed.get(MRDPUtils.CREATION_DATE);
String text = parsed.get(MRDPUtils.TEXT);
if (strDate == null || text == null) {
return;
}
Date creationDate = MRDPUtils.frmt.parse(strDate);
outHour.set(creationDate.getHours ());
outCountAverage.setCount (1);
outCountAverage.setAverage(text.length ());
context.write(outHour , outCountAverage);
}


36 / 61
MapReduce Intro
Applying MapReduce
Example III-Average, function Reduce

public void reduce(IntWritable key , Iterable <CountAverageTuple >
values ,
Context context) throws IOException , InterruptedException {
float sum = 0;
float count = 0;
for (CountAverageTuple val : values) {
sum += val.getCount () * val.getAverage ();
count += val.getCount ();
}
result.setCount(count);
result.setAverage(sum / count);
context.write(key , result);
}


37 / 61
MapReduce Intro
Applying MapReduce
Numerical Summarization-Other approaches
Relation to SQL

SELECT MIN(numcol1), MAX(numcol1),
COUNT (*) FROM table GROUP BY groupcol2;


Implementation in PIG

b = GROUP a BY groupcol2;
c = FOREACH b GENERATE group , MIN(a.numcol1),
MAX(a.numcol1), COUNT_STAR(a);


38 / 61
MapReduce Intro
Applying MapReduce
Numerical Summarization-Other approaches
Relation to SQL

SELECT MIN(numcol1), MAX(numcol1),
COUNT (*) FROM table GROUP BY groupcol2;


Implementation in PIG

b = GROUP a BY groupcol2;
c = FOREACH b GENERATE group , MIN(a.numcol1),
MAX(a.numcol1), COUNT_STAR(a);


39 / 61
MapReduce Intro
Applying MapReduce
Filtering
Types
1
Filtering
2
Top N records
3
Bloom ltering
4
Distinct
40 / 61
MapReduce Intro
Applying MapReduce
Filtering-I
Description
It evaluates each record separately and decides, based on some
condition, whether it should stay or go.
Intent
Filter out records that are not of interest and keep ones that are.
41 / 61
MapReduce Intro
Applying MapReduce
Filtering-II
Applicability
To collate data
Examples
1 Closer view of dataset
2 Data cleansing
3 Tracking a thread of events
4 Simple random sampling
5 Distributed Grep
6 Removing low scoring dataset
7 Log Analysis
8 Data Querying
9 Data Validation
10 . . .
42 / 61
MapReduce Intro
Applying MapReduce
Filtering-Pseudocode
class Mapper
method Map(recordid id, record r)
field f = extract(r)
if predicate (f)
Emit(recordid id, value(r))
class Reducer
method Reduce(recordid id, values [r1, r2,...])
//Whatever
Emit(recordid id, aggregate (values))
43 / 61
MapReduce Intro
Applying MapReduce
Example-IV
Distributed Grep
Given a list of tweets (username, date, text) determine the tweets
that contain a word.
Implementation
See https://github.com/chemaar/seqos/tree/master/prototypes/mapreduce-intro
44 / 61
MapReduce Intro
Applying MapReduce
Overview - Distributed Grep
45 / 61
MapReduce Intro
Applying MapReduce
Example IV-Distributed Grep, function Map

public void map(Object key , Text value , Context context)
throws IOException , InterruptedException {
Map <String , String > parsed =
MRDPUtils.parse(value.toString ());
String txt = parsed.get(MRDPUtils.TEXT);
String mapRegex = ".*\\b"+context.getConfiguration ()
.get("mapregex")+"(.)*\\b.*";
if (txt.matches(mapRegex)) {
context.write(NullWritable.get(), value);
}
}


...and the Reduce function?
In this case it is not necessary and output values are directly writing to the output.
46 / 61
MapReduce Intro
Applying MapReduce
Example-V
Top 5
Given a list of tweets (username, date, text) determine the 5 users
that wrote longer tweets
Implementation
See https://github.com/chemaar/seqos/tree/master/prototypes/mapreduce-intro
47 / 61
MapReduce Intro
Applying MapReduce
Overview - Top 5
48 / 61
MapReduce Intro
Applying MapReduce
Example V-Top 5, function Map

private TreeMap <Integer , Text > repToRecordMap = new TreeMap <
Integer , Text >();
public void map(Object key , Text value , Context context)
throws IOException , InterruptedException {
Map <String , String > parsed =
MRDPUtils.parse(value.toString ());
if (parsed == null) {return ;}
String userId = parsed.get(MRDPUtils.USER_ID);
String reputation = String.valueOf(parsed.get(MRDPUtils.
TEXT).length ());
//Max reputation if you write tweets longer
if (userId == null || reputation == null) {return ;}
repToRecordMap.put(Integer.parseInt(reputation), new
Text(value));
if (repToRecordMap.size() > MAX_TOP) {
repToRecordMap.remove(repToRecordMap.firstKey ()
);
}
}


49 / 61
MapReduce Intro
Applying MapReduce
Example V-Top 5, function Reduce

public void reduce(NullWritable key , Iterable <Text > values ,
Context context) throws IOException , InterruptedException {
for (Text value : values) {
Map <String , String > parsed = MRDPUtils.parse(value.
toString ());
repToRecordMap.put(parsed.get(MRDPUtils.TEXT).length
(),new Text(value));
if (repToRecordMap.size() > MAX_TOP) {
repToRecordMap.remove(repToRecordMap.firstKey ()
);
}
}
for (Text t : repToRecordMap.descendingMap ().values ()
) {
context.write(NullWritable.get(), t);
}
}


50 / 61
MapReduce Intro
Applying MapReduce
Filtering-Other approaches
Relation to SQL

SELECT * FROM table WHERE colvalue < VALUE;


Implementation in PIG

b = FILTER a BY colvalue < VALUE;


51 / 61
MapReduce Intro
Applying MapReduce
Filtering-Other approaches
Relation to SQL

SELECT * FROM table WHERE colvalue < VALUE;


Implementation in PIG

b = FILTER a BY colvalue < VALUE;


52 / 61
MapReduce Intro
Applying MapReduce
Tip
How can I use and run a MapReduce framework?
You should identify what kind of problem you are addressing and
apply a design pattern to be implemented in a framework such
as Apache Hadoop.
53 / 61
MapReduce Intro
Success Stories with MapReduce
Tip
Who is using MapReduce?
All companies that are dealing with Big Data problems for
analytics such as:
Cloudera
Datasalt
Elasticsearch
. . .
54 / 61
MapReduce Intro
Success Stories with MapReduce
Apache Hadoop-Related Projects
55 / 61
MapReduce Intro
Success Stories with MapReduce
More tips
FAQ
MapReduce is a framework based on a simple programming
model
...to deal with large datasets in a distributed fashion
...scalability, replication, fault-tolerant, etc.
Apache Hadoop is not a database
New frameworks on top of Hadoop for specic tasks:
querying, analysis, etc.
Other similar frameworks: Storm, Signal/Collect, etc.
. . .
56 / 61
MapReduce Intro
Summary and Conclusions
Summary
57 / 61
MapReduce Intro
Summary and Conclusions
Conclusions
What is MapReduce?
It is a framework inspired in functional programming to tackle problems in which steps can be paralellized
applying a divide and conquer approach.
What can I do in MapReduce?
Three main functions:
1 Querying
2 Summarizing
3 Analyzing
. . . large datasets in o-line mode for boosting other on-line processes.
How can I use and run a MapReduce framework?
You should identify what kind of problem you are addressing and apply a design pattern to be implemented in a
framework such as Apache Hadoop.
58 / 61
MapReduce Intro
Summary and Conclusions
Conclusions
What is MapReduce?
It is a framework inspired in functional programming to tackle problems in which steps can be paralellized
applying a divide and conquer approach.
What can I do in MapReduce?
Three main functions:
1 Querying
2 Summarizing
3 Analyzing
. . . large datasets in o-line mode for boosting other on-line processes.
How can I use and run a MapReduce framework?
You should identify what kind of problem you are addressing and apply a design pattern to be implemented in a
framework such as Apache Hadoop.
59 / 61
MapReduce Intro
Summary and Conclusions
Conclusions
What is MapReduce?
It is a framework inspired in functional programming to tackle problems in which steps can be paralellized
applying a divide and conquer approach.
What can I do in MapReduce?
Three main functions:
1 Querying
2 Summarizing
3 Analyzing
. . . large datasets in o-line mode for boosting other on-line processes.
How can I use and run a MapReduce framework?
You should identify what kind of problem you are addressing and apply a design pattern to be implemented in a
framework such as Apache Hadoop.
60 / 61
MapReduce Intro
Summary and Conclusions
Whats next?
. . .
Concatenate MapReduce jobs
Optimization using combiners and setting the parameters (size
of partition, etc.)
Pipelining with other languages such as Python
Hadoop in Action: more examples, etc.
New trending problems (image/video processing)
Real-time processing
. . .
61 / 61
MapReduce Intro
References
J. Dean and S. Ghemawat.
MapReduce: simplied data processing on large clusters.
Commun. ACM, 51(1):107113, Jan. 2008.
J. L. Jonathan R. Owens, Brian Femiano.
Hadoop Real-World Solutions Cookbook.
Packt Publishing Ltd, 2013.
C. Lam.
Hadoop in Action.
Manning Publications Co., Greenwich, CT, USA, 1st edition,
2010.
J. Lin and C. Dyer.
Data-intensive text processing with MapReduce.
In Proceedings of Human Language Technologies: The 2009
Annual Conference of the North American Chapter of the
Association for Computational Linguistics, Companion
62 / 61
MapReduce Intro
References
Volume: Tutorial Abstracts, NAACL-Tutorials 09, pages 12,
Stroudsburg, PA, USA, 2009. Association for Computational
Linguistics.
D. Miner and A. Shook.
Mapreduce Design Patterns.
Oreilly and Associates Inc, 2012.
T. G. Srinath Perera.
Hadoop MapReduce Cookbook.
Packt Publishing Ltd, 2013.
T. White.
Hadoop: The Denitive Guide.
OReilly Media, Inc., 1st edition, 2009.
I. H. Witten and E. Frank.
Data Mining: Practical Machine LearningTools and Techniques.
63 / 61
MapReduce Intro
References
Morgan Kaufmann Publishers Inc., San Francisco, CA, USA,
2005.
64 / 61

Anda mungkin juga menyukai