6-Leson Map Reduce

LESSON-6
MapReduce
Class will start with refreshing the previous class with QA….
Today topics:
 What is MapReduce in Hadoop?

 How MapReduce Works?
 MapReduce Architecture
 How MapReduce Organizes Work?
 Exercise
What is MapReduce
MapReduce is a programming model suitable for processing of huge data. Hadoop is capable
of running MapReduce programs written in various languages. MapReduce programs are
parallel in nature, thus are very useful for performing large-scale data analysis using multiple
machines in the cluster.
MapReduce programs work in two phases:
1. Map phase
2. Reduce phase.
An input to each phase is key-value pairs. In addition, every programmer needs to specify two
functions: map function and reduce function.
How MapReduce Works?
The whole process goes through four phases of execution namely, splitting, mapping,
shuffling, and reducing.
Let's understand this with an example –
1
Consider we have following input data for the purpose of word count through Map Reduce Pr
ogram.
Welcome to Hadoop Class

Hadoop is good
Hadoop is bad
2
MapReduce Architecture
MapReduce Architecture explained in detail
 One map task is created for each split which then executes map function for each
record in the split.
 It is always beneficial to have multiple splits because the time taken to process a split is
small as compared to the time taken for processing of the whole input. When the splits
are smaller, the processing is better to load balanced since we are processing the splits
in parallel.
 However, it is also not desirable to have splits too small in size. When splits are too
small, the overload of managing the splits and map task creation begins to dominate the
total job execution time.
 For most jobs, it is better to make a split size equal to the size of an HDFS block (which
is 128 MB, by default).
 Execution of map tasks results into writing output to a local disk on the respective node
and not to HDFS.
 Reason for choosing local disk over HDFS is, to avoid replication which takes place in
case of HDFS store operation.
 Map output is intermediate output which is processed by reduce tasks to produce the
final output.
 Once the job is complete, the map output can be thrown away. So, storing it in HDFS
with replication becomes overkill.
 In the event of node failure, before the map output is consumed by the reduce task,
Hadoop reruns the map task on another node and re-creates the map output.
 An output of every map task is fed to the reduce task. Map output is transferred to the
machine where reduce task is running.
 On this machine, the output is merged.
 Unlike the map output, reduce output is stored in HDFS (the first replica is stored on the
local node and other replicas are stored on off-rack nodes). So, writing the reduce
output
How MapReduce Organizes Work?
Hadoop divides the job into tasks. There are two types of tasks:
1. Map tasks (Splits & Mapping)

2. Reduce tasks (Shuffling, Reducing)
As mentioned above, the complete execution process (execution of Map and Reduce tasks,
both) is controlled by two types of entities called a
1. Jobtracker: Acts like a master (responsible for complete execution of submitted job)
2. Task Trackers: Acts like slaves, each of them performing the job
3
For every job submitted for execution in the system, there is one Jobtracker that resides
on Namenode and there are multiple tasktrackers which reside on Datanode.
 Job tracker divide the job into multiple tasks which are then run onto multiple data
nodes in a cluster.
 It is the responsibility of job tracker to coordinate the activity by scheduling tasks to run
on different data nodes.
 Execution of individual task is then to look after by task tracker, which resides on every
data node executing part of the job.
 Task tracker's responsibility is to send the progress report to the job tracker.
4
 In addition, task tracker periodically sends 'heartbeat' signal to the Jobtracker so as to
notify him of the current state of the system.
 Thus job tracker keeps track of the overall progress of each job. In the event of task
failure, the job tracker can reschedule it on a different task tracker.
Mapreduce Exercise (Wordcount from file01 and file02)
su hadoopuser # switch to hadoopuser
editor ~/.bashrc # environment variable
# append the below line

export HADOOP_CLASSPATH=/usr/lib/jvm/java-8-openjdk-amd64/lib/tools.jar
source ~/.bashrc # sourcing i.e. setting the change
hadoop fs -mkdir /user # create directory “user” in hdfs
#Check if directory got created

http://localhost:50070/explorer.html#/
hadoop fs -mkdir /user/joe # create folder “joe” under “user”
hadoop fs -mkdir /user/joe/wordcount # create file “wordcount” under “joe”
hadoop fs -mkdir /user/joe/wordcount/input # create “input” under “wordcount”
# check with localhost:50070
nano file01 # create a file named “file01”
Hello World Bye World # append it
nano file02 # create a file named “file02”
Hello Hadoop Goodbye Hadoop # append it
hadoop fs -put file01 /user/joe/wordcount/input # Move file01 to hdfs
hadoop fs -put file02 /user/joe/wordcount/input # Move file02 to hdfs
hadoop fs -cat /user/joe/wordcount/input/file01 # view the content of “ file01”
5
Hello World Bye World # will show the output
hadoop fs -cat /user/joe/wordcount/input/file01 # view the content of “file02”
Hello Hadoop Goodbye Hadoop # will show the output
nano WordCount.java # open file for copying java code
#Paste the below java code altogether

import java.io.IOException;
import java.util.Iterator;
import java.util.StringTokenizer;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
public class WordCount {
public static class Map extends MapReduceBase implements

Mapper<LongWritable, Text, Text, IntWritable> {
@Override
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
String line = value.toString();

StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
value.set(tokenizer.nextToken());
output.collect(value, new IntWritable(1));
}
}
}
public static class Reduce extends MapReduceBase implements

Reducer<Text, IntWritable, Text, IntWritable> {
@Override
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));

}
}
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);

conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
6
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf);
}
}
hadoop com.sun.tools.javac.Main WordCount.java # Compile the code

# incase any problem, use the source code
by opening the file with google docs or use
the java code given at the end of this lesson
jar cf wc.jar WordCount*.class # Create jar file from class files (jar is
zipped file for many .class files)
hadoop jar wc.jar WordCount /user/joe/wordcount/input /user/joe/wordcount/output

# Running the final code on terminal
hadoop fs -cat /user/joe/wordcount/output/part-r-00000 #Checking output on terminal
# or hadoop fs -cat /user/joe/wordcount/output/part-00000 whichever works.
Below is a similar link as ref:
https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-
core/MapReduceTutorial.html
If there is any problem with java code, the below code can be used.
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static class TokenizerMapper

extends Mapper<Object, Text, Text, IntWritable>{
7
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context

) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer

extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,

Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "WordCount");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

6-Leson Map Reduce

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

6-Leson Map Reduce

Diunggah oleh

Hak Cipta:

Format Tersedia

LESSON-6

 What is MapReduce in Hadoop?

MapReduce programs work in two phases:

How MapReduce Works?

Let's understand this with an example –

Welcome to Hadoop Class

MapReduce Architecture explained in detail

How MapReduce Organizes Work?

1. Map tasks (Splits & Mapping)

Mapreduce Exercise (Wordcount from file01 and file02)

su hadoopuser # switch to hadoopuser

editor ~/.bashrc # environment variable

# append the below line

source ~/.bashrc # sourcing i.e. setting the change

hadoop fs -mkdir /user # create directory “user” in hdfs

#Check if directory got created

hadoop fs -mkdir /user/joe # create folder “joe” under “user”

hadoop fs -mkdir /user/joe/wordcount # create file “wordcount” under “joe”

hadoop fs -mkdir /user/joe/wordcount/input # create “input” under “wordcount”

# check with localhost:50070

nano file01 # create a file named “file01”

Hello World Bye World # append it

nano file02 # create a file named “file02”

Hello Hadoop Goodbye Hadoop # append it

hadoop fs -put file01 /user/joe/wordcount/input # Move file01 to hdfs

hadoop fs -put file02 /user/joe/wordcount/input # Move file02 to hdfs

hadoop fs -cat /user/joe/wordcount/input/file01 # view the content of “ file01”

hadoop fs -cat /user/joe/wordcount/input/file01 # view the content of “file02”

Hello Hadoop Goodbye Hadoop # will show the output

nano WordCount.java # open file for copying java code

#Paste the below java code altogether

public class WordCount {

public static class Map extends MapReduceBase implements

String line = value.toString();

public static class Reduce extends MapReduceBase implements

output.collect(key, new IntWritable(sum));

public static void main(String[] args) throws Exception {

JobConf conf = new JobConf(WordCount.class);

FileInputFormat.setInputPaths(conf, new Path(args[0]));

hadoop com.sun.tools.javac.Main WordCount.java # Compile the code

hadoop jar wc.jar WordCount /user/joe/wordcount/input /user/joe/wordcount/output

hadoop fs -cat /user/joe/wordcount/output/part-r-00000 #Checking output on terminal

# or hadoop fs -cat /user/joe/wordcount/output/part-00000 whichever works.

Below is a similar link as ref:

public class WordCount {

public static class TokenizerMapper

public void map(Object key, Text value, Context context

public static class IntSumReducer

public void reduce(Text key, Iterable<IntWritable> values,

public static void main(String[] args) throws Exception {

Anda mungkin juga menyukai