Anda di halaman 1dari 12

Hadoop and Big Data 4 CSE-A&B .RamaChandra College of Engg, Eluru.

Dr David Raju

UNIT-III
Writing MapReduce Programs: A Weather Dataset, Understanding Hadoop API for MapReduce Framework (Old
and New), Basic programs of Hadoop MapReduce: Driver code, Mapper code, Reducer code, RecordReader,
Combiner, Partitioner

Reference:

Hadoop: The Definitive Guide by Tom White, 3rd Edition, O’reilly

MapReduce:MapReducce is one major component of Hadoop. It is the frame work for writing applications
which process huge data.

Advantage of using MapReduce jobs:

1. It is simple, the programs can be developed by using any language such as C, java, C++, python and these are
easy to run.(Easy)
2. With the traditional programs to process a huge data it may take days, but with MapReducewhich run in parallel
and speed upthe execution, minimized to minutes. (Speed)
3. If any failure occurs at any data node, the same (key, value) pair (copy) can recovered from other machine.
(Recovery)
4. MapeReduce works for huge data -peta bytes (Scalability)
5. MapReduce programs will work at data node, where data physically exists. This reduces the network traffic and
improves the performance (performance).

MapReduce frame work:

1. Hadoop file system stores the information (data) among various data nodes in the Hadoop cluster.
2. This data may be structured (ex: relational data bases), Semi structured data (ex: XML files) and unstructured
data (ex: audio and video files, text data etc).
3. MapReduce is the frame work for writing applications which process this data
4. It is only significant if the data size is in Giga bytes, Tera bytes or more (never use big net for small fish)
5. In this frame work, the huge data is spilt into small independent chunks or input splits and resides at various
data nodes.
6. Let the file is text file of size 200 MB, if assume each block size on the data node is 64MB, then this file splits
into 4 input splits/blocks and resides among 4 data nodes. (HDFS process).
7. Our aim is to get output of word count. say (hadoop,3), (is,4)..
8. MapReduce program has 3 stages-
1. Mapper stage: For each input split one mapper (program)will be created. The input for this mapper should be
(key, value) pair and its output is also a key, value (k,v) pair.
2. Shuffle and sort: Shuffling is the process by which intermediate data from mappers are transferred to 0,1 or
more reducers. Each reducer receives 1 or more keys and its associated values depending on the number of
reducers (for a balanced load). Further the values associated with each key are locally sorted.

Map output Shuffle and sort Shuffle &Sorted reduce input

1
Hadoop and Big Data 4 CSE-A&B .RamaChandra College of Engg, Eluru. Dr David Raju

hadoop,1
The shuffle and sort phases are (easy,1)
is,1 responsible for two primary activities:
determining the reducer that should good,1)
good,1 receive the map output key/value pair
(hadoop,[1,1])
(called partitioning); and ensuringthat,
Hadoop,1 for a given reducer, all its input keys (is,[1,1])
are sorted.
Is,1

easy,1
Fig. 1. Sample output of mapper and input to reducer with shuffle and sort for word count program

3. Reducer stage: The reducer obtains sorted (key,[values list]) pairs, sorted by the key. The value list contains
all values with the same key produced by mappers. In case of word count problem, input is of the form

Hadoop, [1,1,1])
(is, [1,1,1])
Each reducer emits zero, one or multiple output (key, value) pairs for each input (key, value) pair.
The output is of the form
(easy,1),
(good,1)
(hadoop,[2])
(is,[2])
9. The 4 input splits are at 4 data nodes. For each input split one mapper will be created.
10. JobTracker: This replicates the mapReduce code as many times as number of data nodes or task trackers.
11. Hadoop will run MapReduce in the form of (key(k),value(v)) pairs only i.e. both Mapper and Reducer. Both
will work only on (k,v) pairs, they don’t know other than this.
12. The input splits are in the form of text messages, these to be converted to (k,v) pairs. To convert it, Hadoop uses
RecordReader which is inbuilt interface for each of the input split. Hadoop always run line by line. Each line in
Hadoop input split file is called record (not line). These records has to be read for processing, so the
RecordReader comes in to play.
13. How RecordReader splits the data: RR splits the data depending upon the type of the input format of each
record,
Basically, there are 4 input file formats,
1. TextInputFormat
2. KeyValueTextInputFormat,
3. SequenceFileInputFormat
4. SequenceFileAsTextInputFormat.
These 4 are predefined classes only. If the input file format not specified, bydefault the input file format is
taken as TextInputFileFormat. If the input file format is of type 1 then RR split the data as (byte off set,
entire line).
14. Let the filenamed as file.txt has 4 input splits as shown
File name: file.txt, size- let assume 200 MB
hi how is hadoop
First input split
how you learn Hadoop
how is MapReduce programs
Second input split
can you learn it easily
what about MapReduce frame work
Third input split
is it good to learn
I love hadoop and big data Fourth input split

2
Hadoop and Big Data 4 CSE-A&B .RamaChandra College of Engg, Eluru. Dr David Raju
Fig. 2. The input file for HDFS. It contains 4 input splits of size 64MB (default block size in Hadoop), 64MB,64MB,8MB respectively.

If the first line of input spilt is- hi how is hadoop, it converts to (0(byte off set),hi how is hadoop (entire line))
and it passed to Mapper as input.
After converting first record, next record(line) will be read and converted as (18, how you learn hadoop). 18 is
the count for number of characters in first line including end of line character (\n). While RR reading the second
line/converting the line,Mapper will work on first output of RR, the mapper will run as many times as number
of records available in input split.

Note: Entire Collection frame work will work on object type (it works on class, classes should be instantiated
with objects), but not on primitive types. So here Mapper reads only objective type of (k,v) pairs.
How this conversion takes place: In java, this is done by wrapper classes, the wrapper class will convert
primitive data type in to object of its type i.e., int to Integer. In java upto 1.4 version, to convert int to Integer
we used to work with utility methods parseInt, toString(), valueOf etc. But from 1.4 onwards – auto boxing and
auto unboxing will take primitive type to object type and object to primitive. But in Hadoop, as it recently
invented 4 years back, they have not given like that, so Hadoop convert int to IntWritablei.e., we have to
manually convert it as, for intas shown below

𝑛𝑒𝑤 𝐼𝑛𝑡𝑊𝑟𝑖𝑡𝑎𝑏𝑙𝑒 (𝑖𝑛𝑡)


int→ IntWritable

𝑛𝑒𝑤 𝐿𝑜𝑛𝑔𝑊𝑟𝑖𝑡𝑎𝑏𝑙𝑒 (𝑖𝑛𝑡)


long→ LongWritable

To convert object to primitive type in hadoop


𝑔𝑒𝑡 ()
IntgWritable→ int

Wrapper class Primitive type Box classes


Integer int IntWritable
Long long LongWritble
String strring DoubleWritable
Double double DoubleWritable
Character char TextWritable
String String TextWritable

Table 1. Java wrapper classes hadoop Box classes for primitive types

Similarly we can convert and revert for long, double etc


𝑛𝑒𝑤 𝑇𝑒𝑥𝑡(𝑆𝑡𝑟𝑖𝑛𝑔) 𝑡𝑜𝑆𝑡𝑟𝑖𝑛𝑔()
But for Sting, String → TextWritable, TextWritable→ String

15. The input for mapper is (0 , hi how is hadoop). The key value would be a large number,so it is considered as
long type and value is String type. The mapper or Reducer program always run on objects or object type or Box
type. So these two types should be converted to object. Key value- 0 -converted to LongWritable
Note: (but not IntWritable) as this key is off set, it may produce huge numbers in the range of long type (8 bytes).,int
may not sufficient as occupies 4 bytes.
The value- hi how is hadoop – converted to TextWritable. So if the input file format is of TextInputFormat, the
RR will convert the input record in to (LongWritable, Text) type.

Note: If the input file format is not TextInputFormat (TIF),explicitly indicate the format type in Driver code, otherwise
it will be read as by default TIF.
Let input file format is other than TIF, as below

000 David 10000

3
Hadoop and Big Data 4 CSE-A&B .RamaChandra College of Engg, Eluru. Dr David Raju
001 Ramesh 20000
002 Kala 40000

Table 2. Sample data of type KeyValueTextInputFormat

If the input type format is specified in the driver code as KeyValueTextInputFormat, then the (k,v) pair looks as (000,
David 10000) i.e (000 (first string or number of the record), David 10000 (the letters after the \t as \t is hidden and it is
next to 000 )).
If the input file format KeyValueTextInputFormat, is not indicated in driver code, then (k,v) is (0, 000 David 10000)i.e
(byle offset, entire line).

16. The output of RR, will be read as input to mapper with type (LongWitable, Text). The Mapper program will
process the record and produce the output as (hi,1), (how,1), (is, 1), (hadoop, 1), this is of the type (Text,
IntWritable). The value is IntWritable not LongWritable because the word count almost a small value of int
range. This input and output types should be indicated in Driver code. This input and output types may vary
from program to program.
17. Let, the mapper code is 5KB, it shared and run in parallel at all data input splits i.e. at all nodes/task trackers.
This parallel computing improves the performance of the execution.
18. The output of mapper is input for reducer code. The mapper output – key (k) values may have duplicates.k
values should not be duplicated but v can have duplicate values. This is important because if and only if the k
value is not having duplicate then only (k,v) pairs can be counted and sortedwith respect to k. To get this
properties the intermediate code is go through 2 more phases – shuffle and sort.
19. The data between the mapper and reducer code is called intermediate data.
20. To avoid duplicate values for k, the data go through shuffle phase. It combine all values associated to single
identical key. In case of input file, file.txt, for first input split,
(hi,1), (how,[1,1]) (is,1) (hadoop,[1,1]), (you,1), (learn,1)
By considering all other outputs from all mappers of 4 input splits,
(hi,1), (how,[1,1,1]) (is,[1,1,1]) (hadoop,[1,1,1]),..etc
Here no repetition of k value, this is done because, all the wrapper classes by default implemented with
Comparableinterface that compare the key occurrences.
21. In Hadoop (k,v) pair are Box types/classes like LongWritable, IntWritable etc. These Box classes by default
implemented Writable Comparable interface. In that implementation, there is a method for comparing one key
with other. So all keys are compared one another which implies no duplicate keys. Sorting will be done
automatically.
22. The reducer code will execute for each (k,v) pair one after the other.
23. In java-servlet, to pass some request –doPost() and doGet() will be executed as many times as of requests.,
similarly the number of times the execution of mapper and reducer code is proportional to number of (k,v) pairs.
24. Regard coding, this shuffle and sorting takes place in reducer only by using an interface Iterator.
25. The default reducer in Hadoop is IdentityReducer which only perform sorting, but not shuffle.
The other reducer code is user defined reducer code which contain both shuffling and sorting logic. This will be
preferred to get both shuffle and sort.
26. The reducer will get the input from mapper and shuffle and sort will takes place. And also the logic will run to
sum up the v values for each k and give output to the RecordWriter. Ex: (hi,1), (how, 3) (is, 3).. etc
27. RecordWriter will perform only writing each (k,v) pairs one by one to the output file (named as part-00000).
The other files exist in the directory is _SUCCESS and _logs
28. This file in turn stored in the directory (OutPutDirectory) provided by the user which is given while running the
jar file.Ex: $ hadoopjar test.jar file.txtOutPutDirectory
29. In case of the word count program, the output file contains..(about,1), (and,1), (big, 1), (can,1)…..(hadoop,3),
(hi,1), (how,3)….etc, keys are sorted in ascending order of alphabets.

4
Hadoop and Big Data 4 CSE-A&B .RamaChandra College of Engg, Eluru. Dr David Raju

The flow chart of the above discussion for all the 4 input splits is given below

Input file – file.txt – size- let assume 200 MB


hi how is hadoop how is MapReduce programs what about MapReduce I love hadoop and big data
how you learn Hadoop can you learn it easily frame work
is it good to learn

Input split 1 Input split 2 Input split 3 Input split 4

RecordReader RecordReader RecordReader RecordReader

(byte offset, entire line) (0,how is MapReduce (0, what about MapReduce (0, I love hadoop and big
(0, hi how is hadoop) programs) frame work) data)
(18, how you learn hadoop) (27, can you learn it easily) (33, is it good to learn)

Mapper Mapper Mapper Mapper

(hi,1) (how,1) (how,1) (can,1) (what,1) (is,1) (I,1)


(how,1) (you,1) (is,1) (you,1) (about,1) (it,1) (love,1)
(is,1) (learn,1) (MapReduce,1) (learn,1) (MapReduce,1) (good,1) (hadoop,1)
(hadoop,1) (hadoop) (programs,1) (it, 1) (frame,1) (to, 1) (and,1)
(easily,1) (work,1)…… …(learn,1) (big,1)
(data,1)……

Reducer

output file
..(about,1), (and,1), (big, 1), (can,1)…..(hadoop,3), (hi,1),
(how,3)

Fig 3. Flow chrat for MapReduce program for wordcount. Defult single reducer is used. Furhet file assumed 200KB, but
ideally it is approximately 0.15 KB.

Creation and running of a Hadoop MapReduce program:


1. Creating a file
$ cat file.txt
hi how is hadoop
how you learn hadoop
how is MapReduce programs

5
Hadoop and Big Data 4 CSE-A&B .RamaChandra College of Engg, Eluru. Dr David Raju
can you learn it easily
what about MapReduce frame work
is it good to learn
I love hadoop and big data
2. Loading the file file.txt into Hadoop file system from local file system
$ hadoop fs –put file.txt OutputDirectory //OutputDirectory is the directory in hdfs
3. Writing MapReduce programs
DriverCode.java
MapperCode.java
ReducerCode.java
4. Compiling all above java programs
$ javac–classpath $ HADOOP_HOME/hadoop-core.jar *.java
5. Compiling jar file
$ jar cvf text.jar *.class

6. Running above test.jar on file (which is there on hdfs)


$ hadoop jar test.jar DriverCode fileOutputDirectory

(here‘DriverCode’ name only indicated because it contains the main method, ‘file’ may be single file or
directory, and OutputDirectory must be directory but not file as it contains 3 files. These are -part, log,
SUCCESS)

Driver code for weather data:

Weather data: This data is collected from Weather sensors from many locations for every hour around the
world. It is semi structured and record oriented data.
Each line contains various information like data center name, latitude, latitude, year, date etc as given below

0057
332130 # USAF weather station identifier
99999 # WBAN weather station identifier
19500101 # observation date
0300 # observation time
4
+51317 # latitude (degrees x 1000)
+028783 # longitude (degrees x 1000)
.
.
320 # wind direction (degrees)
.
.
.
010000 # visibility distance (meters)
.
.
-0128 # air temperature (degrees Celsius x 10)

Fig. 4. The sample input file of weather sensors. The character from 15 to 18 indicates year and at 87 indicates sign of temperature and 88
to 91 indicate temperature.

The input sample records for the Mapper (some unused columns are dropped, indicated by …(ellipsis))
0067011990999991950051507004...9999999N9+00001+99999999999...
0043011990999991950051512004...9999999N9+00221+99999999999...
The input to the Mapper is
(0, 0067011990999991950051507004...9999999N9+00001+99999999999...) //(byte offset, entire line)
(106, 0043011990999991950051512004...9999999N9+00221+99999999999...)
The output from the mapper

6
Hadoop and Big Data 4 CSE-A&B .RamaChandra College of Engg, Eluru. Dr David Raju
(1950, 0) //(key-year, value-temp)
(1950, 22)
(1950, −11)
(1949, 111)
(1949, 78)
After shuffle and sort, the reducer function get the following input
(1949, [111, 78]) //This 1949 is the key, and 111, 78 are temperatures corresponding to this year from two different
records
(1950, [0, 22, −11])
The reducer function iterate through the values and pick up the maximum value for each record (key or year) as shown in
the figure 4, which is

(1949, 111) //This is the maximum temperature recorded in this year 1949
(1950, 22)

Fig 4. The inputs and outputs to and from the mapper and reducer. The map and reduce are the functions of Mapper and Reducer code.
Either side of these indicate the input and output

Driver Code:Application to find the maximum temperature in the weather dataset

Importorg.apache.hadoop.fs.Path;
Importorg.apache.hadoop.io.IntWritable;
Importorg.apache.hadoop.io.Text;
Importorg.apache.hadoop.mapreduce.Job;
Importorg.apache.hadoop.mapreduce.lib.input.FileInputFormat;
Importorg.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class MaxTemperature


{
public static void main(String[] args) throws Exception
{
if (args.length != 2)
{
System.err.println("Usage: MaxTemperature<input path><output path>");
System.exit(-1);
}
Job job = new Job();
job.setJarByClass(MaxTemperature.class);
job.setJobName("Max temperature");
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.setMapperClass(MaxTemperatureMapper.class);
job.setReducerClass(MaxTemperatureReducer.class);

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);

System.exit(job.waitForCompletion(true) ? 0 : 1);

7
Hadoop and Big Data 4 CSE-A&B .RamaChandra College of Engg, Eluru. Dr David Raju
}
}

Code description:
if (args.length != 2): While running the Hadoop file (jar file) two input arguments should be passed input file and output
file preceding after Driver code class name (MaxTemperature), if not, it should intimate the user the correct path as
Usage: MaxTemperature<input path><output path>

Job object: It gives the specification of the job and tells how the job will be controlled/run. It will drive the mapper and
reducer programs from here. It also specifies what mapper and reducer class it want to run with. It specifies input and
output keys, input and output values that mapper and reducer work with.

setJarByClass(): The class name of the program is passed in this method. Through which Hadoop will locate the
relevant jar file. Hadoop looks for the JAR file containingMaxTemperature.

setJobName: It is the name of the job for this instance. And it is “Max temperature”.

FileInputFormat.setInputPaths: SetInputPaths is the static method of FileInputFormat(FIF), So it can be used directly


without creating object for FIF.This method will take care of setting of input file path in to job object.This method
requires two parameters job and object of input path. The args[0] is string type which is accepted from command
prompt, so this string to be converted to Path object by new Path(args[0]).

FIF not only setting up the input file path but also specifies howmany input splits the input file having for the
JobTracker.

Plural used at setInputPaths, because input may be single file or group of files or directory

FileOutputFormat.setOutputPath: It will work as above but sets output path for directory to store the output files from
reduce function.Singular used at setInputPath as the output path contains only a directory.

job.setMapperClass and job.setRecducerClass: Mapper and Reducer types are passed through this function.

Job.setOutputKeyClass(Text.class): It specifies the output types of the reduce function, this Text type, which is an year
in our example program.

job.setOutputKeyClass(IntWritable.class): It specifies the output value type which is of type IntWritable. In our
example, it is the maximum temperature corresponding to key (year).

WaitForCompletion( ): This is the method on Joband submits the job and waits for it to finish. If it is true the job’s
output is written on console.

Mapper Code for finding Maximum Temperature:

Importjava.io.IOException;
Importorg.apache.hadoop.io.IntWritable;
Importorg.apache.hadoop.io.LongWritable;
Importorg.apache.hadoop.io.Text;
Importorg.apache.hadoop.mapreduce.Mapper;

public class MaxTemperatureMapper extends Mapper<LongWritable, Text, Text, IntWritable> {


private static final int MISSING = 9999;
@Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
String year = line.substring(15, 19);
int airTemperature;
if (line.charAt(87) == '+')

8
Hadoop and Big Data 4 CSE-A&B .RamaChandra College of Engg, Eluru. Dr David Raju
{ // if temp is +ve , then get only temp value, not sign
airTemperature = Integer.parseInt(line.subString(88, 92));
}
else
{
airTemperature = Integer.parseInt(line.subString(87, 92));
}
String quality = line.subString(92, 93);
if (airTemperature != MISSING &&quality.matches("[01459]")) {
context.write(new Text(year), new IntWritable(airTemperature));
}

Code description:

1. The class MaxTemperatureMapper is implementing the Mapper class, then it have the behavior of the
Mapper. If not implemented, it not at all a Mapper code. (Java, the class is serializable iff it implements
serializable interface).
2. Mapper class is of generic type. The parameters are given below in the first column of the table.

Parameter Specifies Hadoop type Java type


Input key Byte offset LongWritable Long
Input value Entire line Text String
Output key Year IntWritable Integer
Output value Temperature Text String

Table 3. Mapper input and output (key,value) parameters (column 1) that to be passed as generic type in mapper code is indicated
as Hadoop type at column 3 and corresponding what they specify is given in column 2. The equivalent java types are shown in last
column.
3. The MaxTemperatureMapper class will take the responsibility of implementing the code for method in Mapper
class. The method is map (LongWritable key, Text value, Context context).
This function receives byte off set and entire line as input. This input is converted in to String type to support
substring( ) and charAt( ) on the entire line.

4. String line = value.toString(): The second parameter of map function-value is the one ofinput to this code which is
entireline. The first parameterKey (byte offset) is ignored. The value is converted to string and assigned to line.
5. String year = line.substring(15, 19): From figure 4, the characters from index 15 to 18thindicates year which is
assigned to year.
6. airTemperature = Integer.parseInt(line.substring(88, 92)): 88 to 91 characters indicates temperature, it assigned to
airTemperature. (if it is +ve value, + sign is ignored, if it is –ve sign,87 to 91(as 87 index indicates sign) also
assigned to airTemperature.
7. String quality = line.substring(92, 93): Get the character at 92 and assign to quality.
8. if (airTemperature != MISSING &&quality.matches("[01459]")): This loop executes whenairTemperature is not
9999 (because 9999 indicates missing value), and quality.matches( ) returns true ifquality matches any one of the
characters from 0,1,4,5 or 9.
9. context.write(new Text(year), new IntWritable(airTemperature)): year and airTemperature are reverted to Box type
objects (Text and intWritable). These are writtento the output.
Note: Context is used to collect and write the output into intermediate as well as final files.

Reducer Code:

import java.io.IOException;
importorg.apache.hadoop.io.IntWritable;

9
Hadoop and Big Data 4 CSE-A&B .RamaChandra College of Engg, Eluru. Dr David Raju
importorg.apache.hadoop.io.Text;
importorg.apache.hadoop.mapreduce.Reducer;

public class MaxTemperatureReducerextends Reducer<Text, IntWritable, Text, IntWritable> {


@Override
public void reduce(Text key, Iterable<IntWritable> values,Context context)throws IOException,
InterruptedException
{
int maxValue = Integer.MIN_VALUE;
for (IntWritable value : values) {
//(1995,[22,-11,100,122])
maxValue = Math.max(maxValue, value.get());
}
context.write(key, new IntWritable(maxValue));
}

Code description:

1. The input to the reducer is of the from,


(1949, [111, 78])
(1950, [0, 22, −11]),

The out from the reducer is


(1949, 111)
(1950, 22),

2. In view of the above input and output, the reducer code MaxTemperatureReducer is of generic type with the
parameters listed below.
Parameter Specifies Hadoop data type Java type
Input key Year Text String
Input value List of temperatures IntWritable Integer
Output key Year Text String
Output value Temperature IntWritable Integer

Table 4. Reducer input and output (key,value) parameters (column 1) that to be passed as generic type in reducer code is indicated
as Hadoop type at column 3 and corresponding specification (k,v) pairs is given in column 2. The equivalent java types are shown
in last column.

3. The reduce method has the year as first parameter and list of temperatures of Iterable type as second parameter
and context as third parameter.
4. intmaxValue = Integer.MIN_VALUE: Initially minimum value of integer range is assigned to maxValue.
5. for (IntWritable value : values): AsIterator<IntWritable> is a data type for list of values for that key.
Thesevalues are assigned to value.
6. maxValue = Math.max(maxValue, value.get()): For each iteration, value is get (starting form index zero) and
compared with maxTemperature to obtain maximum temperature by using max function of Math class. This
compared maximum value assigned to maxTemperature. The loop repeated for all the items in the value list.
The maxTemperture is found for corresponding key.
7. context.write(key, new IntWritable(maxValue)): The output will be written with key (year)and value
(temperature) pair.
8. The reduce function will run as many times as the number of input (key (year), value (list of temperatures))
pairs. Thus that many output lines produced.
Ex:
(1949, 111)
(1950, 22),

10
Hadoop and Big Data 4 CSE-A&B .RamaChandra College of Engg, Eluru. Dr David Raju

Partitioner Code:
Partitioner: A partitioner partitions the key-value pairs of intermediate Map-outputs. It partitions the data using a user-
defined condition, which works like a hash function. The total number of partitions is same as the number of Reducer
tasks for the job.
Let us assume 4 input splits, these input splits implies running of 4 mappers. The mapper output data is called
intermediate code. Two phases will be done on this data, shuffle and sorting. The shuffled and sortedk,v pairs will go
to the respective reducers by the partitioner. If we give n reducers all these shuffled and sorted k,v pairs will portioned
among these reducers. So the partitioner (partitioner code) duty is to share the number of k,v pairs among
reducers.Hash partitioner is the default partitioner (just like default mapper(identity mapper), reducer (identity
reducer)). It will send the different k,v pairs to different reducers(the number of reducers is given in driver code)
depending on the hash code of the object., it doesn’t know which value pair got to which reducer. But to send the
specific length of k,v pairs to specific code we have to write out custom paritioner code. So by writing our own
program we can improve the performance and readability of out application.
Code:
public class MyPartitioner implements Partitioner<Text, IntWritable><Text, IntWritable>

The partitioner code will run between mapper and reducer. So the input is from mapper, which is k, v pair of type Text
and IntWritable(string and int, ex. [Hi,1]), the out put also same parameters.
The class MyPartitioner implements Partitioner, then the MyPartitioner class will get the properties of Partitioner
class.
The Partitioner interface has two methods, so it’s the responsibility of MyPartitioner to override them.
These methods are configure and getPartitioner as shown
public void configure(JobConfcof)
{
}
publicint getPartitioner(Text key, IntWritable value, int setNumRedTasks)
{
}
To send the k,v pairs whose value(v) length is n to reducer n, the following is the code of the Partitioner.
Note that the key is Text type, as Text has no method to find length of it, convert it to String and find the length
String s=key.toString();
if(s.length()==1)
{ return 0; //this invokes reducer0
}
if(s.length()==2)
{ return 1;
}
if(s.length()==3)
{ return 2;
}

11
Hadoop and Big Data 4 CSE-A&B .RamaChandra College of Engg, Eluru. Dr David Raju

Configure the number of reducers in Driver code, the concerned code in Driver code is
{
conf.setPartitionerClass(MyPartitioner.class);
conf.setNumReduceTasks(n); // 4 reducers will be ready to take data from partitioner
}

Combiner:
For big data, the number of k,v pairs at each node approximately crores in number.Let us assume a small file of
size 0.15KB,.the number of k,v pairs are 30. For a 64MB data there are 1 crore k,v pairs. If we assume our default
input split size is 64MB. For 500GB hard disk, there are 7500 blocks, so the number of k,v pairs that to be processed
by 7500 mappers are7500x1 crore. If the file size is 1MB or more then the number of mappers will be increased
enormously. Accordingly the traffic or network burden increased on Reducer. So the overall performance go down. To
increase the performance combiner is introduced. Combiner act as mini reducer. Each mapper has a combiner.
Combiner: A Combiner, also known as a semi-reducer, is an optional class that operates by accepting the inputs from
the Map class and thereafter passing the output key-value pairs to the Reducer class. The main function of a Combiner
is to summarize the map output records with the same key ie after all combiners work out on their individual mappers
k,v pairs, the outputs from combiners will be given to Reducer.

 Combiner will give more performance on application by decreasing the network traffic to to transmit your k,v
pairs to reducer (one or more)
 Further Combiner is a mini reducer, whatever the logic we write in Reducer that is there in Combiner.
 The number of Combiners is equal to the number of mappers and each Combiner will work on output of its
specific mapper and shuffle& sort and gives output.
 All these combiners’ outputs will be given to Reducer and finally reducer will perform again shuffle and sort
and gives final output.

12

Anda mungkin juga menyukai