Anda di halaman 1dari 6

2014 Fourth International Conference on Communication Systems and Network Technologies

Prominence of MapReduce in BIG DATA Processing


Dr.Vrinda Tokekar
Institute of Engineering & Technology
Indore, India
vrindatokekar@yahoo.com

Shweta Pandey
Shri Vaishnav Institute of Tech. & Science
Indore,India
shweta.pandey2781@gmail.com

unstructured sources of information. So in this manner


Industrial organizations can take the advantage of Big Data
processing for better decision making process. Success of
Big Data depends on its analysis. Big Data analysis is an ongoing process for this required a set of activities instead of
an isolated activity. Therefore for finding the Data and
determining the new insights then integrating and presenting
those new insights properly to conferring unique goals
required a unified set of solutions [23].

Abstract Big Data has come up with aureate haste and a clef
enabler for the social business; Big Data gifts an opportunity to
create extraordinary business advantage and better service
delivery. Big Data is bringing a positive change in the decision
making process of various business organizations. With the several
offerings Big Data has come up with several issues and challenges
which are related to the Big Data Management, Big Data
processing and Big Data analysis. Big Data is having challenges
related to volume, velocity and variety. Big Data has 3Vs Volume
means large amount of data, Velocity means data arrives at high
speed, Variety means data comes from heterogeneous resources.
In Big Data definition, Big means a dataset which makes data
concept to grow so much that it becomes difficult to manage it by
using existing data management concepts and tools. Map Reduce is
playing a very significant role in processing of Big Data. This
paper includes a brief about Big Data and its related issues,
emphasizes on role of MapReduce in Big Data processing.
MapReduce is elastic scalable, efficient and fault tolerant for
analysing a large set of data, highlights the features of MapReduce
in comparison of other design model which makes it popular tool
for processing large scale data. Analysis of performance factors of
MapReduce shows that elimination of their inverse effect by
optimization improves the performance of Map Reduce.

MapReduce has come up as a highly effective and


efficient tool for Big Data analysis; Reasons behind the
popularity of MapReduce is its unique features which
includes simplicity and communicative manners of its
programming model as MapReduce has mainly two
functions map( ) and reduce( ) even though a large number
of data analytical tasks can be expressed as a set of
MapReduce jobs, high degree of elastic scalability and fault
tolerance[4] but the performance of MapReduce is still faroff from ideal in the database context. A recent search
shows that Hadoop[1], the open source implementation of
MapReduce, is slower than two parallel database systems by
a factor of 3.1 to 6.5 [5].Organization required not only
elastically scalable but also efficient data processing system.
MapReduce can also become more efficient by making
proper tuning of various performance affective parameters.
This paper has focused on those parameters and their proper
tuning.

Keywords- Big Data, MapReduce, Hadoop Distributed File


System, Google file System, Hadoop.

I.

INTRODUCTION

Large Volume of Data is growing because the


organizations are continuously capturing the collective
amount of data for better decision making process. Volume
of data increases by online contents like blogs, posts, social
networking site interactions, photos created by the users and
servers continuously record the messages about what the
online users are doing. Todays Business is unexpectedly
affected by this growth of Data. Every day 2.5 quintillion
bytes of data are created according to the estimation done by
IBM and it is very large amount so the 90% of data in the
world has been created in last 2 years [3]. It is a mindboggling figure and but the sad part is instead of having
more information, people feels less conversant.

The rest of this paper is organized as follows:


section II represent Importance and challenges related to
Big Data, in section III and its subsections, represents the
contribution of MapReduce in Big Data Processing,
Comparison study between Big Data and MapReduce, A
glance on performance affective parameters of MapReduce
and optimization of those parameters for improving the
performance of MapReduce. Section IV highlights on
Literature Review of the paper. Finally section V concludes
the paper.
II.

Industrial organizations are essentially intent to


make the sense from the massive arrival of Big Data to
develop the analytic platforms for producing the traditional
structure data which includes semi-structured and

978-1-4799-3070-8/14 $31.00 2014 IEEE


DOI 10.1109/CSNT.2014.117

BIG DATA CHALLENEGS

This is the world of Big Data, Web logs, sensor network,


Internet texts and Documents, Internet search indexing,
CRD (call records details) etc. these all includes Big Data..
555

For making more effective decisions and taking the full


advantage of available information, it is required to process
Big Data efficiently but Big Data includes problems in
capturing, searching, storing, sharing, analyzing and
visualizing of big amount of data.

(C) BIG DATA PRIVACY PROTECTION


IT companies are using online Big Data applications for
sharing information and reducing the cost, In this way
security and Privacy is badly affect the Big Data storage and
processing as third party services are involved to host
private information and to perform computation tasks on
this available data. Big Data is growing in continuation and
brings lots of challenges of dynamic data monitoring and
security protection. Security in Big Data means to process
the data mining without exposing the sensitive information
of individuals. Data is always dynamically changed as an
eg. Variations of attributes, addition of new data, thus it is
challenging to implement effective privacy protection for
this complicated situation while current technologies of
Privacy protection is mainly based on static data.

Here is presenting a glance on Big Data challenges:


(A) BIG DATA MANAGEMENT AND STORAGE
In Big Data Big means the size of data is growing
continuously, on the other hand increasing speed of storage
capacity is much less than the rising amount of Data. The
reconstruction of available information framework is needed
to form a hierarchical framework because Researchers has
come up with the conclusion that available DBMSs are not
adequate to process the large amount of data. Architecture
commonly used for processing of data uses the database
server, Database server has constraint of scalability and cost
which are prime goals of Big Data.

III. Contributions of MapReduce in Big Data Processing


The business and scientific applications need to process and
analyse a large volume of data for making better decision
.There is required to process terabytes of data in efficient
manner on daily bases as explained before in the
introductory part of paper, Industries faced the Big Data
problem due to the inability of conventional database
systems and software tools to manage and process this large
amount of data in a tolerable time limits. MapReduce is a
playing a promising role for processing and managing the
Big Data. MapReduce system is well known because of its
elastic scalability and fault tolerance behaviour.

Different business models has been suggested by the


providers of database but basically those are application
specific for eg.Google seems to be more interested in small
applications. Big Data Storage is another big issue in Big
Data management as available computer algorithms are
sufficient to store homogeneous data but not able to smartly
store data comes in real time because of its heterogeneous
behaviour. So how to rearrange Data is another big problem
in context of Big Data Management. Virtual server
technology can sharpen the problem reason is it raises the
issue of overcommitted resources specially when there is
lack of communication between the application server and
storage administrator. Also need to solve the problem of
concurrent I/O and a single node master /slave architecture.

For Big Data processing Parallel DBMSs and MR, both are
available solution, but both are having their
own
importance .In section 1 have prepared a table on basis of
comparison study between MapReduce(MR) and Parallel
DBMSs in reference of work done by [7].

B) BIG DATA CALCULATION AND EXAMINATION


Speed is a prime requirement when process query in
Database but the process may take time because in short
time it can not traverse all the relevant data in database, Use
of indices is most favourable solution but Index is only
intended simple type of data in Big Data but Big Data is not
only having simple type of data as it is facing
complications. So to solve the same can take the
combination of Index for Big Data and advanced preprocessing techniques. For the same to solve Big Data
problems can use application parallelization and divide-andconquer both are natural computational process but require
to add some extra computational resources which is not that
easy. Traditional series algorithms are not sufficient for Big
Data. Cloud's reduced cost model can be used to process an
application having sufficient data parallelism because it
allows using hundreds of computers for a short time cost.

1 Comparison of MapReduce with other analytical


models:
The given Table 1 has tried to summarize the comparison
between both MR and Parallel DBMSs approaches on basis
of various parameters:
Parameters
Complicated
Transformations

556

Table II-1
Map Reduce
Complicated
transformations
are easier to
express in MR.

Parallel DBMSs
Complicated
transformations
are difficult to
express
in
SQL.User
defined
functions can be
used with SQL
for
such

Compression
optimization

Data Format

Compression is
less valuable in
MR
because
parsing
of
records
is
mandatory at run
time. Its one of
the reasons of
performance
difference
between MR and
parallel DBMS.
MR allows Data
in any arbitrary
format.

Indexing

MR doesnt use
pre-generated
indices [7].

Languages

MR tasks use
procedural,
object oriented
language for eg.
Java.
MR model is
simple which is
one
of
its
benefit,
MR
having only two
function Map()
and Reduce().so
it
performs
complex
computation by
simple chaining
of Map() and
Reduce().MR
follows a modest
and
more
comprehensible
step-by-step

Programming
Model

manner.

complications
but UDF support
is either buggy
or
(DBMS-X)
missing
(vertica).
Compression is
more valuable in
Parallel DBMS
as parsing must
be done at load
time.

While
all
DBMSs require
that
data
conform to a
well-defined
Schema.
Parallel DBMS
uses
the
predefined index
before
processing the
data, It is one of
the
reason
because of that
Parallel DBMS
perform better.
Parallel DBMSs
use declarative
language SQL.

Its
not
as
simple as MR
Model,
respectively has
to write SQL
queries, which
can be hard if
the query is
complex.

Query execution
strategies

MR divides the
dataset in to
smaller parts of
data
and
distributes it ,In
distinction
of
loading
and
indexing phase
of
Parallel
DBMS , MR
only loads data
in to distributed
file system and
process it ,onAs
the-fly.
loading is not
required so MR
takes less time in
processing
of
data as compare
to
parallel
DBMS.

For processing,
A query system
automatically
computes
a
query
plan
distributes it to
the entire cluster
and
partially
executes query
in parallel over
the cluster. It
uses loading and
indexing during
the processing of
data
which
improves
its
performance but
loading
phase
takes long time.

Storage
Independence

MR is storage
independent and
simple,
before
processing
of
data data loading
in data base is
not required.

Strengths

Strengths
of
MR:
Fault
tolerance,
Storage system
independence,
flexibility
and
simplicity.
If
data
is
dynamic
and
user is concerned
for few queries
then, MR is
more useful for
such scenario.
Widely use the
fine
grained
failure
Model
but performance

Parallel DBMS
system is not
storage
independent and
not as simple as
MR.
Programmer
needs to specify
their objective in
high
level
language
then
SQL translator is
used.
In contrast of
MR, strengths of
Parallel DBMS
are: Robust, it
provides
high
performance.

Type of Dataset

Use of
Model

557

Failure

Parallel DBMS
is preferred for
Static
dataset
and the data has
to be queried
frequently.
For any failure
model has to reexecute
the
whole
task,

is low.

Implementation&
Code

Length of code is
bigger for MR
implementation
as compare to
Parallel DBMS
because
MR
implementation
handles
parallelism,
failures of nodes,
data distribution
and
load
balancing.

Time

For processing
of data, MR
needs less time
as compare to
parallel DBMS
because loading
of data is not
required
for
processing.

performance overhead. Mutable decoding scheme is faster


than the immutable scheme by a factor of 10 shown in [9] so
its suggested that MR should strictly use mutable decoding
scheme for database like workload, which improves the
performance of MR in selection task by a factor of 2.(d)
Grouping algorithms: Programming model of MR focuses
on data transformation, which is specified in map() and
reduce() function but programming model itself does not
specify that how the intermediate result generated by map()
grouped for reduce() functions to process. Merge sort
algorithm used by MR is not efficient for certain analytical
tasks like aggregation and Join. It has been observed
performance can be improved by using different grouping
algorithms. For the same purpose Hadoop core need to be
modified for an efficient grouping algorithm.(e) Block
Level Scheduling: Scheduling at block level incurs
performance overhead .Generally more data blocks to
schedule, higher the cost of scheduler incurs. The
scheduling algorithm, which determines the assignment of
map tasks to the nodes, also affects performance.
Scheduling strategy currently using in Hadoop is sensitive
to the processing speed of slave nodes, and may slow down
the execution time of the entire job by 20% to 30%
[9].Concluded part is, By proper tuning of above five
parameters the performance of MR can improved by a factor
of 2.3 to 3.5 [9].it is contrary to the studies which states that
MR is inferior to the parallel DBMS in terms of
performance, instead it can offer a competitive edge as they
are elastically scalable and efficient.

which increases
the cost but
provides better
performance in
comparison of
MR.
It requires less
coding.

Due to the load


phase it requires
more time to
tune data.

one can improve the performance of MR by proper tuning


of various performance affective parameters.

3. Optimization of Various Parameters:


The Parameters mentioned above have been identified on
the basis of three design components programming model,
storage independent design and run time scheduling.

2 MR's Performance Parameters:


On the basis of architectural design of MR five parameters
(I) I/O (II) Index Structure (III) Record Parsing (IV)
Grouping Algorithms (V) Block level scheduling has been
identified, which affects the performance of MR. Here we
enlighten a brief on these parameters:

a.1) Programming Model: In MR data transformation


logic is specified by using Map() and reduce() functions.
Programming model does not specify how to group the
generated intermediate key-value pairs for Reduce().
According to designers shown in [4], specifying a data
grouping algorithms is difficult task and this complexity
should be hidden from framework. MRs default sort merge
algorithms is not sufficient for all analytical jobs especially
for those which do not bother about order of intermediate
keys like aggregation and joins tasks.So need to switch on
other grouping algorithms, for aggregation task a fingerprint
based algorithms has been suggested in [9], in which for
each key value pair a 32 bit integer as a fingerprint is
generated, which compares fingerprint of keys when a map
task sorts the intermediate pairs, if it matches further
compare the original key, in the same way reduce task
merges the intermediate pairs, first it merges by the
fingerprints and then by the original key value pair of each
group. For join task in MR Partition map-side join has been

(a) I/O Interface: MR's storage system is independent but


still requires I/O interface for scanning the data. Two I/O
modes are available direct I/O and streaming I/O mode.
Direct I/O outperforms streaming by 10% shown by the
benchmarking of HDFS (Hadoop Distributed File System)
[19].(b) Index Structure:MR Performance can improve
more by the use of index structure.MR can support three
index structure sorted file with range index, B + tree files and
database indexed tables. Experimental work shows that
performance can be improve by a factor of 2.5 in selection
task and by a factor of 10 in the Join task depends on the
selectivity of filtering conditions.(c)Record Parsing:
Earlier in[5] [8]as reported that record parsing is one of the
reason of poor performance in MR ,which is not true, Its
found that only immutable decoding introduces high

558

evaluated in [9] which uses the partition information,


implementation details of these algorithms using Hadoop
platform has been given [9].MR requires to extend its
programming model so user can specify their customized
grouping algorithms.

and fills the mutable object with new values. Here is only
one data object is created for any no of decoded records so
mutable scheme is faster than immutable scheme as CPU
overhead is present in immutable scheme because of huge
no of generated immutable objects for each decoded record.

a.2) Storage Independence: MR programming model is


independent from storage system which is beneficial in
heterogeneous system because it allows to analyze data
from different storage system. In MR reader reads key-value
pair, retrieves each record from storage system and wraps
the record in key value pairs.MR is data processing system
without any built in processing engine while parallel
database systems are using query engine and storage engine,
query engine directly reads records from storage engine no
cross engine calls are required while in MR processing
engine needs to call the readers to load data which results
lower performance. Three factors mentioned in section 3.2,
has been identified on the basis of storage independence
system of MR which are I) I/O mode, the way reader
retrieve data from storage system II) data parsing, the
scheme taken by reader to parse the format of records III)
indexing. Next section represents a brief about on
optimization of these 3 parameters to improve the
performance of MR:

a.2.3) Index: At first glance MR may not be utilize index


[5] but it is not true MR can utilize index to make
processing of data efficient and faster. Index can be used in
3 ways:(A) By implementing a customized grouping
algorithm: Grouping algorithm can be used in 2 ways
depends on the input:(A1) Range Index: If input is sorted
files like (HDFS, GFS) can use range Index.(A2) Naming
Information: If each input file follows certain naming rules,
this naming information can be used for grouping
algorithms. Another two methods for indexing in MR is
based on type of input files:(B) If input itself is set of sorted
indexed files (B+ tree, Hash),can process these kind of files
in MR by implementing a new reader which takes some
specific condition as an input and applies it to the index to
retrieves the needful records from each input file.(C) If
input is a set of index tables which stores in M relational
database servers, m map () task needs to process those
tables, In which each map() task submits a SQL query to
one database server and this transparently utilizes database
indexed to retrieve data .This scheme is first presented in
[4] and demonstrated in [10].

a.2.1) I/O Mode: Reader in MR can choose either direct I/O


mode or streaming I/O mode. In direct I/O mode; reader
directly reads data from local disk through DMA controller.
Its more efficient way no IPC (Inter process communication
calls) is required while in streaming I/O scheme reader reads
data from any another running system process like Data
Node in Hadoop through certain IPC schemes such as
TCP/IP or JDBC. Streaming I/O is only the choice when
reader requires to read data from remote node. As per
performance concern direct I/O is more efficient than
steaming I/O. That is one of the reason for choosing direct
I/O mode in parallel DBMS,In which query engine and
processing engine run inside the same database instance and
share the memory, so when query engine fills the memory
buffer it transfer the memory buffer directly to processing
engine for further processing. In MR a hybrid I/O mode is
implemented in HDFS as Hadoop is the implementation
platform of MR [9].

a.2.4) Scheduling: MR uses run time scheduling in which


scheduler assigns data blocks to the available nodes for
further processing ,while in parallel DBMS uses compile
time scheduling strategy in which, when a query is
submitted query optimizer generates a distributed query
plan, processing of data blocks at each node done according
to the generated query plan. In this scheme scheduling cost
is not required after the generation of query plan. MR run
time scheduling strategy is more expensive and it slow
down the performance, but it provides MR elastic scalable
behaviour means ability to dynamically adjust the resources
during job execution. The run time scheduling depends on
the 2 parameters 1) the size of data blocks 2) the scheduling
algorithm. So to improve the performance it is
recommended to take large size of data blocks because
larger the size requires less time for processing. For second
parameter should use an appropriate scheduling algorithm
which helps to improve the performance which gets slow
down due to the run time scheduling, Researchers are
working on the same.So by proper tuning of above
parameters can improve the performance of MR.

a.2.2) Record Parsing: In MR, Data parsing is procedure in


which reader transform the retrieved data in to key/value
pair. The need of data parsing is to decode the raw data and
transform it in to data objects for further processing through
a programming language like Java. Two kind of decoding
schemes is used 1) Immutable decoding scheme 2) mutable
decoding scheme. Immutable scheme causes the CPU
overhead because for each record it generates a unique
immutable object which is read only objects can not be
modified while in mutable decoding scheme record is
decodes in its native storage format according to schema

IV.

LITERATURE SURVEY

McKinseys Business Technology Office and McKinsey


Global Institute(MGI) has conducted a study on BIG DATA
in 5 Domains healthcare in US , the public sector in
Europe , retail in the United States and manufacturing

559

,personal location data globally. And their study proves


that Big Data can generate value in each listed domain. For
example by use of BIG DATA a retailer can increase its
operational margin by 60% [11].In a study by McKinseys
Business Technology Office and McKinsey Global Institute
(MGI) firm that the U.S. faces a shortage of 140,000 to
190,000 people with analytical expertise and 1.5 million
managers and analysts with the skills to understand and
make decisions based on the analysis of Big Data
[12].MapReduce plays a significant role for solving the
issues related to Big Data. MapReduce was proposed by
Dean and Ghenmawat in [4] as a programming model for
processing large amount of Data. MapReduce can also use
as analytic tool for processing relational queries, It has been
confirmed [13] [14] [15] [16] [17].

[12]
[13]

[14]

[15]

[16]

[17]

V.CONCLUSION AND FUTURE SCOPE

Here concluded that Big Data is playing a very significant


role in industries for making better business decisions but
the main issue with Big Data is, its analysis, MR is playing
a significant role in Big Data analysis and giving an equal
competitive edge to the parallel DBMS. The performance of
MR can improve by the proper tuning of its performance
affecting parameters but still the research work is required
in the same direction .Researchers need to provide the
proper grouping algorithms, scheduling algorithms to make
MR more effective as performance wise.

[18]
[19]
[20]

[21]

REFERENCES
[1]
[2]
[3]

[4]
[5]

[6]
[7]

[8]

[9]

[10]

[11]

[22]

http://hadoop.apache.org.
http://developer.yahoo.net/blogs/hadoop/2008/09/.
Improving Decision Making in the World of Big Data
http://www.forbes.com/sites/christopherfrank/2012/03/25/imp
rovingdecision-making-in-the-world-of-big-data/.
J. Dean and S. Ghemawat. Mapreduce: simplified data
processing on large clusters. In OSDI, pages 137-150{ 2004}.
A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J.DeWitt, S.
Madden, and M. Stonebraker. A comparison of approaches to
large-scale data analysis. In SIGMOD,pages 165178{.ACM,2009}.
Dean and S. Ghemawat. Mapreduce: a flexible data
processing tool. CommunIn. ACM, pages53(1):72-77{ 2010}.
Jeffery Dean ,Sanjay Ghemawat .An Article on
MapReduce:A Flexible Data Processing Tool.In SigMOD
pages3(1):72-75{ACM 2010 }.
M. Stonebraker, D. Abadi, D. J. DeWitt, S. Madden,E.
Paulson, A. Pavlo, and A. Rasin. Mapreduce and parallel
dbmss: friends or foes. In ACM, 53(1):pages64-71{ 2010}.
Dawei Jiang Beng, Chin Ooi ,Lei Shi, Sai Wu . The
performance of MapReduce : An in- depth study ,Proceeding
of the VLDB Endowment,,3(1):pages110-120{VLDB2010}.
A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi,A.
Silberschatz, and A. Rasin.Hadoop db: An architectural
hybrid of mapreduce and dbms technologies for analytical
workloads. Proc. VLDBEndow., 2(1):922-933{ 2009}.
Big Data: The next frontier for innovation, competition and
productivity

560

http://www.mckinsey.com/Insights/MGI/Research/Technolog
y_and_Innovation/Big_data_The_next_frontier_for_innovatio
n.
Big Data: The next frontier for competition
http://www.mckinsey.com/features/big_data.
]H.-C. Yang, A. Dasdan, R.-L.Hsiao, and D. S. Parker. MapReduce-Merge: simplified relational data processing on large
clusters. In SIGMOD,pages121-134{ 2007}.
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A.
Tomkins. Pig latin: A note on foreign language for data
processing. In SIGMOD, pages 1099-1110{ ACM,2008}.
R. Chaiken, B. Jenkins, P.-A. Larson, B. Ramsey,D. Shakib,
S. Weaver, and J. Zhou. Scope: Easy and efficient parallel
processing of massive data sets. In.PVLDB, 1(2):pages12651276{ 2008}.
D. J. DeWitt, E. Paulson, E. Robinson, J. Naughton, J.
Royalty, S. Shankar, and A. Krioukov. Clustera: An
integrated computation and data management system...In
Proc. VLDB Endow.,1(1):pages28-41{ 2008}.
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S.
Anthony, H. Liu, P. Wycho, and R. Murthy. Hive - A
warehousing solution over a map-reduce framework.In
PVLDB, 2(2):1626-1629{ 2009}.
S. Babu. Towards automatic optimization of mapreduce
programs. In SoCC, pages 137-142.{ ACM,2010}.
Amazon elastic compute cloud (Amazon EC2)
http://aws.amazon.com/ec2/.
Changqing Ji, Yu Li, Wenming Qi and Uchechukwu Awada,
Keqiu L. Big
Data Processing in Cloud Computing
Enviornments ,In International Symposium on Pervasive
Systems, Algorithms and Networks,pages 1087-4089{IEEE
2012}.
M. Zaharia, A. Konwinski, A. Joseph, R. Katz, and I. Stoica.
Improving mapreduce performance in heterogeneous
environments. In OSDI, pages 29-42{2008}
. F. Chang, J. Dean, S. Ghemawat, W. Hsieh, D. Wallach,M.
Burrows, T. Chandra, A. Fikes, and R. Gruber, Bigtable:A
distributed structured data storage system.In OSDI,7(1): pages
305-314{2006}

Anda mungkin juga menyukai