Shweta Pandey
Shri Vaishnav Institute of Tech. & Science
Indore,India
shweta.pandey2781@gmail.com
Abstract Big Data has come up with aureate haste and a clef
enabler for the social business; Big Data gifts an opportunity to
create extraordinary business advantage and better service
delivery. Big Data is bringing a positive change in the decision
making process of various business organizations. With the several
offerings Big Data has come up with several issues and challenges
which are related to the Big Data Management, Big Data
processing and Big Data analysis. Big Data is having challenges
related to volume, velocity and variety. Big Data has 3Vs Volume
means large amount of data, Velocity means data arrives at high
speed, Variety means data comes from heterogeneous resources.
In Big Data definition, Big means a dataset which makes data
concept to grow so much that it becomes difficult to manage it by
using existing data management concepts and tools. Map Reduce is
playing a very significant role in processing of Big Data. This
paper includes a brief about Big Data and its related issues,
emphasizes on role of MapReduce in Big Data processing.
MapReduce is elastic scalable, efficient and fault tolerant for
analysing a large set of data, highlights the features of MapReduce
in comparison of other design model which makes it popular tool
for processing large scale data. Analysis of performance factors of
MapReduce shows that elimination of their inverse effect by
optimization improves the performance of Map Reduce.
I.
INTRODUCTION
For Big Data processing Parallel DBMSs and MR, both are
available solution, but both are having their
own
importance .In section 1 have prepared a table on basis of
comparison study between MapReduce(MR) and Parallel
DBMSs in reference of work done by [7].
556
Table II-1
Map Reduce
Complicated
transformations
are easier to
express in MR.
Parallel DBMSs
Complicated
transformations
are difficult to
express
in
SQL.User
defined
functions can be
used with SQL
for
such
Compression
optimization
Data Format
Compression is
less valuable in
MR
because
parsing
of
records
is
mandatory at run
time. Its one of
the reasons of
performance
difference
between MR and
parallel DBMS.
MR allows Data
in any arbitrary
format.
Indexing
MR doesnt use
pre-generated
indices [7].
Languages
MR tasks use
procedural,
object oriented
language for eg.
Java.
MR model is
simple which is
one
of
its
benefit,
MR
having only two
function Map()
and Reduce().so
it
performs
complex
computation by
simple chaining
of Map() and
Reduce().MR
follows a modest
and
more
comprehensible
step-by-step
Programming
Model
manner.
complications
but UDF support
is either buggy
or
(DBMS-X)
missing
(vertica).
Compression is
more valuable in
Parallel DBMS
as parsing must
be done at load
time.
While
all
DBMSs require
that
data
conform to a
well-defined
Schema.
Parallel DBMS
uses
the
predefined index
before
processing the
data, It is one of
the
reason
because of that
Parallel DBMS
perform better.
Parallel DBMSs
use declarative
language SQL.
Its
not
as
simple as MR
Model,
respectively has
to write SQL
queries, which
can be hard if
the query is
complex.
Query execution
strategies
MR divides the
dataset in to
smaller parts of
data
and
distributes it ,In
distinction
of
loading
and
indexing phase
of
Parallel
DBMS , MR
only loads data
in to distributed
file system and
process it ,onAs
the-fly.
loading is not
required so MR
takes less time in
processing
of
data as compare
to
parallel
DBMS.
For processing,
A query system
automatically
computes
a
query
plan
distributes it to
the entire cluster
and
partially
executes query
in parallel over
the cluster. It
uses loading and
indexing during
the processing of
data
which
improves
its
performance but
loading
phase
takes long time.
Storage
Independence
MR is storage
independent and
simple,
before
processing
of
data data loading
in data base is
not required.
Strengths
Strengths
of
MR:
Fault
tolerance,
Storage system
independence,
flexibility
and
simplicity.
If
data
is
dynamic
and
user is concerned
for few queries
then, MR is
more useful for
such scenario.
Widely use the
fine
grained
failure
Model
but performance
Parallel DBMS
system is not
storage
independent and
not as simple as
MR.
Programmer
needs to specify
their objective in
high
level
language
then
SQL translator is
used.
In contrast of
MR, strengths of
Parallel DBMS
are: Robust, it
provides
high
performance.
Type of Dataset
Use of
Model
557
Failure
Parallel DBMS
is preferred for
Static
dataset
and the data has
to be queried
frequently.
For any failure
model has to reexecute
the
whole
task,
is low.
Implementation&
Code
Length of code is
bigger for MR
implementation
as compare to
Parallel DBMS
because
MR
implementation
handles
parallelism,
failures of nodes,
data distribution
and
load
balancing.
Time
For processing
of data, MR
needs less time
as compare to
parallel DBMS
because loading
of data is not
required
for
processing.
which increases
the cost but
provides better
performance in
comparison of
MR.
It requires less
coding.
558
and fills the mutable object with new values. Here is only
one data object is created for any no of decoded records so
mutable scheme is faster than immutable scheme as CPU
overhead is present in immutable scheme because of huge
no of generated immutable objects for each decoded record.
IV.
LITERATURE SURVEY
559
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[22]
http://hadoop.apache.org.
http://developer.yahoo.net/blogs/hadoop/2008/09/.
Improving Decision Making in the World of Big Data
http://www.forbes.com/sites/christopherfrank/2012/03/25/imp
rovingdecision-making-in-the-world-of-big-data/.
J. Dean and S. Ghemawat. Mapreduce: simplified data
processing on large clusters. In OSDI, pages 137-150{ 2004}.
A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J.DeWitt, S.
Madden, and M. Stonebraker. A comparison of approaches to
large-scale data analysis. In SIGMOD,pages 165178{.ACM,2009}.
Dean and S. Ghemawat. Mapreduce: a flexible data
processing tool. CommunIn. ACM, pages53(1):72-77{ 2010}.
Jeffery Dean ,Sanjay Ghemawat .An Article on
MapReduce:A Flexible Data Processing Tool.In SigMOD
pages3(1):72-75{ACM 2010 }.
M. Stonebraker, D. Abadi, D. J. DeWitt, S. Madden,E.
Paulson, A. Pavlo, and A. Rasin. Mapreduce and parallel
dbmss: friends or foes. In ACM, 53(1):pages64-71{ 2010}.
Dawei Jiang Beng, Chin Ooi ,Lei Shi, Sai Wu . The
performance of MapReduce : An in- depth study ,Proceeding
of the VLDB Endowment,,3(1):pages110-120{VLDB2010}.
A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi,A.
Silberschatz, and A. Rasin.Hadoop db: An architectural
hybrid of mapreduce and dbms technologies for analytical
workloads. Proc. VLDBEndow., 2(1):922-933{ 2009}.
Big Data: The next frontier for innovation, competition and
productivity
560
http://www.mckinsey.com/Insights/MGI/Research/Technolog
y_and_Innovation/Big_data_The_next_frontier_for_innovatio
n.
Big Data: The next frontier for competition
http://www.mckinsey.com/features/big_data.
]H.-C. Yang, A. Dasdan, R.-L.Hsiao, and D. S. Parker. MapReduce-Merge: simplified relational data processing on large
clusters. In SIGMOD,pages121-134{ 2007}.
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A.
Tomkins. Pig latin: A note on foreign language for data
processing. In SIGMOD, pages 1099-1110{ ACM,2008}.
R. Chaiken, B. Jenkins, P.-A. Larson, B. Ramsey,D. Shakib,
S. Weaver, and J. Zhou. Scope: Easy and efficient parallel
processing of massive data sets. In.PVLDB, 1(2):pages12651276{ 2008}.
D. J. DeWitt, E. Paulson, E. Robinson, J. Naughton, J.
Royalty, S. Shankar, and A. Krioukov. Clustera: An
integrated computation and data management system...In
Proc. VLDB Endow.,1(1):pages28-41{ 2008}.
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S.
Anthony, H. Liu, P. Wycho, and R. Murthy. Hive - A
warehousing solution over a map-reduce framework.In
PVLDB, 2(2):1626-1629{ 2009}.
S. Babu. Towards automatic optimization of mapreduce
programs. In SoCC, pages 137-142.{ ACM,2010}.
Amazon elastic compute cloud (Amazon EC2)
http://aws.amazon.com/ec2/.
Changqing Ji, Yu Li, Wenming Qi and Uchechukwu Awada,
Keqiu L. Big
Data Processing in Cloud Computing
Enviornments ,In International Symposium on Pervasive
Systems, Algorithms and Networks,pages 1087-4089{IEEE
2012}.
M. Zaharia, A. Konwinski, A. Joseph, R. Katz, and I. Stoica.
Improving mapreduce performance in heterogeneous
environments. In OSDI, pages 29-42{2008}
. F. Chang, J. Dean, S. Ghemawat, W. Hsieh, D. Wallach,M.
Burrows, T. Chandra, A. Fikes, and R. Gruber, Bigtable:A
distributed structured data storage system.In OSDI,7(1): pages
305-314{2006}