Anda di halaman 1dari 2

BIG DATA LARGE SCALE DATA PROCESSING USING CLUSTERS

Cloud computing has emerged as the default paradigm for a variety of fields especially
considering the resources and infrastructure consumption in case of distributed access.
The process has a lot of emphasis on the server with a heavy demands and QOS Quality
of Service suffers as a result. Doing Data analysis on clouds is an important functionality
in cloud computing which allows a huge amount of data to be processed over very large
clusters which is called as Big Data. MapReduce is a popular solution to handle big data
in the cloud environment due to its excellent scalability and handle fault tolerance in a
phased manner. But when this compared to its competitor commonly called as grid
computing or parallel databases, the performance of MapReduce is slower. Performing
complex data analysis tasks require the joining of multiple data sets in order to do
computation on certain aggregates. A common problem whether MapReduce can be
customized to produce a scalable system with good efficiency is still a mirage. The
proposed Map-Join-Reduce solution is a system that extends and improves MapReduce
runtime framework to efficiently process complex data analysis tasks on large cloud
clusters, which is a filtering-join-aggregation programming model, an extension of
Apaches MapReduce filtering model. A new data processing strategy is deployed which
performs filtering-join-aggregation tasks in two successive MapReduce jobs. The first job
applies filtering logic to all the data sets in parallel clusters and then joins the qualified
tuples which pushes the join results to the reducers for aggregation. The final job
combines all such partial aggregation from the previous results and produces the final
output. The advantage of such an approach is that multiple data sets can be joined in one
go and avoid frequent check pointing and shuffling of intermediate results. This reduces
performance bottlenecks in most of the current MapReduce-based systems.

The technology press has been focusing on the revolution of "cloud computing," a paradigm that
entails the harnessing of large numbers of processors working in parallel to solve computing
problems. In effect, this suggests constructing a data center by lining up a large number of lowend servers, rather than deploying a smaller set of high-end servers. Along with this interest in
clusters has come a proliferation of tools for programming them. MR is one such tool, an
attractive option to many because it provides a simple model through which users are able to
express relatively sophisticated distributed programs.

Anda mungkin juga menyukai