Anda di halaman 1dari 4

Big Data Analysis Using Apache Hadoop

Jyoti Nandimath

Ankur Patil

Dept. of Computer Engineering


SKNCOE, Pune, India
jyotign@gmail.com

Dept. of Computer Engineering


SKNCOE, Pune, India
ankurpatil23@gmail.com

Ekata Banerjee

Pratima Kakade

Dept. of Computer Engineering


SKNCOE, Pune, India
koool.ekata@gmail.com

Dept. of Computer Engineering


SKNCOE, Pune, India
pratimakakade@gmail.com

Saumitra Vaidya
Dept. of Computer Engineering
SKNCOE, Pune, India
saumitravaidya28@gmail.com

Abstract The paradigm of processing huge datasets has been

New tools are being used to handle such a large


amount of data in short time.ApacheHadoop is java based
programming framework which is used for processing large
data sets in distributed computer environment. Hadoop is
used in system where multiple nodes are present which can
process terabytes of data.hadoop uses its own file system
HDFS which facilitates fast transfer of data which can
sustain node failure and avoid system failure as whole.
Hadoop uses MapReduce algorithm which breaks down the
big data into smaller chunks and performs the operations on
it
Various technologies will come in hand-in-hand to
accomplish this task such as Spring Hadoop Data
Framework for the basic foundations and running of the
Map-Reduce jobs, Apache Maven for distributed building
of the code, REST Web services for the communication,
and lastly Apache Hadoop for distributed processing of the
huge dataset.

shifted from centralized architecture to distributed architecture.


As the enterprises faced issues of gathering large chunks of data
they found that the data cannot be processed using any of the
existing centralized architecture solutions. Apart from time
constraints, the enterprises faced issues of efficiency,
performance and elevated infrastructure cost with the data
processing in the centralized environment.
With the help of distributed architecture these large
organizations were able to overcome the problems of extracting
relevant information from a huge data dump. One of the best
open source tools used in the market to harness the distributed
architecture in order to solve the data processing problems is
Apache Hadoop. Using Apache Hadoops various components
such as data clusters, map-reduce algorithms and distributed
processing, we will resolve various location-based complex data
problems and provide the relevant information back into the
system, thereby increasing the user experience.
Index TermsHadoop,Big data,Map Reduce, Data processing

I.

I. INTRODUCTION

II. RELATED WORK

Amount of data generated every day is expanding in


drastic manner.Big data is a popular term used to describe
the data which is in zetta bytes. Government , companies
many organisations try to acquire and store data about their
citizens and customers in order to know them better and
predict the customer behaviour . social networking websites
generate new data every second and handling such a data is
one of the major challenges companies are facing. Data
which is stored in data warehouses is causing disruption
because it is in a raw format ,proper analysis and processing
is to be done in order to produce usable information out of
it.
IEEE IRI 2013, August 14-16, 2013, San Francisco, California, USA
978-1-4799-1050-2/13/$31.00 2013 IEEE

Hadoop framework is used by many big companies like


Google, yahoo, IBM , for applications such as search engine,
advertising and information gathering and processing
A. Apache Hadoop goes realtime at Facebook
At Facebook, Hadoop has traditionally been used in
conjunction with Hive for storage and analysis of large data
sets. Most of this analysis occurs in offline batch jobs and
the emphasis has been on maximizing throughput and
efficiency. These workloads typically read and write large

700

amounts of data from disk sequentially. As such, there has


been less emphasis on making Hadoop performant for
random access workloads by providing low latency access
to HDFS. Instead, we have used a combination of large
clusters of MySQL databases and caching tiers built using
memcached[9]. In many cases, results from Hadoop are
uploaded into MySQL or memcached for consumption by
the web tier.

four steps which includes creating a servers of required


configuration using Amazon web services using their own
cloud formation language and configuring it using chef . Data
on the cluster is stored using MongoDB which stores data in
the form of key:value pairs which is advantageous over
relational database for managing large amount of data.
Various languages allows writing scripts for importing data
from collections in Mongodb to the json file which then can be
processed in Hadoop as per users requirement .
Hadoop jobs are written in spring framework this jobs
implement MapReduce algorithm for data processing. Six jobs
are implemented in STS for a data processing in a location
based social networking application.
The record of the whole session has to me maintained in
log file using aspect (AOP) programming in spring.
The output produced after data processing in the hadoop
job, has to be exported back to the database . the old values to
the database has to be updated immediately after processing ,
in order to prevent loss of valuable data.
The whole process is automated by using java scripts and
tasks written in ant build tool for executing executable JAR
files.

Recently, a new generation of applications has arisen at


Facebook that require very high write throughput and
cheap and elastic storage, while simultaneously
requiring low latency and disk efficient sequential and
random read performance MySQL storage engines are
proven and have very good random read performance,
but typically suffer from low random write
throughput. It is difficult to scale up our MySQL clusters
rapidly while maintaining good load balancing and
high uptime. Administration of MySQL clusters requires
a relatively high management overhead and they typically
use more expensive hardware. Given our high
confidence in the reliability and scalability of HDFS,
we began to explore Hadoop and HBase for such
applications.

A. Importing data from database to Hadoop


Data is stored in mongo DB which is a NoSQL
database. Its stored in the form of key value pairs which
makes it easier for latter updation and creating the index
than using traditional relational database.
JSON
(JavaScript Object Notation ) is a data interchangeable
format which is easy for humans to read and write and its
easy for machines to parse and generate. Its a text format
which is completely language independent but it is
understandable for c, java programmer .
Using json files with Hadoop Is beneficial as it stores
information in key value format. Map reduce is designed in
such a way that it takes key value pairs as an input. Hence
there is no additional cost of conversion .
Groovy scripts can be written to make a connection
to remote database and copy collection to the json file
which then will passed to HDFS to perform required
operations .groovy scripts are called and run in an
application using groovy class loader .
File system of hadoop is different than the file system
the OS .it uses single name node and multiple data nodes to
store data and perform jobs. hence json file is to be
transferred the HDFS from local file system . java script is
included in batch file to automate this process

B. Yelp : uses AWS and Hadoop


Yelp originally depended upon giant RAIDs to store their
logs, along with a single local instance of Hadoop.
When Yelp made the move Amazon Elastic MapReduce,
they replaced the RAIDs with Amazon Simple Storage
Service (Amazon S3) and immediately transferred all
Hadoop jobs to Amazon Elastic MapReduce.
Yelp uses Amazon S3 to store daily logs and photos,
generating around 100GB of logs per day. The company
also uses Amazon Elastic MapReduce to power
approximately 20 separate batch scripts, most of those
processing the logs. Features powered by Amazon Elastic
MapReduce include:
1. People Who Viewed this Also Viewed
2. Review highlights
3. Auto complete as you type on search
4. Search spelling suggestions
5. Top searches
6. Ads
Yelp uses MapReduce. MapReduce is about the
simplest way you can break down a big job into little
pieces. Basically, mappers read lines of input, and spit out
tuples of (key, value). Each key and all of its corresponding
values are sent to a reducer.

B. Performing jobs in Hadoop


Map/Reduce is a programming paradigm that was
made popular by Google where a task is divided into small
portions and distributed to a large number of nodes for
processing (map), and the results are then summarized into the
final answer (reduce). Hadoop also uses Map/Reduce for data
processing. Hence different functions for the processing are
written in the form of Hadoop job. A Hadoop job consists of
mapper and reducer functions. These jobs are written in spring

III. THE PROPOSED SCHEMES


As mentioned earlier we overcome the problem of analysis
of big data using Apache Hadoop . The processing is done in

701

for our users interest. This comes under the data processing
part which will be implemented in the hadoop job.
The final output of the average ratings of all places is
produced in a comma separated format output file. This file
in turn is used in the Groovy script to update the records in
the database.
The key value pairs are extracted from the output file
and a query is generated to update the database. Hadoop
and mongodb are connected using the hadoop-mongodb
connector.

framework i.e. STS is used as the integrated development


environment (IDE). Six Hadoop jobs are implemented in
spring framework. These jobs are required to process data for
a location based social networking application. The input to
these jobs is a JSON file containing different information
fields about various locations.

D.Logging and automation


The whole session of such a large process has to be
maintained and logged in to a file for a record. Aspect(AOP)
Programming allows us to put an aspect on a class(ie to
intercept a class)
1. @Before - Run before the method execution
2. @After - Run after the method returned a result
3. @Around - Run around the method execution,
combine all three advices above.
Aspect can be implemented using following steps
1.To add Project Dependencies
To enable AspectJ, you need aspectjrt.jar,
aspectjweaver.jar and spring-aop.jar in pom.xml file
2.Spring Bean
Normal bean, with few methods, later intercept it via
AspectJ annotation.
3.Enable AspectJ
In Spring configuration file, put <aop:aspectjautoproxy />, and define your Aspect (interceptor)
and normal bean.

The jobs and their implementation are explained


below:
Hadoop jobs are written for calculating:
1.Average Rating:- Ratings for a location are added
from different records and average is calculated. Higher rating
specifies that a particular location is more popular.
2.Total Recommendations:- Total number of
recommendations given by user for a particular location are
calculated.
3.Total Votes:- Total number of votes given by user
for a particular location are calculated.
4.Total Suggestions:- A user can suggest a location to
other users who visit the web site. Total suggestions are
calculated.
5.Unique Tags:- Unique tags corresponding to a
location are calculated.

Aspect Types :
AspectJ @Before
The logBefore() method will be executed before the
execution of a class method
AspectJ @After
The logAfter() method will be executed after the
execution of class method.
AspectJ @Around
The logAround() method will be executed before the
class method.
To overcome the shortage of time and to run such a large
process manually the whole process has to be automated.
Ant is a build tool where task can be written to automated
various task.
1)Making excuted JAR file
2)Compliation of java class
3) Clean Ant

C. Exporting data back to the database


The output produced after data processing in the
hadoop job, has to be exported back to the database. The
old values in the database has to be updated immediately
after processing, in order to prevent loss of valuable data.
Exporting of data is done in the following steps:
1.The output of the hadoop job is in a key value
format. The key contains the location_id and value is the
specific value pertaining to the data processed.
2.There are 6 different jobs, each job will have its
own export program, that will copy the specific output to
the mongo database. Consider the following example:
Calculating the average rating for a particular location(say
barbeque nation) is a hadoop job that we have
implemented. Now members from all around the globe will
post their ratings and comments regarding barbeque nation.
We need to give a final absolute rating to barbeque nation

IV. CONCLUSION
The application performs the operation on big data like
counting average ratings, total recommendations, unique
tags etc, in optimal time and producing an output with

702

minimum utilization of resources . The data analysis and


processing is used in a social networking application. Thus
providing the necessary information to the application users
with least effort.
V. ACKNOWLEDGMENT
We would like to thank Prof. Jyoti Nandimath for guiding
us throughout the selection and implementation of the topic .we
would also like to thank Mr.Divyansh Chaturvedi for his
guidance and providing the resources for making this paper
possible.
VI. REFERENCES
[1.] Hadoop in Action by Chuck Lam
[2.]Hadoop the definitive guide by Tom White publisher: O
REILLY
[3.] Pro Hadoop- build scalable distributed applications in the
cloud by Jason Venner
[4.] Michael G Noll tutorials Applied Research. Big Data.
Distributed Systems. website: http://www.michael-noll.com
[5.] Expert One-on-One J2EE Design and Development by Rod
Johnson.
[6.] MK Yong Spring Tutorials
[7.] Groovy in Action by D-ierk Konig with Andrew Glover, Paul
KingguiIllaume Laforge, and Jon Skeet.
[8.] MapReduce: Simplified Data Processing on Large
Clusters.Available at http://labs.google.com/papers/mapreduceosdi04.pdf

[9.] Oreilly MongoDB The Definitive Guide Sep 2010


[10.] Spring Hadoop reference manual by Costin Leau

703

Anda mungkin juga menyukai