w2014 12 PDF

INTEGRATING R AND HADOOP FOR BIG
DATA ANALYSIS
Bogdan Oancea
"Nicolae Titulescu" University of Bucharest
Raluca Mariana Dragoescu

The Bucharest University of Economic Studies,
BIG DATA
The term big data was defined as data sets of

increasing volume, velocity and variety 3V;
Big data sizes are ranging from a few hundreds
terabytes to many petabytes of data in a single
data set;
Requires high computing power and large
storage devices.
BIG DATA AN OFFICIAL STATISTICS
Official statistics need to harness the potential
of big data to derive more relevant and timely
statistics;
Large data sources that can be used in official
statistics are:
Administrative data;
Commercial or transactional data;
Data provided by sensors;
Data provided by tracking devices;
Behavioral data (for example Internet searches);
Data provided by social media.
BIG DATA AND OFFICIAL STATISTICS
Challenges:
legislative issues;
maintaining the privacy of the data;
financial problems regarding the cost of sourcing

data;
data quality and suitability of statistical methods;
technological challenges
In this paper we will investigate a technological

problem: integrating R and Hadoop
R
Is a free software package for statistics and

data visualization;
Is available for UNIX, Windows and MacOS;
R is used as a computational platform for

regular statistics production in many official
statistics agencies;
It is used in many other sectors like finance,
retail, manufacturing etc.
HADOOP
Is a free software framework developed for

distributed processing of large data sets using
clusters of commodity hardware;
It was developed in Java;
Other languages could be used to: R, Python or

Ruby;
Available at http://hadoop.apache.org/.
HADOOP
The Hadoop framework includes:

Hadoop Distributed File System (HDFS);
Hadoop YARN -a framework for job scheduling and
cluster resource management;
Hadoop MapReduce a system for parallel
processing of large data sets;
HADOOP
The main features of the Hadoop framework

can be summarized as follows:
High degree of scalability;
Cost effective: it allows for massively parallel
computing using commodity hardware;
Flexibility: is able to use any type of data, structured
or not;
Fault tolerance.
R AND HADOOP
R and Streaming;
Rhipe;
Rhadoop;
R AND STREAMING
Allows users to run Map/Reduce jobs with any

script or executable that can access standard
input/standard output;
No client-side integration with R;
R AND STREAMING AN EXAMPLE
$ ${HADOOP_HOME}/bin/Hadoop jar
${HADOOP_HOME}/contrib/streaming/*.jar \
-inputformat
org.apache.hadoop.mapred.TextInputFormat \
-input input_data.txt \
-output output \
-mapper /home/tst/src/map.R \
-reducer /home/tst/src/reduce.R \
-file /home/tst/src/map.R \
-file /home/tst/src/reduce.R
R AND HADOOP
The integration of R and Hadoop using

Streaming is an easy task;
Requires that R should be installed on every
DataNode of the Hadoop cluster ;
RHIPE
Rhipe = R and Hadoop Integrated

Programming Environment;
Provides a tight integration between R and
Hadoop;
Allows the user to carry out data analysis of big
data directly in R;
Available at www.datadr.org.
RHIPE
Rhipe is an R library which allows running a

MapReduce job within R;
Install requirements:
R on each Data Node;
Protocol Buffers on each Data Node;
Rhipe on each Data Node;

RHIPE AN EXAMPLE
library(Rhipe)
rhinit(TRUE, TRUE);
map<-expression ( {lapply (map.values, function(mapper))})
reduce<-expression(
pre = {},
reduce = {},
post = {},
)
x <- rhmr(map=map, reduce=reduce,
ifolder=inputPath,
ofolder=outputPath,
inout=c('text', 'text'),
jobname='a job name'))
rhex(z)
RHADOOP
RHadoop is an open source project developed by

Revolution Analytics;
allows running a MapReduce jobs within R just like
Rhipe;
Consists in:
plyrmr -providing common data manipulation
operations on very large data sets managed by Hadoop;
rmr a collEction of functions providing and integration
of R and MapReduce;
rdfs an interface between R and HDFS;
rhbase - an interface between R and HBase;
RHADOOP AN EXAMPLE
library(rmr)
map<-function(k,v) { }
reduce<-function(k,vv) { }
mapreduce( input =data.txt,
output=output,
textinputformat =rawtextinputformat,
map = map,
reduce=reduce
)
CONCLUSIONS
Each of the approaches has benefits and limitations:
R with Streaming raises no problems regarding installation;
Rhipe and RHadoop requires some effort to set up the
cluster;
The integration with R from the client side is high for Rhipe
and Rhadoop and is missing for R and Streaming.
Rhipe and RHadoop allows users to define and call their own
map and reduce functions within R;
Streaming uses a command line approach where the map
and reduce functions are passed as arguments.
There are other alternatives for large scale data analysis:
Apache Mahout, Apache Hive, commercial versions of R
provided by Revolution Analytics, Segue framework or
ORCH;

w2014 12 PDF

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

w2014 12 PDF

Diunggah oleh

Hak Cipta:

Format Tersedia

INTEGRATING R AND HADOOP FOR BIG

Raluca Mariana Dragoescu

The term big data was defined as data sets of

financial problems regarding the cost of sourcing

In this paper we will investigate a technological

Is a free software package for statistics and

R is used as a computational platform for

Is a free software framework developed for

Other languages could be used to: R, Python or

The Hadoop framework includes:

The main features of the Hadoop framework

Allows users to run Map/Reduce jobs with any

The integration of R and Hadoop using

Rhipe = R and Hadoop Integrated

Rhipe is an R library which allows running a

Rhipe on each Data Node;

RHadoop is an open source project developed by

Anda mungkin juga menyukai