Ijcs 2016 0301005 PDF

International Journal of Computer Systems (ISSN: 2394-1065), Volume 03 Issue 01, January, 2016
Available at http://www.ijcsonline.com/
Emergent Trends and Challenges in Big Data Analytics, Data Mining,

Virtualization and Cyber Crimes: An Integrated Global Perspective- I
Gurdeep S Hura
Department of Mathematics and Computer Science,
University of Maryland Eastern Shore, Princess Anne, MD 21853
gshura@umes.edu
Abstract
Big data and big data analytics are playing very important roles in a variety of applications with the implementation
capabilities and tools for a possible support for offering efficient solutions to those applications. This new technology has
recently emerged as a very popular research and practical-oriented framework that implements i) data mining, ii)
predictive analysis forecasting, iii) text mining, iv) virtualization, v) optimization, vi) data security, vii) virtualization
tools for processing very large data sets particularly cloud data for exploring new business enterprise applications and
decisions. The big data analytic was being considered as one of the fast growing technology trend in 2014 and will
continue for few more years in future with a large number of big data applications in particular is social networking
cloud computing, healthcare systems and many business systems. Thus, in order to understand this technology, we need
to understand how each of the concepts (data mining, virtualization and data security) have evolved and contributed into
data big analytics.
With this in mind, we propose a two parts of paper that will provide a state-of-the-art of each of these interrelated
concepts used in Big data analytics starting with how this concept evolved, its applications, available tools, limitations
and the current status so that researchers and developers can understand the how this new technology can be used for
new applications and also deriving new technology, tools and frameworks In this first part of the paper, we focus on the
conceptual design of big data applications, big data analytics and solutions, discussion on a number of open source
framework tools, the roles of data mining and virtualization. The data mining techniques have been used in big data
analysis, but recent applications for multimedia big data are looking at newer data mining techniques for managing and
analyzing the huge amount of data. Further, easy representation and display of data analysis are looking for an efficient
and user friendly tool that will help the users to interpret the data in a very simple way. The second part of the paper
focuses on remaining two technologies as data virtualization and data security. Virtualization offers very efficient tool to
provide the representation of data with capabilities of displaying the dynamic behavior of the data. Since the modern big
data applications are implemented over Internet, it is important to understand various cyber-attacks and crimes that
affect all the implementation phases of big data multimedia applications. This paper will also provide insights of how
the applications can be prevented from these attacks and further how cyber-crime analysis technique can be used for
reliable big data implementation.
Specifically, part I discusses i) main concepts, features, applications, implementation issues and capabilities of
encapsulating features of ii) data mining for mining, analyzing and processing big data while part II discusses i)
virtualization tools to extract useful information from processed data and also dealing with ii) data security. The
rationale for considering these three sub topics of data mining, virtualization and data security is due to the fact that all
the successful and implemented big data applications are derived from big data analytics. We hope the these two papers
will provide a very clear understanding of the big data analytics and describe how each of the concepts like data mining
and text mining, virtualization and data security have evolved and have been now integrated into Big data analytics
I.
BACKGROUND OF BIG DATA ANALYTICS

AND DATA MINING (PART I)
The first part of the paper discusses the background of

this new technology of big data analytics and then present
how it has been used to implement advanced data mining
techniques, virtualization frameworks and data security for
various applications like public sector, manufacturing,
retails, healthcare, weather and scientific applications, etc.
In other words, the paper describes each of the concepts in
details and how each of these have been implemented in
big data analytics. The paper describes operations in big
data, discussion of known big data applications and
classifications. This is one of the reasons I have selected

only these concepts and provided a detailed discussion on
them. Further, the paper also describes various available
open source tools that have been used to implement and
solve big data applications and implements various data
mining techniques.
After introducing the basic concepts in big data
analytics, the paper focuses on the data mining techniques
that have been used in the past for offering systematic
approach for system analysis now find its use in a big way
in data representation, data collection and data analysis of
multimedia applications. With different forms and formats
of data from different sources, the newer data mining
16 | International Journal of Computer Systems, ISSN-(2394-1065), Vol. 03, Issue 01, January, 2016
Gurdeep S Hura et al
Emergent Trends and Challenges in Big Data Analytics, Data Mining, Virtualization and Cyber Crimes: An Integrated
Global Perspective- I
techniques for collection and analysis of huge amount of

multimedia data need to be introduced. It is hoped these
new efficient and formal data mining techniques will be
used for understanding the big data analytics with a view to
offer easy understanding, easy data formatting, interpreting
and extracting useful information from collected data in the
applications. The paper describes briefly the suitable data
mining techniques and presents how some of the existing
techniques will be redefined with a view to use in
applications like multimedia data applications, social
networking, scientific weather data and many other similar
applications.
The paper presents various unresolved issues and
problems dealing with big data analytics and data mining,
challenges and possible future applications. The paper
further also presents the future research initiatives.
II.
BACKGROUND OF VIRTUALIZATION AND

DATA SEWCURITY (PART II):
The second part II of the paper presents state-of-the art

of remaining two important technologies virtualization and
data security that have implemented in big data analytics.
One of the implementation phases for big data solutions
is data processing. There are many methods that can be
used to data processing. A simple and user friendly visual
and dynamic representation of data can be implemented by
data virtualization. This method provides not only easy
representation of data, but also dynamic behavior of the
data movement and helps to extract useful information
from the data. The virtualization tools represent the data
processing process in a very simple way for data analysis.
The paper discusses different architectures of virtualization
tool, methodologies, main frame virtualization, guidelines
and various available abstraction tools of virtualization that
have been used in big data applications.
Business and technology professionals and practitioners
are deeply concerned about data security. Since data is
coming from different devices like mobile data generation,
real-time connectivity, digital business, and other sources
have changed the entire environment difficult and harder to
protect the data assets over internet. We have seen some
security measures that have been implemented in big data
analytics and it is expected that the future big data
applications have an increasingly crucial and important role
in providing data security. Recent years have seen some
efforts in data analytics that have implemented various
counter measures for data security such as intrusion
detection, differential privacy, preventive measures,
authentication,
digital
watermarking,
malware
countermeasures and many other measures. In order to
implement operational strategies under serious crisis, data
security becomes very critical. Some organizations and
professionals are having a little bit of difficulties to be
competitive in the absence of data security and are engaged
in including advanced analytics capabilities that will
manage privacy and security challenges. By following this
approach, they are able to create confidence in
clients/customers/consumer with some level of trust. In
order to provide reassurance to customers/consumers
around privacy and data security issues, it is important to
establish a framework that will not only provide security
but it evaluates and meet business, big data technology and

needs of consumers/customers.
With a brief discussion and role of data security in big
data applications, the paper describes in brief the cyber
malicious attacks and crimes. The paper presents the
challenges and problems associated with creating a secured
communication environment over internet for big data
applications. Further, it describes briefly various attacks
and crimes over internet known as Cyber Attacks and
Crimes. The paper also presents all the known Cyber
Attacks, cyber-crimes that may affect the data processing,
data mining techniques and virtualization tools of big data
applications over internet. After understanding these
attacks and crimes, paper presents how the big data
implementation includes security issues in new
applications. Further, it also presents Cyber security
analysis for big data applications.
In conclusion, after presenting the state-of-the-art of big
data analytics that implements data mining techniques in
part I and virtualization, text mining, optimization and
associate security issues in part II, the papers summarize
the challenges, unsolved problems and future trends in the
area of big data and big data analytics, data mining
techniques, virtualization and data security counter
measures for cyber-attacks and crimes. Further, the papers
also elaborates various new technologies and
methodologies researchers are exploring in handling
security issues in big data analytics.
Keywords: Big Data Analytics, Solution of big data
applications, Data mining techniques, Virtualization,
Virtualization Methodology and tools Cyber-attacks,
Cyber-crimes, Preventive and Defensive Measures
III. INTRODUCTION: PART I (BIG DATA
ANALYTICS AND DATA MINING)
A. Big Data Analytic: A brief description
The big data analytics deals with a large amount of data
to work with and also the processing techniques to handle
and manage large number of records with many attributes.
The combination of big data and computing power with
statistical analysis allows the designers to explore new
behavioral data throughout the day at various websites. It
represents a database that cant be processed and managed
by current data mining techniques due to large size and
complexity of data. Big data analytic includes the
representation of data in a suitable form and make use of
data mining to extract useful information from these large
dataset or stream of data. As stated above the big data
analytics has recently emerged as a very popular research
and practical-oriented framework that implements i) data
mining, ii) predictive analysis forecasting, iii) text mining,
iv) virtualization, v) optimization, vi) data security, vii)
virtualization tools for processing very large data sets. In
the implementation of big data applications, new data
mining techniques and virtualization are required to be
implemented due to the volume, variability, forms and
velocity of the data to be processed. A set of machine
learning techniques based on statistical analysis and neural
networking technology for big data is still evolving but it
shows a great potential for solving a big data business
problems. Further, a new concept of in-memory database
for enhancing the speed for analytic processing is further

helping big data to provide more new applications.
The year 2014 has seen a new technology emerging
providing a solution and implementation to a number of
applications dealing with enormous amount of data, data
from different sources and devices with different formats,
but all these new applications are suffering from data
security and as such big data analytics are getting another
role to play in data security. It has been observed that
analytics have recently explored some of counter measures
like intrusion detection, differential privacy, digital
watermarking, malware countermeasures, filters, etc.
IV.
BASIC DEFINITIONS AND CONCEPTS OF BIG DATA

ANALYTICS:
The analytics and machine learning together are being

applied to a new process of analytics itself. Big Data is
very large in volume usually in the range of Zeta bytes
(1021) of data flowing from our computers, mobile devices,
and machine sensors into Internet. With the right tools for
collection and analysis of big data application,
organizations can dive into all data and gain valuable
insights that were previously unimaginable. It is important
to know that big data technologies and analysis tools can
transform our applications into new business arena. The
proposed framework must be introduced that can provide
big data solution for any application simple data
processing, simple operations, simple setup, simple code
and application development for its easy implementation.
The huge amount of data of various application
oriented techniques stored in the database is growing
exponentially and it becoming very difficult and a big
challenge in implementation of big data analytics including
data collection, storing, processing, sharing, analyzing,
visualizing, and other related issues of big data
applications. Data volume in modern applications range
from MBytes (109), to Terabyte (1012), to Petabyte (1015),
to Exabyte (1018), to Zettabytes (1021). In one of the
surveys, it was mentioned that about 5 Exabyte of data was
created by all of us around the globe in 2003. This volume
of data is still increasing exponentially and has been in the
range of few Zettabytes in 2012. It is estimated that the
volume of data may be over 8 Zettabytes by the end of
2015 [1-2]. IBM showed that over 2.5 Exabytes of data is
being generated everyday. Similarly, the storage
requirements for storing this large volume of data would
require over 20 billion PCs. This is due to the fact that each
PC today holds about 500 Gigabytes of data. Only Google
has more than one million servers around the world. There
are over 6 billion mobile subscriptions around the world
and has a total population of nearly7.5 billion around the
world. It has been found that there are over ten billion text
messages by these mobile subscribers per day [1-2-4]
There exist a number of frameworks and tools that can
be used to generate appropriate solutions for the big data
business applications, but SAP Hana [1-6] seems to be
more widely acceptable and robust framework for big data
formatting and representation. SAP Hana is an in-memory,
column-oriented, relational database management system
developed by SAP SE. HANA's architecture is designed to
handle both high transaction rates and complex query

processing on the same platform. In-memory platform or
framework as a service based on SAP Hana can allow
developers to build innovative applications with improved
productivity. SAP HANA was previously called SAP
High-Performance Analytic Appliance. It is very powerful
with predictive analytics that provides intuitive modeling,
advanced data visualization, and profitable analysis of the
applications with big data. The business intelligence
analysts will be responsible to show how to predict future
outcomes and uncover opportunities that will maximize our
business performance [1-6, 8,9,11, 14, 38, 40].
V.
COMPUTATIONAL OPERATIONS ON BIG DATA
Web services over Internet supports a variety of

operations and services depending on the big data
application. One of the very popular web operations is
social network-based applications needing data modeling
and analysis that deal with understanding user intelligence
needed on big data applications for more targeted
advertising, marketing campaigns and capacity planning,
customer behavior and buying patterns and many other
inferences. According to these inferences, firms spend
significant time and resources in the optimization of
content of big data and recommendation engine. Some
companies such as Google and Amazon publishing articles
are related to their work. Inspired by the writings
published, developers are developing similar technologies
as open source software such as Lucene, Solr, Hadoop and
HBase. Facebook, Twitter and LinkedIn are going a step
further thereby publishing open source projects for big data
like Cassandra, Hive, Pig, Voldemort, Storm, IndexTank
[1,14].
In 2012, Obama regime announced big data initiatives
of more than $200 million in research and development
investments for National Science Foundation, National
Institutes of Health, Department of Defense, Department of
Energy and United States Geological Survey. The
investments were launched to take a step forward
instruments and methods for access, organize and collect
findings from vast volumes of digital data [21].
One of the biggest data bases that are being used by
both machines and human beings are being designed and
understood by users and developers. There does not seem
to exist any systematic structure that can provide accurate
interdependency relationships between various components
of the web system. Currently the information can be
retrieved from the data base using a single keyword or a
combination of keywords for searching the web sites
containing that information and showing their Uniform
Resource Locators (URLs). Many a times, it does no
optimize the time for search as it may contain a lot of
unnecessary or irrelevant information and as such expects
the users to narrow the search words,
The Information can be retrieved via hyperlinks to wen
content and are mainly responsible for establishing
physical connections between the users and requested
information. The hosts or machines are unable to interpret
these hyperlinks and as such there needs to be some kind of
mapping between the hardware and the interpreted
hyperlinks which in turn may provide a faster retrieval of

the requested information efficiently. Defining structure
from unstructured dataset and deriving useful information
from this structure is a first logical step in data analysis.
The derivation of information is based on the identifying
documents in the collection on the basis of properties
approved to the documents by the user requesting the
information retrieval. [7, 9, 12-13, 38]
The amount of data in these applications are usually do
not follow any particular format and types so defining
structure of the data and deriving the useful information
from these data becomes a big challenge and has been a
main research objective in the areas of Data mining,
Natural language processing, Relational database, Data
virtualization, Cyber Crimes, Crime analysis, Information
retrieval, Data analysis, big data solutions, etc.
In order to appreciate and understand the state-of-theart of big data and big data analytics, the following section
describers the classifications of big data applications and
some of the popular applications along with their
implementation and solutions on different platforms.
VI.
CLASSIFICATION OF BIG DATA APPLICATIONS
The topic of big data is becoming a current project not

only in US, but also in other countries around the globe as
it focusses on the collection, analysis, manipulation and
visualization of big data on a real time. The big data from
different applications generally can be classified into
following five major groups McKinsey Global Institute
specified the potential of big data in five main topics
[1,17]:
Healthcare: clinical decision support systems,
individual analytics applied for patient profile, personalized
medicine, performance based pricing for personnel, analyze
disease patterns, improve public health
Public sector: creating transparency
related data, discover needs, improve
customize actions for suitable products
decision making with automated systems to
innovating new products and services
by accessible
performance,
and services,
decrease risks,
Retail: in store behavior analysis, variety and price

optimization, product placement design, improve
performance, labor inputs optimization, distribution and
logistics optimization, web based markets
Manufacturing: improved demand forecasting, supply
chain planning, sales support, developed production
operations, web search based applications.
Personal location data: smart routing, geo targeted
advertising or emergency response, urban planning, new
business models.
Business: Recently we have seen a new class of big
data application emerging in the market as Business. Some
of the business applications include: e-mails, images,
searching, health records, weather data, sensors and mobile
devices, logs, social networks, IRS audits, social security
administration, satellite communications, news bulletin, online transactions, banking systems, astronomy, atmospheric
science, genomics, biogeochemical, biological science and
research, life sciences, medical records, scientific research,
government, natural disaster and resource management,

private sector, military surveillance, private sector,
financial services, retail, social networks, web logs, text,
document, photography, audio, video, click streams, search
indexing, call detail records, Point of Sale Software (POS)
information, radio frequency identifier (RFID), mobile
phones, sensor networks and telecommunications..
Organizations in any industry have big data can benefit
from its careful analysis to gain insights and depths to solve
real problems [1, 3, 20].
Other Applications: Other big data applications such
as banking, business informatics, meteorology, sports,
medicine use big data set within data management and data
mining. These applications with a big volume of data cant
be handled by traditional database management for its
storage and processing. The big data technology can be
used to define the framework that can provide the solutions
to big data set. An attempt has been made to use nonrelational database architecture to implement horizontal
scalability based on optimized key-value format and
agility. One of such non-relational data base is NoSQL
(Not Only SQL) database architecture [13, 24]. NoSQL
databases also provide a structure for the storage of data
but these structures are less strict as relational schema and
as such it is suitable for some applications. It is important
to know that NoSQL is no way a replacement of traditional
RDBS and it finds its applications in huge big and complex
data set and the traditional relational database system tools
may not be suitable for these applications.
VII. NEW IMPLEMENTATION SOLUTIONS OF BIG DATA

APPLICATIONS
The volume of big data set is usually represented above

terabytes and is usually unstructured. In the literature, it is
being mentioned that the big data problem has been
addressed as a combination of four Vs as Volume (from
Terabytes to Zettabytes of existing data to process),
Velocity (data in motion, streaming data and processing
them between a time frame of milliseconds or seconds),
Variety (different formats of data, different types of data
and also the defined structured formats of data), and
Veracity (undefined format of data, ambiguity in data,
difficult to interpret the data, inconsistency) [1-7, 11,15
38].
In addition to these well-defined classes of big data
applications, recent years have seen successful
implementation, newer technology and solutions of some
of the popular multimedia big data applications in real
world:
A. Machine learning: Machine learning-based
techniques have introduced new applications on iPhones
that can predict the ethnicity, gender, age of the users with
high degree of accuracy and on-line prices. These
applications are benefitted from improved data mining
techniques as these allow an easy presentation of
application to the users.
B. Google Now: Google has recently introduced an
application known as Google Now in 2010 that offers voice
operated application dealing with our personal lives and
does not pretend to be a person but behaves like human like
intelligence. This provides a lot of information and some of

them may be unnecessary e.g. location of any place or
business may not be necessary for our application. But still
this application Google Now is a very powerful application
as it has taken all the possible aspects of us to make our life
better and comfortable [38].
C. Election campaign tool: Software developers
introduced a new big data application that shifted, collated
and combined different categories of information on each
registered voters to discover patterns which then can be
used to target voters for further fundraising, advertisement,
personal meetings etc. to those most likely to respond.
Based on this concept and successes in other applications
needing social issues like education, health care, utility
usage, crime statistics, etc. can be developed. Other
activities like phone calls, on line searches, can also be
used for defining the patterns. Large companies like IBM
(redrawing bus routes in Ivory Coast, Google (flu tracking
software) and others. Data mining technique also found its
use in social issues like tutoring disadvantaged kids, retail
chains forecast sales, modeling of customers behavior, and
presenting articles with title showing its use in business
applications [46, 53].
D. Social networking: One of the most successful
applications of big data dealing with its collection, analysis
and virtualization is social networks (Facebook, Twitter,
Google+, LinkedIn, etc.). According to statistical survey,
over 955 million users with active accounts access
Facebook in over 70 languages, upload over 140 billion
photos, over 125 billion friend connections. In 2012, The
Human Face of Big Data accomplished a global project
that is centering in real time operations of collection,
visualization and analysis of large amounts of data [1].
In addition to these applications, media project survey
statistics [1-6] reported for some of the popular big data
applications as shown below:
Facebook has 955 million monthly active accounts
using 70 languages and handles the loading of over 140
billion photos, over 125 billion friend connections, access
of over 30 billion pieces of content per day and posting of
over 2.7 billion likes and comments [1]
In YouTube, an estimated 48 hours of video are loaded
every minute and every day. Over 4 billion views are being
recoded on YouTube [1].
In Google, over 7.2 billion pages per day are being
monitored and also process over 20 petabytes of data every
day. It is interesting to note that Google provides these
services in over 66 languages around the globe every day.
Google support many services as it monitors 7.2 billion
pages per day and processes 20 petabytes (1015 bytes) of
data daily also translates into 66 languages. 1 billion
Tweets every 72 hours from more than 140 million active
users on Twitter. 571 new websites are created every
minute of the day [23]. Within the next decade, number of
information will increase by 50 times however number of
information technology specialists who keep up with all
that data will increase by 1.5 times. [7, 8, 11, 17, 22, 38].
In Twitter, over one billion tweets are being recorded

every three days by more than 140 million active users.
Over 571 new web sites are being created every minute of
the day [1-6, 46, 68].
Further, based on the current use of internet for these
social networks and services, it is expected that the
information from these social networks and services over
internet will increase by more 50%. In next few years, the
number of people handling this big amount of data will also
increase by at least 1.5 times [7, 38].
VIII. IMPLEMENTATION PHASES OF BIG DATA

APPLICATIONS:
In the previous section, we introduced the state-of-theart of big data, big data analytics, classes of big data
applications, and implementation phases required for data
applications. One of the most critical and important phases
of implementation is data processing. The following
section discusses various tasks needed during the data
processing for big data application implementation.
Further, we will present current trend of big data in
Enterprise Computing Environment. We also describe
various frameworks and tools that have been introduced
and are currently being used in those applications.
IX.
BIG DATA PROCESSING
Appropriate and accurate solutions of any big data

applications depend heavily on how the data processing is
implemented. There are a number of tools for data
processing that offer a number of options for implementing
in-depth memory database analysis, optimization
techniques, data cleansing, data forms, etc. The
development of new algorithms will make these options
automated and are used tom join and cleans big data set.
Some of the options provided by tools are described below.
X.
DATA COLLECTION
This is the first step in data processing where the data

collected from different sources is expressed in different
formats. Techniques are needed to present the collected
data as consolidated data for its analysis. One of such
technique is referred to as data integration which will
present the data (if possible to maximum extent) in a
cohesive format. One of the major problems with big data
as it is growing exponential rate. Although cloud
computing does provide large amount of space and
computing, but the main problem with cloud computing
remains the internet connections and its limitation on the
data transfer from sources to cloud computers and also
across cloud computers for transmission, storage and
processing of big data set. Further, the cloud computing
does not support the traditional database as its horizontal
scalability does offer any advantage or support to databases
including relational databases. There is a need for new
techniques that will provide support to different types of
databases for manipulating the big data set. Some of the
material presented here has been partially derived from [713, 38]
XI.
DATA CLEANSING
After collection, data cleansing or cleaning is

performed. This process will recognize any noise present in
the data, or any missing data in the big data set. It uses
different techniques to reduce the noise and also if possible
eliminate any unwanted data from the dataset. After
cleaning, the data may need to be transformed as final
preparation for analytics on data set.
XII. DATA FORMS
Big data from different applications are received from
different sources and are usually represented in three
different forms as:
structured, semi structured and
unstructured. The structured data are represented in welldefined format, already tagged, sorted and are entered into
data warehouse. This type of data can be analyzed easily.
The unstructured data on other side is random, not welldefined format, already tagged and easily sorted and as
such unstructured data is difficult to analyze due to its
randomness. The semi-structured data does not conform to
fixed fields but contains tags to separate data elements and
as such the volume of data is usually in the range of
terabytes and petabytes [1].
XIII. DATA ANALYSIS
Data analysis deals with extracting and interpreting new
information or knowledge from the big data set. This
process of extraction of deriving new information allows
designers to define different policies for managing the data
and the rules for predicting new information from data set.
There exist different analytic methods and techniques
that have been used on this data set. These methods and
techniques are grouped into three categories as: statistical
analysis, data mining and machine learning. Since we are
dealing with big data set, the analysis technique needs to
interpret and extract meaningful information and also these
techniques have to be automated as the manual techniques
are very time consuming.
In data analytics techniques have to be executed on
different platform, we will face a number of problems like
new platform may not i) accommodate this big volumes of
data, ii) support to needed analytic models. Further, the
data loading may be too slow, and new platform may not
support new advanced analytics technique and also may
not meet with requirements.
The data analysis of big data set does require large
number of servers that will run massively parallel software.
This type of analysis technique in the distributed
environment can distinguish the big data set for its classes
such as category, size of data, velocity and application and
analysis will provide detailed insights of the data, its
structure and interpretation.
Content-based image retrieval: It is based on visual

pictures of images. The derivation of information through
retrieval offers various retrieval capabilities of text images
and different attributes like form, content and structure.
This method also provides measures for accurate
information retrieval.
Semantic Web (SW): It offers intelligent retrieval of
information during the data analysis. The information
representation by web semantics allows its use in a variety
of applications like display, automation, integration, and
reuse for computers. It allows the representation and
exchange of information in a meaningful way. To retrieve
the information from the documents, a number of
techniques exist but in general these are not so advanced
that they can exploit semantic knowledge within
documents and give the desired accurate information. It
looks that future web services will be defined as a
combination of text documents and Semantic markup.
Semantic Web (SW) uses Semantic Web documents
(SWDs) which must be combined with Web based
Indexing. To use these techniques, a normal user has to be
aware of all the tools. To overcome these problems, a new
concept of Ontology in Semantic Web has been introduced
that represents various languages that are used for building
software and increases accuracy. It describes basic
concepts in a domain and defines relations among them and
together with a set of individual instances of classes
constitutes a knowledge base. Recent years have seen a
new application of Semantic Web in open, distributed and
heterogeneous Web environments, and for sharing the
knowledge in the semantic web [7-9, 16-20, 38, 40].
Data Storage, Manipulation and Handling

Since big data set is being transmitted over the internet,
we need to make sure that the data communicates over a
secured environment. Although, we may use cloud
computing technology for storing and processing the data,
we have to make sure that performance of big data
processing is not affected by additional overhead for
secured communication. Further, we need to make sure that
the data itself is also secured and may represent new types
of data for original dataset. During the transmission over
internet, the data may be compromised and we may lose
important data from the original data set. For example,
attack on the MapReduce: one of open source framework
for big data solution (discussed below) could be a
malicious mapper that accesses sensitive data and modifies
the result. Unlike most RDBMS, NoSQL security is largely
relied on outside of the database system. Research into the
types of attacks that are possible on these new systems
would be beneficial [12-13, 16, 24].
Storage and analytic techniques

On some new conceptual data analysis trends:
A number of new conceptual analysis techniques for
the derivation of information have been introduced such as
Content-based image retrieval, Semantic Web (SW) and
others.
There exists different types of storage and analytics

techniques that have been used for a variety of data formats
e.g. structured, semi structured, complex, event and
unstructured data (as discussed above). One of the
problems with storage and manipulation of big data set on
cloud computing over internet is transmission of dataset
with secured environment and also security of data over
internet. A number of approaches for sending big data set

over internet with optimum requirement of bandwidth has
been suggested and implemented e.g. compression, data
deduplication, caching, protocol optimization, etc. It is
important to note that the optimized bandwidth for big data
based on compression offers advantages only on a certain
type of data e.g. plain text data while it may not be
beneficial to use this on encrypted data. Further, the benefit
of compression is available only for homogeneous and
susceptible types of data.
The data duplication that reduces the size of data
transmission over internet deals with the data at file and
block levels. In the case of duplicates, these duplicates are
sent as pointers to one copy of the data, thus reducing the
transmission of multiple copies of the same data set. Some
researchers also call this method as redundancy
elimination. The other technique deals with protocol
optimization at the transportation level where one port for
each of the transport layer protocols (TCP and UDP) is
dedicated for session control 1, 4, 11, 12, 23, 45,70].
Handling and storage:

The handling and storage of big data has also changed
the architecture of storage system and has recently focused
on the implementation of highly scalable and flexibility
features in the architecture to handle big data set in
effective and efficient manner. The storage of big data set
should provide reliable, efficient methods of storage and
retrieval. Google File System (GFS) is one of the storage
systems which is based on clusters and stores the data as a
block of 64MB as smallest unit across the nodes of cluster
[46].
Identify and implement the proposed big data

platform along with best practices and ethics.
Implement and run the applications by running

big data analytics applications that will offer
possible new technologies and applications.
Our needs: on-premise, cloud, or hybrid
Big data and associated operations like collection,

storage, analysis and manipulation has become an integral
part of any business and companies dealing with different
types of services. These services in one way or the other
are dealing with different types of data with different
formats in a variety of applications. One of the major
concerns with big data set is to develop suitable analytic
techniques for managing it after defining its solution.
A new concept based on in-memory databases has been
introduced that helps in enhancing the speed of analytic
processing. Many businesses have already started using this
concept in applying big data analytic in enterprising
computing environment. This forces all the records,
attributes and transactions from different systems to reside
on same in-memory database. This requires the
development of another product tool that manages, secures
and integrates the data in the database. A number of
frameworks for implementing big data analytic in
enterprise computing have been introduced. This type of
new trend in big data analytics finds its application in
enterprise computing and some of the big data applications
in enterprise computing environment are discussed below
Big Data in Enterprise Computing Environment
It has been observed that after going through collection,

storage and analysis of data, the companies can get benefit
of aimed marketing, detailed strategies for business
insights, client based service offering via segmentation,
determination of sales, market needs and chances, risk
analysis, etc. The above listed benefits are available only
when the data analytics is implemented properly. Further,
sometimes lack of inexperienced analysts, cost, lack of
database software in analytic, and hard to design analytic
system may not offer these benefits.
Many of the enterprise computing environments is Inmemory platform or framework as a service based on SAP
Hana can allow to build innovative applications with
improved productivity in handling and managing big data
sets of very large volume. The data analysis of big data
may provide insights of the data sets in such a way that it
can be used to grow the business and marketing of the
products. The big data analytics is becoming a new
technology that has generated an interest in enterprising
arena and as such has to support a number of architecture
that has been developed for those applications [12, 53].
XIV. IMPLEMENTING AND SOLVING BIG DATA
XV. OPEN SOURCE FRAMEWORKS/TOOLS FOR BIG DATA
APPLICATIONS
SOLUTIONS
The following steps are needed to obtain a solution of

any application with big data set. The proposed solution
tries to extract the maximum value from your big Data and
business analytics that needs to be transformed our IT
infrastructure and implementation of big data technologies
to allow us to understand the capture, store, and leverage
data-driven insights in real time.
There exist a number of open source frameworks/tools

to solve big data applications such as: MapReduce, Job
Tracker, Hadoop, High Performance Computing Clusters
(HPCC) and many others. A brief description on some of
the popular frameworks/tools is discussed below. For more
details, please refer to [1-3, 12, 16, 18, 25-36, 38, 46-48]
We need to define applications of big data for the

organizations policy and priorities
Design a detailed plan for future growth in big

data and possible applications that may help the
organizations long term goals of the organization
A. MapReduce
MapReduce is a programming tool for distributed
computing and was created by Google. This framework
uses divide and conquer method to divide the data
problems into small sub data set processes and execute
these processes on different processor in a distributed

environment. It is used to solve big data set and can be
implemented in two stages [18, 28, 46-48]
MapReduce tool is used for processing the big data set
across the nodes of cluster of Google File System (GFS)
[46]. The GFS uses the distributed architecture of GFS to
allocate and schedule the data sets to nodes and also
transfers the data across the nodes. The function for
processing big data set offered by MapReduce like
replication, storage, etc. has been included in the
implementation of Hadoop framework which is most
popular model that includes MapReduce engine, Hadoop
Distributed File System (HDFS) and utilities of other
Hadoop modules. The HDFS file system is highly faulttolerant system and stores the data on the clusters of the
framework. Google introduced another tool data storage
system known as BigTable which has been adopted by
Hadoop framework (to be discussed below). As SQL,
MapReduce, in-memory, stream processing, graph
analytics and other associate tools help Hadoop create more
business and enterprise data processing applications.
The proposed big data solution based on MapReduce is
usually implemented in two stages as described below:
In the first stage, the master node after collecting the
data is divided into a number of the smaller processes. A
worker node is chosen to execute some of these processes
(based on scheduling) under the control of JobTracker (a
component) of the framework. The result from this
execution is stored in local file system which can be
accessed by reducer (a component).
In the second stage, the scheduled data is analyzed and
merges input data from the first stage. There can be
multiple reduce tasks to parallelize the aggregation, and
these tasks are executed on the worker nodes under the
control of the JobTracker.
B. Hadoop
Apache software foundation introduced a tool called
Apache Hadoop (an open source data computing
framework) that uses a number of modules and provides
solutions to handle manage and implement big data set.
This framework and set of tools for processing large data
sets was originally designed to manage cluster of physical
machines. Now, we have seen a big use of this framework
cloud like Amazons Redshift hosted B1 data warehouse,
Googles BigQuery data service, IBMs Bluemix cloud
platform and Amazons Kinesis data processing servive.
It is based on BigTable (data storage system which was
introduced by Google) data storage system. The Hadoop is
Java based framework designed for heterogeneous open
source platform. Various features of this framework
includes Distributed File System, analytics and data storage
platforms, layer to manage parallel computation, workflow
and configuration administration and many others needed
to solve big data sets. Hadoop Distributed File System runs
across nodes in a Hadoop clusters and provides
connectivity to all the input and output nodes with a view
to create one big file system. Some of the material
described below is derived from [1,16, 18, 20, 25-26, 3637, 42, 46].
Hadoop framework offers solution based on batch
processing concept for handling, managing and processing
big data set. It may not provide appropriate solution for
real-time ad hoc querying management, but has become a
common solution for processing large amounts of data.
Modules such as Pig and Hive along with Hadoop
MapReduce provide querying management. Some efforts
have already been made to provide solutions for real-time
ad hoc querying management over large scale big data set.
The querying system is based on SQL for implementing
query Hadoop system. Other possible solutions (based on
Hadoop) using relational data base based on scalability and
distributed relational systems have been developed that
analyzes the data set and interpret useful information from
the data set.
The volume of data is continuously increasing in all
applications at an exponential rate and it is becoming a big
challenge to handle the data and also develop appropriate
solutions [25, 26]. The Hadoop framework model has
become very popular tool for managing social networking
environment applications over Internet. Nearly all the
applications of social networking (Facebook, Twitter,
Linkedln, etc) deal with huge amount of data from their
users. The data are in different forms and need to be need
to be presented to their users in a very simple and friendly
manner. In spite of these features of Hadoop, in real-time
analysis and predictive analysis, it has been seen that it
takes significant amount of time. SQL query tool as
SparkSQLseems to be fast interactive query with streaming
capabilities. A new tool supporting SQL like quering has
opened the door for Hadoop to be used in Enterprise
computing applications.
A number of new open source modules interfacing at
application layer of Hadoop model have been developed to
implement scalable and distributed computing environment
including: database (HBase and Cassandra), querying
(Hive and Pig), coordination services (ZooKeeper) [33-34].
The various functional module programs offering
different services to be used on Hadoop framework have
recently been introduced. Some of the popular services
applications include: HDFS, MapReduce, Pig, Hive, JAQL,
HBase, Flume, Sqoop, Oozie, Zookeeper, YARN, Mahout,
Ambari, Spark, Whirr, Hue, Lucene, Chukwa, Hama,
Cassandra, Impala, etc [37] Each of the module does a
specific functionality and is being used with Hadoop to
implement a specific aspect of big data starting from the
collection, storage, administration, query management,
interpretation and solving of big data set into different
clusters across the distributed system over internet.
The following is a brief description of some of these
modules, their services and each of these modules operate
at the top layer of Hadoop model.
MapReduce: This module provides a powerful parallel
programming technique for distributed processing on
clusters of the framework [1,28]
Apache Hive: This module provides a SQL like
interface and relational model as an application on the
framework for storing and retrieving the data. A data
warehousing system used with Hadoop for querying,

summarization, and analysis of data stored in Hadoop
model. Queries are expressed in the SQL-like Hive
Querying Language. A compiler translates the HQL into a
set of MapReduce jobs that are executed on the Hadoop
system. In other words, it provides a means to perform data
manipulations with high level HiveQL, without having to
write the more complex map reduce functions that are
harder to maintain and reuse.
It arranges data into tables, partitions and buckets. A
table is implemented as rows and columns and contains the
data into it. We can have multiple partitions refereed as
column values in the table. The buckets within partitions
are divided by the hash of the column. Further, the users
can influence the optimization by providing hints through
HiveQL. [31-32]
Pig
This module provides a high level data processing
system for analyzing data sets that are being created at a
high level language. Apache Pig has been used as tool for
the analysis of big data sets. It uses a high level language
(known as Pig Latin) that is compiled into MapReduce
programs that are executed on Hadoop. Pig also allows
extensions its language with User Defined Functions which
can then be written in Java, JavaScript, or Python. One of
the important features of this application that it allows easy
query and analysis with reduced writing of map functions
[1,29].
Hbase
This module creates a scalable and distributed database
for random read/write access of the data stored on clusters.
The Hbase model is a distributed, column-oriented NoSQL
database that operates on top of the Hadoop Distributed
File System. It uses Googles BigTable and provides a
distributed data store that is highly scalable with consistent
reads and writes. Data is stored as indexed Storefiles on
HDFS and is fault tolerant [1,32].
ZooKeeper
hash ring. For more details on the discussion below, please

refer to [37].
Voldemort
This module is a data base program that is highly
scalable distributed key-value based data store and was
originally developed by LinkedIn. In this model the data is
automatically replicated and partitioned among the nodes
in the distributed system. Each node is independent of the
others such that there is no single point of failure. Read and
write access is limited to key-value access. As such, it
supports only three types of queries: get, put and delete.
Based on its simplicity, it always offers predictable
performance of queries.
Sqoop:
This module allows the transferring of data between
relational databases and Hadoop [1]
Avro:
This module supports the serialization of data for its
processing.
Oozie:
This module defines a systematic workflow for
dependent Hadoop jobs [1]
Chukwa:
This module provides support for a Hadoop subproject
as data storage and accumulation system for managing and
monitoring distributed systems of clusters [1]
Flume:
This module provides mechanism for a reliable and
distributed streaming log collection of the data across
clusters [1].
XVI. HIGH PERFORMANCE COMPUTING CLUSTER (HPCC)
This module supports a centralized service for

providing distributed synchronization and group services
across clusters. It offers coordination services that support
synchronization which can be enabled throughout a
Hadoop cluster. It achieves this by defining objects
containing information and namespaces in-memory. This
information is kept across distributed ZooKeeper servers.
Client applications running over Hadoop cluster can
retrieve ZooKeeper from these distributed servers. One of
the advantages of this model is its ability to support
synchronization across Hardoop cluster system [31-32].
Cassandra
This module is a distributed column-oriented database
and was originally developed by Facebook. In this
database model, a column includes name, value and
timestamp while a row contains multiple columns, column
families contain rows, and keyspaces contain column
families. Column families are stored in separate files. Data
is separated into partitions across the nodes in the
distributed database. Each node has a random position on a
Another open source software framework/tool for

solving big data set has been introduced as High
Performance Computing Cluster (HPCC) Systems that
supports distributed data intensive open source computing
platform. It allows the users to support the definition of
model and provides big data workflow management
services. A high level programming language as Extensible
Computer Language (ECL) that provides Development
Environment and supports the description of complex
problems easily and this framework ensures the execution
of ECL at a maximum elapsed time and configures the
implementation of all nodes executing in parallel.
The key to complex problems can be stated easily with
high level ECL basis. HPCC ensure that ECL is executed at
the maximum elapsed time and nodes are processed in
parallel. Furthermore HPCC Platform does not require third
party tools like GreenPlum, Cassandra, RDBMS, Oozie,
etc. [1, 22].
The HPCC framework consists of the following three

application modules: [2]:
HPCC Data Refinery (Thor): This module defines a
massively parallel ETL engine that enables data integration
on a scale and provides batch oriented data manipulation.
HPCC Data Delivery Engine (Roxie): This module
allows efficient multi user retrieval of data and structured
query response. It is based on highly massively parallel,
high throughput and ultra-fast with low latency capability.
Enterprise Control Language (ECL): This module is
responsible providing automatic distribution of workload
between nodes, with a support of has automatic
synchronization of algorithms. It contains extensible
machine learning library and has simple usage
programming language optimized for big data operations
and query transactions [1].
Out of these frameworks/tools discussed above, HPCC
and Hadoop seem to be more popular and are being used
for implementing a variety of applications of big data sets.
However, there are some differences between them in
terms of architecture and stacks. The following sections
will highlight some major differences between all three
frameworks/tools. For details, readers are referred to [22].
XVII. COMPARISONS OF THREE OPEN SOURCE

FRAMEWORKS (MAPREDUCE, HADOOP AND HPCC)
An attempt has been made to compare these
frameworks in terms of architecture and stacks. Below is a
brief summary of that attempt. For more details, please
refer to [2]:
The clusters in HPCC are being performed with Thor
and Roxie modules while in Hadoop these clusters are
being implemented by MapReduce application program.
The HPCC is based on ECL primary language while
MapReduce application program is based on Java
language.
As stated above, the HBase is based on procurare res
column oriented concept and is supported by Hadoop while
HPCC platform builds multi key and multivariate indexes
on Distributed File System.
Hive application program of Hadoop provides data
warehouse and allows the data to be loaded into HDFS. In
HPCC, the data warehouse and loading are based on
structural queries and analyzer application programs.
For a larger hardware configuration with large number
of nodes, the HPCC framework is faster than Hadoop and
takes a very little processing time.
XVIII. HADOOP-BASED SOLUTIONS FOR BIG DATA IN
SOCIAL NETWORKING ENVIRONMENT
The following section describes a list of social networks
that are providing the solutions to big data set using the
Hadoop model for storing, manipulating and analyzing the
big data set. For detailed discussions, please refer to [7, 3844]. State-of-an-art in social networks and data mining can
be found [68]
A. Facebook
This social network hosts the largest Hadoop cluster by
volume, consisting of a total of over 4400 nodes and 100+
petabytes of data. It consists of five modules that are used
on big data set as the Hadoop Core, a log data collector
called Scribe, Hive, a UI for querying with Hive called
HiPal, and an automation framework called NoCron.
Based on needs and other requirements, the network
has defined its own configurations on Hadoop framework.
The underlying HDFS uses a Federated HDFS and its
redundancy is implemented using RAID technology. It
also uses Hive to simplify the interaction with Hadoop by
their analysts. Roughly 90% of their MapReduce jobs are
built on Hive [7, 40, 44].
B. Twitter
Another social network that needs a lot of storage and
processing of big data set also uses Hadoop framework. All
the data in this network is stored in Hadoop Distributed
File System using LZO (Lempel-Ziv-Oberhumer)
compression. Further, it uses Googles Protocol Buffers to
efficiently read and write data into their cluster through
data serialization with the generated code it provides as
these are supported by Hadoop framework. It uses scalding
framework to provide a simpler way of creating
MapReduce jobs much like Hive and Pig [7,41-42,44]
C. LinkedIn
In this social network, Hadoop is used to provide
predictive analytics and querying that is based on the
features of Hadoop like People You May Know and
Endorsements (PYMKE). Over a billion of LinkedIn
relationships are processed each day to compute People. It
also creates their engagement emails, presenting a users
profile views and also their association in their professional
association. It adopted Apache Pig to avoid writing
complex MapReduce programs. It developed a Hadoop log
aggregator and dashboard called White Elephant that
supports visualization the utilization across the users in
cluster and this allows the users to understand these
features and better use them over Distributed File System
[7, 43].
XIX. PROBLEMS AND CHALLENGES IN HADOOP-BASED

IMPLEMENTATION
Although the above section discussed how various

social networks have used Hadoop to provide solutions to
big data set, but still there seems to have some unanswered
problems and challenges that need to be discussed. It
requires the development of new solutions and research to
answer those questions. During this time, the big data set
produced is continually growing in quantity. The
transmission of this ever growing big data set over internet
may become bottleneck is taking more time than actual
processing. One of the solutions to big data set could be to
implement techniques that process the data at its storage
location instead of retrieving over internet and processing
for data analysis. Alternatively, an efficient method of
defining quality data in such a way that only subset of data

is being used for processing. It is difficult to identify subset
from big amount of data and as such may not represent the
big data set to predict its quality.
One of the leading sports organizations National
Football League (NFL) has adopted the cloud computing
environment (based on Hadoop) that delivers a state-of-theart big data experiences for its Fantasy Football league
schedule and time frame. The members associated with this
franchise will be able to analyze, compare the players,
predict the outcomes of the matches, winning or losing of
matches, finals etc.
The above section described various frameworks/tools
that have been used in some of the real world projects.
Although the success stories of these projects support the
use of these tools for such applications, the implementers
and developers of these still feel a number of challenges
like lack of uniform standard formats, lack of formal
methods, lack of clear representation and analysis of data
for providing useful interpretations to the users, lack of
tools that support predictable behavior of data, lack of
newer user friendly representation for easy understanding
etc. In order to address some of the difficulties and
challenges, recent years have seen a new trend of
redefining data mining techniques for multi-media data
analysis and making use of virtualization of accurate
representation and extraction of useful information from
processed data.
The following two sections describe the current stateof-the-art of Data mining techniques and Virtualization and
present how the entire big data and big data analytics
technology have taken a new turn and trend in providing
efficient solutions to big data applications in a variety of
disciplines.
XX. IMPLEMENTATION OF NEW DATA MINING

TECHNIQUES BY BIG DATA ANALYTICS
A. Basic concepts and definitions of data mining
Data mining offers a systematic approach for
understanding and analyzing the big sets of data with a
view to obtain useful information in a highly readable
format and friendly environment. This technique is heavily
based on predictive analysis concept that includes the
concrete assessment of the complex data so that suitable
analysis technique can be identified for deriving useful
information from the set of data. This concept finds it use
in a variety of applications that contain a large data sets
such as NASA weather data set, social networks data sets,
data communications via mobile devices, bioinformatics,
sensors, stock markets, manufacturing of embedded
systems, world wide web, etc. Nearly all the applications
are using Internet for the design, data collection, data
analysis, dissemination, etc. and as such are expected to
provide secured communication environment over it. New
trends of using data mining techniques for social networks
can be found in [68].
As stated above, the big data deals with large-volume,
complex system, growing data set with multiple formats
and sources. In recent years, we have seen multiple areas
where big data has expanded its use in Science,

Technology Engineering and Mathematics (STEM), other
sciences like physical, biological, biomedical, and other
scientific areas. As mentioned above the big data
technology is capable of implementing data mining,
virtualization, text mining and optimization, the newer data
mining techniques will provide a new perspective with new
characteristics and features that introduces new model. This
model offers features like: data-driven that involves
aggregation of information sources on demand, mining and
analysis of users interests, security, privacy and integrity
of data.
In all the applications discussed above, it is quite
obvious that big data technology coupled with data mining
techniques has already played in five known applications as
discussed above. Based on its usefulness in managing big
data, an attempt has been made to apply these concepts into
social networks as nearly all the social networks are
dealing with explosion of big amount of data in different
formats. Social network analysis focusses on understanding
of user intelligence for defining advertising strategies,
marketing strategies, capacity planning, customer behavior,
shopping pattern, targeted customers, and other behavior
profile database for marketing and advertising. Based on
this information, companies and industries use optimization
techniques so that needed and useful contents from this
information can be used on big data engine.
Some of the companies like Google, Amazon have
published some interesting results of their work based on
the underlying framework and many other companies have
developed similar frameworks as open source software
such Lucene, Solr, Hadoop and HBase. Facebook, Twitter
and LinkedIn that will offer more efficient and useful open
source projects for big data like Cassandra, Hive, Pig,
Voldemort, Storm, IndexTank. In addition, predictive
analytics on traffic flows or identify threats from different
video, audio and data feeds some of the advantages that are
useful for big data set analysis. In order to understand how
data mining techniques can play an important roles in
implementing different categories of big data applications
with more emphasis on social networks, the following
section describes different data mining techniques along
with their features and limitations. For more discussions on
these topics, please refer to [1, 7-9, 11, 12, 17, 20, 22, 36,
49-50, 68-69]
The following section describes some of wellestablished mining sequence data and data mining
techniques that have been used for big data analytic few
years ago. With the advent of multimedia big data, these
two techniques along with other derived data mining
techniques have been used extensively for solving big data
set applications. We will also describe few big data
applications that have been implemented using data mining
techniques. With the data mining-based applications, it
looks clear that data mining techniques are playing a very
important and crucial role in the implementation of big data
applications. Further, it looks quite obvious and interesting
to watch how these techniques will introduce new
applications data mining techniques in future. For more
details, readers are referred [49,50, 68] and some of the
material presented here are derived from there. .
B. Data mining Techniques
iv) Mining Spatial Data
i) Mining Sequence Data

A sequence data may be as an ordered list of events and
are usually classified based on behavior and characteristics
of events as: Time-series, Symbolic and Biological data
sequences. The following paragraph describes each of
these in brief
In Time-series sequence data, the sequence data is
defined as data of long sequences of numeric data, recorded
at equal time intervals (e.g., per minute, per hour, or per
day). This data sequence can be generated by many natural
and economic processes such as stock markets, and
scientific, medical, or natural observations.
Symbolic sequence data may be defined as long
sequences of event or nominal data that may not be
recorded or observed at equal time intervals and lapses
between recorded events may not be important. Few
examples of this data sequence include: customer shopping
sequences and web click streams, sequences of events in
science and engineering and in natural and social
developments.
Biological sequence data may be defined as very long
sequence of data that carries important, complicated
information but this information is hidden in its semantic
meaning. Examples of this sequence data include: DNA,
protein sequences and other medical-based sequence data.
ii) Mining Graphs and Networks
Graphs represent a more general class of structures than
sets, sequences, lattices, and trees. There is a broad range
of graph applications on the Web and in social networks,
information networks, biological networks, bioinformatics,
chemical informatics, computer vision, and multimedia and
text retrieval. Hence, graph and network mining have
become increasingly important and heavily researched.
Based on the above concepts, the following graphic
applications have been introduced to deal big data sets:
graph pattern mining; statistical modeling of networks; data
cleaning, integration, and validation by network analysis;
clustering and classification of graphs and homogeneous
networks; clustering, ranking, and classification of
heterogeneous networks; role discovery and link prediction
in information networks; similarity search and OLAP in
information networks; evolution of information networks
and other related to graphs and networks. Various
frameworks like Hadoop have successfully implemented
this technique for these applications.
iii) Mining Other Kinds of Data
In addition to sequences and graphs, there are many
other kinds of semi-structured or unstructured data, such as
spatiotemporal, multimedia, and hypertext data that carry
various kinds of semantics and are either stored in or
dynamically streamed through a system, and call for
specialized data mining methodologies. These types of data
find interesting applications in cyber related multimedia
data. Other applications include: spatial data,
spatiotemporal data, cyber-physical system data,
multimedia data, text data, web data, and data streams, and
other types of data miming.
Spatial data usually refers to geo-space-related data that

are stored in geospatial data repositories and spatial data
mining can be performed on spatial data warehouses,
spatial databases, and other geospatial data repositories.
The spatial data can be in many forms or formats such as
vector, raster, imagery and geo-referenced multimedia. The
spatial data mining methodology identifies patterns,
locations and knowledge from spatial data. Recently, we
have seen big interests in large geographic data warehouses
that have been constructed by integrating thematic and
geographically referenced data from multiple sources.
From these, we can construct spatial data cubes that contain
spatial dimensions and measures, and support spatial Online Analytic Processing (OLAP) for multidimensional
spatial data analysis. Popular topics on geographic
knowledge discovery and spatial data mining include
mining spatial associations and co-location patterns, spatial
clustering, spatial classification, spatial modeling, and
spatial trend and outlier analysis.
v) Mining Spatiotemporal Data and Moving Objects
Spatiotemporal data are data that relate to both space
and time. Spatiotemporal data mining refers to the process
of discovering patterns, locations and knowledge from
spatiotemporal data.
Spatiotemporal data mining has become increasingly
important and has far-reaching implications, given the
popularity of mobile phones, GPS devices, Internet-based
map services, weather services, and digital Earth, as well as
satellite, RFID, sensor, wireless, and video technologies.
Typical examples of spatiotemporal data mining include:
discovering the evolutionary history of cities and lands,
uncovering weather patterns, predicting earthquakes and
hurricanes, and determining global warming trends. For
example, animal scientists attach telemetry equipment on
wildlife to analyze ecological behavior, mobility managers
embed GPS in cars to better monitor and guide vehicles,
and meteorologists use weather satellites and radars to
observe hurricanes.
Among many kinds of spatiotemporal data, movingobject data (i.e., data about moving objects) are especially
important. Some of the examples based on this data
application include: mining movement patterns of multiple
moving objects (i.e., the discovery of relationships among
multiple moving objects such as moving clusters, leaders
and followers, merge, convoy, swarm, and pincer, as well
as other collective movement patterns). Another form of
spatiotemporal data that is becoming popular is massivescale moving-object data are becoming rich, complex, and
ubiquitous and found its application in mining periodic
patterns for one or a set of moving objects, and mining
trajectory patterns, clusters, models, and outliers.
vi) Mining Cyber-Physical System Data
This category of mining data known as Cyber-Physical
System (CPS) usually includes a large number of
interacting physical and information components. In
general, these systems dealing with this category of data
may be interconnected over internet to provide
heterogeneous environment of cyber-physical system
networks. This kind of networks may also represent in
some for an embedded systems that includes different

subsystems performing different functions are connected
for the transfer of data amongst them. Few examples of
such a cyber-physical system may include: A patient care
system that links a patient monitoring system with a
network of patient/medical information and an emergency
handling system, A transportation system that links a
transportation monitoring network, consisting of many
sensors and video cameras, with a traffic information and
control system; A battlefield commander system that links
a sensor/reconnaissance network with a battlefield
information analysis system. It is quite obvious that that
these systems and networks are ubiquitous and form a
critical component of modern information infrastructure.
As expected, the data generated in cyber-physical
systems and networks are very dynamic, volatile, noisy,
inconsistent, and interdependent, containing rich
spatiotemporal information, and are critically important for
real-time decision making. Thus, the data mining of these
systems require linking the current situation with a large
information data bases, performing large volume of realtime calculations and providing the response quickly which
makes the technique to be more complex than applications
for spatiotemporal data mining. Currently, more and more
emphasis in the area of handling of rare-event detection
and anomaly analysis in cyber-physical data streams,
reliability and trustworthiness in cyber-physical data
analysis, effective spatiotemporal data analysis in cyberphysical networks, and the integration of stream data
mining with real-time automated control processes and
other related areas are being addressed.
vii) Mining Multimedia Data
This category of data mining known as multimedia data
mining is intended to detect interesting patterns from
multimedia databases that store and manage large
collections of multimedia objects, including image data,
video data, audio data, as well as sequence data and
hypertext data containing text, text markups, and linkages.
It is an interdisciplinary field that integrates image
processing and understanding, computer vision, data
mining, and pattern recognition. Some of the issues that
have been considered include content-based retrieval and
similarity search, and generalization and multidimensional
analysis. The multimedia data cubes define another
dimensions and measures for multimedia information of
different types. Other topics of interest in this area include
classification and prediction analysis, mining associations,
and video and audio data mining, etc. for multimedia data
mining.
viii) Mining Text Data
Another category of data mining closely related to
multimedia data is known as text mining which is an
interdisciplinary field that deals with information retrieval,
data mining, machine learning, statistics, and
computational linguistics with one mail difference in the
sense that a significant amount of information is stored as
text e.g. news articles, technical papers, books, digital
libraries, email messages, blogs, and web pages. Based on
the nature of this category of data mining, we have seen a
big interest in this area as it finds its use in various
applications in particular, social networks, huge Delaware

houses, etc.
ix) Mining Web Data
This category of data mining deals with the analysis of
the contents of web data mining applications. Some of the
examples of this type of data mining include the following
contents: Text, Multimedia data, and structured data
(within web pages or linked across web pages), etc. This
type of data mining application allows the users to
understand different types of contents in web page, web
page summaries, web page relevance and ranking and
many related information needed for the web search and
analysis. It provides scalable and informative keywordbased page indexing and supports entity/concept resolution.
The implementation of web pages can be performed on the
contents by using as a surface web or deep web on
underlying database engines. The surface web is that
portion of the Web that is indexed by typical search
engines. The deep Web (or hidden Web) refers to web
content that is not part of the surface web.
x) Data Mining for Financial Data Analysis
This category of data mining application finds its use in
banks and financial institutions that offer a wide variety of
services of banking, investment, mortgage, loans for other
household items, etc, and credit services. Depending on
their profiles, other services may be offered by them like
insurance and stock investment services, etc. It is simple
and straightforward to do data analysis and data mining on
the data collected by banks and investment companies as
the data is relatively complete, reliable, structured and of
high quality.
xi) Data Mining for Retail and Telecommunication
Industries
This category of data mining applications deal with the
information maintained by retail industries as they collect a
huge amounts of data on sales, customer shopping history,
payment
modes,
financial
management,
goods
transportation, consumption, human resource services, and
other related services. The amount of data collected
continues to grow expand rapidly, especially due to the
increasing availability, competition, ease, and popularity of
business conducted via on-line on the Web, or ecommerce. The on-line purchasing capabilities have
become an integral part of all major chain stores and
franchises. There are many companies, industries and
business are doing their business fully on-line (examples
include amazon.com, expedia.com, hotels.com, and many
others) without any physical locations for their business.
The data collected from these businesses are being used
very heavily dependent on some sort of data mining for
their marketing strategies. They also provide enough
sources for identifying customer buying behaviors,
discovering customer shopping patterns and trends,
implementing new techniques to improve the quality of
customer service, achieving better customer retention and
satisfaction, enhancing goods consumption ratios,
designing more effective goods transportation and
distribution policies, reducing the cost of business and
creating a user-friendly based environment for all
transactions of goods and services.
xii) Tele communication industries: This category of

data mining application deals with huge amount of data
that is being collected, analyzed, and maintained by
telecommunication industries. One of the earliest services
of telephone industries have been various communication
services like local and long distance calls, but with the
advent of internet, a number of services have been derived
from the basic communication services such as local and
international wireless communications, cellular phones,
smart phone, internet access (wired and wireless), e-mail,
text messages, different types of attachments with e-mails
such as images, pictures, video, and other audio signals,
computer and web data transmissions, and other types of
data communications. With the advent of internet,
numerous technologies have evolved in the areas of
communications and computing. The integration of
telecommunication, computer network, Internet, and
numerous other means of communication and computing
has created a great demand for data mining to help
understand business dynamics, identify telecommunication
patterns, catch fraudulent activities, make better use of
resources, and improve service quality.
xiii) Data Mining in Science and Engineering
This category of data mining applications deal with
collection of massive amounts of complex data, data
preprocessing, data warehousing, and scalable mining of
data for data analysis. The data mining is well supported by
visualization, concepts of graphs and networks for data
analysis. Most of the systems in engineering require realtime responses for their processes and as such the
appropriate data mining techniques for data analysis and
mining of data streams becomes very crucial.
This category of data mining applications deal with the
aspects of software engineering where it will monitor the
status of various process, means of improving system
performance, detection and isolation of software bugs,
detection of software plagiarism, fault tree analysis and
fault tolerant system, detection of malicious attacks,
prevention of assets from these attacks, identification of
malfunctions, maintain quality control, managing various
risks, and design of dependable and secured systems. Data
mining for software and system engineering can operate on
static or dynamic (i.e., stream-based) data, depending on
whether the system dumps traces beforehand for post
analysis or if it must react in real time to handle online
data.
xiv) Data Mining and Recommender Systems
This category of data mining applications deals with
proving goods and services to the customers who are doing
on-line shopping. A new concept with data mining has
been introduced as a recommender system that helps
consumers to select the product based on their interests and
possible recommendation by recommender. Some example
of this category of data mining applications include:
books, CDs, movies, restaurants, online news articles,
tours, travel, attractions and many other related services.
Recommender systems may use either a content based
approach, a collaborative approach, or a hybrid approach
that combines both content-based and collaborative
methods.
xv) Ubiquitous and Invisible Data Mining

This category of data mining application deals with
various aspects of our daily lives as it creates a profile of
our behavior for shopping, interests, work, recreation,
health, family activities, professional activities and many
other related attributes of our routine lives. Based on our
created profiles, a number of information searches will
affect our lives via leisure time, health, mental and physical
fitness, happiness etc. These applications in general can be
grouped in ubiquitous data mining where users are unaware
of any data mining techniques are being used, but be
default it is present in all the applications without the
knowledge of users.
xvi)Crime control in the implementation of data
mining approaches
Recent years have seen new developments in defining
the crime control in the implementing data mining
approaches in these applications. In particular, new data
mining algorithms need to be developed for the crime data
analysis. Data mining is used to identify the undefined and
unexpected format, patterns and any set of rules that may
be present in the data sets. The type of data mining
technique to be used for crime analysis depends on a
number of classes and each class offers set of attributes that
helps the developers for choosing the technique. One of the
techniques that have been very popular depends on the
categories of crime and based on a crime or set of crimes,
appropriate technique can be adopted. In the big group of
similar types of data, another mining technique of crime
has been defined as clustering. The learning technique may
be developed to define structure that may be used in the
entire group and hence useful information may be extracted
from all the similar types of data in data sets [7-9, 45, 4951, 68].
From the discussion above, it has become very clear
why researchers used successfully the concept of data
mining techniques for solving multimedia data (comprising
of data from different sources, and formats) applications.
These efforts have introduced new integrated technologies
for solving a variety of applications dealing with huge
amount of data and further, the data mining analysis
techniques are very useful in extracting useful information
from collected and processed data.
XXI. CONCLUSION
In the first part of the paper, we presented the state-ofthe-art of big data and big data analytics and various issues
associated with these. Starting with basic definitions and
need for such a technology, how big data technology is
evolved, operations and services offered by big data,
various applications, different forms and classifications of
data applications, data processing including data collection,
data cleansing, data analysis, data storage, manipulation
and interpretation, big data solutions, various open source
frameworks/architecture to solve big data applications. We
then discussed various open source software
frameworks/architectures that have been used extensively
in solving different types of applications of big data. In
addition to these frameworks, a number of application
programs have been introduced that can be interfaced with
the top layer of the frameworks. These applications

programs provide a variety of operations and services. This
new technology has become one of the leading
technologies in the last few years as it has implemented
advanced data mining techniques, virtualizations,
optimizations, text mining, etc.
With a detailed background of big data and big data
analytics technology and its successful implementation of
data mining techniques and virtualizations, the paper
provided a detailed discussion how each of these concepts
have been redefined and used in some successful
applications. The paper also discussed various frameworks
and tools that have been introduced with their details with a
view that readers can get all needed experiences of these
frameworks and tools for development of applications in
future. The paper also presented limitations and challenges
in some of these concepts that need to be investigated for
future work.
Data mining has been in existence for over 30 years and
found its use in solving a variety of applications. With
large amount of data, the data mining techniques have been
become very useful and effective in managing high volume
of data and also provide techniques for analyzing the data
and interpreting to provide useful meaning of the
information. In particular, data mining techniques find
useful applications in social networks.
The second part of the paper will focus on remaining
two technologies as data virtualization and data security.
Virtualization offers very efficient tool to provide the
representation of data with capabilities of displaying the
dynamic behavior of the data. Since the modern big data
applications are implemented over Internet, it is important
to understand various cyber-attacks and crimes that affect
all the implementation phases of big data multimedia
applications. It will also provide insights of how the
applications can be prevented from these attacks and
further how cyber-crime analysis technique can be used for
reliable big data implementation.
querying systems that make use of SQL, in order to

leverage existing SQL knowledge amongst users to query
against Hadoop systems. Other interesting extension in this
framework focuses on data management and big data
technology.
Based on various publications and success stories in big
data applications, researchers and developers still feel some
issues and challenges that need to be addressed and
considered in future applications of big data and big data
analytics. Some of these include: volume or amount of data
is continually growing, transmission of this big volume of
data over internet, different techniques for processing of
data, processing of data at the location where it is stored,
retrieving the data for its processing, processing of subset
of data based on identifying the quality and other attributes,
mapping between quality data and quantity of data for
solutions and its validity, new methods of evaluating the
validity of all the data items in the set, identification of size
of the data items for the evaluation of its validity and
accurate conclusions, new methods for evaluating the
quality of data, new methods to measure the accuracy,
reliability and quality of data solutions, etc.
One of the main difficulties in big data analysis is a
suitable method for extracting the useful and accurate
interpretation and meaning of the collected and analyzed
data for users to understand and take appropriate actions
for further interleaved data. If the representation of data can
be defined graphically using easy graphical user interface
(GUI) and further if we can introduce data analyzing
techniques with GUI, we can view a clear format of data
and it becomes easy to understand the extracted and
interpreted outcome/information from processed and
analyzed data.
ACKNOWLEDGEMENT
I am thankful to two of my graduate students Mr.
Avinash Dudi and Ms Seethi Venkata Sandhya Dhari who
helped me in searching articles and reviewing some of the
articles for me.
REFERENCES
XXII. FUTURE RESEARCH DIRECTIONS

Big Data and big data analytics are becoming new way
for exploring and discovering interesting, valuable
information. The volume of data for different applications
in the last few years have been increasing at exponentially
and as such the big area technology will be widely used to
provide the solutions and interpretation of useful data from
the information via data analysis Other related topics
associated with this technology like new tools and
techniques, new and improved frameworks, new analysis
tools etc. need to be discussed. In addition to these, we also
need to address and incorporate critical issues of privacy
and security of the big data and big data analytics in future
Recent years has experienced that one of the most
popular open source software framework Hadoop has
become a common solution for processing large amounts
of data in few applications. It looks that the future
development in this framework is expected to focus on
systems that should provide real-time ad hoc querying
capabilities over large scale data. Other important interest
in this framework should focus on the development of
[1]
[2]
[3]
[4]
[5]
[6]
Aaron Ritchie, Henry Quach, (2013), Developing Distributed

Applications Using Zookeeper, Big Data University, Online, last
access April 17, 2015, http://bigdatauniversity.com/bduwp/bducourse/developin-distributed-applications-using-zookeeper /
Aaron Ritchie, (2012), Using Hive for Data Warehousing, Big Data
University,
Online,
last
access
April
12,
2015,:
http://bigdatauniversity.com/bdu-wp/bdu-course/using-hivefordata-warehousing/
Agrawal, D., Bernstein, P., Bertino, E., Davidson, S., Dayal, U.,
Franklin, M.,Widom, J. (2012). Challenges and Opportunities with
Big
Data,
last
retrieved
Feb
3,
2105.
http://cra.org/ccc/docs/init/bigdatawhitepaper.pdf
Agrawal, D., Das, S., & El Abbadi, A. (2011). Big data and cloud
computing: current state and future opportunities. In Proceedings of
the 14th International Conference on Extending Database
Technology (EDBT/ICDT '11) (pp. 530-533). A. Anastasia, A. Y.
Sihem, P. Jignesh, R. Tore, S. Pierre, & S. Julia, (Eds.). New York,
NY, USA: ACM. Retrieved on March 4, 2015,
http://doi.acm.org/10.1145/1951365.1951432
Alexander N. Gorban, Balzs Kgl, Donald Wunsch, and Andrei
Zinovyev (2008). Principal Manifolds for Data Visualization and
Dimension Reduction. LNCSE 58. Springer Verlag
Anand Loganathan, Ankur Sinha, Muthuramakrishnan V., and
Srikanth Natarajan, (2014), A Systematic Approach to Big Data
Exploration of the Hadoop Framework, International Journal of
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
Information & Computation Technology. ISSN 0974-2239 Volume

4, Number 9 (2014), pp. 869-878, International Research
Publications House. http://www.irphouse.com
Apache
Ambari
Hortonworks,
Online:
http://hortonworks.com/hadoop/ambari/
Apache
Hadoop,
lasr
access
Feb
13,
2015,
http://hadoop.apache.org/core/ 283
A. Vailaya (2012) Whats All the Buzz Around Big Data?, IEEE
Women in Engineering Magazine, December 2012, pp. 24-31,
B. Brown, M. Chui and J. Manyika, (2011) Are you Ready for the
era of Big Data?, McKinsey Quarterly, McKinsey Global Institute,
October 2011
Begoli, E., & Horey, J. (2012). Design Principles for Effective
Knowledge Discovery from Big Data. Software Architecture
(WICSA) and European Conference on Software Architecture
(ECSA), 2012 Joint Working IEEE/IFIP Conference on (pp. 215218),
retrieved
last
April
2,
2015,
http://dx.doi.org/10.1109/WICSA-ECSA.212.32
B.Gerhardt, K. Griffin and R. Klemann, (2014) Unlocking Value in
the Fragmented World of Big Data Analytics, Cisco Internet
Business Solutions Group, June 2012, last retrieved on Nov 2,
2014, http://www.cisco.com/web/about/ac79/docs/sp/InformationInfomediaries.pdf
Bryant, R. E., Katz, R. H., & Lazowska. E. D. (2008). Big-data
computing: Creating revolutionary breakthroughs in commerce,
science, and society. In Computing Research Initiatives for the 21st
Century. Computing Research Association, 2008. Retrieved last
Aprilo 4, 2015, http://www.cra.org/ccc/docs/init/Big_Data.pdf
C. Chu, S. Kim, Y. Lin, Y. Yu, G. Bradski, A. Ng, and K. Olukotun
(2007), Map-reduce for machine learning on multicore. In B.
Scholkopf, J.Platt, and T. Hoffman, editors, Advances in Neural
Information Processing Systems 19, pages 281288. MIT Press,
Cambridge, MA,2007.
Computer crime - Wikipedia, the free encyclopedia, (2011), last
access
June
11,
2015
https://en.wikipedia.org/wiki/Computer_crime, March 2015
C. Eaton, D. Deroos, T. Deutsch, G. Lapis and P.C. Zikopoulos,
(2012), Understanding Big Data: Analytics for Enterprise Class
Hadoop and Streaming Data, Mc Graw-Hill Companies, 978-0-07179053-6, 2012
C. Ranger, R. Raghuraman, A. Penmetsa, G. R. Bradski, and C
Kozyrakis (2007). Evaluating MapReduce for Multi-core and
Multiprocessor Systems, Proc. International Symposium on HighPerformance Computer Architecture (HPCA), 2007, pp. 13-24.
C. Tankard, (2012), Big Data Security, Network Security
Newsletter, Elsevier, ISSN 1353-4858, July 2012
C. Tankard, (2012) Big Data Security, Network Security
Newsletter, Elsevier, ISSN 1353-4858, July 2012
Data Abstraction Best Practices with Cisco Data Virtualization
(2012),
last
access,
Nov
11,
2014,
www.cisco.com/.../data.../data_abstraction_with_cisco
Data Mining Concepts and Techniques Third Edition Jiawei Han
University of Illinois at UrbanaChampaign Micheline Kamber Jian
Pei Simon Fraser University, 2012- Chapter 13
http://www.cse.hcmut.edu.vn/~chauvtn/data_mining/Texts/%5B1%
5D%20Data%20Mining%20%20Concepts%20and%20Techniques%20(3rd%20Ed).pdf
Data virtualization: 6 best practices to help the business (2011),
last access April 12, 2015, www.zdnet.com/.../data-virtualization-6best-practices-to-help-the, Oct 27, 2011
Data Virtualization: Achieve Better Business Outcomes, Faster,
(2014), last access, April 23, 2015, blogs.cisco.com Data Center
Cisco Systems, Inc. May 6, 2014
Data Abstraction Layer | Data Virtualization Layer (2012),
www.compositesw.com//data-abstraction/
Data
Virtualization
Applied: Effective Solutions to Today's Business and IT Challenges
(2013), last access, April 10, 2015. www.tdwi.org/.../Case-for-DataVirtualization
Data Virtualization's Value: Myth or Reality? - DATAVERSITY,
A white paper (2015) Last access May 2, 2015
www.dataversity.net/data-virtualizations-value-myth-or-reality Feb
2015/
[27] Data Virtualization use cases and patterns (2014), last access, May
30, 2015, |
www.denodo.com/en/page/data-virtualization-usecases-and-patterns
[28] Data Virtualization: Going Beyond Traditional Data Integration to
Achieve Business Agility Paperback, by Juditj T Davis and Robert
Eve
(2011)
,
last
access
May
22,
2015,
www.datavirtualizationbook.com, ISBN: 13: 078-0-9799304-16,
Printed in US Nine Five One Press, Sept 2011
[29] Donald Miner, Adam Shook, (2012),MapReduce Design Patterns,
OReilly Media, 2012 Edition
[30] Edward Capriolo, Dean Wampler, Jason Rutherglen,
(2012),Programming Hive, OReilly Media, 2012 Edition
[31] Effective Solutions to Today's Business and IT Challenges (2012,
last
access
March
23,
2015,
www.
purl.manticoretechnology.com/MTC.../mtcURLSrv.aspx?ID...
[32] Gurdeep S Hura, A chapter on Need for dynamicity in social
networking site: An overview from data mining perspective, Data
mining in dynamic social networks and fuzzy systems, chapter I,
Dec 2013, IGI Global Publishing Company, NY
[33] G. S Hura, Chapter 29: Computer Networks: LANs, MANs,
WANs, and Wireless, Digital Process Control and Networks,
Taylor and Francis Group in June 2011.
[34] G. S Hura, Chapter 30: Internet Fundamentals and Cyber Security
Management, Digital Process Control and Networks, Taylor and
Francis Group in June 2011
[35] G. S. Hura, A Chapter on Terrestrial Wide Area Networks,
Handbook of Computer Networks, 3 Volume Set, Hossein Bidgoli,
Editor-in-Chief, John Wiley and Sons, Inc., 2007.
[36] Hayes, M. (2013). White: Elephant: The Hadoop Tool You Never
Know You Needed. Last access, Nov 11, 2014,
http://engineering.linkedin.com/hadoop/white-elephant-hadooptool-you-never-knew-you-needed
[37] http://hpccsystems.com/, last access Dec 11, 2014
[38] http://en.wikipedia.org/wiki/Big_data , last access Nov 11, 2014
[39] http://hadoop.apache.org/ , last access Nov 11, 2014
[40] http://www.humanfaceofbigdata.com/ , last access Nov 11, 2014
[41] IBM System z - Virtualization: Overview (2012), last access Nov
11, 2014 www.ibm.com/systems/z/advantages/virtualization/
[42] Intel IT Center, (2012), Planning Guide: Getting Started with
Hadoop, Steps IT Managers Can Take to Move Forward with Big
Data Analytics, June 2012
[43] Intel IT Center, (2012), Peer Research: Big Data Analytics, Intels
IT Manager Survey on How Organizations Are Using Big Data,
August 2012, last access Feb 4,2015
http://www.intel.com/content/dam/www/public/us/en/documents/re
ports/data-insights-peer-research-report.pdf
[44] Ji, C., Li, Y., Qiu, W., Awada, U., & Li, K. (2012). Big Data
Processing in Cloud Computing Environments Pervasive Systems,
Algorithms and Networks (ISPAN), 2012 12th International
Symposium on (17-23). http://dx.doi.org/10.1109/I-SPAN.2012.9
[45] J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs, C. Roxburgh
and A.H. Byers, (2011) Big data: The next frontier for innovation,
competition, and productivity, McKinsey Global Institute,2011, last
access
March
2,
2015
http://www.mckinsey.com/~/media/McKinsey/dotcom/Insights%20
and%20pubs/MGI/Research/Technology%20and%20Innovation/Bi
g%20Data/MGI_big_data_full_report.ashx
[46] K. Bakshi, (2012), Considerations for Big Data: Architecture and
Approach, Aerospace Conference IEEE, Big Sky Montana, March
2012
[47] M. Smith, C. Szongott, B. Henne and G. Voigt , (2012), Big Data
Privacy Issues in Public Social Media, Digital Ecosystems
Technologies (DEST), 6th IEEE International Conference on,
Campione d'Italia, June 2012
[48] Mainframe Data Virtualization - Rocket Software (2012), last
access April 12, 2015, www.rocketsoftware.com/data-virtualization
[49] Menon, A. (2012). Big data @ facebook. In Proceedings of the
2012 workshop on Management of big data systems (MBDS '12)
(pp.
31-32).
New
York,
NY,
USA:
ACM.
http://dx.doi.org/10.1145/2378356.2378364
[50] Michael J Quinn (2014), Ethics for the information age, Pearson
Press, Sixth Edition, 2014, chapter 7
[51] Paolo Ciuccarelli, Giorgia Lupi, Luca Simeone (2014) "Visualizing
the Data City: Social Media as a Source of Knowledge for Urban
Planning and Management", Springer.Verlag
[52] Pokorny, J. (2011). NoSQL databases: a step to database scalability
in web environment. In Proceedings of the 13th International
Conference on Information Integration and Web-based
Applications and Services (www.ccsenet.org/nct Network and
Communication Technologies Vol. 2, No. 1; 2013
[53] P. Russom, (2011) Big Data Analytics , TDWI Best Practices
Report, TDWI Research, Fourth Quarter 2011, last access April 3,
2015, http://tdwi.org/research/2011/09/best-practices-report-q4-bigdata-analytics/asset.aspx
[54] R. Weiss and L.J. Zgorski, (2012), Obama Administration Unveils
Big Data Initiative:Announces $200 Million in new R&D
Investments, Office of Science and Technology Policy Executive
Office of the President, March 2012
[55] Rajan, S. et al. (2012). Top Ten Big Data Security and Privacy
Challenges. Retrieved from
https://downloads.cloudsecurityalliance.org/initiatives/bdwg/Big_D
ata_Top_Ten_v1.pdf
[56] Reed, B. (2012). ZooKeeper Overview. Last access March 10, 2015
https://cwiki.apache.org/confluence/display/ZOOKEEPER/Project
Description
[57] Ryaboy, D. (2012). Twitter at the Hadoop Summit. Last access Nov
11, 2014
http://engineering.twitter.com/2012/06/twitter-at-hadoopsummit.htm
[58] S. Singh and N. Singh, (2011) Big Data Analytics, 2012
International Conference on
Communication, Information &
Computing Technology Mumbai India, IEEE, October 2011
[59] Sanjay P Ahuja and Bryan Moore (2013) State of Big data analysis
in the cloud, Network and Communication technologies, Vol 2, No.
1, 62-68, 2013
[60] S. Ghemawat, H. Gobioff, and S. Leung (2003), The Google file
system,Symposium on Op-erating Systems Principles, 2003, pp
2943.
[61] S. Madden, (2012), From Databases to Big Data, IEEE Internet
Computing, June 2012, v.16, pp.4-6
[62] Shashank Tiwari, (2011) Professional NoSQL, Wrox Publications,
2011 Edition
[63] Tierney, B., Kissel, E., Swany, M., & Pouyoul, E. (2012). Efficient
data transfer protocols for big data.E-Science (e-Science), 2012
IEEE 8th International Conference on (pp. 1-9).
http://dx.doi.org/10.1109/eScience.2012.6404462
[64] Tom White, (2012), Hadoop: The Definitive Guide, OReilly
Media, 2012 Edition
[65] Warren Pettit, (2012), Introduction to Pig, Big Data University,
Online,
last
access
March
23,
2015,
http://bigdatauniversity.com/bdu-wp/bdu-course/introduction-topig/
[66] Weil, K. (2010). Hadoop at Twitter. Last access, Nov 11,2014
http://engineering.twitter.com/2010/04/hadoop-at-twitter.html
[67] U. Fayyad, G. Piatetsky-Shapiro and P. Smyth, (2012) From Data
Mining to Knowledge Discovery in Databases", American
Association for Artificial Intelligence, AI Magazine, Fall 1996, pp.
37- 54 http://engineering.twitter.com/2012/06/twitter-at-hadoopsummit.html
[68] Unlocking
Agility
with
Data
Virtualization,
www.denodo.com/en/video/webinar/unlocking-agility-datavirtualization
[69] Gurdeep S Hura, Dynamic Reconfigurable Software architecture: A
Novel intelligent framework, Proc MTMI, Virginia Beach, VA,
Sept 11-12, 2015
[70] Gurdeep. S. Hura and M. Singhal:
Data and Computer
Communications: Networking and Internetworking, CRC Press,
April 2001.

Ijcs 2016 0301005 PDF

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Ijcs 2016 0301005 PDF

Diunggah oleh

Hak Cipta:

Format Tersedia

International Journal of Computer Systems (ISSN: 2394-1065), Volume 03 Issue 01, January, 2016

Emergent Trends and Challenges in Big Data Analytics, Data Mining,

BACKGROUND OF BIG DATA ANALYTICS

The first part of the paper discusses the background of

classifications. This is one of the reasons I have selected

techniques for collection and analysis of huge amount of

BACKGROUND OF VIRTUALIZATION AND

The second part II of the paper presents state-of-the art

but it evaluates and meet business, big data technology and

for enhancing the speed for analytic processing is further

BASIC DEFINITIONS AND CONCEPTS OF BIG DATA

The analytics and machine learning together are being

handle both high transaction rates and complex query

COMPUTATIONAL OPERATIONS ON BIG DATA

Web services over Internet supports a variety of

hyperlinks which in turn may provide a faster retrieval of

CLASSIFICATION OF BIG DATA APPLICATIONS

The topic of big data is becoming a current project not

Retail: in store behavior analysis, variety and price

government, natural disaster and resource management,

VII. NEW IMPLEMENTATION SOLUTIONS OF BIG DATA

The volume of big data set is usually represented above

intelligence. This provides a lot of information and some of

In Twitter, over one billion tweets are being recorded

VIII. IMPLEMENTATION PHASES OF BIG DATA

BIG DATA PROCESSING

Appropriate and accurate solutions of any big data

This is the first step in data processing where the data

After collection, data cleansing or cleaning is

Content-based image retrieval: It is based on visual

Data Storage, Manipulation and Handling

Storage and analytic techniques

There exists different types of storage and analytics

internet. A number of approaches for sending big data set

Handling and storage:

Identify and implement the proposed big data

Implement and run the applications by running

Our needs: on-premise, cloud, or hybrid

Big data and associated operations like collection,

Big Data in Enterprise Computing Environment

It has been observed that after going through collection,

XIV. IMPLEMENTING AND SOLVING BIG DATA

XV. OPEN SOURCE FRAMEWORKS/TOOLS FOR BIG DATA

The following steps are needed to obtain a solution of

There exist a number of open source frameworks/tools

We need to define applications of big data for the

Design a detailed plan for future growth in big

these processes on different processor in a distributed

warehousing system used with Hadoop for querying,

hash ring. For more details on the discussion below, please

XVI. HIGH PERFORMANCE COMPUTING CLUSTER (HPCC)

This module supports a centralized service for

Another open source software framework/tool for

The HPCC framework consists of the following three

XVII. COMPARISONS OF THREE OPEN SOURCE

XIX. PROBLEMS AND CHALLENGES IN HADOOP-BASED

Although the above section discussed how various

defining quality data in such a way that only subset of data

XX. IMPLEMENTATION OF NEW DATA MINING

where big data has expanded its use in Science,

B. Data mining Techniques

iv) Mining Spatial Data