Anda di halaman 1dari 3

BIG DATAAn approach towards the future of Information Science

Bijoy Chhetri
Sr. Lecturer
Department of Computer Science and Engineering
Centre for Computers and Communication Technology

bijoychhetri@gmail.com
ABSTRACT
1

Today the term BIG DATA draws a lot of attention, but


behind this there's a simple story. For decades, companies
have been making business decisions based on transactional
data stored in relational databases (RDBMS). Beyond that
critical data, however, is a potential treasure of nontraditional, less structured data: weblogs, social media, email,
sensors, and photographs that can be mined for useful
information and knowledge delivery (KDD). Decreases in the
cost of both storage and compute power have made it feasible
to collect this data - which would have been thrown away
only a few years ago. As a result, more and more companies
are looking to include non-traditional (UNSTRUCTURED
DATA) yet potentially very valuable data with their traditional
enterprise data in their business intelligence analysis.
To derive real business value from BIG DATA, this
information highlights about the background of data science,
need of the right tools to capture and organize a wide variety
of data types from different sources, and to be able to easily
analyze it within the context of all enterprise data so that
instantaneous, spontaneous and constant knowledge delivery
can be made helping the decision maker to make decisions on
real time. In this paper an attempt has been made to highlight
brief discussion on BIG DATA and its impact on the data
science.

Keywords
BIG DATA, Hadoop, Map Reduce, Cloud, NO SQL.

1. INTRODUCTION
BIG DATA is very much similar to small
data but bigger in the scale, complexity and variable
generation modes. But having BIG DATA means having to set
newer technologies and different approach in handling bigger
dataset which aims to solve new problems and even solve old
problems in a better way.
Data have become a torrent flowing into every area of the
global economy and the science behind it as Data Science
which deals with collection, preparation, analysis,
visualization, management & preservation of large collection
of Information. In other words, data science2 is the integration
of methods from statistics, computer science, and other fields
for gaining insights from data. In practice, data science
encompasses an iterative process of data harvesting, cleaning,
analysis and visualization, and implementation. Ultimately,
this interdisciplinary and cross-functional field leads to
decisions that move an organization forward, whether the

business of interest object is product design, a proposed


investment, or business strategy.
In the previous decade of data repository, the traditional
methodologies incorporated Ad-hoc querying and reporting
techniques with data mining techniques specially implied on
mostly structured data where the data are pouring in from a
typical sources generating small to mid-size datasets giving
birth to different data warehousing (EDW) techniques with
Data Marts and Data centers where the data driven knowledge
discovery is done with the help of some Business Intelligence
feed on to them. The method works fine for the limited range
of Data Analysis but will certainly have constraints on
processing the large dataset in terms of time, cost, reliability
and scalability.
With the rapid increase in production of data from varied
source, increase in the storage capacities along with the
increase of the processing power (Moores Law), the term
BIG DATA came into existence and gained importance.
Varied ranges of data are generated from the different source
in continuous manner demanding technologies to handle
newer type of data & integrate them to a decisive level. Apart
from to cater to increase in volume of data, newer analytical
requirements is the most for optimizations and predictive
analytics, Complex statistical analysis in a more of a real-time
manner. And to quote from an IBM website6:
Every day, we create 2.5 quintillion bytes of data so
much that 90% of the data in the world today has been created
in the last two years alone. This data comes from everywhere:
sensors used to gather climate information, posts to social
media sites, digital pictures and videos, purchase transaction
records, and cell phone GPS signals to name a few. This data
is big data.
To provide efficient and reliable services, many factors have
to be considered such as the application environment
requirement, reliability of the protocol used in the network
and the network consistency. MAC layer is a major reason to
provide the reliability and efficiency for WSN. MAC is
responsible for channel access policies, scheduling, buffer
management and error control. In WSN we need a MAC
protocol to consider energy efficiency, reliability, low access
delay and high throughput as major priorities to accommodate
with sensors limited resources and to avoid redundant power
consumption

1.1 Cloud and BIG DATA


BIG DATA is fuelled by properties of cloud in terms of its
creation and utilization rather than only subjected to the
enterprises data because of the computing on demand services
of a cloud architecture which is a permanent driving force for

each other in a never ending cycle. The data being produced


by the Cloud is captured by the BIG DATA architecture and
on real time utilized by the services provided by cloud using
Internet. When it comes to cloud few advantages of
affordability, economy of scale, agility and extensibility in
turn enables us to access BIG DATA in more efficient way.
The newer technology used to analyze data on the Cloud will
certainly drives future design, enhancement and innovation
expansion of cloud which further caters to the need of BIG
DATA technologies in place. Avalanche of BIG DATA is
challenging and questions us to make use of data to the next
generation breakthrough.

promotions right now for store next and sensors monitoring


activities and body for any abnormal measurements and
require immediate reaction amounts to the Velocity of the
data.

1.2 Defining BIG DATA


5

Every one in the enterprise or business need is


wondering about how big is the BIG DATA for a particular
requirement. Is it a matter of only size, not necessarily size is
the only element of describing BIG, also to consider the speed
and various sources that are pouring in data simultaneously.
Taking an example of 40 MB .ppt slides, 1TB of MRI scan, 1
PB of Movie is BIG DATA only when the underlying
infrastructure and technologies does not support. Data is BIG
when it challenges the constraints of the existing system
capabilities and business need. Because 40MB of presentation
will certainly be big data set if it cannot be sent as an
attachment or 1PB of movie when it cannot be rendered for
editing. Therefore an organization must weigh their
capabilities in terms of technologies and architecture before
calling any data as big and yes whenever any fast growing
data is pushing to the limit that common technology supports
to utilize them, it is definitely a BIG DATA.
BIG DATA as defined by Wikipedia is term for a collection
of data sets so large and complex that it becomes difficult to
process using on-hand database management tools or
traditional data processing applications.
Gartner defines it as high-volume, high-velocity and highvariety information assets that demand cost-effective,
innovative forms of information processing for enhanced
insight and decision making.
Huge set of data whose size is beyond the ability of traditional
database / tools to capture, store, manage and analyze.
As technology advances over time, the size of datasets that
qualify as BIG DATA will also increase. Also note that the
definition can vary by sector, depending on what kinds of
software tools are commonly available and what sizes of
datasets are common in a particular industry. With those
qualifications, BIG DATA in many sectors today will range
from a few dozen terabytes to multiple Petabytes (thousands
of terabytes).
Here are few examples which cover the three Vs of BIG
DATA as Volume, Velocity and Variety also having more
dimension as the subset of three major as shown in the fig 1.
There is 44 times increase of data size every year. In Large
Synoptic Survey Telescope (LSST) alone over 30 thousand
gigabytes (30TB) of images will be generated every night and
it has been a decade -long LSST sky survey. Imagine 72 hours
of video uploaded to YouTube every minute, 140 million
tweets per day on average amount to the volume of data that
are generated in the pool.
Various formats, types, and structures Text, numerical,
images, audio, video, sequences, time series, social media
data, multi-dim arrays, etc from the different sources
amounts to the diverse variety the data are generating. Google
uses smart phones as sensors to determine traffic conditions,
based on your current location, your purchase history, send

FIG-1 Three 3Dimensions of BIG DATA


1.3 Challenges of BIG DATA
The progress and innovation is no longer hindered by the
ability to collect data but the ability to manage, analyze,
summarize, visualize, and discover knowledge from the
collected data in a timely manner and in a scalable fashion
would certainly be the challenges of IT fraternity.

2. METHODOLOGIES USED TO
ANALYSE BIG DATA
Figure Fig-2, representing the different phases of
harnessing, cleaning analyzing and interpretation of huge set
of data in efficient matter.

2.1 Data Acquiring


The acquisition phase is one of the major changes in
infrastructure from the days before big data. Because big data
refers to data streams of higher velocity and higher variety, the
infrastructure required to support the acquisition of big data
must deliver low latency in both capturing data and in
executing short, simple queries; be able to handle very high
transaction volumes, often in a distributed environment; and
support flexible, dynamic data structures.
For the purpose , NoSQL databases are frequently used to
acquire and store big data. They are well suited for dynamic
data structures and are highly scalable. It does not use fixed
schema for storage of the data.

2.2 Organizing Data:


This phase include the cleaning and data integration
but at its initial destination location by not moving the
database of large volume. The infrastructure required for
organizing big data must be able to process and manipulate
data in the original storage location that supports high
throughput to deal large set of data.
For the purpose, Hadoop is a new technology that allows large
data volumes to be organized and processed while keeping the
data on the original data storage cluster. Hadoop Distributed
File System (HDFS) is the long-term storage system for web

logs for example. These web logs are turned into browsing
sessions by running MapReduce programs on the cluster and
generating aggregated result on the same.

2.3 Analyze Big Data


The analysis of the acquired and cleaned
transformed data can be done on the distributed environment.
The infrastructure required for analyzing big data must be
able to support deeper analytics such as statistical analysis and
data mining, on a wider variety of data types stored in diverse
systems that has tendency to scale to extreme data volumes. It
must also deliver faster response, while withstanding the
changes in behavior and automate decisions based on
analytical models. For example, analyzing sale of from a
pizza in combination with the events calendar for the venue in
which pizza house is located, will dictate the optimal product
demand and business logic to drive the sales percentage.

2.4 Interpretation of the Solution Spectrum


These new systems have created a divided solutions spectrum
comprised of SQL solutions: Manageability, security and
trusted nature of relational database management systems
(RDBMS)
Not Only SQL (NoSQL) solutions: developer-centric
specialized systems and New SQL strategies: Having a perfect
match of NOSQL and SQL.
In the interpretation area of the analysis result, system with
rich palette of visualization become important in conveying to
the user the results of the queries in what the best way it is
understood.

Web Mining
Machine Learning Techniques
Crowd Sourcing
Genetic Algorithm
NLP
Sentiment Analysis
Visualization
Time Series Analysis

Different architecture technologies includes

Hadoop

Map reduce

BIG table

Cassandra

Distributed system with Dynamo

Mash-up

Google File System

Big query

4. Experimental Results

5. CONCLUSION
There is no question that there is enough data
available that traditional database management systems will
be overwhelmed and overloaded, because new systems using
big data will extend, and possibly replace, traditional
DBMSs. And the rate of data acquisition is accelerating
quickly enough that perhaps we will eventually coin a new
term based on BIG DATA.

6. REFERENCES

Fig-2 (Processing of BIG DATA)


3. TECHNIQUES USED FOR BIG
DATA ANALYSIS
A

wide variety of techniques and technologies has been


developed and adapted to aggregate, manipulate, analyze, and
visualize big data. These techniques and technologies draw
from several fields including statistics, computer science,
applied mathematics, and economics.

A/B/N Testing

[1].The McKinsey Global Institute (MGI), BIG data Report


on Big data: The next frontier for innovation, competition, and
productivity (2012)
[2].Introduction to Data Science, Jeffery Stanton, Syracuse
University.
[3].White Paper on For Big Data Analytics Theres No Such
Thing as Too Big The Compelling Economics and Technology
of Big Data Computing ,March 2012 ,By: 4syth.com
[4].Paper published on SSRN Big Data for Development:
From Information- to Knowledge Societies by Martin Hilbert
(Dr. PhD.), 2012
[5].Webinar on BIG data Pragmatic approach.
[6]. www.ibm.com/software/in/data/bigdata/

Anda mungkin juga menyukai