Introduction To Big Data: Soorya Prasanna Ravichandran

Introduction to Big Data
Soorya Prasanna Ravichandran

What is Big Data?
Big Data is a term for the datasets that are so large and
complex that traditional data processing applications are
inadequate
It usually includes data sets with sizes beyond the ability of

commonly used software tools to capture, manage and process
data within a tolerable elapsed time
Any dataset that can be handled easily by traditional databases,

software and hardware cannot qualify as big data
Variety of Data - Structured -> Unstructured

Why Big Data?
In recent years there has been a great shift in the type of data
Fixed / Pre-determined units -> Variable units. Think of FB posts, tags,

likes; Twitter posts, re-tweets. Can you determine the numbers in advance?
Size of the data - Small -> Enormously large
We always produced big data but didnt have a place to store it
Storage became compact and cheap in recent times, and people wants to
store each information
Meanwhile software companies started figuring that how to store and

analyze the information that didnt fit into neat categories
They still do not have the way out but the attempt itself is triggering a big
change
Why process Big Data?
Some of the problems customers want to solve cannot
be solved by structured data
Trying to understand customers mind set before he/

she places an order through an on-line shopping cart
Weather pattern analysis to plan optimal wind turbine

usage by positioning them correctly
Identify criminals and threats from unlike video and

audio
How big is Big Data?
Variety of Data (1)
Structured Data - Data which is
Organized in structure
Easy for a computer to handle
Follow pre-defined data model

Variety of Data (2)
Unstructured Data - Data
which is
Not organized in structure
Not easy for a computer to

handle
Does not follow pre-defined

data model
Data - Unstructured
Multiple analysts have
estimated that data will grow
800x over the next five years
Out of which 80% would be

unstructured
Growth of unstructured data is

enormously high. It is growing
in 10-50x more than
structured data
Traditional BI
BI with Big Data
What is the difference?
Big data has unstructured data as the source besides

the structured data
It has lot of analytics tool to process it and apply

intelligence to it
Big Data supports complex machine learning

algorithms
Big Data VS Traditional BI
Analyze variety of real-time Analyze transactional

information information
Answers Answers
Whats likely to happen? What happened?
Why did it happen?

Big Data Characteristics
Data growth challenges and opportunities are three

dimensional. Generally denoted as 3Vs
Volume - amount of data
Velocity - speed of data
Variety - range of data types

Volume
Data is huge that needs to be analyzed or/and the
analysis is very intense hence a lot of hardware is needed
Machine-generated data is produced in larger quantities

than non-traditional data
For instance, 10TB of data can be generated by a single

jet engine in 30 minutes
With around 25,000 flights per day, daily volume of this

single data source only runs into the Petabytes
Velocity
The data comes rapidly into data management system and
frequently requires quick analysis or decision making
Even 140 characters per tweet, the high frequency (or velocity) of
Twitter data ensures big volumes (per day over 8 TB data)
E-Promotions : Based on your current location / purchase

history, predict what you like -> send promotions right now for
store next to you
Healthcare monitoring : sensors monitoring your activities and

body -> any abnormal measurements require immediate reaction
Variety
The data is not organized into regular patterns as in a
table; rather images, text and highly varied structures;
or structures unknown in advance are typical
Various formats, types, and structures - Text,

numerical, images, audio, video, sequences, time
series, social media data, multi-dim arrays,
As new marketing campaigns executed, new sensors

deployed, new data types are needed or new services
are added to capture the resultant information
3Vs -> 4Vs
Accessing the need of big data (1)
Accessing the need of big data (2)
Processing Big Data
Processing of big data can be classified into two types
Batch based
Real time data stream

Batch processing
Real time processing
Big Data Stack
Big Data Stack (1)
Storage
Stores the structured as well unstructured data
Integration
Integrates the data from multiple sources
Advantages (over single source) - More efficient, More accurate, Better

insights
Analysis
Analyzes the data, to gain greater insight into the trends for business decisions
Identifies Interesting relationships
Performs classifications and clusters

Big Data Stack (2)
Visualization
Graphically presenting the extracted information from

analysis
Enhances in better understanding
Data Product
Develop applications and services based on the data

captured, stored and the findings from analysis
How to store?
Big Data Storage (1)
Structured data storage
Data with fixed table schema
Stored in traditional relational database (SQL

database)
Handling large volume - Vertical scaling

Big Data Storage (2)
Unstructured data storage

Data does not have a fixed table schema
Stored in NoSQL databases
Mostly implemented as Key/Value pairs
Handling large volume - Horizontal scaling
Distributed File System
Massive data (structured as well unstructured) are
stored across multiple machines in a distributed way
Distributed File System
Google File System (GFS)
A proprietary distributed file system from Google
Has large clusters of commodity hardware
Reliable and efficient access to data

Hadoop Distributed File System (HDFS)
Primarily used as storage system for Apache's Hadoop

application
Distributed file system that runs on commodity hardware
Open source Java product
High fault-tolerant
Uses BASE properties - Basically Available, Soft State,

Eventually Consistent
Amazon S3 (Simple Storage Service)
Amazon's proprietary online storage web service
The internal implementation is not disclosed, though believed to

be implemented as key/value database
Big Data Analysis
Googles Map Reduce
Framework for highly distributed processing among clusters or
grid
Capable of handling larger file system (unstructured) as well
database (structured)
Map : The master node partitions the input and distribute it
among the worker nodes
Reduce : The master node collects the results from the worker
nodes, combines and produces output
Big Data Technologies
Hadoop
Originally developed by Yahoo! as clone for Google's MapReduce
HDFS (Hadoop Distributed File System) for reliable data storage
MapReduce for high performance - parallel data processing
Fault-tolerant in case of any changes or failures in the cluster of servers
Apache Hive
Performs the Hadoop job using HiveQL (Hive's query language)
Can plug-in MapReduce programs when HiveQL is inefficient
Apache Pig
Performs the Hadoop job using Procedural data processing languages
(Pig Latin)

Introduction To Big Data: Soorya Prasanna Ravichandran

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Introduction To Big Data: Soorya Prasanna Ravichandran

Diunggah oleh

Hak Cipta:

Format Tersedia

Introduction to Big Data

Soorya Prasanna Ravichandran

It usually includes data sets with sizes beyond the ability of

Any dataset that can be handled easily by traditional databases,

Variety of Data - Structured -> Unstructured

Fixed / Pre-determined units -> Variable units. Think of FB posts, tags,

Size of the data - Small -> Enormously large

We always produced big data but didnt have a place to store it

Meanwhile software companies started figuring that how to store and

Trying to understand customers mind set before he/

Weather pattern analysis to plan optimal wind turbine

Identify criminals and threats from unlike video and

Structured Data - Data which is

Easy for a computer to handle

Follow pre-defined data model

Not organized in structure

Not easy for a computer to

Does not follow pre-defined

Out of which 80% would be

Growth of unstructured data is

Big data has unstructured data as the source besides

It has lot of analytics tool to process it and apply

Big Data supports complex machine learning

Analyze variety of real-time Analyze transactional

Whats likely to happen? What happened?

Why did it happen?

Data growth challenges and opportunities are three

Volume - amount of data

Velocity - speed of data

Variety - range of data types

Machine-generated data is produced in larger quantities

For instance, 10TB of data can be generated by a single

With around 25,000 flights per day, daily volume of this

E-Promotions : Based on your current location / purchase

Healthcare monitoring : sensors monitoring your activities and

Various formats, types, and structures - Text,

As new marketing campaigns executed, new sensors

Processing of big data can be classified into two types

Real time data stream

Stores the structured as well unstructured data

Integrates the data from multiple sources

Advantages (over single source) - More efficient, More accurate, Better

Identifies Interesting relationships

Performs classifications and clusters

Graphically presenting the extracted information from

Enhances in better understanding

Develop applications and services based on the data

Structured data storage

Data with fixed table schema

Stored in traditional relational database (SQL

Handling large volume - Vertical scaling

Unstructured data storage

Distributed File System

Google File System (GFS)

A proprietary distributed file system from Google

Has large clusters of commodity hardware

Reliable and efficient access to data

Primarily used as storage system for Apache's Hadoop

Distributed file system that runs on commodity hardware

Open source Java product

Uses BASE properties - Basically Available, Soft State,

Amazon S3 (Simple Storage Service)

Amazon's proprietary online storage web service