Anda di halaman 1dari 33

Introduction to Big Data

Soorya Prasanna Ravichandran


What is Big Data?
Big Data is a term for the datasets that are so large and
complex that traditional data processing applications are
inadequate

It usually includes data sets with sizes beyond the ability of


commonly used software tools to capture, manage and process
data within a tolerable elapsed time

Any dataset that can be handled easily by traditional databases,


software and hardware cannot qualify as big data

Variety of Data - Structured -> Unstructured


Why Big Data?
In recent years there has been a great shift in the type of data

Fixed / Pre-determined units -> Variable units. Think of FB posts, tags,


likes; Twitter posts, re-tweets. Can you determine the numbers in advance?

Size of the data - Small -> Enormously large

We always produced big data but didnt have a place to store it

Storage became compact and cheap in recent times, and people wants to
store each information

Meanwhile software companies started figuring that how to store and


analyze the information that didnt fit into neat categories

They still do not have the way out but the attempt itself is triggering a big
change
Why process Big Data?
Some of the problems customers want to solve cannot
be solved by structured data

Trying to understand customers mind set before he/


she places an order through an on-line shopping cart

Weather pattern analysis to plan optimal wind turbine


usage by positioning them correctly

Identify criminals and threats from unlike video and


audio
How big is Big Data?
Variety of Data (1)

Structured Data - Data which is

Organized in structure

Easy for a computer to handle

Follow pre-defined data model


Variety of Data (2)
Unstructured Data - Data
which is

Not organized in structure

Not easy for a computer to


handle

Does not follow pre-defined


data model
Data - Unstructured
Multiple analysts have
estimated that data will grow
800x over the next five years

Out of which 80% would be


unstructured

Growth of unstructured data is


enormously high. It is growing
in 10-50x more than
structured data
Traditional BI
BI with Big Data
What is the difference?

Big data has unstructured data as the source besides


the structured data

It has lot of analytics tool to process it and apply


intelligence to it

Big Data supports complex machine learning


algorithms
Big Data VS Traditional BI

Analyze variety of real-time Analyze transactional


information information

Answers Answers

Whats likely to happen? What happened?

Why did it happen?


Big Data Characteristics

Data growth challenges and opportunities are three


dimensional. Generally denoted as 3Vs

Volume - amount of data

Velocity - speed of data

Variety - range of data types


Volume
Data is huge that needs to be analyzed or/and the
analysis is very intense hence a lot of hardware is needed

Machine-generated data is produced in larger quantities


than non-traditional data

For instance, 10TB of data can be generated by a single


jet engine in 30 minutes

With around 25,000 flights per day, daily volume of this


single data source only runs into the Petabytes
Velocity
The data comes rapidly into data management system and
frequently requires quick analysis or decision making

Even 140 characters per tweet, the high frequency (or velocity) of
Twitter data ensures big volumes (per day over 8 TB data)

E-Promotions : Based on your current location / purchase


history, predict what you like -> send promotions right now for
store next to you

Healthcare monitoring : sensors monitoring your activities and


body -> any abnormal measurements require immediate reaction
Variety
The data is not organized into regular patterns as in a
table; rather images, text and highly varied structures;
or structures unknown in advance are typical

Various formats, types, and structures - Text,


numerical, images, audio, video, sequences, time
series, social media data, multi-dim arrays,

As new marketing campaigns executed, new sensors


deployed, new data types are needed or new services
are added to capture the resultant information
3Vs -> 4Vs
Accessing the need of big data (1)
Accessing the need of big data (2)
Processing Big Data

Processing of big data can be classified into two types

Batch based

Real time data stream


Batch processing
Real time processing
Big Data Stack
Big Data Stack (1)
Storage

Stores the structured as well unstructured data

Integration

Integrates the data from multiple sources

Advantages (over single source) - More efficient, More accurate, Better


insights

Analysis

Analyzes the data, to gain greater insight into the trends for business decisions

Identifies Interesting relationships

Performs classifications and clusters


Big Data Stack (2)
Visualization

Graphically presenting the extracted information from


analysis

Enhances in better understanding

Data Product

Develop applications and services based on the data


captured, stored and the findings from analysis
How to store?
Big Data Storage (1)

Structured data storage

Data with fixed table schema

Stored in traditional relational database (SQL


database)

Handling large volume - Vertical scaling


Big Data Storage (2)

Unstructured data storage


Data does not have a fixed table schema
Stored in NoSQL databases
Mostly implemented as Key/Value pairs
Handling large volume - Horizontal scaling
Distributed File System
Massive data (structured as well unstructured) are
stored across multiple machines in a distributed way

Distributed File System

Google File System (GFS)

A proprietary distributed file system from Google

Has large clusters of commodity hardware

Reliable and efficient access to data


Hadoop Distributed File System (HDFS)

Primarily used as storage system for Apache's Hadoop


application

Distributed file system that runs on commodity hardware

Open source Java product

High fault-tolerant

Uses BASE properties - Basically Available, Soft State,


Eventually Consistent

Amazon S3 (Simple Storage Service)

Amazon's proprietary online storage web service

The internal implementation is not disclosed, though believed to


be implemented as key/value database
Big Data Analysis
Googles Map Reduce
Framework for highly distributed processing among clusters or
grid
Capable of handling larger file system (unstructured) as well
database (structured)
Map : The master node partitions the input and distribute it
among the worker nodes
Reduce : The master node collects the results from the worker
nodes, combines and produces output
Big Data Technologies
Hadoop
Originally developed by Yahoo! as clone for Google's MapReduce
HDFS (Hadoop Distributed File System) for reliable data storage
MapReduce for high performance - parallel data processing
Fault-tolerant in case of any changes or failures in the cluster of servers
Apache Hive
Performs the Hadoop job using HiveQL (Hive's query language)
Can plug-in MapReduce programs when HiveQL is inefficient

Apache Pig
Performs the Hadoop job using Procedural data processing languages
(Pig Latin)

Anda mungkin juga menyukai