Anda di halaman 1dari 29

Big Data Applications in

Management

MDI Gurgaon
Session-1

1
Big Data ????
As per IBM we generate 2.5 Quintillion bytes of data
every day!!!! (10^30)
Big data properties (now famous as 4Vs) Volume,
velocity, veracity and variety.
Besides BD is usually unstructured and also
qualitative in nature. (Why is this a challenge)
Usually anything above 1000 TB (1 Peta byte) can be
BD.
BD needs to be not just organized but also utilized
in various ways to be useful.
2
Hadoop Ecosystem

Figure retrieved from https://www.mssqltips.com/sqlservertip/3262/big-data-basics--


part-6--related-apache-projects-in-hadoop-ecosystem/
Recommended : https://hadoopecosystemtable.github.io/
MDI Gurgaon 3
Why Should managers know
about Big Data??

4
Real World Examples
FMCG and retail observe social media sites and
this data is used in analyzing consumer
behaviour, social media analytics
Manufacturers monitor real time vibration data
from machines to understand the state of the
machine, optimal time to replace. Also monitor
social space for after sales issues.
Advertising and marketing people track
responsiveness of campaigns etc through
various mediums
5
Insurance companies use Big data to
understand where to issue policies and settle
claims in real time and where visits may be
required.
Hospitals work on analyzing patient data to
understand readmission, stay prediction etc
Govt adding more data through OpenData and
Open Government frameworks, enabling
participation by general public in decision
making again requires massive data
management
Lastly sports teams increasingly looking at
managing data from large number of teams,
players competitors etc. 6
Types and Sources of Data
Type Description Source
Social data Unstructured data Facebook, Twitter,
LinkedIn
Audio-Video data Unstructured data CCTVs or any other
source
Machine data Semi-structured RFID chips, GPS
data systems, Weblogs,
XML, JSON etc.
Transactional data Structured PoS data, DB data,
others

7
External and Internal Data
Data Source Definition Sources
Internal Structured CRM, ERP,
data that is OLTP etc
present within
the
organization
External Unstructured Internet, Govt
data from Market
external Research,
environment Formal data
of the suppliers etc
organization
8
Challenges of unstructured data
What to process?
How to process?
How to draw any conclusions /insights?
How to present?
How to triangulate with other data for better
impact?
Could be beneficial to various sectors-
Health, Government, Travel, Education etc- give
examples??

9
Interesting Example
CCTVs in supermarkets are used to analyze
routes customers take to move their way
through the store.
This is combined with structured data such as
bills to arrive at deeper insights about how
consumer behave in a specific setting
Both data forms have their own value

10
Types of Analytics
Descriptive Analytics: Information regarding
past and current business events, helps plan
future course of action, involves frequency,
mean , sd etc. Helpful in root cause analysis.
Predictive Analytics: Use of statistics,
datamining techniques or machine learning to
predict future events or patterns based on past
as well as what-if analysis.
Prescriptive Analytics: The optimized or best
possible course of action as determined by data
is known as prescriptive analytics. 11
Distributed and Parallel computing
Multiple computing resources set up in a network, computing
tasks are divided and distributed across the network to achieve
end result.
When computation itself is peculiarly complex, multiple
processors may divide it into sub tasks and execute the same in
parallel.
Not every problem may be efficiently parallelized!!
In general a distributed system (scale-out) is more cost-
effective as compared to a high end or scale-up system
e.g. 4 TB data being transferred to a single high end machine
vs 100 commodity machines @100Mbps

12
Parallel computing techniques
Cluster / Grid computing
Massively Parallel Processing
High Performance Computation

13
Some Differences
An independent A computing system with
autonomous system several processing units
connected in a network attached to it
for accomplishing variety
of tasks A common shared
memory can be directly
Coordination is made accessed by every
possible between compute unit in the
connected computers network
that have their own Tight coupling of
memory and CPU resources, used for
Loose coupling between solving a single complex
resources problem
14
Problems resolved by Distributed
and Parallel Systems
Failure of resources
Latency issues

15
Slight Digression..
Cloud and Big Data
What is the significance of two being put together?
Cloud computing characteristics
Types of Cloud
Cloud delivery models (Iaas, Saas, Paas)
Major vendors

16
Virtualization and BD
Given the data deluge and the resource intensiveness of BD,
virtualization as a concept helps bring in efficiency of resources
Virtualization allows running multiple OS on a single machine with
dedicated computing resources
Virtualization features include logical partitioning, isolation,
encapsulation
Types of Virtualization
Server virtualization: single server to multiple virtual servers
Application virtualization: encapsulation application making it
agnostic to physical machine
Network virtualization: single physical network to multiple
networks with differing capacities and performance
requirements
Storage, Data, Processor, memory virtualization.
17
History of Hadoop
Apache Lucene was being developed by Doug Cutting, for text-
indexing and searching at various levels of aggregation,
desktop, enterprise etc
Nutch was another project with Lucene as its core component
but looking to be an an open web search engine
However, the scalability that Nutch required was difficult to
achieve with commodity hardware
Google published its work on GFS and MapReduce in 2004
Doug was able to develop open source versions of this and
Hadoop took shape
Yahoo saw the immense value and offered further resources,
led to Hadoop becoming a major Apache project
Yahoo!! was the largest user of Hadoop and subsequently
Doug joined Cloudera to further develop the mature version of
Hadoop 18
Distributed Systems and Hadoop

The earlier example of 4TB data being read into


distributed system if read into a Hadoop system will be
read into blocks of 128 MB
An often cited comparison of Hadoop is with
SETI@home
Hadoop philosophy is different from other distributed
systems that work on compute intensive applications
In Hadoop code travels to the data blocks of
significance instead of the data traveling to another
machine (as much as possible)
Thus Hadoop is specially suited for data intensive
work
19
Relational DB and Hadoop
SQL based access is based on structured data and
Relational DB systems are optimized for the same
(typically scale-up systems)
Relational DB are based on schema with formal properties,
do not support files such as XML or Text etc. (Besides
SQL is declarative language, dependent on DB engine)
Hadoop processing is based on key value pairs
Lastly, processing in Hadoop is based on MapReduce
which is more generic and extendable, enabling one to
build complex models with the input data.
However, working with relational data, using Hadoop may
not be a good idea!!!
Hadoop is for Write-Once Read -many-times (WORM)
unlike DBs

20
Hadoop Summary
Open source platform that provides distributed
computational environment for handling Big Data.
Connects to many nodes without shared memory or disk
Master-Slave architecture with master distributing the
tasks to slaves
Data stored across different nodes can be accessed and
retrieved
Handles both heavy computational load and variety of
tasks with ease
Resilient and fault tolerant

21
Operating in Hadoop
Hadoop clusters are created from community racks.
Dynamic addition and deletion based on load together with
failure detection is possible.
Tasks are distributed across the nodes, that work
independently and provide result to the initiating node.
Two primary components of Hadoop are HDFS and Hadoop
MapReduce
MapReduce algorithm is responsible for breaking the task
(mapper) and reducer is responsible for aggregating the
responses into the final result.

22
MapReduce Philosophy
Scaling a simple program:
Psuedocode for wordcount
define wordcount as Multiset;
For each document in documentSet {
T= tokenize(document);
For each token in t {
wordcount[token]++;

display (wordcount);

23
To speed up this program over a
distributed network of machines
Rework it in two parts
define wordcount as Multiset;
For each document in documentSet {
T= tokenize(document);
For each token in t {
wordcount[token]++;

Sendtosecondphase(wordCount);

Second Phase
define totalwordcount as Multiset;
For each wordcount received from firstphase {
MultiSetAdd (totalworcount,wordcount);
}
24
The problems
If all data is present in one single machine, transferring this
data is a huge bandwidth and server capacity constraint
Splitting up of the documents is required for improved efficiency
Second issue is the space complexity as the entire output may
not fit in the RAM why??
Need to write a disk based hash for storage
If one output aggregator, then we are again stuck as phase two
will be a bottleneck, so we need to ensure it can be distributed
as well
This requires smart partitioning and shuffling
For example, each wordcount -alphabet may be assigned one
single machine, thus 26 machines in phase 2 help us out
In addition we need to take care of fault tolerance, and many
other data transfer issues

25
Pre-Requisite Completion
Group wise presentation for next 2 weeks
Java Topics
Objects, Constructors, abstract class,
Interface, final, static
Arrays, Lists, Overloading
Polymorphism
I/O and File I/O
Strings
Inner class and Hashmaps
Any other issue wrt Eclipse usage

26
Assignment
Individual work
In memory computing???

What is Google Query, pros and cons?

What is Google Big Table, pros and cons?

What is data and storage virtualization?

Hyper-V technology, Intel VT-x ?

Could be expected in quiz!!!!

Group work
Examples of Big Data in industry, one by each group of five-

six students in depth


To be presented and submitted by the end of the term

If PoC project is to be chosen, should be discussed by 5


th

session, submitted by the end of the course


27
Thanks

28
References
Lam, C, Hadoop In Action, Dreamtech
Press, Delhi
Big Data Black Book, Dreamtech Press
Delhi
Professional Hadoop, Wrox Press, Delhi

29

Anda mungkin juga menyukai