Anda di halaman 1dari 28

Migrating Traditional

Datawarehouse to
Hadoop
Datawarehouse
- ANKUR KULSHRESHTHA
(2015HT12284)
Why we need to store old Data ?
 Why an organization should make
huge investments in storing old
data.
 Is it really worth the effort and
investments to store old data.
 How much old data should a
organization preserve
 How long should organization
keep old data.
 How should old data be preserved
Old data gives wealth of Information

 Old data provides a platform for


various business insights like -
• Reporting
• Analytics
• BI
• Decision support
• Data mining
• Predictive analysis
• Identifying and mitigating risks in
business
How to store old data
 Old data are stored in systems
known as Data warehouse
 A data warehouse is a repository
of historical data generated by
organization over a long period of
time.
 This huge collection of data
presents helps in performing data
analytics to obtain a deep insight
in business.
 Even though there is an initial
investment require to set up a
data warehouse but the returns a
business can get from the vast sea
of knowledge residing inside data
warehouse is many folds
Data Warehouse platform

 For some decades now, traditional


data warehouses are built over
relational databases.

 The popular choices of RDBMS are


MS SQL Server, Oracle, Postgre SQL
for smaller volume of data

 For much higher volumes of data


Tera data was preferred choice
Is Traditional Data Warehouse
solving all our problems ?
Data is the new Oil

 We’re in a digital economy where


data is more valuable than ever.
 Now a days more and more huge
amount of digital data is getting
generated from various sources
than ever.
 If this huge data can be properly
extracted and use, it can yield
benefits for various sectors and
society.
 There is a special term for such
type of data – Big Data
Big Data
 In layman term Big Data is a term for
very huge amount of digital data
that is getting generated from
various sources like Internet, Social
Media, Smartphones, Sensors etc.
 The Big Data can be structured,
semi structured or completely
unstructured.
 Big Data is so huge that traditional
platforms and software become
inadequate to handle the
complexity associated with it
Characteristics of Big Data

 The characteristics and challenges


of Big Data are defined with 3 Vs –

• Volume
• Velocity
• Variety
Characteristics of Big Data
Volume
We are living in an era where data is getting generated in all aspects of day to fay life.
From the social network activities , purchasing groceries in shops to booking online cabs
we are generating data everywhere.
Just the sheer volume of data that is getting generated is overwhelming – more than 90%
of all the data that was ever created, got generated in last 2 year.

Velocity
Velocity signifies the fast speed at which data is getting generated. This speed of data
creation is unimaginable.
Every minute we upload 100 hours of video on Youtube. In addition, every minute over 200
million emails are sent, around 20 million photos are viewed and 30.000 uploaded on Flickr,
almost 300.000 tweets are sent and almost 2.5 million queries on Google are performed.

Variety
Traditionally, data that used to be generated was mostly structured data. Structured data
in rows and columns were easy to process.
But now data that is getting generated comes in various formats – structured, semi
structured, unstructured, complex structure.
To deal with such variety of huge data is a challenge in itself.
Challenges of Big Data on traditional platforms

Traditional data warehouse in previous decades used to be built on relational databases


and there were following challenges being faced
 Data Storage - Though traditional data warehouse are structured but it will still fail with big
data because of the sheer volume and velocity of data that is generated is too complex
for traditional systems to handle.

 Scalability and cost - Traditional data warehouse are not scalable for Big Data easily.
Scalability, if at all comes with huge expenses.

 Low retention of data - Because of limited scalability and huge expenses associated with
it, organization have to classify the data in terms of its value and remove the not so
valuable data over a period of time.

 Inflexibility due to schema on write - Traditional RDBMS data warehouse used to enforce
schema on write. This means the data being written in the system should have conform
with the constraints and rules placed by the system. Though this provided the clean data
but is used to prevent data warehouse systems to tap new type of data very easily.

 Less computational ability – As the data keeps on growing in RDBMS data warehouse in
very huge magnitude, it becomes more and more difficult to carry out huge
computations with traditional capacities
How to deal Big Data Challenge ?
We need to move away from traditional ways of
working with data warehouse and the answer is -
What is Hadoop

 Hadoop is an open source java


framework for storing and
processing big data set over
distributed network of commodity
hardware (cluster).

 Such widely is Hadoop used now a


days that it has become
synonymous with Big Data.
Components of Hadoop
The main components of Hadoop (2.0) are –
 Hadoop Distributed File System – It is
highly resilient distributed file system of
Hadoop that resides over a distributed
network of commodity machines.

 Hadoop Map Reduce – It is a parallel


programming model to compute on very
huge data set in an efficient manner.

 Hadoop Yarn – It is resource manager for


managing computing resources in the
cluster and scheduling jobs in the cluster.

 Hadoop Common – This contains libraries


and utilities needed by other Hadoop
modules.
Advantage of Hadoop Datawarehouse over
traditional RDBMS Datawarehouse

 Scalable – Hadoop is a highly scalable platform to store and


process big data. It stores data over a cluster of inexpensive
computers in distributed network. To achieve scalability the most
simplest way is to add more computers in the cluster. This is a huge
advantage compared to storing data in traditional RDBMS system
which are not easily scalable.
 Cost effective – As Hadoop runs on clusters of commodity machines
it is very cost effective to set up a Hadoop cluster. More over as
scalability can be achieved by addition of commodity hardwares it
is very cost effective in terms of scalability too. This was not the case
in traditional systems in which setting up data warehouse and
scalability came with huge price.
Advantage of Hadoop Datawarehouse over
traditional RDBMS Datawarehouse
 Flexibility due to Schema on Read – In contrast to traditional RDBMS
systems, Hadoop relies on Schema on Read concept. It enforce no
rules or constraints while accepting data from various sources. The
data can be saved in raw format and then later verified while reading
it during analysis or computation. Because of this approach tapping
new data from various source is very easy and less time consuming.
 Fast computation of Big Data – With its distributed storage system , HDFS
and parallel programming model , Map reduce Hadoop can process
petabytes and terabytes of data in relatively quick time. Traditional
RDBMS would have failed to match this processing speed over such
huge set of data.
 Resilient to fault – Hadoop has been intelligently designed to overcome
failure. Data is replicated to at least 3 nodes in the cluster to avoid any
possible loss of data due to failure.
Hadoop Ecosystem

 Hadoop initially started as a pure


Java based framework and was
usable by people who knew Java
only. But over last decade many
tools and technologies have
come up in which though the
underlying framework is still java
map reduce but the users need
not know Java to work with it.

 These tools are collectively known


as Hadoop ecosystem.
Hive

 There are many professionals who


are not familiar with Java and
cant write complex Map Reduce
programs.
 For such professionals Hive was
created by Facebook
 Hive is platform for building a data
warehouse of structured data.
 We can run SQL queries on Hive,
but underneath it’s the map
reduce operation that fetches the
results.
Sqoop

 It is a popular tool to move data


between RDBMS system and
Hadoop system.

 It has command line interface and


requires no knowledge of Java.

 Sqoop is the future of ETL in the Big


Data era
Oozie

 It is a server-
based workflow scheduling system
to manage Hadoop jobs

 It helps to streamline the execution


of jobs as per the requirement of
the system

 It can manage Pig, Hive, Java,


jobs to mention the few.
Impala

 It is a massively parallel processing


sql engine on Hadoop that works
very fast on huge data sets.

 Unlike Hive, Impala does not use


Map Reduce – rather it works in
distributed environment using
daemons

 Compared to Hive, Impala query


executes at lightning speed even
on very huge data set
Implementing Hadoop Datawarehouse

 Our traditional RDBMS Data


Warehouse is built on Oracle
database. So are the Source
system and ODS.
 There are materialized views built
in ODS system over source system.
The ODS that’s refreshed at a very
high frequency so as to provide
almost real time data.
 Our RDBMS data warehouse is
loaded with data via ODS.
Implementing Hadoop Datawarehouse

 We use Sqoop as a tool to import


data from ODS and RDBMS data
warehouse into HDFS. This data is
then imported into Hive database.
 This activity is managed by oozie
workflow which is scheduled to
sync up ODS and RDBMS data
warehouse data on periodic basis.
 This data in Hive can be queried
by Hive SQL or Impala SQL.
Query performance in RDBMS and Hadoop

 We ran queries on 500 mb of same


data in RDBMS and Hive. In Hive
we ran query with both HQL and
Impala. We obtained the following
Query Type RDBMS Query Hive Query Impala Query
average results
Pattern 7secs 11 secs 1 sec
 From these results we can observe
search with
following –
Like
• For small data set Hive query may Sorting of 34 secs 32 secs 3 secs
not perform better than normal data
RDBMS query.
• Even for small dataset, Impala
query gives the best lightning
speed performance.
Summary
 In this dissertation we understood why Big
Data is the big thing today and why our
traditional RDBMS based data
warehouse are not capable enough to
support Big Data.
 We then understood how Hadoop offers
technologies that are well equipped to
embrace Big Data.
 We then did a POC by plugging a
Hadoop instance to our existing data
warehouse system and start synching
Hive database with source system using
Sqoop and Oozie.
 We also ran queries on 500mb data do a
comparative study of performance
between RDBMS , Hive and Impala
queries.
Future Work

 We carried out a POC work on


how to implement Hadoop to
build a Data warehouse system.
However with limited hardware we
could not exploit further to see the
real power and capabilities of
Hadoop.
 We can extend this POC to fully
fledged Hadoop based data
warehouse, if we can get proper
hardware and storage capacities.

Anda mungkin juga menyukai