Datawarehouse to
Hadoop
Datawarehouse
- ANKUR KULSHRESHTHA
(2015HT12284)
Why we need to store old Data ?
Why an organization should make
huge investments in storing old
data.
Is it really worth the effort and
investments to store old data.
How much old data should a
organization preserve
How long should organization
keep old data.
How should old data be preserved
Old data gives wealth of Information
• Volume
• Velocity
• Variety
Characteristics of Big Data
Volume
We are living in an era where data is getting generated in all aspects of day to fay life.
From the social network activities , purchasing groceries in shops to booking online cabs
we are generating data everywhere.
Just the sheer volume of data that is getting generated is overwhelming – more than 90%
of all the data that was ever created, got generated in last 2 year.
Velocity
Velocity signifies the fast speed at which data is getting generated. This speed of data
creation is unimaginable.
Every minute we upload 100 hours of video on Youtube. In addition, every minute over 200
million emails are sent, around 20 million photos are viewed and 30.000 uploaded on Flickr,
almost 300.000 tweets are sent and almost 2.5 million queries on Google are performed.
Variety
Traditionally, data that used to be generated was mostly structured data. Structured data
in rows and columns were easy to process.
But now data that is getting generated comes in various formats – structured, semi
structured, unstructured, complex structure.
To deal with such variety of huge data is a challenge in itself.
Challenges of Big Data on traditional platforms
Scalability and cost - Traditional data warehouse are not scalable for Big Data easily.
Scalability, if at all comes with huge expenses.
Low retention of data - Because of limited scalability and huge expenses associated with
it, organization have to classify the data in terms of its value and remove the not so
valuable data over a period of time.
Inflexibility due to schema on write - Traditional RDBMS data warehouse used to enforce
schema on write. This means the data being written in the system should have conform
with the constraints and rules placed by the system. Though this provided the clean data
but is used to prevent data warehouse systems to tap new type of data very easily.
Less computational ability – As the data keeps on growing in RDBMS data warehouse in
very huge magnitude, it becomes more and more difficult to carry out huge
computations with traditional capacities
How to deal Big Data Challenge ?
We need to move away from traditional ways of
working with data warehouse and the answer is -
What is Hadoop
It is a server-
based workflow scheduling system
to manage Hadoop jobs