Anda di halaman 1dari 29

Hadoop 101

Learn the Hadoop Fundamentals


Ian Andrews | iandrews@gopivotal.com
Director Advanced Technology
Pivotal Incorporated

Pivotal ConfidentialInternal Use Only

HADOOP 101

Purpose
After completing this module, you will be able to:
1. Discuss the evolution of Data Platforms and why Hadoop was created
2. Discuss the purpose, functionality, and value of Hadoop
3. Describe the various Hadoop components
4. Discuss some of the most common use cases for Hadoop

Pivotal ConfidentialInternal Use Only

HADOOP 101

Agenda

1. What is Hadoop
2. The Evolution of Data Platforms
3. How Hadoop Is Being Used Today
4. Resources and Key Takeaways

Pivotal ConfidentialInternal Use Only

Part 1:
What Is Hadoop?

Pivotal ConfidentialInternal Use Only

What is Hadoop?
Framework that allows for distributed processing of large
data sets across clusters of commodity servers
Store large amount of data
Process the large amount of data stored

Inspired by Googles MapReduce and Google File System


(GFS) papers
Apache Open Source Project
Initial work done at Yahoo! starting in 2005
Open sourced in 2009 there is now a very active open source
community

Pivotal ConfidentialInternal Use Only

What is Hadoop?
Two Core Components

HDFS

MapReduce

Scalable storage in
Hadoop Distributed
File System

Compute via the


MapReduce distributed
processing platform

Storage & Compute in 1 Framework


Open Source Project of the Apache Software Foundation
Java-intensive programming required

Pivotal ConfidentialInternal Use Only

Part 2:
The Evolution of Data
Platforms
How it all began

Pivotal ConfidentialInternal Use Only

First there was the Data Warehouse


A new architecture to host data from
multiple sources to support decisionmaking
Why the Data warehouse exists:
Centralization of high value data
Tools to process data into information
Highly regulated environment

Pivotal ConfidentialInternal Use Only

Legacy
EDW

Then the MPP database was introduced


A new approach to database was
required to handle new analytics
environment
Why the MPP Database exists:

Data got larger


Queries got uglier
Performance became critical
R/SAS/Statistical languages need to
run in-database

Pivotal ConfidentialInternal Use Only

Then things changed


Internet age introduces the
ability to track interactions
rather than just
transactions
Cost of traditional
platforms too high to store
and process this new data
Normal Databases not
able to perform at scale

Pivotal ConfidentialInternal Use Only

New Data
Streams

New Delivery
Platforms

New
Deployment
Models &
Languages

Expanding
Data Volumes

Greater Cost
Pressures
Increasing
Customer
Expectations

10

Now there is Hadoop


Traditional systems werent
built to handle the
storage/processing needs
of Web 2.0
Why Hadoop exists:
Data volumes moved to the
Petabyte range
Raw (unstructured) forms of
data needed to be
processed
Cost needed to be low
Processing must scale with
storage
Pivotal ConfidentialInternal Use Only

11

The Hadoop Opportunity


Internet age + exploding data growth
Enterprises increasingly interested in leveraging new data sources
quickly:
Understand granular customer behavior
Spot emerging trends
Identify new opportunities, etc.

Traditional database tools not able to cope


Werent built for big data use cases
Lack scale, not cost-effective, rigid data structure

Need for new approach Hadoop

Pivotal ConfidentialInternal Use Only

12

Why Hadoop is Important?


1. Hadoop reduces the cost of storing & processing data to a point that
keeping all data, indefinitely is suddenly a very real possibility AND
that cost is halving every 18 months
2. MapReduce makes developing & executing massively parallel data
processing tasks trivial compared to historical alternatives (e.g. HPC /
Grid)
3. Schema on Read paradigm shifts typical data preparation complexity
to analysis phase rather than acquisition phase
4. For the modern CIO BIG DATA = HADOOP - Dont underestimate
the irrational exuberance of the market
The cost and effort to consume and extract value from data has been
fundamentally changed

Pivotal ConfidentialInternal Use Only

13

How is Hadoop is unique?


Handles large amounts of data
Stores data in native format
Delivers linear scalability at low cost
Resilient in case of infrastructure failures
Transparent application scalability

Enterprises can gain a competitive advantage through the


adoption of big data analytics

Pivotal ConfidentialInternal Use Only

14

Common Apache Hadoop Components

Pivotal ConfidentialInternal Use Only

15

Evolution of the Commercial Hadoop Market

Apache Hadoop is open-source software managed by the Apache Software Foundation


available as a free download at Apache.org.

A number of commercial businesses have been created to provide packaged Hadoop


distributions containing integrated subcomponents of Hadoop and other open source
projects, education, support, services, and Hadoop management utilities to make the
technology easier to implement and use.

Cloudera was the first commercial vendor to enter the space in 2008 founded by early Hadoop technologists
who later brought in Doug Cutting to be head architect.
Other commercial vendors of Hadoop software include Hortonworks (2011 out of Yahoo!), IBM (Infosphere
BigInsights), MapR (2009) and DataStax (2010 Hadoop and Cassandra).

Organizations that are evaluating Hadoop are typically also looking at other NoSQL
databases such as Cassandra, Mongo DB, Couch DB, Amazon EMR, and others.

They might also be evaluating scale-out file systems to use for storage such as Isilon
or Gluster FS. These systems mirror Hadoops scale-out architecture and are also
capable of handling the volume and unstructured nature of data that could be stored in
Hadoop.

Pivotal ConfidentialInternal Use Only

16

Greenplums Entry into the Hadoop Market


Greenplum began moving towards the Hadoop space when it added
support for MapReduce within the database in 2009. The company then
officially entered the Hadoop market in April of 2011 with the launch of
Greenplum HD.
Greenplum HD is one of several commercially supported distributions of
Apache Hadoop. All of these distributions consist of the same core
Hadoop technology
In providing our own distribution of Apache Hadoop, Greenplum (and
now Pivotal) are bringing Greenplums decade of experience developing
the best MPP database platform to the open source Big Data platform

Pivotal ConfidentialInternal Use Only

17

Part 3:
How Hadoop Is Being
Used Today

Pivotal ConfidentialInternal Use Only

18

Target Markets, Verticals, and Use Cases


Hadoop Use Cases by Vertical
Finance

Web 2.0

Telecom

Healthcare

Risk
Modeling/Managem
ent
Portfolio Analysis
Investment
Predictions
Fraud Detection
Compliance Check
Customer Profiling
Social Media
Analytics
ETL
Network analysis
based on
transactions

Product
Recommendation
Engine
Search Engine
Indexing (Search
Assist)
Content Optimization
Advertising
Optimization
Customer Churn
Analysis
POS Transaction
Analysis
Data Warehousing

Network Graph
Analysis
Call Detail Record
(CDR) Analysis
Network Optimization
Service Optimization
& Log Processing
User Behavior
Analysis
Customer Churn
Prediction
Machine-generated
data centralization
(logs from firewalls,
towers, switches,
servers, etc..)

Electronic Medical
Record Analysis
Claims Fraud
Detection
Drug Safety
Analysis
Personalize
Medicine
Healthcare Service
Optimization
Drug Development
Healthcare
Information
Exchange
Medical Image
Processing

Pivotal ConfidentialInternal Use Only

19

Use Case: Data Warehouse Augmentation / Offload


Challenges
Existing EDW used for low value and resource consuming ETL
process
Planned growth will far exceed compute capacity
Hard to do analytics or even basic reporting on EDW system

Objectives
Reduce EDW Total Cost of Ownership
Enable longer data retention to enable analytics and accelerate time
to market
Migrate ETL off EDW to free up compute resources

Pivotal ConfidentialInternal Use Only

20

Hadoop Use Case: Retailer Trend Analysis


Deep Historical Reporting for Retail Trends:
Credit card company loads 10 years of data for all retailers (100s of TBs) into
Hadoop
Run Map/Reduce Job on Hadoop Cluster for a single retailer and develop historical
picture of retailer (or retailers) in a specific area.
Load results from Hadoop into data warehouse and further analyze with standard
BI/statistics packages.

Why do this in Hadoop?

Ability to store years of data cost effectively


Data available for immediate recall (not on tapes or flat files)
No need to ETL/normalize the data
Data exists in its valuable, original format
Offload intensive computation from DW
Ability to pull in other data sources about the retail (unstructured) from other sources
and combine with other data (EDGAR Filings?)

Pivotal ConfidentialInternal Use Only

21

Hadoop Use Case: Risk Modeling


Risk Modeling
Bank had customer data across multiple lines of business and needed to
develop a better risk picture of its customers.

i.e. -- if direct deposits stop coming into checking acct, its likely that customer lost his/her job, which
impacts creditworthiness for other products (CC, mortgage, etc)

Data existing in silos across mutliple LOBs and acquired bank systems
Data size approached 1 petabyte

Why do this in Hadoop?


Ability to cost-effectively integrate + 1 PB of data from multiple data sources: data
warehouse, call center, chat and email
Platform for more analysis with poly-structured data sources; ie, combining bank data
with credit bureau data; Twitter, etc
Offload intensive computation from DW

Pivotal ConfidentialInternal Use Only

22

Hadoop Use Case: Sentiment Analysis


Sentiment Analysis
Hadoop used frequently to monitor what customers think of companys products or
services
Data loaded from social media sources (Twitter, blogs, facebook, emails, chats, etc)
into Hadoop cluster
Map/Reduce jobs run continuously to identify sentiment (ie, Acme Companys rates
are outrageous or rip off)
Negative/positive comments can be acted upon (special offer, coupon, etc)

Why Hadoop
Social media/web data is unstructured
Amount of data is immense
New data sources arise weekly

Pivotal ConfidentialInternal Use Only

23

Part 4:
Resources and Key
Takeaways

Pivotal ConfidentialInternal Use Only

24

Online Resources
Hadoop Basics: Hadoop Basics from EMC
Pivotal Hadoop online resources: Greenplum Nation Pivotal HD
Selling Guide - Hadoop Selling Guide
Webcast
Hadoop Spotlight Webinar
Video: HAWQ and Pivotal HD Video
Internal Webcast: Internal Webcast

Spring Hadoop: Spring Hadoop

Pivotal ConfidentialInternal Use Only

25

Contact Resources
Advanced Technology Sales
Nick Cayou ncayou@gopivotal.com
Ian Andrews iandrews@gopivotal.com

Hadoop Virtual Team


Jacque Istok jistok@gopivotal.com

Product Management
SK Krishnamurthy - SK.Krishnamurthy@emc.com

Hadoop vTeam Mailing List


GPVTeamHadoop@emc.com

Pivotal ConfidentialInternal Use Only

26

Module Summary & Highlights


Hadoop was created to meet a need for data storage and
processing at a scale and cost that existing systems simply
could not address
Hadoop adoption in the Enterprise is largely driven by the
low cost and flexibility of the platform
Hadoop is relatively immature, but is quickly advancing with
new capabilities backed by a broad base of vendor and
community supported efforts

Pivotal ConfidentialInternal Use Only

2
727

Thank You
Please note:
It may take up to 24 hours for your
transcript to be updated and reflect that you
have successfully completed this course.
If after 24 hours your transcript is not
updated, please send an e-mail to
joneill@gopivotal.com describing your
issue.
Please include the following:
BADGE NUMBER
ROLE
MODULE TITLE
ISSUE

Pivotal ConfidentialInternal Use Only

28

Click Here to
Provide Feedback

Thank you to the following individual who


helped make the creation of this course possible
Ian Andrews

Pivotal ConfidentialInternal Use Only

29

Anda mungkin juga menyukai