Hadoop 101

Hadoop 101
Learn the Hadoop Fundamentals

Ian Andrews | iandrews@gopivotal.com
Director Advanced Technology
Pivotal Incorporated
Pivotal ConfidentialInternal Use Only
HADOOP 101
Purpose
After completing this module, you will be able to:
1. Discuss the evolution of Data Platforms and why Hadoop was created
2. Discuss the purpose, functionality, and value of Hadoop
3. Describe the various Hadoop components
4. Discuss some of the most common use cases for Hadoop
HADOOP 101
Agenda
1. What is Hadoop
2. The Evolution of Data Platforms
3. How Hadoop Is Being Used Today
4. Resources and Key Takeaways
Part 1:
What Is Hadoop?
What is Hadoop?
Framework that allows for distributed processing of large
data sets across clusters of commodity servers
Store large amount of data
Process the large amount of data stored
Inspired by Googles MapReduce and Google File System

(GFS) papers
Apache Open Source Project
Initial work done at Yahoo! starting in 2005
Open sourced in 2009 there is now a very active open source
community
What is Hadoop?
Two Core Components
HDFS
MapReduce
Scalable storage in
Hadoop Distributed
File System
Compute via the

MapReduce distributed
processing platform
Storage & Compute in 1 Framework

Open Source Project of the Apache Software Foundation
Java-intensive programming required
Part 2:
The Evolution of Data
Platforms
How it all began
First there was the Data Warehouse

A new architecture to host data from
multiple sources to support decisionmaking
Why the Data warehouse exists:
Centralization of high value data
Tools to process data into information
Highly regulated environment
Legacy
EDW
Then the MPP database was introduced

A new approach to database was
required to handle new analytics
environment
Why the MPP Database exists:
Data got larger

Queries got uglier
Performance became critical
R/SAS/Statistical languages need to
run in-database
Then things changed

Internet age introduces the
ability to track interactions
rather than just
transactions
Cost of traditional
platforms too high to store
and process this new data
Normal Databases not
able to perform at scale
New Data
Streams
New Delivery
Platforms
New
Deployment
Models &
Languages
Expanding
Data Volumes
Greater Cost
Pressures
Increasing
Customer
Expectations
10
Now there is Hadoop

Traditional systems werent
built to handle the
storage/processing needs
of Web 2.0
Why Hadoop exists:
Data volumes moved to the
Petabyte range
Raw (unstructured) forms of
data needed to be
processed
Cost needed to be low
Processing must scale with
storage
11
The Hadoop Opportunity

Internet age + exploding data growth
Enterprises increasingly interested in leveraging new data sources
quickly:
Understand granular customer behavior
Spot emerging trends
Identify new opportunities, etc.
Traditional database tools not able to cope

Werent built for big data use cases
Lack scale, not cost-effective, rigid data structure
Need for new approach Hadoop
12
Why Hadoop is Important?

1. Hadoop reduces the cost of storing & processing data to a point that
keeping all data, indefinitely is suddenly a very real possibility AND
that cost is halving every 18 months
2. MapReduce makes developing & executing massively parallel data
processing tasks trivial compared to historical alternatives (e.g. HPC /
Grid)
3. Schema on Read paradigm shifts typical data preparation complexity
to analysis phase rather than acquisition phase
4. For the modern CIO BIG DATA = HADOOP - Dont underestimate
the irrational exuberance of the market
The cost and effort to consume and extract value from data has been
fundamentally changed
13
How is Hadoop is unique?

Handles large amounts of data
Stores data in native format
Delivers linear scalability at low cost
Resilient in case of infrastructure failures
Transparent application scalability
Enterprises can gain a competitive advantage through the

adoption of big data analytics
14
Common Apache Hadoop Components
15
Evolution of the Commercial Hadoop Market
Apache Hadoop is open-source software managed by the Apache Software Foundation

available as a free download at Apache.org.
A number of commercial businesses have been created to provide packaged Hadoop

distributions containing integrated subcomponents of Hadoop and other open source
projects, education, support, services, and Hadoop management utilities to make the
technology easier to implement and use.
Cloudera was the first commercial vendor to enter the space in 2008 founded by early Hadoop technologists
who later brought in Doug Cutting to be head architect.
Other commercial vendors of Hadoop software include Hortonworks (2011 out of Yahoo!), IBM (Infosphere
BigInsights), MapR (2009) and DataStax (2010 Hadoop and Cassandra).
Organizations that are evaluating Hadoop are typically also looking at other NoSQL
databases such as Cassandra, Mongo DB, Couch DB, Amazon EMR, and others.
They might also be evaluating scale-out file systems to use for storage such as Isilon
or Gluster FS. These systems mirror Hadoops scale-out architecture and are also
capable of handling the volume and unstructured nature of data that could be stored in
Hadoop.
16
Greenplums Entry into the Hadoop Market

Greenplum began moving towards the Hadoop space when it added
support for MapReduce within the database in 2009. The company then
officially entered the Hadoop market in April of 2011 with the launch of
Greenplum HD.
Greenplum HD is one of several commercially supported distributions of
Apache Hadoop. All of these distributions consist of the same core
Hadoop technology
In providing our own distribution of Apache Hadoop, Greenplum (and
now Pivotal) are bringing Greenplums decade of experience developing
the best MPP database platform to the open source Big Data platform
17
Part 3:
How Hadoop Is Being
Used Today
18
Target Markets, Verticals, and Use Cases

Hadoop Use Cases by Vertical
Finance
Web 2.0
Telecom
Healthcare
Risk
Modeling/Managem
ent
Portfolio Analysis
Investment
Predictions
Fraud Detection
Compliance Check
Customer Profiling
Social Media
Analytics
ETL
Network analysis
based on
transactions
Product
Recommendation
Engine
Search Engine
Indexing (Search
Assist)
Content Optimization
Advertising
Optimization
Customer Churn
Analysis
POS Transaction
Analysis
Data Warehousing
Network Graph
Analysis
Call Detail Record
(CDR) Analysis
Network Optimization
Service Optimization
& Log Processing
User Behavior
Analysis
Customer Churn
Prediction
Machine-generated
data centralization
(logs from firewalls,
towers, switches,
servers, etc..)
Electronic Medical
Record Analysis
Claims Fraud
Detection
Drug Safety
Analysis
Personalize
Medicine
Healthcare Service
Optimization
Drug Development
Healthcare
Information
Exchange
Medical Image
Processing
19
Use Case: Data Warehouse Augmentation / Offload

Challenges
Existing EDW used for low value and resource consuming ETL
process
Planned growth will far exceed compute capacity
Hard to do analytics or even basic reporting on EDW system
Objectives
Reduce EDW Total Cost of Ownership
Enable longer data retention to enable analytics and accelerate time
to market
Migrate ETL off EDW to free up compute resources
20
Hadoop Use Case: Retailer Trend Analysis

Deep Historical Reporting for Retail Trends:
Credit card company loads 10 years of data for all retailers (100s of TBs) into
Hadoop
Run Map/Reduce Job on Hadoop Cluster for a single retailer and develop historical
picture of retailer (or retailers) in a specific area.
Load results from Hadoop into data warehouse and further analyze with standard
BI/statistics packages.
Why do this in Hadoop?
Ability to store years of data cost effectively

Data available for immediate recall (not on tapes or flat files)
No need to ETL/normalize the data
Data exists in its valuable, original format
Offload intensive computation from DW
Ability to pull in other data sources about the retail (unstructured) from other sources
and combine with other data (EDGAR Filings?)
21
Hadoop Use Case: Risk Modeling

Risk Modeling
Bank had customer data across multiple lines of business and needed to
develop a better risk picture of its customers.
i.e. -- if direct deposits stop coming into checking acct, its likely that customer lost his/her job, which
impacts creditworthiness for other products (CC, mortgage, etc)
Data existing in silos across mutliple LOBs and acquired bank systems
Data size approached 1 petabyte
Why do this in Hadoop?

Ability to cost-effectively integrate + 1 PB of data from multiple data sources: data
warehouse, call center, chat and email
Platform for more analysis with poly-structured data sources; ie, combining bank data
with credit bureau data; Twitter, etc
Offload intensive computation from DW
22
Hadoop Use Case: Sentiment Analysis

Sentiment Analysis
Hadoop used frequently to monitor what customers think of companys products or
services
Data loaded from social media sources (Twitter, blogs, facebook, emails, chats, etc)
into Hadoop cluster
Map/Reduce jobs run continuously to identify sentiment (ie, Acme Companys rates
are outrageous or rip off)
Negative/positive comments can be acted upon (special offer, coupon, etc)
Why Hadoop
Social media/web data is unstructured
Amount of data is immense
New data sources arise weekly
23
Part 4:
Resources and Key
Takeaways
24
Online Resources
Hadoop Basics: Hadoop Basics from EMC
Pivotal Hadoop online resources: Greenplum Nation Pivotal HD
Selling Guide - Hadoop Selling Guide
Webcast
Hadoop Spotlight Webinar
Video: HAWQ and Pivotal HD Video
Internal Webcast: Internal Webcast
Spring Hadoop: Spring Hadoop
25
Contact Resources
Advanced Technology Sales
Nick Cayou ncayou@gopivotal.com
Ian Andrews iandrews@gopivotal.com
Hadoop Virtual Team

Jacque Istok jistok@gopivotal.com
Product Management
SK Krishnamurthy - SK.Krishnamurthy@emc.com
Hadoop vTeam Mailing List

GPVTeamHadoop@emc.com
26
Module Summary & Highlights

Hadoop was created to meet a need for data storage and
processing at a scale and cost that existing systems simply
could not address
Hadoop adoption in the Enterprise is largely driven by the
low cost and flexibility of the platform
Hadoop is relatively immature, but is quickly advancing with
new capabilities backed by a broad base of vendor and
community supported efforts
2
727
Thank You
Please note:
It may take up to 24 hours for your
transcript to be updated and reflect that you
have successfully completed this course.
If after 24 hours your transcript is not
updated, please send an e-mail to
joneill@gopivotal.com describing your
issue.
Please include the following:
BADGE NUMBER
ROLE
MODULE TITLE
ISSUE
28
Click Here to
Provide Feedback
Thank you to the following individual who

helped make the creation of this course possible
Ian Andrews
29

Hadoop 101 - Sales Training - v4 - 4x3format

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Hadoop 101 - Sales Training - v4 - 4x3format

Diunggah oleh

Hak Cipta:

Format Tersedia

Learn the Hadoop Fundamentals

Pivotal ConfidentialInternal Use Only

Pivotal ConfidentialInternal Use Only

Pivotal ConfidentialInternal Use Only

Pivotal ConfidentialInternal Use Only

Inspired by Googles MapReduce and Google File System

Pivotal ConfidentialInternal Use Only

Compute via the

Storage & Compute in 1 Framework

Pivotal ConfidentialInternal Use Only

Pivotal ConfidentialInternal Use Only

First there was the Data Warehouse

Pivotal ConfidentialInternal Use Only

Then the MPP database was introduced

Data got larger

Pivotal ConfidentialInternal Use Only

Then things changed

Pivotal ConfidentialInternal Use Only

Now there is Hadoop

The Hadoop Opportunity

Traditional database tools not able to cope

Need for new approach Hadoop

Pivotal ConfidentialInternal Use Only

Why Hadoop is Important?

Pivotal ConfidentialInternal Use Only

How is Hadoop is unique?

Enterprises can gain a competitive advantage through the

Pivotal ConfidentialInternal Use Only

Common Apache Hadoop Components

Pivotal ConfidentialInternal Use Only

Evolution of the Commercial Hadoop Market

Apache Hadoop is open-source software managed by the Apache Software Foundation

A number of commercial businesses have been created to provide packaged Hadoop

Pivotal ConfidentialInternal Use Only

Greenplums Entry into the Hadoop Market

Pivotal ConfidentialInternal Use Only

Pivotal ConfidentialInternal Use Only

Target Markets, Verticals, and Use Cases

Pivotal ConfidentialInternal Use Only

Use Case: Data Warehouse Augmentation / Offload

Pivotal ConfidentialInternal Use Only

Hadoop Use Case: Retailer Trend Analysis

Why do this in Hadoop?

Ability to store years of data cost effectively

Pivotal ConfidentialInternal Use Only

Hadoop Use Case: Risk Modeling

Why do this in Hadoop?

Pivotal ConfidentialInternal Use Only

Hadoop Use Case: Sentiment Analysis

Pivotal ConfidentialInternal Use Only

Pivotal ConfidentialInternal Use Only

Spring Hadoop: Spring Hadoop

Pivotal ConfidentialInternal Use Only

Hadoop Virtual Team

Hadoop vTeam Mailing List

Pivotal ConfidentialInternal Use Only

Module Summary & Highlights

Pivotal ConfidentialInternal Use Only

Pivotal ConfidentialInternal Use Only

Thank you to the following individual who

Pivotal ConfidentialInternal Use Only

Anda mungkin juga menyukai