Hadoop World: Rethinking The Data Warehouse With Hadoop and Hive, Facebook

Rethinking Data Warehousing & Analytics
Ashish Thusoo,
Facebook Data Infrastructure Team
Why Another Data Warehousing System?
Data, data and more data

200GB per day in March 2008
2+TB(compressed) raw data per day in April 2009
4+TB(compressed) raw data per day today
Trends Leading to More Data
Free or low cost of user services
Realization that more insights are derived from

simple algorithms on more data
Deficiencies of Existing Technologies
Cost of Analysis and Storage on proprietary systems

does not support trends
towards more data
Limited Scalability does not support trends

towards more data
Closed and Proprietary Systems

Hadoop Advantages
 Pros
– Superior in availability/scalability/manageability despite lower
single node performance
– Open system
– Scalable costs
 Cons: Programmability and Metadata

– Map-reduce hard to program (users know sql/bash/python/perl)
– Need to publish data in well known schemas
 Solution: HIVE
What is HIVE?
 A system for managing and querying structured data built

on top of Hadoop
 Components
– Map-Reduce for execution
– HDFS for storage
– Metadata in an RDBMS
Hive: Simplifying Hadoop – New Technology Familiar
Interfaces
hive> select key, count(1) from kv1 where key > 100
group by key;
vs.
$ cat > /tmp/reducer.sh

uniq -c | awk '{print $2"\t"$1}‘
$ cat > /tmp/map.sh
awk -F '\001' '{if($1 > 100) print $1}‘
$ bin/hadoop jar contrib/hadoop-0.19.2-dev-streaming.jar
-input /user/hive/warehouse/kv1 -mapper map.sh -file
/tmp/reducer.sh -file /tmp/map.sh -reducer reducer.sh
-output /tmp/largekey -numReduceTasks 1
$ bin/hadoop dfs –cat /tmp/largekey/part*
Hive: Open and Extensible
 Query your own formats and types with your own

Serializer/Deserializers
 Extend the SQL functionality through User Defined

Functions
 Do any non-SQL transformations through TRANSFORM

operator that sends data from Hive to any user
program/script
Hive: Smart Execution Plans for Performance
 Hash based Aggregations
 Map-side Joins
 Predicate Pushdown
 Partition Pruning
 Many more to come in the future

Interoperability
 JDBC and ODBC interfaces available
 Integrations with some traditional SQL tools with some

minor modifications
 More improvements in future to support interoperability

with existing front end tools
Information
 Available as a sub project in Hadoop

- http://wiki.apache.org/hadoop/Hive (wiki)
- http://hadoop.apache.org/hive (home page)
- http://svn.apache.org/repos/asf/hadoop/hive (SVN repo)
- ##hive (IRC)
- Works with hadoop-0.17, 0.18, 0.19, 0.20
 Release 0.4.0 is coming in the next few days
 Mailing Lists:
– hive-{user,dev,commits}@hadoop.apache.org
Data Warehousing @ Facebook
using
Hive & Hadoop
Data Flow Architecture at Facebook
Filer
s
Web Scribe MidTier
Servers
Hive
replication Scribe-Hadoop
Cluster
Adhoc Hive-Hadoop Cluster
Production Hive-Hadoop Cluster

Oracle RAC Federated MySQL
Looks like this ..
Node
=
Disks Disks Disks Disks Disks Disks DataNode
Node Node Node Node Node Node +
Map-Reduce
1 Gigabit 4 Gigabit
Hadoop & Hive Cluster @ Facebook
 Hadoop/Hive Warehouse – the new generation

– 4800 cores, Storage capacity of 5.5 PetaBytes
– 12 TB per node
– Two level network topology
 1 Gbit/sec from node to rack switch
 4 Gbit/sec to top level rack switch
Hive & Hadoop Usage @ Facebook
 Statistics per day:

– 4 TB of compressed new data added per day
– 135TB of compressed data scanned per day
– 7500+ Hive jobs on per day
– 80K compute hours per day
 Hive simplifies Hadoop:

– New engineers go though a Hive training session
– ~200 people/month run jobs on Hadoop/Hive
– Analysts (non-engineers) use Hadoop through Hive
– 95% of jobs are Hive Jobs
Hive & Hadoop Usage @ Facebook
 Types of Applications:
– Reporting
 Eg: Daily/Weekly aggregations of impression/click counts
 Measures of user engagement
 Microstrategy dashboards
– Ad hoc Analysis
 Eg: how many group admins broken down by state/country
– Machine Learning (Assembling training data)

 Ad Optimization
 Eg: User Engagement as a function of user attributes
– Many others
Facebook’s contributions…
 A lot of significant contributions:

– Hive
– Hdfs features
– Scheduler work
Etc…
 Talks by Dhruba Borthakur and Zheng Shao in the

development track for more information on these projects

Hadoop World: Rethinking The Data Warehouse With Hadoop and Hive, Facebook

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Hadoop World: Rethinking The Data Warehouse With Hadoop and Hive, Facebook

Diunggah oleh

Hak Cipta:

Format Tersedia

Rethinking Data Warehousing & Analytics

Data, data and more data

Free or low cost of user services

Realization that more insights are derived from

Cost of Analysis and Storage on proprietary systems

Limited Scalability does not support trends

Closed and Proprietary Systems

 Cons: Programmability and Metadata

 A system for managing and querying structured data built

$ cat > /tmp/reducer.sh

 Query your own formats and types with your own

 Extend the SQL functionality through User Defined

 Do any non-SQL transformations through TRANSFORM

 Hash based Aggregations

 Many more to come in the future

 JDBC and ODBC interfaces available

 Integrations with some traditional SQL tools with some

 More improvements in future to support interoperability

 Available as a sub project in Hadoop

 Release 0.4.0 is coming in the next few days

Production Hive-Hadoop Cluster

 Hadoop/Hive Warehouse – the new generation

 Statistics per day:

 Hive simplifies Hadoop:

– Machine Learning (Assembling training data)

 A lot of significant contributions:

 Talks by Dhruba Borthakur and Zheng Shao in the

Anda mungkin juga menyukai