Ashish Thusoo,
Facebook Data Infrastructure Team
Why Another Data Warehousing System?
Pros
– Superior in availability/scalability/manageability despite lower
single node performance
– Open system
– Scalable costs
Solution: HIVE
What is HIVE?
Components
– Map-Reduce for execution
– HDFS for storage
– Metadata in an RDBMS
Hive: Simplifying Hadoop – New Technology Familiar
Interfaces
hive> select key, count(1) from kv1 where key > 100
group by key;
vs.
Map-side Joins
Predicate Pushdown
Partition Pruning
Mailing Lists:
– hive-{user,dev,commits}@hadoop.apache.org
Data Warehousing @ Facebook
using
Hive & Hadoop
Data Flow Architecture at Facebook
Filer
s
Web Scribe MidTier
Servers
Hive
replication Scribe-Hadoop
Cluster
Adhoc Hive-Hadoop Cluster
Node
=
Disks Disks Disks Disks Disks Disks DataNode
Node Node Node Node Node Node +
Map-Reduce
1 Gigabit 4 Gigabit
Hadoop & Hive Cluster @ Facebook
Types of Applications:
– Reporting
Eg: Daily/Weekly aggregations of impression/click counts
Measures of user engagement
Microstrategy dashboards
– Ad hoc Analysis
Eg: how many group admins broken down by state/country
– Many others
Facebook’s contributions…