Topics
Motivation
2014
MapR
Technologies
2014
MapR
Technologies
Data is doubling in
size every two years
SEMI-STRUCTURED
DATA
STRUCTURED DATA
1980
1990
2000
2010
Source: Human-Computer Interaction & Knowledge Discovery in Complex Unstructured, Big Data
2020
GBs-TBs
TBs-PBs
Structure
Structured
Development
RELATIONAL DATABASES
NON-RELATIONAL DATASTORES
Fixed schema
DBA controls structure
Database
1980
1990
2000
2010
2020
2014 MapR Technologies
WANT
DONT WANT
SQL
BI (Tableau, MicroStrategy, etc.)
Low latency
Scalability
APACHE DRILL
Schema-free data exploration for Hadoop and NoSQL
Point-and-query vs. schema-first
Low latency SQL queries at Scale
Extreme Ease of Use
Industry-standard APIs: ANSI SQL, ODBC/JDBC, RESTful APIs
Hadoop data
Data modeling
Transformation
Data
movement
Users
Hadoop data
Users
10
Self-Service BI
w/ RDBMS
SQL-on-Hadoop
Self-Service
Data Exploration
IT-driven
IT-driven
IT-driven
Optional
Data Visualization
IT-driven
Self-service
Self-service
Self-service
Zero-day analytics
2014 MapR Technologies
11
2014
MapR
Technologies
2014
MapR
Technologies
12
RDBMS/SQL-on-Hadoop table
Schema-less
Name!
Gender!
Age!
Michael!
M!
6!
Jennifer!
F!
3!
CSV
TSV
{!
HBase
name: {!
first: Michael,!
last: Smith!
},!
hobbies: [ski, soccer],!
district: Los Altos!
Flexibility
Complex
Parquet
Avro
JSON
BSON
Flexibility
}!
{!
name: {!
first: Jennifer,!
last: Gates!
},!
hobbies: [sing],!
preschool: CCLC!
}!
13
Fixed schema
Leverage schema in centralized
repository (Hive Metastore)
SCHEMA ON
WRITE
SCHEMA
BEFORE READ
Schema2Discovered On-The-Fly
14
Raw data
JSON, CSV, ...
Optimized data
Parquet,
Centrally-structured
data
Schemas in Hive
Metastore
Relational data
Highly-structured
data
Exploration
(known and unknown questions)
Oracle, Teradata
15
Use cases
Apache Drill
Enterprise users
Ad-hoc query/ BI
reporting
Enterprise users
Data exploration /
Raw data analysis
Instant /
Day 0 queries
Ad-hoc query/BI
reporting
Hadoop
DW platforms
(e.g., Oracle, Teradata)
16
Feature Walkthrough
2014
MapR
Technologies
2014
MapR
Technologies
17
Business dataset
{
"business_id":
"4bEjOyTaDG24SY5TxsaUNQ",
"full_address":
"3655
Las
Vegas
Blvd
S\nThe
Strip\nLas
Vegas,
NV
89109",
"hours":
{
"Monday":
{"close":
"23:00",
"open":
"07:00"},
"Tuesday":
{"close":
"23:00",
"open":
"07:00"},
"Friday":
{"close":
"00:00",
"open":
"07:00"},
"Wednesday":
{"close":
"23:00",
"open":
"07:00"},
"Thursday":
{"close":
"23:00",
"open":
"07:00"},
"Sunday":
{"close":
"23:00",
"open":
"07:00"},
"Saturday":
{"close":
"00:00",
"open":
"07:00"}
},
"open":
true,
"categories":
["Breakfast
&
Brunch",
"Steakhouses",
"French",
"Restaurants"],
"city":
"Las
Vegas",
"review_count":
4084,
"name":
"Mon
Ami
Gabi",
"neighborhoods":
["The
Strip"],
"longitude":
-115.172588519464,
"state":
"NV",
"stars":
4.0,
"attributes":
{
"Alcohol":
"full_bar,
"Noise
Level":
"average",
"Has
TV":
false,
"Attire":
"casual",
"Ambience":
{
"romantic":
true,
"intimate":
false,
"touristy":
false,
"hipster":
false,
"classy":
true,
"trendy":
false,
"casual":
false
},
"Good
For":
{"dessert":
false,
"latenight":
false,
"lunch":
false,
"dinner":
true,
"breakfast":
false,
"brunch":
false},
}
2014 MapR Technologies
}
18
Reviews dataset
{
"votes":
{"funny":
0,
"useful":
2,
"cool":
1},
"user_id":
"Xqd0DzHaiyRqVH3WRG7hzg",
"review_id":
"15SdjuK7DmYqUAj6rjGowg",
"stars":
5,
"date":
"2007-05-17",
"text":
"dr.
goldberg
offers
everything
...",
"type":
"review",
"business_id":
"vcNAWiLM4dR7D2nwwJ7nCA"
}
19
Install
Launch
shell
(embedded
mode)
Query
files
and
directories
+------------+------------+-------------+
|
state
|
city
|
businesses
|
+------------+------------+-------------+
|
NV
|
Las
Vegas
|
12021
|
|
AZ
|
Phoenix
|
7499
|
|
AZ
|
Scottsdale
|
3605
|
|
EDH
|
Edinburgh
|
2804
|
|
AZ
|
Mesa
|
2041
|
|
AZ
|
Tempe
|
2025
|
|
NV
|
Henderson
|
1914
|
|
AZ
|
Chandler
|
1637
|
|
WI
|
Madison
|
1630
|
|
AZ
|
Glendale
|
1196
|
+------------+------------+-------------+
Results
20
A workspace
- Sub-directory
- HBase namespace
- Hive database
A
-
-
-
table
pathnames
Hive table
HBase table
21
>
SELECT
name,
stars,
b.hours.Friday
friday,
categories
FROM
dfs.yelp.`business.json`
b
WHERE
b.hours.Friday.`open`
<
'22:00'
AND
b.hours.Friday.`close`
>
'22:00'
AND
REPEATED_CONTAINS(categories,
'Mediterranean')
AND
city
=
'Las
Vegas'
ORDER
BY
stars
DESC
LIMIT
2;
Query
data
with
any
levels
of
nesting
+------------+------------+------------+------------+
|
name
|
stars
|
friday
|
categories
|
+------------+------------+------------+------------+
|
Olives
|
4.0
|
{"close":"22:30","open":"11:00"}
|
["Mediterranean","Restaurants"]
|
|
Marrakech
Moroccan
Restaurant
|
4.0
|
{"close":"23:00","open":"17:30"}
|
["Mediterranean","Middle
Eastern","Moroccan","Restaurants"]
|
+------------+------------+------------+------------+
2014 MapR Technologies
22
functionality
(Joins,
Aggregations,
Sorting,
Sub-
queries,
SQL
data
types)
23
Logical views
//Create
a
view
combining
business
and
reviews
datasets
>
CREATE
OR
REPLACE
VIEW
dfs.tmp.BusinessReviews
AS
SELECT
b.name,
b.stars,
r.votes.funny,
r.votes.useful,
r.votes.cool,
r.`date`
FROM
dfs.yelp.`business.json`
b,
dfs.yelp.`review.json`
r
WHERE
r.business_id
=
b.business_id;
+------------+------------+
|
ok
|
summary
|
+------------+------------+
|
true
|
View
'BusinessReviews'
created
successfully
in
'dfs.tmp'
schema
|
+------------+------------+
Lightweight
file
system
based
views
for
granular
and
de-
centralized
data
management
+------------+
|
Total
|
+------------+
| 1125458
|
+------------+
2014 MapR Technologies
24
Save
analysis
results
as
tables
using
familiar
CTAS
syntax
25
2014
MapR
Technologies
2014
MapR
Technologies
26
Dynamically
flatten
repeated
and
nested
data
elements
as
part
of
SQL
queries.
No
ETL
necessary
27
>SELECT
category,
count(*)
AS
categorycount
FROM
(SELECT
name,
FLATTEN(categories)
AS
category
FROM
dfs.yelp.`business.json`)
c
GROUP
BY
category
ORDER
BY
categorycount
DESC;
+------------+------------+
|
category
|
categorycount|
+------------+------------+
|
Restaurants
|
14303
|
|
Australian
|
1
|
|
Boat
Dealers
|
1
|
|
Firewood
|
1
|
+------------+------------+
28
2014
MapR
Technologies
2014
MapR
Technologies
29
30
31
32
33
Architecture Overview
2014
MapR
Technologies
2014
MapR
Technologies
34
35
Basic Process
Query
Drillbit
Drillbit
Drillbit
Zookeeper
DFS/HBase/
Hive
DFS/HBase/
Hive
DFS/HBase/
Hive
36
Execution
Storage Plugins
Optimizer
Physical Plan
SQL Parser
Logical Plan
RPC Endpoint
DFS
Hive
HBase
MongoDB
37
Demo
2014
MapR
Technologies
2014
MapR
Technologies
38
Wrap Up
2014
MapR
Technologies
2014
MapR
Technologies
39
Apache Drill
Industrys first Schema-free SQL query for Hadoop/NoSQL
Self Service
Data
Exploration
Single SQL
Interface for
Structured
and SemiStructured
Data
Data Agility
with No IT
Intervention
40
New Data
Sources
Improved
Multi
Tenancy
3
4
2014 MapR Technologies
41
$50
$50MM
in Free Training
42
Ask questions:
user@drill.apache.org
email:
bknorr@mapr.com
2014 MapR Technologies
43