NoSQL and SQL - Open Analytics Summit

NoSQL and SQL Work Side-by-Side to Tackle Real-time Big Data Needs
Allen Day MapR Technologies
Me
Allen Day
Principal Data Scientist @ MapR Human Genomics / Bioinformatics (PhD, UCLA School of Medicine)
@allenday
allenday@allenday.com aday@maprtech.com
You
Im assuming that the typical attendee:
is a software developer
is interested and familiar with open source

is familiar with Hadoop, relational DBs has heard of or has used some NoSQL technology
Big Data Workloads

Offline

ETL Model creation & clustering & indexing Web Crawling Batch reporting
Lightweight OLTP Classification & anomaly detection Stream processing Interactive reporting SQL
Online
What is NoSQL? Why use it?

Traditional storage (relational DBs) are unable to accommodate increasing # and variety of observations
Culprits: sensors, event logs, electronic payments
Solution: stay responsive by relaxing ACID storage requirements

Denormalize (#) Loosen schema (variety), loosen consistency
This is the essence of NoSQL
NoSQL Impact on Business Processes

Traditional business intelligence (BI) tech stack assumes relational DB storage
Company decisions depend on this (reports, charts)
NoSQL collected data arent in relational DB

Data volume/variety is still increasing Tech and methods are still in flux
Decoupled data storage and decision support systems

BI cant access freshest, largest data sets Very high opportunity cost to business
Ideal Solution Features

Scalable & Reliable
Distributed replicated storage Distributed parallel processing
Hadoop FS Map/Reduce, YARN
BI application support
Ad-hoc, interactive queries Real-time responsiveness
Flexible
Handles rapid storage and schema evolution Handles new analytics methods and functions
{ SQL Interface Extensible for NoSQL, { Advanced Analytics
From Ideals to Possibilities

Migrate NoSQL data/processing to SQL
High cost to marshal NoSQL data to SQL storage SQL systems lack advanced analytics capabilities
Migrate SQL data to NoSQL

Breaks compatibility for BI-dependent functions, e.g. financial reporting Limited support for relational operations (joins)
high latency
NoSQL tech is still in flux (continuity)
Other Approaches?
Yes. First lets consider a SQL/NoSQL use case
Interactive Queries & Hadoop
Impala
low-latency
Example Problem: Marketing Campaign

Jane is an analyst at an e-commerce company How does she figure out good targeting segments for the next marketing campaign?
Transaction information
User profiles
She has some ideas and lots of data
Access logs
Traditional System Solution 1: RDBMS

ETL the data from MongoDB and Hadoop into the RDBMS
MongoDB data must be flattened, schematized, filtered and aggregated Hadoop data must be filtered and aggregated
User profiles
Query the data using any SQL-based tool
Access logs
Traditional System Solution 2: Hadoop

ETL the data from Oracle and MongoDB into Hadoop
MongoDB data must be flattened and schematized
User profiles
Work with the MapReduce team to write custom code to generate the desired analyses
Access logs
Traditional System Solution 3: Hive

ETL the data from Oracle and MongoDB into Hadoop
MongoDB data must be flattened and schematized
User profiles
Access logs
But HiveQL queries are slow and BI tool support is limited

Marshaling/Coding
What Would Google Do?

Distributed File System GFS NoSQL Interactive analysis Dremel Batch processing MapReduce
BigTable
HDFS
HBase
???
Hadoop MapReduce
Build Apache Drill to provide a true open source solution to interactive analysis of Big Data
Apache Drill Overview

Interactive analysis of Big Data using standard SQL Fast Apache Drill Low latency queries Complement native interfaces and MapReduce/Hive/Pig Open Community driven open source project Under Apache Software Foundation MapReduce Hive Modern Pig Standard ANSI SQL:2003 (select/into) Nested data support Schema is optional Supports RDBMS, Hadoop and NoSQL
Interactive queries Data analyst Reporting 100 ms-20 min
Data mining Modeling Large ETL 20 min-20 hr
How Does It Work?

SELECT * FROM oracle.transactions, mongo.users, hdfs.events LIMIT 1
Drill Client
Tableau MicroStrategy Crystal Reports
Drillbit (Coordinator)
SQL Query Parser
Query Planner
Drill ODBC Driver
Drillbit (Executor)
Drillbit (Executor)
Drillbit (Executor)
Driver
How Does It Work?

Drillbits run on each node, designed to maximize data locality Processing is done outside MapReduce paradigm (but possibly within YARN) Queries can be fed to any Drillbit Coordination, query planning, optimization, scheduling, and execution are distributed
SELECT * FROM oracle.transactions, mongo.users, hdfs.events LIMIT 1
Apache Drill: Key Features

Full ANSI SQL:2003 support
Use any SQL-based tool
Nested data support

Flattening is error-prone and often impossible
Schema-less data source support

Schema can change rapidly and may be record-specific
Extensible
DSLs, UDFs Custom operators (e.g. k-means clustering) Well-documented data source & file format APIs
How Does Impala Fit In?

Impala Strengths
Beta currently available Easy install and setup on top of Cloudera Faster than Hive on some queries SQL-like query language
Questions
Open Source Lite Lacks RDBMS support Lacks NoSQL support beyond HBase Early row materialization increases footprint and reduces performance Limited file format support Query results must fit in memory! Rigid schema is required No support for nested data SQL-like (not SQL)
Many important features are coming soon. Architectural foundation is constrained. No community development.
Drill Status: Alpha Available July

Heavy active development by multiple organizations
Contributors from Oracle, IBM Netezza, Informatica, Clustrix, Pentaho
Available
Logical plan syntax and interpreter Reference interpreter
In progress
SQL interpreter Storage engine implementations for Accumulo, Cassandra, HBase and various file formats
Significant community momentum

Over 200 people on the Drill mailing list Over 200 members of the Bay Area Drill User Group Drill meetups across the US and Europe
Beta: Q3
Why Apache Drill Will Be Successful

Resources Contributors have strong backgrounds from companies like Oracle, IBM Netezza, Informatica, Clustrix and Pentaho Community Development done in the open Active contributors from multiple companies Rapidly growing Architecture Full SQL New data support Extensible APIs Full Columnar Execution Beyond Hadoop
Bottom Line: Apache Drill enables NoSQL and SQL Work Side-by-Side to Tackle Real-time Big Data Needs
Me
Allen Day
Principal Data Scientist @ MapR
@allenday
allenday@allenday.com aday@maprtech.com
ADDITIONAL SLIDES
Full SQL (ANSI SQL:2003)

Drill supports SQL (ANSI SQL:2003 standard)
Correlated subqueries, analytic functions, SQL-like is not enough
Use any SQL-based tool with Apache Drill

Tableau, Microstrategy, Excel, SAP Crystal Reports, Toad, SQuirreL, Standard ODBC and JDBC drivers
Client
Tableau
Drillbit
MicroStrategy Drill% ODBC% Driver Excel SAP% Crystal% Reports Driver SQL% Query% Parser Query% Planner
Drillbits Drill% Worker Drill% Worker
Nested Data
Nested data is becoming prevalent
JSON, BSON, XML, Protocol Buffers, Avro, etc. The data source may or may not be aware
MongoDB supports nested data natively A single HBase value could be a JSON document (compound nested type)
JSON
{ "name": "Homer", "gender": "Male", "followers": 100 children: [ {name: "Bart"}, {name: "Lisa} ]
Google Dremels innovation was efficient columnar storage and querying of nested data
Flattening nested data is error-prone and often impossible

Think about repeated and optional fields at every level
Avro
enum Gender { MALE, FEMALE } record User { string name; Gender gender; long followers; }
Apache Drill supports nested data

Extensions to ANSI SQL:2003
Schema is Optional
Many data sources do not have rigid schemas
Schemas change rapidly Each record may have a different schema, may be sparse/wide
Apache Drill supports querying against unknown schemas

Query any HBase, Cassandra or MongoDB table
User can define the schema or let the system discover it automatically
System of record may already have schema information No need to manage schema evolution
Row Key "com.cnn.www" CF contents contents:html = "<html>" CF anchor anchor:my.look.ca = "CNN.com" anchor:cnnsi.com = "CNN"
"com.foxnews.www"
contents:html = "<html>"
anchor:en.wikipedia.org = "Fox News"
Flexible and Extensible Architecture

Apache Drill is designed for extensibility Well-documented APIs and interfaces Data sources and file formats
Implement a custom scanner to support a new source/format
Query languages
SQL:2003 is the primary language Implement a custom Parser to support a Domain Specific Language UDFs
Optimizers
Drill will have a cost-based optimizer Clear surrounding APIs support easy optimizer exploration
Operators
Custom operators can be implemented (e.g. k-Means clustering) Operator push-down to data source (RDBMS)

NoSQL and SQL - Open Analytics Summit

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

NoSQL and SQL - Open Analytics Summit

Diunggah oleh

Hak Cipta:

Format Tersedia

NoSQL and SQL Work Side-by-Side to Tackle Real-time Big Data Needs

Allen Day MapR Technologies

is interested and familiar with open source

Big Data Workloads

What is NoSQL? Why use it?

Solution: stay responsive by relaxing ACID storage requirements

This is the essence of NoSQL

NoSQL Impact on Business Processes

NoSQL collected data arent in relational DB

Decoupled data storage and decision support systems

Ideal Solution Features

Hadoop FS Map/Reduce, YARN

{ SQL Interface Extensible for NoSQL, { Advanced Analytics

From Ideals to Possibilities

Migrate SQL data to NoSQL

NoSQL tech is still in flux (continuity)

Interactive Queries & Hadoop

Example Problem: Marketing Campaign

She has some ideas and lots of data

Traditional System Solution 1: RDBMS

Query the data using any SQL-based tool

Traditional System Solution 2: Hadoop

Traditional System Solution 3: Hive

But HiveQL queries are slow and BI tool support is limited

What Would Google Do?

Apache Drill Overview

Data mining Modeling Large ETL 20 min-20 hr

How Does It Work?

Drill ODBC Driver

How Does It Work?

SELECT * FROM oracle.transactions, mongo.users, hdfs.events LIMIT 1

Apache Drill: Key Features

Nested data support

Schema-less data source support

How Does Impala Fit In?

Drill Status: Alpha Available July

Significant community momentum

Why Apache Drill Will Be Successful

Full SQL (ANSI SQL:2003)

Use any SQL-based tool with Apache Drill

Drillbits Drill% Worker Drill% Worker

Flattening nested data is error-prone and often impossible

Apache Drill supports nested data

Apache Drill supports querying against unknown schemas

anchor:en.wikipedia.org = "Fox News"

Flexible and Extensible Architecture

Anda mungkin juga menyukai