Big Data Science

June 2013
BIG DATA SCIENCE: A PATH FORWARD
Data Science Lead @ Think Big Product/Brand Obsessive
Teacher
Occasional Engineer
linkedin.com/in/danmallinger/ @danmallinger www.thinkbiganalytics.com

CONFIDENTIAL |
TODAY
High level exploration of the skills, tools, and techniques

needed to achieve early success
and to help you build

your data science practice.
CONFIDENTIAL
INFRASTRUCTURE, TALENT, & CAPABILITIES
Understand our organizational needs for data science Infrastructure: Technological tools and platforms. Talent: Staff hired and trained. Capabilities: Data science techniques utilized.
Hadoop Scripting Visualization NoSQL MapReduce Clustering Analytics Data Exploration Categorization SQL/MPP Real Time
Basic Modeling PhD Math Continuous Models Text Analysis
CONFIDENTIAL
ANALYTICS TOOLS
Boxed Solutions: Mahout & Platform Toolkits: RHadoop, Scikit, etc.

You will need toolkits to solve unique problems but smart techniques make that easier. Boxed solutions are limited but can be a good source of early velocity.
CONFIDENTIAL
DATA
Presenter

Gigabytes from Stackoverflow Questions from users With metadata
Audience

Follow along Thinking about your data To learn in a
Users have reputations

Questions open or closed
Familiar context and

Plan
Hadoop Scripting Visualization
NoSQL MapReduce Clustering
Analytics Exploration Categorization
SQL/MPP Basic Modeling Continuous
Real Time PhD Math Text Analysis

CONFIDENTIAL |
STEP 1: EXPLORE
Patterns through Hive select count(1) as total , sum(has_code) , avg(body_count) , stddev_samp(body_count) , corr(reputation, owner_questions) , histogram_numeric(body_count, 10) from questions ;
Patterns through Tableau

CONFIDENTIAL |
STEP 2: FEATURE BUILDING

SQL Windowing Cross-Record Features
Summaries of unstructured
data
Clustering: Browsing cohorts

/bin/mahout canopy
Time-since metrics
select transform()
using python

CONFIDENTIAL |
PARALLEL MODELS IN HADOOP

Sample (dont parallelize) Naturally parallel

SVD Random Forests Bootstrapping Localizing Linear models with SGD Neural networks
NoSQL MapReduce Clustering Analytics Exploration Categorization SQL/MPP Basic Modeling Continuous Real Time PhD Math Text Analysis
CONFIDENTIAL |
Estimators and Ensembles
Advanced Parallelization
STEP 3: STRUCTURED MODEL (BAGGING)

Mapper 1: Define n reducer keys Send any record to reducer I with probability p Reducer 1: Key: Id of sample Value: List of records Perform analysis over records Reducer 2: Key: One Value: List of models Aggregate the models (e.g. average)
Analytics Exploration Categorization SQL/MPP Basic Modeling Continuous Real Time PhD Math Text Analysis
CONFIDENTIAL |
Bagging a Model

Single R model run many times over samples and aggregated
m <- C5.0(status ~ )
10
WHERE ARE WE?
Weve created a structured model

to flag questions that wont be closed using Big Data. But we havent used unstructured data.

CONFIDENTIAL |
11
TEXT ANALYSIS
Is the big dog really different from dog is big? How about I like eggs but hate tofu and I hate eggs but like tofu? Language has lexical and syntactical features Different techniques leverage these in different ways Bag of Words: Structure doesnt matter n-gram: Structure matters (but not that much) Feature Extraction: BACON! BACON! BACON!
Hadoop Scripting Visualization NoSQL MapReduce Clustering Analytics Exploration Categorization SQL/MPP Basic Modeling Continuous Real Time PhD Math Text Analysis
CONFIDENTIAL |
12
STEP 4: UNSTRUCTURED MODEL

MR Job 1: Calculate information gain (IG) for all tokens. MR Job 2: Select tokens with largest IG. Create structured data for record, tokens: question #4 | 0 | 1 | 0 | 1 | 1
Information Gain
Similar to Hadoops Word
Count
Create counts for

token/category pairs
Use counts to calculate

Information Gain
MR Job 3: Build a classifier over the newly structured data (prior slides)
Hadoop Scripting Visualization NoSQL MapReduce Clustering Analytics Exploration Categorization SQL/MPP Basic Modeling Continuous Real Time PhD Math Text Analysis
CONFIDENTIAL |
13
WHERE ARE WE?
Weve created two models

One structured, one unstructured. But they dont work together.

CONFIDENTIAL |
14
STEP 5: ENSEMBLE MODEL

Mapper 1: Load multiple models Score the models per record and output Reducer 1: Key: Id of record Value: List of model outputs Join model outputs to make new records
Ensembling

Join many models together By using their output As input to ensemble model. Best when models perform differently nonlinearities
Exploit differences with Like interaction effects.

MR Job 2: Build a model over the output data as if it was raw data.
Analytics Exploration Categorization SQL/MPP Basic Modeling Continuous Real Time PhD Math Text Analysis
CONFIDENTIAL |
15
WHERE ARE WE?
Weve created two models:

one structured, one unstructured and have ensembled them to create a single, powerful model and solve a practical business problem.

CONFIDENTIAL |
16
HOW DID WE GET HERE?
This required simple infrastructure

a blend of analysis and scripting skills an understanding of BIG data science techniques but not a team of PhDs or a billion dollars.

CONFIDENTIAL |
17
Questions?
www.thinkbiganalytics.com @danmallinger
CONFIDENTIAL
18

Big Data Science

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Big Data Science

Diunggah oleh

Hak Cipta:

Format Tersedia

June 2013

BIG DATA SCIENCE: A PATH FORWARD

Data Science Lead @ Think Big Product/Brand Obsessive

linkedin.com/in/danmallinger/ @danmallinger www.thinkbiganalytics.com

High level exploration of the skills, tools, and techniques

and to help you build

INFRASTRUCTURE, TALENT, & CAPABILITIES

Basic Modeling PhD Math Continuous Models Text Analysis

Boxed Solutions: Mahout & Platform Toolkits: RHadoop, Scikit, etc.

Users have reputations

Familiar context and

Hadoop Scripting Visualization

NoSQL MapReduce Clustering

Analytics Exploration Categorization

SQL/MPP Basic Modeling Continuous

Real Time PhD Math Text Analysis

Patterns through Tableau

Hadoop Scripting Visualization

NoSQL MapReduce Clustering

Analytics Exploration Categorization

SQL/MPP Basic Modeling Continuous

Real Time PhD Math Text Analysis

STEP 2: FEATURE BUILDING

Clustering: Browsing cohorts

Hadoop Scripting Visualization

NoSQL MapReduce Clustering

Analytics Exploration Categorization

SQL/MPP Basic Modeling Continuous

Real Time PhD Math Text Analysis

PARALLEL MODELS IN HADOOP

Estimators and Ensembles

STEP 3: STRUCTURED MODEL (BAGGING)

Hadoop Scripting Visualization

NoSQL MapReduce Clustering

WHERE ARE WE?

Weve created a structured model

Hadoop Scripting Visualization

NoSQL MapReduce Clustering

Analytics Exploration Categorization

SQL/MPP Basic Modeling Continuous

Real Time PhD Math Text Analysis

STEP 4: UNSTRUCTURED MODEL

Create counts for

Use counts to calculate

WHERE ARE WE?

Weve created two models

Hadoop Scripting Visualization

NoSQL MapReduce Clustering

Analytics Exploration Categorization

SQL/MPP Basic Modeling Continuous

Real Time PhD Math Text Analysis

STEP 5: ENSEMBLE MODEL

Exploit differences with Like interaction effects.

Hadoop Scripting Visualization

NoSQL MapReduce Clustering

WHERE ARE WE?

Weve created two models:

Hadoop Scripting Visualization

NoSQL MapReduce Clustering

Analytics Exploration Categorization

SQL/MPP Basic Modeling Continuous

Real Time PhD Math Text Analysis

HOW DID WE GET HERE?

This required simple infrastructure