Anda di halaman 1dari 18

June 2013

BIG DATA SCIENCE: A PATH FORWARD

Data Science Lead @ Think Big Product/Brand Obsessive

Teacher
Occasional Engineer

linkedin.com/in/danmallinger/ @danmallinger www.thinkbiganalytics.com


CONFIDENTIAL |

TODAY

High level exploration of the skills, tools, and techniques


needed to achieve early success

and to help you build


your data science practice.

CONFIDENTIAL

INFRASTRUCTURE, TALENT, & CAPABILITIES

Understand our organizational needs for data science Infrastructure: Technological tools and platforms. Talent: Staff hired and trained. Capabilities: Data science techniques utilized.
Hadoop Scripting Visualization NoSQL MapReduce Clustering Analytics Data Exploration Categorization SQL/MPP Real Time

Basic Modeling PhD Math Continuous Models Text Analysis

CONFIDENTIAL

ANALYTICS TOOLS

Boxed Solutions: Mahout & Platform Toolkits: RHadoop, Scikit, etc.


You will need toolkits to solve unique problems but smart techniques make that easier. Boxed solutions are limited but can be a good source of early velocity.

CONFIDENTIAL

DATA

Presenter

Gigabytes from Stackoverflow Questions from users With metadata

Audience

Follow along Thinking about your data To learn in a

Users have reputations


Questions open or closed

Familiar context and


Plan

Hadoop Scripting Visualization

NoSQL MapReduce Clustering

Analytics Exploration Categorization

SQL/MPP Basic Modeling Continuous

Real Time PhD Math Text Analysis


CONFIDENTIAL |

STEP 1: EXPLORE

Patterns through Hive select count(1) as total , sum(has_code) , avg(body_count) , stddev_samp(body_count) , corr(reputation, owner_questions) , histogram_numeric(body_count, 10) from questions ;

Patterns through Tableau

Hadoop Scripting Visualization

NoSQL MapReduce Clustering

Analytics Exploration Categorization

SQL/MPP Basic Modeling Continuous

Real Time PhD Math Text Analysis


CONFIDENTIAL |

STEP 2: FEATURE BUILDING


SQL Windowing Cross-Record Features

Summaries of unstructured
data

Clustering: Browsing cohorts


/bin/mahout canopy

Time-since metrics
select transform()
using python

Hadoop Scripting Visualization

NoSQL MapReduce Clustering

Analytics Exploration Categorization

SQL/MPP Basic Modeling Continuous

Real Time PhD Math Text Analysis


CONFIDENTIAL |

PARALLEL MODELS IN HADOOP


Sample (dont parallelize) Naturally parallel

Hadoop Scripting Visualization

SVD Random Forests Bootstrapping Localizing Linear models with SGD Neural networks
NoSQL MapReduce Clustering Analytics Exploration Categorization SQL/MPP Basic Modeling Continuous Real Time PhD Math Text Analysis
CONFIDENTIAL |

Estimators and Ensembles

Advanced Parallelization

STEP 3: STRUCTURED MODEL (BAGGING)


Mapper 1: Define n reducer keys Send any record to reducer I with probability p Reducer 1: Key: Id of sample Value: List of records Perform analysis over records Reducer 2: Key: One Value: List of models Aggregate the models (e.g. average)
Analytics Exploration Categorization SQL/MPP Basic Modeling Continuous Real Time PhD Math Text Analysis
CONFIDENTIAL |

Bagging a Model

Single R model run many times over samples and aggregated

m <- C5.0(status ~ )

Hadoop Scripting Visualization

NoSQL MapReduce Clustering

10

WHERE ARE WE?

Weve created a structured model


to flag questions that wont be closed using Big Data. But we havent used unstructured data.

Hadoop Scripting Visualization

NoSQL MapReduce Clustering

Analytics Exploration Categorization

SQL/MPP Basic Modeling Continuous

Real Time PhD Math Text Analysis


CONFIDENTIAL |

11

TEXT ANALYSIS
Is the big dog really different from dog is big? How about I like eggs but hate tofu and I hate eggs but like tofu? Language has lexical and syntactical features Different techniques leverage these in different ways Bag of Words: Structure doesnt matter n-gram: Structure matters (but not that much) Feature Extraction: BACON! BACON! BACON!
Hadoop Scripting Visualization NoSQL MapReduce Clustering Analytics Exploration Categorization SQL/MPP Basic Modeling Continuous Real Time PhD Math Text Analysis
CONFIDENTIAL |

12

STEP 4: UNSTRUCTURED MODEL


MR Job 1: Calculate information gain (IG) for all tokens. MR Job 2: Select tokens with largest IG. Create structured data for record, tokens: question #4 | 0 | 1 | 0 | 1 | 1

Information Gain
Similar to Hadoops Word
Count

Create counts for


token/category pairs

Use counts to calculate


Information Gain

MR Job 3: Build a classifier over the newly structured data (prior slides)
Hadoop Scripting Visualization NoSQL MapReduce Clustering Analytics Exploration Categorization SQL/MPP Basic Modeling Continuous Real Time PhD Math Text Analysis
CONFIDENTIAL |

13

WHERE ARE WE?

Weve created two models


One structured, one unstructured. But they dont work together.

Hadoop Scripting Visualization

NoSQL MapReduce Clustering

Analytics Exploration Categorization

SQL/MPP Basic Modeling Continuous

Real Time PhD Math Text Analysis


CONFIDENTIAL |

14

STEP 5: ENSEMBLE MODEL


Mapper 1: Load multiple models Score the models per record and output Reducer 1: Key: Id of record Value: List of model outputs Join model outputs to make new records

Ensembling

Join many models together By using their output As input to ensemble model. Best when models perform differently nonlinearities

Exploit differences with Like interaction effects.


MR Job 2: Build a model over the output data as if it was raw data.
Analytics Exploration Categorization SQL/MPP Basic Modeling Continuous Real Time PhD Math Text Analysis
CONFIDENTIAL |

Hadoop Scripting Visualization

NoSQL MapReduce Clustering

15

WHERE ARE WE?

Weve created two models:


one structured, one unstructured and have ensembled them to create a single, powerful model and solve a practical business problem.

Hadoop Scripting Visualization

NoSQL MapReduce Clustering

Analytics Exploration Categorization

SQL/MPP Basic Modeling Continuous

Real Time PhD Math Text Analysis


CONFIDENTIAL |

16

HOW DID WE GET HERE?

This required simple infrastructure


a blend of analysis and scripting skills an understanding of BIG data science techniques but not a team of PhDs or a billion dollars.

Hadoop Scripting Visualization

NoSQL MapReduce Clustering

Analytics Exploration Categorization

SQL/MPP Basic Modeling Continuous

Real Time PhD Math Text Analysis


CONFIDENTIAL |

17

Questions?
www.thinkbiganalytics.com @danmallinger

CONFIDENTIAL

18

Anda mungkin juga menyukai