Teacher
Occasional Engineer
TODAY
CONFIDENTIAL
Understand our organizational needs for data science Infrastructure: Technological tools and platforms. Talent: Staff hired and trained. Capabilities: Data science techniques utilized.
Hadoop Scripting Visualization NoSQL MapReduce Clustering Analytics Data Exploration Categorization SQL/MPP Real Time
CONFIDENTIAL
ANALYTICS TOOLS
CONFIDENTIAL
DATA
Presenter
Gigabytes from Stackoverflow Questions from users With metadata
Audience
Follow along Thinking about your data To learn in a
STEP 1: EXPLORE
Patterns through Hive select count(1) as total , sum(has_code) , avg(body_count) , stddev_samp(body_count) , corr(reputation, owner_questions) , histogram_numeric(body_count, 10) from questions ;
Summaries of unstructured
data
Time-since metrics
select transform()
using python
SVD Random Forests Bootstrapping Localizing Linear models with SGD Neural networks
NoSQL MapReduce Clustering Analytics Exploration Categorization SQL/MPP Basic Modeling Continuous Real Time PhD Math Text Analysis
CONFIDENTIAL |
Advanced Parallelization
Bagging a Model
Single R model run many times over samples and aggregated
m <- C5.0(status ~ )
10
11
TEXT ANALYSIS
Is the big dog really different from dog is big? How about I like eggs but hate tofu and I hate eggs but like tofu? Language has lexical and syntactical features Different techniques leverage these in different ways Bag of Words: Structure doesnt matter n-gram: Structure matters (but not that much) Feature Extraction: BACON! BACON! BACON!
Hadoop Scripting Visualization NoSQL MapReduce Clustering Analytics Exploration Categorization SQL/MPP Basic Modeling Continuous Real Time PhD Math Text Analysis
CONFIDENTIAL |
12
Information Gain
Similar to Hadoops Word
Count
MR Job 3: Build a classifier over the newly structured data (prior slides)
Hadoop Scripting Visualization NoSQL MapReduce Clustering Analytics Exploration Categorization SQL/MPP Basic Modeling Continuous Real Time PhD Math Text Analysis
CONFIDENTIAL |
13
14
Ensembling
Join many models together By using their output As input to ensemble model. Best when models perform differently nonlinearities
15
16
17
Questions?
www.thinkbiganalytics.com @danmallinger
CONFIDENTIAL
18