Anda di halaman 1dari 15

Summary

Technique

Ordinary linear regression

Application

Predicting the numeric quantity. The dependent variable can be numeric as


well as categorical.

Alternative

Support vector regression and neural network

Assumption

Normality, iid and homoscedasticity for the residuals .


If assumption gets violated then box cox method can be used and variable
can be transformed

Limitations

Data should be linearly dependent otherwise svm and neural network can
be used .
Multicollinearity within dependent variable need to be checked

Accuracy
measure

R-Sqaure and adjusted r-square can be used

Key terms

Ordinary linear regression, residuals , transformation of variable ,


multicollinearity , outliers and influential points , box-cox, normal
distribution, PCA

Sources to study

NPTEL notes by Prof Shulabh from iit k ( theory )


Regression model course at course era by john Hopkins ( implementation
on R )

Technique

Logistic Regression

Application

Predicting the categorical variables. The dependent variable can be


numeric as well as categorical.

Alternative

Support vector regression and neural network, decision tree , nave


Bayesian

Assumption

Normality, iid and homoscedasticity for the residuals .


If assumption gets violated then box cox method can be used and variable
can be transformed

Limitations

Data should be linearly dependent otherwise svm and neural network can
be used .
Multicollinearity within dependent variable need to be checked.

Accuracy
measure

Confusion matric can be used when used as a classifier

Key terms

Logistic regression as a classifier, residuals , transformation of variable ,


multicollinearity , outliers and influential points , box-cox, normal
distribution, PCA

Sources to study

NPTEL notes by Prof Shulabh from iit k ( theory )


Regression model course at course era by john Hopkins ( implementation
on R )

Technique

Poisson Regression

Application

Predicting the number of events in a given time period ( event can be a


failure of a machine )

Alternative

Assumption

The probability of occurring of an event should be very less

Limitations

Cannot predict when the event will occur

Accuracy
measure

AIC, BIC

Key terms

Generalized linear models , poisson distribution , link function

Sources to study

NPTEL notes by Prof Sulabh from iit k ( theory )


Regression model course at course era by john Hopkins ( implementation
on R )

Technique

Analysis of variance ( ANOVA )

Application

To figure out the significance of independent categorical variable in the


numeric output.

Alternative

Neural network

Assumption

Normality, iid and homoscedasticity for the groups .


KRUSKAL WALLIS method can be used if assumptions get violated

Limitations

Independent variable should only be categorical.

Accuracy
measure

F statistics

Key terms

Design of experiment, levels in an experiment, TUKEY HSD ,

Sources to study

kutner applied linear statistical models ( book )


R code from http://mgmt.iisc.ernet.in/CM/MG221/Handouts.html

Technique

Association rules & Association Sequence

Application

Used in retail industry to find out an association between products.

Alternative

Decision tree , nave Bayesian

Assumption

No assumptions

Limitations

Everything should be categorical.


If not then it has to be divided into categories

Accuracy
measure

Support , confidence and lift

Key terms

Association sequence, support, confidence , lift, market basket analysis ,

Sources to study

http://www-users.cs.umn.edu/~kumar/dmbook/index.php ( theory )
R code is provided during training

Technique

Clustering

Application

Used to cluster similar data points together. .

Alternative

Assumption

No assumptions

Limitations

If dataset is large ( hierarchical cannot be used and finding number of


cluster is difficult )
Due to redundant variables , clear cluster are not visible in large data

Accuracy
measure

Tuning parameter and within variation within the cluster

Key terms

K-means , k-medoids, hierarchical , Sparseclustering , PCA , Feature


selection , knee point , sumofsquare

Sources to study

Machine learning course by Stanford university Prof Andrew


http://www-users.cs.umn.edu/~kumar/dmbook/index.php

Technique

Decision Tree ( classifier )

Application

Classifier

Alternative

Nave bayes , logistic , Svm ,Neural network

Assumption

Limitations

The data should be linear separable.


Interpretation is difficult when tree is big .
Only categorical variable otherwise numeric data is categorized by decision
tree with loss in information

Accuracy
measure

Confusion matrix , accuracy measure , KS statistics, Area under ROC Curve

Key terms

Entropy , splitinfo, Gain ratio

Sources to study

http://www-users.cs.umn.edu/~kumar/dmbook/index.php ( theory )

Technique

Nave Bayes ( classifier )

Application

Classifier

Alternative

Decision tree , logistic , Svm ,Neural network

Assumption

All variables must be categorical

Limitations

The data should be linear separable.


All variables must be independent.
Only categorical variable.

Accuracy
measure

Confusion matrix , accuracy measure , KS statistics, Area under ROC Curve

Key terms

Bayes theorem

Sources to study

http://www-users.cs.umn.edu/~kumar/dmbook/index.php ( theory )

Technique

SVM and Neural network ( classifier )

Application

Classifier ( NON_LINEAR )

Alternative

Assumption

Limitations

Interpretation is very difficult.


Very expensive and time consuming

Accuracy
measure

Confusion matrix , accuracy measure , KS statistics, Area under ROC Curve

Key terms

Kernel trick , hidden layers, non linear classifier , pattern recognition

Sources to study

Machine learning course by Stanford university Prof Andrew


http://www-users.cs.umn.edu/~kumar/dmbook/index.php

Technique

Principle Component

Application

Dimensionality reduction for clustering as well as regression

Alternative

Feature selection

Assumption

Limitations

Interpretation is difficult.

Accuracy
measure
Key terms

PCA, Multicollnearity , dimentionality reduction .

Sources to study

Machine learning course by Stanford university Prof Andrew


http://www-users.cs.umn.edu/~kumar/dmbook/index.php

Technique

Time series

Application

Auto regression. Is used for only time series data

Alternative

Markov model if sudden jumps are present

Assumption

Stationary and regression assumption for the residuals

Limitations

Cyclic component and sudden jumps are taken into account

Accuracy
measure

AIC and BIC

Key terms

ARIMA, FOERCASTING ,

Sources to study

NPTEL VIDEOS BY IISC PROF FROM CIVIL ENGINEERING DEPTT


Website otext.com

Further readings
suggested

Markov models : for predicting sudden jumps in stock market data or time series data related to other
domain
Survival Analysis : data when the machine is going to fail. Sometimes you get the data which is censored.
Like a data of failure of machine where some machines did not fail when the data was collected.
Data envelop Analysis : to measure the performance difference between various units or teams based on
multiple factors .

Further readings
Transformation suggested
of variables in linear modeling ( box cox method )

What measure you do when assumption violates ( boc cox )


Adding non linear terms to your model
Association sequence rule
Sparse clustering for feature selection in clustering ( special method for clustering )
Nave bayes classifier ( used in text analytics to classify tweets , mails , document
etc
K-NN classifier that is usually used when clustering and classification both are
required.
How to include the interaction term to improve the model performance
Poisson regression for predicting the failure of machines
Generalized linear regression modelling
Linear discriminant analysis
Boosting bagging and other methods for improvement of classifiers
Random forests method for classification
Topic modelling
Sentimental analysis
Support vector regression for non-linear regression

Thank You