Advanced Modelling Techniques Anurag Payel

Advanced Modelling Techniques
ENSEMBLE, REGULARISATION AND BAYESIA LAB
Prepared for
Dr. Sridhar Vaidyanathan
Anurag Sarkar, Payel Ganguly | AMT | 11-02-2017

Table of Contents
Ensemble Methods................................................................................................................. 2
Objective .................................................................................................................................... 2
Dataset ...................................................................................................................................... 2
Technique: Boosting ...............................................................................................................4
R Codes and Outputs .................................................................................................................4
Technique: Bagging ............................................................................................................... 11
R Codes and Outputs ................................................................................................................ 11
Summary...................................................................................................................................17
Regularisation....................................................................................................................... 18
Objective .................................................................................................................................. 18
Dataset .................................................................................................................................... 18
Technique: Ridge Regression .............................................................................................. 19
R Codes and Outputs ............................................................................................................... 19
Technique: LASSO Regression ............................................................................................ 32
R Codes and Outputs ............................................................................................................... 32
Summary.................................................................................................................................. 33
Bayesia Lab application....................................................................................................... 34
Dataset .................................................................................................................................... 34
Dataset definition .................................................................................................................... 34
**all explanation of the outputs are explored in the R-comments in this document
PAGE 1
Ensemble Methods
Objective
Company wants to automate the credit eligibility process (real time) based on customer
detail provided while filling online application form. To automate this process, they have
given a problem to identify the customers segments, those are eligible for credit amount
so that they can specifically target these customers.
Dataset
GermanCredit.csv
Codelist
Var. Variable Name Description Variable Type Code Description
#
1. OBS# Observation Categorical

No.
2. CHK_ACCT Checking Categorical 0 : < 0 DM
account status
1: 0 < ...< 200 DM
2 : => 200 DM
3: no checking account
3. DURATION Duration of Numerical
credit in
months
4. HISTORY Credit history Categorical 0: no credits taken
1: all credits at this bank paid
back duly
2: existing credits paid back
duly till now
3: delay in paying off in the
past
4: critical account
5. NEW_CAR Purpose of Binary car (new) 0: No, 1: Yes
credit
6. USED_CAR Purpose of Binary car (used) 0: No, 1: Yes
credit
7. FURNITURE Purpose of Binary furniture/equipment 0: No, 1:
credit Yes
8. RADIO/TV Purpose of Binary radio/television 0: No, 1: Yes
credit
9. EDUCATION Purpose of Binary education 0: No, 1: Yes
credit
10. RETRAINING Purpose of Binary retraining 0: No, 1: Yes
credit
11. AMOUNT Credit amount Numerical
12. SAV_ACCT Average Categorical 0 : < 100 DM
balance in
PAGE 2
savings
account
1 : 100<= ... < 500 DM
2 : 500<= ... < 1000 DM
3 : =>1000 DM
4 : unknown/ no savings
account
13. EMPLOYMENT Present Categorical 0 : unemployed
employment
since
1: < 1 year
2 : 1 <= ... < 4 years
3 : 4 <=... < 7 years
4 : >= 7 years
14. INSTALL_RATE Installment Numerical
rate as % of
disposable
income
15. MALE_DIV Applicant is Binary 0: No, 1: Yes
male and
divorced
16. MALE_SINGLE Applicant is Binary 0: No, 1: Yes
male and single
17. MALE_MAR_WID Applicant is Binary 0: No, 1: Yes
male and
married or a
widower
18. CO-APPLICANT Application has Binary 0: No, 1: Yes
a co-applicant
19. GUARANTOR Applicant has a Binary 0: No, 1: Yes
guarantor
20. PRESENT_RESIDENT Present Categorical 0: <= 1 year
resident since -
years
1<<=2 years
2<<=3 years
3:>4years
21. REAL_ESTATE Applicant owns Binary 0: No, 1: Yes
real estate
22. PROP_UNKN_NONE Applicant owns Binary 0: No, 1: Yes
no property (or
unknown)
23. AGE Age in years Numerical
24. OTHER_INSTALL Applicant has Binary 0: No, 1: Yes
other
installment
plan credit
25. RENT Applicant rents Binary 0: No, 1: Yes
26. OWN_RES Applicant owns Binary 0: No, 1: Yes
residence
27. NUM_CREDITS Number of Numerical
existing credits
at this bank
PAGE 3
28. JOB Nature of job Categorical 0 : unemployed/ unskilled -
non-resident
1 : unskilled - resident
2 : skilled employee / official
3 : management/ self-
employed/highly qualified
employee/ officer
29. NUM_DEPENDENTS Number of Numerical
people for
whom liable to
provide
maintenance
30. TELEPHONE Applicant has Binary 0: No, 1: Yes
phone in his or
her name
31. FOREIGN Foreign worker Binary 0: No, 1: Yes
32 RESPONSE Credit rating is Binary 0: No, 1: Yes
good
Dependent variable: RESPONSE
Technique: Boosting
R Codes and Outputs
#BOOSTING
library("ggplot2") #for plotting results
library("ROCR")
library("verification")
library("gbm")
set.seed (100)
data = read.csv(file.choose())
#if the data in the format that we want
str(data)
#cleaning and pre-processing the data
data= data[,-1]
data$CHK_ACCT = as.factor(data$CHK_ACCT)
data$HISTORY = as.factor(data$HISTORY)
data$NEW_CAR = as.factor(data$NEW_CAR)
PAGE 4
data$USED_CAR = as.factor(data$USED_CAR)
data$FURNITURE = as.factor(data$FURNITURE)
data$RADIO.TV = as.factor(data$RADIO.TV)
data$EDUCATION = as.factor(data$EDUCATION)
data$RETRAINING = as.factor(data$RETRAINING)
data$SAV_ACCT = as.factor(data$SAV_ACCT)
data$EMPLOYMENT = as.factor(data$EMPLOYMENT)
data$INSTALL_RATE = as.factor(data$INSTALL_RATE)
data$MALE_DIV = as.factor(data$MALE_DIV)
data$MALE_SINGLE = as.factor(data$MALE_SINGLE)
data$MALE_MAR_or_WID = as.factor(data$MALE_MAR_or_WID)
data$CO.APPLICANT = as.factor(data$CO.APPLICANT)
data$GUARANTOR = as.factor(data$GUARANTOR)
data$PRESENT_RESIDENT = as.factor(data$PRESENT_RESIDENT)
data$REAL_ESTATE = as.factor(data$REAL_ESTATE)
data$PROP_UNKN_NONE = as.factor(data$PROP_UNKN_NONE)
data$OTHER_INSTALL = as.factor(data$OTHER_INSTALL)
data$RENT = as.factor(data$RENT)
data$OWN_RES = as.factor(data$OWN_RES)
data$NUM_CREDITS = as.factor(data$NUM_CREDITS)
data$JOB = as.factor(data$JOB)
data$NUM_DEPENDENTS = as.factor(data$NUM_DEPENDENTS)
data$TELEPHONE = as.factor(data$TELEPHONE)
data$FOREIGN = as.factor(data$FOREIGN)
data$RESPONSE = as.factor(data$RESPONSE)
str(data)
#dividing the dtaset into training and testing data
data.train <- sample(1:nrow(data), nrow(data)*0.8)
str(data.train)
names(data)
PAGE 5
dTrain <- data[data.train, ]
nrow(dTrain)
dTest <- data[-data.train, ]
nrow(dTest)
> nrow(dTrain)
[1] 800
> ncol(dTrain)
[1] 31
> dTest <- data[-data.train, ]
> dTest <- data[-data.train, ]
> nrow(dTest)
[1] 200
boost.credit = gbm(RESPONSE~ .,
data= dTrain,
distribution="gaussian",
n.trees=5000 ,
interaction.depth =4,
cv.folds=7)
summary (boost.credit)
> summary (boost.credit)

var rel.inf
CHK_ACCT CHK_ACCT 17.851640118
AMOUNT AMOUNT 17.602342502
DURATION DURATION 12.916551346
HISTORY HISTORY 10.807564555
AGE AGE 6.014443872
SAV_ACCT SAV_ACCT 5.370647421
PAGE 6
#to determine the optimum number of treestumps to reduce processing time and optimise the Machine
learning
best.iter <- gbm.perf(boost.credit,method="cv")
print(best.iter)
PAGE 7
> print(best.iter)
[1] 4993
#estimates of marginal effects of a predictor(s) via Partial Dependency Plots.
# These plots enable us to understand the effect of a predictor variable
#(or interaction between two predictors) on the target outcome,
#given the other predictors (partialling them out - after accounting for their average effects).
#The partial dependence function basically gives us the "average" trend of that variable
#(integrating out all others in the model). It's the shape of that trend that is "important".
#We may interpret the relative range of these plots from different predictor variables, but not the absolute
range. par(mfrow =c(2,2))
plot(boost.credit ,i="CHK_ACCT")
plot(boost.credit ,i="AMOUNT")
plot(boost.credit ,i="HISTORY")
plot(boost.credit ,i="DURATION")
PAGE 8
#to find the cutoff as the boosting result is continuous and needs to be converted to a categorical output. Well
find the area under the ROC curve to determine the best cut-off.
pp= predict(boost.credit, dTest,n.trees = 5000, type="response")
pp <- ifelse(pp<1.5753,0,1)
#print(pp)
x <- data.frame(dTest,pp)
pred <- with(x,prediction(pp,RESPONSE))
perf <- performance(pred,"tpr", "fpr")
auc <-performance(pred, measure = "auc")@y.values[[1]]
print(auc)
rd <- data.frame(x=perf@x.values[[1]],y=perf@y.values[[1]])
p <- ggplot(rd,aes(x=x,y=y)) + geom_path(size=1)
p + labs(title = "ROC Curve")
p <- p + geom_segment(aes(x=0,y=0,xend=1,yend=1),colour="black",linetype= 2)
p <- p + geom_text(aes(x=1, y= 0, hjust=1, vjust=0, label=paste(sep = "", "AUC = ",round(auc,3)

)),colour="black",size=4)
p <- p + scale_x_continuous(name= "False positive rate")
p <- p + scale_y_continuous(name= "True positive rate")
p + labs(title = "ROC")
PAGE 9
print(auc)
> print(auc)
[1] 0.7661242
#Other Performance Measures
cm0 <- confusionMatrix(pp, dTest$RESPONSE)
print(cm0)
> print(cm0)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 29 25
1 13 133
Accuracy : 0.81
95% CI : (0.7487, 0.8619)
No Information Rate : 0.79
P-Value [Acc > NIR] : 0.27532
PAGE 10
Kappa : 0.4817
Mcnemar's Test P-Value : 0.07435
Sensitivity : 0.6905
Specificity : 0.8418
Pos Pred Value : 0.5370
Neg Pred Value : 0.9110
Prevalence : 0.2100
Detection Rate : 0.1450
Detection Prevalence : 0.2700
Balanced Accuracy : 0.7661
'Positive' Class : 0
#write output to a csv file
write.csv(x,"Output.csv")
Technique: Bagging
R Codes and Outputs
#bagging
#Bagging or bootstrap aggregation averages a
#given procedure over many samples, to reduce its
#variance a poor mans Bayes.
#Bagging can dramatically reduce the variance of
#unstable procedures (like trees), leading to
#improved prediction. However any simple
#structure in a tree) is lost.
#Bagging averages many trees, and produces
#smoother decision boundaries.
library (randomForest)
set.seed (100)
ncol(dTest)
nrow(dTest)
#train using various values of mtry- Number of variables available for splitting at each tree node
PAGE 11
bag.credit1 =randomForest(RESPONSE~.,data=dTrain ,mtry=30, ntree=500, importance =TRUE)
#plot out of bag error rate curve and the misclassification error rate curves .
plot(bag.credit1)
legend("topright", colnames(bag.credit1$err.rate),col=1:4,cex=0.8,fill=1:4)
bag.credit1
> bag.credit1
Call:
randomForest(formula = RESPONSE ~ ., data = dTrain, mtry = 30, nt
ree = 500, importance = TRUE)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 30
OOB estimate of error rate: 24.88%

Confusion matrix:
0 1 class.error
0 130 128 0.4961240
1 71 471 0.1309963
bag.credit2 =randomForest(RESPONSE~.,data=dTrain ,mtry=6, importance =TRUE)
bag.credit2
> bag.credit2
Call:
PAGE 12
randomForest(formula = RESPONSE ~ ., data = dTrain, mtry = 6, imp
ortance = TRUE)

Confusion matrix:
0 1 class.error
0 106 152 0.58914729
1 53 489 0.09778598
plot(bag.credit2)
bag.credit3 =randomForest(RESPONSE~.,data=dTrain ,mtry=5, importance =TRUE)
bag.credit3
> bag.credit3
Call:
randomForest(formula = RESPONSE ~ ., data = dTrain, mtry = 5, imp
ortance = TRUE)

Confusion matrix:
0 1 class.error
0 106 152 0.58914729
PAGE 13
1 44 498 0.08118081
plot(bag.credit3)
#For each tree, the prediction accuracy on the out-of-bag portion of the data is recorded.
#Then the same is done after permuting each predictor variable.
#-The difference between the two accuracies are then averaged over all trees, and normalized by the standard
error.
vi<-varImp(bag.credit)
vi
> vi
0 1
CHK_ACCT 19.06621013 19.06621013
DURATION 12.10922121 12.10922121
HISTORY 9.01551629 9.01551629
NEW_CAR 2.13719444 2.13719444
USED_CAR 2.42842203 2.42842203
FURNITURE -0.08435182 -0.08435182
RADIO.TV 3.23616591 3.23616591
EDUCATION 0.43650963 0.43650963
RETRAINING 0.92453546 0.92453546
AMOUNT 8.48411289 8.48411289
SAV_ACCT 4.36585660 4.36585660
PAGE 14
varImpPlot(bag.credit,type=2) #represents the mean decrease in node impurity (and not the mean decrease in
accuracy).
vi$importance
#Prediction on the test data
pred.bag = predict (bag.credit ,newdata =dTest)
plot(pred.bag)
PAGE 15
cm1 <- confusionMatrix(pred.bag, dTest$RESPONSE)
cm1$overall
> cm1$overall
Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull Ac
curacyPValue McnemarPValue
0.80500000 0.33696022 0.74321431 0.85750813 0.79000
000 0.33720145 0.05466394
cm1$byClass
cm1$byClass
Sensitivity Specificity Pos Pred Value Ne
g Pred Value Precision
0.3809524 0.9177215 0.5517241
0.8479532 0.5517241
Recall F1 Prevalence De
tection Rate Detection Prevalence
0.3809524 0.4507042 0.2100000
0.0800000 0.1450000
Balanced Accuracy
0.6493369
x <- data.frame(dTest,pred.bag)
#write output to a csv file
write.csv(x,"Output_Bagging.csv")
PAGE 16
Summary
Boosting seems to yield marginally better accuracy than the bagging algorithm of 81% as compared to 80.50%.
PAGE 17
Regularisation
Objective
The data scientists at BigMart have collected 2013 sales data for 1559 products across 10 stores in different
cities. Also, certain attributes of each product and store have been defined. The aim is to build a predictive
model and find out the sales of each product at a particular store.
Dataset
Train_UWu5bXk_Big Mart.csv
Variable Description
Loan_ID Unique Loan ID
Gender Male/ Female
Married Applicant married (Y/N)
Dependents Number of dependents
Education Applicant Education (Graduate/ Under Graduate)
Self_Employed Self employed (Y/N)
ApplicantIncome Applicant income
CoapplicantIncome Coapplicant income
LoanAmount Loan amount in thousands
Loan_Amount_Term Term of loan in months
Credit_History credit history meets guidelines
Property_Area Urban/ Semi Urban/ Rural
Loan_Status Loan approved (Y/N)
PAGE 18
Technique: Ridge Regression
R Codes and Outputs
#Ridge Regression
#The data scientists at BigMart have collected 2013 sales data for 1559 products across 10 stores in different
cities. Also, certain attributes of each product and store have been defined. The aim is to build a predictive
model and find out the sales of each product at a particular store.
#Using this model, BigMart will try to understand the properties of products and stores which play a key role
in increasing sales.
#loading the relevant libraries
library (glmnet )
library(MASS)
library (ISLR)
library(lattice)
#selection of data
data = read.csv(file.choose())
str(data)
#Cleaning and processing the data
data= data[,-8 ]
data= data[,-1 ]
names(data )
dim(data )
sum(is.na(data$Item_Outlet_Sales))#check for missing values in the data
data =na.omit(data )#remove all missing vales
dim(data )
sum(is.na(data ))
#To compare the ridge output with the OLS output we run the linear regression model first
PAGE 19
#dividing the data into the training and the testing sets
data.train <- sample(1:nrow(data), nrow(data)*0.5)
dTrain <- data[data.train, ]
nrow(dTrain)
dTest <- data[-data.train, ]
#Linear Regression
ols = lm(Item_Outlet_Sales ~., dTrain)
summary(ols)
> summary(ols)
Call:
lm(formula = Item_Outlet_Sales ~ ., data = dTrain)
Residuals:
Min 1Q Median 3Q Max
-3483.0 -634.9 -81.8 532.7 6002.4
Coefficients: (7 not defined because of singularities)

Estimate Std. Error t value Pr(>|t|)
(Intercept) -1851.7738 147.2696 -12.574 <2e-16 *
**
Item_Weight -0.9359 3.9078 -0.239 0.8107
Item_Fat_Contentlow fat 380.7970 189.1178 2.014 0.0441 *
Item_Fat_ContentLow Fat 56.1231 94.4237 0.594 0.5523
Item_Fat_Contentreg 79.3920 175.7233 0.452 0.6514
Item_Fat_ContentRegular 114.1667 98.4877 1.159 0.2465
Item_Visibility -271.7998 382.7683 -0.710 0.4777
Item_TypeBreads 62.9835 124.7067 0.505 0.6136
Item_TypeBreakfast 58.0952 189.9650 0.306 0.7598
Item_TypeCanned 26.4251 90.5039 0.292 0.7703
Item_TypeDairy -1.6757 91.9574 -0.018 0.9855
Item_TypeFrozen Foods -77.2396 85.6644 -0.902 0.3673
Item_TypeFruits and Vegetables -53.9985 80.8350 -0.668 0.5042
Item_TypeHard Drinks -38.2177 128.5790 -0.297 0.7663
Item_TypeHealth and Hygiene -17.5756 98.6799 -0.178 0.8586
Item_TypeHousehold -107.6325 87.4929 -1.230 0.2187
Item_TypeMeat -40.5371 107.1454 -0.378 0.7052
Item_TypeOthers -20.7160 146.8059 -0.141 0.8878
Item_TypeSeafood -42.4768 232.7389 -0.183 0.8552
Item_TypeSnack Foods -62.3765 81.5387 -0.765 0.4443
Item_TypeSoft Drinks -104.1236 100.4662 -1.036 0.3001
PAGE 20
Item_TypeStarchy Foods 25.7238 140.4724 0.183 0.8547
Item_MRP 15.5602 0.2934 53.025 <2e-16 *
**
Outlet_IdentifierOUT013 1946.9820 84.0984 23.151 <2e-16 *
**
**
**
**
**
**
**
#Prediction using Linear Regression
predictols = predict(ols, newdata = dTest)
plot(predictols)
#A plot of the residuals against fitted values is used to determine whether there are any systematic patterns,
#such as over estimation for most of the large values or increasing spread as the model fitted values increase.
xyplot(resid(ols) ~ fitted(ols),
xlab = "Fitted Values",
ylab = "Residuals",
main = "Residual Diagnostic Plot",
panel = function(x, y, ...)
panel.grid(h = -1, v = -1)
panel.abline(h = 0)
panel.xyplot(x, y, ...)
PAGE 21
#The plot is probably ok but there are more cases of positive residuals and
#when we consider a normal probability plot we see that there are some deficiencies with the model
qqmath( ~ resid(ols),
xlab = "Theoretical Quantiles",
ylab = "Residuals"
PAGE 22
#The function resid extracts the model residuals from the fitted model object
#We would hope that this plot showed something approaching a straight line
#to support the model assumption about the distribution of the residuals
#we calculate the mean square errors for further comparison with other models
mean(( predictols -dTest$Item_Outlet_Sales)^2)

[1] 1189168
#Performing ridge regression and the lasso in order to predict Salary on the Hitters data.
# The model.matrix() function is particularly useful for creating x; not only
#does it produce a matrix corresponding to the 19 predictors but it also
#automatically transforms any qualitative variables into dummy variables.
#The latter property is important because glmnet() can only take numerical,
#quantitative inputs.
#help("seq")
x=model.matrix (Item_Outlet_Sales~.,data )[,-1]
y=data$Item_Outlet_Sales
grid =10^ seq (10,-2, length =100)
ridge.mod =glmnet (x,y,alpha =0, lambda =grid)
# default, the glmnet() function standardizes the
#variables so that they are on the same scale. To turn off this default setting,
#use the argument standardize=FALSE.
#It shows from left to right the number of nonzero coefficients (Df),
#the percent (of null) deviance explained (%dev) and the value of (Lambda).
PAGE 23
ridge.mod
> ridge.mod
Call: glmnet(x = x, y = y, alpha = 0, lambda = grid)
Df %Dev Lambda
[1,] 36 1.642e-07 1.000e+10
[2,] 36 2.171e-07 7.565e+09
[3,] 36 2.870e-07 5.722e+09
[4,] 36 3.794e-07 4.329e+09
[5,] 36 5.015e-07 3.275e+09
[6,] 36 6.630e-07 2.477e+09
[7,] 36 8.764e-07 1.874e+09
[8,] 36 1.159e-06 1.417e+09
[9,] 36 1.532e-06 1.072e+09
[10,] 36 2.025e-06 8.111e+08
[11,] 36 2.676e-06 6.136e+08
[12,] 36 3.538e-06 4.642e+08
[13,] 36 4.677e-06 3.511e+08
[14,] 36 6.183e-06 2.656e+08
[15,] 36 8.173e-06 2.009e+08
[16,] 36 1.080e-05 1.520e+08
[17,] 36 1.428e-05 1.150e+08
[18,] 36 1.888e-05 8.697e+07
[19,] 36 2.496e-05 6.579e+07
[20,] 36 3.299e-05 4.977e+07
#We would hope that this plot showed
#something approaching a straight line to support
#the model assumption about the distribution of the residuals
plot(ridge.mod, label = TRUE)
PAGE 24
dim(coef(ridge.mod ))
dim(coef(ridge.mod ))
[1] 40 100
ridge.mod$lambda [50]
> ridge.mod$lambda [50]

[1] 11497.57
#We can obtain the actual coefficients at one or more s within the range of the sequence
coef(ridge.mod)[,50]
> coef(ridge.mod)[,50]
(Intercept) Item_Weight It
em_Fat_Contentlow fat
1768.5296057 0.4714473
6.7615077
Item_Fat_ContentLow Fat Item_Fat_Contentreg It
em_Fat_ContentRegular
-4.8883221 -13.3125877
6.9185662
Item_Visibility Item_TypeBreads
Item_TypeBreakfast
-290.3879446 3.7790286
-12.9511555
Item_TypeCanned Item_TypeDairy
Item_TypeFrozen Foods
3.6287889 8.5570676
-4.4858574
PAGE 25
Item_TypeFruits and Vegetables Item_TypeHard Drinks Item_T
ypeHealth and Hygiene
7.5728516 -5.4537667
-18.9152273
Item_TypeHousehold Item_TypeMeat
Item_TypeOthers
7.5533123 -8.3294249
-6.7918224
Item_TypeSeafood Item_TypeSnack Foods
Item_TypeSoft Drinks
47.3492024 12.3585385
-16.4090562
Item_TypeStarchy Foods Item_MRP Ou
tlet_IdentifierOUT013
22.2712201 1.7932583
28.8954492
Outlet_IdentifierOUT017 Outlet_IdentifierOUT018 Ou
26.7496824 2.6719608
0.0000000
0.0000000 34.7145225
6.5502929
Outlet_IdentifierOUT046 Outlet_IdentifierOUT049
Outlet_SizeHigh
17.9059818 29.2979713
28.8974982
Outlet_SizeMedium Outlet_SizeSmall Outle
t_Location_TypeTier 2
18.8550265 31.0140842
32.5158316
Outlet_Location_TypeTier 3 Outlet_TypeSupermarket Type1 Outlet_
TypeSupermarket Type2
-58.5002738 99.3006575
2.6731882
Outlet_TypeSupermarket Type3
0.0000000
coef<--coef(ridge.mod)
coef
summ <- summary(coef)
summ
x<-data.frame(Origin = rownames(coef)[summ$i],
Destination = colnames(coef)[summ$j],
Weight = summ$x)
write.csv(x,"Output_Ridgecef.csv")
PAGE 26
OUTPUT File Attached
sqrt(sum(coef(ridge.mod)[ -1 ,50]^2) )
> sqrt(sum(coef(ridge.mod)[ -1 ,50]^2) )

[1] 330.4143
ridge.mod$lambda [60]
> ridge.mod$lambda [60]

[1] 705.4802
coef(ridge.mod)[,60]
> coef(ridge.mod)[,60]
29.1878010 -4.6425519
12.4563384
-2.2040003 -141.5550260
36.0491284
Item_TypeBreakfast
-848.6729160 -26.5768543
-203.1094940
68.2376470 11.3440468
-90.7943879
47.7664754 -156.8928571
-97.8301151
Item_TypeOthers
-0.7233438 38.0428187
-217.8423830
Item_TypeSoft Drinks
107.1972117 7.4562711
18.6877479
Item_TypeStarchy Foods Item_MRP Ou
273.4689065 10.1886705
230.0908254
159.0560125 272.8354786
0.0000000
0.0000000 121.1075127
90.1300494
PAGE 27
Outlet_IdentifierOUT046 Outlet_IdentifierOUT049
Outlet_SizeHigh
90.2348227 134.9829698
230.0861606
Outlet_SizeMedium Outlet_SizeSmall Outle
t_Location_TypeTier 2
231.9717947 124.0771150
174.6308261
Outlet_Location_TypeTier 3 Outlet_TypeSupermarket Type1 Outlet_
TypeSupermarket Type2
-302.8971847 595.3107156
272.8358924
Outlet_TypeSupermarket Type3
0.0000000
sqrt(sum(coef(ridge.mod)[ -1 ,60]^2) )
> sqrt(sum(coef(ridge.mod)[ -1 ,60]^2) )

[1] 1357.927
# We can use the predict() function for a number of purposes. For instance,
#we can obtain the ridge regression coefficients for a new value of , say 50
predict(ridge.mod ,s=50, type ="coefficients")[1:20 ,]
> predict(ridge.mod ,s=50, type ="coefficients")[1:20 ,]

-1.142843e+03 -8.961876e-02
4.968519e+01
-3.205728e+01 -6.101688e+01
2.741758e+01
Item_TypeBreakfast
-3.410558e+02 3.564304e+01
-1.100413e+02
1.824980e+01 -5.374391e+01
-3.077317e+01
-2.246690e+00 1.207918e+01
-3.913646e+00
Item_TypeOthers
-2.465764e+01 -3.611566e+01
1.413773e+01
PAGE 28
3.279985e+02 -4.170007e+00
# training and testing
set.seed (1)
train=sample (1: nrow(x), nrow(x)/2)
test=(- train )
y.test=y[test]
ridge.pred=predict (ridge.mod ,s=4, newx=x[test ,])
mean(( ridge.pred -y.test)^2)
#simply fit a model with just an intercept, we would have predicted each test observation using
#the mean of the training observations. In that case, we could compute the test set MSE like this:
#fitting a ridge regression model with = 4 leads to a much lower test
#MSE than fitting a model with just an intercept. We now check whether
#there is any benefit to performing ridge regression with = 4 instead of
#just performing least squares regression. Recall that least squares is simply
#ridge regression with = 0.
ridge.pred=predict (ridge.mod ,s=0, newx=x[test ,], exact=T)
> mean(( ridge.pred -y.test)^2)

[1] 1129263
lm(y~x, subset =train)
predict (ridge.mod ,s=0, exact =T,type="coefficients") [1:20 ,]
> predict (ridge.mod ,s=0, exact =T,type="coefficients") [1:20 ,]

-1273.8570048 -0.1753777
45.5903253
PAGE 29
-37.6752811 -66.2119408
22.3313054
Item_TypeBreakfast
-219.0993942 25.9599166
-126.0530737
6.3310116 -71.8781581
-42.6367252
-15.7890212 3.5453765
-8.7664006
Item_TypeOthers
-39.0268330 -45.3326115
6.1352188
324.7468636 -18.9867244
#if we want to fit a (unpenalized) least squares model, then
#we should use the lm() function, since that function provides more useful
#outputs, such as standard errors and p-values for the coefficients.
# finding optimum value of lambda using CV.
set.seed (1)
cv.out =cv.glmnet (x[train ,],y[train],alpha =0)
#the cross-validation curve (red dotted line), and upper and lower standard deviation curves along the
sequence (error bars).
#Two selected s are indicated by the vertical dotted lines (see below).
plot(cv.out)
PAGE 30
bestlam =cv.out$lambda.min
bestlam
> bestlam
[1] 104.0413
ridge.pred=predict (ridge.mod ,s=bestlam ,newx=x[test ,])

[1] 1137458
Output=predict(cv.out, newx = x[test ,], s = "lambda.min")
#Users can control the folds used.
#Here we use the same folds so we can also select a value for .
foldid=sample(1:10,size=length(y),replace=TRUE)
cv1=cv.glmnet(x,y,foldid=foldid,alpha=1)
cv.5=cv.glmnet(x,y,foldid=foldid,alpha=.5)
cv0=cv.glmnet(x,y,foldid=foldid,alpha=0)
cv1
PAGE 31
log(cv1$lambda)
pch=19
par(mfrow=c(2,2))
plot(cv1);plot(cv.5);plot(cv0)
plot(log(cv1$lambda),cv1$cvm,pch=19,col="red",xlab="log(Lambda)",ylab=cv1$name)
points(log(cv.5$lambda),cv.5$cvm,pch=19,col="grey")
points(log(cv0$lambda),cv0$cvm,pch=19,col="blue")
legend(2,3,legend=c("alpha= 1","alpha= .5","alpha= 0"),pch=0,col=c("red","grey","blue"))
help("predict")
Technique: LASSO Regression

R Codes and Outputs
#LAsso regression
#The elastic-net penalty is controlled by , and bridges the gap between lasso (=1, the default) and ridge
(=0).
#The tuning parameter controls the overall strength of the penalty.
lasso.mod =glmnet(x[train ,],y[train],alpha =1, lambda =grid)
PAGE 32
plot(lasso.mod)
lasso.pred=predict (lasso.mod ,s=0, newx=x[test ,], exact=T)
mean(( lasso.pred -y.test)^2)
> mean(( lasso.pred -y.test)^2)

[1] 1146829
[1] 1137458
> mean(( predictols -dTest$Item_Outlet_Sales)^2)
[1] 1189168
Summary
By comparing the mean square errors Ridge regression seems to yield the most desired results.
PAGE 33
Bayesia Lab application
Dataset
Asia.xbl
Dataset definition
Asia.xbl is a fictional Bayesian network which serves as a hypothetical "Expert System" for
pulmonary diseases. It encodes the cumulative knowledge of a pulmonary physician about all of
his patients with regard to lung diseases.
Each node in the network represents a patient-related variable.
The dataset/network contains 4 classes of nodes.
#Node Node class Variable names Variable type

1. Patient Smoker Binary - T/F
characteristics
Age of patient Binary - T/F
Indicator of recent Binary - T/F
Asia visit of patient
2. Pulmonary diseases Bronchitis Binary - T/F
Cancer Binary - T/F
Tuberculosis Binary - T/F
3. Logical node TborCA Binary - T/F
Indicates whether
either disease is
present
4. Symptoms Dyspnea Binary - T/F
Shortness of breath
XRay Binary -
Normal/Abnormal
The graph of the dataset represents qualitative part of the pulmonary healthcare
domain knowledge.
The arcs represents probabilistic relationship between the nodes.
PAGE 34
The quantitative part of the dataset is present in the conditional probability tables (CPT)
which are associated with each node.
To see the CPT for each node, an editor should be activated with a double click
The above picture shows the status of the node VisitAsia
The probabilistic distribution table shows the marginal probability distribution states of
VisitAsia where true 1% and false = 99%. This means 1% of the patients have recently
been to an Asian country.
The picture below shows the details of the node Bronchitis
The probabilistic distribution of the table shows the conditional probabilities of Bronchitis given
PAGE 35
smoker. So it can be said that smokers are twice as likely to suffer from bronchitis as compared
to non-smokers.
The picture below shows the details of node Cancer
The above table shows the conditional probabilities of cancer given age and smoker.
So the knowledge stored in the dataset network would help the pulmonary doctors to
get further insights from the different patient-related variables relationship.
The above descriptions are taken from modelling mode. For further analysis it is
required to switch to validation mode.
The screenshot of the validation mode -
From the above screenshot the monitor panel helps to read and manipulate the states of
individual nodes.
PAGE 36
Analysis of Validation node VisitAsia
The node in the monitor panel upon display is marked yellow. The monitor in the above
diagram shows that marginal probability distribution of VisitAsia 1% of patients recently
travelled to an Asian country.
Analysis of validation node Bronchitis
The monitor in the above diagram shows that apriori probability of having Bronchitis is 43.87%.
Analysis of validation node - Age
The above monitor shows the marginal probability distribution of age where 25% of the patients
are adolescent, 40% are adult and 35% are from geriatric category.
Now the given network is utilized to diagnose a new patient -
Geriatric category analysis -
PAGE 37
Once the horizontal bar is clicked, the state geriatric gets highlighted in green upon setting
evidence. The node age also gets highlighted in green which indicates that evidence is set.
Once the age is set to geriatric, the bronchitis gets a new conditional probability distribution as
shown in the diagram below in grey arrows.
The patient also reports Dyspnea Shortness of breath. The below diagram captures the
probabilistic distribution.
Dyspnea is set to true. Given the age = Geriatric and Dyspnea = True, the probability of
Bronchitis = true increases as shown in the below diagram.
Now given the age and symptoms of the patient, the doctor can consider cancer as a
possibility.
Analysis of validation node = Cancer
PAGE 38
From the observation as captured in the diagram below, the patient's probability of having
cancer is 15%
H0 (Null Hypothesis) - The patient claims that he/she has quit smoking.
HA(Alternate Hypothesis) - The doctor claims that there is a 0.75 probability the patient is still
smoking.
To analyze the situation, the below diagram is used for smokers monitor -
On setting the new probability value to 75% = true and fixing the value new evidences come up.
This brings that probability of cancer changes to 19% as shown in the diagram.
Now the doctor orders a chest X-Ray based on the above result.
Analysis of the validation node XRay
Once the normal bar from the above XRay monitor is selected, the probability of the cancer
drops to 0.5% as shown in the diagram below -
PAGE 39
and probability of Bronchitis increases to 88% as shown in the diagram below -
Conclusion From the above probability evidences, the doctor confirms a diagnosis of
bronchitis and starts the treatment.
PAGE 40

Advanced Modelling Techniques Anurag Payel

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Advanced Modelling Techniques Anurag Payel

Diunggah oleh

Hak Cipta:

Format Tersedia

Advanced Modelling Techniques

ENSEMBLE, REGULARISATION AND BAYESIA LAB

Anurag Sarkar, Payel Ganguly | AMT | 11-02-2017

1. OBS# Observation Categorical

Dependent variable: RESPONSE

library("ggplot2") #for plotting results

#if the data in the format that we want

#cleaning and pre-processing the data

#dividing the dtaset into training and testing data

data.train <- sample(1:nrow(data), nrow(data)*0.8)

dTest <- data[-data.train, ]

> summary (boost.credit)

best.iter <- gbm.perf(boost.credit,method="cv")

# These plots enable us to understand the effect of a predictor variable

#(or interaction between two predictors) on the target outcome,

pp= predict(boost.credit, dTest,n.trees = 5000, type="response")

pred <- with(x,prediction(pp,RESPONSE))

perf <- performance(pred,"tpr", "fpr")

auc <-performance(pred, measure = "auc")@y.values[[1]]

p <- ggplot(rd,aes(x=x,y=y)) + geom_path(size=1)

p + labs(title = "ROC Curve")

p <- p + geom_text(aes(x=1, y= 0, hjust=1, vjust=0, label=paste(sep = "", "AUC = ",round(auc,3)

p <- p + scale_x_continuous(name= "False positive rate")

p <- p + scale_y_continuous(name= "True positive rate")

#Other Performance Measures

#write output to a csv file

#Bagging or bootstrap aggregation averages a

#given procedure over many samples, to reduce its

#variance a poor mans Bayes.

#Bagging can dramatically reduce the variance of

#unstable procedures (like trees), leading to

#improved prediction. However any simple

#structure in a tree) is lost.

#Bagging averages many trees, and produces

#smoother decision boundaries.

OOB estimate of error rate: 24.88%

bag.credit2 =randomForest(RESPONSE~.,data=dTrain ,mtry=6, importance =TRUE)

OOB estimate of error rate: 25.62%

bag.credit3 =randomForest(RESPONSE~.,data=dTrain ,mtry=5, importance =TRUE)

OOB estimate of error rate: 24.5%

#Then the same is done after permuting each predictor variable.

#Prediction on the test data

pred.bag = predict (bag.credit ,newdata =dTest)

#write output to a csv file

#loading the relevant libraries

#Cleaning and processing the data

sum(is.na(data$Item_Outlet_Sales))#check for missing values in the data

data =na.omit(data )#remove all missing vales

data.train <- sample(1:nrow(data), nrow(data)*0.5)

dTrain <- data[data.train, ]

dTest <- data[-data.train, ]

ols = lm(Item_Outlet_Sales ~., dTrain)

Coefficients: (7 not defined because of singularities)

#Prediction using Linear Regression

predictols = predict(ols, newdata = dTest)

xlab = "Fitted Values",

main = "Residual Diagnostic Plot",

panel = function(x, y, ...)

panel.grid(h = -1, v = -1)

xlab = "Theoretical Quantiles",

mean(( predictols -dTest$Item_Outlet_Sales)^2)

mean(( predictols -dTest$Item_Outlet_Sales)^2)

# The model.matrix() function is particularly useful for creating x; not only

#does it produce a matrix corresponding to the 19 predictors but it also