Anda di halaman 1dari 41

Advanced Modelling Techniques

ENSEMBLE, REGULARISATION AND BAYESIA LAB

Prepared for
Dr. Sridhar Vaidyanathan

Anurag Sarkar, Payel Ganguly | AMT | 11-02-2017


Table of Contents
Ensemble Methods................................................................................................................. 2
Objective .................................................................................................................................... 2
Dataset ...................................................................................................................................... 2
Technique: Boosting ...............................................................................................................4
R Codes and Outputs .................................................................................................................4
Technique: Bagging ............................................................................................................... 11
R Codes and Outputs ................................................................................................................ 11
Summary...................................................................................................................................17
Regularisation....................................................................................................................... 18
Objective .................................................................................................................................. 18
Dataset .................................................................................................................................... 18
Technique: Ridge Regression .............................................................................................. 19
R Codes and Outputs ............................................................................................................... 19
Technique: LASSO Regression ............................................................................................ 32
R Codes and Outputs ............................................................................................................... 32
Summary.................................................................................................................................. 33
Bayesia Lab application....................................................................................................... 34
Dataset .................................................................................................................................... 34
Dataset definition .................................................................................................................... 34

**all explanation of the outputs are explored in the R-comments in this document

PAGE 1
Ensemble Methods
Objective
Company wants to automate the credit eligibility process (real time) based on customer
detail provided while filling online application form. To automate this process, they have
given a problem to identify the customers segments, those are eligible for credit amount
so that they can specifically target these customers.

Dataset
GermanCredit.csv

Codelist
Var. Variable Name Description Variable Type Code Description
#

1. OBS# Observation Categorical


No.
2. CHK_ACCT Checking Categorical 0 : < 0 DM
account status
1: 0 < ...< 200 DM
2 : => 200 DM
3: no checking account
3. DURATION Duration of Numerical
credit in
months
4. HISTORY Credit history Categorical 0: no credits taken
1: all credits at this bank paid
back duly
2: existing credits paid back
duly till now
3: delay in paying off in the
past
4: critical account
5. NEW_CAR Purpose of Binary car (new) 0: No, 1: Yes
credit
6. USED_CAR Purpose of Binary car (used) 0: No, 1: Yes
credit
7. FURNITURE Purpose of Binary furniture/equipment 0: No, 1:
credit Yes
8. RADIO/TV Purpose of Binary radio/television 0: No, 1: Yes
credit
9. EDUCATION Purpose of Binary education 0: No, 1: Yes
credit
10. RETRAINING Purpose of Binary retraining 0: No, 1: Yes
credit
11. AMOUNT Credit amount Numerical
12. SAV_ACCT Average Categorical 0 : < 100 DM
balance in

PAGE 2
savings
account
1 : 100<= ... < 500 DM
2 : 500<= ... < 1000 DM
3 : =>1000 DM
4 : unknown/ no savings
account
13. EMPLOYMENT Present Categorical 0 : unemployed
employment
since
1: < 1 year
2 : 1 <= ... < 4 years
3 : 4 <=... < 7 years
4 : >= 7 years
14. INSTALL_RATE Installment Numerical
rate as % of
disposable
income
15. MALE_DIV Applicant is Binary 0: No, 1: Yes
male and
divorced
16. MALE_SINGLE Applicant is Binary 0: No, 1: Yes
male and single
17. MALE_MAR_WID Applicant is Binary 0: No, 1: Yes
male and
married or a
widower
18. CO-APPLICANT Application has Binary 0: No, 1: Yes
a co-applicant
19. GUARANTOR Applicant has a Binary 0: No, 1: Yes
guarantor
20. PRESENT_RESIDENT Present Categorical 0: <= 1 year
resident since -
years
1<<=2 years
2<<=3 years
3:>4years
21. REAL_ESTATE Applicant owns Binary 0: No, 1: Yes
real estate
22. PROP_UNKN_NONE Applicant owns Binary 0: No, 1: Yes
no property (or
unknown)
23. AGE Age in years Numerical
24. OTHER_INSTALL Applicant has Binary 0: No, 1: Yes
other
installment
plan credit
25. RENT Applicant rents Binary 0: No, 1: Yes
26. OWN_RES Applicant owns Binary 0: No, 1: Yes
residence
27. NUM_CREDITS Number of Numerical
existing credits
at this bank

PAGE 3
28. JOB Nature of job Categorical 0 : unemployed/ unskilled -
non-resident
1 : unskilled - resident
2 : skilled employee / official
3 : management/ self-
employed/highly qualified
employee/ officer
29. NUM_DEPENDENTS Number of Numerical
people for
whom liable to
provide
maintenance
30. TELEPHONE Applicant has Binary 0: No, 1: Yes
phone in his or
her name
31. FOREIGN Foreign worker Binary 0: No, 1: Yes
32 RESPONSE Credit rating is Binary 0: No, 1: Yes
good

Dependent variable: RESPONSE

Technique: Boosting
R Codes and Outputs
#BOOSTING

library("ggplot2") #for plotting results

library("ROCR")

library("verification")

library("gbm")

set.seed (100)

data = read.csv(file.choose())

#if the data in the format that we want

str(data)

#cleaning and pre-processing the data

data= data[,-1]

data$CHK_ACCT = as.factor(data$CHK_ACCT)

data$HISTORY = as.factor(data$HISTORY)

data$NEW_CAR = as.factor(data$NEW_CAR)

PAGE 4
data$USED_CAR = as.factor(data$USED_CAR)

data$FURNITURE = as.factor(data$FURNITURE)

data$RADIO.TV = as.factor(data$RADIO.TV)

data$EDUCATION = as.factor(data$EDUCATION)

data$RETRAINING = as.factor(data$RETRAINING)

data$SAV_ACCT = as.factor(data$SAV_ACCT)

data$EMPLOYMENT = as.factor(data$EMPLOYMENT)

data$INSTALL_RATE = as.factor(data$INSTALL_RATE)

data$MALE_DIV = as.factor(data$MALE_DIV)

data$MALE_SINGLE = as.factor(data$MALE_SINGLE)

data$MALE_MAR_or_WID = as.factor(data$MALE_MAR_or_WID)

data$CO.APPLICANT = as.factor(data$CO.APPLICANT)

data$GUARANTOR = as.factor(data$GUARANTOR)

data$PRESENT_RESIDENT = as.factor(data$PRESENT_RESIDENT)

data$REAL_ESTATE = as.factor(data$REAL_ESTATE)

data$PROP_UNKN_NONE = as.factor(data$PROP_UNKN_NONE)

data$OTHER_INSTALL = as.factor(data$OTHER_INSTALL)

data$RENT = as.factor(data$RENT)

data$OWN_RES = as.factor(data$OWN_RES)

data$NUM_CREDITS = as.factor(data$NUM_CREDITS)

data$JOB = as.factor(data$JOB)

data$NUM_DEPENDENTS = as.factor(data$NUM_DEPENDENTS)

data$TELEPHONE = as.factor(data$TELEPHONE)

data$FOREIGN = as.factor(data$FOREIGN)

data$RESPONSE = as.factor(data$RESPONSE)

str(data)

#dividing the dtaset into training and testing data

data.train <- sample(1:nrow(data), nrow(data)*0.8)

str(data.train)

names(data)

PAGE 5
dTrain <- data[data.train, ]

nrow(dTrain)

dTest <- data[-data.train, ]

nrow(dTest)

> nrow(dTrain)
[1] 800
> ncol(dTrain)
[1] 31
> dTest <- data[-data.train, ]
> dTest <- data[-data.train, ]
> nrow(dTest)
[1] 200

boost.credit = gbm(RESPONSE~ .,

data= dTrain,

distribution="gaussian",

n.trees=5000 ,

interaction.depth =4,

cv.folds=7)

summary (boost.credit)

> summary (boost.credit)


var rel.inf
CHK_ACCT CHK_ACCT 17.851640118
AMOUNT AMOUNT 17.602342502
DURATION DURATION 12.916551346
HISTORY HISTORY 10.807564555
AGE AGE 6.014443872
SAV_ACCT SAV_ACCT 5.370647421

PAGE 6
#to determine the optimum number of treestumps to reduce processing time and optimise the Machine
learning

best.iter <- gbm.perf(boost.credit,method="cv")

print(best.iter)

PAGE 7
> print(best.iter)
[1] 4993
#estimates of marginal effects of a predictor(s) via Partial Dependency Plots.

# These plots enable us to understand the effect of a predictor variable

#(or interaction between two predictors) on the target outcome,

#given the other predictors (partialling them out - after accounting for their average effects).

#The partial dependence function basically gives us the "average" trend of that variable

#(integrating out all others in the model). It's the shape of that trend that is "important".

#We may interpret the relative range of these plots from different predictor variables, but not the absolute
range. par(mfrow =c(2,2))

plot(boost.credit ,i="CHK_ACCT")

plot(boost.credit ,i="AMOUNT")

plot(boost.credit ,i="HISTORY")

plot(boost.credit ,i="DURATION")

PAGE 8
#to find the cutoff as the boosting result is continuous and needs to be converted to a categorical output. Well
find the area under the ROC curve to determine the best cut-off.

pp= predict(boost.credit, dTest,n.trees = 5000, type="response")

pp <- ifelse(pp<1.5753,0,1)

#print(pp)

x <- data.frame(dTest,pp)

pred <- with(x,prediction(pp,RESPONSE))

perf <- performance(pred,"tpr", "fpr")

auc <-performance(pred, measure = "auc")@y.values[[1]]

print(auc)

rd <- data.frame(x=perf@x.values[[1]],y=perf@y.values[[1]])

p <- ggplot(rd,aes(x=x,y=y)) + geom_path(size=1)

p + labs(title = "ROC Curve")

p <- p + geom_segment(aes(x=0,y=0,xend=1,yend=1),colour="black",linetype= 2)

p <- p + geom_text(aes(x=1, y= 0, hjust=1, vjust=0, label=paste(sep = "", "AUC = ",round(auc,3)


)),colour="black",size=4)

p <- p + scale_x_continuous(name= "False positive rate")

p <- p + scale_y_continuous(name= "True positive rate")

p + labs(title = "ROC")

PAGE 9
print(auc)

> print(auc)
[1] 0.7661242
x <- data.frame(dTest,pp)

#Other Performance Measures

x <- data.frame(dTest,pp)
cm0 <- confusionMatrix(pp, dTest$RESPONSE)

print(cm0)

> print(cm0)
Confusion Matrix and Statistics

Reference
Prediction 0 1
0 29 25
1 13 133

Accuracy : 0.81
95% CI : (0.7487, 0.8619)
No Information Rate : 0.79
P-Value [Acc > NIR] : 0.27532

PAGE 10
Kappa : 0.4817
Mcnemar's Test P-Value : 0.07435

Sensitivity : 0.6905
Specificity : 0.8418
Pos Pred Value : 0.5370
Neg Pred Value : 0.9110
Prevalence : 0.2100
Detection Rate : 0.1450
Detection Prevalence : 0.2700
Balanced Accuracy : 0.7661

'Positive' Class : 0

#write output to a csv file

write.csv(x,"Output.csv")

Technique: Bagging
R Codes and Outputs

#bagging

#Bagging or bootstrap aggregation averages a

#given procedure over many samples, to reduce its

#variance a poor mans Bayes.

#Bagging can dramatically reduce the variance of

#unstable procedures (like trees), leading to

#improved prediction. However any simple

#structure in a tree) is lost.

#Bagging averages many trees, and produces

#smoother decision boundaries.

library (randomForest)

set.seed (100)

ncol(dTest)

nrow(dTest)

#train using various values of mtry- Number of variables available for splitting at each tree node

PAGE 11
bag.credit1 =randomForest(RESPONSE~.,data=dTrain ,mtry=30, ntree=500, importance =TRUE)

#plot out of bag error rate curve and the misclassification error rate curves .

plot(bag.credit1)

legend("topright", colnames(bag.credit1$err.rate),col=1:4,cex=0.8,fill=1:4)

bag.credit1

> bag.credit1

Call:
randomForest(formula = RESPONSE ~ ., data = dTrain, mtry = 30, nt
ree = 500, importance = TRUE)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 30

OOB estimate of error rate: 24.88%


Confusion matrix:
0 1 class.error
0 130 128 0.4961240
1 71 471 0.1309963

bag.credit2 =randomForest(RESPONSE~.,data=dTrain ,mtry=6, importance =TRUE)

bag.credit2

> bag.credit2

Call:

PAGE 12
randomForest(formula = RESPONSE ~ ., data = dTrain, mtry = 6, imp
ortance = TRUE)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 6

OOB estimate of error rate: 25.62%


Confusion matrix:
0 1 class.error
0 106 152 0.58914729
1 53 489 0.09778598

plot(bag.credit2)

legend("topright", colnames(bag.credit2$err.rate),col=1:4,cex=0.8,fill=1:4)

bag.credit3 =randomForest(RESPONSE~.,data=dTrain ,mtry=5, importance =TRUE)

bag.credit3

> bag.credit3

Call:
randomForest(formula = RESPONSE ~ ., data = dTrain, mtry = 5, imp
ortance = TRUE)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 5

OOB estimate of error rate: 24.5%


Confusion matrix:
0 1 class.error
0 106 152 0.58914729

PAGE 13
1 44 498 0.08118081

plot(bag.credit3)

legend("topright", colnames(bag.credit3$err.rate),col=1:4,cex=0.8,fill=1:4)

#For each tree, the prediction accuracy on the out-of-bag portion of the data is recorded.

#Then the same is done after permuting each predictor variable.

#-The difference between the two accuracies are then averaged over all trees, and normalized by the standard
error.

vi<-varImp(bag.credit)

vi

> vi
0 1
CHK_ACCT 19.06621013 19.06621013
DURATION 12.10922121 12.10922121
HISTORY 9.01551629 9.01551629
NEW_CAR 2.13719444 2.13719444
USED_CAR 2.42842203 2.42842203
FURNITURE -0.08435182 -0.08435182
RADIO.TV 3.23616591 3.23616591
EDUCATION 0.43650963 0.43650963
RETRAINING 0.92453546 0.92453546
AMOUNT 8.48411289 8.48411289
SAV_ACCT 4.36585660 4.36585660

PAGE 14
varImpPlot(bag.credit,type=2) #represents the mean decrease in node impurity (and not the mean decrease in
accuracy).

vi$importance

#Prediction on the test data

pred.bag = predict (bag.credit ,newdata =dTest)

plot(pred.bag)

PAGE 15
cm1 <- confusionMatrix(pred.bag, dTest$RESPONSE)

cm1$overall

> cm1$overall
Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull Ac
curacyPValue McnemarPValue
0.80500000 0.33696022 0.74321431 0.85750813 0.79000
000 0.33720145 0.05466394

cm1$byClass

cm1$byClass
Sensitivity Specificity Pos Pred Value Ne
g Pred Value Precision
0.3809524 0.9177215 0.5517241
0.8479532 0.5517241
Recall F1 Prevalence De
tection Rate Detection Prevalence
0.3809524 0.4507042 0.2100000
0.0800000 0.1450000
Balanced Accuracy
0.6493369
x <- data.frame(dTest,pred.bag)

#write output to a csv file

write.csv(x,"Output_Bagging.csv")

PAGE 16
Summary
Boosting seems to yield marginally better accuracy than the bagging algorithm of 81% as compared to 80.50%.

PAGE 17
Regularisation
Objective
The data scientists at BigMart have collected 2013 sales data for 1559 products across 10 stores in different
cities. Also, certain attributes of each product and store have been defined. The aim is to build a predictive
model and find out the sales of each product at a particular store.

Dataset
Train_UWu5bXk_Big Mart.csv

Variable Description
Loan_ID Unique Loan ID
Gender Male/ Female
Married Applicant married (Y/N)
Dependents Number of dependents
Education Applicant Education (Graduate/ Under Graduate)
Self_Employed Self employed (Y/N)
ApplicantIncome Applicant income
CoapplicantIncome Coapplicant income
LoanAmount Loan amount in thousands
Loan_Amount_Term Term of loan in months
Credit_History credit history meets guidelines
Property_Area Urban/ Semi Urban/ Rural
Loan_Status Loan approved (Y/N)

PAGE 18
Technique: Ridge Regression
R Codes and Outputs
#Ridge Regression

#The data scientists at BigMart have collected 2013 sales data for 1559 products across 10 stores in different
cities. Also, certain attributes of each product and store have been defined. The aim is to build a predictive
model and find out the sales of each product at a particular store.

#Using this model, BigMart will try to understand the properties of products and stores which play a key role
in increasing sales.

#loading the relevant libraries

library (glmnet )

library(MASS)

library (ISLR)

library(lattice)

#selection of data

data = read.csv(file.choose())

str(data)

#Cleaning and processing the data

data= data[,-8 ]

data= data[,-1 ]

names(data )

dim(data )

sum(is.na(data$Item_Outlet_Sales))#check for missing values in the data

data =na.omit(data )#remove all missing vales

dim(data )

sum(is.na(data ))

#To compare the ridge output with the OLS output we run the linear regression model first

PAGE 19
#dividing the data into the training and the testing sets

data.train <- sample(1:nrow(data), nrow(data)*0.5)

dTrain <- data[data.train, ]

nrow(dTrain)

dTest <- data[-data.train, ]

#Linear Regression

ols = lm(Item_Outlet_Sales ~., dTrain)

summary(ols)

> summary(ols)

Call:
lm(formula = Item_Outlet_Sales ~ ., data = dTrain)

Residuals:
Min 1Q Median 3Q Max
-3483.0 -634.9 -81.8 532.7 6002.4

Coefficients: (7 not defined because of singularities)


Estimate Std. Error t value Pr(>|t|)
(Intercept) -1851.7738 147.2696 -12.574 <2e-16 *
**
Item_Weight -0.9359 3.9078 -0.239 0.8107
Item_Fat_Contentlow fat 380.7970 189.1178 2.014 0.0441 *
Item_Fat_ContentLow Fat 56.1231 94.4237 0.594 0.5523
Item_Fat_Contentreg 79.3920 175.7233 0.452 0.6514
Item_Fat_ContentRegular 114.1667 98.4877 1.159 0.2465
Item_Visibility -271.7998 382.7683 -0.710 0.4777
Item_TypeBreads 62.9835 124.7067 0.505 0.6136
Item_TypeBreakfast 58.0952 189.9650 0.306 0.7598
Item_TypeCanned 26.4251 90.5039 0.292 0.7703
Item_TypeDairy -1.6757 91.9574 -0.018 0.9855
Item_TypeFrozen Foods -77.2396 85.6644 -0.902 0.3673
Item_TypeFruits and Vegetables -53.9985 80.8350 -0.668 0.5042
Item_TypeHard Drinks -38.2177 128.5790 -0.297 0.7663
Item_TypeHealth and Hygiene -17.5756 98.6799 -0.178 0.8586
Item_TypeHousehold -107.6325 87.4929 -1.230 0.2187
Item_TypeMeat -40.5371 107.1454 -0.378 0.7052
Item_TypeOthers -20.7160 146.8059 -0.141 0.8878
Item_TypeSeafood -42.4768 232.7389 -0.183 0.8552
Item_TypeSnack Foods -62.3765 81.5387 -0.765 0.4443
Item_TypeSoft Drinks -104.1236 100.4662 -1.036 0.3001

PAGE 20
Item_TypeStarchy Foods 25.7238 140.4724 0.183 0.8547
Item_MRP 15.5602 0.2934 53.025 <2e-16 *
**
Outlet_IdentifierOUT013 1946.9820 84.0984 23.151 <2e-16 *
**
Outlet_IdentifierOUT017 2045.1455 82.9143 24.666 <2e-16 *
**
Outlet_IdentifierOUT018 1637.0013 83.7154 19.554 <2e-16 *
**
Outlet_IdentifierOUT035 2106.3930 82.0403 25.675 <2e-16 *
**
Outlet_IdentifierOUT045 1866.3759 83.2493 22.419 <2e-16 *
**
Outlet_IdentifierOUT046 1901.1143 83.3527 22.808 <2e-16 *
**
Outlet_IdentifierOUT049 1987.9411 83.5246 23.801 <2e-16 *
**

#Prediction using Linear Regression

predictols = predict(ols, newdata = dTest)

plot(predictols)

#A plot of the residuals against fitted values is used to determine whether there are any systematic patterns,

#such as over estimation for most of the large values or increasing spread as the model fitted values increase.

xyplot(resid(ols) ~ fitted(ols),

xlab = "Fitted Values",

ylab = "Residuals",

main = "Residual Diagnostic Plot",

panel = function(x, y, ...)

panel.grid(h = -1, v = -1)

panel.abline(h = 0)

panel.xyplot(x, y, ...)

PAGE 21
#The plot is probably ok but there are more cases of positive residuals and

#when we consider a normal probability plot we see that there are some deficiencies with the model

qqmath( ~ resid(ols),

xlab = "Theoretical Quantiles",

ylab = "Residuals"

PAGE 22
#The function resid extracts the model residuals from the fitted model object

#We would hope that this plot showed something approaching a straight line

#to support the model assumption about the distribution of the residuals

#we calculate the mean square errors for further comparison with other models

mean(( predictols -dTest$Item_Outlet_Sales)^2)

mean(( predictols -dTest$Item_Outlet_Sales)^2)


[1] 1189168

#Performing ridge regression and the lasso in order to predict Salary on the Hitters data.

# The model.matrix() function is particularly useful for creating x; not only

#does it produce a matrix corresponding to the 19 predictors but it also

#automatically transforms any qualitative variables into dummy variables.

#The latter property is important because glmnet() can only take numerical,

#quantitative inputs.

#help("seq")

x=model.matrix (Item_Outlet_Sales~.,data )[,-1]

y=data$Item_Outlet_Sales

grid =10^ seq (10,-2, length =100)

ridge.mod =glmnet (x,y,alpha =0, lambda =grid)

# default, the glmnet() function standardizes the

#variables so that they are on the same scale. To turn off this default setting,

#use the argument standardize=FALSE.

#It shows from left to right the number of nonzero coefficients (Df),

#the percent (of null) deviance explained (%dev) and the value of (Lambda).

PAGE 23
ridge.mod

> ridge.mod

Call: glmnet(x = x, y = y, alpha = 0, lambda = grid)

Df %Dev Lambda
[1,] 36 1.642e-07 1.000e+10
[2,] 36 2.171e-07 7.565e+09
[3,] 36 2.870e-07 5.722e+09
[4,] 36 3.794e-07 4.329e+09
[5,] 36 5.015e-07 3.275e+09
[6,] 36 6.630e-07 2.477e+09
[7,] 36 8.764e-07 1.874e+09
[8,] 36 1.159e-06 1.417e+09
[9,] 36 1.532e-06 1.072e+09
[10,] 36 2.025e-06 8.111e+08
[11,] 36 2.676e-06 6.136e+08
[12,] 36 3.538e-06 4.642e+08
[13,] 36 4.677e-06 3.511e+08
[14,] 36 6.183e-06 2.656e+08
[15,] 36 8.173e-06 2.009e+08
[16,] 36 1.080e-05 1.520e+08
[17,] 36 1.428e-05 1.150e+08
[18,] 36 1.888e-05 8.697e+07
[19,] 36 2.496e-05 6.579e+07
[20,] 36 3.299e-05 4.977e+07

#We would hope that this plot showed

#something approaching a straight line to support

#the model assumption about the distribution of the residuals

plot(ridge.mod, label = TRUE)

PAGE 24
dim(coef(ridge.mod ))

dim(coef(ridge.mod ))
[1] 40 100

ridge.mod$lambda [50]

> ridge.mod$lambda [50]


[1] 11497.57

#We can obtain the actual coefficients at one or more s within the range of the sequence

coef(ridge.mod)[,50]

> coef(ridge.mod)[,50]
(Intercept) Item_Weight It
em_Fat_Contentlow fat
1768.5296057 0.4714473
6.7615077
Item_Fat_ContentLow Fat Item_Fat_Contentreg It
em_Fat_ContentRegular
-4.8883221 -13.3125877
6.9185662
Item_Visibility Item_TypeBreads
Item_TypeBreakfast
-290.3879446 3.7790286
-12.9511555
Item_TypeCanned Item_TypeDairy
Item_TypeFrozen Foods
3.6287889 8.5570676
-4.4858574

PAGE 25
Item_TypeFruits and Vegetables Item_TypeHard Drinks Item_T
ypeHealth and Hygiene
7.5728516 -5.4537667
-18.9152273
Item_TypeHousehold Item_TypeMeat
Item_TypeOthers
7.5533123 -8.3294249
-6.7918224
Item_TypeSeafood Item_TypeSnack Foods
Item_TypeSoft Drinks
47.3492024 12.3585385
-16.4090562
Item_TypeStarchy Foods Item_MRP Ou
tlet_IdentifierOUT013
22.2712201 1.7932583
28.8954492
Outlet_IdentifierOUT017 Outlet_IdentifierOUT018 Ou
tlet_IdentifierOUT019
26.7496824 2.6719608
0.0000000
Outlet_IdentifierOUT027 Outlet_IdentifierOUT035 Ou
tlet_IdentifierOUT045
0.0000000 34.7145225
6.5502929
Outlet_IdentifierOUT046 Outlet_IdentifierOUT049
Outlet_SizeHigh
17.9059818 29.2979713
28.8974982
Outlet_SizeMedium Outlet_SizeSmall Outle
t_Location_TypeTier 2
18.8550265 31.0140842
32.5158316
Outlet_Location_TypeTier 3 Outlet_TypeSupermarket Type1 Outlet_
TypeSupermarket Type2
-58.5002738 99.3006575
2.6731882
Outlet_TypeSupermarket Type3
0.0000000

coef<--coef(ridge.mod)

coef

summ <- summary(coef)

summ

x<-data.frame(Origin = rownames(coef)[summ$i],

Destination = colnames(coef)[summ$j],

Weight = summ$x)

write.csv(x,"Output_Ridgecef.csv")

PAGE 26
OUTPUT File Attached

sqrt(sum(coef(ridge.mod)[ -1 ,50]^2) )

> sqrt(sum(coef(ridge.mod)[ -1 ,50]^2) )


[1] 330.4143

ridge.mod$lambda [60]

> ridge.mod$lambda [60]


[1] 705.4802
coef(ridge.mod)[,60]

> coef(ridge.mod)[,60]
(Intercept) Item_Weight It
em_Fat_Contentlow fat
29.1878010 -4.6425519
12.4563384
Item_Fat_ContentLow Fat Item_Fat_Contentreg It
em_Fat_ContentRegular
-2.2040003 -141.5550260
36.0491284
Item_Visibility Item_TypeBreads
Item_TypeBreakfast
-848.6729160 -26.5768543
-203.1094940
Item_TypeCanned Item_TypeDairy
Item_TypeFrozen Foods
68.2376470 11.3440468
-90.7943879
Item_TypeFruits and Vegetables Item_TypeHard Drinks Item_T
ypeHealth and Hygiene
47.7664754 -156.8928571
-97.8301151
Item_TypeHousehold Item_TypeMeat
Item_TypeOthers
-0.7233438 38.0428187
-217.8423830
Item_TypeSeafood Item_TypeSnack Foods
Item_TypeSoft Drinks
107.1972117 7.4562711
18.6877479
Item_TypeStarchy Foods Item_MRP Ou
tlet_IdentifierOUT013
273.4689065 10.1886705
230.0908254
Outlet_IdentifierOUT017 Outlet_IdentifierOUT018 Ou
tlet_IdentifierOUT019
159.0560125 272.8354786
0.0000000
Outlet_IdentifierOUT027 Outlet_IdentifierOUT035 Ou
tlet_IdentifierOUT045
0.0000000 121.1075127
90.1300494

PAGE 27
Outlet_IdentifierOUT046 Outlet_IdentifierOUT049
Outlet_SizeHigh
90.2348227 134.9829698
230.0861606
Outlet_SizeMedium Outlet_SizeSmall Outle
t_Location_TypeTier 2
231.9717947 124.0771150
174.6308261
Outlet_Location_TypeTier 3 Outlet_TypeSupermarket Type1 Outlet_
TypeSupermarket Type2
-302.8971847 595.3107156
272.8358924
Outlet_TypeSupermarket Type3
0.0000000

sqrt(sum(coef(ridge.mod)[ -1 ,60]^2) )

> sqrt(sum(coef(ridge.mod)[ -1 ,60]^2) )


[1] 1357.927

# We can use the predict() function for a number of purposes. For instance,

#we can obtain the ridge regression coefficients for a new value of , say 50

predict(ridge.mod ,s=50, type ="coefficients")[1:20 ,]

> predict(ridge.mod ,s=50, type ="coefficients")[1:20 ,]


(Intercept) Item_Weight It
em_Fat_Contentlow fat
-1.142843e+03 -8.961876e-02
4.968519e+01
Item_Fat_ContentLow Fat Item_Fat_Contentreg It
em_Fat_ContentRegular
-3.205728e+01 -6.101688e+01
2.741758e+01
Item_Visibility Item_TypeBreads
Item_TypeBreakfast
-3.410558e+02 3.564304e+01
-1.100413e+02
Item_TypeCanned Item_TypeDairy
Item_TypeFrozen Foods
1.824980e+01 -5.374391e+01
-3.077317e+01
Item_TypeFruits and Vegetables Item_TypeHard Drinks Item_T
ypeHealth and Hygiene
-2.246690e+00 1.207918e+01
-3.913646e+00
Item_TypeHousehold Item_TypeMeat
Item_TypeOthers
-2.465764e+01 -3.611566e+01
1.413773e+01
Item_TypeSeafood Item_TypeSnack Foods

PAGE 28
3.279985e+02 -4.170007e+00

# training and testing

set.seed (1)

train=sample (1: nrow(x), nrow(x)/2)

test=(- train )

y.test=y[test]

ridge.pred=predict (ridge.mod ,s=4, newx=x[test ,])

mean(( ridge.pred -y.test)^2)

#simply fit a model with just an intercept, we would have predicted each test observation using

#the mean of the training observations. In that case, we could compute the test set MSE like this:

#fitting a ridge regression model with = 4 leads to a much lower test

#MSE than fitting a model with just an intercept. We now check whether

#there is any benefit to performing ridge regression with = 4 instead of

#just performing least squares regression. Recall that least squares is simply

#ridge regression with = 0.

ridge.pred=predict (ridge.mod ,s=0, newx=x[test ,], exact=T)

mean(( ridge.pred -y.test)^2)

> mean(( ridge.pred -y.test)^2)


[1] 1129263

lm(y~x, subset =train)

predict (ridge.mod ,s=0, exact =T,type="coefficients") [1:20 ,]

> predict (ridge.mod ,s=0, exact =T,type="coefficients") [1:20 ,]


(Intercept) Item_Weight It
em_Fat_Contentlow fat
-1273.8570048 -0.1753777
45.5903253

PAGE 29
Item_Fat_ContentLow Fat Item_Fat_Contentreg It
em_Fat_ContentRegular
-37.6752811 -66.2119408
22.3313054
Item_Visibility Item_TypeBreads
Item_TypeBreakfast
-219.0993942 25.9599166
-126.0530737
Item_TypeCanned Item_TypeDairy
Item_TypeFrozen Foods
6.3310116 -71.8781581
-42.6367252
Item_TypeFruits and Vegetables Item_TypeHard Drinks Item_T
ypeHealth and Hygiene
-15.7890212 3.5453765
-8.7664006
Item_TypeHousehold Item_TypeMeat
Item_TypeOthers
-39.0268330 -45.3326115
6.1352188
Item_TypeSeafood Item_TypeSnack Foods
324.7468636 -18.9867244

#if we want to fit a (unpenalized) least squares model, then

#we should use the lm() function, since that function provides more useful

#outputs, such as standard errors and p-values for the coefficients.

# finding optimum value of lambda using CV.

set.seed (1)

cv.out =cv.glmnet (x[train ,],y[train],alpha =0)

#the cross-validation curve (red dotted line), and upper and lower standard deviation curves along the
sequence (error bars).

#Two selected s are indicated by the vertical dotted lines (see below).

plot(cv.out)

PAGE 30
bestlam =cv.out$lambda.min

bestlam

> bestlam
[1] 104.0413

ridge.pred=predict (ridge.mod ,s=bestlam ,newx=x[test ,])

mean(( ridge.pred -y.test)^2)

> mean(( ridge.pred -y.test)^2)


[1] 1137458

Output=predict(cv.out, newx = x[test ,], s = "lambda.min")

#Users can control the folds used.

#Here we use the same folds so we can also select a value for .

foldid=sample(1:10,size=length(y),replace=TRUE)

cv1=cv.glmnet(x,y,foldid=foldid,alpha=1)

cv.5=cv.glmnet(x,y,foldid=foldid,alpha=.5)

cv0=cv.glmnet(x,y,foldid=foldid,alpha=0)

cv1

PAGE 31
log(cv1$lambda)

pch=19

par(mfrow=c(2,2))

plot(cv1);plot(cv.5);plot(cv0)

plot(log(cv1$lambda),cv1$cvm,pch=19,col="red",xlab="log(Lambda)",ylab=cv1$name)

points(log(cv.5$lambda),cv.5$cvm,pch=19,col="grey")

points(log(cv0$lambda),cv0$cvm,pch=19,col="blue")

legend(2,3,legend=c("alpha= 1","alpha= .5","alpha= 0"),pch=0,col=c("red","grey","blue"))

help("predict")

Technique: LASSO Regression


R Codes and Outputs

#LAsso regression

#The elastic-net penalty is controlled by , and bridges the gap between lasso (=1, the default) and ridge
(=0).

#The tuning parameter controls the overall strength of the penalty.

lasso.mod =glmnet(x[train ,],y[train],alpha =1, lambda =grid)

PAGE 32
plot(lasso.mod)

lasso.pred=predict (lasso.mod ,s=0, newx=x[test ,], exact=T)

mean(( lasso.pred -y.test)^2)

mean(( ridge.pred -y.test)^2)

mean(( predictols -dTest$Item_Outlet_Sales)^2)

> mean(( lasso.pred -y.test)^2)


[1] 1146829
> mean(( ridge.pred -y.test)^2)
[1] 1137458
> mean(( predictols -dTest$Item_Outlet_Sales)^2)
[1] 1189168

Summary
By comparing the mean square errors Ridge regression seems to yield the most desired results.

PAGE 33
Bayesia Lab application
Dataset
Asia.xbl

Dataset definition
Asia.xbl is a fictional Bayesian network which serves as a hypothetical "Expert System" for
pulmonary diseases. It encodes the cumulative knowledge of a pulmonary physician about all of
his patients with regard to lung diseases.

Each node in the network represents a patient-related variable.

The dataset/network contains 4 classes of nodes.

#Node Node class Variable names Variable type


1. Patient Smoker Binary - T/F
characteristics
Age of patient Binary - T/F
Indicator of recent Binary - T/F
Asia visit of patient
2. Pulmonary diseases Bronchitis Binary - T/F
Cancer Binary - T/F
Tuberculosis Binary - T/F
3. Logical node TborCA Binary - T/F
Indicates whether
either disease is
present
4. Symptoms Dyspnea Binary - T/F
Shortness of breath
XRay Binary -
Normal/Abnormal

The graph of the dataset represents qualitative part of the pulmonary healthcare
domain knowledge.

The arcs represents probabilistic relationship between the nodes.

PAGE 34
The quantitative part of the dataset is present in the conditional probability tables (CPT)
which are associated with each node.

To see the CPT for each node, an editor should be activated with a double click

The above picture shows the status of the node VisitAsia

The probabilistic distribution table shows the marginal probability distribution states of
VisitAsia where true 1% and false = 99%. This means 1% of the patients have recently
been to an Asian country.

The picture below shows the details of the node Bronchitis

The probabilistic distribution of the table shows the conditional probabilities of Bronchitis given

PAGE 35
smoker. So it can be said that smokers are twice as likely to suffer from bronchitis as compared
to non-smokers.

The picture below shows the details of node Cancer

The above table shows the conditional probabilities of cancer given age and smoker.

So the knowledge stored in the dataset network would help the pulmonary doctors to
get further insights from the different patient-related variables relationship.

The above descriptions are taken from modelling mode. For further analysis it is
required to switch to validation mode.

The screenshot of the validation mode -

From the above screenshot the monitor panel helps to read and manipulate the states of
individual nodes.

PAGE 36
Analysis of Validation node VisitAsia

The node in the monitor panel upon display is marked yellow. The monitor in the above
diagram shows that marginal probability distribution of VisitAsia 1% of patients recently
travelled to an Asian country.

Analysis of validation node Bronchitis

The monitor in the above diagram shows that apriori probability of having Bronchitis is 43.87%.

Analysis of validation node - Age

The above monitor shows the marginal probability distribution of age where 25% of the patients
are adolescent, 40% are adult and 35% are from geriatric category.

Now the given network is utilized to diagnose a new patient -

Geriatric category analysis -

PAGE 37
Once the horizontal bar is clicked, the state geriatric gets highlighted in green upon setting
evidence. The node age also gets highlighted in green which indicates that evidence is set.

Once the age is set to geriatric, the bronchitis gets a new conditional probability distribution as
shown in the diagram below in grey arrows.

The patient also reports Dyspnea Shortness of breath. The below diagram captures the
probabilistic distribution.

Dyspnea is set to true. Given the age = Geriatric and Dyspnea = True, the probability of
Bronchitis = true increases as shown in the below diagram.

Now given the age and symptoms of the patient, the doctor can consider cancer as a
possibility.

Analysis of validation node = Cancer

PAGE 38
From the observation as captured in the diagram below, the patient's probability of having
cancer is 15%

H0 (Null Hypothesis) - The patient claims that he/she has quit smoking.

HA(Alternate Hypothesis) - The doctor claims that there is a 0.75 probability the patient is still
smoking.

To analyze the situation, the below diagram is used for smokers monitor -

On setting the new probability value to 75% = true and fixing the value new evidences come up.
This brings that probability of cancer changes to 19% as shown in the diagram.

Now the doctor orders a chest X-Ray based on the above result.

Analysis of the validation node XRay

Once the normal bar from the above XRay monitor is selected, the probability of the cancer
drops to 0.5% as shown in the diagram below -

PAGE 39
and probability of Bronchitis increases to 88% as shown in the diagram below -

Conclusion From the above probability evidences, the doctor confirms a diagnosis of
bronchitis and starts the treatment.

PAGE 40

Anda mungkin juga menyukai