Prepared for
Dr. Sridhar Vaidyanathan
**all explanation of the outputs are explored in the R-comments in this document
PAGE 1
Ensemble Methods
Objective
Company wants to automate the credit eligibility process (real time) based on customer
detail provided while filling online application form. To automate this process, they have
given a problem to identify the customers segments, those are eligible for credit amount
so that they can specifically target these customers.
Dataset
GermanCredit.csv
Codelist
Var. Variable Name Description Variable Type Code Description
#
PAGE 2
savings
account
1 : 100<= ... < 500 DM
2 : 500<= ... < 1000 DM
3 : =>1000 DM
4 : unknown/ no savings
account
13. EMPLOYMENT Present Categorical 0 : unemployed
employment
since
1: < 1 year
2 : 1 <= ... < 4 years
3 : 4 <=... < 7 years
4 : >= 7 years
14. INSTALL_RATE Installment Numerical
rate as % of
disposable
income
15. MALE_DIV Applicant is Binary 0: No, 1: Yes
male and
divorced
16. MALE_SINGLE Applicant is Binary 0: No, 1: Yes
male and single
17. MALE_MAR_WID Applicant is Binary 0: No, 1: Yes
male and
married or a
widower
18. CO-APPLICANT Application has Binary 0: No, 1: Yes
a co-applicant
19. GUARANTOR Applicant has a Binary 0: No, 1: Yes
guarantor
20. PRESENT_RESIDENT Present Categorical 0: <= 1 year
resident since -
years
1<<=2 years
2<<=3 years
3:>4years
21. REAL_ESTATE Applicant owns Binary 0: No, 1: Yes
real estate
22. PROP_UNKN_NONE Applicant owns Binary 0: No, 1: Yes
no property (or
unknown)
23. AGE Age in years Numerical
24. OTHER_INSTALL Applicant has Binary 0: No, 1: Yes
other
installment
plan credit
25. RENT Applicant rents Binary 0: No, 1: Yes
26. OWN_RES Applicant owns Binary 0: No, 1: Yes
residence
27. NUM_CREDITS Number of Numerical
existing credits
at this bank
PAGE 3
28. JOB Nature of job Categorical 0 : unemployed/ unskilled -
non-resident
1 : unskilled - resident
2 : skilled employee / official
3 : management/ self-
employed/highly qualified
employee/ officer
29. NUM_DEPENDENTS Number of Numerical
people for
whom liable to
provide
maintenance
30. TELEPHONE Applicant has Binary 0: No, 1: Yes
phone in his or
her name
31. FOREIGN Foreign worker Binary 0: No, 1: Yes
32 RESPONSE Credit rating is Binary 0: No, 1: Yes
good
Technique: Boosting
R Codes and Outputs
#BOOSTING
library("ROCR")
library("verification")
library("gbm")
set.seed (100)
data = read.csv(file.choose())
str(data)
data= data[,-1]
data$CHK_ACCT = as.factor(data$CHK_ACCT)
data$HISTORY = as.factor(data$HISTORY)
data$NEW_CAR = as.factor(data$NEW_CAR)
PAGE 4
data$USED_CAR = as.factor(data$USED_CAR)
data$FURNITURE = as.factor(data$FURNITURE)
data$RADIO.TV = as.factor(data$RADIO.TV)
data$EDUCATION = as.factor(data$EDUCATION)
data$RETRAINING = as.factor(data$RETRAINING)
data$SAV_ACCT = as.factor(data$SAV_ACCT)
data$EMPLOYMENT = as.factor(data$EMPLOYMENT)
data$INSTALL_RATE = as.factor(data$INSTALL_RATE)
data$MALE_DIV = as.factor(data$MALE_DIV)
data$MALE_SINGLE = as.factor(data$MALE_SINGLE)
data$MALE_MAR_or_WID = as.factor(data$MALE_MAR_or_WID)
data$CO.APPLICANT = as.factor(data$CO.APPLICANT)
data$GUARANTOR = as.factor(data$GUARANTOR)
data$PRESENT_RESIDENT = as.factor(data$PRESENT_RESIDENT)
data$REAL_ESTATE = as.factor(data$REAL_ESTATE)
data$PROP_UNKN_NONE = as.factor(data$PROP_UNKN_NONE)
data$OTHER_INSTALL = as.factor(data$OTHER_INSTALL)
data$RENT = as.factor(data$RENT)
data$OWN_RES = as.factor(data$OWN_RES)
data$NUM_CREDITS = as.factor(data$NUM_CREDITS)
data$JOB = as.factor(data$JOB)
data$NUM_DEPENDENTS = as.factor(data$NUM_DEPENDENTS)
data$TELEPHONE = as.factor(data$TELEPHONE)
data$FOREIGN = as.factor(data$FOREIGN)
data$RESPONSE = as.factor(data$RESPONSE)
str(data)
str(data.train)
names(data)
PAGE 5
dTrain <- data[data.train, ]
nrow(dTrain)
nrow(dTest)
> nrow(dTrain)
[1] 800
> ncol(dTrain)
[1] 31
> dTest <- data[-data.train, ]
> dTest <- data[-data.train, ]
> nrow(dTest)
[1] 200
boost.credit = gbm(RESPONSE~ .,
data= dTrain,
distribution="gaussian",
n.trees=5000 ,
interaction.depth =4,
cv.folds=7)
summary (boost.credit)
PAGE 6
#to determine the optimum number of treestumps to reduce processing time and optimise the Machine
learning
print(best.iter)
PAGE 7
> print(best.iter)
[1] 4993
#estimates of marginal effects of a predictor(s) via Partial Dependency Plots.
#given the other predictors (partialling them out - after accounting for their average effects).
#The partial dependence function basically gives us the "average" trend of that variable
#(integrating out all others in the model). It's the shape of that trend that is "important".
#We may interpret the relative range of these plots from different predictor variables, but not the absolute
range. par(mfrow =c(2,2))
plot(boost.credit ,i="CHK_ACCT")
plot(boost.credit ,i="AMOUNT")
plot(boost.credit ,i="HISTORY")
plot(boost.credit ,i="DURATION")
PAGE 8
#to find the cutoff as the boosting result is continuous and needs to be converted to a categorical output. Well
find the area under the ROC curve to determine the best cut-off.
pp <- ifelse(pp<1.5753,0,1)
#print(pp)
x <- data.frame(dTest,pp)
print(auc)
rd <- data.frame(x=perf@x.values[[1]],y=perf@y.values[[1]])
p <- p + geom_segment(aes(x=0,y=0,xend=1,yend=1),colour="black",linetype= 2)
p + labs(title = "ROC")
PAGE 9
print(auc)
> print(auc)
[1] 0.7661242
x <- data.frame(dTest,pp)
x <- data.frame(dTest,pp)
cm0 <- confusionMatrix(pp, dTest$RESPONSE)
print(cm0)
> print(cm0)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 29 25
1 13 133
Accuracy : 0.81
95% CI : (0.7487, 0.8619)
No Information Rate : 0.79
P-Value [Acc > NIR] : 0.27532
PAGE 10
Kappa : 0.4817
Mcnemar's Test P-Value : 0.07435
Sensitivity : 0.6905
Specificity : 0.8418
Pos Pred Value : 0.5370
Neg Pred Value : 0.9110
Prevalence : 0.2100
Detection Rate : 0.1450
Detection Prevalence : 0.2700
Balanced Accuracy : 0.7661
'Positive' Class : 0
write.csv(x,"Output.csv")
Technique: Bagging
R Codes and Outputs
#bagging
library (randomForest)
set.seed (100)
ncol(dTest)
nrow(dTest)
#train using various values of mtry- Number of variables available for splitting at each tree node
PAGE 11
bag.credit1 =randomForest(RESPONSE~.,data=dTrain ,mtry=30, ntree=500, importance =TRUE)
#plot out of bag error rate curve and the misclassification error rate curves .
plot(bag.credit1)
legend("topright", colnames(bag.credit1$err.rate),col=1:4,cex=0.8,fill=1:4)
bag.credit1
> bag.credit1
Call:
randomForest(formula = RESPONSE ~ ., data = dTrain, mtry = 30, nt
ree = 500, importance = TRUE)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 30
bag.credit2
> bag.credit2
Call:
PAGE 12
randomForest(formula = RESPONSE ~ ., data = dTrain, mtry = 6, imp
ortance = TRUE)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 6
plot(bag.credit2)
legend("topright", colnames(bag.credit2$err.rate),col=1:4,cex=0.8,fill=1:4)
bag.credit3
> bag.credit3
Call:
randomForest(formula = RESPONSE ~ ., data = dTrain, mtry = 5, imp
ortance = TRUE)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 5
PAGE 13
1 44 498 0.08118081
plot(bag.credit3)
legend("topright", colnames(bag.credit3$err.rate),col=1:4,cex=0.8,fill=1:4)
#For each tree, the prediction accuracy on the out-of-bag portion of the data is recorded.
#-The difference between the two accuracies are then averaged over all trees, and normalized by the standard
error.
vi<-varImp(bag.credit)
vi
> vi
0 1
CHK_ACCT 19.06621013 19.06621013
DURATION 12.10922121 12.10922121
HISTORY 9.01551629 9.01551629
NEW_CAR 2.13719444 2.13719444
USED_CAR 2.42842203 2.42842203
FURNITURE -0.08435182 -0.08435182
RADIO.TV 3.23616591 3.23616591
EDUCATION 0.43650963 0.43650963
RETRAINING 0.92453546 0.92453546
AMOUNT 8.48411289 8.48411289
SAV_ACCT 4.36585660 4.36585660
PAGE 14
varImpPlot(bag.credit,type=2) #represents the mean decrease in node impurity (and not the mean decrease in
accuracy).
vi$importance
plot(pred.bag)
PAGE 15
cm1 <- confusionMatrix(pred.bag, dTest$RESPONSE)
cm1$overall
> cm1$overall
Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull Ac
curacyPValue McnemarPValue
0.80500000 0.33696022 0.74321431 0.85750813 0.79000
000 0.33720145 0.05466394
cm1$byClass
cm1$byClass
Sensitivity Specificity Pos Pred Value Ne
g Pred Value Precision
0.3809524 0.9177215 0.5517241
0.8479532 0.5517241
Recall F1 Prevalence De
tection Rate Detection Prevalence
0.3809524 0.4507042 0.2100000
0.0800000 0.1450000
Balanced Accuracy
0.6493369
x <- data.frame(dTest,pred.bag)
write.csv(x,"Output_Bagging.csv")
PAGE 16
Summary
Boosting seems to yield marginally better accuracy than the bagging algorithm of 81% as compared to 80.50%.
PAGE 17
Regularisation
Objective
The data scientists at BigMart have collected 2013 sales data for 1559 products across 10 stores in different
cities. Also, certain attributes of each product and store have been defined. The aim is to build a predictive
model and find out the sales of each product at a particular store.
Dataset
Train_UWu5bXk_Big Mart.csv
Variable Description
Loan_ID Unique Loan ID
Gender Male/ Female
Married Applicant married (Y/N)
Dependents Number of dependents
Education Applicant Education (Graduate/ Under Graduate)
Self_Employed Self employed (Y/N)
ApplicantIncome Applicant income
CoapplicantIncome Coapplicant income
LoanAmount Loan amount in thousands
Loan_Amount_Term Term of loan in months
Credit_History credit history meets guidelines
Property_Area Urban/ Semi Urban/ Rural
Loan_Status Loan approved (Y/N)
PAGE 18
Technique: Ridge Regression
R Codes and Outputs
#Ridge Regression
#The data scientists at BigMart have collected 2013 sales data for 1559 products across 10 stores in different
cities. Also, certain attributes of each product and store have been defined. The aim is to build a predictive
model and find out the sales of each product at a particular store.
#Using this model, BigMart will try to understand the properties of products and stores which play a key role
in increasing sales.
library (glmnet )
library(MASS)
library (ISLR)
library(lattice)
#selection of data
data = read.csv(file.choose())
str(data)
data= data[,-8 ]
data= data[,-1 ]
names(data )
dim(data )
dim(data )
sum(is.na(data ))
#To compare the ridge output with the OLS output we run the linear regression model first
PAGE 19
#dividing the data into the training and the testing sets
nrow(dTrain)
#Linear Regression
summary(ols)
> summary(ols)
Call:
lm(formula = Item_Outlet_Sales ~ ., data = dTrain)
Residuals:
Min 1Q Median 3Q Max
-3483.0 -634.9 -81.8 532.7 6002.4
PAGE 20
Item_TypeStarchy Foods 25.7238 140.4724 0.183 0.8547
Item_MRP 15.5602 0.2934 53.025 <2e-16 *
**
Outlet_IdentifierOUT013 1946.9820 84.0984 23.151 <2e-16 *
**
Outlet_IdentifierOUT017 2045.1455 82.9143 24.666 <2e-16 *
**
Outlet_IdentifierOUT018 1637.0013 83.7154 19.554 <2e-16 *
**
Outlet_IdentifierOUT035 2106.3930 82.0403 25.675 <2e-16 *
**
Outlet_IdentifierOUT045 1866.3759 83.2493 22.419 <2e-16 *
**
Outlet_IdentifierOUT046 1901.1143 83.3527 22.808 <2e-16 *
**
Outlet_IdentifierOUT049 1987.9411 83.5246 23.801 <2e-16 *
**
plot(predictols)
#A plot of the residuals against fitted values is used to determine whether there are any systematic patterns,
#such as over estimation for most of the large values or increasing spread as the model fitted values increase.
xyplot(resid(ols) ~ fitted(ols),
ylab = "Residuals",
panel.abline(h = 0)
panel.xyplot(x, y, ...)
PAGE 21
#The plot is probably ok but there are more cases of positive residuals and
#when we consider a normal probability plot we see that there are some deficiencies with the model
qqmath( ~ resid(ols),
ylab = "Residuals"
PAGE 22
#The function resid extracts the model residuals from the fitted model object
#We would hope that this plot showed something approaching a straight line
#to support the model assumption about the distribution of the residuals
#we calculate the mean square errors for further comparison with other models
#Performing ridge regression and the lasso in order to predict Salary on the Hitters data.
#The latter property is important because glmnet() can only take numerical,
#quantitative inputs.
#help("seq")
y=data$Item_Outlet_Sales
#variables so that they are on the same scale. To turn off this default setting,
#It shows from left to right the number of nonzero coefficients (Df),
#the percent (of null) deviance explained (%dev) and the value of (Lambda).
PAGE 23
ridge.mod
> ridge.mod
Df %Dev Lambda
[1,] 36 1.642e-07 1.000e+10
[2,] 36 2.171e-07 7.565e+09
[3,] 36 2.870e-07 5.722e+09
[4,] 36 3.794e-07 4.329e+09
[5,] 36 5.015e-07 3.275e+09
[6,] 36 6.630e-07 2.477e+09
[7,] 36 8.764e-07 1.874e+09
[8,] 36 1.159e-06 1.417e+09
[9,] 36 1.532e-06 1.072e+09
[10,] 36 2.025e-06 8.111e+08
[11,] 36 2.676e-06 6.136e+08
[12,] 36 3.538e-06 4.642e+08
[13,] 36 4.677e-06 3.511e+08
[14,] 36 6.183e-06 2.656e+08
[15,] 36 8.173e-06 2.009e+08
[16,] 36 1.080e-05 1.520e+08
[17,] 36 1.428e-05 1.150e+08
[18,] 36 1.888e-05 8.697e+07
[19,] 36 2.496e-05 6.579e+07
[20,] 36 3.299e-05 4.977e+07
PAGE 24
dim(coef(ridge.mod ))
dim(coef(ridge.mod ))
[1] 40 100
ridge.mod$lambda [50]
#We can obtain the actual coefficients at one or more s within the range of the sequence
coef(ridge.mod)[,50]
> coef(ridge.mod)[,50]
(Intercept) Item_Weight It
em_Fat_Contentlow fat
1768.5296057 0.4714473
6.7615077
Item_Fat_ContentLow Fat Item_Fat_Contentreg It
em_Fat_ContentRegular
-4.8883221 -13.3125877
6.9185662
Item_Visibility Item_TypeBreads
Item_TypeBreakfast
-290.3879446 3.7790286
-12.9511555
Item_TypeCanned Item_TypeDairy
Item_TypeFrozen Foods
3.6287889 8.5570676
-4.4858574
PAGE 25
Item_TypeFruits and Vegetables Item_TypeHard Drinks Item_T
ypeHealth and Hygiene
7.5728516 -5.4537667
-18.9152273
Item_TypeHousehold Item_TypeMeat
Item_TypeOthers
7.5533123 -8.3294249
-6.7918224
Item_TypeSeafood Item_TypeSnack Foods
Item_TypeSoft Drinks
47.3492024 12.3585385
-16.4090562
Item_TypeStarchy Foods Item_MRP Ou
tlet_IdentifierOUT013
22.2712201 1.7932583
28.8954492
Outlet_IdentifierOUT017 Outlet_IdentifierOUT018 Ou
tlet_IdentifierOUT019
26.7496824 2.6719608
0.0000000
Outlet_IdentifierOUT027 Outlet_IdentifierOUT035 Ou
tlet_IdentifierOUT045
0.0000000 34.7145225
6.5502929
Outlet_IdentifierOUT046 Outlet_IdentifierOUT049
Outlet_SizeHigh
17.9059818 29.2979713
28.8974982
Outlet_SizeMedium Outlet_SizeSmall Outle
t_Location_TypeTier 2
18.8550265 31.0140842
32.5158316
Outlet_Location_TypeTier 3 Outlet_TypeSupermarket Type1 Outlet_
TypeSupermarket Type2
-58.5002738 99.3006575
2.6731882
Outlet_TypeSupermarket Type3
0.0000000
coef<--coef(ridge.mod)
coef
summ
x<-data.frame(Origin = rownames(coef)[summ$i],
Destination = colnames(coef)[summ$j],
Weight = summ$x)
write.csv(x,"Output_Ridgecef.csv")
PAGE 26
OUTPUT File Attached
sqrt(sum(coef(ridge.mod)[ -1 ,50]^2) )
ridge.mod$lambda [60]
> coef(ridge.mod)[,60]
(Intercept) Item_Weight It
em_Fat_Contentlow fat
29.1878010 -4.6425519
12.4563384
Item_Fat_ContentLow Fat Item_Fat_Contentreg It
em_Fat_ContentRegular
-2.2040003 -141.5550260
36.0491284
Item_Visibility Item_TypeBreads
Item_TypeBreakfast
-848.6729160 -26.5768543
-203.1094940
Item_TypeCanned Item_TypeDairy
Item_TypeFrozen Foods
68.2376470 11.3440468
-90.7943879
Item_TypeFruits and Vegetables Item_TypeHard Drinks Item_T
ypeHealth and Hygiene
47.7664754 -156.8928571
-97.8301151
Item_TypeHousehold Item_TypeMeat
Item_TypeOthers
-0.7233438 38.0428187
-217.8423830
Item_TypeSeafood Item_TypeSnack Foods
Item_TypeSoft Drinks
107.1972117 7.4562711
18.6877479
Item_TypeStarchy Foods Item_MRP Ou
tlet_IdentifierOUT013
273.4689065 10.1886705
230.0908254
Outlet_IdentifierOUT017 Outlet_IdentifierOUT018 Ou
tlet_IdentifierOUT019
159.0560125 272.8354786
0.0000000
Outlet_IdentifierOUT027 Outlet_IdentifierOUT035 Ou
tlet_IdentifierOUT045
0.0000000 121.1075127
90.1300494
PAGE 27
Outlet_IdentifierOUT046 Outlet_IdentifierOUT049
Outlet_SizeHigh
90.2348227 134.9829698
230.0861606
Outlet_SizeMedium Outlet_SizeSmall Outle
t_Location_TypeTier 2
231.9717947 124.0771150
174.6308261
Outlet_Location_TypeTier 3 Outlet_TypeSupermarket Type1 Outlet_
TypeSupermarket Type2
-302.8971847 595.3107156
272.8358924
Outlet_TypeSupermarket Type3
0.0000000
sqrt(sum(coef(ridge.mod)[ -1 ,60]^2) )
# We can use the predict() function for a number of purposes. For instance,
#we can obtain the ridge regression coefficients for a new value of , say 50
PAGE 28
3.279985e+02 -4.170007e+00
set.seed (1)
test=(- train )
y.test=y[test]
#simply fit a model with just an intercept, we would have predicted each test observation using
#the mean of the training observations. In that case, we could compute the test set MSE like this:
#MSE than fitting a model with just an intercept. We now check whether
#just performing least squares regression. Recall that least squares is simply
PAGE 29
Item_Fat_ContentLow Fat Item_Fat_Contentreg It
em_Fat_ContentRegular
-37.6752811 -66.2119408
22.3313054
Item_Visibility Item_TypeBreads
Item_TypeBreakfast
-219.0993942 25.9599166
-126.0530737
Item_TypeCanned Item_TypeDairy
Item_TypeFrozen Foods
6.3310116 -71.8781581
-42.6367252
Item_TypeFruits and Vegetables Item_TypeHard Drinks Item_T
ypeHealth and Hygiene
-15.7890212 3.5453765
-8.7664006
Item_TypeHousehold Item_TypeMeat
Item_TypeOthers
-39.0268330 -45.3326115
6.1352188
Item_TypeSeafood Item_TypeSnack Foods
324.7468636 -18.9867244
#we should use the lm() function, since that function provides more useful
set.seed (1)
#the cross-validation curve (red dotted line), and upper and lower standard deviation curves along the
sequence (error bars).
#Two selected s are indicated by the vertical dotted lines (see below).
plot(cv.out)
PAGE 30
bestlam =cv.out$lambda.min
bestlam
> bestlam
[1] 104.0413
#Here we use the same folds so we can also select a value for .
foldid=sample(1:10,size=length(y),replace=TRUE)
cv1=cv.glmnet(x,y,foldid=foldid,alpha=1)
cv.5=cv.glmnet(x,y,foldid=foldid,alpha=.5)
cv0=cv.glmnet(x,y,foldid=foldid,alpha=0)
cv1
PAGE 31
log(cv1$lambda)
pch=19
par(mfrow=c(2,2))
plot(cv1);plot(cv.5);plot(cv0)
plot(log(cv1$lambda),cv1$cvm,pch=19,col="red",xlab="log(Lambda)",ylab=cv1$name)
points(log(cv.5$lambda),cv.5$cvm,pch=19,col="grey")
points(log(cv0$lambda),cv0$cvm,pch=19,col="blue")
help("predict")
#LAsso regression
#The elastic-net penalty is controlled by , and bridges the gap between lasso (=1, the default) and ridge
(=0).
PAGE 32
plot(lasso.mod)
Summary
By comparing the mean square errors Ridge regression seems to yield the most desired results.
PAGE 33
Bayesia Lab application
Dataset
Asia.xbl
Dataset definition
Asia.xbl is a fictional Bayesian network which serves as a hypothetical "Expert System" for
pulmonary diseases. It encodes the cumulative knowledge of a pulmonary physician about all of
his patients with regard to lung diseases.
The graph of the dataset represents qualitative part of the pulmonary healthcare
domain knowledge.
PAGE 34
The quantitative part of the dataset is present in the conditional probability tables (CPT)
which are associated with each node.
To see the CPT for each node, an editor should be activated with a double click
The probabilistic distribution table shows the marginal probability distribution states of
VisitAsia where true 1% and false = 99%. This means 1% of the patients have recently
been to an Asian country.
The probabilistic distribution of the table shows the conditional probabilities of Bronchitis given
PAGE 35
smoker. So it can be said that smokers are twice as likely to suffer from bronchitis as compared
to non-smokers.
The above table shows the conditional probabilities of cancer given age and smoker.
So the knowledge stored in the dataset network would help the pulmonary doctors to
get further insights from the different patient-related variables relationship.
The above descriptions are taken from modelling mode. For further analysis it is
required to switch to validation mode.
From the above screenshot the monitor panel helps to read and manipulate the states of
individual nodes.
PAGE 36
Analysis of Validation node VisitAsia
The node in the monitor panel upon display is marked yellow. The monitor in the above
diagram shows that marginal probability distribution of VisitAsia 1% of patients recently
travelled to an Asian country.
The monitor in the above diagram shows that apriori probability of having Bronchitis is 43.87%.
The above monitor shows the marginal probability distribution of age where 25% of the patients
are adolescent, 40% are adult and 35% are from geriatric category.
PAGE 37
Once the horizontal bar is clicked, the state geriatric gets highlighted in green upon setting
evidence. The node age also gets highlighted in green which indicates that evidence is set.
Once the age is set to geriatric, the bronchitis gets a new conditional probability distribution as
shown in the diagram below in grey arrows.
The patient also reports Dyspnea Shortness of breath. The below diagram captures the
probabilistic distribution.
Dyspnea is set to true. Given the age = Geriatric and Dyspnea = True, the probability of
Bronchitis = true increases as shown in the below diagram.
Now given the age and symptoms of the patient, the doctor can consider cancer as a
possibility.
PAGE 38
From the observation as captured in the diagram below, the patient's probability of having
cancer is 15%
H0 (Null Hypothesis) - The patient claims that he/she has quit smoking.
HA(Alternate Hypothesis) - The doctor claims that there is a 0.75 probability the patient is still
smoking.
To analyze the situation, the below diagram is used for smokers monitor -
On setting the new probability value to 75% = true and fixing the value new evidences come up.
This brings that probability of cancer changes to 19% as shown in the diagram.
Now the doctor orders a chest X-Ray based on the above result.
Once the normal bar from the above XRay monitor is selected, the probability of the cancer
drops to 0.5% as shown in the diagram below -
PAGE 39
and probability of Bronchitis increases to 88% as shown in the diagram below -
Conclusion From the above probability evidences, the doctor confirms a diagnosis of
bronchitis and starts the treatment.
PAGE 40