of Statistical Prediction
Suggested Answers to Project 2
Project 2: Logistic Regression and Model
Selection
1/19/2003
Question 1
Give a brief explanation on the meaning of those commands.
data(birthwt)
#Load the data set birthwt from the boot package.
#We can use an alternative command by loading the whole boot package into the
working space with command: require(boot).
attach(birthwt)
#The data set birthwt is attached to the working space so that the variable in the
data set can be accessed directly by its name. Otherwise, we have to use
birthwt$varname instead of varname.
race< factor(race,labels =c("white","black", "other"))
#The function factor is used to encode a vector as a factor race is originally
being treated as a numerical factor. This command translates into a categorical
factor.
table(ftv)
#table uses the cross-classified factors to build a contingency table of the counts at
each combination of factor levels. The results of this command is as follows,
0 1 2 3 4 6
100 47 30 7 4 1
Question 1
ftv<-factor( ftv)
#The function factor is used to encode a vector as a factor
levels(ftv)[-(1:2)]< "2+"
# This command transfers the levels of ftv into three: 0, 1, and 2+.
#2.Convert ptl to two levels and name the new variable as ptd.
ptd<-factor(ptl>0)
#The function factor is used to encode a vector as a factor. Here, ptd represents a new
factor with two levels only.
3. Create a new data frame bwt.
bwt<-data-frame(1ow=factor(1ow), age, lwt, race, smoke =(smoke >0), ptd, ht = (ht >0),
ui=(ui>0), ftv)
# Create a new data frame
4. Clean up data.
detach(birthwt)
#Remove the data set from the search path of available R objects. This command can be
used to remove either a data-frame which has been attached or a package which was
loaded previously.
rm(race, ptd ,ftv)
# remove and rm can be used to remove objects. Here, the variables race, ptd,
and ftv, which have been created in this workspace, are removed and no longer exist.
Question 2
Give a brief explanation on the specification of the regression
model.
birthwtglmGglm(1ow~., family=binomial ,data =bwt)
# glm is used to fit generalized linear models (GLM) to the data.
In GLM, two parameters need to be specified.
The first parameter is the error distribution of dependent variable. Here, the
chosen distribution is binomial distribution. It is being specified in terms of
family=binomial.
The second parameter is the unknown parameters in the specification of error
distribution. For the binomial distribution, it is the probability of success p. p will
be used to associate the independent variables to dependent variable.
There is the so-called link function to associate them. The default of link function
with binomial distribution is logit. It is as follows:
F(x) = P(Y=1|x)= exp(z)/[1+exp(z)] where z =exp(0 +1X1+...+KXK)
The fitted values of this model are the probabilities of low=1 given all the
other variables in the data frame bwf. Here, we only consider additive
model. It means that all variables enter into the model linearly. (There is no
interaction term in the model.)
Question 2
birthwtglm<- glm(1ow~., family=binomial, data =bwt)
# glm is used to fit generalized linear models (GLM) to the data.
In GLM, two parameters need to be specified.
The first parameter is the error distribution of dependent variable. Here, the
chosen distribution is binomial distribution. It is being specified in terms of
family=binomial.
The second parameter is the unknown parameters in the specification of error
distribution. For the binomial distribution, it is the probability of success p. p
will be used to associate the independent variables to dependent variable.
There is the so-called link function to associate them. The default of link
function with binomial distribution is logit. It is as follows:
F(x) = P(Y=1|x)= exp(z)/[1+exp(z)] where z =exp( 0 +1X1+...+KXK)
The fitted values of this model are the probabilities of low=1 given all the
other variables in the data frame bwf. Here, we only consider additive
model. It means that all variables enter into the model linearly. (There is no
interaction term in the model.)
Question 2
summary(birthwt.glm,correlation=F)
It gives the deviance residuals and coefficients.
birthwt.step$anova
This command is useful in showing the process of deleting variables.
For this data set, it removes ftv and age these two variables and
final AIC value is reduced to 213.8516. The selection procedure stops
because no more reduction of the AIC value can be achieved by removing
any other variables.
Question 3
Repeat the steps in Question 2 and consider all models include
pairwise interactions.
In order to ensure that the co-linearity is not present in the model,
we usually start on checking the correlation of coefficients among
all independent variables.
The difficult question is how to address the correlation with categorical
variables. Refer to your note on association.
Implementation of strategy 1
Start form an additive model with all independent variables.
Birthwt.glm <- glm(1ow~^2, family=binomial, data=bwt, maxit=20)
Due to convergence problem, we can increase the upper bound of the number
of iterations, which I done by using maxit.
Suggestion: Start form the best additive linear model derived in Question 2
(Exclude the two variables: ftv and age.)
For this data set, start with original model with no interaction term.
The interaction terms age:ftv and smoke:ui are added sequentially.
Finally, the race tem is removed.
The process stops at the model
low ~age + lwt + smoke + ptd + ht + ui + ftv + age:ftv + smoke:ui
with the lowest AIC value 207.0734.
Question 5
Repeat the above procedure with the two model being chosen with
cross-validation to give prediction error again.
How do we divide the data randomly into several groups (fold=5 for
example)?
Use the cv.glm procedure in the boot package.
Some students write their own version of cross-validation.
cv.glm
Cross-validation for Generalized Linear Models
This function calculates the estimated K-fold cross-validation prediction error
for generalized linear models.
cv.glm(data, glmfit, cost, K)
Data: A matrix or data frame containing the data. The rows should be cases and the
columns correspond to variables, one of which is the response.
glmfit: An object of class "glm" containing the results of a generalized linear model
fitted to data.
cost: A function of two vector arguments specifying the cost function for the crossvalidation. The first argument to cost should correspond to the observed responses
and the second argument should correspond to the predicted or fitted responses from
the generalized linear model. The default is the average squared error function.
K: The number of groups into which the data should be split to estimate the crossvalidation prediction error.
Cross-validation Algorithm
bwt.shuffle<- bwt; shuf:iter <- 1000; datasize <- length(bwt$low)
for(i in 1:shuf.iter){
n1<- round(runif(n=1, min = 1, max=datasize))
n2<- round(runif(n=1, min = 1, max= datasize))
temp <- bwt.shuffle[n1,]
bwt.shuffle[n1,] <- bwt.shuffIe[n2,]
bwt.shume[n2,] <- temp
}
fold<- k; testsize <- round(datasize/fold); rate.fold <-rep(0,fold)
for(i in 1:fold){
test.start <- (i-1)*testsize; test.end <- test.start + testsize
if (test.end>datasize) test.end=datasize
bwt.test<- data.frame(bwt.shuffIe[(test.start+1):test.end,])
bwt.train<- data.frame(bwt.shuffle[-((test.start+1):test.end),])
train.glm<- glm(1ow~.,family=binomial, data=bwt.train)
pred<-predict(train.glm, subset(bwt.test, select =c(age,lwt, race, smoke, ptd, ht,ui, ftv)),
type ="response")
rate-fold[i]< sum(round(pred)==bwt.test$low)/length(pred)
}
cvrate<- mean(rate.fold); cat("prediction error rate of cross validation:",1-cvrate, \n")