Anda di halaman 1dari 13

Financial Time Series I/Methods

of Statistical Prediction
Suggested Answers to Project 2
Project 2: Logistic Regression and Model
Selection
1/19/2003

Question 1
Give a brief explanation on the meaning of those commands.
data(birthwt)
#Load the data set birthwt from the boot package.
#We can use an alternative command by loading the whole boot package into the
working space with command: require(boot).
attach(birthwt)
#The data set birthwt is attached to the working space so that the variable in the
data set can be accessed directly by its name. Otherwise, we have to use
birthwt$varname instead of varname.
race< factor(race,labels =c("white","black", "other"))
#The function factor is used to encode a vector as a factor race is originally
being treated as a numerical factor. This command translates into a categorical
factor.
table(ftv)
#table uses the cross-classified factors to build a contingency table of the counts at
each combination of factor levels. The results of this command is as follows,
0 1 2 3 4 6
100 47 30 7 4 1

Question 1
ftv<-factor( ftv)
#The function factor is used to encode a vector as a factor
levels(ftv)[-(1:2)]< "2+"
# This command transfers the levels of ftv into three: 0, 1, and 2+.
#2.Convert ptl to two levels and name the new variable as ptd.
ptd<-factor(ptl>0)
#The function factor is used to encode a vector as a factor. Here, ptd represents a new
factor with two levels only.
3. Create a new data frame bwt.
bwt<-data-frame(1ow=factor(1ow), age, lwt, race, smoke =(smoke >0), ptd, ht = (ht >0),
ui=(ui>0), ftv)
# Create a new data frame
4. Clean up data.
detach(birthwt)
#Remove the data set from the search path of available R objects. This command can be
used to remove either a data-frame which has been attached or a package which was
loaded previously.
rm(race, ptd ,ftv)
# remove and rm can be used to remove objects. Here, the variables race, ptd,
and ftv, which have been created in this workspace, are removed and no longer exist.

Question 2
Give a brief explanation on the specification of the regression
model.
birthwtglmGglm(1ow~., family=binomial ,data =bwt)
# glm is used to fit generalized linear models (GLM) to the data.
In GLM, two parameters need to be specified.
The first parameter is the error distribution of dependent variable. Here, the
chosen distribution is binomial distribution. It is being specified in terms of
family=binomial.
The second parameter is the unknown parameters in the specification of error
distribution. For the binomial distribution, it is the probability of success p. p will
be used to associate the independent variables to dependent variable.
There is the so-called link function to associate them. The default of link function
with binomial distribution is logit. It is as follows:
F(x) = P(Y=1|x)= exp(z)/[1+exp(z)] where z =exp(0 +1X1+...+KXK)

The fitted values of this model are the probabilities of low=1 given all the
other variables in the data frame bwf. Here, we only consider additive
model. It means that all variables enter into the model linearly. (There is no
interaction term in the model.)

Question 2
birthwtglm<- glm(1ow~., family=binomial, data =bwt)
# glm is used to fit generalized linear models (GLM) to the data.
In GLM, two parameters need to be specified.
The first parameter is the error distribution of dependent variable. Here, the
chosen distribution is binomial distribution. It is being specified in terms of
family=binomial.
The second parameter is the unknown parameters in the specification of error
distribution. For the binomial distribution, it is the probability of success p. p
will be used to associate the independent variables to dependent variable.
There is the so-called link function to associate them. The default of link
function with binomial distribution is logit. It is as follows:
F(x) = P(Y=1|x)= exp(z)/[1+exp(z)] where z =exp( 0 +1X1+...+KXK)
The fitted values of this model are the probabilities of low=1 given all the
other variables in the data frame bwf. Here, we only consider additive
model. It means that all variables enter into the model linearly. (There is no
interaction term in the model.)

Question 2
summary(birthwt.glm,correlation=F)
It gives the deviance residuals and coefficients.

The command summary is a generic function used to produce


result summaries of the results of various model fitting functions.
The correlations of coefficients are not shown because the
parameter correlation is set to be false.
The AIC value is 217.48 in this additive model containing all the
variables. In addition, the prediction error rate is 0.2698413
We can compare the AIC values and the prediction error rates with
this one in the following analyses to show the effect of including or
excluding some particular variables.
From the probability prob(>|z|) of each variable, we found that the
variables ptdTRUE, htTRUE, lwt, and raceblack will be
significant for the prediction if the significant level is set at 0.05.

Question 2: Model Selection


Consider a model with the above four important variables.
glm(1ow~lwt + race + ptd + ht, family=binomial, data = bwt)
The AIC value will become 217.40, which is slightly smaller than one using
all of the variables.
In addition, the prediction of the error rate is 0.2698413.

If we only drop the two most insignificant variables ftv and


age, this leads to an alternative model:
glm(1ow~lwt + race + ptd + ht + smoke + ui, family=binomial,
data = bwt).
The AIC value will become 213.8516,which seems smaller than the two
models described above
The prediction error rate is 0.2433862.

Question 2: Model Selection with AIC


Use AIC as the objective of model selection.
birthwtstep< step(birthwtglm, trace=F)
The command step selects a formula-based model by AIC. Its basic idea
is to remove the variables one by one in order to find better models with
smaller AIC values.
The parameter trace is turn off so that the above deletion process is not
shown.
We can build model with backward elimination or forward selection.
birthwtstep< step(birthwtglm,trace=F, direction= c(forward))
birthwtstep< step(birthwtglm,trace=F, direction= c(backward))

birthwt.step$anova
This command is useful in showing the process of deleting variables.
For this data set, it removes ftv and age these two variables and
final AIC value is reduced to 213.8516. The selection procedure stops
because no more reduction of the AIC value can be achieved by removing
any other variables.

It is interesting that the model selected by the AIC criterion is


consistent with that one we use a different criterion.

Question 3
Repeat the steps in Question 2 and consider all models include
pairwise interactions.
In order to ensure that the co-linearity is not present in the model,
we usually start on checking the correlation of coefficients among
all independent variables.
The difficult question is how to address the correlation with categorical
variables. Refer to your note on association.

Model building strategy:


Strategy 1: backward elimination
Add all pairwise interactions and then remove them one by one.

Implementation of strategy 1
Start form an additive model with all independent variables.
Birthwt.glm <- glm(1ow~^2, family=binomial, data=bwt, maxit=20)
Due to convergence problem, we can increase the upper bound of the number
of iterations, which I done by using maxit.
Suggestion: Start form the best additive linear model derived in Question 2
(Exclude the two variables: ftv and age.)

Question 3: backward elimination


birthwt.glm <-glm(low~(-ftv-age)^2, family=binomial, data=bwt,
maxit=20)
Consider an additive model without ftv and age.
This leads to the following model
low ~age + lwt + race + smoke + ptd + ht + ui + fw + age:ht + age:ftv +
lwt:smoke + lwt:ht + lwt:ui + race:ht + smoke:ht + ptd:ht + ht:ui + ht:ftv
model selection with AIC
birthwt.step.pwall <- stepAIC(birthwt.glm.pwall, trace=F)
birthwtstep.pwall$anova
This leads to the following model with AIC=210.8205
low ~age + 1wt + race + smoke + ptd + ht + ui + fw+ lwt:smoke + lwt:ht +
lwt:ui + ht:ui

Suggestion: Can we just compare all possible two pairwise


interaction terms?
The best one is with the interaction terms age:ftv and ht:ui.
Its AIC is 209.0006.
Although the variable age andftv are not important when we only
consider an additive linear model, they are included when the interaction tem
is also taken into consideration.

Question 3: Model Searching Strategy


birthwt.step.both <- stepAIC(birthwt.glm, scope = list(upper = ~ .^2,
lower = ~ 1), trace=F)
The direction of stepwise search can be one of both, backward, or
forward, with a default of both.
If the scope argument is missing, the default for direction is backward.
Therefore, we do not only remove predictors from the model but also add
predictors to reduce the AIC value.

For this data set, start with original model with no interaction term.
The interaction terms age:ftv and smoke:ui are added sequentially.
Finally, the race tem is removed.
The process stops at the model
low ~age + lwt + smoke + ptd + ht + ui + ftv + age:ftv + smoke:ui
with the lowest AIC value 207.0734.

Question 5
Repeat the above procedure with the two model being chosen with
cross-validation to give prediction error again.
How do we divide the data randomly into several groups (fold=5 for
example)?
Use the cv.glm procedure in the boot package.
Some students write their own version of cross-validation.

cv.glm
Cross-validation for Generalized Linear Models
This function calculates the estimated K-fold cross-validation prediction error
for generalized linear models.
cv.glm(data, glmfit, cost, K)
Data: A matrix or data frame containing the data. The rows should be cases and the
columns correspond to variables, one of which is the response.
glmfit: An object of class "glm" containing the results of a generalized linear model
fitted to data.
cost: A function of two vector arguments specifying the cost function for the crossvalidation. The first argument to cost should correspond to the observed responses
and the second argument should correspond to the predicted or fitted responses from
the generalized linear model. The default is the average squared error function.
K: The number of groups into which the data should be split to estimate the crossvalidation prediction error.

Cross-validation Algorithm
bwt.shuffle<- bwt; shuf:iter <- 1000; datasize <- length(bwt$low)
for(i in 1:shuf.iter){
n1<- round(runif(n=1, min = 1, max=datasize))
n2<- round(runif(n=1, min = 1, max= datasize))
temp <- bwt.shuffle[n1,]
bwt.shuffle[n1,] <- bwt.shuffIe[n2,]
bwt.shume[n2,] <- temp
}
fold<- k; testsize <- round(datasize/fold); rate.fold <-rep(0,fold)
for(i in 1:fold){
test.start <- (i-1)*testsize; test.end <- test.start + testsize
if (test.end>datasize) test.end=datasize
bwt.test<- data.frame(bwt.shuffIe[(test.start+1):test.end,])
bwt.train<- data.frame(bwt.shuffle[-((test.start+1):test.end),])
train.glm<- glm(1ow~.,family=binomial, data=bwt.train)
pred<-predict(train.glm, subset(bwt.test, select =c(age,lwt, race, smoke, ptd, ht,ui, ftv)),
type ="response")
rate-fold[i]< sum(round(pred)==bwt.test$low)/length(pred)
}
cvrate<- mean(rate.fold); cat("prediction error rate of cross validation:",1-cvrate, \n")

Anda mungkin juga menyukai