STA 138
December 4, 2014
Final Project
Introduction: The data that I will be analyzing for this report deals with whether a a patient id
diagnosed or not diagnosed with depression in a visit during one year of care. There are many
ways in which a patient can be diagnosed with depression, so there are many more variables
not taken into account with this model, that may affect the results greatly, but for the sake if this
project, I will try and predict whether a patient will be diagnosed with depression using stepwise
logtisitic regression. The variables for this data are as follows:
Diagnosis of depression in any visit during one
DAV
PCS
MCS
BECK
PGEND
AGE
EDUCAT
The response variable is DAV. The explanatory variables are PCS, MCS, BECK, and PGEND
which indicates the gender, AGE, which indicates the age, and EDUCAT which tells the number
of years of formal schooling.
Materials and Methods: For this project, I will be testing to see if we can predict whether a
patient will be diagnosed with depression based on the variables above, and pick the best
model ,using stepwise logistic regression. 400 patients were randomly selected from primary
care facilities and the above 7 variables were recored for each patient.
SAS Code and Results:
To read in the data and format:
The very first thing I did was to check to make sure that the model with main effect did not
include any multi collinearity, I did this with the following code:
Nest, I want to create the logistic model. I will fist show you the model with only the main effects
(no interactions) that were chosen from the stepwise logistic regression in SAS, though
this is not the model that I will use. The code that was used to obtained the best model
through forward stepwise regression was:
I chose to use a similar model as the one above, with two extra interaction terms.I chose these
interaction terms in the model because I believe that firstly, the interaction between
PGEND and EDUCAT can help with the prediction of depression because educations
effect may differ depending on the gender of the patient. Secondly, the interaction between
PGEND and BECK, I believe, may help with prediction of depression because the Becks
depression score may differ depending on gender as well. So the SAS code for the model
described above is:
and the partial output from this to obtain the model is:
So the model is :
log(/1-) = 0 + 1x1 + 2y
An explanation of the variables used for my final model is as follows: The intercept 0 is -2.7921
which is for when all of the other parameters are equal to zero. The slope estimate for
MCS is -0.0487, which means that when MCS increases by one unit, the odds of the
patient being diagnosed with depression is (e^(-0.0487) = 0.9524) 0.9524 the odds of the
patient not being diagnosed with depression. For EDUCAT, when a patient does one extra
year of formal schooling, the odds of the patient being diagnosed with depression is
(e^(.2151)) 1.24 times the odds of the patient not being diagnosed with depression. For
BECK, when a patients Beck depression score increases by one unit, the odds of that
patient being diagnosed with depression is (e^(.0813)) 1.085 times the odds of that paient
not being diagnosed with depression. For PGEND, we can say that the odds of the patient
being diagnosed with depression as a male are (e^(1.3659)) 3.92 the odds of the patients
being diagnosed with depression as a female.
diagnosed with depression will increase by between 5.9% and 34.9%. For BECK, we can
be 95% confident that when the Beck depression score increases by one unit, the odds of
that person being diagnosed with depression will increase by between 1.0% and 14.3%.
For PGEND, since the 95% confidence interval contains 1, it is not statistically significant.
-Residual Analysis:
It is clear form the chart to the right, that
there are many outliers, which have a
Pearson and deviance Residual of over
the absolute value of 2.0, so they are
influencing the coefficients and the
goodness of fit. After looking at the data
output (which is not displayed because
it is too big), I can see that the following
observations have a Pearson and
Deviance residual of over the absolute
value of 2.0: observations 22, 115, 173,
194, 255, 260, 286, 316, 323, 325, 333, 353, and 368. These observation numbers are the
ones corresponding to the output form SAS, with observation 1 being nothing (the header).
So if I were to adjust the observations to match exactly the observations form the data,
they would be the observation numbers listed above minus 1: 21, 114, 172, 193, 254, 259,
287, 315, 322, 324, 332, 352, and 367. Out of these adjusted observations, the 5
observations with the highest Pearson and Deviance residuals are observations (with
Pearson residual, Deviance residual): 193(5.47, 2.62), 259(3.54, 2.28), 315(4.20, 2.42),
324(4.31, 2.44), 352(5.75, 2.65).
-Influential Observations
Looking at the hat matrix diagonal column (what we were told to do in class) from the SAS
output, it is clear, after carefully looking, that there is really only one influential observation
which is not even listed as a residual, it is observation 378, with a hat matrix diagonal
equal to .0892, which is much higher than any of the others (the next highest is .02). With
a hat matrix diagonal so high, this means that this observation is affecting the the
parameter estimates.
-Goodness of Fit
The percent concordant is 76.5 and
the percent discordant is 23.1. This is
relatively a good thing, with Somers
D, Gamma, and C being relatively
high (.535, .537, and .767
respectively). This means that (using
Somers D) there 53.5% concordants (or agreement) with the model that we have
selected. This isnt an excellent number, but it still implies that there is some association.
So we can conclude with Somers D that the average difference in he percent concordant
and percent discordant is 53.5%, which means our model is doing an okay job at
predicting. We could do a similar analysis for Gamma and say that since it is positive and
relatively large (.537), that there is some association.
With the lowest AIC(305.201), I chose to
work with the best model chosen by
SAS with stepwise regression over the
model with only the intercept (AIC:
353.736), and over the best model with
the interaction terms (AIC: 308.519). With the Hosmer and Lemshow Goodness-of-fit test
in SAS, we can see that the 2 statistic is: 2 = 7.4172 with 8 degrees of freedom and
p-value= 0.4924. Since we have such a large p-value in this case, we will fail to reject
H0: Model fits the data well, and conclude that the model IS a good fit for the data.
our Wald Chi Square Statistic is 3.897, with p-value= 0.0484, (p-value < .05), we can
conclude that 0 has statistical significance, Ha: 00.
Statistic is 9.773, with p-value= 0.0018, (p-value < .05), we can conclude that 1 has
statistical significance, Ha: 20.
Lets take a look at 2. To test, we have H0: 2=0, against
Statistic is 5.2214, with p-value= 0.0223, (p-value < .05), we can conclude that 2 has
statistical significance, Ha: 20.
Lets take a look at 3. To test, we have H0: 3=0, against
Statistic is 3.8280, with p-value= 0.0504, (p-value .05), we can conclude that 3 has NO
statistical significance, H0: 3=0. Just because this fails the test though, does not mean it
should not be included in the model. It is right on the edge of being statistically significant
and it can play part in predicting ones depression diagnosis, so I believe it should stay in
the model.
Lastly, lets take a look at 4. To test, we have H0: 4=0, against
Square Statistic is 8.36009, with p-value= 0.0038, (p-value < .05), we can conclude that
4 has statistical significance, Ha: 20.
-Goodness-of-link function
This was done with the code:
Output:
Beck depression score increases by one unit, the odds of that person being diagnosed
with depression will increase by between 1.0% and 14.3%. And that for PGEND, since the
95% confidence interval contains 1, it is not statistically significant.
I also did residual analysis on the data, and searched for influential observations. I found there
was an influential observation that was not included in the residuals, which was
observation 378. Even though there were several residuals and an influential observation,
the model was still found to be a good fit for the data, which was determined from the
Hosmer and Lemshow goodness-of-fit test.
As a result from the statements above, I was able to conclude that the Variables BECK,
EDUCAT, MCS, and PGEND, are all associated with the depression diagnosis of a patient.
Based on the result, we cannot reject the Null Hypothesis that the model is fitting the data;
yet I would be more comfortable with a model that provides more support of fit. Therefore,
I recommend researching additional covariates in order to make more reliable predictions,
such as How many people the patient talks to on a dally basis, if the patient has a hobby,
and so on.