Anda di halaman 1dari 8

CAS MA 116 Statistics I I Summer 2, 2012 Lecture 5.

Lecture 5.3

The theory of probabilities is at bottom nothing but common sense reduced to calculus.
~Laplace, Thorie analytique des probabilits, 1820

1) Simple Logistic Regression Analysis


2) Multiple Logistic Regression
3) Area under the ROC Curve

Logistic Regression Analysis

As in linear regression, our goal is to estimate the regression coefficients in a model, given a sample
of (X, Y) pairs. In the case of logistic regression, the Xs can be numerical or categorical, but Ys are
generally coded as 0 (for those who do not have the event) or 1 (for those who have the event).

The simple logistic model is based on a linear relationship between the natural logarithm (ln) of the
odds of an event and a numerical independent variable. The form of this relationship is as follows:

= = ln = ! + ! + ,
1

where Y is binary and represent the event of interest (response),coded as 0/1 for failure/success,
p is the proportion of successes,
o is the odds of the event,
L is the ln(odds of event),
X is the independent variable,
! ! are the Y-intercept and the slope, respectively, and
is the random error.

1




CAS MA 116 Statistics I I Summer 2, 2012 Lecture 5.3

Computation of the estimates of ! ! in logistic regression are far more complicated. Recall
from Chapter 8 that is p is the probability of the event, and then the odds of the event are:

= =
1
We defined L = ln(odds of event Y), sometimes called the log odds or logit of Y. We can write L in
terms of p, Probability (Y=1), as follows:

= ln = ln .
1
We can then use the laws of exponents and logs and some algebra to express p (the proportion of
successes or risk of the event) in terms of L:

ln = ln , ! = 0 = , = ! ! ,
1 1
!
= .
1 + !
This is called the logistic function and its graph is as follows. Notice that p, the probability of the
event, and increases from 0 to 1.

The logistic regression model may be written in terms of p, the risk of event Y, assuming that L is a
linear function of X as follows:

!! !!! !!!
= .
1 + !! !!! !!!

As in linear regression, our goal is to estimate the regression coefficients ! ! , given ta sample of
(X, Y) pairs. Here as in the linear regression framework, X can be numerical or categorical. However,
the Y is not numerical as in linear regression; rather is coded 0(for those who do not have the event) or
1( for those who do have the event). Unfortunately, the techniques used in linear regression to estimate
the regression coefficients cannot be applied to the logistic regression case. The iterative maximum
likelihood estimation process is used to estimate coefficients is computationally complex and is beyond
the scope of this course. We restrict our attention to interpretation of the logistic model estimated by
statistical software.

2




CAS MA 116 Statistics I I Summer 2, 2012 Lecture 5.3
Example (DAgostino, Example 11.1): Logistic Regression Relating White Blood Counts to
Coronary Abnormalities. Patients with Kawasaki syndrome were treated with gamma globulin (GG
=1) or with the standard treatment (GG=0) and the primary outcome was the development of coronary
abnormalities (CA). A possible explanation for the effect of treatment with gamma globulin on the
development of CA is that it reduces the elevated white blood count (WBC) early in the course of the
disease and that WBC is related to CA. This example addresses the relationship between WBC and
CA. The data set contains a total of n=168 observations; each participant has a numerical value of
WBC and binary/dichotomous value (0 or 1) for the outcome, CA.

Response Profile:
CA Frequency
1 25
0 131
Missing 12
Total 168

The following output results are generated by SAS:

a) The estimated regression coefficients ! = , ! = .

b) Is WBC a significant predictor of the CA (response)? Why?

3




CAS MA 116 Statistics I I Summer 2, 2012 Lecture 5.3

c) Now that the coefficients have been estimated, we can estimate the risk of the event Y, given a
specific value of X, by substituting estimates in the following equation:

!! !!! !
= = _______________________________
1 + !! !!! !

d) Using the logistic regression equation above, estimate the risk of CA for a patient with WBC =10:

We can also use the estimate of \beta_1 to estimate the odds of the event for specified values of X and
the odds ratio comparing the odds of the event for people with different values of X. Recall that the
logistic model is based on the linear relationship between X and the log odds: = = ! + ! .
Thus, the estimated odds are:

! !! !!! !
= = = _________________________

e) For a patient with WBC = 10, the odds of CA are:

The odds ratio comparing the odds of an event for two people, one with = ! and one with
= ! , is:

!! !!! !!
= = !! !! !!!
= __________________________________
!! !!! !!

f) So the estimate of the odds ratio comparing a person with WBC = 15 to a person with WBC= 10 is:

Interpretation: The odds of CA are ______________ times higher for every 5-unit increase in WBC.

Notice that this depends only on the difference in the X values (xa - xb), not on their actual values. Thus
the estimated odds ratio comparing one person with WBC = 10 to another person with WBC = 5 is also
equal to ______________ as is the odds ratio comparing any two people whose WBC values are 5
units apart.

4




CAS MA 116 Statistics I I Summer 2, 2012 Lecture 5.3

The formula for a 95%confidence interval around the odds ratio is:

!! !!.!" !.!. !! !! !!! !! !!.!" !.!. !! !! !!!


, ,
where . . (! ) in the standard error of estimated intercept.

g) Using the formula above, estimate the 95% confidence interval for the odds ratio comparing the
odds of CA in children with Kawasaki syndrome whose WBC are 5 units apart:

Interpretation: We conclude that the odds ratio comparing the odds of CA in children with Kawasaki
syndrome whose WBC are 5 units higher than another child is _________ and that the 95% confidence
interval for the odds ratio is (____________, ___________).

Similarly, we can estimate the relative risk, RR, comparing a patient with = ! and one with
= ! , is:

!! !!! !!
! 1 + !! !!! !!
= = = __________________________________
! !! !!! !!
1 + !! !!! !!

h) Estimate the relative risk, RR, comparing a patient with WBC = 15 to a person with WBC= 10 is:

Interpretation: The estimated relative risk of CA for patients treated with gamma globulin as
compared to those on the standard treatment is RR = 0.24. This may be interpreted, as the risk of CA
among patients treated with gamma globulin is less than a quarter the risk for those on the standard
treatment. The odds ratio may be used to estimate the relative risk if the prevalence of the disease is
low (the overall risk is small). Here we see that it is much easier to obtain the estimate of the odds ratio
from the results of a logistic regression than it is to obtain an estimate of the relative risk.

5




CAS MA 116 Statistics I I Summer 2, 2012 Lecture 5.3

Multiple Logistic Regression Analysis

As in the case of linear regression, logistic regression techniques may be generalized to models with
more than one independent variable. The methods used to estimate coefficients, standard errors, and p
values are beyond the scope of our course.

In general, multiple logistic regression analysis, we consider applications involving a single binary
dependent variable, Y, and multiple independent variables, denoted ! , ! , , ! . The form of the
multiple linear regression equation is:

= = ln = ! + ! ! + ! ! + + ! ! + ,
1

where Y is binary and represent the event of interest (response),coded as 0/1 for failure/success,
p is the proportion of successes,
o is the odds of the event,
L is the ln(odds of event),
! , ! , , ! are the independent variables,
! is the intercept and ! , ! , , ! are the slope coefficient (i.e., the expected change in Y
relative to one unit change in ! ), and
is the random error.

NOTE: That unlike in linear regression in logistic regression dependent variable, Y, is a binary
variable. The independent variables, ! can be continuous variables or dichotomous (sometimes called
indicator) variables. The formulas used to estimate the intercept and slope coefficients are
computationally complex.

The general multiple logistic regression model in terms of p is:

!! !!! !! !!! !! !!!! !! !!


= .
1 + !! !!! !! !!! !! !!!! !! !!

The odds ratio comparing the odds of an event for two people, one with = (!! , !! , , !" ) and
one with = (!! , !! , , !" ), is:

= !! !!! !!!! !!! !!! !!!! !!! !!" !!!"

6




CAS MA 116 Statistics I I Summer 2, 2012 Lecture 5.3

Example (DAgostino, Example 11.3): Multiple Logistic Regression Relating While Blood Count
(WBC) and Treatment to Coronary Abnormalities. Suppose we want to use the same sample described
earlier in the section to assess the effect of treatment with gamma globulin on CA but we are concerned
that white blood count is a potential confounder. Although patients were randomized to one of the two
treatment groups, it turns out that the mean WBC in patients treated with gamma globulin was
significantly lower than the mean WBC in patients on the standard treatment. Thus, the investigators
were concerned that the patients treated with gamma globulin were less sick thatn those on the standard
treatment and that the observed impact of the gamma globulin treatment might not be real. The
following output is a multiple logistic regression in which the outcome is the binary variable CA and
the two independent variables are the binary treatment variable, GG, and the numerical variable, WBC.

a) What is the estimated logistic regression equation for CA given GG and WBC?

b) Are GG and WBC two significant predictors? Why?

c) Can you predict the probability and odds for a patients treated with gamma globulin (GG=1)
with WBC=10?

d) Can you estimate the relative risk between two patient with WBC=10 treated by the gamma
globulin and the standard treatment?

7




CAS MA 116 Statistics I I Summer 2, 2012 Lecture 5.3

Example (DAgostino, Example 11.4): Multiple Logistic Regression Relating While Blood Count
(WBC) and Hemoglobin to Coronary Abnormalities. Suppose we want now to predict the probability
of developing CA as a function of both white blood count and haemoglobin. The following output is
based on a multiple logistic regression in which the outcome is binary (dichotomous) variable, CA, and
two predictors are numerical (WBC and HEM).

a) What is the estimated logistic regression equation for CA given HEM and WBC?

b) Are HEM and WBC two significant predictors? Why?

c) Can you predict the probability and odds for a patients with HEM=10 with WBC=10?

d) Can you estimate the relative risk between two patient with WBC=10 , HEM=10 and WBC=15,
HEM=8?

Interpretation of odds ratio and coefficients:

The odds ratio describes the impact of a 1-unit increase in the independent variable on the risk of the
event, adjusting for the other independent variables. In this example, an increase of 1 unit in HEM is
associated with 35% decrease in the odds of CA (OR = exp(-0.42) = 0.65). An increase of 1 unit in
WBC is associated with 12% increase in the odds of CA (OR=exp(0.1135) = 1.12).
8

Anda mungkin juga menyukai