Anda di halaman 1dari 7

Logistic

regression
Many statistical tests require the dependent (response) variable to be continuous so a
different set of tests are needed when the dependent variable is categorical. One of the
most commonly used tests for categorical variables is the Chi-squared test which looks at
whether or not there is a relationship between two categorical variables but this doesnt
make an allowance for the potential influence of other explanatory variables on that
relationship. For continuous outcome variables, Multiple regression can be used for
a)

controlling for other explanatory variables when assessing relationships between a


dependent variable and several independent variables
b) predicting outcomes of a dependent variable using a linear combination of explanatory
(independent)
variables

The
maths:
For multiple regression a model of the following form can be used to predict the value of
a response variable y using the values of a number of explanatory variables:
y 0 1 x1 2 x2 ..... q xq

0 Constant/ intercept , 1 q co efficients for q explanator y variables x1 xq


The regression process finds the co-efficients which minimise the squared differences
between the observed and expected values of y (the residuals). As the outcome of
logistic regression is binary, y needs to be
transformed so that the regression process can be used. The logit transformation gives the
following:
p
ln
0 1 x1 2 x2 ..... q xq
1p
p probabilty of event occuring e.g. person dies following heart
attack,

p
odds ratio
1-p

If probabilities of the event of interest happening for individuals are needed, the logistic
regression equation

exp x x ..... x

0
1 1
2 2
q q
can be written p
,
x
x
1 exp ..... x
as:

1 1

2 2

q q

0<p<1

Logistic regression does the same but the outcome variable is binary and leads to a model
which can predict the probability of an event happening for an individual.
th

Titanic example: On April 14 1912, only 705 passengers and crew out of the 2228 on
board the Titanic survived when the ship sank. Information on 1309 of those on board will
be used to demonstrate logistic regression. The data can be downloaded from
biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets
/titanic3.xls

The key variables


interest are:

of

Logistic
regression

Dependent variable: Whether a passenger survived or not (survival is


indicated by survived = 1).
Possible explanatory variables: Age, gender (recode so that sex = 1 for females and 0 for
males), class (pclass = 1, 2 or
3), number of accompanying parents/ children (parch) and number of accompanying
siblings/ spouses (sibsp)

Most of the variables can be investigated using crosstabulations with the dependent variable
survived. Another reason for the cross tabulation is to identify categories with small
frequencies as this can cause problems with the logistic regression procedure. The number of
accompanying parents/ children (parch) and number of accompanying siblings/ spouses
(sibsp) were used to create a new binary variable indicating whether or not the person was
travelling alone or with family (1 = travelling with family, 0 = travelling alone).
Male
% surviving 19.1%

st

nd

Female

1 class

72.7%

61.9%

43%

rd

Class 3 class
25.5%

Travelling Travelling
alone
with family
30.3%
50.3%

When tested separately, Chi-squared tests concluded that there was evidence of a
relationship between survival and gender, class and whether an individual was travelling
alone. Looking at the %s of survival, its clear that women, those in first class and those not
travelling alone were much more likely to survive. Logistic regression will be carried out using
these three variables first of all. Stage 1 of the following analysis will relate to using logistic
regression to control for other variables when assessing relationships and stage 2 will look at
producing a good model to predict from.
Use ANALYZE Regression Binary logistic to get the following screen:
Where there are more
than two categories,
the last category is
automatically the
reference category.
This means that all
the other categories
will be compared to
the reference in the
st
output e.g. 1 and
nd
2 class will be compared
to
3

Treatment of categorical explanatory


variables

rd

class.

When interpreting SPSS output for logistic regression, it is important that binary variables are
coded as 0 and 1.
Also, categorical variables with three or more categories need to be recoded as dummy
st
st
variables with 0/ 1 outcomes e.g.
class
needs to appear as two variables 1 / not 1 with 1 =
Parameter
coding
nd
nd
Frequency
yes and 2 / not 2 with
1 = yes.
SPSS does
(1) Luckily
(2)
1.000
.000
this
adding323a categorical
variable to the list of covariates, click on the
class for you!1stWhen
class
Categorical 2nd
button
and
move
all
277
.000
1.000
class
categorical 3rd
variables
to the
hand box.
The following table in the output shows the coding
class
709 right.000
.000
Travelling
alone
Alone
790
1.000
of
these
variables.
with family
Gender

519

.000

male
Categorical Variables 843
Codings 1.000
female

466

.000

For

class,

rd

class

is

the

st

reference class so if 1 class = 0


nd
and 2
class = 0, the person
rd
must have been in 3 class.
Females and those not travelling
alone are the references for the
other groups.

Interpretation of the output


The output is split into two sections, block 0 and block 1. Block 0 assesses the usefulness
of having a null model, which is a model with no explanatory variables. The variables in
the equation table only includes a constant so each person has the same chance of
survival.

The null
model is:

p
0
ln
1
0.481,
p

exp - 0.481
p probability of survival
0.382
1 exp - 0.481

SPSS calculates the probability of survival for each individual using the block model. If the
probability of survival is
0.5 or more it will predict survival (as survival = 1) and death if the probability is less than 0.5.
As more people died
than survived, the probability of survival is 0.382 and therefore everyone is predicted as dying
(coded as 0). As
61.8% of people were correctly classified, classification from the null model is 61.8%
accurate. The addition of explanatory variables should increase the percentage of
correct classification significantly if the model is good.

Block 0: Beginning Block

Block 1: Method = Enter


Block 1 shows the results after the addition of the explanatory variables selected.

The omnibus Tests of Model Co-efficients table gives the result of the Likelihood Ratio (LR)
test which indicates whether the inclusion of this block of variables contributes significantly
to model fit. A p-value (sig) of less than 0.05 for block means that the block 1 model is a
significant improvement to the block 0 model.

In standard regression, the co-efficient of determination (R ) value gives an indication of


how much variation in y is explained by the model. This cannot be calculated for logistic
2
regression but the Model Summary table gives the values for two pseudo R values which
try to measure something similar. From the table above, we can conclude

that between 31% and 42.1 of the variation in survival can be explained by the model
in block 1. The correct classification rate has increased by 16.2% to 78%.

Finally, the Variables in the Equation table summarises the importance of the explanatory
variables individually
whilst controlling for the other explanatory variables.
The Wald test is similar to the LR test but here it is used to test the hypothesis that each
0 . In the sig column, the p-values are all below 0.05 apart from the test for the variable
Alone, (p = 0.286). This means that although the Chi-squared test for Survival vs Alone was
significant, once the other variables were controlled for, there is not a
strong enough relationship between that variable and survival. Class is tested as a whole
st
(pclass) and then 1 and
nd
rd
2 class compared to the reference category 3 class. When interpreting the differences, look
at the exp
st
column which represents the odds ratio for the individual variable. For example, those in 1
class were 5.493 times more likely to survive than those in first class. With gender, the odds
ratio compares the likelihood of a male surviving in comparison to females. The odds are a
lot lower for men (0.084 times that of women). For ease of interpretation, calculate the odds
of a female surviving over a male using 1/0.084 = 11.9. Females were 11.9 times more likely
to survive.

The co-efficients for the model are contained in the column headed B. A negative value
means that the odds of survival decreases e.g. for males and those travelling alone.
The full model being tested is:

ln p 0.460 1.703x
0.832x 2nd clas s - 2.474x male - 0.156x alone
1st class
1p
x1st class 1 for 1st
x2nd class 1 for 2nd class, x male 1 for men and x alone 1 for person travelling alone
class.

Anda mungkin juga menyukai