regression
Many statistical tests require the dependent (response) variable to be continuous so a
different set of tests are needed when the dependent variable is categorical. One of the
most commonly used tests for categorical variables is the Chi-squared test which looks at
whether or not there is a relationship between two categorical variables but this doesnt
make an allowance for the potential influence of other explanatory variables on that
relationship. For continuous outcome variables, Multiple regression can be used for
a)
The
maths:
For multiple regression a model of the following form can be used to predict the value of
a response variable y using the values of a number of explanatory variables:
y 0 1 x1 2 x2 ..... q xq
p
odds ratio
1-p
If probabilities of the event of interest happening for individuals are needed, the logistic
regression equation
exp x x ..... x
0
1 1
2 2
q q
can be written p
,
x
x
1 exp ..... x
as:
1 1
2 2
q q
0<p<1
Logistic regression does the same but the outcome variable is binary and leads to a model
which can predict the probability of an event happening for an individual.
th
Titanic example: On April 14 1912, only 705 passengers and crew out of the 2228 on
board the Titanic survived when the ship sank. Information on 1309 of those on board will
be used to demonstrate logistic regression. The data can be downloaded from
biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets
/titanic3.xls
of
Logistic
regression
Most of the variables can be investigated using crosstabulations with the dependent variable
survived. Another reason for the cross tabulation is to identify categories with small
frequencies as this can cause problems with the logistic regression procedure. The number of
accompanying parents/ children (parch) and number of accompanying siblings/ spouses
(sibsp) were used to create a new binary variable indicating whether or not the person was
travelling alone or with family (1 = travelling with family, 0 = travelling alone).
Male
% surviving 19.1%
st
nd
Female
1 class
72.7%
61.9%
43%
rd
Class 3 class
25.5%
Travelling Travelling
alone
with family
30.3%
50.3%
When tested separately, Chi-squared tests concluded that there was evidence of a
relationship between survival and gender, class and whether an individual was travelling
alone. Looking at the %s of survival, its clear that women, those in first class and those not
travelling alone were much more likely to survive. Logistic regression will be carried out using
these three variables first of all. Stage 1 of the following analysis will relate to using logistic
regression to control for other variables when assessing relationships and stage 2 will look at
producing a good model to predict from.
Use ANALYZE Regression Binary logistic to get the following screen:
Where there are more
than two categories,
the last category is
automatically the
reference category.
This means that all
the other categories
will be compared to
the reference in the
st
output e.g. 1 and
nd
2 class will be compared
to
3
rd
class.
When interpreting SPSS output for logistic regression, it is important that binary variables are
coded as 0 and 1.
Also, categorical variables with three or more categories need to be recoded as dummy
st
st
variables with 0/ 1 outcomes e.g.
class
needs to appear as two variables 1 / not 1 with 1 =
Parameter
coding
nd
nd
Frequency
yes and 2 / not 2 with
1 = yes.
SPSS does
(1) Luckily
(2)
1.000
.000
this
adding323a categorical
variable to the list of covariates, click on the
class for you!1stWhen
class
Categorical 2nd
button
and
move
all
277
.000
1.000
class
categorical 3rd
variables
to the
hand box.
The following table in the output shows the coding
class
709 right.000
.000
Travelling
alone
Alone
790
1.000
of
these
variables.
with family
Gender
519
.000
male
Categorical Variables 843
Codings 1.000
female
466
.000
For
class,
rd
class
is
the
st
The null
model is:
p
0
ln
1
0.481,
p
exp - 0.481
p probability of survival
0.382
1 exp - 0.481
SPSS calculates the probability of survival for each individual using the block model. If the
probability of survival is
0.5 or more it will predict survival (as survival = 1) and death if the probability is less than 0.5.
As more people died
than survived, the probability of survival is 0.382 and therefore everyone is predicted as dying
(coded as 0). As
61.8% of people were correctly classified, classification from the null model is 61.8%
accurate. The addition of explanatory variables should increase the percentage of
correct classification significantly if the model is good.
The omnibus Tests of Model Co-efficients table gives the result of the Likelihood Ratio (LR)
test which indicates whether the inclusion of this block of variables contributes significantly
to model fit. A p-value (sig) of less than 0.05 for block means that the block 1 model is a
significant improvement to the block 0 model.
that between 31% and 42.1 of the variation in survival can be explained by the model
in block 1. The correct classification rate has increased by 16.2% to 78%.
Finally, the Variables in the Equation table summarises the importance of the explanatory
variables individually
whilst controlling for the other explanatory variables.
The Wald test is similar to the LR test but here it is used to test the hypothesis that each
0 . In the sig column, the p-values are all below 0.05 apart from the test for the variable
Alone, (p = 0.286). This means that although the Chi-squared test for Survival vs Alone was
significant, once the other variables were controlled for, there is not a
strong enough relationship between that variable and survival. Class is tested as a whole
st
(pclass) and then 1 and
nd
rd
2 class compared to the reference category 3 class. When interpreting the differences, look
at the exp
st
column which represents the odds ratio for the individual variable. For example, those in 1
class were 5.493 times more likely to survive than those in first class. With gender, the odds
ratio compares the likelihood of a male surviving in comparison to females. The odds are a
lot lower for men (0.084 times that of women). For ease of interpretation, calculate the odds
of a female surviving over a male using 1/0.084 = 11.9. Females were 11.9 times more likely
to survive.
The co-efficients for the model are contained in the column headed B. A negative value
means that the odds of survival decreases e.g. for males and those travelling alone.
The full model being tested is:
ln p 0.460 1.703x
0.832x 2nd clas s - 2.474x male - 0.156x alone
1st class
1p
x1st class 1 for 1st
x2nd class 1 for 2nd class, x male 1 for men and x alone 1 for person travelling alone
class.