Statistics Lecture 5
Catagorical Data: The Chi Squared Test, Odds, Ratios,
Relative Risk and Logistic Regression
Rifat Hamoudi
Senior Lecturer
r.hamoudi@ucl.ac.uk
Review
Comparing one numerical outcome over 2
or more groups:
Independent Groups
2 Groups
Independent
t-test
Mann Whitney
U test
>2 Groups
One-way
ANOVA
Kruskal Wallis
test
Review
Comparing one numerical outcome over 2 or more
groups:
Independent Groups
2 Groups
Independent
t-test
Mann Whitney
U test
>2 Groups
Paired Groups
2 Groups
One-way
ANOVA
Paired
t-test
Kruskal Wallis
test
Wilcoxon's
Signed Rank
test
Review
Assessing the relationship between two numerical
variables:
Correlation
Analysis
Quantifies the strength
of the linear association
between two numerical
variables
Review
Regression Analysis
Next.....
Methods for Analysing
Categorical Data
Outline
Comparing Two Proportions:
- Chi-squared test
- Fishers Exact test
Categorical Data
Categorical Data is data that can be placed into categories:
Binary/Ordinal/Nominal
The mean is useless for categorical data! We cannot use
methods for continuous data to analyse categorical data
We analyze frequencies for categorical variables, that is the
number of things that fall into each combination of categories
Over 5
BMI under 30
92
323
19
76
Total
111
399
Over 5
BMI under 30
92
323
19
76
Total
111
399
19/111 =0.17
76/399 = 0.19
Under 5
Over 5
BMI under 30
92
323
19
76
Total
111
399
19/111 =0.17
76/399 = 0.19
Over 5
BMI under 30
92
323
19
76
Total
111
399
19/111 =0.17
76/399 = 0.19
Proportion of Obese
Children
415x111/510
= 90.3
415x399/510
=324.7
95x111/510
=20.7
95x399/510
=74.3
2 Test Example
Example:
H0: The proportion of children with the obese
characteristic is equal in the two age groups
HA: The proportion of children with the obese
characteristic is not equal in the two age groups
To conduct the 2 test:
SPSS: Analyse Descriptive Statistics Crosstabs
Test Example
2 Test Example
2 x 2 Contingency table:
Overweight * Age_Cat Crosstabulation
Count
Overweight
Total
BMI Under 30
Obese (BMI Over 30)
Age_Cat
Under 5
Over 5
92
323
19
76
111
399
Total
415
95
510
2 Test Example
Overweight * Age_Cat Crosstabulation
Count
Overweight
BMI Under 30
Obese (BMI Over 30)
Total
Expected Cell
counts:
Age_Cat
Over 5
Under 5
92
323
76
19
111
399
Total
415
95
510
415x111/510
= 90.3
415x399/510
=324.7
95x111/510
=20.7
95x399/510
=74.3
2 Test Example
Results of the Chi-squared test:
Chi-Square Tests
Pearson Chi-Square
Continuity Correction
Likelihood Ratio
Fisher's Exact Test
Linear-by-Linear
Association
N of Valid Cases
Value
.214 b
.105
.217
.213
1
1
1
Asymp. Sig.
(2-sided)
.644
.746
.641
Exact Sig.
(2-sided)
.682
Exact Sig.
(1-sided)
.379
.682
.682
.379
.379
.644
.682
.379
df
Point
Probability
.101
510
Drug
Placebo
Not Healed
152
142
Healed
212
56
Total
364
198
152/364 = 0.42
142/198 =0.72
Pearson Chi-Square
Continuity Correctiona
Likelihood Ratio
Fisher's Exact Test
N of Valid Cases
Value
46.140 b
44.946
47.359
df
1
1
1
Asymp. Sig.
(2-sided)
.000
.000
.000
Exact Sig.
(2-sided)
.000
Exact Sig.
(1-sided)
.000
.000
.000
.000
.000
562
Pearson Chi-Square
Continuity Correctiona
Likelihood Ratio
Fisher's Exact Test
N of Valid Cases
Value
46.140 b
44.946
47.359
df
1
1
1
Asymp. Sig.
(2-sided)
.000
.000
.000
Exact Sig.
(2-sided)
.000
Exact Sig.
(1-sided)
.000
.000
.000
.000
.000
562
Yes
No
Total
Yes
a+b
No
c+d
Total
a+c
b+d
n=a+b+c+d
Risk exp
Risk unexp
a/(a + c)
b/(b + d)
Drug
Placebo
Not Healed
152
142
Healed
212
56
Total
364
198
152/364 = 42%
142/198 =72%
No
Total
Case
a+b
Control
c+d
Total
a+c
b+d
n=a+b+c+d
Non-smoker Total
Lung Cancer
647
649
No Lung Ca
622
27
649
Total
1269
29
Non-smoker
Total
Lung Cancer
647
649
No Lung Ca
622
27
649
Pearson Chi-Square
Continuity Correctiona
Likelihood Ratio
Fisher's Exact Test
Linear-by-Linear
Association
N of Valid Cases
Value
22.044 b
20.316
26.140
22.027
1
1
1
Asymp. Sig.
(2-sided)
.000
.000
.000
Exact Sig.
(2-sided)
.000
Exact Sig.
(1-sided)
.000
.000
.000
.000
.000
.000
.000
.000
df
Point
Probability
.000
1298
P < 0.001, Small P-value indicates there is evidence against the null
hypothesis, reject the null hypothesis.
Pearson Chi-Square
Continuity
Correctiona
Asymp. Sig.
(2-sided)
.250
.217
.641
2.126
.145
Value
1.326b
df
1.273
.259
Exact Sig.
(2-sided)
.534
Exact Sig.
(1-sided)
.355
.534
.355
.534
.355
.534
.355
25
Point
Probability
.355
Binary
outcome
of
interest
Step
a
1
Age
Treatment(1)
Constant
B
.186
-1.001
-10.491
S.E.
.100
.460
5.484
Wald
3.439
4.724
3.660
df
1
1
1
Odds Ratios
Sig.
.064
.030
.056
Exp(B)
1.204
.368
.000
Step
a
1
Age
Treatment(1)
Constant
B
.186
-1.001
-10.491
S.E.
.100
.460
5.484
Wald
3.439
4.724
3.660
df
1
1
1
Sig.
.064
.030
.056
Exp(B)
1.204
.368
.000
Confidence Intervals
95% confidence intervals for the Odds give the range we expect the true
population Odds Ratio values to lie within
We would expect the confidence interval of Exp(B) [OR] to not include 1
if the associated explanatory variable is significant
If the 95% CI spans OR = 1 then this implies equality and that the odds
are equal. We cannot be sure that true odds ratio is not 1 if the 95%
confidence interval spans 1
- For a binary/categorical predictor the odds are equivalent for one group
compared to another
- For a continuous predictor the odds are equivalent for the different
levels of the continuous variables
Significance of Predictors
We can also test the null hypothesis that the relevant binary
logistic regression coefficient is zero, which is equivalent to
testing the hypothesis that the odds ratio associated with this
variable is 1
Wald test: Formally test the null hypothesis that a regression
coefficient B is zero:
H0: B = 0
HA: B 0
Or equivalently that
H0: Exp(B) = odds ratio =1
HA: Exp(B) = odds ratio 1
Step
a
1
Age
Treatment(1)
Constant
B
.186
-1.001
-10.491
S.E.
.100
.460
5.484
Wald
3.439
4.724
3.660
df
1
1
1
Sig.
.064
.030
.056
Exp(B)
1.204
.368
.000
Key Points
When comparing proportions/risks/odds of a characteristic of a
categorical variable over 2 groups consider the structure of the data:
- Independent groups: