25.11.2014 - Categorical Data

Research Methodology
Statistics Lecture 5
Catagorical Data: The Chi Squared Test, Odds, Ratios,
Relative Risk and Logistic Regression
Rifat Hamoudi
Senior Lecturer
r.hamoudi@ucl.ac.uk
Review
Comparing one numerical outcome over 2
or more groups:
Independent Groups
2 Groups
Independent
t-test
Mann Whitney
U test
>2 Groups
One-way
ANOVA
Kruskal Wallis
test
Review
Comparing one numerical outcome over 2 or more
groups:
Independent Groups
2 Groups
Independent
t-test
Mann Whitney
U test
>2 Groups
Paired Groups
2 Groups
One-way
ANOVA
Paired
t-test
Kruskal Wallis
test
Wilcoxon's
Signed Rank
test
Review
Assessing the relationship between two numerical
variables:
Correlation
Analysis
Quantifies the strength
of the linear association
between two numerical
variables
Review
Regression Analysis
Simple linear regression fits a straight line to describe the

relationship between the two numerical variables where
one variable depends on the other
The Regression coefficient quantifies the amount
the dependent variable changes as the explanatory
variable increases by one unit
Which statistical test to use?
Next.....
Methods for Analysing
Categorical Data
Outline
Comparing Two Proportions:
- Chi-squared test
- Fishers Exact test
Risk, Risk Difference and Risk Ratio

Odds and Odds ratio
Binary Logistic Regression Analysis
Categorical Data
Categorical Data is data that can be placed into categories:
Binary/Ordinal/Nominal
The mean is useless for categorical data! We cannot use
methods for continuous data to analyse categorical data
We analyze frequencies for categorical variables, that is the
number of things that fall into each combination of categories
Obesity in Young Children

Obesity in young life can pave the way for future
musculoskeletal conditions
A dietician conducted a survey of 510 children at a local
primary school
Objective: Are there more obese children under 5 or over 5?
Categorical Data: Comparing Groups

Objective: Are there more obese children under 5 or over 5?
Initially tabulate observed frequencies as below in a 2 x 2
contingency table, for example:
Age Category
Under 5
Over 5
BMI under 30
92
323
Obese (BMI over 30)
19
76
Total
111
399

The proportions of obese children in each age category are
calculated as follows;
Age Category
Under 5
Over 5
BMI under 30
92
323
Obese (BMI over 30)
19
76
Total
111
399
19/111 =0.17
76/399 = 0.19
Proportion of Obese Children
We wish to formally compare the proportions of children

with the obese characteristic

We often have two independent groups of individuals
(under 5 / over 5)
We want to know whether the proportions of individuals
with a particular characteristic are the same in the two
groups (obese)
Categorical Data: Independent

Groups 2 Test
The Chi-Squared (2) test allows us to formally compare proportions
between two independent groups
It allows us to determine whether the observed frequencies (counts) are
markedly differ from the frequencies that we would expect by chance
Define the null and alternative hypothesis under study:
Ho: The proportions of individuals with the characteristic are equal in the
two groups in the population
HA: These population proportions are not equal
Categorical Data: Two Independent

Groups 2 Test
SPSS: Analyse Descriptive Statistics Crosstabs

Groups 2 Test
2 Test Technical details:
-The expected numbers in each of the four cells in our 2x2
contingency table if H0 is true are calculated (equal proportions)
- The formula for each expected cell is:
(row total*column total)/grand total
where the grand total equals the total number of individuals that
make up the sample (N)
Age Category
Under 5
Over 5
BMI under 30
92
323
Obese (BMI over 30)
19
76
Total
111
399
19/111 =0.17
76/399 = 0.19
Proportion of Obese Children

Groups 2 Test
- What was observed is compared to the calculated expected
numbers which would indicate there were no differences between
the groups (equal proportions)
- A large discrepancy between the observed and the corresponding
expected frequencies is an indication that the proportions in the
two groups differ (P <0.05)
Expected Numbers:
Under 5
Over 5
BMI under 30
92
323
Obese (BMI over 30)
19
76
Total
111
399
19/111 =0.17
76/399 = 0.19
Proportion of Obese
Children
415x111/510
= 90.3
415x399/510
=324.7
95x111/510
=20.7
95x399/510
=74.3
2 Test Example
Example:
H0: The proportion of children with the obese
characteristic is equal in the two age groups
HA: The proportion of children with the obese
characteristic is not equal in the two age groups
To conduct the 2 test:
Test Example
2 Test Example
2 x 2 Contingency table:
Overweight * Age_Cat Crosstabulation
Count
Overweight
Total
BMI Under 30
Obese (BMI Over 30)
Age_Cat
Under 5
Over 5
92
323
19
76
111
399
Total
415
95
510
2 Test Example
Overweight * Age_Cat Crosstabulation
Count
Overweight
BMI Under 30
Obese (BMI Over 30)
Total
Expected Cell
counts:
Age_Cat
Over 5
Under 5
92
323
76
19
111
399
Total
415
95
510
415x111/510
= 90.3
415x399/510
=324.7
95x111/510
=20.7
95x399/510
=74.3
2 Test Example
Results of the Chi-squared test:
Chi-Square Tests
Pearson Chi-Square
Continuity Correction
Likelihood Ratio
Fisher's Exact Test
Linear-by-Linear
Association
N of Valid Cases
Value
.214 b
.105
.217
.213
1
1
1
Asymp. Sig.
(2-sided)
.644
.746
.641
Exact Sig.
(2-sided)
.682
Exact Sig.
(1-sided)
.379
.682
.682
.379
.379
.644
.682
.379
df
Point
Probability
.101
510
a. Computed only for a 2x2 table

b. 0 cells (.0%) have expected count less than 5. The minimum expected count is 20.68.
c. The standardized statistic is .462.
There is evidence that the proportions of children with the obese

characteristics are equal in the two age groups (Under 5 = 0.17 or
17%, Over 5 = 0.19 or 19%)
Categorical Data: Comparing

2
Risks Test
The 2 test compares observed and expected cell counts - useful
to compare proportions across two independent groups
In the context of a randomised controlled trial our proportions
will be risks
Probably the most common scenario in medical research is to
compare the outcome risk in two independent groups
We can use the 2 test to answer a Common RCT Question: Is the
risk of failing in the group A the same as the risk of failing in
group B?
2 Test Risk Example

Treatment
Outcome
Drug
Placebo
Not Healed
152
142
Healed
212
56
Total
364
198
Risk of Not healing
152/364 = 0.42
142/198 =0.72
Risk of not healing in the drug group = 0.42 or (0.42*100) 42%

Risk of not healing in the placebo group = 0.72 or (0.72*100) 72%
Risk difference = 72% - 42% = 30%
The Chi-squared test allows us to formally compare risks between
groups answering the Question: Is the risk of not healing in the placebo
group the same as the risk of not healing in the drug group?
2 Test Risk Example

Define the null and alternative hypothesis under study:
H0: The risk of not healing is equal in the two treatment
groups
HA: The risk of not healing is not equal in the two treatment
groups

Chi-Square Tests
Pearson Chi-Square
Continuity Correctiona
Likelihood Ratio
Fisher's Exact Test
N of Valid Cases
Value
46.140 b
44.946
47.359
df
1
1
1
Asymp. Sig.
(2-sided)
.000
.000
.000
Exact Sig.
(2-sided)
.000
Exact Sig.
(1-sided)
.000
.000
.000
.000
.000
562

b. 0 cells (.0%) have expected count less than 5. The minimum expected count is 94.
42.
2 Test Risk Example

Chi-Square Tests
Pearson Chi-Square
Likelihood Ratio
Fisher's Exact Test
N of Valid Cases
Value
46.140 b
44.946
47.359
df
1
1
1
Asymp. Sig.
(2-sided)
.000
.000
.000
Exact Sig.
(2-sided)
.000
Exact Sig.
(1-sided)
.000
.000
.000
.000
.000
562

b. 0 cells (.0%) have expected count less than 5. The minimum expected count is 94.
42.
There is evidence to reject the null hypothesis

(P<0.001). The risk of not healing is not equal in the drug
and placebo groups (risk difference = 30%)
Relative Risk (Risk Ratio)

Typically the risk difference will be a sufficient way of
presenting differences between groups with binary outcomes
If the outcome is rare then ratios are more suitable

Exposed to factor
Outcome of Interest
Yes
No
Total
Yes
a+b
No
c+d
Total
a+c
b+d
n=a+b+c+d
Risk of Outcome in the Exposed group = a / (a+c)

Risk of Outcome in the unexposed group = b / (b+d)
Relative Risk (Risk Ratio or RR) =
Risk exp
Risk unexp
a/(a + c)
b/(b + d)
2 Test Risk Example

Treatment
Outcome
Drug
Placebo
Not Healed
152
142
Healed
212
56
Total
364
198
Risk of Not healing
152/364 = 42%
142/198 =72%
Risk difference = 72% - 42% = 30%

Relative Risk (Risk Ratio): (152/364) / (142/198) = 0.58
A subject in the drug group is 0.58 times as likely to not heal
than a subject in the placebo group
Interpretation of the Relative Risk

(Risk Ratio)
A RR of 1 indicates that the risk is the same in the two groups
A RR <1 indicates that there a reduction in the risk of the
outcome in the exposed group (drug group) compares with
the unexposed group (placebo)
A RR >1 indicates that there is an increased risk in the
exposed group (drug group) compared with the unexposed
group (placebo)

2
Odds Test
A RCT is often not feasible if an outcome is rare so instead known cases
and suitable controls are selected for a case-control study
In a case control study we do not interpret the proportions of
cases/controls with specific characteristics
Case-control studies only examine association NOT causation
We compare odds because patients are selected because of their
disease status
We dont interpret proportions as risks - You could get any risk value you
wish by simply varying the number of cases and controls selected and
often numbers of cases do not reflect the true mix of case numbers in
the general population
Odds and Odds Ratios

Relative Risk is not valid in such a scenario
Rather we will be comparing odds therefore we must use the
Odds Ratio (OR) to present the differences between groups
Odds are different to risks!
What are odds, what is the odds ratio and how does the odds
ratio differ from the risk ratio?
Odds and Odds Ratios

Exposed to Factor
Yes
No
Total
Case
a+b
Control
c+d
Total
a+c
b+d
n=a+b+c+d
Odds of being a Case in the exposed group = a / c

Odds of being a Case in the unexposed group = b / d
OddsRatio =
Odd of being a case in the exposed group

a / c axd
=
=
Odd of being a case in the unexposed group b / d b x c
Odds and Odds Ratios Example

Lung Cancer Doll & Hill Example: 649 male cancer patients and 649
controls. Compare distribution of lung cancer among smokers and
Non smokers. 647 of 1269 smokers had lung cancer compared to 2
of 29 non smokers.
Smoker
Non-smoker Total
Lung Cancer
647
649
No Lung Ca
622
27
649
Total
1269
29
Odds lung cancer in smokers = 647/622 = 1.04

Odds lung cancer in non-smokers = 2/27 = 0.07
Odds ratio = (647/622) / (2/27)= 647 x 27 / 2 x 622 = 14.04
Interpretation of the Odds Ratios

If the odds ratio = 1 then this implies equality
The odds are equivalent in the exposed and unexposed groups
An odds ratio >1 indicates that the odds of disease (outcome) is
greater in the exposed group than in the unexposed group
An odds ratio <1 indicates that the odds of disease (outcome) is
lower in the exposed group than in the unexposed group

Odds 2 Test
The Chi squared test can be used within the context of the case
control study to formally test:
H0: The odds of having lung cancer in smokers cases = the odds of
lung cancer in non-smokers (i.e. odds ratio = 1)
HA: The odds of having lung cancer are not equal
Smoker
Non-smoker
Total
Lung Cancer
647
649
No Lung Ca
622
27
649
Odds lung cancer in smokers = 647/622 = 1.04

Odds lung cancer in non-smokers = 2/27 = 0.07
Odds ratio = (647/622) / (2/27)= 647 x 27 / 2 x 622 = 14.04
2 Test Odds Example

The null and alternative hypothesis under study:
H0: The odds of having lung cancer in smokers cases = the odds of lung
cancer in non-smokers (i.e. odds ratio = 1)
HA: The odds of having lung cancer are not equal
Chi-Square Tests
Pearson Chi-Square
Likelihood Ratio
Fisher's Exact Test
Linear-by-Linear
Association
N of Valid Cases
Value
22.044 b
20.316
26.140
22.027
1
1
1
Asymp. Sig.
(2-sided)
.000
.000
.000
Exact Sig.
(2-sided)
.000
Exact Sig.
(1-sided)
.000
.000
.000
.000
.000
.000
.000
.000
df
Point
Probability
.000
1298

b. 0 cells (.0%) have expected count less than 5. The minimum expected count is 14.50.
c. The standardized statistic is 4.693.
P < 0.001, Small P-value indicates there is evidence against the null
hypothesis, reject the null hypothesis.
Test Odds Example
The study provides considerable evidence to suggest an

association between lung cancer and smoking
The odds of having lung cancer are significantly greater for
smokers than non-smokers, odds ratio = 14.04 (P<0.001)
Assumptions of the 2 Test

1. The expected frequency in each of the four cells is at least 5
The Chi squared test is hence only valid if all the expected
frequencies are sufficient
2. The Chi squared test also assumes the groups are
independent (e.g. treatment group and placebo group,
under 5s and over 5s)
What if Expected Frequency < 5?

Use Fishers exact test - Given in SPSS output of Chisquared test
If any one of the expected cell counts is less than 5
interpret fishers exact test:
Chi-Square Tests
Pearson Chi-Square
Continuity
Correctiona
Lik elihood Ratio
Asymp. Sig.
(2-sided)
.250
.217
.641
2.126
.145
Value
1.326b
df
Fisher' s Exact Test

Linear-by-Linear
Association
N of Valid Cases
1.273
.259
Exact Sig.
(2-sided)
.534
Exact Sig.
(1-sided)
.355
.534
.355
.534
.355
.534
.355
25

b. 2 cells (50. 0%) have expected count less than 5. The minimum expected count is .84.
c. The standardized statistic is 1.128.
Point
Probability
.355
What if the Groups are Not

Independent?
What if the two groups are related?
- Each individual may have had their outcome measured in 2
different circumstances
- Cross-over trial each patient receives drug and placebo
- Matched Case-Control Study
Use McNemars Test
What if the Groups are Not Independent?

SPSS: Analyze Descriptive statistics Crosstabs
Select McNemars Test in Statistics option
Same null and alternative hypothesis:
Ho: The proportion of individuals with the characteristic is
equal in the two groups in the population
HA: The proportion of individuals with the characteristic is
not equal in the two groups in the population
Categorical Data: More than 2

Categories
Suppose we wish to test for an association between two factors which
may have more than two categories
Example: Is there an association between blood group (4 group levels:
A, B, O, AB) and disease severity (3 groups: mild, moderate, severe).
Are individuals of a particular blood group likely to be more severely ill?
We can still use the Chi-squared test on larger frequencies - data
presented in a r x c contingency table (r rows and columns)
The null and alternative hypothesis under study:
H0 : There is no association between the categories of one factor and
the categories of the other factor in the population
HA: The two factors are associated in the population
Binary Logistic Regression

Up until now we have discussed regression with a numerical
outcome/dependent variable
Lecture 4 - Linear regression is a modelling technique used to
explore the associations between one numerical dependent
variable and one or more explanatory variables (be these
numerical of categorical)
We are often interesting in examining binary outcomes, for
example mortality (dead/alive), case/control, success/failure
We can model a binary outcome using binary logistic
regression

Useful when we wish to compare the proportion of people
with a particular binary outcome by group, but adjusted for
potential confounders
Examples:
1. Is there an association between smoking and lung cancer
after adjusting for Sex?
2. Is a new treatment associated with mortality after
adjustment for age?

The dependent variable and explanatory variable(s) are
distinguished in the same way as linear regression
In binary logistic regression the binary outcome of interest is
the dependent variable. The other factors of interest which
we believe may be related to the binary outcome are the
explanatory/independent variables
Logistic regression evaluates the odds that an individual with
a particular combination of values for the explanatory
variables will have the binary outcome of interest

When you fit a binary logistic regression model, for each
explanatory variable you will get an odds ratio (OR) EXP(B)
For binary/categorical explanatory variables, the OR is the
increase in odds of the binary outcome for one group
compared to the other/reference group
For numerical explanatory variables, the OR is the increase in
odds of the binary outcome for a one unit increase in the
numerical explanatory variable

If the odds ratio is greater than 1 then as the predictor increases,
the odds of the outcome occurring increase
Conversely an odds ratio value less than 1 indicates that as the
predictor increases the odds of the outcome occurring decrease
If the odds ratio = 1 then this implies equality
- For a binary/categorical predictor the odds are equivalent for one
group compared to another
- For a numerical predictor the odds are equivalent for the different
levels of the continuous variables
Binary Logistic Regression Example

Example: Clinical trial for breast cancer, comparing mortality at 5
years between new vs standard drug
- Outcome is mortality at 5 years either yes or no
- Difference in age between two treatment groups need to adjust
Binary logistic regression is an ideal method of analysis to employ
to determine if treatment is associated with mortality at 5 years
after adjustment for age
We will fit a binary logistic regression model and get an odds ratio
(OR) for Treatment (adjusted for age) and an OR for Age (adjusted
for treatment)

Data:
Binary
outcome
of
interest

SPSS: Analyze Regression
Binary Logistic

Variables in the Equation
Step
a
1
Age
Treatment(1)
Constant
B
.186
-1.001
-10.491
S.E.
.100
.460
5.484
Wald
3.439
4.724
3.660
df
1
1
1
Odds Ratios
Sig.
.064
.030
.056
Exp(B)
1.204
.368
.000
95.0% C.I.for EXP(B)

Lower
Upper
1.466
.989
.906
.149
a. Variable(s) entered on step 1: Age, Treatment.
For each variable odds ratio (OR) = Exp(B)

- For the binary explanatory variable Treatment, the OR is the odds of
mortality for treatment = 1 compared to the reference category,
treatment = 0
OR for Treatment = Exp (B) = 0.368
In comparison to Treatment=0 the odds of mortality at 5 years for
Treatment=1 are 0.368 times, or equivalently [0.368-1*100] = - 63.2%

Odds Ratios
Step
a
1
Age
Treatment(1)
Constant
B
.186
-1.001
-10.491
S.E.
.100
.460
5.484
Wald
3.439
4.724
3.660
df
1
1
1
Sig.
.064
.030
.056
Exp(B)
1.204
.368
.000

Lower
Upper
1.466
.989
.906
.149
For the continuous variable Age , the OR is the increase in odds

of mortality for a one unit increase in Age(1 year increase)
OR for Age = Exp(B) = 1.204
As Age increases by one unit (1 year) the odds of mortality
increase by a factor of 1.204, or equivalently increase by [1.2041*100]= 20.4%
Confidence Intervals
95% confidence intervals for the Odds give the range we expect the true
population Odds Ratio values to lie within
We would expect the confidence interval of Exp(B) [OR] to not include 1
if the associated explanatory variable is significant
If the 95% CI spans OR = 1 then this implies equality and that the odds
are equal. We cannot be sure that true odds ratio is not 1 if the 95%
confidence interval spans 1
- For a binary/categorical predictor the odds are equivalent for one group
compared to another
- For a continuous predictor the odds are equivalent for the different
levels of the continuous variables
Significance of Predictors
We can also test the null hypothesis that the relevant binary
logistic regression coefficient is zero, which is equivalent to
testing the hypothesis that the odds ratio associated with this
variable is 1
Wald test: Formally test the null hypothesis that a regression
coefficient B is zero:
H0: B = 0
HA: B 0
Or equivalently that
H0: Exp(B) = odds ratio =1
HA: Exp(B) = odds ratio 1

SPSS Conducts the Wald test for you!
Step
a
1
Age
Treatment(1)
Constant
B
.186
-1.001
-10.491
S.E.
.100
.460
5.484
Wald
3.439
4.724
3.660
df
1
1
1
Sig.
.064
.030
.056
Exp(B)
1.204
.368
.000

Lower
Upper
.989
1.466
.149
.906
H0: Exp(B) = odds ratio =1

HA: Exp(B) = odds ratio 1
Age
Treatment
P=0.064, P-value indicates we should not reject

Ho for Age. Odds ratio =1 (Taking a strict 0.05 critical level)
P=0.030. Small P-value indicates evidence
against Ho for Treatment. Odds ratio 1.

Summary of Results:
- OR for Age = 1.204, 95% CI ( 0.989, 1.466), P = 0.064
- OR for new treatment =0.368, 95% CI (0.149, 0.906),
P=0.030
As age increases, so do the odds for mortality, however Age is
not a significant predictor of mortality
The odds of mortality were significantly 63.2% less for patients
on the new treatment
SPSS Practical 4 & 5

Linear Regression
Fits a straight line to describe the relationship
between one or more explanatory/independent variables and
one numerical dependent/outcome variable
Regression coefficients quantify the amount the dependent
variable changes as the explanatory variable increases by one
unit (multiple linear regression - after adjustment for any
other explanatory variables)
Chi squared test
Binary logistic regression
Solutions will be available on moodle
Key Points
When comparing proportions/risks/odds of a characteristic of a
categorical variable over 2 groups consider the structure of the data:
- Independent groups:
Expected counts >5 2 Test

Expected counts <5 Fishers Exact Test
- Non-Independent Samples: McNemars Test

We can also use the 2 test to test for association between 2
categorical factors which may have any number of groups
Binary Logistic regression is used for modelling binary outcomes;
Output is given in terms of odds ratios

25.11.2014 - Categorical Data

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

25.11.2014 - Categorical Data

Diunggah oleh

Hak Cipta:

Format Tersedia

Research Methodology

Simple linear regression fits a straight line to describe the

Which statistical test to use?

Risk, Risk Difference and Risk Ratio

Obesity in Young Children

Categorical Data: Comparing Groups

Categorical Data: Comparing Groups

Obese (BMI over 30)

Categorical Data: Comparing Groups

Obese (BMI over 30)

Proportion of Obese Children

We wish to formally compare the proportions of children

Categorical Data: Comparing Groups

Categorical Data: Independent

Categorical Data: Two Independent

Categorical Data: Two Independent

Obese (BMI over 30)

Proportion of Obese Children

Categorical Data: Two Independent

Obese (BMI over 30)

a. Computed only for a 2x2 table

There is evidence that the proportions of children with the obese

Categorical Data: Comparing

2 Test Risk Example

Risk of Not healing

Risk of not healing in the drug group = 0.42 or (0.42*100) 42%

2 Test Risk Example

SPSS: Analyse Descriptive Statistics Crosstabs

a. Computed only for a 2x2 table

2 Test Risk Example

a. Computed only for a 2x2 table

There is evidence to reject the null hypothesis

Relative Risk (Risk Ratio)

Relative Risk (Risk Ratio)

Risk of Outcome in the Exposed group = a / (a+c)

2 Test Risk Example

Risk of Not healing

Risk difference = 72% - 42% = 30%

Interpretation of the Relative Risk

Categorical Data: Comparing

Odds and Odds Ratios

Odds and Odds Ratios

Odds of being a Case in the exposed group = a / c

Odd of being a case in the exposed group

Odds and Odds Ratios Example

Odds lung cancer in smokers = 647/622 = 1.04

Interpretation of the Odds Ratios

Categorical Data: Comparing

Odds lung cancer in smokers = 647/622 = 1.04

2 Test Odds Example

a. Computed only for a 2x2 table

Test Odds Example

The study provides considerable evidence to suggest an

Assumptions of the 2 Test

What if Expected Frequency < 5?

Lik elihood Ratio

Fisher' s Exact Test

a. Computed only for a 2x2 table

What if the Groups are Not

What if the Groups are Not Independent?

Categorical Data: More than 2

Binary Logistic Regression

Binary Logistic Regression

Binary Logistic Regression