Anda di halaman 1dari 60

Research Methodology

Statistics Lecture 5
Catagorical Data: The Chi Squared Test, Odds, Ratios,
Relative Risk and Logistic Regression

Rifat Hamoudi
Senior Lecturer
r.hamoudi@ucl.ac.uk

Review
Comparing one numerical outcome over 2
or more groups:
Independent Groups
2 Groups
Independent
t-test
Mann Whitney
U test

>2 Groups
One-way
ANOVA
Kruskal Wallis
test

Review
Comparing one numerical outcome over 2 or more
groups:
Independent Groups
2 Groups
Independent
t-test
Mann Whitney
U test

>2 Groups

Paired Groups
2 Groups

One-way
ANOVA

Paired
t-test

Kruskal Wallis
test

Wilcoxon's
Signed Rank
test

Review
Assessing the relationship between two numerical
variables:
Correlation
Analysis
Quantifies the strength
of the linear association
between two numerical
variables

Review
Regression Analysis

Simple linear regression fits a straight line to describe the


relationship between the two numerical variables where
one variable depends on the other
The Regression coefficient quantifies the amount
the dependent variable changes as the explanatory
variable increases by one unit

Which statistical test to use?

Next.....
Methods for Analysing
Categorical Data

Outline
Comparing Two Proportions:
- Chi-squared test
- Fishers Exact test

Risk, Risk Difference and Risk Ratio


Odds and Odds ratio
Binary Logistic Regression Analysis

Categorical Data
Categorical Data is data that can be placed into categories:
Binary/Ordinal/Nominal
The mean is useless for categorical data! We cannot use
methods for continuous data to analyse categorical data
We analyze frequencies for categorical variables, that is the
number of things that fall into each combination of categories

Obesity in Young Children


Obesity in young life can pave the way for future
musculoskeletal conditions
A dietician conducted a survey of 510 children at a local
primary school
Objective: Are there more obese children under 5 or over 5?

Categorical Data: Comparing Groups

Categorical Data: Comparing Groups


Objective: Are there more obese children under 5 or over 5?
Initially tabulate observed frequencies as below in a 2 x 2
contingency table, for example:
Age Category
Under 5

Over 5

BMI under 30

92

323

Obese (BMI over 30)

19

76

Total

111

399

Categorical Data: Comparing Groups


The proportions of obese children in each age category are
calculated as follows;
Age Category
Under 5

Over 5

BMI under 30

92

323

Obese (BMI over 30)

19

76

Total

111

399

19/111 =0.17

76/399 = 0.19

Proportion of Obese Children

We wish to formally compare the proportions of children


with the obese characteristic

Categorical Data: Comparing Groups


We often have two independent groups of individuals
(under 5 / over 5)
We want to know whether the proportions of individuals
with a particular characteristic are the same in the two
groups (obese)

Categorical Data: Independent


Groups 2 Test
The Chi-Squared (2) test allows us to formally compare proportions
between two independent groups
It allows us to determine whether the observed frequencies (counts) are
markedly differ from the frequencies that we would expect by chance
Define the null and alternative hypothesis under study:
Ho: The proportions of individuals with the characteristic are equal in the
two groups in the population
HA: These population proportions are not equal

Categorical Data: Two Independent


Groups 2 Test
SPSS: Analyse Descriptive Statistics Crosstabs

Categorical Data: Two Independent


Groups 2 Test
2 Test Technical details:
-The expected numbers in each of the four cells in our 2x2
contingency table if H0 is true are calculated (equal proportions)
- The formula for each expected cell is:
(row total*column total)/grand total
where the grand total equals the total number of individuals that
make up the sample (N)
Age Category

Under 5

Over 5

BMI under 30

92

323

Obese (BMI over 30)

19

76

Total

111

399

19/111 =0.17

76/399 = 0.19

Proportion of Obese Children

Categorical Data: Two Independent


Groups 2 Test
- What was observed is compared to the calculated expected
numbers which would indicate there were no differences between
the groups (equal proportions)
- A large discrepancy between the observed and the corresponding
expected frequencies is an indication that the proportions in the
two groups differ (P <0.05)
Expected Numbers:
Under 5

Over 5

BMI under 30

92

323

Obese (BMI over 30)

19

76

Total

111

399

19/111 =0.17

76/399 = 0.19

Proportion of Obese
Children

415x111/510
= 90.3

415x399/510
=324.7

95x111/510
=20.7

95x399/510
=74.3

2 Test Example
Example:
H0: The proportion of children with the obese
characteristic is equal in the two age groups
HA: The proportion of children with the obese
characteristic is not equal in the two age groups
To conduct the 2 test:
SPSS: Analyse Descriptive Statistics Crosstabs

Test Example

2 Test Example
2 x 2 Contingency table:
Overweight * Age_Cat Crosstabulation
Count

Overweight
Total

BMI Under 30
Obese (BMI Over 30)

Age_Cat
Under 5
Over 5
92
323
19
76
111
399

Total
415
95
510

2 Test Example
Overweight * Age_Cat Crosstabulation
Count

Overweight

BMI Under 30
Obese (BMI Over 30)

Total

Expected Cell
counts:

Age_Cat
Over 5
Under 5
92
323
76
19
111
399

Total
415
95
510

415x111/510
= 90.3

415x399/510
=324.7

95x111/510
=20.7

95x399/510
=74.3

2 Test Example
Results of the Chi-squared test:
Chi-Square Tests

Pearson Chi-Square
Continuity Correction
Likelihood Ratio
Fisher's Exact Test
Linear-by-Linear
Association
N of Valid Cases

Value
.214 b
.105
.217

.213

1
1
1

Asymp. Sig.
(2-sided)
.644
.746
.641

Exact Sig.
(2-sided)
.682

Exact Sig.
(1-sided)
.379

.682
.682

.379
.379

.644

.682

.379

df

Point
Probability

.101

510

a. Computed only for a 2x2 table


b. 0 cells (.0%) have expected count less than 5. The minimum expected count is 20.68.
c. The standardized statistic is .462.

There is evidence that the proportions of children with the obese


characteristics are equal in the two age groups (Under 5 = 0.17 or
17%, Over 5 = 0.19 or 19%)

Categorical Data: Comparing


2
Risks Test
The 2 test compares observed and expected cell counts - useful
to compare proportions across two independent groups
In the context of a randomised controlled trial our proportions
will be risks
Probably the most common scenario in medical research is to
compare the outcome risk in two independent groups
We can use the 2 test to answer a Common RCT Question: Is the
risk of failing in the group A the same as the risk of failing in
group B?

2 Test Risk Example


Treatment
Outcome

Drug

Placebo

Not Healed

152

142

Healed

212

56

Total

364

198

Risk of Not healing

152/364 = 0.42

142/198 =0.72

Risk of not healing in the drug group = 0.42 or (0.42*100) 42%


Risk of not healing in the placebo group = 0.72 or (0.72*100) 72%
Risk difference = 72% - 42% = 30%
The Chi-squared test allows us to formally compare risks between
groups answering the Question: Is the risk of not healing in the placebo
group the same as the risk of not healing in the drug group?

2 Test Risk Example


Define the null and alternative hypothesis under study:
H0: The risk of not healing is equal in the two treatment
groups
HA: The risk of not healing is not equal in the two treatment
groups

SPSS: Analyse Descriptive Statistics Crosstabs


Chi-Square Tests

Pearson Chi-Square
Continuity Correctiona
Likelihood Ratio
Fisher's Exact Test
N of Valid Cases

Value
46.140 b
44.946
47.359

df
1
1
1

Asymp. Sig.
(2-sided)
.000
.000
.000

Exact Sig.
(2-sided)
.000

Exact Sig.
(1-sided)
.000

.000
.000

.000
.000

562

a. Computed only for a 2x2 table


b. 0 cells (.0%) have expected count less than 5. The minimum expected count is 94.
42.

2 Test Risk Example


Chi-Square Tests

Pearson Chi-Square
Continuity Correctiona
Likelihood Ratio
Fisher's Exact Test
N of Valid Cases

Value
46.140 b
44.946
47.359

df
1
1
1

Asymp. Sig.
(2-sided)
.000
.000
.000

Exact Sig.
(2-sided)
.000

Exact Sig.
(1-sided)
.000

.000
.000

.000
.000

562

a. Computed only for a 2x2 table


b. 0 cells (.0%) have expected count less than 5. The minimum expected count is 94.
42.

There is evidence to reject the null hypothesis


(P<0.001). The risk of not healing is not equal in the drug
and placebo groups (risk difference = 30%)

Relative Risk (Risk Ratio)


Typically the risk difference will be a sufficient way of
presenting differences between groups with binary outcomes
If the outcome is rare then ratios are more suitable
Relative Risk (Risk Ratio)

Relative Risk (Risk Ratio)


Exposed to factor
Outcome of Interest

Yes

No

Total

Yes

a+b

No

c+d

Total

a+c

b+d

n=a+b+c+d

Risk of Outcome in the Exposed group = a / (a+c)


Risk of Outcome in the unexposed group = b / (b+d)
Relative Risk (Risk Ratio or RR) =

Risk exp
Risk unexp

a/(a + c)
b/(b + d)

2 Test Risk Example


Treatment
Outcome

Drug

Placebo

Not Healed

152

142

Healed

212

56

Total

364

198

Risk of Not healing

152/364 = 42%

142/198 =72%

Risk difference = 72% - 42% = 30%


Relative Risk (Risk Ratio): (152/364) / (142/198) = 0.58
A subject in the drug group is 0.58 times as likely to not heal
than a subject in the placebo group

Interpretation of the Relative Risk


(Risk Ratio)
A RR of 1 indicates that the risk is the same in the two groups
A RR <1 indicates that there a reduction in the risk of the
outcome in the exposed group (drug group) compares with
the unexposed group (placebo)
A RR >1 indicates that there is an increased risk in the
exposed group (drug group) compared with the unexposed
group (placebo)

Categorical Data: Comparing


2
Odds Test
A RCT is often not feasible if an outcome is rare so instead known cases
and suitable controls are selected for a case-control study
In a case control study we do not interpret the proportions of
cases/controls with specific characteristics
Case-control studies only examine association NOT causation
We compare odds because patients are selected because of their
disease status
We dont interpret proportions as risks - You could get any risk value you
wish by simply varying the number of cases and controls selected and
often numbers of cases do not reflect the true mix of case numbers in
the general population

Odds and Odds Ratios


Relative Risk is not valid in such a scenario
Rather we will be comparing odds therefore we must use the
Odds Ratio (OR) to present the differences between groups
Odds are different to risks!
What are odds, what is the odds ratio and how does the odds
ratio differ from the risk ratio?

Odds and Odds Ratios


Exposed to Factor
Yes

No

Total

Case

a+b

Control

c+d

Total

a+c

b+d

n=a+b+c+d

Odds of being a Case in the exposed group = a / c


Odds of being a Case in the unexposed group = b / d
OddsRatio =

Odd of being a case in the exposed group


a / c axd
=
=
Odd of being a case in the unexposed group b / d b x c

Odds and Odds Ratios Example


Lung Cancer Doll & Hill Example: 649 male cancer patients and 649
controls. Compare distribution of lung cancer among smokers and
Non smokers. 647 of 1269 smokers had lung cancer compared to 2
of 29 non smokers.
Smoker

Non-smoker Total

Lung Cancer

647

649

No Lung Ca

622

27

649

Total

1269

29

Odds lung cancer in smokers = 647/622 = 1.04


Odds lung cancer in non-smokers = 2/27 = 0.07
Odds ratio = (647/622) / (2/27)= 647 x 27 / 2 x 622 = 14.04

Interpretation of the Odds Ratios


If the odds ratio = 1 then this implies equality
The odds are equivalent in the exposed and unexposed groups
An odds ratio >1 indicates that the odds of disease (outcome) is
greater in the exposed group than in the unexposed group
An odds ratio <1 indicates that the odds of disease (outcome) is
lower in the exposed group than in the unexposed group

Categorical Data: Comparing


Odds 2 Test
The Chi squared test can be used within the context of the case
control study to formally test:
H0: The odds of having lung cancer in smokers cases = the odds of
lung cancer in non-smokers (i.e. odds ratio = 1)
HA: The odds of having lung cancer are not equal
Smoker

Non-smoker

Total

Lung Cancer

647

649

No Lung Ca

622

27

649

Odds lung cancer in smokers = 647/622 = 1.04


Odds lung cancer in non-smokers = 2/27 = 0.07
Odds ratio = (647/622) / (2/27)= 647 x 27 / 2 x 622 = 14.04

2 Test Odds Example


The null and alternative hypothesis under study:
H0: The odds of having lung cancer in smokers cases = the odds of lung
cancer in non-smokers (i.e. odds ratio = 1)
HA: The odds of having lung cancer are not equal
Chi-Square Tests

Pearson Chi-Square
Continuity Correctiona
Likelihood Ratio
Fisher's Exact Test
Linear-by-Linear
Association
N of Valid Cases

Value
22.044 b
20.316
26.140

22.027

1
1
1

Asymp. Sig.
(2-sided)
.000
.000
.000

Exact Sig.
(2-sided)
.000

Exact Sig.
(1-sided)
.000

.000
.000

.000
.000

.000

.000

.000

df

Point
Probability

.000

1298

a. Computed only for a 2x2 table


b. 0 cells (.0%) have expected count less than 5. The minimum expected count is 14.50.
c. The standardized statistic is 4.693.

P < 0.001, Small P-value indicates there is evidence against the null
hypothesis, reject the null hypothesis.

Test Odds Example

The study provides considerable evidence to suggest an


association between lung cancer and smoking
The odds of having lung cancer are significantly greater for
smokers than non-smokers, odds ratio = 14.04 (P<0.001)

Assumptions of the 2 Test


1. The expected frequency in each of the four cells is at least 5
The Chi squared test is hence only valid if all the expected
frequencies are sufficient
2. The Chi squared test also assumes the groups are
independent (e.g. treatment group and placebo group,
under 5s and over 5s)

What if Expected Frequency < 5?


Use Fishers exact test - Given in SPSS output of Chisquared test
If any one of the expected cell counts is less than 5
interpret fishers exact test:
Chi-Square Tests

Pearson Chi-Square
Continuity

Correctiona

Lik elihood Ratio

Asymp. Sig.
(2-sided)
.250

.217

.641

2.126

.145

Value
1.326b

df

Fisher' s Exact Test


Linear-by-Linear
Association
N of Valid Cases

1.273

.259

Exact Sig.
(2-sided)
.534

Exact Sig.
(1-sided)
.355

.534

.355

.534

.355

.534

.355

25

a. Computed only for a 2x2 table


b. 2 cells (50. 0%) have expected count less than 5. The minimum expected count is .84.
c. The standardized statistic is 1.128.

Point
Probability

.355

What if the Groups are Not


Independent?
What if the two groups are related?
- Each individual may have had their outcome measured in 2
different circumstances
- Cross-over trial each patient receives drug and placebo
- Matched Case-Control Study
Use McNemars Test

What if the Groups are Not Independent?


SPSS: Analyze Descriptive statistics Crosstabs
Select McNemars Test in Statistics option
Same null and alternative hypothesis:
Ho: The proportion of individuals with the characteristic is
equal in the two groups in the population
HA: The proportion of individuals with the characteristic is
not equal in the two groups in the population

Categorical Data: More than 2


Categories
Suppose we wish to test for an association between two factors which
may have more than two categories
Example: Is there an association between blood group (4 group levels:
A, B, O, AB) and disease severity (3 groups: mild, moderate, severe).
Are individuals of a particular blood group likely to be more severely ill?
We can still use the Chi-squared test on larger frequencies - data
presented in a r x c contingency table (r rows and columns)
The null and alternative hypothesis under study:
H0 : There is no association between the categories of one factor and
the categories of the other factor in the population
HA: The two factors are associated in the population

Binary Logistic Regression


Up until now we have discussed regression with a numerical
outcome/dependent variable
Lecture 4 - Linear regression is a modelling technique used to
explore the associations between one numerical dependent
variable and one or more explanatory variables (be these
numerical of categorical)
We are often interesting in examining binary outcomes, for
example mortality (dead/alive), case/control, success/failure
We can model a binary outcome using binary logistic
regression

Binary Logistic Regression


Useful when we wish to compare the proportion of people
with a particular binary outcome by group, but adjusted for
potential confounders
Examples:
1. Is there an association between smoking and lung cancer
after adjusting for Sex?
2. Is a new treatment associated with mortality after
adjustment for age?

Binary Logistic Regression


The dependent variable and explanatory variable(s) are
distinguished in the same way as linear regression
In binary logistic regression the binary outcome of interest is
the dependent variable. The other factors of interest which
we believe may be related to the binary outcome are the
explanatory/independent variables
Logistic regression evaluates the odds that an individual with
a particular combination of values for the explanatory
variables will have the binary outcome of interest

Binary Logistic Regression


When you fit a binary logistic regression model, for each
explanatory variable you will get an odds ratio (OR) EXP(B)
For binary/categorical explanatory variables, the OR is the
increase in odds of the binary outcome for one group
compared to the other/reference group
For numerical explanatory variables, the OR is the increase in
odds of the binary outcome for a one unit increase in the
numerical explanatory variable

Binary Logistic Regression


If the odds ratio is greater than 1 then as the predictor increases,
the odds of the outcome occurring increase
Conversely an odds ratio value less than 1 indicates that as the
predictor increases the odds of the outcome occurring decrease
If the odds ratio = 1 then this implies equality
- For a binary/categorical predictor the odds are equivalent for one
group compared to another
- For a numerical predictor the odds are equivalent for the different
levels of the continuous variables

Binary Logistic Regression Example


Example: Clinical trial for breast cancer, comparing mortality at 5
years between new vs standard drug
- Outcome is mortality at 5 years either yes or no
- Difference in age between two treatment groups need to adjust
Binary logistic regression is an ideal method of analysis to employ
to determine if treatment is associated with mortality at 5 years
after adjustment for age
We will fit a binary logistic regression model and get an odds ratio
(OR) for Treatment (adjusted for age) and an OR for Age (adjusted
for treatment)

Binary Logistic Regression Example


Data:

Binary
outcome
of
interest

Binary Logistic Regression Example


SPSS: Analyze Regression
Binary Logistic

Binary Logistic Regression Example


Variables in the Equation

Step
a
1

Age
Treatment(1)
Constant

B
.186
-1.001
-10.491

S.E.
.100
.460
5.484

Wald
3.439
4.724
3.660

df
1
1
1

Odds Ratios
Sig.
.064
.030
.056

Exp(B)
1.204
.368
.000

95.0% C.I.for EXP(B)


Lower
Upper
1.466
.989
.906
.149

a. Variable(s) entered on step 1: Age, Treatment.

For each variable odds ratio (OR) = Exp(B)


- For the binary explanatory variable Treatment, the OR is the odds of
mortality for treatment = 1 compared to the reference category,
treatment = 0
OR for Treatment = Exp (B) = 0.368
In comparison to Treatment=0 the odds of mortality at 5 years for
Treatment=1 are 0.368 times, or equivalently [0.368-1*100] = - 63.2%

Binary Logistic Regression Example


Odds Ratios
Variables in the Equation

Step
a
1

Age
Treatment(1)
Constant

B
.186
-1.001
-10.491

S.E.
.100
.460
5.484

Wald
3.439
4.724
3.660

df
1
1
1

Sig.
.064
.030
.056

Exp(B)
1.204
.368
.000

95.0% C.I.for EXP(B)


Lower
Upper
1.466
.989
.906
.149

a. Variable(s) entered on step 1: Age, Treatment.

For the continuous variable Age , the OR is the increase in odds


of mortality for a one unit increase in Age(1 year increase)
OR for Age = Exp(B) = 1.204
As Age increases by one unit (1 year) the odds of mortality
increase by a factor of 1.204, or equivalently increase by [1.2041*100]= 20.4%

Confidence Intervals
95% confidence intervals for the Odds give the range we expect the true
population Odds Ratio values to lie within
We would expect the confidence interval of Exp(B) [OR] to not include 1
if the associated explanatory variable is significant
If the 95% CI spans OR = 1 then this implies equality and that the odds
are equal. We cannot be sure that true odds ratio is not 1 if the 95%
confidence interval spans 1
- For a binary/categorical predictor the odds are equivalent for one group
compared to another
- For a continuous predictor the odds are equivalent for the different
levels of the continuous variables

Significance of Predictors
We can also test the null hypothesis that the relevant binary
logistic regression coefficient is zero, which is equivalent to
testing the hypothesis that the odds ratio associated with this
variable is 1
Wald test: Formally test the null hypothesis that a regression
coefficient B is zero:
H0: B = 0
HA: B 0
Or equivalently that
H0: Exp(B) = odds ratio =1
HA: Exp(B) = odds ratio 1

Binary Logistic Regression Example


SPSS Conducts the Wald test for you!
Variables in the Equation

Step
a
1

Age
Treatment(1)
Constant

B
.186
-1.001
-10.491

S.E.
.100
.460
5.484

Wald
3.439
4.724
3.660

df
1
1
1

Sig.
.064
.030
.056

Exp(B)
1.204
.368
.000

95.0% C.I.for EXP(B)


Lower
Upper
.989
1.466
.149
.906

a. Variable(s) entered on step 1: Age, Treatment.

H0: Exp(B) = odds ratio =1


HA: Exp(B) = odds ratio 1
Age
Treatment

P=0.064, P-value indicates we should not reject


Ho for Age. Odds ratio =1 (Taking a strict 0.05 critical level)
P=0.030. Small P-value indicates evidence
against Ho for Treatment. Odds ratio 1.

Binary Logistic Regression Example


Summary of Results:
- OR for Age = 1.204, 95% CI ( 0.989, 1.466), P = 0.064
- OR for new treatment =0.368, 95% CI (0.149, 0.906),
P=0.030
As age increases, so do the odds for mortality, however Age is
not a significant predictor of mortality
The odds of mortality were significantly 63.2% less for patients
on the new treatment

SPSS Practical 4 & 5


Linear Regression
Fits a straight line to describe the relationship
between one or more explanatory/independent variables and
one numerical dependent/outcome variable
Regression coefficients quantify the amount the dependent
variable changes as the explanatory variable increases by one
unit (multiple linear regression - after adjustment for any
other explanatory variables)
Chi squared test
Binary logistic regression
Solutions will be available on moodle

Key Points
When comparing proportions/risks/odds of a characteristic of a
categorical variable over 2 groups consider the structure of the data:
- Independent groups:

Expected counts >5 2 Test


Expected counts <5 Fishers Exact Test

- Non-Independent Samples: McNemars Test


We can also use the 2 test to test for association between 2
categorical factors which may have any number of groups
Binary Logistic regression is used for modelling binary outcomes;
Output is given in terms of odds ratios

Anda mungkin juga menyukai