Anda di halaman 1dari 40

Categorical Data

Analysis


Biostat 6651 Lecture 4
Fall 2013 07 Oct 2013
Dr. Lynn Eudey
Assignment and Announcement
Finish reading chapter 2 (skip * sections)
Start reading chapter 3
Quiz 1 on Weds Oct 9
th
About 20 minutes long; closed book & closed
notes
Chapter 1 and 2
All of Chapter 1
Ch 2 up through Lecture 3 material

Contingency Tables
Types of tests for independence (large sample)
Asymptotic Chi-square statistics
Pearson Chi-square test (lecture 1)
Likelihood ratio test (today)
Read about types of tests in Ch1 4.1
What is the test statistic telling us?
Standardized Cell Residuals
Partitioning Chi-square statistics
Ordinal Data
Three-way tables




Likelihood ratio test for independence
Likelihood ratio tests
Compares the maximum likelihood under H
0
with
maximum likelihood under H
A
Test statistic:
2ln

1

For contingency tables

2
= 2

ln

where
ij
= nt
ij
is estimated by

=

+

=
0

Likelihood ratio test for independence
H
0
: t
ij
= t
i+
t
+j
H
A
: t
ij
= t
i+
t
+j

G
2
> 0
G
2

has an asymptotic chi-square distribution
under H
0

G
2
versus Pearsons Chi-square
Both have (I-1)(J-1) degrees of freedom for testing
independence
Usually not the same value
Usually end up in the same statistical decision
Pearsons converges quicker to chi-square
distribution (better for low counts)

Exploring the association
Two choices
Residual analysis
Where are the large deviations from expected
counts?
Usually done with Pearsons Chi-square
Partitioning the Chi-square
Can we find sub-tables where there are strong
associations, or very weak associations?
Usually done with likelihood ratio chi-square
Example
School of
Psychiatric Thought
Origin of Schizophrenia
Observed (Expected under Independence)
Biogenic Environmental Combination
Eclectic 90 (77.9) 12 (16.6) 78 (85.5)
Medical 13 (8.7) 1 (1.8) 6 (9.5)
Psychoanalytic 19 (35.5) 13 (7.6) 50 (39.0)

H
0
: t
ij
= t
i+
t
+j
H
A
: t
ij
= t
i+
t
+j

o = 0.05

Use Pearsons Chi-square statistic; has (3-1)(3-1) = 2 df

Decision Rule: Reject H
0
if _
2
> 9.49

Data with observed counts and expected counts in table below
Example
School of
Psychiatric Thought
Origin of Schizophrenia
Observed (Expected under Independence)
Biogenic Environmental Combination
Eclectic 90 (77.9) 12 (16.6) 78 (85.5)
Medical 13 (8.7) 1 (1.8) 6 (9.5)
Psychoanalytic 19 (35.5) 13 (7.6) 50 (39.0)

_
2
= 22.19 with a p-value of 0.0002
There is a statistically significant association (dependence) between
school of Psychiatric Thought and beliefs on Origin of Schizophrenia

How is the association described? Where does the relationship
deviate significantly from independence?
Standardized Cell Residuals
Not sufficient to look at the differences between
observed and expected larger cell counts may have
larger (but not meaningful) differences than smaller cell
counts

Standardize by dividing by (estimated) standard error
(this will have an asymptotic standard Normal distribution
for large samples)
=

1
:
1
:

Standardized Cell Residuals
Standardize by dividing by (estimated) standard error (this will have
an asymptotic standard Normal distribution for large samples)
=

1
:
1
:


For small number of cells can use 2 for a cutoff for
significant residual

For a large number of cells use a larger cutoff (3)
If there are 20 cells would expect about 1 to be bigger than 2 in
absolute value
Example Continued
Table of Residuals = Observed Expected
Shaded cells have large deviations but are they significant?


Table of Standardized Residuals
Biogenic Environ Combined
Eclectic 12.1 -4.6 -7.5
Medical 4.3 -0.8 -3.5
Psychoanalytic -16.5 5.4 11
Biogenic Environ Combined
Eclectic 3.026256 -1.9703 -1.86164
Medical 2.007924 -0.64928 -1.6262
Psychoanalytic -4.3656 2.441187 2.887112
Example Continued
Table of Standardized Residuals



When there are few cells, standardized residuals above 3 are
considered highly significant, above 2 significant

Conclusion: For the Psychoanalytic School there are significantly
fewer who believe schizophrenia is biogenic alone; for the other two
schools, especially the eclectic school of thought, there are
significantly more who believe schizophrenia is biogenic alone.

Biogenic Environ Combined
Eclectic 3.026256 -1.9703 -1.86164
Medical 2.007924 -0.64928 -1.6262
Psychoanalytic -4.3656 2.441187 2.887112
Example
School of
Psychiatric Thought
Origin of Schizophrenia
Observed (Expected under Independence)
Biogenic Environmental Combination
Eclectic 90 (77.9) 12 (16.6) 78 (85.5)
Medical 13 (8.7) 1 (1.8) 6 (9.5)
Psychoanalytic 19 (35.5) 13 (7.6) 50 (39.0)

H
0
: t
ij
= t
i+
t
+j
H
A
: t
ij
= t
i+
t
+j

o = 0.05

Use likelihood ratio test; has (3-1)(3-1) = 2 df

Decision Rule: Reject H
0
if G
2
> 9.49

Data with observed counts and expected counts in table below
Example
School of
Psychiatric Thought
Origin of Schizophrenia
Observed (Expected under Independence)
Biogenic Environmental Combination
Eclectic 90 (77.9) 12 (16.6) 78 (85.5)
Medical 13 (8.7) 1 (1.8) 6 (9.5)
Psychoanalytic 19 (35.5) 13 (7.6) 50 (39.0)

G
2
=
2

ln

= 2 90 ln
90
77.87234
++50 ln
50
38.96454

G
2
= 23.04; highly significant with p-value = 0.0001;
Note that the test statistic is not exactly the same as Pearsons
chi-square but the test still has the same conclusion.
(G
2
calculation using non-rounded expected counts)

Partitioning Chi-square
Can be done with either Pearsons or
likelihood ratio test
With likelihood ratio the sum of the partitioned
G
2
will be the G
2
for the table
For Pearsons the partitioned _
2
are valid
statistics but will not sum to the _
2
for the
table
Partitioning Chi-square
May help to explain the association
For the example, partition into four
independent sub-tables
(Eclectic, Medical) (Biogenic, Environ |one of those 2)
(Eclectic, Medical) (non-Combination, Combination)
(Psychoanalytic, Not) (Biogenic, Environ |one of those 2)
(Psychoanalytic, Not) (non-Combination, Combination)

Partitioning Chi-square
Biogenic Environ NonCombComb
Eclectic 90 12 Eclectic 102 78
Medical 13 1 Medical 14 6
Biogenic Environ NonCombComb
NonPsych 103 13 NonPsych 116 84
Psych 19 13 Psych 32 50
Each independent table has 1 df. The G
2
statistics are 0.29, 1.36
for the top row and 12.95, 8.43 for the bottom row. This gives
further evidence that the Psychoanalytic school of thought is more
likely to attribute the origins of schizophrenia to a combination of
both biogenic and environmental.
Sum = 23.04
Partitioning Tables
Rules:
The degrees of freedom for the sub-tables must sum to the degrees of
freedom for the table
Each cell count in the original table must be a cell count in one and only
one sub-table
Each marginal total for the original table must be a marginal total for one
and only one sub-table
Notes:
There is no unique partitioning
It takes some practice to partition a table
G
2
for independent sub-tables will sum to original table G
2

This is a way to check whether you have a good partition
Checking the rules
Biogenic Environ NonCombComb
Eclectic 90 12 102 Eclectic 102 78 180
Medical 13 1 14 Medical 14 6 20
103 13 116 84
Biogenic Environ NonCombComb
NonPsych 103 13 116 NonPsych 116 84 200
Psych 19 13 32 Psych 32 50 82
122 26 148 0 148 134 282
Bio Env Comb Margin
Ecl 90 12 78 180
Med 13 1 6 20
Psyc 19 13 50 82
Margin 122 26 134 282
Ordinal data
Typical chi-square statistics ignore the ordinal nature of
data in testing for independence
Solution: treat the ordinal variables as quantitative rather
than categorical
Assign scores to reflect the ordering
Increase distance between scores to reflect the distance
between the categories
Rank observations and use average rank for each cell
If both variables are ordinal, use Pearsons correlation
coefficient to measure association (formula p. 41)
M = 1
2
~ 0,1
0
: = 0

Three-way tables
Three categorical variables: U, X, Y
Examples:
Smoking, Lung disease, age category
Cal grad admissions, gender, department
Victims race, Defendants race, Death Penalty
Clinic, Treatment, Response
Sets of two-way tables for each value of the third
variable
The two-way tables are called partial tables
Example (Table 2.11)
Clinic Treatment Success Failure
One A 18 12
B 12 8
Two A 2 8
B 8 32
Each of the tables above is a partial table
We could also put the variables in a different order for a different
set of partial tables
Treatment Clinic Success Failure
A One 18 12
Two 2 8
B One 12 8
Two 8 32
Three-way tables
Mutual Independence exists if and only if

=
::

::

::

Can test using Pearsons chi-square or likelihood ratio
chi-square
Degrees of Freedom =
(Number of cells 1) (Number of probabilities estimated under H
0
)
For a IJK table this works out to IJK I J K + 2 (see
whiteboard)
Try this formula for a two-way IJ table
The joint probabilities need to add to 1, so (IJ -1)
The number of probabilities being estimated is (I -1) + (J -1) since each of the row-
marginal probabilities and the column marginal probabilities have to add to 1
DF = (IJ -1) - (I -1) - (J -1) = IJ I - J + 1 = (I -1) (J -1)

Conditional Associations
Conditional association refers to the
association of two of the variables when
the third is held constant (conditioned on)
What is the association between treatment
and response for clinic one?
What is the association between treatment
and response for clinic two?
Marginal Associations
Marginal association refers to the association of
two of the variables when the tables are
summed over the third variable?
What is the association between treatment and
response for both clinics combined? Clinic is ignored.
Success Failure
Treatment A 20 20
B 20 40
Odds Ratios
Conditional odds ratios estimated same
way as for two-way table, conditioned on
the third variable

Marginal odds ratios estimated same way
as for two-way table using the marginal
table
Conditional Independence
Conditional independence refers to the
independence of two of the variables (U, X)
when the third variable (Y) is conditioned for
If there are K values to the variable being controlled
then there are K conditional tables
If independence exists for each and every of the K
tables then U and X are called conditionally
independent
Conditional and Marginal
Independence
Marginal independence refers to the
independence of two of the variables when the
third variable is ignored (i.e. summed over).

Conditional Independence does not imply
Marginal Independence (nor vice-versa).
Homogeneous Association
When the conditional odds ratios are the same
for all K conditional tables then homogeneous
association exists
Homogeneous association is independent (or
symmetric) of variables
If U and X show homogeneous association
conditioning on Y then:
X and Y show homogeneous association conditioning on U
U and Y show homogeneous association conditioning on X
Conditional Independence is an example of
homogeneous association

Simpsons Paradox
The association for the marginal table is
not consistent with the association in the
conditional tables this is called
Simpsons Paradox
Example: Gender, Admission Status,
Department for UC Berkeley Graduate
School
Example
UC Berkeley is often accused of gender
discrimination in their graduate admissions
Fall 1973 data (old study)
Admission
Gender Admitted NotAdmitted Total
Male 3715 4727 8442
Female 1512 2809 4321
Total 5227 7536 12763
Row Probabilities
Admission
Gender Admitted NotAdmitted Total
Male 0.44 0.56 1
Female 0.35 0.65 1
Break down for six largest departments
Men Women
Depart Admit NotAdmit Total Depart Admit NotAdmit Total
A 512 313 825 A 89 19 108
B 353 207 560 B 17 8 25
C 120 205 325 C 202 391 593
D 138 279 417 D 131 244 375
E 53 138 191 E 94 299 393
F 22 251 273 F 24 317 341
Total 1198 1393 2591 557 1278 1835
46% 30%
Source: Statistics, 2
nd
Edition, (1991), Freedman, Pisani, Purves, Adhikari
Data from Graduate Division of University of California, Berkeley

Look at the marginal table
H
0
: t
ga
= t
g+
t
+a
H
A
: t
ga
= t
g+
t
+a
o = 0.05
Use Pearsons Chi-square, 1 d.f.
Decision rule: reject H
0
if _
2
> 3.8416
Data next slide
Marginal association
Observed Expected
Admit Not Admit Not
Men 1198 1393 2591 Men 1027.385 1563.615 2591
Women 557 1278 1835 Women 727.6152 1107.385 1835
1755 2671 4426 1755 2671 4426
_
2
= 113.244, with a p-value < 0.000001

Relative Risk (Men, Women) =
0.46237
0.303542
1.52
Odds Ratio =
0.8600
0.4358
= 1.973

Conclusion: From the marginal table it would seem that there is a
significant favoritism towards admitting men.
Conditioning on Department
Department A:

Observed
Admit Not
Men 512 313 825
Women 89 19 108
601 332 933
Row Probabilities
Men 0.620606 0.379394 1
Women 0.824074 0.175926 1
There is a significant association (p-value < 0.0001) but for this
department the women have a higher probability of being
admitted than the men do. This is the only department out of
the six shown that shows a significant association.
Conditioning on Department
Department F:

There is a not a significant association (p-value = 0.6631), this
department does not observe to have a gender bias in
admission. And Departments B, C, D, E also show non-
significant association.
Observed
Admit Not
Men 22 251 273
Women 24 317 341
46 568 614
Row Probabilities
Men 0.080586 0.672922 1
Women 0.070381 0.929619 1
Explanation
Why does the marginal table indicate that women are
being discriminated against and the conditional tables
say there is only a gender discrimination for Dept A and
that is against men?

The first two departments are have high admission rates
and over 50% of the men applied to these. The last four
departments have much lower admission rates and over
90% of the women applied to these departments
Department applicants by
gender
Column Proportions
Depart Men Women Men Women
A 825 108 0.306577 0.058856
B 560 25 0.208101 0.013624
C 325 593 0.120773 0.323161
D 417 375 0.154961 0.20436
E 191 393 0.070977 0.214169
F 373 341 0.13861 0.185831
Total 2691 1835
Men Women
A&B 0.514679 0.07248
D - F 0.485321 0.92752
1 1
SAS


examples from text
UCLA reference
http://www.ats.ucla.edu/stat/sas/examples/icda/default.htm
Wednesday
Start Chapter 3 HW 3 will be assigned
Quiz 1: ch1 and ch2 up through lecture 3
Closed book and closed notes
Please bring a calculator
I will provide paper