Cda

Categorical Data Analysis
Sahadeb Sarkar
IIM Calcutta
• Slides Adapted from Prof Ayanendranath Basu’s Class-notes

• R Programs and Data Sets in Textbook (Tang, He & Tu):
http://accda.sph.tulane.edu/r.html
• Readings: Chapters 1-6, Text
1
Terminology
Discrete data: relates to discrete outcomes, Discrete
distributions
• Categorical Data: Discrete data with finitely many
possible values on a nominal scale (e.g., the state a
person lives in, the political party one might vote for,
the blood type of a patient; Multinomial, Bernoulli
distribution). Central tendency given by its mode
• Count data (non-negative integer valued) : Records
the frequency of an event, may not have an upper
bound (e.g., Poisson, Binomial, Negative Binomial
distributions). It arise out of counting and not ranking.
2
Discrete Data Types
• Dichotomous data: can take only two values
such as “Yes” and “No”
• Nonordered polytomous data: five different
detergents
• Ordered polytomous data: grades A, B , C, D;
“old”, “middle-aged”, “young” employees
• Integer valued: nonnegative counts

3
Derivation Tools in CDA, Text p.18
Delta Method:
𝑑 1
If 𝜃𝑛 𝑁 𝜃, Σ , 𝑔 𝜃 a m×1 differentiable
𝑛
function, then
𝑑 1 𝑇
𝑔 𝜃𝑛 𝑁 𝑔 𝜃 , 𝐷 Σ𝐷
𝑛
𝜕
where Dk×m = 𝑔(𝜃)
𝜕𝜃
4
Derivation Tools in CDA, Text p.18
Slutsky’s Theorem:
𝑑 𝑑
Suppose 𝑋𝑛 𝑋 and 𝑌𝑛 𝑐, constant. Then
𝑑
1. 𝑋𝑛 + 𝑌𝑛 𝑋+𝑐
𝑑
2. 𝑌𝑛 𝑋𝑛 𝑐𝑋
𝑑
3. If c0, 𝑋𝑛 /𝑌𝑛 𝑋/𝑐
5
Inference for One-way Frequency
Table
• Binary case (Sec 2.1.1, Text)

• Inference for Multinomial Variable (Sec 2.1.2)
• Inference for Count Variable (Sec 2.1.3)
R Programs and Data Sets in Textbook (Tan, He & Tu):

http://accda.sph.tulane.edu/r.html
6
Binomial Distribution
(leading to One-Way Frequecy Table)
Suppose Y is a random variable with 2 possible outcome
categories c1,c2 with probabilities π1, π2=(1 π1).
Suppose there are n observations on Y ; we can summarize
the responses through the vector of observed frequencies
(random variables), (X1, X2=nX1).
Then (X1, X2=nX1) is said to have a Binomial distribution

with parameters n and (π1, π2=1 π1), or simply X1 is said to
have a Binomial distribution with parameters n and π1.
𝑛! 𝑥
P(X1=x1) = 𝜋1 1 (1 − π1)𝑛−𝑥1 , 𝑥1 = 0,1, … , 𝑛
𝑥1 !(𝑛−𝑥1 )!
Then, E(X1) = nπ1, V(X1) = nπ1(1-π1) < E(X1)
7
Example 1.1, p. 6, Text
What is Metabolic Syndrome ?
8
Metabolic syndrome
(https://en.wikipedia.org/wiki/Metabolic_syndrome)
Metabolic syndrome, sometimes known by other

names, is a clustering of at least three of the five
following medical conditions (giving a total of 16
possible combinations giving the syndrome):
 abdominal (central) obesity
 High blood pressure
 High blood sugar
 High Serum Triglycerides
 Low high-density lipoprotein (HDL) levels
9
Example 1.1 (Binary Case), p. 37, Text
• Test if the prevalence of Metabolic Syndrome is 40% in this
study population
48
π− 𝜋0 93
− 0.4
𝑍= = = 2.286;
𝜋0 ×(1−𝜋0 )/𝑛 0.4×0.6/93
P-value = 2(2.2.86)=0.0223
• Construct 95% Confidence Interval for the prevalence in this

population
π ± 𝑍/2 × π × (1 − π)/𝑛
48 48 48
= ± 1.96 × × (1 − )/93
93 93 93
=[0.4146, 0.6177]
10
Negative Binomial Distribution (p. 41)
• A sequence of independent Bernoulli trials, having two potential
outcomes "success" and "failure". In each trial probability of success
is p and of failure is (1 − p). Observe this sequence until a predefined
number r of failures has occurred. Then X = number of successes
observed, will have the negative binomial distribution:
•  = E(X)= rp/(1-p), V(X) = rp/(1-p)2 > E(X).
 𝑟+𝑘 1
• 𝑃 𝑋 = 𝑘 =  𝑟 𝑘! 𝑝𝑟 (1 − 𝑝)𝑘 . Put 𝛼 = r & =rp/(1-p) for reparameterization
𝑘+𝑟−1 −𝑟
• Note: =(−1)𝑘
𝑘 𝑘
11
Negative Binomial Distribution (p. 41)
 𝑟+𝑘 𝑟
• 𝑃 𝑋=𝑘 = 𝑝 (1 − 𝑝)𝑘 ……… (1a)
 𝑟 𝑘!
• E(X)= rp/(1-p), V(X) = rp/(1-p)2 > E(X) …….. (1b)
• Extension through reparameterization:
1
= (> 0), =rp/(1-p) in (1)
𝑟
𝑟
 1
+𝑘 µ 𝑘
1
• Then, 𝑃 𝑋 = 𝑘 = α α
……… (2a)
 1
α
1
𝑘!
α
+µ
1
α
+µ
• E(X)= ; V(X) =  + 2 ……………………(2b)
12
Hypergeometric Distribution
• Randomly sample n elements from a finite (dichotomous)
population of size N, without replacement, having K
“success”-type and (N-K) “failure”-type elements. (e.g.
Pass/Fail or Employed/ Unemployed).
• The probability of a success changes on each draw, as each
draw decreases the population.
• X = number of successes in the sample. Then X has the
hypergeometric distribution:
• E(X)=n(K/N), V(X) = {n(K/N)(1 - K/N)}×[(N-n)/(N-1)]
13
Multivariate Hypergeometric Distribution
• Randomly sample n elements from a finite (polytomous)
population of size N, without replacement, having K1, K2, ..., Kc
elements of types 1, 2, …, c.
• Xi= number of i-th type elements in the sample, i=1,…,c. Then
X has multivariate hypergeometric distribution:
𝑐 𝐾𝑖
𝑖=1 𝑥
𝑖
𝑃(𝑋𝑖 = 𝑥𝑖 , 𝑖 = 1, … , 𝑐) =
𝑁
𝑛
• E(Xi)=n(Ki/N),
• V(Xi) = {n(Ki/N)(1 – (Ki/N) )}×[(N-n)/(N-1)]
• Cov(Xi, Xj) = {n(Ki/N)(Kj/N) }×[(N-n)/(N-1)]
14
Inference for Multinomial Case
15
Multinomial Distribution
(may lead to One-Way, Two-Way, … Frequecy Table)
Suppose Y is a random variable with k possible
outcome categories c1,c2,…,ck with probabilities π1,
π2,…, πk=(1- π1-…- πk-1).
Suppose there are n observations on Y; we can
summarize the responses through the vector of
observed frequencies (random variables), X = (X1,
X2,…, Xk), where Xk=n- X1-…- Xk-1.
Then X = (X1, X2,…, Xk) is said to have a multinomial

distribution with parameters n and (π1, π2,…, πk ).
𝑛! 𝑥 𝑥
P(X1=x1, …, Xk=xk) = 𝜋1 1 … 𝜋𝑘 𝑘
𝑥1 ! 𝑥2 !… 𝑥𝑘 !
16
Multinomial Distribution
(may lead to One-Way, Two-Way, … Frequecy Table)
X = (X1, X2,…, Xk) has a multinomial distribution
with parameters n and (π1, π2,…, πk ).
𝑛! 𝑥 𝑥
P(X1=x1, …, Xk=xk) = 𝜋1 1 … 𝜋𝑘 𝑘
𝑥1 ! 𝑥2 !… 𝑥𝑘 !
𝐸(𝑋𝑖 )=n𝜋𝑖 ,
𝑉(𝑋𝑖 )=n𝜋𝑖 (1−𝜋𝑖 ) ; 𝐶𝑜𝑣(𝑋𝑖 , 𝑋𝑗 ) = n𝜋𝑖 𝜋𝑗 , ij
MLE of 𝜋𝑖 = 𝑋𝑖 /𝑛 (Prove it, Excercise)
𝐿(𝜋1 , …, 𝜋𝑘 , ) = ln 𝑙𝑜𝑔 − 𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑 + (1 − 𝑖=1 𝑖 )

𝑘
17
Example 1.1, p. 6, Text
One-Way Frequency Table for Metabolic Syndrome Study
MS
Present Absent Total
48 45 93
Two-Way Frequency Table for Metabolic Syndrome Study

MS
Gender Present Absent Total
male 31 31 62
female 17 14 31
Total 48 45 93
18
Pearson’s Chi-square (χ2) Test
H 0 :  i   0i i  1,..., k (1)
The fit of the model is assessed by comparing the

frequencies expected in each cell, against the observed
frequencies. If there is substantial discrepancy between
the observed frequencies and those expected from the
null model, then it would be wise to reject the null model.
The best known goodness-of-fit statistic used to test the
hypothesis in (1) is the Pearson’s Chi-Square (PCS):
𝑂𝑏𝑠𝑒𝑟𝑣𝑒𝑑𝑖 −𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑𝑖 2 𝑋𝑖 −𝑛0𝑖 2
PCS,  = 𝑖=1
2 𝑘 𝑘
= 𝑖=1
𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑𝑖 𝑛0𝑖
19
Example: Pearson’s χ2 Test
When we are trying to do a test of hypothesis to determine
whether a die is a fair die, it is a simple hypothesis.
Suppose we roll it 120 times and the summarized data
are as follows:
In this case, k=6 and n=120. H0: πi = 1/6 (= π0i), i=1,2,…,6
20
Pearson’s Chi-Square (contd.)
The hypothesis presented in Equation (1) is an
example of a simple hypothesis. (Simple in the sense
that the hypothesis completely specifies the true
distribution).
The hypothesis becomes composite when the null is

not completely spelt out, but is specified in terms of
d parameters (d < k − 1).
21
Multinomial Example, p.38,Text
Multinomial Case:
Depression Diagnosis in the DOS Study
Major Dep Minor Dep No Dep Total
128 136 481 745
DOS = Depression Of Seniors
Test H0: P(No Dep) = 0.65, P(Minor Dep)=0.2, P(Major

Dep = 0.15)
Here, k=3, n=745
(481−484.25)2 (136−149)2 (128−111.75)2
𝑃𝐶𝑆 = + + = 3.519
484.25 149 111.75
df= (no. of categories -1) = 2; P-value= 0.1721
22
Example 2.2, p.4, p.38, Text
Conclusion: The null hypothesis claim appears to be true 23

Testing Composite Hypothesis
in Inference for Count Data
24
Poisson Distribution Case
Suppose Y is a random variable taking integer values y=0,
− 
𝑦
1, 2,…, with probability P(Y=y)=𝑒
𝑦!
Suppose there are n observations on Y; we can summarize
the observations through the vector of observed
frequencies for value-categories 0, 1, 2, …
Suppose all counts ≥ 6 are combined to make combined

frequency more than 5. Then with k=7, value-categories
0, 1, 2, 3, 4, 5, and ≥6 (say) observed frequencies X =
(X1, X2,…, Xk), where Xk=n- X1-…- Xk-1, has a multinomial
distribution with parameters n and (π1, π2,…, πk ), where
π1 = P(Y=0), π2 = P(Y=1),…, πk=P(Y≥6).
25
Example 2.3, p.42, Text
Exc: Check MLE of

=9.1
Conclusion: Null hypothesis claim appears to be false (df=7-1-1) 26

MLE of  = 9.1 ?
32 4 𝜃 2 5
𝜃 5 6
𝐿 𝜃 = 𝑒 −𝜃 𝑒 −𝜃 𝜃 𝑒 −𝜃 … 𝑒 −𝜃 ×
2 120
2 5 41
𝜃 𝜃
1 − 𝑒 −𝜃 − 𝑒 −𝜃 𝜃 − 𝑒 −𝜃
… −𝑒 −𝜃
2 120
Maximize this function 𝐿 𝜃 w.r.t. 𝜃
Need to do numerical maximization
Do it as an Exercise
27
Intentionally Kept Blank
28
Sampling Schemes
Leading to (2×2) Contingency Tables
29
Layout of the 2×2 table
Column factor
(‘Response’)
Level 1 Level 2
Level 1 n11 n12 R1=n1+
Row Fact Row
(‘Explanatory’) Total
C1=n+1 C2=n+2 T=n
Grand
Total
Column
Marginal
Total 30
Totals
Sampling schemes
leading to 2×2 contingency tables
Sampling scheme Marginal Total fixed in

advance
Poisson None
Multinomial Grand Total (Sample size)
Prospective Row (explanatory) total
Retrospective Column (Response) total
31
Poisson Sampling
• Poisson Sampling (French mathematician Simeon
Denis Poisson): Here a fixed amount of time (or space,
volume, money etc.) is employed to collect a random
sample from a single population and each member of
the population falls into one of the four cells in the
2×2 table.
• In the CVD Death example 1 (next slide), researchers
spent a certain amount of time sampling the health
records of 3112 women who were categorized as
obese and non obese against died of CVD or not. In
this case, none of the marginal totals or the sample
size was known in advance.
32
Example-1: Cardio-Vascular Deaths and Obesity among
women in American Samoa
[7.76 (=16/2061) observed deaths versus 6.66 (=7/1051) deaths per

thousand.]
Test equal proportions of CVD deaths in populations of obese and
nonobese Samoan women
This is an “Observational Study“, an example of “Poisson Sampling”
[Ramsey, F. L. and Schafer, D. W. (1997). The Statistical Sleuth.
Duxbury Press, Belmont, California.] 33
Multinomial Sampling
• This is same as the Poisson sampling scheme except
for the fact that here the overall sample size is
predetermined, and not the amount of time for
sampling (or space or volume or money etc.)
• If in the CVD Death Example 1, researchers decided to
sample the health records of exactly 3112 women
and then note (i) who were obese and non obese and
(ii) who died of CVD or did not die of CVD, then it
would have been multinomial sampling.
34
Prospective Product Binomial Sampling
• Prospective Product Binomial Sampling
(“cohort” study): First identify explanatory variable(s)
that explain “causation” . Population is categorized according
to levels of explanatory variable and random samples are then
selected from each explanatory group.
If separate lists of obese and non obese American Samoan
women were available in Example 1, a random sample of
2500 could have been selected from each. The term Binomial
refers to the dichotomy of the explanatory variable. The term
Product refers to the fact that sampling is done from more
than one population independently.
35
Example-2: Vitamin-C versus Common Cold
Outcome
COLD NO COLD TOTAL

PLACEBO 335 76 411
VITAMIN-C 302 105 407
TOTAL 637 181 818
Testing equal proportions of Colds in populations of

Placebo and Vitamin-C takers. One sided P-value for this
example is 0.0059 [Observed proportion 82% versus 74%]
This is a ‘Double Blind Randomized’ Study (not just
Observational), also Double Blind [Ramsey and Schaffer]
36
Retrospective Product Binomial Sampling
• Retrospective Product Binomial Sampling

(“Case- Control” study): This sampling scheme is
technically same as the previous one. However, roles
of the response and the explanatory factors are
reversed. In this scheme, we categorize the
population according to the identified response
levels and random samples are selected from each
response group.
37
Example 3: Smoking versus Lung Cancer
Outcome
CANCER CONTROL TOTAL
SMOKER 83 72 155
NON-SMOKER 3 14 17
TOTAL 86 86 172
Testing equality of proportions of smokers in

populations of cancers and non-cancers
(Homogeneity).
A retrospective observational study [Ramsey and Schaffer]
38
Retrospective Product Binomial
Sampling
• We cannot test for the equality of proportions along the
explanatory variable if the sampling scheme is
retrospective.
• We only get odds ratio from a case control study which is
an inferior measure of strength of association as
compared to relative risk.
• Why do retrospective sampling at all, then?
Compared to prospective cohort studies they tend to be less
costly and shorter in duration. Case-control studies are often
used in the study of rare diseases, or as a preliminary study
where little is known about the association between possible
risk factor and disease of interest.
39
Retrospective Product Binomial
Sampling (Continued)
• If the probabilities of the “Yes” response are very
small, it may need a huge sample size to get any
“Yes” response at all through prospective sampling.
• Retrospective sampling guarantees that we have at
least a reasonable number of “Yes” responses for
each level of explanatory variable.
• In the smoking versus lung cancer study (Example 3),
retrospective sampling may be accomplished without
having to follow the subjects throughout their
lifetime.
40
Prospective
Subjects selected
according to the levels
of the explanatory
variable
Explanatory Response
Variable Variable
Retrospective
Subjects selected
according to
levels of the
Response variable
41
Column factor
(Response)
Level 1 Level 2
Row Factor Row
(Explanatory) Total
C1=n+1 C2=n+2 T=n
Grand
Total
Column
Marginal
Total 42
Totals
Estimated Proportions
• Proportion of “Yes” (Level 1) response in the
first level of the explanatory variable is
ˆ1  n11 / R1
• Similarly the proportion of “Yes” response in

the second level of the explanatory variable is
ˆ 2  n21 / R2
43
Assumption
• We will assume that the frequencies of all the entries
in the 2x2 table are greater than 5.
• This ensures that the “asymptotic tests” performed
on the 2x2 tables are reasonably accurate.
(“asymptotic” means ‘appropriate in large samples’)
• If all the entries in the 2x2 table are not greater than
5, one may try Fisher’s Exact test.
44
Example-1: Cardio-Vascular Deaths and Obesity among
women in American Samoa
[7.76 observed deaths versus 6.66 deaths per thousand.]

Testing equal proportions of CVD deaths in populations of obese
and nonobese Samoan women.
This is an “Observational Study“, an example of “Poisson Sampling”
[Ramsey, F. L. and Schafer, D. W. (1997). The Statistical Sleuth.
Duxbury Press, Belmont, California.]
45
Pearson’s Chi-square (PCS) Test
(𝑂𝑐 − 𝐸𝑐 )2
𝑃𝐶𝑆 =
𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦 𝑐 𝐸𝑐
where Oc = observed count in category c, Ec = expected
count in category c, as per proposed model.
H0: Proposed model generated the observed data

Ha: Proposed model did not generate the data
If H0 is true then PCS has a chi-square distribution with
appropriate degrees of freedom (df).
46
Chi-square Distribution
Let Z1, …, Zk be independent random variables each
having N(0,1) distribution. Then 𝑍1 2 + … + 𝑍𝑘 2 is said to
follow chi-square (2k) with k degrees of freedom (df).
Result: The expected value and variance of a chi-square

(2k) random variable are given by: E(2k ) = k (=df);
Var(2k) = 2k (= 2*df).
For given k and , let 2k, denote a real number, which is

exceeded with probability  by a 2k random variable.
47
Chi-square Distribution
48
Calculations
DF = 1; Two-sided P-value = 1-CHISQ.DIST(0.115,1,TRUE)

One-sided P-value = 1-NORM.S.DIST(0.34,TRUE)
49
Example-2: Vitamin-C versus Common Cold
Outcome
COLD NO COLD TOTAL

PLACEBO 335 76 411
VITAMIN-C 302 105 407
TOTAL 637 181 818
Testing equal proportions of colds in populations of Placebo

and Vitamin-C takers. One sided P-value for this example is
0.0059 [Observed proportion 82% versus 74%]
A randomized study … Also double blind [Ramsey and

Schaffer]
50
Calculations
51
Example 3: Smoking versus Lung Cancer
Outcome
CANCER CONTROL TOTAL
SMOKER 83 72 155
NON-SMOKER 3 14 17
TOTAL 86 86 172
Testing equal proportions of smokers in

populations of cancers and non-cancers. One
sided p-value = 0.0025
A retrospective observational study, Retrospective
Sampling [Ramsey and Schaffer]
52
Calculations
53
54
Exact Test: Independence of Two Attributes
• Example: Data collected on a random sample of
people attending preview of a movie
• Did the movie have equal appeal to the young and old
or whether it is more liked by the young.
• Test H0: two attributes are independent against Ha:
they are positively associated.
55
• To test if two qualitative characters (attributes) A and B
are independent. Let P(A=Ai, B=Bj) = pij, i=1,…,k, j=1,…,l.
𝑙
• Let 𝑃 𝐴 = 𝐴𝑖 = 𝑗=1 𝑝𝑖𝑗 = 𝑝𝑖0 ; Let 𝑃 𝐵 = 𝐵𝑗 =
𝑘
𝑖=1 𝑝𝑖𝑗 = 𝑝0𝑗
• To test H0: 𝑝𝑖𝑗 = 𝑝𝑖0 𝑝0𝑗 , for all i,j.
• nij= observed freq for cell AiBj. The marginal frequency
of Ai and Bj are 𝑛𝑖0 = 𝑙𝑗=1 𝑛𝑖𝑗 and 𝑛0𝑗 = 𝑘𝑖=1 𝑛𝑖𝑗
56
Exact (Conditional) Test: Independence of
Two Attributes
• To test if two qualitative characters (attributes) A and B are
independent. Let P(A=Ai, B=Bj) = pij, i=1,…,k, j=1,…,l.
• nij= observed freq for cell AiBj. The marginal frequency of Ai and Bj
are 𝑛𝑖0 = 𝑙𝑗=1 𝑛𝑖𝑗 and 𝑛0𝑗 = 𝑘𝑖=1 𝑛𝑖𝑗
• Under H0, conditional distribution of {nij, all I,j} given current
sample marginals {𝑛𝑖0 , 𝑛0𝑗 , all i, j} has the (multivariate
hypergeometric) pmf
57
Exact (Conditional) Test: Independence of
Two Attributes
• Add up probabilities, under H0, of the given table and of
those indicating more extreme positive association (and
having the same marginals). These tables and
corresponding probabilities are:
• So, P-value = 0.0198 < 0.05  Ha seems to be true

58
59
Homogeneity versus Independence
Hypotheses
• Hypothesis of homogeneity
H0: π1 = π2
Not done in Retrospective Product Binomial Sampling
• Hypothesis of Independence
(At this stage qualitatively expressed)
Done only in Poisson or Multinomial Sampling
60
Homogeneity versus Independence
Hypotheses (contd.)
• The hypothesis of independence is used to
investigate an association between row and column
factors without specifying one of them as a
response. Although the hypotheses may be
expressed in terms of parameters, it is more
convenient to use the qualitative wording:
• H0: The row categorization is independent of the
column categorization
61
Sampling scheme versus Hypotheses
Sampling scheme Marginal Total fixed in Usual Hypothesis: Usual Hypothesis:
advance Independence Homogeneity
Poisson None YES YES
Multinomial Grand Total (Sample YES YES
size)
Prospective Row (explanatory) YES
total
Retrospective Column (Response) YES
total
Through “Odds Ratio” only
62
Inference for 22 Table
(Sec 2.2, Text)
Measures of Association:
• (i) Relative Risk (or Incidence Rate Ratio or
‘Probability Ratio’)
• (ii) Difference Between Proportions,
• (iii) Odds Ratio
63
Is “Tutoring” Helpful in a Business Stat Course?
Success Failure Row marginal

Tutoring a b (a+b)
No Tutoring c d (c+d) = 𝜋2
Col. marginal (a+c) (b+d) n=(a+b+c+d)
a/(a+b) a(c+d) 𝑎𝑑+𝑎𝑐

Estimated Risk Ratio = = = ,
c/(c+d) (a+b)c 𝑏𝑐+𝑎𝑐
where 𝜋1 = a/(a+b), 𝜋2 = c/(c+d)
𝑎/𝑏 𝑎𝑑
Estimated Odds Ratio = =
𝑐/𝑑 𝑏𝑐
64
Relative Risk vs Odds Ratio
• Relative risk tells how much ‘risk’ (probability) is increased or

decreased from an initial level. It is readily understood. A
relative risk of 0.5 means the initial risk has halved. A relative
risk of 2 means initial risk has increased twofold.
• Odds ratio is simply the ratio of odds in two groups of interest.
If the odds ratio is less than one then the odds (and therefore
the risk too) has decreased, and if the odds ratio is greater
than one then they have increased. But by how much?
• How to interpret an odds ratio of, say, 0.5 or an odds ratio of
2? Lack of familiarity with odds implies no intuitive feel for the
size of the difference when expressed in this way.
65
Column factor
(Response)
Level 1 Level 2
Row Factor Row
(Explanatory) Total
C1=n+1 C2=n+2 T=n
Grand
Total
Column
Marginal
Total 66
Totals
(i) Relative Risk (RR) or Incidence Rate Ratio (IRR)
(Text, p.53)
(Population proportion p, also denoted by Greek letter π)

• The relative risk (RR) of response Y=1 of the population
X=1 to population X=0 is the ratio of two population
proportions:
𝑃(𝑌=1|𝑋=1) π1
• RR = =
𝑃(𝑌=1|𝑋=0) π2
• RR > 1 means probability of response is larger in
Population X=1 than in Population X=0
𝑛11
𝑛1+
• Estimate of RR : 𝑅𝑅 = 𝑛21
𝑛2+
67
Confidence Intervals for Relative Risk (RR)
(Text, p.54)
𝑛11
π1 𝑛1+
• Estimate of RR ( ): 𝑅𝑅 = 𝑛21
π2
𝑛2+
• Estimate of “asymptotic” variance of loge(RR):
1− π1 1− π2
• 𝑉𝑎𝑟(𝑙𝑜𝑔𝑒 𝑅𝑅 = + [𝑛22 should be 𝑛21 ]
𝑛11 𝑛22
• 100(1-)% CI for RR:

𝑙𝑜𝑔𝑒 𝑅𝑅 exp − 𝑍/2  𝑉𝑎𝑟(𝑙𝑜𝑔𝑒 𝑅𝑅 to
𝑙𝑜𝑔𝑒 𝑅𝑅 exp 𝑍/2  𝑉𝑎𝑟(𝑙𝑜𝑔𝑒 𝑅𝑅
Z = 100*(1-)-th percentile of N(0,1) distribution
Note: RR can not be estimated with retrospective sampling

68
Difference Between two Proportions
• Interpreting the difference between two proportions
may not always be easy.
• Two proportions π1 = 0.5 and π2 = 0.45 have the same
difference as π1 = 0.1 and π2 = 0.05 (even though in
the second case one is twice the other). This is when
relative risk is a better measure.
• An alternative to comparing proportions (i.e., π1
versus π2) is to compare the corresponding odds (i.e.,
𝜔1 = π1/(1- π1) versus 𝜔2 = π2/(1- π2).
69
Confidence Interval for π1  π2
𝑛11 𝑛21
• Estimate of π1  π2: 𝜋1 𝜋2 = −
𝑛1+ 𝑛1+
𝜋1 (1− 𝜋1 ) 𝜋2 (1− 𝜋2 )
• 𝑉𝑎𝑟(𝜋1 − 𝜋2 ) = +
𝑛1+ 𝑛2+
𝜋1 (1− 𝜋1) 𝜋2 (1− 𝜋2 )
• s.e.( 𝜋1 − 𝜋2 ) = +
𝑛1+ 𝑛2+
• 100(1-)% CI for π1  π2:
𝜋1 (1− 𝜋1) 𝜋2 (1− 𝜋2)
(𝜋1 − 𝜋2 ) − 𝑍/2 + to
𝑛1+ 𝑛2+
𝜋1 (1− 𝜋1) 𝜋2 (1− 𝜋2)
(𝜋1 − 𝜋2 ) + 𝑍/2 +
𝑛1+ 𝑛2+
𝑤ℎ𝑒𝑟𝑒 𝑍/2 =1.96 for =.05
70
Testing H0: π1  π2 = 0
𝑛11 𝑛21
• Estimate of π1  π2: 𝜋1 𝜋2 = −
𝑛1+ 𝑛1+
𝑛11 +𝑛21
• =
𝑛1+ +𝑛2+
1 1
• 𝑉𝑎𝑟() = (1 − )( + )
𝑛1+ n2+
𝜋1 −𝜋2
• Test statistic, Z = is asymptotically
(1−)(𝑛 + n ) 1 1
1+ 2+
N(0,1), under H0, if n1+, n2+ are ‘large’
𝑍 2 =1.96 for =.05
71
Exact Test of Two Proportions
• Example. Compare two methods of treatment of an allergy.
Method 1(A) uses 15 patients and Method 2(B) uses 14. Is
mehod 2 better than method 1 ?
• Here n1+=15, n2+ = 14, n11=6, n21 = 11 and Ha: p1 < p2. Here
sample sizes are not large, hence asymptotic tests are not
applicable. Need to use exact tests.
72
Exact (Conditional) Test of Two Proportions
(GGD, Fundamentals, Vol 1)
• Two populations for which proportions of subjects

with certain characteristic are p1 and p2. Random
samples of sizes n1 (same as n1+ notation) and n2
(same as n2+ notation) are drawn independently from
the two pop. Let X1 and X2 denote numbers of
members having characteristic in the samples.
• Want to test H0: p1 = p2 (=p, unknown)
• Make use of the statistics X1, X2, but concentrate on
samples for which X=X1+X2 is fixed, same as observed
sum (x1+x2).
73
• Conditional pmf of X1 for given X=x1+x2, is
• If observed value of X1 is x10 and that of X is x0, then

use conditional pmf of X1, f(x1|x0) for testing H0.
74
• H0: p1 = p2 against Ha: p1 > p2, then P-value is

computed by
• H0: p1 = p2 against Ha: p1 < p2, then P-value is

computed by
75
Example: Exact Test of Two Proportions
• Here n1=15, n2 = 14, x0=17, x10 = 6 and Ha: p1 < p2
76
(iii) Odds, and Odds Ratio
Odds of an outcome: Let  be the population
proportion of “YES” outcomes. Then the
corresponding odds is given by,
   /(1   )
The sample odds is given by,
ˆ  ˆ /(1  ˆ )
77
(iii) Odds, and Odds Ratio (contd)
i = population proportion of “YES” response for
Group X=i. Then the odds of “YES” happening is given
𝜋𝑖
by: 𝜔𝑖 = , 0 ≤ 𝜔𝑖 < ∞.
1−𝜋𝑖
The sample odds of “YES” in Group i, give the
𝜋𝑖
estimate: 𝜔𝑖 = .
1−𝜋𝑖
Odds Ratio of “YES” response in Group 1 to that in
Group 2:
𝜔1 𝜋1 (1 − 𝜋2 )
𝜑= = ×
𝜔2 (1 − 𝜋1 ) 𝜋2
78
Odds versus Probabilities
Given the probability  of a “YES” outcome, the
corresponding odds is given by,
   /(1   )
Similarly, given the odds ω of a “YES” response, the
corresponding probability  is given by
   /(1  )
79
Odds versus Probabilities (contd.)
Interpretation: An event with chance of
occurrence 0.95 means the event has odds of 19
to 1 in favour of its occurrence while an event with
chances 0.05 has the same odds 19 to 1, against it.
We generally express the larger number first.
80
Relation between Probability, Odds & Logit
Log(Odds)
Probability Odds =Logit
0 0 NC Odds maps probability
0.1 0.11 -2.20 from [0,1] to [0,)
0.2 0.25 -1.39 asymmetrically,
0.3 0.43 -0.85 while Logit maps it to
0.4 0.67 -0.41 (-, ) symmetrically
0.5 1.00 0.00
0.6 1.50 0.41
0.7 2.33 0.85
0.8 4.00 1.39
0.9 9.00 2.20
1 NC NC
81
Example: NFL Football
TEAM ODDS against (Prob of Win)
San Francisco 49ers Even (1/2)
Denver Broncos 5 to 2 (2/7)
New York Giants 3 to 1 (1/4)
Cleveland Browns 9 to 2 (2/11)
Los Angeles Rams 5 to 1 (1/6)
Minnesota Vikings 6 to 1 (1/7)
Buffalo Bills 8 to 1 (1/9)
Pittsburgh Steelers 10 to 1 (1/11)
Total probability is 1.73!!

[Christensen]
82
Odds versus Probabilities (contd.)
Some facts:
1. Odds must be greater than or equal to zero but
have no upper limit.
2. Odds are not defined for the proportions that are
exactly 0 or 1
3. If the odds of a “YES” outcome is , then the odds
of a “NO” is 1/
83
The Following are Equivalent
• The proportions π1, π2 are equal.
• The odds are equal.
• The odds ratio is equal to 1.
• The log(odds ratio) is equal to 0.
84
Confidence Intervals for Odds Ratio (OR)
(Text, p.52)
𝑛11 𝑛22
• Estimate of OR : 𝑂𝑅 =
𝑛21 𝑛12
• Estimate of “asymptotic” variance of loge(OR):
1 1 1 1
• 𝑉𝑎𝑟(𝑙𝑜𝑔𝑒 𝑂𝑅 = + + +
𝑛11 𝑛22 𝑛21 𝑛12
• 100(1-)% CI for OR:
𝑙𝑜𝑔𝑒 𝑂𝑅 exp − 𝑍/2  𝑉𝑎𝑟(𝑙𝑜𝑔𝑒 𝑂𝑅 to
𝑙𝑜𝑔𝑒 𝑂𝑅 exp 𝑍/2  𝑉𝑎𝑟(𝑙𝑜𝑔𝑒 𝑂𝑅
Note: RR can not be estimated with retrospective sampling

85
Test for Homogeneity
• Hypothesis of homogeneity
H0: π1 = π2
• Alternatively,
H0 : ω1 = ω2, or H0: φ = 1, or H0: log(φ) = 0
86
Odds Ratio (Contd.)
Interpretation:
If the odds ratio =1 /2 equals to 4, then 1=42.
This means that the odds of a “yes” outcome in the
first group is four times the odds of a “yes” outcome in
the second group.
87
Advantages of Odds Ratio over
Risk Ratio or Difference of Proportions
1. Estimate of Odds Ratio (OR) remains invariant over
the sampling design (i.e., works even in case of
retrospective sampling), and it is given by
𝑂𝑅=(n11n22)/(n12n21), since
𝑷(𝒀=𝟏|𝑿=𝟏) 𝑃(𝑌=1,𝑋=1)
𝑷(𝒀=𝟎|𝑿=𝟏) 𝑃(𝑌=0,𝑋=1) 𝑃(𝑌=1,𝑋=1)𝑃(𝑌=0,𝑋=0)
𝑷(𝒀=𝟏|𝑿=𝟎) = 𝑃(𝑌=1,𝑋=0) =
𝑃 𝑌=0,𝑋=1 𝑃(𝑌=1,𝑋=0)
𝑷(𝒀=𝟎|𝑿=𝟎) 𝑃(𝑌=0,𝑋=0)
𝑃 𝑋 =1 𝑌 =1 𝑃(𝑌 = 1) 𝑷 𝑿 = 𝟏 𝒀=𝟏
𝑃 𝑋 =1 𝑌 =0 𝑃(𝑌 = 0) 𝑷 𝑿 = 𝟎 𝒀=𝟏
= =
𝑃 𝑋 =0 𝑌 =1 𝑃(𝑌 = 1) 𝑷 𝑿 = 𝟏 𝒀=𝟎
𝑃 𝑋 =0 𝑌 =0 𝑃(𝑌 = 0) 𝑷 𝑿 = 𝟎 𝒀=𝟎
2. Comparison of odds extends nicely to regression
analysis when response (Y) is a categorical variable. 88
Computation of odds ratio in a 2x2 table
Cold No Cold
Placebo 335 76
Vitamin C 302 105
Odds ratio = (335)(105)/(302)(76) = 4.41/2.88 = 1.53
Calculate odds ratio by dividing the product of the diagonal elements of the
table with that of the off diagonal element of the table.
The above result indicates that the odds of getting cold on a placebo
treatment is 1.53 times larger than that of getting cold on vitamin C
treatment.
89
Example: Computation of odds ratio
Cancer Control
Smoker 83 72
Non-Smoker 3 14
Odds ratio = (83)(14)/(3)(72) = 5.38
Calculate odds ratio by dividing the product of the diagonal elements of the
table with that of the off diagonal element of the table.
The above result indicates that the odds of getting cancer for a smoker is
5.38 times larger than that of getting cancer for a non-smoker.
90
Sampling Distribution of the
Loge of Estimated Odds Ratio
Let  be the odds ratio. Then it can be shown that for

the estimated odds ratio 𝜑, using DELTA method,
1 1
ln(𝜑) ~ N 𝑙𝑛 (𝜑), +
𝑛1+ 𝜋1 (1− 𝜋1 ) 𝑛2+ 𝜋2 (1− 𝜋2 )
for large samples, where ‘ln’ denotes loge
91
Two Formulae of Standard Errors for the
Loge of Odds Ratio
• The estimated variance is obtained by substituting
sample quantities for unknowns in the variance
formula of the estimator. The sample quantities used
to replace the unknowns depend on the usage.
– For a confidence interval, π1 and π2 are replaced by their
individual sample estimates.
– For the test of hypothesis, they are replaced by their
pooled sample estimate from the combined sample.
92
• Testing: The Odds are equal  then odds ratio=1
 ln(odds ratio) = 0.
– If the sample sizes are large, resulting P-value for testing
ln(1/2) = 0, is nearly identical to that obtained with
the Z-test for equal proportions (π1 = π2).
• Confidence interval for odds ratio (OR):

– Construct a confidence interval for log(odds ratio) and
take the antilogarithm of the endpoints.
– A shortcut formula (p.52, text) for the standard
error of log(OR) is the square root of the sum of
the reciprocals of the four cell counts in the 2x2
table.
93
Testing Equality of proportions π1 and π2,
i.e., log(OR)=0 :
• To test the equality of odds of “YES” 1 and 2 in two
Groups ( H0: 1/ 2 =1) , one estimates the common
proportion from combined sample and compute
standard error based on it.
• Estimated st. dev. for constructing Test Statistic:
1 1
s.e.(𝑙𝑛( 𝜔1 /𝜔2 )) = 𝑛1+ 𝜋𝑐 (1− 𝜋𝑐 )
+
𝑛2+ 𝜋𝑐 (1− 𝜋𝑐 )
(𝑛11 +𝑛21 )
𝑤ℎ𝑒𝑟𝑒 𝜋𝑐 =
(𝑛1+ +𝑛2+ )
𝑙𝑛( 𝜔1 /𝜔2 )
• Test statistic=
s.e.( 𝑙𝑛( 𝜔1 /𝜔2 )) ~ N(0,1)
Reject H0 if |Test statistic value| > Z/2
94
Example: Cardio-Vascular Deaths and Obesity
among women in American Samoa
[7.76 (=16/2061) observed deaths versus 6.66 (=7/1051) deaths per

thousand.]
Test equality of proportions of CVD deaths in populations of obese
and nonobese Samoan women ( ln(Odds Ratio) =0 )
This is an “Observational Study“, an example of “Poisson Sampling” [Ramsey, F. L.
and Schafer, D. W. (1997). The Statistical Sleuth. Duxbury Press, Belmont,
California.]
95
Testing equality of two population odds:
Cardiovascular disease and obesity data
group 1 (obese):
1. Estimate the
odds on CVD
death.
group 2 (nonobese):
2. Odds ratio and it’s

log
3. Proportion from the

combined sample
4. SE for the log odds

ratio estimate (test
version) 6. One-side
P-value
5. Z-Statistic 96
Confidence Interval for Odds Ratio
(through that for loge of Odds Ratio)
Estimated st. dev. for Confidence Interval for 𝑙𝑛(1/ 2 ):

1 1
s.e.(𝑙𝑛( 𝜔1 /𝜔2 )) = 𝑛1+ 𝜋1 (1− 𝜋1 )
+
𝑛2+ 𝜋2 (1− 𝜋2 )
1 1 1 1
= + + + (short-cut formula, p.52, text)
𝒏𝟏𝟏 𝒏𝟏𝟐 𝒏𝟐𝟏 𝒏𝟐𝟐
Confidence interval for odds ratio:

– First construct a confidence interval for log(odds ratio):
𝒏 𝒏 1 1 1 1
𝐥𝐧 𝟏𝟏 𝟐𝟐  𝒁/𝟐 + + +
𝒏𝟐𝟏 𝒏𝟏𝟐 𝒏𝟏𝟏 𝒏𝟏𝟐 𝒏𝟐𝟏 𝒏𝟐𝟐
– Then take the antilogarithm of the endpoints to get
confidence interval for odds ratio.
97
Confidence interval for Odds Ratio:
Smoking and Cancer Data
CANCER CONTROL
SMOKER 83 72
NON-SMOKER 3 14
1. Odds ratio and its log 𝜑 = 𝟓. 𝟑𝟖 ln 𝜑 = 1.683
2. Shortcut method for the 1 1 1 1

SE of the log odds ratio + + + 14 = 0.656
83 72 3
3. 95% interval for the log of 1.683  1.96× 0.656 = [0.396,2.969]

odds ratio
4. 95% interval for the odds ratio exp(0.396) to exp(2.969);

or 1.486 to 19.471
Conclusion: The odds of cancer for the smokers are estimated to be 5.38 times the
odds of cancer for non-smokers (approximate 95% CI: 1.486 to 19.471)
Confidence interval for odds ratio:
Vitamin C and Cold data
Cold No Cold
Placebo 335 76
Vitamin C 302 105
1. Odds ratio and its log
2. Shortcut method for the

SE of the log odds ratio
3. 95% interval for the log

odds ratio
4. 95% interval for the odds ratio exp(0.093) to exp(0.761); or 1.10 to 2.14
Conclusion: The odds of a cold for the placebo group are estimated to be 1.53
times the odds of a cold for the vitamin C group (approximate 95% CI: 1.10 to 2.14)
99
100
Test for Marginal Homogeneity
(McNemar’s Test, Text, p.55-56)
Comparing dependent proportions in matched pair or

pre-post treatment study design
H0: Prevalence of Depression at two time points are equal

(P(X=1)= p1+ = p+1=P(Y=1), i.e., treatment has no effect)
(𝑛12 −𝑛21 )2
McNemar’s (Chi-square) test statistic = ~ 1 2
𝑛12 +𝑛21
101
Test for Marginal Homogeneity
(McNemar’s Test, Text, p.55-56)
H0: Prevalence of Depression at two time points are

equal (i.e., treatment has no effect)
(𝑛12 −𝑛21 )2
McNemar’s (Chi-square) test statistic = ~ 1 2
𝑛12 +𝑛21
Here, (9-41)2/(9+41) = 20.48; P-value =6.02E-06;

Conclusion: The treatment seems to be effective
102
103
Cochran-Mantel-Haenszel Test for no row by
column association in any of the 22 Tables
(pp. 94-101)
104
Cochran-Mantel-Haenszel Test (pp. 94-101)
𝑞 (ℎ) (ℎ) 2
𝑛11 −𝑚11
𝑄𝐶𝑀𝐻 =
ℎ=1
𝑞 (ℎ) , Text, p. 100: QCMH = (18-16.4 +
𝑣11
ℎ=1 32 – 28.8)2/(2.3855 + 3.7236) =
Here, h=1,2
3.7714; P-value = 0.052 with
(ℎ) 𝑛 𝑛
(ℎ) (ℎ)
𝑛2+ 𝑛+2
(ℎ) (ℎ) 12 dist
𝑤ℎ𝑒𝑟𝑒 𝑣11 = 1+(ℎ) +1
2 𝑛(ℎ) − 1
𝑛
105
106
Cochran-Armitage Trend Test
(See Text, p.60-61)
Binary categorical (row) variable X, ordered (column)
response variable Y.
Test H0: Proportions of X=1 follow some (linear) pattern

as a function of the ordered levels of Y.
Cochran-Armitage test statistic ~ N(0,1).
It basically tests whether slope of linear regression of
proportions of X=1 on levels of Y is zero or not. [Exercise:
Check Calculations, on p. 61, text] 107
108
Odds & Odds Ratio:
Extension to Regression Analysis
109
Logistic Regression/ Binary Logit Model
• Logistic regression (“binary logit”) model

extends ideas of multiple linear regression to
situation where response variable, Y, is binary
(with coded values 0 and 1)
• Explanatory variables X1, X2 · · ·, Xk may be
categorical/continuous variables.
• It estimates chances of an observation
belonging to each group.
110
Problems with Usual Regression
For X=100, predicted value = 3.2121 has no meaning!

R Codes for Logistic Regression:
> LM1=lm(PassFail~HoursOfStudy)
> summary(LM1)
>glm_PassFail = glm(PassFail ~ HoursOfStudy, family =
binomial(logit), data = passfail)
111
> summary(glm_PassFail)
Solution with Logistic Regression
logistic function: Prob=1/[1+exp(-(-6.39+0.44*x))]

i.e., ln(Prob/(1-Prob)) = –6.39+0.44*x [=logit].
For X=100, predicted Prob = 1;
112
For X=18, predicted Prob = 0.82
Logistic Function
L = Curve’s maximum value

K = steepness of the curve
X0 = curve’s midpoint
113
Comparison to Classical Linear Regression
• Check if the logistic regression model fits well, if some

of the variables are redundant etc. However, statistical
tools to be used for this purpose are not the usual R2,
t- and F-tests.
• Instead of t-test one uses a large sample Z-test, due to
Wald, for checking redundancy of an explanatory
variable. In place of F- test or adjusted R2, pseudo R2
and information criteria such as AIC (Akaike
Information Criterion) of BIC (Bayesian Information
Criterion) is used to figure out which model is better.
114
Comparison to Classical Linear Regression
(contd.)
• Instead of the sum of squared errors (SSE), one calculates
negative of the log-likelihood for the fitted model.
• Interpretation of parameters 1, …, k is now done through
the log of odds called “logit”, or equivalently through odds.
For example, suppose X1 changes by one unit and values of
X2, …, Xk remain constant. Thus, 1 denotes change in logit,
equivalently, 𝑒 β1 denotes change in odds for binary Y
variable to take the value 1.
115
Categorical Variable Y as a Function of
Continuous Latent Variable Z
116
Proportional Odds Model for Ordinal Data
• Suppose Observed variable Y = j iff αj-1 ≤ Z ≤ αj, j=1, …, J,
where latent variable Z depends on the covariates through a
linear or non-linear model. If linear, then Z = βTx + , where 
has CDF F. Then,
• 𝛾𝑗 = 𝑃 𝑌 ≤ 𝑗 = 𝑃 𝑍 ≤ α𝑗 = 𝐹(α𝑗  βTx), j=1, …, J-1
•  𝐹 −1 (𝛾𝑗 ) = α𝑗  βTx
• If F(z) = 1/(1+e-z), the standard logistic distribution
𝛾𝑗
Then 𝐹 −1 𝛾𝑗 = ln = α𝑗 βTx
1−𝛾𝑗
• When J=2, then we have Binary Logistic Regression Model:
𝛾1
𝐹 −1 𝛾1 = ln = α1 βTx
1−𝛾1
117
Proportional Odds Model for Ordinal Data
• Suppose Observed variable Y = j iff αj-1 ≤ Z ≤ αj, , j=1,

…, J, where latent variable Z depends on the
covariates through a linear or non-linear model. If
linear, then Z = βTx + , where  has CDF F. Then,
• 𝛾𝑗 ≤ 𝑃 𝑌 ≤ 𝑗 = 𝑃 𝑍 ≤ α𝑗 = 𝐹(α𝑗  βTx).
 𝐹 −1 (𝛾𝑗 ) = α𝑗  βTx
• F(z) is normal  probit model
• F(z) is extreme-value dist  complementary log-log
model (note: if exp() has exponential or Weibull dist
then  has the extreme value dist)
118
119
Properties of the Logit Model
• Although Xi may vary from – to  , P is

constrained to lie between 0 and 1.
• When Xi approaches – , P approaches 0.
• When Xi approaches  , P approaches 1.
• When linear regression model is used, P is not

constrained to lie between 0 and 1.
120
Relation between Probability, Odds & Logit
Log(Odds)
Probability Odds =Logit
0 0 NC
0.1 0.11 -2.20
0.2 0.25 -1.39
0.3 0.43 -0.85
0.4 0.67 -0.41
0.5 1.00 0.00
0.6 1.50 0.41
0.7 2.33 0.85
0.8 4.00 1.39
0.9 9.00 2.20
1 NC NC
121
Estimation and Model Fit
• Estimation Method: Maximum Likelihood
• Fit: Cox & Snell R Square and Nagelkerke R Square
(similar to R2 in multiple regression).
• Cox & Snell R Square can not equal 1.0, even if fit is
perfect, this limitation is overcome by Nagelkerke R
Square.
• Hosmer–Lemeshow test used for assessing goodness
of fit of a model (Large P-value indicates overall
model fit is good)
• Compare predicted and actual values of Y to
determine the percentage of correct predictions.
122
1. MLE calculation:
The distribution of scores in a public examination is normal
with unknown mean µ and known standard deviation 5. It is,
however, known that µ equals 52, 60 or 67. If the scores of
three randomly chosen candidates turn out to be 88, 93, 59,
then find the maximum likelihood estimate of µ.
(Ans. mle µ of 67; sample mean = 80)
𝑛 𝑛
1 1
−2 𝑖=1(𝑋𝑖 − 𝜇)2
𝐿 𝜇 = 𝑒
5 2𝜋 52
𝑖=1
𝑛 2
𝑖=1 𝑋𝑖 − 𝜇
𝑙𝑜𝑔𝑒 𝐿 𝜇 = 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡 −
2 × 52
𝑛 2
𝑖=1 𝑋𝑖 − 𝑋 + 𝑋 − 𝜇
50
𝑛 2 2 𝑛
𝑖=1 𝑋𝑖 − 𝑋 +𝑛 𝑋−𝜇 + 2(𝑋 − 𝜇) 𝑖=1 𝑋𝑖 − 𝑋
50
123
2. MLE calculation:
(a) The distribution of a certain quality characteristic is
continuous uniform over the range [-2,], where  is
unknown. If the values of the quality characteristic in a
random sample of size five are 0.63, 0.27, –1.62, 0.71 and –
1.38, then what is the maximum likelihood estimate of ?
(Ans. 0.81 = X(1)/2, where X(1)=min of X1, X2, …, Xn )
(b) The life (in hours) of an electrical component is
exponentially distributed with mean , where  is unknown.
let p be the probability that a system, consisting of two such
components arranged in series, survives for five hours or
more. If five randomly chosen components are seen to
survive for 7, 6, 4, 8 and 5 hours, then what is the maximum
likelihood estimate of p? (Ans. mle of  = 6 , mle of p =
exp(-10/) = 0.188756028  0.19)
124
3. MLE calculation:
An electrical unit consists of three identical components and
the unit functions if and only if at least one of the
components function. The components are known to
function independently and has an exponential dist. Out of
500 randomly chosen units of this type 468 were found to be
functioning for 10 hours or more. On the basis of this
information, find the maximum likelihood estimate of the
probability for an individual component to function for 10
hours or more. (Ans. 0.6)
125
Estimating Model Parameters
• Linear Regression model uses OLS method to
minimize sum of squared errors of prediction
• Logistic Regression model maximizes “likelihood” of
observing y1,…,yn, defined by
n
L   p i (1  p i ) (1 yi ) ,
yi
i 1
exp(  0  1 x 1i  ...   k x ki )
where p i  .
1  exp(  0  1 x 1i  ...   k x ki )
Thus , L is a function of  0 , 1 ,...,  k
126
Estimation (contd.)
n
L   p i i (1  p i ) (1 yi ) ,
y
i 1
exp(  0  1x1i  ...   k x ki )

where p i  .
1  exp(  0  1x1i  ...   k x ki )
To maximize L, equivalently, ln(L) we can use Excel Solver.

We require initial values of 0, 1 · · · k, which can be
obtained by MLR of “empirical log odds” on X1, …,Xk :
 y i  12 
ln    0  1X1i  ...   k X ki
 (1  y i  2) 
1
127
Model Fit Measures
https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faq-what-are-pseudo-r-
squareds
2/n
Cox & Snell R square:  L0 
R  1  
2
L
where L0= max likelihood of the intercept-only model,
L=max likelihood under specified model
Nagelkerke adjusted R square:

2
 L0  n
1  
 L 
2
R
Adjusted R 2  2  2
R max
1  (L 0 ) n
128
Other Measures
• Akaike’s Information Criterion:

AIC= 2*ln(L) + 2*k,
where k = # of model parameters
• Corrected AIC (for small samples)
AICc= AIC + 2*(k(k+1))/(n-k-1)
• Bayesian Information Criterion:
BIC = 2*ln(L) + (ln(n))*k
129
Interpretation of Coefficients
• Log odds, i.e., ln(p/(1-p)), is a linear combination of

a0, a1 · · · ak
• If Xi value increases by one unit, log odds will change
by ai units, when other X-variables held constant
• a0= log odds if/when all X-variables equal zero
p exp(a0  a1 x1  ...  a k xk )
ln( )  a0  a1 x1  ...  a k xk , i.e., p  ,
1 p 1  exp(a0  a1 x1  ...  a k xk )
130
Interpretation of Coefficients
If Xi is increased by one unit, the log odds will change by
ai units, when the values of other independent variables is
held constant.
Log(Odds)
Sign of ai will determine Probability Odds =Logit
0 0 NC
whether the probability 0.1 0.11 -2.20
increases (if the sign is 0.2 0.25 -1.39
positive) or decreases (if the 0.3 0.43 -0.85
0.4 0.67 -0.41
sign is negative) by some 0.5 1.00 0.00
amount. 0.6 1.50 0.41
0.7 2.33 0.85
0.8 4.00 1.39
0.9 9.00 2.20
1 NC NC
131
132
Prescribe “Buy/Sell/Hold” for a Security
Daily Data on
Price, Volume, MACD, RSI, Put-Call Ratio,
Open Interest, etc & “Buy/Sell/Hold prediction”
(attached in hindsight on the basis of next
trading day’s closing price)
Using Logistic Regression Model fitted to above

test dataset over a period, we can make future
decisions
133
134
135
Example: Insurance Requirement
• 100 respondents, [2-level response] 63 yes,

37 no
• Predictors are Age, Income, dependent
(whether have dependents or not).
• Want to estimate chances of a prospective
customers purchasing insurance.
136
Data (2-level Response)
137
SPSS Windows: Logit Analysis
1. Select ANALYZE from the SPSS menu bar.
2. Click REGRESSION and then BINARY LOGISTIC.
3. Move “Willing” in to the DEPENDENT VARIABLE box.
4. Move “Age,” “Dependent,” and “Income,” in to the

COVARIATES box.)
5. Select ENTER for METHOD (default option)
6. Click OK.
138
Recovering Probability () from Logit ()
Let η = logit 

= ln = 0 +1 𝑋1 + ⋯ + 𝑝 𝑋𝑝
(1 − )
exp() 1
= =
1+exp() 1+exp(−)
139
Output (2-level response)
Given age, dependent & income of a person, compute:

logit =  = –50.326+1.077*age+13.778(if has dependents) +
0.000*income;
& chances of insurance purchase = 1/[1+exp(–)]
R Codes:
> glm_insurance = glm(nWilling ~ Age+Dependent+Income,
family = binomial(logit), data = insurance)
140
> summary(glm_insurance)
141
https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faq-
what-are-pseudo-r-squareds
Deviance Test; Hosmer-

Lemeshow Test, p.151-152
142
Output (2-level response; w/o Income)
Given age, dependent of a person, compute:

logit = z = –50.190+1.083*age+13.782(if has dependents) ;
& chances of insurance purchase = 1/[1+exp(–z)]
> glm_insurance = glm(nWilling ~ Age+Dependent, family = binomial(logit), data =
insurance)
> summary(glm_insurance)
143
144
145
Classification Table
• When there are two categories Y=1 and Y=2, this will
produce a table of counts for classification or
misclassification
• when there are three categories Y=1, Y=2 and Y=3 one
will have a table of counts for classification or
misclassification
146
Classification Table
Proportion of correct classification for the
subjects in the sample is given by
(𝑛11 + 𝑛22 +𝑛33 )
(𝑛11 +𝑛12 +𝑛13 +𝑛21 +𝑛22 +𝑛23 +𝑛31 +𝑛32 +𝑛33 )
147
Receiver Operating Characteristic (ROC)
Curve
• Sensitivity is the true positive rate (TPR) since it is the
rate of correctly classifying an observed “yes” as “yes”
by the model for a fixed cut-off or threshold value.
• Similarly, (1-specificity) is known as the false positive
rate (FPR) as it represents the rate of incorrectly
classifying an observed “no” as “yes” by the model for
the same cut-off or threshold value.
• Thus, the ROC plot, of sensitivity versus (1specificity),
can be thought of as a plot of the Power versus Type I
Error of the decision (classification) rule based on the
chosen cut-off or threshold value using the model
predicted probabilities.
148
Receiver Operating Characteristic (ROC)
Curve
• Let x = FPR and y = TPR. Then ROC points (x,y) represent relative trade-off
between true positive rate (beneficial) and false positive rate (costly). The
best possible scenario for prediction would be to have the point (0,1) in
the upper left corner of the ROC space, as it represents 100% sensitivity
(no false negatives) and 100% specificity (no false positives). The (0,1)
point is also known as a perfect classification.
• A prediction method based on random guessing based on flipping a coin
with Pr(Heads) = p  [0,1], and then classifying a subject as “yes” category
if “Heads” fall and “no” category if “Tails” appear. Such a method would
generate points along a diagonal line from the left bottom point (0,0) to
the top right corner (1,1). Points above the diagonal line in the ROC space
represent good classification results (better than random) and points
below the line represent poor results (worse than random).
• Thus, one desires to see the ROC curve as far above the diagonal as
possible. Thus, a binary logistic regression model producing the largest
the area under the ROC curve may be declared as the best model.
149
Pearson Chi-Square & the Deviance Tests
Pearson Chi-Square test is mainly used for categorical
explanatory variables. For a sample of n subjects with binary
response and a number of covariates, using a fitted binary logit
model one can construct an IJ table for observed counts with
rows consisting of I distinct patterns of the covariates in the data
and columns representing the two levels of the binary response.
For each of these I patterns construct model based expected
counts (Eij) and compare them with observed counts (nij), i=1, …,
I, j=1,2, …,J.
150
Pearson Chi-Square Tests
• Then the Pearson Chi-Square test statistic is defined by
𝐼 𝐽 2
𝑛𝑖𝑗 − 𝐸𝑖𝑗
𝑃𝐶𝑆 =
𝐸𝑖𝑗
𝑖=1 𝑗=1
• If each of the I2 cells have 5 or more observations nij,
then under the null hypothesis of correct model, PCS
has an asymptotic chi-square distribution with degrees
of freedom (df) = [I(J1)  m], where m = number of
(independent) model parameters being estimated. If
J=2, then the PCS has df = (I-m).
151
Deviance test statistic
Deviance test statistic also compares the observed to the
expected counts, defined by
𝐼 𝐽 𝑛𝑖𝑗
𝐷𝑒𝑣𝑖𝑎𝑛𝑐𝑒 = 2 𝑖=1 𝑗=1 𝑛𝑖𝑗 ln .
𝐸𝑖𝑗
It also has the same df = [I(J1)  m], where m = number of
(independent) model parameters being estimated. The deviance
test is really the likelihood ratio test.
Example: One may test the null hypothesis that conditional (i.e.,
given xi) distribution of the responses {Yi, i=1,…,I} have
independent binomial(1, (xi)) distributions where
𝑒 β0 +β1 x1i +⋯+βk xki
 𝐱𝑖 = ,
1+𝑒 β0 +β1 x1i +⋯+βk xki
against the saturated model which does not assume {(xi), i=1,
…,n} to be related in any manner. In this case, J=2 and m=k+1.
152
Hosmer Lemeshow Test: Binary case
Pearson Chi-square test is not appropriate when there are
continuous (i.e., non-categorical) explanatory variables if they are
not grouped. In this case for each “covariate” pattern X=x, there
will usually be one observation, and hence the asymptotic or
approximate chi-square distribution for the Pearson of the
Deviance test is not valid.
This problem is faced if number of categorical binary predictors k
is large. For example, if k is 6, one may have as many as I = 2k = 64
distinct combinations or patterns of covariates.
To solve this problem of large I, Hosmer and Lemeshow (1980)
orders n subjects according to the fitted probabilities 𝝅 𝐱 =
Pr(Y=“Yes”|x) and then divide n subjects into g number of groups
of comparable sizes, where g is a small number often chosen to
be 10. 153
Hosmer Lemeshow Test: Binary case
nij = number of obs in j-th cell of the i-th group; Ei1 = sum of the
fitted probabilities 𝝅 𝐱 = Pr(Y=“Yes”|x) of the subjects in the i-
th group and Ei2 = ni+ - Ei1. The PCS-type test statistic is
constructed
2
In simulation studies, it has been seen that 𝜒𝐻𝐿 has approximately
a chi-square distribution with df= (g2).
154
Hosmer Lemeshow Test:
Multinomial Response Case
• Fagerland and Hosmer’s (2012) generalization of the
Hosmer-Lemeshow test for multinomial logistic regression
models. The groups are formed in a similar manner as done
in the binary case and the table of observed and expected
counts for J categories (namely, 0, 1, …, J-1 ) takes the form:
155
Exercise
• Fagerland, Hosmer and Bofin (2008) showed that
2
𝜒𝐹𝐻 has asymptotic chi-square distribution with df =
(g2)(J-1). Various statistical software readily
2
provide the 𝜒𝐻𝐿 test in the binary case, and do not
2
usually produce the 𝜒𝐹𝐻 test in the multinomial
case. One may compute the latter by writing R
program codes.
156
AIC, BIC
The Akaike Information criterion
AIC = 2m – 2loge(estimated maximum likelihood value under a model)
where m = number of parameters being estimated in the model. The value 2m
is used to penalize a model with higher number of parameters, to reward a
parsimonious model. Given a number of candidate models for the data, the
preferred model is the one with the minimum AIC value. The enables us to
compare two models nested or not.
AIC is it is not a consistent in that its probability of selecting the true model
among the candidates does not tend to one as the sample size increases to
infinity. One popular alternative that corrects this problem Bayesian
Information Criterion (BIC) defined by Schwarz (1978),
BIC = [loge(n)]m – 2loge(estimated maximum likelihood value under a model)
Usually, both AIC and BIC agree with the choice of the best model.
157
Example: Insurance Requirement
• 100 respondents, [3-level response] 42 yes,

28 no, 30 maybe
• Predictors are Age, Income, dependent
(whether have dependents or not).
• Want to estimate chances of a prospective
customers purchasing insurance.
158
Multinomial Logit Model
j 
log    0i  1 j X 1  ...   kj X k , j  1,..., (m - 1)
m 
k
exp(  ij X i )
i.e.,  j  m 1
i 0
k
, j  1,..., (m - 1)
1  [ exp(  ij X i )]
j 1 i 0
159
Data (3-level Response)
160
2. Click REGRESSION and then MULTINOMIAL LOGISTIC.
3. Move “Resp3level” in to the DEPENDENT VARIABLE box.
4. Move “Age,” “Dependent,” and “Income,” in to the
COVARIATES box.)
5. Under STATISTICS: under MODEL select everything except
Monotonicity measures, under PARAMETRS select
Estimates, Likelihood ratio tests & then ENTER
6. Click OK.
161
Logit1 = z1=63.913 –29.447*dependent – 0.877*age+.000*income

Logit2 = z2=132.935 – 50.587*dependent – 2.403*age+.000*income
Prob(“Yes”) = 3 = 1/(1+exp(z1)+exp(z2)), for Reference Category
Prob(“May be”) =  1= exp(z1)/(1+exp(z1)+exp(z2))
Prob(“No”) =  2= exp(z2)/(1+exp(z1)+exp(z2)) 162
163
164
Logit1 = z1=63.637 -29.573*dependent -0.878*age

Logit2 = z2=128.431 -49.638*dependent -2.325*age
Prob(“Yes”) = p3 = 1/(1+exp(z1)+exp(z2))
Prob(“May be”) = p1=exp(z1)/(1+exp(z1)+exp(z2))
Prob(“No”) = p2=exp(z2)/(1+exp(z1)+exp(z2)) 165
166
167
168
Bankruptcy Example
Source: Book by Johnson & Wichern
169
Bankruptcy Example
(Applied Multivariate Statistical Analysis by Johnson & Wichern)
• Annual financial data collected for 16 (currently)

bankrupt firms about 2 years prior to their bankruptcy
and for 20 (currently) financially sound firms at about
the same time. [ 0: Bankrupt Firms; 1: Non-Bankrupt
Firms]
• Four explanatory variables
X1 (CFTD)= cash flow/total debt,
X2 (NITA) = net income/total assets
X3 (CACL) = current assets/current liabilities,
X4 (CANS) = current assets/net sales
170
Bankrupt
CA/NS
CA/CL
CF/TD
NI/TD
Seq#
1 -0.45 -0.41 1.09 0.45 0 0 = bankrupt

2 -0.56 -0.31 1.51 0.16 0
3 0.06 0.02 1.01 0.4 0 1= sound
4 -0.07 -0.09 1.45 0.26 0
5 -0.1 -0.09 1.56 0.67 0
6 -0.14 -0.07 0.71 0.28 0
7 0.04 0.01 1.5 0.71 0
8 -0.06 -0.06 1.37 0.4 0
9 0.07 -0.01 1.37 0.34 0
10 -0.13 -0.14 1.42 0.44 0
11 -0.23 -0.3 0.33 0.18 0
12 0.07 0.02 1.31 0.25 0
13 0.01 0 2.15 0.7 0
14 -0.28 -0.23 1.19 0.66 0
15 0.15 0.05 1.88 0.27 0
16 0.37 0.11 1.99 0.38 0
171
Bankrupt
CA/NS
CA/CL
CF/TD
NI/TD
Seq#
17 0.51 0.1 2.49 0.54 1
18 0.08 0.02 2.01 0.53 1
19 0.38 0.11 3.27 0.35 1
20 0.19 0.05 2.25 0.33 1
21 0.32 0.07 4.24 0.63 1
22 0.31 0.05 4.45 0.69 1
23 0.12 0.05 2.52 0.69 1
24 -0.02 0.02 2.05 0.35 1
25 0.22 0.08 2.35 0.4 1
26 0.17 0.07 1.8 0.52 1
27 0.15 0.05 2.17 0.55 1
28 -0.1 -0.01 2.5 0.58 1
29 0.14 -0.03 0.46 0.26 1
30 0.14 0.07 2.61 0.52 1
31 0.15 0.06 2.23 0.56 1
32 0.16 0.05 2.31 0.2 1
33 0.29 0.06 1.84 0.38 1
34 0.54 0.11 2.33 0.48 1
35 -0.33 -0.09 3.01 0.47 1
36 0.48 0.09 1.24 0.18 1
172
2. Click REGRESSION and then BINARY LOGISTIC.
3. Move “Bankrupt” in to the DEPENDENT VARIABLE box.
4. Move “CFTD”, “NITD”, “CACL”, & “CANS” in to the

COVARIATES box.)
5. Select ENTER for METHOD (default option)
6. Click OK.
173
Regressors: CFTD, CACL, NITD, CANS
174
Regressors: CFTD, CACL, NITD, CANS
175
Regressors: CFTD, CACL
Given “CFTD” & “CACL” values of a bank, compute:

logit =  = –4.863+5.654*CFTD+2.550*CACL ;
& chances of being “sound” = 1/[1+exp(–)]
176
Regressors: CFTD, CACL
177
178
Example: Survival of the Donner Party
• In 1846, the Donner and Reed families left Springfield,

Illinois, for California by covered wagon. In July the Donner
party reached Fort Bridger, Wyoming. There its leaders
decided to attempt a new and untested route to
Sacramento valley. Having reached its full size of 87 people
and 20 wagons (with 45 individuals with age ≥15), the party
was delayed by a difficult crossing of the Wasatch range …..
• (https://en.wikipedia.org/wiki/Donner_Party)
http://www.stat.ufl.edu/~winner/data/donner.txt (full dataset 87 obs)
• https://onlinecourses.science.psu.edu/stat504/sites/onlinecourses.science.psu.ed
u.stat504/files/lesson07/donner.txt (only 45 obs)
179
Survival of the Donner Party (contd.)
…. and again in the crossing of the desert west of the
great Salt Lake. The group became stranded in the
eastern Sierra Nevada mountains when the region was
hit by heavy snow in October. By the time the last
survivor was rescued on April 21, 1847, 40 of the 87
members had died of famine and hunger and exposure
to extreme cold.
180
Donner Party Data (age above 15yrs)
[sex (i.e., gender) = 1 is male indicator variable]
181
Route of the Donner Party
182
The Logistic Regression Model
• The response variable in this case study is
binary; it take values 0 and 1. The Donner
Party study involves two explanatory
variables, one categorical and one numerical.
Logistic regression is the appropriate tool in
such a situation.
183
184
Generalized Linear Model
• A generalized linear model is one where some
function of the mean of the response variable is
related to the explanatory variables through a
regression equation.
g (  )   0  1 X 1     p X p
• The function g is the link function, and g(μ) = μ leads

to the ordinary multiple linear regression
185
Exponential Family of Distributions with
Vector of Parameters 
For Binomial dist Bin(n,p): =p and () = ln(p/(1-p))
Thus, g() = g(p) = ln(p/(1-p)) is the link function
For Poisson(): g() = g() = ln()
186
Logit Link for Binary Responses
We define
logit ( )  log( /(1   ))
Then the logistic regression formula is given by
logit ( )   0  1 X 1     p X p
Logistic regression is a kind of non-linear regression,
since the equation E(Y | X1, X2,…, Xp) = π is nonlinear
in the β’s.
187
Recovering Probabilities from Logits

logit  = ln = 0 +1 𝑋1 + ⋯ + 𝑝 𝑋𝑝
(1 − )
Let η denote logit  = 0 +1 𝑋1 + ⋯ + 𝑝 𝑋𝑝
exp() 1
= =
1+exp() 1+exp(−)
188
Example: Donner Party
GLM code in R
> glm_survive = glm(survive ~ age+sex, family = binomial(logit),
data = donner)
> summary(glm_survive)
• Coefficients:
(Intercept) age sex
3.23041 0.0782 1.5973
• Degrees of Freedom: 44 ‘Total’(i.e. Null); 42 Residual (Model)
Residual Deviance = 51.26, DF= 42 (=45 - 3), AIC=57.26, P-value (HL test) =
0.19, AUC=0.73, Correct Classification % = 77.8, Nagelkerke R2 = 0.28
189
Example: Donner Party
• The fit of the logistic regression model to the Donner
Party data (with 45 obs with age ≥15), where π
represents the survival probability gives
• logit(πi) = ln(πi/(1- πi)) = 3.23 – 0.078 agei 1.60 sexi,
where “sex” is an indicator variable for males.
• Interpretation:
 1.60 = change in ln(odds), ln(odds ratio), for a male
compared to a female of same age
– 0.078 = change in ln(odds), ln(odds ratio), for one year
older person of same gender
190
Odds from Logistic Regression
 = Odds that Y = 1 is defined as P(Y=1)/P(Y=0)
In Logistic Regression:
𝜔 = exp(𝛽0 + 𝛽0 × 𝑋1 +…+ 𝛽𝑖 × 𝑋𝑖 + … + 𝛽𝑘 × 𝑋𝑘 )
The ratio of the odds (a) at Xi=a, relative to that

(b) at Xi= b, for fixed values of the other X’s is
𝜔𝑎
= exp( 𝛽𝑖 (𝑎 − 𝑏))
𝜔𝑏
191
Interpretation of Coefficients (Donner Party)
• Comparing women 50 years old (a=50) with women 20 years

old (b=20), the odds ratio is estimated as,
exp[– 0.078(50 – 20)] = 0.096, or about 1/10.
So the odds of survival for a 20-year-old woman is about 10
times the odds of survival for a 50-year-old woman.
• Comparing a woman (sex = 0 = a) with a man (sex = 1 = b ) of

the same age, the estimated odds ratio is
exp(1.60[0 – 1]) = 4.95 ,
that is the woman’s odds of survival were about 5 times the
odds of survival of a man of the same age.
192
Testing Presence of Interaction (Donner data)
Wald’s test for the hypothesis that the coefficient of
interaction term is zero in the logistic regression of survival (1
or 0) on age, sex (=1 for males), and interac=age x fem:
Donner Party data (n=45)
Variable Coefficient Standard Z-Statistic
Error
Constant 7.246 3.205 2.261
age -0.194 0.087 -2.220
sex -6.928 3.399 -2.038
Age x sex 0.162 0.094 1.714
Residual Deviance = 47.34, DF= 41 (=45-1-4), AIC=55.35, P-value(HL test)=
0.65, AUC=0.762, Correct Classification %= 73.3, Nagelkerke R2 = 0.368
From the normal distribution Two-sided p-value = 2 x Pr(Z > 1.714) = 0.087
Conclusion: There is suggestive but inconclusive(?) evidence of an interaction.
193
Example ( Confidence Interval for a Coefficient)
Confidence intervals for the odds of survival for females divided by
the odds of survival for males, accounting for age, from the model
without interaction

Error
Constant 3.230 1.387 2.33
age -0.078 0.037 -2.10
Sex (male=1) -1.597 0.755 -2.11
Deviance = 51.256 Degrees of freedom = 42, Z.025=1.96
95% Confidence Interval for the coefficient of Sex : -1.597 1.960.755 = -3.077 to -0.117
Take anti-logarithms of endpoints to get interval for the odds ratio: 0.046 to 0.889
Conclusion: exp(-1.597) = 0.203 = log(odds ratio). Odds of survival for a male is one
fifth of that of a female of same age.
194
195
Retrospective studies
In logistic regression model, probability of binary
responses are modeled prospectively as functions of
explanatory variables. In the studies in which the
probability of “yes” responses (e.g. cancer) are very
small, independent samples are drawn retrospectively.
But prospective probabilities can not be estimated from
such retrospective sampling. However, since the odds
ratio is invariant to the study design, it can be estimated
from a retrospective study using logistic regression
model.
In a logistic regression model for a retrospective study, the
estimated intercept is not an estimate of the prospective intercept.
196
Logistic Regression for Retrospective Studies
(Agresti, p. 168)
Let Z=1, if a subject is sampled, Z=0, otherwise
Let 1 = P(Z=1|y=1) = Prob of sampling a “case”
Let 0 = P(Z=1|y=0) = Prob of sampling a “control”
Cond. Dist. of response Y given X =x is not sampled.
P(Y=1|Z=1,x) =
𝑍 = 1 𝑌 = 1, 𝑥 𝑃(𝑌=1|𝑥)
𝑃
= 1 𝑃 𝑍 = 1 𝑌 = 𝑗, 𝑥 𝑃(𝑌=𝑗|𝑥) = 𝜌1𝑃 𝑌 = 𝜌11𝑃(𝑌=1|𝑥)
𝑥 +𝜌0𝑃 𝑌 = 0 𝑥
𝑗=0
𝑌 = 1𝑥
𝑃
𝜌1
𝜌1 [ ] 𝜌1 𝛼+𝛽𝑥 ln +𝛼+𝛽𝑥
𝑃 𝑌 = 0𝑥 𝜌1 [𝑒 𝛼+𝛽𝑥 ] 𝜌0
[𝑒 ] [𝑒 𝜌0 ]
= 𝑃 𝑌 = 1𝑥 = = 𝜌1 𝛼+𝛽𝑥 = 𝜌
𝜌1 [𝑒 𝛼+𝛽𝑥 ]+𝜌0 [𝑒 ]+1 ln 𝜌1 +𝛼+𝛽𝑥
𝜌1 [ ]+𝜌0 𝜌0 [𝑒 0 ]+1
𝑃 𝑌 = 0𝑥
𝜌
 𝑙𝑜𝑔𝑖𝑡((P(Y=1|Z=1,x))=[ln 1 + 𝛼] + 𝛽𝑥 = 𝛼 ∗ + 𝛽 ∗ x
𝜌0
197
Estimation of Logistic Regression coefficients
Maximum likelihood Estimation: Joint probability is given by,
and i (= function of parameters 0, 1,…, k) is obtained

from the logistic regression model.
From the above likelihood we obtain the score equation by

differentiating it with respect to the model parameters
and equating with 0.
Solving the score equation we obtain the estimates of the
model parameters. We may not obtain closed form
solution of the score equation. On such occasions we use
iterative procedures. 198
Donner Party – Survival Probabilities
Since each person survives independently, the joint
probability of survival can be obtained by multiplying
individual survival probabilities.
If we estimate the parameters by maximum

likelihood method, then we obtain the estimated
model as follows:
logit(i)= 3.23041 – 0.078 agei – 1.60 sexi
199
Tests and Confidence Intervals for Single
Coefficients
The properties imply that each estimated coefficient βj in
logistic regression has, approximately, a normal sampling
distribution, and therefore,
Z-ratio =
has an approximate standard normal distribution.
The standard error is the estimated standard deviation of the
sampling distribution of the estimator.
A confidence interval is the estimate plus and minus the half-
width, which the standard error of the estimate times
appropriate percentile of the N(0,1) distribution
200
Coefficients
A test statistic is the ratio of the estimate minus the
hypothesized value to the standard error, and
the p-value is obtained by comparing this to a N(0,1)
distribution. Such a test is referred to as Wald’s test.
201
Coefficients
A test statistic is the ratio of the estimate minus the
hypothesized value to the standard error, and
the p-value is obtained by comparing this to a N(0,1)
distribution. Such a test is referred to as Wald’s test.
202
Example (Wald’s Test)
Wald’s test can be used to determine whether the log odds of
survival are associated with age differently for men than for
women. (n=45?)
Wald’s test for the hypothesis that the coefficient of interaction term is
zero in the logistic regression of survival (1 or 0) on age, fem (=1 for
females), and age x fem: Donner Party data
Error
Constant 0.318 1.131 0.28
age -0.032 0.035 -0.92
fem 6.927 3.354 2.06
Age x fem -0.162 0.093 -1.73
Deviance = 47.34 Degrees of freedom = 41
From the normal distribution Two-sided p-value = 2 x Pr(Z > 1.73) = 0.085
Conclusion: There is suggestive but inconclusive evidence of an interaction.

203
without interaction

Error
Constant 3.230 1.387 2.33
age -0.078 0.037 -2.10
Sex (male=1) -1.597 0.755 -2.11
Deviance = 51.256 Degrees of freedom = 42, Z.025=1.96
95% Confidence Interval for the coefficient of Sex : -1.597 1.960.755 = -3.077 to -0.117
Take anti-logarithms of endpoints to get interval for the odds ratio: 0.046 to 0.889
Conclusion: The odds of survival for males are estimated to have been exp(-1.597) =
0.203, i.e., one-fifth of the odds of survival for males of similar age).
204
without interaction

Error
Constant 1.633 1.105 1.48
age -0.078 0.037 -2.11
fem 1.597 0.753 2.10
Deviance = 51.256 Degrees of freedom = 42
95% Confidence Interval for the coefficient of fem Z(.025)
Take anti-logarithms of endpoints to get interval for

the odds ratio
Conclusion: The odds of survival for females are estimated to have been 4.9 times the odds
205
of survival for males of similar age (95% CI: 1.1 times to 21.6 times).
206
Continuous Covariate and Invariance to Linear
Transformation, p. 122, Text
𝑥
Let 𝑥 ′ = 𝑎 + . [e.g., x = weight in kg, x = weight in lb]
𝑘
Thus , 𝑥 ′ =1  𝑥 = 𝑘 − 𝑘𝑎, and 𝑥 ′ =0  𝑥 = −𝑘𝑎
Let the two logistic regression models be defined by

logit((P(Y=1|x)) = 𝛼 + 𝛽𝑥; logit((P(Y=1|𝑥 ′ )) = 𝛼 ′ + 𝛽′ 𝑥′;
𝑃 𝑌 = 1 𝑥′ = 1 𝑃 𝑌 = 1 𝑥 = 𝑘 − 𝑘𝑎
[ ′ ] [ ]
𝑃 𝑌 = 0𝑥 = 1 𝑃 𝑌 = 0 𝑥 = 𝑘 − 𝑘𝑎 [𝑒 𝛼+𝛽(𝑘−𝑘𝑎) ]
𝛽′
𝑒 = ′ = = = 𝑒 𝛽(𝑘)
𝑃 𝑌 = 1𝑥 = 0 𝑃 𝑌 = 1 𝑥 = −𝑘𝑎 [𝑒 𝛼+𝛽(−𝑘𝑎) ]
[ ′ ] [ ]
𝑃 𝑌 = 0𝑥 = 0 𝑃 𝑌 = 0 𝑥 = −𝑘𝑎
Thus, 𝛽 ′ = 𝑘𝛽. Similarly, 𝛼 ′ = 𝛼 + 𝛽(−𝑘𝑎)
207
Exact Tests for Categorical and Count
Data
208
Exact Test of Two Poisson Means
• Compare parameters 1 and 2 of two Poisson dist
• Let X1i, i=1, …, n1 and X2i, i=1, …, n2 be two independent
random samples from respective distributions.
• Want to test H0: 1 = 2 (=, unknown)
𝑛1
• Make use of the statistics Y1= 𝑖=1 𝑋1𝑖 ~𝑃𝑜𝑖(𝑛1 ), Y2=
𝑛2
𝑖=1 𝑋2𝑖 ~𝑃𝑜𝑖(𝑛2 ). We concentrate on samples for which
Y=Y1+Y2 is fixed, same as observed sum of Y1 and Y2. Then
the conditional dist of Y1 given Y=y is : Bin( y, n1/(n1+n2) )
209
Exact Test of Two Poisson Means
• If Ha: 1 > 2, compute the P-value
Compare it with  for acceptance or rejection of H0.

• If Ha: 1 < 2, compute the P-value
• Exercise: Find a nice business application
210
211
Exact Test of Multiple Poisson Means
• Compare parameters 1, 2, …, k of k Poisson dist
• Let {Xij, j=1, …, ni}, i=1, …, k, be k independent random
samples from respective distributions.
• Want to test H0: 1 = 2 = … = k (=, unknown)
𝑛𝑖
• Make use of the statistics Yi= 𝑗=1 𝑋𝑖𝑗 ~𝑃𝑜𝑖(𝑛𝑖 ), i=1,…,k.
We concentrate on samples for which Y=Y1+ …+Yk is the
same as observed sum y = y1+ …+yk . Then the conditional
dist of Y1 given Y=y is : Bin( y, n1/(n1+…+nk) )
212
213
Exact Test of Two Proportions (Homogeneity)
(GGD, Fundamentals, Vol 1, Ch 15)
• Two populations for which proportions of

subjects with certain characteristic are p1 and p2.
Random samples of sizes n1 and n2 are drawn
independently from the two pop. Let x1 and x2
denote numbers of members having
characteristic in the samples.
• Want to test H0: p1 = p2 (=p, unknown)
• Make use of the statistics X1, X2, but concentrate
on samples for which X=X1+X2 is fixed, same as
observed sum of x1 and x2.
214
• Conditional pmf of x1 for given x=x1+x2, is

Hypergeometric(“N”=n1+n2, “K”=n1, “n”=x1+x2)
• If observed value of x1 is x10 and that of x is x0, then

use conditional pmf of x1 f(x1|x0) for testing H0.
215
• H0: p1 = p2 against Ha: p1 > p2, then P-value is

computed by
• H0: p1 = p2 against Ha: p1 < p2, then P-value is

computed by
216
Example: Exact Test of Two Proportions
• Here n1=15, n2 = 14, x0=17, x10 = 6 and Ha: p1 < p2
217
218
Exact Test of Multiple Proportions
(GGD, Fundamentals, Vol 1, Ch 15)
• k populations for which proportions of subjects

with certain characteristic are p1, …, pk. Random
samples of sizes n1, …, nk are drawn
independently from the k populations. Let X1, …,
Xk denote numbers of members having
characteristic in the samples.
• Want to test H0: p1 = p2 = pk (=p, unknown)
• Make use of the statistics X1, …, Xk, but
concentrate on samples for which X=X1+…+Xk is
fixed, same as observed sum of x1, …, xk.
219
Exact Test of Multiple Proportions
• Conditional pmf of X1 for given X= x1+…+xk, is
Hypergeometric(“N”=n1+…+nk, “K”=n1, “n”=x1+…+xk)
• If observed value of x1 is x10 and that of x is x0, then

use conditional pmf of x1 f(x1|x0) for testing H0.
220
221
Exact Test for Homogeneity of Two
Multinomial Distributions
• Xi= (Xi1,…,Xik) ~ Mult(ni, pi1,…,pik)
• H0: (p11,…,p1k)=(p21,…,p2k)=(p1,…,pk)
• Let X = (X1+X2), n =n1+n2
• Under H0, Xi ~ Mult(ni, p1,…,pk), X ~ Mult(n, p1,…,pk)
• Then conditional dist of X1 given X = x is
𝑛1 ! 𝑛2 ! (𝑥11 +𝑥21 )! … (𝑥1𝑘 + 𝑥2𝑘 )!
=
(𝑛1 + 𝑛2 )! (𝑥11 ! … 𝑥1𝑘 !)(𝑥21 ! … 𝑥2𝑘 !)
222
223
𝑙
𝑘
224
• Under H0, conditional distribution of {nij, all I,j} for fixed
marginals {𝑛𝑖0 , 𝑛0𝑗 , all I, j} has the pmf
225
𝑙
𝑘
226
• Example: Data collected on a random sample of
people attending preview of a movie
• Did the movie have equal appeal to the young and old
or whether it is more liked by the young.
• Test H0: two attributes are independent against Ha:
they are positively associated.
227
• Add up probabilities, under H0, of the given table and of
those indicating more extreme positive association (and
having the same marginals). These tables and
corresponding probabilities are:
• So, the P-value = 0.0198 < 0.05  Ha seems true 228

229
Degrees of Freedom
Degrees of Freedom for LRT = 2loge() is =
difference in dimensionality of  and 0, when
H0:0 is true. (H1: 0c )
Let I = no. of distinct patterns of covariates; then df
1. Independence Test: {IJ-1} – {(I-1)+(J-1)}
2. Homogeneity Test: {I(J-1)} – (J-1)
3. Multinomial Logistic (J=3, say): {I(J-1)} – (J-1),
(No. of model parameters under H0= J-1)
4. Binary Logistic (J=2): I(2-1)-2
230
Beyond Syllabus beyond this
231
Multinomial Goodness-of-Fit
Suppose Y is a random variable with k possible
outcomes, c1,c2,…,ck with probabilities π1, π2,…, πk,
where the outcomes are mutually exclusive and
exhaustive categories.
Let n realizations of Y be observed; we can
summarize the responses through the random vector
X = (X1, X2,…, Xk) where Xi is the frequency of cell i.
Then X has a multinomial distribution with
parameters n and (π1, π2,…, πk ).
k k

i 1
i  1 and X
i 1
i  n.
232
Pearson’s χ2
Suppose our hypothesis of interest is,
H 0 :  i   0i i  1,..., k (1)
The fit of the model is assessed by comparing the

frequencies expected in each cell, against the observed
frequencies. If there is substantial discrepancy between
the observed frequencies and those expected from the
null model, then it would be wise to reject the null model.
The best known goodness-of-fit statistic used to test the
hypothesis in (1) is the Pearson’s χ2
k
( x  n ) 2
2   i 0i
i 1 n 0i
233
The Power Divergence Family (Optional)
The power divergence statistic is defined as,

2 k   xi  

2 I ( x | n,  )   xi 
 (  1) i 1  n i 
  1

    

where  is the family parameter. The term power divergence

describes the fact that that the divergence uses different
powers of the ratio xi
n i
to construct the measure.
In the above stated expression we are measuring how far the
empirical probability distribution xi/n diverges from the
hypothesized distribution πi.
234
Specific values of λ
1
For   1 we get PCS  
k
 X i  n i 2
2 i 1 n i
k
 Xi 
For   0 we get LD   X i log 
i 1  n i 
 
k
For   1 / 2 we get HD  2
2
Xi  n i
i 1
k
 n i 
For   1 we get KLD   n i log 
i 1  Xi 
1
For   2 we get NCS  
 X i  n i 
k 2
2 i 1 Xi
235
Asymptotic distribution of power
divergence statistics
Consider the testing problem of the following simple null
hypothesis,
H0 :  i   0i for all i  1,2,, k
for known values π01, π02, …, π0k. For all real λ the
asymptotic distribution of the power divergence statistic
2I  ( X | n :  0 )
is approximately a chi-square with (k – 1) degrees of
freedom.
236
Power Comparison
Consider the equi-probable null
H0 :  i  1/ k i  1,2,, k
against the alternative
1   /(k  1)
H1 :  i  i  1,2, , (k  1)
k
1 
k 
k
237
Some specific alternatives for k = 5
The equiprobable null hypothesis specifies the
probability vector (0.2, 0.2, 0.2, 0.2, 0.2)
Consider the following alternatives:
At η = 1.5, we have
• (0.125, 0.125, 0.125, 0.125, 0.5) [bump alternative]
• At η = 0.5, we have
• (0.175, 0.175, 0.175, 0.175, 0.3) [bump alternative]
At η = – 0.9, we have
• (0.245, 0.245, 0.245, 0.245, 0.02) [dip alternative]
238
Exact Power Functions Of the Randomized Test for the Equiprobable Null Model Against the Alternative
α = 0.05, n = 20, k = 5
η
λ 1.5 0.5 -0.9
-5.0 0.2253 0.0742 0.5880
-2.0 0.2253 0.0742 0.5880
-1.0 0.2253 0.0742 0.5880
-0.5 0.3361 0.0800 0.5875
-0.3 0.4468 0.0864 0.5693
0.0 0.6100 0.1073 0.4466
0.3 0.6605 0.1146 0.3760
0.5 0.6806 0.1188 0.3216
0.7 0.6907 0.1211 0.2851
1.0 0.6997 0.1228 0.2720
1.5 0.7168 0.1257 0.2297
2.0 0.7306 0.1278 0.1896
2.5 0.7392 0.1288 0.1760
5.0 0.7498 0.1296 0.1464 239
References
• Agresti, A. (2012). Categorical Data Analysis, Wiley Series in
Probability and Statistics.
• Bishop, Y., Fienberg, S. E. and Holland, P. W. (1975). Discrete
Multivariate Analysis, MIT Press, Cambridge.
• Christensen, R. (1990). Loglinear Models. Springer-Verlag,
New York.
• Ramsey, F. L. and Schafer, D. W. (1997). The Statistical Sleuth.
Duxbury Press, Belmont, California.
• Read, T. R. C. and Cressie, N. (1988). Goodness of fit Statistics
for Discrete Multivariate Data. Springer-Verlag, New York.
• Goon, Gupta, Dasgupta, Fundamentals of Statistics, Volume
One.
240

Cda

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Cda

Diunggah oleh

Hak Cipta:

Format Tersedia

Categorical Data Analysis

• Slides Adapted from Prof Ayanendranath Basu’s Class-notes

• Integer valued: nonnegative counts

• Binary case (Sec 2.1.1, Text)

R Programs and Data Sets in Textbook (Tan, He & Tu):

Then (X1, X2=nX1) is said to have a Binomial distribution

What is Metabolic Syndrome ?

Metabolic syndrome, sometimes known by other

• Construct 95% Confidence Interval for the prevalence in this

•  = E(X)= rp/(1-p), V(X) = rp/(1-p)2 > E(X).

• E(X)=n(K/N), V(X) = {n(K/N)(1 - K/N)}×[(N-n)/(N-1)]

Then X = (X1, X2,…, Xk) is said to have a multinomial

𝑉(𝑋𝑖 )=n𝜋𝑖 (1−𝜋𝑖 ) ; 𝐶𝑜𝑣(𝑋𝑖 , 𝑋𝑗 ) = n𝜋𝑖 𝜋𝑗 , ij

MLE of 𝜋𝑖 = 𝑋𝑖 /𝑛 (Prove it, Excercise)

𝐿(𝜋1 , …, 𝜋𝑘 , ) = ln 𝑙𝑜𝑔 − 𝑙𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑 + (1 − 𝑖=1 𝑖 )

Two-Way Frequency Table for Metabolic Syndrome Study

The fit of the model is assessed by comparing the

The hypothesis becomes composite when the null is

Test H0: P(No Dep) = 0.65, P(Minor Dep)=0.2, P(Major

Conclusion: The null hypothesis claim appears to be true 23

Suppose all counts ≥ 6 are combined to make combined

Exc: Check MLE of

Conclusion: Null hypothesis claim appears to be false (df=7-1-1) 26

C1=n+1 C2=n+2 T=n

Sampling scheme Marginal Total fixed in

Prospective Row (explanatory) total

Retrospective Column (Response) total

[7.76 (=16/2061) observed deaths versus 6.66 (=7/1051) deaths per

COLD NO COLD TOTAL

Testing equal proportions of Colds in populations of

• Retrospective Product Binomial Sampling

Testing equality of proportions of smokers in

A retrospective observational study [Ramsey and Schaffer]

C1=n+1 C2=n+2 T=n

• Similarly the proportion of “Yes” response in

[7.76 observed deaths versus 6.66 deaths per thousand.]

H0: Proposed model generated the observed data

Result: The expected value and variance of a chi-square

For given k and , let 2k, denote a real number, which is

DF = 1; Two-sided P-value = 1-CHISQ.DIST(0.115,1,TRUE)

COLD NO COLD TOTAL

Testing equal proportions of colds in populations of Placebo

A randomized study … Also double blind [Ramsey and

Testing equal proportions of smokers in

• So, P-value = 0.0198 < 0.05  Ha seems to be true

Through “Odds Ratio” only

Success Failure Row marginal

a/(a+b) a(c+d) 𝑎𝑑+𝑎𝑐

• Relative risk tells how much ‘risk’ (probability) is increased or

C1=n+1 C2=n+2 T=n

(Population proportion p, also denoted by Greek letter π)

• 100(1-)% CI for RR:

𝑙𝑜𝑔𝑒 𝑅𝑅 exp 𝑍/2  𝑉𝑎𝑟(𝑙𝑜𝑔𝑒 𝑅𝑅

Z = 100*(1-)-th percentile of N(0,1) distribution

Note: RR can not be estimated with retrospective sampling

𝑤ℎ𝑒𝑟𝑒 𝑍/2 =1.96 for =.05

𝑍 2 =1.96 for =.05

• Two populations for which proportions of subjects

• If observed value of X1 is x10 and that of X is x0, then

• H0: p1 = p2 against Ha: p1 > p2, then P-value is

• H0: p1 = p2 against Ha: p1 < p2, then P-value is

• Here n1=15, n2 = 14, x0=17, x10 = 6 and Ha: p1 < p2

We generally express the larger number first.

Logit1 = z1=63.913 –29.447dependent – 0.877age+.000*income

Logit1 = z1=63.637 -29.573dependent -0.878age