5. Logistic Regression
Illustration – Stata version 14
March 2017
(a) In Stata, input directly from the internet using the command use
use “http://people.umass.edu/biep640w/datasets/illeetvilaine.dta”, clear
(b) From the course website, right click to download. Afterwards, in Stata, use FILE > OPEN
See, http://people.umass.edu/biep640w/webpages/demonstrations.html
tab2
Stata will order the rows and columns according to the numeric values of the row and column variable.
For a 0/1 variable, row 1 will be the value “0” row. Row 2 will be the value “1” row.
For a 1/2 variable, row 1 will be the value “1” row. Row 2 will be the value “2” row.
Columns are ordered similarly.
cc, cs
Stata assumes that you are using 0/1 variables here with 1= event and 0=non-event
Stata will order the rows and columns according to event, with event being the first row (or column)
Thus, row 1 will be the value “1=event” row. Row 2 will be the value “0=non-event” row.
Columns are ordered similarly.
Suppose we are interested in the 2x2 table cross-classification of heavy smoking (30+ gm/day versus other) and case
status (esophageal cancer case versus control):
Grouped |
tobacco | exposure12
consum. | heavy other | Total
-----------+----------------------+----------
0-9 gm/day | 0 526 | 526
10-19 | 0 236 | 236
20-29 | 0 131 | 131
30+ | 82 0 | 82
-----------+----------------------+----------
Total | 82 893 | 975
Case |
status |
(1=case, | case12
0=control) | case control | Total
-----------+----------------------+----------
0 | 0 775 | 775
1 | 200 0 | 200
-----------+----------------------+----------
Total | 200 775 | 975
Grouped |
tobacco | exposure01
consum. | other heavy | Total
-----------+----------------------+----------
0-9 gm/day | 526 0 | 526
10-19 | 236 0 | 236
20-29 | 131 0 | 131
30+ | 0 82 | 82
-----------+----------------------+----------
Total | 893 82 | 975
The commands cc and cs are commands for epidemiological analyses of 2x2 tables where the convention is to
have cases be in row 1 (controls in row 2) and exposed be in column 1 (non-exposed in column 2).
| case12
exposure12 | case control | Total
-----------+----------------------+----------
heavy | 31 51 | 82
other | 169 724 | 893
-----------+----------------------+----------
Total | 200 775 | 975
Better. Heavy exposure is now row 1 and cases are now in column 1.
| exposure12
case12 | heavy other | Total
-----------+----------------------+----------
case | 31 169 | 200
control | 51 724 | 775
-----------+----------------------+----------
Total | 82 893 | 975
Or, you might like this arrangement. Cases are now 1 and heavy exposure is now in column 1.
This data set has two continuous variables: age (age, years) and tob (tobacco consumption, gm/day). Here we
consider the variable age. In this lab session, let’s create two new variables. Each is a two step process.
. *
. *** Create age_quartile = quartiles of age, coded 1, 2, 3, 4
. centile age, c(0 25 50 75 100)
-- Binom. Interp. --
Variable | Obs Percentile Centile [95% Conf. Interval]
-------------+-------------------------------------------------------------
age | 975 0 25 25 25*
| 25 41 40 42
| 50 52 51 53
| 75 63 62 64
| 100 91 91 91*
. generate age_quartile=age
. recode age_quartile (min/41=1) (41.01/52=2) (52.01/63=3) (63/91=4)
(age_quartile: 975 changes made)
. *
. *** Create age_qmedian = quartile medians of age, in years
. sort age_quartile
. by age_quartile: centile age, c(50)
------------------------------------------------------------------------------------------
-> age_quartile = 1
-- Binom. Interp. --
Variable | Obs Percentile Centile [95% Conf. Interval]
-------------+-------------------------------------------------------------
age | 250 50 35 34 35
------------------------------------------------------------------------------------------
-> age_quartile = 2
-- Binom. Interp. --
Variable | Obs Percentile Centile [95% Conf. Interval]
-------------+-------------------------------------------------------------
age | 248 50 47 46 48
------------------------------------------------------------------------------------------
-> age_quartile = 3
-- Binom. Interp. --
Variable | Obs Percentile Centile [95% Conf. Interval]
-------------+-------------------------------------------------------------
age | 239 50 59 59 60
------------------------------------------------------------------------------------------
-> age_quartile = 4
-- Binom. Interp. --
Variable | Obs Percentile Centile [95% Conf. Interval]
-------------+-------------------------------------------------------------
age | 238 50 69 68 69
. generate age_qmedian=age_quartile
. recode age_qmedian (1=35) (2=47) (3=59) (4=69)
(age_qmedian: 975 changes made)
. label variable age_quartile "Quartile of Age"
. label variable age_qmedian "Quartile Median Age"
Summary
The commands logit and logistic will fit logistic regression models. Using logit with no option will produce betas.
Using logistic will produce odds ratios. You can also get odds ratios using the command logit with or as an option.
Stata also has commands eststo, estout and esttab for producing comparisons of models that are easier to read.
After creating some new variables for illustration purposes, 4 logistic regression models are fit and then compared
side-by-side.
Model 4: Predictors = heavy drinking, heavy smoking, drinking x smoking interaction, age
Grouped |
alcohol | Heavy Drinker
consum. | < 80 gm/d 80+ gm/da | Total
------------+----------------------+----------
0-39 gm/day | 414 0 | 414
40-79 | 355 0 | 355
80-119 | 0 139 | 139
120+ | 0 67 | 67
------------+----------------------+----------
Total | 769 206 | 975
. * HEAVY SMOKER: Create smoking_30plus = 0/1 measure of tobacco use >=30 gm/day.
. generate smoking_30plus=tobgp
. recode smoking_30plus (1=0) (2=0) (3=0) (4=1)
(smoking_30plus: 975 changes made)
. label define smokingf 0 "< 30 gm/day" 1 "30+ gm/day"
. label values smoking_30plus smokingf
. * Check variable creation
. numlabel, add
. tab2 tobgp smoking_30plus
-> tabulation of tobgp by smoking_30plus
Grouped |
tobacco | smoking_30plus
consum. | 0. < 30 g 1. 30+ gm | Total
--------------+----------------------+----------
1. 0-9 gm/day | 526 0 | 526
2. 10-19 | 236 0 | 236
3. 20-29 | 131 0 | 131
4. 30+ | 0 82 | 82
--------------+----------------------+----------
Total | 893 82 | 975
| age3544
Age group | 0 1 | Total
-----------+----------------------+----------
1. 25-34 | 116 0 | 116
2. 35-44 | 0 199 | 199
3. 45-54 | 213 0 | 213
4. 55-64 | 242 0 | 242
5. 65-74 | 161 0 | 161
6. 75+ | 44 0 | 44
-----------+----------------------+----------
Total | 776 199 | 975
| age4554
Age group | 0 1 | Total
-----------+----------------------+----------
1. 25-34 | 116 0 | 116
2. 35-44 | 199 0 | 199
3. 45-54 | 0 213 | 213
4. 55-64 | 242 0 | 242
5. 65-74 | 161 0 | 161
6. 75+ | 44 0 | 44
-----------+----------------------+----------
Total | 762 213 | 975
| age5564
Age group | 0 1 | Total
-----------+----------------------+----------
1. 25-34 | 116 0 | 116
2. 35-44 | 199 0 | 199
3. 45-54 | 213 0 | 213
4. 55-64 | 0 242 | 242
5. 65-74 | 161 0 | 161
6. 75+ | 44 0 | 44
-----------+----------------------+----------
Total | 733 242 | 975
| age6574
Age group | 0 1 | Total
-----------+----------------------+----------
1. 25-34 | 116 0 | 116
2. 35-44 | 199 0 | 199
3. 45-54 | 213 0 | 213
4. 55-64 | 242 0 | 242
5. 65-74 | 0 161 | 161
6. 75+ | 44 0 | 44
-----------+----------------------+----------
Total | 814 161 | 975
| age75plus
Age group | 0 1 | Total
-----------+----------------------+----------
1. 25-34 | 116 0 | 116
2. 35-44 | 199 0 | 199
3. 45-54 | 213 0 | 213
4. 55-64 | 242 0 | 242
5. 65-74 | 161 0 | 161
6. 75+ | 0 44 | 44
-----------+----------------------+----------
Total | 931 44 | 975
--------------------------------------------------------------------------------
case | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
---------------+----------------------------------------------------------------
alcohol_80plus | 5.228385 .9892462 8.74 0.000 3.608397 7.575664
age3544 | 4.683066 4.991105 1.45 0.147 .5798813 37.82
age4554 | 24.53994 25.10521 3.13 0.002 3.304189 182.2562
age5564 | 40.6956 41.44652 3.64 0.000 5.528909 299.5404
age6574 | 52.77508 53.98653 3.88 0.000 7.107029 391.895
age75plus | 52.41941 55.81879 3.72 0.000 6.502652 422.5653
_cons | .0064139 .0064735 -5.00 0.000 .0008872 .0463704
--------------------------------------------------------------------------------
. *
. * Logistic model with STATA defined design variables for agegp
. logistic case alcohol_80plus i.agegp
--------------------------------------------------------------------------------
case | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
---------------+----------------------------------------------------------------
alcohol_80plus | 5.228385 .9892462 8.74 0.000 3.608397 7.575664
|
agegp |
2. 35-44 | 4.683066 4.991105 1.45 0.147 .5798813 37.82
3. 45-54 | 24.53994 25.10521 3.13 0.002 3.304189 182.2562
4. 55-64 | 40.6956 41.44652 3.64 0.000 5.528909 299.5404
5. 65-74 | 52.77508 53.98653 3.88 0.000 7.107029 391.895
6. 75+ | 52.41941 55.81879 3.72 0.000 6.502652 422.5653
|
_cons | .0064139 .0064735 -5.00 0.000 .0008872 .0463704
--------------------------------------------------------------------------------
The two outputs match. For the rest of this lab session, I will use stata defined design variables.
. * MODEL 1 –
. * Logistic Regression Heavy Drinking Alone - adjusted for age
. logistic case alcohol_80plus i.agegp
--------------------------------------------------------------------------------
case | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
---------------+----------------------------------------------------------------
alcohol_80plus | 5.228385 .9892462 8.74 0.000 3.608397 7.575664
|
agegp |
2. 35-44 | 4.683066 4.991105 1.45 0.147 .5798813 37.82
3. 45-54 | 24.53994 25.10521 3.13 0.002 3.304189 182.2562
4. 55-64 | 40.6956 41.44652 3.64 0.000 5.528909 299.5404
5. 65-74 | 52.77508 53.98653 3.88 0.000 7.107029 391.895
6. 75+ | 52.41941 55.81879 3.72 0.000 6.502652 422.5653
|
_cons | .0064139 .0064735 -5.00 0.000 .0008872 .0463704
--------------------------------------------------------------------------------
. * MODEL 2 –
. * Logistic Regression Heavy Smoking Alone - adjusted for age
. logistic case smoking_30plus i.agegp
--------------------------------------------------------------------------------
case | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
---------------+----------------------------------------------------------------
smoking_30plus | 4.211492 1.205916 5.02 0.000 2.402722 7.381906
|
agegp |
2. 35-44 | 6.267996 6.675411 1.72 0.085 .7773197 50.54262
3. 45-54 | 38.39114 39.30348 3.56 0.000 5.161798 285.536
4. 55-64 | 65.17199 66.48418 4.09 0.000 8.82513 481.2834
5. 65-74 | 82.44814 84.59853 4.30 0.000 11.03516 616.0035
6. 75+ | 59.4483 63.32511 3.84 0.000 7.369337 479.5683
|
_cons | .0060567 .0061361 -5.04 0.000 .0008315 .0441165
--------------------------------------------------------------------------------
. * MODEL 3 –
. * Logistic Regression Heavy Drinking and Heavy Smoking - adjusted for age
. logistic case alcohol_80plus smoking_30plus i.agegp
--------------------------------------------------------------------------------
case | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
---------------+----------------------------------------------------------------
alcohol_80plus | 5.122038 .9859232 8.49 0.000 3.512352 7.469432
smoking_30plus | 3.989895 1.226166 4.50 0.000 2.184601 7.287035
|
agegp |
2. 35-44 | 6.300504 6.788536 1.71 0.088 .7625007 52.06074
3. 45-54 | 33.11226 34.34676 3.37 0.001 4.335635 252.8861
4. 55-64 | 56.18964 58.08311 3.90 0.000 7.409194 426.1294
5. 65-74 | 80.93014 84.28445 4.22 0.000 10.51063 623.1486
6. 75+ | 71.41624 77.18044 3.95 0.000 8.588013 593.8835
|
_cons | .0040597 .0041822 -5.35 0.000 .000539 .0305757
--------------------------------------------------------------------------------
. * MODEL 4 –
. * Logistic Regression Heavy Drinking and Heavy Smoking PLUS INTERACTION - adjusted
. logistic case alcohol_80plus smoking_30plus i.agegp drinker_smoker
--------------------------------------------------------------------------------
case | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
---------------+----------------------------------------------------------------
alcohol_80plus | 5.018372 1.012542 7.99 0.000 3.379235 7.452591
smoking_30plus | 3.726905 1.379709 3.55 0.000 1.803979 7.699546
|
agegp |
2. 35-44 | 6.563396 7.142474 1.73 0.084 .7777266 55.38986
3. 45-54 | 34.21248 35.78455 3.38 0.001 4.404231 265.7657
4. 55-64 | 58.12253 60.60994 3.90 0.000 7.528611 448.7187
5. 65-74 | 83.55392 87.73638 4.21 0.000 10.66981 654.3001
6. 75+ | 74.15214 80.85386 3.95 0.000 8.749685 628.4273
|
drinker_smoker | 1.251609 .839525 0.33 0.738 .3361396 4.660342
_cons | .0039536 .0041028 -5.33 0.000 .0005172 .0302218
--------------------------------------------------------------------------------
The option “eform” stands for “exponentiated coefficients.” Thus, these are the odds ratios.
1b.agegp 0 0 0 0
(.) (.) (.) (.)
drinker_sm~r 0.224
(0.33)
This tabulation shows the betas. Underneath are the values of t-statistic = (beta/standard error)
1b.agegp 1 1 1
[1,1] [1,1] [1,1]
I am showing models 1, 2, and 3 only because the output wraps around (and is unreadable) if I try to show
all 4 models. This tabulation shows the odds ratios and associated 95% CI.
1b.agegp 1 1
[1,1] [1,1]
drinker_sm~r 1.252
[0.336,4.660]
----------------------------------------------------------------
n
chi2 219.2*** 219.3***
bic 825.3 832.1
----------------------------------------------------------------
Exponentiated coefficients; 95% confidence intervals in brackets
* p<0.05, ** p<0.01, *** p<0.001
Summary
It is of interest to know whether the inclusion of extra predictors to a model is statistically significant. The smaller
model (“reduced”) contains the control variables. The larger model (“full”) contains the control variables plus the
extra variables in question.
Models.
Reduced: logit[π | X1 ,X 2 ...,X p ] = β 0 +β1X1 +...+β p X p
Full: logit[π | X1 ,X 2 ...,X p ,X p+1 ,X p+2 ,...,X p+k ] = β 0 +β1X1 +...+β p X p + β p+1X p+1 +...+β p+k X p+k
A likelihood ratio test is performed to assess the stastistical significance of the interaction of heavy drinking and heavy
smoking in the model, controlling for age and the main effects of each of heavy drinking and heavy smoking. Thus,
Model “full”: Predictors = age, heavy drinking, heavy smoking + (drinking x smoking)
. * Full model
. logistic case i.agegp smoking_30plus alcohol_80plus drinker_smoker
Logistic regression Number of obs = 975
LR chi2(8) = 219.35
Prob > chi2 = 0.0000
Log likelihood = -385.07068 Pseudo R2 = 0.2217
(-2) Log likelihood Full Model = 770.14136
--------------------------------------------------------------------------------
case | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
---------------+----------------------------------------------------------------
agegp |
2. 35-44 | 6.563396 7.142474 1.73 0.084 .7777266 55.38986
3. 45-54 | 34.21248 35.78455 3.38 0.001 4.404231 265.7657
4. 55-64 | 58.12253 60.60994 3.90 0.000 7.528611 448.7187
5. 65-74 | 83.55392 87.73638 4.21 0.000 10.66981 654.3001
6. 75+ | 74.15214 80.85386 3.95 0.000 8.749685 628.4273
|
smoking_30plus | 3.726905 1.379709 3.55 0.000 1.803979 7.699546
alcohol_80plus | 5.018372 1.012542 7.99 0.000 3.379235 7.452591
drinker_smoker | 1.251609 .839525 0.33 0.738 .3361396 4.660342
_cons | .0039536 .0041028 -5.33 0.000 .0005172 .0302218
--------------------------------------------------------------------------------
. * Full model
. quietly: logistic case i.agegp smoking_30plus alcohol_80plus drinker_smoker
. estimates store full
Summary
Now you have a model that is your “candidate” final model. There are lots of further explorations you can do to
assess whether this really is a “good” final model.
Having retained the null hypothesis in our likelihood ratio test of the interaction of heavy smoking and heavy drinking,
our “candidate” final model contains: heavy drinking, heavy smoking, and age.
. * Before requesting any diagnostics of a model, you must have fit it.
. logistic case i.agegp smoking_30plus alcohol_80plus
--------------------------------------------------------------------------------
case | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
---------------+----------------------------------------------------------------
agegp |
35-44 | 6.300504 6.788536 1.71 0.088 .7625007 52.06074
45-54 | 33.11226 34.34676 3.37 0.001 4.335635 252.8861
55-64 | 56.18964 58.08311 3.90 0.000 7.409194 426.1294
65-74 | 80.93014 84.28445 4.22 0.000 10.51063 623.1486
75+ | 71.41624 77.18044 3.95 0.000 8.588013 593.8835
|
smoking_30plus | 3.989895 1.226166 4.50 0.000 2.184601 7.287035
alcohol_80plus | 5.122038 .9859232 8.49 0.000 3.512352 7.469432
_cons | .0040597 .0041822 -5.35 0.000 .000539 .0305757
--------------------------------------------------------------------------------
. *
. ***** 5a) Numerical measures of fit using command FITSTAT
. fitstat
PARTIAL KEY:
Log-Lik Intercept Only = -494.744: This is the log likelihood for the intercept only model
Log-Lik Full Model = -385.128: This is the log likelihood for the current model
LR(7) = 219.233 is the likelihood ratio chi square statistic which tests whether the current
model predicts better than the intercept only model
Prob > LR = .0001: This is the p-value for the LR(7) test
Then there are a series of pseudo-R2 measures.
Finally, there are a series of information criterion measures that are used to compare different models.
. *
. ***** 5b) Test of Model Adequacy Using command LINKTEST
. linktest
------------------------------------------------------------------------------
case | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_hat | 1.009135 .1404655 7.18 0.000 .7338274 1.284442
_hatsq | .0039801 .0479037 0.08 0.934 -.0899094 .0978696
_cons | .0008299 .1243723 0.01 0.995 -.2429353 .2445952
------------------------------------------------------------------------------
. *
. ***** 5c) Test of Overall Goodness of Fit Using command LFIT
. lfit, group(10) table
. *
. ***** 6b) Plot of Y=Standardized Residual versus X=Observation Number
. predict std_residual, rs
. label variable std_residual "Standardized Residual"
. generate index=_n
. label variable index "Observation Number"
. *
. ***** 6c) Plot of Influential Observations: Y=Cook versus X=Observation Number
. predict cook, dbeta
. label variable cook "Cook Distance"
Step 1 – Right click anywhere inside the review window. From drop down menu, choose SELECT ALL
Step 2– Right click again. From drop down menu, choose SEND TO DO-FILE EDITOR
Stata will put you into the Do-File Editor. You should see something like the following
Step 3– Click on the SAVE icon. At SAVE AS: provide a name. At WHERE: provide a path