Stata For Logistic Regression

BIOSTATS 640 – Spring 2017 5.
Logistic Regression Stata Illustration
5. Logistic Regression
Illustration – Stata version 14
March 2017
1. Tip: “1/2” Variables versus “0/1” Variables ………..….………….… 2
2. Tip: How to Create Quartile Groupings of Continuous Variable ……. 6
3. Fit a Logistic Regression Model …………………..………………… 9
4. Likelihood Ratio Test for 2 “Hierarchical” Models…………….………. 20
5. Regression Diagnostics for Logistic Regression: Numerical …….……. 23

a. Numerical Measures of Fit Using fitstat …………………..…….. 25
b. Test of Model Adequacy Using linktest …………………………. 25
c. Test of Overall Goodness-of-Fit Using lfit ……………………….. 26
6. Regression Diagnostics for Logistic Regression: Graphical …….……. 27

a. Plot of ROC Curve Using lroc …………………………………….. 27
b. Plot of Standardized Residuals versus Observation Number ………. 28
c. Plot of Influential Observations Using Cook’s Distances …………... 29
7. Tip: Save Your Commands to a DO File for Later Use ………….… 30
Preliminary – Download the stata data set illeetvilaine.dta.

Note – This data set is accessible through the internet. Alternatively, you can download it from the course website.
(a) In Stata, input directly from the internet using the command use
use “http://people.umass.edu/biep640w/datasets/illeetvilaine.dta”, clear
(b) From the course website, right click to download. Afterwards, in Stata, use FILE > OPEN
See, http://people.umass.edu/biep640w/webpages/demonstrations.html
….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 1 of 30

BIOSTATS 640 – Spring 2017 5. Logistic Regression Stata Illustration
1. Tip - “1/2” Variables versus “0/1” Variables
Why the fuss?

Answer – Sometimes the arrangement of rows and columns in a 2x2 table are not what you expected.
tab2
Stata will order the rows and columns according to the numeric values of the row and column variable.
For a 0/1 variable, row 1 will be the value “0” row. Row 2 will be the value “1” row.
For a 1/2 variable, row 1 will be the value “1” row. Row 2 will be the value “2” row.
Columns are ordered similarly.
cc, cs
Stata assumes that you are using 0/1 variables here with 1= event and 0=non-event
Stata will order the rows and columns according to event, with event being the first row (or column)
Thus, row 1 will be the value “1=event” row. Row 2 will be the value “0=non-event” row.
Columns are ordered similarly.
Ille-et-Vilaine Data: Illustration
Suppose we are interested in the 2x2 table cross-classification of heavy smoking (30+ gm/day versus other) and case
status (esophageal cancer case versus control):
Disease (Esophageal Cancer)

Exposure (Heavy Smoking) Yes No
Yes (30+ gm/day) 31 51 82
No 169 724 893
200 775 975
Preliminary: Introduction to the command recode

Use recode to re-set the values of a variable. This is especially handy in the creation of a new variable. You can
recode a single old value to a new value. Or you can recode a whole range of values to a new value. For example -
. use “http://people.umass.edu/biep640w/datasets/illeetvilaine.dta”, clear

. * recode variablename (oldvalue=newvalue) (rangelower/rangeupper=newvalue) etc.
. generate age12=age
. recode age12 (18=1) (19/max=2)

. * Create "1/2" variables when you want to use command tab2

. * “1/2” measure of heavy smoking (1=30+ gm/day versus 2=other)
. * Exposure will be heavy smoking defined as tobgp=4 (30+ gm/day)
. generate exposure12=tobgp
. recode exposure12 (1=2) (2=2) (3=2) (4=1)
(exposure12: 739 changes made)
. label define exposure12f 2 "other" 1 "heavy"
. label values exposure12 exposure12f
. * "1/2" variable for case status (1=case versus 2=other)

. generate case12=case
. recode case12 (0=2)
(case12: 775 changes made)
. label define case12f 2 "control" 1 "case"
. label values case12 case12f
. * Check variable creations

. tab2 tobgp exposure12
-> tabulation of tobgp by exposure12
Grouped |
tobacco | exposure12
consum. | heavy other | Total
-----------+----------------------+----------
0-9 gm/day | 0 526 | 526
10-19 | 0 236 | 236
20-29 | 0 131 | 131
30+ | 82 0 | 82
-----------+----------------------+----------
Total | 82 893 | 975
. tab2 case case12

-> tabulation of case by case12
Case |
status |
(1=case, | case12
0=control) | case control | Total
-----------+----------------------+----------
0 | 0 775 | 775
1 | 200 0 | 200
-----------+----------------------+----------
Total | 200 775 | 975

. * Create "0/1" variables when you want to use commands cc, cs

. * “0/1” measure of heavy smoking (1=30+ gm/day versus 0=other)
. * Exposure will be heavy smoking defined as tobgp=4 (30+ gm/day)
. generate exposure01=tobgp
. recode exposure01 (1=0) (2=0) (3=0) (4=1)
(exposure01: 975 changes made)
. label define exposure01f 0 "other" 1 "heavy"
. label values exposure01 exposure01f
. * "0/1" variable for case status (1=case versus 0=other)

. * This already exists as the variable case

. tab2 tobgp exposure01
-> tabulation of tobgp by exposure01
Grouped |
tobacco | exposure01
consum. | other heavy | Total
-----------+----------------------+----------
0-9 gm/day | 526 0 | 526
10-19 | 236 0 | 236
20-29 | 131 0 | 131
30+ | 0 82 | 82
-----------+----------------------+----------
Total | 893 82 | 975
. * The command cc works fine with 0/1 variables

. cc case exposure01
Proportion
| Exposed Unexposed | Total Exposed
-----------------+------------------------+------------------------
Cases | 31 169 | 200 0.1550
Controls | 51 724 | 775 0.0658
-----------------+------------------------+------------------------
Total | 82 893 | 975 0.0841
| |
| Point estimate | [95% Conf. Interval]
|------------------------+------------------------
Odds ratio | 2.604014 | 1.557944 4.2894 (exact)
Attr. frac. ex. | .6159775 | .3581283 .7668672 (exact)
Attr. frac. pop | .0954765 |
+-------------------------------------------------
chi2(1) = 16.42 Pr>chi2 = 0.0001
The commands cc and cs are commands for epidemiological analyses of 2x2 tables where the convention is to
have cases be in row 1 (controls in row 2) and exposed be in column 1 (non-exposed in column 2).

. * tab2 with 0/1 variables

. tab2 exposure01 case
-> tabulation of exposure01 by case
| Case status (1=case,

| 0=control)
exposure01 | 0 1 | Total
-----------+----------------------+----------
other | 724 169 | 893
heavy | 51 31 | 82
-----------+----------------------+----------
Total | 775 200 | 975
Heavy exposure is in row 2, the outcome of case=yes is in column 2.
. * tab2 with 1/2 variables - more to your liking?

. tab2 exposure12 case12
-> tabulation of exposure12 by case12
| case12
exposure12 | case control | Total
-----------+----------------------+----------
heavy | 31 51 | 82
other | 169 724 | 893
-----------+----------------------+----------
Total | 200 775 | 975
Better. Heavy exposure is now row 1 and cases are now in column 1.
. tab2 case12 exposure12

-> tabulation of case12 by exposure12
| exposure12
case12 | heavy other | Total
-----------+----------------------+----------
case | 31 169 | 200
control | 51 724 | 775
-----------+----------------------+----------
Total | 82 893 | 975
Or, you might like this arrangement. Cases are now 1 and heavy exposure is now in column 1.

2. Tip – How to Create Quartile Groupings of a Continuous Variable
Creating Quartiles is Useful in Assessing Linearity of Logit

Answer – In regression analysis, it is often of interest to explore linearity of the outcome in relationship to a
continuous predictor. To do this, a new variable is created that is a grouped measure of the original continuous
variable.
This data set has two continuous variables: age (age, years) and tob (tobacco consumption, gm/day). Here we
consider the variable age. In this lab session, let’s create two new variables. Each is a two step process.
age_quartile = Quartile of age, coded 1, 2, 3 or 4
age_qmedian = Median of age, within quartile of age
. *
. *** Create age_quartile = quartiles of age, coded 1, 2, 3, 4
. centile age, c(0 25 50 75 100)
-- Binom. Interp. --
Variable | Obs Percentile Centile [95% Conf. Interval]
-------------+-------------------------------------------------------------
age | 975 0 25 25 25*
| 25 41 40 42
| 50 52 51 53
| 75 63 62 64
| 100 91 91 91*
* Lower (upper) confidence limit held at minimum (maximum) of sample
. generate age_quartile=age
. recode age_quartile (min/41=1) (41.01/52=2) (52.01/63=3) (63/91=4)
(age_quartile: 975 changes made)

. *
. *** Create age_qmedian = quartile medians of age, in years
. sort age_quartile
. by age_quartile: centile age, c(50)
------------------------------------------------------------------------------------------
-> age_quartile = 1
-------------+-------------------------------------------------------------
age | 250 50 35 34 35
------------------------------------------------------------------------------------------
-> age_quartile = 2
-------------+-------------------------------------------------------------
age | 248 50 47 46 48
------------------------------------------------------------------------------------------
-> age_quartile = 3
-------------+-------------------------------------------------------------
age | 239 50 59 59 60
------------------------------------------------------------------------------------------
-> age_quartile = 4
-------------+-------------------------------------------------------------
age | 238 50 69 68 69
. generate age_qmedian=age_quartile
. recode age_qmedian (1=35) (2=47) (3=59) (4=69)
(age_qmedian: 975 changes made)
. label variable age_quartile "Quartile of Age"
. label variable age_qmedian "Quartile Median Age"


. tab2 age_quartile age_qmedian
-> tabulation of age_quartile by age_qmedian
Quartile | Quartile Median Age

of Age | 35 47 59 69 | Total
-----------+--------------------------------------------+----------
1 | 250 0 0 0 | 250
2 | 0 248 0 0 | 248
3 | 0 0 239 0 | 239
4 | 0 0 0 238 | 238
-----------+--------------------------------------------+----------
Total | 250 248 239 238 | 975

3. Fit a Logistic Regression Model
Summary
The commands logit and logistic will fit logistic regression models. Using logit with no option will produce betas.
Using logistic will produce odds ratios. You can also get odds ratios using the command logit with or as an option.
Stata also has commands eststo, estout and esttab for producing comparisons of models that are easier to read.
After creating some new variables for illustration purposes, 4 logistic regression models are fit and then compared
side-by-side.
Model 1: Predictors = heavy drinking, age
Model 2: Predictors = heavy smoking, age
Model 3: Predictors = heavy drinking, heavy smoking, age
Model 4: Predictors = heavy drinking, heavy smoking, drinking x smoking interaction, age
* Create some new variables for illustration purposes

. * HEAVY DRINKER: Create alcohol_80plus = 0/1 measure of alcohol use >=80 gm/day.
. generate alcohol_80plus=alcgp
. recode alcohol_80plus (1=0) (2=0) (3=1) (4=1)
(alcohol_80plus: 975 changes made)
. label define alcoholf 0 "< 80 gm/day" 1 "80+ gm/day"
. label values alcohol_80plus alcoholf
. label variable alcohol_80plus "Heavy Drinker"
. * Check variable creation
. tab2 alcgp alcohol_80plus
-> tabulation of alcgp by alcohol_80plus
Grouped |
alcohol | Heavy Drinker
consum. | < 80 gm/d 80+ gm/da | Total
------------+----------------------+----------
0-39 gm/day | 414 0 | 414
40-79 | 355 0 | 355
80-119 | 0 139 | 139
120+ | 0 67 | 67
------------+----------------------+----------
Total | 769 206 | 975

. * HEAVY SMOKER: Create smoking_30plus = 0/1 measure of tobacco use >=30 gm/day.
. generate smoking_30plus=tobgp
. recode smoking_30plus (1=0) (2=0) (3=0) (4=1)
(smoking_30plus: 975 changes made)
. label define smokingf 0 "< 30 gm/day" 1 "30+ gm/day"
. label values smoking_30plus smokingf
. * Check variable creation
. numlabel, add
. tab2 tobgp smoking_30plus
-> tabulation of tobgp by smoking_30plus
Grouped |
tobacco | smoking_30plus
consum. | 0. < 30 g 1. 30+ gm | Total
--------------+----------------------+----------
1. 0-9 gm/day | 526 0 | 526
2. 10-19 | 236 0 | 236
3. 20-29 | 131 0 | 131
4. 30+ | 0 82 | 82
--------------+----------------------+----------
Total | 893 82 | 975
. * INTERACTION: Create drinker_smoker = interaction of heavy drinking and heavy smoking

. generate drinker_smoker=alcohol_80plus*smoking_30plus
. label variable drinker_smoker "Interaction alcohol*smoking"
. * USER CREATED DESIGN VARIABLES FOR AGEGP

. * Note – If you do not have the command fre, type findit fre and download.
. fre agegp
agegp -- Age group
-------------------------------------------------------------
| Freq. Percent Valid Cum.
----------------+--------------------------------------------
Valid 1 25-34 | 116 11.90 11.90 11.90
2 35-44 | 199 20.41 20.41 32.31
3 45-54 | 213 21.85 21.85 54.15
4 55-64 | 242 24.82 24.82 78.97
5 65-74 | 161 16.51 16.51 95.49
6 75+ | 44 4.51 4.51 100.00
Total | 975 100.00 100.00
-------------------------------------------------------------
. generate age3544=(agegp==2) if agegp !=.

. generate age75plus=(agegp==6) if agegp !=.

. * Check variable creations.

. numlabel, add
. tab2 agegp age3544
-> tabulation of agegp by age3544
| age3544
Age group | 0 1 | Total
-----------+----------------------+----------
1. 25-34 | 116 0 | 116
2. 35-44 | 0 199 | 199
3. 45-54 | 213 0 | 213
4. 55-64 | 242 0 | 242
5. 65-74 | 161 0 | 161
6. 75+ | 44 0 | 44
-----------+----------------------+----------
Total | 776 199 | 975

| age4554
-----------+----------------------+----------
1. 25-34 | 116 0 | 116
2. 35-44 | 199 0 | 199
3. 45-54 | 0 213 | 213
4. 55-64 | 242 0 | 242
5. 65-74 | 161 0 | 161
6. 75+ | 44 0 | 44
-----------+----------------------+----------
Total | 762 213 | 975

| age5564
-----------+----------------------+----------
1. 25-34 | 116 0 | 116
2. 35-44 | 199 0 | 199
3. 45-54 | 213 0 | 213
4. 55-64 | 0 242 | 242
5. 65-74 | 161 0 | 161
6. 75+ | 44 0 | 44
-----------+----------------------+----------
Total | 733 242 | 975


| age6574
-----------+----------------------+----------
1. 25-34 | 116 0 | 116
2. 35-44 | 199 0 | 199
3. 45-54 | 213 0 | 213
4. 55-64 | 242 0 | 242
5. 65-74 | 0 161 | 161
6. 75+ | 44 0 | 44
-----------+----------------------+----------
Total | 814 161 | 975
. tab2 agegp age75plus

-> tabulation of agegp by age75plus
| age75plus
-----------+----------------------+----------
1. 25-34 | 116 0 | 116
2. 35-44 | 199 0 | 199
3. 45-54 | 213 0 | 213
4. 55-64 | 242 0 | 242
5. 65-74 | 161 0 | 161
6. 75+ | 0 44 | 44
-----------+----------------------+----------
Total | 931 44 | 975
. * Logistic model with user defined design variables for agegp

. logistic case alcohol_80plus age3544 age4554 age5564 age6574 age75plus
Logistic regression Number of obs = 975

LR chi2(6) = 199.30
Prob > chi2 = 0.0000
Log likelihood = -395.09465 Pseudo R2 = 0.2014
--------------------------------------------------------------------------------
case | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
---------------+----------------------------------------------------------------
alcohol_80plus | 5.228385 .9892462 8.74 0.000 3.608397 7.575664
age3544 | 4.683066 4.991105 1.45 0.147 .5798813 37.82
age4554 | 24.53994 25.10521 3.13 0.002 3.304189 182.2562
age5564 | 40.6956 41.44652 3.64 0.000 5.528909 299.5404
age6574 | 52.77508 53.98653 3.88 0.000 7.107029 391.895
age75plus | 52.41941 55.81879 3.72 0.000 6.502652 422.5653
_cons | .0064139 .0064735 -5.00 0.000 .0008872 .0463704
--------------------------------------------------------------------------------

. *
. * Logistic model with STATA defined design variables for agegp
. logistic case alcohol_80plus i.agegp

LR chi2(6) = 199.30
Prob > chi2 = 0.0000
--------------------------------------------------------------------------------
---------------+----------------------------------------------------------------
alcohol_80plus | 5.228385 .9892462 8.74 0.000 3.608397 7.575664
|
agegp |
2. 35-44 | 4.683066 4.991105 1.45 0.147 .5798813 37.82
3. 45-54 | 24.53994 25.10521 3.13 0.002 3.304189 182.2562
4. 55-64 | 40.6956 41.44652 3.64 0.000 5.528909 299.5404
5. 65-74 | 52.77508 53.98653 3.88 0.000 7.107029 391.895
6. 75+ | 52.41941 55.81879 3.72 0.000 6.502652 422.5653
|
_cons | .0064139 .0064735 -5.00 0.000 .0008872 .0463704
--------------------------------------------------------------------------------
The two outputs match. For the rest of this lab session, I will use stata defined design variables.
. * MODEL 1 –
. * Logistic Regression Heavy Drinking Alone - adjusted for age
. logistic case alcohol_80plus i.agegp

LR chi2(6) = 199.30
Prob > chi2 = 0.0000
--------------------------------------------------------------------------------
---------------+----------------------------------------------------------------
alcohol_80plus | 5.228385 .9892462 8.74 0.000 3.608397 7.575664
|
agegp |
2. 35-44 | 4.683066 4.991105 1.45 0.147 .5798813 37.82
3. 45-54 | 24.53994 25.10521 3.13 0.002 3.304189 182.2562
4. 55-64 | 40.6956 41.44652 3.64 0.000 5.528909 299.5404
5. 65-74 | 52.77508 53.98653 3.88 0.000 7.107029 391.895
6. 75+ | 52.41941 55.81879 3.72 0.000 6.502652 422.5653
|
_cons | .0064139 .0064735 -5.00 0.000 .0008872 .0463704
--------------------------------------------------------------------------------

. * ESTSTO to save results for later comparison
. eststo model1
. * MODEL 2 –
. * Logistic Regression Heavy Smoking Alone - adjusted for age
. logistic case smoking_30plus i.agegp

LR chi2(6) = 145.72
Prob > chi2 = 0.0000
--------------------------------------------------------------------------------
---------------+----------------------------------------------------------------
smoking_30plus | 4.211492 1.205916 5.02 0.000 2.402722 7.381906
|
agegp |
2. 35-44 | 6.267996 6.675411 1.72 0.085 .7773197 50.54262
3. 45-54 | 38.39114 39.30348 3.56 0.000 5.161798 285.536
4. 55-64 | 65.17199 66.48418 4.09 0.000 8.82513 481.2834
5. 65-74 | 82.44814 84.59853 4.30 0.000 11.03516 616.0035
6. 75+ | 59.4483 63.32511 3.84 0.000 7.369337 479.5683
|
_cons | .0060567 .0061361 -5.04 0.000 .0008315 .0441165
--------------------------------------------------------------------------------

. eststo model2
. * MODEL 3 –
. * Logistic Regression Heavy Drinking and Heavy Smoking - adjusted for age
. logistic case alcohol_80plus smoking_30plus i.agegp

LR chi2(7) = 219.23
Prob > chi2 = 0.0000
--------------------------------------------------------------------------------
---------------+----------------------------------------------------------------
alcohol_80plus | 5.122038 .9859232 8.49 0.000 3.512352 7.469432
smoking_30plus | 3.989895 1.226166 4.50 0.000 2.184601 7.287035
|
agegp |
2. 35-44 | 6.300504 6.788536 1.71 0.088 .7625007 52.06074
3. 45-54 | 33.11226 34.34676 3.37 0.001 4.335635 252.8861
4. 55-64 | 56.18964 58.08311 3.90 0.000 7.409194 426.1294
5. 65-74 | 80.93014 84.28445 4.22 0.000 10.51063 623.1486
6. 75+ | 71.41624 77.18044 3.95 0.000 8.588013 593.8835
|
_cons | .0040597 .0041822 -5.35 0.000 .000539 .0305757
--------------------------------------------------------------------------------


. eststo model3
. * MODEL 4 –
. * Logistic Regression Heavy Drinking and Heavy Smoking PLUS INTERACTION - adjusted
. logistic case alcohol_80plus smoking_30plus i.agegp drinker_smoker

LR chi2(8) = 219.35
Prob > chi2 = 0.0000
--------------------------------------------------------------------------------
---------------+----------------------------------------------------------------
alcohol_80plus | 5.018372 1.012542 7.99 0.000 3.379235 7.452591
smoking_30plus | 3.726905 1.379709 3.55 0.000 1.803979 7.699546
|
agegp |
2. 35-44 | 6.563396 7.142474 1.73 0.084 .7777266 55.38986
3. 45-54 | 34.21248 35.78455 3.38 0.001 4.404231 265.7657
4. 55-64 | 58.12253 60.60994 3.90 0.000 7.528611 448.7187
5. 65-74 | 83.55392 87.73638 4.21 0.000 10.66981 654.3001
6. 75+ | 74.15214 80.85386 3.95 0.000 8.749685 628.4273
|
drinker_smoker | 1.251609 .839525 0.33 0.738 .3361396 4.660342
_cons | .0039536 .0041028 -5.33 0.000 .0005172 .0302218
--------------------------------------------------------------------------------

. eststo model4

. * ESTOUT for side-by-side comparison of the 4 models

. * BETAs
. estout model1 model2 model3 model4, prehead("Logistic Regression of Esophageal Cancer -
BETA's")
Logistic Regression of Esophageal Cancer - BETA's

----------------------------------------------------------------
model1 model2 model3 model4
b b b b
----------------------------------------------------------------
case
alcohol_80~s 1.654102 1.633552 1.613106
1b.agegp 0 0 0 0
2.agegp 1.543953 1.835457 1.84063 1.881508
3.agegp 3.200302 3.647827 3.499904 3.53259
4.agegp 3.70612 4.17703 4.028732 4.062553
5.agegp 3.966039 4.412169 4.393586 4.425492
6.agegp 3.959277 4.085107 4.268525 4.306119
smoking_30~s 1.437817 1.383765 1.315578
drinker_sm~r .22443
_cons -5.049291 -5.106586 -5.506652 -5.533117
----------------------------------------------------------------
Okay. We see the betas.
. * ODDS RATIOS – use option eform

. estout model1 model2 model3 model4, eform prehead("Logistic Regression of Esophageal
Cancer - ODDS RATIO's")
Logistic Regression of Esophageal Cancer - ODDS RATIO's

----------------------------------------------------------------
model1 model2 model3 model4
b b b b
----------------------------------------------------------------
case
alcohol_80~s 5.228385 5.122038 5.018372
1b.agegp 1 1 1 1
2.agegp 4.683066 6.267996 6.300504 6.563396
3.agegp 24.53994 38.39114 33.11226 34.21248
4.agegp 40.6956 65.17199 56.18964 58.12253
5.agegp 52.77508 82.44814 80.93014 83.55392
6.agegp 52.41941 59.4483 71.41624 74.15214
smoking_30~s 4.211492 3.989895 3.726905
drinker_sm~r 1.251609
_cons .0064139 .0060567 .0040597 .0039536
----------------------------------------------------------------
The option “eform” stands for “exponentiated coefficients.” Thus, these are the odds ratios.

. * ESTTAB for side-by-side comparison of the 4 models

. * BETAs with chi square statistics
. esttab model1 model2 model3 model4, stats(n chi2 bic, star(chi2)) prehead("Logistic
Regression of Esophageal Cancer - BETA's")
Logistic Regression of Esophageal Cancer - BETA's

(1) (2) (3) (4)
case case case case
----------------------------------------------------------------------------
case
alcohol_80~s 1.654*** 1.634*** 1.613***
(8.74) (8.49) (7.99)
1b.agegp 0 0 0 0
(.) (.) (.) (.)
2.agegp 1.544 1.835 1.841 1.882

(1.45) (1.72) (1.71) (1.73)
3.agegp 3.200** 3.648*** 3.500*** 3.533***

(3.13) (3.56) (3.37) (3.38)
4.agegp 3.706*** 4.177*** 4.029*** 4.063***

(3.64) (4.09) (3.90) (3.90)
5.agegp 3.966*** 4.412*** 4.394*** 4.425***

(3.88) (4.30) (4.22) (4.21)
6.agegp 3.959*** 4.085*** 4.269*** 4.306***

(3.72) (3.84) (3.95) (3.95)
smoking_30~s 1.438*** 1.384*** 1.316***

(5.02) (4.50) (3.55)
drinker_sm~r 0.224
(0.33)
_cons -5.049*** -5.107*** -5.507*** -5.533***

(-5.00) (-5.04) (-5.35) (-5.33)
----------------------------------------------------------------------------
n
chi2 199.3*** 145.7*** 219.2*** 219.3***
bic 838.4 892.0 825.3 832.1
----------------------------------------------------------------------------
t statistics in parentheses
* p<0.05, ** p<0.01, *** p<0.001
This tabulation shows the betas. Underneath are the values of t-statistic = (beta/standard error)

. * ODDS RATIOS and 95% CI’s

. * Models 1, 2, and 3
. esttab model1 model2 model3, stats(n chi2 bic, star(chi2)) eform ci prehead("Logistic
Regression of Esophageal Cancer - ODDS RATIO's")

(1) (2) (3)
case case case
------------------------------------------------------------------------------------------
case
alcohol_80~s 5.228*** 5.122***
[3.608,7.576] [3.512,7.469]
1b.agegp 1 1 1
[1,1] [1,1] [1,1]
2.agegp 4.683 6.268 6.301

[0.580,37.82] [0.777,50.54] [0.763,52.06]
3.agegp 24.54** 38.39*** 33.11***

[3.304,182.3] [5.162,285.5] [4.336,252.9]
4.agegp 40.70*** 65.17*** 56.19***

[5.529,299.5] [8.825,481.3] [7.409,426.1]
5.agegp 52.78*** 82.45*** 80.93***

[7.107,391.9] [11.04,616.0] [10.51,623.1]
6.agegp 52.42*** 59.45*** 71.42***

[6.503,422.6] [7.369,479.6] [8.588,593.9]
smoking_30~s 4.211*** 3.990***

[2.403,7.382] [2.185,7.287]
------------------------------------------------------------------------------------------
n
chi2 199.3*** 145.7*** 219.2***
bic 838.4 892.0 825.3
------------------------------------------------------------------------------------------
Exponentiated coefficients; 95% confidence intervals in brackets
* p<0.05, ** p<0.01, *** p<0.001
I am showing models 1, 2, and 3 only because the output wraps around (and is unreadable) if I try to show
all 4 models. This tabulation shows the odds ratios and associated 95% CI.

. * ODDS RATIOS and 95% CI’s

. * Models 3 and 4
. esttab model3 model4, stats(n chi2 bic, star(chi2)) eform ci prehead("Logistic
Regression of Esophageal Cancer - ODDS RATIO's")

(1) (2)
case case
----------------------------------------------------------------
case
alcohol_80~s 5.122*** 5.018***
[3.512,7.469] [3.379,7.453]
smoking_30~s 3.990*** 3.727***

[2.185,7.287] [1.804,7.700]
1b.agegp 1 1
[1,1] [1,1]
2.agegp 6.301 6.563

[0.763,52.06] [0.778,55.39]
3.agegp 33.11*** 34.21***

[4.336,252.9] [4.404,265.8]
4.agegp 56.19*** 58.12***

[7.409,426.1] [7.529,448.7]
5.agegp 80.93*** 83.55***

[10.51,623.1] [10.67,654.3]
6.agegp 71.42*** 74.15***

[8.588,593.9] [8.750,628.4]
drinker_sm~r 1.252
[0.336,4.660]
----------------------------------------------------------------
n
chi2 219.2*** 219.3***
bic 825.3 832.1
----------------------------------------------------------------
Exponentiated coefficients; 95% confidence intervals in brackets
* p<0.05, ** p<0.01, *** p<0.001

4. Likelihood Ratio Test for 2 “Hierarchical” Models
Summary
It is of interest to know whether the inclusion of extra predictors to a model is statistically significant. The smaller
model (“reduced”) contains the control variables. The larger model (“full”) contains the control variables plus the
extra variables in question.
Models.
Reduced: logit[π | X1 ,X 2 ...,X p ] = β 0 +β1X1 +...+β p X p
Full: logit[π | X1 ,X 2 ...,X p ,X p+1 ,X p+2 ,...,X p+k ] = β 0 +β1X1 +...+β p X p + β p+1X p+1 +...+β p+k X p+k
Null and Alternative Hypotheses:
HO : β p+1 = β p+2 = ... = β p+k = 0

H A : not
Definition Likelihood Ratio Test (LR)
LR statistic = DevianceREDUCED - DevianceFULL
= [ (-2) ln (Likelihood) REDUCED ] - [ (-2) ln (Likelihood) FULL ]
Under the null hypothesis, LR is distributed Chi SquareDF=k
A likelihood ratio test is performed to assess the stastistical significance of the interaction of heavy drinking and heavy
smoking in the model, controlling for age and the main effects of each of heavy drinking and heavy smoking. Thus,
Model “reduced”: Predictors = age, heavy drinking, heavy smoking
Model “full”: Predictors = age, heavy drinking, heavy smoking + (drinking x smoking)

. *** LR Test WITH DISPLAY OF INTERMEDIATE RESULTS

. * Reduced model
. logistic case i.agegp smoking_30plus alcohol_80plus
LR chi2(7) = 219.23
Prob > chi2 = 0.0000
(-2) Log likelihood Reduced Model = 770.2551
--------------------------------------------------------------------------------
---------------+----------------------------------------------------------------
agegp |
2. 35-44 | 6.300504 6.788536 1.71 0.088 .7625007 52.06074
3. 45-54 | 33.11226 34.34676 3.37 0.001 4.335635 252.8861
4. 55-64 | 56.18964 58.08311 3.90 0.000 7.409194 426.1294
5. 65-74 | 80.93014 84.28445 4.22 0.000 10.51063 623.1486
6. 75+ | 71.41624 77.18044 3.95 0.000 8.588013 593.8835
|
smoking_30plus | 3.989895 1.226166 4.50 0.000 2.184601 7.287035
alcohol_80plus | 5.122038 .9859232 8.49 0.000 3.512352 7.469432
_cons | .0040597 .0041822 -5.35 0.000 .000539 .0305757
--------------------------------------------------------------------------------
. estimates store reduced
. * Full model
. logistic case i.agegp smoking_30plus alcohol_80plus drinker_smoker
LR chi2(8) = 219.35
Prob > chi2 = 0.0000
(-2) Log likelihood Full Model = 770.14136
--------------------------------------------------------------------------------
---------------+----------------------------------------------------------------
agegp |
2. 35-44 | 6.563396 7.142474 1.73 0.084 .7777266 55.38986
3. 45-54 | 34.21248 35.78455 3.38 0.001 4.404231 265.7657
4. 55-64 | 58.12253 60.60994 3.90 0.000 7.528611 448.7187
5. 65-74 | 83.55392 87.73638 4.21 0.000 10.66981 654.3001
6. 75+ | 74.15214 80.85386 3.95 0.000 8.749685 628.4273
|
smoking_30plus | 3.726905 1.379709 3.55 0.000 1.803979 7.699546
alcohol_80plus | 5.018372 1.012542 7.99 0.000 3.379235 7.452591
drinker_smoker | 1.251609 .839525 0.33 0.738 .3361396 4.660342
_cons | .0039536 .0041028 -5.33 0.000 .0005172 .0302218
--------------------------------------------------------------------------------
. estimates store full
. lrtest reduced full
Likelihood-ratio test LR chi2(1) = 0.11

(Assumption: reduced nested in full) Prob > chi2 = 0.7359
CHECK: [(-2) ln L reduced] – [(-2)ln L full] = 770.2551 – 770.14136 = 0.11374 match!

. *** LR Test - WITHOUT DISPLAY OF RESULTS using quietly:

. * Reduced model
. quietly: logistic case i.agegp smoking_30plus alcohol_80plus
. estimates store reduced
. * Full model
. quietly: logistic case i.agegp smoking_30plus alcohol_80plus drinker_smoker
. estimates store full
. lrtest reduced full
Likelihood-ratio test LR chi2(1) = 0.11

(Assumption: reduced nested in full) Prob > chi2 = 0.7359
CHECK: [(-2) ln L reduced] – [(-2)ln L full] = 770.2551 – 770.14136 = 0.11374 match!

5. Regression Diagnostics for Logistic Regression: Numerical
Preliminary – Install the suite of commands in the package spost9_ado
. * Step 1: Install SPost Using net install

. net install spost9_ado
checking spost9_ado consistency and verifying not already installed...

installing into /Users/cbigelow/Library/Application Support/Stata/ado/plus/...
installation complete.
. * Step 2: Now obtain all the ancillary files

. net get spost9_do
checking spost9_do consistency and verifying not already installed...
copying into current directory...

copying st9all.do
copying st9ch2tutorial.do
copying st9ch3estimate.do
copying st9ch4binary.do
copying st9ch5ordinal.do
copying st9ch6nomcase.do
copying st9ch7nomalt.do
copying st9ch8count.do
copying st9ch9other.do
copying binlfp2.dta
copying couart2.dta
copying gsskidvalue2.dta
copying nomocc2.dta
copying ordwarm2.dta
copying science2.dta
copying sciwork.dta
copying travel2.dta
copying travel2case.dta
copying wlsrnk.dta
ancillary files successfully copied.

Summary
Now you have a model that is your “candidate” final model. There are lots of further explorations you can do to
assess whether this really is a “good” final model.
Having retained the null hypothesis in our likelihood ratio test of the interaction of heavy smoking and heavy drinking,
our “candidate” final model contains: heavy drinking, heavy smoking, and age.
. * Before requesting any diagnostics of a model, you must have fit it.
. logistic case i.agegp smoking_30plus alcohol_80plus

LR chi2(7) = 219.23
Prob > chi2 = 0.0000
--------------------------------------------------------------------------------
---------------+----------------------------------------------------------------
agegp |
35-44 | 6.300504 6.788536 1.71 0.088 .7625007 52.06074
45-54 | 33.11226 34.34676 3.37 0.001 4.335635 252.8861
55-64 | 56.18964 58.08311 3.90 0.000 7.409194 426.1294
65-74 | 80.93014 84.28445 4.22 0.000 10.51063 623.1486
75+ | 71.41624 77.18044 3.95 0.000 8.588013 593.8835
|
smoking_30plus | 3.989895 1.226166 4.50 0.000 2.184601 7.287035
alcohol_80plus | 5.122038 .9859232 8.49 0.000 3.512352 7.469432
_cons | .0040597 .0041822 -5.35 0.000 .000539 .0305757
--------------------------------------------------------------------------------

. *
. ***** 5a) Numerical measures of fit using command FITSTAT
. fitstat
Measures of Fit for logistic of case
Log-Lik Intercept Only: -494.744 Log-Lik Full Model: -385.128

D(967): 770.255 LR(7): 219.233
Prob > LR: 0.000
McFadden's R2: 0.222 McFadden's Adj R2: 0.205
ML (Cox-Snell) R2: 0.201 Cragg-Uhler(Nagelkerke) R2: 0.316
McKelvey & Zavoina's R2: 0.466 Efron's R2: 0.224
Variance of y*: 6.157 Variance of error: 3.290
Count R2: 0.817 Adj Count R2: 0.110
AIC: 0.806 AIC*n: 786.255
BIC: -5885.062 BIC': -171.056
BIC used by Stata: 825.315 AIC used by Stata: 786.255
PARTIAL KEY:
Log-Lik Intercept Only = -494.744: This is the log likelihood for the intercept only model
Log-Lik Full Model = -385.128: This is the log likelihood for the current model
LR(7) = 219.233 is the likelihood ratio chi square statistic which tests whether the current
model predicts better than the intercept only model
Prob > LR = .0001: This is the p-value for the LR(7) test
Then there are a series of pseudo-R2 measures.
Finally, there are a series of information criterion measures that are used to compare different models.
. *
. ***** 5b) Test of Model Adequacy Using command LINKTEST
. linktest
-- iteration output omitted --

LR chi2(2) = 219.24
Prob > chi2 = 0.0000
------------------------------------------------------------------------------
case | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_hat | 1.009135 .1404655 7.18 0.000 .7338274 1.284442
_hatsq | .0039801 .0479037 0.08 0.934 -.0899094 .0978696
_cons | .0008299 .1243723 0.01 0.995 -.2429353 .2445952
------------------------------------------------------------------------------
WHAT TO LOOK FOR:

We expect the p-value for _HAT to be highly significant.
Evidence of a GOOD FIT is reflected in a NON-SIGNIFICANT _HATSQ
Here the p-value for _HATSQ is .934
This suggests good model adequacy

. *
. ***** 5c) Test of Overall Goodness of Fit Using command LFIT
. lfit, group(10) table
Logistic model for case, goodness-of-fit test
(Table collapsed on quantiles of estimated probabilities)

(There are only 9 distinct quantiles because of ties)
+--------------------------------------------------------+
| Group | Prob | Obs_1 | Exp_1 | Obs_0 | Exp_0 | Total |
|-------+--------+-------+-------+-------+-------+-------|
| 1 | 0.0159 | 0 | 0.6 | 106 | 105.4 | 106 |
| 2 | 0.0249 | 6 | 3.9 | 153 | 155.1 | 159 |
| 3 | 0.1158 | 4 | 5.1 | 45 | 43.9 | 49 |
| 4 | 0.1185 | 16 | 17.5 | 132 | 130.5 | 148 |
| 6 | 0.1857 | 27 | 29.7 | 133 | 130.3 | 160 |
|-------+--------+-------+-------+-------+-------+-------|
| 7 | 0.2473 | 42 | 38.0 | 115 | 119.0 | 157 |
| 8 | 0.3462 | 0 | 0.3 | 1 | 0.7 | 1 |
| 9 | 0.5388 | 66 | 62.8 | 67 | 70.2 | 133 |
| 10 | 0.8704 | 39 | 41.9 | 23 | 20.1 | 62 |
+--------------------------------------------------------+
number of observations = 975

number of groups = 9
Hosmer-Lemeshow chi2(7) = 4.43
Prob > chi2 = 0.7291
WHAT TO LOOK FOR:

Evidence of a OVERALL GOODNESS OF FIT is reflected in a NON-SIGNIFICANT p-value
Here the Hosmer-Lemeshow test p-value is .7291
This suggests good overall fit

6. Regression Diagnostics for Logistic Regression: Graphical

. *
. ***** 6a) Plot of ROC Curve using LROC
. predict xb, xb
. lroc
Logistic model for case
number of observations = 975
area under ROC curve = 0.8119
WHAT TO LOOK FOR:

Classification that is no better than a coin toss is reference in the 45 degree line
Evidence of GOOD FIT is reflected in an ROC curve that lies above the 45 degree line reference
Area under the ROC curve = .8119 says that 81% of the observations are correctly classified.

. *
. ***** 6b) Plot of Y=Standardized Residual versus X=Observation Number
. predict std_residual, rs
. label variable std_residual "Standardized Residual"
. generate index=_n
. label variable index "Observation Number"
. graph twoway (scatter std_residual index,msymbol(d)), xlabel(0(100)1000) ylabel(-4(2)4)

title("Plot of Standardized Residuals versus Observation Number") xtitle("Observation
Number") ytitle("Standardized Residual") yline(0) caption("stdresidual.png", size(vsmall))
WHAT TO LOOK FOR:

Think of standardized residuals as Z-scores, approximately.
We’d like the majority to be within 1.96 of the expected value of 0
Values outside + 1.96 are potentially extreme.

. *
. ***** 6c) Plot of Influential Observations: Y=Cook versus X=Observation Number
. predict cook, dbeta
. label variable cook "Cook Distance"
. graph twoway (scatter cook index, msymbol(d)), xlabel(0(100)1000) title("Plot of Cook

Distance versus Observation Number") xtitle("Observation Number") ytitle("Cook Distance")
caption("cook.png", size(vsmall))
WHAT TO LOOK FOR:

Look for a even ribbon of cook distance values with no spikes.

7. Tip – Save your Commands to a DO File for Later Use
Step 1 – Right click anywhere inside the review window. From drop down menu, choose SELECT ALL
Step 2– Right click again. From drop down menu, choose SEND TO DO-FILE EDITOR
Stata will put you into the Do-File Editor. You should see something like the following
Step 3– Click on the SAVE icon. At SAVE AS: provide a name. At WHERE: provide a path

Stata For Logistic Regression

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Stata For Logistic Regression

Diunggah oleh

Hak Cipta:

Format Tersedia

BIOSTATS 640 – Spring 2017 5.

Logistic Regression Stata Illustration

1. Tip: “1/2” Variables versus “0/1” Variables ………..….………….… 2

2. Tip: How to Create Quartile Groupings of Continuous Variable ……. 6

3. Fit a Logistic Regression Model …………………..………………… 9

4. Likelihood Ratio Test for 2 “Hierarchical” Models…………….………. 20

5. Regression Diagnostics for Logistic Regression: Numerical …….……. 23

6. Regression Diagnostics for Logistic Regression: Graphical …….……. 27

7. Tip: Save Your Commands to a DO File for Later Use ………….… 30

Preliminary – Download the stata data set illeetvilaine.dta.

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 1 of 30

1. Tip - “1/2” Variables versus “0/1” Variables

Why the fuss?

Ille-et-Vilaine Data: Illustration

Disease (Esophageal Cancer)

Preliminary: Introduction to the command recode

. use “http://people.umass.edu/biep640w/datasets/illeetvilaine.dta”, clear

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 2 of 30

. * Create "1/2" variables when you want to use command tab2

. * "1/2" variable for case status (1=case versus 2=other)

. * Check variable creations

. tab2 case case12

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 3 of 30

. * Create "0/1" variables when you want to use commands cc, cs

. * "0/1" variable for case status (1=case versus 0=other)

. * Check variable creations

. * The command cc works fine with 0/1 variables

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 4 of 30

. * tab2 with 0/1 variables

| Case status (1=case,

Heavy exposure is in row 2, the outcome of case=yes is in column 2.

. * tab2 with 1/2 variables - more to your liking?

. tab2 case12 exposure12

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 5 of 30

2. Tip – How to Create Quartile Groupings of a Continuous Variable

Creating Quartiles is Useful in Assessing Linearity of Logit

Ille-et-Vilaine Data: Illustration

age_quartile = Quartile of age, coded 1, 2, 3 or 4

age_qmedian = Median of age, within quartile of age

* Lower (upper) confidence limit held at minimum (maximum) of sample

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 6 of 30

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 7 of 30

. * Check variable creations

-> tabulation of age_quartile by age_qmedian

Quartile | Quartile Median Age

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 8 of 30

3. Fit a Logistic Regression Model

Ille-et-Vilaine Data: Illustration

Model 1: Predictors = heavy drinking, age

Model 2: Predictors = heavy smoking, age

Model 3: Predictors = heavy drinking, heavy smoking, age

* Create some new variables for illustration purposes

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 9 of 30

. * INTERACTION: Create drinker_smoker = interaction of heavy drinking and heavy smoking

. * USER CREATED DESIGN VARIABLES FOR AGEGP

. generate age3544=(agegp==2) if agegp !=.

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 10 of 30

. * Check variable creations.

. tab2 agegp age4554

. tab2 agegp age5564

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 11 of 30

. tab2 agegp age6574

. tab2 agegp age75plus

3.agegp 3.200 3.648* 3.500* 3.533*

4.agegp 3.706* 4.177* 4.029* 4.063*

5.agegp 3.966* 4.412* 4.394* 4.425*

6.agegp 3.959* 4.085* 4.269* 4.306*

smoking_30~s 1.438* 1.384* 1.316***

_cons -5.049* -5.107* -5.507* -5.533*

3.agegp 24.54 38.39* 33.11***

4.agegp 40.70* 65.17* 56.19***

5.agegp 52.78* 82.45* 80.93***

6.agegp 52.42* 59.45* 71.42***

smoking_30~s 4.211* 3.990*

smoking_30~s 3.990* 3.727*

3.agegp 33.11* 34.21*

4.agegp 56.19* 58.12*

5.agegp 80.93* 83.55*

6.agegp 71.42* 74.15*