Anda di halaman 1dari 30

BIOSTATS 640 – Spring 2017 5.

Logistic Regression Stata Illustration

5. Logistic Regression
Illustration – Stata version 14
March 2017

1. Tip: “1/2” Variables versus “0/1” Variables ………..….………….… 2

2. Tip: How to Create Quartile Groupings of Continuous Variable ……. 6

3. Fit a Logistic Regression Model …………………..………………… 9

4. Likelihood Ratio Test for 2 “Hierarchical” Models…………….………. 20

5. Regression Diagnostics for Logistic Regression: Numerical …….……. 23


a. Numerical Measures of Fit Using fitstat …………………..…….. 25
b. Test of Model Adequacy Using linktest …………………………. 25
c. Test of Overall Goodness-of-Fit Using lfit ……………………….. 26

6. Regression Diagnostics for Logistic Regression: Graphical …….……. 27


a. Plot of ROC Curve Using lroc …………………………………….. 27
b. Plot of Standardized Residuals versus Observation Number ………. 28
c. Plot of Influential Observations Using Cook’s Distances …………... 29

7. Tip: Save Your Commands to a DO File for Later Use ………….… 30

Preliminary – Download the stata data set illeetvilaine.dta.


Note – This data set is accessible through the internet. Alternatively, you can download it from the course website.

(a) In Stata, input directly from the internet using the command use
use “http://people.umass.edu/biep640w/datasets/illeetvilaine.dta”, clear

(b) From the course website, right click to download. Afterwards, in Stata, use FILE > OPEN
See, http://people.umass.edu/biep640w/webpages/demonstrations.html

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 1 of 30


BIOSTATS 640 – Spring 2017 5. Logistic Regression Stata Illustration

1. Tip - “1/2” Variables versus “0/1” Variables

Why the fuss?


Answer – Sometimes the arrangement of rows and columns in a 2x2 table are not what you expected.

tab2
Stata will order the rows and columns according to the numeric values of the row and column variable.
For a 0/1 variable, row 1 will be the value “0” row. Row 2 will be the value “1” row.
For a 1/2 variable, row 1 will be the value “1” row. Row 2 will be the value “2” row.
Columns are ordered similarly.

cc, cs
Stata assumes that you are using 0/1 variables here with 1= event and 0=non-event
Stata will order the rows and columns according to event, with event being the first row (or column)
Thus, row 1 will be the value “1=event” row. Row 2 will be the value “0=non-event” row.
Columns are ordered similarly.

Ille-et-Vilaine Data: Illustration

Suppose we are interested in the 2x2 table cross-classification of heavy smoking (30+ gm/day versus other) and case
status (esophageal cancer case versus control):

Disease (Esophageal Cancer)


Exposure (Heavy Smoking) Yes No
Yes (30+ gm/day) 31 51 82
No 169 724 893
200 775 975

Preliminary: Introduction to the command recode


Use recode to re-set the values of a variable. This is especially handy in the creation of a new variable. You can
recode a single old value to a new value. Or you can recode a whole range of values to a new value. For example -

. use “http://people.umass.edu/biep640w/datasets/illeetvilaine.dta”, clear


. * recode variablename (oldvalue=newvalue) (rangelower/rangeupper=newvalue) etc.
. generate age12=age
. recode age12 (18=1) (19/max=2)

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 2 of 30


BIOSTATS 640 – Spring 2017 5. Logistic Regression Stata Illustration

. * Create "1/2" variables when you want to use command tab2


. * “1/2” measure of heavy smoking (1=30+ gm/day versus 2=other)
. * Exposure will be heavy smoking defined as tobgp=4 (30+ gm/day)
. generate exposure12=tobgp
. recode exposure12 (1=2) (2=2) (3=2) (4=1)
(exposure12: 739 changes made)
. label define exposure12f 2 "other" 1 "heavy"
. label values exposure12 exposure12f

. * "1/2" variable for case status (1=case versus 2=other)


. generate case12=case
. recode case12 (0=2)
(case12: 775 changes made)
. label define case12f 2 "control" 1 "case"
. label values case12 case12f

. * Check variable creations


. tab2 tobgp exposure12
-> tabulation of tobgp by exposure12

Grouped |
tobacco | exposure12
consum. | heavy other | Total
-----------+----------------------+----------
0-9 gm/day | 0 526 | 526
10-19 | 0 236 | 236
20-29 | 0 131 | 131
30+ | 82 0 | 82
-----------+----------------------+----------
Total | 82 893 | 975

. tab2 case case12


-> tabulation of case by case12

Case |
status |
(1=case, | case12
0=control) | case control | Total
-----------+----------------------+----------
0 | 0 775 | 775
1 | 200 0 | 200
-----------+----------------------+----------
Total | 200 775 | 975

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 3 of 30


BIOSTATS 640 – Spring 2017 5. Logistic Regression Stata Illustration

. * Create "0/1" variables when you want to use commands cc, cs


. * “0/1” measure of heavy smoking (1=30+ gm/day versus 0=other)
. * Exposure will be heavy smoking defined as tobgp=4 (30+ gm/day)
. generate exposure01=tobgp
. recode exposure01 (1=0) (2=0) (3=0) (4=1)
(exposure01: 975 changes made)
. label define exposure01f 0 "other" 1 "heavy"
. label values exposure01 exposure01f

. * "0/1" variable for case status (1=case versus 0=other)


. * This already exists as the variable case

. * Check variable creations


. tab2 tobgp exposure01
-> tabulation of tobgp by exposure01

Grouped |
tobacco | exposure01
consum. | other heavy | Total
-----------+----------------------+----------
0-9 gm/day | 526 0 | 526
10-19 | 236 0 | 236
20-29 | 131 0 | 131
30+ | 0 82 | 82
-----------+----------------------+----------
Total | 893 82 | 975

. * The command cc works fine with 0/1 variables


. cc case exposure01
Proportion
| Exposed Unexposed | Total Exposed
-----------------+------------------------+------------------------
Cases | 31 169 | 200 0.1550
Controls | 51 724 | 775 0.0658
-----------------+------------------------+------------------------
Total | 82 893 | 975 0.0841
| |
| Point estimate | [95% Conf. Interval]
|------------------------+------------------------
Odds ratio | 2.604014 | 1.557944 4.2894 (exact)
Attr. frac. ex. | .6159775 | .3581283 .7668672 (exact)
Attr. frac. pop | .0954765 |
+-------------------------------------------------
chi2(1) = 16.42 Pr>chi2 = 0.0001

The commands cc and cs are commands for epidemiological analyses of 2x2 tables where the convention is to
have cases be in row 1 (controls in row 2) and exposed be in column 1 (non-exposed in column 2).

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 4 of 30


BIOSTATS 640 – Spring 2017 5. Logistic Regression Stata Illustration

. * tab2 with 0/1 variables


. tab2 exposure01 case
-> tabulation of exposure01 by case

| Case status (1=case,


| 0=control)
exposure01 | 0 1 | Total
-----------+----------------------+----------
other | 724 169 | 893
heavy | 51 31 | 82
-----------+----------------------+----------
Total | 775 200 | 975

Heavy exposure is in row 2, the outcome of case=yes is in column 2.

. * tab2 with 1/2 variables - more to your liking?


. tab2 exposure12 case12
-> tabulation of exposure12 by case12

| case12
exposure12 | case control | Total
-----------+----------------------+----------
heavy | 31 51 | 82
other | 169 724 | 893
-----------+----------------------+----------
Total | 200 775 | 975

Better. Heavy exposure is now row 1 and cases are now in column 1.

. tab2 case12 exposure12


-> tabulation of case12 by exposure12

| exposure12
case12 | heavy other | Total
-----------+----------------------+----------
case | 31 169 | 200
control | 51 724 | 775
-----------+----------------------+----------
Total | 82 893 | 975

Or, you might like this arrangement. Cases are now 1 and heavy exposure is now in column 1.

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 5 of 30


BIOSTATS 640 – Spring 2017 5. Logistic Regression Stata Illustration

2. Tip – How to Create Quartile Groupings of a Continuous Variable

Creating Quartiles is Useful in Assessing Linearity of Logit


Answer – In regression analysis, it is often of interest to explore linearity of the outcome in relationship to a
continuous predictor. To do this, a new variable is created that is a grouped measure of the original continuous
variable.

Ille-et-Vilaine Data: Illustration

This data set has two continuous variables: age (age, years) and tob (tobacco consumption, gm/day). Here we
consider the variable age. In this lab session, let’s create two new variables. Each is a two step process.

age_quartile = Quartile of age, coded 1, 2, 3 or 4

age_qmedian = Median of age, within quartile of age

. *
. *** Create age_quartile = quartiles of age, coded 1, 2, 3, 4
. centile age, c(0 25 50 75 100)

-- Binom. Interp. --
Variable | Obs Percentile Centile [95% Conf. Interval]
-------------+-------------------------------------------------------------
age | 975 0 25 25 25*
| 25 41 40 42
| 50 52 51 53
| 75 63 62 64
| 100 91 91 91*

* Lower (upper) confidence limit held at minimum (maximum) of sample

. generate age_quartile=age
. recode age_quartile (min/41=1) (41.01/52=2) (52.01/63=3) (63/91=4)
(age_quartile: 975 changes made)

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 6 of 30


BIOSTATS 640 – Spring 2017 5. Logistic Regression Stata Illustration

. *
. *** Create age_qmedian = quartile medians of age, in years
. sort age_quartile
. by age_quartile: centile age, c(50)
------------------------------------------------------------------------------------------
-> age_quartile = 1

-- Binom. Interp. --
Variable | Obs Percentile Centile [95% Conf. Interval]
-------------+-------------------------------------------------------------
age | 250 50 35 34 35

------------------------------------------------------------------------------------------
-> age_quartile = 2

-- Binom. Interp. --
Variable | Obs Percentile Centile [95% Conf. Interval]
-------------+-------------------------------------------------------------
age | 248 50 47 46 48

------------------------------------------------------------------------------------------
-> age_quartile = 3

-- Binom. Interp. --
Variable | Obs Percentile Centile [95% Conf. Interval]
-------------+-------------------------------------------------------------
age | 239 50 59 59 60

------------------------------------------------------------------------------------------
-> age_quartile = 4

-- Binom. Interp. --
Variable | Obs Percentile Centile [95% Conf. Interval]
-------------+-------------------------------------------------------------
age | 238 50 69 68 69

. generate age_qmedian=age_quartile
. recode age_qmedian (1=35) (2=47) (3=59) (4=69)
(age_qmedian: 975 changes made)
. label variable age_quartile "Quartile of Age"
. label variable age_qmedian "Quartile Median Age"

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 7 of 30


BIOSTATS 640 – Spring 2017 5. Logistic Regression Stata Illustration

. * Check variable creations


. tab2 age_quartile age_qmedian

-> tabulation of age_quartile by age_qmedian

Quartile | Quartile Median Age


of Age | 35 47 59 69 | Total
-----------+--------------------------------------------+----------
1 | 250 0 0 0 | 250
2 | 0 248 0 0 | 248
3 | 0 0 239 0 | 239
4 | 0 0 0 238 | 238
-----------+--------------------------------------------+----------
Total | 250 248 239 238 | 975

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 8 of 30


BIOSTATS 640 – Spring 2017 5. Logistic Regression Stata Illustration

3. Fit a Logistic Regression Model

Summary
The commands logit and logistic will fit logistic regression models. Using logit with no option will produce betas.
Using logistic will produce odds ratios. You can also get odds ratios using the command logit with or as an option.
Stata also has commands eststo, estout and esttab for producing comparisons of models that are easier to read.

Ille-et-Vilaine Data: Illustration

After creating some new variables for illustration purposes, 4 logistic regression models are fit and then compared
side-by-side.

Model 1: Predictors = heavy drinking, age

Model 2: Predictors = heavy smoking, age

Model 3: Predictors = heavy drinking, heavy smoking, age

Model 4: Predictors = heavy drinking, heavy smoking, drinking x smoking interaction, age

* Create some new variables for illustration purposes


. * HEAVY DRINKER: Create alcohol_80plus = 0/1 measure of alcohol use >=80 gm/day.
. generate alcohol_80plus=alcgp
. recode alcohol_80plus (1=0) (2=0) (3=1) (4=1)
(alcohol_80plus: 975 changes made)
. label define alcoholf 0 "< 80 gm/day" 1 "80+ gm/day"
. label values alcohol_80plus alcoholf
. label variable alcohol_80plus "Heavy Drinker"
. * Check variable creation
. tab2 alcgp alcohol_80plus
-> tabulation of alcgp by alcohol_80plus

Grouped |
alcohol | Heavy Drinker
consum. | < 80 gm/d 80+ gm/da | Total
------------+----------------------+----------
0-39 gm/day | 414 0 | 414
40-79 | 355 0 | 355
80-119 | 0 139 | 139
120+ | 0 67 | 67
------------+----------------------+----------
Total | 769 206 | 975

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 9 of 30


BIOSTATS 640 – Spring 2017 5. Logistic Regression Stata Illustration

. * HEAVY SMOKER: Create smoking_30plus = 0/1 measure of tobacco use >=30 gm/day.
. generate smoking_30plus=tobgp
. recode smoking_30plus (1=0) (2=0) (3=0) (4=1)
(smoking_30plus: 975 changes made)
. label define smokingf 0 "< 30 gm/day" 1 "30+ gm/day"
. label values smoking_30plus smokingf
. * Check variable creation
. numlabel, add
. tab2 tobgp smoking_30plus
-> tabulation of tobgp by smoking_30plus

Grouped |
tobacco | smoking_30plus
consum. | 0. < 30 g 1. 30+ gm | Total
--------------+----------------------+----------
1. 0-9 gm/day | 526 0 | 526
2. 10-19 | 236 0 | 236
3. 20-29 | 131 0 | 131
4. 30+ | 0 82 | 82
--------------+----------------------+----------
Total | 893 82 | 975

. * INTERACTION: Create drinker_smoker = interaction of heavy drinking and heavy smoking


. generate drinker_smoker=alcohol_80plus*smoking_30plus
. label variable drinker_smoker "Interaction alcohol*smoking"

. * USER CREATED DESIGN VARIABLES FOR AGEGP


. * Note – If you do not have the command fre, type findit fre and download.
. fre agegp
agegp -- Age group
-------------------------------------------------------------
| Freq. Percent Valid Cum.
----------------+--------------------------------------------
Valid 1 25-34 | 116 11.90 11.90 11.90
2 35-44 | 199 20.41 20.41 32.31
3 45-54 | 213 21.85 21.85 54.15
4 55-64 | 242 24.82 24.82 78.97
5 65-74 | 161 16.51 16.51 95.49
6 75+ | 44 4.51 4.51 100.00
Total | 975 100.00 100.00
-------------------------------------------------------------

. generate age3544=(agegp==2) if agegp !=.


. generate age4554=(agegp==3) if agegp !=.
. generate age5564=(agegp==4) if agegp !=.
. generate age6574=(agegp==5) if agegp !=.
. generate age75plus=(agegp==6) if agegp !=.

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 10 of 30


BIOSTATS 640 – Spring 2017 5. Logistic Regression Stata Illustration

. * Check variable creations.


. numlabel, add
. tab2 agegp age3544
-> tabulation of agegp by age3544

| age3544
Age group | 0 1 | Total
-----------+----------------------+----------
1. 25-34 | 116 0 | 116
2. 35-44 | 0 199 | 199
3. 45-54 | 213 0 | 213
4. 55-64 | 242 0 | 242
5. 65-74 | 161 0 | 161
6. 75+ | 44 0 | 44
-----------+----------------------+----------
Total | 776 199 | 975

. tab2 agegp age4554


-> tabulation of agegp by age4554

| age4554
Age group | 0 1 | Total
-----------+----------------------+----------
1. 25-34 | 116 0 | 116
2. 35-44 | 199 0 | 199
3. 45-54 | 0 213 | 213
4. 55-64 | 242 0 | 242
5. 65-74 | 161 0 | 161
6. 75+ | 44 0 | 44
-----------+----------------------+----------
Total | 762 213 | 975

. tab2 agegp age5564


-> tabulation of agegp by age5564

| age5564
Age group | 0 1 | Total
-----------+----------------------+----------
1. 25-34 | 116 0 | 116
2. 35-44 | 199 0 | 199
3. 45-54 | 213 0 | 213
4. 55-64 | 0 242 | 242
5. 65-74 | 161 0 | 161
6. 75+ | 44 0 | 44
-----------+----------------------+----------
Total | 733 242 | 975

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 11 of 30


BIOSTATS 640 – Spring 2017 5. Logistic Regression Stata Illustration

. tab2 agegp age6574


-> tabulation of agegp by age6574

| age6574
Age group | 0 1 | Total
-----------+----------------------+----------
1. 25-34 | 116 0 | 116
2. 35-44 | 199 0 | 199
3. 45-54 | 213 0 | 213
4. 55-64 | 242 0 | 242
5. 65-74 | 0 161 | 161
6. 75+ | 44 0 | 44
-----------+----------------------+----------
Total | 814 161 | 975

. tab2 agegp age75plus


-> tabulation of agegp by age75plus

| age75plus
Age group | 0 1 | Total
-----------+----------------------+----------
1. 25-34 | 116 0 | 116
2. 35-44 | 199 0 | 199
3. 45-54 | 213 0 | 213
4. 55-64 | 242 0 | 242
5. 65-74 | 161 0 | 161
6. 75+ | 0 44 | 44
-----------+----------------------+----------
Total | 931 44 | 975

. * Logistic model with user defined design variables for agegp


. logistic case alcohol_80plus age3544 age4554 age5564 age6574 age75plus

Logistic regression Number of obs = 975


LR chi2(6) = 199.30
Prob > chi2 = 0.0000
Log likelihood = -395.09465 Pseudo R2 = 0.2014

--------------------------------------------------------------------------------
case | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
---------------+----------------------------------------------------------------
alcohol_80plus | 5.228385 .9892462 8.74 0.000 3.608397 7.575664
age3544 | 4.683066 4.991105 1.45 0.147 .5798813 37.82
age4554 | 24.53994 25.10521 3.13 0.002 3.304189 182.2562
age5564 | 40.6956 41.44652 3.64 0.000 5.528909 299.5404
age6574 | 52.77508 53.98653 3.88 0.000 7.107029 391.895
age75plus | 52.41941 55.81879 3.72 0.000 6.502652 422.5653
_cons | .0064139 .0064735 -5.00 0.000 .0008872 .0463704
--------------------------------------------------------------------------------

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 12 of 30


BIOSTATS 640 – Spring 2017 5. Logistic Regression Stata Illustration

. *
. * Logistic model with STATA defined design variables for agegp
. logistic case alcohol_80plus i.agegp

Logistic regression Number of obs = 975


LR chi2(6) = 199.30
Prob > chi2 = 0.0000
Log likelihood = -395.09465 Pseudo R2 = 0.2014

--------------------------------------------------------------------------------
case | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
---------------+----------------------------------------------------------------
alcohol_80plus | 5.228385 .9892462 8.74 0.000 3.608397 7.575664
|
agegp |
2. 35-44 | 4.683066 4.991105 1.45 0.147 .5798813 37.82
3. 45-54 | 24.53994 25.10521 3.13 0.002 3.304189 182.2562
4. 55-64 | 40.6956 41.44652 3.64 0.000 5.528909 299.5404
5. 65-74 | 52.77508 53.98653 3.88 0.000 7.107029 391.895
6. 75+ | 52.41941 55.81879 3.72 0.000 6.502652 422.5653
|
_cons | .0064139 .0064735 -5.00 0.000 .0008872 .0463704
--------------------------------------------------------------------------------
The two outputs match. For the rest of this lab session, I will use stata defined design variables.

. * MODEL 1 –
. * Logistic Regression Heavy Drinking Alone - adjusted for age
. logistic case alcohol_80plus i.agegp

Logistic regression Number of obs = 975


LR chi2(6) = 199.30
Prob > chi2 = 0.0000
Log likelihood = -395.09465 Pseudo R2 = 0.2014

--------------------------------------------------------------------------------
case | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
---------------+----------------------------------------------------------------
alcohol_80plus | 5.228385 .9892462 8.74 0.000 3.608397 7.575664
|
agegp |
2. 35-44 | 4.683066 4.991105 1.45 0.147 .5798813 37.82
3. 45-54 | 24.53994 25.10521 3.13 0.002 3.304189 182.2562
4. 55-64 | 40.6956 41.44652 3.64 0.000 5.528909 299.5404
5. 65-74 | 52.77508 53.98653 3.88 0.000 7.107029 391.895
6. 75+ | 52.41941 55.81879 3.72 0.000 6.502652 422.5653
|
_cons | .0064139 .0064735 -5.00 0.000 .0008872 .0463704
--------------------------------------------------------------------------------

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 13 of 30


BIOSTATS 640 – Spring 2017 5. Logistic Regression Stata Illustration
. * ESTSTO to save results for later comparison
. eststo model1

. * MODEL 2 –
. * Logistic Regression Heavy Smoking Alone - adjusted for age
. logistic case smoking_30plus i.agegp

Logistic regression Number of obs = 975


LR chi2(6) = 145.72
Prob > chi2 = 0.0000
Log likelihood = -421.88661 Pseudo R2 = 0.1473

--------------------------------------------------------------------------------
case | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
---------------+----------------------------------------------------------------
smoking_30plus | 4.211492 1.205916 5.02 0.000 2.402722 7.381906
|
agegp |
2. 35-44 | 6.267996 6.675411 1.72 0.085 .7773197 50.54262
3. 45-54 | 38.39114 39.30348 3.56 0.000 5.161798 285.536
4. 55-64 | 65.17199 66.48418 4.09 0.000 8.82513 481.2834
5. 65-74 | 82.44814 84.59853 4.30 0.000 11.03516 616.0035
6. 75+ | 59.4483 63.32511 3.84 0.000 7.369337 479.5683
|
_cons | .0060567 .0061361 -5.04 0.000 .0008315 .0441165
--------------------------------------------------------------------------------

. * ESTSTO to save results for later comparison


. eststo model2

. * MODEL 3 –
. * Logistic Regression Heavy Drinking and Heavy Smoking - adjusted for age
. logistic case alcohol_80plus smoking_30plus i.agegp

Logistic regression Number of obs = 975


LR chi2(7) = 219.23
Prob > chi2 = 0.0000
Log likelihood = -385.12755 Pseudo R2 = 0.2216

--------------------------------------------------------------------------------
case | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
---------------+----------------------------------------------------------------
alcohol_80plus | 5.122038 .9859232 8.49 0.000 3.512352 7.469432
smoking_30plus | 3.989895 1.226166 4.50 0.000 2.184601 7.287035
|
agegp |
2. 35-44 | 6.300504 6.788536 1.71 0.088 .7625007 52.06074
3. 45-54 | 33.11226 34.34676 3.37 0.001 4.335635 252.8861
4. 55-64 | 56.18964 58.08311 3.90 0.000 7.409194 426.1294
5. 65-74 | 80.93014 84.28445 4.22 0.000 10.51063 623.1486
6. 75+ | 71.41624 77.18044 3.95 0.000 8.588013 593.8835
|
_cons | .0040597 .0041822 -5.35 0.000 .000539 .0305757
--------------------------------------------------------------------------------

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 14 of 30


BIOSTATS 640 – Spring 2017 5. Logistic Regression Stata Illustration

. * ESTSTO to save results for later comparison


. eststo model3

. * MODEL 4 –
. * Logistic Regression Heavy Drinking and Heavy Smoking PLUS INTERACTION - adjusted
. logistic case alcohol_80plus smoking_30plus i.agegp drinker_smoker

Logistic regression Number of obs = 975


LR chi2(8) = 219.35
Prob > chi2 = 0.0000
Log likelihood = -385.07068 Pseudo R2 = 0.2217

--------------------------------------------------------------------------------
case | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
---------------+----------------------------------------------------------------
alcohol_80plus | 5.018372 1.012542 7.99 0.000 3.379235 7.452591
smoking_30plus | 3.726905 1.379709 3.55 0.000 1.803979 7.699546
|
agegp |
2. 35-44 | 6.563396 7.142474 1.73 0.084 .7777266 55.38986
3. 45-54 | 34.21248 35.78455 3.38 0.001 4.404231 265.7657
4. 55-64 | 58.12253 60.60994 3.90 0.000 7.528611 448.7187
5. 65-74 | 83.55392 87.73638 4.21 0.000 10.66981 654.3001
6. 75+ | 74.15214 80.85386 3.95 0.000 8.749685 628.4273
|
drinker_smoker | 1.251609 .839525 0.33 0.738 .3361396 4.660342
_cons | .0039536 .0041028 -5.33 0.000 .0005172 .0302218
--------------------------------------------------------------------------------

. * ESTSTO to save results for later comparison


. eststo model4

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 15 of 30


BIOSTATS 640 – Spring 2017 5. Logistic Regression Stata Illustration

. * ESTOUT for side-by-side comparison of the 4 models


. * BETAs
. estout model1 model2 model3 model4, prehead("Logistic Regression of Esophageal Cancer -
BETA's")

Logistic Regression of Esophageal Cancer - BETA's


----------------------------------------------------------------
model1 model2 model3 model4
b b b b
----------------------------------------------------------------
case
alcohol_80~s 1.654102 1.633552 1.613106
1b.agegp 0 0 0 0
2.agegp 1.543953 1.835457 1.84063 1.881508
3.agegp 3.200302 3.647827 3.499904 3.53259
4.agegp 3.70612 4.17703 4.028732 4.062553
5.agegp 3.966039 4.412169 4.393586 4.425492
6.agegp 3.959277 4.085107 4.268525 4.306119
smoking_30~s 1.437817 1.383765 1.315578
drinker_sm~r .22443
_cons -5.049291 -5.106586 -5.506652 -5.533117
----------------------------------------------------------------
Okay. We see the betas.

. * ODDS RATIOS – use option eform


. estout model1 model2 model3 model4, eform prehead("Logistic Regression of Esophageal
Cancer - ODDS RATIO's")

Logistic Regression of Esophageal Cancer - ODDS RATIO's


----------------------------------------------------------------
model1 model2 model3 model4
b b b b
----------------------------------------------------------------
case
alcohol_80~s 5.228385 5.122038 5.018372
1b.agegp 1 1 1 1
2.agegp 4.683066 6.267996 6.300504 6.563396
3.agegp 24.53994 38.39114 33.11226 34.21248
4.agegp 40.6956 65.17199 56.18964 58.12253
5.agegp 52.77508 82.44814 80.93014 83.55392
6.agegp 52.41941 59.4483 71.41624 74.15214
smoking_30~s 4.211492 3.989895 3.726905
drinker_sm~r 1.251609
_cons .0064139 .0060567 .0040597 .0039536
----------------------------------------------------------------

The option “eform” stands for “exponentiated coefficients.” Thus, these are the odds ratios.

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 16 of 30


BIOSTATS 640 – Spring 2017 5. Logistic Regression Stata Illustration

. * ESTTAB for side-by-side comparison of the 4 models


. * BETAs with chi square statistics
. esttab model1 model2 model3 model4, stats(n chi2 bic, star(chi2)) prehead("Logistic
Regression of Esophageal Cancer - BETA's")

Logistic Regression of Esophageal Cancer - BETA's


(1) (2) (3) (4)
case case case case
----------------------------------------------------------------------------
case
alcohol_80~s 1.654*** 1.634*** 1.613***
(8.74) (8.49) (7.99)

1b.agegp 0 0 0 0
(.) (.) (.) (.)

2.agegp 1.544 1.835 1.841 1.882


(1.45) (1.72) (1.71) (1.73)

3.agegp 3.200** 3.648*** 3.500*** 3.533***


(3.13) (3.56) (3.37) (3.38)

4.agegp 3.706*** 4.177*** 4.029*** 4.063***


(3.64) (4.09) (3.90) (3.90)

5.agegp 3.966*** 4.412*** 4.394*** 4.425***


(3.88) (4.30) (4.22) (4.21)

6.agegp 3.959*** 4.085*** 4.269*** 4.306***


(3.72) (3.84) (3.95) (3.95)

smoking_30~s 1.438*** 1.384*** 1.316***


(5.02) (4.50) (3.55)

drinker_sm~r 0.224
(0.33)

_cons -5.049*** -5.107*** -5.507*** -5.533***


(-5.00) (-5.04) (-5.35) (-5.33)
----------------------------------------------------------------------------
n
chi2 199.3*** 145.7*** 219.2*** 219.3***
bic 838.4 892.0 825.3 832.1
----------------------------------------------------------------------------
t statistics in parentheses
* p<0.05, ** p<0.01, *** p<0.001

This tabulation shows the betas. Underneath are the values of t-statistic = (beta/standard error)

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 17 of 30


BIOSTATS 640 – Spring 2017 5. Logistic Regression Stata Illustration

. * ODDS RATIOS and 95% CI’s


. * Models 1, 2, and 3
. esttab model1 model2 model3, stats(n chi2 bic, star(chi2)) eform ci prehead("Logistic
Regression of Esophageal Cancer - ODDS RATIO's")

Logistic Regression of Esophageal Cancer - ODDS RATIO's


(1) (2) (3)
case case case
------------------------------------------------------------------------------------------
case
alcohol_80~s 5.228*** 5.122***
[3.608,7.576] [3.512,7.469]

1b.agegp 1 1 1
[1,1] [1,1] [1,1]

2.agegp 4.683 6.268 6.301


[0.580,37.82] [0.777,50.54] [0.763,52.06]

3.agegp 24.54** 38.39*** 33.11***


[3.304,182.3] [5.162,285.5] [4.336,252.9]

4.agegp 40.70*** 65.17*** 56.19***


[5.529,299.5] [8.825,481.3] [7.409,426.1]

5.agegp 52.78*** 82.45*** 80.93***


[7.107,391.9] [11.04,616.0] [10.51,623.1]

6.agegp 52.42*** 59.45*** 71.42***


[6.503,422.6] [7.369,479.6] [8.588,593.9]

smoking_30~s 4.211*** 3.990***


[2.403,7.382] [2.185,7.287]
------------------------------------------------------------------------------------------
n
chi2 199.3*** 145.7*** 219.2***
bic 838.4 892.0 825.3
------------------------------------------------------------------------------------------
Exponentiated coefficients; 95% confidence intervals in brackets
* p<0.05, ** p<0.01, *** p<0.001

I am showing models 1, 2, and 3 only because the output wraps around (and is unreadable) if I try to show
all 4 models. This tabulation shows the odds ratios and associated 95% CI.

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 18 of 30


BIOSTATS 640 – Spring 2017 5. Logistic Regression Stata Illustration

. * ODDS RATIOS and 95% CI’s


. * Models 3 and 4
. esttab model3 model4, stats(n chi2 bic, star(chi2)) eform ci prehead("Logistic
Regression of Esophageal Cancer - ODDS RATIO's")

Logistic Regression of Esophageal Cancer - ODDS RATIO's


(1) (2)
case case
----------------------------------------------------------------
case
alcohol_80~s 5.122*** 5.018***
[3.512,7.469] [3.379,7.453]

smoking_30~s 3.990*** 3.727***


[2.185,7.287] [1.804,7.700]

1b.agegp 1 1
[1,1] [1,1]

2.agegp 6.301 6.563


[0.763,52.06] [0.778,55.39]

3.agegp 33.11*** 34.21***


[4.336,252.9] [4.404,265.8]

4.agegp 56.19*** 58.12***


[7.409,426.1] [7.529,448.7]

5.agegp 80.93*** 83.55***


[10.51,623.1] [10.67,654.3]

6.agegp 71.42*** 74.15***


[8.588,593.9] [8.750,628.4]

drinker_sm~r 1.252
[0.336,4.660]
----------------------------------------------------------------
n
chi2 219.2*** 219.3***
bic 825.3 832.1
----------------------------------------------------------------
Exponentiated coefficients; 95% confidence intervals in brackets
* p<0.05, ** p<0.01, *** p<0.001

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 19 of 30


BIOSTATS 640 – Spring 2017 5. Logistic Regression Stata Illustration

4. Likelihood Ratio Test for 2 “Hierarchical” Models

Summary
It is of interest to know whether the inclusion of extra predictors to a model is statistically significant. The smaller
model (“reduced”) contains the control variables. The larger model (“full”) contains the control variables plus the
extra variables in question.

Models.
Reduced: logit[π | X1 ,X 2 ...,X p ] = β 0 +β1X1 +...+β p X p
Full: logit[π | X1 ,X 2 ...,X p ,X p+1 ,X p+2 ,...,X p+k ] = β 0 +β1X1 +...+β p X p + β p+1X p+1 +...+β p+k X p+k

Null and Alternative Hypotheses:

HO : β p+1 = β p+2 = ... = β p+k = 0


H A : not

Definition Likelihood Ratio Test (LR)

LR statistic = DevianceREDUCED - DevianceFULL

= [ (-2) ln (Likelihood) REDUCED ] - [ (-2) ln (Likelihood) FULL ]

Under the null hypothesis, LR is distributed Chi SquareDF=k

Ille-et-Vilaine Data: Illustration

A likelihood ratio test is performed to assess the stastistical significance of the interaction of heavy drinking and heavy
smoking in the model, controlling for age and the main effects of each of heavy drinking and heavy smoking. Thus,

Model “reduced”: Predictors = age, heavy drinking, heavy smoking

Model “full”: Predictors = age, heavy drinking, heavy smoking + (drinking x smoking)

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 20 of 30


BIOSTATS 640 – Spring 2017 5. Logistic Regression Stata Illustration

. *** LR Test WITH DISPLAY OF INTERMEDIATE RESULTS


. * Reduced model
. logistic case i.agegp smoking_30plus alcohol_80plus
Logistic regression Number of obs = 975
LR chi2(7) = 219.23
Prob > chi2 = 0.0000
Log likelihood = -385.12755 Pseudo R2 = 0.2216
(-2) Log likelihood Reduced Model = 770.2551
--------------------------------------------------------------------------------
case | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
---------------+----------------------------------------------------------------
agegp |
2. 35-44 | 6.300504 6.788536 1.71 0.088 .7625007 52.06074
3. 45-54 | 33.11226 34.34676 3.37 0.001 4.335635 252.8861
4. 55-64 | 56.18964 58.08311 3.90 0.000 7.409194 426.1294
5. 65-74 | 80.93014 84.28445 4.22 0.000 10.51063 623.1486
6. 75+ | 71.41624 77.18044 3.95 0.000 8.588013 593.8835
|
smoking_30plus | 3.989895 1.226166 4.50 0.000 2.184601 7.287035
alcohol_80plus | 5.122038 .9859232 8.49 0.000 3.512352 7.469432
_cons | .0040597 .0041822 -5.35 0.000 .000539 .0305757
--------------------------------------------------------------------------------
. estimates store reduced

. * Full model
. logistic case i.agegp smoking_30plus alcohol_80plus drinker_smoker
Logistic regression Number of obs = 975
LR chi2(8) = 219.35
Prob > chi2 = 0.0000
Log likelihood = -385.07068 Pseudo R2 = 0.2217
(-2) Log likelihood Full Model = 770.14136
--------------------------------------------------------------------------------
case | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
---------------+----------------------------------------------------------------
agegp |
2. 35-44 | 6.563396 7.142474 1.73 0.084 .7777266 55.38986
3. 45-54 | 34.21248 35.78455 3.38 0.001 4.404231 265.7657
4. 55-64 | 58.12253 60.60994 3.90 0.000 7.528611 448.7187
5. 65-74 | 83.55392 87.73638 4.21 0.000 10.66981 654.3001
6. 75+ | 74.15214 80.85386 3.95 0.000 8.749685 628.4273
|
smoking_30plus | 3.726905 1.379709 3.55 0.000 1.803979 7.699546
alcohol_80plus | 5.018372 1.012542 7.99 0.000 3.379235 7.452591
drinker_smoker | 1.251609 .839525 0.33 0.738 .3361396 4.660342
_cons | .0039536 .0041028 -5.33 0.000 .0005172 .0302218
--------------------------------------------------------------------------------

. estimates store full

. lrtest reduced full

Likelihood-ratio test LR chi2(1) = 0.11


(Assumption: reduced nested in full) Prob > chi2 = 0.7359

CHECK: [(-2) ln L reduced] – [(-2)ln L full] = 770.2551 – 770.14136 = 0.11374 match!

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 21 of 30


BIOSTATS 640 – Spring 2017 5. Logistic Regression Stata Illustration

. *** LR Test - WITHOUT DISPLAY OF RESULTS using quietly:


. * Reduced model
. quietly: logistic case i.agegp smoking_30plus alcohol_80plus
. estimates store reduced

. * Full model
. quietly: logistic case i.agegp smoking_30plus alcohol_80plus drinker_smoker
. estimates store full

. lrtest reduced full

Likelihood-ratio test LR chi2(1) = 0.11


(Assumption: reduced nested in full) Prob > chi2 = 0.7359

CHECK: [(-2) ln L reduced] – [(-2)ln L full] = 770.2551 – 770.14136 = 0.11374 match!

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 22 of 30


BIOSTATS 640 – Spring 2017 5. Logistic Regression Stata Illustration

5. Regression Diagnostics for Logistic Regression: Numerical

Preliminary – Install the suite of commands in the package spost9_ado

. * Step 1: Install SPost Using net install


. net install spost9_ado

checking spost9_ado consistency and verifying not already installed...


installing into /Users/cbigelow/Library/Application Support/Stata/ado/plus/...
installation complete.

. * Step 2: Now obtain all the ancillary files


. net get spost9_do
checking spost9_do consistency and verifying not already installed...

copying into current directory...


copying st9all.do
copying st9ch2tutorial.do
copying st9ch3estimate.do
copying st9ch4binary.do
copying st9ch5ordinal.do
copying st9ch6nomcase.do
copying st9ch7nomalt.do
copying st9ch8count.do
copying st9ch9other.do
copying binlfp2.dta
copying couart2.dta
copying gsskidvalue2.dta
copying nomocc2.dta
copying ordwarm2.dta
copying science2.dta
copying sciwork.dta
copying travel2.dta
copying travel2case.dta
copying wlsrnk.dta
ancillary files successfully copied.

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 23 of 30


BIOSTATS 640 – Spring 2017 5. Logistic Regression Stata Illustration

Summary
Now you have a model that is your “candidate” final model. There are lots of further explorations you can do to
assess whether this really is a “good” final model.

Ille-et-Vilaine Data: Illustration

Having retained the null hypothesis in our likelihood ratio test of the interaction of heavy smoking and heavy drinking,
our “candidate” final model contains: heavy drinking, heavy smoking, and age.

. * Before requesting any diagnostics of a model, you must have fit it.
. logistic case i.agegp smoking_30plus alcohol_80plus

Logistic regression Number of obs = 975


LR chi2(7) = 219.23
Prob > chi2 = 0.0000
Log likelihood = -385.12755 Pseudo R2 = 0.2216

--------------------------------------------------------------------------------
case | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
---------------+----------------------------------------------------------------
agegp |
35-44 | 6.300504 6.788536 1.71 0.088 .7625007 52.06074
45-54 | 33.11226 34.34676 3.37 0.001 4.335635 252.8861
55-64 | 56.18964 58.08311 3.90 0.000 7.409194 426.1294
65-74 | 80.93014 84.28445 4.22 0.000 10.51063 623.1486
75+ | 71.41624 77.18044 3.95 0.000 8.588013 593.8835
|
smoking_30plus | 3.989895 1.226166 4.50 0.000 2.184601 7.287035
alcohol_80plus | 5.122038 .9859232 8.49 0.000 3.512352 7.469432
_cons | .0040597 .0041822 -5.35 0.000 .000539 .0305757
--------------------------------------------------------------------------------

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 24 of 30


BIOSTATS 640 – Spring 2017 5. Logistic Regression Stata Illustration

. *
. ***** 5a) Numerical measures of fit using command FITSTAT
. fitstat

Measures of Fit for logistic of case

Log-Lik Intercept Only: -494.744 Log-Lik Full Model: -385.128


D(967): 770.255 LR(7): 219.233
Prob > LR: 0.000
McFadden's R2: 0.222 McFadden's Adj R2: 0.205
ML (Cox-Snell) R2: 0.201 Cragg-Uhler(Nagelkerke) R2: 0.316
McKelvey & Zavoina's R2: 0.466 Efron's R2: 0.224
Variance of y*: 6.157 Variance of error: 3.290
Count R2: 0.817 Adj Count R2: 0.110
AIC: 0.806 AIC*n: 786.255
BIC: -5885.062 BIC': -171.056
BIC used by Stata: 825.315 AIC used by Stata: 786.255

PARTIAL KEY:
Log-Lik Intercept Only = -494.744: This is the log likelihood for the intercept only model
Log-Lik Full Model = -385.128: This is the log likelihood for the current model
LR(7) = 219.233 is the likelihood ratio chi square statistic which tests whether the current
model predicts better than the intercept only model
Prob > LR = .0001: This is the p-value for the LR(7) test
Then there are a series of pseudo-R2 measures.
Finally, there are a series of information criterion measures that are used to compare different models.

. *
. ***** 5b) Test of Model Adequacy Using command LINKTEST
. linktest

-- iteration output omitted --


Logistic regression Number of obs = 975
LR chi2(2) = 219.24
Prob > chi2 = 0.0000
Log likelihood = -385.12412 Pseudo R2 = 0.2216

------------------------------------------------------------------------------
case | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_hat | 1.009135 .1404655 7.18 0.000 .7338274 1.284442
_hatsq | .0039801 .0479037 0.08 0.934 -.0899094 .0978696
_cons | .0008299 .1243723 0.01 0.995 -.2429353 .2445952
------------------------------------------------------------------------------

WHAT TO LOOK FOR:


We expect the p-value for _HAT to be highly significant.
Evidence of a GOOD FIT is reflected in a NON-SIGNIFICANT _HATSQ
Here the p-value for _HATSQ is .934
This suggests good model adequacy

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 25 of 30


BIOSTATS 640 – Spring 2017 5. Logistic Regression Stata Illustration

. *
. ***** 5c) Test of Overall Goodness of Fit Using command LFIT
. lfit, group(10) table

Logistic model for case, goodness-of-fit test

(Table collapsed on quantiles of estimated probabilities)


(There are only 9 distinct quantiles because of ties)
+--------------------------------------------------------+
| Group | Prob | Obs_1 | Exp_1 | Obs_0 | Exp_0 | Total |
|-------+--------+-------+-------+-------+-------+-------|
| 1 | 0.0159 | 0 | 0.6 | 106 | 105.4 | 106 |
| 2 | 0.0249 | 6 | 3.9 | 153 | 155.1 | 159 |
| 3 | 0.1158 | 4 | 5.1 | 45 | 43.9 | 49 |
| 4 | 0.1185 | 16 | 17.5 | 132 | 130.5 | 148 |
| 6 | 0.1857 | 27 | 29.7 | 133 | 130.3 | 160 |
|-------+--------+-------+-------+-------+-------+-------|
| 7 | 0.2473 | 42 | 38.0 | 115 | 119.0 | 157 |
| 8 | 0.3462 | 0 | 0.3 | 1 | 0.7 | 1 |
| 9 | 0.5388 | 66 | 62.8 | 67 | 70.2 | 133 |
| 10 | 0.8704 | 39 | 41.9 | 23 | 20.1 | 62 |
+--------------------------------------------------------+

number of observations = 975


number of groups = 9
Hosmer-Lemeshow chi2(7) = 4.43
Prob > chi2 = 0.7291

WHAT TO LOOK FOR:


Evidence of a OVERALL GOODNESS OF FIT is reflected in a NON-SIGNIFICANT p-value
Here the Hosmer-Lemeshow test p-value is .7291
This suggests good overall fit

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 26 of 30


BIOSTATS 640 – Spring 2017 5. Logistic Regression Stata Illustration

6. Regression Diagnostics for Logistic Regression: Graphical


. *
. ***** 6a) Plot of ROC Curve using LROC
. predict xb, xb
. lroc
Logistic model for case
number of observations = 975
area under ROC curve = 0.8119

WHAT TO LOOK FOR:


Classification that is no better than a coin toss is reference in the 45 degree line
Evidence of GOOD FIT is reflected in an ROC curve that lies above the 45 degree line reference
Area under the ROC curve = .8119 says that 81% of the observations are correctly classified.

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 27 of 30


BIOSTATS 640 – Spring 2017 5. Logistic Regression Stata Illustration

. *
. ***** 6b) Plot of Y=Standardized Residual versus X=Observation Number
. predict std_residual, rs
. label variable std_residual "Standardized Residual"
. generate index=_n
. label variable index "Observation Number"

. graph twoway (scatter std_residual index,msymbol(d)), xlabel(0(100)1000) ylabel(-4(2)4)


title("Plot of Standardized Residuals versus Observation Number") xtitle("Observation
Number") ytitle("Standardized Residual") yline(0) caption("stdresidual.png", size(vsmall))

WHAT TO LOOK FOR:


Think of standardized residuals as Z-scores, approximately.
We’d like the majority to be within 1.96 of the expected value of 0
Values outside + 1.96 are potentially extreme.

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 28 of 30


BIOSTATS 640 – Spring 2017 5. Logistic Regression Stata Illustration

. *
. ***** 6c) Plot of Influential Observations: Y=Cook versus X=Observation Number
. predict cook, dbeta
. label variable cook "Cook Distance"

. graph twoway (scatter cook index, msymbol(d)), xlabel(0(100)1000) title("Plot of Cook


Distance versus Observation Number") xtitle("Observation Number") ytitle("Cook Distance")
caption("cook.png", size(vsmall))

WHAT TO LOOK FOR:


Look for a even ribbon of cook distance values with no spikes.

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 29 of 30


BIOSTATS 640 – Spring 2017 5. Logistic Regression Stata Illustration

7. Tip – Save your Commands to a DO File for Later Use

Step 1 – Right click anywhere inside the review window. From drop down menu, choose SELECT ALL

Step 2– Right click again. From drop down menu, choose SEND TO DO-FILE EDITOR

Stata will put you into the Do-File Editor. You should see something like the following

Step 3– Click on the SAVE icon. At SAVE AS: provide a name. At WHERE: provide a path

….1. Teaching\stata\stata version 14\Stata for Logistic Regression.docx Page 30 of 30

Anda mungkin juga menyukai