Anda di halaman 1dari 3

Stat 431 Assignment 2 Winter 2017

Due by 1:00pm on Friday, February 3, 2017

Notes for Submission: Upload your assignment directly to Crowdmark via the link you received by email
(let me know if you have not received this email). It is your responsibility to make sure your solution to
each question is submitted in the correct section, that the pages are rotated correctly, and that everything is
legible. Typed solutions are preferred. Be sure to include all R code and relevant output for each question
(where applicable). Once the solution key is posted on Learn, no further late submissions will be accepted.

Question 1 [6 marks]
This problem is adapted from Jewell, N.P. (2003). Statistics for Epidemiology. CRC Press.
The data below come from a study investigating the role of several reproductive risk factors for the development
of breast cancer. Parity (number of pregnancies carried to term) is coded as X1 = 1 if parity is 0-1, and
X1 = 0 if parity is ≥ 2. Similarly, age is coded X2 = 1 if age is < 40 years, and X2 = 0 if age is ≥ 40
years. Cases refer to subjects with breast cancer and controls are those without breast cancer. For the parts
below, do not fit the logistic regression using statistical software, as all answers can be computed by simple
calculator operations. Be sure to show your work for all calculations.

Breast Cancer Status


Age Parity Cases Controls
< 40 years 0-1 births 24 58
2+ births 96 160
≥ 40 years 0-1 births 127 172
2+ births 353 718

a) [3 marks] Suppose age is ignored and the logistic model


 
π
log = β0 + β1 x 1
1−π

is fit to the data. What is the estimate and interpretation of β1 ? Conduct a Wald-based hypothesis
test of H0 : β1 = 0.

b) [3 marks] Suppose the logistic model


 
π
log = β0 + β1 x1 + β2 x2 + β3 x1 x2
1−π

is fit to the data. What are the estimates and interpretations of the four β parameters?

1
Question 2 [12 marks]
The data for this problem come from Hosmer Jr, D.W., Lemeshow, S., & Sturdivant, R.X. (2013). Applied
Logistic Regression: Third Edition. John Wiley & Sons. The authors state that:

“Myopia, more commonly referred to as nearsightedness, is an eye condition where an individual


has difficulty seeing things at a distance. . . . The risk factors for the development of myopia have
been debated for a long time and include genetic factors (e.g., family history of myopia) and the
amount and type of visual activity that a child performs (e.g., studying, reading, TV watching,
computer or video game playing, and sports/outdoor activity).”

There are 618 subjects who where not myopic when they entered the study. The respose of interest is whether
or not the subject became myopic at any time during five years of follow-up. The following variables are
included in the dataset myopia431.txt which is available for download from Learn.

Variable Name Values/Labels Description

MYOPIC 0=No, 1=Yes Mypoia within the first five years of follow-up
AGE years Age at first visit
SPORTHR hours per week Hours per week spent engaging in sports/outdoor activities
DIOPTERHR hours per week Composite measure of near-work (studying, reading etc.) activities
MOMMY 0=No, 1=Yes Was the subject’s mother myopic?
DADMY 0=No, 1=Yes Was the subject’s father myopic?

(a) [4 marks] Input the data into R, fit the main effects logistic regression model, and print the summary
output from R. Find the fitted probability of becoming myopic for a 6 year old child, who engages in
10 hours of sports per week, 20 hours of near-work per week, and whose mother and father are both
myopic.
(b) [1 mark] What is the interpretation of the regression coefficient for the SPORTHR variable in the
main effects model?
(c) [4 marks] Using the main effects model, find an estimate and 95% confidence interval for the odds
ratio for developing myopia in a child who has both a mother and father with myopia versus a child
who has both a mother and father without myopia, holding all other factors constant.
(d) [3 marks] Use a Deviance/Likelihood Ratio test to conduct a test of the null hypothesis that parental
myopia status is not associated with a child’s risk of developping myopia. You should use the main
effects model as the full model and define an appropriate reduced model. Be sure to carefully state the
null and alternative hypotheses in terms of the regression coefficients (be explicit about which model
you are referring to) and give the formula of the test statistic and its asymptotic distribution under the
null hypothesis. What is the conclusion of the test?

2
Question 3 [12 marks]
In this problem you will investigate how the estimation of β1 is affected by the presence and significance of
the other explanatory variables included in a logistic regression model. We have not discussed theoretical
results in this setting so the investigation will rely on simulation. Consider the true underlying model:
 
π
log = β0 + β1 x1 + . . . + β15 x15
1−π
where
β = (0.5, 0.75, 0.25, 0.5, 0.75, 1.0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)0

Here are the steps in generating a single dataset and conducting an analysis:

• Generate X a (p − 1 × n) matrix of independent N (0, 1) explanatory variables. Then include a row of


ones to make X a (p × n) matrix (p = 16).
• Use X and the true values of β (p × 1) to generate π (n × 1) each individual’s event probability
• Generate a (n × 1) vector Y of individual event indicators using the true model above
• Fit a logistic regression using only selected explanatory variables (with intercept). We will consider
three analysis models:
1. x1 only (omit important explanatory variables and nuisance variales)
2. x1 , . . . , x5 (include important explanatory variables, exclude nuisance variables)
3. x1 , . . . , x15 (include important explanatory variables and nuisance variables)
• Save the values of β̂1 and its standard error

We will consider each of the three analysis models above at two sample sizes n = 100 and n = 500. Therefore
there are a total of six scenarios. For each scenario repeat the simulation N = 500 times. So for each scenario
(k)
you will have k = 1, . . . , 500 estimates β̂1 and corresponding standard error estmates se ˆ (k) (β̂1 ). For each
scenario calculate the following summary statistics:
PN (k)
• The average Percent Bias: 1
N k=1 (β̂1 − β1 )/β1
h i
PN (k) (k)
• The overall 95% Confidence Interval Coverage Probability: 1
N k=1 I β̂ 1lower ≤ β 1 ≤ β̂ 1upper where
(k) (k)
(β̂1lower , β̂1upper ) is the estimated 95% confidence interval for β1 from simulation k
PN
• The Average Standard Error (ASE): N1 k=1 se ˆ (k) (β̂1 )

a) [4 marks] Run the simulation study described above. Be sure to summit well commented code. Report
your results in a table of the form:

Variables n = 100 n = 500


Included % Bias 95% CI Coverage ASE % Bias 95% CI Coverage ASE
x1 only
x1 , . . . , x 5
x1 , . . . , x15

b) [4 marks] Provide a written discussion of the patterns you observe in the table you produce. What
happens when important explanatory variables are omitted from the model? What happens when
unimportant explanatory variables are included in the model?
c) [4 marks] Repeat part (a) but this time generate Yi from a Normal distribution with mean xTi β and
variance 1. Fit a linear regression model to your simulated data rather than a logistic regression model.
Provide a written discussion of the patterns you observe in the linear regression simulation study and
contrast them with the patterns you observed for the logistic regression study.

Anda mungkin juga menyukai