Anda di halaman 1dari 11

Practice Exam 3 KEY

STA 36-202 Statistical Methods


Spring 2017
Remark: This practice exam is a combination of parts of old exercises and
hence longer than the real exam will be.

Printed Name ________________________________________

Andrew I.D. ____________________________

Lab Section:

Section A Section B Section C Section D Section E Section F


(Th. 9:30) (Th. 12:30) (Th. 3:30) (F. 10:30) (F. 11:30) (F. 1:30)

Date: This exam will take place Wednesday, May 3, in lecture.

Materials: One standard (8 by 11) sheet of notes front and back, and calculator (but not the
one on your cell phone), are allowed.
Cell phone, and all other notes and books, must be stored out of reach and out of sight.
Do not share any materials during the exam.
Check that your exam has all its correct pages.

Format: This is an individual examination. No communication or collaboration is permitted.


Carry calculations to sufficient decimal places to demonstrate the correct answer.
Work must be shown when applicable for full credit.
Comment on numerical results when appropriate.

Time: You have 50 minutes for this exam.

Good luck!

Coverage on the Second Hourly Exam:

lecture 30 34 / lab 9

This document is available on: http://www.cmu.edu/blackboard


Spring 2017 Stat 202 Page 2 of 11
Practice Exam 3

Case Study 1: Modeling Disease Prevention

Mosquito-borne diseases are an ongoing concern, with the Zika virus only the most
recent to be in headlines. Past studies can help us understand the problem.

One mosquito-borne disease which has been a longtime problem is dengue fever. In this
scenario we will analyze part of an observational study conducted in Puerto Vallarta in
1984 to identify risk factors for dengue fever.

The data consist of the following variables:

Age: The subjects age in years;


DENGUE (the response variable of interest): Coded as 1 if they were found to be
suffering from dengue fever, 0 if they werent.

1 (a) Write the binary logistic regression model (you may use any of the versions seen
in lecture), and say how many parameters there are in the model.
The model is:
ln( 1pp ) = 0 + 1Age this is the log-odds version; other equivalent versions are:

e0 +1Age 1
p
1p = e0 +1Age or p= or p= (0 +1Age)
[note the extra neg in the exponent]
1 + e0 +1Age 1+ e

There are two parameters. ( 0 and 1 )

The Minitab output from fitting the binary logistic regression model is shown on the
next page; use the output for the rest of scenario 1:

This document is available on: http://www.cmu.edu/blackboard


Spring 2017 Stat 202 Page 3 of 11
Practice Exam 3

1 (b) Based on the Minitab output, state the estimates of the parameters in the model
(indicate which is which).

The estimate of 0 is 0 = 1.659

The estimate of 1 is 1 = 0.02846

[scenario 1 continues on next page]

This document is available on: http://www.cmu.edu/blackboard


Spring 2017 Stat 202 Page 4 of 11
Practice Exam 3

1 (c) What do the goodness-of-fit tests in the output suggest about whether the binary
logistic regression model is appropriate for this data? Briefly say how youre
making your conclusion.
The goodness of fit tests indicate that the model is appropriate,
because the p-values are large enough ( > 0.05) to retain the null
hypothesis model is a good fit.

The p-values were 0.067, 0.529, and 0.290, for various types of
goodness-of-fit tests.

1 (d) [Testing significance of the predictor]


If 1 = 0 then the X (age) term is actually zero and then age does not actually help
in predicting whether or not someone has dengue fever. From the output, give the
z score and associated p-value for the hypothesis test:

H0 : 1 = 0
H A: 1 0

and then briefly state the conclusion to the hypothesis test.

For the significance test indicated,


the z score is 3.39,
and the p-value is 0.001 .

Since the p-value is less than 0.05,


we reject the null hypothesis and conclude Age is a significant
predictor of whether or not someone has dengue fever,
or equivalently, we conclude 1 0, based on the simple binary
logistic model.

1 (e) Using the output on p. 3 as appropriate, state the odds ratio, and interpret it in
context.

The odds ratio is 0.9719.

The interpretation is:

We predict that, for each year of age (or equivalent wording) , the
odds of having dengue fever are multiplied by 0.9719, on average.
[Note that since this is only a model with one predictor (age) we do not
hold constant any other variable.]

[scenario 1 continues on next page]

This document is available on: http://www.cmu.edu/blackboard


Spring 2017 Stat 202 Page 5 of 11
Practice Exam 3

1 (f) Based on the results, who is at greater risk for dengue fever, children or adults?

Children are more at risk for dengue fever based on the model
(since the odds decrease with age because the odds are multiplied
by less than 1)

1 (g) Using the output on p. 3 as appropriate, briefly show how the odds ratio can be
computed from one of the parameter estimates.

1
e , i.e., e 0.02846

1 (f) In the output on p. 3, under odds ratios for continuous predictors is a 95%
confidence interval. Briefly explain how this interval confirms the results of the
hypotheses test in question 1d.

Since the confidence interval does not include 1, we are confident


the odds ratio is not 1 -- i.e., we are confident e1 1, which is
equivalent to 1 0 [because e 0 = 1].

(the relevant part of the output looked as follows:


Odds Ratios for Continuous Predictors

Odds Ratio 95% CI


AGE 0.9719 (0.9561, 0.9880)

Using the output on p. 3 as appropriate, compute e 0 and interpret it in context.



1 (g)

0
e = e1.659 = 5.254

The interpretation is:

We predict that, at age zero, the odds of having dengue fever are
5.254 on average.

[Note that since this is only a model with one predictor (age) we do
not hold constant any other variable.]

[scenario 1 continues on next page]

This document is available on: http://www.cmu.edu/blackboard


Spring 2017 Stat 202 Page 6 of 11
Practice Exam 3

1 (f) [Predicting probability]


Write the equation for p = P (Y = 1) , the equation for predicting, based on age, the
probability that someone would have dengue fever. Then compute the estimated
probability that someone who is 18 years old (i.e., with X = 18) would have
dengue fever; and then do the same for someone who is 37 years old.

1.6590.02846 Age
e
p = 1.6590.02846 Age
[this is how Minitab gives it]
1+ e

1
or equivalently p = ( 1.6590.02846 Age)
[note the extra neg in the exponent].
1+ e

For someone with X = 18:

1.6590.02846 (18)
e
p = 1.6590.02846 (18)
= 0.7589 (or approx 0.76) for someone at age 18
1+ e

For someone with X = 37:

1.6590.02846 (37)
e
p = 1.6590.02846 (37)
= 0.647 (or approx 0.65) for someone at age 37
1+ e

1 (g) In the data is someone who is 18 years old who didnt have dengue fever; and
someone who is 37 years old who did have dengue fever. Based on the estimated
probabilities computed in part f, say whether this pair is concordant or discordant,
and briefly say why.

Discordant, because the person who did have dengue fever had the
lower estimated probability of doing so compared to the other
person of that pair (the 18 year old had p=0.76 of fever but didnt
have it, while the 37 year old had only p = 0.65 of fever but did have
it).

This document is available on: http://www.cmu.edu/blackboard


Spring 2017 Stat 202 Page 7 of 11
Practice Exam 3

Case Study Scenario 2: Predicting brand preference

Assume that there are three brands of a certain product (A, B, and C). You, the
statistician, were hired by the manufacturer of brand C, who noticed a significant drop in
sales, and who would like to learn more about what kind of buyers prefer the other two
brands. Information was gathered from 150 subjects who bought the product, and the
following variables (among others) were recorded:

Age of the customer (in years);


Gender of the customer (1=Female, 0=Male);
Brand: Which of the three brands the customer ended up choosing (A, B, or C).

2 (a) Which variables are the explanatory variables, and which is the response variable?
Explanatory: Age, and Gender

Response: Brand.

2 (b) Let p A , p B , and p C denote the probabilities that a buyer chooses brand A,
B, or C respectively; and assume that brand C is the reference category
(which makes sense in the context of the story).

Write down the nominal logistic regression model.


How many parameters are in the model?

The model consists of a set of two relationships, as follows:

p (A) (A) (A)


ln A = 0 + 1 Age + 2 Gender
pC

p (B) (B) (B)


ln B = 0 + 1 Age + 2 Gender
pC

There are six parameters in the model.

Note that the betas are specific to their respective relationship, as indicated by the superscript
notation, where the betas for the first relationship are indicated with (A) since that equation is the
log odds favoring brand A over the reference brand, and the betas for the second equation are
indicated with (B) since that relationship is the log odds favoring brand B over the reference
brand.

The superscripts are not exponents, they are notation to tag the betas for each relationship; hence
the need for the parentheses around the superscripts.

The nominal logistic regression model was fit to the data, and the Minitab output is
provided. Use the output as needed to answer the questions which follow.

This document is available on: http://www.cmu.edu/blackboard


Spring 2017 Stat 202 Page 8 of 11
Practice Exam 3

2 (c) Based on the Minitab output, write down the estimated logistic regression
equations, as follows:

p A
ln p C
= 10.73 + 0.266 Age 0.418 Gender

p B
ln p C
= 0.784 0.0045 Age 1.03 Gender

This document is available on: http://www.cmu.edu/blackboard


Spring 2017 Stat 202 Page 9 of 11
Practice Exam 3

2 (d) From the output on the previous page, say which variables are significant
in each of the two binary logistic regressions, and then quantify their
effect by interpreting their odds ratios.

For favoring A over C:

Age is significant (p-value = 0.000; z-score = 5.67);


The odds ratio was 1.30, which means that, controlling for gender (or, holding gender constant), we estimate that
for each additional year of age, the odds of preferring brand A over brand C are multiplied by 1.30, on average.

For favoring B over C:

Gender is significant (p-value = 0.015; z-score = 2.42);


The odds ratio was 0.36, which means that, controlling for age (or, holding age constant), we estimate that the
odds that Females prefer brand B over brand C are 0.36 times the odds that Males prefer brand B over brand C,
on average.

2 (e) Suppose there is a 37 year old female. Compute her estimated


probabilities of preferring each of the three brands. [Hint: First, compute
the logistic regression equations for her.]

Shes 37 years old (Age = 37) and Female (Gender = 1),


so, for her, the logistic equations give:

ln pp CA = 10.73 + 0.266(37) 0.418(1) = 1.306


ln pp BC = 0.784 0.0045(37) 1.03(1) = 0.4125

So then, for her, the predicted probabilities are:

1.306
e = 0.14
p A = 1.306 0.4125
1+e +e

0.4125
e
p B = 1.306 0.4125
0.34
1+e +e

and then:

p C = 1 [ p A + p B ] = 1 [ 0.14 + .34] = 1 [ 0.48] = 0.52

This document is available on: http://www.cmu.edu/blackboard


Spring 2017 Stat 202 Page 10 of 11
Practice Exam 3

Case Study Scenario 3: Predicting Grade in 36-202

The instructor used a random sample of students from a past year to investigate
what might predict grade in 36-202.

The response variable, Y, was grade, ordered as follows:

A
B

Y = C
D

R ( fail )

The predictors available were:

Gender of the student (1=Female, 0=Male);


SAT: The SAT score of the student.

For the purpose of the analysis, take A to be the first grade category and R to be
the last (reference) grade category.

3 (a) The model is based on probabilities associated with each grade category;
call them p A , p B , p C , p D , p R . Given that A is taken to be the first grade
category and that R is the last (reference) grade category, what does p C
mean?

p C means the probability of getting a C or better i.e., the probability of getting C, B, or A.

[Remember that, for the ordinal model, the probabilities are cumulative, i.e., each p is the sum
of the probabilities of getting into that category or previous]

[scenario 3 continues]

This document is available on: http://www.cmu.edu/blackboard


Spring 2017 Stat 202 Page 11 of 11
Practice Exam 3

3 (b)
Write down the ordinal logistic regression model.
How many parameters are in the model?

The model consists of a set of four relationships, as follows:

p (A)
ln A = 0 + 1 Gender + 2 SAT
1 p A

p (B)
ln B = 0 + 1 Gender + 2 SAT
1 pB

p (C)
ln C = 0 + 1 Gender + 2 SAT
1 pC

p (D)
ln D = 0 + 1 Gender + 2 SAT
1 pD

There are six parameters in the model.

Note that only the intercepts are tagged as unique to each relationship (indicated
by the superscripts, which are not exponents, hence the parentheses). The 1 is
the same in all the relationships; similarly, the 2 is the same in all the
relationships.

Remember that each relationship is for the log of the cumulative odds, since Y is
ordinal (so for instance, 1ppC is the odds of getting a C or better).
C

Note that there is no relationship for the log odds for grade R. [That is neither
necessary nor would it make any sense, since p R is the cumulative probability of
getting an R or better so p R would be 1 and hence the corresponding odds
pR
1 pR would be undefined.]

3 (c) Is the following true or false? Say how you know based on the ordinal
logistic model described on the previous page:

The ordinal logistic regression model proposes that, holding gender


constant, each unit increase in SAT has the same predictive effect for
getting a C or better as it does for getting a B or better.

True, because its the same 2 coefficient of SAT in the 2nd and 3rd relationships (and in fact in all the
relationships i.e., the model proposes that the SAT effect is the same for getting better grades,
regardless of which grade category we start with).

[End of Practice Exam 3]

This document is available on: http://www.cmu.edu/blackboard

Anda mungkin juga menyukai