Lab Section:
Materials: One standard (8 by 11) sheet of notes front and back, and calculator (but not the
one on your cell phone), are allowed.
Cell phone, and all other notes and books, must be stored out of reach and out of sight.
Do not share any materials during the exam.
Check that your exam has all its correct pages.
Good luck!
lecture 30 34 / lab 9
Mosquito-borne diseases are an ongoing concern, with the Zika virus only the most
recent to be in headlines. Past studies can help us understand the problem.
One mosquito-borne disease which has been a longtime problem is dengue fever. In this
scenario we will analyze part of an observational study conducted in Puerto Vallarta in
1984 to identify risk factors for dengue fever.
1 (a) Write the binary logistic regression model (you may use any of the versions seen
in lecture), and say how many parameters there are in the model.
The model is:
ln( 1pp ) = 0 + 1Age this is the log-odds version; other equivalent versions are:
e0 +1Age 1
p
1p = e0 +1Age or p= or p= (0 +1Age)
[note the extra neg in the exponent]
1 + e0 +1Age 1+ e
The Minitab output from fitting the binary logistic regression model is shown on the
next page; use the output for the rest of scenario 1:
1 (b) Based on the Minitab output, state the estimates of the parameters in the model
(indicate which is which).
1 (c) What do the goodness-of-fit tests in the output suggest about whether the binary
logistic regression model is appropriate for this data? Briefly say how youre
making your conclusion.
The goodness of fit tests indicate that the model is appropriate,
because the p-values are large enough ( > 0.05) to retain the null
hypothesis model is a good fit.
The p-values were 0.067, 0.529, and 0.290, for various types of
goodness-of-fit tests.
H0 : 1 = 0
H A: 1 0
1 (e) Using the output on p. 3 as appropriate, state the odds ratio, and interpret it in
context.
We predict that, for each year of age (or equivalent wording) , the
odds of having dengue fever are multiplied by 0.9719, on average.
[Note that since this is only a model with one predictor (age) we do not
hold constant any other variable.]
1 (f) Based on the results, who is at greater risk for dengue fever, children or adults?
Children are more at risk for dengue fever based on the model
(since the odds decrease with age because the odds are multiplied
by less than 1)
1 (g) Using the output on p. 3 as appropriate, briefly show how the odds ratio can be
computed from one of the parameter estimates.
1
e , i.e., e 0.02846
1 (f) In the output on p. 3, under odds ratios for continuous predictors is a 95%
confidence interval. Briefly explain how this interval confirms the results of the
hypotheses test in question 1d.
0
e = e1.659 = 5.254
We predict that, at age zero, the odds of having dengue fever are
5.254 on average.
[Note that since this is only a model with one predictor (age) we do
not hold constant any other variable.]
1.6590.02846 Age
e
p = 1.6590.02846 Age
[this is how Minitab gives it]
1+ e
1
or equivalently p = ( 1.6590.02846 Age)
[note the extra neg in the exponent].
1+ e
1.6590.02846 (18)
e
p = 1.6590.02846 (18)
= 0.7589 (or approx 0.76) for someone at age 18
1+ e
1.6590.02846 (37)
e
p = 1.6590.02846 (37)
= 0.647 (or approx 0.65) for someone at age 37
1+ e
1 (g) In the data is someone who is 18 years old who didnt have dengue fever; and
someone who is 37 years old who did have dengue fever. Based on the estimated
probabilities computed in part f, say whether this pair is concordant or discordant,
and briefly say why.
Discordant, because the person who did have dengue fever had the
lower estimated probability of doing so compared to the other
person of that pair (the 18 year old had p=0.76 of fever but didnt
have it, while the 37 year old had only p = 0.65 of fever but did have
it).
Assume that there are three brands of a certain product (A, B, and C). You, the
statistician, were hired by the manufacturer of brand C, who noticed a significant drop in
sales, and who would like to learn more about what kind of buyers prefer the other two
brands. Information was gathered from 150 subjects who bought the product, and the
following variables (among others) were recorded:
2 (a) Which variables are the explanatory variables, and which is the response variable?
Explanatory: Age, and Gender
Response: Brand.
2 (b) Let p A , p B , and p C denote the probabilities that a buyer chooses brand A,
B, or C respectively; and assume that brand C is the reference category
(which makes sense in the context of the story).
Note that the betas are specific to their respective relationship, as indicated by the superscript
notation, where the betas for the first relationship are indicated with (A) since that equation is the
log odds favoring brand A over the reference brand, and the betas for the second equation are
indicated with (B) since that relationship is the log odds favoring brand B over the reference
brand.
The superscripts are not exponents, they are notation to tag the betas for each relationship; hence
the need for the parentheses around the superscripts.
The nominal logistic regression model was fit to the data, and the Minitab output is
provided. Use the output as needed to answer the questions which follow.
2 (c) Based on the Minitab output, write down the estimated logistic regression
equations, as follows:
p A
ln p C
= 10.73 + 0.266 Age 0.418 Gender
p B
ln p C
= 0.784 0.0045 Age 1.03 Gender
2 (d) From the output on the previous page, say which variables are significant
in each of the two binary logistic regressions, and then quantify their
effect by interpreting their odds ratios.
ln pp BC = 0.784 0.0045(37) 1.03(1) = 0.4125
1.306
e = 0.14
p A = 1.306 0.4125
1+e +e
0.4125
e
p B = 1.306 0.4125
0.34
1+e +e
and then:
The instructor used a random sample of students from a past year to investigate
what might predict grade in 36-202.
A
B
Y = C
D
R ( fail )
For the purpose of the analysis, take A to be the first grade category and R to be
the last (reference) grade category.
3 (a) The model is based on probabilities associated with each grade category;
call them p A , p B , p C , p D , p R . Given that A is taken to be the first grade
category and that R is the last (reference) grade category, what does p C
mean?
[Remember that, for the ordinal model, the probabilities are cumulative, i.e., each p is the sum
of the probabilities of getting into that category or previous]
[scenario 3 continues]
3 (b)
Write down the ordinal logistic regression model.
How many parameters are in the model?
p (A)
ln A = 0 + 1 Gender + 2 SAT
1 p A
p (B)
ln B = 0 + 1 Gender + 2 SAT
1 pB
p (C)
ln C = 0 + 1 Gender + 2 SAT
1 pC
p (D)
ln D = 0 + 1 Gender + 2 SAT
1 pD
Note that only the intercepts are tagged as unique to each relationship (indicated
by the superscripts, which are not exponents, hence the parentheses). The 1 is
the same in all the relationships; similarly, the 2 is the same in all the
relationships.
Remember that each relationship is for the log of the cumulative odds, since Y is
ordinal (so for instance, 1ppC is the odds of getting a C or better).
C
Note that there is no relationship for the log odds for grade R. [That is neither
necessary nor would it make any sense, since p R is the cumulative probability of
getting an R or better so p R would be 1 and hence the corresponding odds
pR
1 pR would be undefined.]
3 (c) Is the following true or false? Say how you know based on the ordinal
logistic model described on the previous page:
True, because its the same 2 coefficient of SAT in the 2nd and 3rd relationships (and in fact in all the
relationships i.e., the model proposes that the SAT effect is the same for getting better grades,
regardless of which grade category we start with).