Anda di halaman 1dari 116

PRACTICE QUESTIONS FOR

THE MIDTERM EXAM


Part A.
Midterm Exam 2013
Midterm Exam 2012
Midterm Exam 2011
Midterm Exam 2010
Questions, Answers and Explanations
Part B.
Past Years Midterm Exams
Questions, Answers and Explanations

Part A. Midterm Exam 2013


Notes: This exam has 9 questions. The duration is 2 hours. Books, notes, and
calculators are allowed, but not computers, cellphones or on-line connectivity.
MT2013: Question 1

A sampling of sampling questions

a) The Human Resources Department of a large university maintains records on its


faculty members. The table displays some of these data.
Payroll
Years of
Teaching Rating
Salary
Number Birth date
Faculty
Employment
(1-5 scale)
Classification
1520
02/20/56
Law
25
4.1
IV
3210
05/13/62
Science
17
3.9
III
0342
10/12/69
Business
12
4.5
II
2988
02/27/78
Arts
4
4.3
I
Place an X in the space beside each variable that is best described as Quantitative.
__ Payroll Number
__ Birth date
__ Faculty
__Years of Employment __ Teaching Rating
__ Salary Classification
b) Which of the following is (are) based on cross-sectional data?
__A. Company quarterly profits
__B. Percentage of Canadian adults who work full-time
__C. Historical closing stock prices
__D. Yearly student enrolments
__E. Annual costs
c) Which of the following is (are) time series data?
__A. Number of employees in 2012
__B. This months demand for an automotive part
__C. This quarters sales of automobiles
__D. Weekly receipts at a clothing boutique
__E. Percentage of employees who are female
d) The administration of a large university wants to study the types of wellness programs
that would interest its employees. They plan to survey a random sample of employees.
Under consideration are several sampling plans. Beside each plan, write the number of
the sampling strategy given in the following list. for each. Choose from among:
1 = Simple Random Sampling
2 = Stratified Random Sampling
3 = Cluster Sampling
4 = Systematic Sampling
__ (i) There are five categories of employees (administration, faculty, professional staff,
clerical and maintenance). Randomly select ten individuals from each category.
__ (ii) Each employee has an ID number. Randomly select 50 numbers.
__ (iii) Randomly select a school within the university (e.g., Business School) and survey
all of the individuals (administration, faculty, professional staff, clerical and
maintenance) who work in that school.
__ (iv) The HR Department has an alphabetized list of newly hired employees (hired
within the last five years). After starting the process by randomly selecting an
employee from the list, every fifth name is chosen to be included in the sample.
2

e) A manufacturer of toys claims that less than 3% of his toys are defective. When 100
toys were drawn from one production run of 5,000 toys, 5% were found to be defective.
For each term on the left, select the matching answer from the list to the right, and write
the number in the blank.
___ Population
1 The 3% value
___ Sample

2 The 5% value

___ Sampling Frame

3 The 100 toys

___ Parameter

4 The 5,000 toys

___ Statistic

5 All toys produced

MT2013: Question 2

Could this label be called a phone tag?

A magazine that publishes product reviews conducted a survey of teenagers' preferences


for cell phones. Three brands of cell phone designed specifically with teens in mind were
the focus of the study. The table summarizes responses by brand and gender.
Cell Phone
Call Me Maybe
Phone Fun XS
Black Keys II
Total

Male
55
99
196
350

Female
87
150
113
350

Total
142
249
309
700

a) Which of the following charts would be appropriate for displaying the marginal
distribution of cell phone brand?
__A. Histogram
__B. Boxplot
__C. Bar Chart
__D. Line Graph
__E. Stem and Leaf Display
b) What percent of teenagers preferred Call Me Maybe?
__A. 50%
__B. 41%
__C. 25%
__D. 16%

__E. 20%

c) What percent of female teenagers preferred the Phone Fun XS?


__A. 43%
__B. 60%
__C. 21%
__D. 50%
__E. 16%
d) What percent of teenagers who preferred the Black Keys II were males?
__A. 63%
__B. 32%
__C. 16%
__D. 50%
__E. 41%
e) Which of the following statement is true?
__A. It appears that cell phone brand preference and gender are not related.
__B. It appears that cell phone brand preference and gender are not independent.
__C. It appears that cell phone brand preference and gender are independent.
__D. A scatterplot will be more informative here than a table.
__E. None of the above

MT2013: Question 3

Spring into these summary questions

a) You have a set of 30 numbers. The standard deviation from these numbers is reported
as zero. You can be certain that:
__A. Half of the numbers are above the mean
__B. All of the numbers in the set are zero
__C. All of the numbers in the set are equal
__D. The numbers are evenly spaced below and above the mean

b) Here is the five number summary of the hourly wages ($) for sales managers.
Min
Q1
Median
Q3
Max
20.94
37.64
44.77
49.24
67.11
(i) The shape of this distribution is best described as:
__A. Symmetric
__B. Skewed to the right
__C. Skewed to the left
__D. Not enough information to tell
(ii) The IQR for these data is: ______________
(iii) Compute the lower and upper inner fences:

Space for calculations:

Lower inner fence: ___________


Upper inner fence: ___________
(iv) Are there any outliers, as defined by the inner fences criterion?
__A. Yes, only on the left side of the distribution
__B. Yes, only on the right side of the distribution
__C. Yes, on both sides of the distribution
__D. No
(v) Suppose there had been an error and that the lowest hourly wage for sales managers
was $18.50 instead of $20.94. Indicate whether how this change would affect the
following summary statistics (increase, decrease, or stay about the same):
a. Mean

Decrease

Stay the Same

Increase

b. Median

Decrease

Stay the Same

Increase

c. Range

Decrease

Stay the Same

Increase

d. IQR

Decrease

Stay the Same

Increase

c) In a perfectly symmetrical distribution, which of the following statements is false?


__A. The distance from Q1 to Q2 is equal to the distance from Q2 to Q3
__B. The distance from the smallest observation to Q1 is the same as the distance
from Q3 to the largest observation
__C. The distance from the smallest observation to Q2 is the same as the distance
from Q2 to the largest observation
__D. The distance from Q1 to Q3 is half of the distance from the smallest to the
largest observation
d) Here is a stem plot of scores (out of 200) in a graduate finance course.
12 | 6 8
13 | 1 3 4 5 7 8
14 | 3 4 7
15 | 2 6
16 |
17 | 3
18 | 9
(i) How many students were in the course? _____
(ii) What was the maximum score?

_____

(iii) What is the median score?

_____

e) An office supply chain has stores in Toronto and Vancouver. One of these stores is to
be closed within the coming year, and to help make the decision, management reviews
sales data. Below are boxplots for monthly unit sales for both locations.

Which of the following statements is not correct?


__A. Monthly sales are higher in Toronto compared to Vancouver.
__B. The IQR for sales in Toronto is larger than that for Vancouver.
__C. Monthly sales are less variable in Vancouver compared to Toronto.
__D. Both distributions are fairly symmetric.
__E. Monthly sales are more variable in Vancouver compared to Toronto.

MT2013: Question 4

Time for relationship-building

a) A consumer research group investigating the relationship between the price of meat
(per kg) and the fat content (grams) gathered data that produced the following scatterplot.

(i) Which best describes the association between the price of meat and fat content?
__A. Negative, moderately strong
__B. Negative, weak
__C. Positive, strong
__D. Positive, weak
__E. No apparent association
(ii) If the point in the lower left hand corner ($2.00 per kilogram, 6 grams of fat) is
removed, would the correlation would most likely
__A. remain the same
__B. become stronger negative
__C. become weaker negative
__D. become positive
__E. become zero
b) For each of the following pairs of variables, would you expect a large negative
correlation, a large positive correlation, or a small correlation? Circle your choices.
1. The age of a used car and its price

Large Neg.

Large Pos.

Small

2. The height and weight of a person

Large Neg.

Large Pos.

Small

3. The height and the IQ of a person

Large Neg.

Large Pos.

Small

c) For each of the following statements, about the correlation coefficient, r, decide
whether it is True or False. Circle your choices as appropriate.
1. r equals the proportion of times two variables lie
True
False
on a straight line
2. r will be +1.0 only if all the data lie exactly on a
True
False
horizontal straight line
3. r measures the fraction of outliers that appear in
True
False
a scatterplot
4. If the correlation between X and Y is r, the
True
False
correlation between Y and X is r
5. r is a unitless number and must always lie
True
False
between 1.0 and +1.0 inclusive.
6

MT2013: Question 5

If mistrust is the opposite of trust, would


mistress be the opposite of stress?

A labour efficiency consultant collected some data on several employees of a


manufacturing operation: their stress levels (X, on a scale from 0 to 10) and the
productivity levels (Y, in parts made per hour). She only recorded some of the relevant
computations, as follows:
= 5.4
= 3.3
= 57.5
= 11.1
= 3.19
= 4.3
a) Write the estimated regression equation here:

_____________________________
(Use two decimals only for each value)

b) Write the correlation coefficient here:

________ (Round to two decimals)

Space for work:

c) Complete this sentence: For each additional unit on the stress scale, the productivity
level _________________________________________ parts per hour.
d) What percentage of the variation in productivity levels can be explained by
the stress level variable? Give your answer here, to the nearest whole percent: _________

e) Estimate the productivity of an individual whose stress level is 8:

__________

(Round to nearest whole number)

f) Suppose the employee in part e) has an actual productivity level of 60 parts per hour.
Compute the residual and use the fact that the standard deviation of the residuals is 4.3 to
decide whether this data point would be considered an outlier. Explain why in one
sentence only.
Residual = ________

Outlier? Yes

No

Explanation:

g) Estimate the productivity of an individual whose stress level is unknown. __________

h) Give an interval range in which the productivity level of 95% of employees would be
expected to fall. Report to the nearest whole numbers. ____________ to ____________

MT2013: Question 6

Can you answer the call of the bell?

a) Which statistic(s) would you expect to have a normal distribution?


I. Height of women
II. Shoe sizes of men
III. Age (years) of first-year university students
__A. I & II only
__B. II & III only
__C. I & III only
__D. All three
__E. None of the three
b) The length of time taken by a statistics professor to solve The Globe & Mail cryptic
crossword has a normal distribution. It is known that the probability of needing more than
20 minutes is 0.5, while the probability of needing more than 30 minutes is 0.1587.
(i) Find the mean and the standard deviation of the professors solving time.
Mean = ____________

SD = ______________

(ii) What is the probability that the solving time is between 15 and 25 minutes?
__A. 0.38

__B. 0.17

__C. 0.68

__D. 0.06

__E. 0.12

__F. 0.50

c) A soft drink machine dispenses a cup, syrup and carbonated water, hopefully in that
order! The amount of syrup injected is normally distributed with mean 15 ml and
variance 10 ml2. The amount of water injected is normally distributed with mean 80 ml
and variance 15 ml2. The two amounts are independent of one another.
(i) Find the mean and standard deviation of the total amount of syrup and water
dispensed.
Mean = ____________

SD = ______________

(ii) If 25 drinks are dispensed in a day, what are the mean and standard deviation of the
total amount of liquid (syrup and water) that are required?
Mean = ____________

SD = ______________

d) Suppose the time it takes for a purchasing agent to complete an online ordering
process is normally distributed with a mean of 8 minutes and a standard deviation of 2
minutes. Suppose a random sample of 25 ordering processes is selected.
(i) The standard deviation of the sampling distribution of mean times is
__A. 0.4 minutes
__B. 2 minutes
__C. 0.08 minutes
__D. 1.6 minutes
__E. 0.12 minutes
(ii) What is the probability that the sample mean will be less than 7.5 minutes?
__A. 0.3944
__B. 0.1056
__C. 0.2114
__D. 0.4013
__E. 0.8944

e) The mean height of male UBC students is 70 inches, with SD 3 inches. The mean
height of female UBC students is 65 inches, with SD 4 inches. You measure the heights
of random samples of 100 males and 100 females. Which result is the most unlikely? To
decide, compute the z-score for each result and write the values in the spaces provided.
__A. One randomly chosen male having a height of 79 inches or more
__B. One randomly chosen female having a height of 74 inches or more
__C. All females in your sample having an average height of 68 inches or more
__D. All males in your sample having an average height of 73 inches or more
z-score for A = _______

z-score for B = _______

z-score for C = _______

z-score for D = _______

Space for work:

MT2013: Question 7

Work with confidence!

a) EU (European Union) countries report that 46% of their labour force is female. Is the
percentage of females in the Canadian labour force the same? Statscan plan to check a
random sample selected from more than 10,000 employment records on file to estimate
the percentage of females in the Canadian labour force.
(i) Statscan wants to estimate the percentage of females in the Canadian labour force to
within 5% with 90% confidence. How many employment records should be sampled?
__A. 121
__B. 269
__C. 451
__D. 382
__E. 1000
(ii) Suppose that Statscan wants to be 90% confident of estimating the percentage of
females in the labour force to within 2% of the true percentage. Which of the following
would they have to do?
__A. Decrease the sample size
__B. Select the same number of employment records
__C. Increase the sample size
__D. Decrease the precision
__E. Increase the sampling error
(iii) They actually select a random sample of 525 employment records, and find that 229
of the people are females. The 90% confidence interval is closest to:
__A. 40.1% to 47.2%
__B. 27.5% to 59.7%
__C. 17.8% to 69.4%
__D. 42.4% to 56.8%
__E. 12.4% to 71.0%
b) For each of the following statements about a 95% confidence interval (CI) for the
mean, decide whether it is True or False. Circle your answers at the right.
1. Results from 95% of all samples will lie in this interval.

True

False

2. CIs are more information than point estimates because they


show how much the population parameters can vary.

True

False

3. The interval is wider than a 90% CI would be.

True

False

4. 95% of data values will fall in the range of a 95% CI


for the mean.

True

False

5. We are 95% confident that the confidence interval


includes the sample mean.

True

False

6. If we took many additional samples and computed a 95% CI


for each, then approximately 95% of those intervals
would contain the population mean.

True

False

10

MT2013: Question 8

Hypothetically speaking

Suppose that a report indicates that 28% of Canadians have experienced difficulty in
making mortgage payments. Further suppose that a news organization randomly sampled
400 Canadians from 10 cities and found that 136 reported such difficulty. Does this
indicate that the problem is more severe among these cities?
a) The correct null and alternative hypotheses are
__A. H0 : p = 0.28 and Ha : p > 0.28
__B. H0 : p = 0.28 and Ha : p < 0.28
__C. H0 : p = 0.28 and Ha : p 0.28
__D. H0 : p 0.28 and Ha : p = 0.28
__E. H0 : p > 0.28 and Ha : p = 0.28
b) The correct value of the test statistic is:
__A. 1.28
__B. 2.67
__C. 2.67
__D. 1.96
__E. 1.28

Space for work:

c) The P-value corresponding to this test statistic is:


__A. 0.025
__B. 0.2119
__C. 0.0177
__D. 0.0522
__E. 0.0038
d) At = .05, we can conclude that the percentage of Canadians in these cities
experiencing difficulty making mortgage payments ...
__A. is significantly higher than 28%
__B. is significantly lower than 28%
__C. is not significantly different from 28%
__D. is equal to 28%
__E. is none of the above; no conclusion can drawn with the given information.
e) Using the P-value in part c), which one of the following statements is true?
__A. A 90% confidence interval for p would contain 28%
__B. A 95% confidence interval for p would contain 28%
__C. A 95% confidence interval for p would not contain 28%
__D. None of the above
Part f) is unrelated to parts a) through e):
f) An opinion poll in a city of 200,000 was based on a simple random sample of 2000
people. Another poll is to be taken in the same way in a second city of population
400,000. In order for this poll to have the same margin of error as the poll in the first city,
the sample size in the second city should be:
__A. 1000
__B. 2000
__C. 4000
__D. 8000

11

MT2013: Question 9

No Surprise: A Statistics Test with a test statistic!

Insurance companies track life expectancy information to assist in determining the cost of
life insurance policies. Last year the average life expectancy of all policyholders was 77
years. ABI Insurance wants to determine if their clients now have a longer life
expectancy, on average, so they randomly sample some of their recently paid policies.
The insurance company will only change their premium structure if there is evidence that
people who buy their policies are living longer than before. The sample has a mean of
78.6 years and a standard deviation of 4.48 years.
86 75 83 84 81 77 78 79 79 81
76 85 70 76 79 81 73 74 72 83
a) The appropriate null and alternative hypotheses are:
H0: _________________

Ha: _________________

b) Give the formula for the appropriate test statistic and compute its value.
Formula: ________________

Computed value: ______________

Space for work:

c) The corresponding P-value is:


__A. Greater than 0.20
__B. Between 0.10 and 0.20
__C. Between 0.05 and 0.10
__D. Between 0.025 and 0.05
__E. Between 0.01 and 0.025
__F. Less than 0.01
d) State your conclusion using = .05. Write one statistically and grammatically correct
sentence that tells ABI Insurance whether there is evidence to increase their premiums.

e) Suppose ABI randomly samples 100 recently paid policies. This sample yields a mean
of 77.7 years and a standard deviation of 3.6 years. Compute a 95% confidence interval.
Report it in the format [xx.x , xx.x] with one decimal place. [_________ , _________]

MT2013 END OF QUESTIONS; ANSWERS AND EXPLANATIONS FOLLOW

12

MT2013: ANSWERS AND EXPLANATIONS


MT2013: Answer 1
a) Years of Employment, Teaching Rating b) B. c) D. d) 2,1,3,4.
e) 5,3,4,1,2. Population = All toys produced; Sample = 100 toys; Sampling Frame =
5,000 toys; Parameter = 3%; Statistic = 5%
Details and Comments:
a) Years of Employment has units (yrs); Teaching Rating does not have units but the
rating is an average of ordinal data over a number of courses, and can range from 1 to 5
with fractional values possible.
b) Percentage of Canadian adults who work full-time is measured at one time point,
hence cross-sectional. The other variables are measured repeatedly over time, hence
longitudinal or time-series.
c) Only Weekly receipts at a clothing boutique is measured at more than one time
point. The other variables are measured once each.
d) (i) The five categories are strata; random samples are taken within each one.
(ii) Each employee has the same chance of being selected for the sample.
(iii) One school is a reasonable representative of the entire university, hence a cluster.
(iv) Choosing every fifth name makes it systematic.
e) The sampling frame is the production run, namely, that part of the population from
which the sample can be drawn.

MT2013: Answer 2
a) C. b) E. c) A.

d) A. e) B.

Details and Comments:


a) Categorical data are displayed with a bar chart. Histograms, stem-and-leaf displays,
boxplots (and usually line graphs) are for quantitative data.
b) 20% (142/700)
c) 43% (150/350)
d) 63% (196/309)
e) The column percentages for males are different from those for females, which suggests
that cell phone brand preference and gender are related (i.e. not independent.)

13

MT2013: Answer 3
a) C.
b) (i) C. (ii) 11.6 (iii) Lower inner fence = 20.24 Upper inner fence = 66.64
(iv) B. (v) Decrease, Stay the same, Increase, Stay the same
c) D. d) 15, 189, 138 e) E.
Details and Comments:
a) Look at the formula for standard deviation. If all numbers are equal, then they are also
all equal to the mean, so all the deviations are zero. This is the only way the standard
deviation can be zero.
b) (i) The median is closer to Q3 than to Q1 so the distribution is skewed to the left.
(ii) IQR = Q3 Q1 = 49.24 37.64
(iii) Lower inner fence = 37.64 1.511.6; Upper inner fence = 49.24 + 1.511.6
(iv) Yes, only on the right side of the distribution since the maximum exceeds 66.64.
(v) Decreasing the lowest data value decreases the sum, and hence the mean. But it
doesnt really affect which is the middle value or the quartiles. The range increases.
c) Quartiles divide the area of the distribution into four equal sections.
d) (i) Count up the number of data values. Dont forget to attach the leaf to the stem for
the maximum and median.
e) Monthly sales are more variable in Vancouver compared to Toronto since the box is
taller.

MT2013: Answer 4
a) (i) A. Negative, moderately strong
b) Large Neg.; Large Pos.; Small

(ii) B. Become stronger negative


c) False, False, False, False, True

Details and Comments:


a) (i) Top left to bottom right is negative association.
(ii) Removing the lower left point reduces the scatter.
b) 1. The older the car, the lower the price. 2. The taller the person, the heavier the
person. 3. Height has no connection with IQ.
c) 1. Creative but completely wrong.
2. The points must lie exactly on a straight line with a positive slope.
3. Creative but also completely wrong.
4. Corr of X and Y = Corr of Y and X. The roles are interchangeable.
5. Two of the properties of r.

14

MT2013: Answer 5
a) = 74.73 3.19x b) -0.95 c) decreases by 3.19
d) 90% e) 49
f) Residual = 11; Yes, it is an outlier since the resident is more than 2.5 s away from 0.
g) 57.5 h) 35 to 80
Details and Comments:
a)
= 59.5 (-3.19)(5.4) = 74.73)
b) Rearrange the formula for
= (-3.19)(3.3/11.1) = -0.95
c) Interpretation of slope.
d) r2 = (-0.95)2 = 0.90 or 90%
e) (8) = 74.73 3.19(8) = 49.21 (Round to 49)
f) Residual = y = 60 49 =11; remember the 68-95-99.7 Rule for identifying
outliers/unusual observations.
g) Since x is unknown, just use the mean of y.
h) Use the 68-95-99.7 Rule, i.e. 57.5 2(11.1) = 35.3, 74.7

MT2013: Answer 6
a) A. I and II only
b) (i) Mean = 20; SD = 10
(ii) A.
c) (i) Mean = 95 ; SD = 5
(ii) Mean = 2375; SD = 25
d) (i) A. 0.4 minutes
(ii) B. 0.1056
e) D. z-scores: 3, 2.25, 7.5, 10;
D has the highest z-score and therefore is the most unlikely.
Details and Comments:
a) First-year students ages will vary only slightly since most are within a year or two in
age. There might be some older students, i.e. those returning to school etc., but it is
highly unlikely to have students who are much younger than 18 or 19!
b) (i) Computations: Pr(Z > z) = 0.5 => z = 0, so X = + z => 20 = + 0 => = 20
Pr(Z > z) = 0.1587 => z = 1, so X = + z => 30 = 20 + 1 => = 10
(ii) Computations: Pr(15 < X < 25) = Pr([15-20]/10 < Z < [25-20]/10)
= Pr(-0.5 < Z < 0.5) = 1 2(0.3085) = 0.383
c) (i) Computations: E(X+Y) = E(X) + E(Y) =15 + 80 = 95;
Var(X+Y) = Var(X) + Var(Y) (since indep.) = 10 + 15 = 25, so SD =25 = 5
(ii) Computations: E(T) = 25(95) = 2375; Var(T) = 25(25) = 625; SD = 625 = 25
d) (i) /n = 2/25 = 0.4
(ii) Pr( < 7.5) = Pr(Z < [7.5-8]/0.4) = Pr(Z < -1.25) = 0.1056
e) Computations:
z-score for A = [79-70]/3 = 3
z-score for B = [74-65]/4 = 2.25
z-score for C = [68-65]/[4/100] = 7.5
z-score for D = [73-70]/[3/100] = 10

15

MT2013: Answer 7
a) (i) B. 269 (ii) C. Increase the sample size
(iii) A. [ 40.1% , 47.2% ]
b) 1. False; 2. False; 3. True; 4. False; 5. False; 6. True
Details and Comments:
a) (i) n = (1.6452)(0.46)(0.54)/(0.052) = 269
(ii) Look at the formula for the CI. The sample size is in the denominator of the margin of
error, so increasing the sample size decreases the margin of error.
(iii) =229/525 = 0.4362;
90% CI: 0.4362 1.645
= 0.4362 0.0356 or [0.4006, 4718])
b) 1. The interval changes from sample to sample
2. Population parameters dont vary; sample statistics vary
3. Higher confidence requires wider intervals
4. CIs are not about individual data values; they are about estimates
5. All CIs for mean include the sample mean; only 95% include the population mean
6. Definition of a CI
MT2013: Answer 8
a) A. H0: p = 0.28 and Ha: p > 0.28
b) C. 2.67
c) E. 0.0038 d) A. e) C.

f) B. 2000

Details and Comments:


a) One-sided alternative since the question asks whether the problem is more severe.
b) = 136/400 = 0.34;
=
= 2.67.
c) The P-value is the area to the right of 2.67 on a standard normal curve.
d) Since the P-value is less than 0.05, the null hypothesis is rejected; the true population
proportion is significantly higher than 28%.
e) Rejecting the null hypothesis for a two-tailed alternative is equivalent to the usual
(two-sided) confidence interval.
f) Sampling variability only depends on sample size, as long the population is large.
MT2013: Answer 9
a) H0 : = 77 and Ha : > 77
b) Formula and computed value:

= 1.597

c) C. Between 0.05 and 0.10


d) There is not sufficient evidence that the mean length of life of people who buy their
policies is higher, so do not increase premiums.
e) [77.0 , 78.4]
Details and Comments:
a) One-sided alternative since the question asks whether policy-buyers are living longer
than before.
c) Use the t-table with 19 degrees of freedom
d) Since the P-value is greater than 0.05, do not reject the null hypothesis.
e) 77.1 1.9843.6/100 = 77.7 0.7
END OF ANSWERS AND EXPLANATIONS TO MIDTERM 2013
16

Midterm Exam 2012


Notes: This exam has 9 questions. The duration is 2 hours. Books, notes, and
calculators are allowed, but not computers, cellphones or on-line connectivity.
MT2012: Question 1

A sole practitioner

ASW, a regional shoe chain, has recently launched an online store. Sales via the Internet
have been sluggish compared to their brick and mortar stores, and management suspects
that its regular customers have concerns regarding the security of online transactions. To
determine if this is the case, they plan to survey a sample of their regular customers.
a) Suppose that ASWs regular customers belong to a rewards program and have a
customer rewards ID number. ASW decides to randomly select 100 numbers. This
sampling plan is called:
__A. Simple Random Sampling
__B. Stratified Sampling
__C. Cluster Sampling
__D. Systematic Sampling
__E. Convenience Sampling
b) Suppose that ASW has an alphabetized list of regular customers who belong to their
rewards program. After randomly selecting a customer on the list, every 25th customer
from that point on is chosen to be in the sample. This sampling plan is called:
__A. Simple Random Sampling
__B. Stratified Sampling
__C. Cluster Sampling
__D. Systematic Sampling
__E. Convenience Sampling
c) All regular ASW customers is known as the ________ of the study.
__A. Parameter
__B. Statistic
__C. Target Population
__D. Sampling Frame
__E. Sample
d) Which of the following is the parameter of interest in the ASW study?
__A. All regular ASW customers
__B. % of regular ASW customers who have concerns about online security
__C. ASW customers who belong to the rewards program
__D. % of ASW customers who belong to the rewards program but dont shop online
__E. None of the above
e) One member of the management team at ASW suggests that their survey could be
done online. Customers logging on to the online store would be asked to complete the
survey and offered a coupon as incentive to participate. Which statement is true?
__A. This is a voluntary response sample
__B. This would result in an unbiased random sample
__C. This would result in a biased sample
__D. Both A and B
__E. Both A and C
17

Planning ahea
d

MT2012: Question 2

A brokerage firm gathered information on how their clients were investing for retirement.
Here is a small sample of the data they collected.
Respondent
Number
1001
1002
1003

Age

Gender

45
53
58

Male
Female
Female

Household
Income
$155,000
$160,000
$210,000

Self-directed Book value of


RRSP?
portfolio
Yes
$750,000
No
$500,000
No
$1,000,000

a) Place an X in the space beside each variable that is best described as Quantitative.
__ Respondent Number
__ Age
__ Gender
__ Household Income
__ Self-directed RRSP
__ Book value of portfolio
Based on age, clients were categorized according to where the largest percentage of their
retirement portfolio was invested and shown in the table below.
Age 50 or Younger Over Age 50
Mutual Funds
30
34
Stocks
37
45
Bonds
19
23
Total
86
102

Total
64
82
42
188

b) The percentage of clients who are over age 50 and invest in mutual funds is:
__A. 53.1% __B. 33.3% __C. 18.1% __D. 34.0% __E. 54.3%
c) Of the clients over age 50, the percentage who invest in mutual funds is:
__A. 53.1% __B. 33.3% __C. 18.1% __D. 34.0% __E. 54.3%
d) Of the clients who invest in mutual funds, the percentage over age 50 is:
__A. 53.1% __B. 33.3% __C. 18.1% __D. 34.0% __E. 54.3%
e) The percentage of clients over age 50 is:
__A. 53.1% __B. 33.3% __C. 18.1%

__D. 34.0%

__E. 54.3%

f) Consider the following side-by-side bar chart for the data below:
Does the chart indicate that mode of
investment is independent of age?
Yes

No

Explain in one short sentence only.

18

Mmm Marketing Manager Money

MT2012: Question 3

Here is a histogram and the five number salary for salaries (in $) for a sample of 48
marketing managers.

Min
46360

Q1
69693

Median
77020

Q3
91750

Max
129420

a) The shape of this distribution is:


__A. Symmetric
__B. Bimodal
__C. Skewed to the right
__D. Skewed to the left
__E. Normal
b) Which of the following is true?
__A. Mode < Median < Mean
__B. Median < Mode < Mean
__C. Mean < Median < Mode
__D. Mean < Mode < Median
__E. All three are equal
c) Which of the following is closest to the standard deviation?
__A. $ 3,676
__B. $ 13,843
__C. $ 20,765
__D. $ 83,060
__E. Cant tell without the data
d) The IQR for these data is:
__A. $83,060
__B. $22,057
__C. $69,693
__D. $77,020
__E. $14,566

19

e) Compute the lower and upper inner fences:


Space for calculations:
Lower inner fence: ___________
Upper inner fence: ___________
f) Are there any outliers, as defined by the inner fences criterion?
__A. Yes, only on the left side of the distribution
__B. Yes, only on the right side of the distribution
__C. Yes, on both sides of the distribution
__D. No
g) Suppose the marketing manager who was earning $129,420 got a raise and is now
earning $140,000. Which of the following statements is true?
__A. The mean would increase
__B. The median would increase
__C. The range would stay the same
__D. The IQR would increase
__E. The IQR would decrease
The next two parts are not related to parts (a) through (g) above.
The boxplots below show monthly sales revenue figures ($ thousands) for a discount
office supply company with locations in three different regions of Canada (Atlantic,
Central and West).

h) Which of the following statements is true?


__A. Central has the lowest sales revenues
__B. Central has the lowest median sales revenue
__C. West has the lowest mean sales revenue
__D. West has the lowest median sales revenue
__E. Atlantic has the lowest mean sales.
i) Which of the following statements is false?
__A. West has the most variable sales revenues.
__B. West has the largest IQR.
__C. Central has the smallest IQR.
__D. Atlantic has the most variable sales revenues.
__E. Central has the least variable sales revenues.
20

MT2012: Question 4

OMG: A great place to work

To determine whether the cash bonus paid by a company is related to annual pay, data
were gathered for 10 account executives at Outstanding Management Group (OMG) who
received cash bonuses in 2007. The data and summary statistics are shown below.

Mean
Standard Deviation

ANNUAL PAY
$ 70,609
$ 58,487
$ 104,561
$ 43,922
$ 82,613
$ 116,250
$ 76,751
$ 68,513
$ 137,000
$ 94,469

CASH BONUS
$ 11,225
$ 6,238
$ 14,194
$ 4,188
$ 11,863
$ 13,671
$ 7,758
$ 20,760
$ 55,000
$ 34,368

$ 85,318
$ 28,077

$ 17,927
$ 15,618

Correlation

0.735

a) What percentage of variability in cash bonuses can be explained by pay?

b) What would the correlation be if the Dollars were converted to Euros at the current
conversion rate of (1 Canadian Dollar = 0.76 Euros)?

c) Estimate the linear regression model that relates the response variable (cash bonus) to
the predictor variable (annual pay).
Slope of the regression line:

________________ (Report to three decimal places)

Intercept of the regression line: ________________ (Report to nearest whole number)


Equation of the linear model:

___________________________

Space for work:

d) From the equation, in part c), estimate the cash bonus for an executive at OMG earning
$82,613 a year, and compute the residual for this estimate.
Estimated cash bonus: ___________

Residual: ____________

21

e)Would you be confident in using your regression equation to estimate the cash bonus
for an executive at OMG earning $200,000 a year?
Yes

No

Reason:

f) Below is a plot showing residuals versus fitted values for the estimated regression
equation relating cash bonus to pay for the account executives at OMG.

Circle the conditions for linear regression which are violated, if any.
None are violated
Linearity
Normality
Constant Variance (Equal spread)
Independence
Parts (g) through (i) are unrelated to parts (a) through (f):
g) In commenting on the increase in home foreclosures (i.e. banks repossessing homes), a
news reporter stated there appears to be a strong correlation between home foreclosures
and job loss of the head of household. Comment on this statement; use one sentence
only.

h)A research study investigated the relationship between number of hours individuals
spend on the Internet and age. Which is the predictor variable? Circle your choice.
Hours on Internet

Age

i)The correlation associated with the following scatterplot is:


__A.
__B.
__C.
__D.
__E.

1.00
-1.00
0.50
-0.50
0.00

22

MT2012: Question 5

Greater attitude, greater latitude

The Survey of Study Habits and Attitudes (SSHA) is a psychological test that measures
academic motivation and study habits. Females score higher, on average, than males. The
distribution of SSHA scores among the female students at a university has mean 120 and
standard deviation 28; the distribution among male students has mean 105 and standard
deviation 35. Scores are normally distributed. Assume also that scores are independent.
a) What percentage of female students have SSHA scores greater than 162? Report your
percentage to one decimal place only.

b) What SSHA score is exceeded by only 10% of female students? Round your answer
to the nearest whole number.

c) Compute the lower and upper quartiles for the distribution of scores of female students.
Round your answers to the nearest whole numbers.

d) Suppose you select a single female student and a single male student at random and
give them the SSHA test. What are the mean and the standard deviation of the difference
(female minus male) between their scores. Report to one decimal place.
Mean = __________

Standard Deviation = ____________

e) Using your answers from part d), compute the probability that the chosen female has a
higher score than the chosen male.

23

f) Suppose Angelina (a female) scores 78 on the SSHA, while Brad (a male) scores 70 on
the SSHA. Use an appropriate calculation to determine who did worse compared to the
average for their gender. Circle the name of the person who did worse.
Angelina

Brad

Explanation:

MT2012: Question 6

A convenient truth

Part I. A convenience store owner suspects that only 10% of the customers buy
magazines and thinks that he might be able to sell something more profitable. In order to
decide whether he should stop selling them, he tracks the number of customers who buy
magazines on a given day.
a) On that day he had 300 customers. Assuming it was a typical day and that his estimate
is correct, what are the mean and standard deviation of the number of customers who buy
magazines each day? Report your answers to one decimal place.
Mean = ___________

Standard Deviation = ______________

b) What is the probability that 25 to 35 customers (inclusive) bought magazines that day?

c) How many magazine sales would you consider to be very strong evidence that his 10%
estimate was too low. That is, what number of sales would be extremely unusually high?
Hints: Use The Empirical (68-95-99.7) Rule. Remember to give a whole number answer.

Part II. Past records indicate that the magazines he sells on any day have an average
revenue of $150 with a standard deviation of $30. Suppose he takes a random sample of
36 past days sales receipts and records the dollar value of magazine sales.
a) Describe the sampling distribution for the sample mean by naming the model and
telling its mean and standard deviation.

b) Suppose the resulting sample mean is $130. Do you think that this sample result is
unusually small? Explain.

24

MT2012: Question 7

Talk about confidence!

One division of a telecommunications equipment company reports that 12% of nonelectrical components are reworked. Management wants to determine if this percentage is
the same as the percentage rework for electrical components manufactured by the
company. The Quality Control Department plans to check a random sample of the over
10,000 electrical components manufactured across all divisions.
a) The Quality Control Department wants to estimate the true percentage of rework for
electrical components to within 4%, with 99% confidence. How many components
should they sample?
__A. 651
__B. 1000
__C. 344
__D. 438
__E. 579
b) They actually select a random sample of 450 electrical components and find that 46 of
those had to be reworked. The 99% confidence interval is closest to:
__A. [ 0.0654 , 0.1390 ]
__B. [ 0.0432 , 0.1608 ]
__C. [ 0.0763 , 0.1277 ]
__D. [ 0.0541 , 0.1499 ]
__E. Cannot be determined with the given information.
c) The 95% confidence interval based on these data is 0.0742 to 0.1302. Which one of
the following is the correct interpretation?
__A. The percentage of electronic components that are reworked is
between 7.4% and 13.0%.
__B. We are 95% confident that between 7.4% and 13.0% of electrical
components are reworked.
__C. The margin of error for the true percentage of electrical components
that are reworked is between 7.4% and 13.0%.
__D. All samples of size 450 will yield a percentage of reworked electrical
components that falls within 7.4% and 13.0%.
__E. There is a 95% chance that 7.4% to 13.0% of the electrical components
have to be reworked.
d) Based on the 95% confidence interval, should the Quality Control Department
conclude that the percentage of rework for the electrical components is lower than the
rate of 12% for non-electrical components?
__A. Yes, because the lower limit of the confidence interval is 7.4%.
__B. Yes, because 12% is contained with the 95% confidence interval.
__C. No, because 12% is contained with the 95% confidence interval.
__D. No, because the upper limit of the confidence interval is 13.0%.
__E. We cannot say since the sample size is not large enough.
e) All else being equal, increasing the level of confidence desired will...:
__A. ...tighten the confidence interval
__B. ...decrease the margin of error
__C. ...increase precision
__D. ...increase the margin of error
__E. ...increase the margin of error and tighten the confidence interval
25

MT2012: Question 8

A dip in chips

A company manufacturing computer chips finds that 8% of all chips manufactured are
defective. Management is concerned that high employee turnover is partially responsible
for the high defect rate. In an effort to decrease the percentage of defective chips,
management decides to provide additional training to those employees hired within the
last year. After training was implemented, a sample of 450 chips revealed only 27 with
defects. Was the additional training effective in lowering the defect rate?
a) The appropriate null and alternative hypotheses are:
H0: ______________

Ha: ______________

b) Give the formula for the appropriate test statistic and compute its value.

Test Statistic Formula: __________________


Computed value: ______________
Show your work:

c) Assume that the value of the test statistic is 1.4. Dont use your computed value from
part b).The P-value associated with the given test statistic is closest to:
__A. 0.0404
__B. 0.05
__C. 0.0808
__D. 0.1616
__E. 0.9192
d) From the P-value in part c), and using a 1% significance level (i.e. = .01), which of
the following is true?
__A. Conclude that additional training significantly lowered the defect rate.
__B. Conclude that additional training did not significantly lower the defect rate.
__C. Conclude that additional training significantly increased the defect rate.
__D. Conclude that additional training did not affect the defect rate.
__E. No conclusion can be made with the given information.

26

MT2012: Question 9

The non-profit motive

A large software development firm recently relocated its facilities. Top management has
encouraged their professional employees to engage in local service activities. They
believe that the firm's professionals volunteer an average of more than 15 hours per
month. If this is not the case, they will institute an incentive program to increase it. A
random sample of 24 professionals reported the following number of hours:
12 13 14 14 15 15 15 16 16 16 16 16
17 17 17 18 18 18 19 19 19 20 20 22
The sample has a mean of 16.75 hours and a standard deviation of 2.40 credit hours.
a) The correct null and alternative hypotheses are:
__A. H0 : = 15 and Ha : > 15
__B. H0 : = 15 and Ha : > 15
__C. H0 : = 15 and Ha : < 15
__D. H0 : 15 and Ha : = 15
__E. H0 : = 15 and Ha : 15
b) The correct value of the test statistic is closest to:
__A. 3.572
__B. -3.572
__C. 1.327
__D. -1.327
__E. 0.729
c) Which of the following conclusions is correct?
__A. We reject the alternative hypothesis at the 5% significance level.
__B. We fail to reject the null hypothesis at the 5% significance level.
__C. An incentive program is needed since the evidence indicates professional
employees volunteer an average of no more than 15 hours per month.
__D. We reject the null hypothesis; the firm shouldn't need to institute an
incentive program since the evidence indicates that professional
employees volunteer an average of more than 15 hours per month.
__E. No conclusion can be reached about the hypothesis with the information
that is given.
d) It is appropriate to test the mean because:
__ A. The data are a simple random sample from the population of interest
__ B. The distribution of the sample data appears to be approximately normal
__ C. Volunteer hours is likely to be independent across employees
__ D. All of the above
e) A 95% confidence interval for the true mean number of hours of volunteer time is
closest to:
__A. 16.75 1.016
__B. 16.75 0.840
__C. 16.75 4.966
__D. 16.75 4.114
__E. 2.40 7.074
MT2012 END OF QUESTIONS; ANSWERS AND EXPLANATIONS FOLLOW
27

MT2012: ANSWERS AND EXPLANATIONS


MT2012: Answer 1
a) A. b) D. c) C. d) B.

e) E.

Details and Comments:


a) Each regular customer has the same chance of being selected for the sample.
b) Choosing every 25th customer makes it systematic.
c) The target population is the universe for which you want to be able to generalize.
d) A parameter is a numerical characteristic such as a mean or a proportion/percentage.
e) Since people can decide whether to answer or not, it is a voluntary response, and hence
subject to bias. People who decide to participate may not be like people who decide not to
participate.

MT2012: Answer 2
a) Age, Household Income, Book value of portfolio
b) C. 18.1% c) B. 33.3% d) A. 53.1% e) E. 54.3%
f) Yes: The age distribution (ratio of younger to older) is about the same for each mode
(i.e. type) of investment.
Details and Comments:
a) Age (yrs), Household Income ($), and Book Value ($) all have units and are measured
on a continuum, so they are quantitative.
b) 34/188 = 0.181
c) 34/102 = 0.333
d) 34/64 = 0.531
e) 102/188 = 0.543
f) Look for differences across the clusters of bars.

MT2012: Answer 3
a) C. Skewed to the right
b) A. Mode < Median < Mean
c) B. $ 13,843
d) B. $22,057
e) Lower inner fence = $36,607.50; Upper inner fence = $124,835.50
f) B. g) A. h) B. i) D.
Details and Comments:
a) Long right-hand tail: more of the area is piled up to the left.
b) The mode is the peak and it is clearly to the left of the median value of 77020. The
median is less than the mean for a right-skewed distribution.
c) Use the rule of thumb: s Range/6
d) IQR = Q3 Q1 = 91750 69693 = 22,057
e) Lower inner fence = 69,693 1.522,057 = $36,607.50
Upper inner fence = 91,750 + 1.522,057= $124,835.50
f) The maximum is larger than the upper fence but the minimum is not smaller than the
lower fence.
g) The sum is increased so the mean is increased.
h) The median is the line in the interior of the box.
i) Variability is shown by the length of the box.
28

MT2012: Answer 4
a) r2 = 0.7352 = 0.5402 or 54%
b) Unchanged at 0.735
c)
= 0.735(15,618/28,077) = 0.409;
= 17,927 (0.409)(85,318) = -16,968;
= -16,968 + 0.409x
d) (82,613) = -16,968 + 0.409(82,613) = $16,821
Residual = 11,863 16,821 = -$4,958
e) No; a prediction at $200,000 requires extrapolation beyond the range of data.
f) Constant Variance (V-shape indicates violation of this assumption)
g) The two variables are categorical, not quantitative, so correlation is not appropriate.
h) Age
i) E. 0.00
Details and Comments:
a) This is the definition of r-squared.
b) The correlation coefficient has no units; it doesnt change if the measurement units
change.
c) Straightforward application of least squares regression line formulas.
d) Substitute the x-value into the regression equation to get the predicted y. The residual
is the observed y minus the predicted y.
h) Age precedes and therefore predicts Hours on Internet.
i) The best-fitting straight line is horizontal.

MT2012: Answer 5
a) 6.7%
b) 156
c) Q1 = 101; Q3 = 139
d) Mean = 15; SD = 44.8
e) 0.6293 or 0.63 or 63%
f) Angelina; Z-score for Angelina = -1.5; Z-score for Brad = -1.0;
Details and Comments:
a) Standardize the X-value; 162 is 1.5 SDs above the average. Find the area to the right of
1.5 on the Z-curve.
Pr(X > 162) = Pr(Z > [162 120]/28) = Pr(Z > 1.5) = 0.0668.
b) Find the value of Z that has an area of 10% to the right; then unstandardize.
z = 1.28; X = 120 + 1.28(28) =155.8.
c) Find z-values that have an area of 25% to the right and to the left; then
unstandardize. Since the Z is symmetric, the z-value on the left is the negative of the zvalue on the right.
Q1: z = 0.675; X = 120 + (0.675)(28) = 101
Q3: z = 0.675; X = 120 + (0.675)(28) = 139
d) Mean = 120105 =15; SD =
= 44.8
e) Pr(FM > 0) = Pr(Z > [015]/44.8) = Pr(Z > -0.33) = 0.6293 or 0.63 or 63%
f) Z-score for Angelina = (78120)/28 = -1.5; Z-score for Brad = (70105)/35 = -1.0;
Angelina did worse relative to the reference populations since her Z-score more negative.

29

MT2012: Answer 6
Part I.
a) Mean = np = 3000.10=30.0; SD =
=
= 5.2
b) Pr(25 X 35) = Pr([2530]/5.2 < Z < [3530]/5.2) = Pr(-0.96 < Z < 0.96)
= 1 2(0.1685) = 0.663.
c) From the Empirical Rule, 3 SDs above the mean is extremely unusual;
+3 = 30 + 3(5.2) = 45.6. Sales of 46 or more would be extremely unusual.
Part II.
a) Normal: Mean = 150 and SD = 30/
=5
b) Pr( < 130) = Pr(Z < [130 150]/5) = Pr(Z < -4) < 0001
There is an extremely small probability of getting a sample mean this small.
Details and Comments:
Part I.
a) Use the mean and standard deviation of a count.
b) Use the normal sampling distribution of a count. (Note: Continuity correction was not
needed, but if you used it correctly you would get an answer of 0.711.)
Part II.
a) Use the mean and standard deviation of a mean. (Note: The CLT applies here, but it is
not necessary to say this in the answer.)
b) Use the normal sampling distribution of a mean.

MT2012: Answer 7
a) D. b) A. c) B.

d) C. e) D.

Details and Comments:


a) n = (2.5762)(0.12)(0.88)/(0.042) = 438
b) =46/450 = 0.1022
99% CI: 0.1022 2.576
= 0.1022 0.0368
c) Notice the wording and the use of the term 95% confident.
d) Values inside a confidence interval are likely values of the parameter. Evidence of a
change or a difference depends on the target value being outside the CI.
e) Examine the CI formula; a higher confidence level requires a larger multiplier/critical
value so the margin of error will be larger.

30

MT2012: Answer 8
a) H0 : p = 0.08 and Ha : p < 0.08
b) Formula and computed value; = 27/450 = 0.06
z=

c) C.

= -1.56

d) B.

Details and Comments:


a) One-tailed alternative since the question asks whether the training was effective in
lowering the defect rate.
b) Remember that the test statistic uses

in the denominator, not

as in the

confidence interval.
c) Find the area to the left of -1.4 on the standard normal curve.
d)Since the P-value is not less than 0.05 the evidence is not statistically significant.

MT2012: Answer 9
a) B. b) A. c) D. d) D. e) A.
Details and Comments:
a) H0 : = 15 and Ha : > 15.
Use one-tailed alternative since the question is about increasing the volunteer time.
b) t =

= 3.572

c) The P-value is much smaller than 0.05 so reject the null hypothesis. The volunteer time
is greater than 15 hours. So no incentive program is needed to get past 15 hours.
d) These are the assumptions/conditions for a one-sample t-test.
e) 16.75 2.0692.40/
= 16.75 1.016
END OF ANSWERS AND EXPLANATIONS TO MIDTERM 2012

31

Midterm Exam 2011


Notes: This exam has 9 questions. The duration is 2 hours. Books, notes, and
calculators are allowed, but not computers, cellphones or on-line connectivity.
First things first

MT2011: Question 1

a) At the beginning of the term we asked all Commerce 291 students to complete our online survey. This survey was most likely designed to be:
__A. a random sample of all C291 students
__B. a census of all C291 students
__C. a random sample of business students
__D. a random sample of 2nd year UBC students
__E. all of the above
b) The survey asked a wide range of questions. For each variable, circle the description
which best describes the type of data the variable represents.
Ethnic background
Height
C290 grade
# hrs online per day

Categorical
Categorical
Categorical
Categorical

Quantitative
Quantitative
Quantitative
Quantitative

Identifier
Identifier
Identifier
Identifier

c) From the survey results, we can estimate that, on average, students spent 15.2 hours
per week studying. This number seems high given that for a course load of 4 courses the
students spend 12 hours per week in the classroom and nearly half of the students
reported doing paid work. What is the most likely explanation?
__A. the data are very skewed and the median is a better numerical summary
__B. the data are bimodal, the two groups are those that work and those that dont
__C. women study more than men
__D. none of the above
d) Unfortunately, not every C291-registered student responded to the survey. If it were
true that students who didnt respond also spend less time studying, then our estimate of
study time from the survey is:
__A. a good estimate of average study time of C291 students
__B. biased above the true average study time of C291 students
__C. biased below the true average study time of C291 students
__D. not a good estimate for study time of C291 students but
we cant say whether it is too high or too low.
e) From the survey we find that the Commerce 290 Grade (call this variable, X) has a
symmetric, bell-shaped distribution. Also, 95% of the grades fall in the range 53 to 93.
Use that information to compute the mean and standard deviation of X. Report to at most
one decimal place.

Mean of X

= _____

SD of X

= _____
32

MT2011: Question 2
Stock answers are sufficient here
a) The following data are the price-to-earnings ratios (P/E ratio) for a random sample of
25 stocks traded on the NYSE. The data values have been sorted from smallest to largest.
Data: 4 8 11 11 12 13 13 14 14 15 16 17 17 17 19
21 22 22 24 24 26 28 33 35 39
The mean of these values is 19.0 and the standard deviation is 8.5.
i) Find the following:
Median
Q1
Q3
IQR
Inner fences
Outliers:

= ______
= ______
= ______
= ______
= ________________
= __________________ (If there are no outliers, write None)

(Note: Outliers are defined using the inner fences criterion)

ii) Is the distribution symmetric or skewed? (Note: You do not have to draw a graph to
answer this.) Circle your choice. Then give your reason.
Symmetric

Skewed

Reason:

iii) Sketch a boxplot of these data. Use the version based on the five-number summary;
do not use the modified version using fences.

b) Determine whether each statement is true or false? Circle your choice. No explanation
is required.
1. If the mean and SD are equal for a measurement variable
True False
that only takes positive values, the distribution is symmetric.
2. If the mean and median are equal, the distribution must
True False
be normal.
3. If the mean and median are equal, the mode must also
True False
equal the mean and median.
4. The SD and IQR are always equal for a symmetric
True False
distribution.
5. The SD of a set of data values can never be zero.
True False

33

MT2011: Question 3
To-fu or not to-fu, that is the question
Read the following survey design plan and then answer the questions after it.
Get Healthy, a producer of health foods conducts a survey of the Lower Mainland to
determine how receptive high school students would be to its TOFU BURGH product and
what market potential (sales) it could expect. It plans the survey as follows:
i. From the list of all schools in the area, two groups are defined, public and private high
schools, called PUBS and PRIS
ii. From the PUBS, four schools are chosen randomly.
iii. From the PRIS, one school is chosen randomly.
iv. In the PUBS schools selected, on
one day, researchers give every
fifteenth student to exit the school a
TOFU BURGH and a-stamped, selfaddressed postcard (like the one here).
v. In the PRIS school, researchers set
up a stand outside the school and give a
free TOFU BURGH and the postcard to
any student who comes to the stand.
a) The overall survey sampling design planned by the company can best be described as:
__A. convenience sampling
__B. multi-stage sampling
__C. stratified sampling
__D. simple random sampling
__E. cluster sampling
b) In the PUBS selected, the sampling design uses:
__ A. systematic sampling
__ B. voluntary response strategy
__ C. unacceptable bribery of students
__ D. anecdotal responses
c) In the PRIS selected, the sampling design uses:
__ A. systematic sampling
__ B. voluntary response strategy
__ C. unacceptable bribery of students
__ D. anecdotal responses
d) One parameter of interest is likely to be:
__ A. the total number of students who replied to the survey
__ B. the number of high school students in the Lower Mainland
__ C. the number of students who replied they would buy at least one
TOFU BURGH in a typical week
__ D. the proportion of students who replied they would buy at least one
TOFU BURGH in a typical week
e) Which of the two samples is likely to have non-response bias?
__ A. PUBS schools only
__ B. PRIS school only
__ C. Both PUBS and PRIS schools
__ D. Neither will have non-response bias
34

MT2011: Question 4

Unassociated questions about association how ironic

Note: This question has three unrelated parts.


a) A business school conducted a survey of companies in its state. They mailed a
questionnaire to small, medium-sized, and large companies. The rate of non-response is
important in deciding how reliable survey results are. Here are the data on responses to
this survey.
Small
Medium
Large
375
160
40
Response
225
240
160
No Response
600
400
200
Total
(i) What was the overall percent of non-response?

(ii) How is non-response related to the size of the business? Use percents to make your
statement precise.

b) Investment reports now often include correlations. Following a table of correlations


among mutual funds, a report adds, Two funds can have perfect correlation, yet different
levels of risk. For example, Fund A and Fund B may be perfectly correlated, yet Fund A
moves 20% whenever Fund B moves 10%. Explain to someone who knows no statistics
how this can happen.

c) A study shows that there is a positive correlation between the size of a hospital
(measured by its number of beds, x) and the median number of days, y, that patients
remain in the hospital. Does this mean that you can shorten a hospital stay by choosing a
small hospital? Explain your answer choice.
Yes

No

Reason:

35

MT2011: Question 5

Bart vs. Lisa does not refer to Simpsons Paradox

a) At a well-known business school the grade point averages (GPA) of its 1000
undergraduates are normally distributed with mean 2.84 and standard deviation 0.40.
(i) What percentage of the undergraduates have GPAs below 2.00 (i.e. on probation)?
Answer: ________

(ii) What GPA will be exceeded by only 20% of the student body?
Answer: ________

(iii) Compute the lower and upper quartiles, and the interquartile range for this
distribution.
Q1 = _______

Q3 = _______

IQR = ______

b) Bart scores 725 on the mathematics section of the Scholastic Aptitude Test (SAT). In a
reference population, SAT scores are normally distributed with mean 500 and standard
deviation 100. Lisa scores 33 on the American College Test (ACT) mathematics test;
ACT scores are normally distributed with mean 18 and standard deviation 6.
(i) What are the z-scores for each student?
Bart: _______

Lisa: _______

(ii) Circle either the name Bart or Lisa (above) based on who did better relative to the
reference populations.

36

MT2011: Question 6

Strength in numbers; numbers on strength

a) To test the strength of building materials such as steel girders, engineers place
increasing loads on the girders until they break. The pressure exerted by the load that
eventually breaks the material is call the strength of the girder. Generally speaking, the
longer the girder, the less the strength. Your company makes steel girders. The engineer
in charge of testing tells you that he has tested 10 girders to breaking point and has
obtained data linking the length of each girder (in metres) to its strength (in kg per square
centimetre). But his computer crashed just after he ran a regression analysis on the data
and all he can remember is the lengths of the girders and a few strengths. He did manage
to record the means and standard deviations of all the lengths and strengths and the r2 of
the regression, which was 0.719.

Mean
SD

(X) Length (m)


1
1
2
2
3
3
4
4
5
5
3.00
1.49

(Y) Strength (kg/cm2)


90
101
Lost
Lost
91
77
Lost
Lost
76
Lost
82.60
10.72

Note: The means and standard deviations are calculated for the ENTIRE data set,
including those that are missing.
(i) What is the correlation between length and strength? Report to three decimal places.

(ii) Work out a regression equation that predicts strength from length.
Equation: ___________________________

(iii) You notice that the purchaser of your girders requires the 5 m girders to support an
average load of 75 kg per square centimetre. Do you feel confident your girders will do
that? Give a numerical rationale.

37

b) What is the correlation coefficient for the following three points in the X-Y plane?
(STOP AND THINK BEFORE YOU START!)
X
1
3
5
Y
4
3
2
Answer: __________
c) An American study found that the correlation between two-year-old childrens heights
(measured in inches) and their weights (measured in pounds) was 0.46. What would the
correlation coefficient be if you converted their heights to centimetres and weights to
kilograms? (One inch = 2.54 cm and 1 pound = 0.454 kg.)
Answer: __________

d) An economist studied salaries of 321 bank employees with five or less years of
employment in a national bank. He found that the relationship between years of service
and salary was linear and that the regression equation predicting salary (in thousands of
dollars) was: Salary = 21.5 + 3.1 * Years.
He concludes that employees with 10 years of service should make an average salary of
$52,500. Is his conclusion correct? If not, say why.

e) In part d) the economist has used the regression equation to make a prediction. Which
of these numbers best measures the precision of this prediction?
__A. The slope of the line (b1)
__B. The standard deviation of y (sy)
__C. The standard deviation of x (sx)
__D. The square of the correlation coefficient (r2)
__E. The ratio of the two standard deviations (sy / sx)
f) An investigator measuring various characteristics of a large group of athletes found that
the correlation coefficient between the weight of the athlete and the weight that the
athlete could lift was r = 0.60. Determine whether each statement is true or false. Circle
your choice.
(i) If an athlete gains 5 kg, he/she will be able to lift
an additional 3 kg.
(ii) The more an athlete can lift, on the average the more
that athlete weighs.
(iii) 36 per cent of the athletes lifting ability can be
attributed to his or her weight alone.
(iv) 60 per cent of the athletes lifting ability can be
attributed to his or her weight alone.

True

False

True

False

True

False

True

False

38

MT2011: Question 7

Pack up all your troubles, and call it a day

An important part of the customer service responsibilities of a telephone company relates


to the speed with which troubles in residential service can be repaired. Suppose that past
data indicate that there is a probability of 0.70 that service troubles can be repaired on the
same day they are reported.
a) Suppose the company receives 100 trouble calls on a particular day. What is the
approximate chance that 80% or more will receive same-day repairs.

b) Suppose it is also known that the repair time for a trouble call has a mean of 480
minutes and a standard deviation of 250 minutes. A random sample of 400 trouble calls
was taken and the repair times recorded. Compute the probability that the mean of the
400 repair times is less than 500 minutes.

39

MT2011: Question 8

Statistical analysis of a logo transformation

An established clothing retailer, CHAP, is interested in customer response to a proposed


new logo. A survey randomly samples 100 customers; 55 of them say they would prefer
the new logo to the previous one. However, CHAP will only change its logo if it is
convinced that the newly designed logo is preferred by the majority (i.e. more than half)
of its customers. Based on this information answer the following questions.
a) The sample estimate , the proportion of customers who prefer the newly designed
logo over the previous one is:
__ A. 0.55
__ B. 55
__ C. 100
__ D. Not able to be determined from the information given
b) The standard error of this estimate is closest to:
__ A. 0.0025
__ B. 0.050
__ C. 0.071
__ D. 0.50
c) The 95% confidence interval for the true proportion of the customers who prefer the
new logo over the previous one is closest to:
__ A. 0.55 0.098
__ B. 0.55 0.98
__ C. 0.55 0.0049
__ D. 55 9.8
d) How large a sample n would you need to estimate p, the proportion of people who
prefer the newly designed logo over the previous one, with margin of error 0.05 with 99%
confidence? Use the guess = 0.5 as the value for p.
__ A. 384
__ B. 664
__ C. 26
__ D. 271
e) If a hypothesis test were conducted on these data, the test statistic would be 1.00. If the
alternative hypothesis were one-sided, what would the P-value be?
__ A. 0.0794
__ B. 0.1587
__ C. 0.3174
__ D. 0.8413
f) Which of the following is a correct conclusion from the hypothesis test in part e)?
__ A. Customers definitely prefer the new logo
__ B. Customers definitely do not prefer the new logo
__ C. There is not enough evidence to say customers prefer the new logo
__ D. There is not enough evidence to say customers do not prefer the new logo

40

MT2011: Question 9

The business of bus-ness

You are the new Operations Manager of the local public transportation company and are
especially interested in the reliability of bus service. You plan, on a monthly basis, to take
a random sample of major bus stops and observe whether the buses depart on time or late
and how late they are. (Buses never leave early since, if they arrive early, they wait until
their departure will be exactly on time.)
a) The first month, you gather a random sample of 121 bus departures from a variety of
times of day, days of the week, routes and locations. The sample has an average lateness
of departure of 6.4 minutes with a standard deviation of 1.8 minutes. Which of the
following is closest to a 95% confidence interval for the average lateness of departures
for the entire bus system this month.
__ A. 6.4 0.029
__ B. 6.4 0.271
__ C. 6.4 0.324
__ D. 6.4 3.564
b) Which of the following would decrease the width of the confidence interval?
__ A. Reduce the confidence level
__ B. Increase the sample size
__ C. Reduce the sample standard deviation
__ D. All of the above

Five years ago, the system-wide mean lateness of departure was known to be 6.8 minutes.
Using a 5% level of significance and the sample results of part a), carry out a hypothesis
test to decide whether the system is improving; that is, whether the mean lateness has
decreased from five years ago.
c) The appropriate null and alternative hypotheses are:
H0: ____________

Ha: ____________

d) Give the formula for the appropriate test statistic and compute its value.
Formula: __________________
Computed value: ______________
(Show your work to the right ==>)

e) Give a range in which the P-value is located.

41

f) From the P-value associated with this test statistic, which of the following is correct?
__ A. Do not reject H0 at the 10% significance level
__ B. Reject H0 at the 10% significance but not at the 5% significance level
__ C. Reject H0 at the 5% significance level but not at the1% significance level
__ D. Reject H0 at the 1% significance level
g) Using the 5% significance level, state your conclusion in one clearly worded sentence
that the bus company management can understand.

h) The distribution of lateness of departure is strongly skewed to the right. However, it is


still appropriate to test the mean because:
__ A. The data are a simple random sample from the population of interest
__ B. The sample size is large enough for the Central Limit Theorem to apply
__ C. Since the sample is random, bus departures are independent of one another
__ D. All of the above

BONUS: In what century did the equals sign first appear in print?
__ A. 1300s
__ B. 1400s
__ C. 1500s
__ D. 1600s
__ E. 1700s
__ F. 1800s
__ G. 1900s
MT2011 END OF QUESTIONS; ANSWERS AND EXPLANATIONS FOLLOW

42

MT2011: ANSWERS AND EXPLANATIONS


MT2011: Answer 1
a) B
b) Ethnicity: Categorical; Height: Quantitative; C290 grade: Quantitative;
# hrs online: Quantitative
c) A; d) B; e) Mean(X) = 73, SD(X) = 10
Details and Comments:
a) The goal was to survey the entire population of C291 students; that is the definition of
a census.
b) Height, C290 grade, and # hrs online are each measured with units (cm, %, and hrs,
respectively) so all three are quantitative variables.
c) The distribution is likely to have a small number of students with a high number of
study hours; this skewness has the effect of inflating the mean.
d) Bias comes from, among other sources, missing values which are missing for a reason
related to the variable of interest.
e) The interval 73 20 = Mean 2SD (by The Empirical Rule) = 73 2(10).
The midpoint of the interval (73) is the mean; the SD is 10.

MT2011: Answer 2
a) i) Median = 17, Q1 = 13, Q3 = 24, IQR = 11;
Inner fences = (-3.5, 40.5). [Accept also (0,40.5).] There are no outliers.
ii) The distribution is skewed since the mean is quite different from the median.
iii)
_________
|-----------|__|_______|-----------------|
___________________________________
0
10
20
30
40
b) All five statements are False.
Details and Comments:
a) i) With 25 data points, the median is the 13th value. The Q1 is between the 6th and 7th
values (which are equal here) and the Q3 is between the 19th and 20th values (which are
also equal here). IQR = Q3 Q1. Since the lower inner fence (Q1 1.5IQR) is negative,
it is also acceptable to report it as 0 because P/E ratios cannot be negative.
ii) Actually, the distribution is skewed to the right, but that distinction was not needed in
the answer.
iii) The sketch must show the skewness, namely that the median is closer to the left side
of the box and the left whisker is shorter than the right whisker.
b)
1. The Empirical Rule wouldnt be able to work so the distribution is NOT symmetric.
2. A distribution can be symmetric without being normal; e.g. pyramid shape, or uniform.
3. A symmetric distribution can have two peaks; the mean and median are in the middle
but the modes are at either end (e.g. U-shaped)
4. There is no reason for this to be true.
5. SD = 0 if all data values are the same.
43

MT2011: Answer 3
a) B or C; b) A; c) B; d) D; e) C;
Details and Comments:
a) Both multi-stage sampling and stratified sampling are acceptable answers.
Technically, multi-stage sampling is the preferred answer, since for PUBS, four schools
are chosen randomly but the actual students are selected systematically.
b) Since every fifteenth student is selected, the selection is systematic, not random.
c) Since students are free to come, or not, to the stand, this is voluntary response.
d) Counts are not parameters because they are not adjusted for sample size; however,
proportions are parameters.
e) Cards are handed out either to every fifteenth student or to volunteers; however, in
each group not everyone who receives a card will mail the card in; thats non-response.

MT2011: Answer 4
a) (i) 52% (625/1200 = 0.52)
(ii) Non-response rates are: Small: 37.5%, Medium: 60%, Large: 80%.
The larger the company the higher the expected rate of non-response.
b) Correlation is not the same as slope. So a perfect correlation does not mean that the
slope is 1, hence a 1 unit increase in x does not mean a 1 unit increase in y.
c) No: Larger hospitals are more likely to take more serious cases requiring longer length
of stay.
Details and Comments:
a) (i) Sum across the columns to get the row totals of 575 Respondents and 625 Nonrespondents. Then divide by the overall total of 1200.
(ii) Column percentages are needed here, not row percentages.
b) Remember the formula for slope:
. Even if r = 1, the slope is still the ratio of
the SDs, which need not be equal.
c) Look for lurking variables to explain unusual or nonsensical correlations.

MT2011: Answer 5
a) (i) Pr (X < 2.00) = Pr (Z < [2.002.84]/0.40) = Pr (Z < 2.10) = 0.0179 or 17.9%.
(ii) Z = 0.84; X = 2.84 + 0.84(0.40) = 3.18 (or 3.176)
(iii) Q1 = 2.57; Q3 = 3.11; IQR = 0.54
Q1 for Z = 0.675; X = 2.84 + (0.675)(0.40) = 2.57
Q3 for Z = 0.675; X = 2.84 + (0.675)(0.40) = 3.11
IQR = 3.11 2.57 = 0.54
b) Bart: 2.25; Lisa = 2.50, Circle Lisa
Z-score for Bart = (725500)/100 = 2.25; Z-score for Lisa = (3318)/6 = 2.50;
Lisa did better relative to the reference populations since her positive Z-score is higher.
Details and Comments:
a) Remember to make sketches of the required areas so that you get the correct parts of
the normal curve. In (i), standardize X to Z and find the corresponding area ; in (ii) and
(iii), begin with the area, find Z and unstandardize to get X.
44

MT2011: Answer 6
a) (i) r =
(ii)

= 0.848 (Note that the correlation is negative!)


= 0.848(10.72/1.49) = 6.10;

= 82.60 (6.10)(3.00) = 100.9;


= 100.9 6.10x
(iii) (75) = 100.9 6.10(5) = 70.4. Since this is less than the required 75 kg/cm2, you
should not be confident that the 5m girders will be sufficient.
b) Perfect negative correlation: r = 1 (Plot the data points; they fall on a straight line.)
c) r = 0.46, unchanged (Correlation is invariant to the measurement scales.)
d) No predictions at 10 years requires extrapolation beyond the range of data (that is,
the analysis was done using bank employees with 5 or less years of employment).
e) D. The square of the correlation coefficient.
f) False, True, True, False
Details and Comments:
a) The minus sign is vital; the correlation is negative since the longer the girder, the lower
the strength. If you forget the minus sign your calculations of the slope, intercept and
regression equation will be incorrect and you would end up concluding that 5 m girders
would support an average load of 75 kg/cm2. In that case your building might fall down.
Bad statistical analysis can kill!
b) Remember to make a plot before doing the calculations.
f) (i) is false because a gain of 5 kg will mean an additional lift of 3 kg only on average.
A gain of 5 kg might give additional lift greater than 3 kg for some people and less than 3
kg for others; (ii) refers to what happens on average; (iii) uses the definition of r2; (iv)
mistakenly uses r instead of r2.

MT2011: Answer 7
a) Pr ( > 0.80) = Pr (Z >

= Pr (Z > 2.18) = 0.0145 or 1.45%


b) Pr ( < 500) = Pr (Z <
)
= Pr (Z < 1.60) = 0.945 or 94.5%
Details and Comments:
a) Use the sampling distribution of .
b) Use the sampling distribution of (i.e. remember the
in the denominator).
Both of these situations depend on the Central Limit Theorem and both random samples
are large enough (100 and 400, respectively). Remember to make a sketch to get the
correct area!

45

MT2011: Answer 8
a) A; b) B; c) A; d) B; e) B; f) C
Details and Comments:
a) Reason: =55/100 = 0.55
b) Reason:
= 0.050
c) Reason: 0.55 1.96(0.050)
d) Reason: n = (2.5762)(0.5)(0.5)/(0.052) = 664
e) Reason: Area to the right of 1.00 on the z-curve.
f) Reason: The P-value is not less than 0.05 (and not even less than 0.10).

MT2011: Answer 9
a) C b) D c) H0: = 6.8; Ha: < 6.8
d) t =

= -2.44

e) 0.005 < P-value < 0.01


f) D. Reject H0 at the 1% significance level
g) There is strong evidence to say that the system is improving (or that mean lateness has
decreased).
h) B or D (either is acceptable)
Details and Comments:
a) Reason:
= 1.980: CI = 6.4 1.980(1.8/
) = 6.4 0.324
b) Examine the effect of each of these by referring to the formula for the CI.
c) This is a one-tailed alternative since the question asks whether mean lateness has
decreased from five years ago.
d) Remember the minus sign on the test statistic.
e), f) & g) Reject H0 since the P-value is less than 0.01. Remember to state your
conclusion in a sentence that answers the original question.
h) B is the most important of the three, but A and C are also needed for the test to work.
BONUS: C. The equals sign first appeared in print in 1557.

MT2011 END OF ANSWERS AND EXPLANATIONS

46

Midterm Exam 2010


Notes: This exam has 9 questions. The duration is 2 hours. Books, notes, and
calculators are allowed, but not computers, cellphones or on-line connectivity.
MT2010: Question 1 "Mittens, means, and medians"
a) The Hudson's Bay Company was the official retailer of Olympics merchandise,
including the very popular red mittens. Their database included information on each sale
made to customers who paid by credit card (Visa only). Some of the variables they
collected are listed below. Decide whether each variable would, for analysis, be most
usefully considered as categorical, quantitative or neither.

Total amount of the sale ($)

Categorical

Quantitative

Neither

Country of origin on credit card

Categorical

Quantitative

Neither

Gender of the customer

Categorical

Quantitative

Neither

Visa credit card number

Categorical

Quantitative

Neither

b) Credit card customers were divided into two groups: Canadian residents and visitors to
Canada. The average amount spent by all Canadian residents was $200. The average
amount spent by all visitors to Canada was $300. What must be true about the average
amount spent by all customers?
__ A. It must be $250
__ B. It must be larger than the median expenditure
__ C. It could be any number between $200 and $300
__ D. It must be larger than $250

c) A sample of 500 cash sales had a mean of $20 and a standard deviation of $40. The
histogram of the data would most likely be:
__ A. skewed to the left (i.e. long left-hand tail)
__ B. approximately symmetric
__ C. skewed to the right (i.e. long right-hand tail)
__ D. bimodal

d) Which of the following is likely to have a mean that is smaller than the median?
__ A. The salaries of all National Hockey League players
__ B. The grades of students (out of 100) on a very easy exam on which most
score very high or perfectly, but a few do very poorly
__ C. The prices of homes in Vancouver
__ D. The grades of students (out of 100) on a very difficult exam on which most
score poorly, but a few do very well

47

e) Here is the frequency distribution of the ages of a sample of 100 employees of the
Hudson's Bay Company.
Age (years) Frequency
15-19
2
20-24
10
25-29
19
30-34
27
35-39
16
40-44
10
45-49
6
50-54
5
55-59
3
60-64
2
Total
100
(i) What percentage of the employees is 50 or older? _______
(ii) The median age of the employees is:
__ A. About 40
__ B. Between 30 and 34
__ C. Between 40 and 49
__ D. None of the above
(iii) The mean age of the employees is:
__ A. About 34 because about half are younger than 34 and half are older than 34
__ B. Above the median because the distribution is approximately symmetric
__ C. Above the median because the distribution is skewed to the right
__ D. None of the above
f) Based on the following figure, decide whether each of the statements below the figure
is more likely to be True or False. (Note: House income means "total household income"
and is referred to simply as "income" in the statements.)

car buyer house income

350,000
300,000
250,000

200,000
150,000
100,000
50,000
0

BMW

Cadillac

Lexus

Lincoln

Mercedes

Mercedes buyers have the highest variability in income.

True

False

For each car type, the incomes are reasonably symmetric.

True

False

There is a positive correlation between income and brand.

True

False
48

MT2010: Question 2 "Catching some zzzs"


a) Consider a standard normal random variable, Z, (i.e. with mean 0 and standard
deviation 1). Find the median, lower and upper quartiles and interquartile range (IQR) of
Z.
Median of Z: ______
Lower quartile (Q1): _______
Upper quartile (Q3): _______
Interquartile Range: _______

b)What percentage of values of Z lie outside 1.5IQR on each side of the median? That
is, find the total percentage below "Median 1.5IQR" or above "Median + 1.5IQR".

c) Draw a boxplot that would represent data obtained from a large sample of values of Z.

d) This part is unrelated to parts a), b) and c).


Scores on the Wechsler Adult Intelligence Scale (WAIS), a standard IQ test, are
approximately normally distributed for all age groups; however, the means and standard
deviations of scores differ across different age groups. For the 20 to 34 age group, the
mean is 110 and the standard deviation is 25, while for the 60 to 64 age group, the mean
is 90 and the standard deviation is 25. Sarah is 29 and her mother Ann is 62. Sarah scores
135 on the WAIS while Ann scores 120. Which of the two has the higher score relative to
her age group? Explain your choice with appropriate calculations.
____ Ann

____ Sarah

49

MT2010: Question 3 "Contender for gender offender"


A university offers only two degree programs, one in Engineering and one in English.
Admission to the programs is competitive, and a women's group suspects discrimination
against women in the admissions process. They obtain the following data from the
university, a two-way classification of all applicants by gender and admissions decision.
Male
Female
35
20
Admitted
45
40
Not Admitted
a) Is there evidence of an association between the applicants' gender and success in
obtaining admission? Why or why not?

b) The university replies that there is no discrimination. In its defence, it produces a


three-way table that classifies applicants by gender, admission decision AND program to
which they applied.
Engineering
English
Male Female
Male Female
30
10
5
10
Admitted
Admitted
30
10
15
30
Not Admitted
Not Admitted
Is there an association between admission rates and gender in either program? Explain
why or why not.

c) Are the answers in parts a) and b) contradictory? If so, how can you explain the
contradiction?

d) After disregarding gender, are admission rates different in the two programs? Support
your conclusion with an appropriate two-way table (i.e. admission decision by program).

50

MT2010: Question 4 "Beauty is in the eye of the frolder"


On a recent trip to Mars, scientists discovered a colony of small creatures that they named
frolders. Due to the speed and agility of the frolders, the scientists could only capture five
specimens to bring back to Earth to study. One scientist suspects the weight of the frolder
may be related to the number of eyes it has. The following table shows the weight and
number of eyes for each of the five specimens:
Specimen ID
A101
A102
A103
A104
A105
Weight (kg)
2
8
4
15
6
Number of Eyes
2
11
5
17
5
a) Plot these data below.

Frolder Study
20

# Eyes

15
10
5
0
0

10

15

20

Weight (kg)

b) Briefly describe the association (must be brief for full marks!)

c) Which of the following values is the correct correlation coefficient for this data?
Note: You can reason this out without doing the calculation.
__ A. r = 0.5
__ B. r = 0.975
__ C. r = 0
__ D. r = -0.954
__ E. r = -0.5
d) Looking at the scatterplot, is the correlation coefficient an appropriate measure? Why
or why not?

e) A journalist reporting on this study claims that being heavier causes a frolder to grow
more eyes. What is wrong with this statement?

f) Do you think these five frolders represent a random sample? Why or why not?

51

MT2010: Question 5 "Wires, dam wires, and electricians"


Electrical wires can corrode over time. And wires used near hydroelectric dams can
corrode more quickly because of the extra moisture in the air. Corrosion rates (measured
in hundredths of mils) are generally known for various types of wire, but electricians
would like to be able to predict the corrosion rates near dams. Corrosion rates for 30
types of wire were measured in normal use and at dams to assess the relationship. A
linear regression model can be constructed with wires in normal use as the x variable and
the same wires used at dams as the y variable. The following scatterplot shows the data:
1200

Dam Wire

1000
800
600
400
200
0
0

200

400

600

800

Wire (normal use)

a) In this study, the response variable is:


__ A. Corrosion rate for a dam wire
__ B. Corrosion rate for a wire in normal use
__ C. Either rate; it does not matter which is considered the response
__ D. Neither; the instrument used to measure corrosion is the response variable
b) Is linear regression appropriate here? Choose the single best statement.
__ A. Yes, the scatterplot is straight enough
__ B. No, there is not enough scatter
__ C. No, there is too much scatter
__ D. Yes, there are no outliers
c) Summary statistics are presented below. Use them to calculate the regression line.
Show the formulas and your work. Report your final answers to three decimal places.
= 304.6667
sx = 196.4466
r = 0.8691
= 554.0000
sy = 286.6104

52

d) A new type of wire has a corrosion rate measure of 555. What does the model predict
for the corrosion measure of this type of wire used at a dam?

e) One of the data points is (220, 245). What is the value of the residual for this point?

f) What fraction of the variation in y is accounted for by the model?

g) Can the regression line be used to reliably estimate the dam wire corrosion rate for a
wire which has a rate of 2500 mil under normal use? Give a reason.
___ Yes

___ No

Reason:

h) Fill in each blank with the letter of the ending that fits best.
(i) If the x and y variables are switched, __________.
(ii) If the units are changed for both x and y variables, __________.
(iii) If the units are changed for just the x variable, __________.
(iv) If a constant is added to the y variable, __________.
Endings:
A. ...the slope will change but the averages and standard deviations will not change.
B. ...sx will change but

will not change.

C. ...the data will be normally distributed.


D. ...only the correlation will change.
E. ...the correlation, slope, and standard deviations will remain the same.
F. ...the correlation and slope will both change.
G. ...the slope will change, and sx and sy will also change.

53

MT2010: Question 6 "Putting the pedal to the medal"


Retain all precision throughout your calculations but write down only two decimal places
for your final answer.
For parts a) and b), assume that the weights of the gold medals, silver medals, and
ribbons are all independent (especially since we have not learned how to deal with such
questions otherwise!).
a) Each medal made for the recent Olympics is unique. Ours were the first Olympic
Games for which the medals have not been identical! Complete gold medals (that is, the
medal plus the ribbon) weigh 48 grams on average with a standard deviation of 6 grams.
The ribbons that are attached to the medals weigh 8 grams on average with a standard
deviation of 2 grams. Find the mean, variance and standard deviation of the weights of
the gold medals without their ribbons.

b) Complete silver medals (i.e. medal plus ribbon) weigh 38 grams on average with a
standard deviation of 5 grams. Find the mean, variance and standard deviation of a pair of
complete medals (gold and silver) combined.

c) You were instructed to assume that the weights of the gold medals, silver medals, and
lengths of ribbon are all independent. Is this a reasonable assumption? Explain why or
why not in one brief sentence at most.

d) In some winter Olympic events, such as the snowboard parallel giant slalom, the
winner is the rider with the best combined time over two runs. In some summer Olympic
events, such as the javelin throw, the winner is athlete with the best single distance out of
four tries. Generally speaking, does the sum of two random times or the maximum of four
random distances have greater variability?
__ A. Sum of two random times
__ B. Maximum of four random distances
__ C. Cannot say because time and distance are unrelated variables
Why? Explain in one sentence maximum.

54

MT2010: Question 7 "The food of the gods!"


Chocolate bars produced by a certain machine are labeled 240 grams to comply with
advertising rules and regulations. However, the distribution of the actual weight of these
chocolate bars is claimed to be normal with a mean of 243 grams and a standard
deviation of 3 grams.
a) Approximately what percentage of all chocolate bars produced by this machine would
be expected to be between 240 and 246 grams?

b) A quality control manager initially plans to take a random sample of size n from the
production line. If he were to double his sample size to 2n, the standard deviation of the
sampling distribution of the sample mean would be multiplied by:
__ A. 1/2
__ B. 1/
__ C.
__ D. 2

c) The quality control manager plans to take a random sample of size n from the
production line. How big should n be so that the sampling distribution of has standard
deviation 0.3 grams?
__ A. 10
__ B. 100
__ C. 1000
__ D. Cannot be determined unless we know that the population is normal.

d) If the quality control manager takes a random sample of nine chocolate bars from the
production line, what is the probability that the sample mean weight of the nine sample
chocolate bars will be less than 240 grams?
__ A. 0
__ B. 0.0013
__ C. 0.1587
__ D. 0.9987
Show your work:

55

MT2010: Question 8 "Shooters for the shooters?"


A radio talk show host with a large audience is interested in the proportion p of adults in
his listening area who think the drinking age should be lowered to 18. To find out, he
poses the following questions to his listeners: Do you think that the drinking age should
be reduced to 18, in light of the fact that 18-year-olds are eligible for military service?
He asks listeners to phone in and vote yes if they agree the drinking age should be
lowered and no if not. Of the 100 people who phoned in, 70 answered yes.
a) The sample estimate, , of the proportion of adults who think the drinking age should
be reduced is:
__ A. 70
__ B. 0.70
__ C. 0.69
__ D. Not able to be determined from the information given

b) The standard error of this estimate is closest to:


__ A. 0.089
__ B. 0.046
__ C. 0.0021
__ D. 0.0045

c) The margin of error for a 90% confidence interval is closest to:


__ A. 0.046
__ B. 0.075
__ C. 0.090
__ D. 0.690

d) How large a sample n would you need to estimate p with margin of error 0.01 with
95% confidence? Use the guess = 0.6 as the value for p.
__ A. 6768
__ B. 9220
__ C. 9502
__ D. 9596

e) Which of the following assumptions for inference about a proportion using a


confidence interval are violated in this case?
__ A. The data are a simple random sample from the population of interest
__ B. The success/failure condition
__ C. A third choice of no opinion needed to be included
__ D. There appear to be no violations

56

MT2010: Question 9 "Going postal"


A simple random sample of 100 Canada Post employees found that the average time
these employees had worked for the postal service was 7.0 years with standard deviation
of 2.0 years. Do these data provide evidence that the mean length of time that the
population of Canada Post employees have worked for the postal service had changed
from the value of 7.5 of 20 years ago?
a) Give the appropriate null and alternative hypotheses.

b) Give the formula for the appropriate test statistic and compute its value.

c) Give a range in which the P-value is located.

d) From the P-value associated with this test statistic, which of the following is correct?
__ A. Do not reject H0 at the 10% significance level
__ B. Reject H0 at the 10% significance but not at the 5% significance level
__ C. Reject H0 at the 5% significance level but not at the1% significance level
__ D. Reject H0 at the 1% significance level
e) Using the 5% significance level, state your conclusion in one clearly worded sentence
that Canada Post management can understand.

f) The 95% CI for the mean time the population of postal employees have spent with the
postal service is closest to:
__ A. 7.0 0.2
__ B. 7.0 0.4
__ C. 7.0 2.0
__ D. 7.0 4.0
Bonus Question: Just for Fun and Bragging Rights
Over the 17 days of the Winter Olympics you saw the Olympic rings logo countless
times. In the official logo, not the single-colour Vancouver 2010 version, each of the five
rings is a different colour. How well do you remember the order of the colours in the
rings? Write the colours in the blanks as indicated.
________
Ring 1

________
Ring 2
________
Ring 4

________
Ring 3
________
Ring 5

END OF MIDTERM 2010 ANSWERS AND EXPLANATIONS FOLLOW


57

MIDTERM EXAM 2010: ANSWERS AND EXPLANATIONS


MT2010: Answer 1
a) Quantitative; Categorical; Categorical; Neither
b) C c) C d) B
e) (i) 10%
(ii) B
f) True; True; False

(iii) C

Details and Comments:


a) Although the text considers an identifier variable, such as a Visa credit card number, as
a type of categorical variable, it is useless in that form; it is best thought of as Neither.
You aren't likely to do any analysis on the Visa card number!
b) The average must lie between the minimum and maximum, but depending on
skewness it could be smaller or larger than the midpoint or median.
c) The minimum value is 0 but the maximum can be very large, hence right-skewed.
d) All except B are likely to have a long right-hand tail, where the mean exceeds the
median.
e) (i) (5+3+2)/100 = 10%
(ii) 41% of values (2+20+19) are less than 30; including the 30-34 interval increases the
cumulative count to 58% (2+20+19+27).
f) Incomes are not exactly symmetric, but for all practical purposes and especially for
data analysis, they certainly are reasonably symmetric.
MT2010: Answer 2
a) Median = 0; Q1 = -0.675; Q3 = 0.675; IQR = 1.35
b) Prob. = 2Pr (Z > 1.51.35) = 2Pr (Z > 2.025) = 20.0215 = 0.0430 or about 4%
c) The boxplot is symmetric around 0, with the ends of the box at Q1 and Q3 at -0.675
and 0.675 (from part a). Since Z has no limits, the whiskers can't extend to the minimum
and maximum. Instead, use inner fences; the whiskers should extend to -2.7 and 2.7.
d) Ann has a higher rank.
Ann's z-score = (12090)/25 = 1.2; Sarah's z-score = (135110)/25 = 1
Details and Comments:
a) Z is symmetric so the median equals the mean.
It is acceptable to report answers to two decimal places:
For Q1: -0.68 or -0.67; for Q3: 0.68 or 0.67; for IQR: 1.36 or 1.34
b) If you used IQR of 1.36, the probability is 0.0414.
If you used IQR of 1.34, the probability is 0.0444.
c) Since the distribution is unbounded, any reasonable choice of whiskers is acceptable.

58

MT2010: Answer 3
a) Yes: Percent of males admitted = 35/80 = 0.4375 or 43.75%
Percent of Females admitted = 20/60 = 0.33 or 33%
b) No: Half of engineers of either sex are admitted. One-quarter of English students of
either sex are admitted.
c) The English program is harder to get into, and that is where more females applied. This
is an illustration of Simpson's Paradox.
d)
Engineering
English
Row Total
40
15
55
Admitted
40
45
95
Not Admitted
80
60
140
Column Total
Admitted to Engineering:
40/80 = 0.50 or 50%
Admitted to English:
15/60 = 0.25 or 25%
Details and Comments:
When a two-way table is provided, it is useful to add the row totals and the column totals.
They are needed to compute conditional probabilities. Simpson's Paradox is one of the
most revealing illustrations of the need to dig deeper into the relationship between
categorical variables. What might appear to be the result for a two-way table may well be
reversed when a third variable is incorporated.
MT2010: Answer 4
a)
20
# 15
E 10
y 5
e 0
0
s

10

20

Weight (kg)

b) Strong positive linear association


c) B
d) Yes; there is a clear linear relationship
e) Correlation does not imply causation.
f) No. They were the slower ones, or the easier ones to catch.
Details and Comments:
c) Since the correlation is strong and positive, only 0.975 is a sensible choice for r.
d) Correlation coefficients require linear relationships.

59

MT2010: Answer 5
a) A b) A
c)

= 0.8691(286.6104/196.4466) = 1.268

= 554.0000 1.268(304.6667) = 167.683 (or 167.684)


= 167.683 + 1.268x (or = 167.684 + 1.268x)
d) = 167.683 + 1.268(555) = 871.423 (or 871.424)
e) = 167.683 + 1.268(220) = 446.643 (or 446.644)
Residual = e = 245 466.643 = -201.643 (or -201.644)
f) r2 = 0.86912 = 0.755
g) No; this is extrapolation far beyond the range of data.
h) (i) A (ii) G (iii) B (iv) E
Details and Comments:
a) Response variable is on the vertical axis.
c) Beware of round-off error. Carry all available decimal places in the intermediate
calculations, but report fewer as instructed.
d) Simple substitution
e) Use the definition of residual: observed minus predicted.
f) This is the definition of r-squared.
g) Although it is mathematically correct to substitute 2500 into the regression equation,
extrapolation far beyond the range of data is a major misuse of regression.
h) Examine the formulas for slope, intercept and correlations and test out the effect of the
suggested changes. For (ii), correlation does not depend on units, but slope and SDs do
change if both variables change. For (iv), the scatterplot is simply moved straight up, so
SDs, slope, and correlation are not affected.
MT2010: Answer 6
a) Mean (XY) = Mean (X) Mean (Y) = 48 8 = 40
Var (XY) = Var (X) + Var (Y) = 36 + 4 = 40
SD (XY) =
= 6.32
b) Mean (X+Y) = Mean (X) + Mean (Y) = 48+38 = 86
Var (X+Y) = Var (X) + Var (Y) = 36 + 25 = 61
SD (X+Y) =
= 7.81
c) Yes: Heavier ribbons are not expected to be found only on heavier medals.
d) A
Details and Comments:
a) and b) The variance of a sum or difference of two independent variables is always the
sum of the individual variances. Remember that calculations are not done with standard
deviations; combine variances first and then take the square root.
d) The sum of two random variables generally has greater variability than a single
random variable. However, if the question had asked about the mean of two measures
rather than the sum, then the mean would have lesser variability than a single measure.

60

MT2010: Answer 7
a) 68%
b) B
c) B
d) B
Details for d): Pr ( < 240) = Pr (z < [240243] / [3/

]) = Pr (z < -3) = 0.0013

Details and Comments:


a) Use the 68/95/99.7 (or Empirical) Rule: 243 1s = 243 3 = (240 , 246)
b) Remember that "standard deviation of the sampling distribution" is another name for
"standard error" and the standard error of the sample mean is / . If the sample size
becomes 2n, then /
= (1/ )(/ ), so the new SE is (1/ ) times the old SE.
c) /
= 3/
= 0.3, so
= 3/0.3 = 10, and n = 100
d) Since the question is about the sample mean, the standardization uses /
= 3/ .
MT2010: Answer 8
a) B b) B c) B d) B

e) A

Details and Comments:


a) Reason: p =70/100 = 0.70
b) Reason:
= 0.046
c) Reason: 1.645(0.046) = 0.075
d) Reason: n = (1.962)(0.6)(0.4)/(0.012) = 9220
e) The data are a convenience sample since people choose whether or not to phone in!
MT2010: Answer 9
a) H0: = 7.5; Ha: 7.5
b) t =

= -2.5

c) 0.01 < P-value < 0.02


d) C
e) There is evidence to say that the mean length of time employees have worked for
Canada Post has changed from 20 years ago.
f) B. 7.0 1.984 (2.0/
) = 17.0 0.4
Details and Comments:
a) This is a t-test of a single mean. The alternative hypothesis is two-tailed since the
question asks whether the mean length of time had "changed" from the value of 7.5. This
is not a directional hypothesis.
b) & c) Note that although the value of the test statistic is negative, you look up the
positive value in the t-table to find the P-value.
d) Since the P-value is less than 0.05, reject the null hypothesis at the 5% level; but since
the P-value is greater than 0.01, do not reject the null hypothesis at the 1% level.
e) The conclusion must be in terms of the original question, not just Reject H0.
f) The multiplier is based on a t with n1 = 99 df. Use the value for 100 df in the table.
Bonus Question: Olympic Rings colours
BLUE
BLACK
RED
YELLOW
GREEN
END OF ANSWERS AND EXPLANATIONS TO MIDTERM 2010

61

Part B. Past Years Midterm Exams


A collection of questions from midterm exams of past years, with
answers and explanations
This section presents questions from midterm exams in recent years. Since course
syllabi, textbooks, order of topics, and even notation, have changed, not every question
from past exams is relevant today. So the exam questions have been reorganized by broad
topic area as follows:
Section A: Descriptive Statistics
Section B: Scatterplots, Association, Correlation, Least Squares Regression
Section C: Normal Curve, Sampling Distributions, Combining Random Variables
Section D: Introduction to Inference, Confidence Intervals, Hypothesis Tests
Section E: Miscellaneous
Questions in each topic area are arranged from the most recent year and go back in time.
Following the questions in each topic area is a set of answers and explanations/comments
about the answers. The comments give details of calculations and common errors made
by students.
Since the teaching of any course is dynamic and always undergoing change, there may
still be some terminology or notation or even a few parts of questions which are
unfamiliar to you. If you are unclear whether a particular question or topic is relevant to
the current year, please ask your instructor.

62

SECTION A: DESCRIPTIVE STATISTICS


Question A1 (MT2009Q1) Not yet an Olympic Sport
NOTE: Parts g), h) and j) require material from later sections.
The following boxplots show the distributions of ages of the UBC male and female
underwater basket-weaving teams:

a) Which team has more members? (Circle the correct response)


Male

Cant tell

Female

Same size

b) Given the information provided, which of the following is most likely the mean age of
the female team? (Circle the correct response)
21

22

23

30

c) For each of the three measures below, fill in the numerical value in the blank provided
and then decide if each is a measure of shape, centre, spread, or none of these (circle one
choice for each measure):
Value:

Is a measure of:

Interquartile range
(for males):

_________

Shape

Centre

Spread

None

50th percentile
(for females)

_________

Shape

Centre

Spread

None

Oldest male member

_________

Shape

Centre

Spread

None

d) The distribution of male ages is: (circle the correct response)


Symmetric

Skewed to the left

Skewed to the right

e) The distribution of female ages is: (circle the correct response)


Symmetric

Skewed to the left

Skewed to the right

63

f) The mean male age is 22.5 years. One of the members of the male team is 22 years old
and has a z-score of -0.25. What is the standard deviation of male ages?

g) If we assume that male ages are normally distributed, what proportion of males on the
team are 22 years of age or younger?

h) Which of the following is the best justification for the assumption of normality made
in part g)? (Check the best response)
__ A. The Law of Large Numbers
__ B. The Central Limit Theorem
__ C. Least squares regression
__ D. None of the above

i) Team members are required to take a course in the history of underwater basketweaving. The professor records the values of several variables for each student. These
variables are listed below. For each one, decide whether it has been recorded as
quantitative or categorical.
Score on the final exam (out of 200 points)

Quantitative

Categorical

Final grade for the course (A, B, C, D, or F)

Quantitative

Categorical

The number of lectures the student missed

Quantitative

Categorical

Brand name of favorite swimsuit

Quantitative

Categorical

j) Universities across North America require underwater basket-weaving students to take


a quantitative skills test. Percentage scores on this test have a mean of 30% and a
standard deviation of 10%. Give a range within which you would expect to find the
middle 95% of all North American underwater basket-weaving student test scores.

64

Question A2 (MT2008Q1) There are two kinds of data -- good and bad!
a) Here is a small part of the data set in which CyberStat Corporation records information
about its employees.
Employee # Surname Age Gender Salary
Job Type
11234
Smith
39 Female
$62,100 Management
23467
Jones
27 Male
$47,350 Technical
98543
Chan
22 Female
$25,250 Clerical
76548
Wong
48 Male
$77,600 Management
Circle the names of the variables below which are recorded as quantitative scale variables
in the data set above.
Employee #

Surname

Age

Gender

Salary

Job Type

b) Three small Statistics classes all took the same test. Histograms of the scores for each
class are shown below.

(i)
(ii)
(iii)
(iv)

Which class had the highest mean score?


Which class had the highest median score?
For which class are the mean & median most different?
Which class had the smallest standard deviation?

1
1
1
1

2
2
2
2

3
3
3
3

c) For each of these variables, decide whether its distribution is more likely symmetric or
skewed right (i.e. long right-hand tail) or skewed left (i.e. long left-hand tail). Circle one
choice for each variable.
Individual incomes in the United States

Symmetric

Skewed right

Skewed left

Age of male heart attack victims

Symmetric

Skewed right

Skewed left

Lifetimes of electric light bulbs

Symmetric

Skewed right

Skewed left

IQ scores of the Canadian population

Symmetric

Skewed right

Skewed left

65

Question A3 (MT2008Q2) A Nash-ional Game


The data set to the right contains all the point differentials or margins in all NBA
games played by the Phoenix Suns up to February 13 of the 2007/08 season.
Negative numbers indicate losses, positive numbers indicate wins. The data have
been arranged in ascending order for you (biggest loss to biggest win).
a) Compute the various numerical summaries and put them into the table below
part b) under original data. Some have been computed for you.
NOTE: Part b) is not part of the current curriculum. You can ignore it. But
think of it as a challenge question. It is easy to figure out. Instructions
are given in the Answers/Comments.
b) Suppose the data undergo a transformation such that X* = 2X 3, where X =
the original variable and X* is the new, transformed variable. Find all of the
numerical summaries for X* and put them into the table below under
transformed data.
Original
Data (X)

Mean

Transformed
Data (X*)

5.6

Median
Range
Q1
Q3
IQR
Std dev

11.7

c) Are there any outliers? Use the inner fences definition of outliers and the
original data (not the transformed data) to identify any outliers.
Lower inner fence = ___________________
Upper inner fence = ___________________
Observation numbers of outliers = _______

Obs# Margin

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52

66

-22
-21
-15
-10
-9
-7
-7
-7
-6
-5
-4
-3
-3
-3
-2
-2
1
2
2
4
4
4
5
5
5
6
7
7
8
8
8
9
10
10
10
11
11
11
11
11
14
15
15
16
19
19
20
20
22
24
31
33

Question A4 (MT2007Q1) Data, data, data! I cant make bricks without clay!
Sherlock Holmes
a) A sample of shoppers at a mall was asked the following questions. Decide whether the
type of data are more likely to be quantitative or categorical. (Circle your choice)
What is your age (in years)?

Categorical

Quantitative

How much did you spend (in $)?

Categorical

Quantitative

What is your marital status?

Categorical

Quantitative

Rate the availability of parking.


(Excellent, Good, Fair, Poor)

Categorical

Quantitative

b) Here is a table of sources of electricity in Canada and the US and the percentage of
electricity generated by each. Construct a bar graph to compare Canada and the US.
Do NOT use separate sets of axes for each graph.
Source
Hydropower
Coal
Nuclear
Natural Gas
Other

Canada
65
16
16
1
2

US
6
51
21
16
6

c) A news article reports that, Of the 411 players on National Basketball Association
rosters in February 1998, only 139 made more than the league _______ salary of $2.36
million. Which word should go in the blank, mean or median? That is, is $2.36 million
the mean or median salary for NBA players? Explain why, in one sentence only.
d) A study was made of the age of entering first-year university students. Which of the
following is most likely to be the standard deviation? Explain why, in one sentence only.
__A. 1 month
__B. 1 year
__C. 5 years

67

Number of States

e) The following histogram displays the December 2000 percentage unemployment rates
in the 50 U.S. states and Puerto Rico. The labels on the horizontal axis should be
interpreted as follows: the bar labelled 1 represents rates of 1.0% to 1.9%, the bar
labelled 2 represents rates of 2.0% to 2.9%, etc.
24
22
20
18
16
14
12
10
8
6
4
2
0
1

Unemployment Rate

(i) What percentage of the rates (out of a total of 51 observations) is 5.0% or greater?
(ii) Estimate the median unemployment rate.
f) You have decided to sell your home. The market is booming now with the 2010
Olympic Games preparations, and therefore most sellers of houses with similar
characteristics have received extremely good deals in the past few months. You ask the
realtor for a summary of net prices of homes sold in your neighborhood. The realtor
hands you the following two density curves, one of them of the prices of homes sold in
the past few months in your neighborhood, and the other of the prices of homes sold
during a deep economic recession.

Curve A

Curve B

(i) Under the given assumptions, which of the two curves better represents the
distribution of prices of homes sold in the past few months? Circle your answer choice.
Curve A
Curve B
(ii) A potential buyer offers to give you the mean, the median or the mode of the prices of
all the homes sold in the past few months in your neighborhood. Assuming that the
density curve is the one you chose in (i) directly above, which numerical measure would
you prefer? Circle your answer choice.
If you chose Curve A:
OR:
If you chose Curve B:

Mean

Median

Mode

Mean

Median

Mode

(iii) You are told that the mean price of 50 houses sold is $700,000. However, you notice
that there was a mistake in the calculation, and that one of the buyers paid $500,000
instead of the $800,000 that was used when making this calculation. What is the actual
mean price of the 50 houses sold?

68

Question A5 (MT2007Q2) Teach Your Children Book of Deuteronomy


In 2003, the salaries (in $) of secondary school classroom teachers in the United States
gave the following descriptive statistics:
Minimum
= 31,200
Q1
= 37,400
Median
= 40,000
Q3
= 48,400
Maximum
= 57,200
a) Find the range and interquartile range.
Range = _______

Interquartile range = _______

b) Are there are any outliers, as defined by the 1.5 IQR rule (a.k.a. inner fences)?
Explain.
No

Yes

c) Predict the direction of skewness for this distribution. Explain.


d) If the distribution, although somewhat skewed, is approximately bell-shaped, which
one of the following would be the most realistic value for the standard deviation?
__A. 100
__B. 1,000
__C. 5,000
__D. 25,000
e) NOTE: Part e) is not part of the current curriculum; it is similar to Question 3b)
above. Detailed instructions are given in the Answers / Comments to Question 3b).
It is very worthwhile to learn this!
Suppose each teacher were to receive a 10% increase in salary for 2004. For each of the
four statistics in the left-hand column, write the number of the phrase from the list in the
right-hand column that states how each statistic will change (Note: Some phrases may be
used more than once; others may not be used at all.)
Write the phrase number where indicated by the arrows.

____
Median
1. Will be multiplied by
____

IQR

2. Will be multiplied by 0.10

____

Range

3. Will be multiplied by 1.10

____

Standard Deviation

4. Will remain unchanged

69

Question A6 (MT2006Q1) Call centre (and spread and shape)


Here are the numbers of calls answered by 20 workers in a call centre on a particular day:
13
19

13
19

14
20

16
20

17
21

17
21

18
22

18
24

19
25

19
25

The mean number of calls is 19 and the standard deviation is 3.49.


a) What is the variable in this dataset?
__A.
Days
__B.
Call centre workers
__C.
Length of calls
__D.
Numbers of calls per day
b) Describe this distribution with a five-number summary: (Min, Q1, Median, Q3, Max).
Min = ___

Q1 = ___

Median = ___

Q3 = ___

Max = ___

c) Compute the inner fences using the 1.5 IQR criterion.


Identify any outliers (using inner fences)

_________________ (If none, write None)

Consider the following histograms.


(b)

10
8
6
4
2
0

Count

Count

(a)

5
5.
-2
.5
23
5
3.
-2
.5
21
5
1.
-2
.5
19
5
9.
-1
.5
17
5
7.
-1
.5
15
5
5.
-1
.5
13
5
3.
-1
.5
11

.5
26
.523

.5
23
.520

.5
20
.517

.5
17
.514

.5
14
.511

Num ber of calls

7
6
5
4
3
2
1
0

Num ber of calls

d) Which of these two histograms describes the dataset given at the start of the question?
__A. Histogram (a)
__B. Histogram (b)
__C. Both of them
__D. Neither one of them
e) Assuming that a workday in the call centre is 8 hours long and the workers are on the
phone 60% of the time, what is the mean length of a call?
__A. About 15 minutes
__B. About 19 minutes
__C. About 25 minutes
__D. Cannot tell, since the dataset does not contain data about individual calls

70

SECTION A: ANSWERS AND EXPLANATIONS


Answer to Question A1 (MT2009Q1)
a) Cant tell
b) 23
c) IQR = 3; Spread 50th p. = 22, Centre Oldest male = 27, None
d) Symmetric
e) Skewed to the right
f) Z = -0.25 = (2922.5)/, so = (2222.5)/(-0.25) = 2
g) Pr(Z < -0.25) = 0.4013
h) D. None of the above
i) Quantitative, Categorical, Quantitative, Categorical
j) Empirical (68-95-99.7) Rule: 30 20 = (10 , 50)
(Also accept 30 19.6)
Note: Parts g), h) and j) are about Sampling Distributions and the Normal Model.
Check your notes or the textbook.
Details and Comments:
a) Boxplots do not show sample sizes; they only show: min, Q1, median, Q3, and max.
b) Since the age distribution for females is strongly skewed to the right, the mean is
greater than the median. The median (from the graph) is 22, so the mean must be a little
larger, hence 23. Note that 30 is close to the maximum and far above Q3 so it is not a
realistic estimate of the mean.
c) IQR (Males) = 24 21 = 3; 50th p. (Females) = median = 22; Oldest Male = max = 27
f) Use the formula for standardizing X to Z; however, here both the values of X and Z are
given and it is the value of which is unknown.
h) The Central Limit Theorem cannot be used as the reason here since the sample is
unlikely to be large.

Answer to Question A2 (MT2008Q1)


a) The quantitative variables are Age and Salary.
b) Answers: 3, 3, 3, 1.
c) Answers:
Individual incomes in the United States
Skewed right (long right-hand tail)
Age of male heart attack victims
Skewed left (long left-hand tail)
Lifetimes of electric light bulbs
Skewed right (long right-hand tail)
IQ scores of the Canadian population
Symmetric (equal tails)
Details and Comments:
a) Gender and Job Type are categorical; Employee # and Surname are simply strings and
used as identifier variables. Taking the mean of the Employee # would not make sense.
b) Class 3 has much more area to the right than Class 1 or Class 2 so the mean and
median are also shifted to the right. And since the histogram for Class 3 shows the
greatest skewness, it has the greatest difference between mean and median. Class 1 is less
spread out (the tails are both smaller than in the other two classes) so it has the smallest
standard deviation.

71

c) Incomes are skewed right because fewer people have very large incomes, more people
have incomes at the lower end or middle.
Age of heart attack victims is skewed left because heart attacks are much more likely in
older people.
Lifetimes of bulbs are skewed right because most bulbs last the amount of time they are
engineered to last but some will last much longer; that is, quality is designed in. Only a
few will fail early. Lifetimes in general are skewed right.
Answer to Question A3 (MT2008Q2)
a) and b)
Original
Data (X)

Transformed
Data (X*)

Mean

5.6

8.2

Median

6.5

10

Range

55

110

Q1

-3

-9

Q3

11

19

IQR

14

28

Std dev
11.7
23.4
c) Lower inner fence = -3 1.5(14) = -24
Upper inner fence = 11 + 1.5(14) = 32
Observation numbers of outliers = 52
Details and Comments:
Note that the question asked for the observation number(s), not the margin!
For part b): Suppose the data are transformed (linearly) as follows X* = a + bX; that is,
multiply the original observations by b and then add a. That shifts all the values of X
up or down by the amount a and changes the size of the unit of measurement by b.
Mean(X*) = a + bMean(X);
Median (X*) = a + bMedian(X);
Range(X*) = bRange(X); [the effect of a is cancelled]
Q1(X*) = a + bQ1(X);
Q3(X*) = a + bQ3(X);
IQR(X*) = bIQR(X); [the effect of a is cancelled]
SD(X*) = bSD(X); [the effect of a is cancelled]

72

Answer to Question A4 (MT2007Q1)


a) What is your age (in years)?
How much did you spend (in $)?
What is your marital status?
Rate the availability of parking

Quantitative
Quantitative
Categorical
Categorical

b)

OR

c) Of the 411 players on National Basketball Association rosters in February 1998, only
139 made more than the league MEAN salary of $2.36 million. If it were the median,
then half of the 411 players (i.e. 205 or 206) would exceed the value.
d) 1 year is the typical difference in age between entering first-year university students.
e) (i) 5/51 = 0.098, so 9.8%. It is also acceptable to round to 10%.
(ii) The median is in the 3.0-3.9 interval, so the median is best estimated as the midpoint
of that interval at 3.5%.
Comment: It is also acceptable to give the range 3.0-3.9. It is not acceptable to estimate
the median as 3.0%.
f) (i) Curve B
(ii) If you chose Curve A: Mean
If you chose Curve B: Mode
Note: The two choices offered in part (ii) are to give you a chance to get the correct
answer to part (ii) even if you made the wrong choice in part (i).
(iii) [(50700,000) 300,000]/50 = $694,000
Comment: Use the formula for mean and adjust accordingly.

73

Answer to Question A5 (MT2007Q2)


a) Range = $26,000 (i.e. 57,200 31,200)
Interquartile range = $11,000 (i.e. 48,400 37,400)
b) No: Q11.5IQR = 20,900; Q3+1.5IQR = 64,900
There are no values outside this range, so there are no outliers.
c) The distribution is skewed to the right (i.e. long right-hand tail) since the distance
between the median and Q3, and between Q3 and the maximum, are greater than the
distance between the median and Q1, and between Q1 and the minimum.
d) C. 5,000
Use the rule of thumb that SD is approx. equal to Range/6 = 26,000/6 = 4333
e) 3, 3, 3, 3
Details and Comments:
All four quantities are multiplied by 1.10 (thats what a 10% increase means). Since each
data value is multiplied by 1.10, so are the minimum and maximum (and therefore the
range), and the median. And when all values are scaled up by a constant, the standard
deviation is also scaled up by a constant. See Answer to Question 3 (MT2008Q2) for
details of linear transformations.

Answer to Question A6 (MT2006Q1)


Here are the numbers of calls answered by 20 workers in a call centre on a particular day:
13
19

13
19

14
20

16
20

17
21

17
21

18
22

18
24

19
25

19
25

The mean number of calls is 19 and the standard deviation is 3.49.


a) D. Numbers of calls per day
b) Min = 13; Q1 = 17; Median = 19; Q3 = 21; Max = 25
c) IQR = 4: Inner fences: 11 and 27. There are no outliers.
d) C. Both of them
e) A. About 15 minutes
Details and Comments:
b) Q1 is between the 5th and 6th observations; Q3 is between the 15th and 16th
observations. Fortunately, here the 5th and 6th observations are the same, and the 15th and
16th observations are the same, so there is no ambiguity about Q1 or Q3.
c) IQR = 2117 = 4
Lower inner fence =17(1.5)(4) = 11; Upper inner fence = 21+(1.5)(4) = 27
No data values are less than 11 or more than 27.
d) Notice the effect that changing the bins (i.e. class intervals) has on the look of the
histogram.
e) (0.60860 / 19 = 15)

74

SECTION B: SCATTERPLOTS, ASSOCIATION, CORRELATION,


LEAST SQUARES REGRESSION
Question B1 (MT2009Q2) The Need for Speed
Highway planners investigated the relationship between traffic density (number of cars
per mile) and the average speed of the traffic on a large city thoroughfare. The data were
collected at the same location at 10 different times over a span of three months. They
found a mean traffic density of 70 cars per mile (cpm) with standard deviation of 27 cpm;
this variable is called Cars. Overall the cars average speed was 26.5 mph with standard
deviation of 10 mph; this variable is called Speed. These researchers found the least
squares regression line for predicting Speed from Cars to be:
.
a) Compute the value of the correlation coefficient between Speed and Cars?
b) What percent of the variation in average speed is explained by traffic density? (Round
your answer to the nearest whole percent.)
c) Predict the average speed of traffic on the thoroughfare when traffic density is 50 cpm.
d) Using the prediction you made in part c), what is the value of the residual for a traffic
density of 50 cpm when the observed speed was 32.5 mph?
e) What prediction would you make for the average speed when the traffic density is
unknown?
f) The data set initially included the point Cars = 125 cpm, Speed = 55 mph. This point
was considered an outlier and was not included in the analysis. Will the slope of the least
squares regression line increase, decrease or stay the same if we redo the analysis and
include this point?
Increase

Decrease

Stay the same

g) A Canadian member of the research team measured the speed of the cars in kilometres
per hour (1 km 0.62 miles) and the traffic density in cars per kilometre. What is the
value of his calculated correlation between speed and traffic density?
h) Does this study demonstrate that traffic density is a causal factor in explaining the
average speed of the traffic on the thoroughfare?
i) Suppose another researcher got confused about which was the response variable and
which was the explanatory variable and computed a linear regression model to predict
Cars from Speed. What would the slope of this line be? Report a maximum of two
decimal places.

75

Question B2 (MT2008Q4) Drawing a wine line


There is some evidence that drinking moderate amounts of wine helps prevent heart
attacks. Researchers collected data on yearly wine consumption (litres of alcohol from
drinking wine, per person) and yearly deaths from heart disease (deaths per 100,000
people) in 19 developed nations, including Canada.
a) The scatterplot below shows the relationship between heart disease death rates and
wine consumption for the 19 developed nations.

Describe the shape, direction and strength of this relationship by circling the best choice.
Shape:

Linear

Curved

No Pattern

Direction:

Positive

Negative

Neither

Strength:

Very strong

Fairly strong

Quite weak

b) The correlation between heart disease death rates and national wine consumption is
r 0.843 . What does a negative correlation say about wine consumption and heart
disease deaths? Answer this question by circling the appropriate italicized words below:
High wine consumption goes with (more / fewer) heart disease deaths,
while low consumption goes with (more / fewer) deaths.
c) Do you think these data give good evidence that drinking wine causes a reduction in
heart disease deaths? Explain why, in one sentence only.
d) About what percent of the variation among countries in heart disease rates is explained
by the straight-line relationship with wine consumption? Report the percentage to the
nearest whole number.

76

e) The least squares regression line for predicting heart disease death rate from wine
consumption is:
Use this equation to predict the heart disease rate in another country where adults average
4 litres of alcohol from drinking wine each year.
f) What is the predicted heart disease rate for a country that drinks enough wine to supply
150 litres of alcohol per person? Can this result be true? Explain why using the leastsquares regression line for this prediction is not justified.
g) Which of the three figures below corresponds to the plot of least-squares residuals
versus national wine consumption? Hint: Look at the vertical axis.

Circle your choice:

Graph (a)

Graph (b)

Graph (c)

77

Question B3 (MT2008Q5) Causal relationships differ from casual relationships!


a) Here are some general statements about regression. For each one, decide whether it is
True (T) or False (F) and circle the letter of your answer.
T

The least-squares residual is defined as the difference between an


observed value of the explanatory variable and the value predicted
by the least-squares regression line.

The least-squares residuals add up to 0.

If the least-squares regression line fits the data poorly, the residual
plot will exhibit a systematic pattern.

Removing an influential observation will markedly change the


least-squares regression line.

b) Someone says, There is a strong positive correlation between the number of


firefighters at a fire and the amount of damage the fire does. So sending lots of
firefighters just causes more damage. Explain why this reasoning is wrong, in one
sentence only.
c) A study shows that there is a positive correlation between the size of a hospital
(measured by its number of beds, X) and the median number of days, Y, that patients
remain in the hospital. Does this mean that you can shorten a hospital stay by choosing a
small hospital? Explain why, in one sentence only!
Circle one:

Yes

No

d) Which set of two variables is most likely to have a cause and effect relationship?
__ A. The height of a person and their corresponding weight
__ B. The weight of a box and the postage required to ship the box to Toronto
__ C. The make of a car and the fuel efficiency (miles per gallon) of the car
__ D. The age of a teacher and their corresponding yearly income

78

Question B4 (MT2007Q5) Good relationships are everything


a) Rank the following in order, from highest to lowest, by the correlations you would
estimate there to be between:
A Purchase price of a home and the yearly income of the purchasing family
B Purchase price of a home and the shoe size of the purchaser
C Asking price and corresponding selling price of homes in Greater Vancouver
D Purchase price of a home and age of the buyer
Write the letters in order from highest to lowest.
(Highest)
__
__
__
__
(Lowest)
b) A survey of 3000 medical records showed that smokers are more inclined to get
depressed than non-smokers. Does this necessarily imply that smoking causes depression?
Explain, in one sentence only!

c) For each of the eight sections of last years C291 course, the average midterm exam
mark and the average final exam mark were calculated. The correlation for the eight pairs
of averages was +0.97. Does this mean that the relationship between a students midterm
and final exam marks scores for all students in the eight classes is almost exactly a
straight line? Explain, in one sentence only!
__Yes

__No

d) A study is made of people who stutter. Each subject is asked to read two passages of
equal length, and the number of times they stutter while reading each passage is recorded.
The researchers discover that the subjects who stuttered many times on the first passage
tended to stutter fewer times on the second passage. They conclude that the subjects who
stuttered many times on the first passage must have been nervous the first time and more
relaxed the second time, so that they tended to stutter less. Do you agree? Explain, in one
sentence only.
__Yes

__No

e) Studies show that in the period from 1850 to 1900 in the United States, the average
marriage lasted only 12 years. Does this show that the divorce rate was high in that
period? Explain, in one sentence only!
__Yes

__No

79

Question B5 (MT2007Q6) Bed-size kings?


An expert consultant in hospital resource planning states that the number of open beds
that a hospital can use effectively should be estimated by the number of FTEs (full-time
equivalent employees) on staff. The consultant collected data on the number of open beds
and number of FTEs for 12 hospitals, and computed the means and SDs as follows:
Number of open beds:
Number of FTEs:

Mean = 50
Mean = 140

SD = 20
SD = 40

She computed the least squares regression equation and found that for a hospital with 100
FTEs, the estimated number of open beds was 32.
a) Use this information to compute the value of the correlation coefficient.

b) What is the least squares regression equation she found?

c) From the available data, what would you predict the number of open beds to be for a
hospital with an unknown number of FTEs?

d) What fraction of the variation in number of open beds is explained by the number of
FTEs?

e) Another expert consultant, this one in hospital administration, claims that the
regression was done the wrong way around, and that the number of FTEs required in a
hospital should be estimated from the number of open beds in the hospital. What would
the value of the correlation coefficient be if the analysis were done this way?

80

Question B6 (MT2006Q2) Putting the pedal to the medal


Torino 2006: In this question we will look at some aspects of medals citizens of the
mythical country of Statland have won over the past 30 years in Winter Olympic Games.
Statlander Olympic Medals can be divided into team medals and individual medals. Here
are the numbers of medals that Statlanders have won in the past eight Winter Olympic
Games:
Site
Team
medals
Individual
medals

Innsbruck
1976

Lake
Placid
1980

Sarajevo
1984

Calgary
1988

Albertville
1992

Lillehammer
1994

Nagano
1998

Salt Lake City


2002

13

37

40

49

86

Here is a scatterplot of the number of team medals Statland has won since Innsbruck
versus the corresponding individual medals.
a) What is the correlation, r, between these
two variables?
__A.
0.98
__B.
0.98
__C.
0.58
__D.
0.58

Medals: Individuals

100
80
60
40
20
0
0

10

15

Medals: Team

b) How would r change if Statland had won exactly 20 more individual medals in each of
these eight Winter Olympics games?
__A.
r would be 20/8=2.5 times larger
__B.
r would be 20/8=2.5 times smaller
__C.
r would be the same
__D.
r would increase, but I am not sure how much larger

c) Which word best describes the type of relationship between team and individual
medals?
__A.
Causation
__B.
Correlation
__C.
Confounding
__D.
Some other word starting with C

81

The number of medals Statland won in Winter Olympics has grown dramatically over the
30 years. Here is the total number of medals Statland has won in each of the past eight
Winter Olympics:
Site

Innsbruck

Year
Medals

1976
3

Lake
Placid
1980
2

Sarajevo

Calgary

Albertville

Lillehammer

Nagano

1984
4

1988
8

1992
42

1994
45

1998
57

Salt Lake
City
2002
99

Mean

SD

1989.25
32.5

8.9
34.8

Least squares regression line: Medals = -7089 + 3.58 Year


d) Find the correlation coefficient.
Write your answer here: r = __________ (Report 2 decimal places only.)
Show your work:
120

Medals

100
80
60
40
20
0
1976

1980

1984

1988

1992

1996

2000

2004

Year

e) How many medals do you expect Statland to win in 2010?


__A. 106 or 107
__B. Cannot say due to the outlier in the data
__C. Cannot say, because this involves extrapolating beyond the range of data
__D. Cannot say without examining the residual plot first

82

Question B7 (MT2006Q5) Wood flakes are not as tasty as corn flakes


Wood scientists are interested in replacing solid-wood building material by less
expensive products made from wood flakes. They collected data to examine the
relationship between L, the length (in metres) and T, the tensile strength (in kg per cm2)
of beams made of wood flakes.
a) Which is the response variable in this study?
The researchers computed the means to be = 3 and = 20, and the standard deviations
to be SL = 1 and ST = 4. The correlation between T and L is rTL = -0.8.
b) What is rLT (i.e. the correlation between L and T)?
c) Write the regression equation (using the same response variable as your answer in a)).
Show your calculations:
d) Based on your answer in part c), what is the predicted strength (in kg/cm2) of a 2.5
metre beam?
e) In order to present the data in a wood industry conference in the US, the researchers
converted their data to units of feet and pound/ inch2 (1 metre = 3.28 feet, and 1 kg/cm2 =
0.0704 pound/inch2). Let and be the converted variables. Then:
= _______
= _______
= _______
= _______
= _______
Show your calculations:
f) Which of the following observations (that are part of the dataset used for the regression
above) is the most influential observation?
__A. L = 2, T = 21.2
__B. L = 3, T = 20.0
__C. L = 4, T = 18.8
__D. L = 5, T = 13.6
g) The negative correlation between T and L for the beams produced by the new
technology means that:
__A. wood flake beams are stronger than solid wood beams
__B. wood flake beams are weaker than solid wood beams
__C. wood flake beams are shorter than solid wood beams
__D. shorter wood flake beams are weaker but longer ones are stronger
__E. shorter wood flake beams are stronger but longer ones are weaker

83

SECTION B: ANSWERS AND EXPLANATIONS


Answer to Question B1 (MT2009Q2)
a)
= -0.35(27/10) = -0.945
b) r2 = (-0.945)2 = 0.89
89%
c) = 51 0.35(50) = 33.5
d) Residual = 32.533.5 = -1.0
e) Since x is unknown, use the mean of y = 26.5
f) Increase
g) Unchanged: -0.945
h) No; correlation does not prove causation, only association.
i) Switch the roles of x and y:

= (-0.945)(27/10) = -2.55

Details and Comments:


This question requires familiarity and facility with the formulas for the least squares
estimates.
a) Take the usual least squares formula for b, and solve for r. Notice what happens to the
ratio of the standard deviations. That is, since
, then
.
b) This is the definition of r2.
f) This point is far above the regression line, so including it will pull up the slope of the
line.

Answer to Question B2 (MT2008Q4)


a) Shape = Linear; Direction = Negative; Strength = Fairly strong
b) High wine consumption goes with fewer heart disease deaths,
while low consumption goes with more deaths.
c) No; correlation does not prove causation, only association.
d) r2 = (-0.843)2 = 0.711
71%
e) Substitute x=4 into the least squares equation:
= 260 23(4) = 168 (or 168 per 100,000 people)
f) Substitute x=150 into the least squares equation:
= 260 23(150) = -3190
The predicted disease rate cannot be negative! The reason for the negative number is
extrapolation far beyond the range of data values.
g) Graph (b)
Details and Comments:
b) Negative correlation: as X increases, Y decreases and as X decreases, Y increases.
c) This is especially true in an observational study. Many other factors can be responsible
for heart disease deaths.
d) This is simply the definition of r2.
g) Residuals are centered around 0, some are positive and some are negative. In order for
Graph (a) to be the residual plot then all the points on the scatterplot would have to be
above the line. For Graph (c) all the points would be below the line. Neither would, of
course, be the best-fitting line.
84

Answer to Question B3 (MT2008Q5)


a) F, T, T, T
b) No. The number of firefighters and amount of damage are common responses to the
seriousness of the fire; the more serious the fire, the greater the number of firefighters
sent and the greater the amount of damage.
c) No. Larger hospitals tend to take more serious cases, so the length of stay is longer, on
average.
d) B
Details and Comments:
a) In the first statement change the word explanatory to response to make it a true
statement; a residual is the difference between y and .
d) Postal rates are set according to the weight of the package; as the weight increases it
causes the postage to increase. However, increasing the height does not necessarily
cause the weight to increase; and increasing the age of a teacher does not necessarily
cause an increase in salary. The make of the car doesnt cause the fuel efficiency; thats
due to engine design, etc.

Answer to Question B4 (MT2007Q5)


a) (Highest) C A D B (Lowest)
b) No. Depression could just as easily cause people to smoke!
c) No. This is correlation based on averages (called ecological correlation).
d) No. This is an example of the regression effect.
e) No. Life expectancy was much lower 150 years ago so marriages were shorter because
people died earlier!
Details and Comments:
a) In the 2007 real estate market, sellers generally got the price they asked for, or an
amount very close. Purchase price and yearly income are expected to have a strong
correlation, but purchase price and age are not. Shoe size is irrelevant with respect to
purchase price (except possibly for rare cases like professional basketball players )
b) Correlation is not causation; there are confounding factors involved.
c) Using averages suppresses the scatter in a scatterplot and gives false impressions of
correlation.
d) Once again, there are confounding factors involved.
e) Interpreting statistical results requires thinking about the historical and societal
context.

85

Answer to Question B5 (MT2007Q6)


a) There are multiple ways to solve this:
Since
50 = + (140)
Since
32 = + (100)
Solve for by taking the first equation minus the second equation: = 18/40 = 0.45
Since
0.45 = r (20/40)
Hence r = 0.9
Another way: 100 is 1 SD below average in the X-variable; this equals 32, which is r SDs
below average in the Y-variable. So 32 = 50r(20). Hence r = 0.9.
b) What is the least squares regression equation she found?
= -13 + 0.45x
= 0.90(20/40) = 0.45;
= 50 0.45(140) = -13
c) If X is unknown, predict Y to be the mean of y, which is 50.
d) r2 = 0.81 (This is simply the definition of r2.)
e) 0.9, unchanged.
Details and Comments:
Practise using the basic least squares formulas. The most common mistake is to confuse
the standard deviations of X and Y in the formula for b.
For part e), correlation between X and Y is the same as correlation between Y and X.
However, the regression equation would be different; the roles of X and Y make a
difference in regression but not in correlation.

Answer to Question B6 (MT2006Q2)


a) A. 0.98
b) C. r would be the same
c) B. Correlation
d) r = 0.92
e) C. Cannot say, because this involves extrapolating beyond the range of data
Details and Comments:
a) The scatterplot shows positive correlation so B and D are incorrect. As well, the
clustering around a straight line is very strong so 0.98 is more strongly suggested than
0.58 which is moderate to weak correlation.
b) Adding a constant would move the entire scatterplot but would not change the relative
position of the data points so the correlation would be unchanged.
d) Invert the least squares equation for the slope to solve for the correlation.
r = 3.58(8.9/34.8) = 0.92
e) This is also known as a restricted range problem.

86

Answer to Question B7 (MT2006Q5)


a) Response variable: T Tensile strength
b) r is unchanged still -0.8
c) = 29.6 3.2L; = -0.8(4/1) = -3.2; = 20 (-0.8)(3) = 29.6
d) 21.6
e) = 9.84; = 1.41;
= 3.28;
= 0.28;
= -0.8
f) D. L = 5, T = 13.6
g) E. Shorter wood flake beams are stronger but longer ones are weaker
Details and Comments:
b) The roles of X and Y are interchangeable in correlation (but not in regression). Look at
the formula for r and note what happens if you switch X and Y.
c) Use the least squares formulas for slope and intercept.
d) Substitute 2.5 into equation in c): 29.6 3.2(2.5) = 21.6

87

SECTION C: NORMAL CURVE, SAMPLING DISTRIBUTIONS,


COMBINING RANDOM VARIABLES
Question C1 (MT2009Q3) The Patron Saint of Coffee: St. Arbucks!
One of your COMM 291 instructors explains how he/she organizes the school day: I
often work at home until I leave for campus, intending to arrive just in time for class. My
driving time varies, of course. The mean time is 41 minutes and the standard deviation is
2 minutes. Without fail, I stop at Starbucks for tea and my mean time there is 6 minutes,
with a standard deviation of 1.5 minutes. I assume that my driving time and my
Starbucks time are normally distributed and independent of one another.
a) Is the distribution of my total time getting to campusi.e., driving and Starbucks time
combined normal, non-normal, or unknown? Give the mean, variance and standard
deviation of the distribution of total time.
Space for calculations
Distribution ___________
Mean
___________
Variance
___________
SD
___________

b) If I were willing to be late for class 3% of the time in the long run, how far ahead of
class time should I leave home?

c) Well, I admit that I am late very occasionally, but I have found over time that there is a
0.2 probability of any student being late to class. Assume that students are independent
from one another with respect to being late. (Actually, this probability was made up for
the purposes of this question; you all are much better than this! ) In a class of 100
students, what is the approximate probability of at least 75% of students being on time?
(Hint: Be careful in setting up the start of the question.)

88

Question C2 (MT2009Q4) The Need for Less Speed


An automobile insurer has found that repair claims have a mean of $920 and a standard
deviation of $870.
a) What is the probability that one randomly chosen claim is larger than $1000?
__ A. 0.9100
__ B. 0.4641
__ C. 0.0900
__ D. 0.5359
b) Compute the mean and standard deviation of the average, , of the next 100 random
claims.
__ A. Mean = $920; Standard Deviation = $87
__ B. Mean = $920; Standard Deviation = $8.70
__ C. Mean = $92; Standard Deviation = $87
__ D. Mean = $92; Standard Deviation = $870
__ E. None of these

c) What is the probability that the average, ,, of the 100 claims is larger than $1000?
__ A. 0.9200
__ B. 0.8212
__ C. 0.0800
__ D. 0.1788
In order to get full marks for part c), show your work:

d) The Central Limit Theorem justifies some of the calculations above. What does the
Central Limit Theorem say? Complete the following sentence by selecting the most
appropriate phrase from the choices.
When a sample of size n is to be drawn from any population with mean mu and standard
deviation sigma, then when n is sufficiently large
__ A. the standard deviation of the sample mean is
__ B. the distribution of the population is exactly normal
__ C. the distribution of the population is approximately normal
__ D. the distribution of the sample mean is exactly normal
__ E. the distribution of the sample mean is approximately normal

89

Question C3 (MT2008Q3) Its all in/on your head!


a) According to GQ magazine, the first and third quartiles of mens haircut prices in
London are 21 and 29, respectively. What are the mean, variance and standard
deviation of haircut prices? Assume haircut prices are normally distributed.

b) Instead of using the numbers you found in part a), assume that the mean haircut price
is 26.50 and the standard deviation is 5. Suppose a visiting tourist from Paris is so
desperate for a haircut that he walks into the first barber shop he sees and sits right down
in the chair. What is the probability that his haircut will be 25 or less?

c) Two twin sisters are registered in different MBA programs in the United States. The
sister registered at Harvard got 87% on her final comprehensive exam. The average mark
on that exam was 73% and the standard deviation was 7%. The sister registered at
Stanford got 84% on her final comprehensive exam. Its mean was also 73% but its
standard deviation was 5%. Assume exam marks are normally distributed. Which twins
result ranked higher within her own class? What are their respective percentiles?
Percentile of Harvard sister = ______
Percentile of Stanford sister = ______
Which twin ranked higher within her own class?
__ Harvard sister
__ Stanford sister

90

Question C4 (MT2008Q7) What exactly is a widget?


NOTE: This question focuses on means and variances of combinations of random
variables. See textbook and course notes.
The dictionary defines widget as 1: a usually small device, contrivance, or mechanical
part, or 2: an unnamed article considered for purposes of a hypothetical example as the
typical product of a company. We are using the second definition here.
The manufacturing of a widget requires the following 3 steps:
1. Cut a small cube of wood. The amount of time, W, needed to cut the cube is
normally distributed with mean 15 seconds and standard deviation 5 seconds.
2. Drill a hole through the center of each face of the cube through to the other
side. A total of 3 holes are required and the amount of time, X, to drill one
hole is normally distributed with mean 3 seconds and standard deviation 1
second.
3. Round each corner. There are a total of 8 corners and the amount of time, Y, to
round a corner is normally distributed with mean 5 seconds and standard
deviation 2 seconds.
Assume that the variables W, X, and Y are independent.
a) Find the mean and standard deviation for the total time, T, required for producing a
widget. Hint: Figure out all the variables that go into the sum, T.
Mean of T: ______

Standard deviation of T: ______

Show your work:

b) Assume that the mean of T is 70 and the standard deviation of T is 8. (Those arent
actually the correct answers for part a), but will allow you to do part b) independently .)
Let S be the time required to produce 9 widgets. What is the probability that S will exceed
660 seconds? Hint: Define S in terms of T from part a).
Answer: _____
Show your work:

91

Question C5 (MT2008Q8) Is RV a random variable or a recreational vehicle?


NOTE: Parts a) and b) are unrelated.
a) The time Canadians spend watching TV and movies per day has mean 190 minutes and
standard deviation 52 minutes. The time Americans spend watching TV and movies per
day has mean 170 minutes and standard deviation 45 minutes. Two simple random
samples, of 100 Canadians and 100 Americans, are selected. Let denote the mean TVwatching time for the Canadian group, and for the American group. Note that both
and are random variables. Assume all 200 TV-watching times in the two samples are
independent.
(i) Find the means and standard deviations of
Mean of :
SD of :

________
________

and .

Mean of :
SD of :

________
________

(ii) What is the approximate shape of the distribution of , the difference between the
mean TV-watching times in the two samples? Give a reason why.

(iii) What is the probability that is greater than 30 minutes? Report your answer with
no more than two decimal digits. Hint: You will need the mean and standard deviation of
to solve this. You can easily compute the mean; well give you the standard
deviation, which is 6.88. You can trust us!

b) The Grocery Manufacturers of Canada reported that 72% of consumers read the
ingredients listed on a products label. Assume the population proportion is p = 0.72 and
a sample of 250 consumers is selected from the population. What is the approximate
probability that the percentage of consumers in the sample who read the ingredients on a
label will exceed 76%?

92

Question C6 (MT2007Q4) Normal is as normal does (apologies to Forrest


Gump)
a) Suppose midterm exam scores in a math course () are normally distributed with a
mean of 60 and a standard deviation of 20. What is the interquartile range? (Hint: Start by
finding Q1 and Q3 for a standard normal distribution.)

b) Now suppose that the final exam scores are normally distributed, also with a mean of
60 and a standard deviation of 20. The instructor wishes to give 20% As, 30% Bs and
the rest C or lower.
(i) What final score should be the cutoff between A and B?
(ii) What final score should be the cutoff between B and C? (This ones easy!)

c) A company has two manufacturing plants, one that uses low-tech machines and
another that uses high-tech machines. From recent history, the number of defects per
week observed at each plant is normally distributed with the following parameters.
Low-tech:
Mean = 15
SD = 3
High-tech:
Mean = 10
SD = 1
Last week, the low-tech plant produced ten defects, while the high-tech plant produced
eight defects. Which plant performed better relative to past performance? Explain why.
___ Low-tech
___ High-tech

d) Refer to part c). The two plants work independently. Compute the mean and standard
deviation of the total number of defects from the two plants.
Mean = ___________
SD

= ____________

93

Question C7 (MT2007Q7) Bean-counting and counting beans


a) An accounting professor claims that about one-quarter of undergraduate business
students will major in accounting. Assuming the professor is correct, what is the
probability that in a random sample of 1200 undergraduate business students, at least
28% will major in accounting?

b) Refer to part a). A survey of a random sample of 1200 undergraduate business students
indicates that there are 336 students who plan to major in accounting. What does this tell
you about the professors claim?

c) The restaurant in a large commercial building provides coffee for the buildings
occupants. The restaurateur has determined that the mean number of cups of coffee
consumed in a day by each occupant is 2.0 with a standard deviation of 0.6. A new tenant
of the building intends to have a total of 125 new employees. What is the probability that
the new employees will consume more than 240 cups per day? (Hint: You can answer
this using either of two different but related methods.)

Question C8 (MT2006Q6) Making a grand entrance


Scores on the ACT college entrance examination in the U.S. vary normally with = 18
and standard deviation = 6. The range of reported scores is 1 to 36.
a) What range contains the middle 95% of all individual scores?
b) If the ACT scores of 25 randomly selected students are averaged, what range contains
the middle 95% of the averages computed from many repetitions of the sampling
process?
c) If the sample size increased from 25 to 50, by what factor would the standard deviation
of the sampling distribution of change?
__A.
__B. 1/
__C.
__D. 2

94

Question C9 (MT2006Q3) Pooling your resources


Beijing 2008: In this question we focus on swimming events at the Summer Olympics.
a) To qualify for the mens 100m freestyle event, a swimmer needs to have a time of 48
seconds or less. An American, a Dutch and an Australian hope to qualify. Their past
times (in seconds) in this event are normally distributed: American ~N(50,3.0), Dutch
~N(49.5,2.5), and the Australian ~N(49,2.5). (Remember: The first number is the mean,
the second is the standard deviation.) What is the probability that the Dutch swimmer will
qualify? (Report only 2 decimal places!)
Probability = ______
b) Which swimmer has the lowest probability of qualifying? Compute that probability.
Swimmer is ________
Probability = _____
c) If, in the next Olympic competition, each swimmer achieves his 90th percentile
performance, who will win the event? (Remember: Lower times are better, so be careful!)

Winner is _________
d) Mens 4x100m freestyle relay: The Dutch relay team has four swimmers, each of
whose past times are normally distributed as follows: N(49.5,2.5), N(50,3.0), N(51.5,5),
and N(53,2.5). What are the mean and standard deviation of the total time needed by the
four swimmers to complete the 400m swim? Assume racers times are independent.
Mean = _____
SD = _____
e) Refer to part d): What is the chance that the Dutch team will complete the swim in
under 200 seconds (the time needed to qualify)? (Report only 2 decimal places!)

95

SECTION C: ANSWERS AND EXPLANATIONS


Answer to Question C1 (MT2009Q3)
a) Distribution = Normal
Mean = 41+6 = 47; Variance = (2)2 + (1.5)2 = 6.25;
SD = 2.5
b) Pr (Z < z) = 0.97; z = 1.88 = (x47)/2.5; x = 51.70. (Leave 52 minutes before class)
c) Pr ( < 0.25) = Pr (Z < [0.250.20]/
) = Pr (Z -1.25) = 0.894
Details and Comments:
a) Since each individual component is normally distributed, the sum is normally
distributed. The mean of the sum is sum of the means. The variance of sum is also
additive since the components are independent. Note that you cant add the SDs.
b) The probability (area under the z curve) is given; find z and then unstandardize to
find x.
c) The information given is about being late. The question asks for at least 75% being on
time. This translates into less than 25% who are late. Once you see that the question asks
for Pr ( < 0.25), using the sampling distribution of , standardize and use the standard
normal curve.

96

Answer to Question C2 (MT2009Q4)


a) B. Pr(X > 1000) = Pr (Z > [1000920]/870) = Pr (Z > 0.09) = 0.4641
b) A. SD( ) =
c) D. Pr( ) = Pr (Z > [1000920]/87) = Pr (Z > 0.92) = 0.1788
d) E. (Look up the explanation of the Central Limit Theorem)
Details and Comments:
This question is all about the sampling distribution of a mean. In part a), the probability is
concerned with a single observation. In parts b) and c) the probability is concerned with
the mean of n=100 observations. Thats why the denominator in a) is 870 and in c) is 87
(=870/100).

Answer to Question C3 (MT2008Q3)


a) Mean: = (29+21)/2 = 25
Q3 = z(0.75) = 0.675 = (29)/ = (2925)/ ; hence = 4/0.675 = 5.9
If you used z(0.75) = 0.67, = 5.97; if you used z(0.75) = 0.68, = 5.88
All three values for are acceptable.
Variance = 2 = 34.81 (or 35.64 or 34.57 corresponding to the other values of above).
b) Pr (X < 25) = Pr (Z < [2526.50]/5) = Pr (Z < -0.3) = 0.3821
c) First find the z-scores and then convert them to percentiles.
Percentile of Harvard sister = (8773)/7 = 2.00 97.7th percentile
Percentile of Stanford sister = (8473)/5 = 2.20 98.6th percentile
Which twin ranked higher within her own class? Stanford sister
Details and Comments:
a) Since the distribution is assumed to be normal it is symmetric, so the mean is halfway
between the first and third quartiles. Therefore, add the two quartiles and divide by 2.
If you didnt notice that could be found this way, another way to solve this is to use Q1
= z(0.25) = -0.675 = (29)/, and then solve two equations in two unknowns.
b) Remember to draw a sketch of the standard normal and mark -0.3 on the sketch. You
can look up -0.3 directly in the tables or look up +0.3 and subtract the result from 1.
c) The area to the left of 2.00 on the standard normal is .977 which is the same as saying
2.00 is the 97.7th percentile.

97

Answer to Question C4 (MT2008Q7)


a) Mean of T: 64
Standard deviation of T: 7.75
T = W+(X1+X2+X3)+(Y1+Y2+ +Y8) [this is NOT the same as W + 3X + 8Y].
Mean(T) = Mean(W )+Mean(X1)+Mean(X2)+Mean(X3)+Mean(Y1)+ +Mean(Y8)
= Mean(W )+3Mean(X)+8Mean(Y) = 64
Var(T) = Var(W )+Var(X1)+Var(X2)+Var(X3)+Var(Y1)+ +Var(Y8) [by independence]
= Var(W )+3Var(X)+8Var(Y) = 52+3(12)+8(22) = 60
SD(T) = 60 = 7.75
b) Answer: = 0.1056 or 10.56% or 11%.
S = T1 + + T9
Mean(S) = 9Mean(T) = 9(70) = 630
Var(S) = 9Var(T) = 9(82) = 576
SD(S) = 576 = 24
Pr (S > 660) = Pr (Z > [660630]/24) = Pr (Z > 1.25) = 1 0.8944 = 0.1056 or 10.56% or
11%.
Details and Comments:
a) These calculations use two principles of combining random variables. The mean of a
sum of random variables is the sum of the means. The variance of a sum of random
variables is the sum of the variances, ONLY if the random variables are independent
(which they are in the case of a random sample). Note that you can never add standard
deviations. First compute the variance of the sum and then take the square root.
b) Remember to draw a sketch of the standard normal. From the Empirical Rule you
know that the area to the right of 1 is 16% and the area to the right of 2 is 2.5%, so the
area to the right of 1.25 must be between 2.5% and 16%; 10% seems reasonable.

Answer to Question C5 (MT2008Q8)


a) (i) Mean of = 190
Mean of = 170
SD of = 52/100 = 5.2
SD of = 4.5
(ii) Normal, due to the Central Limit Theorem, since each n is larger than 30.
(iii) Mean( ) = 190170 = 20; SD ( ) = 6.88 (given)
Pr ( > 30) = Pr (Z > [3020]/6.88) = Pr( Z > 1.45) = 1 0.9265 = 0.0735 or 0.07
b) Pr ( > 0.76) = Pr (Z > [0.760.72]/
= 0.0793 or 0.080.

) = Pr (Z > 1.41) = 1 0.9207

Details and Comments:


a) First calculate the mean of (the SD is given), then standardize and find the
appropriate area under the z curve.
b) Use the sampling distribution of , standardize and use the standard normal curve.

98

Answer to Question C6 (MT2007Q4)


a) For Z: Q1 = -0.675, Q3 = 0.675 (from Z-table)
Unstandardize to X: Q1 = 60 + (-0.675)(20) = 46.5, Q3 = 60 + (0.675)(20) = 73.5
IQR = 73.5 46.5 =27
b) (i) Find Z with area of 0.20 to the right; Z = 0.84; X = 60 + (0.84)20 = 76.8.
Hence the cutoff is 77.
(ii) The cutoff is 60 (50% are C or lower, so the median equal to the mean in a normal
is the cutoff)
c) High-tech
Z-score for low-tech = (1015)/3 = -1.67
Z-score for high-tech = (810)/1 = -2.00. More unusual, so better relative performance.
d) Mean = 15+10 = 25; SD = [32+12] = 10 = 3.16
Details and Comments:
a) Hints are given for a reason
The z-table only gives two decimal places for a z-value, so it is also acceptable to base
your computations on -0.68 and 0.68: IQR = 27.2, or on -0.67 and 0.67: IQR = 26.8.
However, it is an easy and sensible interpolation to get to -0.675 and 0.675; that is the
preferred solution.
b) (ii) is based on the definition of the median; thats why the question said it was easy.
c) Be careful here; high-tech has the larger z-score and hence has smaller probability
associated with it, but less probable is better here since you are computing the probability
of defects! It is analogous to shorter times in a race or lower golf scores representing
better outcomes. Context is everything.
d) Means are additive; standard deviations are not. However, variances are additive if the
random variables are independent.
Answer to Question C7 (MT2007Q7)
a) Pr ( > 0.28) = Pr (Z > [0.280.25]/
) = Pr (Z > 2.4) = 0.0082
b) = 336/1200 = 0.28. We know from part a) that the claim is very unlikely since the
probability of exceeding 0.28 is so small.
c) The amount of coffee consumed follows a normal distribution; = 240/125 = 1.92
Pr ( > 1.92) = Pr (Z > [1.922.0]/[0.6/125]) = Pr (Z > -1.49) = 0.93
Alternative method (but not necessary for you to do):
Pr (Total > 240) = Pr (Z > [240125(2.0)]/[(0.6)125]) = Pr (Z > -1.49) = 0.93
Details and Comments:
a) Use the sampling distribution of .This question asks for the probability of the
proportion exceeding 0.28.
c) This question can be answered two ways, either by working out the probability that the
mean exceeds 1.92 (i.e. 240/125), or that the total exceeds 240. To use the second method
you need to work out the mean and standard deviation of a Total, using combinations of
random variables. Using the mean will be much more familiar to you and therefore much
easier! The main difference is in the standard deviation formula; the square root of n
term changes positions.

99

Answer to Question C8 (MT2006Q6)


a) 18 2(6) = 18 12 = (6,30)
OR: 18 1.96(6) = 18 11.76 = (6.24, 29.76)
b) 18 2(6/25) = 18 2.5 = (15.6, 20.4)
OR: 18 1.96(6/25) = 18 2.35 = (15.65, 20.35)
c) B. 1/
Details and Comments:
a) This part uses The Empirical Rule (68-95-99.7 Rule) since we are interested in
individual scores. That is, the plus or minus number is about 2().
b) This part uses standard error (from the sampling distribution of X ) since we are
interested in the mean of 25 scores. That is, the plus or minus number is about 2(/
c) Notice the effect of the square root of n.

Answer to Question C9 (MT2006Q3)


a) Pr (X < 48) = Pr (Z < [48-49.5]/2.5) = Pr (Z < -0.6) = 0.27 or 27%
b) American: Pr (Z < [48-50]/3) = Pr (Z < -0.67) = 0.25
Australian: Pr (Z < [48-49]/2.5) = Pr (Z < -0.40) = 0.34
Swimmer is American; Probability = 0.25
c) Australian: -1.28(2.5) + 49 = 45.8
American: -1.28(3) + 50 = 46.2
Winner is Australian
d) Mean = 49.5 + 50 + 51.5 + 53 = 204
Variance = 2.52 + 32 + 52 + 2.52 = 46.5; SD = 6.82
e) Pr (T < 200) = Pr (Z < [200-204]/6.82) = Pr (Z < -0.59) = 0.28 or 28%
Details and Comments:
b) After standardizing, draw a sketch to get the correct part of the normal curve.
c) From the given probability (through the percentile), find the corresponding value on
the standard normal curve and then unstandardize. Remember that faster is better here.
d) The mean of a sum is the sum of the means. The variance of a sum is the sum of the
variances IF the random variables are independent. Standard deviations are never
additive.
e) Use the mean and standard deviation from part d) to standardize.

100

SECTION D: INTRODUCTION TO INFERENCE, CONFIDENCE


INTERVALS, HYPOTHESIS TESTS
Question D1 (MT2009Q6) Frozen biscuits
Thunderbird Co. uses a high technology manufacturing process to produce ice hockey
pucks with an average weight of 170 g. Sometimes the process gets out of adjustment and
produces pucks with average weights different from 170 g. When the average weight
exceeds 170 g., sales will be negatively affected since these pucks cannot be used in a
game due to possible serious harm. As well, there are equipment standards for
professional competition and when the average weight falls below 170 g., Thunderbirds
pucks might be rejected.
Thunderbirds quality control program involves taking periodic samples of 50 pucks to
monitor the manufacturing process. For each sample, a hypothesis test is conducted to
determine whether the process has fallen out of adjustment. The quality control team
selected = 0.05 as the level of significance for the test.
a) Suppose that a sample of 50 pucks is selected and that the sample mean, , is 172.6 g
and sample standard deviation, s, is 12 g.
Carry out the hypothesis test by following these steps.
State the hypotheses

Explain whether Ha should be one-sided or two-sided


Calculate the test statistic

Find bounds on the P-value (Assume the test statistic = 1.5 rather than the answer you
got immediately above.)

State your conclusion (Base it on the test statistic = 1.5 and the P-value you found
immediately above. Reminder: Dont just say Accept H0 or Reject H0.)

b) Construct a 95% confidence interval for the mean weight of the population of all pucks
made by Thunderbird.

101

Question D2 (MT2008Q9) A testbank bank test


The Bank of ABC focused on a stable workforce that has very little turnover. The bank
has always promoted the idea that its employees stay with them for a very long time, and
has used the following line in its recruitment brochures: The average tenure of our
employees is 20 years. However, its new HR manager thinks that the average tenure may
be less than 20 years which would mean that some new measures to improve workforce
stability would be required. A random sample of 100 employees is taken and the average
tenure is computed to be 19 years, with a standard deviation of 4 years. Does the HR
manager have enough evidence to support his claim at the 5% significance level?
a) State the hypotheses.

b) Compute the value of the test statistic.

c) Find an approximate P-value for this test statistic.

d) State a conclusion about the managers claim, in one complete sentence. Dont just say
Accept H0 or Reject H0.

e) Assuming the same sample mean of 19 years and standard deviation of 4 years, what is
the smallest sample size that would still reject the null hypothesis at the 5% significance
level? Hint. Find the value of the test statistic that would give a P-value as close as
possible to 0.05.

102

Question D3 (MT2008Q10) No lack of confidence here!


a) A study of the career paths of hotel general managers sent questionnaires to a random
sample of 160 hotels belonging to major U.S. hotel chains and received 114 responses.
The mean time these 114 general managers had spent with their current company was
11.8 years, with a standard deviation of 3.2 years. Construct a 99% confidence interval
for the mean number of years general managers of major-chain hotels have spent with
their current company.

b) Answer each of the following with: Yes, No, or Cant Tell. Circle your choices.
Does the sample mean lie in
the 95% confidence interval?

Yes

No

Cant Tell

Does the population mean lie in


the 95% confidence interval?

Yes

No

Cant Tell

If a 90% confidence level were used,


would the interval from the same data
produce an interval wider than the 95%
confidence interval?

Yes

No

Cant Tell

With a smaller sample size, all other


things being the same, would the 95%
confidence interval be wider than with
a larger sample size?

Yes

No

Cant Tell

c) A radio talk show invites listeners to enter a dispute about a proposed pay increase for
city council members. What yearly pay do you think council members should get? Call
us with your number. In all, 958 people call. The station calculates the 95% confidence
interval for the mean pay, , that all citizens would propose for council members to be
$9669 to $9811. Is this result trustworthy? Explain your answer.

103

Question D4 (MT2007Q8) I owe, I owe, its off to work I go (go ahead & sing it)
The National Association of Independent Colleges and Universities took a random
sample of 64 college graduates and found that their average debt upon graduation was
$12,000, with a standard deviation of debt upon graduation of $1800.
a) Construct a 95% confidence interval for the mean debt of all college graduates.
b) True or False: The confidence interval you obtained in part a) means that
approximately 95% of sample averages obtained from repeated random samples of 64
college graduates will fall in that interval. (No explanation is necessary.)
___True

___False

c) Calculate the sample size required to have a 99% confidence interval with the same
margin of error as that found in part a).
d) A college president says that the sample of 64 graduates in part a) resulted in an
overestimate and that the actual mean debt is $11,500. Test whether the actual mean debt
exceeds $11,500 by forming the appropriate hypotheses, obtaining a P-value, and
interpreting it.
e) Decide whether each statement is more likely to be True or False:
(i) The larger the sample size, the more likely you will get
statistical significance using a t-test (assuming the sample
mean does not change).

True

False

(ii) The P-value does not depend on the level of significance.

True

False

(iii) As the P-value gets smaller, the evidence against the null
hypothesis gets stronger.

True

False

(iv) In hypothesis testing, if the P-value is 1%, there is a 1%


chance that the null hypothesis being tested is true.

True

False

(v) The smaller the P-value, the less likely the null hypothesis
is true.

True

False

104

Question D5 (MT2006Q7) Buyer, and renter, beware!


The square footage of the several thousand apartments in a new development is
advertised to be 1250 square feet, on average. A tenant group thinks that the apartments
are smaller than advertised. They hire an engineer to measure a random sample of
apartments to test their suspicion about the true average area (in sq. ft.) of the apartments.
a) What are the appropriate null and alternative hypotheses?
__A. H0: = 1250 vs.
Ha: < 1250
__B. H0: = 1250 vs.
Ha: 1250
__C. H0: = 1250 vs.
Ha: > 1250
__D. H0: = 1250 vs.
Ha: < 1250
b) The engineers sample of 36 apartments had a mean of 1208 square feet and standard
deviation of 120 square feet. What is the value of the test statistic?
__A. -0.35
Show your work:
__B. -2.1
__C. -12.6
__D. 2.1
__E. None of these
c) The P-value is for the test statistic you computed in part b) is closest to:
__A. >0.20
__B. 0.05
__C. 0.025
__D. <0.001
__E. None of these
d) State your conclusion in one complete sentence that the tenant group can understand.

105

SECTION D: ANSWERS AND EXPLANATIONS


Answer to Question D1 (MT2009Q6)
a) Hypotheses: H0: = 170; Ha: 170
Two-sided; differences from 170 in either direction are both problems!
Test statistic: t =

= 1.53

Degrees of freedom = 49
Two-tail P-value (from Table T) is between 0.10 and 0.20.
There is not enough evidence to conclude a difference from the target of 170 g.
b)

= 172.6 2.064 (12/

) = 172.6 3.5 or (169.1,176.1)

Details and Comments:


a) It is a problem if pucks are too light or too heavy so this is a two-tailed (two-sided)
alternative.
b) Look up the multiplier in Table T, using 24 degrees of freedom and 95% confidence
level.

Answer to Question D2 (MT2008Q9)


a) H0: = 20; Ha: < 20
b) t =

= -2.5

c) P-value = Pr(t99 < -2.5): From Table T, this is between 0.005 and 0.01.
d) There is enough evidence to support the managers claim that the average tenure is less
than 20 years.
e) From Table T, the critical value of t100 corresponding to a one-tail probability of 0.05 is
1.660. We are working with the left-hand tail, so t = -1.660 =
Hence

= (-1.660)(4)/(-1) = 6.64; n = 44.1 Round up n = 45

Details and Comments:


This is a case of the one-sample t-test for a mean. In this scenario the alternative
hypothesis is one-tailed (i.e. one-sided). Since the P-value is much less than the threshold
of 0.05 (i.e. 5%), the null hypothesis is rejected. Remember that the smaller the P-value
the stronger the evidence that something real is happening.
Note that in part e) it is not correct to leave the answer as 44.1; you cant have fractional
sample sizes!

106

Answer to Question D3 (MT2008Q10)


a)
= 11.8 2.62 (3.2/
) = 11.8 0.79 or (11.01, 12.59)
Since the sample size is 114, there are 113 df, so accept multipliers of 2.617 or 2.626,
corresponding to 120 and 100 df, respectively, or anything in between. I rounded to two
decimal places.
b) (1) YES
(2) CANT TELL
(3) NO
(4) YES
c) Confidence intervals are only trustworthy if they are based on random samples; this is
NOT a random sample.
Details and Comments:
a) This is a straightforward use of the confidence interval for a single mean. Dont obsess
about the choice of degrees of freedom; it only makes a difference in the second decimal
place, but the mean is only reported to one decimal place. In fact, since we are working
with years of employment, one decimal place is sufficient.
b) (1) The formula has the sample mean plus or minus a margin of error, so by definition
the sample mean must be the midpoint of the confidence interval.
(2) Thats the whole point of constructing a confidence interval. You dont know where
the population mean is, but you are 95% sure that it is included in the interval.
(3) Narrower intervals mean less confidence.
(4) Increasing the sample size decreases the variability and hence the margin of error.

Answer to Question D4 (MT2007Q8)


a) 95% CI: 12,000 2.000(1800/ ) = 12,000 450 = (11,550 ; 12,450)
Note: Since Table T does not have probabilities corresponding to 63 degrees of freedom,
use the slightly more conservative choice of 60 df; hence the multiplier is 2.00. It would
also be acceptable to use a multiplier of 1.992, corresponding to 75 df in the next row in
the table. The CI doesnt change much: 12,000 448 = (11,552 ; 12,448).
b) False. The CI changes from sample to sample, but every CI is centred around its own
sample average.
c) H0: = 11,500; Ha: > 11,500
Test statistic: t =

= 2.22

P-value = Pr (t63 > 2.22): from Table T, this probability is between 0.01 and 0.025.
Conclusion: Reject the null hypothesis. There is evidence that mean debt is greater than
$11,500. The college presidents claim is rejected.
d) True, True, True, False, False
Details and Comments:
d) (i) Examine the formula for the z-test statistic; with n in the denominator of the
denominator, increasing it will increase the value of the test statistic.
(ii) The level of significance (i.e. alpha) is chosen before the P-value is calculated and
does not enter into the calculation of P-value.
(iii) This is precisely the interpretation of P-value.
(iv) and (v) The P-value assumes the null hypothesis is true so it cant be a statement
about the chance that the null hypothesis is true. It is a statement about the data and the
consistency of the data with the null hypothesis.

107

Answer to Question D5 (MT2006Q7)


a) A. H0: = 1250 vs. Ha: < 1250
b) B. -2.1
c) C. 0.025
d) There is sufficient evidence to conclude that the apartments are smaller than
advertised. (OR: The difference between the sample mean and advertised mean cannot be
explained by chance alone; the difference is real.)
Details and Comments:
This is a one-sample t-test of a mean.
a) D is completely wrong since hypotheses are about parameters, not estimates.
The remaining decision is whether the alternative is one-tailed or two-tailed, and if onetailed, which way.
b) t =

= -2.1

c) Use Table T to find the area (i.e. probability) to the left of -2.1, with 35 degrees of
freedom. Remember to look up the one-tail probability.
d) Note that it is not sufficient simply to say, Reject H0.

108

SECTION E: MISCELLANEOUS
Question E1 (MT2009Q5) Drinking and de-riving a sample
A study of the number of years that employees work for food-and-drink businesses in the
Lower Mainland was based on a sample from the telephone directorys Yellow Pages
listings of food-and-drink businesses in the Lower Mainland. The sample was drawn as
follows. The study investigator first drew a simple random sample of four municipalities
in the Lower Mainland. Then within each selected municipality, he randomly sampled 50
businesses. For various reasons, the study got no response from 40% of the 200
businesses chosen. Interviews were completed with 120 businesses that responded. Each
of the 120 businesses was asked for the typical number of years that an employee stayed
with the business.
a) The population of interest to the investigator is:
__ A. all food-and-drink businesses in the Lower Mainland that are listed under
the telephone directorys Yellow Pages
__ B. all food-and-drink businesses in the Lower Mainland
__ C. the 200 businesses that were chosen by the investigator
__ D. the 120 businesses that responded
b) What is the relevant statistic here?
__ A. The mean years of employment of all food-and-drink businesses in the
Lower Mainland listed under the telephone directorys Yellow Pages
__ B. The mean years of employment of all food-and-drink businesses in the
Lower Mainland
__ C. The mean years of employment of the 200 businesses that were chosen by
the investigator
__ D. The mean years of employment of the 120 businesses that responded
c) The sampling scheme that the investigator used in choosing the 200 businesses is:
__ A. Simple random sampling
__ B. Stratified random sampling
__ C. Multistage sampling (also known as Cluster sampling)
__ D. Systematic sampling
d) The main source of bias in this study is due to the fact that:
__ A. only four municipalities were sampled
__ B. only 50 businesses in each municipality were sampled
__ C. only 120 of the 200 businesses sampled actually responded
__ D. not all food-and-drink businesses are listed in the Yellow Pages
e) This study is an example of
__ A. an experiment
__ B. a double-blind study
__ C. a census
__ D. a survey

109

Question E2 (MT2009Q8) So many choices, so little time


For each of the following multiple choice questions, choose the single correct answer.
Place an X in the space beside the letter of your choice or circle the letter.
a) A study of elementary school children, ages 6 to 11, finds a high positive correlation
between shoe size X and score Y on a test of reading comprehension. The observed
correlation is most likely due to
__ A. the effect of a lurking variable, such as age
__ B. a mistake since the correlation must be negative
__ C. cause and effect (larger shoe size causes higher reading comprehension)
__ D. reverse cause and effect (higher reading comprehension causes large
shoe size)
b) Which of the following is true of the least-squares regression line?
__ A. the slope is the change in the response variable that would be predicted by
a unit change in the explanatory variable
__ B. It always passes through the point ( , ), the means of the explanatory
and response variables, respectively
__ C. It will only pass through all the data points if r = 1
__ D. All of the above
c) The slope of a regression line and the correlation are similar in the sense that:
__ A. they both have the same sign
__ B. they both do not depend on the units of measurement of the data
__ C. they both fall between -1 and 1 inclusive
__ D. neither of them can be affected by outliers
__ E. both can be used for prediction
d) When the level of confidence and variance remain the same, a confidence interval for a
population mean based on a sample of n=100 will be _____________ as the confidence
interval for a mean based on a sample of n =400.
__ A. twice as wide
__ B. half as wide
__ C. the same size
__ D. four times as wide
__ E. one fourth as wide
e) A random sample of 1500 observations gave a 95% confidence interval for the mean of
810 62. What would the margin of error be, to the nearest whole number, if we wanted
just 90% confidence?
__ A. 65
__ B. 59
__ C. 74
__ D. 52
__ E. cannot be determined

110

f) In testing hypotheses, which of the following would be strong evidence against the null
hypothesis?
__ A. Using a small level of significance
__ B. Using a large level of significance
__ C. Obtaining data with a small P-value
__ D. Obtaining data with a large P-value
g) In a statistical test of hypotheses, we say the data are statistically significant at level
if
__A. = 0.05
__B. is small
__C. the P-value is less than
__D. the P-value is larger than

h) An engineer designs an improved light bulb. The previous design had an average
lifetime of 1200 hours. The new bulb has a lifetime of 1201 hours, using a sample of
2000 bulbs. Although the difference is quite small, it is statistically significant. The
explanation for the statistically significant difference is
__ A. that new designs typically have more variability than standard designs
__ B. that the sample size is very large
__ C. that the mean of 1200 is large
__ D. all of the above

i) In a test of statistical hypotheses, the P-value is:


__ A. the probability that the null hypothesis is true
__ B. the probability that the alternative hypothesis is true
__ C. the largest level of significance at which the null hypothesis can be rejected
__ D. the smallest level of significance at which the null hypothesis can be
rejected

111

Question E3 (MT2006Q4) Shooting and sleighing


Olympics again: Shooting has been an Olympic sport since 1896. The running target
event has shooters firing at a moving target as it moves across a two-metre opening, from
a distance of 10 meters.
a) In Olympic competition, each shooter has 60 shots. The average shooter has a success
rate of p = 0.70. The sampling distribution of the sample proportion is approximately
normal with:
Mean = ______

SD = ______

b) The probability that a shooter hits the target 48 or more times is closest to:
__A. 0.95
__B. 0.26
__C. 0.74
__D. 0.05

Bobsleigh racing was developed in the 19th century by the Swiss in search of the ultimate
thrill. Race times are normally distributed with mean 53 seconds and standard deviation 3
seconds. In bobsleigh events, racers complete four runs.
c) A sample of four runs at a particular event gave a mean of 51 seconds, with the same
standard deviation of 3 seconds (as expected). Compute a 90% confidence interval for the
true mean run time. (Report 2 decimal places)

d) The observed margin of error for another sample of four runs was 8.76. What level of
confidence was chosen to compute that confidence interval?
__A. 90%
__B. 95%
__C. 99%
__D. None of the above

112

Question E4 (MT2006Q8)
a) The Environmental Protection Agency records data on the fuel economy of many
different makes of cars. Some of the variables collected are listed below. Identify each
variable as categorical or measurement (i.e. quantitative). (Circle your choice)
Manufacturer (GM, Ford, Toyota, etc.)
Gas mileage (miles per gallon)
Weight (in pounds)
Size (small, medium, full-size, truck, etc.)

Categorical
Categorical
Categorical
Categorical

Measurement
Measurement
Measurement
Measurement

b) A study of the caloric content of hot dogs was undertaken. As part of the study, the
number of calories in 20 brands of beef hot dogs were recorded and the five-number
summary computed as follows: Min = 110, Max = 190, Median = 152.5, Quartiles = 140,
180. The researchers did not provide the standard deviation. However, previous work has
shown that calorie count is approximately normally distributed. Which of the following is
the most reasonable estimate of the standard deviation?
__A. 10
__B. 20
__C. 40
__D. 80
c) A television station is interested in predicting whether or not voters in its listening area
are watching their coverage of the Winter Olympics. It asks its viewers to phone in and
report whether or not they have watched at least one hour of Olympic coverage in the
first week of the Games. Of the 1242 viewers who phoned in, 512 (41.22%) said Yes.
The number 41.22% is a:
__A. statistic
__B. parameter
__C. sample
__D. population
d) Refer to part e), immediately above. Choose the best statement from the following.
__A. The results are valid because the sample size is very large
__B. The results are valid because people who are undecided do not phone in
__C. The results are not valid because the response is voluntary
__D. The results are not valid because the question is poorly worded

113

e) Which of the following statements is a correctly worded statement about correlation?


(I) There is a high correlation between the gender of Canadian workers and their income.
(II) There is a high correlation (r = 1.09) between students ratings of faculty teaching
and ratings made by other faculty members.
(III) The correlation between age and income was found to be r = 0.53 years.
__A. I only
__B. II only
__C. III only
__D. I and III only
__E. None of them
Two Bonus Questions for Algebra Fans
Bonus Question 1: Suppose you have only two data values and . Which of the
following formulas gives the variance of and ?
__A.
__B.
__C.
__D. None of these
__E. Variance cannot be computed with only two data values

Bonus Question 2:
Suppose a random variable X takes only two possible values: and +, each with
probability 0.5. What are the mean and standard deviation of X?
__A. Mean = , SD =
__B. Mean = 0, SD = 1
__C. Mean = , SD = 2
__D. Mean = 0, SD =
__E. You cant compute mean and SD without actual data.

114

SECTION E: ANSWERS AND EXPLANATIONS


Answer to Question E1 (MT2009Q5)
a) B. All food-and-drink businesses in the Lower Mainland
b) D. The mean years of employment of the 120 businesses that responded
c) C. Multistage cluster sampling
d) C. Only 120 of the 200 businesses sampled actually responded
e) D. A survey
Details and Comments:
a) Although the population is all businesses, the sampling frame is only those listed in the
Yellow Pages. Sometimes it is not possible to access the entire population of interest.
b) A statistic is computed from the actual sample, hence the 120 responding businesses.
d) A and B are sources of variability, not bias. D is also a source of bias, but not as large
as the non-response bias.

Answer to Question E2 (MT2009Q8)


a) A b) D c) A d) A e) D f) C

g) C

h) B

i) D

Details and Comments:


c) Since
, the signs on b1 and r will be the same (since SDs are positive).
d) The square root is important here; changing the sample size by a factor of 4 will
change the margin of error by a factor of 2.
e) Work backward from the formulas to solve for SE and then work forward to get the
new margin of error. SE = 62/1.96; so 1.645(SE) = (1.645/1.96)(62) = 52
f) and g) and i) Look at the definition and interpretation of P-value.
h) Statistical significance and practical significance are different. It is possible to detect a
very small difference using a very large sample size, but the difference may not be
meaningful in practice.

Answer to Question E3 (MT2006Q4)


a) Mean = 0.7; SD = 0.059 (=
)
b) D. 0.05
c) Use a multiplier from the t-distribution with 3 df.:
51 2.353(3/4) = 51 3.53 or (47.47 ; 54.53)
d) C. 99%
Details and Comments:
b) Convert the sample count to a sample proportion: = 48/60 =0.80
Then find Pr ( > 0.80) = Pr (Z > [0.800.70]/0.059) = Pr (Z > 1.69) = 0.0455
d) Margin of error = 8.76 = tSE; hence t = 8.76/SE = 8.75/1.5 = 5.84.
From Table T, Pr (t3 > 5.84) corresponds to a two-tail probability of 0.01 and a
confidence level of 99%.

115

Answer to Question E4 (MT2006Q8)


a) Categorical: Manufacturer, Size
Quantitative: Gas mileage, Weight
b) B. 20
c) A. statistic
d) C. The results are not valid because the response is voluntary
e) E. None of them
Details and Comments:
b) Use the rule of thumb that, for small samples (n 20), SD is approximately the range
divided by 4.
c) A statistic is a quantity computed from the sample data.
e) Statement (I) uses a categorical variable; Statement (II) has r > 1, Statement (III)
reports units for r. None of these are possible for r.

Answer to Bonus Question 1:


A.
Use the definition of variance and do the algebra yourself!
Answer to Bonus Question 2:
A. Mean = , SD =
Calculation/algebra involved:
Mean of X: [() + (+)]/2 = 2/2 =
Variance of X: {[() ]2+ [(+) ]2}/2 = {2 + 2}/2 = 2
SD of X =

END OF PRACTICE QUESTIONS FOR THE MIDTERM EXAM

116