Anda di halaman 1dari 113

STA 6166

STATISTICAL RESEARCH METHODS I


Demetris Athienitis
Department of Statistics, University of Florida

Contents
Contents

Part 1 Material

1 Descriptive Statistics
1.1 Concept . . . . . . . . . . . . . . .
1.2 Summary Statistics . . . . . . . . .
1.2.1 Location . . . . . . . . . . .
1.2.2 Spread . . . . . . . . . . . .
1.2.3 Effect of shifting and scaling
1.3 Graphical Summaries . . . . . . . .
1.3.1 Dot Plot . . . . . . . . . . .
1.3.2 Histogram . . . . . . . . . .
1.3.3 Box-Plot . . . . . . . . . . .
1.3.4 Pie chart . . . . . . . . . . .
1.3.5 Scatterplot . . . . . . . . .

. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
measurements
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .

2 Probability
2.1 Sample Space and Events . . . . . . . . . . . . .
2.1.1 Basic concepts . . . . . . . . . . . . . . . .
2.1.2 Relating events . . . . . . . . . . . . . . .
2.2 Probability . . . . . . . . . . . . . . . . . . . . .
2.3 Conditional Probability and Independence . . . .
2.3.1 Independent Events . . . . . . . . . . . . .
2.3.2 Law of Total Probability . . . . . . . . . .
2.3.3 Bayes Rule . . . . . . . . . . . . . . . . .
2.4 Random Variables . . . . . . . . . . . . . . . . . .
2.4.1 Expected Value And Variance . . . . . . .
2.4.2 Population Percentiles . . . . . . . . . . .
2.4.3 Common Discrete Distributions . . . . . .
2.4.4 Common Continuous Distributions . . . .
2.4.5 Covariance . . . . . . . . . . . . . . . . . .
2.4.6 Mean and variance of linear combinations
2.4.7 Central Limit Theorem . . . . . . . . . . .
1

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

5
. 5
. 5
. 5
. 6
. 7
. 7
. 7
. 8
. 9
. 11
. 11

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

13
13
13
14
15
16
17
18
19
20
23
25
25
27
30
32
33

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

3 Inference For Population Mean


3.1 Confidence intervals . . . . . . . . . . . . . . .
3.1.1 Large sample C.I. for population mean
3.1.2 Small sample C.I. for population mean
3.1.3 Sample size for a C.I. of fixed level and
3.2 Hypothesis Testing . . . . . . . . . . . . . . .
3.2.1 One sample hypothesis tests . . . . . .
3.2.2 Small sample test for population mean

II

. . . .
. . . .
. . . .
width
. . . .
. . . .
. . . .

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

Part 2 Material

35
35
36
36
38
39
39
43

45

4 Inference For Population Proportion


46
4.1 Large sample C.I. for population proportion . . . . . . . . . . 46
4.2 Large sample test for population proportion . . . . . . . . . . 47
5 Inference For Two Population Means
5.1 Two Sample C.I.s . . . . . . . . . . . . . . . . . . . . . .
5.1.1 Large sample C.I. for two means . . . . . . . . . . .
5.1.2 Small sample C.I. for two means . . . . . . . . . . .
5.1.3 Large sample C.I. for two population proportions .
5.1.4 C.I. for paired data . . . . . . . . . . . . . . . . . .
5.2 Two Sample Hypothesis Tests (optional) . . . . . . . . . .
5.2.1 Large sample test for difference of two means . . .
5.2.2 Small sample test for difference of two means . . .
5.2.3 Large sample test for difference of two proportions .
5.2.4 Test for paired data . . . . . . . . . . . . . . . . . .
5.3 Normal Probability Plot . . . . . . . . . . . . . . . . . . .
6 Nonparametric Procedures For
6.1 Sign test . . . . . . . . . . . .
6.2 Wilcoxon rank-sum test . . .
6.3 Wilcoxon signed-rank test . .

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

48
48
48
49
50
51
52
52
53
54
54
55

Population Location
57
. . . . . . . . . . . . . . . . . . 57
. . . . . . . . . . . . . . . . . . 59
. . . . . . . . . . . . . . . . . . 60

7 Inference About Population Variances


64
7.1 Inference On One Variance . . . . . . . . . . . . . . . . . . . . 64
7.2 Comparing Two Variances . . . . . . . . . . . . . . . . . . . . 66
7.3 Comparing t 2 Variances . . . . . . . . . . . . . . . . . . . . 67
8 Contingency Tables

69

III

73

Part 3 Material

9 Regression
74
9.1 Simple Linear Regression . . . . . . . . . . . . . . . . . . . . . 74
2

9.1.1 Goodness of fit . . . . . . . . . . . . . .


9.1.2 Distribution of response and coefficients
9.1.3 Inference on slope coefficient . . . . . . .
9.1.4 C.I. on the mean response . . . . . . . .
9.1.5 Prediction interval . . . . . . . . . . . .
9.1.6 Checking assumptions . . . . . . . . . .
9.1.7 Box-Cox (Power) transformation . . . .
9.2 Multiple Regression . . . . . . . . . . . . . . . .
9.2.1 Model . . . . . . . . . . . . . . . . . . .
9.2.2 Goodness of fit . . . . . . . . . . . . . .
9.2.3 Inference . . . . . . . . . . . . . . . . . .
10 Analysis Of Variance
10.1 Completely Randomized Design .
10.1.1 Post-hoc comparisons . . .
10.1.2 Nonparametric procedure
10.2 Randomized Block Design . . . .
10.2.1 Nonparametric procedure

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

77
79
80
81
82
82
85
88
88
89
89

.
.
.
.
.

102
. 102
. 106
. 108
. 109
. 111

Part I
Part 1 Material

Chapter 1
Descriptive Statistics
Chapter 3 in textbook

1.1

Concept

Definition 1.1. Population parameters are a numerical summary concerning


the complete collection of subjects, i.e. the population.
The population parameters are notated by Greek symbols such as population mean .
Definition 1.2. Sample statistics are a numerical summary concerning a
subset of the population, i.e. the sample, from which we try to draw inference
about the population parameter.
Sample statistics are notated by the hat symbol over the population
parameter such as the sample mean
, or sometimes for convenience a symbol
from the English alphabet. For the sample mean
x.

1.2

Summary Statistics

Let x1 , . . . , xn denote n observations/numbers.

1.2.1

Location

The mean is the arithmetic average of the observations. x =


The median is the center of the ordered data.

1
n

Pn

i=1

xi .

If n is odd then the median is located at the (n + 1)/2 position of


the ordered data.
If n is even the median is the average of two observations, the one
located at the n/2 position and the (n/2) + 1 position.
The mode is the most frequently encountered observation.
5

The % trimmed mean is the mean of the data with the smallest
% n observations and the largest % n observations truncated
from the data.
The pth percentile value divides the ordered data such that p% of
the data are less than that value and (100-p)% greater than it. It is
located at (p/100)(n+1) position of the ordered data. If the position
value is not an integer then average the values at (p/100)(n + 1) and
(p/100)(n + 1). The median is actually the 50th percentile.
According to the textbook p.76, the j th ordered observation corresponds to the 100(j 0.5)/n percentile.

Example 1.1. The following values of fracture stress (in megapascals) were
measured for a sample of 24 mixtures of hot mixed asphalt (HMA).
30 75 79 80 80 105 126 138 149 179 179 191
223 232 232 236 240 242 245 247 254 274 384 470
P24
=
Hence,
i=1 xi = 30 + 75 + + 384 + 470 = 4690 and thus x
4690/24 = 195.4167.
The median is the average of the observations at the 12th and 13th position
of the ordered data, i.e. x = (191 + 223)/2 = 207.
There are three modes, 80, 179 and 232.
To compute the 5% trimmed mean we need to remove 0.05(24) = 1.2
1 observations from the lower and upper side of the data. Hence remove
30 and 470 and recalculate the average of those 22 observations. That
is 190.45.
The 25th percentile (a.k.a. 1st Quartile) is located at (25/100)(24 +
1) = 6.25 position. So average the values at 6th and 7th position, i.e.
(105+126)/2=115.5
http://www.stat.ufl.edu/~ athienit/STA6166/loc_stats.pdf
Remark 1.1. Note that the mean is more sensitive to outliers-observations
that do not fall in the general pattern of the rest of the data-than the median.
Assume we have values 2, 3, 5. The mean is 3.33 and the median is 3. Assume
we now have 2, 3, 5, 112. The mean is 30.5 but the median is now 4.

1.2.2

Spread

The variance is a measure of spread of the individual observations


from their center as indicated by the mean.
!
" n
#
n
X
X
1
1
(xi x)2 =
x2
x2i n
2 = s2 =
n 1 i=1
n1
i=1
6

The standard deviation is simply the square root of the variance in


order to return to the original units of measurement.
The range is the maximum observation - minimum observation.
The interquartile range (IQR) is 75th percentile - 25th percentile (or
Q3 Q1 ).
Example 1.2. Continuing from Example 1.1
P
1
(115249424(195.4167)2) =
we have x2i = 1152494, and hence s2 = 23

10260.43 and s = 10260.43 = 101.2938.


The range is 470-30=440.
The IQR is 243.5-115.5=128.
http://www.stat.ufl.edu/~ athienit/STA6166/loc_stats.pdf

1.2.3

Effect of shifting and scaling measurements

As we know measurements can be made in different scales, e.g. cm, m, km,


etc and even different units of measurements, e.g. Kelvin, Celsius, Fahrenheit. Let us see how shifting and rescaling influence the mean and variance.
Let x1 , . . . , xn denote the data and define yi = axi + b, where a and b are
some constants. Then,
X 
1X
1
1X
x + b,
yi =
(axi + b) =
nb + a
xi = a
y =
n
n
n
and,

s2y =

X
1 X
1 X
1
(yi
y )2 =
(axi +ba
xb)2 =
a2
(xi
x)2 = a2 s2x
n1
n1
n1

1.3
1.3.1

Graphical Summaries
Dot Plot

Stack each observation on a horizontal line to create a dot plot that gives an
idea of the shape of the data. Some rounding of data values is allowed in
order to stack.

100

200

300

400

Fracture stress in mPa

Figure 1.1: Dot plot of data from Example 1.1

1.3.2

Histogram

1. Create class intervals (by choosing boundary points) in which to place


the data.
2. Construct a Frequency Table.
3. Draw a rectangle for each class.
It is up to the researcher to decide how many class intervals to create.
As a rule of thumb one creates about 2n1/3 classes. For Example 1.1 that is
5.75 so we can either go with 5 or 6 classes.
Class Interval
0 -<100
100 -<200
200 -<300
300 -<400
400 -<500

Freq.
5
7
10
1
1

Relative Freq.
5/24=0.208
7/24=0.292
0.417
0.0417
0.0417

Density
0.208/100=0.00208
0.292/100=0.00292
0.00417
0.000417
0.000417

Table 1.1: Frequency Table for Example 1.1

Density

0.000

0.001

0.002

0.003

0.004

Histogram

100

200

300

400

500

Fracture stress in mPa

Figure 1.2: Histogram of data from Example 1.1


8

Remark 1.2. May use Frequency, Relative Frequency or Density as the vertical axis when class widths are equal. However, class widths are not necessarily equal; usually done to create smoother graphics if not mandated by
the situation at hand. If this is the case then we must use Density that
accounts for the width because large classes may have unrepresentative large
frequencies.
http://www.stat.ufl.edu/~ athienit/STA6166/hist1_boxplot1.R

1.3.3

Box-Plot

Box-Plot is a graphic that only uses quartiles. A box is created with Q1 , Q2 ,


and Q3 . A lower whisker is drawn from Q1 down to the smallest data point
that is within 1.5 IQR of Q1 . Hence from Q1 = 115.5 down to Q1 1.5IQR =
115.5 1.5(128) = 76.5, but we stop at the smallest point within than
which is 30. Similarly the upper whisker is drawn from Q3 = 243.5 to
Q3 + 1.5IQR = 435.5 but we stop at the largest point within which is 384.

100

200

300

400

Figure 1.3: Box-Plot of data from Example 1.1


Remark 1.3. Any point beyond the whiskers is classified as an outlier and
any point beyond 3IQR from either Q1 or Q3 is classified as an extreme
outlier.
http://www.stat.ufl.edu/~ athienit/STA6166/hist1_boxplot1.R

These densities have shapes that can be described as:


Symmetric

0.4

0.5

Symmetric Density Shapes

0.0

0.1

0.2

0.3

=1
=1.5
=0.8

Skewed Left and Right

0.3
0.2
0.1
0.0

0.0

0.1

0.2

0.3

0.4

Skewed right

0.4

Skewed left

0.00

0.05

0.10

0.15

0.20

Bi-Modal (or more than two modes)

10

1.3.4

Pie chart

A pie or circle has 360 degrees. For each category of a variable, the size of the
slice is determined by the fraction of 360 that corresponds to that category.
Example 1.3. There is a total of 337,297,000 native English speakers of the
world, categorizes as
Country
USA
UK
Canada
Australia
Other
Total

Pop. (1000) % of Total


% of pie
226,710
67.21
0.6721(360) = 241.97
56,990
16.90
60.83
19,700
5.84
21.02
15,316
4.54
16.35
18,581
5.51
19.83
337,297
100
360

Table 1.2: Frequency table for native English speakers of 1997


Pie Chart of Countries

USA 67%

Other 6%
Australia 5%
Canada 6%
UK 17%

Figure 1.4: Pie chart of English speaking countries


http://www.stat.ufl.edu/~ athienit/STA6166/pie.R

1.3.5

Scatterplot

It is used to plot the raw 2-D points of two variables in an attempt to discern
a relationship.
Example 1.4. A small study with 7 subjects on the pharmacodynamics
of LSD on how LSD tissue concentration affects the subjects math scores
yielded the following data.
Score
Conc.

78.93 58.20 67.47 37.47 45.65 32.92 29.97


1.17 2.97 3.26 4.69 5.83 6.00 6.41

Table 1.3: Math score with LSD tissue concentration


11

60
50
30

40

Math score

70

80

Scatterplot

LSD tissue concentration

Figure 1.5: Scatterplot of Math score vs. LSD tissue concentration


http://www.stat.ufl.edu/~ athienit/STA6166/scatterplot.R

12

Chapter 2
Probability
Chapter 4.1 - 4.5 in textbook.
The study of probability began in the 17th century when gamblers starting
hiring mathematicians to calculate the odds of winning for different types of
games.

2.1
2.1.1

Sample Space and Events


Basic concepts

Definition 2.1. The set of all possible outcomes of an experiment is called


the sample space (S) for the experiment.
Example 2.1. Here are some basic examples:
Rolling a die. Then S = {1, 2, 3, 4, 5, 6}
Tossing a quarter and a penny. S = {Hh, Ht, Th, Tt}
Counting the number of flaws in my personality. S = {1, 2, . . .}
Machine cuts rods of certain length (in cm). S = {x|5.6 < x < 6.4}
Remark 2.1. Elements in S may not be equally weighted.
Definition 2.2. A subset of a sample space is called an event.
For instance the empty set = {} and the entire sample space S are also
events.
Example 2.2. Let A be the event of an even outcome when rolling a die.
Then, A = {2, 4, 6} S.

13

2.1.2

Relating events

When we are concerned with multiple events within the sample space, Venn
Diagrams are useful to help explain some of the relationships. Lets illustrate
this via an example.
Example 2.3. Let,
S = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}
A = {1, 3, 5, 7, 9}
B = {6, 7, 8, 9, 10}

Figure 2.1: Venn Diagram


Combining events implies combining the elements of the events. For example,
A B = {1, 3, 5, 6, 7, 8, 9, 10}.
Intersecting events implies only listing the elements that the events have
in common. For example,
A B = {7, 9}.
The complement of an event implies listing all the elements in the sample
space that are not in that event. For example,
Ac = {2, 4, 6, 8, 10}

(A B)c = {2, 4}.

Definition 2.3. A collection of events A1 , A2 , . . . is mutually exclusive if no


two of them have any outcomes in common. That is, Ai Aj = , i, j
In terms of the Venn Diagram, there is no overlapping between them.

14

2.2

Probability

Notation: Let P (A) denote the probability that the event A occurs. It is the
proportion of times that the event A would occur in the long run.
Axioms of Probability:
P (S) = 1
0 P (A) 1, since A S
If A1 , A2 , . . . are mutually exclusive, then
P (A1 A2 ) = P (A1 ) + P (A2 ) +
As a result of the axioms we have that P (A) = 1 P (Ac ) and that
P () = 0.
Example 2.4. In a computer lab there are 4 computers and once a day a
technician inspects them and counts the number of computer crashes. Hence,
S = {0, 1, 2, 3, 4} and
Crashes
0
1
2
3
4

Probability
0.60
0.30
0.05
0.04
0.01

Table 2.1: Probabilities for computer crashes


Let A be the event that at least one crash occurs on a given day.
P (A) = 0.30 + 0.05 + 0.04 + 0.01
= 0.4
or
= 1 P (Ac )
= 1 0.60
= 0.4
If S contains N equally likely outcomes/elements and the event A contains
k( N) outcomes then,
k
P (A) =
N
Example 2.5. The experiment consists of rolling a die. There are 6 outcomes
in the sample space, all of which are equally likely (assuming a fair die).
Then, if A is the event of an outcome of a roll being even, A = {2, 4, 6} with
3 elements so, P (A) = 3/6 = 0.5
15

The axioms provide a way of finding the probability of a union of two


events but only if they are mutually exclusive. What if they are not mutually
exclusive? Show the formula via the following example
Example 2.6. Lets continue from Example 2.3. We have shown that
A B = {1, 3, 5, 6, 7, 8, 9, 10} and A B = {7, 9}.
So,
P (A B) = P (A) + P (B) P (A B).

We need to subtract the probability of the intersection set {7, 9} as that


probability was double counted since it is included within A and within B.

2.3

Conditional Probability and Independence

Definition 2.4. A probability that is based upon the entire sample space is
called an unconditional probability, but when it is based upon a subset of the
sample space it is a conditional (on the subset) probability.
Definition 2.5. Let A and B be two events with P (B) 6= 0. Then the
conditional probability of A given B (has occurred) is
P (A|B) =

P (A B)
.
P (B)

The reason that we divide by the probability of given said occurrence, i.e.
P (B) is to re-standardize the sample space. We update the sample space to
be just B, i.e. S = B and hence P (B|B) = 1. The only part of event A that
occurs within this new S = B is P (A B).
Proposition 2.1. Rule of Multiplication:
If P (A) 6= 0, then P (A B) = P (B|A)P (A).
If P (B) 6= 0, then P (A B) = P (A|B)P (B).
Example 2.7. A player serving at tennis is only allowed one fault. At a
double fault the server loses a point/other player gains a point. Given the
following information:
Loss of point
0.02
2
ault
F
Serve 2
0 .9 8
4
0.4 1
Success
lt
Fau
Serve 1
0 .5
6
Success
16

What is the probability that the server loses a point, i.e. P (Fault 1 and Fault 2)?
P (Fault 1 and Fault 2) = P (Fault 2|Fault 1)P (Fault 1) = (0.02)(0.44) = 0.009

2.3.1

Independent Events

When the given occurrence of one event does not influence the probability
of a potential outcome of another event, then the two events are said to be
independent.
Definition 2.6. Two events A and B are independent if the probability of
each remains the same, whether or not the other has occurred. If P (A) 6=
0, P (B) 6= 0, then
P (B|A) = P (B) P (A|B) = P (A).
If either P (A) = 0, or P (B) = 0, then the two events are independent.
Definition 2.7. (Generalization) The events A1 , . . . , An are independent if
for each Ai and each collection Aj1 , . . . Ajm of events with P (Aj1 Ajm ) 6=
0,
P (Ai |Aj1 Ajm ) = P (Ai )
As a consequence of independence, the rule of multiplication then says
ind.

P (A B) = P (A|B)P (B) = P (A)P (B),


and in the general case
P

Ai

i=1

k
Y
i=1

P (Ai ) 0 < k n

Example 2.8. Of the microprocessors manufactured by a certain process,


20% of them are defective. Assume they function independently. Five microprocessors are chosen at random. What is the probability that they will
all work?
Let Ai denote the event that the ith microprocessor works, for i = 1, 2, 3, 4, 5.
Then,
P (all work) = P (A1 A2 A3 A4 A5 )
= P (A1 )P (A2 )P (A3 )P (A4 )P (A5)
= 0.85
= 0.328

17

2.3.2

Law of Total Probability

Recall that the sequence of events A1 , . . . , An is mutually exclusive if no two


pairs have any elements in common, i.e. Ai Aj = , i, j. We also say that
the sequence is exhaustive if the union of all the events is the sample space,
i.e. ni=1 Ai = S.
Proposition 2.2. Law of Total Probability
If A1 , . . . An are mutually exclusive and exhaustive events, and B is any event,
then,
n
X
P (Ai B) = P (A1 B) + + P (An B).
P (B) =
i=1

Equivalently, if P (Ai ) 6= 0 for each Ai ,


P (B) =

n
X
i=1

P (B|Ai )P (Ai ) = P (B|A1 )P (A1 ) + + P (B|An )P (An ).

To better illustrate this proposition let n = 4 and look at Figure 2.3.2

Figure 2.2: Venn Diagram illustrating Law of Total Probability


Example 2.9. Customers can purchase a car with three options for engine
sizes
Small 45% sold
Medium 35% sold
Large 20% sold
Of the cars with the small engine 10% fail an emissions test within 10 years
of purchase, while 12% fail of the medium and 15% of the large.
18

What is the probability that a randomly chosen car will fail the emissions
test within 10 years?
IN CLASS

2.3.3

Bayes Rule

In most cases P (B|A) 6= P (A|B). Bayes rule provides a method to calculate


one conditional probability if we know the other one. It uses the rule of
multiplication in conjunction with the law of total probability.
Proposition 2.3. Bayess Rule
Special Case: Let A and B be two events with P (A) 6= 0, P (Ac) 6= 0, and
P (B) 6= 0. Then,
P (A|B) =

P (B|A)P (A)
P (A B)
=
.
P (B)
P (B|A)P (A) + P (B|Ac )P (Ac )

General Case: Let A1 , . . . , An be mutually exclusive and exhaustive events


with P (Ai ) 6= 0 for each i = 1, . . . , n. Let B be any event with P (B) 6= 0.
Then,
P (B|Ak )P (Ak )
P (Ak |B) = Pn
.
i=1 P (B|Ai )P (Ai )
19

Example 2.10. In a telegraph signal a dot or dash is sent. Assume that


3
P (dot sent) = ,
7

P (dash sent) =

4
7

Suppose that there is some interference and with probability 1/8 a dot is
mistankenly received on the other end as a dash, and vice versa.
Find P (dot sent|dash received).
IN CLASS

2.4

Random Variables

Chapter 4.6 - 4.10 in textbook.


Definition 2.8. A random variable is a function that assigns a numerical
value (between [0,1]) to each outcome in a sample space.
It is an outcome characteristic that is unknown prior to the experiment.
For example, an experiment may consist of tossing two dice. One potential random variable could be the sum of the outcome of the two dice, i.e.
X= sum of two dice. Then, X is a random variable. Another experiment
could consist of applying different amounts of a chemical agent and a potential random variable could consist of measuring the amount of final product
created in gramms.
Quantitative random random variables can either be discrete, by which
they have a countable set of possible values, or continuous which have
uncountably infinite.
Notation: For a discrete random variable (r.v.) X, the probability distribution is the probability of a certain outcome occurring, denoted as
P (X = x) = pX (x).
This is also called the probability mass function (p.m.f.).
20

Notation: For a continuous random variable (r.v.) X, the probability density function (p.d.f.), denoted by fX (x), models the relative frequency of X.
Since there are infinitely many outcomes within an interval, the probability
evaluated at a singularity is always zero, e.g. P (X = x) = 0, x, X being a
continuous r.v.
Conditions for a function to be:
P
p.m.f. 0 p(x) 1 and x p(x) = 1
R
p.d.f. f (x) 0 and f (x)dx = 1

Example 2.11. (Discrete) Suppose a storage tray contains 10 circuit boards,


of which 6 are type A and 4 are type B, but they both appear similar. An
inspector selects 2 boards for inspection. He is interested in X = number of
type A boards. What is the probability distribution of X?
The sample space of X is {0, 1, 2}. We can calculate the following:
p(2) = P (A on first)P (A on second|A on first)
= (6/10)(5/9) = 0.3333
p(1) = P (A on first)P (B on second|A on first)
+ P (B on first)P (A on second|B on first)
= (6/10)(4/9) + (4/10)(6/9) = 0.5333
p(0) = P (B on first)P (B on second|B on first)
= (4/10)(3/9) = 0.1334
Consequently,
X=x

p(x)

0.1334

0.5333

0.3333

Total

1.0

Table 2.2: Probability Distribution of X

21

0.3
0.2
0.0

0.1

Density

0.4

0.5

Example 2.12. (Continuous) The lifetime of a certain battery has a distribution that can be approximated by f (x) = 0.5e0.5x , x > 0.

Lifetime in 100 hours

Figure 2.3: Probability density function of battery lifetime.


R
P
Notation: You may recall that f (t)dt is contrived from lim f (ti )i . Hence
for the following definitions and expressions we will
R only be using notation
for
P continuous variables and wherever you see simply replace it with
.

Definition 2.9. The cumulative distribution function (c.d.f.) of a r.v. X is


denoted by FX (x) and defined as
!
Z x
X
discrete
FX (x) = P (X x) =
f (t)dt
=
p(t)

tx

Example 2.13. Example 2.11 continued.


Find F (1). That is,
F (1) = P (X 1)
= P (X = 0) + P (X = 1)
= 0.1334 + 0.5333
= 0.6667
Example 2.14. Example 2.12 continued.
Find F (1). That is,
Z 1
F (1) =
f (x)dx

Z 0
Z 1
=
0dx +
0.5e0.5x dx

0.5x

= 0 + (e
= 0.3935

22

)|10

2.4.1

Expected Value And Variance

The expected value of a r.v. is thought of as the long term average for that
variable. Similarly, the variance is thought of as the long term average of
values of the r.v. to the expected value.
Definition 2.10. The expected value (or mean) of a r.v. X is
!
Z
X
discrete
X := E(X) =
xf (x)dx
=
xp(x) .

In actuality, this definition is a special case of a much broader statement.


Definition 2.11. The expected value (or mean) of function h() of a r.v. X
is
Z
E(h(X)) =
h(x)f (x)dx.

Due to this last definition, if the function h performs a simple linear


transformation, such as h(t) = at + b, for constants a and b, then
Z
Z
Z
E(aX + b) = (ax + b)f (x)dx = a xf (x)dx + b f (x)dx = aE(X) + b
Example 2.15. Referring back to Example 2.11, the expected value of the
number of type A boards (X) is
X
xp(x) = 0(0.1334) + 1(0.5333) + 2(0.3333) = 1.1999.
E(X) =
x

We can also calculate the expected value of (i) 5X + 3 and (ii) 3X 2 .


(i) 5(1.1999) + 3 = 8.995.
(ii) 3(02 )(0.1334) + 3(12)(0.5333) + 3(22 )(0.3333) = 5.5995
Definition 2.12. The variance of a r.v. X is


2
X
:= V (X) = E (X X )2
Z
= (x X )2 f (x)dx
Z
= (x2 2xX + 2X )f (x)dx
Z
Z
Z
2
2
= x f (x)dx 2X xf (x)dx + X f (x)dx
= E(X 2 ) 2E 2 (X) + E 2 (X)
= E(X 2 ) E 2 (X)
23

Example 2.16. We know that E(X) = 1.1999 and E(X 2 ) = 02 (0.1334) +


12 (0.5333) + 22 (0.3333) = 1.8665. Thus,
V (X) = E(X 2 ) E 2 (X)
= 1.8665 1.19992
= 0.42674
Example 2.17. IN CLASS. Variance for Example 2.12

24

Definition 2.13. The variance of a function h of a r.v. X is


Z
V (h(X)) = [h(x) E(h(x))]2 f (x)dx
= E(h2 (X)) E 2 (h(X))

Notice that if h stands for a linear transformation function then,


V (aX + b) = . . . IN CLASS

2.4.2

Population Percentiles

Let X be a continuous r.v. with p.d.f. f and c.d.f. F . The population


pth percentile, xp is found by solving the following equation for xp
Z xp
p
.
f (t)dt =
F (xp ) =
100

Example 2.18. Let r.v. X have p.d.f. f (x) = 0.5e0.5x , x > 0. The median
of X is found by solving for xm in
Z xm
0.5e0.5t dt = 0.5.
F (xm ) =
0

We note that
Z

xm

0.5 0.5t xm
e
|0
0.5
= e0.5xm (e0 )
= e0.5xm + 1.

0.5e0.5t dt =

Hence, we need to solve


e0.5xm + 1 = 0.5
0.5xm = log 0.5
xm = 2 log 0.5 = 1.386294
Example 2.19. Example 2.11, IN CLASS

2.4.3

Common Discrete Distributions

Bernoulli
Imagine an experiment where the r.v. X can take only two possible outcomes,
success (X = 1) with some probability p and failure (X = 0) with probability
1 p. The p.m.f. of X is
p(x) = px (1 p)1x

x = 0, 1 0 p 1
25

and we denote this by stating X Bernoulli(p). The mean of X is


X
E(X) =
xp(x) = 0p(0) + 1p(1) = p,
x

and the variance is

V (X) = E(X 2 ) E 2 (X) = [02 p(0) + 12 p(1)] p2 = p p2 = p(1 p).


Example 2.20. A die is rolled and we are interested in whether the outcome
is a 6 or not. Let,
(
1 if outcome is 6
X=
0 otherwise
Then, X Bernoulli(1/6) with mean 1/6 and variance 5/36.
Binomial
If X1 , . . . , Xn correspond to n Bernoulli trials conducted where
the trials are independent
each trial has identical probability of success p
the r.v. X is the total number of successes
P
then X = ni=1 Xi Bin(n, p). The the intuition behind the form of the
p.m.f. can be motivated by the following example.

Example 2.21. A fair coin is tossed 10 times and X = the number of heads
is recorded. What is the probability that X = 3?
One possible outcome is
(H) (H) (H) (T) (T) (T) (T) (T) (T) (T)

The probability of this outcome occurring in exactly this order is p3 (1 p)7 .


However there are 10
possible ways of 3 Heads and 7 Tails since order is
3
not important.
Consequently, the p.m.f. of X Bin(n, p) is
 
n x
p (1 p)nx , x = 0, 1, . . . , n
p(x) =
x
with E(X) = np and V (X) = np(1 p).
Another variable of interest concerning experiments with binary outcomes
is the proportion of successes p = X/n. Note that p is simply the r.v. X
multiplied by a constant, 1/n. Hence,
np
E(
p) = E(X/n) =
=p
n
and
1
np(1 p)
p(1 p)
V (
p) = V (X/n) = 2 V (X) =
=
2
n
n
n
26

Example 2.22. A die is rolled 4 times and the number of 6s is observed.


Find the probability that there is at least one 6.
Let, X be the number of 6s which implies X Bin(4, 1/6).
4i
4    i 
X
1
1
4
1
P (X 1) =
6
6
i
i=1
= 1 P (X < 1)
= 1 P (X = 0)
   0 
4
1
4
1
=1
1
0
6
6
= 0.518

Also, E(X) = 4(1/6) = 2/3 and V (X) = 4(1/6)(5/6) = 5/9. The expected value of the proportion of 6s which is E(
p) = 1/6 and has variance
V (
p) = (5/36)/4 = 5/144.

2.4.4

Common Continuous Distributions

Uniform
A continuous r.v. that places equal weight to all values within its support,
[a, b], a b, is said to be a uniform r.v. It has p.d.f.
f (x) =

1
ba

axb

0.0

0.1

0.2

0.3

0.4

0.5

Uniform Distribution

Figure 2.4: Density function of Uniform[1, 5].


Hence if X Uniform[a, b] then E(X) =
27

a+b
2

and V (X) =

(ba)2
.
12

Example 2.23. Waiting time for the delivery of a part from the warehouse
to certain destination is said to have a uniform distribution from 1 to 5 days.
What is the probability that the delivery time is two or more days?
Let X Uniform[1, 5]. Then, f (x) = 0.25 for 1 x 5 and hence
Z 5
P (X 2) =
0.25dt = 0.75.
2

Normal
The normal distribution (Gaussian distribution) is by far the most important
distribution in statistics. The normal distribution is identified by a location
parameter and a scale parameter 2 (> 0). A normal r.v. X is denoted as
X N(, 2 ) with p.d.f.
1
2
1
f (x) = e 22 (x)
2

<x<

0.0

0.1

0.2

0.3

0.4

Normal Distribution

Figure 2.5: Density function of N(0, 1).


It is symmetric, unimodal, bell shaped with E(X) = and V (X) = 2 .
Notation: A normal random variable with mean 0 and variance 1 is called
a standard normal r.v. It is usually denoted by Z N(0, 1). The c.d.f.
of a standard normal is given at the end of the textbook and also available
online (z-table) so that probabilities, which can be expressed in terms of c.d.f
can be conveniently obtained.
Example 2.24. Find P (2.34 < Z < 1). From the relevant remark,
P (2.34 < Z < 1) = P (Z < 1)P (Z < 2.34) = 0.15870.0096 = 0.1491
28

If Z is standard normal then it has mean 0 and variance 1. Now if we


take a linear transformation of Z, say X = aZ + b, then
E(X) = E(aZ + b) = aE(Z) + b = b
and
V (X) = V (aZ + b) = a2 V (Z) = a2 .
This fact together with the following proposition allows us to express any
normal r.v. as a linear transformation of the standard normal r.v. Z by
setting a = and b = .
Proposition 2.4. The r.v. X that is expressed as the linear transformation
Z + , is a also a normal r.v. with E(X) = and V (X) = 2 .
Linear transformations are completely reversible, so given a normal r.v.
X with mean and variance 2 we can revert back to a standard normal by
Z=

X
.

As a consequence any probability statements made about an arbitrary normal


r.v. can be reverted to statements about a standard normal r.v.
Example 2.25. Let X N(15, 7). Find P (13.4 < X < 19.0).
We begin by noting


X 15
19.0 15
13.4 15

<
<
P (13.4 < X < 19.0) = P
7
7
7
= P (0.61 < Z < 1.51)
= P (Z < 1.51) P (Z < 0.61)
= 0.93448 0.27093
= 0.66355
If one is using a computer there is no need to revert back and forth from
a standard normal, but it is always useful to standardize concepts.
Example 2.26. The height of males in inches is assumed to be normally distributed with mean of 69.1 and standard deviation 2.6. Let X N(69.1, 2.62 ).
Find the 90th percentile for the height of males.

29

0.15
0.00

0.05

0.10

90 % area

69.1

Figure 2.6: N(69.1, 2.62 ) distribution


IN CLASS

2.4.5

Covariance

The population covariance is a measure of strength of a linear relationship


among two variables.
Definition 2.14. Let X and Y be two r.vs. The population covariance of
X and Y is
Cov(X, Y ) = E [(X E(X)) (Y E(Y ))]
= E(XY ) E(X)E(Y )
Remark 2.2. If X and Y are independent, then
Z Z
E(XY ) =
xyf (x, y)dxdy
Z Z
=
xyfX (x)fY (y)dxdy
Z
Z
= xfX (x)dx yfY (y)dy
= E(X)E(Y )

and consequently Cov(X, Y ) = 0 since the joint p.d.f can be expressed as the
product of the two marginal p.d.fs. However, the converse is not true. Think
of a circle such as sin2 X + cos2 Y = 1. Obviously, X and Y are dependent
but they have no linear relationship. Hence, Cov(X, Y ) = 0.
30

The covariance is not unitless so a measure called the population correlation is used to describe the strength of the linear relationship that is
unitless
ranges from 1 to 1
XY = p

Cov(X, Y )
p
,
V (X) V (Y )

A negative relationship implies a negative covariance and consequently a


negative correlation.
To estimate the sample statistic of the covariance and the correlation we
need:
n

1 X
(xi x)(yi y)
n 1 i=1
" n
!
#
X
1
xi yi n
xy
=
n1
i=1

\Y ) =

XY := Cov(X,

Therefore,
rXY := XY =

Pn

xi yi ) n
xy
.
(n 1)sX sY
i=1

Example 2.27. Lets assume that we want to look at the relationship between two variables, height (in inches) and self esteem for 20 individuals.
Height
Esteem

68
4.1
68
3.5

71 62 75 58
4.6 3.8 4.4 3.2
67 63 62 60
3.2 3.7 3.3 3.4

60 67 68
3.1 3.8 4.1
63 65 67
4.0 4.1 3.8

Table 2.3: Height to self esteem data


Hence,
rXY =

4937.6 20(65.4)(3.755)
= 0.731
19(4.406)(0.426)

there is a moderate to strong positive relationship.

31

71 69
4.3 3.7
63 61
3.4 3.6

2.4.6

Mean and variance of linear combinations

Let X and Y be two r.vs, for aX + bY for constants a and b,


E(aX + bY ) = aE(X) + bE(Y )
V (aX + bY ) = a2 V (X) + b2 V (Y ) + 2ab Cov(X, Y ).
Example 2.28. Let X be a r.v. with E(X) = 3 and V (X) = 2, and Y be
another r.v. independent of X with E(Y ) = 5 and V (Y ) = 6. Then,
E(X 2Y ) = E(X) 2E(Y ) = 3 2(5) = 13
and
V (X 2Y ) = V (X) + 4V (Y ) = 2 + 4(6) = 26
Now we extend these two concepts to more than two r.vs. Let X1 , . . . , Xn
be
Pna sequence of r.vs and a1 , . . . , an a sequence of constants. Then the r.v.
i=1 ai Xi has mean and variance
!
n
n
X
X
ai E(Xi )
E
ai Xi =
i=1

i=1

and
V

n
X
i=1

ai Xi

=
=

n
n X
X

i=1 j=1
n
X
a2i V
i=1

ai aj Cov(Xi , Xj )
(Xi ) + 2

XX

ai aj Cov(Xi , Xj )

(2.1)
(2.2)

i<j

Example 2.29. Assume the random sample, i.e. independent identically


distributed (i.i.d.) r.vs, X1 , . . . , Xn are to be obtained and of interest will
=
be thePspecific linear combination corresponding to the sample mean X
n
(1/n) i=1 Xi . Since the r.vs are i.i.d., let E(Xi ) = and V (Xi ) = 2
i = 1, . . . , n. Then,
!
n
n
1X
1
1X
Xi =
E(Xi ) = n =
E
n i=1
n i=1
n
and

1X
Xi
n i=1

n
1 X
1
2
= 2
V (Xi ) = 2 n 2 =
n i=1
n
n

ind.

Remark 2.3. As the sample size increases, the variance of the sample mean
= 0.
decreases with limn V (X)
Proposition 2.5. A linear combination of independent normal random variables is a normal random variable.
32

2.4.7

Central Limit Theorem

The central limit theorem is a powerful statement concerning the mean of


a random sample. There are three versions, the classical, the Lyapunov
and the Linderberg but in effect they all make the same statement that the
is normal, irrespective of the
asymptotic distribution of the sample mean X
distribution of the individual r.vs. X1 , . . . , Xn .
Proposition 2.6. (Central Limit Theorem)
Let X1 , . . . , Xn be a random sample, i.e.
Pni.i.d., with E(Xi ) = < and
2

V (Xi ) = < . Then, for X = (1/n) i=1 Xi



X

N(0, 1)

Although the central limit theorem is an asymptotic statement, i.e. as the


sample size goes to infinity, we can in practice implement it for sufficiently
will be approximately
large sample sizes n > 30 as the distribution of X
normal with mean and variance derived from Example 2.29.


2

approx.
N ,
X
n
Remark 2.4. The following additional conditions/rule of thumb are needed
ensure that the CLT is applicable to the Binomial distribution
np > 5

and

n(1 p) > 5

Example 2.30. At a university the mean age of students is 22.3 and the
standard deviation is 4. A random sample of 64 students is to be drawn.
What is the probability that the average age of the sample will be greater
than 23?
By the CLT


42
approx.

.
X N 22.3,
64

So we need to find

> 23) = P
P (X

22.3
X
23 22.3
p
> p
4/ (64)
4/ (64)

= P (Z > 1.4)
= 0.0808

33

Example 2.31. At a university assume it is known that 25% of students are


over 21. In a sample of 400 what is the probability that more than 110 of
them are over 21?
IN CLASS

34

Chapter 3
Inference For Population Mean
Chapter 5 in textbook.

3.1

Confidence intervals

When a population parameter is estimated by a sample statistic such as

= x, the sample statistic is a point estimate of the parameter. Due to


sampling variability the point estimate will vary from sample to sample.
An alternative or complementary approach is to report an interval of
plausible values based on the point estimate sample statistic and its standard
deviation (a.k.a. standard error). A confidence interval (C.I.) is calculated
by first selecting the confidence level, the degree of reliability of the interval.
A 100(1)% C.I. means that the method by which the interval is calculated
will contain the true population parameter 100(1 )% of the time. That
is, if a sample is replicated multiple times, the proportion of times that the
C.I. will not contain the population parameter is .

Figure 3.1: Multiple confidence intervals from different samples

35

3.1.1

Large sample C.I. for population mean

Let X1 , . . . , Xn be i.i.d. N(, 2 ) with unknown mean and known variance


N(, 2 /n). However, if the sample
2 (both assumed finite). Then, X
size is large enough we do not require that Xi be normal r.vs. The central
is normal.
limit theorem guarantees that the sample mean X
Let zc stand for the value of Z N(0, 1) such that P (Z > zc ) = c. Hence
the proportion of C.Is containing the population parameter is,



X
< z/2
1 = P z/2 <
/ n



+ z/2
z/2 < < X
=P X
n
n
and the probability that (on the long run) the random C.I. interval,
z/2
X
n
contains the true value of is 1 . When a C.I. is constructed from a
single sample we can no longer talk about a probability as there is no long
run temporal concept but we can say that we are 100(1 )% confident that
the methodology by which the interval was contrived will contain the true
population parameter.
In practice the population variance is unknown. However, a large sample
size implies that the sample variance s2 is a good estimate for 2 and we
can replace it in the C.I. calculation. The technically correct method for
creating a C.I. when 2 is unknown is shown in Section 3.1.2.
Example 3.1. In a packaging plant, the sample mean and standard deviation
for the fill weight of 100 boxes are x = 12.05 and s = 0.1. The 95% C.I. for
the mean fill weight of the boxes is
0.1
12.05 z0.025
= (12.0304, 12.0696),
| {z } 100

(3.1)

1.96

Remark 3.1. If we wanted to perform a 90% we would simply replace z0.05/2


with z0.10/2 z0.05 = 1.645, which would lead to CI of (12.0334, 12.0667)
which is a narrower interval. Thus, as then 100(1 ) which implies a
narrower interval.

3.1.2

Small sample C.I. for population mean

has
The construct of the C.I. for the population mean uses the fact that X
a normal distribution. This can happen in two ways, when X1 , . . . , Xn are
i.i.d.
normal random variables
36

n > 30 and the assumptions of C.L.T. are satisfied


In the small sample setting with n 30 we must assume that the data are
derived from a normal distribution, since we cannot use the C.L.T. Then, if
2 is known the 100(1 )% C.I. for is

x z/2 .
n

(3.2)

However, when 2 is unknown, simply replacing with the sample statistic


s is not sufficient, as s in no longer considered an accurate estimate due
to the small sample size.
In higher level statistics the distribution of s2 is found, as it is a statistic
that depends on the random variables X1 , . . . , Xn and it is shown that

X
tn1
s/ n

(3.3)

where tn1 stands for Students-t distribution with parameter degrees of freedom = n1. A Students-t distribution is similar to the standard normal
except that it places more weight to extreme values as seen in Figure 3.1.2.

0.4

Density Functions

0.0

0.1

0.2

0.3

N(0,1)
t_4

Figure 3.2: Standard normal and t4 probability density functions


It is important to note that Students-t is not just similar to the standard normal but asymptotically (as n ) is the standard normal. One
just needs to view the t-table to see that under infinite degrees of freedom the
values in the table are exactly the same as the ones found for the standard
normal. Intuitively then, using Students-t when 2 is unknown makes sense
37

as it adds more probability to extreme values due to the uncertainty placed


by estimating 2 .
The 100(1 )% C.I. for is then
s
x t(n1,/2) .
n

(3.4)

Remark 3.2. To be technically correct then when 2 in known one should use
equation (3.2) and when it is unknown, equation (3.4). It is common practice,
for convenience mainly, to use equation (3.2) even when 2 is unknown but
the sample size is large. As discussed earlier under this scenario s2 is a good
estimate of 2 and the values in the t-table and z-table are very close to
eachother.
Example 3.2. Suppose that a sample of 36 resistors is taken with x = 10
and s2 = 0.7. A 95% C.I. for is
r
0.7
10 t(35,0.025)
= (9.71693, 10.28307)
| {z } 36
2.03

Note: If the exact degrees of freedom are not in the table, use the closest one.
Since n > 30 you may see in practice see used equation (3.2) for the
reasons discussed. The 95% C.I. using that method would be
r
0.7
10 z0.025
= (9.726691, 10.273309)
| {z } 36
1.96

3.1.3

Sample size for a C.I. of fixed level and width

The price paid for a higher confidence level, for the same sample statistics, is
a wider interval - try this at home using different values.
We know that as
/ n decreases and
the sample size n increases the standard deviation of X,
consequently so does the margin of error. Thus, knowing some preliminary
information such as a rough estimate for can help us determine the sample
size needed to obtain a fixed margin of error.
Using equation (3.2), the width of the interval is twice the margin of error

width = 2z/2 .
n
Thus,

n = 2z/2

width


n 2z/2

2
.
width

Example 3.3. In Example 3.1 we had that x = 12.05 and s = 0.1 for the
100 boxes, leading to a 95% C.I. for the true mean width 0.0392 or 0.0196
(3.1). Boss man requires a 95% C.I. of 0.0120.
38

IN CLASS

3.2

Hypothesis Testing

A statistical hypothesis is a claim about a population characteristic (and on


occasion more than one). An example of a hypothesis is the claim that the
population is some value, e.g. = 0.75.
Definition 3.1. The null hypothesis, denoted by H0 , is the hypothesis that
is initially assumed to be true.
The alternative hypothesis, denoted by Ha or H1 , is the complementary
assertion to H0 and is usually the hypothesis, the new statement that we
wish to test.
A test procedure is created under the assumption of H0 and then it is
determined how likely that assumption is compared to its complement Ha .
The decision will be based on
Test statistic, a function of the sampled data.
Rejection region, the set of all test statistic values for which H0 will
be rejected.
The basis for choosing a particular rejection region lies in an understanding
of the errors that can be made.
Definition 3.2. A type I error consists of rejecting H0 when it is actually
true.
A type II error consists of failing to reject H0 when in actuality H0 is
false.
The type I error is generally considered to be the most serious one and due
to limitations we can only control for one, so the rejection region is chosen
based upon the maximum probability of a type I error that a researcher is
willing to accept.

3.2.1

One sample hypothesis tests

We motivate the test procedure by an example whereby the drying time


of a certain type of paint, under fixed environmental conditions, is known
39

to be normally distributed with mean 75 min. and standard deviation 9


min. Chemists have added a new additive that is believed to decrease drying
time and have obtained a sample of 35 drying times and wish to test their
assertion. Hence,
H0 : 75 (or = 75)
Ha : < 75
An obvious candidate for a test statistic, that is an unbiased estimator of the
which is normally distributed. If the data were not
population mean, is X
can also be confirmed
known to be normally distributed the normality of X
by the C.L.T. Thus, under the null assumption H0


92
H0

,
X N 75,
35
or equivalently

75
X
9
35

H0

N(0, 1).

Since we wish to control for the type I error, we set P (type I error) = .
The default value of is usually taken to be 5%.

0.4

Standard Normal

0.0

0.1

0.2

0.3

=0.05 area

1.645

Figure 3.3: Rejection region equivalent to 0.05


Then if the test statistic value,
T.S. =

x 75

40

9
35

is in the blue region, i.e. T.S. < z0.05 , then H0 is rejected. We assume that
sample mean x is a good estimate for and hence x should be close
to 0, which implies T should be close to zero. However, if it is not, then it
implies that = 75 was not a good hypothesis value for the true mean and
consequently that T was not centered correctly.
Assume that x = 70.8 from the 35 samples. Then T.S. = 2.76, which is
in the rejection region and we reject H0 at the = 0.05 level. Equivalently,
a conclusion can be reached in hypothesis/significance testing by using the
p-value.
Definition 3.3. The p-value of a hypothesis test is the probability of observing the specific value of the test statistic, T.S., or a more extreme value,
under the null hypothesis. The direction of the extreme values is indicated
by the alternative hypothesis.
Therefore, in this example values more extreme than -2.76 are {x|x <
2.76} as the alternative, Ha : < 75, is indicating values less than. Thus,
p-value = P (Z < 2.76) = 0.0029.
Therefore, since p-value< , the null hypothesis is rejected in favor of the
alternative hypothesis as the probability of observing the test statistic value
of -2.76 or more extreme (as indicated by Ha ) is smaller than the probability
of the type I error we are willing to undertake.

=0.05 area
pvalue

0.0

0.1

0.2

0.3

0.4

Standard Normal

2.76

1.645

Figure 3.4: Rejection region and p-value.


Large sample test for population mean
is normally
Let X1 , . . . , Xn be a random sample with n > 30 and hence X
distributed. To test
41

(i) H0 : 0 vs Ha : > 0
(ii) H0 : 0 vs Ha : < 0
(iii) H0 : = 0 vs Ha : 6= 0
at the significance level, compute the test statistic
T.S. =
Reject the null if
(i) T.S. > z

x 0
.
/ n

(3.5)

(i) p-value=P (Z > T.S.) <

(ii) T.S. < z

(ii) p-value=P (Z < T.S.) <

(iii) |T.S.| > z/2

(iii) p-value=P (|Z| > |T.S.|) <

Remark 3.3. Is is unknown and instead s is used, one should be using


Students-t and the relevant t-table instead of the z-table, but since the
sample size is large the two distributions are equivalent.
Example 3.4. A scale is to be calibrated by weighing a 1000g weight 60
times. From the sample we obtain x = 1000.6 and s = 2. Test whether the
scale is calibrated correctly.
IN CLASS

42

Example 3.5. A company representative claims that the number of calls


arriving at their center is no more than 15/week. To investigate the claim, 36
random weeks were selected from the companies records with a sample mean
of 17 and sample standard deviation of 3. Do the sample data contradict
this statement?
First we begin by stating the hypotheses of
H0 : 15

vs

The test statistic is


T.S. =

Ha : > 15

17 15

=4
3/ 36

The conclusion is that there is significance evidence to reject H0 as the p-value


is very close to 0.

3.2.2

Small sample test for population mean

If the sample size is small, i.e. n 30, then the C.L.T. is not applicable
and therefore we must assume that the individual r.vs. X1 , . . . , Xn
for X
corresponding to the sample are normal r.vs with mean and variance 2 .
N(, 2 /n) and we can proceed
Then, by Proposition 2.5 we have that X
exactly as in equation (3.5).
However, if is unknown, which is usually the case, we replace it by its
sample estimate s. Consequently,
0 H0
X
tn1 ,
s/ n
= x, the test statistic becomes
and the for an observed value X
T.S. =

x 0
.
s/ n

At the significance level, for the same hypothesis tests as before, we reject
H0 if
(i) T.S. > t(n1,)
(i) p-value=P (tn1 > T.S.) <
(ii) T.S. < t(n1,)

(ii) p-value=P (tn1 < T.S.) <

(iii) |T.S.| > t(n1,/2)

(iii) p-value=P (|tn1| > |T.S.|) <

Example 3.6. In an ergonomic study, 5 subjects were chosen to study the


maximin weight of lift (MAWL) for a frequency of 4 lifts/min. Assuming the
MAWL values are normally distributed, does the following data suggest that
the population mean of MAWL exceeds 25?
25.8, 36.6, 26.3, 21.8, 27.2
IN CLASS
43

Remark 3.4. The values contained within a two-sided 100(1 )% C.I. are
precisely those values for which the p-value of a two sided hypothesis test
will be greater than .
Example 3.7. The lifetime of single cell organism is believed to be on average 257 hours. A small preliminary study was conducted to test whether
the average lifetime was different when the organism was placed in a certain
medium. The measurements are assumed to be normally distributed and
turned out to be 253, 261, 258, 255, and 256. The hypothesis test is
H0 : = 257 vs. Ha : 6= 257
With x = 256.6 and s = 3.05, the test statistic value is
T.S. =

256.6 257

= 0.293.
3.05/ 5

The p-value is P (|t4 | > | 0.293|) = P (t4 < 0.293) + P (t4 > 0.293) =
0.7839. Hence, since the p-value is large (> 0.05) we fail to reject H0 and
conclude that population mean is not statistically different from 257.
Instead of a hypothesis test if a two sided 95% was constructed by
3.05
256.6 t(4,0.025)
| {z } 5

(252.81, 260.39),

2.776

it clear that the null hypothesis value of = 257 is a plausible value and
consequently H0 is plausible, so it is not rejected.

44

Part II
Part 2 Material

45

Chapter 4
Inference For Population
Proportion
Chapter 10.1 - 10.2 in textbook.

4.1

Large sample C.I. for population proportion

In the binomial setting experiments had binary outcomes and of interest was
the number of successes out of the total number of trials. Let X be the total
number of successes, then X Bin(n, p). Once an experiment is conducted
and data obtained an estimate for p can be obtained,
x
p =
n
which is an average. It is the total number of successes over the total number
of trials. As such, if the number of successes and number of failures are
greater than 5, the C.L.T. tells us that


p(1 p)
p N p,
.
n
Then a 100(1 )% C.I. can be created as in equation (3.2),
r
p(1 p)
.
p z/2
n
This is the classical approach for when the sample size is large. This cannot
be used for the small sample framework as the C.L.T. is not applicable. An
exact version exists in the field nonparametric statistics. However, there does
exist an interval similar to classical version that works relatively well for small
sample sizes (not too small) and is equivalent for large sample sizes. It is
called the Agresti-Coull 100(1 )% C.I.,
r
p(1 p)
,
p z/2
n

46

where n
:= n + 4, and p := (x + 2)/
n.
Note: The instructor will be using the Agresti-Coull interval on the exam and
quizes, however the current textbook will be using the classical approach.
Example 4.1. A map and GPS application for a smartphone was tested for
accuracy. The experiment yielded 26 error out of the 74 trials. Find the 90%
C.I. for the proportion of errors.
Since n = 74 and x = 26, then n
= 74 + 4 and p = (26 + 2)/78 = 0.359.
Hence the 90% C.I. for p is
r
0.359(1 0.359)
0.359 z0.05
(0.269, 0.448)
|{z}
78
1.645

4.2

Large sample test for population proportion

Let X be the number of successes in n Bernoulli trials with probability of


success p, then X Bin(n, p). We know by the the C.L.T. that under certain
regularity conditions (number of successes and number of failures is greater
than 5), then p N(p, p(1 p)/n). To test
(i) H0 : p p0 vs Ha : p > p0
(ii) H0 : p p0 vs Ha : p < p0
(iii) H0 : p = p0 vs Ha : p 6= p0
we must assume, under the null hypothesis H0 , that the number of successes
and failures is greater than 5, i.e. np0 > 5 and n(1 p0 ) > 5, such that


p0 (1 p0 )
H0
.
p N p0 ,
n
The test statistic is

p p0
T.S. = q
,
p0 (1p0 )
n

and the r.v. corresponding to the test statistic has a standard normal distribution under the null hypothesis assumption. Reject the null if
(i) T.S. > z
(i) p-value=P (Z > T.S.) <
(ii) T.S. < z

(ii) p-value=P (Z < T.S.) <

(iii) |T.S.| > z/2

(iii) p-value=P (|Z| > |T.S.|) <

47

Chapter 5
Inference For Two Population
Means
Chapter 6 in textbook.

5.1

Two Sample C.I.s

There are instances when a C.I. for the difference between two means is of
interest when one wishes to compare the sample mean from one population
to the sample mean of another.

5.1.1

Large sample C.I. for two means

Let X1 , . . . , XnX and Y1 , . . . , YnY represent two independent random large


2
samples with nX > 40, nY > 40 with means X , Y and variances X
, Y2

respectively. A simple application of the C.L.T. implies that X and Y are


normal random variables. Proposition 2.5 allows us to find the distribution
Y . Hence, K
of the random variable K corresponding to the difference X
is a normal random variable with
Y ) = X Y ,
E(K) = E(X
and

2
2
Y ) = X + Y .
V (K) = V (X
nX
nY

Therefore,
Y N
K := X

2
Y2
X
+
X Y ,
nX nY

and hence a 100(1 )% C.I. for the difference of X Y is


s
2
2
X
x y z/2
+ Y.
nX
nY
48

Once again, if the variances are unknown we can replace them with the
sample variances due to the large sample size. In addition, we could use
Students-t critical values instead of the z-score, z/2 , (as the variances are
unknown) but large sample sizes imply that the t-score will be approximately
equal to the z-score.
Example 5.1. In an experiment, 50 observations of soil NO3 concentration
(mg/L) were taken at each of two (independent) locations X and Y . The
descriptive statistics are: x = 88.5, sX = 49.4, y = 110.6 and sY = 51.5.
Construct a 95% C.I. for the difference in means and interpret.
IN CLASS

5.1.2

Small sample C.I. for two means

As in Section 3.1.2, with small sample sizes we must assume that X1 , . . . , XnX
2
are i.i.d N(X , X
) and Y1 , . . . , YnY are i.i.d N(Y , Y2 ) with the two sample
being independent of one another. As in equation (3.3)

where

Y (X Y )
X
q 2
t
s2Y
sX
+ nY
nX

2

s2X
s2Y

+ nY
nX

= (s2 /n )2 (s2 /n )2 .
Y
X
X
+ YnY 1
nX 1

(5.1)

Hence the 100(1 )% for X Y is

x y t(,/2)

s2X
s2
+ Y.
nX
nY

Example 5.2. Two methods are considered standard practice for surface
hardening. For Method A there were 15 specimens with a mean of 400.9
(N/mm2 ) and standard deviation 10.6. For Method B there were also 15
specimens with a mean of 367.2 and standard deviation 6.1. Assuming the
49

samples are independent and from a normal distribution the 98% C.I. for
A B is
r
10.62 6.12
+
400.9 367.2 t,0.01
15
15
where

2

2
2
10.6
6.1

+
15
15

= (10.62 /15)2 (6.12 /15)2 = 22.36 = 22


+
14
14
and hence t22,0.01 = 2.508 giving a 98% C.I. of (25.8,41.6).

2
Remark 5.1. When population variances are believed to be equal, i.e. X

2
Y we can improve on the estimate of variance by using a pooled or weighted
average estimate. If in addition to the regular assumptions, if we can assume
equality of variances then the 100(1 )% C.I. for X Y is
r
1
1
x y t(nX +nY 2,/2) sp
+
,
nX
nY

with
sp =

(nX 1)s2X + (nY 1)s2Y


.
nX + nY 2

The assumption that the variances are equal must be made a priori and not
used simply because the two variances may be close in magnitude.
2
Example 5.3. Consider Example 5.2 but now assume that X
Y2 . A
98% C.I. for the difference of X Y constructed with
r
14(10.62 ) + 14(6.12)
sp =
= 8.648
28

is
400.9 367.2 t(28,0.01) (8.648)
| {z }
2.467

2
15

(25.9097, 41.4903)

How is this interval different from the one in Example 5.2?

5.1.3

Large sample C.I. for two population proportions

A simple extension of Section 4.1 to the two sample framework yields the
100(1 )% C.I. for the difference of two population proportions. Let X
Bin(nX , pX ) and Y Bin(nY , pY ) be two independent binomial r.vs. Define
n
X = nX + 2 and pX = (x + 1)/
nX , similarly for Y . Then the 100(1 )%
C.I. for pX pY is
s
pX (1 pX ) pY (1 pY )
pX pY z/2
+
.
n
X
n
Y
50

Intuitively, since proportions are between 0 and 1, the difference of two proportions must lie between -1 and 1. Hence if the bounds of a C.I. are outside
the intuitive ones, they should be replaced by the intuitive bounds.
Example 5.4. In a clinical trial for a pain medication, 394 subjects were
blindly administered the drug, while an independent group of 380 were given
a placebo. From the drug group, 360 showed an improvement. From the
placebo group 304 showed improvement. Construct a 95% C.I. for the difference and interpret.
IN CLASS

5.1.4

C.I. for paired data

There are instances when two samples are not independent, when a relationship exists between the two. For example, before treatment and after
treatment measurements made on the same experimental subject are dependent on eachother through the experimental subject. This is a common event
in clinical studies where the effectiveness of a treatment, that may be quantified by the difference in the before and after measurements, is dependent
upon the individual undergoing the treatment. Then, the data is said to be
paired.
Consider the data in the form of the pairs (X1 , Y1), (X2 , Y2 ), . . . , (Xn , Yn ).
We note that the pairs, i.e. two dimensional vectors, are independent as the
experimental subjects are assumed to be independent with marginal expectations E(Xi ) = X and E(Yi ) = Y for all i = 1, . . . , n. By defining,
D1 = X 1 Y 1
D2 = X 2 Y 2
..
.
Dn = X n Y n
a two sample problem has been reduced to a one sample problem. Inference
for X Y is equivalent to one sample inference on D as was done in
Chapter 3. This holds since,
!
!
n
n
X
X
1
1
Y ) = X Y .
=E
Di = E
Xi Yi = E(X
D := E(D)
n i=1
n i=1
51

does incorporate the covariance


In addition we note that the variance of D
between the two samples as
!
n
n
X
1 X
2 + Y2 2XY
1
2
=V
Di = 2
D
:= V (D)
V (Di ) = X
.
n i=1
n i=1
n
Example 5.5. A new and old type of rubber compound can be used in
tires. A researcher is interested in a compound/type that does not wear
easily. Ten random cars were chosen at random that would go around a
track a predetermined number of times. Each car did this twice, once for
each tire type and the depth of the tread was then measured.

New
Old
D

Car
1
2
3
4
5
6
7
8
9
10
4.35 5.00 4.21 5.03 5.71 4.61 4.70 6.03 3.80 4.70
4.19 4.62 4.04 4.72 5.52 4.26 4.27 6.24 3.46 4.50
0.16 0.38 0.17 0.31 0.19 0.35 0.43 -0.21 0.34 0.20

With d = 0.232 and sD = 0.183. Assuming that the data are normally
distributed, a 95% C.I. for new old = D is
0.183
0.232 t9,0.025
| {z } 10

(0.101, 0.363)

2.262

and we note that the interval is strictly greater than 0, implying that that
the difference is positive, i.e. that new > old

5.2
5.2.1

Two Sample Hypothesis Tests (optional)


Large sample test for difference of two means

Let X1 , . . . , XnX and Y1 , . . . , YnY represent two independent random large


2
samples with nX > 40, nY > 40 with means X , Y and variances X
, Y2
respectively. We have seen in Section 5.1.1 by virtue of the C.L.T.


2
Y2
X

.
+
X Y N X Y ,
nX
nY
To test
(i) H0 : X Y 0 vs Ha : X Y > 0
(ii) H0 : X Y 0 vs Ha : X Y < 0
(iii) H0 : X Y = 0 vs Ha : X Y 6= 0
52

we assume that the variances are known and the test statistic is
T.S. = p

x y 0
.
+ Y2 /nY

2
X
/nX

The r.v. corresponding to the test statistic has a standard normal distribution under the null hypothesis H0 , that X Y = 0 . Reject the null
if
(i) T.S. > z
(i) p-value=P (Z > T.S.) <
(ii) T.S. < z

(ii) p-value=P (Z < T.S.) <

(iii) |T.S.| > z/2


(iii) p-value=P (|Z| > |T.S.|) <
2
If the variances X
and Y2 are unknown it is acceptable to replace them
by their sample estimates or alternatively use a t distribution as shown in
the next section.

5.2.2

Small sample test for difference of two means

Inference via hypothesis testing is analogous to Section 5.1.2 which is an


extension of the large sample methodology. However, since the C.L.T. is
not applicable we must assume that the two random samples are normally
distributed and independent.
If the variances are known the test statistic is
T.S. = p

x y 0
,
2
X
/nX + Y2 /nY

which has a standard normal distribution under H0 . Reject H0 same as


before.
Usually the variances are unknown and have to be estimated, then
the test statistic is
x y 0
T.S. = p 2
,
sX /nX + s2Y /nY

which has a t distribution under H0 , where the degrees of freedom are


given by equation (5.1).
Remark 5.2. As in Remark 5.1, when population variances are believed to be
2
equal, i.e. X
Y2 we can improve on the estimate of variance, and hence
obtain a more powerful test, by using a pooled estimate of the variance. If in
addition to the regular assumptions, if we can assume equality of variances
then replace both sX and sY with
s
(nX 1)s2X + (nY 1)s2Y
sp =
,
nX + nY 2
and the degrees of freedom for the t distribution by nX + nY 2.
53

5.2.3

Large sample test for difference of two proportions

Let X Bin(nX , pX ) and Y Bin(nY , pY ) represent two independent binomial r.vs from two Bernoulli trial experiments. To test
(i) H0 : pX pY 0 vs Ha : pX pY > 0
(ii) H0 : pX pY 0 vs Ha : pX pY < 0
(iii) H0 : pX pY = 0 vs Ha : pX pY 6= 0
we must assume that the number of successes and failures is greater than
10 for both samples. As the null hypotheses values for pX and pY are not
available we simply check that the sample successes and failures are greater
than 10. By virtue of the C.L.T.


pX (1 pX ) pY (1 pY )
H0
,
+
pX pY N 0,
nX
nY
and test statistic would be constructed in the usual way. However, under H0
it is assumed that pX = pY which implies that the two variances are equal
and therefore in lieu of Remark 5.1 we can replace pX and pY in the variance
by the pooled estimate
x+y
p =
.
nX + nY
The test statistic is then
T.S. = p

pX pY 0
,
p(1 p)(1/nX + 1/nY )

and the r.v. corresponding to the test statistic has a standard normal distribution under the null hypothesis.

5.2.4

Test for paired data

In the event that two samples are dependent, i.e. paired, such as when two
different measurements are made on the same experimental unit, the inference methodology must be adapted to account for the dependence/covariance
between the two samples.
Refer to Section 5.1.4, where we consider the data in the form of the
pairs (X1 , Y1 ), (X2 , Y2 ), . . . , (Xn , Yn ) and construct the one-dimensional, i.e.
one-sample D1 , D2 , . . . , Dn where Di = Xi Yi for all i = 1, . . . , n. As shown
2
earlier, D = X Y and the variance term D
incorporates the covariance
between X and Y .
To test
(i) H0 : X Y = D 0 vs Ha : X Y = D > 0
54

(ii) H0 : X Y = D 0 vs Ha : X Y = D < 0
(iii) H0 : X Y = D = 0 vs Ha : X Y = D 6= 0
perform a one-sample hypothesis test by either a large or small sample inference using the test statistic
T.S. =

5.3

d 0

D / n

d 0

sD / n

or

Normal Probability Plot

A probability plot is a graphical technique for comparing two data sets, either
two sets of empirical observations, one empirical set against a theoretical set.
Definition 5.1. The empirical distribution function, or empirical c.d.f., is
the cumulative distribution function associated with the empirical measure
of the sample. This c.d.f. is a step function that jumps up by 1/n at each of
the n data points.
n

number of elements x
1X
Fn (x) =
I{xi x}
=
n
n i=1
5, 7, 8. The empirical c.d.f. is
if
if
if
if
if

x<1
1x<5
5x<7
7x<8
x>8

0.0

0.2

0.4

Fn(x)

0.6

0.8

1.0

Example 5.6. Consider the sample: 1,

0.25

F4 (x) = 0.50

0.75

Figure 5.1: Empirical c.d.f.


55

10

The normal probability plot is a graphical technique for normality testing


by assessing whether or not a data set is approximately normally distributed.
The data are plotted against a theoretical normal distribution in such a way
that the points should form an approximate straight line. Departures from
this straight line indicate departures from normality.
There are two types of plots commonly used to plot the empirical c.d.f.
to the normal theoretical one (G()). A P-P plot that plots (Fn (x), G(x))
(with scaled changed to look linear), while of more wide use, the Q-Q plot
which plots the quantile functions (Fn1 (x), G1 (x)).
Example 5.7. An experiment of lead concentrations (mg/kg dry weight)
from 37 stations, yielded 37 observations. Of interest is to determine if the
data are normally distributed (of more practical use if sample sizes are small,
e.g. < 30).

1
0
1
2

Theoretical Quantiles

Normal QQ Plot

50

100

150

200

Sample Quantiles

Figure 5.2: QQ plot.


http://www.stat.ufl.edu/~ dathien/STA6166/QQplot.R
INTERPRETATION OF FIGURES

56

Chapter 6
Nonparametric Procedures For
Population Location
When the sample size is small and we cannot assume that the data are
normally distributed we need must use exact nonparametric procedures to
perform inference on population cental values. Instead of means we will
be referring to medians (
) and other location concepts as they are less
influenced by outliers which can have a drastic impact (especially) on small
samples.

6.1

Sign test

Section 5.8 in textbook.


Recall that the median is the 50th percentile, so we expect 50% of the data to
fall above that value. Let B be the number of observations that are strictly
greater than the median. (This will be the test statistic irrespective of the
type of hypothesis test). By definition of the median we expect a 50-50
chance that an observation is above the median. Therefore, B Bin(n, 0.5).
To test the hypotheses
(i) H0 :

0 vs Ha :
>
0
(ii) H0 :

0 vs Ha :
<
0
(iii) H0 :
=
0 vs Ha :
6=
0
we reject H0 if the p-value < . We illustrate the calculation of the p-value
with the following example.
Example 6.1. Pulse rates for a sample of 15 students were:
60, 62, 72, 60, 63, 75, 64, 68, 63, 60, 52, 64, 82, 68, 64

57

To test H0 :
65 vs Ha :
< 65 we have B = 5. The p-value, (i.e. the
probability of observing the test statistic or a value more extreme) is
p-value = P (B 5|B Bin(15, 0.5))
= P (B = 0) + . . . + P (B = 5)
5  
X
15
0.5i 0.515i
=
i
i=0
= 0.1509.

Hence, we fail to reject H0 . In R we would simply run


binom.test(5,15,alternative="less")
How does one calculate the p-value for a two-sided test? IN CLASS

Remark 6.1. If we wanted to test the location of the 70th percentile then
B Bin(n, 0.3)
Remark 6.2. There is also a normal approximation (shown in textbook) but
we will stick to exact method.
58

6.2

Wilcoxon rank-sum test

Section 6.3 in textbook.


One of the most widely used two sample tests for location differences between two populations (treatments). Assume, that two independent samples
X1 , . . . , XnX are i.i.d. with a c.d.f. F1 () and Y1 , . . . , YnY are i.i.d. with a
c.d.f. F2 (). The null hypothesis H0 : F1 (x) = F2 (x) x is tested against
(i) Y s tend to be smaller than the Xs.
(ii) Y s tend to be larger than the Xs.
(iii) One of the two populations is shifted from the other.
To conduct the test we
first rank all the (nX + nY ) data irrespective of sample
calculate the sum of the ranks associated with the smallest sample. (if
sample sizes equal, choice of smallest is irrelevant; usually go with
the first sample).
H0 is rejected if
(i)
TX TU ,
TY TL ,

nX nY
nX > nY

TX TL ,
TY TU ,

nX nY
nX > nY

(ii)

(iii)
TX TU or TX TL
TY TU or TY TL

nX nY
nX > nY

where the critical values TU and TL can be found in Table 5 (Table 6 in the
textbook) where the first sample is the smallest one (done for convenience).
In practice though, R can provide exact p-values.
59

Example 6.2. Two groups of 10 did not know whether they were receiving
alcohol or the placebo and their reaction times (in seconds) was recorded.
Placebo
Alcohol

0.90 0.37
1.46 1.45

1.63 0.83 0.95 0.78 0.86 0.61 0.38 1.97


1.76 1.44 1.11 3.07 0.98 1.27 2.56 1.32

Test whether the distribution of reaction times for the placebo are shifted
to the left of that for alcohol (case (ii)). The ranks are:
Placebo
Alcohol

7 1 16 5 8 4
15 14 17 13 10 20

6 3 2 18
9 11 19 12

70
140

The test statistic is T = 70. From Table 6b TL = 83, TU = 127. Since


T TL we reject H0 .
Remark 6.3. Notice that the table only provides critical values for nX 10
and nY 10. For larger values, you may use
other tables online
normal approximation
T n1 (n1 + n2 + 1)/2
p
n1 n2 (n1 + n2 + 1)/12

(as shown in textbook p. 254)

software such as R which give you exact p-values


http://www.stat.ufl.edu/~ athienit/STA6166/wilcox_1.R
Remark 6.4. If there are ties in the data then the values that are tied get the
average of the ranks that they would have gotten if not tied. For example,
the rank of the data 0.3, 0.5, 0.5, 0.7 is 1, 2.5, 2.5, 4, as the values 0.5 should
have gotten ranks 2 and 3 if they were slightly different.

6.3

Wilcoxon signed-rank test

Section 6.5 in textbook.


To test for location differences between the X and Y components in the
i.i.d. pairs (X1 , Y1 ), . . . , (Xn , Yn ), we take the differences Di = Xi Yi (as in
Section 5.1.4) and test
H0 : Distribution of Di s is symmetric about the null value D0 , against the
alternatives

60

(i) Di s tend to be larger than D0 , i.e. Xs tend to be larger that Y s by


an amount of D0 or greater.
(ii) Di s tend to be smaller than D0 , i.e. Xs tend to be smaller than Y s
by an amount of D0 or greater.
(iii) Di s tend to be consistently larger or smaller than D0 , i.e. Xs tend
to be consistently different than Y s by an amount of D0 or greater or
greater.
The test procedure consists of
calculating the differences di = (xi yi ) D0 ,
discarding any di = 0 from the data,
ranking |d1|, . . . , |dn | from smallest to largest,
calculate
T+ = sum of ranks corresponding to positive di s
T = sum of ranks corresponding to negative di s
H0 is rejected if
(i) T < Tc
(ii) T+ < Tc
(iii) min{T , T+ } < Tc
where Tc is the critical value found in Table 6(Table 7 in textbok).
Remark 6.5. The table of critical values is limited, but there does exist a
normal approximation that is provided in the textbook for larger sample
sizes or one can simply use software like R.
Example 6.3. A city park department compared two fertilizers A and B
on 20 softball fields. Each field was divided in half where each fertilizer was
used. The effect of the fertilizer was measured in the pounds (lbs) of grass
clippings produced.
Since not specified in the problem, we consider as an alternative hypothesis the (general) two-sided alternative (case (iii)) with D0 = 0.

61

Field
1
2
3
4
5
6
7
8
9
10

A
211.4
204.4
202.0
201.9
202.4
202.0
202.4
207.1
203.6
216.0

B
186.3
205.7
184.4
203.6
180.4
202.0
181.5
186.7
205.7
189.1

D Rank(|D|)
25.1
15
-1.3
1
17.6
7
-1.7
2
22.0
14
0
0
20.9
13
20.4
11
-2.1
3
26.9
19

Field
11
12
13
14
15
16
17
18
19
20

A
208.9
208.7
213.8
201.6
201.8
200.3
201.8
201.5
212.1
203.4

B
183.6
188.7
188.6
204.2
181.6
208.7
181.5
208.7
186.8
182.9

D Rank(|D|)
25.3
17.5
20.0
8
25.2
16
-2.6
4
20.1
9
-8.4
6
20.3
10
-7.2
5
25.3
17.5
20.5
12

T+ = 15 + 7 + 14 + 13 + 11 + 19 + 17.5 + 8 + 16 + 9 + 10 + 17.5 + 12 = 169


T = 1 + 2 + 3 + 4 + 6 + 5 = 21
The test statistic is T = 21 which is smaller than Tc = 46 (from Table 7
with n = 19 nonzeros, = 0.05) and we reject H0 .
We can conclude that fertilizers A and B differ, and since T+ is greater
than T , that A produces more clippings than B.
http://www.stat.ufl.edu/~ athienit/STA6166/wilcox_2.R
Remark 6.6. Suppose that type B was the old fertilizer and that a sales agent
approached the city council with a claim that their new fertilizer (type A)
was better in that it would produce 5 or more pounds of grass clippings
compared to B.
The alternative hypothesis is case (i) with D0 = 5. As a result we obtain
the following table
Field
1
2
3
4
5
6
7
8
9
10

AB
25.1
-1.3
17.6
-1.7
22.0
0
20.9
20.4
-2.1
26.9

D Rank(|D|)
20.1
16
-6.3
2
12.6
7
-6.7
3
17.0
15
-5.0
1
15.9
14
15.4
12
-7.1
4
21.9
20

Field
11
12
13
14
15
16
17
18
19
20

AB
25.3
20.0
25.2
-2.6
20.1
-8.4
20.3
-7.2
25.3
20.5

D
Rank(|D|)
20.3
18.5
15.0
9
20.2
17
-7.6
5
15.2
10
-13.4
8
15.3
11
-12.2
6
20.3
18.5
15.5
13

T+ = 16 + 7 + 15 + 14 + 12 + 20 + 18.5 + 9 + 17 + 10 + 11 + 18.5 + 13 = 181


T = 2 + 3 + 1 + 4 + 5 + 8 + 6 = 29
62

The test statistic is T = 29 which is smaller than Tc = 60 (from Table 7


with n = 20 nonzeros, = 0.05) and we reject H0 .

63

Chapter 7
Inference About Population
Variances
Chapter 7 in textbook.

7.1

Inference On One Variance

The sample statistic s2 is widely used as the point estimate for the population
variance 2 , and similar to the sample mean it varies from sample to sample
and has a sampling distribution.
Let X1 , . . . , Xn be i.i.d. r.v.s. WeP
already have some tools that help us
1

determine the distribution of X = n ni=1 Xi , a function of the r.v.s, and


is a r.v. itself and once a sample is collected a realization X
= x is
hence X
observed. Similarly, let
n

1 X
2
S =
(Xi X)
n 1 i=1
2

be a function of the r.v.s X1 , . . . , Xn and hence is a r.v. itself. A realization


of this r.v. is the sample variance s2 . If X1 , . . . , Xn are i.i.d. N(, 2 ) then
(n 1)S 2
2n1 ,
2

where 2 denotes a chi-square distribution with (n 1) degrees of freedom.


Let 2(n1,) denote the critical value of a 2n1 distribution such that the
area to the right is .

64

0.06

0.08

0.10

2distribution

0.00

0.02

0.04

area

Figure 7.1: 2 distribution and critical value.


Consequently,
1 =P
=P

(n 1)S 2
<
< 2(n1),/2
2

!
(n 1)S 2
(n 1)S 2
2
< < 2
2(n1),/2
(n1),1/2

2(n1),1/2

which implies that on the long run this interval will contain the true population variance parameter 100(1 )% of the time. Thus, the 100(1 )%
C.I. for 2 is
!
2
2
(n 1)s (n 1)s
,
.
2(n1),/2 2(n1),1/2
Example 7.1. At a coffee plant a machine that fills 500g coffee containers.
Ideally, the amount of coffee in a container should vary only slightly about
the 500g nominal value. The machine is designed to have a mean to dispense coffee amounts that have a normal distribution with mean 506.6g and
standard deviation of 4g. This implies that only 5% of containers weigh less
than 500g. A quality control engineer samples 30 containers every hour. A
particular sample yields
x = 500.453,

s = 3.433.

We have already seen how to a C.I. for , so we skip ahead to constructing


a 95% C.I. for 2 (or ). Assuming that the data are normally distributed,
by creating a normal probability plot, we have
29(3.4332)
29(3.4332)
< 2 <
45.722
16.047
2
7.4752 < < 21.2986
65

as a 95% C.I. for 2 , or equivalently, by taking the square root, (2.7341, 4.6150)
as a 95% C.I. for .
Remark 7.1. Hypothesis testing will be skipped but the methodology is exactly the same as for the mean with H0 : 2 = 02 , a test statistic value
of
(n 1)s2
.
T.S. =
02
For details see p.299 of textbook.

7.2

Comparing Two Variances

Now we extend the set up to two independent i.i.d. normal distribution


2
samples X1 , . . . , XnX and Y1 , . . . , YnY with variances X
and Y2 respectively.
It is known that
2
(nX 1)SX
2nX 1
2
X

and

(nY 1)SY2
2nY 1
Y2

but it is also known that a standardized (by dividing by the degrees of freedom) ratio of two 2 s is an F-distribution. Therefore,
2 / 2
(nX 1)SX
X
nX 1
2
(nY 1)SY2 /Y
nY 1

2
SX
/SY2
FnX 1,nY 1 .
2
X
/Y2

When comparing two variances, in order to use the F-distribution it is


practical to make comparisons in terms or ratios rather than differences. For
2
example X
/Y2 = 1 implies that the two variances are identical.
2
A 100(1 )% C.I for X
/Y2 is constructed by


2
SX
/SY2
1 = P F(nX 1,nY 1),1/2 < 2 2 < F(nX 1,nY 1),/2
X /Y

 2
2
2
1
1
X
SX
SX
< 2 < 2
.
=P
SY2 F(nX 1,nY 1),/2
Y
SY F(nX 1,nY 1),1/2
2
Thus, the 100(1 )% C.I. for X
/Y2 is
 2

sX
1
1
s2X
,
s2Y F(nX 1,nY 1),/2 s2Y F(nX 1,nY 1),1/2

where F(nX 1,nY 1), is the critical value for the F-distribution with the are
to the right being .
Remark 7.2. Hypothesis testing will be skipped but the methodology follows
2
along the same lines as before with H0 : X
/Y2 = 1, a test statistic value of
s2X
T.S. = 2 .
sY
For details see p.307 of textbook.
66

Example 7.2. The life length of an electrical component was studied under
two operating voltages, 110 and 220. Ten different components were assigned
to be tested under 110V and 16 under 220V. The times to failure (in 100s
hrs) were then recorded. Assuming that the two samples are independent
2
2
and normal we construct a 90% C.I. for 110
/220
.
V
n Mean
110 10 20.04
220 16 9.99

St.Dev.
0.474
0.233

IN CLASS

7.3

Comparing t 2 Variances

There are instances where we may wish to compare more than two population
variations such as the variability in SAT examination scores for students
using one of three types of preparatory material. The null assumption (for t
populations) is
H0 : 12 = = t2

versus the alternative that not all i2 s are equal.


Many different methods exist but we will focus on Levenes Test that is
the least restrictive in its assumptions as there are no assumptions regarding
the sample sizes/distributions. We still require the assumptions of independent populations/groups. However, it is slightly complicated to calculate by
hand (so you will not be asked to calculate it by hand, but use and interpret
R output).
For Levenes
P test, the sampling distribution of the test statistic is F(t1,N t) ,
where N = ni=1 ni , i.e. the grand total number of observations. For a specified ,
Reject H0 : T.S. F(t1,N t),

or if the p-value (the area to the right of the T.S. under an Ft1,N t distribution) is < .

67

Example 7.3. Three different additives that are marketed for increasing
fuel efficiency in miles per gallon (mpg) were evaluated by a testing agency.
Studies have shown an average increase of 8% in mpg after using the products
for 250 miles. The testing agency wants to evaluate the variability in the
increase.
Additive
1
1
1
1
1
1
1
1
1
1

% in mpg
4.2
2.9
0.2
25.7
6.3
7.2
2.3
9.9
5.3
6.5

Additive
2
2
2
2
2
2
2
2
2
2

% in mpg
0.2
11.3
0.3
17.1
51.0
10.1
0.3
0.6
7.9
7.2

Additive
3
3
3
3
3
3
3
3
3
3

% in mpg
7.2
6.4
9.9
3.5
10.6
10.8
10.6
8.4
6.0
11.9

Run the R code


http://www.stat.ufl.edu/~ athienit/STA6166/levene.R
and obtain a test statistic of 1.8268. The p-value, that is the area to the
right of 1.8268 for an F2,27 is 0.1803.

0.6

0.8

1.0

F(2,27) distribution

0.0

0.2

0.4

0.1803 area

1.8268

Figure 7.2: F2,27 distribution and p-value.


The critical value F(2,27),0.05 is 3.354131. Clearly, the test statistic does
not fall in the rejection region.

68

Chapter 8
Contingency Tables
Chapter 10.5 in textbook.
Contingency tables are cross-tabulations of frequency counts where the rows
(typically) represent the levels of the explanatory variable and the columns
represent the levels of the response variable.
We motivate the methodology through an example. A personnel manager wants to assess the popularity of 3 alternative flexible time-scheduling
plans among workers. A random sample of 216 workers yields the following
frequencies.

Favored Plan
1
2
3
Total

Office
1 2 3 4 Total
15 32 18 5
70
8 29 23 18
78
1 20 25 22
68
24 81 66 45 216

Table 8.1: 3 4 contingency table of frequencies.


Numbers within the table represent the numbers of individuals falling
in the corresponding combination of levels of the two variables. Denote
nij to be the frequency count for row i, column j.
Row and column totals are called the marginal distributions for the two
variables. Denote ni+ to the ith row total and n+j to be the j th column
total.
Table 8 can be re-expressed as a table of joint and marginal proportions.

69

Office
Favored Plan
1
2
3
Total

1
0.0694
0.0370
0.0046
0.1111

2
0.1481
0.1343
0.0926
0.3750

3
0.0833
0.1065
0.1157
0.3056

4
0.0231
0.0833
0.1019
0.2083

Total
0.3241
0.3611
0.3148
1.0000

Table 8.2: 3 4 contingency table of proportions.


In Section 2.3.1 we have seen that two events are independent if the joint
probability can be written as a product of the marginal probabilities. Hence,
under independence we expect
pij = pi+ p+j

i = 1, 2, 3 j = 1, 2, 3, 4.

Performing this operation for all rows and columns gives us a table of expected joint probabilities.
Favored Plan
1
2
3
Total

1
0.0360
0.0401
0.0350
0.1111

Office
2
3
0.1215 0.0990
0.1354 0.1103
0.1181 0.0962
0.3750 0.3056

4
0.0675
0.0752
0.0656
0.2083

Total
0.3241
0.3611
0.3148
1.0000

Table 8.3: Table of expected (under independence) proportions.


Multiplying these expected (under independence) probabilities by the
sample size n = 216, gives us the expected (under independence) frequencies
Eij = npij
= npi+ p+j
n  n 
i+
+j
=n
n
n
ni+ n+j
=
n
As a result E11 = (70)(24)/216 = 7.7778. Similarly we obtain
Office
Favored Plan
1
2
3
Total

1
2
3
4
Total
7.7778 26.2500 21.3889 14.5833
70
8.6667 29.2500 23.8333 16.2500
78
7.5556 25.5000 20.7778 14.1667
68
24
81
66
45
216

Table 8.4: Table of expected (under independence) frequencies.


70

To test
H0 : Levels of one variable are independent of the other
we use Pearsons chi-square 2 test statistic that is applicable if Eij > 5 i, j.
T.S. =

r X
c
X
(nij Eij )2
Eij
i=1 j=1

where r is the number of rows and c is the number of columns. The sampling
distribution of the test statistic is 2(r1)(c1) and hence, for a specified ,
H0 : is rejected if
T.S. 2(r1)(c1), ,
or if the p-value (the area to the right of the test statistic) < . For the
example at hand the T.S. is
T.S. =

(15 7.7778)2
(22 14.1667)2
++
= 27.135,
7.7778
14.1667

the degrees of freedom are 2(3) = 6 and the p-value is 0.0001366. Therefore,
we reject H0 and conclude that Favored Plan and Office are not independent.
Once dependence is established, of interest is to determine which cells in
the contingency table have higher or lower frequencies than expected (under
independence). This is usually determined by observing the standardized
residuals (deviations) of the observed counts, nij , to the expected counts
Eij , i.e.
nij Eij
rij = p
Eij (1 pi+ )(1 p+j )
Office
Favored Plan
1
2
3

1
2
3
4
3.3409 1.7267 -1.0695 -3.4306
-0.3005 -0.0732 -0.2563 0.6104
-3.0560 -1.6644 1.3428 2.8258

Table 8.5: Table of standardized residuals.


http://www.stat.ufl.edu/~ athienit/STA6166/contingency.R
Hence, the appear to be more people than expected (under independence) in
office 1 that choose plan 1, and less than expected in plan 3. The opposite
applies for office 4.

71

Example 8.1. A personnel director categorizes colleges as most desirable,


good, adequate, and undesirable for purposes of hiring their graduates. The
director collects data on 156 graduates on their performance and from which
college they came from.
School
Most desirable
Good
Adequate
Undesirable

Outstanding
21
20
4
3

Rating
Average
25
36
14
8

Poor
2
10
7
6

IN CLASS http://www.stat.ufl.edu/~ athienit/STA6166/rating.R

72

Part III
Part 3 Material

73

Chapter 9
Regression
Chapter 11 in textbook.
We have seen and interpreted the population correlation coefficient between
two r.vs that measures the strength of the linear relationship between the
two variables. In this chapter we hypothesize a linear relationship between
the two variables, estimate and draw inference about the model parameters.

9.1

Simple Linear Regression

The simplest deterministic mathematical relationship between two mathematical variables x and y is a linear relationship
y = 0 + 1 x,
where the coefficient 0 represents the y-axis intercept, the value of y when
x = 0, and 1 represents the slope, interpreted as the amount of change in
the value of y for a 1 unit increase in x.
To this model we add variability by introducing the random variable
i.i.d.
i N(0, 2 ) for each observation i = 1, . . . , n. Hence, the statistical
model by which we wish to model one random variable using known values
of some predictor variable becomes
Yi = 0 + 1 xi + i

i = 1, . . . , n

(9.1)

where Yi represents the r.v. corresponding to the response, i.e. the variable
we wish to model and xi stands for the observed value of the predictor.
Therefore we have that
ind.

Yi N(0 + 1 xi , 2 ).

(9.2)

Notice that the Y s are no longer identical since their mean depends on the
value of xi .
74

15
10
5

y
0

Data points
Regression line

20

10

10

20

30

40

50

60

Figure 9.1: Regression model.


In order to fit a regression line one needs to find estimates for the coefficients 0 and 1 in order to find the prediction line
Yi = 0 + 1 xi .
The goal is to have this line as close to the data points as possible. The
concept, is to minimize the error from the actual data points to the predicted
points (in the direction of Y , i.e. vertical)
min

n
X
i=1

(Yi E(Yi ))

min

n
X
i=1

(Yi (0 + 1 xi ))2 .

Hence, the goal is to find the values of 0 and 1 that minimizes the sum of
the distances between the points and their expected value under the model.
This is done by the following steps:
1. Taking the partial derivatives with respect to 0 and 1
2. Equate the two resulting equations to 0
3. Solve the simultaneous equations for 0 and 1
4. (Optional) Taking second partial derivatives to show that in fact they
minimize, not maximize.

75

Therefore,
b1 := 1 =
and

Pn
P


(xi x)(yi y)
sY
( ni=1 xi yi ) n
xy
i=1
Pn
=r
= Pn 2
)2
x2
( i=1 xi ) n
sX
i=1 (xi x

(9.3)

b0 := 0 = y b1 x.

Remark 9.1. Do not extrapolate model for values of the predictor x that were
not in the data, as it is not clear how the model behave for other values. Also,
do not fit a linear regression for data that do not appear to be linear.
Next we introduce some notation that will be useful in conducting inference of the model. In order to determine whether a regression model is
adequate we must compare it to the most naive model which uses the sample
mean Y as prediction, i.e. Y = Y . This model does not take into account
any predictors as the prediction is the same for all values of x. Then the total
distance of a point yi to the sample mean y can be broken down into two
components, one measuring the error of the model for that point, and one
measuring the improvement distance accounted by the regression model.
(yi y) = (yi yi ) + (
y y)
| {z } | {z }
| i {z }
Error
Regression
Total

Figure 9.2: Sum of Squares breakdown.

76

Summing over all observations we have that


n
X

(yi y)2 =

|i=1 {z
SST

n
X

(yi yi )2 +

|i=1 {z
SSE

n
X

(
yi y)2 ,

|i=1 {z
SSR

since it can easily be shown that the cross-product term


is equal to 0.

9.1.1

(9.4)

Pn

i )(
yi y)
i=1 (yi y

Goodness of fit

A goodness of fit statistic is a quantity that measures how well a model


explains a given set of data. For regression, we will use the coefficient of
determination
SSR
SSE
R2 =
=1
,
SST
SST
which is the proportion of variability in the response Y (to its mean) that is
explained by the regression model, and R2 [0, 1].

Remark 9.2. For simple linear regression (only) with one predictor, the coefficient of determination is the square of the correlation coefficient, i.e. R2 = r 2 .
Not true when more than one predictor is used in the model.
Example 9.1. For 15 cement block of certain dimensions, their weight (lbs)
and porosity (%) were measured.
Weight
99.0
101.1
102.7
103.0
105.4
107.0
108.7
110.8
112.1
112.4
113.6
113.8
115.1
115.4
120.0

Porosity
28.8
27.9
27.0
25.2
22.8
21.5
20.9
19.6
17.1
18.9
16.0
16.7
13.0
13.6
10.8

Table 9.1: Weight and porosity for cement blocks

77

Figure 9.3: Scatterplot with regression of weight vs. porosity.


The scatterplot of weight vs. porosity shows that there is a strong negative relationship between the two variables. Below is the MINITAB output.
Regression Analysis: weight versus porosity
The regression equation is
weight = 131 - 1.08 porosity
Predictor
Constant
porosity

Coef
130.854
-1.07644

S = 1.02319

SE Coef
1.012
0.04889

R-Sq = 97.4%

T
129.28
-22.02

P
0.000
0.000

R-Sq(adj) = 97.2%

Analysis of Variance
Source
Regression
Residual Error
Total

DF
1
13
14

SS
507.59
13.61
521.20

MS
507.59
1.05

F
484.84

P
0.000

http://www.stat.ufl.edu/~ dathien/STA6166/reg.R
We note that the slope b1 = 1.08 implies that for each percentage point
increase in porosity, the weight decreases by 1.08 lbs (for values of porosity
observed, about 10-30%). The coefficient of determination is 0.974 implying
that 97.4% of the variability in weight (to its mean), conveyed by SS Total
is explained by the model, conveyed by the SS Regression value. The sample correlation coefficient of r = 0.987 also illustrates this strong negative
relationship.
78

If we wish to predict weight for a block of porosity 18% that would be


130.85 1.08(18) = 111.41 lbs. For the existing the data, we have the fitted
values which are used to calculate SSE and SSR.
Porosity
28.8
27.9
27.0
25.2
22.8
21.5
20.9
19.6
17.1
18.9
16.0
16.7
13.0
13.6
10.8

Weight (yi ) Fitted values (


yi )
99.0
99.853
101.1
100.822
102.7
101.791
103.0
103.728
105.4
106.312
107.0
107.711
108.7
108.357
110.8
109.756
112.1
112.447
112.4
110.510
113.6
113.631
113.8
112.878
115.1
116.861
115.4
116.215
120.0
119.229

Table 9.2: Porosity, weight and fitted weight values.

9.1.2

Distribution of response and coefficients


i.i.d.

For the regression model in equation (9.1), we assume that i N(0, 2 )


ind.
and hence Yi N(0 + 1 xi , 2 ). The variance term 2 , unlike the earlier
chapters, represents the variance of the response Y to its mean as indicated
by the model. Therefore,
Pn
(yi yi )2
SSE
2
2
=
.
s =
= i=1
n2
n2
The denominator is n 2 as we lose 2 degrees of freedom for estimating the
two parameters 0 and 1 . You will recall in earlier chapters the use of n 1
degrees of freedom which was due to losing 1 degree of freedom for estimating
the center Y by y.
The coefficients b0 and b1 of equation (9.3) are linear combinations of the
responses. Therefore, they have corresponding r.vs B0 and B1 and since the
Y s are independent normal r.vs, by Proposition 2.5 are themselves normal
r.vs. Re-expressing the r.v. B1 ,
Pn
n
(x x)(Yi Y ) X
(xi x)
Pn i
P
=
Y
B1 = i=1
n
2 i
)2
(x

)
j
i=1 (xi x
j=1
i=1
79

it clear that B1 is a linear combination of the responses with


n
X
1
(xi x) E(Yi ) = = 1
E(B1 ) = Pn
| {z }
)2 i=1
j=1 (xj x
0 +1 xi

and

V (B1 ) = hP
n

j=1 (xj

Thus,

x)

n
X
2
.
(xi x)2 V (Yi ) = = Pn
i2
| {z }
)2
2
j=1 (xj x
i=1
2

B1 N

2
1 , Pn
)2
i=1 (xi x

(9.5)

Remark 9.3. The intercept term is not of much practical importance as it


is the value of the response when the predictor value is 0 and inference is
omitted. Also, whether statistically significant or not, it is always kept in the
model to create a parsimonious and better fitting model. It can be shown,
in similar fashion to B1 that
 
 
x2
1
+ Pn
B0 N 0 ,
2 .
2
n
)
i=1 (xi x

Remark
Pn 9.4. The2larger the spread in the values of the predictor, the larger
the i=1 (xi x) value will be and hence the smaller the variances for B0
and B1 . Also, as (xi x)2 are nonnegative terms when we have more data
points,
Pni.e. larger2n, we are summing more non-negative terms and the larger
the i=1 (xi x) .

9.1.3

Inference on slope coefficient

The distribution of the r.v. corresponding to the slope coefficient, that is B1


is given in equation (9.5). This is a scenario that we are all too familiar with,
it is the same as in Section 3.1.2 and similar to equation (3.3) we have that
B1 1

s
x)2
i=1 (xi

P n

tn2 ,

where s stands for the conditional (upon the model) standard deviation of
the response. The true variance is never known as there are infinite model
variations and hence the Students-t distribution is used, instead of the standard normal, irrespective of the sample size. Important to note is the fact
that the degrees of freedom are n 2, as 2 were lost due to the estimation
of 0 and 1 .
Therefore, a 100(1 )% C.I. for 1 is
1 tn2,/2 s1
80

pPn
where s1 = s/
)2 . Similarly, for a null hypothesis value 10 ,
i=1 (xi x
the test statistic is
1 10
T.S. =
s 1
which under the null hypothesis has a corresponding tn2 r.v.
Example 9.2. A 95% C.I. for 1 is
1.07644 t13,0.025 (0.04889)

9.1.4

(1.1821, 0.9708)

C.I. on the mean response

For an observed value of the predictor, xobs , i.e. xobs = xk for some k, we also
have the observed value of the response that create the data point (xobs , yobs )
in the two dimensional space. However, we also have the fitted value of the
response (once a regression model is fitted), y = b0 + b1 xobs . We wish to
better understand the behavior of the fitted response, so let us look at the
r.v. Y = B0 + B1 xobs , that is before any data on the responses are obtained.
After substituting the formulas for b0 and b1 can be re-expressed as
"
#
n
X
1
x

i
Y = B0 + B1 xobs =
+ (xobs x) Pn
Yi .
(9.6)
2
n
(x

)
j
j=1
i=1

Hence, Y is a r.v. that can be expressed as a linear combination of the


independent normal r.vs Yi of whose distribution is known (equation (9.2)).
Therefore, Y is also a normal r.v. After some algebra, we have that
"
# !
2
(x

)
1
obs
Y N 0 + 1 xobs ,
+ Pn
2 .
2
n
(x

)
j
j=1
Thus, a 100(1 )% C.I. for the mean response, E(Y ) = 0 + 1 xobs , for a
value of the predictor that is observed, i.e. in the data, is
s
!
1
(xobs x)2
y t(n2,/2) s
.
+ Pn
n
)2
j=1 (xj x
{z
}
|
sY

Example 9.3. Refer back to Example 9.1, concerning cement blocks and
specifically Table 9.1. Assume we are interested in a C.I. for the mean value
of the weight (response) when porosity is 27%. Notice that the value of 27%
for porosity is observed. So with
X
x = 19.987,
(xi x)2 = 438.057, (xobs x)2 = 49.182, s = 1.023
i=1

a 95% C.I. around the mean response, of fitted value 130.85-1.08(27)=101.79,


is
r
1
49.182
101.791 t13,0.025 (1.023)
+
(100.8563, 102.7257)
| {z }
15 438.057
2.160

81

9.1.5

Prediction interval

Once a regression model is fitted, after obtaining data (x1 , y1 ), . . . , (xn , yn ),


it may be of interest to predict a value of the response using a value of the
predictor that is unobserved, i.e. not in the dataset. From equation (9.1), we
have some idea where this new prediction value will lie, somewhere around
the mean response E(Ynew ) = 0 + 1 xnew . However, according to the model,
equation (9.1), we do not expect new predictions to fall on the mean response
but close to them, hence the error term N(0, 2 ). Thus, when predicting
for a new value of a predictor we must add to equation (9.6). Therefore,
# !
"
2
(x

)
1
new
2 ,
Ypred N 0 + 1 xnew , 1 + + Pn
2
n
)
j=1 (xj x
and a 100(1 )% prediction interval (P.I.) for Ypred = 0 + 1 xnew + , for
a value of the predictor that is unobserved, i.e. not in the data, is
s
!
(xnew x)2
1
ypred t(n2,/2) s
.
1 + + Pn
n
)2
j=1 (xj x
{z
}
|
spred

Example 9.4. Refer back to Example 9.1. Let us estimate the value of the
cement block weight when porosity is 14% and create an interval around it.
First we check that a value of 14% for porosity is within the range of the
observed data but does not belong to one of the data points so we need to predict the value and create a P.I. The predicted value is 130.85-1.08(14)=115.73
and with the following values
X
x = 19.987,
(xi x)2 = 438.057, (xnew x)2 = 34.422, s = 1.023
i=1

a 95% P.I. around the predicted value is


r
34.422
1
115.73 t(13,0.025) (1.023) 1 +
+
| {z }
15 438.057

(113.3653, 118.0947)

2.160

9.1.6

Checking assumptions

Recall that for the simple linear regression model


Yi = 0 + 1 xi + i
i.i.d.

i = 1, . . . , n

we assume that i N(0, 2 ) for i = 1, . . . , n. However, once a model is


fit, before any inference or conclusions are made based upon a fitted model,
the assumptions of the model need to be checked. These are:
82

1. Normality
2. Independence
3. Homogeneity of variance
4. Model fit
with components of model fit being checked simultaneously within the first
three. The assumptions are checked using the residuals ei := yi yi for
i = 1 . . . , n, or the standardized residuals, which are the residuals divided
by their standard deviation. Standardized residuals are usually the default
residuals being used as their standard deviation should be around 1.
Although, exact statistical tests exist to test the assumptions, linear regression is robust to slight deviations so only graphical procedures will be
introduced here.
Normality
The simplest way to check for normality is with two graphical procedures:
Histogram
P-P or Q-Q plot
A histogram of the residuals is plotted and we try to determine if the
histogram is symmetric and bell shaped like a normal distribution is. In
addition, to check the model fit, we assume the observed response values
yi are centered around the regression line y. Hence, the histogram of the
residuals should be centered at 0. Referring to Example 9.1, we obtain the
following histogram.

Figure 9.4: Histogram of standardized residuals.


We have referenced P-P and Q-Q plots in Section 5.3. Referring to Example 9.1, we obtain the following P-P plot of the residuals.
83

Figure 9.5: P-P Plot of standardized residuals.


Independence
To check for independence a time series plot of the residuals/standardized
residuals is used, i.e. a plot of the value of the residual versus the value of
its position in the data set. For example, the first data point (x1 , y1 ) will
yield the residual e1 = y1 y1 . Hence, the order of e1 is 1, and so forth.
Independence is graphically checked if there is no discernible pattern in the
plot. That ism one cannot predict the next ordered residual by knowing the
a few previous ordered residuals. Referring to Example 9.1, we obtain the
following plot where there does not appear to be any discernible pattern.

Figure 9.6: Time series plot of residuals.


Homogeneity of variance/Fit of model
Recall that the regression model assumes that the errors i have constant
variance 2 . In order to check this assumption a plot of the residuals (ei )
84

versus the fitted values (


yi ) is used. If the variance is constant, one expects
to see a constant spread/distance of the residuals to the 0 line across all the
yi values of the horizontal axis. Referring to Example 9.1, we see that this
assumption does not appear to be violated.

Figure 9.7: Residual versus fitted values plot.


In addition, the same plot can be used to check the fit of the model.
If the model is a good fit, once expects to see the residuals evenly spread
on either side of the 0 line. For example, if we observe residuals that are
more heavily sided above the 0 line for some interval of yi , then this is an
indication that the regression line is not moving through the center of the
data points for that section. By construct, the regression line does move
through the center of the data overall, i.e. for the whole big picture. So if it
is underestimating (or overestimating) for some portion then it will overestimate (or underestimate) for some other. This is an indication that there is
some curvature and that perhaps some polynomial terms should be added.
(To be discussed in the next chapter).
http://www.stat.ufl.edu/~ athienit/STA6166/reg_ex.R

9.1.7

Box-Cox (Power) transformation

In the event that the model assumptions appear to be violated to a significant degree, then a linear regression model on the available data is not valid.
However, have no fear, your friendly statistician is here. The data can be
transformed, in an attempt to fit a valid regression model to the new transformed data set. Both the response and the predictor can be transformed
but there is usually more emphasis on the response.
A common transformation mechanism is the Box-Cox transformation
(also known as Power transformation). This transformation mechanism when
85

applied to the response variable will attempt to remedy the worst of the
assumptions violated, i.e. to reach a compromise. A word of caution, is
that in an attempt to remedy the worst it may worsen the validity of one
of the other assumptions. The mechanism works by trying to identify the
(minimum or maximum depending on software) value of a parameter that
will be used as the power to which the responses will be transformed. The
transformation is

yi 1
if 6= 0
()
yi = G1
y

G log(y ) if = 0
y
i
Qn
where Gy = ( i=1 yi )1/n denotes the geometric mean of the responses. Note
that a value of = 1 effectively implies no transformation is necessary. There
are many software packages that can calculate an estimate for , and if the
sample size is large enough even create a C.I. around the value. Referring to
= 0.55.
Example 9.1, we see that

Figure 9.8: Box-Cox plot.


A C.I. could not be created die to the relatively small sample size of 15
observations. However, one could argue that the value is close to 1 and that
a transformation may not necessarily improve the overall conception of the
validity of the assumptions, so no transformation is necessary. In addition,
we know that linear regression is somewhat robust to deviations from the
assumptions, and it is more practical to work with the untransformed data
that are in the original units of measurements. For example, if the data is in
miles and a transformation is used, inference will be on log(miles).
Below is another set of data that yielded the following Box-Cox plot. In
this example a log transformation appears to be appropriate.

86

Figure 9.9: Box-Cox plot.


Example 9.5. Shown by running R code:
http://www.stat.ufl.edu/~ athienit/STA6166/boxcox.R
If the model fit assumption is the major culprit violated, a transformation of the predictor(s) will often resolve the issue without
having to transform the response and consequently changing its scale.
Example 9.6. In an experiment 13 subjects asked to memorize a list of
disconnected items. Asked to recall them at various times up to a week
later.
Response = proportion of items recalled correctly.
Predictor = time, in minutes, since initially memorized the list.

2000

4000

6000

8000

10000

0.5

1.0

2.0

Homogeneity / Fit

re

1.0
4

Order

87

0.0

Independence
1.0
2

time

1.0

Sample Quantiles

1.0

re

re

0.6
prop

0.4
0.2

Normal QQ Plot

1.5

3.0
1.5
0.0

0.8

Frequency

Histogram of re

Theoretical Quantiles

1
5
15
30
60
120
240
0.84 0.71 0.61 0.56 0.54 0.47 0.45
480 720 1440 2880 5760 10080
0.38 0.36 0.26 0.20 0.16 0.08

1.0

Time
Prop
Time
Prop

10 12

0.0

0.2

0.4

fitted.values(reg)

bcPower Transformation to Normality


Est.Power Std.Err. Wald Lower Bound Wald Upper Bound
0.0617
0.1087
-0.1514
0.2748

dat$time

Likelihood ratio tests about transformation parameters


LRT df
pval
LR test, lambda = (0) 0.327992 1 5.668439e-01
LR test, lambda = (1) 46.029370 1 1.164935e-11

0.5

0.5

0.5

1.5

Homogeneity / Fit

re2

0.5

Independence

1.5
2

l.time

1.5

Sample Quantiles

1.5

re2

0.5

prop

0.4
0.2

0
re2

0.6

Normal QQ Plot

1.5

4
2
0

0.8

Frequency

Histogram of re2

Theoretical Quantiles

It seems that a decent choice for is 0, i.e. a log transformation for time.

10 12

Order

0.2

0.4

0.6

0.8

fitted.values(reg2)

http://www.stat.ufl.edu/~ athienit/STA6166/reg_transpred.R
Remark 9.5. When creating graphs and checking if there are pattern try
and keep the axis for the standardized residuals range from -3 to 3. That
is 3 standard deviation below 0 to 3 standard deviations above 0. Software
have a tendency to zoom in, as done in the notes where some axes for
standardized residuals are from -1.5 to 0.5. Obviously if you zoom in enough
you will see a pattern. For example, is glass smooth? If you are viewing by
eye then yes. If you are viewing via an electron microscope then no.
In R just add plot(....., ylim=c(-3,3))

9.2

Multiple Regression

9.2.1

Model

The multiple regression model is an extension of the simple regression model


whereby instead of only one predictor, there are multiple predictors to better
aid in the estimation and prediction of the response. Let p denote the number
of predictors and (yi , x1i , x2i , . . . xpi ) denote the p + 1 dimensional data points
for i = 1, . . . , n. The statistical model is
Yi = 0 + 1 x1i + + p xpi + i
for

i.i.d.

i = 1, . . . , n with i N(0, 2 ).
88

Multiple regression model can also include polynomial terms (powers of


predictors). For example one can define x2i := x31i . The model is still linear as
it is linear in the coefficients (s). Polynomial terms are useful for accounting for potential curvature in the relationship between predictors and the
response. Also, a polynomial term such as x4i = x1i x3i , is also coined as the
interaction term of x1 with x3 . Such terms are of particular usefulness when
an interaction exists between two predictors, i.e. when the level/magnitude
of one predictor has a relationship to the level/magnitude of the other. For
example, one may wish to fit the following model.
Yi = 0 + 1 x1i + 2 x21i + 3 x2i + 4 x1i x2i + 5 x21i x2i + i

9.2.2

Goodness of fit

Goodness of fit is still measured by the coefficient of determination of R2 =


SSR/SST, where the sum of squares are calculated under the multiple regression model. Intuitively, we note that SSR will always increase or that
equivalently SSE decreases, as the we include more and more predictors in
the model. This is because the fitted values yi better fit the observed values
of the response (yi ). However, any increase in SSR, no matter how minuscule,
will cause R2 to increase.
This implies that the addition of seemingly unimportant predictors to
the model will lead to a tiny increase in R2 . This small gain in goodness
of fit is usually not of any practical value and in effect the model may be
overcomplicated with too many unimportant predictors. This has lead to the
introduction of the adjusted R2 , defined as


SSE/dfE
p
2
2
2
=1
.
Radj := R (1 R )
np1
SST/dfT
As p increases, R2 increases, but the second term which is subtracted from
R2 also increases. Hence, the second term can be thought of as a penalizing
factor. For example, a linear regression model of 50 observation with 3
predictors may yield an R2 = 0.677, and an addition of 2 unimportant
predictors yields a slight increase to R2 = 0.679. Then,
2
Radj
= 0.679 (1 0.679)

9.2.3

5
= 0.6425
44

Inference

The sum of squares calculation remains as in equation (9.4). However, the


degrees of freedom associated with SSE is now n (p + 1). Therefore,
SS : SST = SSR +
SSE
df : (n 1) = p + (n p 1)
89

and conditional variance


2

s =

Pn

yi )2
SSE
=
.
np1
np1

i=1 (yi

Individual tests
Estimating the vector of coefficients = (0 , 1 , . . . , p ) now falls in the field
of matrix algebra and will not be covered in this class. We will simply use
the estimates provided by computer output. The interpretation of the slope
coefficients requires an additional statement. For example, a 1-unit increase
in predictor k will cause the response to change by amount k , assuming all
other predictors are held constant.
Inference on the slope parameters j for j = 1, . . . , p is done as in Section
9.1.3 but under the assumption that
Bj j
tnp1 .
s j
An individual test on k , tests the significance of predictor k, assuming
all other predictors j for j 6= k are included in the model. This can
lead to different conclusions depending on what other predictors are included
in the model.
Consider the following theoretical toy example. Someone wishes to measure the area of a square (the response) using as predictors two potential
variables, the length and the height of the square. Due to measurement
error, replicate measurements are taken.
A simple linear regression is fitted with length as the only predictor,
x = length. For the test H0 : 1 = 0, do you think that we would reject
H0 , i.e. is length a significant predictor of area?
Now assume that a multiple regression model is fitted with both predictors, x1 = length and x2 = height. Now, for the test H0 : 1 = 0, do
you think that we would reject H0 , i.e. is length a significant predictor
of area given that height is already included in the model?
This scenario is defined as confounding. In the toy example, height is a
confounding variable, i.e. an extraneous variable in a statistical model that
correlates with both the response variable and another predictor variable.
Example 9.7. In an experiment of 22 observations, a response y and two
predictors x1 and x2 were observed. Two simple linear regression models
were fitted:
(1)
y = 6.33 + 1.29 x1

90

Predictor
Constant
x1
S = 2.95954

Coef
6.335
1.2915

SE Coef
2.174
0.1392

T
2.91
9.28

R-Sq = 81.1%

P
0.009
0.000

R-Sq(adj) = 80.2%

(2)
y = 54.0 - 0.919 x2
Predictor
Constant
x2
S = 5.50892

Coef
53.964
-0.9192

SE Coef
8.774
0.2821

R-Sq = 34.7%

T
6.15
-3.26

P
0.000
0.004

R-Sq(adj) = 31.4%

Each predictor in their respective model is significant due to the small pvalues for their corresponding coefficients. The simple linear regression model
(1) is able to explain more of the variability in the response than model (2)
with R2 = 81.1%. Logically one would then assume that a multiple regression
model with both predictors would be the best model. The output of this
model is given below:
(3)
y = 12.8 + 1.20 x1 - 0.168 x2
Predictor
Constant
x1
x2
S = 2.97297

Coef
12.844
1.2029
-0.1682

SE Coef
7.514
0.1707
0.1858

R-Sq = 81.9%

T
1.71
7.05
-0.91

P
0.104
0.000
0.377

R-Sq(adj) = 80.0%

We notice that the individual test for 1 stills classifies x1 as significant


given x2 , but x2 is no longer significant given x1 . Also, we notice that the
2
coefficient of determination, R2 , has increased only by 0.8%, and in fact Radj
has decreased from 80.2% in (1) to 80.0% in (3). This is because x1 is acting
as a confounding variable on x2 . The relationship of x2 with the response
y is mainly accounted for by the relationship of x1 on y. The correlation
coefficient of rx1 ,x2 = 0.573 indicates a moderate negative relationship, as
shown in the scatterplot.

91

Figure 9.10: Scatterplot of x1 vs x2 .


However, since x1 is a better predictor, the multiple regression model is
still able to determine that x1 is significant given x2 , but not vice versa.
Remark 9.6. In the event that the correlation between x1 and x2 is strong, e.g.
|rx1 ,x2 | > 0.7, both p-values for the individual tests in the multiple regression
model would be large. The model would not be able to distinguish a better
predictor from the two since they are nearly identical. Hence, x1 given x2 ,
and x2 given x1 would not be significant.
Simultaneous tests
This far we have only seen hypotheses test about individual s. In an experiment with multiple predictors, using only individual tests, the researcher can
only test and potentially drop one predictor at a time and refitting the model
at each step. However, a method exists for testing the statistical significance
of multiple predictors simultaneously.
Let p denote the total number of predictors. Then, we can simultaneously
test for the significance of k( p) predictors. For example, let p = 5 and the
full model is
Yi = 0 + 1 x1i + 2 x2i + 3 x3i + 4 x4i + 5 x5i + i

(9.7)

Now, assume that after fitting this model and looking at some preliminary
results, including the individual tests, we wish to test whether we can remove simultaneously the first, third and fourth predictor, i.e x1 , x3 and x4 .
Consequently, we wish to test the hypotheses
H0 : 1 = 3 = 4 = 0 vs Ha : at least one of them 6= 0
In effect we wish to test the full model in equation (9.7) to the reduced model
Yi = 0 + 1 x2i + 2 x5i + i
92

(9.8)

The SSE of the reduced model will be larger than the SSE of the full
model, as it only has two of the predictors of the full model. The test
statistic is based on comparing the difference in SSE of the reduced model
to the full model.
SSEred SSEf ull
dfEred dfEf ull
T.S. =
SSEf ull
dfEf ull
Under the null hypothesis, H0 , the r.v. corresponding to the test statistic follows an F-distribution with two degrees of freedom, F1 ,2 with 1 =
dfEred dfEf ull and 2 = dfEf ull . The p-value for this test is always the area
to the right of the F-distribution, i.e. P (F1 ,2 > T.S.).
Remark 9.7. Note that 1 = dfEred dfEf ull always equals the number of
predictors being tested in a simultaneous test. If n denotes the sample size
then for our example with p = 5 and testing 3 predictors,
1 = (n 2 1) (n 5 1) = 3
Remark 9.8. Simultaneously testing has to be done by fitting both the full
model and the reduced model in order to obtain the two sets of SSE. Computer output will however perform a simultaneous test for testing the significance of all the predictors. This is called the overall test of the model. In
this case, the reduced model has no predictors, hence
Yi = 0 + i

Yi = + i ,

and thus SSEred =SST and dfEred = n 1. Therefore,


SST SSE
SSR
MSR
(n 1) (n p 1)
p
=
=
T.S. =
SSE
SSE
MSE
np1
np1
Example 9.8. In a biological experiment, researchers wanted to model the
biomass of an organism with respect to a salinity (SAL), acidity (pH), potassium (K), sodium (Na) and zinc (Zn) with a sample size of 45. The full
model yielded the following results:
Coefficients:
(Intercept)
salinity
pH
K
Na
Zn

Estimate Std. Error t value Pr(>|t|)


171.06949 1481.15956
0.115 0.90864
-9.11037
28.82709 -0.316 0.75366
311.58775 105.41592
2.956 0.00527
-0.08950
0.41797 -0.214 0.83155
-0.01336
0.01911 -0.699 0.48877
-4.47097
18.05892 -0.248 0.80576
93

--Residual standard error: 477.8 on 39 degrees of freedom


Multiple R-squared: 0.4867,Adjusted R-squared: 0.4209
F-statistic: 7.395 on 5 and 39 DF, p-value: 5.866e-05
Analysis of Variance
Source
Regression
Residual Error
Total

DF
5
39
44

SS
8439559
8901715
17341274

MS
1687912
228249.1

F
7.395

P
0.000

Assuming all the model assumptions are met, we first take a look at the
overall fit of the model.
H0 : 1 = = 5 = 0 vs Ha : at least one of them 6= 0
The test statistic value is T.S. = 7.395 with an associated p-value of approximately 0 (found using an F5,39 distribution). Hence, at least one predictor
appears to be significant. In addition, the coefficient of determination, R2 , is
48.67%, indicating that a large proportion of the variability in the response
can be accounted for by the regression model.
Looking at the individual tests, acidity (pH) is significant given all the
other predictors with a p-value of 0.001, but salinity, potassium (K), sodium
(Na) and zinc (Zn) have large p-values for the individual tests. Since the
p-values are close to 0.5 it is acceptable to consider them for a simultaneous
test. However, this may just be a case of confounding as the certain variables
are highly correlated. Table 9.8 provides the pairwise correlations of the
continuous variables.
biomass
salinity
pH
K
Na
Zn

biomass salinity
pH
K
Na
Zn
1.000
-0.084 0.669 -0.150 -0.219 -0.503
-0.084
1.000 -0.051 -0.021 0.162 -0.421
0.669
-0.051 1.000 0.019 -0.038 -0.722
-0.150
-0.021 0.019 1.000 0.792 0.074
-0.219
0.162 -0.038 0.792 1.000 0.117
-0.503
-0.421 -0.722 0.074 0.117 1.000

Table 9.3: Pearson correlation and associated p-value

Notice that pH and Zn are highly negatively correlated, so we should


attempt to remove Zn as its p-value is 0.80576 (and pHs p-value is small).
Also, there is a strong positive correlation between K and Na. Since both
their p-values are large at 0.83155 and 0.48877 respectively we should attempt
to remove at least one and see what happens, or be a bit greedy and see if we
94

can remove both. We will be a bit conservative and try an remove salinity,
K and Zn simultaneously. To test
H0 : salinity = K = Zn = 0 vs Ha : at least one of them 6= 0,
the reduced model needs to be fitted.
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -282.86356 319.38767 -0.886
0.3809
pH
333.10556
55.78001
5.972 4.36e-07
Na
-0.01770
0.01011 -1.752
0.0871
--Residual standard error: 461.1 on 42 degrees of freedom
Multiple R-squared: 0.4851,Adjusted R-squared: 0.4606
F-statistic: 19.79 on 2 and 42 DF, p-value: 8.82e-07
Analysis of Variance
Source
Regression
Residual Error
Total

DF
2
42
44

SS
8412953
8928321
17341274

MS
4206477
212579

F
19.72

P
0.000

The test statistic is


T.S. =

(8928321 8901715)/3
= 0.0389.
8901715/39

From the F-table with 2 and 39 degrees of freedom we can determine that
the p-value is greater than 0.05, actually it is 0.9896, so we fail to reject the
null which implies that those predictor are not statistically significant and
we now use the reduced model as our current model.
At this point we see that Na is marginally significant with a p-value of
0.0871. Some may argue to remove it and some may not (due to its p-value
being on the cusp). I would suggest keeping it since our model is simple
enough at this point. Fitting a model without Na (only pH) gives
a model with a higher conditional standard deviation,
s

SSE
s = MSE =
= 472 (compared to 461.1)
np1
2
smaller Radj
= 0.4347 (compared to 0.4606)

So we can argue to keep Na in the model.


95

Remark 9.9. Similar to simple linear regression


one can still create C.I. or P.I. for multiple regression but it must be
done via statistical software as the calculations are more complex,
checking assumptions is done in the exact same way as in Subsection
9.1.6.
http://www.stat.ufl.edu/~ athienit/STA6166/linthurst.R
Remark 9.10.
Another possibility in trying to determine a full versus
reduced model is to go through a large number of possible models with
2
different combination of predictors and taking a look at the Radj
values.
We could have reached the same final model choice by simply performing individual t-tests on the coefficients and refitting the model each
time, i.e. not performing simultaneous tests. This is the basis behind
stepwise regression that software can do. Simultaneous tests are simply
used for convenience to save time.
There is no one best/correct model, certain models meet certain
criteria better than others.
Remark 9.11. A full model does not necessarily imply a model with all the
predictors. It simply means a model that has more predictors than the
reduced model, i.e. a fuller model. For example, one may do a simultaneous
test to determine if they can drop 2 predictors and hence compare a full versus
reduced model. Assume that they do decide to go with the reduced model
but then wish to perform an additional simultaneous test on the reduced
model. In this second step, the reduced model becomes the new full model
that will be compared to a further reduced model.
Remark 9.12. In a regression model with many predictors it is not always
easy to decide which predictors should be tested simultaneously. As a rule of
thumb, predictors with individual test p-values of 0.4 or above should only
be considered. The reasoning is that once a few predictors are dropped the
significance of other predictors may increase.
Example 9.9. Automated model selection:
http://www.stat.ufl.edu/~ athienit/STA6166/cruise_model_selection.R
Qualitative predictors
Interpreting a regression model with qualitative predictors is slightly different. A qualitative predictor is a variable with groups or classification. The
basic case with only two groups will be illustrated by the following example.
Example 9.10. A study is conducted to determine the effects of company
size and the presence or absence of a safety program on the number of hours
96

lost due to work-related accidents. A total of 40 companies are selected for


the study. The variables are as follows:
y = lost work hours
x1 = number of employees
(
1 safety program used
x2 =
0 no safety program used
The proposed model,
Yi = 0 + 1 x1i + 2 x2i + i
implies that
Yi =

(0 + 2 ) + 1 x1i + i
0 + 1 x1i + i

if x2 = 1
if x2 = 0

When a safety program is used, i.e. x2 = 1, that the intercept is 0 + 2 , but


the slope (for x1 ) remains the same in both cases. Fitting this model yields
y = 31.4 + 0.0142 x1 - 54.2 x2
Predictor
Constant
x1
x2

Coef
31.399
0.014208
-54.210

S = 22.8665

SE Coef
9.902
0.001400
7.243

R-Sq = 81.5%

T
3.17
10.15
-7.48

P
0.003
0.000
0.000

R-Sq(adj) = 80.5%

Analysis of Variance
Source
Regression
Residual Error
Total

DF
2
37
39

SS
85464
19346
104811

MS
42732
523

F
81.73

P
0.000

Therefore when x2 = 1 the intercept is 31.39954.210 = 22.811. regression


line for when a safety program is used is parallel to the line when no safety
program is used and it lies below it. A scatterplot of the data and the
associated regression line, differentiated by whether x2 = 1 or 0, is presented.

97

x2

100
0

50

150

0
1

2000

4000

6000

8000

x1

Figure 9.11: Scatterplot and fitted (parallel) regression lines.


Although the overall fit of the model seems adequate, from Figure 9.10
we see that the regression line for x2 = 1 (red), does fit the data well - a fact
that can also be seen by plotting the residuals in the assumption checking
procedure. The model is too restrictive by forcing parallel lines. Adding an
interaction term makes the model less restrictive.
Yi = 0 + 1 x1i + 2 x2i + 3 (x1 x2 )i + i
which implies
(
(0 + 2 ) + (1 + 3 )x1i + i
Yi =
0 + 1 x1i + i

if x2 = 1
if x2 = 0

Now, the slope for x1 is allowed to differ for x2 = 1 and x2 = 0.


y = - 1.8 + 0.0197 x1 + 10.7 x2 - 0.0110 x1x2
Predictor
Constant
x1
x2
x1x2
S = 17.7488

Coef
-1.84
0.019749
10.73
-0.010957

SE Coef
10.13
0.001546
14.05
0.002174

R-Sq = 89.2%

T
-0.18
12.78
0.76
-5.04

P
0.857
0.000
0.450
0.000

R-Sq(adj) = 88.3%

Analysis of Variance
98

Source
Regression
Residual Error
Total

DF
3
36
39

SS
93470
11341
104811

MS
31157
315

F
98.90

P
0.000

The overall fit of the new model is adequate with a T.S. = 98.90 but more
2
importantly Radj
has increased and s has decreased. Figure 9.10 also shows
the better fit.
x2

100
0

50

150

0
1

2000

4000

6000

8000

x1

Figure 9.12: Scatterplot and fitted regression lines.


Remark 9.13. Since the interaction term x1 x2 is deemed significant, then for
model parsimony, all lower order terms of the interaction, i.e. x1 and x2
should be kept in the model, irrespective of their statistical significance. If
x1 x2 is significant then intuitively x1 and x2 are of importance (maybe not
in the statistical sense).
Now lets try and to inference on the slope coefficient for x1 . From the
previous equation we saw that the slope takes on two values depending on
the value of x2 .
For x2 = 0, it is just 1 and inference in straightforward...right?
For x2 = 1, it is 1 + 3 . We can estimate this with b1 + b3 but the
variance is not known to us. From (2.2) we have that
V (B1 + B3 ) = V (B1 ) + V (B3 ) + 2Cov(B1 , B3 )

99

The sample statistics for all the covariances among all the coefficients
can easily be obtained in R using the vcov function (although we readily
have available the variances, i.e. squared standard errors, for 1 and
3 . Then create a 100(1 )% CI for 1 + 3
q
b1 + b3 t(np1,/2) s21 + s23 + 2s1 3
Remark 9.14. This concept can easily be extended to linear combinations of
more that two coefficients.
http://www.stat.ufl.edu/~ athienit/STA6166/safe_reg.R
In the previous example the qualitative predictor only had two levels,
the use or the the lack of use of a safety program. To fully state all levels
only one binary/dummy predictor was necessary. In general, if a qualitative
predictor has k levels, then k 1 binary predictor variables are necessary.
For example, a qualitative predictor for a traffic light has three levels: red,
yellow and green. Therefore, only two binary predictors are necessary.
(
1 if red
xred =
0 otherwise
(
1 if yellow
xyellow =
0 otherwise
The case when xred = xyellow = 0 means that the light is green.
However, we can potentially treat this variable as a quantitative variable
and not a qualitative one. Although, the colour variable has three categories,
one may argue that colour (in some context) is an ordinal qualitative predictor. For example, you can order a drink in 3 sizes: small, medium and
large. There is an inherent order of 1, 2 and 3. In terms of frequency (or
wavelength) there is also an order of
red (400-484 THz),
yellow (508-526 THz),
green (526-606 THz).
Instead of creating 2 dummy variables we can create one (continuous) variable
for frequency of which we happen to observe only 3 frequencies:
442 THz for red, 517 THz for yellow and 566 THz for green
(simply taking midpoints of frequency intervals).

100

Example 9.11. IN CLASS


http://www.stat.ufl.edu/~ athienit/STA6166/hmwk12_13.R

101

Chapter 10
Analysis Of Variance
Chapter 8 in the textbook.
For t = 2 samples/populations we have already seen in Chapters 5 and 6
various inference methods for comparing the central location of two populations. Next we introduce a statistical models and procedures that allow us
to compare more than two ( 2) populations.
Design / Data
Independent Samples (CRD)
Paired Data (RBD)

10.1

Parametric (Normal) Nonparametric


1-Way ANOVA
Kruskal-Wallis Test
2-Way ANOVA
Friedmans Test

Completely Randomized Design

The Completely Randomized Design (CRD) is a linear model (as is the regression model) where for controlled experiments, subjects are assigned at
random to one of t treatments and, and for observational studies, subjects are
sampled from t existing groups with the purpose of comparing the different
groups.
The CRD statistical model is
Yij = + i +ij
| {z }

j = 1, . . . , ni , i = 1, . . . , t

(10.1)

i.i.d.

2
where ij N(0,
Pt ) and the restriction (to make model identifiable) that
some i = 0 or i=1 i = 0. The goal is test the statistical significance of
the treatment effects s. If all s are 0 then it implies that the response
can be modeled by a single mean rather than individual i s for each
treatment/sample. To better illustrate this concept consider the following
example (Example 6.1 in textbook).

Example 10.1. Company officials were concerned about the length of time
a particular drug retained its potency. A random sample of n1 = 10 fresh
102

bottles was retained and a second sample of n2 = 10 samples were stored for
a period of 1 year and the following potency readings were obtained.
Fresh
10.2
10.5
10.3
10.8
9.8
10.6
10.7
10.2
10.0
10.6

Stored
9.8
9.6
10.1
10.2
10.1
9.7
9.5
9.6
9.8
9.9

10.6

10.8

Under the assumption of no treatment effects H0 : 1 = 2 = 0 it implies


that there is no treatment effect but simply one grand mean (labeled as the
naive model for regression).

10.2
10.0
9.6

9.8

potency

10.4

Observations
Grand Mean

Under the assumption of significant treatment effects, it implies that at


least one treatment has a mean that is different from the grand mean (and
the others).

103

10.8
10.6
10.2
10.0
9.6

9.8

potency

10.4

Observations
Trt Mean
Grand Mean

Fresh

Stored
Method

The model in equation (10.1) is a linear model so we have the same


identity for the sum of squares
ni
t X
X
i=1 j=1

(yij y++ )2 =
{z
SST

where we can simplify

ni
t X
X
i=1 j=1

(yij yi+ )2 +

{z
SSE

ni
t X
X
i=1 j=1

(
yi+ y++ )2

{z
SSTrt

SST= (N 1)s2y
P
SSTrt= ti=1 ni (
yi+ y++ )2
P
SSE= ti=1 (ni 1)s2i

In addition, we have a similar identity for the degrees of freedom associated


with each SS.
}t + t|
N
1} = N
{z 1}
| {z
| {z
dfTotal

dfError

dfTrt

It can be shown (in more advanced courses) that the SS have 2 distribution and from that
E(MSE) = 2
2

E(MSTrt) = +

Pt

ni i2
t1

i=1

where MSE is SSE/dfError and MSTrt is SSTrt/dfTrt . As a consequence, under


H0 : 1 = = t = 0, E(MSTrt)/E(MSE) = 1. The test statistic for this
hypothesis, with sampling distribution of Ft1,N t
T.S. =

MST rt
.
MSE

104

Reject H0 if T.S. F(t1,N t), or equivalently if the p-value - the area to


the right - is < .
Remark 10.1. Checking the assumptions for the CRD model are exactly the
same as for regression since both models belong to the same family of linear
models. In addition, the Box-Cox transformation can also be used just as
previously done for regression.
Example 10.2. A metal alloy that undergoes one of four possible strengthening procedures is tested for strength.
Factor
A
B
C
D

250
263
257
253

Alloy Strength
264 256 260
254 267 265
279 269 273
258 262 264

239
267
277
273

Mean
253.8
263.2
271.0
262.0

St. Dev.
9.757
5.4037
8.7178
7.4498

http://www.stat.ufl.edu/~ athienit/STA6166/anova1.R
IN CLASS

105

10.1.1

Post-hoc comparisons

If differences in group means are determined from the F-test, researchers


want to compare pairs of groups. Three popular methods include (from
most conservative to least):
1. Bonferronis Method: Adjusts individual comparison error rates so that
all conclusions will be correct at desired confidence/significance level.
Any number of comparisons can be made. Very general approach can
be applied to any inferential problem.
2. Tukeys Method: Specifically compares all t(t 1)/2 pairs of groups.
Utilizes a special table (Table 11, p. 701).
3. Fishers LSD: Upon rejecting the null hypothesis of no differences in
group means, LSD method is equivalent to doing pairwise comparisons
among all pairs of groups as in Chapter 6.
Bonferroni procedure
This is the most general procedure when we wish to test a priori C pairwise
comparisons. When all pairs of treatments are to be compared C = t(t1)/2.
However, we shall see that the larger C is the wider the intervals will be.
The steps are:
1. Choose an overall so that the overall confidence level is 100(1 )%.
2. Decide how many and which pairwise comparisons are to be made, i.e.
C.
3. Construct each pairwise C.I. (or test) with /C, i.e. confidence level
100(1 /C)%. The margin of error for comparing treatment i to j
will be
s


1
1
+
tN t,/2C MSE
ni nj
Example 10.3. In our example we do not know before hand which comparisons we wish to make so let us perform all 4(3)/2 = 6 comparisons with
an overall 95% confidence level. This implies that each pairwise comparison
must be made at the level of


0.05
100 1
= 99.1667%
6
so the critical value used should be
t16,(0.05/6)/2 = t16,0.004166667 = 3.008334

106

and the standard error (which same as n1 = = n4 = 5)


s


1 1
+
63.98
5 5
which we multiply together to create the margin of error of 15.21813.
Difference
A-B
-9.4
A-C
-17.2
A-D
-8.2
B-C
-7.8
B-D
1.2
C-D
9.0

Lower
Upper
-24.6181 5.8181
-32.4181 -1.9819
-23.4181 7.0181
-23.0181 7.4181
-14.0181 16.4181
-6.2181 24.2181

Tukeys procedure
This procedure is derived so that the probability that at least one false difference is detected is (experimentwise error rate). The margin of error
is
r
MSE
q(t,N t),
n
where n is the common sample size for each treatment (which was 5 in the
example). If the sample sizes are unequal use
n=

1
n1

t
++

1
nt

Fishers LSD
Only apply this method after significance is confirmed with through the Ftest. For each pairwise comparison the margin of error is
s


1
1
+
tN t,/2 MSE
ni nj
http://www.stat.ufl.edu/~ athienit/STA6166/anova1.R

107

10.1.2

Nonparametric procedure

The Kruskal-Wallis test is an extension of the Wilcoxon rank-sum test to 2


groups. To test
H0 : The t distributions (corresponding to the t treatments) are identical
the steps are:
1. Rank the observations across groups from smallest to largest (N), adjusting for ties.
2. Compute the sums of ranks for each group: T1 , . . . , Tt
and then compute
T.S. =

12
N(N + 1)

t
X
T2
i

i=1

ni

3(N + 1)

whose sampling distribution is 2t1 . So reject H0 if T.S. 2t1, or if the


p-value - area to the right - is less < .
Remark 10.2. The textbook suggest an alteration to the test statistic when
ties are present, but it unnecessarily complicated. Hence, adjust for ties as
we have always done.
Rank equivalent to Tukeys HSD
To compare groups i and j, the rank equivalent to Tukeys HSD is
s


N(N
+
1)
1
1
Ti Tj q(t,),
+
.
24
ni nj
where Ti is average of the ranks corresponding to group i.
Example 10.4. In an experiment, patients were administered three levels
of glucose and insulin release levels were then measured.
Glucose conc.
Low
Medium
High

Insulin release
1.59 1.73 3.64 1.97
3.36 4.01 3.49 2.89
3.92 4.82 3.87 5.39

The corresponding ranks are


Glucose conc.
Low
Medium
High

Insulin
1 2
5 10
9 11
108

release
7 3
6 4
8 12

T
T
13 3.25
25 6.25
40 10

providing
T.S. =


12
132 /4 + 252 /4 + 402 /4 3(13) = 7.0385
12(13)

with a p-value of 0.02962 (using a 22 distribution).

http://www.stat.ufl.edu/~ athienit/STA6166/kruskal_wallis.R
IN CLASS

10.2

Randomized Block Design

Blocking is used to reduce variability so that treatment differences can be


identified. Usually, experimental units constitute the blocks. In effect this is
a 2-way ANOVA with treatment being a fixed factor, and the experimental
unit being the random factor.
Fixed factor being a predictor with a fixed number of levels.
Random factor being a predictor with potentially infinite levels but
only a certain number observed in the experiment.
For example, consider a temperature predictor with levels: 20 F, 30 F, 40 F.
Is this fixed or random? It depends!
If these are the only 3 settings on a machine, then fixed.
If these were the 3 settings in a greenhouse experiment, where any
temperature could be used, then random.
The model is
Yij = + i + j + ij

i = 1, . . . , t, j = 1, . . . , b

i.i.d.

with ij N(0, 2 ) and independent j N(0, 2 ). Notice that the random


factor has similar notation as the error. If we were performing a 1-way
ANOVA it would have hidden inside it but now we try to account it and
109

remove some noise from the model. We


P still have the same restrictions on
the s that for some i, i = 0 (or that
i = 0).
Block

Factor

y11

y12

y1b

2
..
.

y21
..
.

y22
..
.

..
.

y2b
..
.

yt1

yt2

ytb

This is an extension of the paired test to more than 2 populations. Notice


we have more than 2 observations on the same experimental unit.
The sum of squares is
SST = |SSTrt +{zSSBlock} +SSE
SSModel

with

SST =

t X
b
X
i=1 j=1

SSTrt =
SSBlock =

t
X

i=1
b
X
j=1

SSE =

(yij y)2

b(
yi+ y)2
t(
y+j y)2

b
t X
X
i=1 j=1

(yij yi+ y+j + y)2

where yi+ is the mean of treatment i, y+j is the mean of block j, and y is
the grand mean.
The ANOVA table is then
Source

SS

df

MS

SSTrt

t1

SSTrt

Block

SSBlock

b1

SSBlock

2 + t2

Error

SSE

(b 1)(t 1)

SSE

Total

SST

bt 1

Trt

t1

b1

(b1)(t1)

110

E(MS)
2 + b

Pt

2i
t1

i=1

F
MSTrt
MSE

In order to test whether there is a treatment effect, i.e. H0 : 1 = =


H
t = 0, we notice that E(MSTrt) =0 E(MSE) and hence the test statistic is
T.S. =

MSTrt H0
Ft1,(b1)(t1)
MSE

with p-value being the area to the right.


Remark 10.3. The block factor is used to reduce variability and we rarely care
about its statistical significance, similar to the intercept term in regression.
However, if we wished to test that H0 : 1 = = b = 0 2 = 0, the test
statistics would be similar
T.S. =

MSBlock H0
Fb1,(b1)(t1)
MSE

Multiple comparison procedures are the same as in Section 10.1.1 with


the exception that ni = b and degrees of freedom error are (b 1)(t 1).
Definition 10.1. The Relative Efficiency of conducting a RBD as opposed
to a CRD is the number of times of replicates that would be needed for
each treatment/factor level in a CRD to obtain as precise of estimates of
differences between two treatment means as we were able to obtain by using
b experimental units per treatment in the RBD.
RE(RBD,CRD) =

(b 1)MSB + b(t 1)MSE


(bt 1)MSE

Example 10.5. http://www.stat.ufl.edu/~ athienit/STA6166/RBD.pdf

10.2.1

Nonparametric procedure

Friedmans test works by ranking the measurements corresponding to the t


treatments within each block. Each block has ranks 1 through t and then
the ranks are summed corresponding to each treatment. That is, rank observations within blocks and then sum across treatments
Similar to the Kruskal-Wallis procedure we will only concern ourselves
with SSTrt. In order to test the null hypothesis that all treatment populations are identical (and hence same location), use

t 
t+1
12b X
Ri+
T.S. =
t(t + 1) i=1
2
"
#
t
X
12
=
R2 3b(t + 1)
bt(t + 1) i=1 i+
Under the null the sampling distribution of the test statistic is 2t1 . As
usual, we reject the null if the p-value< .
111

The follow-up multiple (pairwise) comparison at the significance level


for comparing treatment i to i is
r
bt(t + 1)

Ri+ Ri + z t(t1)
6
Example 10.6. http://www.stat.ufl.edu/~ athienit/STA6166/friedman.pdf
http://www.stat.ufl.edu/~ athienit/STA6166/friedman.R

112

Anda mungkin juga menyukai