Hypothesis Testing

Hypothesis Testing
Statistical Method

Karl Phillip R. Alcarde MBA
University of Negros Occidental-Recoletos

Hypothesis Testing

Statistical Method Page 1

The method of hypothesis testing can be summarized in four steps.

1. To begin, we identify a hypothesis or claim that we feel should be tested. For example, we might want to
test the claim that the mean number of hours that children in the United States watch TV is 3 hours.

2. We select a criterion upon which we decide that the claim being tested is true or not. For example, the
claim is that children watch 3 hours of TV per week. Most samples we select should have a mean close to
or equal to 3 hours if the claim we are testing is true. So at what point do we decide that the discrepancy
between the sample mean and 3 is so big that the claim we are testing is likely not true? We answer this
question in this step of hypothesis testing.

3. Select a random sample from the population and measure the sample mean. For example, we could select
20 children and measure the mean time (in hours) that they watch TV per week.

4. Compare what we observe in the sample to what we expect to observe if the claim we are testing is true.
We expect the sample mean to be around 3 hours. If the discrepancy between the sample mean and
population mean is small, then we will likely decide that the claim we are testing is indeed true. If the
discrepancy is too large, then we will likely decide to reject the claim as being not true.

The goal of hypothesis testing is to determine the likelihood that a population parameter, such as the mean, is
likely to be true. In this section, we describe the four steps of hypothesis testing that were briefly introduced in
Section 8.1:

Step 1: State the hypotheses.

Step 2: Set the criteria for a decision.

Step 3: Compute the test statistic.

Step 4: Make a decision.

Step 1: State the hypotheses. We begin by stating the value of a population mean in a null hypothesis, which we
presume is true. For the children watching TV example, we state the null hypothesis that children in the United
States watch an average of 3 hours of TV per week. This is a starting point so that we can decide whether this is
likely to be true, similar to the presumption of innocence in a courtroom. When a defendant is on trial, the jury
starts by assuming that the defendant is innocent. The basis of the decision is to determine whether this
assumption is true. Likewise, in hypothesis testing, we start by assuming that the hypothesis or claim we are
testing is true. This is stated in the null hypothesis. The basis of the decision is to determine whether this
assumption is likely to be true.
Hypothesis testing or significance testing is a method for testing a claim or
hypothesis about a parameter in a population, using data measured in a
sample. In this method, we test some hypothesis by determining the
likelihood that a sample statistic could have been selected, if the hypothesis
regarding the population parameter were true.

DEFINITION
NOTE: Hypothesis testing is the method of testing whether claims or hypotheses regarding a
population are likely to be true.

Hypothesis Testing


Keep in mind that the only reason we are testing the null hypothesis is because we think it is wrong. We state what
we think is wrong about the null hypothesis in an alternative hypothesis. For the children watching TV example,
we may have reason to believe that children watch more than (>) or less than (<) 3 hours of TV per week. When we
are uncertain of the direction, we can state that the value in the null hypothesis is not equal to () 3 hours.

In a courtroom, since the defendant is assumed to be innocent (this is the null hypothesis so to speak), the
burden is on a prosecutor to conduct a trial to show evidence that the defendant is not innocent. In a similar way,
we assume the null hypothesis is true, placing the burden on the researcher to conduct a study to show evidence
that the null hypothesis is unlikely to be true. Regardless, we always make a decision about the null hypothesis
(that it is likely or unlikely to be true). The alternative hypothesis is needed for Step 2.

MAKING SENSE: Testing the Null Hypothesis

A decision made in hypothesis testing centers on the null hypothesis. This means
two things in terms of making a decision:

1. Decisions are made about the null hypothesis. Using the courtroom
analogy, a jury decides whether a defendant is guilty or not guilty. The jury
does not make a decision of guilty or innocent because the defendant is
assumed to be innocent. All evidence presented in a trial is to show that a
defendant is guilty. The evidence either shows guilt (decision: guilty) or
does not (decision: not guilty). In a similar way, the null hypothesis is
assumed to be correct. A researcher conducts a study show- ing evidence
that this assumption is unlikely (we reject the null hypothesis) or fails to
do so (we retain the null hypothesis).
2. The bias is to do nothing. Using the courtroom analogy, for the same
reason the courts would rather let the guilty go free than send the inno-
cent to prison, researchers would rather do nothing (accept previous
notions of truth stated by a null hypothesis) than make statements that
are not correct. For this reason, we assume the null hypothesis is correct,
thereby placing the burden on the researcher to demonstrate that the
null hypothesis is not likely to be correct.

The null hypothesis (H
0
), stated as the null, is a statement about a population
parameter, such as the population mean, that is assumed to be true.
The null hypothesis is a starting point. We will test whether the value
stated in the null hypothesis is likely to be true.

DEFINITION
NOTE: In hypothesis testing, we conduct a study to test whether the
null hypothesis is likely to be true.

An alternative hypothesis (H1) is a statement that directly contradicts a
null hypothesis by stating that that the actual value of a population
parameter is less than, greater than, or not equal to the value stated in the
null hypothesis.

The alternative hypothesis states what we think is wrong about the
null hypothesis, which is needed for Step 2.

DEFINITION
Hypothesis Testing

Step 2: Set the criteria for a decision. To set the criteria for a decision, we state the level of significance for a test.
This is similar to the criterion that jurors use in a criminal trial. Jurors decide whether the evidence presented
shows guilt beyond a reasonable doubt (this is the criterion). Likewise, in hypothesis testing, we collect data to
show that the null hypothesis is not true, based on the likelihood of selecting a sample mean from a population
(the likelihood is the criterion). The likelihood or level of significance is typically set at 5% in behavioral research
studies. When the probability of obtaining a sample mean is less than 5% if the null hypothesis were true, then we
conclude that the sample we selected is too unlikely and so we reject the null hypothesis.

The alternative hypothesis establishes where to place the level of significance. Remember that we know that
the sample mean will equal the population mean on average if the null hypothesis is true. All other possible values
of the sample mean are normally distributed (central limit theorem). The empirical rule tells us that at least 95% of
all sample means fall within about 2 standard deviations (SD) of the population mean, meaning that there is less
than a 5% probability of obtaining a sample mean that is beyond 2 SD from the population mean. For the children
watching TV example, we can look for

the probability of obtaining a sample mean beyond 2 SD in the upper tail (greater than 3), the lower tail
(less than 3), or both tails (not equal to 3). Figure 8.2 shows that the alternative hypothesis is used to determine
which tail or tails to place the level of significance for a hypothesis test.
Level of significance, or significance level, refers to a criterion of judgment upon which a
decision is made regarding the value stated in a null hypothesis. The criterion is based on
the probability of obtaining a statistic measured in a sample if the value stated in the null
hypothesis were true.
In behavioral science, the criterion or level of significance is typically set at 5%. When
the probability of obtaining a sample mean is less than 5% if the null hypothesis were
true, then we reject the value stated in the null hypothesis.

DEFINITION
The alternative hypothesis
determines whether to place
the level of significance in one
or both tails of a sampling
distribution. Sample means
that fall in the tails are unlikely
to occur (less than a 5%
probability) if the value stated
for a population mean in the
null hypothesis is true.
Hypothesis Testing

Step 3: Compute the test statistic. Suppose we measure a sample mean equal to 4 hours per week that children
watch TV. To make a decision, we need to evaluate how likely this sample outcome is, if the population mean
stated by the null hypothesis (3 hours per week) is true. We use a test statistic to determine this likelihood.
Specifically, a test statistic tells us how far, or how many standard deviations, a sample mean is from the
population mean. The larger the value of the test statistic, the further the distance, or number of standard
deviations, a sample mean is from the population mean stated in the null hypothesis. The value of the test
statistic is used to make a decision in Step 4.

Step 4: Make a decision. We use the value of the test statistic to make a decision about the null hypothesis. The
decision is based on the probability of obtaining a sample mean, given that the value stated in the null hypothesis
is true. If the probability of obtaining a sample mean is less than 5% when the null hypothesis is true, then the
decision is to reject the null hypothesis. If the probability of obtaining a sample mean is greater than 5% when the
null hypothesis is true, then the decision is to retain the null hypothesis. In sum, there are two decisions a
researcher can make:

1. Reject the null hypothesis. The sample mean is associated with a low probability of occurrence when the

2. Retain the null hypothesis. The sample mean is associated with a high probability of occurrence when the

The probability of obtaining a sample mean, given that the value stated in the null hypothesis is true, is stated
by the p value. The p value is a probability: It varies between 0 and 1 and can never be negative. In Step 2, we
stated the criterion or probability of obtaining a sample mean at which point we will decide to reject the value
stated in the null hypothesis, which is typically set at 5% in behavioral research. To make a decision, we compare
the p value to the criterion we set in Step 2.

When the p value is less than 5% (p < .05), we reject the null hypothesis. We will refer to p < .05 as the
criterion for deciding to reject the null hypothesis, although note that when p = .05, the decision is also to reject
the null hypothesis. When the p value is greater than 5% (p > .05), we retain the null hypothesis. The decision to
The test statistic is a mathematical formula that allows researchers to
determine the likelihood of obtaining sample outcomes if the null hypothesis
were true. The value of the test statistic is used to make a decision regarding
the null hypothesis.

NOTE: The level of significance in hypothesis testing is the criterion we use to decide whether the value
stated in the null hypothesis is likely to be true.

DEFINITION
NOTE: We use the value of the test statistic to make a decision regarding the
null hypothesis.

A p value is the probability of obtaining a sample outcome, given that the
value stated in the null hypothesis is true. The p value for obtaining a sample
outcome is compared to the level of significance.

Significance, or statistical significance, describes a decision made concerning a
value stated in the null hypothesis. When the null hypothesis is rejected, we
reach significance. When the null hypothesis is retained, we fail to reach
significance.

DEFINITION
Hypothesis Testing

reject or retain the null hypothesis is called significance. When the p value is less than .05, we reach significance;
the decision is to reject the null hypothesis. When the p value is greater than .05, we fail to reach significance; the
decision is to retain the null hypothesis. Figure 8.3 shows the four steps of hypothesis testing.

HYPOTHESIS TESTING AND
SAMPLING DISTRIBUTIONS

The logic of hypothesis testing is rooted in an understanding of the sampling distribution of the mean. In Chapter
7, we showed three characteristics of the mean, two of which are particularly relevant in this section:

1) The sample mean is an unbiased estimator of the population mean. On
average, a randomly selected sample will have a mean equal to that in the
population. In hypothesis testing, we begin by stating the null hypothesis. We
expect that, if the null hypothesis is true, then a random sample selected from
a given population will have a sample mean equal to the value stated in the
null hypothesis.

2) Regardless of the distribution in the population, the sampling distribution of
the sample mean is normally distributed. Hence, the probabilities of all other
possible sample means we could select are normally distributed. Using this
distribution, we can therefore state an alternative hypothesis to locate the
probability of obtaining sample means with less than a 5% chance of being
A summary of hypothesis testing.

Hypothesis Testing

selected if the value stated in the null hypothesis is true. Figure 8.2 shows that
we can identify sample mean outcomes in one or both tails.

3) To locate the probability of obtaining a sample mean in a sampling
distribution, we must know (1) the population mean and (2) the standard
error of the mean (SEM; introduced in Chapter 7). Each value is entered in
the test statistic formula computed in Step 3, thereby allowing us to make a
decision in Step 4. To review, Table 8.1 displays the notations used to
describe populations, samples, and sampling distributions. Table 8.2
summarizes the characteristics of each type of distribution.

TABLE 1: A review of the notation used for the mean, variance, and
standard deviation in population, sample, and sampling distributions.

Characteristic Population Sample Sampling Distribution

M
=

Mean M or X

Variance o
2
s
2
or SD
2

M
2
=
2

n

Standard o s or SD
M
=

deviation

n

TABLE 2: A review of the key differences between population, sample, and sampling distributions.

Population Distribution Sample Distribution Distribution of Sample Means

What is it? Scores of all persons in a Scores of a select All possible sample means that
population portion of persons from can be drawn, given a certain
the population sample size

Is it accessible? Typically, no Yes Yes

What is the shape? Could be any shape Could be any shape Normally distributed

MAKING A DECISION: TYPES OF ERROR

In Step 4, we decide whether to retain or reject the null hypothesis. Because we are observing a
sample and not an entire population, it is possible that a conclusion may be wrong. Table 8.3
shows that there are four decision alternatives regarding the truth and falsity of the decision we
make about a null hypothesis:

1. The decision to retain the null hypothesis could be correct.

2. The decision to retain the null hypothesis could be incorrect.
Hypothesis Testing


3. The decision to reject the null hypothesis could be correct.

4. The decision to reject the null hypothesis could be incorrect.

TABLE 3: Four outcomes for making a decision. The decision can be either correct (correctly reject or
retain null) or wrong (incorrectly reject or retain null).

Decision

Retain the null Reject the null

True
CORRECT TYPE I ERROR

1 o o

Truth in the

TYPE II ERROR CORRECT

population

False | 1 |

POWER

We investigate each decision alternative in this section. Since we will observe a
sample, and not a population, it is impossible to know for sure the truth in the
population. So for the sake of illustration, we will assume we know this. This
assumption is labeled as truth in the population in Table 8.3. In this section, we will
introduce each decision alternative.

DECISION: RETAIN THE NULL HYPOTHESIS

When we decide to retain the null hypothesis, we can be correct or incorrect. The
correct decision is to retain a true null hypothesis. This decision is called a null result
or null finding. This is usually an uninteresting decision because the decision is to
retain what we already assumed: that the value stated in the null hypothesis is
correct. For this reason, null results alone are rarely published in behavioral research.

The incorrect decision is to retain a false null hypothesis. This decision is an
example of a Type II error, or b error. With each test we make, there is always some
probability that the decision could be a Type II error. In this decision, we decide to
retain previous notions of truth that are in fact false. While its an error, we still did
nothing; we retained the null hypothesis. We can always go back and conduct more
studies.

Type II error, or beta (b) error, is the probability of retaining a null
hypothesis that is actually false.

DEFINITION
Hypothesis Testing

DECISION: REJECT THE NULL HYPOTHESIS

When we decide to reject the null hypothesis, we can be correct or incorrect. The incorrect decision is to reject a true null
hypothesis. This decision is an example of a Type I error. With each test we make, there is always some probability that our
decision is a Type I error. A researcher who makes this error decides to reject previous notions of truth that are in fact true.
Making this type of error is analogous to finding an innocent person guilty. To minimize this error, we assume a defendant is
innocent when beginning a trial. Similarly, to minimize making a Type I error, we assume the null hypothesis is true when
beginning a hypothesis test.

Type I error is the probability of rejecting a null hypothesis that is actually true.
Researchers directly control for the probability of committing this type of error.

An alpha (a) level is the level of significance or criterion for a hypothesis
test. It is the largest probability of committing a Type I error that we will
allow and still decide to reject the null hypothesis.

Since we assume the null hypothesis is true, we control for Type I error by stating a level of significance. The level we set,
called the alpha level (symbolized as ), is the largest probability of committing a Type I error that we will allow and still
decide to reject the null hypothesis. This criterion is usually set at .05 ( = .05), and we compare the alpha level to the p value.
When the probability of a Type I error is less than 5% (p < .05), we decide to reject the null hypothesis; otherwise, we retain the
null hypothesis.

The correct decision is to reject a false null hypothesis. There is always some probability that we decide that the null
hypothesis is false when it is indeed false. This decision is called the power of the decision-making process. It is called power
because it is the decision we aim for. Remember that we are only testing the null hypothesis because we think it is wrong.
Deciding to reject a false null hypothesis, then, is the power, inasmuch as we learn the most about populations when we
accurately reject false notions of truth. This decision is the most published result in behavioral research.

The power in hypothesis testing is the probability of rejecting a false null hypothesis.
Specifically, it is the probability that a randomly selected sample will show that the
null hypothesis is false when the null hypothesis is indeed false.

NOTE: A Type II error, or beta (|) error, is the probability of incorrectly retaining
the null hypothesis.

DEFINITION
NOTE: Researchers directly control for the probability of a Type I error by stating an
alpha (o) level.

NOTE: The power in hypothesis testing is the probability of correctly rejecting the value
stated in the null hypothesis.
Hypothesis Testing

Hypothesis Testing Using z- and t-tests

In hypothesis testing, one attempts to answer the following question: If the null hypothesis is assumed to be true,
what is the probability of obtaining the observed result, or any more extreme result that is favourable to the
alternative hypothesis?
1
In order to tackle this question, at least in the context of z- and t-tests, one must first
understand two important concepts: 1) sampling distributions of statistics, and 2) the central limit theorem.

Sampling Distributions

Imagine drawing (with replacement) all possible samples of size n from a population, and for each sample,
calculating a statistic--e.g., the sample mean. The frequency distribution of those sample means would be the
sampling distribution of the mean (for samples of size n drawn from that particular population).

Normally, one thinks of sampling from relatively large populations. But the concept of a sampling distribution can
be illustrated with a small population. Suppose, for example, that our population consisted of the following 5
scores: 2, 3, 4, 5, and 6. The population mean = 4, and the population standard deviation (dividing by N) = 1.414.

If we drew (with replacement) all possible samples of 2 from this population, we would end up with the 25
samples shown in Table 1.

The 25 sample means from Table 1 are plotted below in Figure 1 (a histogram) . This distribution of sample means
is called the sampling distribution of the mean for samples of n=2 from the population of interest (i.e., our
population of 5 scores).

That probability is called a p-value. It is a
really a conditional probability--it is
conditional on the null hypothesis being
true.

Hypothesis Testing

Distribution of Sample Means*

6
5
4
3
2
1
Std. Dev = 1.02

Mean = 4.00

0 N = 25.00
2.00 2.50 3.00 3.50 4.00 4.50 5.00 5.50 6.00

MEANS

* or "Sampling Distribution of the Mean"

Figure 1: Sampling distribution of the mean for samples of n=2 from a population of N=5.

I suspect the first thing you noticed about this figure is peaked in the middle, and symmetrical about the mean.
This is an important characteristic of sampling distributions, and we will return to it in a moment.

You may have also noticed that the standard deviation reported in the figure legend is 1.02, whereas I reported
SD = 1.000 in Table 1. Why the discrepancy? Because I used the population SD formula (with division by N) to
compute SD = 1.000 in Table 1, but SPSS used the sample SD formula (with division by n-1) when computing the
SD it plotted alongside the histogram. The population SD is the correct one to use in this case, because I have the
entire population of 25 samples in hand.

The Central Limit Theorem (CLT)

If I were a mathematical statistician, I would now proceed to work through derivations, proving the following
statements:
1. The mean of the sampling distribution of the mean = the population mean

2. The SD of the sampling distribution of the mean = the standard error (SE) of the mean = the population
standard deviation divided by the square root of the sample size

Putting these statements into symbols:

Hypothesis Testing


X
n

But alas, I am not a mathematical statistician. Therefore, I will content myself with telling you that these
statements are true (those of you who do not trust me, or are simply curious, may consult a mathematical stats
textbook), and pointing to the example we started this chapter with. For that population of 5 scores, = 4 and
=1.414 . As shown in Table 1,
X
= = 4 , and
X
=1.000 . According to equation (1.2), if we divide the population SD by the square root of the sample size,
we should obtain the standard error of the mean. So let's give it a try:

=
1.414
= .9998 _1 (1.3)
n 2

When I performed the calculation in Excel and did not round off to 3 decimals, the solution worked out to 1
exactly. In the Excel worksheet that demonstrates this, you may also change the
values of the 5 population scores, and should observe that
X
= n for any set of 5 scores you
choose. Of course, these demonstrations do not prove the CLT (see the aforementioned math-stats books if
you want proof), but they should reassure you that it does indeed work.

What the CLT tells us about the shape of the sampling distribution

The central limit theorem also provides us with some very helpful information about the shape of the sampling
distribution of the mean. Specifically, it tells us the conditions under which the sampling distribution of the mean is
normally distributed, or at least approximately normal, where approximately means close enough to treat as
normal for practical purposes.

The shape of the sampling distribution depends on two factors: the shape of the population from which you
sampled, and sample size. I find it useful to think about the two extremes:

1. If the population from which you sample is itself normally distributed, then the sampling distribution of
the mean will be normal, regardless of sample size. Even for sample size = 1, the sampling distribution
of the mean will be normal, because it will be an exact copy of the population distribution.
2. If the population from which you sample is extremely non-normal, the sampling distribution of the mean
will still be approximately normal given a large enough sample size (e.g., some authors suggest for sample
sizes of 300 or greater).

So, the general principle is that the more the population shape departs from normal, the greater the sample size
must be to ensure that the sampling distribution of the mean is approximately normal. This tradeoff is illustrated
in the following figure, which uses colour to represent the shape of the sampling distribution (purple = non-
normal, red = normal, with the other colours representing points in between).

X =
X
{ mean of the sample means = the population mean } (1.1)

=
X
{ SE of mean = population SD over square root of n } (1.2)
Hypothesis Testing


What is the rule of 30 about then?

In the olden days, textbook authors often did make a distinction between small -sample and large-sample versions
of t-tests. The small- and large-sample versions did not differ at all in terms of how t was calculated. Rather, they
differed in how/where one obtained the critical value to which they compared their computed t-value. For the
small-sample test, one used the critical value of t, from a table of critical t-values. For the large-sample test, one
used the critical value of z , obtained from a table of the standard normal distribution. The dividing line between
small and large samples was usually n = 30 (or sometimes 20).

Why was this done? Remember that in that era, data analysts did not have access to desktop computers and
statistics packages that computed exact p-values. Therefore, they had to compute the test statistic, and compare
it to the critical value, which they looked up in a table. Tables of critical values can take up a lot of room. So when
possible, compromises were made. In this particular case, most authors and statisticians agreed that for n 30,
the critical value of z (from the standard normal distribution) was close enough to the critical value of t that it
could be used as an approximation. The following figure illustrates this by plotting critical values of t with alpha =
.05 (2-tailed) as a function of sample size. Notice that when n 30 (or even 20), the critical values of t are very
close to 1.96, the critical value of z.

Hypothesis Testing


Nowadays, we typically use statistical software to perform t -tests, and so we get a p-value computed using
the appropriate t-distribution, regardless of the sample size. Therefore the distinction between small- and
large-sample t-tests is no longer relevant, and has disappeared from most modern textbooks.

The sampling distribution of the mean and z-scores

When you first encountered z-scores, you were undoubtedly using them in the context of a raw score
distribution. In that case, you calculated the z-score corresponding to some value of X as follows:

z =
X
=
X
X

(1.4)

X

And, if the distribution of X was normal, or at least approximately normal, you could then take that z-score, and
refer it to a table of the standard normal distribution to figure out the proportion of scores higher than X, or lower
than X, etc.

Because of what we learned from the central limit theorem, we are now in a position to compute a z-score as
follows:

Hypothesis Testing


X
X

z =

=
X
(1.5)
X

n

This is the same formula, but with X in place of X, and
X
in place of
X
. And, if the sampling

distribution of X is normal, or at least approximately normal, we may then refer this value of z to the standard
normal distribution, just as we did when we were using raw scores. (This is where the CLT comes in, because it
tells the conditions under which the sampling distribution of
X is approximately normal.)

An example. Here is a (fictitious) newspaper advertisement for a program designed to increase intelligence of
school children
2
:

As an expert on IQ, you know that in the general population of children, the mean IQ = 100, and the population
SD = 15 (for the WISC, at least). You also know that IQ is (approximately) normally distributed in the population.
Equipped with this information, you can now address questions such as:

If the n=25 children from Dundas are a random sample from the general population of children,

a) What is the probability of getting a sample mean of 108 or higher?
b) What is the probability of getting a sample mean of 92 or lower?
c) How high would the sample mean have to be for you to say that the probability of getting a mean that
high (or higher) was 0.05 (or 5%)?
d) How low would the sample mean have to be for you to say that the probability of getting a mean that
low (or lower) was 0.05 (or 5%)?
The solutions to these questions are quite straightforward, given everything we have learned so far in this
chapter. If we have sampled from the general population of children, as we are assuming, then the
population from which we have sampled is at least approximately normal. Therefore, the sampling
distribution of the mean will be normal, regardless of sample size. Therefore, we can compute a z-score, and
refer it to the table of the standard normal distribution.

So, for part (a) above:

Hypothesis Testing

z =
X
X

=
X
X

=
108 100
=
8
= 2.667 (1.6)
X

15

X
25
3
n

And from a table of the standard normal distribution (or using a computer program, as I did), we can see that
the probability of a z-score greater than or equal to 2.667 = 0.0038. Translating that back to the original units,
we could say that the probability of getting a sample mean of 108 (or greater) is .0038 (assuming that the 25
children are a random sample from the general population).

For part (b), do the same, but replace 108 with 92:

z =
X
X

=
X
X

=
92 100
=
8
= 2.667 (1.7)
X

15

X
3
n 25

Because the standard normal distribution is symmetrical about 0, the probability of a z-score equal to or less
than -2.667 is the same as the probability of a z-score equal to or greater than 2.667. So, the probability of a
sample mean less than or equal to 92 is also equal to 0.0038. Had we asked for the probability of a sample
mean that is either 108 or greater, or 92 or less, the answer would be 0.0038 + 0.0038 = 0.0076.

Part (c) above amounts to the same thing as asking, "What sample mean corresponds to a z-score of 1.645?",
because we know that p (z 1.645) = 0.05 . We can start out with the usual z-score
formula, but need to rearrange the terms a bit, because we know that z = 1.645, and are trying to determine
the corresponding value of X .

z =
X
X

{ cross-multiply to get to next line }

X
z
X

X
{ add
X
to both sides }

= X
(1.8)
z
X
+
X

{ switch sides }

= X

= 1.645(15 25 )+ 100 =104.935

X = z
X
+
X

So, had we obtained a sample mean of 105, we could have concluded that the probability of a mean that high
or higher was .05 (or 5%).

For part (d), because of the symmetry of the standard normal distribution about 0, we would use the same
method, but substituting -1.645 for 1.645. This would yield an answer of 100 - 4.935 = 95.065. So the probability of
a sample mean less than or equal to 95 is also 5%.

Hypothesis Testing

The single sample z-test

It is now time to translate what we have just been doing into the formal terminology of hypothesis testing. In
hypothesis testing, one has two hypotheses: The null hypothesis, and the alternative hypothesis. These two
hypotheses are mutually exclusive and exhaustive. In other words, they cannot share any outcomes in common,
but together must account for all possible outcomes.

In informal terms, the null hypothesis typically states something along the lines of, "there is no treatment
effect", or "there is no difference between the groups".
3

The alternative hypothesis typically states that there is a treatment effect, or that there is a difference between
the groups. Furthermore, an alternative hypothesis may be directional or non-directional. That is, it may or
may not specify the direction of the difference between the groups.

H
0
and H
1
are the symbols used to represent the null and alternative hypotheses respectively. (Some books
may use H
a
for the alternative hypothesis.) Let us consider various pairs of null and alternative hypotheses for
the IQ example we have been considering.

Version 1: A directional alternative hypothesis

H
0
: 100

H
1
: >100

This pair of hypotheses can be summarized as follows. If the alternative hypothesis is true, the sample of 25
children we have drawn is from a population with mean IQ greater than 100. But if the null hypothesis is true, the
sample is from a population with mean IQ equal to or less than 100. Thus, we would only be in a position to reject
the null hypothesis if the sample mean is greater than 100 by a sufficient amount. If the sample mean is less than
100, no matter by how much, we would not be able to reject H
0
.

How much greater than 100 must the sample mean be for us to be comfortable in rejecting the null
hypothesis? There is no unequivocal answer to that question. But the answer that most
disciplines use by convention is this: The difference between X and must be large enough that the probability
it occurred by chance (given a true null hypothesis) is 5% or less.

The observed sample mean for this example was 108. As we saw earlier, this corresponds to a z-score of 2.667,
and p (z 2.667) = 0.0038 . Therefore, we could reject H
0
, and we would act as
if the sample was drawn from a population in which mean IQ is greater than 100.

Version 2: Another directional alternative hypothesis

H
0
: 100

H
1
: <100

This pair of hypotheses would be used if we expected the Dr. Duntz's program to lower IQ, and if we were willing
to include an increase in IQ (no matter how large) under the null hypothesis. Given a sample mean of 108, we
could stop without calculating z, because the difference is in the wrong direction. That is, to have any hope of
rejecting H
0
, the observed difference must be
in the direction specified by H
1
.

Hypothesis Testing

Version 3: A non-directional alternative hypothesis

H
0
: =100

H
1
: 100

In this case, the null hypothesis states that the 25 children are a random sample from a population with mean IQ =
100, and the alternative hypothesis says they are not--but it does not specify the
direction of the difference from 100. In the first directional test, we needed to have X >100 by a sufficient amount,
and in the second directional test, X <100 by a sufficient amount in order to reject H
0
. But in this case, with a non-
directional alternative hypothesis, we may reject H
0
if
X <100 or if X >100 , provided the difference is large enough.

Because differences in either direction can lead to rejection of H
0
, we must consider both tails of

the standard normal distribution when calculating the p-value--i.e., the probability of the observed outcome, or a
more extreme outcome favourable to H
1
. For symmetrical distributions
like the standard normal, this boils down to taking the p-value for a directional (or 1-tailed) test, and doubling it.

For this example, the sample mean = 108. This represents a difference of +8 from the population mean (under a
true null hypothesis). Because we are interested in both tails of the distribution, we must figure out the probability
of a difference of +8 or greater, or a change of -8 or greater.
In other words, p = p ( X 108) + p (X 92) = .0038 + .0038 = .0076 .

Single sample t-test (when is not known)

In many real-world cases of hypothesis testing, one does not know the standard deviation of the population. In
such cases, it must be estimated using the sample standard deviation. That is, s (calculated with division by n-1)
is used to estimate . Other than that, the calculations are as we saw for the z-test for a single sample--but the
test statistic is called t, not z.

2
SS
X

X X

t
( df = n
1)

X
X

where s
X

s
, and s =

=

= = (1.9)

s
X

n

n 1 n 1

In equation (1.9), notice the subscript written by the t. It says "df = n-1". The "df" stands for degrees of
freedom. "Degrees of freedom" can be a bit tricky to grasp, but let's see if we can make it clear.

Degrees of Freedom

Suppose I tell you that I have a sample of n=4 scores, and that the first three scores are 2, 3, and 5. What is the
value of the 4
th
score? You can't tell me, given only that n = 4. It could be anything. In other words, all of the
scores, including the last one, are free to vary: df = n for a sample mean.

To calculate t, you must first calculate the sample standard deviation. The conceptual formula for the sample
standard deviation is:
Hypothesis Testing


2

s =
X X

n 1
(1.10)

Suppose that the last score in my sample of 4 scores is a 6. That would make the sample mean equal to
(2+3+5+6)/4 = 4. As shown in Table 2, the deviation scores for the first 3 scores are -2, -1, and 1.

Table 2: Illustration of degrees of freedom for sample standard deviation

Deviation
Score Mean from Mean
2 4 -2
3 4 -1
5 4 1
-- -- x
4

Using only the information shown in the final column of Table 2, you can deduce that x
4
, the 4
th

deviation score, is equal to -2. How so? Because by definition, the sum of the deviations about the mean = 0. This
is another way of saying that the mean is the exact balancing point of the distribution. In symbols:

(X X )= 0 (1.11)

So, once you have n-1 of the ( X X )deviation scores, the final deviation score is determined.

That is, the first n-1 deviation scores are free to vary, but the final one is not. There are n-1 degrees of
freedom whenever you calculate a sample variance (or standard deviation).

The sampling distribution of t

To calculate the p-value for a single sample z-test, we used the standard normal distribution. For a single sample t-
test, we must use a t-distribution with n-1 degrees of freedom. As this implies, there is a whole family of t-
distributions, with degrees of freedom ranging from 1 to infinity ( = the symbol for infinity). All t-distributions
are symmetrical about 0, like the standard normal. In fact, the t-distribution with df = is identical to the
standard normal distribution.

But as shown in Figure 2 below, t-distributions with df < infinity have lower peaks and thicker tails than the
standard normal distribution. To use the technical term for this, they are leptokurtic. (The normal distribution is
said to be mesokurtic.) As a result, the critical values of t are further from 0 than the corresponding critical values
of z. Putting it another way, the absolute value of critical t is greater than the absolute value of critical z for all t-
distributions with df < :

Hypothesis Testing


Figure 2: Probability density functions of: the standard normal distribution (the highest peak with the thinnest
tails); the t-distribution with df=10 (intermediate peak and tails); and the t-distribution with df =2 (the lowest peak
and thickest tails). The dotted lines are at -1.96 and +1.96, the critical values of z for a two -tailed test with alpha =
.05. For all t-distributions with df < , the proportion of area beyond -1.96 and +1.96 is greater than .05. The lower
the degrees of freedom, the thicker the tails, and the greater the proportion of area beyond -1.96 and +1.96.
Table 3 (see below) shows yet another way to think about the relationship between the standard normal
distribution and various t-distributions. It shows the area in the two tails beyond -1.96 and +1.96, the critical
values of z with 2-tailed alpha = .05. With df=1, roughly 15% of the area falls in each tail of the t-distribution. As df
gets larger, the tail areas get smaller and smaller, until the t-distribution converges on the standard normal when
df = infinity.

Table 3: Area beyond critical values of + or -1.96 in various t-distributions.
The t-distribution with df = infinity is identical to the standard normal distribution

Hypothesis Testing

T-Test

N

Mean Std. Deviation
Std. Error

Mean

HEIGHT 8 64.25 2.550 .901

One-Sample Test

Test Value = 63

95% Confidence

Interval of the

Mean Difference

t df Sig. (2-tailed) Difference Lower Upper

HEIGHT 1.387 7 .208 1.25 -.88 3.38

Paired (or related samples) t-test

Another common application of the t-test occurs when you have either 2 scores for each person (e.g., before
and after), or when you have matched pairs of scores (e.g., husband and wife pairs, or twin pairs). The paired t-
test may be used in this case, given that its assumptions are met adequately. (More on the assumptions of the
various t-tests later).

Quite simply, the paired t-test is just a single-sample t-test performed on the difference scores. That is, for
each matched pair, compute a difference score. Whether you subtract 1 from 2 or vice versa does not matter,
so long as you do it the same way for each pair. Then perform a single-sample t-test on those differences.

The null hypothesis for this test is that the difference scores are a random sample from a population in which the
mean difference has some value which you specify. Often, that value is zero--but it need not be.

For example, suppose you found some old research which reported that on average, husbands were 5 inches
taller than their wives. If you wished to test the null hypothesis that the difference is still 5 inches today (despite
the overall increase in height), your null hypothesis would state that your sample of difference scores (from
husband/wife pairs) is a random sample from a population in which the mean difference = 5 inches.

In the equations for the paired t-test, X is often replaced with D , which stands for the mean difference.
t =

D

=

D

D D
(1.16)
s

s
D

D
n
where

= the (sample) mean of the difference scores

D

D
= the mean difference in the population, given a true H
0
{often
D
=0, but not always}
s
D
= the sample SD of the difference scores (with division by n-1) n = the
number of matched pairs; the number of individuals = 2n s
D
= the SE of the
mean difference

Hypothesis Testing

df = n 1

Unpaired (or independent samples) t-test

Another common form of the t-test may be used if you have 2 independent samples (or groups). The formula for
this version of the test is given in equation (1.17).

(X
1

2
) (
1

2
)

t =
X
(1.17)

s
X

1
X
2

The left side of the numerator, ( X
1
X
2
) , is the difference between the means of two (independent) samples, or
the difference between group means. The right side of the numerator, (
1

2
) , is the difference between the
corresponding population means, assuming that H
0
is
true. The denominator is the standard error of the difference between two independent means. It is calculated as
follows:
2
2

s

=
s
pooled
+
s
pooled
(1.18)

n

X 1 X
2
n

1

2

where s
2
pooled
= pooled variance estimate
=
SS + SS
2
=
SS
Within Groups

1

n
1
+ n
2
2
df
Within Groups

)
2
for Group 1

SS
1
= ( X X

)
2

SS
2
= (X X
for Group 2

n
1
= sample size for Group 1 n
2
=
sample size for Group 2 df = n
1
+
n
2
2
As indicated above, the null hypothesis for this test specifies a value for (
1

2
) , the difference between the
population means. More often than not, H
0
specifies that (
1

2
) = 0. For that reason, most textbooks omit (
1

2
) from the numerator of the formula, and show it like this:

t
unpaired
=
X 1

X
2
(1.19)

SS
1
+ SS
2
1
+
1

n
1
+ n
2
2
n
1

n
2

I prefer to include (
1

2
) for two reasons. First, it reminds me that the null hypothesis can
specify a non-zero difference between the population means. Second, it reminds me that all t-tests have a
common format, which I will describe in a section to follow.
The unpaired (or independent samples) t-test has df = n
1
+ n
2
2 . As discussed under the Single-sample t-test, one
degree of freedom is lost whenever you calculate a sum of squares (SS). To perform an unpaired t-test, we must
first calculate both SS
1
and SS
2
, so two degrees of freedom are lost.

Hypothesis Testing

Testing the significance of Pearson r

Pearson r refers to the Pearson Product-Moment Correlation Coefficient. If you have not learned about it yet, it
will suffice to say for now that r is a measure of the strength and direction of the linear (i.e., straight line)
relationship between two variables X and Y. It can range in value from -1 to +1.

Negative values indicate that there is a negative (or inverse) relationship between X and Y: i.e., as X increases, Y
decreases. Positive values indicate that there is a positive relationship: as X increases, Y increases.

If r=0, there is no linear relationship between X and Y. If r = -1 or +1, there is a perfect linear relationship
between X and Y: That is, all points fall exactly on a straight line. The further r is from 0, the better the linear
relationship.

Pearson r is calculated using matched pairs of scores. For example, you might calculate the correlation between
students' scores in two different classes. Often, these pairs of scores are a sample from some population of
interest. In that case, you may wish to make an inference about the correlation in the population from which you
have sampled.

The Greek letter rho is used to represent the correlation in the population. It looks like this: . As you may have
already guessed, the null hypothesis specifies a value for . Often that value
is 0. In other words, the null hypothesis often states that there is no linear relationship between X and Y in the
population from which we have sampled. In symbols:

H
0
: = 0 { there is no linear relationship between X and Y in the population } H
1
: 0 { there is
a linear relationship between X and Y in the population }

A null hypothesis which states that = 0 can be tested with a t-test, as follows
5
:

t =
r
where s
r
=
1 r
2

and df = n 2 (1.20)
s
r

n 2

General format for all z- and t-tests

You may have noticed that all of the z- and t-tests we have looked at have a common format. The formula
always has 3 components, as shown below:
z or t =
statistic - parameter|H
0

(1.21)
SE
statistic

The numerator always has some statistic (e.g., a sample mean, or the difference between two independent
sample means) minus the value of the corresponding parameter, given that H
0
is
true. The denominator is the standard error of the statistic in the numerator. If the population standard deviation
is known (and used to calculate the standard error), the test statistic is z. if the population standard deviation is
not known, it must be estimated with the sample standard deviation, and the test statistic is t with some number
of degrees of freedom. Table 5 lists the 3 components of the formula for the t-tests we have considered in this
chapter. For all of these tests but the first, the null hypothesis most often specifies that the value of the parameter
is equal to zero. But there may be exceptions.

Similarity of standard errors for single-sample and unpaired t-tests

People often fail to see any connection between the formula for the standard error of the mean, and the
Hypothesis Testing

standard error of the difference between two independent means. Nevertheless, the two formulae are very
similar, as shown below.
s
X

s s
2
s
2

= = =

(1.22)
n

n

n

Given how s
X
is expressed in the preceding Equation (1.22), it is clear that s
X

1

X

2
is a fairly straightforward
extension of s
X
(see Table 5).
Table 5: The 3 components of the t-formula for t-tests described in this chapter.
Name of Test Statistic Parameter|H
0
SE of the Statistic

Single-sample t-test

X

s

=
s

X
X
n

Paired t-test

s

=
s
D

D

D

D

n

2

2

Independent samples t-test ( X X ) ( 1 2 ) s =
s
pooled +
s
pooled

1 2
X
1 X
2
n n

1

2

Test of significance of r
s
r
=

1 r
2

Pearson r

n 2

Assumptions of t-tests

All t-ratios are of the form t = (statistic parameter under a true H
0
) / SE of the statistic. The
key requirement (or assumption) for any t-test is that the statistic in the numerator must have a sampling
distribution that is normal. This will be the case if the populations from which you have sampled are normal. If
the populations are not normal, the sampling distribution may still be approximately normal, provided the
sample sizes are large enough. (See the discussion of the Central Limit Theorem earlier in this chapter.) The
assumptions for the individual t-tests we have considered are given in Table 6.

A little more about assumptions for t-tests

Many introductory statistics textbooks list the following key assumptions for t-tests:

1. The data must be sampled from a normally distributed population (or populations in case of a two-
sample test).
2. For two-sample tests, the two populations must have equal variances.
3. Each score (or difference score for the paired t-test) must be independent of all other scores.

The third of these is by far the most important assumption. The first two are much less important than many
people realize.

Hypothesis Testing

Table 6: Assumptions of various t-tests.
Type of t-test Assumptions

Single-sample You have a single sample of scores

All scores are independent of each other

is normal (the Central Limit Theorem

The sampling distribution of X

tells you when this will be the case)

Paired t-test You have matched pairs of scores (e.g., two measures per person, or

matched pairs of individuals, such as husband and wife)

Each pair of scores is independent of every other pair

The sampling distribution of

is normal (see Central Limit

D

Theorem)

Independent samples t- You have two independent samples of scoresi.e., there is no basis

test for pairing of scores in sample 1 with those in sample 2

All scores within a sample are independent of all other scores within

that sample

2
is normal

The sampling distribution of X
1
X

The populations from which you sampled have equal variances

Test for significance of The sampling distribution of r is normal

Pearson r H
0
states that the correlation in the population = 0

Lets stop and think about this for a moment in the context of an unpaired t-test. A normal distribution ranges
from minus infinity to positive infinity. So in truth, none of us who are dealing with real data ever sample from
normally distributed populations. Likewise, it is a virtual impossibility for two populations (at least of the sort
that would interest us as researchers) to have exactly equal variances. The upshot is that we never really meet
the assumptions of normality and homogeneity of variance.

Therefore, what the textbooks ought to say is that if one was able to sample from two normally distributed
populations with exactly equal variances (and with each score being independent of all others), then the
unpaired t-test would be an exact test. That is, the sampling distribution of t under a true null hypothesis
would be given exactly by the t-distribution with df = n
1
+ n
2
2. Because we can never truly meet the
assumptions of normality and homogeneity of variance, t-tests on real data are approximate tests. In other
words, the sampling distribution of t under a true null hypothesis is approximated by the t-distribution with df
= n
1
+ n
2
2.

So, rather than getting ourselves worked into a lather over normality and homogeneity of variance (which
we know are not true), we ought to instead concern ourselves with the conditions under which the
approximation is good enough to use (much like we do when using other approximate tests, such as
Pearsons chi-square).

Two guidelines

The t -test approximation will be poor if the populations from which you have sampled are too far from normal.
But how far is too far? Some folks use statistical tests of normality to address this issue (e.g., Kolmogorov-Smirnov
Hypothesis Testing

test; Shapiro-Wilks test). However, statistical tests of normality are ill advised. Why? Because the seriousness of
departure from normality is inversely related to sample size. That is, departure from normality is most grievous
when sample sizes are small, and becomes less serious as sample sizes increase.
6
Tests of normality have very little
power to detect departure from normality when sample sizes are small, and have too much power when sample
sizes are large. So they are really quite useless.

A far better way to test the shape of the distribution is to ask yourself the following simple question:

Is it fair and honest to describe the two distributions using means and SDs?

If the answer is YES, then it is probably fine to proceed with your t-test. If the answer is NO (e.g., due to severe
skewness, or due to the scale being too far from interval), then you should consider using another test, or
perhaps transforming the data. Note that the answer may be YES even if the distributions are somewhat skewed,
provided they are both skewed in the same direction (and to the same degree).

The second guideline, concerning homogeneity of variance, is that if the larger of the two variances is no
more than 4 times the smaller,
7
the t-test approximation is probably good enoughespecially if the
sample sizes are equal. But as Howell (1997, p. 321) points out,

It is important to note, however, that heterogeneity of variance and unequal sample sizes do
not mix. If you have reason to anticipate unequal variances, make every effort to keep your
sample sizes as equal as possible.

Note that if the heterogeneity of variance is too severe, there are versions of the independent groups t-test that
allow for unequal variances. The method used in SPSS, for example, is called the Welch-Satterthwaite method.
See the appendix for details.

All models are wrong

Finally, let me remind you of a statement that is attributed to George Box, author of a well-known book on
design and analysis of experiments:

All models are wrong. Some are useful.

This is a very important pointand as one of my colleagues remarked, a very liberating statement. When we
perform statistical analyses, we create models for the data. By their nature, models are simplifications we use to
aid our understanding of complex phenomena that cannot be grasped directly. And we know that models are
simplifications (e.g., I dont think any of the early atomic physicists believed that the atom was actually a plum
pudding.) So we should not find it terribly distressing that many statistical models (e.g., t-tests, ANOVA, and all of
the other parametric tests) are approximations. Despite being wrong in the strictest sense of the word, they can
still be very useful.

Pearson's Correlation Coefficient
There is a measure of linear correlation. The population parameter is denoted by the greek letter rho and the
sample statistic is denoted by the roman letter r.
Here are some properties of r:
- r only measures the strength of a linear relationship. There are other kinds of relationships besides linear.
Hypothesis Testing

- r is always between -1 and 1 inclusive. -1 means perfect negative linear correlation and +1 means perfect
positive linear correlation
- r has the same sign as the slope of the regression (best fit) line
- r does not change if the independent (x) and dependent (y) variables are interchanged
- r does not change if the scale on either variable is changed. You may multiply, divide, add, or subtract a
value to/from all the x-values or y-values without changing the value of r.
- r has a Student's t distribution
Here is the formula for r. Don't worry about it, we won't
be finding it this way. This formula can be simplified
through some simple algebra and then some
substitutions using the SS notation discussed earlier.
If you divide the numerator and denominator by n,
then you get something which is starting to hopefully
look familiar. Each of these values have been seen
before in the Sum of Squares notation section. So, the
linear correlation coefficient can be written in terms
of sum of squares.
This is the formula that we would be using for calculating the linear
correlation coefficient if we were doing it by hand. Luckily for us, the TI-82 has
this calculation built into it, and we won't have to do it by hand at all.

Analysis of Variance (ANOVA)

Analysis of Variance, in its simplest form, can be thought of as an extension of the
two-sample t-test to more than two samples. (Further methods in Chapter 8 of Business Statistics)
As an example: Using the 2-sample t-test we have tested to see whether there was any difference between the
size of invoices in a company's Leeds and Bradford stores. Using Analysis of variance we can, simultaneously,
investigate invoices from as many towns as we wish, assuming that sufficient data is available.

Problem: Why can't we just carry out repeated t-tests on pairs of the variables?

If many independent tests are carried out pairwise then the probability of being correct for the combined results
is greatly reduced. For example, if we compare the average marks of two students at the end of a semester to see
if their mean scores are significantly different we would have, at a 5% level, 0.95 probability of being correct.
Comparing more students:

Students Pairwise tests P( all correct) P(at least one incorrect)
2 1 0.95 0.05
3 3 0.95
3
= 0.875 0.125
n {n(n-1)}/2 0.95
n
1 - 0.95
n

10 45 0.95
45
= 0.10 0.90

Obviously this situation is not acceptable

Hypothesis Testing

Solution: We need therefore to use methods of analysis which will allow the variation between all n means to be
tested simultaneously giving an overall probability of 0.95 of being correct at the 5% level. This type of analysis is
referred to as Analysis of Variance or 'ANOVA' in short. In general, it:
- measures the overall variation within a variable;
- finds the variation between its group means;
- combines these to calculate a single test statistic;
- uses this to carry out a hypothesis test in the usual manner.

In general, Analysis of Variance, ANOVA, compares the variation between groups and the variation within samples
by analysing their variances.

One-way ANOVA: Is there any difference between the average sales at various departmental stores within a
company?

Two-way ANOVA: Is there any difference between the average sales at various stores within a company and/or
the types of department? The overall variation is split 'two ways'.

One-way ANOVA

Two-way ANOVA

Total variation
(SST)
Variation due to difference
between the groups, i.e.
between the group means.
(SSG)
Residual (error) variation not
due to difference between the
group means.
(SSE)
Total variation
(SST)
between the groups, i.e.
between the group means
(SSG)
Residual (error) variation not
due to difference between the
main group means.
(SSE
1
)
between the block means,
i.e. second group means
(SSBl)
Genuine residual (error)
variation not due to
difference between either
set of group means (SSE)
Hypothesis Testing


where SST = Total Sum of Squares; SSG = Treatment Sum of Squares between the groups; SSBl = Blocks Sum of
Squares; SSE = Sum of Squares of Errors.
(At this stage just think of 'sums of squares' as being a measure of variation.)

The method of measuring this variation is variance, which is standard deviation squared.
Total variance = between groups variance + variance due to the errors

It follows that: Total sum of
squares (SST)
= Sum of squares between the
groups (SSG)
+ Sum of squares due to
the errors (SSE)

If we find any two of the three sums of squares then the other can be found by difference.

In practice we calculate SST and SSG and then find SSE by difference.

Since the method is much easier to understand with a numerical example, it will be explained in stages using
theory and a numerical example simultaneously.

Example 1

One important factor in selecting software for word processing and database management systems is the time
required to learn how to use a particular system. In order to evaluate three database management systems, a firm
devised a test to see how many training hours were needed for five of its word processing operators to become
proficient in each of three systems.
System A 16 19 14 13 18 hours
System B 16 17 13 12 17 hours
System C 24 22 19 18 22 hours

Using a 5% significance level, is there any difference between the training time needed for the three systems?

In this case the 'groups' are the three database management systems. These account for some, but not all, of the
total variance. Some, however, is not explained by the difference between them. The residual variance is referred
to as that due to the errors.

Total variance = between systems variance + variance due to the errors.

In practice we transpose this equation and use: SSE = SST - SSSys

Calculation of Sums of Squares
It follows that: Total sum = Sum of squares + Sum of squares
of squares between systems of errors
(SST) (SSSys) (SSE)

Hypothesis Testing

The 'square' for each case is (x - x)
2
where x is the value for that case and x is the mean. The 'total sum of
squares' is therefore ( )
2
x x . The classical method for calculating this sum is to tabulate the values; subtract
the mean from each value; square the results; and finally sum the squares. The use of a statistical calculator is
preferable!

In the lecture on summary statistics we saw that the standard deviation is calculated by:
( )
n
x x
s
2
n

= so ( )
2
x x =
2
n
ns with s from the calculator using [xo
n
]
or
( )
1 n
x x
s
2
1 n
so ( )
2
x x = ( )
2
1 n
s 1 n

with s from the calculator using [xo
n-1
]
Both methods estimate exactly the same value for the total sum of squares.

TotalSS Input all the data individually and output the values for n , x and
n
o from the calculator in SD mode.
Use these values to calculate
2
n
o and
2
n
no .
n x
n
o
2
n
o
2
n
no .
15 17.33 3.419 11.69 175.3 = SS Total
SSSys Calculate n and x for each of the management systems separately:
n x
System A 5 16
System B 5 15
System C 5 21

Input as frequency data, (n = 5), in your calculator and output n, xand
n
o .

n x
n
o
2
n
o
2
n
no .
SS for Systems 15 17.33 2.625 6.889 103.3 = SSSys

SSE is found by difference SSE = SST - SSSys = 175.3 - 103.3 = 72.0

Using the ANOVA table
Below is the general format of an analysis of variance table. If you find it helpful then make use of it, otherwise just
work with the numbers, as on the next page.

General ANOVA Table (for k groups, total sample size N)

Source S.S d.f. M.S.S. F
Between groups SSG k - 1
MSG
1 k
SSG
=
F
MSE
MSG
=
Hypothesis Testing

Errors SSE (N-1) - (k-1)
MSE
k N
SSE
=

Total SST N - 1

Method
- Fill in the total sum of squares, SST, and the between groups sum of squares, SSG, after calculation; find the
sum of squares due to the errors, SSE, by difference;
- the degrees of freedom, d.f., for the total and the groups are one less than the total number of values and the
number of groups respectively; find the error degree of freedom by difference;
- the mean sums of squares, M.S.S., is found in each case by dividing the sum of squares, SS, by the
corresponding degrees of freedom.
- The test statistic, F, is the ratio of the mean sum of squares due to the differences between the group means
and that due to the errors.

In this example: (k = 3 systems, N = 15 values)

Source S.S. d.f. M.S.S. F
Between
systems
103.3 3 - 1 = 2 103.3/2 = 51.65 51.65/6.00 = 8.61
Errors 72.0 14 - 2 = 12 72.0/12 = 6.00
Total 175.3 15 - 1 = 14

The hypothesis test
The methodology for this hypothesis test is similar to that described last week.

The null hypothesis, H
0
, is that all the group means are equal. H
0
:
1
=
2
=

3
=
4
etc.

The alternative hypothesis, H
1
, is that at least two of the group means are different.

The significance level is as stated or 5% by default.

The critical value is from the F-tables, ( )
2 1
, F v v
o
, with the two degrees of freedom from the groups, v
1
, and the
errors, v
2
.

The test statistic is the F-value calculated from the sample in the ANOVA table.

The conclusion is reached by comparing the test statistic with the critical value and rejecting the null hypothesis if
the test statistic is the larger of the two.

Example 1 (cont.)

H
0
:
A
=
B
=
C
H
1
: At least two of the means are different.
Critical value: F
0.05
(2,12) = 3.89 (Deg. of free. from 'between systems' and 'errors'.)
Hypothesis Testing

Test statistic: 8.61
Conclusion: T.S. > C.V. so reject H
0
. There is a difference between the mean learning times for at least two of the
three database management systems.

Where does any difference lie?
We can calculate a critical difference, CD, which depends on the MSE, the sample sizes and the significance level,
such that any difference between means which exceeds the CD is significant and any less than it is not.
The critical difference formula is:

|
|
.
|
\
|
+ =
2 1
n
1
n
1
MSE t CD
t has the error degrees of freedom and one tail. MSE from the ANOVA table.
Example1 (cont.)

76 . 2
5
1
5
1
00 . 6 78 . 1
n
1
n
1
MSE t CD
2 1
= |
.
|
\
|
+
|
|
.
|
\
|
+ =

From the samples . 21 x , 15 x , 16 x
C B A
= = =
System C takes significantly longer to learn than Systems A and B which are similar.

Two-way ANOVA
In the above example it might have been reasonable to suggest that the five Operators might have different
learning speeds and were therefore responsible for some of the variation in the time needed to master the three
Systems. By extending the analysis from one-way ANOVA to two-way ANOVA we can find our whether Operator
variability is a significant factor or whether the differences found previously were just due to the Systems.

Example 2 Operators
1 2 3 4 5
System A 16 19 14 13 18
System B 16 17 13 12 17
System C 24 22 19 18 22

Again we ask the same question: using a 5% level, is there any difference between the training time for the three
systems? We can use the Operator variation just to explain some of the unexplained error thereby reducing it,
'blocked' design, or we can consider it in a similar manner to the System variation in the last example in order to
see if there is a difference between the Operators.

In the first case the 'groups' are the three database management systems and the 'blocks' being used to reduce the
error are the different operators who themselves may differ in speed of learning. In the second we have a second
set of groups - the Operators.

Total variance = between systems variance + between operators variance
+ variance of errors.
Hypothesis Testing


In 2-way ANOVA we find SST, SSSys, SSOps and then find SSE by difference.
SSE = SST - SSSys - SSOps

We already have SST and SSSys from 1-way ANOVA but still need to find SSOps.

Operators
1 2 3 4 5 Means
System A 16 19 14 13 18 16.00
System B 16 17 13 12 17 15.00
System C 24 22 19 18 22 21.00
Means 18.67 19.33 15.33 14.33 19.00 17.33

From example 1: SST = 175.3 and SSSys = 103.3

SSOps Inputting the Operator means as frequency data (n = 3) gives:
n = 15, x = 17.33,
n
o = 2.078 and
2
n
no = 64.7 = SSOps

Two-way ANOVA table, including both Systems and Operators:
Source S.S. d.f. M.S.S. F
Between Systems 103.3 3 - 1 = 2 103.3/2 = 51.65 51.65/0.91 = 56.76
Between Operators 64.7 5 - 1 = 4 64.7/4 = 16.18 16.18/0.91 = 17.78
Errors 7.3 14 - 6 = 8 7.3/8 = 0.91
Total 175.3 15 - 1 = 14

Hypothesis test (1) for Systems
H
0
:
A B C
= = H
1
: At least two of them are different.
Critical value: F
0.05
(2,8) = 4.46
Test Statistic: 56.76 (Notice how the test statistic has increased with the use of the more powerful two-way
ANOVA)
0
. There is a difference between at least two of the mean times
needed for training on the different systems.
Using CD of 1.12 (see overhead): C takes significantly longer to learn than A and/or B.

Hypothesis test (2) for Operators
H
0
:
1 2 3 4 5
= = = = H
1
: At least two of them are different.
Critical value: F
0.05
(4,8) = 3.84
Test Statistic: 17.78
0
. There is a difference between at least two of the Operators in the mean time
needed for learning the systems.
So Total sum = Sum of squares + Sum of squares + Sum of squares
of squares between systems between operators of errors
(SST) (SSSys) (SSOps) (SSE)

Hypothesis Testing


Using CD of 1.45 calculated as previously (see overhead): Operators 3 and 4 are significantly quicker learners than
Operators 1, 2 and 5.

Table 4 PERCENTAGE POINTS OF THE F DISTRIBUTION o = 5%

v
1

v
2

Numerator degrees of freedom
1 2 3 4 5 6 7 8 9
D
e
n
o
m
i
n
a
t
o
r

d
e
g
r
e
e
s

o
f

f
r
e
e
d
o
m

1 161.40 199.50 215.70 224.60 230.20 234.00 236.80 238.90 240.50
2 18.51 19.00 19.16 19.25 19.30 19.33 19.35 19.37 19.38
3 10.13 9.55 9.28 9.12 9.01 8.94 8.89 8.85 8081
4 7.71 6.94 6.56 6.39 6.26 6.16 6.09 6.04 6.00
5 6.61 5.79 5.41 5.19 5.05 4.95 4.88 4.82 4.77
6 5.99 5.14 4.76 4.53 4.39 4.28 4.21 4.15 4.10
7 5.59 4.74 4.35 4.12 3.97 3.87 3.79 3.73 3.68
8 5.32 4.46 4.07 3.84 3.69 3.58 3.50 3.44 3.39
9 5.12 4.26 3.86 3.63 3.48 3.37 3.29 3.23 3.18
10 4.96 4.10 3.71 3.48 3.33 3.22 3.14 3.07 3.02
11 4.84 3.98 3.59 3.36 3.20 3.09 3.01 2.95 2.90
12 4.75 3.89 3.49 3.26 3.11 3.00 2.91 2.85 2.80
13 4.67 3.81 3.41 3.18 3.03 2.92 2.83 2.77 2.71
14 4.60 3.74 3.34 3.11 2.96 2.85 2.76 2.70 2.65
15 4.54 3.68 3.29 3.06 2.90 2.79 2.71 2.64 2.59
16 4.49 3.63 3.24 3.01 2.85 2.74 2.66 2.59 2.54
17 4.45 3.59 3.20 2.96 2.81 2.70 2.61 2.55 2.49
18 4.41 3.55 3.16 2.93 2.77 2.66 2.58 2.51 2.46
19 4.38 3.52 3.13 2.90 2.74 2.63 2.54 2.48 2.42
20 4.35 3.49 3.10 2.87 2.71 2.60 2.51 2.45 2.39
21 4.32 3.47 3.07 2.84 2.68 2.57 2.49 2.42 2.37
22 4.30 3.44 3.05 2.82 2.66 2.55 2.46 2.40 2.34
23 4.28 3.42 3.03 2.80 2.64 2.53 2.44 2.37 2.32
24 4.26 3.40 3.01 2.78 2.62 2.51 2.42 2.36 2.30
25 4.24 3.39 2.99 2.76 2.60 2.49 2.40 2.34 2.28
26 4.23 3.37 2.98 2.74 2.59 2.47 2.39 2.32 2.27
27 4.21 3.35 2.96 2.73 2.57 2.46 2.37 2.31 2.25
28 4.20 3.34 2.95 2.71 2.56 2.45 2.36 2.29 2.24
29 4.18 3.33 2.93 2.70 2.55 2.43 2.35 2.28 2.22

Hypothesis Testing

30 4.17 3.32 2.92 2.69 2.53 2.42 2.33 2.27 2.21
40 4.08 3.23 2.84 2.61 2.45 2.34 2.25 2.18 2.12
60 4.00 3.15 2.76 2.53 2.37 2.25 2.17 2.10 2.04

120 3.92 3.07 2.68 2.45 2.29 2.17 2.09 2.02 1.96
3.84 3.00 2.60 2.37 2.21 2.10 2.01 1.94 1.88

Hypothesis Testing

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Hypothesis Testing

Diunggah oleh

Hak Cipta:

Format Tersedia

Hypothesis Testing

Anda mungkin juga menyukai