Anda di halaman 1dari 90

INTRODUCTION 17/06/13 4:22 PM

THE REALLY EASY STATISTICS SITE


Produced by Jim Deacon

Biology Teaching Organisation, University of Edinburgh


______________________________________________

I have produced this site, as someone who uses statistics in my experimental work, but I
am not a statistician. The site is intended to provide a simple, straightforward guide to the
basics of experimental design and to some of the common statistical tests.

There are several excellent sites on Statistical methods. But I think that many undergraduates
(and graduate students) want only a user-friendly beginner's guide - or a refresher course - that
enables them to use statistical tests with a minimum of fuss. That's the aim of this site. It is by
no means a comprehensive guide, but it will get you started and, if nothing more, it will help
you to understand the meaning of the symbols you see in scientific papers.

You can use this site in 2 ways:

If you know the statistical test that you want, then go straight to the index of tests (see
below).
If you are "rusty" or a beginner, then you should read the introductory pages: THE
SCIENTIFIC METHOD, Experimental design, Common statistical terms, Descriptive
statistics, and what test do I need?
______________________________________________

design
Designing in mind
Common statistical terms
Descriptive standard deviation, error, confidence intervals of mean.

WHAT TEST DO I NEED?

Student's t-test for


Paired-samples test. (like a t-test,
Analysis of variance for

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress1.html Page 1 of 2
EXPERIMENTAL DESIGN 17/06/13 4:22 PM

This site is no longer maintained and has been left for


archival purposes
Text and links may be out of date

This page covers four things:

THE SCIENTIFIC METHOD

Experimental Design

Designing experiments so that you can use statistics

Some common statistical terms

THE SCIENTIFIC METHOD


After reading this section, you should be able to decribe the main stages involved in scientific
problem-solving.

What is THE SCIENTIFIC METHOD?

Science is usually fun. Most scientists enjoy their work, and fortunately it is sometimes useful
to society. Since scientists are people rather than machines, they behave as other people do.
They can be bad-tempered, pig-headed, jealous of the success of others and untruthful. Even so,
the pattern of work employed by scientists, the scientific method, is the most powerful tool yet
devised for the analysis and solution of problems in the natural world. The method can be
applied as much to situations in everyday life as to conventional 'scientific' problems. The world
might even be a happier place if more people attempted to solve their problems in a 'scientific'
way.

Scientific problem-solving has a number of recognisable stages:

! Observing a phenomenon that is in some way interesting or puzzling.

! Making a guess as to the explanation of the phenomenon.

! Devising a test to show how likely this explanation is to be true or false.

! Carrying out the test, and, on the basis of the results, deciding whether the
explanation is a good one or not. In the latter case, a new explanation will (with
luck) 'spring to mind' as a result of the first test.

Originality

Science depends upon original thinking at several points. One is when we make the original
'guess' - usually called an hypothesis. Another is when we devise a test or experiment to show
how likely the hypothesis is to be correct. A good scientist relies on 'inspiration' in the same

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress2.html Page 1 of 13
EXPERIMENTAL DESIGN 17/06/13 4:22 PM

way as a good artist. [Some teachers are concerned that this point is largely ignored in present-
day science education.]

THE SCIENTIFIC METHOD in practice - two everyday examples

The examples constitute the sort of story you might read in a newspaper. We will try to analyse
these stories in terms of THE SCIENTIFIC METHOD. A general feature is the lack of a
rigorous test situation in the original stories; the strength of THE SCIENTIFIC METHOD
largely lies in devising tests that are capable of discriminating between different hypotheses.

1. The dog that understands French

Mr Smith of Morningside has taught his dog Rover to understand French. Mr Smith noticed that
every evening, after dinner, when he went to the door with his coat on and said "Walkies",
Rover immediately understood and came running. Mr Smith was going to France for the
summer, and, as an experiment in international understanding, decided to teach Rover French.
He started to say "Allons" instead of "Walkies". To his delight, Rover very quickly understood
and came running.

Analysis

! What is the observation here?

The dog apparently responds to the call of 'allons'.

! What is Mr Smith's hypothesis?

The dog understands the word as meaning 'walkies'.

! Is this the only explanation?

No. The dog may respond to a total situation (after dinner,


going to the door, coat on, call) of which what is actually
called is only a small part. A change in the call may not
matter much to the dog.

! Can we devise a test to discriminate between these possibilities?

There are several. For example:

(a) Call "Allons" to the dog in a different situation: for


example, in the morning when he does not usually go for a
walk.

(b) Go to the door, in the normal way (coat on, after


dinner) without calling anything.

(c) Do likewise and call something silly like "Rhubarb".

The results of these and similar tests should indicate whether Rover is specifically responding to
the word "allons", or (more likely) to an overall situation he is well used to.

Notice that these tests do not tell us anything of a dog's ability to learn French words. They are
only concerned with the specific case of responding to one French word. We will see later that

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress2.html Page 2 of 13
EXPERIMENTAL DESIGN 17/06/13 4:22 PM

extrapolating from the specific to the general is very important in scientific methodology.

2. Long-term success of a foreteller of the future

The Institute for Psychical Research conducted a study on the performance of well-known
fortune-tellers. The most positive results involve Arnold Woodchuck who, at the start of each
year, makes a series of ten predictions for the coming year in a national tabloid newspaper. The
Institute has found that Mr Woodchuck is consistently correct in about 80% of his predictions.
For example, for 1995 he predicted a political crisis in Europe (the former Yugoslavia?), a
major human disaster in Africa (Rwanda?), a dispute over public sector pay (nurses?) and the
demise of a prominent football manager (Mr Graham?). He was unfortunately wrong in
predicting that England would win the Rugby Union World Cup. A spokesman for the Institute
was 'optimistic' about future studies on Mr Woodchuck.

Analysis

The apparent observation is that Mr Woodchuck has got more predictions correct than would
have been expected by chance. The Institute's hypothesis would be that Mr Woodchuck has
some kind of 'psychic powers'. Can we devise an alternative hypothesis?

We are dealing here with probability. If we toss an unbiassed coin we get on average the same
number of heads as tails. If we asked someone to predict the outcome of the toss, we would not
be terribly surprised if from a small number of trials, he got 4 out of 5 right. But if he continued
to achieve 80% success over a long series, we would begin to suspect: (1) a biased coin; (2)
cheating; (3) psychic powers.

Mr Woodchuck regularly gets 80%. Is his 'coin' biased, is he cheating, or does he have psychic
powers? The most likely explanation is the 'biased coin' one, i.e. that the events he predicts do
not have a 1:1 probability, but perhaps a probability nearer to 4 or 5:1 on; in other words, a very
high probability that they will occur.

We have therefore two kinds of test:

! An analysis of the real probabilities of the events predicted.

For example, almost invariably every year there is at least one 'political crisis' in Europe and a
'major human disaster' in Africa. Similarly, football managers have a short shelf-life. Public
sector employees (such as nurses, railway signalmen or indeed University teachers) have for
years perceived themselves to be underpaid whilst their masters either cannot or will not
respond appropriately. In contrast, the chances of England's winning the Rugby Union World
Cup were over-stated by the English press - and this is a prediction that failed.

! An analysis of Mr Woodchuck's success in making correct predictions of events


which really do have a low probability.

Again, the results of this investigation would be limited. They would probably show that the
'biased coin' explanation is the most likely. They would not show (a) whether Mr Woodchuck
has some kind of psychic power; or (b) whether psychic powers are possible.

Notice also that even a large deviation from an expected result can occur by chance in a small
sample (e.g., getting 4 out of 5 coin-tossing guesses right). This is very important in Biology,
and the basis of the use of statistical methods in biological analysis.

EXPERIMENTAL DESIGN
http://archive.bio.ed.ac.uk/jdeacon/statistics/tress2.html Page 3 of 13
EXPERIMENTAL DESIGN 17/06/13 4:22 PM

After reading this section you should be able to discriminate between good and bad
experimental design.

Experimental Design

The design of a suitable experiment to test an hypothesis often requires some ingenuity and a
suspicious nature. In modern biology, the experiment may involve very sophisticated
equipment. But there are a number of features common to all good experiments (and often
absent from bad ones) which exist whatever the technical details. In summary these are:

Discrimination

Experiments should be capable of discriminating clearly between different hypotheses. It often


turns out that two or more hypotheses give indistinguishable results when tested by poorly-
designed experiments.

Replication and generality

Living material is notoriously variable. Usually experiments must be repeated enough times for
the results to be analysed statistically. Similarly, because of biological variability, we must be
cautious of generalising our results either from individual creatures to others of the same
species, or to other species. For instance, if our hypothesis is about mammals, it is inadequate
simply to carry out our experiments on laboratory rats. Similarly, it is dangerous to extrapolate
from healthy students to elite athletes.

Controls

The experiment must be well controlled. We must eliminate by proper checks the possibility
that other factors in the overall test situation produce the effect we are observing, rather than the
factor we are interested in.

An example: Growth hormone is secreted in response to a number of agents, including the


amino acid arginine. This was shown by injecting volunteers with arginine. As a control, the
investigators injected the volunteers with a saline solution. To their surprise, growth hormone
was again secreted. The investigators then waved a syringe and needle in front of their
volunteers, and found that that provoked growth hormone secretion too. Growth hormone is
now known to be secreted in response to stress (as well as arginine).

At a more technical level, we must be sure that our method of measurement is reproducible
from day to day, between operators in the same laboratory, or between laboratories. Whilst we
might be confident about a balance or a ruler, can we be as sure about, say, a method for
measuring haemoglobin? Do two groups of students measuring the same samples by the same
methods produce the same results? Quality control helps here.

'Blind' Designs

Investigators can subconsciously 'fudge' their data if they know what result they want to find.
The answer is to do the experiment 'blind', so the investigators (and the subjects, if humans are
being studied) do not know which treatment's effect they are observing. This can make the
logistics of doing the experiment more complex: for example, when determining the
haemoglobin concentration of male and female class members.

There is a story about a professor who devised a maze for measuring the intelligence of rats.
One day he gave his technicians, who actually made the measurements, three groups of rats. He

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress2.html Page 4 of 13
EXPERIMENTAL DESIGN 17/06/13 4:22 PM

told them one group had been specially bred for intelligence, one for stupidity and the third was
average. The technicians assessed the rats' intelligence and confirmed that the 'bright' group
performed the best and the 'stupid' group the worst. The point is, of course, that the professor
had put animals into the three groups at random. They did not differ in intelligence.

Measurement

Good experiments often, though not always, involve measuring something: a weight, say. When
you make measurements, it is important you know both the accuracy and the precision of your
measuring system. These two terms are not synonymous: 'accuracy' means the ability of the
method to give an unbiassed answer on average, whereas 'precision' is an index of the method's
reproducibility. Ideally your method should be both accurate (i.e., give the true mean) and
precise (i.e., have a low standard deviation). Sometimes one is more important than the other.
For example, if you were looking for small changes with time in a quantity (such as an athlete's
haemoglobin concentration), you would need a precise measure of it rather more than an
accurate one.

Accuracy and precision together help you to judge the reliability of your data. They also help
you to judge to how many significant figures you should quote your results. For example, if you
use a balance reading to the nearest gram, you should give the results to the nearest gram and
not, say, to the nearest tenth of a gram.

Some experiments are very difficult to do because it is not obvious what can be measured. This
is a real problem in animal behaviour: for example, there is no obvious unit or measure for
'emotional state'. It is usually necessary to isolate measurable components of behaviour. Thus
the speed at which a tiger paces up and down a cage can give some indication of the internal
state of the animal but can never give a full picture of it.

Many of these points are rather abstract, but they should become clearer when you think about
the following examples.

Example 1: Do plants give off water vapour?

Experiment:

Forty bean plants, growing in pots, were covered one afternoon by individual glass containers
and left in the laboratory overnight. Next morning, the inside of the lid of each container was
found to be covered in droplets of a fluid which proved to be water.

Conclusion:

Plants generally give off water vapour.

Critique

1. Lack of controls.

The water could have come from the plants, the soil, the pots, or the air in the jar.
Control experiments should have been set up to test for these possibilities.

2. The conclusion contains some points that are not valid.

(a) The experiment was done overnight and so can tell us nothing about the
behaviour of the plants at other times of day; the word 'generally' is not justified.

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress2.html Page 5 of 13
EXPERIMENTAL DESIGN 17/06/13 4:22 PM

(b) It was carried out with an adequate number of bean plants but can tell us
nothing about other kinds of plants; the word 'plants' should be qualified.

(c) There is no evidence in the experiment that water is given off as a vapour.

Example 2: Is your supermarket's 'own brand' of washing powder as good as a nationally-


advertised one?

Eric Triton bemoaned the fact that his wife Ariel insisted on washing his clothes with their local
supermarket's own brand of powder. He was sure the well-known brand he saw performing
miracles on television most evenings would do better. He therefore set out to prove as much.

Mr Triton decided to compare the effectiveness of the two products on what his wife called
'difficult' dirt: grass stains on white linen handkerchiefs. He bought 4kg of the well-known
brand for £5.17 in their supermarket and noted that the same weight of the own-brand powder
would have cost £4.47. He followed the instructions on the packets exactly, weighing out the
same amount of powder and using their washing machine's programme for white linens. Mr
Triton was aware of the need for an index of 'cleanliness' and therefore devised a subjective
scale, ranging from 10 ('whiter than white') to 0 (the starting level of dirtiness).

Mr Triton's belief was substantially confirmed. He scored the handkerchief cleaned by the
national brand an impressive 8, whereas the own-brand powder only managed 7. Triumphantly,
he reported the outcome to his wife. Mrs Triton, however, was unimpressed. She pointed out to
her husband that there were several flaws in his experiment and convinced him that the outcome
was 'not proven'.

! What do you think were the flaws that Mrs Triton could have detected?

! How should Mr Triton have done the experiment?

Further reading: Barnard C, Gilbert F and McGregor P (1993) Asking Questions in Biology,
Longmans.

DESIGNING EXPERIMENTS TO USE STATISTICS


There is a story about an eminent Professor at Cambridge who gave a paper at a scientific
meeting and was asked by a questioner "what statistical test did you use to verify your results?"
The Professor explained that he used his own statistical test:

"In our Department we have a long corridor with a notice board at one end. I draw a
histogram of my results, pin it to the notice board, then walk to the other end of the
corridor. If I can still see a difference between the treatments then it's significant"

The relevance of this story lies in what it does not say! If an experiment is designed and
executed properly - as we would expect of an eminent scientist - then the results often speak for
themselves. For example, this might be true of experiments in which mutants are generated (or
genes inserted) in an organism, giving a clear change of behaviour such as resistance to an
antibiotic or expression of a new trait. Such "all or nothing" effects seldom need to be backed
by statistical tests, but they still need good experimental design.

However, in many areas of biology we work with variable effects - differences in the growth
rates of organisms, quantitative differences in antibiotic resistance or in size or in rates of
biochemical reactions, etc. Then we not only need statistical tests to analyse those differences
but we also need good experimental design to ensure that we haven't biased our results in some
http://archive.bio.ed.ac.uk/jdeacon/statistics/tress2.html Page 6 of 13
EXPERIMENTAL DESIGN 17/06/13 4:22 PM

way, without realising it.

Good experimental design is the key to good science. But it's not as easy as it
might seem.
In many cases good experimental design involves having a clear idea about how we will
analyse the results when we get them. That's why statisiticians often tell us to think about the
statistical tests we will use before we start an experiment.

Three important steps in good experimental design


1. Define the objectives. Record (i.e. write down) precisely what you want to test in an
experiment.

2. Devise a strategy. Record precisely how you can achieve the objective. This includes
thinking about the size and structure of the experiment - how many treatments? how many
replicates? how will the results be analysed?

3. Set down all the operational details. How will the experiment be performed in practice? In
what order will things be done? Should the treatments be randomised or follow a set structure?
Can the experiment be done in a day? Will there be time for lunch? etc.

If all this sounds trivial or obvious, then read on. It's not as easy as you think!

Example 1. Experiments that yield no useful results because we did not collect enough
data

Suppose that we want to test the results of a Mendelian genetic cross. We start with 2 parents of
genotype AABB and aabb (where A and a represent the dominant and recessive alleles of one
gene, and B and b represent the dominant and recessive alleles of another gene).

We know that all the F1 generation (first generation progeny of these parents) will have
genotype AaBb and that their phenotype will display both dominant alleles (e.g. in fruit flies all
the F1 generation will have red eyes rather than white eyes, and normal wings rather than
stubby wings).

This F1 generation will produce 4 types of gamete (AB, Ab, aB and ab), and when we self-cross
the F1 generation we will end up with a variety of F2 genotypes (see the table below).

Gametes Gametes
AB Ab aB ab
AB AABB AABb AaBB AaBb
Ab AABb AAbb AaBb Aabb
aB AaBB AaBb aaBB aaBb
ab AaBb Aabb aaBb aabb

All these genotypes fall into 4 phenotypes, shown by colours in the table: double dominant,
single dominant A, single dominant B and double recessive. And we know that in classical
Mendelian genetics the ratio of these phenotypes is 9:3:3:1

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress2.html Page 7 of 13
EXPERIMENTAL DESIGN 17/06/13 4:22 PM

Most people also know that we use a chi squared test to analyse the results of genetic crosses:
we do our experiment, count the number of F2 progeny that fall into the different categories,
and test to see if our results agree with an expectation. In this case, the expectation would be
that the results fit a 9:3:3:1 ratio.

But what you might not know is that a chi squared test would only be valid if every
expected category in this case is 5 or more (it does not matter what the actual count is in each
category, but the expected count must be 5 or more). In other words, we MUST have at least 80
F2 progeny in order to use the chi squared test for this experiment, because then the smallest
category - double recessive - would have an expected 5 individuals in it (one-sixteenth of 80
being 5). [A fuller explanation of this is given in Chi squared test.]

Similarly, for comparing two counts (e.g. counts from dilution plating of bacteria) by means of
a Poisson distribution, you will need to count about 30 colonies at the chosen dilution level.
[The principles underlying the Poisson distribution do not hold for counts lower than this]

As a different example along the same lines, we might want to compare the biomass produced
by plant callus culture in flasks containing different nutrient solutions. We know that we need
more than one flask of each nutrient solution (i.e. we need replicates), and we will use
Student's t-test to compare the mean growth in each solution. [Basically, a t-test compares the
difference between the two means in relation to the amount of variation within the treatments.
In other words, we get a significant result if the difference between the means is large and/or the
variation between replicates is small].

So, how many replicates should we use? This is a matter of judgement (and the available
resources) but if we look at a t-table we can make some rational decisions. If we use 2 flasks for
each treatment (4 flasks in total), we would have 2 degrees of freedom. This term is explained
elsewhere, but for now we can note that the number of degrees of freedom for each treatment is
one less than the number of replicates. In other words, with 2 treatments of 2 flasks each we
have 2 degrees of freedom. With 2 treatments of 10 flasks each we have 18 degrees of freedom.

When we analyse our results by Student's t-test, we calculate a t value and compare it with the t
value for probability of 0.05 in the t-table. Our treatments differ significantly if the calculated t
value is greater than the tabulated value.

Look at the tabulated t value (4.30) for 2 degrees of freedom. It is quite high, and we would
only find a significant difference between our treatments if we have quite a large difference
between the means and also little variation in our replicates. But if we used 4 replicates of each
treatment (6 degrees of freedom) we would have a much better chance of finding a significant
difference (t value of 2.45) between the same means. But look even further down the t-table -
e.g. downwards from 10 degrees of freedom (t-value 2.23) - and we see that we would gain very
little by using any more replicates. We would be in the realm of diminishing returns, gaining
very little for all the extra time and resources.

The message from these examples is that knowledge of the statistical test that we will use
helps us to design our experiment properly.

Example 2. Experiments that seem to give useful results but our procedures let us down!
Under this heading we deal with the actual process of doing an experiment - a task with many
hidden pitfalls.

Suppose we decide to compare 4 treatments, with 4 replicates each - a total of 16 flasks of

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress2.html Page 8 of 13
EXPERIMENTAL DESIGN 17/06/13 4:22 PM

bacteria, 16 potted plants, 16 biochemical reactions to measure with a spectrophotometer, etc.


We do the experiment, get nice results, analyse them (the appropriate test would be Analysis of
Variance) and find significant differences between the treatments. We write up the results, get a
Nobel Prize, or a good mark, or whatever. End of story.

Or is it? The answer to that question depends on how we did the experiment. For example,
there might have been a good "practical" (i.e. convenient) reason for setting up all replicates of
treatment 1, then (for example) changing the pipette and setting up all replicates of treatment 2,
and so on. The problem is: how can we be sure that the difference we found between treatments
was due to the treatments themsleves, and was not influenced by the order in which we set them
up? Even if we DO feel sure, our experiment is inherently biased and nobody would trust the
results if we said how we did it! [There is an almost infinite number of reasons why the
conditions might change during the time taken to set up an experiment. For example, we might
get progressively more efficient, or more tired. The temperature of the water bath (or whatever)
might change slightly during this time. Each pipette will be slightly different from the next, etc.
etc.]

So, what about doing one replicate of treatment 1, then one of treatment 2, then 3, then 4, and
then doing a second replicate of treatment 1, a second of treatment 2, and so on? In truth, this
would remove only some of the inherent bias - on average, treatment 1 is still being set up
before treatment 2, etc.

The only way to overcome this is to plan in advance. We have basically two options.

We could randomise the order in which each test is set up, so that it is entirely a matter of
chance whether "treatment 1, replicate 1" is done first or last or somewhere in the middle.
We could structure the setting up (or the sampling of the experiment) so that one replicate
of each treatment is done first (in random order), then the next replicate of each treatment
is done (again in random order), and so on. This would be sensible if we knew (or
suspected) that the conditions are likely to change during the setting up of an experiment,
because Analysis of Variance (the statistical test we would use for our results) could
separate out the effect of "setting up time" from the effect of "treatments". This sort of
experimental design is termed a randomised block design. It is illustrated in the grid
below, where the rows represent blocks and the order of treatments (A, B, C and D)
within the blocks is determined randomly. [But note that the grid below is perfect - you
would seldom get that by randomly ordering the treatments within blocks. A grid of this
type is called a Latin square, and for some purposes it is probably better than a
randomised block]

Row 1 A B C D
Row 2 D C A B
Row 3 B A D C
Row 4 C D B A

And one last point - the same principles should be applied to other practical aspects of an
experiment. For example, have you ever tested the temperature gradient in a laboratory
incubator or (worse still) in a walk-in growth room? The temperature gauge might record
"30oC" but there is likely to be a vertical (or lateral) temperature gradient of 2-3o or more. So
never put all the replicates of one treatment together. Randomise them.

To block or not to block?

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress2.html Page 9 of 13
EXPERIMENTAL DESIGN 17/06/13 4:22 PM

For most experiments we would simply randomise the treatments and replicates. There is an
important statistical reason for this, because all the statistical procedures are based on the
fundamental assumption that variation is random: in other words, that it is determined by
chance alone. But "blocking" becomes useful, or even essential, if we know (or strongly
suspect) that "extraneous" factors will introduce variation that is irrelevant to the effects we
wish to test and that might mask the effects of our treatments. Here are two examples, and you
could think of many more along the same lines.

1. Suppose that we want to test the effect of fertiliser treatment (or pesticide treatment, etc.)
on a crop. We set up a field trial, with plots that will receive different treatments.
Everyone knows that conditions can vary across a field - the topsoil may be thinner or
drier in some places than in others, leading to differences in crop growth. We cannot
control this, so it would be sensible to mark out 3 (or more) blocks, and to have plots
(randomised treatments) within these blocks. Then, by Analysis of Variance, we will be
able to separate the variation due to our treatments from the variation caused by soil or
site factors (which are not relevant to the question we are trying to ask).
2. Suppose that an experiment will take a long time to set up or to sample, and we cannot be
sure that the conditions will remain constant during this time - the lab might warm up, the
humidity rise, and so on. Then it is sensible to block the experiment into time periods -
before coffee, after coffee, after lunch. Of course this is not ideal. But experiments are
done in the real world.

SOME COMMON STATISTICAL TERMS


Suppose that we are measuring the size of cells, the height of trees, the biomass of microbial
cultures, the number of eggs in nests, or anything else. The thing that we are measuring or
recording (e.g. cell size, plant height, etc.) is called a variable.

Each measurement that we record (e.g. the size of each cell) is a value or observation.

We obtain a number of values (e.g. 100 for cells), and this is our sample.

The sample (e.g. 100 cells) is part of a population. In this case the population (in biological
terms) is all the cells in the culture (or all the trees in a forest, etc.). Theoretically, we could
measure every cell or tree to get a precise measure of that population. But often we want to be
able to say more than this - something of general significance, based on our sample. For
example, that if anyone were to measure the cells of that organism, then they would find a
certain average value and a certain range of variation. Here are 3 sorts of thing that you might
want to say.

The optimum temperature for growth of the bacterium Escherichia coli is 37oC, whereas
the optimum temperature for Bacillus cereus is 30oC.
The average height of adult men in Britain is 175 cm, whereas the average height of
women is162 cm.
In rats the LD50 of the drug "Zappit" is 3 mg kg-1 body weight. [LD50 is the calculated
(lethal) dose that kills 50% of the treated individuals]

General statements such as these will always be based on a sample, because we could never test
every possible strain of E. coli, nor measure every possible adult, nor test every possible rat that
could ever live. So, in these and in many other cases the population can be considered to be
infinite. That's the sense in which statisticians use the term "population" - for all the
possible measurements or events (i.e. all the possible values of a variable) of a particular
type that there could ever be.

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress2.html Page 10 of 13
EXPERIMENTAL DESIGN 17/06/13 4:22 PM

In statistics, we use SAMPLES to ESTIMATE the PARAMETERS of a POPULATION.

The statistical procedures are based on quite complex mathematics. But that need not concern
us at all, because the procedures are actually very simple to apply. Basically, from our sample
we calculate:

The average, or mean. The notation for the sample mean is (pronounced "X bar").
Some measure of the dispersion (range of variation) of data around the sample mean. For
this we use the variance (the notation is S2) and thence the standard deviation (S).

Having obtained those values, we use them to estimate the population mean and the
population variance. In order to distringuish between what we measure (samples) and what we
wish to estimate (populations) from the samples, many statisticians use greek letters for the
population mean (which is denoted m (mu) instead of for the sample mean) and the
population variance (which is denoted s2 (sigma squared) instead of S2 for the sample
variance).

We cover these points under Descriptive statistics. But before we move on to that, we must
consider briefly how variation is distributed in statistical terms.

If we measured cells or people or plants or biochemical reactions (as absorbance values in a


spectrophotometer) etc. we would find a range of variation. As we made more and more
measurements of this type they would display a continuous range of variation. If we were to put
these measurements into appropriate categories or class intervals (for example, all the
measurements that fall between 1.0 and less than 2.0, all between 2.0 and less than 3.0, and so
on) and then plot the numbers in each category as a histogram it would look like this:

Given enough measurements (and small enough class intervals), this would be a completely
symmetrical, bell-shaped curve. Data of this sort are said to be normally distributed. Most of
our measurements (data points) would be close to the mean, and progressively fewer would
depart widely from the mean.

Most of the statistical tests that we consider on this site are for normally distributed data.

But there are other types of distribution. For example, if we measured the heights of men and
women as a single population they might form a bimodal distribution - two humps with a dip
between them, because women, on average, are shorter than men. Then we should treat them as
two populations, not as a single one. Other types of data, such as counts, fall naturally into
"either/or" categories. These are treated in different ways, some of which are explained later.

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress2.html Page 11 of 13
EXPERIMENTAL DESIGN 17/06/13 4:23 PM

This site is no longer maintained and has been left for


archival purposes
Text and links may be out of date

This page covers four things:

THE SCIENTIFIC METHOD

Experimental Design

Designing experiments so that you can use statistics

Some common statistical terms

THE SCIENTIFIC METHOD


After reading this section, you should be able to decribe the main stages involved in scientific
problem-solving.

What is THE SCIENTIFIC METHOD?

Science is usually fun. Most scientists enjoy their work, and fortunately it is sometimes useful
to society. Since scientists are people rather than machines, they behave as other people do.
They can be bad-tempered, pig-headed, jealous of the success of others and untruthful. Even so,
the pattern of work employed by scientists, the scientific method, is the most powerful tool yet
devised for the analysis and solution of problems in the natural world. The method can be
applied as much to situations in everyday life as to conventional 'scientific' problems. The world
might even be a happier place if more people attempted to solve their problems in a 'scientific'
way.

Scientific problem-solving has a number of recognisable stages:

! Observing a phenomenon that is in some way interesting or puzzling.

! Making a guess as to the explanation of the phenomenon.

! Devising a test to show how likely this explanation is to be true or false.

! Carrying out the test, and, on the basis of the results, deciding whether the
explanation is a good one or not. In the latter case, a new explanation will (with
luck) 'spring to mind' as a result of the first test.

Originality

Science depends upon original thinking at several points. One is when we make the original
'guess' - usually called an hypothesis. Another is when we devise a test or experiment to show
how likely the hypothesis is to be correct. A good scientist relies on 'inspiration' in the same

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress2.html#SOME%20COMMON%20STATISTICAL%20TERMS Page 1 of 13
EXPERIMENTAL DESIGN 17/06/13 4:23 PM

way as a good artist. [Some teachers are concerned that this point is largely ignored in present-
day science education.]

THE SCIENTIFIC METHOD in practice - two everyday examples

The examples constitute the sort of story you might read in a newspaper. We will try to analyse
these stories in terms of THE SCIENTIFIC METHOD. A general feature is the lack of a
rigorous test situation in the original stories; the strength of THE SCIENTIFIC METHOD
largely lies in devising tests that are capable of discriminating between different hypotheses.

1. The dog that understands French

Mr Smith of Morningside has taught his dog Rover to understand French. Mr Smith noticed that
every evening, after dinner, when he went to the door with his coat on and said "Walkies",
Rover immediately understood and came running. Mr Smith was going to France for the
summer, and, as an experiment in international understanding, decided to teach Rover French.
He started to say "Allons" instead of "Walkies". To his delight, Rover very quickly understood
and came running.

Analysis

! What is the observation here?

The dog apparently responds to the call of 'allons'.

! What is Mr Smith's hypothesis?

The dog understands the word as meaning 'walkies'.

! Is this the only explanation?

No. The dog may respond to a total situation (after dinner,


going to the door, coat on, call) of which what is actually
called is only a small part. A change in the call may not
matter much to the dog.

! Can we devise a test to discriminate between these possibilities?

There are several. For example:

(a) Call "Allons" to the dog in a different situation: for


example, in the morning when he does not usually go for a
walk.

(b) Go to the door, in the normal way (coat on, after


dinner) without calling anything.

(c) Do likewise and call something silly like "Rhubarb".

The results of these and similar tests should indicate whether Rover is specifically responding to
the word "allons", or (more likely) to an overall situation he is well used to.

Notice that these tests do not tell us anything of a dog's ability to learn French words. They are
only concerned with the specific case of responding to one French word. We will see later that

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress2.html#SOME%20COMMON%20STATISTICAL%20TERMS Page 2 of 13
EXPERIMENTAL DESIGN 17/06/13 4:23 PM

extrapolating from the specific to the general is very important in scientific methodology.

2. Long-term success of a foreteller of the future

The Institute for Psychical Research conducted a study on the performance of well-known
fortune-tellers. The most positive results involve Arnold Woodchuck who, at the start of each
year, makes a series of ten predictions for the coming year in a national tabloid newspaper. The
Institute has found that Mr Woodchuck is consistently correct in about 80% of his predictions.
For example, for 1995 he predicted a political crisis in Europe (the former Yugoslavia?), a
major human disaster in Africa (Rwanda?), a dispute over public sector pay (nurses?) and the
demise of a prominent football manager (Mr Graham?). He was unfortunately wrong in
predicting that England would win the Rugby Union World Cup. A spokesman for the Institute
was 'optimistic' about future studies on Mr Woodchuck.

Analysis

The apparent observation is that Mr Woodchuck has got more predictions correct than would
have been expected by chance. The Institute's hypothesis would be that Mr Woodchuck has
some kind of 'psychic powers'. Can we devise an alternative hypothesis?

We are dealing here with probability. If we toss an unbiassed coin we get on average the same
number of heads as tails. If we asked someone to predict the outcome of the toss, we would not
be terribly surprised if from a small number of trials, he got 4 out of 5 right. But if he continued
to achieve 80% success over a long series, we would begin to suspect: (1) a biased coin; (2)
cheating; (3) psychic powers.

Mr Woodchuck regularly gets 80%. Is his 'coin' biased, is he cheating, or does he have psychic
powers? The most likely explanation is the 'biased coin' one, i.e. that the events he predicts do
not have a 1:1 probability, but perhaps a probability nearer to 4 or 5:1 on; in other words, a very
high probability that they will occur.

We have therefore two kinds of test:

! An analysis of the real probabilities of the events predicted.

For example, almost invariably every year there is at least one 'political crisis' in Europe and a
'major human disaster' in Africa. Similarly, football managers have a short shelf-life. Public
sector employees (such as nurses, railway signalmen or indeed University teachers) have for
years perceived themselves to be underpaid whilst their masters either cannot or will not
respond appropriately. In contrast, the chances of England's winning the Rugby Union World
Cup were over-stated by the English press - and this is a prediction that failed.

! An analysis of Mr Woodchuck's success in making correct predictions of events


which really do have a low probability.

Again, the results of this investigation would be limited. They would probably show that the
'biased coin' explanation is the most likely. They would not show (a) whether Mr Woodchuck
has some kind of psychic power; or (b) whether psychic powers are possible.

Notice also that even a large deviation from an expected result can occur by chance in a small
sample (e.g., getting 4 out of 5 coin-tossing guesses right). This is very important in Biology,
and the basis of the use of statistical methods in biological analysis.

EXPERIMENTAL DESIGN
http://archive.bio.ed.ac.uk/jdeacon/statistics/tress2.html#SOME%20COMMON%20STATISTICAL%20TERMS Page 3 of 13
EXPERIMENTAL DESIGN 17/06/13 4:23 PM

After reading this section you should be able to discriminate between good and bad
experimental design.

Experimental Design

The design of a suitable experiment to test an hypothesis often requires some ingenuity and a
suspicious nature. In modern biology, the experiment may involve very sophisticated
equipment. But there are a number of features common to all good experiments (and often
absent from bad ones) which exist whatever the technical details. In summary these are:

Discrimination

Experiments should be capable of discriminating clearly between different hypotheses. It often


turns out that two or more hypotheses give indistinguishable results when tested by poorly-
designed experiments.

Replication and generality

Living material is notoriously variable. Usually experiments must be repeated enough times for
the results to be analysed statistically. Similarly, because of biological variability, we must be
cautious of generalising our results either from individual creatures to others of the same
species, or to other species. For instance, if our hypothesis is about mammals, it is inadequate
simply to carry out our experiments on laboratory rats. Similarly, it is dangerous to extrapolate
from healthy students to elite athletes.

Controls

The experiment must be well controlled. We must eliminate by proper checks the possibility
that other factors in the overall test situation produce the effect we are observing, rather than the
factor we are interested in.

An example: Growth hormone is secreted in response to a number of agents, including the


amino acid arginine. This was shown by injecting volunteers with arginine. As a control, the
investigators injected the volunteers with a saline solution. To their surprise, growth hormone
was again secreted. The investigators then waved a syringe and needle in front of their
volunteers, and found that that provoked growth hormone secretion too. Growth hormone is
now known to be secreted in response to stress (as well as arginine).

At a more technical level, we must be sure that our method of measurement is reproducible
from day to day, between operators in the same laboratory, or between laboratories. Whilst we
might be confident about a balance or a ruler, can we be as sure about, say, a method for
measuring haemoglobin? Do two groups of students measuring the same samples by the same
methods produce the same results? Quality control helps here.

'Blind' Designs

Investigators can subconsciously 'fudge' their data if they know what result they want to find.
The answer is to do the experiment 'blind', so the investigators (and the subjects, if humans are
being studied) do not know which treatment's effect they are observing. This can make the
logistics of doing the experiment more complex: for example, when determining the
haemoglobin concentration of male and female class members.

There is a story about a professor who devised a maze for measuring the intelligence of rats.
One day he gave his technicians, who actually made the measurements, three groups of rats. He

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress2.html#SOME%20COMMON%20STATISTICAL%20TERMS Page 4 of 13
EXPERIMENTAL DESIGN 17/06/13 4:23 PM

told them one group had been specially bred for intelligence, one for stupidity and the third was
average. The technicians assessed the rats' intelligence and confirmed that the 'bright' group
performed the best and the 'stupid' group the worst. The point is, of course, that the professor
had put animals into the three groups at random. They did not differ in intelligence.

Measurement

Good experiments often, though not always, involve measuring something: a weight, say. When
you make measurements, it is important you know both the accuracy and the precision of your
measuring system. These two terms are not synonymous: 'accuracy' means the ability of the
method to give an unbiassed answer on average, whereas 'precision' is an index of the method's
reproducibility. Ideally your method should be both accurate (i.e., give the true mean) and
precise (i.e., have a low standard deviation). Sometimes one is more important than the other.
For example, if you were looking for small changes with time in a quantity (such as an athlete's
haemoglobin concentration), you would need a precise measure of it rather more than an
accurate one.

Accuracy and precision together help you to judge the reliability of your data. They also help
you to judge to how many significant figures you should quote your results. For example, if you
use a balance reading to the nearest gram, you should give the results to the nearest gram and
not, say, to the nearest tenth of a gram.

Some experiments are very difficult to do because it is not obvious what can be measured. This
is a real problem in animal behaviour: for example, there is no obvious unit or measure for
'emotional state'. It is usually necessary to isolate measurable components of behaviour. Thus
the speed at which a tiger paces up and down a cage can give some indication of the internal
state of the animal but can never give a full picture of it.

Many of these points are rather abstract, but they should become clearer when you think about
the following examples.

Example 1: Do plants give off water vapour?

Experiment:

Forty bean plants, growing in pots, were covered one afternoon by individual glass containers
and left in the laboratory overnight. Next morning, the inside of the lid of each container was
found to be covered in droplets of a fluid which proved to be water.

Conclusion:

Plants generally give off water vapour.

Critique

1. Lack of controls.

The water could have come from the plants, the soil, the pots, or the air in the jar.
Control experiments should have been set up to test for these possibilities.

2. The conclusion contains some points that are not valid.

(a) The experiment was done overnight and so can tell us nothing about the
behaviour of the plants at other times of day; the word 'generally' is not justified.

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress2.html#SOME%20COMMON%20STATISTICAL%20TERMS Page 5 of 13
EXPERIMENTAL DESIGN 17/06/13 4:23 PM

(b) It was carried out with an adequate number of bean plants but can tell us
nothing about other kinds of plants; the word 'plants' should be qualified.

(c) There is no evidence in the experiment that water is given off as a vapour.

Example 2: Is your supermarket's 'own brand' of washing powder as good as a nationally-


advertised one?

Eric Triton bemoaned the fact that his wife Ariel insisted on washing his clothes with their local
supermarket's own brand of powder. He was sure the well-known brand he saw performing
miracles on television most evenings would do better. He therefore set out to prove as much.

Mr Triton decided to compare the effectiveness of the two products on what his wife called
'difficult' dirt: grass stains on white linen handkerchiefs. He bought 4kg of the well-known
brand for £5.17 in their supermarket and noted that the same weight of the own-brand powder
would have cost £4.47. He followed the instructions on the packets exactly, weighing out the
same amount of powder and using their washing machine's programme for white linens. Mr
Triton was aware of the need for an index of 'cleanliness' and therefore devised a subjective
scale, ranging from 10 ('whiter than white') to 0 (the starting level of dirtiness).

Mr Triton's belief was substantially confirmed. He scored the handkerchief cleaned by the
national brand an impressive 8, whereas the own-brand powder only managed 7. Triumphantly,
he reported the outcome to his wife. Mrs Triton, however, was unimpressed. She pointed out to
her husband that there were several flaws in his experiment and convinced him that the outcome
was 'not proven'.

! What do you think were the flaws that Mrs Triton could have detected?

! How should Mr Triton have done the experiment?

Further reading: Barnard C, Gilbert F and McGregor P (1993) Asking Questions in Biology,
Longmans.

DESIGNING EXPERIMENTS TO USE STATISTICS


There is a story about an eminent Professor at Cambridge who gave a paper at a scientific
meeting and was asked by a questioner "what statistical test did you use to verify your results?"
The Professor explained that he used his own statistical test:

"In our Department we have a long corridor with a notice board at one end. I draw a
histogram of my results, pin it to the notice board, then walk to the other end of the
corridor. If I can still see a difference between the treatments then it's significant"

The relevance of this story lies in what it does not say! If an experiment is designed and
executed properly - as we would expect of an eminent scientist - then the results often speak for
themselves. For example, this might be true of experiments in which mutants are generated (or
genes inserted) in an organism, giving a clear change of behaviour such as resistance to an
antibiotic or expression of a new trait. Such "all or nothing" effects seldom need to be backed
by statistical tests, but they still need good experimental design.

However, in many areas of biology we work with variable effects - differences in the growth
rates of organisms, quantitative differences in antibiotic resistance or in size or in rates of
biochemical reactions, etc. Then we not only need statistical tests to analyse those differences
but we also need good experimental design to ensure that we haven't biased our results in some
http://archive.bio.ed.ac.uk/jdeacon/statistics/tress2.html#SOME%20COMMON%20STATISTICAL%20TERMS Page 6 of 13
EXPERIMENTAL DESIGN 17/06/13 4:23 PM

way, without realising it.

Good experimental design is the key to good science. But it's not as easy as it
might seem.
In many cases good experimental design involves having a clear idea about how we will
analyse the results when we get them. That's why statisiticians often tell us to think about the
statistical tests we will use before we start an experiment.

Three important steps in good experimental design


1. Define the objectives. Record (i.e. write down) precisely what you want to test in an
experiment.

2. Devise a strategy. Record precisely how you can achieve the objective. This includes
thinking about the size and structure of the experiment - how many treatments? how many
replicates? how will the results be analysed?

3. Set down all the operational details. How will the experiment be performed in practice? In
what order will things be done? Should the treatments be randomised or follow a set structure?
Can the experiment be done in a day? Will there be time for lunch? etc.

If all this sounds trivial or obvious, then read on. It's not as easy as you think!

Example 1. Experiments that yield no useful results because we did not collect enough
data

Suppose that we want to test the results of a Mendelian genetic cross. We start with 2 parents of
genotype AABB and aabb (where A and a represent the dominant and recessive alleles of one
gene, and B and b represent the dominant and recessive alleles of another gene).

We know that all the F1 generation (first generation progeny of these parents) will have
genotype AaBb and that their phenotype will display both dominant alleles (e.g. in fruit flies all
the F1 generation will have red eyes rather than white eyes, and normal wings rather than
stubby wings).

This F1 generation will produce 4 types of gamete (AB, Ab, aB and ab), and when we self-cross
the F1 generation we will end up with a variety of F2 genotypes (see the table below).

Gametes Gametes
AB Ab aB ab
AB AABB AABb AaBB AaBb
Ab AABb AAbb AaBb Aabb
aB AaBB AaBb aaBB aaBb
ab AaBb Aabb aaBb aabb

All these genotypes fall into 4 phenotypes, shown by colours in the table: double dominant,
single dominant A, single dominant B and double recessive. And we know that in classical
Mendelian genetics the ratio of these phenotypes is 9:3:3:1

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress2.html#SOME%20COMMON%20STATISTICAL%20TERMS Page 7 of 13
EXPERIMENTAL DESIGN 17/06/13 4:23 PM

Most people also know that we use a chi squared test to analyse the results of genetic crosses:
we do our experiment, count the number of F2 progeny that fall into the different categories,
and test to see if our results agree with an expectation. In this case, the expectation would be
that the results fit a 9:3:3:1 ratio.

But what you might not know is that a chi squared test would only be valid if every
expected category in this case is 5 or more (it does not matter what the actual count is in each
category, but the expected count must be 5 or more). In other words, we MUST have at least 80
F2 progeny in order to use the chi squared test for this experiment, because then the smallest
category - double recessive - would have an expected 5 individuals in it (one-sixteenth of 80
being 5). [A fuller explanation of this is given in Chi squared test.]

Similarly, for comparing two counts (e.g. counts from dilution plating of bacteria) by means of
a Poisson distribution, you will need to count about 30 colonies at the chosen dilution level.
[The principles underlying the Poisson distribution do not hold for counts lower than this]

As a different example along the same lines, we might want to compare the biomass produced
by plant callus culture in flasks containing different nutrient solutions. We know that we need
more than one flask of each nutrient solution (i.e. we need replicates), and we will use
Student's t-test to compare the mean growth in each solution. [Basically, a t-test compares the
difference between the two means in relation to the amount of variation within the treatments.
In other words, we get a significant result if the difference between the means is large and/or the
variation between replicates is small].

So, how many replicates should we use? This is a matter of judgement (and the available
resources) but if we look at a t-table we can make some rational decisions. If we use 2 flasks for
each treatment (4 flasks in total), we would have 2 degrees of freedom. This term is explained
elsewhere, but for now we can note that the number of degrees of freedom for each treatment is
one less than the number of replicates. In other words, with 2 treatments of 2 flasks each we
have 2 degrees of freedom. With 2 treatments of 10 flasks each we have 18 degrees of freedom.

When we analyse our results by Student's t-test, we calculate a t value and compare it with the t
value for probability of 0.05 in the t-table. Our treatments differ significantly if the calculated t
value is greater than the tabulated value.

Look at the tabulated t value (4.30) for 2 degrees of freedom. It is quite high, and we would
only find a significant difference between our treatments if we have quite a large difference
between the means and also little variation in our replicates. But if we used 4 replicates of each
treatment (6 degrees of freedom) we would have a much better chance of finding a significant
difference (t value of 2.45) between the same means. But look even further down the t-table -
e.g. downwards from 10 degrees of freedom (t-value 2.23) - and we see that we would gain very
little by using any more replicates. We would be in the realm of diminishing returns, gaining
very little for all the extra time and resources.

The message from these examples is that knowledge of the statistical test that we will use
helps us to design our experiment properly.

Example 2. Experiments that seem to give useful results but our procedures let us down!
Under this heading we deal with the actual process of doing an experiment - a task with many
hidden pitfalls.

Suppose we decide to compare 4 treatments, with 4 replicates each - a total of 16 flasks of

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress2.html#SOME%20COMMON%20STATISTICAL%20TERMS Page 8 of 13
EXPERIMENTAL DESIGN 17/06/13 4:23 PM

bacteria, 16 potted plants, 16 biochemical reactions to measure with a spectrophotometer, etc.


We do the experiment, get nice results, analyse them (the appropriate test would be Analysis of
Variance) and find significant differences between the treatments. We write up the results, get a
Nobel Prize, or a good mark, or whatever. End of story.

Or is it? The answer to that question depends on how we did the experiment. For example,
there might have been a good "practical" (i.e. convenient) reason for setting up all replicates of
treatment 1, then (for example) changing the pipette and setting up all replicates of treatment 2,
and so on. The problem is: how can we be sure that the difference we found between treatments
was due to the treatments themsleves, and was not influenced by the order in which we set them
up? Even if we DO feel sure, our experiment is inherently biased and nobody would trust the
results if we said how we did it! [There is an almost infinite number of reasons why the
conditions might change during the time taken to set up an experiment. For example, we might
get progressively more efficient, or more tired. The temperature of the water bath (or whatever)
might change slightly during this time. Each pipette will be slightly different from the next, etc.
etc.]

So, what about doing one replicate of treatment 1, then one of treatment 2, then 3, then 4, and
then doing a second replicate of treatment 1, a second of treatment 2, and so on? In truth, this
would remove only some of the inherent bias - on average, treatment 1 is still being set up
before treatment 2, etc.

The only way to overcome this is to plan in advance. We have basically two options.

We could randomise the order in which each test is set up, so that it is entirely a matter of
chance whether "treatment 1, replicate 1" is done first or last or somewhere in the middle.
We could structure the setting up (or the sampling of the experiment) so that one replicate
of each treatment is done first (in random order), then the next replicate of each treatment
is done (again in random order), and so on. This would be sensible if we knew (or
suspected) that the conditions are likely to change during the setting up of an experiment,
because Analysis of Variance (the statistical test we would use for our results) could
separate out the effect of "setting up time" from the effect of "treatments". This sort of
experimental design is termed a randomised block design. It is illustrated in the grid
below, where the rows represent blocks and the order of treatments (A, B, C and D)
within the blocks is determined randomly. [But note that the grid below is perfect - you
would seldom get that by randomly ordering the treatments within blocks. A grid of this
type is called a Latin square, and for some purposes it is probably better than a
randomised block]

Row 1 A B C D
Row 2 D C A B
Row 3 B A D C
Row 4 C D B A

And one last point - the same principles should be applied to other practical aspects of an
experiment. For example, have you ever tested the temperature gradient in a laboratory
incubator or (worse still) in a walk-in growth room? The temperature gauge might record
"30oC" but there is likely to be a vertical (or lateral) temperature gradient of 2-3o or more. So
never put all the replicates of one treatment together. Randomise them.

To block or not to block?

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress2.html#SOME%20COMMON%20STATISTICAL%20TERMS Page 9 of 13
EXPERIMENTAL DESIGN 17/06/13 4:23 PM

For most experiments we would simply randomise the treatments and replicates. There is an
important statistical reason for this, because all the statistical procedures are based on the
fundamental assumption that variation is random: in other words, that it is determined by
chance alone. But "blocking" becomes useful, or even essential, if we know (or strongly
suspect) that "extraneous" factors will introduce variation that is irrelevant to the effects we
wish to test and that might mask the effects of our treatments. Here are two examples, and you
could think of many more along the same lines.

1. Suppose that we want to test the effect of fertiliser treatment (or pesticide treatment, etc.)
on a crop. We set up a field trial, with plots that will receive different treatments.
Everyone knows that conditions can vary across a field - the topsoil may be thinner or
drier in some places than in others, leading to differences in crop growth. We cannot
control this, so it would be sensible to mark out 3 (or more) blocks, and to have plots
(randomised treatments) within these blocks. Then, by Analysis of Variance, we will be
able to separate the variation due to our treatments from the variation caused by soil or
site factors (which are not relevant to the question we are trying to ask).
2. Suppose that an experiment will take a long time to set up or to sample, and we cannot be
sure that the conditions will remain constant during this time - the lab might warm up, the
humidity rise, and so on. Then it is sensible to block the experiment into time periods -
before coffee, after coffee, after lunch. Of course this is not ideal. But experiments are
done in the real world.

SOME COMMON STATISTICAL TERMS


Suppose that we are measuring the size of cells, the height of trees, the biomass of microbial
cultures, the number of eggs in nests, or anything else. The thing that we are measuring or
recording (e.g. cell size, plant height, etc.) is called a variable.

Each measurement that we record (e.g. the size of each cell) is a value or observation.

We obtain a number of values (e.g. 100 for cells), and this is our sample.

The sample (e.g. 100 cells) is part of a population. In this case the population (in biological
terms) is all the cells in the culture (or all the trees in a forest, etc.). Theoretically, we could
measure every cell or tree to get a precise measure of that population. But often we want to be
able to say more than this - something of general significance, based on our sample. For
example, that if anyone were to measure the cells of that organism, then they would find a
certain average value and a certain range of variation. Here are 3 sorts of thing that you might
want to say.

The optimum temperature for growth of the bacterium Escherichia coli is 37oC, whereas
the optimum temperature for Bacillus cereus is 30oC.
The average height of adult men in Britain is 175 cm, whereas the average height of
women is162 cm.
In rats the LD50 of the drug "Zappit" is 3 mg kg-1 body weight. [LD50 is the calculated
(lethal) dose that kills 50% of the treated individuals]

General statements such as these will always be based on a sample, because we could never test
every possible strain of E. coli, nor measure every possible adult, nor test every possible rat that
could ever live. So, in these and in many other cases the population can be considered to be
infinite. That's the sense in which statisticians use the term "population" - for all the
possible measurements or events (i.e. all the possible values of a variable) of a particular
type that there could ever be.

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress2.html#SOME%20COMMON%20STATISTICAL%20TERMS Page 10 of 13
EXPERIMENTAL DESIGN 17/06/13 4:23 PM

In statistics, we use SAMPLES to ESTIMATE the PARAMETERS of a POPULATION.

The statistical procedures are based on quite complex mathematics. But that need not concern
us at all, because the procedures are actually very simple to apply. Basically, from our sample
we calculate:

The average, or mean. The notation for the sample mean is (pronounced "X bar").
Some measure of the dispersion (range of variation) of data around the sample mean. For
this we use the variance (the notation is S2) and thence the standard deviation (S).

Having obtained those values, we use them to estimate the population mean and the
population variance. In order to distringuish between what we measure (samples) and what we
wish to estimate (populations) from the samples, many statisticians use greek letters for the
population mean (which is denoted m (mu) instead of for the sample mean) and the
population variance (which is denoted s2 (sigma squared) instead of S2 for the sample
variance).

We cover these points under Descriptive statistics. But before we move on to that, we must
consider briefly how variation is distributed in statistical terms.

If we measured cells or people or plants or biochemical reactions (as absorbance values in a


spectrophotometer) etc. we would find a range of variation. As we made more and more
measurements of this type they would display a continuous range of variation. If we were to put
these measurements into appropriate categories or class intervals (for example, all the
measurements that fall between 1.0 and less than 2.0, all between 2.0 and less than 3.0, and so
on) and then plot the numbers in each category as a histogram it would look like this:

Given enough measurements (and small enough class intervals), this would be a completely
symmetrical, bell-shaped curve. Data of this sort are said to be normally distributed. Most of
our measurements (data points) would be close to the mean, and progressively fewer would
depart widely from the mean.

Most of the statistical tests that we consider on this site are for normally distributed data.

But there are other types of distribution. For example, if we measured the heights of men and
women as a single population they might form a bimodal distribution - two humps with a dip
between them, because women, on average, are shorter than men. Then we should treat them as
two populations, not as a single one. Other types of data, such as counts, fall naturally into
"either/or" categories. These are treated in different ways, some of which are explained later.

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress2.html#SOME%20COMMON%20STATISTICAL%20TERMS Page 11 of 13
DESCRIPTIVE STATISTICS 17/06/13 4:23 PM

This site is no longer maintained and has been left for


archival purposes
Text and links may be out of date

DESCRIPTIVE STATISTICS
Sounds boring, but it's about giving the most - and most useful - information from a set of
data.

IMPORTANT: AN OVERVIEW OF THIS SECTION.

When we take measurements or record data - for example, the height of people - we cannot
possibly measure every person in the world (or, as another example, every cell of a particular
type of bacterium). Instead, we have to take a representative sample, and from that sample we
might wish to say something of wider significance - something about the population (e.g. all
the people in the world, or all the bacteria of that type). So, we use samples as estimates of
populations. But in many cases they can only be estimates, because if our sample size had been
greater (or if we had measured a different sample) then our estimate would have been slightly
different. Statistical techniques are based on probability, and enable us to make the jump from
samples to populations. But we should never lose sight of the fact that our initial sample can
only be an estimate of a population.

In the following sections we will start from a small sample, describe it in statistical terms, and
then use it to derive estimates of a population.

______________________________________
A sample
Here are some values of a variable: 120, 135, 160, 150.

We will assume that they are measurements of the diameter of 4 cells, but they could be the
mass of 4 cultures, the lethal dose of a drug in 4 experiments with different batches of
experimental animals, the heights of 4 plants, or anything else. Each value is a replicate - a
repeat of a measurement of the variable.

In statistical terms, these data represent our sample. We want to summarize these data in the
most meaningful way. So, we need to state:

the mean, and the number of measurements (n) that it was based on

a measure of the variability of the data about the mean (which we express as the
standard deviation)

other useful information derived from the mean and standard deviation, such as (1) the
range within which 95% or 99% or 99.9% of measurements of this sort would be
expected to fall - the prediction intervals, and (2) the range of means that we could
expect 95% or 99% or 99.9% of the time if we were to repeat the same type of

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress3.html#DESCRIPTIVE%20STATISTICS Page 1 of 7
DESCRIPTIVE STATISTICS 17/06/13 4:23 PM

measurement again and again on different samples - this is often called the confidence
interval.

Now we will go through these points, explaining the meaning of the procedures. If you are
familiar with all this, you can go straight to Describing a population: practical steps.

How to express the variability of data as variance or standard deviation

The sample variance (S2)

There is a simple formula for calculating the variance of the sample (S2). It is given below, but
first we should see how it is derived.

Each data point (measurement) in our sample differs from the mean by an amount called the
deviation (d). We could, in theory, find each d value by subtraction (keeping the sign as + or -),
then square each deviation, add all the d 2 values (to get the sum of squares of the deviations,
shortened to the sum of squares) and divide this by n-1 to give the variance, where n is the
number of observations in our sample. We can then obtain the standard deviation (notation,
S), which is the square root of the variance.

Why do we use n-1 and not n?

You should just accept this as standard and necessary practice! However, it has a reason, based
on the fact that we almost always use the sample variance to obtain an estimate of the
population variance (a population being all the measurements or events of the same type that
could ever be found). Put in simple terms, the population variance is derived from the sample
mean and from the deviation (d) of each measurement from the sample mean. But if we lacked
any one of these measurements (the mean or a single d value) we could calculate it from the
other information. So, with n measurements (data points) only n-1 of them are free to vary when
we know the mean - we could calculate the missing one. "n-1" is therefore the number of
degrees of freedom of our data.

The formula for calculating Sample variance

For each observation (x) the deviation (d) from the mean ( ) is x - .

Therefore d2 = (x - )2
2 2
Expanding this equation, we get: d2 = x - 2x. +( )

To obtain the sum of squares of the deviations, we sum both sides of this equation (the capital
letter sigma, S = sum of):

Sd2 = Sx2 - 2Sx +S 2

From this equation we can derive the following important equation for the sum of squares,
Sd2.

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress3.html#DESCRIPTIVE%20STATISTICS Page 2 of 7
DESCRIPTIVE STATISTICS 17/06/13 4:23 PM

Then we find the sample variance and sample standard deviation:

sample variance (S2) =

standard deviation (S) =

If we present the mean standard deviation, we will have summarised in just two numbers
the most important properties of the sample that we used. This also is our estimate of the mean
(m) and standard deviation (sigma, s ) of the population.

Now we can express our data as S.

This is the conventional way in which you see data published. For example, if the four values
(120, 135, 160, 150) given earlier were the diameters of four cells, measured in micrometres,
then we would say that the mean cell diameter was 138.8 19.31 mm (see the worked example
later).

Further steps: the standard error of a mean


What we have done so far is useful, but not useful enough! If we think about it, we will see that
the estimate of the population mean (i.e. all the measurements that we ever could make of the
same type - the diameter of this type of cell, etc.) has been fixed by the sample of four
individuals. If we did the experiment again by measuring another 4 cells then we almost
certainly would get a different mean.

What we really want to know is "how good is our estimate of the mean?", not "how much
variation was found around this particular estimate of the mean?" We do this by calculating the
standard error (standard deviation of the mean).

Imagine that we repeat the experiment again and again, with a different set of cells (or other
types of variable). Each time, we would get a slightly different mean, but if we were to plot a
frequency distribution (histogram) of the means then it would show a normal distribution:

We could find the mean of the means and then calculate a standard deviation of it (not the
standard deviation around a single mean). By convention, this standard deviation of the mean is
called the standard error (SE) or standard error of the mean (SEM).

The notation for the standard error of the mean is sn

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress3.html#DESCRIPTIVE%20STATISTICS Page 3 of 7
DESCRIPTIVE STATISTICS 17/06/13 4:23 PM

We do not need to repeat our experiment many times for this, because there is a simple
statistical way of estimating sn which is based on: sn = s / Ö n. (For this we are using S as an
estimate of s).

So, if we had a sample of 4 values (120, 135, 160, 150) and the mean with standard deviation (
s) was 138.8 19.31 mm, then the mean with standard error ( sn) would be 138.8
9.65 mm, because we divide s by Ö n (which is 2) to obtain the standard error.

Note 1. You don't need to understand the mathematics behind the use of Ö n. Look on it as a
"sliding scale" that compensates for the number of values (data points) in the original sample.
Note 2. If you go to Deciphering the data in publications, you will see the value of expressing
results as mean standard error.

Confidence intervals of a mean


A calculated value for a standard deviation or a standard error has little practical use in itself.
But it becomes meaningful when we use it to calculate confidence intervals.We can do this
easily by multiplying a standard deviation or a standard error by a t value obtained from a table
of t. The confidence intervals show us the range within which 95% or 99% or 99.9% of
observations could be expected to lie.

We will illustrate this with the 4 values that we mentioned above (120, 135, 160, 150).

We found a mean with standard error of the mean (138.8 9.65 mm). To put confidence
intervals on this, we multiply 9.65 by a t value.

If we had measured an infinite number of cells we would consult the bottom line of a table of t
values (Student's t-test), which appears as follows.

Degrees of freedom Probability


0.05 0.01 0.001
(95%) (99%) (99.9%)
¥ 1.96 2.58 3.39

We select the level of confidence we want (usually 95% in biological work - see the notes
below) and multiply s by the tabulated value. If s was 138.8 9.65 mm, then the 95%
confidence interval would be 138.8 9.65x1.96 mm, or 138.8 18.91 mm. In other words, if
we were to repeat this experiment over and over again then in 95% of cases the mean could be
expected to fall within the range of values 119.89 to 157.71. These limiting values are the
confidence limits.

But our sample was not infinite - we had 4 measurements - so we use the t value corresponding
to 4 measurements, not to ¥ . The t table shows degrees of freedom (df), which are always one
less than the number of observations. For 4 observations there are 3 df, because if we knew any
3 values and we also knew the mean, then the fourth value would not be free to vary.

To obtain a confidence interval, we multiply sn by a t value as before, using the df in our


original data. In our example, the 95% confidence interval would be 138.8 9.65 x 3.18 mm,

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress3.html#DESCRIPTIVE%20STATISTICS Page 4 of 7
DESCRIPTIVE STATISTICS 17/06/13 4:23 PM

or 138.8 30.69 mm.

When do we use the standard error?

Many people prefer to cite the standard error rather than standard deviation, because it makes
the range of variation appear to be smaller! But it has real value in telling us something - for
example, that if anyone were to repeat our experiment, then the mean would be likely to fall
within the limits of t x sn.

Note the following points

1. The standard deviation is a measure of variability in our data and is an estimate of s, an


important property of the population.

2. We choose the level of confidence we wish to place on our data. The confidence limits will
be larger if we choose a higher level of confidence (e.g. 99% compared with 95%). However,
for most biological work we use the 95% level.

3. The number of degrees of freedom determines the t value. So, when designing experiments
we have to find a compromise between the level of confidence we want and the amount of work
involved. Inspection of a t table shows that the values fall off rapidly from 2 to 5 df (3 to 6
replicate observations) but more slowly thereafter.

[Note that a t-table does not have space to give us every degree of freedom, so if a sample has a
degree of freedom that falls between two entries in a t-table (say between 60 and 120) then you
should use the t value for 60 df in the table.]

Describing a population: practical steps (see the worked example below)


1. Tabulate the data.

2. Sum the data to obtain S x, then square this to obtain (S x)2

3. Calculate the mean,

4. Square each data value and sum the squares to obtain S x2

5. Calculate

6. Estimate the variance of the population (s2) as:

7. Find the estimated standard deviation of the population (s ) = square root of the variance.

8. Calculate the estimated standard error (SE) of the mean (s n) = s / Ö n

Worked example of the data given at the top of this page: 120, 135, 160, 150.
Item Value Notes/ explanation

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress3.html#DESCRIPTIVE%20STATISTICS Page 5 of 7
DESCRIPTIVE STATISTICS 17/06/13 4:23 PM

Replicate 1 120
Replicate 2 125
Replicate 3 160
Replicate 4 150
Sx 555 Total (= sum of the replicates)
n 4 Number of replicates
138.75 Mean (= total / n)
Sum of the squares of each
S x2 78125
replicate value

(S x)2 308025 Total squared

S d2 1118.75 [78125 - 77006.25]

s2 372.9167

s 19.311 = Ö s2
sn 9.6555 =s/Ön
[In practice, we would record
this as 138.8 9.66, with one
mean standard error more decimal place than we
138.75 9.655
( sn) measured, and an extra
decimal place for the standard
error]
95% confidence limits of the
[The value for t with 3 df is
mean ( tsn where t has 3 138.75 30.705
3.18]
df)

How would we present our results in a publication or a practical report?

The cell diameter was 138.75 9.66 mm (mean s.e.; n = 4)


OR The lethal dose was 138.75 9.66 mg (kg body weight)-1 (mean s.e.; n = 4)
OR The plant height was 138.75 9.66 cm (mean s.e.; n = 4)

Note that these statements contain everything that anyone would need to know about the mean!
For example, if somebody wanted to calculate a confidence interval they could multiply the
standard error by the t value (we gave them the number of replicates so they can look up the t
value). They also can decide if they want to have 95%, 99% or 99.9% confidence intervals.

In other sections of this site we shall see that the statements above give all the information we
need to test for significant differences between treatments.As one example, go to Student's t-
test.

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress3.html#DESCRIPTIVE%20STATISTICS Page 6 of 7
WHAT TEST? 17/06/13 4:23 PM

This site is no longer maintained and has been left for


archival purposes
Text and links may be out of date

STATISTICAL TESTS FOR SIGNIFICANCE


What test do I need?
Other parts of this site explain how to do the common statistical tests. Here is a guide to
choosing the right test for your purposes. When you have found it, click on "more
information?" to confirm that the test is suitable. If you know it is suitable, click on "go for it!"

Important: Your data might not be in a suitable form (e.g. percentages, proportions) for the
test you need. You can overcome this by using a simple transformation. Always check this -
click HERE.

1. Student's t-test

Use this test for comparing the means of two samples (but see test 2 below),
even if they have different numbers of replicates. For example, you might want
to compare the growth (biomass, etc.) of two populations of bacteria or plants, the
yield of a crop with or without fertiliser treatment, the optical density of samples
taken from each of two types of solution, etc. This test is used for "measurement
data" that are continuously variable (with no fixed limits), not for counts of 1, 2, 3
etc. You would need to transform percentages and proportions because these have
fixed limits (0-100, or 0-1).

More information?
Go for it!

2. Paired-samples test

Use this test like the t-test but in special circumstances - when you can arrange
the two sets of replicate data in pairs. For example: (1) in a crop trial, use the
"plus" and "minus" nitrogen crops on one farm as a pair, the "plus" and "minus"
nitrogen crops on a second farm as a pair, and so on; (2) in a drug trial where a
drug treatment is compared with a placebo (no treatment), one pair might be 20-
year-old Caucasian males, another pair might be 30-year old Asian females, and so
on.

More information?
Go for it!

3. Analysis of variance for comparing the means of three or more samples

Use this test if you want to compare several treatments. For example, the
growth of one bacterium at different temperatures, the effects of several drugs or
antibiotics, the sizes of several types of plant (or animals' teeth, etc.). You can also

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress4.html#What%20test%20do%20I%20need? Page 1 of 8
WHAT TEST? 17/06/13 4:23 PM

compare two things simultaneously - for example, the growth of 3 bacteria at


different temperatures, and so on. Like the t-test, this test is used for "measurement
data" that are continuously variable (with no fixed limits), not for counts of 1, 2, 3
etc. You would need to transform percentages and proportions because these have
fixed limits (0-100, or 0-1).

More information? You need this, because there are different forms of this test.

4. Chi-squared test for categories of data

Use this test to compare counts (numbers) of things that fall into different
categories. For example, the numbers of blue-eyed and brown-eyed people in a
class, or the numbers of progeny (AA, Aa, aa) from a genetic crossing experiment.
You can also use the test for combinations of factors (e.g. the incidence of
blue/brown eyes in people with light/dark hair, or the numbers of oak and birch
trees with or without a particular type of toadstool beneath them on different soil
types, etc.).

More information?
Go for it!

5. Poisson distribution for count data

Use this test for putting confidence limits on the mean of counts of random
events, so that different count means can be compared for statistical difference.
For example, numbers of bacteria counted in the different squares of a counting
chamber (haemocytometer) should follow a random distribution, unless the bacteria
attract one another (in which case the numbers in some squares should be
abnormally high, and abnormally low in other squares) or repel one another (in
which case the counts should be abnormally similar in all squares). Very few things
in nature are randomly distributed, but testing the recorded data against the
expectation of the Poisson distribution would show this. By using the Poisson
distribution you have a powerful test for analysing whether objects/ events are
randomly distributed in space and time (or, conversely, whether the objects/ events
are clustered).

More information?
Go for it!

6. Correlation coefficient and regression analysis for curve fitting

These procedures are used for looking at the relationship between different
factors, and (if appropriate) for graphing the results in statistically meaningful
ways. For example, as the temperature (or pH, etc.) increases, does growth rate
increase or decrease? As the dose rate of a drug is increased does the response rate
of patients rise? As altitude is increased does the number of butterflies (or oak
trees) increase or decrease? Sometimes the relationship is linear, sometimes
logarithmic, sometimes sigmoidal, etc. You can test all these possibilities and, in
drug or toxicity trials (for example) calculate the LD50 or ED50 (lethal dose, or
estimated dose, for a 50% response rate).

More information?
Go for it!

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress4.html#What%20test%20do%20I%20need? Page 2 of 8
WHAT TEST? 17/06/13 4:23 PM

==========================================

More information
Student's t-test

Use this test for comparing the means of two populations that you have
sampled (but see test 2 below). For example, you might want to compare the
growth (biomass, etc.) of two bacteria or plants, the yield of a crop with or without
added nitrogen, the optical density of samples taken from each of two types of
solution, etc.

What you will need for this test: a minimum of 2 or 3 replicates of each sample
or treatment, but ideally at least 5 replicates. For example, the yield measured for 5
fields of a crop fertilised with nitrogen and for 5 unfertilised fields, the optical
density of 5 tubes of each solution, the measurement of 5 plants of each type, etc.
Large sample sizes (10 or more) are always better than small sample sizes, but it is
easier to measure the height of 10 or 20 (or 50) plants than it is to set up10 or 20
large-scale fermenters!

You don't need the same number of replicates of each treatment - for example,
you can compare 3 tubes of one solution with 4 tubes of another. You could also
use this test to compare several replicates of one treatment with a single value for
another treatment, but it would not be very sensitive.

Go for it!
Back to "What test do I need?"

Paired-samples test

Use this test like the t-test but in special circumstances - when you can arrange
the two sets of replicate data in pairs. For example: (1) in a crop trial, use the
"plus" and "minus" nitrogen crops on one farm as a pair, the "plus" and "minus"
nitrogen crops on a second farm as a pair, and so on; (2) in a drug trial where a
drug treatment is compared with a placebo (no treatment), one pair might be 20-
year-old males, another pair might be 30-year old females, and so on.

Why do we use the paired samples test? Because farms or people or many other
things are inherently variable, but by pairing the treatments we can remove much of
this random variability from the test of "nitrogen versus no nitrogen" or "drug
treatment versus no treatment", etc.

What are the requirements for this test? The main requirement is that the
experiment is PLANNED ahead of time. Then you can use the paired samples test
for many purposes - for example, two treatments compared on one day, then the
same two treatments compared on the next day, and so on.

In general, you will need more replicates than for a t-test (say, a minimum of 5
for each treatment), and you will need the same number of replicates for each
treatment.

But you must have a good reason to pair treatments - you should not do it
arbitrarily.

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress4.html#What%20test%20do%20I%20need? Page 3 of 8
WHAT TEST? 17/06/13 4:23 PM

Go for it!
Back to "What test do I need?"

Analysis of variance for comparing the means of three or more samples.

Use this test if you want to compare several treatments. For example, the
growth of one bacterium at different temperatures, the effects of several drugs or
antibiotics, the sizes of several plants (or animals' teeth, etc.). You can also
compare two things simultaneously - for example, the growth of 3 or 4 strains of
bacteria at different temperatures, and so on.

The simplest form of this test is one-way ANOVA (ANalysis Of VAriance). Use
this to compare several separate treatments (e.g. effects of 3 or more
temperatures, antibiotic levels, crop treatments, etc.). You will need at least 2
replicates of each treatment.

One-way ANOVA tells you if there are differences between the treatments as a
whole. But it can also be used, with caution, like a multiple t-test, to tell you
which of the treatments differ from each other.
Go for one-way ANOVA?
Back to "What test do I need?"

Another form of this test is two-way ANOVA. Use this if you want to compare
combinations of treatments. For example, to compare the growth of an organism
on several different substrates at several different temperatures. Or the effects of
two (or more) drugs singly and in combination. Or responses of crops to fertiliser
treatment on different farms or soil types. You can get useful information even if
you have one of each combination of treatments, but you get much more
information if you have 2 (or more) replicates of each combination of
treatments. Then the test can tell you if you have significant interaction - for
example, if changing the temperature changes the way that an organism responds to
a change of pH, etc.
Go for two-way ANOVA?
Back to "What test do I need?"

Chi-squared test for categories of data

Use this test to compare counts (numbers) of things that fall into different
categories. For example, to compare the numbers of blue-eyed and brown-eyed
people in a class, or the numbers of progeny (AA, Aa, aa) from a genetic crossing
experiment. You can also use the test for looking at combinations of factors (e.g.
the incidence of blue/brown eyes in people with light/dark hair, or the numbers of
toadstools beneath oak and birch trees on different soil types, etc.).

For this test you compare the actual counts (in the different categories) with an
"expected" set of counts. Sometimes the expectation is obvious - for example,
that half of the progeny from a cross between parents Aa and aa will have the Aa
genotype and half will have aa. You have to construct an hypothesis (termed the
null hypothesis) by using logical arguments.

What are the requirements for this test? Almost any sort of "count" data can be
analysed by chi-squared, but you have to use "real" numbers, not proportions or
percentages.

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress4.html#What%20test%20do%20I%20need? Page 4 of 8
WHAT TEST? 17/06/13 4:23 PM

Go for it!
Back to "What test do I need?"

Poisson distribution for count data

The main requirement for this test is that the mean count (of bacterial colonies,
buttercups, etc.) need to be relatively high (say 30 or more) before they can be
expected to conform to a Poisson distribution. If you have such a high count, then
you can test whether or not your results actually do conform to the Poisson
distribution.

Go for it!
Back to "What test do I need?"

Correlation coefficient and regression analysis for curve fitting

These procedures are used for looking at the relationship between different
factors, and (if appropriate) for graphing the results in statistically meaningful
ways. For example, as the temperature (or pH, etc.) increases, does growth rate
increase or decrease? As the dose rate of a drug is increased does the response rate
of patients rise? As altitude is increased does the number of butterflies (or oak
trees) increase or decrease? Sometimes the relationship is linear, sometimes
logarithmic, sometimes sigmoidal, etc. You can test all these possibilities and, in
drug or toxicity trials (for example) calculate the LD50 or ED50 (lethal dose, or
estimated dose, for a 50% response rate).

There is a 3-stage procedure:

1. Plot your results on graph paper, and ask yourself: does the relationship look
(or is expected to be) linear, or is it logarithmic, or sigmoid (S-shaped)? You
might need to transform the data (see transforming data) if they are not
linear.
2. Calculate the correlation coefficient, which tells you whether the data fit a
straight line relationship (and how close the fit is, in statistical terms).
3. If the correlation coefficient is significant, and other conditions are met,
proceed to regression analysis, which gives the equation for the line of best
fit, then draw this line on your graph.

Go for it!
Back to "What test do I need?"

Transformation of data
1. Proportions and percentages: convert to arcsin values
Certain mathematical assumptions underly all the statistical tests on this site. The most
important assumption is that the data are normally distributed and are free to vary widely about
the mean - there are no imposed limits. Clearly this is not true of percentages, which cannot be
less than 0 nor more than 100. If you have data that are close to these limits, then you need to
transform the original data before you analyse them.

One simple way of doing this is to convert the percentages to arcsin values and then analyse
these arcsin values. The arcsin transformation moves very low or very high values towards the

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress4.html#What%20test%20do%20I%20need? Page 5 of 8
WHAT TEST? 17/06/13 4:23 PM

centre, giving them more theoretical freedom to vary.

[You convert percentages (x) to arcsin values ( q ), where q is an angle for which sin q is Ö
x/100 ]

On a calculator:

to get the arcsin value for a percentage (e.g. 50%), divide this by 100 ( = 0.5), take
the square root (= 0.7071), then press "sin-1" to get the arcsin value (= 45). [NB: if
your calculator gives the result as 0.785 then this is the angle in radians rather than
degrees]

to get the arcsin value for a proportion (e.g. 0.4), take the square root (= 0.6325),
then press "sin-1" to get the arcsin value (= 39.23).

On an "Excel" spreadsheet:

convert percentages to arcsin values (and back again) by entering a formula into the
spreadsheet - Go for it!

2. Logarithmic transformation
Use this for two purposes:

When fitting a curve to logarithmic data (exponential growth of cells, etc). Take the
logarithm of each "growth" value and plot this against time (real values). You can use
either natural logarithms or logs to base 10. The data should now show a straight-line
relationship and can be analysed using correlation coefficient and regression.
In Analysis of Variance, when comparing means that differ widely. The reason for this
is that an analysis of variance is based on the assumption that the variance is the same
across all the data. But usually this will not be true if some means are very small and
others are very large - the individual data points for the large mean could vary widely.
[For example, a mean of 500 could be made up from 3 values of 100, 400 and 1000,
whereas a mean of 50 could not possibly include such wide variation] This problem is
overcome by converting the original data to logarithms, squeezing all the data points
closer together. Contrary to expectations, this would show significant differences between
small and large means that would not be seen otherwise.

3. Converting Percentages to Probits


Some types of data show a sigmoid (S-shaped) relationship. A classic case is in dosage-
response curves, for testing antibiotics, pharmaceuticals, etc. To analyse these relationships the
"percentage of patients/cells responding to a treatment" can be converted to a "probit" value,
and the dosage is converted to a logarithm. This procedure converts an S-shaped curve into a
straight-line relationship, which can be analysed by correlation coefficient and regression
analysis in the normal way. From the straight-line equation, we can calculate the LD50, ED50,
and so on.

The method for doing this in "Excel" is shown below.

Converting between percentage, arcsin and probits in ‘Excel’.


The table below shows part of a page from an ‘Excel’ worksheet. Columns are headed A-F and
rows are labelled 1-21, so each cell in the table can be identified (e.g. B2 or F11).
http://archive.bio.ed.ac.uk/jdeacon/statistics/tress4.html#What%20test%20do%20I%20need? Page 6 of 8
WHAT TEST? 17/06/13 4:23 PM

Representative % values were inserted in cells B2-B21.

You will now see how to convert these % values into probits or arcsin values, and back again. If
you do the relevant conversion in your own spreadsheet, you can then use the probit or arcsin
values instead of % values for the statistical tests.

In cell C2 of the spreadsheet. a formula was entered to convert Percentage to Probit values.

The formula (without spaces) is: =NORMINV(B2/100,5,1)

This formula is not seen. As soon as we move out of cell C2 it automatically gives the probit
value (in C2) for the percentage in cell B2, seen in the "printout" below. Copying and then
pasting this formula into every other cell of column C produces a corresponding probit value
(e.g. cell C3 contains the probit of the % in cell B3).

Next, a formula was entered in cell D2 to convert Probit to Percentage, and the above
procedure was repeated for all cells in column D.

The formula is: =NORMDIST(C2,5,1,TRUE)*100

The formula entered in cell E2 converts Percentage to Arcsin

The formula is: =ASIN(SQRT(A2/100))*180/PI()

The formula in cell F2 converts Arcsin to Percentage

The formula is: =SIN(E2/180*PI())^2*100

A B C D E F
1 Percent % to Probit Probit to % % to arcsin arcsin to
%
2 0.1 1.91 0.1 1.812 0.1
3 0.5 2.424 0.5 4.055 0.5
4 1 2.674 1 5.739 1
5 2 2.946 2 8.13 2
6 3 3.119 3 9.974 3
7 4 3.249 4 11.54 4
8 5 3.355 5 12.92 5
9 6 3.445 6 14.18 6
10 7 3.524 7 15.34 7
11 8 3.595 8 16.43 8
12 9 3.659 9 17.46 9
13 10 3.718 10 18.43 10
14 50 5 50 45 50
15 96 6.751 96 78.46 96
16 97 6.881 97 80.03 97
17 98 7.054 98 81.87 98
18 99.5 7.576 99.5 85.95 99.5
19 99.99 8.719 99.99 89.43 99.99
20 99.999 9.265 99.999 89.82 99.999

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress4.html#What%20test%20do%20I%20need? Page 7 of 8
T-­TEST 17/06/13 4:23 PM

This site is no longer maintained and has been left for


archival purposes
Text and links may be out of date

Student's t-test
We use this test for comparing the means of two samples (or treatments), even if they have
different numbers of replicates. In simple terms, the t-test compares the actual difference
between two means in relation to the variation in the data (expressed as the standard deviation
of the difference between the means).

Procedure
First, we will see how to do this test using "pencil and paper" (with a calculator to help with the
calculations). Then we can see how the same test can be done in a spreadsheet package
(Microsoft 'Excel')

1. We need to construct a null hypothesis - an expectation - which the experiment was


designed to test. For example:

If we are analysing the heights of pine trees growing in two different locations, a suitable
null hypothesis would be that there is no difference in height between the two locations.
The student's t-test will tell us if the data are consistent with this or depart significantly
from this expectation. [NB: the null hypothesis is simply something to test against. We
might well expect a difference between trees growing in a cold, windy location and
those in a warm, protected location, but it would be difficult to predict the scale of that
difference - twice as high? three times as high? So it is sensible to have a null hypothesis
of "no difference" and then to see if the data depart from this.

2. List the data for sample (or treatment) 1.

3. List the data for sample (or treatment) 2.

4. Record the number (n) of replicates for each sample (the number of replicates for sample 1
being termed n1 and the number for sample 2 being termed n2)

5. Calculate mean of each sample ( 1 and 2).

6. Calculate s 2 for each sample; call these s 12 and s 22 [Note that actually we are using S2 as an
estimate of s 2 in each case]

5. Calculate the variance of the difference between the two means (sd2) as follows

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress4a.html#Student's%20t-­test Page 1 of 7
T-­TEST 17/06/13 4:23 PM

6. Calculate sd (the square root of sd2)

7. Calculate the t value as follows:

(when doing this, transpose 1 and 2 if 2 > 1 so that you always get a positive value)

8. Enter the t-table at (n1 + n2 -2) degrees of freedom; choose the level of significance required
(normally p = 0.05) and read the tabulated t value.

9. If the calculated t value exceeds the tabulated value we say that the means are significantly
different at that level of probability.

10. A significant difference at p = 0.05 means that if the null hypothesis were correct (i.e. the
samples or treatments do not differ) then we would expect to get a t value as great as this on less
than 5% of occasions. So we can be reasonably confident that the samples/treatments do differ
from one another, but we still have nearly a 5% chance of being wrong in reaching this
conclusion.

Now compare your calculated t value with tabulated values for higher levels of significance
(e.g. p = 0.01). These levels tell us the probability of our conclusion being correct. For example,
if our calculated t value exceeds the tabulated value for p = 0.01, then there is a 99% chance of
the means being significantly different (and a 99.9% chance if the calculated t value exceeds the
tabulated value for p = 0.001). By convention, we say that a difference between means at the
95% level is "significant", a difference at 99% level is "highly significant" and a difference at
99.9% level is "very highly significant".

What does this mean in "real" terms? Statistical tests allow us to make statements with a degree
of precision, but cannot actually prove or disprove anything. A significant result at the 95%
probability level tells us that our data are good enough to support a conclusion with 95%
confidence (but there is a 1 in 20 chance of being wrong). In biological work we accept this
level of significance as being reasonable.

Student's t-test: a worked example


Suppose that we measured the biomass (milligrams) produced by bacterium A and bacterium B,
in shake flasks containing glucose as substrate. We had 4 replicate flasks of each bacterium.

Bacterium A Bacterium B
Replicate 1 520 230
Replicate 2 460 270
Replicate 3 500 250
Replicate 4 470 280
Total (= sum of
Sx 1950 1030 the 4 replicate
values)
n 4 4
http://archive.bio.ed.ac.uk/jdeacon/statistics/tress4a.html#Student's%20t-­test Page 2 of 7
T-­TEST 17/06/13 4:23 PM

487.5 257.5 Mean (= total / n)


Sum of the
S x2 952900 266700 squares of each
replicate value
Square of the total
(S x)2 3802500 1060900 (S x). It is not the
same as Sx2

950625 265225

Sd2 2275 1475

s2 758.3 491.7 s 2 = Sd2 / (n-1)

= 189.6 + 122.9 sd2 is the variance of the difference


= 312.5 between the means

sd = 17.68 = Ö sd2 (the standard deviation of the


difference between the means)

= 230/17.68 = 13.0

Entering a t table at 6 degrees of freedom (3 for n1 + 3 for n2) we find a tabulated t value of
2.45 (p = 0.05) going up to a tabulated value of 5.96 (p = 0.001). Our calculated t value exceeds
these, so the difference between our means is very highly significant. Clearly, bacterium A
produces significantly more biomass when grown on glucose than does bacterium B.

[Note that all the time-consuming calculations above can be done on a calculator with memory
and statistics functions. Guidance on this can be found in your calculator's instruction booklet.
Note also that this test and others can be run on computer packages. Below is a print-out from a
package in Microsoft "Excel"]

Student's t-test: the worked example using "Excel" (Microsoft) spreadsheet

[NB: If you cannot find "Data analysis" on Excel then do into "Help" and find "statistical
analysis" in the Help index. Different versions of Excel have slightly different commands, so
you may not find the following section to be identical to the one you are using.]

The screen for "Excel" (not shown here) has cells arranged in columns A-F... and rows 1-10...
For the print-out below, row 1 was used for headings and column A for replicate numbers. The
data for Bacterium A were entered in cells B2,3,4,5 and data for Bacterium B in cells C2,3,4,5
of the spreadsheet. From the Tools option at the top of the screen, I selected Data analysis.
This displays Analysis options and from the drop-down menu I selected t-test: Two-sample
assuming equal variances. Then click OK and enter cells B2-5 for Variable range 1, cells
C2-5 for Variable range 2, and a free cell (e.g. A7) for output range (choose the top-left cell
of the area where you want the results of the analysis to be displayed). Then click OK and the
printout appears.

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress4a.html#Student's%20t-­test Page 3 of 7
T-­TEST 17/06/13 4:23 PM

Replicate Bacterium A Bacterium B


1 520 230
2 460 270
3 500 250
4 470 280

t-Test: Two-Sample Assuming Equal Variances


Bacterium A Bacterium B

Mean 487.5 257.5


Variance 758.3333 491.6667
Observations 4 4
Pooled Variance 625

Hypothesized Mean 0 (The test will "ask" what is the probability of obtaining our
Difference given results by chance if there is no difference between the
population means?)
df 6

t Stat 13.01076 (This shows the t value calculated from the data)
P(T<=t) one-tail 6.35E-06

t Critical one-tail 1.943181

P(T<=t) two-tail 1.27E-05 (This shows the probability of getting our calculated t value
by chance alone. That probability is extremely low, so the
means are significantly different)
t Critical two-tail 2.446914 (This shows the t value that we would need to exceed in
order for the difference between the means to be significant
at the 5% level)

We are given the column means, the pooled variance, the number of observations (n), the null
hypothesis (that there is no difference between the population means), the degrees of freedom,
the calculated t value, and four other entries. The first two of these refer to a one-tailed t-test
(i.e. if we wish to test only that one particular mean is larger (or smaller) than the other. The
final two entries refer to a two-tailed test, where we do not specify the "direction" of the test.
For most purposes, we use a two-tailed test. In each case we are shown the probability that
our calculated t-value is equal to or less than the "tabulated" t-value (shown as "Critical t").
Note that the calculated t-value (13.0) vastly exceeds the critical t-value (two-tailed); the means
for the bacteria are significantly different at p = 1.27 x 10-5. In other words, there is a
probability of about 1 in 100,000 that we would get our observed difference between the means
by chance alone.

The analysis for a t-test always pools the variances and, strictly speaking, it is only valid if the
variances of the two treatments are similar. In the analysis above we could have selected the
option "t-test: Two-sample assuming unequal variances". This would have given us the same
result from our particular set of data but would have shown their separate variances as 758.33
for bacterium A and 491.67 for bacterium B.

For interest, let us ask if these variances are different in statistical terms. There is a simple
test for this: divide the larger variance by the smaller (758.33 / 491.67 = 1.54) and compare this
variance ratio with a value from a table of ‘F’ (variance ratio) for p = 0.05. For 2 treatments

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress4a.html#Student's%20t-­test Page 4 of 7
T-­TEST 17/06/13 4:23 PM

there is one degree of freedom between treatments (shown as n1 in the table) and 6 df within
treatments (3 for each treatment), shown as n2 in the table. The tabulated F value is 6.0. Our
variance ratio (1.54) is less than this, so our variances do not differ significantly. Our t-test
was valid.

Student's t-test: deciphering the data in publications


Here are some results taken randomly from a scientific paper. Look at any scientific journal and
you will find something similar to this:

Intracellular water volume for Streptococcus mutans; m l (mg dry wt.)-1

pH 7.0 2.49 0.06


pH 5.5 2.62 0.05
(means SE; n = 6)

Note that the authors give us all the information that we need to test for significance. They tell
us that the values are means of 6 replicates for each pH level, with the standard errors of those
means.

If we go back to Descriptive Statistics, we will see that a standard error (sn ) for a mean is
calculated from the standard deviation (s ) as follows: sn = s / Ö n.

If we square s / Ö n, we get s2 / n.

Now, if we go back to one of the steps in the t-test, we see that we calculate:

In this step we add the separate values of s2/n for each mean.

In other words, to do a t test on the published data, all we need do is to square the standard
errors. We will do it now:

pH 7.0 pH 5.5
Mean SE 2.41 0.06 2.62 0.05
n 6 6
SE2 (= s2/n) 0.0036 0.0025

= 0.0036 + 0.0025 = 0.0061

sd = Ö 0.0061 = 0.0781

(2.62 - 2.41)/0.0781 = 2.689

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress4a.html#Student's%20t-­test Page 5 of 7
T-­TEST 17/06/13 4:23 PM

The critical t value (p = 0.05) for 10 degrees of freedom (n1 + n2 - 2) is 2.23. The calculated
value exceeds this. There is less than a 5% probability of getting a value as high as this by
chance alone. So we can be reasonably confident that pH does affect the intracellular
water volume of the bacterium.

CONTENTS

INTRODUCTION
THE SCIENTIFIC METHOD
Experimental design
Designing experiments with statistics in mind
Common statistical terms
Descriptive statistics: standard deviation, standard error, confidence intervals of mean.

WHAT TEST DO I NEED?

STATISTICAL TESTS:
Student's t-test for comparing the means of two samples
Paired-samples test. (like a t-test, but used when data can be paired)
Analysis of variance for comparing means of three or more samples:

For comparing separate treatments (One-way ANOVA)


Calculating the Least Significant Difference between means
Using a Multiple Range Test for comparing means
For factorial combinations of treatments (Two-way ANOVA)

Chi-squared test for categories of data


Poisson distribution for count data
Correlation coefficient and regression analysis for line fitting:

linear regression
logarithmic and sigmoid curves

TRANSFORMATION of data: percentages, logarithms, probits and arcsin values

STATISTICAL TABLES:
t (Student's t-test)
F, p = 0.05 (Analysis of Variance)
F, p = 0.01 (Analysis of Variance)
F, p = 0.001 (Analysis of Variance)
c2 (chi squared)
r (correlation coefficient)
Q (Multiple Range test)
Fmax (test for homogeneity of variance)

This site is no longer maintained and has been left for


archival purposes

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress4a.html#Student's%20t-­test Page 6 of 7
PAIRED SAMPLES 17/06/13 4:24 PM

This site is no longer maintained and has been left for


archival purposes
Text and links may be out of date

Paired-samples test
Use this test as an alternative to the t-test, for cases where data can be paired to reduce
incidental variation - i.e. variation that you expect to be present but that is irrelevant to the
hypothesis you want to test.

As background, let us consider exactly what we do in a conventional t-test to compare two


samples. We compare the size of the difference between two means in relation to the amount of
inherent variability (the random error, not related to treatment differences) in the data. If the
random error is large then we are unlikely to find a significant difference between means unless
this difference is also very large.

Consider the data in the table below, which shows the number of years' remission from
symptoms (of cancer, AIDS, etc.) in two groups of patients: group A who received a new drug
and group B who received a placebo (the controls). There were10 patients in each group, and
we will first analyse the data by conventional t-test (see Student's t-test if you are not familiar
with this).

Patient Drug Placebo

1 7 4

2 5 3

3 2 1

4 8 6

5 3 2

6 4 4

7 10 9

8 7 5

9 4 3

10 9 8

Sx 59 45

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress5.html#Paired-­samples%20test Page 1 of 6
PAIRED SAMPLES 17/06/13 4:24 PM

n 10 10

5.9 4.5

S x2 413 261

(S x)2 / n 348 203

65 58

s2 7.22 6.44

= 1.37

sd = 1.17

= 0.67

Clearly, there is no significant difference between the means. [The smallest tabulated t value for
significant difference at p = 0.05 is 1.96.]

But drug trials are never done as randomly as this. Instead, the patients are matched as nearly as
possible to exclude the effects of extraneous variation. For example, patient 1 in each group
(drug or placebo) might be a Caucasian male aged 20-25; patient 2 in each group might be an
Asian female aged 40-50, and so on. There is every reason to suspect that age, sex, social
factors etc. could influence the course of a disease, and it would be foolish not to exclude this
variation if the purpose of the trial is to see if the drug actually has an overall effect. In other
words, we are not dealing with random groups but with purposefully paired observations. (The
same would be true if, for example, we wanted to test effects of a fungicide against a disease on
10 farms, or to test whether a range of different bacteria are sensitive to an antibiotic, etc.).

Now we will analyse the data as paired samples.

Procedure (see worked example later)

1. Subtract each control value from the corresponding treatment value and call the difference z.
(NB Always subtract in the same "direction", recording negative values where they occur)

2. Calculate S z, , S z2 and (S z)2 /n , where "n" is the number of pairs (z values)

3. Construct a null hypothesis. In this case it would be appropriate to "expect" no difference


between the groups (drug treatment versus controls). If this were true then the observed values
of z would have a mean close to zero, with variation about this mean.

4. Calculate:

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress5.html#Paired-­samples%20test Page 2 of 6
PAIRED SAMPLES 17/06/13 4:24 PM

5. Calculate:

6. Square root this to find sd then calculate:

7. Find t from the equation:

8. Consult a t table at n-1 degrees of freedom, where n is the number of pairs (number of z
values). The tabulated t value for 9 df is 2.26 (p = 0.05)

In our example the calculated t value is 5.24, which is very highly significant - it exceeds the t
value for probability (p) of 0.001. In other words, we would expect such a result to occur by
chance only once in a thousand times. So the drug is effective: we see below that it gives
remissison of symptoms for 1.4 0.266 years (this value is the mean standard error of the
mean). The confidence limits are 1.4 0.6 years (mean t.sn).

Patient Drug Placebo Difference (z)

1 7 4 3

2 5 3 2

3 2 1 1

4 8 6 2

5 3 2 1

6 4 4 0

7 10 9 1

8 7 5 2

9 4 3 1

10 9 8 1

Sz 14

n 10

1.4

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress5.html#Paired-­samples%20test Page 3 of 6
PAIRED SAMPLES 17/06/13 4:24 PM

S z2 26

(S z)2 / n 19.6

6.4

s d2 0.71

sd 0.84

sn 0.267

t = 1.4/0.267 = 5.24

It is instructive to consider what we have done in this analysis. We calculated the mean
difference between the pairs of patients (treatments), calculated the standard error of this
mean difference and tested it to see if it is significantly different from zero (no difference). The
following diagram should make this clear.

Paired-samples t-test: print-out from "Excel".

The example above was run on "Excel", as before (see Student's t-test) but we select t-test:
paired two sample for means from the analysis tools package, click OK, select the whole data
set (cells B2-C11) for Input variable range and a clear cell for Output range. See Student's t-
test for explanation of other relevant entries in the print-out. [The calculated t (5.25) differs
slightly from the worked example (5.24) on the previous page because the computer did not
round up the decimal points during calculations]

Patient Drug Placebo


1 7 4
2 5 3
3 2 1
4 8 6

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress5.html#Paired-­samples%20test Page 4 of 6
PAIRED SAMPLES 17/06/13 4:24 PM

5 3 2
6 4 4
7 10 9
8 7 5
9 4 3
10 9 8

t-Test: Paired Two Sample for Means Variable 1 Variable 2


Mean 5.9 4.5
Variance 7.211111 6.5
Observations 10 10
Pearson Correlation 0.949414

Hypothesized Mean Difference 0

df 9

t Stat 5.25

P(T<=t) one-tail 0.000264


t Critical one-tail 1.833114

P(T<=t) two-tail 0.000528 ************


t Critical two-tail 2.262159

******** Note that the two-tailed t-test shows the drug and placebo to be significantly different
at p = 0.0005 (probability of 5 in 10,000 that we would get this result by chance alone). But in
this case we would be justified in using a one-tailed test (P = 0.00026) because we are testing
whether the mean difference (1.4 years) is significantly greater than zero. Look at the ‘critical
t’ for a one-tailed test in the printout. This is the value given forp = 0.1 in a t-table (because we
are testing only for a difference in one direction - above zero - so we can double the normal
probability of 0.05). We should have used this value, not the value for p = 0.05 in our testing for
significance.

CONTENTS

INTRODUCTION
THE SCIENTIFIC METHOD
Experimental design
Designing experiments with statistics in mind
Common statistical terms
Descriptive statistics: standard deviation, standard error, confidence intervals of mean.

WHAT TEST DO I NEED?

STATISTICAL TESTS:
Student's t-test for comparing the means of two samples
Paired-samples test. (like a t-test, but used when data can be paired)
Analysis of variance for comparing means of three or more samples:

For comparing separate treatments (One-way ANOVA)

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress5.html#Paired-­samples%20test Page 5 of 6
ONE-­WAY ANOVA 17/06/13 4:24 PM

This site is no longer maintained and has been left for


archival purposes
Text and links may be out of date

Analysis of variance (ANOVA) for comparing means of


three or more variables.
Use this test for comparing means of 3 or more samples/treatments, to avoid the error
inherent in performing multiple t-tests

Background. If we have, say, 3 treatments to compare (A, B, C) then we would need 3 separate
t-tests (comparing A with B, A with C, and B with C). If we had seven treatments we would
need 21 separate t-tests. This would be time-consuming but, more important, it would be
inherently flawed because in each t-test we accept a 5% chance of our conclusion being wrong
(when we test for p = 0.05). So, in 21 tests we would expect (by probability) that one test would
give us a false result. ANalysis Of Variance (ANOVA) overcomes this problem by enabling us
to detect significant differences between the treatments as a whole. We do a single test to see if
there are differences between the means at our chosen probability level.

Ideally, for this test we would have the same number of replicates for each treatment, but
this is not essential. Advanced computer programmes can overcome the problem of unequal
replicates by entering "missing values".

An important assumption underlies the Analysis of Variance: that all treatments have
similar variance. If there are strong reasons to doubt this then the data might need to be
transformed before the test can be done. In practice, there is a simple way to check for
"homogeneity of variance". We deal with this at step "3" in the procedure below.

Procedure (see worked example)


Don't be frightened by this! It looks complicated but it is actually very easy. You should
understand it, and then you can use a simple statistical programme (e.g. Microsoft "Excel") to
run the whole test.

Assume that we have recorded the biomass of 3 bacteria in flasks of glucose broth, and we used
3 replicate flasks for each bacterium. [But the test could apply equally to any sort of variable]

Step 1. Record the data in columns:

Replicate Bacterium A Bacterium B Bacterium C


1 12 20 40
2 15 19 35
3 9 23 42

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress6.html#Analysis%20of%20variance Page 1 of 7
ONE-­WAY ANOVA 17/06/13 4:24 PM

Step 2. For each column, enter S x, n, , S x2, and Sd2 (click here for method)

Step 3. [A check for equal variance - the underlying assumption of this test] For each column
divide Sd2 by n-1 to obtain the variance, s 2. Divide the highest value of s2 by the lowest value
of s 2 to obtain a variance ratio (F). Then look up a table of Fmax for the number of treatments in
our table of data and the degrees of freedom (number of replicates per treatment -1). If our
variance ratio does not exceed the Fmax value then we are safe to proceed. If not, the data might
need to be transformed.

Step 4. Sum all the values of S x2 and call the sum A.

Step 5. Sum all the values for and call the sum B.

Step 6. Sum all the values for S x to obtain the grand total.

Step 7. Square the grand total and divide it by total number of observations; call this D.

Step 8. Calculate the Total sum of squares (S of S) = A - D

Step 9. Calculate the Between-treatments sum of squares = B - D

Step 10. Calculate the Residual sum of squares = A - B [This is sometimes called the Error
sum of squares]

Step 11. Construct a table as follows, where *** represents items to be inserted, and where u =
number of treatments and v = number of replicates.

Source of variance Sum of squares Degrees of freedom Mean square


(S of S) (df) = S of S / df
Between treatments *** u-1 ***
Residual *** u(v-1) ***
Total *** (uv)-1

[The total df is always one fewer than the total number of data entries]

Step 12. Using the mean squares in the final column of this table, do a variance ratio test to
obtain an F value:

F = Between treatments mean square / Residual mean square

Step 13. Go to a table of F (p = 0.05) and read off the value where n1 is the df of the between
treatments mean square and n2 is df of the residual mean square. If the calculated F value
exceeds the tabulated value there is significant difference between treatments. If so, then look at
the tabulated F values for p = 0.01 and then 0.001, to see if the treatment differences are more
highly significant.

What does all this mean?

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress6.html#Analysis%20of%20variance Page 2 of 7
ONE-­WAY ANOVA 17/06/13 4:24 PM

If you look at many of the steps above they should remind you of the steps in a t-test. For

example, in a t-test we calculate S x, S x2, and Sd2 (which is the sum of squares), then
we divide Sd2 by n-1, just as we did in step 11 (above). So, the Analysis of Variance is using
the same types of procedure, but for more than 2 samples. If you want to convince yourself of
this, then try doing the Analysis of Variance for just two samples (e.g. Bacterium A and
Bacterium B). You will get exactly the same result as in a t-test.

Analysis of variance: worked example

Replicate Bacterium A Bacterium B Bacterium C Row totals


1 12 20 40 72
2 15 19 35 69
3 9 23 42 74
Sx 36 62 117 215 (Grand total)
n 3 3 3
12 20.7 39

S x2 450 1290 4589 6329 (call this A)


6276.3(call this
432 1281.3 4563
B)

Sd2 18 8.7 26 52.7 (A - B)

s2 (=Sd2 /n-1) 9.4 35 13

Fmax test: F = 13/4.35 = 2.99. This is lower than the Fmax of 87.5 (for 3 treatments and 2 df, at
p = 0.05) so the variances are homogeneous and we can proceed with analysis of variance. If
our value exceeded the tabulated Fmax then we would need to transform the data.

D = (Grand total)2 total observations = 2152 9 = 5136.1

Total sum of squares (S of S) = A - D = 1192.9

Between-treatments S of S = B - D = 1140.2

Residual S of S = A - B = 52.7

Sum of squares Mean square


Source of variance Degrees of freedom *
(S of S) (= S of S df)
Between treatments 1140.2 u - 1 (=2)* 570.1
Residual 52.7 u(v-1) (=6)* 8.78
Total 1192.9 (uv)-1 (=8)*

[* For u treatments (3 in our case) and v replicates (3 in our case); the total df is one fewer than
the total number of data values in the table (9 values in our case)]
http://archive.bio.ed.ac.uk/jdeacon/statistics/tress6.html#Analysis%20of%20variance Page 3 of 7
ONE-­WAY ANOVA 17/06/13 4:24 PM

F = Between treatments mean square /Residual mean square = 570.1 / 8.78 = 64.93

The tabulated value of F (p = 0.05) where u is df of between treatments mean square (2) and v
is df of residual mean square (6) is 5.1. Our calculated F value exceeds this and even exceeds
the tabulated F value for p = 0.001 (F = 27.0). So there is a very highly significant difference
between treatments.

[Note that the term "mean square" in an Analysis of Variance is actually a variance - it is
calculated by dividing the sum of squares by the degrees of freedom. In a t-test we would call it
s 2, obtained by dividing Sd2 by n-1. Analysis of Variance involves the partitioning of the total
variance into (1) variance associated with the different treatments/samples and (2) random
variance, evidenced by the variability within the treatments. When we calculate the F value, we
ask, in effect, "is there a large amount of variance associated with the different treatments
compared with the amount of random variance?".]

Which treatments differ from one another?


The Analysis of Variance has told us only that there are differences between treatments in the
experiment as a whole. Sometimes this information is useful in its own right. But it does not tell
us which treatments differ from one another.

We now have a problem, because every time we compare one treatment with another (for
example, comparing bacterium A with bacterium B) we are doing the equivalent of a t-test, with
a probability of making a wrong interpretation. We need some way of avoiding this problem.

Method 1. Calculate the least significant difference between any two means. [This is not
generally favoured, but it can be used with caution.]

We make use of the fact that our calculations for Analysis of Variance were similar to those of a
t-test (see earlier); in particular, the residual mean square is an estimate of s2 for each
treatment, because the variance for all treatments is assumed to be equal in an Analysis of
Variance.

In the t-test, we calculate sd2 as follows:

In the analysis of variance, s2 for each treatment is assumed to be the same, and if n for each
treatment is the same, then we could compare any two means by calculating sd2 as follows:

sd2 = 2 x residual mean square / n

We can then find sd as the square root of sd2 and calculate t as:

If we did this for two particular means,we could compare the calculated t with that in a t-table,
using the df of the residual mean square (because this reflects the residual variance in the
whole experiment).

There is a simpler way of doing this for any two means:

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress6.html#Analysis%20of%20variance Page 4 of 7
ONE-­WAY ANOVA 17/06/13 4:24 PM

If we take the equation and multiply each side by sd we get: t (sd) = 1 - 2

In other words, any two means would be significantly different from one another if they differ
by more than "t multiplied by sd"

So t(sd) represents the least significant difference (LSD) between any two means.

In scientific papers you might see data presented as follows:

Bacterium Biomass (mg)


1 12
2 20.7
3 39
5% LSD 5.92

Here the author would be giving us the means for the 3 treatments (bacteria) and telling us that
analysis of variance was used to find the least significant difference between any of the means
at p = 0.05 (the level of probability chosen for the t value).

In fact, the table above uses the data for bacterial biomass in our worked example.

For 5% LSD, we find sd2 (= 2 x residual mean square / n). It is 17.56 /3 = 5.85.

We square root this to find sd = 2.42.

The tabulated value of t for 6 df (of the residual mean square) is 2.45 (p = 0.05).

So the 5% LSD is t(sd ) = 2.45 x 2.42 = 5.92.

Our table of data indicates that each bacterium produced a significantly different biomass from
every other one.

A word of caution: We can be much more confident about significant difference between
bacteria 1 and 3 or between bacteria 2 and 3 than we can about the difference between bacteria
1 and 2. Remember that every time we make such a comparison we run the risk of 5% error.
But if we had used the t value for p = 0.01 then we could more safely make five comparisons
and still have only a 1 in 20 chance of being wrong.

Statisticians recommend that the LSD should never be used indiscriminately, but only to test
comparisons between treatments that we "nominated" when designing the experiment. For
example, each treatment might be compared with a control, but each treatment should not
necessarily be compared with each other treatment.

Method 2. Many people now use variants of the LSD, such as a Multiple Range Test, which
enables us more safely to compare any treatments in a table. This test is far preferable to the
LSD. It is explained separately on another page.

Analysis of variance: using "Excel"

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress6.html#Analysis%20of%20variance Page 5 of 7
ONE-­WAY ANOVA 17/06/13 4:24 PM

The example that we used (bacterial biomass) above is shown below as a print-out from
"Excel".

Having entered the data on the spreadsheet, we select Anova: single factor from the analysis
tools, click OK, and enter all 9 cells of data in Input variable range. The table shows the
source of variance as "Between groups" (= between treatments) and "within groups" (=
residual). We are also told the calculated F value (64.949..), the F value that we would need to
exceed (F critical) in order to have a significant difference between treatments, and the
probability (p-value) that our calculated F value would be obtained by chance (random error)
alone. This probability is very small (8.61 x 10-5) so we have a highly significant difference
between treatments in our table. We could then use the residual (within groups) mean square
(MS) to calculate LSD, as explained earlier.

Replicate Treatment A Treatment B Treatment C


1 12 20 40

2 15 19 35

3 9 23 42

Anova: Single Factor

SUMMARY

Groups Count Sum Average Variance

Column 1 3 36 12 9

Column 2 3 62 20.66667 4.333333

Column 3 3 117 39 13
ANOVA

Source of SS df MS F P-value F crit


Variation
Between 1140.222 2 570.1111 64.94937 8.61E-05 5.143249
Groups
Within 52.66667 6 8.777778
Groups
Total 1192.889 8

Note: There is always a danger in using a statistical package, because the package does
whatever we tell it to do. It does not "think" or "consider" whether what we ask it to do is
legitimate. For example, it does not test for homogeneity of variance. BEWARE!

CONTENTS

INTRODUCTION
THE SCIENTIFIC METHOD
Experimental design
Designing experiments with statistics in mind
Common statistical terms
Descriptive statistics: standard deviation, standard error, confidence intervals of mean.

WHAT TEST DO I NEED?

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress6.html#Analysis%20of%20variance Page 6 of 7
ONE-­WAY ANOVA 17/06/13 4:24 PM

This site is no longer maintained and has been left for


archival purposes
Text and links may be out of date

Analysis of variance (ANOVA) for comparing means of


three or more variables.
Use this test for comparing means of 3 or more samples/treatments, to avoid the error
inherent in performing multiple t-tests

Background. If we have, say, 3 treatments to compare (A, B, C) then we would need 3 separate
t-tests (comparing A with B, A with C, and B with C). If we had seven treatments we would
need 21 separate t-tests. This would be time-consuming but, more important, it would be
inherently flawed because in each t-test we accept a 5% chance of our conclusion being wrong
(when we test for p = 0.05). So, in 21 tests we would expect (by probability) that one test would
give us a false result. ANalysis Of Variance (ANOVA) overcomes this problem by enabling us
to detect significant differences between the treatments as a whole. We do a single test to see if
there are differences between the means at our chosen probability level.

Ideally, for this test we would have the same number of replicates for each treatment, but
this is not essential. Advanced computer programmes can overcome the problem of unequal
replicates by entering "missing values".

An important assumption underlies the Analysis of Variance: that all treatments have
similar variance. If there are strong reasons to doubt this then the data might need to be
transformed before the test can be done. In practice, there is a simple way to check for
"homogeneity of variance". We deal with this at step "3" in the procedure below.

Procedure (see worked example)


Don't be frightened by this! It looks complicated but it is actually very easy. You should
understand it, and then you can use a simple statistical programme (e.g. Microsoft "Excel") to
run the whole test.

Assume that we have recorded the biomass of 3 bacteria in flasks of glucose broth, and we used
3 replicate flasks for each bacterium. [But the test could apply equally to any sort of variable]

Step 1. Record the data in columns:

Replicate Bacterium A Bacterium B Bacterium C


1 12 20 40
2 15 19 35
3 9 23 42

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress6.html#LSD Page 1 of 7
ONE-­WAY ANOVA 17/06/13 4:24 PM

Step 2. For each column, enter S x, n, , S x2, and Sd2 (click here for method)

Step 3. [A check for equal variance - the underlying assumption of this test] For each column
divide Sd2 by n-1 to obtain the variance, s 2. Divide the highest value of s2 by the lowest value
of s 2 to obtain a variance ratio (F). Then look up a table of Fmax for the number of treatments in
our table of data and the degrees of freedom (number of replicates per treatment -1). If our
variance ratio does not exceed the Fmax value then we are safe to proceed. If not, the data might
need to be transformed.

Step 4. Sum all the values of S x2 and call the sum A.

Step 5. Sum all the values for and call the sum B.

Step 6. Sum all the values for S x to obtain the grand total.

Step 7. Square the grand total and divide it by total number of observations; call this D.

Step 8. Calculate the Total sum of squares (S of S) = A - D

Step 9. Calculate the Between-treatments sum of squares = B - D

Step 10. Calculate the Residual sum of squares = A - B [This is sometimes called the Error
sum of squares]

Step 11. Construct a table as follows, where *** represents items to be inserted, and where u =
number of treatments and v = number of replicates.

Source of variance Sum of squares Degrees of freedom Mean square


(S of S) (df) = S of S / df
Between treatments *** u-1 ***
Residual *** u(v-1) ***
Total *** (uv)-1

[The total df is always one fewer than the total number of data entries]

Step 12. Using the mean squares in the final column of this table, do a variance ratio test to
obtain an F value:

F = Between treatments mean square / Residual mean square

Step 13. Go to a table of F (p = 0.05) and read off the value where n1 is the df of the between
treatments mean square and n2 is df of the residual mean square. If the calculated F value
exceeds the tabulated value there is significant difference between treatments. If so, then look at
the tabulated F values for p = 0.01 and then 0.001, to see if the treatment differences are more
highly significant.

What does all this mean?

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress6.html#LSD Page 2 of 7
ONE-­WAY ANOVA 17/06/13 4:24 PM

If you look at many of the steps above they should remind you of the steps in a t-test. For

example, in a t-test we calculate S x, S x2, and Sd2 (which is the sum of squares), then
we divide Sd2 by n-1, just as we did in step 11 (above). So, the Analysis of Variance is using
the same types of procedure, but for more than 2 samples. If you want to convince yourself of
this, then try doing the Analysis of Variance for just two samples (e.g. Bacterium A and
Bacterium B). You will get exactly the same result as in a t-test.

Analysis of variance: worked example

Replicate Bacterium A Bacterium B Bacterium C Row totals


1 12 20 40 72
2 15 19 35 69
3 9 23 42 74
Sx 36 62 117 215 (Grand total)
n 3 3 3
12 20.7 39

S x2 450 1290 4589 6329 (call this A)


6276.3(call this
432 1281.3 4563
B)

Sd2 18 8.7 26 52.7 (A - B)

s2 (=Sd2 /n-1) 9.4 35 13

Fmax test: F = 13/4.35 = 2.99. This is lower than the Fmax of 87.5 (for 3 treatments and 2 df, at
p = 0.05) so the variances are homogeneous and we can proceed with analysis of variance. If
our value exceeded the tabulated Fmax then we would need to transform the data.

D = (Grand total)2 total observations = 2152 9 = 5136.1

Total sum of squares (S of S) = A - D = 1192.9

Between-treatments S of S = B - D = 1140.2

Residual S of S = A - B = 52.7

Sum of squares Mean square


Source of variance Degrees of freedom *
(S of S) (= S of S df)
Between treatments 1140.2 u - 1 (=2)* 570.1
Residual 52.7 u(v-1) (=6)* 8.78
Total 1192.9 (uv)-1 (=8)*

[* For u treatments (3 in our case) and v replicates (3 in our case); the total df is one fewer than
the total number of data values in the table (9 values in our case)]
http://archive.bio.ed.ac.uk/jdeacon/statistics/tress6.html#LSD Page 3 of 7
ONE-­WAY ANOVA 17/06/13 4:24 PM

F = Between treatments mean square /Residual mean square = 570.1 / 8.78 = 64.93

The tabulated value of F (p = 0.05) where u is df of between treatments mean square (2) and v
is df of residual mean square (6) is 5.1. Our calculated F value exceeds this and even exceeds
the tabulated F value for p = 0.001 (F = 27.0). So there is a very highly significant difference
between treatments.

[Note that the term "mean square" in an Analysis of Variance is actually a variance - it is
calculated by dividing the sum of squares by the degrees of freedom. In a t-test we would call it
s 2, obtained by dividing Sd2 by n-1. Analysis of Variance involves the partitioning of the total
variance into (1) variance associated with the different treatments/samples and (2) random
variance, evidenced by the variability within the treatments. When we calculate the F value, we
ask, in effect, "is there a large amount of variance associated with the different treatments
compared with the amount of random variance?".]

Which treatments differ from one another?


The Analysis of Variance has told us only that there are differences between treatments in the
experiment as a whole. Sometimes this information is useful in its own right. But it does not tell
us which treatments differ from one another.

We now have a problem, because every time we compare one treatment with another (for
example, comparing bacterium A with bacterium B) we are doing the equivalent of a t-test, with
a probability of making a wrong interpretation. We need some way of avoiding this problem.

Method 1. Calculate the least significant difference between any two means. [This is not
generally favoured, but it can be used with caution.]

We make use of the fact that our calculations for Analysis of Variance were similar to those of a
t-test (see earlier); in particular, the residual mean square is an estimate of s2 for each
treatment, because the variance for all treatments is assumed to be equal in an Analysis of
Variance.

In the t-test, we calculate sd2 as follows:

In the analysis of variance, s2 for each treatment is assumed to be the same, and if n for each
treatment is the same, then we could compare any two means by calculating sd2 as follows:

sd2 = 2 x residual mean square / n

We can then find sd as the square root of sd2 and calculate t as:

If we did this for two particular means,we could compare the calculated t with that in a t-table,
using the df of the residual mean square (because this reflects the residual variance in the
whole experiment).

There is a simpler way of doing this for any two means:

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress6.html#LSD Page 4 of 7
ONE-­WAY ANOVA 17/06/13 4:24 PM

If we take the equation and multiply each side by sd we get: t (sd) = 1 - 2

In other words, any two means would be significantly different from one another if they differ
by more than "t multiplied by sd"

So t(sd) represents the least significant difference (LSD) between any two means.

In scientific papers you might see data presented as follows:

Bacterium Biomass (mg)


1 12
2 20.7
3 39
5% LSD 5.92

Here the author would be giving us the means for the 3 treatments (bacteria) and telling us that
analysis of variance was used to find the least significant difference between any of the means
at p = 0.05 (the level of probability chosen for the t value).

In fact, the table above uses the data for bacterial biomass in our worked example.

For 5% LSD, we find sd2 (= 2 x residual mean square / n). It is 17.56 /3 = 5.85.

We square root this to find sd = 2.42.

The tabulated value of t for 6 df (of the residual mean square) is 2.45 (p = 0.05).

So the 5% LSD is t(sd ) = 2.45 x 2.42 = 5.92.

Our table of data indicates that each bacterium produced a significantly different biomass from
every other one.

A word of caution: We can be much more confident about significant difference between
bacteria 1 and 3 or between bacteria 2 and 3 than we can about the difference between bacteria
1 and 2. Remember that every time we make such a comparison we run the risk of 5% error.
But if we had used the t value for p = 0.01 then we could more safely make five comparisons
and still have only a 1 in 20 chance of being wrong.

Statisticians recommend that the LSD should never be used indiscriminately, but only to test
comparisons between treatments that we "nominated" when designing the experiment. For
example, each treatment might be compared with a control, but each treatment should not
necessarily be compared with each other treatment.

Method 2. Many people now use variants of the LSD, such as a Multiple Range Test, which
enables us more safely to compare any treatments in a table. This test is far preferable to the
LSD. It is explained separately on another page.

Analysis of variance: using "Excel"

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress6.html#LSD Page 5 of 7
ONE-­WAY ANOVA 17/06/13 4:24 PM

The example that we used (bacterial biomass) above is shown below as a print-out from
"Excel".

Having entered the data on the spreadsheet, we select Anova: single factor from the analysis
tools, click OK, and enter all 9 cells of data in Input variable range. The table shows the
source of variance as "Between groups" (= between treatments) and "within groups" (=
residual). We are also told the calculated F value (64.949..), the F value that we would need to
exceed (F critical) in order to have a significant difference between treatments, and the
probability (p-value) that our calculated F value would be obtained by chance (random error)
alone. This probability is very small (8.61 x 10-5) so we have a highly significant difference
between treatments in our table. We could then use the residual (within groups) mean square
(MS) to calculate LSD, as explained earlier.

Replicate Treatment A Treatment B Treatment C


1 12 20 40

2 15 19 35

3 9 23 42

Anova: Single Factor

SUMMARY

Groups Count Sum Average Variance

Column 1 3 36 12 9

Column 2 3 62 20.66667 4.333333

Column 3 3 117 39 13
ANOVA

Source of SS df MS F P-value F crit


Variation
Between 1140.222 2 570.1111 64.94937 8.61E-05 5.143249
Groups
Within 52.66667 6 8.777778
Groups
Total 1192.889 8

Note: There is always a danger in using a statistical package, because the package does
whatever we tell it to do. It does not "think" or "consider" whether what we ask it to do is
legitimate. For example, it does not test for homogeneity of variance. BEWARE!

CONTENTS

INTRODUCTION
THE SCIENTIFIC METHOD
Experimental design
Designing experiments with statistics in mind
Common statistical terms
Descriptive statistics: standard deviation, standard error, confidence intervals of mean.

WHAT TEST DO I NEED?

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress6.html#LSD Page 6 of 7
MULTIPLE RANGE TEST 17/06/13 4:25 PM

This site is no longer maintained and has been left for


archival purposes
Text and links may be out of date

A multiple range test for comparing means in an analysis of


variance
This procedure is broadly similar to that for calculating the LSD but it gives us more confidence
in comparing the means within a table.

In calculating LSD, we first found sd2 (= 2 x residual mean square / n) and from this we found
sd (the standard deviation of the difference between any two means) and multiplied it by a t
value (for the degrees of freedom of the residual mean square). For a multiple range test, we use
essentially the same procedure but instead of a t value we use a Q value obtained from a table
"The Studentized Range". We list our means in order of magnitude, from highest to lowest,
then we test for significant difference between the highest and lowest - it must be greater
than Qxsd). If this is significant, we test the highest against the second lowest mean, and
continue in this way until all means have been tested against one another. However the Q value
changes each time. For the first test (highest against lowest mean) we look up the Q value for
the number of treatments (i.e. for the number of means in our table of results). For the next test
(second highest against lowest, we use the Q value for the number of means minus 1 (because
we are excluding the highest mean now), and so on. However the degrees of freedom does not
change - it is always the df of the residual (error) mean square. Although each step in this
procedure is simple, you need to be organised for testing each mean against all others - see a
worked example for the best way to do this.

Having done an analysis like this, most people construct a table as follows, using letters to show
which treatments differ from others. They would say that treatments that are not followed by
the same letter differ significantly from one another (P, 0.05).

For example, in the fictitious table below, the means for pH 4 and 5 do not differ from one
another but differ from the means for all other pH values. The mean for pH 6 differs from the
means at all other pH values. The means for pH 7 and 8 do not differ from one another but
differ from all other means, and the mean for pH 9 differs from all others.

Treatment Mean
pH 4 36 a
pH 5 35 a
pH 6 30 b
pH 7 20 c
pH 8 17 c
pH 9 10 d

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress7.html#A%20multiple%20range%20test Page 1 of 5
MULTIPLE RANGE TEST 17/06/13 4:25 PM

We will now apply the Multiple Range Test to the data on bacterial biomass that we
analysed by ANOVA

Summary table of data:

Bacterium A Bacterium B Bacterium C


Replicate 1 12 20 40
Replicate 2 15 19 35
Replicate 3 9 23 42
n 3 3 3
12 20.7 39

Source of variance Sum of squares Degrees of freedom Mean square


Between treatments 1140.2 2 570.1
Residual 52.7 6 8.78
Total 1192.9 8

sd2 = 2 x residual mean square / n = 17.56 /3 = 5.85.

sd = 2.42 (obtained as the square root of sd2).

First: Rank the treatments from highest to lowest mean. Then compare the highest (39) with the
lowest (12) mean:

With 3 treatments, and 6 degrees of freedom for the residual mean square, we have
a Q value of 4.34. So, Q(sd) = 4.34 x 2.42 = 9.72. The difference between the
highest and lowest means is greater than this, so the biomass of Bacterium C (the
highest) differs significantly from that of Bacterium A (the lowest).

Second: compare the second highest (20.7) with the lowest (12) mean:

With 2 treatments (because we are excluding the highest mean now), and 6 degrees
of freedom for the residual mean square, we have a Q value of 3.46. So, Q(sd) =
3.46 x 2.42 = 8.37. The difference between means is greater than this (but only
just), so the biomass of Bacterium B differs significantly from Bacterium A.

Third: continue in this way, down the table of ranked means, until you get a non-significant
result. [In our case, we have reached the end because there are only 3 treatments]

Fourth: [This is not necessary in our case because we ran out of means!] Compare the second
lowest mean with the highest, then continue with the second lowest and second highest, etc.

Comparison of the Multiple Range Test with the Least Significant Difference
It is interesting to compare the findings of these two types of test. Using the LSD method, we
found that the least significant difference between any two means would need to be t(sd ) = 2.45

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress7.html#A%20multiple%20range%20test Page 2 of 5
MULTIPLE RANGE TEST 17/06/13 4:25 PM

x 2.42 = 5.92.

Using the Multiple Range Test, we had to meet stricter criteria: the highest and lowest means
had to differ by 9.72, and the second-highest and lowest means had to differ by 8.37. If we had
had more means to compare (e.g. the third-highest and lowest) then the critical value would
have been reduced again. The Multiple Range Test is much more discriminating that the LSD -
it greatly reduces the chance of error when making multiple comparisons between treatments.

Suggested procedure for comparing all means


Suppose that we did an ANOVA on 5 treatments, with 4 replicates per treatment, and found
means of 36, 42, 74, 10, 80. We calculated sd as 2.50. With 5 treatments and 4 replicates we
have 15 degrees of freedom for the residual (error) mean square.

We will need the Q values for comparing 5, then 4, then 3, then 2 treatments (always with 16
df). It is sensible to make a small table:

No. of treatments 5 4 3 2
Q= 4.37 4.08 3.67 3.01
Qxsd = 10.925 10.20 9.175 7.525

Now we can rank our means from highest to lowest and see if they differ by Qxsd. Again make
a table (see below). Start in box 1, comparing the means 80 and 10. These differ by more than
10.925 (Qxsd for 5 treatments) so we insert * in box 1. Repeat this for box 2 (comparing means
80 and 36, using Q for 4 treatments). Again insert *. We would also get a significant difference
in box 3 (means 80 and 42) but not in box 4, so we insert ns (not significant).

Now start in the next column (boxes 5, 6 and 7) then the third column (boxes 8, 9) and the
fourth column.

Mean
80
74 box 4 ns
42 box 7 * box 3 *
36 box 9 ns box 6 * box 2 *
box 10
10 box 8 * box 5 * box 1 *
*
10 36 42 74 80 Mean

Finally, we put letters against the means to show significant differences. We state that: means
followed by the same letter do not differ significantly from one another (p = 0.05). Do this
in a series of steps as shown below, to reach the letters (shown in blue) in the final column.
These are the letters that we retain from the early steps if they are not contradicted by a later
step. [Of course, having done this, we can present our means in any order, with their letters; we
do not need to keep them in ranked order]

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress7.html#A%20multiple%20range%20test Page 3 of 5
MULTIPLE RANGE TEST 17/06/13 4:25 PM

Means Step 1 Step 2 Step 3 Final


(from boxes 1-4) (from boxes 5-7) (boxes 8,9,10)
80 a a
74 a c a
42 b d e b
36 b d e b
10 b d f d

Transformation of data
During ANOVA we do an Fmax test to check for homogeneity of variance, i.e. to check that it is
safe to pool all the treatment variances - an essential condition for performing an Analysis of
Variance.

What should we do if the Fmax test shows a major discrepancy in the variances, thereby
invalidating ANOVA?

The answer is to use some mathematical transformation of the original data, then perform
ANOVA with the transformed data. There are several types of transformation, each most
appropriate for particular circumstances.

1. When our data consist of small, whole-numbered counts the variance is often proportional to
the mean. This is overcome by converting each value (X) to Ö X and analysing the Ö X data. If
the counts are low and contain zeros then use Ö (X + 0.5).

2. More generally, it is appropriate to use log10 X, or log10 (X+1) if there are zero values.

3. Percentages and proportions (after multiplying by 100) can be converted to arcsin values.

In all these cases the transformed data are analysed in exactly the same way as in a normal
ANOVA, and we can use LSD or a multiple range test, as we did before, to test for significant
differences between the treatment means. BUT remember that these tests tell us the
difference between the transformed values, and it is not valid to de-transform an LSD and
show it as a significant difference between the ‘true’ means. This problem does not arise with a
multiple range test, where we use letters to show significant differences.

The way to overcome this is to present the data in a table as follows, showing both the true
means and the transformed means, and the LSD that applies to the transformed means:

Mean (with log10 (X+1), arcsin or Ö in


Treatment
parentheses)
1 20 (4.47)
2 10 (3.16)
3 15 (3.87)
5% LSD (0.37)

CONTENTS

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress7.html#A%20multiple%20range%20test Page 4 of 5
TWO-­WAY ANOVA 17/06/13 4:25 PM

This site is no longer maintained and has been left for


archival purposes
Text and links may be out of date

Analysis of variance (ANOVA) for factorial combinations of


treatments
Elsewhere on this site we have dealt with ANOVA for simple comparisons of treatments. We
can also use ANOVA for combinations of treatments, where two factors (e.g. pH and
temperature) are applied in every possible combination. These are called factorial designs, and
we can analyse them even if we do not have replicates.

This type of analysis is called TWO-WAY ANOVA.

Suppose that we have grown one bacterium in broth culture at 3 different pH levels at 4
different temperatures. We have 12 flasks in all, but no replicates. Growth was measured by
optical density (O.D.).

Construct a table as follows (O.D. is given in fictitious whole numbers here for convenience).

Temp oC pH 5.5 pH 6.5 pH


7.5

25 10 19 40

30 15 25 45

35 20 30 55

40 15 22 40

Then calculate the following (see the worked example and the output from Microsoft "Excel").

(a) S x, S x2, (S x)2 / n, and for each column in the table.

(b) S x, S x2, (S x)2 / n, and for each row.

(c) Find the grand total by adding all S x for columns (it should be the same for rows). Square
this grand total and then divide by uv, where u is the number of data entries in each row, and v
is number of data entries in each column. Call this value D; in our example it is (336)2 12 =
9408.

(d) Find the sum of S x2 values for columns; call this A. It will be the same for S x2 of rows. In
our example it is 11570.

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress8.html#Analysis%20of%20variance%20for%20factorial Page 1 of 7
TWO-­WAY ANOVA 17/06/13 4:25 PM

(e) Find the sum of S x2/n values for columns; call this B. In our example it is 11304.

(f) Find the sum of S x2/n values for rows; call this C. In our example it is 9646.

(g) Set out a table of analysis of variance as follows:

Mean square
Source of variance Sum of squares Degrees of freedom*
(= S of S df)
Between columns B - D (1896) u - 1 (=2) 948
Between rows C - D (238) v - 1 (= 3) 79.3
Residual *** (28) (u-1)(v-1) (=6) 4.67
Total A - D (2162) (uv)-1 (=11) 196.5

[* Where u is the number of data entries in each row, and v is the number of data entries in each
column); note that the total df is always one fewer than the total number of entries in the table
of data.

*** Obtained by subtracting the between-columns and between-rows sums of squares from total
sum of squares.

Now do a variance ratio test to obtain F values:

(1) For between columns (pH): F = Between columns mean square / Residual mean square

= 948 / 4.67 = 203

(2) For between rows (temperature) F = Between rows mean square / Residual mean square

= 79.3 / 4.67 = 17.0

In each case, consult a table of F (p = 0.05 or p = 0.01 or p = 0.001) where u is the between-
treatments df (columns or rows, as appropriate) and v is residual df. If the calculated F value
exceeds the tabulated value then the treatment effect (temperature or pH) is significant. In our
example, for the effect of pH (u is 2 degrees of freedom, v is 6 df) the critical F value at p =
0.05 is 5.14. In fact, we have a significant effect of pH at p = 0.001. For the effect of
temperature (u is 3 degrees of freedom, v is 6 df) the critical F value at p = 0.05 is 4.76. We find
that the effect of temperature is significant at p = 0.01.

Worked example:

pH 5.5 pH 6.5 pH 7.5 S x, Rows n (=u) S x2 (S x)2 / n

25oC 10 19 40 69 3 23 2061 1587


30oC 15 25 45 85 3 28.33 2875 2408
35oC 20 30 55 105 3 35 4325 3675

40oC 15 22 40 77 2309 1976


S x, 60 96 180 336 grand Total C

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress8.html#Analysis%20of%20variance%20for%20factorial Page 2 of 7
TWO-­WAY ANOVA 17/06/13 4:25 PM

Columns total 9646


n (= v) 4 4 4
15 24.67 46.67

S x2 950 2370 8250 Total A


11570
(S x)2 / n 900 2304 8100 Total B
11304

Below, we see a print-out of this analysis from "Excel".

We select Anova: Two-Factor Without Replication from the analysis tools package. Note
that the Anova table gives Source of Variation separately for Rows, Columns and Error (=
Residual).

pH 5.5 pH 6.5 pH 7.5

25oC 10 19 40

30oC 15 25 45
35oC 20 30 55

40oC 15 22 40

Anova: Two-Factor Without Replication

SUMMARY Count Sum Average Variance

Row 1 3 69 23 237
Row 2 3 85 28.33333 233.3333

Row 3 3 105 35 325

Row 4 3 77 25.66667 166.3333

Column 1 4 60 15 16.66667

Column 2 4 96 24 22

Column 3 4 180 45 50
ANOVA

Source of SS df MS F P-value F crit


Variation
Rows 238.6667 3 79.55556 17.46341 0.00228 4.757055
Columns 1896 2 948 208.0976 2.87E-06 5.143249
Error 27.33333 6 4.555556
Total 2162 11

Of interest, another piece of information is revealed by this analysis - the effects of temperature
do not interact with effects of pH. In other words, a change of temperature does not change the
response to pH, and vice-versa.We can deduce this because the residual (error) mean square

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress8.html#Analysis%20of%20variance%20for%20factorial Page 3 of 7
TWO-­WAY ANOVA 17/06/13 4:25 PM

(MS) is small compared with the mean squares for temperature (columns) or pH (rows). [A low
residual mean square tells us that most variation in the data is accounted for by the separate
effects of temperature and pH].

But suppose that our data were as follows:

Temp oC pH 5.5 pH 6.5 pH


7.5

25 10 19 40

30 15 25 30

35 20 30 25

40 25 22 10

Here an increase of temperature increases growth at low pH but decreases growth at high pH. If
we analysed these data we would probably find no significant effect of temperature or pH,
because these factors interact to influence growth. The residual mean square would be very
large. This type of result is not uncommon - for example, patients' age might affect their
susceptibility to levels of stress. Inspection of our data strongly suggests that there is
interaction. To analyse it, we would need to repeat the experiment with two replicates, then use
a slightly more complex analysis of variance to test for (1) separate temperature effects, (2)
separate pH effects, and (3) significant effects of interaction.

As an example, below is shown a print-out from "Excel" of the following table, where I have
assumed that we did the experiment above with replication.

Temp oC pH 5.5 pH 6.5 pH


7.5

25 rep 1 9 18 36

rep 2 11 20 44

30 rep 1 13 23 27

rep 2 17 27 33

35 rep 1 18 27 23

rep 2 22 33 27

40 rep 1 22 20 7

rep 2 28 24 13

The procedure in "Excel" is as follows.

1. Enter the replicates as separate rows.

2. From the analysis tools menu, choose Anova: Two-Factor with Replication.

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress8.html#Analysis%20of%20variance%20for%20factorial Page 4 of 7
TWO-­WAY ANOVA 17/06/13 4:25 PM

3. Insert all the cells of the table in Input range (Anova assumes that column A and

row 1 are used for headings).

4. Enter "2" (in our case) where asked for "Rows per sample".

In the table displayed on the screen (see below) the analysis shows the means for each
temperature and each pH. It also tells us the following (see the bottom rows of the table).

(i) There is no significant difference between temperatures overall ("Excel" has called the
temperature "Sample") because the calculated F value (3.148) is less than the critical F value
(3.49).

(ii) There is very highly significant (p = 0.0008) effect of pH ("Columns") overall.

(iii) There is very highly significant interaction (p = 0.0000397) between temperature and pH.
In other words, the response to pH depends on the temperature, or vice-versa. This might have
been the purpose of doing the experiment - to see how the organism behaves when subjected to
combinations of factors.

Temp pH 5.5 pH 6.5 pH 7.5


25oC 9 18 36

25oC 11 20 44

30oC 13 23 27

30oC 17 27 33

35oC 18 27 23

35oC 22 33 27
40oC 22 20 7

40oC 28 24 13

Anova: Two-Factor With Replication

SUMMARY pH 5.5 pH 6.5 pH 7.5 Total

25oC Count 2 2 2 6

Sum 20 38 80 138
Average 10 19 40 69

Variance 2 2 32 36

30oC Count 2 2 2 6

Sum 30 50 60 140

Average 15 25 30 70
Variance 8 8 18 34

35oC Count 2 2 2 6

Sum 40 60 50 150

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress8.html#Analysis%20of%20variance%20for%20factorial Page 5 of 7
TWO-­WAY ANOVA 17/06/13 4:25 PM

Average 20 30 25 75

Variance 8 18 8 34

40oC Count 2 2 2 6
Sum 50 44 20 114

Average 25 22 10 57

Variance 18 8 18 44

Total Count 8 8 8

Sum 140 192 210


Average 70 96 105

Variance 36 36 76

ANOVA

Source SS df MS F P-value F crit


Sample 116.5 3 38.83333 3.148649 0.064794 3.4903
Columns 330.3333 2 165.1667 13.39189 0.000877 3.88529
Interaction 1203 6 200.5 16.25676 3.97E-05 2.996117
Within 148 12 12.33333

Total 1797.833 23

Note: Because there is so much interaction, it is difficult to analyse the separate effects of
temperature and pH. We should repeat the analysis, using separate parts of the data. For
example, ANOVA for all the pH treatments at 25oC , then at 30oC, then 35oC and 40oC. But we
could assemble all the means (there are 12) in ranked order and do a multiple range test to find
significant differences.

CONTENTS

INTRODUCTION
THE SCIENTIFIC METHOD
Experimental design
Designing experiments with statistics in mind
Common statistical terms
Descriptive statistics: standard deviation, standard error, confidence intervals of mean.

WHAT TEST DO I NEED?

STATISTICAL TESTS:
Student's t-test for comparing the means of two samples
Paired-samples test. (like a t-test, but used when data can be paired)
Analysis of variance for comparing means of three or more samples:

For comparing separate treatments (One-way ANOVA)

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress8.html#Analysis%20of%20variance%20for%20factorial Page 6 of 7
CHI SQUARED TEST 17/06/13 4:25 PM

This site is no longer maintained and has been left for


archival purposes
Text and links may be out of date

Chi-squared test for categories of data


Background: The Student's t-test and Analysis of Variance are used to analyse measurement
data which, in theory, are continuously variable. Between a measurement of, say, 1 mm and 2
mm there is a continuous range from 1.0001 to 1.9999 m m.

But in some types of experiment we wish to record how many individuals fall into a particular
category, such as blue eyes or brown eyes, motile or non-motile cells, etc. These counts, or
enumeration data, are discontinuous (1, 2, 3 etc.) and must be treated differently from
continuous data. Often the appropriate test is chi-squared (c2), which we use to test whether the
number of individuals in different categories fit a null hypothesis (an expectation of some sort).

Chi squared analysis is simple, and valuable for all sorts of things - not just Mendelian crosses!
On this page we build from the simplest examples to more complex ones. When you have gone
through the examples you should consult the checklist of procedures and potential pitfalls.

A simple example
Suppose that the ratio of male to female students in the Science Faculty is exactly 1:1, but in the
Pharmacology Honours class over the past ten years there have been 80 females and 40 males.
Is this a significant departure from expectation? We proceed as follows (but note that we are
going to overlook a very important point that we shall deal with later).

Set out a table as shown below, with the "observed" numbers and the "expected" numbers (i.e.
our null hypothesis).

Then subtract each "expected" value from the corresponding "observed" value (O-E)

Square the "O-E" values, and divide each by the relevant "expected" value to give (O-E)2/E

Add all the (O-E)2/E values and call the total "X2"

Female Male Total


Observed numbers
80 40 120
(O)
Expected numbers
(E) 60*3 60*3 120 *1

O-E 20 -20 0 *2
2
(O-E) 400 400

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress9.html#Chi-­squared%20test Page 1 of 9
CHI SQUARED TEST 17/06/13 4:25 PM

2 6.67 6.67
(O-E) / E 13.34 = X2

Notes:
*1 This total must always be the same as the observed total
*2 This total must always be zero
*3 The null hypothesis was obvious here: we are told that there are equal numbers of males and
females in the Science Faculty, so we might expect that there will be equal numbers of males
and females in Pharmacology. So we divide our total number of Pharmacology students (120)
in a 1:1 ratio to get our ‘expected’ values.

Now we must compare our X2 value with a c2 (chi squared) value in a table of c2 with n-1
degrees of freedom (where n is the number of categories, i.e. 2 in our case - males and females).
We have only one degree of freedom (n-1). From the c2 table, we find a "critical value of 3.84
for p = 0.05.

If our calculated value of X2 exceeds the critical value of c2 then we have a significant
difference from the expectation. In fact, our calculated X2 (13.34) exceeds even the tabulated c2
value (10.83) for p = 0.001. This shows an extreme departure from expectation. It is still
possible that we could have got this result by chance - a probability of less than 1 in 1000. But
we could be 99.9% confident that some factor leads to a "bias" towards females entering
Pharmacology Honours. [Of course, the data don't tell us why this is so - it could be self-
selection or any other reason]

Now repeat this analysis, but knowing that 33.5% of all students in the Science Faculty are
males

Female Male Total


Observed numbers
80 40 120
(O)
Expected numbers
(E)
79.8*3 40.2 120*1

O-E 0.2 -0.2 0*2


(O-E)2 0.04 0.04

(O-E)2 / E 0.0005 0.001 0.0015 = X2

Note *1: We know that the expected total must be 120 (the same as the observed total), so we
can calculate the expected numbers as 66.5% and 33.5% of this total.

Note *2: This total must always be zero.

Note *3: Although the observed values must be whole numbers, the expected values can be
(and often need to be) decimals.

Now, from a c2 table we see that our data do not depart from expectation (the null hypothesis).
They agree remarkably well with it and might lead us to suspect that there was some design
behind this! In most cases, though, we might get intermediate X2 values, which neither agree

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress9.html#Chi-­squared%20test Page 2 of 9
CHI SQUARED TEST 17/06/13 4:25 PM

strongly nor disagree with expectation. Then we conclude that there is no reason to reject the
null hypothesis.

Some important points about chi-squared


Chi squared is a mathematical distribution with properties that enable us to equate our
calculated X2 values to c2 values. The details need not concern us, but we must take account of
some limitations so that c2 can be used validly for statistical tests.

(i) Yates correction for two categories of data (one degree of freedom)

When there are only two categories (e.g. male/female) or, more correctly, when there is only
one degree of freedom, the c2 test should not, strictly, be used. There have been various
attempts to correct this deficiency, but the simplest is to apply Yates correction to our data. To
do this, we simply subtract 0.5 from each calculated value of "O-E", ignoring the sign (plus or
minus). In other words, an "O-E" value of +5 becomes +4.5, and an "O-E" value of -5 becomes
-4.5. To signify that we are reducing the absolute value, ignoring the sign, we use vertical lines:
|O-E|-0.5. Then we continue as usual but with these new (corrected) O-E values: we calculate
(with the corrected values) (O-E)2, (O-E)2/E and then sum the (O-E)2/E values to get X2. Yates
correction only applies when we have two categories (one degree of freedom).

We ignored this point in our first analysis of student numbers (above). So here is the table
again, using Yates correction:

Female Male Total


Observed numbers
80 40 120
(O)
Expected numbers
(E) 60*3 60*3 120 *1

O-E 20 -20 0 *2
|O-E|-0.5 19.5 -19.5 0
2
(|O-E|-0.5) 380.25 380.25
2
(|O-E|-0.5) / E 6.338 6.338 12.676 = X2

In this case, the observed numbers were so different from the expected 1:1 ratio that Yates
correction made little difference - it only reduced the X2 value from 13.34 to 12.67. But there
would be other cases where Yates correction would make the difference between acceptance or
rejection of the null hypothesis.

(ii) Limitations on numbers in "expected" categories

Again to satisfy the mathematical assumptions underlying c2, the expected values should be
relatively large. The following simple rules are applied:

no expected category should be less than 1 (it does not matter what the observed values
are)
AND no more than one-fifth of expected categories should be less than 5.

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress9.html#Chi-­squared%20test Page 3 of 9
CHI SQUARED TEST 17/06/13 4:25 PM

What can we do if our data do not meet these criteria? We can either collect larger samples so
that we satisfy the criteria, or we can combine the data for the smaller "expected" categories
until their combined expected value is 5 or more, then do a c2 test on the combined data. We
will see an example below.

Chi squared with three or more categories


Suppose that we want to test the results of a Mendelian genetic cross. We start with 2 parents of
genotype AABB and aabb (where A and a represent the dominant and recessive alleles of one
gene, and B and b represent the dominant and recessive alleles of another gene).

We know that all the F1 generation (first generation progeny of these parents) will have
genotype AaBb and that their phenotype will display both dominant alleles (e.g. in fruit flies all
the F1 generation will have red eyes rather than white eyes, and normal wings rather than
stubby wings).

This F1 generation will produce 4 types of gamete (AB, Ab, aB and ab), and when we self-cross
the F1 generation we will end up with a variety of F2 genotypes (see the table below).

Gametes Gametes
AB Ab aB ab
AB AABB AABb AaBB AaBb
Ab AABb AAbb AaBb Aabb
aB AaBB AaBb aaBB aaBb
ab AaBb Aabb aaBb aabb

All these genotypes fall into 4 phenotypes, shown by colours in the table: double dominant,
single dominant A, single dominant B and double recessive. We know that in classical
Mendelian genetics the expected ratio of these phenotypes is 9:3:3:1

Suppose we got observed counts as follows

Phenotype
AB Ab aB ab Total
Observed
40 20 16 4 80
numbers (O)
Expected
45 15 15 5 80*1
numbers (E)
O-E -5 5 1 -1 0

(O-E)2 25 25 1 1
2.50
(O-E)2 / E 0.56 1.67 0.07 0.20
= X2

[Note: *1. From our expected total 80 we can calculate our expected values for categories on the
ratio 9:3:3:1.]

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress9.html#Chi-­squared%20test Page 4 of 9
CHI SQUARED TEST 17/06/13 4:25 PM

From a c2 table with 3 df (we have four categories, so 3 df) at p = 0.05, we find that a c2 value
of 7.82 is necessary to reject the null hypothesis (expectation of ratio 9:3:3:1). So our data are
consistent with the expected ratio.

Combining categories

Look at the table above. We only just collected enough data to be able to test a 9:3:3:1 expected
ratio. If we had only counted 70 (or 79) fruit flies then our lowest expected category would have
been less than 1, and we could not have done the test as shown. We would break one of the
"rules" for c2 - that no more than one-fifth of expected categories should be less than 5. We
could still do the analysis, but only after combining the smaller categories and testing against a
different expectation.

Here is an illustration of this, assuming that we had used 70 fruit flies and obtained the
following observed numbers of phenotypes.

Phenotype
Combined
AB Ab aB ab Total
aB + ab
Observed
34 18 15 3 18 70
numbers (O)
Expected
39.375 13.125 13.125 4.375 17.5 70*1
numbers (E)
O-E -5.375 4.875 0.5 0

(O-E)2 28.891 23.766 0.25


2.559
(O-E)2 / E 0.734 1.811 0.014
= X2

One of our expected categories (ab) is less than 5 (shown in bold italics in the table). So we
have combined this category with one of the others and then must analyse the results against an
expected ratio of 9:3:4. The numbers in the expected categories were entered by dividing the
total (70) in this ratio.

Now, with 3 categories we have only 2 degrees of freedom. The rest of the analysis is done as
usual, and we still have no reason to reject the null hypothesis. But it is a different null
hypothesis: the expected ratio is 9:3:4 (double dominant: single dominant Ab: single dominant
aB plus double recessive ab).

Chi-squared: double classifications


Suppose that we have a population of fungal spores which clearly fall into two size categories,
large and small. We incubate these spores on agar and count the number of spores that
germinate by producing a single outgrowth or multiple outgrowths.

Spores counted:

120 large spores, of which 80 form multiple outgrowths and 40 produce single
outgrowths
60 small spores, of which 18 form multiple outgrowths and 42 produce single
http://archive.bio.ed.ac.uk/jdeacon/statistics/tress9.html#Chi-­squared%20test Page 5 of 9
CHI SQUARED TEST 17/06/13 4:25 PM

outgrowths

Is there a significant difference in the way that large and small spores germinate?

Procedure:

1. Set out a table as follows

Large spores Small spores Total


Multiple outgrowth 80 18 98
Single outgrowth 40 42 82
Total 120 60 180

2. Decide on the null hypothesis.

In this case there is no "theory" that gives us an obvious null hypothesis. For
example, we have no reason to suppose that 55% or 75% or any other percentage of
large spores will produce multiple outgrowths. So the most sensible null hypothesis
is that both the large and the small spores will behave similarly and that both types
of spore will produce 50% multiple outgrowths and 50% single outgrowths. In
other words, we will test against a 1:1:1:1 ratio. Then, if our data do not agree with
this expectation we will have evidence that spore size affects the type of
germination.

3. Calculate the expected frequencies, based on the null hypothesis.

This step is complicated by the fact that we have different numbers of large and
small spores, and different numbers of multiple versus single outgrowths. But we
can find the expected frequencies (a, b, c and d) by using the grand total (180) and
the column and row totals (see table below).

Large spores Small spores Row totals


Multiple
Observed (O) 80 18 98
outgrowth
Expected (E) a b (expected 98)
Single
Observed (O) 40 42 82
outgrowth
Expected (E) c d (expected 82)
Column totals 120 60 180

To find the expected value "a" we know that a total 98 spores had multiple
outgrowths and that 120 of the total 180 spores were large. So a is 98(120/180) =
65.33.

Similarly, to find b we know that 98 spores had multiple outgrowths and that 60 of
the total 180 spores were small. So, b is 98(60/180) = 32.67. [Actually, we could
have done this simply by subtracting a from the expected 98 row total - the
expected total must always be the same as the observed total]

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress9.html#Chi-­squared%20test Page 6 of 9
CHI SQUARED TEST 17/06/13 4:25 PM

To find c we know that a 82 spores had single outgrowths and that 120 of the total
180 spores were large. So c is 82(120/180) = 54.67.

To find d we know that 82 spores had single outgrowths and that 60 of the total 180
spores were small. So d is 82(60/180) = 27.33. [This value also could have been
obtained by subtraction]

4. Decide the number of degrees of freedom

You might think that there are 3 degrees of freedom (because there are 4
categories). But there is actually one degree of freedom! The reason is that we
lose one degree of freedom because we have 4 categories, and we lose a further 2
degrees of freedom because we used two pieces of information to construct our null
hypothesis - we used a column total and a row total. Once we had used these we
would have needed only one data entry in order to fill in the rest of the values
(therefore we have one degree of freedom).

Of course, with one degree of freedom we must use Yates correction (subtract 0.5
from each O-E value).

5. Run the analysis as usual. Calculating O-E, (O-E)2 and (O-E)2/E for each category, then sum
the (O-E)2/E. values to obtain X2 and test this against c2 .

The following table shows some of the working. The sum of the values shown in red gives X2
of 20.23

Large spores Small spores Row totals


Multiple
Observed (O) 80 18 98
outgrowth
Expected (E) 65.33 32.67 98
O-E +14.67 -14.67
Yates correction |O-E|-0.5 +14.17 -14.17 0

(O-Ecorrected)2/E 3.07 6.14

Single
Observed (O) 40 42 82
outgrowth
Expected (E) 54.67 27.33 82
O-E -14.67 +14.67
Yates correction |O-E|-0.5 +14.17 -14.17 0
(O-Ecorrected)2/E 3.67 7.35 X2 = 20.23
Column totals 120 60 180

We compare the X2 value with a tabulated c2. with one degree of freedom. Our calculated X2
exceeds the tabulated c2 value (10.83) for p = 0.001. We conclude that there is a highly
significant departure from the null hypothesis - we have very strong evidence that large spores
and small spores show different germination behaviour.

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress9.html#Chi-­squared%20test Page 7 of 9
CHI SQUARED TEST 17/06/13 4:25 PM

Checklist: procedures and potential pitfalls


Chi squared is a very simple test to use. The only potentially difficult things about it are:

calculating the expected frequencies when we have double classifications - use the
marginal subtotals and totals to work out these frequencies
determining the number of degrees of freedom, especially when we have to use some of
the data to construct the null hypothesis.

If you follow the examples given on this page you should not have too many difficulties.

Some points to watch:


Always work with "real numbers" in the observed categories, not with proportions.
To illustrate this, consider a simple chi squared test on tossing of coins. Suppose that in
100 throws you get 70 "heads" and 30 "tails". Using Yates correction (for one degree of
freedom) you would find an X2 value of 15.21, equating to a c2 probability less than
0.001. But if you got 7 "heads" and 3 "tails" in a test of 10 throws it would be entirely
consistent with random chance. The ratio is the same (7:3), but the actual numbers
determine the level of significance in a chi squared test.
Observed categories must have whole numbers, but expected categories can have
decimals.
Follow the rules about the minimum numbers in expected categories. These rules do not
apply to the observed categories.
Remember Yates correction for one degree of freedom.

CONTENTS

INTRODUCTION
THE SCIENTIFIC METHOD
Experimental design
Designing experiments with statistics in mind
Common statistical terms
Descriptive statistics: standard deviation, standard error, confidence intervals of mean.

WHAT TEST DO I NEED?

STATISTICAL TESTS:
Student's t-test for comparing the means of two samples
Paired-samples test. (like a t-test, but used when data can be paired)
Analysis of variance for comparing means of three or more samples:

For comparing separate treatments (One-way ANOVA)


Calculating the Least Significant Difference between means
Using a Multiple Range Test for comparing means
For factorial combinations of treatments (Two-way ANOVA)

Chi-squared test for categories of data


Poisson distribution for count data
Correlation coefficient and regression analysis for line fitting:

linear regression
logarithmic and sigmoid curves

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress9.html#Chi-­squared%20test Page 8 of 9
POISSON DISTRIBUTION 17/06/13 4:25 PM

This site is no longer maintained and has been left for


archival purposes
Text and links may be out of date

Poisson distribution for count data


Use this test for counts of events that should be randomly distributed in space and time. For
example, the number of cells in a certain number of squares in a counting chamber, or the
number of colonies growing on agar plates in a dilution plating assay. With this test we can
compare such counts and place confidence limits on them.

As background to the Poisson distribution, we should compare the treatment of random count
data with the treatment of measurement data. Suppose that we did a survey of the height of
postal vans and another survey of the height of postal workers. The mean heights might be
very similar (depending on the type of van) but the variance (a measure of the spread of data)
would almost certainly be different. People are much more variable than postal vans of a given
type. That's obvious. But suppose that we look down on a city where for some reason (perhaps
a catastrophe) the postal vans had been adandoned and the postal workers were making their
way to their various homes, and we count the number of postal vans in each square kilometre,
and do the same for postal workers. Then if the means were the same the variance also would
be the same. If counts of anything are randomly distributed in space and time then they follow
the Poisson rule:

! the variance is equal to the mean


! so the standard deviation = square root of the mean.

The same point applies if we have a suspension of blood cells in a counting chamber. Provided
that these cells do not attract or repel one another their count will conform to Poisson
distribution. If there is a mean of 80 cells per square of the counting chamber, then there will be
a variance of 80, standard deviation of 8.94 (i.e. Ö 80) and 95% confidence limits of 8.94 x
1.96 (the t value for infinite degrees of freedom). In other words, 95% of squares in the
counting chamber would be expected to contain a bacterial count between 62.5 and 97.5.

Note two important points:

1. Provided that the cells are randomly distributed (no mutual attraction or repulsion) then their
count conforms to Poisson distribution, and this applies to all the counts (of various types) that
ever have been made or that ever will be made. So we need not bother with degrees of freedom
- we use the t value for infinite degrees of freedom (actually this is termed a d value).

2. Provided that our count is reasonably high (say, above 30) then it can be treated as part of a
Poisson distribution, and we do not even need replicates. So, a count of 30 in one square of a
counting chamber (or a count of 80 pooled from, for example, 3 squares) is all we need. This
count has:

a mean of 80,

variance (s 2) of 80,

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress10.html#Poisson%20distribution%20for%20count%20data Page 1 of 5
POISSON DISTRIBUTION 17/06/13 4:25 PM

standard deviation (s ) of Ö 80,

standard error (sn) of Ö 80, because sn = s /Ö n and we counted one square (or pooled 3
squares) so n = 1.

An improved estimate of confidence limits of the mean can be obtained by introducing a


correction factor. The confidence limits of a count X are calculated as:

X + d2/2 dÖ (X + d2/4), where d is obtained from the bottom of a t-table (p = 0.05).

Thus, for our count of 80, the 95% confidence limits are:

80 + 1.962 /2 1.96 Ö (80 + 1.962 /4) = 81.92 17.64, so the limits are 64.28 to 99.56

Comparison of two Poisson counts


If you are starting to wonder where all that preamble is taking us, suppose that we count 100
cells in a certain volume of bacterial suspension (or blood), and 150 cells in the same volume of
another suspension. Are these significantly different?

Call the first count X1 and the second count X2, and use these in the following equation:

| X1 - (X1 + X2)/2 |
- 0.5
______________________
d=
Ö [(X1 + X2)/4]

[We have applied a correction factor of 0.5 here, as in Yates correction for c2, to improve the
estimate of d. The symbols "| |" simply mean that the value of 0.5 is subtracted from the value
between these two lines regardless of whether the value is positive or negative; a value of +5
becomes +4.5, and a value of -5 becomes -4.5]

If we use our counts of 100 and 150 in the equation above, we get:

| 150 - (150 + 100)/2 |


- 0.5
d= __________________________ = 24.5 / 7.9 = 3.10
Ö [(150 + 100)/4]

We compare this with the d values on the bottom line of a t table and find that it is higher than
the d value for p = 0.002. Our two counts are significantly different; there is a probability of
only 2 in 1000 of finding this difference by chance.

If the counts were obtained from different volumes (termed V1 and V2) then we simply apply a
modified formula:

| X1 - (X1 + X2) (V1/(V1 +


V2)) | - 0.5

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress10.html#Poisson%20distribution%20for%20count%20data Page 2 of 5
POISSON DISTRIBUTION 17/06/13 4:25 PM

_________________________________________
d=
Ö [(X1 + X2) (V1/(V1 + V2))
(V2/(V1 + V2)) ]

Dilution plating
All the methods above can be applied to dilution plating of bacteria or fungi on agar plates. For
example, if we used a soil dilution and counted 67 colonies on a plate at the 10-5 dilution, then
we can estimate the original population in terms of its mean SE, as 67 8.18 x 105 colony-
forming units ml-1 (Ö 67 = 8.18).

A test for randomness in space and time


Sometimes we might wish to test whether counts conform to a Poisson distribution. For
example, motile cells can aggregate into clumps, non-motile cells can agglutinate by surface
interactions, and cells can also repel one another by producing metabolites. We might wish to
test whether these events are occurring, in order to investigate the mechanisms or their
biological significance. The method is simple.

Suppose that we incubate cells in a counting chamber for 30 minutes and then count the number
of cells in several different squares of the chamber (of course, we can choose the size of our
sampling unit by pooling counts for groups of 4 or 16 squares, etc. to get mean counts large
enough (say, at least 30) to conform to Poisson expectation).

We might find the following counts in five squares of the chamber: 50, 30, 80, 90, 10.

For these five replicate counts we can obtain a mean (52) and variance in the normal way (see
methods) by calculating:

= 4480

Then
= 1120
variance,

If the data conformed to a Poisson distribution, then the mean of 52 would have a variance of
52. But our calculated variance is 1120. It seems that our counts do not conform to Poisson
expectation - the cells are not randomly distributed in the counting chamber.

There are different ways of testing this, which need not be explained, but the simplest is to
calculate S d2/mean (= 1120 / 52 = 21.54) and equate this to c2 with n-1 degrees of freedom (n
= 5 in our example). From the c2 table we see that our calculated value of 21.54 exceeds the
tabulated value of 18.47 for 4 df at p = 0.001. So our counts differ significantly from a Poisson
expectation - the cells are not randomly distributed. Instead, we have highly significant
evidence that they tend to be aggregated. [We might explain this in terms of agglutination, or if
the cells are motile they might release substances that attract other cells.]

[An explanation of what we have done. To test for randomness of distribution, we calculate S
d2 which is an estimate of variance of our five replicate values, and we divide it by the mean.
If the data fit a Poisson distribution then we will get a value close to 1 for "S d2/mean" (because

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress10.html#Poisson%20distribution%20for%20count%20data Page 3 of 5
POISSON DISTRIBUTION 17/06/13 4:25 PM

the mean equals the variance when the data fit a Poisson distribution). Any major clustering
(aggregation) of cells etc. will give a "S d2/mean" value much greater than 1. Conversely, if the
cells etc. are "too uniformly dispersed" the "S d2/mean" value will be much less than 1.]

Now suppose that we had five counts: 49, 50, 50, 49, 50.

We can calculate the mean (49.6), S d2 (1.2) and S d2/mean (= 0.024). Consulting a c2 table we
see that our value of 0.024 is less than the expected value (0.297) for 4 degrees of freedom at p
= 0.99. In other words, again our counts do not fit a Poisson expectation - the cells have a
significant tendency (99% probability) to be uniformly dispersed. Perhaps they repel one
another or perhaps the uniformity is caused by some other factor - that is a question to be
addressed by a separate experiment.

Now think about elephants!


Poisson distributions don't apply only to cells or bacterial counts (or postal vans). They apply
equally to elephants and animal behaviour. For example, if you surveyed an area of a large
game park and counted the elephants in each square kilometre (or whatever area is appropriate),
would the data fit a Poisson distribution? Would this be true at all times of the year? The results
you obtain would only tell you, in statistical terms, whether the counts fit a Poisson distribution
(i.e. whether elephants are randomly distributed in space). But the findings would suggest a lot
about the behaviour of elephants. Do they have large family groups? Do these groups disperse
at certain times of the year? Of course, what this analysis can never tell us why they behave in
this way - do elephants congregate at sites of food abundance, etc. and disperse to forage widely
in periods of food shortage? Questions such as those need to be formulated and tested, but at
least the statistical analysis of distribution prompts us to ask them.

CONTENTS

INTRODUCTION
TTHE SCIENTIFIC METHOD
Experimental design
Designing experiments with statistics in mind
Common statistical terms
Descriptive statistics: standard deviation, standard error, confidence intervals of mean.

WHAT TEST DO I NEED?

STATISTICAL TESTS:
Student's t-test for comparing the means of two samples
Paired-samples test. (like a t-test, but used when data can be paired)
Analysis of variance for comparing means of three or more samples:

For comparing separate treatments (One-way ANOVA)


Calculating the Least Significant Difference between means
Using a Multiple Range Test for comparing means
For factorial combinations of treatments (Two-way ANOVA)

Chi-squared test for categories of data


Poisson distribution for count data
Correlation coefficient and regression analysis for line fitting:

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress10.html#Poisson%20distribution%20for%20count%20data Page 4 of 5
CORRELATION AND REGRESSION 17/06/13 4:26 PM

This site is no longer maintained and has been left for


archival purposes
Text and links may be out of date

Correlation, and regression analysis for curve fitting


The techniques described on this page are used to investigate relationships between two
variables (x and y). Is a change in one of these variables associated with a change in the other?
For example, if we increase the temperature do we increase the growth rate of a culture or the
rate of a chemical reaction? Does an increase in DDT content of bird tissues correlate with
thinning of the egg shell? Is an increase in slug density in a field plot associated with a decrease
in seedling development?

We can use the technique of correlation to test the statistical significance of the association. In
other cases we use regression analysis to describe the relationship precisely by means of an
equation that has predictive value. We deal separately with these two types of analysis -
correlation and regression - because they have different roles.

Correlation
Suppose that we took 7 mice and measured their body weight and their length from nose to tail.
We obtained the following results and want to know if there is any relationship between the
measured variables. [To keep the calculations simple, we will use small numbers]

Mouse Units of weight (x) Units of length (y)

1 1 2

2 4 5

3 3 8

4 4 12

5 8 14

6 9 19

7 8 22

Procedure

(1) Plot the results on graph paper. This is the essential first step, because only then can we
see what the relationship might be - is it linear, logarithmic, sigmoid, etc?

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress11.html#Correlation%20coefficient Page 1 of 10
CORRELATION AND REGRESSION 17/06/13 4:26 PM

In our case the relationship seems to be linear, so we will continue on that


assumption. If it does not seem to be linear we might need to transform the data.

(2) Set out a table as follows and calculate S x, S y, S x2, S y2, S xy, and (mean of y).

Weight (x) Length (y) x2 y2 xy

Mouse 1 1 2 1 4 2

Mouse 2 4 5 16 25 20

Mouse 3 3 8 9 64 24

Mouse 4 4 12 16 144 48

Mouse 5 8 14 64 196 112

Mouse 6 9 19 81 361 152

Mouse 7 8 22 64 484 176

Total Sx = 37 Sy = 82 S x2 = 251 S y2 = 1278 S xy = 553

Mean = 5.286 = 11.714

(3) Calculate = 55.429 in our case.

(4) Calculate = 317.429 in our case.

(5) Calculate (this can be positive or negative) = 119.571.

(6) Calculate r (correlation coefficient):

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress11.html#Correlation%20coefficient Page 2 of 10
CORRELATION AND REGRESSION 17/06/13 4:26 PM

= 0.9014 in our case.

(7) Look up r in a table of correlation coefficients (ignoring + or - sign). The number of degrees
of freedom is two less than the number of points on the graph (5 df in our example because we
have 7 points). If our calculated r value exceeds the tabulated value at p = 0.05 then the
correlation is significant. Our calculated value (0.9014) does exceed the tabulated value (0.754).
It also exceeds the tabulated value for p = 0.01 but not for p = 0.001. If the null hypothesis were
true (that there is no relationship between length and weight) we would have obtained a
correlation coefficient as high as this in less than 1 in 100 times. So we can be confident that
weight and length are positively correlated in our sample of mice.

Important notes:

1. If the calculated r value is positive (as in this case) then the slope will rise from left to right
on the graph. As weight increases, so does the length. If the calculated value of r is negative the
slope will fall from left to right. This would indicate that length decreases as weight increases.

2. The r value will always lie between -1 and +1. If you have an r value outside of this range
you have made an error in the calculations.

3. Remember that a correlation does not necessarily demonstrate a causal relationship. A


significant correlation only shows that two factors vary in a related way (positively or
negatively). This is obvious in our example because there is no logical reason to think that
weight influences the length of the animal (both factors are influenced by age or growth stage).
But it can be easy to fall into the "causality trap" when looking at other types of correlation.

What does the correlation coefficient mean?

The part above the line in this equation is a measure of the degree to which x and y vary
together (using the deviations d of each from the mean). The part below the line is a measure of
the degree to which x and y vary separately.

Regression analysis: fitting a line to the data


It would be tempting to try to fit a line to the data we have just analysed - producing an equation
that shows the relationship, so that we might predict the body weight of mice by measuring
their length, or vice-versa. The method for this is called linear regression.

However, this is not strictly valid because linear regression is based on a number of
assumptions. In particular, one of the variables must be "fixed" experimentally and/or precisely
measureable. So, the simple linear regression methods can be used only when we define some
experimental variable (temperature, pH, dosage, etc.) and test the response of another variable
to it.

The variable that we fix (or choose deliberately) is termed the independent variable. It is always

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress11.html#Correlation%20coefficient Page 3 of 10
CORRELATION AND REGRESSION 17/06/13 4:26 PM

plotted on the X axis. The other variable is termed the dependent variable and is plotted on the
Y axis.

Suppose that we had the following results from an experiment in which we measured the
growth of a cell culture (as optical density) at different pH levels.

pH Optical density
3 0.1
4 0.2
4.5 0.25
5 0.32
5.5 0.33
6 0.35
6.5 0.47
7 0.49
7.5 0.53

We plot these results (see below) and they suggest a straight-line relationship.

Using the same procedures as for correlation, set out a table as follows and calculate S x, S y, S
x2, S y2, S xy, and (mean of y).

pH (x) Optical x2 y2 xy
density (y)

3 0.1 9 0.01 0.3

4 0.2 16 0.04 0.8

4.5 0.25 20.25 0.0625 1.125

5 0.32 25 0.1024 1.6

5.5 0.33 30.25 0.1089 1.815

6 0.35 36 0.1225 2.1

6.5 0.47 42.25 0.2209 3.055

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress11.html#Correlation%20coefficient Page 4 of 10
CORRELATION AND REGRESSION 17/06/13 4:26 PM

7 0.49 49 0.240 3.43

7.5 0.53 56.25 0.281 3.975

Total Sx = 49 Sy = 3.04 S x2 = 284 S y2= S xy = 18.2


1.1882

Mean = 5.444 = 0.3378

Now calculate = 17.22 in our case.

Calculate = 0.1614 in our case.

Calculate (this can be positive or negative) = +1.649

Now we want to use regression analysis to find the line of best fit to the data. We have done
nearly all the work for this in the calculations above.

The regression equation for y on x is: y = bx + a where b is the slope and a is the intercept
(the point where the line crosses the y axis)

We calculate b as:

= 1.649 x 17.22 = 0.0958 in our case

We calculate a as:

a= -b

From the known values of (0.3378), (5.444) and b (0.0958) we thus find a (-0.1837).

So the equation for the line of best fit is: y = 0.096x - 0.184 (to 3 decimal places).

To draw the line through the data points, we substitute in this equation. For example:

when x = 4, y = 0.384, so one point on the line has the x,y coordinates (4, 0.384);

when x = 7, y = 0.488, so another point on the line has the x,y coordinates (7, 0.488).

It is also true that the line of best fit always passes through the point with coordinates , so
we actually need only one other calculated point in order to draw a straight line.

Regression analysis using Microsoft Excel

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress11.html#Correlation%20coefficient Page 5 of 10
CORRELATION AND REGRESSION 17/06/13 4:26 PM

Below is a printout of the Regression analysis from Microsoft "Excel". It is obtained simply by
entering two columns of data (x and y) then clicking "Tools - Data analysis - Regression". We
see that it gives us the correlation coefficient r (as "Multiple R"), the intercept and the slope of
the line (seen as the "coefficient for pH" on the last line of the table). It also shows us the result
of an Analysis of Variance (ANOVA) to calculate the significance of the regression (4.36 X
10-7).

Regression Statistics

Multiple R 0.989133329

R Square 0.978384742
Adjusted R
0.975296848
Square
Standard
0.022321488
Error
Observations 9

ANOVA

df SS MS F Significance F

Regression 1 0.157868 0.157868 316.8453 4.36E-07

Residual 7 0.003488 0.000498

Total 8 0.161356

Standard Lower Upper Lower Upper


Coefficients t Stat P-value
Error 95% 95% 95.0% 95.0%
Intercept -0.18348387 0.030215 -6.07269 0.000504 -0.25493 -0.11204 -0.25493 -0.11204
pH 0.095741935 0.005379 17.80015 4.36E-07 0.083023 0.108461 0.083023 0.108461

Presenting the results

The final graph should show:

(i) all measured data points;

(ii) the line of best fit;

(iii) the equation for the line;

(iv) the R2 and p values.

Further applications: logarithmic and sigmoid curves


When we plot our initial results on a graph it will usually be clear whether they best fit a linear
relationship or a logarithmic relationship or something else, like a sigmoid curve. We can
analyse all these relationships in exactly the same way as above if we transform the x and y
values as appropriate so that the relationship between x and y becomes linear. BEWARE - you
MUST look at a scatter plot on graph paper to see what type of relationship you have. If you
simply instruct a computer programme such as "Excel" to run a regression on

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress11.html#Correlation%20coefficient Page 6 of 10
CORRELATION AND REGRESSION 17/06/13 4:26 PM

untransformed data it will do this by assuming that the relationship is linear!

(i) For plots of data that suggest exponential (logarithmic) growth, convert all y values to log
of y (using either log10 or loge). Then go through the linear regression procedures above, using
the log y data instead of y data.

(ii) For sigmoid curves (drug dose response curves and UV killing curves are often sigmoid),
the y values (proportion of the population responding to the treatment) can be converted using a
logistic or probit transformation. Sometimes it is useful to convert the x (dose) data to
logarithms; this condenses the x values, removing the long tails of non-responding individuals
at the lowest and highest dose levels. A plot of logistic or probit (y) against dose (x) or log of
dose (x) should show a straight-line relationship.

Converting between percentage, arcsin, logistic and probits in ‘Excel’


The table below shows part of a page from an ‘Excel’ worsksheet, produced as an exercise to
show how transformations are performed. Columns in an Excel worksheet are headed A-F and
rows are labelled 1-21, so each cell in the table can be identified (e.g. B2 or F11).
Representative Proportions were inserted in cells A2-A21, and % values were inserted in cells
B2-B21.

Then a formula was entered in cell C2 to convert Proportions to logistic values

The logistic transformation converts y to log(y/(1-y))

The formula (without spaces) entered into cell C2 was: =LOG(A2/(1-A2))

This formula is not seen in the cell, but as soon as we move out of cell C2 it automatically gives
the logistic value (in C2) for the proportion in cell A2, seen in the printout below. Copying and
then pasting this formula into every other cell of column C produces a corresponding logistic
value (e.g. cell C3 contains the logistic value of the proportion in cell A3).

Similarly, a formula was entered in cell D2 to convert Percentage to Probit values.

The formula (without spaces) is: =NORMINV(B2/100,5,1) This was then pasted into all cells
of column D

Next, a formula was entered in cell E2 to convert Probit to Percentage, and pasted into all
cells of column E

The formula is: =NORMDIST(C2,5,1,TRUE)*100

The formula entered in cell F2 converts Percentage to Arcsine

The formula is: =ASIN(SQRT(A2/100))*180/PI()

The formula in cell G2 converts Arcsine to Percentage

The formula is: =SIN(E2/180*PI())^2*100

A B C D E F G
Proportion % to Probit to
1 Proportion Percent % to arcsin arcsin to %
to logistic Probit %

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress11.html#Correlation%20coefficient Page 7 of 10
CORRELATION AND REGRESSION 17/06/13 4:26 PM

2 0.001 0.1 -2.99957 1.91 0.1 1.812 0.1


3 0.005 0.5 -2.29885 2.424 0.5 4.055 0.5
4 0.01 1 -1.99564 2.674 1 5.739 1
5 0.02 2 -1.6902 2.946 2 8.13 2
6 0.03 3 -1.50965 3.119 3 9.974 3
7 0.04 4 -1.38021 3.249 4 11.54 4
8 0.05 5 -1.27875 3.355 5 12.92 5
9 0.06 6 -1.19498 3.445 6 14.18 6
10 0.07 7 -1.12338 3.524 7 15.34 7
11 0.08 8 -1.0607 3.595 8 16.43 8
12 0.09 9 -1.0048 3.659 9 17.46 9
13 0.1 10 -0.95424 3.718 10 18.43 10
14 0.5 50 0 5 50 45 50
15 0.96 96 1.380211 6.751 96 78.46 96
16 0.97 97 1.50965 6.881 97 80.03 97
17 0.98 98 1.690196 7.054 98 81.87 98
18 0.995 99.5 2.298853 7.576 99.5 85.95 99.5
19 0.9999 99.99 3.999957 8.719 99.99 89.43 99.99
20 0.99999 99.999 4.999996 9.265 99.999 89.82 99.999
21 0.999999 99.9999 6 9.768 100 89.94 99.9999

As an example of the use of transformations, the data from a fictitious dose-response curve
(table below) are shown in two curves - first, without transformation and then after
transforming the proportion responding to logistic values.

Dose Proportion Logistic


1 0.01 -1.99564
2 0.015 -1.81734
3 0.02 -1.6902
4 0.04 -1.38021
5 0.045 -1.32679
6 0.05 -1.27875
7 0.07 -1.12338
8 0.1 -0.95424
9 0.19 -0.62973
10 0.25 -0.47712
11 0.34 -0.28807
12 0.44 -0.10474
13 0.53 0.052178
14 0.62 0.212608
15 0.68 0.327359
16 0.74 0.454258
17 0.79 0.575408
18 0.83 0.688629
19 0.85 0.753328
20 0.88 0.865301
21 0.9 0.954243
22 0.92 1.060698
23 0.935 1.157898
24 0.95 1.278754

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress11.html#Correlation%20coefficient Page 8 of 10
CORRELATION AND REGRESSION 17/06/13 4:26 PM

25 0.96 1.380211
26 0.96 1.380211
27 0.97 1.50965
28 0.975 1.591065

CONTENTS

INTRODUCTION
THE SCIENTIFIC METHOD
Experimental design
Designing experiments with statistics in mind
Common statistical terms
Descriptive statistics: standard deviation, standard error, confidence intervals of mean.

WHAT TEST DO I NEED?

STATISTICAL TESTS:
Student's t-test for comparing the means of two samples
Paired-samples test. (like a t-test, but used when data can be paired)
Analysis of variance for comparing means of three or more samples:

For comparing separate treatments (One-way ANOVA)


Calculating the Least Significant Difference between means
Using a Multiple Range Test for comparing means
For factorial combinations of treatments (Two-way ANOVA)

Chi-squared test for categories of data


Poisson distribution for count data
Correlation coefficient and regression analysis for line fitting:

linear regression
logarithmic and sigmoid curves

TRANSFORMATION of data: percentages, logarithms, probits and arcsin values

STATISTICAL TABLES:
t (Student's t-test)
F, p = 0.05 (Analysis of Variance)
F, p = 0.01 (Analysis of Variance)
F, p = 0.001 (Analysis of Variance)
c2 (chi squared)
r (correlation coefficient)
Q (Multiple Range test)
Fmax (test for homogeneity of variance)

This site is no longer maintained and has been left for


archival purposes
http://archive.bio.ed.ac.uk/jdeacon/statistics/tress11.html#Correlation%20coefficient Page 9 of 10