113 tayangan

Diunggah oleh budhail

Research Methodology in Botany

- Server Algorithms for Simulated Annealing
- History and Research Methods
- My Scientific Method Book by Shahad
- 641-2600-1-PB
- ds24-unit 1 final review
- 6 Strategies for Effective Learning
- Chapter 1 Introduction to Psychology - Thinking Through the Themes
- Psych
- 102086 designing teaching assignment 2
- biology sample comment
- Earth Science A.pdf
- Chapter 02
- Art Research
- Pendullum.doc 1
- angry birds lesson plan final
- intro stations
- Designing an Experiment
- Developmental Psychology.docx
- Practice Test
- N

Anda di halaman 1dari 90

Produced by Jim Deacon

______________________________________________

I have produced this site, as someone who uses statistics in my experimental work, but I

am not a statistician. The site is intended to provide a simple, straightforward guide to the

basics of experimental design and to some of the common statistical tests.

There are several excellent sites on Statistical methods. But I think that many undergraduates

(and graduate students) want only a user-friendly beginner's guide - or a refresher course - that

enables them to use statistical tests with a minimum of fuss. That's the aim of this site. It is by

no means a comprehensive guide, but it will get you started and, if nothing more, it will help

you to understand the meaning of the symbols you see in scientific papers.

If you know the statistical test that you want, then go straight to the index of tests (see

below).

If you are "rusty" or a beginner, then you should read the introductory pages: THE

SCIENTIFIC METHOD, Experimental design, Common statistical terms, Descriptive

statistics, and what test do I need?

______________________________________________

design

Designing in mind

Common statistical terms

Descriptive standard deviation, error, confidence intervals of mean.

Paired-samples test. (like a t-test,

Analysis of variance for

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress1.html Page 1 of 2

EXPERIMENTAL DESIGN 17/06/13 4:22 PM

archival purposes

Text and links may be out of date

Experimental Design

After reading this section, you should be able to decribe the main stages involved in scientific

problem-solving.

Science is usually fun. Most scientists enjoy their work, and fortunately it is sometimes useful

to society. Since scientists are people rather than machines, they behave as other people do.

They can be bad-tempered, pig-headed, jealous of the success of others and untruthful. Even so,

the pattern of work employed by scientists, the scientific method, is the most powerful tool yet

devised for the analysis and solution of problems in the natural world. The method can be

applied as much to situations in everyday life as to conventional 'scientific' problems. The world

might even be a happier place if more people attempted to solve their problems in a 'scientific'

way.

! Carrying out the test, and, on the basis of the results, deciding whether the

explanation is a good one or not. In the latter case, a new explanation will (with

luck) 'spring to mind' as a result of the first test.

Originality

Science depends upon original thinking at several points. One is when we make the original

'guess' - usually called an hypothesis. Another is when we devise a test or experiment to show

how likely the hypothesis is to be correct. A good scientist relies on 'inspiration' in the same

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress2.html Page 1 of 13

EXPERIMENTAL DESIGN 17/06/13 4:22 PM

way as a good artist. [Some teachers are concerned that this point is largely ignored in present-

day science education.]

The examples constitute the sort of story you might read in a newspaper. We will try to analyse

these stories in terms of THE SCIENTIFIC METHOD. A general feature is the lack of a

rigorous test situation in the original stories; the strength of THE SCIENTIFIC METHOD

largely lies in devising tests that are capable of discriminating between different hypotheses.

Mr Smith of Morningside has taught his dog Rover to understand French. Mr Smith noticed that

every evening, after dinner, when he went to the door with his coat on and said "Walkies",

Rover immediately understood and came running. Mr Smith was going to France for the

summer, and, as an experiment in international understanding, decided to teach Rover French.

He started to say "Allons" instead of "Walkies". To his delight, Rover very quickly understood

and came running.

Analysis

going to the door, coat on, call) of which what is actually

called is only a small part. A change in the call may not

matter much to the dog.

example, in the morning when he does not usually go for a

walk.

dinner) without calling anything.

The results of these and similar tests should indicate whether Rover is specifically responding to

the word "allons", or (more likely) to an overall situation he is well used to.

Notice that these tests do not tell us anything of a dog's ability to learn French words. They are

only concerned with the specific case of responding to one French word. We will see later that

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress2.html Page 2 of 13

EXPERIMENTAL DESIGN 17/06/13 4:22 PM

extrapolating from the specific to the general is very important in scientific methodology.

The Institute for Psychical Research conducted a study on the performance of well-known

fortune-tellers. The most positive results involve Arnold Woodchuck who, at the start of each

year, makes a series of ten predictions for the coming year in a national tabloid newspaper. The

Institute has found that Mr Woodchuck is consistently correct in about 80% of his predictions.

For example, for 1995 he predicted a political crisis in Europe (the former Yugoslavia?), a

major human disaster in Africa (Rwanda?), a dispute over public sector pay (nurses?) and the

demise of a prominent football manager (Mr Graham?). He was unfortunately wrong in

predicting that England would win the Rugby Union World Cup. A spokesman for the Institute

was 'optimistic' about future studies on Mr Woodchuck.

Analysis

The apparent observation is that Mr Woodchuck has got more predictions correct than would

have been expected by chance. The Institute's hypothesis would be that Mr Woodchuck has

some kind of 'psychic powers'. Can we devise an alternative hypothesis?

We are dealing here with probability. If we toss an unbiassed coin we get on average the same

number of heads as tails. If we asked someone to predict the outcome of the toss, we would not

be terribly surprised if from a small number of trials, he got 4 out of 5 right. But if he continued

to achieve 80% success over a long series, we would begin to suspect: (1) a biased coin; (2)

cheating; (3) psychic powers.

Mr Woodchuck regularly gets 80%. Is his 'coin' biased, is he cheating, or does he have psychic

powers? The most likely explanation is the 'biased coin' one, i.e. that the events he predicts do

not have a 1:1 probability, but perhaps a probability nearer to 4 or 5:1 on; in other words, a very

high probability that they will occur.

For example, almost invariably every year there is at least one 'political crisis' in Europe and a

'major human disaster' in Africa. Similarly, football managers have a short shelf-life. Public

sector employees (such as nurses, railway signalmen or indeed University teachers) have for

years perceived themselves to be underpaid whilst their masters either cannot or will not

respond appropriately. In contrast, the chances of England's winning the Rugby Union World

Cup were over-stated by the English press - and this is a prediction that failed.

which really do have a low probability.

Again, the results of this investigation would be limited. They would probably show that the

'biased coin' explanation is the most likely. They would not show (a) whether Mr Woodchuck

has some kind of psychic power; or (b) whether psychic powers are possible.

Notice also that even a large deviation from an expected result can occur by chance in a small

sample (e.g., getting 4 out of 5 coin-tossing guesses right). This is very important in Biology,

and the basis of the use of statistical methods in biological analysis.

EXPERIMENTAL DESIGN

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress2.html Page 3 of 13

EXPERIMENTAL DESIGN 17/06/13 4:22 PM

After reading this section you should be able to discriminate between good and bad

experimental design.

Experimental Design

The design of a suitable experiment to test an hypothesis often requires some ingenuity and a

suspicious nature. In modern biology, the experiment may involve very sophisticated

equipment. But there are a number of features common to all good experiments (and often

absent from bad ones) which exist whatever the technical details. In summary these are:

Discrimination

turns out that two or more hypotheses give indistinguishable results when tested by poorly-

designed experiments.

Living material is notoriously variable. Usually experiments must be repeated enough times for

the results to be analysed statistically. Similarly, because of biological variability, we must be

cautious of generalising our results either from individual creatures to others of the same

species, or to other species. For instance, if our hypothesis is about mammals, it is inadequate

simply to carry out our experiments on laboratory rats. Similarly, it is dangerous to extrapolate

from healthy students to elite athletes.

Controls

The experiment must be well controlled. We must eliminate by proper checks the possibility

that other factors in the overall test situation produce the effect we are observing, rather than the

factor we are interested in.

amino acid arginine. This was shown by injecting volunteers with arginine. As a control, the

investigators injected the volunteers with a saline solution. To their surprise, growth hormone

was again secreted. The investigators then waved a syringe and needle in front of their

volunteers, and found that that provoked growth hormone secretion too. Growth hormone is

now known to be secreted in response to stress (as well as arginine).

At a more technical level, we must be sure that our method of measurement is reproducible

from day to day, between operators in the same laboratory, or between laboratories. Whilst we

might be confident about a balance or a ruler, can we be as sure about, say, a method for

measuring haemoglobin? Do two groups of students measuring the same samples by the same

methods produce the same results? Quality control helps here.

'Blind' Designs

Investigators can subconsciously 'fudge' their data if they know what result they want to find.

The answer is to do the experiment 'blind', so the investigators (and the subjects, if humans are

being studied) do not know which treatment's effect they are observing. This can make the

logistics of doing the experiment more complex: for example, when determining the

haemoglobin concentration of male and female class members.

There is a story about a professor who devised a maze for measuring the intelligence of rats.

One day he gave his technicians, who actually made the measurements, three groups of rats. He

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress2.html Page 4 of 13

EXPERIMENTAL DESIGN 17/06/13 4:22 PM

told them one group had been specially bred for intelligence, one for stupidity and the third was

average. The technicians assessed the rats' intelligence and confirmed that the 'bright' group

performed the best and the 'stupid' group the worst. The point is, of course, that the professor

had put animals into the three groups at random. They did not differ in intelligence.

Measurement

Good experiments often, though not always, involve measuring something: a weight, say. When

you make measurements, it is important you know both the accuracy and the precision of your

measuring system. These two terms are not synonymous: 'accuracy' means the ability of the

method to give an unbiassed answer on average, whereas 'precision' is an index of the method's

reproducibility. Ideally your method should be both accurate (i.e., give the true mean) and

precise (i.e., have a low standard deviation). Sometimes one is more important than the other.

For example, if you were looking for small changes with time in a quantity (such as an athlete's

haemoglobin concentration), you would need a precise measure of it rather more than an

accurate one.

Accuracy and precision together help you to judge the reliability of your data. They also help

you to judge to how many significant figures you should quote your results. For example, if you

use a balance reading to the nearest gram, you should give the results to the nearest gram and

not, say, to the nearest tenth of a gram.

Some experiments are very difficult to do because it is not obvious what can be measured. This

is a real problem in animal behaviour: for example, there is no obvious unit or measure for

'emotional state'. It is usually necessary to isolate measurable components of behaviour. Thus

the speed at which a tiger paces up and down a cage can give some indication of the internal

state of the animal but can never give a full picture of it.

Many of these points are rather abstract, but they should become clearer when you think about

the following examples.

Experiment:

Forty bean plants, growing in pots, were covered one afternoon by individual glass containers

and left in the laboratory overnight. Next morning, the inside of the lid of each container was

found to be covered in droplets of a fluid which proved to be water.

Conclusion:

Critique

1. Lack of controls.

The water could have come from the plants, the soil, the pots, or the air in the jar.

Control experiments should have been set up to test for these possibilities.

(a) The experiment was done overnight and so can tell us nothing about the

behaviour of the plants at other times of day; the word 'generally' is not justified.

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress2.html Page 5 of 13

EXPERIMENTAL DESIGN 17/06/13 4:22 PM

(b) It was carried out with an adequate number of bean plants but can tell us

nothing about other kinds of plants; the word 'plants' should be qualified.

(c) There is no evidence in the experiment that water is given off as a vapour.

advertised one?

Eric Triton bemoaned the fact that his wife Ariel insisted on washing his clothes with their local

supermarket's own brand of powder. He was sure the well-known brand he saw performing

miracles on television most evenings would do better. He therefore set out to prove as much.

Mr Triton decided to compare the effectiveness of the two products on what his wife called

'difficult' dirt: grass stains on white linen handkerchiefs. He bought 4kg of the well-known

brand for £5.17 in their supermarket and noted that the same weight of the own-brand powder

would have cost £4.47. He followed the instructions on the packets exactly, weighing out the

same amount of powder and using their washing machine's programme for white linens. Mr

Triton was aware of the need for an index of 'cleanliness' and therefore devised a subjective

scale, ranging from 10 ('whiter than white') to 0 (the starting level of dirtiness).

Mr Triton's belief was substantially confirmed. He scored the handkerchief cleaned by the

national brand an impressive 8, whereas the own-brand powder only managed 7. Triumphantly,

he reported the outcome to his wife. Mrs Triton, however, was unimpressed. She pointed out to

her husband that there were several flaws in his experiment and convinced him that the outcome

was 'not proven'.

! What do you think were the flaws that Mrs Triton could have detected?

Further reading: Barnard C, Gilbert F and McGregor P (1993) Asking Questions in Biology,

Longmans.

There is a story about an eminent Professor at Cambridge who gave a paper at a scientific

meeting and was asked by a questioner "what statistical test did you use to verify your results?"

The Professor explained that he used his own statistical test:

"In our Department we have a long corridor with a notice board at one end. I draw a

histogram of my results, pin it to the notice board, then walk to the other end of the

corridor. If I can still see a difference between the treatments then it's significant"

The relevance of this story lies in what it does not say! If an experiment is designed and

executed properly - as we would expect of an eminent scientist - then the results often speak for

themselves. For example, this might be true of experiments in which mutants are generated (or

genes inserted) in an organism, giving a clear change of behaviour such as resistance to an

antibiotic or expression of a new trait. Such "all or nothing" effects seldom need to be backed

by statistical tests, but they still need good experimental design.

However, in many areas of biology we work with variable effects - differences in the growth

rates of organisms, quantitative differences in antibiotic resistance or in size or in rates of

biochemical reactions, etc. Then we not only need statistical tests to analyse those differences

but we also need good experimental design to ensure that we haven't biased our results in some

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress2.html Page 6 of 13

EXPERIMENTAL DESIGN 17/06/13 4:22 PM

Good experimental design is the key to good science. But it's not as easy as it

might seem.

In many cases good experimental design involves having a clear idea about how we will

analyse the results when we get them. That's why statisiticians often tell us to think about the

statistical tests we will use before we start an experiment.

1. Define the objectives. Record (i.e. write down) precisely what you want to test in an

experiment.

2. Devise a strategy. Record precisely how you can achieve the objective. This includes

thinking about the size and structure of the experiment - how many treatments? how many

replicates? how will the results be analysed?

3. Set down all the operational details. How will the experiment be performed in practice? In

what order will things be done? Should the treatments be randomised or follow a set structure?

Can the experiment be done in a day? Will there be time for lunch? etc.

If all this sounds trivial or obvious, then read on. It's not as easy as you think!

Example 1. Experiments that yield no useful results because we did not collect enough

data

Suppose that we want to test the results of a Mendelian genetic cross. We start with 2 parents of

genotype AABB and aabb (where A and a represent the dominant and recessive alleles of one

gene, and B and b represent the dominant and recessive alleles of another gene).

We know that all the F1 generation (first generation progeny of these parents) will have

genotype AaBb and that their phenotype will display both dominant alleles (e.g. in fruit flies all

the F1 generation will have red eyes rather than white eyes, and normal wings rather than

stubby wings).

This F1 generation will produce 4 types of gamete (AB, Ab, aB and ab), and when we self-cross

the F1 generation we will end up with a variety of F2 genotypes (see the table below).

Gametes Gametes

AB Ab aB ab

AB AABB AABb AaBB AaBb

Ab AABb AAbb AaBb Aabb

aB AaBB AaBb aaBB aaBb

ab AaBb Aabb aaBb aabb

All these genotypes fall into 4 phenotypes, shown by colours in the table: double dominant,

single dominant A, single dominant B and double recessive. And we know that in classical

Mendelian genetics the ratio of these phenotypes is 9:3:3:1

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress2.html Page 7 of 13

EXPERIMENTAL DESIGN 17/06/13 4:22 PM

Most people also know that we use a chi squared test to analyse the results of genetic crosses:

we do our experiment, count the number of F2 progeny that fall into the different categories,

and test to see if our results agree with an expectation. In this case, the expectation would be

that the results fit a 9:3:3:1 ratio.

But what you might not know is that a chi squared test would only be valid if every

expected category in this case is 5 or more (it does not matter what the actual count is in each

category, but the expected count must be 5 or more). In other words, we MUST have at least 80

F2 progeny in order to use the chi squared test for this experiment, because then the smallest

category - double recessive - would have an expected 5 individuals in it (one-sixteenth of 80

being 5). [A fuller explanation of this is given in Chi squared test.]

Similarly, for comparing two counts (e.g. counts from dilution plating of bacteria) by means of

a Poisson distribution, you will need to count about 30 colonies at the chosen dilution level.

[The principles underlying the Poisson distribution do not hold for counts lower than this]

As a different example along the same lines, we might want to compare the biomass produced

by plant callus culture in flasks containing different nutrient solutions. We know that we need

more than one flask of each nutrient solution (i.e. we need replicates), and we will use

Student's t-test to compare the mean growth in each solution. [Basically, a t-test compares the

difference between the two means in relation to the amount of variation within the treatments.

In other words, we get a significant result if the difference between the means is large and/or the

variation between replicates is small].

So, how many replicates should we use? This is a matter of judgement (and the available

resources) but if we look at a t-table we can make some rational decisions. If we use 2 flasks for

each treatment (4 flasks in total), we would have 2 degrees of freedom. This term is explained

elsewhere, but for now we can note that the number of degrees of freedom for each treatment is

one less than the number of replicates. In other words, with 2 treatments of 2 flasks each we

have 2 degrees of freedom. With 2 treatments of 10 flasks each we have 18 degrees of freedom.

When we analyse our results by Student's t-test, we calculate a t value and compare it with the t

value for probability of 0.05 in the t-table. Our treatments differ significantly if the calculated t

value is greater than the tabulated value.

Look at the tabulated t value (4.30) for 2 degrees of freedom. It is quite high, and we would

only find a significant difference between our treatments if we have quite a large difference

between the means and also little variation in our replicates. But if we used 4 replicates of each

treatment (6 degrees of freedom) we would have a much better chance of finding a significant

difference (t value of 2.45) between the same means. But look even further down the t-table -

e.g. downwards from 10 degrees of freedom (t-value 2.23) - and we see that we would gain very

little by using any more replicates. We would be in the realm of diminishing returns, gaining

very little for all the extra time and resources.

The message from these examples is that knowledge of the statistical test that we will use

helps us to design our experiment properly.

Example 2. Experiments that seem to give useful results but our procedures let us down!

Under this heading we deal with the actual process of doing an experiment - a task with many

hidden pitfalls.

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress2.html Page 8 of 13

EXPERIMENTAL DESIGN 17/06/13 4:22 PM

We do the experiment, get nice results, analyse them (the appropriate test would be Analysis of

Variance) and find significant differences between the treatments. We write up the results, get a

Nobel Prize, or a good mark, or whatever. End of story.

Or is it? The answer to that question depends on how we did the experiment. For example,

there might have been a good "practical" (i.e. convenient) reason for setting up all replicates of

treatment 1, then (for example) changing the pipette and setting up all replicates of treatment 2,

and so on. The problem is: how can we be sure that the difference we found between treatments

was due to the treatments themsleves, and was not influenced by the order in which we set them

up? Even if we DO feel sure, our experiment is inherently biased and nobody would trust the

results if we said how we did it! [There is an almost infinite number of reasons why the

conditions might change during the time taken to set up an experiment. For example, we might

get progressively more efficient, or more tired. The temperature of the water bath (or whatever)

might change slightly during this time. Each pipette will be slightly different from the next, etc.

etc.]

So, what about doing one replicate of treatment 1, then one of treatment 2, then 3, then 4, and

then doing a second replicate of treatment 1, a second of treatment 2, and so on? In truth, this

would remove only some of the inherent bias - on average, treatment 1 is still being set up

before treatment 2, etc.

The only way to overcome this is to plan in advance. We have basically two options.

We could randomise the order in which each test is set up, so that it is entirely a matter of

chance whether "treatment 1, replicate 1" is done first or last or somewhere in the middle.

We could structure the setting up (or the sampling of the experiment) so that one replicate

of each treatment is done first (in random order), then the next replicate of each treatment

is done (again in random order), and so on. This would be sensible if we knew (or

suspected) that the conditions are likely to change during the setting up of an experiment,

because Analysis of Variance (the statistical test we would use for our results) could

separate out the effect of "setting up time" from the effect of "treatments". This sort of

experimental design is termed a randomised block design. It is illustrated in the grid

below, where the rows represent blocks and the order of treatments (A, B, C and D)

within the blocks is determined randomly. [But note that the grid below is perfect - you

would seldom get that by randomly ordering the treatments within blocks. A grid of this

type is called a Latin square, and for some purposes it is probably better than a

randomised block]

Row 1 A B C D

Row 2 D C A B

Row 3 B A D C

Row 4 C D B A

And one last point - the same principles should be applied to other practical aspects of an

experiment. For example, have you ever tested the temperature gradient in a laboratory

incubator or (worse still) in a walk-in growth room? The temperature gauge might record

"30oC" but there is likely to be a vertical (or lateral) temperature gradient of 2-3o or more. So

never put all the replicates of one treatment together. Randomise them.

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress2.html Page 9 of 13

EXPERIMENTAL DESIGN 17/06/13 4:22 PM

For most experiments we would simply randomise the treatments and replicates. There is an

important statistical reason for this, because all the statistical procedures are based on the

fundamental assumption that variation is random: in other words, that it is determined by

chance alone. But "blocking" becomes useful, or even essential, if we know (or strongly

suspect) that "extraneous" factors will introduce variation that is irrelevant to the effects we

wish to test and that might mask the effects of our treatments. Here are two examples, and you

could think of many more along the same lines.

1. Suppose that we want to test the effect of fertiliser treatment (or pesticide treatment, etc.)

on a crop. We set up a field trial, with plots that will receive different treatments.

Everyone knows that conditions can vary across a field - the topsoil may be thinner or

drier in some places than in others, leading to differences in crop growth. We cannot

control this, so it would be sensible to mark out 3 (or more) blocks, and to have plots

(randomised treatments) within these blocks. Then, by Analysis of Variance, we will be

able to separate the variation due to our treatments from the variation caused by soil or

site factors (which are not relevant to the question we are trying to ask).

2. Suppose that an experiment will take a long time to set up or to sample, and we cannot be

sure that the conditions will remain constant during this time - the lab might warm up, the

humidity rise, and so on. Then it is sensible to block the experiment into time periods -

before coffee, after coffee, after lunch. Of course this is not ideal. But experiments are

done in the real world.

Suppose that we are measuring the size of cells, the height of trees, the biomass of microbial

cultures, the number of eggs in nests, or anything else. The thing that we are measuring or

recording (e.g. cell size, plant height, etc.) is called a variable.

Each measurement that we record (e.g. the size of each cell) is a value or observation.

We obtain a number of values (e.g. 100 for cells), and this is our sample.

The sample (e.g. 100 cells) is part of a population. In this case the population (in biological

terms) is all the cells in the culture (or all the trees in a forest, etc.). Theoretically, we could

measure every cell or tree to get a precise measure of that population. But often we want to be

able to say more than this - something of general significance, based on our sample. For

example, that if anyone were to measure the cells of that organism, then they would find a

certain average value and a certain range of variation. Here are 3 sorts of thing that you might

want to say.

The optimum temperature for growth of the bacterium Escherichia coli is 37oC, whereas

the optimum temperature for Bacillus cereus is 30oC.

The average height of adult men in Britain is 175 cm, whereas the average height of

women is162 cm.

In rats the LD50 of the drug "Zappit" is 3 mg kg-1 body weight. [LD50 is the calculated

(lethal) dose that kills 50% of the treated individuals]

General statements such as these will always be based on a sample, because we could never test

every possible strain of E. coli, nor measure every possible adult, nor test every possible rat that

could ever live. So, in these and in many other cases the population can be considered to be

infinite. That's the sense in which statisticians use the term "population" - for all the

possible measurements or events (i.e. all the possible values of a variable) of a particular

type that there could ever be.

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress2.html Page 10 of 13

EXPERIMENTAL DESIGN 17/06/13 4:22 PM

The statistical procedures are based on quite complex mathematics. But that need not concern

us at all, because the procedures are actually very simple to apply. Basically, from our sample

we calculate:

The average, or mean. The notation for the sample mean is (pronounced "X bar").

Some measure of the dispersion (range of variation) of data around the sample mean. For

this we use the variance (the notation is S2) and thence the standard deviation (S).

Having obtained those values, we use them to estimate the population mean and the

population variance. In order to distringuish between what we measure (samples) and what we

wish to estimate (populations) from the samples, many statisticians use greek letters for the

population mean (which is denoted m (mu) instead of for the sample mean) and the

population variance (which is denoted s2 (sigma squared) instead of S2 for the sample

variance).

We cover these points under Descriptive statistics. But before we move on to that, we must

consider briefly how variation is distributed in statistical terms.

spectrophotometer) etc. we would find a range of variation. As we made more and more

measurements of this type they would display a continuous range of variation. If we were to put

these measurements into appropriate categories or class intervals (for example, all the

measurements that fall between 1.0 and less than 2.0, all between 2.0 and less than 3.0, and so

on) and then plot the numbers in each category as a histogram it would look like this:

Given enough measurements (and small enough class intervals), this would be a completely

symmetrical, bell-shaped curve. Data of this sort are said to be normally distributed. Most of

our measurements (data points) would be close to the mean, and progressively fewer would

depart widely from the mean.

Most of the statistical tests that we consider on this site are for normally distributed data.

But there are other types of distribution. For example, if we measured the heights of men and

women as a single population they might form a bimodal distribution - two humps with a dip

between them, because women, on average, are shorter than men. Then we should treat them as

two populations, not as a single one. Other types of data, such as counts, fall naturally into

"either/or" categories. These are treated in different ways, some of which are explained later.

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress2.html Page 11 of 13

EXPERIMENTAL DESIGN 17/06/13 4:23 PM

archival purposes

Text and links may be out of date

Experimental Design

After reading this section, you should be able to decribe the main stages involved in scientific

problem-solving.

Science is usually fun. Most scientists enjoy their work, and fortunately it is sometimes useful

to society. Since scientists are people rather than machines, they behave as other people do.

They can be bad-tempered, pig-headed, jealous of the success of others and untruthful. Even so,

the pattern of work employed by scientists, the scientific method, is the most powerful tool yet

devised for the analysis and solution of problems in the natural world. The method can be

applied as much to situations in everyday life as to conventional 'scientific' problems. The world

might even be a happier place if more people attempted to solve their problems in a 'scientific'

way.

! Carrying out the test, and, on the basis of the results, deciding whether the

explanation is a good one or not. In the latter case, a new explanation will (with

luck) 'spring to mind' as a result of the first test.

Originality

Science depends upon original thinking at several points. One is when we make the original

'guess' - usually called an hypothesis. Another is when we devise a test or experiment to show

how likely the hypothesis is to be correct. A good scientist relies on 'inspiration' in the same

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress2.html#SOME%20COMMON%20STATISTICAL%20TERMS Page 1 of 13

EXPERIMENTAL DESIGN 17/06/13 4:23 PM

way as a good artist. [Some teachers are concerned that this point is largely ignored in present-

day science education.]

The examples constitute the sort of story you might read in a newspaper. We will try to analyse

these stories in terms of THE SCIENTIFIC METHOD. A general feature is the lack of a

rigorous test situation in the original stories; the strength of THE SCIENTIFIC METHOD

largely lies in devising tests that are capable of discriminating between different hypotheses.

Mr Smith of Morningside has taught his dog Rover to understand French. Mr Smith noticed that

every evening, after dinner, when he went to the door with his coat on and said "Walkies",

Rover immediately understood and came running. Mr Smith was going to France for the

summer, and, as an experiment in international understanding, decided to teach Rover French.

He started to say "Allons" instead of "Walkies". To his delight, Rover very quickly understood

and came running.

Analysis

going to the door, coat on, call) of which what is actually

called is only a small part. A change in the call may not

matter much to the dog.

example, in the morning when he does not usually go for a

walk.

dinner) without calling anything.

The results of these and similar tests should indicate whether Rover is specifically responding to

the word "allons", or (more likely) to an overall situation he is well used to.

Notice that these tests do not tell us anything of a dog's ability to learn French words. They are

only concerned with the specific case of responding to one French word. We will see later that

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress2.html#SOME%20COMMON%20STATISTICAL%20TERMS Page 2 of 13

EXPERIMENTAL DESIGN 17/06/13 4:23 PM

extrapolating from the specific to the general is very important in scientific methodology.

The Institute for Psychical Research conducted a study on the performance of well-known

fortune-tellers. The most positive results involve Arnold Woodchuck who, at the start of each

year, makes a series of ten predictions for the coming year in a national tabloid newspaper. The

Institute has found that Mr Woodchuck is consistently correct in about 80% of his predictions.

For example, for 1995 he predicted a political crisis in Europe (the former Yugoslavia?), a

major human disaster in Africa (Rwanda?), a dispute over public sector pay (nurses?) and the

demise of a prominent football manager (Mr Graham?). He was unfortunately wrong in

predicting that England would win the Rugby Union World Cup. A spokesman for the Institute

was 'optimistic' about future studies on Mr Woodchuck.

Analysis

The apparent observation is that Mr Woodchuck has got more predictions correct than would

have been expected by chance. The Institute's hypothesis would be that Mr Woodchuck has

some kind of 'psychic powers'. Can we devise an alternative hypothesis?

We are dealing here with probability. If we toss an unbiassed coin we get on average the same

number of heads as tails. If we asked someone to predict the outcome of the toss, we would not

be terribly surprised if from a small number of trials, he got 4 out of 5 right. But if he continued

to achieve 80% success over a long series, we would begin to suspect: (1) a biased coin; (2)

cheating; (3) psychic powers.

Mr Woodchuck regularly gets 80%. Is his 'coin' biased, is he cheating, or does he have psychic

powers? The most likely explanation is the 'biased coin' one, i.e. that the events he predicts do

not have a 1:1 probability, but perhaps a probability nearer to 4 or 5:1 on; in other words, a very

high probability that they will occur.

For example, almost invariably every year there is at least one 'political crisis' in Europe and a

'major human disaster' in Africa. Similarly, football managers have a short shelf-life. Public

sector employees (such as nurses, railway signalmen or indeed University teachers) have for

years perceived themselves to be underpaid whilst their masters either cannot or will not

respond appropriately. In contrast, the chances of England's winning the Rugby Union World

Cup were over-stated by the English press - and this is a prediction that failed.

which really do have a low probability.

Again, the results of this investigation would be limited. They would probably show that the

'biased coin' explanation is the most likely. They would not show (a) whether Mr Woodchuck

has some kind of psychic power; or (b) whether psychic powers are possible.

Notice also that even a large deviation from an expected result can occur by chance in a small

sample (e.g., getting 4 out of 5 coin-tossing guesses right). This is very important in Biology,

and the basis of the use of statistical methods in biological analysis.

EXPERIMENTAL DESIGN

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress2.html#SOME%20COMMON%20STATISTICAL%20TERMS Page 3 of 13

EXPERIMENTAL DESIGN 17/06/13 4:23 PM

After reading this section you should be able to discriminate between good and bad

experimental design.

Experimental Design

The design of a suitable experiment to test an hypothesis often requires some ingenuity and a

suspicious nature. In modern biology, the experiment may involve very sophisticated

equipment. But there are a number of features common to all good experiments (and often

absent from bad ones) which exist whatever the technical details. In summary these are:

Discrimination

turns out that two or more hypotheses give indistinguishable results when tested by poorly-

designed experiments.

Living material is notoriously variable. Usually experiments must be repeated enough times for

the results to be analysed statistically. Similarly, because of biological variability, we must be

cautious of generalising our results either from individual creatures to others of the same

species, or to other species. For instance, if our hypothesis is about mammals, it is inadequate

simply to carry out our experiments on laboratory rats. Similarly, it is dangerous to extrapolate

from healthy students to elite athletes.

Controls

The experiment must be well controlled. We must eliminate by proper checks the possibility

that other factors in the overall test situation produce the effect we are observing, rather than the

factor we are interested in.

amino acid arginine. This was shown by injecting volunteers with arginine. As a control, the

investigators injected the volunteers with a saline solution. To their surprise, growth hormone

was again secreted. The investigators then waved a syringe and needle in front of their

volunteers, and found that that provoked growth hormone secretion too. Growth hormone is

now known to be secreted in response to stress (as well as arginine).

At a more technical level, we must be sure that our method of measurement is reproducible

from day to day, between operators in the same laboratory, or between laboratories. Whilst we

might be confident about a balance or a ruler, can we be as sure about, say, a method for

measuring haemoglobin? Do two groups of students measuring the same samples by the same

methods produce the same results? Quality control helps here.

'Blind' Designs

Investigators can subconsciously 'fudge' their data if they know what result they want to find.

The answer is to do the experiment 'blind', so the investigators (and the subjects, if humans are

being studied) do not know which treatment's effect they are observing. This can make the

logistics of doing the experiment more complex: for example, when determining the

haemoglobin concentration of male and female class members.

There is a story about a professor who devised a maze for measuring the intelligence of rats.

One day he gave his technicians, who actually made the measurements, three groups of rats. He

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress2.html#SOME%20COMMON%20STATISTICAL%20TERMS Page 4 of 13

EXPERIMENTAL DESIGN 17/06/13 4:23 PM

told them one group had been specially bred for intelligence, one for stupidity and the third was

average. The technicians assessed the rats' intelligence and confirmed that the 'bright' group

performed the best and the 'stupid' group the worst. The point is, of course, that the professor

had put animals into the three groups at random. They did not differ in intelligence.

Measurement

Good experiments often, though not always, involve measuring something: a weight, say. When

you make measurements, it is important you know both the accuracy and the precision of your

measuring system. These two terms are not synonymous: 'accuracy' means the ability of the

method to give an unbiassed answer on average, whereas 'precision' is an index of the method's

reproducibility. Ideally your method should be both accurate (i.e., give the true mean) and

precise (i.e., have a low standard deviation). Sometimes one is more important than the other.

For example, if you were looking for small changes with time in a quantity (such as an athlete's

haemoglobin concentration), you would need a precise measure of it rather more than an

accurate one.

Accuracy and precision together help you to judge the reliability of your data. They also help

you to judge to how many significant figures you should quote your results. For example, if you

use a balance reading to the nearest gram, you should give the results to the nearest gram and

not, say, to the nearest tenth of a gram.

Some experiments are very difficult to do because it is not obvious what can be measured. This

is a real problem in animal behaviour: for example, there is no obvious unit or measure for

'emotional state'. It is usually necessary to isolate measurable components of behaviour. Thus

the speed at which a tiger paces up and down a cage can give some indication of the internal

state of the animal but can never give a full picture of it.

Many of these points are rather abstract, but they should become clearer when you think about

the following examples.

Experiment:

Forty bean plants, growing in pots, were covered one afternoon by individual glass containers

and left in the laboratory overnight. Next morning, the inside of the lid of each container was

found to be covered in droplets of a fluid which proved to be water.

Conclusion:

Critique

1. Lack of controls.

The water could have come from the plants, the soil, the pots, or the air in the jar.

Control experiments should have been set up to test for these possibilities.

(a) The experiment was done overnight and so can tell us nothing about the

behaviour of the plants at other times of day; the word 'generally' is not justified.

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress2.html#SOME%20COMMON%20STATISTICAL%20TERMS Page 5 of 13

EXPERIMENTAL DESIGN 17/06/13 4:23 PM

(b) It was carried out with an adequate number of bean plants but can tell us

nothing about other kinds of plants; the word 'plants' should be qualified.

(c) There is no evidence in the experiment that water is given off as a vapour.

advertised one?

Eric Triton bemoaned the fact that his wife Ariel insisted on washing his clothes with their local

supermarket's own brand of powder. He was sure the well-known brand he saw performing

miracles on television most evenings would do better. He therefore set out to prove as much.

Mr Triton decided to compare the effectiveness of the two products on what his wife called

'difficult' dirt: grass stains on white linen handkerchiefs. He bought 4kg of the well-known

brand for £5.17 in their supermarket and noted that the same weight of the own-brand powder

would have cost £4.47. He followed the instructions on the packets exactly, weighing out the

same amount of powder and using their washing machine's programme for white linens. Mr

Triton was aware of the need for an index of 'cleanliness' and therefore devised a subjective

scale, ranging from 10 ('whiter than white') to 0 (the starting level of dirtiness).

Mr Triton's belief was substantially confirmed. He scored the handkerchief cleaned by the

national brand an impressive 8, whereas the own-brand powder only managed 7. Triumphantly,

he reported the outcome to his wife. Mrs Triton, however, was unimpressed. She pointed out to

her husband that there were several flaws in his experiment and convinced him that the outcome

was 'not proven'.

! What do you think were the flaws that Mrs Triton could have detected?

Further reading: Barnard C, Gilbert F and McGregor P (1993) Asking Questions in Biology,

Longmans.

There is a story about an eminent Professor at Cambridge who gave a paper at a scientific

meeting and was asked by a questioner "what statistical test did you use to verify your results?"

The Professor explained that he used his own statistical test:

"In our Department we have a long corridor with a notice board at one end. I draw a

histogram of my results, pin it to the notice board, then walk to the other end of the

corridor. If I can still see a difference between the treatments then it's significant"

The relevance of this story lies in what it does not say! If an experiment is designed and

executed properly - as we would expect of an eminent scientist - then the results often speak for

themselves. For example, this might be true of experiments in which mutants are generated (or

genes inserted) in an organism, giving a clear change of behaviour such as resistance to an

antibiotic or expression of a new trait. Such "all or nothing" effects seldom need to be backed

by statistical tests, but they still need good experimental design.

However, in many areas of biology we work with variable effects - differences in the growth

rates of organisms, quantitative differences in antibiotic resistance or in size or in rates of

biochemical reactions, etc. Then we not only need statistical tests to analyse those differences

but we also need good experimental design to ensure that we haven't biased our results in some

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress2.html#SOME%20COMMON%20STATISTICAL%20TERMS Page 6 of 13

EXPERIMENTAL DESIGN 17/06/13 4:23 PM

Good experimental design is the key to good science. But it's not as easy as it

might seem.

In many cases good experimental design involves having a clear idea about how we will

analyse the results when we get them. That's why statisiticians often tell us to think about the

statistical tests we will use before we start an experiment.

1. Define the objectives. Record (i.e. write down) precisely what you want to test in an

experiment.

2. Devise a strategy. Record precisely how you can achieve the objective. This includes

thinking about the size and structure of the experiment - how many treatments? how many

replicates? how will the results be analysed?

3. Set down all the operational details. How will the experiment be performed in practice? In

what order will things be done? Should the treatments be randomised or follow a set structure?

Can the experiment be done in a day? Will there be time for lunch? etc.

If all this sounds trivial or obvious, then read on. It's not as easy as you think!

Example 1. Experiments that yield no useful results because we did not collect enough

data

Suppose that we want to test the results of a Mendelian genetic cross. We start with 2 parents of

genotype AABB and aabb (where A and a represent the dominant and recessive alleles of one

gene, and B and b represent the dominant and recessive alleles of another gene).

We know that all the F1 generation (first generation progeny of these parents) will have

genotype AaBb and that their phenotype will display both dominant alleles (e.g. in fruit flies all

the F1 generation will have red eyes rather than white eyes, and normal wings rather than

stubby wings).

This F1 generation will produce 4 types of gamete (AB, Ab, aB and ab), and when we self-cross

the F1 generation we will end up with a variety of F2 genotypes (see the table below).

Gametes Gametes

AB Ab aB ab

AB AABB AABb AaBB AaBb

Ab AABb AAbb AaBb Aabb

aB AaBB AaBb aaBB aaBb

ab AaBb Aabb aaBb aabb

All these genotypes fall into 4 phenotypes, shown by colours in the table: double dominant,

single dominant A, single dominant B and double recessive. And we know that in classical

Mendelian genetics the ratio of these phenotypes is 9:3:3:1

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress2.html#SOME%20COMMON%20STATISTICAL%20TERMS Page 7 of 13

EXPERIMENTAL DESIGN 17/06/13 4:23 PM

Most people also know that we use a chi squared test to analyse the results of genetic crosses:

we do our experiment, count the number of F2 progeny that fall into the different categories,

and test to see if our results agree with an expectation. In this case, the expectation would be

that the results fit a 9:3:3:1 ratio.

But what you might not know is that a chi squared test would only be valid if every

expected category in this case is 5 or more (it does not matter what the actual count is in each

category, but the expected count must be 5 or more). In other words, we MUST have at least 80

F2 progeny in order to use the chi squared test for this experiment, because then the smallest

category - double recessive - would have an expected 5 individuals in it (one-sixteenth of 80

being 5). [A fuller explanation of this is given in Chi squared test.]

Similarly, for comparing two counts (e.g. counts from dilution plating of bacteria) by means of

a Poisson distribution, you will need to count about 30 colonies at the chosen dilution level.

[The principles underlying the Poisson distribution do not hold for counts lower than this]

As a different example along the same lines, we might want to compare the biomass produced

by plant callus culture in flasks containing different nutrient solutions. We know that we need

more than one flask of each nutrient solution (i.e. we need replicates), and we will use

Student's t-test to compare the mean growth in each solution. [Basically, a t-test compares the

difference between the two means in relation to the amount of variation within the treatments.

In other words, we get a significant result if the difference between the means is large and/or the

variation between replicates is small].

So, how many replicates should we use? This is a matter of judgement (and the available

resources) but if we look at a t-table we can make some rational decisions. If we use 2 flasks for

each treatment (4 flasks in total), we would have 2 degrees of freedom. This term is explained

elsewhere, but for now we can note that the number of degrees of freedom for each treatment is

one less than the number of replicates. In other words, with 2 treatments of 2 flasks each we

have 2 degrees of freedom. With 2 treatments of 10 flasks each we have 18 degrees of freedom.

When we analyse our results by Student's t-test, we calculate a t value and compare it with the t

value for probability of 0.05 in the t-table. Our treatments differ significantly if the calculated t

value is greater than the tabulated value.

Look at the tabulated t value (4.30) for 2 degrees of freedom. It is quite high, and we would

only find a significant difference between our treatments if we have quite a large difference

between the means and also little variation in our replicates. But if we used 4 replicates of each

treatment (6 degrees of freedom) we would have a much better chance of finding a significant

difference (t value of 2.45) between the same means. But look even further down the t-table -

e.g. downwards from 10 degrees of freedom (t-value 2.23) - and we see that we would gain very

little by using any more replicates. We would be in the realm of diminishing returns, gaining

very little for all the extra time and resources.

The message from these examples is that knowledge of the statistical test that we will use

helps us to design our experiment properly.

Example 2. Experiments that seem to give useful results but our procedures let us down!

Under this heading we deal with the actual process of doing an experiment - a task with many

hidden pitfalls.

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress2.html#SOME%20COMMON%20STATISTICAL%20TERMS Page 8 of 13

EXPERIMENTAL DESIGN 17/06/13 4:23 PM

We do the experiment, get nice results, analyse them (the appropriate test would be Analysis of

Variance) and find significant differences between the treatments. We write up the results, get a

Nobel Prize, or a good mark, or whatever. End of story.

Or is it? The answer to that question depends on how we did the experiment. For example,

there might have been a good "practical" (i.e. convenient) reason for setting up all replicates of

treatment 1, then (for example) changing the pipette and setting up all replicates of treatment 2,

and so on. The problem is: how can we be sure that the difference we found between treatments

was due to the treatments themsleves, and was not influenced by the order in which we set them

up? Even if we DO feel sure, our experiment is inherently biased and nobody would trust the

results if we said how we did it! [There is an almost infinite number of reasons why the

conditions might change during the time taken to set up an experiment. For example, we might

get progressively more efficient, or more tired. The temperature of the water bath (or whatever)

might change slightly during this time. Each pipette will be slightly different from the next, etc.

etc.]

So, what about doing one replicate of treatment 1, then one of treatment 2, then 3, then 4, and

then doing a second replicate of treatment 1, a second of treatment 2, and so on? In truth, this

would remove only some of the inherent bias - on average, treatment 1 is still being set up

before treatment 2, etc.

The only way to overcome this is to plan in advance. We have basically two options.

We could randomise the order in which each test is set up, so that it is entirely a matter of

chance whether "treatment 1, replicate 1" is done first or last or somewhere in the middle.

We could structure the setting up (or the sampling of the experiment) so that one replicate

of each treatment is done first (in random order), then the next replicate of each treatment

is done (again in random order), and so on. This would be sensible if we knew (or

suspected) that the conditions are likely to change during the setting up of an experiment,

because Analysis of Variance (the statistical test we would use for our results) could

separate out the effect of "setting up time" from the effect of "treatments". This sort of

experimental design is termed a randomised block design. It is illustrated in the grid

below, where the rows represent blocks and the order of treatments (A, B, C and D)

within the blocks is determined randomly. [But note that the grid below is perfect - you

would seldom get that by randomly ordering the treatments within blocks. A grid of this

type is called a Latin square, and for some purposes it is probably better than a

randomised block]

Row 1 A B C D

Row 2 D C A B

Row 3 B A D C

Row 4 C D B A

And one last point - the same principles should be applied to other practical aspects of an

experiment. For example, have you ever tested the temperature gradient in a laboratory

incubator or (worse still) in a walk-in growth room? The temperature gauge might record

"30oC" but there is likely to be a vertical (or lateral) temperature gradient of 2-3o or more. So

never put all the replicates of one treatment together. Randomise them.

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress2.html#SOME%20COMMON%20STATISTICAL%20TERMS Page 9 of 13

EXPERIMENTAL DESIGN 17/06/13 4:23 PM

For most experiments we would simply randomise the treatments and replicates. There is an

important statistical reason for this, because all the statistical procedures are based on the

fundamental assumption that variation is random: in other words, that it is determined by

chance alone. But "blocking" becomes useful, or even essential, if we know (or strongly

suspect) that "extraneous" factors will introduce variation that is irrelevant to the effects we

wish to test and that might mask the effects of our treatments. Here are two examples, and you

could think of many more along the same lines.

1. Suppose that we want to test the effect of fertiliser treatment (or pesticide treatment, etc.)

on a crop. We set up a field trial, with plots that will receive different treatments.

Everyone knows that conditions can vary across a field - the topsoil may be thinner or

drier in some places than in others, leading to differences in crop growth. We cannot

control this, so it would be sensible to mark out 3 (or more) blocks, and to have plots

(randomised treatments) within these blocks. Then, by Analysis of Variance, we will be

able to separate the variation due to our treatments from the variation caused by soil or

site factors (which are not relevant to the question we are trying to ask).

2. Suppose that an experiment will take a long time to set up or to sample, and we cannot be

sure that the conditions will remain constant during this time - the lab might warm up, the

humidity rise, and so on. Then it is sensible to block the experiment into time periods -

before coffee, after coffee, after lunch. Of course this is not ideal. But experiments are

done in the real world.

Suppose that we are measuring the size of cells, the height of trees, the biomass of microbial

cultures, the number of eggs in nests, or anything else. The thing that we are measuring or

recording (e.g. cell size, plant height, etc.) is called a variable.

Each measurement that we record (e.g. the size of each cell) is a value or observation.

We obtain a number of values (e.g. 100 for cells), and this is our sample.

The sample (e.g. 100 cells) is part of a population. In this case the population (in biological

terms) is all the cells in the culture (or all the trees in a forest, etc.). Theoretically, we could

measure every cell or tree to get a precise measure of that population. But often we want to be

able to say more than this - something of general significance, based on our sample. For

example, that if anyone were to measure the cells of that organism, then they would find a

certain average value and a certain range of variation. Here are 3 sorts of thing that you might

want to say.

The optimum temperature for growth of the bacterium Escherichia coli is 37oC, whereas

the optimum temperature for Bacillus cereus is 30oC.

The average height of adult men in Britain is 175 cm, whereas the average height of

women is162 cm.

In rats the LD50 of the drug "Zappit" is 3 mg kg-1 body weight. [LD50 is the calculated

(lethal) dose that kills 50% of the treated individuals]

General statements such as these will always be based on a sample, because we could never test

every possible strain of E. coli, nor measure every possible adult, nor test every possible rat that

could ever live. So, in these and in many other cases the population can be considered to be

infinite. That's the sense in which statisticians use the term "population" - for all the

possible measurements or events (i.e. all the possible values of a variable) of a particular

type that there could ever be.

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress2.html#SOME%20COMMON%20STATISTICAL%20TERMS Page 10 of 13

EXPERIMENTAL DESIGN 17/06/13 4:23 PM

The statistical procedures are based on quite complex mathematics. But that need not concern

us at all, because the procedures are actually very simple to apply. Basically, from our sample

we calculate:

The average, or mean. The notation for the sample mean is (pronounced "X bar").

Some measure of the dispersion (range of variation) of data around the sample mean. For

this we use the variance (the notation is S2) and thence the standard deviation (S).

Having obtained those values, we use them to estimate the population mean and the

population variance. In order to distringuish between what we measure (samples) and what we

wish to estimate (populations) from the samples, many statisticians use greek letters for the

population mean (which is denoted m (mu) instead of for the sample mean) and the

population variance (which is denoted s2 (sigma squared) instead of S2 for the sample

variance).

We cover these points under Descriptive statistics. But before we move on to that, we must

consider briefly how variation is distributed in statistical terms.

spectrophotometer) etc. we would find a range of variation. As we made more and more

measurements of this type they would display a continuous range of variation. If we were to put

these measurements into appropriate categories or class intervals (for example, all the

measurements that fall between 1.0 and less than 2.0, all between 2.0 and less than 3.0, and so

on) and then plot the numbers in each category as a histogram it would look like this:

Given enough measurements (and small enough class intervals), this would be a completely

symmetrical, bell-shaped curve. Data of this sort are said to be normally distributed. Most of

our measurements (data points) would be close to the mean, and progressively fewer would

depart widely from the mean.

Most of the statistical tests that we consider on this site are for normally distributed data.

But there are other types of distribution. For example, if we measured the heights of men and

women as a single population they might form a bimodal distribution - two humps with a dip

between them, because women, on average, are shorter than men. Then we should treat them as

two populations, not as a single one. Other types of data, such as counts, fall naturally into

"either/or" categories. These are treated in different ways, some of which are explained later.

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress2.html#SOME%20COMMON%20STATISTICAL%20TERMS Page 11 of 13

DESCRIPTIVE STATISTICS 17/06/13 4:23 PM

archival purposes

Text and links may be out of date

DESCRIPTIVE STATISTICS

Sounds boring, but it's about giving the most - and most useful - information from a set of

data.

When we take measurements or record data - for example, the height of people - we cannot

possibly measure every person in the world (or, as another example, every cell of a particular

type of bacterium). Instead, we have to take a representative sample, and from that sample we

might wish to say something of wider significance - something about the population (e.g. all

the people in the world, or all the bacteria of that type). So, we use samples as estimates of

populations. But in many cases they can only be estimates, because if our sample size had been

greater (or if we had measured a different sample) then our estimate would have been slightly

different. Statistical techniques are based on probability, and enable us to make the jump from

samples to populations. But we should never lose sight of the fact that our initial sample can

only be an estimate of a population.

In the following sections we will start from a small sample, describe it in statistical terms, and

then use it to derive estimates of a population.

______________________________________

A sample

Here are some values of a variable: 120, 135, 160, 150.

We will assume that they are measurements of the diameter of 4 cells, but they could be the

mass of 4 cultures, the lethal dose of a drug in 4 experiments with different batches of

experimental animals, the heights of 4 plants, or anything else. Each value is a replicate - a

repeat of a measurement of the variable.

In statistical terms, these data represent our sample. We want to summarize these data in the

most meaningful way. So, we need to state:

the mean, and the number of measurements (n) that it was based on

a measure of the variability of the data about the mean (which we express as the

standard deviation)

other useful information derived from the mean and standard deviation, such as (1) the

range within which 95% or 99% or 99.9% of measurements of this sort would be

expected to fall - the prediction intervals, and (2) the range of means that we could

expect 95% or 99% or 99.9% of the time if we were to repeat the same type of

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress3.html#DESCRIPTIVE%20STATISTICS Page 1 of 7

DESCRIPTIVE STATISTICS 17/06/13 4:23 PM

measurement again and again on different samples - this is often called the confidence

interval.

Now we will go through these points, explaining the meaning of the procedures. If you are

familiar with all this, you can go straight to Describing a population: practical steps.

There is a simple formula for calculating the variance of the sample (S2). It is given below, but

first we should see how it is derived.

Each data point (measurement) in our sample differs from the mean by an amount called the

deviation (d). We could, in theory, find each d value by subtraction (keeping the sign as + or -),

then square each deviation, add all the d 2 values (to get the sum of squares of the deviations,

shortened to the sum of squares) and divide this by n-1 to give the variance, where n is the

number of observations in our sample. We can then obtain the standard deviation (notation,

S), which is the square root of the variance.

You should just accept this as standard and necessary practice! However, it has a reason, based

on the fact that we almost always use the sample variance to obtain an estimate of the

population variance (a population being all the measurements or events of the same type that

could ever be found). Put in simple terms, the population variance is derived from the sample

mean and from the deviation (d) of each measurement from the sample mean. But if we lacked

any one of these measurements (the mean or a single d value) we could calculate it from the

other information. So, with n measurements (data points) only n-1 of them are free to vary when

we know the mean - we could calculate the missing one. "n-1" is therefore the number of

degrees of freedom of our data.

For each observation (x) the deviation (d) from the mean ( ) is x - .

Therefore d2 = (x - )2

2 2

Expanding this equation, we get: d2 = x - 2x. +( )

To obtain the sum of squares of the deviations, we sum both sides of this equation (the capital

letter sigma, S = sum of):

From this equation we can derive the following important equation for the sum of squares,

Sd2.

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress3.html#DESCRIPTIVE%20STATISTICS Page 2 of 7

DESCRIPTIVE STATISTICS 17/06/13 4:23 PM

If we present the mean standard deviation, we will have summarised in just two numbers

the most important properties of the sample that we used. This also is our estimate of the mean

(m) and standard deviation (sigma, s ) of the population.

This is the conventional way in which you see data published. For example, if the four values

(120, 135, 160, 150) given earlier were the diameters of four cells, measured in micrometres,

then we would say that the mean cell diameter was 138.8 19.31 mm (see the worked example

later).

What we have done so far is useful, but not useful enough! If we think about it, we will see that

the estimate of the population mean (i.e. all the measurements that we ever could make of the

same type - the diameter of this type of cell, etc.) has been fixed by the sample of four

individuals. If we did the experiment again by measuring another 4 cells then we almost

certainly would get a different mean.

What we really want to know is "how good is our estimate of the mean?", not "how much

variation was found around this particular estimate of the mean?" We do this by calculating the

standard error (standard deviation of the mean).

Imagine that we repeat the experiment again and again, with a different set of cells (or other

types of variable). Each time, we would get a slightly different mean, but if we were to plot a

frequency distribution (histogram) of the means then it would show a normal distribution:

We could find the mean of the means and then calculate a standard deviation of it (not the

standard deviation around a single mean). By convention, this standard deviation of the mean is

called the standard error (SE) or standard error of the mean (SEM).

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress3.html#DESCRIPTIVE%20STATISTICS Page 3 of 7

DESCRIPTIVE STATISTICS 17/06/13 4:23 PM

We do not need to repeat our experiment many times for this, because there is a simple

statistical way of estimating sn which is based on: sn = s / Ö n. (For this we are using S as an

estimate of s).

So, if we had a sample of 4 values (120, 135, 160, 150) and the mean with standard deviation (

s) was 138.8 19.31 mm, then the mean with standard error ( sn) would be 138.8

9.65 mm, because we divide s by Ö n (which is 2) to obtain the standard error.

Note 1. You don't need to understand the mathematics behind the use of Ö n. Look on it as a

"sliding scale" that compensates for the number of values (data points) in the original sample.

Note 2. If you go to Deciphering the data in publications, you will see the value of expressing

results as mean standard error.

A calculated value for a standard deviation or a standard error has little practical use in itself.

But it becomes meaningful when we use it to calculate confidence intervals.We can do this

easily by multiplying a standard deviation or a standard error by a t value obtained from a table

of t. The confidence intervals show us the range within which 95% or 99% or 99.9% of

observations could be expected to lie.

We will illustrate this with the 4 values that we mentioned above (120, 135, 160, 150).

We found a mean with standard error of the mean (138.8 9.65 mm). To put confidence

intervals on this, we multiply 9.65 by a t value.

If we had measured an infinite number of cells we would consult the bottom line of a table of t

values (Student's t-test), which appears as follows.

0.05 0.01 0.001

(95%) (99%) (99.9%)

¥ 1.96 2.58 3.39

We select the level of confidence we want (usually 95% in biological work - see the notes

below) and multiply s by the tabulated value. If s was 138.8 9.65 mm, then the 95%

confidence interval would be 138.8 9.65x1.96 mm, or 138.8 18.91 mm. In other words, if

we were to repeat this experiment over and over again then in 95% of cases the mean could be

expected to fall within the range of values 119.89 to 157.71. These limiting values are the

confidence limits.

But our sample was not infinite - we had 4 measurements - so we use the t value corresponding

to 4 measurements, not to ¥ . The t table shows degrees of freedom (df), which are always one

less than the number of observations. For 4 observations there are 3 df, because if we knew any

3 values and we also knew the mean, then the fourth value would not be free to vary.

original data. In our example, the 95% confidence interval would be 138.8 9.65 x 3.18 mm,

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress3.html#DESCRIPTIVE%20STATISTICS Page 4 of 7

DESCRIPTIVE STATISTICS 17/06/13 4:23 PM

Many people prefer to cite the standard error rather than standard deviation, because it makes

the range of variation appear to be smaller! But it has real value in telling us something - for

example, that if anyone were to repeat our experiment, then the mean would be likely to fall

within the limits of t x sn.

important property of the population.

2. We choose the level of confidence we wish to place on our data. The confidence limits will

be larger if we choose a higher level of confidence (e.g. 99% compared with 95%). However,

for most biological work we use the 95% level.

3. The number of degrees of freedom determines the t value. So, when designing experiments

we have to find a compromise between the level of confidence we want and the amount of work

involved. Inspection of a t table shows that the values fall off rapidly from 2 to 5 df (3 to 6

replicate observations) but more slowly thereafter.

[Note that a t-table does not have space to give us every degree of freedom, so if a sample has a

degree of freedom that falls between two entries in a t-table (say between 60 and 120) then you

should use the t value for 60 df in the table.]

1. Tabulate the data.

5. Calculate

7. Find the estimated standard deviation of the population (s ) = square root of the variance.

Worked example of the data given at the top of this page: 120, 135, 160, 150.

Item Value Notes/ explanation

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress3.html#DESCRIPTIVE%20STATISTICS Page 5 of 7

DESCRIPTIVE STATISTICS 17/06/13 4:23 PM

Replicate 1 120

Replicate 2 125

Replicate 3 160

Replicate 4 150

Sx 555 Total (= sum of the replicates)

n 4 Number of replicates

138.75 Mean (= total / n)

Sum of the squares of each

S x2 78125

replicate value

s2 372.9167

s 19.311 = Ö s2

sn 9.6555 =s/Ön

[In practice, we would record

this as 138.8 9.66, with one

mean standard error more decimal place than we

138.75 9.655

( sn) measured, and an extra

decimal place for the standard

error]

95% confidence limits of the

[The value for t with 3 df is

mean ( tsn where t has 3 138.75 30.705

3.18]

df)

OR The lethal dose was 138.75 9.66 mg (kg body weight)-1 (mean s.e.; n = 4)

OR The plant height was 138.75 9.66 cm (mean s.e.; n = 4)

Note that these statements contain everything that anyone would need to know about the mean!

For example, if somebody wanted to calculate a confidence interval they could multiply the

standard error by the t value (we gave them the number of replicates so they can look up the t

value). They also can decide if they want to have 95%, 99% or 99.9% confidence intervals.

In other sections of this site we shall see that the statements above give all the information we

need to test for significant differences between treatments.As one example, go to Student's t-

test.

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress3.html#DESCRIPTIVE%20STATISTICS Page 6 of 7

WHAT TEST? 17/06/13 4:23 PM

archival purposes

Text and links may be out of date

What test do I need?

Other parts of this site explain how to do the common statistical tests. Here is a guide to

choosing the right test for your purposes. When you have found it, click on "more

information?" to confirm that the test is suitable. If you know it is suitable, click on "go for it!"

Important: Your data might not be in a suitable form (e.g. percentages, proportions) for the

test you need. You can overcome this by using a simple transformation. Always check this -

click HERE.

1. Student's t-test

Use this test for comparing the means of two samples (but see test 2 below),

even if they have different numbers of replicates. For example, you might want

to compare the growth (biomass, etc.) of two populations of bacteria or plants, the

yield of a crop with or without fertiliser treatment, the optical density of samples

taken from each of two types of solution, etc. This test is used for "measurement

data" that are continuously variable (with no fixed limits), not for counts of 1, 2, 3

etc. You would need to transform percentages and proportions because these have

fixed limits (0-100, or 0-1).

More information?

Go for it!

2. Paired-samples test

Use this test like the t-test but in special circumstances - when you can arrange

the two sets of replicate data in pairs. For example: (1) in a crop trial, use the

"plus" and "minus" nitrogen crops on one farm as a pair, the "plus" and "minus"

nitrogen crops on a second farm as a pair, and so on; (2) in a drug trial where a

drug treatment is compared with a placebo (no treatment), one pair might be 20-

year-old Caucasian males, another pair might be 30-year old Asian females, and so

on.

More information?

Go for it!

Use this test if you want to compare several treatments. For example, the

growth of one bacterium at different temperatures, the effects of several drugs or

antibiotics, the sizes of several types of plant (or animals' teeth, etc.). You can also

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress4.html#What%20test%20do%20I%20need? Page 1 of 8

WHAT TEST? 17/06/13 4:23 PM

different temperatures, and so on. Like the t-test, this test is used for "measurement

data" that are continuously variable (with no fixed limits), not for counts of 1, 2, 3

etc. You would need to transform percentages and proportions because these have

fixed limits (0-100, or 0-1).

More information? You need this, because there are different forms of this test.

Use this test to compare counts (numbers) of things that fall into different

categories. For example, the numbers of blue-eyed and brown-eyed people in a

class, or the numbers of progeny (AA, Aa, aa) from a genetic crossing experiment.

You can also use the test for combinations of factors (e.g. the incidence of

blue/brown eyes in people with light/dark hair, or the numbers of oak and birch

trees with or without a particular type of toadstool beneath them on different soil

types, etc.).

More information?

Go for it!

Use this test for putting confidence limits on the mean of counts of random

events, so that different count means can be compared for statistical difference.

For example, numbers of bacteria counted in the different squares of a counting

chamber (haemocytometer) should follow a random distribution, unless the bacteria

attract one another (in which case the numbers in some squares should be

abnormally high, and abnormally low in other squares) or repel one another (in

which case the counts should be abnormally similar in all squares). Very few things

in nature are randomly distributed, but testing the recorded data against the

expectation of the Poisson distribution would show this. By using the Poisson

distribution you have a powerful test for analysing whether objects/ events are

randomly distributed in space and time (or, conversely, whether the objects/ events

are clustered).

More information?

Go for it!

These procedures are used for looking at the relationship between different

factors, and (if appropriate) for graphing the results in statistically meaningful

ways. For example, as the temperature (or pH, etc.) increases, does growth rate

increase or decrease? As the dose rate of a drug is increased does the response rate

of patients rise? As altitude is increased does the number of butterflies (or oak

trees) increase or decrease? Sometimes the relationship is linear, sometimes

logarithmic, sometimes sigmoidal, etc. You can test all these possibilities and, in

drug or toxicity trials (for example) calculate the LD50 or ED50 (lethal dose, or

estimated dose, for a 50% response rate).

More information?

Go for it!

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress4.html#What%20test%20do%20I%20need? Page 2 of 8

WHAT TEST? 17/06/13 4:23 PM

==========================================

More information

Student's t-test

Use this test for comparing the means of two populations that you have

sampled (but see test 2 below). For example, you might want to compare the

growth (biomass, etc.) of two bacteria or plants, the yield of a crop with or without

added nitrogen, the optical density of samples taken from each of two types of

solution, etc.

What you will need for this test: a minimum of 2 or 3 replicates of each sample

or treatment, but ideally at least 5 replicates. For example, the yield measured for 5

fields of a crop fertilised with nitrogen and for 5 unfertilised fields, the optical

density of 5 tubes of each solution, the measurement of 5 plants of each type, etc.

Large sample sizes (10 or more) are always better than small sample sizes, but it is

easier to measure the height of 10 or 20 (or 50) plants than it is to set up10 or 20

large-scale fermenters!

You don't need the same number of replicates of each treatment - for example,

you can compare 3 tubes of one solution with 4 tubes of another. You could also

use this test to compare several replicates of one treatment with a single value for

another treatment, but it would not be very sensitive.

Go for it!

Back to "What test do I need?"

Paired-samples test

Use this test like the t-test but in special circumstances - when you can arrange

the two sets of replicate data in pairs. For example: (1) in a crop trial, use the

"plus" and "minus" nitrogen crops on one farm as a pair, the "plus" and "minus"

nitrogen crops on a second farm as a pair, and so on; (2) in a drug trial where a

drug treatment is compared with a placebo (no treatment), one pair might be 20-

year-old males, another pair might be 30-year old females, and so on.

Why do we use the paired samples test? Because farms or people or many other

things are inherently variable, but by pairing the treatments we can remove much of

this random variability from the test of "nitrogen versus no nitrogen" or "drug

treatment versus no treatment", etc.

What are the requirements for this test? The main requirement is that the

experiment is PLANNED ahead of time. Then you can use the paired samples test

for many purposes - for example, two treatments compared on one day, then the

same two treatments compared on the next day, and so on.

In general, you will need more replicates than for a t-test (say, a minimum of 5

for each treatment), and you will need the same number of replicates for each

treatment.

But you must have a good reason to pair treatments - you should not do it

arbitrarily.

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress4.html#What%20test%20do%20I%20need? Page 3 of 8

WHAT TEST? 17/06/13 4:23 PM

Go for it!

Back to "What test do I need?"

Use this test if you want to compare several treatments. For example, the

growth of one bacterium at different temperatures, the effects of several drugs or

antibiotics, the sizes of several plants (or animals' teeth, etc.). You can also

compare two things simultaneously - for example, the growth of 3 or 4 strains of

bacteria at different temperatures, and so on.

The simplest form of this test is one-way ANOVA (ANalysis Of VAriance). Use

this to compare several separate treatments (e.g. effects of 3 or more

temperatures, antibiotic levels, crop treatments, etc.). You will need at least 2

replicates of each treatment.

One-way ANOVA tells you if there are differences between the treatments as a

whole. But it can also be used, with caution, like a multiple t-test, to tell you

which of the treatments differ from each other.

Go for one-way ANOVA?

Back to "What test do I need?"

Another form of this test is two-way ANOVA. Use this if you want to compare

combinations of treatments. For example, to compare the growth of an organism

on several different substrates at several different temperatures. Or the effects of

two (or more) drugs singly and in combination. Or responses of crops to fertiliser

treatment on different farms or soil types. You can get useful information even if

you have one of each combination of treatments, but you get much more

information if you have 2 (or more) replicates of each combination of

treatments. Then the test can tell you if you have significant interaction - for

example, if changing the temperature changes the way that an organism responds to

a change of pH, etc.

Go for two-way ANOVA?

Back to "What test do I need?"

Use this test to compare counts (numbers) of things that fall into different

categories. For example, to compare the numbers of blue-eyed and brown-eyed

people in a class, or the numbers of progeny (AA, Aa, aa) from a genetic crossing

experiment. You can also use the test for looking at combinations of factors (e.g.

the incidence of blue/brown eyes in people with light/dark hair, or the numbers of

toadstools beneath oak and birch trees on different soil types, etc.).

For this test you compare the actual counts (in the different categories) with an

"expected" set of counts. Sometimes the expectation is obvious - for example,

that half of the progeny from a cross between parents Aa and aa will have the Aa

genotype and half will have aa. You have to construct an hypothesis (termed the

null hypothesis) by using logical arguments.

What are the requirements for this test? Almost any sort of "count" data can be

analysed by chi-squared, but you have to use "real" numbers, not proportions or

percentages.

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress4.html#What%20test%20do%20I%20need? Page 4 of 8

WHAT TEST? 17/06/13 4:23 PM

Go for it!

Back to "What test do I need?"

The main requirement for this test is that the mean count (of bacterial colonies,

buttercups, etc.) need to be relatively high (say 30 or more) before they can be

expected to conform to a Poisson distribution. If you have such a high count, then

you can test whether or not your results actually do conform to the Poisson

distribution.

Go for it!

Back to "What test do I need?"

These procedures are used for looking at the relationship between different

factors, and (if appropriate) for graphing the results in statistically meaningful

ways. For example, as the temperature (or pH, etc.) increases, does growth rate

increase or decrease? As the dose rate of a drug is increased does the response rate

of patients rise? As altitude is increased does the number of butterflies (or oak

trees) increase or decrease? Sometimes the relationship is linear, sometimes

logarithmic, sometimes sigmoidal, etc. You can test all these possibilities and, in

drug or toxicity trials (for example) calculate the LD50 or ED50 (lethal dose, or

estimated dose, for a 50% response rate).

1. Plot your results on graph paper, and ask yourself: does the relationship look

(or is expected to be) linear, or is it logarithmic, or sigmoid (S-shaped)? You

might need to transform the data (see transforming data) if they are not

linear.

2. Calculate the correlation coefficient, which tells you whether the data fit a

straight line relationship (and how close the fit is, in statistical terms).

3. If the correlation coefficient is significant, and other conditions are met,

proceed to regression analysis, which gives the equation for the line of best

fit, then draw this line on your graph.

Go for it!

Back to "What test do I need?"

Transformation of data

1. Proportions and percentages: convert to arcsin values

Certain mathematical assumptions underly all the statistical tests on this site. The most

important assumption is that the data are normally distributed and are free to vary widely about

the mean - there are no imposed limits. Clearly this is not true of percentages, which cannot be

less than 0 nor more than 100. If you have data that are close to these limits, then you need to

transform the original data before you analyse them.

One simple way of doing this is to convert the percentages to arcsin values and then analyse

these arcsin values. The arcsin transformation moves very low or very high values towards the

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress4.html#What%20test%20do%20I%20need? Page 5 of 8

WHAT TEST? 17/06/13 4:23 PM

[You convert percentages (x) to arcsin values ( q ), where q is an angle for which sin q is Ö

x/100 ]

On a calculator:

to get the arcsin value for a percentage (e.g. 50%), divide this by 100 ( = 0.5), take

the square root (= 0.7071), then press "sin-1" to get the arcsin value (= 45). [NB: if

your calculator gives the result as 0.785 then this is the angle in radians rather than

degrees]

to get the arcsin value for a proportion (e.g. 0.4), take the square root (= 0.6325),

then press "sin-1" to get the arcsin value (= 39.23).

On an "Excel" spreadsheet:

convert percentages to arcsin values (and back again) by entering a formula into the

spreadsheet - Go for it!

2. Logarithmic transformation

Use this for two purposes:

When fitting a curve to logarithmic data (exponential growth of cells, etc). Take the

logarithm of each "growth" value and plot this against time (real values). You can use

either natural logarithms or logs to base 10. The data should now show a straight-line

relationship and can be analysed using correlation coefficient and regression.

In Analysis of Variance, when comparing means that differ widely. The reason for this

is that an analysis of variance is based on the assumption that the variance is the same

across all the data. But usually this will not be true if some means are very small and

others are very large - the individual data points for the large mean could vary widely.

[For example, a mean of 500 could be made up from 3 values of 100, 400 and 1000,

whereas a mean of 50 could not possibly include such wide variation] This problem is

overcome by converting the original data to logarithms, squeezing all the data points

closer together. Contrary to expectations, this would show significant differences between

small and large means that would not be seen otherwise.

Some types of data show a sigmoid (S-shaped) relationship. A classic case is in dosage-

response curves, for testing antibiotics, pharmaceuticals, etc. To analyse these relationships the

"percentage of patients/cells responding to a treatment" can be converted to a "probit" value,

and the dosage is converted to a logarithm. This procedure converts an S-shaped curve into a

straight-line relationship, which can be analysed by correlation coefficient and regression

analysis in the normal way. From the straight-line equation, we can calculate the LD50, ED50,

and so on.

The table below shows part of a page from an ‘Excel’ worksheet. Columns are headed A-F and

rows are labelled 1-21, so each cell in the table can be identified (e.g. B2 or F11).

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress4.html#What%20test%20do%20I%20need? Page 6 of 8

WHAT TEST? 17/06/13 4:23 PM

You will now see how to convert these % values into probits or arcsin values, and back again. If

you do the relevant conversion in your own spreadsheet, you can then use the probit or arcsin

values instead of % values for the statistical tests.

In cell C2 of the spreadsheet. a formula was entered to convert Percentage to Probit values.

This formula is not seen. As soon as we move out of cell C2 it automatically gives the probit

value (in C2) for the percentage in cell B2, seen in the "printout" below. Copying and then

pasting this formula into every other cell of column C produces a corresponding probit value

(e.g. cell C3 contains the probit of the % in cell B3).

Next, a formula was entered in cell D2 to convert Probit to Percentage, and the above

procedure was repeated for all cells in column D.

A B C D E F

1 Percent % to Probit Probit to % % to arcsin arcsin to

%

2 0.1 1.91 0.1 1.812 0.1

3 0.5 2.424 0.5 4.055 0.5

4 1 2.674 1 5.739 1

5 2 2.946 2 8.13 2

6 3 3.119 3 9.974 3

7 4 3.249 4 11.54 4

8 5 3.355 5 12.92 5

9 6 3.445 6 14.18 6

10 7 3.524 7 15.34 7

11 8 3.595 8 16.43 8

12 9 3.659 9 17.46 9

13 10 3.718 10 18.43 10

14 50 5 50 45 50

15 96 6.751 96 78.46 96

16 97 6.881 97 80.03 97

17 98 7.054 98 81.87 98

18 99.5 7.576 99.5 85.95 99.5

19 99.99 8.719 99.99 89.43 99.99

20 99.999 9.265 99.999 89.82 99.999

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress4.html#What%20test%20do%20I%20need? Page 7 of 8

T-TEST 17/06/13 4:23 PM

archival purposes

Text and links may be out of date

Student's t-test

We use this test for comparing the means of two samples (or treatments), even if they have

different numbers of replicates. In simple terms, the t-test compares the actual difference

between two means in relation to the variation in the data (expressed as the standard deviation

of the difference between the means).

Procedure

First, we will see how to do this test using "pencil and paper" (with a calculator to help with the

calculations). Then we can see how the same test can be done in a spreadsheet package

(Microsoft 'Excel')

designed to test. For example:

If we are analysing the heights of pine trees growing in two different locations, a suitable

null hypothesis would be that there is no difference in height between the two locations.

The student's t-test will tell us if the data are consistent with this or depart significantly

from this expectation. [NB: the null hypothesis is simply something to test against. We

might well expect a difference between trees growing in a cold, windy location and

those in a warm, protected location, but it would be difficult to predict the scale of that

difference - twice as high? three times as high? So it is sensible to have a null hypothesis

of "no difference" and then to see if the data depart from this.

4. Record the number (n) of replicates for each sample (the number of replicates for sample 1

being termed n1 and the number for sample 2 being termed n2)

6. Calculate s 2 for each sample; call these s 12 and s 22 [Note that actually we are using S2 as an

estimate of s 2 in each case]

5. Calculate the variance of the difference between the two means (sd2) as follows

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress4a.html#Student's%20t-test Page 1 of 7

T-TEST 17/06/13 4:23 PM

(when doing this, transpose 1 and 2 if 2 > 1 so that you always get a positive value)

8. Enter the t-table at (n1 + n2 -2) degrees of freedom; choose the level of significance required

(normally p = 0.05) and read the tabulated t value.

9. If the calculated t value exceeds the tabulated value we say that the means are significantly

different at that level of probability.

10. A significant difference at p = 0.05 means that if the null hypothesis were correct (i.e. the

samples or treatments do not differ) then we would expect to get a t value as great as this on less

than 5% of occasions. So we can be reasonably confident that the samples/treatments do differ

from one another, but we still have nearly a 5% chance of being wrong in reaching this

conclusion.

Now compare your calculated t value with tabulated values for higher levels of significance

(e.g. p = 0.01). These levels tell us the probability of our conclusion being correct. For example,

if our calculated t value exceeds the tabulated value for p = 0.01, then there is a 99% chance of

the means being significantly different (and a 99.9% chance if the calculated t value exceeds the

tabulated value for p = 0.001). By convention, we say that a difference between means at the

95% level is "significant", a difference at 99% level is "highly significant" and a difference at

99.9% level is "very highly significant".

What does this mean in "real" terms? Statistical tests allow us to make statements with a degree

of precision, but cannot actually prove or disprove anything. A significant result at the 95%

probability level tells us that our data are good enough to support a conclusion with 95%

confidence (but there is a 1 in 20 chance of being wrong). In biological work we accept this

level of significance as being reasonable.

Suppose that we measured the biomass (milligrams) produced by bacterium A and bacterium B,

in shake flasks containing glucose as substrate. We had 4 replicate flasks of each bacterium.

Bacterium A Bacterium B

Replicate 1 520 230

Replicate 2 460 270

Replicate 3 500 250

Replicate 4 470 280

Total (= sum of

Sx 1950 1030 the 4 replicate

values)

n 4 4

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress4a.html#Student's%20t-test Page 2 of 7

T-TEST 17/06/13 4:23 PM

Sum of the

S x2 952900 266700 squares of each

replicate value

Square of the total

(S x)2 3802500 1060900 (S x). It is not the

same as Sx2

950625 265225

= 312.5 between the means

difference between the means)

= 230/17.68 = 13.0

Entering a t table at 6 degrees of freedom (3 for n1 + 3 for n2) we find a tabulated t value of

2.45 (p = 0.05) going up to a tabulated value of 5.96 (p = 0.001). Our calculated t value exceeds

these, so the difference between our means is very highly significant. Clearly, bacterium A

produces significantly more biomass when grown on glucose than does bacterium B.

[Note that all the time-consuming calculations above can be done on a calculator with memory

and statistics functions. Guidance on this can be found in your calculator's instruction booklet.

Note also that this test and others can be run on computer packages. Below is a print-out from a

package in Microsoft "Excel"]

[NB: If you cannot find "Data analysis" on Excel then do into "Help" and find "statistical

analysis" in the Help index. Different versions of Excel have slightly different commands, so

you may not find the following section to be identical to the one you are using.]

The screen for "Excel" (not shown here) has cells arranged in columns A-F... and rows 1-10...

For the print-out below, row 1 was used for headings and column A for replicate numbers. The

data for Bacterium A were entered in cells B2,3,4,5 and data for Bacterium B in cells C2,3,4,5

of the spreadsheet. From the Tools option at the top of the screen, I selected Data analysis.

This displays Analysis options and from the drop-down menu I selected t-test: Two-sample

assuming equal variances. Then click OK and enter cells B2-5 for Variable range 1, cells

C2-5 for Variable range 2, and a free cell (e.g. A7) for output range (choose the top-left cell

of the area where you want the results of the analysis to be displayed). Then click OK and the

printout appears.

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress4a.html#Student's%20t-test Page 3 of 7

T-TEST 17/06/13 4:23 PM

1 520 230

2 460 270

3 500 250

4 470 280

Bacterium A Bacterium B

Variance 758.3333 491.6667

Observations 4 4

Pooled Variance 625

Hypothesized Mean 0 (The test will "ask" what is the probability of obtaining our

Difference given results by chance if there is no difference between the

population means?)

df 6

t Stat 13.01076 (This shows the t value calculated from the data)

P(T<=t) one-tail 6.35E-06

P(T<=t) two-tail 1.27E-05 (This shows the probability of getting our calculated t value

by chance alone. That probability is extremely low, so the

means are significantly different)

t Critical two-tail 2.446914 (This shows the t value that we would need to exceed in

order for the difference between the means to be significant

at the 5% level)

We are given the column means, the pooled variance, the number of observations (n), the null

hypothesis (that there is no difference between the population means), the degrees of freedom,

the calculated t value, and four other entries. The first two of these refer to a one-tailed t-test

(i.e. if we wish to test only that one particular mean is larger (or smaller) than the other. The

final two entries refer to a two-tailed test, where we do not specify the "direction" of the test.

For most purposes, we use a two-tailed test. In each case we are shown the probability that

our calculated t-value is equal to or less than the "tabulated" t-value (shown as "Critical t").

Note that the calculated t-value (13.0) vastly exceeds the critical t-value (two-tailed); the means

for the bacteria are significantly different at p = 1.27 x 10-5. In other words, there is a

probability of about 1 in 100,000 that we would get our observed difference between the means

by chance alone.

The analysis for a t-test always pools the variances and, strictly speaking, it is only valid if the

variances of the two treatments are similar. In the analysis above we could have selected the

option "t-test: Two-sample assuming unequal variances". This would have given us the same

result from our particular set of data but would have shown their separate variances as 758.33

for bacterium A and 491.67 for bacterium B.

For interest, let us ask if these variances are different in statistical terms. There is a simple

test for this: divide the larger variance by the smaller (758.33 / 491.67 = 1.54) and compare this

variance ratio with a value from a table of ‘F’ (variance ratio) for p = 0.05. For 2 treatments

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress4a.html#Student's%20t-test Page 4 of 7

T-TEST 17/06/13 4:23 PM

there is one degree of freedom between treatments (shown as n1 in the table) and 6 df within

treatments (3 for each treatment), shown as n2 in the table. The tabulated F value is 6.0. Our

variance ratio (1.54) is less than this, so our variances do not differ significantly. Our t-test

was valid.

Here are some results taken randomly from a scientific paper. Look at any scientific journal and

you will find something similar to this:

pH 5.5 2.62 0.05

(means SE; n = 6)

Note that the authors give us all the information that we need to test for significance. They tell

us that the values are means of 6 replicates for each pH level, with the standard errors of those

means.

If we go back to Descriptive Statistics, we will see that a standard error (sn ) for a mean is

calculated from the standard deviation (s ) as follows: sn = s / Ö n.

If we square s / Ö n, we get s2 / n.

Now, if we go back to one of the steps in the t-test, we see that we calculate:

In this step we add the separate values of s2/n for each mean.

In other words, to do a t test on the published data, all we need do is to square the standard

errors. We will do it now:

pH 7.0 pH 5.5

Mean SE 2.41 0.06 2.62 0.05

n 6 6

SE2 (= s2/n) 0.0036 0.0025

sd = Ö 0.0061 = 0.0781

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress4a.html#Student's%20t-test Page 5 of 7

T-TEST 17/06/13 4:23 PM

The critical t value (p = 0.05) for 10 degrees of freedom (n1 + n2 - 2) is 2.23. The calculated

value exceeds this. There is less than a 5% probability of getting a value as high as this by

chance alone. So we can be reasonably confident that pH does affect the intracellular

water volume of the bacterium.

CONTENTS

INTRODUCTION

THE SCIENTIFIC METHOD

Experimental design

Designing experiments with statistics in mind

Common statistical terms

Descriptive statistics: standard deviation, standard error, confidence intervals of mean.

STATISTICAL TESTS:

Student's t-test for comparing the means of two samples

Paired-samples test. (like a t-test, but used when data can be paired)

Analysis of variance for comparing means of three or more samples:

Calculating the Least Significant Difference between means

Using a Multiple Range Test for comparing means

For factorial combinations of treatments (Two-way ANOVA)

Poisson distribution for count data

Correlation coefficient and regression analysis for line fitting:

linear regression

logarithmic and sigmoid curves

STATISTICAL TABLES:

t (Student's t-test)

F, p = 0.05 (Analysis of Variance)

F, p = 0.01 (Analysis of Variance)

F, p = 0.001 (Analysis of Variance)

c2 (chi squared)

r (correlation coefficient)

Q (Multiple Range test)

Fmax (test for homogeneity of variance)

archival purposes

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress4a.html#Student's%20t-test Page 6 of 7

PAIRED SAMPLES 17/06/13 4:24 PM

archival purposes

Text and links may be out of date

Paired-samples test

Use this test as an alternative to the t-test, for cases where data can be paired to reduce

incidental variation - i.e. variation that you expect to be present but that is irrelevant to the

hypothesis you want to test.

samples. We compare the size of the difference between two means in relation to the amount of

inherent variability (the random error, not related to treatment differences) in the data. If the

random error is large then we are unlikely to find a significant difference between means unless

this difference is also very large.

Consider the data in the table below, which shows the number of years' remission from

symptoms (of cancer, AIDS, etc.) in two groups of patients: group A who received a new drug

and group B who received a placebo (the controls). There were10 patients in each group, and

we will first analyse the data by conventional t-test (see Student's t-test if you are not familiar

with this).

1 7 4

2 5 3

3 2 1

4 8 6

5 3 2

6 4 4

7 10 9

8 7 5

9 4 3

10 9 8

Sx 59 45

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress5.html#Paired-samples%20test Page 1 of 6

PAIRED SAMPLES 17/06/13 4:24 PM

n 10 10

5.9 4.5

S x2 413 261

65 58

s2 7.22 6.44

= 1.37

sd = 1.17

= 0.67

Clearly, there is no significant difference between the means. [The smallest tabulated t value for

significant difference at p = 0.05 is 1.96.]

But drug trials are never done as randomly as this. Instead, the patients are matched as nearly as

possible to exclude the effects of extraneous variation. For example, patient 1 in each group

(drug or placebo) might be a Caucasian male aged 20-25; patient 2 in each group might be an

Asian female aged 40-50, and so on. There is every reason to suspect that age, sex, social

factors etc. could influence the course of a disease, and it would be foolish not to exclude this

variation if the purpose of the trial is to see if the drug actually has an overall effect. In other

words, we are not dealing with random groups but with purposefully paired observations. (The

same would be true if, for example, we wanted to test effects of a fungicide against a disease on

10 farms, or to test whether a range of different bacteria are sensitive to an antibiotic, etc.).

1. Subtract each control value from the corresponding treatment value and call the difference z.

(NB Always subtract in the same "direction", recording negative values where they occur)

between the groups (drug treatment versus controls). If this were true then the observed values

of z would have a mean close to zero, with variation about this mean.

4. Calculate:

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress5.html#Paired-samples%20test Page 2 of 6

PAIRED SAMPLES 17/06/13 4:24 PM

5. Calculate:

8. Consult a t table at n-1 degrees of freedom, where n is the number of pairs (number of z

values). The tabulated t value for 9 df is 2.26 (p = 0.05)

In our example the calculated t value is 5.24, which is very highly significant - it exceeds the t

value for probability (p) of 0.001. In other words, we would expect such a result to occur by

chance only once in a thousand times. So the drug is effective: we see below that it gives

remissison of symptoms for 1.4 0.266 years (this value is the mean standard error of the

mean). The confidence limits are 1.4 0.6 years (mean t.sn).

1 7 4 3

2 5 3 2

3 2 1 1

4 8 6 2

5 3 2 1

6 4 4 0

7 10 9 1

8 7 5 2

9 4 3 1

10 9 8 1

Sz 14

n 10

1.4

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress5.html#Paired-samples%20test Page 3 of 6

PAIRED SAMPLES 17/06/13 4:24 PM

S z2 26

(S z)2 / n 19.6

6.4

s d2 0.71

sd 0.84

sn 0.267

t = 1.4/0.267 = 5.24

It is instructive to consider what we have done in this analysis. We calculated the mean

difference between the pairs of patients (treatments), calculated the standard error of this

mean difference and tested it to see if it is significantly different from zero (no difference). The

following diagram should make this clear.

The example above was run on "Excel", as before (see Student's t-test) but we select t-test:

paired two sample for means from the analysis tools package, click OK, select the whole data

set (cells B2-C11) for Input variable range and a clear cell for Output range. See Student's t-

test for explanation of other relevant entries in the print-out. [The calculated t (5.25) differs

slightly from the worked example (5.24) on the previous page because the computer did not

round up the decimal points during calculations]

1 7 4

2 5 3

3 2 1

4 8 6

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress5.html#Paired-samples%20test Page 4 of 6

PAIRED SAMPLES 17/06/13 4:24 PM

5 3 2

6 4 4

7 10 9

8 7 5

9 4 3

10 9 8

Mean 5.9 4.5

Variance 7.211111 6.5

Observations 10 10

Pearson Correlation 0.949414

df 9

t Stat 5.25

t Critical one-tail 1.833114

t Critical two-tail 2.262159

******** Note that the two-tailed t-test shows the drug and placebo to be significantly different

at p = 0.0005 (probability of 5 in 10,000 that we would get this result by chance alone). But in

this case we would be justified in using a one-tailed test (P = 0.00026) because we are testing

whether the mean difference (1.4 years) is significantly greater than zero. Look at the ‘critical

t’ for a one-tailed test in the printout. This is the value given forp = 0.1 in a t-table (because we

are testing only for a difference in one direction - above zero - so we can double the normal

probability of 0.05). We should have used this value, not the value for p = 0.05 in our testing for

significance.

CONTENTS

INTRODUCTION

THE SCIENTIFIC METHOD

Experimental design

Designing experiments with statistics in mind

Common statistical terms

Descriptive statistics: standard deviation, standard error, confidence intervals of mean.

STATISTICAL TESTS:

Student's t-test for comparing the means of two samples

Paired-samples test. (like a t-test, but used when data can be paired)

Analysis of variance for comparing means of three or more samples:

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress5.html#Paired-samples%20test Page 5 of 6

ONE-WAY ANOVA 17/06/13 4:24 PM

archival purposes

Text and links may be out of date

three or more variables.

Use this test for comparing means of 3 or more samples/treatments, to avoid the error

inherent in performing multiple t-tests

Background. If we have, say, 3 treatments to compare (A, B, C) then we would need 3 separate

t-tests (comparing A with B, A with C, and B with C). If we had seven treatments we would

need 21 separate t-tests. This would be time-consuming but, more important, it would be

inherently flawed because in each t-test we accept a 5% chance of our conclusion being wrong

(when we test for p = 0.05). So, in 21 tests we would expect (by probability) that one test would

give us a false result. ANalysis Of Variance (ANOVA) overcomes this problem by enabling us

to detect significant differences between the treatments as a whole. We do a single test to see if

there are differences between the means at our chosen probability level.

Ideally, for this test we would have the same number of replicates for each treatment, but

this is not essential. Advanced computer programmes can overcome the problem of unequal

replicates by entering "missing values".

An important assumption underlies the Analysis of Variance: that all treatments have

similar variance. If there are strong reasons to doubt this then the data might need to be

transformed before the test can be done. In practice, there is a simple way to check for

"homogeneity of variance". We deal with this at step "3" in the procedure below.

Don't be frightened by this! It looks complicated but it is actually very easy. You should

understand it, and then you can use a simple statistical programme (e.g. Microsoft "Excel") to

run the whole test.

Assume that we have recorded the biomass of 3 bacteria in flasks of glucose broth, and we used

3 replicate flasks for each bacterium. [But the test could apply equally to any sort of variable]

1 12 20 40

2 15 19 35

3 9 23 42

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress6.html#Analysis%20of%20variance Page 1 of 7

ONE-WAY ANOVA 17/06/13 4:24 PM

Step 2. For each column, enter S x, n, , S x2, and Sd2 (click here for method)

Step 3. [A check for equal variance - the underlying assumption of this test] For each column

divide Sd2 by n-1 to obtain the variance, s 2. Divide the highest value of s2 by the lowest value

of s 2 to obtain a variance ratio (F). Then look up a table of Fmax for the number of treatments in

our table of data and the degrees of freedom (number of replicates per treatment -1). If our

variance ratio does not exceed the Fmax value then we are safe to proceed. If not, the data might

need to be transformed.

Step 5. Sum all the values for and call the sum B.

Step 6. Sum all the values for S x to obtain the grand total.

Step 7. Square the grand total and divide it by total number of observations; call this D.

Step 10. Calculate the Residual sum of squares = A - B [This is sometimes called the Error

sum of squares]

Step 11. Construct a table as follows, where *** represents items to be inserted, and where u =

number of treatments and v = number of replicates.

(S of S) (df) = S of S / df

Between treatments *** u-1 ***

Residual *** u(v-1) ***

Total *** (uv)-1

[The total df is always one fewer than the total number of data entries]

Step 12. Using the mean squares in the final column of this table, do a variance ratio test to

obtain an F value:

Step 13. Go to a table of F (p = 0.05) and read off the value where n1 is the df of the between

treatments mean square and n2 is df of the residual mean square. If the calculated F value

exceeds the tabulated value there is significant difference between treatments. If so, then look at

the tabulated F values for p = 0.01 and then 0.001, to see if the treatment differences are more

highly significant.

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress6.html#Analysis%20of%20variance Page 2 of 7

ONE-WAY ANOVA 17/06/13 4:24 PM

If you look at many of the steps above they should remind you of the steps in a t-test. For

example, in a t-test we calculate S x, S x2, and Sd2 (which is the sum of squares), then

we divide Sd2 by n-1, just as we did in step 11 (above). So, the Analysis of Variance is using

the same types of procedure, but for more than 2 samples. If you want to convince yourself of

this, then try doing the Analysis of Variance for just two samples (e.g. Bacterium A and

Bacterium B). You will get exactly the same result as in a t-test.

1 12 20 40 72

2 15 19 35 69

3 9 23 42 74

Sx 36 62 117 215 (Grand total)

n 3 3 3

12 20.7 39

6276.3(call this

432 1281.3 4563

B)

Fmax test: F = 13/4.35 = 2.99. This is lower than the Fmax of 87.5 (for 3 treatments and 2 df, at

p = 0.05) so the variances are homogeneous and we can proceed with analysis of variance. If

our value exceeded the tabulated Fmax then we would need to transform the data.

Between-treatments S of S = B - D = 1140.2

Residual S of S = A - B = 52.7

Source of variance Degrees of freedom *

(S of S) (= S of S df)

Between treatments 1140.2 u - 1 (=2)* 570.1

Residual 52.7 u(v-1) (=6)* 8.78

Total 1192.9 (uv)-1 (=8)*

[* For u treatments (3 in our case) and v replicates (3 in our case); the total df is one fewer than

the total number of data values in the table (9 values in our case)]

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress6.html#Analysis%20of%20variance Page 3 of 7

ONE-WAY ANOVA 17/06/13 4:24 PM

F = Between treatments mean square /Residual mean square = 570.1 / 8.78 = 64.93

The tabulated value of F (p = 0.05) where u is df of between treatments mean square (2) and v

is df of residual mean square (6) is 5.1. Our calculated F value exceeds this and even exceeds

the tabulated F value for p = 0.001 (F = 27.0). So there is a very highly significant difference

between treatments.

[Note that the term "mean square" in an Analysis of Variance is actually a variance - it is

calculated by dividing the sum of squares by the degrees of freedom. In a t-test we would call it

s 2, obtained by dividing Sd2 by n-1. Analysis of Variance involves the partitioning of the total

variance into (1) variance associated with the different treatments/samples and (2) random

variance, evidenced by the variability within the treatments. When we calculate the F value, we

ask, in effect, "is there a large amount of variance associated with the different treatments

compared with the amount of random variance?".]

The Analysis of Variance has told us only that there are differences between treatments in the

experiment as a whole. Sometimes this information is useful in its own right. But it does not tell

us which treatments differ from one another.

We now have a problem, because every time we compare one treatment with another (for

example, comparing bacterium A with bacterium B) we are doing the equivalent of a t-test, with

a probability of making a wrong interpretation. We need some way of avoiding this problem.

Method 1. Calculate the least significant difference between any two means. [This is not

generally favoured, but it can be used with caution.]

We make use of the fact that our calculations for Analysis of Variance were similar to those of a

t-test (see earlier); in particular, the residual mean square is an estimate of s2 for each

treatment, because the variance for all treatments is assumed to be equal in an Analysis of

Variance.

In the analysis of variance, s2 for each treatment is assumed to be the same, and if n for each

treatment is the same, then we could compare any two means by calculating sd2 as follows:

We can then find sd as the square root of sd2 and calculate t as:

If we did this for two particular means,we could compare the calculated t with that in a t-table,

using the df of the residual mean square (because this reflects the residual variance in the

whole experiment).

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress6.html#Analysis%20of%20variance Page 4 of 7

ONE-WAY ANOVA 17/06/13 4:24 PM

In other words, any two means would be significantly different from one another if they differ

by more than "t multiplied by sd"

So t(sd) represents the least significant difference (LSD) between any two means.

1 12

2 20.7

3 39

5% LSD 5.92

Here the author would be giving us the means for the 3 treatments (bacteria) and telling us that

analysis of variance was used to find the least significant difference between any of the means

at p = 0.05 (the level of probability chosen for the t value).

In fact, the table above uses the data for bacterial biomass in our worked example.

For 5% LSD, we find sd2 (= 2 x residual mean square / n). It is 17.56 /3 = 5.85.

The tabulated value of t for 6 df (of the residual mean square) is 2.45 (p = 0.05).

Our table of data indicates that each bacterium produced a significantly different biomass from

every other one.

A word of caution: We can be much more confident about significant difference between

bacteria 1 and 3 or between bacteria 2 and 3 than we can about the difference between bacteria

1 and 2. Remember that every time we make such a comparison we run the risk of 5% error.

But if we had used the t value for p = 0.01 then we could more safely make five comparisons

and still have only a 1 in 20 chance of being wrong.

Statisticians recommend that the LSD should never be used indiscriminately, but only to test

comparisons between treatments that we "nominated" when designing the experiment. For

example, each treatment might be compared with a control, but each treatment should not

necessarily be compared with each other treatment.

Method 2. Many people now use variants of the LSD, such as a Multiple Range Test, which

enables us more safely to compare any treatments in a table. This test is far preferable to the

LSD. It is explained separately on another page.

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress6.html#Analysis%20of%20variance Page 5 of 7

ONE-WAY ANOVA 17/06/13 4:24 PM

The example that we used (bacterial biomass) above is shown below as a print-out from

"Excel".

Having entered the data on the spreadsheet, we select Anova: single factor from the analysis

tools, click OK, and enter all 9 cells of data in Input variable range. The table shows the

source of variance as "Between groups" (= between treatments) and "within groups" (=

residual). We are also told the calculated F value (64.949..), the F value that we would need to

exceed (F critical) in order to have a significant difference between treatments, and the

probability (p-value) that our calculated F value would be obtained by chance (random error)

alone. This probability is very small (8.61 x 10-5) so we have a highly significant difference

between treatments in our table. We could then use the residual (within groups) mean square

(MS) to calculate LSD, as explained earlier.

1 12 20 40

2 15 19 35

3 9 23 42

SUMMARY

Column 1 3 36 12 9

Column 3 3 117 39 13

ANOVA

Variation

Between 1140.222 2 570.1111 64.94937 8.61E-05 5.143249

Groups

Within 52.66667 6 8.777778

Groups

Total 1192.889 8

Note: There is always a danger in using a statistical package, because the package does

whatever we tell it to do. It does not "think" or "consider" whether what we ask it to do is

legitimate. For example, it does not test for homogeneity of variance. BEWARE!

CONTENTS

INTRODUCTION

THE SCIENTIFIC METHOD

Experimental design

Designing experiments with statistics in mind

Common statistical terms

Descriptive statistics: standard deviation, standard error, confidence intervals of mean.

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress6.html#Analysis%20of%20variance Page 6 of 7

ONE-WAY ANOVA 17/06/13 4:24 PM

archival purposes

Text and links may be out of date

three or more variables.

Use this test for comparing means of 3 or more samples/treatments, to avoid the error

inherent in performing multiple t-tests

Background. If we have, say, 3 treatments to compare (A, B, C) then we would need 3 separate

t-tests (comparing A with B, A with C, and B with C). If we had seven treatments we would

need 21 separate t-tests. This would be time-consuming but, more important, it would be

inherently flawed because in each t-test we accept a 5% chance of our conclusion being wrong

(when we test for p = 0.05). So, in 21 tests we would expect (by probability) that one test would

give us a false result. ANalysis Of Variance (ANOVA) overcomes this problem by enabling us

to detect significant differences between the treatments as a whole. We do a single test to see if

there are differences between the means at our chosen probability level.

Ideally, for this test we would have the same number of replicates for each treatment, but

this is not essential. Advanced computer programmes can overcome the problem of unequal

replicates by entering "missing values".

An important assumption underlies the Analysis of Variance: that all treatments have

similar variance. If there are strong reasons to doubt this then the data might need to be

transformed before the test can be done. In practice, there is a simple way to check for

"homogeneity of variance". We deal with this at step "3" in the procedure below.

Don't be frightened by this! It looks complicated but it is actually very easy. You should

understand it, and then you can use a simple statistical programme (e.g. Microsoft "Excel") to

run the whole test.

Assume that we have recorded the biomass of 3 bacteria in flasks of glucose broth, and we used

3 replicate flasks for each bacterium. [But the test could apply equally to any sort of variable]

1 12 20 40

2 15 19 35

3 9 23 42

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress6.html#LSD Page 1 of 7

ONE-WAY ANOVA 17/06/13 4:24 PM

Step 2. For each column, enter S x, n, , S x2, and Sd2 (click here for method)

Step 3. [A check for equal variance - the underlying assumption of this test] For each column

divide Sd2 by n-1 to obtain the variance, s 2. Divide the highest value of s2 by the lowest value

of s 2 to obtain a variance ratio (F). Then look up a table of Fmax for the number of treatments in

our table of data and the degrees of freedom (number of replicates per treatment -1). If our

variance ratio does not exceed the Fmax value then we are safe to proceed. If not, the data might

need to be transformed.

Step 5. Sum all the values for and call the sum B.

Step 6. Sum all the values for S x to obtain the grand total.

Step 7. Square the grand total and divide it by total number of observations; call this D.

Step 10. Calculate the Residual sum of squares = A - B [This is sometimes called the Error

sum of squares]

Step 11. Construct a table as follows, where *** represents items to be inserted, and where u =

number of treatments and v = number of replicates.

(S of S) (df) = S of S / df

Between treatments *** u-1 ***

Residual *** u(v-1) ***

Total *** (uv)-1

[The total df is always one fewer than the total number of data entries]

Step 12. Using the mean squares in the final column of this table, do a variance ratio test to

obtain an F value:

Step 13. Go to a table of F (p = 0.05) and read off the value where n1 is the df of the between

treatments mean square and n2 is df of the residual mean square. If the calculated F value

exceeds the tabulated value there is significant difference between treatments. If so, then look at

the tabulated F values for p = 0.01 and then 0.001, to see if the treatment differences are more

highly significant.

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress6.html#LSD Page 2 of 7

ONE-WAY ANOVA 17/06/13 4:24 PM

If you look at many of the steps above they should remind you of the steps in a t-test. For

example, in a t-test we calculate S x, S x2, and Sd2 (which is the sum of squares), then

we divide Sd2 by n-1, just as we did in step 11 (above). So, the Analysis of Variance is using

the same types of procedure, but for more than 2 samples. If you want to convince yourself of

this, then try doing the Analysis of Variance for just two samples (e.g. Bacterium A and

Bacterium B). You will get exactly the same result as in a t-test.

1 12 20 40 72

2 15 19 35 69

3 9 23 42 74

Sx 36 62 117 215 (Grand total)

n 3 3 3

12 20.7 39

6276.3(call this

432 1281.3 4563

B)

Fmax test: F = 13/4.35 = 2.99. This is lower than the Fmax of 87.5 (for 3 treatments and 2 df, at

p = 0.05) so the variances are homogeneous and we can proceed with analysis of variance. If

our value exceeded the tabulated Fmax then we would need to transform the data.

Between-treatments S of S = B - D = 1140.2

Residual S of S = A - B = 52.7

Source of variance Degrees of freedom *

(S of S) (= S of S df)

Between treatments 1140.2 u - 1 (=2)* 570.1

Residual 52.7 u(v-1) (=6)* 8.78

Total 1192.9 (uv)-1 (=8)*

[* For u treatments (3 in our case) and v replicates (3 in our case); the total df is one fewer than

the total number of data values in the table (9 values in our case)]

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress6.html#LSD Page 3 of 7

ONE-WAY ANOVA 17/06/13 4:24 PM

F = Between treatments mean square /Residual mean square = 570.1 / 8.78 = 64.93

The tabulated value of F (p = 0.05) where u is df of between treatments mean square (2) and v

is df of residual mean square (6) is 5.1. Our calculated F value exceeds this and even exceeds

the tabulated F value for p = 0.001 (F = 27.0). So there is a very highly significant difference

between treatments.

[Note that the term "mean square" in an Analysis of Variance is actually a variance - it is

calculated by dividing the sum of squares by the degrees of freedom. In a t-test we would call it

s 2, obtained by dividing Sd2 by n-1. Analysis of Variance involves the partitioning of the total

variance into (1) variance associated with the different treatments/samples and (2) random

variance, evidenced by the variability within the treatments. When we calculate the F value, we

ask, in effect, "is there a large amount of variance associated with the different treatments

compared with the amount of random variance?".]

The Analysis of Variance has told us only that there are differences between treatments in the

experiment as a whole. Sometimes this information is useful in its own right. But it does not tell

us which treatments differ from one another.

We now have a problem, because every time we compare one treatment with another (for

example, comparing bacterium A with bacterium B) we are doing the equivalent of a t-test, with

a probability of making a wrong interpretation. We need some way of avoiding this problem.

Method 1. Calculate the least significant difference between any two means. [This is not

generally favoured, but it can be used with caution.]

We make use of the fact that our calculations for Analysis of Variance were similar to those of a

t-test (see earlier); in particular, the residual mean square is an estimate of s2 for each

treatment, because the variance for all treatments is assumed to be equal in an Analysis of

Variance.

In the analysis of variance, s2 for each treatment is assumed to be the same, and if n for each

treatment is the same, then we could compare any two means by calculating sd2 as follows:

We can then find sd as the square root of sd2 and calculate t as:

If we did this for two particular means,we could compare the calculated t with that in a t-table,

using the df of the residual mean square (because this reflects the residual variance in the

whole experiment).

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress6.html#LSD Page 4 of 7

ONE-WAY ANOVA 17/06/13 4:24 PM

In other words, any two means would be significantly different from one another if they differ

by more than "t multiplied by sd"

So t(sd) represents the least significant difference (LSD) between any two means.

1 12

2 20.7

3 39

5% LSD 5.92

Here the author would be giving us the means for the 3 treatments (bacteria) and telling us that

analysis of variance was used to find the least significant difference between any of the means

at p = 0.05 (the level of probability chosen for the t value).

In fact, the table above uses the data for bacterial biomass in our worked example.

For 5% LSD, we find sd2 (= 2 x residual mean square / n). It is 17.56 /3 = 5.85.

The tabulated value of t for 6 df (of the residual mean square) is 2.45 (p = 0.05).

Our table of data indicates that each bacterium produced a significantly different biomass from

every other one.

A word of caution: We can be much more confident about significant difference between

bacteria 1 and 3 or between bacteria 2 and 3 than we can about the difference between bacteria

1 and 2. Remember that every time we make such a comparison we run the risk of 5% error.

But if we had used the t value for p = 0.01 then we could more safely make five comparisons

and still have only a 1 in 20 chance of being wrong.

Statisticians recommend that the LSD should never be used indiscriminately, but only to test

comparisons between treatments that we "nominated" when designing the experiment. For

example, each treatment might be compared with a control, but each treatment should not

necessarily be compared with each other treatment.

Method 2. Many people now use variants of the LSD, such as a Multiple Range Test, which

enables us more safely to compare any treatments in a table. This test is far preferable to the

LSD. It is explained separately on another page.

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress6.html#LSD Page 5 of 7

ONE-WAY ANOVA 17/06/13 4:24 PM

The example that we used (bacterial biomass) above is shown below as a print-out from

"Excel".

Having entered the data on the spreadsheet, we select Anova: single factor from the analysis

tools, click OK, and enter all 9 cells of data in Input variable range. The table shows the

source of variance as "Between groups" (= between treatments) and "within groups" (=

residual). We are also told the calculated F value (64.949..), the F value that we would need to

exceed (F critical) in order to have a significant difference between treatments, and the

probability (p-value) that our calculated F value would be obtained by chance (random error)

alone. This probability is very small (8.61 x 10-5) so we have a highly significant difference

between treatments in our table. We could then use the residual (within groups) mean square

(MS) to calculate LSD, as explained earlier.

1 12 20 40

2 15 19 35

3 9 23 42

SUMMARY

Column 1 3 36 12 9

Column 3 3 117 39 13

ANOVA

Variation

Between 1140.222 2 570.1111 64.94937 8.61E-05 5.143249

Groups

Within 52.66667 6 8.777778

Groups

Total 1192.889 8

Note: There is always a danger in using a statistical package, because the package does

whatever we tell it to do. It does not "think" or "consider" whether what we ask it to do is

legitimate. For example, it does not test for homogeneity of variance. BEWARE!

CONTENTS

THE SCIENTIFIC METHOD

Experimental design

Designing experiments with statistics in mind

Common statistical terms

Descriptive statistics: standard deviation, standard error, confidence intervals of mean.

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress6.html#LSD Page 6 of 7

MULTIPLE RANGE TEST 17/06/13 4:25 PM

archival purposes

Text and links may be out of date

variance

This procedure is broadly similar to that for calculating the LSD but it gives us more confidence

in comparing the means within a table.

In calculating LSD, we first found sd2 (= 2 x residual mean square / n) and from this we found

sd (the standard deviation of the difference between any two means) and multiplied it by a t

value (for the degrees of freedom of the residual mean square). For a multiple range test, we use

essentially the same procedure but instead of a t value we use a Q value obtained from a table

"The Studentized Range". We list our means in order of magnitude, from highest to lowest,

then we test for significant difference between the highest and lowest - it must be greater

than Qxsd). If this is significant, we test the highest against the second lowest mean, and

continue in this way until all means have been tested against one another. However the Q value

changes each time. For the first test (highest against lowest mean) we look up the Q value for

the number of treatments (i.e. for the number of means in our table of results). For the next test

(second highest against lowest, we use the Q value for the number of means minus 1 (because

we are excluding the highest mean now), and so on. However the degrees of freedom does not

change - it is always the df of the residual (error) mean square. Although each step in this

procedure is simple, you need to be organised for testing each mean against all others - see a

worked example for the best way to do this.

Having done an analysis like this, most people construct a table as follows, using letters to show

which treatments differ from others. They would say that treatments that are not followed by

the same letter differ significantly from one another (P, 0.05).

For example, in the fictitious table below, the means for pH 4 and 5 do not differ from one

another but differ from the means for all other pH values. The mean for pH 6 differs from the

means at all other pH values. The means for pH 7 and 8 do not differ from one another but

differ from all other means, and the mean for pH 9 differs from all others.

Treatment Mean

pH 4 36 a

pH 5 35 a

pH 6 30 b

pH 7 20 c

pH 8 17 c

pH 9 10 d

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress7.html#A%20multiple%20range%20test Page 1 of 5

MULTIPLE RANGE TEST 17/06/13 4:25 PM

We will now apply the Multiple Range Test to the data on bacterial biomass that we

analysed by ANOVA

Replicate 1 12 20 40

Replicate 2 15 19 35

Replicate 3 9 23 42

n 3 3 3

12 20.7 39

Between treatments 1140.2 2 570.1

Residual 52.7 6 8.78

Total 1192.9 8

First: Rank the treatments from highest to lowest mean. Then compare the highest (39) with the

lowest (12) mean:

With 3 treatments, and 6 degrees of freedom for the residual mean square, we have

a Q value of 4.34. So, Q(sd) = 4.34 x 2.42 = 9.72. The difference between the

highest and lowest means is greater than this, so the biomass of Bacterium C (the

highest) differs significantly from that of Bacterium A (the lowest).

Second: compare the second highest (20.7) with the lowest (12) mean:

With 2 treatments (because we are excluding the highest mean now), and 6 degrees

of freedom for the residual mean square, we have a Q value of 3.46. So, Q(sd) =

3.46 x 2.42 = 8.37. The difference between means is greater than this (but only

just), so the biomass of Bacterium B differs significantly from Bacterium A.

Third: continue in this way, down the table of ranked means, until you get a non-significant

result. [In our case, we have reached the end because there are only 3 treatments]

Fourth: [This is not necessary in our case because we ran out of means!] Compare the second

lowest mean with the highest, then continue with the second lowest and second highest, etc.

Comparison of the Multiple Range Test with the Least Significant Difference

It is interesting to compare the findings of these two types of test. Using the LSD method, we

found that the least significant difference between any two means would need to be t(sd ) = 2.45

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress7.html#A%20multiple%20range%20test Page 2 of 5

MULTIPLE RANGE TEST 17/06/13 4:25 PM

x 2.42 = 5.92.

Using the Multiple Range Test, we had to meet stricter criteria: the highest and lowest means

had to differ by 9.72, and the second-highest and lowest means had to differ by 8.37. If we had

had more means to compare (e.g. the third-highest and lowest) then the critical value would

have been reduced again. The Multiple Range Test is much more discriminating that the LSD -

it greatly reduces the chance of error when making multiple comparisons between treatments.

Suppose that we did an ANOVA on 5 treatments, with 4 replicates per treatment, and found

means of 36, 42, 74, 10, 80. We calculated sd as 2.50. With 5 treatments and 4 replicates we

have 15 degrees of freedom for the residual (error) mean square.

We will need the Q values for comparing 5, then 4, then 3, then 2 treatments (always with 16

df). It is sensible to make a small table:

No. of treatments 5 4 3 2

Q= 4.37 4.08 3.67 3.01

Qxsd = 10.925 10.20 9.175 7.525

Now we can rank our means from highest to lowest and see if they differ by Qxsd. Again make

a table (see below). Start in box 1, comparing the means 80 and 10. These differ by more than

10.925 (Qxsd for 5 treatments) so we insert * in box 1. Repeat this for box 2 (comparing means

80 and 36, using Q for 4 treatments). Again insert *. We would also get a significant difference

in box 3 (means 80 and 42) but not in box 4, so we insert ns (not significant).

Now start in the next column (boxes 5, 6 and 7) then the third column (boxes 8, 9) and the

fourth column.

Mean

80

74 box 4 ns

42 box 7 * box 3 *

36 box 9 ns box 6 * box 2 *

box 10

10 box 8 * box 5 * box 1 *

*

10 36 42 74 80 Mean

Finally, we put letters against the means to show significant differences. We state that: means

followed by the same letter do not differ significantly from one another (p = 0.05). Do this

in a series of steps as shown below, to reach the letters (shown in blue) in the final column.

These are the letters that we retain from the early steps if they are not contradicted by a later

step. [Of course, having done this, we can present our means in any order, with their letters; we

do not need to keep them in ranked order]

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress7.html#A%20multiple%20range%20test Page 3 of 5

MULTIPLE RANGE TEST 17/06/13 4:25 PM

(from boxes 1-4) (from boxes 5-7) (boxes 8,9,10)

80 a a

74 a c a

42 b d e b

36 b d e b

10 b d f d

Transformation of data

During ANOVA we do an Fmax test to check for homogeneity of variance, i.e. to check that it is

safe to pool all the treatment variances - an essential condition for performing an Analysis of

Variance.

What should we do if the Fmax test shows a major discrepancy in the variances, thereby

invalidating ANOVA?

The answer is to use some mathematical transformation of the original data, then perform

ANOVA with the transformed data. There are several types of transformation, each most

appropriate for particular circumstances.

1. When our data consist of small, whole-numbered counts the variance is often proportional to

the mean. This is overcome by converting each value (X) to Ö X and analysing the Ö X data. If

the counts are low and contain zeros then use Ö (X + 0.5).

2. More generally, it is appropriate to use log10 X, or log10 (X+1) if there are zero values.

3. Percentages and proportions (after multiplying by 100) can be converted to arcsin values.

In all these cases the transformed data are analysed in exactly the same way as in a normal

ANOVA, and we can use LSD or a multiple range test, as we did before, to test for significant

differences between the treatment means. BUT remember that these tests tell us the

difference between the transformed values, and it is not valid to de-transform an LSD and

show it as a significant difference between the ‘true’ means. This problem does not arise with a

multiple range test, where we use letters to show significant differences.

The way to overcome this is to present the data in a table as follows, showing both the true

means and the transformed means, and the LSD that applies to the transformed means:

Treatment

parentheses)

1 20 (4.47)

2 10 (3.16)

3 15 (3.87)

5% LSD (0.37)

CONTENTS

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress7.html#A%20multiple%20range%20test Page 4 of 5

TWO-WAY ANOVA 17/06/13 4:25 PM

archival purposes

Text and links may be out of date

treatments

Elsewhere on this site we have dealt with ANOVA for simple comparisons of treatments. We

can also use ANOVA for combinations of treatments, where two factors (e.g. pH and

temperature) are applied in every possible combination. These are called factorial designs, and

we can analyse them even if we do not have replicates.

Suppose that we have grown one bacterium in broth culture at 3 different pH levels at 4

different temperatures. We have 12 flasks in all, but no replicates. Growth was measured by

optical density (O.D.).

Construct a table as follows (O.D. is given in fictitious whole numbers here for convenience).

7.5

25 10 19 40

30 15 25 45

35 20 30 55

40 15 22 40

Then calculate the following (see the worked example and the output from Microsoft "Excel").

(c) Find the grand total by adding all S x for columns (it should be the same for rows). Square

this grand total and then divide by uv, where u is the number of data entries in each row, and v

is number of data entries in each column. Call this value D; in our example it is (336)2 12 =

9408.

(d) Find the sum of S x2 values for columns; call this A. It will be the same for S x2 of rows. In

our example it is 11570.

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress8.html#Analysis%20of%20variance%20for%20factorial Page 1 of 7

TWO-WAY ANOVA 17/06/13 4:25 PM

(e) Find the sum of S x2/n values for columns; call this B. In our example it is 11304.

(f) Find the sum of S x2/n values for rows; call this C. In our example it is 9646.

Mean square

Source of variance Sum of squares Degrees of freedom*

(= S of S df)

Between columns B - D (1896) u - 1 (=2) 948

Between rows C - D (238) v - 1 (= 3) 79.3

Residual *** (28) (u-1)(v-1) (=6) 4.67

Total A - D (2162) (uv)-1 (=11) 196.5

[* Where u is the number of data entries in each row, and v is the number of data entries in each

column); note that the total df is always one fewer than the total number of entries in the table

of data.

*** Obtained by subtracting the between-columns and between-rows sums of squares from total

sum of squares.

(1) For between columns (pH): F = Between columns mean square / Residual mean square

(2) For between rows (temperature) F = Between rows mean square / Residual mean square

In each case, consult a table of F (p = 0.05 or p = 0.01 or p = 0.001) where u is the between-

treatments df (columns or rows, as appropriate) and v is residual df. If the calculated F value

exceeds the tabulated value then the treatment effect (temperature or pH) is significant. In our

example, for the effect of pH (u is 2 degrees of freedom, v is 6 df) the critical F value at p =

0.05 is 5.14. In fact, we have a significant effect of pH at p = 0.001. For the effect of

temperature (u is 3 degrees of freedom, v is 6 df) the critical F value at p = 0.05 is 4.76. We find

that the effect of temperature is significant at p = 0.01.

Worked example:

30oC 15 25 45 85 3 28.33 2875 2408

35oC 20 30 55 105 3 35 4325 3675

S x, 60 96 180 336 grand Total C

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress8.html#Analysis%20of%20variance%20for%20factorial Page 2 of 7

TWO-WAY ANOVA 17/06/13 4:25 PM

n (= v) 4 4 4

15 24.67 46.67

11570

(S x)2 / n 900 2304 8100 Total B

11304

We select Anova: Two-Factor Without Replication from the analysis tools package. Note

that the Anova table gives Source of Variation separately for Rows, Columns and Error (=

Residual).

25oC 10 19 40

30oC 15 25 45

35oC 20 30 55

40oC 15 22 40

Row 1 3 69 23 237

Row 2 3 85 28.33333 233.3333

Column 1 4 60 15 16.66667

Column 2 4 96 24 22

Column 3 4 180 45 50

ANOVA

Variation

Rows 238.6667 3 79.55556 17.46341 0.00228 4.757055

Columns 1896 2 948 208.0976 2.87E-06 5.143249

Error 27.33333 6 4.555556

Total 2162 11

Of interest, another piece of information is revealed by this analysis - the effects of temperature

do not interact with effects of pH. In other words, a change of temperature does not change the

response to pH, and vice-versa.We can deduce this because the residual (error) mean square

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress8.html#Analysis%20of%20variance%20for%20factorial Page 3 of 7

TWO-WAY ANOVA 17/06/13 4:25 PM

(MS) is small compared with the mean squares for temperature (columns) or pH (rows). [A low

residual mean square tells us that most variation in the data is accounted for by the separate

effects of temperature and pH].

7.5

25 10 19 40

30 15 25 30

35 20 30 25

40 25 22 10

Here an increase of temperature increases growth at low pH but decreases growth at high pH. If

we analysed these data we would probably find no significant effect of temperature or pH,

because these factors interact to influence growth. The residual mean square would be very

large. This type of result is not uncommon - for example, patients' age might affect their

susceptibility to levels of stress. Inspection of our data strongly suggests that there is

interaction. To analyse it, we would need to repeat the experiment with two replicates, then use

a slightly more complex analysis of variance to test for (1) separate temperature effects, (2)

separate pH effects, and (3) significant effects of interaction.

As an example, below is shown a print-out from "Excel" of the following table, where I have

assumed that we did the experiment above with replication.

7.5

25 rep 1 9 18 36

rep 2 11 20 44

30 rep 1 13 23 27

rep 2 17 27 33

35 rep 1 18 27 23

rep 2 22 33 27

40 rep 1 22 20 7

rep 2 28 24 13

2. From the analysis tools menu, choose Anova: Two-Factor with Replication.

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress8.html#Analysis%20of%20variance%20for%20factorial Page 4 of 7

TWO-WAY ANOVA 17/06/13 4:25 PM

3. Insert all the cells of the table in Input range (Anova assumes that column A and

4. Enter "2" (in our case) where asked for "Rows per sample".

In the table displayed on the screen (see below) the analysis shows the means for each

temperature and each pH. It also tells us the following (see the bottom rows of the table).

(i) There is no significant difference between temperatures overall ("Excel" has called the

temperature "Sample") because the calculated F value (3.148) is less than the critical F value

(3.49).

(iii) There is very highly significant interaction (p = 0.0000397) between temperature and pH.

In other words, the response to pH depends on the temperature, or vice-versa. This might have

been the purpose of doing the experiment - to see how the organism behaves when subjected to

combinations of factors.

25oC 9 18 36

25oC 11 20 44

30oC 13 23 27

30oC 17 27 33

35oC 18 27 23

35oC 22 33 27

40oC 22 20 7

40oC 28 24 13

25oC Count 2 2 2 6

Sum 20 38 80 138

Average 10 19 40 69

Variance 2 2 32 36

30oC Count 2 2 2 6

Sum 30 50 60 140

Average 15 25 30 70

Variance 8 8 18 34

35oC Count 2 2 2 6

Sum 40 60 50 150

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress8.html#Analysis%20of%20variance%20for%20factorial Page 5 of 7

TWO-WAY ANOVA 17/06/13 4:25 PM

Average 20 30 25 75

Variance 8 18 8 34

40oC Count 2 2 2 6

Sum 50 44 20 114

Average 25 22 10 57

Variance 18 8 18 44

Total Count 8 8 8

Average 70 96 105

Variance 36 36 76

ANOVA

Sample 116.5 3 38.83333 3.148649 0.064794 3.4903

Columns 330.3333 2 165.1667 13.39189 0.000877 3.88529

Interaction 1203 6 200.5 16.25676 3.97E-05 2.996117

Within 148 12 12.33333

Total 1797.833 23

Note: Because there is so much interaction, it is difficult to analyse the separate effects of

temperature and pH. We should repeat the analysis, using separate parts of the data. For

example, ANOVA for all the pH treatments at 25oC , then at 30oC, then 35oC and 40oC. But we

could assemble all the means (there are 12) in ranked order and do a multiple range test to find

significant differences.

CONTENTS

THE SCIENTIFIC METHOD

Experimental design

Designing experiments with statistics in mind

Common statistical terms

Descriptive statistics: standard deviation, standard error, confidence intervals of mean.

STATISTICAL TESTS:

Student's t-test for comparing the means of two samples

Paired-samples test. (like a t-test, but used when data can be paired)

Analysis of variance for comparing means of three or more samples:

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress8.html#Analysis%20of%20variance%20for%20factorial Page 6 of 7

CHI SQUARED TEST 17/06/13 4:25 PM

archival purposes

Text and links may be out of date

Background: The Student's t-test and Analysis of Variance are used to analyse measurement

data which, in theory, are continuously variable. Between a measurement of, say, 1 mm and 2

mm there is a continuous range from 1.0001 to 1.9999 m m.

But in some types of experiment we wish to record how many individuals fall into a particular

category, such as blue eyes or brown eyes, motile or non-motile cells, etc. These counts, or

enumeration data, are discontinuous (1, 2, 3 etc.) and must be treated differently from

continuous data. Often the appropriate test is chi-squared (c2), which we use to test whether the

number of individuals in different categories fit a null hypothesis (an expectation of some sort).

Chi squared analysis is simple, and valuable for all sorts of things - not just Mendelian crosses!

On this page we build from the simplest examples to more complex ones. When you have gone

through the examples you should consult the checklist of procedures and potential pitfalls.

A simple example

Suppose that the ratio of male to female students in the Science Faculty is exactly 1:1, but in the

Pharmacology Honours class over the past ten years there have been 80 females and 40 males.

Is this a significant departure from expectation? We proceed as follows (but note that we are

going to overlook a very important point that we shall deal with later).

Set out a table as shown below, with the "observed" numbers and the "expected" numbers (i.e.

our null hypothesis).

Then subtract each "expected" value from the corresponding "observed" value (O-E)

Square the "O-E" values, and divide each by the relevant "expected" value to give (O-E)2/E

Add all the (O-E)2/E values and call the total "X2"

Observed numbers

80 40 120

(O)

Expected numbers

(E) 60*3 60*3 120 *1

O-E 20 -20 0 *2

2

(O-E) 400 400

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress9.html#Chi-squared%20test Page 1 of 9

CHI SQUARED TEST 17/06/13 4:25 PM

2 6.67 6.67

(O-E) / E 13.34 = X2

Notes:

*1 This total must always be the same as the observed total

*2 This total must always be zero

*3 The null hypothesis was obvious here: we are told that there are equal numbers of males and

females in the Science Faculty, so we might expect that there will be equal numbers of males

and females in Pharmacology. So we divide our total number of Pharmacology students (120)

in a 1:1 ratio to get our ‘expected’ values.

Now we must compare our X2 value with a c2 (chi squared) value in a table of c2 with n-1

degrees of freedom (where n is the number of categories, i.e. 2 in our case - males and females).

We have only one degree of freedom (n-1). From the c2 table, we find a "critical value of 3.84

for p = 0.05.

If our calculated value of X2 exceeds the critical value of c2 then we have a significant

difference from the expectation. In fact, our calculated X2 (13.34) exceeds even the tabulated c2

value (10.83) for p = 0.001. This shows an extreme departure from expectation. It is still

possible that we could have got this result by chance - a probability of less than 1 in 1000. But

we could be 99.9% confident that some factor leads to a "bias" towards females entering

Pharmacology Honours. [Of course, the data don't tell us why this is so - it could be self-

selection or any other reason]

Now repeat this analysis, but knowing that 33.5% of all students in the Science Faculty are

males

Observed numbers

80 40 120

(O)

Expected numbers

(E)

79.8*3 40.2 120*1

(O-E)2 0.04 0.04

Note *1: We know that the expected total must be 120 (the same as the observed total), so we

can calculate the expected numbers as 66.5% and 33.5% of this total.

Note *3: Although the observed values must be whole numbers, the expected values can be

(and often need to be) decimals.

Now, from a c2 table we see that our data do not depart from expectation (the null hypothesis).

They agree remarkably well with it and might lead us to suspect that there was some design

behind this! In most cases, though, we might get intermediate X2 values, which neither agree

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress9.html#Chi-squared%20test Page 2 of 9

CHI SQUARED TEST 17/06/13 4:25 PM

strongly nor disagree with expectation. Then we conclude that there is no reason to reject the

null hypothesis.

Chi squared is a mathematical distribution with properties that enable us to equate our

calculated X2 values to c2 values. The details need not concern us, but we must take account of

some limitations so that c2 can be used validly for statistical tests.

(i) Yates correction for two categories of data (one degree of freedom)

When there are only two categories (e.g. male/female) or, more correctly, when there is only

one degree of freedom, the c2 test should not, strictly, be used. There have been various

attempts to correct this deficiency, but the simplest is to apply Yates correction to our data. To

do this, we simply subtract 0.5 from each calculated value of "O-E", ignoring the sign (plus or

minus). In other words, an "O-E" value of +5 becomes +4.5, and an "O-E" value of -5 becomes

-4.5. To signify that we are reducing the absolute value, ignoring the sign, we use vertical lines:

|O-E|-0.5. Then we continue as usual but with these new (corrected) O-E values: we calculate

(with the corrected values) (O-E)2, (O-E)2/E and then sum the (O-E)2/E values to get X2. Yates

correction only applies when we have two categories (one degree of freedom).

We ignored this point in our first analysis of student numbers (above). So here is the table

again, using Yates correction:

Observed numbers

80 40 120

(O)

Expected numbers

(E) 60*3 60*3 120 *1

O-E 20 -20 0 *2

|O-E|-0.5 19.5 -19.5 0

2

(|O-E|-0.5) 380.25 380.25

2

(|O-E|-0.5) / E 6.338 6.338 12.676 = X2

In this case, the observed numbers were so different from the expected 1:1 ratio that Yates

correction made little difference - it only reduced the X2 value from 13.34 to 12.67. But there

would be other cases where Yates correction would make the difference between acceptance or

rejection of the null hypothesis.

Again to satisfy the mathematical assumptions underlying c2, the expected values should be

relatively large. The following simple rules are applied:

no expected category should be less than 1 (it does not matter what the observed values

are)

AND no more than one-fifth of expected categories should be less than 5.

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress9.html#Chi-squared%20test Page 3 of 9

CHI SQUARED TEST 17/06/13 4:25 PM

What can we do if our data do not meet these criteria? We can either collect larger samples so

that we satisfy the criteria, or we can combine the data for the smaller "expected" categories

until their combined expected value is 5 or more, then do a c2 test on the combined data. We

will see an example below.

Suppose that we want to test the results of a Mendelian genetic cross. We start with 2 parents of

genotype AABB and aabb (where A and a represent the dominant and recessive alleles of one

gene, and B and b represent the dominant and recessive alleles of another gene).

We know that all the F1 generation (first generation progeny of these parents) will have

genotype AaBb and that their phenotype will display both dominant alleles (e.g. in fruit flies all

the F1 generation will have red eyes rather than white eyes, and normal wings rather than

stubby wings).

This F1 generation will produce 4 types of gamete (AB, Ab, aB and ab), and when we self-cross

the F1 generation we will end up with a variety of F2 genotypes (see the table below).

Gametes Gametes

AB Ab aB ab

AB AABB AABb AaBB AaBb

Ab AABb AAbb AaBb Aabb

aB AaBB AaBb aaBB aaBb

ab AaBb Aabb aaBb aabb

All these genotypes fall into 4 phenotypes, shown by colours in the table: double dominant,

single dominant A, single dominant B and double recessive. We know that in classical

Mendelian genetics the expected ratio of these phenotypes is 9:3:3:1

Phenotype

AB Ab aB ab Total

Observed

40 20 16 4 80

numbers (O)

Expected

45 15 15 5 80*1

numbers (E)

O-E -5 5 1 -1 0

(O-E)2 25 25 1 1

2.50

(O-E)2 / E 0.56 1.67 0.07 0.20

= X2

[Note: *1. From our expected total 80 we can calculate our expected values for categories on the

ratio 9:3:3:1.]

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress9.html#Chi-squared%20test Page 4 of 9

CHI SQUARED TEST 17/06/13 4:25 PM

From a c2 table with 3 df (we have four categories, so 3 df) at p = 0.05, we find that a c2 value

of 7.82 is necessary to reject the null hypothesis (expectation of ratio 9:3:3:1). So our data are

consistent with the expected ratio.

Combining categories

Look at the table above. We only just collected enough data to be able to test a 9:3:3:1 expected

ratio. If we had only counted 70 (or 79) fruit flies then our lowest expected category would have

been less than 1, and we could not have done the test as shown. We would break one of the

"rules" for c2 - that no more than one-fifth of expected categories should be less than 5. We

could still do the analysis, but only after combining the smaller categories and testing against a

different expectation.

Here is an illustration of this, assuming that we had used 70 fruit flies and obtained the

following observed numbers of phenotypes.

Phenotype

Combined

AB Ab aB ab Total

aB + ab

Observed

34 18 15 3 18 70

numbers (O)

Expected

39.375 13.125 13.125 4.375 17.5 70*1

numbers (E)

O-E -5.375 4.875 0.5 0

2.559

(O-E)2 / E 0.734 1.811 0.014

= X2

One of our expected categories (ab) is less than 5 (shown in bold italics in the table). So we

have combined this category with one of the others and then must analyse the results against an

expected ratio of 9:3:4. The numbers in the expected categories were entered by dividing the

total (70) in this ratio.

Now, with 3 categories we have only 2 degrees of freedom. The rest of the analysis is done as

usual, and we still have no reason to reject the null hypothesis. But it is a different null

hypothesis: the expected ratio is 9:3:4 (double dominant: single dominant Ab: single dominant

aB plus double recessive ab).

Suppose that we have a population of fungal spores which clearly fall into two size categories,

large and small. We incubate these spores on agar and count the number of spores that

germinate by producing a single outgrowth or multiple outgrowths.

Spores counted:

120 large spores, of which 80 form multiple outgrowths and 40 produce single

outgrowths

60 small spores, of which 18 form multiple outgrowths and 42 produce single

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress9.html#Chi-squared%20test Page 5 of 9

CHI SQUARED TEST 17/06/13 4:25 PM

outgrowths

Is there a significant difference in the way that large and small spores germinate?

Procedure:

Multiple outgrowth 80 18 98

Single outgrowth 40 42 82

Total 120 60 180

In this case there is no "theory" that gives us an obvious null hypothesis. For

example, we have no reason to suppose that 55% or 75% or any other percentage of

large spores will produce multiple outgrowths. So the most sensible null hypothesis

is that both the large and the small spores will behave similarly and that both types

of spore will produce 50% multiple outgrowths and 50% single outgrowths. In

other words, we will test against a 1:1:1:1 ratio. Then, if our data do not agree with

this expectation we will have evidence that spore size affects the type of

germination.

This step is complicated by the fact that we have different numbers of large and

small spores, and different numbers of multiple versus single outgrowths. But we

can find the expected frequencies (a, b, c and d) by using the grand total (180) and

the column and row totals (see table below).

Multiple

Observed (O) 80 18 98

outgrowth

Expected (E) a b (expected 98)

Single

Observed (O) 40 42 82

outgrowth

Expected (E) c d (expected 82)

Column totals 120 60 180

To find the expected value "a" we know that a total 98 spores had multiple

outgrowths and that 120 of the total 180 spores were large. So a is 98(120/180) =

65.33.

Similarly, to find b we know that 98 spores had multiple outgrowths and that 60 of

the total 180 spores were small. So, b is 98(60/180) = 32.67. [Actually, we could

have done this simply by subtracting a from the expected 98 row total - the

expected total must always be the same as the observed total]

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress9.html#Chi-squared%20test Page 6 of 9

CHI SQUARED TEST 17/06/13 4:25 PM

To find c we know that a 82 spores had single outgrowths and that 120 of the total

180 spores were large. So c is 82(120/180) = 54.67.

To find d we know that 82 spores had single outgrowths and that 60 of the total 180

spores were small. So d is 82(60/180) = 27.33. [This value also could have been

obtained by subtraction]

You might think that there are 3 degrees of freedom (because there are 4

categories). But there is actually one degree of freedom! The reason is that we

lose one degree of freedom because we have 4 categories, and we lose a further 2

degrees of freedom because we used two pieces of information to construct our null

hypothesis - we used a column total and a row total. Once we had used these we

would have needed only one data entry in order to fill in the rest of the values

(therefore we have one degree of freedom).

Of course, with one degree of freedom we must use Yates correction (subtract 0.5

from each O-E value).

5. Run the analysis as usual. Calculating O-E, (O-E)2 and (O-E)2/E for each category, then sum

the (O-E)2/E. values to obtain X2 and test this against c2 .

The following table shows some of the working. The sum of the values shown in red gives X2

of 20.23

Multiple

Observed (O) 80 18 98

outgrowth

Expected (E) 65.33 32.67 98

O-E +14.67 -14.67

Yates correction |O-E|-0.5 +14.17 -14.17 0

Single

Observed (O) 40 42 82

outgrowth

Expected (E) 54.67 27.33 82

O-E -14.67 +14.67

Yates correction |O-E|-0.5 +14.17 -14.17 0

(O-Ecorrected)2/E 3.67 7.35 X2 = 20.23

Column totals 120 60 180

We compare the X2 value with a tabulated c2. with one degree of freedom. Our calculated X2

exceeds the tabulated c2 value (10.83) for p = 0.001. We conclude that there is a highly

significant departure from the null hypothesis - we have very strong evidence that large spores

and small spores show different germination behaviour.

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress9.html#Chi-squared%20test Page 7 of 9

CHI SQUARED TEST 17/06/13 4:25 PM

Chi squared is a very simple test to use. The only potentially difficult things about it are:

calculating the expected frequencies when we have double classifications - use the

marginal subtotals and totals to work out these frequencies

determining the number of degrees of freedom, especially when we have to use some of

the data to construct the null hypothesis.

If you follow the examples given on this page you should not have too many difficulties.

Always work with "real numbers" in the observed categories, not with proportions.

To illustrate this, consider a simple chi squared test on tossing of coins. Suppose that in

100 throws you get 70 "heads" and 30 "tails". Using Yates correction (for one degree of

freedom) you would find an X2 value of 15.21, equating to a c2 probability less than

0.001. But if you got 7 "heads" and 3 "tails" in a test of 10 throws it would be entirely

consistent with random chance. The ratio is the same (7:3), but the actual numbers

determine the level of significance in a chi squared test.

Observed categories must have whole numbers, but expected categories can have

decimals.

Follow the rules about the minimum numbers in expected categories. These rules do not

apply to the observed categories.

Remember Yates correction for one degree of freedom.

CONTENTS

THE SCIENTIFIC METHOD

Experimental design

Designing experiments with statistics in mind

Common statistical terms

Descriptive statistics: standard deviation, standard error, confidence intervals of mean.

Student's t-test for comparing the means of two samples

Paired-samples test. (like a t-test, but used when data can be paired)

Analysis of variance for comparing means of three or more samples:

Calculating the Least Significant Difference between means

Using a Multiple Range Test for comparing means

For factorial combinations of treatments (Two-way ANOVA)

Poisson distribution for count data

Correlation coefficient and regression analysis for line fitting:

linear regression

logarithmic and sigmoid curves

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress9.html#Chi-squared%20test Page 8 of 9

POISSON DISTRIBUTION 17/06/13 4:25 PM

archival purposes

Text and links may be out of date

Use this test for counts of events that should be randomly distributed in space and time. For

example, the number of cells in a certain number of squares in a counting chamber, or the

number of colonies growing on agar plates in a dilution plating assay. With this test we can

compare such counts and place confidence limits on them.

As background to the Poisson distribution, we should compare the treatment of random count

data with the treatment of measurement data. Suppose that we did a survey of the height of

postal vans and another survey of the height of postal workers. The mean heights might be

very similar (depending on the type of van) but the variance (a measure of the spread of data)

would almost certainly be different. People are much more variable than postal vans of a given

type. That's obvious. But suppose that we look down on a city where for some reason (perhaps

a catastrophe) the postal vans had been adandoned and the postal workers were making their

way to their various homes, and we count the number of postal vans in each square kilometre,

and do the same for postal workers. Then if the means were the same the variance also would

be the same. If counts of anything are randomly distributed in space and time then they follow

the Poisson rule:

! so the standard deviation = square root of the mean.

The same point applies if we have a suspension of blood cells in a counting chamber. Provided

that these cells do not attract or repel one another their count will conform to Poisson

distribution. If there is a mean of 80 cells per square of the counting chamber, then there will be

a variance of 80, standard deviation of 8.94 (i.e. Ö 80) and 95% confidence limits of 8.94 x

1.96 (the t value for infinite degrees of freedom). In other words, 95% of squares in the

counting chamber would be expected to contain a bacterial count between 62.5 and 97.5.

1. Provided that the cells are randomly distributed (no mutual attraction or repulsion) then their

count conforms to Poisson distribution, and this applies to all the counts (of various types) that

ever have been made or that ever will be made. So we need not bother with degrees of freedom

- we use the t value for infinite degrees of freedom (actually this is termed a d value).

2. Provided that our count is reasonably high (say, above 30) then it can be treated as part of a

Poisson distribution, and we do not even need replicates. So, a count of 30 in one square of a

counting chamber (or a count of 80 pooled from, for example, 3 squares) is all we need. This

count has:

a mean of 80,

variance (s 2) of 80,

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress10.html#Poisson%20distribution%20for%20count%20data Page 1 of 5

POISSON DISTRIBUTION 17/06/13 4:25 PM

standard error (sn) of Ö 80, because sn = s /Ö n and we counted one square (or pooled 3

squares) so n = 1.

correction factor. The confidence limits of a count X are calculated as:

Thus, for our count of 80, the 95% confidence limits are:

80 + 1.962 /2 1.96 Ö (80 + 1.962 /4) = 81.92 17.64, so the limits are 64.28 to 99.56

If you are starting to wonder where all that preamble is taking us, suppose that we count 100

cells in a certain volume of bacterial suspension (or blood), and 150 cells in the same volume of

another suspension. Are these significantly different?

Call the first count X1 and the second count X2, and use these in the following equation:

| X1 - (X1 + X2)/2 |

- 0.5

______________________

d=

Ö [(X1 + X2)/4]

[We have applied a correction factor of 0.5 here, as in Yates correction for c2, to improve the

estimate of d. The symbols "| |" simply mean that the value of 0.5 is subtracted from the value

between these two lines regardless of whether the value is positive or negative; a value of +5

becomes +4.5, and a value of -5 becomes -4.5]

If we use our counts of 100 and 150 in the equation above, we get:

- 0.5

d= __________________________ = 24.5 / 7.9 = 3.10

Ö [(150 + 100)/4]

We compare this with the d values on the bottom line of a t table and find that it is higher than

the d value for p = 0.002. Our two counts are significantly different; there is a probability of

only 2 in 1000 of finding this difference by chance.

If the counts were obtained from different volumes (termed V1 and V2) then we simply apply a

modified formula:

V2)) | - 0.5

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress10.html#Poisson%20distribution%20for%20count%20data Page 2 of 5

POISSON DISTRIBUTION 17/06/13 4:25 PM

_________________________________________

d=

Ö [(X1 + X2) (V1/(V1 + V2))

(V2/(V1 + V2)) ]

Dilution plating

All the methods above can be applied to dilution plating of bacteria or fungi on agar plates. For

example, if we used a soil dilution and counted 67 colonies on a plate at the 10-5 dilution, then

we can estimate the original population in terms of its mean SE, as 67 8.18 x 105 colony-

forming units ml-1 (Ö 67 = 8.18).

Sometimes we might wish to test whether counts conform to a Poisson distribution. For

example, motile cells can aggregate into clumps, non-motile cells can agglutinate by surface

interactions, and cells can also repel one another by producing metabolites. We might wish to

test whether these events are occurring, in order to investigate the mechanisms or their

biological significance. The method is simple.

Suppose that we incubate cells in a counting chamber for 30 minutes and then count the number

of cells in several different squares of the chamber (of course, we can choose the size of our

sampling unit by pooling counts for groups of 4 or 16 squares, etc. to get mean counts large

enough (say, at least 30) to conform to Poisson expectation).

We might find the following counts in five squares of the chamber: 50, 30, 80, 90, 10.

For these five replicate counts we can obtain a mean (52) and variance in the normal way (see

methods) by calculating:

= 4480

Then

= 1120

variance,

If the data conformed to a Poisson distribution, then the mean of 52 would have a variance of

52. But our calculated variance is 1120. It seems that our counts do not conform to Poisson

expectation - the cells are not randomly distributed in the counting chamber.

There are different ways of testing this, which need not be explained, but the simplest is to

calculate S d2/mean (= 1120 / 52 = 21.54) and equate this to c2 with n-1 degrees of freedom (n

= 5 in our example). From the c2 table we see that our calculated value of 21.54 exceeds the

tabulated value of 18.47 for 4 df at p = 0.001. So our counts differ significantly from a Poisson

expectation - the cells are not randomly distributed. Instead, we have highly significant

evidence that they tend to be aggregated. [We might explain this in terms of agglutination, or if

the cells are motile they might release substances that attract other cells.]

[An explanation of what we have done. To test for randomness of distribution, we calculate S

d2 which is an estimate of variance of our five replicate values, and we divide it by the mean.

If the data fit a Poisson distribution then we will get a value close to 1 for "S d2/mean" (because

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress10.html#Poisson%20distribution%20for%20count%20data Page 3 of 5

POISSON DISTRIBUTION 17/06/13 4:25 PM

the mean equals the variance when the data fit a Poisson distribution). Any major clustering

(aggregation) of cells etc. will give a "S d2/mean" value much greater than 1. Conversely, if the

cells etc. are "too uniformly dispersed" the "S d2/mean" value will be much less than 1.]

Now suppose that we had five counts: 49, 50, 50, 49, 50.

We can calculate the mean (49.6), S d2 (1.2) and S d2/mean (= 0.024). Consulting a c2 table we

see that our value of 0.024 is less than the expected value (0.297) for 4 degrees of freedom at p

= 0.99. In other words, again our counts do not fit a Poisson expectation - the cells have a

significant tendency (99% probability) to be uniformly dispersed. Perhaps they repel one

another or perhaps the uniformity is caused by some other factor - that is a question to be

addressed by a separate experiment.

Poisson distributions don't apply only to cells or bacterial counts (or postal vans). They apply

equally to elephants and animal behaviour. For example, if you surveyed an area of a large

game park and counted the elephants in each square kilometre (or whatever area is appropriate),

would the data fit a Poisson distribution? Would this be true at all times of the year? The results

you obtain would only tell you, in statistical terms, whether the counts fit a Poisson distribution

(i.e. whether elephants are randomly distributed in space). But the findings would suggest a lot

about the behaviour of elephants. Do they have large family groups? Do these groups disperse

at certain times of the year? Of course, what this analysis can never tell us why they behave in

this way - do elephants congregate at sites of food abundance, etc. and disperse to forage widely

in periods of food shortage? Questions such as those need to be formulated and tested, but at

least the statistical analysis of distribution prompts us to ask them.

CONTENTS

INTRODUCTION

TTHE SCIENTIFIC METHOD

Experimental design

Designing experiments with statistics in mind

Common statistical terms

Descriptive statistics: standard deviation, standard error, confidence intervals of mean.

Student's t-test for comparing the means of two samples

Paired-samples test. (like a t-test, but used when data can be paired)

Analysis of variance for comparing means of three or more samples:

Calculating the Least Significant Difference between means

Using a Multiple Range Test for comparing means

For factorial combinations of treatments (Two-way ANOVA)

Poisson distribution for count data

Correlation coefficient and regression analysis for line fitting:

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress10.html#Poisson%20distribution%20for%20count%20data Page 4 of 5

CORRELATION AND REGRESSION 17/06/13 4:26 PM

archival purposes

Text and links may be out of date

The techniques described on this page are used to investigate relationships between two

variables (x and y). Is a change in one of these variables associated with a change in the other?

For example, if we increase the temperature do we increase the growth rate of a culture or the

rate of a chemical reaction? Does an increase in DDT content of bird tissues correlate with

thinning of the egg shell? Is an increase in slug density in a field plot associated with a decrease

in seedling development?

We can use the technique of correlation to test the statistical significance of the association. In

other cases we use regression analysis to describe the relationship precisely by means of an

equation that has predictive value. We deal separately with these two types of analysis -

correlation and regression - because they have different roles.

Correlation

Suppose that we took 7 mice and measured their body weight and their length from nose to tail.

We obtained the following results and want to know if there is any relationship between the

measured variables. [To keep the calculations simple, we will use small numbers]

1 1 2

2 4 5

3 3 8

4 4 12

5 8 14

6 9 19

7 8 22

Procedure

(1) Plot the results on graph paper. This is the essential first step, because only then can we

see what the relationship might be - is it linear, logarithmic, sigmoid, etc?

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress11.html#Correlation%20coefficient Page 1 of 10

CORRELATION AND REGRESSION 17/06/13 4:26 PM

assumption. If it does not seem to be linear we might need to transform the data.

(2) Set out a table as follows and calculate S x, S y, S x2, S y2, S xy, and (mean of y).

Mouse 1 1 2 1 4 2

Mouse 2 4 5 16 25 20

Mouse 3 3 8 9 64 24

Mouse 4 4 12 16 144 48

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress11.html#Correlation%20coefficient Page 2 of 10

CORRELATION AND REGRESSION 17/06/13 4:26 PM

(7) Look up r in a table of correlation coefficients (ignoring + or - sign). The number of degrees

of freedom is two less than the number of points on the graph (5 df in our example because we

have 7 points). If our calculated r value exceeds the tabulated value at p = 0.05 then the

correlation is significant. Our calculated value (0.9014) does exceed the tabulated value (0.754).

It also exceeds the tabulated value for p = 0.01 but not for p = 0.001. If the null hypothesis were

true (that there is no relationship between length and weight) we would have obtained a

correlation coefficient as high as this in less than 1 in 100 times. So we can be confident that

weight and length are positively correlated in our sample of mice.

Important notes:

1. If the calculated r value is positive (as in this case) then the slope will rise from left to right

on the graph. As weight increases, so does the length. If the calculated value of r is negative the

slope will fall from left to right. This would indicate that length decreases as weight increases.

2. The r value will always lie between -1 and +1. If you have an r value outside of this range

you have made an error in the calculations.

significant correlation only shows that two factors vary in a related way (positively or

negatively). This is obvious in our example because there is no logical reason to think that

weight influences the length of the animal (both factors are influenced by age or growth stage).

But it can be easy to fall into the "causality trap" when looking at other types of correlation.

The part above the line in this equation is a measure of the degree to which x and y vary

together (using the deviations d of each from the mean). The part below the line is a measure of

the degree to which x and y vary separately.

It would be tempting to try to fit a line to the data we have just analysed - producing an equation

that shows the relationship, so that we might predict the body weight of mice by measuring

their length, or vice-versa. The method for this is called linear regression.

However, this is not strictly valid because linear regression is based on a number of

assumptions. In particular, one of the variables must be "fixed" experimentally and/or precisely

measureable. So, the simple linear regression methods can be used only when we define some

experimental variable (temperature, pH, dosage, etc.) and test the response of another variable

to it.

The variable that we fix (or choose deliberately) is termed the independent variable. It is always

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress11.html#Correlation%20coefficient Page 3 of 10

CORRELATION AND REGRESSION 17/06/13 4:26 PM

plotted on the X axis. The other variable is termed the dependent variable and is plotted on the

Y axis.

Suppose that we had the following results from an experiment in which we measured the

growth of a cell culture (as optical density) at different pH levels.

pH Optical density

3 0.1

4 0.2

4.5 0.25

5 0.32

5.5 0.33

6 0.35

6.5 0.47

7 0.49

7.5 0.53

We plot these results (see below) and they suggest a straight-line relationship.

Using the same procedures as for correlation, set out a table as follows and calculate S x, S y, S

x2, S y2, S xy, and (mean of y).

pH (x) Optical x2 y2 xy

density (y)

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress11.html#Correlation%20coefficient Page 4 of 10

CORRELATION AND REGRESSION 17/06/13 4:26 PM

1.1882

Now we want to use regression analysis to find the line of best fit to the data. We have done

nearly all the work for this in the calculations above.

The regression equation for y on x is: y = bx + a where b is the slope and a is the intercept

(the point where the line crosses the y axis)

We calculate b as:

We calculate a as:

a= -b

From the known values of (0.3378), (5.444) and b (0.0958) we thus find a (-0.1837).

So the equation for the line of best fit is: y = 0.096x - 0.184 (to 3 decimal places).

To draw the line through the data points, we substitute in this equation. For example:

when x = 4, y = 0.384, so one point on the line has the x,y coordinates (4, 0.384);

when x = 7, y = 0.488, so another point on the line has the x,y coordinates (7, 0.488).

It is also true that the line of best fit always passes through the point with coordinates , so

we actually need only one other calculated point in order to draw a straight line.

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress11.html#Correlation%20coefficient Page 5 of 10

CORRELATION AND REGRESSION 17/06/13 4:26 PM

Below is a printout of the Regression analysis from Microsoft "Excel". It is obtained simply by

entering two columns of data (x and y) then clicking "Tools - Data analysis - Regression". We

see that it gives us the correlation coefficient r (as "Multiple R"), the intercept and the slope of

the line (seen as the "coefficient for pH" on the last line of the table). It also shows us the result

of an Analysis of Variance (ANOVA) to calculate the significance of the regression (4.36 X

10-7).

Regression Statistics

Multiple R 0.989133329

R Square 0.978384742

Adjusted R

0.975296848

Square

Standard

0.022321488

Error

Observations 9

ANOVA

df SS MS F Significance F

Total 8 0.161356

Coefficients t Stat P-value

Error 95% 95% 95.0% 95.0%

Intercept -0.18348387 0.030215 -6.07269 0.000504 -0.25493 -0.11204 -0.25493 -0.11204

pH 0.095741935 0.005379 17.80015 4.36E-07 0.083023 0.108461 0.083023 0.108461

When we plot our initial results on a graph it will usually be clear whether they best fit a linear

relationship or a logarithmic relationship or something else, like a sigmoid curve. We can

analyse all these relationships in exactly the same way as above if we transform the x and y

values as appropriate so that the relationship between x and y becomes linear. BEWARE - you

MUST look at a scatter plot on graph paper to see what type of relationship you have. If you

simply instruct a computer programme such as "Excel" to run a regression on

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress11.html#Correlation%20coefficient Page 6 of 10

CORRELATION AND REGRESSION 17/06/13 4:26 PM

(i) For plots of data that suggest exponential (logarithmic) growth, convert all y values to log

of y (using either log10 or loge). Then go through the linear regression procedures above, using

the log y data instead of y data.

(ii) For sigmoid curves (drug dose response curves and UV killing curves are often sigmoid),

the y values (proportion of the population responding to the treatment) can be converted using a

logistic or probit transformation. Sometimes it is useful to convert the x (dose) data to

logarithms; this condenses the x values, removing the long tails of non-responding individuals

at the lowest and highest dose levels. A plot of logistic or probit (y) against dose (x) or log of

dose (x) should show a straight-line relationship.

The table below shows part of a page from an ‘Excel’ worsksheet, produced as an exercise to

show how transformations are performed. Columns in an Excel worksheet are headed A-F and

rows are labelled 1-21, so each cell in the table can be identified (e.g. B2 or F11).

Representative Proportions were inserted in cells A2-A21, and % values were inserted in cells

B2-B21.

This formula is not seen in the cell, but as soon as we move out of cell C2 it automatically gives

the logistic value (in C2) for the proportion in cell A2, seen in the printout below. Copying and

then pasting this formula into every other cell of column C produces a corresponding logistic

value (e.g. cell C3 contains the logistic value of the proportion in cell A3).

The formula (without spaces) is: =NORMINV(B2/100,5,1) This was then pasted into all cells

of column D

Next, a formula was entered in cell E2 to convert Probit to Percentage, and pasted into all

cells of column E

A B C D E F G

Proportion % to Probit to

1 Proportion Percent % to arcsin arcsin to %

to logistic Probit %

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress11.html#Correlation%20coefficient Page 7 of 10

CORRELATION AND REGRESSION 17/06/13 4:26 PM

3 0.005 0.5 -2.29885 2.424 0.5 4.055 0.5

4 0.01 1 -1.99564 2.674 1 5.739 1

5 0.02 2 -1.6902 2.946 2 8.13 2

6 0.03 3 -1.50965 3.119 3 9.974 3

7 0.04 4 -1.38021 3.249 4 11.54 4

8 0.05 5 -1.27875 3.355 5 12.92 5

9 0.06 6 -1.19498 3.445 6 14.18 6

10 0.07 7 -1.12338 3.524 7 15.34 7

11 0.08 8 -1.0607 3.595 8 16.43 8

12 0.09 9 -1.0048 3.659 9 17.46 9

13 0.1 10 -0.95424 3.718 10 18.43 10

14 0.5 50 0 5 50 45 50

15 0.96 96 1.380211 6.751 96 78.46 96

16 0.97 97 1.50965 6.881 97 80.03 97

17 0.98 98 1.690196 7.054 98 81.87 98

18 0.995 99.5 2.298853 7.576 99.5 85.95 99.5

19 0.9999 99.99 3.999957 8.719 99.99 89.43 99.99

20 0.99999 99.999 4.999996 9.265 99.999 89.82 99.999

21 0.999999 99.9999 6 9.768 100 89.94 99.9999

As an example of the use of transformations, the data from a fictitious dose-response curve

(table below) are shown in two curves - first, without transformation and then after

transforming the proportion responding to logistic values.

1 0.01 -1.99564

2 0.015 -1.81734

3 0.02 -1.6902

4 0.04 -1.38021

5 0.045 -1.32679

6 0.05 -1.27875

7 0.07 -1.12338

8 0.1 -0.95424

9 0.19 -0.62973

10 0.25 -0.47712

11 0.34 -0.28807

12 0.44 -0.10474

13 0.53 0.052178

14 0.62 0.212608

15 0.68 0.327359

16 0.74 0.454258

17 0.79 0.575408

18 0.83 0.688629

19 0.85 0.753328

20 0.88 0.865301

21 0.9 0.954243

22 0.92 1.060698

23 0.935 1.157898

24 0.95 1.278754

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress11.html#Correlation%20coefficient Page 8 of 10

CORRELATION AND REGRESSION 17/06/13 4:26 PM

25 0.96 1.380211

26 0.96 1.380211

27 0.97 1.50965

28 0.975 1.591065

CONTENTS

THE SCIENTIFIC METHOD

Experimental design

Designing experiments with statistics in mind

Common statistical terms

Descriptive statistics: standard deviation, standard error, confidence intervals of mean.

Student's t-test for comparing the means of two samples

Paired-samples test. (like a t-test, but used when data can be paired)

Analysis of variance for comparing means of three or more samples:

Calculating the Least Significant Difference between means

Using a Multiple Range Test for comparing means

For factorial combinations of treatments (Two-way ANOVA)

Poisson distribution for count data

Correlation coefficient and regression analysis for line fitting:

linear regression

logarithmic and sigmoid curves

STATISTICAL TABLES:

t (Student's t-test)

F, p = 0.05 (Analysis of Variance)

F, p = 0.01 (Analysis of Variance)

F, p = 0.001 (Analysis of Variance)

c2 (chi squared)

r (correlation coefficient)

Q (Multiple Range test)

Fmax (test for homogeneity of variance)

archival purposes

http://archive.bio.ed.ac.uk/jdeacon/statistics/tress11.html#Correlation%20coefficient Page 9 of 10

- Server Algorithms for Simulated AnnealingDiunggah olehajitkk79
- History and Research MethodsDiunggah olehkalitkalit
- My Scientific Method Book by ShahadDiunggah olehMr. Wolfe
- 641-2600-1-PBDiunggah olehEliya Rochmah
- ds24-unit 1 final reviewDiunggah olehapi-110789702
- 6 Strategies for Effective LearningDiunggah olehJohn
- Chapter 1 Introduction to Psychology - Thinking Through the ThemesDiunggah olehRoxanagaloiuroxanaR
- PsychDiunggah olehroxylover
- 102086 designing teaching assignment 2Diunggah olehapi-408525277
- biology sample commentDiunggah olehapi-240180094
- Earth Science A.pdfDiunggah olehChermaine Villanueva
- Chapter 02Diunggah olehZahid Khan Babai
- Art ResearchDiunggah olehNicolas Lamas
- Pendullum.doc 1Diunggah olehStavros Mouslopoulos
- angry birds lesson plan finalDiunggah olehapi-201947722
- intro stationsDiunggah olehapi-218999959
- Designing an ExperimentDiunggah olehMaridjan Wiwaha
- Developmental Psychology.docxDiunggah olehIsuri Vidyarathne
- Practice TestDiunggah olehSuperTotie Landrito
- NDiunggah olehlucianaeu
- Osborn2e TIF CH07Diunggah olehStrayy
- Chapter 01 Benefits and RisksDiunggah olehsupvalhere
- science final lesson planDiunggah olehapi-193684547
- Strawwberry Lab ReportdsfDiunggah olehtonybob79
- AP Stat ExperimentDiunggah olehJordon Halteman
- Holocasut Research Paper Experiments MJR EditsDiunggah olehMaddie Roe
- Neeta nots.pdfDiunggah olehAnonymous PC78GME
- The Implementation of Key Word Clustering to Foster StudentsDiunggah olehLaras Ratih Maulida
- Sample Planning and Design LabDiunggah olehQuerwty
- sinopsis 2.docxDiunggah olehRini sukmayanti

- Correlation and RegressionDiunggah olehbudhail
- Data Mining in Social MediaDiunggah olehbudhail
- Electronic SignalsDiunggah olehbudhail
- Tabla, For Advanced StudentsDiunggah olehbudhail
- BcgDiunggah olehVikas
- Statistics Hacks - B. FreyDiunggah olehbudhail
- Video SignalDiunggah olehbudhail
- Statistic SPSS Regression 1 1Diunggah olehbudhail
- 07 Retail Industry Chapter4Diunggah olehdipti30
- TesseractOSCONDiunggah olehNguyen Minh
- Introduction to SPSS.research Methods and Statistics HandbookDiunggah olehbudhail
- Common Office Keyboard ShortcutsDiunggah olehbudhail
- ConvergenceDiunggah olehbudhail
- Excel 2011 Shortcuts - Mac OS ShortcutsDiunggah olehbudhail
- Full TextDiunggah olehbudhail
- Insiders Guide to Becoming a Rapid E-Learning ProDiunggah olehbudhail
- lpc_paperDiunggah olehFeras Ali
- 5112 Advanced Computer ApplicationsDiunggah olehbudhail
- ID_RM346_040-074125_3V0_2GDiunggah olehbudhail
- ID_RM346_040-074125_3V0_2GDiunggah olehbudhail
- GlossaryDiunggah olehbudhail

- Problems on Diffraction (1)Diunggah olehBrajesh Kumar
- Conducting Agricultural Research (5).pptDiunggah olehfatinnurnaiemah
- Chapter3Diunggah olehapi-26789115
- anthro 1020Diunggah olehapi-295032978
- Geoffrey LymanDiunggah olehJaime Mercado
- Organizational TheoriesDiunggah olehbskadyan
- Qualitative vs Quantitative Research-DesignDiunggah olehBen E. Als
- Scientific Modelling WorksheetDiunggah olehnettexts
- Random Phenomena TextDiunggah olehleapoffaith
- Multifaceted nature of intrinsic motivation.pdfDiunggah olehJoshua Robertson
- KNOEPFEL - Public Policy AnalysisDiunggah olehGreg Jordan
- Thematic Analysis MongoliaDiunggah olehAnton Halim
- Quantum Mechanics AssignmentDiunggah olehsubhankarsarthak1
- IRIS Handbook 2010Diunggah olehRishav Sharma
- PNAS-2005-Schweber-7783-8Diunggah olehYunlongLian
- 20080909_November1992RSHamiltonDiunggah olehChris Radieve
- Lecture 1Diunggah olehSiti Nadirah Shuhaimi
- Hypothesis TestingDiunggah olehnavya um
- British SecularisationDiunggah olehGabriel Lambert
- AST-01 July 2013(E)Diunggah olehShibani Shankar Ray
- UNIT: Preparation of Standard CurvesDiunggah olehbiddyusmc
- M. Dugic and J. Jeknic-Dugic- Which Multiverse?Diunggah olehDex30KM
- 2012 Sms Book2 Safety Policy ObjectivesDiunggah olehIkerCarmilla
- Physics 73.1 Technical Paper 2Diunggah olehJeaYang
- 01_QEP RSL_Introduction_to_Research_Design(3).pptxDiunggah olehapis
- The Start(on the Room)Diunggah olehTribhubaneshwar Das
- ALBRITTON Claude C. Jr. - The Fabric of GeologyDiunggah olehSarah Jaime
- Lecture 01Diunggah olehHéctor Flores
- Examination of Proper Randomness of the Numbers generated by L.H.C. Tippett (1927)Diunggah olehIOSRjournal
- Research Aptitude with 650+ MCQsDiunggah olehgauravkls