Anda di halaman 1dari 51

# SOR1002 Statistical Methods

7.Simple Significance Tests Learning Outcomes - Tests based on the Normal distribution; single mean, comparison of two means; single proportion or percentage; comparison of two proportions; correlation coefficient. - Tests based on the t-distribution: single mean; paired comparison; comparison of two means; linear regression coefficient. - Tests based on the F-distribution; comparison of two variances; comparison of r means, analysis of variance. - Tests based on the 2 distribution: single variance; goodnessof-fit for classified data.

7.1 Tests Based on the Normal Distribution 7.1.1 Test of Single Mean 7.1.1.1 Suppose we have a random sample of size n from the normal distribution N( , 2) where is unknown but the variance 2 is known. We wish to test H0: = 0 (specified) against H1: 0 Or H1: > 0 H1: < 0 at a given significance level . The test statistic
Z X ~ N(0,1) .

2 n The critical region lies in the tails of N(0,1) distribution (or right hand tail or left hand tail respectively).

page

## SOR1002 Statistical Methods

2 A 100(1- )% confidence interval for is x z* n P(Z z*) . (See examples 2 and 3 of chapter 6). 2
N(0,1)

where

z*

7.1.1.2 If the variance 2 is unknown, then if the sample size n is reasonably large, say n 30, the above procedure can be used 2 2 1 n ( x x ) 2. with replaced by the sample variance s n 1i 1 i 2 The results will be approximate. The quantity sn is called the standard error (of the mean) - used in confidence intervals calculations. (If n is small, use the t-distribution - see 7.2.1 later). Suppose the distribution of the population from which the sample is drawn is not known to be normal. When n is large, 2 X ~ approximately N( , ) by the Central Limit Theorem

(SOR101 Chapter 5) and hence Z X - ~ approximately N(0,1). 2 n We can then use the methods of 7.1.1.1 and 7.1.1.2 but treat the results as approximate.
Simple Significance Tests page
2

## SOR1002 Statistical Methods

7.1.1 Comparison of Two Means 7.1.2.1 Suppose we have random samples of size n1 and n2 from independent normal populations with unknown means 1 and 2 and known variances 12 and 22 . Sample mean random variable from population 1 2 1 X1 ~ N( 1, n ) 1 Sample mean random variable from population 2 X 2 ~ N( 2 , n2 ) 2 H0: against Consider H1:
1

0,

0.

D X1 X 2

2)

## Var(D) Var(X1 X2 ) Var (X1) Var(X2 ) 2 2 1 2 n1 n 2

Therefore
D ~ N( 1 2 1 2, n 1 2 ) n2 2

page

And

## A 100(1- )% c.i. for the unknown difference 2 2 1 2 x1 x 2 z* n n 1 2 . where P(Z z*) 2

is

7.1.2.2 If 12 and 22 are unknown and n1 30, n2 30, replace 2 2 1 and 2 by the sample variances and proceed as above.

7.1.2.3 When the population distributions are not known to be normal, the above procedure can be used for large samples (n1 30, n2 30). Z ~ approximately N(0,1).

page

## SOR1002 Statistical Methods

Example 1 Suppose we wish to determine if there is a difference in mean weight between the two sexes in a particular bird species. The following data were obtained: Male sample size n1=125, mean weight x1 92.31g s12 56 .22g 2 mean weight x 2 88.84g s 22 65.41g 2
1 1

## Female sample size n2=85,

Test Against

H0: H1:

2 2

i.e. i.e.

- 2= 0 1- 2 0
1 1

at a 5% significance level. If significant, give a 95% c.i. for 2. Under H0, test statistic (X1 X 2 ) 0 Z ~ approx. N(0,1) s1 s 1 1 2 n n 1 2 Test statistic value z

page

## SOR1002 Statistical Methods

2.5%

-1.96 CR

1.96 CR

Test is highly significant at 5% level since P(|Z| > 3.14) < 0.01. Hence we are confident that the mean weights are different, in particular with male mean weight greater than female mean weight. How different? From
P( 1.96 (X1 X 2 ) ( 1 s2 s 2 1 2 n n 1 2 2 ) 1.96) 0.95

Approximate 95% confidence limits for 1 - 2 and s2 s 2 1 2 (x1 x 2 ) 1.96 n n 1 2 That is, approximately 95% confidence interval is [1.31, 5.63]. Note Since 1 - 2 = 0 does not lie in this interval, the test is significant at 5% level.

page

## SOR1002 Statistical Methods

7.1.3 Tests of Proportions(s) 7.1.3.1 Single proportion p Suppose we have a random sample of n units from a large population, a proportion p (unknown) of which possess a certain attribute. Let x units in the sample possess the attribute. Then the x sample proportion p n estimates p. A possible probability model for this situation is the binomial distribution. Let random variable X correspond to the number of units with the attribute in the sample of size n. Then n x P(X x) p (1 p)n x x 0,1,..., n. x When the sample size n is large and both np and n(1-p)>5, X~ approx. N(np, np(1-p)) in the sense that

where Y ~ N(np,np(1- p)). The sample proportion random variable is X . n E X 1 E(X) np p . n n n 1 Var (X) np(1 p) p(1 p) Var X n n2 n n2 Therefore X ~ approx. N p, p(1- p) and, n n
X p n ~ approx. N(0,1). p(1 p) n

P(a X b) (a Y b)

## Hence test statistic Z can be used to test

Simple Significance Tests page
7

against

## H0: p=p0 (specified) H1: p p0 (or p > p0 or p < p0 ) as in 7.1.1.1.

Since,

P 1.96

X p n 1.96 p(1 p) n

0.95

an approximate 95% c.i. for unknown proportion p is obtained by replacing p(1 p) by p(1- p) where p x . n n n That is,
p 1.96 p(1- p) . n

The above procedure is also applicable when p is a probability, for example p=P(head when coin tossed).

page

## SOR1002 Statistical Methods

Example 2 In a random sample of 120 graduates, 78 spent 3 years at university and 42 more than 3 years. Test the hypothesis that 70% of graduates obtain degrees in 3 years. Let p=P(graduate in 3 years) (unknown) H0: p = 0.7 H1: p 0.7

Sample proportion p
Test statistic

## 78 0.65. 120 78 0.7 120 1.2. 0.7 0.3 120

2.5%

-1.96 CR

1.96 CR

Test not significant at 5% level. We have insufficient evidence for rejecting H0.

page

## SOR1002 Statistical Methods

7.1.3.2 Comparison of Two Proportions (from large samples) Suppose we have large samples of sizes of n1 and n2 from two populations where proportions p1 and p2 respectively have an attribute. We wish to test H0: p1 = p2 H1: p1 p2 (or a one-sided alternative).

against

Under H0, denote the common proportion by p, that is, H0: p1 = p2 = p say, where p is unknown. Sample proportion random variable X1 p (1 p1) ~ approx. N p1, 1 n1 n1 Sample proportion random variable X2 p (1 p 2 ) ~ approx. N p 2 , 2 n2 n2 Therefore X1 X 2 p (1 p1) p 2 (1 p 2 ) ~ approx. N p1 - p 2 , 1 n1 n 2 n1 n2
X1 X 2 ~ approx. N 0, p(1- p) 1 1 . n1 n 2 n1 n 2 The above variance needs to be estimated by estimating p. Under H0, p is estimated from the combined samples, that is, x x p 1 2. n1 n 2

page 10

## Under H0, the test statistic

X1 X 2 n1 n 2 p(1 p) 1 1 n1 n 2

~ approx. N(0,1).

## A 100(1- )% c.i. for the unknown difference p1-p2 is obtained from

X1 X 2 (p1 p 2 ) n1 n 2 p1(1 p1) p 2 (1 p 2 ) n1 n2

P z*

z*

where
x p1 1 , p 2 n1 x2 , P(Z z*) n2 2

page 11

## SOR1002 Statistical Methods

Example 3 We wish to compare the germination rates of spinach seeds for two different methods of preparation. Method A Method B 80 seeds sown, 90 seeds sown, 65 germinate 80 germinate.

Let proportions germinating be p1 and p2. H0: p1 = p2= p H1: p1 p2. (unknown)

against Estimate p by

p 65 80 p1 65 0.8125, p2 80
z

80 0.853. 90 80 0.889. 90
1.4 .

Under H0,

## 0.8125 0.889 (0.853)(0.147 ) 1 1 80 90

2.5%

-1.96 CR

1.96 CR

Test not significant at 5% level. We have no evidence for supposing the germination rates to be different. (Hypotheses involving proportions can also be tested using the 2 distribution. See chapters 7 and 8).
Simple Significance Tests page 12

## SOR1002 Statistical Methods

7.1.4 Test of Correlation Coefficient Suppose we have a random sample (x1, y1), , (xn, yn) with sample correlation coefficient r from a bivariate probability distribution with unknown correlation coefficient . The distribution of the sample correlation coefficient random variable R is very complicated. However the transformed random variable Z 1 log e 1 R 2 1 R is approximately normally distributed with mean 1 log 1 p and variance 1 . 2 e1 p n -3 (Called the Fisher Z transformation.) Confidence limits for are obtained by first calculating confidence limits for and then transforming back. NCEST Tables 16 and 17 are useful.

page 13

## SOR1002 Statistical Methods

Example 4 A random sample of 39 pairs of observations have sample correlation coefficient r =0.73. Test the hypothesis that the population correlation coefficient is 0.9. Give a 95% c.i. for . H0: = 0.9 against H1: 0.9.
Z 1 log e 1 R ~ approx. N , 1 where 2 1 R 36
1 log 1 p 2 e1 p

Therefore Y Z

1 36

~ approx. N(0,1) .

Under H0,

## Observed value of Z is Therefore test statistic y 0.9287 1.4722 3.29. 1 6

0.9287

2 .5 %

-1.96
CR

1.96
CR

p-value 0.1%
Simple Significance Tests page 14

## Test very highly significant at 5% level. Confidently conclude that 0.9.

P 1.96 Z

1 6

1.96

0.95

that is
Z 1.96 0.95 . 6 1 log 1 p is Therefore a 95% c.i. for 2 e1 p 0.9287 1.96 , 0.9287 1.96 , 6 6 P Z 1.96 6

that is,
Table 17

0.60, 1.26 .
z =0.60,r =0.54 z =1.26,r =0.85 is (0.54, 0.85).

## Therefore a 95% c.i. for

Note: ' = 0.9' does not lie in this interval therefore test significant at 5% level.

## What happens z 1 log e 1 r when r is negative? 2 1 r Use result 1 log e 1 r 2 1 r

1 log 1 r , then Table 16. 2 e1 r

page 15

## SOR1002 Statistical Methods

7.2 Tests Based on the (Students) t-distribution Many of the tests described in previous sections require large samples or precise information about variances (which is often lacking in practice). Small samples are important since in some practical situations the number of observations which can be made may be limited by: the experimental technique, the amount of experimental material, the cost of making an observation, the particular environmental conditions, etc. For small samples there are useful tests based on the tdistribution. The probability density function of the t-distribution with degrees of freedom is

1 f (t ) 2
where Note: (i)The

2 1( 1) 2 2 1 t

is a positive integer.

p.d.f. is symmetrical about t=0. As , t - distribution N(0,1). For practical purposes, when 30, the t-distribution is approximately the same as N(0,1). (ii) If X1 , Xn are independent and each ~ N( , 2 ), then T X ~t n 1 S2 n i.e. t-distribution with (n-1) degrees of freedom where

page 16

## SOR1002 Statistical Methods

n 1 n (X X)2. X 1 X and S2 ni 1 i n 1i 1 i
Therefore, given a random sample of size n from is an observation from tn-1. N( , 2 ), t x s2 n (iii) NCEST Table 9 tabulates the cumulative distribution function
P(T t) F (t) t f (u) du, t 0

## and t, read F (t) .

t p.d.f.

P(T

t)

t(P)

NCEST Table 10 tabulates percentage points, That is, given P% and , read t(P) where P(T t(P)) P%. For example, =30, P 5% 2.5% 0.5% 0.05% t(P) 1.697 2.042 2.750 3.646 N(0,1) 1.645 1.960 2.576 3.291

page 17

## SOR1002 Statistical Methods

t p.d.f.

P%

t(P)

7.2.1 Single Mean Suppose we have a random sample of size n (small) from N( , 2 ), where and 2 are both unknown. We wish to test H0: H1: Under H0, T =
0 0

## (specified) (or a one-sided alternative).

x

0 is an n 1 s2 S2 n n observation from tn-1. The critical region will lie in the tails of the tn-1 distribution.

0 ~t

is obtained from

t* X

S2 n

t*

## 2 where P(T> t*)= , that is x t * sn . 2

Simple Significance Tests page 18

## SOR1002 Statistical Methods

Example 5 The temperature of warm water springs in a basin is reported to have a mean of 38C. A sample of 12 springs from the west end of the basin had mean temperature 39.4 and variance 1.92. Have springs at the west end a different mean temperature? Give a 95% c.i. for the mean temperature.
tn-1 p.d.f.

-t*

Denote west end spring temperature by X where X has mean and variance 2. We estimate 2 by s2=1.92 with 11 degrees of freedom. H0: H1: = 38 38

page 19

## SOR1002 Statistical Methods

t11 p.d.f.

2.5%

2.5%

-2.201

2.201

Upper 2.5% point of t11 is 2.201. Test is significant at 5% level and we conclude that west springs do have a different temperature. Since upper 0.5% point of t11 is 3.106, the test is highly significant at 5% level. (Alternatively, p-value=P(|T|> 3.5) =2P(T>3.5) =2(1-P(T<3.5)) =2(1-0.9975) From Table 9 =0.005 or 0.5% ).

2.201

X 1.92 12

2.201

0.95

39.4 2.201 0.4

page 20

## SOR1002 Statistical Methods

7.2.2 Paired Comparison Test In this case we are interested in the difference between two methods or properties where the observations occur naturally in pairs and taking the difference of the paired observations is valid. It is not possible to pair arbitrarily. Example 6 Consider an experiment to compare the effects of two sleeping drugs A and B. There are 10 subjects and each subject receives treatment with each of the two drugs (the order of treatment being randomised). The number of hours slept by each subject is recorded. Is there any difference between the effects of the two drugs?
Subject 1 2 3 4 5 6 7 8 9 10 Hours slept using A 9.9 8.8 9.1 8.1 7.9 12.4 13.5 9.6 12.6 11.4 Hours slept using B 8.7 6.4 7.8 6.8 7.9 11.4 11.7 8.8 8.0 10.0 Difference (A-B) x 1.2 2.4 1.3 1.3 0.0 1.0 1.8 0.8 4.6 1.4

The paired sample data have been reduced to a single sample of differences. This will tend to cancel out any subject effect assuming that the effect of the drug is additive. Assume x values to be normally distributed with mean . x 15.8, x 1.58, x 2 38.58, s 2 1.513 . i i H0: =0 H1: 0

page 21

## SOR1002 Statistical Methods

Under H0 , t 1.58 0 4.06 is an observation from the t 1.513 10 distribution with 9 degrees of freedom.
t9 p.d.f.

2.5%

2.5%

-2.262

2.262

From Table 10 we have P(|T| > 2.262)=0.05. The test is significant at a 5% level.

The upper 0.5% point of t9 is 3.250 so the test is also significant at the 1% level. We are thus confident that there is a difference between the drugs, in particular that drug A induces more sleep than drug B on average. (Or p-value = P(|T| > 4.06)=2(1-P(T 4.06))=2(1-0.9986)=0.3% Table 9). A 95% confidence interval for the unknown mean difference is x 2.262 1.513 , x 2.262 1.513 10 10 That is, [0.70, 2.46].

page 22

## SOR1002 Statistical Methods

7.2.3 Comparison of Two Means - small samples Random sample of size n1 (n1 small, < 30) with sample mean x , 1 sample variance s 2 from a normal or approximately normal 1 distribution with unknown mean 1 and unknown variance 2. Random sample of size n2 (n2 small, < 30) : x , s 2 , 2 and 2. 2 2 Note: The unknown population variances are equal. We wish to test H0: 1- 2= H1: 1- 2 First, estimate

0 2

by

page 23

## Critical region is in the tails of the distribution. From

(X X ) ( ) 1 2 1 2 t* t* S2 1 1 n n 1 2

n n 2 1 2
2.5%

2.5%

-t*

t*

## A 100(1- )% c.i. for the unknown difference (

) is

(x x ) t * s 2 1 1 n n 1 2 1 2

page 24

## SOR1002 Statistical Methods

Example 7 Two methods of oxidation care are used in an industrial process. Repeated measurements of the oxidation time are made to test the hypothesis that the oxidation time of method 2 is longer than that of method 1 on average. Sample Sample Sample size mean variance Method 1 9 41.3 20.7 Method 2 8 48.9 34.2 We wish to test H0: 1= H1: 1<

2 2

## that is, that is,

1- 2=0 1- 2<

0.

We shall assume that the unknown population variances are equal. (This can be tested using an F -Test - See 7.3.1)

s2 8(20.7) 7(34.2) 8 7
Under H0, t 41.3 48.9 27 1 1 9 8

27

w 15 df. tith

t15

5%

page 25

## SOR1002 Statistical Methods

Since p-value = P(T<-3.01) 0.005, test is highly significant at 5% level. We are confident that oxidation time for method 2 is longer. 95% c.i. for (

) is

(X X ) ( 2 1 P 1 S2 1 1 9 8

) 1.753 95%

That is,
P X X 1.753 S2 1 1 9 8 2 1 2 95% .

) is

That is (

) < -3.2,

i.e. (

) >3.2. 1

page 26

## SOR1002 Statistical Methods

7.2.4 Test of Slope and Intercept in Linear Regression Model In the linear regression model, suppose that the responses are normally distributed. That is, Y ~ N( x , 2 ) i 1, ...., n. i i The least squares estimator of is n (x x) (y y) i i 1 i n (x x) 2 i 1 i 2 and (Chapter 5) E( ) , Var( ) n (x x) 2 i 1 i Since is a linear combination of normal random variables, is also normally distributed That is,
~N , n 2 .

## Let S2 be the corresponding random variable, hence, T ~t n 2 S2

i can be used to test hypotheses such as H0: = 0 (specified), against, H1: 0 and also to set up c.i. for unknown slope .
Simple Significance Tests page 27

(x

x) 2

## SOR1002 Statistical Methods

7.3 Tests based on the F-distribution The probability density function of the F-distribution with ( , ) degrees of freedom is 1 2 1 1 1( ) 1 1 1 ( 2 2 1 2 ) 12 1 2 2 2 2 1 f ( ) 0 ( , ) B 1 ,1 1 2 2 12 2 where B(a,b)= (a ) (b) is the beta function and , and are (a b) 1 2 positive integers. Note: (i) For

, 3 the probability density function has the 1 2 following shape (not symmetrical):

(ii) Given two independent random samples of size n 1 and n2 from N , 2 and N , 2 respectively, the ratio 1 1 2 2 S2 1 2 1 ~F F (n 1, n 1) S2 1 2 2 2 2 where S12 and S22 are the sample variance random variables.
Simple Significance Tests page 28

## SOR1002 Statistical Methods

(iii) NCEST Tables 12(a) 12(f) tabulate percentage points for the right hand tail only. where P=10, 5, 2.5, 1, 0.5, 0.1%, read , and 1 2 (P) where P(F > (P))=P%. Given P%,

Suppose we wish to find the lower percentage point 1(P) for F distribution. ( , ) 1 2 First, find the upper percentage point 2(P) for F , that is, ( , ) 2 1 with degrees of freedom interchanged. Then 1 . (P) (P) 1 2

Linear interpolation in 1 or 2 will be sufficient except when either 1 12 or 2 40, in which case harmonic interpolation should be used. (See example later).
Simple Significance Tests page 29

## SOR1002 Statistical Methods

7.3.1 Comparison of Two Variances Suppose we have two random samples of size n 1 and n2 with
2 2 sample variances s1 and s 2 from two independent normal
2
2 2.

## distributions with unknown variances 1 and test H0:

2 1 2 2
2 1 2 2

We wish to

(specified), against,

2 1
0=1,

2 1 2 2

## that is, we test hypothesis H0:

2 S1

against

)
F
2 1 2 2 2 1 S1 2 0 S2

## Under H0, the test statistic

S2 2

~ F(n 1 1, n 2 1)

The critical region lies in both tails of the F-distribution. (For H1:
2 1 2 2 0

2 1 2 2

## use the left

hand tail only). A 100(1- )% confidence interval for the variance ratio obtained from
2 1 2 2

is

page 30

S1

2 2

S2
2 2

2 2

S1 S2
2

2 1 2 2

S1 S2
2

1
L

## That is, a 100(1- )% confidence interval for

S1 S2
2 U 2

2 1 2 2

is

S1 S2
2

Example 8 We wish to compare the precisions of two technicians in titrations of CaCO3 content of raw meal. The following results were obtained: 1st Technician 2nd Technician We wish to test H0: H1: n1=31, n2=25
s12 =0.0388
s 22 =0.0177.

2 1 2 1

2 2

F
2 s1

s2 2

0.0388 0.0177

page 31

## SOR1002 Statistical Methods

F(30,24)

2.5%

2.5%

WL

WU

In table 12(c), there is no tabulated value for F(30, 24). We use harmonic interpolation in 1, that is linear interpolation in 1 or a
1

multiple of 1 . 1

## ( 1, 2 ) Upper 2.5% point

(24,24) (30,24) ( ,24) 2.269
U

1.935

120 1 5 4 0

1.935 4 (2.269 1.935) 2.202. U 5 For , we first find upper 2.5% point of F(24, 30), that is 2.136. L 1 0.468. Then L 2.136 The observed value of the test statistic, 2.19, does not fall in the critical region. We have no convincing evidence that precisions are different.
OR Under H0, the test statistic observation from F(24, 30).
Simple Significance Tests page 32

s2 F 2 s2 1

is

an

## Not significant at the 5% level.

F(24,30)

2.5%

2.5%

1/(2.202)=0.454

2.136

l arg er sample variance smaller sample variance Which will fall in the right hand tail of the F distribution. It is still a 2-sided test, but we don't have to calculate the lower percentage point.

In practice we calculate F

S2 1 P 0.468 S 2 2

2 1 2 2

2.202

0.95

S2 1 P S 2 (2.202 ) 2

2 S2 1 1 2 S 2 (0.468) 2 2

0.95

2 1 2 2

[0.996, 4.682].

page 33

## SOR1002 Statistical Methods

7.3.2 Comparison of t Means Suppose we have t samples of size n 1, n2 ,, nt from independent normal distributions N , 2 , N , 2 , , N , 2 t 1 2 respectively where 1, 2 ,, t and 2 are unknown. Sample 1 Data Total n1 y11, y12, , y 1n1 T y 1 j i 1j y21, y22, , y Mean Variance T s2 1 y 1 1 n 1

2n2 T 2
: :
Tt

n2

j i 2j

T 2 y 2 n 2
: :

s 2 2
: :

: : t

: : yt1, yt2, , y tn t

nt

j i tj

yt

Tt nt

st2

T y i i 1 j i ij i 1

t ni

i i i

y G n

(overall mean)

## n i 1 2 s (y y )2 (ith sample variance). i n 1 j 1 ij i i

Simple Significance Tests page 34

2

An estimate of

is

s2

## (n 1)s 2 (n 1)s 2 ... (n t 1)s 2 t (See Chapter 4) 1 1 2 2 (n 1) (n 1) ... (n t 1) 1 2 n 1 t (n 1)s 2 1 t i ( y y) 2 n-ti 1 i n - t i 1 j 1 ij i SS E n t

where SSE is the within samples sum of squares or error sum of squares or residual sum of squares with (n-t) degrees of freedom.

t SS (y y)2 n (y y)2. T i 1j 1 i i 1 i i SST is the sum of squares of deviations of the sample means from the overall mean and is referred to as the between samples sum of squares or treatment sum of squares with (t-1) degrees of freedom.
Now consider

t ni

page 35

SS T

## When H0 holds, test statistic F from F(t-1, n-t) distribution.

SS

(t 1) is an observation (n t )

When H0 does not hold, some of the (y y)2 terms will tend to i be larger than expected, resulting in a larger value for SST than expected, leading to a larger value of the test statistic. So we set up the critical region in the right-hand tail only of the F distribution. If the test statistic falls in this region, the null hypothesis of equality of means is rejected. In practice we do not compute SSE and SST in the form given above. First compute the total corrected sum of squares.
t ni 2 SS (y y )2 y 2 2 y y y TC i 1 j 1 ij i ij ij i 1 j 1 i 1j 1 i 1j 1 t ni 2 t ni 2 y 2yny ny 2 y ny 2 (*) ij ij i 1j 1 i 1j 1 t ni 2 G 2 SS y n TC i 1 j 1 ij t ni t ni t ni

page 36

## SOR1002 Statistical Methods

Also,

SS TC

t ni

2 (y ij y) (y i y)

i 1j 1 t ni t ni t ni (y y )2 2 (y y )( y y) ( y y) 2 ij i ij i i i 1j 1 i 1j 1 i 1j 1 i

## 1st term =SSE 3rd term =SST 2nd term

t ni n i

2 ( y y )( y y) 2 ( y y) ( y y ) i 1 j 1 ij i i i 1 i j 1 ij i n t i 2 ( y y) y ny i i 1 j 1 ij i i 0

## n 1 iy . Since y i n j 1 ij i Therefore SSTC = SSE + SST.

We usually calculate SSTC using (*) and SST, and then obtain SSE by subtraction, that is, SSE = SSTC + SST.

page 37

## SST is calculated as follows: t t t t SS n ( y y) 2 ny2 2 nyy n y2 T i 1 i i i 1 i i i 1 i i i 1 i t n y 2 2yG ny 2 i 1 i i

n y 2 ny 2 i 1 i i t Ti 2 G 2 SS n T i 1n i

(**)

We set out the calculations in an analysis of variance table Source variation (Between) samples of df t-1 SS SST
SS T (t 1) SS E (or s 2 ) (n t )

MS (mean square)

F ratio
SS T SS E

(t 1) (n t )

## Within samples or residual Total corrected

n-t

SSE

n-1

SSTC

We can also set up confidence intervals for an unknown mean i or a difference ( i - j) using t- distribution and s2 with (n-t) degrees of freedom. We can test pairs of means for equality using a t-test.

page 38

## SOR1002 Statistical Methods

Example 9 We wish to test if there is any difference in the average yield of a particular crop when treated with four different fertilisers: 1. Straw, 2.Straw + Nitrate, 3.Straw + Phosphate, 4. Straw + Nitrate + Phosphate. In particular we are interested in any difference between fertilisers 3 and 4. A properly designed experiment was carried out with the following results:
Fertiliser 1. 2. 3. 4. S S+N S+P S+N+P Yield yij 38.3 38.8 40.3 62.7 38.5 43.4 42.6 61.0 38.7 38.9 41.1 54.8 41.2 39.1 40.6 51.7 ni 4 4 4 4 16 Total Mean T 2 i y Ti i n i 156.7 39.18 6138.7 160.2 40.05 6416.0 164.6 41.15 6773.3 230.2 57.55 13248.0 711.7 32576.0
y 2 j ij

## 6144.3 6431.0 6776.4 13328.2 32679.9

G 2 711.7 2 31657 .3 n 16 SS 32679 .9 31657 .3 1022 .6 TC SS 32576 .0 31657 .3 918.7 T SS 1022 .6 918.7 103.9 E

df 3

SS 918.7

MS 306.2

F ratio 35.4

page 39

## , against 1 2 3 4 H1: not all the means are equal

at the 5 % level.

Under H0 the F ratio 35.4 is an observation from F (3,12). The value 35.4 falls in the critical region. If W ~ F(3,12), then from Table 12(f), P(W > 35.4) < 0.001. Hence the test is very highly significant at a 5% level, i.e. we are very confident that the fertilisers give different mean yields. A 95% confidence interval for is based on the percentage 2 points of the t12 distribution and s2=8.66: y 2.179 s . n i i
i

Fertiliser 1 2 3 4

## Mean 39.18 40.05 41.15 57.55

Standard 95% confidence error interval 8.66 1.47 (35.97, 42.39) 4 1.47 (36.84, 43.26) 1.47 (37.94, 44.36) 1.47 (54.34, 60.76)

Clearly fertiliser 4 is different from 1, 2 and 3. To investigate fertilisers 3 and 4, consider H0: , against 3 4 H1: . 3 4

page 40

Under H0,

(y y ) 0 3 4

## 41.15 57.55 8.66 1 1 4 4

7.88

is an

t12

2.5%

2.5%

-2.179

2.179

The test is very highly significant at the 5% level, that is, we are very confident that fertilisers 3 and 4 are different on average: from examination of the sample means, fertiliser 4 produces a higher yield on average than fertiliser 3.

page 41

Distribution
2

## The probability density function of the degrees of freedom is

distribution with

Note: (i) For

## 3 the p.d.f. has the shape:

(ii) Let S2 be the sample variance random variable of a random sample of size n from N( , 2). (n 1)S2 ~ 2n 1. Then V 2 (iii) NCEST Table 8 tabulates 2 percentage points. Given and P% where P= 99.95, 99.9,, 60% (page 40) P= 50,, 0.05% (page 41) read (P) where P(V> (P))=P%.

page 42

## SOR1002 Statistical Methods

7.4.1 Single Variance 2 Suppose we have a random sample of size n from a normal distribution with unknown variance 2. We wish to test H0: 2 = 02 (specified) against 2 H1: 2 0 (or a one-sided alternative) at given significance level .
(n 1)S2 ~ 2n 1. Under H0, the test statistic V 2 0 The critical region lies in both tails of the 2 distribution. (For H1: 2 > 02, use right hand tail only. For H1: 2 < 02, use left hand tail only.)

## A 100(1- )% confidence interval for

is obtained from

(n 1)S2 P L 2

that is,
2 2 (n 1)S2 1 P (n 1)S U L Therefore a 100(1- )% confidence interval for (n 1)s 2 , (n 1)s 2 . U L
page 43

is

## SOR1002 Statistical Methods

Example 10 The precision of a measuring process is stated to be 2=0.025. A random sample of 30 measurements has sample variance s2=0.032. Is the above statement justified? H0: 2 = 0.025 H1: 2 0.025 Assuming the measurements are normally distributed, then under H0, the test statistic value 29 0.032 37.12 0.025 is an observation from the 2 . 29

Test is not significant at the 5% level. We can have insufficient evidence for rejecting H0 in favour of H1, that is, no reason to doubt statement. (In this case, we wouldn't normally compute a c.i. for 2 since the test was not significant. However a 95% c.i. could be computed as follows: 2 P 16.05 29S 45.72 0.95 2

2 P 29S 45.72
95% c.i. for
2

29S2 16.05

0.95

## is 29 0.032 , 29 0.032 , 0.0203, 0.0578 45.72 16.05 which contains 2 = 0.025.

Simple Significance Tests page 44

## SOR1002 Statistical Methods

7.4.2 Goodness-of-fit Test for Classified Data Suppose that a sample of n observations is classified into k mutually exclusive and exhaustive classes, that is, each observation belongs to one and only one class. k Let Oi be the observed frequency in the ith class, O n. i 1 i Consider a null hypothesis H0 which specifies the probabilities of belonging to the k classes. Under H0, let Ei be the expected k frequency in the ith class, E n. i 1 i Under H0, the goodness-of-fit test statistic k (Oi Ei ) 2 2 E i 1 i is approximately distributed 2 where =k-1-(number of independent parameters estimated from the data). The critical region lies in the right hand tail only of the 2 distribution since if H is not true we would expect the E 's to 0 i be quite different from the Oi's resulting in a larger than expected value of 2 . (Small 2 results when Ei's and Oi's are in good agreement - certainly not a reason to reject H0). Note: (i) The exact distribution of 2 is discrete and is approximated by the continuous 2 distribution. For this approximation to be reasonable, Ei should be > 5 for each class. If not, combine adjacent classes with resultant loss of one or more degrees of freedom.

page 45

## SOR1002 Statistical Methods

(ii) In tests with only 1 degree of freedom, a better approximation is obtained by including Yates' continuity correction: k (| Oi Ei | 1 )2 2 2 ~ approx. 2 . 1 E i 1 i Example 11 The geneticist Mendel evolved the theory that for a certain type of pea, the characteristics Round and Yellow, R and Green, Angular and Y, A and G occurred in the ratio 9:3:3:1. He classified 556 seeds and the observed frequencies were 315, 108, 101 and 32. Test Mendel's theory on the basis of these data.

## 9, p 3, p 3, p 1. H : p 0 1 16 2 16 3 16 4 16 H : probabilities not as in H . 1 0 Seed Oi Pi Ei=556 Pi (O E ) 2 i i

E i

R +Y R +G A+Y A+G

## 315 108 101 32 556 (= n)

9 16 3 16 3 16 1 16

## 0.016 0.135 0.101 0.218 0.470 (= 2 )

2 . The test is not 3 significant at the 5% level, that is, no evidence for supporting the rejection of H0. (That is, data in agreement with the theory).
Under H0, 2 =0.47 is an observation from
Simple Significance Tests page 46

## SOR1002 Statistical Methods

Example 12 In a random sample of 120 graduates, 78 spent 3 years at University and 42 more than 3 years. Test hypothesis that 70% obtain degree in 3 years. (See7.1.3.1 - test of proportion using the normal distribution test). H0: P(degree in 3 years) = p = 0.7 H1: p 0.7 Oi Ei Degree in 3 years 78 84 More than 3 years 42 36 120 120 Degrees of freedom = 2 - 1 = 1. Therefore use correction.

(| 78 84 | 1 )2 (| 42 36 | 1 )2 2 2 2 1.2 84 36
Test not significant at 5% level. No evidence to support the rejection of H0. Alternative Method - use normal approximation.

page 47

## SOR1002 Statistical Methods

Fitting and Testing the Goodness-of-fit of a Probability Distribution to classified Data This consists of the following steps: Decide which distribution is applicable; Find parameter values from H0 and/or by estimation; Calculate the probability P i for the ith class; The expected value in the ith class = Ei = npi where n = number of observations in the sample; Carry out 2 goodness-of-fit tests, amalgamating adjacent classes if necessary. (i.e. to make all Ei 's at least 5). For discrete distributions, the classes occur in a straight forward manner and the calculation of P i is based on calculating the probability function at specified values.

Example 13 Yeast Cell Data For continuous distributions the classes used are only possible divisions of the real line into classes; If (ci, ci+1] defines the ith class, then c i 1 P f (x | , ...., ) dx. 1 i c k i Different divisions may lead to different values of 2 . For this reason we often use other methods for testing the goodness-of-fit of a continuous distribution.

page 48

## SOR1002 Statistical Methods

Example 14 Fitting and testing the goodness-of-fit of a Poisson distribution to the yeast cell data. (See chapter 3 Grouped Data for a description of the experiment and data.) Number of yeast cells in a square i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 Observed Probability Expected (Oi Ei ) 2 Count Oi pi Count Ei Ei 0 20 43 53 86 70 54 37 18 10 5 2 2 0 400 0.0093 0.0434 0.1016 0.1585 0.1855 0.1736 0.1354 0.0905 0.0530 0.0276 0.0129 0.0055 0.0021 0.0011 1.0000 3.7 17.4 40.6 63.4 74.2 69.4 54.2 36.2 21.2 11.0 5.2 2.2 0.8 0.4 399.9

0.1 0.1 1.7 1.9 0.0 0.0 0.0 0.5 0.1 0.0

4.4

A possible model for X, the number of yeast cells in a square, is a Poisson distribution. We thus want to test H0: data arises from a Poisson distribution against H1: data not from a Poisson distribution, at a 5% significance level.

page 49

## SOR1002 Statistical Methods

1. Estimate the Poisson parameter by the sample mean 12 x 1 iO 4.68. 400 i 1 i 2. Calculate the Poisson probabilities for and thus calculate i 4.68 p P(X i) (4.68) e for i 0, 1, ... , 12 i i! and thus calculate 12 P(X 13) 1 P(X i). i 0 3. Calculate the expected counts Ei = 400pi. 4. Combine E0 and E1, and E10 and E11, so that expected values exceed 5. Combine the corresponding observed counts. 5. 2 =4.4 with 10 - 1 - 1 = 8 degrees of freedom (since 10 classes were used in computing 2 and one parameter was estimated from the data). The upper 5% point of the 2 distribution with 8 degrees of freedom is 15.51. Hence we do not reject H0 in favour of H1, that is, we conclude that the Poisson distribution model provides an adequate fit to the yeast cell data.

page 50

## SOR1002 Statistical Methods

7.4.3 Amalgamation of

Results

Theoretical result: if V1, , Vr are independent 2 random variable with degrees of freedom , ... , r respectively then the 1 .... random variable V = V1 ++ Vr ~ 2 where r. 1 Example 15 Suppose 4 independent experiments are performed to test a null hypothesis H0 and the goodness-of-fit test statistic 2 calculated in each case. Also suppose that the experimental results cannot be combined.

Experiment 1 2 3 4

## Not sig. at 5% Not sig. at 5% Not sig. at 5% Not sig. at 5%

Individual tests of H0 are not significant at 5% level. Using the additive property, under H0 the value 50.2 is an observation from 2 . The upper 5% point of 2 is 43.77. Hence the combined 30 30 test is significant at the 5% level so we have sufficient evidence for rejecting H0 in favour of H1.

page 51