Anda di halaman 1dari 79

Chapter 1

Sampling and Sampling distributions


1

Prepared by Noman Rasheed

Statistical Population
A Statistical population or universe is defined as the aggregate or totality of all individual members or
objects, whether animate or inanimate, concrete or abstract of some characteristics of interest.

Sampling units
The individual members of the population are called sampling units or simply units.

Types of Population
There are following types of population
Finite Population
Infinite Population
Target Population
Sampled Population
Existent Population
Hypothetical Population

Finite Population
A population is said to be finite if it consists of finite or countable number of sampling units.For examples:
all students in a college, all houses in a country etc.

Infinite Population
A population is said to be infinite if it consists of infinite or uncountable number of sampling of units.For
examples all points on a line, no of stars in the sky etc.

Target Population
A population about which we wish to draw inferences is called Target population.

1 M.Phil

Stat,M.Ed Contact: nomanrasheed163@yahoo.com, facebook.com/ Something About Statistics.

CHAPTER 1. SAMPLING AND SAMPLING DISTRIBUTIONS

Sampled Population
A population is said to be sampled from which sample is selected.

Explanation
Suppose we desire to know the opinions of college students in the province of the Punjab with regard
to the present examination system. Then our population will consist of the total number of students in all
the colleges in the province that is our target population but being shortage of resources (time or cost etc.)
, if we select a sample from the five colleges scattered throughout the province that is our sampled population.

Existent Population
A population whose units are available in solid form (concrete form) such as trees, households, students
etc is called an Existent population.

Hypothetical Population
A population which consists of possible ways in which an event can occur is called hypothetical population its units are not available in solid form. For example outcomes of a die or coin etc.

Sample
A sample is the part of population which is selected with the expectation that it will represent the characteristics of the population

Population Size
Total number of units in a finite population is called the size of the population and is denoted by N.

Sample Size
Number of units selected in the sample from the population is called size of sample denoted by n.

Population distribution
Arrangement of values(sampling units) of population with their probabilities of occurrence is called population distribution.

Sample distribution
Arrangement of values of sample with their probabilities of occurrence is called sample distribution.

Sampling
The process of selecting a sample from the population is called sampling.

Parameter and Statistic


A numerical value obtained from population is called population parameter or simply parameter. Parameter is a constant and usually denoted by Greek letter like , , etc.
Statistic is a value obtained from sample observations.Its value used to estimate the corresponding population parameter. Statistics is a random variable and denoted by Roman letter such as x
, S, r etc.

Basic aims/ Purposes of Sampling


1. To obtain the maximum information about the characteristics of population without examining every
unit of the population.
2. To find the reliability of the estimates derived from the sample.

Sampling Design
A sample design is a definite statistical plan concerned with all principal steps taken in the selection of
a sample and the estimation procedure. These steps are formulated in advance of conducting the sample.

Sampling Frame
A sampling frame is a complete list or a map that contains all the N sampling units in a population such
as a complete list of all households in a city, a map of a village showing all fields etc.

Requirements of a good frame


A good frame should have following qualities.
It should not contain inaccurate sampling units.
It should be complete and exhaustive.
It should be free from errors of omission and duplication of sampling units.
It should be as up-to-date as possible at the time of use.

Probability and Non-Probability Sampling


Sampling methods are broadly classified as probability and non-probability sampling
1. Probability or Random Sampling
When each unit in the population has a known non-zero(not necessarily equal) probability of its being
included in the sample, the sampling is called probability or random sampling.The major advantage
of probability sampling is that it provides a valid estimate of sampling error. The major types of
probability sampling are
(a) Simple Random Sampling
(b) Stratified Random Sampling
(c) Systematic Sampling
(d) Cluster Sampling
2. Non-Probability Sampling
A non-probability or non-random sampling is a process in which the personal judgment determines
which units of the population are selected for a sample.The disadvantage of non-probability sampling
is that the reliability of the sample results can not determined in terms of probability. The major types
of non-random sampling are
(a) Purposive Sampling
(b) Quota Sampling

Sampling with and Sampling without replacement


Samples may be selected with replacement or without replacement.

CHAPTER 1. SAMPLING AND SAMPLING DISTRIBUTIONS


1. With replacement
A sampling is said to be with replacement when from a population (finite or infinite) a sampling unit
is drawn, observed and then returned to the population before another unit is drawn. The number of
units available for future drawing is not affected. The population remains the same and a sampling
unit might be selected more than once.Though the successive drawings are independent.
2. Without replacement
A sampling is said to be without replacement, when the sampling unit selected is not returned to the
population, before the next unit is selected. Thus the number of units remaining after each drawing
will be reduced by one. In this case, a sampling unit selected once cannot be selected again for the
sample because the selected unit is not replaced therefore successive drawings are dependent.

Theorem
If a sample of size n is selected from a finite population of size N , then the number of all possible samples
is given as
No. of possible samples = N n
No. of possible samples =

Pn

if sampling with replacement


if sampling without replacement

Proof. Sampling with replacement


If we use sampling with replacement, the number of all possible samples of size n, that could be selected
from a finite population of size N , is
No. of possible samples = N N N ...n times = N n .
Bcause the first unit of the sample can be selected in N different ways, the second unit of the sample
can be selected in N ways, and so on , the nth unit of the sample can also be selected in N ways. A sample
of n units constitutes only one arrangement and there are N n possible arrangement of n units from a finite
population of N units.

Proof. Sampling without replacement


If we use sampling without replacement, the number of all possible samples of size n, that can be drawn
from a finite population of size N is
No. of possible samples = N (N 1)(N 2)...N (n 1) = N (N 1)(N 2)...(N n + 1)
Dividing and multiplying (N n)...(3)(2)(1), we get
=

N (N 1)(N 2)...(N n + 1)(N n)...(3)(2)(1)


N!
=
=
(N n)...(3)(2)(1)
(N n)!

Pn

Because the first unit of the sample can be selected in N different ways, the second unit of the sample
can be selected in (N 1) ways and so on, the nth unit of the sample can be selected in (N n + 1) ways. A
sample of n units constitutes only one permutation and there areN Pn possible permutations of n units from
a finite population of N units.

Sampling and Non-Sampling Errors


The difference between sample statistic and the true value of the corresponding population parameter is
called Sampling error. For example x
is the mean obtained from a sample of size n and is the corresponding
population parameter, then the difference between x
and is sampling error, that is
Sampling error = x

As the sample size increase, the sampling error is reduced and in a complete enumeration(census) there is
no sampling error as x
equals the .

5
The errors which occur at the stage of gathering, arranging and analyzing the data are called non-sampling
errors. Non-sampling errors include all kinds of human errors, faulty sampling frame, biased method of selection of units, processing errors such as errors in editing and coding, missclasification of observations etc.

Sampling Bias
In a survey sampling bias means a systematic component of error which deprives survey results of its
representativeness. Bias is included by the following methods of selection.
Deliberate Selection
Substitution
Incomplete Coverage
Haphazard Selection
Inadequate Interviewing

Simple Random Sampling


Simple random sampling is a procedure of selecting a sample of size n from the population in such a way
that 1) each unit in the population has an equal probability of being included in the sample and 2) each
possible sample of the same size n has an equal probability of being selected.
Simple random sampling is used when the population is essentially homogeneous in terms of some characteristics relevant to the inquiry and when population size is small where the sampling units are easily
identifiable and accessible.

Selection of Simple Random Sample


A simple random sample can be selected by the following methods.
1. Goldfish Bowl Procedure/ Lottery Method
2. Using a Random Number Table
3. Using a Computer

Stratified Random Sampling


If the units in population are not homogeneous, then population is divided into non-overlapping classes
or groups called strata. Units within each stratum (singular of strata) are as homogeneous as possible with
respect to the characteristics under study or stratifying factor. From each stratum, a simple random sample
is taken and the overall sample is obtained by combining the samples for all strata.
Stratified random sampling is used when 1) the variations among strata are greater than the variations
within strata 2) information about some part of the population is desired.
The purpose of stratified random sampling are 1) it provides improved estimates of the population characteristics 2) it reduces the variance of the estimator (more precise estimator).

Allocation of sample sizes


By allocation of a sample we mean the way the total sample size n is distributed among the various strata
into which the population has been divided. Four methods of allocating the sample numbers are available.
They are
1. Equal Allocation
2. Proportional Allocation

CHAPTER 1. SAMPLING AND SAMPLING DISTRIBUTIONS


3. Neyman Allocation
4. Optimum Allocation

Equal Allocation
The allocation is called equal allocation when from each stratum, equal number of sampling units is
selected. That is the total sample size n is distributed equally among all the strata. Thus the stratum
sample size ni for equal allocation is
ni =

n
, for i = 1, 2, 3, ..., k
k

Proportional Allocation
The allocation is said to be proportional when the total sample size n is distributed among the different
strata in proportional to the sizes of strata. In other words, the allocation is proportional if
ni = n

Ni
, for i = 1, 2, 3, ..., k
N

Where Ni is the population size of the ith stratum, ni is the ith stratum sample size and N is the total
population size.

Neyman Allocation
This method of allocation consists of finding ni which minimize the variance of the stratified sample
mean for a fixed total sample size n, assuming the costs of surveying the units to be the same in all strata.
The stratum sample size ni is given by the relation
Ni i
ni = n P
, for i = 1, 2, 3, ..., k
Ni i
Neyman allocation becomes exactly the same as the proportional allocation when all the stratum standard
deviations are equal.

Optimum Allocation
The allocation is called optimum when the total sample size n is allocated among all the different strata
in such a way that for a given cost of selecting the sample, the variance of the stratified sample and its
variance is minimized. The stratum sample size ni for this method of allocation is

Ni i / ci
ni = P
, for i = 1, 2, 3, ..., k

Ni i / ci
Where Ni is the population of the nit stratum, i is the stratum standard deviation, and ci is the cost of
surveying one unit in the ith stratum.

Systematic Sampling
Systematic sampling is a method of selecting a sample of size n that calls every kth unit in the population
have been serially numbered from 1 to N or arranged in a systematic fashion. The steps for this technique
are.
Allot serial number from 1 to N to the sampling units.
Divide the population into n groups.
Find the number of units of each group k as k =

Population size
N
=
sample size
n

7
Select first unit from first group at random and then every kth unit.
Some advantages of this technique over simple random sampling are 1) it is easier to draw because only
one random number is required. 2) It distributes the sample more evenly over the listed population.

Cluster Sampling
Cluster sampling is a method of selecting a sample from a population which is divided into natural
groups, such as households, agricultural farms etc. which are called clusters, then taking these cluster as
sampling units, a sample is drawn at random. After cluster have been selected, all, or part of, the units in
each cluster are included in the sample.This sample is called cluster random sample. Cluster sampling also
uses a prior knowledge about the target population and partitions the population into groups or cluster,
where each cluster ideally has the same characteristics as the target population.

Distinguish between Stratum and Cluster


Stratum
1)Units within stratum are homogeneous
2)Variations within stratum is less than variations among strata

Cluster
1)Units within cluster are heterogeneous
2) Variations within cluster is more than variation
among clusters

Purposive Sampling
In this method personal judgment plays and important role in the selection of sampling units.The samples
are selected in purposive in views

Quota sampling
A sampling technique in which the sampling units are selected in sample from the quotas(group usually
of human beings) by personal limited choice is called quota sampling.

Sampling distribution
The sampling distribution is defined as the probability distribution or relative frequency distribution of
sample statistic.As a sampling distribution is a probability distribution, therefore the sum of probabilities in
it is always equal to one. The distribution has its own mean and its own standard deviation. Most commonly
sampling distributions are F , t and 2 .

Standard Error
The standard error is defined as the standard deviation of a sampling distribution of a sample statistic
abbreviated as S.E.

Accuracy and Precision


Accuracy refers to the size of deviations from the true population mean , whereas precision refers to
the size of deviations from the overall mean obtained by repeated application of the sampling procedure.

List of Sampling distributions


1. Sampling distribution of the sample mean (
x)
2. Sampling distribution of differences between sample means (
x1 x
2 )
3. Sampling distribution of sample proportion (
p)
4. Sampling distribution of differences between sample proportions (
p1 p2 )
5. Sampling distribution of sample variances (S 2 and s2 )

CHAPTER 1. SAMPLING AND SAMPLING DISTRIBUTIONS

Sampling distribution of sample mean (


x)
The probability distribution or the relative frequency distribution of the means x
of all possible random
samples of the same size that could be selected from a given population.For example
x

x
1
x
2
.
.
.
x
k

f (
x)
f (
x1 )
f (
x2 )
.
.
.
f (
xk )

Properties of Sampling distribution of sample mean (


x)
There are following properties of sampling distribution of x
.
1. Mean
The mean of sampling distribution of x
is equal to population mean regardless whether sampling is
done with replacement or without replacement.That is
x = E(
x) =
2. Variance
The variance of the sampling distribution of the mean is given by
2
x2 =
,
( 2 = population variance)
n
When sampling is done with replacement from a finite or an infinite population.
And

S.E(
x) =
n
When sampling
isdone without replacement from a finite population of size N .

2 N n
2
,
( 2 = population variance)
x =
n N 1
And
r
N n

S.E(
x) =
n N 1
3. Shape
The shape of sampling distribution of x
can be studied as
(a) Normal Population with known
If parent population is Normal, then the sampling distribution of x
will also be normal regardless
of sample size (whether sample is small or large). Then standardized normal variable is
Z=

/ n

If sampling is without replacement and sample size n is 5 percent or greater than 5 percent of the
population size N , Then Z values are obtained by the formula
Z=

r
N n
N 1

(b) Normal Population with unknown


For Large Sample
When sample is drawn from normal population with unknown, then it is estimated by the
sample standard deviation. If sample size is sufficiently large (n greater than or equal to 30).
Then by central limit theorem the sampling distribution of x
is approximately normal mean

9
S
and a standard deviation of .Then Standardized normal variable is
n
x

Z=
S/ n
For Small Sample
When is unknown and sample size is small (n < 30), then sampling distribution of x
follows
to Students t-distribution having statistic is
t=

with degrees of freedom = n 1


(c) Normal Population with known or unknown (Large Sample)
By central limit theorem for large sample size the sampling distribution of x
is approximately
normally distributed even though the parent population is non-normal. Then the standardized
normal variable with known is
Z=

/ n

Z=

S/ n

And when is unknown then

Theorem
The mean of the sampling distribution of x
equal to the population mean (when sampling is done with
replacement) that is
x=
Proof :PLet x1 , x2 , ..., xn be a random sample of size n from a population withe mean . Then sample mean
n
xi
is x
= i=1
n
We know that
Pn
n
xi
1X
xi
x =E(
x) = E( i=1 ) =
n
n i=1
1
= [E(x1 + x2 + ... + xn )]
n

In a random sample, the random variables x1 , x2 , ..., xn are independent and each has the same distribution
of the population.Thus
E(x1 ) = E(x2 ) = ... = E(xn )

(1.1)

So (1.1) can be written as


x =

1
n
( + + ... + ) =
=
n
n

Theorem
The mean of the sampling distribution of x
equal to the population mean (when sampling is done without
replacement) that is
x=
Proof : Consider a population of size N and sample size n. Let number of samples N cn drawn without
replacement is denoted by k and x
1 , x
2 , ...
xk are means of k samples. Then mean of sample means.

10

CHAPTER 1. SAMPLING AND SAMPLING DISTRIBUTIONS

1
E(
x) =x = [
x1 , x
2 , ...
xk ]
 k

1 x1 + x2 + ... x1 + x3 + ...
x2 + x3 + ...
=
+
+ ... +
k
n
n
n
Each xi repeats

N 1
n1

times. Therefore,

=
=

1 
x1
nk 
N 1
n1

nk

N 1
n1

+ x2

N 1
n1

+ ... + xN

N 1
n1



[x1 + x2 + ... + xN ]

(N 1)!
n(n 1)!(N n)!
[x1 + x2 + ... + xN ]
=
N!
n!(N n)!
(N 1)!n!(N n)!
=
[x1 + x2 + ... + xN ]
N !(N n)!n(n 1)!
1
= [x1 + x2 + ... + xN ]
N
E(
x) =x =
Theorem
If a random sample of size n is drawn with replacement from an infinite or finite population, the standard
deviation of sampling distribution of x
is given by

x =
n
Proof : Let x1 , x2 , x3 , ..., xn be a random sample ofP
size n drawn with replacement from a population whose
n
xi
mean is and variance 2 . The sample mean x
= i=1
is and the variance of x
, x2 is defined as
n
2

x2 =E [
x E(
x)]
Pn
 Pn
2
xi
i=1 xi
=E
,
x
= i=1
n
n
" n
#2
 Pn
2
X
1
i=1 xi n
=E
= 2E
(xi )
n
n
i=1
" n
#2
X
1
= 2E
(xi )
n
i=1

n
n
X
1 X
= 2E
(xi )2 +
(xi )(xj )
n
i=1
i6=j

Applying expectation and we get

n
n
X
X
1
E(xi )2 +
E(xi )(xj )
= 2
n i=1
i6=j

Consider the factor


E(xi )2 = 2

(1.2)

11
Now
E(xi )(xj ) = 0, it is co-variance betweenxi and xj and here because of sampling with replacement
ith and jth drawn are independent
Therefore (1.2) becomes
x2 =

1  2  2
n =
n2
n

An standard error is

S.E(
x) =
n
Theorem
If a random sample of size n is drawn without replacement from finite population, the standard deviation of
sampling distribution of x
is given by
r

N n
x =
n N 1
Proof :Let x1 , x2 , x3 , ..., xn be a random sample of P
size n drawn with replacement from a population whose
n
xi
is and the variance of x
, x2 is defined as
mean is and variance 2 . The sample mean x
= i=1
n
2

x2 =E [
x E(
x)]
Pn
 Pn
2
xi
i=1 xi
=E
,
x
= i=1
n
n
" n
#2
 Pn
2
X
1
i=1 xi n
=E
(xi )
= 2E
n
n
i=1
" n
#2
X
1
= 2E
(xi )
n
i=1

n
n
X
1 X
(xi )(xj )
= 2E
(xi )2 +
n
i=1
i6=j

Applying expectation, we get

n
n
X
1 X
E(xi )2 +
E(xi )(xj )
= 2
n i=1

(1.3)

i6=j

Consider the factor


E(xi )2 = 2
Similarly E(xi )(xj ) is the co-variance between xi and xj and being without replacement, successive
terms are dependent. Therefore
N

X
1
(xi )(xj )
N (N 1)

E(xi )(xj ) =

i6=j

Now we assume that


"

#2
N
N
N
X
X
X
(xi ) =
(xi )2 +
(xi )(xj )
i=1

i=1

i6=j

N
X

N
X

i=1

i6=j

(xi )2 =

(xi )(xj )

(1.4)

12

CHAPTER 1. SAMPLING AND SAMPLING DISTRIBUTIONS

Using the property of Arithmetic mean, the above relation becomes


N
N
N
X
X
NX
2
2

(xi ) =
(xi )(xj ) N =
(xi )(xj )
N i=1
i6=j

(1.5)

i6=j

Inserting (1.5) in (1.4) and we get


E(xi )(xj ) =

2
N 1

Now (1.3) becomes


1
= 2
n
2
=
n
2
x2 =
n



2
2
n n(n 1)
N 1


(n 1)
1
N 1


N n
N 1

Now taking square root and we get

x =
n

s

N n
N 1

population Correction factor




N n
The factor
usually called the finite population correction(fpc) or finite correction factor(fcf)
N 1
for the variance because in sampling from finite population , the variance of sampling distribution of x
is
reduced by this amount. It is dropped from the formula whenever n is less than 5% of N and used when n
is 5% or greater than 5% of N .

Sampling distribution of differences between sample means (


x1 x2 )
Suppose we have two large or infinite populations with means 1 and 2 and variances 12 and 22 respectively. Let independent random samples of sizes n1 and n2 be selected from the respective populations and
the differences (
x1 x
2 )between the means of all possible pairs of samples be computed. Then a probability
distribution of the differences (
x1 x
2 ) is called sampling distribution of the differences of sample means
(
x1 x
2 ).
Properties of Sampling distribution of differences between sample means (
x1 x
2 )
There are following properties of the distribution.
1. Mean
The mean of the sampling distribution of (
x1 x
2 ), denoted by x1 x2 , is equal to the difference
between population means, that is
x1 x2 = 1 2
[E(
x1 x
2 ) = E(
x1 ) E(
x2 ) = 1 2 ]
2. Variance
The variance of the sampling distribution of (
x1 x
2 ), denoted by x21 x2 is given by
2
2

x21 x2 = 1 + 2
n1
n2
And
s
12
2
S.E(
x1 x
2 ) =
+ 2
n1
n2
If the values of 1 and 2 are not known and if both sample sizes are sufficiently large,they are
replaced by S1 and S2 , the standard deviations of the respective samples. Then S.E will be

13

s
S(x1 x2 ) =

S12
S2
+ 2
n1
n2

If the populations are finite, sampling is done without replacement and the sample sizes are greater
than equal to 5% of population sizes, the S.E is
s
2 N2 n 2
12 N1 n1
+ 2
S.E((
x1 x
2 ) =
n 1 N1 1
n 2 N2 1
3. Shape
The shape of sampling distributions of (
x1 x
2 ) can be studied as.
(a) Normal Populations with known 1 and 2
If the populations are normally distributed, the sampling distribution of (
x1 x
2 ), regardless of
22
12
2
+
. In other words
sample sizes, will be normal with mean1 2 and variance x1 x2 =
n1
n2
the standarized variable is
Z=

(
x1 x
) (1 2 )
s2
12
2
+ 2
n1
n2

(b) Normal Populations with unknown 1 and 2


For large Samples
When the independent samples of sizes n1 and n2 are drawn from normal populations with
unknown standard deviations, we estimate them by the respective sample standard deviations.If sample sizes are sufficiently large, then we can assume(by central limit theorem) that
the sampling distribution
of (
x1 x
2 ) is approximately normal with mean 1 2 and stans
dard deviation

S12
S2
+ 2 and the standardized normal variable is
n1
n2
(
x1 x
) (1 2 )
s2
Z=
S12
S2
+ 2
n1
n2

For Small Samples


If sample sizes are small (n1 and n2 < 30) and the populations have unknown equal standard
deviations , then the sampling distribution of (
x1 x
2 ) follows to Students t-distribution
with Statistic
t=

(
x1 x
2 ) (1 2 )
r
1
1
+
sp
n1 n2

with degree of freedom = n1 + n2 2


Where s2p are
Pn
Pn
2
2
1 ) + i=1 (xi x
2 )
i=1 (xi x
2
sp =
n1 + n2 2
"
!
Pn
2
Pn
( i=1 x1i )
1
2
2
sp =
+
i=1 x1i
n1 + n2 2
n1
+ (n2
n1 + n2 2
n1 S12 + n2 S22
s2p =
n1 + n2 2
s2p =

(n1

1) s21

1) s22

Pn

2
i=1 x2i

Pn

i=1

x2i )

n2

!#

14

CHAPTER 1. SAMPLING AND SAMPLING DISTRIBUTIONS


(c) Non-Normal Populations with known or unknown 1 and 2
If the sample sizes are sufficiently large, then by central limit theorem the sampling distribution
of (
x1 x
2 ) will be approximately normal even though the populations are Non-Normal with
standardized normal variable.
Z=

(
x1 x
) (1 2 )
s2
12
2
+ 2
n1
n2

If the population standard deviations are unknown, then they are estimated by the sample standard deviations and the standardized normal variable is
Z=

(
x1 x
) (1 2 )
s2
S12
S2
+ 2
n1
n2

Where S12 and S22 are sample standard deviations.

Sampling distribution of sample proportion (


p)
x
where x represents the number of units having the characteristics of interests
n
x
follows to binomial random variable and the binomial parameter P is called
and n is sample size. p =
n
the proportion of success here. The sample proportion p has different values in different samples therefore
it is a random variable and has a probability distribution. The probability distribution of proportions of all
possible random samples of size n, is called sampling distribution of p
Sample proportion p =

Properties of Sampling distribution of sample proportion (


p)
The sampling distribution has following properties
1. Mean
The mean of sampling distribution of p denoted by p is equal to the population proportion P . That
is
p = p
2. Variance
The variance of sampling distribution of p is denoted by p2 and is given by
pq
p2 =
,
q =1p
n
When sampling is with replacement and the standard error is obtained as
r
pq
p =
n
When the sampling is done without replacement and sample size n is 5% or greater than 5% of N .
r r
pq N n
p =
n N 1
When the population proportion P is not known and both the population size N and the sample
size n are sufficiently large , then population proportion P is estimated with sample proportion and
the standard error is obtained as
r
pq
Sp =
n
When the sample is selected without replacement from a finite population of size N then
r r
pq N n
Sp =
n N 1

15
3. Shape
The sampling distribution of p is the binomial distribution. However, for large sample size, the sampling
1
x
distribution of p is approximately normal. Continuity correction of
as p = is needed when the
2n
n
normal approximation to the binomial distribution and the standardized normal variable is.
p p
Z=r
pq
n
1
(
p
)p
2n
r
Z=
pq
n

(without continuity correction)

(with continuity correction)

Sometimes we use
x np
Z=
npq

Z=

(x 0.5) np

npq

(without continuity correction)

(without continuity correction)

Sampling distribution of difference between proportions (


p1 p2 )
Suppose there are two binomial populations with proportions of successes p1 and p2 respectively. Let
independent random samples of sizes n1 and n2 be drawn from the respective populations and the differences (
p1 p2 ) between the proportions of all possible pairs of samples be computed. Then a probability
distribution of the difference (
p1 p2 ) can be obtained and this probability distribution is called sampling
distribution of the differences between proportions (
p1 p2 ).
Properties of Sampling distribution of differences between proportions (
p1 p2 )
There are following properties of the distribution
1. Mean
The mean of the sampling distribution of (
p1 p2 ) denoted by p1 p2 is equal to the difference between
the population proportions, that is
p1 p2 = p1 p2
2. Variance
The variance of the sampling distribution of (
p1 p2 ) denoted by p21 p2 is given by
p
q
p
q
1
1
2
2
p21 p2 =
+
n1
n2
And
r
p1 p2 =

p2 q2
p1 q1
+
n1
n2

If both populations have same proportion of successes, i.e p1 = p2 = p or if both the samples have
been drawn from a common binomial distribution, then
s 

1
1
p1 p2 = pq
+
n1
n2
Whenever the value of the common proportion is not known, then for sufficiently large sample sizes, it
n1 p1 + n2 p2
is replaced with its estimated pc where pc is
. Then the standard error is
n1 + n2

16

CHAPTER 1. SAMPLING AND SAMPLING DISTRIBUTIONS


s
Sp1 p2 =


pc qc

1
1
+
n1 n2


,

where qc = 1 pc

Whenever p1 6= p2 and also unknown then for large sample sizes, they are replaced with the sample proportions p1 and p2 , then the standard error is
r
p1 q1
p2 q2
p1 p2 =
+
n1
n2
3. Shape
The sampling distribution of p1 p2 is approximately normal for large sample sizes with standardized
normal variable
Z=

(
p1 p2 ) (p1 p2 )
r
p1 q2
p1 q1
+
n1
n2

Standardized variable will be changed with standard error according to the conditions mentioned above.

Sampling distribution of sample variances(S 2 ands2 )


The probability distribution of the sample variances calculated from all possible random samples of size
n from a normal population with variance 2
Properties of Sampling distribution of sample variance (S 2 ands2 )
There are following important properties of the distribution.
1. Mean

Pn

x
)2
is denoted by E(s2 ) = s2 .If sampling is done with
n1
replacement thanE(s2 ) =Ps2 = 2 . Thus it is an unbiased estimator of population variance The sample
n
(xi x
)2
n
variance S 2 is defined as i=1
. If samples are drawn with replacement than E(S 2 )
= 2
n
n

1


n1
or E(S 2 ) =
2 . Thus S 2 is biased estimator of 2 . In case of sampling without replacement
n
the following relations.
N 1
N
E(s2 ) =
= 2 or E(s2 ) =
2 and
N
N 1
n N 1
n1 N
E(S 2 ) =
= 2 or E(S 2 ) =
= 2
n1 N
n N 1
The mean of the sample variance

i=1 (xi

2. Shape
The sampling distribution of sample variance follows to chi-square distribution while the sampling
distribution followed by the ratio of two sample variances is called F-distribution.

Advantages of sampling over census


There are following advantages of sampling over census by census, we mean a procedure of systematically
acquiring and recording information about all members (units) of the given population.
1. Sampling saves money as it is much cheaper to collect information than from the whole population.
2. Sampling saves a lot of time and energy as the needed data are collected and processed much faster
than census information.
3. In case of inaccessible population, the only way to collect information is sampling.
4. The only way to collect information when the measurement process physically damages or destroys the
sampling units under investigation.
5. Sampling is extensively used to obtain some of census information.

17
6. Sampling provides a valid measure of reliability for sample estimates.
7. Following up of non-response is more easy in sampling.
8. More detailed information can be obtained by sampling.

Numerical Problems
Example 1.1: (a) A population consists of four numbers 3, 7, 11, 15.Considering all possible samples of
size two which can be drawn with replacement from this population. Find 1) The Population mean, 2) The
Population standard deviation, 3) The mean of sampling distribution of means, 4) The standard deviation
of sampling distribution of means. Verify (3) and (4) directly from (1) and (2) by one of suitable formula
(b)Repeat 3) and 4) in (a) when sampling is without replacement.
Solution:(a) (1) Population Mean
36
3 + 7 + 11 + 15
=
=9
=
4
4
(2) Population Standard deviation
We know that
x
x2
3
9
7
49
11
121
15
P
P 225
x = 36
x2 = 404
P 2  P 2
 2
x
x
404
36
2 =

= 20
N
N
4
4
so that

= 20 = 4.47
3) Now we draw all possible samples of size two from the population
n
2
(N ) = (4) = 16
Sample x
Sample x

(3, 3)
3
(13, 3)
8
(3, 7)
5
(13, 7) 10
(3, 11)
7 (13, 11) 12
(3, 15)
9 (13, 15) 14
(7, 3)
5
(15, 3)
9
(3, 7)
5
(15, 7) 11
(7, 11)
9 (15, 11) 13
(7, 15) 11 (15, 15) 15
Sampling distribution of x
and calculation for mean and standard deviation
x
Tally
f
f (
x)
x
f (
x)
x
2 f (
x)
3
/
1
1/16
3/16
9/16
5
//
2
2/16
10/16
50/16
7
///
3
3/16
21/16
147/16
9
////
4
4/16
36/16
324/16
11
///
3
3/16
33/16
363/16
13
//
2
2/16
26/16
338/16
15
/
1
1/16
15/16
P
P
P
P 2 225/16
f = 16
f (
x) = 1
x
f (
x) = 144/16
x
f (
x) = 1456/16
Now
P
144
=9
x =
x
f (
x) =
16
s

2
q
p

P 2
P
1456
144
2
x =
x
f (
x) ( x
f (
x)) =

= 91 (9)2 = 10 = 3.16
16
16
Verification
x = = 9
2
20
x2 =
=
= 10
n
2

18

CHAPTER 1. SAMPLING AND SAMPLING DISTRIBUTIONS

and
x = 10 = 3.16 which verifies the result
Example 1.2: The random variable has the following probability distribution:
xi
4
5
6
7
p(X = x) 0.2 0.4 0.3 0.1
1. Find the mean x and the variance x2 of the mean x
for a random sample of 36.
2. Find the probability that the mean of 36 items will be less than 5.5.
Solution:
We
P know that
P 2
P
2
xP (x) = and 2 =
x P (x) ( xP (x)) . Therefore
2
x P (x)
xP (x)
x P (x)
4
0.2
0.8
3.2
5
0.4
2
10
6
0.3
1.8
10.8
7
0.1
0.7
P
P 2 4.9
xP (x) = 5.3
x P (x) = 28.9
P
P
2
x2 P (x) ( xP (x)) = 28.9 (5.3)2 = 0.81
= 5.3 and 2 =
= x = 5.3 and x2 =

2
0.81
=
= 0.0225 x = 0.15
n
36

Now
P (
x < 5.5)
we know that sample size is sufficiently large therefore x
follows to normal distribution and the standard
normal variable is
x

Z=
/ n
Inserting the values and we obtain
5.5 5.3
Z=
= 1.33
0.15
P (
x < 5.5) = P (Z < 1.33) = P ( Z 0) + P (0 Z 1.33)

P (
x < 5.5) = 0.5 + 0.4082 = 0.9082

Chapter 2

Statistical Estimation
1

Prepared by Noman Rasheed

Statistical Inference
The process of drawing conclusions or inferences about a population on the basis of limited information
contained in a random sample is called statistical inference.

Types of Statistical Inference


Statistical inference traditionally divided into two major areas
1. Estimation of parameters
2. Testing of hypothesis

Estimation
Estimation is a procedure of making judgment by a numerical value about the true but unknown value
of population parameter on the basis of limited information contain in a random sample obtained from the
population whose estimate is required.

Testing of Hypothesis
Hypothesis testing is a procedure which enables us to decide whether we accept or reject any specified
assumption or statement about the value of the population parameter on the basis of limited information
contain in a random sample.

Estimator
The rule or formula or function used to estimate a population parameter is called an estimator or point
estimator. The word estimator is used in general for a statistic which is a random variable because it is a
function of random observations obtained from population

Estimate
A numerical value obtained by substituting the sample observations in the rule of formula is called an
estimate.

Explanation
1 M.Phil

Stat, M.Ed, Contact: Nomanrasheed163@yahoo.com,facebook.com/Something About Statistics

19

20

CHAPTER 2. STATISTICAL ESTIMATION

Let
18, 15 , a random sample obtained from the population then
P we have n values 5, 10, 12, P
x
60
x
x
=
=
= 12. Here x
=
is an estimator of population mean and the value 12 is called the
n
5
n
estimate.There are two categories of estimates
Point Estimate
When an estimate for the unknown population parameter is expressed by a single value, it is called
point estimate.
Interval Estimate
An estimate expressed by a range of values within which a true value of the population parameter is
believed to lie, is referred to as an interval estimate.

Types of Estimation
There are two types of estimation
1. Point Estimation
2. Interval Estimation

Point Estimation
The process of obtaining a single value from the sample as an estimate (point estimate) of the unknown
but true value of population parameter is called point estimation.

Linear Estimate
If an estimate can be expressed as a sum of the weighted observations ( as a linear combination), it is
called linear estimate. For example x
is a linear combination of the population parameter because it can
be expressed as
x
=

1
1
1
x1 + x2 + ... + xn
n
n
n

Which is a linear combination of the values of x0 s and in terms of weights, each observation is given a weight
1
equal to
n

Criteria/Properties/Qualities of good point estimator


A point estimator is considered a good estimator. If it satisfies various qualities or properties.
Unbiasedness
Consistency
Efficiency
Sufficiency

Unbiasedness
An estimator is defined to be unbiased if the statistic used as an estimator has its expected value (mean)
equal to to the value of the population parameter being estimated. Let be an estimator of population
= .
parameter , then will be unbiased estimator if E()

Bias
6= , then the statistic is said to be biased estimator and the Bias of an estimator is called
If the E()
estimation bias and is defined as
h
i

Bias = E()

21

Positively and Negatively Bias


> and estimator is said to be negatively
The estimator is said to be positively biased when E()
< .
biased when E()

List of unbiased estimators


There are most common unbiased estimators are
Sample mean (
x)
sample proportion (
p)
Sample variance (s2 )

Theorem
Show that sample proportion
 xis an unbiased estimator of population parameter p
Proof : we know that p =
n
Applying expectation and we get
E(
p) =E
=

x

np
=p
n

1
E(x)
n

Theorem

1 Pn
(xi x
)2 , are the sample mean and sample variance of a random sample
n 1 i=1
of size n from a population with mean and variance 2 , then show that E(s2 ) = 2 .
If x
and s2 , defined by

Proof : Let x1 , x2 , . . . , xn be a random sample of size n from a population with mean and variance 2
Then
" n
#

 Pn
X
)2
1
2
2
i=1 (xi x
=
E
(xi x
)
E(s ) =E
n1
n1
i=1
Multiplying each side of this equation by n 1, we have
"

n
X
(n 1)E(s ) = E
(xi x
)2

i=1

Adding and subtracting on the right side of the above equation, we get
"

n
X
=E
(xi + x
)2

i=1

"
=E

=E

n
X

i=1
" n
X

#
2

[(xi ) (
x )]

#


2
2
(xi ) 2(xi )(
x ) + (
x )

i=1

"

n
n
X
X
=E
(xi )2 2(
x )
(xi ) + n(
x )2
i=1

i=1

#
(2.1)

22

CHAPTER 2. STATISTICAL ESTIMATION

Pn
Pn
Consider the factor i=1 (xi ) = i=1 xi n = n
x n = n(
x )
inserting this result in (2.1) and we get
" n
#
X
2
2
2
2
(n 1)E(s ) =E
(xi ) 2n(
x ) + n(
x )
i=1

=E

" n
X

#
2

(xi ) n(
x )

i=1
" n
X

#
2

E(xi ) nE(
x )

(2.2)

i=1

Consider the factor E(


x )2 =

2
and E(xi )2 = 2 then (2.2) is
n


2
= 2 (n 1)
= n 2 n
n
E(s2 ) = 2

Consistency
An estimator is said to be consistent if the statistic to be used as estimator becomes closer and closer
to the population parameter being estimated as the sample size increase. In other words an estimator is
called consistent estimator of if the probability that becomes closer and closer to approaches unity with
increasing sample size. Symbolically

h
i


lim P e = 1
n

To prove that an estimator is consistent, we may state a criterion that is sometimes quite useful,as follows
0
Let be an estimator of based on a sample of size n. Then is a consistent estimator of , if var()
as n
A consistent estimator is unbiased in the limit but an unbiased estimator may or may not be consistent estimator

Efficiency
An unbiased estimator is defined to be efficient if the variance of its sampling distribution is smaller
than that of the variance of sampling distribution of other unbiased estimator of same parameter.Suppose
we have two unbiased estimators 1 and 2 of the same parameter then 1 will be said to be more efficient
estimator than 2 if V ar(1 ) < V ar(2 ).
Relative Efficiency
The relative efficiency is measured by the ratio:
Ef =

V ar(2 )
V ar(1 )

If Ef > 1 then 1 is more efficient than 2


If Ef < 1 then 2 is more efficient than 1

Efficiency of Biased estimator


An efficiency of biased estimators of , is compared on the basis of Mean Square Error (MSE) which
is defined by the expected value of the squared differences between the estimator and the true value of
population parameter.

2
= E
M SE()

23

Theorem

= V ar()
+ (Bias)2
Show that M SE()
Proof : We know that
=E( )2
M SE()

Adding and subtracting E()


h
i2
+ E()

=E E()
h
 
i2
+ E()

=E E()

2 
2




=E E() + E() + 2 E() E()


Applying expection and we get
 
2

2 
 


+ E()
E E()

= E E()
+ E E()
=V ar()
+ (Bias)2
M SE()


=0
Where E E()

BLUE
An estimator that is linear, unbiased and has minimum variance among all linear unbiased estimator
of is called a best linear unbiased estimator or BLUE for short.

Sufficiency
An estimator is defined to be sufficient, if the statistic used as estimator uses all the information that
is contained in the sample.Any statistic that is not computed from all the values in the sample is not a
sufficient estimator. Examples of sufficient estimator are x
and p.

Methods of Point Estimation


A point estimator of a parameter can be obtained by several methods but here only there methods are
described.
1. The Method of Maximum Likelihood
2. The Method of Moments
3. The Method of Least Squares

The Methods of Maximum Likelihood


The method or the principle of maximum likelihood, abbreviated ML is to consider every possible value
that the parameter might have and for each value, compute the probability that the given sample would have
occurred if that were the true value of the parameter. That value of the parameter for which the probability
of a given sample is greatest, is chosen as an estimate.

24

CHAPTER 2. STATISTICAL ESTIMATION

Mathematical technique of finding maximum likelihood estimators


This technique consists of the following steps.
1. Obtain the likelihood function L()
2. Take natural log ln of likelihood function L()
3. Differentiate with respect to the parameter
4. Equate the derivative to zero and solve for the parameter
Q: Define the term likelihood function?
Let x1 , x2 , . . . , xn be a random sample from a probability distribution f (x; ) where is a single unknown
parameter. Then the joint distribution for x1 , x2 , . . . , xn denoted by L() is
L() =f (x1 , x2 . . . . , xn : )
=f (x1 ; )f (x2 ; )f (x3 ; ) . . . f (xn ; )
n
Y
=
f (xi ; )
i=1

This joint prbability function is called likelihood function

Properties of the Maximum likelihood estimators


Maximum likelihood estimators generally satisfy the criteria of a good estimator. These estimators
possess the following properties.
These estimators are always consistent and efficient.
These are generally biased but bias becomes negligible as the sample size increases.
For large sample sizes, these estimators are approximately normally distributed.
Note: Other two methods are beyond B.Sc syllabus

Interval Estimation or Estimation by Confidence Interval


A process of obtaining a range of values within which the true value of the unknown population parameter is expected to lie with a certain degree of confidence is called an interval estimation or estimation by
confidence interval.

Confidence Interval
The range of values is known as interval and the interval to which 100(1 )% probability is associated
that it will include the population parameter is termed as confidence interval.

Level of Confidence
The probability (1 ) or 100(1 )% associated with the interval is known as confidence co-efficient
or level of confidence. In practice its commonly used values are 90%, 95% and 99% etc.

Confidence Limits
The end points that bound the confidence interval are called the lower and upper confidence limits for
the unknown parameter. These limits are the random variable because the functions of sample observations

25
which are randomly selected from the population.
The difference between upper confidence limit and the lower confidence limit is called precision of the
estimate. The shorter the confidence interval, the more precise the estimate. The precision can be increased
by
Increasing the sample size n
decreasing the confidence interval.

Confidence Interval Estimate


List of confidence interval estimate are
1. Confidence Interval Estimate for Population Mean
2. Confidence Interval Estimate for Difference of Population Means (Independent Samples)
3. Confidence Interval Estimate for Population Proportion
4. Confidence Interval Estimate for Difference between Proportions
5. Confidence Interval Estimate for Difference of Population Means (Dependent Samples)
6. Confidence Interval Estimate for Population Correlation Coefficient
7. Confidence Interval Estimate for (The Population intercept of Regression line)
8. Confidence Interval Estimate for (The Population regression coefficient)
9. Confidence Interval Estimate for 2
10. Confidence Interval Estimate for Variance Ration 12 /22

Confidence Interval Estimate for Population Mean


To compute a confidence interval for the population mean , we have to see population normality,
population standard deviation and sample size. We discuss these different cases below.
Normal Population with known
Normal Population with unknown
Non-Normal Population with known or unknown

Normal Population with known


Let x1 , x2 , . . . , xn be a random sample of size n drawn from the normal population with an unknown mean
and a know standard deviation . Then the sampling distribution of mean x
will be normal regardless to

the sample size with and standard deviation , then the standard normal variable.
n
Z=

/ n

Then according to the normal distribution the probability that a value of Z will fall in the interval from
Z/2 to Z/2 is equal to 1


P Z/2 Z Z/2 = 1
Putting Z and we get


x

Z/2 = 1
P Z/2
/ n

26

CHAPTER 2. STATISTICAL ESTIMATION

Multyplying by / n



P Z/2 / n x
Z/2 / n = 1
Subtracting x
from each of term and we have



P
x Z/2 / n
x + Z/2 / n = 1
Now multyplying by -1, then inequality sign will be



P x
+ Z/2 / n +
x Z/2 / n = 1
Which is equavalent to



P x
Z/2 / n +
x + Z/2 / n = 1
For a particular sample of size n a 100(1 )% confidence interval for is given by



x
Z/2 / n, x
+ Z/2 / n
Which may be expressed more comactly as

x
Z/2 / n

Normal Population with unknown


Let x1 , x2 , . . . , xn be a random sample of size n drawn from the normal with unknown then we estimate
by the sample standard deviation which is used in place of . If the sample size is sufficiently large (n 30)
then by central limit theorem the sampling distribution of x
is approximately normal with mean and
S
standard deviation , where S is the sample standard deviation then the standard normal variable is
n
Z=

S/ n

Then according to the normal distribution the probability that a value of Z will fall in the interval from
Z/2 to Z/2 is equal to 1


P Z/2 Z Z/2 = 1
Putting Z and we get


x

P Z/2
Z/2 = 1
S/ n

Multyplying by S/ n



P Z/2 S/ n x
Z/2 S/ n = 1
Subtracting x
from each of term and we have



P
x Z/2 S/ n
x + Z/2 S/ n = 1
Now multyplying by -1, then inequality sign will be



P x
+ Z/2 S/ n +
x Z/2 S/ n = 1

27
Which is equavalent to



P x
Z/2 S/ n +
x + Z/2 S/ n = 1
For a particular sample of size n a 100(1 )% confidence interval for is given by



x
Z/2 S/ n, x
+ Z/2 S/ n
Which may be expressed more comactly as

x
Z/2 S/ n
Note:When is unknown and sample size is small (n < 30), the sampling distribution of x
follows to
t-distribution. We shall discuss this case in further chapter

Non-normal Population with known or unknown (Large sample)


When population is non-normal then for large sample (n > 30) central limit theorem tells us that the
sampling distribution of the mean x
is approximately normal with standard normal variable is
Z=

/ n

Therefore an approximate 100(1 )% confidence interval for the mean of non-normal population with
known is given by

x
Z/2
n
In case of is unknown and is estimated by the sample standard deviation S, the confidence interval estimate
for becomes
S
x
Z/2
n
If sampling is done without replacement from a finite population of size N and sample size n is greater than
equal to population size then confidence interval is
r

N n
x
Z/2
n N 1

Confidence Interval Estimate for Difference of Population Means


To construct the confidence interval for the difference between two mean (1 2 ), the following three
cases are to be considered.
Both the populations are normal with known standard deviations.
Both the populations are normal with unknown standard deviations.
Both the populations are non-normal with known or unknown standard deviations.

Normal Populations with known Standard Deviations


Suppose we have two normal populations. Population 1 has an unknown mean 1 and a known standard
deviation 1 and population 2 has an unknown mean 2 and an unknown standard deviation 2 . Independent
Samples of size n1 and n2 are taken from the populations and sample means calculated. Let the sample
means be denoted by x
1 and x
2 . Then the sampling distribution
of the difference x
1 x
2 is normally
s
distributed with mean 1 2 and a standard deviation
Z=

12
2
+ 2 with standard normal variable.
n1
n2

(
x1 x
) (1 2 )
s2
12
2
+ 2
n1
n2

28

CHAPTER 2. STATISTICAL ESTIMATION

Then according to the normal distribution the probability that a value of Z will fall in the interval from
Z/2 to Z/2 is equal to 1


P Z/2 Z Z/2 = 1
Putting Z and we get

(
x1 x
2 ) (1 2 )

s
Z

Z
P
/2 = 1
/2

22
12
+
n1
n2
s
Multyplying by

12
2
+ 2 and we get
n1
n2

P Z/2

2
12
+ 2 (
x1 x
2 ) (1 2 ) Z/2
n1
n2

22
12
+
=1
n1
n2

Subtracting (
x1 x
2 ) and we get

P (
x1 x
2 ) Z/2

12
n1

22
n2

s
(1 2 ) (
x1 x
2 ) + Z/2

12
n1

22
n2

=1

Multyplying by 1 then equality sign will be

2
12
+ 2 (1 2 ) (
x1 x
2 ) Z/2
n1
n2

22
12
=1
+
n1
n2

12

12

P (
x1 x
2 ) + Z/2

Which is equavalent to

x1 x
2 ) Z/2
P (

n1

22
n2

(1 2 ) (
x1 x
2 ) + Z/2

n1

22
n2

=1

Hence the 100(1 )% confidence interval for particular samples obtained for (1 2 ) is
s
(
x1 x
2 ) Z/2

12
2
+ 2
n1
n2

Normal Populations with unknown Standard Deviations


When the independent samples of sizes n1 and n2 are drawn from normal populations with unknown
standard deviations, we estimate them by the respective sample standard deviations. If sample sizes are sufficiently large, then we can assume that the samplingrdistribution of the difference (
x1 x
2 ) is approximately
S22
S12
+
with standard normal variable
normal with mean 1 2 and standard deviation
n1
n2
Z=

(
x1 x
) (1 2 )
s2
S12
S2
+ 2
n1
n2

29
Then according to the normal distribution the probability that a value of Z will fall in the interval from
Z/2 to Z/2 is equal to 1


P Z/2 Z Z/2 = 1
Putting Z and we get

(
x1 x
2 ) (1 2 )

s
Z/2
P Z/2
=1

S12
S22
+
n1
n2
s
Multyplying by

P Z/2

S12
S2
+ 2 and we get
n1
n2
s

S12
n1

S22
n2

s
(
x1 x
2 ) (1 2 ) Z/2

S12
n1

S22
n2

=1

Subtracting (
x1 x
2 ) and we get

P (
x1 x
2 ) Z/2

S12
S2
+ 2 (1 2 ) (
x1 x
2 ) + Z/2
n1
n2

S12
S22
=1
+
n1
n2

Pultyplying by 1 then equality sign will be

S12
S2
+ 2 (1 2 ) (
x1 x
2 ) Z/2
n1
n2

S12
S22
=1
+
n1
n2

S12

S12

P (
x1 x
2 ) + Z/2

Which is equavalent to

P (
x1 x
2 ) Z/2

n1

S22
n2

(1 2 ) (
x1 x
2 ) + Z/2

n1

S22
n2

=1

Hence the 100(1 )% confidence interval for particular samples obtained for (1 2 ) is
s
(
x1 x
2 ) Z/2

S2
S12
+ 2
n1
n2

Note: When sample sizes are small and the populations have unknown but equal standard deviations, then
we use students t-distribution.We shall discuss this case in further chapter.

Non-normal Populations with known or unknown Standard Deviations


If the sample sizes are sufficiently large, then by central limit theorem that the sampling distribution of the
difference (
x1 x
2 ) will be approximately normal even though the populations are normal. An approximate
100(1 )% confidence interval for (1 2 ) when the population standard deviations are known
s
(
x1 x
2 ) Z/2

12
2
+ 2
n1
n2

30

CHAPTER 2. STATISTICAL ESTIMATION

If the population standard deviations are unknown then they are estimated by the sample standard deviations. The approximate 100(1 )% confidence interval for (1 2 ) is then given
s
S12
S2
(
x1 x
2 ) Z/2
+ 2
n1
n2

Confidence Interval for Population Proportion (Large Sample)


Let a random sample of size n(n 30) be drawn from a binomial population with an unknown proportion
x
of successes p and let the sample proportion be p = . Now we wish to estimate p by an interval to be
n
computed from the sample data.
We know that the sampling distribution
of a sample proportion p is approximately normal with a mean
r
pq
. If sample size is sufficiently large and p is not too close to zero or
of p and standard deviation of
n
1.Thus the standard normal random variable is
p p
Z=r
pq
n
Then according to the normal distribution the probability that a value of Z will fall in the interval from
Z/2 to Z/2 is equal to 1


P Z/2 Z Z/2 = 1
Putting Z and we get

p p

P
Z/2 r pq Z/2 = 1
n
r
Multyplying by

pq
and we get
n
r
r 

pq
pq
P Z/2
p p Z/2
=1
n
n

Subtracting p and we get


r
r 

pq
pq
P
p Z/2
p
p + Z/2
=1
n
n
Multyplying by 1 then equality sign will be
r 
r

pq
pq
p p Z/2
=1
P p + Z/2
n
n
Which is equavalent to
r
r 

pq
pq
p p + Z/2
=1
P p Z/2
n
n
Hence the 100(1 )% confidence interval for particular samples obtained for p is
r
p Z/2

pq
n

31
Note: the standard error of the sample proportion involves the unknown p. For large sample size this
diffculty is overcome by using the sample proportion p in place of p. Hence the confidence interval is
r
p Z/2

pq
n

Confidence Interval for the difference between Proportions(Large Samples)


Suppose there are two binomial populations with unknown proportions of success p1 and p2 respectively.
Let p1 be the proportion of successes based on a random sample of size n1 drawn from first population and
p2 be the proportion of successes based on a random sample of size n2 drawn from second population. Then
the sampling distribution r
of the difference (
p1 p2 ) will be approximately normal with mean of (p1 p2 )
p1 q1
p2 q2
and standard deviation is
+
with standard normal variable is
n1
n2
Z=

(
p1 p2 ) (p1 p2 )
r
p1 q1
p2 q2
+
n1
n2

Then according to the normal distribution the probability that a value of Z will fall in the interval from
Z/2 to Z/2 is equal to 1


P Z/2 Z Z/2 = 1
Putting Z and we get

(
p1 p2 ) (p1 p2 )
r
P
Z/2
Z/2
=1
p1 q1
p2 q2
+
n1
n2
r
Multyplying with

p1 q1
p2 q2
+
and we get
n1
n2



r
r
p1 q1
p1 q1
p2 q2
p2 q2
P Z/2
+
(
p1 p2 ) (p1 p2 ) Z/2
+
=1
n1
n2
n1
n2
Subtracting (
p1 p2 )


r
r
p1 q1
p2 q2
p1 q1
p2 q2
P (
p1 p2 ) Z/2
+
(p1 p2 ) (
p1 p2 ) + Z/2
+
=1
n1
n2
n1
n2
Multyplying by 1 then equality sign will be


r
r
p1 q1
p2 q2
p1 q1
p2 q2
+
(p1 p2 ) (
p1 p2 ) Z/2
+
=1
P (
p1 p2 ) + Z/2
n1
n2
n1
n2
Which is equavalent to


r
r
p1 q1
p2 q2
p1 q1
p2 q2
P (
p1 p2 ) Z/2
+
(p1 p2 ) (
p1 p2 ) + Z/2
+
=1
n1
n2
n1
n2
Hence the 100(1 )% confidence interval for particular samples of size n1 and n2 is
r
(
p1 p2 ) Z/2

p1 q1
p2 q2
+
n1
n2

32

CHAPTER 2. STATISTICAL ESTIMATION

Note:We use that the standard error of (


p1 p2 ) involves the unknown parameters p1 and p2 we therfore
replace p1 and p2 with their sample estimates p1 and p2 . Hence for sufficiently large samples, an approximate
confidence interval is
r
(
p1 p2 ) Z/2

p1 q1
p2 q2
+
n1
n2

Note: Confidence interval estimate from 5 10 will be discussed in further chapters

One-sided Confidence Interval


Sometimes we wish to find only an upper or a lower confidence limits for the parameter , that is one
sided interval. In such a case, the entire area will be located at one end of the sampling distribution.

Sample size for Estimating Population Mean


The 100(1 )% confidence interval for is given by

x
Z/2 / n < < x
+ Z/2 / n
Which may be written as

|
x | = Z/2
n

when sampling is done withreplacement or population is very large


Where is the standard error of x
n
(infinite). The quantity |
x | is called the error of the estimator x
denoted by e. Then

e = Z/2
n
or

n=

Z/2
e

Squaring both sides and we get



n=

Z/2
e

2

Note: Population standard deviation is ususlly unknown then its estimate is found either from past
experience or from a pilot sample of size n > 30.

Similarlly when sampling is performed without replacement from the finite population of size N the standard
error of sampling distribution of x
is

x =
n

N n
N 1

In this case the 100(1 )% error bound for estimating becomes

e = Z/2
n

N n
N 1

33
Squaring both sides and we get


2 N n
1
Z/2
e =
n
N 1


2
ne
N n
2 =
N 1
Z/2
2
2
ne2 (N 1) = N Z/2 n Z/2
h
2
2 i
= N Z/2
n e2 (N 1) + Z/2
2
N Z/2
n= h
2 i
e2 (N 1) + Z/2
2

Sample size for Estimating Population Proportion


For large sample size confidence interval for p is given by
r
pq
p Z/2
n
This implies that
r
e = Z/2

pq
n

Therefore solving for n, we obtain


2
Z/2 pq
n=
e2

34

CHAPTER 2. STATISTICAL ESTIMATION

Chapter 3

Testing of Hypothesis
1

Prepared by Noman Rasheed

Statistical Hypothesis
A statistical hypothesis is a statement or assumption about a characteristics of one or more population
which may or my not be true and its validity is checked on the basis of a random sample selected from the
population.

Null and Alternative Hypothesis


A null hypothesis, generally denoted by the symbol H0 , is that hypothesis which is to be tested for
possible rejection under the assumption that it is true.
An alternative hypothesis is any other hypothesis which we accept when the null hypothesis H0 is rejected
and it is denoted by H1 or HA .

Simple and Composite Hypothesis


A simple hypothesis is one in which all the parameters of the population distribution are specified while a
hypothesis is said to be composite in which not all the parameters of the population distribution are specified.
For example: Suppose that the age distribution of the first year college students is normally distributed
with mean and variance 25. Then H0 : = 16 is a simple hypothesis because it completely specify the
distribution. On the other hand, if H1 : < 16 or H1 : > 16, it is a composite hypothesis which does not
completely specify the distribution.

Test Statistic
A sample statistic on which the decision of accepting or rejecting the null hypothesis is based called a
test statistic. Every test statistic has a probability (sampling) distribution which gives the probability of
obtaining a specified value of the test statistic when the null hypothesis is true. The sampling distributions
of the most commonly used test-statistics are Z, t, 2 or F .

Rejection and Acceptance region


It is that part of the sampling distribution of a statistic for which the H0 is rejected. In this case the
result of the sampling distribution are not consistent with the null hypothesis when the H0 is true.
It is that part of sampling distribution of a statistic for which the H0 is accepted. In this case the results
of the sampling distribution are not consistent with the H0 .
1 M.Phil

Stat, M.Ed, Contact: Nomanrasheed163@yahoo.com,facebook.com/Something About Statistics

35

36

CHAPTER 3. TESTING OF HYPOTHESIS

Critical value
The value(s) that separates the critical region from the acceptance region is called the critical value(s).

Type-I and Type-II error


Whenever we reject or accept a statistical hypothesis on the basis of the sample data, there is a possibility
that the sample evidence may lead us to make a wrong decision. Therefore there are two types of wrong
decision which are called type-I and type-II errors.
Type-I Error when true null hypothesis is rejected, we commit an error and this error is called type-I error.
Type-II Error when false null hypothesis is accepted, we commit an error and this error is called type-II
error.

Probability of Type-I and Type-II error


The probability of making Type-I error or Level of Significance is conventionally denoted by . Actually it is a some small pre-assigned value used as a standard for rejecting a null hypothesis H0 while it is
assumed to be true. The value is also known as the size of the critical region. The most frequently used
values of , the level of significance level are 0.05 or 0.01.
The probability of committing a Type-II error (accepting false null hypothesis) is indicated by . In
symbols, we may write
= P (Type-I error) = P (rejecting H0 /H0 is true)
= P (Type-II error) = P (accepting H0 /H0 is false)

Relation between and


There is an inverse relationship between and . When become smaller, tends to become larger and
when becomes larger, tends to become smaller. We can reduce both and by increasing the sample
size.
Note: In general + 6= 1

Power of a Test
The power of a test with respect to a specified alternative hypothesis is the probability of rejecting a null
hypothesis when it is actually false. In other words the power is the complement of . Symbolically
Power = P (reject H0 /H0 is false)
Power = 1
Note: The power generally increases with an increase in the sample size. A test for which is small, is
defined to be a powerful test.

Operating Characteristic Curve


A curve giving the probabilities of making Type-II errors for various parametric values under alternative
hypothesis is called an operating characteristic curve or simply the OC curve.

Power Curve
The power curve which may be regarded as the complement of the oc curve, shows the probabilities of
rejecting the null hypothesis H0 for various values of the parameter.

37

Test of Significance
The method which make possible, by using sample observations either to accept or reject the null hypothesis at a level of significance that is not already given but decided according to the situation of problem
is called test of significance.

One Tailed Test and Two Tailed Test


A test is called a one tailed test or one sided test if the critical region is located in only one side (either
left or right) of the sampling distribution of the test-statistic. A one tailed test is used when H1 ( alternative
hypothesis ) is formulated in this form H1 : > 0 or H1 : < 0 .
A test is called a two-tailed or two-sided test if the critical region is located equally both sides (tails) of
the sampling distribution of the test-statistic. In this case H1 will be H1 : 6= 0 .

P-Value
The P-value for a test of hypothesis is defined as the smallest level of significance at which the null hypothesis is rejected or the largest level of significance at which the null of hypothesis is accepted.The P-value
enables us to test hypothesis without first specifying a value of .

Formulation of Hypothesis
The hypothesis must be formulated in such a way that when one is true, other is false (H0 and H1 are
opposites). Equality sign is always used in null hypothesis and any one of the signs is used in the formulation
of null hypothesis.1) = 2) 3) . Equality sign is never used in alternative hypothesis and any one
of the following signs is used in the formulation of alternative hypothesis.1) 6= 2) > 3) < .
If is a population parameter and 0 is its specific value to test then the null and alternative hypothesis
take the form.
Null Hypothesis
If H0 : = 0
If H0 : 0
If H0 : 0

Alternative Hypothesis
Then H1 : 6= 0 , H1 : > 0 , H1 : < 0
Then H1 : < 0
Then H1 : > 0

Q:Explain the general procedure for testing the hypothesis?


Ans: The procedure for testing the a hypothesis about a population parameter involves the following six
steps.
1. State our problem and formulate an appropriate null and alternative hypothesis which may take the
form
Null Hypothesis
If H0 : = 0
If H0 : 0
If H0 : 0

Alternative Hypothesis
Then H1 : 6= 0 , H1 : > 0 , H1 : < 0
Then H1 : < 0
Then H1 : > 0

2. Decide the level of significance , the probability of type-I error. The most common value of 0.05 or

38

CHAPTER 3. TESTING OF HYPOTHESIS


0.01 is used.
3. Choose an appropriate test-statistic, determine and sketch the sampling distribution of the teststatistic, assuming H0 is true.
4. Compute the value of the test-statistic from the sample data in order to decide whether to accept or
reject the null hypothesis H0 .
5. Determine the rejection or critical region in such a way that the probability of rejecting the null
hypothesis H0 , if it is true is equal to the significance level . The location of the critical region depends
upon the form of H1 .It may locate one side (either left or right) or both side of the distribution.
6. Draw conclusion, if our calculated value lies in rejection region we reject H0 and accept H1 and if
calculated value lies in acceptance region we accept H0 and reject H1

Tests Based on the Normal Distribution


In this chapter, we deal with the following tests of hypothesis.
1. Testing hypothesis about Population Mean.
2. Testing hypothesis about the difference between two Population Means (Independent Samples).
3. Testing hypothesis about Population Proportions.
4. Testing hypothesis about difference between two Population Proportions.

Testing hypothesis about Population Mean.


These test can be classified as
Testing a hypothesis about Mean of a Normal Population when is known.
Testing a hypothesis about Mean of a Normal Population when is unknown and (n 30)
Testing a hypothesis about Mean of a Non-Normal Population with large sample (n 30)
Q:Explain the procedure for testing a hypothesis about Mean of Normal Population when is known?
Ans: Suppose a random sample of size n is drawn from a normal population with mean having a specified
value 0 and a known standard deviation . The sample mean is given by x
. We wish to determine whether
the sample accords with the hypothesis that the population mean has the specified value 0 . For this
x
0
and the procedure is.
purpose, we employ the normal distribution test Z =
/ n
1. Formulate the null and alternative hypothesis about . Three possible forms are
a) H0 : = 0 and H1 : 6= 0
b) H0 : 0 and H1 : > 0
c) H0 : 0 and H1 : < 0
2. Decide the level of significance . ( take = 0.05 or 0.01).
3. Test Statistic in this case is Z =

x
0

/ n

4. Calculate the value of Z from the sample data.


5. Determine the rejection region, which actually depends on the alternative hypothesis and it can be
describe as
When the Alternative Hypothesis is
a)H1 : 6= 0 (two-sided)
b)H1 : < 0 (one-sided)
c) H1 : > 0 (one-sided)

The rejection region will be


Z(Calculated) > Z/2 and Z(Calculated) < Z/2
Z(Calculated) < Z
Z(Calculated) < Z

6. Conclusion: Reject H0 if value of Z(Calculated) falls in the rejection region otherwise accept it.

39
Q:Explain the procedure for testing a hypothesis about Mean of Normal Population when is unknown and
n 30?
Ans: Suppose a random sample of size n is drawn from a normal population with mean having a specified
value 0 and a unknown standard deviation . The sample mean is given by x
. We wish to determine
whether the sample accords with the hypothesis that the population mean has the specified value 0 . As
we know that the population standard deviation is unknown therefore the sample standard deviation S is
used as an estimate. For large sample (n 30), the central limit theorem allows us to assume that the
S
sampling distribution of x
is approximately normal with a mean of and a standard deviation of then
n
x
0
and the testing procedure is.
the standard normal variable is Z =
S/ n
1. Formulate the null and alternative hypothesis about . Three possible forms are
a) H0 : = 0 and H1 : 6= 0
b) H0 : 0 and H1 : > 0
c) H0 : 0 and H1 : < 0
2. Decide the level of significance . ( take = 0.05 or 0.01).
3. Test Statistic in this case is Z =

x
0

S/ n

4. Calculate the value of Z from the sample data.


5. Determine the rejection region, which actually depends on the alternative hypothesis and it can be
describe as
When the Alternative Hypothesis is
a)H1 : 6= 0 (two-sided)
b)H1 : < 0 (one-sided)
c) H1 : > 0 (one-sided)

The rejection region will be


Z(Calculated) > Z/2 and Z(Calculated) < Z/2
Z(Calculated) < Z
Z(Calculated) > Z

6. Conclusion: Reject H0 if value of Z(Calculated) falls in the rejection region otherwise accept it.
Q:Explain the procedure for testing a hypothesis about Mean of Normal Population when is known or
unknown when Population is Non-Normal and n 30?
Ans: Central limit theorem tells that for large sample size, the sampling distribution of x
is approximately
normal even though the parent population is Non-Normal and is known or unknown.
x
0
and testing procedure is same as mentioned above.
If is known then random variable is Z =
/ n
If is unknown then random variable is Z =

x
0
and testing procedure is same as mentioned above.
S/ n

Testing hypothesis about difference between two Population Means


To test hypothesis about the difference between two population means, we deal with the following three
cases.
Both the populations are normal with known standard deviations.
Both the populations are normal with unknown standard deviations.
Both the populations are non-normal with large sample sizes.
Q:Explain the procedure for testing hypothesis about difference between two population means when both populations are normal and population standard deviations are known?
Ans: Let x
1 be the mean of the first random sample of size n1 from a normal population with a mean of 1
and a known standard deviation 1 and x
2 be the mean of the second random sample of size n2 from another
normal population with mean of 2 and a known standard deviation 2 . Then the sampling s
distribution of
the difference (
x1 x
2 ) is normally distributed with mean 1 2 and standard deviation

12
2
+ 2 . In
n1
n2

40

CHAPTER 3. TESTING OF HYPOTHESIS

other words, the variable


Z=

(
x1 x
) (1 2 )
s2
12
2
+ 2
n1
n2

It is exactly standard normal variable even the sample size is small. Hence it is used as the test-statistic for
testing the hypothesis about the difference between two population means. The procedure is
1. Formulate the null and alternative hypothesis from the following three forms.
a) H0 : 1 2 = 40 and H1 : 1 2 6= 40
(40 may be equal zero)
b) H0 : 1 2 40 and H1 : 1 2 > 40
c) H0 : 1 2 40 and H1 : 1 2 < 40
2. Decide the level of significance . (take = 0.01 or 0.05)
3. The test statistic Z, under H0 becomes Z =

(
x1 x
) 40
s 2
2
1
2
+ 2
n1
n2

4. Compute the value of Z from the sample data.


5. Determine the rejection region, which actually depends upon the alternative hypothesis and it can be
described as
When the Alternative Hypothesis is
a)H1 : 1 2 6= 40 (two-sided)
b)H1 : 1 2 < 40 (one-sided)
c) H1 : 1 2 > 40 (one-sided)

The rejection region will be


Z(Calculated) > Z/2 and Z(Calculated) < Z/2
Z(Calculated) < Z
Z(Calculated) > Z

6. Conclusion: Reject H0 if value of Z(Calculated) falls in the rejection region otherwise accept it.
Q:Explain the procedure for testing hypothesis about difference between two population means when both
populations are normal and population standard deviations are unknown?
Ans: Let x
1 be the mean of the first random sample of size n1 from a normal population with a mean of
1 and an unknown standard deviation 1 and x
2 be the mean of the second random sample of size n2 from
another normal population with mean of 2 and an unknown standard deviation 2 .Here 1 and 2 both
are unknown therefore they are estimated with the sample standard deviations. For sufficiently large sample
sizes (n1 , n2 > 30), the
x1 x
2 ) is approximately normal with mean 1 2 and
ssampling distribution of (

S12
S2
+ 2 . Where S12 is the variance of first sample and S22 is the variance of second
n1
n2
sample. The standard normal variable is
a standard deviation

Z=

(
x1 x
) (1 2 )
s2
S12
S2
+ 2
n1
n2

The testing procedure is


1. Formulate the null and alternative hypothesis from the following three forms.
a) H0 : 1 2 = 40 and H1 : 1 2 6= 40
(40 may be equal zero)
b) H0 : 1 2 40 and H1 : 1 2 > 40
c) H0 : 1 2 40 and H1 : 1 2 < 40
2. Decide the level of significance . (take = 0.01 or 0.05)
3. The test statistic Z, under H0 becomes Z =

(
x1 x
) 40
s 2
2
S1
S2
+ 2
n1
n2

41
4. Compute the value of Z from the sample data.
5. Determine the rejection region, which actually depends upon the alternative hypothesis and it can be
described as
When the Alternative Hypothesis is
a)H1 : 1 2 6= 40 (two-sided)
b)H1 : 1 2 < 40 (one-sided)
c) H1 : 1 2 > 40 (one-sided)

The rejection region will be


Z(Calculated) > Z/2 and Z(Calculated) < Z/2
Z(Calculated) < Z
Z(Calculated) > Z

6. Conclusion: Reject H0 if value of Z(Calculated) falls in the rejection region otherwise accept it.
Q:Explain the procedure for testing hypothesis about difference between two population means when both
populations are non-normal and population standard deviations are known or unknown but sample sizes are
large?
Ans: When both the populations are non-normal but sample sizes are sufficiently large then central limit
theorem tells us that the sampling distribution of (
x1 x
2 ) will be approximately normal even though population standard deviations may or may not be known.
If 1 and 2 known then Z =

(
x1 x
) 40
s 2
and testing procedure is same which is mentioned above
2
1
22
+
n1
n2

If 1 and 2 unknown then Z =

(
x1 x
) 40
s 2
and testing procedure is same which is mentioned above
S12
S22
+
n1
n2

Q:Explain the procedure for testing hypothesis about population proportion when sample size is large?
Ans: Let p be the proportion of success in a sample of size n drawn from a binomial population having
proportion p. If sample size n is sufficiently
large, then p will be approximately normally distributed with a
r
pq
mean p and a standard deviation
, where q = 1 p. In other words, if sample is large then the standard
n
variable is.
p p
Z=r
pq
n
x
which is approximately normal. When p = , where x is the actual number of success in a random sample.
n
The standard variable becomes
x np
p p
=
Z=r
npq
pq
n
Suppose we want to test the specified value of population proportion then
p p0
x np0
Z=r
=
np0 q0
p0 q0
n
It is used as the test-statistic for testing the population proportion and testing procedure is
1. Formulate the null and alternative hypothesis from the following three forms.
a) H0 : p = p0 and H1 : p 6= p0
b) H0 : p p0 and H1 : p > p0
c) H0 : p p0 and H1 : p < p0
2. Decide the level of significance . (take = 0.01 or 0.05)

42

CHAPTER 3. TESTING OF HYPOTHESIS


3. The test statistic Z, under H0 becomes
x np0
Z=
(Without continuity correction)
np0 q0
(x 0.5) np0
Z=
(With continuity correction)

np0 q0
p p0
(Using p directly)
Z=r
p0 q0
n
4. Compute the value of Z from the sample data.
5. Determine the rejection region, which actually depends upon the alternative hypothesis and it can be
described as
When the Alternative Hypothesis is
a)H1 : p 6= p0 (two-sided)
b)H1 : p < p0 (one-sided)
c) H1 : p > p0 (one-sided)

The rejection region will be


Z(Calculated) > Z/2 and Z(Calculated) < Z/2
Z(Calculated) < Z
Z(Calculated) > Z

6. Conclusion: Reject H0 if value of Z(Calculated) falls in the rejection region otherwise accept it.
Q:Explain the procedure for testing hypothesis about difference between two population proportions when
sample sizes are large?
Ans: Suppose we wish to test the hypothesis that the difference between two proportions is equal to a
specified value 40 or that the two proportions are equal.The statistic on which we base our decision rule
is the variable (
p1 p2 ), where p1 is the proportion of success in the first sample of size n1 and p2 is the
proportion of success in the second sample of size n2 , samples are drawn from two binomial populations with
unknown proportion of success p1 and p2 respectively. If the samples are sufficiently large, the sampling
distribution
of the difference (
p1 p2 ) is approximately normal with mean of p1 p2 and standard deviation
r
p1 q1
p2 q2
of
+
and the standard variable is
n1
n2
Z=

p1 p2 (p1 p2 )
r
p1 q2
p2 q2
+
n1
n2

Which is approximately normal.The testing procedure is


1. Formulate the null and alternative hypothesis from the following three forms.
a) H0 : p1 p2 = 40 and H1 : p1 p2 6= 40
(40 may be zero)
b) H0 : p1 p2 40 and H1 : p1 p2 > 40
c) H0 : p1 p2 40 and H1 : p1 p2 > 40
2. Decide the level of significance . (take = 0.01 or 0.05)
3. The test statistic Z, under H0 becomes
p1 p2 (p1 p2 )
Z= r
p1 q2
p2 q2
+
n1
n2
When the values of p1 and p2 are not known then for large sample sizes, they are replaced with the
sample proportions p1 and p2 respectively.
p1 p2 (p1 p2 )
Z= r
p1 q2
p2 q2
+
n1
n2
and the test-statistic, if the hypothesis H0 : p1 p2 = 40 is true, will be
p1 p2 40
Z=r
p1 q2
p2 q2
+
n1
n2
If the null hypothesis H0 : p1 = p2 = p then

43
p1 p2


1
1
pc qc
+
n1
n2
n1 p1 + n2 p2
and qc = 1 pc
Where pc =
n1 + n2

Z=s

4. Compute the value of Z from the sample data.


5. Determine the rejection region, which actually depends upon the alternative hypothesis and it can be
described as
When the Alternative Hypothesis is
a)H1 : p1 p2 6= 40 (two-sided)
b)H1 : p1 p2 < 40 (one-sided)
c) H1 : p1 p2 > 40 (one-sided)

The rejection region will be


Z(Calculated) > Z/2 and Z(Calculated) < Z/2
Z(Calculated) < Z
Z(Calculated) > Z

6. Conclusion: Reject H0 if value of Z(Calculated) falls in the rejection region otherwise accept it

Relation between Confidence interval and Testing hypothesis


There is a close relationship between the confidence interval for a population parameter and a test of
hypothesis about . Let [L(Lower confidence limit), U(Upper confidence limit)] be a 100(1 )% confidence
interval for the parameter . Then we shall accept the null hypothesis H0 : = 0 against H1 : 6= 0 at a
level of significance , if 0 falls inside the confidence interval, but if 0 falls outside the interval we shall
reject H0 .
In the language of hypothesis testing, the (1)% confidence interval is known as the acceptance
region and the region outside the confidence interval is called the rejection region or critical
region

44

CHAPTER 3. TESTING OF HYPOTHESIS

Chapter 4

The Chi-Square Distribution and


Statistical Inference
1

Prepared by Noman Rasheed

The Chi-Square (2 ) Statistic and Chi-Square distribution


The 2 random variable or statistic is defined as the sum of squares of independent standard normal
random variables. Let Z1 , Z2 , . . . , Zn are n independent standard normal random variables then chi-square
random variable with n degrees of freedom is defined as
2n =

n
X

Zi2

i=1

The sampling distribution of 2 random variable is called the chi-square distribution and its pdf is
(2 )(n/2)1 e
f ( ) =
2n/2 (n/2)

/2

, 0 < 2 <

With parameter n number of degress of freedom.

Properties of the Chi-Square distribution


There are following properties of the Chi-Square distribution
1. Area under the curve is unity.
2. The chi-square is a continuous distribution ranging from zero to plus infinity (0 < 2 < )
3. The man of the distribution is equal to its parameter and the variance is twice of its parameter. That
is E(2(n) ) = n and Var[2(n) ] = 2n
4. Moments of the distribution are
(a) Moments about origin are
1 = n, 2 = n(n + 2), 3 = n(n + 2)(n + 4), 4 = n(n + 2)(n + 4)(n + 6)
(b) Moments about Mean
1 = 0, 2 = 2n, 3 = 8n, 4 = 12n2 + 48n
5. Coefficient of Skewness and Kurtosis
12
8
1 = and 2 = 3 +
n
n
1 M.Phil

Stat, M.Ed, Contact: Nomanrasheed163@yahoo.com,facebook.com/Something About Statistics

45

46

CHAPTER 4. THE CHI-SQUARE DISTRIBUTION AND STATISTICAL INFERENCE


6. The 2 -distribution tends to normal distribution as the number of degrees of freedom approaches
infinity.(1 0 as n and 2 3 as n )
7. The moment generating function of 2(n) is M0 (t) = (1 2t)n/2
8. If x and y are independent 2 -random variables with n1 and n2 degrees of freedom respectively, then
the sum x + y is a chi2 -random variable with n1 + n2 degrees of freedom.
9. A 2 random variable can be partitioned into two or more than two parts which are also 2 random
variables and sum of their degrees of freedom equals the total degrees of freedom.
2(n) = 2(n1) + 2(1)

p
10. By Fisher for sufficiently large n, the random variable 22 is approximately normally distributed
 2 1/3

with mean 2n 1 and unit variance similarly by Wilson and Hilferty the random variable
n
2
2
is approximately normal with mean 1
and variance
.
9n
9n

Assumptions of the chi-square distribution


When the chi-square distribution is used for statistical inference then the following assumptions are
1. The parent population (the population from which the sample is drawn) in normal.
2. The sample is a random sample, so that the observations are independently distributed.

Confidence Interval Estimate of 2 from a Sample Variance


Let x
and S 2 be the mean and variance of a random sample
x1 , x2 . . . . , xn of size n drawn from a normal
P
(xi x
)2
nS 2
2
population with variance . Then the statistic 2 =
that is the ratio of the sum of squared

2
deviations from the sample mean to the population variance, has a chi-square distribution with (n 1)
degrees of freedom.
P
nS 2
(xi x
)2
2
= 2 =

2
Then according to the chi-square distribution the probability that the value of 2 will fall in the interval
from 21/2 to 2/2 is equal to 1
h
i
P 21/2 < 2 < 2/2 = 1
inserting the value of chi-square and we get
P


(x x
)2
2
<

P 21/2 <
/2 = 1
2
Dividing all the terms by

P
(x x
)2
"
P

21/2
2/2
1
P
P
< 2 <
(x x
)2

(x x
)2

#
=1

Reciprocl of each term then the sign becomes


"P
#
P
(x x
)2
(x x
)2
2
P
> >
=1
21/2
2/2

47
Which can be written as
#
"P
P
(x x
)2
(x x
)2
2
=1
P
< <
2/2
21/2
If instead of sample values, biased sample variance S 2 =
"
P

nS 2
nS 2
2
<

<
2/2
21/2

P
(x x
)2
then confidence interval will be
n

#
=1

If instead of sample values, unbiased sample variance s2 =


"
P

P
(x x
)2
then confidence inteval will be
n1

(n 1)s2
(n 1)s2
2
<

<
2/2
21/2

#
=1

Hence 100(1 )% confidence interval for particular sample of size n is


P
P
(x x
)2
(x x
)2
2
<

<
2
2
/2
1/2

Tests Based On Chi-Square Distribution


Some of the most frequently used/application tests of hypothesis, that are based on 2 -distribution are
presented here
1. Testing hypothesis about variance of a Normal population.
2. Testing hypothesis about equality of proportions of multinomial distribution.
3. Tests for goodness of fit.
4. Testing hypothesis of two attributes in a contingency table.

Testing hypothesis about variance of a Normal population


Suppose we desire to test a null hypothesis H0 that the variance 2 of a normally distributed population
has some specified value 02 . To do this, we need to draw a random sample x1 , x2 , . . . , xn of size n from a
normal population and compute the value of the sample variance S 2 . Then the statistic is
P
nS 2
(x x
)2
2
= 2 =
0
02
It 2 -distribution with (n 1) degrees of freedom. The testing procedure is
1. Formulate the null and alternative hypothesis about 2 . Three possible forms are.
a) H0 : 2 = 02 and H1 : 2 6= 02
b) H0 : 2 02 and H1 : 2 > 02
c) H0 : 2 02 and H1 : 2 < 02
2. Decide the significance level . The commonly used values are at = 0.05 or 0.01.
3. The test-statistic to be used is
P
nS 2
(x x
)2
2
2
(n1) = = 2 =
0
02
which under H0 has a chi-square distribution with (n 1) degrees of freedom.

48

CHAPTER 4. THE CHI-SQUARE DISTRIBUTION AND STATISTICAL INFERENCE


4. Compute the value of 2 from the given data.
5. Determine the critical region which depends on and the alternative hypothesis H1 .
a) When H1 is 2 6= 02 then the critical region.
2 < 2(1/2) , (n 1) and 2 > 2(/2) , (n 1)
b) When H1 is 2 > 02 then the critical region.
2 > 2(/2) , (n 1)
c) When H1 is 2 < 02 then the critical region.
2 < 2(1/2) , (n 1)
6. Draw conclusion, reject H0 if the calculated value of 2 falls in the critical region otherwise accept it.

Testing hypothesis about equality of proportions of multinimial distribution


In multinomial type problems, where there are classes or cells and where the cells probabilities p0i s are
completely specified, the procedure for testing the hypothesis is given below.
1. Formulate the null and alternative hypothesis about p0 s as
H0 : P1 = P10 , P2 = P20 , . . . , Pk = Pk0 and
H1 : Pi 6= Pi0 for atleast one i
2. Decide level of significance . Commonly used as 0.05 or 0.01.
3. The test statistic to use it
2 =

k
X
(ni npi0 )2
i=1

npi0

Which, If H0 is true, has an approximately chi-square distribution with (k 1) degrees of freedom.


4. Compute the value of 2 , after having calculated the expected values of npi0 , from the given data.
5. Determine the critical region which depends on and the degrees of freedom.
6. Draw conclusion, reject H0 , if the computed value of 2 > 2 , (k 1) otherwise accept H0 .

Test for goodness of fit


A goodness of fit test is a hypothesis test that is concerned with the determination whether results of a
sample conform to a hypothesized distribution which may be the uniform, binomial, Poisson, Normal or any
other distribution. This is a kind of hypothesis test for problems where we do not know the probability distribution of a random variable under consideration.The test involves a comparison of the observed frequency
distribution in the sample with the expected frequency distribution based on some theoretical model. A
goodness of fit test between observed and expected frequencies is based upon the quantity.
2


k 
X
(Oi Ei )2
i=1

Ei

The symbols Oi and Ei represents the observed and expected frequencies respectively for the ith class and
k represents the number of possible outcomes or the number of different classes. The sampling distribution
of 2 approaches the chi-square distribution with degrees of freedom = k 1 m. Where k represents
number of classes and m are the number of parameters estimated by the sample statistics.
Q:Explain the procedure for testing hypothesis for a goodness of fit test?
Ans: The procedure for a goodness of test is as follows.
1. Formulate the null and alternative hypothesis as
H0 : The population has a specified probability distribution, and
H1 : The population does not have the specified distribution.
2. Decide the level of significance . The commonly used value is at = 0.05.

49
3. The test statistic to use is
2


k 
X
(Oi Ei )2
i=1

Ei

Which, If H0 is true has an approximate chi-square distribution with degrees of freedom = k 1 m.


4. Compute the expected value and the value of 2 .
5. Determine the critical region, which depends upon and the degrees of freedom.
6. Draw conclusion, if the calculated value of 2 exceeds the 2 value against the appropriate degree of
freedom from the 2 -table, we reject H0 otherwise we accept it.

Attributes
A characteristic which varies only in quality from one individual to another is called an attribute such as
male or female, tall or short, satisfied or dissatisfied, high or low, healthy or diseased, positive or negative
etc. The attributes cannot be measured accurately but they can be divided into classes and their numbers
in each class can be counted.

Dichotomy
If the data (Population) are divided into two distinct and mutually exclusive classes by a single attribute
as for instance, the population of human beings is divided into male and females, the process is called dichotomy.
Population may be divided into three or more classes which is called trichotomy or manifold division
respectively.

Positive and Negative attributes


Capital letters A, B, C, . . . are used to denote the presence of attribute and the attributes are called
positive attributes. Greek letters , , , . . . are used to denote the absence of attribute and the attributes
are called negative attributes.

Class and Class frequency


A class is a set of the objects which are sharing a given characteristic while a class frequency is the
number of observations (objects) which are distributed in a class. Class frequencies are denoted by enclosing
the class symbols in brackets such as (A) etc.

Order of Classes
Order of classes is known by the number of attributes specifying the class. For example a class specified
by one attribute is known as the class of order 1.

Ultimate class frequency


The frequencies of classes of the highest order are called ultimate class frequencies. The number
of ultimate class frequencies for k attributes is given by (2)k .. For example in case of two attributes
(AB), (A), (B), ().

Consistency
If the class frequencies are observed in a certain sample data and all class frequencies are recorded correctly then there will be no error in them and they will be called consistent.

50

CHAPTER 4. THE CHI-SQUARE DISTRIBUTION AND STATISTICAL INFERENCE

Independence
Suppose in a population of size N , the class frequency of two attributes A and B are given by (A) and
(B). Then the two attributes A and B are said to be independent if the actual frequency equals the expected
one, that is
(A)(B)
(AB) =
N
Similarly, and will be independent if
()()
() =
N

Association
Two attributes A and B are said to be associated only if they appear together a large number of times
than it is expected if they are independent. There may be complete association or perfect positive association
or complete disassociation or perfect negative association.
Positively Associated or Simply Associated A and B are said to be positively associated or simply
associated, if
(A)(B)
(AB) >
N
Negatively Associated or Disassociated On the contrary, A and B are said to be negatively associated
or briefly disassociated if
(A)(B)
(AB) <
N
Note: Disassociation does not mean independence.

Measures of Association
The strength of association between two attributes A and B is measured by the co-efficient called the
co-efficient of association and defined by the formula.
Q=

(AB)() (A)(B)
(AB)() + (A)(B)

It lies between 1 and +1. When Q = 0 then the attributes are independent, when Q = +1 then there is a
complete association and when Q = 1 then there is complete association.
Another co-efficient known as the co-efficient of colligation, which measures the strength of association and
is defined as.
s
(A)(B)
1
(AB)()
s
Y =
(A)(B)
1+
(AB)()

Contingency Table
A table consisting of r rows and c columns in which the data are classified according to two attributes A
and B is called an r c contingency table.
Attributes
B1
B2
...
Bc1
Bc
Total
A1
(A1 B1 )
(A1 B2 )
...
(A1 Bc1 )
(A1 Bc )
(A1 )
A2
(A2 B1 )
(A2 B2 )
...
(A2 Bc1 )
(A2 Bc )
(A2 )
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Ar1
(Ar1 B1 ) (Ar1 B2 ) ... (Ar1 Bc1 ) (Ar1 Bc ) (Ar1 )
Ar
(Ar B1 )
(Ar B2 )
...
(Ar Bc1 )
(Ar Bc )
(Ar )
Total
(B1 )
(B2 )
...
(Bc1 )
(Bc )
n

51

The simplest form of a contingency table is the 2 2 ( read as 2 by 2) table in which the two attributes are
dichotomised.
Q:Explain the procedure for testing hypothesis of independence in contingency table?
Ans: As we know that independence between contingency table is tested by 2 . The procedure involves six
steps which are as follows.
1. Formulate the null and alternative hypothesis as
H0 : Two characteristics or two criteria of classification are independent.
H1 : Two characteristics or two criteria of classification are not independent.
2. Choose a significance level . The commonly used level of are at = 0.05, 0.01
3. The test statistic 2 , which compare the expected and the observed cell frequencies is

r X
c 
X
(oij eij )2
=
eij
i=1 j=1
2

which, if H0 is true, has an appropriate chi-square distribution with (r 1)(c 1) degrees of freedom
4. Compute the expected frequencies under H0 for each cell by the formula eij =
the value of 2 and the degrees of freedom.

(Ai )(Bj )
also calculate
n

5. Determine the critical region which depends on and the number of degrees of freedom. that is
2 > 2 , [(r 1)(c 1)]
6. Draw conclusion, we reject H0 , if 2 > 2 , [(r 1)(c 1)] otherwise we accept it.

Measures of Association in a Contingency Table


The test of independence indicates only whether any dependency relationship exists between the attributes but it does not indicate the degree of association or the direction of the dependency.
A measure of the degree of association or dependency in a contingency table which is called Pearsons
coefficient of mean square contingency or simply the coefficient of contingency is given by
s
r
2
q1
C=
,
0C
2 + n
q
Where q represents the number of rows or columns whichever is smaller and n is the sample size.
Another measure of association in a contingency table which is called Cramers contingency coefficient
Q=

2
,
n(q 1)

0Q1

Where q is the number of rows or columns whichever is smaller and n represents the sample size.
Note: If Q = 0 then attributes are independent and when Q = 1 then there is perfect relationship

Yates Correction for Continuity


The 2 test statistic is computed from cell frequencies which are discrete data. However, the continuous
chi-square distribution yields a satisfactory approximation provided that the number of degrees of freedom
is greater than one and there are five or more observations in each cell.The corrected formula is
2(Corrected) =

X (|oi ei | 0.5)2
ei

52

CHAPTER 4. THE CHI-SQUARE DISTRIBUTION AND STATISTICAL INFERENCE

If the expected frequencies are large, the corrected and uncorrected results are almost the same. When the
expected frequencies are between 5 and 10 then Yates correction should be applied.

Simplified formula in 2 2 Contingency Table


In applying 2 approximation, we are required to combine the smaller frequencies (less than 5) with
larger ones. But in case of two classes only, we cannot pool the small frequency into the larger one due to
no degree of freedom.Therefore, a simplified formula can be applied for calculating the 2 .
Suppose the observed frequencies in a 22 contingency table are a, b, c and d as shown in the following table.
Attribute
A1
A2
Total

B1
a
c
a+c

B2
b
d
b+d

Total
a+b
c+d
n

Then value of 2 be calculated as


2 =

n(ad bc)2
(a + b)(a + c)(c + d)(b + d)

With continuity correction


2(Corrected) =

n(|ad bc| n/2)2


(a + b)(a + c)(c + d)(b + d)

Degrees of freedom
Degrees of freedom is the number of values that are free to vary after we have placed certain restrictions
upon the data.

Chapter 5

The Students t-distribution and


Statistical Inference
1

Prepared by Noman Rasheed

The t-statistic and The t-distribution


Let x1 , x2 , . . . , xn be P
a random sample
P of size2 n drawn from a normal population with mean and
(x x
)
xi
2
2
and s =
, which is the unbiased estimate of 2 . Then sampling
variance and let x
=
n
n1
distribution of the statistic
x

t=
s/ n
is called t-distribution or students t-distribution with (n 1) degrees of freedom having pdf
(+1)/2
t2
1+




1

,
2 2


f (t) =

, < t <

with parameter degrees of freedom


Note:t-statistic is the quotient of a standard normal variable and the square root of a chi-square random
variable divided by its degrees of freedom. Symbolically.
t= p

Z
2 /

Properties of Students t-distribution


The t-distribution has the following properties
1. Area under the curve is unity
2. The t-distribution is continuous and symmetric about the value t = 0, ranging from to

for > 2.
2
Note: The variance for 2 does not exist. The variance is greater than 1 and approaches 1 as the
degrees of freedom increase.

3. The mean of t-distribution is = 0 and its variance is 2 =

4. The distribution is unimodal with a bell shape. The mode of the distribution is t = 0 and the median
is also equal to zero.
5. The shape of the distribution changes the number of degrees of freedom or the sample size changes.
1 M.Phil

Stat, M.Ed, Contact: Nomanrasheed163@yahoo.com,facebook.com/Something About Statistics

53

54

CHAPTER 5. THE STUDENTS T-DISTRIBUTION AND STATISTICAL INFERENCE


6. For a small value of , the distribution is flatter than standard normal distribution (it is more spread
out in the tail than the standard normal distribution)
7. The distribution approaches the standard normal distribution as the number of degrees of freedom
or sample size becomes larger.
8. It is independent of population standard deviation .

Assumptions in using t-distribution


To use the t-distribution, we make the following assumptions.
1. The sample of n observations x1 , x2 , . . . , xn is selected randomly.
2. The population from which the small sample is drawn is normal. This is essential for x
and s, the
components of statistic t.
3. In case of two small samples, both the samples are selected randomly, both the populations are normal
and both the populations have equal variance.

Confidence Interval Estimate of Mean for small sample (n < 30)


Let x1 , x2 , . . . , xn be a random sample of size n drawn from the normal population with an unknown mean
and an unknown standard deviation and sample size is small (n < 30).The sample size is small and population standard deviation is unknown therefore the sampling distribution of mean x
will be t-distribution
with statistic.
x

t=
s/ n
with degrees of freedom = n 1.
Then according to the t-distribution the probability that a value of t will fall in the interval from t/2 ,
to t/2 , is equal to 1


P t/2 , () t t/2 , () = 1
Inserting t and we get


x

P t/2 , () t/2 , () = 1
s/ n

Multyplying each term by s/ n





t/2 , ()s/ n = 1
P t/2 , ()s/ n x
Subtracting x
from each of term and we have



P
x t/2 , ()s/ n
x + t/2 , ()s/ n = 1
Multyplying by -1 then equality sign will be



P x
+ t/2 , ()s/ n x
t/2 , ()s/ n = 1
Which is equavelant to



P x
t/2 , ()s/ n x
+ t/2 , ()s/ n = 1
Hence the 100(1 )% confidence interval for particular sample of size (n < 30) is
s
x
t/2 , ()
n

55

Confidence Interval Estimate for the difference of two Means (1 2 )


Independent Samples
Let x11 , x12 , . . . , x1n1 and x21 , x22 , . . . , x2n2 be two small independent random samples(n1 , n2 < 30)
from two normal populations with mean 1 and 2 and the standard deviations 1 and 2 respectively. If
1 = 2 = () but unknown, then the unbiased pooled or combined estimate of the common variance 2 is
given by
(n1 1)s21 + (n2 1)s22
s2p =
n1 + n2 2
Therefore the sampling distribution of (
x1 x
2 ) follows to students t-distribution with = n1 + n2 2
degrees of freedom and the statistic is
t=

(
x1 x
2 ) (1 2 )
r
1
1
sp
+
n1
n2

Then according to the t-distribution the probability that a value of t will fall in the interval from t/2 ,
to t/2 , is equal to 1


P t/2 , () t t/2 , () = 1
Inserting t and we get

(
x1 x
2 ) (1 2 )
r
P
t/2 , ()
t/2 , ()
=1
1
1
sp
+
n1
n2
r
Multyplying by sp

P t/2 , ()sp

1
1
+
n1
n2

1
1
+
(
x1 x
2 ) (1 2 ) t/2 , ()sp
n1
n2


1
1
+
=1
n1
n2

Subtracting (
x1 x
2 )
r


P (
x1 x
2 ) t/2 , ()sp

1
1
+
(1 2 ) (
x1 x
2 ) + t/2 , ()sp
n1
n2


1
1
+
=1
n1
n2

Multyplying by 1 then equality signs will be


r
r


1
1
1
1
P (
x1 x
2 ) + t/2 , ()sp
+
(1 2 ) (
x1 x
2 ) t/2 , ()sp
+
=1
n1
n2
n1
n2
Which is equavelant to
r
r


1
1
1
1
P (
x1 x
2 ) t/2 , ()sp
+
(1 2 ) (
x1 x
2 ) + t/2 , ()sp
+
=1
n1
n2
n1
n2
Hence 100(1 )% confidence interval for (1 2 ) for small sample sizes and when 1 and 2 are unknown
but equal and populations are normal
r
(
x1 x
2 ) t/2 , ()sp

1
1
+
n1
n2

56

CHAPTER 5. THE STUDENTS T-DISTRIBUTION AND STATISTICAL INFERENCE

Confidence Interval Estimate for the difference of two Means (1 2 )


Dependent Samples
When the observations from two samples are paired or dependent, we fiend the difference between two
observations of each pair. If the pairs (X1 , Y1 ), (X2 , Y2 ), . . . , (Xn , Yn ) then the difference di = (Xi Yi )
constitute a single random sample from a population of differences which is normally distributed with
2
mean d = P
d and d2 are estimated by their corresponding
1 2 and variance
Pnd . The parameters
n
2
(di d)
di
where n represents the number of pairs. Here the differences
statistics d = i=1 and s2d = i=1
n
n1
(d1 , d2 , . . . , dn ) is a random sample which are normally distributed and the statistic is
t=

d d

sd / n

which follows to t-distribution with = n 1 degrees of freedom.


According to the t-distribution the probability that a value of t will fall in the interval from t/2 , to
t/2 , is equal to 1


P t/2 , () t t/2 , () = 1
Inserting t and we get


d d
t/2 , () = 1
P t/2 , ()
sd / n

Multyplying each term by sd / n





d t/2 , ()s/ n = 1
P t/2 , ()s/ n x
Subtracting d from each of term and we have



P
x t/2 , ()sd / n d
x + t/2 , ()sd / n = 1
Multyplying by -1 then equality sign will be



P d + t/2 , ()sd / n d d t/2 , ()sd / n = 1
Which is equavelant to



P d t/2 , ()sd / n d d + t/2 , ()sd / n = 1
Hence the 100(1 )% confidence interval for d for particular sample of size (n < 30) is
sd
d t/2 , ()
n

Testing Hypothesis about Mean of a Normal Population


When is unknown and n < 30
Let x1 , x2 , . . . , xn be the observations in a small random sample of size n taken from a normally distributed
population whose standard deviation is unknown. Then is estimated from the sample data. Let x
be
the sample mean and s is the unbiased estimate of then, if wish to test the hypothesis that the population
mean has a specified value 0 the statistic
x
0

t=
s/ n
has, when the hypothesis is true a t-distribution with = n 1 degrees of freedom. The testing procedure
is

57
1. Formulate the null and alternative hypothesis about . Three possible forms are
a) H0 : = 0 and H1 : 6= 0
b) H0 : 0 and H1 : > 0
c) H0 : 0 and H1 : < 0
2. Decide the level of significance . ( take = 0.05 or 0.01).
3. Test Statistic in this case is t =

x
0

s/ n

4. Calculate the value of t from the sample data.


5. Determine the rejection region, which actually depends on the alternative hypothesis and it can be
describe as
When the Alternative Hypothesis is
a)H1 : 6= 0 (two-sided)
b)H1 : < 0 (one-sided)
c) H1 : > 0 (one-sided)

The rejection region will be


t(calculated) > t/2 , () and t(calculated) < t/2 , ()
t(calculated) < t , ()
t(calculated) > t , ()

6. Conclusion: Reject H0 if value of t(calculated) falls in the rejection region otherwise accept it.

Testing hypothesis about difference of two Means (1 2 )


When 1 = 2 but unknown and n1 , n2 < 30
Let x11 , x12 , . . . , x1n1 and x21 , x22 , . . . , x2n2 be two small independent random samples(n1 , n2 < 30)
from two normal populations with mean 1 and 2 and the standard deviations 1 and 2 respectively. If
1 = 2 = () but unknown, then the unbiased pooled or combined estimate of the common variance 2 is
given by
(n1 1)s21 + (n2 1)s22
s2p =
n1 + n2 2
Therefore the sampling distribution of (
x1 x
2 ) follows to students t-distribution with = n1 + n2 2
degrees of freedom and the statistic is
t=

(
x1 x
2 ) (1 2 )
r
1
1
+
sp
n1
n2

Then the testing procedure is


1. Formulate the null and alternative hypothesis from the following three forms.
a) H0 : 1 2 = 40 and H1 : 1 2 6= 40
(40 may be equal zero)
b) H0 : 1 2 40 and H1 : 1 2 > 40
c) H0 : 1 2 40 and H1 : 1 2 < 40
2. Decide the level of significance . (take = 0.01 or 0.05)
3. The test statistic t, under H0 becomes t =

(
x1 x
) 40
r 2
1
1
sp
+
n1
n2

4. Compute the value of t from the sample data.


s2p can also be obtained as
n1 S12 + n2 S22
n1 + n2 2
P
P
(x1i x
1 )2 + (x2i x
2 )2
2
sp =
n1 + n2 2
s2p =

58

CHAPTER 5. THE STUDENTS T-DISTRIBUTION AND STATISTICAL INFERENCE

s2p

"
#
P
P
2
2
P 2
P 2
1
( x1 )
( x2 )
=
{ x1i
} + { x2i
}
n1 + n2 2
n1
n2

5. Determine the rejection region, which actually depends upon the alternative hypothesis and it can be
described as
When the Alternative Hypothesis is
a)H1 : 1 2 6= 40 (two-sided)
b)H1 : 1 2 < 40 (one-sided)
c) H1 : 1 2 > 40 (one-sided)

The rejection region will be


t(calculated) > t/2 , () and t(calculated) < t/2 , ()
t(calculatd) < t , ()
t(calculated) > t , ()

6. Conclusion: Reject H0 if value of t(calculated) falls in the rejection region otherwise accept it.

Paired Observations
There are many situations in which the two samples are not independent. This happens when the observations are found a pairs as the two observations of a pair are related to each other. Pairing occurs either
naturally or by design/ artificial pairing.
Natural Pairing
Natural pairing occurs whenever measurement is taken on the same unit or individual at two different times.
For examples, suppose 10 young recruits are given a strenuous physical training programme by the Army.
Their weights are recorded before they begin and after they complete the training. The two observations
obtained for each recruit (before-and-after) measurements constitute natural pairing.
Pairing by Design/ Artificial Pairing
The two observations are also paired to eliminate effect in which there is no interest.

Testing Hypothesis About Two Means (Paired Observations)


When the observations from two samples are paired or dependent either natural or artificial, we fiend
the difference between two observations of each pair. If the pairs (X1 , Y1 ), (X2 , Y2 ), . . . , (Xn , Yn ) then the
difference di = (Xi Yi ) constitute a single random sample from a population of differences which is normally
2
distributed with mean d =P
1 2 and variance
parameters d and d2 are estimated by their
Pn d . The
n
2

di
(di d)
corresponding statistics d = i=1 and s2d = i=1
where n represents the number of pairs. Here
n
n1
the differences (d1 , d2 , . . . , dn ) is a random sample which are normally distributed and the statistic is
t=

d d

sd / n

which follows to t-distribution with = n 1 degrees of freedom. Then the testing procedure is
1. Formulate the null and alternative hypothesis from the following three forms.
a) H0 : 1 2 = d0 and H1 : 1 2 6= d0
(d0 may be equal zero)
b) H0 : 1 2 d0 and H1 : 1 2 > d0
c) H0 : 1 2 d0 and H1 : 1 2 < d0
2. Decide the level of significance . (take = 0.01 or 0.05)
3. The test statistic t, under H0 becomes t =

d d0

sd / n

4. Compute the value of t from the sample data.


5. Determine the rejection region, which actually depends upon the alternative hypothesis and it can be
described as
When the Alternative Hypothesis is The rejection region will be
a)H1 : 1 2 6= d0 (two-sided)
t(calculated) > t/2 , () and t(calculated) < t/2 , ()
b)H1 : 1 2 < d0 (one-sided)
t(calculated) < t , ()
c) H1 : 1 2 > d0 (one-sided)
t(calculated) > t , ()
6. Conclusion,Reject H0 if value of t(calculated) falls in the rejection region otherwise accept it.

Chapter 6

The F-distribution and Statistical


Inference
1

Prepared by Noman Rasheed

The F-Statistic and The F-distribution


F-statistic is defined as the ratio of two independent chi-square random variables each divided by its
degrees of freedom
2 /1
F = 21
2 /2
s21
where s21 and s22 be the unbiased estimated variances of two
s22
random samples of sizes n1 and n2 drawn from normal populations with same variances.
The sampling distribution of F-statistic is called F-distribution having pdf

after some simplification it becomes as F =

f (F ) =

[(1 + 2 ) /2] (1 /2 )

(1 /2)

F (1 /2)1
(1 +2 )/2

0<F <

(1 /2) (2 /2) [1 + 1 F/2 ]


With two parameters 1 and 2 degrees of freedom.

Properties of F-distribution
The F-distribution has the following important properties
1. Area under the curve is unity.
2. The F-distribution always ranges from zero to infinity.
3. The mean and variance of the distribution with 1 and 2 degrees of freedom are
=
and
2 =

2
2 2

222 (1 + 2 2)
1 (2 2)2 (2 4)

4. The F-distribution for 1 > 2, 2 is unimodal and the mode of the distribution with 1 ( 2) is at
F =

2 (1 2)
1 (2 + 2)

Which is always less than 1.


1 M.Phil

Stat, M.Ed, Contact: Nomanrasheed163@yahoo.com,facebook.com/Something About Statistics

59

60

CHAPTER 6. THE F-DISTRIBUTION AND STATISTICAL INFERENCE


5. The F-distribution is skewed to the right but as the degrees of freedom 1 and 2 become large, the
distribution approaches the normal distribution.
1
6. If F has an F-distribution with 1 and 2 degrees of freedom the
has an F distribution with 2 and
F
1 degrees of freedom. This result allows one to table the F-distribution for the upper tail only.
F1 (1 , 2 ) =

1
F (2 , 1 )

7. The square of a t-distributed random variable with degrees of freedom has an F-distribution with 1
and 2 degrees of freedom. Symbolically.
t2 =

Z2
2 /1
=
= F(1,2 )
2 /
2 /

8. The F-distribution does not posses the moment generating function because some of moments are
infinite.

Assumptions required for F-distribution


The F-distribution can be applied if the following assumptions are satisfied.
1. The two samples are independently and randomly selected.
2. The two populations from which samples are selected are normally distributed.

Some Useful Result for My Students


We know that
F =

21 /1
22 /2

21 =

(n1 1)s21
2

22 =

(n2 1)s22
2

But

Similarlly

1 = (n1 1) and 2 = (n2 1)


Now using the 21 and 22 then F becomes
(n1 1)s21
s21 /12
2 (n1 1)
F =
2 = s2 / 2
(n2 1)s1
2
2
2 (n2 1)
Note: In testing of hypothesis where both population variances are equal, then
F =

s21
s22

61

12
Confidence Interval For The Variance Ratio 2
2
Let two independent random samples of size n1 and n2 be taken from two normal populations with
variances 12 and 22 and let s21 and s22 be the unbiased estimates of 12 and 22 . Then we know that
F =

2 s2
s21 /12
= 22 12
2
2
s2 /2
1 s2

Then according to the F-distribution the probability that a value of F will fall in the interval from F1/2 (1 , 2 )
to F/2 (1 , 2 ) is equal to 1


P F1/2 (1 , 2 ) < F < F/2 (1 , 2 ) = 1
Putting the value of F and we get


2 s2
P F1/2 (1 , 2 ) < 22 21 < F/2 (1 , 2 ) = 1
1 s2
Multyplying each term in the inequality by

s22
, we obtain
s21


s22
22
s22
P 2 F1/2 (1 , 2 ) < 2 < 2 F/2 (1 , 2 ) = 1
s1
1
s1


Now inverting each term in the inequality, we get





12
s21
1
s21
1
>
=1
>
s22 F1/2 (1 , 2 )
22
s22 F/2 (1 , 2 )


1
12
s21
1
s21
<
<
=1
s22 F/2 (1 , 2 )
22
s22 F1/2 (1 , 2 )

P
Which is equavelant to
P
We know that

1
= F/2 (2 .1 )
F/2 (1 , 2 )
Therefore

P


s21
1
12
s21
<
<
F
(
,

)
=1
2
1
/2
s22 F/2 (1 , 2 )
22
s22

Thus 100(1 )% confidence interval for




12
is
22


s21
1
s21
, F/2 (2 , 1 )
s22 F/2 (1 , 2 ) s22

We can also find a confidence interval for 1 /2 by taking the square root of the endpoints of this interval.

62

CHAPTER 6. THE F-DISTRIBUTION AND STATISTICAL INFERENCE

Tests based on F-distribution


The following tests of hypothesis are based on the F-distribution.
1. Testing a hypothesis about the equality of two variances.
2. Testing a hypothesis about the equality of k(k > 2) population means.
3. Testing a hypothesis about linearity of regression.
4. Testing a hypothesis about various correlation co-efficients.

Testing Hypothesis about the Equality of Two Variances


Suppose that we have two independent random samples of size n1 and n2 from two normal populations
with variances 12 and 22 and we wish to test the hypothesis that the two variances are equal (that is
H0 : 12 /22 = 1 or equivalently H0 : 12 = 22 ).Let s21 and s22 denote the unbiased estimates, based on
1 = n1 1 and 2 = n2 1 degrees of freedom. Then the statistic is
F =

s21
s22

The procedure for testing a hypothesis that the population variances 12 and 22 are equal, consists of the
following steps.
1. Formulate the null hypothesis as
H0 : 12 /22 = 1 (that is H0 : 12 = 22 ). The alternative hypothesis may be
(a) H1 : 12 /22 > 1
(b) H1 : 12 /22 < 1
(c) H1 : 12 /22 6= 1
2. Decide the level of significance .Usually used 0.05 or 0.01.
3. The test-statistic to use is
s2
F = 12 , (where s21 is larger than s22 )
s2
Which, if H0 is true, has an F-distribution with 1 and 2 degrees of freedom.
4. Calculate the value of F from the sample data.
5. Determine the critical region, which depends upon the size of and the degrees of freedom
(a) When H1 : 12 /22 > 1 (H1 : 12 > 22 ) then critical region will be F(calculated) F (1 , 2 )
1
(b) When H1 : 12 /22 < 1 (H1 : 12 < 22 ) then critical region will be F(calculated)
F (2 , 1 )
1
but we know that F1 (1 , 2 ) =
F (2 , 1 )
1
(c) When H1 : 12 /22 6= 1 (H1 : 12 6= 22 ) then critical region will be F(calculated)
and
F (2 , 1 )
F(calculated) F (1 , 2 )
6. Draw conclusion, reject H0 , if F(calculated) falls in the critical or rejection region otherwise accept H0

Chapter 7

Statistical Inference in Regression and


Correlation
1

Prepared by Noman Rasheed

Introduction
A simple linear regression model that describes the relationship between x and y takes the form
Yi = + Xi + i
where is the intercept term, is the slope of line or regression coefficient while i is the error term or
disturbance term. The random errors 0i s are assumed to be independent of Xi an normally distributed with
2

E(i ) = 0 and
= y.x
Pvar(i ) P
P. The above regression line is estimated from the sample data by Y = a + bxi
n XY X Y
The quantities a, b, Y and Y will vary from one sample
P
P
where b =
and a = Y bX.
n X 2 ( X)2
to another. They are thus random variables and hence have sampling distributions and have own mean and
variance.

Mean and Variance of Sampling distribution of b


1. Mean
b = E(b) =
2. Variance

2
y.x
b2 = P
(xi x
)2
The standard error of b is
y.x
b = pP
(xi x
)2
2
Generally, b will be unknown, we therefore require an estimate of b2 from the sample data. The
unbiased
is given by
Pestimator
(Yi Y )2
2
sy.x =
n2
Thus the estimate of b2 denoted by s2b , may be taken as
s2y.x
s2b = P
(xi x
)2

Note:

s2y.x

1 M.Phil

P
P 2
P
P
P
P
P 2 ( x)2
(Yi Y )2
Y a X b XY
2
=
=
and (x x
) = x
n2
n2
n

Stat, M.Ed, Contact: Nomanrasheed163@yahoo.com,facebook.com/Something About Statistics

63

64

CHAPTER 7. STATISTICAL INFERENCE IN REGRESSION AND CORRELATION

Mean and Variance of Sampling distribution of a


1. Mean
a = E(a) =
2. Variance 

2
1
X
2
2
a = y.x
+P
2
n
(X X)
2
2
When y.x
is unknown, we use s2y.x in place of y.x

Confidence Interval Estimate of Population Regression coefficient


To construct a confidence interval for the population regression co-efficient, we use b, the sample
estimate of . The sampling distribution of b is normally distributed with a mean and a standard deviation
y.x
b = pP
. Then the statistic is
(xi x
)2
Z=

b
pP
(xi x
)2
y.x /

But y.x is generally not known, we therefore estimate it from the sample data then we shall use the students
t-distribution rather than the normal distribution. In other words, the statistic, with degrees of freedom
=n2
b
pP
t=
(xi x
)2
sy.x /
Hence a 100(1 )% confidence interval for the population regression co-efficient for particular sample of
size n(n < 30) is given by
qX
b t/2 , (n 2)sy.x /
(xi x
)2

Confidence Interval Estimate of , the intercept of Regression Line


To construct a confidence interval for we use a the sample estimate
of . We know that a is normally
s
2
X
1
+P
distributed with mean = and standard deviation a = y.x
2 , Since y.x is usually
n
(Xi X)
unknown, we use its unbiased sample estimate sy.x then the statistic, with degrees of freedom = n 2
a

t=

s
sy.x

2
X
1
+P
2
n
(Xi X)

Hence 100(1 )% confidence interval for when sample size n(n < 30) is given by
s
2
1
X
a t/2 , (n 2)sy.x
+P
2
n
(Xi X)

Testing hypothesis about the Population Regression Co-efficient


Suppose that we wish to test the hypothesis that population regression coefficient has some specified
value 0 . We draw a random sample (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ) of n pairs of observations from a bivariate
normal population and obtain b from the data. It is an estimate of population regression coefficient and
y.x
follows to normal distribution with mean and standard deviation b = pP
. When y.x is
(xi x
)2
unknown then it is estimated from sample data. If regression coefficient has the specified value 0 then the
statistic is
b 0
t=
sb
sy.x
, has a t-distribution with = n 2 degrees of freedom.The testing procedure is
Where sb = pP
2
(Xi X)

65
1. Formulate the null and alternative hypothesis about . Three possible forms are
a) H0 : = 0 and H1 : 6= 0
b) H0 : 0 and H1 : > 0
c) H0 : 0 and H1 : < 0
2. Decide the level of significance . ( take = 0.05 or 0.01).
3. Test Statistic in this case is t =

b 0
sb

4. Calculate the value of t from the sample data.


5. Determine the rejection region, which actually depends on the alternative hypothesis and it can be
describe as
When the Alternative Hypothesis is
a)H1 : 6= 0 (two-sided)
b)H1 : < 0 (one-sided)
c) H1 : > 0 (one-sided)

The rejection region will be


t(calculated) > t/2 , () and t(calculated) < t/2 , ()
t(calculated) < t , ()
t(calculated) > t , ()

6. Conclusion: Reject H0 if value of t(calculated) falls in the rejection region otherwise accept it.

Testing hypothesis about the intercept of Regression line


Suppose that we wish to test the hypothesis that population regression coefficient has some specified
value 0 . We draw a random sample (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ) of n pairs of observations from a bivariate
normal population and obtain a from the data. It is an estimate of the intercept
s of regression line and
2
1
X
follows to normal distribution with mean and standard deviation a = y.x
+P
2 . When
n
(Xi X)
y.x is unknown then it is estimated from sample data and. If intercept of regression line has specified value
0 then the statistic is
a 0
t=
sa
s
2
X
1
+P
Where sa = sy.x
2 has a t-distribution with = n 2 degrees of freedom.The testing
n
(Xi X)
procedure is
1. Formulate the null and alternative hypothesis about . Three possible forms are
a) H0 : = 0 and H1 : 6= 0
b) H0 : 0 and H1 : > 0
c) H0 : 0 and H1 : < 0
2. Decide the level of significance . ( take = 0.05 or 0.01).
3. Test Statistic in this case is t =

a 0
sa

4. Calculate the value of t from the sample data.


5. Determine the rejection region, which actually depends on the alternative hypothesis and it can be
describe as
When the Alternative Hypothesis is
a)H1 : 6= 0 (two-sided)
b)H1 : < 0 (one-sided)
c) H1 : > 0 (one-sided)

The rejection region will be


t(calculated) > t/2 , () and t(calculated) < t/2 , ()
t(calculated) < t , ()
t(calculated) > t , ()

6. Conclusion: Reject H0 if value of t(calculated) falls in the rejection region otherwise accept it.

66

CHAPTER 7. STATISTICAL INFERENCE IN REGRESSION AND CORRELATION

Sampling distribution of sample correlation coefficient r


Let r be the sample correlation coefficient obtained from a random sample of n pairs of values from a
bivariate normal population having a linear correlation . The r is used as the estimator of .The sampling
distribution of r depends upon and n. The standard deviation of the sampling distribution of r is approx1 2
imately equal to .The sampling distribution of r is far from normal distribution for large value of .
n
It is sharply skewed in the neighbourhood of = 1.If the sample is large enough (n > 400) and if is only
1 r2
1 2
but
is
moderately large, then r is approximately normal with mean and standard deviation
n
n
commonly used because of biased estimator of r thus the statistic is
Z=

r
1 r2

This is not recommended for use when n is small and is large therefore non-normal distribution of r can be
changed by simple transformation into an approximately normal distribution. The transformation is known
as Fishers z-transformation the variable
1+r
1 1+
1+
1 1+r
= 1.1513 log
is approximately normal with mean z = ln
= 1.1513 log
Zf = ln
2 1r
1r
2 1
1
1
with standard deviation
. Hence the statistic
n3
Z=

zf z

1/ n 3

We know that the sampling distribution of r is skew when is not zero. However when = 0 the sampling
distribution of r is symmetric. Thus when the random variable x and y are normally distributed and = 0
the t-distribution is used and statistic is

r n2
t=
1 r2
with = n 2 degrees of freedom.

Confidence Interval Estimate for


We know that Zf = 1.1513 log
standard deviation

1+r
1+
is approximately normal with mean z = 1.1513. log
and
1r
1

1
thus the standard normal variable is
n3
Z=

zf f

1/ n 3

According to the normal distribution the probability that a value of Z will fall in the interval from Z/2
to Z/2 is equal to 1


P Z/2 Z Z/2 = 1
Inserting Z and we get


zf f
P Z/2
Z/2 = 1
1/ n 3
Dividing by

1
n3


Z/2
Z/2
P
zf f
=1
n3
n3

67
Subtracting zf and we get


Z/2
+Z/2
P zf
f zf
=1
n3
n3
Multyplying by 1 then the equality signs will be


Z/2
Z/2
f zf
=1
P zf
n3
n3
Which is equivelant to


Z/2
Z/2
f zf
=1
P zf
n3
n3
Hence 100(1 )% confidence interval for z is given by
Z/2
Z/2
, zf
zf
n3
n3

Testing hypothesis that has Specified Value other Than the Zero
We know that Zf = 1.1513 log
standard deviation

1+r
1+
is approximately normal with mean z = 1.1513. log
and
1r
1

1
thus the standard normal variable is
n3
Z=

zf f

1/ n 3

Then the testing procedure is


1. Formulate the null and alternative hypothesis about . Three possible forms are
a) H0 : = 0 and H1 : 6= 0
b) H0 : 0 and H1 : > 0
c) H0 : 0 and H1 : < 0
2. Decide the level of significance . ( take = 0.05 or 0.01).
3. Test Statistic in this case is Z =
Where zf = 1.1513 log

zf z

1/ n 3

1+r
1+
and z = 1.1513 log
1r
1

4. Calculate the value of Z from the sample data.


5. Determine the rejection region, which actually depends on the alternative hypothesis and it can be
describe as
When the Alternative Hypothesis is
a)H1 : 6= 0 (two-sided)
b)H1 : < 0 (one-sided)
c) H1 : > 0 (one-sided)

The rejection region will be


Z(calculated) > Z/2 and Z(calculated) < Z/2
Z(calculated) < Z
Z(calculated) > Z

6. Conclusion: Reject H0 if value of Z(calculated) falls in the rejection region otherwise accept it.

Testing hypothesis about the Equality of Two Correlation


Let r1 and r2 be the correlation coefficient of two random samples of sizes n1 and n2 pairs, drawn from two
bivariate normal populations with correlation coefficients 1 and 2 .Then to test the hypothesis H0 : 1 = 2
we calculate.

68

CHAPTER 7. STATISTICAL INFERENCE IN REGRESSION AND CORRELATION

1 + r1
1 + r2
and zf2 = 1.1513 log
1 r1
1 r2
Since zf1 and zf2 are approximately normally distributed, therefore the difference zf1 zr
f2 , If H0 : 1 = 2
1
1
+
is true, is approximately normally distributed with a mean zero and standard deviation
n1 3 n2 3
and the test-statistic is
zf1 zf2
Z=r
1
1
+
n1 3 n2 3
zf1 = 1.1513 log

Which is approximately standard normal. The testing procedure is.


1. Formulate the null and alternative hypothesis from the following three forms.
a) H0 : 1 = 2 and H1 : 1 6= 2
b) H0 : 1 2 and H1 : 1 > 2
c) H0 : 1 2 and H1 : 1 < 2
2. Decide the level of significance . (take = 0.01 or 0.05)
3. The test statistic Z, under H0 becomes Z = r

zf1 zf2
1
1
+
n1 3 n2 3

4. Compute the value of Z from the sample data.


5. Determine the rejection region, which actually depends upon the alternative hypothesis and it can be
described as
When the Alternative Hypothesis is
a)H1 : 1 6= 2 (two-sided)
b)H1 : 1 < 2 (one-sided)
c) H1 : 1 2 > (one-sided)

The rejection region will be


Z(Calculated) > Z/2 and Z(Calculated) < Z/2
Z(Calculated) < Z
Z(Calculated) > Z

6. Conclusion: Reject H0 if value of Z(Calculated) falls in the rejection region otherwise accept it.

Testing hypothesis that = 0


We are often interested in testing the null hypothesis that the population correlation coefficient equals
zero. That is, we wish to test H0 : = 0( There is no linear correlation between the variable x and y.)
We know that the sampling distribution of r, the sample correlation coefficient is skew when is not zero.
However, when = 0, the sampling distribution of r is symmetric.Thus this property makes it possible
to test the hypothesis H0 : = 0 by using the t-distribution.Thus when the random variable x and y are
normally distributed and = 0, the statistic

r n2
t=
1 r2
has a student t-distribution with = n 2 degrees of freedom and the testing procedure is
1. Formulate the null and alternative hypothesis about . Three possible forms are
a) H0 : = 0 and H1 : 6= 0
b) H0 : 0 and H1 : > 0
c) H0 : 0 and H1 : < 0
2. Decide the level of significance . ( take = 0.05 or 0.01).

r n2
3. Test Statistic in this case is t =
1 r2
4. Calculate the value of t from the sample data.

69
5. Determine the rejection region, which actually depends on the alternative hypothesis and it can be
describe as
When the Alternative Hypothesis is
a)H1 : 6= 0 (two-sided)
b)H1 : < 0 (one-sided)
c) H1 : > 0 (one-sided)

The rejection region will be


t(calculated) > t/2 , () and t(calculated) < t/2 , ()
t(calculated) < t , ()
t(calculated) > t , ()

6. Conclusion: Reject H0 if value of t(calculated) falls in the rejection region otherwise accept it.

70

CHAPTER 7. STATISTICAL INFERENCE IN REGRESSION AND CORRELATION

Chapter 8

Vital Statistics
1

Prepared by Noman Rasheed

Vital Events
There are some factors which causes some changes in the size and composition of human population
such factors are called vital events. For examples birth, death, migrations, marriages, divorces, sickness,
adoptions etc.

Vital Statistics
The collection, presentation and analysis of vital events constitute vital statistics.Vital statistics includes
the whole study of man and throws light on various social and medical problems.
Factors which change the size of population
Birth
Death
Migrations

Factors which effect the population composition


Marriages
Divorces
Sickness

Sources/Collection of Vital Data


There are two traditional sources of data on population , namely, the census and the registration system.Vital Statistics can be obtained from a census while the composition of the population can be determined
from a registration system.

Census
Complete count of population at a point in a fixed time is called census.In Pakistan 1st census was held
in 1951, 2nd in 1961, 3rd in 1972, 4th in 1981 and 5th in 1988.

Registration
Keeping in record of all vital events like birth,death,still birth,marriages and divorces etc. is known as
registration.It does not give the vital Statistics but provides us the composition of population into categories
e.g age and sex etc.
Q:How will you describe the registration system of births and deaths in Pakistan?
Ans:In registration system all events like birth, death, still birth, marriages and divorced etc. are recorded.The
registration of births and deaths is carried out in Pakistan as under.
1. In Rural Areas
The registration of birth and death in rural areas is carried out under order of basic democracies of
1959. It places its duty on the union council.The process is as under
1 M.Phil

Stat, M.Ed, Contact: Nomanrasheed163@yahoo.com,facebook.com/Something About Statistics

71

72

CHAPTER 8. VITAL STATISTICS


(a) The head of household reports the chowkedar who register them at the union council.
(b) The union council sends a copy to the District Health Officer (DHO).
2. In Urban Areas
In urban areas the registration of birth and death is carried out under order of basic democracies of
1960. The process is as under.
(a) The head of household reports the birth and death at municipal committee, town committee or
cantonment board etc.
(b) Municipal committee, town committee or cantonment board sends a copy to the District Health
Officer (DHO).

The copies of births and deaths from rural and urban areas are sent to Divisional Health Directorates then to
Provincial Health Directorates.The annual statements from provincial health directorates are sent to director
general health Govt of Pakistan.

Uses/Advantages of Vital Statistics


Vital Statistics is an important branch of Statistics.Its importance lies in the fact that it deals with events
which are very important in human life.Some of its important uses are as follows.
1. Records of births, deaths, marriages, divorces, etc. are of immense use to the individuals.
2. Insurance companies make use of vital Statistics in determining the rates of premium.
3. Vital Statistics provide a numerical assessment of the state of public health, hygienic conditions and
availability of medical facilities.
4. Vital Statistics are used by various government agencies for a number of administrative purposes.
5. Use of vital Statistics in planning for economic development is indispensable (essential).
6. Vital Statistics are of immense use to the businessmen.
7. Vital Statistics are indispensable in demographic research.
8. Accurately registered and systemically collected vital Statistics can be used to check up the accuracy
of data provided by the census.
9. Medical and pharmaceutical research is carried out on the basis of mortality and natality data.

Shortcomings of Vital Statistics


The vital Statistics in Pakistan suffer from many defects.Some of the major defects are given below.
1. There is evidence that many births escape registration, especially in rural areas.
2. Another defect is the delayed reporting of births and deaths by the reporting agencies.
3. Data on ages of females is usually unreliable.
4. Lack of knowledge causes many inaccuracies.
5. Data on widowed and divorced women are misreported.
6. Delay in collection and tabulation misrepresent the result.

Ratios and Rates


For purposes of comparison, we need relative numbers.In vital Statistics, the commonly used relative
numbers are the ratios and rates.

73
1. Ratio
The ratio of one number, a to another number c is defined by a divided by c.It indicates the
relative size of two numbers while a and c represents separate and distinct categories.In vital Statistics
a ratio expresses the relation of a given kind of event to the occurrence of other events or one kind of
data to another.Thus
a
Ratio = ,
c
a denotes the number of times the given kind of event occurs and
c denotes the number of times another event occurs.
Vital ratios are usually multiplied by 100 for ease in understanding and recording
2. Rate
A rate is a type of ratio, which in vital Statistics may be defined as a numerical proportion of the
number of vital events to the population in which the events took place. In other words,
Rate =

a
a+b

where a stands for the number of times the given vital event occurs and b denotes the number of times
the event does not occur.
Vital rates are usually multiplied by 1000 for ease in understanding and recording.

Vital Ratios
There are several ratios which are used in vital Statistics depending upon the need of the study.The
commonly used vital ratios are
1. Sex Ratio
2. Child-Women Ratio
3. Birth-Death Ratio/Vital Index

Sex Ratio
The ratio between males and females in a population is called a sex ratio.It is computed by dividing the
number of males in a population by the number of females in the same population and the result is expressed
in percentage.In other words
Number of Males
100
Sex Ratio =
Number of Females
Interpretation
The sex ratio indicates the number of males per 100 females.
A sex ratio more than 100 indicates that there are more men than women in population.
A sex ratio less than 100 indicates that there are less men than women in population.
A sex ratio 100 indicates that men and women are equal in the population.
Sometimes we are interested in the sex ratio of a portion of the population.For example, the sex ratio at
birth describes the sex composition of the live births at a specified time.It is given by
Sex Ratio at Birth =
where
Bm = number of male live-births
Bf = number of female live-births.

Bm
100
Bf

74

CHAPTER 8. VITAL STATISTICS

Child-Women Ratio
The ratio between children under 5 years of age and the women of child bearing age is called a childwomen ratio. The child-bearing age is defined sometimes by age group 15 44 and sometimes by age-group
15 49. The child-women ratio is computed by the formula
Child-Women Ratio =

P04
100
f1544

where
P04 denotes the number of children, both sexes (male and female) combined under 5 years of age, and
f1544 denotes the number of females (women) between ages f1544 or f1549

Birth-Death Ratio or Vital Index


The ratio between the total number of births and the total number of deaths of a population during a
particular year is called birth-death-ratio or vital index.It is computed by the formula
Vital Index =

Total Number of Births


100
T otalN umberof Deaths

Interpretation
A vital index more than 100, indicates that the population is increasing and is in a healthy condition.
A vital index less than 100, indicates that the population is deceasing.
A vital index 100, indicates that the population is stable.

Population Growth Rate


The annual population growth rate is computed by dividing the increase in population during the year
by the population at the beginning of that year when the total population of a country is available each year.
Anual population growth rate =

Increase in population during the year


populationatthebeginningof year

When total population is not available then the following formula is used.

Pn = P0 (1 + r) r =

Pn
P0

1
n

where
P0 denotes the population, at the beginning of the periods/decade.
Pn denotes the population after n years.
n denotes the intercensal period, and
r unknown rate of change of population.

Classification of Vital Rates


The commonly employed rates in vital Statistics may be classified as follows.
1. Death Rates or Mortality Rates
The kinds of death rates are
(a) Crude Death Rate
(b) Specific Death Rate
(c) Infant Mortality Rate
(d) Standardized Death Rate

75
2. Birth Rates or Natality Rates
The commonly used birth rates are
(a) Crude Birth Rate
(b) Specific Birth Rate
(c) Standardized Death Rate
3. Reproduction Rates
There are two main types of such rates
(a) Gross Reproduction Rate
(b) Net Reproduction Rate
4. Morbidity Rates or Sickness Rates
5. Marriage Rates
6. Divorce Rates

Crude Death Rate


For a given areas, the crude death rate may be defined as a ratio of total registered deaths of some
specified year to the total midyear population in the same year, multiplied by 1000.It is computed as follows
D
1000
P

C.R.D =

where C.D.R stands for crude death rate. D denotes the total number of death from all causes during a
calender year.
P denotes the midyear total population (which is taken as an estimate of the average population during the
whole calender year) during the same year.
Advantages
The crude death rate is perhaps the most widely used vital rate because it is easily understood and quickly
computed.It is used to measure the probability of dying of a person in the population.
Disadvantage
It is well known fact that mortality varies with age, sex, race, occupation but crude death rate ignores all
these factors therefore it misleads the result and not be used for comparison between areas.

Specific Death Rates


When death rates are computed for some specific class of people or specific-age-group of a population,
they are called specific death rates.The specification being made with respect to age, sex, martial status,
occupation, religion, etc.The most commonly used specific death rates are the age-specific and sex-specific
death rates.Age-specific death rates are computed by the formula
A.S.D.R =

di
1000
Pi

Where A.S.D.R denotes the age-specific death rate,


di denotes the number of deaths occurring in the ith age-group.
Pi denotes the midyear population in the ith age group.

Infant Mortality Rate


It is defined as a ratio of registered deaths of infants (under one year age) during a specified year to the
total live births registered in the same year.The formula thus becomes
I.M.R =

d0
1000
B

76

CHAPTER 8. VITAL STATISTICS

Where I.M.R stands for infant mortality rate.


d0 denotes the number of deaths (excluding foetal deaths) under one year of age registered during a given
year in a locality, and
B denotes the number of live births registered during the same year in the same locality.
Inaccuracy in I.M.R
The infant mortality rate does not provide an accurate measure of the risk of death during the first year of
life because.
1. Infants are usually under-enumerated.
2. The babies who die immediately after birth are often not registered as live births.
3. Sometimes infant deaths are not separated from stillbirths and abortions.
4. Some of the deaths under one year of age during a calender year must have been of infants who had
been born in the preceding calender year.
Purposes of Infant Mortality Rate
A low infant mortality rate signifies that the maternity cases are well attended, medical care facilities are
adequate, hygienic conditions are good etc.Thus it serves as indicator of the level of healthiness of a society.
Foetal Death
A foetal death is the death before complete expulsion from its mother of a product of conception after at
least 20 weeks of confinement and the child does not show any signs of life.
Still Birth
A still birth, generally termed as late foetal death is the one in which the child shows no signs of life after
being completely separated from the mother.

Standardized Death Rates


The crude death rates of two localities or in two occupations cannot be compared mortality rates differ
with age, sex, climate, occupation, etc.Although comparison can be made with specific death rate yet such
an investigation requires enormous amount of data which is difficult task.A need is, therefore felt for a single
mortality rate which should sum up the rates at all ages and enable satisfactory comparisons between the
mortality rates of one locality with that of the other or the mortality rates of the same locality over the
years.The death rate used for this purpose is known as standardized death rate or corrected death rate or
adjusted death rate.
Methods
There are two methods to calculate standardized death rate
1. Direct Method
2. Indirect Method

Direct Method
Method in which death rate is obtained by calculating the ratio between expected deaths in standard
population and the total population
S.D.R =

Expected deaths in standard population


1000
Total standard population


P di
Pi
p
Pi
1000
S.D.R =
Pi


di
Pi is the expected deaths in standard population.
pi
di =Number of deaths in actual population of ith age group.
pi =Mid year actual population of ith age group.
Pi =Mid year standard population of ith age group.
Where

77

If sex wise data is available then the direct formula is






P dim
P dif
Pim +
Pif
pim
pif
P
P
S.D.R =
1000
Pim + Pif

Indirect Method
Method in which death rate is obtained by multiplying crude death rate of standard population by the
ratio between real death in the actual population and expected deaths in the actual population
S.D.R =

No. of deaths in actual population


C.D.R of standard population
Expected dath in actual population
P
P
di
D

 P i 1000
S.D.R =
P Di
Pi
pi
Pi


Di
pi = Expected deaths in actual population.
Pi
di = Number of deaths in actual population of ith age group.
pi = Mid year actual population of ith age group.
Pi = Mid year standard population of ith age group.
Di = Number of death in standard population of ith age group.
Where,

If sex wise data is available then


P

S.D.R =


P

Dim
Pim

P
dim + dif

 1000

P Dif
pim +
pif
Pif

Note: If population is changing slowly then both direct and indirect methods give the same results.Standardized
death rates are used when two populations are equivalent in their age distribution but differ in occupation,
climate, sex and number of people especially in early and last age group.

Crude Birth Rate


It is a ratio of total registered live births during a calender year to the total midyear population during
the same year and multiplied by 1000. It is computed by the formula.
C.B.R =

B
1000
P

Where C.B.R. stands for crude birth rate.


B denotes the total number of live births registered during a given year, and
P denotes the midyear total population during the same year.

Age-Specific Birth Rate


The birth performance varies with the age of the mother and the years of married life.To get a true
picture of fertility, we need age-specific birth rates.An age-specific birth rate is defined as the number of
births per 1000 women of a given age-group. Thus the age-specific birth rate is given by
fi =

Bi
1000
Pif

where fi is age specific death rate.


Bi = number of registered births to women of the ith age-group during a year
Pif = number of women in the ith age-group at the middle of the year.

78

CHAPTER 8. VITAL STATISTICS

Standardized Birth Rate


The birth rate differs with sex composition, number of married women, age of marriage, occupation of
women of child bearing age, health of couple and competition of producing children in the society etc. More
over it is affected by the themes about family planning.For these reasons crude birth rate is unsuitable to
compare birth rates of two localities.Hence we standardized birth rates to make comparisons. It is obtained
by direct or indirect methods
Fertility
Actual production of children by women is called fertility.It is measured by number of births.
Fecundity
Physiological ability of women to produce children is known as fecundity.There is no measurement of it.
Sterility
Property of women having no ability to produce children is known as sterility.It is opposite of fecundity.

General Fertility Rate


The general fertility rate is a ratio of all live births registered during a year to the number of women of
child-bearing age.It is computed by the formula
G.F.R =

B
1000
Pif

Where G.F.R. stands for general fertility rate,


B denotes the total number of live-births registered during the year, and
Pif denotes the midyear population of women of child-bearing age.
The fertility rate is general in the sense that it attributes all births to all women in child-bearing
groups.But the number of births depends upon the number of married women of child-bearing age.
Note: General fertility rate may be further refined by restricting the base population (female) to married or over-married women. Such a fertility rate is known as marital fertility rate.

Age-Specific Fertility Rate


Fertility varies with a number of factors such as age, duration of marriages, occupation, social class,
area of residence, etc. Therefore general fertility rate is not used and specific fertility rate is used which is
computed by the following formula.
Age-Specific Fertality Rate =

Bi
1000
Pif

Where Bi denotes the number of live births occurring to mothers of the ith age-group during a year,
Pif denotes the midyear female population of the same age-group during the same year.
Note
The terms age-specific birth rates and the age-specific fertility rates are used interchangeably.

Total Fertility Rate


Number of babies who would be born to group of 1000 women throughout their reproductive life is known
as total fertility rate. It is computed by adding the age specific fertility rates and then multiplying by the
size of class interval of the age group.In other words, the total fertility rate is computed by the formula.
X
T.F.R =5
(age-specific fertility rate)
X Bi
=5
1000
Pif
where T.F.R stands for total fertility rate,
Bi denotes the total live births in ith age-group and

79
Pif denotes the midyear female population in the same age-group during the same year.

Reproduction Rates
The reproduction rate which will give an indication of the number of females which a female will produce
over her child-bearing age to replace herself.
There are two types of reproduction rate.
1. Gross Reproduction Rate
2. Net Reproduction Rate

Gross Reproduction Rate


Number of female babies who would be born to a woman throughout her reproductive life is known
as Gross reproduction rate. It is computed by adding the age specific fertility rates for female babies and
multiplying the sum by the size of class interval of the age group (usually used as 5).Thus it is obtained as
X
G.R.R =5
Age specific rates for female babies
X bif
=5
Pif
where,
bif = Number of female babies born to women of ith age in the given year.
Pif = Number of women of ith age group of child bearing age of same year.

Net Reproduction Rate


Number of female babies born to a woman throughout her reproductive life who would reach their child
bearing age is known as net reproduction rate. It is computed by adding the age specific fertility rates for
female babies who would become mothers and multiplying the sum by the size of class interval of the age
group (usually used as 5).Thus it is obtained as
X
N.N.R =5
Age specific fertility rates for female babies Probability of survival
X bif
=5
P (S)
Pif
Where,
bif = Number of female babies born to women of ith age in the given year.
Pif = Number of women of ith age group of child bearing age of same year.
P (S) = Probability of survival.

Purposes of N.R.R
N.R.R(Net reproduction rate) is used to measure how the female population is continuing itself.
1. If N.N.R = 1, it means number of potential mothers is same hence population is stable.
2. If N.N.R > 1, it means number of potential mother is increasing and as a result population is increasing.
3. If N.N.R < 1, it means number of potential mothers is decreasing and as a result population is deceasing.

Anda mungkin juga menyukai