Statistical Population
A Statistical population or universe is defined as the aggregate or totality of all individual members or
objects, whether animate or inanimate, concrete or abstract of some characteristics of interest.
Sampling units
The individual members of the population are called sampling units or simply units.
Types of Population
There are following types of population
Finite Population
Infinite Population
Target Population
Sampled Population
Existent Population
Hypothetical Population
Finite Population
A population is said to be finite if it consists of finite or countable number of sampling units.For examples:
all students in a college, all houses in a country etc.
Infinite Population
A population is said to be infinite if it consists of infinite or uncountable number of sampling of units.For
examples all points on a line, no of stars in the sky etc.
Target Population
A population about which we wish to draw inferences is called Target population.
1 M.Phil
Sampled Population
A population is said to be sampled from which sample is selected.
Explanation
Suppose we desire to know the opinions of college students in the province of the Punjab with regard
to the present examination system. Then our population will consist of the total number of students in all
the colleges in the province that is our target population but being shortage of resources (time or cost etc.)
, if we select a sample from the five colleges scattered throughout the province that is our sampled population.
Existent Population
A population whose units are available in solid form (concrete form) such as trees, households, students
etc is called an Existent population.
Hypothetical Population
A population which consists of possible ways in which an event can occur is called hypothetical population its units are not available in solid form. For example outcomes of a die or coin etc.
Sample
A sample is the part of population which is selected with the expectation that it will represent the characteristics of the population
Population Size
Total number of units in a finite population is called the size of the population and is denoted by N.
Sample Size
Number of units selected in the sample from the population is called size of sample denoted by n.
Population distribution
Arrangement of values(sampling units) of population with their probabilities of occurrence is called population distribution.
Sample distribution
Arrangement of values of sample with their probabilities of occurrence is called sample distribution.
Sampling
The process of selecting a sample from the population is called sampling.
Sampling Design
A sample design is a definite statistical plan concerned with all principal steps taken in the selection of
a sample and the estimation procedure. These steps are formulated in advance of conducting the sample.
Sampling Frame
A sampling frame is a complete list or a map that contains all the N sampling units in a population such
as a complete list of all households in a city, a map of a village showing all fields etc.
Theorem
If a sample of size n is selected from a finite population of size N , then the number of all possible samples
is given as
No. of possible samples = N n
No. of possible samples =
Pn
Pn
Because the first unit of the sample can be selected in N different ways, the second unit of the sample
can be selected in (N 1) ways and so on, the nth unit of the sample can be selected in (N n + 1) ways. A
sample of n units constitutes only one permutation and there areN Pn possible permutations of n units from
a finite population of N units.
As the sample size increase, the sampling error is reduced and in a complete enumeration(census) there is
no sampling error as x
equals the .
5
The errors which occur at the stage of gathering, arranging and analyzing the data are called non-sampling
errors. Non-sampling errors include all kinds of human errors, faulty sampling frame, biased method of selection of units, processing errors such as errors in editing and coding, missclasification of observations etc.
Sampling Bias
In a survey sampling bias means a systematic component of error which deprives survey results of its
representativeness. Bias is included by the following methods of selection.
Deliberate Selection
Substitution
Incomplete Coverage
Haphazard Selection
Inadequate Interviewing
Equal Allocation
The allocation is called equal allocation when from each stratum, equal number of sampling units is
selected. That is the total sample size n is distributed equally among all the strata. Thus the stratum
sample size ni for equal allocation is
ni =
n
, for i = 1, 2, 3, ..., k
k
Proportional Allocation
The allocation is said to be proportional when the total sample size n is distributed among the different
strata in proportional to the sizes of strata. In other words, the allocation is proportional if
ni = n
Ni
, for i = 1, 2, 3, ..., k
N
Where Ni is the population size of the ith stratum, ni is the ith stratum sample size and N is the total
population size.
Neyman Allocation
This method of allocation consists of finding ni which minimize the variance of the stratified sample
mean for a fixed total sample size n, assuming the costs of surveying the units to be the same in all strata.
The stratum sample size ni is given by the relation
Ni i
ni = n P
, for i = 1, 2, 3, ..., k
Ni i
Neyman allocation becomes exactly the same as the proportional allocation when all the stratum standard
deviations are equal.
Optimum Allocation
The allocation is called optimum when the total sample size n is allocated among all the different strata
in such a way that for a given cost of selecting the sample, the variance of the stratified sample and its
variance is minimized. The stratum sample size ni for this method of allocation is
Ni i / ci
ni = P
, for i = 1, 2, 3, ..., k
Ni i / ci
Where Ni is the population of the nit stratum, i is the stratum standard deviation, and ci is the cost of
surveying one unit in the ith stratum.
Systematic Sampling
Systematic sampling is a method of selecting a sample of size n that calls every kth unit in the population
have been serially numbered from 1 to N or arranged in a systematic fashion. The steps for this technique
are.
Allot serial number from 1 to N to the sampling units.
Divide the population into n groups.
Find the number of units of each group k as k =
Population size
N
=
sample size
n
7
Select first unit from first group at random and then every kth unit.
Some advantages of this technique over simple random sampling are 1) it is easier to draw because only
one random number is required. 2) It distributes the sample more evenly over the listed population.
Cluster Sampling
Cluster sampling is a method of selecting a sample from a population which is divided into natural
groups, such as households, agricultural farms etc. which are called clusters, then taking these cluster as
sampling units, a sample is drawn at random. After cluster have been selected, all, or part of, the units in
each cluster are included in the sample.This sample is called cluster random sample. Cluster sampling also
uses a prior knowledge about the target population and partitions the population into groups or cluster,
where each cluster ideally has the same characteristics as the target population.
Cluster
1)Units within cluster are heterogeneous
2) Variations within cluster is more than variation
among clusters
Purposive Sampling
In this method personal judgment plays and important role in the selection of sampling units.The samples
are selected in purposive in views
Quota sampling
A sampling technique in which the sampling units are selected in sample from the quotas(group usually
of human beings) by personal limited choice is called quota sampling.
Sampling distribution
The sampling distribution is defined as the probability distribution or relative frequency distribution of
sample statistic.As a sampling distribution is a probability distribution, therefore the sum of probabilities in
it is always equal to one. The distribution has its own mean and its own standard deviation. Most commonly
sampling distributions are F , t and 2 .
Standard Error
The standard error is defined as the standard deviation of a sampling distribution of a sample statistic
abbreviated as S.E.
x
1
x
2
.
.
.
x
k
f (
x)
f (
x1 )
f (
x2 )
.
.
.
f (
xk )
S.E(
x) =
n
When sampling
isdone without replacement from a finite population of size N .
2 N n
2
,
( 2 = population variance)
x =
n N 1
And
r
N n
S.E(
x) =
n N 1
3. Shape
The shape of sampling distribution of x
can be studied as
(a) Normal Population with known
If parent population is Normal, then the sampling distribution of x
will also be normal regardless
of sample size (whether sample is small or large). Then standardized normal variable is
Z=
/ n
If sampling is without replacement and sample size n is 5 percent or greater than 5 percent of the
population size N , Then Z values are obtained by the formula
Z=
r
N n
N 1
9
S
and a standard deviation of .Then Standardized normal variable is
n
x
Z=
S/ n
For Small Sample
When is unknown and sample size is small (n < 30), then sampling distribution of x
follows
to Students t-distribution having statistic is
t=
/ n
Z=
S/ n
Theorem
The mean of the sampling distribution of x
equal to the population mean (when sampling is done with
replacement) that is
x=
Proof :PLet x1 , x2 , ..., xn be a random sample of size n from a population withe mean . Then sample mean
n
xi
is x
= i=1
n
We know that
Pn
n
xi
1X
xi
x =E(
x) = E( i=1 ) =
n
n i=1
1
= [E(x1 + x2 + ... + xn )]
n
In a random sample, the random variables x1 , x2 , ..., xn are independent and each has the same distribution
of the population.Thus
E(x1 ) = E(x2 ) = ... = E(xn )
(1.1)
1
n
( + + ... + ) =
=
n
n
Theorem
The mean of the sampling distribution of x
equal to the population mean (when sampling is done without
replacement) that is
x=
Proof : Consider a population of size N and sample size n. Let number of samples N cn drawn without
replacement is denoted by k and x
1 , x
2 , ...
xk are means of k samples. Then mean of sample means.
10
1
E(
x) =x = [
x1 , x
2 , ...
xk ]
k
1 x1 + x2 + ... x1 + x3 + ...
x2 + x3 + ...
=
+
+ ... +
k
n
n
n
Each xi repeats
N 1
n1
times. Therefore,
=
=
1
x1
nk
N 1
n1
nk
N 1
n1
+ x2
N 1
n1
+ ... + xN
N 1
n1
[x1 + x2 + ... + xN ]
(N 1)!
n(n 1)!(N n)!
[x1 + x2 + ... + xN ]
=
N!
n!(N n)!
(N 1)!n!(N n)!
=
[x1 + x2 + ... + xN ]
N !(N n)!n(n 1)!
1
= [x1 + x2 + ... + xN ]
N
E(
x) =x =
Theorem
If a random sample of size n is drawn with replacement from an infinite or finite population, the standard
deviation of sampling distribution of x
is given by
x =
n
Proof : Let x1 , x2 , x3 , ..., xn be a random sample ofP
size n drawn with replacement from a population whose
n
xi
mean is and variance 2 . The sample mean x
= i=1
is and the variance of x
, x2 is defined as
n
2
x2 =E [
x E(
x)]
Pn
Pn
2
xi
i=1 xi
=E
,
x
= i=1
n
n
" n
#2
Pn
2
X
1
i=1 xi n
=E
= 2E
(xi )
n
n
i=1
" n
#2
X
1
= 2E
(xi )
n
i=1
n
n
X
1 X
= 2E
(xi )2 +
(xi )(xj )
n
i=1
i6=j
n
n
X
X
1
E(xi )2 +
E(xi )(xj )
= 2
n i=1
i6=j
(1.2)
11
Now
E(xi )(xj ) = 0, it is co-variance betweenxi and xj and here because of sampling with replacement
ith and jth drawn are independent
Therefore (1.2) becomes
x2 =
1 2 2
n =
n2
n
An standard error is
S.E(
x) =
n
Theorem
If a random sample of size n is drawn without replacement from finite population, the standard deviation of
sampling distribution of x
is given by
r
N n
x =
n N 1
Proof :Let x1 , x2 , x3 , ..., xn be a random sample of P
size n drawn with replacement from a population whose
n
xi
is and the variance of x
, x2 is defined as
mean is and variance 2 . The sample mean x
= i=1
n
2
x2 =E [
x E(
x)]
Pn
Pn
2
xi
i=1 xi
=E
,
x
= i=1
n
n
" n
#2
Pn
2
X
1
i=1 xi n
=E
(xi )
= 2E
n
n
i=1
" n
#2
X
1
= 2E
(xi )
n
i=1
n
n
X
1 X
(xi )(xj )
= 2E
(xi )2 +
n
i=1
i6=j
n
n
X
1 X
E(xi )2 +
E(xi )(xj )
= 2
n i=1
(1.3)
i6=j
X
1
(xi )(xj )
N (N 1)
E(xi )(xj ) =
i6=j
#2
N
N
N
X
X
X
(xi ) =
(xi )2 +
(xi )(xj )
i=1
i=1
i6=j
N
X
N
X
i=1
i6=j
(xi )2 =
(xi )(xj )
(1.4)
12
(xi ) =
(xi )(xj ) N =
(xi )(xj )
N i=1
i6=j
(1.5)
i6=j
2
N 1
2
2
n n(n 1)
N 1
(n 1)
1
N 1
N n
N 1
x =
n
s
N n
N 1
x21 x2 = 1 + 2
n1
n2
And
s
12
2
S.E(
x1 x
2 ) =
+ 2
n1
n2
If the values of 1 and 2 are not known and if both sample sizes are sufficiently large,they are
replaced by S1 and S2 , the standard deviations of the respective samples. Then S.E will be
13
s
S(x1 x2 ) =
S12
S2
+ 2
n1
n2
If the populations are finite, sampling is done without replacement and the sample sizes are greater
than equal to 5% of population sizes, the S.E is
s
2 N2 n 2
12 N1 n1
+ 2
S.E((
x1 x
2 ) =
n 1 N1 1
n 2 N2 1
3. Shape
The shape of sampling distributions of (
x1 x
2 ) can be studied as.
(a) Normal Populations with known 1 and 2
If the populations are normally distributed, the sampling distribution of (
x1 x
2 ), regardless of
22
12
2
+
. In other words
sample sizes, will be normal with mean1 2 and variance x1 x2 =
n1
n2
the standarized variable is
Z=
(
x1 x
) (1 2 )
s2
12
2
+ 2
n1
n2
S12
S2
+ 2 and the standardized normal variable is
n1
n2
(
x1 x
) (1 2 )
s2
Z=
S12
S2
+ 2
n1
n2
(
x1 x
2 ) (1 2 )
r
1
1
+
sp
n1 n2
(n1
1) s21
1) s22
Pn
2
i=1 x2i
Pn
i=1
x2i )
n2
!#
14
(
x1 x
) (1 2 )
s2
12
2
+ 2
n1
n2
If the population standard deviations are unknown, then they are estimated by the sample standard deviations and the standardized normal variable is
Z=
(
x1 x
) (1 2 )
s2
S12
S2
+ 2
n1
n2
15
3. Shape
The sampling distribution of p is the binomial distribution. However, for large sample size, the sampling
1
x
distribution of p is approximately normal. Continuity correction of
as p = is needed when the
2n
n
normal approximation to the binomial distribution and the standardized normal variable is.
p p
Z=r
pq
n
1
(
p
)p
2n
r
Z=
pq
n
Sometimes we use
x np
Z=
npq
Z=
(x 0.5) np
npq
p2 q2
p1 q1
+
n1
n2
If both populations have same proportion of successes, i.e p1 = p2 = p or if both the samples have
been drawn from a common binomial distribution, then
s
1
1
p1 p2 = pq
+
n1
n2
Whenever the value of the common proportion is not known, then for sufficiently large sample sizes, it
n1 p1 + n2 p2
is replaced with its estimated pc where pc is
. Then the standard error is
n1 + n2
16
pc qc
1
1
+
n1 n2
,
where qc = 1 pc
Whenever p1 6= p2 and also unknown then for large sample sizes, they are replaced with the sample proportions p1 and p2 , then the standard error is
r
p1 q1
p2 q2
p1 p2 =
+
n1
n2
3. Shape
The sampling distribution of p1 p2 is approximately normal for large sample sizes with standardized
normal variable
Z=
(
p1 p2 ) (p1 p2 )
r
p1 q2
p1 q1
+
n1
n2
Standardized variable will be changed with standard error according to the conditions mentioned above.
Pn
x
)2
is denoted by E(s2 ) = s2 .If sampling is done with
n1
replacement thanE(s2 ) =Ps2 = 2 . Thus it is an unbiased estimator of population variance The sample
n
(xi x
)2
n
variance S 2 is defined as i=1
. If samples are drawn with replacement than E(S 2 )
= 2
n
n
1
n1
or E(S 2 ) =
2 . Thus S 2 is biased estimator of 2 . In case of sampling without replacement
n
the following relations.
N 1
N
E(s2 ) =
= 2 or E(s2 ) =
2 and
N
N 1
n N 1
n1 N
E(S 2 ) =
= 2 or E(S 2 ) =
= 2
n1 N
n N 1
The mean of the sample variance
i=1 (xi
2. Shape
The sampling distribution of sample variance follows to chi-square distribution while the sampling
distribution followed by the ratio of two sample variances is called F-distribution.
17
6. Sampling provides a valid measure of reliability for sample estimates.
7. Following up of non-response is more easy in sampling.
8. More detailed information can be obtained by sampling.
Numerical Problems
Example 1.1: (a) A population consists of four numbers 3, 7, 11, 15.Considering all possible samples of
size two which can be drawn with replacement from this population. Find 1) The Population mean, 2) The
Population standard deviation, 3) The mean of sampling distribution of means, 4) The standard deviation
of sampling distribution of means. Verify (3) and (4) directly from (1) and (2) by one of suitable formula
(b)Repeat 3) and 4) in (a) when sampling is without replacement.
Solution:(a) (1) Population Mean
36
3 + 7 + 11 + 15
=
=9
=
4
4
(2) Population Standard deviation
We know that
x
x2
3
9
7
49
11
121
15
P
P 225
x = 36
x2 = 404
P 2 P 2
2
x
x
404
36
2 =
= 20
N
N
4
4
so that
= 20 = 4.47
3) Now we draw all possible samples of size two from the population
n
2
(N ) = (4) = 16
Sample x
Sample x
(3, 3)
3
(13, 3)
8
(3, 7)
5
(13, 7) 10
(3, 11)
7 (13, 11) 12
(3, 15)
9 (13, 15) 14
(7, 3)
5
(15, 3)
9
(3, 7)
5
(15, 7) 11
(7, 11)
9 (15, 11) 13
(7, 15) 11 (15, 15) 15
Sampling distribution of x
and calculation for mean and standard deviation
x
Tally
f
f (
x)
x
f (
x)
x
2 f (
x)
3
/
1
1/16
3/16
9/16
5
//
2
2/16
10/16
50/16
7
///
3
3/16
21/16
147/16
9
////
4
4/16
36/16
324/16
11
///
3
3/16
33/16
363/16
13
//
2
2/16
26/16
338/16
15
/
1
1/16
15/16
P
P
P
P 2 225/16
f = 16
f (
x) = 1
x
f (
x) = 144/16
x
f (
x) = 1456/16
Now
P
144
=9
x =
x
f (
x) =
16
s
2
q
p
P 2
P
1456
144
2
x =
x
f (
x) ( x
f (
x)) =
= 91 (9)2 = 10 = 3.16
16
16
Verification
x = = 9
2
20
x2 =
=
= 10
n
2
18
and
x = 10 = 3.16 which verifies the result
Example 1.2: The random variable has the following probability distribution:
xi
4
5
6
7
p(X = x) 0.2 0.4 0.3 0.1
1. Find the mean x and the variance x2 of the mean x
for a random sample of 36.
2. Find the probability that the mean of 36 items will be less than 5.5.
Solution:
We
P know that
P 2
P
2
xP (x) = and 2 =
x P (x) ( xP (x)) . Therefore
2
x P (x)
xP (x)
x P (x)
4
0.2
0.8
3.2
5
0.4
2
10
6
0.3
1.8
10.8
7
0.1
0.7
P
P 2 4.9
xP (x) = 5.3
x P (x) = 28.9
P
P
2
x2 P (x) ( xP (x)) = 28.9 (5.3)2 = 0.81
= 5.3 and 2 =
= x = 5.3 and x2 =
2
0.81
=
= 0.0225 x = 0.15
n
36
Now
P (
x < 5.5)
we know that sample size is sufficiently large therefore x
follows to normal distribution and the standard
normal variable is
x
Z=
/ n
Inserting the values and we obtain
5.5 5.3
Z=
= 1.33
0.15
P (
x < 5.5) = P (Z < 1.33) = P ( Z 0) + P (0 Z 1.33)
P (
x < 5.5) = 0.5 + 0.4082 = 0.9082
Chapter 2
Statistical Estimation
1
Statistical Inference
The process of drawing conclusions or inferences about a population on the basis of limited information
contained in a random sample is called statistical inference.
Estimation
Estimation is a procedure of making judgment by a numerical value about the true but unknown value
of population parameter on the basis of limited information contain in a random sample obtained from the
population whose estimate is required.
Testing of Hypothesis
Hypothesis testing is a procedure which enables us to decide whether we accept or reject any specified
assumption or statement about the value of the population parameter on the basis of limited information
contain in a random sample.
Estimator
The rule or formula or function used to estimate a population parameter is called an estimator or point
estimator. The word estimator is used in general for a statistic which is a random variable because it is a
function of random observations obtained from population
Estimate
A numerical value obtained by substituting the sample observations in the rule of formula is called an
estimate.
Explanation
1 M.Phil
19
20
Let
18, 15 , a random sample obtained from the population then
P we have n values 5, 10, 12, P
x
60
x
x
=
=
= 12. Here x
=
is an estimator of population mean and the value 12 is called the
n
5
n
estimate.There are two categories of estimates
Point Estimate
When an estimate for the unknown population parameter is expressed by a single value, it is called
point estimate.
Interval Estimate
An estimate expressed by a range of values within which a true value of the population parameter is
believed to lie, is referred to as an interval estimate.
Types of Estimation
There are two types of estimation
1. Point Estimation
2. Interval Estimation
Point Estimation
The process of obtaining a single value from the sample as an estimate (point estimate) of the unknown
but true value of population parameter is called point estimation.
Linear Estimate
If an estimate can be expressed as a sum of the weighted observations ( as a linear combination), it is
called linear estimate. For example x
is a linear combination of the population parameter because it can
be expressed as
x
=
1
1
1
x1 + x2 + ... + xn
n
n
n
Which is a linear combination of the values of x0 s and in terms of weights, each observation is given a weight
1
equal to
n
Unbiasedness
An estimator is defined to be unbiased if the statistic used as an estimator has its expected value (mean)
equal to to the value of the population parameter being estimated. Let be an estimator of population
= .
parameter , then will be unbiased estimator if E()
Bias
6= , then the statistic is said to be biased estimator and the Bias of an estimator is called
If the E()
estimation bias and is defined as
h
i
Bias = E()
21
Theorem
Show that sample proportion
xis an unbiased estimator of population parameter p
Proof : we know that p =
n
Applying expectation and we get
E(
p) =E
=
x
np
=p
n
1
E(x)
n
Theorem
1 Pn
(xi x
)2 , are the sample mean and sample variance of a random sample
n 1 i=1
of size n from a population with mean and variance 2 , then show that E(s2 ) = 2 .
If x
and s2 , defined by
Proof : Let x1 , x2 , . . . , xn be a random sample of size n from a population with mean and variance 2
Then
" n
#
Pn
X
)2
1
2
2
i=1 (xi x
=
E
(xi x
)
E(s ) =E
n1
n1
i=1
Multiplying each side of this equation by n 1, we have
"
n
X
(n 1)E(s ) = E
(xi x
)2
i=1
Adding and subtracting on the right side of the above equation, we get
"
n
X
=E
(xi + x
)2
i=1
"
=E
=E
n
X
i=1
" n
X
#
2
[(xi ) (
x )]
#
2
2
(xi ) 2(xi )(
x ) + (
x )
i=1
"
n
n
X
X
=E
(xi )2 2(
x )
(xi ) + n(
x )2
i=1
i=1
#
(2.1)
22
Pn
Pn
Consider the factor i=1 (xi ) = i=1 xi n = n
x n = n(
x )
inserting this result in (2.1) and we get
" n
#
X
2
2
2
2
(n 1)E(s ) =E
(xi ) 2n(
x ) + n(
x )
i=1
=E
" n
X
#
2
(xi ) n(
x )
i=1
" n
X
#
2
E(xi ) nE(
x )
(2.2)
i=1
2
and E(xi )2 = 2 then (2.2) is
n
2
= 2 (n 1)
= n 2 n
n
E(s2 ) = 2
Consistency
An estimator is said to be consistent if the statistic to be used as estimator becomes closer and closer
to the population parameter being estimated as the sample size increase. In other words an estimator is
called consistent estimator of if the probability that becomes closer and closer to approaches unity with
increasing sample size. Symbolically
h
i
lim P e = 1
n
To prove that an estimator is consistent, we may state a criterion that is sometimes quite useful,as follows
0
Let be an estimator of based on a sample of size n. Then is a consistent estimator of , if var()
as n
A consistent estimator is unbiased in the limit but an unbiased estimator may or may not be consistent estimator
Efficiency
An unbiased estimator is defined to be efficient if the variance of its sampling distribution is smaller
than that of the variance of sampling distribution of other unbiased estimator of same parameter.Suppose
we have two unbiased estimators 1 and 2 of the same parameter then 1 will be said to be more efficient
estimator than 2 if V ar(1 ) < V ar(2 ).
Relative Efficiency
The relative efficiency is measured by the ratio:
Ef =
V ar(2 )
V ar(1 )
23
Theorem
= V ar()
+ (Bias)2
Show that M SE()
Proof : We know that
=E( )2
M SE()
+ E()
E E()
= E E()
+ E E()
=V ar()
+ (Bias)2
M SE()
=0
Where E E()
BLUE
An estimator that is linear, unbiased and has minimum variance among all linear unbiased estimator
of is called a best linear unbiased estimator or BLUE for short.
Sufficiency
An estimator is defined to be sufficient, if the statistic used as estimator uses all the information that
is contained in the sample.Any statistic that is not computed from all the values in the sample is not a
sufficient estimator. Examples of sufficient estimator are x
and p.
24
Confidence Interval
The range of values is known as interval and the interval to which 100(1 )% probability is associated
that it will include the population parameter is termed as confidence interval.
Level of Confidence
The probability (1 ) or 100(1 )% associated with the interval is known as confidence co-efficient
or level of confidence. In practice its commonly used values are 90%, 95% and 99% etc.
Confidence Limits
The end points that bound the confidence interval are called the lower and upper confidence limits for
the unknown parameter. These limits are the random variable because the functions of sample observations
25
which are randomly selected from the population.
The difference between upper confidence limit and the lower confidence limit is called precision of the
estimate. The shorter the confidence interval, the more precise the estimate. The precision can be increased
by
Increasing the sample size n
decreasing the confidence interval.
the sample size with and standard deviation , then the standard normal variable.
n
Z=
/ n
Then according to the normal distribution the probability that a value of Z will fall in the interval from
Z/2 to Z/2 is equal to 1
P Z/2 Z Z/2 = 1
Putting Z and we get
x
Z/2 = 1
P Z/2
/ n
26
Multyplying by / n
P Z/2 / n x
Z/2 / n = 1
Subtracting x
from each of term and we have
P
x Z/2 / n
x + Z/2 / n = 1
Now multyplying by -1, then inequality sign will be
P x
+ Z/2 / n +
x Z/2 / n = 1
Which is equavalent to
P x
Z/2 / n +
x + Z/2 / n = 1
For a particular sample of size n a 100(1 )% confidence interval for is given by
x
Z/2 / n, x
+ Z/2 / n
Which may be expressed more comactly as
x
Z/2 / n
S/ n
Then according to the normal distribution the probability that a value of Z will fall in the interval from
Z/2 to Z/2 is equal to 1
P Z/2 Z Z/2 = 1
Putting Z and we get
x
P Z/2
Z/2 = 1
S/ n
Multyplying by S/ n
P Z/2 S/ n x
Z/2 S/ n = 1
Subtracting x
from each of term and we have
P
x Z/2 S/ n
x + Z/2 S/ n = 1
Now multyplying by -1, then inequality sign will be
P x
+ Z/2 S/ n +
x Z/2 S/ n = 1
27
Which is equavalent to
P x
Z/2 S/ n +
x + Z/2 S/ n = 1
For a particular sample of size n a 100(1 )% confidence interval for is given by
x
Z/2 S/ n, x
+ Z/2 S/ n
Which may be expressed more comactly as
x
Z/2 S/ n
Note:When is unknown and sample size is small (n < 30), the sampling distribution of x
follows to
t-distribution. We shall discuss this case in further chapter
/ n
Therefore an approximate 100(1 )% confidence interval for the mean of non-normal population with
known is given by
x
Z/2
n
In case of is unknown and is estimated by the sample standard deviation S, the confidence interval estimate
for becomes
S
x
Z/2
n
If sampling is done without replacement from a finite population of size N and sample size n is greater than
equal to population size then confidence interval is
r
N n
x
Z/2
n N 1
12
2
+ 2 with standard normal variable.
n1
n2
(
x1 x
) (1 2 )
s2
12
2
+ 2
n1
n2
28
Then according to the normal distribution the probability that a value of Z will fall in the interval from
Z/2 to Z/2 is equal to 1
P Z/2 Z Z/2 = 1
Putting Z and we get
(
x1 x
2 ) (1 2 )
s
Z
Z
P
/2 = 1
/2
22
12
+
n1
n2
s
Multyplying by
12
2
+ 2 and we get
n1
n2
P Z/2
2
12
+ 2 (
x1 x
2 ) (1 2 ) Z/2
n1
n2
22
12
+
=1
n1
n2
Subtracting (
x1 x
2 ) and we get
P (
x1 x
2 ) Z/2
12
n1
22
n2
s
(1 2 ) (
x1 x
2 ) + Z/2
12
n1
22
n2
=1
2
12
+ 2 (1 2 ) (
x1 x
2 ) Z/2
n1
n2
22
12
=1
+
n1
n2
12
12
P (
x1 x
2 ) + Z/2
Which is equavalent to
x1 x
2 ) Z/2
P (
n1
22
n2
(1 2 ) (
x1 x
2 ) + Z/2
n1
22
n2
=1
Hence the 100(1 )% confidence interval for particular samples obtained for (1 2 ) is
s
(
x1 x
2 ) Z/2
12
2
+ 2
n1
n2
(
x1 x
) (1 2 )
s2
S12
S2
+ 2
n1
n2
29
Then according to the normal distribution the probability that a value of Z will fall in the interval from
Z/2 to Z/2 is equal to 1
P Z/2 Z Z/2 = 1
Putting Z and we get
(
x1 x
2 ) (1 2 )
s
Z/2
P Z/2
=1
S12
S22
+
n1
n2
s
Multyplying by
P Z/2
S12
S2
+ 2 and we get
n1
n2
s
S12
n1
S22
n2
s
(
x1 x
2 ) (1 2 ) Z/2
S12
n1
S22
n2
=1
Subtracting (
x1 x
2 ) and we get
P (
x1 x
2 ) Z/2
S12
S2
+ 2 (1 2 ) (
x1 x
2 ) + Z/2
n1
n2
S12
S22
=1
+
n1
n2
S12
S2
+ 2 (1 2 ) (
x1 x
2 ) Z/2
n1
n2
S12
S22
=1
+
n1
n2
S12
S12
P (
x1 x
2 ) + Z/2
Which is equavalent to
P (
x1 x
2 ) Z/2
n1
S22
n2
(1 2 ) (
x1 x
2 ) + Z/2
n1
S22
n2
=1
Hence the 100(1 )% confidence interval for particular samples obtained for (1 2 ) is
s
(
x1 x
2 ) Z/2
S2
S12
+ 2
n1
n2
Note: When sample sizes are small and the populations have unknown but equal standard deviations, then
we use students t-distribution.We shall discuss this case in further chapter.
12
2
+ 2
n1
n2
30
If the population standard deviations are unknown then they are estimated by the sample standard deviations. The approximate 100(1 )% confidence interval for (1 2 ) is then given
s
S12
S2
(
x1 x
2 ) Z/2
+ 2
n1
n2
p p
P
Z/2 r pq Z/2 = 1
n
r
Multyplying by
pq
and we get
n
r
r
pq
pq
P Z/2
p p Z/2
=1
n
n
pq
n
31
Note: the standard error of the sample proportion involves the unknown p. For large sample size this
diffculty is overcome by using the sample proportion p in place of p. Hence the confidence interval is
r
p Z/2
pq
n
(
p1 p2 ) (p1 p2 )
r
p1 q1
p2 q2
+
n1
n2
Then according to the normal distribution the probability that a value of Z will fall in the interval from
Z/2 to Z/2 is equal to 1
P Z/2 Z Z/2 = 1
Putting Z and we get
(
p1 p2 ) (p1 p2 )
r
P
Z/2
Z/2
=1
p1 q1
p2 q2
+
n1
n2
r
Multyplying with
p1 q1
p2 q2
+
and we get
n1
n2
r
r
p1 q1
p1 q1
p2 q2
p2 q2
P Z/2
+
(
p1 p2 ) (p1 p2 ) Z/2
+
=1
n1
n2
n1
n2
Subtracting (
p1 p2 )
r
r
p1 q1
p2 q2
p1 q1
p2 q2
P (
p1 p2 ) Z/2
+
(p1 p2 ) (
p1 p2 ) + Z/2
+
=1
n1
n2
n1
n2
Multyplying by 1 then equality sign will be
r
r
p1 q1
p2 q2
p1 q1
p2 q2
+
(p1 p2 ) (
p1 p2 ) Z/2
+
=1
P (
p1 p2 ) + Z/2
n1
n2
n1
n2
Which is equavalent to
r
r
p1 q1
p2 q2
p1 q1
p2 q2
P (
p1 p2 ) Z/2
+
(p1 p2 ) (
p1 p2 ) + Z/2
+
=1
n1
n2
n1
n2
Hence the 100(1 )% confidence interval for particular samples of size n1 and n2 is
r
(
p1 p2 ) Z/2
p1 q1
p2 q2
+
n1
n2
32
p1 q1
p2 q2
+
n1
n2
x
Z/2 / n < < x
+ Z/2 / n
Which may be written as
|
x | = Z/2
n
e = Z/2
n
or
n=
Z/2
e
Z/2
e
2
Note: Population standard deviation is ususlly unknown then its estimate is found either from past
experience or from a pilot sample of size n > 30.
Similarlly when sampling is performed without replacement from the finite population of size N the standard
error of sampling distribution of x
is
x =
n
N n
N 1
e = Z/2
n
N n
N 1
33
Squaring both sides and we get
2 N n
1
Z/2
e =
n
N 1
2
ne
N n
2 =
N 1
Z/2
2
2
ne2 (N 1) = N Z/2 n Z/2
h
2
2 i
= N Z/2
n e2 (N 1) + Z/2
2
N Z/2
n= h
2 i
e2 (N 1) + Z/2
2
pq
n
34
Chapter 3
Testing of Hypothesis
1
Statistical Hypothesis
A statistical hypothesis is a statement or assumption about a characteristics of one or more population
which may or my not be true and its validity is checked on the basis of a random sample selected from the
population.
Test Statistic
A sample statistic on which the decision of accepting or rejecting the null hypothesis is based called a
test statistic. Every test statistic has a probability (sampling) distribution which gives the probability of
obtaining a specified value of the test statistic when the null hypothesis is true. The sampling distributions
of the most commonly used test-statistics are Z, t, 2 or F .
35
36
Critical value
The value(s) that separates the critical region from the acceptance region is called the critical value(s).
Power of a Test
The power of a test with respect to a specified alternative hypothesis is the probability of rejecting a null
hypothesis when it is actually false. In other words the power is the complement of . Symbolically
Power = P (reject H0 /H0 is false)
Power = 1
Note: The power generally increases with an increase in the sample size. A test for which is small, is
defined to be a powerful test.
Power Curve
The power curve which may be regarded as the complement of the oc curve, shows the probabilities of
rejecting the null hypothesis H0 for various values of the parameter.
37
Test of Significance
The method which make possible, by using sample observations either to accept or reject the null hypothesis at a level of significance that is not already given but decided according to the situation of problem
is called test of significance.
P-Value
The P-value for a test of hypothesis is defined as the smallest level of significance at which the null hypothesis is rejected or the largest level of significance at which the null of hypothesis is accepted.The P-value
enables us to test hypothesis without first specifying a value of .
Formulation of Hypothesis
The hypothesis must be formulated in such a way that when one is true, other is false (H0 and H1 are
opposites). Equality sign is always used in null hypothesis and any one of the signs is used in the formulation
of null hypothesis.1) = 2) 3) . Equality sign is never used in alternative hypothesis and any one
of the following signs is used in the formulation of alternative hypothesis.1) 6= 2) > 3) < .
If is a population parameter and 0 is its specific value to test then the null and alternative hypothesis
take the form.
Null Hypothesis
If H0 : = 0
If H0 : 0
If H0 : 0
Alternative Hypothesis
Then H1 : 6= 0 , H1 : > 0 , H1 : < 0
Then H1 : < 0
Then H1 : > 0
Alternative Hypothesis
Then H1 : 6= 0 , H1 : > 0 , H1 : < 0
Then H1 : < 0
Then H1 : > 0
2. Decide the level of significance , the probability of type-I error. The most common value of 0.05 or
38
x
0
/ n
6. Conclusion: Reject H0 if value of Z(Calculated) falls in the rejection region otherwise accept it.
39
Q:Explain the procedure for testing a hypothesis about Mean of Normal Population when is unknown and
n 30?
Ans: Suppose a random sample of size n is drawn from a normal population with mean having a specified
value 0 and a unknown standard deviation . The sample mean is given by x
. We wish to determine
whether the sample accords with the hypothesis that the population mean has the specified value 0 . As
we know that the population standard deviation is unknown therefore the sample standard deviation S is
used as an estimate. For large sample (n 30), the central limit theorem allows us to assume that the
S
sampling distribution of x
is approximately normal with a mean of and a standard deviation of then
n
x
0
and the testing procedure is.
the standard normal variable is Z =
S/ n
1. Formulate the null and alternative hypothesis about . Three possible forms are
a) H0 : = 0 and H1 : 6= 0
b) H0 : 0 and H1 : > 0
c) H0 : 0 and H1 : < 0
2. Decide the level of significance . ( take = 0.05 or 0.01).
3. Test Statistic in this case is Z =
x
0
S/ n
6. Conclusion: Reject H0 if value of Z(Calculated) falls in the rejection region otherwise accept it.
Q:Explain the procedure for testing a hypothesis about Mean of Normal Population when is known or
unknown when Population is Non-Normal and n 30?
Ans: Central limit theorem tells that for large sample size, the sampling distribution of x
is approximately
normal even though the parent population is Non-Normal and is known or unknown.
x
0
and testing procedure is same as mentioned above.
If is known then random variable is Z =
/ n
If is unknown then random variable is Z =
x
0
and testing procedure is same as mentioned above.
S/ n
12
2
+ 2 . In
n1
n2
40
(
x1 x
) (1 2 )
s2
12
2
+ 2
n1
n2
It is exactly standard normal variable even the sample size is small. Hence it is used as the test-statistic for
testing the hypothesis about the difference between two population means. The procedure is
1. Formulate the null and alternative hypothesis from the following three forms.
a) H0 : 1 2 = 40 and H1 : 1 2 6= 40
(40 may be equal zero)
b) H0 : 1 2 40 and H1 : 1 2 > 40
c) H0 : 1 2 40 and H1 : 1 2 < 40
2. Decide the level of significance . (take = 0.01 or 0.05)
3. The test statistic Z, under H0 becomes Z =
(
x1 x
) 40
s 2
2
1
2
+ 2
n1
n2
6. Conclusion: Reject H0 if value of Z(Calculated) falls in the rejection region otherwise accept it.
Q:Explain the procedure for testing hypothesis about difference between two population means when both
populations are normal and population standard deviations are unknown?
Ans: Let x
1 be the mean of the first random sample of size n1 from a normal population with a mean of
1 and an unknown standard deviation 1 and x
2 be the mean of the second random sample of size n2 from
another normal population with mean of 2 and an unknown standard deviation 2 .Here 1 and 2 both
are unknown therefore they are estimated with the sample standard deviations. For sufficiently large sample
sizes (n1 , n2 > 30), the
x1 x
2 ) is approximately normal with mean 1 2 and
ssampling distribution of (
S12
S2
+ 2 . Where S12 is the variance of first sample and S22 is the variance of second
n1
n2
sample. The standard normal variable is
a standard deviation
Z=
(
x1 x
) (1 2 )
s2
S12
S2
+ 2
n1
n2
(
x1 x
) 40
s 2
2
S1
S2
+ 2
n1
n2
41
4. Compute the value of Z from the sample data.
5. Determine the rejection region, which actually depends upon the alternative hypothesis and it can be
described as
When the Alternative Hypothesis is
a)H1 : 1 2 6= 40 (two-sided)
b)H1 : 1 2 < 40 (one-sided)
c) H1 : 1 2 > 40 (one-sided)
6. Conclusion: Reject H0 if value of Z(Calculated) falls in the rejection region otherwise accept it.
Q:Explain the procedure for testing hypothesis about difference between two population means when both
populations are non-normal and population standard deviations are known or unknown but sample sizes are
large?
Ans: When both the populations are non-normal but sample sizes are sufficiently large then central limit
theorem tells us that the sampling distribution of (
x1 x
2 ) will be approximately normal even though population standard deviations may or may not be known.
If 1 and 2 known then Z =
(
x1 x
) 40
s 2
and testing procedure is same which is mentioned above
2
1
22
+
n1
n2
(
x1 x
) 40
s 2
and testing procedure is same which is mentioned above
S12
S22
+
n1
n2
Q:Explain the procedure for testing hypothesis about population proportion when sample size is large?
Ans: Let p be the proportion of success in a sample of size n drawn from a binomial population having
proportion p. If sample size n is sufficiently
large, then p will be approximately normally distributed with a
r
pq
mean p and a standard deviation
, where q = 1 p. In other words, if sample is large then the standard
n
variable is.
p p
Z=r
pq
n
x
which is approximately normal. When p = , where x is the actual number of success in a random sample.
n
The standard variable becomes
x np
p p
=
Z=r
npq
pq
n
Suppose we want to test the specified value of population proportion then
p p0
x np0
Z=r
=
np0 q0
p0 q0
n
It is used as the test-statistic for testing the population proportion and testing procedure is
1. Formulate the null and alternative hypothesis from the following three forms.
a) H0 : p = p0 and H1 : p 6= p0
b) H0 : p p0 and H1 : p > p0
c) H0 : p p0 and H1 : p < p0
2. Decide the level of significance . (take = 0.01 or 0.05)
42
np0 q0
p p0
(Using p directly)
Z=r
p0 q0
n
4. Compute the value of Z from the sample data.
5. Determine the rejection region, which actually depends upon the alternative hypothesis and it can be
described as
When the Alternative Hypothesis is
a)H1 : p 6= p0 (two-sided)
b)H1 : p < p0 (one-sided)
c) H1 : p > p0 (one-sided)
6. Conclusion: Reject H0 if value of Z(Calculated) falls in the rejection region otherwise accept it.
Q:Explain the procedure for testing hypothesis about difference between two population proportions when
sample sizes are large?
Ans: Suppose we wish to test the hypothesis that the difference between two proportions is equal to a
specified value 40 or that the two proportions are equal.The statistic on which we base our decision rule
is the variable (
p1 p2 ), where p1 is the proportion of success in the first sample of size n1 and p2 is the
proportion of success in the second sample of size n2 , samples are drawn from two binomial populations with
unknown proportion of success p1 and p2 respectively. If the samples are sufficiently large, the sampling
distribution
of the difference (
p1 p2 ) is approximately normal with mean of p1 p2 and standard deviation
r
p1 q1
p2 q2
of
+
and the standard variable is
n1
n2
Z=
p1 p2 (p1 p2 )
r
p1 q2
p2 q2
+
n1
n2
43
p1 p2
1
1
pc qc
+
n1
n2
n1 p1 + n2 p2
and qc = 1 pc
Where pc =
n1 + n2
Z=s
6. Conclusion: Reject H0 if value of Z(Calculated) falls in the rejection region otherwise accept it
44
Chapter 4
n
X
Zi2
i=1
The sampling distribution of 2 random variable is called the chi-square distribution and its pdf is
(2 )(n/2)1 e
f ( ) =
2n/2 (n/2)
/2
, 0 < 2 <
45
46
p
10. By Fisher for sufficiently large n, the random variable 22 is approximately normally distributed
2 1/3
with mean 2n 1 and unit variance similarly by Wilson and Hilferty the random variable
n
2
2
is approximately normal with mean 1
and variance
.
9n
9n
2
deviations from the sample mean to the population variance, has a chi-square distribution with (n 1)
degrees of freedom.
P
nS 2
(xi x
)2
2
= 2 =
2
Then according to the chi-square distribution the probability that the value of 2 will fall in the interval
from 21/2 to 2/2 is equal to 1
h
i
P 21/2 < 2 < 2/2 = 1
inserting the value of chi-square and we get
P
(x x
)2
2
<
P 21/2 <
/2 = 1
2
Dividing all the terms by
P
(x x
)2
"
P
21/2
2/2
1
P
P
< 2 <
(x x
)2
(x x
)2
#
=1
47
Which can be written as
#
"P
P
(x x
)2
(x x
)2
2
=1
P
< <
2/2
21/2
If instead of sample values, biased sample variance S 2 =
"
P
nS 2
nS 2
2
<
<
2/2
21/2
P
(x x
)2
then confidence interval will be
n
#
=1
P
(x x
)2
then confidence inteval will be
n1
(n 1)s2
(n 1)s2
2
<
<
2/2
21/2
#
=1
<
2
2
/2
1/2
48
k
X
(ni npi0 )2
i=1
npi0
k
X
(Oi Ei )2
i=1
Ei
The symbols Oi and Ei represents the observed and expected frequencies respectively for the ith class and
k represents the number of possible outcomes or the number of different classes. The sampling distribution
of 2 approaches the chi-square distribution with degrees of freedom = k 1 m. Where k represents
number of classes and m are the number of parameters estimated by the sample statistics.
Q:Explain the procedure for testing hypothesis for a goodness of fit test?
Ans: The procedure for a goodness of test is as follows.
1. Formulate the null and alternative hypothesis as
H0 : The population has a specified probability distribution, and
H1 : The population does not have the specified distribution.
2. Decide the level of significance . The commonly used value is at = 0.05.
49
3. The test statistic to use is
2
k
X
(Oi Ei )2
i=1
Ei
Attributes
A characteristic which varies only in quality from one individual to another is called an attribute such as
male or female, tall or short, satisfied or dissatisfied, high or low, healthy or diseased, positive or negative
etc. The attributes cannot be measured accurately but they can be divided into classes and their numbers
in each class can be counted.
Dichotomy
If the data (Population) are divided into two distinct and mutually exclusive classes by a single attribute
as for instance, the population of human beings is divided into male and females, the process is called dichotomy.
Population may be divided into three or more classes which is called trichotomy or manifold division
respectively.
Order of Classes
Order of classes is known by the number of attributes specifying the class. For example a class specified
by one attribute is known as the class of order 1.
Consistency
If the class frequencies are observed in a certain sample data and all class frequencies are recorded correctly then there will be no error in them and they will be called consistent.
50
Independence
Suppose in a population of size N , the class frequency of two attributes A and B are given by (A) and
(B). Then the two attributes A and B are said to be independent if the actual frequency equals the expected
one, that is
(A)(B)
(AB) =
N
Similarly, and will be independent if
()()
() =
N
Association
Two attributes A and B are said to be associated only if they appear together a large number of times
than it is expected if they are independent. There may be complete association or perfect positive association
or complete disassociation or perfect negative association.
Positively Associated or Simply Associated A and B are said to be positively associated or simply
associated, if
(A)(B)
(AB) >
N
Negatively Associated or Disassociated On the contrary, A and B are said to be negatively associated
or briefly disassociated if
(A)(B)
(AB) <
N
Note: Disassociation does not mean independence.
Measures of Association
The strength of association between two attributes A and B is measured by the co-efficient called the
co-efficient of association and defined by the formula.
Q=
(AB)() (A)(B)
(AB)() + (A)(B)
It lies between 1 and +1. When Q = 0 then the attributes are independent, when Q = +1 then there is a
complete association and when Q = 1 then there is complete association.
Another co-efficient known as the co-efficient of colligation, which measures the strength of association and
is defined as.
s
(A)(B)
1
(AB)()
s
Y =
(A)(B)
1+
(AB)()
Contingency Table
A table consisting of r rows and c columns in which the data are classified according to two attributes A
and B is called an r c contingency table.
Attributes
B1
B2
...
Bc1
Bc
Total
A1
(A1 B1 )
(A1 B2 )
...
(A1 Bc1 )
(A1 Bc )
(A1 )
A2
(A2 B1 )
(A2 B2 )
...
(A2 Bc1 )
(A2 Bc )
(A2 )
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Ar1
(Ar1 B1 ) (Ar1 B2 ) ... (Ar1 Bc1 ) (Ar1 Bc ) (Ar1 )
Ar
(Ar B1 )
(Ar B2 )
...
(Ar Bc1 )
(Ar Bc )
(Ar )
Total
(B1 )
(B2 )
...
(Bc1 )
(Bc )
n
51
The simplest form of a contingency table is the 2 2 ( read as 2 by 2) table in which the two attributes are
dichotomised.
Q:Explain the procedure for testing hypothesis of independence in contingency table?
Ans: As we know that independence between contingency table is tested by 2 . The procedure involves six
steps which are as follows.
1. Formulate the null and alternative hypothesis as
H0 : Two characteristics or two criteria of classification are independent.
H1 : Two characteristics or two criteria of classification are not independent.
2. Choose a significance level . The commonly used level of are at = 0.05, 0.01
3. The test statistic 2 , which compare the expected and the observed cell frequencies is
r X
c
X
(oij eij )2
=
eij
i=1 j=1
2
which, if H0 is true, has an appropriate chi-square distribution with (r 1)(c 1) degrees of freedom
4. Compute the expected frequencies under H0 for each cell by the formula eij =
the value of 2 and the degrees of freedom.
(Ai )(Bj )
also calculate
n
5. Determine the critical region which depends on and the number of degrees of freedom. that is
2 > 2 , [(r 1)(c 1)]
6. Draw conclusion, we reject H0 , if 2 > 2 , [(r 1)(c 1)] otherwise we accept it.
2
,
n(q 1)
0Q1
Where q is the number of rows or columns whichever is smaller and n represents the sample size.
Note: If Q = 0 then attributes are independent and when Q = 1 then there is perfect relationship
X (|oi ei | 0.5)2
ei
52
If the expected frequencies are large, the corrected and uncorrected results are almost the same. When the
expected frequencies are between 5 and 10 then Yates correction should be applied.
B1
a
c
a+c
B2
b
d
b+d
Total
a+b
c+d
n
n(ad bc)2
(a + b)(a + c)(c + d)(b + d)
Degrees of freedom
Degrees of freedom is the number of values that are free to vary after we have placed certain restrictions
upon the data.
Chapter 5
t=
s/ n
is called t-distribution or students t-distribution with (n 1) degrees of freedom having pdf
(+1)/2
t2
1+
1
,
2 2
f (t) =
, < t <
Z
2 /
for > 2.
2
Note: The variance for 2 does not exist. The variance is greater than 1 and approaches 1 as the
degrees of freedom increase.
4. The distribution is unimodal with a bell shape. The mode of the distribution is t = 0 and the median
is also equal to zero.
5. The shape of the distribution changes the number of degrees of freedom or the sample size changes.
1 M.Phil
53
54
t=
s/ n
with degrees of freedom = n 1.
Then according to the t-distribution the probability that a value of t will fall in the interval from t/2 ,
to t/2 , is equal to 1
P t/2 , () t t/2 , () = 1
Inserting t and we get
x
P t/2 , () t/2 , () = 1
s/ n
t/2 , ()s/ n = 1
P t/2 , ()s/ n x
Subtracting x
from each of term and we have
P
x t/2 , ()s/ n
x + t/2 , ()s/ n = 1
Multyplying by -1 then equality sign will be
P x
+ t/2 , ()s/ n x
t/2 , ()s/ n = 1
Which is equavelant to
P x
t/2 , ()s/ n x
+ t/2 , ()s/ n = 1
Hence the 100(1 )% confidence interval for particular sample of size (n < 30) is
s
x
t/2 , ()
n
55
(
x1 x
2 ) (1 2 )
r
1
1
sp
+
n1
n2
Then according to the t-distribution the probability that a value of t will fall in the interval from t/2 ,
to t/2 , is equal to 1
P t/2 , () t t/2 , () = 1
Inserting t and we get
(
x1 x
2 ) (1 2 )
r
P
t/2 , ()
t/2 , ()
=1
1
1
sp
+
n1
n2
r
Multyplying by sp
P t/2 , ()sp
1
1
+
n1
n2
1
1
+
(
x1 x
2 ) (1 2 ) t/2 , ()sp
n1
n2
1
1
+
=1
n1
n2
Subtracting (
x1 x
2 )
r
P (
x1 x
2 ) t/2 , ()sp
1
1
+
(1 2 ) (
x1 x
2 ) + t/2 , ()sp
n1
n2
1
1
+
=1
n1
n2
1
1
+
n1
n2
56
d d
sd / n
d t/2 , ()s/ n = 1
P t/2 , ()s/ n x
Subtracting d from each of term and we have
P
x t/2 , ()sd / n d
x + t/2 , ()sd / n = 1
Multyplying by -1 then equality sign will be
P d + t/2 , ()sd / n d d t/2 , ()sd / n = 1
Which is equavelant to
P d t/2 , ()sd / n d d + t/2 , ()sd / n = 1
Hence the 100(1 )% confidence interval for d for particular sample of size (n < 30) is
sd
d t/2 , ()
n
t=
s/ n
has, when the hypothesis is true a t-distribution with = n 1 degrees of freedom. The testing procedure
is
57
1. Formulate the null and alternative hypothesis about . Three possible forms are
a) H0 : = 0 and H1 : 6= 0
b) H0 : 0 and H1 : > 0
c) H0 : 0 and H1 : < 0
2. Decide the level of significance . ( take = 0.05 or 0.01).
3. Test Statistic in this case is t =
x
0
s/ n
6. Conclusion: Reject H0 if value of t(calculated) falls in the rejection region otherwise accept it.
(
x1 x
2 ) (1 2 )
r
1
1
+
sp
n1
n2
(
x1 x
) 40
r 2
1
1
sp
+
n1
n2
58
s2p
"
#
P
P
2
2
P 2
P 2
1
( x1 )
( x2 )
=
{ x1i
} + { x2i
}
n1 + n2 2
n1
n2
5. Determine the rejection region, which actually depends upon the alternative hypothesis and it can be
described as
When the Alternative Hypothesis is
a)H1 : 1 2 6= 40 (two-sided)
b)H1 : 1 2 < 40 (one-sided)
c) H1 : 1 2 > 40 (one-sided)
6. Conclusion: Reject H0 if value of t(calculated) falls in the rejection region otherwise accept it.
Paired Observations
There are many situations in which the two samples are not independent. This happens when the observations are found a pairs as the two observations of a pair are related to each other. Pairing occurs either
naturally or by design/ artificial pairing.
Natural Pairing
Natural pairing occurs whenever measurement is taken on the same unit or individual at two different times.
For examples, suppose 10 young recruits are given a strenuous physical training programme by the Army.
Their weights are recorded before they begin and after they complete the training. The two observations
obtained for each recruit (before-and-after) measurements constitute natural pairing.
Pairing by Design/ Artificial Pairing
The two observations are also paired to eliminate effect in which there is no interest.
di
(di d)
corresponding statistics d = i=1 and s2d = i=1
where n represents the number of pairs. Here
n
n1
the differences (d1 , d2 , . . . , dn ) is a random sample which are normally distributed and the statistic is
t=
d d
sd / n
which follows to t-distribution with = n 1 degrees of freedom. Then the testing procedure is
1. Formulate the null and alternative hypothesis from the following three forms.
a) H0 : 1 2 = d0 and H1 : 1 2 6= d0
(d0 may be equal zero)
b) H0 : 1 2 d0 and H1 : 1 2 > d0
c) H0 : 1 2 d0 and H1 : 1 2 < d0
2. Decide the level of significance . (take = 0.01 or 0.05)
3. The test statistic t, under H0 becomes t =
d d0
sd / n
Chapter 6
f (F ) =
[(1 + 2 ) /2] (1 /2 )
(1 /2)
F (1 /2)1
(1 +2 )/2
0<F <
Properties of F-distribution
The F-distribution has the following important properties
1. Area under the curve is unity.
2. The F-distribution always ranges from zero to infinity.
3. The mean and variance of the distribution with 1 and 2 degrees of freedom are
=
and
2 =
2
2 2
222 (1 + 2 2)
1 (2 2)2 (2 4)
4. The F-distribution for 1 > 2, 2 is unimodal and the mode of the distribution with 1 ( 2) is at
F =
2 (1 2)
1 (2 + 2)
59
60
1
F (2 , 1 )
7. The square of a t-distributed random variable with degrees of freedom has an F-distribution with 1
and 2 degrees of freedom. Symbolically.
t2 =
Z2
2 /1
=
= F(1,2 )
2 /
2 /
8. The F-distribution does not posses the moment generating function because some of moments are
infinite.
21 /1
22 /2
21 =
(n1 1)s21
2
22 =
(n2 1)s22
2
But
Similarlly
s21
s22
61
12
Confidence Interval For The Variance Ratio 2
2
Let two independent random samples of size n1 and n2 be taken from two normal populations with
variances 12 and 22 and let s21 and s22 be the unbiased estimates of 12 and 22 . Then we know that
F =
2 s2
s21 /12
= 22 12
2
2
s2 /2
1 s2
Then according to the F-distribution the probability that a value of F will fall in the interval from F1/2 (1 , 2 )
to F/2 (1 , 2 ) is equal to 1
P F1/2 (1 , 2 ) < F < F/2 (1 , 2 ) = 1
Putting the value of F and we get
2 s2
P F1/2 (1 , 2 ) < 22 21 < F/2 (1 , 2 ) = 1
1 s2
Multyplying each term in the inequality by
s22
, we obtain
s21
s22
22
s22
P 2 F1/2 (1 , 2 ) < 2 < 2 F/2 (1 , 2 ) = 1
s1
1
s1
12
s21
1
s21
1
>
=1
>
s22 F1/2 (1 , 2 )
22
s22 F/2 (1 , 2 )
1
12
s21
1
s21
<
<
=1
s22 F/2 (1 , 2 )
22
s22 F1/2 (1 , 2 )
P
Which is equavelant to
P
We know that
1
= F/2 (2 .1 )
F/2 (1 , 2 )
Therefore
P
s21
1
12
s21
<
<
F
(
,
)
=1
2
1
/2
s22 F/2 (1 , 2 )
22
s22
12
is
22
s21
1
s21
, F/2 (2 , 1 )
s22 F/2 (1 , 2 ) s22
We can also find a confidence interval for 1 /2 by taking the square root of the endpoints of this interval.
62
s21
s22
The procedure for testing a hypothesis that the population variances 12 and 22 are equal, consists of the
following steps.
1. Formulate the null hypothesis as
H0 : 12 /22 = 1 (that is H0 : 12 = 22 ). The alternative hypothesis may be
(a) H1 : 12 /22 > 1
(b) H1 : 12 /22 < 1
(c) H1 : 12 /22 6= 1
2. Decide the level of significance .Usually used 0.05 or 0.01.
3. The test-statistic to use is
s2
F = 12 , (where s21 is larger than s22 )
s2
Which, if H0 is true, has an F-distribution with 1 and 2 degrees of freedom.
4. Calculate the value of F from the sample data.
5. Determine the critical region, which depends upon the size of and the degrees of freedom
(a) When H1 : 12 /22 > 1 (H1 : 12 > 22 ) then critical region will be F(calculated) F (1 , 2 )
1
(b) When H1 : 12 /22 < 1 (H1 : 12 < 22 ) then critical region will be F(calculated)
F (2 , 1 )
1
but we know that F1 (1 , 2 ) =
F (2 , 1 )
1
(c) When H1 : 12 /22 6= 1 (H1 : 12 6= 22 ) then critical region will be F(calculated)
and
F (2 , 1 )
F(calculated) F (1 , 2 )
6. Draw conclusion, reject H0 , if F(calculated) falls in the critical or rejection region otherwise accept H0
Chapter 7
Introduction
A simple linear regression model that describes the relationship between x and y takes the form
Yi = + Xi + i
where is the intercept term, is the slope of line or regression coefficient while i is the error term or
disturbance term. The random errors 0i s are assumed to be independent of Xi an normally distributed with
2
E(i ) = 0 and
= y.x
Pvar(i ) P
P. The above regression line is estimated from the sample data by Y = a + bxi
n XY X Y
The quantities a, b, Y and Y will vary from one sample
P
P
where b =
and a = Y bX.
n X 2 ( X)2
to another. They are thus random variables and hence have sampling distributions and have own mean and
variance.
2
y.x
b2 = P
(xi x
)2
The standard error of b is
y.x
b = pP
(xi x
)2
2
Generally, b will be unknown, we therefore require an estimate of b2 from the sample data. The
unbiased
is given by
Pestimator
(Yi Y )2
2
sy.x =
n2
Thus the estimate of b2 denoted by s2b , may be taken as
s2y.x
s2b = P
(xi x
)2
Note:
s2y.x
1 M.Phil
P
P 2
P
P
P
P
P 2 ( x)2
(Yi Y )2
Y a X b XY
2
=
=
and (x x
) = x
n2
n2
n
63
64
b
pP
(xi x
)2
y.x /
But y.x is generally not known, we therefore estimate it from the sample data then we shall use the students
t-distribution rather than the normal distribution. In other words, the statistic, with degrees of freedom
=n2
b
pP
t=
(xi x
)2
sy.x /
Hence a 100(1 )% confidence interval for the population regression co-efficient for particular sample of
size n(n < 30) is given by
qX
b t/2 , (n 2)sy.x /
(xi x
)2
t=
s
sy.x
2
X
1
+P
2
n
(Xi X)
Hence 100(1 )% confidence interval for when sample size n(n < 30) is given by
s
2
1
X
a t/2 , (n 2)sy.x
+P
2
n
(Xi X)
65
1. Formulate the null and alternative hypothesis about . Three possible forms are
a) H0 : = 0 and H1 : 6= 0
b) H0 : 0 and H1 : > 0
c) H0 : 0 and H1 : < 0
2. Decide the level of significance . ( take = 0.05 or 0.01).
3. Test Statistic in this case is t =
b 0
sb
6. Conclusion: Reject H0 if value of t(calculated) falls in the rejection region otherwise accept it.
a 0
sa
6. Conclusion: Reject H0 if value of t(calculated) falls in the rejection region otherwise accept it.
66
r
1 r2
This is not recommended for use when n is small and is large therefore non-normal distribution of r can be
changed by simple transformation into an approximately normal distribution. The transformation is known
as Fishers z-transformation the variable
1+r
1 1+
1+
1 1+r
= 1.1513 log
is approximately normal with mean z = ln
= 1.1513 log
Zf = ln
2 1r
1r
2 1
1
1
with standard deviation
. Hence the statistic
n3
Z=
zf z
1/ n 3
We know that the sampling distribution of r is skew when is not zero. However when = 0 the sampling
distribution of r is symmetric. Thus when the random variable x and y are normally distributed and = 0
the t-distribution is used and statistic is
r n2
t=
1 r2
with = n 2 degrees of freedom.
1+r
1+
is approximately normal with mean z = 1.1513. log
and
1r
1
1
thus the standard normal variable is
n3
Z=
zf f
1/ n 3
According to the normal distribution the probability that a value of Z will fall in the interval from Z/2
to Z/2 is equal to 1
P Z/2 Z Z/2 = 1
Inserting Z and we get
zf f
P Z/2
Z/2 = 1
1/ n 3
Dividing by
1
n3
Z/2
Z/2
P
zf f
=1
n3
n3
67
Subtracting zf and we get
Z/2
+Z/2
P zf
f zf
=1
n3
n3
Multyplying by 1 then the equality signs will be
Z/2
Z/2
f zf
=1
P zf
n3
n3
Which is equivelant to
Z/2
Z/2
f zf
=1
P zf
n3
n3
Hence 100(1 )% confidence interval for z is given by
Z/2
Z/2
, zf
zf
n3
n3
Testing hypothesis that has Specified Value other Than the Zero
We know that Zf = 1.1513 log
standard deviation
1+r
1+
is approximately normal with mean z = 1.1513. log
and
1r
1
1
thus the standard normal variable is
n3
Z=
zf f
1/ n 3
zf z
1/ n 3
1+r
1+
and z = 1.1513 log
1r
1
6. Conclusion: Reject H0 if value of Z(calculated) falls in the rejection region otherwise accept it.
68
1 + r1
1 + r2
and zf2 = 1.1513 log
1 r1
1 r2
Since zf1 and zf2 are approximately normally distributed, therefore the difference zf1 zr
f2 , If H0 : 1 = 2
1
1
+
is true, is approximately normally distributed with a mean zero and standard deviation
n1 3 n2 3
and the test-statistic is
zf1 zf2
Z=r
1
1
+
n1 3 n2 3
zf1 = 1.1513 log
zf1 zf2
1
1
+
n1 3 n2 3
6. Conclusion: Reject H0 if value of Z(Calculated) falls in the rejection region otherwise accept it.
r n2
t=
1 r2
has a student t-distribution with = n 2 degrees of freedom and the testing procedure is
1. Formulate the null and alternative hypothesis about . Three possible forms are
a) H0 : = 0 and H1 : 6= 0
b) H0 : 0 and H1 : > 0
c) H0 : 0 and H1 : < 0
2. Decide the level of significance . ( take = 0.05 or 0.01).
r n2
3. Test Statistic in this case is t =
1 r2
4. Calculate the value of t from the sample data.
69
5. Determine the rejection region, which actually depends on the alternative hypothesis and it can be
describe as
When the Alternative Hypothesis is
a)H1 : 6= 0 (two-sided)
b)H1 : < 0 (one-sided)
c) H1 : > 0 (one-sided)
6. Conclusion: Reject H0 if value of t(calculated) falls in the rejection region otherwise accept it.
70
Chapter 8
Vital Statistics
1
Vital Events
There are some factors which causes some changes in the size and composition of human population
such factors are called vital events. For examples birth, death, migrations, marriages, divorces, sickness,
adoptions etc.
Vital Statistics
The collection, presentation and analysis of vital events constitute vital statistics.Vital statistics includes
the whole study of man and throws light on various social and medical problems.
Factors which change the size of population
Birth
Death
Migrations
Census
Complete count of population at a point in a fixed time is called census.In Pakistan 1st census was held
in 1951, 2nd in 1961, 3rd in 1972, 4th in 1981 and 5th in 1988.
Registration
Keeping in record of all vital events like birth,death,still birth,marriages and divorces etc. is known as
registration.It does not give the vital Statistics but provides us the composition of population into categories
e.g age and sex etc.
Q:How will you describe the registration system of births and deaths in Pakistan?
Ans:In registration system all events like birth, death, still birth, marriages and divorced etc. are recorded.The
registration of births and deaths is carried out in Pakistan as under.
1. In Rural Areas
The registration of birth and death in rural areas is carried out under order of basic democracies of
1959. It places its duty on the union council.The process is as under
1 M.Phil
71
72
The copies of births and deaths from rural and urban areas are sent to Divisional Health Directorates then to
Provincial Health Directorates.The annual statements from provincial health directorates are sent to director
general health Govt of Pakistan.
73
1. Ratio
The ratio of one number, a to another number c is defined by a divided by c.It indicates the
relative size of two numbers while a and c represents separate and distinct categories.In vital Statistics
a ratio expresses the relation of a given kind of event to the occurrence of other events or one kind of
data to another.Thus
a
Ratio = ,
c
a denotes the number of times the given kind of event occurs and
c denotes the number of times another event occurs.
Vital ratios are usually multiplied by 100 for ease in understanding and recording
2. Rate
A rate is a type of ratio, which in vital Statistics may be defined as a numerical proportion of the
number of vital events to the population in which the events took place. In other words,
Rate =
a
a+b
where a stands for the number of times the given vital event occurs and b denotes the number of times
the event does not occur.
Vital rates are usually multiplied by 1000 for ease in understanding and recording.
Vital Ratios
There are several ratios which are used in vital Statistics depending upon the need of the study.The
commonly used vital ratios are
1. Sex Ratio
2. Child-Women Ratio
3. Birth-Death Ratio/Vital Index
Sex Ratio
The ratio between males and females in a population is called a sex ratio.It is computed by dividing the
number of males in a population by the number of females in the same population and the result is expressed
in percentage.In other words
Number of Males
100
Sex Ratio =
Number of Females
Interpretation
The sex ratio indicates the number of males per 100 females.
A sex ratio more than 100 indicates that there are more men than women in population.
A sex ratio less than 100 indicates that there are less men than women in population.
A sex ratio 100 indicates that men and women are equal in the population.
Sometimes we are interested in the sex ratio of a portion of the population.For example, the sex ratio at
birth describes the sex composition of the live births at a specified time.It is given by
Sex Ratio at Birth =
where
Bm = number of male live-births
Bf = number of female live-births.
Bm
100
Bf
74
Child-Women Ratio
The ratio between children under 5 years of age and the women of child bearing age is called a childwomen ratio. The child-bearing age is defined sometimes by age group 15 44 and sometimes by age-group
15 49. The child-women ratio is computed by the formula
Child-Women Ratio =
P04
100
f1544
where
P04 denotes the number of children, both sexes (male and female) combined under 5 years of age, and
f1544 denotes the number of females (women) between ages f1544 or f1549
Interpretation
A vital index more than 100, indicates that the population is increasing and is in a healthy condition.
A vital index less than 100, indicates that the population is deceasing.
A vital index 100, indicates that the population is stable.
When total population is not available then the following formula is used.
Pn = P0 (1 + r) r =
Pn
P0
1
n
where
P0 denotes the population, at the beginning of the periods/decade.
Pn denotes the population after n years.
n denotes the intercensal period, and
r unknown rate of change of population.
75
2. Birth Rates or Natality Rates
The commonly used birth rates are
(a) Crude Birth Rate
(b) Specific Birth Rate
(c) Standardized Death Rate
3. Reproduction Rates
There are two main types of such rates
(a) Gross Reproduction Rate
(b) Net Reproduction Rate
4. Morbidity Rates or Sickness Rates
5. Marriage Rates
6. Divorce Rates
C.R.D =
where C.D.R stands for crude death rate. D denotes the total number of death from all causes during a
calender year.
P denotes the midyear total population (which is taken as an estimate of the average population during the
whole calender year) during the same year.
Advantages
The crude death rate is perhaps the most widely used vital rate because it is easily understood and quickly
computed.It is used to measure the probability of dying of a person in the population.
Disadvantage
It is well known fact that mortality varies with age, sex, race, occupation but crude death rate ignores all
these factors therefore it misleads the result and not be used for comparison between areas.
di
1000
Pi
d0
1000
B
76
Direct Method
Method in which death rate is obtained by calculating the ratio between expected deaths in standard
population and the total population
S.D.R =
di
Pi is the expected deaths in standard population.
pi
di =Number of deaths in actual population of ith age group.
pi =Mid year actual population of ith age group.
Pi =Mid year standard population of ith age group.
Where
77
Indirect Method
Method in which death rate is obtained by multiplying crude death rate of standard population by the
ratio between real death in the actual population and expected deaths in the actual population
S.D.R =
Di
pi = Expected deaths in actual population.
Pi
di = Number of deaths in actual population of ith age group.
pi = Mid year actual population of ith age group.
Pi = Mid year standard population of ith age group.
Di = Number of death in standard population of ith age group.
Where,
S.D.R =
P
Dim
Pim
P
dim + dif
1000
P Dif
pim +
pif
Pif
Note: If population is changing slowly then both direct and indirect methods give the same results.Standardized
death rates are used when two populations are equivalent in their age distribution but differ in occupation,
climate, sex and number of people especially in early and last age group.
B
1000
P
Bi
1000
Pif
78
B
1000
Pif
Bi
1000
Pif
Where Bi denotes the number of live births occurring to mothers of the ith age-group during a year,
Pif denotes the midyear female population of the same age-group during the same year.
Note
The terms age-specific birth rates and the age-specific fertility rates are used interchangeably.
79
Pif denotes the midyear female population in the same age-group during the same year.
Reproduction Rates
The reproduction rate which will give an indication of the number of females which a female will produce
over her child-bearing age to replace herself.
There are two types of reproduction rate.
1. Gross Reproduction Rate
2. Net Reproduction Rate
Purposes of N.R.R
N.R.R(Net reproduction rate) is used to measure how the female population is continuing itself.
1. If N.N.R = 1, it means number of potential mothers is same hence population is stable.
2. If N.N.R > 1, it means number of potential mother is increasing and as a result population is increasing.
3. If N.N.R < 1, it means number of potential mothers is decreasing and as a result population is deceasing.