Anda di halaman 1dari 10

SAMPLING THEORY

Kaustav Banerjee
Decision Sciences Area, IIM Lucknow

1 Sample and population


Say, you are cooking rice, how do you know if the rice is ready? You take a sample of rice
from the bowl and check. With this example, let us find answers to these questions:
Q: Why do you need a sample?
A: A sample is required to have an educated guess about some aspects of the population.
Note, it is impossible to assess if each of the rice grains is ready to eat. A sample gives some
idea about the feature of the population you are interested in.
Q: What do you conclude by checking the sample?
A: Typically you think: rice is ready. Notice that, this conclusion applies to the rice in the
bowl and not limited to the sample alone. Therefore, though you check the sample, your
conclusion will apply to the population at large. This is a ‘leap of faith’ or an ‘inductive leap’
in the process of learning about the population. A leap from particular to general. Yes, there
is risk involved in this kind of decision making. The good news is we can quantify this risk,
provided we select the sample judiciously or randomly.
Suppose the question is: what is the average number of hours per day, a college student in
Lucknow spends on social networking sites in 2016? Consider this question to understand the
following terms.
Population: all college students in Lucknow. In other words, collection of all students covered
by the inquiry. 
Sampling units: each student could be a sampling unit, when from the population a sample
of students is to be selected. However, a more efficient process may be to select a sample of
colleges and then collect data from each student of the selected colleges. In the latter case

1
the sampling unit is a college. Alternatively, after selecting a college from all the colleges in
Lucknow, if a sample of students is selected from each selected college, then the colleges are
the first stage sampling units and the students are the second stage sampling units. 
Sampling frame: list of all college students in Lucknow or list of all colleges in Lucknow. In
other words, the list of all sampling units in the population. 
Sample: this is a collection of sampling units drawn from a frame or frames. Sample could
be drawn according to convenience or could be drawn in a way such that every sample has
got equal chance of being selected. The former is called a convenience sample and the latter
is called a random sample. A random sample is often preferred because it avoids any bias in
selection and usually results in a representative sample. A convenience sample in this example
could be a sample of students from the the nearby colleges. For drawing a random sample on
the other hand a list of all college students (a sampling frame) needs to be created and then
using some random mechanism a sample of students will be selected from the list. 
Parameter: a parameter is a population characteristic of our interest. In our example it
is the average based on data from all college students in Lucknow. It is usually a fixed
unknown number. Other population characteristics that may be of our interest are the standard
deviation of the times spent by the students in the population, and the percentage of students
in the population spending more than 4 hours, etc. 
Statistic: a statistic is a sample analogue of the parameter like sample average or sample
standard deviation. A statistic is calculated on the basis of sample observations. In the
context of our example, the average and standard deviation of times spent in social media
sites by the selected students, percentage of selected students spending more than four hours
in social media sites etc. are examples of statistics. For estimating the unknown parameter,
the sample value of the associated statistic is used. For example in estimating population
average, the value of the sample average observed for the selected sample is used. 

1.1 Statistic and parameter: a short evaluation


For the sake of brevity, say, there are only 10 college students in Lucknow, who constitute the
entire population. And we have data on the number of hours per day, each of them spends on
social networking sites. Following table summarizes the results. Note, here proportion refers
to proportion of students spending more than 4 hours on social networking.

Population Parameter Sample 1 Statistic 1 Sample 2 Statistic 2


5.7 average: 3.89 5.7 average: 3.58 5.7 average: 3.86
5.8 sd: 1.48 – sd: 1.43 – sd: 1.83
3.4 median: 3.6 3.4 median: 3.4 – median: 2.8
2.6 range: 3.8 – range: 3.5 2.6 range: 3.8
2.4 proportion: 0.4 2.4 proportion: 0.4 – proportion: 0.4
6.0 – 6.0
2.2 2.2 2.2
4.2 4.2 –
2.8 – 2.8
3.8 – –

The first column gives the number of hours, second column gives 5 parameter values obtained

2
on the basis of all the 10 students. Next we select a random sample of 5 students listed as
Sample 1 and obtain the same quantities on the basis of the selected students, reported in
Statistic 1. Taking a fresh sample in Sample 2 we repeat the exercise reported in Statistic 2.
The message is, parameters remain constant, as they are based on all the sampling units
of the population. However, statistic is a sample/data-driven quantity: its value varies with
the sample.

2 Collection of Data
Two methods are usually followed in collection of data for research. Either through conducting
experiment or surveys (often using data already collected by some agency, known as syndicated
data). The former is often called data collected through experimental study and the latter,
the data collected through observational study.
We now consider collection of data through surveys. In case the data are collected from each
and every sampling unit of the population the method is known as complete enumeration or
census method of data collection. Naturally, if the data are collected through census method
theoretically one can find the true value of the parameter. However, there is an implicit
assumption in the above statement, which most of the times does not hold, and is very very
far from what happens in reality. The assumption is: the collected data are free from errors
which is never (!!!) true.
On the other hand, in case of estimation through sample surveys we only have access to
the selected sample observations which is a part of the whole population. The other part
of the population remains unobserved. Thus the value of the sample statistic, which we use
as an estimate for the unknown value of the parameter, could be near or far from the true
value of the parameter depending on whether the sample is a close or a bad representation of
the population. So one of the important issues in sample survey is to select a sample which
represents the population well, so that an estimate with good accuracy is obtained.
Better representation and hence more accuracy could be achieved by (i) increasing sample
size (ii) by adopting a better method of selection of the sample or using better sampling
design. The representation and hence accuracy also depends on (iii) how homogeneous is the
population. But, controlling homogeneity is beyond our reach.

What do you think? The population of Switzerland in 2012 is about 8


million, and the population of China is about 1,400 million. So random sample
of the Chinese population should be larger than that of the Swiss population.

2.1 Sampling and non-sampling error


In any survey the estimates are subject to two types of error: sampling error and non-sampling
error. The sampling error arises because of sampling. In other words, observing a part of
the population instead of the whole population. More representative the sample is of the
population, less is the sampling error.
On the other hand, non-sampling error arises due to all other factors except sampling.
It could be due to errors of coverage. This error arises when the list (like the telephone

3
directory, list of e-mail addresses) that is used for drawing sample, does not match up perfectly
with the sampling frame of the target population.
It could be due to non-response. This is considered to be one of the most serious and
frequently occurred errors in any survey. In a personal interview this arises in one of the three
ways: the inability to contact the sampled unit (person or household or an organization etc.
In actual survey substituting by a next door neighbour is common but is not a good idea),
the inability of the respondent to come up with an answer to the question of interest (for
example asking someone about the impact of a policy decision, who may not have any clue),
or refusal to answer (could be because of fear or of intention not to divulge). A good survey
should attempt to obtain some information about the group of non-respondents in order to
understand how different or similar they are as a group, from the group of respondents.
Besides these, there could be errors of observations: may be due to respondent’s
reporting error, the respondent may not simply remember it correctly, the respondent may
not understand the question properly, like, asking the head of the household the number of
literates in the household (the meaning of literate may not be clear to the respondent).
Besides the above, the errors could be due to inability of the interviewers to obtain honest
response, could be due to bad design of the questionnaire, (it has been observed that ordering
and wording of questions, nature of the question (whether the question is open ended or close
ended) lead to lot of variation in responses), could be due to coding errors etc. So any kind of
error besides sampling error is known as non-sampling error.
It is believed that for a moderately large sample survey the non-sampling error contributes
around 70-80% of the total error. Finally, the non-sampling error increases with the increase
in sample size.

What do you think? With reference to the above discussion, why do you
think sample survey could be a better choice than census?

3 Finite population: simple random sampling


There are two kinds of random sampling used for finite population: simple random sampling
with replacement (SRSWR) and simple random sampling without replacement (SRSWOR).
For all practical purposes SRSWOR is preferred to SRSWR but in some situations like random
number generation we need to use SRSWR. Let’s discuss how to draw a random sample of size
10 from a class of 60 students.
1. Assign numbers 01 to 60 to the students. Consider a random number generation mechanism
that selects one of the digits 0 to 9 with equal probability, i.e 1/10.
2. Select two digits using this mechanism. If it gives a two-digit number (say 10), select
the student with roll number 10. If it’s either 00 or a number between 61 to 99, reject
it. Again select a two-digit number until a number between 01 to 60 is obtained. Prove
that by this method the chance of selecting a student in the first draw is 1/60.
3. Repeat step 3 until you draw nine more two-digit numbers between 01–60. The students
with corresponding roll numbers are selected. Notice in this way you may select a student
more than once. So if you draw a simple random sample with replacement, a student
could be selected more than once. However, to draw a simple random sample without
replacement you need to repeat step 3 until you select 9 more distinct two-digit numbers.

4
Q1 Following SRSWR, what is the probability of drawing a particular sequence of n (say
n = 10) units in order, out of N (say N = 60) units in the population?
Q2 Following SRSWOR, what is the probability of drawing a particular sequence of n (say
n = 10) units in order, out of N (say N = 60) units in the population?
Q3 In SRSWR, what is the probability that any particular unit (say the second unit) is
present in a sample of size n, drawn from a population of N units?
Q4 In SRSWOR, what is the probability that any particular unit (say the second unit) is
present in a sample of size n, drawn from a population of N units?
You can use an EXCEL function like randbetween(min, max) to generate random numbers
betweem min and max. A random number table is a sequence of digits generated using a
mechanism as discussed above so that in the long run the table contains all the digits 0,1,2,...,9
in approximately equal proportions, with no trends in the pattern in which the digits are
generated. Thus if a digit is drawn at random from the random number table the chance of
getting any digit is 1/10.

Simple Random Sample (Finite Population): A simple random sample


of size n from a finite population of size N is a sample selected such that each
possible sample of size n has the same probability of being selected.

3.1 R code for sampling


> population = c(5.7, 5.8, 3.4, 2.6, 2.4, 6.0, 2.2, 4.2, 2.8, 3.8)
> population
[1] 5.7 5.8 3.4 2.6 2.4 6.0 2.2 4.2 2.8 3.8
> srswor = sample(population, size = 5, replace = FALSE)
> srswor
[1] 5.7 2.8 2.2 3.4 3.8
> srswr = sample(population, size = 5, replace = TRUE)
> srswr
[1] 3.8 2.6 2.6 4.2 2.4

4 Infinite population: random sampling


Truly speaking there is nothing like an infinite population but often the population is either
very large or hypothetical (i.e., non-existent) so that sampling frame is not available. In such
cases, the sampling units should be selected fulfilling the following two conditions.

Random Sample (Infinite Population): A random sample of size n drawn


from an infinite population is a sample selected such that:
1. Each sampling unit should be a member of the target population.
2. The sampling units should be selected independently of each other.

For example consider the problem of drawing a random sample of customers of Flipkart from
its customer base. Flipkart’s customer base is not only very large but dynamic too. For all

5
practical purposes the population could be approximately considered as an infinite population.
If Flipkart selects a sample of customers by picking up a purchaser every 5 seconds during the
grand sale of 36 hours, the customers in the sample could be dependent. Because there is a
possibility that the customers could exhibit similar buying behaviour. One should be careful
about avoiding dependence if domain knowledge makes us feel so.
Question: Does it intuitively make sense to assume that sampling from an infinite population
is equivalent to sampling from a finite population with replacement?

5 The sample mean and sample proportion


5.1 Sampling distribution of sample mean
Example: Suppose a circus owner had 5 crocodiles to ship from Chennai to Mumbai. The
shipping company agreed to ship but would charge Rs. 20,000 per 100 kg. Naturally, they
need to know the total weight of all five crocodiles. Weighing a crocodile is difficult and at the
same time expensive too. Let the crocodiles be Jumbo (J), Kambo (K), Lambo (L), Mambo
(M) and Shambo (S). They hired a statistician for estimating the total weight by weighing two
crocodiles only. The statistician proposed the following procedure: (i) select two crocodiles at
random without replacement, (ii) weigh them, find the mean weight and multiply it by 5.
By following the statistician’s procedure the total weight came out as 1750 kgs. The
manager of the shipping company is not happy with the estimate. After observing the size of
the crocodiles, from his experience the manager felt that the estimate was very low. Though,
the statistician was claiming that his estimate is unbiased and if the distribution of weight
could be assumed to be normal then it is actually the best among all unbiased estimates.
There was a guy who helped the company in the past for weighing crocodiles. He could
measure the weight of a crocodile by measuring its length and knowing its age. His error in
estimation was always within 10 kgs. The manager called the guy. His estimates of weights
(kg) were: 1000 (J), 600(K), 500 (L), 400 (M) and 300 (S). Assuming these weights to be
correct, find the following:
1. the mean weight of the crocodiles and hence the total weight.
2. the standard deviation of weights.
3. write down all possible samples of size two.
4. for each sample find sample mean and the probability distribution of sample mean.
5. the mean and variance of the distribution of sample mean.
Match the values of mean and standard deviation of the probability distribution of sample
mean with the values that you get directly by using the following formula.

Pn
Sample Mean X̄ = n−1 i=1 Xi under Finite Population
r
σ N −n
E(X̄) = µ; SD(X̄) = √ (SRSWOR)
n N −1

where µ is the population mean, σ is the population standard deviation, n is


the sample size and N is the population size.

6
1. Notice that for SRSWR and SRSWOR the sample mean and the sample proportion are
random variables and hence they have a probability distribution.
2. Do you notice that the accuracy of sample mean does not depend on the population size
if the sampling fraction
p is very small. Usually if it is less than 0.05 the finite population
correction (fpc) (N − n)/(N − 1) is taken as 1. Do you think the above fact to be
counter-intuitive?
3. As an implication of the above formulas one could very nicely interpret the impact of
sample size, of population heterogeneity and the role of sampling fraction f = n/N on
accuracy of sample mean as an estimator of population mean. Please explain.
4. For SRSWR what alterations would be required to the formula?

5.2 Sampling distribution of sample proportion


Example: A statistician who belonged to a group of rebellions was taken as a prisoner by
the army of king Juna and produced before the king. The king offered him to play a game
of chance. The game is as follows: Six identical bags of coins labelled B1 to B6 would be
placed before him. Each bag contains either only gold coin or only silver coin. The statistician
would be allowed to pick up two bags at random and would be allowed to observe its contents.
Based on this information he will have to predict the number of bags containing only gold
coin. Naturally, the predicted value would be six times the proportion of bags containing gold
in the sample. If the predicted number is correct he will be freed and if he errs by 1 bag, he
will be imprisoned for 5 years and if he errs by more than 1 bag, he will be executed. Suppose
the emperor ordered to keep two bags of gold coin and four bags of silver coin. Based on this
information find the following:
1. write down all possible choices of two bags (sample size 2) the statistician could make.
2. for each sample find the proportion of bags having gold coin only. Find the probability
distribution of sample proportion of bags containing only gold coin.
3. find the estimate of the number of bags out of the six containing only gold coin.
4. find also the probabilities of the statistician getting free and getting executed.
5. find the mean and standard deviation of the sampling distribution of sample proportion.
Match your result with the results that could be directly obtained from the following formula.

Sample Proportion p̄ under Finite Population


r r
p(1 − p) N − n
E(p̄) = p; SD(p̄) = (SRSWOR)
n N −1

where p is the population proportion, p̄ is the sample proportion, n is the


sample size and N is the population size.

1. As an implication of the above formulas one could very nicely interpret the impact of
sample size, of population heterogeneity and the role of sampling fraction f = n/N on
accuracy of sample proportion as an estimator of population proportion. Please explain.
2. For SRSWR what alterations would be required to the formula?

7
5.3 Sampling from an Arbitrary Population
Example: As a promotion strategy of its brand a cell phone company decides to offer a
discount of either Rs. 5000 or Rs. 3000 or Rs. 2000 to the first 10000 customers e-ordering a
particular model on its website. The price of the phone is Rs. 10000. As soon as a customer
places an order the discount amount will be flashed and will be deducted from the price. To
decide on the discount to be offered to a customer the company decides to use the following
random mechanism. With the placing of an order, a digit between 0 to 9 will be selected at
random. If the chosen random digit is either 0 or 1, the offered discount will be Rs. 5000, if
it is between 2 and 4, it will be Rs. 3000, and otherwise it will be Rs. 2000.
1. Suppose a customer (among the first 10000) places an order for a phone. Find the
probability distribution of the price of the phone for the customer and also find its mean
and standard deviation.
2. Suppose a customer (among the first 9000) decides to place order for two such phones,
then find the probability distribution of total (average) price of two phones for the
customer. Also find its mean and standard deviation.
3. If a customer (a local shop owner, among the first 5000 customers) places an order of
40 cell phones then find the mean and standard deviation of the total (average) price
of the phones. Find an approximation to its probability distribution and then find the
probability that the average price is (i) less than equal to 6000 (ii) more than Rs. 7000
and (iii) between Rs. 6000 to Rs. 8000.
How do we find an approximation to the probability distribution of the average price for the
cell phones ordered by the shop owner? The next result provides a surprising answer.

Central Limit Theorem (CLT): For a random sample of size n drawn


from an arbitrary population with mean µ and standard deviation σ, when
n is large, the distribution of X̄ is approximately normal with mean µ and

standard deviation σ/ n. It follows that,

X̄ − µ
Z= √ is approximately N (0, 1).
σ/ n

So the distribution of X̄ can be approximated by a normal distribution with mean µ and



standard deviation σ/ n, if the sample size is greater than equal to 30 (A thumb rule).
Similarly, the distribution of p̄ can be approximated
p by a normal distribution with mean p
(population proportion) and standard deviation p(1 − p)/n, where n is the sample size, and
if both np and n(1 − p) are greater than equal to 5 or 10 (a thumb rule). Can we utilize this
result to solve the problem 3 above?

Application 1 in Statistical Quality Control: A manufacturing process is supposed to


produce capsules containing 400 mg of a chemical, say, C. However, variation in a manufacturing
process is inherent, so the contents of different capsules would vary. Suppose the regulatory
authority makes it mandatory that the content of every capsule should be between 398 mg.
and 402 mg. To ensure it, the mean and standard deviation of the contents produced by the

8
manufacturing process are set at 400 mg and 0.5 mg. The production supervisor knows from
his experience that the standard deviation of the process does rarely change. However, he feels
that continuous monitoring of the process is necessary for checking the stability of the mean
of the process. A consultant suggested him to implement the following procedure. In every
hour during a shift a sample of 100 capsules is to be selected and if the average content of the
sample falls below 399.90 or above 400.10 stop the process and hunt for the trouble.
1. What is the probability of a false alarm if this procedure is followed?
2. Assuming that the mean has actually shifted to 400.1, what is the probability that the
shift will be detected using such a sample?
3. What is the probability that the change in mean will remain undetected after two such
samples are inspected since the beginning of morning shift?
4. What is the probability that it remains undetected in the first two and gets detected at
the inspection of the third sample?
5. Suppose the process produces 10000 capsules per hour. What is the expected number of
capsules produced that will violate the norm of the regulatory authority till the change
in mean is detected in case of 3 above?

Application 2 in Statistical Quality Control: For assessing the quality of lots sent by
vendors, the quality control departments usually devise sampling inspection plans for taking
a decision on whether to accept or reject a lot. Suppose the lot size is 100 (N ), then the
sampling inspection plan specifies a sample size, say, 10 (n) that needs to be selected from the
lot without replacement and if the sample contains more than, say, 1 (c) defective items (again
the number specified by the sampling plan) the decision would be to reject the lot, otherwise
accept it. Sampling is often the only option if the testing is destructive in nature.
For designing sampling inspection plans the interests of both the consumer and the vendor
are to be protected. Since the decision to accept or reject a lot is taken on the basis of a
sample, there is a chance that even if the lot quality is good (bad) the lot may get rejected
(accepted). The vendor to protect himself from rejection of good lots imposes a condition
like: ‘if a lot has 5% (p1 ) defective items, the chance of rejecting such a lot should not exceed
10%’ (V Risk). Let us call it the vendor’s risk. On the other hand, for reducing the chance of
accepting a bad lot the consumer imposes a condition like, ‘the chance of accepting a lot with
10% (p2 ) defective should not exceed 10%’ (C Risk). Let us call this consumer’s risk.
Let us now consider a problem just to illustrate the above. Suppose N = 20; n = 5; c = 0;
p1 = 5%; V Risk = 10%; p2 = 10%; C Risk = 10%.
1. Does this sampling plan fulfill Vendor’s risk?
2. Does this sampling plan fulfill Consumer’s risk?
3. If actual number of defectives in the lot is 4, what is the chance of accepting such a lot?
4. Repeat 1–2 for N = 103 ; n = 20; c = 2; p1 = 5%; V Risk = 10%; p2 = 10%; C Risk = 10%
5. Repeat 3 with actual number of defectives 40.

5.4 Sampling under normal model


Suppose, in Application 1, the recommended sample size is 25 instead of 100. Then, what
assumption would be necessary to compute the probabilities?
From application point of view, one can invoke the CLT provided the sample size is large

9
enough. When sample sizes are small, we must assume normality for all practical purposes.
Note that, there are procedures to validate the assumption of normality, to be taken up in due
course. The following result helps to tackle problems when the sample size is not so large.

If X1 , X2 , ..., Xn are random sample from N (µ, σ), sampling distribution of X̄



is N (µ, σ/ n). It follows that,
σ
E(X̄) = µ, SD(X̄) = √
n

Further, the sampling distribution of (n − 1)S 2 /σ 2 is χ2n−1 , chi-square


distribution with (n − 1) degrees of freedom. It follows that,
r
2
E(S 2 ) = σ 2 , SD(S 2 ) = σ 2
n−1

Application: An automotive part must be machined to close tolerances to be acceptable to


customers. Regulatory specifications call for a maximum variance in the lengths of the parts
of 0.0004. A random sample of size 10 is drawn during a shift to monitor the process. How
large should be the sample variance so that the probability of a false alarm is 5%?

10

Anda mungkin juga menyukai