36 tayangan

Diunggah oleh Hannah Ma Ya Li

training manual on sampling techniques, calculations and applications

training manual on sampling techniques, calculations and applications

© All Rights Reserved

- Lecture_4.pdf
- Chapter 17 Final
- MATH1208AnnotatedBook Imp
- Improved Silicone Rubbers for the Use as Housing Material in Composite Insulators
- datasets
- Perceived Stress Scales
- Chapter 17 Test Bank
- Business Statistics
- Central Coverage Bayes Prediction Intervals for the Generalized Pareto Distribution
- Volume 1 Soils and Solid Media
- Biostat to Answer
- Bct2053 - Applied Statistics
- The Multidimensional Wisdom of Crowds
- Doe
- 15
- Circulation-2004-Gami-364-7.pdf
- 0802.3276
- Fernanda
- 2040_W16_T2_W
- Exam Solutions 2017 2018 Semester 1 Adapted

Anda di halaman 1dari 282

Draft 2006

the International Programs Center of the US Bureau

of the Census for training in developing countries.

CHAPTER 1

GENERAL NATURE OF SAMPLE SURVEYS

________________________________________________________________________________

1.1 ROLE OF SAMPLING IN STATISTICAL THEORY AND

METHODS

In a broad sense, sampling theory can be considered as coextensive with modern statistical methods.

Almost all of the modern developments in statistics relate to the inferences that can be made about a

population when information is available from only a sample of the elements of the population.

Some of the ways in which this is reflected in statistical programs are mentioned below.

In most survey work, the population consists of all persons (or housing units, households, industrial

establishments, farms, etc.) in a city or other area. Information is obtained or desired from a sample

of the population, but inferences are required on characteristics of the whole population.

In the design and analysis of experiments, the population represents all possible applications of

several alternative techniques which can be used. For example, the experiment may be agricultural,

in which a number of fertilizers are being tested. The population is infinite because it represents the

use of the fertilizers in all possible farms over all time. The problem is to design experiments so that

the maximum amount of information can be made available for inferences about the full population,

estimated from a sample of limited size.

In the application of quality control methods in an industrial establishment, for example, the

population is all of the products coming out of a machine. Inferences are needed on how well the

products conform to specifications. The term "quality control" is also applied to a sample check on

the quality of field work done in a sample survey; the sample check is carried out after the actual

survey is completed. Office operations such as editing and coding are also subject to quality control;

a sample of the work is checked to determine if it meets acceptable standards.

These chapters will be limited to one aspect of sampling; that is, sampling application in survey

work. They will deal mainly with principles of sampling from the common sense rather than the

mathematical viewpoint, though mathematics cannot be entirely avoided. The emphasis will be on

the methods of sampling that can be used under different conditions. The formulas will be

presented, some without mathematical proof, but with information on how they should be used. Two

types of examples will be used to illustrate the formulas and methods: (a) simple examples to make

the techniques clear, and (b) examples taken from actual surveys to show the realistic applications of

the methods discussed.

First there will be a general discussion of the subject as a whole, including the nature of probability

sampling, and choices of sampling units and sampling frames. Then we shall describe the types of

common sample designs--simple random sampling, stratified sampling, and cluster sampling. The

features of these designs and the methods of sample selection will be discussed. The different

methods of estimating the characteristics of the population from the sample results will also be

treated, as well as how to determine the size of sample required for a particular degree of reliability

and how to calculate sampling errors.

We shall also discuss the problem of estimating, from a sample, the results that would have been

obtained from a full census using the same questionnaire, enumeration or interview procedures,

supervision, etc. These are aspects of the problem of sampling error. There are, of course,

nonsampling errors that arise from wrong responses to questions, or from poorly worded questions.

These are present in complete censuses as well as in sample surveys. Although the lectures are not

primarily concerned with such nonsampling errors, they may be very important. In fact,

nonsampling errors often represent more serious limitations on the use of statistics than sampling

errors.

(1) A sample may save money (as compared with the cost of a complete census) when absolute

precision is not necessary.

(2) A sample saves time, when data are desired more quickly than would be possible with a

complete census.

(4) In industrial uses, some tests are destructive (for example, testing the length of time an

electric bulb will last) and can only be performed on a sample of items.

(5) Some populations can be considered as infinite, and can, therefore, only be sampled. A

simple example is an agricultural experiment for testing fertilizers. In one sense, a census

can be considered as a sample at one instant of time of an underlying causal system which

has random features in it.

(6) Where nonsampling errors are necessarily large, a sample may give better results than a

complete census because nonsampling errors are easier to control in smaller-scale

operations.

2

1.4.1 Limited Funds

The use of a sample survey when limited funds are available for collecting information is well

known. Sampling may also be used to save money in tabulation. For example, in the 1950 Census

in the United States most of the data were collected on a 100-percent basis. However, many

tabulations were made on a sample basis (20% or 3-1/3%) for special detailed classifications to save

the cost of tabulating 150,000,000 individual records. The 1960 Census utilized sampling

procedures to an even greater extent in both the collection and the tabulation of data.

Other examples from the 1950 census in the United States illustrate how samples can be used to save

time. The enumeration of the census was taken in April 1950. The time required for processing the

results was such that publication of the results was expected to start in 1951 and continue through

1952. A sample of the census results was selected for quick processing and tabulation, and

preliminary results were published on the basis of this sample. These results were issued 1 to 2 years

earlier than the complete census results.

Some surveys require such intensive and time-consuming interviews that it is impossible to consider

them on any basis except a sample basis. Moreover, the use of sampling permits particular attention

to be given to a limited number of cases. Examples are family budget studies and comprehensive

studies of health conditions.

Information may be required for a time series when data are available only for particular periods of

time and results are needed promptly. The series may be one of economic activity in the country,

with figures available only on a yearly or monthly basis, or it may be one of producing a learning

curve for which only occasional tests are possible.

An interesting example arose in the 1950 United States Census of a case where the relationship

between nonsampling and sampling errors made sample results preferable to complete census results.

The United States has conducted a monthly sample survey of the labor force since 1940. In 1950, it

was based on a sample of 20,000 households. The information obtained in the 1950 complete census

also included labor force status. When the results of the census became available, it was clear that

the figures for both unemployed and employed persons were quite

different from those estimated from the labor force sample survey; the differences were far beyond

what could be expected on the basis of the sampling errors. The problem of reporting in the census

introduced much greater error than the sampling error of the monthly survey (this greater error was

caused by the use of enumerators who, for the most part, were inexperienced in interviewing). Users

of census data were advised, therefore, to use the sample results as the more reliable national

3

statistics on the labor force.

Under certain conditions, the usefulness of sampling becomes questionable. Three principal

conditions can be mentioned.

(1) If data are needed for very small areas, disproportionately large samples are required, since

precision of a sample depends largely on the sample size and not on the sampling rate. In

this case, sampling may be almost as expensive as a complete census.

(2) If data are needed at regular intervals of time, and it is important to measure very small

changes from one period to the next, very large samples may be necessary.

(3) If there are unusually high overhead costs connected with a sample survey, caused by work

involved in sample selection, control, etc., sampling may be impractical. For example, in a

country with many small villages it may be more economical to enumerate all the

households in the sample villages than to enumerate a sample of households within the

sample villages. For office processing, however, a sample of the enumerated households

may be used to reduce the work and costs of producing tabulations.

4

CHAPTER 2

CRITERIA AND DEFINITIONS

_______________________________________________________________________________________________________________

2.1 CRITERIA FOR THE ACCEPTABILITY OF A SAMPLING

METHOD

It has been demonstrated repeatedly in practical applications that modern sampling methods can

provide data of known reliability on an efficient and economical basis. However, although a sample

includes only part of a population, it would be misleading to call a collection of numbers a "sample"

merely because it includes part of a population.

To be acceptable for statistical analysis, a sample must represent the population and must have

measurable reliability. In addition, the sampling plan should be practical and efficient.

The sample must be selected so that it properly represents the population that is to be covered. This

means that each unit (farm, household, person, or whatever unit is being sampled) must have a

nonzero probability (chance) of being selected.

It should be possible to measure the reliability of the estimates made from the sample. That is, in

addition to the desired estimates of characteristics of the population (totals, averages, percentages,

etc.) the sample should give measures of the precision of these estimates. As we shall see later, these

measures of precision can be used to indicate the maximum error that may reasonably be expected in

the estimates, if the procedures are carried out as specified, and if the sample is moderately large.

The estimation of precision is not possible unless the selection is carried out so that the chance of

selection of each unit is known in advance and random sampling is used.

2.1.3 Feasibility

A third characteristic is that the sampling plan must be practical. It must be sufficiently simple and

straightforward so that it can be carried out substantially as planned; that is, the sampling theory and

practice will be the same. A plan for selecting a sample, no matter how attractive it may appear on

paper, is useful only to the extent that it can be carried out in practice. When the methods actually

followed are the same (or substantially the same) as specified in the sampling plan, then known

sampling theory provides the necessary measures of reliability. In addition, the measures of

reliability computed from the survey results will serve as powerful guides for future improvement in

important aspects of the sample design.

5

2.1.4 Economy and Efficiency

Finally, the design should be efficient. Among the various sampling methods that meet the three

criteria stated above, we would naturally choose the method which, to the best of our knowledge,

produces the most information at the smallest cost. Although this is not an essential feature of an

acceptable sampling plan, it is clearly a highly desirable one. It implies that the most effective

possible use will be made of all available facilities and resources, such as maps, other statistical data,

personal knowledge, sampling theory, etc.

We shall consider only sampling methods that conform to the above criteria. We shall present basic

theory for various alternative designs which are possible, and methods of measuring their precision.

We shall also stress practical methods of application and considerations of efficiency.

measurements are taken on a sample of elements for making statistical inferences (see Glossary in

Annex A) about a defined group of elements. Surveys are conducted in many ways.

The unit of analysis is the unit for which we wish to obtain statistical data. The most common units

of analysis are persons, households, farms, and business firms. They may also be products coming

out of some machine process. The unit of analysis is frequently called an element of the population.

There may be more than one unit of analysis in the same survey; for example, households and

persons; or number of farms and hectares (or acres) harvested.

2.2.3 Characteristic

A characteristic is a general term for any variable or attribute having different possible values for

different individual units of sampling or analysis. In a sample survey, we observe or measure the

values of one or more characteristics for the units in the sample. For example, we observe (or ask

about) the area of land for rice crop, the number of cattle on a farm, the age and sex of a person, the

number of children per family, etc. So, we observe a unit, but we measure several characteristics of

that unit.

The population or universe is the entire group of all the units of analysis whose characteristics are

to be estimated. The chapters in this sampling manual will deal primarily with a finite population,

having N units.

6

2.2.5 Probability Sample

sampling, every element in a defined population has a known, nonzero, probability of being selected.

It should be possible to consider any element of the population and state its probability of selection.

A simple way of obtaining a probability sample is to draw the units one by one with a known

probability of selection assigned to each unit of the population at the first and each subsequent draw.

The successive draws may be made with or without replacing the units selected in the preceding

draws. The former is called the procedure of sampling with replacement, and the latter, sampling

without replacement.

Simple random sampling is a special case of probability sampling, sometimes called unrestricted

random sampling. It is a process for selecting n sampling units one at a time, from a population of N

sampling units so that each sampling unit has an equal chance of being in the sample. Every possible

combination of n sampling units has the same chance of being chosen. Selection of one sampling

unit at a time with equal probability may be accomplished by either sampling with replacement or

without replacement. Almost, if not all, samples are selected without replacement. Using a table of

random numbers to select the units satisfies this definition of simple random sampling.

The totality of the sampling units from which the sample is to be selected is called the sampling

frame. The frame may be a list of persons or of housing units; it may be a subdivided map, or it may

be a directory of names and addresses stored in some kind of electronic medium, such as a file in a

hard disk or a data base.

2.2.9 Parameter

A parameter is a quantity computed from all values in a population set. That is, a parameter is a

descriptive measure of a population. For example, consider a population consisting of N elements.

Then the population total, the population average or any other quantity computed from

measurements including all elements of the population is a parameter. The objective of sampling is

to estimate the parameters of a population.

2.2.10 Statistic

A statistic is a quantity computed from sample observations of a characteristic, usually for the

purpose of making an inference about the characteristic in the population. The characteristic may be

any variable which is associated with a member of the population, such as age, income, employment

status, etc.; the quantity may be a total, an average, a median, or other quantiles. It may also be a rate

of change, a percentage, a standard deviation, or it may be any other quantity whose value we wish to

estimate for the population.

7

Note that the term statistic refers to a sample estimate and the term parameter refers to a population

value.

Note on Quantiles: What is a quantile? If a set of data is arranged in order of magnitude, the

middle value (or the arithmetic mean of the two middle values) which divides the set into two equal

parts is the MEDIAN. By extending this idea we can think of those values which divide the set into

four equal parts. These values, denoted by Q1, Q2 and Q3 are called the first, second and third

quartiles respectively, the value of Q2 being equal to the median. Similarly the values which divide

the data into ten equal parts are called deciles and are denoted by D1, D2, ... D9, while the values

dividing the data into one hundred equal parts are called percentiles and are denoted by P1, P2, ... P99.

The 5th decile and the 50th percentile correspond to the median. The 25th and 75th percentiles

correspond to the first and third quartiles, respectively. Collectively, quartiles, deciles, percentiles

and other values obtained by equal subdivisions of the data are called quantiles.

Independent information consists of data that are known in advance of or simultaneously with the

survey which are not based on the survey but are used to improve the survey design. Such data may

be used for purposes of stratification, for determining the probabilities of selection, or in estimating

the final results from the sample data. The data must be of good, known quality.

intended to provide information about an unknown population value.

An estimator is a mathematical formula or rule which uses sample results to produce an estimate for

the entire population. For example, the sample average,

Therefore, the estimator refers to a mathematical formula. When numbers are plugged into the

formula, an estimate is produced. However, in common statistical language, the words estimate and

estimator are used interchangeably.

8

2.2.13 Probability of Selection

The probability of selection is the chance that each unit in the population has of being included in

the sample. Probability values range from 0 to 1, inclusive.

A random variable is a variable which, by chance, can be equal to any value in a specified set. The

probability that it equals any given value (or falls between two limits) is either known, can be

determined, or can be approximated or estimated. A chance mechanism determines the value which

a random variable takes. For example, in flipping a coin, we can define the random variable X which

can take the value 1 is the coin lands ‘heads’ and the value 0 if the coin lands ‘tails’. Therefore, the

variable X, as was just defined, can take either one of two values after the coin is flipped.

The probability distribution gives the probabilities associated with the values which a random

variable can equal. If there are N values that a random variable X can take, say X1, X2, ... ,XN, then

there are N probabilities associated with the Xi's values, namely P1, P2, ... ,PN. The probabilities and

the values the random variable takes constitute the probability distribution of X.

2.2.16 Illustration

The 2000 U.S Census of Population and Housing found that 281,421,906 persons lived in

105,480,101 households of which 71,787,347 are Family Households and 33,692,754 are Non-

Family Households. Table 2.1 below shows the distribution of households by type1.

These data show that 68.1% of all households are of the “family” type and 31.9% are of the

“nonfamily” type. Now if we were to pick a household at random, what is the probability that we

would pick a family household? If each household, large or small, is equally likely to be picked,

then there is a .681 probability of picking a family household.

1 The Census Bureau defines a household as persons who occupy a house, apartment, or other separate living quarters. One of the tests in

determining a household is that there are complete kitchen facilities for the exclusive use of the occupants. People who are not in households live in

group quarters including rest homes, rooming houses, military barracks, jails, and college dormitories.

9

Table 2.12.

Fraction of Total

Type of Household Number of Households

Households

Married Couple 54,493,232 51.7

Female Householder, no husband present 12,900,103 12.2

Male Householder, no wife present 4,394,012 4.2

One Person 27,230,075 25.8

Two or M ore People 6,462,679 6.1

10

Exercises

2.1 In order to select a sample of the total population of a city, a sample is selected from the

telephone directory for that city and the families of the persons selected are interviewed.

Does this satisfy the criteria for acceptability? Explain.

2.2 In order to determine the population of a city where all children of school age attend school,

a sample of school children is drawn and their families are interviewed. Give two reasons

why this does not meet the criteria for acceptability. (Think of families who have more than

one child in school and families that don’t have any children.)

2.3 Suppose that you were using sampling to estimate the total number of words in a book that

contains illustrations.

(b) What are the pros and cons of (1) using the page, (2) the line as a sampling unit?

2.4 Suppose that you work for a major public opinion pollster and you wish to estimate the

proportion of adult citizens who think the President is doing a good job in heading the

nation's economy. Clearly define the population you wish to sample.

2.5 The problem of finding a frame that is complete and from where a sample can be drawn is

often an obstacle. What kinds of frames might be tried for the following surveys? Do the

frames have any serious weakness?

(d) A survey to estimate the number of hours per week spent by family members watching

television.

11

CHAPTER 3

SIMPLE RANDOM SAMPLING

SAMPLING DISTRIBUTION

______________________________________________________________________________

3.1 INTRODUCTION

In this chapter, we shall introduce the concept of the sampling distribution of a statistic, probably the

most basic concept of statistical inference. We shall concentrate only on the sample mean and its

sampling distribution. We shall first introduce certain definitions and relationships of terms needed

for the sampling distribution.

The expected value is the average value for a single characteristic over all possible samples.

Mathematically, we define the expected value (or mean) of a random variable Y as follows:

where and the Greek letter E is used to indicate the sum of the products of all

possible values of y and their associated probabilities P(y). The small y denotes a particular value of

Y.

The expected value is a weighted average of the possible outcomes, with the probability weights

reflecting the likelihood of occurrence of each outcome. Thus, the expected value should be

interpreted as the long-run average value of Y, if the frequency with which each outcome occurs is in

accordance with its probability.

For example, consider the tossing of a die in which each outcome (numbers 1 to 6) have the same

probability of occurring, 1/6 (assuming the die is not biased). If Y is used to represent the number

that appears when we throw a die, the expected value of Y is given by:

The expected value of Y is not the most likely or the most typical value of Y. It is the long-run

average value of Y, if we repeatedly perform the experiment that originates the outcomes. Some

throws of the die will produce numbers below 3.5 and others above 3.5. The average of these

different numbers, in the long rum, will be 3.5.

12

3.2.1 Unbiased Estimate

An unbiased estimate has the property that the average of all the estimates obtained from all possible

samples of a given size is equal to the true value. Mathematically, an estimate is unbiased if the

expected value of the estimate is equal to the parameter being estimated.

For example, if is an estimate of the parameter 2 and if then is an unbiased

estimate of 2. Otherwise, That is, the bias is the difference between the

expected value of an estimate and the true population value (parameter) being estimated.

An estimate is consistent if its values tend to concentrate increasingly around the true value as the

sample size increases. In other words, the estimate assumes the population value with probability

approaching unity as the sample size tends to infinity. This definition of consistency strictly applies

to estimates based on samples drawn from an infinite population. We use the following definition in

the case of a finite population. An estimate is said to be a consistent estimate of the parameter Y

if it takes the population value when n = N.

In the next section we will see that for simple random sampling the sample mean is an unbiased and

consistent estimate of the population mean as the sample size increases.

A sampling distribution is the probability distribution of all possible values that an estimate might

take under a specified sampling plan.

In this section we will show by examples that the sample average (mean) is both an unbiased and a

consistent estimate of the true population average.

Let us first present the idea of a sampling distribution of the mean by actually listing all possible

random samples of size n = 2 which can be drawn from a hypothetical population of N = 5 housing

units (HUs) shown in Table 3.1. We wish to estimate the average household (HH) size of these HUs

from a sample.

Table 3.1

HU 1 2 3 4 5

HH Size 3 5 7 9 11

13

The average number of persons per household (or average household size) is:

If we take a sample of size 2 from this population, there are possibilities, and they are:

3 and 7, 5 and 9 7 and 11

3 and 9, 5 and 11

3 and 11

The means of these samples are 4, 5, 6, 7, 6, 7, 8, 8, 9, and 10, respectively, and if sampling is

random so that each sample has the probability 1/10, we obtain all the possible samples of size two

HUs from a population of 5 HUs, as shown in Table 3.2. Table 3.3 presents the sampling

distribution of the mean.

Table 3.2

n=2 p(y)

3,5 4 1/10

3,7 5 1/10

3,9 6 1/10

3,11 7 1/10

5,7 6 1/10

5,9 7 1/10

5,11 8 1/10

7,9 8 1/10

7,11 9 1/10

9,11 10 1/10

14

Table 3.3

Mean Probability

4 1/10

5 1/10

6 2/10

7 2/10

8 2/10

9 1/10

10 1/10

An examination of this sampling distribution reveals some pertinent information relative to the

problem of estimating the mean of the given population using a random sample of size 2. For

instance, we see that corresponding to 6, 7, or 8, the probability is 6/10 that a sample mean will

not differ from the population mean (which is 7) by more than 1, and that corresponding to 5, 6,

7, 8, or 9, the probability is 8/10 that a sample mean will not differ from the population mean by

more than 2.

Further useful information about this sampling distribution of the mean can be obtained by

calculating its expected value as follows:

Note that the same results would be obtained for samples of any size. Recall the definition of the

expected value, which is the average of a single characteristic over all possible samples.

With simple random sampling the sample mean is an unbiased estimate of the true mean.

15

We will now compare the distribution of the sample estimates to show that:

(1) As the sample size increases, the means of the samples tend to concentrate more and more

around the true average value. In other words, the estimates tend to become more and more

reliable as the sample size increases.

(2) The percentage distributions of the sample estimates can be used to predict the chance of

obtaining a sample estimate within specified ranges of the true value.

To see the above statements, consider a hypothetical population of 12 individuals. We wish to make

different estimates from a sample of 1,2,3,4,5,6 and 7 individuals. The full population is shown in

Table 3.4 below.

Table 3.4

1 $1,300 7 1,800

2 6,300 8 2,700

3 3,100 9 1,500

4 2,000 10 900

5 3,600 11 4,800

6 2,200 12 1,900

TOTAL INCOME: $32,100

AVERAGE INCOME: $2,675

A frequency distribution of the sample means is illustrated in Table 3.5 for samples of sizes

1,2,3,4,5,6 and 7 individuals. For each sample size, the percentage of the sample estimates falling

within a specified range of the true value and the average of the means are also shown in the table.

For example, the proportion of the sample results falling between $2,000 and $3,400 is 47% for

samples of 2; 58% for samples of 3; 69% for samples of 4; and 78%, 87% , and 94% for samples of

5,6, and 7 respectively. This tells us that by taking samples large enough, the proportion of the

sample estimates falling within a designated interval about the expected value can be made as close

to 100% as desired. That is, we can predict the precision of a sample if we have the distribution of

all sample estimates of a given size for the population. The increasing concentration of sample

estimates around the true value illustrates consistency, a quality possessed by important types of

sample estimates.

16

Table 3.5

from Samples Drawn Without Replacement

from the Population of 12 Persons

Estimated from Sample with sample of size n

$ 800 to $1,199 1 1 - - - - -

$1,200 to $1,399 1 2 3 1 - - -

$1,400 to $1,599 1 5 10 11 7 1 -

$1,600 to $1,799 - 6 15 25 25 16 6

$1,800 to $1,999 2 5 20 42 55 50 27

$2,000 to $2,199 1 6 22 50 78 84 61

$2,200 to $2,399 1 6 22 52 90 109 98

$2,400 to $2,599 - 6 19 52 101 139 136

$2,600 to $2,799 1 3 17 49 108 151 150

$2,800 to $2,999 - 4 16 57 101 133 130

$3,000 to $3,199 1 3 16 46 81 107 108

$3,200 to $3,399 - 3 16 38 61 79 62

$3,400 to $3,599 - 2 13 26 46 43 14

$3,600 to $3,799 1 2 10 21 27 12 -

$3,800 to $3,999 - 3 7 11 10 - -

$4,000 to $4,199 - 3 4 10 2 - -

$4,200 to $4,399 - 2 6 3 - - -

$4,400 to $4,599 - 1 1 1 - - -

$4,600 to $4,799 - 1 2 - - - -

$4,800 to $6,399 2 2 1 - - - -

Average of all possible $2,675 $2,675 $2,675 $2,675 $2,675 $2,675 $2,675

samples *

*

Expected Value

This means that if the sample is sufficiently large, one takes very little risk in using sample

estimates. (From the above illustration, it might appear that the increase in concentration arises

from the fact that, as the size of the sample increases, the percentage of the population in the sample

becomes higher. Actually, similar results would be observed when the size of sample increases even

though only a small proportion of the universe is included.)

(CONFIDENCE INTERVAL)

We have seen that the precision of a sample can be predicted if we have the distribution of all

sample estimates of a given size for the population. In a real situation, we can not select all possible

samples and examine the estimates derived from them. We must depend upon a single sample.

Therefore, it is necessary to find some measure of the extent to which the estimates made from

various samples differ from the true value; this measure, if it is to be useful, must be one that can be

estimated from the sample itself. Before showing how and why we can do this, we shall introduce

certain definitions and relationships which are derived from the theory of sampling.

17

3.4.1. Standard Deviation

We shall show that there is a measure of the variability in the original population which can be

estimated from the observations in a single sample, and from which it is possible to estimate the

expected error in the sample mean.

The measure of variability in the population is called the standard deviation; its square is called

the population variance and is designated by the symbol F2 or VAR. The variance of the

population is defined as the average of the squares of the deviations of all the individual

observations from their mean value. Thus, it would be computed by the following process, if all the

values in the universe could be observed:

where the Y's with subscripts are individual observations and is the mean of the N observations

for the N elements in the universe. Note that it has become fairly general practice to denote the

population variance by F2 when dividing by N, and by S2 when dividing by N-1; symbolically,

where n is the sample size, yi is the sample measurement of a characteristic and is the sample

mean.

We will use S2 throughout the text because s2 is an unbiased estimate of S2. Note that all results are

equivalent in either notation. Also,

The variance of the sample means is the average of the squares of the deviations of the means of all

possible samples of size n from the true mean. The variance of is denoted by and we

write:

18

where f = (n/N) = sampling fraction.

The square root of the variance of is called the sampling error for means of samples of size n.

The sampling error of is:

It is important to note that the sampling error varies with the size of the sample, as we would

expect. If we compute the sampling error for all possible samples of sizes shown in Table 3.5, we

see that as the sample size increases, the sampling error becomes smaller and smaller. This is

shown in the following illustration (see Table 3.6). The factor (N - n)/N in the formula for the

variance of is called the finite population correction factor (fpc). As a rule of thumb, if n

#0.05N we can ignore (N - n)/N since its value will be close to 1. Otherwise we should include it in

the formula in order not to severely overestimate the variance of

3.4.3 Illustration

Consider again the population of 12 individuals in Table 3.4. In this case, the true average is

with N =12. We compute S2 as follows:

and S = $1,571.41.

Using S, we can compute the sampling error of the sample mean for different sample sizes n. For

example, if the sample size n = 1 then,

19

for n = 2,

The sampling errors for all possible sample sizes are given in the following table.

Table 3.6

Sampling Error of

Size of Sample

Estimated Measure

1 $1,505

2 1,015

3 786

4 642

5 537

6 454

7 383

We know that the probability of an estimate being equal to the true value (parameter) is zero for

continuous variables. Thus, it will be more useful if we can state how probable it is that an interval

based on our estimate will contain the parameter to be estimated.

Interval estimator - An interval estimator is a formula that tells us how to use the sample

observations to calculate two numbers that define an interval which will enclose the estimated

parameter with a certain (usually high) probability. The resulting interval is called a confidence

interval and the probability that it contains the true parameter is called its confidence

coefficient. If a confidence interval has a confidence coefficient equal to .95, we call it a 95%

confidence interval.

The symbol t is the value of the normal deviate corresponding to the desired confidence probability.

20

In practice, S2 is not known. Usually, s2, the sample variance is calculated from the sample data and

used as an estimate of S2. If n is large, s provides a fairly good estimate of S; however, for small

samples this may not be the case. Using s, the confidence interval is

The value t depends on the level of confidence desired. For large samples, the most common

values (see Appendix I - Normal Distribution Table) are:

If the sample size is less than 30, the percentage points may be taken from the Student's t table (see

Appendix II) with (n-1) degrees of freedom.

Comparing Tables 3.5 and 3.6, it can be seen that as the sample size increases, the sample estimates

differ less and less from the expected value, and at the same time the sampling error becomes

smaller and smaller. In practical sampling problems, where a reasonably large sample is used

(generally 30 or more cases), the distribution of sample results over all possible samples

approximates very closely the normal distribution-- the familiar bell-shaped curve. This is the

result of the most important theorem in statistics, The Central Limit Theorem, which states, briefly,

that sums of random variables have a normal distribution.

For this distribution, the probabilities of being within a fixed range of the average value are well

known and have been published (see Appendix I). These probabilities depend solely on the value of

the sampling error. For example, the probability of being within one sampling error is 68 percent;

for two sampling errors, it is 95 percent; for three sampling errors, it is 99.7 percent.

The implications are of fundamental importance to sampling theory. Suppose we have drawn a

simple random sample from a population, have computed the mean from the sample and have

estimated the true sampling error of the mean by means of How can we infer the

21

precision of this particular sample result? If we set an interval based on around the sample

estimate we can be fairly confident that will give an interval such that one will be

correct about two-thirds of the time that the interval covers the true mean. Similarly,

gives a confidence interval for which the assumption will be correct 95 percent of the time, and for

it will be correct 99.7 percent of the time. To understand the concept, we present the

following illustration.

3.4.6 Illustration

Consider again the same population of 12 individuals in Table 3.4. Let us find the percent of

sample averages in Table 3.5 which differ from the population average by less

than less than and less than (We are using capital S instead of small s, as

well as because we are dealing with a population and we therefore know its true variance and its

true mean). This is the same as finding the percent of sample averages which fall within

and Consider a sample of size 2. Using Table 3.5

with we have:

Table 3.5 shows that there are 42 sample averages that fall within the confidence interval (1660,

3690). That is, 63.6% of sample averages differ from the population average by less than one

sampling error. Similarly, there are 64 averages that fall within the confidence interval (645, 4705);

that is, about 97% of sample averages differ from the population average by less than two sampling

errors. It can easily be seen that 100% of sample averages differ from the population average by

less than three sampling errors. For the normal distribution, we have seen that the probability of

being within one standard (or sampling) error is 68%; for two standard errors, it is 95%; for three

standard errors it is 99.7%. This shows that even for small samples of size 2, the distribution of

sample results over all possible samples approximates very closely the normal distribution. For

larger samples, the results would conform to the normal distribution much more closely. The

percentages of sample averages in Table 3.5 which differ from the population averages by less

than and are displayed in Table 3.7.

22

Table 3.7

Size n population average by

1 $1,505 75 92 100

2 1,015 64 97 100

3 786 65 96 100

4 642 64 97 100

5 537 65 97 100

6 454 64 97 100

7 383 65 97 100

NORMAL

DISTRIBUTION 68 95 99.7

Consider the distribution given in Table 3.5 of average income in all possible samples of size 7. A

graph of this distribution is shown in Figure 3.1. This figure appears approximately symmetric,

with a clustering of measurements about the midpoint of the distribution, tailing off rapidly as we

move away from the center of the histogram. Thus, the graph possesses the following properties:

Figure 3.1

23

(1) The sampling distribution of appears approximately normally distributed when the sample

size is large.

(2) The average of all possible sample averages equals the population average.

(3) The variance of the sampling distribution is equal to which is less than the

population variance,

Property (1) above is the result of the Central Limit Theorem (CLT), one of the most fundamental

and important theorems in statistics. Briefly stated, the CLT shows that if x1, x2, ... , xn are

independent random variables having the same distribution with mean : and variance F², then for a

large enough sample, the variable

has a standard normal distribution (i.e., mean zero and variance one).

3.4.7 Illustration

Unoccupied seats on flights cause the airlines to lose revenue. Suppose a large airline wants to

estimate the average number of unoccupied seats per flight over the past year. To accomplish this,

the records of 225 flights are randomly selected from the files, and the number of unoccupied seats

is noted for each of the sampled flights.

Estimate the mean number of unoccupied seats per flight during the past year, using a 90%

confidence interval (ignore the fpc).

24

that is, at the 90% confidence level, we estimate the mean number of unoccupied seats per flight to

be between 11.15 and 12.05 during the sampled year.

Estimates are subject to both sampling errors and nonsampling errors. Sampling error arises

because information is not collected from the entire target population, but rather from some portion

of it. Through the use of scientific sampling procedures, however, it is possible to estimate from the

sample data the range within which the true population value (parameter) is likely to be with a

known probability.

Nonsampling error, on the other hand, is defined as a residual category consisting of all other errors

which are not the result of the data having been collected from only a sample. These include errors

made by respondents, enumerators, supervisors, office clerical staff, key coding operators, etc.

The total error is the sum of all errors about a sample estimate, both sampling and nonsampling,

both variable and systematic. An illustration of the composition of the total error follows:

Total Error

In practice, the bulk of sampling error consists of variable error, and by contrast the bulk of

nonsampling error is bias.

Mathematically, the total error is represented by the mean square error. In terms of expected values,

the mean square error of the estimate is denoted by the and is given by:

which is the average of the squares of deviations of all possible estimates from the parameter.

Recall that If the estimates are unbiased, the mean square error is equivalent

to the variance.

25

Exercises

3.1 Assume that you know the distribution of the number of cows in a population of eight farms,

as follows:

Farm 1 2 3 4 5 6 7 8

Number of Cows 4 5 0 3 2 1 1 0

b. Calculate the true standard deviation and variance of the number of cows per farm.

c. Take all possible samples of two farms each and calculate the average number of cows per

farm for each sample.

d. Compute the average of the 28 means obtained in c. and compare it with the true mean.

and How do they compare with

the expected proportion assuming the sampling distribution of is normal?

3.2 Consider the following distribution of N = 6 population values which represent "the number of

household persons residing in the housing unit." Random samples of size 2 are drawn from

this population.

(HU) (HH)

1 5

2 6

3 7

4 8

5 9

6 10

a. Show that the mean of this population is and its standard deviation is

26

b. How many possible random samples of size 2 can be drawn from this population? List

them all and calculate their means.

c. Use the results obtained in b. to assign to each possible sample a probability and construct

the sampling distribution of the mean for random samples of size 2 from the given

population.

d. Calculate the mean and the standard deviation of the probability distribution obtained in c.

3.3 A simple random sample of 100 households will be selected from a village of Nigeria. For

this village 75 Naira per month is spent on electricity and s = 15 Naira. Find a 95%

confidence interval for Interpret the interval (ignore the fpc).

3.4 A manufacturing company wishes to estimate the mean number of hours per month an

employee is absent from work. The company decides to randomly sample 320 of its

employees from a total of 5,000 employees and monitor their working time for 1 month. At

the end of the month the total number of hours absent from work is recorded for each

employee. If the mean and standard deviation of the sample are hours and s = 6.4

hours, find a 95% confidence interval for the true mean number of hours absent per month per

employee.

27

CHAPTER 4

SIMPLE RANDOM SAMPLING

BASIC THEORY

________________________________________________________________________________

4.1 SIMPLE RANDOM SAMPLING

To introduce the idea of a simple random sample, let us ask the following questions:

(1) How many distinct samples of size n can be drawn from a population of size N?

To answer the first question, we use combinatorics, which allows us to choose n objects out of a

total of N in

To answer the second question, we make use of the answer to the first one and define a simple

random sample of size n (or more briefly, a random sample) selected from a population of size N as a

sample which is chosen in such a way that each of the possible samples has the same

28

For example, if a population consists of the N = 5 elements A, B, C, D and E (which might be the

incomes of five persons, the number of persons in five households, and so on), there are

possible distinct samples of size n = 3; they consist of the elements ABC, ABD, ABE, ACD, ACE,

ADE, BCD, BCE, BDE, and CDE. If we choose one of these samples in such a way that each has

the probability of being chosen, we call this sample a simple random sample.

With regard to the third question of how to take a random sample in actual practice, we could, in

simple cases like the one above, write each of the possible samples on a slip of paper, put these

slips into a hat, shuffle them thoroughly, and then draw one without looking. Such a procedure is

obviously impractical, if not impossible, given the size of most populations; we mentioned it here

only to make the point that the selection of a random sample must depend entirely on chance.

Fortunately, we can take a random sample without actually resorting to the tedious process of listing

all possible samples. We can list instead the N individual elements of a population, and then take a

random sample by choosing the elements to be included in the sample one at a time, making sure that

in each of the successive drawings each of the remaining elements of the population has the same

chance of being selected. The selection may be accomplished by either sampling with replacement

or sampling without replacement. In sampling from a finite population, the practice usually is to

sample without replacement. Most of the theory which will be discussed is based on this method.

For example, to take a random sample of 12 of a city's 273 drugstores, we could write each store's

name (address, or some other business identification number) on a slip of paper, put the slips of

paper into a box or a bag and mix them thoroughly, and then draw (without looking) 12 of the slips

one after the other without replacement.

Even this relatively easy procedure can be simplified in actual practice; usually, the simplest way to

take a random sample from a population of N units is to refer to a table of random numbers (see

Appendix III). In practice, however, the members of the population are sorted according to certain

rules and then a systematic selection of n elements is carried out. The sample thus obtained is, for all

practical purposes, a simple random sample.

4.1.1 Procedure for Selecting a Simple Random Sample (Use of Random Number Tables)

A practical procedure of selecting a random sample is to choose units one by one with the help of a

table of random numbers. Tables of random numbers are used in practical sampling to avoid the

necessity of carrying out some operation such as selecting numbered chips from an urn to designate

the units to be included in the sample. Moreover, experience has shown that it is practically

impossible to mix a set of chips thoroughly between each selection, that devices such as cards or dice

29

have imperfections in their manufacture, that in thinking of numbers at random people tend to favor

certain digits, etc. Consequently, such methods do not, in fact, give each member of the population

an equal chance of selection. The use of a table of random numbers, however, reduces the amount of

work involved, and also gives much greater assurance that all elements have the same probability of

selection.

Many tables of random numbers are readily available. There are several in the series of Tracts for

Computers, notably tables compiled by Tippett, and by Kendall and Smith. The RAND Corporation

has published A Million Random Digits. Sets are also available in Statistical Tables by Fisher and

Yates, and in other sources. Many of these publications describe the methods of compilation and the

uses of the tables. Some microcomputer packages such as LOTUS spreadsheets also have a random

number generator which can also be used to generate pseudo-random numbers, but these random

number generators provide random numbers between 0 and 1. A table of random numbers is given

in Appendix III.

Typically, these tables show sets of random digits arranged in groups both horizontally and

vertically. To select a set of random numbers, one can start anywhere on a page. Furthermore, after

selecting the first number, one can proceed down a column, across a row, up a column, or in any

other pattern that is desired.

4.1.2 Illustration

To obtain a random number between 1 and a given number, for example between 1 and 273, proceed

as follows: Notice how many digits are in the upper limit number (for 273 there are three digits).

Use this number of columns, counting from the first (or a predetermined) column, and start at the top

(or on a predetermined line). Each line in the set of three columns has a 3-digit number. Choose the

first of these which is between 001 and the given number, inclusive. That is, between 001 and 273 in

our example. Discard numbers which are greater than 273 and discard 000. If more than one

random number is desired, continue down the three columns, choosing each 3-digit number which is

between 001 and 273 until the desired 3-digit random number is obtained. If a number is chosen two

or more times, use it only once.1

1089 8719

9385 7902

6934 8660

0052 1007

5736 9249

1901 5988

5372 6212

Within the limits of the numbers in the examples which follow, we shall select random numbers

from the above table, using a selected number only once.

30

Example A: Select 3 numbers at random between 1 and 10. First choose an arbitrary column,

having decided to let 0 stand for 10. Suppose we choose the fifth column. The first number in the

column is 8; the second number is 7; the third is 8 again. Since 8 has already been selected, we skip

it and take the next number which is 1. The three numbers selected, therefore, are 8, 7, and 1.

Example B: Select 5 numbers at random between 1 and 80. Suppose we take the first two columns

as our choice of a start. First take 10; discard 93 since it is not between 01 and 80; take 69; discard

00 (which represents 100); and take 57, 19, and 53.

If we use a table of random numbers frequently, we should not always use the same part. For

example, if the first random number is always taken from the same column of the same page, the

same set of numbers would be used repeatedly, and we would not get proper randomization. If

tables of random numbers are used frequently, one can continue from the last random number

selected for the previous sample, or a new starting point should be taken for each use.

4.2 NOTATION

The notation defined in this section is appropriate not only for simple random sampling, but also for

most designs. They provide a key to the system used throughout this manual. Capital letters refer to

population values and lower case (small) letters denote corresponding sample values. A bar (-) over

a letter denotes an average or mean value and (^) over a letter indicates an estimate. We shall use the

following notation:

population mean

31

sample mean

population variance

population variance

and

sample variance

CV = coefficient of variation

As we mentioned earlier, we shall use, unless otherwise mentioned, S² for the population variance.

The difference between S² and F2 disappears for large populations. In general, the population

variance, S², is not known. The sample variance, s², will be used as its estimate; this will hold

throughout the course regardless of the sampling scheme being discussed. It should be noted that in

simple random sampling, s², is an unbiased estimate of S².

The sample estimate of the population total value, Y, is denoted by and can be written as:

(4.1)

32

(4.2)

(4.3)

(4.4)

(4.5)

(4.6)

4.2.2 Illustration

Let us verify equation (4.3) with the data for the 12 individuals discussed previously (see Chapter 3).

We have already used equation (4.4) for the means of samples of sizes 1 and 2 in illustration 3.4.3

(page 19), and their standard errors for different sizes were given in Table 3.6 of Chapter 3. Using

this table, the total income of 12 individuals can be estimated. Equation (4.3) can be expressed as:

Using Table 3.6 of Chapter 3, the sampling error of the estimated total income for samples of size 2

is:

33

4.2.3 Relative Error

Often we wish to consider not the absolute value of the standard error, but its value in relation to the

magnitude of the statistic (mean, total, etc.) being estimated. For this purpose, one can express the

standard error as a proportion (or a percent) of the value being estimated. This form is called the

relative standard error, or coefficient of variation and is denoted by the symbol CV. The true

population CV (for a given characteristic or variable) is defined as follows:

To estimate the true we use the following formula which uses data from a sample:

One advantage of expressing error as a coefficient of variation is that it is unitless, unlike absolute

measures, like the standard deviation and the sampling error. The CV is useful when making

comparisons because no units enter into play. The population CV refers to the relative sampling

error of means of samples of 1 unit (that is, the population standard deviation expressed as a

proportion of the population mean) and it’s denoted simply by CV (not followed by a parenthesis).

Thus, for the estimate of the total, the true coefficient of variation is:

(4.7)

34

Similarly, for the estimate of the sample mean, the coefficient of variation is:

(4.8)

The corresponding formulas for the estimated (obtained from a sample) coefficient of variations are:

(4.9)

(4.10)

The standard error of the estimated total is N times that of the mean, while the coefficients of

variation of the two estimates are the same; this result is, upon reflection, not unexpected. An

estimated total is obtained by multiplying the sample mean (an estimate) by the number of elements

in the population (a known number); the only source of error is the sample mean. Therefore, we

should expect that, when expressed as a proportion or percentage, the error in the total would be the

same as that in the mean; however, when the error in the total is expressed in absolute terms, it

would be N times as large as the error in the mean, since N is the factor of multiplication.

The big advantage of the coefficient of variation is that it permits comparison of two distributions of

values even though they may be totally unrelated. For example, one could compare the variability of

length of mice tails to weight of elephants. This is possible because variability is expressed relative

to the mean, that is, it is the average variability per unit of mean.

Another way to look at the coefficient of variation is to consider it as a measure of dispersion for

relative deviations. Recall that the variance of Yi was given by

35

If we now consider the relative deviations

square them, add them, and then average them over N, we get the following expression:

which is called the relative variance of the distribution or simply the relvariance.

The square root of this last expression is the population coefficient of variation mentioned before.

An important class of statistics for which the formulas for variance and the formulas for determining

the size of sample become particularly simple is the estimation of the proportion of units having a

certain characteristic.

Proportions arise in two ways in statistical analysis. First of all, we are frequently interested in a

statistic that is a proportion, rather than a total or an average; for example, the proportion of the

population that is unemployed, or the percentage of families with income greater than a certain

amount, or the proportion of business firms interested in purchasing a particular product. Secondly,

it may be desired to classify a population into a number of groups, and to find the percentage of the

total population in each of these groups. The groups may have a natural ordering as in distribution

by age (0 to 4 years, 5 to 9, 10 to 14, etc.) or income classes; or they may be groups having no natural

order, such as those in an industrial classification of business firms, where the groups can be

arranged in a number of ways. The analysis is the same whenever the proportion of the total in each

group is the statistic to be measured.

36

4.3.2 Relationship to Previous Theory

Suppose we think of the total population and the sample in the following way. Consider a particular

class of units in which we are interested, and use the following notation:

a = Number of units in that class in the sample

p = Proportion in that class in the sample

q = Proportion not in that class in the sample (q = 1 - p).

All of the formulas discussed in previous lectures can be applied to this particular case by

considering each member of the population as having a characteristic which can have only one of

two values, either 0 or 1. If the member is in a particular class in which we are interested, the value

assigned is 1; if the member is not in the class, the value is 0. Examining the entire population, we

can see that the A members of the class each have a value of 1; the rest have a value of 0. Adding up

the values for all elements of the population, we get A. In other words, A can be considered as the

the same way as We can now use the previous formulas. It turns out that they are

In sampling for proportions, the following formulas are applicable (with simple random sampling):

(4.11) and

That is, an estimate of the proportion in the population is obtained by using the sample proportion,

and an estimate of the total number of units having the characteristic is obtained by multiplying the

sample proportion by the total number of units in the population. Also

(4.12)

The population variance is PQ. Note that it is the variance of the population distribution giving the

37

value of 1 or 0 to an element depending on whether or not it is in the class (whether it has the

attribute in question). It can still be estimated by pq, unless n is very small (for example n < 30) in

The variance of the estimate of the proportion which is computed from all samples of size n is

(4.13)

The estimate of this variance which is made from a single sample of n observations is

(4.13a)

See equations (4.4) and (4.6). These are the same formulas given previously for with PQ

substituted for S2, and with pq substituted for s².

Similarly, the formulas given in the previous section for the relative standard error (coefficient of

variation) of a mean and the sampling error of an estimated total are given by:

(4.14)

and,

(4.15)

Again the relative standard error of the total is the same as that of the mean.

The confidence interval for the proportion is derived on the same assumptions as for the quantitative

characteristics, namely that the sample proportion p is normally distributed. From (4.13a) for the

estimated variance of p, one form of the normal approximation to the confidence interval for p is:

(4.16)

where the value t depends on the level of confidence desired (see Section 4.4 of Chapter 3).

38

4.3.4 Illustration

Estimate of sampling error.--Suppose that the proportion of farms that grow maize in a given area is

0.40; what would be the sampling error in estimating this proportion from a random sample of 500

farms, if the total number of farms in the area is 10,000? In this case,

N = 10,000 P = 0.40

n = 500 Q = 0.60

We have

Consequently,

How is the figure of 0.021 to be interpreted? This means that if we establish an interval around the

true proportion of (or 0.379 to 0.421), there is a reasonably good chance (68 percent)

that a sample of 500 farms will give a proportion somewhere between 0.379 and 0.421. If we double

the interval to get a range of 0.358 to 0.442, the chance is about 95 percent that the sample estimate

will be within that range. If an interval based on three times 0.021, (or 0.063) is used, the chance is

0.997 (or nearly certain) that the sample estimate will be within that range. In normal practice, it is

customary to use a 2-S range (two standard/sampling errors) as providing sufficient confidence in

the accuracy of the estimates. If very important decisions are to be based on the results of the survey,

and we wish to be almost absolutely sure of the range within which the sample estimate will lie, we

can use a 3-S level. It is difficult to conceive of cases in which 3-S would not be sufficient.

In this example, both the proportion (0.40) and the chance of the sample estimate being within a

certain range around this proportion were known. In practice, we are usually interested in the

converse of this situation, in which we do not know the true proportion but we do have a sample

estimate of 0.40 based on a sample of 500 farms out of 10,000. We wish to establish ranges around

the sample figure which will be expected to include the true mean. For all practical purposes, the

same statements can be made as before by substituting the term "true figure" for "sample estimate."

That is, if the sample shows that 0.40 of the farms grow corn and we establish a range

the chances are about 68 percent that this range will include the true figure; the

chances are about 95 percent that the interval 0.358 to 0.442 will include the true figure; etc.

Frequently, the proportion to be estimated is a percentage, not of the total population, but of a

particular class. For example, we may be interested not in unemployment expressed as a percentage

of the total population, but as a percentage of persons in the labor force; or we may need to know the

proportion of firms with more than 5 employees in a particular industry. In such cases, a very close

approximation to an exact analysis can be made by using the formulas listed above, but interpreting

39

the numbers N and n as applying to the class in which we are interested. That is, N would not be

considered the total population but would be the number of persons in this class (for example, the

total number of persons in the labor force) as estimated from the sample; n would be the number of

sample cases in this class; a would be the number of sample cases in the subset (for example, the

number of unemployed).

Table 4.1 shows the value of for specified values of P and n. As described in sections

4.3.7 and 4.3.8 below, we can use the simplified formula

(4.17)

to compute the standard error of the proportion of units having a certain attribute, if the sample is an

unrestricted (simple) random sample and if N is so large relative to n that the factor (N-n)/N in the

formula has a value very close to 1.

Since the true proportion in the population (P) is not known, the estimate from the sample (p) may

be used in equation (4.17) to give an estimate of the sampling error of p:

(4.18)

Most samples are stratified; that is, they are not simple random samples. We shall see later that this

has the effect of making the sampling error smaller than it would be for a simple random sample of

the same size. However, most samples used in surveys are also clustered and we shall also see that

this has the opposite effect of making the sampling error larger than it would be for a simple random

sample of the same size. When the sample is both stratified and clustered, the formulas for the

standard error become more complex.

Sometimes it is not possible to work out the exact formulas, but a rough estimate of the standard

error can be obtained by using the simple formula of equation (4.17) with an allowance for the

expected net effect of departures from randomness in the sample design. If the units of analysis are

clustered into rather small groups--for example, 5 housing units or 25 persons in a cluster, and the

persons within a cluster are rather similar, as in a cluster located in a rural area--the standard error of

a proportion as read from Table 4.1 might be multiplied by a factor such as 1.25. This factor is a

design effect. In a larger cluster, such as a city block with 40 or 50 housing units, the factor to be

applied to Table 4.1 might be 1.75, even though the persons within the cluster are less alike in an

urban area than in a rural area.

The size of the design effect to be used depends on the sample design and the nature of the

population; it can sometimes be roughly estimated by an experienced sampling statistician, using

40

past experience and mathematical formulas involving the “intraclass correlation.”

Table 4.1

SAMPLING ERROR OF AN ESTIMATE OF A PROPORTION

IN SIMPLE RANDOM SAMPLING

P = Proportion of units having a characteristic (Q = 1-P has the same standard error)

n = number of .001 .002 .01 .02 .03 .04 .05 .10 .15 .20 .25 .30 .40

sample cases or or or or or or or or or or or or or .50

.999 .998 .99 .98 .97 .96 .95 .90 .85 .80 .75 .70 .60

50 .0045 .0063 .0141 .0198 .024 .028 .031 .042 .051 .057 .061 .065 .069 .071

100 .0032 .0045 .0099 .0140 .017 .020 .022 .030 .036 .040 .043 .046 .049 .05.0

200 .0022 .0032 .0071 .0099 .012 .014 .016 .021 .025 .028 .031 .033 .035 .035

300 .0018 .0026 .0058 .0081 .0099 .012 .013 .017 .021 .023 .025 .027 .028 .029

400 .0016 .0023 .0050 .0070 .0086 .010 .011 .015 .018 .020 .022 .023 .024 .025

500 .0014 .0020 .0045 .0063 .0076 .0089 .0098 .013 .016 .018 .019 .021 .022 .022

600 .0013 .0018 .0041 .0057 .0070 .0082 .0090 .012 .015 .016 .018 .019 .020 .020

700 .0012 .0017 .0038 .0053 .0065 .0076 .0083 .011 .014 .015 .016 .017 .019 .019

800 .0011 .0016 .0035 ..0050 .0061 .0071 .0078 .011 .013 .014 .015 .016 .017 .018

1000 .0010 .0014 .0032 .0044 .0054 .0063 .0070 .0095 .011 .013 .014 .015 .015 .016

1200 .0009 .0013 .0029 .0040 .0049 .0058 .0064 .0087 .010 .012 .013 .013 .014 .014

1500 .0008 .0012 .0026 .0036 .0044 .0052 .0057 .0077 .0093 .010 .011 .012 .013 .013

1700 .0008 .0011 .0024 .0034 .0042 .0049 .0053 .0073 .0087 .0097 .011 .011 .012 .012

2000 .0007 .0010 .0022 .0031 .0038 .0045 .0049 .0067 .0081 .0090 .0097 .010 .011 .011

2500 .0006 .0009 .0020 .0028 .0034 .0040 .0044 .0060 .0072 .0080 .0087 .0092 .0098 .0100

3000 .0006 .0008 .0018 .0026 .0031 .0039 .0040 .0055 .0066 .0073 .0079 .0084 .0090 .0092

3500 .0005 .0008 .0017 .0024 .0029 .0034 .0037 .0051 .0061 .0068 .0073 .0078 .0083 .0084

4000 .0005 .0007 .0016 .0022 .0027 .0032 .0035 .0047 .0057 .0063 .0068 .0073 .0077 .0079

4500 .0005 .0006 .0015 .0021 .0025 .0030 .0033 .0045 .0054 .0060 .0065 .0069 .0073 .0074

5000 .0004 .0006 .0014 .0020 .0024 .0028 .0031 .0042 .0051 .0057 .0061 .0065 .0069 .0071

In practice the sample value p would be used, inasmuch as the population value P would not be

known.

For values of n greater than 5,000, when n is multiplied by 100, the standard error is divided by

10.

41

The design effect or DEFF is the ratio of the variance of the estimate obtained from the more

complex sample (described later in this text) to the variance of the estimate obtained from a

simple random sample of the same size. For instance, if is the variance of the

estimate, say obtained from a complex sample, and is the variance of the same

and

This approach is commonly used by practical samplers. For many situations where we can not

estimate directly the variance of the estimate, we may be able to guess fairly well both the

element variance S2 and DEFF from experience with similar past data. This comprehensive

factor attempts to summarize the effects of various complexities in the sample design especially

those of clustering.

The exact formula for the relative variance (square of the coefficient of variation) of a mean for a

simple random sample,

or

and or

The only way the size of the total population comes into the formula is in the expression

This is usually called the finite population correction factor (fpc). If the population were infinite

this factor would be 1 and the formulas would be much simpler:

or

42

The value of is approximately equal to where is the sampling

rate. If the sampling rate is small, say less than 0.05, the effect of the finite population correction

factor is very small and, for all practical purposes, the finite population correction factor can be

ignored.

With large populations and small sampling rates, the fpc can be ignored and the formulas become

simpler.

Simplified Formulae

Variance of a proportion

mean

Coefficient of variation of a

proportion

Variance of a total

units having an attribute

Coefficient of variation of a

total

number of units having an

attribute

43

CHAPTER 5

SIMPLE RANDOM SAMPLING

ESTIMATION OF SAMPLE SIZE

_______________________________________________________________________

5.1 SPECIFIC CONSIDERATIONS FOR DETERMINING THE

SAMPLE SIZE

One of the first questions which a statistician is called upon to answer in planning a sample survey

refers to the size of the sample required for estimating a population parameter with a specified

precision. Making a decision about the size of the sample for the survey is important. Too large a

sample implies a waste of resources, and too small a sample diminishes the utility of the results.

When considering sample size determination, there are three very important concerns: ACCURACY,

PRACTICALITY, and EFFICIENCY.

5.1.1. Accuracy

Accuracy can be defined as an inverse measure of the total error. Total error is the sum of sampling

error (SE) and nonsampling error (NSE). Sampling error arises because only a part of the population

is observed, and not all of it. The terms PRECISION and RELIABILITY are associated with

sampling error. Estimator A is more precise or more reliable than estimator B if the sampling error

of A is smaller than the sampling error of B. Nonsampling errors are usually biases which are very

often due to poor quality control of the survey operations (poor questionnaire design; interviewers

that are not well trained; response errors; etc.)

5.1.2. Practicality

To obtain an accurate estimate, both sampling and nonsampling errors must be reduced. However,

accuracy may come into conflict with practicality because:

a. to reduce sampling errors and increase precision, the sample size must be large.

b. too large a sample can impose an excessive burden on the limited resources available

(and resources are usually very limited) and increase the likelihood of nonsampling

errors.

5.1.3. Efficiency

A further concern is that a given sample size can produce different levels of precision depending on

which sampling techniques are chosen. This concept is known as the statistical efficiency of the

design. The most efficient design is the one that gives the most precision for the same sample size.

Therefore, expert sample design is needed in the determination of the optimal sample size.

44

Example 5.1

A population consists of N = 5000 persons. A simple random sample without replacement (SRS-

WOR) of size n = 50 included 10 persons of Chinese descent.

A 95% confidence interval for P, the proportion of persons of Chinese descent in the population, is:

The conclusion is that between 8.9% and 31.1% of the population is of Chinese descent. This

interval is too wide to be useful. There are two ways in which a narrower interval could be obtained:

< by increasing the sample size

There is a point at which lowering the confidence level is not attractive. We shall consider the

problem of determining the sample size necessary to produce a fixed level of precision.

The following eight steps are taken into account when determining the sample size. We will study

each one in detail.

(7) Population subdivisions for which separate estimates of a given precision are

required. These are also called domains of estimation.

45

5.2 Degree of Precision

The precision of an estimate refers to the amount of variable error, mainly sampling error, contained

in an estimate. To lower the sampling error, that is, to increase the precision, we want n to be

sufficiently large. Therefore, we decide on a target value for the precision of the estimate. The

degree of precision desired can be stated in terms of:

where is an estimate of the parameter 2 and (1- ") is the degree of confidence

desired.

The absolute error E is measured in the same unit used to measure the variable. For

example, E = 5 hectares or E = $10,000 or E = 25 persons.

This is E expressed as a proportion (or percentage) of the true value of the parameter

being estimated. For example, if E = 5 hectares and the true value of the parameter is

100, then RE = 5/100 = 0.05 or 5%.

(3) The target coefficient of variation (cv) for the estimate (v0)

We set the cv (also known as the relative standard error) for the estimate equal to a

target value v0. For example, we can have:

Depending on which of the three ways we use to specify the precision, the formula for

n will be different. The values of E, RE and " are usually decided by the user of the

data in conjunction with the statistician.

5.3 Formula that Connects n (sample size) with the Desired Degree of Precision

46

S2 = the population variance; could be used instead.

n = the desired sample size

k = 1 for 68% confidence

2 for 95% confidence

3 for 99.7% confidence

E = Absolute error.

RE = Relative error

Note: the level of confidence states the probability that the n determined will provide the degree

of precision specified. For example, a 95% level of confidence means that, except for a

small chance (5%), we can be 95% certain that the precision specified will be reached with

the calculated n. This is equivalent to saying that the acceptable risk is 5% that the true 2

will lie outside of the range specified in the confidence interval.

The sampling error of a mean using simple random sample is given by:

Now, where k is a multiple of the sampling error, selected to achieve the specified

degree of confidence. Therefore, if we substitute for (E/k), we get:

(5.1)

(5.2)

47

If the population size is large and n # 0.05N, the finite population correction factor in equation (5.1)

can be ignored because its effect would be minimal. In this case, we have:

(5.3)

Example 5.2

Consider a population consisting of 1,000 farms for which the population variance of the number of

cattle per farm is 250 (N = 1,000 and S² = 250). Let us estimate the average number of cattle per

farm from a sample; we wish to have reasonable confidence that the estimate will be close to the true

value. Suppose the sample estimate is to be in error by no more than 1 (one head of cattle) from the

true average, and we require an assurance of 95 chances out of 100 that the error will be no larger

than 1. In this case,

If in the same situation we are satisfied with an error of not more than 3, with a confidence level of

95 percent, the only change in the formula would be in the values of E and E², as follows:

E=3 and E² = 9.

Example 5.3

We wish to estimate the average age of 2,000 seniors on a particular college campus. How large a

SRS must be taken if we wish to estimate the age within 2 years from the true average, with 95%

confidence? Assume S2 = 30.

E=2 and k = 2

48

5.3.2. Sample size needed to estimate a proportion with absolute error E

The sample size n to estimate a population proportion P is obtained from equation (5.2); in this

equation, but we’ll use the approximation S² = PQ (i.e., we’ll assume N is big

enough so that N/(N-1) is very close to 1):

(5.4)

And for a large population size (n # 0.05N), we have from equation (5.3),

(5.5)

Example 5.4

Refer to Example 5.1 on page 50. Suppose we would like to estimate P, the proportion of persons of

Chinese descent to within ± 3%, with 95% confidence. What sample size do we have to choose to

achieve this target? Assume P to be no larger than 1/2.

Now, let’s assume that we know that P # 0.25. What is the required sample size?

Using equation (4.3), and letting we get the following formula for n:

(5.6)

49

(5.7)

5.3.4. Sample size needed to estimate the number of units that possess a certain attribute

with absolute error E

To obtain the n necessary to estimate A, the number of units that possess a certain characteristic,

simply substitute PQ in place of S2 in equations (5.6) and (5.7).

5.3.5. Sample size formulas when the error is expressed in relative terms (RE)

We can obtain formulas for estimates when the desired error is expressed in relative terms instead of

absolute terms. For relative errors (RE), if (RE) is a proportion of the estimates, substitute (RE/k) for

(or in equation (4.7) or (4.8)). We will denote by cv the estimated coefficient of

variation. The true population coefficient of variation is denoted by CV. We then have:

(5.8)

Note:

(5.9)

NOTE 1: In actual practice, we usually do not know S² or (CV)2. Indeed we do not even know s²

in advance of the survey. Instead, we use rough estimates of S² or (CV)2, obtained by

the methods discussed in section 8 of chapter 6.XXXX

NOTE 2: For the mean and the total, it is better to express the variance in relative rather than

absolute terms, for two reasons:

(1) Most importantly, because a population’s relative variance is more stable than its

absolute variance. A guess or estimate of the population coefficient of variation

CV (from past data or from similar populations) is likely to be closer to the true

value than a guess or estimate of the variance.

(2) The formula for n is the same for estimators of means or totals when it is

expressed in terms of the coefficient of variation.

50

NOTE 3: To estimate the proportion P, it is preferable to use the absolute error previously

discussed because the proportion is itself a relative quantity, so that taking the

percentage of a percentage can become confusing.

To obtain the formula for the sample size required to estimate a population proportion when the error

is expressed as relative error (RE), use equation 5.8

(5.10)

(5.11)

Example 5.5

We would like to carry out a survey to estimate the total area in hectares of the farms in a population.

The estimate should be within 10% of the true value. How many farms should be surveyed? (In a

pilot survey, we estimated the population coefficient of variation, CV, of the variable farm size to be

1.2). Use 95% confidence.

5.3.6. Sample size formulas when the error is expressed in terms of the coefficient of

variation

population coefficient of variation and is a specified target value for an estimate's

coefficient of variation, then (5.8) becomes,

51

(5.12)

(5.13)

Let’s consider Example 5.5 and use coefficients of variation to solve the problem.

Example 5.6

Suppose that a survey was carried out to estimate the total area in hectares of the farms in a

population. The estimate should be within 10 percent of the true value, with 95 percent confidence.

How many farms should be surveyed? [In a pilot survey, we estimated the population coefficient of

variation CV of the variable "farm size" to be 1.2].

In this case,

Example 5.7

The results from a pilot test are used to estimate and S for the variable ‘income’ in a population of

5,000 households.

s = $12,300

A full scale survey is planned. What should be the sample size for this survey if we want to estimate

the mean income per household with a cv no larger than 5%?

52

5.4. Advance Estimates of Population Variances

In the preceding section, we noted that most of the sample size formulas are written in terms of the

population variance. In practice this is unknown and it must be estimated or guessed. There are five

ways of estimating population variances for sample size determination.

Method 1: Select the sample in two steps, the first being a simple random sample of size n1 (the

first sample) from which estimates s1² and p1 of S² and P, respectively, are obtained.

Then use this information to determine the required n (the final sample size).

Method 2: Use the results of a pilot survey. This is one of the more commonly used methods.

Method 3: Use the results of previous samples of the same or similar population.

Method 4: Guess about the structure of the population and use some mathematical results.

then make a fairly good guess of P (the proportion in the population).

Method 1 carries out the survey in two steps. In the first step, only a subsample (a random part of

the total sample) is enumerated. An analysis of this part permits one to estimate the variance and to

make revisions in the total size of the sample, if necessary. In the second step, the remainder of the

sample is enumerated in accordance with these changes, if any. This method gives the most reliable

estimates of S² or P, but it is not often used, since it slows up the completion of the survey.

Method 2 is one of the more commonly used methods. It serves many purposes, especially if the

feasibility of the main survey is in doubt. If the pilot survey is itself a simple random sample, the

preceding methods apply. But often the pilot work is restricted to a part of the population that is

convenient to handle or that will reveal the magnitude of certain problems.

Method 3 is also a very commonly used method. This method points to the value of making

available, or at least keeping accessible, any data on standard errors obtained in previous surveys.

Unfortunately, the cost of computing standard errors in complex surveys is high, and frequently only

those standard errors needed to give a rough idea of the precision of the principal estimates are

computed and recorded. If suitable past data are found, the value of S² may require adjustment for

time changes. Experience indicates that the variance of an item tends to change much more slowly

over time than the mean value of the item itself. Even if the mean value changes, the relative error

may be quite stable.

53

Method 4 uses some mathematical results. Deming (1960) showed that some simple mathematical

distributions may be used to estimate S² from a knowledge of the range (h) and a general idea of the

shape of the distribution of the characteristic of interest:

Normal .17 * h

These relations do not help much if the range, h, is large or poorly known. However, if h is large,

good sampling practice is to stratify the population so that within any stratum the range is

significantly reduced. Usually the shape also becomes simpler (closer to rectangular) within a

stratum. Consequently, these relations are effective in predicting S², hence h, within individual

strata.

Example 5.8

The universities in the State of Maryland were classified according to the number of enrolled

students into four size classes. The standard deviation within each class is shown below:

Enrollment Level, Xi < 1,000 1,000-3,000 3,000-10,000 > 10,000

Si 236 625 2,008 10,023

If you knew the class boundaries but not the values of Si, how well could you guess the values by

using the Deming method? (No university has fewer than 200 enrolled students and the largest has

about 50,000).

We do not know the number of universities in each size class; therefore, we cannot obtain a

frequency distribution that would show us the general shape of the distribution. A conservative

estimate would be that the distribution is uniform. In this case, Si would be given by 0.29 * hi, where

54

hi is the range of each class.

S2 = 0.29 (3,000 - 1,000) = 580

S3 = 0.29 (10,000 - 3,000) = 2,030

S4 = 0.29 (50,000 - 10,000) = 11,600

growing corn--the population variance is approximately PQ. It is only necessary to be able to make a

fairly good guess at P in order to estimate S². As long as the guess is reasonably close, we will get a

good estimate of S². For example, suppose the true value of P is 0.4; then the value of S² = PQ

would be 0.4 x 0.6 = 0.24. Suppose we made a rather poor guess of P, say 0.3. We would then

estimate the value of the variance as 0.3 x 0.7 = 0.21, which differs from the true value by only about

10 percent. Note that we can also estimate S² by setting S² = PQ = (1/2)(1/2) because the formula for

n is maximized when P = Q = 1/2. This latter is called a "conservative estimate," because we can

never do worse than that.

Let us recall that the total error is composed of both bias and variance. High sample sizes reduce the

variance (i.e., yield high precision) but tend to increase cost and operational difficulties, which

translates into larger nonsampling errors.

(2) sufficient resources.

However, in a real survey setting, there exist constraints with respect to:

(a) budget

(b) field conditions

(c) field and office personnel

(d) time

(e) equipment and materials, etc.

Hence, in addition to precision, we also need to consider the maximum sample size that can be

handled by the available resources. It may be necessary to limit the sample size in order to stay

within budget and operational constraints.

If the maximum practical sample size is much smaller than that required to achieve the specified

precision, calculations can be made to estimate the level of precision that could be expected from the

actual sample size. If this level is not acceptable, greater resources have to be allocated to

accommodate a larger sample size.

To compromise between precision and practicality, we may take a sample size that is somewhere

between the constraint-based and the precision-based sizes.

55

5.6. Expected Sample Loss Due to Nonresponse

If past experience indicates that a certain level of nonresponse can be present, we may want to inflate

the calculated sample size to compensate. This is because our calculations were based on a 100

percent response. If we do not obtain all the interviews, then the estimates will be based on a

number smaller than the calculated n and will, therefore, have a greater variance than expected.

Inflating Procedure

where r is an estimate of the expected response rate and it can be obtained from previous rounds of

the same survey, previous experience with similar surveys, a pilot (pre-test), etc.

For example, we calculate n to be 1,000 units. Based on the results of a pilot survey, we anticipate

the response rate to be 70 percent.

If our assumption was correct, we should get back 70% of 1,429 = 1,000.

Therefore, our estimates will be based on the same number of units as expected and the target

precision will be attained.

Important Note

Inflating the sample size when there is nonresponse only helps compensate for the resulting loss in

precision. It does nothing for diminishing the resulting nonresponse bias.

In most surveys information is collected from a sampling unit for more than one characteristic. One

method of determining sample size is to specify margins of error for the characteristics that are

regarded as most vital to the survey. An estimation of the sample size needed is first made

separately for each of these important characteristics.

When the estimations of n have been completed for each of the most important characteristics, it is

time to take stock of the situation. It may happen that the n's required are all reasonably close. If the

largest of the n's falls within the limits of the budget, this sample size is selected. More commonly,

there is sufficient variation among the n's so that we are reluctant to choose the largest, either for

budgetary considerations or because this will give an overall level of precision substantially higher

than originally contemplated for the other characteristics. In this event the desired level of precision

may be relaxed for some of the characteristics in order to permit the use of a smaller value of n.

56

In some cases the n's required for different characteristics are so different that some of them must be

dropped from the survey; with the resources available the precision expected for these characteristics

is totally inadequate. The difficulty may not be merely one of sample size. Some characteristics call

for a different type of sampling scheme than others. With populations that are sampled repeatedly, it

is useful to gather information about those characteristics that can be combined economically in a

general survey and those that need special methods. As an example, a classification of

characteristics into four types, suggested by experience in regional agricultural surveys, is shown in

Table 5.1. In this classification, a general survey means one in which the units are fairly evenly

distributed over some region as, for example, by a simple random sample.

Table 5.1.

1 Widespread throughout the region, A general survey with low sampling

occurring with reasonable frequency in all fraction.

parts.

2 Widespread throughout the region but A general survey, but with a higher sampling

with low frequency. fraction.

3 Occurring with reasonable frequency in For best results, a stratified sample with

most parts of the region, but with more different intensities in different parts of the

sporadic distribution, being absent in region (Chapter 5). Can sometimes be

some parts and highly concentrated in included in a general survey with

others. supplementary sampling.

4 Distribution very sporadic or concentrated Not suitable for a general survey. Requires a

in a small part of the region. sample geared to its distribution.

Example

The following coefficients of variation per unit were obtained in a farm survey in Iowa, the unit

being an area 1 square mile.

Item Estimated cv

Acres in farms (Y1 ) 0.38

Acres in corn (Y2 ) 0.39

Acres in oats (Y3 ) 0.44

Number of family workers (Y4 ) 1.00

Number of hired workers (Y5 ) 1.10

Number of unemployed (Y6 ) 3.17

57

A survey is planned to estimate acreage characteristics with a cv of 2½% and numbers of workers

(excluding unemployed) with a cv of 5%. With simple random sampling, how many units are

needed? How well would this sample be expected to estimate the number of unemployed? The

results are displayed in the following table:

Item cv for Estimate n with n = 484

Y2 0.39 0.025 244 0.018

Y3 0.44 0.025 310 0.020

Y4 1.00 0.050 400 0.046

Y5 1.10 0.050 484 0.050

Y6 3.17 -- -- 0.144

Comments

1. Assuming cost and workload constraints permitted it, a sample of 484 segments should be

taken (the largest calculated size). This sample size should guarantee the desired precision

(or better) for the estimates of Y1 through Y5. As noted in the last column, the cv of the

estimate is expected to be either as small as desired or smaller, if n = 484 is used.

2. As far as the estimate of Y6, a cv of approximately 14% can be expected if a sample size of

484 is used. Although it is true that the precision will be lower for this estimate than for the

others, this is not critical because sponsors and data users did not require higher precision.

If there are subpopulations or domains of estimation for which separate estimates of a given

precision are required, we must resort to a different sampling strategy, such as the use of stratified

sampling with different sampling rates by stratum.

Under stratified sampling, each stratum or domain is considered a "population" in its own right. We

can then apply the same principles to calculate separate sample sizes within each stratum to meet the

precision requirements for the domain estimates. Often the same precision is required in each

domain. If the variability and the cost within the domain are similar from domain to domain, then

the sample sizes will be about the same in all domains.

58

The overall sample size would then be the sum of the stratum sample sizes. The overall estimate for

the whole population would have a higher precision than the stratum-level estimates.

For example, if the unemployment rate is to be measured at the national level with x% target cv, the

national sample size computed would be n, say 5,000 households. On the other hand, if the

unemployment rate is needed for each of 5 regions of the country, all with the same precision, the

total (national) sample size required would be around 5n or 25,000 households. The national

estimate would have a precision much higher than originally planned.

The formulas discussed so far are all based on simple random sampling (SRS). Let us denote as nsrs,

the sample sizes obtained from those formulas.

However, as will be seen later on, simple random sampling is rarely used in complex surveys. The

efficiency of the design actually used is measured by comparing the variance of the estimator 2

obtained with the complex design and the variance of the same estimator with SRS.

- If the complex design is more efficient, that is, inherently tends to produce a lower variance

than SRS, then our precision is likely to be better than expected with nsrs.

- If, on the other hand, the complex design is less efficient than the SRS one, that is, has an

inherent tendency to produce a higher variance for 2 than SRS, then our expected precision

level may not be met with the calculated nsrs. In this case, it would be desirable to inflate

nsrs beforehand.

As we study different sampling schemes, we will know which are more efficient than SRS and which

are less. Here are some examples:

estimators (e.g., ratio estimators of total)

The efficiency of a particular sample design is measured by the design effect. XXX (see Chapter 4).

We return to certain implications of the basic formula from which all the above formulas are derived.

That basic formula was given in equation (4.4) in Chapter 4 as:

59

(5.14)

Notice that the sampling variance of the mean is equal to the variance of individual observations (S²)

in the population multiplied by the factor What happens when the sample increases

from its smallest possible size (n = 1) to its largest possible size (n = N)? When n = 1,

This states the familiar fact that the variance of the means of samples of one unit is the same as the

variance of individual observations in the population. At the other extreme, when n = N,

That is, if the sample includes the entire population, the mean is estimated without sampling error.

For sample sizes between these extremes, how does the sampling fraction (sampling rate) n/N affect

the standard error? The answer, sometimes surprising to students, is that for populations that are

large relative to the sample size, the absolute size of the sample (n) and not the sampling fraction n/N

determines the precision of the estimated mean. This follows from the fact that when N is large

relative to n, the factor [(N-n)/N] .1. (The symbol . stands for "is approximately equal to"). Then

On the other hand, for small populations the sampling fraction does have an effect. For example,

suppose two populations have the same mean and the same variance: and S² = 100, while

N1 = 40 and N2 = 400. If we take the same size of sample from each, say n1 = n2 = 20, the standard

errors are related (in an inverse way) to the sampling fractions. Equation (5.14) then gives:

N n n/N

2nd population 400 20 .05 2.2

The number of sample units needed to achieve the same precision would be greater for the second

(larger) population. However, the number of sample units needed to achieve a given reliability does

not increase indefinitely as the number of elements in the population increases. In other words, we

reach a point in which adding an extra sampling unit does not produce a sizable reduction in

variance.

60

Example

Table 5.2 below shows the size of sample necessary to give an estimate of the population mean

within a 5 percent error (E = 0.05) of the estimate (with confidence coefficient k = 2) for populations

ranging in size from 50 to 10,000,000 elements and with (CV)² = .10 in each case. These results

were obtained using equation (4.8) of Chapter 4. Equation 4.8 is given by:

Table 5.2

(E = .05 and k = 2)

_

Number of elements Number of elements

in the population required in sample

(N) (n) n/N

50*..................... 38 .76

100..................... 62 .62

1,000.................. 138 .14

10,000................ 158 .016

100,000.............. 160 .0016

1,000,000........... 160 .00016

10,000,000......... 160 .000016

____________________________________________

* Use equation (4.8) when N is smaller than 50.

As an example, let’s calculate the first value of n in Table 5.2. Since N = 50 and is very small for a

population value, we have to use the formula for n that contains the finite population correction

factor (N-n)/N. The series of steps leading to the number 38 in Table 5.2 is shown below.

61

The objective is to leave n on one side of the equation in terms of the other components.

Now, we know that (CV)2 = 0.10. This is the population coefficient of variation and is given to us

as a known value. However, we do not know the value of but we can obtain it by using

the following:

Table 5.2 shows that for small populations, the sample size needed for a given accuracy does

increase as the population increases, but the sample size approaches a fixed number as the population

gets very large. The largest size of sample we would ever need for this accuracy (with CV² = .10) is

160 elements, and this is approximately the number we would need whether there are 10,000 or

10,000,000 elements in the population. Furthermore, if we had used a sample of 160 for a

population even as small as 1,000, the sample would be somewhat larger than necessary; but the

excess would not have been very serious.

62

Chapter 5

1. State park officials were interested in the proportion of campers who consider the campsite

spacing adequate in a particular campground. They decided to take a simple random sample of

size n = 30 from the first N = 300 camping parties which visit the campground. Let yi = 0 if the

head of the i-th party sampled does not think the spacing is adequate and yi = 1 if he does (i = 1,

2, . . . , 30). Use the data below to estimate P, the proportion of campers who consider the

campsite spacing adequate. Find the sampling error of the estimate and its coefficient of

variation.

1 1

2 0

3 1

. .

. .

. .

20 1

30 1

2. Use the data in Exercise 1 to determine the sample size required to estimate P with a bound on

the error of estimation of magnitude E = 0.05.

Answer: n = 125

3. A simple random sample of 100 water meters within a community is monitored to estimate the

average daily water consumption per household over a specified dry spell. The sample mean

and sample variance are found to be and s2 = 1252, respectively. If we assume that

there are N = 10,000 households within the community, estimate :, the true average daily

consumption, find the sampling error of the mean and its coefficient of variation.

4. Using Exercise 3, estimate the total number of gallons of water, T, used daily during the dry

spell. Find the sampling error of the total and its coefficient of variation.

63

5. Resource managers of forest game lands are concerned about the size of the deer and rabbit

populations during the winter months in a particular forest. As an estimate of population size,

they propose using the average number of pellet groups for rabbits and deer per 30 foot square

plots. Using an aerial photograph, the forest was divided into N = 10,000 thirty foot square

grids. A simple random sample of n = 500 plots was taken, and the number of pellet groups

was observed for rabbits and for deer. The results of this study are summarized below:

Deer Rabbits

Sample mean = 2.30 Sample mean = 4.52

Sample variance = 0.65 Sample variance = 0.97

Estimate :1 and :2, the average number of pellet groups for deer and rabbits respectively, per

30 square foot plots. Find the sampling error and the coefficient of variation of each mean.

Mean(rabbits) = 4.52; se(rabbits) = 0.042930176; cv(rabbits) = 0.95%

proportion of students in favor of converting from the semester to the quarter system. If 25 of

the students answered affirmatively, estimate the proportion of students on campus in favor of

the change. (Assume N = 2000.) Find the sampling error of the proportion and its coefficient

of variation.

7. A dentist was interested in the effectiveness of a new toothpaste. A group of N = 1,000 school

children participated in a study. Prestudy records showed there was an average of 2.2 cavities

every six months for the group. After three months on the study, the dentist sampled n = 10

children to determine how they were progressing on the new toothpaste. Using the data below,

estimate the mean number of cavities for the entire group and find the sampling error and the

coefficient of variation of the mean.

1 0

2 4

3 2

4 3

5 2

6 0

7 3

8 4

9 1

10 1

64

Answers: Mean = 2.0; se(mean) = 0.469039; cv(mean) = 23.45%

8. The Fish and Game department of a particular state was concerned about the direction of its

future hunting programs. In order to provide for a greater potential for future hunting, the

department wanted to determine the proportion of hunters seeking any type of game bird. A

simple random sample of n = 1000 of the N = 99,000 licensed hunters was obtained. If 430

indicated they hunted game birds, estimate P, the proportion of licensed hunters seeking game

birds. Find the sampling error and the coefficient of variation of the proportion.

9. Using the data in Exercise 8, determine the sample size the department must obtain to estimate

the proportion of game-bird hunters, given an error of estimation E = 0.02.

Answer: n = 2,300

10. A company auditor was interested in estimating the total number of travel vouchers that were

incorrectly filed. In a simple random sample of n = 50 vouchers taken from a group of N =

250, 20 were filed incorrectly. Estimate the total number of vouchers from the N = 250 that

have been filed incorrectly, and find its sampling error and coefficient of variation. (Hint: If P

is the population proportion of incorrect vouchers, then NP is the total number of incorrect

vouchers. An estimator of NP is Np which has an estimated variance given by )

11. A psychologist wishes to estimate the average reaction time to a stimulus among 200 patients

in a hospital specializing in nervous disorders. A simple random sample of n = 20 patients was

selected and their reaction times were measured with the following results:

Estimate the population mean, :, and find the sampling error and the coefficient of variation of

the mean.

12. In Exercise 11, how large a sample should be taken in order to estimate : with an error of

estimation equal to one second? Use 1.0 second as an approximation of the population

standard deviation.

Answer: n = 4.

13. The manager of a machine shop wishes to estimate the average time that it takes for an operator

to complete a simple task. The shop has 98 operators. Eight operators are selected at random

and timed. The following are the observed results:

65

Time in Minutes

4.2 5.3

5.1 4.6

7.9 5.1

3.8 4.1

Estimate the average time it takes an operator to complete a simple task and find the sampling

error and the coefficient of variation of the average time.

14. A sociological study conducted in a small town calls for the estimation of the proportion of

households which contain at least one member over 65 years of age. The city has 621

households according to the most recent city directory. A simple random sample of n = 60

households was selected from the directory. At the completion of the field work, out of the 60

households sampled, 11 contained at least one member over 65 years of age. Estimate the true

population proportion, P, and find the sampling error and the coefficient of variation of the

proportion.

15. In Exercise 14, how large a sample should be taken in order to estimate P with an error of

estimation of 0.08? Assume the true proportion P is approximately 0.2.

Answer: n = 84

16. An investigator is interested in estimating the total number of “count trees” (trees larger than a

specified size) on a plantation of N = 1500 acres. This information is used to determine the

total volume of lumber for trees on the plantation. A simple random sample of n = 100 one-

acre plots was selected, and each plot was examined for the number of count trees. If the

sample average for the n = 100 one-acre plots was with a sample variance of s2 = 136,

estimate the total number of count trees on the plantation and find the sampling error and the

coefficient of variation of the estimated total.

17. Using the results of the survey conducted in Exercise 16, determine the sample size required to

estimate T, the total number of trees on the plantation, with an error of estimation E = 1500.

Answer: n = 388.

18. You want to design a household survey to estimate average annual income per household. The

number of households is 2,000,000. On the basis of the data from a previous census, the

population variance of annual income per household is estimated to be 1,000,000 (that is, S =

1000).

66

a. What sample size is necessary to estimate the average annual income with a 95 percent

confidence that the result is accurate to plus or minus $100?

Answer: n = 385.

b. What size sample is necessary to estimate average annual income within plus or minus $50,

also at the 95 percent confidence level?

Answer: n = 1,537.

19. Refer to the universe of eight farms listed below with known value of land and buildings as

follows:

Farm 2 - $6854 Farm 6 - $9284

Farm 3 - $1532 Farm 7 - $1438

Farm 4 - $2180 Farm 8 - $8836

a. List the 28 simple random samples of two farms each, compute the mean for each sample and

verify that the average mean of all 28 means is $4,694.75.

b. Compute the standard deviation of the 28 means and check that the standard deviation is

$2,037 (or 2,036.776).

67

Chapter 6.

PRACTICAL CONSIDERATIONS

IN SELECTING A SAMPLE

_____________________________________________________________________________________________

6.1 SAMPLING FRAME

In order to select a sample, it is necessary to have a sampling frame; that is, a list of all elements (or

the equivalent, such as a list of blocks, housing units, etc.) so that the probability of selection of each

element can be known in advance. The frame need not be literally a list. In sampling from cards,

questionnaires, etc., the documents themselves can be considered as the frame. But it is necessary to

know that the file is complete. For example, in sampling from a file of records, one should make

sure that no records are out of the file--in use or waiting to be refiled--since such records would not

have any chance of selection. Again, in using a population register maintained by local authorities,

one should make certain the list is current. For example, the list might not contain all families with

married couples. Since new families and those that move around are likely to differ in their

characteristics from older and more settled families, a biased sample would result.

In using local registers or lists, it may be useful to conduct an actual check of the completeness, on a

more or less informal basis. This can be done by going out to the area to be sampled, selecting a few

families (or farms or business firms) scattered around the area, and checking to see if they are on the

list. If possible, it is better to select families of the type likely to be missing from the list, since this

would provide a better test. A rough idea of the adequacy of the list can be obtained in this manner.

Special difficulties arise when some units have more than one chance of selection--for example,

when sampling from a file in which some individuals are included more than once; when selecting a

sample of families from a sample of individual persons; etc. To illustrate, one might select a sample

of school children and use it to select families. It is clear that if one draws a sample of families by

first selecting a sample of persons and including the families to which these persons belong, the

families will have unequal probabilities of selection, since the larger the family the greater the

chance of selection. Similarly, selecting a sample of a business firm's customers by using a record

file containing a separate sheet (or card) for each purchase will give customers making more than

one purchase a greater chance of selection.

To avoid the biases which result from giving some of the units a greater chance of selection than

others, it is desirable to restrict the sampling procedure so that each unit has only one chance of

selection. For example, when selecting a sample of families, we could make a rule to include the

family only if the head of the family is the person selected. Since each family has only one head,

each family would have the same chance of selection.

The specified person on whom the selection of the family depends need not be the head; he/she could

68

just as well be the oldest person, the youngest child, etc. The only requirement is that each family

have one and only one such member. Similarly, in sampling customers, we could restrict the sample

by using only the cards with the earliest date for each customer, etc.

While the technique described in the preceding paragraph is generally recommended, whether the

sample is drawn from a file, a set of questionnaires, or is selected in the field, there are other

techniques that might be used. They will provide unbiased estimates of the universe, although they

do not strictly satisfy the conditions of simple random sampling. Some of these techniques are:

(1) After selecting the initial sample by including all families for which one (or more) person has

been selected, we group the sample by size of family. It is clear that families with 2 members

have twice the chance of selection as those with 1; families with 3 members have three times

the chance of selection; etc. Therefore, instead of interviewing all families in the sample, we

interview only ½ of the two-member families; 1/3 of the three-member families; etc.

(2) Proceed as above, but interview all families instead of ½, 1/3 etc. However, in tabulating the

results, tabulate each size class separately, and multiply the results of the two-person families

by ½, the three-person families by 1/3, etc., before adding the results together.

Sometimes the only available frame is a list which includes some units which are outside the scope

of the universe defined for the survey. For example, suppose a special analysis is desired of the

census characteristics of males. The only source for sampling is a card file containing cards for all

persons both male and female, and it is not feasible to remove all the cards for females. The file can

still be used as a frame even though cards for both males and females will be designated by the

random selection process. The proper procedure in such a case is to take only the cards for the

males selected, and disregard those for the females.

Do not substitute. A procedure that is sometimes erroneously used (and may cause serious bias) is to

substitute the next "male" card in the file for each "female" card drawn in the sample. There are two

things wrong with this method:

(1) It results in a higher sampling rate than that specified. Also, the sampling rate actually

obtained cannot be calculated unless the total number of males is known. This makes it

impossible to use the reciprocal of the sampling rate, N/n, as a multiplier to produce estimates

of totals from the sample.

(2) A more serious objection to this substitution lies in the biases it may introduce in the selection

process. Suppose we have a list of all housing units and we wish to select a sample of

occupied dwellings only. If we use a procedure that substitutes the next occupied unit for each

vacant housing unit that falls into the sample, occupied units that are neighbors of vacant ones

will have two chances of selection--the chance that their own listing entry is selected and the

chance that the listing of the neighboring vacant dwelling is selected. If vacant units are more

likely to be found in poor and undesirable neighborhoods, this would mean that occupied

housing units in such areas would be over-represented in the sample.

69

6.4 SYSTEMATIC SAMPLING

The work necessary to draw a simple random sample can be quite burdensome when the number of

units to be selected is large. For example, to get a 5 percent sample of 20,000 elements, it would be

necessary to select 1,000 random numbers from a table of random numbers and then to select the

designated units from the population. In practice, most statisticians prefer a different method. A

sample of this size is usually drawn by taking a random number between 1 and 20, then taking every

20th element thereafter. Thus, if the random number is 3, the elements taken will be 3, 23, 43, 63,

and so on up to 19,983. The reciprocal of the sampling rate (20 in this case) is called the sampling

interval. The method of estimating the mean, total, or a proportion is the same as for simple random

sampling.

This type of sampling is called systematic sampling. It is not the same as simple random sampling,

but it is an acceptable sampling method because the chance of selecting any one element is known

and we can calculate the sampling errors.

If the elements in the population are arranged in a nearly random order (that is, with very

little correlation between successive elements), the results of systematic sampling will be in

close agreement with those of simple random sampling. Experience shows that, generally, the

two methods will give results of roughly the same accuracy. The systematic sample will often

have a somewhat smaller sampling error, since it will make certain the sample will be spread

throughout the population. We may make use of the formulas for simple random sampling to

evaluate the reliability of estimates from a systematic sample; the result will usually somewhat

overstate the standard error for systematic sampling. In other words, we will underestimate, slightly,

the reliability of the estimates. There are ways of calculating the standard errors of systematic

samples more precisely; however, they are not covered in these chapters.

- you may round if you are doing this without a calculator, but you would be sacrificing

exactness for convenience

3. Select a random number (RN) from a table of random numbers between 0 and the SI. This

is called a random start (RS)

- in the permitted range, exclude zero, but include the sampling interval

- use as many digits as SI has, including decimals

- if you are searching through a RN table, pretend the decimal point is not there

- if you are using a calculator which only provides random numbers between zero and

70

one, multiply this random number by the value of SI in order to get a random number

between zero and SI. Remember to keep the decimals, do not round yet.

4. Begin the series of cumulated numbers with RS. Add SI to this first number to determine

the second. Then, add SI to the second number to get the third, and so on.

5. Stop cumulating when the last cumulated number exceeds N (discard this last number)

- if you rounded SI before adding, you may not have exactly n

6. Now go back and round all the cumulated numbers up to the next integer

7. On the list of population units, circle the serial numbers that correspond to these integers

Example 6.1

Suppose that a village contains 285 housing units (HUs) and we wish to select a systematic sample

of 12 HUs for a survey. Assume the list is randomly ordered.

3. RS = 19.79

of Selected Unit (Serial Number after rounding up)

1 19.79 20

2 19.79+23.75=43.54 44

3 43.54+23.75=67.29 68

4 67.29+23.75=91.04 92

5 114.79 115

6 138.54 139

7 162.29 163

8 186.04 187

9 209.79 210

71

10 233.54 234

11 257.29 258

12 281.04 282

13 304.79 (Discard)

Remarks: Let's see what might have happened if we had not carried the decimals.

3. RS = 24

4. Results:

(1) 24

(2) 48

(3) 72

.

.

(11) 264

(12) 288 (discard).

We exhausted the population before reaching our 12 units. This would not have happened if we had

kept the decimals (had not rounded up at the beginning), even if our RN was equal to the SI.

We accomplish the same results by truncating instead of rounding up. Refer to Section 4.1 above.

- In step 3 of Section 4.1, while choosing RN, include zero but exclude SI;

- Then, in step 6, truncate (that is, retain only the integer portion of the number), instead of

rounding up.

This alternative is convenient when using computer software packages because their rounding

functions usually round up to the closest number instead of up systematically. So, it is better to use

the integer functions which truncate systematically.

Let's look at an example in order to clarify the concepts. Refer to the previous example.

72

3. RS = 19.79 + 1 = 20.79

(Serial Number after

truncating)

1 20.79 20

2 20.79+23.75=44.54 44

3 44.54+23.75=68.29 68

4 68.29+23.75=92.04 92

5 115.79 115

6 139.54 139

7 163.29 163

8 187.04 187

9 210.79 210

10 234.54 234

11 258.29 258

12 282.04 282

13 305.79 (Discard)

There is one situation in which systematic sampling will give very poor reliability. That is the case

in which the arrangement of the elements in the population follow a very regular (periodic) pattern

and the sampling interval of the systematic sample falls into that pattern. For example, suppose all

families in a certain population consisted of exactly four persons--the head, his wife, and two

children. The population has been listed in the order just given and we wish to draw a 25 percent

systematic sample from this list to obtain some special information. Since the sampling procedure is

to take every fourth person starting at random, four possible samples could be obtained:

(1) Random start is 1--the sample will consist entirely of heads of families.

(2) Random start is 2--the sample will consist entirely of wives of heads.

(3) Random start is 3 or 4--the sample will consist entirely of children.

In a case such as this, results from sample to sample would have nearly the maximum possible

variation, and it would be likely that estimates based on any one of the samples would be quite far

from the true values for the population. However, even in this extreme case, the estimates would be

unbiased; that is, the averages of the estimates for all possible samples would be the population

averages.

Although the example given above is not likely to occur in practice, approximations to this situation

sometimes arise. If there is suspicion of any regularity in the sequence of listing, which could

conform to the sampling interval, systematic sampling should be avoided or modified. For example,

the list could be randomized before systematic selection is used.

73

6.4.3 Modified systematic sampling

One variant of systematic sampling that could be used when there is some systematic ordering in the

population is to use a different random number within each sampling interval. To illustrate, let us

use the previous example of 25-percent sample when family members are listed in order--head, wife,

child. With a systematic sample, once a random number is selected, this sets the pattern for the

entire sample. As explained above, if the random number is 1, the sample will be the 1st, 5th, 9th, 13th

person, etc. (all heads of families); if the random number is 2, the sample will include the 2nd, 6th,

10th, 14th person, etc. (all wives of heads). To avoid this difficulty, we can select a different random

number within each group of 4 persons, so as to avoid a constant interval between our sample cases.

The selection scheme is indicated below:

number (1 four selected

to 4) persons

3 1st 3rd

1 2nd 5th

2 3rd 10th

1 4th 13th

4 5th 20th

etc.

That is, in the first group, one child is selected because the random number is 3. In the second

group, the husband is selected because the random number is 1 and the husband is the first person in

the group, but the fifth person in the list. In the third group, the second person is selected (the wife),

who is the 10th person in the list, and so forth.

The system requires more work than ordinary systematic sampling, but it avoids the possibility of the

patterns indicated above. We do not mean to imply that such patterns as described above usually

exist and that systematic sampling should be avoided. In most cases, systematic sampling produces

very satisfactory results.

Frequently, in sampling office files, the records have a serial number. We may take advantage of this

fact to draw the sample; for example, by designating all records whose serial numbers end in 5, 7, or

some other number chosen from a table of random numbers. However, before deciding on this

system, one should make sure that the last digit of the serial number is actually random, and does not

represent a nonrandom arrangement of some kind; if it does, we might obtain only one particular

type of unit in the sample by repeatedly selecting the same last digit. If such a serial number is not

present, frequently one can be assigned at random with little cost, and used for sampling.

74

6.5 GUIDELINES ON WHEN TO USE DIFFERENT SAMPLING

SCHEMES

1. There are no major cost differences associated with including various classes of sampling

units in the sample.

2. The population is relatively homogeneous with respect to the major characteristics being

estimated.

4. There are no cost savings in surveying units which are close together or other natural

clusters of the population.

It should be noted that none of these reasons on its own is enough to justify the use of SRS.

There are several reasons for using systematic sampling, but in practice, the main reason usually is:

1. The frame is a record system requiring a manual selection of sample units (e.g., a physical

list, card files, etc.)

Systematic sampling can also be used to provide implicit stratification during sample selection if

sampling units are arranged in a particular order. This type of sampling, however, would not be

SRS.

75

6.5.3 When to Use Stratification

1. Natural or predefined strata of the population exist: e.g., geographic divisions such as

states, provinces; ecological zones that have great socioeconomic impact on the population,

etc..

2. There exist subpopulations of interest for which separate estimates of a given precision are

required.

Strata could be created so that each regional office can handle the sampling and the

interviewing in their respective areas.

b. The potential strata are internally homogeneous with respect to the variables of interest.

6. Auxiliary information upon which to base the stratification is available for all population

units.

1. Natural or predefined clusters of the population exist: e.g., Metropolitan Statistical Areas

(MSAs), Enumeration Districts (EDs), Enumeration Areas (EAs), etc.

2. Confining sampling operations to units that are nearby produces large cost and time

savings.

3. No frame is available which lists all population elements but one could be constructed for a

limited number of clusters to list all elements in the cluster.

76

7. Nonsampling errors can be controlled more effectively (e.g. listing operation can be done

more accurately for a cluster than for the whole population, yielding better coverage).

It is generally recommended that clusters be selected either with probability proportional to size or

with equal probabilities after stratification by size. In addition, it is recommended that larger clusters

be placed in certainty strata so they may all be included in the sample. This is done in order to

control the variance of estimates.

The situations which suggest the use of a multistage design are the same as for single stage cluster

sampling except that multistage sampling is preferred over single stage sampling when:

2. Only a limited number of sample elements can be handled, and concentrating them in a few

clusters would result in estimates of poor precision. In such a case, it would be more

efficient to spread the sample over more clusters and only subsample each cluster.

Remarks: The above guidelines for using different sampling schemes are not meant to be rigid or

exhaustive. In practice, there might be:

6.6 CONTROLS

After a sample is selected, it is necessary to check the number of cases actually obtained against the

number expected (as calculated by applying the sampling rate to the number of cases in the

universe). Discrepancies may indicate that the sampling procedure was not properly carried out. For

example, forgetting to sample from file drawers in use at the time of sampling, and thus omitting part

of the population, would result in fewer cases than expected. Further checks on whether the sample

shows any unusual features may also help us know whether the sampling was actually performed as

planned.

Very frequently, when a sample has been selected for a study, sample data will be collected and

tabulated for a set of basic items for which there are already available known population totals in

addition to the items of special interest in the survey. Such known population totals are called

"check data" or "independent information." If the sample results for the known items agree closely

77

with the known population totals, it is sometimes claimed that this coincidence "validates" the

sample and proves it will provide good results for other items.

Actually, this so-called "validation" does not demonstrate that we have a "good" sampling procedure,

or that the sample will yield "good" estimates for the other items in the survey. It is only on the basis

of a random method of selecting the sample that we are able to attach a sampling error to our

statistics, and to evaluate the probability that the estimates will be within specified limits of the true

value: therefore, it is obvious that we cannot rely exclusively on such "validation."

(1) Available check data may be used in improving the method of sampling; for example,

in providing a basis for stratification. (This is the subject of the next two chapters.)

(2) It is possible to calculate the standard errors of the estimates made from the sample

data. If the check data and sample estimates of the same items differ more than might

reasonably be expected from the size of the calculated standard errors, this may

indicate that the sampling procedures may not have been carried out properly, the

sampling frame has coverage errors, or something else may have gone wrong in the

implementation of the survey. Further investigation is needed.

(3) Check data may be used in improving the method of making estimates from the

sample; for example, by adjusting the sample estimate by the ratio of the true value of

the check item to the sample estimate of this check item (using a ratio estimate). We

will discuss this more fully in later chapters.

The above three applications of the use of check data (or independent information) are acceptable,

since we can make statistical inferences when using them.

Recall that the sampling weight of a sample unit is equal to the reciprocal of the probability of

selection. In SRS, the probability of selection is (n / N). Therefore, the sampling weight is equal to:

Sampling Weight =

A sample is self-weighting if every unit in sample has the same probability of selection. By their

very nature, SRS samples are self-weighting. However, in practice, most complex designs produce

non-self-weighting samples. For instance, a higher sampling fraction is used for a the stratum that

contains large businesses (sometimes all of them are chosen); in demographic surveys (say health)

we may oversample special minority groups in order to obtain better estimates with smaller

variances. In addition, almost any self-weighting design becomes non-self-weighting due to

78

adjustments to the basic weights.

Exercises

6.1 You have a population of 185 persons. Select a systematic sample of 20 persons. List the

numbers assigned to them and describe the procedure you used in the selection.

6.2 Suppose that a city block contains 125 housing units. We wish to select a systematic sample of

10 housing units. Follow the steps we discussed in this chapter to accomplish this.

79

Chapter 7

STRATIFIED SAMPLING-BASIC THEORY

__________________________________________________________________________

7.1 DESCRIPTION OF THE STRATIFICATION PROCEDURE

In simple random sampling, we do not try to force the sample to be representative of different groups

in the population. The tendency to be representative is inherent in the procedure itself and the

sampling error can be reduced only by increasing the size of sample. However, if something is

known in advance about a population, it may be possible to use this information in stratification and

thus reduce the sampling error. The judgment of experts may be useful here.

Stratified random sampling is a method in which the elements of the population are divided into

groups (strata), and a simple random sample is selected for each group, taking at least one element

from each group (stratum). One element from each group is sufficient to estimate the mean, but two

are needed to estimate its reliability; generally many more than two are needed to make the estimates

sufficiently precise. The process of establishing these groups is called stratification and the groups

are called strata. The strata may reflect regions of a country, densely populated or sparsely populated

areas, various ethnic or other groups.

In stratification we group together elements which are similar, so that the population variance

within stratum h is small; at the same time, it is desirable that the means of the several strata be

as different as possible. The letter h will be used to identify the strata so that if L strata are created, h

will go from 1 to L.

In stratified sampling, the probabilities of selection may be the same from group to group, or they

may be different. It is not necessary that all elements have the same chance of selection, but the

chance of each must be known. Under stratified random sampling all the elements in a particular

stratum have equal chances of being selected. While not every combination of elements is possible,

all of the possible samples (that is, combinations of elements) that might be drawn have the same

chance of occurring.

In stratified sampling, the selection of sampling units, the location and enumeration of the selected

units, distribution and supervision of fieldwork and, in general, the whole administration of the

survey is greatly simplified. The procedure, however, presupposes the knowledge of the strata sizes,

that is, the total number of sampling units in each stratum as well as the availability of a frame for

selecting a sample from each stratum.

The most important aspect of a good stratification is that it lowers significantly the sampling error of

the estimates if the stratification variable is highly correlated to the variables of interest.

80

7.2 NOTATION

We use the same notation as for simple random sampling, except that there will be a subscript to

indicate a particular stratum when we refer to information regarding this stratum. Thus, N will

represent the total number of elements in the population, as before; but N1 will be the number in the

first stratum, N2 will be the number in the second stratum, etc. Similarly, n will be the total sample

size; n1 will be the size of the sample in the first stratum, n2 will be the size of the sample in the

second stratum, etc. The subscript h denotes the stratum and I the unit within the stratum. As in the

case of simple random sampling, capital letters refer to population values and lower case letters

denote corresponding sample values. The following notation given in the table will be used.

Number of strata L L --

Population Variance S² -- --

81

7.2.1 Illustration for a Whole Population

Suppose we have a universe of eight farms with known value of land and buildings as follows:

buildings

1 $2026

2 6854

3 1532

4 2180

5 5408

6 9284

7 1438

8 8836

Let us compute the average (mean) and the standard deviation of these values. In terms of the

notation above, we would have

N=8

= $4,694.75

S = $3,326.04

Now let us arrange the farms into two strata, so that the groupings of values are as follows:

Stratum 1 Stratum 2

$1,438 $5,408

1,532 6,854

2,026 8,836

2,180 9,284

If we compute the average and standard deviation of each group of four farms separately, we would

have

82

Stratum 1 Stratum 2

N1 = 4 N2 = 4

= $1,794 = $7,595.50

S1 = $364.33 S2 = $1,800.45

The population mean can be expressed in terms of the stratum totals, as follows:

(7.1)

(7.2)

Within each stratum, simple random sampling is used. We saw previously that for simple random

sampling, is an unbiased estimate of This suggests that for stratified sampling an estimate of

the population mean can be obtained by substituting, for each stratum mean, the corresponding

estimate from the sample. That is, the mean of the sample elements from the first stratum gives an

estimate of the true mean of the first stratum; the mean of the sample elements in the second stratum

gives us an estimate of the true mean for the second stratum, etc. In symbols, therefore, the estimate

of the population mean from a stratified sample is denoted by (st for stratified) and is given by :

(7.3)

(7.4)

83

7.3.1 Illustration of estimate of mean

A stratified sample is drawn from a population of 1,000 farms to estimate average expenditure by

farm operators for hired labor. There are three strata--the total number of farms in the first is 300; in

the second, also 300; and in the third, 400. The selected samples have 30, 30, and 40 farms in the

three strata respectively. The average expenditure for the 30 farms in the first stratum is $12.20; for

the 30 farms in the second stratum, $25.60; and for the 40 farms in the third stratum, $48.70. For the

sample estimate of the average expenditure for all farms in the population we would have

As with simple random sampling, we make an estimate of the population total by multiplying the

estimate of the mean by the total number of elements in the population:

(7.5)

To estimate a proportion for the population, the procedure is similar to that for the mean because a

proportion, Pst is simply a special case of the mean when the only possible values of are 0 and

1. In this case,

for stratified random sampling. The true population proportion Pst is given by

and it is estimated by

(7.6)

84

7.4 SAMPLING ERROR OF A STRATIFIED SAMPLE

The sampling errors of the three types of estimates referred to above are computed by using equation

(7.7) for the mean, equation (7.8) for the total, and equation (7.9) for a proportion:

(7.7)

where

(7.8)

(7.9)

The corresponding formulas for the estimated sampling error for each type of estimate are:

(7.10)

(7.11)

85

(7.12)

Similar formulas can be derived for the coefficient of variation by dividing the above expressions by

the value of the item being estimated. Thus, for example:

(7.13)

The formulas for confidence intervals of the population mean and the population total are:

(7.14)

(7.15)

These formulas assume that is normally distributed and that is well determined, so that the

multiplier t can be read from tables of the normal distribution (see Appendix I). If only a few

degrees of freedom ( less than 30) are provided by each stratum the t-value should be taken from the

tables of student's t (see an Appendix II) instead of the normal table.

7.4.1 Illustration

Let us apply equation (7.7) to the case of the eight farms in the illustration in section 2. Suppose we

took a sample of four farms out of the eight--two from each stratum--and we have computed by

equation (7.3). What is the sampling error of

Stratum 1 Stratum 2

N1 = 4 N2 = 4

n1 = 2 n2 = 2

364.33 = 1,800.45

86

= 132,736.35 = 3,241,620.2

It is interesting to compare this sampling error with the corresponding sampling error of the mean for

a simple random sample of four farms. For a simple random sample of four farms, we would have

In this example, the sampling error of the stratified sample is much smaller than that of the simple

random sample, less than half. In fact, it would require a sample of six farms, using simple random

sampling, to achieve the same reliability (that is, as small a sampling error) as we obtained with a

stratified sample of the four farms.

7.4.2 Remarks

In actual practice, we usually do not know the true values and Instead, we substitute

sample estimates of these values into equations (7.7), (7.8), and (7.9) to obtain equations (7.10),

(7.11) and (7.12), respectively. To make such estimates from a single sample, we would need at least

two elements from each stratum. (In the examples described above, we were able to compute the

standard error for samples having only one element per stratum because we had information on all

elements in the universe.)

(7.16)

87

(7.17)

(7.18)

(7.19)

(7.20)

(7.21)

We will now rewrite equation (7.21) in a different way to make some observations.

(7.22)

(7.23)

From equation (7.22) we can see that if the fpc = 1, i.e., if then equation

(7.22) becomes:

88

(7.24)

Equation (7.23) has two components. The first component is shown in equation (7.24) and it

represents the variance of the mean when sampling is done with replacement, that is, when the fpc =

1.

The second term in equation (7.23) represents the adjustment that one needs to make when sampling

is done without replacement.

We can also see from equation (7.24) that the variance of the mean is directly proportional to the

strata population variance. That is, the smaller the population variance in the strata, the smaller the

variance of the mean. In other words, the more homogeneous the strata, the smaller the overall

variance of the mean with stratified sampling.

89

Exercises

7.1 Suppose you have a population of 12 persons whose hourly earnings are as follows:

1 0.85

2 1.35

3 0.60

4 2.20

5 1.80

6 3.10

7 0.90

8 1.50

9 1.75

10 0.75

11 2.40

12 2.10

b. What is the sampling error of the mean for a sample of six persons selected as a simple

random sample?

c. Stratify this population into three strata of equal size in the best way to estimate average

earnings. List the persons in each stratum by their hourly earnings.

d. Select a sample of six persons--two from each stratum. Suppose from stratum I we obtain in

sample the values (0.60) and (0.90); from stratum II we obtain the sample values (1.35) and

(1.50); and from stratum III we get (3.10) and (2.10).

(a) Estimate the average (mean) hourly earnings for this sample.

90

Chapter 8

STRATIFIED SAMPLING-ALLOCATION TO STRATA

______________________________________________________________________________

The definition of stratified sampling does not specify a particular size of sample in a stratum. The

sample can be selected so as to have the same size in each stratum, or it can be distributed in some

other way. As long as we select at least one element per stratum, the specification for a stratified

sample is satisfied; and with two elements per stratum we can estimate both the mean and its error.

Usually the total sample size is much larger than two elements per stratum. Hence, the question

arises as to what criterion should be used in allocating the total sample among the strata.

Let us return to the earlier example of a population of eight farms in two strata. If we wish to select

a sample of two farms to estimate the mean, we have no choice but to take one farm from each

stratum. Suppose, however, that we wish to select four farms. Then we have a choice in the

allocation of the sample. Would it be better to select two farms from each stratum or take one farm

from one stratum and three farms from the other?

There are two important criteria for determining how the sample should be distributed among the

various strata. The first criterion is convenience; that is, choose a method which is easy to apply and

simple to tabulate. This usually leads to the use of proportionate or proportional (allocation)

stratified sampling. The second criterion is precision: choose a method which will provide the

smallest sampling variance (or sampling error). This leads to the use of optimum allocation.

It is very common in stratified sampling to select the same proportion of units in each stratum. With

this method, to take a 10-percent sample of a given population, we would take a 10-percent sample

from each stratum.

Since the sampling rates in all strata are the same, the number of elements taken for the sample will

vary from stratum to stratum, depending on the size of the stratum. Within each stratum, the sample

size will be proportionate to the total population of the stratum. We can express this mathematically

as follows:

or alternatively

For the population characteristics that we are usually interested in (namely, Y and we can

prepare estimates from a proportionate stratified sample as easily as from a simple random sample--

in fact, by using the same formula

8.1)

In this formula, the sum is for all sample elements without regard to strata; since (Nh/nh) is a

constant, and equal to (N/n), equation (7.3) of chapter 7 reduces to this form. We also have

(8.2)

The simple weighting procedure makes proportionate sampling attractive since results are easy to

tabulate. Different strata do not have to be tabulated separately. All of the sample data can be added

together before application of any factors such as (1/n) or (N/n). A sample which has this feature is

self-weighting. That is, in a self-weighting sample, every individual observation has the same

probability of selection and, consequently, the same weight. The true standard error of the mean

estimated from a proportionate stratified sample is

(8.3)

(8.4)

(8.5)

1. In order to use this allocation procedure we don't need to know the stratum variances (as the

methods we'll discuss later do).

2. Other methods require us to know the costs of sampling units in the different strata, but not

this method.

3. The increase in precision from other more elaborate methods is not very large.

4. Efficient for national-level estimates.

92

However, we will see later on that when there is a very large variation in the stratum variances, the

gain in precision obtained by other methods may outweigh the simplicity of proportional allocation.

However, as shown later, this method is widely used in applied sample design.

Sometimes we have to conduct a survey with a fixed amount of money and we may be faced with the

fact that the cost of sampling units in different strata differs widely. For instance, it is a well-known

fact that sampling units in rural areas is generally more expensive than urban areas, because the

distances are longer and sometimes sampling units are more difficult to find. The term optimum

allocation refers to the optimum (the most efficient) way of allocating the total sample (n) to the

different strata. The formula is given by:

where ch is the cost of sampling one unit in stratum h. The above formula is obtained by finding the

When the costs of sampling in the different strata are the same, the optimum allocation formula is

called Neyman allocation, after Jerzy Neyman (1934), who investigated mathematically the question

of what distribution of the sample among strata would give the smallest possible sampling error. He

found that the answer was to let the sampling rate in each stratum vary according to the amount of

variability in the stratum--in other words, to make the sampling rate in a given stratum proportional

to the standard deviation in that stratum. The number of elements to be sampled from any stratum,

then, would depend not only on the total number of elements in that stratum, but also on the standard

deviation of the characteristic to be measured. For Neyman allocation, the number to be selected

within a stratum is given by the following formula:

(8.6)

With Neyman allocation, the formula for the variance of the mean (after using (8.6) in formula (8.3))

reduces to

(8.7)

The second term on the right represents the use of the fpc.

93

As before, the standard error of the total is given by the following formula:

(8.8)

For this type of allocation, it is necessary to know the values of Sh in the universe. If these are not

known in advance, then X may be estimated within each stratum, by using the methods described in

Chapter 5.4, (p. 53).

Note that in formula (8.6), when the Sh are all equal, Neyman allocation becomes proportionate

allocation.

8.3.1 Illustration

Let us compare the standard errors arising from proportionate and optimum allocation in the same

survey. In 1942, a census of lumber production was taken in the United States. In 1943, the survey

was to be repeated, but on a sample basis. Before selecting the sample, mills were grouped into

strata, on the basis of their 1942 production; an analysis of the data produced the information

presented in Table 8.1.

Table 8.1

(Production figures and standard deviations given in thousands of board feet)

1942

Stratum Annual Number Average Standard

Production of production in deviation

mills stratum for 1943

(Nh) (Sh)*

1 5,000 and over 538 11,029.7 9,000

2 1,000 to 4,999 4,756 1,779.6 1,200

3 Under 1,000 30,964 203.8 300

Total 36,258 571.2

**1,684

*Estimated from 1942 data. **For unstratified sampling.

Now let us select a sample of 1,000 mills. The first question to consider is how to determine the

sample size in each stratum, under either proportionate sampling or optimum allocation sampling.

The second question to consider is the resulting reliability of the two methods. Let us consider first

the matter of the sample size, then the matter of reliability.

94

8.3.2 Sample Size in Each Stratum

For proportionate allocation, since the sampling rate is 1,000 out of 36,258, this rate is used in each

stratum. The sample sizes, therefore, would be:

n1 = x 538 = 15

n2 = x 4,756 = 131

n3 = x 30,964 = 854.

For optimum allocation, the sample size in each stratum would be determined by the following table.

Table 8.2

of Deviation in rate

mills (Sh) sample

(Nh) (nh)*

1 538 9,000 4,842,000 0.244 244 ½

2 4,756 1,200 5,707,200 0.288 288 1/16

3 30,964 300 9,289,200 0.468 468 1/66

Total 36,258 19,838,400 1.000 1,000

*nh

What are the standard errors for these two sample designs? For proportionate allocation, the

standard error of the estimate of the mean is given by equation (8.4):

95

For the survey of lumber production,

and

For optimum allocation, the corresponding standard error is given by equation (8.7):

To complete the analysis, one may compare these results with those obtained if we had not stratified

the mills, but had taken a simple random sample of 1,000 mills from the universe. In this case, the

standard error is given by:

SAMPLING METHODS

Examining the results of the sample designs above, we see that optimum allocation gave us a standard

error of 16.1 thousand board feet, considerably smaller than that under proportionate sampling, which

was 37.8; we see also that the sampling error under proportionate sampling was smaller than that

under simple random sampling, which was 52.5. Putting the results another way, it would require a

proportionate sample more than 5 times as large as an optimum allocation sample to achieve the same

reliability. Simple random sampling would require a sample 10 times as large. The efficiency of

optimum allocation results from the fact that it provides for more intensive sampling in strata having

large standard deviations, which can be expected to contribute more heavily to the total sampling

error.

The example in section 8.3 above illustrates a general result which can be demonstrated

mathematically. The sampling errors of the three types of designs are approximately related in the

96

following way (if the sampling rates are small enough so that the finite correction factors can be

ignored):

(8.9)

(8.10)

respectively, the variances of the estimated means based on simple random sampling, optimum and

proportionate sampling.

An examination of this formula shows that sampling errors obtained with optimum allocation will be

at least as small, and usually smaller, than those obtained with proportionate stratified sampling.

Furthermore, the errors obtained with either of these methods will be at least as small, and generally

smaller, than those obtained with simple random sampling. (There are a few rare cases, which almost

never occur in practice, in which this is not true. When the sample is very small and the stratification

is completely ineffective, neither proportionate sampling nor optimum allocation may show a gain

over simple random sampling. For all practical purposes, this possibility can be ignored.)

Consider the conditions under which important differences result from the three methods. When we

compare proportionate stratified sampling with simple random sampling, it can be shown that the gain

in reliability depends on the amount by which the means of the strata vary; the greater the variation

between the means (in other words, the greater the differences among the strata), the more the

reduction in the standard error arising from the use of proportionate sampling. On the other hand, if

the variance between stratum means is fairly small compared to the total variance, not much will be

gained by stratification. As a result, stratification is usually less important in dealing with proportions

than with measured items (or with aggregates or quantities). For example, it would be of much

greater help in trying to estimate the average expenditure of farmers for hired labor than in estimating

the proportion of farmers who hire labor. Even for measured items the gains would be slight unless

the strata are established so that the differences between the means are sizable (as was the case in the

example of lumber mills). For example, in conducting a survey to measure personal income, it would

probably not pay to establish separate strata for different professional groups--for example, doctors,

lawyers, etc. It probably would be useful, however, to set up separate strata for broader groups--

laborers, businessmen, professionals, etc. Since proportionate sampling is nearly always better than

simple random sampling, stratification is recommended whenever it can be accomplished with little

additional work.

Comparing optimum allocation with proportionate allocation, we see that if the standard deviations in

all strata are the same, the two methods are identical. The greater the differences between the

standard deviations in the strata, the greater the reduction in sampling error to be expected from

97

optimum allocation. Unless the range among the standard deviations is greater than 2 or 3 to 1, the

gains of optimum allocation are so small that they are probably not worth the extra complications in

tabulation. With larger variations in standard deviations, the gains are appreciable and optimum

allocation is advisable. In the example of lumber mills, the standard deviation for stratum 1 was 30

times as large as that for stratum 3.

We need to know the Sh for each stratum either (a) to apply optimum allocation or (b) to estimate the

errors of proportionate stratified samples. Of course, in practice, we never really know each Sh and

must estimate it. Two questions arise: (a) How is the accuracy of the sample affected by the errors

introduced by estimating Sh instead of knowing the true value? (b) What methods can be used to

estimate these quantities?

In answer to the first question, if our estimates of the standard deviation are fairly reasonable (for

example, accurate to within 30% or 40%) we will obtain almost all of the gains of optimum

allocation. The reason for this is that the sampling error does not increase very rapidly as the

allocation departs from the optimum within fairly broad limits. (It should be noted that poor guesses

of the values of Sh do not introduce any biases in the result; they only increase the sampling errors.)

However, if the estimates of Sh are very unreliable, the "optimum allocation" may have a larger

variance than proportionate allocation. In this case, it is safer to use proportionate allocation.

In regard to the second question, we can use the methods for estimating the standard deviations

described previously (Section 5.4 of Chapter 5). One additional method that is sometimes used is to

assume that the standard deviations for the strata are proportional to the average values within the

strata; that is, assume the same relative standard deviation in each stratum. (Note that for optimum

allocation, it is not necessary to know the absolute values of the standard deviations; it is only

necessary to know their values relative to each other.) This assumption will frequently give results

reasonably close to the optimum. In the case of the lumber mills discussed previously, this would

give us a sample with the following distribution by strata:

It can be seen that this allocation is much closer to optimum allocation than is proportionate

allocation. In fact, if the standard error of this allocation is computed, it turns out to be 17.3. This is

not quite as good as the 16.1 for optimum allocation, but it is far superior to the 37.8 obtained with

proportionate sampling.

98

8.5 OPTIMUM ALLOCATION WITH VARIABLE COSTS

The discussion of optimum allocation thus far has been in terms of getting the most reliable results for

a given total sample size. It frequently happens that the costs of obtaining information vary

substantially from stratum to stratum. To give an example, let us suppose that families have been

stratified by urban and rural residence; furthermore, suppose that the cost of conducting a rural

interview is five times as great as that of an urban interview. It would be wise to concentrate more of

the sample in the cheaper stratum. Another example would be a sample survey of business firms; we

may mail questionnaires to small companies and visit large ones personally, when there are large

differences in unit costs.

A more general approach than the one which is described in section 4 above is to consider the

optimum allocation for a fixed cost, rather than for a fixed sample size. In other words, we would

like to allocate the sample among strata in such a way as to achieve the lowest standard error with a

fixed budget.

For this we need a cost function, which is a mathematical formulation expressing the cost of taking

the survey in terms of the sample sizes, nh. Suppose the average cost for a single questionnaire in the

hth stratum is called Ch. Thus C1 is the cost per questionnaire in the first stratum, C2 is the cost in the

second stratum, etc. Ch represents the total cost of a questionnaire in the hth stratum, including the

cost of interviewing, coding, data entry, etc. (There may also be an overhead cost for the survey

which does not depend on the size of the sample, but it is not necessary to consider this in the cost

function.) The total cost of the survey which can be affected by the sample size is

For a fixed cost C, the optimum allocation of the sample turns out to be

(8.11)

Note: To use this formula, n must first be calculated. Note that nh is a function of the C h 's, Sh 's, and Nh 's. See Sample Survey

Methods and Theory, Volume I: Methods and Applications, by Hansen, M.H., Hurwitz, W.N., and Madow, W.G. New York,

Wiley and Sons, 1953, p. 221.

That is, nh is directly proportional to Nh and to Sh, and inversely proportional to Formula (8.11)

leads to several rules. In a given stratum, we would take a larger sample under the following

conditions:

(2) If the stratum is more variable internally than the average stratum.

99

(3) If the cost of collection and processing is cheaper than in the average stratum.

In regard to the third point, the cost per stratum (Ch) enters into the formula in the form of a square

root. This tends to reduce the effect of the differences in unit cost. Unless the costs vary by a factor

of at least 2 to 1, using the formula above will give results not very much different from the simpler

Neyman allocation given in equation (8.6).

In equation (8.11), we do not yet know the value of n. If cost is fixed, substitute the value of nh from

(8.11) in and solve for n. This gives

(8.12)

(8.13)

Note that in case Ch = c, that is, if the cost per unit is the same in all strata, then the cost becomes

and also equation (8.11) reduces to equation (8.6), which is the formula for

Neyman allocation. That is, optimum allocation for fixed cost reduces to optimum allocation for

fixed sample size.

8.5.1 Illustration

Suppose a sampler proposes to take a stratified random sample. He expects that his field costs will be

of the form His advance estimates of relevant quantities for the two strata are as

follows:

Stratum 1 Stratum 2

N1 = 1,056 N2 = 1,584

S1 = 10 S2 = 20

C1 = $4 C2 = $9

(a) Find the sample size required under optimum allocation, to make Ignore the

fpc.

(b) Determine the sample size for each stratum (i.e., the allocation of the total sample size n to

each of the two strata).

100

(c) How much will the total field cost be (excluding overhead costs)?

Solution of (a)

to 2 (it should be 1.96 to be exact). Therefore,

Solution of (b)

The sample size in each stratum is given by equation (8.11). For the sample size in the first stratum:

Similarly, n2 = 159.

The formula for the optimum allocation of the sample (equation 8.6 or 8.11) is necessarily computed

for a single characteristic or variable, Y. If it is desired to obtain the most favorable sample allocation

for several characteristics, some kind of compromise must be made. Some alternatives are:

(1) Determine the most important item (or group of highly correlated items) and allocate the

sample to get the best estimate for this item.

(2) Follow the procedure in (1) and increase the size of the sample in some strata to provide

adequate coverage of other important items.

(3) Set up a function which assigns a weight to each item according to its importance; use this

101

function in the allocation to prevent poor sample estimates for the most important

characteristics.

Optimum allocation is most effective for characteristics which vary widely for the individual units;

such as amount of personal income, number of board feet produced by a sawmill, kilos of maize

harvested on a farm, etc.

In sampling for attributes, however, such as the proportion of the population in a class (for example,

in the income class $1,000 - $1,999), proportionate sampling may be the best allocation. It has the

added advantage of being self-weighting.

Before concluding this chapter, some comments will be made on the problem of sample allocation

when the object is to estimate a population proportion P. From equation (7.9 of chapter 7, we have

for stratified random sampling,

(8.14)

(8.15)

(8.16)

For the sample estimate of the variance, substitute for the unknown in any of the

formulas above.

allocation will differ substantially from proportional allocation only if the quantities

differ considerably from stratum to stratum. For example, let the Ph lie between 0.3 and 0.7, in which

case will lie between 0.46 and 0.50. In this situation the optimum allocation will not

be preferred to proportional allocation when the simplicity of the computations involved is another

factor to be taken into account.

102

We can choose nh in order to minimize the variance

Thus,

(8.17)

(8.18)

where cost =

8.7.1 Illustration

In a firm, 62% of the employees are skilled or unskilled males, 31% are clerical females, and 7% are

supervisory. A sample of 400 employees is taken from a total of 7,000 employees. Based on the

sample, the firm wishes to estimate the proportion that uses certain recreational facilities. Rough

guesses are that the facilities are used by 40 to 50% of the males, 20 to 30% of the females, and 5 to

10 % of the supervisors. How would you allocate the sample among the three groups ? What would

the standard error of the estimated proportion pst be? Ignore the fpc.

We have,

N = 7,000 n = 400

N1 = 4340, N2 = 2170 and N3 = 490

Using equation (8.17), we can allocate the total sample size (n = 400) to the different strata, as

follows:

103

Similarly,

n2 = 116 and n3 = 16

In simple random sampling, we saw that the determination of n depended on the sampling variance of

the estimator. In a similar way, for stratified sampling, we need to know the formulas for the

sampling variances of the different methods of allocation in order to determine n for each one of these

methods. Let's summarize the methods of allocation.

2. Proportionate allocation.

4. Neyman allocation: fixed sample size, equal sampling costs among strata.We also saw that the

stratum sample sizes nh for these methods of allocation were given by:

104

(8.20) (proportionate)

(8.21) (optimum)

(8.22) (Neyman)

To determine the sample size we need to know the variances of these methods. So, we start with the

formula for the variance of a mean when using stratified random sampling. Recall that the formula is

given by:

(8.23)

(8.24)

(8.25)

(8.26)

(8.27)

Now, let's see how to determine the sample size n to estimate the mean with an error of estimation E.

The sample size is directly related to the error we are willing to tolerate (or the precision we are

required to obtain) in our estimates. As before, we define the error the following way:

105

Error of estimation = E = k S( )

where k is the level of reliability. So, given the precision E that we need to obtain and the level of

reliability k, we can write:

(8.28)

We know that as n increases, the variance of the estimate becomes smaller. Therefore, we need to

find the sample size n that will give us a variance equal to B2.

Let's try to solve for n in equation (8.24), that is, when we have equal samples.

(8.29)

Multiply each side of equation (8.29) by N2 and leave the term which contains n on one side of the

equation. After we do this, we obtain:

(8.30)

(8.31)

Now, when (nh/Nh) is very small (negligible), the fpc = 1 and we may omit from the denominator of

equation (8.31) the term

Applying a similar procedure to equations (8.25), (8.26), and (8.27), we obtain the sample size n

given by the following formulas:

(8.32)

106

(8.33)

(8.34)

As before, when the fpc = 1, the denominator in equations (8.32), (8.33) and (8.34) only contains the

term N2 B2. Another important point to mention is that all the formulas for n have been given in terms

of the stratum population variances (Sh). In practice, we don't know this value and it has to be

estimated by means of a sample or from other sources.

107

Stratified Random Sampling

1. A chain of department stores is interested in estimating the proportion of accounts receivable that

are delinquent. The chain consists of four stores. To reduce the cost of sampling, stratified

random sampling is used with each store as a stratum. Since no information on population

proportions is available before sampling, proportional allocation is used. From the table given

below, estimate P, the proportion of delinquent accounts for the chain, find its sampling error and

calculate the coefficient of variation of the estimate.

Stratum I II III IV

Number of accounts receivable N1 = 65 N2 = 42 N3 = 93 N4 = 25

Sample size n1 = 14 n2 = 9 n3 = 21 n4 = 6

Sample Proportion of delinquent accounts

2. A corporation desires to estimate the total number of man-hours lost, for a given month, because

of accidents among all employees. Since laborers, technicians, and administrators have different

accident rates, it is decided to use stratified random sampling with each group forming a separate

stratum. Data from previous years suggest the following variances for the number of man-hours

lost per employee in the three groups and current data give the following stratum sizes:

N1 = 132 N2 = 92 N3 = 27

3. For Exercise 2, estimate the total number of man-hours lost during the given month and place a

bound on the error of estimation. Use the following data obtained from sampling 18 laborers, 10

technicians, and 2 administrators:

8 24 0 4 5 1

0 16 32 0 24 8

6 0 16 8 12

7 4 4 3 2

9 5 8 1 8

18 2 0

108

Answer: Total = 1903.90; Error = 676.80

4. A zoning commission is formed to estimate the average appraised value of houses in a residential

suburb of a city. It is convenient to use the two voting districts in the suburb as strata because

separate lists of dwellings are available for each district. From the data given below, estimate the

average appraised value for all houses in the suburb and place a bound on the error of estimation

(note that proportional allocation was used):

Stratum I Stratum II

N1 = 10 N2 = 168

n1 = 20 n2 = 30

number of division heads will be interviewed by telephone and asked to rate the equipment on a

numerical scale. The divisions are located in North America, Europe, and Asia. Hence,

stratified sampling is used. The costs are larger for interviewing division heads located outside

of North America. The following costs per interview, approximate variances of the ratings, and

Ni’s have been established:

c 1 = $9 c 2 = $25 c 1 = $36

N 1 = 112 N 2 = 68 N 3 = 39

The corporation wants to estimate the average rating with Choose the sample size, n,

which achieves this bound and find the appropriate allocation.

6. A school desires to estimate the average score that would be obtained on a reading

comprehension exam for students in the sixth grade. The school has students divided into three

tracks, with the fast learners in track I and the slow learners in track III. It was decided to stratify

on tracks since this method should reduce variability of test scores. The sixth grade contains 55

109

students in track I, 80 in track II, and 65 in track III. A stratified random sample of 50 students is

proportionally allocated and yields simple random samples of n1 = 14, n2 = 20, y n3 = 16 from

tracks I, II, and III, respectively. The test is administered to the sample of students with the

following results:

80 92 85 82 42 32

68 85 48 75 36 31

72 87 53 73 65 29

85 91 65 78 43 19

90 81 49 69 53 14

62 79 72 81 61 31

61 83 53 59 42 30

68 52 39 32

71 61

59 42

Estimate the average score for the sixth grade, and place a bound on the error of estimation.

7. Suppose the average test score for the class in Exercise 6 is to be estimated again at the end of

the school year. The cost of sampling are equal in all strata, but the variances differ. Find the

optimum (Neyman) allocation of a sample of size 50 using the data in Exercise 6 to approximate

the variances.

8. Using the data in Exercise 6, find the sample size required to estimate the average score with a

bound of 4 points on the error of estimation. Use proportional allocation.

Answer: n = 33

9. Repeat Exercise 8 using Neyman allocation. Compare the result with the answer to Exercise 8.

Answer: n = 32

10. A forester wants to estimate the total number of farm-acres planted in trees for a state. Since the

number of acres of trees varies considerably with the size of the farm, it is decided to stratify on

farm sizes. The 240 farms in the state are placed in one of four categories according to size. A

stratified random sample of 40 farms, selected using proportional allocation, yields the following

results on number of acres planted in trees:

110

Stratum I Stratum II Stratum III Stratum IV

0-200 acres 200-400 acres 400-600 acres 600+ acres

N1 = 86 N2 = 72 N3 = 52 N4 = 30

n1 = 14 n2 = 12 n3 = 9 n4 = 5

97, 67, 42, 125, 125, 155, 67, 96, 142, 256, 310, 167, 655,

25, 92, 105, 86, 256, 47, 310, 440, 495, 510, 220, 540,

27, 43, 45, 59, 236, 220, 352, 320, 396, 196 780

53, 21 142, 190

Estimate the total number of acres of trees on farms in the state, and place a bound on the error of

estimation.

11. The study of Exercise 10 is to be made yearly with the bound on the error of estimation 500

acres. Find an approximate sample size to achieve this bound if Neyman allocation is to be used.

Use the data in Exercise 10.

Answer: n = 156

12. A psychologist working with a group of mentally retarded adults desires to estimate their average

reaction time to a certain stimulus. He feels that men and women probably will show a

difference in reaction times so he wants to stratify on sex. The group of 96 people contains 43

men. In previous studies of this type it has been observed that the times range from 5 to 20

seconds for men and 3 to 14 seconds for women. The costs of sampling are the same for both

strata. Using optimum allocation find the approximate sample size necessary to estimate the

average reaction time for the group to within 1 second.

Answer: n = 29

13. A county government is interested in expanding the facilities of a day-care center for mentally

retarded children. The expansion would increase the cost of enrolling a child in the center. A

sample survey will be conducted to estimate the proportion of families with retarded children that

would make use of the expanded facilities. The families are divided into those who use the

existing facilities and those who do not. Some families live in the city in which the center is

located and some live in the surrounding suburban and rural areas. Thus, stratified random

sampling is used with users in the city, users in the surrounding county, nonusers in the city, and

nonusers in the county forming strata 1, 2, 3, and 4, respectively. Approximately 90% of the

present users and 50% of the present nonusers would use the expanded facilities. The cost of

obtaining an observation from a user is $4.00 and from a nonuser is $8.00. The difference in cost

is due to the fact that nonusers are difficult to locate.

Existing records give N1 = 97; N2 = 43; N3 = 145; N4 = 68. Find the approximate sample size

and allocation necessary to estimate the population proportion with a bound of 0.05 on the error

of estimation.

111

Answer: n = 158; n1 = 39; n2 = 17; n3 = 69; n4 = 33

14. The survey in Exercise 13 was conducted and yields the following proportion of families who

would use the new facilities:

Estimate the population proportion, P, and place a bound on the error of estimation. Was the

desired bound achieved?

15. Suppose in Exercise 13 the total cost of sampling is fixed at $400. Choose the sample size and

allocation which minimizes the variance of the estimator, pst, for this fixed cost.

16. The following data show the stratification of all the farms in a county by farm size and the

average acres of corn per farm in each stratum.

Average

Farm Size Number of Standard

Corn Acres

(acres) farms deviation NhSh

Nh Sh

a. For a sample of 100 farms, allocate the sample size to each stratum under:

(ii) Optimum allocation

(iii) Equal allocation

b. For a sample of 100 farms, compute the sampling error of the estimated total for

(i) a simple random sample

(ii) proportional allocation

(iii) Neyman allocation

c. On the basis of this analysis, which of the three methods of allocating the sample would you

recommend?

112

17. It is desired to estimate the total value of farm products for a population of 5,900 farms. Means

and variances are available from a past census on the value of farm products classified by farm

size and tenure of the operator:

Average

Number Variance

value of Standard

of products

S i z e a nd tenure deviation N hS h

farms

(S h)

(N h )

S i z e o f F arm

1 0 t o 4 9 acres 1,600 1,500 15,000,000 3,872.98 6,196,7 7 3 . 3 5

5 0 t o 9 9 acres 1,150 2,200 18,000,000 4,242.64 4,879,0 3 6 . 7 9

1 0 0 t o 1 7 9 acres 1,200 3,600 35,000,000 5,916.08 7,099,2 9 5 . 7 4

1 8 0 t o 2 5 9 acres 490 5,500 70,000,000 8,366.60 4,099,6 3 4 . 1 3

2 6 0 t o 9 9 9 acres 650 6,200 200,000,000 14,142.14 9,192,3 8 8 . 1 6

1,000+ 220 18,000 400,000,000 20,000.00 4,400,0 0 0 . 0 0

38,370,2 8 6 . 1 8

349,620,000,000

Tenure of

Operator

Full owner

Part owner 3,300 2,600 35,000,000 5,916.08 19,523,0 6 3 . 2 8

Manager 660 6,900 110,000,000 10,488.09 6,922,1 3 8 . 4 0

Tenant 50 18,000 510,000,000 22,583.18 1,129,1 5 8 . 9 8

1,890 3,500 40,000,000 6,324.56 11,953,4 0 9 . 5 6

39,527,7 7 0 . 2 2

289,200,000,000

a. Compute the standard error of the total value of products from a proportionate stratified

sample of 300 farms for each of the two methods of stratification (by size and by tenure of

the operator).

c. Compute the standard error of the estimate of the total value of products, using a simple

random sample of 300 farms.

d. For both methods of stratification, use the Neyman allocation for a sample of 300 farms, and

compute:

(ii) The standard error of the estimate of the total value of products.

e. On the basis of this analysis, which of the four methods of allocating the sample would you

recommend?

f. Assume that the sample was stratified by tenure and allocated by the optimum method.

Assume also that the following means by strata were obtained:

113

Tenure Mean Value of Products

Full owner 2,900

Part owner 6,400

Manager 20,000

Tenant 4,000

Estimate the mean value of products for the population of 5,900 farms.

g. Describe how you would calculate the standard error of the mean computed in (f) above after

the survey results are available?

18. With three strata, the values of the Nh, Sh, and ch are as follows:

Stratum Nh Sh ch

1 860 5 2

2 640 4 3

3 1230 6 5

a. Find the sample size in each stratum for a sample of size 200 under an optimum allocation.

19. Using the list of 600 households residing in 30 villages (Appendix IV), select a SRS-WOR of 20

households, and on the basis of the data on the size of these 20 sample households, do the

following :

a. Determine the number of households for each zone and then select a sample of size nh(h = 1,

2, 3) in each of the three zones. Use proportional allocation.

(ii) the average household (HH) size and its sampling error.

(ii) the average HH size and its sampling error.

(iii) the coefficient of variations (CVs) for both number of persons and average HH size.

d. Compare the population estimates and standard errors obtained from exercise 19 with those

obtained from SRS-WOR in chapter 5.

114

CHAPTER 9

RATIO ESTIMATES

_______________________________________________________________________________________________

9.1 REASONS FOR CONSIDERING USE OF RATIO ESTIMATES

In earlier chapters we dealt with the problem of how to design the most efficient sample (from the

point of view of minimizing the standard error) using as much relevant information as we can obtain

about the population. We have seen how to use information for stratification with either

proportionate sampling or optimum allocation, how to take unit costs into account, and how to

choose between different kinds of sampling units. We have seen how to use whatever knowledge we

have of costs and of the variances of different methods of sampling, in order to produce the

maximum amount of information with the resources we have available. All of this analysis has been

in terms of fairly simple estimates such as in which the estimates were prepared by

using only the sample data, the total number of units (N) in the population and the probabilities of

selection. Thus, for simple random sampling,

There are similar formulas for cluster sampling, or for estimation of proportions

There are, however, more complex methods of estimating these statistics, which under certain

circumstances can result in very large reductions in the standard errors.

Moreover, there are other types of statistics which we wish to measure--such as ratios of two

characteristics, change over time of a single characteristic, etc. For example, we may obtain

information on wages and salary payments and on number of hours worked, but we may be more

interested in estimating the average hourly earnings, rather than total wages and salaries or total

hours worked. From surveys covering two different periods of time, we may be more interested in

finding out whether total wages have gone up or down than in measuring the level at any one time.

The analysis of the standard errors estimated ratios also helps with the problem of producing more

efficient estimates of means and totals.

We shall investigate the simplest and most commonly used method of improving the reliability of an

estimated mean or total, by the use of a special estimating technique which produces a "ratio

estimate." A number of other very powerful tools are useful in particular situations; for example,

difference estimates and regression estimates, double sampling (in which the final sample is selected

from a previously selected larger sample that provides information for improving the final selection

or the estimation procedure), and special methods for the estimation of time series. However, we

will only discuss in this chapter ratio estimates.

Ratio estimation is the most commonly used of the more complex estimation techniques available to

the statistician. It is also the easiest to apply. It is appropriate whenever the units of the population

possess two characteristics that are positively correlated--the higher the correlation, the greater the

gain from using this technique. The simplest kind of ratio estimator of the form given by equation

(9.1), is an estimate of Y (the population aggregate):1

(9.1)

Here, and are the ordinary estimates of the aggregates of two characteristics Y and X; the

aggregate X must be known in order to estimate the aggregate Y.

Ratio estimates of aggregates are ordinarily applied in the three situations described in sections 9.2.1

to 9.2.3 below.

X is the same type of characteristic as Y, but X refers to an earlier time period during which a

complete census was taken. For example, we may have taken a full census of manufacturers in one

year, and wish to take a sample survey the following year. Suppose we wish to estimate the total

1. The ratio estimator of a mean (as an estimate of ) is obtained by dividing by N; it has the same

116

value of shipments. For each manufacturing establishment in the sample, we obtain not only yi, the

value of shipments in the survey year, but also xi, the value during the preceding census year. Then

and would be estimates from the sample of total shipments for the two years, obtained by the

methods discussed earlier. X is the total value of shipments tabulated from the full census. In this

application, the survey is actually used to measure the rate of change between the two years, using

the identical sample of establishments. The rate of change is then multiplied by the census total for

the previous year.

Y and X are two different characteristics for the same time period, which are known to be positively

correlated. The true value of the aggregate X is known. For example, for the ith farm in a sample, xi

may be the total hectares in the farms, and yi the payments for farm labor; the total hectares in all

farms, X, is known from another source. If, in general, the larger farms pay more total wages for

farm labor than the smaller ones, the ratio estimate can drastically reduce the sampling error. In this

application, the survey is used to measure a rate (such as the average payment per hectare) which is

multiplied by the known number of hectares.

The characteristic Y is a subset of X, varying roughly in proportion to X. For example, xi may be total

acres in the ith farm in the sample, and yi the acres planted to a particular crop on that farm. Another

application is the case in which X is the total number of units of analysis and Y is the number of these

having a particular attribute. For example, yi might be the number of persons in the labor force in the

ith cluster; xi is the total number2 of persons in this cluster; and X is the known total number of

persons in the population. In these cases, the survey is used to measure a ratio which is then

multiplied by the population total (X) for the characteristic in the denominator of the ratio.

In examining it is clear that X is not derived from the sample. The sampling error in the

estimate is, therefore, dependent on the sampling error of the ratio, with X

having only the effect of a constant multiplier. Therefore, an analysis of the sampling error of

2. In cluster sampling, the estimate of the total number of units of analysis will be a random variable, which is usually not exactly equal to the

true figure . Hence, the proportion of units having the attribute must be treated as a ratio of random variables.

117

is closely related to that of the ratio as an estimate of R =

The mathematical form of the distribution of the ratio of two random variables from sample to

sample is much more complicated than that of the simpler estimates discussed earlier. It involves the

relationship of two variables, both of which have sampling errors. Hence, more care is required in

deciding when to use such ratios. The following facts about the variance of ratios and ratio estimates

will indicate when to use a ratio estimator to estimate a mean or an aggregate. They also tell us what

error to expect when using the estimate.

(9.2)

(9.3) =

(9.3a)

and is estimated by ,

118

(9.3b)

Equations (9.2) and (9.3) are somewhat simpler if expressed in terms of the coefficient of variation,

CV. The square of the coefficient of variation (that is, the rel-variance) of is the same as that

of and can be expressed as

(9.4)

In the above formulas, D is the coefficient of correlation between the variables Y and X. It represents

the correlation of Y and X, not for the elementary units of analysis but for the units used for

sampling. For example, if Y and X represent the incomes of persons in two different years, but the

sample is a cluster sample, the correlation coefficient D will be the correlation between the values Yi

and Xi where Yi is the sum of the incomes for all persons in the ith cluster in the year of estimation

and Xi is the corresponding sum in the base year. Frequently, is referred to as the sampling

covariance between and and the symbol is used for it. It can be calculated exactly as the

variance, but with the cross product replacing the square wherever it

(9.5)

where

(9.6)

(9.7)

and in place of the population values, in equation (9.5), and solving for D, which then

119

(9.8)

(9.9)

where are within-strata covariances and are computed in exactly the same way, but are

If we examine equation (9.4), the formula for the rel-variance of an estimate of a total,

(9.4)

we see that CV² of the ratio estimate can be expressed as CV² of the simpler estimate

plus the term minus the term Whether we gain or lose by the

use of a ratio estimate, as compared with the simpler estimate depends on whether

following:

120

(3) If both estimates have the same standard error.

To see the implication of these facts in some common situations, consider the example of a census of

manufacturers which was conducted in one year, followed by a sample the next year. Let yi and xi

represent the values of shipments for the same sample firm in two consecutive years. In this

case and are nearly the same, and is approximately 1.

Furthermore, there will be a very high correlation between Y and X, probably about 0.90 or 0.95.

Consequently, a ratio estimate will result in a substantial gain in accuracy. The amount of the gain

can be found as follows: if Equation (10.4) becomes

In other words, the use of a ratio estimate achieves an 80 percent reduction in variance.

at the result in another way, the ratio estimate is as effective as using a sample 5 times (or 10 times)

as large.

Consider now the situation described in section 9.2.3 in which Y is a subset of X. In such cases, the

correlation is likely to be quite low, unless is fairly large--for example, greater than ½. In

practice, if is less than about 20 percent, a ratio estimate may increase the sampling error

although, generally, not much. If is greater than 40 or 50 percent, a ratio estimate will usually

121

improve the efficiency; the closer to 100 percent, the more the improvement. Between 20 and 40

percent, the differences between the two types of estimates will be small. Thus, for example, in a

labor force survey, the use of ratio estimates probably provides an important improvement in the

estimate of the number of employed (which comprises a fairly high proportion of the adult

population) but probably results in a slight increase in the standard error of the estimate of

unemployed.

The ratio estimate is a biased estimate. This can easily be demonstrated by constructing a small

population with values Yi and Xi for each element, taking all possible samples of two or three

elements, and computing for each sample. It will be seen that the average of the ratios is not

the true average. However, the bias tends to be negligible for moderately large samples. In most

practical applications, the bias is so small compared with the advantage gained in reducing the

sampling error, that the ratio estimate is preferred over the unbiased estimate.

A ratio estimate, although biased, is a consistent estimate. This means that, if we use a large enough

sample, we can be sure that the estimate will be as close as we like to the true value. Not only does

the standard error decrease with increasing sample size, but the bias is also reduced.

For reasonably large samples, ratio estimates are normally distributed (for the kinds of populations

dealt with in practice). Consequently, if we can compute the standard error of the ratio estimate, we

can construct the same type of confidence limits for and as for and that is, we can say that

we have a 68-percent chance that a range around the estimate of plus and minus one standard error

will cover the true figure, a 95-percent chance that a range of plus and minus two standard errors will

cover the true figure, etc.

Sections 9.3.3 to 9.3.5 above refer to the fact that moderately large samples are needed to make the

bias negligible, and to provide a reasonably normal distribution of sample estimates. When is the

sample large enough? The following working rule has been suggested: If the sample size exceeds

30 and if the coefficients of variation of and are both less than 10 percent, then the bias is

negligible and we can assume that the theory for the normal distribution applies. The first condition

does not mean that a ratio estimate is necessarily better than a simple unbiased estimate whenever

n > 30; it means this size of sample is required before the formulas for sampling error have the usual

meaning in terms of confidence intervals.

122

9.3.7 Formula for Bias

where D and R are defined as in section 3.1. For the estimate of a total, the bias is

Even with low values of D this will be small compared with the standard error of provided only

These bias formulas are presented for analytical purposes. They are never used to adjust estimates.

In situations where the bias would be expected to be significantly large, we would either increase the

sample size or use a different method of estimation.

If ratio estimates are applied separately for a large number of subgroups of the population, with a

small sample in each subgroup, the bias in the subgroup may accumulate and become too large to

ignore. For example, suppose a relatively small sample of persons is classified by separate age-sex

groups--300 persons divided into 5-year age groups by sex. There would be about 30 such groups.

Suppose we know the true total population in each of these 30 groups. For any statistic we are

interested in, we could compute a separate ratio estimate for the persons in each of the 30 groups,

and then get a final estimate by adding the 30 results. The average size of sample in each group

would be 10. Since there would be only a small sample in each of the age groups for which a ratio

estimate would be formed, the accumulation of 30 different ratio estimates could result in a serious

bias. In such a case, the use of ratio estimation group-by-group is not recommended.

9.3.9 Illustration

Suppose that a complete census of the value of manufacturing shipments was taken in 1981. The

following table shows the value of shipments in each of a simple random sample of the value of 10

shipments drawn from the value of 30 shipments. The problem is to estimate the total value of

shipments in 1982. The true 1981 total, X is assumed to be known . Its value is $19.5 billion.

123

Value of shipments in

1981 (xi) 0.3 1.1 0.5 0.4 1.0 0.7 0.2 0.3 2.4 0.1

Value of shipments in

1982 (yi) 0.1 0.6 0.8 0.6 1.0 0.8 0.9 0.8 2.7 0.2

We have,

N = 30, n = 10

Compute the estimate of the total and the variance, the coefficient of variation of the estimate and

the confidence interval for Y by using (a) a method of simple random sampling and (b) a method of

ratio estimates.

(1)

(2)

(3)

(4)

124

(5) 95 % confidence interval for Y is

(1)

(2)

(3)

(4)

125

Formulas for Ratio Estimation Variances

Population Ratio R:

Estimated Variance of r:

Estimated Variance of

126

Estimated Variance of

127

Ratio Estimation

1. A forester is interested in estimating the total volume of trees in a timber sale. He records the

volume for each tree in a simple random sample. In addition he measures the basal area for each

tree marked for sale. He then uses a ratio estimator of total volume.

The forester decides to take a simple random sample of n = 12 from the N = 250 trees marked for

sale. Let x denote basal area and y the cubic foot volume for a tree. The total basal area for all

250 trees, Tx, es 75 square feet. Use the data below to estimate Ty, the total cubic foot volume

for those trees marked for sale, and place a bound on the error of estimation.

Sampled (x) (y)

1 .3 6 0.09 36 3.24

2 .5 9 0.25 81 20.25

3 .4 7 0.16 49 7.84

4 .9 19 0.81 361 292.41

5 .7 15 0.49 225 110.25

6 .2 5 0.04 25 1

7 .6 12 0.36 144 51.84

8 .5 9 0.25 81 20.25

9 .8 20 0.64 400 256

10 .4 9 0.16 81 12.96

11 .8 18 0.64 324 207.36

12 .6 13 0.36 169 60.84

3=7 3 = 142 3=4 3 = 1,976 3 = 1,044.24

2. Use the y-data in Exercise 1 to compute an estimate of Ty using Place a bound on the error

of estimation. Compare your results to those obtained in Exercise 1.

3. A consumer survey was conducted to determine the ratio of the money spent on food to the total

income per year for households in a small community. A simple random sample of 14

households was selected from 150 in the community. Sample data are tabulated below. Estimate

R, the population ratio, and place a bound on the error of estimation.

128

Total Income Amount spent on

Household

(xi) food (yi)

2 12,240 2,524 149,817,600 6,370,576 30,893,760

3 9,600 1,935 92,160,000 3,744,225 18,576,000

4 15,600 3,123 243,360,000 9,753,129 48,718,800

5 14,400 2,760 207,360,000 7,617,600 39,744,000

6 6,500 1,337 42,250,000 1,787,569 8,690,500

7 8,700 1,756 75,690,000 3,083,536 15,277,200

8 8,200 2,132 67,240,000 4,545,424 17,482,400

9 14,600 3,504 213,160,000 12,278,016 51,158,400

10 12,700 2,286 161,290,000 5,225,796 29,032,200

11 11,500 2,875 132,250,000 8,265,625 33,062,500

12 10,600 2,226 112,360,000 4,955,076 23,595,600

13 7,700 1,463 59,290,000 2,140,369 11,265,100

14 8,500 1,905 72,250,000 3,629,025 16,192,500

4. A corporation is interested in estimating the total earnings from sales of color television sets at

the end of a given three month period. The total earnings figures are available for all districts

within the corporation for the corresponding three month period of the previous year. A simple

random sample of 13 districts offices is selected from the 123 offices within the corporation.

Using a ratio estimator, estimate Ty and place a bound on the error of estimation. Use the data in

the table below and take Tx = 128,200.

data from data from

Office

previous year current year

(x i) (y i)

2 720 780 518,400 608,400 561,600

3 1,500 1,600 2,250,000 2,560,000 2,400,000

4 1,020 1,030 1,040,400 1,060,900 1,050,600

5 620 600 384,400 360,000 372,000

6 980 1,050 960,400 1,102,500 1,029,000

7 928 977 861,184 954,529 906,656

8 1,200 1,440 1,440,000 2,073,600 1,728,000

9 1,350 1,570 1,822,500 2,464,900 2,119,500

10 1,750 2,210 3,062,500 4,884,100 3,867,500

11 670 980 448,900 960,400 656,600

12 729 865 531,441 748,225 630,585

13 1,530 1,710 2,340,900 2,924,100 2,616,300

5. Use the data in Exercise 4 to estimate the mean earnings for offices within the corporation.

Place a bound on the error of estimation.

129

Answer: Mean = 1,186.5348; Error = 59.79

6. An investigator has a colony of N = 763 rats which have been subjected to a standard drug.

The average length of time to thread a maze correctly under influence of the standard drug

was found to be :x = 17.2 seconds. The investigator now would like to subject a random

sample of 11 rats to a new drug. Estimate the average time required to thread the maze while

under the influence of the new drug. Place a bound on the error of estimation. (Hint: it is

reasonable to employ a ratio estimator for :y if we assume that the rats will react to the new

drug in much the same way as they did the standard drug.)

Rat

(xi) (yi)

2 15.7 16.1 246.49 259.21 252.77

3 17.8 18.1 316.84 327.61 322.18

4 17.5 17.6 306.25 309.76 308

5 13.2 14.5 174.24 210.25 191.4

6 18.8 19.4 353.44 376.36 364.72

7 17.6 17.5 309.76 306.25 308

8 14.3 14.1 204.49 198.81 201.63

9 14.9 15.2 222.01 231.04 226.48

10 17.9 18.1 320.41 327.61 323.99

11 19.2 19.5 368.64 380.25 374.4

7. A group of 100 rabbits is being used in a nutrition study. A pre-study weight is recorded for

each rabbit. The average of these weights is 3.1 pounds. After two months the experimenter

wants to obtain a rough approximation of the average weight of the rabbits. He selects n = 10

rabbits at random and weighs them. The original weights and current weights are presented

below:

Rabbit 1 2 3 4 5 6 7 8 9 10

Original weight 3.2 3.0 2.9 2.8 2.8 3.1 3.0 3.2 2.9 2.8

Current weight 4.1 4.0 4.1 3.9 3.7 4.1 4.2 4.1 3.9 3.8

Estimate the average current weight and place a bound on the error of estimation.

8. A social worker wants to estimate the ratio of the average number of rooms per apartment to

the average number of people per apartment in an urban ghetto area. He selects a simple

random sample of 25 apartments from the 275 in the ghetto area. Let xi denote the number of

people in apartment I, and let yi denote the number of rooms in apartment I. From a count of

the number of rooms and number of people in each apartment, the following data are

130

obtained:

Estimate the ratio of average number of rooms to average number of people for this area, and

place a bound on the error of estimation.

9. A forest resource manager is interested in estimating the number of dead fir trees in a 300

acre area of heavy infestation. Using an aerial photo, he divides the area into 200 one and a

half acre plots. Let x denote the photo count of dead firs and y the actual ground count for a

simple random sample of n = 10 plots. The total number of dead fir trees obtained from the

photo count is Tx = 4,200. Use the sample data below to estimate Ty, the total number of

dead firs in the 300 acre area. Place a bound on the error of estimation.

(xi) (yi)

2 30 42 900 1,764 1,260

3 24 24 576 576 576

4 24 36 576 1,296 864

5 18 24 324 576 432

6 30 36 900 1,296 1,080

7 12 14 144 196 168

8 6 10 36 100 60

9 36 48 1296 2,304 1,728

10 42 54 1764 2,916 2,268

10. Members of a teachers’ association are concerned about the salary increases given to high

school teachers in a particular school system. A simple random sample of n = 15 teachers is

selected from an alphabetical listing of all high school teachers in the system. All 15 teachers

are interviewed to determine their salaries for this year and the previous year. Use these data

to estimate R, the rate of change, for N = 750 high school teachers in the community school

system. Place a bound on the error of estimation.

131

Past year’s Present year’s

Teacher salary salary

(xi) (yi)

2 6,700 6,940 44,890,000 48,163,600 46,498,000

3 7,792 8,084 60,715,264 65,351,056 62,990,528

4 9,956 10,275 99,121,936 105,575,625 102,297,900

5 6,355 6,596 40,386,025 43,507,216 41,917,580

6 5,108 5,322 26,091,664 28,323,684 27,184,776

7 7,891 8,167 62,267,881 66,699,889 64,445,797

8 5,216 5,425 27,206,656 29,430,625 28,296,800

9 5,416 5,622 29,333,056 31,606,884 30,448,752

10 5,397 5,597 29,127,609 31,326,409 30,207,009

11 8,152 8,437 66,455,104 71,182,969 68,778,424

12 6,436 6,700 41,422,096 44,890,000 43,121,200

13 9,192 9,523 84,492,864 90,687,529 87,535,416

14 7,006 7,279 49,084,036 52,983,841 50,996,674

15 7,311 7,582 53,450,721 57,486,724 55,432,002

11. An experimenter was investigating a new food additive for cattle. Midway through the two

month study, he was interested in estimating the average weight for the entire herd of N =

500 steers. A simple random sample of n = 12 steers was selected from the herd and

weighed. These data and prestudy weights are presented below for all cattle sampled.

Assume :x, the pre-study average, was 880 lbs. Estimate :y, the average weight for the herd,

and place a bound on the error of estimation. All the weights below are in pounds.

Steer

(xi) (yi)

2 919 992 844,561 984,064 911648

3 690 752 476,100 565,504 518880

4 984 1,093 968,256 1,194,649 1075512

5 200 768 40,000 589,824 153600

6 260 828 67,600 685,584 215280

7 1,323 1,428 1,750,329 2,039,184 1889244

8 1,067 1,152 1,138,489 1,327,104 1229184

9 789 875 622,521 765,625 690375

10 573 642 328,329 412,164 367866

11 834 909 695,556 826,281 758106

12 1,049 1,122 1,100,401 1,258,884 1176978

12. An advertising firm is concerned about the effect of a new regional promotional campaign on

the total dollar sales for a particular product. A simple random sample of n = 20 stores is

drawn from the N = 452 regional stores in which the product is sold. Quarterly sales data are

obtained for the current three-month period and the three-month period prior to the new

campaign. Use these data to estimate Ty, the total sales for the current period, and place a

bound on the error of estimation. Assume Tx = 216,256.

132

Stor Pre-Campaign Present

e Sales (xi) Sales (yi)

1 208 239 43,264 57,121 49712

2 400 428 160,000 183,184 171200

3 440 472 193,600 222,784 207680

4 259 276 67,081 76,176 71484

5 351 363 123,201 131,769 127413

6 880 942 774,400 887,364 828960

7 273 294 74,529 86,436 80262

8 487 514 237,169 264,196 250318

9 183 195 33,489 38,025 35685

10 863 897 744,769 804,609 774111

11 599 626 358,801 391,876 374974

12 510 538 260,100 289,444 274380

13 828 888 685,584 788,544 735264

14 473 510 223,729 260,100 241230

15 924 998 853,776 996,004 922152

16 110 171 12,100 29,241 18810

17 829 889 687,241 790,321 736981

18 257 265 66,049 70,225 68105

19 388 419 150,544 175,561 162572

20 244 257 59,536 66,049 62708

3= 3= 3=

3= 9,506 3= 10,181

5,808,962 6,609,029 6,194,001

13. Use the data of Exercise 12 to determine the sample size required to estimate Ty with a bound

on the error of estimation equal to $3,800.

Answer: n = 14.

14. A 10-percent simple random sample of housing units in a village has been selected producing

the 12 housing units listed below. At each sample unit, information was obtained on the

number of persons in the household and the total annual earnings; the results are given below.

It is also known from independent sources that the total population of all households in the

village is 600 persons.

133

Sample Total Total

unit persons earnings

1 6 $ 7,000

2 6 8,000

3 5 3,000

4 8 10,000

5 4 2,000

6 2 1,000

7 4 2,000

8 5 3,000

9 1 1,000

10 7 8,000

11 4 1,000

12 5 6,000

Total 57 $52,000

a. Estimate the total earnings in all households in the village using a direct inflation

factor.

b. Estimate the total earnings in all households in the village using a ratio estimate.

c. Use the sample results to estimate the coefficient of variation for each of the above

estimates.

15. The following table shows the total hectares in three farms along with the payments for farm

labor draw1n from 30 farms. The true value of the total hectares of all farms, X is assumed to

be 800.

(I) (xi) (yi)

1 5 382

2 8 467

3 10 701

b. Estimate the variance of

c. Compute the coefficient of variation of

d. Find a 95% confidence interval for Y.

134

Chapter 10

CLUSTER SAMPLING

________________________________________________________________________________________

10.1 DESCRIPTION OF CLUSTER SAMPLING

The discussion so far has been about sampling methods in which the units of analysis (people, farms,

business firms, etc.) were considered as arranged in a list (or its equivalent) and a sample of

individual units could be selected directly from the list. Now we will consider a sampling procedure

in which the units of analysis in the population are grouped into clusters and a sample of clusters

(rather than a sample of individual units of analysis) is selected. The sample clusters then determine

the units to be included. The determination may be made in either of two ways:

(1) The sample could include all units in the selected clusters. This is usually referred to as

single-stage cluster sampling.

(2) A subsample of units in the selected clusters could be selected for enumeration. This is

called multi-stage cluster sampling, or simply multi-stage sampling.

There are two main reasons for using cluster sampling. Often there is no adequate frame (such as a

list) from which to select a sample of the elements in the population, and the cost of constructing

such a frame may be too great. In other cases, such a frame may exist but the savings in field costs

obtained by cluster sampling (on some kind of geographical basis) may make this method more

efficient than a simple random sample from a list. In most practical situations, a sample of a given

number of units selected at random will have smaller variance than a sample of the same size

selected in clusters; nevertheless, when cost is balanced against precision, the cluster sample may be

more efficient.

Even though the units in which we are interested are not selected directly, the probability of selecting

a cluster and each unit in it (i.e., the probability of selecting a unit from the population) is fixed in

advance; consequently, cluster sampling satisfies the criterion for probability sampling.

To draw a sample of persons, it would generally not be feasible to obtain a list of all persons, and

then to select a sample from the list. It might be possible to find a list of families. We could then

select a sample of families and obtain information by interview concerning all persons in the selected

families. This is an example of single-stage cluster sampling; the family constitutes the cluster.

Note that for a given number of individuals in the sample, it would undoubtedly be less costly in

terms of both travel and time to take all persons within selected families than to select the same

number of persons at random from all individuals in the population.

Often there is no list of families available, and some other procedure must be used. A possible

method is as follows. In large cities, a map showing the boundaries of city blocks can usually be

obtained; and we can select a sample of blocks. In the rest of the country, we can use maps divided

into small areas called segments, which have identifiable boundaries, and select a sample of

segments. Within the sample blocks and segments, we could include all persons in the sample;

alternatively, we could select a sample of persons living in the selected blocks. The choice would

depend upon the number of stages of sampling we believe would be most efficient. By using maps,

we eliminate the need for a list of all persons. We replace it with a list of blocks and segments and a

list of families within a sample of blocks and segments. (In practice there frequently is an earlier

stage of sampling in which a sample of cities and/or other administrative areas is selected.) The

preceding discussion illustrates an important application of cluster sampling; namely, area sampling.

However, other applications of cluster sampling are frequently made.

Suppose we wish to make a survey of school children in order to obtain information on their health,

or information on their knowledge of a particular subject. One way to do this is to obtain a complete

list of schools, then select a sample of schools, and finally choose a sample of children within the

selected schools. Similarly, a sample of factory workers could be selected by first choosing a sample

of factories and then interviewing a sample of workers within these factories. In both cases we

would need to construct a list of individuals only for the schools or factories selected in the sample.

These examples illustrate multi-stage (specifically, two-stage) cluster sampling. The probability that

any unit in the population is selected in the sample can be expressed as the product of the

probabilities at each stage. Thus, in the first example the probability of selecting the jth child from

the ith school is the probability of first selecting the ith school times the conditional probability of

selecting the jth child, given that the ith school has been selected. That is,

P(jth child, ith school) = P(ith school) x P(jth child ith school).

Since area sampling is a frequently used application of cluster sampling, we shall describe in more

detail the methods which are usually applied. Area sampling is useful when one or both of the

following conditions exist:

(1) When complete lists of housing units (or other desired units of observation) are not

available but maps having a reasonable amount of detail are available. Such maps can be

considered as a list covering all of the housing units in the area.

(2) When there are large travel costs in sending an interviewer from one randomly selected

sample housing unit to another randomly selected housing unit. For a given amount of

money, we may be able to increase the number of sample housing units greatly by

grouping units together and selecting a random sample of groups.

Three simple procedures exist for drawing an area sample. We shall use city blocks as an illustration

(segments of land with identifiable boundaries around them could be used in rural areas in exactly

the same way as blocks are used in cities). We shall assume that a 1-percent sample of housing units

is to be drawn.

136

Procedure A for a sample of areas to be enumerated completely:

(1) Obtain a reasonably accurate map of the city, showing as much detail as possible for

blocks. If the map is not new, one should take steps through local inquiry to bring it up-

to-date (for example, draw in new streets that have been opened since the map was

printed).

(2) Number the blocks serially, entering the numbers directly on the map; a serpentine

numbering system is advisable in order to make certain that no blocks are omitted.

(3) Select a simple random or systematic sample of blocks, using a 1-percent sample. If a

systematic sample is used, select a random number from 1 to 100 to determine the first

sample block, and include every one-hundredth block thereafter.

The 1-percent sample can also be obtained by drawing, for example, a sample of 1 in 25 blocks, then

taking a subsample of one-fourth of the area in each sample block.

(1) Proceed as in (1), (2), and (3) in procedure A above, except that instead of taking 1 in 100

blocks, take 1 block in 25.

(2) Divide each of the sample blocks into 4 segments. If maps are available that show the

internal structure of each block (alleys, buildings, etc.), these can be used. If not, make a

quick and crude sketch of the sample blocks, showing each building; use this sketch as

the basis of the segmentation. The 4 segments within any block should have roughly the

same number of housing units in each.

(4) Select the sample segments by taking a random number from 1 to 4 for each block.

Notice that although a 1-percent sample is obtained in both procedures, procedure B includes more

sample blocks and fewer housing units per block. Usually, it will cost more to obtain the same

sample size by procedure B, since there is a cost of subsampling not involved in procedure A; also,

travel will be increased in visiting a greater number of blocks. This subsampling procedure is almost

equivalent to dividing every block in the city into 4 parts, or segments, and taking 1 in 100 of these

segments. Hence, the use of subsampling as described above in procedure B can be regarded as

essentially equivalent to using a sample of small clusters of housing units (in which every housing

unit would be enumerated) but with two-stage sampling as a device for reducing the work of drawing

a sample of small clusters.

137

Procedure C for a sample of areas with listing and subsampling:

To carry out procedure B, it is necessary to have or to construct detailed maps. A third procedure

accomplishes approximately the same results and is frequently applicable when detailed maps are not

available and are not easy to prepare.

(2) Visit each sample block and make a list of all the housing units in it. Number the housing

units serially. The numbering can be done (a) separately by blocks (that is, starting with 1

for each block), (b) in a single sequence throughout all the sample blocks, or © by some

combination, such as a separate sequence for various groups of blocks.

(3) Select one-fourth of the housing units within the sample blocks either by using a random

number table, or by systematic sampling using the serial numbers assigned to the housing

units.

(4) Interview the households whose serial numbers are selected for the sample.

Note: If advance information is available on the approximate numbers of housing units in all blocks,

some combination of the above procedures with stratification of blocks by size can be used.

In designing a sample, the sampling statistician must decide how many sampling stages are to be

used. In addition, at each stage he must determine the sampling unit. In making his decision, the

statistician often has many alternatives from which to choose. Suppose, for example, that he desires

to estimate the average number of cattle per holding. Ultimately, the information must be obtained

from a sample of individual holdings (units of analysis or elementary units). In order to obtain such

a sample, however, any of the following plans could be used:

(1) A simple random, systematic, or stratified sample of individual holdings could be taken if

complete and accurate lists of holdings were available.

(2) Maps could be used to subdivide the country into small area segments (for example,

segments containing an average of 5 or 10 holdings). A sample of these area segments

could then be selected, and all holdings within each selected segment included in the

sample. For holdings which extend across segment boundaries, rules would be needed to

associate holdings with segments.

(3) A sample of small administrative subdivisions, such as districts, could be selected. All

holdings in the selected districts could be included in the sample, or a subsample of

holdings could be selected.

(4) A sample of provinces (larger administrative divisions) could be selected, and a sample of

areas and holdings within the selected provinces could be taken in one of the ways

described in procedures A, B, and C above.

138

Where subsampling is used, the cluster initially selected is called the first-stage unit or the primary

sampling unit (PSU) and the unit of subsampling is called the second-stage unit (SSU). For

example, in (3) above, if a subsample of holdings is selected, the "district" is the PSU and the

holding is the second-stage unit; in (4), the "province" is the PSU, the small area is the second-stage

unit, and still smaller areas or holdings may be third-stage units (TSU).

How can one make an intelligent choice among the various alternatives? We may reason as follows:

where cost is not important, single-stage sampling using the elementary unit (the holding in the

above case) as the sampling unit provides the most accurate results for the given number of

elementary units in the sample. (There are some exceptions, but these are rather unusual cases.) On

the other hand, when cost and administrative convenience are important, a cluster sample involving

one or more stages may be desirable. The cost of enumeration per elementary unit is usually much

less if the units are in clusters than if they are randomly distributed throughout the country; by

clustering, travel time and cost for interviewing are reduced. As a result, for a given amount of

money it may be possible, by using cluster sampling, to increase the number of elementary units in

the sample above the number that the same budget would allow if these were selected at random. If

the increase in the number of units more than compensates for the fact that a cluster sample tends to

increase the standard error, a net gain will be obtained in the reliability of estimates made from the

sample.

In order to choose among alternative sampling units, we must therefore balance the expected costs

against the standard errors for the various possible designs and use the method which will provide

the smallest standard error for a fixed cost. In some administrative situations, the correct decision

may be obvious. If the survey involves little or no travel cost--for example, if mail questionnaires

are used, or if the survey uses personnel who travel around as a normal part of their other activities,

such as policemen or postmen (mailmen)--and if listings of elementary units are available, the

elementary unit should always be taken as the sampling unit. If travel costs or the costs of

constructing lists of elementary units are rather large, an alternative design using a clustered sample

will usually be better. A full discussion of this matter is beyond the scope of these chapters, but

some of the important points will be discussed here.

Usually there is a fixed budget available for a survey, and one of the major functions of the sampling

statistician is to provide a method of obtaining the smallest sampling error for this budget. Let us

first examine how costs enter into a survey involving the use of cluster sampling.

In studying stratified sampling, we discussed the possibility that enumeration and processing costs

can vary from stratum to stratum, and we constructed a cost function which expressed the variable

part of the total cost as a sum of unit costs multiplied by sample sizes (for example, C = C n + C n

1 1 2 2

+ ...). A similar approach is needed for cluster sampling, although the unit costs are of a different

type. For simplicity, let us consider a two-stage sample.

In order to analyze the costs of a two-stage cluster sample, it is necessary to identify the various

phases of the survey and to distinguish between three elements of cost:

139

(1) Overhead costs; that is, those costs that are fixed regardless of the manner in which the

sample is selected.

(2) Costs that depend primarily on the number of first-stage clusters in the sample, and the

way in which such costs vary as the number of these primary sampling units in the sample

varies.

(3) The costs that depend primarily on the number of second-stage units in the sample, and

the way in which such costs vary with this number.

Overhead costs include such things as the administrative and technical work required for the survey,

rent for space and for some types of equipment, cost of printing the final results, etc. These costs

will generally be approximately the same, even with great variations in the size and design of the

survey. Since these costs are not affected by the size of the survey, they do not enter into the

decision on sample design. The only reason for separating these costs is to subtract them from the

total available budget in order to see what funds can be spent on the variable costs.

Certain costs will usually vary in proportion to the number of first-stage sampling units. These will

include (a) the cost of selecting, traveling to, and locating each first-stage unit, (b) the cost of

preparing a list of second-stage units (within the primary unit), and © the cost of designating the

subsample of second-stage units. There may also be other costs (costs of preparing maps for the

first-stage sample units, hiring special enumerators to handle each one, etc.) depending on the nature

of the administrative organization, and the materials available before the start of the survey.

The costs which depend on the number of second-stage units will include the costs of interviewing,

reviewing the survey results, coding, recording, etc.

Let us assume a simple situation in which the cost per first-stage unit does not change despite

changes in the number of such units in the sample. Similarly, the cost per second-stage unit does not

change. Then the total variable cost (which excludes overhead costs) can be represented by

where

C2 is the cost per second-stage unit,

n is the total number of first-stage sampling units.

m is the total number of second-stage sampling units.

is the average number of second-stage units in a primary unit.

140

Using equation (10.1), one can set down combinations of n and m which would add up to the same

cost. For example, suppose the total variable cost available for a survey was $2,500, and the

estimates of C1 and C2 were $10 and $2, respectively. The table below shows various combinations

of sample sizes all of which would cost exactly $2,500; the last column shows the average size of

cluster for each allocation:

Number of units

in sample Average

First Second

stage (n) stage (m)

10 1200 120

20 1150 57.5

50 1000 20

75 875 11.7

100 750 7.5

125 625 5

150 500 3.3

If the sampling error can be found for each of the above combinations, one can choose that

combination which would give the lowest sampling error. In fact, with this simple type of cost

function, it is usually possible to determine the optimum allocation mathematically. However, this is

not necessary; if a formula can be found which expresses the variance in terms of n and m, we can

easily see which combination is best. Furthermore, this can also be done in situations involving

more complex cost functions, when it is more difficult to develop a mathematical solution to the

problem of optimum allocation. The next chapter will be devoted to analyzing the variances for the

simpler and more common situations.

One additional comment on costs should be made. The formulation of the cost function above as C

= C1 n + C2 m covers the simplest type of situation only. In practice, the cost function may be much

more complex. For example, there may be stratification for either the first-stage or the second-stage

units with different unit costs in each stratum. The cost function would then be

and the problem of the allocation of the sample would be a combination of optimum allocation for

cluster sampling with optimum allocation for stratified sampling. Frequently, the unit costs would

depend on the number of units in the sample.

141

For example, suppose that C1 included a part that resulted from the time spent traveling from one

first-stage unit to another. With only a few primary units in the sample, the average distance from

one to the next might be quite large, resulting in a high value of C1. However, as the number of units

in the sample increases, the average distance gets smaller and C1 will be smaller. A different type of

cost function would be used in such a situation. In general, in planning a large-scale and important

survey, a detailed analysis should be made of how costs vary, in order to construct a cost function

which is realistic for that particular survey.

Cluster sampling is simple random sampling with each sampling unit containing a number of

elements. Hence, the estimators of the population mean, :, and total, T, are similar to those for

simple random sampling. In particular, the sample mean, is a good estimator of the population

mean, :. An estimator of : and two estimators of T are discussed in this section.

n = the number of clusters selected in a simple random sample

mi = the number of elements in cluster I, I = 1, . . . ., N.

The estimator of the population mean, :, is the sample mean, which is given by:

Thus, takes the form of a ratio estimator, as developed in Chapter 11, with mi taking the place of

xi. Then, the estimated variance of has the form of the variance of a ratio estimator:

142

Estimator of the Population Mean ::

(10.1)

Estimated variance of

(10.2)

The estimated variance in equation in (10.2) is biased and a good estimator of only if n is

large, say n $ 20. The bias disappears if the cluster sizes m1, m2. . . . mN are all equal.

(10.4)

Estimated variance of

(10.5)

Note that the estimator is useful only if the number of elements in the population, M, is known.

Often the number of elements in the population is not known in problems for which cluster sampling

is appropriate. This makes it impossible to use the estimator but we can form another estimator

of the population total which does not depend n M. The quantity given by

143

(10.7)

is the average of th cluster totals for the n sampled clusters. Hence, is an unbiased estimator of

the average of the N cluster totals in the population. By the same reasoning as used previously, the

estimator is un unbiased estimator of the sum of the cluster totals or, equivalently, of the

population total, T.

For example, it is highly unlikely that the number of adult males in a city would be known, and

hence the estimator rather than would have to be used to estimate T.

(10.8)

(10.9)

If there is a large amount of variation among the cluster sizes and if cluster sizes are highly

correlated with cluster totals, the variance of is generally larger than the variance of The

estimator does not use the information provided by the cluster sizes and,

hence, may be less precise.

The estimators of : and T possess special properties when all cluster sizes are equal, that is, when

First, the estimator given by equation (10.1), is an unbiased estimator of

the population mean :. Second, given by equation (10.2), is an unbiased estimator of the

variance of Finally, the two estimators, and of the population total T are equivalent.

10.6 Selecting the Sample Size for Estimating Population Means and Totals

The quantity of information in a cluster sample is affected by two factors, the number of clusters and

the relative cluster size. We have not encountered the latter factor in any of the sampling procedures

discussed previously. In the problem of estimating the number of homes with inadequate fire

insurance in a state, the clusters could be counties, voting districts, school districts, communities, or

any other convenient grouping of homes. We will assume that the relative cluster size has been

144

selected in advance and will consider the problem of choosing the number of clusters, n.

where

(10.11)

(10.12)

Because we do not know or the average cluster size, choice of the sample size, that is, the

number of clusters necessary to purchase a specified quantity of information concerning a population

parameter, is difficult. We overcome this difficulty by using an estimate of and from a prior

survey or by selecting a preliminary sample containing n’ units. Thus, as in all problems of selecting

a sample size, we equate two standard deviations of our estimator to a bound on the error of

estimation, E. This bound is chosen by the experimenter and represents the maximum error that he

is willing to tolerate. That is,

We obtain similar results when using to estimate the population total T because

The approximate sample size required to estimate : with a bound, E, on the error of

estimation:

(10.13)

145

where is estimated by and

The approximate sample size required to estimate T, using with a bound, E, on the error of

estimation:

(10.14)

The approximate sample size required to estimate T, using with a bound, E, on the error of

estimation:

(10.17)

(10.18) .

146

(10.19)

where ai denote the total number of elements in cluster I that possess the characteristic of interest.

147

One-Stage Cluster Sampling Problems

1. A manufacturer of band saws wants to estimate the average repair cost per month for the saws

he has sold to certain industries. He cannot obtain a repair cost for each saw, but he can obtain

the total amount spent for saw repairs and the number of saws owned by each industry. Thus,

he decides to use cluster sampling with each industry as a cluster. The manufacturer selects a

simple random sample of n = 20 from the N = 96 industries which he services. The data on

total cost of repairs per industry and number of saw per industry are as follows:

Number of

Industry (cluster) for past month

saws

(dollars)

1 3 50

2 7 110

3 11 230

4 9 140

5 2 60

6 12 280

7 14 240

8 3 45

9 5 60

10 9 230

11 8 140

12 6 130

13 3 70

14 2 50

15 1 10

16 4 60

17 12 280

18 6 150

19 5 110

20 8 120

3= 130 3= 2,565

Estimate the average repair cost per saw for the past month, and place a bound on the error of

estimation.

2. For the data in Exercise 1, estimate the total amount spent by the 96 industries on bad saw

repairs. Place a bound on the error of estimation.

148

3. After checking his sales records, the manufacturer of Exercise 1 finds that he sold a total of 710

bad saws to these industries. Using this additional information, estimate the total amount spent

on saw repairs by these industries and place a bound on the error of estimation.

4. The same manufacturer wants to estimate the average repair cost per saw for next month. How

many clusters should he select for his sample if he wants the bound on the error of estimation

to be less than $2.00?

Answer: n = 14

5. A political scientist developed a test designed to measure the degree of awareness of current

events. He wants to estimate the average score which would be achieved on this test by all

students in a certain high school. The administration at the school would not allow the

experimenter to randomly select students out of classes in session, but it would allow him to

interrupt a small number of classes for the purpose of giving the test to every member of the

class. Thus, the experimenter selects 25 classes at random from the 108 classes in session at a

particular hour. The test is given to each member of the sampled classes with the following

results:

Class Class

Students Score students Score

1 31 1590

14 40 1980

2 29 1510

15 38 1990

3 25 1490

16 28 1420

4 35 1610

17 17 900

5 15 800

18 22 1080

6 31 1720

19 41 2010

7 22 1310

20 32 1740

8 27 1427

21 35 1750

9 25 1290

22 19 890

10 19 860

23 29 1470

11 30 1620

24 18 910

12 18 710

25 31 1740

13 21 1140

Estimate the average score that would be achieved on this test by all students in the school.

Place a bound on the error of estimation.

6. The same political scientist of Exercise 5 wants to estimate the average test score for a

similar high school. If he wants the bound on the error of estimation to be less than 2 points,

how many classes should he sample? Assume the school has 100 classes in session during

each hour.

149

Answer: n = 13

7. An industry is considering revision of its retirement policy and wants to estimate the

proportion of employees which favor the new policy. The industry consists of 87 separate

plants located throughout the United States. Since results must be obtained quickly and with

little cost, the industry decides to use cluster sampling with each plant as a cluster. A simple

random sample of 15 plants is selected, and the opinions of the employees in these plants are

obtained by questionnaire. The results are as follows:

Plant Plant

employees new policy employees new policy

1 51 42 9 73 54

2 62 53 10 61 45

3 49 40 11 58 51

4 73 45 12 52 29

5 101 63 13 65 46

6 48 31 14 49 37

7 65 38 15 55 42

8 49 30

Estimate the proportion of employees in the industry who favor the new retirement policy

and place a bound on the error of estimation.

8. The industry of Exercise 7 modified its retirement policy after obtaining the results of the

survey. It now wants to estimate the proportion of employees in favor of the modified

policy. How large a sample should be taken to have a bound of 0.08 on the error of

estimation? Use the data from Exercise 7 to approximate the results of the new survey.

Answer: n = 7

9. An economic survey is designed to estimate the average amount spent on utilities for

households in a city. Since no list of households is available, cluster sampling is used with

divisions (wards) forming the clusters. A simple random sample of 20 wards is selected

from the 60 wards of the city. Interviewers then obtain the cost of utilities from each

household within the sampled wards; the total costs are tabulated below:

150

Sampled Number of Total Amount Sampled Number of Total Amount

W ard Households Spent on Utilities W ard Households Spent on Utilities

1 55 2210 11 73 2930

2 60 2390 12 64 2470

3 63 2430 13 69 2830

4 58 2380 14 58 2370

5 71 2760 15 63 2390

6 78 3110 16 75 2870

7 69 2780 17 78 3210

8 58 2370 18 51 2430

9 52 1990 19 67 2730

10 71 2810 20 70 2880

Estimate the average amount a household in the city spends on utilities, and place a bound

on the error of estimation.

10. In the above survey the number of households in the city is not known. Estimate the total

amount spent on utilities for all households in the city, and place a bound on the error of

estimation.

structure. The objective is to estimate the total amount spent on utilities by households in

the city with a bound of $5,000 on the error of estimation. Use the data in Exercise 9 to find

the approximate sample size needed to achieve this bound.

Answer: n = 30

12. An inspector wants to estimate the average weight to fill for cereal boxes packaged in a

certain factory. The cereal is available to him in cartons containing 12 boxes each. The

inspector randomly selects 5 cartons and measures the weight of fill for every box in the

sampled cartons, with the following results (in ounces):

1 16.1 15.9 16.1 16.2 15.9 15.8 16.1 16.2 16.0 15.9 15.8 16.0

2 15.9 16.2 15.8 16.0 16.3 16.1 15.8 15.9 16.0 16.1 16.1 15.9

3 16.2 16.0 15.7 16.3 15.8 16.0 15.9 16.0 16.1 16.0 15.9 16.1

4 15.9 16.1 16.2 16.1 16.1 16.3 15.9 16.1 15.9 15.9 16.0 16.0

5 16.0 15.8 16.3 15.7 16.1 15.9 16.0 16.1 15.8 16.0 16.1 15.9

Estimate the average weight of fill for boxes packaged by this factory, and place a bound on

the error of estimation. Assume that the total number of cartons packaged by the factory is

large enough for the finite population correction to be ignored.

151

Answer: Mean = 16.005; Error = 0.0215

13. A newspaper wants to estimate the proportion of voters favoring a certain candidate,

“Candidate A,” in a state-wide election. Since it is very expensive to select and interview a

simple random sample of registered voters, cluster sampling is used with precincts as

clusters. A simple random sample of 50 precincts is selected from the 497 precincts in the

state. The newspaper wants to make the estimation on election day, but before final returns

are tallied. Therefore, reporters are sent to the polls of each sample precinct to obtain the

pertinent information directly from the voters. The results are tabulated below:

No. Of No. Of No. Of

Favoring Favoring Favoring

Voters Voters Voters

A A A

1170 631 1942 1187 1066 487

840 475 971 542 1171 596

1620 935 1143 973 1213 782

1381 472 2041 1541 1741 980

1492 820 2530 1679 983 693

1785 933 1567 982 1865 1033

2010 1171 1493 863 1888 987

974 542 1271 742 1947 872

832 457 1873 1010 2021 1093

1247 983 2142 1092 2001 1461

1896 1462 2380 1242 1493 1301

1943 873 1693 973 1783 1167

798 372 1661 652 1461 932

1020 621 1555 523 1237 481

1141 642 1492 831 1843 999

1820 975 1957 932

Estimate the proportion of voters favoring Candidate A, and place a bound on the error of

estimation.

14. The same newspaper wants to conduct a similar survey during the next election. How large

a sample size will be needed to estimate the proportion of voters favoring a similar

candidate with a bound of 0.05 on the error of estimation? Use the data in Exercise 13.

Answer: n = 21

15. A forester wishes to estimate the average height of trees on a plantation. The plantation is

divided into quarter-acre plots. A simple random sample of 20 plots is selected from the

386 plots on the plantation. All trees on the sampled plots are measured with the following

results:

152

Number of Average Height Number of Average Height

Trees (feet) Trees (feet)

42 6.2 60 6.3

51 5.8 52 6.7

49 6.7 61 5.9

55 4.9 49 6.1

47 5.2 57 6.0

58 6.9 63 4.9

43 4.3 45 5.3

59 5.2 46 6.7

48 5.7 62 6.1

41 6.1 58 7.0

Estimate the average height of trees on the plantation, and place a bound on the error of

estimation. (Hint: the total for cluster I can be found by taking the total number of elements

in cluster I times the cluster average).

16. To emphasize safety, a taxi-cab company wants to estimate the proportion of unsafe tires on

their 175 cabs. (Ignore spare tires.) It is impractical to select a simple random sample of

tires, so cluster sampling is used with each cab as a cluster. A random sample of 25 cabs

gives the following number of unsafe tires per cab:

2, 4, 0, 1, 2, 0, 4, 1, 3 , 1 ,2 , 0, 1

1, 2, 2, 4, 1, 0, 0, 3, 1, 2, 2, 1.

Estimate the proportion of unsafe tires being used on the company’s cabs, and place a bound

on the error of estimation.

153

CHAPTER 11

CLUSTER SAMPLING VARIANCES

11.1 VARIANCE OF A TWO-STAGE CLUSTER SAMPLE

To study the variance of a two-stage cluster sample, it will be useful to review some ideas of

stratified sampling. In stratified sampling, the standard error of a sample estimate depends on the

within-stratum variances, For each stratum, the variance is defined by the same

formula as S² (the total variance of the population) but using only the elements in the ith stratum.

We saw that stratified sampling was most useful when the means of the strata were very

different. In fact the gains of stratified sampling can be determined by computing the standard

deviation among the means of the strata (that is, computing the standard deviation of the numbers

weighted by the number of units within each stratum) if the necessary data are

available. The square of this weighted standard deviation between cluster (primary sampling

units or PSUs, in this case) means is called the between-PSU variance.

Similar concepts can be considered in cluster sampling. In fact, there is a close analogy between

cluster and stratified sampling. In both cases we group the individual elements into sets before

selecting the sample. The difference is that in stratified sampling it is necessary to sample within

every one of the sets (the strata); in cluster sampling a sample of the sets (the clusters) is selected

and then either all or a sample of the elements within the selected sets is included. The purpose

and method of forming the sets is very different in the two cases.

11.1.1 Notation

Consider a two-stage design in which second-stage sample units (SSUs) are selected randomly

from the elementary units within selected clusters (primary sampling units or PSUs) for

interview.

= avg. number of SSUs per PSU in the population or avg. cluster size

mi = number of SSUs selected for the sample in the ith PSU, i = 1,..., n

= average number of SSUs per sample PSU

= value of a characteristic for the jth elementary unit in the ith PSU in the population

= value of the characteristic for the jth sample SSU in the ith sample PSU

155

11.1.2 Estimates of Means and Totals

The formulas given in previous chapters for estimating population means are appropriate when

the sampling unit is identical with the unit of analysis. An important characteristic of cluster

sampling, however, is that the sampling unit (at least in the first stage) is not the unit of analysis.

Thus, in the examples in the previous chapter, we would probably not be interested in the mean

per family, per school, per factory, or per block. Rather, we would be interested in estimating the

mean per family member, per school child, per factory worker, or per housing unit.

Consider a two-stage design in which the second stage units are the units of analysis; n clusters

are selected from among N clusters by simple random sampling; and mi units are selected in the

ith PSU using simple random sampling for i = 1, ... , n.

Within the ith cluster, the population mean per unit is given by

(11.1)

Since the units within the cluster were selected by simple random sampling, we know (from

chapter 4, section 2) that we can estimate this mean without bias, by using the following formula:

(11.2)

These estimates of the cluster unit means from the n sample clusters must then be combined in

some way to estimate the overall population total (Y) and the population mean per unit given by

the following formula:

Several estimators are available and are discussed in most standard texts: we shall examine only

one of these.

First, we shall construct an estimator for the population total for the Y-characteristic. An

unbiased estimator for Yi, the ith PSU total is given by

(11.3)

156

An unbiased estimator for the population total is then given by

(11.4)

Similarly, we can estimate the total number of units of analysis in the population (assuming that

we do not know it) by

(11.5)

An estimator of is

(11.6)

As can be seen, this estimator is a weighted mean of the n sample cluster means per unit where

the weights are the corresponding cluster sizes. As indicated previously, this is only one of

several possible estimators; however, this estimator seems to be most generally useful. Since

both the numerator and denominator are random variables, this is a ratio-type estimator and it has

the usual bias of a ratio estimator. The bias will usually not be serious if the number of clusters

in the sample is reasonably large.

11.1.3 Variances

Consider the case when n PSUs are selected from a population of N PSUs and random samples

of mi (i = 1,...,n) SSUs are taken from the Mi (i=1,...,N) SSUs in the selected PSUs. Then the

variance of , the estimator of Y is

(11.7)

157

where,

(11.8) and

(11.9)

The variance of the estimator of Y is the sum of two components. The first component is the

contribution to the variance arising from the selection of first-stage units. The second component

is the contribution from the selection of second-stage units. If there are three or more stages of

sampling, the variance will include additional terms similar in form for each additional stage.

The sample estimator of is

(11.10)

(11.11)

and,

(11.12)

(11.13)

The approximate value of the variance of may also be obtained from equation (11.7) as follows:

158

(11.14)

(11.15)

If all PSUs have the same number of second-stage units M and a constant number m of them is

sampled from every sample PSU, we have

and (11.16)

(11.17)

where, and

(11.18)

(11.19)

where, (11.20)

159

The variance of an estimated mean is

(11.21)

(11.22)

11.1.3.1 Illustration

A population consists of four clusters of five households each. The second-stage units, which are

also the elementary units in this case, are houses having persons as follows:

Cluster

Household 1 2 3 4

1 3 8 4 7

2 10 3 6 2

3 9 6 3 6

4 8 4 8 4

5 6 5 6 6

First, select two clusters at random from a population of four clusters. Then within each of these

selected clusters take a random sample of three households. Compute the estimate of the

population total Y and the variance of Find the variance and the coefficient of variation of the

estimate of

Suppose that clusters 3 and 4 are selected at random. Assume also that households 1, 2, and 5 within

cluster 4 and households 2, 4, and 5 within cluster 3 are selected at random. Then we have,

160

Using equation (11.19),

where , and

We have,

and

On substitution, we have,

161

The standard error of is:

The above formulas are somewhat cumbersome. Consequently, short-cut approximations are often

used to reduce the amount of work, particularly if variance estimates are to be computed for a large

number of characteristics. One of these approximations is known as the random group method.

The random group method consists of dividing the sample into a number of groups at random; each

group is then used to make an estimate of the total, mean, etc. (this would be done for each

characteristic for which a variance is to be computed). Each of the random groups will reflect the

various steps of the sample selection so that the estimate from each group is an estimate of the total

with the same sample design as the whole sample (but with a much smaller sample size). In a multi-

stage sample, the random groups are usually formed by placing the entire sample from a primary

sampling unit in a single group. For complex designs using stratification and/or sampling over time,

somewhat different methods are available to divide the sample into random groups. However, the

method is not very useful if the number of first-stage units is small.

In computing the estimates of variance, it is exactly the variance between different possible estimates

of the total or mean in which we are interested. Therefore, this method which provides a number of

different estimates of the total or mean, each with some degree of stability (that is, the number of

cases in a group should not be too small) is a realistic one for estimating variances.

Examining the variance equation (11.7) and equation (11.9), we can easily see what happens in two

simple situations. First, if all second-stage units are included in the sample we have the case

described in chapter 9 as "single-stage cluster sampling." In this case, mi = Mi and the term arising

from variation within first-stage units is zero. In equation (11.7), the first term is the same as the

variance formula for simple random sampling except that the sample sizes and values of Yi refer to

the first-stage units. For example, if area segments were the first-stage units, N is the total number of

area segments and Yi is the segment total for the variable. In equation (11.9), the first term is the

162

between cluster component of the overall variance which is based on the differences among cluster

means per unit of analysis rather than on differences among cluster totals.

Secondly, consider a situation in which all first-stage units are in the sample. In this case, n = N and

the first term becomes zero. The variance of the estimator of the population total becomes equal to

The variance of the estimator of the population mean per element is then equal to

These are the variance formulas for the estimators of totals and means from a stratified sample. In

other words, a stratified sample is simply a special case of a cluster sample in which all first-stage

units are included in the sample and a subsample of second-stage units is selected from each first-

stage unit.

This discussion has covered only the case of simple random sampling for both the first-stage and

second-stage selections. Analogous formulas can be developed for stratified cluster sampling in

which the only difference is that the terms in the equations are replaced by the sums of similar terms

over strata.

A more detailed analysis of equation (11.7) and equation (11.13) would show that for a two-stage

sample containing a given total number of units of analysis, the sampling variances of estimates

computed by equation (11.4) and equation (11.6) depend on several factors. Two important factors

which the sampling statistician must consider in designing the sample are:

(1) The variability in size of first-stage units in terms of the number of second-stage units they

contain.

(2) The variability among second-stage units (the elementary units or units of analysis) within first-

stage units.

If the first-stage units are unequal in size in terms of the number of second-stage units (for example,

the number of holdings in an area segment), these variations in size can have a profound effect on the

size of the variance of the estimator of the population total, as shown by the first term in equation

163

(11.7). We can see in equation (11.13) that the variance of the estimator of the population mean per

elementary unit is affected by the variation among first-stage means per element. If the variability in

size is very great, it will be necessary to use a large sample of first-stage units or to change the

sampling and estimating methods to keep the standard error within reasonable bounds (see section

11.4 below).

The second important factor is the variability among second-stage units (units of analysis) within

first-stage units (clusters). For a given sampling plan in which we select n out of N clusters and an

average of units of analysis out of each sample cluster, it can be shown that the greater the

variability among second-stage units within first-stage units, the smaller will be the sampling

variability of resulting estimates. In other words, it is desirable that the units of analysis have a

relatively low intraclass correlation. Intraclass correlation is a measure of similarity among units

within a cluster with regard to the characteristics being investigated.

A mathematical demonstration of this phenomenon is beyond the scope of this chapter; however, by

considering an extreme example we can gain an intuitive understanding of it. Consider a situation in

which the units of analysis within each cluster are identical. Clearly, a sampling plan such as

described above would not be efficient. A single unit of analysis within a given cluster would

provide complete information about all the units; consequently, the remaining units would

contribute nothing additional to our knowledge. To include them in the sample would be a waste of

resources. The inefficiency of this design in this situation would be reflected in a high sampling

variability relative to a simple random sample with the same number of units of analysis.

The statistician must consider the effect of intraclass correlation on the sampling variability when

designing a sample. This is particularly true of area sampling since units which are close together

geographically are usually quite similar for many characteristics such as income, education, attitudes,

type of agricultural activity, etc. The usual approach is to limit the number of units of analysis taken

from the first-stage units and include more of the first-stage units in the sample. In a single-stage

sample, the statistician can do this by making the clusters as small as practicable. The more common

approach, however, is to introduce additional stages in the sampling procedure so that the number of

units of analysis ultimately selected from each unit at the last stage is small. The statistician must, of

course, balance precision against cost in deciding on a sampling plan.

Notice that in cluster sampling we gain by having units within clusters as unlike as possible, but in

stratified sampling we gain by having units within strata as much alike as possible. The reason for

this difference becomes clear when you recall from section 11.2 above that in stratified sampling, the

"between-cluster" component of the variance drops out of the equation entirely.

In all of this discussion, it has been assumed that the only way we could affect the sampling variance,

with the given population, is to take more or fewer sample cases in the first or second stages or to

vary the size of the first-stage units. Of course, if the sampling variance can be reduced by

164

appropriate stratification, this should be done first. Several special procedures are also available to

control the effect of variability in size of cluster. The most important procedure is described below.

Although this discussion is related to a two-stage sample, a similar analysis could be made for three

or more stages. The procedures described below for controlling variability in size apply equally well

to first, second, or other stages, whenever cluster sampling is used.

One obvious method is to attempt to define clusters in such a way that they are approximately equal

in size in terms of the number of units of analysis with the expectation that this will tend to make

them equal also in terms of characteristics being investigated. If this can be done with available

materials and information, then no other action is necessary. For example, if block counts of

numbers of housing units are available for cities and villages, it may be possible to group small

blocks together to make clusters which contain approximately the same number of housing units.

In some cases, it may be feasible to define clusters directly in terms of a characteristic being

investigated. For example, in an agricultural survey, clusters can be constructed to be nearly equal in

area. If recent aerial photographs are available, they might even be made nearly equal in terms of

cultivated area.

If information is available on the size of all the first-stage clusters in the universe in advance of the

survey (reasonably good approximations are adequate), it is possible to stratify the clusters by size

group. The effect of stratification is to replace a total variance by a sum of within-stratum variances.

Within each stratum, the clusters should be about equal in size; therefore, stratification by size of

cluster will have about the same effect as making all clusters in the whole population about equal in

size.

If information on size is not available, it may be worthwhile to spend a small amount of the available

resources, for example, in making a "Quick Count" of city blocks in order to obtain approximate

sizes of the first-stage units (in terms of the number of housing units they contain). Errors in counts

do not cause biases in the estimates, which are based on the actual numbers of housing units found in

the survey itself.

Either optimum or proportionate sampling can be performed depending on which appears most

useful in the particular case. If more than one characteristic is being estimated, proportionate

sampling may be preferable to optimum allocation, since the optimum allocation might be different

for each characteristic. Also, proportionate sampling is usually safer unless very good measures of

size are available, since the use of the optimum allocation formula with poor measures of size may

actually increase the variance.

A third method of reducing the effect of variability in cluster size is through the use of ratio

165

estimates. Ratio estimates were discussed in detail in Chapter 9; an example of the method is given

here. A ratio estimate makes use of a quantity of the form where both and are estimates

of totals made from sample data. X, the universe total of the quantity of which is an estimate,

must be known (it may be a projection or other figure which is believed to be very close to the true

value). One can make a ratio estimate of the universe total Y--an estimate which is frequently very

efficient--by using

instead of alone. The new estimate of thus differs significantly from since it involves two

items having sampling variances instead of one. Ratio estimates are generally much less sensitive to

variation in size of cluster than estimates of the type

and their use will frequently reduce the standard errors appreciably.

Two different uses of ratio estimates for this purpose will be discussed. In the first, "X" is a variable

closely related to the total number of units of analysis in the clusters and is an estimate, based on

the sample clusters only, of the population aggregate, X. For example, consider a sample design in

which city blocks are the first-stage units, and housing units are both the second-stage units and the

units of analysis. We have rough counts (Xi) of housing units for each block based on a previous

census or special counts made for this purpose. These counts can be totaled for all blocks in the city

to obtain X. Then a sample estimate of X, can be obtained by adding up the rough counts for the

sample blocks only, and multiplying this by (where N is the total number of blocks in the city

and n is the number in the sample). Then, a ratio estimate of Y is

If subsampling is used within the first-stage units, the procedure would be modified. In order to

make the fullest gain with this type of ratio estimate, it is advisable not to subsample independently

within the clusters, but to treat the second-stage units within the clusters as a continuous list and

sample systematically throughout.

166

11.4.3.2 Ratio to a Correlated Statistic

In a second use of ratio estimates the true value of some universe total X is known and a sample

estimate (of X) can be obtained in the survey. If the characteristics "Y" and "X" are positively

correlated, then will reduce the effect of variability in cluster size (and possibly other types of

variability as well). For example, suppose a survey is planned to measure the total wage and salary

earnings of factory workers (Y). We can do this by taking a sample of factories (the clusters) and

including all workers within the sample of factories. Suppose the total sales of all factories can be

found (X) from some other source--tax records, for example. We could then include on our

questionnaire to the sample factories a question on total sales (xi) as well as wage and salary

payments (yi), and we could prepare estimates of population totals for both characteristics from the

sample in the usual manner. The ratio estimate of wages and salaries would then be .

A fourth method for controlling the effects of variability in cluster size is to select the sample

clusters with probability proportionate to size instead of using a simple random sample of clusters.

Probability proportionate to size is frequently abbreviated as PPS. Selection with PPS means that a

cluster which is, for example, 5 times as large as another, will have 5 times the chance to be in the

sample. It might appear, at first, that this would introduce a bias in the sample result, with some

clusters over represented and others under represented. When PPS is used, the unbiased estimate of

the total, where there is no subsampling, is

Here Yi is the total in the ith cluster in the sample and Pi is the probability of selection of this cluster.

It can easily be shown that this provides an unbiased estimate of Y.

A common application of sampling with PPS is the use of PPS for the selection of the first-stage

units in a two-stage sample. When this is done, the subsampling rates are usually set as inversely

proportional to size. As a result, the chance of any second-stage unit being included in the sample is

the product of the probability of the first-stage and second-stage selections. All second-stage units

therefore have identical probabilities and the sample is self-weighting.

There are a number of other advantages to this type of selection procedure; for example, the

workload can be made approximately the same for all selected first-stage units. Moreover, the

estimates will have smaller variances than those from a proportionate sample in which the first-stage

units are selected with equal probabilities.

167

11.4.4.2 Measures of Size

In order to select with PPS, it is necessary to have measures of size of each cluster in the population.

If measures of size are not available, it will usually be found worth the effort to prepare crude

estimates of size. (Rough approximations will be almost as effective as more exact measures.) Let

us assume such measures are available. The mechanics for selecting a sample with PPS can best be

described through an illustration.

11.4.4.3 Illustration

Suppose the clusters are blocks and we wish to sample the housing units in a universe made up of the

10 blocks as listed in column (1) of Table 11.1. We would list, in column (2), the measure of size

for each block (this may be a rough estimate of the number of housing units), and cumulate the

measures of size in column (3). The last figure in column (3) is the total number (rough estimate) of

housing units in all 10 blocks. Let us assume that we wish to include in the sample 5 blocks out of

the 10, and that the sample is to include 10 percent of all the housing units.

number of size Measure desig- selection Sampling Rate

(PSU) nation (Pi) = nh *(Mhi/Mh ) mhi/Mhi

2 12 51 - 62

3 20 63 - 82

4 31 83 - 113 82.7 31 ÷ 60.2 60.2 ÷ 310

5 10 114 - 123

6 60 124 - 183 142.9 60 ÷ 60.2 60.2 ÷ 600

7 55 184 - 238 203.1 55 ÷ 60.2 60.2 ÷ 550

8 13 239 - 251

9 30 252 - 281 263.3 30 ÷ 60.2 60.2 ÷ 300

10 20 282 - 301

After completing the first three columns of Table 11.1 as shown, proceed as follows:

(1) Since there are 5 blocks in the sample, divide the final cumulative measure (301) by 5; this

168

gives 60.2, which is the "sampling interval" for selecting blocks.

(2) Choose a random number between 00.1 and 60.2; suppose the number happens to be 22.5. This

number is called the Random Start (RS).

(3) Use this random number as the starting number and enter it in column (4), on the line whose

cumulative measure interval includes the number 22.5. In our example, the cumulative

measure interval is [0 - 50].

(4) Add the sampling interval (60.2) to the random start (22.5), that is add 60.2 to 22.5. This

number is equal to 82.7; enter 82.7 on the line whose cumulative measure interval contains this

number. In our case, the interval is [83 - 113]. Continue adding 60.2 to the last number

obtained (82.7 in our case) and obtain the next one: 142.9. Locate the interval which contains

142.9. In our case the interval is [124 - 183]. Continue with this procedure until a number is

reached which is larger than the last cumulative measure.

(5) The blocks with entries in column (4) are the ones in the sample. In this example, they are

blocks 1, 4, 6, 7, and 9.

(6) The probability (Pi) of selection of each block actually selected is entered in column (5). For

each block, the probability is the measure of size in column (2) divided by the sampling

interval 60.2.

(7) The sampling rate to be used within each selected block is computed and entered in column (6).

For each block, the rate is the desired overall probability of selection, namely 1/10, divided by

the entry in column (5). Thus, for block 1, the rate in column (6) would be

or

(8) It occasionally happens that some of the blocks are so large that the measures of size are greater

than the sampling interval. As a result, there may be two or more entries in column (4) for the

same block. In such a case, the subsampling rate within the block is adjusted to make the

overall probability for the selection of housing units equal to 1/10, in our example.

169

Two-Stage Cluster Sampling

1. A nurseryman wants to estimate the average height of seedlings in a large field that is divided

into 50 plots that vary slightly in size. He believes the heights are fairly constant throughout

each plot, but may vary considerably from plot to plot. Therefore, it is decided to sample 10%

of the trees within each of 10 plots using a two-stage cluster sample. The data are as follows:

seedlings seedlings planted (inches)

1 52 5 12, 11, 12, 10, 13

2 56 6 10, 9, 7, 9, 8, 10

3 60 6 6, 5, 7, 5, 6, 4

4 46 5 7, 8, 7, 7, 6

5 49 5 10, 11, 13, 12, 12

6 51 5 14, 15, 13, 12, 13

7 50 5 6, 7, 6, 8, 7

8 61 6 9, 10, 8, 9, 9, 10

9 60 6 7, 10, 8, 9, 9, 10

10 45 6 12, 11, 12, 13, 12, 12

Estimate the average height of seedlings in the field, and place a bound on the error of

estimation.

2. In Exercise 1, assume that the nurseryman knows there are approximately 2600 seedlings in the

field. Use this additional information to estimate the average height, and place a bound on the

error of estimation.

3. A supermarket chain has stores in 32 cities. A company official wants to estimate the

proportion of stores in the chain which do not meet a specified cleanliness criterion. Stores

within each city appear to possess similar characteristics; therefore, it is decided to select a

two-stage cluster sample containing one-half of the stores within each of four cities. Cluster

sampling is desirable in this situation because of travel costs. The data collected are as follows:

City

stores in city stores sampled meeting criterion

1 25 13 3

2 10 5 1

3 18 9 4

4 16 8 2

170

Estimate the proportion of stores not meeting the cleanliness criterion, and place a bound on

the error of estimation.

5. To improve telephone service, an executive of a certain company wants to estimate the total

number of phone calls placed by secretaries in the company during one day. The company

contains 12 departments, each making approximately the same number of calls per day. Each

department employs approximately 20 secretaries, and the number of calls made varies

considerably from secretary to secretary. It is decided to employ two-stage cluster sampling

using a small number of departments (cluster) and selecting a fairly large number of secretaries

(elements) from each. Ten secretaries are sampled from each of four departments. The data are

summarized in the following table:

Secretaries sampled

1 21 10 15.5 2.8

2 23 10 15.8 3.1

3 20 10 17.0 3.5

4 20 10 14.9 3.4

Estimate the total number of calls placed by the secretaries in this company, and place a bound

on the error of estimation.

6. A city zoning commission wants to estimate the proportion of property owners in a certain

section of a city who favor a proposed zoning change. The section is divided into 7 distinct

residential areas, each containing similar residents. Because the results must be obtained in a

short period of time, two-stage cluster sampling is used. Three of the 7 areas are selected at

random and 20% of the property owners in each area selected are sampled. The figure of 20%

seems reasonable because the people living within each area seem to be in the same

socioeconomic class and, hence, they tend to hold similar opinions on the zoning question.

The results are as follows:

owners owners sampled zoning change

1 46 9 1

2 67 13 2

3 93 20 2

171

Estimate the proportion of property owners who favor the proposed zoning change, and place a

bound on the error of estimation.

7. A forester wants to estimate the total number of trees in a certain county which are infected

with a particular disease. There are ten well-defined forest areas in the country; these areas can

be subdivided into plots of approximately the same size. Four crews are available to conduct

the survey, which must be completed in one day. Hence, two-stage cluster sampling is used.

Four areas (clusters) are chosen with 6 plots (elements) randomly selected from each. (Each

crew can survey six plots in one day). The data are as follows:

sampled trees per plot

1 12 6 15, 14, 21, 13, 9, 10

2 15 6 4, 6, 10, 9, 8, 5

3 14 6 10, 11, 14, 10, 9, 15

4 21 6 8, 3, 4, 1, 2, 5

Estimate the total number of infected trees in the county, and place a bound on the error of

estimation.

8. A new bottling machine is being tested by a company. During a test run, the machine fills 24

cases, each containing a dozen bottles. It is desired to estimate the average number of ounces

of fill per bottle. A two -stage cluster sample is employed using 6 cases (clusters) with 4

bottles (elements) randomly selected from each. The results are as follows:

sample

( )

( )

1 7.9 0.15

2 8.0 0.12

3 7.8 0.09

4 7.9 0.11

5 8.1 0.10

6 7.9 0.12

Estimate the average number of ounces per bottle, and place a bound on the error of estimation.

172

9. A population consists of four clusters. The second-stage units, which are also the elementary

units in this case, are houses having rental values as follows:

$100 $100 $10 $50

100 100 20 90

200 40

400 ____ 50 ____

TOTALS 800 200 120 140

b. What is the value of the within-cluster variance for the first cluster?

c. A sample of two clusters is selected with equal probability; within each selected

cluster, half the elementary units are in the sample.

(iii) What is the probability that any elementary unit will be in the sample?

10. The following table shows areas of cacao holdings of 15 farmers in five clusters (PSUs) of

equal size. The five clusters were selected at random from a total of 40 clusters into which the

territory had been divided. Each PSU represents a geographic division containing 120 cacao

farmers.

Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5

96 110 102 140 132

134 121 113 142 162

152 146 157 161 184

TOTALS 382 377 372 443 478

173

c. Compute the standard errors for the estimates given in exercises (a) and (b).

d. Compute the coefficient of variation for the estimates given in exercises (a) and (b).

11. Assume a city with 12 blocks, as listed in the first column below. Measures of size

(approximate number of housing units in each block) are given in the second column. On the

basis of this information, we wish to select a sample of 4 blocks with probability proportionate

to size, and then to select housing units within the blocks in order to obtain a self-weighting

sample of an expected 10 housing units.

Number Number of Measure of Housing of Actual

(PSU) Housing Units Units* Housing Units

(measure of size)

1 10 10 9 1 to 9

2 5 15 6 10 to 15

3 2 17 2 16 to 17

4 5 22 6 18 to 23

5 5 27 6 24 to 29

6 10 37 8 30 to 37

7 10 47 8 38 to 45

8 2 49 2 46 to 47

9 2 51 4 48 to 51

10 5 56 6 52 to 57

11 5 61 6 58 to 63

12 10 71 9 64 to 72

TOTALS 71 72

* The number of housing units that would actually be found in the block in a field operation if the block were

selected in the sample.

a. Prepare a worksheet showing the selection of the sample of blocks. Assume 3.7 is the

random start number for designating the sample blocks.

b. Assume that you have visited the blocks selected in your sample and determine the actual

number of housing units as given in the fourth column above. The housing units that

actually exist in each block are designated by "Serial Numbers" as shown in the fifth

column. Perform necessary computations for selecting the sample of housing units and

list the Serial Numbers for the housing units selected in your sample.

c. Consider the list of 600 households of 30 villages located in 3 zones (See Appendix IV).

174

Using a two-stage cluster sample design, it is desired to estimate the total number of

persons in the population. A random sample of four clusters is chosen and five

households in each sampled cluster are randomly selected. Assume

households and consider the village as the cluster (PSU) for the survey.

175

CHAPTER 12

NONRESPONSE

The best way to deal with nonresponse is to prevent it. After nonresponse has occurred, it is

sometimes possible to model the missing data, but predicting the missing observations is never as

good as observing them in the first place. Nonrespondents often differ in critical ways from

respondents; if the nonresponse rate is not negligible, inference based upon only the respondents may

be seriously flawed.

We discuss two types of nonresponse in this chapter: unit nonresponse, in which the entire

observation unit is missing, and item nonresponse, in which some measurements are present for the

observation unit but at least one item is missing. In a survey of persons, unit nonresponse means that

the person provides no information for the survey; item nonresponse means that the person does not

respond to a particular item on the questionnaire. In the Current Population survey and the National

Crime Victimization survey (NCVS), unit nonresponse can arise for a variety of reasons: the

interviewer may not be able to contact the households; the person may be ill and cannot respond to

the survey; the person may refuse to participate in the survey. In these surveys, the interviewer tries

to get demographic information about the nonrespondent, such as age, sex, and race, as well as

characteristics of the dwelling unit, such as urban/rural status; this information can be used later to

adjust for the nonresponse. Item nonresponse occurs largely because of refusals: a household may

decline to give information about income, for example.

In agriculture or wildlife surveys, the term missing data is generally used instead of nonresponse, but

the concepts and remedies are similar. In a survey of breeding ducks, for example, some birds will

be not be found by the researchers; they are, in a sense, nonrespondents. The nest may be raided by

predators before the investigator can determine how many eggs were laid; this is comparable to item

nonresponse.

1. Prevent it. Design the survey so that nonresponse is low. This is by far the best method.

inferences about the other nonrespondents.

3. Use a model to predict values for the nonrespondents. Weights implicitly use a model to

adjust for unit nonresponse. Imputation often adjusts for item nonresponse, and parametric

models may be used for either type of nonresponse.

Example 12.1

Thomas and Siring (1983) report results from a 1969 survey on voting behavior carried out by the

Central Bureau of Statistics in Norway. In this survey, three calls were followed by a mail survey.

The final nonresponse rate was 9.9%, which is often considered to be a small nonresponse rate. Did

the nonrespondents differ from the respondents?

In the Norwegian voting register, it was possible to find out whether a person voted in the election.

The percentage of persons who voted could then be compared for respondents and nonrespondents;

Table 12.1 shows the results. The selected sample is all persons selected to be in the sample,

including data from the Norwegian voting register for both respondents and nonrespondents.

The difference in voting rate between the nonrespondents and the selected sample was largest in the

younger age groups. Among the nonrespondents, the voting rate varied with the type of nonresponse.

The overall voting rate for the persons who refused to participate in the survey was 81%, the voting

rate for the not-at-homes was 65%, and the voting rate for the mentally and physically ill was 55%,

implying that absence or illness were the primary causes of nonresponse bias.

Table 12.1

Age

All 20-24 25-29 30-49 50-69 70-79

Nonrespondents 71 59 56 72 78 74

Selected Sample 88 81 84 90 91 84

It has been demonstrated repeatedly that nonresponse can have large effects on the results of a

survey–in example 12.1, a nonresponse rate of less than 10% led to an overestimate of voting rate in

Norway. Holt and Elliot discuss the results of a series of studies done on nonresponse in the United

Kingdom, indicating that “lower response rates are associated with the following characteristics:

London residents; households with no car; single people; children couples; older people;

divorced/widowed people; new Commonwealth origin; lower educational attainment; self-

employed” (1991, 334.)

Moreover, increasing the sample size without targeting nonresponse does nothing to reduce

nonresponse bias; a larger sample size merely provides more observations from the class of persons

that would respond to the survey. Increasing the sample size may actually worsen the nonresponse

bias, as the larger sample size may divert resources that could have been used to reduce or remedy

the nonresponse or it may result in less care in the data collection. Recall that the infamous Literary

Digest Survey of 1936 (see Annex 1) had 2.4 million respondents but a response rate of less than

25%. The U. S. decennial census itself does not include the entire population, and the undercoverage

rate varies for different demographic groups. In the early 1990s, the nonresponse and undercoverage

in the U. S. Census prompted a lawsuit from certain cities to force the Census Bureau to adjust for

the nonresponse, and the debate about census adjustment continues.

177

Most small surveys ignore any nonresponse that remains after callbacks and follow-ups, and report

results based on complete records only. Hite (1987) did so in her survey and much of the criticism of

her results was based on her low response rate. Nonresponse is also ignored for many surveys

reported in newspapers, both local and national.

An analysis of complete records has the underlying assumption that the nonrespondents are similar to

the respondents and that units with missing items are similar to units that have responses for every

question. Much evidence indicates that this assumption does not hold true in practice. If

nonresponse is ignored in the NCVS, for example, victimization rates are underestimated. Biderman

and Cantor (1984) find lower victimization rates for persons who respond in three consecutive

interviews than for persons who are nonrespondents in at least one of the those interviews or who

move before the panel study is completed.

Results reported from an analysis of only complete records should be taken as representative of the

population of persons who would respond to the survey, which is rarely the same as the target

population. If you insist on estimating population means and totals using only the complete records

and making no adjustment for nonrespondents, at the very least you should report the rate of

nonresponse.

The main problem caused by nonresponse is potential bias of population estimates. Think of the

population as being divided into two somewhat artificial strata of respondents and nonrespondents.

The population respondents are the units that would respond if they were chosen to be in the sample;

the number of population respondents, NR, is unknown. Similarly, the NM (M for missing)

population nonrespondents are the units that would not respond. We then have the following

population quantities:

Respondents NR TR

Nonrespondents NM TM

Entire Population N T

with mean and total T. A probability sample from the population will likely contain some

respondents and some nonrespondents. But, of course, on the first call we do not observe yi for any

of the units in the nonrespondent stratum. If the population mean in the nonrespondent stratum

differs from that in the respondent stratum, estimating the population mean using only the

178

respondents will produce bias.1

Let be an approximately unbiased estimator of the mean in the respondent stratum, using only

the respondents. Because

The bias is small if either (1) the mean for the nonrespondents is close to the mean for the

respondents or (2) (NM/N) is small–there is little nonresponse. But we can never be assured of (1), as

we generally have no data for the nonrespondents. Minimizing the nonresponse rate is the only sure

way to control nonresponse bias.

A common feature of poor surveys is a lack of time spent on the design and nonresponse follow-up

in the survey. Many persons new to surveys (and some, unfortunately, not new) simply jump in and

start collecting data without considering potential problems in the data-collection process; they mail

questionnaires to everyone in the target population and analyze those that are returned. It is not

surprising that such surveys have poor response rates. Many surveys reported in academic journals

on purchasing, for example, have response rates between 10 and 15%. It is difficult to see how

anything can be concluded about the population in such a survey.

A researcher who knows the target population well will be able to anticipate some of the reasons for

nonresponse and prevent some of it. Most investigators, however, do not know as much about

reasons for nonresponse as they think they do. They need to discover why the nonresponse occurs

and resolve as many of the problems as possible before commencing the survey.

These reasons can be discovered through designed experiments and application of quality-

improvement methods to the data collection and processing. You do not know why previous surveys

related to yours have a low response rate? Design an experiment to find out. You think errors are

introduced in the data recording and processing? Use a nested design to find the sources of errors.

Any book on quality control or designed experiments will tell you how to collect your data.

And, of course, you can rely on previous researchers’ experiments to help you minimize

1

The variance is often too low as well. In income surveys, for example, the rich and the poor are more

likely to be nonrespondents on the income question. In that case, , for the respondent stratum, is smaller than

2

S . The point estimate of the mean may be biased, and the variance estimate may be biased, too.

179

nonsampling errors. The references on experiment design and quality control at the end of the book

are a good place to start; Hidoroglou et al. (1993) give a general framework for nonresponse.

Example 12.2

The 1990 U. S. decennial census attempted to survey each of the over 100 million households in the

United States. The response rate for the mail survey was 65%; households that did not mail in the

survey needed to be contacted in person, adding millions of dollars to the cost of the census.

Increasing the mail response rate for future censuses would result in tremendous savings.

Dillman et al. (1995a) report results of a factorial experiment employed in the 1992 Census

Implementation Test, designed to explore the individual effects and interactions of three

experimental factors on response rates. The three factors were:

(1) a prenotice letter alerting the household to the impending arrival of the census form,

(2) a stamped return envelope included with the census form, and

(3) a reminder postcard sent a few days after the census form.

The results were dramatic, as shown in Figure 12.1. The experiment established that, although all

three factors influenced the response rate, the letter and postcard let to greater gains in response rate

than the stamped return envelope.

Figure 12.1

Response rates achieved for each combination of the factors letter, envelope, and postcard. The observed

response rate was 64.3% when all three aids were used and only 50% when non were used.

Nonresponse can have many different causes; as a result, no single method can be recommended for

every survey. Platek (1977) classifies sources of nonresponse as related to (1) survey content, (2)

methods of data collection, and (3) respondent characteristics, and illustrates various sources using

the diagram in Figure 12.2. Groves (1989) and Dillman (1978) discuss additional sources of

nonresponse.

180

Figure 12.2

The following are some factors that may influence response rate and data accuracy.

# Survey content. A survey on drug use or financial matters may have a large number of

refusals. Sometimes the response rate can be increased for sensitive items by careful

ordering of the questions or by using a randomized response technique (see Section 12.5).

# Time of survey. Some calling periods or seasons of the year may yield higher response rates

than others. The vacation month of August, for example, would be a bad time to take a one-

time household survey in Germany.

# Interviewers. Grower (1979) found a large variability in response rates achieved by different

interviewers, with about 15% of interviewers reporting almost no nonresponse. Some field

investigators in a bird survey may be better at spotting and identifying birds than others.

Standard quality-improvement methods can be applied to increase the response rate and

accuracy for interviewers. The same methods can be applied to the data-coding process.

# Data-collection method. Generally, telephone and mail surveys have a lower response rate

and in-person surveys (they also have lower costs, however). Computer Assisted Telephone

Interviewing (CATI) has been demonstrated to improve accuracy of data collected in

telephone surveys; with CATI, all questions are displayed on a computer, and the interviewer

181

codes the responses in the computer as questions are asked. CATI is specially helpful in

surveys in which a respondent’s answer to one question determines which question is asked

next (Catlin and Ingram 1988).

Mail, fax, and Internet surveys often have low response rates. Possible reasons for

nonresponse in a mail survey should be explored before the questionnaire is mailed: Is the

survey sent to the wrong address? Do recipients discard the envelope as junk mail even

before opening it? Will the survey reach the intended recipient? Will the recipient believe

that filling out the survey is worth the time?

# Questionnaire design. We have already seen that question wording has a large effect on the

responses received; it can also affect whether a person responds to an item on the

questionnaire. The volume edited by Tamur (1993) explores some recent research on

application of cognitive research on question design. In a mail survey, a well-designed form

for the respondent may increase data accuracy.

# Respondent burden. Persons who respond to a survey are doing you an immense favor, and

the survey should be as nonintrusive as possible. A shorter questionnaire, requiring less

detail, may reduce the burden to the respondent. Respondent burden is a special concern in

panel surveys such as the NCVS, in which sampled households are interviewed every six

months for 3 ½ years. DeVries et al. (1996) discuss methods used in reducing respondent

burden because a smaller sample suffices to give the required precision.

# Survey introduction. The survey introduction provides the first contact between the

interviewer and potential respondent; a good introduction, giving the recipient motivation to

respond, can increase response rates dramatically. Nielsen Media Research emphasizes to

households in its selected sample that their participation in the Nielsen ratings affects which

television shows are aired. The respondent should be told for what purpose the data will be

used (unscrupulous persons often pretend to be taking a survey when they are really trying to

attract customers or converts) and assured confidentiality.

# Incentives and disincentives. Incentives, financial or otherwise, may increase the response

rate. Disincentives may work as well: Physicians who refused to be assessed by peers after

selection in a stratified sample from the College of Physicians and Surgeons of Ontario

registry had their medical licenses suspended. Not surprisingly, nonresponse was low

(McAuley et al. 1990).

# Follow-up. The initial contact of the sample is usually less costly per unit than follow-ups of

the initial nonrespondents. If the initial survey is by mail, a reminder may increase the

response rate. Not everyone responds to follow-up calls, though; some persons will refuse to

respond to the survey no matter how often they are contacted. You need to decide how many

follow-up calls to make before the marginal returns do not justify the money spent.

You should try to obtain at least some information about nonrespondents that can be used later to

adjust for the nonresponse, and include surrogate items that can b e used for item nonresponse. True,

there is no complete compensation for not having the data, but partial information may be better than

none. Information about the race, sex, or age of a nonrespondent may be used later to adjust for

182

nonresponse. Questions about income may well lead to refusals, but questions about cars,

employment, or education may be answered and can be used to predict income. If the pretests of the

survey indicate a nonresponse problem that you do not know how to prevent, try to design the survey

so that at least some information is collected for each observation unit.

The quality of survey data is largely determined at the design stage. Fisher’s (1938) words about

experiments apply equally well to the design of sample surveys: “To call in the statistician after the

experiment is done may be no more than asking him to perform a postmortem examination: he may

be able to say what the experiment died of.” Any survey budget needs to allocate sufficient

resources for survey design and for nonresponse follow-up. Do not scrimp on the survey design;

every hour spent on design may save weeks of remorse later.

Virtually all good surveys rely on callbacks to obtain responses from persons not at home for the first

try. Analysis of callback data can provide some information about the biases that can be expected

from the remaining nonrespondents.

Example 12.3

Traugott (1987) analyzed callback data from two 1984 Michigan polls on preference for presidential

candidates. The overall response rates for the surveys were about 65%, typical for large political

polls. About 21% of the interviewed sample responded on the first call; up to 30 attempts were

made to reach persons who did not respond on the first call. Traugott found that later respondents

were more likely to be male, older, and Republican than early respondents; while 48% of the

respondents who answered the first call supported Reagan and 45% supported Mondale, 59% of the

entire sample supported Reagan as opposed to 39% for Mondale. Differing procedures for

nonresponse follow-up and persistence in callback may explain some of the inconsistencies among

political polls.

If nonrespondents resemble late respondents, one might speculate that nonrespondents were more

likely to favor Reagan. But nonrespondents do not necessarily resemble the hard-to-reach; persons

who absolutely refuse to participate may differ greatly from persons who could not be contacted

immediately, and nonrespondents may be more likely to have illnesses or other circumstances

preventing participation. We also do not know how likely it is that nonrespondents to the surveys

will vote in the election; even if we speculate that they were more likely to favor Reagan, they are

not necessarily more likely to vote for Reagan.

Often, when the survey is designed so that callbacks will be used, the initial contact is by mail

survey; the follow-up calls use a more expensive method such as a personal interview.

Hansen and Hurwitz (1946) proposed subsampling the nonrespondents and using two-phase

sampling (also called double sampling) for stratification to estimate the population mean or total.

The population is divided into two strata, as described in Section 12.1; the two strata are respondents

and initial nonrespondents, persons who do not respond in the first call. WE will develop the theory

of two-phase sampling for general survey designs in Section 12.1; here, we illustrate how it can be

used for nonresponse.

183

In the simplest form of two-phase sampling, randomly select n units in the population. Of these, nR

respond and nM do not respond. The values nR and nM, though, are random variables; they will

change if a different simple random sample (SRS) is selected. Then, make a second call on a

random subsample of 100v% of the nM nonrespondents in the sample, where the subsampling

fraction v does not depend on the data collected.

Suppose that through some superhuman effort all the targeted nonrespondents are reached. Let

be the sample average of the original respondents and (M stands for missing) be the average

of the subsampled nonrespondents. The two-phase sampling estimates of the population mean and

total are:

(12.1)

and

(12.2)

where SR represents the sampled units in the respondent stratum and SM represents the sampled units

in the nonrespondent stratum. Note that is a weighted sum of the observed units; the weights are

N/n for the respondents and N/(nv) for the subsampled nonrespondents. Because only a subsample

was taken in the nonrespondent stratum, each subsampled unit in that stratum represents more units

in the population than does a unit in the respondent stratum.

The expected value and variance of these estimators are found in Section 12.1. Because is an

appropriately weighted unequal-probability estimator, Theorem 6.2 implies that From

(12.5), if the finite population corrections can be ignored, we can estimate the variance by

If everyone responds in the subsample, two-phase sampling not only removes the nonresponse bias

but also accounts for the original nonresponse in the estimated variance.

Most surveys have some residual nonresponse even after careful design and follow-up of

nonresponse. All methods for fixing up nonresponse are necessarily model-based . If we are to

make any inferences about the nonrespondents, we must assume that they are related to respondents

in some way. A good nontechnical reference for methods of dealing with nonresponse is Groves

184

(1989); the three-volume set edited by Madow et al. (1983) contains much information on the

statistical research on nonresponse up to that date.

Dividing population members into two fixed strata of would-be respondents and would-be

nonrespondents is fine for thinking about potential nonresponse bias and for two-phase methods. To

adjust for nonresponse that remains after all other measures have been taken, we need a more

elaborate setup, letting the response or nonresponse of unit I be a random variable. Define the

random variable

After sampling, the realizations of the response indicator variable are known for the units selected in

the sample. A value for yi is recorded if ri, the realization of Ri, is 1. The probability that a unit

selected for the sample will respond,

is of course unknown but assumed positive. Rosembaum and Rubin (1983) call Mi the propensity

score for the ith unit.

Suppose that yi is a response of interest and that xi is a vector of information known about unit i in

the sample. Information used in the survey design is included in xi. We consider three types of

missing data, using the Little and Rubin (1987) terminology of nonresponse classification.

Missing Completely at Random if Mi does not depend on xi, yi, or the survey design, the missing

data are missing completely at random (MCAR). Such a situation occurs if, for example, someone

at the laboratory drops a test tube containing the blood sample of one of the survey participants–there

is no reason to think that the dropping of the test tube had anything to do with the white blood cell

count.2 If data are MCAR, the respondents are representative of the selected sample.

Missing data in the NCVS would be MCAR if the probability of nonresponse is completely unrelated

to region of the United States, race, sex, age, or any other variable measured for the sample and if the

probability of nonresponse is unrelated to any variables about victimization status. Nonrespondents

would be essentially selected at random from the sample.

If the response probabilities Mi are all equal and the events {Ri = 1} are conditionally independent of

each other and of the sample-selection process given nR, then the data are MCAR. If an SRS of size

n is taken, then under this mechanism the respondents will be a simple random subsample of variable

size nR. The sample mean of the respondents, is approximately unbiased for the population

mean. The MCAR mechanism is implicitly adopted when nonresponse is ignored.

2

Even here, though, the suspicious mind can create a scenario in which the nonresponse might be related to

quantities of interest: perhaps workers are less likely to drop test tubes that they believe contain HIV.

185

Missing at Random Covariates, or Ignorable Nonresponse If Mi depends on xi but not on yi, the

data are missing at random (MAR); the nonresponse depends only on observed variables. We can

successfully model the nonresponse, since we know the values of xi for all sample units. Persons in

the NCVS would be missing at random if the probability of responding to the survey depends on

race, sex, and age–all known quantities–but does not vary with victimization experience within each

age/race/sex class. This is sometimes termed ignorable nonresponse: ignorable means that a model

can explain the nonresponse mechanism and that the nonresponse can be ignored after the model

accounts for it, not that the nonresponse can be completely ignored and complete-data methods used.

variable and cannot be completely explained by values of the x’s, then the nonresponse is

nonignorable. This is likely the situation for the NCVS: it is suspected that a person who has been

victimized by crime is less likely to respond to the survey than a nonvictim, even if they share the

values of all known variables such as race, age, and sex. Crime victims may be more likely to move

after a victimization and thus not be included in subsequent NCVS interviews. Models can help in

this situation, because the nonresponse probability may also depend on known variables, but cannot

completely adjust for the nonresponse.

The probabilities of responding, Mi, are useful for thinking about the type of nonresponse.

Unfortunately, they are unknown, so we do not know for sure which type of nonresponse is present.

We can sometimes distinguish between MCAR and MAR by fitting a model attempting to predict

the observed probabilities of response for subgroups from known covariates; if the coefficients in a

logistic regression model are significantly different from zero, the missing data are likely not MCAR.

Distinguishing between MAR and nonignorable nonresponse is more difficult. In the next section,

we discuss a method for estimating the Mi’s.

In previous chapters we have seen how weights can be used in calculating estimates for various

sampling schemes (see Sections 4.3, 5.4, and 7.2). The sampling weights are the reciprocals of the

probabilities of selection, so an estimate of the population total is

For stratification, the weights are wi = (Nh / nh) if unit i is in stratum h; for sampling elements with

unequal probabilities, wi = 1 / Bi.

Weights can also be used to adjust for nonresponse. Let Zi be the indicator variable for presence in

the selected sample, with P(Zi = 1) = Bi. If Ri is independent of Zi, then the probability that unit i will

be measured is

The probability of responding, Mi, is estimated for each unit in the sample, using auxiliary

information that is known for all units in the selected sample. The final weight for a respondent is

186

then Weighting methods assume that the response probabilities can be estimated from

variables known for all units; they assume MAR data. References for more information on

weighting are Oh and Scheuren (1983) and Holt and Elliot(1991).

Sampling weights wi have been interpreted as the number of units in the population represented by

unit I of the sample. Weighting-class methods extend this approach to compensate for nonsampling

errors: variables known for all units in the selected sample are used to form weighting-adjustment

classes, and it is hoped that respondents and nonrespondents in the same weighting-adjustment class

are similar. Weights of respondents in the weighting-adjustment class are increased so that the

respondents represent the nonrespondents’ share of the population as well as their own.

Example 12.4

Suppose the age is known for every member of the selected sample and that person i in the selected

sample has sampling weight wi = (1 / Bi). Then weighting classes can be formed by dividing the

selected sample among different age classes, as Table 12.2 shows.

Then the sampling weight for each respondent in class c is multiplied by the weight factor in

Table 12.2. The weight of each respondent with age between15 and 24, for example, is multiplied

by 1.622. Since there was no nonresponse in the over-65 group, their weights are unchanged.

Table 12.2

Sum of weights for sample 30322 33013 27046 29272 30451 150104

187

The probability of response is assumed to be the same within each weighting class, with the

implication that within a weighting class, the probability of response does not depend on y. As

mentioned earlier, weighting-class methods assume MAR data. The weight for a respondent in

weighting class c is

To estimate the population total using weighting-class adjustments, let xci = 1 if unit i is in class c,

and 0 otherwise. Then let the new weight for respondent i be

unit i is a nonrespondent. Then,

and

In an SRS, for example, if nc is the number of sample units in class c, ncR is the number of

respondents in class c, and is the average for the respondents in class c, then and

To adjust for individual nonresponse in the NCVS, the within-household noninterview adjustment

factor (WHHNAF) of Chapter 7 is used. NCVS interviewers gather demographic information on the

nonrespondents, and this information is used to classify all persons into 24 weighting-adjustment

cells. The cells depend on the age of the person, the relation of the person to the reference person

(head of household), and the race of the reference person.

For any cell, let WR be the sum of the weights for the respondents and WM be the sum of the weights

for the nonrespondents. Then the new weight for a respondent in a cell will be the previous weight

multiplied by the weighting-adjustment factor

188

Thus, the weights that would be assigned to nonrespondents are reallocated among respondents with

similar (we hope) characteristics.

more nonrespondents than respondents. In this case, the variance of the estimate increases; if the

number of respondents in the cell is small, the weight may not be stable. The U. S. Census Bureau

collapses cells to obtain weighting-adjustment factor of 2 or less. If there are fewer than 30

interviewed persons in a cell or if the weighting-adjustment factor is greater than 2, the cell is

combined (collapsed) with neighboring cells until the collapsed cell has more than 30 observations

and a weight-adjustment factor of 2 or less.

though they were strata; as shown in the next section, weighting adjustment is similar to post-

stratification. The classes should be formed so that units within each class are as similar as possible

with respect to the major variables of interest and so that the response rates vary from class to class.

Little (1986) suggests estimating the response probabilities as a function of the known variables

(perhaps using logistic regression) and grouping observations into classes based on This

approach is preferable to simply using the estimated values of Ni in individual case weights, as the

estimated response probabilities may be extremely variable and might cause the final estimates to be

unstable.

12.5.2 Post-stratification

Post-stratification is similar to weighting-class adjustment, except that population counts are used to

adjust the weights. Suppose an SRS is taken. After the sample is collected, units are grouped into H

different post-strata, usually based on demographic variables such as race or sex. The population has

Nh units in post-stratum h; of these, nh were selected for the sample and nhR responded. The post-

stratified estimator for is

the weighting-class estimator for if the weighting classes are the post-strata, is

189

The two estimators are similar in form; the only difference is that in post-stratification the Nh are

known, whereas in weighting-class adjustments the Nh are unknown and estimated by (Nnh / n).

For the post-stratified estimator, often the conditional variance given the nhR is used. For an SRS,

(12.3)

given in Oh and Scheuren (1983). A variance estimator for post-stratification will be given in

Exercise 5 of Chapter 9.

In a general survey design, the sum of the weights in subgroup h is supposed to estimate the

population count Nh for that subgroup. Post-stratification uses the ratio estimator within each

subgroup to adjust by the true population count.

Let

Then, let

Post-stratification can adjust for undercoverage as well as nonresponse if the population count Nh

includes individuals not in the sampling frame for the survey.

190

Example 12.6

The second stage factor in the NCVS (see Section 7.6) uses post-stratification to adjust the weights.

After all other weighting adjustments have been done, including the weighting-class adjustments for

nonresponse, post-stratification is used to make the sample counts agree with estimates of the

population counts from the U. S. Census Bureau. Each person is assigned to one of 72 post-strata

based on the person’s age, race, and sex. The number of persons in the population falling in that

post-stratum, Nh, is known from other sources. Then, the weight for a person in post-stratum h is

multiplied by

With weighting classes, the weighting factor to adjust for unit nonresponse is always at least 1. With

post-stratification, because weights are adjusted so that they sum to a known population total, the

weighting factor can be any positive number, although weighting factors of 2 or less are desirable.

(1) withing each post-stratum each unit is selected to be in the sample has the same probability of

being a respondent,

(2) the response or nonresponse of a unit is independent of the behavior of all other units, and

The data are MCAR within each post-stratum. These are big assumptions; to make them seem a

little more plausible, survey researchers often use many post-strata. But a large number of post-strata

may create additional problems, in that few respondents in some post-strata may result in unstable

estimates, and may preclude the application of the central limit theorem. If faced with post-strata

with few observations, most practitioners collapse the post-strata with others that have similar means

in key variables until they have a reasonable number of observations in each post-stratum. For the

Current Population Survey, a “reasonable” number means that each group has at least 20

observations and that the response rate for each group is at least 50%.

Raking is a post-stratification method that can be used when post-strata are formed using more than

one variable, but only the marginal population totals are known.

Raking was first used in the 1940 census to ensure that the complete census data and samples taken

from it gave consistent results and was introduced in Deming and Stephan (1940); Brackstone and

Rao (1976) further developed the theory. Oh and Schuren (1983) describe raking ratio estimates for

nonresponse.

191

Consider the following table of sums of weights from a sample; each entry in the table is the sum of

the sampling weights for persons in the sample falling in that classification (for example, the sum of

the sampling weights for black females is 300).

Male

Now suppose we know the true population counts for the marginal totals: we know that the

population has 1510 women and 1490 men, 600 blacks, 2120 whites, 150 Asians, 100 Native

Americans, and 30 persons in the “Other” category. The population counts for each cell in the table,

however, are unknown; we do not know the number of black females in this population and cannot

assume independence. Raking allows us to adjust the weights so that the sums of weights in the

margins equal the population counts.

First, adjust the rows. Multiply each entry by (true row population) / (estimated row population).

Multiplying the cells in the female row by 1510/1620 and the cells in the male row by 1490/1380

results in the following table:

Male 161.96 1166.09 97.17 32.39 32.39

The row totals are fine now, but the column totals do not yet equal the population totals. Repeat the

same procedure with the columns in the new table. The entries in the first column are each

multiplied by 600/441.59. The following table results:

Male 220.06 1082.07 95.21 53.67 16.10 1467.10

But this has thrown the row totals off again. Repeat the procedure until both row and column totals

equal the population counts. The procedure converges as long as all cell counts are positive. In this

example, the final table of adjusted counts is

192

Black W hite Asian Native American Other Sum of Weights

Male 224.41 1098.53 96.28 54.44 16.33

The entries in the last table may be better estimates of the cell populations (that is, with smaller

variance) than the original weighted estimates, simply because they use more information about the

population. The weighting-adjustment factor for each white male in the sample is 1098.53/1080; the

weight of each white male is increased a little to adjust for nonresponse and undercoverage.

Likewise, the weights of white females are decreased because they are overrepresented in the sample.

The assumptions for raking are the same as for post-stratification, with the additional assumption

that the response probabilities depend only on the row and column and not on the particular cell. If

the sample sizes in each cell are large enough, the raking estimator is approximately unbiased.

Raking has some difficulties–the algorithm may not converge if some of the cell estimates are zero.

There is also a danger of “overadjustment”–if there is little relation between the extra dimension in

raking and the cell means, raking can increase the variance rather than decrease it.

Some weighting-class methods use weights that are the reciprocal of the estimated probability of

response. A famous example is the Politz-Simmons method for adjusting for nonavailability of

sample members.

Suppose all calls are made during Monday through Friday evenings. Each nonrespondent is asked

whether he or she was at home, at the time of the interview, on each of the four preceding

weeknights. The respondent replies that she was home k of the four nights. It is then assumed that

the probability of response is proportional to the number of nights at home during interviewing

hours, so the probability of response is estimated by The sampling weight wi for

each respondent is then multiplied by 5/(ki + 1). The respondents with k = 0 were home on only one

of the five nights and are assigned to represent their share of the population plus the share of four

persons in the sample who were called on one of their “unavailable” nights. The respondents most

likely to be home have k = 4; it is presumed that all persons in the sample who were home every

night were reached, so their weights are unchanged. The estimate of the population mean is

This method of weighting–described by Hartley (1946) and Politz and Simmons (1949)–is based on

193

the premise that the most accessible persons will tend to be overrepresented in the survey data. The

method is easy to use, theoretically appealing, and can be used in conjunction with callbacks. But it

still misses people who were not at home on any of the five nights or who refused to participate in

the survey. Because nonresponse is due largely to refusals in some telephone surveys, the Politz-

Simmons method may not be helpful in dealing with all nonresponse. Values of k may also be in

error, because people may err when recalling how many evenings they were home.

Potthoff et al. (1993) modified and extended the Politz-Simmons method to determine weights based

on the number of callbacks needed, assuming that the Ni’s follow a beta distribution.

The models for weighting adjustments for nonresponse are strong: in each weighting cell, the

respondents and nonrespondents are assumed to be similar. Each individual in a weighting class is

assumed equally likely to respond to the survey, regardless of the value of the response. These

models never exactly describe the true state of affairs, and you should always consider their

plausibility and implications. It is an unfortunate tendency of many survey practitioners to treat the

weighting adjustment as a complete remedy and to then act as though there was no nonresponse.

Weights may improve many of the estimates, but they rarely eliminate all nonresponse bias. If

weighting adjustments are made (and remember, making no adjustments is itself a model about the

nature of the nonresponse), practitioners should always state the assumed response model and give

evidence to justify it. Weighting adjustments are usually used for unit nonresponse, not for item

nonresponse (which would require a different weight for each item).

12.6 Imputation

Missing items may occur in surveys for several reasons: an interviewer may fail to ask a question; a

respondent may refuse to answer the question or cannot provide the information; a clerk entering the

data may skip the value. Sometimes, items with responses are changed to missing when the data set

is edited or cleaned–a data editor may not be able to resolve the discrepancies for an individual 3-

year old who voted in the last election and may set both values to missing.

Imputation is commonly used to assign values to the missing items. A replacement value, often

from another person in the survey who is similar to the item nonrespondent on other variables, is

imputed for the missing value. When imputation is used, an additional variable that indicates

whether the response was measured or imputed should be created for the data set.

Imputation procedures are used not only to reduce the nonresponse bias but to produce a “clean,”

rectangular data set–one without holes for the missing values. We may want to look at tables for

subgroups of the population, and imputation allows us to do that without considering the item

nonresponse separately each time we construct a table. Some references for imputation include

Sande (1983) and Kalton and Kasprzyk (1982; 1986).

Example 12.7

The Current Population Survey (CPS) has an overall high household response rate (typically well

194

above 90%), but some households refuse to answer certain questions. The nonresponse rate is about

20% on many income questions. This nonresponse would create a substantial bias in any analysis

unless some corrective action were taken: various studies suggest that the item nonresponse for the

income items is highest for low-income and high-income households. Imputation for the missing

data makes it possible to use standard statistical techniques such as regression without the analyst

having to treat the nonresponse by using specially developed methods. For surveys such as the CPS,

if imputation is to be done, the agency collecting the data has more information to guide it in filling

the missing values than does an independent analyst, because identifying information is not released

on the public-use tapes.

The CPS uses weighting for noninterview adjustment and hot-deck imputation for item nonresponse.

The sample is divided into classes using variables sex, age, race, and other demographic

characteristics. If an item is missing, a corresponding item from another unit in that class is

substituted. Usually, hot-deck imputation is done by taking the value of the missing item from a

household that is similar to the household with the missing item in some other explanatory variable

such as family size.

We use the small data set in Table 12.3 to illustrate some of the different methods for imputation.

This artificial data set is only used for illustration; in practice, a much larger data set is needed for

imputation. A “1" means the respondent answered yes to the question.

Table 12.3

? 1 1

11 0 0

? 1 1

12 1 1

? 0 0

20 1 ?

12 0 0

13 0 ?

10 ? ?

12 0 0

12 0 0

11 1 ?

16 1 0

14 0 0

11 0 0

14 0 0

10 0 0

12 ? 0

10 0 0

195

12.6.1. Deductive Imputation

Some values may be imputed in the data editing, using logical relations among the variables. In

Table 12.3, person 9 is missing the response for whether she was a victim of violent crime. But she

had responded that she was not a victim of any crime, so the violent-crime response should be

changed to 0.

Deductive Imputation may sometimes be used in longitudinal surveys. If a woman has two children

in year 1 and two children in year 3, but is missing the value for year 2, the logical value to impute

would be 2.

Respondents are divided into classes (cells) based on known variables, as in weighting-class

adjustments. Then, the average of the values for the responding units in cell c, is substituted

for each missing value. Cell mean imputation assumes that missing items are missing completely at

random within the cells.

Example 12.8

The four cells for our example are constructed using the variables age and sex. (In practice, of

course, you would want to have many more individuals in each cell.)

Age

M

3, 5, 10, 14 1, 7, 8, 15, 16

Persons Persons

F

4, 12, 13, 19, 20 2, 6, 9, 11, 17, 18

Persons 2 and 6, missing the value for years of education, would be assigned the mean value for the

four women aged 35 or older who responded to the question: 12.25. The mean for each cell after

imputation is the same as the mean of the respondents. The imputed value, however, is not one of

the possible responses to the question about education.

Mean imputation gives the same point estimates for means, totals, and proportions as the weighting-

class adjustments. Mean imputation methods fail to reflect the variability of the nonrespondents,

however–all missing observations in a class are given the same imputed value. The distribution of y

will be distorted because of a “spike” at the value of the sample mean of the respondents. As a

consequence, the estimated variance in the subclass will be too small.

To avoid the spike, a stochastic cell mean imputation could be used. If the response variable were

approximately normally distributed, the missing values could be imputed with a randomly generated

196

value from a normal distribution with mean and standard deviation

Mean imputation, stochastic or otherwise, distorts relationships among different variables because

imputation is done separately for each missing item. Sample correlations and other statistics are

changed. Jinn and Sedransk (1989a; 1989b) discuss the effect of different imputation methods on

secondary data analysis–for instance, for estimating a regression slope.

In hot-deck imputation, as in cell mean imputation and weighting-adjustment methods, the sample

units are divided into classes. The value of one of the responding units in the class is substituted for

each missing response. Often, the values for a set of related missing items are taken from the same

donor, to preserve some of the multivariate relationships. The name hot deck is from the days when

computer programs and data sets were punched on cards–the deck of cards containing the data set

being analyzed was warmed by the card reader, so the term hot deck was used to refer to imputations

made using the same data set. Fellegi and Holt (976) discuss methods for data editing and hot-deck

imputation with large surveys.

Sequential Hot-Deck Imputation Some hot-deck imputation procedures impute the value in the

same subgroup that was last read by the computer. This is partly a carryover from the card days of

computers (imputation could be done in one pass) and partly a belief that, if the data are arranged in

some geographic order, adjacent units in the same subgroup will tend to be more similar than

randomly chosen units in the subgroup. One problem with using the value on the previous “card” is

that often nonrespondents also tend to occur in clusters, so one person may be a donor multiple

times, in a way that the sampler cannot control. One of the other hot-deck imputation methods is

usually used today for most surveys.

In our example, person 19 is missing the response for crime victimization. Person 13 had the last

response recorded in her subclass, so the value 1 is imputed.

Random Hot-Deck Imputation A donor is randomly chosen from the persons in the cell with

information on all missing items. To preserve multivariate relationships, usually values from the

same donor are used for all missing items of a person.

In our small data set, person 10 is missing both variables for victimization. Persons 3, 5, and 14 in

his cell have responses for both crime questions, so one of the three is chosen randomly as the donor.

In this case, person 14 is chosen, and his values are imputed for both missing variables.

impute the value of a respondent who is “closest” to the person with the missing item, where

closeness is defined using the distance function.

If age and sex are used for the distance function, so that the person of closest age with the same sex

197

is selected to be the donor, the victimization responses of person 3 will be imputed for person 10.

Regression imputation predicts the missing value by using a regression of the item of interest on

variables observed for all cases. A variation is stochastic regression imputation, in which the

missing value is replaced by the predicted value from the regression model, plus a randomly

generated error term.

We only have 18 complete observations for the response crime victimization (not really enough for

fitting a model to our data set), but a logistic regression of the response with explanatory variable age

gives the following model for predicted probability of victimization,

The predicted probability of being a crime victim for a 17-year old is 0.74; because that is greater

than a predetermined cutoff of 0.5, the value 1 is imputed for person 10.

Example 12.9

Paulin and Ferraro (1994) discuss regression models for imputing income in the U. S. Consumer

Expenditure Survey. Households selected for the interview component of the survey are interviewed

each quarter for five consecutive quarters; in each interview, they are asked to recall expenditures for

the previous 3 months. The data are used to relate consumer expenditures to characteristics such as

family size and income; they are the source of reports that expenditures exceed income in certain

income classes.

The Consumer Expenditure Survey conducts about 5000 interviews each year, as opposed to about

60,000 for the NCVS. This sample size is too small for hot-deck imputation methods, as it is less

likely that suitable donors will be found for nonrespondents in a smaller sample. If imputation is to

be done at all, a parametric model needs to be adopted. Paulin and Ferraro used multiple regression

models to predict the log of family income (logarithms are used because the distribution of income is

skewed) from explanatory variables including total expenditures and demographic variables. These

models assume that income items are MAR, given the covariates.

In cold-deck imputation, the imputed values are from a previous survey or other information, such as

from historical data. (Since the data set serving as the source for the imputation is not the one

currently running through the computer, the deck is “cold.”) Little theory exists for the method. As

with hot-deck imputation, cold-deck imputation is not guaranteed to eliminate selection bias.

198

12.6.6. Substitution

Substitution methods are similar to cold-deck imputation. Sometimes interviewers are allowed to

choose a substitute while in the field; if the household selected for the sample is not at home, they try

next door. Substitution may help reduce some nonresponse bias, as the household next door may be

more similar to the nonresponding household than would be a household selected at random from the

population. But the household next door is still a respondent; if the nonresponse is related to the

characteristics of interest, there will still be nonresponse bias. An additional problem is that, since

the interviewer is given discretion about which household to choose, the sample no longer has

known probabilities of selection.

The 1975 Michigan Survey of Substance Abuse was taken to estimate the number of persons that

used 16 types of substances in the previous year. The sample design was a stratified multistage

sample with 2100 households. Three calls were made at a dwelling; then the house to the right was

tried, then the house to the left. From the data, evidence shows that the substance-use rate increases

as the required number of calls increases.

Some surveys select designated substitutes at the same time the sample units are selected. If a unit

does not respond, then one of the designated substitutes is randomly selected. The National

Longitudinal Study (see National Center of Educational Statistics 1977) used this method. This

stratified, multistage sample of the high school graduating class of 1972 was intended to provide data

on the educational experiences, plans, and attitudes of high school seniors. Four high schools were

randomly selected from each of 600 strata. Two were designated for the sample, and the other two

were saved as backups in case of nonresponse. Of the 1200 schools designated for the sample, 948

participated, 21 had no graduating seniors, and 231 either refused or were unable to participate.

Investigators chose 122 schools from the backup group to substitute for the nonresponding schools.

Follow-up studies showed a consistent 5% bias in a number of estimated totals, which was attributed

to the use of substitute schools and to nonresponse.

Substitution has the added danger that efforts to contact the designated units may not be as great as if

no “easy way out” was provided. If substitution is used, it should be reported in the results.

In multiple imputation, each missing value is imputed m ($2) different times. Typically, the same

stochastic model is used for each imputation. These create m different “data” sets with no missing

values. Each of the m data sets is analyzed as if no imputation had been done; the different results

give the analyst a measure of the additional variance due to the imputation. Multiple imputation

with different models for nonresponse can give an idea of the sensitivity of the results to particular

nonresponse models. See Rubin (1987; 1996) for details on implementing multiple imputation.

Imputation creates a “clean,” rectangular data set that can be analyzed by standard software.

Analyses of different subsets of the data will produce consistent results. If the nonresponse is

missing at random given the covariates used in the imputation procedure, imputation substantially

199

reduces the bias due to item nonresponse. If parts of the data are confidential, the data collector can

perform the imputation. The data collector has more information about the sample and population

than is released to the public (for example, the collector may know the exact address for each sample

member) and can often perform a better imputation using that information.

The foremost danger of using imputation is that future data analysis will not distinguish between the

original and the imputed values. Ideally, the imputer should record which observations are imputed,

how many times each nonimputed record is used as a donor, and which donor was used for a specific

response imputed to a recipient. The imputed values may be good guesses, but they are not real data.

Variances computed using the data together with the imputed values are always too small, partly

because of the artificial increase in the sample size and partly because the imputed values are treated

as though they were really obtained in the data collection. The true variance will be larger than that

estimated from a standard software package. Rao (1996) and Fay (1996) discuss methods for

estimating the variances after imputation.

Most of the methods for dealing with nonresponse assume that the nonresponse is ignorable–that is,

conditionally on measured covariates, nonresponse is independent of the variables of interest. In this

situation, rather than simply dividing units among different subclasses and adjusting weights, one

can fit a superpopulation model. From the model, then, one predicts the values of the y’s not in the

sample. The model fitting is often iterative.

In a completely model-based approach, we develop a model for the complete data and add

components to the model to account for the proposed nonresponse mechanism. Such an approach

has many advantages over other methods: the modeling approach is flexible and can be used to

include any knowledge about the nonresponse mechanism, the modeler is forced to state the

assumptions about nonresponse explicitly in the model, and some of these assumptions can be

evaluated. In addition, variance estimates that result from fitting the model account for the

nonresponse, if the model is a good one.

Example 12.10

Many people believe that spotted owls in Washington, Oregon, and California are threatened with

extinction because timber harvesting in mature coniferous forests reduces their available habitat.

Good estimates of the size of the spotted owl population are needed for reasoned debate on the issue.

In the sampling plan described by Azuma et al. (1990), a region of interest is divided into N

sampling regions (PSU’s), and an SRS of n PSU’s is selected. Let

Assume that the Yi’s are independent and that P(Yi = 1) = p, the true proportion of occupied PSU’s.

200

If occupancy could be definitively determined for each PSU, the proportion of PSU’s occupied could

be estimated by the sample proportion While a fix number of visits can establish that a PSU is

occupied, however, a determination that a PSU is unoccupied may be wrong–some owl pairs are

“nonrespondents,” and ignoring the nonresponse will likely result in a too-low estimate of percentage

occupancy.

Azuma et al. (1990) propose using a geometric distribution for the number of visits required to

discover the owls in an occupied unit, thus modeling the nonresponse. The assumptions for the

model are:

(1) the probability of determining occupancy on the first visit, 0, is the same for all PSU’s,

A geometric distribution is commonly used for number of callbacks needed in surveys of people (see

Potthoff et al. 1993).

Let Xi be the number of visits required to determine whether PSU I is occupied or not. Under the

geometric model,

The budget of the U. S. Forest Service, however, does not allow for an infinite number of visits.

Suppose a maximum of s visits are to be made to each PSU. The random variable Yi cannot be

observed; the observable random variables are

Here, counts the number of PSU’s observed to be occupied, and counts the total

number of visits made to occupied units. Using the geometric model, the probability that an owl is

first observed in PSU I on visit k (#s) is

and the probability that an owl is observed on one of the s visits to PSU I is

201

Thus, the expected value of the sample proportion of occupied units, is and

is less than the proportion of interest p if 0 < 1. The geometric model agrees with the intuition that

owls are missed in the s visits.

We find the maximum likelihood estimates of p and 0 under the assumption that all PSU’s are

independent. The likelihood function

is maximized when

numerical methods are needed to calculate Maximum likelihood theory also allows calculation

of the asymptotic covariance matrix of the parameter estimates.

Visit Number 1 2 3 4 5 6

average number of visits made to occupied units was Thus, the maximum

likelihood estimates are and using the asymptotic covariance matrix from

maximum likelihood theory, we estimate the variance of by 0.00137. Thus, an approximate 95%

confidence interval for the proportions of units that are occupied is 0.370±0.072.

Incorporating the geometric model for number of visits gave a larger estimate of the proportion of

202

units occupied. If the model does not describe the data, however, the estimate will still be biased;

if the model is poor, may be a worse estimate of the occupancy rate than If, for example, field

investigators were more likely to find owls on later visits because they accumulate additional

information on where to look, the geometric model would be inappropriate.

We need to check whether the geometric model adequately describes the number of visits needed to

determine occupancy. Unfortunately, we cannot determine whether the model would describe the

situation for units in which owls are not detected in six visits, as the data are missing. We can,

however, use a goodness-of-fit test to see whether data from the six visits made are fit by the

model. Under the model, we expect of the PSU’s to have owls observed on visit

k, and we plug in our estimates of p and 0 to calculate expected counts:

1 33 29.66

2 17 19.74

3 12 13.14

4 7 8.75

5-6 12 9.71

Total 81 80.99

Visits 5 and 6 were combined into one category so that the expected cell count would be greater than

5. The test statistic is 1.75, with p-value< 0.05. There is no indication that the model is

inadequate for the data we have. We cannot check its adequacy for the missing data, however. The

geometric model assumes observations are independent and that an occupied PSU would eventually

be determined to be occupied if enough visits were made. We cannot check whether that assumption

of the model is reasonable or not: if some wily owls will never be detected in any number of visits,

will still be too small.

Commonly, maximum likelihood methods are used to estimate parameters, and the likelihood

equations rarely have closed-form solutions. Calculation of estimates required numerical methods

even for the simple model adopted for the owls, and that was an SRS with a simple geometric model

for the response mechanism that allowed to easily write down the likelihood function. Likelihood

203

functions for more complex sampling designs or nonresponse mechanisms are much more difficult

to construct (particularly if observations in the same cluster are considered dependent), and

calculating estimates often requires intensive computations. Little and Rubin (1987) discuss

likelihood-based methods for missing data in general. Stasny (1991) gives an example of using

models to account for nonresponse.

Often an investigator will say, “I expect to get a 60% response rate in my survey. Is that acceptable

and will the survey give me valid results?” As we have seen in this chapter, the answer to that

question depends on the nature of the nonresponse: if the nonrespondents are MCAR, then we can

largely ignore the nonresponse and use the respondents as a representative sample of the population.

If the nonrespondents tend to differ from the respondents, then the biases in the results from using

only the respondents may make the entire survey worthless.

Many references give advice on cutoffs for acceptability of response rates. Babbie, for example,

says: “I feel that a response rate of at least 50 percent is adequate for analysis and reporting. A

response of at least 60 percent is good. And a response rate of 70 percent is very good” (1973, 165).

I believe that giving such absolute guideline for acceptable response rates is dangerous and has led

many survey investigators to unfounded complacency about nonresponse; many examples exist of

surveys with a 70% response rate whose results are flawed. The NCVS needs corrections for

nonresponse bias even with a response rate of about 95%.

Be aware that response rates can be manipulated by defining them differently. Researchers often do

not say how the response rate was calculated or may use an estimate of response rate that is smaller

than it should be. Many surveys inflate the response rate by eliminating units that could not be

located from the denominator. Very different results for response rate accrue, depending on which

definition of response rate is used; all of the following have been used in surveys:

Number of units in sample

Number of units contacted

contacted units

completed interviews

Contacted units - (ineligible units)

204

Number of completed interviews

Number of units in sample

completed interviews

contacted units - (ineligible units) - refusals

Note that a “response rate” calculated using the last formula will be much higher than one calculated

using the first formula because the denominator is smaller.

The guidelines for reporting response rates in Statistics Canada (1993) and Hidiroglou et al (1993)

provide a sensible solution for reporting response rates. They define in-scope units as those that

belong to the target population, and resolved units as those units for which it is known whether or

not they belong to the target population.3 They suggest reporting a number of different response

rates for a survey including the following:

# Out-of-scope rate: the ratio of the number of out-of-scope units to the number of resolved

units

# No-contact rate: the ratio of the number of no-contacts and unresolved units to the number of

in-scope and unresolved units

# Refusal rate: the ratio of number of refusals to the number of in-scope units

# Nonresponse rate: the ratio of number of nonrespondent and unresolved units to the number

of in-scope and unresolved units

Different measures of response rates may be appropriate for different surveys, and I hesitate to

recommend one “fits-all” definition of response rate. The quantities used in calculating response

rate, however, should be defined for every survey. The following recommendations from the U. S.

Office of Management and Budget’s Federal Committee on Statistical Methodology, reported in

Gonzales et al. (1994), are helpful:

Recommendation 1. Survey staffs should compute response rates in a uniform fashion over time and

document response rate components on each edition of a survey.

Recommendation 2. Survey staffs for repeated surveys should monitor response rate components

(such as refusals, not-at-homes, out-of-scopes, address not locatable, post-master returns, etc.) over

time, in conjunction with routine documentation of cost and design changes.

3

If, for example, the target population is residential telephone numbers, it may be impossible to tell whether

or not a telephone that rings but is not answered belongs to the target population; such a number would be an

unresolved unit.

205

Recommendation 3. Response rate components should be published in survey reports, readers

should be given definitions of response rates used, including actual counts, and commentary on the

relevance of response rates to the quality of the survey data.

Recommendation 4. Some research on nonresponse can have real payoffs. It should be encouraged

by survey administrators as a way to improve the effectiveness of data collection operations.

206

Annex 1

Many surveys have more than one of these problems. The Literary Digest (1932, 1936a, b, c) began

taking polls to forecast the outcome of the U. S. presidential election in 1912, and their polls attained

a reputation for accuracy because they forecast the correct winner in every election between 1912

and 1932. In 1932, for example, the poll predicted that Roosevelt would receive 56% of the popular

vote and 474 votes in the electoral college; in the actual election, Roosevelt received 58% of the

popular vote and 472 votes in the electoral college.

With such a strong record of accuracy, it is not surprising that the editors of the Literary Digest had a

great deal of confidence in their polling methods by 1936. Launching the 1936 poll, they said:

The Poll represents thirty years’ constant evolution and perfection. Based on the “com m ercial

sam pling” m ethods used for m ore than a century by publishing houses to push book sales, the

present m ailing list is drawn from every telephone book in the United States, from the rosters of clubs

and associations, from city directories, lists of registered voters, classified m ail-order and occupational

data. (1936a. 3).

On October 31, the poll predicted that Republican Alf Landon would receive 55% of the popular

vote, compared with 41% for President Roosevelt. The article “Landon, 1,293,669; Roosevelt,

972,897: Final Returns in The Digest’s Poll of Ten Million Voters” contained the statement: “We

make no claim to infallibility. We did not coin the phrase ‘uncanny accuracy’ which has been so

freely applied to our Polls” (1936b). It is a good thing they made no claim to infallibility; in the

election, Roosevelt received 61% of the vote; Landon, 37%.

What went wrong? One problem may have been the undercoverage in the sampling frame, which

relied heavily on telephone directories and automobile registration lists–the frame was used for

advertising purposes, as well as for the poll. Households with a telephone or automobile in 1936

were generally more affluent than other households, and opinion of Roosevelt’s economic policies

was generally related to the economic class of the respondent. But sampling frame bias does not

explain all the discrepancy. Postmortem analyses of the poll by Squire (1988) and Calahan (1989)

indicate that even persons with both a car and a telephone tended to favor Roosevelt, though not to

the degree that persons with neither car nor telephone supported him.

The low response rate to the survey was likely the source of much of the error. Ten million

questionnaires were mailed out, and 2.3 million were returned–an enormous sample but a response

rate of less than 25%. In Allentown, Pennsylvania, for example, the survey was mailed to every

registered voter, but the survey results for Allentown were still incorrect because only one-third of

the ballots were returned. Squire (1988) reports that persons supporting Landon were much more

likely to have returned the survey; in fact, may Roosevelt supporters did not even remember

receiving a survey, even though they were on the mailing list.

One lesson to be learned from the Literary Digest poll is that the sheer size of a sample is no

guarantee of its accuracy. The Digest editors became complacent because they sent out

questionnaires to more than one quarter of all registered voters and obtained a huge sample of 2.3

million people. But large unrepresentative samples can perform as badly as small unrepresentative

samples. A large unrepresentative sample may do more damage than a small one because many

207

people think that large samples are always better than small ones. The design of the survey is far

more important than the absolute size of the sample.

What good are samples with selection bias? We prefer to have samples with no selection bias,

that serve as a microcosm of the population. When the primary interest is in estimating the total

number of victims of violent crime in the United States or the percentage of likely voters in the

United Kingdom who intend to vote for the Labour Party in the next election, serious selection bias

can cause the sample estimates to be invalid.

Purposive of judgment samples can provide valuable information, though, particularly in the early

stages of an investigation. Teichman et al. (1993) took soil samples along Interstate 880 in Alameda

County, California, to determine the amount of lead in yards of homes and in parks close to the

freeway. In taking the samples, they concentrated on areas where they thought children were likely

to play and areas where soil might easily be tracked into homes. The purposive sampling scheme

worked well for justifying the conclusion of the study, that “lead contamination of urban soil in the

east bay area of the San Francisco metropolitan area is high and exceeds hazardous waste levels at

many sites.” A sampling scheme that avoided selection bias would only be needed for this study if

the investigators wanted to generalize the estimated percentage of contaminated sites to the entire

area.

208

Annex 2

Shere Hite’s book Women and Love: A Cultural Revolution in Progress (1987) had a number of

widely quoted results:

• 84% of women are “not satisfied emotionally with their relationships” (p.804).

• 70% of all women “married five or more years are having sex outside of their marriages” (p.

856).

• 95% of women “report forms of emotional and psychological harassment from men with

whom they are in love relationships” (p. 810).

• 84% of women report forms of condescension from the men in their love relationships (p.

809).

The book was widely criticized in newspaper and magazine articles throughout the United States.

The Time magazine cover story “Back Off, Buddy” (October 12, 1987), for example, called the

conclusions of Hite’s study “dubious” and “of limited value.”

Why was Hite’s study so roundly criticized? Was it wrong for Hite to report the quotes from women

who feel that the men in their lives refuse to treat them as equals, who perhaps have never been

given the chance to speak out before? Was it wrong to report the percentages of these women who

are unhappy in their relationships with men?

Of course not. Hite’s research allowed women to discuss how they viewed their experiences, and

reflected the richness of these women’s experiences in a way that a multiple-choice questionnaire

could not. Hite’s error was in generalizing these results to all women, whether they participated in

the survey or not, and in claiming that the percentages applied to all women. The following

characteristics of the survey make it unsuitable for generalizing the results to all women.

• The sample was self-selected–that is, recipients of questionnaires decided whether they

would be in the sample or not. Hite mailed 100,000 questionnaires; of these, 4.5% were

returned.

counseling centers, church societies, and senior citizens’ centers. The members may differ in

political views, but many have joined an “all-women” group, and their viewpoints may differ

from other women in the United States.

• The survey has 127 essay questions, and most of the questions have several parts. Who will

tend to return the survey?

• Many of the questions are vague, using words such as love. The concept of love probably has

as many interpretations as there are people, making it impossible to attach a single

interpretation to any statistic purporting to state how many women are “in love.” Such

question wording works well for eliciting the rich individual vignettes that comprise most of

the book but makes interpreting percentages difficult.

209

• Many of the questions are leading–they suggest to the respondent which response she should

make. For instance: “Does your husband/lover see you as an equal? Or are there times when

he seems to treat you as an inferior? Leave you out of the decisions? Act superior?” (p.

795).

Hite writes, “Does research that is not based on a probability or random sample give one the right to

generalize from the results of the study to the population at large? If a study is large enough and the

sample broad enough, and if one generalizes carefully, yes” (p. 778). Most survey statisticians

would answer Hite’s questions with a resounding no. In Hite’s survey, because the women sent

questionnaires were purposefully chosen and an extremely small percentage of the women returned

the questionnaires, statistics calculated from these data cannot be used to indicate attitudes of all

women in the United States. The final sample is not representative of women in the United States,

and the statistics can only be used to describe women who would have responded to the survey.

Hite claims that results from the sample could be generalized because characteristics such as the age,

educational, and occupational profiles of women in the sample matched those for the population of

women in the United States. But the women in the sample differed on one important aspect–they

were willing to take the time to fill out a long questionnaire dealing with harassment by men and to

provide intensely personal information to a researcher. We would expect that in every age group and

socioeconomic class, women who choose to report such information would in general have had

different experiences than women who choose not to participate in the survey.

210

Annex 3

As we have seen before, is an unbiased estimator of where the latter is the average of all

possible values of if we could examine all possible SRSs of S that could be chosen. We also

calculate the variance of given by:

No distributional assumptions are made about the yi’s in order to ascertain that is unbiased for

estimating We do not, for instance, assume that the yi’s are normally distributed with mean :.

In the randomization theory (also called design-based) approach to sampling, the yi’s are

considered to be fixed by unknown numbers–any probabilities used arise from the probabilities of

selecting units to be in the sample. The randomization theory approach provides a nonparametric

approach to inference–we need not make any assumptions about the distribution of random

variables.

Let’s see how the randomization theory works for deriving properties of the sample mean in simple

random sampling. As done in Cornfield (1944), define

Then

The Zi’s are the only random variables in the above equation because, according to randomization

theory, the yi’s are fixed quantities. When we choose an SRS of n units out of the N units in the

population, {Z1, . . . , ZN} are identically distributed Bernoulli random variables with

211

(2.18)

The probability in (2.18) follows from the definition of an SRS. To see this, note that if unit I is in

the sample, then the other (n – 1) units in the sample must be chosen from the (N - 1) units in the

population.

A total of possible samples of size (n - 1) may be drawn from a population of size (N - 1),

so

and

The variance of is also calculated using properties of the random variables Z1, . . . , ZN. Note that

For

Because the population is finite, the Zi’s are not quite independent–if we know that unit I is in the

sample, we do have a small amount of information about whether unit j is in the sample, reflected in

the conditional probability P(Zj = 1 * Zi = 1). Consequently, for

212

We use the covariance (Cov) of Zi and Zj to calculate the variance of see Appendix B for

properties of covariances. The negative covariance of Zi and Zj is the source of the fpc.

of we need to show that the E[s2] = S2. The argument proceeds much like

Since it makes sense when trying to find an unbiased estimator to find the

213

expected value of and then find the multiplicative constant that will give the

unbiasedness:

Thus,

Unless you have studied randomization theory in the design of experiments, the proofs in the

preceding section probably seemed strange to you. The random variables in randomization theory

are not concerned with the responses yi: they are simply random variables that tell us whether the ith

unit is in the sample or not. In a design-based, or randomization theory, approach to sampling

inference, the only relationship between units sampled and units not sampled is that the nonsampled

units could have been sampled had we used a different starting value for the random number

generator.

In Section 2.7 we found properties of the sample mean using randomization theory: Y1, Y2, . . . , YN

were considered to be fixed values, and is unbiased because the average of for all possible

samples S equals The only probabilities used in finding the expected value and variance of

are the probabilities used in finding the expected value and variance of are the probabilities that

units are included in the sample.

In your basic statistics class, you learned a different approach to inference. There, you had random

variable {Yi} that followed some probability distribution, and the actual sample values were

realizations of those random variables. Thus you assumed, for example, that y1, y2, . . ., yn were

independent and identically distributed from a normal distribution with mean : and variance F2 and

used properties of independent random variables and the normal distribution to find expected values

of various statistics.

214

We can extend this approach to sampling by thinking of random variables y1, y2, . . ., yn generated

from some model. The actual values for the finite population is that the sample is one realization of

the random variables. The joint probability distribution of Y1, Y2, . . . , YN supplies the link between

units in the sample and units not in the sample in this model-based approach–a link that is missing

in the randomization approach. Here, we sample Thus, problems in finite population

sampling may be thought of as prediction problems.

215

CHAPTER 13

VARIANCE ESTIMATION IN COMPLEX SURVEYS

_________________________________________________________________________

Population means and totals are easily estimated using weights. Estimating variances is more

intricate. We noted before that in a complete survey with several levels of stratification and

clustering, variances for estimated means and totals are calculated at each level and then combined as

the survey design is ascended. Poststratification and nonresponse adjustment also affect the

variance.

In previous chapters, we have presented and derived variance formulas for a variety of sampling

plans. Some of the variance formulas, such as those for simple random samples (SRSs), are

relatively simple. Other formulas, such as from a two-stage cluster sample without

replacement, are more complicated. All work for estimating variances of estimated totals. But we

often want to estimate other quantities from survey data for which we have presented no variance

formula. For example, in Chapter 3 we derived an approximate variance for a ratio of two means

when an SRS is taken. What if you want to estimate a ratio, but the survey is not as SRS? How

would you estimate the variance?

This chapter describes several methods for estimating variances of estimated totals and other

statistics from complex surveys. Section 13.1 describes the commonly used linearization method for

calculating variances of nonlinear statistics. Sections 13.2 and 13.3 present random group and

resampling methods for calculating variances of linear and nonlinear statistics. Section 13.4

describes the calculation of generalized variances functions, and Section 13.5 describes constructing

confidence intervals. These methods are described in more detail by Wolter (1985) and Rao (1988);

Rao (1997) and Rust and Rao (1996) summarize recent work.

Most of the variance formulas in Chapters 2 through 6 were for estimates of means and totals. Those

formulas can be used to find variances for any linear combination of estimated means and totals. If

are unbiased estimates of k totals in the population, then

(13.1)

The result can be expressed equivalently using unbiased estimates of k means in the population:

Thus, if T1 is the total number of dollars robbery victims reported stolen, T2 is the number of days of

work robbery victims missed because of the crime, and T3 is the total medical expenses incurred by

robbery victims, one measure of financial consequences of robbery (assuming $150 per day of work

lost) might be By (13.1), the variance is:

This expression requires calculation of six variances and covariances; it is easier computationally to

define a new variable at the observation unit level.

Suppose, though, that we are interested in the proportion of total loss accounted for by the stolen

property, T1/Tq. This is not a linear statistic, as T1/Tq cannot be expressed in the form a1T1+a2Tq for

constants ai. But Taylor’s theorem from calculus allows us to linearize a smooth nonlinear function

h(T1, T2, . . . , Tk) of the population totals; Taylor’s theorem gives the constants ao, a1, . . . , ak so that

using (13.1).

Taylor series approximations have long been used in statistsics to calculate approximate variances.

Woodruff (1971) illustrates their use in complex surveys. Binder (1983) gives a more rigorous

treatment of Taylor series methods for complex surveys and tells how to use linearization when the

parameter of interest 2 solves h(2, T1, . . ., Tk) = 0, but 2 is not necessarily expressed as an explicit

function of T1, . . ., Tk.

Example 13.1

Assume that p is an unbiased estimator of P and that V(p) is known.

Let h(x) = x(1-x), so 2 = h(p) and Now h is a nonlinear function of x, but the function can

be approximated at any nearby point a by the tangent line to the function; the slope of the tangent

line is given by the derivative, as illustrated in Figure 13.1.

217

Figure 13.1

The function h(x) = x(1-x), along with the tangent to the function at point P. If p is close to P, the h(p) will be

close to the tangent line. The slope of the tangent line is h’(P) = 1 - 2P.

The first-order version of Taylor’s theorem states that if the second derivative of h is continuous,

then

under conditions commonly satisfied in statistics, the last term is small relative to the first two, and

we use the approximation

Then,

The following are the basic steps for constructing a linearization estimator of the variance of a

nonlinear function of means or totals:

computed in the sample. In general, 2 = h(T1, T2, . . . , Tk) or In Example

13.1,

218

2. Find the partial derivatives of h with respect to each argument. The partial derivatives,

evaluated at the population quantities, for the linearizing constants ai.

Where

Now find the estimated variance of This will generally approximate the

variance of h(T1, T2, . . . , Tk).

Example 13.2

We used linearization methods to approximate the variance of the ratio and regression estimators in

Chapter 3. In Chapter 3, we used an SRS, estimator and the approximation

Essentially, we used Taylor’s theorem to obtain this approximation. The steps below give the same

result.

219

Assume that the sample estimates are unbiased.

3. By Taylor’s Theorem,

(13.2)

220

We can substitute values for B, for the variances and covariance, and possibly for Tx from the

particular sampling scheme used into (13.2). Alternatively, we would define

And find

and

Advantages If the partial derivatives are known, linearization almost always gives a variance

estimate for a statistic and can be applied in general sampling designs. Linearization methods have

been used for a long time in statistics, and the theory is well developed. Software exists for

calculating linearization variance estimates for many nonlinear functions of interest, such as ratios

and regression coefficients; some software will be discussed in Section 13.6.

Disadvantages Calculations can be messy, and the method is difficult to apply for complex

functions involving weights. You must either find analytical expressions for the partial derivatives

of h or calculate the partial derivatives numerically. A separate variance formula is needed for each

nonlinear statistic that is estimated, and that can require much special programming; a different

method is needed for each statistic. In addition, not all statistics can be expressed as a smooth

function of the population totals–the median and other quantiles, for example, do not fit into this

framework. The accuracy of the linearization approximation depends on the sample size–the

estimate of the variance is often biased downward if the sample is not large enough.

Suppose the basic survey design is replicated independently R times. Independently here means that

after each sample is drawn, the sampled units are replaced in the population so that they are available

for later samples. Then, the R replicate samples produce R independent estimates of the quantity of

221

interest; the variability among those estimates can be used to estimate the variance of

Mahalanobis (1946) describes early uses of the method, which he calls “replicated networks of

sample units” and “interpenetrating sampling.”

Let

(13.3)

estimates of 2 divided by R–the usual estimate of the variance of a sample mean.

Example 13.3

The 1991 Information Please Almanac listed enrollment, tuition, and room-and-board costs for every

4-year college in the United States. Suppose we want to estimate the ratio of nonresident tuition to

resident tuition for public colleges and universities in the United States. In a typical implementation

of the random group method, independent samples would be chosen using the same design and

found for each sample. Let’s take four SRSs of size 10 each (Table 13.1). The four SRS are

without replacement, but the same college can appear in more than one of the four SRSs.

222

Table 13.1: Four SRSs of Colleges, Used in Example 13.3

Southeastern Massachusetts University 4,983

U. S. Naval Academy 1,500

Athens State College 2,160

University of South Alabama 2,475

Virginia State University 5.135

SUNY College of Technology-Farmingdale 3,950

University of Houston 4,050

CUNY-Lehman College 4,140

Austin Peay State University 4,166

Indiana University-Southeast

University of Wisconsin-Platteville

University of California-Santa Barbara

W eber State College

Kennesaw College

South Dakota State University

Dickinson State University

Chadron State College

University of Alaska-Fairbanks

University of Maine-Fort Kent

Southern University-Baton Rouge

University of Oregon

Virginia State University

Glenville State College

W inston-Salem State University

Framingham State College

SUNY-Old W estbury

Northwest Missouri State University

W orcester State College

University of California-Davis

Sam Houston State University

University of Texas-Tyler

Southerneastern Oklahoma State University

University of Southern Colorado

Pennsylvania State University

East Central University

Univ of Arkansas-Monticello

223

Thus, The sample average of the four

independent estimates of 2 is The sample standard deviation (SD) of the four estimates

is 0.343, so the standard error (SE) of is The estimated variance is based on

four independent observations, so a 95% confidence interval (CI) for the ratio is

where 3.18 is the appropriate t critical value with 3 degrees of freedom (df). Note that the small

number of replicates causes the confidence interval to be wider than it would be if more replicate

samples were taken, because the estimate of the variance with 3 df is not very stable.

In practice, subsamples are not usually drawn independently, but the complete sample is selected

according to the survey design. The complete sample is then divided into R groups so that each

group forms a miniature version of the survey, mirroring the sample design. The groups are then

treated as though they are independent replicates of the basic survey design.

If the sample is an SRS of size n, the groups are formed by randomly apportioning the n observations

into R groups, each of size n/R. These pseudo-random groups are not quite independent replicates

because an observation unit can only appear in one of the groups; if the population size is large

relative to the sample size, however, the groups can be treated as though they are independent

replicates. In a cluster sample, the PSUs are randomly divided among the R groups. The PSU takes

all its observations units with it to the random group, so each random group is still a cluster sample.

In a stratified multistage sample, a random group contains a sample of PSUs from each stratum.

Note that if k PSUs are sampled in the smallest stratum, at most k random groups, can be formed.

If 2 is a nonlinear quantity, will not, in general, be the same as the estimator calculated directly

from the complete sample. For example, in ratio estimation, while

estimate although it is an overestimate. Another estimator of the variance is slightly larger

but is often used:

(13.4)

Example 13.4

The 1987 Survey of Youth in Custody, discussed in Example 7.4, was divided into seven random

224

groups. The survey design had 16 strata. Strata 6-16 each consisted of one facility (= PSU), and

these facilities were sampled with probability 1. In strata 1-5, facilities were selected with

probability proportional to number of residents in the 1985 Children in Custody census.

It was desired that each random group be a miniature of the sampling design. For each self-

representing facility in strata 6-16, random group numbers were assigned as follows: the first

resident selected from the facility was assigned a number between 1 and 7. Let’s say the first

resident was assigned number 6. Then the second resident in that facility would be assigned number

7, the third resident 1, the fourth resident 2, and so on. In strata 1-5, all residents in a facility (PSU)

were assigned to the same random group. Thus, for the seven facilities sampled in stratum 2, all

residents in facility 33 were assigned random group number 1, all residents in facility 9 were

assigned random group 2 (etc.). Seven random groups were formed because strata 2-5 each have

seven PSUs.

After all random group assignments were made, each random group had the same basic design as the

original sample. Random group 1, for example, forms a stratified sample in which a (roughly)

random sample of residents is taken from the self-representing facilities in strata 6-16, and a pps

(probability proportional to size) sample of facilities is taken from each of strata 1-5.

To use the random group method to estimate a variance, is calculated for each random group. The

following table shows estimates of mean age of residents for each random group; each estimate was

calculated using

where wi is the final weight for resident I and the summations are over observations in random group

r.

1234567 16.55

16.66

16.83

16.06

16.32

17.03

17.27

and

225

Using the entire data set, we calculate with

We can use either or to calculate confidence intervals; using a 95% CI for mean age is

Advantages No special software is necessary to estimate the variance, and it is very easy to

calculate the variance estimate. The method is well suited to multiparameter or nonparametric

problems. It can be used to estimate variances for percentiles and nonsmooth functions, as well as

variances of smooth functions of the population totals. Random group methods are easily used after

weighting adjustments for nonresponse and undercoverage.

Disadvantages The number of random groups is often small–this gives imprecise estimates of the

variances. Generally, you would like at least ten random groups to obtain a more stable estimate of

the variance and to avoid inflating the confidence interval by using the t distribution rather than the

normal distribution. Setting up the random groups can be difficult in complex designs, as each

random group must have the same design structure as the complete survey. The survey design may

limit the number of random groups that can be constructed; if two PSUs are selected in each stratum,

then only two random groups can be formed.

Random group methods are easy to compute and explain but are unstable if a complex sample can

only be split into a small number of groups. Resampling methods treat the sample as if it were itself

a population; we take different samples from this new “population” and use the subsamples to

estimate a variance. All methods in this section calculate variance estimates for a sample in which

PSUs are sampled with replacement. If PSUs are sampled without replacement, these methods may

still be used but are expected to overestimate the variance and result in conservative confidence

intervals.

Some surveys are stratified to the point that only two PSUs are selected from each stratum. This

226

gives the highest degree of stratification possible while still allowing calculation of variance

estimates in each stratum.

We illustrate BRR for a problem we already know how to solve–calculating the variance for

from a stratified multistage sample. More complicated statistics from stratified multistage

samples are discussed in Section 13.3.1.2.

Suppose an SRS of two observation units is chosen from each of seven strata. We arbitrarily label

one of the sampled units in stratum h as yh1 and the other as yh2. The sampled values are given in

Table 13.2.

.10 -210

.05 -4,510

.10 -450

.20 2,036

.05 446

.20 36

Ignoring the fpc’s (finite population correction) in Equation (4.5) gives the variance estimate

when nh = 2, as here, so

replacement.

227

To use the random group method, we would randomly select one of the observations in each stratum

for group 1 and assign the other to group 2. The groups in this situation are half-samples. For

example, group 1 might consist of {y11, y22, y32, y42, y51, y62, y71} and group 2 of the other seven

observations. Then,

and

The random group estimate of the variance–in this case, 139,129–has only 1 df for a two-psu-per-

stratum design and is unstable in practice. If a different assignment of observations to groups had

been made–had, for example, group 1 consisted of yh1 for strata 2, 3, and 5 and yh2 for strata 1, 4, 6

and 7–then and the random group estimate of the variance would have been

3238.

McCarthy (1966; 1969) notes that altogether 2H possible half-samples could be formed and suggests

using a balanced sample of the 2H possible half-samples to estimate the variance. Balanced

repeated replication uses the variability among R replicate half-samples that are selected in a

balanced way to estimate the variance of

To define balance, let’s introduce the following notation. Half-sample r can be defined by a vector

"r Let

yh("r)

Equivalently,

yh("r) =

If group 1 contains observations {y11, y22, y32, y42, y51, y62, y71} as above, then "1 = (1, -1, -1, -1, 1, -1,

1). Similarly " 2 = (-1, 1, 1, 1, -1, 1, -1). The set of R replicate half-samples is balanced if

Let "r) be the estimate of interest, calculated the same way as but using only the observations

in the half-sample selected by "r. For estimating the mean of a stratified sample,

228

" r) = yh("r).

"r)- .

For our example, the set of "’s in the following table meets the balancing condition

for all l h. The 8 x 7 matrix of -1's and 1's has orthogonal columns; in fact, it is the design matrix

(excluding the column of 1's) for a fractional factorial design (Box et al. 1978). Designs described

by Plackett and Burman (1946) give matrices with k orthogonal columns, for k a multiple of 4;

Wolter (1985) explicitly lists some of these matrices.

Stratum (h)

1 2 3 4 5 6 7

"1 -1 -1 -1 1 1 1 -1

"2 1 -1 -1 -1 -1 1 1

"3 -1 1 -1 -1 1 -1 1

Half-Sample "4 1 1 -1 1 -1 -1 -1

®) "5 -1 -1 1 1 -1 -1 1

"6 1 -1 1 -1 1 -1 -1

"7 -1 1 1 -1 -1 1 -1

"8 1 1 1 1 1 1 1

The estimate from each half-sample, ("r) = (" r) is calculated from the data in Table 13.2.

229

12345678 4732.4 78,792.5

4439.8 141.6

4741.3 83,868.2

4344.3 11,534.8

4084.6 134,762.4

4592.0 19,684.1

4123.7 107,584.0

4555.5 10,774.4

The average of " r) for the eight replicate half-samples is 55,892.75, which is the same as

for sampling with replacement. Note that we can do the BRR estimation above by

creating a new variable of weights for each replicate half-sample. The sampling weight for

observation I in stratum h is whi = Nh / nh, and

In BRR with a stratified random sample, we eliminate one of the two observations in stratum h to

calculate yh("r). To compensate, we double the weight for the remaining observation. Define

("r) = " r.

Then,

ystr("r) =

Similarly, for any statistic calculated using the weights whi, ("r) is calculated exactly the same

way, but using the new weights whi("r). Using the new weight variables instead of selecting the

subset of observations simplifies calculations for surveys with many response variables–the same

column w("r) can be used to find the rth half-sample estimate for all quantities of interest. The

230

modified weights also make it easy to extend the method to stratified multistage samples.

When is the only quantity of interest in a stratified random sample, BRR is simply a fancy method

of calculating the variance of Equation (4.5) and adds little extra to the procedure in Chapter 4.

BRR’s value in a complex survey comes from its ability to estimate the variance of a general

population quantity 2, where 2 may be a ratio of two variables, a correlation coefficient, a quantile,

or another quantity of interest.

Suppose the population has H strata, and two PSUs are selected from stratum h with unequal

probabilities and with replacement. (In replication methods, we like sampling with replacement

because the subsampling design does not affect the variance estimator, as we saw in Section 6.3).

The same method may be used when sampling is done without replacement in each stratum, but the

estimated variance of calculated under the assumption of with-replacement sampling, is expected

to be larger than the without-replacement variance.

The data file for a complex survey with two PSUs per stratum often resembles that shown in Table

13.3, after sorting by stratum and PSU.

The vector "r defines the half-sample r: If "rh = 1, then all observation units in PSU 1 of stratum h

are in half-sample r; if "rh = -1, then all observation units in PSU 2 of stratum h are in half-sample r.

The vectors "r, are selected in a balanced way, exactly as in stratified random sampling. Now, for

half-sample r, create a new column of weights w("r):

wi("r) =

Observation Stratum PSU SSU W eight. w i Response Response Response

Number Number Number Number Variable 1 Variable 2 Variable 3

2 w2 y2 x2 u2

3 w3 y3 x3 u3

4 w4 y4 x4 u4

5 w5 y5 x5 u5

6 w6 y6 x6 u6

7 w7 y7 x7 u7

8 w8 y8 x8 u8

9 w9 y9 x9 u9

10 w 10 y 10 x 10 u 10

11 w 11 y 11 x 11 u 11

etc.

For the data structure in Table 13.3, and "rh = -1 and "rh = 1, the column w("r) will be

231

Now use the column w("r) instead of w to estimate quantities for half-sample r. The estimate of the

population total of y for the full sample is the estimate of the population total of Y for half-

sample r is ("r) yi. If then and

We saw in Section 7.3 that the empirical distribution function is calculated using the weights

If 2 is the population median, then may be defined as the smallest value of y for which

and is the smallest value of y for which

(13.6)

BRR can also be used to estimate covariances of statistics: If 2 and 0 are two quantities of interest,

then

statistics, Krewski and Rao (1981) and Rao and Wu (1985) show that if h is a smooth function of the

population totals, the variance estimate from BRR is asymptotically equivalent to that from

linearization. BRR also provides a consistent estimator of the variance for quantiles when a

stratified random sample is taken (Shao and Wu 1992).

Example 13.5

Bye and Gallicchio (1993) describe BRR estimates of variance in the U. S. Survey of Income and

Program Participation (SIPP). SIPP, like the National Crime Victimization Survey (NCVS), has a

232

stratified multistage cluster design. Self-representing (SR) strata consist of one PSU that is sampled

with probability 1, and one PSU is selected with PPS from each non-self-representing (NSR)

stratum. Strictly speaking, BRR does not apply since only one PSU is selected in each stratum, and

BRR requires two PSUs per stratum. To use BRR, “pseudostrata” and “pseudo-PSUs” were formed.

A typical pseudostratum was formed by combining an SR stratum with two similar NSR strata: the

PSU selected in each NSR stratum was randomly assigned to one of the two pseudo-PSUs, and the

segments in the SR PSU were randomly split between the two pseudo-PSUs. This procedure created

72 pseudostrata, each with two pseudo-PSUs.

The 72 half-samples, each containing the observations from one pseudo-PSU from each

pseudostratum, were formed using a 71-factor Plackett-Burman (1946) design. This design is

orthogonal, so the set of replicate half-samples is balanced.

About 8500 of the 54,000 persons in the 1990 sample said they received Social Security benefits;

Bye and Gallicchio wanted to estimate the mean and median monthly benefit amount for persons

receiving benefits, for a variety of subpopulations. The mean monthly benefit for married males was

estimated as

where yi is the monthly benefit amount for person I in the sample, wi is the weight assigned to person

I, and SM is the subset of the sample consisting of married males receiving Social Security benefits.

The median benefit payment can be estimated from the empirical distribution function for the

married men in the sample:

Calculating for a replicate is simple: merely define a new weight variable w("r), as previously

described, and use w("r) instead of w to estimate the mean and median.

Advantages BRR gives a variance estimate that is asymptotically equivalent to that from

linearization methods for smooth functions of population totals and for quantiles. It requires

relatively few computations when compared with the jackknife and the bootstrap.

though, it is often extended to other sampling designs by using more complicated balancing schemes.

BRR, like the jackknife and bootstrap, estimates the with-replacement variance and may

overestimate the variance if the Nh’s, the number of PSUs in stratum h in the population, are small.

233

The jackknife method, like BRR, extends the random group method by allowing the replicate

groups to overlap. The jackknife was introduced by Quenouille (1949; 1956) as a method of

reducing bias; Tukey (1958) used it to estimate variances and calculate confidence intervals. In this

section, we describe the delete-1 jackknife; Shao and Tu (1995) discuss other forms of the jackknife

and give theoretical results.

For an SRS, let be the estimator of the same form as but not using observation j. Thus, if

(so called because we delete one observation in each replicate) as

(13.7)

Then,

Example 13.6

Let’s use the jackknife to estimate the ratio of nonresident tuition to resident tuition for the first

group of colleges in Table 13.1. Here, and

For each jackknife group, omit one observation. Thus, is the average of all x’s except for

(Table 13.4).

Here, and

234

Table 13.4: Jackknife Calculations for Example 13.6

j x y

1545.9 3480.3 2.2513

1565.6 3867.3 2.4703

123 1612.2 3794.0 2.3533

456 1523.9 3759.0 2.4667

1e+38 4e+39

789 1391.0 3463.4 2.4899

10 1560.9 3595.1 2.3032

1628.9 3584.0 2.2003

1583.3 3574.0 2.2573

1597.8 3571.1 2.2350

How can we extend this to a cluster sample? One might think that you could just delete one

observation unit at a time, but that will not work–deleting one observation unit at a time destroys the

cluster structure and gives an estimate of the variance that is only correct if the intraclass correlation

is zero. In any resampling method and in the random group method, keep observation units within a

PSU together while constructing the replicates–this preserves the dependence among observation

units within the same PSU. For a cluster sample, then, we would apply the jackknife variance

estimator in (13.7) by letting n be the number of PSUs and letting be the estimate of 2 that we

would obtain by deleting all the observations in PSU j.

In a stratified multistage cluster sample, the jackknife is applied separately in each stratum at the first

stage of sampling, with one PSU deleted at a time. Suppose there are H strata, and nh PSUs are

chosen for the sample from stratum h. Assume these PSUs are chosen with replacement.

To apply the jackknife, delete one PSU at a time. Let be the estimator of the same form as

when PSU j of stratum h is omitted. To calculate define a new weight variable: Let

(13.8)

235

Example 13.7

Here we use the jackknife to calculate the variance of the mean egg volume from Example 5.6. We

calculated In that example, since we did not know the

number of clutches in the population, we calculated the with-replacement variance.

First, find the weight vector for each of the 184 jackknife iterations. We have only one stratum, so h

= 1 for all observations. For delete the first PSU. Thus, the new weights for the observations

in the first PSU are 0; the weights in all remaining PSUs are the previous weights times nh /(nh - 1) =

184/183. Using the weights from Example 5.8, the new jackknife weight columns are shown in

Table 13.5.

1 13 6.5 0 6.535519 ... 6.535519

2 13 6.5 6.535519 0 ... 6.535519

2 13 6.5 6.535519 0 ... 6.535519

3 6 3 3.016393 3.016393 ... 3.016393

3 6 3 3.016393 3.016393 ... 3.016393

4 11 5.5 5.530055 5.530055 ... 5.530055

4 11 5.5 5.530055 5.530055 ... 5.530055

. . . . . ... .

. . . . . ... .

. . . . . ... .

183 13 6.5 6.535519 6.535519 ... 6.535519

183 13 6.5 6.535519 6.535519 ... 6.535519

184 12 6 6.032787 6.032787 ... 0

184 12 6 6.032787 6.032787 ... 0

Note that the sums of the jackknife weights vary from column to column because the original sample

is not self-weighting. We calculated as to find follow the same

procedure but use wi(h j) in place of wi.

Thus,

Using (13.8) then, we calculate This results in a standard error of 0.061, the

same as calculated in Example 5.6.

Advantages This is an all-purpose method. The same procedure is used to estimate the variance

for every statistic for which the jackknife can be used. The jackknife works in stratified multistage

samples in which BRR does not apply because more than two PSUs are sampled in each stratum.

The jackknife provides a consistent estimator of the variance when 2 is a smooth function of

population totals (Krewski and Rao 1981).

236

Disadvantages The jackknife performs poorly for estimating the variances of some statistics. For

example, the jackknife produces a poor estimate of the variance of quantiles in an SRS. Little is

known about how the jackknife performs in unequal-probability, without-replacement sampling

designs in general.

As with the jackknife, theoretical results for the bootstrap were developed for areas of statistics

other than survey sampling; Shao and Tu (1995) summarize theoretical results for the bootstrap in

complete survey samples. We first describe the bootstrap for an SRS with replacement, as developed

by Efron (1979, 1982) and described in Efron and Tibshirani (1993). Suppose S is an SRS of size n.

We hope, in drawing the sample, that it reproduces properties of the whole population. We then treat

the sample S as if it were a population and take resamples from S. If the sample really is similar to

the population–if the empirical probability mass function (epmf) of the sample is similar to the

probability mass function of the population–then samples generated from the epmf should behave

like samples taken from the population.

Example 13.8

Let’s use the bootstrap to estimate the variance of the median height, 2, in the height population

from Example 7.3, using the sample in the file ht.srs. The population median height is 2 = 168; the

sample median from ht.srs is Figure 7.2, the probability mass function for the population,

and Figure 7.3, the histogram of the sample, are similar in shape (largely because the sample size for

the SRS is large), so we would expect that taking an SRS of size n with replacement from S would

be like taking an SRS with replacement from the population. A resample from S, though, will not be

exactly the same as S because the resample is with replacement–some observations in S may occur

twice or more in the resample, while other observations in S may not occur at all.

We take an SRS of size 200 with replacement from S to form the first resample. The first resample

from S has an epmf similar to but not identical to that of S; the resample median

Repeating the process, the second resample from S has median We take a total of R =

2000 resamples from S and calculate the sample median from each sample,

obtaining We obtain the following frequency table for the 2000 sample medians:

Frequency Median of

Resample

1 165

5 166

2 166.5

40 167

15 167.5

268 168

237

87 168.5

739 169

111 169.5

491 170

44 170.5

188 171

5 171.5

4 172

The sample mean of these 2000 values is 169.3, and the sample variance of these 2000 values is

0.9148; this is the bootstrap estimator of the variance. The bootstrap distribution may be used to

calculate a confidence interval directly: since it estimates the sampling distribution of a 95% CI is

calculated by finding the 2.5 percentile and the 97.5 percentile of the bootstrap distribution. For this

distribution, a 95% CI for the median is [167.5, 171].

If the original SRS is without replacement, Gross (1980) proposes creating N/n copies of the sample

to form a “pseudopopulation,” then drawing R SRSs without replacement from the

pseudopopulation. If n/N is small, the with-replacement and without-replacement bootstrap

distributions should be similar.

Sitter (1992) describes and compares three bootstrap methods for complex surveys. In all these

methods, bootstrapping is applied within each stratum. Here are steps for using one version of the

rescaling bootstrap of Rao and Wu (1988) for a stratified random sample:

1. For each stratum, draw an SRS of size (nh - 1) with replacement from the sample in stratum

h. Do this independently for each stratum.

where mi®) is the number of times that observation I is selected to be in the resample.

Calculate using the weights wi®).

4. Calculate

238

Advantages The bootstrap will work for nonsmooth functions (such as quantiles) in general

sampling designs. The bootstrap is well suited for finding confidence intervals directly: to get a 90%

CI, merely take the 5th and 95th percentiles from or use a bootstrap-t method such as

that described in Efron (1982).

Disadvantages The bootstrap requires more computations than BRR or jackknife since R is

typically a very large number. Compared with BRR and jackknife, less theoretical work has been

done on properties of the bootstrap in complex sampling designs.

In many large government surveys such as the U. S. Current Population Survey (CPS) or the

Canadian Labour Force Survey, hundreds or thousands of estimates are calculated and published.

The agencies analyzing the survey results could calculate standard errors for each published estimate

and publish additional tables of the standard errors but that would add greatly to the labor involved in

publishing timely estimates from the surveys. In addition, other analysts of the public-use tapes may

wish to calculate additional estimates, and the public-use tapes may not provide enough information

to allow calculation of standard errors.

Generalized variance functions (GVFs) are provided in a number of surveys to calculate standard

errors. They have been used for the CPS since 1947. Here, we describe some GVFs in the 1990

NCVS.

Criminal Victimization in the United States, 1990 (U. S. Department of Justice 1992, 146) gives

GVF formulas for calculating standard errors. If is an estimated number of persons or households

victimized by a particular type of crime or if estimates a total number of victimization incidents,

(13.9)

If p is an estimated proportion,

(13.10)

where is the estimated base population for the proportion. For the 1990 NCVS, the values of a

and b were a = -.00001833 and b = 3725. For example, it was estimated that 1.23% of persons aged

20 to 24 were robbed in 1990 and that 18,017,100 persons were in that age group. Thus, the GVF

estimate of SE(p) is

Assuming that asymptotic results apply, this gives an approximate 95% CI of .0123 ± (1.96)(.0016),

or [.0091, .0153].

239

There were an estimated 800,510 completed robberies in 1990. Using (13.9), the standard error of

this estimate is

Where do these formulas come from? Suppose Ti is the total is the total number of observation units

belonging to a class–say, the total number of persons in the United States who were victims of

violent crime in 1990. Let Pi = Ti/N, the proportion of persons in the population belonging to that

class. If di is the design effect (deff) in the survey for estimating Pi (see Section 7.5), then

(13.11)

where ai = -di / n. If estimating a proportion in a domain–say, the proportion of persons in the 20-24

age group who were robbery victims–the denominator in (13.11) is changed to the estimated

population size of the domain (see Section 3.3).

If the deff’s are similar for different estimates so that and then constants a and b can

be estimated that give (13.9) and (13.10) as approximations to the variance for a number of

quantities. The general procedure for constructing a generalized variance function is as follows:

1. Using replication or some other method, estimate variances for k population totals of special

interest, Let vi be the relative variance for

for I = 1, 2, . . . , k.

This is a linear regression model with response variable vi and explanatory variable

Valliant (1987) found that this model produces consistent estimates of the variances for the

class of superpopulation models he studied.

3. Use regression techniques to estimate " and $. Valliant (1987) suggests using weighted least

squares to estimate the parameters, giving higher weight to items with small vi. The GVF

240

estimate of variance, then, is the predicted value from the regression equation,

The ai and bi for individual items are replaced by quantities a and b, which are calculated from all k

items. For the 1990 NCVS, b = 3725. Most weights in the 1990 NCVS are between 1500 and 2500;

b approximately equals the (average weight) x (deff), if the overall design effect is about 2.

Valliant (1987) found that if deff’s for the k estimated totals are similar, the GVF variances were

often more stable than the direct estimate, as they smooth out some of the fluctuations from item to

item. If a quantity of interest does not follow the model in step 2, however, the GVF estimate of the

variance is likely to be poor, and you can only know that it is poor by calculating the variance

directly.

Advantages The GVF may be used when insufficient information is provided on the public-use

tapes to allow direct calculation of standard errors. The data collector can calculate the GVF, and the

data collector often has more information for estimating variances than is released to the public. A

generalized variance function saves a great deal of time and speeds production of annual reports. It

is also useful for designing similar surveys in the future.

Disadvantages The model relating vi to may not be appropriate for the quantity you are

interested in, resulting in an unreliable estimate of the variance. You must be careful about using

GVFs for estimates not included when calculating the regression parameters. If a subpopulation has

an unusually high degree of clustering (and hence a high deff), the GVF estimate of the variance may

be much too small.

Theoretical results exist for most of the variance estimation methods discussed in this chapter,

stating that under certain assumptions asymptotically follows a standard normal

distribution. These results and conditions are given in Binder (1983), for linearization estimates; in

Krewski and Rao (1981) and Rao and Wu (1985), for jackknife and BRR; in Rao and Wu (1988) and

Sitter (1992), for bootstrap. Consequently, when the assumptions are met, an approximate 95%

confidence interval for 2 may be constructed as

Alternatively, a tdf percentile may be substituted for 1.96, with df = (number of groups - 1) for the

random group method. Rust and Rao (1996) give guidelines for appropriate df’s for other methods.

Roughly speaking, the assumptions for linearization, jackknife, BRR, and bootstrap are as follows:

1. The quantity of interest 2 can be expressed as a smooth function of the population totals;

241

more precisely, 2 = h(T1, T2, . . . , Tk), where the second-order partial derivatives of h are

continuous.

2. The sample sizes are large: either the number of PSUs sampled in each stratum is large, or

the survey contains a large number of strata. (See Rao and Wu 1985 for the precise technical

conditions needed.) Also, to construct a confidence interval using the normal distribution,

the sample sizes must be large enough so that the sampling distribution of is approximately

normal.

Furthermore, a number of simulation studies indicate that these confidence intervals behave well in

practice. Wolter (1985) summarizes some of the simulation studies; others are found in Kovar et al.

(1988) and Rao et al. (1992). These studies indicate that the jackknife and linearization methods

tend to give similar estimates of the variance, while the bootstrap and BRR procedures give slightly

larger estimates. Sometimes a transformation may be used so that the sampling distribution of a

statistic is closer to a normal distribution: if estimating total income, for example, a log

transformation may be used because the distribution of income is extremely skewed.

The theoretical results described above for BRR, jackknife, bootstrap, and linearization do not apply

to population quantiles, however, because they are not smooth functions of population totals.

Special methods have been developed to construct confidence intervals for quantiles; McCarthy

(1993) compares several confidence intervals for the median, and his discussion applies to other

quantiles as well.

Let q be between 0 and 1. Then define the quantile 2q as 2q = F-1(q), where F-1(q) is defined to be the

smallest value y satisfying F(y) $q. Similarly, define Now F-1 and are not

smooth functions, but we assume the population and sample are large enough so that they can be

well approximated by continuous functions.

Some of the methods already discussed work quite well for constructing confidence intervals for

quantiles. The random group method works well if the number of random groups, R, is moderate.

Let be the estimated quantile from random group r. Then, a confidence interval for 2q is

where t is the appropriate percentile from a t distribution with R - 1 df. Similarly, empirical studies

by McCarthy (1993), Kovar et al. (1988), Sitter (1992), and Rao et al. (1992) indicate that in certain

designs confidence intervals can be formed using

242

where the variance estimate is calculated using BRR or bootstrap.

An alternative interval can be constructed based on a method introduced by Woodruff (1952). For

any y, is a function of population totals: where ui = 1 if

and ui = 0 if yi > y. Thus, a method in this chapter can be used to estimate for any value y,

and an approximate 95% CI for F(y) is given by

confidence

interval for q =

F(2q) to obtain an

approximate

confidence

interval for 2q.

Since we have a

95% CI,

Figure 13.2

Woodruff’s confidence interval for the quantile 2q if the empirical distribution function is

continuous. Since F(y) is a proportion, we can easily calculate a confidence interval (CI) for any

value of y, shown on the vertical axis. We then look at the corresponding points on the horizontal

axis for form a confidence interval for 2q.

243

Figure 13.2 shows Woodruff’s confidence interval for the quantile 2q if the empirical distribution

function is continuous. Since F(y) is a proportion, we can easily calculate a confidence interval (CI)

for any value of y, shown on the vertical axis. We then look at the corresponding points on the

horizontal axis to form a confidence interval for 2q.

Now we need several technical assumptions to use the Woodruff-method interval. These

assumptions are stated by Rao and Wu (1987) and Francisco and Fuller (1991), who studied a similar

confidence interval. Basically, the problem is that both F and are step functions; they have jumps

at the values of y in the population and sample. The technical conditions basically say that the jumps

in F and in should be small and that the sampling distribution of is approximately normal.

Example 13.9

Let’s use Woodruff’s method to construct a 95% CI for the median height in the file ht.srs, discussed

in Examples 7.3 and 13.8. Note that is the sample proportion of observations in the SRS that

take on value at most 2q; so, ignoring the fpc,

The lower confidence bound for the median is then and the upper confidence

bound for the median is As heights were only measured to the nearest centimeter,

we’ll use linear interpolation to smooth the step function The following values were obtained

for the empirical distribution function:

244

167 0.405

168 0.440

170 0.515

171 0.550

172 0.605

Then, interpolating,

and

The confidence intervals presented so far in this chapter have been developed under the design-based

approach. A 95% CI may be interpreted in the repeated-sampling sense that, if samples were

repeatedly taken from the finite population, we would expect 95% of the resulting confidence

intervals to include the true value of the quantity in the population.

Sometimes, especially in situations when ratio estimation or poststratification are used, you may

want to consider constructing a conditional confidence interval instead. In poststratification as used

for nonresponse (Section 8.5.2.), the respondent sample sizes nhR, was presented. A 95% conditional

confidence interval, constructed using the variance in (8.3), would have the interpretation that we

would expect 95% of all samples having those specific values of nhR to yield confidence intervals

containing

The theory of conditional confidence intervals is beyond the scope of this book; we refer the reader

to Särndal et al. (1992, sec. 7.10), Casady and Valliant (1993), and Thompson (1997, sec. 5.12) for

more discussion and bibliography.

This chapter has briefly introduced you to some basic types of variance estimation methods that are

used in practice: linearization, random groups, replication, and generalized variance functions. But

this is just an introduction; you are encouraged to read some of the references mentioned in this

chapter before applying these methods to your own complex survey. Much of the research done

exploring properties and behavior of these methods has been done since 1980, and variance

estimation methods are still a subject of research by statisticians.

245

Linearization methods are perhaps the most thoroughly researched in terms of theoretical properties

and have been widely used to find variance estimates in complex surveys. The main drawback of

linearization, though, is that the derivatives need to be calculated for each statistic of interest, and

this complicates the programs for estimating variances. If the statistic you are interested in is not

handled in the software, you must write your own code.

The random group method is an intuitively appealing method for estimating variances. Easy to

explain and to compute, it can be used for almost any statistic of interest. Its main drawback is that

we generally need enough random groups to have a stable estimate of the variance, and the number

of random groups we can form is limited by the number of PSUs sampled in a stratum.

Resampling methods for stratified multistage surveys avoid partial derivatives by computing

estimates for subsamples of the complete sample. They must be constructed carefully, however, so

that the correlation of observations in the same cluster is preserved in the resampling. Resampling

methods require more computing time than linearization but less programming time: the same

method is used on all statistics. They have been shown to be equivalent to linearization for large

samples when the characteristic of interest is a smooth function of population totals.

The BRR method can be used with almost any statistic, but it is usually used only for two-PSU-per-

stratum designs or for designs that can be reformulated into two PSU per strata. The jackknife and

bootstrap can also be used for most estimators likely to be used in surveys (exception: the delete-1

jackknife may not work well for estimating the variance of quantiles) and may be used in stratified

multistage samples in which more than two PSUs are selected in each sample, but they require more

computing than BRR.

Generalized variance functions are cheap and easy to use but have one major drawback: unless you

can calculate the variance using one of the other methods, you cannot be sure that your statistic

follows the model used to develop the GVF.

All methods except GVFs assume that information on the clustering is available to the data analyst.

In many surveys, such information is not released because it might lead to identification of the

respondents. See Dippo et al. (1984) for a discussion of this problem.

Various software packages have been developed to assist in analyzing data from complex surveys.

Cohen (1997), Lepkowski and Bowles (1996), and Carlson et al. (1993) evaluate PC-based packages

for

analysis of complex survey data.1 SUDAAN (Shah et al. 1995), OSIRIS (Lepkowski 1982), Stata

(StataCorp 1996), and PC-CARP (Fuller et al 1989) all use linearization methods to estimate

variances of nonlinear statistics. SUDAAN, for example, calculates variances of estimated

population totals for various stratified multistage sampling designs that have H strata, unequal-

probability cluster sampling with or without replacement at the first stage of sampling, and SRS with

or without replacement at subsequent stages. The formula in (6.9) is used to estimate the variance

1

Lepkowski and Bowles (1996) tell how to access the free (or almost-free) software packages CENVAR,

CLUSTERS, Epi Info, VPLX, and W esVarPC through e-mail or from the internet. Software for analysis of survey

data is changing rapidly; the Survey Research Methods Section of the American Statistical Association

(www.amstat.org) is a good resource for updated information.

246

for each stratum in with-replacement sampling, and the Sen-Yates-Grundy form in (6.15) is used for

without-replacement variance. Then, the variances for the totals in the strata are added to estimate

the variance for the estimated population total. SUDAAN then uses linearization to find variances

for ratios, regression coefficients, and other nonlinear statistics. Recent versions of SUDAAN also

implement BRR and jackknife.

OSIRIS also implements BRR and jackknife methods. The survey software packages WesVarPC

(Brick et al. 1996; at press time, WesVarPC could be downloaded free from www.westat.com) and

VPLX (Fay 1990) both use resampling methods to calculate variance estimates. A simple S-PLUS

function for jackknife is given in Appendix D; this is not intended to substitute for well-tested

commercial software but to give you an idea of how these calculations might be done. Then, after

you understand the principles of the methods, you can use commercial software for your complex

surveys.

247

CHAPTER 14

SAMPLING FOR OBJECTIVE MEASUREMENT

SURVEYS IN AGRICULTURE

14.1 NEED FOR OBJECTIVE MEASUREMENTS

The principles of sampling discussed in the previous lectures are widely applicable to survey

programs generally. Certain kinds of surveys, however, may require special techniques of sampling

and data collection which are determined by the nature of the inquiry or the ability of respondents to

give accurate answers. Chapter 12 describes some special techniques used in agriculture surveys.

Statistics on area planted with individual crops and on yields from these crops are, in most countries,

based upon periodic reports from crop reporters. In some countries, these reporters are holders or

other individuals who reside in the rural areas and have knowledge of the local agriculture; they

report voluntarily, usually by mail. In other countries the reporters are government officials or

agents. The reports submitted by these agents are usually less accurate than those submitted by

private individuals, in part because the agents are usually reporting for a much larger area and in part

because the agents are not so closely connected with agriculture. However, whether made by private

individuals or by government agents, these reports are all subject to biases which are often large and

always difficult to evaluate. For example, investigations in various countries have shown that in

estimating yields, reporters (particularly official reporters) have a tendency to be biased toward the

normal; in other words, in good years they tend to underestimate the yield whereas in bad years they

tend to overestimate. Although private reporters also have this tendency to some extent, they are

generally more inclined to underestimate in the belief that it will be to their advantage to do so.

Areas, on the other hand, tend to be overestimated because of the difficulty of making proper

allowances for non planted areas around the edges of fields and areas within the fields that cannot be

planted.

Check data from past years can be used to evaluate the biases in the estimates of production obtained

from reporters. For crops such as tobacco or cotton, which must be processed before being used,

information on production can be obtained from the processors and compared with the corresponding

figures obtained from reporters. For other crops, similar use can be made of data obtained from

marketing or shipping sources. If such data are complete (usually there is no guarantee that they are

complete) and if the relative bias remains reasonably constant from year to year, estimates for the

current year can be adjusted on the basis of this past experience. For other crops, which are at least

partly consumed locally, fed to livestock, etc., such check data are not available. Census data, if

available, can be used as a benchmark for adjusting the reports for these crops. However, the census

data are also subject to reporting biases. Furthermore, adjustments using census data become less

and less reliable as the time lapse between the last census and the current year widens.

Experience in many different countries under a variety of conditions has indicated that subjective

methods of estimating production, even when other data are available for adjusting the estimates,

cannot provide reliable results. If accurate and unbiased estimates are required, the only alternative

is to establish some type of program utilizing objective methods of observation applied on a random

sampling basis. Such surveys are called "objective measurement surveys" because the data are

collected by actual observation and measurement or counting, rather than by methods depending on

the judgment, good memory, or education of persons who report the required information. Even

though such a program of objective measurement surveys is relatively costly and difficult to carry

out, the results will usually justify the effort.

The theoretical considerations affecting sample design, discussed in previous lectures, are as relevant

to the design of an objective measurement survey as they are to any other survey.

The sampling statistician must know whether estimates are required for the nation as a whole, for the

Provinces or districts individually, or for some other administrative areas. The sample allocation

must be planned to give estimates for the desired areas at an acceptable level of reliability. If an

estimate of the number of holdings (either in total or for a specific crop) is also required, this must be

considered in designing the sample.

14.2.2 Stratification

First-level strata often consist of the smallest areas requiring separate estimates. Further gains in

efficiency may be obtained by further stratification into geographic areas having relatively

homogeneous yield rates for the crop. Other bases for stratification, such as irrigated and

nonirrigated land, varieties of crops, etc., may also be used.

The statistician must decide how to allocate the sample to strata. A common practice is to allocate it

proportionately to the area under the particular crop or group of crops being investigated. If

available, knowledge about the relative variances and/or the relative costs of performing the field

work in the different strata should also be used in allocating the sample.

A decision must be made on the method of sampling within strata. As was indicated before, there

are usually several possible sampling units and sample designs. In deciding upon a sampling plan,

the sampling statistician will need to know what materials are available for constructing the sampling

frame and what types of data are required. His choice may also be influenced by other factors such

as the availability of capable personnel to carry out the work. However, even with the restrictions

imposed by these considerations, there will usually be a number of possible choices.

In most practical applications, several sampling stages and sampling units will be used within strata.

For example, if the strata are large administrative divisions, such as Provinces, a sample of districts

might be selected at the first stage and a sample of subdistricts within sample districts at the second

stage. Where "villages" have identifiable boundaries and account for all the land, they can serve as

convenient units at some stage in the sampling. The ultimate unit of analysis will usually be an

individual holding, the individual field, or (for studies involving estimation of yields) small plots

within fields. If the field is the unit of analysis, holdings may be selected at the preceding stage.

249

14.2.4.2 Methods of selecting holdings and fields

The following examples illustrate some procedures that can be used to select holdings and fields in

the final stages of the sample design. The selection of plots within fields is discussed in section

14.4.4 of this chapter.

(1) Holdings can be selected from lists if lists are available or can be constructed without much

difficulty. Lists of holdings would be needed only for the units (villages, subdistricts, etc.)

actually selected in the sample at the preceding stage; if necessary, these could be compiled as

part of the field operation. The selection of holdings can be made either with equal probability

or with probability proportionate to size (assuming that information on size is available or can

be obtained). The measure of size might be total reported area in the holding, total area in a

particular crop or group of crops, etc.

Similarly, within each selected holding, a list of fields could be compiled and a sample

selected. Again, selection could be made either with equal probability or with probability

proportionate to size.

(2) If maps or aerial photographs are available, these can be used to select fields directly without

first selecting holdings. One way to do this is to superimpose on the map or photo a grid on

which dots have been placed either in a systematic pattern or at random; each field into which a

dot falls is then included in the sample, thus giving the fields probabilities of selection

proportionate to their sizes. This procedure requires, of course, that the maps or photos be

sufficiently detailed so that the point and the corresponding field can be located on the ground.

(This procedure is not easily adaptable to estimating number of holdings, if that is desired.)

(3) Area segments are useful sampling units for determining which holdings and/or fields are to be

included in the sample. These segments may be constructed either with natural boundaries that

can be located on the ground or with imaginary boundaries drawn on a photo or map; the

choice depends upon the particular situation. Holdings and/or fields may be associated with

area segments in any of the following ways:

(a) Area segments with imaginary boundaries could be used as first-stage sampling units and a

sample of segments selected; within the sample segments, fields could be selected as

second-stage units in the manner described above in (2).

(b) An alternative procedure would be to include in the sample all fields (or holdings) for

which a uniquely defined point falls within the segment boundaries. With this procedure,

fields (or holdings) would not be selected with probability proportionate to their sizes; the

probability of selection would be the same as the probability of selection of the segment

into which the point falls. This is known as an open segment approach. The segments

determine which units are included in the sample, but data are tabulated for some fields (or

holdings) lying partly outside the segment and are not tabulated for other fields (or

holdings) lying partly inside the segment.

250

The unique point must be defined with care. Usually a particular corner of the field

(holding) would be designated as the unique point. Because fields (holdings) may not be

rectangular, a specific rule for locating this corner would be needed as well. For example,

if the northwest corner were the designated unique point, it could be defined either (1) by

identifying the boundary points that lie farthest west and then designating the most

northern of these points as the northwest corner or (2) by identifying the boundary points

that lie farthest north and then designating the most western of these points as the

northwest corner. If the holding were the unit of analysis, the residence of the holder

(provided all such residences had a chance of being included in the sample) would

generally be preferred as the unique point since it would be the easiest point to locate. A

combination of rules is, perhaps, even more useful. For example, the residence of the

holder might be used when the holder lives on the holding, and a particular corner used

when he does not live on the holding. In any case, the point must be defined in a way such

that it is truly unique (that is, each unit must have one, and only one, such point associated

with it and thus have one, and only one, chance of being included in the sample); it should

also be fairly easy to identify.

(c) If the unit of analysis is the holding, the weighted segment approach will usually be more

efficient than the open segment approach. With this procedure, all holdings having any

land in the segment are included in the sample. In the estimation, the data from each

holding are weighted by a factor based on the proportion of the entire holding lying inside

the segments. In almost all applications, the weighted segment approach requires that the

segments have natural boundaries that can be identified on the ground.

(d) Still another possibility is to use the so-called closed-segment approach in which only

those fields or parts of fields lying within the segments are included in the sample. One

advantage of this procedure is that it avoids the difficulty of having to define the holding.

Of course, if information is desired on a holding basis, the closed-segment approach is not

appropriate since some holdings will certainly extend beyond the segment boundaries.

ESTIMATION OF AREA

Since it is known that data on land area obtained by asking individuals to respond to questionnaires

can be very inaccurate, other means of obtaining these data have been investigated.1 The usual

approach in objective measurement surveys is to select a sample of areas, and then to go to these

areas and measure them directly. There are also methods of obtaining objective estimates of area

that do not require direct measurement of the land; for example, measuring the area on aerial

photographs. In addition to the measurements, other information may be obtained. For example, the

land may be classified into various categories according to its use (crop land, pasture, wasteland,

etc.), the particular crop being grown on each piece of land may be identified, etc.

1. For discussion of techniques and experiences in many countries, see S. S. Zarkovich (ed.), Estimation of Areas in

Agricultural Statistics, Food and Agriculture Organization of the United Nations, Rome, 1965.

251

14.3.1 Measurement of land area

The first step in making direct measurements of land is to make a scale drawing. In order to do this,

one must be able to measure distances and angles. A drawing made by a professional land surveyor

using technical equipment would be very precise. On the other hand, a drawing made by an

inexperienced worker measuring distances by pacing and measuring angles by eye estimates would

not be very accurate. Between these extremes, there are many other methods that can be used. One

should balance the relative cost against the relative accuracy of the various procedures and select the

method that will provide an acceptable level of reliability for the lowest cost.

After the scale drawing has been made, the area of the drawing must be determined. If the land that

was measured is in the shape of a regular geometric figure such as a rectangle, trapezoid, etc., it is

relatively easy to determine the area of the drawing by standard mathematical formulas. Using the

appropriate expansion factor, the area of the land represented by the drawing can then be determined.

Often, however, the area is of irregular shape and other methods must be used; for example,

triangulation, planimetering, gridding, dot counting, and map cutting and weighing.

14.3.1.1 Triangulation.--In triangulation, the polygon formed by the drawing is converted into

simple triangles. It is a principle of geometry that this can always be done. (Curved

boundaries are roughly approximated by a series of straight lines before triangulation.)

Each triangle is measured and the area computed by standard formulas. This procedure is

time consuming and tedious and has largely been replaced.

14.3.1.2 Planimetering.--A planimeter is an instrument with which one can determine the area of a

closed figure by tracing around the boundary of the figure with a pencil-like device. A

good planimeter will give very accurate results. It does, however, require a skilled

operator and much time.

14.3.1.3 Gridding.--Basically, a grid is a plane divided into small squares (for example, a piece of

ordinary graph paper). For use in measuring area, the squares are constructed so that each

is equivalent to a particular amount of area in accordance with the scale of the drawing. A

transparent plastic grid can be placed over the drawing; or the grid can be printed on paper

and the drawing made directly on this paper. To estimate the area represented by the

drawing, one counts the whole squares and parts of squares within the perimeter of the

scale drawing and converts this number to its equivalent in terms of the appropriate unit of

area.

252

Figure 1: MEASUREMENT BY GRIDDING

(1 SQUARE = 1/4 HECTARE)

Although not as accurate as planimetering, gridding can be done in less time. It requires only that

the individual be able to count accurately and that he be able to accurately convert the partial squares

into an equivalent number of whole squares. See Figure 1 on the preceding page for an illustration

of this method. There are approximately 159 squares within the scale drawing (including the partial

squares that overlap the boundary); thus, since each square represents 1/4 hectare, the field contains

about 40 hectares.

14.3.1.4 Dot counting. Dot counting is essentially the same as gridding except that instead of small

squares, the grid consists of uniformly spaced dots. Each dot represents a unit area

according to the scale of the drawing. One need only count the dots lying within the

perimeter of the drawing to find the area. If any dots lie on the boundary, only half of

them are counted.

14.3.1.5 Map cutting and weighing.--By this procedure, the map or photograph of the area is

carefully cut into pieces representing different categories of land along the lines drawn by

the field worker. Each piece is then carefully weighed. The estimation is based on the

253

weight of the paper in each category relative to the weight for the entire area. This

procedure is not very practical; it is time consuming and requires a weighing instrument of

high precision and map paper of uniform quality.

Some methods of objectively measuring area do not require direct measurement of the land itself.

Instead, the proportion of land falling into various categories is estimated by some objective means

and multiplied by the known total area of land in the universe (Province, district, etc.) to estimate the

total area in each category. All of the methods discussed in section 3.2 except the last method (the

last method described in paragraph 3.22) require accurate, up-to-date maps or aerial photographs;

consequently, their usefulness is somewhat limited at this time. However, as progress is made in

aerial photography, these and similar methods are likely to become more generally useful in the

future.

14.3.2.1 Observations for a sample of points.--A sample of points is selected and the points

marked on maps or aerial photographs. In selecting the sample of points, appropriate techniques of

stratification and clustering should be used to maximize the efficiency of the design. For example, if

primary interest is in the estimation of crop areas, higher sampling rates should be used in those

portions of the universe known to consist primarily of crop land.

If only broad categories of land use are to be estimated, and suitable aerial photographs are available,

it may be possible to make the necessary observations directly from the photographs. For most

purposes, however, it will be necessary to send observers to the field to locate each sample point and

to record the crop being grown or other use being made of the land at the point.

One author has suggested that for periodic surveys the sample points be permanently identified by

suitable markers, to make them easier to locate. The markers could not be placed at the exact

locations of the sample points, since they would interfere with farming operations; however, they

would be placed nearby and equipped with sighting devices aimed at the sample points. This method

has not yet been tried in the field. (Refer to "Fixed-Point Sampling--A New Method of Estimating

Crop Areas" by Thomas B. Jabine in Estadistica, published by the Inter-American Statistical

Institute, Washington, D.C., September-December 1967.)

Once the observations have been made for the sample of points, one can make an unbiased estimate

of area devoted to a particular use:

(1) For each stratum in which points were sampled at a constant rate, tally the number of sample

points in each land use category.

(2) Multiply the known total area of the stratum by the proportion of sample points devoted to that

use.

254

14.3.2.2 Observations for a sample of lines.--A sample of lines is selected and the lines are

marked on maps or aerial photographs. As in the case of points, appropriate techniques of

stratification and clustering should be used to increase the efficiency of the design. The usual

procedure within ultimate sampling units is to select a sample of parallel lines spaced at equal

intervals.

By using aerial photographs, or by actually pacing the lines, the investigator determines the

proportion of each line falling into each land use category. Unbiased estimates are then made from

these observations by a procedure completely analogous to that described above for point samples.

A relatively cheap but biased form of line sampling involves the substitution of roads for a

probability sample of lines. The investigator drives a car along a prescribed route. The car is

equipped with a distance measuring device. As he drives, the investigator notes and records the

distance for which the road is bordered by each category of land being measured (specific crops, crop

land in general, pasture, woodland, etc.). Estimates are then made in the normal way for line

sampling.

This last technique is likely to be seriously biased, especially in areas where the road network is

sparse, since the pattern of land use along roads is likely to differ substantially from the overall

pattern for a given area. Techniques based on probability sampling should be used in preference if at

all possible.

Having completed area measurements on the holdings (or other units of analysis) in the sample, we

can estimate totals directly from these data by the estimation procedure which is appropriate to the

particular sample design. This procedure can usually be improved upon, however, if in addition to

making area measurements for a sample of the population, we also have available less accurate and

less expensive area data (for example, data obtained by direct interview) from the entire population.

Such data would normally come from a complete census. By means of ratio estimation, we can often

obtain estimates of population totals that will be more reliable than those that could be obtained from

either the objective measurements or the interview responses alone. The procedure is essentially the

same as that discussed in section 2.3 of chapter 10. The X-characteristic in this case would be the

actual measurement of the land obtained for a subset of the population; the Y-characteristic would be

the data collected by the interview.

Even more useful and practical is a technique called double sampling2 in which the less expensive

technique is used to obtain data from a relatively large sample of the population and the more

expensive technique to obtain data from a subsample of the basic sample. Again, ratio estimation is

used, but here the Y-characteristic is the response that is obtained by the less expensive technique,

and the sample estimate of the population total for the Y-characteristic is used in place of a total

based on 100-percent coverage.

2. Double sampling is a statistical technique useful in a variety of situations whenever a characteristic of interest that is difficult

or expensive to determine is correlated highly with another characteristic that can be determined relatively easily or

inexpensively.

255

Compared with the method based on area measurement alone, methods using ratio estimation will be

preferred if the gain in efficiency more than offsets the cost of obtaining the supplementary

observations by the less expensive technique (either from the entire population or, in the case of

double sampling, from a larger sample from the population). The factors to be considered are:

(1) The strength of the relationship between the data obtained by the two methods. The interview

response must have a high positive correlation with the area measurement if a significant

improvement is to be obtained. One would reasonably expect this to be the case.

(2) The relative cost of the two methods. Assuming that the correlation is large enough, ratio

estimation will reduce the number of holdings requiring area measurement in order to achieve a

given level of reliability. Whether or not this reduction will offset the cost of obtaining the

interview responses depends in part upon the difference in costs between the two types of

observations.

Compared with the method based only on interview responses, the use of ratio estimation will be

preferred whenever it is believed that the bias in the interview responses is sufficient to justify the

additional expense of obtaining the area measurements. The concept of mean square error (MSE) is

needed to understand the situation more fully. Recall from previous chapters that the variance is

based on differences between estimates (x') based on samples and the value X that would be obtained

if data had been collected from all members of the population, using the same techniques. The mean

square error, on the other hand, is based on differences between estimates based on samples and the

true value of the quantity being measured (XT). If the data-collection technique is unbiased, X = XT,

then the MSE is equivalent to the variance; if the technique is biased, the MSE is equal to the

variance plus the square of the bias (X - XT), or

(14.1) MSE = .

For a given cost, data can be obtained by interview from a sample of a certain size. For the same

cost, data can be obtained by interview from a smaller sample, combined with objective

measurements from a subsample of this sample. Estimates based on the large interview sample will

have a specified MSE containing a bias component as well as a variance component. Ratio estimates

based on the combination of interview and objective measurement data will have a smaller bias but a

larger variance. The MSE may be either larger or smaller than the MSE based only on the large

interview sample depending on the variability in the population, the relative cost of the two

procedures (which determines the relative sample sizes), the relative size of the biases (or the

effectiveness of the ratio estimation procedure in reducing the bias), etc. The sampling statistician

must consider all of these factors in allocating the available resources between the two procedures.

His goal is to minimize the MSE for a given cost (or to minimize the cost of obtaining an acceptable

level of reliability).

The goal of objective measurement of yield is usually to estimate the yield of a crop on a unit basis

(such as bushels per acre, quintals per hectare, etc.). In order to estimate the total production, it is

necessary to have also an estimate of the total area of the crop in question planted. In some

256

instances, only the yield is estimated by objective means, although estimates of both the yield and the

area should be based on objective measurements.

The general procedure in making objective measurements of yield (usually called "crop cutting") is

to use a random process to select areas (usually called plots) planted, and to cut and weigh the

produce from each of these plots at or near the time the remainder of the field is harvested.3 Each

different crop has different characteristics, and the same crop will behave differently in different

parts of the world. Consequently, there is no specific set of rules that can be applied to all crops or

even to the same crops in different locations. We will, however, discuss in general terms some of

the factors to be considered in planning such a program and describe some of the techniques that

have been used in the past.

Because information gained about other crops or about the behavior of the crop in question in other

countries is not directly transferable to one's own situation, pilot studies should be carried out before

establishing any program for objective measurement of yield. Pilot studies can provide important

information about most of the things that need to be considered such as sampling variability,

optimum size and shape of plot, harvesting procedures, problems such as personnel and materials

needed to carry out the work, etc. They are also useful as training devices for those who will

eventually be in charge of the full-scale operation. On the basis of the pilot studies, the investigator

can develop a sampling plan and field procedures appropriate to the conditions under which the

survey will be conducted.

After a procedure has been decided upon, it is usually advisable to put it into operation only

gradually and, after it is in full operation to carry it out for a few years simultaneously with the

procedure it is to replace. The existing program, no matter how inadequate it may be, should not be

ended until the proposed new method has been sufficiently tested and found to be clearly superior

and operationally feasible.4 After its superiority and feasibility have been established, the new

method can then serve as a basis for evaluating the bias in the old method which would not be

possible unless the two were conducted simultaneously for a few years. This is particularly

important to users of the data who are interested in examining differences or trends over a period of

years; they must know to what extent observed differences in the data are simply the result of

differences in measurement technique.

14.4.2 Variability

One must have some idea of the variability in yield of the crop to be measured in order to plan

wisely. Two aspects of variability which are of interest are:

3. Objective measurements are also used to forecast yields on the basis of observations made earlier in the season.

Since the sampling procedures used in forecasting yields are quite similar to those used in estimating yields,

only the latter are discussed in this section.

4. Actually, it may be necessary to continue the existing program in any case, particularly if data are required for administrative

areas different from those for which estimates are made using objective data. Furthermore, the existing program may collect

data on a number of crops which are not economically important enough to justify an expensive objective measurement

program.

257

(1) The relative variability of yields for different sizes and shapes of plots.

(2) For a plot of given size and shape, the relative magnitude of the variation among fields and the

variation among plots within a field.

In deciding which type of plot to use, the investigator must balance the variability against the cost.

He will attempt to select the plot that will give the desired degree of reliability for the lowest cost,

although other factors (for example, personnel considerations) may force him to choose one that is

quite the best in terms of costs and variances.

Experience has shown that in almost all cases, the variation among fields is considerably greater than

variation within fields. As a result, the number of plots selected within each sample field should be

small so that the available resources can be more efficiently expended on sampling as many different

fields as possible. In fact, in some investigations, the optimum number of plots has been only one

per field.5 A minimum of two plots is necessary, of course, if one wishes to estimate the within-field

variability from the sample; nevertheless, the investigator may choose to have only one plot per field

if the within-field component of variance is very small compared with the between-field component.

Circular, triangular, square, and rectangular plots have all been used in past studies for crops that are

scattered in the field or planted in very closely spaced rows (for example, small grains or hay). For

crops in widely spaced rows (for example, maize or cotton), rectangular plots are the logical choice;

the width is often designated in terms of rows and the length in terms of feet (or meters, etc.).

Along with the shape of the plot, a method of marking it must be specified. Rigid frames or other

devices have been used successfully for marking small plots. Ropes, chains, etc., are easier to

transport but are more difficult to place in the field if the worker has to measure and drive stakes at

the corners, etc. For a triangular plot, a closed chain with rings at the three vertices can be used quite

easily; the same device, provided it forms a right triangle, can also be used to mark rectangular plots

using a suitable combination of triangles. Large plots are usually laid out using pegs or stakes,

string, and a measuring tape.

As the size of the plot increases, the variability among plots decreases; however, since the within-

field contribution to the overall variance is usually negligible relative to the other sources of

variance, small plots are usually preferred from a practical standpoint. One man can usually do the

work alone, he can place a portable frame much faster than he can stake out a large plot, he can

harvest more quickly, and he has less material to handle.

Unfortunately, experience has shown that small plots almost always produce seriously biased

estimates. The reasons for this are not entirely clear, but it appears that two factors are largely

responsible:

(1) In locating the plot in the field, it is much easier for the field worker to allow the condition of

5. Theoretically, the optimum number of plots need not be an integer. As a practical matter, of course, the theoretical result

must be rounded to an integer.

258

the crop to influence the precise location of the smaller plot.

(2) The problem of whether to count plants on the boundary as being in or out of the plot is more

critical with the smaller plot, since the perimeter of a small plot is greater relative to its area

than is the perimeter of a large plot. The general tendency appears to be to include plants that

should be excluded and, thus, to consistently overestimate the yield. For a smaller plot, even a

single plant erroneously included can seriously affect the results.

Many different procedures have been proposed for locating plots in the field. Whatever method is

used, it is important that the field staff understand clearly how it should be done, and checks should

be made to see that they are following the instructions. Otherwise, subjective bias on the part of the

field worker will almost certainly enter into the procedure.

Ideally it would be desirable to divide the entire field into plots of the size and shape decided upon

and select the required number of plots at random. However, this is not usually practicable. A

method that has been used and is practicable whenever the field is rectangular (or can be

conveniently enclosed in a rectangle) is to locate points at random within the field; the sample plots

are then laid out in a prescribed manner about these points. For each plot to be located, the

procedure is as follows:

(1) The field worker selects a random number x between 0 and n1, where n1 represents the total

length of one dimension of the field (or of the enclosing rectangle); he selects another random

number y between 0 and n2, where n2 represents the total length of the other dimension. For a

row crop, the first dimension would usually be expressed in terms of the number of rows.6 In

other cases, the dimensions would be expressed in terms of units, such as meters, or in terms of

steps or paces.

(2) Starting at a predetermined corner, the field worker measures or paces (or counts rows) the

distance x along the appropriate side of the field (or of the enclosing rectangle); then at right

angles to this side, he measures or paces the distance y into the field.

(3) If the worker is still within the boundaries of the field, he marks the random point (for example,

by digging with his heel and driving a stake). If he is not within the boundaries of the field (he

would, of course, be within the enclosing rectangle), he uses another pair of random numbers

and repeats the process.

(4) From this point, the field worker lays out the plot. If the plot is to be circular, the random point

should be used as the center. If it is to be triangular or rectangular, the point should be used to

locate a predetermined vertex or corner; this vertex or corner is usually chosen so that the plot

will extend away from the random point in the direction that the worker has been walking.

Figure 2 on the following page illustrates this procedure. In this example the point (x1, y1) falls

inside the field and is accepted. The point (x2, y2) falls outside the field and is rejected. From the

6. The random number would then be selected between 1 and the total number of rows in the field (n1 )

259

sample point, the plot would usually extend upward and to the right.

One difficulty in this scheme is that it allows plots to overlap field boundaries; any of the several

feasible rules that can be used in such cases present certain problems. Consider, for example, a field

of maize 200 rows wide and 100 meters long. Suppose that the plot is to be 4 rows wide by 6 meters

long. Suppose further that the selected row coordinate is 198 and the length coordinate is 95. From

the point of intersection of the coordinates, the plot would extend 1 meter and 1 row beyond the

boundaries of the field (the plot starts at the end of te 95th meter but includes row 198). Possible

rules that could be adopted to take care of this situation include:

(1) Instruct the worker to harvest only the partial plot 3 rows by 5 meters and, of course, to record

these dimensions on his form. Using the proper inflation factor, an unbiased estimate of the

yield for this field could be made. In this example, this procedure could be carried out rather

easily; however, if the field were irregular in shape or the plot were circular or triangular, the

worker might find it difficult to estimate the portion of the plot in the field.

(2) Instruct the worker to think of the rows as being numbered in a circular manner and similarly

the length. Thus, in this example, row 1 would be the fourth row of the plot and the first meter

in each row would be taken to finish out the length of the plot. This, too, would be an unbiased

260

procedure. It would, however, not be practicable for anything except rectangular plots in

regularly shaped fields. Furthermore, it might be difficult to explain it to the average field

worker. Finally it does not fit into the usual concept of a plot as a contiguous piece of land.

(3) Instruct the worker to restrict his random selection to numbers that will not allow this situation

or, equivalently, to reject plots found to overlap boundaries and select another set of

coordinates. In this case, in the example, he could do the former by restricting the selection for

rows to numbers between 1 and 197 and for length to numbers between 0 and 94. This

procedure is clearly biased since the edges of the field (in the example, the first and last four

rows and the first and last six meters) have less chance of being in the sample than does the

remainder of the field. If the yield tends to be greater or smaller than average around the edges

of the field, estimates of yield based on this method will be biased. However, this is the

simplest procedure. If the borders of the field are small in area relative to the remainder of the

field or if there is no reason to believe that the yield is different along the edges, this method

can be recommended in preference to unbiased but more difficult procedures.

If the plots are small, the field worker will probably do the work himself, cutting the crop and

weighing it in the field. He will than take a small subsample to be sent to the central office for

drying. (It is always a good practice to return the remainder of the produce to the holder.) If plots

are large enough, it may be desirable to harvest them by the same method that the holder will use in

the regular harvest and, if possible, at the same time. This will require his cooperation and help.

The technician's method of harvesting small plots and processing the produce usually gives a higher

rate of yield than does the normal harvesting procedures used by the holder because of greater

harvesting losses in the normal methods. For some crops, these losses are substantial. In addition, it

is not possible to harvest all plots on or immediately before the harvest date. If the worker waits too

long to start harvesting, he will almost certainly find some fields harvested before he arrives;

consequently, he will need to start harvesting plots in some fields while the crop is immature. Both

of these factors will cause biased estimates if adjustments are not made.

(The harvesting of small plots measures what is often referred to as biological yield.)

One method of adjustment is to select a subsample of fields of known area and harvest them for the

holders, using the normal procedures. This provides a basis for adjusting the data collected from the

harvested plots. A similar method appropriate for some crops (for example, hay crops that are taken

from the field in the form of bales) is to arrange to weigh te entire crop in a subsample of fields as

the holder transports it from the harvested field, but allowing the holder to harvest it whenever and

however he wishes.

Another method of adjustment is to carry out a gleaning operation after harvest to estimate field

losses directly. The estimated field losses per unit area are then subtracted from the estimated

biological yield to get the actual yield. This procedure has the advantage of not requiring the worker

to be present at the harvest--an important consideration since several holders of different sample

fields may all decide to harvest on the same day. Unfortunately, experience has shown that the

261

problems of estimating field losses are fully as great as those of estimating the original biological

production.

As already mentioned, it is desirable that sample plots be harvested as near as possible to the date the

remainder of the field is harvested; however, this cannot always be accomplished for all fields. One

object of a pilot study would be to determine what adjustments, if any, must be made for differences

between these harvesting dates. For many crops, no adjustment is necessary because the crop has

essentially completed its growth before either date and is then in the process only of losing moisture.

An additional adjustment that must be made is for moisture content. A procedure commonly used is

to dry the material from the plots (or a subsample of it) until it is at or very near to 0% moisture

content and then to weigh it. This so-called dry weight can then be adjusted to any moisture content

desired. For many crops, a standard moisture content has been specified. If the dry material is only a

subsample of the plot, a two-step process is required. The material from the entire plot and the

subsample must be weighed separately in the field immediately after cutting. The subsample is then

dried and weighed. The dry weight of the entire plot can then be estimated using the ratio of dry to

wet weight of the subsample.

Before an extensive program to measure yields objectively can be put into operation, numerous

practical problems must be solved. These include the availability of labor, the availability of

facilities for drying the crops, equipment needs, the need to coordinate the activities of the workers

with the holders' plans for harvesting their crops, etc. The problem of timing can be very difficult,

particularly when the crop is likely to be ready for harvest at the same time over a wide area. As

stated previously, one important reason for conducting pilot studies is to obtain information about

these practical problems.

262

Study Assignment

The segment contains a total of 100 hectares divided into categories according to the uses made

of the land. The categories are:

Crop land:

A1 - maize B - grassland

A2 - wheat C - forest

A3 - other crop land D - wasteland

A grid of 36 dots has been placed over the segment to be used in estimating the amount of land

by categories of use.

Exercise 1. Estimate the number of hectares in this segment that are used for crop land.

Exercise 4. Estimate the proportion of crop land used for maize. In what basic way does this

estimate differ from those in exercises 1 to 3?

Problem B. In the sketch above, marks on the east and west boundaries of the segment subdivide the boundaries into

40 units. Using these marks as guides, place two lines at random across the segment parallel to the north

and south boundaries.

Exercise 5. Use these parallel lines to estimate the quantities estimated in Problem A.

263

Exercise 6. For each quantity, compile the distribution of the estimates obtained by several trials or several persons.

Exercise 7. Draw a circle around the corner corresponding to the unique point according to each

of the definitions given below. Place the appropriate letter (a, b, c) by each circle.

(a) Northwest corner - Identify those boundary points lying farthest north. The northwest

corner is the most western of these points.

(b) Northwest corner - Identify those boundary points lying farthest west. The northwest

corner is the most northern of these points.

© Southwest corner - Identify those boundary points lying farthest south. The southwest

corner is the most western of these points.

Problem D. Data on the total area of crop land harvested has been obtained by interview from a simple random sample

(selected without replacement of 24 holdings out of a population of 96 holdings. Objective measurements

have been carried out on a subsample of 8 of these holdings selected at random without replacement. The

data are shown in the table below.

1 14 14.4

2 79 -

3 46 -

4 112 116.1

5 46 -

6 92 -

7 29 -

8 40 41.9

9 12 -

10 78 80.4

11 66 -

264

12 43 -

13 39 -

14 91 93.9

15 17 16.8

16 68 -

17 100 -

18 87 -

19 74 75.4

20 64 -

21 78 -

22 40 42.6

23 22 -

24 55 -

Exercise 8. Estimate the total crop land harvested using the interview data only. Estimate the variance of this

estimated total.

Exercise 9. Estimate the total crop land harvested using the objective measurement data only. Estimate the variance of

this estimate.

Exercise 10. Using the formulas given below, estimate the total crop land harvested and the variance of this estimate

using both types of data and ratio estimation.

265

266

SELECTED LIST OF REFERENCES

1. Cochran, William G. Sampling Techniques. Second edition. New York, John Wiley and

Sons. 1963.

2. Food and Agriculture Organization of the United Nations (FAO). Estimation of Areas in

Agricultural Statistics. Edited by S. S. Zarkovich. Rome, 1965.

3. Food and Agriculture Organization of the United Nations (FAO). Estimation of Crop Yields.

By V. G. Panse. Rome, 1954.

Sampling Methods and Censuses. Rome, 1965. Quality of Statistical Data. Rome, 1966.

5. Hansen, Morris H.; Hurwitz, William N.; and Madow, William G. Sample Survey Methods

and Theory. New York, John Wiley and Sons, 1953. (Volume I: Methods and

Applications; Volume II: Theory)

6. Kish, Leslie. Survey Sampling. New York, John Wiley and Sons, 1965.

7. Kniceley, Maurice R. Probability Sampling for Surveys and Censuses, Course Notes,

PSDP, 1985.

8. Megill, David J. Preliminary Recommendations for Designing the Master Frame for the

Senegal Intercensal Household Survey Program, U.S. Bureau of the Census, November

1990.

9. Neter, John and Wasserman, William. Fundamental Statistics for Business and

Economics. Boston, Mass., U.S.A., Allyn and Bacon, 1961.

10. Sampford, M. R. An Introduction to Sampling Theory. Edinburgh and London, Oliver and

Boyd, 1962.

11. Sukhatme, Pandurang V. Sampling Theory of Surveys with Applications. Ames, Iowa.

U.S.A., The Iowa State College Press, 1953. New Delhi, India, The Indian Society of

Agricultural Statistics, 1953.

12. The RAND Corporation. A Million Random Digits. Glencoe, Illinois, U.S.A., The Free

Press, 1955.

13. United Nations.Statistical Office. Handbook of Household Surveys: A Practical Guide for

Inquiries on Levels of Living. New York, 1964. (Studies in Methods, Series F, No. 10)

14. U.S. Bureau of the Census. The Current Population Survey Reinterview Program, Some

Notes and Discussion. Washington, D.C., U.S. Government Printing Office, 1963.

(Technical Paper No. 6)

267

15. U.S. Bureau of the Census. The Current Population Survey--A Report on Methodology.

Washington, D.C., U.S. Government Printing Office, 1963. (Technical Paper No. 7)

16. U.S. Department of Commerce. Statistical Abstract, Washington, D.C., U.S. Government

Printing Office, 1981, Table 202, P. 123.

17 Yates, Frank. Sampling Methods for Censuses and Surveys. Third Edition. New York,

Hafner Publishing Company, 1960.

268

Annex A

GLOSSARY OF TERMS

Accuracy: Quality of survey result as measured by the closeness of the survey estimate to the

exact or true value being estimated. The accuracy is affected by both sampling error and bias.

Allocation of sample: The method used in determining how the sample should be distributed.

In stratified, cluster sampling, it usually refers to the number of clusters to be allocated to each

stratum and the size of sample selected from each cluster.

Area sample: A type of sample (usually a multistage sample) in which the sampling units are

individual land areas (segments) which can be defined on a map. The segments cover the entire

area to be included in the survey; the segments do not overlap; and, in most applications, the

boundaries of each segment must be clearly defined so they can be recognized and identified by

enumerators in the field. Often the segments are clusters of the units of analysis; for example,

clusters of farms or housing units. Each unit of analysis must be associated with one and only

one segment.

Attribute: See also ‘Characteristic.’ Quality or characteristic. This term is also used in reference

to the proportion of units having a certain characteristic.

Benchmark statistics: Statistics that provide information against which one can measure or

compare changes.

Bias: The difference between the expected value of an estimator and the true population value

being estimated. When the bias is equal to zero, the estimator is said to be “unbiased.”

The term bias is also generally used to designate an effect which deprives a statistical result

of representativeness by systematically distorting it, as distinct from a random error which may

distort on any one occasion but balances out on the average.

earlier interview and is then asked only to report on any new events that occurred subsequent to

the bounding interview. This method is usually used in “income and expenditures surveys, “ in

which at the beginning of the bounded interview (the second and subsequent interviews), the

respondent is told about the expenditures reported during the previous interview, and is then

asked about additional expenditures made since then.

Bounding: Prevention of erroneous shifts of the timing of events by having the enumerator or

respondent supply at the start of the interview (or in a mail survey) a record of events reported in

the previous interview.

Census: Data collection program through which attempts are made to collect information about

every element (person, household, farm, etc.) in the population.

Characteristic: A variable having different possible values for different individual units of

sampling or analysis. In a sample survey, we observe or measure the values of one or more

characteristics for the units in the sample. For example, we observe (or ask about) the area of

land in rice, or the number of cattle on a farm.

application of classification systems to survey data.

Cluster sample: A system of sampling in which the units of analysis of the population are

considered as grouped into clusters, and a sample of clusters is selected. The selected clusters

then determine the units to be included in the sample. The sample may include all units in the

selected clusters or a subsample of units in each selected cluster.

Clusters: See also ‘Cluster sample.’ Small groups into which a population is divided to

facilitate the data collection. The groups generally are defined so as to help break a large survey

area into workload-sized chunks and/or to reduce travel and administrative costs. Ideally, the

units in a cluster should be as heterogeneous as possible.

Coding: Coding is a technical procedure for converting verbal information into numbers or

other symbols which can be more easily counted and tabulated.

Coding error: Error that occurs during the coding of sample data. The assignment of an

incorrect code to a survey response.

Coefficient of variation: The relative standard error; that is, the standard error as a proportion of

the magnitude of the estimate. The population coefficient of variation is denote by CV, which is

estimated from a sample by the cv. The coefficient of variation of estimates, such as the mean,

proportion, or total, is denoted by CV(). The estimate of interest is then placed inside the

parenthesis. If is an estimate of the population parameter 2 , then denotes the true

coefficient of variation of the estimate and is an estimate of

Conditioning effect: The effect on responses resulting from the previous collection of data

from the same respondents in recurring surveys.

Confidence interval: A range above and below the estimated value which may be expected to

enclose the true value with a known probability, assuming no bias.

Consistent estimate: An estimate of a type that (while possibly biased) approaches more

and more closely the true value being estimated as the size of sample increases, the most

common example being a ratio estimate.

other processing which results in associating a wrong value of the characteristic with a specified

unit.

Coverage error: The error in an estimate that results from (1) failure to include in the frame all

units belonging to the defined population; failure to include specified unis in the conduct of the

survey (undercoverage), and (2) inclusion of some units erroneously either because of a

defective frame or because of inclusion of unspecified units or inclusion of specified units more

than once, in the actual survey (overcoverage).

Cost function: A mathematical expression showing the cost of conducting a survey in terms of

the sample sizes and unit costs.

Editing: Preliminary step in which the responses are inspected, corrected and sometimes

precoded according to a fixed set of rules.

Efficiency: A comparative measure of one sample design relative to another with respect to

amount of precision produced per unit of cost for a given sample size.

Estimate: A numerical quantity calculated from sample data and intended to provide

information about an unknown population value.

Expected value. The average value of the sample estimates over all possible samples.

Finite Population Correction Factor (fpc): It’s a factor that corrects the value of the variance

when the sample size is large with respect to the size of the population.

Frame: A list of units which make up a population. The frame consists of previously available

descriptions of the objects or material related to the physical field in the form of maps, lists,

directories, etc., from which sampling units may be constructed and a set of sampling units

selected; and also information on communications, transport, etc., which may be of value in

improving the design for the choice of sampling units, and in the formation of strata.

Imputation: The process of developing estimates for missing or inconsistent data in a survey.

Data obtained from other units in the survey are usually used in developing the estimate.

Independent information: Data known in advance or simultaneously with the survey, which are

not based on the survey but may be used to improve the survey design. Such data may be used

for stratifying, deciding on the probabilities of selection, or estimating the final results from the

sample data.

Interviewer bias: Bias in the responses which is the direct result of the action of the interviewer.

Interviewer error: Errors in the responses obtained in a survey that are due to actions of the

interviewer.

Interviewer variance: The component of the nonsampling variance which is due to the

different ways in which different interviewers elicit or record responses.

(or heterogeneity) between elementary units within a cluster. It can be used to determine how

satisfactorily clusters have been formed. For example, the closer the value is to zero (or

negative) the more unlike the elementary units are and, consequently, the better we’ve done to

form clusters. We also could use this to evaluate how effectively we have created the strata.

Item nonresponse: The type of nonresponse in which some questions, but not all, are answered

for a particular unit. The type of nonresponse in which a question is missed for an interviewed

unit.

List: A population in which the sampling units have been numbered or otherwise identified; the

list of units can be the basis for the selection of a sample. See also Sampling Frame.

Mean square error: A measure of the accuracy of an estimate or the extent to which an estimate

from sample data differs from the true population value being estimated. If the estimates are

unbiased, the mean square error is equivalent to the variance.

Muitiframe sampling: The use of two or more sampling frames to select a survey sample.

Generally necessary when the usual frame, such as an address register, will not adequately cover

the population and/or there are unique or unusually large units that must appear in the sample.

Multistage sampling: The most common type of cluster sampling. In this method, a sample of

clusters is selected; and then a subsample of units selected within each sample cluster. If the

subsample of units is the last stage of sample selection, it is called a two-stage sample design

(although each such unit may contain more than one unit of analysis, as in an area sample). If the

subsample is also a cluster from which units are again selected, it is a three-stage design, or four-

stage design, etc.

sample units for such reasons as: not at home, refusals, incapacity and lost questionnaires.

Noninterview adjustment: A method of adjusting the weights for interviewed units in a survey

to the extent needed to account for occupied sample units for which no information was

obtained.

Nonsampling error: The error in an estimate arising at any stage in a survey from such sources

as varying interpretation of questions by enumerators, unwillingness or inability of respondents

to give correct answers, nonresponse, improper coverage, and other sources exclusive of

sampling error. This definition includes all components of the Mean Square Error (MSE) except

sampling variance.

Optimum allocation of sample: Refers to the selection of a sample in such a way as to produce

the minimum standard error for a constant sample size or for a constant cost. It is used in both

stratified sampling and cluster sampling.

Overhead costs: Costs that are fixed and do not affect overall costs. These do not enter into

designing the sample. Included are such costs as administrative, rent, equipment, printing, and

utilities.

Parameters: These are values descriptive of the population distribution and calculated from all

population units. They are estimated from a sample, the estimates being called statistics. For

normal distributions the parameters are the mean and standard deviation.

Population: Any clearly defined set of units (or elements) for which estimates are to be made.

The elements can be persons, farms, households, blocks, counties, businesses, and so on. Most of

our discussion deals with sampling from a finite population, containing a finite number of

elements.

Precision: Difference between the sample estimate and a complete count value collected under

the sample conditions. This is measured by the sampling error or relative sampling error.

Primary sampling unit (PSU): The units making up the sampling frame for the first stage of a

multistage sample.

Probability of selection: The chance each unit has of being selected in the sample. This is

known prior to sample selection.

Probability proportionate to size (PPS): A method of sample selection in which units are

selected with unequal probability of selection, the probability for each unit being proportionate

to a measure of size. The measure of size for a unit is a number assigned to that unit in advance

of selection, which is believed to be highly correlated with the statistics to be estimated.

Probability proportionate to size is frequently abbreviated to PPS.

Proportion: Measure of the relative frequency of units that possess a certain characteristic in the

population or sample.

Proportionate stratified sampling: A system of selecting a stratified sample in which the same

probability of selection is used in each stratum.

Response bias: The difference between the average of the averages of the responses over a large

number of independent repetitions of the census and the unknown average that could be

measured if the census were accomplished under ideal conditions and without error. The

difference between average reported value over trials and true values. It is a combined bias as

algebraic sum of all bias terms representing diverse source of biases.

Response error: The part of the nonsampling error which is due to the failure or the respondent

to report the correct value (respondent error) or the interviewer to record the value correctly

(interviewer error). It includes both the consistent response biases and the variable errors of

response which tend to balance out.

Response variance: That part of the response error which tends to balance out over repeated

trials or over a large number of interviewers. The variance among the trial means over a large

number of trials. The response variance of a survey estimator is the sum of the simple response

variance and the correlated response variance.

Response variance, correlated: The correlated response variance is the contribution to the total

variance arising from nonzero correlations (in the sense of the distribution of measurement

errors) among the response of sample units. The contribution to the total response variance from

the correlations among response deviations.

Response variance, uncorrelated: The sample response variance contribution to the total

variance arises from the variability of each survey response about its own expected value. In

terms of a simple random sampling design, the simple response variance is the population mean

of the variances of each population unit. The variance of the individual response deviations

over all possible trials. The basic trial-to-trial variability in response, averaged over the elements

in the population.

Rotation bias: A type of bias that occurs in panel surveys which consist of repeated interviews

on the same units. Although these surveys are designated so that the estimates of a characteristic

are expected to be nearly the same for each panel in the survey, this expectation has not been

realized. For example, an estimate from a panel that is in the survey for the first time may differ

significantly from estimates from the panels that have been in the survey longer. The downward

tendency in the value of the characteristics reported if the observation of the same units is

continued over a longer period of time. For example, it was found in expenditure surveys that the

average expenditure per item per person is usually higher in the first week of the survey than in

the second or the third.

sample that is, a sample in which each element in the population has a known probability of

selection.

Sample Survey: A data collection program through which information is collected from a

probability-selected subset of the population.

Sampling bias: That part of the difference between the expected value of the sample estimator

and the true value of the characteristic which results from the sampling procedure, the estimating

procedure, or their combination.

Sampling Distribution: The distribution of values of a statistic calculated from all possible

samples of the same size from the same population.

Sampling Error (of Estimator): That part of the error of an estimator which is due to the fact

that the estimator is obtained from a sample rather than a 100 percent enumeration using the

same procedures. The sampling error has an expected frequency distribution for repeated

samples, and the sampling error is described by stating a multiple of the standard deviation of

this distribution. That part of the difference between a population value and an estimator

thereof, derived from a random sample, which is due to the fact that only a sample of values is

observed; as distinct from errors due to imperfect selection, bias in response or estimation, errors

of observation and recording, etc. The totality of sampling errors in all possible samples of the

same size generates the sampling distribution of the statistic which is being used to estimate the

parent value.

Sampling frame: The totality of sampling units from which a sample is to be selected. The

frame may be a listing of persons or housing units; a file of records; a generalization about the

population based on information contained in a sample.

Sampling Variance: It is denoted by where denotes any estimator. The term sampling

variance refers to the variance of an estimator. For a simple random sample, the variance of the

mean is given by:

Sampling plan: The actual procedure describing how sample units are to be selected and from

which sampling frames.

Sampling unit: The units to be selected. These may or may not be the same as the units of

analysis. For example, to obtain information on persons, one might use a complete listing in a

Census, or a register, and select a sample of persons directly. However, one could also select a

sample of households and include in the survey all persons in the selected households. Similarly,

one could select complete buildings, and include all persons in the sample buildings. The choice

of the most efficient sampling unit is an important consideration in the design of a survey.

Sampling with replacement: A sample obtained by first selecting one element of the

population, replacing it, then making a second selection and replacing it before making the third

selection, etc., until n selections have been made. With this method of selection, a particular unit

can be included more than once in the sample--in fact, up to n times.

Sampling without replacement: A sample obtained by selecting one element of the population

and, without replacing it, selecting one of the remaining elements; then continuing this process

until n different selections have been made. With this method, a unit can be included only once

in any sample.

Self-weighting sample: A sample in which every element in the population has the same chance

of selection, although unequal probabilities may have been used at various stages of sampling.

For example, clusters may have been selected with PPS; then the sampling within a selected

cluster is done in such a way as to give each element in it the same chance of being in the sample

as the elements to be selected in other clusters.

Simple random sample (also called unrestricted random sample): The simplest type of

sampling system. For a sample of size n, each of the possible combinations of n elementary units

that may be formed from a population of N units has the same chance of selection as every other

combination of n units. Moreover, every element will have the same chance of selection as every

other element (chapters 2, 3, 4, and 5).

Standard error: A measure of the extent to which estimates from various samples differ from

their expected value. With a reasonably large sample, the distribution of sample results for all

possible samples is approximately the normal distribution, and probability statements can be

made about how close the sample can be expected to come to the expected value--the

probabilities being expressed in terms of the standard error. The standard error usually is

expressed by the Greek letter F or S. See also Variance.

Statistic: A quantity computed from sample observations of a characteristic, usually for the

purpose of making an inference about the population. The characteristic may be any variable

associated with a member of the population; such as age, income, employment status, etc. The

quantity may be a total, an average, a median, or other percentile; it may also be a rate of change,

a percentage, a standard deviation or any other quantity whose value we wish to estimate for the

population.

coefficient of variation.

Stratification: The process of dividing a population into groups for the purpose of selecting a

separate sample from each group. Each group is usually made as internally homogeneous as

possible. The groups are called strata with each one referred to as a stratum.

Stratified sampling: The method of sampling from a universe which has been stratified. At least

one sample unit must be selected from each stratum, but at least two units are needed to calculate

variances. Probabilities of selection can be different from stratum to stratum.

Systematic error: As opposed to a random error, an error which is in some sense biased, that is

to say, has a distribution with mean (or some equally acceptable measure of location) not at zero.

Systematic sampling: A method of sample selection in which the population is listed in some

order and every kth element is selected for the sample.

Telescoping: The tendency of the respondent to allocate an event to a period other than the

reference period (also called border bias). A telescoping error occurs when the respondent

misremembers the duration of an event. While one might imagine that errors would be

randomly distributed around the true duration, the errors are primarily in the direction of

remembering an event as having occurred more recently than it did. This is due to the

respondent’s wish to perform the task required of him. When in doubt, the respondent prefers to

give too much information rather than too little.

Total Error: The difference between an estimate and its true value in the population measured as

the root mean square error, that is, the square root of the sum of variable error squared and bias

squared.

True Value. The value that would be obtained if no mistakes were made or error existed.

Ultimate cluster: The totality of units included in the sample from a primary unit. Even if the

sample is obtained using different stages of selection, the Primary Sampling Unit becomes the

ultimate cluster.

Unbiased estimate: A type of estimate having the property that the average of such estimates

made from all possible samples of a given size is equal to the true value.

Unbouded recall: Ordinary type of recall, where respondents are asked for expenditures made

since a given date and no control is exercised over the possibility that respondents may

erroneously shift some of their expenditures reports into or out of the recall period.

Unit of Analysis: A unit for which we wish to obtain statistical data. The units may be persons,

households, farms or business firms; they may also be products resulting from some machine

process, etc.

Variance: The square of the standard error; it is usually written as S2 with a subscript to indicate

the statistic to which it refers. The term is usually written without a subscript for the square of

the standard deviation. Where there is any possibility of confusion, sampling variance is used

for the square of the standard error, and population variance for the square of the standard

deviation.

- Lecture_4.pdfDiunggah oleh3rlang
- Chapter 17 FinalDiunggah olehMichael Hu
- MATH1208AnnotatedBook ImpDiunggah olehmaconny20
- Improved Silicone Rubbers for the Use as Housing Material in Composite InsulatorsDiunggah olehyashodharaby
- datasetsDiunggah olehRoberto Arteaga
- Perceived Stress ScalesDiunggah olehiretzi
- Chapter 17 Test BankDiunggah olehjeankopler
- Business StatisticsDiunggah olehNyakerario_E
- Central Coverage Bayes Prediction Intervals for the Generalized Pareto DistributionDiunggah olehSEP-Publisher
- Volume 1 Soils and Solid MediaDiunggah olehFTerra7
- Biostat to AnswerDiunggah olehSamantha Arda
- Bct2053 - Applied StatisticsDiunggah olehMie_Ilani_7145
- The Multidimensional Wisdom of CrowdsDiunggah olehGraf Orlok
- DoeDiunggah olehMetaSathithamaporn
- 15Diunggah olehAndresAmaya
- Circulation-2004-Gami-364-7.pdfDiunggah olehlua
- 0802.3276Diunggah olehGalina Alexeeva
- FernandaDiunggah olehSuamba GeBracelet
- 2040_W16_T2_WDiunggah olehMO
- Exam Solutions 2017 2018 Semester 1 AdaptedDiunggah olehRyan Lee
- Simulation IntroDiunggah olehMohamed Fawzy Shrif
- article1114.pdfDiunggah olehMiguel Augusto
- Best Mcqs of StaticsDiunggah olehBilal
- Design and Analysis of Compressive Sensing Radar DetectorsDiunggah olehselnemais
- NIST-Capability-in-Rockwell-C-Scale-Hardness-pdf.pdfDiunggah olehSaraswanto
- 2.RCT+Appraisal+sheets.2005Diunggah olehPatih Gajahmada
- BRM (1)Diunggah olehSharad Aggarwal
- A1INSE6220-Winter17sol.pdfDiunggah olehpicala
- COL-EAFIT-0014191.pdfDiunggah olehNelson CarrCarri
- Statistics FinalDiunggah olehLucija

- chapter 35 practice questionsDiunggah olehHannah Ma Ya Li
- Lecture 4 Reading Materials ETS Exponential SmoothingDiunggah olehHannah Ma Ya Li
- Balancing Work and FamilyDiunggah olehHannah Ma Ya Li
- 9vol2no5Diunggah olehKaka Awalia
- Topic 1 Introduction microeconomicsDiunggah olehHannah Ma Ya Li
- BB211 - Chapter 8 - Internal Communications (1)Diunggah olehHannah Ma Ya Li
- BB211 - Chapter 4 - Identity, Image, Reputation and Corporate Advertising (3)Diunggah olehHannah Ma Ya Li
- BB211 - Chapter 7 - Government Relation (1)Diunggah olehHannah Ma Ya Li
- BB211 - Chapter 9 - Crisis Communication (3) (1)Diunggah olehHannah Ma Ya Li
- Bb203 Tutorial Week 3Diunggah olehHannah Ma Ya Li
- Bb203 Tutorial Week 2 SolutionDiunggah olehHannah Ma Ya Li
- Assignment Jan 2013Diunggah olehHannah Ma Ya Li
- Tutorial 9 PresentDiunggah olehLi Nini
- tvmDiunggah olehHannah Ma Ya Li

- Who is RSPO Tall Infographic PDF- April 2017-English-EnglishDiunggah olehWayan Sudiasa
- Revision ClostrifdiumDiunggah olehVictor Jose Astoquillca Llancce
- The CowmunicatorDiunggah olehKaren Greenfield
- Assessment of Intercropped Sweet Corn (Zea Mays Var. Saccharata)Diunggah olehAlexander Decker
- Equipotential PlanesDiunggah olehAnonymous NGXdt2Bx
- AlgaeDiunggah olehHitesh More
- 110918 Becerra letter to USDA regarding wild horse slaughterDiunggah olehLakeCoNews
- Final Sem Project Report - CopyDiunggah olehVipula Gangurde
- Index CardDiunggah olehSwit Manrique
- Huck FinnDiunggah olehshehabeden
- The Names and Epithets of the DagdaDiunggah olehincoldhellinthicket
- CannapiesDiunggah olehKevin Steele
- Urban GeographyDiunggah olehGuruKPO
- Transgenic CattleDiunggah olehSunitha Machiraju
- Semen Collection, Evaluation and Processing IDiunggah olehKristi Smith
- The Nature of Economics of AgricultureDiunggah olehAhmad Zia Tareq
- Backwoodsman Magazine Index_1-138Diunggah olehLance Kearns
- GenEd 111 World History Lecture & Chapter Notes -- Ch 21Diunggah olehGeneric_Persona
- 16. Solid Waste Management - Indoor MethodDiunggah olehraaaaajjjjj
- Poultry Farms in DisarrayDiunggah olehNazrul Islam
- Composition, Biochemistry of MilkDiunggah olehKhogen Mairembam
- Statistical Abstract of Andhra Pradesh 2007Diunggah olehyaazel
- Esther's Gardening Tips, Parks Gardener's HandbookDiunggah olehCR Larson
- AMUL MILKDiunggah olehShashi Bhushan Sonbhadra
- 2009 PMO-FINAL 1Diunggah olehJhony Pacheco
- 0niir Book ListDiunggah olehPrasad Rao Reddy
- forestry code powerpoint.pptxDiunggah olehRho Mah
- Coastal Resource Management - an Occidental MindoroDiunggah olehJohann Dexter Malimban Glorioso
- The 1992 CIA World Factbook by United States. Central Intelligence AgencyDiunggah olehGutenberg.org
- Evans Pritchard 1953Diunggah olehDaniel Castillo