The questions are usually structured and standardized. The structure is intended to reduce bias. For
example, questions should be ordered in such a way that a question does not influence the
response to subsequent question s. Surveys are standardized to ensure reliability, generalizability, and
validity. Every respondent should be presented with the same questions and in the same order as
other respondents.
In organizational development (OD), carefully constructed survey ins truments are often used as the
basis for data gathering, organizational diagnosis, and subsequent action planning. Some OD
practitioners (e.g. Fred Nickols) even consider survey guided development as the sine qua non of
OD.
Serial surveys
Serial surveys ar e those which repeat the same questions at different points in time, producing
repeated measures data. There are three basic designs for a study with more than one
measurement occasion: cross -sectional design, longitudinal design, and time -series design.
Telephone
• use of interviewers encourages sample persons to respond, leading to higher response rates.
• interviewers can increase comprehension of questions by answering respondents' questions.
• fairly cost efficient, depending on local call charge structure
• good for large national (or international) sampling frames
• some potential for interviewer bias (e.g. some people may be more willing to discuss a
sensitive issue with a female interviewer than with a male one)
• cannot be used for non -audio information (graphics, demonstrations, taste/smell samples)
• unreliable for consumer surveys in rural areas where telephone penetration is low
• three types:
o traditional telephone interviews
o computer assisted telephone dialing
o computer assisted telephone interviewing ( CATI)
• the questionnaire may be handed to the respondents or mailed to them, but in all cases they
are returned to the researcher via mail.
• cost is very low, since bulk postage is cheap in most c ountries
• long time delays, often several months, before the surveys are returned and statistical
analysis can begin
• not suitable for issues that may require clarification
• respondents can answer at their own convenience (allowing them to break up long surve ys;
also useful if they need to check records to answer a question)
• no interviewer bias introduced
• large amount of information can be obtained: some mail surveys are as long as 50 pages
• response rates can be improved by using mail panels
o members of the panel have agreed to participate
o panels can be used in longitudinal designs where the same respondents are surveyed
several
Online surveys
• shoppers at malls are intercepted - they are either interviewed o n the spot, taken to a room
and interviewed, or taken to a room and given a self -administered questionnaire
• socially acceptable - people feel that a mall is a more appropriate place to do research
than their home
• potential for interviewer bias
• fast
• easy to manipulate by completing multiple times to skew results
Sampling
Sample selection is critical to the validity of the information that represents the populations that are
being studied. The approach of the sampling helps to determine the focus of the study and allows
better acceptance of the generalizations that are being made. Careful use of biased sampling can
be used if it is justified and as long as it is noted that the resulting sample may not be a true
representation of the population of the study. Th ere are two different approaches to sampling in
survey research:
• There is nonprobability sampling approach. In this approach the researcher does not know
each element's probability of selection in the sample. The most commonly used
nonprobability sampling method is the convenience sampling approach. With this method, it
only samples those who are available and willing to participate in the survey. The use of this
approach allows for convenience for the researcher while possibly losing data validity due to
the lack of representation.
• The probability sampling approach for research methods gives each element a known
chance of being included in the sample. This method is closer to a true representation of the
population. It can be difficult to use due to cost of a rigorous sampling method, and difficulty
in obtaining full coverage of the target population, but the generalizations that come from it
are more likely to be closer to a true representation of the population. Different forms of
probability sampling are designed to achieve various benefits - e.g. theoretical simplicity,
operational simplicity, detailed information on subpopulations, or minimal cost. Some
common forms:
o Equal probability of selection designs (EPS), in which each element of the population
has an equal chance of being included in the sample. This uniformity makes EPS
surveys relatively simple to interpret. Forms of EPS include Simple random sampling
(SRS) and systematic sampling.
Degree level: Graduates, Post graduates; PhD, Total students in age group and in degree level.
A3.:
DEPARTMENTS
Total
Age Degree Level Eng Hin Pol. Sci History Sociology
20-40
40-60
60 & Above
Total
Q4. The data given below is the distribution of employees of a business according to their efficiency.
Find the mean deviation and coefficient of mean deviation from Mean and Median:
Efficiency Index 22-26 26-30 30-34 34-38 38-42
Emloyees 25 35 15 5 2
MD = (•f|D|)/N
D = X – X•
MD = 235.02/82 = 2.866
EI f Cf X |D|=x-Me f|D|
22-26 25 25 24 3.83 95.75
26-30 35 60 28 0.17 5.95
30-34 15 75 32 4.17 62.55
34-38 5 80 36 8.17 40.85
38-42 2 82 40 12.17 24.34
N=82 •f|D|=229.44
A5.: Conditional probability is the probability of some event A, given the occurrence of some other
event B. Conditional probability is written P(A|B), and is read "the (conditional) probability of A,
given B" or "the probability of A under the condition B". When in a random experiment the event B is
known to have occurred, the possible outcomes of the experiment a re reduced to B, and hence the
probability of the occurrence of A is changed from the unconditional probability into the conditional
probability given B.
Joint probability is the probability of two events in conjunction. That is, it is the probability of b oth
events together. The joint probability of A and B is written or
Marginal probability is then the unconditional probability P(A) of the event A; that is, the probability
of A, regardless of whether event B did or did not occur. If B can be thought of as the event of a
random variable X having a given outcome, the marginal probability of A can be obtained by
summing (or integrating, more generally) the joint probabilities over all outcomes for X. For example,
if there are two possible outcomes for X with corresponding events B and B', this means that
. This is called marginalization.
In these definitions, note that there need not be a causal or temporal relation between A and B. A
may precede B or vice versa or they may happen at the same time. A may cause B or vice versa or
they may have no causal relation at all. Notice, however, that causal and temporal relations are
informal notions, not belonging to the probabilistic framework. They may apply in some examples,
depending on the interpretation given to events.
Conditioning of probabilities, i.e. updating them to take account of (possibly new) information, may
be achieved through Bayes' theorem. In such conditioning, the probability of A given only initial
information I, P(A|I), is known as the prior p robability. The updated conditional probability of A,
given I and the outcome of the event B, is known as the posterior probability, P(A|B,I).
Introduction
Consider the simple scenario of rolling two fair six -sided dice, labelled die 1 and die 2. Define th e
following three events (not assumed to occur simultaneously):
A: Die 1 lands on 3.
The prior probability of each event describes how likely the outcome is before the dice are rolled,
without any knowledge of the roll's outcome. For example, die 1 is equally likely to fall on each of its
6 sides, so P(A) = 1/6. Similarly P(B) = 1/6. Likewise, of the 6 × 6 = 36 possible ways that a pair of dice
can land, just 5 result in a sum of 8 (namely 2 and 6, 3 and 5, 4 and 4 , 5 and 3, and 6 and 2), so P( C)
= 5/36.
Some of these events can both occur at the same time; for example events A and C can happen at
the same time, in the case where die 1 lands on 3 and die 2 lands on 5. This is the only one of the 36
outcomes where bo th A and C occur, so its probability is 1/36. The probability of both A and C
occurring is called the joint probability of A and C and is written , so . On the
other hand, if die 2 lands on 1, the dice cannot sum to 8, so .
Now suppose we roll the dice and cover up die 2, so we can only see die 1, and observe that die 1
landed on 3. Given this partial information, the probability that the dice sum to 8 is no longer 5/36;
instead it is 1/6, since die 2 must land on 5 to achieve this result. This is called the conditional
probability, because it is the probability of C under the condition that A is observed, and is written
P(C | A), which is read "the probability of C given A." Similarly, P(C | B) = 0, since if we observe die 2
landed on 1, we already know the dice can't sum to 8, regardless of what the other die landed on.
On the other hand, if we roll the dice and cover up die 2, and observe die 1, this has no impact on
the probability of event B, which only depends on die 2. We say events A and B are statistically
independent or just independent and in this case
In other words, the probability of B occurring after observing that die 1 landed on 3 is the same as
before we observed die 1.
In other words, if two events are independent, their joint probability is the product of the prior
probabilities of each event occurring by itself.
Definition
Given a probability space (•, F, P) and two events A, B F with P(B) > 0, the conditional probability
of A given B is defined by
For example, if X and Y are non-degenerate and jointly continuous random variables with density
ƒX,Y(x, y) then, if B has positive measure,
The case where B has zero measure can only be dealt with directly in the case that B={y 0},
representing a single point, in which case
If A has measure zero then the conditional probability is zero. An indication of why the more general
case of zero measure cannot be dealt with in a similar way can be seen by noting that that the limit,
as all •yi approach zero, of
depends on their relationship as they approach zero. See conditional expectation for more
information.
Derivation
The following derivation is taken from Grinstead and Snell's Introduction to Probability.
Let • be a sample space with the probability P. Suppose the event has occurred and an
altered probability P({•} | E) is to be assigned to the elementary events {•} to reflect the fact that E
has occurred. (In the following we will omit the curled brackets.)
For all we want to make sure that the intuitive result P(• | E) = 0 is true.
Also, without further information provided, we can be certain that the relative magnitude of
probabilities is conserved:
where •, is a positive real constant or scaling factor to reflect the above requirement.
Statistical Independence
Thus, if A and B are independent, then their joint probability can be expressed as a simple product of
their individual probabilities.
and
In other words, if A and B are independent, then the conditional probability of A, given B is simply the
individual probability of A alone; likewise, the probability of B given A is simply the probability of B
alone.
Mutual Exclusivity
The conditional probability fallacy is the assumption that P(A|B) is approximately equal to P(B|A).
The mathematician John Allen Paulos discusses this in his book Innumeracy , where he points out that
it is a mistake often made even by doctors, lawyers, and other highly educated non -statisticians. It
can be overcome by describing the data in actual numbers rather than probabilities.
In other words, one can only assume that P(A|B) is approximately equal to P(B|A) if the prior
probabilities P(A) and P(B) are also approximately equal.
An Example
In the following constructed but realistic si tuation, the difference between P(A|B) and P(B|A) may
be surprising, but is at the same time obvious.
In order to identify individuals having a serious disease in an early curable form, one may consider
screening a large group of people. While the benefits are obvious, an argument against such
screenings is the disturbance caused by false positive screening results: If a person not having the
disease is incorrectly found to have it by the initial test, they will most likely be quite distressed until a
more careful test shows that they do not have the disease. Even after being told they are well, their
lives may be affected negatively.
Suppose 1% of the group suffer from t he disease, and the rest are well. Choosing an individual at
random,
Suppose that when the screening test is applied to a person not having the disease, there is a 1%
chance of getting a false positive result an d 99% chance of getting a true negative result, i.e.
Finally, suppose that when the test is applied to a person having the disease, there is a 1% chance of
a false negative result and 99% chance of get ting a true positive result, i.e.
The fraction of individuals in the whole group who are well and test negative (true negative):
The fraction of individuals in the whole group who are ill and test positive (true positive):
The fraction of individuals in the whole group who have false positive results:
The fraction of individuals in the whole group who have false negative results:
Furthermore, the fraction of in dividuals in the whole group who test positive:
Finally, the probability that an individual actually has the disease, given that the test result is positive:
In this example, it should be easy to relate to the difference between the conditional probabi lities
P(positive | ill) (which is 99%) and P(ill | positive) (which is 50%): the first is the probability that an
Another type of fallacy is interpreting conditional probabilitie s of events (or a series of events) as
(unconditional) probabilities, or seeing them as being in the same order of magnitude. A conditional
probability of an event and its (total) probability are linked with each other through the formula of
total probability, but without additional information one of them says little about the other. The
fallacy to view P(A|B) as P(A) or as being close to P(A) is often related with some forms of statistical
bias but it can be subtle.
Here is an example: One of the conditio ns for the legendary wild -west hero Wyatt Earp to have
become a legend was having survived all the duels he survived. Indeed, it is reported that he was
never wounded, not even scratched by a bullet. The probability of this to happen is very small,
contributing to his fame because events of very small probabilities attract attention. However, the
point is that the degree of attention depends very much on the observer. Somebody impressed by a
specific event (here seeing a "hero") is prone to view effects of randomness differently from others
which are less impressed.
In general it does not make much sense to ask after observation of a remarkable series of events
"What is the probability of this?"; this is a conditional probability based upon observation. The
distinction between conditional and unconditional probabilities can be intricate if the observer who
asks "What is the probability?" is himself/herself an outcome of a random selection. The name "Wyatt
Earp effect" was coined in an article "Der Wyatt Earp Effekt" (in German) showing through several
examples its subtlety and impact in various scientific domains.
Q6. The probability that a football player will play Eden garden is 0.6 and on Ambedkar Stadium is
0.4. The probability that he will get knee injury when playing in Eden is 0.07 and that in Ambedkar
stadium is 0.04. What is the probability that he would get a knee injury if he played in Eden.
A6.:
P(A) = 0.6 P(B) = 0.4 P(C) = 0.07 P(D) = 0.04
P(A•C) = P(A) * P(C)
= 0.6*0.07
= 0.042
MBA SEMESTER 1
MB0040 – STATISTICS FOR MANAGEMENT
Assignment Set - 2
Q1. A random sample of 6 sachets of mustard oil was examined and two were found to be leaking.
A wholesaler receives seven hundred twenty six packs, each containing 6 sachets. Find the
expected number of packets to contain exactly one sachet leaking?
A1.:
n=6
N =726
Each packet contains 6 sachets
Expected no. of packs to contain exactly 1 sachet leaking
E(A) = N*P(x)
P(x) = 36/726 = 0.0496
Q2. What do you mean by errors in statistics? Mention the measures to do so.
A2.: In statistics and optimization, statistical errors and residuals are two closely related and easily
confused measures of the deviation of a sample from its “theoretical value”. The error of a sample is
the deviation of the sample from the (unobservable) true function value; while the residual of a
sample is the difference between the sample and the estimated function value.
The distinction is most important in regression analysis, where it leads to the concept of studentized
residuals.
Suppose there is a series of observations from a univariate distribution and we want to estimate the
mean of that distribution (the so -called location model). In this case the errors are the deviations of
the observations from the population mean, while the residuals are the deviations of the
observations from the sample mean.
A statistical error is the amount by which an observation differs from its expected value; the latter
being based on the whole population from which the stat istical unit was chosen randomly. For
example, if the mean height in a population of 21 -year-old men is 1.75 meters, and one randomly
chosen man is 1.80 meters tall, then the “error” is 0.05 meters; if the randomly chosen man is 1.70
meters tall, then the “error” is •0.05 meters. The expected value, being the mean of the entire
population, is typically unobservable, and hence the statistical error cannot be observed either.
The nomenclature arose from random measurement errors in astronomy. It is as if the measurement
of the man’s height were an attempt to measure the population mean, so that any difference
between the man’s height and the mean would be a measurement error.
A residual (or fitting error), on the other hand, is an observable estimate of the unobservable
statistical error. Consider the previous example with men’s heights and suppose we have a random
sample of n people. The sample mean could serve as a good estimator of the population mean.
Then we have:
• The difference between the height of each man in the sample and the unobservable
population mean is a statistical error, whereas
• The difference between the height of each man in the sample and the observable sample
mean is a residual.
Note that the sum of the residuals within a random sample is ne cessarily zero, and thus the residuals
are necessarily not independent . The statistical errors on the other hand are independent , and their
sum within the random sample is almost surely not zero.
One can standardize statistical errors (especially of a nor mal distribution) in a z -score (or “standard
score”), and standardize residuals in a t -statistic, or more generally studentized residuals.
The standard error of the mean (SEM) is the standard deviation of the sample mean estimate of a
population mean. (It can also be viewed as the standard deviation of the error in the sample mean
relative to the true mean, since the sample mean is an unbiased estimator.) SEM is usually estimated
by the sample estimate of the population standard d eviation (sample standard deviation) divided
by the square root of the sample size (assuming statistical independence of the values in the
sample):
s is the sample standard deviation (i.e., the sample based estimate of the standard deviation
of the population), and
This estimate may be compared with the formula for the true standard deviation of the mean:
where
Note 1: Standard error may also be defined as the standard deviation of the residual error term.
Note 2: Both the standard error and the standard deviation of small samples tend to systematically
underestimate the population standard error and deviations: the standard error of the mean is a
biased estimator of the population standard error. With n = 2 the underestimate is about 25%, but for
n = 6 the underestimate is only 5%. Gurland and Tripathi (1971) provide a correction and equation for
this effect. Sokal and Rohlf (1981) give an equation of the correction factor for small samples of n <
20. See unbiased estimation of standard deviation for further discussion.
A practical result: Decreasing the uncertainty i n your mean value estimate by a factor of two
requires that you acquire four times as many observations in your sample. Worse, decreasing
standard error by a factor of ten requires a hundred times as many observations.
If the data are assumed to be normally distributed, quantiles of the normal distribution and the
sample mean and standard error can be used to calculate approximate confidence intervals for
the mean. The following expressions can be used to calculate the upper and lower 95% confidence
limits, where is equal to the sample mean, SE is equal to the standard error for the sample mean,
and 1.96 is the .975 quantile of the normal distribution:
In particular, the standard error of a sampl e statistic (such as sample mean) is the estimated
standard deviation of the error in the process by which it was generated. In other words, it is the
standard deviation of the sampling distribution of the sample statistic. The notation for standard error
can be any one of SE, SEM (for standard error of measurement or mean), or SE.
Standard errors provide simple measures of uncertainty in a value and are often used because:
• If the standard error of several individual quantities is known then the standard er ror of some
function of the quantities can be easily calculated in many cases;
• Where the probability distribution of the value is known, it can be used to calculate a good
approximation to an exact confidence interval; and
• Where the probability distributio n is unknown, relationships like Chebyshev’s or the
Vysochanskiï -Petunin inequality can be used to calculate a conservative confidence interval
• As the sample size tends to infinity the central limit theorem guarantees that the sampling
distribution of the mean is asymptotically normal.
Correction for finite population
The formula given above for the standard error assumes that the sample size is much smaller than
the population size, so that the population can be considered to be effectively infinite in siz e. When
the sampling fraction is large (approximately at 5% or more), the estimate of the error must be
corrected using a “finite population correction”
to account for the added precision gained by sampling close to a larger percentage of the
population. The effect of the FPC is that the error becomes zero when the sample size n is equal to
the population size N.
If values of the measured quantity A are not statistically independent but have been obtained from
known locations in parameter space x, an unbiased estimate of error in the mean may be obtained
by multiplying the standard error above by the factor f:
where the sample bias coefficient • is the average of the autocorrelation -coefficient • ij value (a
quantity between -1 and 1) for all sample point pairs. See unbiased estimation of standard deviation
for more discussion.
Expected error in the mean of A for a sample of n data points with sample bias coefficient •. The
unbiased standard error plots as the •=0 line with log-log slope -½.
Q3. From a population known to have a standard deviation of 1.4, a sample of 70 individuals is
taken. The mean of this sample is found to be 6.2. Find the standard error of the mean. Also establish
an interval estimate around the sample mean using one standard deviation of the mean.
A3.:
• = 1.4 x• = 6.2 n = 70
Q4. A machine is designed so as to pack 300ml of a solution with a standard deviation of 5ml. A
sample of 150 bottles when measured had a mean content of 201.3ml. Test whether the machine is
functioning properly.(use 5% level of significance)
A4.:
µ0 = 300ml n = 150 bottles • = 0.05 • = 5ml x• = 201.3ml
Conclusion: The mean is not 300ml , that is the machine is not functioning properly.
Q5. Out of 2000 people surveyed, 1200 belong to urban areas and rest to semi urban areas. Among
1000 who visited other regions, 800 belonged to urban areas. Test at 5% level of significance whether
area and visiting other states are dependant.
A5.:
N = 2000
P = 800/1000 = 0.8
• = 0.05
P0 = 1200/2000 = 0.6
Q0 = 1- P0 = 0.4
= (0.8-0.6)/•(0.6*0.4/1000)
= 12.91
H0 is rejected
Q6. How is statistics useful for modern managers? Give examples and explain.
A6.: Modern managers often join agencies because they s eek to serve and help their communities
and country. Not surprisingly, some managers are puzzled by the suggestion of engaging in research
and statistics: research appears boring in comparison with developing and implementing new
programs, and statistics seems, well, impossibly challenging with little payoff in sight.
In fact, analytical techniques involving research and statistics are increasingly in demand. Many
decisions that modern managers make involve data and analysis, one way or another. Consider th e
following common uses of analysis and data :
Second, data are used to describe policies and programs. What are programs and policies
expected to achieve? How many services are programs expected to provide? What are some
milestones of achievement? How much will a program cost? These questions involve quantifiable
answers, such as the number of national guardsmen that are brought in to assist with search and
rescue efforts after a major hurr icane, or the number of evacuees for whom officials expect to
provide refuge. Policies and programs can be described in quite detailed ways, involving distinct
program activities, the duration and geographic scope of activities, and staffing levels and are a
program budget data.
Third, programs produce much routine, administrative data that are used to monitor progress and
prevent fraud. For example, hospitals produce a large amount of data about patient visits, who
attended them, their diagnosis, billing codes, and so on. Schools produce vast amounts of data
about student achievement, student conduct, extracurricular activities, support and administrative
services, and so on. Regulatory programs produce data about inspections and compliance. In
many states, gaming devices (such as slot machines) are monitored electronically to ensure that
taxes are collected and that they are not tampered with. Managers are expected to be familiar
with the administrative data in their lines of business.
Fifth, data are used to evaluate outcomes. Legislatures and citizens want to know what return they
are getting from their tax dollars. Did programs and policies achieve their aims? Did they produce
any unexpected results?
Most grant applications require modern managers to be accountable for program outcomes.
Modern managers must demonstrate that their programs are producing effective outcomes and
that they are doing so in cost -effective ways. This demand for outcome evaluation and monitoring
far exceeds any requirement of proper funds management. Ana lysis can also be used to determine
the impact of different conditions on program effectiveness, leading to suggestions for improving
programs.
Data and analysis are omnipresent in programs and policies. They are there at every stage, from the
inception of programs and policies, to their very end. Of course, decisions are also based on
personal observation, political consensus, anecdotal and impressionistic descriptions, and the
ideologies of leaders. Yet data and analysis often are present, too, one way or another. This is
because analysis is useful. Specifically, quantitative analysis aids in providing an objective, factual
underpinning of situations and responses. Analysis, along with data, helps quantify the extent of
problems and solutions in ways that other information seldom can. Analysis can help quantify the
actual or likely impact of proposed strategies, for example, helping to determine their adequacy. At
the very least, a focus on facts and objective analysis might reduce judgment errors stemming from
overly impressionistic or subjective perceptions that are factually incorrect. So managers are
expected to bring data and analysis to the decision -making table.