Anda di halaman 1dari 20

GEM2900: Understanding

Uncertainty & Statistical Thinking

David Nott
standj@nus.edu.sg
Department of Statistics and Applied Probability
National University of Singapore

GEM2900: Understanding Uncertainty & Statistical Thinking DSAP, NUS, Semester 2, 2008/2009 – 1
More about sampling

 Last lecture we discussed some very basic methods in opinion polling.

 Suppose there is some population of individuals of size N and we want to know


the proportion of those who would answer yes when asked some question.

 Call the true proportion p.

 If we can randomly sample a collection of n individuals from the population then


we can count the number X who answer yes in the sample.

 Then p̂ = X/n is the proportion in the sample who answer yes, and this estimates
the proportion p in the whole population who would say yes.

GEM2900: Understanding Uncertainty & Statistical Thinking DSAP, NUS, Semester 2, 2008/2009 – 278
More about sampling

 We say that we could think of X (and p̂) as random variables, since they would
vary randomly if we were to draw different samples.

 We argued that the distribution of X was approximately binomial, Bin(n, p).

 The distribution of the random variable p̂ can tell us something about the reliability
of the results of the poll (for example, the standard deviation would tell us
something about how much p̂ is expected to vary from sample to sample).

 We had an estimate of the standard deviation of p̂, the standard error, and 1.96
times the standard error was what we called the margin of error (this has the
property that if we construct intervals p̂ plus or minus the margin of error over
many samples then for about 95% of samples the intervals will contain the true p).

GEM2900: Understanding Uncertainty & Statistical Thinking DSAP, NUS, Semester 2, 2008/2009 – 279
More about sampling

 The margin of error might tell us something about whether our


sample size is adequate to give accurate results.

 However, adequacy of the sample size is often the least of your


problems if you’re conducting an opinion poll.

 Another problem is ensuring that your sample is random. When we


argued that the distribution of X in our sample was approximately
binomial with parameters n and p we were assuming that each
person in the population had the same chance of being chosen in
the sample, so that the probability of a yes response for someone in
our sample was p.

GEM2900: Understanding Uncertainty & Statistical Thinking DSAP, NUS, Semester 2, 2008/2009 – 280
Example: the worst opinion poll ever

 Before the US presidential election in 1936, the magazine Literary


Digest published an opinion poll that predicted an easy win for
Republican candidate Alf Langdon over President Franklin D.
Roosevelt: 57% for Langdon versus 43% for Roosevelt.

 The result of the election was that Roosevelt got 62% and Langdon
38%.

 What went wrong?

 The Literary Digest poll was based on 2.3 million responses - small
sample size was certainly not an issue!

Olofsson (2007), Chapter 8.


GEM2900: Understanding Uncertainty & Statistical Thinking DSAP, NUS, Semester 2, 2008/2009 – 281
Example: the worst opinion poll ever

 How was the poll conducted? When they selected people to be included in the poll
they chose people from address lists such as their own subscription list.

 This does not constitute a random sample.

 Also, they mailed postcards to 10 million people, and their poll results were based
on the 2.3 million who responded. The tendency to respond may be correlated
with what you’re trying to measure (someone who has just lost their job and may
be sympathetic to Roosevelt may very likely not respond).

 This is called non-response bias.

 Using a smaller sample size but making greater efforts to ensure that the sample
was randomly chosen would have yielded a much more accurate result.

GEM2900: Understanding Uncertainty & Statistical Thinking DSAP, NUS, Semester 2, 2008/2009 – 282
Bias in polls

 Even if we can guarantee a randomly chosen sample from the


population, other problems arise.

 One problem is that people may not tell the truth.

 For example, suppose you are conducting a poll to determine the


extent of some activity which is illegal (for example, drug use).

 Government agencies have a need for accurate information about


such things but people are not likely to respond truthfully in surveys
on this topic.

 Statisticians have a variety of clever ways to get around this problem.

GEM2900: Understanding Uncertainty & Statistical Thinking DSAP, NUS, Semester 2, 2008/2009 – 283
Confidentiality in polls

 One way of guaranteeing to people that their responses will be confidential is to


use a randomization device.

 I am going to conduct an opinion poll of GEM2900 students. I will take a sample of


size 20.

 The question I will ask is potentially embarassing. The question is: have you ever
cheated in an exam at NUS?

 To protect your confidentiality, we do the following. For each of the 20 people in my


sample, I ask them to flip a coin twice. If they obtain two heads they are instructed
to answer ‘yes’ to the question. If they obtain any other outcome they are asked to
respond truthfully.

 So if they answer ‘yes’ I cannot be sure whether they really cheated (responding
truthfully) or whether they just flipped two heads.

GEM2900: Understanding Uncertainty & Statistical Thinking DSAP, NUS, Semester 2, 2008/2009 – 284
Confidentiality in polls

 Suppose that 10 people say yes in my sample of 20.

 Now, I’d expect that since the probability of two heads when I flip a
coin twice is 1/4, about 1/4 × 20 = 5 people are going to answer
yes regardless of whether they’ve cheated or not.

 So let’s subtract 5 from both 20 and 10, and then I have an estimate
of 5 people out of 15 who are answering yes truthfully.

 So that would give an estimate of 1/3 of students who have cheated


on an NUS exam (unrealistic, I hope).

 Calculation of standard errors and margins of error are a little more


complex than before.
GEM2900: Understanding Uncertainty & Statistical Thinking DSAP, NUS, Semester 2, 2008/2009 – 285
Rare events and the Poisson distribution

 We’ve been looking at applications of the binomial distribution.

 A related distribution is the so-called Poisson distribution. It often arises when we


are counting occurrences of rare events in time and space.

 Olofsson (2007) states the following law of rare events: suppose we are dealing
with some rare, unpredictable event that occurs on average λ times. The number
of occurrences is said to follow a Poisson distribution, with the probability of k
events given by
exp(−λ)λk
P (k occurrences) = .
k!

 We won’t be too precise about the exact conditions needed for this law of rare
events to hold.

GEM2900: Understanding Uncertainty & Statistical Thinking DSAP, NUS, Semester 2, 2008/2009 – 286
Poisson probability distribution

A random variable X has a Poisson distribution if its p.m.f. is

λ x
P (X = x) = pX (x) = e−λ , x = 0, 1, 2, . . .
x!
for some λ > 0. We also write X ∼ Poisson(λ).

If X has a Poisson distribution with parameter λ, then

E[X] = Var[X] = λ

Woolfson (2008, Chapter 12)


GEM2900: Understanding Uncertainty & Statistical Thinking DSAP, NUS, Semester 2, 2008/2009 – 287
Connection between binomial and Poisson
distribution
Assume X follows a binomial distribution with parameters n and p where
n is “large” and p is “small”.
Then for “small” values of x = 0, 1, 2, . . . we have

P (X = x) ≈ P (Y = x)

where Y follows a Poisson distribution with parameter λ = np.


N OTE :
−λ λx
n x
 P (Y = x) = e x! might be easier to evaluate than P (X = x) = x p (1 − p)n−x .

 The probability that no success was observed among the n (independent) Bernoulli trials is

P (X = 0) = (1 − p)n ≈ P (Y = 0) = e−λ = e−np

The probability that at least one success was observed is:

P (X > 0) ≈ P (Y > 0) = 1 − P (Y = 0) = 1 − e−λ = 1 − e−np


GEM2900: Understanding Uncertainty & Statistical Thinking DSAP, NUS, Semester 2, 2008/2009 – 288
An application of the law of rare events

Matching Problem
7 couples attend a dancing class where the instructor pairs everyone off
at random. What is the probability that at least one couple gets to dance
together? We calculated the exact answer in a previous lecture using the
inclusion-exclusion formula:

1 1 1 1 1 1 1
7 × − 21 × + 35 × − 35 × + 21 × −7× +1×
7 42 210 840 2520 5040 5040
1 1 1 1 1 1
=1− + − + − + ≈ 0.6321
2 6 24 120 720 5040
Do you notice a pattern in the fractions? For n couples the answer would be

1 1 1 1 1 1 1
− + − + − + ··· ± ≈ 0.6321
1! 2! 3! 4! 5! 6! n!
Olofsson (2007)
GEM2900: Understanding Uncertainty & Statistical Thinking DSAP, NUS, Semester 2, 2008/2009 – 289
An application of the law of rare events

 We can think of the problem as writing down the numbers 1 to 7 in one column,
and then writing down a random permutation of the numbers 1 to 7 in a second
column alongside. What is the chance of a match?

 For any position, the chance of a match is 1/7. So with 7 positions we expect an
average of 7 by 1/7 or 1 matches. If a match is considered “rare” then the law of
rare events says that the chance of zero matches is exp(−1) and hence the
chance of one or more matches is 1 − exp(−1). This is 0.6321. Remarkably the
law of rare events gives an answer accurate to four decimal places here.

 Note that the terms in the expression that we got from the inclusion-exclusion
formula can be derived from approximating exp(−1) by the first eight terms in its
series expansion, for those of you who know what this means.

GEM2900: Understanding Uncertainty & Statistical Thinking DSAP, NUS, Semester 2, 2008/2009 – 290
Poisson probability distribution (cont.)

E XAMPLE: (Death by Horse Kick in Prussian Army)


Data from von Bortkiewicz (1898) on number of deaths by horse or mule
kicks in 10 (of 14 reported) corps of the Prussian army from 1875 to 1894.
4 corps were not considered by Fisher (1925) as they had a different
organisation.

# Deaths per corps per year 0 1 2 3 4 >4


# corps years 109 65 22 3 1 0

Woolfson (2008, Chapter 12.4)


von Bortkiewicz, L. (1898). Das Gesetz der kleinen Zahlen, Teubner, Leipzig.
Fisher, R.A. (1925). Statistical Methods for Research Workers, Oliver & Boyd, London.
GEM2900: Understanding Uncertainty & Statistical Thinking DSAP, NUS, Semester 2, 2008/2009 – 291
Poisson probability distribution (cont.)

E XAMPLE: (Death by Horse Kick in Prussian Army cont.)


If a Poisson distribution fits the data, then the parameter of that
distribution is its expected value.

In this example, the parameter can be interpreted as the expected


number of deaths per corps per year.

It is natural to estimate the parameter from the data by calculating the


observed average number of deaths per corps per year.

In the 200 corps years there have been 122 deaths.


65 × 1 + 22 × 2 + 3 × 3 + 1 × 4 = 122

Thus, there have been an estimated λ̂ = 122


200 = 0.61 deaths per corps
per year.
GEM2900: Understanding Uncertainty & Statistical Thinking DSAP, NUS, Semester 2, 2008/2009 – 292
Poisson probability distribution (cont.)

E XAMPLE: (Death by Horse Kick in Prussian Army cont.)


If we take λ̂ = 0.61 as the true value of the parameter, what kind of data
would we expect if the data were in fact generated from the Poisson
distribution?
If X follows a Poisson distribution with parameter 0.61, then

x P (X = x) expected # corps years observed


0 0.543 108.67 109
1 0.331 66.29 65
2 0.101 20.22 22
3 0.0206 4.11 3
≥4 0.00356 0.71 1

This compares favourably with the observed values.


GEM2900: Understanding Uncertainty & Statistical Thinking DSAP, NUS, Semester 2, 2008/2009 – 293
Poisson probability distribution (cont.)

E XAMPLE: (Death by Horse Kick in Prussian Army cont.)

A word of explanation about how the “expected # corps years” column


was calculated in the table.

Under the model, there is a probability of 0.543 of a zero count for the
number of deaths in one year for one of the corps. With 200 corp years
observed, we expect 54.3% of the 200 corp years to result in a zero, i.e.
an expected count of 0.543 × 200 = 108.67.

Other entries in the table are calculated similarly.

GEM2900: Understanding Uncertainty & Statistical Thinking DSAP, NUS, Semester 2, 2008/2009 – 294
Poisson probability distribution (cont.)

E XAMPLE: (Death by Horse Kick in Prussian Army cont.)


How do we compare observed frequencies with expected frequencies?

Just look at the (squared) deviation?

Technical considerations suggest to look at


c
(oi − ei )2
χ2 =
i=1
ei

if we have data that falls into c categories; oi and ei are the observed and
expected frequency for category i, respectively.

Woolfson (2008, Chapters 6.2, 8 and 12.4)


GEM2900: Understanding Uncertainty & Statistical Thinking DSAP, NUS, Semester 2, 2008/2009 – 295
Poisson probability distribution (cont.)

E XAMPLE: (Death by Horse Kick in Prussian Army cont.)


For this example:

c
(oi − ei )2
χ2 =
i=1
ei
(109 − 108.67)2 (65 − 66.29)2 (22 − 20.22)2
= + +
108.67 66.29 20.22
(3 − 4.11)2 (1 − 0.71)2
+ +
4.11 0.71
= 0.599929
But is this value “typical” or “suspiciously large” if the Poisson distribution
were assumed to generate the data?
It can be shown, that (under some conditions) this statistic follows an (approximate) χ2d distribution with a
particular parameter d if the Poisson is the true model.
Woolfson (2008, Chapters 6.2, 8 and 12.4)
GEM2900: Understanding Uncertainty & Statistical Thinking DSAP, NUS, Semester 2, 2008/2009 – 296

Anda mungkin juga menyukai