Anda di halaman 1dari 28

Lecture notes 5: sampling distributions and the central

limit theorem

Highlights:

The law of large numbers

The central limit theorem

Sampling distributions

Formalizing the central limit theorem

Calculating probabilities associated with sample


means
Two important results in inferential statistics

Two results that are important in establishing the


basis for inferential statistics are the law of large
numbers (LLN) and the central limit theorem (CLT).

Both of these results have to do with sample size, and


the kinds of behaviors we can expect from statistics
which are calculated using large samples rather
than small samples.

We will first consider the LLN, and then the CLT.



The law of large numbers
The law of large numbers tells us what tends to
happen to a sample mean as the sample size gets
bigger.

It says that, as our sample size increases, the


average of our sample will tend to get closer and
closer to the true average of the population from
which we are sampling.
Here is a simple example: if you flip a coin twice, you
may well get two heads. In this case, you will have
flipped heads, on average, 100% of the time.

You may also get two tails. In this case, you will
have flipped heads, on average, 0% of the time.

You may also get one tail and one head, which would
give you the correct average of 50%, but there is a
very good chance that your average will be very far
off from the true average.

If you flip a coin 10 times, chances are still pretty


good that the average number of heads will be far
away from 5.
Now lets say you flip a coin 1,000,000 times. Again,
you are unlikely to flip heads exactly 50% of the
time. However, rate at which you flip heads will
almost certainly be very close to 50%.

Here is a visual example of how the proportion of


coin flips that are heads approaches 50% as the
number of flips increases:
Likewise with rolling dice. Here is an image showing
the behavior of the average of the rolls of multiple
dice as the # of dice (or trials, as the horizontal axis
is labeled) increases:

( +1 +2 +3 +4 +5 6 )
Note that the population average is =3 .
6
The law of large numbers would also apply to
estimating human height using a sample.

The average height of adult American women is 64


inches (or 5ft, 4).

If you take a sample of two adult American women,


you might happen to pick to who are taller than
average, or two who are shorter than average.

If you sample 200 adult American women, these


differences will mostly cancel out, and the average
from your sample should be very close to 64.
(The average for women in our class is 64.51.)
The law of large numbers also tells us why casinos
dont have to worry about going out of business due
to a bunch of lucky gamblers.

Casino games are always designed so that the casino


has an advantage, in the sense that over the long run
they will tend to make money and gamblers will tend
to lose money.

So, even though an individual gambler may do very


well at a casino, if you combine the winnings and
losing thousands upon thousands of gamblers, the
house will on average make money.
To use the dice example: I wouldnt bet $1,000 that,
on a single roll of a die, the number that comes up
will be less than five, even though there is a 4/6
chance of this happening and I would have the
advantage.

However I would bet $1,000 that on 10 rolls of a die,


the average of all the rolls would be less than five.
This is because I know that, by the law of large
numbers, the average of the rolls will be close to 3.5.

Just how unlikely is it that the average of 10 rolls


will be 5 or greater? We will find the answer to this
using the central limit theorem.
The Central Limit Theorem
The Central Limit Theorem tells us that any distribution
(no matter how skewed or strange) will produce a
normal distribution of sample means if you take large
enough samples from it.

Furthermore, the larger the sample sizes, the less


spread out this distribution of means becomes.

This is of great importance in statistics. It allows us to


use the properties of a normal distribution when
analyzing data, even when the data we are analyzing is
not normal.

This is nice, because we rarely work with normally


distributed data, and we are often interested in means.
Here is how it works: take any distribution you
like; for instance, this heavily positively skewed
distribution:

Now take a random sample from this


distribution. I used software to take a sample of
size n=2.

> sample1=sample(x,2)
[1] 11 9
> mean(sample1)
[1] 10

So, we took a random sample, and got the numbers


11 and 9. Their mean is 10.
We can keep doing this over and over again, and
recording the mean of each sample.

Here are the results I got from doing this 20


more times. Remember, for each sample, I draw
two random numbers from our skewed
distribution, and then I find their average. So
these are all sample means, from samples of size
n=2:
22.0 4.0 17.5 14.0
16.5 6.5 14.0 19.5
33.0 10.0 16.0 9.5
32.0 9.5 19.5 6.5
6.5 8.5 12.0 7.5
Here is a histogram of the 20 sample means from
the previous slide, along with that of the original
data:

Notice that this new histogram looks much closer


to a normal distribution than the original. All we
did was take repeated samples of size n=2, and
then graphed their averages.

Lets see what happens when we increase the


sample size.
Sample size: n=10

Sample size: n=30

Despite being skewed, it didnt take very large samples


for this dataset to quickly turn normal. Also notice
that, as the sample size increases, the distribution of
means becomes less spread out.
How large the sample size must be before we can be
confident that the distribution of sample means will
be normal depends upon how far from (or close to)
normal the underlying distribution is.

Extremely skewed distributions require larger


sample sizes. Distributions that are already normal
will always have normally distributed sample means.

As a very loose, general rule of thumb, n=30 is a


safe sample size under which we can assume
the distribution of sample means is normal. If
the underlying distribution is already close to
normal, the sample size can be much smaller. If the
underlying distribution is extremely skewed, the
sample size needs to be much larger.
Sampling distributions
All of the histograms we just looked at are examples
of sampling distributions.

A sampling distribution is the distribution of a


statistic under repeated sampling. In other words,
it tell us the values that a statistic takes on, and
how often it takes them on.

Note again how these sampling distributions were


created: in these examples, we kept taking new
samples from the same population over and over
again, and each time we recorded the sample mean.

Each histogram we created displayed a sampling


distribution of means.
Sampling distributions
The central limit theorem tells us about the behavior
of the sampling distribution of a mean.

All statistics have associated sampling distributions.

Any time we calculate a statistic from a random


sample, we can treat it as having come from a
sampling distribution of possible values for that
statistic that we could have had our sample been
different.

This concept is the basis for all of the inferential


procedures we will look at.
Sir Francis Galton on the Central Limit Theorem:

I know of scarcely anything so apt to impress the


imagination as the wonderful form of cosmic order
expressed by the law of frequency of error. The law
would have been personified by the Greeks if they had
known of it. It reigns with serenity and complete self-
effacement amidst the wildest confusion. The larger
the mob, the greater the apparent anarchy, the more
perfect its sway. It is the supreme law of unreason.
Whenever a large sample of chaotic elements are
taken in hand and marshaled in the order of their
magnitude, an unsuspected and most beautiful form of
regularity proves to have been latent all along.
Formalizing the Central Limit Theorem

The Central Limit Theorem can be stated formally:

For any distribution with mean and standard


deviation , the distribution of sample means
converges to a normal distribution with mean and
standard deviation , as n goes to infinity.

Here, as n goes to infinity can just be thought of as


as n gets larger and larger. And the distribution of
the sample means can be written as:
Formalizing the Central Limit Theorem

This reads as X-bar is distributed normally with


mean mu and variance sigma squared over n.

Note that if the variance is , the standard


deviation will be .

We can also use this notation to describe the


standard normal distribution:

i.e. z is distributed normally with mean 0 and


variance 1
Formalizing the Central Limit Theorem

Since we know that the sampling distribution of a


sample mean will converge to a normal with mean
and standard deviation , we can convert any
sample mean to a z-score and find probabilities
associated with it, using a slightly modified z
formula:

Note that, in order to use this formula, and


must be either known or assumed.
Calculating probabilities associated
with sample means
Here is the example we introduced earlier: the 6
numbers on a die have a mean of 3.5 and a standard
deviation of 1.87. What is the probability that the
average of 10 rolls will be less than 5?

Formally, this can be written as:

To convert x to z, we use this new z formula:

Which gives us:


Calculating probabilities associated
with sample means
Another example: we know that the heights of adult
women in the U.S. are normally distributed with a mean
of 64 inches and a standard deviation of 3 inches. What
is the probability that a sample of 20 women will yield a
mean height between 63 and 65 inches?

Formally, we can write this as:

Using our z formula gives us:


Calculating probabilities associated
with sample means
To get a feel for how the distribution of the sample
mean differs from the distribution of the original
variable itself, lets find the probability that one
randomly selected womens height will be between 63
and 65 inches:
P ( 6< 3 < 6x 5= )
Calculating probabilities associated
with sample means
Note that this probability is much smaller than the one
we calculated for a sample mean. Visually, we can
draw how the distribution of height itself differs from
the distribution of mean height when the sample size is
n=20:
Putting the LLN and CLT together

The central limit theorem can also be understood in


terms of the law of large numbers.

The law of large numbers tells us that, as our sample


size increases, the mean of our sample is more and
more likely to be close to the true mean.

If we take lots of samples from a population (i.e. if we


obtain a sampling distribution), then each sample
mean is more likely to be close to the true mean if
the sample size is large rather than if it is small.
Putting the LLN and CLT together
So, if we have a sampling distribution of means
taken from a population, then the larger each sample
was, the less spread out around the true population
mean this distribution will be.

This agrees with what we know from the central limit


theorem: that, as our sample size gets larger,
sampling distribution of the mean becomes both
more normal and less spread out, since the
standard deviation of these means gets smaller.
Remember the sampling distribution!
In the next set of notes, we will begin discussing
formal statistical inferential procedures.

In these procedures, we will be computing special


kinds of statistics from sample data, called test
statistics.

These test statistics will be treated as having come


from a known sampling distribution, which will
double as a probability distribution.

This will allow us to compute probabilities associated


these statistics, which in turn will help us answer
scientific questions which is the reason we collect
data in the first place!

Anda mungkin juga menyukai