Lecture notes 5: CLT and sampling distributions

Lecture notes 5: sampling distributions and the central
limit theorem
Highlights:
The law of large numbers
The central limit theorem
Sampling distributions
Formalizing the central limit theorem
Calculating probabilities associated with sample

means
Two important results in inferential statistics
Two results that are important in establishing the

basis for inferential statistics are the law of large
numbers (LLN) and the central limit theorem (CLT).
Both of these results have to do with sample size, and

the kinds of behaviors we can expect from statistics
which are calculated using large samples rather
than small samples.
We will first consider the LLN, and then the CLT.

The law of large numbers
The law of large numbers tells us what tends to
happen to a sample mean as the sample size gets
bigger.
It says that, as our sample size increases, the

average of our sample will tend to get closer and
closer to the true average of the population from
which we are sampling.
Here is a simple example: if you flip a coin twice, you
may well get two heads. In this case, you will have
flipped heads, on average, 100% of the time.
You may also get two tails. In this case, you will
have flipped heads, on average, 0% of the time.
You may also get one tail and one head, which would
give you the correct average of 50%, but there is a
very good chance that your average will be very far
off from the true average.
If you flip a coin 10 times, chances are still pretty

good that the average number of heads will be far
away from 5.
Now lets say you flip a coin 1,000,000 times. Again,
you are unlikely to flip heads exactly 50% of the
time. However, rate at which you flip heads will
almost certainly be very close to 50%.
Here is a visual example of how the proportion of

coin flips that are heads approaches 50% as the
number of flips increases:
Likewise with rolling dice. Here is an image showing
the behavior of the average of the rolls of multiple
dice as the # of dice (or trials, as the horizontal axis
is labeled) increases:
( +1 +2 +3 +4 +5 6 )
Note that the population average is =3 .
6
The law of large numbers would also apply to
estimating human height using a sample.
The average height of adult American women is 64

inches (or 5ft, 4).
If you take a sample of two adult American women,

you might happen to pick to who are taller than
average, or two who are shorter than average.
If you sample 200 adult American women, these

differences will mostly cancel out, and the average
from your sample should be very close to 64.
(The average for women in our class is 64.51.)
The law of large numbers also tells us why casinos
dont have to worry about going out of business due
to a bunch of lucky gamblers.
Casino games are always designed so that the casino

has an advantage, in the sense that over the long run
they will tend to make money and gamblers will tend
to lose money.
So, even though an individual gambler may do very

well at a casino, if you combine the winnings and
losing thousands upon thousands of gamblers, the
house will on average make money.
To use the dice example: I wouldnt bet $1,000 that,
on a single roll of a die, the number that comes up
will be less than five, even though there is a 4/6
chance of this happening and I would have the
advantage.
However I would bet $1,000 that on 10 rolls of a die,

the average of all the rolls would be less than five.
This is because I know that, by the law of large
numbers, the average of the rolls will be close to 3.5.
Just how unlikely is it that the average of 10 rolls

will be 5 or greater? We will find the answer to this
using the central limit theorem.
The Central Limit Theorem
The Central Limit Theorem tells us that any distribution
(no matter how skewed or strange) will produce a
normal distribution of sample means if you take large
enough samples from it.
Furthermore, the larger the sample sizes, the less

spread out this distribution of means becomes.
This is of great importance in statistics. It allows us to

use the properties of a normal distribution when
analyzing data, even when the data we are analyzing is
not normal.
This is nice, because we rarely work with normally

distributed data, and we are often interested in means.
Here is how it works: take any distribution you
like; for instance, this heavily positively skewed
distribution:
Now take a random sample from this

distribution. I used software to take a sample of
size n=2.

> sample1=sample(x,2)
[1] 11 9
> mean(sample1)
[1] 10
So, we took a random sample, and got the numbers

11 and 9. Their mean is 10.
We can keep doing this over and over again, and
recording the mean of each sample.
Here are the results I got from doing this 20

more times. Remember, for each sample, I draw
two random numbers from our skewed
distribution, and then I find their average. So
these are all sample means, from samples of size
n=2:
22.0 4.0 17.5 14.0
16.5 6.5 14.0 19.5
33.0 10.0 16.0 9.5
32.0 9.5 19.5 6.5
6.5 8.5 12.0 7.5
Here is a histogram of the 20 sample means from
the previous slide, along with that of the original
data:
Notice that this new histogram looks much closer

to a normal distribution than the original. All we
did was take repeated samples of size n=2, and
then graphed their averages.
Lets see what happens when we increase the

sample size.
Sample size: n=10
Sample size: n=30
Despite being skewed, it didnt take very large samples

for this dataset to quickly turn normal. Also notice
that, as the sample size increases, the distribution of
means becomes less spread out.
How large the sample size must be before we can be
confident that the distribution of sample means will
be normal depends upon how far from (or close to)
normal the underlying distribution is.
Extremely skewed distributions require larger

sample sizes. Distributions that are already normal
will always have normally distributed sample means.
As a very loose, general rule of thumb, n=30 is a

safe sample size under which we can assume
the distribution of sample means is normal. If
the underlying distribution is already close to
normal, the sample size can be much smaller. If the
underlying distribution is extremely skewed, the
sample size needs to be much larger.
All of the histograms we just looked at are examples
of sampling distributions.
A sampling distribution is the distribution of a

statistic under repeated sampling. In other words,
it tell us the values that a statistic takes on, and
how often it takes them on.
Note again how these sampling distributions were

created: in these examples, we kept taking new
samples from the same population over and over
again, and each time we recorded the sample mean.
Each histogram we created displayed a sampling

distribution of means.
The central limit theorem tells us about the behavior
of the sampling distribution of a mean.
All statistics have associated sampling distributions.
Any time we calculate a statistic from a random

sample, we can treat it as having come from a
sampling distribution of possible values for that
statistic that we could have had our sample been
different.
This concept is the basis for all of the inferential

procedures we will look at.
Sir Francis Galton on the Central Limit Theorem:
I know of scarcely anything so apt to impress the

imagination as the wonderful form of cosmic order
expressed by the law of frequency of error. The law
would have been personified by the Greeks if they had
known of it. It reigns with serenity and complete self-
effacement amidst the wildest confusion. The larger
the mob, the greater the apparent anarchy, the more
perfect its sway. It is the supreme law of unreason.
Whenever a large sample of chaotic elements are
taken in hand and marshaled in the order of their
magnitude, an unsuspected and most beautiful form of
regularity proves to have been latent all along.
Formalizing the Central Limit Theorem
The Central Limit Theorem can be stated formally:
For any distribution with mean and standard

deviation , the distribution of sample means
converges to a normal distribution with mean and
standard deviation , as n goes to infinity.
Here, as n goes to infinity can just be thought of as

as n gets larger and larger. And the distribution of
the sample means can be written as:
This reads as X-bar is distributed normally with

mean mu and variance sigma squared over n.
Note that if the variance is , the standard

deviation will be .
We can also use this notation to describe the

standard normal distribution:
i.e. z is distributed normally with mean 0 and

variance 1
Since we know that the sampling distribution of a

sample mean will converge to a normal with mean
and standard deviation , we can convert any
sample mean to a z-score and find probabilities
associated with it, using a slightly modified z
formula:
Note that, in order to use this formula, and

must be either known or assumed.
Calculating probabilities associated
with sample means
Here is the example we introduced earlier: the 6
numbers on a die have a mean of 3.5 and a standard
deviation of 1.87. What is the probability that the
average of 10 rolls will be less than 5?
Formally, this can be written as:
To convert x to z, we use this new z formula:
Which gives us:

with sample means
Another example: we know that the heights of adult
women in the U.S. are normally distributed with a mean
of 64 inches and a standard deviation of 3 inches. What
is the probability that a sample of 20 women will yield a
mean height between 63 and 65 inches?
Formally, we can write this as:
Using our z formula gives us:

with sample means
To get a feel for how the distribution of the sample
mean differs from the distribution of the original
variable itself, lets find the probability that one
randomly selected womens height will be between 63
and 65 inches:
P ( 6< 3 < 6x 5= )
with sample means
Note that this probability is much smaller than the one
we calculated for a sample mean. Visually, we can
draw how the distribution of height itself differs from
the distribution of mean height when the sample size is
n=20:
Putting the LLN and CLT together
The central limit theorem can also be understood in

terms of the law of large numbers.
The law of large numbers tells us that, as our sample

size increases, the mean of our sample is more and
more likely to be close to the true mean.
If we take lots of samples from a population (i.e. if we

obtain a sampling distribution), then each sample
mean is more likely to be close to the true mean if
the sample size is large rather than if it is small.
Putting the LLN and CLT together
So, if we have a sampling distribution of means
taken from a population, then the larger each sample
was, the less spread out around the true population
mean this distribution will be.
This agrees with what we know from the central limit

theorem: that, as our sample size gets larger,
sampling distribution of the mean becomes both
more normal and less spread out, since the
standard deviation of these means gets smaller.
Remember the sampling distribution!
In the next set of notes, we will begin discussing
formal statistical inferential procedures.
In these procedures, we will be computing special

kinds of statistics from sample data, called test
statistics.
These test statistics will be treated as having come

from a known sampling distribution, which will
double as a probability distribution.
This will allow us to compute probabilities associated

these statistics, which in turn will help us answer
scientific questions which is the reason we collect
data in the first place!

Lecture notes 5: CLT and sampling distributions

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Lecture notes 5: CLT and sampling distributions

Diunggah oleh

Hak Cipta:

Format Tersedia

Lecture notes 5: sampling distributions and the central

The law of large numbers

The central limit theorem

Formalizing the central limit theorem

Calculating probabilities associated with sample

Two results that are important in establishing the

Both of these results have to do with sample size, and

We will first consider the LLN, and then the CLT.

It says that, as our sample size increases, the

If you flip a coin 10 times, chances are still pretty

Here is a visual example of how the proportion of

The average height of adult American women is 64

If you take a sample of two adult American women,

If you sample 200 adult American women, these

Casino games are always designed so that the casino

So, even though an individual gambler may do very

However I would bet $1,000 that on 10 rolls of a die,

Just how unlikely is it that the average of 10 rolls

Furthermore, the larger the sample sizes, the less

This is of great importance in statistics. It allows us to

This is nice, because we rarely work with normally

Now take a random sample from this

So, we took a random sample, and got the numbers

Here are the results I got from doing this 20

Notice that this new histogram looks much closer

Lets see what happens when we increase the

Sample size: n=30

Despite being skewed, it didnt take very large samples

Extremely skewed distributions require larger

As a very loose, general rule of thumb, n=30 is a

A sampling distribution is the distribution of a

Note again how these sampling distributions were

Each histogram we created displayed a sampling

All statistics have associated sampling distributions.

Any time we calculate a statistic from a random

This concept is the basis for all of the inferential

I know of scarcely anything so apt to impress the

The Central Limit Theorem can be stated formally:

For any distribution with mean and standard

Here, as n goes to infinity can just be thought of as

This reads as X-bar is distributed normally with

Note that if the variance is , the standard

We can also use this notation to describe the

i.e. z is distributed normally with mean 0 and

Since we know that the sampling distribution of a

Note that, in order to use this formula, and

Formally, this can be written as:

To convert x to z, we use this new z formula:

Which gives us:

Formally, we can write this as:

Using our z formula gives us:

The central limit theorem can also be understood in

The law of large numbers tells us that, as our sample

If we take lots of samples from a population (i.e. if we

This agrees with what we know from the central limit

In these procedures, we will be computing special

These test statistics will be treated as having come

This will allow us to compute probabilities associated

Anda mungkin juga menyukai