3 - 2 - Unit 3, Part 1 - (1) Sampling Variability and CLT (21-00)

In this video, we will define sampling distributions.
We're going to introduce the central limit theorem and review conditions required for the theorem to apply. And we're also going to do some simulation demos to illustrate the central limit theorem and start talking about why it works without going into a theoretical proof, as well as talk about how it works and why it might be of use to us. Say, we have a population of interest, and we take our random sample from it. And based on that sample, we calculate a sample statistic. For example, the mean of that sample. Then suppose we take another random sample and also calculate and record its mean. Then we do this again, and again, many more times. Each one of the samples will have their own distribution, which we call sample distributions. Each observation in these distributions is a randomly sampled unit from the population, say, a person, or a cat, or a dog, depending on what population you are studying. The values we recorded from each sample, the sample statistics, also now make a new distribution where each observation is not a unit from the population but a sample statistic. In this case, a sample mean. The distribution of these sample statistics is called the sampling distributions. So the two terms, sample and sampling distributions, sound similar but they're different concepts. Let's give a little more concrete example. Suppose we're interested in the average height of the US women. Our population of interest is US women. We'll call capital N the population size, and our parameter of interest is the average height of all women in the US, which we denote as mu. Let's assume that we have height data from every single woman in the US. Using dese, these data, we could find the population mean. 65 inches is probably a reasonable
estimate. Using the same population data, we can also calculate the population standard deviation. Which gar, we usually call sigma. We wouldn't expect this number, the sigma, to be very small, since heights of all women in the US are probably very variable. It's possible to find a woman as short as 4 feet tall, as 7 feet tall. Then, let's assume that we take random samples of 1,000 women from each state. We'll start with the first list on the alphabetical list, Alabama. We sampled 1,000 women from Alabama. We represent each woman in our sample with an x, and we use the subscripts to keep track of the state, as well as the observation number, ranging from 1 to 1000. Then, we collect data from 1,000 woman from each of a bunch more states, including North Carolina, where I happen to be car, currently located, and then a bunch more, till finally, we get to the last state on the alphabetical list, Wyoming. For east state, we calculate state's mean that we denote as x bar. So now, we have a data set consisting of a bunch of means, or 50 to be exact, since there are 50 states. We call this distribution the sampling distribution. The mean of the sample means will probably be around the true population mean, roughly, 65 inches as well. The standard deviation of the sample means will probably be much lower than the population standard deviation, since we would expect the average height for each state to be pretty close to one another. For example, we wouldn't expect to find a state where the average height of a random sample of 1,000s wo, 1,000 women is as low as 4 feet or as high as 7 feet. We call the standard deviation of the sample means the standard error. In fact, as the sample size, n, increases, the
standard error will decrease. The fewer women we sample from each state, the more variable we would expect the sample means to be. Next, we're going to illustrate what we were just talking about, in terms of sampling distributions, their shapes, centers, and spreads, using an applet that simulates a bunch of sampling distributions for us, given certain parameters of the population distribution and its shape. If you would like to also play along with us, you can follow the URL on the screen. Let's start with the default case of a normal distribution for the population, with mean 0 and standard deviation 20. Let's take samples of say, size 45, from this population. And what we can see here is that each one of these dot plots show us one sample of 45 observations from the normal population. We can see that the centers of each one of these samples is close to 0, though not exactly 0. And we can also see that the sample mean varies from one sample to another. Since these are random samples from the population, each time we reach out to the population and grab 45 observations, we may not be getting the same sample. In fact, we will not be getting the same sample. And therefore, the x bars for each sample are slightly different. The standard deviations of each one of these samples should be roughly equal to the population standard deviation, because after all, each one of these samples are simply a subset of our population. We have illustrated eight of the first samples here. But, we're actually taking 200 samples from the population. We can make this a very large number, say, 1,000 samples from the population. And what we have at the very bottom is basically our sampling distribution. Each one of the sample means, once calculated, get dropped to the lower plot and what we're seeing here is a distribution of sample means.
Since we saw that the sample means had some variability among them, the sampling distribution basically illushr, illustrates for us what this variability looks like. The sampling distribution, as we expected, is looking just like the population distribution, so nearly normal, and the center of the sampling distribution, so that is the mean of the means, is close to the true population mean of 0. However, one big difference between our population distribution up top and our sampling distribution at the bottom is the spread of these distributions. The sampling distribution at the bottom is much skinnier than the population distribution up top. And if you think about it, while the standard deviation of the population distribution is 20, the standard error, so the standard deviation of the sample means, is only 2.93. The reason for this is that while individual observations can be very variable, it is unlikely that sample means are going to be very variable. So, if we want to decrease the variability of the sample means, what that means is you're taking samples that have more consistent means, in order to do that, we would want to increase our sample size. Let's say that we increase our sample size to, all the way to 500, alright? So we have here is, again, our same population distribution. Here, we're seeing the first eight of the 1,000 samples being taken from the population. The distributions look much more dense here because we simply have more observations. So, each one of these samples represent a sample from the population of 500 observations. And we can also see that the means are, again, variable, but let's check to see if they're as variable as before. The curve is indeed skinnier, so the higher the sample size of each sample that
you're taking from the population, the ve, less variable the means of those samples. And indeed, we can see it graphically, looking at the curve. And we can see it numerically, looking at the value of the standard error. Now it's finally time to introduce the central limit theorem. In fact, the central limit theorem says that the sampling distribution of the mean, distribution of sample means from many samples, is nearly normal, centered at the population mean, with standard error equal to the population standard deviation divided by the square root of the sample size. Note that this is called the central limit theorem because it's central to much of the statistical inference theory. So the central limit theorem tells us about the shape, which it says that it's going to be nearly normal, the center, which it says that the sampling distribution's going to be centered at the population mean, and the spread of the sampling distribution, which we measure using the standard error. If sigma is unknown, which is often the case, remember, sigma is the population standard deviation, and oftentimes, we don't have access to the entire population to calculate this number, we use S, the sample standard deviation, to estimate the standard error. So that would be the standard deviation of one sample that we happen to have at hand. In the earlier demo, the simulation, we talked about taking many samples, but if you're running a study, as you can imagine, you would only take one sample. So that's the standard deviation of that sample that we would use as our best guess for the population standard deviation. So it wasn't a coincidence that the sampling distribution we saw earlier was symmetric and centered at the true population mean, and that as n increased, the sample sized increased, the standard error decreased. We won't go through a detailed proof of why the standard error is equal to sigma over square root of n, but understanding
the inverse relationship between them is very important. As the sample size increases, we would expect samples to yield more consistent sample means, hence the variability among the sample means would be lower, which results in a lower standard error. Certain conditions must be met for the central limit theorem to apply. The first one is independence. Sampled observations must be independent. And this is very difficult to verify. But it is more likely, if we have used random sampling or assignment, depending on whether we have an observational study where we're sampling from the population randomly or we have an experiment where we're randomly assigning experimental units to various treatments. And, if sampling without replacement, the sample size n is less than 10% of the population. So, we've previously mentioned, we love large samples, and now we're saying that, well, we don't exactly want them to be very large. We're going to talk about why this is the case in a moment. The other condition is related to the sample size, or skew. Either the population distribution is normal, or, if the population distribution is skewed or we have no idea what it looks like, the sample size is large. According to the central limit theorem, if the population distribution is normal, the sampling distribution will also be nearly normal, regardless of the sample size. We illusta, illustrated this earlier when we were working with the applet, where we looked at a sample size of 45, as well as a sample size of 500, and in both instances, the sampling distribution was nearly normal. However, if the population distribution is not normal, the more skewed the population distribution, the larger sample size we need for the central limit theorem to apply. For moderately skewed distributions, n greater than 30 is a widely-used rule of thumb that we're going to make use of often in this course as well.
This distribution of the population is also something very difficult to verify, because we often do not know what the population looks like. That's why we're doing this investigation in the first place. But, we can check it using the sample data and assume that the sample mirrors the population. So if you make a plot over your sample distribution, and it looks nearly normal, then you might be fairly certain that the parent population distribution it's coming from is nearly normal as well. We'll discuss these conditions in more detail in the next couple of slides. First, let's focus on the 10% condition. If sampling without replacement and needs to be less than 10% of the population, is what we stated earlier. Why is this the case? So let's think about this for a moment. Say that you live in a very small town. Say, that the population of the town is only 1,000 people, alright? And your family lives there as well, including your extended family. Say that I'm a researcher who is doing research on some genetic application. And I actually want to randomly sample some individuals from your town. Say I take a random sample of say, size just 10. If we're randomly sampling 10 people out of 1,000, and let's say you are included in our sample; it's going to be quite unlikely that your parents are also included in that sample as well. because remember, we're only grabbing 10 out of a sam, a population of 1,000. But say, on the other hand, I actually sample 500 people from the 1,000 that live in your town. If in this town, you live with your parents and all of your extended family, and I've already grabbed you to be in my sample, and I have 499 other people to grab, chances are I might get somebody from your family in my sample as well. You
and a family member of yours are not genetically independent, because observations in the population itself are not independent of each other, often. So therefore, if we grab a very big portion of the population to be in our sample, it's going to be very difficult to make sure that the sampled individuals are not independent of each other. That's why while we like large samples, we also want to keep the size of our samples somewhat proportional to our population. And a good rule of thumb, usually, if we're sampling without replacement, is going to be that we don't grab more than 10% of the population to be in our sample. When you're sampling with replacement, which is not something we often do in survey settings, because if I've already sampled you once, and given you a survey, and gotten your responses, I don't want to be able to sample you again. I don't need your responses again. But if I were sampling without replacement, then the probability of sampling you versus somebody from your family would stay consistent throughout all of the trials. That's why we wouldn't need to worry about the 10% condition there. But again, in realistic survey sampling situations, we sample without replacement, and we like large samples, but we also do not want our samples to be much larger than, or, any more than 10% of our population. And what about the sample size skew condition? Say we have a skewed population distribution. Here, we have a population distribution that's extremely right skewed. When the sample size is small, here we're looking at a sampling distribution created based on samples of n just equals 10, the sample means will be quite variable, and the shape of their distribution will mimic the population distribution. Increasing the sample size a bit, now we've gone from n equals 10 to n equals 100, this decreases the standard error, and the distribution starts to condense
around the mean and starts looking more unimodal and symmetric. With quite large samples, here we're looking at a sampling distribution where, for each of the individual samples based on which the sample means were calculated, the, those sample sizes were 200. With quite large samples like this, we can actually overcome the effect of the parent distribution, and the central limit theorem kicks in, and the sampling distribution starts to resemble a closely normal distribution. Why are we somewhat obsessed with having nearly normal sampling distributions? Because we've learned earlier that once you have a normal distribution, calculating probabilities, which will later serve as our p-values in our hypothesis tests, are relatively simple. So, having a nearly normal sampling distribution that relies on central limit theory is actually going to open up a bunch of doors for us for doing statistical inference using confidence intervals and hypothesis tests using normal distribution theory. Let's do another demo real quick. We've looked earlier at what does a sampling distribution look like when we have nearly normal population distribution. Let's take a look to see what happens if the population distribution is not nearly normal. Suppose I first pick a uniform distribution. Here, we can see that our population distribution is uniform. Let's say that it's going to be uniform between 4, and obviously, our upper bound needs to be greater, 4 and 12. So we can see a, a uniform distribution between 4 and 12, absolutely no peak, so on and so forth. Say that we're actually taking samples of size just 15 from this distribution. Each one of our samples contains 15 observations from the parent population, and the center of these samples are going to be somewhere close to the
population mean. We take a bunch of these samples, 1,000 of them, and let's take a look at what the sampling distribution is looking like. It actually looks fairly symmetric, unimodal and symmetric. The center of the distribution is very close to our population distribution mean, and the variability of this distribution is actually much lower than our population distribution. We can see that the standard error is 0.59, while the original population standard deviation was 2.31. What happens if we have skewed data? Here, we have a population distribution that's right skewed. We're taking samples of size 15. And let's actually make this an extremely right skewed distribution. So this is what this looks like. If we're taking samples of size 15, so the, here, we're taking a look at each one of our individual samples, the sampling distribution is looking an awful lot skewed. However, if I increase my sample size to be much larger, say 500, then my sampling distribution is starting to look much more unimodal and symmetric and starting to resemble a nearly normal distribution. What about a left skewed distribution? Once again, let's make the skew of this distribution pretty high. And we can see that our sampling distribution, when we have a large number of observations in each sample, we have still kept it at 500 observations, the sampling distribution looks pretty nearly normal. However, if I was to decrease my sample size to be something pretty small, 24, let's say, then my sampling distribution is looking more and more left skewed. And in fact, if I take even smaller samples, let's go all the way down to 12, for example, now the distribution is looking even more skewed. If I, though, decrease the skew, and my population distribution to begin with is not looking all that skewed anyway, then I really don't need a whole lot of observations in my sample. Here, I have only 12 observations in each
sample, and the sampling distribution is already looking pretty unimodal and symmetric. So the moral of this story is, the more the skew, the higher the sample size you need for the central limit theorem to kick in. Please feel free to go play with this applet, interact with it, and find out for yourself what the sampling distribution looks like in various scenarios. And also play around with the different parameters of the distributions, either picking how skewed they are, if it's a uniform distribution, what the minimum and the maximum are, of if it's a normal distribution, what the mean and the standard deviation are.

3 - 2 - Unit 3, Part 1 - (1) Sampling Variability and CLT (21-00)

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

3 - 2 - Unit 3, Part 1 - (1) Sampling Variability and CLT (21-00)

Diunggah oleh

Hak Cipta:

Format Tersedia

In this video, we will define sampling distributions.

Anda mungkin juga menyukai