Anda di halaman 1dari 26

# Econ 307: Two-Sample

## Inference and Chi-Square Test

Pasita Chaijaroen
pchaijaroen@wm.edu
Outline
• Two-sample inference
• When do we need two-sample inference?
• The sampling distribution of the difference in means
• Confidence interval and hypothesis testing for the difference in means
• Chi-square test for the relationship between two random variables
• What is it for?
• The Chi-square distribution
Outline
• Two-sample inference
• When do we need two-sample inference?
• The sampling distribution of the difference in means
• Confidence interval and hypothesis testing for the difference in means
• Chi-square test for the relationship between two random variables
• What is it for?
• The Chi-square distribution
Two-sample hypothesis testing
• We are often interested in testing hypotheses about whether a population
parameter is identical in two subsets of a population
• Example: do male and female students get the same grade?
• The null hypothesis states that there is no difference between the two
groups
• 𝐻0 : 𝜇𝑓 = 𝜇𝑚
• The alternate hypothesis may be that the two parameters are not equal
(two-tail test) or that one parameter is greater than the other (one-tail
test)
• 𝐻1 : 𝜇𝑓 ≠ 𝜇𝑚
• 𝐻1 : 𝜇𝑓 > 𝜇𝑚
Test statistics for two-sample tests
• We need a sample statistics to perform these two-sample tests
• We can rewrite the hypotheses as
𝐻0 : 𝜇𝑓 − 𝜇𝑚 = 0 𝐻0 : 𝜇𝑓 − 𝜇𝑚 = 0 𝐻0 : 𝜇𝑓 − 𝜇𝑚 = 0
𝐻1 : 𝜇𝑓 − 𝜇𝑚 ≠ 0 𝐻1 : 𝜇𝑓 − 𝜇𝑚 > 0 𝐻1 : 𝜇𝑓 − 𝜇𝑚 < 0

## • Then, our sample statistics is 𝑋ത𝑓 − 𝑋ത𝑚

• To find this sample statistics, take each of the subsample, find its
sample mean, then subtract one sample mean from the other
How is 𝑋ത1 − 𝑋ത2 distributed?
• It turns out that 𝑋ത1 − 𝑋ത2 is an unbiased and consistent estimator of
𝜇1 − 𝜇2
• The sampling distribution of 𝑋ത1 − 𝑋ത2 depends on many factors
• In the case of unknown population standard deviation and large
enough subsamples (𝑒𝑎𝑐ℎ 𝑛 > 25 − 30),
𝑠12 𝑠22
𝑥ҧ1 − 𝑥ҧ2 ~𝑁(𝜇1 − 𝜇2 , + )
𝑛1 𝑛2
• The sampling distribution is actually a t-distribution with a degree of freedom
of n-2; however, we can apply the central limit theorem if our sample size is
large
Two-sample inference
• Z-score for a difference in means
(𝑥ҧ1 −𝑥ҧ2 ) − (𝜇1 − 𝜇2 )
𝑧=
𝑠12 𝑠22
+
𝑛1 𝑛2
• Confidence interval
𝑠12 𝑠22
𝐶𝐼 = 𝑥ҧ1 − 𝑥ҧ2 ± 𝜁𝑠𝑥ҧ1 −𝑥ҧ2 ; where 𝑠𝑥ҧ1 −𝑥ҧ2 = +
𝑛1 𝑛2
Example
• Based on the Boston dataset, we calculate the following subsample
descriptive statistics for areas that are bounded by Charles River
and those that are not. Is the average house price the same
whether the area is bounded by the river?
Example
• 95% confidence interval
• 𝐶𝐼 = 𝑥ҧ1 − 𝑥ҧ2 ± 𝜁𝑠𝑥ҧ1 −𝑥ҧ2 ; where
𝑠12 𝑠22
𝑠𝑥ҧ1 −𝑥ҧ2 = +
𝑛1 𝑛2
Example
• Hypothesis testing with 𝛼 = .05

## 𝑥ҧ1 −𝑥ҧ 2 )−(𝜇1 −𝜇2 )

• 𝑧=
𝑠2 2
1 + 𝑠2
𝑛1 𝑛2
Computing
Statistical significance vs meaningful
difference
• From problem 1 in the worksheet, we found that the difference
between male and female scores is statistically significant with a p-
value of .038.
• What about the magnitude of this difference?
• Female students, on average, got 1.1 points higher than male students did on
a scale of 100
• Shall we use this finding and suggest that male students should not become
statisticians?
• A statistically significant relationship does not imply a meaningful
relationship!
Inference for a difference in proportions
• The hypotheses would be
𝐻0 : 𝜋1 − 𝜋2 = 0 𝐻0 : 𝜋1 − 𝜋2 = 0 𝐻0 : 𝜋1 − 𝜋2 = 0
𝐻1 : 𝜋1 − 𝜋2 ≠ 0 𝐻1 : 𝜋1 − 𝜋2 > 0 𝐻1 : 𝜋1 − 𝜋2 < 0

## • The sample statistics, 𝑝1ҧ − 𝑝2ҧ , is also an unbiased and consistent

estimator of 𝜋1 − 𝜋2
• The sampling distribution is
𝜋1 (1−𝜋1 ) 𝜋2 (1−𝜋2 )
𝑝1ҧ − 𝑝2ҧ ~𝑁(𝜋1 − 𝜋2 , + )
𝑛1 𝑛2
Inference for a difference in proportions
• Z-score
𝑝1ҧ − 𝑝2ҧ − (𝜋1 − 𝜋2 )
𝑧=
𝜋1 (1 − 𝜋1 ) 𝜋2 (1 − 𝜋2 )
+
𝑛1 𝑛2
• Confidence interval
𝑝ҧ1 (1−𝑝ҧ1 ) 𝑝ҧ2 (1−𝑝ҧ2 )
𝐶𝐼 = 𝑝1ҧ − 𝑝2ҧ ± 𝜁𝑠𝑝ҧ1 −𝑝ҧ2 ; where 𝑠𝑝ҧ1−𝑝ҧ2 = +
𝑛1 𝑛2
Outline
• Two-sample inference
• When do we need two-sample inference?
• The sampling distribution of the difference in means
• Confidence interval and hypothesis testing for the difference in means
• Chi-square test for the relationship between two random variables
• What is it for?
• The Chi-square distribution
Example
• Is low birth weight statistically related to gender?
Chi-square test for statistical independence
• Recall that we tested for statistical independence using the joint
probability distribution before
• Our approach back then did not account for sampling errors, so we tested for
strict independence
• The Chi-square test allows us to test for statistical independence in
the world with sampling errors
• How far from strict independence does a relationship have to be for us to say
that the two variables or attributes are statistically related?
• In 1900, Karl Pearson developed a new test statistic that
simultaneously compares observed and expected values when all
possible outcomes are divided into mutually exclusive groups
Chi-square test for statistical independence
• The intersections in our joint probability distribution (or contingency
table) are mutually exclusive, so we can use this test statistic
• This test statistic is called the Chi-square statistics (𝜒 2 ), and it follows
the Chi-square distribution (which is also used for hypothesis testing
for population variance)
• The Chi-square test statistic can be calculated as follows:
𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 # − 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 # 2
2
𝜒 =෍
𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 #
Example

2
𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 # − 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 #
𝜒2 = ෍
𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 #
Chi-square test for statistical independence
• The null hypothesis is always “the two variables/outcomes are not
related”
• 𝐻0 : 𝑡ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑖𝑛𝑡𝑒𝑟𝑠𝑒𝑐𝑡𝑖𝑜𝑛 =
𝑡ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑡ℎ𝑎𝑡 𝑤𝑜𝑢𝑙𝑑 𝑏𝑒 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝑖𝑓 𝑡ℎ𝑒 𝑡𝑤𝑜 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑠 𝑎𝑟𝑒 𝑖𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑡
• 𝐻1 : 𝑡ℎ𝑒 𝑡𝑤𝑜 𝑛𝑢𝑚𝑏𝑒𝑟𝑠 𝑎𝑟𝑒 𝑛𝑜𝑡 𝑒𝑞𝑢𝑎𝑙
• The Chi-square distribution is always positive, and its shape depends
on the degrees of freedom. As the degree of freedom increases, the
Chi-square distribution gets closer to the normal distribution
• The degrees of freedom are the number of rows – 1 times the
number of columns – 1 ([rows – 1] X [cols – 1])
Chi-square distribution
Chi-square test for statistical independence
• We then pick our level of significance,
𝛼, and obtain the critical value from
the Chi-square table using the degrees
of freedom and 𝛼
• We reject the null hypothesis if our 𝜒 2
is greater than the critical value
𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 # − 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 # 2
𝜒2 = ෍
𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑 #
Example
• The degrees of freedom =([rows
– 1] X [cols – 1])= (2-1)*(2-1) = 1
• The critical value for 𝛼 = .05 is
3.84
• We cannot reject the null
hypothesis that gender and birth
weight are independent
Using Stata
List of Stata commands
• Tests for two means
• ttest var, by(groupvar)
• Chi-square test for independence
• Tabulate var1 var2, chi
Assignments