Anda di halaman 1dari 58

Statistical Conclusion Validity

IGS

1
Statistical Conclusion Validity

The validity of inferences about the


correlation (covariation) between
treatment X and outcome Y

2
Glossary

Population parameter: A fixed feature of a population (e.g., the


population mean, the population standard deviation, . . .); we
conventionally indicate parameters using greek letters (, , , . . .)
Sample statistic: A feature that varies from one sample to another
(e.g., the sample average, the sample standard deviation, . . .)
Estimator: Any function of sample data used to estimate
parameters
Expectation: The mathematical expectation of a variable
indicated as is the population average of this variable
When a sample statistic has expectation equal to the corresponding
population parameter, its said to be an unbiased estimator of that
parameter

3
Formal Statistical Inference

The process of drawing conclusions about a


population based on sample data
Practical questions
How much uncertainty is associated with sample
data?
Do my results constitute strong evidence or just a
lucky draw/chance finding?

4
Formal Statistical Inference (contd)

If we select at random a sample of units from a population of


units, every possible sample of size has the same chance of selection

!
= possible samples are equally likely
! !

Example: if we select 3 units at random from a population of 8 units,


56 samples are equally likely

! 8! 87654321
= = = 56
! ! 3! 5! (3 2 1)(5 4 3 2 1)

5
The Mean

For a given population


Only one [ ] (parameter)
Many sample averages = that
depend on

What units are drawn

6
Unbiasedness of the Sample Mean

If we were to draw infinitely many random samples,


the average of the resulting sample means would be
the population mean

= [ ]

7
Variability of the Sample Mean
Sampling variance

Population Sample Sample mean



1
=
2
= = 1
2
Variance = 2 ( )2 =
1
( ) =1
=1 2
=


Std. Dev. =

SE summarizes the
variability in an estimate due
to random sampling 8
Estimated standard error

The population standard deviation is usually


unknown and must be estimated by replacing
with

9
T-statistic for the sample mean

Under the working/null hypothesis

A t-statistic for the sample mean is


= =


10
T-statistic for the sample mean (contd)

If the null hypothesis is

= = 0

A t-statistic for the sample mean is

11
Central limit theorem

If = then, as long as the sample is large


enough, has a sampling distribution that is
close a standard normal distribution (mean of 0 and
standard deviation of 1), irrespective of the
population distribution of
In other words, for large samples, the distribution
of a t-statistic is independent of the distribution of
the underlying data

12
Distribution of a t-statistic

13
Hypothesis testing

With standard normal variables, the


frequency of values larger than 2 is about
5%
Any t-statistic larger than 2 in absolute value
is too unlikely to be consistent with the null
hypothesis We reject the null

14
Confidence interval

If we repeatedly drew infinitely many independent


samples from the same population

, + 2
2

would include the population mean about 95% of


the time

15
Confidence Level

Confidence level: the percentage of all possible independent


samples from the same population that can be expected to
include the true population parameter
If we repeatedly drew infinitely many independent samples
from the same population and we calculated a confidence
interval for each sample, then a certain percentage
(confidence level) of the intervals would include the
unknown population parameter
Confidence intervals are usually calculated so that this
percentage is 95%, but we can produce 90%, 99%, 99.9% (or
whatever) confidence intervals for the unknown parameter

16
Comparison of Two Group Averages

1 = [ = 1

0 = [ = 0

0 : 1 0 = = 0

1 0 1 0
= =
1
0
1 1
+
1 2

17
Significance vs. Effect magnitude

A large t-statistic may be due to

A large effect size

Or a small estimated standard error

18
Null Hypothesis Significance Testing
(NHST)
The null hypothesis (0 ) is a claim to be tested,
usually an hypothesis of no difference (e.g., no
difference between test scores in group A and
group B)
The alternative hypothesis (1 ) is the one we
would believe if the null hypothesis is rejected
Rejecting 0 does not prove 0 to be false nor 1 to be
true
The only way 0 can be proven false (or true) is to know
the value of the population parameter(s) specified in the
null hypothesis; sample data do not provide that kind of
information
19
p-value

The probability of getting the observed


or more extreme results if the null
hypothesis were true
Following Fisher (1926), we usually say that
results are statistically significant if p < .05
(arbitrary)

20
More on NHST
DECISION

Do not reject H0 Reject H0

Correct Decision Type I Error


(FALSE POSITIVE)
H0 is true Prob = 1 Prob =
(significance)

TRUTH Type II Error Correct Decision


(FALSE NEGATIVE)
H0 is false Prob = Prob = 1
(power)

21
More on NHST (contd)

= =

= 1 = 1

22
Statistical Conclusion Validity

1. Do X and Y covary?
Type I error (false positive): We may incorrectly
conclude that X and Y covary when they do not
Type II error (false negative): We may incorrectly
conclude that X and Y do not covary when they do
2. How strongly do X and Y covary?
We can over/underestimate
The magnitude of covariation
The degree of confidence that magnitude estimate
warrants

23
Threats to Statistical
Conclusion Validity

24
1. Low Statistical Power

An insufficiently powered experiment may


incorrectly conclude that the relationship between
treatment and outcome is not significant (Shadish et
al. 2002, 55)

= 1 = 1
The ability of a test to detect relationships that exist in
the population
The probability that a statistical test will reject the null
hypothesis when it is false

25
1. Low Statistical Power (contd)

Low power Larger estimated SE Wider


confidence intervals
Common practice to set = .20 Power = .80
Important to increase power when missing a real effect
would have negative consequences, e.g. when testing for
harmful effects of a new drug
Low power is a problem when effect sizes are small
Remedy: Meta-analysis
Comprehensive list of remedies: Table 2.3 SKC

26
1. Low Statistical Power (contd)

Factors affecting power


A. Sample size: The larger the sample
size, the higher the power (see figure
1) Remedy:
Increasing sample size (sometimes
expensive/difficult)

27
Figure 1: The relationship between sample size and power for
H0: = 75, real = 80, one-tailed = 0.05, for 's of 10 and 15.

n
Source: Lane 2015
28
1. Low Statistical Power (contd)

Factors affecting power


B. Standard deviation (SD): The smaller
the SD, the higher the power (see
figure 1) Remedies:
Sampling from a homogeneous
population
Reducing random measurement error

29
Figure 1: The relationship between sample size and power for
H0: = 75, real = 80, one-tailed = 0.05, for 's of 10 and 15.

n
Source: Lane 2015
30
1. Low Statistical Power (contd)

Factors affecting power (contd)


C. Effect size (i.e. difference between
hypothesized and true parameter):
Easier to detect larger effects (see
figure 2)

31
Figure 2. The relationship between and power for
H0: = 75, one-tailed = 0.05, for 's of 10 and 15

Source: Lane 2015


32
1. Low Statistical Power (contd)

Factors affecting power (contd)


D. Significance level (): The lower the
(i.e. the probability of Type I
error/false positive), the lower the
power (see figure 3)

33
Figure 3. The relationship between significance level and
power with one-tailed test: H0: = 75, real = 80, and = 10.

n
Source: Lane 2015
34
2. Violated Assumptions of the Test Statistics

Violations of statistical test assumptions can lead to


either overestimating or underestimating the size and
significance of an effect (Shadish et al. 2002, 55)
Example: If we ignore the hierarchical/multivel
structure of the data (e.g., soccer players nested
within teams, students nested within classes), we
may severely underestimate standard errors and
conclude that effects that might be ascribed to
chance are real (i.e. a higher risk of Type I error)

35
3. Fishing and the Error Rate Problem

Repeated tests for significant relationships, if uncorrected for


the number of tests, can artifactually inflate statistical
significance (Shadish et al. 2002, 55)
If the nominal = .05, the actual = .923 when the test is
repeated fifty times (Maxwell & Delaney 1990)
Examples:
Fishing until we find a significant effect
Multiple researchers analyzing the same data
Remedy:
Bonferroni correction: divides the target by the number of tests and uses
the Bonferroni-corrected in all individual tests
Bonferroni and other corrections may be too conservative in low-powered studies (high
risk of Type II error)

36
4. Unreliability of Measures

Measurement error weakens the relationship between two variables


and strengthens or weakens the relationship between three or more
variables (Shadish et al. 2002, 55)
With three or more variables, unreliability of measures can lead to
either false positives or false negatives what does that mean?
Particularly problematic in longitudinal studies that assess change
over time
Remedies:
Increasing the number of measurements
More items to measure the same concept
Multiple raters
Improving the quality of measures
Using validated scale items
Training for raters
Techniques like latent variable modelling

37
4. Unreliability of Measures (contd)

Structural equation modeling (SEM): family of


statistical modeling techniques (e.g., confirmatory
factor analysis, path analysis) to test theoretical
models
Two main components
Measurement model: uses observed variables (e.g.,
survey items) to define latent constructs (e.g., happiness,
self-efficacy, intelligence)
Structural regression model: system of simultaneous
regression equations to estimate paths linking the latent
constructs

38
5. Restriction of Range

Reduced range on a variable usually weakens the


relationship between it and another variable (Shadish et al.
2002, 55)
Small range Lower power
This problem can affect either the
Independent variable (IV). Example:
Comparing two similar treatments Remedy: using different
treatment doses and even full-dose vs. no treatment
Dependent variable (DV). Examples:
Dummies
Floor effects (respondents cluster near the bottom)
Ceiling effects (respondents cluster near the top)
Remedy: Using models that are appropriate for limited
39
variables (e.g., Tobit, truncated regression, Heckman)
6. Unreliability of Treatment Implementation

If a treatment that is intended to be implemented in a


standardized manner is implemented only partially for some
respondents, effects may be underestimated compared with
full implementation (Shadish et al. 2002, 55)
Common in field experiments
It usually decreases effect size, but it can also increase the
effect size when implemention is tailored to the recipients
Important to measure all components of the treatment
package

40
7. Extraneous Variance in the Experimental
Setting
Some features of an experimental setting may inflate error,
making detection of an effect more difficult (Shadish et al.
2002, 55)
Example: Fire drill or concert downstairs during lab
experiment
Particularly frequent in field experiments
When sources of extraneous variance cannot be controlled,
we should measure them and include them in the statistical
analysis

41
8. Heterogeneity of Units

Increased variability on the outcome variable within conditions


increases error variance, making detection of a relationship more
difficult (Shadish et al. 2002, 55)
Heterogeneity of respondents on an outcome variable increases
standard deviations on that variable and on any other correlated with
it Weaker treatment effect
Remedies
Sample units that are similar on characteristics correlated with outcome
Potential risks:
Lower external validity
Limited range on DV
Measure respondent characteristics that interact with a cause-effect
relationship and use them for blocking or as covariates
Within-participants designs comparing pre- and post-test scores for each
participant

42
9. Inaccurate Effect Size Estimation

Some statistics systematically overestimate or


underestimate the size of an effect (Shadish et al.
2002, 55)
Examples
Outliers (departing from normal distribution) can
dramatically decrease effect sizes
Analyzing binary outcomes with effect size measures
intended for continuous variables (correlation coefficient
or standardized mean difference statistic)
Underestimation of effect size

43
Internal Validity

44
Internal Validity

The validity of inferences about


whether observed covariation
between X (the presumed treatment)
and Y (the presumed outcome) reflects
a causal relationship from X to Y as
those variables were manipulated or
measured
45
Internal Validity (contd)

Local molar causal validity (Campbell 1986)


Local: Causal conclusions are limited to the context
of the particular treatments, outcomes, times,
settings, and persons studied
Molar: Treatments are a complex package
consisting of many components, all of which are
tested as a whole

46
Threats to Internal Validity

47
Threats to Internal Validity

Typically, we infer from an effect to a cause by


eliminating other possible causes (Mackie 1974, p.
67)
Threats to internal validity are those other possible
causes
Different threats are not necessarily independent

48
1. Ambiguous Temporal Precedence

Lack of clarity about which variable occurred first may yield


confusion about which variable is the cause and which is the
effect (Shadish et al. 2002, 55)
Correlational studies are often unable to answer the
question: Which came first, the chicken or the egg?
Not always: e.g., unlikely that an increase in the sales of
airconditioners increases outside temperature
Particularly tricky because some causation is bidirectional
(reciprocal)
High performance Self-efficacy Higher performance

49
2. Selection

Systematic differences over conditions in respondents


characteristics that could also cause the observed effect
(Shadish et al. 2002, 55)
Example
A new drug is given only to patients who volunteer to take the new
treatment
The volunteering patients might differ from nonvolunteers in ways
(e.g., sicker, older, etc.) that might affect the outcome
Random assignment eliminates selection bias because
randomly formed groups differ only by chance

50
3. History

[External] Events occurring concurrently with


treatment could cause the observed effect (Shadish
et al. 2002, 55)
Example: A study of psychotherapy with depressed
patients at the time a new antidepressant went on
the market

51
4. Maturation

Naturally occurring changes over time could be


confused with a treatment effect (Shadish et al.
2002, 55)
While maturation is internal, a natural course of
things having to do with some quality of the
participants in the study, history has to do with an
external event of some kind
Example: We may think that an ineffective medicine
works because patients get better by themselves

52
5. Regression Artifacts

When units are selected for their extreme scores,


they will often have less extreme scores on other
variables, an occurrence that can be confused with a
treatment effect (Shadish et al. 2002, 55)
Test theory Every measure has
A true score component reflecting a true ability
Plus a random error component that is normally and
randomly distributed around the mean of the
measure

53
5. Regression Artifacts (contd)

High scores will tend to have more positive random error pushing
them up, low scores will tend to have more negative random error
pulling them down
On the same measure at a later time, or on other measures at the
same time, the random error is less likely to be so extreme
Examples
A compensatory tutoring program for kids in the lowest 10 percent on a
pretest will seem more effective than it actually is because those kids will
tend to improve anyway in the post-test
People tend to go to psychotherapy after a shock and organizations tend to
hire consultants after a downturn; clients measured progress is partly a
movement back toward their stable mean as the temporary shock grows less
acute

54
6. Attrition/Mortality

Loss of respondents to treatment or to measurement can produce


artifactual effects if that loss is systematically correlated with
conditions (Shadish et al. 2002, 55)
A special subset of selection bias occurring after the treatment is in
place
Unlinke selection bias, attrition is not controlled by random
assignment
Example
If those dropping out of a compensatory tutoring course are the low pretest
test scorers, by the end of the course the participants who remain will be the
ones with higher academic skills
By comparing the average pretest to posttest scores we would overestimate
the effect of the course

55
7. Testing

Exposure to a test can affect scores on subsequent


exposures to that test, an occurrence that can be
confused with a treatment effect (Shadish et al.
2002, 55)
Only in pretest-posted designs
Example: People commonly improve on
standardized tests such as intelligence tests, SATs, or
GREs, due to practice, familiarity or other forms of
reactivity

56
8. Instrumentation

The nature of a measure may change over time or conditions


in a way that could be confused with a treatment effect
(Shadish 2002, 55)
Only in pretest-postest designs
Whereas testing involves a change in the participant,
instrumentation involves a change in the instrument
Examples
The spring on a bar press might become weaker and easier to push
over time
Schools often use two different types of tests before and after a
compensatory tutoring course (to reduce the testing threat); if the
level of difficulty is not the same between the two tests, part or all of
any pre-post difference is due to the change in instrument, not to the
course
57
9. Additive and Interactive Effects of Threats to
Internal Validity

The impact of a threat can be added to


that of another threat or may depend on
the level of another threat (Shadish et al.
2002, 55).

58