6401 Lecture Notes Chang Lei

Dr.
Chang
Lecture Notes for EDM6401, Quantitative Methods in Educational Research

Chang Lei, Ph.D., Professor
1. An Overview of Educational Research
Four Ways of Knowing
Methods of Tenacity
Truth is true because one believes it even in front of contradicting evidence.
Superstition.
Method of Authority
Truth is true because an authority says so. Religion.
Method of Intuition
Truth is true because it is logical. It derives from reasoning but does not bear empirical
support. Philosophy.
Method of Science
Science is a method of seeking truth. This method only accounts for solvable problems
that have empirical solutions based on observable events. Some major components of the
scientific method include:
empirical evidence vs. refusal of contradicting evidence
random sampling vs. isolated or selected events
countering rival explanations
replication and public inquiry
Truth is routinely challenged and tested and retested by the public. There is no truth
but a temporary state where it has not been disproved or rejected as untrue.
The "Low" Status of Social Sciences
Difficult to replicate
The more developed a discipline, the higher the probability of detection and replication.
Value judgement
Science is considered "pure" and free from value judgement.
Too "obvious"
People tend to be uninterested in questioning what is already "obvious" to them. Nor
do people want to be bothered with what they do not know.
Thorndike: "That is the fate of educational research. If it comes out the way people
thought it should, they ask, 'What is the point?' If startling conclusions emerge, people say, 'I
do not believe it'."
Norms of Science
Universal Standards
The quality of research is judged by universal standards regardless of the experience,
race, sex, affiliation, or the characteristics of the researcher. e.g., Blind review process.
Common Ownership of Information
Scientific information is not proprietary but is owned and freely shared by all;
Dr.Chang
publication is not only a right but an obligation of a researcher. e.g., Data are to be shared on
request. "Publish or perish" enforces this norm.
Integrity in Gathering and Interpreting Data
The researcher displays disinterestedness and impersonal tone when gathering data or
presenting a point of view. e.g., Cite the researcher's own name or use "the author" to minimize
personability.
Organized Scepticism
It is the responsibility of the community of scientists to be skeptical of each new
knowledge claim, to test it, to try to think of reasons the claim might be false, to think of
alternative explanations. This challenge to new knowledge is sought in science, e.g.,
conference debate, validation study.
The Roles and Outcomes of Research
Exploratory
Discover new phenomena and relationships among phenomena that are missed by
others. Qualitative research plays important role here.
e.g., A counselling psychologist wants to know what things make an effective
counsellor.
Explanatory
Develop new theories or use existing theories to account for the observations.
e.g., Dollard and Dobb (1939) theorized that frustration leads to aggression from the
observation that a child strikes out when deprived of a toy.
According to social learning theory, a good role model is important for school
achievement.
Validation
Validating and replicating existing research and theory is an important part of science.
Using different samples, populations, research methods.
Three Components of Educational Research Methodology
1. MEASUREMENT (PSYCHOMETRICS)
Instrumentation, Reliability, Validity
2. RESEARCH DESIGN
Sampling, Designs, Internal and External Validity
Experiment, quasi-experiment, non-experiment research.
3. DATA ANALYSIS (STATISTICS)
Descriptive statistics, Hypothesis testing, and Various
analytical techniques
The Process of Scientific Inquiry
1. Identification of a research problem.
(why)
2. Consult the literature for a solution.
(find out why)
Dr.Chang
3. Formulation of testable hypotheses on the basis of existing theory and/or experience.

(a tentative solution)
4. Design a study with efforts to minimize extraneous factors that may contribute to the same
phenomenon or relationship you hypothesized.
(design a study)
5. Data collection. When the behaviour of subjects are measured or observed, the
measurement or observations become empirical data.
(carry out
the study)
6. Data analysis. Data are summarized in such a way that the summary bears on the research
questions and hypotheses. Statistics are used to generalize from sample to population.
(report the findings)
7. Interpretation of data, adding to the existing body of knowledge. (this is why)
Research Report:
* Title Page
* Abstract
* Introduction
Problem
Significance
Justifications
Hypotheses which are integrated in the literature review
* Method
population
sample
procedures
measurement
designs
* Results
Present the results in the order of the hypotheses
* Discussion
* References
* Tables and Figures
The Process of Research:
1. INTRODUCTION SECTION
*****
1.
Identification of a research problem.
(Why)
Objectives and
Significance of study
2.
Consult the literature for a solution.

(Find out why)
Literature review
3.
Formulation of testable hypotheses on

the basis of existing theory and research.
(Here is a solution)
Research questions
and hypotheses
Ind and Dep variables
Dr.Chang
4.Design a study to minimize extranous
factors that affect the same phenomenon
or relationship you hypothesized.
(A plan to test the solution)
5.Data collection. When the behaviours
are experimentally manipulated or
observed, the outcome become data.
(Carry out the plan.)
2. METHOD SECTION *****

Sample, Design (exp
quasi, non), Procedure
Validity threats
Measurements,
Reliability and
Validity
3. RESULTS SECTION *****
ANOVA vs. Regression
framework, Sig test
Confidence interval
6.Data analysis. Data are summarized in

such a way that the summary bears on
the research questions and hypotheses.
(Report it)
7.Interpretation of data, adding to
the existing body of knowledge.
(Why? This is why.)
4. DISCUSSION SECTION *****

Theory & explanation
Limitations and
Future directions
Ways to Locate a Research Problem

1.Identify broad areas that are closely related to your interests and professional goal and write
them down.
2.Then choose among the areas that relate to your future career, an area or a research topic
that is feasible.
3.Collaborate with other people; join on-going projects.
4.Read text books where rather comprehensive topics in a field are summarized, and problems
and future research needs are identified; journal article for the state of art of the field and
authors recommendations; reviews articles for both.
5.Test a theory.
6.Replication. Replicate major milestone study. Replicate studies using different population,
samples, methods.
7.Observations. Observe carefully the existing practices in your area of interest.
8.Develop research ideas from advanced courses you take.
9.Get ideas from newspaper and popular magazines.
Variable and Constant
An attribute or characteristic of a person or object that varies from person to person,
object to object. A constant is an attribute that does not vary from person to person. Student
achievement, motivation, blood pressure, etc. Pi, in relation to a particular population, age is a
constant for the 6th graders, religion for parochial schools, gender for male prisoners, etc.
When you ask a research question, you ask about variables, you want to know the relationship
Dr.Chang
among variables. Why don't students learn, is it because they are in poor health, not motivated,
distracted by family problems, by crime, the teachers are not qualified, etc? You end up with a
question regarding the relationship among variables. Is there a relationship between motivation
and achievement.
Independent variable represents the research interest and is manipulated (experiment)
or measured (non-experiment) to see the results of its change on the dependent variable.
Dependent variable is the observed outcome in response to the independent variable. It
is used to evaluate the independent variable.
Control variable is a variable that is either made into a constant or is included in the
study (even though it is not of interest) to control or neutralize factors extraneous to the
research question.
Operational Definitions
Assign meaning to a construct or a variable by specifying the activities or "operations"
necessary to MEASURE or MANIPULATE it.
Redefine a concept in terms of clearly observable operations that anyone can see and
repeat. These observable and replicable operations can take the form of an experiment or of a
measurement instrument.
As Cronbach and Meehl (1955) point out, it is rare for a construct to receive one
commonly endorsed operational definition. To some researchers, hunger is defined in an
animal experiment as "amount of time since last feeding." It may also be defined by others as
"amount of energy an animal would expend to seek food." Thus, it is important to be
operationally clear about a particular construct, so that other researchers understand, for
example, what the construct "hunger" is intended to mean.
Measured operational definition (more often used):
Intelligence is defined as scores on the Woodcock-Johnson Test of Cognitive Abilities.
Vagueness of lecturing is defined as using the following words: A couple, a few,
sometimes, all of this, something like that, pretty much.
School achievement is defined as one's GPA.
Social Economic Status is defined by the number of years of education and the amount
of salary the head of a family receives.
Popularity is defined operationally by the number of friendship nominations a student
receives from his/her school mates.
Experimental operational definition:
Recall is defined by asking subjects to recite items shown to them from a stimulus list
and assigning a point for each item that matches one on the list.
Recognition is defined by showing subjects items and asking them to decide whether
they were part of the stimulus list.
Aggression is defined as the number of times a child hits a toy doll after watching a
violent TV show.
Distinguishing Between Two General Types of Literature Reviews
There are two general types of literature reviews, each possessing unique as well as
common characteristics. Making the distinction prior to embarking on the review is important
to both your own mental health and the quality of the product. The two types are:
1.A critical review of a literature
2.A review of literature relevant to a research proposal
Dr.Chang
Following are some is and is not for each type:

Critical Review of a Literature
Is a place where you may review the body of literature that bears on a problematic area -- or
even to examine all the research that relates to the specific question raised in a research
proposal.
Is an activity the product of which is devoted to critical retrospective on scholarship -- and
publishable in such journals as the Psychological Bulletin, Psychological Review.
Is not encumbered with supporting the conceptual framework of a proposed study or of
justifying study design and methodology decisions.
Review of Literature Relevant to a Research Proposal
Is an obligation to place the question or hypothesis in the context of previous work in such a
way as to explain and justify the decisions made.
Is a product that reflects a step-by-step explanation of decisions, punctuated by references to
studies that support the conceptual framework and ongoing argument.
Is not a product to educate the reader concerning the state of science in the problem area; nor
is it to display the thoroughness with which the author pursued a comprehensive understanding
of the literature.
How to Write a Lit Review
Break up the review into several topic areas.
Organize all the findings under the various topics into a unified picture of the state of
knowledge in the area reviewed. The process of combining and interpreting the literature is
more difficult than merely reviewing what has been done.
Use two to three studies that are most pertinent and well done as foundations of your
review topics. Use similar studies as support.
Write the review as if you are expressing your own thoughts and developing and
building your own arguments and themes but not as if you are reporting others' work.
Don't do article by article listing of things.
Don't use the same format, e.g., Baker found...
Rather than citing everything in an article in one place, cite an article multiple times to
fit different themes of yours.
Write down your thoughts and paraphrase important points of the articles as you read.
It may not be a good idea to read all the articles and then write.
Look over the articles before copying them. Read several carefully before looking for
more.
Dr.Chang
2: Hypothesis Testing
1.The hypothesis should state an expected relationship between two or more variables.
2.The researcher should have definite reasons based on either theory or evidence for
considering the hypothesis worthy of testing.
3.Hypothesis should be testable. The relationship or difference that is stated in a hypothesis
should be such that measurement of the variables involved can be made and necessary
statistical comparisons carried out in order to determine whether the hypothesis as stated is or
is not supported by the research.
4.The hypothesis should be as brief as possible.
There is a gender difference in the perception of body sensations.
Women and men use physiological cues (internal) and situational factors (external)
differently in defining bodily state.
Women, compared to men, make greater use of external cues in defining their body
sensations.
There is a relationship between information processing techniques and subsequent recall
of information.
Visual imagery has a greater enhancing effect on recall than verbal recitation.
People tend to apply dispositional attribution to account for behaviours of others and
use situational attribution to explain behaviours of themselves.
Induced self-consciousness enhances recall of personal information.
Teachers who use specific feedback during lectures obtain higher pupil achievement
gains than teachers who use general feedback.
High intimacy self-disclosing statements would be more effective in counselling than
low intimacy self-disclosing statements.
Hypothesis Testing
A hypothesis is always about a population.
Testing a hypothesis means drawing inference from a random sample to the population
where the sample is taken.
1. Research hypothesis reflecting your verbal reasoning. The wording often reflects the
research design.
e.g., There is a relationship between motivation to learn and math achievement.
Girls have higher math achievement than boys.
The effect of induced public self-consciousness is stronger adolescents than adults.
2. Statistical hypothesis reflecting the statistics used to summarize your observations.
e.g., > 0: There is a positive correlation between motivation to learn and math
achievement. The statistic of correlation is used to summarize data.
g > b: Mean math achievement of girls is higher than the mean of boys. Mean is used
to summarize data.
3. Null hypothesis representing a way to test the statistical hypothesis.
g = b. The mean math achievement of girls is the same as the mean of boys.
Dr.Chang
= 0. There is no correlation between motivation to learn and math achievement.

4. Statistical tests are conducted with the assumption that the null hypothesis is true.
What is the probability of finding a positive correlation when the truth is there is no
correlation?
What is the probability of finding a difference between the two means when there is no
difference?
Statistical Significance
The probability level at which you will reject the null hypothesis, or, at which you will
allow yourself the risk of wrongly rejecting the null hypothesis.
Type I Error
Significance level is also Type I error rate. It is the probability of rejecting the null
hypothesis when the null hypothesis is true. You make such an error only when the null is
rejected.
Type II Error
It is the probability of not rejecting the null hypothesis when the null hypothesis is false.
You make such an error only when you fail to reject the null hypothesis.
Sampling distribution of means (or any statistic)
Is an imagined or theoretical distribution of an infinite number of means computed from
random samples of the same size. Because of the central limit theorem, this distribution is used
as a probability distribution to determine the probability of obtaining a mean larger than or as
large as (in absolute value) the one computed from your sample.
Central limit theorem
1. If repeated random samples of size n are drawn from a normally distributed
population, the distribution of the sample means is normal.
2. As the sample size increases, disregarding the shape of the population distribution,
the sampling distribution of means approximates normality.
3. The mean of the sampling distribution of means equals the population mean.
4. The standard deviation of the sampling distribution of means equals the population
standard deviation divided by the square root of sample size. This is called standard error of
means.
If population variance is not known, sample variance can be used as an estimate of
population variance in computing the standard error.
Four steps in hypothesis testing:
1: State the null and
alternative hypotheses.
Example
H0: 1-2 = 0
H1: 1-2 > 0
2: Set the level of statistical significance

which is the probability at which you'll
reject the null or at which you'll allow
yourself to make the type I error.
.05
t (.05, 28)=1.7
3: Compute the test statistic which can be

a t-test, z-test, f-test, chi-square, etc.
t (28) = 2.85
Dr.Chang
4: Decision about the null. If you reject the
null, you may make a type I error the
probability of which is set at step 2. If
you do not reject null, you are running
the risk of making a type II error the
probability of which can be calculated if
you know certain parameters.
Reject null
and support
your research
(alternative)
hypothesis.
An Example
Hypothesis:
People high in public self-consciousness are more conforming to perceived social norms
on gender roles (than those low in public self-consciousness).
There is a relationship between public self-consciousness and gender role conformity.
Independent variable is public self-consciousness.
Dependent variable is gender role conformity.
Operational definitions:
Public self-consciousness is measured by the Self-Consciousness Scale (Fenigstein,
Scheier, & Buss, 1975; Scheier & Carver, 1985).
Gender role conformity is defined by the following operations: Ten gender role
attitudes questions were used to first determine participants' own standings on these gender
role questions. The participants were then informed of the mean ratings of their peers on these
gender role questions and were asked to re-assess their attitudes toward these gender roles.
Conformity to social norms on gender roles is measured by the difference score between the
two self-assessment on the ten gender roles questions.
Statistical hypothesis:
high public - 2high private > 0
This implies that the statistic, mean, is used to summarize sampled data that bear on the
hypothesis.
or
> 0, implying that the statistic, correlation, is used to summarize data.
Null hypothesis:
high public - 2high private = 0
or
=0
Significance level: <= .05
Hypothesis testing rationale:
The hypothesis is regarding population.
The null assumes that there is no mean difference between the two populations
(groups).
My hypothesis assumes that there is a mean difference (in the direction hypothesized)
between the two populations.
Dr.Chang 10
The purpose of hypothesis testing is to make the qualitative decision regarding whether
my samples are taken from the populations defined by the null (decision: accept null and your
research hypothesis is not supported) or are taken from the populations defined by the
alternative (research) hypothesis (decision: reject null and your research hypothesis is
supported).
The hypothesis testing starts with the assumption that the null is true. Even though the
null is true, there is a good chance that, due to sampling fluctuation, you will find some small
magnitudes of difference in your samples. The chance for you to find large differences,
however, should be very small. In fact, such chance is so small that you should no longer
attribute the difference to sampling fluctuations but to the possibility that the null is not true.
This chance is the probability associated with your computed sample statistic. As this
probability gets smaller, you grow more doubtful about the truth of the null to the point that
you make the qualitative decision that the null is not true (reject the null). This point is your
significance level and your decision is associated with the possibility of a type I error.
Another example:
Induced state of public self-consciousness increases gender role conformity.
State of public self-consciousness is the independent variable. It is induced by video
taping the participants while they are assessing their gender role attitudes in relation to these
attitudes of their peers. The knowledge that they are on camera induces the state of public selfconsciousness.
3: Measurement and Testing
Reliability
Classical theory also known as true score theory is mostly concerned with test
reliability or the reliability of observed scores of a test in measuring the underlying true abilities
or true scores. Reliability can be defined as the strength of the relationship between observed
scores and true scores.
If we were to administer a test to the same person under all different conditions at
different times using different items, there would be different observed scores. The mean of all
these observed scores is the person's true score, or true ability or personality.
In reality, we only give the person one test and there is only one observed score. This
score can be seen as a random variable or as randomly sampled observation from a distribution
of all possible observed scores. The observed score can be seen as consisting of the mean of
the distribution (or the true score) and a deviation from the mean which is called error or error
score. Thus, x = t + e.
The extent to which an observed score represent the true score is reliability. We can
use Pearson product moment correlation, , to describe the strength of the relationship
between observed scores and true scores, i.e., reliability. Thus, xt is called a reliability index.
(Note xt is not reliability coefficient.) Of course, we don't know the true score and thus, can
not solve for xt. But assumptions can be made that enable the solving of xt. These
assumptions, which are not discussed here, make up the classical test theory.
With the assumptions, we can numerically estimate the reliability of a test without
knowing its true scores. First, we define the square of reliability, numerically, as the ratio
between true score variance and observed score variance. This is called the reliability
coefficient. (Note this is xx or xx'.) Second, we can estimate the reliability coefficient by
simply correlating two parallel tests (or two forms of a test, two halves of a test, or two
administrations of a test). The result is a Pearson correlation coefficient, r, which is an estimate
of the reliability coefficient. Keep in mind, the meaning of a Pearson r when used as an
estimate of reliability coefficient, is really r2 representing the proportion of the observed score
Dr.Chang 11
variance that is true score variance. r ranges from -1 to 1 whereas r 2 ranges from 0 to 1. That
is why, r as a reliability estimate, ranges from 0 to 1 but not -1 to 1.
Depending on which kinds of two tests are being correlated to arrive at the reliability
estimates, these estimates are given different names as shown below:
Test-retest reliability (coefficient of stability)
Correlate two administrations of the same test.
Parallel form reliability (coefficient of equivalence)
Correlate two forms of the same test
Split half reliability (Spearman-Brown prophecy formula)
Correlate two halves of the test
Internal consistency reliability (Cronbach )
Correlate every item with every other item.
When xx' = 1,
1. the measurement has been made without error (e=0 for all examinees).
2. X = T for all examinees.
3. all observed score variance reflects true-score variance.
4. all difference between observed scores are true score differences.
5. the correlation between observed scores and true scores is 1.
6. the correlation between observed scores and errors is zero.
1.
2.
3.
4.
5.
6.
When xx = 0,
only random error is included in the measurement.
X = E for all examinees.
all observed score variance reflects error variance.
all difference between observed scores are errors of measurement.
the correlation between observed scores and true scores is 0.
the correlation between the observed scores and errors is 1.
1.
2.
3.
4.
5.
6.
When xx is between zero and 1,

the measurement include some error and some truth.
X = T + E.
observed score variance include true-score and error variance.
difference between scores reflect true-score differences and error.
the correlation between observed scores and true scores is reliability.
the correlation between observed scores and error is the square root of 1 reliability.
Validity
The validity of the use of a test refers to the extent to which the test truly measures
what it is expected to measure. For example, the use of a bathroom scale to measure weight is
valid whereas the use of a bathroom scale to measure height is invalid. Commonly discussed
validities include content, construct, and predictive validity.
Kinds of Validity Evidence
Content validity refers to the extent to which the items on a test are representative of a
specified domain content. For example, a test that is intended to measure the content of this
course should contain items about reliability, validity, intelligence and personality tests. If this
Dr.Chang 12
test is made up of items on calculus or matrix algebra, the test will have no content validity.
Achievement and aptitude (but not personality and attitude) tests are concerned with content
validity.
Construct validity refers to the extent to which items on a test are representative of the
underlying construct, e.g., personality or attribute. Personality and attitude tests are concerned
with construct validity. The process to establish construct validity is referred to as construct
validation. Construct validation is complicated, involving testing hypotheses concerning the
theories from which the test is derived. A common practice is to compare high scorers with low
scorers on the test with respect to some external behavior which is hypothesized to correlate
with the test.
Construct validity has often been narrowly interpreted as providing evidence for the
internal structure of a test. Another term for this narrow definition of construct validity is
factorial validity because such validity evidence is gathered through factor analysis. The correct
definition of construct validity refers to gathering evidence for a broad normological network
of relations.
In 1959, Campbell and Fiske published an important paper, which has also become the
most cited paper in psychology. In this paper, they conceptualize validity issues in a MultitraitMultimethod (MTMM) correlation matrix. In this MTMM matrix, convergent validity (CV) is
the correlation between the different methods of the same trait. This correlation coefficient
should be high. Discriminant validity (DV) is the correlation between different traits obtained
by the same method. A low value indicates evidence of validity. The correlation between
different traits obtained by different methods (HH, heterotrait-heteromethod) should be the
lowest in value. This MTMM concept has since become an important part of construct validity,
both in terms of its narrow definition and the broad definition.
Criterion related validity, including predictive validity and concurrent validity, refers to
the extent to which a test correlates with future behaviors which the test is intended to predict.
For example, the HKALE is intended to select students who are capable of university studies.
For the HKALE to have predictive validity, its test scores should correlate with undergraduate
performance, such as the GPA. Predictive validity is simply the correlation between the test and
a criterion measure the test is intended to predict. The correlation coefficient is also called
validity coefficient. Sometimes, because of lack of a criterion, one correlates the test with
another test that perports to measure the same thing. This is called concurrent validity.
Making Sense of Validity
Validation is an ongoing process where one keeps accumulating validity evidence. But
one is not capable of obtaining all the evidence at once. Certain evidence is never attempted.
For different uses of a test, the gathering of some evidence becomes more important than
others, giving rise to different validity concepts and procedures. Over time, this discriminating
use of evidence in relation to different types of tests becomes a tradition so that certain test and
its use is routinely associated with one kind of validity evidence but not with other kinds. For
example, most education achievement tests are only concerned with content validity; i.e.,
whether the test items are representative of what have been taught within a specified domain
content. One purpose of education is to create a competent work force and, thus, a valid test
of such education achievement should be correlated with future job performance. However,
such criterion-related validity evidence is seldom gathered for an education achievement test in
part because there is also a strong public mentality that students must be assessed for what they
have learned. As an another example, the content validity of a personality test is never
questioned. The validity concern with a personality test lies in establishing the linkage between
the test items and the underlying trait structure as defined by a theory within which the test is
conceptualized. Factor analysis is often used to see if the items form clusters according to the
Dr.Chang 13
theory and to see if the items correlate with measures of other constructs according to
specified patterns of relationships defined by the theory. These efforts are referred to as
construct validation.
To be simplistic, content validity is associated with an achievement test as construct
validity is with a personality and attitude test. The former is concerned with the
representativeness of the items with respect to a specified domain content. The content
validation procedures are qualitative or judgmental. The latter is concerned with the
representativeness of the items with respect to a defined theoretical construct. The procedures
are referred to as construct validation which may involve different data collection techniques
and strategies, some of which could be thought of as a different validity procedure, e.g.,
criterion related validity. Finally, some tests do not need validity evidence. For example, a test
of typing which is administered to measure the status of an individual's typing skills is its own
criterion. Such "obviously valid" tests do not need additional validity evidence.
An important issue associated with criterion validity is what is referred to as the
restriction of range effect. Validity coefficient is a correlation coefficient the magnitude of
which depends on the ranges of scores for the predictor variable (test) and the criterion (a
future behavior to be predicted by the test). A correlation based on the full range of scores will
always be higher than that based on a restricted range of scores, independent from the true
predictability of the test. In reality, most of such tests are used to make selections (of students
or employees for example). The validity study is always conducted on those who passed the
test and were, thus, given the opportunity to demonstrate the future performance. Thus, the
validity is based on a restricted range of scores rather than the full range. The real predictive
validity of a test should always be higher than what is obtained from a validity study. There
are, however, statistical procedures to adjust the validity coefficient.
Another important issue is the relationship between validity and reliability. Remember
that xx' = 2xt, the reliability coefficient is the squared correlation between the observed scores
and the true scores. If xx' = .81, xt = .90, (the correlation between the observed scores and
true scores is .90). In general, xt > xx'. That is an observed score will correlate higher with its
own true score than with an observed score on a parallel test. Because a test can not correlate
more highly with any other test or variable than with its own true score, the maximum
correlation between an observed score and another variable is the squared coefficient itself.
Sometimes, the distinction between reliability and validity is blurred. For instance,
Lindquist (1942) defined validity as the correlation between the fallible (the test to be
validated) and infallible measure of a trait. Assuming the infallible criterion to be perfectly
reliable and the (fallible) test to be perfectly representative of the criterion, the maximum
validity of the (fallible) test would have to be the correlation of the observed test scores with
true scores on the test itself which is the square root of the reliability of the fallible test. This is
basically the criterion related validity -- correlation between the observed scores of the test and
observed scores of the criterion.
4: Research Validity and Sampling Techniques
Research Validity
Research validity can be simply understood as the quality of a research study. There are
two kinds of quality issues that are referred to as internal validity and external validity. When we
try to pinpoint the cause of a phenomenon or behavior either by experimentally manipulating the
independent variable or by measuring or observing the independent variable, we are trying to
demonstrate that it is the independent variable that causes changes in the outcome variable or
in the dependent variable. Internal validity is about the extent to which we can make this causal
inference. Internal validity is the extent to which the outcomes of a study result from the variabl
es which were manipulated, measured, or selected in the study rather than from other variables n
Dr.Chang 14
ot systematically treated. e.g., there might not be a relationship between televised violence and a
ggressive behavior, rather, children who watch violence programs are more aggressive in the firs
t place. Different brands of deodorants were tested under left and right arm which provide uneq
ual testing conditions. An expectancy effect will also make people feel the advocated brand lasts
longer. In other words, internal validity is about how confident we are about the stated causal
relationship between the independent variables and the dependent variable. That is, the
dependent variable is due to independent variable but not due to something else. To improve
internal validity and thus the quality of a research study, we need to be able to rule out
alternative causes that may have done the same thing to the dependent variable as does our
independent variable. Some of the commonly encountered alternative causes include:
History: Events take place during the study that might affect its outcome in the same
way that the independent variable is hypothesized to affect the outcome. e.g., a study
examined whether a certain leadership training program was effective in enhancing students
sense of competitiveness; in the three months of the experimental study, the TV show,
Survivor, was on which could have made the students more aware of competition and become
more experienced with some competition strategies.
Maturation: Especially for developmental studies where children grow with the
passage of time to become more mature in certain developmentally related abilities. e.g., a
study showing students vocabulary increase from a dialectical reading program may not have
internal validity due to maturation because childrens vocabulary increases with time
independent of the training program.
Testing: When people are measured repeatedly, e.g., pretest-posttest, they become
better not because of the independent variable but because they become test smarter. e.g., a
study showing math scores improved over the pretest due to a new teaching method might not
have internal validity because students work much faster or better because of their experience
with the pretest.
Instrumentation: The effect on the dependent variable is not due to the independent
variable but due to aspects of the instrument used in the study. Like the testing threat, this one
only operates in the pretest-posttest situation. e.g., observed change from pretest to posttest is
due not to a math program that is being experimented but rather to a change in the test that
was used. The posttest could simply be easier. Instrumentation threats are especially likely
when the "instrument" is a human observer. The observers may get tired over time or bored
with the observations and thus become more lenient or more stringent or more conservative or
liberal in their ratings. Conversely, they might get better at making the observations as they
practice more and become more accurate than their ratings at the beginning. In either event, it
is the change in instrumentation, not the independent variable, that leads to the observation in
the dependent variable.
Regression towards the mean: Particularly problematic when subjects are chosen
because of extreme scores where high scoring individuals are more likely to score lower and
low scoring individuals are likely to score higher the next time they are tested merely due to
random measurement error. e.g., a study showing a teacher is effective in improving the scores
of students who are at the bottom in the first term may not have internal validity due to the
regression towards the mean artifact because by chance alone these students at the bottom will
tend to increase (there is a greater chance for top students to fall towards the mean; while
bottom students have a greater chance to rise towards the mean).
Dr.Chang 15
Selection: Results due to assignment to different treatment or control groups but not
due to the independent variable that makes the two groups. e.g., you want to compare a new
coaching method against the existing method used in PE lessons and recruit volunteers to
participate in three days of training of the sport. You then compare the results with those of
some of the existing PE lessons. Volunteers are simply more motivated than those attending
regular PE class and thus may produce better training outcome (dependent variable) which has
little to do with training method (independent variable).
Mortality: Especially for longitudinal studies that last for an extended period of time,
attrition or dropping off from the study in a non-random manner may affect the outcome of the
study. That is, some participants no longer want to continue with the study and the results are
based on those who stayed in the study who have are different from those who dropped in
some fundamental ways. e.g., a special intervention may be tried out in a school but the
results showed that their students average HKCE results to be much worse than those from
regular schools; this study may not have internal validity due to mortality because the weakest
students may have dropped out from the regular schools.
Diffusion or imitation of treatment: The control group or one of the treatment groups
somehow end up receiving some of the same treatment as the other groups resulting in few
differences among the treatments. e.g., a study shows that a new teaching method (tried out in
one class) does not lead to better achievement than the traditional method (in another class) in
part because students from the two classes constantly compare notes and exchange
information.
Hawthorne effect refers to the fact that when participants received unusal treatment in
a field experiment, they may temporarily change their behavior or performance not because of
the manipulation of the independent variable but because of the special attention they received
during the experimentation. The term gets its name from a factory called the Hawthorne
Works, where a series of experiments on factory workers were carried out between 1924 and
1932. Among many types of experiments, one stands out and is often talked about is the
illumination experiment where researchers came to the factory to change lights all the time
(and probably made casual conversations with the workers). In comparison to the control
group which did not experience lighting changes, productivity of the experiemental group
seemed to increase independent of how lighting was adjusted.
John Henry effect: Whereas the Hawthorne effect is due to some unusal performance
of the experiemental group, sometimes the control group may also put up some extraordinary
performance to outperform the experimental group due to a sense of demoralization for not
being included in the special treatment experimental group. The term comes from a true
story of a railroad worker, John Henry, who probably felt threatened when a spiking machine
(to put spikes to stablize the rails) was introduced to replace the manual work and
outperformed the machine but later died of heart attack.
External Validity
External validity is the extent to which the findings of a particular study can be
generalized to people or situations other than those observed in the study. Can findings from
laboratory be applied to real world where there are many other factors influencing behavior
that have been controlled in the lab. All the factors threatening internal validity can be
controlled in a lab but the results may become less generalizable outside the lab. e.g., a study
on a new teaching method with a certain group of S.4 students does not have external validity
if the results cannot be generalized to other topics taught by the same new method, or to other
Dr.Chang 16
S.4 students. Many threats to external validity can be understood in the form of an interaction:
Treatment-attribute interaction: Certain personality and other characteristics may
interact with the independent variable so that the effect of the independent variable may be
different on people having different personality characteristics. e.g., A study showing
democratic parenting style improves academic achievement may lack external validity if we
can argue and show that less parental demand and supervision only work with highly
motivated students but not other students.
Treatment-setting interaction: The independent variable may interact with other
external factors or contexts to result in different effects for different settings so that the effect
cannot be generalized to all settings. e.g., The positive effect of democratic parenting on
school achievement may lack external validity if democratic parenting (independent variable)
works only when schools provide clear structure and strict supervision regarding student
learning.
Pretest sensitization: The effect of the independent variable may be due to the pretest
which serves to sensitize the subjects whereas in the population (the real world to which the
findings are to be generalized), there is no pretest and thus the treatment (independent
variable) may not work as it does in the study. A study showing that a math teaching method
worked because the post test improved over the pretest may lack external validity if the
pretest helped the teachers and students to identify learning difficulties or made them more
aware of certain weaknesses and the teaching method focused on those weaknesses.
Posttest sensitization: The effect of the independent variable is part due to sensitization
or exercising effect of the posttest which is not available in the population to which the results
are to be generalized. e.g., in the above study on a new math teaching method, the posttest
served to reinforced the effect of the teaching method and thus the students in the study
improved their math performance because they had both the new teaching method and the
posttest.
Sampling Techniques
An element is an object on which a measurement is taken. It is not the person or thing
but a particular measurement of the person or thing that is of interest. e.g., persons' height,
class size.
A population is all the elements in a defined set about which we wish to make an
inference. Examples of target population vs. experimentally accessible population are heights
of Chinese vs. heights of Shanghai residents.
Sampling units are non-overlapping collections of elements from the population. The
sampling unit can be the element.
Sampling frame is a list of sampling units.
A sample is a collection of sampling units drawn from a frame.
Simple random sample
If a sample of size n is drawn from a population of size N in such a way that every
possible sample of size n has the same chance of being selected.
M = xi / n is the sample estimate of population mean, .
Stratified random sampling
The population of N units is divided into subpopulations of N1, N2... Nh units which are
non-overlapping so that N1 + N2 +..+ Nh = N. The subpopulations are called strata. A sample is
drawn from each stratum. The sample is denoted as n1, n2, nh. If a simple random sample is
taken from each stratum, the whole procedure is called stratified random sampling.
When the population is heterogeneous, stratified sampling increases the precision in
Dr.Chang 17
estimating population parameters. Breaking up the population makes each stratum
homogeneous (measurement varies little among units) so that a small sample is needed to
estimate the population characteristics of the stratum. These strata estimates can be combined
into a precise estimate of the whole population.
Wh = Nh / N is stratum weight.
fh = nh / Nh is sampling fraction in the stratum.
Mst
= NhXh / N
= WhXh is the stratified sample estimate of population mean, .
If in every stratum, nh / n = Nh / N, this means sampling fraction is the same in all
strata. Such stratification is called stratification with proportional allocation of the nh. Using
proportional allocation, Mst = nhXh / n
NXst = N1X1 + N2X2 + N3X3 .. + NhXh is stratified sample estimate of population total, .
Cluster sampling
A cluster sample is a simple random sample in which each sampling unit is a collection,
or cluster, of elements. It is used when 1) a good frame listing population elements either is
unavailable, unreliable, or costly; 2) the cost of obtaining observations increases as the distance
separating the elements increases. For example, when sampling in the field in agricultural
research, it is hard to do random sampling by running around. In quality inspection of light
bulbs contained in boxes, light bulbs are the elements and the boxes can be the sampling units.
Travelling within a city (to obtain a simple random sample of city residents) is more expensive
than travelling in a city block (to get a cluster sample of city blocks as sampling units). The
rational for choosing the unit size for cluster sampling is a decision on the unit that gives
smaller sampling variance for a given cost or on the cost for a prescribed variance. Used as a
general rule, the number of elements within a cluster should be small relative to the population
size, and the number of clusters in the sample should be reasonably large.
Mcl = xi / mi is the cluster sample estimate of population mean, , where xi is the
total of observations (sum of elements) in the ith cluster, and xi indicates summing over i=1 to
n sampled cluster totals; mi is the size of ith cluster, and mi indicates summing over i=1 to n
sampled cluster sizes,.
Systematic sampling
Randomly selecting one element from the first k elements in the frame and every kth
element thereafter is called a one-in-k systematic sample. It is needed in field study, e.g., select
every 10th tree or every 20th file, every 15th shopper who passes by an aisle in a supermarket,
until a predetermined n is achieved. If different persons originally handled different sections of
the files, or each clerk deals with a surname, then systematic sampling provides more accurate
information than simple random sampling which could, by random chance, select n files filed by
one clerk. Parameter estimations are the same as those of the simple random sampling.
Sample size
The method used to select the sample is of utmost importance in judging the validity of
the inference made from the sample to the population. The representativeness of the sample is
more important than the size of the sample. A representative sample of 100 may be preferable
to an unrepresentative sample of 100,000. The size of a sample can never compensate for a
lack of representativeness (bias).
Having established sample representativeness, using the right sample size becomes an
important economical decision. Each observation taken from the population contains a certain
Dr.Chang 18
amount of information about the population parameter or parameters of interest. Since
obtaining the information costs financially, one decides how much information to be sampled.
Too little prevents good estimates. Too much may be a waste given a limited economic
resource. The quality of information obtained in a sample depends upon the number of
elements sampled (sample size) and the amount of variation in the information (population
variance). Let's look at an example.
What proportion of people are left-handed? How big a random sample is needed to
answer the question? Or, in other words, how big an error, or margin of error, is to be
tolerated? Let's set the margin of error at no bigger than 10%. If the sample estimate is 20%,
you will at least be confident that the population proportion is between 10% and 30%. But
you cannot guarantee that every sample, including the one you draw, will have this margin of
error unless the whole population is sampled. Sometimes, a sample may have sampling error
higher than 10%. Sometimes, a sample may have sampling error lower than 10%. Of course,
you would be concerned only with having an error higher than 10%. Then the question
becomes how unlikely do you want this unlucky sample having a higher than 10% error to be?
Assume you want the unlikelihood to be 5 out 100 samples so that you can be confident, 95%
of the times, that a sample does not exceed the specified sampling error of 10%. That is Pr (p P > 10%) = 5% or Pr (p - P 5) = 95%. You can then use some of the basic statistics to
estimate the margin of error by first estimating the population variance (of proportion of lefthanders) and standard deviation which is a function of sample size so that you can estimate
how big a sample you need so that the chance of making an estimation error of bigger than
10% does not exceed 5%.
5: Experimental, Quasi-Experimental and Non-Experimental Designs
To ensure research validity or internal validity, researchers develop various ways to
identify, isolate, or nullify variability among subjects in a dependent variable that is presumably
"caused" by one or more independent variables that are extraneous to the particular relation or
relations under study. Such a research effort is called control, control of variance, or control of
extraneous variables. The most powerful way of controlling extraneous variables is
experimentation where subjects are randomly assigned to experimental versus control groups.
Other things being equal, if random assignment has been used, the groups can be assumed to be
equal in all possible characteristics except that due to the manipulation of the independent
variable. In other words, variations among subjects due to anything other than the independent
variable are scattered or spread out evenly across the randomly assigned groups. The variability
due to the manipulation of the independent variable is called systematic variance. The purpose
of experimental research is to maximize this source of variance, minimize error variance, and
control extraneous variance. Other means of controlling extraneous variance include matching,
including the extraneous variable into the study, making the extraneous variable a constant, and
using statistical methods to decompose different sources of variance. It is the extent to which
extraneous variables are controlled that distinguishes research designs into experimental, quasiexperimental, and non-experimental designs.
Experimental design
The single most important feature of experimental research is the manipulation of the
independent variable. Researchers create changes, called treatment conditions, in variables
being researched, called independent variables, to examine the impact of these manipulations
on some outcome behaviour or phenomena called the dependent variable. Another important
feature of experimental research is the ability to control extraneous variables so that subjects
receiving different manipulations of the independent variable are equal except for the
manipulation.
Dr.Chang 19
Experimental research in social science is developed from both physics research and
biological research models. In physics research the primary means of controlling extraneous
factors is through artificial controls, such as isolation, insulation, sterilization, strong steel
chamber walls, soundproofing, lead shielding, etc. These methods ensure the reproduction of
similar conditions and the consequent production of certain effects. Biological research moved
from the laboratory to the open field, the modern theory of experimental control through
randomized assignment to treatment emerges. Agricultural research compares yield per acre
for different crops or fertilizers, raking, or plowing method. One of the greatest breakthroughs
in experimental design was the realization that random assignment provided a means of
comparing the outcomes of different treatments in a manner that ruled out most alternative
interpretations. Random assignment requires experimental units, which can be plots of land in
agriculture, individual persons in social psychology experiments, intact classrooms in education
studies, and neighborhoods in some criminal justice research. Treatments are then assigned to
these units by some equivalent of a coin toss.
Quasi-Experimental Design
Quasi-experiments have treatments, outcome measures, and experimental units, but do
not use random assignment to create the comparisons from which treatment-caused change is
inferred. Instead, the comparisons depend on nonequivalent groups that differ from each other
in many ways other than the presence of a treatment whose effects are being tested. The task is
one of separating the effects of a treatment from those due to the initial noncomparability
between the average units in each treatment group. In a sense, quasi-experiments require
making explicit the irrelevant causal forces hidden within the intact groups. The advantages of
experimental control for inferring causation have to be weighed against the disadvantages that
arise because we do not always want to learn about causation in controlled settings. Instead,
we would like to be able to generalize to causal relationships in complex field settings, and we
cannot easily assume that findings from the laboratory will hold in the field.
For lack of randomization, pretest is an integral part of quasi-experiment that enables
comparisons among the nonequivalent groups, whereas, in most experiments, a pretest is often
unnecessary or undesirable. When a pre-test is not available, researchers may look for some
proxy variables to use in an attempt to equate the treatment and control groups. For instance
previous academic records can be used as proxies to equate the intact classes where different
instructional methods are implemented and post-tests are used to evaluate or compare the
different teaching methods.
Non-Experimental or Ex Post Facto Research
There is a temporal sequence in the order of the independent variable (which occurs
first) and dependent variable (which is observed subsequently) that allows a causal inference.
Specifically, the researcher manipulates the independent variable to "create" changes in the
dependent variable. This expected change (in the dependent variable as a consequence of the
manipulation of the independent variable) represents a causal relationship between the two
variables. This causal relationship is stated in the hypothesis. Thus, an experimental research is
guided by a hypothesis stated a priori.
In experimental and quasi-experimental research, inferences are made from the
independent variables (the causes) to the dependent variable (the effect). In non-experimental
research, also called "ex post facto research", inferences are generally made in the opposite
direction. That is beginning with the observation of the dependent variable, attempts are made
to uncover, detect, or find the reasons (independent variable) for the existing variations. The
point is the variations are not the result of the manipulation of the independent variables but are
Dr.Chang 20
pre-existing. When an experimental researcher manipulates a variable (e.g., administers
different treatments), the researcher has some expectations regarding the effect of the
manipulation on the dependent variable. These expectations are expressed in the form of
hypotheses to be tested. In non-experimental research, a researcher would not have an
expectation of the independent variables or sometimes would not even know prior to data
collection what "independent" variables are tenable to explain the pre-existing variations on the
"dependent" variable. Often, there are no hypotheses associated with a non-experimental study
and researchers adopt the position of "letting the data speak for themselves."
This design is most vulnerable to internal validity threats. Two general strategies to
protect internal validity are using (1) large samples to compensate for the lack of random
assignment and (2) large numbers of "independent" variables to eliminate rival explanations.
The latter strategy is intended to overcome the weakness of having no experimental
manipulation of the independent variables.
Summary of Research Designs
ManipulationofRandomSampleVariable
DesignIndependentV.AssignmentSizeNumber
ExperimentalYesYesSmallSmall
QuasiExpYesNo
NonExpNoNoLargeLarge
Non-Experimental Design: Causal Comparative Study

Non-experimental designs refer to efforts at causal inference based on measures taken
all at one time, with differential levels of both effects and exposures to presumed causes being
measured as they occur naturally, without any experimental intervention.
According to some authors, there are two kinds of non-experimental designs, causal
comparative and correlational studies. (Others including me do not think such a distinction is
necessary.) The difference lies in the measurement of the "independent" variable which can be
either categorical or continuous. The quotation mark indicates that an "independent" variable
in non-experimental research is not truly what the term stands for since it is not manipulated.
For this reason, a categorical or continuous "independent" variable is also called a grouping
variable in causal comparative studies and an exogenous variable in correlational studies.
Causal comparative design is different from experimental and quasi-experimental
designs in that there is no manipulation of the grouping variable and there is no random
assignment of subjects into different groups. The question inspiring gender studies is whether
males and females are different with respect to a particular behaviour or trait. In this example,
the independent or grouping variable is gender which is not and can not be manipulated and
there is no way to randomly assign some subjects into either of the two groups. Another
difference is that, as stated earlier, groups in causal comparative research are often formed on
the basis of the dependent variable. Researcher are often interested in finding out why, for
example, some children are less motivated to learn than others, do not achieve as well as
others, have more behavioral problems than others, are more aggressive than others, and turn
out to be criminals. The behaviours on which questions are raised are the results or outcomes
or dependent variables. Researchers try to find answers to the individual differences on these
variables by grouping subjects on these variables into, say, high versus low achievers, students
with and without behavioral problems or with and without criminal records, and then compare
these two groups of subjects on some suspected causes, e.g., parental supervision, peer
influence, TV viewing, etc. Such logical thinking and research processes are almost the
Dr.Chang 21
opposite of the experimental research. For example, in the above example, TV viewing as an
independent variable will be manipulated in experimental research and subjects will be
randomly assigned to groups with different amounts or different kinds of TV exposure and their
subsequent aggressive, antisocial, or criminal behaviours are observed. In this experimental
example, because of randomization, different experimental groups can be considered equal
except for the manipulation of the TV viewing. Thus different validity threats can be
reasonably ruled out and the observed difference among the groups on the dependent variable,
aggressive behaviour, can be attributed to the independent variable, TV viewing. In the nonexperimental example, however, one can not say that TV viewing causes aggressive behaviour
even though the two groups of aggressive versus non-aggressive children were found to differ
on TV viewing. The reason is that the two groups can not be assumed to be equal: They may
differ in a lot of things in addition to aggressiveness, e.g., different family background,
hormonal levels, personalities, etc. The same statistical analyses are used in causal comparative
studies, such as ANOVA. The interpretations of the results should be far more cautious in
causal comparative research.
Non-Experimental Design: Correlational Study
Partial correlation. The calculation is complicated but the idea of partial correlation is
simple. It is an estimate of the correlation between two variables in a population that is
homogeneous on the variable (or variables) that is being controlled, whose effects are being
removed, or variability on this variable is made into a constant. For example, a correlation
between height and intelligence computed out of a sample that is heterogenous on age, say,
ranging from 4 to 15, is a simple correlation which is high and positive. A partial correlation
would be an average correlation between height and intelligence within each age group where
age is a constant. This partial correlation which is likely zero more truly depicts the
relationship between height and intelligence.
Spurious effect. When two variables are correlated solely because they are both
affected by the same cause, the correlation between these two variables is spurious. For
example, the tobacco industry argues that the correlation between cigarette smoking and lung
disease is not causal but spurious in that both these variables may be caused by a common third
factor such as stress or an unhappy mental state. Another example will be the positive
correlation between height and intelligence often observed in children. Here, the correlation is
again spurious because both variables have the common cause of chronicle age.
Mediating variable. The correlation between two variables can be the result of a
mediating variable. For example, a strong correlation between SES and academic achievement
is often observed and makes some people believe that there is a causal relationship between
how rich the parents are and how well the kids do in school. However, such a relationship is
now found to be mediated by a third variable, Achievement motivation. That is rich people's
children are more motivated to study (by their parents' success) and this motivation leads to
good academic performance. This later finding is achieved by correlating SES and
Achievement while statistically partialling out Motivation. The correlation is almost zero. The
important implication of this statistical insight is that the key lies in motivating the poor kids
(providing them with role models) whereas giving them material incentives may not make them
study.
Suppressor variable. A special case when a partial correlation is larger than its zeroorder correlation is called a suppressor variable effect. A suppressor variable is a variable that
has a zero, or close to zero, correlation with the criterion or dependent variable but is
Dr.Chang 22
correlated with the predictor or independent variable. When such suppressor variable is not
taken into consideration, the correlation between the independent and dependent variable may
be "suppressed" or reduced by this uncontrolled suppressor. For example, a paper-and-pencil
pilot test as a predictor was found to predict little of the criterion, flying. The correlation was
suppressed by a third variable, verbal ability, which has little to do with flying but a lot to do
with test taking. When this suppressor variable was partialled out, the correlation between the
pilot test and piloting increased significantly. This is a real example from pilot training during
World War II.
Validity Issues
The only way to enhance research validity of correlational studies which are post facto
or after fact is through careful logical deduction and well-thought-out statistical analyses. The
first part requires a strong theory to map out relations among variables and careful thinking to
include all the possible extraneous variables that may contribute to the post facto observed
variations. The second part involves the use of various rather complicated analytic techniques,
such as multiple regression, path analysis, and structural equation modeling. Partial correlation
is the basic idea behind these analyses.
Thus, correlational studies often involve more variables than experimental research.
The variables in a correlational study are not distinguished between independent and dependent
variables. First, since there is no experimental manipulation of the variable of research interest,
there is no independent variable. Second, because the inference is often not made from the
manipulation of the independent to the outcome of the dependent variable. On the contrary,
the post facto outcome is observed first and the research purpose is to account for the
observations, the order of what is independent and dependent seems opposite to what is used
in experiments. Third, unlike experimental studies where there is usually one dependent
variable, there can be and usually are more than one outcome variable in a complex patten of
associations. In correlational studies, variables are distinguished between what are referred to
as exogenous and endogenous variables. An exogenous variable is one whose variability is
assumed to be determined by causes outside the model or study under consideration. It is not
the interest of the study to explain the variability of an exogenous variable or its causal
relations with other exogenous variables. An endogenous variable is one whose variation is to
be explained by the exogenous and/or other endogenous variables in the model. Exogenous
and endogenous variables are like independent and dependent variables in experimental studies
except that there is no manipulation of the exogenous variables and there are usually more than
one endogenous variable in a correlational study.
In correlational studies one needs to think hard to include as many relevant variables as
possible. Omission of relevant variables which are correlated with the exogenous or
endogenous variables in the model constitutes what is called a specification error which will
lead to biased estimates of the relations among the variables in the model. Specification errors
are almost unavoidable in correlational research. All one can do is to attempt to minimize them
by including in the design major relevant variables. Although it is hard to say how many
relevant or extraneous variables need to be included in a model, it is fairly certain to say that a
correlational study involving only one "independent" variable is bound to be misspecified in
virtually any instance that comes to mind.
An Evaluation Checklist for Quantitative Studies
Adopted from McMillan, J. H. (2004). Educational research: Fundamentals for the
consumer (4th ed.) Boston: Pearson.
Dr.Chang 23
1.0 Research Problem
What are the independent and dependent variables?
Is the problem researchable?
Is the problem significant? Will the results have practical or theoretical importance?
Is the problem stated clearly and succinctly?
Does the problem communicate whether the study is descriptive, relational, or experimental?
Does the problem indicate the population studied?
Does the problem indicate the variables in the study?
2.0 Review of Literature
2.1 Does the review of literature seem comprehensive? Are all important previous studies
included?
2.2 Are primary sources emphasized?
2.3 Is the review up to date?
2.4 Have studies been critically reviewed, and flaws noted, and have the results been
summarized? (I disagree with 2.1 and 2.4.)
2.5 Does the review emphasize studies directly related to the problem?
2.6 Does the review explicitly relate previous studies to the problem?
2.7 If appropriate, does the review establish a basis for research hypotheses?
2.8 Does the review establish a theoretical framework for the significance of the study?
2.9 Is the review well organized?
3.0 Research Hypothesis
3.1 Is the hypothesis stated in declarative form?
3.2 Does the hypothesis follow from the literature?
3.3 Does the hypothesis state expected relationships or differences?
3.4 Is the hypothesis testable?
3.5 Is the hypothesis clear and concise?
4.0 Selection of Participants
4.1 Are the participants clearly described?
4.2 Is the population clearly defined?
4.3 Is the method of sampling clearly described?
4.4 Is probability sampling used? If so, is it proportional or disproportional?
4.5 What is the return rate in a survey study?
4.6 Are volunteers used?
4.7 Is there an adequate number of participants?
5.0 Instrumentation
5.1 Is evidence for validity and reliability clearly stated and adequate? Is the instrument
appropriate for the participants?
5.2 Are the instruments clearly described? If an instrument is designed for a study by the
researchers, is there a description of its development?
5.3 Are the procedures for gathering data clearly described?
Dr.Chang 24
5.4 Are norms appropriate if norm-referenced tests are used?
5.5 Are standard setting procedures appropriate if criterion-referenced tests are used?
5.6 Do the scores distort the reality of the findings?
5.7 Do response set or faking influence the results?
5.8 Are observers and interviewers adequately trained?
5.9 Are there observer or interviewer effects?
6.0 Design
6.1 Descriptive and Correlational
6.1a If descriptive, are relationships inferred?
6.1b Do graphic presentations distort the findings?
6.1c If comparative, are criteria for identifying different groups clear?
6.1d Are causative conclusions reached from correlational findings?
6.1e Is the correlation affected by restriction in the range and reliability of the
instruments?
6.1f If predictions are made, are the based on a different sample?
6.1g Is the size of the correlation large enough?
6.1h If causal-comparative, has the causal condition already occurred?
How comparable are the participants in the groups being compared?
6.2 Experimental
6.2a Is there direct manipulation of an independent variable?
6.2b Is the design clearly described? Is random assignment used?
6.2c What extraneous variables are not controlled in the design?
6.2d Are the treatments very different from one another?
6.2e Is each replication of the treatment independent of other replications? Is the
number of participants equal to the number of treatment replications?
7.0 Results
7.1 Is there an appropriate descriptive statistical summary?
7.2 Is statistical significance confused with practical significance?
7.3 Is statistical significance confused with internal or external validity?
7.4 Are appropriate statistical tests used?
7.5 Are levels of significance interpreted correctly?
7.6 How clearly are the results presented?
7.7 Is there a sufficient number of participants to give valid statistical results?
7.8 Are data clearly and accurately presented in graphs and tables?
8.0 Discussion and Conclusions
8.1 Is interpretation of the results separate from reporting of the results?
8.2 Are the results discussed in relation to previous research, methodology, and the research
problem?
8.3 Do the conclusions address the research problem?
8.4 Do the conclusions follow from the interpretation of the results?
8.5 Are the conclusion appropriately limited by the nature of the participants, treatments,
and
Dr.Chang 25
measures?
8.6 Is lack of statistical significance properly interpreted?
8.7 Are the limitations of the findings reasonable?
8.8 Are the recommendations and implications specific?
8.9 Are the conclusions consistent with what is known from previous research?
Quantitative versus Qualitative Research
The purpose of research is to draw some causal inference, A leads to B, A affects B, so
that intervention can be introduced. A particular study may or may not be able to draw causal
conclusion, but the eventual goal of research in any discipline is to draw such conclusion. There
is a fundamental difference in interpreting a causal relationship that distinguish quantitative vs.
qualitative. (The chair example.) Quantitative focuses on the specific or most salient causal link
whereas qualitative takes into consideration the chain of events as contributing to a specific
social process. However, as a tradition, relationship, process, context are talked about in
qualitative whereas "cause and effect" is considered quantitative terminology. More formally,
these two philosophies are distinguished in the following formal explanations:
Alternative conditions: sufficient but not necessary
Presence of the conditions associates with the presence of the outcome; but the absence
of conditions does not associate with the absence of the outcome. It is sufficient by itself but
not necessary. Flue virus is a sufficient but not necessary condition of headache.
Contingent conditions: necessary but not sufficient
Absence of the conditions indicates the absence of the outcome; but presence of the
conditions does not indicate the presence of the outcome. It is necessary but not sufficient by
itself. Ability to discriminate letters is necessary but not sufficient to reading.
Conclusion drawn from the two philosophies:
Quantitative: To be able to infer causality, conditions have to be both sufficient and
necessary; i.e., the presence of the conditions is accompanied by the presence of the outcome
and the absence of the conditions is accompanied by the absence of the outcome. The only way
to test a causal relationship is experiment where the independent variable (the cause) is
manipulated to see the corresponding change in the dependent variable (the outcome). There
are different variations of experiment to fit in the constraints of social science research, e.g.,
quasi, causal comparative, correlational, but the ideas behind them are the same which are
derived from physical science research, physics.
Qualitative: Constellation of conditions that are individually insufficient but necessary
and jointly unnecessary but sufficient (INUS) to bring about the outcome. (Meehl's example.)
Any social phenomenon resembles the INUS situation. Every social factor itself is not sufficient
even though necessary but together they are sufficient to bring about an effect even though that
combination is not necessary. The emphasis is on different factors, angles, points of views,
culture, the big picture, the chain of events which are necessary but not sufficient. Since the
combination of them, which is sufficient, is not necessary and there can be other combinations,
there is not a pattern of relationship that is uniformly true. Here the emphasis is on the
contexts, the situations. They have to be taken into consideration or findings are context
dependent whereas quantitative emphasizes generalization and statistical inference from sample
to population.
Qualitative try to study a process (rather than an isolated event) by taking into
consideration different factors contributing to the process. But with the qualitative approach,
there is discipline emphasis in educational research. An anthropological orientation
Dr.Chang 26
(ethnography) emphasizes the role of culture in influencing behaviour. Researchers with a
sociological bent tend to emphasize symbolic interaction. People are seen as acting according
to the meaning of things and persons to them; their reality is socially constructed. People act
not according to what the school is supposed to be, but according to how they see it.
Quan:
Philosophy: Isolated causal link
Method: experiment
Effort: Control extraneous variable to isolate out the particular linkage; e.g., attitude of
subject, history, instrumentation, testing, maturation. standardize data collection. Following
physical science tradition, studying only that is observable, measurable, and testable. Latent
constructs have to operationalized. Lot of topics are not attempted for research.
Qual:
P: INUS, causal chain
M: Field study. Consider combinations of factors. Look at context. Use different data
collection schemes to obtain all sorts of information. Social phenomenon and human behaviour
are not directly observable which include intentions, feelings, aspirations influenced by norms,
culture, values. Observable behaviour aren't any real than internal phenomenon.
Quan:
P: Following the physical science tradition, study the social phenomenon or human behaviour
as an objective and impartial observer.
M: control internal validity threats, such as, observer bias and characteristics. Keep the
subjects unaware of your research purpose. Hire data collector and standardize the data
collecting condition by training them. Structured interview. Let the data speak.
Terminology: Subject, researcher
Qual:
P: Take the perspective of the people being studied. See the world from the way they see it.
M: Participation. Researcher is the only or major source of data collection. Subjects play a
role in data interpretation. Having subjects read your report and modify afterwards.
T: Informants, collaborators, teachers, vs. participant.
Quan:
P: Deductive reasoning, formulate theory from previous research and conduct specific
empirical test.
M: Hypothesis testing. Ask questions before data are collected. Use standardized test.
Confirmatory or explanatory studies.
Qual:
P: Inductive reasoning, theory grounded in observation. From pieces of specific events and
observations, develop an explanation.
M: Start from scratch. Extensive and prolonged observation. Going back and forth between
data and explanation until a theory is fully grounded by observation. No measurement.
Measurement is not just asking questions but knowing what to ask. Exploratory or discovery
oriented research.
Quan:
P: Generalization
Dr.Chang 27
M: Random sampling, hypothesis testing, inferential statistics.
Qual:
P: Context dependent, Generalization with caution
M: Purposive sample to gather data from most representative situation to draw generalization.
Informants are selected for their willingness to talk, their sensitivity, knowledge, and insights
into a situation, and their ability and influence to gain access to new situations. No intention to
use statistical inference. Lengthy text report. Text analysis.
Quan vs. qual is more of approach and philosophical difference than methodological
difference. Techniques which were traditionally more often used by one approach than the
other are now adopted by both. A case study can be used to triangulate on the findings from
large scaled surveys. Field notes compiled through participative observations and personal
interviews can be quantified and statistically analyzed to draw inference to the population.
Some Details about Research Designs
Randomlyassign3subjectsintoeachofthe3treatments,Pu,
Pr,andControl.
Treatment
Pu
Pr
X1,X2,X3
MPu
Control
X4,X5,X6
X7,X8,X9
MPr
MCMGrand
FactorialANOVA:
Randomlyassign3subjectsfromeachgenderintoeachof
the3treatments,Pr,Pu,andControl.
Treatment
Pu
Male
MMale
Female
Pr
X1,X2,X3
MPuM
X10,X11,X12
Control
X4,X5,X6
X7,X8,X9
MPrM
MCM
X13,X14,X15
X16,X17,X18
MPuF
MPrF
MCFMFemale
MPu
MPr
MCGrand
RBANOVA:
Assigneverysubjectintoeachofthe3treatments,Pr,
Pu,andControl,inarandomorder.
Subject
Treatment
Dr.Chang 28
Pu
Pr
Control
Mean
X1
X1
X1
X2
X2
X2
X3
X3
X3
X4
X4
X4
Mean
MPu
MPr
MC
MGrand
SPANOVA:
Assigneverysubjectfromeachgenderintoeachofthe3
treatments,Pr,Pu,andControl,inarandomorder.
Sexor
SubjectTreatmentorWithinSubjectFactor
Between
SubFactor
Pu
Pr
Control
Mean
1
X1
X1
X1
X2
X2
X2
X3
X3
X3
X4
X4
X4
MPuM
MPrM
MCM
MMale
X5
X5
X5
X6
X6
X6
X7
X7
X7
X8
MPuF
X8
MPrF
X8
MCF
MPu
MPr
MC
Male
Female
MFemale
MGrand
Pu: Experimental manipulation of public self-consciousness is achieved by having the subjects

respond to the Spence's Attitudes Towards Women Scale with the awareness that their answers
will be evaluated by other people.
Pr: Experimental manipulation of private self-consciousness is acquired by having the subjects
respond to the Spence's Attitudes Towards Women Scale in front of a mirror.
Control: Subjects respond to the Spence's Attitudes Towards Women Scale without
Dr.Chang 29
experimental manipulations.
The Spence's Attitudes Toward Women Scale is so scaled that higher values indicate more
stereotyped attitudes towards women.
Between Subject Design
Between subjects designs are used to draw inference about treatment effects to several
populations. This statement does not mean that subjects are drawn from different populations
to start with. In fact you use random assignment to create equal groups. The statement means
that the treatment (or experiment) is so effective that after the treatment the behavior of the
subjects represent that of a different population.
Although referred to as an experimental design, it is also used for non experimental
comparisons of different population means. As a non-experimental design, subjects are
sampled from different populations to start with.
Commonly used between-subject designs are the (1) completely randomized design and
(2) factorial design. Factorial design deals with more than one independent variable. In the
case of two independent variables, the design is called two-way factorial. The two variables
can be both experimental variables. More often, one of the two variables is experimental and
the other is a measured variable. A special case is the aptitude treatment interaction design
where one of the two variables is the state variable, which is experimentally manipulated, and
the other is a trait variable, which is measured. All these are experimental designs. Of course,
when both variables are measured, e.g., gender by race, the design is non-experimental. Three
hypotheses are explored in a two-way factorial: (1) the main effect due to one of the variables,
(2) the main effect due to the other variable, and (3) the interaction between the two variables.
The factorials are also distinguished between fixed and random depending on whether the
categories of the independent variables are assumed to exhaust those in the defined population
(fixed design) or are a random sample of the population (random design).
Within Subject Design
In a between design, each subject is observed under one condition and inference is
drawn between subjects from different conditions. In a within design, each subject is observed
under more than one condition and inference is drawn within the same subject across different
conditions. There are usually three ways a within-design can be carried out. All three are
generally referred to as randomized blocks design.
(1) Subjects are observed under all treatment conditions. The order of the conditions
assigned to each subject must be random. For example, each subject experiences three dosage
conditions of 1 mg, 5 mg, 10 mg. Some subjects have 5 mg first and some have 10 mg first.
The key is that the three conditions are assigned to each subject in a random order. As another
example, have the same teachers rate content-identical essays which bear names representing
immigration status and gender identities to see if there is discrimination in rating essays. In this
example, each teacher rates 4 essays (female-HK, male-HK, female-Mainland, male-Mainland)
the order of which is random.
(2) Assign matched subjects randomly into different treatments. K matched subjects
form a block. Within a block subjects are randomly assigned to treatment conditions with one
subject per condition. Subjects within a block are considered identical as far as the dependent
variable is concerned. The design strength lies in that the within-block variability is smaller
than between-block variability. In the earlier example, we can match subjects by, for example,
age, gender, physical conditions (all of which have to be related to the dependent variable), and
randomly assign each matched triplet into the three dosage conditions.
(3) A special case of the first design is the repeated measures design where each subject
Dr.Chang 30
undergoes repeated observations. For example, in a test-retest design, a subject is observed
twice, before and after the treatment. Repeated measures design can also be used in nonexperimental studies involving multiple observations.
Mixed Design
If you add a between-factor to the Randomized Blocks Design, you have the Split Plot
Design (SP) which is a mixed design. There are now two independent variables. One of them
varies within the subjects and, thus, is called within-factor. The other varies between subjects
and is called between-factor. This basic mixed design can be extended to include more than
one within-factor and/or more than one between-factor.
All the requirements for RB design apply to the SP design. All three situations which
make up an RB design apply to a SP design. The three situations are: 1. Same subjects are
observed under all treatment conditions the order of which is random. 2. Matched subjects
form a block within which subjects are considered equivalent and randomly assigned to
treatments. 3. Same subjects undergo repeated measures where random order does not exist.
The same three situations make up the within-factor of a SP design.
Either of the two factors which make up a SP design can be an experimental or
measured variable, creating four possible scenarios:
1. Between-factor is an experimental (state) and within-factor is a measured (trait)
variable. For example, the between-factor is Pu (public self-consciousness induced by the
camera condition), Pr (private self-consciousness induced by the mirror condition), and a
control group. Subjects are randomly assigned to these three conditions. The within-factor is
pre-test (before the experimental conditions) and post-test (under the experimental conditions).
This is the typical pretest-posttest experimental design. (One typical RB design is the test-retest
design which is not experimental because there is no control group.) In this example, the
within-factor represents Situation 3 in B above.
2. Between-factor is a measured and within-factor is an experimental variable. In the
above example, the Pu, Pr, and Ctl conditions can be made a within-factor by randomly
assigning these conditions either to the same subjects or to blocks of matched triplets. The
between-factor can be gender. In this example, the within-factor represents Situations 1 or 2 in
B.
3. Both factors are experimental. In our earlier Prozac example, the within-factor is the
three experimental conditions of 10mg Prozac, 5mg Prozac, and placebo. We can add a
between-factor of whether or not receiving therapy to cope with depression. The research
question is whether the combination of counselling and Prozac is more effective than either the
counselling or Prozac alone. In this case, both the between and within factors are experimental
variables. The between-factor conditions are created by randomly assigning subjects into either
the counselling or no-counselling condition. The within-factor can be created in two ways.
Corresponding to Situation 1 in B, within each of these two between-factor conditions,
subjects undergo all three Prozac conditions in a random order. Corresponding to Situation 2
in B, matched triplets are randomly assigned to one of the three Prozac conditions.
4. Finally, both factors can be measured in a non-experimental study. For example, I'm
obtaining repeated measures on withdrawn and aggressive behaviors of primary school
children. I'm going to give the children three questionnaires at the end of three consecutive
semesters. The within-factor is the three waves of questionnaire data. One between-factor
could be gender. Another between-factor could be popularity classification of these children.
Through peer nomination, children can be classified into popular, rejected, and neglected.
Analysis of Covariance (ANCOVA)
ANCOVA is ANOVA with a covariate or covariates. The covariate summarizes
Dr.Chang 31
individual differences on the dependent variable. These differences would otherwise be
allocated as error in an ANOVA. Apparently, the covariate should be correlated with the
dependent variable. By removing variations due to persons from the error term, ANOVA
achieves the same goal as that of a within-subject design or a mixed design (RB and SP
ANOVA). Three statistical assumptions:
1. Linearity of regression. The relationship between the dependent variable and the
covariate is linear. Simply check the scatter plot to examine this assumption.
2. Homogeneity of regression. The regression of the dependent variable on the
covariate is the same across different treatment groups. The most intuitive way to check this
assumption is to conduct separate regression within the treatment groups to see if the
regression coefficients are similar.
3. The covariate is independent from the treatment variable. The way to ensure this
assumption is to obtain the covariate before conducting the experiment. That is measuring the
covariate before randomly assigning subjects into different treatment conditions and conducting
the experiment.
The purpose of the covariate is to reduce within group variance or difference among
people within groups. If the covariate is highly related with the independent variable, the
individual difference between groups will be reduced by the part of the covariate that is related
to the independent variable. However, in non-experimental and quasi experimental studies,
e.g., teaching methods are compared using intact classes which could differ in ability and ability
is used as covariate, the covariate is often correlated with the independent variable and
ANCOVA is still used. In this situation, a different question is sought after in non-experimental
research. The question is how much is purely due to the grouping variable after the covariate is
accounted for. Here researchers may also like to use a covariate which is related to a grouping
variable so that "pure" group differences can be identified after controlling for the covariate.
For example, in comparing gender role attitudes among several age groups, one may want to
control for education which is related to age groups. In this design, one can find out how
much is due to education and how much is due to pure age.

6401 Lecture Notes Chang Lei

Diunggah oleh

Informasi Dokumen

Deskripsi Asli:

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

6401 Lecture Notes Chang Lei

Diunggah oleh

Hak Cipta:

Format Tersedia

Dr.

Lecture Notes for EDM6401, Quantitative Methods in Educational Research

2. Consult the literature for a solution.

(find out why)

3. Formulation of testable hypotheses on the basis of existing theory and/or experience.

Consult the literature for a solution.

Formulation of testable hypotheses on

2. METHOD SECTION *****

6.Data analysis. Data are summarized in

4. DISCUSSION SECTION *****

Ways to Locate a Research Problem

Following are some is and is not for each type:

= 0. There is no correlation between motivation to learn and math achievement.

2: Set the level of statistical significance

3: Compute the test statistic which can be

When xx is between zero and 1,

Non-Experimental Design: Causal Comparative Study

Pu: Experimental manipulation of public self-consciousness is achieved by having the subjects

Anda mungkin juga menyukai