Anda di halaman 1dari 46

Introduction to Testing

and Measurement

Testing:
Basic Definitions
Assessment - process of documenting
knowledge, skills, attitudes, and/or
beliefs
Evaluation - the making of a
judgment about the amount,
number, or value
Measurement - quantitative (involves
assigning numbers)
Testing - form of measurement

Basic Definitions
(Continued)

Reliability - Measures consistency


Validity - Valid to the degree that
accomplishes purpose
Objective - To the degree that two or
more reasonable persons given a key
will agree

Basic Statistics
Mean, Median, and
Standard Deviation

Mean

(Arithmetic Average - the sum divided by


the count.)

Advantages
Calculation includes all scores
Indicates typical score for
group

Disadvantages
Easily distorted by extreme
scores

Median

(Midpoint - place the numbers in value


order and find the middle number)

Advantages
Not easily distorted by
extremely high or low scores

Disadvantages
Does not take into account the
value of all the scores in the
group

Mean or median?
Rule of Thumb
use median when extremely
high or low scores (outliers)
are present;
use the mean for most other
situation

Standard Deviation
Indicates by how much the
scores in a distribution typically
deviate from the mean
Mean represents 50% of the
norm group,
68% within 1 SD above or below
the mean,
95% within 2 SD above or below
the mean,
99.7% within 3 SD above or below
mean

Normal Curve - Properties


Symmetrical, bell-shaped
Total area under the curve represents total
number of scores in the distribution
Vertical lines mark sub-areas and represent
proportions of scores falling in a particular
range
Points along baseline correspond to
standard deviations away from the mean

Testing and Measurement


Validity & Reliability

Validity of Test Scores


The extent to which the scores
on the test are representative of
what you are trying to measure
Example - Does the science test
measure only the knowledge of
science, or is it dependent on
reading ability and therefore
measuring science and reading
ability?

Types of Validity
Content Validity
Determined by the degree to
which the questions or items are
representative of the universe of
behavior the test was designed to
sample (does the test assess what
it claims to assess?)

Criterion-Related Validity
Determined by whether there is a
relationship between a test and an
immediate criterion measure
example - a driving test,
employment

Factors That Can Reduce


Validity?
Factors in the Test
Vague Directions
Irrelevant Items
Poorly Constructed Items
Items that Contain Clues to
the Correct Answer
Too Few or Improperly
Sequenced Items

What Affects Validity


(Continued)

Factors in Test Administration and


Scoring
Insufficient Time to Complete
the Test
Testing Environment
Undetected Cheating
Inappropriate Help or Coaching
Properly Motivated Students
Unreliable Item Scoring

What Affects Validity


(Continued)

Factors Affecting Pupil


Responses
High Level of Fear or
Anxiety About Taking the
Test
A Tendency to Rush
Though the Test
Guessing

Reliability of Test Scores


Consistency
Measure of confidence that if
same individuals were retested
under similar conditions that
the results could be replicated

Types of Reliability
Test-Retest: Coefficient of Stability
Alternate Form: Coefficient of
Equivalence
Internal Consistency: Consistency of
examinee across test items
Interrater Reliability: Consistency of
judges or scorers

Reliability

General Guidelines
Test scores used for decision
about individuals require a
much higher degree of
reliability than those for making
decisions about groups.
Higher reliability coefficients
are essential if decisions based
on test scores have long term
consequences.

Reliability
General Guidelines
(Continued)

Lower reliability coefficients are


tolerable if decisions are
reversible or have only a
temporary impact.
Reliability coefficients for
standardized tests should be .90
or higher
Reliability coefficients are
influenced by many factors.

How to Increase
Reliability
Use objective tests
Use a more heterogeneous
group
Make sure the difficulty level is
appropriate for the individuals
being tested
Increase the number of items

Reliability vs. Validity


Reliability means that the testtakers will get the same score in
multiple takes (within reason of
course).
Validity means measuring what it
is supposed to measure
Reliability doesn't necessarily
equate to validity:
A test can be reliable without being
valid.
However, a test cannot be valid
unless it is reliable.

Types of Tests
Standardized Tests:
Norm-Referenced and
Criterion-Referenced Tests

Standardized Test
administered and scored in a
consistent, or "standard", manner.
designed in such a way that the
questions, conditions for
administering, scoring procedures,
and interpretations are consistent
administered and scored in a
predetermined, standard manner.
not necessarily a high-stakes, timelimited, or multiple-choice.

Standardized Testing
Benefits

Objectivity
Evidence of validity or reliability of
results
Ability to compare across students,
schools, states, etc.
Ease of administration and scoring
Efficiency (group testing)
Developed over time and
supported with data and research

Standardized Testing
Possible issues

Can only sample a portion of the


domain
May not match school curriculum
May not answer relevant questions
Interpretations may not be relevant
for all populations
Extraneous factors may prevent
good measure of the students
ability
May not be available for some
constructs/concepts

Base test type according to


decision to be made
Norm-Referenced: Level of
achievement compared to others
students
Criterion-Referenced: Level of
achievement compared to external
criterion

Norm-Referenced Scores
Based on the normal curve
Reflects student performance
compared to other similar students
Shows relative strengths and
weaknesses
Are not standards of what should
be - only indicators of what is
Examples: CogAT, Iowa, NNAT, WISC,
Stanford, Terra Nova

Norms
A set standard of development
or achievement usually derived
from the average or median
achievement of a large group
Used to compare one students
results to those of a large sample
of students:
National norms - based on a large
sample from across the nation
Local norms - based on a large
sample from local schools within a
city, district, state, etc.

Norms

(Continued)

Indicate what the current reality


is
are not standards, or indicators of
what should be

Derived by assessing students


thought to be typical
For mental ability scores, use
student age norms
For achievement scores, use
student grade scores

Good Norms are


Recent
When outdated norms are used, results can be
misleading. Norms change every 5-7 years. (Tests
with norms over 10 years old are not used for
gifted evaluation in Cobb County.)

Representative
Because participation in the norm group is
voluntary, norm groups might not be
representative.

Relevant
The normal students used to establish the
norms may not have been provided a normal
instructional program.

Norm Referenced Tests (NRT)


Appropriate Uses
Used to compare student
performance with large, usually
national or international,
sample of similar students
Used to make relative
comparisons among schools or
school systems to a national
sample

Criterion-Referenced Tests
Allow inferences about:
a curricular domain of skills and
knowledge (e.g. the CCGPS, state
standards)
a cognitive domain of skill
reading comprehension
math computation

standing with respect to a judgmental


criterion
CRCT (Criterion Referenced Competency Test
EOCT (End of Course Test)
Georgia Milestones

Criterion Referenced Tests


(CRT)
Appropriate Uses

To make instructional decisions


about individual students
To make placement decisions about
students, along with other
information
To make evaluative (formative and
summative) decisions about
programs
To make decisions about the
curriculum

Types of Scores
NRTs & CRTs

Raw Scores
Actual number of points
received on test
For example, 25 correct answers
out of 30 questions equals a raw
score of 25

Have not been cooked in


cauldron of statistics

Standard Scores
Raw scores converted to new
scale
Can be used to make direct
comparisons among classes,
schools, or districts
Can be misinterpreted because
somewhat arbitrary scale values
used from test to test
Commonly Reported Standard Scores
SAT, GRE, NCEs, Stanines, SAS

Normal Curve Equivalent


(NCE)
Normalized standard scores
used for reporting some
standardized achievement tests
Converted to a scale with a
mean of 50 and a standard
deviation of 21.06
Reported in a range between
values of 1 and 99
Are not particularly useful in
reporting test reports to parents

Standard Age Scores (SAS)


Used to report the results of
ability tests
Sometimes reported as
deviation IQ scores
Converted to a scale with a
mean of 100 and a standard
deviation of 15
Average is considered 15
above and below 100 from
85 -115 on the normal curve

Stanines
Standard Scores with whole
number values ranging from 1
to 9
Relate to percentile bands
Useful as a simple
approximation of performance;
May lead to a loss of precision
in reporting

Percentile Scores
Commonly used in expressing results of
standardized tests
Probably the best single derived score
for general use in relaying test results
Indicate the percentage of students in
the norm group scoring lower than the
examinee
Range between values of 1 and 99
Used to interpret a students
performance in comparison to other
students
Can result in misinterpretation because
all percentile ranks are not equally
spaced along any one scale

Percentile Bands
Range of values thought to contain the
students true percentile rank
smaller bands reflect higher reliability
Example: Susan might have a percentile
band ranging between 76 and 86 for
math computation on the ITBS, and a
percentile band ranging between 82 and
92 for reading.
Scores indicate that Susan probably
performs better at reading than she
did at math computation
However, exact percentile score for
math could be higher than for reading

Grade Equivalents
Identifies grade level at which
typical student obtains same
raw score
Expressed by grade and month
Are useful in measuring growth
Can be easily misinterpreted

Grade Equivalent Interpretation


Compares student performance on grade-level
material against the average performance of
students at other grade levels on the same
material
Reported in terms of grade level and months
Does not mean a 5th grade student with a 9.5
GE score in reading can do 8th grade reading
work
Does not mean the 5th grade student needs to
be in 8th grade
Does mean the 5th grade student is performing
better than peers at same level
Does mean that 5th grade student reads 5th
grade material as well as the average 8th
grader

Grade EquivalentsCommon Misinterpretations


Can not be interpreted as estimate of
grade where a student should be placed
Are not equal across the range of the
scale
Are not necessarily equal across tests
Extremely high or low GE scores are not
dependable estimates of student
achievement

Things to Know
Know the Test study the manual and
understand the content and purpose
Know the Norms cannot interpret
scores well if dont understand
norming population
Know the Score is it standard score,
raw score, percentile rank, or
something else?
Know the Background test results
dont tell the whole story so consider
multiple sources of data and
information on student

More to know
Research on your own the more you
know, the more you can explain test
results with accuracy and confidence
Communicate effectively provide
pertinent information in a clear,
understandable manner to approved
individuals
Use the test understanding
increases with multiple uses
Use caution test scores can reflect
ability but they do not determine
ability
Reference Test Scores and What They Mean, 6th edition by H. Lyman,