Anda di halaman 1dari 46

Introduction to Testing

and Measurement

Basic Definitions
Assessment - process of documenting
knowledge, skills, attitudes, and/or
Evaluation - the making of a
judgment about the amount,
number, or value
Measurement - quantitative (involves
assigning numbers)
Testing - form of measurement

Basic Definitions

Reliability - Measures consistency

Validity - Valid to the degree that
accomplishes purpose
Objective - To the degree that two or
more reasonable persons given a key
will agree

Basic Statistics
Mean, Median, and
Standard Deviation


(Arithmetic Average - the sum divided by

the count.)

Calculation includes all scores
Indicates typical score for

Easily distorted by extreme


(Midpoint - place the numbers in value

order and find the middle number)

Not easily distorted by
extremely high or low scores

Does not take into account the
value of all the scores in the

Mean or median?
Rule of Thumb
use median when extremely
high or low scores (outliers)
are present;
use the mean for most other

Standard Deviation
Indicates by how much the
scores in a distribution typically
deviate from the mean
Mean represents 50% of the
norm group,
68% within 1 SD above or below
the mean,
95% within 2 SD above or below
the mean,
99.7% within 3 SD above or below

Normal Curve - Properties

Symmetrical, bell-shaped
Total area under the curve represents total
number of scores in the distribution
Vertical lines mark sub-areas and represent
proportions of scores falling in a particular
Points along baseline correspond to
standard deviations away from the mean

Testing and Measurement

Validity & Reliability

Validity of Test Scores

The extent to which the scores
on the test are representative of
what you are trying to measure
Example - Does the science test
measure only the knowledge of
science, or is it dependent on
reading ability and therefore
measuring science and reading

Types of Validity
Content Validity
Determined by the degree to
which the questions or items are
representative of the universe of
behavior the test was designed to
sample (does the test assess what
it claims to assess?)

Criterion-Related Validity
Determined by whether there is a
relationship between a test and an
immediate criterion measure
example - a driving test,

Factors That Can Reduce

Factors in the Test
Vague Directions
Irrelevant Items
Poorly Constructed Items
Items that Contain Clues to
the Correct Answer
Too Few or Improperly
Sequenced Items

What Affects Validity


Factors in Test Administration and

Insufficient Time to Complete
the Test
Testing Environment
Undetected Cheating
Inappropriate Help or Coaching
Properly Motivated Students
Unreliable Item Scoring

What Affects Validity


Factors Affecting Pupil

High Level of Fear or
Anxiety About Taking the
A Tendency to Rush
Though the Test

Reliability of Test Scores

Measure of confidence that if
same individuals were retested
under similar conditions that
the results could be replicated

Types of Reliability
Test-Retest: Coefficient of Stability
Alternate Form: Coefficient of
Internal Consistency: Consistency of
examinee across test items
Interrater Reliability: Consistency of
judges or scorers


General Guidelines
Test scores used for decision
about individuals require a
much higher degree of
reliability than those for making
decisions about groups.
Higher reliability coefficients
are essential if decisions based
on test scores have long term

General Guidelines

Lower reliability coefficients are

tolerable if decisions are
reversible or have only a
temporary impact.
Reliability coefficients for
standardized tests should be .90
or higher
Reliability coefficients are
influenced by many factors.

How to Increase
Use objective tests
Use a more heterogeneous
Make sure the difficulty level is
appropriate for the individuals
being tested
Increase the number of items

Reliability vs. Validity

Reliability means that the testtakers will get the same score in
multiple takes (within reason of
Validity means measuring what it
is supposed to measure
Reliability doesn't necessarily
equate to validity:
A test can be reliable without being
However, a test cannot be valid
unless it is reliable.

Types of Tests
Standardized Tests:
Norm-Referenced and
Criterion-Referenced Tests

Standardized Test
administered and scored in a
consistent, or "standard", manner.
designed in such a way that the
questions, conditions for
administering, scoring procedures,
and interpretations are consistent
administered and scored in a
predetermined, standard manner.
not necessarily a high-stakes, timelimited, or multiple-choice.

Standardized Testing

Evidence of validity or reliability of
Ability to compare across students,
schools, states, etc.
Ease of administration and scoring
Efficiency (group testing)
Developed over time and
supported with data and research

Standardized Testing
Possible issues

Can only sample a portion of the

May not match school curriculum
May not answer relevant questions
Interpretations may not be relevant
for all populations
Extraneous factors may prevent
good measure of the students
May not be available for some

Base test type according to

decision to be made
Norm-Referenced: Level of
achievement compared to others
Criterion-Referenced: Level of
achievement compared to external

Norm-Referenced Scores
Based on the normal curve
Reflects student performance
compared to other similar students
Shows relative strengths and
Are not standards of what should
be - only indicators of what is
Examples: CogAT, Iowa, NNAT, WISC,
Stanford, Terra Nova

A set standard of development
or achievement usually derived
from the average or median
achievement of a large group
Used to compare one students
results to those of a large sample
of students:
National norms - based on a large
sample from across the nation
Local norms - based on a large
sample from local schools within a
city, district, state, etc.



Indicate what the current reality

are not standards, or indicators of
what should be

Derived by assessing students

thought to be typical
For mental ability scores, use
student age norms
For achievement scores, use
student grade scores

Good Norms are

When outdated norms are used, results can be
misleading. Norms change every 5-7 years. (Tests
with norms over 10 years old are not used for
gifted evaluation in Cobb County.)

Because participation in the norm group is
voluntary, norm groups might not be

The normal students used to establish the
norms may not have been provided a normal
instructional program.

Norm Referenced Tests (NRT)

Appropriate Uses
Used to compare student
performance with large, usually
national or international,
sample of similar students
Used to make relative
comparisons among schools or
school systems to a national

Criterion-Referenced Tests
Allow inferences about:
a curricular domain of skills and
knowledge (e.g. the CCGPS, state
a cognitive domain of skill
reading comprehension
math computation

standing with respect to a judgmental

CRCT (Criterion Referenced Competency Test
EOCT (End of Course Test)
Georgia Milestones

Criterion Referenced Tests

Appropriate Uses

To make instructional decisions

about individual students
To make placement decisions about
students, along with other
To make evaluative (formative and
summative) decisions about
To make decisions about the

Types of Scores

Raw Scores
Actual number of points
received on test
For example, 25 correct answers
out of 30 questions equals a raw
score of 25

Have not been cooked in

cauldron of statistics

Standard Scores
Raw scores converted to new
Can be used to make direct
comparisons among classes,
schools, or districts
Can be misinterpreted because
somewhat arbitrary scale values
used from test to test
Commonly Reported Standard Scores
SAT, GRE, NCEs, Stanines, SAS

Normal Curve Equivalent

Normalized standard scores
used for reporting some
standardized achievement tests
Converted to a scale with a
mean of 50 and a standard
deviation of 21.06
Reported in a range between
values of 1 and 99
Are not particularly useful in
reporting test reports to parents

Standard Age Scores (SAS)

Used to report the results of
ability tests
Sometimes reported as
deviation IQ scores
Converted to a scale with a
mean of 100 and a standard
deviation of 15
Average is considered 15
above and below 100 from
85 -115 on the normal curve

Standard Scores with whole
number values ranging from 1
to 9
Relate to percentile bands
Useful as a simple
approximation of performance;
May lead to a loss of precision
in reporting

Percentile Scores
Commonly used in expressing results of
standardized tests
Probably the best single derived score
for general use in relaying test results
Indicate the percentage of students in
the norm group scoring lower than the
Range between values of 1 and 99
Used to interpret a students
performance in comparison to other
Can result in misinterpretation because
all percentile ranks are not equally
spaced along any one scale

Percentile Bands
Range of values thought to contain the
students true percentile rank
smaller bands reflect higher reliability
Example: Susan might have a percentile
band ranging between 76 and 86 for
math computation on the ITBS, and a
percentile band ranging between 82 and
92 for reading.
Scores indicate that Susan probably
performs better at reading than she
did at math computation
However, exact percentile score for
math could be higher than for reading

Grade Equivalents
Identifies grade level at which
typical student obtains same
raw score
Expressed by grade and month
Are useful in measuring growth
Can be easily misinterpreted

Grade Equivalent Interpretation

Compares student performance on grade-level
material against the average performance of
students at other grade levels on the same
Reported in terms of grade level and months
Does not mean a 5th grade student with a 9.5
GE score in reading can do 8th grade reading
Does not mean the 5th grade student needs to
be in 8th grade
Does mean the 5th grade student is performing
better than peers at same level
Does mean that 5th grade student reads 5th
grade material as well as the average 8th

Grade EquivalentsCommon Misinterpretations

Can not be interpreted as estimate of
grade where a student should be placed
Are not equal across the range of the
Are not necessarily equal across tests
Extremely high or low GE scores are not
dependable estimates of student

Things to Know
Know the Test study the manual and
understand the content and purpose
Know the Norms cannot interpret
scores well if dont understand
norming population
Know the Score is it standard score,
raw score, percentile rank, or
something else?
Know the Background test results
dont tell the whole story so consider
multiple sources of data and
information on student

More to know
Research on your own the more you
know, the more you can explain test
results with accuracy and confidence
Communicate effectively provide
pertinent information in a clear,
understandable manner to approved
Use the test understanding
increases with multiple uses
Use caution test scores can reflect
ability but they do not determine
Reference Test Scores and What They Mean, 6th edition by H. Lyman,