# Introduction to Testing

and Measurement

Testing:
Basic Definitions
Assessment - process of documenting
knowledge, skills, attitudes, and/or
beliefs
Evaluation - the making of a
number, or value
Measurement - quantitative (involves
assigning numbers)
Testing - form of measurement

Basic Definitions
(Continued)

## Reliability - Measures consistency

Validity - Valid to the degree that
accomplishes purpose
Objective - To the degree that two or
more reasonable persons given a key
will agree

Basic Statistics
Mean, Median, and
Standard Deviation

Mean

## (Arithmetic Average - the sum divided by

the count.)

Calculation includes all scores
Indicates typical score for
group

Easily distorted by extreme
scores

Median

## (Midpoint - place the numbers in value

order and find the middle number)

Not easily distorted by
extremely high or low scores

Does not take into account the
value of all the scores in the
group

Mean or median?
Rule of Thumb
use median when extremely
high or low scores (outliers)
are present;
use the mean for most other
situation

Standard Deviation
Indicates by how much the
scores in a distribution typically
deviate from the mean
Mean represents 50% of the
norm group,
68% within 1 SD above or below
the mean,
95% within 2 SD above or below
the mean,
99.7% within 3 SD above or below
mean

## Normal Curve - Properties

Symmetrical, bell-shaped
Total area under the curve represents total
number of scores in the distribution
Vertical lines mark sub-areas and represent
proportions of scores falling in a particular
range
Points along baseline correspond to
standard deviations away from the mean

## Testing and Measurement

Validity & Reliability

## Validity of Test Scores

The extent to which the scores
on the test are representative of
what you are trying to measure
Example - Does the science test
measure only the knowledge of
science, or is it dependent on
ability?

Types of Validity
Content Validity
Determined by the degree to
which the questions or items are
representative of the universe of
behavior the test was designed to
sample (does the test assess what
it claims to assess?)

Criterion-Related Validity
Determined by whether there is a
relationship between a test and an
immediate criterion measure
example - a driving test,
employment

## Factors That Can Reduce

Validity?
Factors in the Test
Vague Directions
Irrelevant Items
Poorly Constructed Items
Items that Contain Clues to
Too Few or Improperly
Sequenced Items

(Continued)

## Factors in Test Administration and

Scoring
Insufficient Time to Complete
the Test
Testing Environment
Undetected Cheating
Inappropriate Help or Coaching
Properly Motivated Students
Unreliable Item Scoring

(Continued)

## Factors Affecting Pupil

Responses
High Level of Fear or
Test
A Tendency to Rush
Though the Test
Guessing

## Reliability of Test Scores

Consistency
Measure of confidence that if
same individuals were retested
under similar conditions that
the results could be replicated

Types of Reliability
Test-Retest: Coefficient of Stability
Alternate Form: Coefficient of
Equivalence
Internal Consistency: Consistency of
examinee across test items
Interrater Reliability: Consistency of
judges or scorers

Reliability

General Guidelines
Test scores used for decision
much higher degree of
reliability than those for making
Higher reliability coefficients
are essential if decisions based
on test scores have long term
consequences.

Reliability
General Guidelines
(Continued)

## Lower reliability coefficients are

tolerable if decisions are
reversible or have only a
temporary impact.
Reliability coefficients for
standardized tests should be .90
or higher
Reliability coefficients are
influenced by many factors.

How to Increase
Reliability
Use objective tests
Use a more heterogeneous
group
Make sure the difficulty level is
appropriate for the individuals
being tested
Increase the number of items

## Reliability vs. Validity

Reliability means that the testtakers will get the same score in
multiple takes (within reason of
course).
Validity means measuring what it
is supposed to measure
Reliability doesn't necessarily
equate to validity:
A test can be reliable without being
valid.
However, a test cannot be valid
unless it is reliable.

Types of Tests
Standardized Tests:
Norm-Referenced and
Criterion-Referenced Tests

Standardized Test
consistent, or "standard", manner.
designed in such a way that the
questions, conditions for
and interpretations are consistent
predetermined, standard manner.
not necessarily a high-stakes, timelimited, or multiple-choice.

Standardized Testing
Benefits

Objectivity
Evidence of validity or reliability of
results
Ability to compare across students,
schools, states, etc.
Efficiency (group testing)
Developed over time and
supported with data and research

Standardized Testing
Possible issues

## Can only sample a portion of the

domain
May not match school curriculum
Interpretations may not be relevant
for all populations
Extraneous factors may prevent
good measure of the students
ability
May not be available for some
constructs/concepts

## Base test type according to

Norm-Referenced: Level of
achievement compared to others
students
Criterion-Referenced: Level of
achievement compared to external
criterion

Norm-Referenced Scores
Based on the normal curve
Reflects student performance
compared to other similar students
Shows relative strengths and
weaknesses
Are not standards of what should
be - only indicators of what is
Examples: CogAT, Iowa, NNAT, WISC,
Stanford, Terra Nova

Norms
A set standard of development
or achievement usually derived
from the average or median
achievement of a large group
Used to compare one students
results to those of a large sample
of students:
National norms - based on a large
sample from across the nation
Local norms - based on a large
sample from local schools within a
city, district, state, etc.

Norms

(Continued)

## Indicate what the current reality

is
are not standards, or indicators of
what should be

## Derived by assessing students

thought to be typical
For mental ability scores, use
student age norms
For achievement scores, use

## Good Norms are

Recent
When outdated norms are used, results can be
misleading. Norms change every 5-7 years. (Tests
with norms over 10 years old are not used for

Representative
Because participation in the norm group is
voluntary, norm groups might not be
representative.

Relevant
The normal students used to establish the
norms may not have been provided a normal
instructional program.

## Norm Referenced Tests (NRT)

Appropriate Uses
Used to compare student
performance with large, usually
national or international,
sample of similar students
Used to make relative
comparisons among schools or
school systems to a national
sample

Criterion-Referenced Tests
a curricular domain of skills and
knowledge (e.g. the CCGPS, state
standards)
a cognitive domain of skill
math computation

## standing with respect to a judgmental

criterion
CRCT (Criterion Referenced Competency Test
EOCT (End of Course Test)
Georgia Milestones

(CRT)
Appropriate Uses

## To make instructional decisions

students, along with other
information
To make evaluative (formative and
programs
curriculum

Types of Scores
NRTs & CRTs

Raw Scores
Actual number of points
out of 30 questions equals a raw
score of 25

## Have not been cooked in

cauldron of statistics

Standard Scores
Raw scores converted to new
scale
Can be used to make direct
comparisons among classes,
schools, or districts
Can be misinterpreted because
somewhat arbitrary scale values
used from test to test
Commonly Reported Standard Scores
SAT, GRE, NCEs, Stanines, SAS

## Normal Curve Equivalent

(NCE)
Normalized standard scores
used for reporting some
standardized achievement tests
Converted to a scale with a
mean of 50 and a standard
deviation of 21.06
Reported in a range between
values of 1 and 99
Are not particularly useful in
reporting test reports to parents

## Standard Age Scores (SAS)

Used to report the results of
ability tests
Sometimes reported as
deviation IQ scores
Converted to a scale with a
mean of 100 and a standard
deviation of 15
Average is considered 15
above and below 100 from
85 -115 on the normal curve

Stanines
Standard Scores with whole
number values ranging from 1
to 9
Relate to percentile bands
Useful as a simple
approximation of performance;
May lead to a loss of precision
in reporting

Percentile Scores
Commonly used in expressing results of
standardized tests
Probably the best single derived score
for general use in relaying test results
Indicate the percentage of students in
the norm group scoring lower than the
examinee
Range between values of 1 and 99
Used to interpret a students
performance in comparison to other
students
Can result in misinterpretation because
all percentile ranks are not equally
spaced along any one scale

Percentile Bands
Range of values thought to contain the
students true percentile rank
smaller bands reflect higher reliability
Example: Susan might have a percentile
band ranging between 76 and 86 for
math computation on the ITBS, and a
percentile band ranging between 82 and
Scores indicate that Susan probably
performs better at reading than she
did at math computation
However, exact percentile score for
math could be higher than for reading

typical student obtains same
raw score
Are useful in measuring growth
Can be easily misinterpreted

material against the average performance of
students at other grade levels on the same
material
Reported in terms of grade level and months
Does not mean a 5th grade student with a 9.5
work
Does not mean the 5th grade student needs to
Does mean the 5th grade student is performing
better than peers at same level
grade material as well as the average 8th

Can not be interpreted as estimate of
grade where a student should be placed
Are not equal across the range of the
scale
Are not necessarily equal across tests
Extremely high or low GE scores are not
dependable estimates of student
achievement

Things to Know
Know the Test study the manual and
understand the content and purpose
Know the Norms cannot interpret
scores well if dont understand
norming population
Know the Score is it standard score,
raw score, percentile rank, or
something else?
Know the Background test results
dont tell the whole story so consider
multiple sources of data and
information on student

More to know
Research on your own the more you
know, the more you can explain test
results with accuracy and confidence
Communicate effectively provide
pertinent information in a clear,
understandable manner to approved
individuals
Use the test understanding
increases with multiple uses
Use caution test scores can reflect
ability but they do not determine
ability
