Test Development

Test Development
Why Develop a New Test?

meet the needs of a special group of
test takers
sample behaviours from a newly
defined test domain
improve the accuracy of test scores for
their intended purpose
Tests need to be revised
First Four Steps

Defining the test universe, audience,
and purpose
Developing a test plan
Composing the test items
Writing the administration instructions
Continued Steps of Test Construction

Diagram of Test Construction (p. 234)
Constructing Scales
Piloting the Test
Standardizing the Test
Collecting Norms
Validation & Reliability Studies
Manual Writing
Test Revision
Defining the Test Universe,

Audience, & Purpose
Defining the test universe.

prepare a working definition of the
construct
locate studies that explain the construct
locate current measures of the construct

Audience, & Purpose
Defining the target audience.

make a list of characteristics of persons
who will take the test--particularly those
characteristics that will affect how test
takers will respond to the test questions
(e.g., reading level, disabilities, honesty,
language)

Audience, & Purpose
Defining the purpose.

includes not only what the test will
measure, but also how scores will be used
e.g., will scores be used to compare test
takers (normative approach) or to indicate
achievement (criterion approach)?
e.g., will scores be used to test a theory or
to provide information about an individual?
Developing a Test Plan
A test plan includes a definition of the

construct, the content to be measured
(test domain), the format for the
questions, and how the test will be
administered and scored
Defining the Construct

Define construct after reviewing literature
about the construct and any available
measures
Operationalize in terms of observable
and measurable behaviours
Provides boundaries for the test domain
(what should and shouldnt be included)
Specify approximate number of items
needed
Choosing the Test Format

Test format refers to the type of
questions the test will contain (usually
one format per test for ease of test
takers and scoring)
Test formats have two elements:
stimulus (e.g., a question or phrase)

mechanism for response (e.g., multiple
choice, true-false). May be objective or
subjective test format
Composing the Test Items

test items are the stimuli presented to
the test taker (may or may not take the
form of questions)
the form chosen depends on decisions
made in the test plan (e.g., purpose,
audience, method of administration,
scoring)
Test Types
Structured Response
Multiple Choice
True False, Forced Choice
Likert Scales
Free Response
Essay, Short Answer
Interview Questions
Fill in the Blank
Projective Techniques
Multiple Choice
Multiple choice most common in educational

testing (and also some personality and
employment testing)
consists of a stem and a number of responses-should only be one right answer
the wrong answers are called distractors because
they may appear correct--should be realistic
enough to appeal to uninformed test taker
easy scoring but downside is that test takers can
get some correct by guessing
Multiple Choice
Pros
more answer options (4-5) reduce the chance of

guessing that an item is correct
many items can aid in student comparison and reduce
ambiguity, increase reliability
Cons
measures narrow facets of performance

reading time increased with more answers
transparent clues (e.g., verb tenses or letter uses a
or an) may encourage guessing
difficult to write four or five reasonable choices
takes more time to write questions
True/False
True/False is also used in educational

testing and some personality testing
in educational testing the test taker can
again gain some advantage by guessing
True/False (cont.)
Ideally a true/false question should be

constructed so that an incorrect response
indicates something about the student's
misunderstanding of the learning objective.
This may be a difficult task, especially when

constructing a true statement.
Forced Choice Items
Forced-Choice is similar to multiple-choice but

is used in personality and attitude tests (e.g.,
MBTI)
test taker must choose between unrelated but
equally acceptable responses
Forced Choice Items(cont.)

Example
Place an X in the space to the left of the work that of
the word in each pair that best describes your
personality.
1. ____ Sunny
____ Friendly
2. ____ Outgoing
____ Loyal
Likert Scales
Likert scales are usually reliable and

highly popular (e.g., personality and
attitude tests)
item is presented with an array of response
options (e.g., 1 to 5 or 1 to 7 scale), usually
on an agree/disagree or
approve/disapprove continuum
Test Types
Structured Response
Advantages
Great
Breadth
Quick Scoring
Disadvantages
Limited
Depth
Difficult to assess higher levels of skills
Guessing/Memorization vs. Knowledge
Subjective Items
subjective items are less easily scored

but provide the test taker with fewer
cues and open wider areas for
response--often used in education
essay questions - responses can vary in
breadth and depth and scorer must
determine to what extent the response is
correct (often by examining match with
predetermined correct response)
Essay Questions
Provide a freedom on response that

facilitates assessing higher cognitive
behaviors (e.g., analysis and evaluation)
Allows respondent to focus on what they

have learned and does not limit them to
specific questions
Interview Questions
interview questions are often used in organizational
settings--interviewer decides what is a good or poor
answer
test
plan should be based on knowledge, skills,

abilities and other characteristics required to perform
the job
Information
can be obtained from a job description,

job analysis, current job incumbent
Projective Techniques
Projective techniques are often

employed in clinical settings
uses a highly ambiguous stimulus to elicit an
unstructured response (i.e., the test taker
projects his or her perception and
perspective onto a neutral stimulus)
variety of stimuli (e.g., pictures, words) and
responses may be verbal or drawing
pictures
Sentence Completion
Sentence-Completion format presents an

incomplete sentence that the test taker
completes (e.g., I feel happiest when )
subjective tests are at risk for judgment error

and inter-rater reliability is therefore of
particular importance--scoring keys and
training important
Test Types
Subjective Items
Advantages
Can
Test Higher Cognitive skills

Encourages organize/develop thoughts
Disadvantages
Difficult
to Grade
Judgement error (e.g., interrater reliability)
Requires Advance - Objective Scoring Key
Writing Good Items
Basis building block of test construction

Little attention given to writing items
an art that requires originality, creativity, combined with
knowledge of test domain and good item writing practices
not all items will perform as expected--may be too easy or
difficult, may be misinterpreted, etc.
rule of thumb to write at least twice as many items as you
expect to use
Broad vs. Narrow items
Writing Good Items (cont.)
Suggestions:
identify item topics by consulting test plan
(increases content validity)
ensure that each item presents a central
idea or problem
write items drawn only from testing
universe
write each item in clear and direct manner
Writing Good Items (cont..)
Suggestions:
use vocabulary and language appropriate for
the target audience (e.g., age, culture)
take into account sexist or racist language (e.g.,
mailman, fireman)
make all items independent (e.g.,one question
per question)
ask an expert to review items to reduce
ambiguity and inaccuracy
Writing Administration
Instructions
specify the testing environment to decrease
variation or error in test scores
should address:
group or individual administration

requirements for location (e.g., quiet)
required equipment
time limits or approximate completion time
script for administrator and answers to questions test
takers may ask
Specifying Administration and

Scoring Methods
determine such things as how test will
be administered (e.g., orally, written,
computer--individually or in groups)
method of scoring, but also whether
scored by hand by test administrator, or
accompanied by scoring software, or
sent to test publisher for scoring
Scoring Methods
Cumulative model: most common
assumes that the more a test taker responds in a
particular fashion the more he/she has of the
attribute being measured (e.g., more correct
answers, or endorses higher numbers on a Likert
scale)
correct responses or responses on Likert scale are
summed
yields interval data that can be interpreted with
reference to norms
Scoring Methods (cont.)

Categorical model: place test takers in a group
e.g., a particular pattern of responses may suggest
diagnosis of a certain psychological disorder
typically yields nominal data because it places test
takers in categories
Scoring Methods (cont)

Ipsative model: test takers scores are not compared to that of
other test takers but rather compare the scores on various
scales WITHIN the test taker (Which scores are high & low)
e.g., a test taker may complete a measure of interpersonal
problems of various types and the test administrator may want to
determine which of the types the test taker feels is most problematic
for him or her
Cumulative model may be combined with categorical or ipsative

model
Response Bias
In preparing an item review, each question can

be evaluated from two perspectives: Is the item
fair? Is the item biased?
Tests are subject to error and one form comes

from the test takers
Response Sets/Styles
Are patterns of responding that result in misleading

information and limit the accuracy and usefulness of
the test scores
Reasons for misleading information

1. Information requested is too personal
2. Distort their responses
3. Answer items carelessly
4. May feel coerced into completing the test
Response Style
People always agree (acquiescence) or
disagree (criticalness) with statements without
attending to the actual content
Usually, when items are ambiguous
Solution: use both positively- and negativelykeyed items
Social Desirability
Some test takers choose socially acceptable
answers or present themselves in a favourable
light
People often do not attend as much to the trait

being measured as to the social acceptability of
the statement
This represents unwanted variance
Social Desirability (cont.)

Example items:
Friends would call me spontaneous.
People I know can count on me to finish what I
start.
I would rather work in a group than by myself.
I often get stressed-out in many situations.
Faking
Faking -- some test takers may respond in a particular
way to cause a desired outcome
may fake good (e.g., in employment settings) to create a
favourable impression
may fake bad (e.g., in clinical or forensic settings) as a cry
for help or to appear mentally disturbed
may use some subtle questions that are difficult to fake
because they arent clearly face valid
Faking Bad
People try to look worse than they really are
Common
problem in clinical settings
Reasons:
Cry
for help
Want to plea insanity in court
Want to avoid draft into military
Want to show psychological damage
Most people who fake bad overdo it
Impression Management
Mitigating IM:
Use
positive and negative impression scales

(endorsed by 10% of the population)
Use lie scales to flag those who score high
(e.g., I get angry sometime).
Inconsistency scales (e.g., two different
responses to two similar questions)
(Use multiple assessment methods (other than
self-report)
Random Responding
Random responding may occur when test takers
are unwilling or unable to respond accurately.
likely to occur when test taker lacks the skills (e.g.,
reading), does not want to be evaluated, or lacks
attention to the task
try to detect by embedding a scale that tends to yield
clear results from vast majority such that a different
result suggests the test taker wasnt cooperating
Random Responding
Detection:
Duplicate
items:
I love my mother.
I hate my mother.
Infrequency
scales:
Ive never had hair on my head.
I have not seen a car in 10 years.
Random Responding
May occur for several reasons:
People
are not motivated to participate

Reading or language difficulties
Do not understand instructions / item content
Too confused or disturbed to respond
appropriately
Piloting and Revising Tests

cant assume the test will perform as
expected
pilot test scientifically investigates the tests
reliability and validity
administer test to sample from target
audience
analyze data and revise test to fix any
problems uncovered--many aspects to
consider
Setting Up the Pilot Test

test situation should match actual
circumstances in which test will be used
(e.g., in sample characteristics, setting)
developers must follow the American
Psychological Associations codes of
ethics (e.g., strict rules of confidentiality
and publish only aggregate results)
Conducting the Pilot Test

depth and breadth depends on the size
and complexity of the target audience
adhere strictly to test procedures outlined
in test administration instructions
generally require large sample
may ask participants about the testing
experience
Analyzing the Results

can gather both quantitative and
qualitative information
use quantitative information for such
things as item characteristics, internal
consistency, convergent and
discriminate validity, and in some
instances predictive validity
Revising the Test

Choosing the final items requires
weighing each items content validity,
item difficulty and discrimination, interitem correlation, and bias
when new items need to be added or
items need to be revised, the items must
again be pilot tested to ensure that the
changes produced the desired results
Validation and Cross-Validation

Validation is the process of obtaining
evidence that the test effectively measures
what it is supposed to measure (i.e.,
reliability and validity)
first part of establishing content validity is
carried out as the test is developed--that it
measures the constructs (construct validity)
and predicts an outside criterion is
determined in subsequent data collection
Validation and Cross-Validation

when the final revision of a test yields
scores with sufficient evidence of reliability
and validity, test developers then conduct
cross-validation--a final round of test
administration to another sample
because of chance factors the reliability
and validity coefficients will likely be smaller
in the new sample--referred to as shrinkage

Test Development

Diunggah oleh

Informasi Dokumen

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Test Development

Diunggah oleh

Hak Cipta:

Format Tersedia

Test Development

Why Develop a New Test?

First Four Steps

Continued Steps of Test Construction

Defining the Test Universe,

Defining the test universe.

Defining the Test Universe,

Defining the target audience.

Defining the Test Universe,

Defining the purpose.

Developing a Test Plan

A test plan includes a definition of the

Defining the Construct

Choosing the Test Format

stimulus (e.g., a question or phrase)

Composing the Test Items

Multiple choice most common in educational

more answer options (4-5) reduce the chance of

measures narrow facets of performance

True/False is also used in educational

Ideally a true/false question should be

This may be a difficult task, especially when

Forced Choice Items

Forced-Choice is similar to multiple-choice but

Forced Choice Items(cont.)

Likert scales are usually reliable and

subjective items are less easily scored

Provide a freedom on response that

Allows respondent to focus on what they

plan should be based on knowledge, skills,

can be obtained from a job description,

Projective techniques are often

Sentence-Completion format presents an

subjective tests are at risk for judgment error

Test Higher Cognitive skills

Writing Good Items

Basis building block of test construction

Writing Good Items (cont.)

Writing Good Items (cont..)

group or individual administration

Specifying Administration and

Scoring Methods (cont.)

Scoring Methods (cont)

Cumulative model may be combined with categorical or ipsative

In preparing an item review, each question can

Tests are subject to error and one form comes

Are patterns of responding that result in misleading

Reasons for misleading information

Solution: use both positively- and negativelykeyed items

People often do not attend as much to the trait

This represents unwanted variance

Social Desirability (cont.)

problem in clinical settings

Most people who fake bad overdo it

positive and negative impression scales

are not motivated to participate

Piloting and Revising Tests

Setting Up the Pilot Test

Conducting the Pilot Test

Analyzing the Results

Revising the Test

Validation and Cross-Validation

Validation and Cross-Validation

Anda mungkin juga menyukai