Anda di halaman 1dari 52

Test Development

Why Develop a New Test?


meet the needs of a special group of
test takers
sample behaviours from a newly
defined test domain
improve the accuracy of test scores for
their intended purpose
Tests need to be revised

First Four Steps


Defining the test universe, audience,
and purpose
Developing a test plan
Composing the test items
Writing the administration instructions

Continued Steps of Test Construction


Diagram of Test Construction (p. 234)
Constructing Scales
Piloting the Test
Standardizing the Test
Collecting Norms
Validation & Reliability Studies
Manual Writing
Test Revision

Defining the Test Universe,


Audience, & Purpose

Defining the test universe.


prepare a working definition of the
construct
locate studies that explain the construct
locate current measures of the construct

Defining the Test Universe,


Audience, & Purpose

Defining the target audience.


make a list of characteristics of persons
who will take the test--particularly those
characteristics that will affect how test
takers will respond to the test questions
(e.g., reading level, disabilities, honesty,
language)

Defining the Test Universe,


Audience, & Purpose

Defining the purpose.


includes not only what the test will
measure, but also how scores will be used
e.g., will scores be used to compare test
takers (normative approach) or to indicate
achievement (criterion approach)?
e.g., will scores be used to test a theory or
to provide information about an individual?

Developing a Test Plan

A test plan includes a definition of the


construct, the content to be measured
(test domain), the format for the
questions, and how the test will be
administered and scored

Defining the Construct


Define construct after reviewing literature
about the construct and any available
measures
Operationalize in terms of observable
and measurable behaviours
Provides boundaries for the test domain
(what should and shouldnt be included)
Specify approximate number of items
needed

Choosing the Test Format


Test format refers to the type of
questions the test will contain (usually
one format per test for ease of test
takers and scoring)
Test formats have two elements:

stimulus (e.g., a question or phrase)


mechanism for response (e.g., multiple
choice, true-false). May be objective or
subjective test format

Composing the Test Items


test items are the stimuli presented to
the test taker (may or may not take the
form of questions)
the form chosen depends on decisions
made in the test plan (e.g., purpose,
audience, method of administration,
scoring)

Test Types

Structured Response
Multiple Choice
True False, Forced Choice
Likert Scales

Free Response
Essay, Short Answer
Interview Questions
Fill in the Blank
Projective Techniques

Multiple Choice

Multiple choice most common in educational


testing (and also some personality and
employment testing)
consists of a stem and a number of responses-should only be one right answer
the wrong answers are called distractors because
they may appear correct--should be realistic
enough to appeal to uninformed test taker
easy scoring but downside is that test takers can
get some correct by guessing

Multiple Choice

Pros

more answer options (4-5) reduce the chance of


guessing that an item is correct
many items can aid in student comparison and reduce
ambiguity, increase reliability

Cons

measures narrow facets of performance


reading time increased with more answers
transparent clues (e.g., verb tenses or letter uses a
or an) may encourage guessing
difficult to write four or five reasonable choices
takes more time to write questions

True/False

True/False is also used in educational


testing and some personality testing
in educational testing the test taker can
again gain some advantage by guessing

True/False (cont.)

Ideally a true/false question should be


constructed so that an incorrect response
indicates something about the student's
misunderstanding of the learning objective.

This may be a difficult task, especially when


constructing a true statement.

Forced Choice Items

Forced-Choice is similar to multiple-choice but


is used in personality and attitude tests (e.g.,
MBTI)
test taker must choose between unrelated but
equally acceptable responses

Forced Choice Items(cont.)


Example
Place an X in the space to the left of the work that of
the word in each pair that best describes your
personality.
1. ____ Sunny
____ Friendly

2. ____ Outgoing
____ Loyal

Likert Scales

Likert scales are usually reliable and


highly popular (e.g., personality and
attitude tests)
item is presented with an array of response
options (e.g., 1 to 5 or 1 to 7 scale), usually
on an agree/disagree or
approve/disapprove continuum

Test Types

Structured Response
Advantages
Great

Breadth
Quick Scoring

Disadvantages
Limited

Depth
Difficult to assess higher levels of skills
Guessing/Memorization vs. Knowledge

Subjective Items

subjective items are less easily scored


but provide the test taker with fewer
cues and open wider areas for
response--often used in education
essay questions - responses can vary in
breadth and depth and scorer must
determine to what extent the response is
correct (often by examining match with
predetermined correct response)

Essay Questions

Provide a freedom on response that


facilitates assessing higher cognitive
behaviors (e.g., analysis and evaluation)

Allows respondent to focus on what they


have learned and does not limit them to
specific questions

Interview Questions
interview questions are often used in organizational
settings--interviewer decides what is a good or poor
answer
test

plan should be based on knowledge, skills,


abilities and other characteristics required to perform
the job

Information

can be obtained from a job description,


job analysis, current job incumbent

Projective Techniques

Projective techniques are often


employed in clinical settings
uses a highly ambiguous stimulus to elicit an
unstructured response (i.e., the test taker
projects his or her perception and
perspective onto a neutral stimulus)
variety of stimuli (e.g., pictures, words) and
responses may be verbal or drawing
pictures

Sentence Completion

Sentence-Completion format presents an


incomplete sentence that the test taker
completes (e.g., I feel happiest when )

subjective tests are at risk for judgment error


and inter-rater reliability is therefore of
particular importance--scoring keys and
training important

Test Types

Subjective Items
Advantages
Can

Test Higher Cognitive skills


Encourages organize/develop thoughts

Disadvantages
Difficult

to Grade
Judgement error (e.g., interrater reliability)
Requires Advance - Objective Scoring Key

Writing Good Items

Basis building block of test construction


Little attention given to writing items
an art that requires originality, creativity, combined with
knowledge of test domain and good item writing practices
not all items will perform as expected--may be too easy or
difficult, may be misinterpreted, etc.
rule of thumb to write at least twice as many items as you
expect to use
Broad vs. Narrow items

Writing Good Items (cont.)

Suggestions:
identify item topics by consulting test plan
(increases content validity)
ensure that each item presents a central
idea or problem
write items drawn only from testing
universe
write each item in clear and direct manner

Writing Good Items (cont..)

Suggestions:
use vocabulary and language appropriate for
the target audience (e.g., age, culture)
take into account sexist or racist language (e.g.,
mailman, fireman)
make all items independent (e.g.,one question
per question)
ask an expert to review items to reduce
ambiguity and inaccuracy

Writing Administration
Instructions
specify the testing environment to decrease
variation or error in test scores
should address:

group or individual administration


requirements for location (e.g., quiet)
required equipment
time limits or approximate completion time
script for administrator and answers to questions test
takers may ask

Specifying Administration and


Scoring Methods
determine such things as how test will
be administered (e.g., orally, written,
computer--individually or in groups)
method of scoring, but also whether
scored by hand by test administrator, or
accompanied by scoring software, or
sent to test publisher for scoring

Scoring Methods
Cumulative model: most common
assumes that the more a test taker responds in a
particular fashion the more he/she has of the
attribute being measured (e.g., more correct
answers, or endorses higher numbers on a Likert
scale)
correct responses or responses on Likert scale are
summed
yields interval data that can be interpreted with
reference to norms

Scoring Methods (cont.)


Categorical model: place test takers in a group
e.g., a particular pattern of responses may suggest
diagnosis of a certain psychological disorder
typically yields nominal data because it places test
takers in categories

Scoring Methods (cont)


Ipsative model: test takers scores are not compared to that of
other test takers but rather compare the scores on various
scales WITHIN the test taker (Which scores are high & low)
e.g., a test taker may complete a measure of interpersonal
problems of various types and the test administrator may want to
determine which of the types the test taker feels is most problematic
for him or her

Cumulative model may be combined with categorical or ipsative


model

Response Bias

In preparing an item review, each question can


be evaluated from two perspectives: Is the item
fair? Is the item biased?

Tests are subject to error and one form comes


from the test takers

Response Sets/Styles

Are patterns of responding that result in misleading


information and limit the accuracy and usefulness of
the test scores

Reasons for misleading information


1. Information requested is too personal
2. Distort their responses
3. Answer items carelessly
4. May feel coerced into completing the test

Response Style
People always agree (acquiescence) or
disagree (criticalness) with statements without
attending to the actual content
Usually, when items are ambiguous

Solution: use both positively- and negativelykeyed items

Social Desirability
Some test takers choose socially acceptable
answers or present themselves in a favourable
light

People often do not attend as much to the trait


being measured as to the social acceptability of
the statement

This represents unwanted variance

Social Desirability (cont.)


Example items:
Friends would call me spontaneous.
People I know can count on me to finish what I
start.
I would rather work in a group than by myself.
I often get stressed-out in many situations.

Faking
Faking -- some test takers may respond in a particular
way to cause a desired outcome
may fake good (e.g., in employment settings) to create a
favourable impression
may fake bad (e.g., in clinical or forensic settings) as a cry
for help or to appear mentally disturbed
may use some subtle questions that are difficult to fake
because they arent clearly face valid

Faking Bad
People try to look worse than they really are
Common

problem in clinical settings

Reasons:
Cry

for help
Want to plea insanity in court
Want to avoid draft into military
Want to show psychological damage

Most people who fake bad overdo it

Impression Management
Mitigating IM:
Use

positive and negative impression scales


(endorsed by 10% of the population)
Use lie scales to flag those who score high
(e.g., I get angry sometime).
Inconsistency scales (e.g., two different
responses to two similar questions)
(Use multiple assessment methods (other than
self-report)

Random Responding
Random responding may occur when test takers
are unwilling or unable to respond accurately.
likely to occur when test taker lacks the skills (e.g.,
reading), does not want to be evaluated, or lacks
attention to the task
try to detect by embedding a scale that tends to yield
clear results from vast majority such that a different
result suggests the test taker wasnt cooperating

Random Responding
Detection:
Duplicate

items:
I love my mother.
I hate my mother.

Infrequency

scales:
Ive never had hair on my head.
I have not seen a car in 10 years.

Random Responding
May occur for several reasons:
People

are not motivated to participate


Reading or language difficulties
Do not understand instructions / item content
Too confused or disturbed to respond
appropriately

Piloting and Revising Tests


cant assume the test will perform as
expected
pilot test scientifically investigates the tests
reliability and validity
administer test to sample from target
audience
analyze data and revise test to fix any
problems uncovered--many aspects to
consider

Setting Up the Pilot Test


test situation should match actual
circumstances in which test will be used
(e.g., in sample characteristics, setting)
developers must follow the American
Psychological Associations codes of
ethics (e.g., strict rules of confidentiality
and publish only aggregate results)

Conducting the Pilot Test


depth and breadth depends on the size
and complexity of the target audience
adhere strictly to test procedures outlined
in test administration instructions
generally require large sample
may ask participants about the testing
experience

Analyzing the Results


can gather both quantitative and
qualitative information
use quantitative information for such
things as item characteristics, internal
consistency, convergent and
discriminate validity, and in some
instances predictive validity

Revising the Test


Choosing the final items requires
weighing each items content validity,
item difficulty and discrimination, interitem correlation, and bias
when new items need to be added or
items need to be revised, the items must
again be pilot tested to ensure that the
changes produced the desired results

Validation and Cross-Validation


Validation is the process of obtaining
evidence that the test effectively measures
what it is supposed to measure (i.e.,
reliability and validity)
first part of establishing content validity is
carried out as the test is developed--that it
measures the constructs (construct validity)
and predicts an outside criterion is
determined in subsequent data collection

Validation and Cross-Validation


when the final revision of a test yields
scores with sufficient evidence of reliability
and validity, test developers then conduct
cross-validation--a final round of test
administration to another sample
because of chance factors the reliability
and validity coefficients will likely be smaller
in the new sample--referred to as shrinkage

Anda mungkin juga menyukai