students. She uses a variety of formative and summative

assessments which she writes herself to measure her

students’ learning. Recently, she has become interested in

using classroom assessment data in other ways. She would

like to make decisions about the quality of her own instruction,

the quality of her tests and the fairness of her scoring. Item

analysis techniques offer her the tools to make those

decisions wisely.

performance on individual test items. There are three common

types of item analysis, which provide teachers with three

different types of information:

Item Difficulty - Teachers often wish to know how "hard"

a test question or performance task was for their

students. To help answer that question, they can

produce a difficulty index for a test item by calculating

the proportion of students in class who got that

particular item correct. The larger the proportion, the

more students there are who have learned the content

measured by the item. Although we call this proportion

a difficulty index, the name is counterintuitive. This is,

one actually gets a measure of how easy the item is,

not the difficulty of the item. Thus, a big number means

easy, not difficult.

teachers related to testing involves the fundamental

Validity of a given test; that is, whether a single test

item measures the same thing or assesses the same

objectives as the rest of the test. A number can be

calculated that provides that information in a fairly

straightforward way. The Discrimination index is a

rough indication of the Validity of an item. As such, it is

a measure of an item’s ability to discriminate between

those who scored high on the total test and those who

scored low. Once computed, this index may be

interpreted as an indication of the extent to which

overall knowledge of the content area or mastery of the

skills is related to the response on an item. Perhaps the

most crucial validity standard for a test item is whether

or not a student’s correct answer is due to his or her

level of knowledge or ability and not due to something

else such as chance or test bias. An item that can

discriminate between students with high knowledge or

ability and those with low knowledge or ability (as

measured by the whole test) should be considered an

item that "works." Discrimination, in this case, is a good

thing.

the performance of an entire test item, teachers are

often interested in examining the performance of

individual distractors (incorrect answer options) on

multiple-choice items. By calculating the proportion of

students who choose each answer option, teachers can

identify which distractors are "functioning" and appear

attractive to students who do not know the correct

answer, and which distractors are simply taking up

space and are not chosen by many students. To

eliminate Blind guessing, which results in a correct

answer purely by chance (which hurts the validity and

Reliability of a test item), teachers want as many

plausible distractors as is feasible. Analysis of response

options allows teachers to fine tune and improve items

they may wish to use again with future classes.

calculations involved in item analysis with data for an example

item. For our example, imagine a classroom of 25 students

who took a test that included the item below. (The asterisk

indicates that B is the correct answer.)

Item Data

Each Answer Option

Who wrote The

Great Gatsby?

A. Faulkner 4

*B. Fitzgerald 16

C. Hemingway 5

D. Steinbeck 0

Total Number of

25

Students

For our sample item, we can see how difficult it was for

students by determining what percentage of the class got it

right. Because they are proportions, difficulty indices range

from .00 to 1.0.

Method

Difficulty Index- 1. Count the 16

Proportion of number of students

students who got who got the correct

an item correct answer.

2. Divide by the 16/25 = .64

total number of

students who took

the test.

that number later in this lesson.

Calculating an Item discrimination index

about whether a given question is measuring what we want by

calculating a discrimination index. In other words, did students

perform on this item in the same way they did on the whole

test? Discrimination indices range from -1.0 to 1.0.

Method

Discrimination 1. Sort your Imagine this

Index - A tests by total information for

comparison of score and create our example: 10

how overall high two groupings of out of 13 students

scorers on the tests--the high (or tests) in the

whole test did on scores, made up high group and 6

one particular of the top half of out of 12 students

item compared to tests, and the in the low group

overall low low scores, got the item

scorers made up of the correct.

bottom half of

tests.

2. For each High Group;

group, calculate 10/13=.77Low

a difficulty index Group; 6/12=.50

for the item.

3. Subtract the .77-.50=.27

difficulty index

for the low-score

group from the

difficulty index

for the high-

score group.

Later, we’ll see why that number indicates pretty good validity

for this item.

By the way, the suggestion of cutting the class into two halves

is made for the sake of convenience, as most classrooms do

not have enough students to make it practical to work with

high and low groups that represent a smaller percentage of

students. Statisticians who study measurement issues would

prefer that you divide your class into the top 27% and the

lower 27% (the strangely precise percentages are based on

mathematical assumptions about normal distributions of

performance), so if you have large enough classes that you

can do this and still have big numbers in each group, please

do so!

Analyzing Answer Options

context it would be interesting to know what other answer

options were appealing to students. By calculating the

percentage of students who picked each answer option,

teachers can see what sort of errors students are making.

Analysis

Method

Analysis of For each Who wrote The Great Gatsby?

Answer answer A. Faulkner 4/25 = .16

Options - option, *B. Fitzgerald 16/25 = .64

An divide the C. Hemingway 5/25 = .20

examination number of D. Steinbeck 0/25 = .00

of the students

proportion who choose

of students that answer

choosing option by

each the number

Response of students

option. taking the

test.

Hemingway all mixed up.

Interpreting the Results of Item Analyses

means that 64% of students knew the answer. Some teachers

would consider 64% "success" on a basic fact or classroom

objective as high, others would consider it low. If a teacher

believes that .64 is too low, he or she might decide that the

content has not been learned or that objectives are not being

met for the whole class. Another interpretation might be that

the item was too difficult, confusing or invalid. The latter

interpretation suggests that learning may have occurred, but

that the particular item was not designed well enough to

measure it.

Interpreting a Discrimination Index

The discrimination index for the item was .27. The formula for

the discrimination index is such that if more students in the

high-scoring group chose the correct answer than did students

in the low-scoring group, the number will be positive. At a

minimum, then, one would hope for a positive value, as that

would indicate that knowledge resulted in the correct answer.

The greater the positive value (the closer it is to 1.0), the

stronger the relationship between overall test performance and

performance on that item. If the discrimination index is

negative, on the other hand, that means that for some reason

students who scored low on the test were more likely to get

the answer correct. This is a strange situation, which suggests

poor validity for an item.

A reasonable interpretation for a positive discrimination index

is that, to some degree, the item is measuring the same thing

as the rest of the test. Typically, this is what teachers want, but

it is not uncommon for classroom tests to cover many different

areas that might not even be related to each other. Therefore,

low discrimination for one item on one section of a long

multifaceted assessment is not always cause for concern.

Teachers interested in the discrimination of items that work

better within a subgroup of items on a broad test can analyze

them in that context, as if the smaller group of items dedicated

to a single topic or skills were a stand-alone test.

Negative discrimination indices are harder to interpret than

positive values. While they happen rarely, they may happen.

The first best guess is to double-check that there was not an

error on the answer key or in the scoring of an item. If

everything was scored correctly, it might be a situation where

the item fooled the students who knew the most. Regardless

of one myth about teachers, tests are not designed to fool

students. So it is a troubling situation when something about a

question that appears just fine on its face is for some reason a

problem for students. As to the reason why an item is

occasionally tricky for your best students, that is a little easier

to understand. Typically, teachers choose distractors

(incorrect answer options) that only students who have studied

most would even recognize as plausible, so they are the only

ones who are likely to be attracted to it. It is also possible that

an answer the teachers believes is wrong might be seen as

correct by students who have really learned the material.

Consider that possibility when you are faced with negative

discrimination values.

Interpreting an Analysis of Answer Options

item. The analysis of response options shows that students

who missed the item were about equally likely to choose

Answer A as Answer C. No students chose Answer D. Answer

option D does not act as a distractor. Students are not

choosing between four answer options on this item; they are

really choosing between only three options, as they are not

even considering Answer D. This makes guessing correctly

more likely, which hurts the validity and reliability of an item.

Without having talked with the teacher who provided the

imaginary item we analyzed in this example, one might guess

that the writings of John Steinbeck have not even been

discussed in the unit for which the test was developed. If true,

Steinbeck could not be expected to act as a plausible

distractor.

Making Decisions Based on the Results of Item Analysis

information derived from item analysis. For example, you

could decide to change the way you teach. You could decide

that you are teaching just fine, but the test item or the whole

test is not assessing accurately. Or you could decide to use

information from the test you just gave to change the way you

score the test and assign grades.

of item analysis evidence. Recall that in our example, the item

had a difficulty index of .64. If a teacher believes that .64 is too

low, he or she could decide to improve instruction and try

different strategies to better meet the objective represented by

the item. In the case of low discrimination indices, teachers

could take an instructional approach that seeks to better

incorporate each independent fact or concept into a larger

whole. You also might consider what it is about the way you

teach a topic that confuses mostly those students who seem

to have mastered the rest of the material. For example, for

wrong answer options that attract a lot of students, teachers

can specifically emphasize the common error or

misunderstanding as they teach.

Making Test Design Decisions

tests or items they wish to use again. Many teachers have

item banks of questions they have used in the past and can

choose from each time when building assessments. This

allows teachers to have new tests for each use (which helps in

terms of test security and protecting against cheating), and

also helps to produce tests with good measurement

characteristics such as validity and reliability.

The fairest tests for all students are tests that are valid and

reliable. To improve the quality of tests, item analysis can

identify items that (a) are too difficult (or too easy if a teacher

has that concern), (b) are not able to differentiate between

those who have learned the content and those who have not,

or (c) include distractors that are not plausible. With

discrimination information, teachers can try to identify what, if

anything, was tricky about an item and change it. When

distractors are identified as being non-functional, teachers

may tinker with the item and create a new distractor. One goal

for a valid and reliable classroom test is to decrease the

chance that random guessing could result in credit for a

correct answer. The greater the number of plausible

distractors, the more accurate, valid and reliable the test

typically becomes.

Making Scoring Decisions

teachers re-score a less-than-perfect test in a way that

provides more accurate scores. If items are too hard, for

example, teachers can decide that the material was not taught

and, for the sake of fairness, remove the item from the current

test, and recompute the scores. The philosophy essentially

being espoused here is that "it was my fault, not yours." If

items have low or negative discrimination values, on the other

hand, teachers can remove them from the current test,

recompute scores, and remove the questions from the pool of

items for future tests.

Every day, some real-world classroom teachers improve their

tests immediately by removing items that were too hard or not

valid and scoring their students’ tests as if those items were

not part of the test. A simple way to make these score

corrections is to add a point to everyone’s test for each item

"removed." So, if you identified two "bad" items on a test, you

would add two points to everyone’s test score, whether they

got those items correct or not. This is not quite what would be

the best solution statistically, but it is close enough, and

students who performed well enough to still get the bad item

correct continue to get credit for it, which often seems fair to

teachers.

If, as a teacher, you decide to remove items that are too hard,

you must decide what is too hard. There are at least a couple

of approaches to use here. One simple philosophy is that if the

majority of students miss an item, it must be too hard.

According to that thinking, any item with a difficulty index of

.50 or less is too hard.

up entirely of items with a certain level of difficulty. How many

students would pass such a test? For example, a common

grading scheme is to give D’s for 60% of the possible points

and to assign an F, or not passing, to scores below 60%. So, a

test made up of items with difficulties less than .60 would

result in an average grade of F. Under this "what is failure?"

approach, items with difficulties of less than .60 must be

considered too hard.

Summary

can use to improve the way they teach, the way they assess,

and the way they score tests. With just a bit of time and math,

you can measure how hard a test question is, get some basic

information on its validity, and identify student misconceptions

and common errors. By acting on this information, you can

become a better teacher and increase the validity and

reliability of your assessments.

where assessment plays a key role in helping her make

decisions about her own teaching, as well as to measure her

students’ learning, she has become comfortable with altering

and improving her assessments frequently. With the help of

item analysis methods, she has developed the habit of making

most important classroom strategy decisions based on data.

