Anda di halaman 1dari 9

# Item Analysis in the Classroom

## Mrs. Malone frequently assesses the performance of her

students. She uses a variety of formative and summative
assessments which she writes herself to measure her
students’ learning. Recently, she has become interested in
using classroom assessment data in other ways. She would
like to make decisions about the quality of her own instruction,
the quality of her tests and the fairness of her scoring. Item
analysis techniques offer her the tools to make those
decisions wisely.

## Item analysis is a process of examining classwide

performance on individual test items. There are three common
types of item analysis, which provide teachers with three
different types of information:
 Item Difficulty - Teachers often wish to know how "hard"
a test question or performance task was for their
students. To help answer that question, they can
produce a difficulty index for a test item by calculating
the proportion of students in class who got that
particular item correct. The larger the proportion, the
more students there are who have learned the content
measured by the item. Although we call this proportion
a difficulty index, the name is counterintuitive. This is,
one actually gets a measure of how easy the item is,
not the difficulty of the item. Thus, a big number means
easy, not difficult.

##  Item discrimination (Item validity) - Another concern to

teachers related to testing involves the fundamental
Validity of a given test; that is, whether a single test
item measures the same thing or assesses the same
objectives as the rest of the test. A number can be
calculated that provides that information in a fairly
straightforward way. The Discrimination index is a
rough indication of the Validity of an item. As such, it is
a measure of an item’s ability to discriminate between
those who scored high on the total test and those who
scored low. Once computed, this index may be
interpreted as an indication of the extent to which
overall knowledge of the content area or mastery of the
skills is related to the response on an item. Perhaps the
most crucial validity standard for a test item is whether
or not a student’s correct answer is due to his or her
level of knowledge or ability and not due to something
else such as chance or test bias. An item that can
discriminate between students with high knowledge or
ability and those with low knowledge or ability (as
measured by the whole test) should be considered an
item that "works." Discrimination, in this case, is a good
thing.

##  Effectiveness of Distractors - In addition to examining

the performance of an entire test item, teachers are
often interested in examining the performance of
individual distractors (incorrect answer options) on
multiple-choice items. By calculating the proportion of
students who choose each answer option, teachers can
identify which distractors are "functioning" and appear
attractive to students who do not know the correct
answer, and which distractors are simply taking up
space and are not chosen by many students. To
eliminate Blind guessing, which results in a correct
answer purely by chance (which hurts the validity and
Reliability of a test item), teachers want as many
plausible distractors as is feasible. Analysis of response
options allows teachers to fine tune and improve items
they may wish to use again with future classes.

## The following are the step-by-step procedures for the

calculations involved in item analysis with data for an example
item. For our example, imagine a classroom of 25 students
who took a test that included the item below. (The asterisk
indicates that B is the correct answer.)

Item Data

## Number of Students Choosing

Who wrote The
Great Gatsby?
A. Faulkner 4
*B. Fitzgerald 16
C. Hemingway 5
D. Steinbeck 0
Total Number of
25
Students

## Calculating an Item difficulty index

For our sample item, we can see how difficult it was for
students by determining what percentage of the class got it
right. Because they are proportions, difficulty indices range
from .00 to 1.0.

## Item Analysis Procedures Calculations

Method
Difficulty Index- 1. Count the 16
Proportion of number of students
students who got who got the correct
2. Divide by the 16/25 = .64
total number of
students who took
the test.

## The difficulty index for our question is .64. We will interpret

that number later in this lesson.
Calculating an Item discrimination index

## As mentioned, we can get some basic validity information

about whether a given question is measuring what we want by
calculating a discrimination index. In other words, did students
perform on this item in the same way they did on the whole
test? Discrimination indices range from -1.0 to 1.0.

## Item Analysis Procedures Calculations

Method
Discrimination 1. Sort your Imagine this
Index - A tests by total information for
comparison of score and create our example: 10
how overall high two groupings of out of 13 students
scorers on the tests--the high (or tests) in the
whole test did on scores, made up high group and 6
one particular of the top half of out of 12 students
item compared to tests, and the in the low group
overall low low scores, got the item
scorers made up of the correct.
bottom half of
tests.
2. For each High Group;
group, calculate 10/13=.77Low
a difficulty index Group; 6/12=.50
for the item.
3. Subtract the .77-.50=.27
difficulty index
for the low-score
group from the
difficulty index
for the high-
score group.

## The discrimination index for our Great Gatsby question is .27.

Later, we’ll see why that number indicates pretty good validity
for this item.
By the way, the suggestion of cutting the class into two halves
is made for the sake of convenience, as most classrooms do
not have enough students to make it practical to work with
high and low groups that represent a smaller percentage of
students. Statisticians who study measurement issues would
prefer that you divide your class into the top 27% and the
lower 27% (the strangely precise percentages are based on
mathematical assumptions about normal distributions of
performance), so if you have large enough classes that you
can do this and still have big numbers in each group, please
do so!

## As we have seen, not everyone got the item correct. In this

context it would be interesting to know what other answer
options were appealing to students. By calculating the
percentage of students who picked each answer option,
teachers can see what sort of errors students are making.

## Item Procedures Calculations

Analysis
Method
Analysis of For each Who wrote The Great Gatsby?
Options - option, *B. Fitzgerald 16/25 = .64
An divide the C. Hemingway 5/25 = .20
examination number of D. Steinbeck 0/25 = .00
of the students
proportion who choose
choosing option by
each the number
Response of students
option. taking the
test.

## Some students may be getting Faulkner, Fitzgerald and

Hemingway all mixed up.
Interpreting the Results of Item Analyses

## In our example, the item had a difficulty index of .64. This

means that 64% of students knew the answer. Some teachers
would consider 64% "success" on a basic fact or classroom
objective as high, others would consider it low. If a teacher
believes that .64 is too low, he or she might decide that the
content has not been learned or that objectives are not being
met for the whole class. Another interpretation might be that
the item was too difficult, confusing or invalid. The latter
interpretation suggests that learning may have occurred, but
that the particular item was not designed well enough to
measure it.
Interpreting a Discrimination Index

The discrimination index for the item was .27. The formula for
the discrimination index is such that if more students in the
high-scoring group chose the correct answer than did students
in the low-scoring group, the number will be positive. At a
minimum, then, one would hope for a positive value, as that
would indicate that knowledge resulted in the correct answer.
The greater the positive value (the closer it is to 1.0), the
stronger the relationship between overall test performance and
performance on that item. If the discrimination index is
negative, on the other hand, that means that for some reason
students who scored low on the test were more likely to get
the answer correct. This is a strange situation, which suggests
poor validity for an item.
A reasonable interpretation for a positive discrimination index
is that, to some degree, the item is measuring the same thing
as the rest of the test. Typically, this is what teachers want, but
it is not uncommon for classroom tests to cover many different
areas that might not even be related to each other. Therefore,
low discrimination for one item on one section of a long
multifaceted assessment is not always cause for concern.
Teachers interested in the discrimination of items that work
better within a subgroup of items on a broad test can analyze
them in that context, as if the smaller group of items dedicated
to a single topic or skills were a stand-alone test.
Negative discrimination indices are harder to interpret than
positive values. While they happen rarely, they may happen.
The first best guess is to double-check that there was not an
error on the answer key or in the scoring of an item. If
everything was scored correctly, it might be a situation where
the item fooled the students who knew the most. Regardless
of one myth about teachers, tests are not designed to fool
students. So it is a troubling situation when something about a
question that appears just fine on its face is for some reason a
problem for students. As to the reason why an item is
occasionally tricky for your best students, that is a little easier
to understand. Typically, teachers choose distractors
(incorrect answer options) that only students who have studied
most would even recognize as plausible, so they are the only
ones who are likely to be attracted to it. It is also possible that
an answer the teachers believes is wrong might be seen as
correct by students who have really learned the material.
Consider that possibility when you are faced with negative
discrimination values.
Interpreting an Analysis of Answer Options

## We also conducted answer option analyses for our sample

item. The analysis of response options shows that students
who missed the item were about equally likely to choose
option D does not act as a distractor. Students are not
choosing between four answer options on this item; they are
really choosing between only three options, as they are not
even considering Answer D. This makes guessing correctly
more likely, which hurts the validity and reliability of an item.
Without having talked with the teacher who provided the
imaginary item we analyzed in this example, one might guess
that the writings of John Steinbeck have not even been
discussed in the unit for which the test was developed. If true,
Steinbeck could not be expected to act as a plausible
distractor.
Making Decisions Based on the Results of Item Analysis

## Teachers can make a variety of classroom decisions based on

information derived from item analysis. For example, you
could decide to change the way you teach. You could decide
that you are teaching just fine, but the test item or the whole
test is not assessing accurately. Or you could decide to use
information from the test you just gave to change the way you
score the test and assign grades.

## Instructional decisions can be made based on all three types

of item analysis evidence. Recall that in our example, the item
had a difficulty index of .64. If a teacher believes that .64 is too
low, he or she could decide to improve instruction and try
different strategies to better meet the objective represented by
the item. In the case of low discrimination indices, teachers
could take an instructional approach that seeks to better
incorporate each independent fact or concept into a larger
whole. You also might consider what it is about the way you
teach a topic that confuses mostly those students who seem
to have mastered the rest of the material. For example, for
wrong answer options that attract a lot of students, teachers
can specifically emphasize the common error or
misunderstanding as they teach.
Making Test Design Decisions

## Teachers can also use item analysis information to improve

tests or items they wish to use again. Many teachers have
item banks of questions they have used in the past and can
choose from each time when building assessments. This
allows teachers to have new tests for each use (which helps in
terms of test security and protecting against cheating), and
also helps to produce tests with good measurement
characteristics such as validity and reliability.
The fairest tests for all students are tests that are valid and
reliable. To improve the quality of tests, item analysis can
identify items that (a) are too difficult (or too easy if a teacher
has that concern), (b) are not able to differentiate between
those who have learned the content and those who have not,
or (c) include distractors that are not plausible. With
discrimination information, teachers can try to identify what, if
anything, was tricky about an item and change it. When
distractors are identified as being non-functional, teachers
may tinker with the item and create a new distractor. One goal
for a valid and reliable classroom test is to decrease the
chance that random guessing could result in credit for a
correct answer. The greater the number of plausible
distractors, the more accurate, valid and reliable the test
typically becomes.
Making Scoring Decisions

## A benefit of item analysis that students appreciate is when

teachers re-score a less-than-perfect test in a way that
provides more accurate scores. If items are too hard, for
example, teachers can decide that the material was not taught
and, for the sake of fairness, remove the item from the current
test, and recompute the scores. The philosophy essentially
being espoused here is that "it was my fault, not yours." If
items have low or negative discrimination values, on the other
hand, teachers can remove them from the current test,
recompute scores, and remove the questions from the pool of
items for future tests.
Every day, some real-world classroom teachers improve their
tests immediately by removing items that were too hard or not
valid and scoring their students’ tests as if those items were
not part of the test. A simple way to make these score
corrections is to add a point to everyone’s test for each item
"removed." So, if you identified two "bad" items on a test, you
would add two points to everyone’s test score, whether they
got those items correct or not. This is not quite what would be
the best solution statistically, but it is close enough, and
students who performed well enough to still get the bad item
correct continue to get credit for it, which often seems fair to
teachers.
If, as a teacher, you decide to remove items that are too hard,
you must decide what is too hard. There are at least a couple
of approaches to use here. One simple philosophy is that if the
majority of students miss an item, it must be too hard.
According to that thinking, any item with a difficulty index of
.50 or less is too hard.

## Another approach is to think of the implications of a test made

up entirely of items with a certain level of difficulty. How many
students would pass such a test? For example, a common
grading scheme is to give D’s for 60% of the possible points
and to assign an F, or not passing, to scores below 60%. So, a
test made up of items with difficulties less than .60 would
result in an average grade of F. Under this "what is failure?"
approach, items with difficulties of less than .60 must be
considered too hard.
Summary

## Item analysis methods provide test information that teachers

can use to improve the way they teach, the way they assess,
and the way they score tests. With just a bit of time and math,
you can measure how hard a test question is, get some basic
information on its validity, and identify student misconceptions
and common errors. By acting on this information, you can
become a better teacher and increase the validity and