Anda di halaman 1dari 8

Educational Measurement: Issues and Practice

Fall 2014, Vol. 33, No. 3, pp. 47–54

When Can We Improve Subscores by Making Them Shorter?:


The Case Against Subscores with Overlapping Items

Richard A. Feinberg and Howard Wainer, National Board of Medical Examiners

Subscores can be of diagnostic value for tests that cover multiple underlying traits. Some items
require knowledge or ability that spans more than a single trait. It is thus natural for such items to
be included on more than a single subscore. Subscores only have value if they are reliable enough
to justify conclusions drawn from them and if they contain information about the examinee that is
distinct from what is in the total test score. In this study we show, for a broad range of conditions
of item overlap on subscores, that the value of the subscore is always improved through the
removal of such items.

Keywords: empirical Bayes, overlapping items, ReliaVAR plots, simulation, value added

T he impetus that generated this research had two parts;


one general that underlies virtually all tests, and the
other somewhat more limited, for it only applies directly to
item must span more than a single topic. A cardiac item is
usually placed within a context; it might be a dog’s heart,
or a cat’s, or a horse’s. Thus, that same item might logically
those tests, like medical licensing exams, that aspire to test find itself on both the heart subscore as well as on the dog
fairly characteristics of the topics of the exam that are truly subscore. This natural response to such multidimensionality
multidimensional. generates the underlying question we wish to explore: Under
The general aspect devolves from a sense of thrift; that it what circumstances is it sensible for a test to have separate
is often wasteful to generate a great deal of information from subscores that share some items?
a test of many items and then to summarize it all in a single Before we address this issue it is important that we review
number, a single overall score. Certainly examinees who give some basic ideas about subscores, especially the important
of their time and energy to complete the test should also work recently published by Shelby Haberman and Sandip
be granted some measure of their differential proficiencies. Sinharay.
This is especially important for examinees who fail and look
to their past performance to guide remediation. In addition,
educational institutions from which these examinees have What Can Compromise the Value of a Subscore?
emanated often are interested in their performance as a way A subscore represents performance on a short test, and so
of gauging the success of the various parts of their training. its value is determined by precisely the same characteristics
The almost universal response to these two needs is a demand that give value to any test score. Specifically, it must provide
for testing organizations to provide subscores, each of which evidence to support a claim that the user of the score wants
helps to evaluate the examinees’ success on a specific aspect to make about the examinee. To do this it must contain
of the topics covered by the exam. information of relevance to that claim (validity) and it must
The more limited aspect of this research is the part that provide enough evidence to be of use (reliability). In addition,
deals with truly multidimensional characteristics. So, for ex- to be of use, a subscore must provide information that is not
ample, in a test that licenses veterinarians there might be a available from the rest of the test. For short, let us call these
set of items on the pulmonary system, another on the cardio- two issues subscore reliability and subscore orthogonality
vascular system, a third on infectious diseases, and on and on. and discuss each in turn.
We suspect that such multidimensionality exists in virtually
all exams, but perhaps it is most pronounced on extensive
multiday licensing exams that must cover broad ranges of
topics. On tests of this sort it is natural to compute subscores Subscore Reliability
separately for each of these component parts, and indeed that For a subscore to be of value it must be stable enough so
is precisely what is usually done. that the user is not just chasing noise. This is especially
But now we come to an interesting conundrum that is the pertinent when the user is an examinee whose goal is to try
source of the research described here. Often a single test to remediate perceived deficiencies revealed by the subscore.
Unfortunately, because testing time is often sharply limited
by practical constraints, subscores are typically made up of
Richard A. Feinberg and Howard Wainer, National Board of only a few items, and because reliability is a function of length
Medical Examiners, 3750 Market Street, Philadelphia, PA 19104; the reliability of a subscore may be too low to be stable enough
rfeinberg@nbme.org for many such uses.


C 2014 by the National Council on Measurement in Education 47
Historically there have been but two approaches taken to refer the reader to Haberman (2008) for a more general
ease this problem. The first one was for test developers to in- derivation as well as the computational details. But the take-
crease the overall length of the test so that the subtests that away message is that by using Haberman’s PRMSE statistic
were subsequently formed could be long enough and hence we can determine whether any proposed subscore adds value
stable enough for the purposes envisioned for them. This ap- to users of the test. In this study, we use the PRMSE as the
proach fits with the current best practice for building tests, dependent variable in an extensive simulation to examine
evidence-centered design (ECD), which encompasses con- when the practice of constructing subscores in which some
tent specification, item development, and test design. ECD of their components overlap on other subscores is likely to be
considers a test as a collection of evidence for making infer- fruitful.
ences on what examinees’ know or can do (Mislevy, Steinberg, For example, let us imagine two tests, both including a
& Almond, 2003). ECD defines the connection between con- subscore that measures the same thing; such as knowledge
tent domain and subscores, and, most importantly, whether of the cardiovascular system. The correlation between the
or not the test was built on an assumption that subscores will two provides a measure of the value of a subscore to predict
be reported. Thus, subscores that are worth reporting should future performance on the same topic. The squared corre-
be planned for when the test is being designed. lation between both cardiovascular subscores would provide
Although this solves the problem, the practical constraints a direct measure of the extent to which the mean squared
limiting test length that were alluded to earlier have not error is reduced on one subscore by knowing the other. Ex-
provided this option with widespread appeal. More recently, perience with augmented subscores, akin to those that use
a second approach was proposed to stabilize each subscore the empirical Bayes shrinkage referenced earlier, has taught
statistically by borrowing strength from other items on the us that when subscores are not particularly reliable we can
exam that, to some extent, test the same construct. This is do a better job predicting future behavior from the total test
done using empirical Bayes methods in which the raw sub- score (or something much closer to it) than from the subscore
score is regressed toward the total test score (Skorupski & alone (Thissen & Wainer, 2001).
Carvajal, 2010; Wainer, Sheehan, & Wang, 2000). However, Given that the predictive value of a subscore depends on
the way that an empirical Bayes estimation approach boosts both its reliability and its orthogonality to the rest of the test,
subscore reliability does not come without a cost. What we as the subscore becomes less reliable its value decreases,
would expect to see (and what is, indeed, often observed, e.g., and as it correlates more highly with the total test score its
Thissen & Wainer, 2001, chap. 9) is that the less reliable the predictive value decreases. Augmenting a subscore with items
raw subscore, the more the augmented subscore becomes less from the rest of the test improves matters when the increased
distinguishable from the total score. Thus, the raw subscores reliability that it offers can offset the inevitable increase in
must first start out with ample orthogonality or the empirical its relationship with the rest of the test.
Bayes approach will be of little use for an insufficiently mul- Sinharay’s (2010) futile search among a substantial num-
tidimensional test (Sinharay, Haberman, & Wainer, 2011). ber of tests for subscores that add value indicates how unidi-
This provides us with a natural segue to the second issue. mensional are most modern tests. Yet there are, particularly
among licensing tests, a fair number that seem to be truly
multidimensional. Because, without multidimensionality, the
Subscore Orthogonality construction of subscores is not sensible, it is among these
The value of a subscore to add new information is directly tests that we are most likely to find useful subscores. As it
related to the extent that it is orthogonal to the rest of the turns out, it is also among such tests that we are likely to find
test. The smaller the correlation between the subscore and items that appear on more than a single subscore.
the rest of the test, the greater the likelihood that the subscore How widespread is the practice of using the same item
is telling us something new (divergent evidence for construct on more than one subscore? Arguably, the most important
validity). Thus, improving the reliability of a subscore by such test is the United States Medical Licensing Examination
borrowing strength from the rest of the test may or may not (USMLE), which is the shibboleth for all practicing physi-
improve the usefulness of the subscore. Everything depends cians in the United States and annually is taken by more
on the delicate trade-off between increasing reliability at the than 100,000 examinees. As Luecht (1996) reported, there
cost of decreasing orthogonality. are eight “core” disciplines in which a separate subscore is
reported; physiology, biochemistry, pathology, microbiology,
pharmacology, behavioral science, gross anatomy, and histol-
Measuring the Added Value of a Subscore ogy. He then calculated that 27% of the items (662 items from
a bank of 2,458) were represented on more than just one
Recent seminal work has provided insight on how to measure subscore; some on as many as three.
the added value of a subscore. Haberman (2008) presented There are a substantial number of other tests that construct
a measure of value added and Sinharay (2010) provided re- subscores that share items. Among them:
sults from an extensive investigation of Haberman’s measure.
Haberman’s methodology calculates value added as the pro-
portional reduction in mean squared error (PRMSE); if the 1 State achievement tests like NJAsk (2012) in which four
reduction in error of the estimate of the examinees’ true mathematics subscores are computed without overlap-
scores on that subtest from their observed subscores is larger ping items and then a fifth subscore, problem solving, is
than that estimated from the total score, then the observed made up of components from the other four (R. Knab,
subscore is a better approximation of its construct, the true personal communication, September 25, 2013).
subscore, than the total score. Thus, the subscore method 2 Large-scale achievement tests like the Stanford
with the highest PRMSE is the most desirable. Although we Achievement Test (SAT9) that double-code open-ended
will describe how to calculate PRMSE in a later section, we items for both the content and process skills in each

48 
C 2014 by the National Council on Measurement in Education Educational Measurement: Issues and Practice
domain (science, mathematics, etc.) (M. J. Zieky, per- knowledge of each organ system. However, due to limitations
sonal communication, October 26, 2013). in test development resources and available testing time, it
3 Other Licensing tests like (a) NAVLE (North American was not practical to build, for example, a dog subtest with
Veterinary Licensing Examination R
, 2013) mentioned enough items to yield a score that is reliable enough for its
earlier—which classifies items both by system (renal, intended use. Thus, the testing company decided to see if
cardiac, etc.) and species (dogs, cats, birds, etc.) and augmenting the subscore on subtest A (dog items) with the
(b) ABPTS-ORTH (American Board of Physical Ther- cardiac-focused items involving a dog patient, which we shall
apy Specialists Orthopedics Certification Examination, denote subtest B, would help enough to make a useful animal-
2013)—Items are classified by knowledge area and body specific subscore.
part. Let us construct subtest B so that there are always twice
It is likely that including multidimensional items on more as many items in it as on subtest A. Thus, as we described
than one subscore was done because it seemed natural. The previously, the number of core items on subtest A were set as
reality expressed by the already existing test structure sug- 5, 10, 20, 30, 40, and 50 while those for subtest B would be 10,
gested an attractive side benefit of using the same items on 20, 40, 60, 80, and 100, respectively.
multiple subscores—an increase in the reliability of these As an illustration note that, under the 10-item subtest A
subscores without requiring the addition of more items. condition, the total test would consist of 30 items. If there
Since the items were already there and were constructed was 50% overlap subtest A would contain the 10 core items
specifically to be suitable for such a use, what would be the and five items overlapping with subtest B that both subtests
harm? Or, more specifically, what are the limits of such an had in common. Thus, subtest A now has 15 items, the 10 core
approach? items, and 5 that it shares with subtest B. Table 1 shows the
total number of subtest A items for each condition.
It should be clear that we have defined the total test length
Simulation Design as the items in subtest A plus those in subtest B. The ex-
The simulation was built to mirror the situation we often find tra items on subtest B (e.g., cardiac items for other animal
in practice. We chose six levels of subtest length that were species) means that even after borrowing some items from
uniquely attached to a specific subscore; we designate these B to augment A there are enough items left over to com-
as the core items. Then we augmented the core items with 11 pute what Haberman calls a remainder score for the PRMSEx
levels of items from a different subtest that overlap to some computation of the true subscore predicted by the total score:
extent with the topic of the core items. The number of such   2
overlapping items ranged from 0% to 100% of the number of αx × σs2t + r(s,d) × σs × σd
core items. Thus, a subscore with 10 core items could increase PRMSEx = ,
σs2t × σx2t
in length to include as many as 20 items, half of which were
overlapping with another subtest. This process was repeated where αx is the total test reliability, r(s,d) is the raw corre-
for varying degrees of correlation between the underlying lation between the subscore and remainder score, σs is the
constructs measured by the subtests and multidimensionality subscore standard deviation, σd is the remainder score stan-
of the overlapping items. Summarizing, the simulation study dard deviation, σs2t is the true subscore variance, and σx2t is
has the following four subtest conditions:
r Core (unique) items (5, 10, 20, 30, 40, and 50 items); the true total score variance.
r Overlapping (common) items (11 levels ranging from The third factor varied in our experiment, in addition to
subtest length and overlap, is the extent of orthogonality
0% to 100% of the number of core items, evenly spaced);
r True (disattenuated) correlation between subtests (multidimensionality) between the constructs the subscores
are intended to measure. Two examinee abilities correspond-
(r = .0, .3, .7, and .9)
r Within-item multidimensionality (complex, semicom- ing to each subtest were drawn for each of 1,000 simulees’
from a bivariate normal distribution with equal means, stan-
plex, and moderate structure) dard deviations set to 1, and correlation set to .0, .3, .7, or
In his extensive simulation, Sinharay (2010) examined .9. In practice, the true correlation between subtests is likely
tests composed of 2, 3, and 4 subscores and found that to vary between .7 and .9 (Sinharay, 2010). We included .0
the number of subscores did not affect the likelihood for and .3, though unrealistic, to explore the extent to which or-
the subscores to add value. To reduce the current study thogonality interacts with the other simulation conditions in
to its most essential features, we modeled a test with only context of a truly multidimensional test.
two subtests, denoted A and B, which simplifies without Lastly, multidimensionality of the data was varied to ma-
losing much generality. Subtest A is the subtest of inter- nipulate the extent to which the overlapping items contribute
est and subtest B contains items from which subtest A can to boosting reliability. For both subtests A and B, item param-
borrow. eters were generated in the same way that Finch (2011) de-
Let us return to the earlier example, the veterinarians’ scribed. Item difficulty and discrimination parameters were
licensing test, which includes items from two broad domains, drawn based on parameters estimated from a large scale
animal species and organ systems. Items are written to match reading examination. Difficulty parameters were drawn from
the real-world structure of the profession. Thus, items focused a standard normal distribution, N (0, 1). The discrimination
on an organ system will still be placed in the context of parameters for core items were drawn from N (1.0, .3) with .7
an animal species, and vice-versa. Let’s consider subtest A as the minimum and 2.0 as the maximum value. Discrimina-
to be the items focused on dogs and subtest B to contain tion parameters for overlapping items were generated based
cardiac focused items. The Board using the exam scores is on either a complex, semicomplex, or moderate structure.
interested in both kinds of subscores; those that measure r Complex structure: Two item discrimination parame-
knowledge of each animal species and those that measure ters were drawn from the same distribution described

Fall 2014 
C 2014 by the National Council on Measurement in Education 49
Table 1. Total Number of Subtest a Items, Core Plus Common, by Condition
Number of Core (Unique) Subtest a Items
5 10 20 30 40 50
Percent Overlapping 0% 5 10 20 30 40 50
10% – 11 22 33 44 55
20% 6 12 24 36 48 60
30% – 13 26 39 52 65
40% 7 14 28 42 56 70
50% – 15 30 45 60 75
60% 8 16 32 48 64 80
70% – 17 34 51 68 85
80% 9 18 36 54 72 90
90% – 19 38 47 76 95
100% 10 20 40 60 80 100
Total Test Items 15 30 60 90 120 150

above for the core items. Thus, the common items were the model to score all items regardless of simple or complex
generated to load similarly on either subtest. structure.
r Semicomplex structure: One item discrimination param- In total, the simulation consists of 732 subtest conditions
eter was drawn from the same distribution described (6 levels of subtest length × 11 levels of overlap (only 6
above for the core items and a second parameter was levels for the five item subtest condition) × 4 levels of
drawn from N (.35, .15) with.1 as the minimum and .6 true subtest correlation × 3 levels of item structure). For
as the maximum value. Thus, the common items had a given data set of simulated item responses, the item level
a strong loading on the primary trait, subtest A, and a scores for subtest A items plus the overlapping items were
smaller discrimination of approximately half the size on summed to equal subscore A. Similarly, the item level scores
subtest B for subtest B items minus the overlapping items equaled the
r Moderate structure: Two item discrimination parame- remainder score used for the PRMSEx calculation. An ex-
ters were drawn from the smaller discrimination distri- aminee’s total test score was the sum of all their item level
bution. Thus, the common items were generated to have scores.
similar weak loadings on either subtest. The value of a subscore was determined by the value added
Thus, this simulation generates core item subtests and ratio, a ratio of the observed subscore and total score PRMSE
then augments them with overlapping items with varying values. As described by Sinharay et al. (2011), the PRMSE
multidimensionality. Subtests built with complex structure of the true subscore predicted by the observed subscore is
items, as defined above, represent the situation in which mathematically equivalent to the subscore’s reliability. Thus,
the test is truly multidimensional. On the other extreme, the value added ratio can by represented by
subtests built with moderate structure items represent a more
common multidimensional test that is more likely to be found Subscore Reliability
in practice (Feinberg, 2012). Value Added Ratio = .
PRMSEx
Simulated data was scored with the same two-dimensional
MIRT model (Reckase, 2007) used by Sinharay (2010): If the expected value (across the simulations) of this value-
added ratio for a particular subscore is less than 1, then the
1 subscore is not adding value and hence not worth reporting.
P (U i j = 1|θ j , ai , b i ) = ,
1+ e −(a1i θ1 +a2i θ2 −b i ) A ratio greater than 1 indicates that the subscore is adding
value, the larger the ratio the greater the value added. A
where the probability of a correct response to item i for ex- parallel version of this simulation was run with data generated
aminee j is a function of item difficulty b, item discrimination under the Rasch model with similar results.
vector a = (a1 , a2 ) and ability vector θ = (θ 1 , θ 2 ) where
a1 and θ 1 correspond to item discrimination and ability on
subtest 1 items and a2 and θ 2 correspond to item discrimina- Simulation Results
tion and ability on subtest 2 items. For simulation conditions The simulation was designed as a four-way factorial experi-
in which there are no common items (percent overlap = 0), ment and was analyzed as such; we did two different four-way
the corresponding a parameter is set to 0. Therefore, under analyses of variances with subtest length, amount of overlap,
conditions where subtests are composed of only core items, the extent of the dependence between the subtests, and the
the model is simplified to include only 1 a and θ parameter multidimensionality of the overlapping items as the indepen-
corresponding to that particular subtest: dent variables.
In Table 2 value added ratio serves as the dependent vari-
1 able and in Table 3 the dependent variable was reliability.
P (U j = 1|θ j , a1 , b) = ,
1+ e −(a1 θ1 −b) The main effects dominated the interactions in size and all of
the main effects were statistically significant. Typically, one
where the probability of a correct response for examinee j would give the reader a clear idea of what was obtained from
is a function of item difficulty b and item discrimination a1 the experiment by plotting the main effects. But, in this case,
and ability θ 1 corresponding to subtest 1. This design allows when the size of the main effects so dominates any interaction,

50 
C 2014 by the National Council on Measurement in Education Educational Measurement: Issues and Practice
Table 2. ANOVA Results With Value Added overlap items) cannot be greater than a proportional increase
Ratio (VAR) as the Dependent Variable of items from the same construct (core items plus more core
Effect df Mean Square F ␩p 2
items).
Interestingly, subtests augmented with complex or semi-
Subtest Length (L) 5 70 2,803 .16 complex structure items are outpacing the Spearman-Brown
Correlation (C) 3 4,231 168,849 .88
Overlap (O) 10 955 38,102 .84
estimates; estimated reliability was higher than that ex-
Structure (S) 2 118 4,722 .12 pected from a subtest of similar length composed of only
L×C 15 1 39 .01 core items. This at first surprising outcome results from the
L×O 45 2 88 .05 boosting effects of the relative heterogeneity of the overlap-
L×S 10 1 60 .01 ping items; the variance of a subscore that includes com-
C×O 30 440 17,559 .88 mon items is greater than that of a subscore of comparable
C×S 6 72 2,855 .19 length composed of only core items. But, as seen in Figure 2,
O×S 20 3 137 .04 there is a price to be paid; even with greater than expected
L×C×O 135 .1 5 .01 gain in reliability, value still rapidly deteriorates as overlap
L×C×S 30 .1 3 .00 increases.
L×O×S 90 .1 2 .00
C×O×S 60 2 63 .05
Figure 2 shows the impact on value added by increasing
L×C×O×S 270 .03 1 .00 overlap. Subtests augmented with complex structure items
Error 72,468 .03 had the least reduction in value. Interestingly, in conditions
in which there was an initial steep drop in value (e.g., five
core items with semicomplex structure), value increased with
higher levels of overlap. This result does not provide evidence
Table 3. ANOVA Results With Reliability as that the subscores improved, but rather that the subscore
the Dependent Variable and remainder of the test are becoming more similar—the
Effect df Mean Square F ␩p 2 upturn is caused by the reduction in orthogonally which is
experienced least in the moderate structure condition.
Subtest Length (L) 5 172 494,797 .97 How does the amount of dependence between the two sub-
Correlation (C) 3 6 18,223 .43
Overlap (O) 10 4 11,635 .62
scores affect reliability and value added? As the correlation
Structure (S) 2 27 75,893 .68 between the subscores increases so too does the subtest’s reli-
L×C 15 .3 857 .15 ability, as predicted by true score theory. Also, as expected, as
L×O 45 .2 609 .27 the correlation between the subscores increases, value added
L×S 10 1 3,194 .31 decreases since the subscores become less distinct from the
C×O 30 .1 314 .12 rest of the test. Additionally, regardless of the level of depen-
C×S 6 .2 535 .04 dence, any increase in overlap results in a loss of value. The
O×S 20 .7 1,940 .35 reduction in value becomes steeper as the subtests become
L×C×O 135 .01 21 .04 more orthogonal and thus have greater potential to be worth
L×C×S 30 .01 34 .01 reporting.
L×O×S 90 .04 121 .13
C×O×S 60 .01 19 .02
The simulation tell us that the amount of overlapping items
L×C×O×S 270 .00 2 .01 add to the subscore’s reliability is more than counterbalanced
Error 72,468 .00 by the amount that they subtract from the subscores’ orthog-
onality, hence shrinking their value added. Subscores com-
posed purely of core items leave us better off than those
supplemented with common items, despite the latter’s higher
reliability. We shall examine this is greater detail in the next
we can choose specific subtest conditions of the experiment section when we introduce the ReliaVAR plot.
and still convey an accurate idea of the total results. We opted
for this approach because we felt that showing a specific case
(e.g., true correlation between subtests set at .7 over different
levels of overlap) provides a closer connection to reality than The ReliaVAR Plot—A New Tool for Studying the Effect
the more abstract plot of effect sizes. In the figures that fol- of Including Overlapping Items in Subscores
low we present a selection of some of the simulation results. Figures 1 and 2 show separately the effects of adding overlap-
These results are typical of all results obtained. Each plotted ping items to subscores; first reliability and then the value-
point represents the mean over 100 replications. added ratio. But what we really care about is how the two
In Figure 1 we can see that the Spearman-Brown prediction interact so that we can make sensible design decisions about
(represented by a dashed line) of increasing reliability with the make-up of the subscores. One way to do this is to plot
increases in subtest length is upheld. Subtests augmented the reliability on one axis and the value added ratio on the
with moderate structure items had the weakest reliability other—the ReliaVAR plot. When we do this we can see the
gain, likely due to the overlapping items poorly loading on trade-offs at a glance. In Figure 3 is a ReliaVAR plot depicting
the primary trait. Although the overlapping items added to the simulation results, the reliability and value added ratio
the subscore’s reliability, they did not contribute as much as across varying levels of overlap and within item multidimen-
the original items (reliability estimates from the Spearman- sionality for a 10-core item subtest. Each core item curve
Brown Prophecy formula, represented by the dashed lines, traces out the changes as the percent overlap increases from
were greater than what actually happened). This is predicted zero (the top node) to 100% (the bottom node). The amount
by Mosier’s (1943) measure of the reliability of a weighted of overlap is coded by the nodes size and shade across the
composite: that the reliability of a composite (core items plus individual curves.

Fall 2014 
C 2014 by the National Council on Measurement in Education 51
1.0 Complex Structure

0.9

Reliability
0.8

0.7

0.6

0.5
0 20 40 60 80 100

1.0 Semi-Complex Structure


0.9
Reliability

0.8

0.7

0.6

0.5
0 20 40 60 80 100

1.0 Moderate Structure


0.9
Reliability

0.8

0.7

0.6

0.5
0 20 40 60 80 100
Percent Overlapping Items

FIGURE 1. Effect on subscore reliability by adding overlapping items of varying within item multidimensionality, with true correlations
between subtests set to .7. Dashed lines represent reliability estimates from the Spearman-Brown Prophecy formula.

In all cases, the addition of overlapping items has only place (e.g., n = 5). In this case, adding more items makes it
a tiny effect on reliability relative to the dramatic decline better, but never good. This result reinforces the finding of
in the value added of that subscore. Thus, there is loss of increased reliability obtained when subscores are augmented
value when including the overlapping items, because more with empirical Bayes shrinkage: that the increased reliabil-
is subtracted from the subscore’s orthogonality than can be ity of such augmentation does not yield parallel increases in
counterbalanced by the increase in reliability. Augmenting usefulness.
a subtest with items that have moderate structure barely Creating subscores of value should not occur post-hoc with
boosts reliability. Augmenting a subtest with items that have subtest augmentation, but rather during the initial stages of
complex structure, when overlapping items have the most test design. Unfortunately, it is not always possible to build
potential to add value, does cause a greater than expected subtests with a sufficient number of items. The total number
boost to reliability, yet still a greater loss in orthogonality. of items on a test is often limited by test development costs to
create items and costs to administer the exam, such as more
seat time at a testing center, making it unlikely that enough
Conclusion and Discussion items could be devoted to each subtest. Certainly, a trade-off
The findings from the simulation illustrate a situation in exists between the number of items included in a subtest
which the usefulness of subscores can be increased by mak- to achieve acceptable psychometric properties for reporting
ing them shorter. Augmenting subscores to include items against the costs associated with longer tests and increased
common to other subtests can increase their reliability, but testing time.
this improvement is a snare and a delusion, for the subtest Our results were derived by modeling the structure of
overlap will, in the end, decrease the value added of that sub- achievement test data and therefore the findings may vary
score. Within the conditions of our research we can almost slightly under different testing contexts. Not all subtests
always boost the value added of a subscore by removing the based on 20 items and no overlap will have sufficient reli-
overlapping items, despite the lower reliability of the shorter ability and value to be worth reporting. And sometimes, albeit
subscores. The only time this is not true is when the core rarely, subscores based on as few as 10 items may be worth
subscore is too unreliable to be worth reporting in the first reporting (e.g., the Multidimensional Anxiety Questionnaire

52 
C 2014 by the National Council on Measurement in Education Educational Measurement: Issues and Practice
1.3 Complex Structure

Value Added Ratio


1.2

1.1

1.0
5 Core Items
0.9

0.8
0 20 40 60 80 100

1.3 Semi-Complex Structure


Value Added Ratio

1.2

1.1

1.0
5 Core Items
0.9

0.8
0 20 40 60 80 100

1.3 Moderate Structure


Value Added Ratio

1.2

1.1

1.0
5 Core Items
0.9

0.8
0 20 40 60 80 100
Percent Overlapping Items

FIGURE 2. Effect on subscore value added by adding overlapping items of varying within item multidimensionality, with true correlations
between subtests set to .7.

1.10 has four subscales that range from 9 to 12 items each with
Complex high reliability and orthogonality; Wall, 2003).
0% Overlap
We would be remiss if we did not emphasize that the char-
acter of the examinee population is critical. Magical claims
1.05
that there are inferences from some test scoring models that
Value Added Ratio

20%
are somehow “sample-free” are an
40% 60%
Semi-Complex
100 illusion, item parameters and subject abilities are always es-
1.00 %
timated relative to a population, even if this fact may be ob-
scured by the mathematical properties of the models used.
Moderate Hence, I view as unattainable the goal of many psychometric
0.95 researchers to remove the effect of the examinee population
from the analysis of test data. (Holland, 1990)

Substantial homogeneity of the examinee population is es-


0.90 pecially evident in licensure or certification tests even though
0.60 0.65 0.70 0.75 0.80 0.85 0.90 they typically cover broad and diverse content. This homo-
Reliability geneity grows from the stringent requirements examinees
must satisfy to be allowed to sit for the exam (e.g., Raymond
FIGURE 3. A summary of the simulation results in a ReliaVAR plot, & Luecht, 2013). In such situations the items are less discrim-
for subtests with 10 core items and true correlations between sub-
tests set to .7, showing that for all situations studied the gains in
inating and thus subtests must be composed of more items
subscore reliability by including overlapping items is more than off- to yield the same reliability that would have occurred with a
set by the loss of orthogonality, so that the value such a subscore more diverse examinee sample. The usefulness of a subscore
adds to the testing situation is always diminished. thus depends on both the quality of the items and the char-
acteristics of the examinee population. The extent to which
this is true can be investigated by expanding the simulations

Fall 2014 
C 2014 by the National Council on Measurement in Education 53
we described here and allowing the variance of the ability Luecht, R. M. (1996). Multidimensional computerized adaptive testing
distribution of the examinees to vary. in a certification or licensure context. Applied Psychological Mea-
The overlapping items may still be meaningful as parts surement, 20, 389–404.
of the overall test score, even though including them in a Mislevy, R. J., Steinberg, L. S., & Almond, R. G. (2003). On the structure of
educational assessments. Measurement: Interdisciplinary Research
subscore reduces the subscore’s diagnostic value. Items may and Perspectives, 1(1), 3–67.
be written expressly to cover multiple content areas when the Mosier, C. I. (1943). On the reliability of a weighted composite. Psy-
intention is to generate evidence supporting claims about a chometrika, 8, 161–168.

multivariate context or possibly when two factors are difficult North American Veterinary Licensing Examination (NAVLE
R
R
).
to separate. (2013). NAVLE Candidate Bulletin. Retrieved August 5, 2013,
from http://www.nbvme.org/navle-general-information/candidate-
bulletin/
Acknowledgments Raymond, M. R., & Luecht, R. L. (2013). Licensure and certification
This work is collaborative in all respects and the order of testing. In K. F. Geisinger (Ed.), APA handbook of testing and
authorship is alphabetical. We would also like to express our assessment in psychology (Vol. 3, pp. 391–414). Washington, DC:
American Psychological Association.
sincere gratitude to Derek Briggs, Brian Clauser, Polina Harik, Reckase, M. D. (2007). Multidimensional item response theory. In C.
Mike Jodoin, Sandip Sinharay, Billy Skorupski, and three R. Rao & S. Sinharay (Eds.), Handbook of Statistics (Vol. 26, pp.
anonymous reviewers whose insightful comments resulted in 607–642). Amsterdam, The Netherlands: North-Holland.
a vastly improved paper. Sinharay, S. (2010). How often do subscores have added value? Re-
sults from operational and simulated data. Journal of Educational
Measurement, 47, 150–174.
References Sinharay, S., Haberman, S. J., & Wainer, H. (2011). Do adjusted sub-
American Board of Physical Therapy Specialists (ABPTS). (2013). Infor- scores lack validity? Don’t blame the messenger. Educational and
mation booklet. Retrieved August 5, 2013, from http://www.abpts.org/ Psychological Measurement, 7, 789–797.
Certification/Orthopaedics/ Skorupski, W. P., & Carvajal, J. (2010). A comparison of approaches for
Feinberg, R. A. (2012). A simulation study of the situations in which improving the reliability of objective level scores. Educational and
reporting subscores can add value to licensure examinations. Ph.D. Psychological Measurement, 70, 357–375.
dissertation, University of Delaware. Retrieved October 31, 2012, from Thissen, D., & Wainer, H. (2001). Test scoring. Hillsdale, NJ: Lawrence
ProQuest Digital Dissertations database (Publication No. 3526412). Erlbaum.
Finch, H. (2011). Multidimensional item response theory parameter Wainer, H., Sheehan, K., & Wang, X. (2000). Some paths toward making
estimation with nonsimple structure items. Applied Psychological Praxis scores more useful. Journal of Educational Measurement,
Measurement, 35(1), 67–82. 37, 113–140.
Haberman, S. (2008). When can subscores have value? Journal of Wall, R. E. (2003). Review of the multidimensional anxiety question-
Educational and Behavioral Statistics, 33, 204–229. naire. In B. S. Plake, J. C. Impara, & R. A. Spies (Eds.), The fifteenth
Holland, P. W. (1990). On the sampling theory foundations of item mental measurements yearbook (pp. 601–603). Lincoln, NE: Buros
response theory models. Psychometrika, 55, 577–601. Institute of Mental Measurements.

54 
C 2014 by the National Council on Measurement in Education Educational Measurement: Issues and Practice

Anda mungkin juga menyukai