Comparisons of Five Health Status Instruments for Orthopedic Evaluation

Author(s): Matthew H. Liang, Anne H. Fossel and Martin G. Larson

Source: Medical Care, Vol. 28, No. 7 (Jul., 1990), pp. 632-642
Published by: Lippincott Williams & Wilkins
July 1990, Vol. 28, No. 7

Comparisons of Five Health Status Instruments

for Orthopedic Evaluation

This study represents a long-term effort to find optimal techniques for evalu-
ating outcome in patients who have undergone total joint arthroplasty. Sensi-
tivity of five health status questionnaires was studied in a longitudinal evalua-
tion of orthopedic surgery. The questionnaires (Arthritis Impact Measurement
Scales [AIMS], Functional Status Index [FSI], Health Assessment Question-
naire [HAQ], Index of Well Being [IWB], and Sickness Impact Profile [SIP])
were administered to 38 patients with end-stage arthritis at three points in
time: two weeks before hip or knee arthroplasty, and at three-month and 12- to
15-month follow-up. Response values (i.e., changes within patients) were cal-
culated on four scales: global health, pain, mobility, and social function. By the
three-month follow-up, most instruments detected large mean responses in
global health, pain scores, and mobility. Smaller changes on these scales were
found between three and 12 to 15 months. Social function showed small to
modest gains at successive follow-ups. Standardized response means were cal-
culated to assess sensitivity to detect change. Confidence intervals for these
indices were constructed using a jackknife procedure, and significance test
were performed by pairing selected indices. Finally, the study projected sample
sizes required to assess a new therapy, using each response. These statistica
tools facilitated comparisons among instruments and may prove useful in other
settings. Key words: health status assessment; orthopedic evaluation; Arthritis
Impact Measurement Scale; Functional Status Index; Health Assessment Ques-
tionnaire; Index of Well Being; Sickness Impact Profile; arthritis, evaluation of.
(Med Care 1990; 28:632-642)

E. A. Codman first emphasized thecedures im- has been based on short-term fol-
portance of evaluating the "end result" low-up
of and uncontrolled studies. The eval-
surgical interventions.1 Despite this exhor- uation of total joint arthroplasty (TJA) is no
tation, the evaluation of many surgicalexception.2-4 pro- Although TJA is a remarkably
successful intervention, both short-term
and long-term failures do occur. These in-
* From the Departments of Medicine and Rheuma- clude complications such as infection, dislo-
tology/Immunology, Harvard Medical School, and long-term biomaterial failure, and
from the Department of Health Policy and Manage-
ment, Harvard School of Public Health, Boston, Mas- failure to achieve the expected benefit for
sachusetts. the patient.5'6 Outcome assessment in TJA
has traditionally focused on physician-de-
t From the Robert B. Brigham Multipurpose Arthritis
Center, Brigham and Women's Hospital, Boston, Mas-
fined measures of technical success rather
than on its impact on a patient's handicap,
t From the Department of Biostatistics, Harvard
School of Public Health, Boston, Massachusetts. satisfaction, or quality of life. Weak and in-
Supported by Multipurpose Arthritis Center NIHconsistent correlations exist among im-
Grant no. AR36308 and by Biomedical Research Sup- provements in pain, joint motion, functional
port Grant RR-05950.
ability, and psychologic well-being.7'8 In
Address correspondence to: Matthew H. Liang, MD,
MPH, Brigham and Women's Hospital, 75 Francis St.,addition, a patient's evaluation may conflict
Boston, MA 02115. with the surgeon's judgment.6'9 A compre-

hensive assessment of outcome should in-

tion, and 16 cancelled surgery or failed t
clude health-status measures as well as
return initial questionnaires. This left 50 p
measures of technical success, and pain, a on whom three-month follow-up ha
key factor in determining health status,10
been reported previously.11 Thirty-eight p
also should be assessed. tients who were contacted again 12 to 1
In an effort to find optimal techniques to after surgery comprised the longi
evaluate outcomes such as pain relief tudinal
and sample.
functional change in patients having TJA, a
year-long study was conducted with five
health-status questionnaires for outcome Five health-status instruments were used:
assessment. The results presented in this ar-
the Functional Status Index (FSI),'2 the
ticle describe the time course among those
Health Assessment Questionnaire (HAQ),'1
patients on measures of change in global
the Arthritis Impact Measurement Scales
health status, pain, mobility and physical
(AIMS),14 the Index of Well-Being (IWB),25
activity, and social function. We re-ex-
and the Sickness Impact Profile (SIP).16 The
pressed the measures of change with a sim-
first three instruments were developed for
ple index, the standardized response mean
patients with rheumatoid arthritis and the
(SRM), and estimated the sampling variabil-
latter two instruments, although not arthri-
ity of each SRM by a jackknife procedure.
tis-specific, are particularly useful in health
This methodology also enabled us to draw
services research and have been widely
inferences about statistical significanceused
of in evaluation of musculoskeletal dis-
differences among instruments in SRM orders.
mates. By posing questions about sample
The instruments were completed by each
sizes needed to evaluate a new therapy in
patient approximately two weeks before,
two hypothetical studies, we made addi- and at three months and 12 to 15 months
tional quantitative comparisons among the
five instruments, between three-monthafter
and surgery. The self-administered ques-
tionnaires were assembled in random order
12- to 15-month follow-ups, and also across
and mailed to patients. The IWB was com-
the four health/pain/function scales.
pleted either by interview or over the
Methods phone. The mode and order of administra-
tion were the same preoperatively and post-
Consecutive patients scheduled for hip or
Functional Scales
knee arthroplasty at the Brigham and
Global health-status scores were com-
Women's Hospital were screened for study
puted according to published instructions
eligibility by the following criteria: no prior
arthroplasty; age between 50 and 80;for
each instrument. As described pre-
mary diagnosis of rheumatoid arthritisviously,1
(RA) to facilitate comparisons among
or primary/secondary osteoarthritis (OA);
instruments on specific functional dimen-
and no cognitive impairment or languagesions, the subscales were grouped into a few
barriers. We screened 201 consecutive functional
pa- scales. Here, we considered three
tients, but 113 did not meet our criteria (56 Mobility, including ambulation, bed
had prior arthroplasty, 39 were outsideandthechair transfers; Social, including social
age range of 50 to 80, 12 did not have activities,
OA or social role, and home-manage-
ment functions; and Pain. The IWB and SIP
RA, and six for other reasons). Of 88 eligible
patients, 12 refused to participate, 10dohad
not have pain scales. For every instru-
scheduling problems preventing participa-
ment each scale was linearly transformed to


a common range of 0 to 100, where avalues: scoreAIMS, P = 0.08; FSI, P = 0.05; HAQ,
of 0 indicated no dysfunction (no pain) P =and
0.27; IWB, P = 0.09; and SIP, P = 0.62).
100 indicated maximum possible dysfunc- However, at the three-month follow-up, the
tion (pain). observed mean changes were similar for
To fill in missing data, 34 scores (1.7% of both groups (2-tailed P values: AIMS, P
2,052 individual scores) were imputed by = 0.74; FSI, P = 0.92; HAQ, P = 0.24; IWB,
multiple linear regression models17 using P = 0.60; and SIP, P = 0.47). We concluded
available visit-specific and scale-specific that the longitudinal sample represented the
data on AIMS, FSI, HAQ, and SIP. The IWB entire sample reasonably well.
was not used as a predictor because it had Results
relatively low correlations with other in-
Response Patterns
struments. The number of imputed values
were 0 on SIP, 1 on AIMS, 3 on IWB, 6 on Three response variables were defined to
HAQ, and 24 on FSI. assess changes among different time points:
early response = preoperative minus three-
Patients' Characteristics
month values; late response = three-month
The sample contained 58% females and one-year values; and net response
42% males, 87% with OA and 13% with = preoperative minus one-year values (i.e.,
RA. The surgical procedures were hip (55%)net = early + late). For each response, posi-
and knee (45%) replacements. Mean age attive values denoted improvement in the pa-
surgery was 67.4 years with an age rangetient's status. Means and standard devia-
from 56 to 80 years. These 38 patients were tions summarizing preoperative data as well
older (by 4.5 years, P = 0.04) than the 12 as temporal patterns of response are shown
patients not followed at 1 year and had in Table 1.
more OA (87% vs. 58%, P = 0.05). There were substantial improvements in
We also obtained comparative data on pain and function during the early postop-
623 hip and knee replacements done at our erative period. Without exception the early
hospital during the study's accrual period. response means were positive, and most
The sex distribution was 70% females and were highly significant (10 of 18 at P
30% males in that series, and the procedures= 0.0001). Response means were smaller for
were hip (56%) and knee (44%) replace- social and global scales than for pain and
ments. The relative percentages were 60% mobility. Strong positive correlations ex-
diagnosed with OA and 40% diagnosedisted between global, mobility, and social
with RA (19% were not diagnosed with ei-responses (all instruments except IWB), but
ther OA or RA). Finally, the mean age waspain response did not correlate well with
63.9 years and the age range 12 to 88 years.other responses. Also, only modest correla-
Our longitudinal sample of 38 patients tions were seen among corresponding scales
compared fairly well on general characteris- of different instruments.
tics to the hospital's entire series of hip and Late responses (i.e., changes between 3-
knee replacements. month and 1-year values) generally showed
Additional checks on how representativecontinued improvement, but of lesser mag-
the groups were were done by comparing nitude than the early responses. Few late
health-status scores between the group of response means reached statistical signifi-
38 patients who had one-year longitudinal cance (only two achieved P = 0.01). We
data and the group of 12 patients who did found that early and late responses were
not. Two-sample t statistics showed thatnegatively correlated at moderate levels.
the longitudinal sample tended to haveGlobal responses had correlations ranging
better health preoperatively (2-tailed Pfrom -0.52 to -0.70; for mobility re-

TABLE 1. Preoperative Values and Measures of Response

Preoperative Early Response Late Response Net Response

Dimension Instrument Mean SD Mean SD P Value Mean SD P Value Mean SD P Value

Pain AIMS 55.0 20.1 25.6 22.6 a -0.5 15.4 NSb 25.0 21.0
FSI 22.4 14.5 12.6 11.1 a 1.3 10.1 NSb 13.9 12.9
HAQ 42.3 21.3 22.0 22.3 a 2.4 14.8 NSb 24.4 26.9
Global AIMS 34.4 15.1 11.2 12.1 a 3.7 12.2 NSb 14.9 10.7
FSI 14.4 10.1 4.7 11.1 c 1.6 13.5 NSb 6.3 9.8 d
HAQ 35.3 20.5 6.2 17.2 NSb 8.2 15.8 c 14.5 14.0
IWB 60.3 8.3 15.6 13.6 a -1.4 19.0 NSb 14.2 15.8
SIP 14.9 8.8 5.6 7.6 a 2.3 6.1 NSb 7.9 6.8
Mobility AIMS 51.6 23.3 18.6 17.4 a 7.1 17.7 NSb 25.7 18.0 a
FSI 16.9 11.8 11.2 13.7 a -0.5 11.0 NSb 10.8 12.5
HAQ 43.9 23.7 16.7 27.1 d 6.9 24.7 NSb 23.6 27.1
IWB 33.1 21.6 14.9 20.7 a 1.5 17.6 NSb 16.4 17.5
SIP 21.8 14.4 11.6 12.0 a 1.1 10.4 NSb 12.7 11.2
Social AIMS 24.0 11.7 2.1 11.6 NSb 2.2 9.7 NSb 4.3 11.0 NSb
FSI 21.3 19.4 9.6 17.5 c 0.9 16.2 NSb 10.5 19.2
HAQ 45.6 29.2 11.0 29.1 NSb 9.6 26.7 NSb 20.6 20.7
IWB 43.9 23.5 13.6 38.1 NSb 9.2 41.3 NSb 22.8 27.7
SIP 22.6 13.4 6.5 12.6 c 6.2 9.4 d 12.8 13.5

SD, standard deviation; Early Response, preoperative-three-m

Response, preoperative-one year.
For t tests that response has mean zero:
aP < 0.0001.
b NS, not significant at P < 0.01.
P < 0.01.
dp < 0.001.

sponses, the values were -0.46 to -0.59; five instruments using a unit-free yardstick,
and for social responses -0.27 to -0.73. In which we called the standardized response
contrast, early and late pain responses were mean (SRM). It was defined as SRM = [re-
nearly uncorrelated (0.01 to -0.44). sponse mean/response standard deviation].
Finally, net response means (i.e., preoper- If X denotes a measurable response variable,
ative to 1-year values) were large, positive, with ,u = mean(X) and a2 = Var(X) for the
and highly significant (15 of 18 at P population of TJA patients, one can express
= 0.0001). These findings suggest that pa- the theoretical SRM by the parameter 6
tients who have not improved by three = j/cr.
months may still do so later, especially for Although the natural estimator for 6 is d
social and global function. Conversely, = x/s, where x and s are the estimates of the
those with large early improvement may mean and standard deviation among re-
reach a plateau or even regress later. The sponses in the sample, the statistical sam-
findings are consistent with clinical obser- pling distribution of d is not known gener-
vations that rehabilitation after total joint ally. Therefore, we used a jackknife proce-
arthroplasty may peak late in some individ- dure18 programmed in SAS'9 to obtain the
uals and that full rehabilitation requires approximate distribution of d from which to
stable, predictable physical functioning. construct a point estimate, b*, create confi-
dence intervals for 6, and test hypotheses.
Comparing Responses: Methods Details are given in the Appendix.* All these
Next we compared magnitudes of early indices are standardized internally and all
and net responses on each scale among the share the feature that larger absolute values

TABLE 2. Jackknife Estimates of Standardized Mean Response for Early and Net Respon

Net Response
Early Response (Baseline to 12 to 15 Months
(Baseline to 3 Months Postoperatively) Postoperatively)

95% Confidence 95% Confidence

Interval Interval

Scale Instrument 6*a SE(6*)b Lower' Upper' 5*a SE(6*)b Lower' Upper'
Pain AIMS 1.11 0.21 0.68 1.54 1.15d 0.24 0.67 1.63
Pain FSI 1.08 0.28 0.51 1.65 1.00 0.24 0.51 1.49
Pain HAQ 0.95 0.25 0.43 1.47 0.89 0.21 0.46 1.32
Global AIMS 0.88 0.25 0.38 1.38 1.36d 0.24 0.87 1.85
Global FSI 0.40' 0.21 -0.02 0.82 0.61' 0.23 0.15 1.07
Global HAQ 0.33' 0.21 -0.10 0.76 1.00 0.23 0.53 1.47
Global IWB 1.13 0.16 0.80 1.46 0.88 0.16 0.56 1.20
Global SIP 0.71 0.19 0.32 1.10 1.11 0.18 0.74 1.48
Mobility AIMS 1.01 0.29 0.42 1.60 1.41d 0.16 1.08 1.74
Mobility FSI 0.78e 0.22 0.33 1.23 0.84e 0.19 0.45 1.23
Mobility HAQ 0.57' 0.24 0.08 1.06 0.84e 0.23 0.37 1.31
Mobility IWB 0.68e 0.25 0.17 1.19 0.91 0.23 0.45 1.37
Mobility SIP 0.94 0.20 0.54 1.34 1.10 0.15 0.79 1.41
Social AIMS 0.17f 0.17 -0.18 0.52 0.39' 0.15 0.09 0.69
Social FSI 0.54 0.17 0.20 0.88 0.51 0.24 0.03 0.99
Social HAQ 0.35 0.22 -0.09 0.79 0.98d 0.14 0.70 1.26
Social IWB 0.34' 0.20 -0.07 0.75 0.82 0.17 0.48 1.16
Social SIP 0.51 0.18 0.14 0.88 0.93 0.15 0.62 1.24

a Jackknife estimate of 6 = u/a.

b Standard error of 5*.
95% confidence interval for 6.
d Reference scale.
e p value < 0.01 versus reference.
f P value < 0.001 versus reference.

response are shown in Table 2. Cohen's

connote relatively larger responses. Sup-
pose two instruments have been used to rule-of-thumb for interpreting the "effect
measure the same response phenomenon. size If index," a value of 0.2 is small, 0.5 is
they have equal values, they are statistically
moderate, and 0.8 or greater is large,22 can
be applied to the SRM's. Also shown in
equivalent for that response. But if one in-
strument has a substantially larger index,Table
it 2 are standard errors of the SRM's and
is preferred on statistical grounds because95%
it confidence intervals to quantify their
is "more sensitive" or "more responsive" sampling
to variability. Each SRM was com-
changes within patients. pared with a selected reference, and SRM's
Comparing Responses: Application were compared between early and net re-
sponses, using paired-data t tests.
Jackknife estimates of standardized re-
With regard to pain responses, all esti-
sponse means for early response and net mates of SRM were quite similar: the mean
of the six estimates was 1.03 and the range
* Kazis et al.20 used "effect size" to equal [mean re- was 0.89 (net HAQ) to 1.15 (net AIMS).
sponse/standard deviation of pretreatment observa-
These values are large effects by Cohen's
tions] and Guyatt et al.21 used "responsiveness" to
equal [minimum meaningful response/V2 MSE], where guidelines. Due to the modest sample size,
MSE = variance of observations within stable patients. the standard errors were fairly large, so
Our definition of SRM actually coincides with one form there are broad 95% confidence intervals.
of the "effect size index" proposed by Cohen,22 but to
avoid confusion, we use the term SRM (symbolized by Note, though, that the lowest bound (0.43
6) throughout this article. on early HAQ) still is a "moderate" effect.

Design Issues
We did not find any statistically significant
differences (P > 0.25 for every test) between
SRM's for early and net response, Suppose
norone wants to evaluate changes in
among direct comparisons of specific health status among patients who receive a
against the net AIMS response. new treatment. Which endpoints should be
Measures of global response showed evaluated,
sub- at which follow-up times, by
stantial variation; the average SRM was 0.84 instruments, and on how many sub-
and the range was from 0.33 (early HAQ) to are some of the decisions one faces. We
1.36 (net AIMS). The 95% confidence used the a values to calculate and compare
vals again were broad, but lower bounds sampleforsizes to assist in experimental de-
some measures were "moderate" in size or sign. Two common scenarios are used for
greater, these being early IWB and all illustration:
net experiment A studied only the
responses except net FSI. We did not findnew treatment (e.g., a pilot study), whereas
any significant differences (at P < 0.01) be- B compared the new treatment
tween SRM's for early and net response on the standard (e.g., a randomized
any instrument. Testing each SRM against trial). The following assumptions are
made about the theoretic values of a nor-
the net AIMS gave significant results (P
< 0.01) for early FSI, early HAQ, andmallynet distributed "generic" response vari-
FSI responses, these having smaller SRM's The responses have mean , and vari-
ance a2 among patients who received the
than the net AIMS. We judge the remaining
measures to be equivalent. standard treatment such as total joint
Jackknife estimates of SRM's for mobility Among those who received
response were between 0.57 (early HAQ) the new treatment, the responses have
and 1.41 (net AIMS) and 0.91 was the aver- K, and variance a2. Note that K is a
age. These point estimates indicatedparameter
re- for treatment efficacy: if K = 1,
both treatments have equal efficacy, if K
sponse sizes moderate or greater, but most
< 1 (alternatively, K > 1) the standard (new)
of the lower confidence limits were small-
to-moderate by Cohen's criterion. We found has greater efficacy, and the spe-
no significant differences between SRM'scial case K = 0 might represent a placebo
for early versus net responses on the same
The statistical question was posed as
instrument. Additional pair-wise tests
showed that both early and net response "What
on sample size is needed in each exper-
iment to have power 1 - / for rejecting the
FSI and on HAQ, and the IWB early re-
null hypothesis at significance level a (2-
sponse, had significantly smaller SRM's
tailed)?" A formula for the sample size per
than the net AIMS (P < 0.01 on each test).
treatment group23 can be written as
Estimates of SRM's for change in social
function were smaller: the average was 0.55N = [Z(1 - a/2) + Z(1 - )]2f(,, a, K)
and the values ranged from 0.17 (AIMS
where the function f( ) depends on the de-
early response) to 0.98 (HAQ net response).
sign and response parameters and where
None of the instruments had significant Z(p) is the 100p percentile of the standard
differences (at P < 0.01) between the SRM's
normal distribution. Specifically, f(,, a, K)
for early versus net response. Compared = [a/Kl]2 for experiment A and f(,L, a, K)
with the net HAQ response, three measures= 2 [a/(1 - K)#]2 for experiment B.
had significantly (P < 0.01) smaller SRM's: We illustrate the calculations for both ex-
early AIMS, early IWB, and net AIMS. periments with 1 - / = 0.80 as the statistical
Based on Cohen's criterion, only three mea-
power and a = 0.05 as the significance level.
sures produced "large" response sizes: the
For the pilot study, experiment A, the null
net HAQ, net IWB and net SIP. hypothesis states that the new treatment

TABLE 3. Sample Sizes for Evaluating a New Therapy in a Pilot Study (A) and a Rando
(B) According to Type of Response Measurement

Experiment A Experiment B

Scale Instrument Earlya Netb Earlya Netb

Pain AIMS 11 11 51 47
Pain FSI 12 14 54 63
Pain HAQ 15 18 70 79
Global AIMS 18 8 81 34
Global FSI 87 37 392 169
Global HAQ 128 14 577 63
Global IWB 11 18 49 81
Global SIP 28 11 125 51
Mobility AIMS 14 7 62 32
Mobility FSI 23 20 103 89
Mobility HAQ 43 20 193 89
Mobility IWB 30 17 136 76
Mobility SIP 16 12 71 52
Social AIMS 483 92 2173 413
Social FSI 48 54 215 241
Social HAQ 114 15 513 65
Social IWB 121 21 543 93
Social SIP 54 16 241 73

" Response between

b Response between

was not
trial. Fourefficac
of the instruments would require
chose K = 0.75 relative to the standardmuch larger samples to measure response at
three months than at one year (IWB was
treatment as the efficacy worth detecting.
Equal efficacy (Ho: K = 1) was the nullanomolous).
pothesis for the randomized trial in experi-
Turning to the mobility responses, one
ment B; here, we chose 1 - K = 0.50 as the
sees that the pilot study required seven to 43
difference in efficacies worth detecting. patients and the clinical trial needed 32 to
Finally, we substituted estimates V* for 193 patients.
6 Smaller samples suffice if
= i/a to derive sample sizes neededmeasurements
for are taken at one year rather
each response variable. The formulas were than at three months. The biggest differ-
given approximately by N = 13.95/(6*)2 ences
foramong instruments appeared on
experiment A and N = 62.79/(6*)2 formeasures ex- of social function: anywhere from
periment B. Numerical results shown in483 patients were required in the pilot
15 to
Table 3 are discussed below. study, and the numbers were 65 to 2,173 in
Sample sizes needed to evaluate pain re- the randomized trial. Regardless of whether
sponse ranged from 11 to 18 for the pilot there was statistical significance when com-
study or from 47 to 79 per group in paring the values, taking measures of response
randomized trial. There was no clear statis- at one year seems to be highly preferred to
tical preference for measurements at one three months, and only three measures ap-
year versus three months. Differences pear to be truly suitable: HAQ (net), IWB
among instruments also seemed modest. (net), and SIP (net).
The data on responses in global health
status painted a different picture. It would
require eight to 128 patients in the pilot Increasingly in health-care resear
study or 34 to 577 per group in the clinical health-status instruments are being use

measure change within patients to quantify and to evaluate the relative sensitivities of
the health benefits of interventions. Becausefive instruments for measuring those types
no single instrument is optimal, one should of response. The patients reported substan-
establish for each instrument its appropri- tial pain and dysfunction before surgery,
ateness to measure the designated out- and there were strong correlations among
comes, taking into account the study popu- corresponding scales of different instru-
lation, the specific interventions, and the ments. Nonetheless, the instruments had
research design. Also, whereas cross-sec- distinct metric properties as judged by mea-
tional data from several instruments may sures of central location and spread. Pa-
correlate strongly (one example of conver- tients improved dramatically on three scales
gent validity), response data may give quite (pain, mobility, and overall health) by three
different results. We believe that it is vitally months after surgery, and they showed
important to assess which instruments are some improvement in social function. Later
sensitive to detect typical responses in spe- responses were smaller. Nearly all instru-
cific settings and which instruments are rel- ments and all scales registered impressively
atively insensitive. large net responses (Table 4).
Standardized indices of response are es- Response magnitudes were compared by
sential to make statistically meaningful a common index, the standardized response
comparisons among different measures. We mean (SRM, or 3). Measures of net response
chose (mean response)/(standard deviation generally gave larger SRM's than did those
of responses) as the comparative index for of early response, but these differences were
this research, a quantity that Cohen calls an not statistically significant. Also, there were
"effect size index."22 Because this quantity greater SRM's for changes in pain, mobility,
equals the paired-sample t statistic (except and overall health (average SRM values
for a sample size factor), it is a natural sta- were 1.03, 0.91, and 0.84, respectively) than
tistical index for evaluating response mag- for social function (average SRM, 0.55). This
nitude. Two other indices were proposed finding was consistent with the clinical ob-
recently. Guyatt and co-workers21 defined servation that full rehabilitation depends on
an index of "responsiveness," but its practi- primarily pain-free, stable physical func-
cality is limited, because it requires quanti- tioning.
fying what is "minimum clinically mean- Among the instruments with pain scales,
ingful response." Kazis and colleagues20 ar- all had roughly equal sensitivity (i.e., equal
gued for the use of "effect sizes" to assess SRM's) for measuring pain response. This
magnitudes and meanings of changes in was not the case for other types of response.
health status. Their index has the same nu-
To assess changes in global health status,
merator as ours, but uses pretreatmentthe FSI (early, net) and HAQ (early) were
standard deviation in the denominator. It relatively insensitive. These same measures
fails to incorporate information about plusre- HAQ (net) and IWB (early) had the
sponse variance, so it cannot be used to least
test sensitivity for mobility changes. Re-
statistical significance of the response garding social function, the AIMS (early,
means. Because Kazis' definition conflicts net) and IWB (early) appeared to be the least
with Cohen's "effect size index," to avoid sensitive. Another perspective was provided
ambiguity the phrase "standardized re- by the data in Table 3: whereas the order
sponse mean" designates the index adopted among measures does not change, their dif-
in this article. ferences appear as "relative efficiencies" in
Our goals in this research were to mea- terms of sample sizes for future studies.
sure changes in pain and function in pa- Whichever indicator one chooses, inequali-
tients undergoing total joint replacement, ties among instruments might be attributed

TABLE 4. Selected Standardized Response Means in a Randomized Drug Trial and the
Randomized Trial
Present Study
SRM at Six Monthsb
Baseline SRM at
(Pooled Samples, Placebo Auranofin Baseline 12 to 15 Monthsc
N = 303)a (N = 149) (N = 154) (N = 38) (N = 38)

10 cm lined 53 0.41 0.68 HAQ 42 0.89
McGill 29 0.46 0.59 FSI 22 1.00
Composite NA 0.56 0.75 AIMS 55 1.15
Overall health
HAQd 47 0.35 0.50 HAQ 35 1.00
QWBd 40 -0.01 0.27 IWB 60 0.88
10 cm line 45 0.32 0.45 AIMS 34 1.36
RAND 64 0.09 0.30 SIP 15 1.15

From Bombardier et al.24

NA, not available.
a Baseline means were transfo
b SRM was estimated by mea
c Jackknife estimate.
d Instrument common to both

to differences in construction of individual

that these were based on clinically deter-
mined combinations of subscales for each
scales and to differential weighting of pain
and mobility items in the overall score. specific instrument. We cannot claim that
these constructed scales have the same

Limitations properties of reliability, validity, and sen

tivity as their parent instruments. An i
Some limitations of the study design re-
strument with just a few items might p
quire comment. First, the study was not formde-poorly in subscale analyses. The soc
signed to evaluate the effectiveness of scales
TJA probably are most vulnerable becau
alone. Each patient having hip or kneethese TJA instruments devoted more questio
also received rehabilitation and medications to physical function than to social functi
as part of ongoing arthritis management. Perhaps the main lesson is to be wary
There are no controlled studies of patientssubscale analyses, at least until the statistica
who received only "routine care," so it is not properties of specific subscales in specif
possible to estimate the effects of TJA per se.research settings are understood.
Nonetheless, the patients in this study had The third limitation was our small samp
dramatic improvements in pain (averagesize of 38 patients. We demonstrated th
SRM = 1.03), global health status (averagethe profile for these patients was similar to
SRM = 0.84), and mobility (average SRMmuch larger, more comprehensive series.
= 0.91) and somewhat less improvement inthe small sample affected the variability, bu
social function (average SRM = 0.55). Ournot the validity, of our estimates. The m
results can be used to define benchmarks for consequence was that we could discrimina
other interventions, and also to calculatebetween measures of response if the
sample sizes for comparative trials, with SRM's differed substantially, but not if
treatment efficacies 1 and K, that account for differences were modest. Given the still-
"placebo" effect. primitive "state-of-the-art" for sensitivity
Regarding the comparison of mobility comparisons, that was not a serious draw-
and social scales, it should be emphasized back. Contrasts among instruments will be

clarified with additional studies of taking

largerresponse measures with multiple in-
size. struments on each sample of patients. Data
from different instruments can be judged
using standardized indices of response, sev-
Published data from a randomized,eral dou- of which have been proposed.
ble-blind placebo-controlled study of Aura- estimating an index is not enough.
This study demonstrated that there can be
nofin therapy in rheumatoid arthritis,24 with
three of the same scales used in this substantial
study, sampling variability associated
with each estimate. Therefore, future com-
served as a reference point. Patients in the
Auranofin trial were allowed to continue parisons among instruments should incor-
porate sampling variability to construct in-
taking NSAIDs, oral steroids, and analgesics
terval estimates and to perform hypothesis
during the trial. Overall, 303 patients com-
pleted the six-month study; most were fe-
male (73%). They were younger than the
patients in our surgical series (their mean Acknowledgments
age was 50 years), but most had moderateWeorgratefully acknowledge the assistance of Jacque-
severe disease by physician's assessment
line Mazzie, Lisa Bail, and Nancy Tanner in the prepa-
ration of this manuscript, and of Tari Bergman in pro-
(73%). The Auranofin-trial patients viding
had data on the series of 623 TJA patients.
slightly worse health, but the surgical series
had larger response values, expressed as References
SRM's. These data suggest the following
crude benchmarks (Table 4). Pain response 1. Codman EA. The shoulder. Boston: Thomas Todd
Company, 1934.
SRMs up to approximately 0.5 may be seen
2. Gross M. A critique of the methodologies used in
in placebo groups, up to approximatelyclinical
0.7 studies of hip-joint arthroplasty published in
for patients receiving a moderately active
the English-language orthopaedic literature. JBJS
drug, and up to approximately 1.0 for1988;70A:1364.
tients undergoing surgery (TJA). Kazis20
Galante J. The need for a standardized system for
evaluating results of total hip surgery (Editorial). JBJS
ported pain response "effect sizes" greater
than 1 for injectable gold and an "experi-4. Gartland J. Orthopaedic clinical research: defi-
mental drug." Regarding changes in overall
ciencies in experimental design and determinations of
outcome. JBJS 1988;70A:1357.
health status, small SRM's of, for example,
5. Liang MH, Cullen KE, Poss R. Primary total hip or
0.2 to 0.3, might be found in placebo
knee replacement: evaluation of patients. Ann Intern
groups, increased values of 0.3 to 0.5 for a
Med 1982;97:735.
moderately active drug, and large responses
6. Burton KE, Wright V, Richards J. Patient's expec-
tations in relation to outcome of total
(with SRM's of 1.0 or greater) in surgical hip replacement
surgery. Ann Rheum Dis 1979;38:471.
series. Clearly, these benchmarks will be
7. Wilcock GK. Benefits of total hip replacement to
modified and benchmarks for subscales will
older patients and the community. Br Med J 1978;2:37.
be generated as more data become avail- 8. Visuri T, Honkanen R. The influence of total hip
able. replacement on selected activities of daily living and on
the use of domestic aid. Scand J Rehab Med
Conclusions 9. Andersson G. Hip assessment: a comparison of
nine different methods. JBJS 1972;54B:621.
As health sciences research invests more
10. Kazis LE, Meenan RF, Anderson JJ. Pain in the
heavily in health status questionnairesrheumatic
as diseases: investigation of a key health status
tools to measure response, so too mustcomponent.
it Arth Rheum 1983;26:1017.
11. Liang MH, Larson MG, Cullen KE et al. Com-
invest in comparative research to determine
parative measurement efficiency and sensitivity of five
the relative merits of various instruments. health status instruments for arthritis research. Arth
Clearly, the most useful data is obtained byRheum 1985;28:542.


18. Mosteller F, Tukey JW. Data analysis and re-

12. Jette AM. Functional status instrument: reliabil-
ity of a chronic disease evaluation instrument. gression.
Arch Reading, MA: Addison Wesley, 1977;133.
Phys Med Rehab 1980;61:395. 19. SAS Institute Inc. SAS Procedures Guide. Re-
lease 6.03 edition. Cary, NC: SAS Institute Inc.,
13. Fries JF, Spitz P, Kraines RG et al. Measurement
of patient outcome in arthritis. Arth Rheum
1980;23:137. 20. Kazis LE, Anderson JJ, Meenan RF. Effect size
14. Meenan RF, Gertman PM, Mason JM. Measur-
for interpreting changes in health status. Med Care
ing health status in arthritis: the arthritis impact mea-
surement scales. Arth Rheum 1980;23:146. 21. Guyatt G, Walter S, Norman G. Measuring
15. Kaplan RM, Bush JW, Berry CC. Health status:change over time: assessing the usefulness of evalua-
tive instruments. J Chron Dis 1987;40:171.
types of validity for an index of well-being. Health Serv
Res 1976;11:478. 22. Cohen J. Statistical power analysis for the be-
16. Bergner M, Bobbitt RA, Pollard WE et al. The havioral sciences. New York: Academic Press, 1977.
sickness impact profile: validation of a health status23. Snedecor GW, Cochran WG. Statistical
measure. Med Care 1976;14:57. methods. Ames, IA: Iowa State University Press, 1967.
17. SAS Institute Inc. SAS/STAT User's Guide. Re- 24. Bombardier C, Ware J, Russell IJ et al. Auranofin
lease 6.03 edition: the REG procedure. Cary, NC: SAS
therapy and quality of life in patients with rheumatoid
Institute Inc., 1988;773. arthritis. Am J Med 1986;81:565.


To draw inferences about the relative efficiency of obtaining response

data by different instruments (or at different times after surgery), we de-
fined the standardized response mean, for example, 6 = t/a, where , is the
population response mean and a the response standard deviation. Ob-
viously, A and a are unknown quantities, but the sample mean and sample
standard deviation, x and s, can be used to estimate 6 by d = x/s. The slight
bias in s as an estimator of a is ignored for this estimate. Because the
sampling distribution of d is unknown, a jackknife procedure18 was used
for estimation and significance testing as outlined below.
All n patients were used to compute x, s, and d for each response
variable. Next, the i-th subject was removed and new values were com-
puted: x(i), s(i), and d(i); i = 1, ... , n; and corresponding "pseudo-values"
were calculated as d* = nd - (n - 1) d(i). Then 6 was estimated by the
sample mean of the pseudo-values:

* = i- 2I= d

and the variance of 6* was estimated by

1' 1
n n-1
-1 y- B,(di - 6*)2 .

Confidence intervals (95%) based on the t distribution

+ 2.026VV(6*) where 2.026 is the 97.5 percentile of a t
- 1 = 37 degrees of freedom.
For any two response variables, for example, A and
esis to be tested was Ho: MA/CA = MB/aB, or equival
Pseudo-values d* (A) and d* (B) were calculated as des
was tested based on differences [d*(A) - d*(B)] wit
statistic having n - 1 degrees of freedom.

