Anda di halaman 1dari 19

Articles Section

The receiver-operating characteristic (roc) analysis 49


Journal of Cognitive and Behavioral Psychotherapies,
Vol. 9, No. 1, March 2009, 49-66.


THE RECEIVER-OPERATING CHARACTERISTIC
(ROC) ANALYSIS: FUNDAMENTALS AND
APPLICATIONS IN CLINICAL PSYCHOLOGY
Sebastian PINTEA
*
& Ramona MOLDOVAN
Babes-Bolyai University, Cluj-Napoca, Romania

Abstract
The Receiver-Operating Characteristic (ROC) analysis has been long used
in Signal Detection Theory to depict the tradeoff between hit rates and false
alarm rates of classifiers. In the last years, ROC analysis has become
largely used in the medical community for visualizing and analyzing the
performance of diagnostic tests. Our article points out some fundamental
aspects of ROC analysis underlying the importance of using ROC analysis
in evaluating the diagnostic validity of tests commonly used in clinical
psychology. The main statistical programs available for this type of
analysis, with their advantages and deficiencies are also discussed. In order
to illustrate how ROC analysis works in clinical research, we also describe
an application of ROC analysis in evaluating scales generally related to
depression.

Keywords: receiver operating characteristics (ROC), ROC analysis, area
under the curve (AUC), diagnostic performance, sensitivity, specificity,
clinical psychology, depression


THE ROC ANALYSIS: FUNDAMENTALS

Receiver Operating Characteristic (ROC) analysis is a procedure used in
assessing diagnostic properties of tests, namely in assessing the way various
measures generally discriminate between different categories of subjects. In order
to do this, a cut-off point needs to be established; based on the cut-off point, we
can determine whether a person with a certain score belongs to one category or
another (e.g. normal/non-clinical or clinical group). ROC analysis may also be
used when comparing the diagnostic performance of two or more tests (Westin,
2001).
ROC analysis was used for the first time in the military field, for the
analysis of radar images, during The Second World War (Westin, 2001). In

*
Correspondence concerning this article should be addressed to:
E-mail: sebastianpintea@psychology.ro





Articles Section

Sebastian Pintea, Ramona Moldovan 50
medicine, the procedure has been used since the 1960s, and there is an extensive
literature on the use of ROC graphs for diagnostic testing (Fawcett, 2006). In
chemistry, ROC curves analysis is used to solve dichotomous decision problems
such as: the presence or absence of a protein marker, whether the structure of a
molecule is X or Y, whether a reaction obeys first order kinetics or second, should
a reaction be terminated or continued, etc.? (Brown & Davis, 2006).
In clinical psychology, ROC analysis is being used with increased
frequency, particularly in examining the utility or performance of diagnostic or
screening tools for: future difficulties in reading comprehension (Shapiro, Solari
& Petscher, 2008), alcohol and drug abuse (Kills Small et al., 2007),
neuropsychological impairment (Horwitz et al., 2008; O'Brien et al., 2007),
depression (Benazzi, 2008; Serrano-Duenas & Serrano, 2008; Stafford, Berk &
Jackson, 2007; Ballesteros et al., 2007; Walsh et al., 2006), obsessive-compulsive
disorder (Ivarsson & Larsson, 2008), bipolar disorder (Parker et al., 2008;),
suicide (Jokinen, Nordstrom & Nordstrom, 2008), dementia (Chiu et al., 2008;
Giaquinto & Parnetti, 2006), dropout risk from different treatments such as
cognitive-behavior therapy for insomnia (Ong, Kuo & Manber, 2008) and so on.
Two of the more important literature reviews on ROC curves analysis
have been conducted by McFall & Treat (1999) and Streiner & Cairney (2007).
McFall and Treat follow a broader objective in their work, as they themselves
state, about the functions of clinical assessment, the standards by which methods
can be evaluated, and the most promising approaches to achieving the broad goals
of clinical assessment (p. 216). Compared to their article and also to the one of
Streiner and Cairney, our paper is intended to be more applicative. We use actual
data collected from a Romanian sample to illustrate a number of ROC concepts
reviewed in this paper; we also discuss an extended set of uses, from establishing
cutoff-points to comparing different tests to overall and partial ROC areas, to
specific points of the curves.

Measures related to ROC curves

The dichotomous decision process is based on a threshold value V (cut-
off point) which classifies the scores of a continuous variable Y (also called
classifier) into two categories: positive vs. negative. If YV, the subject will be
classified as positive; if Y<V, the subject will be classified as negative (Brown &
Davis, 2006).
Now let us assume that we have a valid procedure of discriminating
between the presence and the absence of a disorder (also called valid
diagnosis), and we differentiate two groups of individuals: with and without a
certain disorder. By also administering the test that assesses the value of the Y
variable in each subject, we obtain two distributions of Y scores, one for each
group. In actual research (in real life) a perfect separation between the two groups





Articles Section

The receiver-operating characteristic (roc) analysis 51
is quite rare. Most of the times, the distribution of test results will overlap, as
shown in Figure 1.




Figure 1. Four possible outcomes when intersecting a valid diagnosis with a classifier


The intersection of the valid diagnosis with the classifier generates
four possible outcomes. If the valid diagnosis is positive and it is correctly
classified as positive, the outcome is counted as a true positive (TP); if the same
outcome is incorrectly classified as negative, it is counted as a false negative
(FN). If the valid diagnosis is negative and it is correctly classified as negative,
the outcome is counted as a true negative (TN); if the same outcome is incorrectly
classified as positive, it is counted as a false positive (FP) (Brown & Davis, 2006;
Fawcett, 2006).
The outcomes identified in Figure 1 can be conceptualized in a
contingency table that we will call decision matrix, also known as the
confusion matrix (Brown & Davis, 2006). Such an approach is presented in
table 1.

Table 1. The decision matrix

DIAGNOSIS
TEST Positive Negative Total
Positive TP FP T+
Negative FN TN T-
Total D+ D- N





Articles Section

Sebastian Pintea, Ramona Moldovan 52
In the decision matrix, besides the four outcomes described above, we
have included the following values: the total number of subjects who have the
disorder (D+), the total number of subjects who do not have the disorder (D-), the
total number of subjects with a positive test result (T+), the total number of
subjects with a negative test result (T-) and the total number of subjects analysed
(n) (Brown & Davis, 2006; Fawcett, 2006). It is important to note that in research
situations we have a decision matrix for each possible cut-off point. If variable Y
is a discrete variable, with k possible values, we will have k-1 decision matrices.
From the outcomes described in the decision matrix, we can calculate
the following measures (metrics):

(1) Sensitivity = TP/D+
(2) Specificity = TN/D-
(3) Positive likelihood ratio = Sensitivity / (1-Specificity)
(4) Negative likelihood ratio = (1-Sensitivity) /
Specificity
(5) Positive predictive value = TP/T+
(6) Negative predictive value = TN/T-
(7) Accuracy = (TP+TN)/n

Sensitivity, also called the true positive rate (when expressed as a
percentage) is defined as the probability that a test result will be positive when the
disorder is present.
Specificity, also called the true negative rate (when expressed as a
percentage), represents the probability that a test result will be negative when the
disorder is not present. These two indicators are essential for ROC curves
analysis.
The positive likelihood ratio, is the ratio between the probability of a
positive test result given the presence of the disorder and the probability of a
positive test result given the absence of the disorder. Similarly, the negative
likelihood ratio is defined as the ratio between the probability of a negative test
result given the presence of the disorder and the probability of a negative test
result given the absence of the disorder.
Positive predictive value, also called precision, is defined as the
probability that the disorder is present when the result of the test is positive, while
the negative predictive value is defined as the probability that the disorder is not
present when the result of the test is negative.
The last indicator presented here is the diagnostic accuracy of a test, or
the clinical performance of a test. It can be described in terms of diagnostic
accuracy, or the ability to correctly classify subjects into clinically relevant
subgroups. Diagnostic accuracy refers to the quality of the information provided
by the classification device (for more details see Jokinen, Nordstrom &
Nordstrom, 2008).





Articles Section

The receiver-operating characteristic (roc) analysis 53

1.0 0.8 0.6 0.4 0.2 0.0
1 - Specificity
1.0
0.8
0.6
0.4
0.2
0.0
Sensitivity

Other measures that are worth mentioning here, even though they are not
actually used in the following analysis, are prevalence and pre-test and post-test
odds. The prevalence (D+/n) refers to the proportion of cases exhibiting the
disorder; the pre-test odds (prevalence/1-prevalence) refers to the odds that the
patient suffers from the target disorder before the test is carried out, while the
post-test odds (pre-test odds* Positive likelihood ratio) reflects the odds that the
patient suffers from the target disorder after the test is carried out. An important
observation is that when the prevalence in the sample is different from the
prevalence in the population, measures such as accuracy and predictive values
(both positive and negative) are calculated taking into account the prevalence in
the population (for more details see Brown & Davis, 2006).

The ROC space

ROC graphs are bidimensional representations of the sensitivity (also
called the true positive rate on the X axis) and 1-specificity (also called the false
positive rate on the Y axis), coresponding to each possible cut-off point
(classifying value). In other words, they represent the tradeoffs between benefits
(true positives) and costs (false positives) (Fawcett, 2006).
In order to be able to interpret the so-called ROC space, we need to
have a reference point in this space. The ROC space is illustrated in Figure 2.

















Figure 2. The ROC space


The lower left point of the graph (0,0) is the value that contains no error
(no false positives) but also, does not detect any true positives. The opposite





Articles Section

Sebastian Pintea, Ramona Moldovan 54
point, in the upper right side of the graph (1,1), identifies all true positives, but
with a 100% error rate (rate of false positives FP). The upper left point (0,1) is the
perfect classification, where all the true positives are identified without any error
(no false positives, or 0 costs). The lower right point (1, 0) is the worse
classification, where all subjects classified as positive are in fact false positives,
with no true positives being identified (Fawcett, 2006).
In order to establish the optimal cut-off point, we have to look at the most
northwestern point in the ROC space (highest TP rate and lowest FP rate).
The diagonal line where the TP rate is equal to FP rate (y=x) represents
the performance of a random test. This means that if the classifier is randomly
guessing, it is supposed to correctly identify half of the positives and half of the
negatives. Consequently, all cut-off points that are above the random diagonal
perform better than random guessing, and all cut-off points that are below this
diagonal are worse than random guessing (Fawcett, 2006).

ROC analysis utility

When is ROC analysis useful? The literature dedicated to this procedure
indicates three main uses: 1. determining the ability of a test to discriminate
between groups 2. choosing the optimal cut-off point of a test, and 3. comparing
the performance of two or more tests. All of these uses rely on several statistics
that can be derived from the ROC (Westin, 2001; Fawcett, 2006; Brown & Davis,
2006).

Determining the ability of a test to discriminate between groups
Before any statistical analysis, the ROC curve needs to be inspected
visually. The curve of a good test will be well above the diagonal of the ROC
graphic; the curve will tend toward the north-western corner of the graphic.
As concerning statistical indicators of the ROC curve, the primary
statistic derived from the ROC is the area under the curve (AUC). The total area
under the ROC curve is a measure of the overall performance of a diagnostic test:
the larger the area, the better the performance (Westin, 2001). The area under the
curve corresponding to a test may be compared with the random performance of a
test, that is designated by the diagonal of a graphic where x=y. This is in fact an
inference problem, testing the null hypothesis which states that the test performs
randomly in establishing the two diagnostic categories. Considering the null
hypothesis, the area under the curve is 0.50, namely the area under the diagonal.
That is, if we subtract 0.50 from the area of a test, we will obtain the area of the
test that is over the diagonal of the random test. In testing the difference of AUC
between a diagnostic test and a random test the values of interest are Z (test) and
the probability of this value under the null hypothesis (p).
There are a number of ways of calculating the AUC due to the finite
number of points on the ROC curve (possible cut-off points). Thus, we most often





Articles Section

The receiver-operating characteristic (roc) analysis 55
estimate the area designated by these points; the derived solutions are diverse and
do not always lead to the same result. Some of the ways used for the calculation
of AUC are: the trapezoidal rule, the approximation of the curve by fitting the
data to a binormal model with maximum-likelihood estimates, the use of the
Mann-Whitney U statistic (for more details see Westin, 2001).
The interpretation of the AUC of a test is the following: the AUC is the
probability that the test yielded a higher value for a randomly chosen individual
suffering from the disorder than for a randomly chosen individual not suffering
from the disorder (Streiner & Cairney, 2007).
Going back to AUC utility in determining the ability of a test to
discriminate between groups, in interpreting AUC, Streiner and Cairney (2007)
show that the accuracy of tests with AUC between 0.50 and 0.70 is low; an
accuracy between 0.70 and 0.90 is moderate, while an AUC over 0.90 indicates
high accuracy.

Choosing the optimal cut-off point
The ROC analysis can also be used in determining the optimal cut-off
point of a test. As mentioned above, the optimal cut-off point is the most
northwestern point in the ROC space. It is the cut-off point where the proportion
of subjects that were accurately classified is maximal (cut-off point which has a
high sensitivity and also a high specificity). In other words, as a rule, the optimal
cut-off point is the one which maximizes TP+TN (or minimizes the FP+FN).
However, this principle is based on the assumption that the cost of making a false
positive mistake is equal to the cost of making a false negative mistake. In real
life, these costs are rarely equivalent. For example, the costs of a false positive
mistake in the case of a child suffering from ADHD (e.g. unjustified drugs
administration) may not be equivalent to a false negative mistake (delayed
intervention) (Streiner & Cairney, 2007).
Other studies use a specificity of at least 95% as criterion, even with a
decreased sensitivity (Westin, 2001) or, for screening, a sensitivity of at least 80%
even with a decreased specificity (Sharifi et al., 2008).
Taking into account this imbalance of costs between FP and FN mistakes,
Westin (2001) concludes that the optimal cut-off point depends on what the test
will be used for.

Comparing the performance of two or more tests
ROC curves are also used when comparing the average performance of
two or more tests. When the two compared curves do not intersect one another, an
analysis can be made simply by comparing the ROC curves corresponding to the
two tests (Streiner & Cairney, 2007). In such cases it is essential to also consider
the correlation between the two areas (the correlation between the two sets of
scores) in order to reduce the standard error and to increase the power of
comparison (statistical power of comparison test). It must be noted that a





Articles Section

Sebastian Pintea, Ramona Moldovan 56
difference which is not significant does not indicate the equivalence of the two
tests.
If ROC curves are intersected, we use the partial AUC in order to
compare them. Instead of comparing the AUC over the entire range of tests, we
concentrate on the area of curves between specific values of sensitivity or
specificity, or simply on a specific value (ex. specificity equal to 0.4, or
sensitivity equal to 0.6). The test with a higher curve for a specific value is more
useful. Similarly, we can compare two tests at their optimum cut-off points (in
this case we can compare their specificities and sensitivities independently or we
can compare them globally, considering the accuracy that incorporates both
indicators). In the figure below, Test B has a better average performance than Test
A; the difference in performance is confirmed by the difference between the
areas. Moreover, if we compare the performance of these tests at a specificity
value of 0.4 (1-specificity=0.6), we note that test A has a higher curve, which
means a better performance (see point X in Figure 3).




Figure 3. Comparing the performance of two tests



If we compare the performance of test A and test B between specific
values of specificity (1-specificity between 0.6 and 1.00), we notice that test A
has a better performance, even if its average performance (the entire AUC) is
poorer that the performance of test B.
A
B





Articles Section

The receiver-operating characteristic (roc) analysis 57
To conclude, using ROC curve analysis for two or more tests, we can
compare their average performance, their performance considering certain
specificity or sensitivity areas, considering discrete values of specificity or
sensitivity or the optimum cut-off point of each test.

Computer programs used for ROC analysis

In recent years, several computer programs have been used to perform
ROC analysis. Among these, the best known are: AccuROC, Analyse-it, CMDT,
GraphROC, MedCalc, mROC, ROCKIT, SPSS, STATA and SAS.
In a comparative analysis of eight computer programs, Stephan et al.
(2003) have highlighted the advantages and drawbacks of each program. The
authors considered the following criteria in their analysis: their ease of use,
mathematical correctness, final output, and their compatibility with other graphics
programs. The results of their analysis show that only the Analyse-It, AccuROC,
and MedCalc exhibit good performance, while only the GraphROC can compare
curves at a certain sensitivity or specificity cut-off point. Their authors
conclusion was that adequate ROC analysis and ROC plotting cannot be
performed with a single program, therefore they recommend the use of Analyse-
It, AccuROC, and MedCalc, despite certain limitations.
As far as the well-known SPSS is concerned, the authors notice that ROC
analysis within this package is not yet fully developed. For example, the SPSS
does not allow comparison of ROC curves. Although it shows a wide range of
other statistics, a valid ROC analysis cannot be performed with this software (for
more details, see Stephan et al., 2003). It needs to be mentioned that their
conclusions are valid for 2003 and that multiple developments of the statistical
packages regarding ROC analysis have taken place since then. For example, the
STATA and SAS programs were not included in this review. A valid up-to-date
review is needed.

ROC ANALYSIS: AN APPLICATION IN CLINICAL PSYCHOLOGY

Brief description of the study
Mental health is one of the most important health issues worldwide.
Research suggests that overall rates of mental illnesses are rising and that there is
a trend toward earlier onset of various mental illnesses (e.g., depression);
additionally, the level of disability, mortality and human or financial resources
associated with mental illness are currently at unprecedented rates (Avenevoli et
al., 2008). Major depressive disorder (MDD) is one of the leading causes of
emotional distress and disability both in adults and youth (OMS, 2000). MDD is a
serious condition characterized by one or more major depressive episodes.





Articles Section

Sebastian Pintea, Ramona Moldovan 58
Symptoms of depression may differ at various ages and stages of development,
and may vary for different ethnic groups. Generally, however, according to the
Diagnostic and Statistical Manual of Mental Disorders (DSM-IV-TR, American
Psychiatric Association, 2000) depressed children and adolescents show a
significant mood change (depressed, sad, irritable, and indecisive), they lose
interest in activities that used to please them, have problems concentrating, and
may lack energy or motivation; they may neglect their appearance and hygiene
and they criticize themselves and feel that others criticize them. They may also
indicate that they are feeling worthless or hopeless about the future, and thoughts
of suicide may be present.
A variety of methods are used to assess mental health and illness, and
these methods have considerable impact on diagnosis and treatment. It has been
often suggested that reliance on clinical judgment alone rather than statistical and
standardized measures leads to accurate diagnosis (Dawes et al., 1989; Lowe et
al., 2004). Given that MDD is often overlooked, misdiagnosed, and therefore,
mistreated, it is worth exploring if there are additional tools which can quickly
and accurately assess specific symptoms and etiopathogenetic mechanisms in
order to improve diagnosis and treatment in these patients.
Objective: The objectives of the current study are: (1) to test if there is a
significant difference between the two scales regarding their mean diagnostic
performance; (2) to establish the optimal cut-off point of each measure for the
sample investigated; (3) to evaluate the diagnostic performance for each
investigated measure; and (4) to compare their sensitivity and specificity at their
established cut-off points.
Sample: Data were pooled from participants in two separate studies.
Inclusion criteria required that patients (children and adolescents aged from 12 to
18) meet criteria for a current diagnosis of MDD as described in the Diagnostic
and Statistical Manual of Mental Disorders (4
th
edition, text revision; DSM-IV-
TR, American Psychiatric Association, 2000). Diagnoses were determined by
means of the Structured Clinical Interview for DSM-IV (KID SCID; Hien et al.,
1994, 1998, 2004). Exclusion criteria included a number of concurrent psychiatric
disorders, current substance abuse, mental retardation, organic brain syndrome;
we also excluded participants who were in some concurrent form of
psychotherapy, who were receiving psychotropic medication or who needed to be
hospitalized because of imminent suicidal risk. Patients were recruited by local
advertisements and by referrals from clinics within the Pediatric Psychiatry Clinic
in Cluj-Napoca from March 2007 until June 2008. The sample consisted of 50
patients. A number of 50 voluntary adolescents from several high schools in Cluj
were also included.
Measures: All subjects filled in self-report measures that evaluate the
symptoms and the etiopathogenetic mechanisms involved in depression.





Articles Section

The receiver-operating characteristic (roc) analysis 59
The Beck Depression Inventory (BDI; Beck et al., 1979) is a 21-item self-
report inventory measuring current symptoms of depression (e.g., sadness,
fatigue, social withdrawal, irritability, hopelessness etc.). A value of 0 to 3 is
assigned to each answer and the total score is computed by summing up
individual value. Higher total scores indicate more severe depressive symptoms.
The Romanian version of the BDI has very good psychometric properties and has
proven sensitive in screenings and clinical change assessments.
The Automatic Thoughts Questionnaire (ATQ; Hollon & Kendal, 1980) is a
15-item questionnaire assessing the frequency of negative thoughts experienced
by depressed individuals. All items consist of different self related automatic
thoughts (e.g. I am worthless; The future is dark; I feel helpless), frequently
identified in patients with MDD. Such sustained, inaccurate and intrusive
negative automatic thoughts about the self, the world and the future are
hypothesized to cause depression, rather than be generated by depression. Each of
the 15 items is rated on a scale from 1 to 5. The total score is computed by
summing up individual item values. The higher the score, the higher the negative
automatic thoughts frequency. Scores of the original scale and the Romanian
version indicate high internal consistency and concurrent validity, and also
differentiate depressed from non-depressed groups (Netemeyer, 2002).
Statistical analysis: ROC curve analysis was performed, for both scales,
using the MedCalc and SPSS softwares. The AUC of the scales is significantly
different from the AUC corresponding to a random test. We then determined their
specificity and sensitivity in order to establish their cut-off points and we checked
if there are significant differences between their AUCs. Finally, we compared
their sensitivity and specificity at their previous established cut-off points.

Results

In analyzing our results, we first checked if the AUC of the two scales is
significantly different from the area under the diagonal determined of a random
test. Table 2 we presents the results of this analysis for the ATQ scale.

Table 2. The AUC for ATQ scale

Area under the ROC curve (AUC) 0.909
Standard error 0.0307
95% Confidence interval 0.835 to 0.957
z statistic 13.338
Significance level P (Area=0.5) 0.0001

As Table 2 shows, the Z test performed with MedCalc indicates a
significant difference from the random area, with a probability of error smaller





Articles Section

Sebastian Pintea, Ramona Moldovan 60
than 1% (Z=13.33, p<.01). The same table shows an AUC for the ATQ scale of
.90, which, according to Streiner and Cairney (2007) indicates, a high
discriminative capacity of the ATQ scale.
In Table 3 we present the results of the same analysis for the BDI scale.

Table 3. The AUC for BDI scale

Area under the ROC curve (AUC) 0.996
Standard error 0.00655
95% Confidence interval 0.955 to 0.996
z statistic 75.682
Significance level P (Area=0.5) 0.0001


Table 3 shows that the BDI is also significantly different from the random
area, with a probability of error smaller than 1% (Z=75.68, p<.01). Our data also
show an AUC of the BDI of .90, which, according to Streiner and Cairney (2007),
indicates a high accuracy of the scale.
Our results so far indicate that the performance of both ATQ and BDI
significantly differs from the performance of a random test and that both have a
high accuracy in identifying participants with depression.
In order to establish the optimal cut-off point of each scale, we analyzed
both sensitivity and specificity at each possible cut-off point, for both scales.
Table 4 presents the analysis performed for the ATQ.
As the results in Table 4 indicate, the best performance of the ATQ in
discriminating between depressed and non/depressed participants, is reached at
the cut-off point of 34 (sensitivity=94, specificity=70).
The same analysis was performed for the BDI. Results are presented in
Table 5.






Articles Section

The receiver-operating characteristic (roc) analysis 61
Table 4. Criterion values and coordinates of the ROC curve for ATQ

Criterion Sensitivity 95% CI Specificity 95% CI +LR -LR
>=15 100.00 92.8 - 100.0 0.00 0.0 - 7.2 1.00
>15 100.00 92.8 - 100.0 10.00 3.4 - 21.8 1.11 0.00
>16 100.00 92.8 - 100.0 14.00 5.8 - 26.7 1.16 0.00
>18 100.00 92.8 - 100.0 16.00 7.2 - 29.1 1.19 0.00
>20 100.00 92.8 - 100.0 22.00 11.5 - 36.0 1.28 0.00
>21 100.00 92.8 - 100.0 24.00 13.1 - 38.2 1.32 0.00
>23 100.00 92.8 - 100.0 32.00 19.5 - 46.7 1.47 0.00
>24 100.00 92.8 - 100.0 42.00 28.2 - 56.8 1.72 0.00
>25 100.00 92.8 - 100.0 48.00 33.7 - 62.6 1.92 0.00
>26 100.00 92.8 - 100.0 50.00 35.5 - 64.5 2.00 0.00
>27 98.00 89.3 - 99.7 58.00 43.2 - 71.8 2.33 0.034
>28 98.00 89.3 - 99.7 60.00 45.2 - 73.6 2.45 0.033
>29 96.00 86.3 - 99.4 62.00 47.2 - 75.3 2.53 0.065
>31 96.00 86.3 - 99.4 66.00 51.2 - 78.8 2.82 0.061
>33 94.00 83.4 - 98.7 68.00 53.3 - 80.5 2.94 0.088
>34 * 94.00 83.4 - 98.7 70.00 55.4 - 82.1 3.13 0.086
>35 90.00 78.2 - 96.6 70.00 55.4 - 82.1 3.00 0.14
>36 82.00 68.6 - 91.4 78.00 64.0 - 88.5 3.73 0.23
>37 72.00 57.5 - 83.8 84.00 70.9 - 92.8 4.50 0.33
>38 72.00 57.5 - 83.8 86.00 73.3 - 94.2 5.14 0.33
>39 66.00 51.2 - 78.8 90.00 78.2 - 96.6 6.60 0.38
>40 60.00 45.2 - 73.6 94.00 83.4 - 98.7 10.00 0.43
>41 56.00 41.3 - 70.0 96.00 86.3 - 99.4 14.00 0.46
>42 56.00 41.3 - 70.0 98.00 89.3 - 99.7 28.00 0.45
>44 54.00 39.3 - 68.2 98.00 89.3 - 99.7 27.00 0.47
>45 42.00 28.2 - 56.8 100.00 92.8 - 100.0 0.58
>46 36.00 22.9 - 50.8 100.00 92.8 - 100.0 0.64
>48 34.00 21.2 - 48.8 100.00 92.8 - 100.0 0.66
>49 32.00 19.5 - 46.7 100.00 92.8 - 100.0 0.68
>50 28.00 16.2 - 42.5 100.00 92.8 - 100.0 0.72
>51 26.00 14.6 - 40.3 100.00 92.8 - 100.0 0.74
>52 24.00 13.1 - 38.2 100.00 92.8 - 100.0 0.76
>54 22.00 11.5 - 36.0 100.00 92.8 - 100.0 0.78
>57 18.00 8.6 - 31.4 100.00 92.8 - 100.0 0.82
>58 14.00 5.8 - 26.7 100.00 92.8 - 100.0 0.86
>59 12.00 4.6 - 24.3 100.00 92.8 - 100.0 0.88
>63 8.00 2.3 - 19.3 100.00 92.8 - 100.0 0.92
>64 4.00 0.6 - 13.7 100.00 92.8 - 100.0 0.96
>65 2.00 0.3 - 10.7 100.00 92.8 - 100.0 0.98
>68 0.00 0.0 - 7.2 100.00 92.8 - 100.0 1.00





Articles Section

Sebastian Pintea, Ramona Moldovan 62
Table 5. Criterion values and coordinates of the ROC curve for BDI scale

Criterion Sensitivity 95% CI Specificity 95% CI +LR -LR
>=0 100.00 92.8 - 100.0 0.00 0.0 - 7.2 1.00
>0 100.00 92.8 - 100.0 12.00 4.6 - 24.3 1.14 0.00
>1 100.00 92.8 - 100.0 20.00 10.0 - 33.7 1.25 0.00
>2 100.00 92.8 - 100.0 28.00 16.2 - 42.5 1.39 0.00
>3 100.00 92.8 - 100.0 40.00 26.4 - 54.8 1.67 0.00
>4 100.00 92.8 - 100.0 42.00 28.2 - 56.8 1.72 0.00
>5 100.00 92.8 - 100.0 56.00 41.3 - 70.0 2.27 0.00
>6 100.00 92.8 - 100.0 60.00 45.2 - 73.6 2.50 0.00
>8 100.00 92.8 - 100.0 64.00 49.2 - 77.1 2.78 0.00
>9 100.00 92.8 - 100.0 70.00 55.4 - 82.1 3.33 0.00
>10 100.00 92.8 - 100.0 82.00 68.6 - 91.4 5.56 0.00
>14 100.00 92.8 - 100.0 84.00 70.9 - 92.8 6.25 0.00
>15 100.00 92.8 - 100.0 88.00 75.7 - 95.4 8.33 0.00
>17 100.00 92.8 - 100.0 90.00 78.2 - 96.6 10.00 0.00
>18 100.00 92.8 - 100.0 94.00 83.4 - 98.7 16.67 0.00
>21 * 100.00 92.8 - 100.0 96.00 86.3 - 99.4 25.00 0.00
>22 92.00 80.7 - 97.7 96.00 86.3 - 99.4 23.00 0.083
>23 90.00 78.2 - 96.6 98.00 89.3 - 99.7 45.00 0.10
>24 86.00 73.3 - 94.2 100.00 92.8 - 100.0 0.14
>25 84.00 70.9 - 92.8 100.00 92.8 - 100.0 0.16
>26 82.00 68.6 - 91.4 100.00 92.8 - 100.0 0.18
>28 76.00 61.8 - 86.9 100.00 92.8 - 100.0 0.24
>29 72.00 57.5 - 83.8 100.00 92.8 - 100.0 0.28
>31 66.00 51.2 - 78.8 100.00 92.8 - 100.0 0.34
>32 50.00 35.5 - 64.5 100.00 92.8 - 100.0 0.50
>33 46.00 31.8 - 60.7 100.00 92.8 - 100.0 0.54
>34 40.00 26.4 - 54.8 100.00 92.8 - 100.0 0.60
>35 36.00 22.9 - 50.8 100.00 92.8 - 100.0 0.64
>36 22.00 11.5 - 36.0 100.00 92.8 - 100.0 0.78
>37 20.00 10.0 - 33.7 100.00 92.8 - 100.0 0.80
>38 10.00 3.4 - 21.8 100.00 92.8 - 100.0 0.90
>39 6.00 1.3 - 16.6 100.00 92.8 - 100.0 0.94
>40 2.00 0.3 - 10.7 100.00 92.8 - 100.0 0.98
>41 0.00 0.0 - 7.2 100.00 92.8 - 100.0 1.00

Table 5 indicates that the best discriminating performance of the BDI is
reached at the cut-off point of 21 (sensitivity=100 specificity=96).
In order to test if there is a significant difference between the two scales
regarding their mean diagnostic performance, we also used MedCalc and
compared their AUCs (Table 6).





Articles Section

The receiver-operating characteristic (roc) analysis 63

1.0 0.8 0.6 0.4 0.2 0.0
1 - Specificity
1.0
0.8
0.6
0.4
0.2
0.0
Sensitivity
Reference Line
BDI
ATQ
Source of the Curve

Table 6. Pairwise comparison of ATQ and BDI AUCs

ATQ ~ BDI
Difference between areas 0.0868
Standard error 0.0299
95% Confidence interval 0.0283 to 0.145
z statistic 2.907
Significance level P = 0.004

As results in Table 6 indicate, there is a significant difference in the
overall diagnostic performance between ATQ and BDI (z=2.90, p<.01) in favor of
the BDI scale (AUC for BDI=.99 vs AUC for ATQ=.90). A visual inspection of
both curves in Figure 4, suggests that there is a diagnostic performance difference
in favor of the BDI at all sensitivity or specificity levels, and at all intervals of
these parameters.
















Figure 4. The ROC curves of ATQ and BDI


Finally if we look at Tables 4 and 5, at the accuracy of both scales at their
optimum cut-off point, we can conclude that the BDI has a better overall
accuracy, and also better sensitivity (100 vs. 94) and specificity (96 vs. 70).
To conclude, our analysis underlies the superiority of the BDI compared
to the ATQ at all levels of comparison: the overall ability to discriminate between
the two categories (depression vs. non-depression), the ability to discriminate
between the two categories at specific levels of sensitivity or specificity, intervals
of those parameters, or at their optimum cut-off point.





Articles Section

Sebastian Pintea, Ramona Moldovan 64
DISCUSSION AND CONCLUSIONS

Our work stands for an advocacy in favor of using ROC analysis on a
larger scale in clinical psychology. As we have shown in the section dedicated to
ROC fundamentals, this procedure allows a rigorous analysis of the diagnostic
performance of tests (instruments) frequently used in clinical psychology. Based
on these results, we can establish, for example, if the clinical instruments have a
good overall performance and select an optimal cut-off point which best
discriminates between categories of people with and without a certain disorder.
ROC analysis also allows comparing two or more tests regarding their overall
diagnostic performance, or their performance at certain discrete values or
intervals of sensitivity or specificity. We must emphasize that, before performing
ROC analysis, valid measures are needed to identify the two categories
investigated. In our case, depression was diagnosed using the Structured Clinical
Interview for DSM-IV (KID SCID; Hien et al., 1994, 1998, 2004); the BDI and
ATQ were used after a valid method of diagnosing depression was employed.
The example of ROC analysis used in this article illustrates how this
procedure works in a real clinical research situation, and consequently, our results
must be interpreted in this particular context. However, our findings replicate
previous studies on depression. Consistent with the literature, BDI and ATQ
scores have a good discriminating performance between adolescents suffering
from depression and adolescents not suffering from depression.
The superiority of the BDI in discriminating depressed from non-
depressed participants was expected and is fairly justifiable. High BDI scores
indicate the presence of depressive symptoms while high ATQ scores indicate the
presence of mechanisms that are likely to lead to depression. Therefore, better
sensitivity and specificity of the BDI in discriminating depressed from non-
depressed participants is rather intuitive: having frequent automatic thoughts does
not make us necessarily depressed, while having high depression symptoms will
most probable place us in the depressed category.
Among the limitations of our study, several are worth mentioning:
patients included may not be representative for the general depressed population
or for depressed patients entering treatment; similarly, data pooled from voluntary
adolescents may not be specific to the non-depressed population in general.
However, despite these limitations, our application can be also regarded
as innovative considering that, to our knowledge, this is the first study to analyze
BDI and ATQ accuracy (both sensitivity and specificity) in discriminating
depressed from non-depressed patients using ROC analysis.
To conclude, as we have previously shown, the principles behind ROC
analysis are easy to understand and the existence of softwares performing the
procedure makes it attractive and easy to use, with relevant benefits for clinical
research. The ROC analysis in this article illustrates how this procedure works in
a real clinical research situation, and consequently, our results must be interpreted
in this particular context.






Articles Section

The receiver-operating characteristic (roc) analysis 65
REFERENCES

American Psychiatric Association. (2000). Diagnostic and Statistical Manual of Mental
Disorders (4
th
edition, text revision). Washington, DC: APA.
Avenevoli, S., Knight, E., Kessler, R. C., & Merikangas, K. R. (2008). Epidemiology of
depression in children and adolescents. In Abela, J.R.Z., & Hankin, B.L. (Eds.).
Handbook of depression in children and adolescents. New York: The Guilford
Press.
Ballesteros, J., Bobes, J., Bulbena, A., Luque, A., Dal-Re, R., Ibarra, N., & Guemes, I.
(2007). Sensitivity to change, discriminative performance, and cutoff criteria to
define remission for embedded short scales of the Hamilton depression rating scale
(HAMD). Journal of Affective Disorders, 102, 9399.
Beck, A. T., Rush, A. J., Shaw, B. F., & Emery, G. (1979). Cognitive therapy of
depression. New York: Guilford Press.
Benazzi, F. (2008). Defining mixed depression. Progress in Neuro-Psychopharmacology
and Biological Psychiatry, 32, 932-939.
Brown, C. D., & Davis, H. T. (2006). Receiver operating characteristics curves and
related decision measures: A tutorial. Chemometrics and Intelligent Systems, 80,
24-38.
Chiu, Y. C., Li, C. L., Lin, K. N., Chiu, Y. F., Liu, & H. C. (2008). Sensitivity and
specificity of the clock drawing test, incorporating Rouleau scoring system, as a
screening instrument for questionable and mild dementia: Scale development.
International Journal of Nursing Studies, 45, 7584.
Dawes, R. M., Faust, D., & Meehl, P. E. (1989). Clinical versus actuarial judgment.
Science, 243, 16681674.
Fawcett, T. (2006). An introduction to ROC analysis. Pattern Recognition Letters, 27,
861-874.
Giaquinto, S., & Parnetti, L. (2006). Early detection of dementia in clinical practice.
Mechanisms of Ageing and Development, 127, 123128.
Hien, D., Matzner, F., First, M.B., Spitzer, R.L., Williams, J.B.W., and Gibbon, M. (1994,
1998, 2004). Interviu clinic structurat pentru tulburrile clinice ale sugarului,
copilului i adolescentului. Adaptare n limba romn David, D. (coordonator).
Editura RTS Cluj-Napoca.
Horwitz, J. E., Lynch, J. K., McCaffrey, R. J., & Fisher, J. M. (2008). Screening for
neuropsychological impairment using Reitan and Wolfsons preliminary
neuropsychological test battery. Archive of Clinical Neuropsychology, 23, 393-398.
Ivarsson, T., & Larsson, B. (2008). The Obsessive-Compulsive Symptom (OCS) scale of
the Child Behavior Checklist: A comparison between Swedish children with
Obsessive-Compulsive Disorder from a specialized unit, regular outpatients and a
school sample. Journal of Anxiety Disorders, in-press
Jokinen, J., Nordstrom, A. L., & Nordstrom, P. (2008). ROC analysis of dexamethasone
suppression test threshold in suicide prediction after attempted suicide. Journal of
Affective Disorders, 106, 145152.
Kendler, K. S. (1995). Genetic epidemiology in psychiatry. Taking both genes and
environment seriously. Archives of General Psychiatry, 52, 895-899.
Kills Small, N. J., Simons, J. S., & Stricherz, M. (2007). Assessing criterion validity of the
Simple Screening Instrument for Alcohol and Other Drug Abuse (SSI-AOD) in a
college population. Addictive Behaviors, 32, 24252431.





Articles Section

Sebastian Pintea, Ramona Moldovan 66
Lowe, B., Spitzer, R. L., Grafe, K., Kroenke, K., Quenter, A., & Zipfel, S. (2004).
Comparative validity of three screening questionnaires for DSM-IV depressive
disorders and physicians diagnoses. Journal of Affective Disorders, 78, 131140.
McFall, R. M., & Treat, T. A. (1999). Quantifying the information value of clinical
assesement with Signal Detection Theory. Annual Review of Psychology, 50, 215-
241
Netemeyer, R. G., Williamson, D. A., Burton, S., Biswas, D., Jindal, S., Landreth, S.,
Mills, G., & Primeaux, S. (2002). Psychometric properties of shortened versions of
the Automatic Thoughts Questionnaire. Educational and Psychological
Measurement, 62, 111-129.
O'Brien, A., Gaudino-Goering, E., Shawaryn, M., Komaroff, E., Moore, N. B., & DeLuca,
J. (2007). Relationship of the Multiple Sclerosis Neuropsychological Questionnaire
(MSNQ) to functional, emotional, and neuropsychological outcomes. Archives of
Clinical Neuropsychology, 22, 933948.
Ong, J. C., Kuo, T. F., & Manber, R. (2008). Who is at risk for dropout from group
cognitive-behavior therapy for insomnia? Journal of Psychosomatic Research, 64,
419425.
Parker, G., Fletcher, K., Barrett, M., Synnott, H., Breakspear, M., Hyett, M., & Hadzi-
Pavlovic, D. (2008). Screening for bipolar disorder: The utility and comparative
properties of the MSS and MDQ measures. Journal of Affective disorders, 109, 83-
89.
Serrano-Duenas, M., & Serrano, M S. (2008). Concurrent validation of the 21-item and 6-
item Hamilton Depression Rating Scale versus the DSM-IV diagnostic criteria to
assess depression in patients with Parkinsons disease: An exploratory analysis.
Parkinsonism and Related Disorders, 14, 233238.
Shapiro, E. S., Solari, E., & Petscher, Y. (2008). Use of a measure of reading
comprehension to enhance prediction on the state high stakes assessment. Learning
and Individual Differences, In Press, Uncorrected Proof, Available online 26
March 2008
Sharifi, F., Mousavinasab, N., Mazloomzadeh, S., Jaberi, Y., Saeini, M., Dinmohammadi,
M., & Angomshoaa, A. (2008). Cutoff point of waist circumference for the
diagnisis of metabolic syndrome in an Iranian population. Obesity Research &
Clinical Practice, In Press, Corrected Proof, Available online 23 May 2008
Stafford, L., Berk, M., Jackson, H. J. (2007). Validity of the Hospital Anxiety and
Depression Scale and Patient Health Questionnaire-9 to screen for depression in
patients with coronary artery disease. General Hospital Psychiatry, 29, 417424.
Stephan, C., Wesseling, S., Schink, T., & Jung, K. (2003). Comparison of Eight Computer
Programs for Receiver-Operating Characteristic Analysis. Clinical Chemistry, 49,
433439.
Streiner, D. L., Cairney, J. (2007). What's under the ROC? An introduction to Receiver
Operating Characteristics Curves. The Canadian Journal of Psychiatry, 52, 121-
128.
Walsh, T. L., Homa, K., Hanscom, B., Lourie, J., Sepulveda, M. G., & Abdu, W. (2006).
Screening for depressive symptoms in patients with chronic spinal pain using the
SF-36 Health Survey. The Spine Journal, 6, 316320.
Westin, L. (2001). Receiver operating characteristic (ROC) analysis: Evaluating
discriminance effects among decision support systems, retreived 05.16.2008 from
source http://www.cs.umu.se/research/reports/2001/018/part1.pdf.

Anda mungkin juga menyukai