com
Information and Computer Science Department, King Fahd University of Petroleum and Minerals, P.O. Box 1082, Dhahran 31261, Saudi Arabia
Received 26 February 2007; received in revised form 28 May 2007; accepted 27 July 2007
Available online 5 October 2007
Abstract
Eective prediction of defect-prone software modules can enable software developers to focus quality assurance activities and allocate
eort and resources more eciently. Support vector machines (SVM) have been successfully applied for solving both classication and
regression problems in many applications. This paper evaluates the capability of SVM in predicting defect-prone software modules and
compares its prediction performance against eight statistical and machine learning models in the context of four NASA datasets. The
results indicate that the prediction performance of SVM is generally better than, or at least, is competitive against the compared models.
2007 Elsevier Inc. All rights reserved.
Keywords: Software metrics; Defect-prone modules; Support vector machines; Predictive models
1. Introduction
Studies have shown that the majority of defects are often
found in only a few software modules (Fenton and Ohlsson, 2000; Koru and Tian, 2003). Such defective software
modules may cause software failures, increase development
and maintenance costs, and decrease customer satisfaction
(Koru and Liu, 2005). Accordingly, eective prediction of
defect-prone software modules can enable software developers to focus quality assurance activities and allocate
eort and resources more eciently. This in turn can lead
to a substantial improvement in software quality (Koru
and Tian, 2003).
Identication of defect-prone software modules is commonly achieved through binary prediction models that
classify a module into either defective or not-defective category. These prediction models almost always utilize static
product metrics, which have been associated with defects,
as independent variables (Emam et al., 2001). Recently,
support vector machines (SVM) have been introduced as
*
Corresponding author. Tel.: +966 3 860 1150; fax: +966 3 860 2174.
E-mail addresses: kelish@kfupm.edu.sa (K.O. Elish), elish@kfupm.edu.sa (M.O. Elish).
0164-1212/$ - see front matter 2007 Elsevier Inc. All rights reserved.
doi:10.1016/j.jss.2007.07.040
an eective model in both machine learning and data mining communities for solving both classication and regression problems (Gun, 1998). It is therefore motivating to
investigate the capability of SVM in software quality
prediction.
The objective of this paper is to evaluate the capability
of SVM in predicting defect-prone software modules (functions in procedural software and methods in object-oriented software) and compare its prediction performance
against eight well-known statistical and machine learning
models in the context of four NASA datasets. The compared models are two statistical classiers techniques: (i)
Logistic Regression (LR) and (ii) K-Nearest Neighbor
(KNN); two neural networks techniques: (i) Multi-layer
Perceptrons (MLP) and (ii) Radial Basis Function
(RBF); two Bayesian techniques: (i) Bayesian Belief Networks (BBN) and (ii) Nave Bayes (NB); and two treestructured classiers techniques: (i) Random Forests (RF)
and (ii) Decision Trees (DT). For more details on these
techniques see (Han and Kamber, 2001; Hosmer and Lemeshow, 2000; Duda et al., 2001; Webb, 2002; Breiman,
2001).
The rest of this paper is organized as follows. Section 2
reviews related work. Section 3 provides an overview of
650
K.O. Elish, M.O. Elish / The Journal of Systems and Software 81 (2008) 649660
SVM. Section 4 discusses the conducted empirical evaluation and its results. Section 5 concludes the paper and outlines directions for future work.
2. Related work
A wide range of statistical and machine learning models
have been developed and applied to predict defects in software. Basili et al. (1996) investigated the impact of the suite
of object-oriented design metrics introduced by (Chidamber and Kemerer, 1994) on the prediction of fault-prone
classes using logistic regression. Guo et al. (2004) proposed
random forest technique to predict fault-proneness of software system. They applied this technique on NASA software defect datasets. The proposed methodology was
compared with some machine learning and statistical methods. They found that the prediction accuracy of random
forest is generally higher than other methods. Khoshgoftaar et al. (1997) investigated the use of the neural network
as a model for predicting software quality. They used large
telecommunication system to classify modules as faultprone or not fault-prone. They compared the neural network model with a non-parametric discriminant model,
and found that the neural network model had better predictive accuracy. Khoshgoftaar et al. (2002) applied regression trees with classication rule to classify fault-prone
software modules using a very large telecommunications
system as a case study. Fenton et al. (2002) proposed
Bayesian Belief Networks for software defect prediction.
However, the limitations of Bayesian Belief Networks have
been recognized (Weaver, 2003; Ma et al., 2006).
Several other techniques have been developed and
applied to software quality prediction. These techniques
include: discriminant analysis (Munson and Khoshgoftaar,
1992; Khoshgoftaar et al., 1996), the discriminative power
techniques (Schneidewind, 1992), optimized set reduction
(Briand et al., 1993), genetic algorithms (Azar et al.,
2002), classication trees (Selby and Porter, 1988), casebased reasoning (Emam et al., 2001; Mair et al., 2000;
Shepperd and Kadoda, 2001), and DempsterShafer Belief
Networks (Guo et al., 2003).
Recently, SVM has been applied successfully in many
applications, for example in the eld of optical character
recognition (Burges, 1998; Cristianini and Shawe-Taylor,
2000), text categorization (Dumais, 1998), face detection
in images (Osuna et al., 1997; Cristianini and Shawe-Taylor, 2000), speaker identication (Schmidt and Gish,
1996), spam categorization (Drucker et al., 1999), intrusion
detection (Chen et al., 2005), cheminformatics and bioinformatics (Cai et al., 2002; Burbidge et al., 2001; Morris
et al., 2001; Lin et al., 2003; Bao and Sun, 2002), and nancial time series forecasting (Tay and Cao, 2001).
w xi b 6 1
for all xi 2 A
for all xi 2 B
Support vector machines (SVM) are kernel based learning algorithm introduced by Vapnik (Vapnik, 1995) using
y i wT xi b P 1
for all xi 2 A [ B
K.O. Elish, M.O. Elish / The Journal of Systems and Software 81 (2008) 649660
Maximum
Margin
651
b
Maximum
Margin
Support Vectors
hyperplane
Optimal
hyperplane
Optimal
hyperplane
hyperplane
Fig. 1. (a) Optimal separating hyperplane for separable case in 2D space, (b) linear separating hyperplane for non-separable case in 2D space.
y i wT xi b P 1;
i 1; . . . ; n
n
X
i1
n
X
i1
1
2
ki kwk
2
n X
n
1X
ki
ki kj y i y j xTi xj
2 i1 j1
n
X
ki y i 0;
and
ki P 0;
n
X
ki y i xTi x b
i 1; 2; . . . ; n
i1
i1
f x sign
for i 1; . . . ; n
10
where C is the regularizing (margin) parameter that determines the trade-o between the maximization of the margin and minimization of the classication error (Gun,
1998; Cristianini and Shawe-Taylor, 2000). The solution
to this minimization problem is similar to the separable
case except for a modication of the bounds of the Lagrange multipliers. Therefore, Eq. (7) is changed to:
n
X
ki y i 0;
and
0 6 ki 6 C;
i 1; 2; . . . ; n
11
i1
Thus the only dierence from the separable case is that the
ki now has an upper bound of C. The obtained support vector machine in this case is called the soft-margin support
vector machine (Abe, 2005).
7
3.2. Two-class: nonlinear support vector machines
Once the solution has been found in the form of a vector
k = (k1, . . . , kn)T, the optimal separating hyperplane is
found and the decision function is obtained as follows:
652
K.O. Elish, M.O. Elish / The Journal of Systems and Software 81 (2008) 649660
R3
R2
4.1. Goal
12
n
X
4.2. Datasets
The datasets used in this study are four mission critical
NASA software projects, which are publicly accessible
from the repository of the NASA IV&V Facility Metrics
Data Program.2 Two datasets (CM1 and PC1) are from
software projects written in a procedural language (C)
where a module in this case is a function. The other two
datasets (KC1 and KC3) are from projects written in
object-oriented languages (C++ and Java) where a module
in this case is a method. Each dataset contains 21 software
metrics (independent variables) at the module-level and the
associated dependent Boolean variable: Defective (whether
or not the module has any defects). Table 1 summarizes
some main characteristics of these datasets.
4.3. Independent variables
ki y i 0;
and ki P 0;
i 1; 2; . . . ; n
i1
14
where K(xi, xj) is the kernel function performing the nonlinear mapping into feature space. Once the solution has been
found, the decision can be constructed as:
n
X
ki y i kxi ; x b
15
f x sign
i1
The following are the most common kernel functions in literature (Abe, 2005; Gun, 1998; Burges, 1998):
1.
2.
3.
4.
4. Empirical evaluation
1
K.O. Elish, M.O. Elish / The Journal of Systems and Software 81 (2008) 649660
653
Table 1
Characteristics of datasets
Project
Language
# of Modules
% of Defective modules
Procedural
CM1
PC1
C
C
496
1107
9.7
6.9
Object oriented
KC1
KC3
C++
Java
2107
458
15.4
6.3
Description
NASA spacecraft instrument
Flight software for an earth orbiting satellite
Storage management for receiving and processing ground data
Collection, processing and delivery of satellite metadata
Table 2
Module-level metrics (independent variables)
Metric
Type
Denition
V (g)
EV (g)
IV (g)
LOC
McCabe
McCabe
McCabe
McCabe
Cyclomatic Complexity
Essential Complexity
Design Complexity
Total lines of code
Derived
Halstead
Derived
Halstead
Derived
Halstead
Derived
Halstead
Derived
Halstead
Derived
Halstead
Derived
Halstead
Derived
Halstead
V
L
D
I
E
B
T
LOCode
LOComment
LOBlank
LOCodeAndComment
Line
Line
Line
Line
UniqOp
Basic
Halstead
Basic
Halstead
Basic
Halstead
Basic
Halstead
UniqOpnd
TotalOp
TotalOpnd
BranchCount
Count
Count
Count
Count
Branch
4.5.1. Accuracy
Accuracy is also known as correct classication rate. It
is dened as the ratio of the number of modules correctly
predicted to the total number of modules. It is calculated
as follows:
Eort Estimate
Accuracy
Diculty = 1/L
Intelligent count
lines
lines
lines
lines
of
of
of
of
statement
comment
blank
code and
4.5.2. Precision
Precision is also known as correctness. It is dened as
the ratio of the number of modules correctly predicted as
defective to the total number of modules predicted as
defective. It is calculated as follows:
Precision
Table 3
Best subset of independent variables in each dataset
Dataset
CM1
PC1
KC1
KC3
TP TN
TP TN FP FN
TP
TP FP
4.5.3. Recall
Recall is also known as defect detection rate. It is
dened as the ratio of the number of modules correctly predicted as defective to the total number of modules that are
actually defective. It is calculated as follows:
Recall
TP
TP FN
Actual
Not defective
Defective
Not defective
Defective
TN = True Negative
FN = False Negative
FP = False Positive
TP = True Positive
654
K.O. Elish, M.O. Elish / The Journal of Systems and Software 81 (2008) 649660
2 Precision Recall
Precision Recall
K.O. Elish, M.O. Elish / The Journal of Systems and Software 81 (2008) 649660
655
Table 5
Prediction performance measures: CM1 dataset
Prediction model
Accuracy
Precision
Mean
StDev
SVM
LR
KNN
MLP
RBF
BBN
NB
RF
DT
90.69
90.17
83.27
89.32
89.91
76.83
86.74
88.62
89.82
1.074
1.987
4.154
2.066
1.423
6.451
3.888
2.606
1.526
Recall
F-measure
Sig.?
Mean
StDev
Sig.?
Mean
StDev
Sig.?
Mean
StDev
Sig.?
No(+)
Yes(+)
No(+)
No(+)
Yes(+)
Yes(+)
Yes(+)
No(+)
90.66
91.10
91.36
90.75
90.32
94.17
92.49
90.93
90.35
0.010
0.013
0.020
0.012
0.008
0.026
0.020
0.014
0.009
No()
No()
No()
No(+)
Yes()
Yes()
No()
No(+)
100.00
98.78
90.02
98.20
99.49
79.30
92.90
97.10
99.34
0.000
0.017
0.043
0.021
0.013
0.071
0.038
0.026
0.017
Yes(+)
Yes(+)
Yes(+)
No(+)
Yes(+)
Yes(+)
Yes(+)
No(+)
0.951
0.948
0.906
0.943
0.947
0.859
0.926
0.939
0.946
0.006
0.011
0.025
0.012
0.008
0.044
0.023
0.015
0.009
No(+)
Yes(+)
Yes(+)
No(+)
Yes(+)
Yes(+)
Yes(+)
No(+)
Table 6
Prediction performance measures: PC1 dataset
Prediction model
SVM
LR
KNN
MLP
RBF
BBN
NB
RF
DT
Accuracy
Precision
Recall
F-measure
Mean
StDev
Sig.?
Mean
StDev
Sig.?
Mean
StDev
Sig.?
Mean
StDev
Sig.?
93.10
93.19
91.82
93.59
92.84
90.44
89.21
93.66
93.58
0.968
1.088
2.080
1.212
0.948
4.534
2.606
1.665
1.499
No()
No(+)
No()
No(+)
No(+)
Yes(+)
No()
No()
93.53
93.77
95.54
94.20
93.40
94.60
94.95
95.15
94.59
0.007
0.008
0.012
0.009
0.007
0.011
0.012
0.012
0.011
No()
Yes()
No()
No(+)
Yes()
Yes()
Yes()
Yes()
99.47
99.28
95.70
99.23
99.34
95.17
93.40
98.21
98.77
0.007
0.008
0.020
0.009
0.008
0.048
0.025
0.013
0.012
No(+)
Yes(+)
No(+)
No(+)
Yes(+)
Yes(+)
Yes(+)
No(+)
0.964
0.964
0.956
0.966
0.963
0.948
0.941
0.967
0.966
0.005
0.006
0.011
0.006
0.005
0.027
0.015
0.009
0.008
No(+)
Yes(+)
No()
No(+)
No(+)
Yes(+)
No()
No()
Table 7
Prediction performance measures: KC1 dataset
Prediction model
SVM
LR
KNN
MLP
RBF
BBN
NB
RF
DT
Accuracy
Precision
Recall
F-measure
Mean
StDev
Sig.?
Mean
StDev
Sig.?
Mean
StDev
Sig.?
Mean
StDev
Sig.?
84.59
85.55
83.99
85.68
84.81
75.99
82.86
85.20
84.56
0.714
1.394
2.144
1.353
1.107
2.925
2.210
1.798
1.580
No()
No(+)
Yes()
No()
Yes(+)
Yes(+)
No()
No(+)
84.95
87.08
89.58
86.75
85.81
91.98
88.59
88.12
86.79
0.005
0.010
0.014
0.010
0.007
0.017
0.013
0.012
0.012
Yes()
Yes()
Yes()
Yes()
Yes()
Yes()
Yes()
Yes()
99.40
97.38
91.75
98.07
98.31
78.47
91.53
95.38
96.46
0.006
0.012
0.021
0.013
0.010
0.032
0.021
0.016
0.021
Yes(+)
Yes(+)
Yes(+)
Yes(+)
Yes(+)
Yes(+)
Yes(+)
Yes(+)
0.916
0.919
0.906
0.921
0.916
0.847
0.900
0.916
0.914
0.004
0.008
0.013
0.008
0.006
0.021
0.013
0.010
0.009
No()
Yes(+)
No()
No(+)
Yes(+)
Yes(+)
No(+)
No(+)
Table 8
Prediction performance measures: KC3 dataset
Prediction model
SVM
LR
KNN
MLP
RBF
BBN
NB
RF
DT
Accuracy
Precision
Mean
StDev
93.28
93.42
92.50
93.50
93.59
93.21
92.81
92.63
93.11
1.099
1.892
2.979
2.215
1.557
2.684
3.134
3.073
1.636
Recall
Sig.?
Mean
StDev
No()
No(+)
No()
No()
No(+)
No(+)
No(+)
No(+)
93.65
94.37
95.23
94.74
94.10
95.72
95.89
95.49
93.85
0.006
0.013
0.017
0.015
0.011
0.017
0.018
0.018
0.009
F-measure
Sig.?
Mean
StDev
No()
Yes()
Yes()
No()
Yes()
Yes()
Yes()
No()
99.58
98.89
96.87
98.56
99.41
97.14
96.50
96.73
99.15
0.009
0.016
0.028
0.018
0.012
0.026
0.029
0.029
0.019
Sig.?
Mean
StDev
Sig.?
No(+)
Yes(+)
No(+)
No(+)
Yes(+)
Yes(+)
Yes(+)
No(+)
0.965
0.966
0.960
0.966
0.967
0.964
0.962
0.961
0.964
0.006
0.010
0.016
0.012
0.008
0.015
0.017
0.017
0.009
No()
No(+)
No()
No()
No(+)
No(+)
No(+)
No(+)
656
K.O. Elish, M.O. Elish / The Journal of Systems and Software 81 (2008) 649660
K.O. Elish, M.O. Elish / The Journal of Systems and Software 81 (2008) 649660
657
Yes(+)
No(+)
No(-)
Yes(-)
2
0
-2
-4
-6
Accuracy
Precision
Recall
F-measure
Fig. 11. Signicance test results: SVM vs. other models: CM1 dataset.
Fig. 8. Mean vs. standard deviation of F-measure: KC1 dataset.
6
Yes(+)
No(+)
No(-)
Yes(-)
2
0
-2
-4
-6
Accuracy
Precision
Recall
F-measure
Fig. 12. Signicance test results: SVM vs. other models: PC1 dataset.
Fig. 9. Mean vs. standard deviation of accuracy: KC3 dataset.
8
Yes(+)
No(+)
No(-)
Yes(-)
4
2
0
-2
-4
-6
-8
Accuracy
Precision
Recall
F-measure
Fig. 13. Signicance test results: SVM vs. other models: KC1 dataset.
is outperformed in precision by six out of the eight compared models, but this is not signicant except for two
models (BBN, NB). However, SVM achieves signicantly
higher recall than almost all the compared models. As
explained earlier, there is trade-o between precision and
658
K.O. Elish, M.O. Elish / The Journal of Systems and Software 81 (2008) 649660
6
Yes(+)
No(+)
No(-)
Yes(-)
2
0
-2
-4
-6
Accuracy Precision
Recall
F-measure
Fig. 14. Signicance test results: SVM vs. other models: KC3 dataset.
5. Conclusion
This paper has empirically evaluated the capability of
SVM in predicting defect-prone software modules and
compared its prediction performance against eight statistical and machine learning models in the context of four
NASA datasets. The results indicate that the prediction
performance of SVM is generally better than, or at least,
is competitive against the compared models. In all datasets,
the overall accuracy of SVM is within 84.693.3%; its precision is within 84.993.6%; its recall is within 99.4100%;
and its F-measure is within 0.9160.965. Although the precision of SVM is outperformed by many models, SVM outperforms all models in recall. As explained earlier, there is
trade-o between precision and recall, but the F-measure
considers their harmonic mean, i.e. takes both of them
K.O. Elish, M.O. Elish / The Journal of Systems and Software 81 (2008) 649660
659
660
K.O. Elish, M.O. Elish / The Journal of Systems and Software 81 (2008) 649660
Mair, C., Kadoda, G., Leel, M., Phapl, L., Schoeld, K., Shepperd, M.,
Webster, S., 2000. An investigation of machine learning based
prediction systems. Journal of Systems and Software 53 (1), 2329.
McCabe, T., 1976. A complexity measure. IEEE Transactions on Software
Engineering 2 (4), 308320.
McCabe, T., Butler, C., 1989. Design complexity measurement and
testing. Communications of the ACM 32 (12), 14151425.
Morris, C., Autret, A., Boddy, L., 2001. Support vector machines for
identifying organisms-a comparison with strongly partitioned radial
basis function networks. Ecological Modeling 146, 5767.
Munson, J., Khoshgoftaar, T., 1992. The detection of fault-prone
programs. IEEE Transactions on Software Engineering 18 (5), 423433.
Nadeau, C., Bengio, Y., 2003. Inference for the generalization error.
Machine Learning 52, 239281.
Osuna, E., Freund, R., Girosi, F., 1997. Training support vector machines:
an application to face detection. In: Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pp. 130136.
Schmidt, M., Gish, H., 1996. Speaker identication via support vector
classiers. In: Proceedings of the IEEE International Conference on
Acoustics, Speech, and Signal Processing (ICASSP-96), pp. 105108.
Schneidewind, N., 1992. Methodology for validating software metrics.
IEEE Transactions on Software Engineering 18 (5), 410422.
Selby, R., Porter, A., 1988. Learning from examples: generation and
evaluation of decision trees for software resource analysis. IEEE
Transactions on Software Engineering 14 (12), 17431756.
Shepperd, M., Kadoda, G., 2001. Comparing software prediction techniques using simulation. IEEE Transactions on Software Engineering
27 (11), 10141022.
Smola, A., 1998. Learning with kernels. Ph.D. dissertation, Department of
Computer Science, Technical University Berlin, Germany.
Tay, F., Cao, L., 2001. Application of support vector machines in nancial
time series forecasting. Omega 29, 309317.
Vapnik, V., 1995. The Nature of Statistical Learning Theory. Springer,
New York.
Weaver, R., 2003. The safety of software Constructing and assuring
arguments. Ph.D. dissertation, Department of Computer Science,
University of York.
Webb, R., 2002. Statistical Pattern Recognition, second ed. John Wiley &
Sons, New York.
Witten, I., Frank, E., 2005. Data Mining: Practical Machine Learning
Tools and Techniques, second ed. Morgan Kaufmann, San Francisco.
Karim O. Elish is a MS candidate and research assistant in the Information and Computer Science Department at King Fahd University of
Petroleum and Minerals (KFUPM), Saudi Arabia. He received the BS
degree in computer science from KFUPM in 2006. He is a member of
Software Engineering Research Group (SERG) at KFUPM. His main
research interests include software metrics and measurement, empirical
software engineering, and application of machine learning and data
mining in software engineering.
Mahmoud O. Elish received the PhD degree in computer science from
George Mason University, USA, in 2005. He is an assistant professor in
the Information and Computer Science Department at King Fahd University of Petroleum and Minerals, Saudi Arabia. His research interests
include software metrics and measurement, object-oriented analysis and
design, empirical software engineering, and software quality predictive
models.