Thesis Work On TFT

Investigating the Capability of Object-Oriented
Metrics for Fault Proneness

Submitted in partial fulllment of the requirements for the degree of
Master of Technology
by
Santosh Singh Rathore
(Roll no: 1120103)
under the guidance of
Dr. Atul Gupta

Computer Science & Engineering
PDPM INDIAN INSTITUTE OF INFORMATION TECHNOLOGY,
DESIGN AND MANUFACTURING JABALPUR
2013
Approval Sheet
This thesis entitled Investigating the Capability of Object-Oriented Met-
rics for Fault Proneness submitted by Santosh Singh Rathore (1120103)
is approved for partial fulllment of the requirements for the degree of Master of
Technology in Computer Science and Engineering.
Examining Committee
................................................
................................................
................................................
Guide
................................................
................................................
................................................
Chairman
................................................
Date .......................... ................................................
Place ......................... ................................................
Certificate
This is to certify that the thesis entitled, Investigating the Capability of Object-
Oriented Metrics for Fault Proneness, submitted by Santosh Singh Rathore,
Roll No. 1120103 in partial fulllment of the requirements for the award of Mas-
ter of Technology Degree in Computer Science and Engineering, at PDPM
Indian Institute of Information Technology, Design and Manufacturing Jabalpur is an
authentic work carried out by him under my supervision and guidance.
To the best of my knowledge, the matter embodied in the thesis has not been submitted
elsewhere to any other university/institute for the award of any other degree.
(Atul Gupta) February 6, 2013
Associate Professor
Computer Science & Engineering Discipline
PDPM Indian Institute of Information Technology, Design and Manufacturing Jabalpur
India-482005
Acknowledgments
Foremost, I would like to express my sincere gratitude to my supervisor Dr.
Atul Gupta for the continuous support of my M.Tech study and research, for his
valuable guidance, patience, motivation, enthusiasm, and immense knowledge.
His approach towards software engineering will always be a valuable learning
experience for me. His guidance helped me in all the time of research and writing
of this thesis. I could not have imagined having a better supervisor for my M.Tech
study. His dedication, professionalism and hard work have been and shall be a
source of inspiration throughout my life.
My deepest gratitude goes to my family for their unagging love and support
throughout my life; this thesis is simply impossible without them. I thanks my
parents to inspired, encouraged and fully supported me. Along with, I would
also like to express my thanks to Pratibha and Deepika (my sisters) to brought a
light inside me and always lled me with enthusiasm to do my jobs with complete
eort and dedication.
I would like to thank Mr. Saurabh Tiwari and Mr. Deepak Banthia, who as a
good friend, were always willing to help and give their best suggestions. It would
have been a lonely lab without them. I would also like to give my sincere thanks
to Mr. Amaltas Khan, Mr. Amit Dhama, Mr. Arpit Gupta, Mr. Ravindra Singh
and to my batch mates for their support and being there always, no matter what.
I would like to thank IIITDM and Department of Computer Science for providing
me such a congenial environment, labs and other resources.
Santosh Singh Rathore
Abstract
Software fault prediction is used to streamline the eorts of software quality assur-
ance (SQA) activities by identifying the more faulty modules rst. It is typically
done by training a prediction model over some known project data augmented
with fault information, and subsequently using the prediction models to predict
faults for unseen project. However, the earlier eorts of fault prediction based
on the classication of the software modules to be faulty or non-faulty. This
prediction does not provide enough logistics to identify and x the faults in the
software system. The fault prediction can be more useful if, besides predicting the
software modules being faulty or non-faulty, their fault densities can also be pre-
dicted accurately. In this thesis, we aim to investigate the relationship between
object-oriented (OO) metrics and their capability of predicting fault densities in
the object-oriented software. As a follow up, we investigate two important and
related issues relevant to the fault prediction. First, how to select a subset of
OO metrics that are signicantly correlated with fault proneness? Subsequently,
how to use this subset of metrics to predict fault densities in a given software
system. Here, we present an approach to identify a subset containing software
metrics with signicant fault-correlation and then use this identied subset with
the count models to predict fault densities over the subsequent releases of the
software system. To select signicant metrics, we rst, evaluate each metric in-
dependently for their potential to predict faults by performing Univariate Logistic
Regression analysis. Next, we perform Spearmans correlation and Multivariate
Linear Regression analysis between the selected signicant metrics to further up-
date the metrics subset for an improved performance. The identied metrics
subset is then used with the count models to predict fault densities. The re-
sults of the prediction were evaluated using confusion matrix parameters and a
cost-benet model. Our results suggest that among the used ve count models,
negative binomial regression (NBR) analysis produced the best performance for
fault densities prediction. Its predictive accuracy was the highest compared to the
other count models. The results of the cost-benet analysis also conrmed that
prediction model based on negative binomial regression was most cost-eective
compared to the other count models used in the study.
V
Table of Contents
Approval I
Certicate II
Acknowledgments III
Abstract IV
List of Figures IX
List of Tables XI
List of Symbols XIII
Abbreviations XIV
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Related Work 6
2.1 Object-Oriented Metrics . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 Metrics suites for object-oriented software . . . . . . . . . 7
2.1.2 Validation of object-oriented metrics . . . . . . . . . . . . 15
2.2 Public Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Evaluation Measures . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4 Subset Selection of Object-Oriented Metrics for Fault Prediction . 22
2.5 Fault Prediction Studies . . . . . . . . . . . . . . . . . . . . . . . 24
2.5.1 Binary class classication of the faults . . . . . . . . . . . 24
2.5.2 Number of faults and the fault densities prediction . . . . 27
2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3 A Framework for Subset Selection of Object-Oriented Metrics
for Fault Proneness 29
3.1 The Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.1 Metrics set used for investigation . . . . . . . . . . . . . . 34
3.2.2 Dependent variable . . . . . . . . . . . . . . . . . . . . . . 35
3.2.3 Project datasets . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2.4 Research questions . . . . . . . . . . . . . . . . . . . . . . 35
3.2.5 Experimental execution . . . . . . . . . . . . . . . . . . . 36
3.2.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2.6.1 Univariate logistic regression analysis . . . . . . . 37
3.2.6.2 Correlation analysis between metrics . . . . . . . 39
3.2.6.3 Multivariate linear regression analysis . . . . . . 40
3.2.6.4 Validation of prediction models over the succes-
sive releases . . . . . . . . . . . . . . . . . . . . . 41
3.2.7 Threats to validity . . . . . . . . . . . . . . . . . . . . . . 43
3.2.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4 A Count Model Based Analysis to Predict Fault Densities in
Software Modules 48
4.1 The Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.1.1 Selection of fault-correlated metrics . . . . . . . . . . . . . 50
4.1.2 Count model analysis . . . . . . . . . . . . . . . . . . . . . 51
4.1.3 Evaluation of count models . . . . . . . . . . . . . . . . . 51
4.1.4 Cost-benet model . . . . . . . . . . . . . . . . . . . . . . 52
4.2.1 Metrics set used for the experiment . . . . . . . . . . . . . 54
4.2.2 Project dataset . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2.3 Count models . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2.3.1 Poisson regression model . . . . . . . . . . . . . . 56
4.2.3.2 Negative binomial regression model . . . . . . . . 57
VII
4.2.3.3 Zero-inated count model . . . . . . . . . . . . . 58
4.2.3.4 Generalized negative binomial regression model . 58
4.2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.2.4.1 Prediction of the number of faults and fault densities 59
4.2.4.2 Evaluating the results of ve count models . . . . 62
4.2.4.3 Prediction of the number of faults and the fault
densities in the modules ranked as top 20% . . . 63
4.2.4.4 Cost-benet analysis . . . . . . . . . . . . . . . . 64
4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5 An Application of the Count Models to Predict Fault Densities
With Binary Fault Classication 71
5.1 The Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.1.1 Subset selection of fault-correlated metrics . . . . . . . . . 72
5.1.2 Count model analysis . . . . . . . . . . . . . . . . . . . . . 72
5.1.3 Evaluation of count models . . . . . . . . . . . . . . . . . 73
5.1.4 Cost-benet model . . . . . . . . . . . . . . . . . . . . . . 73
5.2.1 Metrics set used for the experiment . . . . . . . . . . . . . 73
5.2.2 Experimental data . . . . . . . . . . . . . . . . . . . . . . 74
5.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
densities . . . . . . . . . . . . . . . . . . . . . . . 75
5.2.3.2 Evaluating the results of the ve count models . . 77
densities in modules ranked as top 20% . . . . . . 79
5.2.3.4 Cost-benet analysis . . . . . . . . . . . . . . . . 79
5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6 Conclusions and Future Work 84
References 86
Publications 93
VIII
List of Figures
1.1 Thesis organization . . . . . . . . . . . . . . . . . . . . . . . . . . 3
3.1 Framework of proposed approach . . . . . . . . . . . . . . . . . . 31
3.2 Results of the validation of prediction models constructed using
original set of metrics and using four machine-learning techniques 44
3.3 Results of the validation of prediction models constructed using
identied subset of metrics and using four machine-learning tech-
niques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.1 Overview of the proposed approach . . . . . . . . . . . . . . . . . 50
4.2 Result of the predicted number of fault using count models (PROP1-
POPR6) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.3 Result of the predicted number of faulty modules using count mod-
els (PROP1-POPR6) . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.4 Result of the fault densities prediction using count models (PROP1-
POPR6) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.5 Comparison of count model using various confusion matrix criteria
(PROP1-POPR6) . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.6 Cost-benet model for the count models (PROP1-PROP6) . . . . 66
5.1 Result of the predicted number of faults using count models (PROP1-
POPR6) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.2 Result of the predicted number of faulty modules using count mod-
els (PROP1-POPR6) . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.3 Result of the fault densities prediction using count models (PROP1-
POPR6) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.4 Comparison of count model using various confusion matrix criteria
(PROP1-POPR6) . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.5 Cost-benet model for the count models (PROP1-PROP6) . . . . 80
X
List of Tables
2.1 CK metrics suite [15] . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 MOODS metrics suite [32] . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Wei Li metrics suite [45] . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Loren & Kidd metrics suite [48] . . . . . . . . . . . . . . . . . . . 9
2.5 Briand metrics suite [11] . . . . . . . . . . . . . . . . . . . . . . . 10
2.6 Bansiyas metrics suite [4] . . . . . . . . . . . . . . . . . . . . . . 10
2.7 Other metrics suites . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.8 Summary of empirical study on object-oriented metrics . . . . . . 15
2.9 Datasets used in the study . . . . . . . . . . . . . . . . . . . . . . 20
2.10 Confusion matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1 Datasets used for study . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2 Univariate logistic regression analysis- Camel 1.0 to 1.4 . . . . . . 37
3.3 Univariate logistic regression analysis- Ivy 1.0 to 1.4 . . . . . . . 38
3.4 Univariate logistic regression analysis- Velocity 1.4 to 1.5 . . . . . 38
3.5 Univariate logistic regression analysis- Xalan 2.4 to 2.5 . . . . . . 38
3.6 Univariate logistic regression analysis- Xerces 1.2 to 1.3 . . . . . 39
3.7 Reduced metrics subset after ULR analysis . . . . . . . . . . . . . 39
3.8 Spearmans correlation analysis over Camel project dataset . . . . 40
3.9 Spearmans correlation analysis over Ivy project dataset . . . . . 40
3.10 Spearmans correlation analysis over Velocity project dataset . . . 40
3.11 Spearmans correlation analysis over Xalan project dataset . . . . 41
3.12 Spearmans correlation analysis over Xerces project dataset . . . . 41
3.13 Multivariate linear regression analysis over Camel project datasets 41
3.14 Multivariate linear regression analysis over Ivy project datasets . 42
3.15 Multivariate linear regression analysis over Velocity project datasets 42
3.16 Multivariate linear regression analysis over Xalan project datasets 42
3.17 Multivariate linear regression analysis over Xerces project datasets 43
3.18 Resulted subset of metrics after MLR analysis . . . . . . . . . . . 43
4.1 Fault removal cost of testing techniques (in sta-hours per defect) 52
4.2 Fault identication eciencies of dierent testing phases . . . . . 52
4.3 Identied metrics for each release of the PROP dataset . . . . . . 55
4.4 Detail of PROP project dataset used for study . . . . . . . . . . . 56
4.5 Percentage of faults contained in the modules ranked as top 20%
(T=Training set) . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.6 Percentage of faults density contained in the modules ranked as
top 20% of modules (Fault density=faults/100 lines of code) . . . 65
5.1 Identied metrics for each release of the PROP dataset . . . . . . 74
5.2 Datasets use for the study . . . . . . . . . . . . . . . . . . . . . . 74
5.3 Percentage of faults contained in the modules ranked as TOP 20%
(T=Training set) . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.4 Percentage of faults density contained in the modules ranked as
top 20% of modules (Fault density=faults/100 lines of code) . . . 80
XII
List of Symbols
C
f
Normalized fault removal cost in eld
C
i
Initial setup cost of used fault prediction approach
C
s
Normalized fault removal cost in system testing
C
u
Normalized fault removal cost in unit testing
M
p
Percentage of modules unit tested
s
Fault identication eciency of system testing
u
Fault identication eciency of unit testing
Abbreviations
Acc Accuracy
AUC Area under the ROC curve
Ecost Estimated fault removal cost of the software when we
use fault prediction
FNR False negative rate
FP False positive
FPR False positive rate
NEcost Normalized estimated fault removal cost of the software
when we use fault prediction
PR Precision
Tcost Estimated fault removal cost of the software without the
use fault prediction
TN True negative
TP True positive
ULR Univariate logistic regression
MLR Multivariate linear regression
NBRM Negative binomial regression model
PRM Poisson regression model
ZIP Zero-inated poisson regression model
GNBR Generalized negative binomial regression model
ZIN Zero-inated negative binomial regression model
OO Object-oriented
Chapter 1
Introduction
Software quality assurance activities consist of monitoring and controlling the
software development process to ensure the desired software quality at a lower
cost [69]. It may include the application of formal code inspections, code walk-
throughs, software testing, and fault prediction. Software fault prediction is a
technique to identify the fault-prone software modules by using some underlying
properties of the software system. It is typically performed by training a predic-
tion model over some known project data augmented with fault information, and
subsequently using the prediction model to predict faults for unseen projects.
The underlying theory of software fault prediction is that a module currently
under development is likely to be fault prone if a module with the similar char-
acteristics of an earlier project (or release) developed in the same environment
was found to be faulty. In this case, the early detection of the faulty modules can
be useful to streamline the eorts to be applied in the later phases of software
development by better focusing the quality assurance eorts to those modules.
The potential of software fault prediction to identify the faulty software modules
early in the development life cycle has gained considerable attention over last two
decades. The earlier fault prediction studies used a wide range of classication
algorithms to predict the faultiness of the software modules. Dierent experimen-
tal studies result in a limited ability to comprehend the algorithms strengths and
weakness [37]. The prediction accuracy of a fault-prediction technique found to
be considerably lower, ranging 70-85 percent with the high misclassication error
rate [66] [18] [29]. An important concern related to the fault prediction is the lack
of suitable performance evaluation measures that would assess the capability of
fault prediction models [37]. Another concern is about the unequal distribution
1.1 Motivation 2
of the fault data that may lead to a biased learning [51]. Moreover, there remain
some issues like which software properties/metrics to include, how context aects
fault prediction, the cost-eectiveness of fault prediction, and fault densities pre-
diction need further understanding and investigations before the results of fault
prediction can be put to practice.
1.1 Motivation
Fault prediction models are generally constructed by identifying the relationship
between the structural measures of the software such as coupling, cohesion, com-
plexity etc. with faults. These models quantitatively describe how these internal
structural properties are related to relevant external system qualities such as fault
proneness.
However, there are some critical issues those need to be resolved before using the
fault prediction results to guide the quality assurance process. One important
concern is about the diculty of knowing the software metrics that are signif-
icantly correlated with fault proneness and this issue is not being adequately
investigated [23]. This may be such that some of the metrics may contain re-
dundant information or worse, have adverse eects on the fault proneness of
other metrics. Earlier studies in this regard have conrm that a high number of
features (attributes) may lead to lower classication accuracy and higher misclas-
sication errors [42] [59]. Higher dimensional data can also a serious problem for
many classication algorithms due to its high computational cost and memory
usage [47].
The other issue is about using the fault prediction results in practice. Many of
the earlier fault prediction studies were based on binary fault classication model
i.e., a module is considered as faulty or non-faulty. There are the several issues
with this binary class classication. The binary class classication of the software
modules do not provide enough logistics to streamline the eort that would be
useful to identify and x the faults in the software systems. In addition, even if the
performance of the prediction model was reported excellent, the interpretation of
the nding are hard to put into the proper usability context i.e., identication of
the actual number of faults.
Investigating the capability of Object-oriented Metrics for Fault Proneness
1.2 Objectives 3
1.2 Objectives
The objective of this thesis work is to present an approach to identify a subset of
object-oriented (OO) metrics that produce signicant fault-correlation. Subse-
quently use this subset to train various count models for fault densities prediction.
In this thesis, we aim to investigate two important and related issues, as mention
above with respect to fault prediction. In this regards, we frame our research
questions as follows:
RQ1: How to select a subset of OO metrics that are signicantly correlated with
fault proneness?
RQ2: How to use this subset of metrics to predict fault densities in a given
software system?
1.3 Thesis Organization
The overall structure of the thesis illustrated in Figure 1.1. The content can
broadly be divided into three major sections: Background of the research includes
literature review, research contribution and future scope of the proposed work.
ChapterInt

Problem Definition and Literature Review Research Contribution Future Scope of the Proposed
work

Chapter1: Introduction
Chapter 2: Related Work

Chapter 3: Subset
Selection of Fault-
correlated Metrics
Chapter 4&5: Fault
Densities Prediction

Chapter 6:
Conclusion and
Future Work
Figure 1.1: Thesis organization
Chapter 2: This chapter summarizes the concepts relevant to the fault predic-
tion study. Specically, we present a detailed survey of existing object-oriented
1.3 Thesis Organization 4
metrics, including the empirical studies of these metrics performed earlier, the
detail of the public datasets used in our experimental study, various models eval-
uation techniques and a literature review of the earlier studies related to the
software fault prediction. We have drawn our general arguments and nding at
the end of the chapter.
Chapter 3: We present an approach of identifying the metrics subset consisting
metrics with signicant fault-correlation. The metrics subset selection process is
undertaken in three steps. In the rst step, we assist the fault proneness of each
metric separately by performing Univariate Logistic Regression (ULR) analysis
and select those metrics having signicant fault correlation. In the next step,
we analyze the pair wise correlation among the selected metrics by performing
Spearmans correlation analysis. In the last step, we construct Multivariate Lin-
ear Regression (MLR) models to further reduce the metrics and identify a group
of metrics that are more signicant for fault proneness. Finally, we evaluate the
performance of the selected metrics subset against the original project metric
suite. Our results demonstrated that the identied metrics subset produced an
improved fault prediction performance compares to the original project metrics
suite.
Chapter 4: The fault prediction can be more useful if, besides predicting soft-
ware modules being faulty or non-faulty, their fault densities can also be predicted
accurately. We used the identied subset of signicant fault-correlated metrics
with various count models to predict fault densities. The results of the prediction
are evaluated using confusion matrix parameters and a cost-benet model. Our
results suggested that among the used ve count models, the negative binomial
regression (NBR) analysis produced the best performance for fault prediction.
Chapter 5: We extend the approach of fault densities prediction presented in
the previous chapter to evaluate the eectiveness of the count models, when we
identify a subset of signicant fault-correlated metrics by classifying the faultiness
of the software modules into binary class classication, i.e., faulty and non-faulty.
This analysis will help to decide, whether the nature of the fault classication
(i.e., binary class classication or multi class classication) for the selection of
signicant fault-correlated will aect the result of the fault densities prediction.
We observed that the results are similar to the results that we were found out in
the previous chapter. However, the values of the number of faults and the fault
1.4 Summary 5
densities predicted by the count models here, are lower and more closely t to
their actual values compare to the values predicted by the count models in the
previous chapter.
Chapter 6: We concluded our work in this chapter. We also discuss the future
scope of research in this area.
1.4 Summary
Software fault prediction is a technique to identify the faults in the software mod-
ules without executing it. It aims to help the software validation and verication
process by targeting the eort of quality assurance to the faulty modules. How-
ever, there are some issues like identication of a metrics subset consisting metrics
with signicant fault-correlation and software fault densities prediction are asso-
ciated with the fault prediction process that need to be resolved before ensuring
their practical use in software quality assurance. In this chapter, we highlighted
the issues in fault prediction process, stated the objectives of the thesis and how
these objectives are pursued have been summarized in the organization of our
thesis work.
Chapter 2
Related Work
In this chapter, we present a detailed survey of existing object-oriented metrics,
including the empirical studies of the metrics performed earlier. Later, we present
a detail of the public datasets used in our experimental study, various models
evaluation techniques and a literature review of the earlier studies related to the
software fault prediction.
2.1 Object-Oriented Metrics
Software quality assurance aims to develop quality software that meets the cus-
tomers requirements with desired quality as well as easy to maintain. In order to
assess and improve software quality during the development process, developers
and managers need to measure the software design. For this, software metrics
have been proposed. By using metrics, a software project can be quantitatively
analyzed and its quality can be evaluated. Generally, each metric is associated
with some structural property of the software, such as coupling, cohesion, inheri-
tance etc. and used to indicate of an external quality attribute, such as reliability,
maintainability and fault-proneness [4].
There have been many object-oriented (OO) metrics suites proposed to capture
the structural properties of a software system. Chidamber and Kemerer proposed
a software metrics suite for object-oriented software in 1991 known as CK metrics
suite [15]. Later on, Several other metrics suites have also been proposed by
various authors. Harrison and Counsel proposed MOOD metrics suite [32], Wei
Li et al. proposed a metrics suite for maintainability [32], Lorenz and Kidd [48],
2.1 Object-Oriented Metrics 7
Briand et al. [11], Marchesi [50], Bansiya et al. [4], Judith Barnard [6] have also
proposed their metrics suite. These all metrics suites contained metrics of static
types. Meanwhile, Yacoub et al. [71] and Arisholm et al. [2] have also proposed
dynamic metrics suites that are capturing the dynamic behavior of the software.
2.1.1 Metrics suites for object-oriented software
(1) C&K Metrics Suite: Chidamber and Kemerer dene a set of metrics
known as CK metrics suite. Later on, they revised their metrics and pro-
posed an improved version [15]. This metrics suite contains six metrics
namely: WMC, DIT, NOC, CBO, RFC, and LCOM, which are given in
Table 2.1 -
Table 2.1: CK metrics suite [15]
Coupling between Ob-
ject class CBO)
CBO for a class is a count of the number of other
classes to which it is coupled.
Lack of Cohesion in
Methods (LCOM)
LCOM is the subtraction of the number of methods
pairs that do not share a eld to the number of
methods pairs that do.
Depth of inheritance
tree (DIT)
DIT is the measure of the depth of inheritance of a
class.
Response for a lass
(RFC)
It is the number of methods of the class plus the
number of methods called by any of these methods
of class.
Weighted Method
Count (WMC)
WMC metric is sum of the complexities of all the
methods dene in a class.
Number of Children
(NOC)
This metric is the measures the number of imme-
diate descendants of the class.
(2) MOODS Metrics Suite: This metrics suite [32] provides measures of the
structural characteristics of OO programming . This metrics suite includes
six metrics- MHF, AHF, MIF, AIF, PF, and CF that are dened in Table
2.2.
Table 2.2: MOODS metrics suite [32]
Method Hiding
Factor (MHF)
MHF is the ratio of the sum of the invisibility of all
methods dened in all classes to the total number of
methods dened in the system under consideration.
Attribute Hiding
Factor (AHF)
AHF is the ratio of the sum of the invisibility of all
attributes dened in all classes to the total number of
attributes dened in the system under consideration.
Method In-
heritance Factor
(MIF)
MIF is the ratio of the sum of the inherited methods
in all classes of the system under consideration to the
total number of available methods (locally dened plus
inherited) for all classes.
Attribute Inheri-
tance Factor
(AIF)
AIF is the ratio of the sum of inherited attributes in
all classes of the system under consideration to the
total number of available attributes for all classes.
Polymorphism
Factor (PF)
PF is the ratio of the actual number of possible dier-
ent polymorphic situation for class to the maximum
number of possible distinct polymorphic situations for
class.
Coupling Factor
(CF)
CF is the ratio of the maximum possible number of
couplings (both inheritance and non-inheritance re-
lated coupling) in the system to the actual number
of non-inheritance couplings.
(3) Wei Li & Henry Metrics Suite: Wei Li et al. [45] evaluated C&K
metrics by using Kichenhams metric evaluation framework and found some
deciencies and ambiguities in the denition of these metrics. For example-
CBO imply that all couples are considered as equal. However, a more
complete form of object coupling depending on the several circumstances
that need to be dened. Accordingly, they proposed a more comprehensive
metrics suite consisting of six new metrics- Coupling Through Inheritance,
Coupling Through Message Passing, Coupling Through ADT, Number of
Local Methods and two size metrics SIZE1,SIZE2. Four of them can be
used for measuring Coupling and Cohesion as dened in Table 2.3
Table 2.3: Wei Li metrics suite [45]
Coupling Through
Inheritance
It is a measure of the inheritance of a class.
Coupling Through
Message passing
(CTM)
It is equal to the number of send statements dened
in a class.
Coupling Through
ADT (Abstract Data
Type) (CTA)
It measures the coupling, which occurs due to access
of ADT.
Number of local
methods (NOM)
It measures the total number of local method dene
in a class.
SIZE1 Number of semicolon in a class.
SIZE2 Number of attributes + Number of local methods
(4) Lorenz and Kidds Metrics Suite: Lorenz et al. [48] dened ten met-
rics in their metric suite, which are classied into size metrics, inheritance
metrics and internal metrics as given in Table 2.4
Table 2.4: Loren & Kidd metrics suite [48]
PIM This metric counts the total number of public instance methods in a
class. Public methods are those that are available as services to other
classes.
NIM This metric counts all the public, protected, and private methods dened
in a class.
NIV This metric counts the total number of instance variables in a class.
Instance variables include private and protected variables available to
the instances.
NCM This metric counts the total number of class methods in a class. A class
method is a method that is global to its instances.
NCV The metric counts the total number of class variables in a class.
NMO The metric counts the total number of methods overridden by a subclass.
A subclass is allowed to dene a method of the same name as a method
in one of its super-classes. This is called overriding the method.
NMI This metric counts the total number of methods inherited by a subclass.
NMA This metric counts the total number of methods dened in a subclass.
SIX It is the ratio of the number of overridden methods * Hierarchy nesting
level to total number of methods.
APPM It is the ratio of total number of methods parameters to the total num-
ber of methods.
(5) Briand et al. Metrics Suite: Briand et al. dene the metrics at the
class level to measure the coupling occurred due to the interactions between
classes. These metrics are given in Table 2.5.
Table 2.5: Briand metrics suite [11]
IFCAIC, ACAIC, OCAIC, FCAEC, DCAEC, OCAEC, IFCMIC, ACMIC,
OCMIC, FCMEC, DCMEC, OCMEC, IFMMIC, AMMIC, OMMIC, FMMEC,
DMMEC, OMMEC,
These coupling measures count the number of interactions between classes.
The measures distinguish the relationship between classes (friendship, inher-
itance, none), dierent types of interactions, and the locus of impact of the
interaction. The acronyms for the measures indicates the type of the inter-
actions counted: The rst or rst two letters indicate the relationship (A:
coupling to ancestor classes, D: Descendents, F: Friend classes, IF: Inverse
Friends (classes that declare a given class c as their friend), O: Others, i.e.,
none of the other relationships). The next two letters indicate the type of
interaction: CA: There is a Class-Attribute interaction between two classes
c and d, if c has an attribute of type d. CM: There is a Class-Method inter-
action between two classes c and d, if class c has a method with a parameter
of type class d. MM: There is a Method-Method interaction between two
classes c and d, if c invokes a method of d, or if a method of class d is passed
as parameter (function pointer) to a method of class c. The last two letters
indicate the locus of impact: IC: Import coupling, this metric counts all the
interactions for a class c, where c is using another class. EC: Export coupling,
this metric counts all the interactions for a class d, where class d is the used
class.
(6) Bansiya et al. Metrics Suite: Bansiya et al. [4] proposed eleven metrics,
which can be applied at the class level. They are given in Table 2.6
Table 2.6: Bansiyas metrics suite [4]
DAM It is the ratio of total number of private attributes in a class
to the total number of attributes denes in a class.
DCC It is a count of the total number of classes that a class di-
rectly related.
CIS This metric is a count of the number of public methods in a
class.
MOA The Measure of Aggregation metric is a count of the number
of data declarations whose types are user-dened type.
MFA The Measure of Functional Abstraction metric is the ratio
of the number of methods inherited by a class to the total
number of methods accessible by member methods of the
class.
DSC This metric is a count of the total number of classes in the
design.
NOH This metric is a count of the number of class hierarchies in
the design.
ANA This metrics value signies the average number of classes
from which a class inherits information.
CAM It is a sum of the interactions of a methods parameters with
the maximum independent set of all parameter types in a
class.
NOP It is a count of the methods that can exhibit polymorphic
behavior.
NOM It is a sum of the total number of methods denes in class.
Other Metrics Suites: Besides from the above listed metrics suites, there are
some other metrics suites also proposed by the authors. They are listed in Table
2.7
Table 2.7: Other metrics suites
Author Metrics Signicance of Metrics
Luis Fer-
nandez,
Rosalia
Pena [24]
Sensitive Class Cohe-
sion Metric (SCOM).
SCOM metrics is normalized to pro-
duce values in the range [0...1]. It is
more sensitive while calculating cohe-
sion than those metrics that was pre-
viously stated. This metric has analyt-
ical threshold. This metric is simple,
precise, general and able to be auto-
mated, which are the important prop-
erties to be applicable to large-size soft-
ware systems.
Letha Et-
zkom
Harry
Delu-
gach [21]
Logical Relatedness of
Meth-
ods (LORM), LORM2,
LORM3, Class Domain
Com-
plexity (CDC), Relative
Class Domain Complex-
ity (RCDC), Class In-
terface Complexity
(CIC), Semantic Class
Interface Denition En-
tropy (SCIDE).
This metrics suite provides a high
level, semantic, domain-oriented view
of object-oriented software compare to
the traditional syntactically-oriented
view, and therefore, it can be more
accurate in many cases than syntactic
metrics.
Andrian
Marcus,
Denys
Poshy-
vanyk [56]
The Concep-
tual Cohesion of Classes
(C3), Lack of Concep-
tual Similarity Between
Methods (LCSM), The
conceptual coupling of a
class
C3 and LCSM metrics help to iden-
tify special cases like wrappers or
classes that have several concepts im-
plemented into a set of classes. The
above state coupling metrics capture
new dimensions in coupling measure-
ment, compared to existing structural
metrics.
Gui Gui,
Paul D.
Scott [28]
Weighted Tran-
sitive Coupling (WT-
Coup), Weighted Tran-
sitive Cohesion
(WTCoh)
These metrics pass two signicant char-
acteristics. First, they use a numeric
measure of the degree of coupling or
similarity between entities rather than
a binary quantity. Secondly, they in-
clude indirect coupling mediated by in-
tervening entities. Proposed metrics
for coupling and cohesion are very good
predictors of the number of lines of
code required to make simple modi-
cations to java components retrieved
from the internet.
Belaujhazi,
Rudolf,
Ferenc,
Denys
Poshy-
vanyk and
TiborGyi-
mothy [9]
Conceptual Coupling
Between Object classes
(CCBO), Concep-
tual Lack of Cohesion
on Methods (CLCOM5)
These metrics assume that the meth-
ods and classes of object-oriented soft-
ware are connected in more than one
ways and the most explored and eval-
uated set of relations among meth-
ods and classes are based on data and
control dependencies. These proposed
metrics rely on parameterized concep-
tual similarities among methods, which
assume specifying a threshold for oper-
ational measures.
Judith
Barnard
[6]
Calls to Foreign classes
(CBO), Depth of Inher-
itance (DIT), Meaning-
ful Description
of (MDc) , Meaningful
Name (MNc)
These reusability metrics used to pro-
vide a value of reusability for a class
irrespective to the programming lan-
guage and can be used to guide the
programmer to write a more reusable
code.
Meghan
Revelle
and Mal-
com
Gethers
and
Denys
Poshy-
vanyk [58]
Structural Feature Cou-
pling (SFC), Structural
Feature Coupling prime
(SFC), Conceptual Sim-
ilarity between Meth-
ods (CSM), Conceptual
Similarity Between a
Method and a Feature
(CSMF), Textual Fea-
ture Coupling (TFC),
Maximum Textual Fea-
ture Coupling (TFC-
max), Hybrid Feature
Coupling (HFC).
These metrics capture feature-level
coupling by using structural and tex-
tual information. These metrics are
useful since they are good predictors
of fault-proneness. Additionally, they
have an application in feature-level im-
pact analysis to determine if a change
made to one feature may have undesir-
able eects on other features.
Jehad
Al Dallal,
Lionel C.
Briand [17]
Similarity-based
Class Cohesion (SCC),
Direct Method Invoca-
tion (DMI), Method In-
vocation (MI), Direct
Attribute Type (DAT),
Attribute Type (AT)
matrix.
These metrics account all types of in-
teractions between the class members:
methodmethod
interactions; attribute-attribute inter-
actions; attributemethod interactions;
and methodmethod-invocation interac-
tions. Both direct and transitive inter-
actions are considered. These metrics
are more useful for predicting fault oc-
currence in statistical terms.
Sherif M.
Yacoub,
Hany H.
Ammar
and Tom
Robin-
son [71]
Export object coupling,
Import object coupling
This metrics suite provides a set of dy-
namic metrics used to measure the de-
sign at early phase of software devel-
opment. These metrics can be used to
measure the run time properties of a
software.
Arisholm
et al. [2]
IC OD, IC OM, IC OC
, IC
CD, IC CM, IC CC,
EC OD, EC OM,
EC OC, EC CD,
EC CM, EC CC
These metrics include most of OO fea-
tures (inheritance, polymorphism and
dynamic binding) to measure accurate
behavior of OO program. These met-
rics may be used for various purposes,
such as focusing, supporting documen-
tation on those parts of a system that
are more likely to undergo change, or
making use of design patterns to bet-
ter anticipate change.
2.1.2 Validation of object-oriented metrics
Many researchers have been proposed various metrics suites during 1990-92 for
the measurement of an object-oriented software, but not all of these oered a
theoretical or an empirical validation. Chidamber and Kemerer proposed an
object-oriented metrics suite, consisting metrics to measure the characteristics
of the object-oriented software [15]. These metrics were tested and evaluated by
many authors. Summary of the empirical studies related to the metrics have been
given in Table 2.8. The rst column of the table indicates the reference of the
authors that performed the validation of the OO metrics. The second column
refers to the external quality attributes that were targeted by the authors for
their studies. The third column indicates the set of metrics used for the study,
followed by the results of the study in last column.
Table 2.8: Summary of empirical study on object-oriented
metrics
Author Variable
Analyzed
Metrics
Used
Results
Basili et
al. [7]
Fault
proneness
All C&K
metrics
All metrics except LCOM were good
to predict fault proneness.
Briand et
al. [52]
Fault
proneness
CBO, RFC,
LCOM
All metrics were signicantly corre-
lated to fault proneness.
Tang et
al. [64]
Fault
proneness
WMC,
RFC
Both metrics were correlated to
fault proneness.
Briand et
al. [13]
Fault
proneness
All C&K
metrics
None of the metrics except LCOM
were not correlated to fault
proneness.
EI Emam et
al. [20]
Fault
proneness
All C&K
metrics
All metrics were correlated to fault
proneness.
Chidamber
et al. [16]
Productivity,
design
eort
All C&K
metrics
CBO & LCOM associated with pro-
ductivity and design work.
Binkley et
al. [10]
Maintenance
code change
CBO, NOC Only CBO was correlated with the
code change.
Wei Li et
al. [45]
Maintenance
eort
All C&K
metrics
All metrics except CBO were corre-
lated to predict maintenance eort.
Mohammad
Al-
shayeb and
Wei Li [1]
Design ef-
fort, Main-
tenance
eort
CTA,
CTM, NLM
None of the metric found to be sig-
nicant for predicting maintenance
eort in the software development.
Wei Li,
Raed Shat-
nawi [61]
Error
proneness
CTA,
CTM, NLM
All metrics were signicantly associ-
ated with error proneness and found
to be a good predictor of class er-
ror probability in all error severity
categories.
Wei Li,
Huaming
Zhang [62]
Error
proneness
CTA, CTM Both metrics were associated with
error proneness.
Wei Li,
Salie
Henry [45]
Maintenance
eort
DAT,
MPC,
SIZE1,
SIZE2,
NOM
All metrics were correlated with
maintenance eort. SIZE metrics
could be account for large portion
of the total variance in maintenance
eort.
Hector
M. Olague
et al. [53]
Error
prone-
ness, fault
proneness
All MOOD
metrics
None of the metric were correlated
with prediction of fault proneness.
P.M. Shan-
thi et
al. [60]
Error
proneness
All MOOD
metrics
All metrics were associated with er-
ror proneness.
Ayaz
Farooq [22]
System size All MOOD
metrics
None of MOOD metrics associated
with prediction of the system size.
Loren and
Kidd [48]
Static char-
acteristics
of a design
All the met-
rics of
Lorens
metric
Suite
Large number of instance increase
coupling and reduce reuse. Deeper
class hierarchy indicates poor sub-
class performance. NIM, SIX,
NCM, NIV and NMO were signi-
cant predictor of quality attributes.
Briand et
al. [12]
Fault
proneness
All metrics
of
C&K met-
rics suites
and Briand
metrics
suite
The coupling metrics found to be
an important predictor of faults.
More specically, the impact of
export coupling on fault-proneness
is weaker than that for import
coupling.
Briand et
al. [52]
Fault
proneness
All metrics
of
C&K met-
rics suites
and Briand
metrics
suite
Many of the coupling, cohesion, and
inheritance measures appear to cap-
ture similar dimensions in the data.
Coupling and inheritance measures
are strongly related to the proba-
bility of fault detection in a class.
Cohesion measures shown little rel-
evance to fault proneness.
Emam et
al. [20]
Fault
proneness
All Briand
metrics
Out of total metrics OCAEC,
ACMIC and OCMEC tend to be as-
sociated with fault-proneness.
Bansi [4] Design
quality
assessment
All
the Bansiya
metrics
CAMC was shown to eectively pre-
dictor of class cohesiveness. Build
a model for evaluating the overall
quality of an OO software system
based on its internal design proper-
ties and showed that the used met-
rics were the signicant for design
quality assessment.
Arisholm et
al. [3]
Fault
proneness
Complexity
metrics
LOC and WMC have been signi-
cant to predict fault proneness.
2.2 Public Datasets 18
Zhou et
al. [72]
Fault
proneness
Complexity
metrics
LOC and WMC were better fault
predictor than SDMC and AMC
metrics.
Mahmoud
et al. [19]
Fault
proneness
in package
level
Three met-
ric suites
(Mar-
tin, MOOD
CK)
Martin metrics suite was more accu-
rate than the MOOD and CK suites.
Their have been few studies investigated OO metrics for the fault prediction.
Moreover, they dont seem to be provided any consolidated results. These studies
yielded us on mixed results such that some of them was conrming the predictive
capabilities of the metrics and others prompting questions about these metrics.
By observing these studies, we found that-
The earlier reported studies have been used dierent-dierent approaches
to validate the metrics and used dierent metrics suites. As the result, no
standard and widely accepted metrics have been found.
Most of the authors carried out their studies using C&K metrics suite while
other metrics were not adequately investigated. Therefore, further investi-
gation and validation of these metrics are needed to ensure their usability
for fault prediction.
Most of the authors used OO metrics for fault prediction without evalu-
ating their potential for fault-correlation. However, it is also required to
investigate the relationship between the metrics to determine a subset of
signicant fault-correlated metrics for an improve fault prediction.
2.2 Public Datasets
The datasets used in our study have been collected from the PROMISE data
repository [57]. PROMISE data repository contains datasets for defect predic-
tion, eort estimation and text mining. Currently, it comprises 23 datasets, but
this number is constantly growing. The fault data were collected during require-
ments, design, development, unit testing, integration testing, system testing, beta
2.3 Evaluation Measures 19
release, controlled release, and general release of each release of the software sys-
tem and was recorded in a database associated with the software. Therefore, these
datasets can be used to validate the performance of the various fault-prediction
techniques. In the experiments of this thesis work, we used six fault datasets
with their twenty-two releases from PROMISE Data Repository. All the used
software projects datasets have been implemented in the Java programming lan-
guage. Each dataset contain the information of twenty OO metrics available at
class level along with the fault information (number of faults) for each instance
(class). Most of the twenty metrics are objected-oriented class metrics, such as
those dened in the above discusses metrics suite [15].
A detailed description of the datasets is tabulated in Table 2.9. This table con-
tained six columns. The rst column consisting name of the project dataset.
The second column shows the no. of instances (classes) present in each dataset.
The third column shows the number of non-commented lines of code (LOC). The
fourth column is corresponding to the number of faulty instances out of all the
instances in the dataset. The fth column is corresponding to the number of non-
faulty instances out of all the instances in the dataset. The last column shows
the percentage of faults.
2.3 Evaluation Measures
Once a fault prediction model has been constructed, we need to evaluate it to
nd out its capability of fault prediction. Confusion matrix parameters help with
this by reporting the performance of the prediction model.
A confusion matrix shows how the predictions are made by the model. The rows
correspond to the known class of the data, i.e. the labels in the data. The
columns correspond to the predictions made by the model. Table 2.10 shows the
confusion matrix for a binary class classication of the faults. All the measures
below can be derived from the confusion matrix.
Accuracy:
The prediction accuracy of a fault-prediction technique is measured as
1
In this thesis, we use terms instance, class and module in interchanging. All of them refer
a class of a object-oriented software system.
Table 2.9: Datasets used in the study
Project
Name
No. of
Instances
LOC No. of
Faulty
Instances
No. of
non-
Faulty
Instances
percentage
of Faults
Camel 1.0 340 19632 13 327 3.80%
Camel 1.2 609 36792 215 394 35.03%
Camel 1.4 873 49007 144 729 16.49%
Camel 1.6 966 57996 187 779 19.35%
Xalan-2.4 724 225088 109 615 15.05%
Xalan-2.5 804 304860 386 418 48.00%
Xalan-2.6 886 411737 410 456 46.27%
Xerces 1.2 441 159254 70 371 15.87%
Xerces 1.3 454 167095 68 386 14.97%
Xerces 1.4 589 141180 435 154 78.85%
Ivy 1.1 111 27292 61 50 54.94%
Ivy 1.5 241 59286 14 227 5.08%
Ivy 2.0 352 87359 39 313 11.07%
Velocity 1.4 196 51513 146 50 74.49%
Velocity 1.5 214 53141 140 74 65.42%
Velocity 1.6 229 57012 76 153 33.18%
PROP1 18472 3816692 2739 15733 14.82%
PROP2 23015 3748585 2432 20583 10.56%
PROP3 10275 1604319 1180 9095 11.48%
PROP4 8719 1508381 841 7878 9.60%
PROP5 8517 1081625 1299 7218 15.25%
PROP6 661 97570 66 595 9.90%
Table 2.10: Confusion matrix
Defect Present No Yes
Defect Predicted
No TN=True Negative FN=False Negative
Yes FP=False Positive TP=True Positive
Accuracy =
TN + TP
TN +TP +FN +FP
(2.1)
False positive rate (FPR):
It is measured as the ratio of modules incorrectly predicted as faulty module to
the entire non-faulty modules. False alarm and type-1 error are similar as FPR.
FPR =
FP
TN + FP
(2.2)
False negative rate (FNR):
It is measured as the ratio of modules incorrectly predicted as non-faulty module
to the entire faulty modules. Type-2 error is similar as FNR.
FNR =
FN
TP + FN
(2.3)
Precision:
It is measured as the ratio of modules correctly predicted as faulty to the entire
modules predicted as faulty.
Precision =
TP
TP +FP
(2.4)
Recall:
It is measured as the ratio of modules correctly predicted as faulty to the entire
faulty modules. Probability of detection (PD) is similar to recall.
Recall =
TP
TP + FN
(2.5)
F-measure:
It is measured as the harmonic mean of precision and recall.
F measure = 2
Precision Recall
Precision +Recall
(2.6)
2.4 Subset Selection of Object-Oriented Metrics for Fault Prediction 22
ROC curve:
An ROC curve provides visualization of the tradeo between the ability to cor-
rectly predict fault-prone modules (PD) and the number of incorrectly predicted
fault free modules (PF). The area under the ROC curve (denoted AUC) is a
numeric performance evaluation measure to compare the performance of fault-
prediction techniques. In ROC curves, the best performance indicates high PD
and low PF.
2.4 Subset Selection of Object-Oriented Met-
rics for Fault Prediction
An important issue associated with fault datasets in practice is the problem of
having too many metrics (attributes). Simply put, not all metrics are likely to be
necessary for accurate classication and include them in the prediction model may
in fact lead to a worse model [42] [59]. There some work have been reported for
solving the subset selection problem in order to identify the signicant software
metrics.
Guyon et al. [30] highlighted the key approaches used for attribute selection,
including feature construction, feature ranking, multivariate feature selection, ef-
cient search methods and feature validity assessment methods. They concluded
that sophisticated wrapper or embedded methods improve predictive performance
compared to simple variable ranking methods like correlation methods, but the
improvements are not always signicant: domains with large numbers of input
variables suer from the curse of dimensionality and multivariate methods may
over t the prediction model.
Harman et al. [31] provided a comprehensive survey of the studies related to
search based software engineering. They identied research trends and relation-
ships between the techniques applied, the applications to which they have been
applied and highlighted gaps in the literature and avenues for further research.
Rodriguez et al. [59] performed an investigation using feature selection algorithms
with three lter models and three wrapper models over ve software project
datasets. They concluded that the reduced datasets maintained the prediction
capability with fewer attributes than the original datasets. In addition, while it
2.4 Subset Selection of Object-Oriented Metrics for Fault Prediction 23
was stated that the wrapper model was better than the lter model, it came at
a high computational cost.
Liu and Yu [47] provided a survey of feature selection algorithms and presented
an integrated approach of intelligent feature selection. Their study introduced
concepts and algorithms of feature selection, survey existing feature selection
algorithms for classication and clustering, groups and compares dierent algo-
rithms with a categorizing framework based on search strategies. Their evaluation
criteria, and data mining tasks, reveals unattempted combinations, and provides
guidelines in selecting feature selection algorithms. They stated that as data min-
ing develops and expands to new application areas, feature selection also faces
new challenges that need to be further researched.
Khoshgoftaar et al. [42] reported a study of selecting software metrics for de-
fect prediction. Their study focused on the problem of attribute selection in the
context of software quality estimation. They presented a comparative investi-
gation for evaluating their proposed hybrid attribute selection approach. Their
results demonstrated that the automatic hybrid search algorithm performed the
best among the feature subset selection methods. Moreover, performances of the
defect prediction models either improved or remained unchanged when over 85%
of the software metrics were eliminated.
The studies listed above were investigated the subset selection problem using some
lters and wrapper based approaches. There are some some issues associated with
these approaches.
Wrappers make usage of a search algorithm to search through the space
of possible attributes and evaluate each subset by running a model on the
subset. They are generally computationally expensive and have a risk of
over tting to the model [68].
The subset obtained by wrapper methods is lacking generality since it is
tied to the bias of the classier used in the evaluation function.
Filters are similar to Wrappers in the search approach, but instead of eval-
uating against a model, a simpler lter is evaluated. Since they evaluate
the structural properties of the data, rather than tied up with a particular
classier, their results exhibit more generality to the solution.
2.5 Fault Prediction Studies 24
Since the lter resulted the full attributes set as the optimal solution. This
forces the user to select an arbitrary cuto on the number of attributes to
be selected for model building.
The earlier studies were based on using cross-validation instead of using an
independent test dataset. Most of the studies made the use of single release
of the software for investigation and of results validation. Therefore, it
will be a problem of multiple comparisons, which leads to the failure of
generalized the real-world consequences of the results.
When evaluating the model on the single release of the dataset often created
the models that may be too small than the real models in the data.
2.5 Fault Prediction Studies
Software fault prediction is a technique to identify the fault-prone modules before
the testing phase by using the underlying properties of the software. It aims to
streamline the eorts to be applied in the later phases of software development.
Typically, the fault-prediction is done by training a prediction model over some
known project data augmented with fault information, and subsequently using
the prediction model to predict faults for unseen projects. Existing studies in
software fault prediction mainly focuses on predicting faults into two perspectives:
Binary class classication of faults and Multi class classication of faults (i.e.,
fault densities).
2.5.1 Binary class classication of the faults
This type of fault prediction classied the modules of a software into binary class
classication i.e., either faulty or non-faulty. To construct these fault prediction
models generally two methods have been used- Supervised learning and Unsu-
pervised learning. Both of them are used in the dierent context of application.
When a new system without any previous release is built, to predict fault-prone
subsystems, unsupervised learning needs to be adopt. After some subsystems are
tested and put into function, these pre-release subsystems can be used as training
data to build software fault prediction models to predict new subsystems. This
time supervised learning is used. The dierence between supervised and unsu-
pervised learning is the status of training data
s class, if it is unknown, then the

learning is unsupervised, otherwise, the learning is supervised learning.
There have been many eorts reported earlier to predict fault proneness of the
software modules in terms of modules being faulty or non-faulty [18] [51] [35].
The authors of these studies have used dierent techniques such as Genetic Pro-
gramming, Decision Trees, Neural Networks, Naive Bayes, Fuzzy Logic, Logistic
Regression etc. for predicting faultiness of the software modules [14].
S. S. Gokhale et al. [63] performed a fault prediction study over an industrial
dataset using Regression Tree and Density modeling techniques to build their
fault prediction models. They found that Regression Tree based prediction model
produced higher prediction accuracy and lower misclassication rate compare to
Density based prediction model.
Lan Guo et al. [29] carried out an empirical investigation using Dempster-Shafer
(D-S) belief network, Logistic Regression and Discriminant Analysis based tech-
niques over KC2 NASA dataset. They evaluated the prediction models by using
various performance measurement parameters and concluded that accuracy of
D-S belief networks based prediction model was higher than Logistic Regression
and Discriminant Analysis based model.
A. Koru et al. [44] reported a study of fault prediction using J48 and Kstar
techniques on public datasets. They suggested that it is better to perform defect
prediction on the data that belong to the large modules. They found that defect
prediction using class level metrics, produced better performance as compared to
method level metrics.
Venkata U.B. Challagulla et al. [66] performed a comparative study using various
machine learning techniques like- Linear Regression, Pace Regression, Support
Vector Regression, Neural Network, Support Vector Logistic Regression, Neural
Network for discrete goal eld, Logistic Regression, Naive Bayes, Instance Based
Learning, J48 Tree, and 1-Rule. They have used four public datasets and eval-
uated the potential of the prediction models using various parameters. They
showed that combination of 1R and Instance-based Learning gives better predic-
tion accuracy and size and complexity metrics are not sucient for ecient fault
prediction.
B. Turhan et al. [65] build fault prediction model using the Naive Bayes machine
learning technique. They have used seven NASA datasets and suggested that in-
dependence assumption of Naive Bayes was not harmful for the defect prediction
in datasets with PCA preprocessing. They found that assigning weights to static
code level metrics can signicantly increase the performance of fault prediction
models.
Elish et al. [18] compared the performance of Support Vector Machines (SVMs)
with various other machine learning techniques over the NASA datasets and
stated that the performance of SVM in general better than, or similar to the
other machine learning techniques. Kanmani et al. [41] investigated Probabilistic
Neural Network (PNN) and Back-propagation Neural Network (BPN) using a
fault data collected from the students projects and found that the performance
of PNN is better compared to BPN. Menzies et al. [51] empirically investigated the
Naive Bayes with a logNum lter for fault proneness and found that Naive Bayes
with logNum lter is the best fault prediction model among the prediction models
used. Huihua Lu et al. [35] investigated Random Forest and FTF techniques
for fault prediction over the NASA datasets and found that Semi-supervised
technique outperforms compare to supervised technique.
Catal et al. [14] presented a literature review on fault-prediction studies from 1990
to 2009. They reviewed the results of previous studies as well as discussed the
current trends of fault prediction. They concluded that till then no study exist
in literature which could investigate the impact of fault prediction in software
development process. They also highlighted that coming up with a method that
would assess the eectiveness of fault-prediction studies if adopted in the software
project would be helpful for the software community.
These studies show that a lot of researches have been done in the eld of soft-
ware fault prediction. However, most of these studies have been resulted a high
misclassication rate (normally, 15 to 35%) with lower classication accuracy
(normally, 70 to 85%). It shows the requirement of more specic studies showing
the eect of fault prediction on software quality. In this thesis, we address one of
the major and complex problem in software fault prediction studies i.e. how to
determine a best possible subset of OO metrics they produce an improved perfor-
mance of fault prediction. As a solution we proposed an approach of determining
a subset of OO metrics for fault prediction.
2.5.2 Number of faults and the fault densities prediction
There have been few eorts examining the fault proneness of software modules
in terms of predicting the fault density or the number of faults in a given mod-
ule [54] [36] [27].
Graves et al. reported a study using the fault history of the software modules [27].
They performed their study over a large telecommunication system consisting of
1.5 million lines of code and considered dierent les characteristics. They found
that the module size and other software complexity metrics were generally poor
predictors of fault likelihood. The best predictors were the combinations of a
module
s age, the changes made to the module, and the ages of the changes.
Ostrand et al. [54] have used negative binomial regression (NBR) analysis to pre-
dict fault prone in software modules. In their study, a NBR model was developed
and used to predict the expected number of faults and fault density in every mod-
ule of the next release of the system. The prediction models were based on the
number of lines of code, faults and modication history of the software modules.
They used this prediction model to two large industrial systems and found that
the NBR model was very accurate to identify the fault proneness of the software.
In another study [54], the same compared the three dierent variations of LOC
based NBR models to predict fault densities. They have used NBR model to
pre-dict the number of faults in each of the software and then sorted them in
the decreasing order of their fault contained and then select rst 20% of the les.
They found the model to be accurate in terms of predicting faults in top 20% of
the les.
Janes et al. [36] reported a study of using NBR analysis to predict fault proneness.
They investigated the relation between object-oriented metrics and class defect
in a real time telecommunication system. They built dierent prediction models
and found the zero-inated negative binomial regression model to be the most
accurate for fault prediction.
Recently, Liguo et al. [46] performed a case study using NBR analysis to predict
fault proneness on an open source software. They compared the performance of
the NBR model with Binary Regression model and found that in predicting fault
prone modules, NBR model could not outperform Binary Regression but they
suggested that NBR is eective in predicting multiple errors in one module.
2.6 Summary 28
Kehan Gao et al. [25] reported a comprehensive study of count models for fault
prediction over a full-scale industrial software system. They concluded that
among the dierent count models, the zero-inated negative binomial and the
hurdle negative binomial models demonstrated a better correlation with fault
proneness.
These studies showed some earlier eorts have been made to predict fault densities
but they did not provide enough logistics that can prove the signicance of the
count models to predict fault densities. As well as the selection of a count model
for an optimal performance is still equivocal. Ostrand et al. [54] applied NBR
model to predict the number of faults and the fault densities in each le of the
software. They made the use of change history and LOC metric of the les
to determine faults without providing any appropriateness of these metrics for
the NBR model. Kehan et al. [25] reported a comprehensive study of eight
count models for fault prediction. They evaluated the quality of the tted count
models using some hypothesis testing and goodness of t parameters. However,
no evaluation was provided to assess the potential of the count models to predict
fault densities.
2.6 Summary
This chapter presented a brief introduction of the concepts related to our study.
In particular, we have given a description of object-oriented metrics suites pro-
posed by dierent authors along with the empirical studies that were validated
these metrics suites. Later on, we discussed the studies related to software fault
prediction, the measures used to evaluate the performance of fault-prediction
technique and the information of available public dataset repositories. Here, we
also summarized the studies of subset selection of signicant metrics and framed
a background for the same.
Chapter 3
A Framework for Subset
Selection of Object-Oriented
Metrics for Fault Proneness
Software metrics aim to represent the necessary measurement that could be help-
ful to assess the quality of a software system with desired accuracy and at a lower
cost. However, the diculty lies in knowing the metrics that actually capture
the important quality attributes of a class, such as fault proneness. There some
eorts have been reported to validate these class levels object-oriented metrics
with respect to fault proneness [1] [26] [52] [64] [61]. These studies yielded us on
mixed results with some studies conrming the predictive capabilities of the met-
rics and others prompting questions about these metrics [23]. In their study [43],
Kitchenham reported the limitation of earlier studies of metrics. Their study sug-
gested that the results of empirical studies are not comprehensible. The context
of metrics validation and the relationship of metrics with fault proneness was not
properly investigated. There can be a possibility that some of the metrics depend
on the project characteristics. Some of them contained redundant information,
or not added any new information, or worse have an adverse eect on the other
metrics.
In this chapter, we aim to investigate the relationship of existing class level object-
oriented metrics with fault proneness over the multiple releases of the software
system to identify the metrics producing signicant fault-correlation. The metrics
subset selection process is undertaken in three steps. In the rst step, we assist
the fault proneness of each metric separately by performing Univariate Logistic
3.1 The Approach 30
Regression (ULR) analysis and select those metrics having higher fault correla-
tion. In the next step, we analyze the pair wise correlation among the selected
metrics by performing Spearmans correlation analysis. Each time, after a higher
correlation between a pair of metrics is observe, we check the performance of
the metrics individually and in combination for fault prediction and select any
one of the metric or keep both of the metrics, whosoever produced better fault
prediction result. In the last step, we construct Multivariate Linear Regression
(MLR) models to further reduce the metrics and identify a group of metrics that
are more signicant for fault proneness.
Finally, we use the identied metrics subset for fault prediction to estimate the
overall accuracy and misclassication errors over the subsequent releases of the
same project datasets that are used for investigation. We use confusion matrix
criteria: Accuracy, Precision, Recall and AUC (area under the ROC curve) to
evaluate the performance of prediction models. To perform our investigation,
we used ve datasets namely: Camel, Xalan, Xerces, Ivy, and Velocity, available
publicly in PROMISE data repository with their multiple successive releases [57].
The rest of the chapter organizes as follows- Section 3.1 presents an approach of
experimental investigation to identify a subset of metrics that signicantly corre-
late with fault proneness. In Section 3.2, we present our experimental setup that
includes information about datasets, metrics (independent variables) and depen-
dent variables used for investigation and the results of investigations followed by
threats to validity. We discuss the implications of our results in Section 3.3.
3.1 The Approach
In this section, we present our approach to evaluate the potential of object-
oriented (OO) metrics for fault proneness. We have constructed an algorithm
algorithm OO subset that takes input of the original set of metrics, evaluate each
metric individually and in the conjunction of the other metrics to determine a
subset of signicant fault-correlated metrics for an improved performance of fault
prediction. An overview of the proposed approach is illustrated in Figure 3.1:
Algorithm: OO subset()
// An Algorithm for subset selection of fault-correlated OO metrics.
Initialization: X = [x
0
, x
1
, x
2
...x
n
] is a vector of independent variables. Y =
3.1 The Approach 31

Data set containing object
oriented metrics and fault
found in software modules
This analysis identify the
correlation of each metrics
to fault proneness
This analysis identify the
metrics that are highly
correlated with each other
This analysis finds the
subset of significant metrics
for an improve performance
of fault prediction
Data Set
Univariate Logistic Regression
Analysis
Cross Correlation Analysis
between the Significant
Metrics
Multivariate Linear Regression
Analysis of Significant
Metrics
Validation of Resulted Metrics
to Estimate their Overall
Prediction Accuracy
This step validates the
resulted metrics to estimate
the overall prediction
accuracy.
Figure 3.1: Framework of proposed approach
[y
0
] is a vector of dependent variable.
Declaration: Create empty vectors ULR = [], SR= [], MLR = [], to store the
intermediate output of Logistic regression, Spearmans correlation and Multivari-
ate linear regression analysis respectively.
Begin:
1. for each element x
i
of the vector X, do 0in, n is the number of independent
variables.
1.1 perform univariate logistic regression analysis of x
i
with Y.
1.2 Store the value of regression coecient, odds ratio and p-value.
1.3 If value of odds ratio 1 && p-value.05 && regression co.>0, then
1.4 add the element to the vector ULR.
End if
End for
2. Extract each element from the vector ULR and perform its correlation analysis
with all other elements of the vector and store their correlation values.
3. for each pair of elements,
3.1 The Approach 32
3.1 if correlation.7, then
3.2 check their individual fault-correlation values and combined fault-correlation
value.
3.3 if individual performance of the elements is greater than their combined per-
formance, then
3.4 keep the element with higher fault-correlation value and discard the other.
Otherwise, keep both the elements.
End if
End if
End for
4. for each element of the vector SR, do
4.1 perform a multivariate linear regression analysis,
4.2 for each element selected by the linear prediction model,
4.3 add it to the vector MLR.
End for
5. Combined the elements of MLR for each releases of the project.
6. The resulted vector is subset of metrics signicant to fault proneness.
End for
This algorithm determines a subset of OO metrics that produced signicant fault-
correlation for a given software system. It takes the vector of metrics (indepen-
dent variables) as the input and assess the potential of each metric for fault
proneness. First, we initialize a vector X to the independent variables and a vec-
tor Y to dependent variable (fault proneness). Second, we declare three empty
vectors namely- ULR, SR and MLR, to store the results of each intermediate
steps. We start to analysis the elements of vector X. In each of the intermediate
step, some metrics dropped out based on the intermediate steps analysis results.
The output of this algorithm results a subset of metrics for each project that
consist metrics with signicant fault-correlation.
Our proposed approach consists of four steps:
(1) Perform a ULR analysis to evaluate each metric separately for fault prone-
ness.
This step evaluates each OO metrics separately for fault proneness. Here
we perform binary univariate logistic regression (ULR) analysis by consid-
3.1 The Approach 33
ering fault as dependent variable and metrics as independent variables. To
check the level of signicance of each metric, we use three parameters of
LR model: (i) Regression coecient, shows the amount of correlation of
each metric with fault proneness, (ii) Signicance level (p- value), shows
the signicance of correlation and (iii) Odds ratio represents the change in
odds when the value of an independent variable increases by one. This step
will result a subset of the metrics that was signicant for fault prediction.
(2) Perform a pair wise spearmans correlation analysis between the signicant
metrics.
This step determines the correlation between a pair of metrics. Here, we
perform a pair wise Spearmans correlation analysis among the signicant
metrics and check for both positive as well as negative correlations. Each
time, after a higher correlation between a pair of metrics is observe, we check
the performance of these metrics individually and in the combine basis for
fault prediction and select any of the metric or pair of metrics, whosoever
performs better. If the correlation of independent metric was poorer than
a pair of metrics, we drop that metric, and continue this process until we
encounter all the metrics showing higher correlation. The remaining metrics
are signicant for further analysis. The signicance level of correlation is
test at 95% condence level (p-value 0.05) and the degree of correlation
is measure using the Hopkins criteria [34]. The outcome of this step is a
subset of metrics that are signicantly correlated to fault proneness and are
not redundant with each other.
(3) Determine a subset of metrics for an improved performance of fault predic-
tion models.
There is a possibility that some metrics still remain in the subset due to
their dependency with other metrics but can be further reduced. To in-
vestigate this issue, we construct Multivariate Linear Regression models.
This analysis determines a best possible subset of the metrics that can pre-
dict fault-proneness, when use in combination. In each model, a subset of
metrics is selected and rest all other are discard. At the end, this analysis
results a subset of the metrics that are more signicant to predict faults.
(4) Evaluate the resulted subset of metrics to estimate their overall prediction
accuracy.
3.2 Experimental Evaluation 34
Finally, we construct fault prediction models to investigate the capability
of the obtained metrics subset for fault proneness. The construction of
prediction models are carry out on the subsequent release of same software
systems that are used in the above investigation of metrics selection. We
use four machine-learning techniques namely: Navies Bias, Logistic Regres-
sion, Random Forest and IBK. The aim of this step is to estimate the overall
predictive accuracy of metrics, rather than identifying the best fault pre-
diction technique. For this reason, the choice of fault prediction techniques
is orthogonal with respect to the intended contribution. To investigate the
fault prediction capability of the dierent metric subset we use confusion
matrix criteria namely: Accuracy, Precision, Recall and Area under ROC
curve (AUC) [69].
3.2 Experimental Evaluation
In this section, we present our experimental study that includes the experimental
setup, information about datasets, metrics (independent variables), dependent
variable and set of research questions, used for investigation.
3.2.1 Metrics set used for investigation
We have used existing class level OO metrics to perform our investigation. We
have used nineteen measures of coupling, cohesion, inheritance, encapsulation
and complexity of OO software system. Since, we focus to investigate the fault
proneness in a given class. Therefore, we selected only those metrics that were
available at class level [26]. One more reason to select only these metrics is that
they all are present in the datasets collected from the PROMISE data repository
that encourages us to incorporate them in our study. The metrics used for study
are as follows-
WMC, CBO, RFC, DIT, NOC, IC, CBM, CA, CE, MFA, LCOM, LCOM3, CAM,
MOA, NPM, DAM, AMC, LOC and max CC (abbreviated as CC). The detailed
description of the metrics was given in the related work. (See Section 2.1, chapter
2).
3.2.2 Dependent variable
This study investigates the relationship between OO metrics and fault proneness.
Therefore, we selected measure of fault proneness as the dependent variable. In
this study, due to the dependency of the statistical techniques, we dene fault
proneness as of binary type, which means a class is either marked as faulty or
non-faulty. We mark a class as faulty if there is at least one fault found in the
module or non-faulty if no fault is found in the class. When we move from one
release to the subsequent release of a software system, the fault proneness is
dened by the faults identied in that release.
3.2.3 Project datasets
The datasets used in our study have been collected from the PROMISE data
repository, available as publicly [57]. These datasets contained OO metrics
and faults found in the software modules during testing and after their release.
The number of faulty modules varies between 3%-74% (approximately) in these
datasets. We have used ve projects namely Camel, Xalan, Xerces, Ivy, and
Velocity with their sixteen successive releases for performing our study, and to
investigate our results [40]. All the datasets contained same nineteen metrics.
The size of datasets varies from one to another. The name of the datasets with
their subsequent releases in given in Table 3.1. Detailed description of these
datasets given in Section 2.2 of chapter 2.
Table 3.1: Datasets used for study
Camel 1.0, Camel 1.2, Camel 1.4, Camel 1.6, Xalan-2.4, Xalan-2.5,
Xalan-2.6, Xerces 1.2, Xerces 1.3, Xerces 1.4, Ivy 1.1, Ivy 1.5, Ivy 2.0,
Velocity 1.4, Velocity 1.5 and Velocity 1.6
3.2.4 Research questions
The objective of this experiment is to identify the best possible subset consisting
metrics with signicant fault-correlation. We followed the GQM approach [8],
where we framed a set of research questions, which were investigated by obtain-
ing the relevant measures. The research questions are as-
RQ 3.1: Whether there exists a subset of object-oriented metrics that are sig-
nicantly correlated with fault proneness.
This step aims to evaluate the metrics to test their relationship with fault prone-
ness. We evaluate each metric individually for their correlation with fault prone-
ness.
RQ 3.2: Do existing object-oriented metrics show a higher correlation to each
other?
This question tests whether existing class level metrics have a correlation with
each other. Here we check metrics for both positive and negative correlations to
identify a subset for an improve performance.
RQ 3.3: Does the identied subset of metrics improve the overall prediction
accuracy and reduced misclassication errors compare to considering original set
the metrics.
This question investigates the performance of the subset metrics to predict fault
proneness.
First two research questions are evaluating the metrics for fault prediction to
determine a subset of metrics that results an improved accuracy. Therefore, the
signicance of rst two questions is to provide the support to the investigation
of the third question.
3.2.5 Experimental execution
To perform our investigation, we have used all ve datasets with their multiple
successive releases listed in Table 3.1. In order to incorporate multiple releases,
we used the training and testing strategy. Here training and testing is performed
as, testing prediction models on the rst release of software, next we train models
in rst release and test on the second release, next we train models in rst two
releases and test on third release and we continue in this way till all the subsequent
releases are not incorporated. All experiments were performed using a well-known
machine learning tool WEKA [49].
3.2.6 Results
This section presents the detail description of the experimental results. We
started the statistical analysis with nineteen metrics. As we progressed through
analysis steps, we eventually dropped some of them based on the intermediate
analysis results.
3.2.6.1 Univariate logistic regression analysis
The results of Univariate Logistic Regression Analysis (ULR) have been summa-
rized in Tables 3.2 to 3.6. The column metric shows the independent variable
used in ULR. The columns co., P-value and odds ratio state the estimated re-
gression coecient, the statistical signicance of the coecient and odds ratio
of each metric. As discussed above, we selected only those metrics that have
positive regression co., p-value less than .05 and odds ratio more than 1. In
each version of datasets some metrics found to be signicant for fault prediction
while other metrics not shown to be relevant for fault proneness. Moreover, as we
moved from one release to other releases, the nature of associated metrics were
changed. Some earlier selected metrics got deselected and some new metrics were
added. One possible reason of this is that the nature of the metrics depends on
the characteristics of the project and as we move from one release to another one,
the characteristics may get changed. We selected only those metrics that were
signicantly correlated with fault proneness in all releases of the project. The
resulted metrics after the ULR analysis is given in Table 3.7.
Table 3.2: Univariate logistic regression analysis- Camel 1.0 to 1.4
Metrics Camel 1.0 Camel 1.0/1.2 Camel 1.0/1.2/1.4
Co. p-value Odd ratio Co. p-value Odd ratio Co. p-value Odd ratio
WMC ..45 0.017 1.046 0.027 0 1.027 0.037 0 1.038
DIT -0.34 0.237 0.712 -0.097 0.129 0.907 -0.003 0.944 0.997
NOC 0.145 0.007 1.156 0.101 0.001 1.106 0.099 0 1.104
CBO 0.03 0.002 1.031 0.014 0.002 1.014 0.015 0 1.015
RFC 0.02 0.044 1.02 0.012 0 1.012 0.017 0 1.017
LCOM 0.001 0.326 1.001 0.001 0.035 1.001 0.001 0.003 1.001
CA 0.028 0.003 1.028 0.012 0.005 1.012 0.013 0 1.013
CE 0.053 0.242 1.054 0.015 0.256 1.015 0.025 0.004 1.025
NPM 0.047 0.021 1.048 0.033 0 1.034 0.043 0 1.044
LCOM3 0.423 0.271 1.527 0.14 0.178 1.15 0 1 1
LOC 0.002 0.071 1.002 0.002 0 1.002 0.002 0 1.002
DAM -0.302 0.611 0.739 -0.187 0.244 0.83 0.079 0.523 1.082
MOA 0.291 0.141 1.338 0.15 0.011 1.162 0.177 0 1.194
MFA -1.7 0.06 0.183 -0.106 0.575 0.9 0.046 0.724 1.048
CAM -2.534 0.075 0.079 -0.584 0.053 0.558 -1.26 0 0.282
IC -0.727 0.245 0.483 -0.125 0.363 0.883 0.157 0.109 1.17
CBM -0.682 0.243 0.506 0.092 0.106 1.097 0.147 0 1.58
AMC -0.017 0.638 0.983 0.007 0.238 1.007 0.011 0.014 1.011
CC 0.339 0.277 1.403 0.445 0.001 1.561 0.404 0 1.498
Table 3.3: Univariate logistic regression analysis- Ivy 1.0 to 1.4
Metrics Ivy 1.0 Ivy 1.0/1.4
Co. p-value Odd ratio Co. p-value Odd ratio
WMC 0.108 0.002 1.114 0.03 0.005 1.03
DIT 0 1 1 -0.087 0.397 0.917
NOC 0.036 0.826 1.036 0.162 0.066 1.176
CBO 0.144 0.001 1.154 0.025 0.011 1.025
RFC 0.04 0 1.041 0.013 0 1.013
LCOM 0.012 0.027 1.012 0 0.177 1
CA 0.037 0.224 1.038 0.012 0.286 1.012
CE 0.23 0 1.259 0.093 0 1.097
NPM 0.111 0.006 1.117 0.035 0.007 1.035
LCOM3 -0.184 0.533 0.832 -..42 0.832 0.959
LOC 0.003 0.012 1.003 0.001 0.002 1.001
DAM 0.373 0.372 1.452 0.143 0.613 1.154
MOA 0.284 0.173 1.328 0.184 0.093 1.202
MFA 0.235 0.683 1.264 -0.425 0.228 0.653
CAM -3.272 0.001 0.038 -3.12 0 0.044
IC 0.557 0.095 1.746 0.214 0.161 1.239
CBM 0.323 0.081 1.381 0.124 0.107 1.132
AMC 0.03 0.023 1.031 0.017 0.001 1.017
CC 0.119 0.658 1.127 0.349 0.032 1.418
Table 3.4: Univariate logistic regression analysis- Velocity 1.4 to 1.5
Metrics velocity 1.4 Velocity 1.4/1.5
WMC -0.013 0.211 0.987 0.004 0.616 1.004
DIT -1.03 0 0.357 -0.496 0 0.609
NOC 0.005 0.927 1.005 0.067 0.409 1.069
CBO 0.02 0.243 1.02 0.03 0.014 1.031
RFC -0.01 0.09 0.99 0.004 0.344 1.004
LCOM -0.001 0.306 0.999 0 0.573 1
CA 0.03 0.19 1.03 0.016 0.182 1.016
CE -0.02 0.364 0.981 0.023 0.185 1.023
NPM 0.007 0.711 1.007 0.023 0.108 1.024
LCOM3 0.194 0.406 1.214 -0.398 0.011 0.672
LOC -0.001 0.22 0.999 0 0.627 1
DAM -0.428 0.212 0.652 0.5 0.033 1.649
MOA 0 1 1 0.323 0.03 1.381
MFA -1.145 0.004 0.318 -0.816 0.001 0.442
CAM -0.824 0.232 0.438 -0.871 0.067 0.419
IC -1.152 0 0.316 -0.667 0 0.513
CBM -0.729 0 0.482 -0.418 0 0.658
AMC -0.032 0 0.969 -0.002 0.537 0.998
CC -0.187 0.117 0.83 0.044 0.632 1.045
Table 3.5: Univariate logistic regression analysis- Xalan 2.4 to 2.5
Metrics Xalan 2.4 Xalan 2.4/2.5
WMC 0.037 0 1.038 0.025 0 1.025
DIT -0.039 0.576 0.962 0.058 0.104 1.06
NOC 0.041 0.202 1.042 0.045 0.027 1.046
CBO 0.018 0 1.018 0.009 0.001 1.009
RFC 0.023 0 1.023 0.015 0 1.015
LCOM 0.001 0 1.001 0 0 1
CA 0.013 0.009 1.013 0.008 0.02 1.008
CE 0.042 0 1.042 0.018 0.001 1.018
NPM 0.035 0 1.035 0.024 0 1.024
LCOM3 -0.385 0.011 0.681 -0.194 0.013 0.823
LOC 0.001 0 1.001 0.001 0 1.001
DAM 0.534 0.015 1.705 0.206 0.075 1.229
MOA 0.187 0 1.205 0.131 0 1.14
MFA -0.615 0.008 0.541 -0.115 0.352 0.891
CAM -3.559 0 0.028 -0.674 0.001 0.51
IC 0.228 0.011 1.256 0.114 0.002 1.12
CBM 0.079 0 1.083 0.057 0 1.058
AMC 0.01 0 1.01 0.001 0.018 1.001
CC 0.5 0 1.649 0.275 0 1.317
Table 3.6: Univariate logistic regression analysis- Xerces 1.2 to 1.3
Metrics Xerces 1.2 Xerces 1.2/1.3
WMC 0.017 0.058 1.017 0.027 0 1.027
DIT -0.229 0.036 0.795 -0.11 0.137 0.895
NOC 0.012 0.728 1.012 0.021 0.363 1.021
CBO 0.012 0.421 1.012 0.034 0 1.034
RFC 0.009 0.009 1.009 0.015 0 1.015
LCOM 0.001 0.045 1.001 0.001 0 1.001
CA 0.006 0.73 1.006 0.017 0.15 1.017
CE 0.046 0.066 1.047 0.108 0 1.115
NPM 0.014 0.299 1.014 0.016 0.089 1.016
LCOM3 -0.175 0.371 0.84 -0.751 0 0.472
LOC 0 0.029 1 0 0 1
DAM 0.474 0.118 1.607 1.119 0 3.06
MOA 0.086 0.037 1.09 0.146 0 1.157
MFA -0.539 0.09 0.583 -0.043 0.843 0.958
CAM 0.305 0.565 1.356 -0.702 0.071 0.495
IC -0.012 0.949 0.988 0.543 0 1.72
CBM 0.042 0.266 1.043 0.11 0 1.117
AMC -0.001 0.696 0.999 0.003 0.032 1.003
CC -0.086 0.507 0.917 0.075 0.278 1.078
Table 3.7: Reduced metrics subset after ULR analysis
Camel WMC, CBO, RFC, NOC, NPM, CA
Xalan WMC, CBO, RFC, LCOM, CA, CE, LOC, NPM, MOA, CC
Xerces WMC, CBO, RFC, LCOM, LOC, MOA
Ivy WMC, CBO, RFC, CE, NPM, LOC, CAM, AMC
Velocity DIT, MFA, CC, IC
3.2.6.2 Correlation analysis between metrics
The results of Spearmans correlation analysis have been summarized in Tables
3.8 to 3.12. It is observed from the tables that WMC, NPM, RFC, LOC, LOC,
and AMC were highly correlated with each others. It shows the strong structural
association between these metrics. CBO metrics were not correlated with CA
and CE metrics, which show that we need separate measures for measuring the
import and export coupling as CBO is not able to handle these issues of coupling.
RFC- LOC, WMC-LOC were correlated with the very large level. The correlation
value of WMC and RFC was higher than the correlation value of LOC. It means
that RFC, WMC are a good indicator of class complexity and we do not need
LOC measure separately for measuring class size. MFA and CAM metrics were
negatively correlated with most of the metrics. It shows that these metrics are
not signicant for fault proneness and as the consequence of that we dropped
these metrics for further analysis.
Table 3.8: Spearmans correlation analysis over Camel project dataset
wmc noc cbo rfc ca npm
wmc 1 0.134 0.566 0.888 0.244 0.918
noc 0.134 1 0.191 0.097 0.299 0.093
cbo 0.566 0.191 1 0.588 0.618 0.44
rfc 0.888 0.097 0.588 1 0.152 0.74
ca 0.244 0.299 0.618 0.152 1 0.237
npm 0.918 0.093 0.44 0.74 0.237 1
Table 3.9: Spearmans correlation analysis over Ivy project dataset
wmc cbo rfc ce npm loc cam amc
wmc 1 0.493 0.802 0.396 0.95 0.75 -0.783 0.37
cbo 0.493 1 0.472 0.377 0.437 0.401 -0.505 0.23
rfc 0.802 0.472 1 0.467 0.711 0.966 -0.763 0.807
ce 0.396 0.377 0.467 1 0.34 0.417 -0.408 0.335
npm 0.95 0.437 0.711 0.34 1 0.651 -0.722 0.27
loc 0.75 0.401 0.966 0.417 0.651 1 -0.75 0.866
cam -0.783 -0.505 -0.763 -0.408 -0.722 -0.75 1 -0.514
amc 0.37 0.23 0.807 0.335 0.27 0.866 -0.514 1
Table 3.10: Spearmans correlation analysis over Velocity project dataset
dit mfa cbm cc
dit 1 0.898 0.645 -0.264
mfa 0.898 1 0.584 -0.461
cbm 0.645 0.584 1 -0.101
cc -0.264 -0.461 -0.101 1
3.2.6.3 Multivariate linear regression analysis
Eliminating the metrics showing higher correlation with the other metrics means
that we identify a subset of the metrics that were signicant to predict fault
proneness individually and not confounding with others. It did not mean that
we had the best set of metrics that can be used in combination. To select the
best possible subset of independent metrics, we constructed multivariate linear
regression (MLR) models. The results of multivariate linear regression analysis
Table 3.11: Spearmans correlation analysis over Xalan project dataset
wmc cbo rfc lcom ca ce npm loc moa amc cc
wmc 1 0.474 0.841 0.583 0.448 0.367 0.937 0.742 0.48 0.262 0.576
cbo 0.474 1 0.62 0.374 0.517 0.787 0.466 0.43 0.381 0.286 0.415
rfc 0.841 0.62 1 0.51 0.301 0.643 0.771 0.863 0.547 0.578 0.648
lcom 0.583 0.374 0.51 1 0.32 0.3 0.534 0.393 0.087 0.088 0.287
ca 0.448 0.517 0.301 0.32 1 0.067 0.426 0.226 0.195 -0.069 0.245
ce 0.367 0.787 0.643 0.3 0.067 1 0.347 0.469 0.439 0.437 0.392
npm 0.937 0.466 0.771 0.534 0.426 0.347 1 0.64 0.42 0.169 0.498
loc 0.742 0.43 0.863 0.393 0.226 0.469 0.64 1 0.491 0.783 0.629
moa 0.48 0.381 0.547 0.087 0.195 0.439 0.42 0.491 1 0.311 0.381
amc 0.262 0.286 0.578 0.088 -0.069 0.437 0.169 0.783 0.311 1 0.491
cc 0.576 0.415 0.648 0.287 0.245 0.392 0.498 0.629 0.381 0.491 1
Table 3.12: Spearmans correlation analysis over Xerces project dataset
wmc cbo rfc lcom loc moa
wmc 1 0.313 0.898 0.797 0.672 0.432
cbo 0.313 1 0.485 0.099 0.602 0.526
rfc 0.898 0.485 1 0.632 0.855 0.535
lcom 0.797 0.099 0.632 1 0.341 0.205
loc 0.672 0.602 0.855 0.341 1 0.583
moa 0.432 0.526 0.535 0.205 0.583 1
have been summarized in Tables 3.13 to 3.16. An empty cell in the table indi-
cates that corresponding metric was not one of the selected independent variable
(metric) through the regression procedure in corresponding model.
Table 3.13: Multivariate linear regression analysis over Camel project datasets
Datasets Constant WMC NOC CBO CA MOA CC
Camel 1.0 -0.04 0.008 0.011 - 0.003 - -
Camel 1.0/1.2 0.110 -0.03 0.025 0.029 0.027 0.041 0.079
Camel 1.0/1.2/1.4 0.129 -0.02 0.020 0.03 -0.03 0.030 0.044
The reduced subset of metrics after MLR analysis for each project is given in
Table 3.17.
3.2.6.4 Validation of prediction models over the successive releases
This section summarizes the validation results over the Camel 1.6, Xalan 2.6,
Xerces 1.4, Ivy 2.0 and Velocity 1.6 project datasets. Figure 3.2 shows the values
accuracy, precision, recall and AUC analysis of the prediction models constructed
using all nineteen metrics. Figure 3.3 shows the accuracy, precision, recall and
AUC values of prediction models that were built using an identied subset of
metrics.
Table 3.14: Multivariate linear regression analysis over Ivy project datasets
Dataset Constant WMC CBO CE LOC AMC
ivy 1.1 0.2977 0.0173 -0.0549 0.0722 -0.0014 0.0086
Ivy 1.1/1.4 0.4364 -0.0067 -0.167 0.018 -0.0003 0.0031
Table 3.15: Multivariate linear regression analysis over Velocity project datasets
Dataset Constant DIT MFA CAM IC
Velocity 1.4 1.111 -0.3023 0.4943 - -0.2344
Velocity 1.4/1.5 0.7983 -0.1489 0.0194 - -0.1042
Shatwani & Li [61] stated that as the system evolved, it became increasingly dif-
cult (inaccurate) to identify error-prone classes. We also experienced the same
scenario. By examining the gures, we observed that the models constructed us-
ing an identied subset of metrics produced the desired prediction accuracy and
comparable with the models that are built by considering original set of metrics.
This conrms the ability of these selected metrics to predict faults in the subse-
quent releases of software systems and shows that the metrics can be signicant to
predict fault prone modules over unseen project data. The value of ROC curves
were above the level of discrimination in all the cases (0.5 ROC0.6: no discrim-
ination, 0.6 ROC0.7: poor discrimination, 0.7 ROC0.8: good discrimination,
0.8 ROC0.9: excellent discrimination, 0.9 ROC1 outstanding discrimination).
It shows that the identied subset of metrics resulted the reduced misclassication
errors. Based on the nding of our results, we answer our research questions.
RQ 3.1: We found that there exists a dierent subset of metrics for each project
that was signicantly correlated with fault proneness (in Table 3.7). For individ-
ual metric, we found that CBO, RFC, import and export coupling metrics are
equally important for predicting faults. For complexity metrics, LOC, CC and
WMC were selected by each prediction model. While cohesion metrics were not
found to signicant for fault proneness. This leads us to conclude there exist a
Table 3.16: Multivariate linear regression analysis over Xalan project datasets
Dataset Constant WMC CBO LCOM CA CE LOC MOA CC
Xalan 2.4 0.0165 0.0101 -0.0082 -0.0001 0.0088 0.0049 - -0.022 -
Xalan 2.4/2.5 0.0859 - -0.0033 - 0.004 -0.0044 0.0001 -0.0121 0.0242
Table 3.17: Multivariate linear regression analysis over Xerces project datasets
Dataset Constant WMC CBO LCOM LOC MOA
Xerces 1.2 0.2454 -0.0092 -0.0517 0.0001
Xerces 1.2/1.3 0.2563 -0.0029 -0.0398
Table 3.18: Resulted subset of metrics after MLR analysis
Camel WMC, CBO, NOC, NPM, CA
Xalan WMC, CBO, LCOM, CA, CE, LOC, MOA, CC
Xerces WMC, RFC, LCOM, LOC, MOA, CBO
Ivy WMC, CBO, CE, LOC, AMC
Velocity DIT, MFA, IC
subset of metrics that signicantly correlated with faults.
RQ 3.2: From Tables 3.8 to 3.12, we found that some metrics produced a higher
correlation with other metrics. This shows that in order to judge the capability
of each metric separately, it is needed to eliminate the collinearity among the
metrics. So that the models that are build based on these metrics are accurate
and did not suer from any collinearity.
RQ 3.3: Comparing the Figures 3.2 and 3.3, we found that the identied subset
of metrics produced an improved prediction accuracy and reduced misclassica-
tion errors of fault prediction. Therefore, we can say that the obtained subset of
metrics improve the accuracy of fault prediction.
3.2.7 Threats to validity
Experiments are always associated with potential risks that can aect their nd-
ings. We present the possible risks as various validity threats and highlight our
mitigation eorts to deal with them. They are as follows-
Construct validity: It questions to the quality of choices about the independent
and dependent variables. These choices will aect the quality of experimental
ndings. It includes questions like - are we actually measuring what we intend
to measure? Here, we are interested to identify a subset of OO metrics for

1) Considering all Metrics
Accuracy Precision

Recall AUC

NB LR IBK RF
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Camel
1.6
Xalan
2.6
Xerces
1.4
Ivy 2.0 Velocity
1.6
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Camel
1.6
Xalan 2.6 Xerces
1.4
Ivy 2.0 Velocity
1.6
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Camel
1.6
Xalan 2.6 Xerces
1.4
Ivy 2.0 Velocity
1.6
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Camel
1.6
Xalan 2.6 Xerces
1.4
Ivy 2.0 Velocity
1.6

Figure 3.2: Results of the validation of prediction models constructed using orig-
inal set of metrics and using four machine-learning techniques
fault proneness. Therefore, we select faults found in a given class as dependent
variable. The dierentiation between development phase faults and the post
release evolution phase faults of a system are signicant. The eectiveness of
metrics might vary according with variation in the collection of faults.
Internal validity: Internal validity is used to show the casual relation between
two variables. It questions like are the cause and eect are related?, is the cause
precedes eect in time? or are there not any plausible alternatives that can
eect the outcome of the experiment? We are interested to identify a subset of
metrics that better fault-correlated in a given class. Therefore, we have used
the OO metrics that are available at the class level for our experimental study.
To incorporate these metrics into our study, we have used datasets available in
PROMISE data repository. All these values may vary with the organizational
benchmarks.
Conclusion validity: This validity check for appropriate data collection and
analysis. As we calculated various confusion matrix parameters of each project
using WEKA tool. Here, we have used standard statistical data analysis, which
includes a graphical method. Our results produced a dierent subset of metrics

2) Subset of metrics after Cross correlation & MLR
Accuracy Precision

Recall AUC

0
0.2
0.4
0.6
0.8
1
Camel
1.6
Xalan 2.6 Xerces
1.4
Ivy 2.0 Velocity
1.6
0
0.2
0.4
0.6
0.8
1
Camel
1.6
Xalan 2.6 Xerces
1.4
Ivy 2.0 Velocity
1.6
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Camel
1.6
Xalan 2.6 Xerces
1.4
Ivy 2.0 Velocity
1.6
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Camel
1.6
Xalan
2.6
Xerces
1.4
Ivy 2.0 Velocity
1.6
Figure 3.3: Results of the validation of prediction models constructed using iden-
tied subset of metrics and using four machine-learning techniques
for each of the projects that we have used in our experimental investigation. One
need to be understand the characteristics and the distribution of the dataset
before selecting the metrics for a project of the new domain.
External validity: It investigates the potential threats when we try to generalize
the causal relationship obtained beyond that studied. This is the most important
aspect of an experimental study, and it requires great care and restraint to address
the related threats. Our models are build and evaluated on datasets available in
public data repositories. The system developed inside the organization may pass
the dierent eort pattern. Therefore, the results of our investigation needs to
considered in their context only. We do not suggest generalizing our research
results to any arbitrary project categories.
3.2.8 Discussion
The validation of OO metrics for fault proneness is already undertaken by some
researchers. However, the earlier studies make use of single release of the software
system and performed cross validation analysis to investigate and validate their
3.3 Summary 46
results. Using multiple releases of the software system to identify the metrics from
previous releases of software and to test them on the later release is not much
validated. In this experimental study, we have used Logistic Regression analysis
to evaluate each metric independently. The reason to choose Logistic Regression
is that it did not follow the dependency of the normalized dataset. We check
each metric against the three parameters of the regression i.e., regression co.,
p-value and odds ratio to select the signicant metrics. Furthermore, the resulted
metrics after the ULR analysis are investigated using Spearmans correlation and
Multivariate linear regression analysis to determine the best possible subset of
metrics that produces an improved performance of software fault prediction.
Based on the investigation of above correlation and regression analysis, we can
draw a number of conclusions. We found that class level OO metrics are signi-
cantly correlated with fault proneness. In comparing our result with the results of
previous studies on open source system, we found that our results agree in part
with the results obtained by [53]. We found their results encouraging because
their study used an open source system, thus making the common part of our
research ndings more reliable.
It can be noticed from the investigation that it is possible to identify a metricss
subset that are signicantly correlated with fault proneness. However, we ob-
served that the results of all ve datasets are not common. The subset of metrics
was dierent for each project. One possible reason of this is that the dierence
between the metrics subsets depended on the project characteristics and aected
by the specic project domain. In our investigation WMC, CBO, RFC, LOC,
CA and CE are the metrics that were signicantly correlated with fault prone-
ness with most of the datasets. Our study identied a reduced subset of metrics
for improved performance of fault proneness over the successive releases of the
software.
3.3 Summary
Validation of object-oriented (OO) metrics to predict software fault proneness
is essential to ensure their practical use in building fault prediction model for
OO software system. Since, some of the OO metrics have previously been shown
to be relevant to predict fault proneness. Still, the other metrics have not been
3.3 Summary 47
much validated except by the authors those proposed them. For this purpose
here, we investigated the relationship of existing class level OO metrics with
fault proneness of OO software systems to determine a subset of metrics that are
better correlated with faults.
We evaluate the metrics individually as well as in the conjunction with other
metrics to determine a subset of the signicant metrics. Further, we investigate
these metrics over the subsequent release of the same software to estimate their
overall prediction accuracy. Our results suggested that it is possible to identify a
subset of metrics out of total available metrics. This identied subset is able to
predict fault proneness with higher accuracy and reduced misclassication errors.
Chapter 4
A Count Model Based Analysis
to Predict Fault Densities in
Software Modules
Software fault prediction is a technique to identify the fault-prone modules before
the testing phase by using the underlying properties of the dataset. It aims to
streamline the testing and verication eorts to be applied in the later phases
of software development. Typically the fault-prediction is done by training the
prediction models over a part of some known fault data and measuring its per-
formance against the other part of the fault data.
There have been many eorts comparing the performance of fault-prediction tech-
niques on dierent project datasets using various performance evaluation criteria.
However, many earlier fault prediction studies were based on the classication of
fault data into two classes, namely faulty and non-faulty. There are the several
issues with this binary class classication. For example, even if the performance
of the prediction model was reported excellent, the interpretation of the nding
are hard to put into the proper usability context i.e., identication of the actual
number of faults. The binary class classication of the software modules as either
faulty or non-faulty, does not provide enough logistics to streamline the eorts
that would ensure the identication of faults in the software system.
The main motivation behind the software fault prediction is to identify and pre-
dict faults accurately, so that the eort require to nd and x them are minimized.
Hence, the idea of software fault density prediction is more useful as it assigns
an expected number of faults to each module of the software. This prediction
4.1 The Approach 49
can help the software quality assurance team to optimize the testing eorts by
targeting the modules having more number of faults.
The fault datasets available in the software data repository are stued with unnec-
essary information, and thereby making it dicult to be used for fault prediction.
There is also a possibility that some of the metrics may depend on the project
characteristics. Therefore, the major issue is identifying the subset of software
metrics that are showing signicant fault-correlation.
In this chapter, initially we have identied a subset of the project metrics suite
that contains the metrics signicant for fault-correlation by performing the Mul-
tivariate Linear Regression (MLR) analysis. Subsequently, we have used this
subset of metrics with the count models to predict fault densities. We have
performed our experimental investigation, using ve dierent count models and
six successive releases of a software project dataset available in PROMISE data
repository [57]. The built count models have assigned an expected number of
faults and the fault densities to each module of the software. To predict fault
densities, the count models were trained using the prior releases of the software
project and tested on the later release of the software. The results of the predic-
tion were evaluated using confusion matrix parameters and a cost-benet model.
The rest of the chapter organizes as follows. Section 4.1 describes the approach of
data analysis, includes reviews of the count models, cost-benet model and subset
selection process. Section 4.2 contains information of the experiment evaluation
includes datasets, metrics (independent variables), the dependent variable and
the results of our investigation follow by the threats to validity. Section 4.3
discuss the implication of our results.
4.1 The Approach
The proposed approach involves initial identication of a subset of the project
metrics suite that shows signicant fault-correlation. Subsequently, the identied
subset of metrics is use with the count models to predict fault densities (Fault
density=faults/100 lines of code). The built count models are validate using con-
fusion matrix parameters and a cost-benet model. In the following subsections,
1
In our study fault density= faults/100 lines of code.
4.1 The Approach 50

Fault Dataset

Subset Selection of Fault-correlated
Metrics

Construction of Count Models for Fault
Densities Prediction

Prediction of
Number of Fault and
Fault Densities
Evaluating the
Results of Five
Count Models
Cost-benefit
Analysis

Figure 4.1: Overview of the proposed approach
we present the details of each step of the approach. An overview of the proposed
approach is given in Figure 4.1.
4.1.1 Selection of fault-correlated metrics
To determine a subset of project metrics, we use the Multivariate Linear Re-
gression (MLR) analysis with the backward selection approach. MLR attempts
to model the relationship between the two or more independent variables and a
dependent variable by tting a linear equation to observed data. Every value of
the independent variable x is associated with a value of the dependent variable y.
The regression line for n independent variables x
1
, x
2
, ..., x
n
is dened to be
y
=
0
+
1
x
1
+
2
x
2
+...+
n
x
n
. This line describes how the mean response
y
changes
with the independent variables. The observed values for y vary about their means
y
and are assumed to have the same standard deviation . Backward selection
is a search technique in MLR, which start with considering all the independent
variables, test the signicance of each variable using a chosen model comparison
criterion and delete the variable (if any) that does not improve the model by
much being deleted. This process is repeated until no further improvement is
4.1 The Approach 51
possible [69].
We carried out our investigation using class level object-oriented (OO) metrics as
independent variables and fault proneness of a class as the dependent variable. We
have used all the six releases of the PROP datasets and performed MLR analysis
to identify the signicant fault-correlated metrics. We compute this subset for
each release of fault dataset incrementally. After identifying the metrics for each
release of the PROP dataset, we combine (take union) them to compute the
resulting metrics subset for later use. Subsequently this identied metrics subset
is used with the count models for the fault densities prediction.
4.1.2 Count model analysis
The identied subset of metrics is use to construct the count models. Count
model is a form of regression analysis, use to model the data where dependent
variable is a count type. All count models aim to explain the number of occur-
rences of an event. We built count models over all six releases of the software
by training the model from the earlier releases and testing on the later release.
The benet of training model from earlier releases is that model contains the
historical information of the domain that can help the count models to better
predict fault densities. The faultiness of the modules selected as the dependent
variable for analysis. Since the number of faults in each release of the software
has a high variance. Therefore, we performed a square root transformation to
reduce the inuence of the outlier values and take the logarithmic transformation
of the LOC metrics. These transformations help us to better t the model in
terms of the log likelihood ratio.
4.1.3 Evaluation of count models
Once these count models have been constructed. Then, we can use the confusion
matrix parameters to evaluate the potential of these models for fault densities
prediction. Since every count model assigns an expected number of faults to
each module of the software system. Therefore, we used this information as fault
prediction means, and every module that contains one or more faults are marked
as faulty and modules that contain zero faults are marked as non-faulty. Hence
these values can then serve as the values of TP, FP, FN and TN (described in
4.1 The Approach 52
chapter 2). We use these values to calculate the elements of the confusion matrix
(i.e., Accuracy, Precision and Recall) in order to evaluate the overall accuracy of
the count models.
4.1.4 Cost-benet model
We need to use a cost-benet model that qualies the fault removal cost at dier-
ent phases of software development, when we are using fault density prediction
model. This cost-benet model can help to put the results of fault densities
prediction in proper usability context. Essentially, the framework can provide
an estimate of the saving in the eorts applied by using the results of the fault
densities prediction in the subsequent phases of the software development.
Jiang et al. [38] introduced cost curve, a measure to estimate the cost eective-
ness of a classication technique, to evaluate the performance of a fault-prediction
technique. They drew out the conclusion that cost characteristics must be con-
sidered to select the best prediction technique. Deepak et al. [5] proposed a
cost evaluation framework, where they accounted realistic fault removal cost of
dierent testing phases, along with their fault identication eciency. In our
study, we have used their concept of a cost evaluation framework to construct
our cost-benet model.
Table 4.1: Fault removal cost of testing techniques (in sta-hours per defect)
Type Lowest Mean Median Highest
Unit 1.5 3.46 2.5 6
System 2.82 8.37 6.2 20
Field 3.9 27.24 27 66.6
Table 4.2: Fault identication eciencies of dierent testing phases
Type Lowest Median Highest
Unit 0.1 0.25 0.5
System 0.25 0.5 0.65
The constraints, that their framework include are -
4.1 The Approach 53
(1) Fault removal cost varies with testing phases.
(2) It is not possible to identify 100% faults in specic testing phase.
(3) It is practically not feasible to perform unit test on all modules.
We have used normalized fault removal cost suggested by Wagner et al. [67]
and fault removal eciency of the dierent testing phases from the study of
Caper Jones [39] to formulate the cost-benet model. The normalized costs are
summarized in Table 4.1. The eciencies of testing phases are summarized in
Table 4.2. Wilde et al. [70] stated that more than fty percent of the modules
are very small in size, hence unit testing of these modules is unfruitful. We have
included this value (0.5) as the threshold for unit testing in our framework.
Equation 4.1 shows the proposed cost evaluation framework to estimate the over-
all fault removal cost. Equation 4.2 shows the minimum fault removal cost with-
out the use of count model. Normalized fault removal cost and its interpretation
is shown in Equation 4.3.
Ecost = C
i
+C
u
(NoF)
u
+
s
C
s
(1
u
) (NoF)
+(1
s
) (1
u
) C
f
(NoF) (4.1)
Tcost = M
p
C
u
(TM) +
s
C
s
(1
u
) NoF
+(1
s
) C
f
(1
u
) (NoF) (4.2)
NEcost =
Ecost
Tcost
(4.3)
Where, Ecost - Estimated fault removal cost of the software when we use count
model for fault prediction.
Tcost- Estimated fault removal cost of the software without the use count model.
NEcost- Normalized Estimated fault removal cost of the software when we use
count model.
C
i
- Initial setup cost of used fault-prediction technique.
C
u
- Normalized fault removal cost in unit testing.
C
s
- Normalized fault removal cost in system testing.
C
f
- Normalized fault removal cost in eld testing.
M
p
- Percentage of modules unit tested.
TM - Total modules.
NoF - Total number of faults.
u
- Fault identication eciency of unit testing.
s
- Fault identication eciency of system testing.
We will see below that how this framework helps to estimate a normalized cost
of used count models to determine their economic viability. Here, for our study
we use a median value of fault removal costs of testing techniques and fault
identication eciencies of the dierent testing phases.
In this section, we present an experimental study to evaluate the performance
of count models for fault densities prediction. We have used ve dierent count
models namely: Poisson Regression model, Negative Binomial Regression model,
Zero-Inated Poisson Regression model, Generalized Negative Binomial Regres-
sion model and Zero-Inated Negative Binomial Regression model, over the six
successive releases of a software project dataset consisting nineteen class level
object-oriented metrics. In this study, we investigated the prediction of fault
densities and the number of faults for a given module. Therefore, we have se-
lected measure of fault proneness as the dependent variable. The fault proneness
of a class is the probability that a class contains a fault, given the metrics for
that class. It is a key factor for monitoring and controlling the quality of the
software.
4.2.1 Metrics set used for the experiment
To perform our experimental investigation, we have used the nineteen measures of
coupling, cohesion, inheritance, encapsulation and complexity of object-oriented
software system. They are as follows- WMC, CBO, RFC, DIT, NOC, IC, CBM,
CA, CE, MFA, LCOM, LCOM3, CAM, MOA, NPM, DAM, AMC, LOC and
CC (CC same as max CC in the PROP dataset). For each release of the PROP
dataset, we performed Multivariate Logistic Regression (MLR) analysis to test
whether each of the nineteen metrics would be signicant predictor in the count
models. The MLR analysis results the subset of signicant metrics corresponding
to each release of the dataset. The criteria used to select metrics is - In all six
releases of the datasets, there should be at least 50% or more times that metric
appeared. Based on this selection criteria, only eleven metrics (WMC, NOC,
CBO, CA, CE, NPM, LOC, CAM, DAM, LCOM3 and AMC) were selected for
the further analysis. The results of this analysis are summarized in Table 4.3.
Table 4.3: Identied metrics for each release of the PROP dataset
Dataset Name Name of the Metrics Identify
PROP1 CBO, CE, LOC, DAM, CAM, LCOM3, LCOM and WMC
PROP2 RFC, CBO, WMC, LCOM3, NOC, CE, IC, DAM, CAM,
AMC, LOC, NPM, MOA and CA
PROP3 LOC, CBO, NOC, LOCM3, CC, MOA, CAM, DAM, MFA,
DIT, AMC and WMC
PROP4 RFC, CBO, LCOM3, NOC, CAM, NPM, LOC, IC, AMC,
WMC, CC, CE, CA, LCOM, DIT, DAM and MFA
PROP5 CBO, NOC, LOC, LCOM3, WMC, NPM, LOCM and CE
PROP6 DIT, CBO, CA, CE, LCOM3, LOC, DAM and IC
4.2.2 Project dataset
We have used PROP dataset with its six successive releases to perform our study
and to evaluate our results [40]. PROP dataset is one of the largest dataset that is
available in PROMISE data repository. This dataset is collected from a software
project that was developed inside an organization (commercial software) and
written in the java programming language. Each release of the dataset consists
fault data of one or more versions of the project release. Like PROP1 dataset
correspond to the versions 4, 40, 85, 121, 157 and 185. Similarly other release
correspond to other versions of the project. Each version of the project contains
some modication in the functionality, but all the versions do not provide major
changes. Therefore, we group them into one dataset and the version that address
major changes are grouped in the dierent-2 releases, i.e., PROP1, PROP2 etc.
For each release, same nineteen metrics have been calculated and recorded with
respect to the software modules. The size of dataset varies from one release to
another release, but for all the releases, we collected the same nineteen metrics.
The detailed description of about the dataset in given in Table 4.2.2.
Table 4.4: Detail of PROP project dataset used for study
Version no. Total no. of
instances
No. of
faulty
instances
Total no. of
faults
PROP1 18472 2739 5493
PROP2 23015 2432 4096
PROP3 10275 1180 1640
PROP4 8719 841 1362
PROP5 8517 1299 1930
PROP6 661 66 79
4.2.3 Count models
Count model such as poisson regression or negative binomial regression is a form
of regression analysis used to model the data where the dependent variable is
a count type. All count models aim to explain the number of occurrences, or
counts, of an event. The counts themselves have a variance that increases with
the mean of the distribution [46]. They inherit the basic idea of linear regression,
by assigning a regression coecient to each variable showing its contribution
for occurring dependent variable, while keeping other independent variables as
constant. These models retain all the power of the linear regression models but
extend the analysis to predict the mean of variables that are not reasonably
assumed to be normally distributed.
In this subsection, we describe the dierent count models used for the experi-
mental investigation.
4.2.3.1 Poisson regression model
Poisson regression is the standard or base count response regression model. It is
based on the Poisson probability distribution, which is the fundamental method
used for modeling count response data. It assumes that the dependent variable
Y has a Poisson distribution, and assumes the logarithm of its expected value
can be modeled by a linear combination of independent variables.
let Y
i
equal the number of faults (dependent variable) observed in the le i and
X
i
be a vector of independent variables for the i
th
observation. Given X
i
, assume
Y
i
is Poisson distributed with the probability density function (PDF) of
Pr(Y
i
|
i
X
i
) =
e
Y
i
i
Y
i
!
(4.4)
where
i
is the mean value of the dependent variable Y
i
. To ensure that the
expected value of
i
is nonnegative, the link function which displays a relation-
ship between the expected value, and the independent variables should have the
form [33]
i
= E(Y
i
|X
i
) = e
X
(4.5)
where =[b
o
,b
1
,b
2
,....,b
k
] denotes an independent variables vector and X
i
repre-
sents the transpose of the X
i
, which is equal to [1, X
i
].
4.2.3.2 Negative binomial regression model
Negative binomial models have been derived from two dierent origins. First, and
initially, the negative binomial can be thought of as a Poisson-gamma mixture
designed to model overdispersion Poisson count data. Conceived of in this man-
ner, estimation usually takes the form of a maximum likelihood Newton-Raphson
type algorithm. This parametrization estimates both the mean parameter, as well
as the ancillary or heterogeneity parameter, .
In the context of our prediction model, we can view the negative binomial regres-
sion (NBR) model as- let Y
i
equal the number of faults observed in the le i and
X
i
be a vector of OO metrics for that module. The NBR model species that Y
i
,
given X
i
, has a Poisson distribution with mean .
The negative binomial regression model is given as-
V ar(Y
i
|X
i
) =
i
(1 +
i
) = e
X
(1 + e
X
) (4.6)
if =0, the negative binomial distribution reduces to a Poisson distribution.
NBR model is appropriate to use when the data has follows overdispersion. The
variance
2
, which is known as the dispersion parameter, is allows for the type
of concentration observed for faults.
4.2.3.3 Zero-inated count model
The Poisson and Negative binomial distributions dene an expected number of
zero counts for a given value of the mean. The greater the mean, the fewer zero
counts are expected. However, fault data normally come with a high percentage
of zero counts make it hard for the Poisson or negative binomial distribution.
To encounter this problem zero-inated Poisson (ZIP) and zero-inated negative
binomial (ZINB) have been developed [33]. The data are assumed to come from a
mixture of two distributions where the structural zeros from a binary distribution
are mixed with the non-negative integer outcomes (including zeros) from a count
distribution.
The ZINB model is similar to the ZIP model. The only dierence is that, in
the case of the ZINB model, the negative binomial distribution is used for the
non-perfect modules group, as compared to the Poisson distribution used in the
ZIP model [33]. The general form of zero-inated model is given below.
ln(
i
) = X
i
(4.7)
logit(
i
) = ln

i
1
i
= X
i
(4.8)
where and are the independent variable vectors that are to be estimated.
4.2.3.4 Generalized negative binomial regression model
The generalized negative binomial regression (gnbreg) has been found useful in
tting over-dispersed as well as under-dispersed count data. A gnbreg is a type of
negative binomial regression, in which the heterogeneity parameter itself could be
parameterized. It allows a generalization of the scalar overdispersion parameter
such that parameter estimates can be calculated showing how model predictors
comparatively inuence overdispersion. A generalized negative binomial regres-
sion has been formulated as-
NB G = +
p
(4.9)
where p is a third parameter to be estimated.
4.2.4 Results

5493 2732 3274 2732 6379 5958 10030 8657 14105 4940 2076 2056
Actual NBRM P ZIP GNBR ZIN
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
12000
13000
14000
15000
16000
17000
18000
19000
20000
21000
22000
23000
24000
25000
PROP1 PROP2 PROP3 PROP4 PROP5 PROP6
N
u
m
b
e
r
o
f
F
a
l
u
t
s
Figure 4.2: Result of the predicted number of fault using count models (PROP1-
POPR6)
This subsection presents the detailed description of the experimental results.
First, we discuss the result of the prediction of the number of faults and the fault
densities using various count models. Next, we compare the overall accuracy
and eectiveness of the count models using confusion matrix criteria. Finally, we
present the results of cost-benet analysis of the count models to evaluate them
in the economic standpoint.
4.2.4.1 Prediction of the number of faults and fault densities
The built count models assign an expected number of faults and the fault densities
to each module of the software. In each scenario, the count model is built on one
or more prior release of the software and is evaluated on the latest release of the
software. Like, the count model based on release PROP1 and PROP2 is evaluated
for release PROP3, except the release PROP1, where training and testing both
performed on the same dataset due to unavailability of any prior release. The
similar procedure has been followed for all the count models.
Figure 4.2 and 4.3 show the results of the prediction of number of faults and
the number of faulty modules, predicted by each count model. Figures consist a

5493 2732 3274 2732 6379 5958 10030 8657 14105 4940 2076 2056
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
N
u
m
b
e
r
o
f
F
a
u
l
t
y
I
n
s
t
a
n
c
e
s
Figure 4.3: Result of the predicted number of faulty modules using count models
(PROP1-POPR6)
graph corresponding to the each release of the project dataset that shows their
predicted number of faults and the number of faulty modules. The blue bar in
the gures show the actual number of faults and the faulty modules contained
in the release of the project dataset. This is the optimal values of nding all the
faults and faulty modules in each release of the dataset. The other bars show
the comparison of prediction results among the count models. The quality of the
each model is measured in terms of how close it predicts the value comes from
the actual curve.
We can observe by comparing the results from PROP1 to PROP6 that the number
of faults and the faulty modules predicted by negative binomial regression count
model is closest t with the actual number of faults. This result is consistent with
all the releases of the PROP datasets, except PROP4, where the values predicted
by poisson regression count model is closest t to the actual value. The second
most accurate prediction model is poisson (p) regression model except for the
PROP6, where ZIP is the second best predictor. The rest of all models assign a
higher values than their actual value. It questions their viability to predict fault
densities. These results were very encouraging stating the accuracy of overall
prediction made by count models to the actual number of faults discovered.
Figure 4.4 shows the results of fault densities predicted by each count model. This

0
1
2
3
4
5
6
Actual
NBRM
P
ZIP
GNBR
ZIN
Figure 4.4: Result of the fault densities prediction using count models (PROP1-
POPR6)
information can be useful, if practitioner wants to know, which module is likely to
contain the highest number of faults relative to the size of the module. The gure
shows the sum of the value of actual fault density in the project datasets (blue
bars) and the value of fault densities predicted by the count models. For the sake
of simplicity, we normalized the values of the predicted fault densities. The value
of actual curve is converted to 1 and all other values normalized corresponding
to this value by dividing them with the actual value. Therefore, the bars in
the graph corresponding to each count model is showing the amount of fault
densities that increase or decrease with respect to the actual curve value. The
predictive capability of each count model is measured by how close it predicts
the values shown in actual curve. It can see from the gure that the values of
the fault densities predicted by NBR model and P model is closest t with the
actual values, except the result of PROP4 dataset, where NBR is the third best
performer and PROP6, where p is the third best performer. The values of the
other count models are much far dierent from their actual values. This conrm
the potential of NBR model and P model to predict the fault densities in the
software system.

PROP1 PROP2

PROP3 PROP4

PROP5 PROP6

Acc. Precision Recall F-measure
0
10
20
30
40
50
60
70
80
90
NBRM P ZIP GNBR ZIN
0
10
20
30
40
50
60
70
80
90
NBRM P ZIP GNBR ZIN
0
10
20
30
40
50
60
70
80
NBRM P ZIP GNBR ZIN
0
10
20
30
40
50
60
70
80
90
NBRM P ZIP GNBR ZIN
0
10
20
30
40
50
60
70
80
NBRM P ZIP GNBR ZIN
0
10
20
30
40
50
60
70
80
90
100
NBRM P ZIP GNBR ZIN
Figure 4.5: Comparison of count model using various confusion matrix criteria
(PROP1-POPR6)
4.2.4.2 Evaluating the results of ve count models
The results of previous section show the potential of count models to predict
values of the number of faults and faulty modules, but it raises the question how
eective are the count models to predict actual faulty modules. If we mark every
module that contains one or more faults as faulty and rest all other that contain
zero faults as non-faulty, then these values can serve as the values of TP, FP, FN
and TN and count models can also be used for predicting binary classication of
modules, i.e., fault prone or not fault prone. These values are used to calculate
the elements of the confusion matrix (i.e., Accuracy, Precision and Recall) to
evaluate the overall accuracy of the count models. This is particularly important
because it is possible that even if a prediction model is accurate to predict the
number of faults close to their actual values but skipped the actual faulty modules
and raises the false alarm by predicting faults in non-faulty modules.
Figure 4.5 shows the predicted results of count models using various confusion
matrix criteria. Prediction accuracy, precision, recall and F-measure are the most
commonly used parameters to evaluate the prediction models. Here, we used all
four parameters and built comparison graph for all six releases of the PROP
project datasets (sub gures in Figure 4.5 from 1 to 6, corresponding to the six
releases of the PROP project).
From gures, we can observe, in general, the prediction accuracy of NBR model
is higher than all other count models, except the result on PROP5, where the
accuracy is lower than other count models. The prediction accuracy of NBR
model is varied from 75% to 85%. Similarly, to the value of the precision NBR
outperform compare to other models except for PROP6. The value of precision
for NBR varied between 18% to 33%. The value of recall of NBR model is lower
compare to the other count model and varied between 25% to 75% in general.
One possible reason of this is that other count models assigned higher value of
the number of faults compared to the NBR model. As a result of this, their
recall value is increased by some amount. To address this issue, we examined
the value of F-measure. The value of F-measure shows the trade o between
false positive and false negative. The values of F-measure of NBR model is again
higher than the other count models and varied between 23% to 32%. This shows
that NBR model has the potential to predict fault prone modules and reduce the
misclassication errors. The value of F-measure of other model in general, are
less compared to NBR model.
4.2.4.3 Prediction of the number of faults and the fault densities in
the modules ranked as top 20%
As mentioned by the Ostrand et al. [55] that the individual fault counts predicted
for each le generally do not exactly match their actual fault counts, the great
majority of the actual faults occur in the set of les at the top of the listing.
To evaluate our results in this context, we sort the modules according to their
faults count and use the modules that ranked as the top 20 percent for nding
their faults contain. Tables 4.5 and 4.6 contained the percentage of the predicted
number of faults and the fault densities by top 20% of modules of six releases of
PROP dataset using all the ve count models. The results of predicted number of
faults for PROP1 to PROP6 is given in Table 4.5. From the table, it is clear that
the modules contained faults in between 55% to 72%, with the overall average
of 67% (approximately) in the case of NBR model. While for the other count
models average percentage of prediction is between 64% - 66%, which is closer
to NBR model. A similar process was followed in the case of predicting fault
densities. Table 4.6 contained the percentage of fault densities predicted by the
count models for PROP1 to PROP6. The top 20% of the modules contain fault
densities between 60% to 96% with the average of 72% in the case of NBR model,
excluding the values 34% and 100%, they occur once. For other count models, the
average percentage varies between 54% to 56%. Comparing our results with the
results of [55], we found that only NBR model was signicantly able to predict
the fault densities in the given modules when considered top 20% of the total
modules.
Table 4.5: Percentage of faults contained in the modules ranked as top 20%
(T=Training set)
Model Prop1 (T
prop1)
Prop 2
(T
prop1,
2) Prop3
(T prop
1, 2, 3)
prop 4
(T prop
1, 2, 3, 4)
prop 5
(T prop 1,
2, 3, 4, 5)
prop 6
Average
NBR 70% 66.90% 66.56% 71.26% 71.63% 54.83% 66.85%
P 58.31% 57.08% 63.72% 75.24% 75.35% 58.53% 64.07%
ZIP 56.80% 58.17% 60.81% 75.23% 80.46% 54.21% 64.28%
GNBR 61.11% 62.74% 63.79% 77.08% 80% 49.07% 65.63%
ZIN 70% 67.19% 66.60% 76.19% 75.84% 38.38% 65.70%
4.2.4.4 Cost-benet analysis
The results obtained through the experiments of cost-benet analysis are shown
in Figure 4.6. We have used the value of the predicted number of faults to
calculate the estimated cost of each count model for fault densities prediction.
The value of Tcost was calculated to show the actual cost that incurred when
Table 4.6: Percentage of faults density contained in the modules ranked as top
20% of modules (Fault density=faults/100 lines of code)
Model Prop1 (T
prop1)
Prop 2
(T
prop1,
2) Prop3
(T prop
1, 2, 3)
prop 4
(T prop
1, 2, 3, 4)
prop 5
(T prop 1,
2, 3, 4, 5)
prop 6
Average
NBR 96.20% 59.14% 34.66% 80.06% 60.67% 100% 71.78%
P 82.80% 48.66% 33.69% 73.11% 52.28% 43.70% 55.70%
ZIP 68.39% 44.57% 38.33% 74.74% 59.59% 34.93% 53.42%
GNBR 57.91% 36.62% 36.80% 76.71% 61.38% 48.88% 53.05%
ZIN 57.57% 38.59% 37.38% 76.34% 62.61% 51.85% 54.05%
normal testing process. This served as the optimal point to calculate the NEcost
of the each count model. The Figure 4.6 shows the values of NEcost for each of
the model for fault densities prediction.
From Figure 4.6, it can see that in general, except the results of PROP4 dataset,
the value of NEcost for NBR model is less than other count models. While it
observed that all the other count model in general have value of NEcost greater
than NBR model. The cost value of P model is the second best value, except for
the value of PROP6. These results imply that it is more economic to build the
prediction model based on negative binomial regression to reduce the overall cost
of testing.
Since the used cost-benet model utilizes the value of faults found in each phase of
software testing and faults remain and seed to next phases of testing. Therefore,
it provides a signicant guidance about the cost eectiveness of the model along
with the eectiveness, eciency and accuracy of the prediction model. These
results conrm and strengthen our prediction model in economic standpoint.
Based on our results, we nd out the following observations.
Each of the count models have assigned an expected number of faults and
the fault densities to each module of the software. We used the actual
curve value to compare the predicted values. Although, we observed that
the value predicted by each count model does not exactly matched with
the actual value and vary from one release to other release. Therefore,

0
1
2
3
4
5
6
7
8
Cost Benefit Model
NBR P ZIP GNBR ZIN
Figure 4.6: Cost-benet model for the count models (PROP1-PROP6)
we checked the value of the number of faults occurring in modules ranked
as top 20%. We found that the majority of the faults occurred in the
modules listed as top 20%. The result of NBR model has the average of
67% (approximately). While for other models, the average percentage of
prediction is between 64% - 66%. It shows that count models have the
potential to predict fault densities.
The NBR model in general provided a higher prediction accuracy, except
the result of PROP4 dataset for fault prediction. All other four count
models were similar at their prediction accuracy and it is lower than NBR
model.
The value of the precision of NBR model outperformed other count models
except for PROP6. The value of precision for NBR varies between 18% to
33%.
In the value of recall, it is found that NBR model has lower recall values
as compared to other count models. In general GNBR and ZIN have the
highest recall values. One possible reason of this is that other count models
assigned higher value of the number of faults compares to the NBR model.
As the result of this, their recall value is increased by some amount.
For the value of F-measure, it is observed that NBR model showed higher
value as compared to other count models. The second highest value was
shown by P model. The rest of the model produced a lower value of F-
measure. This result shows that prediction model based on NBR reduced
the misclassication rate by a signicant amount.
The nal model selection included the performance of the cost benet
model. Once again, we found that the NEcost incurred by NBR model
is much lower to other count models irrespective to the datasets except
PROP4 dataset.
These results suggested that the NBR model is more signicant to predict fault
densities.
In this section, we critically examined the possible side eects of our experimental
ndings. We also highlight the factors aecting the validity of the cost evaluation
framework that we have used to measure the performance of the count models
in predicting fault densities for multiple releases of the software. The validity
considerations can be grouped in the following categories:
Construct validity: The eectiveness of a count model is measured as Ecost,
which is the estimated fault removal cost. The framework is developed consid-
ering the costs to be incurred to rectify faults in the later phases of software
development, if not identied before testing. In our cost evaluation framework
unit testing cost of faulty and non-faulty modules are same. The testing cost
of a particular phase is same for all modules i.e. nding a fault in a 100 LOC
module is the same as nding a fault in a 1000 LOC module. We have selected
eleven fault signicant metrics out of nineteen metrics by keeping those metrics
that were appeared more than 50% of times in all six releases of the dataset.
The dierent framework parameters that are used in our cost-benet analysis
have been taken from dierent sources as reported in literature. For example, the
cost parameters (i.e., the values of C
u
, C
s
and C
f
) were taken from Wagner [67].
The fault identication eciencies (i.e., the values of
u
and
s
) were taken from
Jones [39] and the value of M
p
(Percentage of modules unit tested) is taken from
4.3 Discussion 68
the study of Wilde et. al. [70]. However, one can substitute these parameters
with organizational specic benchmarks to ensure the practical use of the cost-
benet analysis. One can use other criterion also to select the signicant metrics
and the result may vary with the conjuncture of that criterion.
Internal validity: Our experimental study involving the use of the statistical
analysis tools namely, Weka and Stata and the data collected from the publicly
available software data repository. The fault densities and their distribution
should depend on the fault data. Any biasing in this may inuence the nding
of our results.
Conclusion validity: We have used the statistics of previous versions to calcu-
late the estimated false positive, false negative and true positive. The value of
estimated false positive, false negative and true positive may be dier from the
actual value. Here, we compared the values of Ecost with the unit testing cost to
decide whether count models are useful. Our results are specic to the versions
of datasets included in the study.
External validity: We do not suggest generalizing our research results to any
arbitrary project categories because our results identify variance in the metrics
set when the examined project has changed. Our models are built and validated
on datasets available in public data repositories. The system developed in the
organization may pass the dierent eort pattern. One needs to take of the
underlying pattern of software before applying our approach.
4.3 Discussion
The approach proposed in this chapter suggests an eective use of a subset of
project metrics suite with the count models to predict the fault densities. To
evaluate the performance of the count models, a set of experiments was carried
out. A count model assigned an expected faults count and fault densities to each
module of a software system. The same models were also used to predict the
fault densities.
We evaluated the performance of the count models using performance measures
of confusion matrix. The NBR model in general provided higher prediction ac-
curacy as compare to other count model for fault densities prediction. The other
4.4 Summary 69
four count models were found to have lower prediction accuracy compare to NBR
model. On the other hand, NBR model has lower recall values compare to the
other count models. In general, GNBR and ZIN have the highest recall values.
For the value of F-measure, it is observed that NBR model showed higher value
as compare to other count models. The rest of the model produced lower value
of F-measure. Only contradictory results found in PROP4 dataset, where pre-
diction model based on Poisson regression analysis produced the higher accuracy
compare to negative binomial regression (NBR) based model. One possible rea-
son of is that the PROP4 dataset has a minimum number of faults contained
(9.60%). It means that PROP4 has a higher number of zeros, which may lead
to the poor results for it for NBR model. This left a question of eectiveness of
negative binomial regression based model for the software projects that has less
number of faulty modules. Overall, these results show that NBR count model
produced a higher prediction accuracy and reduced the misclassication rate by
a signicant amount. These result show that count model based on NBR pro-
duced a higher prediction accuracy and reduced the misclassication rate by a
signicant amount.
In this chapter, we have used a cost-benet model for validating the count models.
This analysis aimed to assess the economic viability of models for fault densities
prediction. In this framework, we have used the value of cost parameters from
the study of Wagner and the values of various testing phases from the study
of Jones. We used these values due to the unavailability of the organizational
benchmark. These values may not be realistic but our main contribution is to
provide a cost evaluation measure that would access the cost eectiveness of fault-
prediction techniques when it is used in the development process. The Changes
in the framework parameters can only make the change in the resultant threshold
values.
4.4 Summary
The count models such as negative binomial regression have the potential to pre-
dict the fault densities of the software modules by assigning an expected number
of faults, that it best represents the fault occurrence process of the given soft-
ware. In this chapter, we investigated the performance of the ve count models,
4.4 Summary 70
in predicting fault densities of the software modules. The investigation has been
performed on the six releases of a project dataset, available publicly. Confusion
matrix based evaluation parameters and cost-benet framework have been used
to evaluate the capability of these count models.
Our results suggest that among the all ve count models, the negative binomial
regression analysis showed best performance for fault prediction. Its predictive
accuracy is higher as compare to other count models. The contradicted results
have been shown only in the case of recall values, where NBR model provided
lower values. The results of the cost-benet analysis also conrmed that negative
binomial is most cost-eective compare to the other count models. Our aim is
to provide the benchmark to estimate the fault removal cost for newer version,
when we train count model with historical information. In future, this work
could be more generalized to globally access the eectiveness of fault-prediction
techniques.
Chapter 5
An Application of the Count
Models to Predict Fault Densities
With Binary Fault Classication
In the previous chapter, we presented an approach to the fault densities prediction
based on the count models analysis. Where, initially we identied a subset of
project metrics suite that contained the metrics signicant for fault-correlation by
performing the Multivariate Linear Regression (MLR) analysis and choose fault
densities of a software module as dependent variable. subsequently, we used this
subset of metrics with the count models to predict fault densities. In this chapter,
we evaluate the eectiveness of the count models, when we identify a subset of
signicant fault-correlated metrics by classifying the faultiness of the software
modules into binary class classication, i.e., faulty and non-faulty. This analysis
will helps to decide, whether the nature of the fault classication (i.e., binary
class classication and multi class classication) for the selection of signicant
fault-correlated metrics will aect the results of the fault densities prediction.
The rest of the chapter organizes as follows. Section 5.1 describes the approach
of data analysis and subset selection process. Section 5.2 contains information of
experiment evaluation includes datasets, metrics (independent variable), depen-
dent variable and the results of our investigation follow by the threats to validity.
The discussion of the implication of our results is in Section 5.3.
5.1 The Approach 72
5.1 The Approach
A software metric that found signicantly fault-correlated for the modules that
classied into binary class classication (i.e., faulty or non-faulty) is more likely
to be signicant for fault proneness, if the modules have same or similar struc-
tural properties and the occurrence of the faults. Therefore, the binary class
classication of the faults can be serve same as fault densities information of the
software modules. We use the above assumption to identify a subset of signicant
fault-correlated metrics. To select the signicant metrics, we use the approach
described in the chapter 3. The proposed approach involves initial identica-
tion of a subset of project metrics suites that shows signicant fault-correlation.
Subsequently, the identied subset of metrics is used with the count models to
predict fault densities. The built count models are then validate using confusion
matrix parameters and a cost-benet model.
5.1.1 Subset selection of fault-correlated metrics
We use the approach described in the chapter 3 to determine a subset of signif-
icant fault-correlated project metrics. We carry out an investigation using class
level object-oriented (OO) metrics as independent variables and binary class clas-
sication of the software modules as dependent variable. For each release of the
PROP dataset, we perform a three steps analysis (Univariate Logistic Regression,
Spearmans Correlation and Multivariate Linear Regression) to test whether each
of the nineteen metrics would be signicant predictor in the count models. These
analysis results the subset of signicant metrics corresponding to each release
of the dataset. The criteria used to select metrics is - In all six releases of the
datasets, there should be at least 50% or more times that metric appeared.
5.1.2 Count model analysis
The identied subset of metrics is used to construct the count models. We built
the count models over all six releases of the software project by training the model
from the earlier releases and testing on the later release. The fault densities of the
modules were selected as the dependent variable for count model analysis. Since
the number of faults in each release of software has a high variance. Therefore,
we performed a square root transformation to reduce the inuence of the out-
lier values and take the logarithmic transformation of the LOC metrics. These
transformations help us to better t the model in terms of the log likelihood ratio.
5.1.3 Evaluation of count models
Once these count models have been constructed. Then, we use the various confu-
sion matrix parameters to evaluate the potential of these count models for fault
densities prediction. Since each count model assigned an expected number of
faults to each module of the software system. Therefore, if we used that infor-
mation as fault prediction, then we can calculate the elements of the confusion
matrix to evaluate the overall accuracy of the count models.
5.1.4 Cost-benet model
We have used the same cost-benet model that was used in the previous chapter
to qualies the fault removal cost of the dierent count models.
In this section, we present an experimental study to evaluate the performance
of the count models for fault densities prediction. We have used ve dierent
count models namely: Poisson Regression model, Negative Binomial Regression
model, Zero-Inated Poisson Regression model, Generalized Negative Binomial
Regression model and Zero-Inated Negative Binomial Regression model, over six
successive releases of a software projects dataset consisting nineteen class level
object-oriented metrics. In this study, we investigated the prediction of fault
densities and the number of faults for a given module. Therefore, we selected
measure of fault proneness as the dependent variable.
5.2.1 Metrics set used for the experiment
To perform our experimental investigation, we have used the nineteen measures
of coupling, cohesion, inheritance, encapsulation and complexity of OO software
system. They are as follows- WMC, CBO, RFC, DIT, NOC, IC, CBM, CA, CE,
MFA, LCOM, LCOM3, CAM, MOA, NPM, DAM, AMC, LOC and CC. We have
applied our subset selection approach to these metrics. Based on the selection
criteria, only nine metrics (WMC, DIT, CBO, RFC, CA, CE, LOC, AMC, and
CC) selected for the further analysis. The results of this analysis are summarized
in Table 5.1.
Table 5.1: Identied metrics for each release of the PROP dataset
Dataset Name Name of the Metrics Identify
PROP1 CBO, RFC, WMC, AMC, CC, CE, LOC and DIT
PROP2 CBO, RFC, CA, CE, AMC, CC and DIT
PROP3 CBO, RFC, CA, CE, WMC, AMC, CC, DIT and LOC
PROP4 CBO, RFC, CA, CE, WMC, AMC, CC, DIT and LOC
PROP5 CBO, NPM, LOC, MOA, LCOM3, MFA and CC
PROP6 CA, CE, IC, WMC, LCOM3, LOC and MFA
5.2.2 Experimental data
We have used PROP dataset with its six successive releases to perform our ex-
perimental study and to evaluate our results. For each release, same nineteen
metrics have been calculated and recorded with respect to the software modules.
The size of dataset varies from one release to another release, but for all the
releases, we collected the same eighteen metrics. They are listed in Table 5.2.
The detailed description of this dataset can be found in the previous chapter.
Table 5.2: Datasets use for the study
PROP1, PROP2, PROP3, PROP4, PROP5 and PROP6
5.2.3 Results
This subsection presents the detailed description of the experimental results.
First, we discuss the result of the prediction of number of faults and the fault
densities using various count models. Next, we compare the overall accuracy
and eectiveness of the count models using confusion matrix criteria. Finally, we

3274 2732 6379 5958 10030 8657 14105 4940 2076 2056
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
12000
13000
14000
15000
16000
17000
18000
19000
20000
21000
22000
23000
24000
25000
26000
27000
28000
29000
N
u
m
b
e
r
o
f
F
a
u
l
t
s
Figure 5.1: Result of the predicted number of faults using count models (PROP1-
POPR6)
present the results of cost-benet analysis of the prediction models obtained by
dierent count models to evaluate them in the economic standpoint.
In each scenario, the count models are built on one or more prior releases of
the software and is evaluated on the latest release of the software. Like, the
prediction model based on release 1 and 2 is evaluated for release 3, except the
release PROP1, where training and testing both performed on the same dataset
due to unavailability of any prior release. The similar procedure has been followed
for all the count models.
5.2.3.1 Prediction of the number of faults and the fault densities
The count models assigned an expected number of faults and the fault densities to
each module of the software. Figure 5.1 and 5.2 show the results of the predicted
number of faults and the number of faulty modules, by each count model. The
gures contained a graph for each release of the project dataset showing their
predicted number of faults and the number of faulty modules. The blue bar in
the gures show the actual number of faults and the faulty modules contained
in the release of the project dataset. This is the optimal values of nding all the
faults and faulty instances in each release of the dataset. The other bars show

5493 2732 3274 2732 6379 5958 10030 8657 14105 4940 2076 2056
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
N
U
m
b
e
r
o
f
F
a
u
l
t
y
I
n
s
t
a
n
c
e
s
Figure 5.2: Result of the predicted number of faulty modules using count models
(PROP1-POPR6)
the comparison of prediction results among the count models. The accuracy of
the each count model is measured in term of how close it predicts the value
comes from the actual curve. We can see by comparing the results from PROP1
to PROP6 that the negative binomial regression (NBR) model is the closest t
with the actual number of faults. This result is consistent with all the releases
of the PROP dataset, except PROP2 where the predicted number of faults and
the faulty modules are higher than their actual value. The second most accurate
prediction model is poisson (p) regression model. The rest of all the models assign
a higher value than their actual value. It questions their viability to predict fault
proneness. These results were very encouraging stating the accuracy of the count
models to the actual number of faults discovered.
Figure 5.3 shows the results of the fault densities predicted by each count model.
This information can be useful, if practitioner wants to know, which module is
likely to contain the highest number of faults relative to the size of the module.
The gure contained the sum of the value of actual fault density in the project
datasets (blue bars) and the value of fault densities predicted by the count models.
The predictive capability of each count model is measured by how close it predicts
the values shown in actual curve. It can be observe from the gure that the values
of fault densities predicted by NBR model and P model is the closest t with the

0
5000
10000
15000
20000
25000
Actual
NBRM
P
ZIP
GNBR
ZIN
Figure 5.3: Result of the fault densities prediction using count models (PROP1-
POPR6)
actual values. The values of the other count models are higher than their actual
values. This conrm the potential of NBR model and P model to predict fault
densities in the software system.
5.2.3.2 Evaluating the results of the ve count models
The results of the previous section were showing the potential of count models to
predict values of the number of faults and the faulty modules. In this subsection,
we evaluated the eectiveness of the count models to measure their overall pre-
diction accuracy and the misclassication errors using various confusion matrix
parameters.
Figure 5.4 shows the results of accuracy, precision, recall and F-measure of the
count models. Figure contained a graph corresponding to each release of the
PROP project datasets (sub gures in Figure 5.4 from 1 to 6, corresponding to
the six releases of the PROP project). From gure, we can see that in general,
the prediction accuracy of NBR model is higher than all the other count models,
except the result on PROP2, where the accuracy is lower than other models. The
prediction accuracy of NBR model varied from 75% to 85%. The recall value of
NBR model is lower compare to the other count model and varied between 25%
to 75% in general. This is because NBR prediction model predicted some faulty
modules incorrectly. The value of f-measure shows the trade o between false
positive and false negative. The values of f-measure of NBR model is again
higher than other count models, This shows that NBR model has the potential
to predict fault densities and reduce the misclassication errors. The value of
F-measure of other model in general, are less compare to NBR model.

PROP1 PROP2

PROP3 PROP4

PROP5 PROP6

Acc. Precision Recall F-measure
0
10
20
30
40
50
60
70
80
90
NBRM P ZIP GNBR ZIN
0
10
20
30
40
50
60
70
80
90
100
NBRM P ZIP GNBR ZIN
0
10
20
30
40
50
60
70
80
90
100
NBRM P ZIP GNBR ZIN
0
10
20
30
40
50
60
70
80
90
NBRM P ZIP GNBR ZIN
0
10
20
30
40
50
60
70
80
90
100
NBRM P ZIP GNBR ZIN
0
10
20
30
40
50
60
70
80
90
100
NBRM P ZIP GNBR ZIN
Figure 5.4: Comparison of count model using various confusion matrix criteria
(PROP1-POPR6)
5.2.3.3 Prediction of the number of faults and the fault densities in
modules ranked as top 20%
Tables 5.3 and 5.4 contained the percentage of the predicted number of faults
and fault densities by the top 20% of modules of six releases of PROP dataset
using all ve count models. The results of the predicted number of faults for
PROP1 to PROP6 is given in Table 5.3. From the table, it is clear that the
modules contained faults in between 53% to 100%, with the overall average of
86% in the case of NBR model. While for other models average percentage of
prediction is between 36% - 46%, which is very low compare to NBR model. A
similar process was followed in the case of predicting fault densities. Table 5.4
contained the percentage of fault densities predicted by the count models for
PROP1 to PROP6. The top 20% of the modules contain fault densities between
72% to 100% fault densities with the average of 91% in the case of NBR model.
For other model the average percentage was varies between 38% to 50%.
Table 5.3: Percentage of faults contained in the modules ranked as TOP 20%
(T=Training set)
Model Prop1 (T
prop1)
Prop 2
(T
prop1,
2) Prop3
(T prop
1, 2, 3)
prop 4
(T prop
1, 2, 3, 4)
prop 5
(T prop 1,
2, 3, 4, 5)
prop 6
Average
NBRM 100% 53% 100% 91.68% 100% 96.26% 86.59%
P 25% 30.41% 43.66% 39.49% 51.40% 30.50% 36.74%
ZIP 21% 20.16% 50.42% 34.42% 30.94% 33.93% 31.81%
GNBR 72.02% 48.46% 43.51% 44.28% 38.89% 27.72% 45.64%
ZIN 30.78% 41.41% 51.90% 41.71% 47.77% 37.03% 41.76%
5.2.3.4 Cost-benet analysis
The results obtained through the experiments of cost-benet analysis are shown
in Figure 5.5. We have used the value of predicted number of faults to calculate
the estimated cost of each of count models for fault prediction. The value of
Tcost was calculated to show the actual cost that incurred when normal testing
process. This served as the optimal point to calculate the NEcost of each of the
Table 5.4: Percentage of faults density contained in the modules ranked as top
20% of modules (Fault density=faults/100 lines of code)
Model Prop1 (T
prop1)
Prop 2
(T
prop1,
2) Prop3
(T prop
1, 2, 3)
prop 4
(T prop
1, 2, 3, 4)
prop 5
(T prop 1,
2, 3, 4, 5)
prop 6
Average
NBRM 78% 72% 100% 100% 100% 97.30% 91.21%
P 91.38% 32.57% 28.78% 26.99% 20.24% 31.91% 38.64%
ZIP 72.29% 45.61% 33.48% 36.46% 32.84% 20.95% 40.27%
GNBR 88.16% 94.60% 34.38% 32.47% 30.97% 19.64% 50.03%
ZIN 46.25% 43.78% 34.69% 32.79% 36.48% 34.11% 38.01%
count models. The Figures 5.5 shows the values of NEcost for each of the model
for fault prediction.
From Figure 5.5, it can see that in general, except the results of PROP2 datasets,
the value of NEcost for NBR model is less than 1. While it observed that all
other model in general have value of NEcost greater than 1 except the case of P
model, which has value less than 1 for two datasets PROP1 & PROP2. These
results imply that that it is more economic to build the prediction model based
on negative binomial regression to reduce the overall cost of testing .

0
1
2
3
4
5
Cost Benefit Model
NBR P ZIP GNBR ZIN
Figure 5.5: Cost-benet model for the count models (PROP1-PROP6)
Based on the above results, we nd out the following observation.
The NBR model in general provided a higher prediction accuracy compare
to the other count models for fault densities prediction. The performance
of NBR model is poor for the POPR2 dataset. Here, it assigned a higher
value of the number of faults and the faulty modules. As PROP2 dataset
contained the largest number of software modules and compare to this its
fault contains is very low. Therefore, one possible reason is that the per-
formance of NBR model has gotten eected with this high skewness of the
dataset. For rest of the datasets, all other four count models were similar
at their prediction accuracy and it is lower than NBR model.
In the value of recall, it is found that NBR model has lower recall values
compare to other count models. In general GNBR and ZIN have the highest
recall values.
For the value of F-measure, it is observed that NBR model showed higher
value as compare to other count models. The second highest value was
shown by P count model. The rest of the count models produced a lower
value of F-measure. These results show that prediction model based on
NBR reduced the misclassication errors by a signicant amount.
The nal model selection included the performance of the cost benet
model. Once again, we found that the NEcost incurred by NBRM model
is much lower to other count models irrespective to the datasets except
PROP2.
These results are similar to the results that we were observed in the previous
chapter, where signicant fault-correlated metrics were selected using fault den-
sities of the software modules. However, the values of the number of faults and
the fault densities predicted by the count models here are lower and more closely
t to their actual values compare to the values predicted by the count models
in the previous chapter. The cost-benet analysis also conrm this nding. The
cost incurred of the count models here is lower compared to the previous values.
These results signify that performance of the count models is not much eected
by the selection process of the signicant fault-correlated metrics. Once, we iden-
tied the signicant metrics then we can use them with the count models for fault
densities prediction.
In this section, we critically examine the possible side eects of our experimental
ndings. We also highlight the factors aecting the validity of the cost evaluation
framework that we have used to measure the performance of the count models
in predicting fault densities for multiple releases of the software. The validity
considerations can be grouped in the following categories:
Construct validity: As discussed in the previous chapter, the eectiveness of
a count model is measured as Ecost, which is the estimated fault removal cost.
The framework is developed considering the costs to be incurred to remove faults
in the later phases of software development, if not identied before testing. In
our cost evaluation framework unit testing cost of faulty and non-faulty modules
are same. The dierent framework parameters that are used in our cost-benet
analysis have been taken from dierent sources as reported in literature. We have
selected nine fault signicant metrics out of nineteen metrics by keeping those
metrics that were more than 50% of the times in all six releases of the dataset.
One can use other criterion also to select the signicant metrics and result may
vary with the conjuncture of that criterion.
Internal validity: Our experimental study involving the use of the statistical
analysis tools namely, Weka and Stata and the data collected from the publicly
available software data repository. The fault densities and their distribution
should depend on the fault data. Any biasing in this may inuence the nding
of our results.
Conclusion validity: We have used the statistics of binary class classication
of the faults to select the signicant fault-correlated metrics. Subsequently, we
have used these selected metrics with the count models to predict fault densities.
Our results are specic to the versions of datasets included in the study.
External validity: We do not suggest generalizing our research results to any
arbitrary project categories because our results identify variance in the metrics
set when the examined project has changed. Our models are built and validated
on datasets available in public data repositories. The system developed in the
organization may pass the dierent eort pattern. One needs to take of the
underlying pattern of software before applying our approach.
5.3 Summary 83
5.3 Summary
In this chapter, we presented an application of the count models for fault den-
sities prediction, where we identied the signicant fault-correlated metrics by
classifying the faultiness of the software modules into binary class i.e., faulty or
non-faulty.
Our results were consistent with the results found in the previous chapter. The
negative binomial regression analysis showed the best performance for fault den-
sities prediction. Its predictive accuracy was the highest compared to other count
models. The only contradicted results have been occurred in the case of recall val-
ues, where NBR model provided lower values. The results of cost-benet analysis
also conrmed that negative binomial was the most cost-eective compare to the
other count models. However, the performance of the count models is improved
by some amount for fault densities prediction. But it is not creating the much
dierence in the performance of the count models. These results suggest that
the selection process of the fault signicant metrics do not eect the performance
of the count models. The only requirement for the count models is the input
dataset that do not stued with the unnecessary information that may lead to
poor performance of the count models.
Chapter 6
Conclusions and Future Work
Software metrics can be helpful to assess the quality of a software system with
desired accuracy. However, the diculty lies in knowing the right set of metrics
that actually capture important quality attributes of a class, such as fault prone-
ness. Validation of object-oriented metrics to predict software fault proneness
is essential to ensure their practical use in fault prediction for object-oriented
software system. In this thesis, we investigated the relationship of existing class
level object-oriented metrics with fault proneness of the software systems. As a
followed up, we have presented an approach to identify a subset containing soft-
ware metrics with signicant fault-correlation and then used this identied subset
with the count models to predict fault densities over the subsequent releases of
the software system.
We performed two sets of experimental investigations using project fault datasets
taken from a PROMISE data repository that make use of object-oriented metrics
available at class level. The rst set of investigations consisted of evaluating the
performance of the selected metrics subset against the original project metric
suite. The second set of investigations consisted of using the identied metrics
subset with various count models to predict fault densities.
In the rst set of investigations, we identied the metrics subset consisting met-
rics with signicant fault-correlation. We performed our investigation over ve
software project datasets namely: Camel, Xalan, Xerces, Ivy, and Velocity with
their multiple successive releases. We used confusion matrix criteria: Accuracy,
Precision, Recall and AUC (area under the ROC curve) to estimate the overall
prediction accuracy and misclassication errors of the prediction models. Our
results demonstrated that the identied metrics subset produced an improved
85
fault prediction performance compare to the original project metrics suite.
We performed our second set of the investigations using ve dierent count mod-
els over six successive releases of PROP software projects dataset available in
PROMISE data repository. The results of the prediction were evaluated using
confusion matrix parameters and a cost-benet model. Our results suggested
that among the used ve count models, the negative binomial regression (NBR)
model produced the best performance for fault densities prediction. The predic-
tive accuracy of NBR model was found to be the highest among the count models
used. The results of cost-benet analysis also conrmed that the prediction model
based on negative binomial regression was the most cost-eective compared to
other count models. Though, the NBR model produced lower recall values, the
F-measure establish the results of the NBR model to be the best tradeo between
precision and recall among the ve count models used.
In the present thesis, we have used an approach to select a subset consisting
signicant fault-correlated metrics. Although, there many other techniques of
subset selection, for example- Wrappers, lters or PCA are also available that
need to be investigated for their potential to identify a metrics subset. Since, our
subset selection approach identied a dierent subset of metrics for each project
dataset. We wish to investigate other approaches or techniques that may identify
a generalized subset of metrics by considering the inevitable dierences that may
exist across the projects and the systems.
In the future, we intended to look up for some alternative approaches to inves-
tigate and validate our results to further strengthen or update the arguments
made in this thesis. We also focus on collecting more software projects datasets
to enhance the applicability of the presented approach in the real setting.
References
[1] M. Alshayeb and W. Li. An empirical validation of object-oriented metrics
in two dierent iterative software processes. IEEE Transactions on Software
Engineering, 29(11):10431049, 2003.
[2] E. Arisholm. Dynamic coupling measurement for object-oriented software.
IEEE Transactions on Software Engineering, 30(8):491506, 2004.
[3] E. Arisholm, L. Briand, and E. B. Johannessen. A systematic and com-
prehensive investigation of methods to build and evaluate fault prediction
models. The Journal of Systems and Software, (1):217, 2010.
[4] J. Bansiya and C. Davis. A hierarchical model for object-oriented design
quality assessment. IEEE Transactions on Software Engineering, 28(1):4
17, 2002.
[5] Deepak Banthia. The economic of fault predcition. Master Thesis, Depart-
ment of computer science and engineering, PDPM-IIITDM Jabalpur, 2012.
[6] Judith Barnard. A new reusability metric for object-oriented software. Soft-
ware Quality Journal, 24(6):491496, 1998.
[7] V. Basili, L. Briand, and W. Melo. Object-oriented metrics that predict
maintainability. Journal of Systems and Software, 23(2):111122, 1993.
[8] V. Basili, G. Caldeira, and H. Rombach. The goal question metric approach.
Encyclopedia of Software Engineering, 1994.
[9] Belaujhazi, R. Ferenc, D. Poshyvanyk, and T. Gyimothy. New conceptual
coupling and cohesion metrics for object-oriented systems. In Proceedings of
10th IEEE Working Conference on Source Code Analysis and Manipulation,
pages 3342, 2010.
86
References 87
[10] A. Binkley and S. Schach. Validation of the coupling dependency metric as a
predictor of run-time failures and maintenance measures,. In Proceedings of
the 20th International Conference on Software Engineering, pages 452455,
1998.
[11] L. Briand, P. Devanbu, and W. Melo. An investigation into coupling mea-
sures for C++. In Proceeding of 19th International Conference on Software
Engineering, pages 412421, 1997.
[12] L. Briand, W. John, and K. J. Wust. An unied framework for cohesion
measurement in object-oriented systems. Journal of Empirical Software En-
gineering, 3(1):65117, 1998.
[13] L. Briand, W. John, and K. J. Wust. An unied framework for coupling
measurement in object-oriented systems. IEEE Transactions on Software
Engineering, 25(1):91121, 1999.
[14] Cagatay Catal. Software fault prediction: A literature review and current
trends. Expert System Application, 38(4):46264636, 2011.
[15] S. Chidamber and C. Kemerer. A metrics suite for object-oriented design.
IEEE Transactions on Software Engineering, 20(6):476493, 1994.
[16] S.R. Chidamber, D.P. Darcy, and C.F. Kemerer. Managerial use of metrics
for object oriented software: An exploratory analysis. IEEE Transanction
on Software Enggineering, 24(8):629639, 1998.
[17] J. A. Dallal and L. C. Briand. An object-oriented high-level design-based
class cohesion metric. Information and Software Technology, 52(12):1346
1361, 2010.
[18] K.O. Elish and M.O. Elish. Predicting defect-prone software modules using
support vector machines. Journal of Systems and Software, 81(5):649660,
2008.
[19] M. O. Elish, A. H. Al Yafei, and M. Al Mulhem. Empirical comparison of
three metrics suites for fault prediction in packages of object-oriented sys-
tems: A case study of eclipse. Advances in Engineering Software, 42(10):852
859, 2011.
References 88
[20] K.E. Emam and W. Melo. The prediction of faulty classes using object-
oriented design metrics. In Technical report: NRC 43609. NRC, 1999.
[21] L. Etzkom and H. Delugach. Towards a semantic metrics suite for object-
oriented design. pages 7181, 2000.
[22] Ayaz Farooq. Conception and prototypical implementation of a web service
as an empirical-based consulting about java technologies. In Master Thesis,
Department of Computer Science Institute for Distributed Systems, 2005.
[23] N. Fenton and M. Neil. A critique of software defect prediction models.
IEEE Transactions on Software Engineering, (5):675689, 2000.
[24] L. Fernndez and R. Pena. A sensitive metric of class cohesion. International
Journal of Information Theories & Applications, 13, 2005.
[25] K. Gao and T. M. Khoshgoftaar. A comprehensive empirical study of count
models for software fault prediction. IEEE Transactions on Software Engi-
neering, 50(2):223237, 2007.
[26] M. Genero, M. Piattini, and C. Calero. A survey of metrics for UML class
diagrams. Journal of Object Technology, (9):5991, 2005.
[27] T.L. Graves, A.F. Karr, J.S. Marron, and H. Siy. Predicting fault incidence
using software change history. IEEE Transactions on Software Engineering,
26(7):653661, 2002.
[28] G. Gui and P. D. Scott. Coupling and cohesion metrics for evaluation of soft-
ware component reusability. In Proceeding of 9th International Conference
for Young Computer Scientists, pages 11811186, 2008.
[29] L. Guo, B. Cukic, and H. Singh. Predicting fault prone modules by the
dempster-shafer belief networks. In Proceedings of 18th IEEE International
Conference on Automated Software Engineering, pages 249252, 2003.
[30] I. Guyon and A. Elissee. An introduction to variable and feature selection.
The Journal of Machine Learning Research, 3:11571182, 2003.
[31] M. Harman, S. A. Mansouri, and Y. Zhang. Search based software engineer-
ing: A comprehensive analysis and review of trends techniques and appli-
cations. In Technical report: TR-09-03. Department of Computer Science,
Kings College London, UK, 2009.
References 89
[32] R. Harrison and J. Steve Counsel. An evaluation of the mood set of object-
oriented software metrics. IEEE Transactions on Software Engineering,
24(6):491496, 1998.
[33] J. M. Hilbe. Negative binomial regression. Second Edition Jet Propulsion
Laboratory, California Institute of Technology and Arizona State University,
2012.
[34] W.G. Hopkins. A new view of statistics. Sport Science, pages 217, 2010.
[35] L. Huihua, C. Bojan, and M. Culp. An iterative semi-supervised approach to
software fault prediction. In Proceedings of the 7th International Conference
on Predictive Models in Software Engineering, PROMISE 11, pages 115,
2011.
[36] A. Janes, M. Scotto, W. Pedrycz, B. Russo, M. Stefanovic, and G. Succi.
Identication of defect-prone classes in telecommunication software systems
using design metrics. Journal of Information Sciences, 176(24):37113734,
2006.
[37] Y. Jiang, B. Cukic, and Y. Ma. Techniques for evaluating fault prediction
models. Empirical Software Engineering, 13(5):561595, 2008.
[38] Y. Jiang, B. Cukic, and M. Yan. Techniques for evaluating fault prediction
models. Empirical Software Engineering, 13(5):561595, 2008.
[39] C. Jones. Software defect-removal eciency. Computer, 29(4):94 95, 1996.
[40] M. Jureczko. Signicance of dierent software metrics in defect prediction.
Software Engineering: An International Journal, 1(1):8695, 2011.
[41] S. Kanmani, V.R. Uthariaraj, V. Sankaranarayanan, and P. Thambidurai.
Object-oriented software fault prediction using neural networks. Journal of
Information and Software Technology, 49(5):483492, 2007.
[42] G. Kehan, T. M. Khoshgoftaar, H. Wang, and N. Seliya. Choosing soft-
ware metrics for defect prediction: an investigation on feature selection tech-
niques. Software Practice and Experience, 41(5):579606, 2011.
[43] Barbara Kitchenham. Whats up with software metrics? a preliminary map-
ping study. The Journal of Systems and Software, (1):3751, 2010.
References 90
[44] A. G. Koru and L. Hongfang. An investigation of the eect of module size on
defect prediction using static measures. In Proceedings of the 2005 workshop
on Predictor models in software engineering, PROMISE 05, pages 15, 2005.
[45] W. Li and W. Henry. A validation of object-oriented design metrics as quality
indicators. IEEE Transactions on Software Engineering, 22(10):751761,
1996.
[46] Yu Liguo. Using negative binomial regression analysis to predict software
faults: A study of apache ant. International Journal Information Technology
and Computer Science, 4(8):6370, 2012.
[47] Huan Liu. Toward integrating feature selection algorithms for classication
and clustering. IEEE Transactions on Knowledge and Data Engineering,
17(4):491502, 2005.
[48] M. Lorenz and J. Kidd. Object-oriented software metrics. In Prentice Hall,
page 146, 1994.
[49] Weka machine learning tool. http://www.cs.waikato.ac.nz/ml/weka/.
[50] M. Marchesi. OOA metrics for the unied modeling language. In Proceeding
of 2nd Euromicro Conference on Softwar eMaintenance and Reengineering,
pages 6773, 1998.
[51] T. Menzies, Z. Milton, T. Burak, B. Cukic, Y. Jiang, and Bener. Defect
prediction from static code features: current results, limitations, new ap-
proaches. Automated Software Engineering, 17(4):375407, 2010.
[52] N. Ohlsson, M. Zhao, and M. Helander. Application of multivariate analysis
for software fault prediction. Journal of Software Quality Journal, 7(1):51
66, 1998.
[53] H. M. Olague, L. H.Etzkorn, S. Gholston, and S. Quattlebaum. Empiri-
cal validation of three software metrics suites to predict fault-proneness of
object-oriented classes developed using highly iterative or agile software de-
velopment processes. IEEE Transactions on Software Engineering, (6):402
419, 2007.
References 91
[54] T. J. Ostrand, E. J. Weyuker, and R. M. Bell. Where the bugs are. In Pro-
ceedings of 2004 International Symposium on Software Testing and Analysis,
pages 8696, 2004.
[55] T. J. Ostrand, E. J. Weyuker, and R. M. Bell. Predicting the location and
number of faults in large software systems. IEEE Transactions on Software
Engineering, 31(4):340355, 2005.
[56] D. Poshyvanyk and A. Marcus. The conceptual coupling metrics for object-
oriented systems. In Proceeding of International Conference on Software
Maintenance (ICSM06), pages 469478, 2006.
[57] PROMISE Data Repository. http://promisedata.org/.
[58] M. Revelle, M. Gethers, and D. Poshyvanyk. Using structural and textual in-
formation to capture feature coupling in object-oriented software. Empirical
Software Engineering, 16(6):773811, 2011.
[59] D. Rodriguez, R. Ruiz, J. Cuadrado-Gallego, J. Aguilar-Ruiz, and M. Garre.
Attribute selection in software engineering datasets for detecting fault mod-
ules. In Proceedings of the 33rd EUROMICRO Conference on Software Engi-
neering and Advanced Applications, EUROMICRO 07, pages 418423, 2007.
[60] P. M. Shanthi and K. Duraiswamy. An empirical validation of software
quality metric suites on open source software for fault-proneness prediction
in object oriented systems. European Journal of Scientic, 51(2):168181,
2011.
[61] R. Shatnawi and W. Li. The eectiveness of software metrics in identifying
error-prone classes in post-release software evolution process. The Journal
of Systems and Software, (11):18681882, 2008.
[62] R. Shatnawi, W. Li, and H. Zhang. Predicting error probability in the
eclipse project. In Proceedings of the International Conference on Software
Engineering Research and practice, pages 422428, 2006.
[63] S. Swapna, Gokhale, and R. L. Michael. Regression tree modeling for the
prediction of software quality. In Proceeding of ISSAT97, pages 3136, 1997.
References 92
[64] M.H. Tang, M. H. Kao, and M-H Chen. An empirical study on object
oriented metrics. In Proceedings of the International Symposium on Software
Metrics, pages 242249, 1999.
[65] B. Turhan and A. Bener. Analysis of naive bayes assumptions on software
fault data: An empirical study. Data Knowledge Engineering, 68(2):278290,
2009.
[66] U. B. Venkata, B. Farokh Bastani, and I. Ling Yen. A unied framework for
defect data analysis using the mbr technique. In Proceeding of 18th IEEE
International Conference on Tools with Articial Intelligence, ICTAI 06,
2006, pages 3946, 2006.
[67] Wagner and Stefan. A literature survey of the quality economics of defect-
detection techniques. In Proceedings of the 2006 ACM/IEEE international
symposium on Empirical software engineering, ISESE 06, pages 194203,
2006.
[68] Wikipedia. http://en.wikipedia.org/wiki/Feature_selection.
[69] Wikipedia. http://en.wikipedia.org/wiki/Precision_and_recall.
[70] N. Wilde and R. Huitt. Maintenance support for object-oriented programs.
IEEE Transaction on Software Engineering, 18(12):10381044, 1992.
[71] S. Yacoub, H. Ammar, and T. Robinson. Dynamic metrics for object-
oriented designs. In Proceeding of the 6th International Symposium on Soft-
ware Metrics (Metrics99), pages 5060, 1999.
[72] Y. Zhou, B. Xu, and H. Leung. On the ability of complexity metrics to pre-
dict fault-prone classes in object oriented systems. The Journal of Systems
and Software, pages 660674, 2010.
Publications
Santosh Singh Rathore and Atul Gupta, Validating the Eectiveness of
Object Oriented Metrics over Multiple Releases for Predicting Fault Prone-
ness. In the proceeding of nineteenth Asia Pacic Software Engineer-
ing Conference (APSEC12), Hongkong, pp 270-275, 4-7 Dec 2012. DOI:
10.1109/APSEC.2012.148.
Santosh Singh Rathore and Atul Gupta, Investigating Object-Oriented
Design Metrics to Predict Fault-Proneness of Software Modules. In the
proceeding of sixth International Conference on Software Engineering (CON-
SEG12), Indore-India, 5-7 Sep 2012. DOI: 10.1109/CONSEG.2012.6349484.
Saurabh Tiwari, Santosh Singh Rathore, Abhinav Singh, Abhijeet Singh
and Atul Gupta, An Approach to Generate Actor-Oriented Activity Charts
from Use Case Requirements. In the proceeding of nineteenth Asia Pacic
Software Engineering Conference (APSEC12), Hongkong, pp 350-355, 4-7
Dec 2012 DOI: 10.1109/APSEC.2012.149.
Saurabh Tiwari, Santosh Rathore, Sudhanshu Gupta, Vaibhav Gagote and
Atul Gupta, Analysis of Use Case Requirements using SFTA and SFMEA
Techniques. Seventeenth International Conference on Engineering of Com-
plex Computer Systems (ICECCS12), Paris-France, pp 29-38, 18-20 July
2012. DOI: 10.1109/ICECCS.2012.10.
Santosh Singh Rathore and Atul Gupta, Using Negative Binomial Regres-
sion Analysis to Predict Fault Densities in Software Modules, In seven-
teenth International Conference on Evaluation and Assessment in Software
Engineering (EASE13), Brazil. (Submitted).

Thesis Work On TFT

Diunggah oleh

Informasi Dokumen

Judul Asli

Hak Cipta

Format Tersedia

Bagikan dokumen Ini

Bagikan atau Tanam Dokumen

Opsi Berbagi

Apakah menurut Anda dokumen ini bermanfaat?

Apakah konten ini tidak pantas?

Hak Cipta:

Format Tersedia

Thesis Work On TFT

Diunggah oleh

Hak Cipta:

Format Tersedia

Investigating the Capability of Object-Oriented

Metrics for Fault Proneness

s class, if it is unknown, then the

Anda mungkin juga menyukai