Anda di halaman 1dari 8

The SIJ Transactions on Computer Science Engineering & its Applications (CSEA), Vol. 1, No.

3, July-August 2013
ISSN: 2321 2381 2013 | Published by The Standard International Journals (The SIJ) 75



AbstractBiological databases, containing genetic information of patients, are undergoing tremendous growth
beyond our analysing capability. However such analysis can reveal new findings about the cause and
subsequent treatment of any disease. Rough Set Theory has been used in analysis with an aim to effectively
extract biologically relevant information from inconsistent and ambiguous data and to find hidden patterns and
dependencies in data. The investigation has been carried out on the publicly available microarray dataset for
Lung Adenocarcinoma, obtained from the National Center for Biotechnology Information website. The
analysis revealed that Rough Set is able to extract the various dominant genes in term of reducts which play an
important role in causing the disease and also able to provide a unique simplified rule set for building expert
systems in medical sciences. Cross validation of the generated rule sets shows 100% accuracy and also verified
for unknown dataset successfully. From rules the responsible genes are identified and they are also validated
by Gene-ontology-DAVID site which shows their direct or indirect functional relation with Lung
Adenocarcinoma.
KeywordsAdenocarcinoma; Cancer Diagnosis; Micro Array Data; Reduct; Rough Set; Rule Reduction.

I. INTRODUCTION
HE growth of the size of data and number of existing
databases far exceed the ability of humans to analyze
this data, which creates both a need and an opportunity
to extract knowledge from databases by using several
computational intelligence & soft computing methods.
Medical databases have accumulated large quantities of
information about patients and their medical conditions
supporting clinical, pathological, radiological and
microbiological aspect in term of genetic expression.
Identifying hidden relationships and patterns within this data
could provide new medical knowledge about the patient.
Analysis of medical data can often serve as tool for diagnosis
and treatment of a patient with incomplete knowledge. This
can be done with management of inconsistent pieces of
information and with manipulation of various levels of
representation of data using various soft computing tools.
Existing intelligent techniques of data analysis are mainly
based on quite strong assumptions (some knowledge about
dependencies, probability distributions, large number of
experiments), are unable to derive conclusions from
incomplete knowledge, or cannot manage inconsistent pieces
of information. The most common intelligent techniques used
in medical data analysis are based on soft computing tools
like Neural Network [Wang et al., 2006], Genetic Algorithms
[Huerta et al., 2006], Decision Trees [David & Balakrishnan,
2010, 2011] and fuzzy theory [Bezdek, 1993 ; Vinterbo et al.,
2005; Hoa et al., 2006] etc.
In this paper Rough Set Theory [Walczak & Massart,
1999; Tsumoto, 2001; Pawlwk, 2002; Midelfart et al., 2002;
Banerjee et al., 2007] is introduced for the same purpose. The
theory of rough sets is a mathematical tool for extracting
knowledge from uncertain and incomplete data based
information within noisy and missing data environment. The
theory assumes that we first have necessary information or
knowledge of all the objects in the universe with which the
objects can be divided into different groups. If we have
exactly the same information of two objects then we say that
they are indiscernible (similar), i.e., we cannot distinguish
them with known knowledge. The theory of Rough Set can
be used to find dependence relationship among data, evaluate
the importance of attributes, discover the patterns of data,
learn common decision-making rules, reduce all redundant
objects and attributes and seek the minimum subset of
attributes so as to attain satisfactory classification. Moreover,
T
*Head of the Department, Electronics & Communication Engineering, Global Institute of Management & Technology, Krishna Nagar, Nadia,
West Bengal INDIA. E-Mail: sudip.mandal007@gmail.com
**Head of the Department, Information Technology, North Eastern Hill University, Shillong, Meghalaya, INDIA.
E-Mail: dr_goutamsaha@yahoo.com
Sudip Mandal* & Dr. Goutam Saha**
Rough Set Theory based Automated
Disease Diagnosis using Lung
Adenocarcinoma as a Test Case
The SIJ Transactions on Computer Science Engineering & its Applications (CSEA), Vol. 1, No. 3, July-August 2013
ISSN: 2321 2381 2013 | Published by The Standard International Journals (The SIJ) 76
the rough set reduction algorithms help to approximate the
decision classes using possibly large and simplified patterns.
Lung Adenocarcinoma often begins in the outer parts of
the lungs and as such well-known symptoms of Lung Cancer
such as a chronic cough and coughing up blood may be less
common until later stages in the disease. Early symptoms of
Adenocarcinoma that may be overlooked may include
fatigue, mild shortness of breath, or achiness in our back,
shoulder, or chest.
In this paper, Rough Set Theory has been introduced for
the automated diagnosis of Lung Adenocarcinoma using the
microarray dataset of Adenocarcinoma. For analyzing
microarray data, it is required to analyze and interpret the
vast amount of data that are available, involving the decoding
of around 2400030,000 human genes. High-dimensional
feature selection is important for characterizing gene
expression data involving many attributes. Data mining
[Mohammadi et al., 2011] methodologies hold promising
opportunity in this direction. Microarray experiments produce
gene expression patterns that provide dynamic information
about cell function. This information is useful while
investigating complex interactions which take place within a
cell. Therefore a normal and a cancerous person have
different gene expression levels as genes are responsible for
each characteristic or disease as the set of genes responding
to a certain level of stress in an organism from its microarray
data. So if the sets of responsible genes which have the
different value of cancerous & normal persons can be
identified on the basis of their dependency, then diagnosis
can be easily done with high accuracy.
For this purpose the microarray dataset have been taken
that contains data related to those who have been diagnosed
with cancer and without cancer. Classification Rule Sets
[Midelfart et al., 2001; Du et al., 2011] have been generated
for the dataset using different techniques. If these decision
rule set tally with any of the unknown data set, he can be
predicted exactly.
This paper discusses how Rough Set theory can be used
to analysis microarray data and for generating classification
rules from a set of observed data samples of the
Adenocarcinoma. The Rough Set reduction technique is
applied to find all possible reducts of the data which contains
the minimal subset of attributes that are associated with a
class label for classification. This paper is organized as
follows. Theoretical aspects of rough set data analysis which
are relevant to the work are presented in Section II. The
characteristics of microarray data of Adenocarcinoma and
knowledge representation are discussed in Section III and
reduct sets generation process is introduced in Sections IV.
Decision rules generation process is described in Section V.
In Section VI, the evolutionary rule reduction approach is
described. Experimental results and discussion on result are
reported in Section VII &VIII respectively.


II. ROUGH SET THEORY: AN OVERVIEW
Rough sets [Hassanien & Ali, 2004] constitute a major
mathematical tool for managing uncertainty that arises from
granularity in the domain of discourse due to incomplete
information about the objects of the domain. The granularity
is represented formally in terms of an indiscernibility relation
that partitions the domain. If there is a given set of attributes
ascribed to the objects of the domain, objects having the same
attribute values would be indiscernible and would belong to
the same block of the partition. The intention is to
approximate a rough (imprecise) concept in the domain by a
pair of exact concepts. These exact concepts are called the
lower and upper approximations and are determined by the
indiscernibility relation. The lower approximation is a set of
objects definitely belonging to the rough concept, whereas
the upper approximation is a set of objects possibly
belonging. The formal definitions of the aforementioned
notions and others required for the present work are given as
follows.
Definition 1:
An information system = (, ) consists of a nonempty,
finite set of objects (cases, observations, etc.) and a non-
empty, finite set of attributes a (features, variables), such
that : , where is a value set. We shall deal with
information systems called decision tables, in which the
attribute set has two parts ( = ) consisting of the
condition and decision attributes (in the subsets , of A,
respectively). In particular, the decision tables we take will
have a single decision attribute d and will be consistent, i.e.,
whenever objects , are such that for each condition
attribute , () = (), then () = ().
Definition 2:
Let . Then a B-indiscernibility relation () is
defined as
() = {(, ) : () = (), } (1)
It is clear that () partitions the universe U into
equivalence classes
[

] = {

: (

) ()},

(2)
The equivalence class of () is called the
elementary set in because it represents the smallest
discernible oblects.
Definition 3:
The B-lower and B-upper approximations of a given set
( ) are defined, respectively, as follows:
= {x U : [x]B X} (3)
= {x U : [x]B X } (4)
The B-boundary region is given by
BNB(X) = / (5)
Assuming B and C are equivalence relation in U, the
important concept of positive region

() =
[]
(6)
Definition 4:
In an information system there often exist some condition
attributes that do not provide any additional information
about the objects in . So, we should remove those attributes
The SIJ Transactions on Computer Science Engineering & its Applications (CSEA), Vol. 1, No. 3, July-August 2013
ISSN: 2321 2381 2013 | Published by The Standard International Journals (The SIJ) 77
since the complexity and cost of decision process can be
reduced if those condition attributes are eliminated. Given a
classification task mapping a set of variables to a set of
labeling , a reduct is defined as any , such that
(, ) = (, ) and a reduct set is defined with respect to
the power set () as the set () such that = {
(): (, ) = (, )}. That is, the reduct set is the set of
all possible reducts of the equivalence relation denoted by C
and D. a minimal reduct is defined as any reduct R such that
|R| |A|, A R. That is, the minimal reduct is the reduct of
least cardinality for the equivalence relation denoted by C
and D.
Definition 5:
The set of attributes which are common to all reduct is called
core. The core is the set of attributes which is possessed by
every legitimate reduct, and therefore consists of attributes
which cannot be removed from the information system
without causing collapse of the equivalence-class structure. It
is possible for the core to be empty, which means that there is
no indispensable attribute.
III. DATA COLLECTION AND DECISION
TABLE
TheMicroarray data are a typical example for presenting an
overwhelmingly large number of features (genes), the
majority of which are not relevant to the description of the
problem and could potentially degrade the classification
performance by masking the contribution of the relevant
features.
In Rough Set Theory, the datasets are represented with
the help of decision tables. The decision table contains
attributes i.e. condition and objects i.e. different cases of
samples. The decision table describes the decision in terms of
conditions that must be satisfied in order to obtain the
decisions specified in the decision table. In this paper, the
genes are as attributes and about 82 persons data as object
are used to generate the required decision table. Based on
different condition or different value of attribute the decision
of the sample may be either normal or cancerous. This
decision table is used as a training dataset which is used to
know hidden dependency between different genes which are
responsible for Adenocarcinoma. This microarray data for
gene expression of Adenocarcinoma and normal human being
(Series Geo_accession No: GSE10072) are collected from the
NCBI website [http:// www.ncbi.nlm.nih.gov/]. In this case
the no different genes are 22284 and the no of cases is 82.
The values of the different attributes are real but for
simplification we consider it integer. The following table is a
typical example of the decision table corresponding to the
Adenocarcinoma micro-array data set where the no of column
is 22285 and no of rows are 82.
Table 1: Sample Decision Table (Truncated)
Cases
Attributes or Gene Expression
1007_s_at 1053_at ..
AFFX-
pnX-
5_at
AFFX-
TrpnX-
M_at
Decision
X1 10 7 .. 4 4 TUMOR
X2 10 6 .. 4 4 NORMAL
X3 10 6 .. 4 4 TUMOR
The training data set consist of 42 cases of tumor and 40
cases of normal data set. Another different random 15 cases
data set i.e. test data set is taken to test the validity of
generated rules. 1007_s_at, 1053_at AFFX-Trpnx-M _at
are different genes, depend upon these genes decision is
taken.
IV. CALCULATION OF REDUCT SETS
Reduct calculation is a crucial task in any learning system. In
this study the minimal numbers of reducts are calculated
automatically from decision table using Rough Set
Exploration Systems (RSES 2.2.2). Calculation of all
reducts is very exhaustive and complex in nature. Therefore
calculation of all minimal reducts, Genetic Algorithm with
full indiscernibility and modulo decision is used. From the
huge database or decision table, 37 no of reducts of various
size are generated, each of which have the positive region 1
and Stability Coefficient=1.
It is interesting to find that there is no core in the reduct
set which means that there is no indispensable attribute and
there are huge dependencies between different attributes of
minimal reduct sets. In other words, there is a huge
inhomogeneity among the attributes and there are many
possibilities of substitution. These hidden dependencies are
responsible for Adenocarcinoma which has to be finding out.
There are only 195 no of different attribute in all reduct sets,
where as in decision table the no of different attribute is
22284 Therefore the dependency and complexity of the
decision is reduced by a factor 195/22284 after the
calculation of reduct sets.

The SIJ Transactions on Computer Science Engineering & its Applications (CSEA), Vol. 1, No. 3, July-August 2013
ISSN: 2321 2381 2013 | Published by The Standard International Journals (The SIJ) 78
Table 2: Reduct Set for the Data Set of Adenocarcinoma after using RSES
Reduct Set
{ 201591_s_at, 202295_s_at, 209613_s_at }
{ 201456_s_at, 203065_s_at, 203217_s_at, 215972_at }
{ 201413_at, 201969_at, 204987_at, 205022_s_at, 206550_s_at, 208429_x_at }
{ 201139_s_at, 201337_s_at, 205278_at, 211367_s_at, 211501_s_at, 212412_at, 219293_s_at }
{ 201017_at, 202886_s_at, 207877_s_at, 210481_s_at, 210581_x_at, 213927_at, 217338_at }
{ 200745_s_at, 201337_s_at, 206296_x_at, 209147_s_at, 211367_s_at, 211501_s_at, 219293_s_at }
{ 200085_s_at, 202848_s_at, 203866_at, 206328_at, 212611_at, 218624_s_at, 218918_at, 221382_at }
{ 200003_s_at, 206550_s_at, 207173_x_at, 207184_at, 207409_at, 208985_s_at, 210346_s_at, 211062_s_at, 217971_at }
{ 200617_at, 201226_at, 202737_s_at, 205455_at, 208829_at, 211619_s_at, 212540_at, 215543_s_at }
{ 201256_at, 201336_at, 206704_at, 206779_s_at, 206929_s_at, 207204_at, 210640_s_at, 211971_s_at }
{ 201354_s_at, 204697_s_at, 206622_at, 207294_at, 210788_s_at, 213547_at, 220530_at, 35820_at }
{ 201337_s_at, 202640_s_at, 203927_at, 204333_s_at, 209682_at, 212412_at, 213188_s_at, 38069_at }
{ 201268_at, 202937_x_at, 203091_at, 203249_at, 209790_s_at, 210279_at, 213165_at, 214451_at, 218304_s_at }
{ 201350_at, 201591_s_at, 202422_s_at, 202442_at, 204209_at, 209204_at, 210956_at, 216194_s_at, 221649_s_at }
{ 201354_s_at, 204697_s_at, 207294_at, 207685_at, 209807_s_at, 212696_s_at, 213547_at, 214500_at, 35820_at }
{ 201425_at, 203989_x_at, 206930_at, 210066_s_at, 210359_at, 210659_at, 216941_s_at }
{ 201425_at, 201656_at, 206930_at, 210066_s_at, 210359_at, 210659_at, 215529_x_at, 216941_s_at }
{ 201506_at, 206632_s_at, 207083_s_at, 208335_s_at, 212230_at }
{ 201772_at, 206068_s_at, 206757_at, 212760_at }
{ 201591_s_at, 202442_at, 209613_s_at, 212737_at, 222313_at }
{ 201656_at, 202069_s_at, 202181_at, 202850_at, 204288_s_at, 205286_at, 217826_s_at }
{ 201656_at, 203989_x_at, 204288_s_at, 208140_s_at, 210066_s_at, 210359_at, 210659_at, 216941_s_at }
{ 208056_s_at, 209072_at, 212611_at, 218918_at, 49452_at }
{ 201953_at, 204513_s_at, 207461_at, 217030_at, 218307_at, 40560_at }
{ 201938_at, 203217_s_at, 205261_at, 210359_at, 211665_s_at, 215972_at }
{ 202848_s_at, 207973_x_at, 208056_s_at, 209072_at, 218918_at, 49452_at }
{ 202182_at, 202199_s_at, 206550_s_at, 206718_at, 209343_at, 210759_s_at, 214143_x_at }
{ 202848_s_at, 204980_at, 205549_at, 207178_s_at, 209072_at, 211395_x_at, 49452_at }
{ 205549_at, 206600_s_at, 207178_s_at, 211395_x_at, 211718_at, 212611_at } 1.0
{ 203217_s_at, 204290_s_at, 205208_at, 205261_at, 210359_at, 211852_s_at, 215972_at }
{ 202850_at, 205549_at, 206022_at, 206600_s_at, 207500_at, 207973_x_at, 209048_s_at }
{ 202862_at, 205379_at, 206990_at, 208865_at, 209817_at, 215091_s_at, 215136_s_at, 219630_at, 219995_s_at }
{ 203003_at, 203965_at, 204082_at, 204952_at, 205416_s_at, 207296_at, 208711_s_at, 210984_x_at, 219297_at, 221907_at, 34210_at }
{ 203866_at, 204595_s_at, 205549_at, 209072_at, 212152_x_at, 212611_at, 218982_s_at }
{ 203989_x_at, 204288_s_at, 205187_at, 205793_x_at, 206930_at, 209609_s_at, 210359_at, 210438_x_at, 210659_at, 216941_s_at }
{ 205549_at, 207178_s_at, 207793_s_at, 208910_s_at, 209072_at, 218982_s_at, 49452_at }
{ 208004_at, 208818_s_at, 208923_at, 210113_s_at, 211714_x_at, 212008_at, 213121_at, 221443_x_at}
V. GENERATION OF DECISION RULES
In this section, we describe how decision rules are generated
based on the reduct system obtained from previous section. If
we distinguish in information system two disjoint classes of
attributes, called condition and decision attributes,
respectively, then the system will be called a decision table
and will be denoted by = (, , ), where and are
disjoint sets of condition and decision attributes, respectively.
Let = (, , ) be a reduced decision table where
denotes the reduced no. of attributes set of which are possible
different responsible genes for Adenocarcinoma. Every
determines a sequence
1
(), .

();

1
(), . ,

(), where {
1
, . . ,

} = and
{
1
, . . ,

} = .
The sequence will be called a decision rule induced by
(in ) and will be denoted by

1
(), . . .

()
1
(), . ,

(), or in short
.
The number

(, ) will be called a support of the


decision rule and the number is given by

(, ) = = (7)
The strength of the decision rule can be written
as the following equation where denotes the cardinality of
.

x
(C,D)=

(,)

(8)
With every decision rule we associate the
coverage factor of the decision rule, denoted

(, ) and
defined as follows:

(, )=
()()
()
(9)
Using the generated reduct and decision table the
decision rules are generated with the help of RSES2.2 tools.
Due to the huge dependency with each other of the attributes
of reduct set, a huge 2187 no of rules are generated. Using
these rules, it is possible to classify a new micro array data
set of any human either into Adenocarcinoma or normal
lungs.
The SIJ Transactions on Computer Science Engineering & its Applications (CSEA), Vol. 1, No. 3, July-August 2013
ISSN: 2321 2381 2013 | Published by The Standard International Journals (The SIJ) 79
Consider the first reduct i.e. {201591_s_at, 202295_s_at,
209613_s_at}, which consist of three attribute. Depending
upon the combination of different value of these genes the
decision can be taken. It is found that, for the first reduct the
number generated rule is 35 which are quite large in number
which can be used for classification.
But due to these large no rules, the complexity of
generated rules are so much that it cant be handled quite
easily. Following tables shows a portion of the generated
rules (1
st
35 rules only for the 1
st
reduct set) where each
generated rules are given first and at rightmost column
Match/support denotes the number of case in the decision
table which are supported by this particular rule.
Table 3: Generated Rules (Partially Shown) Set from Reduct Sets
(201591_s_at=10)&(202295_s_at=11)&(209613_s_at=8)=>(CLASS=TUMOR[2])
(201591_s_at=10)&(202295_s_at=11)&(209613_s_at=11)=>(CLASS=NORMAL[3])
(201591_s_at=10)&(202295_s_at=10)&(209613_s_at=4)=>(CLASS=TUMOR[4])
(201591_s_at=10)&(202295_s_at=12)&(209613_s_at=11)=>(CLASS=NORMAL[4])
(201591_s_at=10)&(202295_s_at=11)&(209613_s_at=5)=>(CLASS=TUMOR[1])
(201591_s_at=10)&(202295_s_at=12)&(209613_s_at=7)=>(CLASS=NORMAL[1])
(201591_s_at=9)&(202295_s_at=10)&(209613_s_at=4)=>(CLASS=TUMOR[1])
(201591_s_at=10)&(202295_s_at=12)&(209613_s_at=10)=>(CLASS=NORMAL[11])
(201591_s_at=9)&(202295_s_at=11)&(209613_s_at=5)=>(CLASS=TUMOR[1])
(201591_s_at=10)&(202295_s_at=11)&(209613_s_at=10)=>(CLASS=NORMAL[14])
(201591_s_at=9)&(202295_s_at=10)&(209613_s_at=7)=>(CLASS=TUMOR[3])
(201591_s_at=9)&(202295_s_at=12)&(209613_s_at=4)=>(CLASS=TUMOR[1])
(201591_s_at=9)&(202295_s_at=10)&(209613_s_at=8)=>(CLASS=TUMOR[2])
(201591_s_at=10)&(202295_s_at=11)&(209613_s_at=7)=>(CLASS=TUMOR[5])
(201591_s_at=11)&(202295_s_at=12)&(209613_s_at=10)=>(CLASS=NORMAL[1])
(201591_s_at=10)&(202295_s_at=10)&(209613_s_at=7)=>(CLASS=TUMOR[3])
(201591_s_at=10)&(202295_s_at=10)&(209613_s_at=5)=>(CLASS=TUMOR[1])
(201591_s_at=9)&(202295_s_at=10)&(209613_s_at=10)=>(CLASS=NORMAL[1])
(201591_s_at=9)&(202295_s_at=9)&(209613_s_at=4)=>(CLASS=TUMOR[1])
(201591_s_at=9)&(202295_s_at=10)&(209613_s_at=6)=>(CLASS=TUMOR[3])
(201591_s_at=10)&(202295_s_at=11)&(209613_s_at=4)=>(CLASS=TUMOR[1])
(201591_s_at=9)&(202295_s_at=11)&(209613_s_at=4)=>(CLASS=TUMOR[1])
(201591_s_at=10)&(202295_s_at=12)&(209613_s_at=8)=>(CLASS=TUMOR[1])
(201591_s_at=10)&(202295_s_at=12)&(209613_s_at=9)=>(CLASS=NORMAL[3])
(201591_s_at=10)&(202295_s_at=10)&(209613_s_at=6)=>(CLASS=TUMOR[2])
(201591_s_at=9)&(202295_s_at=11)&(209613_s_at=6)=>(CLASS=TUMOR[1])
(201591_s_at=10)&(202295_s_at=9)&(209613_s_at=4)=>(CLASS=TUMOR[2])
(201591_s_at=9)&(202295_s_at=11)&(209613_s_at=9)=>(CLASS=TUMOR[1])
(201591_s_at=9)&(202295_s_at=11)&(209613_s_at=8)=>(CLASS=TUMOR[1])
(201591_s_at=10)&(202295_s_at=11)&(209613_s_at=9)=>(CLASS=NORMAL[1])
(201591_s_at=10)&(202295_s_at=10)&(209613_s_at=8)=>(CLASS=TUMOR[1])
(201591_s_at=10)&(202295_s_at=11)&(209613_s_at=6)=>(CLASS=TUMOR[1])
(201591_s_at=10)&(202295_s_at=10)&(209613_s_at=10)=>(CLASS=NORMAL[1])
(201591_s_at=9)&(202295_s_at=9)&(209613_s_at=5)=>(CLASS=TUMOR[1])
(201591_s_at=10)&(202295_s_at=13)&(209613_s_at=9)=>(CLASS=TUMOR[1])

VI. RULES REDUCTION METHOD
Due to large number of rules, it is almost impossible to
understand the hidden data dependency on each other
manually. Thats why the generated rule needs to be reduced
in number without effecting overall accuracy and coverage of
the rules. In this rules reduction approach, basically there are
following 3 steps.
Step1-Shortening:
Shortening is the first step where large rules are simplified.
The process by which the maximum numbers of condition
attributes are removed without losing essential information is
called value reduction and the resulting rule is called
maximally general or minimal length. Computing maximally
general rules is of particular importance in knowledge
discovery since they represent general patterns existing in the
data.
The simplification rule algorithm initialize general rules
GRULE to empty set and copies one rule
1
to rule
. A condition is dropped from rule , and then rule is
checked for decision consistency with every rule

.
If rule is inconsistent, then the dropped condition is
restored. This step is repeated until every condition of the
rule has been dropped once. The resulting rule is the
simplified rule.
The SIJ Transactions on Computer Science Engineering & its Applications (CSEA), Vol. 1, No. 3, July-August 2013
ISSN: 2321 2381 2013 | Published by The Standard International Journals (The SIJ) 80
Consider the 35 rules which are generated for the first
reduct set. Using above algorithm, it is found that that the
condition which are imposed by the attribute 201591_s_at
and 202295_s_at can be removed without affecting the
accuracy of the rules generated by the first reduct set as
209613_s_at is the most dominant attribute in the first reduct
set. Using RSES2.2, shortening is applied to the 2187 no of
rule with shortening ratio 0.9. The user provides a coefficient
between 0 and 1, which determines how aggressive the
shortening procedure should be. The coefficient is equal to
1.0 means that no shortening occurs. If shortening ratio is
near zero, the algorithm attempts to maximally shorten
reducts. This shortening ratio is in fact a threshold imposed
on the relative size of positive region after shortening
Applying the above mentioned algorithm the number of rules
for the first reduct set is reduced to 7 (5 for the tumor, 2 for
the normal) with only a single attribute 209163_s_at.The
following shortened 7 rules are the rules only for the first
reduct set.
(209613_s_at=8)=>(CLASS=TUMOR[7])
(209613_s_at=4)=>(CLASS=TUMOR[11])
(209613_s_at=5)=>(CLASS=TUMOR[4])
(209613_s_at=7)=>(CLASS=TUMOR[11])
(209613_s_at=6)=>(CLASS=TUMOR[7])
(209613_s_at=11)=>(CLASS=NORMAL[7])
(209613_s_at=10)=>(CLASS=NORMAL[28])
After shortening overall number rules is reduced to only
770 rules. So reduction factor after shortening is = (2187/770)
=2.84.
Step2-Generalization:
Though it is found that numbers of rules are reduced after
shortening but till it is found that due to different value there
exist different rules for the same decision. Using shortening
we minimize the dependent attribute in a rule whereas
Generalization is the process by which the values of the
reduced attribute are reduced into conjunctive form. For the
first reduct set, the number of shortened rule is only 7 but
there are the different values of a same attribute 209613_s_at
for which the decision is either tumor or normal. As an
example, if the value is 4 or 5 or 6 or 7 or 8 then decision
will be tumor and if the value is either 10 or 11 then samples
will be considered as the normal. Therefore after
generalization of the shortened rules for the first reduct set
can be written
(209613_s_at=8|4|7|6|5)=> (CLASS=TUMOR [40])
(209613_s_at=11|10)=> (CLASS=NORMAL [35])
Interestingly it is found that the Match/support of this
generalized rule is increased as the generalized & shortened
rule support all individual shortened rules for normal and
tumor. So, total support for a particular decision is the sum
of the individual rule for that decision.
Obviously generalized rules signify stronger rule than
shortened rules. The strength of the rule is defined as the
ratio of the number support by a rule for a particular
condition to the total no of cases in the universe of decision
table. So the strength of the above described generalized rule
for tumor is = 40/82=0.488 which is quite good. Overall after
generalization of shortened rule in this case, number of rules
is reduced to 707numbers of rules.
Step3-Filter:
In previous section it is found that after generalization the
strength of the rule is increased as the support is increased.
Moreover stronger rules are better for classification of
unknown dataset. Therefore we filter the shortened &
generalized rule with the condition of removing the rules upto
15 support. Which signify that rules which have the strength
less than 15/82 i.e. 0.183 must not to be considered for
classification.
After this filter, we get only 142 number of strongest rule
from 707 no of generalized which can used for diagnosis of a
human being from his microarray dataset which are given
below. The no. of rules must be reduced in such a way that it
should not affect the accuracy of classification.
Following table shows how the number & length of the
rule are decreased but strength of the rule is increased with
the different stage of rule reduction method which guarantee
the accuracy of the rule will also increased in spite of the
reduction of rules.
Table 4: Comparative Study of Different Stages of Rule Reduction
Different Stage of
Rules Reduction
No of
Rules
Support of the Rules Length of the Rules Premises Strength of the Rules
Minimum Maximum Minimum Maximum Minimum Maximum
Rules form Reduct Set 2187 1 18 3 11 0.0120 0.219
After Shortening 770 1 39 1 8 0.0121 0.476
After Generalization 707 1 41 1 8 0.0121 0.500
After Filter 142 16 41 1 5 0.1950 0.500

VII. RESULT ANALYSIS
The rules for the classification of cancer and non-cancer
patients are generated using RSES 2.2
[http://logic.mimuw.edu.pl/rses/] software from the training
data set. Now these rules must be tested to ascertain their
accuracy level for classification and diagnosis of disease.
Actually these rules are the knowledge which is acquired
from the training data set of 82 different instances. Now this
knowledge is needed to be tested against the test data set to
prove their validity. This study consists of four parts. First,
the generated 142 number of rules are tested on the training
data itself. In other words the rules are cross validated against
the training data set. The results show that these rules can
predict the 42 tumor & 40 normal which is 100% accurate.
The coverage factor is defined as the ratio of classified
objects to the class of all objects in the class, of the rules is 1.
It shows that all the possible combination of case of training
set are covered by these few no rules. The confusion matrix
corresponding to this cross validation is shown in following
The SIJ Transactions on Computer Science Engineering & its Applications (CSEA), Vol. 1, No. 3, July-August 2013
ISSN: 2321 2381 2013 | Published by The Standard International Journals (The SIJ) 81
figure1. Rows in this confusion matrix correspond to actual
decision classes (all possible values of decision) while
columns represent decision values as returned by classifier in
discourse and the values on diagonal represent correctly
classified cases.
Next we added two different dataset of normal lungs
which were not used for rule generation to the previously
training dataset. Obviously new table consist of total 84
different objects among which the number normal & tumor
cases are 42 each. This new dataset is further classified using
the generated rules and it is found that the rules can easily
detect the new two normal cases without any error. The
figure 2 shows the corresponding confusion matrix of the
result of experiment by training &test method for this case.
Similarly in the third case we just added two new case of
cancer to the training decision table and new table which
consist of 44 tumor 40 normal cases is tested using the
filtered rules. Interestingly these rules are able to detect these
two new cases of tumor with accuracy 100% which confirm
the validity of above described process of rules reduction.
Confusion matrix for this case is shown in Fig.3.At last the
rules are verified against the 15 number of test data sets
which consist of 8 tumor & 7 normal cases. After
classification using the rules, it can be seen that the rules can
classify the new data set with 100% accuracy and coverage
factor of 1 which is quite satisfactory. The rules can easily
detect the new 8 tumor and 7 normal cases. The confusion
matrix corresponding to this cross validation is shown in
following Figure 4.

Figure 1 : Confusion Matrix of Cross Validation Figure 2: Confusion Matrix for 2
nd
Case

Figure 3: Confusion Matrix for 3
rd
Case
Figure 4: Confusion Matrix of 15 New Test Data Set
The genes that remain in different 142 rules are stronger
contender for the cause of Adenocarcinoma. By analyzing the
attribute values of responsible genes we can easily identify
either a person have normal or cancerous lung from his
microarray data. Table 5 described 104 no of most
responsible genes for Adenocarcinoma among 22284 of
genes of human body.
Table 5: Most Responsible Genes
Genes Genes Genes Genes Genes
201413_at 202848_s_at 206550_s_at 210066_s_at 215091_s_at
203091_at 203065_s_at 206600_s_at 210113_s_at 215972_at
200085_s_at 203091_at 206632_s_at 210438_x_at 216941_s_at
200745_s_at 203217_s_at 206757_at 210481_s_at 217030_at
201336_at 203249_at 206929_s_at 210640_s_at 217826_s_at
201337_s_at 203866_at 207178_s_at 211367_s_at 218304_s_at
201413_at 204288_s_at 207204_at 211395_x_at 218307_at
201456_s_at 204333_s_at 207294_at 211501_s_at 218918_at
201506_at 204987_at 207461_at 211619_s_at 218982_s_at
201591_s_at 205022_s_at 207877_s_at 211714_x_at 219293_s_at
201656_at 205208_at 207973_x_at 211971_s_at 219630_at
201772_at 205261_at 208056_s_at 212230_at 219995_s_at
201938_at 205278_at 208335_s_at 212412_at 220530_at
201953_at 205286_at 208829_at 212611_at 221382_at
201969_at 205379_at 209072_at 212696_s_at 222313_at
202181_at 205455_at 209147_s_at 212737_at 34210_at
202182_at 205549_at 209343_at=8 212760_at 35820_at
202199_s_at 205793_x_at 209609_s_at 213165_at 38069_at
202295_s_at 206022_at 209613_s_at 213547_at 40560_at
202640_s_at 206068_s_at 209790_s_at 213927_at 49452_at
202737_s_at 206328_at 209817_at 214500_at
The SIJ Transactions on Computer Science Engineering & its Applications (CSEA), Vol. 1, No. 3, July-August 2013
ISSN: 2321 2381 2013 | Published by The Standard International Journals (The SIJ) 82
The above results are verified using publicly available
website Gene-ontology, named DAVID [http://
david.abcc.ncifcrf.gov] which provides a comprehensive set
of functional annotation tools for investigators to understand
biological meaning behind large list of genes. Among the 104
no of Genes, it is found that 47 numbers of genes directly or
indirectly related to Lung Adenocarcinoma, which proves the
fact that RST can extract biological relevant information also.
Thus it can be concluded that using this technique, accurate
diagnosis of Adenocarcinoma can be done, if the microarray
data of the concern patient is available.
VIII. DISCUSSION
This study has been carried out on human disease diagnosis
process using genetic information in the form of microarray
experimentation data of the disease Adenocarcinoma. It has
been assumed that the set of genes will more or less act in the
same general way in a particular species e.g. human beings in
the present investigation. With this assumption in mind, a
procedure has been proposed here for developing two distinct
purposes, first, Diagnosis of disease using the microarray
data, secondly isolation of the most important genes
responsible for causing the disease. The algorithm is
developed and validated using a soft computing tool Rough
Set Exploration Tools. The result obtained can be used as
dependable prediction. The validation of predicted results has
also been carried out here. The result shows 100% accuracy
in prediction in the form of diagnosis of disease. The result
also predicted responsible / affected genes for causing the
disease Adenocarcinoma. The validation for this can be
carried out in the WET Laboratory. Further research work
can be carried out in form of finding genetic network
amongst those responsible genes that can be used for drug
design for particular design in future.
All the validation results strengthen our proposal that by
using only a mathematical tool like Rough Set Theory, it is
possible to isolate those genes which are relatively more
active in causing or inhibiting Lung Adenocarcinoma without
knowing the functional behaviour of those genes. So, this
work highlights the usefulness and efficiency of RST in the
disease diagnosis process and its potential use in inductive
learning and as a valuable aid for building expert systems in
medical based on simple rules.
REFERENCES
[1] J.C. Bezdek (1993), Editorial: Fuzzy Models- What are they
and Why, IEEE Transactions on Fuzzy Systems, Vol. 1, No. 1,
Pp. 112.
[2] B. Walczak & D.L. Massart (1999), Tutorial: Rough Set
Theory, Elsevier Chemometrics and Intelligent Systems, Pp.
116.
[3] H. Midelfart, A. Lgreid & J. Komorowski (2001),
Classification of Gene Expression Data in Ontology, Second
International Symposium on Medical Data Analysis (ISMDA),
Pp. 186194.
[4] S. Tsumoto (2001), Medical Diagnostic Rules as Upper
Approximation of Rough Sets, IEEE International Fuzzy
Systems Conference, Pp. 15511554.
[5] H. Midelfart, J. Komorowsk, K. Nrsett, F. Yadetie, A.K.
Sandvik & A. Lgreid (2002), Learning Rough Set Classifiers
from Gene Expression and Clinical Data, Fundamental
Informatica, Pp. 155183.
[6] B. Z. Pawlwk (2002), Rough Set Theory and Its Application,
Journal of Telecommunication and information Technology,
Pp: 710.
[7] E. Hassanien & J.M.H. Ali (2004), Rough Set Approach for
Generation of Classification Rules of Breast Cancer Data,
Informatica, Vol. 15, No. 1, Pp. 2338.
[8] S.A. Vinterbo, E.Y. Kim & L. Ohno-Machado (2005), Small,
Fuzzy and Interpretable Gene Expression Based Classifiers,
Bioinformatics, Vol. 21, No. 9, Pp. 19641970.
[9] E.B. Huerta, B. Duval & J. Hao (2006), A Hybrid GA/SVM
Approach for Gene Selection and Classification of Microarray
Data, EvoWorkshops-2006, LNCS 3907, Pp. 3444.
[10] Z. Wang, V. Palade & Y. Xu (2006), Neuro-Fuzzy Ensemble
Approach for Microarray Cancer Gene Expression Data
Analysis, International Symposium on Evolving Fuzzy
Systems, Pp. 241246.
[11] S.Y Hoa, C.H. Hsieh, H.M. Chen & H.L. Huang (2006),
Interpretable Gene Expression Classifier with an Accurate and
Compact Fuzzy Rule base for Microarray Data Analysis,
Elsevier - Biosystems, Pp. 112.
[12] M. Banerjee, S. Mitra & H. Banka (2007), Evolutionary
Rough Feature Selection in Gene Expression Data, IEEE
Transaction on Systems, Man & Cybernetics, Vol. 37, No. 4,
Pp. 622632.
[13] J. M. David and K. Balakrishnan (2010), Machine Learning
Approach for Prediction of Learning Disabilities in School-Age
Children, International Journal of Computer Applications,
Volume 9, No.11, Pp. 712.
[14] J.M. David & K. Balakrishnan (2011), Prediction of Key
Symptoms of Learning Disabilities in School-Age Children
using Rough Sets, International Journal of Computer and
Electrical Engineering, Vol. 3, No. 1, Pp. 163168.
[15] Y. Du, Q. Hu, P. Zhu & P. Ma (2011), Rule Learning for
Classification based on Neighbourhood Covering Reduction,
Elsevier Information Sciences, Pp. 112.
[16] A. Mohammadi, M.H. Saraee & M. Salehi (2011),
Identification of Disease-Causing Genes using Microarray
Data Mining and Gene Ontology, BMC Medical Genomics,
Pp. 19.
Sudip Mandal. He received the M.Tech.
Degree in ECE from Kalyani Govt of
Engineering College in 2011. Recently he has
held the position of Head of Electronics and
Communication Engineering Department in
GIMT, India. He is also pursuing Ph.D degree
from University of Calcutta. His current work
includes microwave tomography,
Bioinformatics and Soft computing. The
author is also member IEEE.
Dr. Goutam Saha. He received his Ph.D from Bengal Engineering
and Science University, Shibpur on Genetic Engineering in 1984.
Now he was Head of IT Department, Govt. College of Engineering
and Leather Technology, Kolkata, India for many years. Recently he
has held the position of Head of IT dept in NEHU, India. His
research of interest is Bio-informatics, System Biology and Soft
Computing.

Anda mungkin juga menyukai