+1
at time (t + 1) is affected by its current state and
the states of its neighbors at time-t. For example, in case of
3-neighborhood CA, the state transition depends on the cell
itself and its left and right neighbors), such that:
+1
=
,
+1
(1)
Where
1
and
+1
is the i
th
state
transition function.
Every CA gives rise to a state transition graph consisting
of a number of cyclic and acyclic states [21]. The state
transition graph of an arbitrary CA is shown in Fig. 1. The set of
non-cyclic states of the CA as depicted in Fig. 1 forms inverted
trees rooted at the cyclic states. The cyclic states are referred to
as attractors [22]. The states of a tree rooted at the cyclic state
forms the -basin [23].
Fig. 1 State Transition Diagram
A CA with multiple basins may be viewed as a natural
classifier [24], [25]. It tends to classify a given set of patterns
into multiple disjoint state transition graphs (Fig. 1) with each
disjoint graph representing a class falling in their respective
attractor basin.
As an example let us consider the DNA sequences depicted
in Table I. We have encoded the four nucleotides as A = 00, T =
01, G = 10, C = 11. This binary encoding scheme, gives rise to
the following binary codes for the DNA sequences under
consideration.
TABLE I
BINARY ENCODING OF DNA SEQUENCES
Serial Nr. DNA Sequence Binary Code (b9 b8b1b0)
1 AATTC 0000010111
2 ATTTC 0001010111
3 ATTGA 0001011000
4 CATTC 1100010111
5 AATTA 0000010100
6 GCGCT 1011101101
7 GTGCT 1001101101
8 GTGCC 1001101111
9 TTGCT 0101101101
10 GCGCC 1011101111
To classify the given set of DNA sequences into two classes,
we need to design a CA based classifier for two pattern sets P
1
and P
2
, such that two arbitrary patterns P
1
and P
2
should
fall into different attractor basins.
Let us use the rules of state transition as depicted in Table II.
Here means the bitwise XOR operation.
TABLE II
STATE TRANSITION RULES
Bit Position Rule
0 b1 b0
1 b2 b1
2 b3 b2 b1
3 b4 b3 b2
4 b4 b3
5 b5
6 b7 b5
7 b7
8 b9 b8 b7
9 b9 b8
Using the given CA-rule set we observe that the sequences
given in Table I, can be classified into two classes, the
0-attractor basin (an attractor with all zeros) and the non-zero
attractor basins, as depicted in Fig. 2.
Fig. 2 Classifying the DNA Patterns into Attractor Basins
Using the above CA the given DNA sequences can be
0000010100 0001010111 0001011000
0-Basin
1100010111 0000010111
1011101101 1001101101 1001101111
Non 0-Basin
0101101101 1011101111
10001 01001
10000 11000 01000
00001 00000 11001
10010 10011
01010 11011 01011
00010 00011 11010
10100 10101
01100 11100 01101
00101 00100 11101
10110 10111
01110 11111 01111
00110 00111 11110
categorized into two sets as shown in Fig. 3.
Fig. 3 Categorization of the DNA Patterns into two Classes
Any CA rule with only XOR-operation can be emulated by a
coefficient-matrix multiplication scheme as illustrated below.
The next state of a binary pattern B = b
n-1
b
n-2
b
1
b
0
can be
derived by multiplying it with corresponding CA-Coefficient
matrix.
=
0,0
0,1
0,1
1,0
1,1
1,1
1,0
1,1
1,1
1
In order to build CA-Coefficient matrix for a CA rule the
following equation is used.
,
=
1,
, , = 0, 1, ,
0,
1
From the equation it is clear that the (i, j)
th
position of the
coefficient matrix will hold one if only if the CA-rule at i
th
bit
depends on j
th
bit for XOR operation otherwise it holds zero
value. For example, the corresponding CA-Coefficient
matrix(C) for the CA-Rule set shown in Table-II can be derived
as follows.
b
9
b
8
b
7
b
6
b
5
b
4
b
3
b
2
b
1
b
0
b
9
1 1 0 0 0 0 0 0 0 0
b
8
1 1 1 0 0 0 0 0 0 0
b
7
0 0 1 0 0 0 0 0 0 0
b
6
0 0 1 0 1 0 0 0 0 0
C =
b
5
0 0 0 0 1 0 0 0 0 0
b
4
0 0 0 0 0 1 1 0 0 0
b
3
0 0 0 0 0 1 1 1 0 0
b
2
0 0 0 0 0 0 1 1 1 0
b
1
0 0 0 0 0 0 0 1 1 0
b
0
0 0 0 0 0 0 0 0 1 1
In case of XOR-CA rule, the following theorem relates a
pattern to its basin.
Theorem 1 If any pair of arbitrary patterns- B
1
and B
2
ever
reach -basin on consecutive applications of XOR-CA rule
then the pattern-B=B
1
B
2
will reach zero-basin on
consecutive applications of XOR-CA rule.
Proof:
Let
0
and
0
are two arbitrary patterns falling in the
-basin. The pattern
0
reaches -basin after k
th
consecutive
application of XOR-CA rule. Also, let
=
1
=
2
=
+1
=
=
=
1
= = .
0
=1
Similarly, if the pattern
0
= =
Now,
= = 0
leads to
= 0
The above equation implies that the pattern (B) derived from
=
0
0
also reaches zero-basin.
Hence is the proof.
Example 1 In Fig.1, two patterns B
1
= 01000 and B2 = 10000
are in the zero basin with B = B
1
B
2
= 11000 also falling in
the same zero-basin.
Theorem 1 confirms that the hamming distance between a pair
of patterns falling in the same basin gets reflected in the zero
basin patterns. This result obviously leads to the fact that
XOR-CA rules with patterns in zero-basin close to each other
with respect to their hamming distance can act as effective
pattern classifiers. The state-transition characteristics of such
CA are desirable for DNA-pattern classification wherein
similar patterns will tend to fall in zero-basin.
Design of Multi-stage Hierarchical Classifier: A two class
XOR-CA-classifier is favourable for implementation due to its
simplicity but has several limitations due to its linear
characteristics. The limitations of single-stage
XOR-CA-Classifier can be avoided to a certain extent by
designing a multi-stage hierarchical classifier. In multi-stage
classification scheme, the single stage classifier is repeatedly
employed at every stage leading to a hierarchical tree-like
structure with each node corresponding to a single stage CA
classifier (Fig. 4).
Class B
GCGCT
GTGCT
GTGCC
TTGCT
GCGCC
Class A
AATTC
ATTTC
ATTGA
CATTC
AATTA
IV. EVOLUTION OF THE CA BY SIMULATED ANNEALING
Simulated annealing is a generalization of a Monte Carlo
method for examining the equations of state and frozen states of
n-body systems [26]. We employ and appropriately tune the
Simulated Annealing to arrive at the desired CA with patterns
in zero-basin close to each other.
In Simulated Annealing an initial temperature (Ti) is set. The
temperature decreases exponentially during the process [27]. At
each temperature point (Tp) some action is taken based on the
value of Cost Function. The entire process continues till
temperature becomes zero. To evaluate the CA rules as a DNA
pattern classifier, we design a heuristic cost function as
described below. Let us assume that we are given with two
distinct classes of DNA sequences, represented by class A = {
AATTC, ATTTC, ATTGA, CATTC, AATTA} and class B = {
GCGCT, GTGCT, GTGCC, TTGCT, GCGCC}. We initially
create a randomly generated CA rule, represented by the
Coefficient matrix C. For training the classifier we arbitrarily
select N
A
number of DNA sequences from class A and N
B
number of DNA sequences from class B. Let us assume that
out of the (N
A
+ N
B
) sequences N
AB
patterns fall in the
zero-basin by applying the CA rules and the rest of the
sequences fall in the non-zero basin. Next we emit the
consensus sequence-CSeq for these N
AB
numbers of DNA
sequences using HMMER [28], which is an online DNA
sequence analysis tool based on Hidden Markov Models. Let L
be the average Levenshtein distance of these N
AB
numbers of
DNA sequences as determined with respect to the consensus
sequence-CSeq. The Levenshtein distance Lev(x, y) between
two DNA sequences x and y of lengths m and n respectively is
given by
Lev
x,y
m, n =
maxm, n , if minm, n = 0
min
Lev
x,y
m1, n +1
Lev
x,y
m, n 1 +1
Lev
x,y
m 1, n 1 +[x
m
y
n
]
otherwise
The Levenshtein distance is an integer, which gives a measure
of similarity between two DNA sequences. To compute the
Fuzzy Levenshtein distance [29], the percentage similarity
between two DNA sequences is computed. To transform the
Levenshtein distance into a percentage, the number of edits
required are subtracted from 1.0 and divided by the length of
the longest string. The Fuzzy Levenshtein distance is obtained
by multiplying the resulting value by 100. The Fuzzy
Levenshtein distance of the sequences in a of three DNA
sequences in the same basin from their consensus sequence is
illustrated in the table below:
TABLE III
COMPUTATION OF FUZZY LEVENSHTEIN DISTANCE
Sequence
Consensus
Sequence
Fuzzy Levenshtein
distance
CAGAT
CAGTT
0.8
AGGTT 0.2
CAATT 0.6
The fitness cost of a CA as solution is then calculated as the
average of the Fuzzy Levenshtein distances of each sequence in
the alignment to the consensus alignment. For example, the
fitness cost of the solution in the previous example is 0.53. The
lower is the value of L the better is the fitness cost of the CA
rule as a classifier.
There are two types of solutions based on cost value - Best
Solution (BS) and Current Solution (CS). A New Solution (NS)
at the next Tp compares its cost value with CS. If NS has better
cost value than CS, then NS becomes CS. The new solution
(NS) is also compared with BS and if NS is better, then NS
becomes BS. Even if NS is not as good as CS, NS is accepted
with a probability. This step is done typically to avoid any local
minima. The complete algorithm is presented below:
Algorithm SA_EvolveCA
// Input: Pattern Size (n), Pattern Set (S), Initial Temp. (Ti)
// Output: CA Rule.
1 T
p
= T
i
2 CS = BS = NULL
3 while(T
p
> 0) {
4 if (T
p
> 0.5 * T
i
) {
5 Randomly generate a CA as guess solution
6 }
7 else {
8 Generate a new solution from CS
9 }
10 Generate state transition table and rule table
11 NS = CA-Rule
12
cost
= cost-value(NS) cost-value(CS)
13 if (
cost
< 0) {
14 CS = NS
15 if (cost-value(NS) < cost-value(BS)) {
16 BS = NS
17 }
18 }
19 else
20 accept CS = NS with probability
21 Reduce T
p
exponentially
22 }
The above mentioned simulated annealing algorithm continues
to explore CA-search space with heuristic approach for
obtaining desired CA-rule as long as the temperature remains
positive. The temperature (Tp) in simulated annealing is
initialized with a large value (line 1) and at every attempt it is
S
1
, S
2
, S
3
, S
4
S
1
, S
2
S
3
, S
4
S
1
S
2
S
3
S
4
Fig. 4 Multi-stage hierarchical classifier
reduced (line 21) gradually to get to the termination phase. A
CA-Rule as a candidate solution is randomly generated (line 5)
through random synthesis of CA-Coefficient matrix. In order to
obtain neighbor candidate solution to CS (i.e. current solution
derived so far), a few bits of CA-Coefficient matrix
corresponding to CS is altered (line 8). The probability of
accepting a new candidate solution as current solution depends
on the fitness-cost value. It is evident from the algorithm that
every new candidate solution has the possibility to become
current solution irrespective of its fitness-cost. However, with
the temperature approaching zero value i.e. as the algorithm
approaches termination phase, the probability of accepting
less-fit CA-Rule also diminishes (line 21). During the
exploration, the algorithm records the best explored CA-Rule
as BS (line 16).
V. RESULT
This section reports experimental observations during
evaluation of our proposed CA based DNA classification
scheme. To analyse the performance of the proposed CA based
DNA-pattern classifier, the experiment has been performed on
synthetic datasets. All the experiments have been conducted
under the following setup.
Hardware
o Processor - Intel Core i7-3610QM CPU
@ 2.30GHz 8
o RAM 8GB
o Disk 1000 GB
Software
o Operating system Open SUSE Kernel
version 3.1.0-1.2-desktop
o OS type 32-bit
o Compiler used javac version 4.6.2 (SUSE
Linux)
During the experimentation, emphasis has been put on the
behavior of our model in response to the varying DNA
sequence length as well as number of DNA-trainee patterns (i.e.
trainee-size). The given set of DNA sequences has been
randomly divided into a trainee set and testing-set. The most
desirable CA is evolved through simulated annealing heuristic
algorithm and the CA is assumed to be the best explored
solution which can classify the trainee DNA patterns efficiently
with respect to their Fuzzy Levenshtein distances as discussed
in previous section. The testing-set is used to measure the
class-prediction accuracy of the proposed model built with
randomly chosen trainee patterns in comparison with the actual
class membership. The overall performance of the proposed
scheme is represented in the form of following graphs plotted
with variations of DNA sequence size and number of trainee
sequences. Each of the figures presents classification accuracy
of the proposed model for DNA patterns of various sequence
lengths against varying trainee pattern size. Fig. 5 displays
classification accuracy of the proposed model with DNA
sequence length 20 whereas Fig. 6 reports classification
accuracy with DNA sequence length 40 against various trainee
pattern sizes. The observation reveals several interesting facts
on the behavior of the model. In both the cases, the accuracy
level has been observed as ranging from 60 percent to 95
percent showing linear improvement with the increase in
number of trainee DNA patterns.
Fig. 5 Classification Accuracy vs. Number of Trainee Patterns for a
sequence length of 20
Fig. 6 Classification Accuracy vs. Number of Trainee Patterns for a
sequence length of 40
20 40 60 80 100
Classification
Accuracy
67.19 74.45 80.13 89.03 93.77
60
65
70
75
80
85
90
95
100
C
l
a
s
s
i
f
i
c
a
t
i
o
n
A
c
c
u
r
a
c
y