Anda di halaman 1dari 7

Abstract Contemporary researchers of Bio-informatics have

witnessed an exponential growth in the amount of biological


information over the years. The increasing volume of DNA sequences
has of late created interest among many scientists in computational
approaches to DNA sequence analysis. A lot of computer analysis of
DNA sequences is directed toward meaningful interpretation of
biologically significant patterns. Pattern classification forms one of the
most important foundations for extraction of knowledge from the
enormous DNA sequence databases. This paper reports a cheap and
efficient DNA pattern classifier based on the sparse network of
Cellular Automata.

KeywordsBio-informatics, Cellular Automata, DNA, Pattern
Classification, Sequence Analysis
I. INTRODUCTION
ellular Automata (CA) is a simple model of a spatially
extended decentralized system made up of a number of
individual components (cells) [1]. The communication
between constituent cells is limited to local interaction. The
state of each individual cell changes over time depending on the
states of its neighbors [2], [3]. The overall structure can be
viewed as a parallel processing computer [4].
Cellular Automata is used by Bio-informatics researchers for
automated recognition, description, classification and grouping
of patterns [5], [6]. For example, CA based models have been
reported to recognize genetic disorder in cells responsible for
the development of cancer [7], [8]. A CA based pattern
classification model generally comprises of two basic
operations exploration of CA for supervised classification
and prediction of class for unknown patterns.
DNA (deoxyribonucleic acid) consists of two long strands,
each strand being made of units called phosphates, deoxyribose
sugars and nucleotides (adenine [A], guanine [G], cytosine [C],
and thymine [T]) linked in series. For ease of understanding,
biologists commonly represent DNA molecules simply by their
different nucleotides using the symbols {A, G, C, T}. The DNA
in each cell provides the full genetic blueprint for that cell.

Manuscript received October 9, 2013.
Tamal Chakrabarti is with the Computer Science and Engineering
Department, Institute of Engineering and Management, Y-12, Block-EP,
Sector-V, Salt Lake Electronics Complex, Kolkata-700091, West Bengal, India
( e-mail: tamalc@gmail.com).
Sourav Saha is with the Computer Science and Engineering Department,
Institute of Engineering and Management, Y-12, Block-EP, Sector-V, Salt
Lake Electronics Complex, Kolkata-700091, West Bengal, India (e-mail:
souravsaha1977@gmail.com).
Devadatta Sinha is with the Computer Science and Engineering Department,
Calcutta University, 92 Acharya Prafulla Chandra Road, Kolkata-700009,
West Bengal, India (e-mail: devadatta.sinha@gmail.com).
The identification of genes in a given DNA sequence is an
emerging field of study in Bio-informatics. One popular
approach is to develop a predictive computer model from a
database of known gene sequences and use the resulting model
to predict where genes are likely to be in newly generated
sequence information. Discovery of coding regions in DNA
sequences can therefore be viewed as a pattern recognition
problem. The explosive growth in biological data demands that
the most advanced and powerful ideas in machine learning,
such as cellular automata, should be brought to bear on such
problems [9], [10].
A pattern classification/recognition algorithm using CA
usually has two phases [11], the learning or training phase and
the testing phase. In the training phase, the
machine/network/algorithm is trained with some benchmark
patterns [12]. In the testing phase new patterns are tested
against the trained model [13], built in the previous step.
This paper proposes a unique DNA-classification scheme
using Cellular Automata (CA), which is evolved through
Simulated Annealing heuristic technique.
II. RELATED WORKS
Plenty of research works deal with the problem of classification
of DNA patterns from the genomic sequences. One of the vital
tasks in the study of genomes is DNA sequence identification
[14]. Recently researchers have attempted various
soft-computing techniques for DNA sequence identification.
Peterson et. al. used proper orthogonal decomposition (POD)
technique to recognize various cancerous patterns in DNA
sequences [15]. Since the performance of a classification
strategy heavily depends on selection of similarity or distance
measure, there has been a demand for exploration of various
similarity metrics for DNA classification. Priness et. al.
compared various unsupervised classification techniques of
DNA sequences with respect to the Euclidean distance and the
Pearson correlation [16]. Kulp et. al. proposed a generalized
Hidden Markov Model (GHMM) based framework to
recognize human genes in DNA [17]. Kumar et. al. integrated
pattern mining and neural network-based approaches to classify
DNA-sequences with reduced dimensions using Multi-linear
Principal Component Analysis (MPCA) [18]. However, the
VLSI-implementation- friendly sparse structure of CA has not
yet been extensively utilized as DNA pattern classifier. The
proposed work evolves a CA based classification framework
for DNA pattern prediction in linear time. In order to derive
desirable CA for DNA pattern classification, the proposed
scheme has employed simulated annealing heuristic with Fuzzy
Levenshtein distance as similarity-cost-measure among DNA
A Cellular Automata Based DNA Pattern
Classifier
Tamal Chakrabarti, Sourav Saha, and Devadatta Sinha
C


patterns. The simple structure of CA renders the proposed
model easily implementable on a VLSI chip suitable for
embedded applications, which demand high speed.
III. CELLULAR AUTOMATA AS DNA PATTERN CLASSIFIER
A Cellular Automaton (CA) consists of a number of cells
organized in the form of a lattice [19]. It evolves in discrete
space and time, and can be viewed as an autonomous finite state
machine (FSM) [20]. Each cell stores a discrete variable at time
t that refers to the current state

of the cell. The next state of


the cell

+1
at time (t + 1) is affected by its current state and
the states of its neighbors at time-t. For example, in case of
3-neighborhood CA, the state transition depends on the cell
itself and its left and right neighbors), such that:

+1
=

,
+1

(1)

Where
1

and
+1

are the current states of left and right


neighbors of the i
th
CA cell at time t and

is the i
th
state
transition function.
Every CA gives rise to a state transition graph consisting
of a number of cyclic and acyclic states [21]. The state
transition graph of an arbitrary CA is shown in Fig. 1. The set of
non-cyclic states of the CA as depicted in Fig. 1 forms inverted
trees rooted at the cyclic states. The cyclic states are referred to
as attractors [22]. The states of a tree rooted at the cyclic state
forms the -basin [23].



















Fig. 1 State Transition Diagram
A CA with multiple basins may be viewed as a natural
classifier [24], [25]. It tends to classify a given set of patterns
into multiple disjoint state transition graphs (Fig. 1) with each
disjoint graph representing a class falling in their respective
attractor basin.
As an example let us consider the DNA sequences depicted
in Table I. We have encoded the four nucleotides as A = 00, T =
01, G = 10, C = 11. This binary encoding scheme, gives rise to
the following binary codes for the DNA sequences under
consideration.
TABLE I
BINARY ENCODING OF DNA SEQUENCES
Serial Nr. DNA Sequence Binary Code (b9 b8b1b0)
1 AATTC 0000010111
2 ATTTC 0001010111
3 ATTGA 0001011000
4 CATTC 1100010111
5 AATTA 0000010100
6 GCGCT 1011101101
7 GTGCT 1001101101
8 GTGCC 1001101111
9 TTGCT 0101101101
10 GCGCC 1011101111

To classify the given set of DNA sequences into two classes,
we need to design a CA based classifier for two pattern sets P
1

and P
2
, such that two arbitrary patterns P
1
and P
2
should
fall into different attractor basins.
Let us use the rules of state transition as depicted in Table II.
Here means the bitwise XOR operation.

TABLE II
STATE TRANSITION RULES
Bit Position Rule
0 b1 b0
1 b2 b1
2 b3 b2 b1
3 b4 b3 b2
4 b4 b3
5 b5
6 b7 b5
7 b7
8 b9 b8 b7
9 b9 b8

Using the given CA-rule set we observe that the sequences
given in Table I, can be classified into two classes, the
0-attractor basin (an attractor with all zeros) and the non-zero
attractor basins, as depicted in Fig. 2.
















Fig. 2 Classifying the DNA Patterns into Attractor Basins
Using the above CA the given DNA sequences can be
0000010100 0001010111 0001011000
0-Basin
1100010111 0000010111
1011101101 1001101101 1001101111
Non 0-Basin
0101101101 1011101111
10001 01001
10000 11000 01000
00001 00000 11001
10010 10011
01010 11011 01011
00010 00011 11010
10100 10101
01100 11100 01101
00101 00100 11101
10110 10111
01110 11111 01111
00110 00111 11110


categorized into two sets as shown in Fig. 3.












Fig. 3 Categorization of the DNA Patterns into two Classes
Any CA rule with only XOR-operation can be emulated by a
coefficient-matrix multiplication scheme as illustrated below.
The next state of a binary pattern B = b
n-1
b
n-2
b
1
b
0
can be
derived by multiplying it with corresponding CA-Coefficient
matrix.
=

0,0

0,1

0,1

1,0

1,1

1,1

1,0

1,1

1,1

1


In order to build CA-Coefficient matrix for a CA rule the
following equation is used.

,
=
1,

, , = 0, 1, ,
0,

1

From the equation it is clear that the (i, j)
th
position of the
coefficient matrix will hold one if only if the CA-rule at i
th
bit
depends on j
th
bit for XOR operation otherwise it holds zero
value. For example, the corresponding CA-Coefficient
matrix(C) for the CA-Rule set shown in Table-II can be derived
as follows.



b
9
b
8
b
7
b
6
b
5
b
4
b
3
b
2
b
1
b
0


b
9


1 1 0 0 0 0 0 0 0 0

b
8


1 1 1 0 0 0 0 0 0 0

b
7


0 0 1 0 0 0 0 0 0 0

b
6


0 0 1 0 1 0 0 0 0 0
C =
b
5


0 0 0 0 1 0 0 0 0 0
b
4


0 0 0 0 0 1 1 0 0 0

b
3


0 0 0 0 0 1 1 1 0 0

b
2


0 0 0 0 0 0 1 1 1 0

b
1


0 0 0 0 0 0 0 1 1 0

b
0
0 0 0 0 0 0 0 0 1 1

In case of XOR-CA rule, the following theorem relates a
pattern to its basin.

Theorem 1 If any pair of arbitrary patterns- B
1
and B
2
ever
reach -basin on consecutive applications of XOR-CA rule
then the pattern-B=B
1
B
2
will reach zero-basin on
consecutive applications of XOR-CA rule.

Proof:
Let
0
and
0
are two arbitrary patterns falling in the
-basin. The pattern
0
reaches -basin after k
th
consecutive
application of XOR-CA rule. Also, let

denote the pattern


which is derived after i
th
consecutive application of XOR-CA
rule on
0
. The above assumption implies following
equations.

=
1

=
2

=
+1

=
=

=
1

= = .
0

=1


Similarly, if the pattern
0

reaches -basin after k' consecutive


application of XOR-CA rule and k > k' then we can state the
following equation since the attractor pattern will not change
even after application of XOR-CA rule.

= =



Now,

= = 0

leads to

= 0



The above equation implies that the pattern (B) derived from
=
0

0
also reaches zero-basin.
Hence is the proof.

Example 1 In Fig.1, two patterns B
1
= 01000 and B2 = 10000
are in the zero basin with B = B
1
B
2
= 11000 also falling in
the same zero-basin.

Theorem 1 confirms that the hamming distance between a pair
of patterns falling in the same basin gets reflected in the zero
basin patterns. This result obviously leads to the fact that
XOR-CA rules with patterns in zero-basin close to each other
with respect to their hamming distance can act as effective
pattern classifiers. The state-transition characteristics of such
CA are desirable for DNA-pattern classification wherein
similar patterns will tend to fall in zero-basin.

Design of Multi-stage Hierarchical Classifier: A two class
XOR-CA-classifier is favourable for implementation due to its
simplicity but has several limitations due to its linear
characteristics. The limitations of single-stage
XOR-CA-Classifier can be avoided to a certain extent by
designing a multi-stage hierarchical classifier. In multi-stage
classification scheme, the single stage classifier is repeatedly
employed at every stage leading to a hierarchical tree-like
structure with each node corresponding to a single stage CA
classifier (Fig. 4).
Class B
GCGCT
GTGCT
GTGCC
TTGCT
GCGCC
Class A
AATTC
ATTTC
ATTGA
CATTC
AATTA















IV. EVOLUTION OF THE CA BY SIMULATED ANNEALING
Simulated annealing is a generalization of a Monte Carlo
method for examining the equations of state and frozen states of
n-body systems [26]. We employ and appropriately tune the
Simulated Annealing to arrive at the desired CA with patterns
in zero-basin close to each other.
In Simulated Annealing an initial temperature (Ti) is set. The
temperature decreases exponentially during the process [27]. At
each temperature point (Tp) some action is taken based on the
value of Cost Function. The entire process continues till
temperature becomes zero. To evaluate the CA rules as a DNA
pattern classifier, we design a heuristic cost function as
described below. Let us assume that we are given with two
distinct classes of DNA sequences, represented by class A = {
AATTC, ATTTC, ATTGA, CATTC, AATTA} and class B = {
GCGCT, GTGCT, GTGCC, TTGCT, GCGCC}. We initially
create a randomly generated CA rule, represented by the
Coefficient matrix C. For training the classifier we arbitrarily
select N
A
number of DNA sequences from class A and N
B

number of DNA sequences from class B. Let us assume that
out of the (N
A
+ N
B
) sequences N
AB
patterns fall in the
zero-basin by applying the CA rules and the rest of the
sequences fall in the non-zero basin. Next we emit the
consensus sequence-CSeq for these N
AB
numbers of DNA
sequences using HMMER [28], which is an online DNA
sequence analysis tool based on Hidden Markov Models. Let L
be the average Levenshtein distance of these N
AB
numbers of
DNA sequences as determined with respect to the consensus
sequence-CSeq. The Levenshtein distance Lev(x, y) between
two DNA sequences x and y of lengths m and n respectively is
given by

Lev
x,y
m, n =

maxm, n , if minm, n = 0
min
Lev
x,y
m1, n +1
Lev
x,y
m, n 1 +1
Lev
x,y
m 1, n 1 +[x
m
y
n
]
otherwise



The Levenshtein distance is an integer, which gives a measure
of similarity between two DNA sequences. To compute the
Fuzzy Levenshtein distance [29], the percentage similarity
between two DNA sequences is computed. To transform the
Levenshtein distance into a percentage, the number of edits
required are subtracted from 1.0 and divided by the length of
the longest string. The Fuzzy Levenshtein distance is obtained
by multiplying the resulting value by 100. The Fuzzy
Levenshtein distance of the sequences in a of three DNA
sequences in the same basin from their consensus sequence is
illustrated in the table below:

TABLE III
COMPUTATION OF FUZZY LEVENSHTEIN DISTANCE
Sequence
Consensus
Sequence
Fuzzy Levenshtein
distance
CAGAT
CAGTT
0.8
AGGTT 0.2
CAATT 0.6


The fitness cost of a CA as solution is then calculated as the
average of the Fuzzy Levenshtein distances of each sequence in
the alignment to the consensus alignment. For example, the
fitness cost of the solution in the previous example is 0.53. The
lower is the value of L the better is the fitness cost of the CA
rule as a classifier.
There are two types of solutions based on cost value - Best
Solution (BS) and Current Solution (CS). A New Solution (NS)
at the next Tp compares its cost value with CS. If NS has better
cost value than CS, then NS becomes CS. The new solution
(NS) is also compared with BS and if NS is better, then NS
becomes BS. Even if NS is not as good as CS, NS is accepted
with a probability. This step is done typically to avoid any local
minima. The complete algorithm is presented below:

Algorithm SA_EvolveCA
// Input: Pattern Size (n), Pattern Set (S), Initial Temp. (Ti)
// Output: CA Rule.
1 T
p
= T
i

2 CS = BS = NULL
3 while(T
p
> 0) {
4 if (T
p
> 0.5 * T
i
) {
5 Randomly generate a CA as guess solution
6 }
7 else {
8 Generate a new solution from CS
9 }
10 Generate state transition table and rule table
11 NS = CA-Rule
12
cost
= cost-value(NS) cost-value(CS)
13 if (
cost
< 0) {
14 CS = NS
15 if (cost-value(NS) < cost-value(BS)) {
16 BS = NS
17 }
18 }
19 else
20 accept CS = NS with probability


21 Reduce T
p
exponentially
22 }

The above mentioned simulated annealing algorithm continues
to explore CA-search space with heuristic approach for
obtaining desired CA-rule as long as the temperature remains
positive. The temperature (Tp) in simulated annealing is
initialized with a large value (line 1) and at every attempt it is
S
1
, S
2
, S
3
, S
4

S
1
, S
2
S
3
, S
4

S
1
S
2
S
3
S
4

Fig. 4 Multi-stage hierarchical classifier


reduced (line 21) gradually to get to the termination phase. A
CA-Rule as a candidate solution is randomly generated (line 5)
through random synthesis of CA-Coefficient matrix. In order to
obtain neighbor candidate solution to CS (i.e. current solution
derived so far), a few bits of CA-Coefficient matrix
corresponding to CS is altered (line 8). The probability of
accepting a new candidate solution as current solution depends
on the fitness-cost value. It is evident from the algorithm that
every new candidate solution has the possibility to become
current solution irrespective of its fitness-cost. However, with
the temperature approaching zero value i.e. as the algorithm
approaches termination phase, the probability of accepting
less-fit CA-Rule also diminishes (line 21). During the
exploration, the algorithm records the best explored CA-Rule
as BS (line 16).
V. RESULT
This section reports experimental observations during
evaluation of our proposed CA based DNA classification
scheme. To analyse the performance of the proposed CA based
DNA-pattern classifier, the experiment has been performed on
synthetic datasets. All the experiments have been conducted
under the following setup.
Hardware
o Processor - Intel Core i7-3610QM CPU
@ 2.30GHz 8
o RAM 8GB
o Disk 1000 GB
Software
o Operating system Open SUSE Kernel
version 3.1.0-1.2-desktop
o OS type 32-bit
o Compiler used javac version 4.6.2 (SUSE
Linux)
During the experimentation, emphasis has been put on the
behavior of our model in response to the varying DNA
sequence length as well as number of DNA-trainee patterns (i.e.
trainee-size). The given set of DNA sequences has been
randomly divided into a trainee set and testing-set. The most
desirable CA is evolved through simulated annealing heuristic
algorithm and the CA is assumed to be the best explored
solution which can classify the trainee DNA patterns efficiently
with respect to their Fuzzy Levenshtein distances as discussed
in previous section. The testing-set is used to measure the
class-prediction accuracy of the proposed model built with
randomly chosen trainee patterns in comparison with the actual
class membership. The overall performance of the proposed
scheme is represented in the form of following graphs plotted
with variations of DNA sequence size and number of trainee
sequences. Each of the figures presents classification accuracy
of the proposed model for DNA patterns of various sequence
lengths against varying trainee pattern size. Fig. 5 displays
classification accuracy of the proposed model with DNA
sequence length 20 whereas Fig. 6 reports classification
accuracy with DNA sequence length 40 against various trainee
pattern sizes. The observation reveals several interesting facts
on the behavior of the model. In both the cases, the accuracy
level has been observed as ranging from 60 percent to 95
percent showing linear improvement with the increase in
number of trainee DNA patterns.

Fig. 5 Classification Accuracy vs. Number of Trainee Patterns for a
sequence length of 20

Fig. 6 Classification Accuracy vs. Number of Trainee Patterns for a
sequence length of 40
20 40 60 80 100
Classification
Accuracy
67.19 74.45 80.13 89.03 93.77
60
65
70
75
80
85
90
95
100
C
l
a
s
s
i
f
i
c
a
t
i
o
n

A
c
c
u
r
a
c
y

Number of Trainee Patterns


20 40 60 80 100
Classification
Accuracy
62.44 66.33 69.06 78.11 83.04
60
65
70
75
80
85
C
l
a
s
s
i
f
i
c
a
t
i
o
n

A
c
c
u
r
a
c
y

Number of Trainee Patterns


Sequence Length = 20
Sequence Length = 40



Fig. 7 Classification Accuracy vs. Number of Trainee Patterns for a
sequence length of 60

Fig. 8 Classification Accuracy vs. Number of Trainee Patterns for a
sequence length of 80
The behavior of the proposed scheme does not vary drastically
with respect to the other sequence lengths as obvious in Fig. 7
(DNA Sequence Length 60), Fig. 8 (DNA Sequence Length
80), and Fig. 9 (DNA Sequence Length 100). However, it is
evident from each graph that as the number of trainee patterns
increases the accuracy level also increases almost linearly. One
of the interesting observations is that as long as the trainee
pattern size remains below 60 percent, the performance of the
model does not vary too much with the variation in DNA
sequence length. However, while dealing with number of
trainee patterns exceeding 60 percent of the given set, the
classification accuracy of the model falls with the increase in
DNA sequence length. The outcome also indicates that as the
sequence length increases the average performance of the
scheme slides down a bit. But the accuracy rate rises sharply
with the increase in number of trainee patterns. It is evident
from our observation that the proposed classification scheme
has the potential to classify DNA sequences with reasonable
accuracy rate.



Fig. 9 Classification Accuracy vs. Number of Trainee Patterns for a
sequence length of 100
VI. CONCLUSION
The identification and classification of genes in new DNA
sequences information is not a trivial problem. The researcher
is quite often faced with hundreds of gigabytes of data to be
analyzed. This difficulty is compounded by the many
competing choices for the parameters, in choosing the
algorithm, in choosing the similarity metric, in selecting the
classification model and finally in selecting a terminating
criterion. This paper has presented the idea of a cellular
automata based DNA pattern classifier, which is low-cost,
high-speed and works with high accuracy. The proposed
technique of DNA pattern classification would open up a wide
scope of investigative studies with a goal to explore further
improvements in this area. .
REFERENCES
[1] Stefania Bandini. Guest Editorial - Cellular Automata. Future Generation
Computer Systems, 18:vvi, August 2002.
[2] A. Albicki, S. K. Yap, M. Khare, and S. Pamper. Prospects on Cellular
Automata Application to Test Generation. Technical Report EL-88-05,
Dept. of Electrical Engg., Univ. of Rochester, 1988.
[3] H. Baltzer, W. P. Braun, and W. Kohler. Cellular Automata Model for
Vegetable Dynamics. Ecological Modelling, 107:113125, 1998.
[4] S. Wolfram, Theory and application of Cellular Automata, World
Scientific, 1986.
[5] P. H. Bardell. Analysis of Cellular Automata used as Pseudo-Random
Pattern Generators. In International Test Conference, pages 762768,
1990.
[6] C. Burks and D. Farmer. Towards Modeling DNA Sequences as
Automata. Physica D, 10:157167, 1984.
[7] J. H. Moore and L. W. Hahn. A Cellular Automata-based Pattern
Recognition Approach for Identifying Gene-Gene and
Gene-Environment Interactions. American Journal of Human Genetics,
67(52), 2000.
20 40 60 80 100
Classification
Accuracy
61.78 65.31 70.07 79.99 87.09
60
65
70
75
80
85
90
C
l
a
s
s
i
f
i
c
a
t
i
o
n

A
c
c
u
r
a
c
y

Number of Trainee Patterns


20 40 60 80 100
Classification
Accuracy
60.03 68.65 71.01 73.76 84.88
50
55
60
65
70
75
80
85
90
C
l
a
s
s
i
f
i
c
a
t
i
o
n

A
c
c
u
r
a
c
y

Number of Trainee Patterns


20 40 60 80 100
Classification
Accuracy
54.97 57.06 61.21 69.33 73.07
50
55
60
65
70
75
C
l
a
s
s
i
f
i
c
a
t
i
o
n

A
c
c
u
r
a
c
y

Number of Trainee Patterns


Sequence Length = 100
Sequence Length = 60
Sequence Length = 80


[8] J. H. Moore and L. W. Hahn. Multilocus Pattern Recognition using
Cellular Automata and Parallel Genetic Algorithms. In Proc. of the
Genetic and Evolutionary Computation Conference (GECCO-2001),
page 1452, 7-11 July 2001.
[9] A. Albicki and M. Khare. Cellular Automata used for Test Pattern
Generation. In Proc. ICCD, pages 5659, 1987.
[10] A. Albicki and S. K. Yap. Covering a Set of Test Patterns by a Cellular
Automata. Research Review, Dept. of Comp. Sc. and Engg., Univ. of
Rochester, 1987.
[11] E. R. Banks. Information Processing and Transmission in Cellular
Automata. PhD thesis, M. I. T., 1971.
[12] S. C. Benjamin and N. F. Johnson. A Possible Nanometer-scale
Computing Device based on an Adding Cellular Automaton. Applied
Physics Letters, 1997.
[13] A. M. Barbe. A Cellular Automata Ruled by an Eccentric Conservation
Law. Physica D, 45:4962, 1990.
[14] Jianbo Gao, Yan Qi, Yinhe Cao, and Wen-wen Tung, "Protein Coding
Sequence Identification by Simultaneously Characterizing the Periodic
and Random Features of DNA Sequences", Journal of Biomedicine and
Biotechnology, Vol. 2, pp. 139146, 2005.
[15] Peterson, D.; Lee, C.H., "A DNA-based pattern recognition technique for
cancer detection," Engineering in Medicine and Biology Society, 2004.
IEMBS '04. 26th Annual International Conference of the IEEE , vol.2,
no., pp.2956,2959, 1-5 Sept. 2004 doi: 10.1109/IEMBS.2004.1403839
[16] Ido Priness, Oded Maimon and Irad Ben-Gal, Evaluation of
gene-expression clustering via mutual information distance measure,
BMC Bioinformatics 2007, 8:111 doi:10.1186/1471-2105-8-111
[17] David Kulp, avid Haussler, Martin G. Reese Frank, H. Eeckman, A
Generalized Hidden Markov Model for the Recognition of Human Genes
in DNA, ISMB-96 Proceedings, 1996.
[18] Sathish Kumar S, N.Duraipandian, An Effective Identification of
Species from DNA Sequence: A Classification Technique by Integrating
DM and ANN, International Journal of Advanced Computer Science and
Applications, Vol. 3, No.8, , pp. 104114, 2012.
[19] A. W. Burks. Essays on Cellular Automata. Technical Report, Univ. of
Illinois, Urbana, 1970.
[20] S. Bhattacharjee, J. Bhattacharya, and P. Pal Chaudhuri. An Efficient
Data Compression based on Cellular Automata. In Data Compression
Conference (DCC95), 1995.
[21] Stephen A Billings and Yingxu Yang. Identification of Probabilistic
Cellular Automata. IEEE Transaction on System, Man and Cybernetics,
Part B, pages 112, 2002.
[22] M. S. Capcarrere. Cellular Automata and Other Cellular System: Design
and Evolution. PhD thesis, Swiss Federal Institute of Technology,
Luassane, 2002.
[23] S. Chakraborty, D. Roy Chowdhury, and P. Pal Chaudhuri. Theory and
Application of Non-Group Cellular Automata for Synthesis of Easily
Testable Finite State Machines. IEEE Trans. on Computers,
45(7):769781, July 1996.
[24] S. Chattopadhyay, S. Adhikari, S. Sengupta, and M. Pal. Highly Regular,
Modular, and Cascadable Design of Cellular Automata-based Pattern
Classifier. IEEE Transaction on VLSI Systems, 8(6):724735, December
2000.
[25] N. Ganguly, P. Maji, S. Dhar, B. K. Sikdar, and P. Pal Chaudhuri.
Evolving Cellular Automata as Pattern Classifier. In Proc. of Fifth
International Conference on Cellular Automata for Research and
Industry, ACRI 2002, Switzerland, pages 5668, October 2002.
[26] E. H. L. Aarts and J. Korst. Simulated Annealing and Boltzmann
Machines. John Wiley & Sons, Essex, U.K., 1989.
[27] De Vicente, Juan; Lanchares, Juan; Hermida, Romn (2003). "Placement
by thermodynamic simulated annealing". Physics Letters A 317 (56):
415423.
[28] HMMER 3.1 (February 2013); http://hmmer.org/
[29] Hjelmqvist, Sten (March 2012), Fast, memory efficient Levenshtein
algorithm
(http://www.codeproject.com/Articles/13525/Fast-memory-efficient-Lev
enshtein-algorithm)

Prof. Tamal Chakrabarti is currently Assistant Professor, Department of
Computer Science and Engineering, Institute of Engineering and Management.
He started his career with Wipro Technologies, India as a Software Engineer.
Then he joined Flextronics Software Systems, India as a Technical Leader.
After that he was associated with IBM India Pvt. Limited, where he was leading
a software development team. Subsequently, he worked with Infosys
Technologies, India, as a Project Manager. Since 2009, he has been teaching in
Institute of Engineering and Management. He did his graduation (B.Sc., Hons.)
in Physics from Calcutta University in 1997, and B.Tech. in Computer Science
and Engineering from Calcutta University in 2000. In 2006 He received his MS
degree from BITS Pilani, India. He has been presented with numerous awards
from professional bodies and academia; including Feather in My Cap Award
(twice) by Wipro Technologies, Spot Award by Lucent Technologies,
Bravo Award by IBM India Pvt. Ltd. and Award of Excellence for
contribution in the International Conference on innovative
techno-management solution for social sector, in 2012. He has participated in
various projects in India, Belgium and Ireland. IBM India Pvt. Ltd. had honored
him with Mentor Award for guiding a project in The Great Mind Challenge,
2011. Prof. Chakrabarti is a member of the Computer Society of India (CSI).
He has authored numerous papers in journals and conferences. His research
interests include, Bio-informatics, Programming Languages and Design and
Analysis of Algorithms.

Prof. Sourav Saha is currently Assistant Professor, Department of Computer
Science and Engineering, Institute of Engineering and Management. He started
his career working in R&D sector at various companies. Since 2011, he has
been teaching in Institute of Engineering and Management. He did his
graduation (B.Tech) in Computer Science & Engineering from Kalyani
University in 2000, and obtained his Master of Engineering (M.E.) degree in
Computer Science and Engineering from Bengal Engineering and Science
University in 2002. He was awarded university medal for securing highest
mark in M.E. and also received award from Indian National Academy of
Engineering for best innovative bachelor level project in 2000. He has
numerous international and national publications in reputed journals and
conferences to his credit throughout his entire career. His research interests
include Cellular Automata, Pattern Recognition, Bio-Medical Engineering,
Bio-Informatics etc.

Prof. (Dr.) Devadatta Sinha is currently Professor, Department of Computer
Science and Engineering of University of Calcutta, India. He joined this
department as a Reader in 1989. Prior to this, he worked as Assistant Professor,
Department of Computer Engineering, B.I.T. Mesra Ranchi and as Lecturer and
Senior Lecturer (Computer Science) at the Department of Mathematics,
Jadavpur University. He obtained his Ph.D. from Jadavpur University in 1985
and his area of research was Program Testing. He has published more than 50
papers and articles in different national and international journals, proceedings,
periodicals and monographs. His area of interests includes Software
Engineering, Parallel and Distributed Computing, Bioinformatics,
Cryptography. He has guided a number of doctoral and masters thesis in
Computer Science. He worked as Head of the Department of Computer
Science and Engineering, University of Calcutta for two terms of two years
each. He worked as Chairman, undergraduate studies in Computer Science,
University of Calcutta and currently the Convener, Ph.D. Committee in
Computer Science and Engineering, University of Calcutta. He is associated
with a number of academic institutions as member in their academic bodies. He
is involved in a number of national and international conferences in the
capacity of Chairman of PC/OC. He served as Chairman, Computer Society of
India, Kolkata Chapter and is a Patron of the chapter. He was Sectional
President, Section of Computer Science, and Indian Science Congress
Association in 1993-94.

Anda mungkin juga menyukai