Hong Zhou1,2Zheng Zhao1Hongpo Wang2 1. School of Computer Science and Technology, Tianjin University, Tianjin, 300072, China; 2. Department of Computer Science and Information Engineering, Tianjin Agricultural University,Tianjin,300384,China zhouhong@tjau.edu.cn
Abstract
This paper attempts to provide a new approach to discover conserved regions such as motifs in either DNA or Protein sequences. We have followed a graph-based approach to solve this problem, in particular, using the idea of de Bruijn graphs. The de Bruijn graph has been successfully adopted to solve problems such as local alignment and DNA fragment assembly. Our method harnesses the power of the de Bruijn graph to discover the conserved regions in a DNA or protein sequence. We have found that the algorithm was successful in mining signals for larger number of sequences and at a faster rate when compared to some popular motif searching tools.
potential of the graph-based algorithms. This paper focuses on developing a time-efficient algorithm for motif discovery using a de Bruijn graph. We have found that the algorithm was successful in mining signals for larger number of sequences and at a faster rate when compared to some popular motif searching tools such as MEME [3]. The next few sections will describe the importance of the motif discovery problem and the advancements in the algorithmic approaches to motif discovery.
2. Method
Our method starts from constructing the de Bruijn graph. As the graph that we constructed may have cycles, so the second step is to remove cycles in the graph and transform it into a directed acyclic graph. The third step is to compute a path on the directed graph, then extract the conserved regions. This is the main thought of our method, which we will describe in detail in the following. The algorithm can be formally described as follows. Input S = s1, s2, ..., sn, each si has length li Output S = s1, s2, ..., sn, each sri has length m 1. use S to construct de Bruijn graph G = (V,E) 2. eliminate cycles of G 3. get a high weighted path from G, then construct consensus sequence sc from the path 4. FOR i 1 to n 5. DO si conserveextract(sc, si) 6. construct output result S s1, s2, ..., sn
39
1. Introduction
Motif is a repeating pattern in a biological sequence that is conserved during the process of evolution. Motif discovery is a very important problem in Biology. It finds applications in DNA or protein sequence analysis, comprehending disease susceptibility and disease cure. Numerous motif discovery algorithms have been proposed till date. Graph theory is playing an important role in computational biology [1]. Graph-based algorithms provide a simpler and quicker solution to computationally intensive problems such as DNA fragment assembly [2] and motif discovery. However, the amount of literature available on motif discovery using graph algorithms is not proportional to the
978-0-7695-3735-1/09 $25.00 2009 IEEE DOI 10.1109/FSKD.2009.542
13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23.
THEN add sequence info{i,j}to e ELSE IF vL <0 THEN vL =new vertex vNL
VvL
hashtable(TL).node= vL IF vR <0 THEN vR =new vertex vNR VvR hashtable(TR).node= vR new arc eN in(vL, vR),add string T and sequence info{i,j}to eN
EeN
24. RETURN G=(V,E) A hash table which contains all the nodes as its keys is maintained. Corresponding to every node is its connected node and the edge connecting them, maintained in the form of an adjacency list.
40
particular consecutive edge set. In other words, an active region is a region that contains consecutive edges that have repeated themselves. All the high weight edges need not represent motifs because there is a fair chance for them being repeats. Hence the search is for more prominent edges. Therefore we have tried to identify edges that have more weight in an active region in comparison to the other edges.
Figure.1(a), cycles exist which include vertexes {v1,v2,v3} and edges {e1,e2,e3}. The vertex v2 has the
least information. So the original cycle is eliminated by making a copy vN2 of vertex v2. The result is
Figure.1(b). In Figure.1(b), cycle also exists. Now vertex v3 has the least information. So the original cycle is eliminated by making a copy vN3 of vertex v3.
The result is Figure.1(c). And so on. The final result is Figure.1(d)
Figure.1
Elimiting cycle
Note that a loop in the original graph is removed, and it is a directed acyclic graph now.
4 2.66GHz under the Linux operating system with C language. Initially the algorithm was tested on the DNA sequences that belong to the prokaryotic family. The advantage with these sequences is that they have very low or negligible number of repeats. In addition to that, the common repeating patterns in the prokaryotic sequences (the TATA patterns) are already known. Upon successful initial testing we tested the algorithm on the protein sequences. We chose to start off with the protein responsible for
41
redox (oxidation and reduction) reactions called Cytochrome [5, 6]. The initial testing began on just twenty sequences, with the longest sequence being 577 nucleotides. In this section, we make a comparison of our algorithm with popular motif searching tools namely MEME and the Gibbs sampler [7]. Figure 2 shows the comparison among the three algorithms. Both MEME and Gibbs sampler have a character limit. Therefore, we could not test them for more than 30 protein sequences each averaging 200bp. Clearly our algorithm has a speed advantage over speed the other (see motif discovery 2). tools. This comparison just gives an approximate picture of the levels Figure Our algorithm successfully ran for 1500 sequences. However it broke beyond 1500 (see Figure 3).
Porcess Time Versus
4. References
[1] J. C. Setubal, J. Meidanis, Introduction to Computational Molecular, Biology, PWS Publishing Company, Boston, USA, 1997. [2] P. A. Pevzner, H. Tang, and M. S. Waterman, An Eulerian path approach to DNA fragment assembly. Proceedings of the National Academy of Sciences of the United States, 98(17):9748--9753, August 2001. [3] T. L. Bailey, C. Elkan, Fitting a mixture model by expectation maximization to discover motifs in
biopolymers, In The Second International Conference on Intelligent Systems for Molecular Biology, pages 28--36, Stanford, CA, USA, 1994. AAAI Press. [4] D. Z. Du, F. K. Hwang, Generalized de Bruijn digraphs, Networks, Vol. 18, pp. 27-38, 1988. [5] Cytochrome P450 cysteine heme-iron ligand signature. http://www.expasy.org/cgi-bin/nicedoc. pl?PDOC00081.
/Cytochrome, June 2006. Cytochrome is a protein family. [7] S. Geman, D. Geman, Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images, Readings in uncertain reasoning, pages 452472, 1990
Time in Milliseconds
42