Anda di halaman 1dari 4

2009 Sixth International Conference on Fuzzy Systems and Knowledge Discovery

A New Approach for Motif Discovery Based on the de Bruijn Graph

Hong Zhou1,2Zheng Zhao1Hongpo Wang2 1. School of Computer Science and Technology, Tianjin University, Tianjin, 300072, China; 2. Department of Computer Science and Information Engineering, Tianjin Agricultural University,Tianjin,300384,China zhouhong@tjau.edu.cn

Abstract
This paper attempts to provide a new approach to discover conserved regions such as motifs in either DNA or Protein sequences. We have followed a graph-based approach to solve this problem, in particular, using the idea of de Bruijn graphs. The de Bruijn graph has been successfully adopted to solve problems such as local alignment and DNA fragment assembly. Our method harnesses the power of the de Bruijn graph to discover the conserved regions in a DNA or protein sequence. We have found that the algorithm was successful in mining signals for larger number of sequences and at a faster rate when compared to some popular motif searching tools.

potential of the graph-based algorithms. This paper focuses on developing a time-efficient algorithm for motif discovery using a de Bruijn graph. We have found that the algorithm was successful in mining signals for larger number of sequences and at a faster rate when compared to some popular motif searching tools such as MEME [3]. The next few sections will describe the importance of the motif discovery problem and the advancements in the algorithmic approaches to motif discovery.

2. Method
Our method starts from constructing the de Bruijn graph. As the graph that we constructed may have cycles, so the second step is to remove cycles in the graph and transform it into a directed acyclic graph. The third step is to compute a path on the directed graph, then extract the conserved regions. This is the main thought of our method, which we will describe in detail in the following. The algorithm can be formally described as follows. Input S = s1, s2, ..., sn, each si has length li Output S = s1, s2, ..., sn, each sri has length m 1. use S to construct de Bruijn graph G = (V,E) 2. eliminate cycles of G 3. get a high weighted path from G, then construct consensus sequence sc from the path 4. FOR i 1 to n 5. DO si conserveextract(sc, si) 6. construct output result S s1, s2, ..., sn
39

1. Introduction
Motif is a repeating pattern in a biological sequence that is conserved during the process of evolution. Motif discovery is a very important problem in Biology. It finds applications in DNA or protein sequence analysis, comprehending disease susceptibility and disease cure. Numerous motif discovery algorithms have been proposed till date. Graph theory is playing an important role in computational biology [1]. Graph-based algorithms provide a simpler and quicker solution to computationally intensive problems such as DNA fragment assembly [2] and motif discovery. However, the amount of literature available on motif discovery using graph algorithms is not proportional to the
978-0-7695-3735-1/09 $25.00 2009 IEEE DOI 10.1109/FSKD.2009.542

2.1. Construct the de Bruijn Graph


A de Bruijn graph is a graph whose vertices are subsequences of a given sequence and whose edges indicate the overlapping subsequences [4]. Consider the following sequence ACCGTCT. The sequence can be resolved into the following fragments of length 4: ACCG, CCGT, CGTC, and GTCT. Each fragment is called an l - tuple. An l - 1 tuple is obtained by further fragmenting each l - tuple. For example, ACC and CCG are l - 1 tuples of ACCG. The l - 1 tuples form the nodes of the de Bruijn graph. An edge exits between any two l - 1 tuples and the edges represents the l - tuples. The multiplicity of each edge is represented by an edge with a weight. Initially every edge gets a weight 1. In case two consecutive vertices (vertices with the specified overlap) repeat, then the edge multiplicity is increased (in this case an increment in weight of the edge by 1). The conserved regions are most likely to reside on the most repeated edges, i.e. edges with greater multiplicity (which have greater weight attached to them). Procedure ConstructG Input S={s1,s2,,sn},each si has length li Output G=(V,E) 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. FOR i1 TO n DO FOR j1 TO(li-k+1) DO T=si(j,,j+k-1)

13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23.

THEN add sequence info{i,j}to e ELSE IF vL <0 THEN vL =new vertex vNL

VvL
hashtable(TL).node= vL IF vR <0 THEN vR =new vertex vNR VvR hashtable(TR).node= vR new arc eN in(vL, vR),add string T and sequence info{i,j}to eN

EeN

24. RETURN G=(V,E) A hash table which contains all the nodes as its keys is maintained. Corresponding to every node is its connected node and the edge connecting them, maintained in the form of an adjacency list.

2.2. Eliminate cycle


Through the above step, we construct a directed de Bruijn graph. This might leads to a vertex with more than one incoming or outgoing edges and may result in cycles in the graph when the number of incoming or outgoing edges in a vertex increases. So the second step is to remove cycles in the graph and transform it into a directed acyclic graph while keeping all the similarity information among sequences as much as possible. The method of eliminating cycle is that breaking the cycle from one vertex which has the least information among cycle. Claim: Define a left edge for vertex vi to be an edge that points to vi , denoted as eLi , and a right edge for vertex vi to be an edge that starts from vi and points to another vertex, denoted as eRi. The copy of vertex

TL=si(j,,j+k-2) TR=si(j+1,,j+k-1) vL-1 vR-1


IF(hashtable(TL)) THEN vL=hashtable(TL).node IF(hashtable(TR)) THEN vR =hashtable(TR).node IF vL0 AND vR0 AND eE in(vL, vR)

vi, denoted as vNi.

40

In Figure. 1 is an eliminating cycle process. In

particular consecutive edge set. In other words, an active region is a region that contains consecutive edges that have repeated themselves. All the high weight edges need not represent motifs because there is a fair chance for them being repeats. Hence the search is for more prominent edges. Therefore we have tried to identify edges that have more weight in an active region in comparison to the other edges.

Figure.1(a), cycles exist which include vertexes {v1,v2,v3} and edges {e1,e2,e3}. The vertex v2 has the
least information. So the original cycle is eliminated by making a copy vN2 of vertex v2. The result is

Figure.1(b). In Figure.1(b), cycle also exists. Now vertex v3 has the least information. So the original cycle is eliminated by making a copy vN3 of vertex v3.
The result is Figure.1(c). And so on. The final result is Figure.1(d)

Figure.1

Elimiting cycle

Note that a loop in the original graph is removed, and it is a directed acyclic graph now.

3. Results and Analysis


The algorithm has been implemented on Pentium

2.3. Extract the conserved regions


After performing the above transformations, we apply a Depth First Search (DFS) algorithm to find a heaviest path within linear time. The weight for each edge is proportional to its multiplicity and length. After obtaining the high weight edges, the algorithm tries to locate the entire repeated region called as the active region. An active region is high weight region where a lot of sequences coincide upon a

4 2.66GHz under the Linux operating system with C language. Initially the algorithm was tested on the DNA sequences that belong to the prokaryotic family. The advantage with these sequences is that they have very low or negligible number of repeats. In addition to that, the common repeating patterns in the prokaryotic sequences (the TATA patterns) are already known. Upon successful initial testing we tested the algorithm on the protein sequences. We chose to start off with the protein responsible for

41

redox (oxidation and reduction) reactions called Cytochrome [5, 6]. The initial testing began on just twenty sequences, with the longest sequence being 577 nucleotides. In this section, we make a comparison of our algorithm with popular motif searching tools namely MEME and the Gibbs sampler [7]. Figure 2 shows the comparison among the three algorithms. Both MEME and Gibbs sampler have a character limit. Therefore, we could not test them for more than 30 protein sequences each averaging 200bp. Clearly our algorithm has a speed advantage over speed the other (see motif discovery 2). tools. This comparison just gives an approximate picture of the levels Figure Our algorithm successfully ran for 1500 sequences. However it broke beyond 1500 (see Figure 3).
Porcess Time Versus

4. References
[1] J. C. Setubal, J. Meidanis, Introduction to Computational Molecular, Biology, PWS Publishing Company, Boston, USA, 1997. [2] P. A. Pevzner, H. Tang, and M. S. Waterman, An Eulerian path approach to DNA fragment assembly. Proceedings of the National Academy of Sciences of the United States, 98(17):9748--9753, August 2001. [3] T. L. Bailey, C. Elkan, Fitting a mixture model by expectation maximization to discover motifs in

biopolymers, In The Second International Conference on Intelligent Systems for Molecular Biology, pages 28--36, Stanford, CA, USA, 1994. AAAI Press. [4] D. Z. Du, F. K. Hwang, Generalized de Bruijn digraphs, Networks, Vol. 18, pp. 27-38, 1988. [5] Cytochrome P450 cysteine heme-iron ligand signature. http://www.expasy.org/cgi-bin/nicedoc. pl?PDOC00081.

100000 Time in Milliseconds

[6] Cytochrome. http://en.wikipedia.org/wiki


10000 1000 100 10 1 5 10 20 30 Number of Sequences in Sample Data MEME Gibbs Our Algorithm

/Cytochrome, June 2006. Cytochrome is a protein family. [7] S. Geman, D. Geman, Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images, Readings in uncertain reasoning, pages 452472, 1990

Figure.2 Process Time Versus


Performance Curve

Time in Milliseconds

80000 60000 40000 20000 0 0 500 1000 Number of Sequences 1500


Our Algorithm

Figure.3 Performance Curve

42

Anda mungkin juga menyukai