Anda di halaman 1dari 8

2011 International Conference on Reconfigurable Computing and FPGAs

A Coarse-Grained Reconfigurable Processorfor Sequencing and Phylogenetic


Algorithmsin Bioinformatics

Pei Liu, Fatemeh O. Ebrahim, Ahmed Hemani Kolin Paul


Dept. of ES, School of ICT Department of Computer Science and engineering
KTH, Royal Institute of Technology Indian Institute of Technology
Stockholm, Sweden New Delhi, India
{peiliu, fateme, hemani}@kth.se kolin@cse.iitd.ac.in

Abstract—A coarse-grained reconfigurable processor tailored grained reconfigurable architecture (CGRA) based solution
for accelerating multiple bioinformatics algorithms is tailored for the Needleman-Wunsch[1], Smith-Waterman[2],
proposed. In this paper, a programmable and scalable HMMER[3] and Maximum Likelihood[4] algorithm. In this
architectural platform instantiates an array of coarse grained version of the proposed solution, we target the scoring phase
light weight processing elements, which allows arbitrary for N-W and S-W algorithm, the P7Viterbi function for
partitioning, scheduling schemes and capable of solving HMMER, the Phylogenetic Likelihood Function (PLF) and
complete four popular bioinformatics algorithms: the the Tree Likelihood Calculation for Maximum Likelihood.
Needleman-Wunsch, Smith-Waterman, and HMMER on CGRA based solution is motivated by the insight that a
sequencing, and Maximum Likelihood on phylogenetic. The
better match in granularity of computation, interconnect and
key difference of the proposed CGRA based solution compared
storage to the bio-informatics algorithm will improve the
to FPGA and GPU based solutions is a much better match on
architecture and algorithms for the core computational needs, architectural efficiency. CGRA also provides better silicon
as well as the system level architectural needs. For the same efficiency because it avoids the large reconfiguration
degree of parallelism, we provide a 5X to 14X speed-up overhead of the fine granular FPGAs and the larger than
improvements compared to FPGA solutions and 15X to 78X necessary GPU processing element for what is needed by
compared to GPU acceleration on 3 sequencing algorithms. We ML algorithm.
also provide 2.8X speed-up compared to FPGA with the same The key contributions made by this paper are:
amount of core logic and 70X compared to GPU with the same • A coarse-grained reconfigurable and scalable
silicon area for Maximum Likelihood. (Abstract) processing element for accelerating bioinformatics
algorithms.
Keywords-component; Bioinformatics, Coarse Grained • A reconfigurable platform for solving different
Reconfigurable Architecture, Needleman Wunsch, Smith complete bioinformatics algorithms, including
Waterman, HMMER, Maximum Likelihood, Phylogenetic Needleman-Wunsch, Smith-Waterman, HMMER
Inference, VLSI (key words) and Maximum Likelihood algorithm.
• Demonstrate how the architecture is configured to
I. INTRODUCTION compute these different sequencing and
Bioinformatics is an emerging discipline which studies phylogenetics algorithms for both gene and protein
biological problems by analyzing mathematical models. species.
Most of these problems are computationally expensive, • Quantify the benefits of the proposed solution in
requiring huge amount of computational resources. The terms of its architectural efficiency compared to
applications in this area like multiple sequence alignment, FPGAs and GPUs based comparable solutions. We
molecular docking, protein folding, phylogenetic etc. lend also estimate, based on credible data [5] the potential
themselves naturally to both control and data parallelism. silicon efficiency again compared to FPGAs and
Sequence analysis finds similarity in biological GPUs based solutions.
sequences, which is used to establish functional, structural
and evolu-tionary relationships. Phylogenetic studies the II. RELATED WORK
evolutionary relationship among various groups of By virtue of their widespread availability and established
organisms through comparative analysis of the assonated front end tools, many-core architectures and FPGAs are
genome with complex statistical simulations. Accelerating obvious choices to accelerate bioinformatics algorithm
these algorithms is well researched topic with very compared to the ubiquitous uni (and dual/quad) x86
encouraging results. FPGA and GPU based acceleration are processor based desktop machines.
at the forefront of these efforts. Needleman-Wunsch algorithm was published by Saul B.
We propose a custom computing machine for sequence Needleman and Christian D. Wunsch in 1970 [1]. It is
alignment and phylogenetics algorithms that would provide a suitable for global alignment of pair-wise sequence with a
quantum improvement in speed-up. To attain a quantum certain similarity, and soon became one of the standard
improvement in speedup and capacity, we propose a coarse- techniques in sequence analysis. It also spawned many

978-0-7695-4551-6/11 $26.00 © 2011 IEEE 190


DOI 10.1109/ReConFig.2011.1
variations, including the famous and widely used Smith- not practical to parallelize the computation of such an
Waterman algorithm for local alignment and some others. enormous matrix. For this reason, most researchers have
Parallel acceleration of Needleman-Wunsch algorithm was partitioned the matrix computation problem into several sub-
introduced on FPGA [6], which is reported to have a speed- matrix computation problems and the sub-matrix
up factor of 350x comparing to a Pentium IV 2.6Ghz PC. computation is then parallelized. The size of such a sub-
Smith-Waterman algorithm was first proposed by matrix depends on the size of the on-chip buffer, B, and the
Temple F. Smith and Michael S. Waterman in 1981 [2]. The number of parallel processing elements, P, as shown in
wave-front method is commonly used for parallelization on Figure 1.
Smith- Waterman algorithm, which can be seen on both
FPGA [7] and CUDA [8] implementations.
Profile HMM (Hidden Markov Models) [3] is used for
sequence analysis in bioinformatics on sensitive database

Colmn Buffer
searches, and HMMER is one of the most popular computer
program package in this area. HMMER, often used for
Multiple Sequence Alignment (MSA), is sequential by
nature, though some researchers [9], [10], have proposed
tweaks to parallelize HMMER, either with some loss of
accuracy and significant boost in performance [9], or a
corrective measure that will result in no loss of accuracy but
still a significant boost in performance on average [10].
Figure 1. The partitioning and parallelization scheme for the
ML and Bayesian phylogenetic study the amount of computational biology algorithms.
evolutionary relationship among various groups of
organisms. Some researchers have proposed FPGA based Figure 2 shows how the parallelized NW and SW
implementation for likelihood-based phylogenetic inference algorithm create a computational wave-front and their cycle
[11]. There has also been recent work in using GPU as co- wise relative timing is also shown. The diagram shows how
processors for acceleration [12]. the M values cross the matrix cell boundaries and their
As stated earlier, using many core based architectures relative timing is also shown.
and FPGA are valid and justified choices in terms of
pragmatics, availability of architecture as well as mapping Mi-2,j Mi-2,j+1

tools. However, both these options are not well matched with
the granularity of the problem. Mi-1,j-1 Mi-1,j Mi-1,j+1
FPGA is very fine grained and the amount of silicon and
energy wastes to support its fine granularity based generality Mi,j-1 Mi,j
is enormous. The many-core option is the other extreme in
terms of granularity mismatch. The typical core utilized is
Unit 1 Mi-1,j-1 Mi-1,j Mi-1,j+1 Mi-1,j+2
significantly larger than the kernel operations used in NW,
SW, HMMER and ML algorithms that is the target of
Unit 2 Mi,j-2 Mi,j-1 Mi,j Mi,j+1
parallelization. The generality of the many-core architecture
also restricts the storage and interconnect scheme from being Figure 2. The spatial and temporal relationship of the M values in a wave-
customized to the dataflow typical of these algorithms. front computation.
With a coarse-grained reconfigurable architecture
(CGRA), targeting computational biology, we propose to
overcome these deficiencies in terms of the granularity of
computational units and the architecture of interconnect and
the storage, in order to match the dataflow of those
bioinformatics algorithms. By doing so, we gain not only
quantum improvement in acceleration but also significant
better utilization of silicon, which in turn will allow us to
deploy more silicon to further press the advantage. And, by
keeping the architecture reconfigurable, we retain the ability
Figure 3. Plan7 Hidden Markov Model of 4 states [3].
to program essentially implement an arbitrary algorithm for
those algorithms. HMMER[3] is one of the most widely used open-source
III. PARALLELIZING & PARTITIONING computer program of profile HMM (Hidden Markov
Models). Profile HMM is a statistical model used for
Needleman-Wunsch and Smith-Waterman algorithm multiple sequence alignments. It should be figured out that
have a scoring phase that is easy to be parallelized. These up to 90% of the runtime is spent on the P7Viterbi function
algorithms involve computation of a matrix of size M×N (see in HMMER due to profiling. Figure 3 demonstrates how the
Figure 1), where M and N are lengths of two sequences that Plan7 Viterbi algorithm works. The original Plan7 HMM
we would like to align. As M and N can be very large, it is

191
algorithm cannot be optimized for full parallel acceleration, AHB
because of the existence of feedback at the end of each
computation loop, as shown in Figure 3. This feedback loop Scores, Tree configurations,
Transition probability matrixes
is rarely necessitated but cannot be ignored for accuracy. LUT

Leon3
Maximum Likelihood based phylogenetic inference uses
the Phylogenetic Likelihood Function (PLF) to evaluate the

Micro-coded Sequencer
BiCell Array Controller
likelihood of trees, which consumes over 95% of the

Scratch
Pad

AHB
BiCell Array
execution time due to profiling. Figure 4 gives an example of
PLF, the Felsenstein pruning algorithm for nucleotide

Controller
sequences, a small three-node sub-tree with two child nodes

Memory
Flash

Flash
and an inner node at the virtual root of the sub-tree. For Buffer

every child there is one entry for each column of the

Controller
SDRAM

SDRAM
Buffer Controller
alignment; for every branch there is one transition

Banks
DDR2

DDR2
probability matrix. The computation of the complete PLF Output Buffer

proceeds, a series of operations (multiplications and


additions) are conducted using the entries of each column of Figure 5. Example of Phylogenetic Likelihood Function.
likelihood vectors at the child nodes and the two transition
probability matrixes, in order to calculate the respective The Bioinformatics Cell or BiCell is the basic building
ancestor likelihood vector. These operations are executed for block. BiCells are instantiated in an array, shown as BiCell
all columns with length of the input alignment. Array in Figure 5 to parallelize the computations mentioned
in section III (shown in Figure 1, 2, 3 and 4). A micro-coded
sequencer, the BiCell Array Controller controls the
configuration and timing of the BiCell Array and the transfer
of virtual root results to the external memory. The buffer
transfers are initiated by the BiCell Array controller but
controlled by a dedicated buffer controller. Both these
controllers are micro-coded sequencers of instructions that
directly correspond to operations like initialize, configure,
compute etc. The BiCell array controller works as an AHB
slave and has an appropriate interface (that we adapted from
public domain GRLIB [15] blocks with a slave interface). A
look up table (LUT) with very wide output bus feeds the
BiCell Array to provide the constant vectors like scores,
transition probability matrix, prior probabilities, etc..
Leon3, a public domain RISC processor from Gaisler
Technologies [15] is the system controller and also
implements the complete NW, SW, HMMER and ML
algorithms. It also delegates and controls the BiCell Array
Controller to compute the required computation. Leon3 also
uses the on-chip scratch pad memory for the low latency
access to data while computing the proposing of moves,
chain swapping, sampling, summary of the results, and other
computations.
Figure 4. Example of Phylogenetic Likelihood Function. The plan is to have multiple DDR2 SDRAM controllers
to interface multiple external DDR2 SDRAM banks that
These ways of sequencing and partitioning have been would serve as the main high bandwidth data transmission.
reported in the literature and we adopt the same approaches. At present DDR2 SDRAM and Flash controllers and their
The innovation we introduce is a custom computing respective memory models do not exist. This is duly
architecture to implement such sequencing and partitioning indicated by using dotted lines for these components in
using a common core for different types of algorithms and Figure 5. For the moment, these Flash/RAM controllers are
sequences. emulated by AHB aware VHDL processes, which interface
IV. THE BIOINFORMATICS COMPUTER: BICOM to data files that emulate the DDR2 SDRAM and Flash
memories.
We have architected a custom computing platform for the
NW, SW, HMMER and ML algorithms and to implement V. THE BIOINFORMATICS CELL: BICELL
the schemes described in section III. We call this platform - Analysis of the computational needs of those popular
the Bioinformatics Computer - BiCOM and is shown in bioinformatics algorithms reveals that it has the needs in the
Figure 5, suitable for both DNA and protein sequences. form of add, compare and multiplication. Based on this
observation, we propose the Bioinformatics Cell or BiCell,

192
shown in Figure 6 as the kernel that can be reconfigured a) to avoid underflow due to very small result numbers. But we
address the varying add and compare operations of NW, SW propose that a specialized fixed-point architecture without
and HMMER algorithms, b) to be reconfigured and sub log function is suitable for use if we use a larger operand
grouped as multiplier for PLF and Tree likelihood width.
calculation in ML algorithm, and c) to work as an array to In our current design, the operand width of BiCell is kept
parallelize the computations as discussed in section III. 80 bits due to the requirement of multiplication in ML
algorithm. For NW, SW and HMMER algorithm, the base
operand width of BiCell is configured to keep 40 bits only.
This also applied to the final output stage of multiplication in
ML algorithm; during multiplication it will use 79 bits for
accuracy. This method means that 39 bits will be used to
represent any signed values, which provides an accuracy of
2-39. It is more than enough for NW, SW and HMMER
algorithm, and also better than the 1.0×10-10 accuracy
limitation of fast-log approximation function in MrBayes[13],
which is the state-of-art software implementation. We can
also scale the operand width for greater accuracy comparing
with double precision floating-point.
BiCell is light weight comparing with floating-point
Figure 6. Block diagram of BiCell.
architectures, an experimental implementation takes just
~15K gate equivalents and clocks at 666 MHz in TSMC
The BiCell has 11 inputs, 9 of which feed into the
90nm technology based on post layout timing analysis. We
Bitwidth Extention & Partial Product module. Considering
draw reader’s attention to the granularity of BiCell compared
the small input bits corresponding to DNA and protein
to FPGAs and GPUs.
sequence characters, bitwidth extension function is
While the BiCell can be realized in FPGAs as well, the
sometimes required to extend the inputs due to the need of
granularity of the underlying logic is gate and bit level and
data structure, which works in different configuration for
the overhead of composing a BiCell using such fine granular
DNA and protein sequences. This function will affect I1 to
building blocks will be considerable. FPGAs also have
I8, in order to match the following adder stage. On the other
coarse granular arithmetic building blocks but they are
hand, partial product generation is required since BiCell shall
limited in number. We have synthesized a single BiCell with
match ML algorithm, which including a group of
Quartus II 9.1 on 40nm EP4SGX230KF40C2N Stratix IV
multiplication. It consumes 9 inputs for a single cell to
FPGA device, which consumes 2043 of 140,600 ALUTs. It
provide the input for multiplicands, and the generated partial
can be seen that this FPGA device can fit 68 BiCells only
products will be transferred to the following adder stage for
without any other logic, which proves that FPGA has less
further computation. The Bitwidth Extention & Partial
silicon efficiency than VLSI architecture for this case.
Product module is placed but now drawn in detail, in the
GPUs on the other hand have computational units that a)
interest of not cluttering the diagram.
are much bigger than the BiCell and b) they are not
The second stage of BiCell is constructed by four adder
customized to the needs of bioinformatics computation.
units. The fourth adder followed by a crossbar (Left crossbar
The second key factor that works in favor of BiCell is
in Figure 6), which establish a redirection stage that feeds to
that the interconnect fabric used to build the array in case of
an add/comparison stage and a multiplexor. The Add/Max
BiCell is custom, though configurable and hence gives better
module can be configured to function as an adder or a
routing efficiency. In case of FPGAs and GPUs it is general
comparator, which is made up by an adder and some extra
purpose.
logic. Following another crossbar (Right Crossbar in Figure
The third factor, even at system level, the components
6), the output of the second add/comparison module is
like buffers, the LUT, and the BiCell Array Controller are all
redirected to the third add/comparison module and the
customized to the need of bioinformatics algorithms.
multiplexor. At the end of the processing, 4 outputs are
Whereas in case of FPGAs and GPUs, these system level
available in registered form and 1 in unregistered form. O2 is
architecture is general purpose and we invoke the dictum:
unregistered due to the need of SW algorithm, and O5 is not
greater generality implies less efficiency and performance.
balanced in clock cycle with others due to the need of NW,
With proposed approach, we have restricted the generality to
SW and HMMER computation.
a specific domain and not more. These claims are validated
During computation of the BiCell array, which inputs are
and quantified in Section VI.
selected and redirected is the principal configuration input
shown as CFG in Figure 6. Some of the outputs are feedback A. BiCell for Needleman-Wunsch Algorithm
to input, again in a configurable way, not shown in Figure 6 The BiCell reconfigured for Needleman-Wunsch
in the interest of not cluttering the diagram. algorithm is shown at Figure 7, which also displays how two
In order to keep enough accuracy, most of the BiCells operate in parallel to form a fragment of an array.
computational biology algorithms are based on floating-point; While this fragment shows how two units work in parallel; in
log scaling and normalization methods are also employed to real life it will be hundreds or more. As can be seen, a subset

193
of the BiCell core is used. Note that, a feedback from output ­0
OA4 is input to IA7, as well as OB4 to IB7; this is shown as °K (2)
° i, j
dotted line as this is done outside the BiCell and as part of M i, j = max ®
building an array of BiCells to parallelize the sub-matrix ° Li , j
computation discussed in section III. The cfg input in Figure °M i −1, j −1 + S a ,b
¯ i j

6 is present but omitted from Figure 7 and the unused logic is


­° K i −1, j − σ (3)
also omitted; both omissions to not clutter Figure 7. K i , j = max ®
°̄ M i −1, j − ρ − σ
­M i −1, j −1 + S ai ,b j
° (1) ­° Li , j −1 − σ (4)
M i, j = max ®M i −1, j + wa Li , j = max ®
° °̄ M i , j −1 − ρ − σ
¯M i , j −1 + wb

C. BiCell for HMMER Algorithm


HMMER[3] is one of the most widely used open-source
computer program of profile HMM (Hidden Markov
Models). Profile HMM is a statistical model used for
multiple sequence alignments. [9], [10] report that up to 90%
of the runtime is spent in the P7Viterbi function of HMMER.
The core computation for P7Viterbi function in HMMER
is significantly more complex compared to the other two
sequence alignment algorithms described earlier and is
shown below:
­°M i −1, j + tr ( M j → I j ) (5)
Ii , j = e( I j , Si ) + max ®
°̄ Ii −1, j + tr ( I j → I j )

­°M i , j −1 + tr ( M j −1 → D j ) (6)
Di , j = max ®
°̄ Di , j −1 + tr ( D j −1 → D j )

Figure 7. Needleman-Wunsch algorithm mapped to BiCell. ­M i −1, j −1 + tr ( M j −1 → M j )


° (7)
° I i −1, j −1 + tr ( I j −1 → M j )
M i, j = e( M j , S i ) + max ®
° Di −1, j −1 + tr ( D j −1 → M j )
° B + tr ( B → M )
¯ i j

Ei = max{M i , j + tr ( M j → E )}( j = 0,..., m − 1) (8)


n is the length of evaluating sequence.
m is the length of profile HMM.
e is the emission score table for M and I.
tr is the transition score for each state.
M, I, D are the transition state defined in HMMER.

The HMMER algorithm requires three BiCells working


together for a single node, shown in Figure 9. The composite
HMMER unit is arranged in an array, in much the same way
as the case of the other two alignment algorithms described
above. The computation also proceeds in a wave-front
manner. The key difference is that, after each line (aggregate
of Bs+K, where K is dependent on the database use) of
Figure 8. The Smith-Waterman algorithm mapped to two BiCells.
parallel computation, it is checked if the feedback will be
necessary or not. If the feedback becomes necessary,
B. BiCell for Smith-Waterman Algorithm computations from sequence element at beginning of this
The Smith Waterman Algorithm is more complex than stage till the line that feedback occurred need to be redone,
the Needleman-Wunsch Algorithm. A single unit for SW which can be also in parallel. The sequencer shown in Figure
algorithm computation composed of two BiCells is shown in 5 is programmed to deal with this complexity.
Figure 8. The computation in the Smith-Waterman also Diagrams corresponding to full HMMER algorithm
proceeds in the wave-front fashion but now two units work become unreasonably big and complex to be accommodated
in tandem for each Matrix cell computation. In essence, we in this paper and are being skipped.
need twice the number of BiCells to achieve the same degree
of parallelism as NW algorithm.

194
Mi-1,j+1
IA1
BiCell A shows how 220 configured BiCells operate in parallel. While
tr(Mj+1, Mj+2)
IA2 + this fragment shows how 4 units work in parallel; in real life
Ii-1,j+1
IA3 MAX
it will be much more.
tr(Ij+1, Ij+2)
IA4 + OA3
IA5 MAX D Q (To BiCell B)
Di-1,j+1
tr(Dj+1, Dj+2)
IA6 +
MAX
IA7
Bi
tr(B Mj)
IA8 +
IB1
OB3 BiCell B
tr(Mj, Dj+1)
IB2 + OB3
IB3 MAX D Q Di,j
OB4 (Feedback)

tr(Dj, Dj+1)
IB4 + OB4
IB5 D Q Mi,j Figure 11. Conditional probability calculation unit for DNA sequences
OA3 OB5
(Feedback)
IB6 + D Q Mi,j-1
mapped to BiCell array.
e(Mj+1, Si)

IC1
Mi-1,j+1 BiCell C
tr(Mj+1, Ij+1) IC2
+
MAX
IC3
Ii-1,j+1
tr(I , I ) IC4
+
j+1 j+1 OC3
e(Ij+1, Si)
IC9 + D Q Ii,j
IC5
Mi,j-1
tr(Mj-1, E)
IC6 + OC4
MAX D Q Ei,j (a) (b)
IC10
Ei,j-1
Figure 12. PLF Computation and Tree Likelihood Calculation mapped to
BiCells..
Figure 9. HMMER algorithm mapped to three BiCells.
Figure 12(b) shows how 31 BiCells configured for Tree
D. BiCell for Maximum Likelihood Algorithm Likelihood score calculation given by equation 10. This unit
The Conditional Probability Calculation for the DNA takes 4 inputs from the inner likelihood vector entries LA,
sequences based PLF is composed of 9 multiplication and 6 LC, LG, and LT at position i of the virtual root and multiply
add operations. In order to implement such a complicate with the base frequencies (prior probabilities) A, C, G,
structure, we need to map BiCells as multiplier at first. A T with the help of 24 configured BiCells, which Followed
single BiCell mapped as part of the first stage of a multiplier by a single BiCell. The sum result multiplies with the
is shown as Figure 10. product of the previous results L(0) · L(1) · ... · L(i  1) such
that L = L(0) · L(1) · ... · L(i). This Tree Likelihood
§ S ·§ S · (9)
LN ∈{ A, C , G ,T } = ¨ ¦ PNS (i ) LS (i ) ¸¨ ¦ PNS ( j ) LS ( j ) ¸
¨ ¸¨ ¸ Calculation unit actually computes the likelihood score and
© S ∈{ A, C , G ,T } ¹© S ∈{ A,C , G ,T } ¹ not the log likelihood score as we have described the reason
I1
39{A[0]} before.
+

I2
39{A[1]}
§ ·
Bitwidth Extention &

n =1 S
(10)
+

likelihood = ¦ log¨¨ ¦ π S (i ) LS (i) ¸¸


I3
Partial Product

39{A[2]}
+

I4
39{A[3]} O3 0 →i © S ∈{ A,C ,G ,T } ¹
+

I5 D Q
39{A[4]}
According to the ML algorithm, the number of all
+

I6
39{A[5]}
possible rooted phylogenetic trees for n taxa is given by:
+

I7
39{A[6]}
+

I8 n
39{A[7]}
B[38:0]
I9 N T = ∏ ( 2i − 3) for n  2 (11)
i =3
Figure 10. One BiCell mapped as part of a multiplier. Figure 13 shows how BiCell array working in parallel for
ML algorithm. Each of the big dots in the tree configuration
Since the bitwidth of a multiplicand is 40 bits, 5 BiCells part corresponds to the 220-cells PLF unit, and each of the
is required for the first stage of a multiplier, and another small dots corresponds to the 31-cell Tree Likelihood
BiCell for the second stage. This means 6 BiCells will be Calculation unit. It shows 8,292 reconfigured BiCells
mapped as a 2-stage multiplier for ML algorithm. The working together as 12 temporary grouped units to compute
complete conditional probability calculation unit for DNA a single type of rooted phylogenetic tree topology; 4 units in
sequences mapped by BiCell array is shown as Figure 11. the same column to compute different rooted phylogenetic
According to Figure 11, 55 BiCells are required to tree topology using the same input data, and 3 groups
implement the conditional probability calculation unit. working together to share the input band width.
Conditional Probability Unit shall be replicated four times It can be easily figured out that the total number of cells
for each nucleotide to implement equation 9. Figure 12(a) is constrained by silicon space only. For example, we can

195
place another 33 reconfigured BiCell arrays (or 22,803 accuracy of final result, and the computation of each sub-tree
BiCells) to compute the complete 15 rooted phylogenetic can also be paralleled.
trees for 4 taxa DNA sequences in parallel of 3 columns. We
can also place a lot of units in columns to increase the E. BiCell for Protein sequences based ML Algorithm
throughput of the computation, with the consideration of § S ·§ S · (12)
input bandwidth. Note that all the cells and arrays are LN ∈{20 proteins} = ¨¨ ¦ PNS (i) LS (i ) ¸¸¨¨ ¦ PNS ( j ) LS ( j ) ¸¸
reconfigurable and regroupable, which means our system © S∈{20 proteins} ¹© S∈{20 proteins} ¹
matches any require-ment due to different number of taxa Conditional Probability Unit shall be replicated 20 times
and length of sequence inputs. for each type of proteins to implement equation 4, and it
takes 4,400 BiCells to create PLF computation unit for
protein sequences. Diagrams corresponding to Maximum
Likelihood algorithm for protein sequences will be too big
and complex to be accommodated in this paper, and are
being skipped.
VI. PERFORMANCE EVALUATION
In this section, we evaluate the predicted performance of
accelerating the Needleman-Wunsch, Smith-Waterman,
HMMER and Maximum Likelihood algorithm discussed in
section V. The results quantify the benefits of the proposed
CGRA approach compared to the results reported in
literature for FPGAs [7], [10], [11] and GPUs [8], [12], [14].
Researchers have reported acceleration of sequencing
algorithms using FPGAs. Benkrid [7] reports results for both
NW and SW algorithms. It reports results for the NW
algorithms with 252 processing elements on a rather old
FPGA device in 130nm. Similar to FPGAs, there are
reported results for speedups of sequencing algorithms using
GPUs. The ukasz’s attempt [8] is a recent one that reports
achieving performance of 4650 GCUPS on a single core of
9800GX2 with 128 stream processors in 65nm technology.
For HMMER acceleration, Oliver et al reports a FPGA
based accelerator to have achieved 2,100 MCUPS with 30PE
Figure 13. 8,292 BiCells working in parallel for 6 rooted phylogenetic tree in the 130nm process, which is for original HMMER
corresponding to 4 taxa DNA sequences.. algorithm [10]. The John’s attempt [14] based on 8800Ultra
GPU with 128 stream processors reports 18x faster
According to equation 9, 10 and 11, the BiCells required comparing to a single core of Quad-core 2.0 Ghz Intel Xeon
to construct rooted polygenetic tree for different number of processor.
taxa sequences can be summed in Table I: Alachiotis et al reported their FPGA-based double
precision accelerator for ML-based methods without log
TABLE I. RESOURCES REQUIRED FOR DIFFERENT NUMBER OF TAXA approximation function on Xilinx Virtex 5 SX240T FPGA
SEQUENCES
device running at 284.152 Mhz [11]. Their 7-cells structure
Number of taxa sequences 2 3 4 5 achieves a maximum of 13.68X speed-up comparing with a
Number of nodes in rooted tree 1 2 3 4 single core of AMD Opteron processor running at 2.6 Ghz.
Number of BiCells per tree 251 471 691 1791
Suchard et al reported their GPU-based accelerator for ML-
Number of rooted tree 1 3 15 105
Number of BiCells per full column 251 1,413 10,365 188,055
based methods without log approximation function on
Maximum columns for 128-bits Bus 32 21 16 12 nVidia GTX280 GPU with 240 SP running at 1.3 Ghz [12].
Number of BiCells if full loaded 8,032 29,673 165,840 2,256k They reported their performance results comparing with Intel
It can be seen that with the increasing number of taxa QX9770 processor running at 3.2 GHz. archiving 6X
sequences, the number of BiCells per tree and the number of speedup for single precision and 7.5X for double precision
rooted tree is rapidly increased, and the Number of BiCells with 3 GPUs.
per full column is highly increased as the product of these We give the same assumption as above and scale these
two values. For a system with a 128-bits wide bus, a full load results for clear comparison with our design. For the
shall be impossible on 2-D chip for more than 2 taxa different working frequency of processors between the
sequences considering the limitation of silicon area. Since software platforms, we only scale the performance value due
the real case on Maximum Likelihood computation will use to the frequency ratio, which can be seen as a worse result
hundreds even thousands of taxa sequences, we should than ideal case. Our baseline for speedup comparison on
divide the complete tree into small pieces, or so-called sub- sequencing algorithms is a desktop PC with Intel Core i7 860
trees, to fit the hardware architecture. This will not affect the 2.80GHz CPU and Windows 7 Professional running on it.

196
Performance corresponding to this baseline is reported as At present we are comparing our result with FPGA and
“Software” in the Tables II. We wrote the NW algorithm GPU platform based on scaling and prediction from the
ourselves, whereas for the SW algorithm we use result of other researchers. We plan to construct our own
forward_pass function from ClustalW v1.83. For HMMER, FPGA and GPU platform for further performance evaluation.
The code of P7Viterbi function is taken from the package of Also we are planning migration of 40nm silicon process.
HMMER v2.3.2. All of these evaluation codes are in C
language and compiled by Visual Studio 2008 with an “O2” REFERENCES
optimization flag.
While the results of the entire sequencing algorithm [1] Needleman, S.B., and Wunsch, C.D., “A general method applicable
would be interesting, many researchers report speedup of to the search for similarities in the amino acid sequence of two
only the scoring phase that has been parallelized. To be able proteins,” Journal of Molecular Biology, Vol. 48, Issue 3, pp. 443–
453, 1970
to do a fair comparison, we follow the same methodology
[2] Smith, T.F., and Waterman, M.S., “Identification of Common
and report speedups of the scoring phase of the NW and SW Molecular Subsequences,” Journal of Molecular Biology, Vol. 147,
algorithms, and the trace back phase is completed in software pp. 195–197. 1981
with the simulation result. The performance results for NW, [3] SR Eddy, “Profile hidden Markov models,” Bioinformatics, Vol 14,
SW and HMMER algorithms are reported in Cell Updates pp. 755-763, 1998, doi: 10.1093/bioinformatics/14.9.755
Per Second or CUPS and the prefix G makes Giga CUPS [4] Joseph Felsenstein, “Evolutionary trees from DNA sequences: a
(GCUPS). The experiment platform we taken for measuring maximum likelihood approach,” Journal of Molecular Evolution, Vol.
performance in GCUPS consists of 256 BiCells. 17, No. 6, pp. 368-376, 1981
The performance on accelerating ML algorithm is not [5] Kuon, I., and Rose, J.; , "Measuring the Gap Between FPGAs and
reported in GCUPS. In order to make a fair comparison ASICs," IEEE Transactions on Computer-Aided Design of Integrated
Circuits and Systems, Vol. 26, Issue 2, pp.203-215, Feb. 2007, doi:
between our design, FPGA and GPU on ML algorithm, we 10.1109/TCAD.2006.884574
assume that: BiCOM and GPU consist of the same silicon [6] Fei X., and Yong D., “Reducing Storage Requirements in
area in 90nm process, and FPGA consist of the equivalent Accelerating Algorithm of Global BioSequence Alignment on
function logic with BiCOM. For 90nm process, TSMC FPGA,” Proc. 7th International Conference on Advanced Parallel
standard logic library provides a gate density of 448k per 1 Processing Technologies (APPT 07), 2007
mm2 [16]. For GPU, the nVidia 8800GTX with 128 SP in [7] Benkrid, K., Ying, L., and Benkrid, A., “A Highly Parameterized and
90nm process covering a 484 mm² die surface area [17]. We Efficient FPGA-Based Skeleton for Pairwise Biological Sequence
Alignment,” IEEE Transactions on Very Large Scale Integration
assume that our structure also fits on a 484 mm² die, with (VLSI) Systems, Vol.17, no.4, pp. 561-570, April 2009, doi:
about 50% area of which consumed by Leon3 processor, 10.1109/TVLSI.2008.2005314
extra logic and buffers, and 50% for BiCells. A single BiCell [8] ukasz, L., and Witold, R., “An efficient implementation of Smith
consumes ~15k equivalent gates, and we predict that with Waterman algorithm on GPU using CUDA, for massively parallel
242 mm² of silicon, (448/15) × 242 = 7,227 BiCells can be scanning of sequence databases,” Proc. IEEE Symp. Parallel &
placed. We take the number of 7000 for comparison, and we Distributed Processing, 2009 (IPDPS 2009), pp.1-8, May 2009, doi:
10.1109/IPDPS.2009.5160931
also assume that the SP of GPU in 65nm (GTX280) and
[9] Timothy F. O., Bertil S., Yanto J., Douglas L. M., “Accelerating the
90nm (8800GTX) give the same performance. Viterbi Algorithm for Profile Hidden Markov Models Using
Reconfigurable Hardware,” Proc. International Conference on
TABLE II. SPEEDUP COMPARISONS BETWEEN SOFTWARE, FPGA, Computational Science (1), 2006, pp.522-529
GPU AND BICOM ON MULTIPLE BIOINFORMATICS ALGORITHMS
[10] T. Oliver, Y.Y. Leow, and B. Schmidt, "Integrating FPGA
Needleman- Acceleration into HMMer", Parallel Computing, Vol. 34, No. 11, pp.
Smith-Waterman HMMER ML 681-691, Dec. 2008, doi:10.1016/j.parco.2008.08.003
Wunsch
GCUPS Speedup GCUPS Speedup GCUPS Speedup Speedup [11] N. Alachiotis, et al., “Exploring FPGAs for accelerating the
717.7x, 1561x, 127.3x, Phylogenetic Likelihood Function,” Proc. IEEE Symp. Parallel &
1067x,
BiCOM 153.60 76.80 15.4y, 64.00 78y, 2.32y Distributed Processing (IPDPS 2009), IEEE Press, pp. 1-8, May 2009,
6.35z 11.6z
5.5z 14.3z doi: 10.1109/IPDPS.2009.5160929
FPGA 167.8x, 130.8x, 109.2x, 54.8x [12] Marc A. Suchard and Andrew Rambaut, “Many-Core Algorithms for
24.17 14.00 4.48
[7,9,11] 1z 1z 1z 1z Statistical Phylogenetics,” Bioinformatics, Vol. 25, Issue 11, pp.
GPU 46.7x 18.9x, 10.9x 1370-1376, 2009, doi: 10.1093/bioinformatics/btp244
5.00
[8,12] 1y 1y 1y
Software 0.144 1x 0.107 1x 0.041 1x 1x [13] Huelsenbeck, J.P. and Ronquist, F. “MRBAYES: Bayesian inference
of phylogenetic trees,” Bioinformatics, Vol. 17, Issue 8, pp. 754-755,
2001, doi: 10.1093/bioinformatics/17.8.754
VII. CONCLUSION AND FUTURE WORK [14] John, P.W., Vidyananth, B., Suryaprakash, K., and Vipin, C.,
We have presented a CGRA based solution for “Evaluating the Use of GPUs for Life Science Applications”, Proc.
accelerating four popular bioinformatics algorithms and IEEE Symp. Parallel & Distributed Processing, 2009 (IPDPS 2009)
quantified the benefits of the proposed solution in terms of [15] Leon3 Processor and GRLIB, http://www.gaisler.com
architectural efficiency (same degree of parallelism) and [16] Datasheet of TSMC standard cell libraries, Cadence,
http://www.cadence.com/downloads/tsmc_library_request/SC_Broch
combined architectural with silicon efficiency. This ure_9.pdf
architecture can be easily scaled to any silicon process, and
[17] Comparison of Nvidia graphics processing units, Wikipedia,
the depth of pipeline can be optimized to achieve a much http://en.wikipedia.org/wiki/Comparison_of_Nvidia_graphics_proces
better system throughput. sing_units

197

Anda mungkin juga menyukai