Abstract—A coarse-grained reconfigurable processor tailored grained reconfigurable architecture (CGRA) based solution
for accelerating multiple bioinformatics algorithms is tailored for the Needleman-Wunsch[1], Smith-Waterman[2],
proposed. In this paper, a programmable and scalable HMMER[3] and Maximum Likelihood[4] algorithm. In this
architectural platform instantiates an array of coarse grained version of the proposed solution, we target the scoring phase
light weight processing elements, which allows arbitrary for N-W and S-W algorithm, the P7Viterbi function for
partitioning, scheduling schemes and capable of solving HMMER, the Phylogenetic Likelihood Function (PLF) and
complete four popular bioinformatics algorithms: the the Tree Likelihood Calculation for Maximum Likelihood.
Needleman-Wunsch, Smith-Waterman, and HMMER on CGRA based solution is motivated by the insight that a
sequencing, and Maximum Likelihood on phylogenetic. The
better match in granularity of computation, interconnect and
key difference of the proposed CGRA based solution compared
storage to the bio-informatics algorithm will improve the
to FPGA and GPU based solutions is a much better match on
architecture and algorithms for the core computational needs, architectural efficiency. CGRA also provides better silicon
as well as the system level architectural needs. For the same efficiency because it avoids the large reconfiguration
degree of parallelism, we provide a 5X to 14X speed-up overhead of the fine granular FPGAs and the larger than
improvements compared to FPGA solutions and 15X to 78X necessary GPU processing element for what is needed by
compared to GPU acceleration on 3 sequencing algorithms. We ML algorithm.
also provide 2.8X speed-up compared to FPGA with the same The key contributions made by this paper are:
amount of core logic and 70X compared to GPU with the same • A coarse-grained reconfigurable and scalable
silicon area for Maximum Likelihood. (Abstract) processing element for accelerating bioinformatics
algorithms.
Keywords-component; Bioinformatics, Coarse Grained • A reconfigurable platform for solving different
Reconfigurable Architecture, Needleman Wunsch, Smith complete bioinformatics algorithms, including
Waterman, HMMER, Maximum Likelihood, Phylogenetic Needleman-Wunsch, Smith-Waterman, HMMER
Inference, VLSI (key words) and Maximum Likelihood algorithm.
• Demonstrate how the architecture is configured to
I. INTRODUCTION compute these different sequencing and
Bioinformatics is an emerging discipline which studies phylogenetics algorithms for both gene and protein
biological problems by analyzing mathematical models. species.
Most of these problems are computationally expensive, • Quantify the benefits of the proposed solution in
requiring huge amount of computational resources. The terms of its architectural efficiency compared to
applications in this area like multiple sequence alignment, FPGAs and GPUs based comparable solutions. We
molecular docking, protein folding, phylogenetic etc. lend also estimate, based on credible data [5] the potential
themselves naturally to both control and data parallelism. silicon efficiency again compared to FPGAs and
Sequence analysis finds similarity in biological GPUs based solutions.
sequences, which is used to establish functional, structural
and evolu-tionary relationships. Phylogenetic studies the II. RELATED WORK
evolutionary relationship among various groups of By virtue of their widespread availability and established
organisms through comparative analysis of the assonated front end tools, many-core architectures and FPGAs are
genome with complex statistical simulations. Accelerating obvious choices to accelerate bioinformatics algorithm
these algorithms is well researched topic with very compared to the ubiquitous uni (and dual/quad) x86
encouraging results. FPGA and GPU based acceleration are processor based desktop machines.
at the forefront of these efforts. Needleman-Wunsch algorithm was published by Saul B.
We propose a custom computing machine for sequence Needleman and Christian D. Wunsch in 1970 [1]. It is
alignment and phylogenetics algorithms that would provide a suitable for global alignment of pair-wise sequence with a
quantum improvement in speed-up. To attain a quantum certain similarity, and soon became one of the standard
improvement in speedup and capacity, we propose a coarse- techniques in sequence analysis. It also spawned many
Colmn Buffer
searches, and HMMER is one of the most popular computer
program package in this area. HMMER, often used for
Multiple Sequence Alignment (MSA), is sequential by
nature, though some researchers [9], [10], have proposed
tweaks to parallelize HMMER, either with some loss of
accuracy and significant boost in performance [9], or a
corrective measure that will result in no loss of accuracy but
still a significant boost in performance on average [10].
Figure 1. The partitioning and parallelization scheme for the
ML and Bayesian phylogenetic study the amount of computational biology algorithms.
evolutionary relationship among various groups of
organisms. Some researchers have proposed FPGA based Figure 2 shows how the parallelized NW and SW
implementation for likelihood-based phylogenetic inference algorithm create a computational wave-front and their cycle
[11]. There has also been recent work in using GPU as co- wise relative timing is also shown. The diagram shows how
processors for acceleration [12]. the M values cross the matrix cell boundaries and their
As stated earlier, using many core based architectures relative timing is also shown.
and FPGA are valid and justified choices in terms of
pragmatics, availability of architecture as well as mapping Mi-2,j Mi-2,j+1
tools. However, both these options are not well matched with
the granularity of the problem. Mi-1,j-1 Mi-1,j Mi-1,j+1
FPGA is very fine grained and the amount of silicon and
energy wastes to support its fine granularity based generality Mi,j-1 Mi,j
is enormous. The many-core option is the other extreme in
terms of granularity mismatch. The typical core utilized is
Unit 1 Mi-1,j-1 Mi-1,j Mi-1,j+1 Mi-1,j+2
significantly larger than the kernel operations used in NW,
SW, HMMER and ML algorithms that is the target of
Unit 2 Mi,j-2 Mi,j-1 Mi,j Mi,j+1
parallelization. The generality of the many-core architecture
also restricts the storage and interconnect scheme from being Figure 2. The spatial and temporal relationship of the M values in a wave-
customized to the dataflow typical of these algorithms. front computation.
With a coarse-grained reconfigurable architecture
(CGRA), targeting computational biology, we propose to
overcome these deficiencies in terms of the granularity of
computational units and the architecture of interconnect and
the storage, in order to match the dataflow of those
bioinformatics algorithms. By doing so, we gain not only
quantum improvement in acceleration but also significant
better utilization of silicon, which in turn will allow us to
deploy more silicon to further press the advantage. And, by
keeping the architecture reconfigurable, we retain the ability
Figure 3. Plan7 Hidden Markov Model of 4 states [3].
to program essentially implement an arbitrary algorithm for
those algorithms. HMMER[3] is one of the most widely used open-source
III. PARALLELIZING & PARTITIONING computer program of profile HMM (Hidden Markov
Models). Profile HMM is a statistical model used for
Needleman-Wunsch and Smith-Waterman algorithm multiple sequence alignments. It should be figured out that
have a scoring phase that is easy to be parallelized. These up to 90% of the runtime is spent on the P7Viterbi function
algorithms involve computation of a matrix of size M×N (see in HMMER due to profiling. Figure 3 demonstrates how the
Figure 1), where M and N are lengths of two sequences that Plan7 Viterbi algorithm works. The original Plan7 HMM
we would like to align. As M and N can be very large, it is
191
algorithm cannot be optimized for full parallel acceleration, AHB
because of the existence of feedback at the end of each
computation loop, as shown in Figure 3. This feedback loop Scores, Tree configurations,
Transition probability matrixes
is rarely necessitated but cannot be ignored for accuracy. LUT
Leon3
Maximum Likelihood based phylogenetic inference uses
the Phylogenetic Likelihood Function (PLF) to evaluate the
Micro-coded Sequencer
BiCell Array Controller
likelihood of trees, which consumes over 95% of the
Scratch
Pad
AHB
BiCell Array
execution time due to profiling. Figure 4 gives an example of
PLF, the Felsenstein pruning algorithm for nucleotide
Controller
sequences, a small three-node sub-tree with two child nodes
Memory
Flash
Flash
and an inner node at the virtual root of the sub-tree. For Buffer
Controller
SDRAM
SDRAM
Buffer Controller
alignment; for every branch there is one transition
Banks
DDR2
DDR2
probability matrix. The computation of the complete PLF Output Buffer
192
shown in Figure 6 as the kernel that can be reconfigured a) to avoid underflow due to very small result numbers. But we
address the varying add and compare operations of NW, SW propose that a specialized fixed-point architecture without
and HMMER algorithms, b) to be reconfigured and sub log function is suitable for use if we use a larger operand
grouped as multiplier for PLF and Tree likelihood width.
calculation in ML algorithm, and c) to work as an array to In our current design, the operand width of BiCell is kept
parallelize the computations as discussed in section III. 80 bits due to the requirement of multiplication in ML
algorithm. For NW, SW and HMMER algorithm, the base
operand width of BiCell is configured to keep 40 bits only.
This also applied to the final output stage of multiplication in
ML algorithm; during multiplication it will use 79 bits for
accuracy. This method means that 39 bits will be used to
represent any signed values, which provides an accuracy of
2-39. It is more than enough for NW, SW and HMMER
algorithm, and also better than the 1.0×10-10 accuracy
limitation of fast-log approximation function in MrBayes[13],
which is the state-of-art software implementation. We can
also scale the operand width for greater accuracy comparing
with double precision floating-point.
BiCell is light weight comparing with floating-point
Figure 6. Block diagram of BiCell.
architectures, an experimental implementation takes just
~15K gate equivalents and clocks at 666 MHz in TSMC
The BiCell has 11 inputs, 9 of which feed into the
90nm technology based on post layout timing analysis. We
Bitwidth Extention & Partial Product module. Considering
draw reader’s attention to the granularity of BiCell compared
the small input bits corresponding to DNA and protein
to FPGAs and GPUs.
sequence characters, bitwidth extension function is
While the BiCell can be realized in FPGAs as well, the
sometimes required to extend the inputs due to the need of
granularity of the underlying logic is gate and bit level and
data structure, which works in different configuration for
the overhead of composing a BiCell using such fine granular
DNA and protein sequences. This function will affect I1 to
building blocks will be considerable. FPGAs also have
I8, in order to match the following adder stage. On the other
coarse granular arithmetic building blocks but they are
hand, partial product generation is required since BiCell shall
limited in number. We have synthesized a single BiCell with
match ML algorithm, which including a group of
Quartus II 9.1 on 40nm EP4SGX230KF40C2N Stratix IV
multiplication. It consumes 9 inputs for a single cell to
FPGA device, which consumes 2043 of 140,600 ALUTs. It
provide the input for multiplicands, and the generated partial
can be seen that this FPGA device can fit 68 BiCells only
products will be transferred to the following adder stage for
without any other logic, which proves that FPGA has less
further computation. The Bitwidth Extention & Partial
silicon efficiency than VLSI architecture for this case.
Product module is placed but now drawn in detail, in the
GPUs on the other hand have computational units that a)
interest of not cluttering the diagram.
are much bigger than the BiCell and b) they are not
The second stage of BiCell is constructed by four adder
customized to the needs of bioinformatics computation.
units. The fourth adder followed by a crossbar (Left crossbar
The second key factor that works in favor of BiCell is
in Figure 6), which establish a redirection stage that feeds to
that the interconnect fabric used to build the array in case of
an add/comparison stage and a multiplexor. The Add/Max
BiCell is custom, though configurable and hence gives better
module can be configured to function as an adder or a
routing efficiency. In case of FPGAs and GPUs it is general
comparator, which is made up by an adder and some extra
purpose.
logic. Following another crossbar (Right Crossbar in Figure
The third factor, even at system level, the components
6), the output of the second add/comparison module is
like buffers, the LUT, and the BiCell Array Controller are all
redirected to the third add/comparison module and the
customized to the need of bioinformatics algorithms.
multiplexor. At the end of the processing, 4 outputs are
Whereas in case of FPGAs and GPUs, these system level
available in registered form and 1 in unregistered form. O2 is
architecture is general purpose and we invoke the dictum:
unregistered due to the need of SW algorithm, and O5 is not
greater generality implies less efficiency and performance.
balanced in clock cycle with others due to the need of NW,
With proposed approach, we have restricted the generality to
SW and HMMER computation.
a specific domain and not more. These claims are validated
During computation of the BiCell array, which inputs are
and quantified in Section VI.
selected and redirected is the principal configuration input
shown as CFG in Figure 6. Some of the outputs are feedback A. BiCell for Needleman-Wunsch Algorithm
to input, again in a configurable way, not shown in Figure 6 The BiCell reconfigured for Needleman-Wunsch
in the interest of not cluttering the diagram. algorithm is shown at Figure 7, which also displays how two
In order to keep enough accuracy, most of the BiCells operate in parallel to form a fragment of an array.
computational biology algorithms are based on floating-point; While this fragment shows how two units work in parallel; in
log scaling and normalization methods are also employed to real life it will be hundreds or more. As can be seen, a subset
193
of the BiCell core is used. Note that, a feedback from output 0
OA4 is input to IA7, as well as OB4 to IB7; this is shown as °K (2)
° i, j
dotted line as this is done outside the BiCell and as part of M i, j = max ®
building an array of BiCells to parallelize the sub-matrix ° Li , j
computation discussed in section III. The cfg input in Figure °M i −1, j −1 + S a ,b
¯ i j
°M i , j −1 + tr ( M j −1 → D j ) (6)
Di , j = max ®
°̄ Di , j −1 + tr ( D j −1 → D j )
194
Mi-1,j+1
IA1
BiCell A shows how 220 configured BiCells operate in parallel. While
tr(Mj+1, Mj+2)
IA2 + this fragment shows how 4 units work in parallel; in real life
Ii-1,j+1
IA3 MAX
it will be much more.
tr(Ij+1, Ij+2)
IA4 + OA3
IA5 MAX D Q (To BiCell B)
Di-1,j+1
tr(Dj+1, Dj+2)
IA6 +
MAX
IA7
Bi
tr(B Mj)
IA8 +
IB1
OB3 BiCell B
tr(Mj, Dj+1)
IB2 + OB3
IB3 MAX D Q Di,j
OB4 (Feedback)
tr(Dj, Dj+1)
IB4 + OB4
IB5 D Q Mi,j Figure 11. Conditional probability calculation unit for DNA sequences
OA3 OB5
(Feedback)
IB6 + D Q Mi,j-1
mapped to BiCell array.
e(Mj+1, Si)
IC1
Mi-1,j+1 BiCell C
tr(Mj+1, Ij+1) IC2
+
MAX
IC3
Ii-1,j+1
tr(I , I ) IC4
+
j+1 j+1 OC3
e(Ij+1, Si)
IC9 + D Q Ii,j
IC5
Mi,j-1
tr(Mj-1, E)
IC6 + OC4
MAX D Q Ei,j (a) (b)
IC10
Ei,j-1
Figure 12. PLF Computation and Tree Likelihood Calculation mapped to
BiCells..
Figure 9. HMMER algorithm mapped to three BiCells.
Figure 12(b) shows how 31 BiCells configured for Tree
D. BiCell for Maximum Likelihood Algorithm Likelihood score calculation given by equation 10. This unit
The Conditional Probability Calculation for the DNA takes 4 inputs from the inner likelihood vector entries LA,
sequences based PLF is composed of 9 multiplication and 6 LC, LG, and LT at position i of the virtual root and multiply
add operations. In order to implement such a complicate with the base frequencies (prior probabilities) A, C, G,
structure, we need to map BiCells as multiplier at first. A T with the help of 24 configured BiCells, which Followed
single BiCell mapped as part of the first stage of a multiplier by a single BiCell. The sum result multiplies with the
is shown as Figure 10. product of the previous results L(0) · L(1) · ... · L(i 1) such
that L = L(0) · L(1) · ... · L(i). This Tree Likelihood
§ S ·§ S · (9)
LN ∈{ A, C , G ,T } = ¨ ¦ PNS (i ) LS (i ) ¸¨ ¦ PNS ( j ) LS ( j ) ¸
¨ ¸¨ ¸ Calculation unit actually computes the likelihood score and
© S ∈{ A, C , G ,T } ¹© S ∈{ A,C , G ,T } ¹ not the log likelihood score as we have described the reason
I1
39{A[0]} before.
+
I2
39{A[1]}
§ ·
Bitwidth Extention &
n =1 S
(10)
+
39{A[2]}
+
I4
39{A[3]} O3 0 →i © S ∈{ A,C ,G ,T } ¹
+
I5 D Q
39{A[4]}
According to the ML algorithm, the number of all
+
I6
39{A[5]}
possible rooted phylogenetic trees for n taxa is given by:
+
I7
39{A[6]}
+
I8 n
39{A[7]}
B[38:0]
I9 N T = ∏ ( 2i − 3) for n 2 (11)
i =3
Figure 10. One BiCell mapped as part of a multiplier. Figure 13 shows how BiCell array working in parallel for
ML algorithm. Each of the big dots in the tree configuration
Since the bitwidth of a multiplicand is 40 bits, 5 BiCells part corresponds to the 220-cells PLF unit, and each of the
is required for the first stage of a multiplier, and another small dots corresponds to the 31-cell Tree Likelihood
BiCell for the second stage. This means 6 BiCells will be Calculation unit. It shows 8,292 reconfigured BiCells
mapped as a 2-stage multiplier for ML algorithm. The working together as 12 temporary grouped units to compute
complete conditional probability calculation unit for DNA a single type of rooted phylogenetic tree topology; 4 units in
sequences mapped by BiCell array is shown as Figure 11. the same column to compute different rooted phylogenetic
According to Figure 11, 55 BiCells are required to tree topology using the same input data, and 3 groups
implement the conditional probability calculation unit. working together to share the input band width.
Conditional Probability Unit shall be replicated four times It can be easily figured out that the total number of cells
for each nucleotide to implement equation 9. Figure 12(a) is constrained by silicon space only. For example, we can
195
place another 33 reconfigured BiCell arrays (or 22,803 accuracy of final result, and the computation of each sub-tree
BiCells) to compute the complete 15 rooted phylogenetic can also be paralleled.
trees for 4 taxa DNA sequences in parallel of 3 columns. We
can also place a lot of units in columns to increase the E. BiCell for Protein sequences based ML Algorithm
throughput of the computation, with the consideration of § S ·§ S · (12)
input bandwidth. Note that all the cells and arrays are LN ∈{20 proteins} = ¨¨ ¦ PNS (i) LS (i ) ¸¸¨¨ ¦ PNS ( j ) LS ( j ) ¸¸
reconfigurable and regroupable, which means our system © S∈{20 proteins} ¹© S∈{20 proteins} ¹
matches any require-ment due to different number of taxa Conditional Probability Unit shall be replicated 20 times
and length of sequence inputs. for each type of proteins to implement equation 4, and it
takes 4,400 BiCells to create PLF computation unit for
protein sequences. Diagrams corresponding to Maximum
Likelihood algorithm for protein sequences will be too big
and complex to be accommodated in this paper, and are
being skipped.
VI. PERFORMANCE EVALUATION
In this section, we evaluate the predicted performance of
accelerating the Needleman-Wunsch, Smith-Waterman,
HMMER and Maximum Likelihood algorithm discussed in
section V. The results quantify the benefits of the proposed
CGRA approach compared to the results reported in
literature for FPGAs [7], [10], [11] and GPUs [8], [12], [14].
Researchers have reported acceleration of sequencing
algorithms using FPGAs. Benkrid [7] reports results for both
NW and SW algorithms. It reports results for the NW
algorithms with 252 processing elements on a rather old
FPGA device in 130nm. Similar to FPGAs, there are
reported results for speedups of sequencing algorithms using
GPUs. The ukasz’s attempt [8] is a recent one that reports
achieving performance of 4650 GCUPS on a single core of
9800GX2 with 128 stream processors in 65nm technology.
For HMMER acceleration, Oliver et al reports a FPGA
based accelerator to have achieved 2,100 MCUPS with 30PE
Figure 13. 8,292 BiCells working in parallel for 6 rooted phylogenetic tree in the 130nm process, which is for original HMMER
corresponding to 4 taxa DNA sequences.. algorithm [10]. The John’s attempt [14] based on 8800Ultra
GPU with 128 stream processors reports 18x faster
According to equation 9, 10 and 11, the BiCells required comparing to a single core of Quad-core 2.0 Ghz Intel Xeon
to construct rooted polygenetic tree for different number of processor.
taxa sequences can be summed in Table I: Alachiotis et al reported their FPGA-based double
precision accelerator for ML-based methods without log
TABLE I. RESOURCES REQUIRED FOR DIFFERENT NUMBER OF TAXA approximation function on Xilinx Virtex 5 SX240T FPGA
SEQUENCES
device running at 284.152 Mhz [11]. Their 7-cells structure
Number of taxa sequences 2 3 4 5 achieves a maximum of 13.68X speed-up comparing with a
Number of nodes in rooted tree 1 2 3 4 single core of AMD Opteron processor running at 2.6 Ghz.
Number of BiCells per tree 251 471 691 1791
Suchard et al reported their GPU-based accelerator for ML-
Number of rooted tree 1 3 15 105
Number of BiCells per full column 251 1,413 10,365 188,055
based methods without log approximation function on
Maximum columns for 128-bits Bus 32 21 16 12 nVidia GTX280 GPU with 240 SP running at 1.3 Ghz [12].
Number of BiCells if full loaded 8,032 29,673 165,840 2,256k They reported their performance results comparing with Intel
It can be seen that with the increasing number of taxa QX9770 processor running at 3.2 GHz. archiving 6X
sequences, the number of BiCells per tree and the number of speedup for single precision and 7.5X for double precision
rooted tree is rapidly increased, and the Number of BiCells with 3 GPUs.
per full column is highly increased as the product of these We give the same assumption as above and scale these
two values. For a system with a 128-bits wide bus, a full load results for clear comparison with our design. For the
shall be impossible on 2-D chip for more than 2 taxa different working frequency of processors between the
sequences considering the limitation of silicon area. Since software platforms, we only scale the performance value due
the real case on Maximum Likelihood computation will use to the frequency ratio, which can be seen as a worse result
hundreds even thousands of taxa sequences, we should than ideal case. Our baseline for speedup comparison on
divide the complete tree into small pieces, or so-called sub- sequencing algorithms is a desktop PC with Intel Core i7 860
trees, to fit the hardware architecture. This will not affect the 2.80GHz CPU and Windows 7 Professional running on it.
196
Performance corresponding to this baseline is reported as At present we are comparing our result with FPGA and
“Software” in the Tables II. We wrote the NW algorithm GPU platform based on scaling and prediction from the
ourselves, whereas for the SW algorithm we use result of other researchers. We plan to construct our own
forward_pass function from ClustalW v1.83. For HMMER, FPGA and GPU platform for further performance evaluation.
The code of P7Viterbi function is taken from the package of Also we are planning migration of 40nm silicon process.
HMMER v2.3.2. All of these evaluation codes are in C
language and compiled by Visual Studio 2008 with an “O2” REFERENCES
optimization flag.
While the results of the entire sequencing algorithm [1] Needleman, S.B., and Wunsch, C.D., “A general method applicable
would be interesting, many researchers report speedup of to the search for similarities in the amino acid sequence of two
only the scoring phase that has been parallelized. To be able proteins,” Journal of Molecular Biology, Vol. 48, Issue 3, pp. 443–
453, 1970
to do a fair comparison, we follow the same methodology
[2] Smith, T.F., and Waterman, M.S., “Identification of Common
and report speedups of the scoring phase of the NW and SW Molecular Subsequences,” Journal of Molecular Biology, Vol. 147,
algorithms, and the trace back phase is completed in software pp. 195–197. 1981
with the simulation result. The performance results for NW, [3] SR Eddy, “Profile hidden Markov models,” Bioinformatics, Vol 14,
SW and HMMER algorithms are reported in Cell Updates pp. 755-763, 1998, doi: 10.1093/bioinformatics/14.9.755
Per Second or CUPS and the prefix G makes Giga CUPS [4] Joseph Felsenstein, “Evolutionary trees from DNA sequences: a
(GCUPS). The experiment platform we taken for measuring maximum likelihood approach,” Journal of Molecular Evolution, Vol.
performance in GCUPS consists of 256 BiCells. 17, No. 6, pp. 368-376, 1981
The performance on accelerating ML algorithm is not [5] Kuon, I., and Rose, J.; , "Measuring the Gap Between FPGAs and
reported in GCUPS. In order to make a fair comparison ASICs," IEEE Transactions on Computer-Aided Design of Integrated
Circuits and Systems, Vol. 26, Issue 2, pp.203-215, Feb. 2007, doi:
between our design, FPGA and GPU on ML algorithm, we 10.1109/TCAD.2006.884574
assume that: BiCOM and GPU consist of the same silicon [6] Fei X., and Yong D., “Reducing Storage Requirements in
area in 90nm process, and FPGA consist of the equivalent Accelerating Algorithm of Global BioSequence Alignment on
function logic with BiCOM. For 90nm process, TSMC FPGA,” Proc. 7th International Conference on Advanced Parallel
standard logic library provides a gate density of 448k per 1 Processing Technologies (APPT 07), 2007
mm2 [16]. For GPU, the nVidia 8800GTX with 128 SP in [7] Benkrid, K., Ying, L., and Benkrid, A., “A Highly Parameterized and
90nm process covering a 484 mm² die surface area [17]. We Efficient FPGA-Based Skeleton for Pairwise Biological Sequence
Alignment,” IEEE Transactions on Very Large Scale Integration
assume that our structure also fits on a 484 mm² die, with (VLSI) Systems, Vol.17, no.4, pp. 561-570, April 2009, doi:
about 50% area of which consumed by Leon3 processor, 10.1109/TVLSI.2008.2005314
extra logic and buffers, and 50% for BiCells. A single BiCell [8] ukasz, L., and Witold, R., “An efficient implementation of Smith
consumes ~15k equivalent gates, and we predict that with Waterman algorithm on GPU using CUDA, for massively parallel
242 mm² of silicon, (448/15) × 242 = 7,227 BiCells can be scanning of sequence databases,” Proc. IEEE Symp. Parallel &
placed. We take the number of 7000 for comparison, and we Distributed Processing, 2009 (IPDPS 2009), pp.1-8, May 2009, doi:
10.1109/IPDPS.2009.5160931
also assume that the SP of GPU in 65nm (GTX280) and
[9] Timothy F. O., Bertil S., Yanto J., Douglas L. M., “Accelerating the
90nm (8800GTX) give the same performance. Viterbi Algorithm for Profile Hidden Markov Models Using
Reconfigurable Hardware,” Proc. International Conference on
TABLE II. SPEEDUP COMPARISONS BETWEEN SOFTWARE, FPGA, Computational Science (1), 2006, pp.522-529
GPU AND BICOM ON MULTIPLE BIOINFORMATICS ALGORITHMS
[10] T. Oliver, Y.Y. Leow, and B. Schmidt, "Integrating FPGA
Needleman- Acceleration into HMMer", Parallel Computing, Vol. 34, No. 11, pp.
Smith-Waterman HMMER ML 681-691, Dec. 2008, doi:10.1016/j.parco.2008.08.003
Wunsch
GCUPS Speedup GCUPS Speedup GCUPS Speedup Speedup [11] N. Alachiotis, et al., “Exploring FPGAs for accelerating the
717.7x, 1561x, 127.3x, Phylogenetic Likelihood Function,” Proc. IEEE Symp. Parallel &
1067x,
BiCOM 153.60 76.80 15.4y, 64.00 78y, 2.32y Distributed Processing (IPDPS 2009), IEEE Press, pp. 1-8, May 2009,
6.35z 11.6z
5.5z 14.3z doi: 10.1109/IPDPS.2009.5160929
FPGA 167.8x, 130.8x, 109.2x, 54.8x [12] Marc A. Suchard and Andrew Rambaut, “Many-Core Algorithms for
24.17 14.00 4.48
[7,9,11] 1z 1z 1z 1z Statistical Phylogenetics,” Bioinformatics, Vol. 25, Issue 11, pp.
GPU 46.7x 18.9x, 10.9x 1370-1376, 2009, doi: 10.1093/bioinformatics/btp244
5.00
[8,12] 1y 1y 1y
Software 0.144 1x 0.107 1x 0.041 1x 1x [13] Huelsenbeck, J.P. and Ronquist, F. “MRBAYES: Bayesian inference
of phylogenetic trees,” Bioinformatics, Vol. 17, Issue 8, pp. 754-755,
2001, doi: 10.1093/bioinformatics/17.8.754
VII. CONCLUSION AND FUTURE WORK [14] John, P.W., Vidyananth, B., Suryaprakash, K., and Vipin, C.,
We have presented a CGRA based solution for “Evaluating the Use of GPUs for Life Science Applications”, Proc.
accelerating four popular bioinformatics algorithms and IEEE Symp. Parallel & Distributed Processing, 2009 (IPDPS 2009)
quantified the benefits of the proposed solution in terms of [15] Leon3 Processor and GRLIB, http://www.gaisler.com
architectural efficiency (same degree of parallelism) and [16] Datasheet of TSMC standard cell libraries, Cadence,
http://www.cadence.com/downloads/tsmc_library_request/SC_Broch
combined architectural with silicon efficiency. This ure_9.pdf
architecture can be easily scaled to any silicon process, and
[17] Comparison of Nvidia graphics processing units, Wikipedia,
the depth of pipeline can be optimized to achieve a much http://en.wikipedia.org/wiki/Comparison_of_Nvidia_graphics_proces
better system throughput. sing_units
197