Anda di halaman 1dari 4

DNA Assembly with de Bruijn Graphs on FPGA

Carl Poirier, Benoit Gosselin and Paul Fortier

Abstract This project aims to see if accelerators based and without royalty allowing high-performance programming
on FPGAs are worthwhile for DNA assembly. It involves by exploiting the parallelism of the hardware architecture.
reprogramming an already existing algorithm - called Ray - It specifies the interface to expose to users, but not the
to be run either on such an accelerator or on a CPU to be
able to compare both. It has been achieved using the OpenCL implementation; each hardware vendor is free at this level.
language. The focus is put on modifying and optimizing the It is thus easy to target many different architectures with the
original algorithm to better suit the new parallelization tool. same code.
Upon running the new program on some datasets, it becomes On the programming side, it is a language based on
clear that FPGAs are a very capable platform that can fare C99. In it, the parallelism is described explicitely according
better than the traditional approach, both on raw performance
and energy consumption. to a hierarchy of workgroups and work-items, mapped to
compute units and processing elements in hardware. Each
I. INTRODUCTION task must at its core be dissected in many small similar and
De novo DNA assembly has been done using different parallel steps.
algorithms throughout time. A recent method which will be
II. ACCELERATORS
the subject of this article consists of using a De Bruijn graph
in which we place DNA fragments. A De Bruijn graph is an A. CPU
oriented graph that allows representing overlaps of length The CPU is used as the host in a compatible OpenCL sys-
k-1 between words of length k, called k-mers, in a given tem. However, it can at the same time act as an accelerator,
alphabet [3]. The number of times each k-mer has been seen where each core is a compute unit. The vector instruction
is saved, which we call coverage. It is then possible to search extensions can also be used for SIMD processing.
paths in the graph that represent some part of the original Compiling OpenCL code for the CPU can be done on-the-
genomic sequence, which we call contigs. fly without any apparent delay to the user. For this reason,
The goal of this project is to complete DNA assembly Altera suggests to do so using the option -march=emulator
using a De Bruijn graph in a reasonable amount of time on instead of compiling for its own accelerators during proto-
devices other than supercomputers. Up until recently, CPU typing.
have been the main calculation power of these, but now
accelerators have taken the lead. OpenCL will thus be used B. FPGA
for parallelization. The algorithm on which this project is Originally, FPGA programming was done using a hard-
based on is called Ray. It has been developed by Sebastien ware description language such as VHDL or Verilog. In
Boisvert from Laval University and it uses OpenMPI for OpenCL, it is the compiler and the optimizer which take on
inter-node parallelization. Our version is called OCLRay the duty of generating an architecture adapted and optimized
because of the new parallelization tool. for the instructions to execute, which is easy with the explicit
Ray is a proposition of a new algorithm for assembling re- parallelism. To top it off, the PCI-E connectivity to the
sults from different sequencing technologies taking the form host and the DDR3 memory controller including DMA are
of short reads. This algorithm is split into many different handled automatically by the SDK. In the end, the OpenCL
parts. First, the graph is filled with the k-mers from the reads. SDK typically promises a better performance than a hand-
Then, there is a purge step which consists of removing edges written architecture in a shorter development time, as well
leading to dead-ends. Next is a statistical count of coverage. as a portable solution that can be migrated to newer FPGAs
This allows determining appropriate vertices for annotating automatically [1].
the reads and determining the seeds, which is the next step. What the Altera OpenCL SDK does is first generate a
This is followed by annihilating the spurious ones and finally, pipeline to obtain a throughput of up to one work-item per
extending them [2] and writing the results. clock edge, independently of the number of instructions to
In this article, the considered OpenCL version is 1.2, execute on each of them. It is then possible to make this
published in November 2011. OpenCL is an open standard pipeline larger by processing many work-items at the same
time in a SIMD fashion or unroll loops for even more
This work was supported, in part, by the Natural Sciences and Engineer-
ing Research Council of Canada, the Fonds de recherche du Quebec - Nature parallelism. Finally, the whole pipeline can be duplicated
et technologies and by the Microsystems Strategic Alliance of Quebec. for increasing the number of compute units. All these tech-
The authors are with the Department of Electrical and Computer niques allow for a better throughput with the first two being
Engineering, Laval University, 2325 Rue de lUniversite, Quebec, Qc,
G1V 0A6, Canada. carl.poirier.2@ulaval.ca, benoit.gosselin@gel.ulaval.ca, preferred because of memory access patterns and resource
paul.fortier@gel.ulaval.ca utilization.

978-1-4244-9270-1/15/$31.00 2015 IEEE 6489


Altera provides a tool as part of its SDK that allows for a certain quantity of indices and need to be resized upon
quantifying the pipeline stalls caused by memory accesses. It insertion.
uses a profiler which, when activated, places extra registers For obtaining a contiguous structure in memory and with-
between each step of the pipeline for measuring delays. out dynamic allocation adapted to processing in OpenCL, a
The optimality of the generated architecture is judged hash table with open addressing is used. This implies that all
according to two criterias, which are its throughput and entries are stored in the same table. To prevent its resizing,
resource utilization, measured in percentage of the available the required size is guessed at the beginning and only one
ressources in the FPGA. An estimate of these is given memory allocation is done by the host program.
after the OpenCL code compilation but before the hardware For the very same reason, we also need a pre-allocated
generation. This last step takes several hours to complete, buffer of an estimated size for containing the resulting
as for a conventional HDL solution, so it has to be done contigs. Here, since many contigs are assembled simulta-
beforehand. At run-time, the compilation result is loaded neously, each processing element is responsible for making
from a file, the FPGA programmed, and the kernel launched. a reservation to store part of its contig in the next block. It
identifies its ownership by its global identifier.
III. PARALLELIZATION Finally, OpenCL does not have access to the file system.
A. Methodology Because of this, neither the graph construction, the very first
step in the algorithm, nor the very last step consisting in
The methodology to migrate an existing application on writing contigs to a file are executed on the accelerator.
FPGA with OpenCL has a few steps to ensure correct
functionalty and a shortest possible development time. First, C. Problem Partitioning
it is advised to realize a dumb C model which will execute Any parallelization requires that the problem be parti-
on the CPU. This is to obtain reference results first and to tioned. For a kernel of dimension N, it is about determining
determine the minimum size variables can take to prevent on which angle the problem is sliced, and that can be
overflows. different for each of the steps of the algorithm.
Next, this C code is transferred into an OpenCL kernel. In OCLRay, for the part about purging the graph, every
The easiest to begin with is to use a kernel of type task. work-item simply consists of a graph vertex. The execution
It is also advised for now to put aside complementary range then goes from 0 to T ableSize 1.
steps to the algorithm. At this point, the performance and For the part about calculating the coverage distribution,
resource utilization can now be gauged. The work required the separation is also done according to the graph vertices,
for obtaining a design that fits the constraints can then be but a particular attention is given to the size of workgroups.
evaluated. This one is set to the maximum coverage + 1 exactly,
Finally, the OpenCL kernel can be optimized. The migra- because each work-unit is responsible for initializing the
tion to a kernel of dimension N can be realized if the problem value corresponding to its local identifier as well as the
is fit for it. It is also crucial to adopt a good design pattern. addition of this same value to the global total at the end
For example, for a continuous data flow, shift registers are of the calculation.
to be considered. The loops must be unrolled as much as For the part about annotating reads on the graph, the
possible to obtain constant access indices to the registers and parallelization is done according to the reads, where a main
to prevent stalling the pipeline. In the end, it is advantageous loop explores one next nucleotide per iteration. Many work-
to make as much tests as possible to converge towards the items are in flight in that main loop at the same time. It also
best solution. [4] generates seeds as they come by.
Then, the work-items of the next kernel consist in these
B. Adjusting to OpenCL Constraints seeds. It verifies and removes any of them that is spurious.
With OpenCL, there is no dynamic memory management Finally, the remaining seeds are the work-items of the
inside a kernel. Thus, all data structures must be adapted to kernel that will stretch them. Here, once again a main loop
such a constraint; the allocations must be managed on the processes one nucleotide per iteration. The algorithm flow
host before the kernel is being run. is presented in the Figure 1, where small squares represent
Furthermore, there exists in the OpenCL specifica- the work-items running in parallel. All five steps have thus
tion a constraint on the maximum size of buffers, been parallelized. Since the automatically-generated pipeline
called CLDEVICEMAXMEMALLOCSIZE, equal to
one fourth of CLDEVICEGLOBALMEMSIZE, the Purge Count Annotate Annihilate Extend
global memory size. [7] To bypass this limit, the graph
is allocated in four distinct memory buffers of equal size.
According to the index of the vertex to visit, the memory
load is done from the corresponding memory buffer.
Ray uses a hash table for storing the graph, which is Fig. 1. Algorithm flow and work-item separation.
based on the sparse tables from Google. This implementation
consists of a table of dynamic tables which are responsible in the FPGA is at the instruction level and that each step has

6490
thousands of instructions, representing the parallelization in s l o t = hash ; / / S t a r t i n g s l o t
hardware is not practical. p e r t u r b = hash ; / / I n i t i a l p e r t u r b a t i o n
w h i l e ( s l o t . i s F u l l ( ) &&
IV. TWEAKING THE ALGORITHM s l o t . item != itemToFind )
s l o t = (5* s l o t ) + 1 + p e r t u r b ;
Besides porting the algorithm to OpenCL, many changes p e r t u r b = 5 ;
have been made for ensuring an optimal performance on the
accelerators. Some other changes are not squarely aimed at
performance, but memory usage. Here we describe the most Fig. 2. Collision resolution scheme used for the hash table.
important changes.
A. Decreasing Memory Usage stored as a hash table [8]. It consists of modifying the hash
In some other short read assemblers such as Ray and with a perturbation in a way that indices in the hash table
Velvet, the annotation of a read mentions the position of the are generated pseudo-randomly. The pseudo-code to do so is
vertex in it. Here, we proceed in another fashion. Since the presented in Figure 2. The calculation of a new position then
nucleotides are stored consecutively, we can simply modify gets very light to execute compared to what Ray suggests.
the start of the read so that it points to the first unique This nets a slight performance gain as well as some resource
vertex. The storage used for the offset position is then saved. savings for the FPGA. Speaking of memory accesses, doing
Anyway, the start of the read is not useful in the next steps. so in a pseudo-random order has an adverse effect on the
Also, Ray uses a Bloom filter to filter out k-mers appearing performance; DDR3 memory performs better for a sequential
only once. These are definitely errors created during the access. The end solution consists of stretching the table in
sequencing. OCLRay cascades two filters to eliminate the a second dimension to give it a width of many elements.
ones appearing only twice as well. During a search in this structure, all the elements in the same
row are verified, which results in many consecutive accesses
B. Ensuring Adequate Performance [9]. The width used in OCLRay is four.
Ray takes as a command-line argument the desired length In order to obtain an efficient pipeline in the FPGA,
of k-mers. OCLRay keeps the same interface, but since this inner loops must be avoided. To do so with the search of
value does not change during the execution, it can be used to a vertex in the graph, the collision resolution loop is com-
compile and load the OpenCL kernel. Thus, one kernel for pletely unrolled. It thus needs a finite number of iterations,
each allowed k-mer length is generated beforehand. Having which becomes possible by imposing a maximum number
this value as constant allows the required memory for storing of collisions for a same vertex. It has been calculated that
a k-mer to be sized appropriately without overhead for larger in a open-addressed hash table utilizing a pseudo-random
k-mers. At the same time, it avoids some mathematical collision resolution, the expected probe length to find an
operations on pointers for memory accesses. Second, loops element is dictated by the following formula [5]:
having a number of iterations dependent on the k-mer length  
1
are avoided. k = ln(1 ) (3)

Another optimization done is rounding the graph size
to the next power of 2. This prevents modulo operations where is the occupancy rate of the table, between 0 and
required for calculating the index of a vertex in the hash 1. Thus, for a maximum rate of 75%, we obtain an average
table, following the hash function. It can be replaced by an number of accesses of 1.848. During the creation of the
AND operation, as illustrated in the following equation: graph, a maximum of 7 allowed collisions was chosen, so
two full rows have to be verified.
idxvertex = hash%sizetable = hash&(sizetable 1) (1)
V. RESULTS
A division operation used for determining the appropriate
memory space, as explained in III-B, can also be avoided The first tests to be run are raw performance tests, pitting
by replacing it with a binary logarithm, a binary scan and a the Intel Core i7-4770 against the Altera Stratix V, a high-
binary shift, as illustrated in the following equation: end FPGA with 50Mb of on-chip memory. For the results
in Figure 3, OCLRay is run on a dataset consisting in
idxbuf f er = hash/sizetable = salmonella enterica, a run with the identifier SRR749060 in
(2)
hash  sizeofbits (sizetable ) BSR(tableSize) DNA databases. On the x axis, the five steps parallelized with
OpenCL are presented. The y axis is the time in seconds
On a x86 Haswell CPU from Intel, a division operation it takes for the kernel to be run. It is quite clear that the
takes 95 cycles whereas a binary operation takes one cycle. FPGA is very competitive, performance-wise, with regards
Similarly, a binary search (BSR) takes three cycles, a sub- to the CPU. In Table I, the FPGA kernel run times are
traction takes one cycle and a shift takes one as well [6]. normalized with respect to the CPU. It is interesting to see
As for an FPGA, as mentioned in section II-B, division and that the relatively simple kernels that cannot be vectorized,
remainder are operations to avoid. be it the count and annihilate kernels, do not perform very
As for resolving the collisions in the hash table, we use well on the FPGA. On the other hand, the purge kernel is
the same method as the Python Dictionary, which is also completely vectorizable and while the annotate and extend

6491
7 i7-4770 i7-4770
PCIe-385N 0.14 PCIe-385N
6

Energy consumed (Wh)


0.12
Kernel run time (s)

5
0.1
4
0.08
3
0.06
2 0.04
1 0.02

0 0
Purge Count Annotate Annihilate Extend Purge Count AnnotateAnnihilate Extend
Algorithm step Algorithm step

Fig. 3. FPGA and CPU kernel run times according to the algorithm step. Fig. 4. FPGA and CPU performance per watt according to the algorithm
step.

kernels are not, they are very complex, meaning there are
lots of instructions to execute for each work-item. This is VI. CONCLUSIONS
because both include one main loop that has many iterations, Overall, it is clear that FPGAs should be used to speed up
so the FPGA pipeline throughput really shines here. For the DNA assembly, but also to decrease power usage while doing
whole algorithm, the FPGA is 6.89 times as fast as the CPU. so. This particular algorithm shows that FPGAs are potent
Power consumption has been estimated at 28 W using a accelerators that will work well for a range of applications,
as shown by the very different algorithm steps here. Future
TABLE I work should focus on systems using uniform memory access
FPGA KERNEL RUN TIMES , NORMALIZED . such as SoC from Altera, for which memory transfers would
not be needed, and for which the hard CPU cores could take
Purge Count Annotate Anihilate Extend Total on the few serial tasks required. This would perform better
0.02696 4.25641 0.07568 0.33919 0.08168 0.14524 than using atomics in the OpenCL kernels.
ACKNOWLEDGMENT
Kill-A-Watt power meter for the whole FPGA board under Thanks to Sebastien Boisvert for having open-sourced Ray
load. This is by having the meter plugged in the wall outlet and helping clarify some parts of the algorithm. Thanks to
and subtracting the idle power consumption from the load CMC Microsystems for providing design and prototyping
measurement. In the same manner, the whole computer using tools.
the CPU as the accelerator consumes 111 W, 78 W more
than in an idle state. These numbers representing the power R EFERENCES
draw induced by the workload are being used for calculating [1] ALTERA. Implementing FPGA design with the OpenCL standard. http:
the results presented in Figure 4. The values for the FPGA, //www.altera.com/literature/wp/wp-01173-opencl.pdf, 2014.
[2] Sebastien Boisvert, Francois Laviolette, and Jacques Corbeil. Ray:
normalized according to the CPU, are presented in Table II. Simultaneous assembly of reads from a mix of high-throughput se-
We can see that the FPGA takes 13.15 times less energy than quencing technologies. Journal of Computational Biology, 17(11),
the CPU. It is however important to note here that buffer 2010.
[3] Nicolaas Govert de Bruijn. A combinatorial problem. Koninklijke
Nederlandse Akademie v. Wetenschappen, 49:758764, 1946.
TABLE II [4] Dmitry Denisenko. Lucas kanade optical flow from C to OpenCL on
E NERGY SAVINGS FACTOR BY USING THE PCI E -385N INSTEAD OF THE CV SoC. In CMC Microsystems Altera Training on OpenCL, 2014.
C ORE I 7-4770. [5] Gaston H. Gonnet. Expected length of the longest probe sequence in
hash code searching. Department of Computer Science, University of
Waterloo, December 1978.
Purge Count Annotate Anihilate Extend Total [6] Torbjorn Granlund. Instruction latencies and throughput for AMD
37.0940 0.2349 13.2143 2.9482 12.2425 13.1468 and Intel x86 processors. https://gmplib.org/tege/x86-timing.pdf, July
2014.
[7] KHRONOS GROUP. OpenCL 1.2 reference pages. http://www.khronos.
org/registry/cl/sdk/1.2/docs/man/xhtml/, November 2011.
transfer times between the host memory and global device [8] Andy Oram and Greg Wilson. Beautiful Code. OReilly Media, 2007.
memory have not been included in the calculations. This is [9] Xilinx. How to get more than two orders of magnitude better power/per-
because they have not been optimized yet. The results might formance from key-value stores using FPGA. In IEEE Communications
Society Tutorials, 2014.
thus change slightly later on.

6492