Sie sind auf Seite 1von 4

DNA Assembly with de Bruijn Graphs on FPGA

Carl Poirier, Benoit Gosselin and Paul Fortier

Abstract This project aims to see if accelerators based


on FPGAs are worthwhile for DNA assembly. It involves
reprogramming an already existing algorithm - called Ray to be run either on such an accelerator or on a CPU to be
able to compare both. It has been achieved using the OpenCL
language. The focus is put on modifying and optimizing the
original algorithm to better suit the new parallelization tool.
Upon running the new program on some datasets, it becomes
clear that FPGAs are a very capable platform that can fare
better than the traditional approach, both on raw performance
and energy consumption.

I. INTRODUCTION
De novo DNA assembly has been done using different
algorithms throughout time. A recent method which will be
the subject of this article consists of using a De Bruijn graph
in which we place DNA fragments. A De Bruijn graph is an
oriented graph that allows representing overlaps of length
k-1 between words of length k, called k-mers, in a given
alphabet [3]. The number of times each k-mer has been seen
is saved, which we call coverage. It is then possible to search
paths in the graph that represent some part of the original
genomic sequence, which we call contigs.
The goal of this project is to complete DNA assembly
using a De Bruijn graph in a reasonable amount of time on
devices other than supercomputers. Up until recently, CPU
have been the main calculation power of these, but now
accelerators have taken the lead. OpenCL will thus be used
for parallelization. The algorithm on which this project is
based on is called Ray. It has been developed by Sebastien
Boisvert from Laval University and it uses OpenMPI for
inter-node parallelization. Our version is called OCLRay
because of the new parallelization tool.
Ray is a proposition of a new algorithm for assembling results from different sequencing technologies taking the form
of short reads. This algorithm is split into many different
parts. First, the graph is filled with the k-mers from the reads.
Then, there is a purge step which consists of removing edges
leading to dead-ends. Next is a statistical count of coverage.
This allows determining appropriate vertices for annotating
the reads and determining the seeds, which is the next step.
This is followed by annihilating the spurious ones and finally,
extending them [2] and writing the results.
In this article, the considered OpenCL version is 1.2,
published in November 2011. OpenCL is an open standard
This work was supported, in part, by the Natural Sciences and Engineering Research Council of Canada, the Fonds de recherche du Quebec - Nature
et technologies and by the Microsystems Strategic Alliance of Quebec.
The authors are with the Department of Electrical and Computer
Engineering, Laval University, 2325 Rue de lUniversite, Quebec, Qc,
G1V 0A6, Canada. carl.poirier.2@ulaval.ca, benoit.gosselin@gel.ulaval.ca,
paul.fortier@gel.ulaval.ca

978-1-4244-9270-1/15/$31.00 2015 IEEE

and without royalty allowing high-performance programming


by exploiting the parallelism of the hardware architecture.
It specifies the interface to expose to users, but not the
implementation; each hardware vendor is free at this level.
It is thus easy to target many different architectures with the
same code.
On the programming side, it is a language based on
C99. In it, the parallelism is described explicitely according
to a hierarchy of workgroups and work-items, mapped to
compute units and processing elements in hardware. Each
task must at its core be dissected in many small similar and
parallel steps.
II. ACCELERATORS
A. CPU
The CPU is used as the host in a compatible OpenCL system. However, it can at the same time act as an accelerator,
where each core is a compute unit. The vector instruction
extensions can also be used for SIMD processing.
Compiling OpenCL code for the CPU can be done on-thefly without any apparent delay to the user. For this reason,
Altera suggests to do so using the option -march=emulator
instead of compiling for its own accelerators during prototyping.
B. FPGA
Originally, FPGA programming was done using a hardware description language such as VHDL or Verilog. In
OpenCL, it is the compiler and the optimizer which take on
the duty of generating an architecture adapted and optimized
for the instructions to execute, which is easy with the explicit
parallelism. To top it off, the PCI-E connectivity to the
host and the DDR3 memory controller including DMA are
handled automatically by the SDK. In the end, the OpenCL
SDK typically promises a better performance than a handwritten architecture in a shorter development time, as well
as a portable solution that can be migrated to newer FPGAs
automatically [1].
What the Altera OpenCL SDK does is first generate a
pipeline to obtain a throughput of up to one work-item per
clock edge, independently of the number of instructions to
execute on each of them. It is then possible to make this
pipeline larger by processing many work-items at the same
time in a SIMD fashion or unroll loops for even more
parallelism. Finally, the whole pipeline can be duplicated
for increasing the number of compute units. All these techniques allow for a better throughput with the first two being
preferred because of memory access patterns and resource
utilization.

6489

Altera provides a tool as part of its SDK that allows


quantifying the pipeline stalls caused by memory accesses. It
uses a profiler which, when activated, places extra registers
between each step of the pipeline for measuring delays.
The optimality of the generated architecture is judged
according to two criterias, which are its throughput and
resource utilization, measured in percentage of the available
ressources in the FPGA. An estimate of these is given
after the OpenCL code compilation but before the hardware
generation. This last step takes several hours to complete,
as for a conventional HDL solution, so it has to be done
beforehand. At run-time, the compilation result is loaded
from a file, the FPGA programmed, and the kernel launched.
III. PARALLELIZATION
A. Methodology
The methodology to migrate an existing application on
FPGA with OpenCL has a few steps to ensure correct
functionalty and a shortest possible development time. First,
it is advised to realize a dumb C model which will execute
on the CPU. This is to obtain reference results first and to
determine the minimum size variables can take to prevent
overflows.
Next, this C code is transferred into an OpenCL kernel.
The easiest to begin with is to use a kernel of type task.
It is also advised for now to put aside complementary
steps to the algorithm. At this point, the performance and
resource utilization can now be gauged. The work required
for obtaining a design that fits the constraints can then be
evaluated.
Finally, the OpenCL kernel can be optimized. The migration to a kernel of dimension N can be realized if the problem
is fit for it. It is also crucial to adopt a good design pattern.
For example, for a continuous data flow, shift registers are
to be considered. The loops must be unrolled as much as
possible to obtain constant access indices to the registers and
to prevent stalling the pipeline. In the end, it is advantageous
to make as much tests as possible to converge towards the
best solution. [4]
B. Adjusting to OpenCL Constraints
With OpenCL, there is no dynamic memory management
inside a kernel. Thus, all data structures must be adapted to
such a constraint; the allocations must be managed on the
host before the kernel is being run.
Furthermore, there exists in the OpenCL specification a constraint on the maximum size of buffers,
called CLDEVICEMAXMEMALLOCSIZE, equal to
one fourth of CLDEVICEGLOBALMEMSIZE, the
global memory size. [7] To bypass this limit, the graph
is allocated in four distinct memory buffers of equal size.
According to the index of the vertex to visit, the memory
load is done from the corresponding memory buffer.
Ray uses a hash table for storing the graph, which is
based on the sparse tables from Google. This implementation
consists of a table of dynamic tables which are responsible

for a certain quantity of indices and need to be resized upon


insertion.
For obtaining a contiguous structure in memory and without dynamic allocation adapted to processing in OpenCL, a
hash table with open addressing is used. This implies that all
entries are stored in the same table. To prevent its resizing,
the required size is guessed at the beginning and only one
memory allocation is done by the host program.
For the very same reason, we also need a pre-allocated
buffer of an estimated size for containing the resulting
contigs. Here, since many contigs are assembled simultaneously, each processing element is responsible for making
a reservation to store part of its contig in the next block. It
identifies its ownership by its global identifier.
Finally, OpenCL does not have access to the file system.
Because of this, neither the graph construction, the very first
step in the algorithm, nor the very last step consisting in
writing contigs to a file are executed on the accelerator.
C. Problem Partitioning
Any parallelization requires that the problem be partitioned. For a kernel of dimension N, it is about determining
on which angle the problem is sliced, and that can be
different for each of the steps of the algorithm.
In OCLRay, for the part about purging the graph, every
work-item simply consists of a graph vertex. The execution
range then goes from 0 to T ableSize 1.
For the part about calculating the coverage distribution,
the separation is also done according to the graph vertices,
but a particular attention is given to the size of workgroups.
This one is set to the maximum coverage + 1 exactly,
because each work-unit is responsible for initializing the
value corresponding to its local identifier as well as the
addition of this same value to the global total at the end
of the calculation.
For the part about annotating reads on the graph, the
parallelization is done according to the reads, where a main
loop explores one next nucleotide per iteration. Many workitems are in flight in that main loop at the same time. It also
generates seeds as they come by.
Then, the work-items of the next kernel consist in these
seeds. It verifies and removes any of them that is spurious.
Finally, the remaining seeds are the work-items of the
kernel that will stretch them. Here, once again a main loop
processes one nucleotide per iteration. The algorithm flow
is presented in the Figure 1, where small squares represent
the work-items running in parallel. All five steps have thus
been parallelized. Since the automatically-generated pipeline
Purge

Fig. 1.

Count

Annotate Annihilate Extend

Algorithm flow and work-item separation.

in the FPGA is at the instruction level and that each step has

6490

thousands of instructions, representing the parallelization in


hardware is not practical.

s l o t = hash ; / / S t a r t i n g s l o t
p e r t u r b = hash ; / / I n i t i a l p e r t u r b a t i o n
w h i l e ( s l o t . i s F u l l ( ) &&
s l o t . item != itemToFind )
s l o t = (5* s l o t ) + 1 + p e r t u r b ;
p e r t u r b = 5 ;

IV. TWEAKING THE ALGORITHM


Besides porting the algorithm to OpenCL, many changes
have been made for ensuring an optimal performance on the
accelerators. Some other changes are not squarely aimed at
performance, but memory usage. Here we describe the most
important changes.
A. Decreasing Memory Usage
In some other short read assemblers such as Ray and
Velvet, the annotation of a read mentions the position of the
vertex in it. Here, we proceed in another fashion. Since the
nucleotides are stored consecutively, we can simply modify
the start of the read so that it points to the first unique
vertex. The storage used for the offset position is then saved.
Anyway, the start of the read is not useful in the next steps.
Also, Ray uses a Bloom filter to filter out k-mers appearing
only once. These are definitely errors created during the
sequencing. OCLRay cascades two filters to eliminate the
ones appearing only twice as well.
B. Ensuring Adequate Performance
Ray takes as a command-line argument the desired length
of k-mers. OCLRay keeps the same interface, but since this
value does not change during the execution, it can be used to
compile and load the OpenCL kernel. Thus, one kernel for
each allowed k-mer length is generated beforehand. Having
this value as constant allows the required memory for storing
a k-mer to be sized appropriately without overhead for larger
k-mers. At the same time, it avoids some mathematical
operations on pointers for memory accesses. Second, loops
having a number of iterations dependent on the k-mer length
are avoided.
Another optimization done is rounding the graph size
to the next power of 2. This prevents modulo operations
required for calculating the index of a vertex in the hash
table, following the hash function. It can be replaced by an
AND operation, as illustrated in the following equation:
idxvertex = hash%sizetable = hash&(sizetable 1) (1)
A division operation used for determining the appropriate
memory space, as explained in III-B, can also be avoided
by replacing it with a binary logarithm, a binary scan and a
binary shift, as illustrated in the following equation:
idxbuf f er = hash/sizetable =
hash  sizeofbits (sizetable ) BSR(tableSize)

(2)

On a x86 Haswell CPU from Intel, a division operation


takes 95 cycles whereas a binary operation takes one cycle.
Similarly, a binary search (BSR) takes three cycles, a subtraction takes one cycle and a shift takes one as well [6].
As for an FPGA, as mentioned in section II-B, division and
remainder are operations to avoid.
As for resolving the collisions in the hash table, we use
the same method as the Python Dictionary, which is also

Fig. 2.

Collision resolution scheme used for the hash table.

stored as a hash table [8]. It consists of modifying the hash


with a perturbation in a way that indices in the hash table
are generated pseudo-randomly. The pseudo-code to do so is
presented in Figure 2. The calculation of a new position then
gets very light to execute compared to what Ray suggests.
This nets a slight performance gain as well as some resource
savings for the FPGA. Speaking of memory accesses, doing
so in a pseudo-random order has an adverse effect on the
performance; DDR3 memory performs better for a sequential
access. The end solution consists of stretching the table in
a second dimension to give it a width of many elements.
During a search in this structure, all the elements in the same
row are verified, which results in many consecutive accesses
[9]. The width used in OCLRay is four.
In order to obtain an efficient pipeline in the FPGA,
inner loops must be avoided. To do so with the search of
a vertex in the graph, the collision resolution loop is completely unrolled. It thus needs a finite number of iterations,
which becomes possible by imposing a maximum number
of collisions for a same vertex. It has been calculated that
in a open-addressed hash table utilizing a pseudo-random
collision resolution, the expected probe length to find an
element is dictated by the following formula [5]:


1
(3)
k = ln(1 )

where is the occupancy rate of the table, between 0 and


1. Thus, for a maximum rate of 75%, we obtain an average
number of accesses of 1.848. During the creation of the
graph, a maximum of 7 allowed collisions was chosen, so
two full rows have to be verified.
V. RESULTS
The first tests to be run are raw performance tests, pitting
the Intel Core i7-4770 against the Altera Stratix V, a highend FPGA with 50Mb of on-chip memory. For the results
in Figure 3, OCLRay is run on a dataset consisting in
salmonella enterica, a run with the identifier SRR749060 in
DNA databases. On the x axis, the five steps parallelized with
OpenCL are presented. The y axis is the time in seconds
it takes for the kernel to be run. It is quite clear that the
FPGA is very competitive, performance-wise, with regards
to the CPU. In Table I, the FPGA kernel run times are
normalized with respect to the CPU. It is interesting to see
that the relatively simple kernels that cannot be vectorized,
be it the count and annihilate kernels, do not perform very
well on the FPGA. On the other hand, the purge kernel is
completely vectorizable and while the annotate and extend

6491

i7-4770
PCIe-385N
Energy consumed (Wh)

Kernel run time (s)

6
5
4
3
2
1

0.12
0.1
0.08
0.06
0.04
0.02

0
Purge

Count

Annotate Annihilate Extend

Purge

Algorithm step
Fig. 3.

i7-4770
PCIe-385N

0.14

Algorithm step

FPGA and CPU kernel run times according to the algorithm step.

kernels are not, they are very complex, meaning there are
lots of instructions to execute for each work-item. This is
because both include one main loop that has many iterations,
so the FPGA pipeline throughput really shines here. For the
whole algorithm, the FPGA is 6.89 times as fast as the CPU.
Power consumption has been estimated at 28 W using a
TABLE I
FPGA KERNEL RUN TIMES , NORMALIZED .
Purge
0.02696

Count
4.25641

Annotate
0.07568

Anihilate
0.33919

Count AnnotateAnnihilate Extend

Extend
0.08168

Total
0.14524

Fig. 4.
step.

FPGA and CPU performance per watt according to the algorithm

VI. CONCLUSIONS
Overall, it is clear that FPGAs should be used to speed up
DNA assembly, but also to decrease power usage while doing
so. This particular algorithm shows that FPGAs are potent
accelerators that will work well for a range of applications,
as shown by the very different algorithm steps here. Future
work should focus on systems using uniform memory access
such as SoC from Altera, for which memory transfers would
not be needed, and for which the hard CPU cores could take
on the few serial tasks required. This would perform better
than using atomics in the OpenCL kernels.
ACKNOWLEDGMENT

Kill-A-Watt power meter for the whole FPGA board under


load. This is by having the meter plugged in the wall outlet
and subtracting the idle power consumption from the load
measurement. In the same manner, the whole computer using
the CPU as the accelerator consumes 111 W, 78 W more
than in an idle state. These numbers representing the power
draw induced by the workload are being used for calculating
the results presented in Figure 4. The values for the FPGA,
normalized according to the CPU, are presented in Table II.
We can see that the FPGA takes 13.15 times less energy than
the CPU. It is however important to note here that buffer
TABLE II
E NERGY SAVINGS FACTOR BY USING THE PCI E -385N INSTEAD OF THE
C ORE I 7-4770.
Purge
37.0940

Count
0.2349

Annotate
13.2143

Anihilate
2.9482

Extend
12.2425

Total
13.1468

transfer times between the host memory and global device


memory have not been included in the calculations. This is
because they have not been optimized yet. The results might
thus change slightly later on.

Thanks to Sebastien Boisvert for having open-sourced Ray


and helping clarify some parts of the algorithm. Thanks to
CMC Microsystems for providing design and prototyping
tools.
R EFERENCES
[1] ALTERA. Implementing FPGA design with the OpenCL standard. http:
//www.altera.com/literature/wp/wp-01173-opencl.pdf, 2014.
[2] Sebastien Boisvert, Francois Laviolette, and Jacques Corbeil. Ray:
Simultaneous assembly of reads from a mix of high-throughput sequencing technologies. Journal of Computational Biology, 17(11),
2010.
[3] Nicolaas Govert de Bruijn. A combinatorial problem. Koninklijke
Nederlandse Akademie v. Wetenschappen, 49:758764, 1946.
[4] Dmitry Denisenko. Lucas kanade optical flow from C to OpenCL on
CV SoC. In CMC Microsystems Altera Training on OpenCL, 2014.
[5] Gaston H. Gonnet. Expected length of the longest probe sequence in
hash code searching. Department of Computer Science, University of
Waterloo, December 1978.
[6] Torbjorn Granlund. Instruction latencies and throughput for AMD
and Intel x86 processors. https://gmplib.org/tege/x86-timing.pdf, July
2014.
[7] KHRONOS GROUP. OpenCL 1.2 reference pages. http://www.khronos.
org/registry/cl/sdk/1.2/docs/man/xhtml/, November 2011.
[8] Andy Oram and Greg Wilson. Beautiful Code. OReilly Media, 2007.
[9] Xilinx. How to get more than two orders of magnitude better power/performance from key-value stores using FPGA. In IEEE Communications
Society Tutorials, 2014.

6492

Das könnte Ihnen auch gefallen