FPGA

Fine-grained Parallel Application Specic Computing for RNA Secondary Structure Prediction Using SCFGs on FPGA
Yong Dou yongdou@nudt.edu.cn Fei Xia xcyphoenix@nudt.edu.cn Jingfei Jiang jiangjinfei@nudt.edu.cn
National Laboratory for Parallel & Distributed Processing National University of Defence Technology 410073 ChangSha, China
ABSTRACT
In the eld of RNA secondary structure prediction, the CYK (Coche-Younger-Kasami) algorithm is a most popular methods using SCFG (stochastic context-free grammars) model. However, general purpose parallel computers including SMP multiprocessors or cluster systems exhibit low parallel eciency and they are too expensive to be used easily for many research institutes. FPGA chips provide a new approach to accelerate the CYK algorithm by exploiting ne-grained custom design. The CYK algorithm shows complicated data dependence, in which the dependence distance is variable, and the dependence direction is also across two dimensions. We propose a systolic array structure including one master PE and multiple slave PEs for ne grain hardware implementation on FPGA. We partition tasks by columns and assign tasks to PEs for load balance. We exploit data reuse schemes to reduce the need to load matrix from external memory. To our knowledge, our implementation with 16 PEs is the only FPGA accelerator implementing the complete CYK/inside algorithm. The experimental results show a factor of more than 14 speedup over the Inf ernal 0.55 software running on a PC platform with Pentium 4 2.66GHz CPU. The computational power of our platform with FPGA accelerator is comparable to a PC cluster consisting of 20 Intel-Xeon CPUs for RNA secondary structure prediction using SCFGs, but the hardware cost and power consumption is only about 15% and 10% of the latter respectively.
General Terms
Algorithms, Design, Performance
Keywords
RNA, secondary structure prediction, SCFGs, FPGA, recongurable algorithm accelerator
1. INTRODUCTION
Ribonucleic Acid (RNA) is an important molecule that performs a wide range of functions in biological systems, such as synthesizing proteins, catalyzing reactions, splicing introns and regulating cellular activities. The function of an RNA molecule generally can be derived from its secondary structure. Currently, the only completely accurate method of determining the folded structure of an RNA molecule is by X-ray crystallography and nuclear magnetic resonance (NMR), however, those methods are time consuming and very expensive. Therefore, the computational methods have been widely used in the eld of RNA secondary structures prediction, such as the minimum free energy (MFE) method, the homologous comparative sequences, the stochastic contextfree grammar (SCFG) methods and so on. Among which, the structure prediction based on the SCFG model is a most important method. A standard dynamic programming alignment algorithm for SCFGs is the Coche-YoungerKasami (CYK) algorithm, which nds an optimal parse tree for a model and a sequence[1]. The CYK algorithm is a three-dimensional dynamic programming (DP) algorithm[1, 2, 3], and it can be lled in two opposite directions, either inside or outside. The computational complexity for a sequence of length L and a model with K states is O(K L2 + B L3 ) and the corresponding spatial complexity is O(L3 ), where B is the number of bifurcation states in the model. Although computation using SCFGs is ecient in the sense, using the CYK algorithm for large models and large database searches is still intolerable. For example, alignment of a LSU rRNA sequence with 2904-residue to the LSU rRNA consensus structure on a single Alpha EV68 processor would take about 17391s[4] using CYK/inside algorithm without traceback and about 7 hours of CPU time for this alignment with traceback. Therefore, accelerating CYK algorithm becomes a challenge task in computational bioinformatics. High performance parallel computers with shared or distributed memory, such as SMP multiprocessors or cluster
Categories and Subject Descriptors

B.7.1 [Integrated Circuits]: Types and Design Styles Algorithms implemented in hardware; C.3 [Computer Systems Organization]: Special-Purpose and ApplicationBased SystemsMicroprocessor/microcomputer applications
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specic permission and/or a fee. CASES09, October 1116, 2009, Grenoble, France. Copyright 2009 ACM 978-1-60558-626-7/09/10 ...$10.00.
107
systems are widely used to accelerate the CYK algorithm without traceback. Jane C. Hill et al.[5] explored a pipelined parallel approach on multi-processor in 1991. This implementation achieves a speedup of 3.52 using 7 processors. In [6], Tan G. et al. presented a parallel implementation of the CYK/inside algorithm on PC cluster. The main idea is to partition each layer in the 3D DP matrix in a regular fashion and distribute tasks to multiple processors. Unfortunately, the simple coarse-grain zone blocking method results in severe load imbalance. They report a 19x speedup on a 32-CPU cluster system. Parallel eciency is greatly limited by complicated data dependency and tight synchronization. In 2005, T. Liu et al. presented a parallel CYK algorithm between 3D matrix layers on two PC cluster systems [4]. To get better load balance, they adopted the unequal-sized areas task partition strategy and considered communication delays. The implementation achieves a speedup of 16 using 20 2.0GHz Xeon CPUs and a speedup of 36 using 48 1.0GHz Alpha EV68 processors on the cluster of SMPs. Unfortunately, the using, maintenance and management cost of large scale parallel computer systems are very high. Take the cluster with 48 Alpha EV68 CPUs for an example, the hardware cost is more than 70 thousand dollars and the power consumption is more than 6KW. Thus, the high performance parallel computers are too expensive to be used easily for many research institutes. Recently, the use of FPGA coprocessors has become a promising approach for accelerating bioinformatics applications. The computational capability of FPGAs is increasing rapidly. The top level FPGA chip from Xilinx Virtex 5 series contains 51840 slices, 10368k bits storage and 192 DSP modules. The largest FPGA chip from Altera StratixIII series integrates 135200 ALMsn338000 LEs, 17208Kbits on-chip c memory capacity. It also contains 576 18 18 multipliers and achieves 300GMACS processing capability. However, the hardware cost is less than three thousand dollars and the power consumption is less than 30W. The recongurebility of FPGA chips also enables algorithms to be implemented with dierent computing structures on the same hardware platform. Using a combination of FPGAs and general-purpose CPUs to accelerate RNA secondary prediction algorithm attracts much more attention. Tan G. et al.[7] introduced a ne-grained parallelization of the Zuker algorithm using free energy minimization model. A recent paper, Arpith Jacob et al.[8], implemented the simplest folding algorithm, Nussinov algorithm [9] on a VirtexII 6000 FPGA. However, those works only implemented 2D dynamic programming algorithm for RNA secondary structure prediction, 3D DP algorithm such as CYK has not been accelerated on FPGA at present. In [10], [11] the authors presented some parallel implementations of the CFG model which considers only the parsing process. In this paper, we propose a systolic array structure including one master PE and multiple slave PEs for ne grain hardware implementation on FPGA. For minimization storage requirement, we partition the whole 3D scoring matrix into small ones and calculate the layers one by one. For load balance, we partition the layers by columns and assign tasks to PEs. We aggressively exploit data reuse schemes to minimize the need for loading layers from external memory. Specically, we add a cache for each PE to buer current computation result, most of which will be used in computing the next element in the column. We also transfer local elements directly
to the next adjoining PE. In our design, only the master PE loads scoring matrix from external DRAM. The remaining slave PEs simply wait for data from the previous PE. The whole array structure is carefully pipelined in order to overlap multiple PEs column computations, master PEs load operations and multiple PEs write-back operations as much as possible. We implemented a CYK/inside algorithm accelerator with 16 processing elements on a single FPGA (XC5VLX330) chip. The experimental results show a factor of more than 14 speedup over the Inf ernal 0.55 software for 959-residue RNA sequence and a CM model with 3145 states running on a PC platform with Pentium 4 2.66GHz CPU. The computational power of our accelerator is comparable to a PC cluster consisting of 20 Intel-Xeon CPUs for RNA secondary structure prediction using SCFGs. However, the hardware cost is only 7 thousand dollars and the power consumption is less than 300W, which is only about 15% and 10% of the latter respectively.
2. OVERVIEW OF THE CYK/INSIDE ALGORITHM 2.1 CYK/inside Algorithm Introduction

The CYK/inside algorithm aligned a RNA sequence to a CM (Covariance Model)[12] to determine the probability that the sequence belongs to the modeled family. A covariance model is a prole stochastic context free grammar designed to model a consensus RNA secondary structure with position-specic scores [1], which can be built from a multisequence alignment. The scoring represents the similarity between the sequence and an RNA family. The input of CYK algorithm is an sequence x = x1 ...xi ...xj ...xL of length L and a CM G of length K with states numbered in preorder traversal, where i and j represent the nucleotides location in RNA sequence. The CYK/inside algorithm iteratively calculates the three-dimension DP matrix [13] with triangular cross-section for 1 i j + 1, 0 j L, 1 k K and the recurrence relationship is as follows:
M (i, j , k ) = max C ( k ) [ M (i, j , ) + tk ( )] ek ( xi , x j ) + max C ( k ) [ M (i + 1, j 1, ) + tk ( )] e ( x ) + max C ( k ) [ M (i + 1, j , ) + t k ( )] k i ek ( x j ) + max C ( k ) [ M (i, j 1, ) + tk ( )] max (i 1 mid j ) [ M (i, mid , kl ) + M (mid + 1, j , kr )] 0 if S (k ) {D, S } if S (k ) = MP d 2 if S (k ) {ML, IL} d 1 if S (k ) {MR, IR} d 1 if S (k ) = BIF if S (k ) = E d = 0 otherwise
Where d = j i + 1, M (i, j, k) is the element with index i,j,k in DP matrix. S(k) is the type of state k, C(k) is the set of states that k can transit to. ek (xi ) is the log-odds score of the emission probability of character xi in state k, tk () is the log-odds score of the transition probability from state k to state [4]. A CM consists of seven basic types, each with its own emission and transition probability distributions. P-state represents consensus base pairs, L and R represent consensus single stranded residues, D-state represents deletions relative to consensus, BIF , S, and E model the branching topology of the RNA secondary structure. From the recurrences above, we see that the calculation of M (i, j, k) is state-type dependent. For an instance, if
108
S(k) = M L or IL, then k emits xi and transits to one of its children states C(k). The maximum over all possible choices of child states M (i, j, k) is the sum of three terns: an emission term ek (xi ), a transition tern tk () and the score of the optimal parse tree rooted in that generates xi+1 ...xj has already been calculated in the DP-matrix cell M (i + 1, j, ). The recurrence relations for all other statetypes can be explained in a similar way. Furthermore, we get the three-dimension view of the execution processing of the CYK/inside algorithm as shown in Figure 1. The processing starts by those elements located in the edge of bottommost triangular layer (E-states, length 0 subsequences on diagonal, cells marked by red color). The computation moves along diagonals in a wavefront mode until reaching the right-angled vertex in current layer. Then, the upper layer starts. When the last cell M (1, L, 1) worked out (the cell marked by green color), the CYK/inside algorithm returns. The element M (1, L, 1) contains the score of the best parse of the complete sequence with the complete model, the global alignment score.
Global alignment score M[1,L,1]
Consider each layer of the 3D dynamic programming matrix, which is an upper triangle matrix as described in the recurrence relationship. If k is not BIF state, the computation size, s, (the number of add and compare operations) for each element M (i, j, k) is variable, which decided by the number of its children states, C(k). For example, if S(k) = M P , then the element M (i, j, k) is calculated by 6 add and 6 compare operations. If S(k) = BIF , the computation size for each element, C(i, j, k), is closely related with the indices i and j, as shown in formula (1). C (i, j, k) = j i + 2, if (S(k) = BIF ) ; s, s {2, 3, 4, 5, 6} , else; (1)
The computation size of j th column V (, j, k), C(, j, k) is: C (, j, k) = (S(k) = BIF ) ; (j + 1) s, s {2, 3, 4, 5, 6} , else;
(j+1)(j+2) , if 2
(2)
The dierence in computation size between column j and column j + 1 is C(j, j + 1): C (j, j + 1, k) = j + 2, if (S(k) = BIF ) ; s, s {2, 3, 4, 5, 6} , else; (3)
Initial computation position
We nd that the computation size is related with the number of its children states, which equals 2 6 for N on BIF state and it gradually increases with the matrix location moving up from bottom to top in the same column and increases with the location moving right in the same row for BIF state k. Specically, the workload of M (1, L, k) is the heaviest one, it depends on the entire row of left substate layer and the entire column elements of right substate layer. This workload imbalance suggests a cyclic column allocation scheme, in which each processing element (PE) is assigned one column of current layer. Each PE processes its column from bottom to top or conversely. Observation2 The data dependence distance and the number of layers that depend on for each layers computing is variable and it closely related with the connected relation between parent and children states in CM model. Figure 2 is an example of covariance model, which is taken from Sean R. Eddys work [1]. The model has 81 states (boxes, stacked in a horizontal array). Each state is associated with one of the 24 nodes of the guide tree (text to the above of the state array). The transitions (the arcs with arrow) mean that there is a data dependency between the two states. The calculation location moves from right to left. We nd that the number of states (layers, each state corresponds to a layer in 3D matrix) that current state depend on is variable, which equals 0 minimally (means do not need any other layers) and equals 6 maximally. Moreover, the data dependence distance (the length of the arcs) among those layers is dierence. It maybe equals 1 or is more than several hundred probably. For example, the transitions (bold arrows) from bifurcation state B10 to start states S11 and S46. It means that the data in state S11 will be used right now and S46 will not be used in a long time. To reduce storage requirement we partition the whole 3D scoring matrix into small ones and replace those layers will not be used recently to external DRAM.
Figure 1: The 3D dynamic programming lattices in CYK algorithm. Accelerating the CYK algorithm on FPGA chips is a challenging task. First, the non-uniform multi-dimensional data dependences with variable dependence distance make it difcult to nd a well-behaved task assignment for load balance. Second, the irregular spatial locality with a great deal of small granularity access operations makes it dicult to optimize memory scheduling for ecient external access. Finally, the limited on-chip memory cannot hold the 3D DP matrix (the storage requirement for each layer is more than 32MB, but the on-chip memory capacity of the largest FPGA is less than 2MB), resulting in long-latency matrix loads from external DRAM.
2.2 Characteristics of the CYK algorithm

We make four observations about the characteristics of the CYK/inside algorithm. These observations suggest details of the parallel implementation. Observation 1 The computation size of each element in DP matrix is variable and closely related with its position and state type.
109
Figure 2: A sample covariance model. Observation3 The dependence direction across over two dimensions results in a great deal of small granularity discontinuous access operations.
k Deck k j L K 1 j L
Non BIF-layer k
(i,j)
(a)
(b)
Figure 3: (a) The CYK/inside algorithm exhibits the character of a wavefront computation. The cells located in a diagonal can be calculated in parallel. (b) Data dependency relationship in a N on BIF layer: M (i, j, k) is computed from the matrix cells M (i, j, ), M (i + 1, j, ), M (i + 1, j 1, ) or M (i, j 1, ) but several layers have to be considered. The number of the layers equals the number of the childstates of current state k. Recurrence above shows that the calculation of M (i, j, k) depends on current state-type. As shown in Figure 3, if k is a N on BIF state, the M (i, j, k) depends on the elements located in (i, j, ) if S(k) = D or S (marked by gray color), left elements M (i, j 1, ) if S(k) = M R or IR (marked by cyan color), bottom-left elements M (i + 1, j 1, ) if S(k) = M P (marked by yellow color) and bottom elements M (i+1, j, ) if S(k) = M L or IL (marked by magenta color) for all C(k). Moreover, those elements are located in dierent layers. As shown in Figure 4, an irregular dependency pattern occurs in the BIF-state layers: M (i, j, k) is computed from all matrix cells M (i, mid, klef t ) in its left substate and M (mid + 1, j, kright ) in its right substate for mid = i 1 to j. Thus, the load computation density increases along the wavefront direction by using increasing black shades in Figure 4(a). The calculation of each cell depends on several elements located in dierent layers and the irregular data dependency results in a great deal of small granularity discontinuous access operations. Observation4 We can exploit data reuse schemes to reduce the need to load matrix from external memory. (1) We can use a sliding window to reuse data
within a column. The M (i, j, k) depends on the j th column in right substate layer as depicted in Figure 4(c). We can consider the column marked with cyan color as a sliding window. With the computation traveling upward in the same column from M (i, j, k) to M (i 1, j, k), the rectangular window moves from bottom to top. We observe that only one element is updated, the other (j i) elements remain unchanged and had been worked out already, which can be reused by current PE. (2) We can use broadcasting strategy to reuse data within a row. From Figure 4(b), we observe the element M (i, j, k) depends on the ith row in its left substate layer. Assuming the eight elements located in the ith row (marked with increasing black color) of BIF layer, we arrange eight PEs for parallel computing respectively. For each PE, the column elements that computation needed are available and located in PEs local memory at that time. We can reuse the data by broadcasting the ith row elements (marked with magenta color) loaded from external DRAM to the whole PE array. (3) We can use transitive registers to reuse data between adjoining columns. As shown in Figure 3(b), the M (i, j, k) depends on the left or bottom-left elements if S(k) is M R, IR or M P . It means the elements that the j th PE needed are located in the previous P Ej 1 if we partition tasks by columns and assign tasks to PEs. Thus, the operands are located in on-chip memory except for the master PE. This observation implies that only the rst PE has to load elements from external DRAM. The remaining slave PEs simply wait for the elements transferred from the previous PE. As a result, by reusing row/column data in PE array and transferring data between the adjoining columns processing, we can greatly reduce the memory bandwidth requirements for loading elements from external DRAM.
3. THE CYK ALGORITHM ACCELERATOR 3.1 System Architecture

Our RNA alignment platform consists of a recongurable algorithm accelerator and a host PC. The accelerator receives a input sequence of length L with 2-bit binary encoding and a CM model of length K, executes the 3D scoring matrix lling, and reports the global alignment score M (1, L, 1) to the host for display. The structure is shown in Figure 5. The accelerator engine comprises one FPGA chip (Virtex5 XC5VLX330), two DDRII modules, and PCI-E 8 interface to the host PC. Two DDRII SODIMMs store CM model and a part of 3D scoring matrix and are connected to the FPGA
110
BIF- layer k
j L
(i,j)
Layer k left
j L
Layer k right
j L
M ( i, mid , kleft )
M ( mid +1, j, kright )
(a)
(b)
(c)
Figure 4: Data dependency relationship in a BIF-layer: M (i, j, k) depends on the cells M (i, mid, klef t ) in its left child layer and M (mid + 1, j, kright ) in its right child layer for all i 1 mid j. Load computation density in a BIF-layer increases along the wavefront shift direction by using increasingly black shades.
Host Interface
CM Model & RNA seq Alignment result (score)
Host Processor
Host
3.2 Fine-grained Parallel CYK/inside Algorithm without Traceback

The standard CYK/inside algorithm lls the layers in three-dimensional DP matrix one by one from bottom to top as depicted in Figure 1. The computation moves along diagonals in a wavefront mode in each layer. However, this solution is not well suited for accelerating CYK algorithm on FPGA. First, the limited on-chip memory cannot hold even one layer of 3D matrix. Second, the limited on-chip logic resource cannot implement the array with enough PE to calculate the whole diagonal in parallel.
BIF-layer k
Previous section
PCI-E Interface Controller

Memory access request Control Signals
PE Array Controller
CYK Algorithm Accelerator

Control Signals
DDR II Memory Controller
Control Signals CM model Info & RNA seq Trans Regs Trans Regs
Control Signals
DDR II SODIMMs
Trans Regs
Deck S
Master PE #1
Slave PE #2
Slave PE #N-1
Slave PE #N
Trans Regs
to Master
LocMem Deck S
LocMem Deck S
LocMem Deck S Deck S
LocMem
Column Synchronization and Deck Back-writing Controller
j
Current section Next section
Figure 5: The structure of the CYK/inside algorithm accelerator.

1 k K i 2 L 1 j L 3 4
the three-dimensional DP-matrix
i Final result
BI F)
K+K
2K+K
sta nt
K+4 2K+4 K+3 2K+3 K+2 2K+2 2K+1 K+1
te =
(Maste) PE1 (Slave) PE2 (Slave) PE3 (Slave) PE4
if (c u
rre
pad directly and the memory controller is implemented in FPGA chip. The PCI-E interface is responsible for transferring initial data (RNA and CM), congure command (start and interrupt signal) and nal results between the accelerator and the host. The eective bandwidth reaches 1GB/s. The core of the CYK algorithm accelerator is composed of a PE Array Controller, a Master-Slave PE Array and a Column Synchronization and Back-writing Module. The PE Array Controller is responsible for initializing the PE array, assigning column tasks to the PE array dynamically. The Column Synchronization and Back-writing Module is responsible for Synchronizing column calculation and writing necessary column elements back to external DRAM. For example, if layer k is Left s state, the columns calculated by PE array should be shifted out from FPGA to DRAM because is will not be used in a long time generally. The PE array performs the 3D DP matrix lling in parallel using the recurrence relationship. The array consists of a series of PE modules, in which the rst one, PE1 is the master and the others are slaves. Each PE is augmented with a local memory which consists of some memory blocks implemented by on-chip Block RAMs to store a copy of the current RNA sequence and the current column elements of current state layer k. The registers between adjoining PEs, called the TransRegs, are used for delivering reusable data including several column elements of substates if S(k) = M R, IR or MP .
L
Section 1 Section 2 Section 3
Initial computation position Non BIF-layer k

Previous section
(b)
j
Current section Next section
u (c if rre nt t sta = e! F) BI
Section 1
Section 2
Section 3
(a)
(Maste) PE1 (Slave) PE2 (Slave) PE3 (Slave) PE4
L
Section 1 Section 2 Section 3
(c)
Figure 6: The task partitioning and the calculating order with 4-PE array. (a) Decomposition of the three-dimensional DP-matrix using four PEs (P E1, ..., P E4). The 3D matrix is partitioned into three sections. Each section consists of k layers and each layer includes 4 columns. For minimization storage requirement, we partition the whole 3D scoring matrix into small ones (named sections) along the k dimension. Each of which is still a 3D matrix and each layer consists of N columns, where N is the PE array size. Then we calculate those sections one by one from bottom layer to top layer (k = K down to 1). For load balance, we partition each layer by columns and assign tasks to PEs. We take a 4-PE array as example. As shown in
111
Algorithm 1Fine-grained Parallel CYK/inside Algorithm without Traceback Input: R: an RNA sequence R= r1rL of length L; G: a CM model of length K with states numbered in preorder traversal;
Output: M[1, L, 1]: optimal global alignment score; Variables: i, j, k: element location index in the three-dimensional DP-matrix; S(k): the type of current state k; C(k): the states that k can transit to; P(k): the set of parents of state k; s: the number of section that the three-dimensional DP-matrix been partitioned; m: current section number; p: current PE number; n: the number of PE; d: the number of deck storage region in a PE local memory and d {1, 2, , 16}; M[*, jp, k]: the column of current state k in matrix M assigned to PE p {1, 2, , n}; BEGIN: Parallel CYK/inside For all processing element p {1, 2, , n} do in parallel { FOR m = 1 up to s do { FOR k = K down to 1 do { S1: Get_deck_buf_id(d); // Find a free storage region with minimal number in loc_mem; if (p = 1) // its the master PE Data input: S2: if (S(k) = BIF) then { S21: Load M[*, jp, kright] of the right child deck from own local on-chip memory); Load M[i,*, kleft] of the left child deck from external DRAM and Distribute them to all PEs }; else if (S(k) = MR or IR or MP) then S22: for all C ( k ) do {Load M[*, jp-1, ] from external DRAM}; S23: else for all C ( k ) do {Load M[*, jp, ] from own local on-chip memory}; else // its a slave PE S3: if (S(k) = BIF) then { S31: Load M[*, jp, kright] of the right child deck from own local on-chip memory); Recv M[i,*, kleft] of the left child deck from external DRAM}; S32: else if (S(k) = MR or IR or MP) then { for all C ( k ) do { Load M[*, jp-1, ] from previous Processing Element p-1}; S33: else for all C ( k ) do { Load M[*, jp, ] from own local on-chip memory};
Calculation: S4: if (S(k) = BIF) then FOR i = 1 up to j +1 do calculate M(i, jp, k) using the recurrence formula of Section 2 and store M(i, jp, k) in deck_buf[d]; else FOR i = j+1 down to 1 do calculate M(i, jp, k) using the recurrence formula of Section 2 and store M(i, jp, k) in deck_buf[d]; Synchronization: S5: PE_ p waiting until the last PE_n processing finished; Data output: S6: if (S(k) is a left substate of BIF-state), then send M(*, jp, k) to external DRAM; else if ((p = n) & (MR or IR or MP P(k))), then send M(*, jp, k) to external DRAM; // for the last PE S7: FOR t = 1 up to 16 do { if all the parent states of deck_buf[t] have been worked out; then set deck_buf[t] free and add t into idle region set; } Section advance:
S8: Distribute a new section to PE array; S9: if (k = 1) then jp = jp + n; // assign next section to PE array; S10: if (jp > L) then Stop; // current column index exceed the RNA sequence length
};
Return (M[1,L,1]); }; END

};
Figure 7: Fine-grained parallel CYK/inside algorithm without traceback.
112
Figure 6(a), the three dimensional DP-matrix is partitioned into three sections. Each section consists of k layers, each of which includes 4 columns. The dot lines with arrows and the number 1, 2, ..., K, K + 1, ..., 2K + K represent consecutive processing phases. The lling stage starts from the cells located in the diagonal of layer k, section 1 (marked by red color). The wavefront computation moves along diagonals from south-west to north-east until reaching the right-angled vertex in layer k, section 1. Then, the columns in layer k 1, section 1 are assigned to PE array and calculated in a same mode. When the last layer (k=1) in section 1 is completed, we go back to the layer k, section 2 and deal with the next section in a similar way. At the end of the iteration, the element M (1, L, 1) contains the global alignment score (the position marked by green star). As implied by observation 1, the upper-triangular shaped layers are partitioned into columns. Each PE holds one column in turn. Every group of N contiguous columns forms a section, as depicted in Figure 6(b) and (c). There are 4 dark columns assigned to PE1 to PE4 in the middle area representing the current section. The positions marked with stars represent the current computation points of the 4 PEs. If S(k) = BIF , then the sections are calculated from top to bottom. The row elements in its left S state-layer can be reused by broadcasting to the whole array and the column elements in its right S state-layer have already been calculated which located in PEs local memory. If S(k) = non BIF , then the sections are computed from bottom to top along diagonals in parallel. All of the elements belong to child states that PE needed are available, which located in own local memory or previous PEs local memory. The parallel CYK/inside algorithm is divided into ve phases, data input, score calculation, column synchronization, data output and section advance. Figure 7 describes the parallel CYK/inside algorithm without traceback in a Single-Program Multiple-Data (SPMD) style. The rst part of Figure 7 shows the parameter and variable denitions. The core of ne-grained parallel CYK/inside algorithm is a double For-loop operations (F orm = 1 up to s and F ork = K down to 1). In each loop body, N contiguous columns (1, 2, ..., N ) in layer k of current section are assigned to PE array, p {1, 2, ..., N } and calculated using the described schemes in parallel. Each PE assigns its PE identier, p, to the column index indicating the initial column assignment. At the initial phase, each PE nds a valid memory block with minimal number in its local memory to store the column elements will be generated as shown in S1. Before the score calculation phase several data items are loaded into local memory as shown in S2 and S3. Only the master PE (p = 1) loads data from external DRAM, the other slave PEs only receive data from data bus or load data from Trans Regs of the previous adjoining PE or from own local memory, which depends on current state-type S(k). The whole computation procedure is carefully pipelined in order to overlap memory access delay. All PEs calculate M (i, j, k) using the recurrence formula described in section 2.1 and store the results to its local memory then moves the computation point to next position in the same column. The computation order is also state-type dependent as shown in S4. When the last element of the current column has been worked out, the PE enters the column synchronization phase (S5), during which it writes its local results to DRAM if it is necessary (S6). Then some columns will be replaced and
those allocated memory blocks are released if all the parent states have been worked out (S7). When all PEs arrive at the synchronization point, indicating the end of the current section, the section advances by assigning a new layer k (if k = 1, then k = K and each PE adding n to its column index) to PE array (S8, S9). The process repeats until the column index is greater than the RNA sequence length.
4. EXPERIMENTS AND PERFORMANCE 4.1 Analysis of the Parallel Algorithms Scalability

The execution time of our parallel CYK/inside algorithm without traceback can be predicted in cycles due to the tight synchronization of the systolic array structure. The total execution time is the sum of cell computation time (TC ), column synchronization and back writing time (Tm ) and the time for section advance and data loading. The overhead of the latter is O(L) and it can be ignored. Assuming p is the number of PEs, L is the length of the RNA sequence, B is the number of bifurcation states in the model with K states and s = L/p, we have the execution time (TNB ) for N on BIF state-layer in (4): TNB = (TC1 + TM 1 ) (K B)
1 2
(4)
Where, TC1 = (p + 2p + ... + sp) t1 and TM 1 = (1 + 2 + ... + s) p t2 . They are cell computation time and the synchronization and back writing time respectively. t1 is the time for each element computation, which is 10 cycles in average, t2 is the access overhead for storing an element to external memory, which is only one cycle for N onBIF layers, because all elements are stored in DRAM in row access mode. The execution time (TB ) for BIF state-layer is TB = (TC2 + TM 2 ) B
s(s+1)(2s+1) 6
(5)
The cell computation time (TC2 ) is TC2 = 1 [ s(s+1) p + 2 2 p2 ] t3 and the synchronization and back writing time (TM 2 ) is TM 2 = L(L+1) t4 . Where, the 2 computation for each element t3 = 2 (one add and one compare operation), t4 is the access overhead for storing an element to external memory in column access mode, which is 16 cycles in average for BIF layers. As a result, the total execution overhead (T ) of hardware CYK/inside algorithm is the sum of TB and TNB . T = (TC1 + TM 1 ) (K B) + (TC2 + TM 2 ) B The computation time for lling 3D matrix (TC ) is TC = TC1 (K B) + TC2 B (7) (6)
According to the formula (6) and (7), we can theoretically analyze the parallel eciency (EC ) of our accelerator. EC = TC = T 1+ 1
TM 1 (KB)+TM 2 B TC1 (KB)+TC2 B
(8)
The expression can be simplied approximately as (9) in general case L p and K B. EC = 1 , 1+ 96B p + 3K 4B L + 60K (9)
113
Table 1: Parallel Eciency (EC ) on Dierent PE Array Size for Dierent Queries
PE number p=4 p=8 p=16 p=20 p=32 tRNA 91% 87% 83% 81% 78% parallel eciency (EC ) 5S rRNA SRP RNA RNaseP 94% 93% 93% 92% 91% 90% 90% 88% 86% 89% 86% 83% 85% 82% 78% SSU rRNA 95% 93% 89% 87% 82% eciency on average 93% 91% 87% 85% 81%
Take SSU RNA sequence for an example, when L=1545, K=4789, B=30, p=16, parameter is a little more than 0.1. As a result, the parallel eciency is nearly 90%, showing good parallelism. The parallel eciency of hardware accelerator on dierent PE array size for dierent queries is listed in Table 1. We nd that it drops with the increase in sequence length and PE number. Though we achieved an increase in speed, the eciency of the algorithm drops in each case with the design moving from 4-PE to 32 PEs due to the increase overhead of message passing and column synchronization. Considering the longest sequence SSU rRNA with 1545 bases, the eciency descends from 95% for 4-PE to 82% for 32 PEs.
100
1.5GB Memory at level O3 compiler optimization. We also measure software execution time on AMD 9650 Quad and Intel Q9400 Quad platforms to verify the acceleration of our approach.
4.3 FPGA Resource Usage

Table 2: Device Utilization Summary (Final Report)
PE num 1-PE 8-PE 16-PE Slices Logic 16438/207360(8%) 50046/207360(24%) 88458/207360(42%) BRAM memory 25/288(9%) 137/288(48%) 265/288(92%) Frequency 192MHz 168MHz 164MHz
80
60
40
Our Accelerator
Alpha Cluster
Xeon Cluster
AMD Cluster Number of processors
20 0 1 2 4 8 16 20
32
Note: hardware environment Our Accelerator: one FPGA chip (XC5VLX330) with 16-PE, one P4 2.66GHz CPU and 1.5GB memory; Alpha Cluster: 12 Compaq ES45 SMP nodes. Each node consists of four 1.0GHz Alpha EV68 CPUs; Xeon Cluster: 10 nodes. Each node consists of two 2.0GHz Intel-Xeon CPUs and 1GB memory; AMD Cluster: 8 nodes. Each node consists of four 2.4GHz AMD Opteron CPUs and 8GB memory.
Figure 8: The parallel eciency of CYK/inside algorithm on 4 hardware platforms with dierent processing scale. The downtrend of parallel eciency of CYK/inside algorithm is also depicted in Figure 8. We compared the efciency of our accelerator to three cluster platforms with dierent processor number. Because the load balance and communication delay are carefully considered, the parallel eciency of ours is superior to AMD Cluster [6] signicantly. It is also better than Xeon Cluster and Alpha Cluster [4] when PE number is less than 16 and comparable to the latter when it is more than 16.
We place dierent numbers of PEs on the latest FPGA chips XC5VLX330 from Xilinx to evaluate the resource usages. As shown in the last row of Table 2, one PE consumes 16438 Slices and 25 Block RAMs (36Kbits each), sum to 900Kbits of memory. At most 16 PEs can be tted on XC5VLX330 because the storage requirement consumes almost all of the memory resources (92%) however the proportion of logic resource used is only 42%. Thus, the bottleneck for accelerating the CYK algorithm is not logic but memory capacity. As shown in the last row of Table 2, all implementation can reach a clock frequency of over 160MHz (post-place&route not synthesize) and it does not drop visibly with the array size growth from 8-PE to 16-PE, also showing the good scalability of our parallel algorithm.
Average Parallel Efficiency
4.4 Performance Compared to Single CPU Platforms

4.4.1 Speedup
Taking P4 as the base, we compare the execution time and speedup among 4 dierent platforms, including three general-purpose computers and our algorithm accelerator. The execution time on FPGA accelerator includes computation time, the time for sending CM model and sequence query and taking results back to host for display. Despite the variation in CPU type, clock frequency, main memory capacity, second-level cache capacity and operate system versions, the three general-purpose computers exhibit similar performance. As shown in Table 3, AMD and Intel Quad CPUs show a little advantage over P4, achieving about 1.5x speedup. However, the FPGA accelerator with 16-PE exhibits signicant speedup from 3.3 to more than 14 with the increase in sequence length ranging from 72 bases to 1545 bases. It partially because CPUs start getting more cache misses as data sets grow.
4.2 Experimental Environment

We implement the CYK/inside algorithm accelerator in an FPGA testbed. The testbed is mainly composed of one large scale FPGA chip, Virtex5 XC5VLX330 from Xilinx, two 2GB DDRII SODIMM modules, KVR 667D2S5/2G from Kingston and a PCI-E8 interface (implemented by Virtex5 XC5VLX50T) to the host computer. Specically, our accelerator has the capability of global dynamic recongurebility, which can be recongured in 60ms. The alignment software, Inf ernal 0.55, developed by Sean R. Eddy and his colleges (download from Washington University School of Medicine web site [14]), runs on a desktop computer with Intel Pentium4 2.66GHz CPU and
4.4.2 Performance/Power Consumption Ratio

We compare the performance/power consumption ratio (t = 1000P/W ) among 4 dierent platforms. As shown in
114
Table 3: Execution Time (ms) and Speedup with Dierent Sequence Length on Dierent Platforms
RNA platform
P4(1) AMD(2) Core2(3) FPGA(4) FPGA(5)
tRNA Time Sp 0.1 1 0.05 2 0.05 2 0.03 3.3 0.03 3.3
5S rRNA Time Sp 0.4 1 0.33 1.2 0.29 1.4 0.09 4.6 0.06 7.3
SRP RNA Time Sp 5.6 1 4.39 1.3 4.36 1.3 0.89 6.3 0.51 10.9
RNase P Time Sp 12.2 1 8.78 1.4 9.03 1.35 1.75 6.97 0.85 14.3
SSU rRNA Time Sp 272.2 1 185.3 1.47 184.7 1.47 38.9 7.0 18.7 14.5
Note: hardware environment

(1) Pentium4 2.66GHz CPU, 1.5GB memory; (2) AMD Phenom 9650 Quad CPU, 2.3GHz, 3.0GB memory; (3) Intel Core2 Q9400 Quad CPU, 2.66GHz, 3.0GB memory; (4) FPGA accelerator (XC5VLX330) with 8-PE, 160MHz, 4.0GB memory; (5) FPGA accelerator (XC5VLX330) with 16-PE, 160MHz, 4.0GB memory.
Performance/power consumption (t) 14.5x
Speedup
16
RnaseP RNA (Accelerator) SRP RNA (Accelerator) RnaseP RNA (Xeon cluster) SRP RNA (Xeon cluster)
500
Power consumption (W) Performance (P) Performance/power consumption (t)

12
400
RnaseP RNA (AMD cluster)
300
200
20
4
10 115W 1x 95W
1.5x 95W
1.5x <30W
2 1
Number of processors
0 0 1 2 4 8 16
Pentium4
AMD Phenom 9650
Intel Core2
FPGA (16PE)
Figure 9: Performance/Power consumption ratio (FPGA VS. CPU). Figure 9, the power consumption of three general-purpose micro-processors ranging from 95W to 115W. However, A 16-PE CYK accelerator consumes less than 30W as simulated by Xilinx ISE 10.1, saved more than 70%. Hence, as for the measurement unit, performance/power consumption, our FPGA accelerator shows a factor of more than 30x over general-purpose computers.
Figure 10: Speedup on 4 hardware platforms (FPGA Accelerator VS. Cluster). tem with 8 Xeon processors, and at most 7x speedup on a 16 processors system. It reports 5.9x and 10.5x speedup on 8-processor and 16-processor AMD cluster system respectively. However, our experimental results show a factor of nearly 7 speedup over the Inf ernal 0.55 software for 8 PEs and more than 14 speedup for 16 PEs. It exhibit similar linear features due to the scalable parallel structure in the accelerator.
4.5 Performance Compared to multi-CPU (Cluster) platforms

4.5.1 Speedup
To compare with previous work [4], [6], we aligned two classical RNA sequence (RNaseP and SRP RNA) to their family CM models on our accelerator. T. Liu [4] and GM Tan [6] implement two parallel CYK/inside algorithms without traceback using C and MPI and on Xeon Cluster [4] with 10 nodes (Each node consists of two Intel-Xeon 2GHz CPUs and 1GB memory) and AMD Cluster [6] with 8 nodes (Each node consists of four 2.4GHz AMD Opteron CPUs and 8GB memory) respectively. As shown in Figure 10, the speedup our accelerator achieves is superior to Xeon Cluster and AMD Cluster obviously in each case. For example, alignment of an RNaseP sequence to the RNaseP consensus structure achieves 5.1x speedup on a cluster sys-
4.5.2 Performance/Power Consumption and Performance/Cost Ratio Taking Xeon Cluster as the base, we compare the power consumption and hardware cost among 4 dierent platforms, including three general-purpose PC cluster systems and our algorithm accelerator with the same processor number. As shown in Figure 11, the AMD cluster, Alpha cluster system and FPGA accelerator exhibit 1.5x, 1.8x and 2.0x performance over Xeon cluster respectively. The hardware cost of the three general-purpose PC clusters with 16 processors is about 2032 thousand dollars and consumes between 2.5KW and 4KW on average. However, it only takes 7 thousand dollars to build a high performance computational platform with a FPGA accelerator and the power dissipation is less than 0.3KW. The computational power of our solution is comparable to a PC cluster consisting of 20 Intel-Xeon
115
CPUs for RNA secondary structure prediction using SCFGs. However, the hardware cost and power consumption is only about 15% and 10% of the latter respectively. Thus, considering the Performance/Power consumption (t = 100P/W ) and Performance/Cost ratio (s = 100P/C), we believe the application-specic ne-grained scheme implemented in our accelerator provides signicant advantage over the generalpurpose schemes.
Performance/Power consumption (t) Performance/Cost (s)
ers for their detailed revising directions and constructive comments. This work is partially sponsored by the National High Technology Research and Development Program (2007AA01Z106) and NSFC (60633050 and 60621003).
7. REFERENCES
[1] S. R. Eddy. A memory-ecient dynamic programming algorithm for optimal alignment of a sequence to an rna secondary structure. BMC Bioinformatics, 3(18), 2002. [2] E. Rivas and S. R. Eddy. A dynamic programming algorithm for rna structure prediction including pseudoknots. J. Mol. Biol., 285:20532068, 1999. [3] S. R. Eddy. What is dynamic programming? Journal of Nature Biotechnology, 22(7), July 2004. [4] T. Liu and B. Schmidt. Parallel rna secondary structure prediction using stochastic context-free grammars. Concurrency Computat.: Pract. Exper., 17:16691685, 2005. [5] J. C. Hill and A. Wayne. A cyk approach to parsing in parallel: A case study. In Technical Symposium on Computer Science Education: Proceedings of the twenty-second SIGCSE technical symposium on Computer science education, March 1991. [6] G. Tan, S. Feng, and N. Sun. Exploiting parallelization for rna secondary structure prediction in cluster. In Proc. International Conference on Computational Science (ICCS05), LNCS 3516, pages 979982, May 2005. [7] G. Tan, L. Xu, S. Feng, and N. Sun. An experimental study of optimizing bioinformatics applications. In Proc. International Conference on Parallel and Distributed Processing Symposium. IEEE, 2006. [8] A. Jacob, J. Buhler, et al. Accelerating nussinov rna secondary structure prediction with systolic arrays on fpgas. In Proc. International Conference on Application-specic Systems, Architectures and Processors (ASAP08), pages 191196. IEEE, 2008. [9] R. Nussinov, G. Pieczenik, J. R. Griggs, and D. Kleitman. Algorithms for loop matchings. SIAM Journal on Applied mathematics, 35:6882, 1978. [10] C. Ciressan, E. Sanchez, M. Rajman, and J.-C. Chappelier. An fpga-based coprocessor for the parsing of context-free grammars. In Proc. Symposium on Field-Programmable Custom Computing Machines (FCCM2000), pages 236245. [11] C. Ciressan, E. Sanchez, M. Rajman, and J.-C. Chappelier. An fpga-based syntactic parser for real-life almost unrestricted context-free grammars. In Proc. International Conference on Field Programmable Logic and Application (FPL2001), LNCS 2147, pages 590594. IEEE, October 2001. [12] S. R. Eddy and R. Durbin. Rna sequence analysis using covariance models. Nucleic Acids Research, 22(11):20792088, 1994. [13] K. Lari and S. Young. Applications of stochastic context-free grammars using the inside-outside algorithm. Computer Speech and Languages, 5:237257, 1991. [14] Infernal-0.55 software. Washington University School of Medicine web site, http://infernal.janelia.org/, 2009.
70
600
Performance/Power consumption (t) Performance/Cost (s) Power consumption (KW) Hardware cost (1000$)
60
500
50
400
40 Performance (P) 1.8x 2.0x 30
300
100 1.5x 60 4 32 1x 3 20 10 7 0.3 2.5 24
20
10
20
Xeon Cluster
AMD Cluster
Alpha Cluster
FPGA Accelerator
Figure 11: Performance/Power consumption and Performance/Cost ratio (FPGA Accelerator VS. Cluster).
5.
CONCLUSION
With the growing number of known RNA sequences stochastic context-free grammars (SCFGs) are used as an ecient analysis tool to model RNA secondary structures. However, the standard CYK algorithm for aligning a RNA sequence to a SCFG is highly compute-intensive. The parallel eciency of many implementations on general purpose multiprocessors are greatly limited by complicated data dependency and tight synchronization. Additionally, large scale parallel computers are too expensive to be used easily for many research institutes. In this paper, we explore the use of FPGAs to accelerate the CYK algorithm. After carefully studying the characteristics of the algorithm, we make four observations to direct our design. We propose a systolic array structure including one master PE and multiple slave PEs for ne grain hardware implementation. We partition the whole 3D scoring matrix into small ones for minimization storage requirement and split the layers by columns and assign tasks to PEs for load balance. We aggressively exploit data reuse schemes to minimize the need for loading layers from external memory. The experimental evaluation demonstrates that the performance of our accelerator is scalable with multiple PEs and that the FPGA accelerator outperforms general-purpose computers with a speedup of more than 14x on 16 PEs. The computational power of our accelerator platform is comparable to a PC cluster consisting of 20 Intel-Xeon CPUs for RNA secondary structure prediction using SCFGs, but the hardware cost and power consumption is only about 15% and 10% of the latter respectively.
6.
ACKNOWLEDGMENTS
We would like to thank the researchers who provided access, documentation and installation assistance for Infernal software. We also want to thank the anonymous review-
116

FPGA

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

FPGA

Hochgeladen von

Copyright:

Verfügbare Formate

Fine-grained Parallel Application Specic Computing for RNA Secondary Structure Prediction Using SCFGs on FPGA

Yong Dou yongdou@nudt.edu.cn Fei Xia xcyphoenix@nudt.edu.cn Jingfei Jiang jiangjinfei@nudt.edu.cn

Categories and Subject Descriptors

2. OVERVIEW OF THE CYK/INSIDE ALGORITHM 2.1 CYK/inside Algorithm Introduction

Initial computation position

2.2 Characteristics of the CYK algorithm

3. THE CYK ALGORITHM ACCELERATOR 3.1 System Architecture

M ( mid +1, j, kright )

3.2 Fine-grained Parallel CYK/inside Algorithm without Traceback

PCI-E Interface Controller

CYK Algorithm Accelerator

DDR II Memory Controller

LocMem Deck S Deck S

Column Synchronization and Deck Back-writing Controller

Figure 5: The structure of the CYK/inside algorithm accelerator.

K+4 2K+4 K+3 2K+3 K+2 2K+2 2K+1 K+1

(Maste) PE1 (Slave) PE2 (Slave) PE3 (Slave) PE4

Initial computation position Non BIF-layer k

(Maste) PE1 (Slave) PE2 (Slave) PE3 (Slave) PE4

Return (M[1,L,1]); }; END

Figure 7: Fine-grained parallel CYK/inside algorithm without traceback.

4. EXPERIMENTS AND PERFORMANCE 4.1 Analysis of the Parallel Algorithms Scalability

4.3 FPGA Resource Usage

AMD Cluster Number of processors

Average Parallel Efficiency

4.4 Performance Compared to Single CPU Platforms

4.2 Experimental Environment

4.4.2 Performance/Power Consumption Ratio

tRNA Time Sp 0.1 1 0.05 2 0.05 2 0.03 3.3 0.03 3.3

Note: hardware environment

Power consumption (W) Performance (P) Performance/power consumption (t)

RnaseP RNA (AMD cluster)

AMD Phenom 9650

4.5 Performance Compared to multi-CPU (Cluster) platforms

40 Performance (P) 1.8x 2.0x 30

100 1.5x 60 4 32 1x 3 20 10 7 0.3 2.5 24

Das könnte Ihnen auch gefallen