You are on page 1of 4

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/LCA.2015.2458318, IEEE Computer Architecture Letters

A Coarse-Grained Reconfigurable Architecture for


Compute-Intensive MapReduce Acceleration
Shuang Liang, Shouyi Yin, Leibo Liu, Yike Guo, and Shaojun Wei,

AbstractLarge-scale workloads often show parallelism of different levels. which offers acceleration potential for clusters and parallel
processors. Although processors such as GPGPUs and FPGAs show good performance of speedup, there is still vacancy for a low
power, high efficiency and dynamically reconfigurable one, and coarse-grained reconfigurable architecture(CGRA) seems to be one
possible choice. In this paper, we introduce how we use our CGRA fabric Chameleon to realize a dynamically reconfigurable
acceleration to MapReduce-based(MR-based) applications. A FPGA-shell-CGRA-core(FSCC) architecture is designed for the
acceleration PCI-Express board, and a programming model with compilation flow for CGRA is presented. With the supports above, a
small evaluation cluster with Hadoop framework is set up, and experiments on compute-intensive applications show that the
programming process is significantly simplified, with an 30-60x speedup offered under low power.

Index TermsReconfigurable computing, MapReduce, Accelerator

F
1 I NTRODUCTION

A S the performance requirement of computing rapidly


grows, constructing a more powerful engine for data
processing becomes a necessity. Accumulating computing cores
and raising clock frequency seems to be one solution, and
spreading workloads among a scalable multi-node cluster can
break the single-node limitation of storage and throughput.
However, the physical and financial constraints are preventing
us from the former approach. The root cause is that from the
aspect of processing element level, the Von-Neumann based
general-purpose processors(GPPs) are of low efficiency, which
lead to serious energy consumption and throughput gap for
Fig. 1. Typical architecture of CGRA
their serial working style. Thus, it is desirable to take advan-
tages of more efficient and flexible processors.
GPGPUs [13] [6] have been widely used for distributed programmability, and less power consumption and latencies on
acceleration due to their abundant stream processing units and unnecessary interconnects for higher on-chip efficiency. [8] has
convenient programming models. They bring appealing speed offered fundamental support that coarse-grained architectures
together with a headache of watt number. FPGAs [16] [15] [14] will show averagely 10 times speed and a quarter energy cost
[10] have also been developed recently to enhance the perfor- of LUT-based FPGAs.
mance of large-scale tasks, which show 10-1000x through-put Acknowledgedly, MapReduce [4] has been a well-suited
than GPPs with less than 10 watts power on a single chip, prov- programming model for processing most large-scale work-
ing the high efficiency of its flexible and parallel work style. loads in distributed clusters. It provides two primitive functions
Yet we should see that it is really time consuming to realize a map and reduce to split the workloads into < key, value >
certain function on FPGA with bothersome RTL programming, pairs for parallel executions, and what is more, among those
even if researchers have paid great efforts on implementing compute-intensive workloads such as matrix multiplication, pi
tools to generate RTL codes from high-level languages [12] [11]. calculation and k-means clustering, etc., the overall patterns
Moreover, it takes minutes to hours to generate a feasible bit- of the algorithms are often made of repetitive simple kernels
stream for FPGAs due to the high interconnection complexity, in a loop form. This offers opportunity for CGRAs to explore
and hence makes the dynamic reconfiguration impractical. the parallelism of iterations within a limited hardware space,
The coarse-grained reconfigurable architectures(CGRAs) which leads us to think of introducing CGRAs into the world
seem to be suitable substitutes, which consist of a proper num- of MapReduce for a mutually rein-force.
ber of coarse-grained processing elements(PEs), typically word- In this paper, we are going to describe how we integrate
level ALUs instead of bit-level LUTs, as depicted in Figure 1. our Chameleon CGRA in an MR-based system. Hardware and
The complexity of interconnections, usually crossbar or mesh software implementations will be narrated in detail, and some
styled, is significantly reduced after the shrink of PE number, case studies will show the effectiveness of CGRA acceleration.
which results in less synthesis time aiming for the on-the-fly

S.Liang, S.Yin, L.Liu and S.Wei are with the Institute of Microelectronics, 2 CGRA- CORED ACCELERATOR BOARD DESIGN
Tsinghua University, Beijing, China, 100084. S.Yin is the corresponding
2.1 Architecture and working mechanism of our CGRA
author.
E-mail: yinsy@tsinghua.edu.cn Based on the typical CGRA architecture shown in Figure 1,
Y.Guo is with the Department of Computing, Imperial College London, we implemented a prototype chip called Chameleon with
UK.
TSMC 65nm LP1P8M CMOS technology, which has an area of

1556-6056 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/LCA.2015.2458318, IEEE Computer Architecture Letters

Fig. 2. Hierarchical structure of configuration context storage

6.2 6.2mm2 and a maximum clock of 400MHz. The power


consumption of the chip is 49.8mW under 200MHz.
Each chip consists of four 8 8 2-D mesh PE arrays and
configuration context(CC) memories. Instead of FPGAs bit-
(a)
streams, the functional pattern of PEs are expressed by word-
wide(32bit) CCs, which can be easily translated from high-
level instructions. CCs are organized in a three configuration-
levels(CLs) hierarchical structure depicted in Figure 2. We
represent the kernels in the form of controlled data flow graph-
s(CDFGs), which can be mapped onto the PE array(PEA) under
the guidance of CCs. Each CLs responsibility is listed below:
CL2: The bottom level CC. The 5-bit opcode is used to
configure the function of ALUs, which supports 26 different
kinds of arithmetic and logic operators including add, subtract,
multiply, and, not, etc.. I/O configuration will direct the inter-
connections between two adjacent PEs and the dependence of
iterations, thus forming a kernel pattern and temporal schedule.
CL1: The medium level CC. It includes the index of CL2, and (b)
indicates the information of iterations, such as the number of
iterations and the dependence among iterations. Fig. 3. Implementation of the accelerator board: (a)The FSCC structure
CL0: The top level CC. It includes the index of CL1, and design. (b)The photo of the prototype PCI-express board integrated with
offers the addresses of external data operations. The target PEA two Chameleon CGRAs
number and synchronization control info are also given in CL0.
This makes our Chameleon works in a different way from is also needed. For cooperation control, an interrupt controller
FPGA. For FPGAs, all the logics should be carefully designed can catch the start and the end of one certain task, and a system
first under the resource limitations, then programmed with RTL controller is in charge of the overall on-board system control.
codes, and finally a lengthy run of synthesis and P&R, which is Control and status registers can tell the functional modules to
a long development cycle. Admittedly this offers plenty of flexi- work as arranged, and offer feedbacks for their inside status
bility for custom design, but it is too much for describing simple machine and the server monitoring.
algorithms. For our CGRA, the workloads are abstracted in a All the above modules need little reconfiguration at run-
higher level with several key parameters: iteration times, kernel time. They can be statically implemented with RTL and
patterns, input and output address and interconnections. As a mapped onto hardware such as FPGA. Therefore, we couple
result, the complexity of describing a workload is significantly a FPGA chip with CGRAs to realize a FPGA-shell-CGRA-
reduced. Designers only need to define parameters without too core(FSCC) structure, where the computing kernels are mapped
much worry about the bottom hardware. on CGRAs with the controlling of static logic on FPGA. Con-
We also notice that between two adjacent iterations, the sidering the logic cell and I/O requirement, we select a Virtex-6
kernel pattern may need to be reconfigured recurrently, so we LX550T chip together with two Chameleon chips on board. Two
store CL1 and CL2 in the on-chip context memories to avoid Samsung K7N643645M 8MByte pipelined SRAMs are utilized
redundant communication, and only needs to write CL0 words as a on-board cache storage for the data transferred between
to push the overall workflow forward in a streaming mode. the host server and CGRAs. We use a synchronous clock of
200MHz for the system, and realize a 3.27Gbytes/s of PCI-E
2.2 Board design communication speed. The FSCC structure in our system is
Practically, we need a high speed bus to transfer data in shown in Figure 3(a), and the photo of the prototype PCI-E
two-way and ensure that the bandwidth will not become a board is given in Figure 3(b).
bottleneck. PCI-Express bus is a suitable choice for it can bring a
peak full-duplex data transfer speed of 4GByte/s, so we choose
3 MR- BASED PROGRAMMING MODEL FOR CGRA
to integrate Chameleon on a board with PCI-Express interface.
Furthermore, peripheral logics are required to communicate MapReduce offers an approach for developers to simply
and cooperate CGRAs with the host server. For communication, write map and reduce functions to enable a distributed execution
A PCI-Express endpoint IP should be set to collect and transmit of applications. We need to fit the task into the MapReduce
transaction level packages(TLPs) from and to the server. A model first. Here we take matrix multiplication as an example,
direct memory access(DMA) module is necessary for fast data and the pseudo-code is given in Algorithm 1. In this example,
transferring between the host servers memory and RAM on the elements in the source matrices are replicated and labeled
board. A data cache memory and the according interface logic with the key values, which are the target locations, in the map

1556-6056 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/LCA.2015.2458318, IEEE Computer Architecture Letters

Algorithm 1 Mapper for matrix multiplication


1: INPUT: (key,value) //value is (A, i, j, aij ) or (B, j, k, bjk )
2: if value[0]==A then
3: i=value[1];j=value[2];aij =value[3];
4: for k=1 to p do
5: emit((i, k), (A, j, aij ));
6: end for
7: else
8: j=value[1];k=value[2];bjk =value[3]; Fig. 4. The compilation flow of CIPs in MapReduce functions
9: for i=1 to m do
10: emit((i, k), (B, j, ajk ));
11: end for
12: end if

Algorithm 2 Reducer for matrix multiplication


1: INPUT: (key,value)
2: //key is(i,k), values is a list of (A, j, aij ) or (B, j, bjk )
3: hashA = {j : aij f or(x, j, aij ) in values if x == A}
4: hashB = {j : bjk f or(x, j, bjk ) in values if x == B}
5: result = 0
6: for j=1 to n do
7: result+ = hashA [j] hashB [j]; Fig. 5. The block diagram of the prototype Hadoop-based cluster with
8: end for our accelerator boards inserted
9: emit(key,result);

TABLE 1
MapReduce models for the benchmark applications
function. The reduce function searches for the elements with the Function MM KMC CONV
same key value, and yields vector inner productions in parallel. Euclid distance,
After the map and reduce functions presented, we should Mapper Replicate Multiply
Compare, Sum
consider how to compile the compute-intensive parts of the Reducer Multiply, Sum
Euclid distance,
Sum
functions to offload them onto CGRAs. Just as shown in Figure Sum, Mean
4, the compute-intensive parts(CIPs) can always be present- Complexity O(M N P ) O(N KD) O(K 2 N 2 )
Tested size M=N=P=128 N=10e5,K=10,D=2 K=7,N=224
ed as control data flow graphs(CDFGs). With the hardware
parameters such as PE number and memory bandwidth, the
original CDFG will be optimized by transformations [5] to scheduling reference. In our cluster, four servers are configured
suit with Chameleon array. Then the original CDFG will be as Datanodes and each inserted with one Chameleon accelera-
transformed into a series of subgraphs, and the key parameters tor boards through the PCI-Express slots.
such as kernel pattern, subgraph iteration number, iteration
dependence and input data address will be abstracted. With
these parameters, configuration contexts will be generated and 5 E XPERIMENTAL RESULTS
packed up in a parametric execution package. The above pro- Our experiments mainly focus on three aspects: the simplicity
cedures can be called as a compilation flow for Chameleon, of CGRA programming, the standalone performance improve-
which can be accomplished manually or by a custom compiler, ment on CGRAs and the overall improvement in the MR-based
and currently we have developed an LLVM-based compiler [9] prototype cluster. The benchmark applications we have chosen
[3], which can go through the procedures automatically. are: matrix multiplication(MM), K-means clustering(KMC), and
We write a C driver for the access of bottom hardware due 2-D convolution(CONV) in convolutional neural networks, for
to the register arrangement, and use Java native interface(JNI) they all show compute-intensive characteristics and are easy to
calling C libraries and linking it with MapReduce applications. be paralleled. The MapReduce models, the timing complexities
As shown in Figure 4, with the pointers of source and end and the tested size of these applications are listed in Table 1.
given, the exe CGRA function will realize the CIPs in the We extract the CIPs in the MapReduce functions for acceler-
original mapper/reducer on CGRAs. ation. To compare the strengths and weaknesses between FPGA
and CGRA implementations, the CIPs are realized in both
HDL and C++. We respectively generate configurations using
4 E VALUATION SYSTEM
Xilinx ISE Design suite v14.7 for FPGA and our LLVM-based
To verify the effectiveness CGRA acceleration, we need to set C compiler for Chameleon. For the configuration generation
up a distributed environment. Since Hadoop [2] is a widely time(CGT), FPGA takes 10-14 minutes to generate bitstream
used framework of MapReduce, and it offers abundant tool- files with 90.3-112.1 Mb size, while CGRA only takes 23-37
s and libraries for development, so we choose to set up a seconds to generate execution packages with 1.65-5.15Kb size.
Hadoop-based cluster. The overall system workflow is given The main difference comes from the elimination of structure
in Figure 5. Here we apply five IBM x3650 M4 servers with description of PEs and the reduction of interconnection number
a Xeon E5-2650 2GHz CPU and 8GB ECC DDR3 memory in and combinations. Therefore, FPGAs can only be statically con-
each, connected with local network. In Hadoop framework, figured before the chip actually works, while our Chameleon
Namenode is the manager of the whole system, and Datanodes CGRA can be dynamically configured with much simpler para-
work under the control of Namenode with status feedbacks for metric description of the workload provided.

1556-6056 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/LCA.2015.2458318, IEEE Computer Architecture Letters

TABLE 2
Comparison between CPU, FPGA and CGRA
Xeon in 2GHz Virtex 6 LX550T in 200MHz Chameleon in 200MHz
Func # of operations
Latency(ms) Cycles Latency(ms) Speedup Cycles Latency(ms) Speedup
MM 4.19e6 14.41 59.96 16385 8.19e-2 175.89 4.59e3 77824 3.89e-1 37.03 1.09e5
KMC 8.10e6 12.29 135.77 41032 2.05e-1 59.90 3.49e3 53125 2.66e-1 46.27 3.08e5
CONV 4.74e6 20.07 48.68 47526 2.37e-1 84.46 1.77e3 72604 3.63e-1 55.29 1.32e5
*
Here we set the energy efficiency as = ( of operations)/(P ower(mW ) Execution time(ms))

ACKNOWLEDGMENTS
This work is supported by the National Nature Science founda-
tion of China(No.61274131), the International S&T Cooperation
Project of China(No. 2012DFA11170), the Tsinghua Indigenous
Research Project(No.20111080997) and the China National High
Technologies Research Program(No. 2012-AA012701).

R EFERENCES
(a) (b)
[1] Powerstat for desktop. http://sourceforge.net/projects/
Fig. 6. The speedup of three applications with different number of nodes: powerstatfordes/.
(a)non-accelerated; (b)CGRA-accelerated. [2] D. Borthakur. The hadoop distributed file system: Architecture
and design. Hadoop Project Website, 11:21, 2007.
[3] Y. Chongyong, Y. Shouyi, L. Leibo, and W. Shaojun. Compiler
framework for reconfigurable computing architecture. IEICE trans-
For the standalone performance, we implement three ver- actions on electronics, 92(10):12841290, 2009.
sions on CPU, FPGA and CGRA respectively. The detailed [4] J. Dean and S. Ghemawat. Mapreduce: simplified data processing
comparison is given in Table 2. The power consumption of Xeon on large clusters. Communications of the ACM, 51(1):107113, 2008.
CPU is measured by the Powerstat tool [1] with a wattsUp [5] V. Govindaraju, C.-H. Ho, T. Nowatzki, J. Chhugani, N. Satish,
PRO watt meter. As we can see, both FPGA and CGRA K. Sankaralingam, and C. Kim. Dyser: Unifying functionality and
parallelism specialization for energy-efficient computing. IEEE
shows significant speedup and power advantages comparing Micro, (5):3851, 2012.
with CPU. However, due to the limitation of PE number(512), [6] B. He, W. Fang, Q. Luo, N. K. Govindaraju, and T. Wang. Mars: a
Chameleon shows worse speedup than Virtex-6 FPGA, while mapreduce framework on graphics processors. In Proceedings of the
offering a high efficiency which is almost 30-90 times of FPGAs. 17th international conference on Parallel architectures and compilation
We should mention that Chameleons technology is 65nm while techniques, pages 260269. ACM, 2008.
[7] M. Horowitz, E. Alon, D. Patil, S. Naffziger, R. Kumar, and
Virtex-6 is 40nm, and the area of Chameleon is only a quarter K. Bernstein. Scaling, power, and the future of cmos. In Electron
of Virtex-6. Due to the scaling principle [7], with reduction of Devices Meeting, 2005. IEDM Technical Digest. IEEE International,
CGRA I/O number, more CGRA PEs will be integrated on pages 7pp. IEEE, 2005.
board, which will bring an even better timing improvement. [8] I. Kuon and J. Rose. Measuring the gap between fpgas and
Finally we test the applications in a CGRA accelerated asics. Computer-Aided Design of Integrated Circuits and Systems, IEEE
Transactions on, 26(2):203215, 2007.
cluster environment. We rewrite the applications in MapReduce [9] C. Lattner and V. Adve. Llvm: A compilation framework for
forms, and make another version with the loops(the CIPs) lifelong program analysis & transformation. In Code Generation and
compiled into CGRA execution packages and called by the Optimization, 2004. CGO 2004. International Symposium on, pages
exe CGRA function with the key parameters defined. We take 7586. IEEE, 2004.
the non-accelerated single node case as the baseline, and we [10] A. Putnam, A. M. Caulfield, E. S. Chung, D. Chiou, K. Constan-
tinides, J. Demme, H. Esmaeilzadeh, J. Fowers, G. P. Gopal, J. Gray,
test the applications with different Datanode number from one et al. A reconfigurable fabric for accelerating large-scale datacenter
to four. We process 104 copies of the applications, in order to services. In Computer Architecture (ISCA), 2014 ACM/IEEE 41st
raise the datasets to a Gigabyte level. The normalized speedup International Symposium on, pages 1324. IEEE, 2014.
of the cluster is shown in Figure 6. We can observe that the [11] O. Segal, M. Margala, S. R. Chalamalasetti, and M. Wright. High
relationship between the node number and speedup is non- level programming for heterogeneous architectures. arXiv preprint
arXiv:1408.4964, 2014.
linear, for the I/O communication ratio will raise relatively [12] D. Soderman and Y. Panchul. Implementing c designs in hard-
as the single-node compute time drops. However, we can see ware: a full-featured ansi c to rtl verilog compiler in action. In
that the node number does not affect the effect of CGRA- Verilog HDL Conference and VHDL International Users Forum, 1998.
acceleration very much, for the PCI-Express bus brings fast IVC/VIUF. Proceedings., 1998 International, pages 2229. IEEE, 1998.
communication and the parallelism on CGRAs remains stable. [13] J. A. Stuart and J. D. Owens. Multi-gpu mapreduce on gpu
clusters. In Parallel & Distributed Processing Symposium (IPDPS),
2011 IEEE International, pages 10681079. IEEE, 2011.
[14] B. Sukhwani, H. Min, M. Thoennes, P. Dube, B. Brezzo, S. Asaad,
6 C ONCLUSION and D. E. Dillenberger. Database analytics: A reconfigurable-
In this paper, we present how we bring CGRA accelerator in computing approach. IEEE Micro, 34(1):1929, 2014.
a MR-based system, with hardware architecture and software [15] K. H. Tsoi and W. Luk. Axel: a heterogeneous cluster with
fpgas and gpus. In Proceedings of the 18th annual ACM/SIGDA
programming model declared. Dynamic programmability has
international symposium on Field programmable gate arrays, pages
been primarily achieved, and a considerable speedup and a re- 115124. ACM, 2010.
markable energy efficiency value is realized in both standalone [16] J. H. Yeung, C. Tsang, K. H. Tsoi, B. S. Kwan, C. C. Cheung, A. P.
and cluster cases. The power saving will be even more tempting Chan, and P. H. W. Leong. Map-reduce as a programming model
if the cluster scale grows big enough. Future work lies in the for custom computing machines. In Field-Programmable Custom
Computing Machines, 2008. FCCM08. 16th International Symposium
CGRA mapping optimization and a real-time scheduler with
on, pages 149159. IEEE, 2008.
CGRA status feedbacks.

1556-6056 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.