Beruflich Dokumente
Kultur Dokumente
I. I NTRODUCTION
Field Programmable Gate Arrays (FPGAs) has been used
for implementing various types of logic functions. Especially, with the appearing high performance FPGA and PLD
like Virtex series of Xilinx[1] and Stratix series of Altera[2],
research of High Performance Computing by using FPGA is
coming out to be more exciting eld than before. However,
FPGA on HPC is used as accelerator of Central Processing
Unit (CPU). But, there is a signicant performance gap
between FPGA and CPU caused by the difference between
the conguration speed of FPGA and execution speed of
CPU. To resolve this problem, we proposed MPLD as a new
Programmable Logic Device (PLD) architecture with high
speed reconguration[3]. The merits of the MPLD in HPC
are high speed conguration and easy partial conguration.
This is achieved by the conguration method which is same
as write memory access of conventional parallel memory.
In this paper, the problem of FPGA on HPC and the
structure of FPGA are described at rst. Next, merits of
0-7695-3770-7/70 $25.00 3770 IEEE
DOI 10.1109/IWIA.2008.12
III. FPGA
To verify the usage of MPLD, the structure of FPGA is
described and shown the problems of FPGA here. Figure3
shows the basic structure of FPGA. In the gure, FPGA
consists of LB, CB, SB and IOB[4].
Logic Block(LB)
LB is a fundamental element for logic. There are
35
I/O Block
(IOB)
Switch Block(SB)
Figure 1.
Logic Block
(LB)
Figure 3.
Figure 2.
Connection Block
(CB)
Figure 4.
A. Problems of FPGA
Switch Matrix consists of SBs and CBs and achieves the
exibility of wiring on FPGA. But, gure3 shows problems
that the area of Switch Matrix includes CBs and SBs is
huge and it means the cost of FPGA is not cheap. As
another problem, conguration time is not fast because of
36
A. Structure of MLUT
Figure 7.
2-port memory
Figure 6.
Figure 8.
B. Behavior of MLUT
To explain the behavior of MLUT, the MLUT with 4
address data pairs shown in Figure 7 is used as an example.
Conguration
Figure 9 shows wires for conguration in MLUT. Since
method of conguration is same as write access of the
conventional parallel memory, MAD can be used to
select the MLUT to recongure, with the data provided
from MDATA in memory access. Conguration of the
MLUT is provided as the truth table like the conventional LUT.
Behavior as logic circuit
Let us assume the MLUT behaves as a logic circuit like
Figure 10. The logic circuit in Figure 10 is expressed
as the truth table. The truth table of this logic circuit
is shown in Table I. In Table I, each Input and Output
are correspond to each bit of the address and data port
of the conventional parallel memory. On conguration,
this truth table is written to the MLUT, then the MLUT
Address-Data pair
37
Figure 9.
Figure 10.
Behavior as Logic
Figure 11.
Behavior as Logic
LAD0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
Table I
T RUTH TABLE FOR L OGIC
LAD0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
Input
LAD1
LAD2
0
0
0
0
0
1
0
1
1
0
1
0
1
1
1
1
0
0
0
0
0
1
0
1
1
0
1
0
1
1
1
1
LAD3
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
LDATA0
0
0
0
0
0
0
0
1
0
0
0
1
0
0
0
1
Output
LDATA1 LDATA2
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
Switch Matrix
Table II
T RUTH TABLE FOR S WITCH M ATRIX
Input
LAD1
LAD2
0
0
0
0
0
1
0
1
1
0
1
0
1
1
1
1
0
0
0
0
0
1
0
1
1
0
1
0
1
1
1
1
LAD3
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
Figure 13.
LDATA3
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
LDATA0
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
Output
LDATA1 LDATA2
0
0
0
0
0
0
0
0
0
1
0
1
0
1
0
1
1
0
1
0
1
0
1
0
1
1
1
1
1
1
1
1
LDATA3
0
0
1
1
0
0
1
1
0
0
1
1
0
0
1
1
C. Compilation of MPLD
Currently, the compiler environment for MPLD is now
on developed. The structure of MPLD is different to that of
FPGA. So, all conventional method of compiling to FPGA
cannot be used for MPLD. Figure14 shows the conventional
method of compiling to FPGA. To implement applications
on FPGA, writing HDL description of algorithm is done
at rst. After functional simulation, HDL description is
transformed to netlist by executing synthesis. From the
netlist, mapping, placing and routing are done. After that,
information of implementation is transported to FPGA as
bitstream. Synthesis includes independent processes and
dependent processes. The following processes from dependent processes of synthesis are dependent process also.
Independent process is not dependent on the technology.
This means the conventional process can be used in the
Switch Matrix
Assuming the MLUT behaves as the switch matrix
like Figure 12, the switch matrix in Figure 12 is
expressed as the truth table. The truth table of this
switch matrix is shown in Table II. In Table II, Input
LAD0, LAD1, LAD2 and LAD3 are corresponded to
Output LDATA1, LDATA2, LDATA3 and LDATA0.
38
Figure 14.
Conguration to FPGA
Figure 15.
Figure 16.
A. MLUT
V. D ESIGN OF P ROTOTYPE MPLD
Table III
I/O OF P ROTOTYPE MPLD
Name
MAD
MDATA
LAD
LDATA
WE
RE
PRE
SEL
CLK
I/O
Input
Input/Output
Input
Output
Input
Input
Input
Input
Input
cells. Because MLUT has SEL for context switch, 2port memory block has two areas for conguration.
Figure19 shows the structure of 2-port SRAM cell in
the MLUT and tableIV shows the I/O of 2-port SRAM
cell. BL is input-output for memory, qBL is opposite
signal to BL and qRL is output for logic. In this 2port SRAM cell, one nMOS transistor is reduced in
contrast with the conventional 2-port SRAM cell. This
is because logic function of MLUT needs output for
logic only.
Explanation
Input for conguration
Input/output for conguration
Input for logic
Output for logic
Input to control writing
Input to control reading
Input to control precharge
Input for context switch
Clock for D-FF
qRL
MUX
IN0
IN1
MUX
IN7
IN0
IN1
IN7
16*8
WL
16*8
Figure 19.
16
8*2
Table IV
I/O OF 2- PORT SRAM CELL
Name
WL
BL
qBL
LE
qRL
8*2
Figure 17.
WL
SEL
16
qBL
BL
I/O
Input
Input/Output
Input-output
Input
Output
Explanation
Input for conguration
Input/output for conguration
Opposite input/output of BL
Input for logic
Output for logic
B. Row Decoder
Row decoder behaves as same as decoders of the conventional parallel memory to select row. Figure20 shows the
structure of row decoder. Address input of row decoder is
divided as upper address and lower address. Upper address
is used for selecting MLUT. Lower address is used for
selecting internal row of MLUT.
C. Column Decoder
Figure21 shows the structure of column decoder. In the
gure, column decoder consists of decoder to select column
and R/W unit for reading and writing data. Figure22 shows
the structure of R/W unit. In the gure, R/W unit consists
of CM unit, PRE unit, Read unit and Write unit.Functions
of each unit are the following.
CM Unit
CM Unit is sense amplier. Sense amplier is a current
mirror type and amplify the voltage of BL.
PRE Unit
PRE Unit controls precharge. PRE Unit adopted the
VD /2 precharge method and precharge BL and qBL to
VD /2.
DX
IN7
C7
Figure 18.
8 to 1 MUX
COL_AD7
COL_AD4
COL_AD3
COL_AD0
BL
CM
Unit
OUT255
OUT240
OUT239
Write
Unit
qBL
WE
enable
from Decorder
Figure 22.
OUT015
WE
PRE
RE
OUT000
Row decoder
Figure 23.
Figure 21.
IN/OUT
RE
Figure 20.
PRE
Unit
OUT224
Read
Unit
Column Decoder
and the number of metal layers used for wiring was three.
From evaluation results, latency of memory write access was
12.8nsec. This means that the conguration speed of MPLD
is about 78.1 MHz because it depends on memory write
access speed. Since conguration on each MLUT requires
sixteen times of memory write accesses and prototype
MPLD consists of 64 MLUTs, the achieved conguration
time is about 6.6sec for whole prototype MPLD. Transport
quantity of the conguration data in the prototype MPLD
is 48bit/12.8nsec = 3.75 Mbit per second(bps). Transport
quantity of the conguration data in Altera FPGAs using
Active Parallel(AS) conguration is 16bit/50nsec = 0.32
Mbps[6]. So, the conguration speed of the prototype MPLD
is about 11.7 times higher than AS conguration used for
Altera FPGAs.
Read Unit
Read Unit consists of 3-state buffer and powerful buffer
to amplify the output signal for reading.
Write Unit
Write Unit consists of 3-state buffer and powerful
buffer to amplify the input signal for writing.
VI. E VALUATION
Table V
E VALUATION R ESULTS
Behavior
Read
Write
32bit Counter
32bit Full Adder
Prototype MPLD
Latency
16.4nsec
12.8nsec
9.35nsec
121.6nsec
We implemented a prototype MPLD to conrm its function by using ve metal layers ROHM 0.18m CMOS
technology, and conrmed its functions as memory and PLD
by conguring it as a 32bit counter as an example application. Evaluation results are shown in Figure 23 and Table
V. As results, memory capacity of the prototype MPLD
was 49152bit, and the core area was 1767.541690.96m2
41
Acknowledgement
The VLSI chip in this study has been fabricated in the
chip fabrication program of VLSI Design and Education
Center(VDEC), the University of Tokyo in collaboration
with Rohm Corporation and Toppan Printing Corporation.
R EFERENCES
[1] http://www.xilinx.com/
[2] http://www.altera.com/
[3] Naoki Hirakawa, Masanori Yoshihara, Masayuki Sato, Kazuya
Tanigawa and Tetsuo Hironaka, Low Cost PLD with High
Speed Partial Reconguration, ITC-CSCC 2008, July 6-9,
2008, to appear
[4] Toshinori Sueyosi, Hideharu Amano, Recongurable System, Ohmsya (in Japanese), 2005
[5] Masanori Yoshihara!$Naoki Hirakawa, Kazuya Tanigawa,
Tetsuo Hironaka and Masayuki Sato!$Implementation of
Memory(MPLD) with the Ability to Work as a Recongurable Device, IEICE Technical Report RECONF2007-16 (in
Japanese)!$pp.7-12, 2007
[6] http://www.altera.com/support/devices/conguration/cfgcompare.html
42