A Parallel Hardware-Software System For Signal Processing Algorithms

A Parallel Hardware-Software System for Signal Processing Algorithms
Mathias Kortke, Jan Mller, Rainer Schaffer, Sebastian Siegel, Renate Merker
Dresden University of Technology, Germany
Department of Electrical Engineering and Information Technology
E-mail: <kortke, muellerj, schaffer, siegel, merker>@iee1.et.tu-dresden.de
Jrgen Kelber
University of Applied Sciences Schmalkalden, Germany
Department of Electrical Engineering
E-mail: kelber@e-technik.fh-schmalkalden.de
Abstract
This paper presents the implementation of a parallel
hardware-software system for several digital signal pro-
cessing algorithms. Besides the description of the devel-
oped hardware components, a main focus is set onto the
software part: the implemented driver, libraries and user
interfaces. One application of the hardware-software sys-
tem is the reconstruction of tomographoc images, for which
the interaction of the hardware and software parts is illus-
trated.
1. Introduction
Parallel architectures such as parallel processor arrays
bear the potential to produce scalable and efcient designs
for computationally intensive applications in signal process-
ing, especially with real-time demands. A parallel proces-
sor array has a piecewise homogeneous structure, concern-
ing the processor functions in the processor elements (PEs)
and the interconnection network between the PEs. This
structure provides high parallelism, intensive pipelining and
distributed memories. The systematic design process of
processor arrays [13, 16, 21, 22, 25] mainly consists of the
space-time transformations allocation and scheduling, and
the processor space partitioning. In the processor allocation,
the elements of the processor space for the evaluation of the
algorithms operations are determined. In the scheduling,
evaluation times are assigned to the operations. The result
of these space-time transformations is a full-size processor
This research was supported by the German Research Foundation

(DFG) within project A1 of the Collaborative Research Center (SFB) 358.
array [3, 4, 6, 5, 7, 14, 15, 28]. Usually, the full-size array
cannot be implemented on a chip, such that the dimensions
and the number of processor elements are reduced in the
design of the array [2, 9, 18, 24, 29].
This paper describes a hardware-software system with
the ability to accelerate computation intensive algorithms,
e. g. from the eld of image processing. We show that the
application of theoretical results and methods of the proces-
sor array design enables a systematic and efcient design
that copes with different problems and algorithms, such as
in the eld of image processing, computer graphics, and
others.
Following the need for a hardware-acceleration for these
time-consuming algorithms, we implemented a prototype
system for this task. The heart of this system is the re-
congurable processor array with 322 processor elements
(RecoChip [19, 20, 23]) manufactured as an ASIC. This
chip was designed to maintain maximum exibility and ex-
tendibility, such that a wide variety of algorithms can be
implemented, and several chips can be easily connected to
enlarge the processor array. This exibility is supported
by the recongurability of the hardware controller (Reco-
Controller) realized as an FPGA. In addition, the accelera-
tor board (RecoBoard) contains the necessary memory and
logic circuitry. The RecoBoard is mounted on a commercial
general purpose prototype board [8]. The prototype board
is connected to the host system through a PCI bus interface
using a PCI bus master controller [1].
The RecoController software, a graphical user interface,
and a software development system (RecoIDE) [12] com-
plete the hardware-software system.
Proceedings of the International Conference on Parallel Computing in Electrical Engineering (PARELEC04)
0-7695-2080-4/04 $ 20.00 IEEE
Internet
Prototype Board (H+K)
Memory
A B C D
16M*32bit 16k*32bit 1M*16bit 32k*32bit
User Interface
Driver / Software Library
P
C
I

B
u
s
Visualisation
Input
Graphic Controller
PS/2 Controller
Network Controller
AMCC RecoChip RecoController
(FPGA) (ASIC) S5933
Main Memory
Host CPU
Chip Set
Motherboard
Figure 1. The parallel hardware-software system (RecoBoard, RecoChip, and RecoController)
2. Chip Design
The main purpose of the hardware accelerator is the
fast computation of multiple sums of products, e. g.:
j
x(f
x
(i, j)) y(f
y
(i, j)). These calculations are the
most time consuming operations of typical signal process-
ing algorithms. Parallel processing on array architectures is
suitable for a signicant acceleration of such computations.
The parallel processor array RecoChip was developed by
applying systematic methods for the processor array design.
This development is supported by the application of sev-
eral tools, such as HLDESA [17]. Due to the maximization
of the number of processors (322), a minimization of the
largest parts the multipliers is performed. So, serial mul-
tipliers (328 Bit within 8 cycles) are implemented in each
processor element (PE).
Knowing the functionality of the PEs, one can now fol-
low two ways to implement them in silicon. First, the PEs
are designed manually, coded in a description language,
and then synthesized using commercial tools. The second
way is to utilize a module generator to produce the PEs.
This way is very promising because of the great number
of PEs on a chip, and it probably produces much better re-
sults. This approach could be combined with an automatic
generation of on-chip memories, thus continuing the regu-
lar structure down to the chip layout and improving the data
throughput at the interfaces. Aiming at an implementation
of a prototype, the rst way was chosen in order to reduce
the design costs.
The chip has a gate count of about 510
5
. It was im-
plemented in a 0.35 m CMOS technology and the die area
is 21.34 mm
2
. The chip was designed for a typical clock
frequency of 50 MHz. All data interfaces have a width of
32 Bit. Because of the great amount of data to be read at the
serial data input, this port samples at both clock edges. In
this way the number of pins could be limited to 160.
3. Board Architecture
Following the restrictions and requirements of the algo-
rithm analysis, all parts of the hardware system had to be
designed very closely together. This was of special impor-
tance (and rendered strong benets) to the design of the
FPGA as the hardware controller, where the communica-
tion with the processor array, the hardware (PCI and board)
drivers, and the user interface had to be reconciled. Apart
from control and communication, the implementation of the
pre- and post-processing algorithms into the recongurable
hardware and software requires a tight cooperation within
the overall design.
On the accelerator board we use two SRAM memories
the C-memory for control (program instructions and control
codes) and the D-memory for input data, a dual-port SRAM
for frequently updated data (B-memory), and an SDRAM
for serial input data (A-memory) which runs at double clock
rate to provide the needed word length. The memory chips,
the hardware controller, and the processor array are con-
nected by three 32 bit data busses. The host data transfers
0-7695-2080-4/04 $ 20.00 IEEE
are accomplished using a PCI interface which runs in DMA
mode, providing a (theoretical) bandwidth of 1 Gbit/s. The
board size is about 220100 mm
2
; it was produced using
a six-layer technology. Figure 1 gives an overview of the
hardware parts of the host (left) and the accelerator board
components (right) with the RecoBoard including the Re-
coChip, the RecoController, and the memories.
The smooth interaction between all components and in-
terfaces on the RecoBoard is assured by the RecoController
which is specied in VHDL and implemented within the
FPGA, a Xilinx Virtex XCV1000E. The FPGA design of
the controller has the major advantage that it can be adapted
to new demands by simply reconguring the FPGA. This is
necessary when implementing new algorithms on the Reco-
Board. Furthermore, the FPGA can be congured to per-
form computational tasks, e. g. serial computations, which
the RecoChip was not designed for.
The RecoController is divided into units such that each
unit operates on one component of the RecoBoard. The
units are adapted to the specications of the components
they control, e. g. the unit which manages the SDRAM
assures that from time to time, the memory is refreshed.
Other units control the three SRAM components, the PCI
interface, and the instructions to the RecoChip.
The behavior of the RecoController is organized as a hi-
erarchy of nite state machines (FSM). The top level FSM
provides the states for the main functions of the RecoBoard,
such as the initialization and execution of algorithms. An-
other FSM is responsible for the distinction between differ-
ent algorithms. The execution of an algorithm is controlled
by a common controller that is fed with algorithm-specic
instructions.
Further details on systematic controller design can be
found in [10]. Figure 2 shows the prototype board carrying
the RecoBoard in the host computer.
Figure 2. RecoBoard on the prototype board
4. System Environment
The hardware of the system consisting of the RecoBoard
with the RecoChip and the RecoController is accompanied
by a software environment. This incorporates the Reco-
Board drivers for the operating systems Linux and Windows
NT/XP and the software library libreco [27] containing ba-
sic functions for the operation of the RecoBoard. Great im-
portance of the implementation was attached to a large de-
gree of portability among the operating systems. So, the
compatibility layer between the library libreco and the de-
vice drivers is very slim. The software development sys-
tem RecoIDE [12] provides a graphical user interface im-
plemented as a Tcl/TK front end.
Figure 3 shows the parts of the software environment in
hierarchical layers.
Windows NT/XP Linux
reco.sys
C++ Standard
Library
(System Library)
C++ Standard
Library
(System Library)
(Compatibility Layer)
C++ Interface
C Interface
Tcl Interface
Script Processor Plugins
User Interface
RecoIDE (Application)
libreco (C++ Software Library)
(Device Driver) (Device Driver)
(Compatibility Layer)
amccS5933.o reco.o
System independent System dependent System
Tcl
Library Library
Tcl
Figure 3. Software environment
4.1. RecoBoard Driver
The interface between the RecoBoard hardware and soft-
ware is realized by operating system specic device drivers.
For the operating systems Windows NT/XP, an existing de-
vice driver was modied.
The development of a device driver for the Linux oper-
ating system resulted in new concepts [11]. With respect to
exibility and a large degree of reuse, the driver was divided
into two parts (modules). The module amccS5933.o encap-
sules the functionality of the bus master controller AMCC
S5933 [1] of the company Applied Micro Circuits Corpora-
tion (AMCC) and the control and management of the PCI
bus. This interface is restricted to the kernel space and is
used only by other kernel modules. The module operates
independently from other modules and could be used for
hardware applications other than the RecoBoard as well.
The second module reco.o comprises all necessary func-
tions for the operation of the RecoBoard and provides an
0-7695-2080-4/04 $ 20.00 IEEE
interface for the communication with the upper software
layer. It requires the low level functions of the amccS5933.o
module. To achieve a high exibility and scalability of the
RecoBoard Driver for future applications, it was designed
to support multiple (up to 16) RecoBoards at the same time.
The interaction of the applications with the device drivers
is performed via the device entries /dev/reco<0-15> and
/dev/recomail<0-15> (see below) by the standard POSIX
le operations open(), close(), read(), write(),
poll() respectively select(), and ioctl(). The
function ioctl() is used for the submission of commands
to the RecoController and the RecoChip.
For the data transfer between the host and the Reco-
Board, two different modes are provided the fast DMA
mode and the mail box mode. The DMA mode allows a
high data throughput (dependent on the given PCI chip set
and the utilization of the PCI bus) up to 100 MByte/s.The
measured average data rate for the systemis 60-70 MByte/s.
The mail box mode is used for the status reports. Two
mail boxes an incoming and an outgoing mail box store
up to 1000 messages with a length of 32 Bit each in ring
buffers.
4.2. Software Libraries
Basic RecoBoard commands are implemented in the li-
brary libreco with a C-, C++-, and Tcl-interface, e. g. for
loading of the FPGA control streams, the transfer of data
to and from the board memories, and the transmission of
commands to the processing units the RecoController and
the RecoChip. With the help of the software environment,
these commands (as parts of different algorithms) are easy
to handle.
The different interfaces (C, C++ and Tcl) to the applica-
tions access the same basic functions. This results in low
programming and update costs.
Other libraries contain algorithm specic macros con-
sisting of Tcl scripts. The modularity of the software devel-
opment system guarantees an easy extensibility and a pos-
sible software reuse for further algorithms realized on the
RecoBoard.
4.3. User Interface: RecoIDE
The RecoIDE simplies the use and the development of
RecoBoard applications. It allows both: an interactive exe-
cution of RecoBoard commands, and the execution of entire
programs.
The interactive execution is performed by a script pro-
cessor. This script processor executes Tcl programs and al-
lows a single step operation, which is very useful during
debugging.
To extend the application, a powerful plug-in mechanism
was established. Plug-ins enable the extension of the script
processor and the extension of the RecoIDE itself. Such
plug-ins realize the data in- and output, the data conversion
and scaling, and the visualization.
At present, tomographic reconstruction algorithms (such
as ltered back projection and algebraic reconstruction
technique), 2D-ltering, matrix-vector-multiplication, and
2D-projection of a rotating three dimensional object are im-
plemented on this hardware-software system.
5. Implementation of Tomographic Recon-
struction Algorithms
The application of the hardware-software system takes
place within a demonstrator, where it serves as a client
within a conference system. The task within that system
is to grab projections of an object obtained by a computer
tomograph and to reconstruct slice images of this object.
Furthermore, our system will be used for the tomo-
graphic reconstruction at the Institute for Biomedical Tech-
nology of the Dresden University of Technology as an alter-
native to the former software solution. On the other hand,
the application for the optical inspection of two-phase ows
in cooling circuits is examined. There it is essential to pro-
vide the resources needed to process huge amounts of data.
The reconstruction of a few hundred images per second is
necessary in order to allow a quantitative analysis of the
portion of the gaseous phase and its streaming properties.
For the implementation of the reconstruction of tomo-
graphic images we selected two commonly used algorithms:
the ltered back projection (FBP) and the algebraic recon-
struction technique (ART).
5.1. Algorithm Analysis
The principle of reconstruction algorithms is illustrated
in gure 4. Here, a two-dimensional slice image b with the
pixels b(i, j) is calculated from a given set of m projections
p
m
(k) and its weighting coefcients a(i, j, k, m). The vari-
ables mand k specify the angle and the ray of the projection
value p
m
(k).
The FBP algorithm is given by
g(k, m) = 2
R
min(Kk,H)1
n=max(k,H+1)
p
m
(k + n) h(n), (5.1)
b(i, j) = norm
_
K1
k=0
M1
m=0
a(i, j, k, m) g(k, m)
_
,
(5.2)
with 0 k < K, 0 m < M, 0 i, j < N, and the
signed 16 Bit lter coefcients h(n) = 2
14
/
_
4n
2
1
_
.
0-7695-2080-4/04 $ 20.00 IEEE
b(i, j)
pm(k)
i
j
a(i, j, k, m)
k
m
Figure 4. Principle of tomographic recon-
struction algorithms
Originally, these calculations are performed using oat-
ing point numbers. The algorithms are adapted to the 32 Bit
integer arithmetics of the RecoChip to avoid overow by
introducing normalization parameters R, S, T, and U. The
resulting data width for a, p, and b is 8 Bit unsigned. The
function norm[] reduces the data width from 32 to 8 Bit.
Better results characterized by sharper-edged images are
achieved by the iterative ART algorithm, specied by the
equations (5.3) (5.7):
f(k, m)
(s)
= 2
S
N1
i=0
N1
j=0
a(i, j, k, m) b(i, j)
(s1)
,
(5.3)
d(k, m)
(s)
= 2
U
p
m
(k) f(k, m)
(s)
, (5.4)
e(i, j)
(s)
=
K1
k=0
M1
m=0
a(i, j, k, m) d(k, m)
(s)
, (5.5)
c(i, j)
(s)
= 2
T
e(i, j)
(s)
r
1
(i, j), (5.6)
b(i, j)
(s)
= norm
_
max
_
0, b(i, j)
(s1)
+ c(i, j)
(s)
__
,
(5.7)
with 0 k < K, 0 m < M, 0 i, j < N, and
the iteration counter s. Our experiments show a good con-
vergence of the iterative procedure for s = 10. Further-
more, f denotes an approximated forward projection, d an
error projection, e an error image, and c a correction im-
age. The data r
1
(i, j) serve for three different normal-
ization methods q of the error image e, where r
1
(i, j) =
=
_
2
7
for q = 0
2
7q+15
/
m
a
q
(i, j, k, m) for q {1, 2}
if
K = M = 64. The r
1
data and the pixel b
(s)
are 8 Bit un-
signed, whereas all extension data c, d, e, and f are 32 Bit
signed.
All the normalization parameters depend on the quality
Figure 5. Laboratory tomograph
of the tomograph (K, M) and the size of the reconstructed
slice images (N). Initial values can be estimated and are
stored in the library libtomograph. This estimation is exem-
plarily shown for the parameter R where H = K = M =
N = 64 holds. With (5.2) a maximum for b(i, j) leads to
2
31
= K M 2
8
2
G
and G = 11. From (5.1) results
2
G
= 2
11
= 2
R
2
8
2
14
with R = 11. By similar estima-
tion, we choose S = 15, T = 21, and U = 8.
5.2. Mode of Operation
Besides the hardware-software system, our demonstrator
comprises a laboratory tomograph shown in gure 5. The
laboratory tomograph is able to grab M = 64 projections of
K = 64 rays with a single rotation. The 64 64 projection
data are transferred by a data capturing board to the main
memory of the host computer. The hardware-software sys-
tem is designed to calculate slice images with N N pix-
els where N {32, 64, 128} from the captured data. For
a comfortable operation with the demonstrator, a tomogra-
phy plug-in [26] was developed for the RecoIDE. This plug-
in enables the interactive control of the RecoBoard and the
stepwise reconstruction of tomographic images. The recon-
struction steps are shortly described in the following.
The processing hierarchy comprises three phases: the
initialization, the parameter transfer, and the start of the re-
construction algorithm combined with the transfer of the
grabbed projection values. Depending on the slice image
size, different sets of weighting coefcients a(i, j, k, m) are
transferred into the A-memory in the initialization phase.
After the choice of the tomography algorithm, the normal-
ization parameters R, S, T, and U are loaded into the
RecoChip, the program instructions and control codes are
transfered to the C-memory, and additional data, e. g. the
lter coefcients h(n) for the FBP or the normalization
data r
1
(i, j) for the ART, are transfered to the D-memory.
0-7695-2080-4/04 $ 20.00 IEEE
Completing this second phase, the hardware-software sys-
tem is ready to start the calculation of multiple slice images
after getting their projection values without additional data,
e. g. for the FBP reconstruction at a rate of 42.9 Hz (23.3 ms
per image) for 64 64 images from 64 64 projections.
These slice images are then provided for further use in the
conference system.
6. Conclusions
In this paper we describe our implementation of a paral-
lel hardware-software system for several digital signal pro-
cessing algorithms. It is composed of a systematically de-
signed processor array (RecoChip) as part of an accelerator
board (RecoBoard) which is embedded in a software envi-
ronment (RecoIDE). The functionality of the entire system
was successfully tested. Our design incorporates the de-
mands of tomographic reconstruction algorithms. This ap-
plication was integrated as a client within a conference sys-
tem. Besides the tomographic reconstruction, we have im-
plemented and veried algorithms for 2D-ltering, matrix-
vector-multiplication, and the 2D-projection of a rotating
three dimensional object.
References
[1] Applied Micro Cicuits Corporation (AMCC). PCI Products
Data Book, 2003. http://www.amcc.com.
[2] J. Bu, E. Deprettere, and P. Dewilde. A design methodol-
ogy for xed-size systolic arrays. In Proc. IEEE Int. Conf.
on Application Specic Array Processors, pages 591602,
1990.
[3] P. Cappello, O. Egecioglu, and C. Scheiman. Processor-
time-optimal systolic arrays. Parallel Algorithms and Ap-
plications, 15:167199, 2000.
[4] A. Darte and Y. Robert. Constructive methods for schedul-
ing uniform loop nests. IEEE Trans. on Parallel and Dis-
tributed Systems, 5(8):814822, 1994.
[5] D. Fimmel. Optimaler Entwurf paralleler Rechenfelder
unter Verwendung ganzzahliger linearer Optimierung. PhD
thesis, Technische Universitt Dresden, April 2002. .
[6] D. Fimmel and R. Merker. Design of processor arrays for
recongurable architectures. Journal of Supercomputing,
19(1):4156, 2001.
[7] J. Fortes and D. Moldovan. Parallelism detection and trans-
formation techniques useful for vlsi algorithms. Journal of
Parallel and Distributed Computing, 2:277301, 1985.
[8] H+K Messsysteme. PCI-Proto LAB - Technical Manual,
2000. http://www.pci-tools.de.
[9] F. Irigoin and R. Triolet. Supernode partitioning. In Proc. of
the 15th Annual ACM Symp. on Principles of Programming
Languages, pages 319 329, San Diego, California, January
1988.
[10] J. Kelber, R. Merker, and S. Siegel. Systematische Gener-
ierung des Steuerusses von Prozessorarrays. In Proc. DASS
2003 and SDA 2003, pages 4954, 2003.
[11] T. Klose. Dokumentation zu den Linux-Treibern am-
ccS5933 bzw. reco. Technical report, TU Dresden, SFB
358-A1, 2003.
[12] T. Klose. Programmumgebung fr ein Hardware-Systemmit
parallelem Prozessorfeld. Diplomarbeit, Technische Univer-
sitt Dresden, 2003.
[13] S. Kung. VLSI Array Processors. Prentice Hall, Englewood
Cliffs, 1987.
[14] P. Lee and Z. Kedem. Mapping nested loop algorithms into
multidimensional systolic arrays. IEEE Trans. on Parallel
and Distributed Systems, 1:6476, 1990.
[15] G. Li and B. Wah. The design of optimal systolic arrays.
IEEE Trans. on Computers, 34:6677, 1985.
[16] G. Megson. An Introduction to Systolic Algorithm Design.
Oxford Science Publications, 1992.
[17] R. Merker. High-level synthesis system (HLDESA) for
processor arrays. In Proc. Int. Conf. on Parallel Comput-
ing in Electrical Engineering, pages 8993, Trois-Rivires,
Qubec, Canada, 2000. .
[18] D. Moldovan and J. Fortes. Partitioning and mapping algo-
rithms into xed sized systolic arrays. IEEE Transactions
on Computers, 35:112, 1986.
[19] J. Mller, D. Fimmel, R. Merker, and R. Schaffer. Hardware-
software system for tomographic reconstruction. Journal of
Circuits, Systems and Computers, Special Issue: Applica-
tion Specic Hardware Design, Part 1, 12(2):203229, April
2003.
[20] J. Mller, R. Schaffer, M. Kortke, R. Merker, and J. Kel-
ber. A hardware accelerator for tomographic reconstruction
and 2d-ltering. In Proceedings of the 5th World Multicon-
ference on Systemics, Cybernetics and Informatics SCI, vol-
ume XV, pages 337342, Orlando, Florida, USA, 2001.
[21] P. Quinton. The systematic design of systolic arrays. In
Automata Networks in Computer Science, pages 229260.
Manchester University Press, 1987.
[22] S. Rao. Regular Iterative Algorithms and their Implementa-
tions on Processor Arrays. PhD thesis, Stanford University,
1985.
[23] R. Schaffer, T. Ferchland, D. Fimmel, M. Kortke, R. Merker,
T. Schmitt, and J. Kelber. A VLSI processor array for
algorithms of the tomographic reconstruction and the 2D-
ltering. In DATE Conference, Designers Forum, Mnchen,
Germany, pages 7781, 2001.
[24] J. Teich and L. Thiele. Partitioning of processor arrays: a
piecewise regular approach. INTEGRATION, the VLSI Jour-
nal, 14(2):297332, 1993.
[25] L. Thiele. Compiler techniques for massive parallel architec-
tures. In The State of the Art in Computer Systems and Soft-
ware Engineering, pages 101151. Kluwer Academic Pub-
lishers, Boston, P. Dewilde edition, 1992.
[26] TU Dresden, SFB 358-A1, A. Weder. libtomograph Nach-
schlagewerk, Januar 2004.
[27] TU Dresden, SFB 358-A1, T. Klose. libreco Nachschlage-
werk, Juni 2003.
[28] Y. Wong and J. Delosme. Optimal systolic implementation
of n-dimensional recurrences. In Proc. ICCD, pages 618
621, 1985.
[29] J. Xue. Communication-minimal tiling of uniform depen-
dence loops. Journal of Parallel & Distributed Computing,
42(1):42 59, 1997.
0-7695-2080-4/04 $ 20.00 IEEE

A Parallel Hardware-Software System For Signal Processing Algorithms

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

A Parallel Hardware-Software System For Signal Processing Algorithms

Hochgeladen von

Copyright:

Verfügbare Formate

A Parallel Hardware-Software System for Signal Processing Algorithms

This research was supported by the German Research Foundation

Das könnte Ihnen auch gefallen