05382065

Symmetric multiprocessor systems on FPGA
Pablo Huerta, Javier Castillo, Cesar Pedraza, Javier Cano, Jose Ignacio Martinez
Department of Computer Architecture and Technlogy, Computer Science and IA
Rey Juan Carlos University
Mostoles, Spain
{pablo.huerta, javier.castillo, cesar.pedraza, javier.cano.montero, joseignacio.martinez}@urjc.es
AbstractAdvances in FPGA technologies allow designing
highly complex systems using on-chip FPGA resources and
intellectual property (IP) cores. Furthermore, it is possible
to build multiprocessor systems using hard-core or soft-core
processors, increasing the range of applications that can be
implemented on an FPGA. In this paper we propose a symmet-
ric multiprocessor architecture using the Microblace soft-core
processor, and the operating system support needed for running
multithreaded applications. Four systems with different shared
memory congurations have been implemented on FPGA and
tested with parallel applications to show its performance.
Keywords-Multiprocessor, SMP, FPGA
I. INTRODUCTION
Soft core processors (SCP) have become so popular that
every FPGA vendor offer its own SCP highly optimized for
being used with their FPGAs (e.g., the MicroBlaze from
Xilinx or the Nios II from Altera), as well as other devel-
opers who provide open source cores (e.g., the OpenRISC
from the Opencores community or the LEON 2 and LEON
3 from Gaisler Research).
The number of SCPs that can be used in an FPGA system
is only limited by the number of device resources (logic
elementes and memory), allowing designers to implement
complex multiprocessor architectures for both specic [1][2]
or general purpose systems [3][4].
Modern FPGAs also include hard core processors, like
the PowerPC included in FPGAs from the Virtex 4 and
Virtex 5 families, that can be used together with the soft
core processors allowing the designer to build heterogeneous
multiprocessor systems [5][6].
This paper is focused on symmetric multiprocessor (SMP)
systems and how they can be implemented in programmable
logic devices using SCPs. The paper describes four different
systems with different memory and cache organization,
and evaluates their performance when running parallelizable
applications. The paper is organized as follows. Section I
presents a brief overview of SMP fundamentals. In section
II, an operating system that has been developed for being
used with SMP systems is briey described. In section III,
the four systems that have been developed are described
and some synthesis results are shown. Section IV shows the
performance results obtained with each system, and nally
the conclusions are discussed in section V.
II. SYMMETRIC MULTIPROCESSOR SYSTEMS
Multiprocessor systems are often categorized according
to the Flynns taxonomy [7], taking into consideration the
parallelism in both instructions and data, in one of the
following categories:
SIMD: single instruction stream, multiple data streams.
MISD: multiple instruction streams, single data stream.
MIMD: multiple instrucion streams, multiple data
streams.
MIMD are the most popular general purpose multiproces-
sor systems, and can be categorized into two groups accord-
ing to the architecture of the memory system: centralized
and distributed shared memory.
SMPs are under the MIMD centralized shared memory
category. In these systems, a shared bus is used to inter-
connect a number of identical processors to a single shared
memory and I/O interface. The term symmetric is used
because all processors use the same mechanism as equals to
get acces to the memory and peripherals, and also because
all the processors are identical.
Implementing an SMP system on FPGA incurs in many
problems, like cache coherency and memory consistency
[8][9], interrupt management in multiprocessor systems [10],
interprocessor communications [11], or processor identica-
tion and synchronization [12]. Another problem is the lack
of an operating system (OS) with SMP support for being
used with the most popular SCPs, so in previous work [13]
we developed an OS that can be used in SMP systems based
on the MicroBlaze soft core processor.
III. OPERATING SYSTEM AND HARDWARE SUPPORT
Xilkernel [14] is an operating system provided by Xilinx
for working with the MicroBlaze soft core processor. It
provides a high level of customization so that can be adapted
to fulll different funcionalities and memory footprint needs.
It also supports main features required in an embedded
real time kernel, such as scheduling of POSIX threads
and some communication and synchronization services like
mail-boxes, semaphores and mutex. Xilkernel was designed
for being used in single processor systems, but we have
deeply modied it for supporting symmetric multiprocessor
systems. For this purpose the OS requires some hardware
2009 International Conference on Reconfigurable Computing and FPGAs
978-0-7695-3917-1/09 $26.00 2009 IEEE
DOI 10.1109/ReConFig.2009.20
279
support for processor identication and synchronization.
For identication we use one register for each processor
connected to their FSL interface that stores a constant value
used as CPU identier. For synchronization we developed a
hardware mutex peripheral [15] that when integrated with
the OS allows the processors to get exclusive access to
critical sections of the code. The OS also needs that each
processor has a private memory region, that is used as stack
in some kernel functions and system calls.
The way these hardware support is implemented, and
other hardware related stuff such as number of processors
or memory regions, is described in a Hardware Abstraction
Layer (HAL) in order to separate the functional side of
the OS from the hardware-dependant side, allowing an
application to be executed without changes in many systems
with different hardware implementations of the OS hardware
support requirements.
IV. EVALUATED SYSTEMS
A. System 1
The rst system we have implemented and tested is shown
in Figure 1. The main components of the system are:
2 MicroBlaze processor cores, version 4.0a
64 Kbytes of shared blockram for instructions and
another 64 Kbytes for data connected to both processors
trhough the LMB bus.
8 Kbytes private memory for each processor, imple-
mented using blockrams.
An FSL register with a constant value for each proces-
sor, used as CPU identier.
An OPB bus shared by both processors and connected
to: an uartlite for I/O, a hardware mutex peripheral, 2
timers, one of them used as periodic interrupt source
for both processors, and the other one used for time
measurement.
Figure 1. System 1
The main difference between this system and the other
three is that although the instruction and data memories
are shared by both processor, the bus is not shared. The
Table I
SYNTHESIS RESULTS FOR SYSTEM 1
Slices LUTs
Number % Ratio Number % Ratio
1 CPU 1657 4.9 1 2366 3.5 1
2 CPUs 2344 6.9 1.4146 3561 5 1.5051
blockrams used have two ports, so each processor uses one
port and both of them can acces the memory simultaneously
without waiting for the bus to be free. Because of the 1 cycle
latency of the LMB bus this is the only system that does not
uses cache memories.
The system was synthesized for the RC300 board from
Celoxica which includes an XC2V6000-FF1152-4 FPGA
from the Virtex 2 Pro family. Table I shows the resources
used by the system with one and with two processors,
and the ratio of each resource regarding the one processor
system.
B. System 2
The second system includes up to 8 MicroBlaze cores
version 4.0a with 4 Kbytes instruction cache. The shared
memory uses blockrams and is connected to the OPB bus.
There are also a UART lite, two timers and a hardware
mutex connected to the system bus. The CPU identier is
implemented using a constant value register connected to
the FSL interface of each processor. The processors have 8
Kbytes of private data connected to their LMB data interface.
Figure 2 shows an schematic of the system. The maximun
number of processors is 8 becasue the OPB bus suppports up
to 16 masters, and each processor spends 2 master interfaces:
one for the instruction side and one for the data side of the
OPB interface.
The system was synthesized for the same FPGA used in
system 1, and table II shows the resource usage for different
number of processors in the system.
Figure 2. System 2
C. Systems 3 and 4
For building these systems a different version of the
MicroBlaze core was used: version 5.0a. One of the main
differences of versions 4.0a and 5.0a is in the cache interface.
280
Table II
Slices LUTs
1 CPU 1660 4.9 1 2329 3.4 1
2 CPUs 2341 6.9 1.4102 3539 5.2 1.5195
4 CPUs 3867 11.4 2.3295 6137 9.0 2.6350
8 CPUs 7134 21.1 4.2976 11806 17.4 5.0691
Version 4.0a uses a cache interface that can be used for
caching every memory connected to the shared OPB bus.
The new cache interface Xilinx Cache Link (XCL) used in
version 5.0a is more efcient, but it can only be connected to
memory controllers that implement the same interface. The
memory controllers used in systems 3 and 4 only have four
XCL interfaces each so the maximun number of processors
will be four if only instruction cache is used or two if both
instruction and data caches are used.
Both systems include up to 4 processors with 4 Kbytes
of instruction cache. Each processor has 2 Kbytes of private
data memory, and an FSL register holding the cpu iden-
tier.A UART lite, timers and hardware mutex are shared
using the OPB bus. The only difference between the systems
are the shared memory they include. System 3 includes 1
Mbyte of SRAM and system 4 64 Mbytes of DDR RAM.
Figure 3 shows an schematic of the system and the cache
connections using the XCL interface.
These systems were implemented using a smaller FPGA
that the one used in systems 1 and 2. The platform used was
the ML401 development board from Xilinx, that includes
an XC4VLX25-FF668-10 FPGA from the Virtex 4 family.
Tables III and IV shows the FPGA resource usage from 1
to 4 processors.
Figure 3. System 4
V. EXPERIMENTAL RESULTS
For validating the developed systems and measuring their
performance we have written two parallel applications that
are run on the systems using the operating system mentioned
in section 2. We use two metrics for evaluating the perfor-
mance of the systems: speedup and efciency.The speedup is
Table III
Slices LUTs
1 CPU 3132 30.1 1 4415 1 20.5
2 CPUs 4581 42.6 1.4626 6740 31.3 1.5266
3 CPUs 6041 56.1 1.9288 9112 42.3 2.0639
4 CPUs 7569 70.4 2.4167 11555 53.7 2.6172
Table IV
Slices LUTs
1 CPU 3493 32.4 1 5071 23.5 1
2 CPUs 4948 46.0 1.4165 7392 34.3 1.4577
3 CPUs 6400 59.5 1.8322 9760 45.3 1.9247
4 CPUs 7921 73.6 2.2677 12199 56.7 2.4056
dened as the time it takes to complete the application with
one processor divided by the time it takes to complete with N
processors. The efciency is usually dened as the speedup
with N processors divided by the number of processors N.
In the systems we are evaluating the resources used are not
directly proportional to the number of processors, because
many components like the bus or the memory controllers,
that spend many resources, are independant of the number of
processors. For example, the system 1 with 8 processors uses
only 4.3 times the slices used by the system with 1 processor.
Due to that resource distribution it makes more sense to
take into account the number of resources used insted of
the number of processors when talking about efciency. In
the results presented in this section we will calculate the
efciency relating to the chip resources instead of to the
number of processors. The resource we are considering for
these calculations are the slices used by the system.
A. Parallel applications
For testing the performance of the systems that are being
evaluated we have used two parallel applications: a parallel
matrix multiplication and a parallel encryption/decryption
application using the AES algorithm.
The matrix multiplication application splits the work
in many threads. Each thread receives as parameters the
addresses and sizes of the source matrices, the address of
the result matrix and the initial and nal row that the thread
has to calculate. The application creates as many threads as
processors exist in the system and waits until all of them has
nished for measuring the time it took the whole application
to complete.
The encryption/decryption application takes as entry a
plain text le or an encrypted text le and splits it in
smaller pieces that will be processed in parallel. The main
application creates as many threads as processors exist in the
281
Table V
SPEEDUP AND EFFICIENCY FOR SYSTEM 1
Matrices Encrypt Decrypt
Speedup 1.9853 1.9662 1.9809
Efciency 1.4035 1.3899 1.4003
Figure 4. System 2 results
system and each thread will process a piece of the whole
data.
B. Results
Table V shows the results obtained in system 1, wich are
very close to an ideal system with a speedup almost equal
to the number of processors. The efciency for this system
is close to 140 % , being this system a very recomendable
choice for parallel applications with low memory footprints.
Figure 4 shows the results for system 2. The results are
not as good as expected: the speedup of the system does not
increase when the number of processors is 4 or higher and
even it goes down for 7 and 8 processors. This is because
version 4.0a of the MicroBlaze core uses a speculative
memory access policy, which means that when a processor
launches a memory request it does it through all the available
interfaces for improving memory access time. In a system
with many processors connected to the same bus it entails
that when a processor is loading an instruction that is on
the instruction cache it also launches the request through
the OPB bus, slowing down the accesses to the bus from
other processors that really need it. Due to this feature,
Figure 5. System 3 and 4 results
MicroBlaze 4.0a its not advisable for building SMP systems
that make use of a shared memory connected to the OPB
bus, and its only suitable for systems like system 1 that use
dual ported memories.
Systems 3 and 4 make use of a newer version of the Mi-
282
croBlaze core that includes a better cache performance and
a different memory access policy that avoids the problems
seen for system 2. Speedup and efciency of the systems is
shown in gure 5. Both systems get very good results for
2 processors, with an efciency higher than 100%. With 4
processors the efciency falls to 70%-80%, due to the bus
congestion.
All the systems get good results for 2 processors, with
efciencies higher than 100 %. The choice of the system
to use for an embedded application is restricted by the
memory needs of the software application. For applications
with low memory footprint (up to 128 KB) the system 1
is an excellent choice, offering the best resource efciency.
For higher memory footprints system 3 (1 MB) and system
4 (64 MB) are a good choice, offering speedup of up to 2
and a good resource efciency.
VI. CONCLUSIONS
We have presented a general purpose SMP system for
FPGA that allows different memory architectures implemen-
tations depending on the application needs. Te operating
system used allows us to run the same application in
different systems with only minor changes in the HAL.
The performance of the systems goes from a very high
eciency of 140 % when using dual ported blockram memo-
ries to 70-80 % efciency in systems with slower memories
shared through an OPB bus.
We have met a feature of the version 4.0a of the MicroB-
laze core, the speculative memory acces, that makes it not
suitable for SMP systems. This problem is not present in ver-
sion 5.0a and higher, but this version uses a different cache
interface that only supports 4 processors using instruction
cache.
The main lack of the evaluated systems are that the
speedup doesnt grow linearly when increasing the number
of processors due to the bus congestion. Using data caches
will help to avoid this congestion, so our future work is
focused in implement a cache coherence mechanism that
allows us to use data caches. The actual memory controllers
from Xilinx only have 4 XCL interfaces, so for using a
higher number of processors we are working in the develop-
ment of new memory controllers with more XCL interfaces.
REFERENCES
[1] O. Lehtoranta, E. Salminen, A. Kulmala, M. Hannikainen, T.D.
Hamalainen, A Parallel MPEG-4 Encoder For FPGA Based
Multiprocessor SOC, 2005, Proceedings of the International
Conference on Field Programmable Logic and Applications.
FPL05
[2] K. Ravindran, et al. An FPGA-based soft multiprocessor
system for IPv4 packet forwarding. Proceedings of the In-
ternational Conference on Field Programmable Logic and
Applications, 2005. pp. 487-492.
[3] A. Tumeo, M. Monchiero, G. Palermo, F. Ferrandi, D. Sci-
uto.A design kit for a fully working shared memory multipro-
cessor on FPGA. Proceedings of the 17th ACM Great Lakes
symposium on VLSI. pp. 219-222. 2007.
[4] P. James-Roxby, P. Schumacher, C. Ross, A Single Program
Multiple Data Parallel Processing Platform for FPGAs, 2004,
Proceedings of the 12th Annual IEEE Symposium on Field-
Programmable Custom Computing Machines (FCCM 04).
[5] Zhang, Wen-Ting, et al. Design of Heterogeneous MPSoC on
FPGA. Proceedings of the 7th International Conference on
ASIC. ASICON07. pp. 102-105. 2007.
[6] Senouci, Benaoumeur, et al. Multi-CPU/FPGA Platform
Based Heterogeneous Multiprocessor Prototyping: New Chal-
lenges for Embedded Software Designers. Proceedings of the
19th IEEE/IFIP International Symposium on Rapid System
Prototyping. pp. 41-47. 2008
[7] M.J, Flynn, Some computer organizations and their effective-
ness, IEEE Transactions on Computers, pp. 948-960.
[8] A. Hung, W. Bishop, and A. Kennings. Enabling Cache
Coherency for N-Way SMP Systems on Programmable Chips.
Proceedings of the International Conference on Engineering of
Recongurable Systems and Algorithms. 2004.
[9] A. Hung, W. Bishop, A. Kunnings, Symmetric Multipro-
cessing on Programmable Chips Made Easy, Proceedings of
the Design Automation and Test in Europe Conference and
Exhibition 2005 (DATE05)
[10] A. Tumeo, M. Branca, L. Camerini, M. Monchiero, G.
Palermo, F. Ferrandi, D. Sciuto, An Interrupt Controller for
FPGA-based Multiprocessors, Proceedings of the Interna-
tional Conference on Embedded Computer Systems: Architec-
tures, Modeling and Simulation, 2007. IC-SAMOS 2007.
[11] N. K. Bambha and S. S. Bhattacharyya. Communication
strategies for shared-bus embedded multiprocessors. Proceed-
ings of the 5th ACM international Conference on Embedded
Software. pp. 21-24. 2005.
[12] A. Tumeo, C. Pilato, G. Palermo, F. Ferrandi, D. Sciuto,
HW/SW methodologies for synchronization in FPGA mul-
tiprocessors, Proceeding of the ACM/SIGDA international
symposium on Field programmable gate arrays, 2009
[13] P. Huerta, J. Castillo, C. Sanchez, J.I. Martinez, Operating
System for Symmetric Multiprocessors on FPGA, Proceedings
of the International Conference on Recongurable Computing
and FPGAs, 2008. ReConFig 08.
[14] Xilinx Corporation, OS and Libraries Document Collection,
2005, available at http://www.xilinx.com
[15] P. Huerta, J. Castillo, C. Pedraza, J. I. Martinez, Exploring
FPGA Capabilities for Building Symmetric Multyprocessor
Systems", Proceedings of the III Southern Conference on
Programmable Logic, SPL 2007
283

05382065

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

05382065

Hochgeladen von

Copyright:

Verfügbare Formate

Symmetric multiprocessor systems on FPGA

Das könnte Ihnen auch gefallen