2 Stamoulis VHDLDATE98

VHDL Methodologies for Effective Implementation on FPGA
Devices and Subsequent Transition to ASIC Technology

I. Stamolis*, N. Ford, G. J. Dunnett, M. White, P. F. Lister
Centre for VLSI and Computer Graphics

School of Engineering
University of Sussex
Brighton, BN1 9QT,
UNITED KINGDOM
*
I.Stamoulis@Sussex.ac.uk
Abstract:
This paper investigates the efficient use of VHDL for targeting Field Programmable Gate
Array Architectures (FPGAs) in order to match constraints in area or timing critical designs.
We will question the effectiveness of technology-independent VHDL, and show methodologies
for mixed, low-level and VHDL integration that allow optimum results while maintaining the
flexibility of VHDL designs.
We will show how optimisations can be applied to algorithms, not only at the synthesis
level, but at an algorithmic level as well. We will also discuss the necessity of manual versus
automated floorplanning as performed by vendors back end tools.
However, as FPGAs are often used for ASIC emulation, such device-specific
implementation would not be beneficial if it prohibited the transition to an ASIC process. In
this context we show methodologies that will allow FPGA-optimised designs to migrate to ASIC
processes, and take advantage of both architectures.
Another popular application of FPGA devices is

1. Introduction for prototyping ASICs. Usually in this case, the
technology-independent VHDL is targeted towards
Until recently Field Programmable Gate the chosen FPGA architecture. This approach
Arrays (FPGA) devices were relatively small and makes use of FPGAs as an ASIC emulator that
they were used as a sophisticated alternative to usually runs at a much lower speed.
PLDs. The recent introduction of FPGA chips with However, in the following sections we will show
large gate counts from many manufactures has that designs can be engineered to take advantage of
given rise to a number of new applications that were the target FPGA and exploit its architectural
previously only feasible with ASIC technology. features. We present here the VHDL methods that
Also, the newest generation of FPGA devices will take advantage of FPGAs but still maintain the
feature very sophisticated internal architectures flexibility that will allow the transition to an ASIC
which help designers to make better use of the process.
available resources. 2.1. Design of Finite State Machines (FSM)
Hardware description languages are
becoming increasingly popular in designing large A key issue in effective use of FPGA is to balance
scale integrated circuits. VHDL is a popular the designs logic and registers to match those of
Hardware Description Language (HDL) supporting the target FPGA. The generation of FSMs is
many of the features available in high level another area where conventional ASIC synthesis
programming languages. and FPGA synthesis differ [7]. The balance of logic
resources and registers on FPGAs means that the
2. VHDL synthesis for FPGAs most efficient state machine realisation results from
using one-hot encoding of the FSM.
The flexibility of FPGA and the much smaller Due to this fact, One-hot representation of FSMs
Non-Recurring Engineering (NRE) costs, allows in FPGAs is not only the fastest implementation,
FPGAs to be used in final products as alternative but the most compact as well [8].
solutions to ASICs. This is especially true in small Typically, binary representation of FSMs require
scale productions where these NRE costs are the a lot of decoding logic and consumes more logic
dominant factor. cells than One-hot encoding. Furthermore, most of
the registers in those decoding cells are wasted. In
small and simple FSMs, the decoding logic may be as the size of the operands increases, the amount of
minimal and in this case, a binary state machine logic cells required are too large for practical
may be slightly more efficient. implementations. A 9 x 8 multiplier was
Most synthesis tools, including Synopsys, support synthesised and the resulting size was 43 logic cells
extraction of the State Machine. Alteras MaxPlus2 with a speed of 12 MHz.
will by default One-hot encode the FSM. Other
tools, such as Synplify, will automatically use one- b. SCUBA generated multipliers:
hot or binary encoding according to certain SCUBA is a tool provided by Lucent that can
constraints. produce common designs such as Multipliers,
Adders/Subtractors, Comparators, and RAM/ROM
2.2. Design of Adders components using ORCA specific macro files. This
Most FPGA architectures have special carry chain usually yields improved performance and area for
logic that allows very fast and compact design of such devices. However, it can only produce
adders. Although this uses a ripple carry method, unsigned multipliers. In this example a 9 x 8
the very fast routing makes it much faster than a unsigned multiplier was produced. Inputs and
LUT implementation, and definitely a lot smaller. outputs were registered in order to get accurate in-
All synthesis tools, such as Synopsys, will use a system results. Such a multiplier occupies 19 PFUs,
DesignWare Library with adders and subtractors to which is roughly equivalent to 700 logic gates, and
take advantage of such hardware. can be clocked to 20 MHz. SCUBA only generates
unsigned multipliers, and so cannot be used for
2.3. Multipliers signed multipliers.
A very common module in digital circuits is an
integer multiplier. The optimum architecture for an c. Hand-made ORCA multiplier
integer multiplier varies significantly with the A hand-made multiplier was implemented. This
architecture of the target technology. Although was created for a particular graphics application,
synthesis tools can minimise the logic and take hence the strange sizes for multiplicand and
advantage of special technology, they can not multiplier. This multiplier is a signed 9 bits by
optimise a circuit at the algorithmic level. unsigned 8 bits, and the result is a signed 17 bit
In this section we will investigate three product. We implemented it as a tree of adders,
alternative methods for implementing such circuits. which follow a stage of a 9 x 2 multipliers. ORCA
The FPGA-based designs were targeted at Lucent specific cells such as ANEB4 or FMULT41 were
ORCA devices. The Lucent ORCA architecture is extensively used in order to get the best
ideal for implementing multipliers because the logic performance and area compromise.
unsigned[7:6] unsigned[5:4] unsigned[3:2] unsigned[1:0]
cells (PFUs) can be configured in a 4x1 multiply 9-bit signed
and sum mode with a partial product. This greatly

reduces the amount of logic required for a 9x2 9x2 9x2 9x2
Multiplier Multiplier Multiplier Multiplier
multiplier. In this example we compare three 9 by 8
multipliers. Registers Registers Registers Registers
The multiplier options are: 13-bit Adder 13-bit Adder
Registers Registers
a. Synopsys1 generated
b. One generated by SCUBA, a Lucent Tool Optional
Pipeline
that generates FPGA specific modules. Registers 17-bit Adder
c. A design that uses Orca DesignWare

Components and VHDL. 17-bit signed product
a. Synopsys Figure 1
Synopsys can generate signed and unsigned This particular design was placed and routed in
multipliers using logic gates. These are then an OR2C04A-4 device where it occupied 34 PFUs,
minimised and packed into Look Up Tables (LUT). and it could be clocked up to 28.8 MHz. As
This gives the best results in ASIC processes. expected, a 3-stage pipeline as shown on Figure 1
However, this architecture doesnt take advantage of can be clocked at least to 60 MHz without
any technology specific features and therefore increasing in size at all, as all pipelining registers
results are bound to be sub-optimal. For small can be hosted in the same logic cells as the
multipliers, this method gives acceptable results, but preceding addition and multiplication logic blocks.
From our results it is obvious that the method of
1
Design Analyser 3.5a
choice is to use SCUBA because of the small area,
and if pipelined they are very fast as well. Also, A
Output
22-to-
-to-11
SCUBA can produce any size of multipliers, which B MUX
MUX
reduces development time. If signed multipliers are
needed, then an alternative method will have to be Sel
used.
If speed performance were a concern, then a A
Output
hand-made version would achieve even better B 44input
input
Sel LUT
LUT
performance at the expense of a slightly bigger area.
If signed multipliers are required then Synopsys can
be used.
With VHDL the designer can have all Figure 3
architectures under the multiplier entity and choose
VHDL code to infer the structure in figure 3
the desired architecture ( w.r.t. target technology)
would be:
using the configuration statement.
Using this method, the design can take advantage Sel <= A WHEN Sel=0 ELSE B;
of the low-level architecture features whilst
maintaining the flexibility and ease of simulation In some of the current FPGA architectures, three-
that VHDL provides. states are connected only to special long routing
Furthermore, in applications such as the Fast resources, and their extensive use will severely
Discrete Cosine Transform, the designer can choose affect the routing of the design.
the algorithm that has the best balance between The designer can choose either the LUT or the tri-
adders and multipliers to meet his constraints in state implementation according to the target
terms of area, speed and latency. architecture. The flexibility of VHDL means that
one needs only to change the configuration to
2.4. Multiplexing switch between the two architectures.
There are two ways to perform multiplexing in 2.5. Distributed or local On-Chip RAM.
FPGA devices. One of them is using LUTs, where
the inputs are the select and input signals. Currently most FPGA manufacturers have
architectures with on-chip memory. This satisfies
the designers need for integrated memory to
A Output
22-to-
-to-11 implement FIFOs and other memory structures.
MUX
MUX
B There are two distinct styles of on-chip RAM.
The first style, which appears to be the most
Sel
Output common, is distributed RAM. In this case, a small
A portion of static RAM, usually between 32 and 64
bits, is available in all Logic Cells. This gives the
B
best flexibility to the designer to use the available
logic cells in the way that best suits his design, e.g.
as logical or storage functions. Moreover, this
Sel
arrangement simplifies the routing complexity.
Figure 2 The alternative method is to have the entire RAM
The VHDL code for figure 2 would be simply: locally in much bigger chunks. This method is used
in Altera FPGAs. The main problem with this
Output <= A WHEN Sel=0 ELSE Z; method is that the design is often limited by the
Output <= B WHEN Sel=1 ELSE Z;
amount of RAM that is available and not by the
Another method is to use the available three-state number of available logic cells. Synthesising logic
buffers. Three-state buffers consume no logical to dedicated SRAM structures can be laborious, as
resources and therefore help designs to be more usually synthesis tools are not aware of these
compact. structures and targeting them is left to the Back End
Tool.
2.6. FIFOs
Fifos are not generally implemented in FPGAs as
the resources that are needed for the memory cells
are restricted. The increasing availability of on-chip
memory in FPGAs means that small FIFOs can be
implemented in the same FPGA as the design. This
can significantly increase the performance of many c. A hand-made comparator, a combination of
applications where maintaining the flow of data is the two methods using individual comparator
important. The on-chip memory is usually dual- cells and logic cells configured as six input
ported and this significantly simplifies the LUTs.
complexity of the support circuits and increases
substantially their performance Table 1 summarises the results of the three
The diagram in figure 4 shows a typical FIFO methods.
structure based on dual ported memory. The Table 1
pointers can be implemented using fast adders that Area (PFUs) Speed (MHz) Prop. Delay (ns)
use carry chains. The Flags and Control logic is a Synopsys 15 70 14.3
SCUBA 19 56 17.8
purely combinatorial circuit. The Dual Ported RAM Hand-made 19 87 11.4
can be either Distributed or local depending on the
target FPGA target architecture. Both methods work The area number includes registered inputs and
well and consume a relatively small amount of outputs. The look-up table approach gives better
logic. area results, mainly because the LUT can be
combined with registers in the same PFU. However,
Logic
Data In when speed is the main concern, a hand-made one
Data Out that uses the comparator cells gives the optimum
Read
ReadPointer
Pointer
results.
As seen in this example careful examination of
On
OnChip
Chip
Write
WritePointer
Pointer Memory
Memory
the target architecture can give enough information
to the designer to optimise the design either for area
or for speed.
Flags
Flagsand
and 2.8. Pipelining
Control
Control
A popular method to increase the rate at which
Input Output
Req Ack Req Ack
valid results are produced by the design, i.e. the
throughput, is pipelining. The pipelining technique
Figure 4 partitions blocks of combinatorial logic into n stages
The integration of the FIFO on the same piece of of equal delays with each stage separated by a bank
silicon as the design does not only mean increased of pipeline registers [6]. A fully balanced pipelined
performance but also a decrease in the number of design can be clocked up to n times faster and
external components. This saves board space and therefore will have n times bigger throughput. The
decreases manufacturing costs. penalty for pipelining is the n times increased
latency of the design.
2.7. Comparators
The introduction of register banks in the design
Another common component in logic circuits is will inevitably have as a direct result an increase in
the comparator. This is a component found in the gate count. However, the architecture of the
many address decoding designs such as PCI targets most popular FPGA logic cells has both logical and
where the base address must be compared very register resources.
quickly with the base address registers of the The introduction of pipeline registers in such
allocated memory space. technology will have minimal impact to the gate
Three designs were implemented to compare 24- count, as most of the flip-flops will be hosted in the
bit values. One designed by Synopsys that produces same logic element as the combinatorial logic 2. In a
a tree of LUTs. The other is the SCUBA design, best case situation, the size of the design would
which cascades comparator cells and another design remain the same.
that was hand-made and was a combination of the
two above mentioned methods. 2.9. Design partitioning
In many cases, the design will not fit in any of the
a. A comparator Using LUT, as produced by available FPGA devices. In this case, the designer
SYNOPSYS. This was implemented as a tree will have to partition the design into multiple
of Look Up Tables. FPGAs.
Some of the popular back-end tools will do this
b. A comparator produced by SCUBA by
cascading comparator (ANEB4) cells and
2
interconnecting them using the carry chains. This is not usually the case for placing registers after three-state
buffers.
automatically and go even further by proposing the Connecting adjacent cells may require use of a
particular devices to be used. number of short-wires and switches, which in many
However, it is usually better if the designer, who cases is slower than connecting a wire from a cell
has knowledge of the functionality of his design, that is far away where connections can be
partitions it manually. implemented with a long-wire. Although it may be
In many cases, particularly in timing critical argued that very careful examination of the delays
designs it may be better to replicate logic in more and routing resources may provide enough
than one FPGA. This is often the case for a information to the designer to make a marginally
Program Register Set (PRS), where many sub- better design, this is only feasible in very small
modules may want to have access to the same data. designs. In larger designs, (more than 2,000
When timing partitioned designs, the delays equivalent ASIC gates) it is practically impossible.
introduced by the pads must be taken into account A number of Universities are engaged in research
to correctly balance the registers and time the on routing methodologies and the development of
design. algorithms that are more efficient than standard
cost-based methods that are used by propriety
2.10. Floorplanning in FPGAs software [4,5].
A Design coded in VHDL is transformed into However, most FPGA vendors do not publish
gates by synthesis tools such as Synopsys. The gates enough information that would allow a third-party
and registers are then packed into Look-Up-Tables placement and routing tool to be implemented.
(LUT) and flip-flops. The design is then imported Placement and routing of pre-routed hard-wired
into the back-end mapping tools, which identifies macros is an option that must be closely considered
LUT and flip-flops in the same region and packs as it may simplify the routing process and improve
them into the same logic cells. It also orders system performance. However, the existing tools do
register flip-flops to make sure that related not offer the desired flexibility and they are far from
registered bits lie on the same logic cell (i.e. bits user-friendly. It is expected that as FPGA placement
0,1,2,3 of a register will be packed on the same and routing tools mature they will offer increased
logic cell if it has 4 registers). Also, at this stage a control over the floorplanning of a design [2].
further Design Rule Checking (DRC) takes place, However, automatic placement and routing
which checks for fan-out, etc. errors and modifies algorithms will become more clever and understand
the netlist accordingly. Some further optimisation more of the structure of the design and will provide
can take place at this point as well. better results as well.
The result of the above procedure is a netlist of For high-density devices, one may floorplan
logic cells that is still unrouted. The designer is specific parts of the designs to improve the
then faced with two options for placing the logic performance of Place and Router (PAR) software.
cells on a given FPGA. He can either use the Place On large designs (more than 30,000 gates or 85% of
and Route tool to automatically place the logic cells the available resources ) the placement and route
on the FPGA. This is based on a reasonably clever tool has a hard time placing the design and routing
iterative cost based algorithm that will try to place all the nets. In many cases, it has taken up to
the logic cells in an efficient way on the available 50-70 hours on a Sparcstation-20 to completely
cells. Alternatively, he can manually place and route the design. Due to the complexity and size of
route. larger designs, PAR tools are limited by their ability
to recognise structures. A design may not route at
2.11. Manual placement of Components all (usually the case on designs that use more than
Here the designer has to manually try and place 95% of the resources of the FPGA), or will not meet
the unrouted Logic Cells on the available cells of the timing-constrains (as some routes are slower
the device. On small devices, the automatic place than 50-60 ns on their own).
and route tool can be both faster and more efficient However, designers have direct knowledge of the
than manual routing. Experimentation in small designs structure and its critical paths. Using this
circuits has shown that the circuit produced with information they can improve the placement of the
automatic placement and routing is faster and uses design [2]. Designers have the option to specify the
the routing resources more efficiently. design constraints with allowances for paths that are
The main problem in manually placing the cells not timing-critical (e.g. reset) and let the automatic
of a netlist is that the shortest distance is not PAR work harder on the critical paths. The other
necessarily the fastest. The latest generation of option is to pre-route blocks (e.g. a multiplier) and
FPGAs has typically three types of routing let the placement tool place those hard-wired
resources, pin-wires, short-wires and long-wires. macros on the available area [1]. However, this is
not as flexible as real ASIC floorplanning as it is
not possible to orient (rotate, flip, etc.) the macros. structures, the user can change the VHDL
architecture of each component.
2.12. Routing Designs The registers of the design should also be re-
Again the designer has two options, he can either balanced to take into account the different
let an automatic Place And Route tool do the propagation delays of the two technologies.
routing or route the design himself using a The designer may use mischievous techniques in
floorplanning tools (e.g. Lucents Epic). However, the FPGA implementation to fulfil tight constrains.
as complex designs have tens of thousands of wires, But these methods can not be used in an ASIC
it is practically impossible to route a design process as this may introduce race conditions,
manually, let alone optimising it. glitch-prone clocks and improper initialisation of
The only reasonable method is to give weights to the device that must work across all process
the various routes, in order to force the PAR tool variations [9].
spend more time improving the timing critical- The previously proposed methods of using
paths. Many of the paths may be considered not different VHDL architectures: a general purpose
timing critical, and can be given less weight than and FPGA specific architecture, allows the designer
other more timing critical paths. to use every trick in the book to use most of the
Another issue often overlooked, is the pin FPGA architecture, while still maintaining the good
assignment. Careful assignment of pins has proved design methodologies on the ASIC version. This
that it can help the PAR tool place the design more would also allow testability and safe timing.
quickly, and in many cases give better results than It is often desirable that the ASIC is pin for pin
fully automated placement and routing. Pin compatible with the prototype FPGA that was
positions, however are also part of the PCB and designed first. This would allow the ASIC to be
system level partitioning, but in most cases careful used as a drop-in replacement and the exact same
pin assignment is beneficial for both internal board to be used in both implementations. Special
placement and external PCB routing. consideration must be given for the FPGA
configuration pins and also for ASIC testing pins.
3. Transition to ASIC
4. Conclusions
FPGA devices are increasingly being used as
prototypes for ASIC Devices. This allows a design In the first section, we investigated different
to be verified in a system environment before the design methodologies that give optimum results in
company commits to an ASIC process. Another FPGA implementations. Then we showed that
reason that this approach is becoming popular is knowledge of the target architecture could give the
that an early prototype allows drivers and software designer important information that can be used to
related activities to be initiated at earlier stages thus change the structure of the design to make the most
reducing significantly the design-to-market time. of the target architecture. It was also shown that
Potential bugs and system incompatibilities can be although synthesis tools are aware of the target
detected at a very early stage, and the design team architecture they are not currently intelligent
can have robust hardware. enough to make these optimisations. Furthermore,
FPGAs are a very attractive solution for designs the designer can make optimisations, not only at the
where specifications are evolving and the time-to- low logic level, but even at an algorithmic level,
market must be very small. Features such as something that is impossible for synthesis tools.
reconfigurability can even allow users to update the It has been shown that placement and routing are
version of their hardware using software methods. a very important part of the FPGA design and the
But, when the specification has been finalised and routing delays contribute to the total delays in most
high volume production is needed, then cost is a cases by almost as much as the combinatorial
significant issue. The transition to an ASIC delays. Careful pin assignments and floorplanning,
technology is essential to meet these requirements. although more time consuming, pay off in reduced
A vital advantage of VHDL is its device routing time and improved system performance.
independent nature. The designs source code can FPGAs are often used as an emulation of the final
be targeted to any technology without changes. But ASIC during the development time and for early
in our designs, many blocks were substituted for impact in the market. When high volume quantities
lower-level FPGA dependent components. For the are needed, the production can be switched to an
ASIC synthesis, those blocks will be substituted ASIC technology.
back to the RTL version for the synthesis. Seeking the peak performance in both the FPGA
Using extensively the VHDL configuration and ASIC versions of a design, one can use dual
VHDL architectures of the critical components and
combine them in the same design. However, good
design principles must be maintained in the
common components to allow smooth transition
from FPGAs to ASICs. Further considerations must
be given if the ASIC is destined to be a drop-in
replacement.
5. References
[1] Lucent Epic manual and Data Book
[2] Xilinx, "HDL synthesis for FPGAs design G uide",
http://www.xilinx.com/appnotes/hdl_dg.pdf , chapter 4
[3] Michael Gschwind and Valentina Salapura, VHDL
design methodology for FPGAs, Technische
Universitat Wien
[4] Michael J. Alexander and Gabriel Robins, New
performance driven FPGA Routing Algorithms,
University of Virginia
[5] Stephen Brownm, Jonathan Rosem and Z. Vranesic,
"A Stochastic model to predict the routability if
FPGAs, University of Toronto
[6] Synopsys, Pipelining Designs , Synopsys
Application note
[7] Michael Gschwind and Valentina Salapura,
Optimizing VHDL for FPGA targets, Technische
Universitat Wien
[8] Peter Alfke and Bernie New, Implementing state
machines in LCA devices, Xilinx PLD Book, 1994
[9] Ron Modo, FPGA Design Practices that help ensure
good Migration, IDSMAG, April 97

2 Stamoulis VHDLDATE98

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

2 Stamoulis VHDLDATE98

Hochgeladen von

Copyright:

Verfügbare Formate

VHDL Methodologies for Effective Implementation on FPGA

Devices and Subsequent Transition to ASIC Technology

Centre for VLSI and Computer Graphics

Another popular application of FPGA devices is

and sum mode with a partial product. This greatly

The multiplier options are: 13-bit Adder 13-bit Adder

that generates FPGA specific modules. Registers 17-bit Adder

c. A design that uses Orca DesignWare

Das könnte Ihnen auch gefallen