Beruflich Dokumente
Kultur Dokumente
Xilinx FPGA
Ease of pipelining
Asynchronous pipelines enable the design of high-throughput logic cores that are easily
composable and reusable, where asynchronous pipeline handshakes enforce correctness
instead of circuit delays or pipeline depths as in clocked pipelines.
Robustness
Asynchronous circuits are automatically adaptive to delay variations resulting from
temperature fluctuations, supply voltage changes, and the imperfect physical manufacturing
of a chip, which are increasingly difficult to control in deep submicron technologies.
Tool compatibility
Able to use existing place and route CAD tools developed for clocked FPGAs.
Asynchronous Pipelines
Asynchronous Pipelines
The data wires encode bits using a dual-rail code such that setting
wire-0 transmits a logic-0 and setting wire-1 transmits a
logic-1.
The four-phase protocol:
The sender sets one of the data wires
The receiver latches the data and lowers the enable wire
The sender lowers all data wires
The receiver raises the enable wire when it is ready to accept new data.
Asynchronous Pipelines
Data wires start and end the handshake with their values in the lowered
position this handshake is an example of an asynchronous return-tozero protocol.
The cycle time of a pipeline stage is the time required to complete one fourphase handshake.
The throughput, or the inverse of the cycle time, is the rate at which
tokens travel through the pipeline.
Changing local pipeline depths in a clocked circuit often requires global retiming of
the entire system.
Slack elastic
Increasing the pipeline depth (= slack) will not change the logical correctness of the original
system.
Allows to locally add pipelining anywhere in the system without having to adjust or
re-synthesize the global pipeline structure of the system.
The pipelines in the asynchronous FPGAs were designed so that they are slack
elastic.
Banks of retiming registers (a significant overhead in pipelined clock FPGAs) arent necessary.
Dataflow Computations
Channel handshakes ensure that pipeline stages consume and produce tokens
in sequential order so that new data items cannot overwrite old data items.
In this dataflow model, data items have one producer and one consumer.
Data items needed by more than one consumer are duplicated by copy processes that produce a
new token for every concurrent consumer.
Clocked logic uses a global clock to separate data items in a pipeline, which
allows data items to fan out to multiple receivers because they are all
synchronized to the clock.
The default behavior for a clocked pipeline is to overwrite data items on the
next clock cycle, regardless of whether they were actually used for a
computation in the previous cycle.
Dataflow Computations
Dataflow Computations
Logical pipeline stages
Dataflow nodes in an asynchronous dataflow graph
Copy:
18
19
20
21
22
23
Pipelines
Pipeline:
Linear sequence of buffers where the output of one buffer
connects to the input of the next buffer.
Tokens are sent into the input end of the pipeline and flow
through each buffer to the output end.
The tokens remain in FIFO order.
For synchronous pipelines the tokens usually advance through
one stage on each clock cycle
For asynchronous pipelines there is no global clock to
synchronize the movement. Instead each token moves forward
down the pipeline when there is an empty cell in front of it
Otherwise it stalls.
24
Pipelines
Buffer capacity (slack)
The maximum number of tokens that can packed into the
buffer without stalling the input end of the pipeline.
Throughput
The number of tokens per second which pass a given stage in
the pipeline
Forward latency
The time it takes a given token to travel the length of the
pipeline
Buffer Reshuffling
L
CSP
S in gle R ail
B u ffe r
26
Buffer Reshuffling
An array of these buffers preserves the desired FIFO order and properties
of a pipeline
27
Buffer Reshuffling
Direct implementation of this CSP will require a
state variable to distinguish the first half from the
second half and has too much sequencing per cycle
It is better to reshuffle the waits and events to
reduce the amount of sequencing and the number
of state variables
We wish to maximize the throughput and minimize
the latency of a pipeline
28
Buffer Reshuffling
First:
29
Buffer Reshuffling
30
Buffer Reshuffling
that data would need to be saved in internal state variables proportional to the
number of bits on R or L
31
Buffer Reshuffling
32
Buffer Reshuffling
WC
= weak condition logic
PC
= pre-charge logic
HB
= half buffer (slack ) - The L and R channels cannot
simultaneously hold two distinct
data tokens
FB = full buffer (slack 1)
Weak condition:
The validity and neutrality of the output data R implies the validity and neutrality of the input
data L
It means that the L does not need to be checked anywhere else besides in R
In the half buffer reshufflings only every other stage can have token on
its output channel since a token on that channel blocks the previous
stage from producing an output token
33
Buffer Reshuffling
34
Buffer Reshuffling
Expanding for more than single rail:
The L and La will represent all the input data and acknowledges
The R and Ra will represent all the output data and acknowledges
[L] indicates a wait for the validity of all inputs and
[L] wait for the neutrality of all inputs
[Ra] indicates a wait for all the output acknowledges to be false
[Ra] indicates a wait for all the output acknowledges to be true
La indicates raising all the input acknowledges in parallel
La indicates lowering them
R means that all the outputs are set to their valid states in parallel
R means that all the outputs are set to their neutral states
When R occurs it is meant that particular rails of the outputs are raised depending on which rails
of L are true. This expands R into a set of exclusive selection statements executing in parallel
35
36
37
38
WCHB
Bit
Generator
Bit
Bucket
39
WCHB
40
C-element restriction
41
42
43
The dependency between the data-input falling and the data-output falling
has been removed and, instead, the input falling is a precursor for the
outputs re-evaluating
44
45
Achronix FPGA
46
Achronix FPGA
Each RLB contains:
8 x 4-input LUTs,
storage elements
128 bits of RAM.
47
Achronix FPGA
48
Achronix FPGA
CEs are capable of being initialized into one of two states:
state holding
not state holding.
49
Achronix FPGA
FEs have functionality equivalent to combinatorial
logic.
The only difference relates to how ingress and
egress data is handled.
The local handshaking within a picoPIPE network
means the FEs must also handshake data in and
out.
This handshaking ensures only valid, settled data
is propagated
50
Achronix FPGA
BEs are only used at the boundary where the picoPIPE fabric
meets the FPGA Frame.
These elements are responsible for converting Data Tokens in
the Frame into Data Tokens in the picoPIPE fabric (ingress).
They are also used for converting Data Tokens in the fabric
back into Data Tokens in the Frame (egress).
The conventional I/O Frame ensures that every Data Token
enters the picoPIPE core at a clock edge, and every Data Token
leaving the picoPIPE core is clocked out at a clock edge.
51
Base Architecture
52
53
54
Initializes
with aa control
token on
its on
Receives
token
output.
Upon system
channel
C and areset
dataatoken
token on
is sent
on channel
channel
A. If the control
Y . Unit
is used
for logic-0"
state
token
equals
it sends
initialization
the data token on channel Y,
Inputs from routers.
Copies
result tokens
froma channels Y otherwise it sends the data
If channel
is unused
and token
Z to one
oramore
of the physical
token on channel Z.
with
logic-1"
output
ports
(Nout;Eout;
Sout;Wout) or
value
is internally
sourced
sinks
tokens before they reach
onthe
theresult
channel.
any output port.
Logic Cell
This router is
implemented as a
switch matrix
and of three
Two arbitrary
functions
is un-pipelined.
variables.
Receives tokens on
channels (A;B;C) and sends
function results on output
channels (Y;Z).
Channel Router
55
Programmable switch
Asynchronous circuits
56
High throughput
Minimum pipeline cycle times of 10-16 FO4 delays (competitive with clocked domino logic).
Energy efficient
Power savings from no extra output latch, no clock tree, and no dynamic power dissipation
when the pipeline stage is idle.
57
Asynchronous circuits
WCHB is most useful for token buffering and token copying
PCHB is optimized for performing logic computations
(similar to dual-rail clocked domino circuits).
Since the weak-condition and pre-charge pipeline stages both use the dual-rail
handshake protocol, they can be freely mixed together in the same pipeline.
WCHB stages are used in the
Token unit,
Output-Copy,
Copy processes of the Function unit.
58
Asynchronous circuits
59
Physical Design
60
Physical Design
To minimize cell area and simplify programming, configuration
bits are programmed using JTAG clocked circuitry.
The area breakdown for the architecture components is:
61
Physical Design
Simulation of the layout in SPICE found the maximum inter-cell operating
frequency for PAPA logic to be 395MHz.
Internally the logical units can operate much faster, but are slowed by the
channel routers.
To observe this the logical units were configured to internally source
logic-1" tokens on their inputs and the Output-Copy stages were
configured to sink all result tokens (bypassing all routers). The results are:
Function unit (498MHz, 26pJ/cycle),
Merge unit (543MHz, 11pJ/cycle),
Split unit (484MHz, 12pJ/cycle),
Token unit (887MHz, 7pJ/cycle).
Highly Pipelined
Asynchronous FPGAs
(2004)
Optimized Architecture
62
63
64
Logic Cell
65
66
Logic Cell
Function Unit
Carry chains can be routed using the normal interconnect or using low latency carry
channels that run vertically south-to-north between adjacent vertical logic blocks
67
Logic Cell
Asynchronous LUT circuit
68
Logic Cell
State Unit
The state unit is a small pipeline built from two
WCHB stages that feeds the function unit output
back as an input to the function unit, forming a fast
token-ring pipeline.
Upon system reset, a token is generated by the state
unit to initialize this token-ring.
This state feedback mechanism is better than in
basic design, where all feedback token rings needed
to be routed through the global interconnect.
69
Logic Cell
Conditional unit
70
Logical Cell
Output Copy
71
Pipelined Interconnect
72
Pipelined Interconnect
Channel connecting two logic blocks can be
routed through an arbitrary number of pipelined
switch boxes without changing the correctness of
the resulting logic system (slack elasticity)
system performance can still decrease if a
channel is routed through a large number of
switch boxes
Need to determine sensitivity of channel route
lengths on pipelined logic performance
73
74
75
76
Logic Synthesis
77
Logic synthesis
78
Assignment: (a := b)
assign the value of b to a
a for a := true
a for a := false
79
Send: X!e
send the value of e over channel X
Receive: Y?v
Receive a value over channel Y and store it in variable v.
80
81
82
83
Steps of synthesis
Begin with a high-level sequential specification of the
logic
Apply semantics-preserving program
transformations to partition the original specification
into high-level concurrent function blocks.
The function blocks are further decomposed into sets
of highly concurrent processes that are guaranteed
to be functionally equivalent to the original
sequential specification.
Map the results to the FPGAs logical cells
84
An example
85
86
87
Physical Implementation of
Dataflow FPGA Architectures
88
89
Circuit Performance
90
Circuit Performance