Sie sind auf Seite 1von 18

UNIT II : CPLD & FPGA ARCHITECTURE & APPLICATIONS

INTRODUCTION: The Xilinx Programmable Gate Array, known as a Logic Cell Array (LCA),
is a high-density CMOS IC that combines user programmability with the flexibility of a gate
array architecture and the economy and testability of standard products. Xilinx reprogrammable
architectures are used because of their flexibility, low prices for small quantities, testability and
short development time. Most design changes can be implemented by reprogramming the LCAs.
Thus, use of the LCAs , allows the design to go directly from schematic capture to a production
board. The programmable logic blocks in the Xilinx family of FPGAs are called Configurable
Logic Blocks (CLBs).The Xilinx architecture uses, CLBs, I/O blocks, switch matrix and an
external memory chip to realize a logic function. It uses external memory to store the
interconnection information. Therefore, the device can be reprogrammed by simply changing the
configuration data stored in the memory.
XILINX Logic Cell Array : This is the novel architectural feature introduced by XILINX in
the year 1985 for their FPGA devices. It is almost like a proprietary or trade mark property of
XILINX implemented for FPGA devices. The XILINX LCA architecture consists of three major
Components. They are

(i).Configurable Logic Blocks (CLBs) (ii).Input/Output Blocks (lOBs) and

(iii). Programmable Interconnect.


In addition, configuration memory is used to hold the configuration program bits which control
the configuration of CLRM, IOBs and interconnect.
This LCA architecture consists of an interior matrix of logic blocks and a surrounding ring of I/O
interface blocks. Interconnect resources occupy the channels between the rows and columns of
logic blocks and between the logic blocks and I/O blocks. Like a microprocessor the LCA is a
program driven logic device. The functions of the LCA’s configurable logic blocks and I/O
blocks and their interconnection are controlled by a configuration program stored in an on-chip
memory. The configuration program is loaded automatically from an external memory on power-
up or on command, or is programmed by a microprocessor as part of system initialization.
As shown below diagram the configuration memory consists of a distributed array of static
memory cells .During configuration the cell is written through the data line and is read through
the data line during read back operation.

During normal operation the pass transistor is off and continuous configuration control is
provided. There are five methods for loading configuration program data into configuration
memory. Among them two methods load the data serially and three methods load the data in a
byte wide parallel manner.

The LCA performance is determined by the speed of logic , storage elements and programmable
interconnect.LCA performance is specified by the maximum toggle rate for a logic block
storage element configured as a toggle flip-flop. For typical application system clock rates are
one third to one-half the maximum flip-flop toggle rate.

The core of the LCA is a matrix of identical Configurable Blocks (CLBs).Each CLB contains
programmable combinational logic and storage registers. The combinational logic section of of
the block is capable of implementing any Boolean function of its input variables.The registers
can be loaded from the combinational logic or directly from a CLB input the register outputs can
be inputs to the combinational logic via an internal feedback path.
The periphery of the Logic Cell Array is made up of user programmable input/output blocks
(IOBs).Each block can be programmed independently to be an input ,an output or bi-directional
pin with three state control. Inputs can be programmed to recognize either TTL or CMOS
thresholds. Each IOB also includes flip-flops that can be used to buffer inputs and outputs.

The flexibility of the LCA is due to resources that permit program control of the interconnection
of any two points on the chip. The LCA interconnection resources include a two-layer metal net-
work of lines that run horizontally and vertically in the rows and columns between the CLBs.
Programmable switches connect the inputs and outputs of IOBs and CLBs to the nearest metal
lines. Cross point switches and interchanges at the interconnections of rows and columns allow
signals to be switched from one path to another. Long lines run the entire length or breadth of the
chip ,by passing interchanges to provide distribution of critical signals with minimum delay or
skew.

Configurable Block(CLB) : The core of the FPGA is a matrix of identical Configurable


Blocks(CLBs) .Each CLB contains a combinational logic array, program controlled data
multiplexers, and flip-flops. The CLB also contains RAM memory cells and can be programmed
to realize any function of five variables or any two functions of four variables. The functions are
stored in the truth table form, so the number of gates required to realize the functions is not
important. In the fig below each trapezoidal block represents a multiplexer, which can be
programmed to select one of its inputs. The block diagram of the CLB is shown below

The array of CLBs provides the functional elements from which the user’s logic is constructed.
The logic blocks are arranged in a matrix within the perimeter of IOBs. Forexample, the
XC3020A has 64 such blocks arranged in 8rows and 8 columns. The development system is used
tocompile the configuration data which is to be loaded intothe internal configuration memory to
define the operationand interconnection of each block. User definition of CLBsand their
interconnecting networks may be done by automatic translation from a schematic-capture logic
diagram oroptionally by installing library or user macros. Each CLB has a combinatorial logic
section, two flip-flops,and an internal control section. There are : five logic inputs (A, B, C, D
and E); a common clock input (K); an asynchronous direct RESET input (RD); and an enable
clock (EC). All may be driven from the interconnect resources adjacent to the blocks. Each CLB
also has two outputs (X and Y) which may drive interconnect networks. Data input for either
flip-flop within a CLB is supplied from the function F or G outputs of the combinatorial logic, or
the block input, DI. Both flip-flops in each CLB share the asynchronous RD which, when
enabled and High , is dominant over clocked inputs. All flip-flops are reset by the active-Low
chip input, RESET, or during the configuration process. The flip-flops share the enable clock
(EC) which, when Low, re circulates the flip-flops’ present states and inhibits response to the
data-in or combinatorial function inputs on a CLB. The user may enable these control inputsand
select their sources. The user may also select theclock net input (K), as well as its active sense
within each CLB. This programmable inversion eliminates the need toroute both phases of a
clock signal throughout the device.
The combinatorial-logic portion of the CLB uses a 32 by 1 look-up table to implement Boolean
functions. Variables selected from the five logic inputs and two internal block flip-flops are used
as table address inputs. The combinatorial propagation delay through the network is independent
of the logic function generated and is spike free for singleinput variable changes. The partial
functions of six or seven variables are implemented using the input variable (E) to dynamically
select between two functions of four different variables. For thetwo functions of four variables
each, the independent results (F and G) may be used as data inputs to either flip-flop or either
logic block output. For the single function of five variables and merged functions of six or seven
variables, the F and G outputs are identical. Symmetry of the F and G functions and the flip-flops
allows the interchange of CLB outputs to optimize routing efficiencies of the networks
interconnecting the CLBs and IOBs
Input/Output Blocks ( I/O Block):

The periphery of the Logic Cell Array is made up of user programmable input/output blocks
(IOBs) .Each block can be programmed independently to be an input ,an output or bi-directional
pin with three state control. So, each user-configurable IOB , provides an interface between the
external package pin of the device and the internal user logic. This IOB includes both registered
and direct input paths. Also each IOB provides a programmable3-state output buffer, which may
be driven by a registered or direct output signal. Configuration options allow the IOB an
inversion, a controlled slew rate and a high impedance pull-up. Each input circuit also provides
input clamping diodes to provide electrostatic protection, and circuits to inhibit latch-up
produced by input currents
The IOB also includes input and output storage elements and I/O options selected by
configuration memory cells. A choice of two clocks is available on each die edge. The polarity of
each clock line (not each flip-flop or latch) is programmable. A clock line that triggers the flip-
flop on the rising edge is an active Low Latch Enable (Latch transparent) signal and vice versa.
Passive pull-up can only be enabled on inputs, not on outputs. All user inputs are programmed
for TTL or CMOS thresholds.
The input-buffer portion of each IOB provides threshold detection to translate external signals
applied to the package pin to internal logic levels. The global input-buffer threshold of the IOBs
can be programmed to be compatible with either TTL or CMOS levels. The buffered input signal
drives the data input of a storage element, which may be configured as either a flip-flop or a
latch. The clocking polarity (rising/falling edge-triggered flip-flop, High/Low transparent latch)
is programmable for each of the two clock lines on each of the four die edges. Note that a clock
line driving a rising edge-triggered flip-flop makes any latch driven by the same line on the same
edge Low-level transparent and vice versa (falling edge, High transparent). All Xilinx primitives
in the supported schematic-entry packages, however, are positive edge-triggered flip-flops or
High transparent latches. When one clock line must drive flip-flops as well as latches, it is
necessary to compensate for the difference in clocking polarities with an additional inverter
either in the flip-flop clock input or the latch-enable input. I/O storage elements are reset during
configuration or by the active-Low chip RESET input. Both direct input (from IOB pin I) and
registered input (from IOB pin Q) signals are available for interconnect.
Programmable-interconnection resources in the Field Programmable Gate Array provide routing
paths to connect inputs and outputs of the IOBs and CLBs into logic networks .Interconnections
between blocks are composed of a two-layer grid of metal segments. Specially designed pass
transistors, each controlled by a configuration bit, form programmable interconnect points (PIPs)
and switching matrices used to implement the necessary connections between selected metal
segments and block pins.
Figure below is an example of a routed net. The development system provides automatic
routing of these interconnections. Interactive routing is also available for design optimization.
The inputs of the CLBs or IOBs are multiplexers which can be programmed to select an input
network from the adjacent interconnect segments. Since the switch connections to block inputs
are unidirectional, as are block outputs, they are usable only for block input connection and not
for routing. Figure below illustrates routing access to logic block input variables, control inputs
and block outputs.

Three types of metal resources are provided to fulfill various network interconnect
requirements.
• General Purpose Interconnect
• Direct Connection
• Long lines (multiplexed busses and wide AND gates)
General Purpose Interconnect
It consists of a grid of five horizontal and five vertical metal segments located between the rows
and columns of logic and IOBs. Each segment is the height or width of a logic block. Switching
matrices join the ends of these segments and allow programmed interconnections between the
metal grid segments of adjoining rows and columns. The switches of an un-programmed device
are all non-conducting. The connections through the switch matrix may be established by the
automatic routing or by selecting the desired pairs of matrix pins to be connected or
disconnected.
Special buffers within the general interconnect areas provide periodic signal isolation and
restoration for improved performance of lengthy nets. The interconnect buffers are available to
propagate signals in either direction on a given general interconnect segment. These bidirectional
(bidi) buffers are found adjacent to the switching matrices, above and to the right. The other PIPs
adjacent to the matrices are accessed to or from Long lines. The development system
automatically defines the buffer direction based on the location of the interconnection network
source. The delay calculator of the development system automatically calculates and displays the
block, interconnect and buffer delays for any paths selected. Generation of the simulation net list
with a worst-case delay model is provided.
Direct Interconnect

Direct interconnect provides the most efficient implementation of networks between adjacent
CLBs or I/O Blocks. Signals routed from block to block using the direct interconnect exhibit
minimum interconnect propagation and use no general interconnect resources. For each CLB, the
X output may be connected directly to the B input of the CLB immediately to its right and to the
C input of the CLB to its left. The Y output can use direct interconnect to drive the D input of the
block immediately above and the A input of the block below. Direct interconnect should be used
to maximize the speed of high-performance portions of logic. Where logic blocks are adjacent to
IOBs, direct connect is provided alternately to the IOB inputs (I) and outputs (O) on all four
edges of the die. The right edge provides additional direct connects from CLB outputs to
adjacent IOBs.
Long lines

The Long lines bypass the switch matrices and are intended primarily for signals that must travel
a long distance, or must have minimum skew among multiple destinations. Long lines, run
vertically and horizontally the height or width of the interconnect area. Each interconnection
column has three vertical Long lines, and each interconnection row has two horizontal Long
lines. Two additional Long lines are located adjacent to the outer sets of switching matrices.
Long lines can be driven by a logic block or IOB output on a column-by-column basis. This
capability provides a common low skew control or clock line within each column of logic
blocks. Isolation buffers are provided at each input to a Long line and are enabled automatically
by the development system when a connection is made.
Technology Mapping for FPGA :
An FPGA consists of a regular array of logic blocks that implement combinational and
sequential logic functions and a user programmable routing network that provides connections
between the logic blocks . In conventional ASIC implementation technologies such as Mask
Programmed Gate Arrays (MPGAs) and Standard Cells the connections between logic blocks
are implemented by metallization at a fabrication facility. In an FPGA the connections are
implemented in the field using the user programmable routing network. This reduces
manufacturing turn-around times drastically from weeks to minutes and reduces prototype
costs.
But the limitations are , density and performance penalties associated with user programmable
routing. The programmable connections which consist of metal wire segments connected by
programmable switches occupy greater area and incur greater delay than simple metal wires. To
reduce the density penalty FPGA architectures employ highly functional logic blocks such as
lookup tables that reduce the total number of logic blocks and hence the number of
programmable connections needed to implement a given application. These complex logic
blocks also reduce the performance penalty by reducing the number of logic blocks and
programmable conections on the critical paths in the circuit.
The high functionality of FPGA logic blocks presents new challenges for logic synthesis. So,the
technology mapping provides a solution for FPGAs that use lookup tables to implement
combinational logic. i.e Technology mapping is a process of transforming a technology
independent Boolean network into a technology dependent network. For example a K input
lookup table (LUT) is a digital memory that can implement any Boolean function of K variables.
The K inputs are used to address a 2K by 1 bit memory that stores the truth table of the Boolean
function. It is a proven fact that lookup tables are an area efficient method of implementing
combinational functions and that the delays of LUT based FPGAs are minimum when compared
to the delays of FPGAs using other types of logic blocks .The goal of the technology mapping is
to reduce area, delay or a combination of both.
Technology mapping is the logic synthesiss task that is directly concerned with selecting the
circuit elements used to implement the optimized circuit. Previous approaches to technology
mapping have focused on using circuit elements from a limited set of simple gates. However
such approaches are inappropriate for complex logic blocks where each logic block can
implement a large number of functions . A K input lookup table can implement 2K different
functions. For values of K greater than 3 the number of different functions becomes too large
for conventional technology mapping Therefore new approaches to technology mapping are
required for LUT based FPGAs.
Library-Based Technology Mapping : In library based mapping, gates or components are
selected from a technology library to implement a circuit. Hence it is also referred to as library
binding. So, this method generates a technology mapping for a given Boolean network using a
characterized cell library with the objective of cost optimization or delay optimization. Standard
Cells and Mask Programmed Gate Arrays implement combinational functions using a limited
set of simple gates. For such ASIC technologies library-based technology mapping is very
useful.
In this methodology the set of available circuit elements is represented as a library of functions
and the construction of the optimized circuit is divided into three sub problems
(i). Decomposition, (ii). Matching and (iii) Covering.
The original network is first decomposed into a canonical representation that uses limited fan in
NAND nodes. This decomposition guarantees that there will be no nodes in the network that are
too large to be implemented by any library element provided the library includes NAND gates
that reach the fan in limit.
After decomposition the network is partitioned into a forest of trees The optimal sub circuit
covering each tree is constructed and finally the circuit covering the entire network is assembled
from these sub circuits. To form the forest of trees, the decomposed network is partitioned at fan
out nodes into a set of single output sub networks.
Each of these sub networks is either a tree or a leaf DAG (Directed Acyclic Graph). A leaf DAG
is a multi input single output DAG where only the input nodes have fan out greater than one.
Each leaf DAG is converted into a tree by creating a unique instance of every input node for
each of its multiple fan out edges The optimal circuit implementing each tree is constructed
using a dynamic programming traversal that proceeds from the leaf nodes to the root node.
For every node in the tree an optimal circuit implementing the sub tree extending from the node
to the leaf nodes is constructed. This circuit consists of a library element that matches a sub
function rooted at the node and previously constructed circuits implementing its inputs. The cost
of the circuit is calculated from the cost of the matched library element and the cost of the
circuits implementing its inputs.
To find the lowest cost circuit, the DAGON , first finds all library elements that match sub
functions rooted at the node. The cost of the circuit using each of these candidate library
elements is then calculated and the lowest cost circuit is retained . The set of library elements
is found by searching through the library and using tree matching to determine if each library
element matches a sub function rooted at the node.
As an example let us consider the library shown in the figure(a) below and the circuit shown in
figure(b). The circuit elements are standard cells and their costs are given in terms of the area of
the cells. The cost of the INV , NAND-2 and AOI-21 cells are2,3 and 4 respectively. In Figure
(b) the only library element matching at node E is the NAND-2 and the cost of the optimal
circuit implementing node E is therefore 3. At node C the only matching library element is also
the NAND2. The cost of the NAND-2 is 3 and the cost of the optimal circuits implementing its
input E is also 3.Therefore , the cumulative cost of the optimal circuit implementing node C is 6.
Finally the algorithm will reach node A_ For node A there are two matching library elements
the INV as used in figure(b) and the AOI-21 as used in figure (c).The circuit constructed using
the INV matching A includes a NAND-2 implementing node B, a NAND-2 implementing node
C, an INV implementing node D and a NAND-2 implementing node E. The cumulative cost of
this circuit is 13. The circuit constructed using the AOI-21 matching A includes a NAND-2
implementing node E. The cumulative cost of this circuit is 7. The circuit using the AOI-21 is
therefore the optimal circuit implementing node A.
The major obstacle to applying library-based technology mapping to LUT circuits is the large
number of different functions that a K-input LUT can implement. The function implemented by
a K-input LUT is determined by the values stored in its 2K memory bits. Since each bit can
independently be either 0 or 1, there are 22K different Boolean functions of K- variables.
For values of K greater than 3 the library required to represent a K-input LUT becomes very
large. The size of the library can be reduced by noting that some patterns are equivalent after a.
permutation of inputs . The inversion of outputs or inputs, which is trivially accomplished with
a LUT, can also produce equivalent ‘patterns.
Another alternative is to use a partial library tuned to take advantage of the network structure
likely to be produced by technology independent logic optimization. The limitation of this
approach is that it precludes some opportunities for optimization of the final circuit.
LUT-based Technology Mapping:
The major obstacle to applying library-based technology mapping to LUT circuits is the large
number of different functions that a K-input LUT can implement. The function implemented by
a K-input LUT is determined by the values stored in its 2K memory bits. Since each bit can
independently be either 0 or 1, there are 22K different Boolean functions of K- variables.For
values of K greater than 3 the library required to represent a K-input LUT becomes very large.
The limitations of earlier technology mapping approaches paved the way for the development
of technology mapping that deals specially with LUT circuits. The first LUT based technology
mappers appeared in 90s. and later improved for optimized delay performance of LUT circuits
by minimizing the number of levels of LUT in the final circuit.
In LUT based FPGAs (example XILINX FPGAs) the building blocks are LUTs and Flip-Flops.
In an LUT based FPGA chip the basic programmable logic block is a K-input Look Up
Table.(K-LUT) which can implement any Boolean function of up to K- variables.The technology
mapping in LUT based FPGA designs is to cover a general Boolean Network using K-LUTs to
obtain functionally equivalent K-LUT network. The main objectives in LUT mapping are
(i).Cost optimal mapping i.e Minimizing the number of LUTs and Minimizing the number of
CLBs
(ii) Delay optimal mapping i.e Minimizing the number of LUT levels and Minimizing the
delays (including routing delays)
(iii).Maximizing the routability of the mapping schemes.
The LUT based technology can be implemented using two types of algorithms .They are
(a). The Area Algorithm and (b) The delay algorithm
The Area Algorithm :
A circuit can be implemented by a given FPGA only if the number of logic blocks in the circuit
does not exceed the available number of logic blocks and the required connections between the
logic blocks do not exceed the capacity of the routing network. The area algorithm minimizes
the total number of K -input LUTs in the circuit implementing a given network . Minimizing the
number of LUTs in the circuit allows larger networks to be implemented by the fixed number of
logic blocks available in a given LUT based FPGA.
In implementing the area algorithm ,the original network is first partitioned into a forest of
trees and then each tree is separately mapped into a circuit of K-input LUTs. The final circuit is
then assembled from the circuits implementing the trees.
The main principle of the area algorithm is that it simultaneously addresses the decomposition
and matching problems using a bin packing approximation algorithm. The correct decomposition
of network nodes can reduce the number of LUTs required to implement the network. For
example let us consider the circuit of 5 input LUTs shown in Figure (a) below.The shaded OR
node is not decomposed and four 5 input LUTs are required to implement the network However
if the OR node is decomposed into the two nodes as shown in figure (b) then only two LUTs
are required .But the main problem is to find the decomposition of every node in the network
that minimizes the number of LUTs in the final circuit.
The delay algorithm : Unlike the area algorithm which decomposes nodes to reduce the total
number of LUTs the delay algorithm decomposes nodes to minimize the number of levels in the
final circuit. For example consider the circuit of 5-input LUTs shown in figure (a). In this
figure the number in the lower right hand corner of a LUT indicates its depth which is the
maximum number of LUTs along any path from a primary input to the output of the LUT. The
LUTs preceding the AND nodes are not shown in this figure but they are assumed to

In figure(a) the shaded OR node is not decomposed and 5 levels of LUTs are required to
implement the network. However if the OR node is decomposed into the two nodes shown in
figure (b) then only 4 levels of LUTs are required.
The delay algorithm like the area algorithm firstt partitions the original net workin to a forest of
trees , maps each tree separately into a circuit of K-input LUTs and then assembles the circuit
implementing the entire network from the circuits implementing the trees. The trees are mapped
in a breadth first order proceeding from the primary inputs toward the primary outputs. This
ensures that when each tree is mapped that the trees implementing its leaf nodes have already
been mapped.
The overall strategy employed by the delay algorithm is to minimize the number of levels of
LUTs by minimizing the depth of every path in the final circuit. This can result in a circuit that
contains a large number of LUTs.
MULTIPLEXER BASED TECHNOLOGY MAPPING:

This Multiplexer based technology mapping is used in ACTEL FPGAs and in recent Xilinx
VIRTEX 6 FPGA devices .Because their logic block architectures are MUX based.In Actel
based FPGAs ,the size of the Multiplexers is small and suitable to achieve the objective of area
optimization and minimum delays.

Circuits usually contain a large number of multiplexers (MUXes). This is mainly true for circuits
that are automatically synthesized from high-level descriptions. MUXes exist in the data-paths of
circuits, where they are used to route operands to operators. Also, the control logic is frequently
specified as a CASE statement in HDL descriptions. MUXs arise as a result of a direct
translation of CASE statements in HDLs into a logic-level description. Cell libraries too contain
various choices of MUXes. Cell implementations make use of the fact that a pass gate
implementation of a MUX is both, faster and smaller. In the case of MUX-based FPGAs like
Actel, there is a natural presence of MUX in the virtual library. Thus, a method for mapping
MUX in the unmapped network to those in the library is desirable.
The significance of Multiplexer synthesis is mainly due to the fact that Multiplexer tree circuits
give new FPGA's like the ACT. FPGA family from Actel , where the basic building block
consists of multiplexers .Each basic building block of the ACT family allows the
implementation of a multiplexer (a) and, in the case of the ACT l family, implementation of
three hierarchical multiplexers (b), which is denoted by act0. The ACT 2 family allows only a
restricted realization of three hierarchical multiplexers, as can be seen in Fig. (b).

Basic building block of the ACT' family : (a) ACT1 family; (b) .ACT2 family.
The main objective behind this Mux based technology mapping is ,describing a combinational
circuit in terms of Boolean equations and realize it using minimum number of basic blocks of
the target Mux based architecture and minimizing the delay on the critical path.

In this algorithm an appropriate base function ,a library of cells and a set of pattern graphs are
selected .As an example let us select a 2 to 1 multiplexer as a base function.

The above figure shows two Mux structures STRUCT and STRUCT1.Four pattern graphs are
constructed for STRUCT1 as shon in figure below.If the function is realizable by one STRUCT1
block ,it either uses all the multiplexers or two or just one.These pattern graphs are in one to one
correspondnce with these possibilities.So, a very small set of patterns to capture all possible
functions realizable by one STRUCT1 block is needed.From the figure it is clear that the pattern
graph uses all the multiplexers.
The introduction of the OR gate at the select input of MUX3 increases the number of function
realized by the block.from an algorithmic point of view it creates some problems .But the a
modification of the algorithm is considered for the concurrence of OR structure.

The advantages of MUX based technology mapping are it generates optimal mappings, which
are often much better than those produced by conventional heuristic techniques. Moderately
large circuits can be mapped optimally in a small amount of time. Very large circuits can be
mapped near-optimally by partitioning the circuits and mapping each partition individually.

---------xxx--------------

References:

(i).Technology Mapping for Lookup Table Based Field Programmable Gate Arrays, Robert J
Francis

(ii).Technology Mapping for Field-Programmable Gate Arrays Using Integer Programming,


Amit Chowdhary and John P. Hayes.

(iii) .Experiences with XILINX Programmable Gate arrays,J.Molendijk & U.Wehrle

Das könnte Ihnen auch gefallen