01 Intro

FPGAs for DSP
This material exempt per Department of Commerce license exception TSU
2011 Xilinx, Inc. All Rights Reserved
Objectives
After completing this module, you will be able to:
Describe why parallelism enables such high

performance
Describe how FPGA architectures lend to an
optimum implementation of DSP functions
FPGA Introduction 2
For Academic Use Only
Outline
Power of Parallelism
Virtex-5 FPGA Architecture
Latest Families
Virtex-6/Spartan-6 Families
Virtex-7 Families
Why should I use FPGAs for DSP?

The DSP48 Slice Advantage
FPGA Introduction 3
Essence of a DSP Processor

Algorithm must be implemented within the constraints of the pre-defined fixed architecture
Cycles expended
making decisions
and controlling flow
Program Counter and

Control
I/O
Cycles expended
communicating with
outside world or other
processors
Program must be stored in ROM

and many instructions do not directly
contribute to processing
Program
Memory
Registers
Instruction
Decode
ALU
ALU supports many

operations but only
one or a few can be
used at one time
Fixed bit
width.
Algorithm
may not
require all
bits
Memory
All values currently

not in use must be
retained
ALU contains a fixed set of
operations and multiple
operations (cycles) required
to achieve desired effect
FPGA Introduction 4
Sequential Processing
Limits System Performance
40
Single 300 MHz Processor

Two 300 MHz Processor
35
30
25
Channel
Density
or
Sample
Rate
Fixed Processor Clock Rate

=
Number of operations per sample
20
Max Sample Rate
15
10
2 46 8
16
24
32
40
48
56
64
72
80
88
96
104
No. of
coefficients
Algorithmic Complexity
FPGA Introduction 5
Multiply Accumulate
Single Engine
Sequential processing
limits data throughput
Time-shared MAC unit
High clock frequency creates
difficult system challenge
Data In
Reg
Loop
Algorithm
256 times
256 Tap FIR Filter
MAC
unit
Data Out
256 multiply and accumulate

(MAC) operations per data sample
One output every 256 clock cycles
FPGA Introduction 6
Multiply Accumulate
Multiple Engines
Parallel processing maximizes
data throughput
Support any level of parallelism
Optimal performance/cost
tradeoff
256 Tap FIR Filter

256 multiply and accumulate
(MAC) operations per data
sample
One output every clock cycle
Data In
Reg1
Reg0
C0
C1
Reg2
C2
All 256 MAC

operations in one
clock cycle
Reg255
.... C255
Data Out
Flexible architecture
Distributed DSP resources (LUT,
registers, multipliers, & memory)
FPGA Introduction 7
Outline
Latest Families
Virtex-7 Families

FPGA Introduction 8
Overview
All Xilinx FPGAs contain the same basic resources
Slices grouped into Configurable Logic Blocks (CLBs)
Contain combinatorial logic and register resources
IOBs
Interface between the FPGA and the outside world
Programmable interconnect
Other resources
Memory
Multipliers
Global clock buffers
Boundary scan logic
FPGA Introduction 9
Virtex-5 FPGA Platform

Feature Overview
CLB
BRAM
I/O
CMT
BUFGMUX
DSP48E
BUFIO & BUFR
FPGA Introduction 10
Virtex-5 Family
The Ultimate System Integration Platform
Logic
RAM
DSP
Parallel I/Os
Serial I/Os
PPC processor
Logic/Serial DSP/Serial
Logic
Emb./Serial
Emb
./Serial
High SIO B/W
Built on the success of ASMBL

Virtex-5 FPGA CLB

Two slices per CLB
Four LUTs per slice
Slice
Slice3 1
Based on true 6-input LUT

technology
No input-sharing restrictions
64-bit distributed memory/SRL32
Increased performance
Slice 1
Slice 2
One to two speed grade average

logic performance increase over
prior generation
Reduction of combinatorial logic
between flip-flops
Slice 0
Slice 0
Increased utilization
Virtex-5 FPGA Slice

O5 output supports 5-input functions
F7MUX and F8MUX for
wide functions
Carry chain
4-bit lookahead for speed
Free 0/1 in carry
Fast counter and
accumulator
SLICEL and SLICEM

Approximately* one quarter of the
slices are SLICEM
LUT resources can be used for
SLICEL
SLICEL
SLICEM
SLICEL
Distributed memory (RAM or ROM)

SRL (32 bits per LUT)
32 SRL bits per LUT

64 memory bits per LUT
SLICEL
*The CLB column before and after each DSP48E

column supports SLICEM
functions
SLICEL
SLICEL
SLICEL
SLICEM
SLICEM
SLICEL
SLICEL
6-Input LUT with Dual Output

6-input LUT can be two 5-input
6-LUT
LUTs with common inputs

A6
Minimal speed impact to

A5
a 6-input LUT
A4
A3
Full input swappability
A2
A1
One or two outputs
Any function of six variables or
two independent functions of
five variables
Reduces average slice count
by 10 percent
A5
A4
A3
5-LUT
A2
A1
O6
A5
A4
A3
5-LUT
A2
A1
O5
SLICEM
Distributed (LUT) RAM
64-bit blocks in each SLICEM LUT
Single-port, dual-port, multi-port block RAM
Shift Register LUT (SRL)

32-bit shift register in each SLICEM LUT
Both dynamic and static outputs available
Slice3
Slice3 Slice3
Slice3
Logic
Logic RAM
Shift Register
Logic RAM
Shift Register
Slice3 Slice3
Slice3
Logic
Logic RAM
Shift Register
Logic RAM
Shift Register
Logic
Slice3
Logic
Slice3 Slice3
Logic RAM
Shift Register
Logic
R
A
M
R
A
M
Slice3
Logic
Cascadable up to 128 bits in one SLICEM
Logic
Slice3
Logic RAM
Shift Register
R
A
M
Very fast (sub-nanosecond)

Tightly coupled to logic
Ideal for coefficient storage, small buffers, or small state machines
SLICEM Used as Distributed

SelectRAM Memory
Single
Port
32x2
32x4
32x6
32x8
64x1
64x2
64x3
64x4
128x1
128x2
256x1
Dual
Port
Simple
Dual Port
32x2D
32x4D
64x1D
64x2D
128x1D
32x6SDP
64x3SDP
Quad
Port
Uses the same storage that is used for the

look-up table function
Synchronous write, asynchronous read
Can be converted to synchronous read using
32x2Q
64x1Q
the flip-flops available in the slice
Various configurations
Single port
One LUT6 = 64x1 or 32x2 RAM
Cascadable up to 256x1 RAM
Dual port (D)

1 read / write port + 1 read-only port
Simple dual port (SDP)

1 write-only port + 1 read-only port
Quad-port (Q)
1 read / write port + 3 read-only ports
Each port has independent address inputs

Configuring LUTs as a Shift

Register (SRL)
LUT
D
CE
CLK
LUT
D
CE
D
CE
D
CE
D
CE
A[4:0]
Q31 (cascade out)
Block RAM and FIFO Block

Each block RAM can be used as
18-kb
BRAM
3636-kb
BRAM /
FIFO
or
1818-kb
BRAM /
FIFO
One 36-kb block

RAM
and FIFO
Two independent 18-kb block RAMs

or
One 18-kb FIFO and 18-kb block RAM
DSP48E Slice Overview

25x18 Multiply
ALU Mode
Dedicated A
Cascading
Pattern Detection
10
Multiply (35 X 25)

25
DSP48_1
OPMODE 0010101
ALUMODE 0000
B[34:17]
18
ACIN
A
DSP48_0
A[24:0]
OPMODE 0000101
ALUMODE 0000
0,B[16:0]
P[42:0] = OUT[59:17]
SHIFT 17
25
P
18
P[16:0] = OUT[16:0]
Outline
Latest Families
Virtex-7 Families

11
Designers Eccentrics
Higher System Performance
More design margin to simplify designs
Higher integrated functionality
Lower System Cost

Reduce BOM
Implement design in a smaller device & lower speed-grade
Lower Power
Help meet power budgets
Eliminate heat sinks & fans
Prevent thermal runaway
Architecture Alignment
Spartan-6 FPGAs
Virtex-6 FPGAs
760K
Logic Cell
Device
Common Resources
150K
Logic Cell
Device
LUT-6 CLB
BlockRAM
DSP Slices
High-performance Clocking
FIFO Logic
Parallel I/O
Hardened Memory Controllers
Tri-mode EMAC
HSS Transceivers*
3.3 Volt compatible I/O
System Monitor
PCIe
Interface
*Optimized for target application in each family
Enables IP Portability, Protects Design Investments

12
Virtex-6 and Spartan-6 FPGA

Sub-Families
Virtex-6
CXT FPGA
Virtex-6
LXT FPGA
Upto 3.75Gbps serial connectivity High Logic Density

and corresponding logic performance High-Speed Serial
Connectivity
Spartan-6
LXT FPGA
Spartan-6
LX FPGA
Virtex-6
HXT FPGA
Virtex-6
SXT FPGA
High Logic Density

Ultra High-Speed Serial
Connectivity
High Logic Density

High-Speed Serial
Connectivity
Enhanced DSP
Logic
Block RAM
DSP
Parallel I/O
Serial I/O
Lowest Cost Logic
Lowest Cost Logic

Low-Cost Serial Connectivity
Virtex-6 Logic Fabric

Virtex-6 Configurable Logic Block (CLB)
Slice
Each CLB contains two slices

Each slice contains four 6-input Lookup Tables (6LUT)
Slices implement logic functions (slice_l)
Slices for memories and shift registers (slice_m)
LUT6 implements
All functions of up to 6 variables
Two functions of up to 5 or less variables each
Shift registers up to 32 stages long
Memories of 64 bits
Multiple configurations within a slice
Performance Benefits
Power Consumption Benefits

Shift register mode greatly reduces power
consumption over FF implementation
Increased ratio of slice_m memories

available closer to the source or target logic
LUT
LUT
Slice
LUT
LUT
LUT
LUT
LUT
LUT
CLB
Cost Benefits
Can pack logic and memory functions more
efficiently
13
Higher DSP Performance

Most advanced DSP architecture
New optional pre-adder for symmetric filters
25x18 multiplier
High resolution filters
Efficient floating point support
ALU-like second stage enables mapping of advanced

operations
Programmable op-code
SIMD support
Addition / Subtraction / Logic functions
Pattern detector
Lowest power consumption

Highest DSP slice capacity
Up to 2K DSP Slices
Spartan-6 CLB Logic Slices

SliceM (25%)
LUT6
8 Registers
Carry Logic
Wide Function Muxes
Distributed RAM / SRL logic
SliceL (25%)
LUT6
8 Registers
Carry Logic
Wide Function Muxes
SliceX (50%)
LUT6
Optimized for Logic
8 Registers
Slice mix chosen for the optimal balance of Cost, Power & Performance
14
18
A0
18
A1
48
PCOUT
CCOUT
BCOUT
Spartan-6 FPGA DSP48A1

Slice
36
MFOUT
D:A:B
18
18
18
CFOUT
+/-
48
P
0
Z
48
OPMODE[6,4]
18
BCIN
36 0
18x18 signed multiplier

48-bit add/subtract/accumulate
Pipeline registers for high speed
Cascade paths for wide functions
Pre-adder
48
CIN
Dual B, D
Register
With
Pre-adder
PCIN
18
18 X 18
OPMODE[7]
12
OPMODE[5]
18
OPMODE[3:0]
Outline
Latest Families
Virtex-7 Families

15
Power, Performance and Productivity

Drive Market Trends
Lower Power
Legislation and Regulations
Flat panel/TV, Central Office, Server Farms,

Portable Medical, Portable Consumer
Higher Performance
Wired Infrastructure, Wireless, Broadcast,

300G+ Networks, Aerospace and Defense,
High Performance Computing
System Capacity and Performance
All Market Segments
Improved Productivity
Reduce Capital and Operating Expenses
(OPEX, CAPEX)
#1 Customer Problem: Lower Power enables better Cost,

Performance, and Capability
The Unified Architecture Advantage

Common elements enable easy IP reuse for quick
design portability across all 7 series families
Design scalability from low-cost to high-performance
Expanded eco-system support
Quickest TTM
Logic Fabric
LUT-6 CLB
Precise, Low Jitter Clocking

MMCMs
On-Chip Memory
36Kbit/18Kbit Block RAM
Enhanced Connectivity
PCIe Interface Blocks
DSP Engines
DSP48E1 Slices
Hi-perf. Parallel I/O Connectivity

SelectIO Technology
Artix-7 FPGA
Kintex-7 FPGA
Hi-performance Serial I//O Connectivity

Transceiver Technology
Virtex-7 FPGA
16
The Xilinx 7 Series FPGAs

Industrys First Unified Architecture
Industrys Lowest Power and First Unified Architecture
Three new device families with breakthrough innovations in power efficiency,

performance-capacity and price-performance
Spanning Low-Cost to Ultra High-End applications
Page 33
Virtex-7 Sub-Families
The Virtex-7 family has several sub-families
Virtex-7:
Virtex-7XT:
Virtex-7HT:
General logic
Rich DSP and block RAM
Highest serial bandwidth
Virtex-7 FPGA
Virtex-7 XT FPGA
Virtex-7 HT FPGA
Logic
Block RAM
DSP
Parallel I/O
Serial I/O
High Logic Density

High-Speed Serial
Connectivity
High Logic Density

High-Speed Serial
Connectivity
Enhanced DSP
High Logic Density

Ultra High-Speed Serial
Connectivity
17
Outline
Latest Families
Virtex-7 Families

Reason 1: FPGAs handle high computational workloads

Speed up FIR Filters by implementing with parallel architecture
Programmable DSP - Sequential
C0
C0
X
C1
C2
C3
Reg
Reg
256 clock
cycles
needed
Data In
Reg
Coefficients
FPGA - Fully Parallel Implementation

Reg
Data In
X C255
MAC Unit
+
256 operations
in 1 clock cycle
Reg
Data Out
Data Out
1 GHz
256 clock cycles
= 4 MSPS
500 MHz
1 clock cycle
= 500 MSPS
Example 256 TAP Filter Implementation

18
Reason 2: FPGAs are ideal for multi-channel DSP Designs

Can implement multiple channels running in parallel or time multiplex channels into one filter
20MHz
Samples
LPF
ch1
LPF
ch2
LPF
ch3
LPF
ch4
80MHz
Samples
LPF
Multi Channel
Filter
Many low sample rate channels can be multiplexed (e.g.

TDM) and processed in the FPGA, at a high rate
Interpolation (using zeros) can also drive sample rates higher
Reason 3: Customize Architectures to Suit your Goals

FPGAs allow Cost/Performance tradeoffs
Parallel
Semi-Parallel
Serial
+
+
+
+
+
+
DQ
+
+
DQ
Speed
Optimized for?
Cost
19
Reason 4: Lower System Cost through Integration

Implement Interface Logic within FPGA to connect DSP functions to I/O and Memory Devices
AFE
A/D
SDRAM
A/D
MACs
DDC
DDC
DDC
DDC
Hundreds of
Termination Resistors
DSP MACs
Procs.
Control
Control
SSTL3
Translators
Quad
TRx
FPGA
D/A
DUC
DUC
D/A
DUC
DUC
A/D
FPGA
SDRAM
ASSP
Quad Network
TRx
Card
DSP
Card
SDRAM
A/D
Control
D/A
MACs,
DUCs,
DDCs, Logic
D/A
Control
ASSP
PL4
3.125 Gbps
SDRAM
CORBA
Outline
Latest Families
Virtex-7 Families

20
The XtremeDSP Slice Advantage

Without XtremeDSP Slice, Parallel Adder Tree Consumes Logic Resources
Parallel Adder Tree Implementation
C5
C4
C0
X
C6
Consumes Logic to
Implement Adders
C7
C30
C31
Variable
Latency
32 TAP filter implementation will

consume 1,461 logic cells to
implement adders in fabric
Reg
Reg
C3
Reg
Reg
C2
Reg
C0
X
Reg
C1
Reg
C0
Reg
Reg
Data In
Data Out
Fabric and Routing May

Reduce Performance
The XtremeDSP Slice Advantage

With XtremeDSP Slice, Parallel Adder Tree Consumes Zero Logic Resources
Parallel Adder Cascade Implementation
C31
Reg
Reg
C30
Reg
Reg
C7
Reg
Reg
Reg
C6
Reg
Reg
C5
Reg
Reg
C4
Reg
Reg
C3
Reg
Reg
Reg
Reg
Reg
C2
Reg
Reg
Reg
Reg
C1
Reg
Reg
Reg
C0
Reg
Reg
Data In
+
Data Out
32 TAP filter implementation implemented entirely with XtremeDSP Slices
21

01 Intro

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

01 Intro

Hochgeladen von

Copyright:

Verfügbare Formate

FPGAs for DSP

This material exempt per Department of Commerce license exception TSU

2011 Xilinx, Inc. All Rights Reserved

Describe why parallelism enables such high

For Academic Use Only

Why should I use FPGAs for DSP?

For Academic Use Only

Essence of a DSP Processor

Program Counter and

Program must be stored in ROM

ALU supports many

All values currently

For Academic Use Only

Single 300 MHz Processor

Fixed Processor Clock Rate

Max Sample Rate

For Academic Use Only

256 Tap FIR Filter

256 multiply and accumulate

For Academic Use Only

256 Tap FIR Filter

All 256 MAC

For Academic Use Only

Why should I use FPGAs for DSP?

For Academic Use Only

For Academic Use Only

Virtex-5 FPGA Platform

For Academic Use Only

High SIO B/W

Built on the success of ASMBL

For Academic Use Only

Virtex-5 FPGA CLB

Based on true 6-input LUT

One to two speed grade average

For Academic Use Only

Virtex-5 FPGA Slice

For Academic Use Only

SLICEL and SLICEM

Distributed memory (RAM or ROM)

32 SRL bits per LUT

*The CLB column before and after each DSP48E

For Academic Use Only

6-Input LUT with Dual Output

LUTs with common inputs

Minimal speed impact to

For Academic Use Only

Shift Register LUT (SRL)

Cascadable up to 128 bits in one SLICEM

Very fast (sub-nanosecond)

Ideal for coefficient storage, small buffers, or small state machines

For Academic Use Only

SLICEM Used as Distributed

Uses the same storage that is used for the

the flip-flops available in the slice

Dual port (D)

Simple dual port (SDP)

Each port has independent address inputs

2011 Xilinx, Inc. All Rights Reserved

For Academic Use Only

Configuring LUTs as a Shift

For Academic Use Only

Block RAM and FIFO Block

One 36-kb block