Sie sind auf Seite 1von 21

FPGAs for DSP

This material exempt per Department of Commerce license exception TSU

2011 Xilinx, Inc. All Rights Reserved

Objectives
After completing this module, you will be able to:

Describe why parallelism enables such high


performance
Describe how FPGA architectures lend to an
optimum implementation of DSP functions

FPGA Introduction 2
2011 Xilinx, Inc. All Rights Reserved

For Academic Use Only

Outline
Power of Parallelism
Virtex-5 FPGA Architecture
Latest Families
Virtex-6/Spartan-6 Families
Virtex-7 Families

Why should I use FPGAs for DSP?


The DSP48 Slice Advantage

FPGA Introduction 3
2011 Xilinx, Inc. All Rights Reserved

For Academic Use Only

Essence of a DSP Processor


Algorithm must be implemented within the constraints of the pre-defined fixed architecture
Cycles expended
making decisions
and controlling flow

Program Counter and


Control

I/O

Cycles expended
communicating with
outside world or other
processors

Program must be stored in ROM


and many instructions do not directly
contribute to processing

Program
Memory

Registers

Instruction
Decode

ALU

ALU supports many


operations but only
one or a few can be
used at one time

Fixed bit
width.
Algorithm
may not
require all
bits

Memory

All values currently


not in use must be
retained
ALU contains a fixed set of
operations and multiple
operations (cycles) required
to achieve desired effect

FPGA Introduction 4
2011 Xilinx, Inc. All Rights Reserved

For Academic Use Only

Sequential Processing
Limits System Performance
40

Single 300 MHz Processor


Two 300 MHz Processor

35

30

25

Channel
Density
or
Sample
Rate

Fixed Processor Clock Rate


=
Number of operations per sample

20

Max Sample Rate

15

10

2 46 8

16

24

32

40

48

56

64

72

80

88

96

104

No. of
coefficients

Algorithmic Complexity
FPGA Introduction 5
2011 Xilinx, Inc. All Rights Reserved

For Academic Use Only

Multiply Accumulate
Single Engine
Sequential processing
limits data throughput
Time-shared MAC unit
High clock frequency creates
difficult system challenge

Data In
Reg

Loop
Algorithm
256 times

256 Tap FIR Filter

MAC
unit
Data Out

256 multiply and accumulate


(MAC) operations per data sample
One output every 256 clock cycles

FPGA Introduction 6
2011 Xilinx, Inc. All Rights Reserved

For Academic Use Only

Multiply Accumulate
Multiple Engines
Parallel processing maximizes
data throughput
Support any level of parallelism
Optimal performance/cost
tradeoff

256 Tap FIR Filter


256 multiply and accumulate
(MAC) operations per data
sample
One output every clock cycle

Data In

Reg1

Reg0

C0

C1

Reg2
C2

All 256 MAC


operations in one
clock cycle

Reg255

.... C255

Data Out

Flexible architecture
Distributed DSP resources (LUT,
registers, multipliers, & memory)
FPGA Introduction 7
2011 Xilinx, Inc. All Rights Reserved

For Academic Use Only

Outline
Power of Parallelism
Virtex-5 FPGA Architecture
Latest Families
Virtex-6/Spartan-6 Families
Virtex-7 Families

Why should I use FPGAs for DSP?


The DSP48 Slice Advantage

FPGA Introduction 8
2011 Xilinx, Inc. All Rights Reserved

For Academic Use Only

Overview
All Xilinx FPGAs contain the same basic resources
Slices grouped into Configurable Logic Blocks (CLBs)
Contain combinatorial logic and register resources

IOBs
Interface between the FPGA and the outside world

Programmable interconnect
Other resources

Memory
Multipliers
Global clock buffers
Boundary scan logic

FPGA Introduction 9
2011 Xilinx, Inc. All Rights Reserved

For Academic Use Only

Virtex-5 FPGA Platform


Feature Overview

CLB
BRAM
I/O
CMT
BUFGMUX
DSP48E
BUFIO & BUFR

FPGA Introduction 10
2011 Xilinx, Inc. All Rights Reserved

For Academic Use Only

Virtex-5 Family
The Ultimate System Integration Platform

Logic
RAM
DSP
Parallel I/Os
Serial I/Os
PPC processor

Logic/Serial DSP/Serial

Logic

Emb./Serial
Emb
./Serial

High SIO B/W

Built on the success of ASMBL


FPGA Introduction 11
2011 Xilinx, Inc. All Rights Reserved

For Academic Use Only

Virtex-5 FPGA CLB


Two slices per CLB
Four LUTs per slice

Slice
Slice3 1

Based on true 6-input LUT


technology
No input-sharing restrictions
64-bit distributed memory/SRL32

Increased performance

Slice 1
Slice 2

One to two speed grade average


logic performance increase over
prior generation
Reduction of combinatorial logic
between flip-flops

Slice 0

Slice 0

Increased utilization

FPGA Introduction 12
2011 Xilinx, Inc. All Rights Reserved

For Academic Use Only

Virtex-5 FPGA Slice


O5 output supports 5-input functions
F7MUX and F8MUX for
wide functions
Carry chain
4-bit lookahead for speed
Free 0/1 in carry
Fast counter and
accumulator

FPGA Introduction 13
2011 Xilinx, Inc. All Rights Reserved

For Academic Use Only

SLICEL and SLICEM


Approximately* one quarter of the
slices are SLICEM
LUT resources can be used for

SLICEL

SLICEL

SLICEM

SLICEL

Distributed memory (RAM or ROM)


SRL (32 bits per LUT)

32 SRL bits per LUT


64 memory bits per LUT

SLICEL

*The CLB column before and after each DSP48E


column supports SLICEM
functions

SLICEL

SLICEL

SLICEL

SLICEM

SLICEM

SLICEL

SLICEL

FPGA Introduction 14
2011 Xilinx, Inc. All Rights Reserved

For Academic Use Only

6-Input LUT with Dual Output


6-input LUT can be two 5-input
6-LUT

LUTs with common inputs


A6

Minimal speed impact to


A5
a 6-input LUT
A4
A3
Full input swappability
A2
A1
One or two outputs
Any function of six variables or
two independent functions of
five variables
Reduces average slice count
by 10 percent

A5
A4
A3
5-LUT
A2
A1

O6

A5
A4
A3
5-LUT
A2
A1

O5

FPGA Introduction 15
2011 Xilinx, Inc. All Rights Reserved

For Academic Use Only

SLICEM
Distributed (LUT) RAM
64-bit blocks in each SLICEM LUT
Single-port, dual-port, multi-port block RAM

Shift Register LUT (SRL)


32-bit shift register in each SLICEM LUT
Both dynamic and static outputs available

Slice3

Slice3 Slice3

Slice3

Logic

Logic RAM
Shift Register

Logic RAM
Shift Register

Slice3 Slice3

Slice3

Logic

Logic RAM
Shift Register

Logic RAM
Shift Register

Logic

Slice3

Logic

Slice3 Slice3
Logic RAM
Shift Register

Logic

R
A
M

R
A
M

Slice3

Logic

Cascadable up to 128 bits in one SLICEM

Logic

Slice3
Logic RAM
Shift Register

R
A
M

Very fast (sub-nanosecond)


Tightly coupled to logic

Ideal for coefficient storage, small buffers, or small state machines

FPGA Introduction 16
2011 Xilinx, Inc. All Rights Reserved

For Academic Use Only

SLICEM Used as Distributed


SelectRAM Memory

Single
Port
32x2
32x4
32x6
32x8
64x1
64x2
64x3
64x4
128x1
128x2
256x1

Dual
Port

Simple
Dual Port

32x2D
32x4D
64x1D
64x2D
128x1D

32x6SDP
64x3SDP

Quad
Port

Uses the same storage that is used for the


look-up table function
Synchronous write, asynchronous read
Can be converted to synchronous read using

32x2Q
64x1Q

the flip-flops available in the slice

Various configurations
Single port
One LUT6 = 64x1 or 32x2 RAM
Cascadable up to 256x1 RAM

Dual port (D)


1 read / write port + 1 read-only port

Simple dual port (SDP)


1 write-only port + 1 read-only port

Quad-port (Q)
1 read / write port + 3 read-only ports

Each port has independent address inputs


FPGA Introduction 17

2011 Xilinx, Inc. All Rights Reserved

For Academic Use Only

Configuring LUTs as a Shift


Register (SRL)
LUT
D
CE
CLK

LUT

D
CE

D
CE

D
CE

D
CE

A[4:0]
Q31 (cascade out)

FPGA Introduction 18
2011 Xilinx, Inc. All Rights Reserved

For Academic Use Only

Block RAM and FIFO Block


Each block RAM can be used as
18-kb
BRAM

3636-kb
BRAM /
FIFO

or
1818-kb
BRAM /
FIFO

One 36-kb block


RAM
and FIFO

Two independent 18-kb block RAMs


or
One 18-kb FIFO and 18-kb block RAM

FPGA Introduction 19
2011 Xilinx, Inc. All Rights Reserved

For Academic Use Only

DSP48E Slice Overview


25x18 Multiply

ALU Mode

Dedicated A
Cascading

Pattern Detection

FPGA Introduction 20
2011 Xilinx, Inc. All Rights Reserved

For Academic Use Only

10

Multiply (35 X 25)


25

DSP48_1
OPMODE 0010101
ALUMODE 0000

B[34:17]

18
ACIN
A

DSP48_0

A[24:0]

OPMODE 0000101
ALUMODE 0000

0,B[16:0]

P[42:0] = OUT[59:17]

SHIFT 17

25

P
18

P[16:0] = OUT[16:0]

FPGA Introduction 21
2011 Xilinx, Inc. All Rights Reserved

For Academic Use Only

Outline
Power of Parallelism
Virtex-5 FPGA Architecture
Latest Families
Virtex-6/Spartan-6 Families
Virtex-7 Families

Why should I use FPGAs for DSP?


The DSP48 Slice Advantage

FPGA Introduction 22
2011 Xilinx, Inc. All Rights Reserved

For Academic Use Only

11

Designers Eccentrics
Higher System Performance
More design margin to simplify designs
Higher integrated functionality

Lower System Cost


Reduce BOM
Implement design in a smaller device & lower speed-grade

Lower Power
Help meet power budgets
Eliminate heat sinks & fans
Prevent thermal runaway
FPGA Introduction 23
2011 Xilinx, Inc. All Rights Reserved

For Academic Use Only

Architecture Alignment
Spartan-6 FPGAs

Virtex-6 FPGAs

760K
Logic Cell
Device

Common Resources

150K
Logic Cell
Device

LUT-6 CLB
BlockRAM
DSP Slices
High-performance Clocking

FIFO Logic

Parallel I/O

Hardened Memory Controllers

Tri-mode EMAC

HSS Transceivers*

3.3 Volt compatible I/O

System Monitor

PCIe

Interface

*Optimized for target application in each family

Enables IP Portability, Protects Design Investments


FPGA Introduction 24
2011 Xilinx, Inc. All Rights Reserved

For Academic Use Only

12

Virtex-6 and Spartan-6 FPGA


Sub-Families
Virtex-6
CXT FPGA

Virtex-6
LXT FPGA

Upto 3.75Gbps serial connectivity High Logic Density


and corresponding logic performance High-Speed Serial
Connectivity

Spartan-6
LXT FPGA

Spartan-6
LX FPGA

Virtex-6
HXT FPGA

Virtex-6
SXT FPGA

High Logic Density


Ultra High-Speed Serial
Connectivity

High Logic Density


High-Speed Serial
Connectivity
Enhanced DSP

Logic
Block RAM
DSP
Parallel I/O
Serial I/O
Lowest Cost Logic

Lowest Cost Logic


Low-Cost Serial Connectivity

FPGA Introduction 25
2011 Xilinx, Inc. All Rights Reserved

For Academic Use Only

Virtex-6 Logic Fabric


Virtex-6 Configurable Logic Block (CLB)

Slice

Each CLB contains two slices


Each slice contains four 6-input Lookup Tables (6LUT)
Slices implement logic functions (slice_l)
Slices for memories and shift registers (slice_m)
LUT6 implements
All functions of up to 6 variables
Two functions of up to 5 or less variables each
Shift registers up to 32 stages long
Memories of 64 bits
Multiple configurations within a slice
Performance Benefits

Power Consumption Benefits


Shift register mode greatly reduces power
consumption over FF implementation

Increased ratio of slice_m memories


available closer to the source or target logic

LUT
LUT

Slice
LUT
LUT
LUT
LUT
LUT
LUT

CLB

Cost Benefits
Can pack logic and memory functions more
efficiently

FPGA Introduction 26
2011 Xilinx, Inc. All Rights Reserved

For Academic Use Only

13

Higher DSP Performance


Most advanced DSP architecture
New optional pre-adder for symmetric filters
25x18 multiplier
High resolution filters
Efficient floating point support

ALU-like second stage enables mapping of advanced


operations
Programmable op-code
SIMD support
Addition / Subtraction / Logic functions

Pattern detector

Lowest power consumption


Highest DSP slice capacity
Up to 2K DSP Slices
FPGA Introduction 27
2011 Xilinx, Inc. All Rights Reserved

For Academic Use Only

Spartan-6 CLB Logic Slices


SliceM (25%)

 LUT6
 8 Registers
 Carry Logic
 Wide Function Muxes
 Distributed RAM / SRL logic

SliceL (25%)

 LUT6
 8 Registers
 Carry Logic
 Wide Function Muxes

SliceX (50%)

 LUT6
 Optimized for Logic
 8 Registers

Slice mix chosen for the optimal balance of Cost, Power & Performance
FPGA Introduction 28
2011 Xilinx, Inc. All Rights Reserved

For Academic Use Only

14

18

A0

18

A1

48

PCOUT

CCOUT

BCOUT

Spartan-6 FPGA DSP48A1


Slice
36

MFOUT

D:A:B

18
18

18

CFOUT

+/-

48
P

0
Z

48

OPMODE[6,4]

18

BCIN

36 0

 18x18 signed multiplier


 48-bit add/subtract/accumulate
 Pipeline registers for high speed
 Cascade paths for wide functions
 Pre-adder

48
CIN

Dual B, D
Register
With
Pre-adder

PCIN

18

18 X 18

OPMODE[7]

12

OPMODE[5]

18

OPMODE[3:0]

FPGA Introduction 29
2011 Xilinx, Inc. All Rights Reserved

For Academic Use Only

Outline
Power of Parallelism
Virtex-5 FPGA Architecture
Latest Families
Virtex-6/Spartan-6 Families
Virtex-7 Families

Why should I use FPGAs for DSP?


The DSP48 Slice Advantage

FPGA Introduction 30
2011 Xilinx, Inc. All Rights Reserved

For Academic Use Only

15

Power, Performance and Productivity


Drive Market Trends
Lower Power
Legislation and Regulations

Flat panel/TV, Central Office, Server Farms,


Portable Medical, Portable Consumer

Higher Performance

Wired Infrastructure, Wireless, Broadcast,


300G+ Networks, Aerospace and Defense,
High Performance Computing

System Capacity and Performance

All Market Segments

Improved Productivity
Reduce Capital and Operating Expenses
(OPEX, CAPEX)

#1 Customer Problem: Lower Power enables better Cost,


Performance, and Capability
FPGA Introduction 31
2011 Xilinx, Inc. All Rights Reserved

For Academic Use Only

The Unified Architecture Advantage


Common elements enable easy IP reuse for quick
design portability across all 7 series families
Design scalability from low-cost to high-performance
Expanded eco-system support
Quickest TTM
Logic Fabric
LUT-6 CLB

Precise, Low Jitter Clocking


MMCMs

On-Chip Memory
36Kbit/18Kbit Block RAM

Enhanced Connectivity
PCIe Interface Blocks

DSP Engines
DSP48E1 Slices

Hi-perf. Parallel I/O Connectivity


SelectIO Technology

Artix-7 FPGA

Kintex-7 FPGA

Hi-performance Serial I//O Connectivity


Transceiver Technology
Virtex-7 FPGA
FPGA Introduction 32
2011 Xilinx, Inc. All Rights Reserved

For Academic Use Only

16

The Xilinx 7 Series FPGAs


Industrys First Unified Architecture

Industrys Lowest Power and First Unified Architecture

Three new device families with breakthrough innovations in power efficiency,


performance-capacity and price-performance

Spanning Low-Cost to Ultra High-End applications

FPGA Introduction 33
2011 Xilinx, Inc. All Rights Reserved

Page 33

For Academic Use Only

Virtex-7 Sub-Families
The Virtex-7 family has several sub-families
Virtex-7:
Virtex-7XT:
Virtex-7HT:

General logic
Rich DSP and block RAM
Highest serial bandwidth

Virtex-7 FPGA

Virtex-7 XT FPGA

Virtex-7 HT FPGA

Logic
Block RAM
DSP
Parallel I/O
Serial I/O

High Logic Density


High-Speed Serial
Connectivity

High Logic Density


High-Speed Serial
Connectivity
Enhanced DSP

High Logic Density


Ultra High-Speed Serial
Connectivity

FPGA Introduction 34
2011 Xilinx, Inc. All Rights Reserved

For Academic Use Only

17

Outline
Power of Parallelism
Virtex-5 FPGA Architecture
Latest Families
Virtex-6/Spartan-6 Families
Virtex-7 Families

Why should I use FPGAs for DSP?


The DSP48 Slice Advantage

FPGA Introduction 35
2011 Xilinx, Inc. All Rights Reserved

For Academic Use Only

Reason 1: FPGAs handle high computational workloads


Speed up FIR Filters by implementing with parallel architecture
Programmable DSP - Sequential

C0

C0
X

C1

C2

C3

Reg

Reg

256 clock
cycles
needed

Data In

Reg

Coefficients

FPGA - Fully Parallel Implementation


Reg

Data In

X C255

MAC Unit

+
256 operations
in 1 clock cycle

Reg

Data Out

Data Out

1 GHz
256 clock cycles

= 4 MSPS

500 MHz
1 clock cycle

= 500 MSPS

Example 256 TAP Filter Implementation


FPGA Introduction 36
2011 Xilinx, Inc. All Rights Reserved

For Academic Use Only

18

Reason 2: FPGAs are ideal for multi-channel DSP Designs


Can implement multiple channels running in parallel or time multiplex channels into one filter
20MHz
Samples

LPF

ch1

LPF

ch2

LPF

ch3

LPF

ch4

80MHz
Samples
LPF
Multi Channel
Filter

Many low sample rate channels can be multiplexed (e.g.


TDM) and processed in the FPGA, at a high rate
Interpolation (using zeros) can also drive sample rates higher
FPGA Introduction 37
2011 Xilinx, Inc. All Rights Reserved

For Academic Use Only

Reason 3: Customize Architectures to Suit your Goals


FPGAs allow Cost/Performance tradeoffs
Parallel

Semi-Parallel

Serial

+
+

+
+

+
+

DQ

+
+

DQ

Speed

Optimized for?

Cost

FPGA Introduction 38
2011 Xilinx, Inc. All Rights Reserved

For Academic Use Only

19

Reason 4: Lower System Cost through Integration


Implement Interface Logic within FPGA to connect DSP functions to I/O and Memory Devices

AFE

A/D

SDRAM

A/D

MACs

DDC

DDC

DDC

DDC

Hundreds of
Termination Resistors

DSP MACs
Procs.
Control
Control

SSTL3
Translators

Quad
TRx

FPGA

D/A

DUC

DUC

D/A

DUC

DUC

A/D

FPGA

SDRAM

ASSP

Quad Network
TRx

Card

DSP
Card

SDRAM

A/D

Control

D/A

MACs,
DUCs,
DDCs, Logic

D/A

Control

ASSP

PL4

3.125 Gbps

SDRAM

CORBA

FPGA Introduction 39
2011 Xilinx, Inc. All Rights Reserved

For Academic Use Only

Outline
Power of Parallelism
Virtex-5 FPGA Architecture
Latest Families
Virtex-6/Spartan-6 Families
Virtex-7 Families

Why should I use FPGAs for DSP?


The DSP48 Slice Advantage

FPGA Introduction 40
2011 Xilinx, Inc. All Rights Reserved

For Academic Use Only

20

The XtremeDSP Slice Advantage


Without XtremeDSP Slice, Parallel Adder Tree Consumes Logic Resources
Parallel Adder Tree Implementation

C5

C4

C0
X

C6

Consumes Logic to
Implement Adders

C7

C30

C31

Variable
Latency

32 TAP filter implementation will


consume 1,461 logic cells to
implement adders in fabric

Reg

Reg

C3

Reg

Reg

C2

Reg

C0
X

Reg

C1

Reg

C0

Reg

Reg

Data In

Data Out

Fabric and Routing May


Reduce Performance

FPGA Introduction 41
2011 Xilinx, Inc. All Rights Reserved

For Academic Use Only

The XtremeDSP Slice Advantage


With XtremeDSP Slice, Parallel Adder Tree Consumes Zero Logic Resources

Parallel Adder Cascade Implementation

C31

Reg

Reg

C30

Reg

Reg

C7

Reg

Reg

Reg

C6

Reg

Reg

C5

Reg

Reg

C4

Reg

Reg

C3

Reg

Reg

Reg

Reg

Reg

C2

Reg

Reg

Reg

Reg

C1

Reg

Reg

Reg

C0

Reg

Reg

Data In

+
Data Out

32 TAP filter implementation implemented entirely with XtremeDSP Slices

FPGA Introduction 42
2011 Xilinx, Inc. All Rights Reserved

For Academic Use Only

21

Das könnte Ihnen auch gefallen