Beruflich Dokumente
Kultur Dokumente
ADC
y 1.2 GSPS
I/Q
FFT
FIFO
Corrx,y [ m ] x [ n ]y [ n m ]
n
Conjugate
8K FFT bottleneck
Real-time
Complex
0.6 GSPS input (16-bits)
1.2 GSPS output (12-bits)
- 1k
I/Q
FFT
FIFO
FIFO
Evaluation Scorecard
Length of FFT
IO pins
Butterflies
Multipliers
Adder/subtractors
Shift registers
Size
16
8192
Pins
Fly
Mult
Add
Shift
Outline
Introduction
Parallel architecture
Data flow graph
Effects of serial input
Systolic architecture
Performance summary
Conclusions
16
8192
Pins
448
229K
Fly
32
53K
Mult
Add
Shift
10
10
10
10
11
11
11
11
12
12
12
12
13
13
13
13
14
14
14
14
15
15
15
15
16
16
16
16
Parallel FFT
Butterfly
structure
Removes
redundant
calculation
Complex Butterfly
Butterfly contains
1 complex addition
1 complex subtraction
1 complex, constant multiply
WNr
Size
16
8192
Pins
448
229K
Fly
32
53K
Mult
Add
Shift
Complex Addition
a
c
b
d
real
imag
Size
16
8192
Pins
448
229K
Fly
32
53K
Add
128
213K
Shift
Mult
Complex Multiply
real
imag
Size
16
8192
Pins
448
229K
Fly
32
53K
Mult
128
213K
Add
192
320K
Shift
16
8192
Pins
448
229K
Fly
32
53K
Mult
96
159K
75%
Add
288
480K
150%
Shift
c
d
Size
real
imag
Parallel-Pipelined Architecture
Size
16
8192
Pins
448
229K
Fly
32
53K
Mult
96
159K
Add
288
480K
Shift
10
10
10
10
11
11
11
11
12
12
12
12
13
13
13
13
14
14
14
14
15
15
15
15
16
16
16
16
A pipelined version
IO Bound
100% Efficient
Serial Input
Size
16
8192
.01%
Pins
28
28
Fly
32
53K
Mult
96
159K
Add
288
480K
Shift
10
10
10
10
11
11
11
11
12
12
12
12
13
13
13
13
14
14
14
14
15
15
15
15
16
16
16
16
A serial version
IO-rate matches
A/D
6.25% Efficient
Outline
Introduction
Parallel architecture
Systolic architecture
Serial implementation
Application specific optimizations
Performance summary
Conclusions
Serial Architecture
Stage 1
Stage 2
Stage 3
50% Efficiency
Size
16
8192
Pins
28
28
Fly
13
.03%
Mult
12
39
.03%
Add
36
117
.03%
Shift
22
12K
Stage 4
Size
16
8192
Pins
28
28
Fly
13
Mult
12
39
Add
36
117
Shift
22
12K
Stage 1
Stage 2
Stage 3
Stage 4
8192-Point Architecture
Requires 13 stages
Fixed point arithmetic
Varies the dynamic range to increase
accuracy
Overflow replaced with saturated value
Size
16
8192
Pins
28
28
Fly
13
Mult
12
39
Add
36
117
Shift
22
12K
10 11 12 13
4 int
4 int
5 int
6 int
7 int
8 int
9 int
10 int
4 frac
14 frac
13 frac
12 frac
11 frac
10 frac
9 frac
8 frac
0110.0101
6 165
Increase Parallelism
Size
16
8192
Pins
112
112
400%
Fly
16
52
400%
Mult
48
156
400%
Add
144
468
400%
Shift
16
12K
100%
1 2 3 4 5 6 7 8 9 10 11
12
13
1 2 3 4 5 6 7 8 9 10 11
12
13
1 2 3 4 5 6 7 8 9 10 11
12
13
1 2 3 4 5 6 7 8 9 10 11
12
13
Simplification
Size
16
8192
Pins
160
160
143%
Fly
16
52
Mult
36
144
92%
Add
108
432
92%
Shift
8K
67%
1 2 3 4 5 6 7 8 9 10 11
12
13
1 2 3 4 5 6 7 8 9 10 11
12
13
1 2 3 4 5 6 7 8 9 10 11
12
13
1 2 3 4 5 6 7 8 9 10 11
12
13
Outline
Introduction
Parallel architecture
Systolic architecture
Performance summary
Power, operations per second
FPGA resources, frequency
Latency, throughput
Conclusions
Results
Power: 22 Watts @ 65 C
GOPS: 86 total @ 3.9 GOPS/Watt
FPGA resources (XC2V8000)
Outline
Introduction
Parallel architecture
Systolic architecture
Performance summary
Conclusions
Applicability to other platforms
Future work
Conclusions
General architecture
Extendable to a generic FPGA core
Retargetable to ASIC technology
Future work
Develop a parameterizable IP core generator
Sources
Fredrik Edman
Preston Jackson, Cy Chan, Charles Rader,
Jonathan Scalera, and Michael Vai
HPEC 2004
29 September 2004