Sie sind auf Seite 1von 4

2011 IEEE/IFIP 19th International Conference on VLSI and System-on-Chip

1024-Point Pipeline FFT Processor with Pointer FIFOs based on FPGA


Guanwen Zhong, Hongbin Zheng, ZhenHua Jin, Dihu Chen and Zhiyong Pang
School of physics and engineering, Sun Yat-sen University, Guangzhou 510275, P.R. China Email: stspzy@mail.sysu.edu.cn

AbstractDesign and optimized implementation of a 16bit and 32-bit 1024-point pipeline FFT processor is presented in this paper. The architecture of the FFT is based on R22 SDF algorithm with new pointer FIFO embedded with gray code counters. It is implemented in Spartan-3E, Spartan6 and Virtex-4 devices and fully tested by method of cosimulation using SMIMS VeriLink as a bridge that connects software(Matlab Simulink ) and real hardwareFPGA targets. The implementation results show that our pointer FIFO FFT processor could use lower resource, but achieve higher performance. Our 16-bit 1024-point FFT processor only costs 2580 slices, 2030 slice ip ops and just 2 block RAMs, achieving the maximum clock frequency of 92.6 MHz with the throughput per area of 0.035 Msamples/s/area. Due to the parameterized input wordlength, output wordlength, Twiddle Factors wordlength and processing stages, it is easily to implement a 16-point, 64-point, 256-point, 1024-point,4096-point or higher power of 4 points pointer FIFO FFT processor synthesized from the same code just through modifying the corresponding parameters. Index TermsFFT; Pointer FIFO; Radix22 SDF; Cosimulation.

(a) 256-Point R22 SDF Architecture

(b) The Architecture of 1024 points FFT Processor based on R22 SDF Fig. 1. The Architecture of FFT Processor

I. I NTRODUCTION The Fast Fourier Transform (FFT) is an very efcient algorithm to compute the Discrete Fourier Transform(DFT) and has been used in a quite wide range of applications in modern digital processing and communication systems, such as Orthogonal Frequency Division Multiplexing(OFDM), radar technology, spectrum analysis and high speed image processing. The pipeline FFT is a special class of FFT algorithms that can compute the FFT in a sequential manner and its architectures have been studied since the 1970s. There are many kinds of methods to implement the pipeline FFT hardware architectures, and the most commonly used methods can be categorized into three kinds of pipelined architectures which include multiple delay commutator(MDC), single delay commutator(SDC), single delay feedback(SDF) architectures. In this paper we focus on Radix22 SDF architecture and use new method to achieve the FIFOs of the hardware implementation architecture with pointer FIFO embedded with gray code counters. The rest of the paper is organized as follows. II discusses the common pipeline FFT architectures and focus on the Radix22 SDF architecture especially on the new pointer FIFOs implementation and some architecturespecic optimizations. III describes the verication using cosimulation with Matlab Simulink , SMIMS VeriLink and FPGA boards, the results of the implementation and the

comparisons with others Radix22 SDF FFT processor. The conclusions and future work are given in IV. II. P IPELINE FFT A RCHITECTURE A. The Selection of FFT Algorithms The common algorithms to implement FFT processors contain Radix-2, Radix-4, Radix-22 , Split-Radix(SRMDC). The detailed algorithm deduction can be found in [1] and [2]. As shown in [1] and [2], which list the major characteristics and resource requirements of the pipeline FFT architectures, Radix22 Single Delay Feedback (R22 SDF) architecture provides the highest computational efciency with its hardware architecture simple to implement and was selected as our basic architecture for the FFT processor. B. R22 SDF Architecture The R22 SDF Architecture was proposed by He and Torkelson [1]. Fig. 1 (a) shows 256-point R22 SDF FFT processor. It has four levels. Similarly, 1024-point R22 SDF FFT has ve levels as shown in Fig. 1 (b). Fig. 2 species each level. One level includes two buttery processors, two different size FIFO, a -J unit, a Twiddle Factor unit and a controller which is used to determine when to enable the corresponding -J and Twiddle Factor modules. This paper focuses on the optimization of the FIFOs, using pointer FIFO embedded with gray code counters, of the

978-1-4577-0170-2/11/$26.00 2011 IEEE

122

Fig. 2.

The Architecture of R22 SDF Buttery Processor Fig. 4. The dual n-bit gray code counter block diagram(based on [5])

Fig. 3.

The Architecture of Pointer FIFO

R22 SDF algorithm. The other part of the implementation of the R22 SDF algorithm had been detailedly described in [1], [2], [3] and [4]. C. Pointer FIFO Implementation The general methods to achieve the buffers, which are used to store data out of the buttery processors, usually use two RAMs named pingpang buffers or just one RAM named shift registers. The pingpang buffer has two RAMs which read or write simultaneously, while the shift register only uses one RAM and it costs a lot of slice ip op. Achieving the same functions, the pointer FIFO uses less resources than the twos, it consists of one RAM named as FIFO memory, which is the same size of the twos, FIFO write pointer and full signals generator(FIFO wptr&full) ,FIFO read pointer signal generator(FIFO rptr) and address controller. Fig. 3 shows the architecture of the pointer FIFO. 1) FIFO Memory: FIFO Memory consists of one dual-port RAM. The dual-port RAM is used to store the outputs of the buttery processors and the FFT processor could deal with the data according to signals given by the controller. 2) Controller: The controller traces the data ow. For clarity, take 64-point FFT as examples. During the rst 32 cycles, buttery processor of the stage1 does not do with data and put the data directly into the FIFO Memory in the same stage. When the 32ed data comes in the memory, the delay signal of the controller will be enabled and during the next 32 cycles, the controller will enable the rinc signal used to enable the part of FIFO rptr signal generator to count the read address and feedback the important signalrptr to the part

of the FIFO wptr&full. After the second group of 32 cycles, data has been calculated with the previous data that read form the FIFO memory, the after-caculated data(results of substract operations) will be restored rstly in the registers to delay by 2 clocks and then put into the FIFO memory. After restored all the results, The delay signal will be disabled and wait for the next 32 cycles. 3) FIFO wptr&full: The FIFO wptr&full consists of a dual n-bit gray code counter and the full signal generator. Unlike the binary counter that all of its bits will change when increasing, the gray code counter only changes one bit. Besides, with the MSB inverted, the second half of the n-bit Gray code is a mirror image of the rst half. Fig. 4 is a block diagram of a dual n-bit gray code counter.[5] The Gray code outputs are passed to a Gray-to-binary converter (bin), which is passed to a conditional binary-value incrementer to generate the next-binary-count-value (bnext), which is passed to a binary-to-Gray converter that generates the next-Gray-count-value (gnext), which is passed to the register inputs. The signal-ptr[n-1:0](wptr in g. 3) is passed to the read clock domain and the ptr[n-1:0] is used to address the FIFO buffer. The addrmsb signal is mixed the inverted msb with the inverted 2nd msb of gnext, and put the addrmsb and ptr[n-2:0] together to generate the signal which is used to compare with the rptr signal that passed by the FIFO rptr module. If the two signals are equal, it means that the write pointer has caught up with the synchronized read pointer, and then the FIFO wptr&full module enable the full signal-wfull. 4) FIFO rptr: The FIFO rptr module is similar to the FIFO wptr&full module. It also consists of a dual n-bit gray code counter. D. Optimization 1) Multiplier Unit: A complex multiplication is usually computed as: (a + bj ) (c + dj ) = (a c b d) + (a d + b c)j (1)

This method costs four multipliers and two add/sub operations. But just by rearranging the position of a, b, c and d, the equation above can be described as: (a + bj ) (c + dj ) = A b + C c + (B a C c)j (2)

123

(a)

(a) The Simulink functional blocks

and VeriLink

model composed of

(b) Fig. 5. Verication Methods

(b) The wave of sine after FFT processing Fig. 7. SMIMS Co-simulation with Xilinx Verilink boards, Matlab Simulink and

Fig. 6.

The results of the Pointer FIFO FFT and Matlab FFT

A is equal to the value of (c-d), B is equal to the value of (c+d) and C represents the value of (a-b). This method just requires three multipliers and ve add/sub operations to compute a complex multiplication. 2) Twiddle Factor: Due to large numbers of twiddle factors, this paper uses block RAMs to generate ROMs which is used to store factors. This method can save the resource of distributed-RAMs generated by slices. 3) Reversed Unit: Using a controller consist of several counters and mapping the reverse data stored in the corresponding ROM, the data that come out of the Reversed Unit is the result of the input data after FFT processing. III. V ERIFICATION AND E XPERIMENTAL R ESULTS A. Verication 1) Method 1: Fig. 5 (a) shows the block diagram of the comparison between Point FIFO FFT and Matlab FFT. Sine signal (xed point) generated by Visual C++ is used as the input to the Pointer FIFO FFT and Matlab FFT. After processing, use Matlab Figure to show the results in Fig. 6. From Fig. 6, the output of the Pointer FIFO FFT is nearly the same as the Matlab FFTs. 2) Method 2: Use SMIMS VeriLink as a bridge that connects software(Simulink ) and hardware to do the cosimulation, as shown in Fig. 5 (b). Building upon Simulink from Matlab , along with the VeriLink from SMIMS , input data generated by Simulink sine wave block could directly put into the FFT

processor downloaded in the FPGA board through the RT2C block given by SMIMS VeriLink , which is used to convert and scale real format input to 2complement, and the Simulink could straightly derive the output data from the FPGA devices after FFT processing using Simulink scope block to show the corresponding waves. The TWOCTR block given by SMIMS VeriLink is used to convert and scale 2complement input to real format. Fig. 7 (a) is the Simulink and VeriLink model composed of functional blocks that use to co-simulate the FFT processor(hardware). The Fig. 7 (b) shows the two impulse signals which are the sine signal spectrum. B. Experimental Results Table I gives the performance results of the R22 SDF architectures based on shift registers and pointer FIFOs with different FFT sizes on Spartan-3E FPGAs and the FFTs achieved by [4]. Due to shift operation, the FFT based on shift registers costs much more slices than the FFTs based on pointer FIFOs and Bin Zhous [4]. Each kind of FFTs, with different FFT size, keep the same maximum frequency because of the pipeline architecture. From table I, the pointer FIFO R22 SDF FFT with data width of 16 only occupies 2580 slices and 2 block RAMs which is smaller than the R22 SDF [4] and its throughput per area is better than the other FFTs. A summary of performance comparison with selected implementations of FFT are shown in table II. Figures of Amphion core CS2411XV [8], Sundance core FC200 [7], Xilinx core version 2 for Virtex-E [6] and Bin Zhous [4] are given together with our FFT. The resulting gures show different features of the implementations of R22 SDF FFT. Our implementation outperforms Sundances [7], Bin Zhous [4], Sukhsawas and Benkrids

124

FFT size Our PFR22 SDF1 Our SFR22 SDF2 Our PFR22 SDF Our SFR22 SDF R22 SDF[4] R4SDC[4]
1 2

1024 1024 1024 1024 1024 1024

Input data width 32 32 16 16 16 16

TABLE I I MPLEMENTATION R ESULTS ON S PARTAN -3E D EVICES Twiddle Slice Maximum factor Slices Flip BRAMs Frequency Latency width Flops (MHz) (clk) 32 3692 2788 5 71.947 1046 32 23425 39056 5 71.947 1046 16 2580 2030 2 92.595 1046 16 13485 22303 2 92.595 1046 16 4409 Not given 8 123.84 1041 16 2802 Not given 8 95.25 1042

Transform time (s) 14.54 14.54 11.30 11.3 8.27 10.75

Throughput (MS/s) 70.44 70.44 90.62 90.62 123.84 95.25

Throughput /area (MS/s/slice) 0.019 0.003 0.035 0.007 0.028 0.034

PFR22 SDF: Pointer FIFOs R22 SDF SFR22 SDF: Shift Registers R22 SDF

FFT size Amphion[3] Xilinx[3],[6] Sundance[7] Suksawas R22 SDF[3] R22 SDF[4] Our PFR22 SDF
1

1024 1024 1024 1024 1024 1024

Input data width 13 16 16 16 16 16

TABLE II P ERFORMANCE C OMPARISON ON V IRTEX -E D EVICES Twiddle Maximum Transform factor Slices BRAMs Frequency Latency time width (MHz) (clk) (s) 13 1639 9 57 5097 71.86 16 1968 24 83 4096 49.35 10 8031 20 49 1320 27.00 16 16 16 7365 5008 3390 28 32 11 82 95 53.978 1099 1042 1046 12.49 10.78 19.38

Throughput (MS/s) 14.25 20.75 49.00 82.00 95.00 52.84

Throughput /area1 (MS/s/slice) 0.009 0.011 0.006 0.011 0.019 0.016

Not taken the block RAMs into consideration

[7] in slice(only 3390) and block-RAM(only 11) cost and poor performance than theirs in maximum clock frequency with 54 MHz. And the throughput per area ratio of 0.016 Msamples/s/slice is better than others except for Bin Zhous, whose throughput per area ratio is 0.019 Msamples/s/slice. According to the corresponding reference, the Xilinx FFT IP core based on Virtex-E in the table II shows 4 times the latency(4096) in cycles because of its internal architecture while our FFTs only has the latency of 1046 in cycles due to the internal pipeline architecture. Our FFT costs least block RAMs except for Amphions whose latency has 5097 in cycles. IV. C ONCLUSIONS AND FUTURE WORKS In this paper, we presented a new R22 SDF architecture based on pointer FIFO embedded with gray code counters which made the FFT processor more stable when dealing with a huge mass of data. The 16-bit 1024-point R22 SDF FFT based on pointer FIFO embedded with gray code counters reached a maximum clock frequency of 92.6 MHz and only used 2580 slices and 2030 slice ip ops with just 2 block RAMs, giving a throughput of 92.6 Msamples/s on Spartan-3E devices. Our pointer FIFO FFT processor outperforms others in slice and block RAM cost and with high throughput per area. This paper also presented a new verication methodhardware and software co-simulation, which used Matlab Simulink and SMIMS VeriLink together with FPGA boards to test the implementation of the R22 SDF FFT based on pointer FIFO.

Future work includes improving the pointer FIFO embedded with gray code counters for slice optimization, ameliorating multipliers to improve the maximum clock frequency, implementation with CORDIC arithmetic, power and SQNR analysis. Higher performance should be resulted in with these improvements. R EFERENCES
[1] S. He and M. Torkelson, A new approach to pipeline fft processor, in Parallel Processing Symposium, 1996., Proceedings of IPPS 96, The 10th International, Apr. 1996, pp. 766 770. [2] J. Garcia, J. Michell, and A. Buron, VLSI congurable delay commutator for a pipeline split radix fft architecture, Signal Processing, IEEE Transactions on, vol. 47, no. 11, pp. 3098 3107, Nov. 1999. [3] S. Sukhsawas and K. Benkrid, A high-level implementation of a high performance pipeline fft on virtex-e fpgas, in VLSI, 2004. Proceedings. IEEE Computer society Annual Symposium on, 2004, pp. 229 232. [4] B. Zhou and D. Hwang, Implementations and optimizations of pipeline ffts on xilinx fpgas, in Recongurable Computing and FPGAs, 2008. ReConFig 08. International Conference on, 2008, pp. 325 330. [5] C. E.Cummings, Synthesis and scripting techniques for designing multiasynchronous clock designs, SNUG2001 (Synopsys Users Group Conference, San Jose, CA, 2001) User Papers, no. Section MC1, p. 3rd paper, Mar. 2001. [6] Xilinx, Inc., High-performance 1024-point complex fft/ifft v2.0, San Jose, Calif, USA, 2000, Jul. 2000. [Online]. Available: http://www.xilinx.com/ipcenter [7] Sundance Multiprocessor Technology Ltd., 1024-point xed point fft processor, Jul. 2008. [Online]. Available: http://www.sundance.com/web/?les/productpage.asp?STRFilter=FC200 [8] Amphion Semiconductor Ltd., 1024 point block based fft/ifft, Apr. 2002. [Online]. Available: http://www.amphion.com/signal.html

125

Das könnte Ihnen auch gefallen