Sie sind auf Seite 1von 15

International Journal of Applied Research and Studies (iJARS)

ISSN: 2278-9480 Volume 3, Issue 5 (May - 2014)


www.ijars.in

Manuscript Id: iJARS/820 1


Research Article
Design, Simulation, Implementation, and Performance Analysis of a
fixed-point 8 Point FFT Core for Real Time Application in Verilog
HDL

Authors:

1
Bikash Poudel,
2
Manish Bhattrai*,
3
Sandesh Ghimire



Address For correspondence:
1
Asst. Lecturer Institute of Engineering, Thapathali Campus
2
Assistant R&D Engineer, Powertech Nepal
3. Engineer, Nepal Electricity Authority

Abstract - Fast Fourier Transform (FFT), which serves as an efficient and ubiquitous tool for
computing Discrete Fourier Transform (DFT), is popular for transforming a signal from time
domain to frequency domain. Since FFT algorithm requires less number of computations than
direct evaluation of DFT, this technique has been widely used in speech recognition (massively
used now days in many application lines and products), telecommunication, signal processing,
multimedia communication, etc. Designing and implementing the floating-point (FP) FFT
Algorithms in FPGA is always the hot research spot and is still a challenging task. This paper
proposes a new architecture of an FFT core that computes radix-2 8-point FFT using fixed-point
operation in only eight clock cycles. The key feature of this design is that it tries to maintain
better performance with minimal possible footprint. The design is done in Xilinx ISE 13.2 tool
using Verilog-HDL. The processor core has been simulated using Xilinx ISIM simulator for the
functional verification and its FPGA based implementation has been successfully verified using
Spartan-3E Starter Kit. This paper also aggregates a brief analysis of the performance of FFT
Core and the consumption of FPGA resources by the designed core. The objective of this work is
to get an area and time efficient architecture that could be used as a part of a voice processing
system.
Keywords: DFT, FFT, FPGA, Xilinx ISE 13.2, Verilog-HDL
Introduction
Audio signal processing is a well-developed line massively used these days in
telecommunication, multimedia applications, speech recognition for voice-operated system, etc.
When it comes to signal processing one always opts to work in the frequency domain because of

ceo.dspspectrum@gmail.com *Corresponding Author Email-Id
International Journal of Applied Research and Studies (iJARS)
ISSN: 2278-9480 Volume 3, Issue 5 (May - 2014)
www.ijars.in

Manuscript Id: iJARS/820 2


myriad of advantages the frequency domain offers, which brings forward Discrete Fourier
Transform that converts a signal in discrete time domain to discrete frequency domain. The
arithmetic complexity of the Discrete Fourier Transform (DFT) algorithm becomes a significant
factor, which influences in global computational costs of a design. Cooley and Tukey [1]
developed the well-known radix-2 FFT algorithm to reduce the computational load of the DFT.
Based on how one divides a set of N inputs into two sets of N/2 numbers, there are two types of
radix-2 FFT algorithm or Cooley-Tucky algorithm: Decimation in time FFT (DIT-FFT) and
Decimation in frequency FFT (DIF-FFT). We have implemented decimation in frequency FFT
algorithm. To understand decimation in frequency we start by writing definition of DFT,

(1)


For even k, i.e. k=2m, (-1)
k
=1,

(2)
For odd k, i.e. k=2m+1, (-1)
k
= -1,

(3)
Using symbols

and

,
equations (2) and (3) can be written as

(4)

(5)
Thus, we started by dividing N inputs into two halves. Noticing that twiddle factor set (

) is
similar in first and second half, we worked out that twiddle factor multiplication is same if k is
even and we need to multiply second half by certain power of twiddle factor(

) if k is odd.
International Journal of Applied Research and Studies (iJARS)
ISSN: 2278-9480 Volume 3, Issue 5 (May - 2014)
www.ijars.in

Manuscript Id: iJARS/820 3


Thus we grouped two sets with some modifications, but now the number of twiddle factors is
halved and we have reached N/2 point DFT.
Equation (4) can be viewed as N/2 point DFT of and equation (5) can be viewed as N/2
point DFT of

. Thus, N point DFT can be computed by evaluating two N/2 point DFT.

Figure 1: Illustration of N-point FFT using two N/2-point FFT

This process can be continued and N/2 point DFT can be computed by two N/4 point DFT. For
N=8, figure 1 corresponds to decimation of 8-point DFT into two 4-point DFTs. Further
decimating 4-point DFTs into 2-point DFTs we reach a butterfly structure as shown in figure 5.
As shown in figure 5, there are three stages. For N-point DFT, the number of stage is log
2
(N).
Thus, this kind of decimation reduces computational complexity from O(N
2
) to O(N.log
2
(N))
since computation in each stage is of order N.
Proposed Methodology
A. Functional Block Diagram of the Radix-2 8-point FFT Computer
The proposed system, which has a functional diagram as shown in figure 2, has divided the
computation of FFT in three stages- Input Stage, Compute Stage, and Output Stage. In Input
Stage, eight samples are read form the Analog to Digital Converter (ADC) and are stored in
8*64-bit Input Buffer, which takes eight clock cycles. Compute stage performs the computation
International Journal of Applied Research and Studies (iJARS)
ISSN: 2278-9480 Volume 3, Issue 5 (May - 2014)
www.ijars.in

Manuscript Id: iJARS/820 4


of FFT out of eight input samples and generates eight frequency samples in eight clock cycles.
Compute stage has three blocks- x-Buffer to hold eight input samples from the input buffer, FFT
Core that computes the FFT, and temporary buffer that holds the intermediate results. Finally,
Output Stage presents the output in the output ports.

Figure 2: Proposed architecture of the radix-2 8-point FFT.
B. Implementation method of Butterfly Network
The computation of the FFT is done by implementing the Butterfly Network in a novel and
efficient way. Whereas the direct implementation of the butterfly arrangement requires twelve
subtracters, twelve adders and twelve multipliers, the FFT core presented in this paper uses four
adders, four subtracters and four multipliers in order to conserve resources without sacrificing
the performance of the network. Had all twelve adders, multipliers, and subtracters been used
then the output frequency samples would have been computed in one clock cycle since the whole
design will be a single combinational circuit, the output is generated as soon as new input is
available i.e. in one clock cycle. But, since only four of the adders, subtracters, and multipliers
each are used, the output samples are presented to output port only at the end of the eight clock
cycles because the calculation of FFT with butterfly network has been done using the FSM as
shown in figure 4 which will take 8 clock cycles to complete.
This architecture is a three-stage pipelined-architecture, so there are three independent and
concurrent stages, which are: Input Stage, Compute Stage, and Output Stage as shown in figure
5. The input stage takes eight clock cycles independent of the architecture, since it will always
take eight clock cycles to fetch eight input samples from ADC. Thus, in order to complete the
computation of FFT before the next set of input samples are available from ADC, Compute
Stage has at most seven clock cycles to compute the FFT and Output Stage has one clock cycle
to host output samples in the output ports without introducing extra cycle consumption in the
overall instruction cycle of the core. For this, the Compute Stage has been mathematically
divided into three sub-stages each sub-stages requiring four adders (A1, A2, A3, and A4), four
International Journal of Applied Research and Studies (iJARS)
ISSN: 2278-9480 Volume 3, Issue 5 (May - 2014)
www.ijars.in

Manuscript Id: iJARS/820 5


subtracters (S1, S2, S3, and S4) and four multipliers (M1, M2, M3, and M4) as shown in figure
5. Here, for each of the three sub-stages of Compute Stage i.e. Compute Stage I, Compute Stage
II, and Compute Stage III; the four adders, four subtracters, and four multipliers are reused by
using the computational architecture as shown in figure 7 with the help of 4-to-1 multiplexers
whose one input line is not used.

Figure 3: Illustration of how a adder is reused in various sub-stages of Compute Stage using 4-
to-1 multiplexer
The crux behind the reusability of the adders, subtracters, and multipliers is that the compute
stage has been divided into three sub-stages as shown in figure 5 where each sub-stage require
each of four adders, subtracters, and multipliers. At first, input samples in x-Buffer are added or
subtracted and multiplied as per the butterfly network in Compute Stage I. The four adders,
subtracters, and multipliers are used to generate intermediate results in x-Buffer, which are used
as the inputs for the Compute Stage II. Next, in Compute Stage II the same four adders,
subtracters, and multipliers are reused to compute intermediate results to be used as input for the
Compute Stage III as shown in figure 5. Finally, in the Compute Stage III the same four adders,
subtracters, and multipliers are used again to generate final output frequency samples.
The reusability of the four adders, four subtracters, and four multipliers can be properly
illustrated with the help of figure 3. The reuse of the same set of component is done by using
multiplexers in the input lines of the component to select different inputs in different Compute
Stage. Say, the core is at COMPUTE STAGE I of FSM shown in figure 4 that corresponds to the
Compute Stage I of figure 5. The adder A1 has to add input samples x[0] and x[4] as dictated by
the butterfly network of figure 5. This is done by sending the 2b00 from the controller FSM to
the multiplexer connected to the second port of Complex Adder A1 that allows multiplexer to
feed x[4] to the adder as shown in figure 3. The resulting sum evaluated by Complex Adder A1
is stored in the x[0] of x-Buffer, which previously contained first sample from ADC, in the
WRITE BACK I stage. In COMPUTE STAGE II of FSM which corresponds to Compute Stage
II of figure 5, the same Complex Adder A1 has to add the intermediate samples x[0] and x[2] by
sending 2b01 selection line value from the controller FSM to select a sample for the second
International Journal of Applied Research and Studies (iJARS)
ISSN: 2278-9480 Volume 3, Issue 5 (May - 2014)
www.ijars.in

Manuscript Id: iJARS/820 6


input of the Complex Adder. The sum is written back to x-Buffer in WRITEBACK STAGE II
stage. Finally, the same Complex Adder A1 has to sum up intermediate samples x[0] and x[1] in
COMPUTE STAGE III of FSM in figure 4 which corresponds to the Compute Stage III of figure
5, thus, generating the output frequency sample X[0]. The first input to the Complex Adder is
again x[0] but the second input to Complex Adder is x[1] which is selected with the help of the
multiplexer by sending 2b10 in the selection line from the controller FSM. The output sample
X[0] is presented to the output port in WRITEBACK STAGE III. Another important point to
note about this design is that the same x-Buffer, which initially holds the input samples from
ADC, is used to hold the intermediate results of sub-stages of Compute Stage as shown in figure
5.
The complete architecture of the Core is shown in figure 7, which shows how the multiplexers
are incorporated with the adders, subtracters, and multipliers in their input ports to select
different set of inputs in different sub-stage of Compute Stage.


Figure 4: FSM that dictates how the FFT Core of figure 7 operates.


International Journal of Applied Research and Studies (iJARS)
ISSN: 2278-9480 Volume 3, Issue 5 (May - 2014)
www.ijars.in

Manuscript Id: iJARS/820 7



Figure 5: Segregation of the Butterfly Network into three Stages-Input, Compute and Output
Stage.
The input samples from the ADC are in 14-bit 2s complement form and the twiddle factors are
represented by 10-bit 2s complement fixed-point number. Since the Twiddle factor is a complex
number, two 10-bits fixed-point numbers represents the real part and imaginary part respectively.
So during the twiddle factor multiplication at most 24-bits (14-bits value times 10-bits value
generates result that is at most 24-bits value) of the storage for each of real and imaginary part of
the samples is needed. However, in our actual implementation 32-bit for real and imaginary part
of each sample has been used so that we can go up to 16-bit (which is 10-bits for this particular
design) representation for the twiddle factor for more accurate representation of floating-point
number in fixed-point format in future modification of the core. All of the adders, subtracters,
International Journal of Applied Research and Studies (iJARS)
ISSN: 2278-9480 Volume 3, Issue 5 (May - 2014)
www.ijars.in

Manuscript Id: iJARS/820 10


COMPUTE STAGE III and the results are finally sent to the output port from the Temp-Buffer
in OUTPUT STAGE. Thus, from the FSM shown in figure 4 where each of the stage requires
one clock-cycle to operate, it is straightforward that the core takes eight clock cycles to compute
FFT and present the output frequency samples in the output ports.
D. Verilog Design of the FFT Core in Xilinx ISE
The Verilog [2] module named fft8point whose block diagram is as shown in figure 8 computes
the radix-2 8-point FFT and the HDL code snippet for the ports declaration is as shown in figure
9. The port signal name and the description of each signal is shown in table 1. The eight input
samples are taken into the core using the input ports Px0 to Px7 each of which is 14-bit wide.
The core then evaluates the FFT of the eight time domain samples thus generating eight
frequency samples. These frequency samples are presented in the eight output ports X0 to X7
each of which is 64-bit wide where the upper 32-bits is the real part and the lower 32-bits is the
imaginary part of the output frequency samples.


Figure 8: Verilog Module of the FFT Core

International Journal of Applied Research and Studies (iJARS)
ISSN: 2278-9480 Volume 3, Issue 5 (May - 2014)
www.ijars.in

Manuscript Id: iJARS/820 11



Figure 9: Verilog Code Snippet of the Core

Table 1: List of ports of the designed core with their function
S.N. Port
Name
Function Width (in bits) Direction
1. Px0 Px7 Takes eight input samples from the Input
Buffer
14-bit each Input
2. X0 X7 Present output samples 64-bit each with 32-
bit real part and 32-
bit imaginary part
Output
3. Clk Global Clock 1-bit Input
4. inputValid Asserts that the input samples at the input
port are valid
1-bit Input
5. Reset Global Reset signal for the core 1-bit Input
International Journal of Applied Research and Studies (iJARS)
ISSN: 2278-9480 Volume 3, Issue 5 (May - 2014)
www.ijars.in

Manuscript Id: iJARS/820 12



Result, Discussion and Summary
A. Result and Discussion:
The functional verification of the FFT Core has been done by using the ISIM Simulator
of Xilinx ISE. The snapshot of the simulation result is shown in figure 9. The input
sequences fed to the FFT Core and the corresponding output samples generated along
with the comparison with actual output is shown in table 2. The execution period is
0.16us (= 8/50MHz). That means 8-point FFT is computed only in 8 clock-cycles.

Figure 10: Simulation Waveform showing the input samples, output samples, and control
signals.
The waveform shown in figure 9 illustrates that the designed core runs smoothly with correct
output. The result must be in floating point but because of the use of fixed-point representation
for the floating-point numbers, the obtained result is integer approximation of the actual result.
International Journal of Applied Research and Studies (iJARS)
ISSN: 2278-9480 Volume 3, Issue 5 (May - 2014)
www.ijars.in

Manuscript Id: iJARS/820 13


The result is almost 100% accurate. Thus, the FFT Core is able to calculate the 8-point FFT with
a good precision.
Table 2: Comparison between the actual output from MATLAB calculation and computed
output from the designed core
S.N. Input Matlab Output Observed
Output
Error in
magnitude
(in Percentage)
100 + 0 i 1500 + 0 i 1500 + 0 i 0
2 200 + 0 i 300 + 200 i 300 + 200 i 0
3 300 + 0 i -541.4 724.3 i -542 725 i 0.102 %
4 400 + 0 i -258.6 124.3 i -259 -124 i 0.174 %
5 500 + 0 i 300 0 i 300 0 i 0
6 0 + 0 i 300 200 i 300 200 i 0
7 0 + 0 i -258.6 + 124.3 i -258 + 125 i 0.102 %
8 0 + 0 i -541.4 + 724.3 i -541 + 724 i 0.174 %

B. Design Summary
Design summary, a report generated by Xilinx ISE [3], allows designers to view various
information like targeted device, device utilization, design goal, etc. The implementation
of the FFT Core has been done in Spartan-3E Starter Board. The RTL schematic, which
is a basic logical representation of the circuit in terms of logic primitives which are
generated when the design become correct in simulation and synthesis level, of the FFT
processor is shown in figure 12. The design summary generated by the Xilinx is shown in
figure 11.
International Journal of Applied Research and Studies (iJARS)
ISSN: 2278-9480 Volume 3, Issue 5 (May - 2014)
www.ijars.in

Manuscript Id: iJARS/820 14



Figure 11: Design Summary generated by Xilinx ISE

Figure 12: RTL Schematic of the Designed Core


International Journal of Applied Research and Studies (iJARS)
ISSN: 2278-9480 Volume 3, Issue 5 (May - 2014)
www.ijars.in

Manuscript Id: iJARS/820 15


Summary and Conclusion
This paper presents 8-point FFT processor with a new architecture, which indeed has best
possible performance with the optimization in resource consumption. The whole design is
implemented in Verilog-HDL through Xilinx ISE 13.2 and the functional verification is done by
using ISIM simulator. The performance of our design presents better results in terms of both the
physical resources and throughput that is required for real time application as audio processing.
Along with these performance results come other considerations, which needs to be evaluated to
select the best approach depending on system requirements like easy implementation, costs and
performance. This design has a very simple port interface so that it can be easily incorporated
with any other system that requires FFT computation. The design produces a maximum error of
0.2% in the result due to the fixed-point representation of floating point values, which is accurate
to a very good tolerance limit. Another important note is that this core can be extended to
compute higher point FFT with a little modification. Further, this core can be a very useful tool
to analyze the frequency samples of any type of discrete time signals in real time.
References:
[1] Alan V. Oppenheim, Ronald W. Schafer, and John R. Buck, Discrete-Time Signal
Processing, 2nd ed., Tenth Impression, Pearson Education, 2012, pp.655681.
[2] J Bhasker, A Verilog HDL Primer, 3rd ed., Star Galaxy Publishing, 2005.

[3] Xilinx, April 2009, ISE In-Depth Tutorial (UG695 (v 12.1).

[4] Young-jin Moon, and Young-il Kim, A Mixed-Radix 4-2 Butterfly with Simple Bit
Revering for Ordering the Output sequences, ICA0T2006 vol. 4, pp. 17721774, February
2006.

[5] A. Sreir.sr, C. Ka-a-Terki, H. Mshrez, and S. Negus, A Flexible High Perfomance Serial
Radix-2 fft Butterfly Arithmetic unit IEEE Transl. J. Magn. Japan, vol. 2, pp. 26-29, August
1987 [Digests 9th Annual Conf. Magnetics Japan, p. 301, 1982].

[6] Xilinx Logi Core FFT Processor guide.pdf.

[7] Chung-Ping Hung, Sau-Gee Chen and Kun-Lung Chen , Design Of An Efficient Variable-
length FFT Processor, ISCAS 2004,vol -4,pp.833-836.

Das könnte Ihnen auch gefallen