Sie sind auf Seite 1von 4

Sample IEEE Paper for A4 Page Size

First Author#, Second Author*, Third Author#


#

First-Third Department, First-Third University


Address
1

first.author@firstthird.edu

third.author@firstthird.edu
*

Second Company
Address Including Country Name
2

second.author@second.com

Abstract The discrete Fourier transform is an important operation


in digital communication systems.
However, the DFT is
computationally very expensive, and the fast Fourier transform is an
algorithm that has been proposed to compute the discrete Fourier
transform efficiently. FFT can be implemented as either decimation in
time or decimation in frequency.
Also, depending on the
decomposition, FFT can be radix-2 or radix-4. We propose a split
radix FFT design where we apply a radix-2 mapping to the even index
terms and we apply a radix-4 mapping to the the odd index terms.
The design is to be implemented on Xilinx Spartan FPGA. We present
speed, area, and power results showing a comparison between split
radix implementation and single radix implementation.
Keywords FFT, fast fourier transform, discrete fourier transform,
FPGA, decimation in time, decimation in frequency

I. INTRODUCTION
The fast Fourier transform is a fast implementation of the DFT.
It is based on a divide- and-conquer approach in which the DFT
computation is divided into smaller, simpler, problems and the nal
DFT is rebuilt from the simpler DFTs. Another application of this
divide-and-conquer approach is the computation of very large
FFTs, in which the time data and their DFT are too large to be
stored in main memory. In such cases the FFT is done in parts and
the results are pieced together to formthe overall FFT, and saved in
secondary storage such as on hard disk. In the simplest CooleyTukey version of the FFT, the dimension of the DFT is successively divided in half until it becomes unity. This requires the
initial dimension N to be a power of two:

N=2 BB=log2(N )
The problem of computing the N-point DFT is replaced by the
simpler problems of computing two (N/2)-point DFTs. Each of
these is replaced by two (N/4)-point DFTs, and so on. We will see
shortly that an N-point DFT can be rebuilt from two (N/2)-point
DFTs by an additional cost of N/2 complex multiplications. This
basic merging step is shown in Fig. 1. Thus, if we compute the two
(N/2)-DFTs directly, at a cost of (N/2)2 multiplications each, the
total cost of rebuilding the full N-DFT will be:

N 2 N N2 N N2
+ = + = (approximately)
2
2
2 2 2

( )

where for large N the quadratic term dominates. This amounts to


50 percent savings over computing the N-point DFT directly at a
cost of N2.

Merging two N/2-DFTs into an N-DFT and its repeated


application
Similarly, if the two (N/2)-DFTs were computed indirectly by
rebuilding each of them from two (N/4)-DFTs, the total cost for
rebuilding an N-DFT would be:

N 2
N N N2
N
N2
+2
+ = +2
= (approximately)
4
4
2
4
2
4

( ) ( )

( )

Thus, we gain another factor of two, or a factor of four in


efciency over the direct N-point DFT. In the above equation, there
are 4 direct (N/4)-DFTs at a cost of (N/4)2 each, requiring an
additional cost of N/4 each to merge them into (N/2)-DFTs, which
require another N/2 for the nal merge. Proceeding in a similar
fashion, we can show that if we start with (N/2m)-point DFTs and
perform m successive merging steps, the total cost to rebuild the
nal N-DFT will be:

N2 N
+ m
2m 2
The rst term, N2/2m, corresponds to performing the initial
(N/2m)-point DFTs directly. Because there are 2m of them, they
will require a total cost of 2m(N/2m)2= N2/2m. However, if the
subdivision process is continued for m = B stages, as shown in Fig.
9.8.1, the nal dimension will be N/2m = N/2B = 1, which requires
no computation at all because the 1-point DFT of a 1-point signal is
itself. In this case, the rst term in Eq. (2) will be absent, and the
total cost will arise from the second term. Thus, carrying out the
subdivision/merging process to its logical extreme of m = B = log
(N) stages, allows the computation to be done in:

1
1
NB= Nlog 2 ( N )
2
2

It can be seen Fig. 1 that the total number of multiplications


needed to perform all the mergings in each stage is N/2, and B is
the number of stages. Thus, we may interpret Eq. (3) as
(total multiplications) = (multiplications per stage) (no. stages)
= (N/2)*B
For the N = 8 example shown in Fig. 1, we have B = log2(8)= 3
stages and N/2 = 8/2 = 4 multiplications per stage. Therefore, the
total cost is BN/2 = 3 4 = 12 multiplications. Next, we discuss the
so-called decimation-in-time radix-2 FFT algorithm. There is also a
decimation-in-frequency version, which is very similar. The term
radix-2 refers to the choice of N as a power of 2, in Eq. (1). Given a
length-N sequence x(n), n = 0, 1,...,N1, its N-point DFT X(k)=
X(k) can be written in the component-form of Eq. (2):

The summation index n ranges over both even and odd values in
the range 0 n N1. By grouping the even-indexed and oddindexed terms, we may rewrite Eq. (9.8.4) as

To determine the proper range of summations over n, we


consider the two terms separately. For the even-indexed terms, the
index 2n must be within the range 0 2n N 1. But, because N
is even (a power of two), the upper limit N 1 will be odd.
Therefore, the highest even index will be N 2. This gives the
range:

0 2 n N20 n

N
1
2

Similarly, for the odd-indexed terms, we must have 0 2n + 1


N 1. Now the upper limit can be realized, but the lower one
cannot; the smallest odd index is unity. Thus, we have:

1 2 n+1 N 10 2n N20 n

N
1
2

Therefore, the summation limits are the same for both terms:

or
WNk(2n) = WN2(kn) = WN/2kn
and
WNk(2n+1) = WNk WN2kn = WNk WN/2kn
So,

and
X(k) = G(k) WNk H(k)
k = 0, 1, ... , N - 1
This is the basic merging result. It states that X(k) can be rebuilt
out of the two (N/2)-point DFTs G(k) and H(k). There are N
additional multiplications, Wk NH(k). Using the periodicity of
G(k) and H(k), the additional multiplications may be reduced by
half to N/2. To see this, we split the full index range 0 k N 1
into two half-ranges parametrized by the two indices k and k + N/2:

0k

N
N
N
1 k + N 1
2
2
2

Therefore, we may write the N equations (9.8.8) as two groups


of N/2 equations:
X(k) = G(k) + WNk H(k)
X(k + N/2) = G(k + N/2) + WN(k + N/2) H(k + N/2)
Using the periodicity property that any DFT is periodic in k with
period its length, we have G(k + N/2)= G(k) and H(k + N/2)= H(k).
We also have the twiddle factor property:
WNN/2 = (e-j2pi/N)N/2 = e-jpi = -1
Then, the DFT merging equations become:
X(k) = G(k) + WNk H(k)
X(k + N/2) = G(k) - WNk H(k)
k = 0, 1, ... , N/2 - 1
They are known as the buttery merging equations. The upper
group generates the upper half of the N-dimensional DFT vector X,
and the lower group generates the lower half. The N/2
multiplications WNkH(k) may be used both in the upper and the
lower equations, thus reducing the total extra merging cost to N/2.
Vectorially, we may write them in the form:

This expression leads us to dene the two length-(N/2)


subsequences:

g ( n )=x ( 2n )
h ( n )=x ( 2 n+1 )
N
n=0,1, , 1
2
and their (N/2)-point DFTs:

Then, the two terms of Eq. (9.8.5) can be expressed in terms of


G(k) and H(k).We note that the twiddle factors WN and WN/2 of
orders N and N/2 are related as follows:
WN/2 = e2j/(N/2) = e4j/N = WN2

where the indicated multiplication is meant to be componentwise. Together, the two equations generate the full DFT vector X.
The operations are shown in Fig. 9.8.2.

buttery merging operations are applied to each pair of DFTs to


generate the next DFT of doubled dimension.

Buttery merging builds upper and lower halves of length-N DFT.

As an example, consider the case N = 2. The twiddle factor is


now W2 =1, but only its zeroth power appears W0 2 = 1. Thus,
we get two 1-dimensional vectors, making up the nal 2dimensional DFT:
[X0] = [G0] + [H0W20]
[X1] = [G0] - [H0W20]
For N = 4

For N = 8

To begin the merging process shown in Fig. 9.8.1, we need to


know the starting one- dimensional DFTs. Once these are known,
they may be merged into DFTs of dimension 2,4,8, and so on. The
starting one-point DFTs are obtained by the so-called shufing or
bit reversal of the input time sequence. Thus, the typical FFT
algorithm consists of three conceptual parts:
1. Shufing the N-dimensional input into N one-dimensional
signals.
2. Performing N one-point DFTs.
3. Merging the N one-point DFTs into one N-point DFT.
Performing the one-dimensional DFTs is only a conceptual part
that lets us pass from the time to the frequency domain.
Computationally, it is trivial because the one-point DFT X = [X0]
of a 1-point signal x = [x0] is itself, that is, X0 = x0, as follows by
setting N = 1 in Eq. (4). The shufing process is shown in Fig. 3
for N = 8. It has B = log2(N) stages. During the rst stage, the
given length-N signal block x is divided into two length-(N/2)
blocks g and h by putting every other sample into g and the
remaining samples into h. During the second stage, the same
subdivision is applied to g, resulting into the length-(N/4) blocks
{a, b} and to h resulting into the blocks {c, d}, and so on. Eventually, the signal x is time-decimated down to N length-1
subsequences. These subsequences form the starting point of the
DFT merging process, which is depicted in Fig. 4 for N = 8. The

II. FFT IMPLEMENTATIONS


The Fast Fourier Transform (FFT), as an efficient algorithm to
compute the Discrete Fourier Transform (DFT), is one of the most
important operations in modern digital signal processing and
communication systems. The pipeline FFT is a special class of FFT
algorithms which can compute the FFT in a sequential manner; it
achieves real-time behavior with nonstop processing when data is
continually fed through the processor. Pipeline FFT architectures
have been studied since the 1970's when real-time large scale
signal processing requirements became prevalent. Several different
architectures have been proposed, based on different decomposition
methods, such as the Radix-2 Multipath Delay Commutator
(R2MDC) , Radix-2 Single-Path Delay Feedback (R2SDF) , Radix4 Single-Path Delay Commutator (R4SDC) , and Radix-2 2SinglePath Delay Feedback (R22SDF) . More recently, Radix-2 2to Radix24SDF FFTs were studied and R23SDF was implemented and
shown to be area efficient for 2 or 3 multipath channels. Each of
these architectures can be classified as multipath or single-path.
Multipath approaches can processdata inputs simultaneously,
though they have limitations on the number of parallel data-paths,
FFT points, and radix. This paper focuses on single-path
architectures.
From the hardware perspective, Field Programmable Gate Array
(FPGA) devices are increasingly being used for hardware
implementations in communications applications. FPGAs at
advanced technology nodes can achieve high performance, while
having more flexibility, faster design time, and lower cost. As such,
FPGAs are becoming more attractive for FFT processing
applications and are the target platform of this paper.
The primary goal of this research is to optimize pipeline FFT
processors to achieve better performance and lower cost than prior
art implementations.
The FFT processor is an inevitable component used for
implementing the systems of OFDM. The pipeline architectures in
structured form are adopted to satisfy the requirements of energy
and signal processing as far as mobile environment is concerned.
The architectures of FFT based on decomposition method and
radix-2 i algorithm were proposed by He et al (1998). The
algorithm was exploited for the implementation of dominant
elements to decrease the count of manipulations and the 21 storage
capacity. The storage capacity had been optimized by adjusting the
word length in a progressive manner. The efficiency of the design
with respect to area and power was improved by using a multiplier
based on distributed arithmetic. The specifications of the design
were obtained by using a 1024 point FFT processor. Song-Nien
Tang et al (2012) proposed a FFT processor for various types of
wireless networks such as wireless LAN, wireless MAN etc. By
adopting the Flexible-Radix Configuration Multiple Delay
Feedback (FRCMDF) commutator, a high performance could be
obtained by FFTs of variable-length in an efficient manner. In order
to improve the efficiency with respect to area and energy, an
optimized method of multiplication was also suggested. In
addition, the architecture could provide a support for scaling the
power across the modes of FFT. The chip had been realized with a

size of 3.2 mm2 , a signal to noise ratio of 40 dB, power


consumption of 507 mW at 300 MHz. This FFT processor of length
512-point was able to give higher performance and lower
consumption of power when compared to other designs.
Taesang Cho (2013) presented a radix-2 5 fast Fourier transform
(FFT) processor of length 512-point for applications of personal
area wireless networks. A modified version of radix-2 5 algorithm
of FFT was used to decrease the level of hardware required. This
technique could decrease the count of manipulations and the
capacity of memory required. A complex multiplier was employed
in the place of a Booth multiplier. The architecture obtained a SNR
value of 35 dB with a word length of 12 bits at 1.2 V. This design
had been implemented using 90-nm technology with the
specifications - gate count was 2, 90, 000, the rate of throughput
was 2.5 Gigabits per second at 310 MHz.
Kyung Heo et al (2003) proposed a FFT processor using mixed
radix algorithm and a new in-place technique. This processor
employed only two numbers of N-word memories for
implementation of FFT when compared to existing FFT processors
which employ four numbers of N-word memories. Further this
architecture obtained the optimum requirements with respect to
area and signal processing. The number of clock cycles and number
of gates were 640 and 37, 000 respectively for a FFT processor of
length 512-point. Hence this design could reduce size of memory
and gate count when compared to other FFT processors.
Shousheng He et al (1998) discussed a FFT processor of length
1024-point using pipeline architecture. This architecture utilized
the regularity of radix-2 2 FFT algorithm. The implementation of
FFT processor was obtained with only four numbers of complex
multipliers and a data memory of 1024 points. The chip had been
realized with 0.5 m CMOS technology with an area of 40 mm2 at
a frequency of 30 MHz.
Han Ying et al (2003) introduced a FFT processor based on
Xilinx FPGA. To reduce the complexity of logic, the serial mode
was adopted to subject the data into three operations such as
inclusion of multiplying window, manipulation of FFT and
computation of module-square. The frequency of the clock was
increased to obtain better performance by employing the serial and
parallel architectures thereby avoiding the bottleneck. It was shown
that the processor obtained high performance and was suitable for
applications of Digital signal processing.
Chang et al (2003) presented an architecture based on algorithm
of Radix-4. By integrating the styles of feed forward and feedback
commutators, this architecture obtained better usage of hardware
and memory when compared to other FFT processors. Moreover

various components of ROM were needed for storing the 23


twiddle values. The size of ROM was decremented by an integer of
2 using the concept of redundancy. The single-frequency
networks are constructed on a large scale using Coded Orthogonal
Frequnecy Division Multiplexing (COFDM) system, with
appropriate guard durations to make the echoes invalid. The fast
Fourier transforms of longer length have to be implemented for
demodulating every symbol in order to reduce the losses of spectral
efficiency to about 20%.
Petrovsky et al (2006) proposed a technique for synthesizing the
split radix FFT processors using pipeline architecture. This
technique was adopted using hardware design of FPGA by
considering the constraints of practical applications such as
frequency, length of transform, performance etc into account. The
illustrated examples of architecture showed the abilities of the
technique to optimize the hardware. The transition from variable
arithmetic to fixed arithmetic along with appropriate issues of
accuracy was also discussed.
III. EXPECTED RESULTS AND CONCLUSION
By using various optimization techniques such as pipelining,
parallel architecture, and split-radix implementation, we propose an
FFT architecture that is optimized in terms of speed, power, and
area. We will also show the interplay of these parameters and the
tradeoffs involved in efficient design.
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]

Jaishri Katekhaye, Mr.Amit Lamba, Mr. Vipin REVIEW ON FFT


PROCESSOR FOR OFDM SYSTEM IJAICT Vol-1,NOV:2014
Anwar Bhasha Pattan, Dr. Madhavi Latha FastFourier Transform
Architectures: A Survey IJAECT Weidong Li and Lars Wanhammar
LOW- POWER FFT PROCESSORS IJAECT.
Weidong Li and Lars WanhammarVLSI based FFT processor with
improvement in computation speed and area reduction IJECSE 2013.
H. Sorensen, D. Jones, M. Heideman, and C. Burrus, Real-valued fast
Fourier transform algorithms, IEEE Trans. Acoust., Speech Signal Process.,
vol. 35, no. 6, pp. 849863, Jun. 1987.
S. He and M. Torkelson, Design and implementation of a 1024-point
pipeline FFT processor, in Proc. IEEE Custom Integr. Circuits Conf.,
J. Lee, H. Lee, S. I. Cho, and S. S. Choi, A high-speed two parallel radix24 FFT/IFFT processor for MBOFDM UWB systems, in Proc. IEEE Int.
Symp. Circuits Syst., May 2006, pp. 47194722.
M. Ayinala, M. Brown, and K. K. Parhi, Pipelined parallel FFT
architectures via folding transformation, IEEE Trans. Very Large Scale
Integr. (VLSI) Syst., vol. 20, no. 6, pp. 10681081, Jun. 2012.
R. Radhouane, P. Liu, and C. Modin, Minimizing the memory requirement
for continuous flow FFT implementation: Continuous flow mixed mode FFT
(CFMM-FFT), in Proc. IEEE Int. Symp. Circuits Syst., May 2000, pp. 116
119.

Das könnte Ihnen auch gefallen