2003 VLSI Design Investigation For Low-Cost, Low-Power FFT-IfFT Processing in Advanced VDSL Transceivers

Microelectronics Journal 34 (2003) 133148 www.elsevier.
com/locate/mejo
VLSI design investigation for low-cost, low-power FFT/IFFT processing in advanced VDSL transceivers
S. Saponaraa,1, L. Fanuccib,*
a
Department of Information Engineering, University of Pisa, Via Diotisalvi 2, I-56122 Pisa, Italy b IEIIT, National Research Council, Via Diotisalvi 2, I-56122 Pisa, Italy Received 23 May 2002; accepted 11 October 2002
Abstract The problem of an efcient very large scale integration (VLSI) realization of the direct/inverse fast Fourier transform (FFT/IFFT) for digital subscriber line (DSL) applications is addressed in this paper. The design of scalable and very high-rate (VDSL) modem claims for large and high-throughput complex FFT computations while for massive and fast deployment of the xDSL family low-cost and low-power constraints are key issues. Throughout the paper we explore the design space at different levels (algorithm, arithmetic accuracy, architecture, technology) to achieve the best trade-off between processing performance, hardware complexity and power consumption. A programmable VLSI processor based on a FFT/IFFT cascade architecture plus pre/post-processing stages is discussed and characterized from the high-level choices down to the gate-level synthesis. Furthermore low-power design techniques, based on clock gating and data driven switching activity reduction, are used to decrease the power consumption exploiting the correlation of the FFT/IFFT coefcients and the statistics of the input signals. To this aim both frequency-division and time-division duplex schemes have been considered. The effects of supply voltage scaling and its consequence on circuit performance are examined in detail, as well as the use of different target technologies. Synthesis results for a 0.18 mm CMOS standard-cells technology show that the processor is suitable for real-time modulation and demodulation in scalable full-rate VDSL modem (64 4096 complex FFT, 20 Msample/s) with a power consumption of few tens of mW. These performances are very interesting when compared to state-of-the-art software implementations and custom VLSI ones. q 2002 Elsevier Science Ltd. All rights reserved.
Keywords: Very large scale integration architectures; Low-power circuits; Fast fourier transform; Digital subscriber lines; Multicarrier modem
1. Introduction High data rate communication is increasingly becoming desirable to provide fast internet access and interactive multimedia services for both business and residential customers. To this aim Digital Subscriber Line (DSL) technologies can deliver data at multi Mbits/s over the unshielded twisted pairs of the wired Public Switched Telephone Network [1 8]. Depending on the type of technology, the possible bit-rates may be between 0.5 and 8 Mbits/s for distance of several kilometers (Asymmetric DSL hereafter referred as ADSL) and more than 50 Mbits/s for distance of few hundred meters (Very high-rate DSL hereafter referred as VDSL).
* Corresponding author. Tel.: 39-050-568-668; fax: 39-050-568-522. E-mail addresses: luca.fanucci@iet.unipi.it (L. Fanucci), sergio. saponara@iet.unipi.it (S. Saponara). 1 Tel.: 39-050-568-557; fax: 39-050-568-522.
To achieve an efcient and reliable data transmission, multi-carrier or discrete multi-tone (DMT) modulation has been selected by the American National Standards Institute (ANSI) and by the European Telecommunications Standards Institute (ETSI) for ADSL and it is one candidate for VDSL [1 9]. The DMT symbol is the sum of independent quadrature amplitude modulated (QAM) subcarriers spread over the selected transmission bandwidth. In Fig. 1 the spectral allocation for VDSL and ADSL services is shown. To be noted that data are differently loaded on the subcarriers depending on the spectral shaping of the channel [1,9 11]. Fig. 2 presents the basic scheme of a DMT modem. According to this scheme multi-carrier modulation and demodulation are managed by the inverse and direct discrete Fourier transforms (IDFT and DFT), respectively. Different DFT/IDFT sizes correspond to different bandwidths and hence to different achievable bit-rates and target
0026-2692/03/$ - see front matter q 2002 Elsevier Science Ltd. All rights reserved. PII: S 0 0 2 6 - 2 6 9 2 ( 0 2 ) 0 0 1 4 2 - 8
134
S. Saponara, L. Fanucci / Microelectronics Journal 34 (2003) 133148
Fig. 1. Power spectral density vs. frequency for ADSL and VDSL services.
loop lengths. As an example, Table 1 summarizes the above parameters for a scalable VDSL system [8] considering a subcarrier spacing Df 4:3125 kHz; a 26 American wire gauge cable and a 2 140 dB m/Hz thermal noise. Obviously reported values depend on the selected xDSL scheme and on the channel conditions [1,8,11 15]. Such long length DFT computations are rather time and power consuming while massive and fast deployment of xDSL technologies requires low-cost, low-power and highly integrated modem. Thus the design of special purpose processors for DFT and IDFT is a key issue for the success of xDSL. Table 2 details the target parameters for the complex-DFT/IDFT processor which is presented in this paper (with reference to Table 1 DFT-length has been extended to 64 to be compliant with multi-carrier applications such as ADSL upstream [1] and wireless high datarate transmission [11,15]). 1.1. Previous works
for a TMS320C62x, a DSP with a computational power of about 2000 MIPS (million instructions per second) does not meet the requirements of a VDSL modem (more than 5000 MIPS) where roughly 3200 MIPS are used by the DFT and IDFT blocks. To be noted that a maximum length of 2048 was considered in Ref. [16]. Moreover a multi-DSPs implementation is not suitable for low-complexity and low-power. For instance at the cost of 0.5 mW/MIPS [16] the real time DFT and IDFT processing on the C62x architecture entails a power consumption of about 1.6 W. Therefore a VLSI design approach is mandatory. Several dedicated chips for Time Division Duplex (TDD) and Frequency Division Duplex (FDD) multi-carrier modem have been proposed in the last years [15,17 20]. Although they integrate VLSI macrocells for DFT and IDFT processing the maximum considered length is 2048 for a maximum throughput of 8.8 Msample/s without exploiting the full capability of the VDSL scheme (see Table 1 and Fig. 1). 1.2. Paper outline
Both software implementations, based on digital signal processors (DSP), and custom implementations, based on very large scale integration (VLSI) architectures, have been proposed [11,15 20] in literature for DFT/IDFT processing in xDSL applications. The best implementation for getting the highest exibility is a complete software one but, as proved in Ref. [16]
In this paper we explore the VLSI design space at different levels (algorithm, arithmetic, architecture, technology) to determine the circuit conguration which achieves the best power-area trade-off while meeting the requirements of advanced xDSL schemes. Then we further reduce the chip power consumption by adopting
Fig. 2. Scheme of a DMT modem.
S. Saponara, L. Fanucci / Microelectronics Journal 34 (2003) 133148 Table 1 Properties of a scalable VDSL system DFT length Max. bandwidth (MHz) Target bit-rate (Mbits/s) Max. asymmetry upstream/downs tream (Mbits/s) 0.3/1.2 5/7 5/20 10/30 25/45 Target length (kft)
135
where ( rectT t 1 0 when t [ 0; T ; otherwise
256 512 1024 2048 4096
1.1 2.2 4.4 8.8 17.6
1.5 12 25 40 70
10 6 4 3 2
clock-gating strategies that, based on input signal statistics, turn off some portions of the circuit and reduce the switching activity to the minimal level required for real time computation. To this aim both TDD and FDD application cases are considered. After this introduction Section 2 shows how a Nsubcarriers DMT signal can be generated/received by a complex 2N-IDFT/DFT (double-size approach hereafter referred as DS) or by a complex N-IDFT/DFT plus proper pre- and post-processing stages (single-size approach hereafter referred as SS). Section 3 presents the design of an intellectual property (IP) VLSI macrocell for fast direct/ inverse Fourier transform (FFT/IFFT) exploring the design space at different levels. Starting from this macrocell and according to a bottom-up design strategy in Section 4 we detail both DS and SS schemes for DMT and the relevant performances when implemented with today CMOS technology. Section 5 deals with the characterization and optimization of the proposed architectures considering TDD and FDD xDSL modem. After a comparison with state-ofthe-art addressed in Section 6, some conclusions are drawn in Section 7.
being N the total number of subcarriers, Df the subcarrier spacing, T the symbol duration, Xk the sequence of N complex data produced by the QAM mapper. The symbol duration is given by T Tg 1=Df where Tg takes into account the cyclic retransmission [1,11 14] of part of the IFFT output, i.e. cyclic sufx and prex in Fig. 2, which is adopted in multi-carrier modem to maintain the orthogonality of the subcarriers in case of dispersive channel and to avoid self-echo problems. The sequence Ck is obtained from Xk according to the following expression (hereafter overline denotes a complex conjugate operation): C k Xk 2N 2k Ck X Ck 0 for 1 # k # N 2 1; for N 1 # k # 2N 2 1; for k 0; N 2
i:e: pilot and DC carriers are not bit-loaded: By sampling xt with a frequency fs 1=Ts 2N Df we obtain the numerical sequence xq
N 21 1 2X C ej2pkq=2N 2 k0 k
with q 0; ; 2N 2 1:
The general expression of a M-point IDFT is sp

21 X 1 M S W pi M i0 i M
with p 0; 1; ; M 2 1
2. DMT symbol analysis 2.1. Double-size approach The DMT symbol transmitted can be modeled as (" xt Re "
N 21 X k 0
where the twiddle-factor WM is given by e j2p=M : Comparing Eqs. (3) and (4) it is clear that the generation of the DMT symbol xq (2N real values) can be obtained by means of a 2N-complex IDFT of the sequence Ck : It is straightforward 2N 2k (i.e. the coefcients exhibit an from Eq. (2) that Ck C Hermitian symmetry), which guarantees that the output of the complex DFT is a real sequence. The operations involved in the DS approach are summarized in Fig. 3.
# Xk e
j2pkDft
) rectT t
# N 21 1 2X j2pkDft rectT t C e 2 k0 k
Table 2 Typical DFT/IDFT parameters for a DMT modem Programmability range 644096 DFT-length Target throughput 20 Msample/s I/O data precision 1216 bits
Fig. 3. DS DMT symbol generation.
136
2.2. Single-size approach In Eq. (3) even values can be rewritten as q 2p; with p 0; 1; N 2 1 : x2p
N 21 1 2X C e j2pk2p=2N 2 k0 k " # 21 N 21 X X 1 N j2pkp=N j2pkp=N j2pp C e CkN e e 2 k 0 k k0
3. FFT/IFFT VLSI architecture 3.1. Algorithm and ow graph Many FFT algorithms [21] have been proposed since Cooley and Tukey in 1965 [22] to reduce the number of operations (multiplications and sums) of the direct DFT implementation whose complexity is in the order of N 2. Among all the proposed FFT algorithms, radix algorithms appear the most suitable for VLSI implementation due to the reduced number of operations and the high regularity of the structure. The basic concept of radix FFT algorithm is that, if N is not prime (i.e. N r1 r2 ; where r1 and r2 are integers greater than 1), it is possible to divide the global N point DFTs in r1 DFTs, each one r2 points long, and r2 DFTs, each one r1 points long (multiplications by twiddle-factors are also required). When r1 and r2 are not prime, this step can be repeated recursively n-times up to obtain N r1 r2 r3 rn1 : When r1 r2 rn1 r the algorithm is called radix-r, otherwise mixed-radix. Typically radix-2, 4 or mixed 2/4 solutions are preferred to higher radix (such as radix 8, 16 and 32) because they lead to simpler and more regular VLSI architectures [1,11, 19 21,26 35,40]. Based on radix factorization, different possible data ow graph and relevant VLSI architectures could be considered for FFT computation [19,20,23 34,40]. Basically they can be grouped in four main classes: Full-Array [24,25], OneColumn [26], Cascade [23,29 34] and Recursive [19,20,27, 28,40]. Depending on the parallelization level, i.e. the sharing of the hardware resources, they reach a different trade-off between processing time, circuit complexity and power consumption. The Full-Array architectures carry out all the required operations at the same time, and this means that all the computational units are directly mapped on the chip. The corresponding circuit complexity becomes unacceptable for large N values in today CMOS technology. To reduce this complexity it is possible to exploit hardware resource multiplexing by realizing on-chip only one column of radix cells (One-column architecture), only one row (Cascade architecture), or only a basic cell (Recursive architecture). Table 3 compares the performance of different FFT/IFFT VLSI processors proposed in literature in the last years and suitable for ADSL and/or VDSL applications. Performances are expressed in terms of circuit complexity, processing speed, power consumption and maximum complex FFT/ IFFT size. To be noted that data have been collected from literature and thus some values are not available (NA) or expressed in different ways: e.g. in full-custom design the circuit complexity is typically expressed as number of transistors while in semi-custom design it is expressed as number of gates for the logic (1 gate 4 transistor) plus memory size. The reported values give an idea of the different trade-offs achieved.
21 X 1N C CkN e j2pkp=N 2 k0 k
N IDFTCk CkN 2
while odd values can be rewritten as q 2p 1; with p 0; 1; N 2 1 : x2p1

N 21 1 2X C e j2pk2p1=2N 2 k0 k " 21 X 1 N C e j2pkp=N e jpk=N 2 k 0 k
N 21 X k0
CkN e
j2pkp=N
j2pp
jpk=N
# e
jp
21 X 1N C 2 CkN e j2pkp=N e jpk=N 2 k0 k
N IDFTCk 2 CkN e jpk=N 2
Combining both Eqs. (5) and (6) in Eq. (7) we can see how the 2N real values can be obtained by means of a N-complex IDFT plus proper pre-processing of the input sequence Ck and a nal multiplexing of the complex output to extract odd and even values. The operations involved in the SS approach are summarized in Fig. 4. x2p jx2p1 N IDFTCk CkN 2 N j IDFTCk 2 CkN e jpk=N 2 N IDFTCk CkN 2 Ck 2 CkN e jpk=N e jp=2 N IDFTyk 2
The above analysis can be repeated for the reception of the DMT symbol achieving similar results. In this case a M-point IDFT must be replaced by a DFT computation whose general expression is: Si
M 21 X p0
pi sp W M
with i 0; 1; M 2 1
M e2j2p=M the twiddle factor. being W
137
Fig. 4. (a) SS DMT symbol generation; (b) pre-processing scheme.
Table 3 State-of-art of FFT/IFFT processors suitable for xDSL applications Design System word length (bit) 18 10 16 24 17 24 Tech. (mm) 0.35 0.5 0.35 0.6 0.35 0.75 Supply (V) 3.3 3.3 NA NA 3.3 3.3 Processing speed 1024 FFT in 80 ms at 100 MHz (12.8 Msample/s) 8192 FFT in 400 ms at 20 MHz (20 Msample/s) 22.2 Msample/s at 100 MHz 4 Msample/s at 40 MHz 256 FFT in 20 ms at 44 MHz (12.8 Msample/s) 1024 FFT in 9.25 ms at 40 MHz for a 16 chip array (110 Msample/s) 8.8 Msample/s at 200 MHz 1024 FFT in 330 ms at 16 MHz (3 Msample/s) 1024 FFT in 30 ms at 173 MHz (33 Msample/s) Circuit complexity Power (mW) Complex FFT size 1024 8192 512 512 256 64, (1024 with a 16 chip array) 2048 1024 1024
Recursive [20] Cascade [32] Recursive [27] Recursive [40] Recursive [19] One-column [26]
150 Kgates 360Kbits memory 1.5 Mtransistors 90 Kgates 8192 bits ROM 56 960 bits DPRAM 96 Kgates 11 270 bits ROM 36 864 bits DPRAM NA 1.2 Mtransistors (19.2 Mtrans istor for a 16 chip array) NA 460 Ktransistors 460 Ktransistors
3000 600 NA NA NA 7700 [34]
DSP C62X [19] Cascade [34] Cascade [34]
32 20 20
0.18 0.6 0.6
1.8 1.1 3.3
1600 10 845
138
With reference to the xDSL specication and today submicron technology the One-column approach features a still high circuit complexity, 1024 complex processing units for a radix-4 4096-FFT, and poor length exibility. For instance the one-column single-chip implementation proposed in Ref. [26] is targeted for a maximum size of 64 complex point. The recursive approach achieves low hardware complexity (just 1 [27,28,40] or 2 [19,20] complex computation units) but it pays a tribute to processing speed and power consumption since a large FFT requires a great number of iterations and memory read/ write operations. Moreover, high clock frequency values do not permit the use of voltage scaling techniques for power saving. Up to now, recursive architectures proposed in literature reach a maximum length of 1024 [20] or 512 [19, 27,40] complex points. On the contrary, the cascade approach offers a good area-time trade-off and a remarkable length exibility:
FFT solutions ranging from 64 to 8192 complex points have been already proposed in literature targeting different application elds [29 34]. Particularly, for our design we have selected the Bi and Jones [28] ow graph properly modied to meet the mixed-radix requirements. Fig. 5 sketches the ow graph for the case example of a radix-4 16-FFT computation. The relevant circuit architecture can be derived by the projection of the signal ow graph onto a linear array of computational processors, each made up of a buttery (BUTT) and a complex multiplier, concatenated with proper commutator (COM) blocks for data reshufing between successive stages. As proved in Refs. [29,32] with respect to the classic pipeline approach [30] the selected ow-graph features the following advantages: (i) saving in the number of adders (3 log4N instead of 8 log4N ) and multipliers (log4N 2 1 instead of 3 log4N 2 3); (ii) increase of computational efciency
Fig. 5. Bi and Jones pipeline data ow.
139
Fig. 6. VLSI architecture for a six stages case example.
when the processor is interfaced to a continuous word serial stream; (iii) reduction in the number of delay units (2N instead of 2.5N ). 3.2. VLSI architecture design Starting from the pipeline ow graph proposed in Section 3.1 the specication of a 4096 size (8192 in case of DS approach) can be achieved by a cascade of 6 (7) stages: 5 (6) radix-4 processing units and a nal mixed radix-2/radix-4 one. Between radix-2 and 4 factorization, the latter is preferable since it is more performing in terms of output precision and reduced number of additions/multiplications [35,36]. Furthermore, from the analysis proposed in Ref. [23] for pipeline architectures, emerges that higher radix values are preferable for low-power design in case of large FFT size. As a drawback, radix-4 algorithms are applicable only for N equal to a power-of-four; so to implement an FFT with N equal to a power-of-two, a mixed radix-2/4 algorithm has to be used. The block diagram of the proposed cascade architecture is sketched in Fig. 6 for the case of six stages. Thanks to an extensive use of pipeline the overall throughput amounts to 1 complex sample/cycle and so a N-point FFT can be processed within N clock cycles. Moreover each stage can be dynamically by-passed thus realizing all the required word lengths from 64 to 4096. To further reduced energy consumption a clock-gating strategy is applied to the bypassed stages. To be noted that, following an IP design reuse approach, the VHDL (very-high speed integrated circuits hardware description language) architecture description is fully parametric in terms of number of stages, input data word length (DWL), output word length (OWL), twiddle factor word length (TWL) and data path word length of each stage (SWL).2 Thus, before logic synthesis on the target CMOS technology, the IP-user can select the desired trade-off between processing accuracy, circuit complexity and FFT
2 These parameters refer to real data and must be doubled for complex data.
size. In order to meet the xDSL system specications a detailed analysis of the above parameters before silicon integration has been carried out (see Section 3.3). According to the data ow detailed in Fig. 5, the processing stages in Fig. 6 are made up of a buttery, a complex multiplier and a commutator for data reshufing between successive stages. The generic buttery (see Fig. 7, where Xi are the input complex samples and bi are proper control signals with i 0; 1; 2; 3) consists of adder/subtract blocks and switchers for internal shufing. Fig. 8 details the structure of the commutator which exhibits six delay blocks whose size Nd varies from one stage to the other according to the rule Nd N =4t being t the index of the current stage (C0, C1, C2 in Fig. 8 are control signals). The commutator and the computational units of the last stage are slightly different from the others (i) to allow for radix-2 or 4 according to the selected length (additional switching blocks are added to the schemes of Figs. 7 and 8); (ii) since no multiplier are required according to ow graph in Section 3.1. Moreover the last stage features a rounding unit for proper scaling of the nal internal word length (SWL6 in Fig. 6) to the OWL. The inner structure of the complex multiplier is presented in Fig. 9. It is made up of four real booth-multipliers [37], one real adder, one real subtractor, four units for data rounding. According to a data driven power saving approach detailed in Section 4.3 it also
Fig. 7. Radix-4 buttery.
140
Fig. 8. Radix-4 commutator.
comprises hardware resources to avoid multiplications for twiddle factors equal to 1. 3.2.1. Storage optimizations for data shufing and twiddle coefcients To implement the commutator delay units dual port RAM (DPRAM) have been used instead of D-edgetriggered Flip Flop (DFF) for their area-power advantage (DFF are used only for the last stage where the required delay amounts to 1 clock cycle). It is worth mentioning that this approach is possible in the selected Bi and Jones ow graph due to the remarkable size of the commutator delay blocks. On the contrary other pipeline architectures, such as the one presented by Ding et al. [31], feature deep memory fragmentation and so are not suitable for RAM technologies. This is paid in terms of area overhead and power consumption, which becomes particularly critical for long size FFT implementations. This is evident from Table 4, where memory requirements for our architecture and for Ding architecture are reported together with relevant area and power consumption. The values in Table 4 are for
the 64-point FFT case adopted in Ref. [31] for a 0.6 mm 3.3 V standard-cells CMOS technology. Following the approach in Ref. [31] a switching activity of 0.4 has been considered for power consumption analysis. The proposed approach yields an area saving of 34% and a power saving of about 87% as far as data-shufe storage is concerned. To be noted that this area-power saving is achieved despite of the greater number of storage units required by our architecture with respect to Ref. [31]. As concerns twiddle coefcients, in each radix stage they are stored in ROM whose size is 2TWL M being M N =4t21 the number of cells for the generic t stage in the Npoint cascade architecture. For the architecture sketched in Fig. 6, 5456 cells are required. This hardware cost can be reduced by exploiting the symmetry properties of the twiddle coefcients. As depicted in Fig. 10 only the values in region A of the complex plane need to be stored since the rest of the coefcients can be generated simply by inversion and transposition of the stored values. Hence the amount of ROM for twiddle coefcients storage can be reduced by a factor of 8 according to the circuit of Fig. 11.
Fig. 9. Complex multiplier.
S. Saponara, L. Fanucci / Microelectronics Journal 34 (2003) 133148 Table 4 Memory requirements for data reshufing Architectures (64 point 24-bit FFT) Proposed Ref. [31] Storage units (bit) 6048 5376 Storage element DPRAM/DFF DFF Area (mm2) 2.72 4.1 Power at 50 MHz (mW) 50 390
141
3.3. Arithmetic accuracy design A detailed study of the machine number representation and scaling techniques is of primary importance in FFT VLSI design to cope with the growing size of partial results while limiting the word size of internal memories/data paths (SWL and TWL) and the loss of processing accuracy. As already proved in Ref. [32], working with a whole accuracy oating-point representation implies too large internal word length and is not suitable for single-chip implementation of large transform size. Thus, three main arithmetic approaches are considered in this section: xedpoint, block-oating point (BFP) and convergent blockoating point (CBFP). The relevant analysis is based on a C program which allows to evaluate the relevant performances with respect to a 64-bit oating-point arithmetic. The analysis has been carried out in terms of signal-to-quantization-noise ratio (SQNR), mean square error (MSE), absolute maximum error (MAE) taking into account different kinds of input signals and different values of the hardware parameters DWL, OWL, SWL, TWL, FFT length.
Fig. 10. Twiddle storage optimization by a factor 8.
In the xed-point arithmetic without scaling a growing size by a factor 4 (2 bits in the SWL) has to be considered for each radix-4 stage. This way, for a 16 bits input and a 4096-transform size the SWL6 of the last radix-4 stage becomes 28. However, as detailed in Table 5 the same processing accuracy (dened as number of output true bits) can be achieved by adopting proper scaling techniques with reduced word length size. The second approach features a scaling technique of data after every stage of the pipelined architecture discussed in Section 3.2. This method corresponds to a sort of BFP arithmetic where data are characterized by a single SWL value for the whole cascade architecture and by an exponent that is augmented at every computational block stage moving from the input towards the output. On the contrary, in the CBFP arithmetic data are adaptively scaled depending on the maximum amplitude inside the computational stages. To maximize throughput performance, in CBFP we adopt one exponent for each block of data instead that one exponent for all data in each stage. As it is evident from Fig. 5, the computation of the rst N =4 outputs of the second stage depends only on the rst N =4 outputs of the rst stage. Thus, as soon as the rst N =4 results of the rst stage are computed, the evaluation of the maximum amplitude is performed and an exponent is associated with this block of data. Then, the computation of the rst N =4 results of the second stage can start without waiting for the processing of all the N results of the rst stage. The same approach is followed for the other blocks of data in the rst stage and is iterated from stage to stage with data blocks of smaller lengths (N =4; N =16 and so on). Avoiding the scaling of data when it is not necessary CBFP approach achieves the same performances of the BFP one with lower SWL and TWL values (i.e. a greater precision with the same values). As example, some results in terms of MSE and MAE for different SWL and TWL values are presented in Figs. 12 14. In these gures a random input signal with a peak-to-average ratio (PAR) of roughly 15 dB (typical for VDSL systems) is considered. Error results refer to a normalized I/O data range [2 1,1]. To be noted that CBFP is more efcient than BFP when the same SWL and TWL
Fig. 11. Circuit for M-point twiddle factor storage based on a M/8-word ROM.
142
Table 5 Word lengths sizing for 16 bits output precision, DWL OWL 16, N 4096 CBFP SWL TWL 17 12 BFP 21 13 Fixed-point 28 (max) 12
values are considered. However, for high values of SWL the processing accuracy of the two arithmetic approaches saturates at the same level (Figs. 13 and 14). This effect is mainly due to the error which characterizes the transition of the output data representation from SWL bits to OWL bits. For instance by considering in Fig. 14 the case example with OWL 17 the MSE saturation level decreases with respect to the OWL 16 case thus allowing for a greater processing accuracy. Reported data are relevant to the FFT case but similar ndings resulted for the IFFT case. To be noted that for proper scaling from SWL to OWL either truncation or rounding can be chosen. As sketched in Fig. 6 the latter approach has been adopted in our architecture. Indeed, as proved by computer simulations, the use of the rounding function attains a greater arithmetic accuracy up to 50% for a few percent circuit complexity and power consumption increase (for instance see power budget in Fig. 16). Table 5 reports the sizing of the FFT/IFFT processor for the three considered arithmetics to achieve the maximum accuracy permitted by the saturation levels xed by OWL 16. In such a case the processor allows for a 16 bits output precision with a MAE of roughly 16.5 1026 that is to say in the worst case the error amounts to 0.54 LSB.3 By using the square root of the MSE as a measure of the average error the accuracy loss is less than 0.4 LSB. These error gures are well suited for xDSL applications: according to FFT accuracy measure proposed in literature [38] they lead to 16 true-bits output for a dynamic range of 96 dB. If required, a greater arithmetic precision can be reached by proper setting the VHDL parameters before silicon integration. As example a precision of 17 bits with an average error of roughly 0.4 LSB can be obtained with SWL 18 and OWL 17 (CBFP arithmetic in Fig. 13). From Table 5 it clearly emerges that the CBFP requires the minimum values for SWL and TWL. However the selection of the arithmetic is not straightforward. Indeed this greater accuracy is achieved at the expenses of a greater complexity of each computational stage for the requirements of an on-chip amplitude estimation unit and a greater delay in each commutator block to guarantee data synchronization. Thus the VLSI macrocell has been implemented in a 0.18 mm 6-metal level standard-cells CMOS technology for the three considered arithmetic
3 In the 2s complement n bit representation of the range [2 1,1] the last signicant bit (LSB) amounts to 22n21 :
approaches. The CBFP-ASIC integrates also a module for fast estimation of complex number amplitude which has been detailed in Ref. [39]. Proper sizing of the VHDL parameters before logic synthesis has been accomplished according to the results of Table 5. Table 6 details the performance of the three ASIC in terms of gates complexity, ROM and DPRAM. From the results of Tables 5 and 6 it is evident that the CBFP approach achieves the best trade-off between processing accuracy and hardware complexity for the target xDSL applications. Therefore CBFP has been selected for the implementation of the DS and SS DMT processors in Sections 4 and 5.
4. Single-size and double-size processors 4.1. CMOS implementation results Starting from the FFT/IFFT macrocell detailed in Section 3 and according to a bottom-up design strategy we have implemented both DS and SS schemes for DMT symbol generation/receipt (the architectures of DS and SS have been already sketched in Figs. 3 and 4). Since the new hardware resources introduce quantization/rounding errors, then we reapplied to the whole DS and SS schemes the procedure for optimal data path/memory word lengths sizing addressed in Section 3.3. For the case of OWL DWL 16 and full programmability from 64 to 4096 subcarriers it resulted SWL 18 and TWL 14 for both DS and SS cases. After VHDL parameter denition, the two VLSI macrocells have been implemented in the 0.18 mm CMOS technology considered above. Table 7 details the relevant performance in terms of circuit complexity, computational performance and power consumption when the processors are used for modulation (IFFT plus preprocessing). Similar results have been achieved when the processors are used for demodulation (FFT plus postprocessing). Power gures have been extracted by gatelevel simulations (within Synopsyse power compiler environment) using a random sequence with 15 dB PAR as input stimuli and a power supply of 1.6 V. The clock frequency has been set to 20 MHz for the SS processor and 40 MHz for the DS one in order to attain the same target throughput of 20 complex Msample/s (to elaborate a N-subcarriers DMT symbol the SS chip processes N complex values while the DS chip processes 2N complex value). These results demonstrate that the SS approach made up of a pre/post-processor plus a 4096-IFFT/FFT core (ve radix-4 stages and a nal mixed radix one), is more efcient than the DS approach which is made up of a 8192-IFFT/FFT core (six radix-4 stages and a nal mixed radix one). This can be explained considering that (i) the pre/post-processor (one complex multiplier plus two adders and one subtractor)
143
Fig. 12. MSE performance vs. TWL, SWL 19, DWL OWL 16, N 4096:
Fig. 13. MSE performance vs. SWL, TWL 14, DWL 16 and 17, N 4096:
Fig. 14. MAE performance vs. SWL, TWL 14, DWL OWL 16, N 4096:
144
S. Saponara, L. Fanucci / Microelectronics Journal 34 (2003) 133148 Table 7 CMOS implementation results for SS and DS processors Gates (K) DS SS 121 114 RAM (bits) 393 216 393 216 DPRAM (bits) 565 080 306 864 ROM (bits) 38 220 33 132 Power (mW) 130.63 61.6 Max. through put(Msample/s) 36 at 72 MHz 75 at 75 MHz
Table 6 CMOS implementation results for the three arithmetic approaches Gates CBFP BFP Fixed-point 82 414 91 480 110 810 DPRAM (bits) 277 528 282 288 272 640 ROM (bits) 16 368 17 732 16 368
is a bit less complex than a radix-4 stage; (ii) the delay unit size in the commutators of all radix stages increases linearly with the transform length N (Section 3.2); (iii) the SS chip achieves the target throughput with a clock frequency reduced by a factor of 2 with respect to the DS chip. 4.2. I/O memory management The output of IFFT/FFT radix processors is digit reverse [1,21] (see Fig. 5 as example) and hence buffer resources are needed for symbol reordering. Moreover, in a multi-carrier modem we encounter other buffering issues due to pre/postprocessing of the input DMT symbol (i.e. generation of the 2N-point sequence Ck from the N-point input Xk during modulation), insertion/removal of cyclic prex and sufx, removal of pilot and DC carriers. An efcient solution towards the macrocell integration in a single-chip xDSL modem is the adoption of the storage unit sketched in Fig. 15: it is made up of three single RAM each with a size equals to the subcarrier number. During the elaboration of the generic ith DMT symbol (Fig. 15(a)) the processing core reads the input data from the memory IN and lls in the memory OUT with the Fourier transformed results of the i 2 Dth symbol.4 Concurrently the i 1th DMT symbol is loaded in the memory PREFETCH. This way, at the end of the ith symbol processing the memory PREFETCH contains the next symbol to process and so it becomes the new memory IN (Fig. 15(b)). On the contrary, the current memory IN does not contain useful data for the i 1th symbol processing and it becomes the new memory OUT. Finally, memory OUT contains the Fourier transformed results of the i 2 Dth symbol which can be read by the other units of the modem for successive processing steps (pulse shaping/ windowing, insertion/removal of cyclic sufx and prex and so on). To be noted that sufx and prex extensions are implemented with the cyclic retransmission of part of the IFFT output (i.e. multi-read of part of the memory OUT). As soon as the coefcients of the i 2 Dth symbol have been extended, memory OUT can be used to prefetch the i 2th symbol and thus it becomes the new memory PREFETCH. Summarizing at the end of each symbol processing the memories PREFETCH, IN and OUT become the new memories IN, OUT and PREFETCH,
D represents the latency of the pipeline processor expressed in terms of number of DMT symbols. For the SS architecture selected in Section 4.1, D amounts to 2.
4
respectively, (compare Fig. 15(a) and (b)). In the target application of 4096 carriers and DWL OWL 16 the storage unit complexity amounts to the 393 216 bits already reported in Table 7. 4.3. Power optimizations In the SS processor the maximum achievable throughput of 75 Msample/s at 75 MHz is far beyond the target requirement of 20 Msample/s at 20 MHz. The clock rate reduction, and the consequent lowered circuit speed requirement, can be exploited for power saving. This can be achieved by scaling down the supply voltage and/or by using low-speed, low-leakage version of the considered standard-cells library when available. For instance, the adopted 0.18 mm CMOS technology provides two different versions: a Device High Speed (DHS) optimized for low circuit propagation delay and a Device Low Leakage (DLL) optimized for low-leakage power consumption. The optimization is mainly obtained by using two different threshold voltages which are scaled down of about 20% going from DLL library to DHS one. Table 8 presents the power consumption and maximum frequency results achieved implementing the SS processor on different technology versions with different power supplies permitted by the selected CMOS technology. For power consumption estimation the clock frequency has been
Fig. 15. Storage unit management. (a) ith DMT symbol processing; (b) i 1th DMT symbol processing.
S. Saponara, L. Fanucci / Microelectronics Journal 34 (2003) 133148 Table 8 Power consumption and maximum frequency for different supply voltages and technology library versions DHS 1.2 V Power (mW) Max frequency (MHz) 28.65 53 1.6 V 61.6 75 1.95 V 100.98 213 DLL 1.2 V 26.35 46 1.6 V 49.58 66 1.95 V 73.32 190
145
set to 20 MHz and a random sequence with 15 dB PAR has been used as input signal. From the results of Table 8 it emerges how the processor can cover a wide range of applications. By using a 1.95 V supply voltage and the DHS library the SS processor achieves a computational power greater than 200 Msample/s. For the target xDSL application (20 Msample/s) the best solution is the use of the DLL library with the 1.2 V supply voltage which reduce power consumption down to roughly 27 mW. Fig. 16 details the power consumption budget of the 4096carriers SS DMT processor. From the analysis in Fig. 16 it is clear that 50% of the power consumption is due to the complex multipliers activity. However, in the considered algorithm a great part of the overall multipliers activity can be saved. Fig. 17 shows that, among the M twiddle coefcients of the generic radix stage (16 in the case example reported in gure), 0.25M 3 (7 in the case example reported in gure) are 1. This way, in a 64-point transform up to 40% of twiddle factors are 1 while in a 4096-point transform this percentage is 30. Therefore a considerable power saving (roughly 15% for VDSL applications) can be achieved by gating the clock of the complex multipliers whenever a (1,0) twiddle coefcient has to be processed. In such a case the signal C in Fig. 9 is set to 1 thus propagating the input data towards the output. The additional buffer unit guarantees the same delay of the multiplier for the synchronization with successive stages. Power simulation results demonstrate that the nal SS chip with multiplier clock-gating strategy dissipates about 23 mW for VDSL applications (input signal with 20 Msample/s, 15 dB PAR, 4096-carriers, 16 bit I/0) in the considered 0.18 mm 1.2 V CMOS technology.
Fig. 17. Twiddle coefcients equal to 1 in the generic computational stage.
5. FFT/IFFT processing in TDD and FDD modem The duplex method determines how the overall throughput is shared between the downstream and upstream directions. In TDD schemes upstream and downstream are partioned in time and the entire frequency band can be used in both directions in separate time epochs. The asymmetry ratio is dened by the ratio of time used for upstream and downstream transmissions. In FDD the available spectrum is divided in distinct frequency bands where each band is used uniquely for either upstream or downstream. The asymmetry ratio is determined by the width and location of the spectrum bands allocated for each direction. Both TDD and FDD xDSL schemes have been proposed in literature [1,13 19]. In TDD schemes a single FFT/IFFT processor can be used for DMT symbol generation/receipt while in FDD schemes two separate processors are required for FFT and IFFT since upstream and downstream directions are allowed at the same time. The nal circuit complexity is nearly twice that for TDD. However, a further power saving approach can be addressed exploiting the fact that, for typical xDSL applications, a lot of carriers in both directions are zero. For instance, if we consider the frequency plan foreseen by the ETSI standard (Table 9) [6,7] with a 4096-carriers DMT and a tone spacing of 4.3125 kHz then about 41% of the carriers can be used for upstream, 26% for downstream and more than 30% is not exploited. For a 1024-carriers DMT in the same conditions then 13% of the carriers can be used for
Fig. 16. Power consumption budget of the SS DMT processor.
146
Table 9 ETSI frequency plan and relevant amateur radio bands ETSI frequency plan for VDSL Downstream (MHz) 0.1383.000 5.1007.050 Upstream (MHz) 3.005.10 7.0512.00 1.812.00 3.504.00 7.007.30 10.1010.15 Amateur radio bands (MHz)
upstream, 69% for downstream and more than 18% is not exploited. Reported data takes into account also radio amateur bands (Fig. 1). This means that for the two example cases 59 87% of the IFFT inputs (upstream) are zero and 31 74% of the FFT inputs (downstream) are zero. Since CMOS circuits do not dissipate power when they are not switching power saving can be achieved by a clockgating strategy which, exploiting the great amounts of zero inputs, turns off some portions of the processors and reduces the switching activity to the minimal level required for the computation. For the IFFT processor we have added a unit that checks for blocks of null data at the input of the pre-processing and of the rst radix-4 stage. If the group of data to process is null then the clock of the computational units (buttery plus multiplier) is gated and a zero is propagated towards the output. The buttery style operation absorbs the null coefcients early in the signal path and hence more nonzero coefcients are present in average in the successive radix stages. For these stages the clock-gating approach is not applied. As a matter of fact, the implementation of the clock-gating technique involves an overhead in terms of silicon area and capacitive load and so it has to be applied only where an important reduction in the power consumption can be obtained. For the IFFT processor the same approach has been applied to the rst and second radix-4 stages. As proved by gate level simulations for the above case examples this approach allows for a power saving up to 23% for the IFFT and up to 10% for the FFT with a negligible circuit overhead.
a good trade-off between power consumption and circuit complexity which makes it suitable for low-cost and lowpower implementation in a single-chip modem. On the contrary, most of known solutions support maximum FFT sizes equal or lower than 1024 [19,20,26 28,33,34, 40] and/or achieve a poor computational power [19,34, 40] being compliant for ADSL but not exploiting the capability of advanced VDSL schemes. Moreover our VLSI architecture is not only a FFT/IFFT macrocell but it also comprises hardware resources (logic and memory) for proper pre- and post-processing of the DMT symbols. 2. We exploit the great amount of zero carriers which characterizes FDD xDSL systems to further reduce the switching power consumption according to a data driven clock-gating approach. 3. With respect to Ref. [32] which attains similar performance in terms of size and throughput and which exhibits similar architecture/algorithmic solutions the proposed approach saves power by an order of magnitude. This is due not only to the use of a more recent CMOS technology but also to optimization strategies such as voltage scaling and clock gating. Particularly clock gating has been adopted at three different levels (i) to power down radix-stages not used when sizes lower than 4096 are required; (ii) to reduce by 30% the switching activity of the multipliers discarding operations with twiddle factors equal to 1; (iii) to reduce the switching activity of the rst two stages in FDD applications exploiting the great amount of zero carriers. Moreover Ref. [32] adopts a CBFP arithmetic with an internal word length of 10 bits which, as proved in Section 3, is not suitable for the accuracy requirements of advanced xDSL systems. As already addressed above, our architecture contains hardware resources for DMT symbol processing which are not present in Ref. [32]. 4. Finally, the circuit has been designed as a parametric IP cell customizable by the user to achieve the desired trade-off between processing performance and hardware complexity. Therefore a greater exibility, in different applications cases and implementation technologies, is achieved with respect to full-custom VLSI designs.
7. Conclusions The design of a exible, low-cost and low-power FFT/IFFT processor for scalable and high-rate xDSL transceivers has been addressed in this paper. To this aim the VLSI design space has been explored at different levels (algorithm, arithmetic accuracy, architecture, technology) throughout the paper. Different possible solutions have been considered for symbol modulation/demodulation (SS and DS schemes), data ow graph (full-array, one-column, cascade, recursive) and internal processing arithmetic (xed-point, BFP, CBFP). Logic synthesis results prove that the best solution is a SS programmable processor based
6. Comparison vs. state-of-art Comparing the performance of the SS processor (presented in Sections 3 5 and summarized in Tables 7 and 8) with the state-of-art of FFT/IFFT VLSI architectures for xDSL applications (reviewed in Section 3.1 and Table 3) the following considerations can be made: 1. The proposed architecture allows for full rate VDSL applications (4096-carriers, 20 MHz bandwidth) with
147
on a radix-2/4 FFT/IFFT cascade architecture with CBFP arithmetic plus proper pre/post-processing stages. Implemented in a 0.18 mm standard-cells CMOS technology it allows for 64 4096 FFT size, 16 bits output precision with a maximum throughput of 75 Mcomplexsample/s. Furthermore low-power design techniques, based on clock gating and data driven switching activity reduction, are used to decrease the power consumption exploiting the correlation of the twiddle coefcients and the great amount of zero carriers in FDD transmission schemes. The effects of supply voltage scaling and its consequence on circuit performance are examined in detail, as well as the use of different target technologies (low-leakage and high-speed). Final implementation results in the 0.18 mm CMOS technology prove that the SS processor allows for full-rate VDSL applications (4096-carriers, 20 Mcomplexsample/s) with a good trade-off between circuit complexity and power consumption which amounts to few tens of mW. On the contrary most of known solutions, based on DSP or dedicated custom architectures, support maximum FFT sizes equal or lower than 1024 and/or achieve a poor computational power being compliant for ADSL but not exploiting the capability of advanced VDSL schemes. Finally, the circuit has been designed as a parametric IP cell customizable by the user to achieve the desired trade-off between processing performance and hardware complexity. This way a greater exibility in different applications cases and implementation technologies is achieved with respect to full-custom VLSI designs. Acknowledgements This work has been carried out in the framework of the MEDEA project INCA: Integrated Network Copper Access. Discussions with C. Del Toso and C. Trang of ST Microelectronics, France and L. Serani of Pisa University are gratefully acknowledged. References
[1] J. Bingham, ADSL, VSDL and Multicarrier Modulation, Wiley, New York, 2000. [2] ANSI standard T1.413, issue 2, Asymmetric Digital Subcriber Line (ADSL), 1998. [3] VDSL Alliance SDMT VDSL draft standard proposal, ETSI STC/ TM6, April 1998. [4] ANSI T1E1.4, Very-high-speed Digital Subscriber Line (VDSL) metallic interface: functional requirements and common specications, May 1998. [5] ANSI T1E1.4/00-011, Draft specication, Very-high-speed Digital Subscriber Line (VDSL) metallic interface. Part 2. Technical specication of multi-carrier modulation (MCM) transceiver. [6] ETSI TM6, Transmission and multiplexing; Access transmission systems on metallic access cables; Very-high-speed Digital Subscriber Line (VDSL). Part 1. Functional requirements, 1999. [7] ETSI TM6, Transmission and multiplexing; Access transmission systems on metallic access cables; Very-high-speed Digital Subscriber Line (VDSL). Part 2. Transceiver specication, 2000.
[8] C. Del-Toso, B. Rezvani, M. Beck, J. Chow, G. Jin, S. Oelcer, J. Ciof, J. Gustafsson, Scalable multi-mode VDSL (DMT option) for EFMCu. IEEE 802.3ah Ethernet in the First Mile Task Force, November 2001, web site: http://grouper.ieee.org/groups/802/3/efm/ [9] J. Bingham, Multicarrier modulation for data transmission: an idea whose time has come, IEEE Commun. Mag. (1990) 5 14. [10] P. Chow, J. Ciof, J. Bingham, A practical discrete multitone transceiver loading algorithm for data transmission over spectrally shaped channel, IEEE Trans. Commun. 43 (6) (1995) 773775. [11] N. Weste, D. Skellen, VLSI for OFDM, IEEE Commun. Mag. (1998) 127131. [12] K. Sistanizadeh, P. Chow, J. Ciof, Multi-tone transmission for asymmetric digital subscriber lines (ADSL), Proc. IEEE Int. Conf. Commun. 2 (1993) 756 760. [13] F. Sjoberg, M. Isaksson, R. Nilsson, P. Oding, S. Wilson, P. Borjesson, Zipper: a duplex method for VDSL based on DMT, IEEE Trans. Commun. 47 (8) (1999) 12451252. [14] F. Sjoberg, R. Nilsson, M. Isaksson, P. Deutgen, P. Oding, P. Borjesson, Performance evaluation of the Zipper duplex method, Proc. IEEE Int. Conf. Commun. 2 (1998) 10351039. [15] W. Eberle, V. Derudder, G. Vanwijnsberghe, M. Vergara, L. Deniere, L. Van der Perre, M. Engels, I. Bolsen, H. De Man, 80Mb/s QPSK and 72-Mb/s 64-QAM exible and scalable digital OFDM transceiver ASICs for Wireless Local Area Networks in the 5-GHz band, IEEE J. Solid State Circuits 36 (11) (2001) 1829 1838. [16] B. Wiese, J. Chow, Programmable implementations of xDSL transceiver systems, IEEE Commun. Mag. (2000) 114 119. [17] L. Kiss, E. Hanssens, K. Adriaensen, M. Huysmans, C. Gendarme, F. Van Beylen, H. Van de Weghe, A customizable DSP for DMT-based ADSL modem, Proc. IEEE 24th Eur. Solid State Circuits Conf. (1998) 349353. [18] D. Mestdagh, R. Nilsson, M. Isaksson, P. Oding, Zipper VDSL: a solution for robust duplex communication over telephone lines, IEEE Commun. Mag. (2000) 90 96. [19] D. Veithen, P. Spruyf, T. Pollet, M. Peeters, S. Braet, O. Van de Wiel, H. Van de Weghe, A 70 Mb/s variable-rate DMT-based modem for VDSL, Proc. IEEE Int. Solid State Circuits Conf. (1999) 248249. [20] M. Rudberg, M. Sanberg, K. Ekholm, Design and implementation of an FFT processor for VDSL, Proc. IEEE AsiaPacic Conf. Circuits Syst. (1998) 611 614. [21] P. Duhamel, M. Vetterli, Fast Fourier transform: a tutorial review and a state of the art, Signal Process. 19 (1990) 259299. [22] J. Cooley, J. Tuckey, An algorithm for machine computation of complex Fourier series, Math. Comput. 19 (1965) 297 301. [23] S. Hong, S. Kim, M. Papaefthymiou, W. Stark, Power-complexity analysis of pipelined VLSI FFT architectures for low energy wireless communication applications, Proc. IEEE MWSCAS (1999) 313 316. [24] N. Murphy, M. Swamy, On the real-time computation of DFT and DCT through systolic architecture, IEEE Trans. Signal Process. 42 (1994) 988991. [25] L. Chang, M.-Y. Wu, A new systolic array for discrete Fourier transform, IEEE Trans. Acoust. Speech Signal Process. 36 (1988) 1165 1167. [26] T. Chen, G. Sunada, J. Jin, COBRA: a 100-MOPS single-chip programmable and expandable FFT, IEEE Trans. VLSI Syst. 7 (2) (1999) 174182. [27] C. Wang, C. Chang, A new memory based FFT processor for VDSL transceivers, Proc. IEEE Int. Symp. Circuits Syst., ISCAS 4 (2001) 670673. [28] E. Cetin, R. Morling, I. Kale, An integrated 256-point complex FFT processor for real-time spectrum analysis and measurement, Proc. IEEE Conf. Instrum. Meas. Technol. 1 (1997) 96101. [29] G. Bi, E.V. Jones, A pipelined FFT processor for word-sequential data, IEEE Trans. Acoust. Speech Signal Process. 37 (12) (1989) 1982 1985.
148
S. Saponara, L. Fanucci / Microelectronics Journal 34 (2003) 133148 [36] T. Thong, B. Liu, A xed-point fast Fourier transform error analysis, IEEE Trans. Audio Electroacoust. 17 (1969) 151 157. [37] N. Weste, K. Eshraghian, Principles of CMOS VLSI Design, Addison-Wesley, Reading, MA, 1985. [38] W. Hui, T. Ding, J. McCanny, R. Woods, Error analysis of FFT architectures for digital video applications, Proc. IEEE ICECS (1996) 820 823. [39] L. Fanucci, M. Rovini, A low-complexity and high-resolution algorithm for the magnitude approximation of complex numbers, IEICE Trans. Fundam. E85-A (7) (2002) 651 654. [40] C. Chang, C. Wang, Y. Chang, Efcient VLSI architectures for fast computation of the discrete Fourier transform and its inverse, IEEE Trans. Signal Process. 48 (11) (2000) 32063215.
[30] L. Rabiner, B. Gold, Theory and Application of Digital Signal Processing, Prentice-Hall, Englewood Cliffs, NJ, 1975, chapter 10. [31] T. Ding, J. McCanny, Rapid design of application specic FFT cores, IEEE Trans. Signal Process. 47 (5) (1999) 13711381. [32] E. Bidet, D. Castelain, C. Joanblanq, P. Senn, A fast single-chip implementation of 8192 complex point FFT, IEEE J. Solid State Circuits 30 (3) (1995) 300304. [33] S. He, M. Torkelson, Designing pipeline FFT processor for OFDM (de)modulation, Proc. URSI Int. Symp. Signals Syst. Electron. (1998) 257262. [34] B.M. Baas, A 9.5 mW 330 ms 1024-point FFT processor, Proc. IEEE Custom Integr Circuits Conf. (1998) 127130. [35] V. Prakash, V. Rao, Fixed point error analysis of radix-4 FFT, Signal Process. 3 (1981) 123 133.

2003 VLSI Design Investigation For Low-Cost, Low-Power FFT-IfFT Processing in Advanced VDSL Transceivers

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

2003 VLSI Design Investigation For Low-Cost, Low-Power FFT-IfFT Processing in Advanced VDSL Transceivers

Hochgeladen von

Copyright:

Verfügbare Formate

Microelectronics Journal 34 (2003) 133148 www.elsevier.

S. Saponara, L. Fanucci / Microelectronics Journal 34 (2003) 133148

Fig. 2. Scheme of a DMT modem.

where ( rectT t 1 0 when t [ 0; T ; otherwise

256 512 1024 2048 4096

1.1 2.2 4.4 8.8 17.6

The general expression of a M-point IDFT is sp

Fig. 3. DS DMT symbol generation.

S. Saponara, L. Fanucci / Microelectronics Journal 34 (2003) 133148

while odd values can be rewritten as q 2p 1; with p 0; 1; N 2 1 : x2p1

21 X 1N C 2 CkN e j2pkp=N e jpk=N 2 k0 k

N IDFTCk 2 CkN e jpk=N 2

 M e2j2p=M the twiddle factor. being W

S. Saponara, L. Fanucci / Microelectronics Journal 34 (2003) 133148

Fig. 4. (a) SS DMT symbol generation; (b) pre-processing scheme.

3000 600 NA NA NA 7700 [34]

DSP C62X [19] Cascade [34] Cascade [34]

0.18 0.6 0.6

1.8 1.1 3.3

S. Saponara, L. Fanucci / Microelectronics Journal 34 (2003) 133148

Fig. 5. Bi and Jones pipeline data ow.

S. Saponara, L. Fanucci / Microelectronics Journal 34 (2003) 133148

Fig. 6. VLSI architecture for a six stages case example.

Fig. 7. Radix-4 buttery.

S. Saponara, L. Fanucci / Microelectronics Journal 34 (2003) 133148

Fig. 8. Radix-4 commutator.

Fig. 9. Complex multiplier.

Fig. 10. Twiddle storage optimization by a factor 8.

S. Saponara, L. Fanucci / Microelectronics Journal 34 (2003) 133148

S. Saponara, L. Fanucci / Microelectronics Journal 34 (2003) 133148

Fig. 17. Twiddle coefcients equal to 1 in the generic computational stage.

Fig. 16. Power consumption budget of the SS DMT processor.

S. Saponara, L. Fanucci / Microelectronics Journal 34 (2003) 133148

S. Saponara, L. Fanucci / Microelectronics Journal 34 (2003) 133148

Das könnte Ihnen auch gefallen

M e2j2p=M the twiddle factor. being W