Beruflich Dokumente
Kultur Dokumente
MSc THESIS
Microelectronics Department of Electrical Engineering Faculty of Electrical Engineering, Mathematics and Computer Science
Abstract
An MP3 audio decoder is designed as System-on-a-Chip using hardware/software codesign techniques. The hardware architecture is built on the LEON SoC platform, which contained an open source SPARC-V8 architecture compatible processor, an AMBA bus. A pre-designed flash card interface hardware core is added to this system. And then an audio driver module also is added to the system. After a performance analysis to the decoder, the MP3 decoder is partitioned into software part and hardware part. The hardware parts, a 36 point IDCT decoder and a 32 point DCT decoder, are designed as two accelerators connected to AMBA. The final MP3 decoder can decode MP3 stream with the help of the 36 point IDCT accelerator and 32 point DCT accelerator.
Acknowledgements
It is a wonderful experience to study in Delft University of Technology (TU Delft) for two years. And I am so lucky to do my Master's thesis following the Dr. Ren van Leuken in the Circuits and Systems (CAS) group. During this period of time, I met many intelligent and enthusiastic people; I would like express my respect and thanks to all of them. Without their helps, I can't complete my thesis at all. First of all, I would like express my greatest appreciation to my supervisor Dr. Ren van Leuken. He is really an earnest and professional people. Every time I fell in the morass, he always helps me to analyze the problems and find the solution. His manner of the work and his attitude to the work are impressed in my mind deeply. I have to appreciate the help from Mr. Huib Lincklaen Arrins also. The scheduling tool developed by him is so helpful to my thesis. At last, the special thanks are given to my wife and my parents. Their endless love and support are the motivation of my progress. They comprehend my career and suffer long time separation to me. I feel that they are the most lovely and trustful people in the world. Jianwei Wang Delft, The Netherlands July 3, 2005
Contents
CONTENTS ............................................................................................................................. 1 FIGURE LIST ......................................................................................................................... 5 TABLE LIST ........................................................................................................................... 7 CHAPTER 1 ............................................................................................................................ 9 INTRODUCTION ................................................................................................................... 9 1.1 1.2 1.3 1.4 1.5 MOTIVATION ................................................................................................................... 9 DEFINITION OF THE WORK ........................................................................................... 10 THE MAJOR CHALLENGES ............................................................................................ 10 METHODOLOGY OF HW/SW CODESIGN ..................................................................... 11 ORGANIZATION OF THE THESIS ................................................................................... 13
CHAPTER 2 .......................................................................................................................... 15 MP3 DECODER.................................................................................................................... 15 2.1 2.2 2.3 2.4 2.4.1 2.4.2 2.4.3 2.4.4 2.5 INTRODUCTION ............................................................................................................. 15 BIT STREAM DECODING ................................................................................................ 17 INVERT QUANTIZATION ............................................................................................... 18 PROCESSING DATA ........................................................................................................ 19 STEREO PROCESSING ................................................................................................... 20 REORDERING & ALIAS REDUCTION ............................................................................ 20 IMDCT........................................................................................................................ 20 POLYPHASE SYNTHESIS FILTER BANK......................................................................... 21 CONCLUSION ................................................................................................................. 22
CHAPTER 3 .......................................................................................................................... 23 LEON2.................................................................................................................................... 23 3.1 3.2 3.2.1 3.2.2 3.3 3.4 3.4.1 3.4.2 3.5 3.6 BACKGROUND OF LEON2............................................................................................ 23 AMBA-2.0 ..................................................................................................................... 24 THE ADVANCED HIGH-PERFORMANCE BUS (AHB).................................................... 25 THE ADVANCED PERIPHERAL BUS (APB) .................................................................. 27 FLOATING-POINT UNIT AND CO-PROCESSOR .............................................................. 28 INTERRUPT .................................................................................................................... 29 INTERRUPT CONTROLLER ............................................................................................ 29 CONNECT A C FUNCTION TO INTERRUPT SOURCE ....................................................... 29 VHDL MODEL ARCHITECTURE OF LEON2 ................................................................ 30 CROSS COMPILE ............................................................................................................ 31
3.7
CONCLUSION ................................................................................................................. 32
CHAPTER 4 .......................................................................................................................... 33 ADDING NEW MODULES TO LEON2 & CONVERTING MP3 DECODER TO FIXED POINT....................................................................................................................... 33 INTRODUCTION ............................................................................................................. 33 METHOD OF ADDING SLAVE MODULES TO AHB......................................................... 35 METHOD OF ADDING SLAVE MODULES TO APB ......................................................... 38 ACCESS THE NEW MODULES......................................................................................... 39 ADDING THE FLASH CARD MODULE TO AHB.............................................................. 39 DRIVER MODULE OF THE AUDIO A/D CONVERTER ..................................................... 40 ADDING THE AUDIO MODULE TO APB......................................................................... 42 CONVERT MP3 DECODER FROM FLOATING-POINT VERSION TO FIXED-POINT VERSION .................................................................................................................................. 42 4.8.1 SINGLE PRECISION FLOATING-POINT .......................................................................... 43 4.8.2 CONVERSION TO 32-BIT FIXED-POINT ......................................................................... 43 4.9 CONCLUSION ................................................................................................................. 44 CHAPTER 5 .......................................................................................................................... 45 PERFORMANCE ANALYSIS AND HW/SW PARTITIONING .................................... 45 5.1 5.2 5.3 PERFORMANCE ANALYSIS ............................................................................................ 45 HW/SW PARTITIONING ................................................................................................ 47 CONCLUSION ................................................................................................................. 48 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8
CHAPTER 6 .......................................................................................................................... 49 FAST ALGORITHM OF DCT AND INVERSE DCT ...................................................... 49 6.1 6.2 6.3 6.4 INTRODUCTION ............................................................................................................. 49 FAST ALGORITHM OF 32-POINT DCT .......................................................................... 50 FAST ALGORITHM OF 36-POINT IDCT ........................................................................ 53 CONCLUSION ................................................................................................................. 54
CHAPTER 7 .......................................................................................................................... 55 IMPLEMENT OF DCT AND IDCT ACCELERATORS ................................................. 55 7.1 7.2 7.3 7.4 7.5 7.6 THE ACCELERATOR ARCHITECTURE........................................................................... 55 SCHEDULING AND FSM DESIGN ................................................................................... 56 FSM CONTROLLER DESIGN .......................................................................................... 58 AHB SLAVE INTERFACE DESIGN .................................................................................. 59 CONNECT THE ACCELERATORS TO AHB .................................................................... 61 CONCLUSION ................................................................................................................. 61
CHAPTER 8 .......................................................................................................................... 63
TEST RESULTS AND CONCLUSION.............................................................................. 63 8.1 8.2 8.3 8.4 TEST RESULTS ............................................................................................................... 63 CONCLUSION ................................................................................................................. 65 SUMMARY ...................................................................................................................... 68 FUTURE RESEARCH ....................................................................................................... 68
BIBLIOGRAPHY ................................................................................................................. 71
Figure list
Figure 1 the design flow of the general codesign approach.........................................12 Figure 2 the first level function block diagram of MP3 decoder.................................15 Figure 3 the second level function block diagram of MP3 decoder ............................16 Figure 4 Format of MP3 frame, granules, subband blocks and frequency lines [2]....16 Figure 5 MP3 frame format .........................................................................................17 Figure 6 Decoding of bitstream block diagram [2]......................................................18 Figure 7 inverse quantization block diagram...............................................................19 Figure 8 processing data block diagram ......................................................................19 Figure 9 Poly Phase Synthesis filter bank [2]..............................................................21 Figure 10 function block diagram of MP3 decoder .....................................................22 Figure 11 function block diagram of LEON2 [3] ........................................................24 Figure 12 structure of AHB[4].....................................................................................27 Figure 13 APB structure [4].........................................................................................27 Figure 14 interrupt controller of LEON2.....................................................................29 Figure 15 MP3 decoder with flash card module and A/D converter ...........................34 Figure 16 LEON2 with flashcard module and audio module ......................................34 Figure 17 port definition of AHB arbiter.....................................................................35 Figure 18 an example of slave module on AHB on position 4....................................36 Figure 19 AHB address decoding ................................................................................36 Figure 20 variable ' ahbrange_con ...............................................................................37 Figure 21 example of assigning a interrupt signal to interrupt controller....................38 Figure 22 APB address decoding.................................................................................38 Figure 23 instantiate the flash card module in mcore.vhd ...........................................39 Figure 24 function block diagram of audio driver module ..........................................41 Figure 25 signals of audio module added to leon port.................................................42 Figure 26 fixed-point format........................................................................................44 Figure 27 profile of the soft MP3 decoder...................................................................45 Figure 28 execution time from start point....................................................................47 Figure 29 HW/SW partitioned function block diagram of MP3 decoder....................48 Figure 30 diagram of 2-point, 4-point and 8-point DCT[7].........................................52 Figure 31 function block of Lees algorithm[8] ..........................................................53 Figure 32 the structure diagram of DCT/IDCT modules.............................................56 Figure 33 scheduling tool.............................................................................................58 Figure 34 IDCT36 FSM controller and DCT32 FSM controller.................................59 Figure 35 function block diagram of LEON2 with two accelerators...........................61 Figure 36 compare two DCT accelerators with software ............................................64 Figure 37 execution time (cycle) distribution of (I)DCT accelerator module .............67
Table list
Table 1 map from address bus to position number ......................................................37 Table 2 the port address of audio module....................................................................41 Table 3 control register of audio module.....................................................................42 Table 4 the number of computations of two algorithms ..............................................54 Table 5 number of FSMs states with different number of adders and multipliers to IDCT36 ........................................................................................................................57 Table 6 number of FSMs states with different number of adders and multipliers to DCT32..........................................................................................................................58 Table 7 address space and interrupt of the two accelerators........................................61 Table 8 execution time IDCT36 C function and accelerator on LEON2 ....................63 Table 9 execution time IDCT32 C function and accelerator on ELON2 ....................64 Table 10 final area report.............................................................................................65 Table 11 execution time of the IDCT36 without Lee's algorithm ...............................66
Chapter 1
Introduction
1.1 Motivation
Current world is full of different types of embedded systems and processors. An embedded system is a special-purpose computer built into a device. The embedded systems have varieties of types and sizes. It could range from a single microprocessor to a complex System-on-a-Chip system. Embedded systems usually consist of hardware and software. The hardware maybe is processor, ASIC or memory and is used for performance. The software is used to provide features and flexibility. In many applications, embedded systems just act as a system controller. Current generations of silicon process technology allow designers to integrate a large number of features onto a single IC, leading to the notion of system-on-chip (SoC ) design. Such a system can integrate different elements like processors, memories and allocation specific circuits on one chip instead of on one board. The advantages of this technique are obvious. It can make systems consume low power, small size and low cost. The present techniques for SOC design also make possible the combination of large, pre-designed complex blocks (or so-called cores or IP blocks) and embedded software. This feature is very important to current digital system design. High reusability of IP blocks reduces time-to-market for new products and makes system more reliable. SoC design techniques are focused on the problems of evaluating, integrating, and verifying multiple pre-existing blocks and software components. This
10
is characterized by more in-depth system-level design, concurrent hardware/software design and verification at all levels of the design process [1].
11
2. How to add new modules to LEON2. In this project, LEON2 is implemented on Spartan 3 FPGA chip. It is compatible with SPARC V8 architecture. If we want to add new modules to it, we must know where and how to add. 3. How to design an accelerator to LEON2. Before we know how to design an accelerator, we should know why the accelerator is required. To design an accelerator, we should know what the accelerator will be. The above problems should be solved during the project.
12
Step 3: Performance analysis will be performed to find out the system bottlenecks. Step 4: The hardware/software partitioning phase a plan will be made to determine which parts will realized by hardware and which parts will be realized by software. Obviously, some system bottlenecks will be replaced by hardware to improve the performance. Step 5: based on the results of step 4, hardware and software parts will be designed respectively. Step 6: co-simulation. At this step, the completed hardware and software parts will be integrated together and performance analysis will be performed. Step 7: if the performance meets the requirements, the design can stop and if the performance cant meet the requirements, new HW/SW partitioning and a new design procedure will start.
13
14
15
Chapter 2
MP3 Decoder
In this chapter, the knowledge of MP3 is presented and how to decode MP3 is offered.
2.1 Introduction
MPEG-I/Audio Layer3, as known as Mp3, is the most popular audio compression technique currently. This compression technique is developed by the International Standard Organization and the International Electrotechnical Commission (ISO/IEC). The MPEG/Audio offers three levels of compression. The MPEG/Audio Layer3 is the most complex scheme and provides best sound quality of the three layers. Three sample rates are supported: 32 kHz, 44.1 kHz and 48 kHz. A MP3 decoder means it receives MP3 bitstream and output the PCM format bit stream. Figure 2 shows the first level function block of MP3 decoder.
16
The main block in figure2 can be divided into three parts and the second level function block of MP3 decoder is showed in figure3.
The first part bit stream decoder is in charge of reading the MP3 bit stream and recognizing them. The second part Inverse quantization reestablishes the signals and output original spectrum data. The last part Frequency to time mapping reproduces the audio signal from those original spectrum data.
Subband blocks
31 30 29 28
frame
18 freq.lines 18 freq.lines
1 0
Granule 1
Figure 4 Format of MP3 frame, granules, subband blocks and frequency lines [2].
17
Figure 4 shows the format of MP3 frames. Each frame holds data from 2 granules (the smallest acoustic unit) and every granule consists of 576 samples. Section 2.2 will explain the MP3 bit stream decoding and Section 2.3 will explain how to requantize those Huffman decoded data and Section 2.4 describes the mapping from frequency domain to time domain.
It mainly includes three parts: header and CRC, side information and main data. Frame header In each frame, the first 32 bits is header. In header, the first 12 bits equal to FFF (hex) is synchronization word. After that, the other 20 bits includes the information such as format, version, bit rate, sampling frequency, stereo mode, and copy right. Side information The information used in invert quantization and Huffman decoding are included in the size information section. In the mono mode, the size of the side information part is 17 bytes and in the two channels/stereo mode, the size is 256 bits. Main data Main data are divided into two parts, scalefactor and Huffman code. The former are grouped into some scalefactor bands. The length of the scale factor part depends on whether scale factors are reused, and also on the window
18
length (short or long). The scalefactors are used in the requanzitation of the samples [2].
Huffman decoding
Bitstream in
Synchronization
Huffman Information
Scalefactor Information
Scalefactor decoding
Figure 6 shows the decoding of bitstream block. Besides the synchronization, three decoding parts are required. Huffman decoding: The Huffman decoding is executed in the part. The Huffman coding is a variable-length coding method. The decoding process should begin at the first data. Huffman info decoding: The Huffman info decoding collects all information about the Huffman decoding from the head and side information parts. Scalefactor decoding: The scalefactor decoding collects all scalefactor information in header and side information part and decoded scalefactor in main data part.
Scalefactors
19
xr = is 2(0.25C) i
The is is the output from Huffman decoder. The factor C in the equation consists of global and scalefactor band dependent gain factors from the side information and the scale factors.
4 3 i
20
2 2 Intensity stereo mode: Intensity_stereo is done by specifying the magnitude (via the scalefactors of the left channel) and a stereo position is_pos[sfb], which is transmitted instead of scalefactors of the right channel. The stereo position is used to derive the left and right channel signals according to the formulas below.
left i =
M i + Si
and right i =
M i Si
MS intensity stereo mode: the values M and S are transmitted in the left and right channel, instead of left and right in intensity stereo mode. The calculation is the same as intensity mode.
2.4.3 IMDCT
IMDCT means inverse modified discrete transform. It transforms the frequency lines to polyphase filter subband samples. It is calculated by the following formula:
n 1 2
n (2i + 1 + )(2k + 1)) 2n 2 k =0 For long block n=36, for short block n=12. x i = X k cos(
21
22
2.5 Conclusion
The previous sections describe the format of MP3. Now we have already had an idea about MP3 decoder. A fine MP3 decoder block diagram is showed in figure 10. An MP3 decoder in C language has been completed. It decodes MP3 file in Linux operating system on Pentium IV processor and output PCM signal to audio card. The next step should be the performance analysis of the soft decoder. However the desired MP3 decoder is that can run on the embedded processor instead of Pentium. So before the analysis, we should port the MP3 decoder to embedded system. In next chapter, some background knowledge about the embedded processor LEON2 will be introduced.
23
Chapter 3
LEON2
The previous chapter presents some basic knowledge about MP3 decoder. The decoder reads MP3 source file and outputs PCM signals to audio card on PC. Compared with LEON2, the personal computer system is so powerful that real-time decoding MP3 is easy. It also has complete hard disk system and audio system. Those features raise the difficulties of setting up a MP3 decoder on embedded system. In this chapter, the background knowledge of LEON2 is introduced. With that knowledge, we try to find a way to port the MP3 decoder on LEON2. Section 3.1 offers brief background knowledge of LEON2. Section 3.2 describes the bus specification of AMBA2.0. Section 3.4 depicts interrupt system of LEON2.
24
Local ram
AMBA AHB
AHB controller Memory Controller UARTS
AMBA APB
Timers IrqCtrl I/O port
AHB/APB Bridge
PROM
I/O
SDRAM
SRAM
From figure 11, we can see that AMBA is the communication center of LEON2 and every unit is connected to it except floating point unit, co-processor and local ram. Generally, those high speed units are connected to AMBA AHB and those low speed units are connected to AMBA APB. Some units are connected to AHB and APB at the same time such as PCI unit and Ethernet unit. In this case, the APB is in charge of accessing the control registers on target unit and AHB is in charge of exchanging data.
3.2 AMBA-2.0
The Advanced Microcontroller Bus Architecture (AMBA) is a specification that defines an on-chip communications standard for designing high-performance embedded micro-controllers. It consists of three distinct buses that meet different requirements The Advanced High-performance Bus (AHB) The Advanced System Bus (ASB) The Advanced Peripheral Bus (APB).
25
In LEON2, only the AHB and APB are used and we only explain these two buses in following sections.
26
The AHB bus can connect up to 16 masters and any number of slaves. The LEON2 processor core is normally connected as master 0, while the memory controller and APB bridge are connected at slaves 0 and 1 [3]. The AHB controller (AHBARB) controls the AHB bus and implements both bus decoder/multiplexer and the bus arbiter. The arbitration scheme is fixed priority where the bus master with highest index has highest priority. The processor is by default put on the lowest index. Each AHB master is connected to the bus through two records, corresponding to the AHB signals as defined in the AMBA 2.0 standard:
------------------------------------------------------------------------------ Definitions for AMBA(TM) AHB Masters ------------------------------------------------------------------------------ AHB master inputs (HCLK and HRESETn routed separately) type AHB_Mst_In_Type is record HGRANT: Std_ULogic; -- bus grant HREADY: Std_ULogic; -- transfer done HRESP: Std_Logic_Vector(1 downto 0); -- response type HRDATA: Std_Logic_Vector(HDMAX-1 downto 0); -- read data bus end record; -- AHB master outputs type AHB_Mst_Out_Type is record HBUSREQ: Std_ULogic; HLOCK: Std_ULogic; HTRANS: Std_Logic_Vector(1 HADDR: Std_Logic_Vector(HAMAX-1 HWRITE: Std_ULogic; HSIZE: Std_Logic_Vector(2 HBURST: Std_Logic_Vector(2 HPROT: Std_Logic_Vector(3 HWDATA: Std_Logic_Vector(HDMAX-1 end record;
downto 0); downto 0); downto downto downto downto 0); 0); 0); 0);
----------
bus request lock request transfer type address bus (byte) read/write transfer size burst type protection control write data bus
27
Master units can access whole address space from 0x00000000 to 0xFFFFFFFF. Figure 12 illustrates structure of AHB.
28
Compared with AHB, APB is a bus with low speed, low power consumption and low interface complexity. Its targets are those low speed peripheral units such as serial port and keypad. The slaves are connected to APB through a pair of records containing the APB signals:
------------------------------------------------------------------------------ Definitions for AMBA(TM) APB Slaves ------------------------------------------------------------------------------ APB slave inputs (PCLK and PRESETn routed separately) type APB_Slv_In_Type is record PSEL: Std_ULogic; -- slave select PENABLE: Std_ULogic; -- strobe PADDR: Std_Logic_Vector(PAMAX-1 downto 0); -- address bus (byte) PWRITE: Std_ULogic; -- write PWDATA: Std_Logic_Vector(PDMAX-1 downto 0); -- write data bus end record; -- APB slave outputs type APB_Slv_Out_Type is record PRDATA: Std_Logic_Vector(PDMAX-1 downto 0); -- read data bus end record;
The number of APB slaves is defined by the APB_SLV_MAX constant in the TARGET package. The APB address space is from 0x80000000 to 0x8FFFFFFF.
29
In this project, the PCI unit, Ethernet unit and DSU unit are disabled, so the interrupt 11 to 14 are available.
30
The catch_interrupt() will only associate a function to an interrupt, unmasking and enabling of interrupts have to be done by the applications, typically by programming certain registers.
31
apbmst.vhd: this file defines the AHB to APB bridge. The bridge acts as master on APB. Some address vectors are saved in this file. Those address vectors are used when APB modules are accessed.
GCC-3.2.3 C/C++ compiler GNU binary utilities 2.13.1 with support for LEON UMAC/SMAC instructions RTEMS-4.6.0-beta C/C++ real-time kernel with LEON and ERC32 support Newlib-1.11 standalone C-library GDB-5.3 SPARC cross-debugger Remote debugging monitor (rdbmon-1.3.6) DDD graphical front-end for GDB (unix only) GDB-TK graphical front-end for GDB (Windows only) MKPROM-1.3.9 boot-prom builder
DSUMON-1.0.11 LEON debug support unit monitor
To compile and link an application to LEON2, use sparc-rtems-gcc like following code:
The executable file should be converted into SRECORD files because only this format files can be download to LEON2 from PC. To create an SRECORD file for a PROM programmer, use objcopy:
32
3.7 Conclusion
This chapter presents basic information about LEON2 in hardware and software. The bus system and interrupt controller will be used when new modules are added to LEON2. In next chapter, the method how to adding new modules to LEON2 SoC platform will be presented.
33
Chapter 4
Adding New Modules to LEON2 & Converting MP3 Decoder to Fixed Point
In chapter 3, the brief background of LEON2 is offered. In a real MP3 player, accessing file system to get MP3 source data and converting the digital audio into sound are required. This chapter will explain how to add new two different hardware modules to the existing LEON2. The first one is the flash card driver module and the second one is the audio driver module.
4.1 Introduction
From the previous chapter, AMBA is introduced. Three different buses defined in AMBA specification is AHB, ASB and APB. AHB is in charge of high speed communication and APB is in charge of low speed data exchange. From this point of view, new modules can be added to AHB or APB. Another way that add new module to LEON2 is making the new modules as a co-processor. In section 3.4 the concept of co-processor has already been introduced. The co-processor and main processor can execute instructions concurrently. The co-processor just executes those instructions it can recognize and ignore those it can not recognize. The interface between the processor and co-processor is exclusive and makes the cooperation work in high speed. But the main processor has five stage pipelines. It is obvious that matching the co-processor with the five stage pipeline is complicated. Compared to the method of adding new modules to AMBA, the method of adding co-processor is very
34
complicated although it is more efficient. So, in this design, we add all new modules to AMBA as accelerators In a real MP3 player, two problems should be solved: where MP3 can be obtained and where PCM signals should be sent. The solution is that two modules are required. The first one is the module that provides MP3 source file and the second is the module that convert PCM signal into sound. Figure 15 shows the function block diagram of MP3 decoder with flash card modules and A/D converter. The flash card module provides MP3 file to decoder and the D/A converter generates sound from PCM codes.
Figure 15 MP3 decoder with flash card module and A/D converter
Figure16 illustrates the new LEON2 system after adding the two modules. An IDE bus module connects to AHB and a standard flash card module connects to the IDE bus. The Audio driver is added on APB and sends PCM data to audio A/D converter.
35
Now, we explain those three steps one by one. 1. Find an available position on AHB. The AHB arbiter module can be considered as the bus driver. Figure 17 shows its port definition. The bus ports slvi and slvo are defined into interface array. Any slave modules interface should be added into the interface array as an element. The position should be between 0 to (masters-1). Figure 18 shows an example that a RAM component aram0 is added to position 4 on AHB. In mcore.file we can find out which positions have been occupied. 2. Add new modules to AHB on the position. In this step, an available position on AHB for the new module should be ready. In section 3.3, we have already known that the AHB bus can connect any number of slaves. From VHDL codes point of view, adding new modules to AHB means instantiating the new modules in mcore.vhd and adding its AHB interface into interface array at the prepared position.
entity ahbarb generic ( masters : defmast : ); port ( rst : clk : msti : msto : slvi : slvo : ); end; is integer := 2; integer := 0 in in out in out in -- number of masters -- default master
std_logic; clk_type; ahb_mst_in_vector(0 to masters-1); ahb_mst_out_vector(0 to masters-1); ahb_slv_in_vector(0 to AHB_SLV_MAX-1); ahb_slv_out_vector(0 to AHB_SLV_MAX)
36
aram0 : if AHBRAMEN generate aram : ahbram generic map (AHBRAM_BITS) port map (rst, clk, ahbsi(4), ahbso(4)); end generate;
3. Allocate address to the modules. All modules on AMBA should have one or more address. The address consist two parts. The first part act as select signal sent to every module. To AHB slave modules, the signal is hsel and to APB is psel. Only the module which select signal is valid can access AHB or APB. This signal is decoded from some bits of address bus. The second part is rest of part of address bus. Figure19 shows that the highest four bits of AHB address bus are passed to a mapping table as an index and the value with the index will be output to a decoder. The decoder decodes the value and output hsel signals which are sent to every AHB slave module respectively. Among those hsel signals, only one can be set to high in a certain time and only the selected slave module can access AHB. The mapping table is saved in the device.vhd file as an array type variable named ahbrange_config.
37
The variable ahbrange_config showed in figure 20 is defined as an array with 16 elements. Those 16 elements correspond the highest four bits of AHB address bus. Table1 lists the mapping from the address to position number on AHB.
MSB 4 bit of address bus 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Position number
ahbrange_config (0) ahbrange_config (1) ahbrange_config (2) ahbrange_config (3) ahbrange_config (4) ahbrange_config (5) ahbrange_config (6) ahbrange_config (7) ahbrange_config (8) ahbrange_config (9) ahbrange_config (10) ahbrange_config (11) ahbrange_config (12) ahbrange_config (13) ahbrange_config (14) ahbrange_config (15)
Address space 0x00000000-x0FFFFFFFF 0x10000000-0x1FFFFFFF 0x20000000-0x2FFFFFFF 0x30000000-0x3FFFFFFF 0x40000000-0x4FFFFFFF 0x50000000-0x5FFFFFFF 0x60000000-0x6FFFFFFF 0x70000000-0x7FFFFFFF 0x80000000-0x8FFFFFFF 0x90000000-0x9FFFFFFF 0xA0000000-0xAFFFFFFF 0xB0000000-0xBFFFFFFF 0xC0000000-0xCFFFFFFF 0xD0000000-0xDFFFFFFF 0xE0000000-0xEFFFFFFF 0xF0000000-0xFFFFFFFF
For example, assuming the highest four bits of AHB address bus is 0010, we will find out the element of ahbrange_config at the position 2 and its value is 0, that means the slave module on AHB at the position 0 responds the address. The position number and the variable ahbrange_config both determine the address of new modules. 4. If interrupt signal is required, add it to interrupt controller on an available position. In chapter 4, interrupt controller is introduced. The main controller
38
can handle 15 interrupts. It is instantiated in mcore.vhd file. It provides an interrupt array that every element corresponds one interrupt signal. The interrupt array is irqi.irq. Figure 21 shows an example that the interrupt signal timo.irq(1) is assigned to interrupt controller on position 9. The position number is interrupt number that will be used in software. irqi.irq(9) <= timo.irq(1);
Figure 21 example of assigning a interrupt signal to interrupt controller
5. Update the port of leon entity if necessary. If the new module need communicate with something out of the SoC, new port needs to be added to leon entity.
The method that adds new modules to APB is almost same with AHB. The only difference is the address allocation. The address range to all APB modules is from 0x80000000 to 0x8FFFFFFF and the AHB to APB bridge is the master of APB. When one master module on AHB want to access the modules on APB, the address(9 downto 2) will be compared with some pre-defined vector saved in apbmst.vhd file,
39
when they are equal, a position number will be generated and a certain APB module will be selected. Figure 22 shows the APB address decoding block diagram. So adding a new module to APB need add new address vector to the look-up table.
40
2. Allocate address space 0xC0000000-0CFFFFFFF to the module and change the 12th element of ahbrange_config into 4. 3. Update the port of leon entity.
The main clock frequency of LEON2 is 25MHz, so we set the MCLK to 12.5MHz which is closed to desired 12.288MHz. And then SCLK is 3.072MHz and LRCK is 48.0kHz. Figure 24 shows its block diagram.
41
The module includes two buffers. The two buffers both can keep eight data. When the output buffer contains data fewer than four, an interrupt signal will be sent to interrupt controller. When the input buffer contains data more than four, an interrupt signal will be sent to interrupt controller. So it is better that every access reads or writes four data. The clock generator is in charge of generating three clock signals: MCLK, SCLK and LRCK. The Serial to parallel block receives serial data from A/D converter and recovery them into 4-byte wide data. The Parallel to serial part converts the 32 bit wide PCM into serial bit stream and sent them to D/A converter. The APB slave interface is in charge of accessing APB. Three port addresses are used by the driver module. address Right channel 0x80000340 Left channel 0x80000344 Control register 0x80000348
The three bits in control register are used and listed in the table 3.
42
function
Bit 0 enable
43
The value V represented by the word may be determined as follows: If E=255 and F is nonzero, then V=NaN ("Not a number") If E=255 and F is zero and S is 1, then V=-Infinity If E=255 and F is zero and S is 0, then V=Infinity If 0<E<255 then V=(-1)**S * 2 ** (E-127) * (1.F) where "1.F" is intended to represent the binary number created by prefixing F with an implicit leading 1 and a binary point. If E=0 and F is nonzero, then V=(-1)**S * 2 ** (-126) * (0.F) These are "unnormalized" values. If E=0 and F is zero and S is 1, then V=-0 If E=0 and F is zero and S is 0, then V=0
44
With this format, the effective range is 0x80000000 to 0x7FFFFFFF, the dynamic range is 96dB. The method how to convert the decoder into fixed-point involve several aspects: Arithmetic computation: fixed-number can be added or subtracted as normal integer, but multiplication requires shifting the 64-bit result from 56 fractional bits back 28. Overflow and underflow: during the computation, when overflow is detected, the result should be set to the maximum value 0x7FFFFFFF. When underflow is detected, the result will be set to 0x80000000. Mathematic function: during decoding MP3, lots of mathematic function will be used such cosine function. LEON2 cant figure out their value on run time. So the fixed-point results of those mathematic function will be save in memory and using the lookup table, their value can be got. Constant: all floating-point constants will be converted into fixed-point constants.
4.9 Conclusion
In this section, two major works are completed. The first one is adding two module to LEON2. The second one is converting the MP3 decoder from floating point version to fixed point version. In the next chapter, a performance analysis will be taken to find out how to partition the system into hardware part and software part.
45
Chapter 5
calls 216000 376576 12000 216000 1830922 6912000 6000 8800642 384000 12000 376576 11768 12000 1278464 6000 12000
name my_synthesis imdct36 III_dequantize_sample dct32 huffman_decoder III_requantize III_stereo hgetbits III_hybrid_int III_hufman_decode III_imdct_l III_aliasreduce main III_reorder getbits out_fifo III_antialias
46
Figure 27 shows that the profiles of the soft decoder run on personal computer. The most left column display the execution time ration to the total execution time. Function my_synthesis consumes above half of total execution time. This function processes the poly phase synthesis filter bank which requires a large number of multiplications, additions and memory accessing. The first part is a decoder of 32 point DCT, after that the data will pass the D-window where 512 multiplications is required and no fast algorithm has been found out, then 512 additions will be executed. The second most time consuming function is 36-point IMDCT. Function block Polyphase synthesis bank 36-point IMDCT Inverse quantization Huffman decoder Stereo processing Time ratio 55.96 8.16 5.14 3.64 2.51
In principle, the method of reducing the total execution time is to find out the timecritical parts in this system and replace them with hardware accelerators. The above table lists five most time consuming top level functions called by main.c of the MP3 decoder. Then the second test is taken. The MP3 decoder decodes the same MP3 on the embedded processor. In this test, 300 MP3 frames are decoded and we can measure the execution time. The column scale_factor means the decoding process stop at the end of scale factor decoding. The column Huffman_decoding means the decoding process stop at the end of Huffman decoding. The measured results are list in the following figure28.
300 frames Time(s) Scale_factor 6 Huffman_decoding 24 Inverse_quantization 60 synthesis 162
47
The sample rate of the MP3 is 44.1K and there are 1152 samples in each MP3 frame. That means the real-time playback a 44.1K sample rate stereo MP3 requires at least 26.2ms/ frame. Decoding 300 frames need about 7.86 second. In above list, only one column scale_factor meets the requirement. That means at lest replacing all the function after the scale_factor decoding with hardware can lead that real-time decoder. But in this project the FPGA chip cant provide enough system gates to contain such many hardware modules.
48
5.3 Conclusion
In this chapter, a preliminary partition plan is made. In the plan, two DCT blocks will be converted into hardware and the rest of decoder still be kept in software. This plan is not optimal because the hardware modules dont include the whole polyphase synthesis block. But the partition plan is the best one that can be realized. In next chapters, how to implement the two modules will be described.
49
Chapter 6
The previous chapter discussed how to partition the whole system into HW/SW. A preliminary plan is that two blocks will be implemented in hardware. One is 32-point DCT and another one is 36-point IDCT. This chapter introduces two fast algorithms in order to reduce computation complexity.
6.1 Introduction
Three different DCT/ IDCT are involved in MP3 decoder, 12-point IDCT, 36-point IDCT and 32-point DCT. Because the 12-point IDCT is seldom used, so this chapter focuses on 36-point IDCT and 32-point DCT. 36-point IDCT is defined by the following formula:
n 1 2
x i = X k cos(
k =0
In the formula, 648 multiplications and 612 additions are required. 32-point DCT is calculated by the following equation:
50
x i = X k cos(
k =0
n 1 2
512 multiplications and 480 additions are required The two DCTs both require lot of computation. However, some fast algorithms are found out and computation complexity can be reduced dramatically. Section 6.2 explains a fast algorithm of 32-point DCT and section 6.3 introduces a fast algorithm of 36-point IDCT.
51
From the above three examples, we can observe the following recursive property for the DCT matrices:
N N T T 2 2 T (N) = D N D N 2 2
N N T D The kernel of 2 is even and that of 2 is odd. Because of cos(2k + 1) m = 2 cos(2k m ) cos m cos(2k 1) m We can obtain
N N T T 2 Z e 2 Z = N o KT Q KT N Q 2 2
Where K=R L R
52
And Q= diag[cos m ] R is the permutation matrix for performing the bit-reversal arrangement. 0 0 1 1 2 0 1 2 2 L = 1 2 2 : : 1 2 2 0 0 0 2 .. .. .. .. .. .. 2 .. .. .. .. .. .. .. .. 0 0 0 0 : : 2
Figure30 illustrates the 2-point, 4-point and 8-point DCT. Broken lines represent transfer factors -1 while full lines represent unity transfer factors. o represents adders; multipliers; and shifts (i.e., multiplication by 2).
53
xi =
X
k= 0
cos ( 2 n ( 2i + 1 +
n 2
) ( 2k + 1) )
n=36.
From above chapter, we know the Hous algorithm can reduce computation
n complexity. However, it is only useful to 2 point IDCT. To reduce the computation
complexity of 36-point IDCT, we introduce the Szu-Wei Lees fast algorithm. The following figure31 shows the flow of the Szu-Wei Lees algorithm. The N-point DCT-II (SDCT-II) is define as
N 1 (2m + 1)k C kN , II = k x m cos 2N m =0
N/4-point SDCT-II N/2-point DCT-IV N/2-point SDCT-II N/4-point DCT-IV N/4-point SDCT-II
54
6.4 Conclusion
The previous two sections describe HOUs algorithm and Lees algorithm respectively. The two algorithms both can reduce computation complexity largely.
Number of multiplications Hou algorithm for 32-point DCT Lee algorithm for 36-point IDCT 80 43 Number of additions 209 115
The table 4 shows the number of computations of the two algorithms. Next chapter will show how to design DCT/IDCT accelerators.
55
Chapter 7
The previous chapter depicts two fast algorithms for 36-point IDCT and 32-point DCT, which reduce computation complexity largely. In this chapter, we realize DCT and IDCT accelerators on hardware. Section 7.1 describes the accelerator architecture; section 7.2 depicts how to schedule the algorithms and how to create FSM. Section 7.3 shows the design of FSM controller and section 7.4 explains how to design the accelerators interface to AHB.
56
The accelerator modules exchange data with LEON2 in high speed and AHB interface is required. Since the modules only response the LEON2, we make them as slave modules on AHB.
Figure 32 presents the function block diagram of the accelerators. The FSM and arithmetic units is in charge of DCT or IDCT computation. The FSM controller triggers the FSM when input data are ready and generates done signal when results are ready. The AHB slave interface is the way in which the accelerators communicate with LEON2.
57
ALUs and multipliers, the less total execution time. However when the number of ALU and multiplier is enough, additional ALU and multipliers cant reduce execution time further. The scheduling tool showed in figure33 developed by Mr. Huib Lincklaen Arrins is used to schedule those computations. Since this tool cant recognize the C program directly, the conversion from C program to the temporary file is required. The following table shows an example of the temporary file. In this temporary file, one line consists of only one computation.
t0 = i0 + i31; w0 = i0 - i31; t16 = w0 * c1; t1 = i15 + i16; w1 = i15 - i16; t17 = w1 * c31;
After converting C codes to the temporary file, the scheduling tool reads it and shows scheduling graph. The number of ALUs and multipliers can be change manually. In principle, more multipliers and ALUs make the FSM less states; however the more multipliers and ALUs lead the circuit become more complex. So we must know how many multipliers and ALUs is the best choice. Table 5 and table 6 shows the number of states in FSM where different number of adders and multipliers are used to IDCT36 and DCT32. In the IDCT36 module, additional 36 multiplications are added to it so it number of all multiplication is 79. In the two tables, five adders and two multipliers are best choice because additional multipliers only bring little profit.
Number of ALUs
2
Number of multipliers
3
81 69 69 69 69 69
4
81 52 52 52 52 52
5
81
6
81 42 35 35 35 35
10
1 2 3 4 5 6
42
42 42 42 42
42 31 31 31 31 30 27 27 27 30 25 25 24 25 25 24
Table 5 number of FSMs states with different number of adders and multipliers to IDCT36
58
1
147 135 134
2
92 75 71
3
80 57 51
4
80 48 42
5
80
6
80 41 33
1 2 3
41
37
Table 6 number of FSMs states with different number of adders and multipliers to DCT32
59
The controller obtains data through 32 bit width data bus. 36-point IDCT requires 18 input data and 32-point DCT need 32 input data. So their validate address bus width is 6 bits. The input data should be obtained from the first to the last one. As soon as the last data is ready, the FSM controller generates start signal to trigger the FSMs to work. When FSM complete all computation, the done signal is set and the AHB interface passes it to interrupt controller. When all results have been read by other master modules on AHB, the controller will set the signal done to low and the whole module come back to idle state. Figure34 shows the two FSM controllers.
60
code is the port definition of IDCT36 accelerator. The port consists of a standard AHB slave I/O port and an interrupt signal output.
entity ahbmdct36 is port ( rst : in std_logic; clk : in clk_type; ahbsi : in ahb_slv_in_type; ahbso : out ahb_slv_out_type; irq : out std_logic ); end;
The following VHDL code shows how the AHB slave interface works. Those codes are modified from ahbram.vhd file.
comb : process (ahbsi, r, rst) variable v : reg_type; variable haddr : std_logic_vector(5 downto 0); begin v := r; v.hready := '1'; if (r.hwrite or not r.hready) = '1' then haddr := r.addr(5 downto 0); else haddr := ahbsi.haddr(7 downto 2); end if; if ahbsi.hready = '1' then v.hsel := ahbsi.hsel and ahbsi.htrans(1); v.hwrite := ahbsi.hwrite and v.hsel; v.addr := ahbsi.haddr(7 downto 2); end if; if r.hwrite = '1' then v.hready := not (v.hsel and not ahbsi.hwrite); v.hwrite := v.hwrite and v.hready; end if; if rst = '0' then v.hwrite := '0'; end if; write <= r.hwrite; ramsel <= v.hsel or r.hwrite; ahbso.hready <= r.hready; addr <= haddr; c <= v; end process; ahbso.hresp <= "00"; ahbso.hsplit <= (others => '0'); aram : IMDCT36_core port map ( clk=>clk,rst=>rst,datain=>ahbsi.hwdata, dataout=>ahbso.hrdata,ena=>ramsel,write=>write,address=>addr,done=>irq);
61
7.6 Conclusion
At the end of this chapter, two hardware accelerators are completed and added to AHB. Figure 35 shows the function block diagram of the modified LEON2. In next chapter, the performance tests to the accelerators will be taken.
62
63
Chapter 8
The previous chapter describes how to complete the two accelerators. In this chapter, a test will be performed to find out the hardware accelerators effect. Section 8.1 shows the test results and section 8.2 shows conclusion that can be reached from the results. And then in section 8.3 a summary will be made to the whole project. The final section presents a scope of the future research.
64
Secondly, a test to DCT32 accelerator is taken. In this test, the DCT32 accelerator and soft DCT32 decoding function are executed 300000 times respectively. The execution time is recorded in the table 9 and the test is repeated two times. In figure 36, two software results are compared with two hardware results. DCT32 Soft decoding(Hous algorithm) Accelerator decoding 1 91s 18s 2 92s 17s Average 91.5s 17.5s
From the tables 8 and table 9, it can be derived that the two accelerators all can reduce execution time dramatically. The first table shows the latency of accelerator is less than half of soft IDCT36 decoding time. To software decoding, The IDCT36 fast decoding algorithm consists of 79 multiplications and 115 additions. Since every computation needs two operands and one memory accessing takes 3 clock cycles, one complete multiplication or addition will take 3+3+1+3=10 clock cycles. Completing one whole IDCT36 computation needs about 10*(79+115) =1940 clock cycles. The main frequency of LEON2 is 25MHZ and so one IDCT36 needs 1940*40=77600ns and 300000 times IDCT36 fast computation needs at lest 77600*300000=23.3 second. In the IDCT36 accelerator, the FSM consists of 41 states, 18 data input need 18*(3+3) =108 clock cycles and 36 data output need 36*6=216 clock cycles. So one IDCT36 computation on the accelerator needs 108+41+216=365 clock cycles and 300000 times IDCT36 computations on the accelerator need about 365*40ns*300000=4.38s. Because the LEON2 needs more clock cycles to response interrupt and deal with the
65
interrupt. So this value is just a minimum value. Results of DCT32 are same with IDCT36. From previous chapter, we know the architecture of IDCT36 accelerator and DCT32 accelerator is almost same. The only difference is that they have different number of input data. The IDCT36 need 18 input data and DCT32 need 32 input data. So soft DCT32 block takes more time than soft IDC36 block and DCT32 accelerator takes more time than IDCT36 accelerator.
sequential elements LEON(include the following four modules) DCT32 IDCT36 Flash card module Audio module 16.68% (9354) 4.82% (2702) 3.29% (1847) 0.52% (293) 1.52% (852) combinational logic 59.73% (33502) 21.44% (12005) 16.6% (9382) 0.53% (297) 2.38% (1334)
Table 10 final area report *Numbers in the parentheses indicate the number of the elements being used
Table 10 shows the area report the LEON2 SoC. About 60 percents combinational logic and 17% sequential elements are used in the system.
8.2 Conclusion
The test results described in the previous section indicate the two hardware accelerators both can reduce execution time. This result is same with our expectation. The accelerator DCT32 takes more execution time than IDCT36 because its FSM consists of more states than IDCT36 and it need more input data than IDCT36. In section 5.1, the MP3 decoder on LEON2 is used to decode 300 MP3 frames. During the decoding, the IDCT36 function is invoked 38016 times and the DCT32 function is invoked 21600 times. Because in that test, the Lees algorithm is not used in IDCT36, so we should know execution time that IDCT36 function ( no Lees algorithm) is called 300000 times.
66
IDCT36 Soft decoding(no Lees algorithm) Soft decoding( Lees algorithm) Accelerator decoding
Table 11 shows the execution time that IDCT36 accelerator and IDCT36 C function are invoked 300000 times respectively. When decoding 300 MP3 frames, the IDCT36 module will be called 38016 times and we can figure out the execution time should be about 9.31 seconds and 1.33 seconds. That means 8 seconds can be saved where 1.8 seconds are contributed by hardware accelerator. To DCT32 module, the Hous algorithm has already been used in the test described in section 5.1, so using the method above mentioned, we can figure out that about 5.3 seconds can be saved. Used the two accelerators together, decoding 300 frames MP3 can save 13.3 seconds which is about 8% to the total 162 second execution time. The two accelerators both are not efficient because too much time is taken on data transmission. Section 8.1 describes IDCT36 need eighteen input data and output 36 results, accessing one data will take 3 cycle. Before sending a data to IDCT36 module, the data should be read on to bus. After reading a data from IDCT36 module, the data should be saved to memory. That means every data transmission with IDCT36 module will take six cycles. Making all input data ready will take 6*18= 108 cycles. Reading all results from it takes 6*36=216 cycles. The computation part, FSM, only consist of 41 states where one state is equal to one cycle.
67
DCT32
IDCT36
50
100
150
200
250
300
350
400
450
input data
computation
output results
Figure 37 shows the execution time distribution of DCT32 and IDCT36 modules. The ratio of computation time to all execution time is 10.3% to DCT32 module and 11.2% to IDCT36. The two values are the maximum values that assume the interrupt handle functions take no time to be called. In practice, the maximum value cant be reached. How to solve the problem? The first direction we can try is that assign more computations to the FSM. The second direction is that make the input data, computation and output data three parts work concurrently.
The advantage of general purpose processors is its flexibility and its disadvantage is low performance in special application. Adding hardware accelerators can improve the general purposes performance effectively. If more software is replaced by accelerators, more performance improvement can be obtained. However, it will consume more hardware circuit. In this project, a MP3 decoder run on LEON2 is completed. Two hardware accelerators are added to the system to improve performance. This hardware and software codesign technique shortens the time-to-market while reducing the design effort.
68
8.3 Summary
The goal of the project has been reached. A MP3 decoder has been completed and it has been ported to LEON2 platform. This decoder can read MP3 files from flash card and send PCM signal to audio A/D converter. Two accelerators are added to the system to improve performance and their effects have been observed. In the first chapter, the motivation and definition of the project are presented. The general principle of hardware and software codesign is offered, the organization of the thesis is based on the principle. In chapter2, I give some knowledge of MP3 and MP3 decoder. In chapter3, the background of LEON2 is introduced. The background is necessary to know how to modify an existing LEON2. Chapter4 focus on adding new modules to LEON2. Firstly, the method how to add new modules to LEON2 is offered. And then a flash card module and audio module are added to the system. At the end of the chapter, how to convert the MP3 decoder into fixed point is discussed. Chapter5 takes an analysis to the soft MP3 decoder. And find out which parts are the most time consuming. And then the reason why add two DCT and IDCT accelerators to LEON2 is presented. Chpater6 briefly describes two fast DCT algorithms and chapter7 discusses how to implement the two accelerators. In the last chapter, the effect of the accelerators is presented.
69
DCT accelerator for MP3 decoder run on LEON2 processor. In the following sections, we will review the whole design process and present the scope for future research. 1. Complete a real time MP3 decoder on FPGA. In this project the accelerators can improve performance, but the decoder is not powerful enough to decoding MP3 real-time. Further effort can focus on following aspects: Decrease the area of embedded processor on FPGA. In this project, LEON2 consumes almost above half of all area of FPGA. If it consumes fewer system gates, more hardware can be implemented on chip. Design a MP3 decoder without any processor. That means all parts of the decoder are implemented in hardware. Maybe this is very difficult because in this project, only a DCT module has already consumed 30% system area. 2. Find a method that can converts C code into VHDL directly. In this project, scheduling tool can generate FSM automatically based on one type of temporary file. However, converting a big C program into the temporary file is very complicated. If those C code can be converted into VHDL code directly, the FSM design
70
71
Bibliography
[1] Design of an Audio Player as System-on-a-Chip, Luis Azuara Pattara Kiatisevi, Institute of Computer Science, University of Stuttgart [2] K Salomonsen, S Sgaard, E P Larsen, Design and Implementation of an MPEG/Audio Layer III Bitstream Processor 1997. [3] LEON2 Processor Users Manual, from www.gaisler.com [4] AMBA Specification, from www.gaisler.com [5] The LEON/ERC32 GNU Cross-Compiler System, from www.gaisler.com [6] MPEG-Layer3 Bitstream Syntax and Decoding, www.mp3-tech.org. [7] HSIEH S. HOU, A Fast Recursive Algorithm For Computing the Discrete Cosine Transform, IEEE Transaction On Acoustics, Speech, And Signal Processing, vol. ASSP-35, No. 10, pp.1455-1461, October 1987. [8] Szu-Wei Lee, Improved Algorithm for Efficient Computation of the Forward and Backward MDCT in MPEG Audio Coder, IEEE Transaction on Circuits and System-II: Analog and Digital Signal Processing, vol.48, No.10, pp.990-994, October 2001