Jianwei Thesis

Hardware/Software Codesign of MP3 Decoder with 36/32-point (I)DCT Accelerators
MSc THESIS
Jianwei Wang Advisor: Dr. Rene van Leuken

Nov. 2004 Jul. 2005
Microelectronics Department of Electrical Engineering Faculty of Electrical Engineering, Mathematics and Computer Science
Abstract
An MP3 audio decoder is designed as System-on-a-Chip using hardware/software codesign techniques. The hardware architecture is built on the LEON SoC platform, which contained an open source SPARC-V8 architecture compatible processor, an AMBA bus. A pre-designed flash card interface hardware core is added to this system. And then an audio driver module also is added to the system. After a performance analysis to the decoder, the MP3 decoder is partitioned into software part and hardware part. The hardware parts, a 36 point IDCT decoder and a 32 point DCT decoder, are designed as two accelerators connected to AMBA. The final MP3 decoder can decode MP3 stream with the help of the 36 point IDCT accelerator and 32 point DCT accelerator.
Acknowledgements
It is a wonderful experience to study in Delft University of Technology (TU Delft) for two years. And I am so lucky to do my Master's thesis following the Dr. Ren van Leuken in the Circuits and Systems (CAS) group. During this period of time, I met many intelligent and enthusiastic people; I would like express my respect and thanks to all of them. Without their helps, I can't complete my thesis at all. First of all, I would like express my greatest appreciation to my supervisor Dr. Ren van Leuken. He is really an earnest and professional people. Every time I fell in the morass, he always helps me to analyze the problems and find the solution. His manner of the work and his attitude to the work are impressed in my mind deeply. I have to appreciate the help from Mr. Huib Lincklaen Arrins also. The scheduling tool developed by him is so helpful to my thesis. At last, the special thanks are given to my wife and my parents. Their endless love and support are the motivation of my progress. They comprehend my career and suffer long time separation to me. I feel that they are the most lovely and trustful people in the world. Jianwei Wang Delft, The Netherlands July 3, 2005
Contents
CONTENTS ............................................................................................................................. 1 FIGURE LIST ......................................................................................................................... 5 TABLE LIST ........................................................................................................................... 7 CHAPTER 1 ............................................................................................................................ 9 INTRODUCTION ................................................................................................................... 9 1.1 1.2 1.3 1.4 1.5 MOTIVATION ................................................................................................................... 9 DEFINITION OF THE WORK ........................................................................................... 10 THE MAJOR CHALLENGES ............................................................................................ 10 METHODOLOGY OF HW/SW CODESIGN ..................................................................... 11 ORGANIZATION OF THE THESIS ................................................................................... 13
CHAPTER 2 .......................................................................................................................... 15 MP3 DECODER.................................................................................................................... 15 2.1 2.2 2.3 2.4 2.4.1 2.4.2 2.4.3 2.4.4 2.5 INTRODUCTION ............................................................................................................. 15 BIT STREAM DECODING ................................................................................................ 17 INVERT QUANTIZATION ............................................................................................... 18 PROCESSING DATA ........................................................................................................ 19 STEREO PROCESSING ................................................................................................... 20 REORDERING & ALIAS REDUCTION ............................................................................ 20 IMDCT........................................................................................................................ 20 POLYPHASE SYNTHESIS FILTER BANK......................................................................... 21 CONCLUSION ................................................................................................................. 22
CHAPTER 3 .......................................................................................................................... 23 LEON2.................................................................................................................................... 23 3.1 3.2 3.2.1 3.2.2 3.3 3.4 3.4.1 3.4.2 3.5 3.6 BACKGROUND OF LEON2............................................................................................ 23 AMBA-2.0 ..................................................................................................................... 24 THE ADVANCED HIGH-PERFORMANCE BUS (AHB).................................................... 25 THE ADVANCED PERIPHERAL BUS (APB) .................................................................. 27 FLOATING-POINT UNIT AND CO-PROCESSOR .............................................................. 28 INTERRUPT .................................................................................................................... 29 INTERRUPT CONTROLLER ............................................................................................ 29 CONNECT A C FUNCTION TO INTERRUPT SOURCE ....................................................... 29 VHDL MODEL ARCHITECTURE OF LEON2 ................................................................ 30 CROSS COMPILE ............................................................................................................ 31
3.7
CONCLUSION ................................................................................................................. 32
CHAPTER 4 .......................................................................................................................... 33 ADDING NEW MODULES TO LEON2 & CONVERTING MP3 DECODER TO FIXED POINT....................................................................................................................... 33 INTRODUCTION ............................................................................................................. 33 METHOD OF ADDING SLAVE MODULES TO AHB......................................................... 35 METHOD OF ADDING SLAVE MODULES TO APB ......................................................... 38 ACCESS THE NEW MODULES......................................................................................... 39 ADDING THE FLASH CARD MODULE TO AHB.............................................................. 39 DRIVER MODULE OF THE AUDIO A/D CONVERTER ..................................................... 40 ADDING THE AUDIO MODULE TO APB......................................................................... 42 CONVERT MP3 DECODER FROM FLOATING-POINT VERSION TO FIXED-POINT VERSION .................................................................................................................................. 42 4.8.1 SINGLE PRECISION FLOATING-POINT .......................................................................... 43 4.8.2 CONVERSION TO 32-BIT FIXED-POINT ......................................................................... 43 4.9 CONCLUSION ................................................................................................................. 44 CHAPTER 5 .......................................................................................................................... 45 PERFORMANCE ANALYSIS AND HW/SW PARTITIONING .................................... 45 5.1 5.2 5.3 PERFORMANCE ANALYSIS ............................................................................................ 45 HW/SW PARTITIONING ................................................................................................ 47 CONCLUSION ................................................................................................................. 48 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8
CHAPTER 6 .......................................................................................................................... 49 FAST ALGORITHM OF DCT AND INVERSE DCT ...................................................... 49 6.1 6.2 6.3 6.4 INTRODUCTION ............................................................................................................. 49 FAST ALGORITHM OF 32-POINT DCT .......................................................................... 50 FAST ALGORITHM OF 36-POINT IDCT ........................................................................ 53 CONCLUSION ................................................................................................................. 54
CHAPTER 7 .......................................................................................................................... 55 IMPLEMENT OF DCT AND IDCT ACCELERATORS ................................................. 55 7.1 7.2 7.3 7.4 7.5 7.6 THE ACCELERATOR ARCHITECTURE........................................................................... 55 SCHEDULING AND FSM DESIGN ................................................................................... 56 FSM CONTROLLER DESIGN .......................................................................................... 58 AHB SLAVE INTERFACE DESIGN .................................................................................. 59 CONNECT THE ACCELERATORS TO AHB .................................................................... 61 CONCLUSION ................................................................................................................. 61
CHAPTER 8 .......................................................................................................................... 63
TEST RESULTS AND CONCLUSION.............................................................................. 63 8.1 8.2 8.3 8.4 TEST RESULTS ............................................................................................................... 63 CONCLUSION ................................................................................................................. 65 SUMMARY ...................................................................................................................... 68 FUTURE RESEARCH ....................................................................................................... 68
BIBLIOGRAPHY ................................................................................................................. 71
Figure list
Figure 1 the design flow of the general codesign approach.........................................12 Figure 2 the first level function block diagram of MP3 decoder.................................15 Figure 3 the second level function block diagram of MP3 decoder ............................16 Figure 4 Format of MP3 frame, granules, subband blocks and frequency lines [2]....16 Figure 5 MP3 frame format .........................................................................................17 Figure 6 Decoding of bitstream block diagram [2]......................................................18 Figure 7 inverse quantization block diagram...............................................................19 Figure 8 processing data block diagram ......................................................................19 Figure 9 Poly Phase Synthesis filter bank [2]..............................................................21 Figure 10 function block diagram of MP3 decoder .....................................................22 Figure 11 function block diagram of LEON2 [3] ........................................................24 Figure 12 structure of AHB[4].....................................................................................27 Figure 13 APB structure [4].........................................................................................27 Figure 14 interrupt controller of LEON2.....................................................................29 Figure 15 MP3 decoder with flash card module and A/D converter ...........................34 Figure 16 LEON2 with flashcard module and audio module ......................................34 Figure 17 port definition of AHB arbiter.....................................................................35 Figure 18 an example of slave module on AHB on position 4....................................36 Figure 19 AHB address decoding ................................................................................36 Figure 20 variable ' ahbrange_con ...............................................................................37 Figure 21 example of assigning a interrupt signal to interrupt controller....................38 Figure 22 APB address decoding.................................................................................38 Figure 23 instantiate the flash card module in mcore.vhd ...........................................39 Figure 24 function block diagram of audio driver module ..........................................41 Figure 25 signals of audio module added to leon port.................................................42 Figure 26 fixed-point format........................................................................................44 Figure 27 profile of the soft MP3 decoder...................................................................45 Figure 28 execution time from start point....................................................................47 Figure 29 HW/SW partitioned function block diagram of MP3 decoder....................48 Figure 30 diagram of 2-point, 4-point and 8-point DCT[7].........................................52 Figure 31 function block of Lees algorithm[8] ..........................................................53 Figure 32 the structure diagram of DCT/IDCT modules.............................................56 Figure 33 scheduling tool.............................................................................................58 Figure 34 IDCT36 FSM controller and DCT32 FSM controller.................................59 Figure 35 function block diagram of LEON2 with two accelerators...........................61 Figure 36 compare two DCT accelerators with software ............................................64 Figure 37 execution time (cycle) distribution of (I)DCT accelerator module .............67
Table list
Table 1 map from address bus to position number ......................................................37 Table 2 the port address of audio module....................................................................41 Table 3 control register of audio module.....................................................................42 Table 4 the number of computations of two algorithms ..............................................54 Table 5 number of FSMs states with different number of adders and multipliers to IDCT36 ........................................................................................................................57 Table 6 number of FSMs states with different number of adders and multipliers to DCT32..........................................................................................................................58 Table 7 address space and interrupt of the two accelerators........................................61 Table 8 execution time IDCT36 C function and accelerator on LEON2 ....................63 Table 9 execution time IDCT32 C function and accelerator on ELON2 ....................64 Table 10 final area report.............................................................................................65 Table 11 execution time of the IDCT36 without Lee's algorithm ...............................66
Chapter 1
Introduction
1.1 Motivation
Current world is full of different types of embedded systems and processors. An embedded system is a special-purpose computer built into a device. The embedded systems have varieties of types and sizes. It could range from a single microprocessor to a complex System-on-a-Chip system. Embedded systems usually consist of hardware and software. The hardware maybe is processor, ASIC or memory and is used for performance. The software is used to provide features and flexibility. In many applications, embedded systems just act as a system controller. Current generations of silicon process technology allow designers to integrate a large number of features onto a single IC, leading to the notion of system-on-chip (SoC ) design. Such a system can integrate different elements like processors, memories and allocation specific circuits on one chip instead of on one board. The advantages of this technique are obvious. It can make systems consume low power, small size and low cost. The present techniques for SOC design also make possible the combination of large, pre-designed complex blocks (or so-called cores or IP blocks) and embedded software. This feature is very important to current digital system design. High reusability of IP blocks reduces time-to-market for new products and makes system more reliable. SoC design techniques are focused on the problems of evaluating, integrating, and verifying multiple pre-existing blocks and software components. This
10
is characterized by more in-depth system-level design, concurrent hardware/software design and verification at all levels of the design process [1].
1.2 Definition of the work

The project is aim to demonstrate the use of System-on-a-Chip (SoC) technology by developing an DCT/IDCT accelerator to MP3 decoder using hardware/software codesign technique and compare the accelerator with corresponding software function to obtain its effect. The goal of the project consists following aspects. Implement a MP3 decoder on LEON2 SoC platform. Design DCT/IDCT decoding modules as accelerators for the MP3 decoder. When the project is completed, a MP3 decoding system on FPGA will be obtained. The following tasks will be completed. Build up a MP3 decoder on LEON2 SoC platform. Performance analysis to the MP3 decoder. Adding flash card module and audio module to the embedded system. Designing accelerator for MP3 decoder Observing the performance of the accelerator. The hardware/software codesign technique will be used in this project. The structure of the report is based on the steps of the technique.
1.3 The major challenges

There are several challenges in the project we must confront. 1. How to build up a MP3 decoder on LEON2 SoC platform. This challenge consists of two steps. The first one is how to design a MP3 decoder for PC and the second steps is how to port the decoder to the embedded processor.
11
2. How to add new modules to LEON2. In this project, LEON2 is implemented on Spartan 3 FPGA chip. It is compatible with SPARC V8 architecture. If we want to add new modules to it, we must know where and how to add. 3. How to design an accelerator to LEON2. Before we know how to design an accelerator, we should know why the accelerator is required. To design an accelerator, we should know what the accelerator will be. The above problems should be solved during the project.
1.4 Methodology of HW/SW codesign

Hardware/software codesign is the main technique used in the project. It can be defined as the cooperative design of hardware and software. Codesign research deals with the problem of designing heterogeneous systems. One of the goals of codesign is to shorten the time-to-market while reducing the design effort and costs of the designed products. The advantages of using processors are manifold, because software is more flexible and cheaper than hardware. This flexibility of software allows late design changes and simplified debugging opportunities. Furthermore, the possibility of reusing software by porting it to other processors reduces the time-tomarket and the design effort. Finally, in most cases the use of processors is very cheap compared to the development costs of ASICs, because processors are often produced in high-volume, leading to a significant price reduction. However, hardware is always used by the designer, when processors are not able to meet the required performance. This trade-off between hardware and software illustrates the optimization aspect of the codesign problem. Codesign is an interdisciplinary activity, bringing concepts and ideas from different disciplines together, e.g. system-level modeling, hardware design and software design. The design flow of the general codesign approach is depicted in figure 1. Step1: The codesign process starts with specifying the system behavior at the system level. Step 2: After this, a pure software system will be developed to verify all algorithms.
12
Step 3: Performance analysis will be performed to find out the system bottlenecks. Step 4: The hardware/software partitioning phase a plan will be made to determine which parts will realized by hardware and which parts will be realized by software. Obviously, some system bottlenecks will be replaced by hardware to improve the performance. Step 5: based on the results of step 4, hardware and software parts will be designed respectively. Step 6: co-simulation. At this step, the completed hardware and software parts will be integrated together and performance analysis will be performed. Step 7: if the performance meets the requirements, the design can stop and if the performance cant meet the requirements, new HW/SW partitioning and a new design procedure will start.
Figure 1 the design flow of the general codesign approach
13
1.5 Organization of the thesis

The organization of the report depends on the step order of HW/SW codesign described in the previous section. Chapter 2 presents the knowledge of MP3 and decoding method. The problem how to get a MP3 decoder can be solved in this chapter. A pre-designed MP3 decoder will be study. Chapter 3 presents the background knowledge of LEON2. In my design, the bus system and interrupt system of LEON2 is very important. Chapter 4 mainly shows the method of adding new hardware modules to existing LEON2 SoC platform. Then the flash card module and the audio driver module are added to LEON2. The flash card module can provide MP3 source file to decoder and the audio driver receives PCM decoded by decoder. At last, how to port the predesigned MP3 decoder to LEON2 is discussed. This problem mainly focuses on how to convert the existing floating point MP3 decoder into fixed point version. Chapter 5 shows the analysis to the soft decoder and a partitioning plan is made. In the analysis, which parts of the soft decoder are the most time consuming parts will be found out. The reason why the IDCT and DCT blocks will be implemented as hardware module is presented. Chapter 6 describes two fast DCT and IDCT algorithm: HOUs algorithm and Lees algorithm. In chapter 7, we begin to design the hardware modules: DCT and IDCT accelerators. Both the two accelerators consist of FSM, FSM controller and interface to AHB. In the end of the chapter, we complete the modification to LEON2 In chapter 8, a performance test will be performed. From the test we can observe the profit bring by the hardware accelerators accelerate. A conclusion will be reached and a review to the whole project will be taken.
14
15
Chapter 2
MP3 Decoder
In this chapter, the knowledge of MP3 is presented and how to decode MP3 is offered.
2.1 Introduction
MPEG-I/Audio Layer3, as known as Mp3, is the most popular audio compression technique currently. This compression technique is developed by the International Standard Organization and the International Electrotechnical Commission (ISO/IEC). The MPEG/Audio offers three levels of compression. The MPEG/Audio Layer3 is the most complex scheme and provides best sound quality of the three layers. Three sample rates are supported: 32 kHz, 44.1 kHz and 48 kHz. A MP3 decoder means it receives MP3 bitstream and output the PCM format bit stream. Figure 2 shows the first level function block of MP3 decoder.
Figure 2 the first level function block diagram of MP3 decoder
16
The main block in figure2 can be divided into three parts and the second level function block of MP3 decoder is showed in figure3.
Figure 3 the second level function block diagram of MP3 decoder
The first part bit stream decoder is in charge of reading the MP3 bit stream and recognizing them. The second part Inverse quantization reestablishes the signals and output original spectrum data. The last part Frequency to time mapping reproduces the audio signal from those original spectrum data.
Subband blocks
18 freq.lines 18 freq.lines 18 freq.lines Granule 0 18 freq.lines
31 30 29 28
frame
18 freq.lines 18 freq.lines
1 0
Granule 1
Figure 4 Format of MP3 frame, granules, subband blocks and frequency lines [2].
17
Figure 4 shows the format of MP3 frames. Each frame holds data from 2 granules (the smallest acoustic unit) and every granule consists of 576 samples. Section 2.2 will explain the MP3 bit stream decoding and Section 2.3 will explain how to requantize those Huffman decoded data and Section 2.4 describes the mapping from frequency domain to time domain.
2.2 Bit stream decoding

MP3 bit stream consists of a stream of frames whose format is as figure5:
Figure 5 MP3 frame format
It mainly includes three parts: header and CRC, side information and main data. Frame header In each frame, the first 32 bits is header. In header, the first 12 bits equal to FFF (hex) is synchronization word. After that, the other 20 bits includes the information such as format, version, bit rate, sampling frequency, stereo mode, and copy right. Side information The information used in invert quantization and Huffman decoding are included in the size information section. In the mono mode, the size of the side information part is 17 bytes and in the two channels/stereo mode, the size is 256 bits. Main data Main data are divided into two parts, scalefactor and Huffman code. The former are grouped into some scalefactor bands. The length of the scale factor part depends on whether scale factors are reused, and also on the window
18
length (short or long). The scalefactors are used in the requanzitation of the samples [2].
Huffman code bits
Huffman decoding
Magnitude & sign
Bitstream in
Synchronization
Huffman Information
Huffman info decoding
Scalefactor Information
Scalefactor decoding
Figure 6 Decoding of bitstream block diagram [2]
Figure 6 shows the decoding of bitstream block. Besides the synchronization, three decoding parts are required. Huffman decoding: The Huffman decoding is executed in the part. The Huffman coding is a variable-length coding method. The decoding process should begin at the first data. Huffman info decoding: The Huffman info decoding collects all information about the Huffman decoding from the head and side information parts. Scalefactor decoding: The scalefactor decoding collects all scalefactor information in header and side information part and decoded scalefactor in main data part.
2.3 Invert Quantization

Invert quantization consumes the Huffman decoded data and reunite them into original spectrum values. Figure 7 shows this part diagram. We can use following formula to complete the invert quantization.
Scalefactors
19
xr = is 2(0.25C) i
The is is the output from Huffman decoder. The factor C in the equation consists of global and scalefactor band dependent gain factors from the side information and the scale factors.
4 3 i
Figure 7 inverse quantization block diagram
2.4 Processing data

After inverse quantization, the data are recovered to original spectrum values; some other processing is required to map those data to PCM signal in time domain. The processing includes stereo processing, reordering& alias reduction, IMDC and polyphase synthesis like figure 8 shows.
Figure 8 processing data block diagram
20
2.4.1 Stereo processing

There are three type of stereo mode: joint stereo, MS stereo mode, or MS intensity stereo mode. All 576 samples are not original left and right channel, instead, information are transferred to rebuild original right and left channel values. MS stereo mode: MS stereo mode means the left and right channel is used to transmit the normalized middle and side channel values respectively. We can use following formula to convert the middle and side values into left and right value respectively.
2 2 Intensity stereo mode: Intensity_stereo is done by specifying the magnitude (via the scalefactors of the left channel) and a stereo position is_pos[sfb], which is transmitted instead of scalefactors of the right channel. The stereo position is used to derive the left and right channel signals according to the formulas below.
left i =
M i + Si
and right i =
M i Si
MS intensity stereo mode: the values M and S are transmitted in the left and right channel, instead of left and right in intensity stereo mode. The calculation is the same as intensity mode.
2.4.2 Reordering & Alias Reduction

If short blocks are used, the rescaled data shall be reordered in polyphase subband order. The alias reduction is used to reduce the alias. During MP3 encoding when data pass the polyphase filter bank, some alias is created. So in order to obtain correct reconstruction value, alias reduction is necessary.
2.4.3 IMDCT
IMDCT means inverse modified discrete transform. It transforms the frequency lines to polyphase filter subband samples. It is calculated by the following formula:
n 1 2
n (2i + 1 + )(2k + 1)) 2n 2 k =0 For long block n=36, for short block n=12. x i = X k cos(
for i=0 to n-1
21
2.4.4 Polyphase Synthesis filter bank

The polyphase synthesis filter bank transforms the 32 subband blocks of 18 timedomain samples in each granule to 18 blocks of 32 PCM samples [2]. Each time 32 samples from each of 32 subbands are applied to the synthesis polyphase filter bank and 32 consecutive audio samples are calculated. The following figure 9 shows its procedure.
Figure 9 Poly Phase Synthesis filter bank [2]
22
2.5 Conclusion
Figure 10 function block diagram of MP3 decoder
The previous sections describe the format of MP3. Now we have already had an idea about MP3 decoder. A fine MP3 decoder block diagram is showed in figure 10. An MP3 decoder in C language has been completed. It decodes MP3 file in Linux operating system on Pentium IV processor and output PCM signal to audio card. The next step should be the performance analysis of the soft decoder. However the desired MP3 decoder is that can run on the embedded processor instead of Pentium. So before the analysis, we should port the MP3 decoder to embedded system. In next chapter, some background knowledge about the embedded processor LEON2 will be introduced.
23
Chapter 3
LEON2
The previous chapter presents some basic knowledge about MP3 decoder. The decoder reads MP3 source file and outputs PCM signals to audio card on PC. Compared with LEON2, the personal computer system is so powerful that real-time decoding MP3 is easy. It also has complete hard disk system and audio system. Those features raise the difficulties of setting up a MP3 decoder on embedded system. In this chapter, the background knowledge of LEON2 is introduced. With that knowledge, we try to find a way to port the MP3 decoder on LEON2. Section 3.1 offers brief background knowledge of LEON2. Section 3.2 describes the bus specification of AMBA2.0. Section 3.4 depicts interrupt system of LEON2.
3.1 Background of LEON2

LEON2 is a synthesizable VHDL model of a 32-bit processor compliant with the SPARC V8 architecture. It is designed for embedded applications and has the following features on-chip: separate instruction and data caches, hardware multiplier and divider, interrupt controller, debug support unit with trace buffer, two 24-bit timers, two UARTs, power-down function, watchdog, 16-bit I/O port, flexible memory controller, Ethernet MAC and PCI interface. It is highly configurable. That means we can change the configuration when we use it in different application. The following figure 11 shows the function block of LEON2.
24
FPU CP Local ram
Local ram
5-Stage Integer unit

D-Cache MMU I-Cache
Debug Support Unit
Debug Serial Link PCI Ethernet
AMBA AHB
AHB controller Memory Controller UARTS
AMBA APB
Timers IrqCtrl I/O port
AHB/APB Bridge
PROM
I/O
SDRAM
SRAM
Figure 11 function block diagram of LEON2 [3]
From figure 11, we can see that AMBA is the communication center of LEON2 and every unit is connected to it except floating point unit, co-processor and local ram. Generally, those high speed units are connected to AMBA AHB and those low speed units are connected to AMBA APB. Some units are connected to AHB and APB at the same time such as PCI unit and Ethernet unit. In this case, the APB is in charge of accessing the control registers on target unit and AHB is in charge of exchanging data.
3.2 AMBA-2.0
The Advanced Microcontroller Bus Architecture (AMBA) is a specification that defines an on-chip communications standard for designing high-performance embedded micro-controllers. It consists of three distinct buses that meet different requirements The Advanced High-performance Bus (AHB) The Advanced System Bus (ASB) The Advanced Peripheral Bus (APB).
25
In LEON2, only the AHB and APB are used and we only explain these two buses in following sections.
3.2.1 The Advanced High-performance Bus (AHB)

AHB is a high-performance, high clock frequency bus. It mainly acts as a high performance system backbone in embedded processor. In LEON2, it connects all high-speed units such as integer unit, local ram unit, debug unit and memory controller unit. Additionally, AHB is also specified to ensure ease of use in an efficient design flow using synthesis and automated test techniques. The AMBA AHB bus protocol is designed to be used with a central multiplexer interconnection scheme. There are three types of modules in the bus: master, slave and arbiter. A typical AMBA AHB system design contains the following components [4]: AHB master: A bus master is able to initiate read and write operations by providing an address and control information. Only one bus master is allowed to actively use the bus at any one time. AHB slave: A bus slave responds to a read or write operation within a given address-space range. The bus slave signals back to the active master the success, failure or waiting of the data transfer. AHB arbiter: The bus arbiter ensures that only one bus master at a time is allowed to initiate data transfers. Even though the arbitration protocol is fixed, any arbitration algorithm, such as highest priority or fair access can be implemented depending on the application requirements. An AHB would include only one arbiter, although this would be trivial in single bus master systems. AHB decoder: The AHB decoder is used to decode the address of each transfer and provide a select signal for the slave that is involved in the transfer. A single centralized decoder is required in all AHB implementations [4].
26
The AHB bus can connect up to 16 masters and any number of slaves. The LEON2 processor core is normally connected as master 0, while the memory controller and APB bridge are connected at slaves 0 and 1 [3]. The AHB controller (AHBARB) controls the AHB bus and implements both bus decoder/multiplexer and the bus arbiter. The arbitration scheme is fixed priority where the bus master with highest index has highest priority. The processor is by default put on the lowest index. Each AHB master is connected to the bus through two records, corresponding to the AHB signals as defined in the AMBA 2.0 standard:
------------------------------------------------------------------------------ Definitions for AMBA(TM) AHB Masters ------------------------------------------------------------------------------ AHB master inputs (HCLK and HRESETn routed separately) type AHB_Mst_In_Type is record HGRANT: Std_ULogic; -- bus grant HREADY: Std_ULogic; -- transfer done HRESP: Std_Logic_Vector(1 downto 0); -- response type HRDATA: Std_Logic_Vector(HDMAX-1 downto 0); -- read data bus end record; -- AHB master outputs type AHB_Mst_Out_Type is record HBUSREQ: Std_ULogic; HLOCK: Std_ULogic; HTRANS: Std_Logic_Vector(1 HADDR: Std_Logic_Vector(HAMAX-1 HWRITE: Std_ULogic; HSIZE: Std_Logic_Vector(2 HBURST: Std_Logic_Vector(2 HPROT: Std_Logic_Vector(3 HWDATA: Std_Logic_Vector(HDMAX-1 end record;
downto 0); downto 0); downto downto downto downto 0); 0); 0); 0);
----------
bus request lock request transfer type address bus (byte) read/write transfer size burst type protection control write data bus
Each AHB slave is similarly connected through two records:

------------------------------------------------------------------------------ Definitions for AMBA(TM) AHB Slaves ------------------------------------------------------------------------------ AHB slave inputs (HCLK and HRESETn routed separately) type AHB_Slv_In_Type is record HSEL: Std_ULogic; -- slave select HADDR: Std_Logic_Vector(HAMAX-1 downto 0); -- address bus (byte) HWRITE: Std_ULogic; -- read/write HTRANS: Std_Logic_Vector(1 downto 0); -- transfer type HSIZE: Std_Logic_Vector(2 downto 0); -- transfer size HBURST: Std_Logic_Vector(2 downto 0); -- burst type HWDATA: Std_Logic_Vector(HDMAX-1 downto 0); -- write data bus HPROT: Std_Logic_Vector(3 downto 0); -- protection control HREADY: Std_ULogic; -- transfer done HMASTER: Std_Logic_Vector(3 downto 0); -- current master HMASTLOCK: Std_ULogic; -- locked access end record; -- AHB slave outputs type AHB_Slv_Out_Type is record HREADY: Std_ULogic; -HRESP: Std_Logic_Vector(1 downto 0); -HRDATA: Std_Logic_Vector(HDMAX-1 downto 0); -HSPLIT: Std_Logic_Vector(15 downto 0); -end record;
transfer done response type read data bus split completion
27
Master units can access whole address space from 0x00000000 to 0xFFFFFFFF. Figure 12 illustrates structure of AHB.
Figure 12 structure of AHB[4]
3.2.2 The Advanced Peripheral Bus (APB)
Figure 13 APB structure [4]
28
Compared with AHB, APB is a bus with low speed, low power consumption and low interface complexity. Its targets are those low speed peripheral units such as serial port and keypad. The slaves are connected to APB through a pair of records containing the APB signals:
------------------------------------------------------------------------------ Definitions for AMBA(TM) APB Slaves ------------------------------------------------------------------------------ APB slave inputs (PCLK and PRESETn routed separately) type APB_Slv_In_Type is record PSEL: Std_ULogic; -- slave select PENABLE: Std_ULogic; -- strobe PADDR: Std_Logic_Vector(PAMAX-1 downto 0); -- address bus (byte) PWRITE: Std_ULogic; -- write PWDATA: Std_Logic_Vector(PDMAX-1 downto 0); -- write data bus end record; -- APB slave outputs type APB_Slv_Out_Type is record PRDATA: Std_Logic_Vector(PDMAX-1 downto 0); -- read data bus end record;
The number of APB slaves is defined by the APB_SLV_MAX constant in the TARGET package. The APB address space is from 0x80000000 to 0x8FFFFFFF.
3.3 Floating-point unit and co-processor

In the LEON2s function block diagram a floating-point unit is present and it models two interface options for a floating-point unit: either a parallel interface or an integrated interface where FP instruction do not execute in parallel with IU instruction. But in fact the free version LEON2 only provides a floating-point unit interface instead of a floating-point unit. Even though LEON2 provides a FPU, the floating point instructions will be take much more time than fixed-point instructions. Thats means we should convert the floating-point MP3 decoder into fixed-point decoder. In following chapter, the method how to convert floating-point decoder to fixed-point will be presented. LEON can be configured to provide a generic interface to a special-purpose coprocessor [3]. The interface allows an execution unit to operate in parallel to increase performance. One coprocessor instruction can be started each cycle as long as there are no data dependencies. When finished, the result is written back to the co-processor register file.
29
3.4 Interrupt 3.4.1 Interrupt controller

LEON2 have one interrupt controller and one optional secondary interrupt controller. Figure 14 shows the interrupt controller block diagram. Its output is sent to integer unit and cause interrupts. The controller can handle 15 internal and external interrupts which are divided into two priority levels [3]. The optional secondary interrupt controller is used add up to 32 additional interrupts.
Figure 14 interrupt controller of LEON2
In this project, the PCI unit, Ethernet unit and DSU unit are disabled, so the interrupt 11 to 14 are available.
3.4.2 Connect a C function to interrupt source

The standard C-library provides interrupt support. Through the catch_interrupt call the function to be used as interrupt handler is assigned:
extern void *catch_interrupt(void func(int irq), int irq);
30
The catch_interrupt() will only associate a function to an interrupt, unmasking and enabling of interrupts have to be done by the applications, typically by programming certain registers.
3.5 VHDL model architecture of LEON2

The latest version of LEON2 can be downloaded from www.gaisler.com. In this project, the LEON2 version is 1.0.27. The LEON VHDL model is designed to be used for both synthesis and board-level simulation [3]. It is therefore written using rather high-level VHDL constructs, mostly using sequential statements. Typically, each module only contains two processes, one combinational process describing all functionality and one process implementing registers. Records are used extensively to group signals according their functionality. In particular, signals between modules are passed in records. There are 72 VHDL files in the /leon directory. Among these files, some files are very important to add new modules to LEON2 and we should know them. leon.vhd: the top level entity of LEON2. In this file only the port of LEON2 is defined. Any modification to the LEON2s input and output port will be involved with this file. This file only defines the top level entity of LEON2 with the basic configuration. leon_pci.vhd defines the top level entity of LEON2 with PCI module. mcore.vhd: the main core of LEON2. It consists of the details how the LEON2 configures. All AHB and APB components are instantiated in this file. Since the interrupt controller connects to APB, all interrupt signals should be passed to interrupt controller in this file. device.vhd: keeping the current configuration information. Only the configuration tool can change the file and other modules get configuration information from it.
31
apbmst.vhd: this file defines the AHB to APB bridge. The bridge acts as master on APB. Some address vectors are saved in this file. Those address vectors are used when APB modules are accessed.
3.6 Cross compile

LEON2 is compliant with SPARC V8 architecture. That means GCC can be used to compile the C program. This feature is very important to this project because the existing MP3 decoder in C language can port to LEON2 without many modification. We can download the LEON/ERC32 GNU cross-compiler system (LECCS) from www.gaisler.com. LECCS is a multi-platform development system based on the GNU family of freely available tools with additional point tools developed by Cygnus, OAR and Gaisler Research [5]. LECCS consists of the following packages:
GCC-3.2.3 C/C++ compiler GNU binary utilities 2.13.1 with support for LEON UMAC/SMAC instructions RTEMS-4.6.0-beta C/C++ real-time kernel with LEON and ERC32 support Newlib-1.11 standalone C-library GDB-5.3 SPARC cross-debugger Remote debugging monitor (rdbmon-1.3.6) DDD graphical front-end for GDB (unix only) GDB-TK graphical front-end for GDB (Windows only) MKPROM-1.3.9 boot-prom builder
DSUMON-1.0.11 LEON debug support unit monitor
To compile and link an application to LEON2, use sparc-rtems-gcc like following code:
sparc-rtems-gcc g -mv8 -O3 rtems-hello.c -o rtems-hello.exe
The executable file should be converted into SRECORD files because only this format files can be download to LEON2 from PC. To create an SRECORD file for a PROM programmer, use objcopy:
32
sparc-rtems-objcopy -O srec hello.exe hello.srec
3.7 Conclusion
This chapter presents basic information about LEON2 in hardware and software. The bus system and interrupt controller will be used when new modules are added to LEON2. In next chapter, the method how to adding new modules to LEON2 SoC platform will be presented.
33
Chapter 4
Adding New Modules to LEON2 & Converting MP3 Decoder to Fixed Point
In chapter 3, the brief background of LEON2 is offered. In a real MP3 player, accessing file system to get MP3 source data and converting the digital audio into sound are required. This chapter will explain how to add new two different hardware modules to the existing LEON2. The first one is the flash card driver module and the second one is the audio driver module.
4.1 Introduction
From the previous chapter, AMBA is introduced. Three different buses defined in AMBA specification is AHB, ASB and APB. AHB is in charge of high speed communication and APB is in charge of low speed data exchange. From this point of view, new modules can be added to AHB or APB. Another way that add new module to LEON2 is making the new modules as a co-processor. In section 3.4 the concept of co-processor has already been introduced. The co-processor and main processor can execute instructions concurrently. The co-processor just executes those instructions it can recognize and ignore those it can not recognize. The interface between the processor and co-processor is exclusive and makes the cooperation work in high speed. But the main processor has five stage pipelines. It is obvious that matching the co-processor with the five stage pipeline is complicated. Compared to the method of adding new modules to AMBA, the method of adding co-processor is very
34
complicated although it is more efficient. So, in this design, we add all new modules to AMBA as accelerators In a real MP3 player, two problems should be solved: where MP3 can be obtained and where PCM signals should be sent. The solution is that two modules are required. The first one is the module that provides MP3 source file and the second is the module that convert PCM signal into sound. Figure 15 shows the function block diagram of MP3 decoder with flash card modules and A/D converter. The flash card module provides MP3 file to decoder and the D/A converter generates sound from PCM codes.
Figure 15 MP3 decoder with flash card module and A/D converter
Figure16 illustrates the new LEON2 system after adding the two modules. An IDE bus module connects to AHB and a standard flash card module connects to the IDE bus. The Audio driver is added on APB and sends PCM data to audio A/D converter.
Figure 16 LEON2 with flashcard module and audio module
35
4.2 Method of adding slave modules to AHB

The method of adding new slave modules to AHB mainly consists of the following several steps: Find an available position on AHB. Add new modules to AHB on the position Allocate address to the modules. If interrupt signal is required, add it to interrupt controller on an available position. Update the port of leon entity.
Now, we explain those three steps one by one. 1. Find an available position on AHB. The AHB arbiter module can be considered as the bus driver. Figure 17 shows its port definition. The bus ports slvi and slvo are defined into interface array. Any slave modules interface should be added into the interface array as an element. The position should be between 0 to (masters-1). Figure 18 shows an example that a RAM component aram0 is added to position 4 on AHB. In mcore.file we can find out which positions have been occupied. 2. Add new modules to AHB on the position. In this step, an available position on AHB for the new module should be ready. In section 3.3, we have already known that the AHB bus can connect any number of slaves. From VHDL codes point of view, adding new modules to AHB means instantiating the new modules in mcore.vhd and adding its AHB interface into interface array at the prepared position.
entity ahbarb generic ( masters : defmast : ); port ( rst : clk : msti : msto : slvi : slvo : ); end; is integer := 2; integer := 0 in in out in out in -- number of masters -- default master
std_logic; clk_type; ahb_mst_in_vector(0 to masters-1); ahb_mst_out_vector(0 to masters-1); ahb_slv_in_vector(0 to AHB_SLV_MAX-1); ahb_slv_out_vector(0 to AHB_SLV_MAX)
Figure 17 port definition of AHB arbiter
36
aram0 : if AHBRAMEN generate aram : ahbram generic map (AHBRAM_BITS) port map (rst, clk, ahbsi(4), ahbso(4)); end generate;
Figure 18 an example of slave module on AHB on position 4
3. Allocate address to the modules. All modules on AMBA should have one or more address. The address consist two parts. The first part act as select signal sent to every module. To AHB slave modules, the signal is hsel and to APB is psel. Only the module which select signal is valid can access AHB or APB. This signal is decoded from some bits of address bus. The second part is rest of part of address bus. Figure19 shows that the highest four bits of AHB address bus are passed to a mapping table as an index and the value with the index will be output to a decoder. The decoder decodes the value and output hsel signals which are sent to every AHB slave module respectively. Among those hsel signals, only one can be set to high in a certain time and only the selected slave module can access AHB. The mapping table is saved in the device.vhd file as an array type variable named ahbrange_config.
Figure 19 AHB address decoding
37
constant ahbrange_config : ahbslv_addr_type := (0,0,0,0,0,0,0,0,1,2,7,4,7,7,6,5);

Figure 20 variable ' ahbrange_con
The variable ahbrange_config showed in figure 20 is defined as an array with 16 elements. Those 16 elements correspond the highest four bits of AHB address bus. Table1 lists the mapping from the address to position number on AHB.
MSB 4 bit of address bus 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Position number
ahbrange_config (0) ahbrange_config (1) ahbrange_config (2) ahbrange_config (3) ahbrange_config (4) ahbrange_config (5) ahbrange_config (6) ahbrange_config (7) ahbrange_config (8) ahbrange_config (9) ahbrange_config (10) ahbrange_config (11) ahbrange_config (12) ahbrange_config (13) ahbrange_config (14) ahbrange_config (15)
Address space 0x00000000-x0FFFFFFFF 0x10000000-0x1FFFFFFF 0x20000000-0x2FFFFFFF 0x30000000-0x3FFFFFFF 0x40000000-0x4FFFFFFF 0x50000000-0x5FFFFFFF 0x60000000-0x6FFFFFFF 0x70000000-0x7FFFFFFF 0x80000000-0x8FFFFFFF 0x90000000-0x9FFFFFFF 0xA0000000-0xAFFFFFFF 0xB0000000-0xBFFFFFFF 0xC0000000-0xCFFFFFFF 0xD0000000-0xDFFFFFFF 0xE0000000-0xEFFFFFFF 0xF0000000-0xFFFFFFFF
Table 1 map from address bus to position number
For example, assuming the highest four bits of AHB address bus is 0010, we will find out the element of ahbrange_config at the position 2 and its value is 0, that means the slave module on AHB at the position 0 responds the address. The position number and the variable ahbrange_config both determine the address of new modules. 4. If interrupt signal is required, add it to interrupt controller on an available position. In chapter 4, interrupt controller is introduced. The main controller
38
can handle 15 interrupts. It is instantiated in mcore.vhd file. It provides an interrupt array that every element corresponds one interrupt signal. The interrupt array is irqi.irq. Figure 21 shows an example that the interrupt signal timo.irq(1) is assigned to interrupt controller on position 9. The position number is interrupt number that will be used in software. irqi.irq(9) <= timo.irq(1);
Figure 21 example of assigning a interrupt signal to interrupt controller
5. Update the port of leon entity if necessary. If the new module need communicate with something out of the SoC, new port needs to be added to leon entity.
4.3 Method of adding slave modules to APB
Figure 22 APB address decoding
The method that adds new modules to APB is almost same with AHB. The only difference is the address allocation. The address range to all APB modules is from 0x80000000 to 0x8FFFFFFF and the AHB to APB bridge is the master of APB. When one master module on AHB want to access the modules on APB, the address(9 downto 2) will be compared with some pre-defined vector saved in apbmst.vhd file,
39
when they are equal, a position number will be generated and a certain APB module will be selected. Figure 22 shows the APB address decoding block diagram. So adding a new module to APB need add new address vector to the look-up table.
4.4 Access the new modules

The method to access the modules on AHB and APB is same. From software point of view, a pointer that point to the hardware module is required. Another thing is to make sure you are using the volatile keyword when trying to read a hardware register that is modified by hardware. C compilers will often cache the value of a register if you don't declare it volatile and this maybe results in old value is kept always.
4.5 Adding the flash card module to AHB

The flash card module is a pre-designed IP block. We dont need to know the inside of it. I just present how to add the blocks to LEON2. 1. Add the modules as AHB slave component to position 4, instantiate the module at the mcore.vhd file like figure 23 shows.
atahost : ata_host port map ( clk => clk, reset => rst, ahbo => ahbso(4), ahbi => ahbsi(4), intr => atairq, RST_I => '0', ata_in => ata_in, ata_out => ata_out, ata_io => ata_io, error => ata_err, pccard_addr => pccard_addr, pccard_cseln => pccard_cseln, pccard_oen => pccard_oen, pccard_wen => pccard_wen, pccard_regn => pccard_regn, pccard_cd => pccard_cd, pccard_cd1n => pccard_cd1n, pccard_cd2n => pccard_cd2n, pccard_vs1n => pccard_vs1n, pccard_vs2n => pccard_vs2n, pcpwr_ready => pcpwr_ready, pcpwr_clk => pcpwr_clk, pcpwr_data => pcpwr_data, pcpwr_latch => pcpwr_latch, pcpwr_reset => pcpwr_reset );
Figure 23 instantiate the flash card module in mcore.vhd
40
2. Allocate address space 0xC0000000-0CFFFFFFF to the module and change the 12th element of ahbrange_config into 4. 3. Update the port of leon entity.
4.6 Driver module of the audio A/D converter

The AK4520A on extension board is a stereo CMOS A/D & D/A converter for musical instruments. This section describes how to drive the chip in LEON2. The AK4520A exchanges data with driver circuit in serial format. Four serial data modes are supported. On the extension board, it has already been set to mode 2 which means the input and output are all 20-bit width. We just only care four signals: MCLK: master clock input. SCLK: audio serial data clock. LRCK: Input/output channel clock SDTO: audio serial data output SDTI: audio serial data input
The main clock frequency of LEON2 is 25MHz, so we set the MCLK to 12.5MHz which is closed to desired 12.288MHz. And then SCLK is 3.072MHz and LRCK is 48.0kHz. Figure 24 shows its block diagram.
41
Figure 24 function block diagram of audio driver module
The module includes two buffers. The two buffers both can keep eight data. When the output buffer contains data fewer than four, an interrupt signal will be sent to interrupt controller. When the input buffer contains data more than four, an interrupt signal will be sent to interrupt controller. So it is better that every access reads or writes four data. The clock generator is in charge of generating three clock signals: MCLK, SCLK and LRCK. The Serial to parallel block receives serial data from A/D converter and recovery them into 4-byte wide data. The Parallel to serial part converts the 32 bit wide PCM into serial bit stream and sent them to D/A converter. The APB slave interface is in charge of accessing APB. Three port addresses are used by the driver module. address Right channel 0x80000340 Left channel 0x80000344 Control register 0x80000348
Table 2 the port address of audio module
The three bits in control register are used and listed in the table 3.
42
function
Bit 0 enable
Bit 1 irq pending / disable irq

Table 3 control register of audio module
Bit3 irq enable
4.7 Adding the audio module to APB

Section 4.3 explains how add new component APB. Several steps should be complete. 4. Find an available position on AHB for the module. The position 15 on APB is open. 5. Add the module to APB. Instantiate the module in mcore.vhl file and set its APB interface at position 15. 6. Add the three address vector "11010000", "11010001" and "11010010" to lookup-table in aphmst.vhl. 7. Assign the interrupt signal to controller on position 13. 8. Add following five signals like figure 25 shows to port definition list of leon entity.
mclk: OUT lrck: OUT sclk: OUT sdin: OUT sdout: IN std_logic; std_logic; std_logic; std_logic; std_logic;
Figure 25 signals of audio module added to leon port
4.8 Convert MP3 decoder from floating-point version to fixed-point version

The existing MP3 decoder requires lots of floating-point computation. However the free version LEON2 dont provide floating-point unit. And even if LEON2 provide FPU, floating-point operations take much more cycles than fixed-point operations. It is necessary to convert the current floating-point MP3 decoder into a fixed-point MP3 decoder.
43
4.8.1 Single Precision floating-point

The IEEE single precision floating point standard representation requires a 32-bit word, which may be represented as numbered from 0 to 31, left to right. The first bit is the sign bit, S, the next eight bits are the exponent bits, 'E', and the final 23 bits are the fraction 'F':
S EEEEEEEE FFFFFFFFFFFFFFFFFFFFFFF 0 1 8 9 31
The value V represented by the word may be determined as follows: If E=255 and F is nonzero, then V=NaN ("Not a number") If E=255 and F is zero and S is 1, then V=-Infinity If E=255 and F is zero and S is 0, then V=Infinity If 0<E<255 then V=(-1)**S * 2 ** (E-127) * (1.F) where "1.F" is intended to represent the binary number created by prefixing F with an implicit leading 1 and a binary point. If E=0 and F is nonzero, then V=(-1)**S * 2 ** (-126) * (0.F) These are "unnormalized" values. If E=0 and F is zero and S is 1, then V=-0 If E=0 and F is zero and S is 0, then V=0
4.8.2 Conversion to 32-bit Fixed-point

During decoding, floating-point number exists in those blocks after Huffman decoder. The range of the output values of the decoder (PCM samples) is between -1.0 and +1.0 [6]. The most important thing in conversion from floating-point number to fixed-point number is the dynamic range. Compared to single precision floating-point, the 32-bit fixed-points dynamic range is limited. In this project, the MP3 decoder MAD is made as a reference. This decoder is fixed-point MP3 decoder. The figure 26 describes the format of the 32-bit fixed point. This format has already be used in MAD and be proved to be correct.
44
Figure 26 fixed-point format
With this format, the effective range is 0x80000000 to 0x7FFFFFFF, the dynamic range is 96dB. The method how to convert the decoder into fixed-point involve several aspects: Arithmetic computation: fixed-number can be added or subtracted as normal integer, but multiplication requires shifting the 64-bit result from 56 fractional bits back 28. Overflow and underflow: during the computation, when overflow is detected, the result should be set to the maximum value 0x7FFFFFFF. When underflow is detected, the result will be set to 0x80000000. Mathematic function: during decoding MP3, lots of mathematic function will be used such cosine function. LEON2 cant figure out their value on run time. So the fixed-point results of those mathematic function will be save in memory and using the lookup table, their value can be got. Constant: all floating-point constants will be converted into fixed-point constants.
4.9 Conclusion
In this section, two major works are completed. The first one is adding two module to LEON2. The second one is converting the MP3 decoder from floating point version to fixed point version. In the next chapter, a performance analysis will be taken to find out how to partition the system into hardware part and software part.
45
Chapter 5
Performance Analysis and HW/SW Partitioning
5.1 Performance analysis

After completing the soft MP3 decoder, then we can use the profiling tool gprof to analysis it. The following figure 27 shows analysis results. This test is taken on Pentium IV CPU with 2.0GHz main frequency, 512M memory, windows XP OS and Cygwin.
% cumulative time seconds 55.96 4.46 8.16 5.11 5.14 5.52 4.02 5.84 3.64 6.13 3.39 6.40 2.51 6.60 2.26 6.78 2.13 6.95 2.13 7.12 1.88 7.27 1.76 7.41 1.76 7.55 1.63 7.68 0.63 7.73 0.63 7.78 0.50 7.82 self seconds 4.46 0.65 0.41 0.32 0.29 0.27 0.20 0.18 0.17 0.17 0.15 0.14 0.14 0.13 0.05 0.05 0.04 self us/call 20.65 1.73 34.17 1.48 0.16 0.04 33.33 0.02 0.44 14.17 0.40 11.90 10.83 0.04 8.33 3.33 total us/call 22.13 1.73 56.67 1.48 0.27 0.04 33.33 0.02 2.55 55.81 2.12 11.90 10.83 0.07 8.33 15.00
calls 216000 376576 12000 216000 1830922 6912000 6000 8800642 384000 12000 376576 11768 12000 1278464 6000 12000
name my_synthesis imdct36 III_dequantize_sample dct32 huffman_decoder III_requantize III_stereo hgetbits III_hybrid_int III_hufman_decode III_imdct_l III_aliasreduce main III_reorder getbits out_fifo III_antialias
Figure 27 profile of the soft MP3 decoder
46
Figure 27 shows that the profiles of the soft decoder run on personal computer. The most left column display the execution time ration to the total execution time. Function my_synthesis consumes above half of total execution time. This function processes the poly phase synthesis filter bank which requires a large number of multiplications, additions and memory accessing. The first part is a decoder of 32 point DCT, after that the data will pass the D-window where 512 multiplications is required and no fast algorithm has been found out, then 512 additions will be executed. The second most time consuming function is 36-point IMDCT. Function block Polyphase synthesis bank 36-point IMDCT Inverse quantization Huffman decoder Stereo processing Time ratio 55.96 8.16 5.14 3.64 2.51
In principle, the method of reducing the total execution time is to find out the timecritical parts in this system and replace them with hardware accelerators. The above table lists five most time consuming top level functions called by main.c of the MP3 decoder. Then the second test is taken. The MP3 decoder decodes the same MP3 on the embedded processor. In this test, 300 MP3 frames are decoded and we can measure the execution time. The column scale_factor means the decoding process stop at the end of scale factor decoding. The column Huffman_decoding means the decoding process stop at the end of Huffman decoding. The measured results are list in the following figure28.
300 frames Time(s) Scale_factor 6 Huffman_decoding 24 Inverse_quantization 60 synthesis 162
47
Figure 28 execution time from start point
The sample rate of the MP3 is 44.1K and there are 1152 samples in each MP3 frame. That means the real-time playback a 44.1K sample rate stereo MP3 requires at least 26.2ms/ frame. Decoding 300 frames need about 7.86 second. In above list, only one column scale_factor meets the requirement. That means at lest replacing all the function after the scale_factor decoding with hardware can lead that real-time decoder. But in this project the FPGA chip cant provide enough system gates to contain such many hardware modules.
5.2 HW/SW partitioning

In this design Spartan3 FPGA chips provide 2 million gates and this is not enough to all function parts plus with a general-purpose processor LEON2. Before designing any hardware modules, we dont know how many modules the FPGA can support in the end. The performance analysis shows the polyphase synthesis bank block consumes the most time. Replacing the block with hardware can bring the most profit. However this block includes too many computations and the tasks exist that the FPGA cant afford enough gates. So we divide the block further into a DCT32 block and the rest of the block. We first realize the 32-point DCT block and 36-point IDCT which is the second time consuming block. So, the partitioning plan is that implementing IDCT36 block and DCT32 block in hardware and keeping the rest of the MP3 decoder in software. Figure29 shows the partitioning plan. In the figure, the modules in blue color will be implemented in hardware.
48
Figure 29 HW/SW partitioned function block diagram of MP3 decoder
5.3 Conclusion
In this chapter, a preliminary partition plan is made. In the plan, two DCT blocks will be converted into hardware and the rest of decoder still be kept in software. This plan is not optimal because the hardware modules dont include the whole polyphase synthesis block. But the partition plan is the best one that can be realized. In next chapters, how to implement the two modules will be described.
49
Chapter 6
Fast algorithm of DCT and Inverse DCT
The previous chapter discussed how to partition the whole system into HW/SW. A preliminary plan is that two blocks will be implemented in hardware. One is 32-point DCT and another one is 36-point IDCT. This chapter introduces two fast algorithms in order to reduce computation complexity.
6.1 Introduction
Three different DCT/ IDCT are involved in MP3 decoder, 12-point IDCT, 36-point IDCT and 32-point DCT. Because the 12-point IDCT is seldom used, so this chapter focuses on 36-point IDCT and 32-point DCT. 36-point IDCT is defined by the following formula:
n 1 2
x i = X k cos(
k =0
n (2i + 1 + )(2k + 1)) 2n 2
for i=0 to n-1 n=36
In the formula, 648 multiplications and 612 additions are required. 32-point DCT is calculated by the following equation:
50
x i = X k cos(
k =0
n 1 2
n (2i + 1 + )(2k + 1)) 2n 2
for i=0 to n-1 n=32
512 multiplications and 480 additions are required The two DCTs both require lot of computation. However, some fast algorithms are found out and computation complexity can be reduced dramatically. Section 6.2 explains a fast algorithm of 32-point DCT and section 6.3 introduces a fast algorithm of 36-point IDCT.
6.2 Fast algorithm of 32-point DCT

In this project, HOUs fast algorithm is used to compute the 32 points DCT. This algorithm is numerically stable, fast, and recursive. Similar to the Cooley-Turkey FFT algorithm, this algorithm allows us to generate the next higher order DCT from two identical lower order DCTs[1]. This method requires fewer multipliers and adders than other methods. As a tradeoff, shifting and multiplexing operations are required. However, the shifting operation is fast than multiplication and addition, and it is realized easily. This section just explains the algorithm briefly. To get a better understanding, some simple examples are presented. The 2nd-order DCT (N=2) is
Z 0 1 Z = 1 1 x0 x1
The 4th-order DCT (N=4) is

Z 0 1 Z 2 = Z1 Z 3 1 1 1 x0 x2 x1 x3
51
The 8th-order DCT (N=8) is

Z 0 1 Z 4 Z 2 Z 6 = Z1 Z 5 Z 3 Z 7 1 1 1 1 1 1 1 x0 x2 x4 x6 x7 x5 x3 x1
From the above three examples, we can observe the following recursive property for the DCT matrices:
N N T T 2 2 T (N) = D N D N 2 2
N N T D The kernel of 2 is even and that of 2 is odd. Because of cos(2k + 1) m = 2 cos(2k m ) cos m cos(2k 1) m We can obtain
N N T T 2 Z e 2 Z = N o KT Q KT N Q 2 2
Where K=R L R
52
And Q= diag[cos m ] R is the permutation matrix for performing the bit-reversal arrangement. 0 0 1 1 2 0 1 2 2 L = 1 2 2 : : 1 2 2 0 0 0 2 .. .. .. .. .. .. 2 .. .. .. .. .. .. .. .. 0 0 0 0 : : 2
Figure 30 diagram of 2-point, 4-point and 8-point DCT[7]
Figure30 illustrates the 2-point, 4-point and 8-point DCT. Broken lines represent transfer factors -1 while full lines represent unity transfer factors. o represents adders; multipliers; and shifts (i.e., multiplication by 2).
53
6.3 Fast algorithm of 36-point IDCT

The 36 point IDCT in mp3 decoder is the second time consuming module. Now I present the Lees fast algorithm to compute it.
n 2 -1
xi =
X
k= 0
cos ( 2 n ( 2i + 1 +
n 2
) ( 2k + 1) )
for i=0 to n-1
n=36.
From above chapter, we know the Hous algorithm can reduce computation
n complexity. However, it is only useful to 2 point IDCT. To reduce the computation
complexity of 36-point IDCT, we introduce the Szu-Wei Lees fast algorithm. The following figure31 shows the flow of the Szu-Wei Lees algorithm. The N-point DCT-II (SDCT-II) is define as
N 1 (2m + 1)k C kN , II = k x m cos 2N m =0
The N-point DCT-IV is defined as

N 1 (2m + 1)(2k + 1) C kN , IV = x m cos 4N m =0
N-point forward MDCT N-point backward MDCT
N/4-point SDCT-II N/2-point DCT-IV N/2-point SDCT-II N/4-point DCT-IV N/4-point SDCT-II
Figure 31 function block of Lees algorithm[8]
Step 1: since the cosine function satisfies the following relation
54
cos (2m + 1)(2k + 1) = cos (2 N 1 2m)(2k + 1) 2N 2N

So, we just need compute the N/2-point DCT. Step 2: if we use cos( + ) = 2 cos( ) cos( ) cos( ) , we can convert the N/2-point DCT-IV into N/2-point SDCT-II. This step need N/2 multiplications and (N/2)-1 additions. Step 3: to even index of element, we can convert it into N/4-point SDCT-II directly. To odd index of element, we can use the same method as in step 2 to convert the samples into N/4-point SDCT-II. This step totally requires N/4 multiplications and (3N/4)-1 additions. After three conversions, the computation of 36-point DCT is converted into that of 9point DCT.
6.4 Conclusion
The previous two sections describe HOUs algorithm and Lees algorithm respectively. The two algorithms both can reduce computation complexity largely.
Number of multiplications Hou algorithm for 32-point DCT Lee algorithm for 36-point IDCT 80 43 Number of additions 209 115
Table 4 the number of computations of two algorithms
The table 4 shows the number of computations of the two algorithms. Next chapter will show how to design DCT/IDCT accelerators.
55
Chapter 7
Implement of DCT and IDCT Accelerators
The previous chapter depicts two fast algorithms for 36-point IDCT and 32-point DCT, which reduce computation complexity largely. In this chapter, we realize DCT and IDCT accelerators on hardware. Section 7.1 describes the accelerator architecture; section 7.2 depicts how to schedule the algorithms and how to create FSM. Section 7.3 shows the design of FSM controller and section 7.4 explains how to design the accelerators interface to AHB.
7.1 The accelerator architecture

The previous chapter presents two fast algorithms for 36-point IDCT and 32-point DCT respectively. Now the problem is how to implement the two DCT modules. The two modules share several features. The first one is they both consist of a large number of additions and multiplications. All these computations can be considered as computation stream based on data dependency. The second one is that they are all connected to AHB. That means they can use the same AHB interface. The number of adders and multipliers provided by FPGA is so limited that all computations should share those adders and multipliers. In this case, the finite state machine (FSM) is a better choice because it makes all computation pass the arithmetic units in different states.
56
The accelerator modules exchange data with LEON2 in high speed and AHB interface is required. Since the modules only response the LEON2, we make them as slave modules on AHB.
Figure 32 the structure diagram of DCT/IDCT modules
Figure 32 presents the function block diagram of the accelerators. The FSM and arithmetic units is in charge of DCT or IDCT computation. The FSM controller triggers the FSM when input data are ready and generates done signal when results are ready. The AHB slave interface is the way in which the accelerators communicate with LEON2.
7.2 Scheduling and FSM design

All computation in fast DCT/IDCT algorithm can be consider as a sequence of additions and multiplications. These computations pass the adders and multipliers one by one and finally the correct results will be created. Generally, two factors affect the sequences order. The first one is the data dependency. The data dependency represents the time relation among those computations. The second factor is the number of arithmetic units. On FPGA chip, the number of ALU and multiplier is so limited that some prepared operands have to be in wait state. In principle, the more
57
ALUs and multipliers, the less total execution time. However when the number of ALU and multiplier is enough, additional ALU and multipliers cant reduce execution time further. The scheduling tool showed in figure33 developed by Mr. Huib Lincklaen Arrins is used to schedule those computations. Since this tool cant recognize the C program directly, the conversion from C program to the temporary file is required. The following table shows an example of the temporary file. In this temporary file, one line consists of only one computation.
t0 = i0 + i31; w0 = i0 - i31; t16 = w0 * c1; t1 = i15 + i16; w1 = i15 - i16; t17 = w1 * c31;
After converting C codes to the temporary file, the scheduling tool reads it and shows scheduling graph. The number of ALUs and multipliers can be change manually. In principle, more multipliers and ALUs make the FSM less states; however the more multipliers and ALUs lead the circuit become more complex. So we must know how many multipliers and ALUs is the best choice. Table 5 and table 6 shows the number of states in FSM where different number of adders and multipliers are used to IDCT36 and DCT32. In the IDCT36 module, additional 36 multiplications are added to it so it number of all multiplication is 79. In the two tables, five adders and two multipliers are best choice because additional multipliers only bring little profit.
Number of ALUs
2
Number of multipliers
3
81 69 69 69 69 69
4
81 52 52 52 52 52
5
81
6
81 42 35 35 35 35
10
1 2 3 4 5 6
103 103 103 103 103 103
42
42 42 42 42
42 31 31 31 31 30 27 27 27 30 25 25 24 25 25 24
Table 5 number of FSMs states with different number of adders and multipliers to IDCT36
58
Figure 33 scheduling tool
Number of ALUs Number of multipliers
1
147 135 134
2
92 75 71
3
80 57 51
4
80 48 42
5
80
6
80 41 33
1 2 3
41
37
Table 6 number of FSMs states with different number of adders and multipliers to DCT32
7.3 FSM controller design

DCT/IDCT FSM controller controls the FSM directly and is placed between AHB interface and FSM. On one side, it exchanges data with AHB slave interface unit; on the other sides, it distributes data to FSM and collects results from it.
59
The controller obtains data through 32 bit width data bus. 36-point IDCT requires 18 input data and 32-point DCT need 32 input data. So their validate address bus width is 6 bits. The input data should be obtained from the first to the last one. As soon as the last data is ready, the FSM controller generates start signal to trigger the FSMs to work. When FSM complete all computation, the done signal is set and the AHB interface passes it to interrupt controller. When all results have been read by other master modules on AHB, the controller will set the signal done to low and the whole module come back to idle state. Figure34 shows the two FSM controllers.
Figure 34 IDCT36 FSM controller and DCT32 FSM controller
7.4 AHB slave interface design

The AHB slave interface of the IDCT36 accelerator and the DCT32 accelerator are almost same. The only difference is that they have different address. The following
60
code is the port definition of IDCT36 accelerator. The port consists of a standard AHB slave I/O port and an interrupt signal output.
entity ahbmdct36 is port ( rst : in std_logic; clk : in clk_type; ahbsi : in ahb_slv_in_type; ahbso : out ahb_slv_out_type; irq : out std_logic ); end;
The following VHDL code shows how the AHB slave interface works. Those codes are modified from ahbram.vhd file.
comb : process (ahbsi, r, rst) variable v : reg_type; variable haddr : std_logic_vector(5 downto 0); begin v := r; v.hready := '1'; if (r.hwrite or not r.hready) = '1' then haddr := r.addr(5 downto 0); else haddr := ahbsi.haddr(7 downto 2); end if; if ahbsi.hready = '1' then v.hsel := ahbsi.hsel and ahbsi.htrans(1); v.hwrite := ahbsi.hwrite and v.hsel; v.addr := ahbsi.haddr(7 downto 2); end if; if r.hwrite = '1' then v.hready := not (v.hsel and not ahbsi.hwrite); v.hwrite := v.hwrite and v.hready; end if; if rst = '0' then v.hwrite := '0'; end if; write <= r.hwrite; ramsel <= v.hsel or r.hwrite; ahbso.hready <= r.hready; addr <= haddr; c <= v; end process; ahbso.hresp <= "00"; ahbso.hsplit <= (others => '0'); aram : IMDCT36_core port map ( clk=>clk,rst=>rst,datain=>ahbsi.hwdata, dataout=>ahbso.hrdata,ena=>ramsel,write=>write,address=>addr,done=>irq);
61
7.5 Connect the accelerators to AHB

Now, we have already completed two accelerators and Chapter X has already described how to add new slave modules to AHB. Now we integrate the two modules to LEON2. The following table7 shows the address allocation.
IDCT36 Position on AHB Interrupt number Address space 5 14 0xF0000000-0xF0000048 DCT32 6 12 0xE0000000-0xE000007C
Table 7 address space and interrupt of the two accelerators
7.6 Conclusion
Figure 35 function block diagram of LEON2 with two accelerators
At the end of this chapter, two hardware accelerators are completed and added to AHB. Figure 35 shows the function block diagram of the modified LEON2. In next chapter, the performance tests to the accelerators will be taken.
62
63
Chapter 8
Test Results and Conclusion
The previous chapter describes how to complete the two accelerators. In this chapter, a test will be performed to find out the hardware accelerators effect. Section 8.1 shows the test results and section 8.2 shows conclusion that can be reached from the results. And then in section 8.3 a summary will be made to the whole project. The final section presents a scope of the future research.
8.1 Test results

After integrating the IDCT36 accelerator and DCT32 accelerator into LEON2, we can compare them with software counterparts respectively to find out the effect of the two accelerators. Firstly, a test to IDCT36 accelerator is taken. In this test, the IDCT36 accelerator and soft IDCT36 decoding function are executed 300000 times respectively. The execution time is recorded in the following table8 and the test is repeated two times. IDCT36 Soft decoding(Lees algorithm) Accelerator decoding 1 25s 10s 2 24s 11s Average 24.5s 10.5s
Table 8 execution time IDCT36 C function and accelerator on LEON2
64
Secondly, a test to DCT32 accelerator is taken. In this test, the DCT32 accelerator and soft DCT32 decoding function are executed 300000 times respectively. The execution time is recorded in the table 9 and the test is repeated two times. In figure 36, two software results are compared with two hardware results. DCT32 Soft decoding(Hous algorithm) Accelerator decoding 1 91s 18s 2 92s 17s Average 91.5s 17.5s
Table 9 execution time IDCT32 C function and accelerator on ELON2
100 80 60 40 20 0 IDCT36 DCT32
soft decoding(fast algorithm) accelerator decoding
Figure 36 compare two DCT accelerators with software
From the tables 8 and table 9, it can be derived that the two accelerators all can reduce execution time dramatically. The first table shows the latency of accelerator is less than half of soft IDCT36 decoding time. To software decoding, The IDCT36 fast decoding algorithm consists of 79 multiplications and 115 additions. Since every computation needs two operands and one memory accessing takes 3 clock cycles, one complete multiplication or addition will take 3+3+1+3=10 clock cycles. Completing one whole IDCT36 computation needs about 10*(79+115) =1940 clock cycles. The main frequency of LEON2 is 25MHZ and so one IDCT36 needs 1940*40=77600ns and 300000 times IDCT36 fast computation needs at lest 77600*300000=23.3 second. In the IDCT36 accelerator, the FSM consists of 41 states, 18 data input need 18*(3+3) =108 clock cycles and 36 data output need 36*6=216 clock cycles. So one IDCT36 computation on the accelerator needs 108+41+216=365 clock cycles and 300000 times IDCT36 computations on the accelerator need about 365*40ns*300000=4.38s. Because the LEON2 needs more clock cycles to response interrupt and deal with the
65
interrupt. So this value is just a minimum value. Results of DCT32 are same with IDCT36. From previous chapter, we know the architecture of IDCT36 accelerator and DCT32 accelerator is almost same. The only difference is that they have different number of input data. The IDCT36 need 18 input data and DCT32 need 32 input data. So soft DCT32 block takes more time than soft IDC36 block and DCT32 accelerator takes more time than IDCT36 accelerator.
sequential elements LEON(include the following four modules) DCT32 IDCT36 Flash card module Audio module 16.68% (9354) 4.82% (2702) 3.29% (1847) 0.52% (293) 1.52% (852) combinational logic 59.73% (33502) 21.44% (12005) 16.6% (9382) 0.53% (297) 2.38% (1334)
Table 10 final area report *Numbers in the parentheses indicate the number of the elements being used
Table 10 shows the area report the LEON2 SoC. About 60 percents combinational logic and 17% sequential elements are used in the system.
8.2 Conclusion
The test results described in the previous section indicate the two hardware accelerators both can reduce execution time. This result is same with our expectation. The accelerator DCT32 takes more execution time than IDCT36 because its FSM consists of more states than IDCT36 and it need more input data than IDCT36. In section 5.1, the MP3 decoder on LEON2 is used to decode 300 MP3 frames. During the decoding, the IDCT36 function is invoked 38016 times and the DCT32 function is invoked 21600 times. Because in that test, the Lees algorithm is not used in IDCT36, so we should know execution time that IDCT36 function ( no Lees algorithm) is called 300000 times.
66
IDCT36 Soft decoding(no Lees algorithm) Soft decoding( Lees algorithm) Accelerator decoding
1 73s 25s 10s
2 74s 24s 11s
Average 73.5s 24.5 10.5s
Table 11 execution time of the IDCT36 without Lee's algorithm
Table 11 shows the execution time that IDCT36 accelerator and IDCT36 C function are invoked 300000 times respectively. When decoding 300 MP3 frames, the IDCT36 module will be called 38016 times and we can figure out the execution time should be about 9.31 seconds and 1.33 seconds. That means 8 seconds can be saved where 1.8 seconds are contributed by hardware accelerator. To DCT32 module, the Hous algorithm has already been used in the test described in section 5.1, so using the method above mentioned, we can figure out that about 5.3 seconds can be saved. Used the two accelerators together, decoding 300 frames MP3 can save 13.3 seconds which is about 8% to the total 162 second execution time. The two accelerators both are not efficient because too much time is taken on data transmission. Section 8.1 describes IDCT36 need eighteen input data and output 36 results, accessing one data will take 3 cycle. Before sending a data to IDCT36 module, the data should be read on to bus. After reading a data from IDCT36 module, the data should be saved to memory. That means every data transmission with IDCT36 module will take six cycles. Making all input data ready will take 6*18= 108 cycles. Reading all results from it takes 6*36=216 cycles. The computation part, FSM, only consist of 41 states where one state is equal to one cycle.
67
execution time distribution
DCT32
IDCT36
50
100
150
200
250
300
350
400
450
input data
computation
output results
Figure 37 execution time (cycle) distribution of (I)DCT accelerator module
Figure 37 shows the execution time distribution of DCT32 and IDCT36 modules. The ratio of computation time to all execution time is 10.3% to DCT32 module and 11.2% to IDCT36. The two values are the maximum values that assume the interrupt handle functions take no time to be called. In practice, the maximum value cant be reached. How to solve the problem? The first direction we can try is that assign more computations to the FSM. The second direction is that make the input data, computation and output data three parts work concurrently.
The advantage of general purpose processors is its flexibility and its disadvantage is low performance in special application. Adding hardware accelerators can improve the general purposes performance effectively. If more software is replaced by accelerators, more performance improvement can be obtained. However, it will consume more hardware circuit. In this project, a MP3 decoder run on LEON2 is completed. Two hardware accelerators are added to the system to improve performance. This hardware and software codesign technique shortens the time-to-market while reducing the design effort.
68
8.3 Summary
The goal of the project has been reached. A MP3 decoder has been completed and it has been ported to LEON2 platform. This decoder can read MP3 files from flash card and send PCM signal to audio A/D converter. Two accelerators are added to the system to improve performance and their effects have been observed. In the first chapter, the motivation and definition of the project are presented. The general principle of hardware and software codesign is offered, the organization of the thesis is based on the principle. In chapter2, I give some knowledge of MP3 and MP3 decoder. In chapter3, the background of LEON2 is introduced. The background is necessary to know how to modify an existing LEON2. Chapter4 focus on adding new modules to LEON2. Firstly, the method how to add new modules to LEON2 is offered. And then a flash card module and audio module are added to the system. At the end of the chapter, how to convert the MP3 decoder into fixed point is discussed. Chapter5 takes an analysis to the soft MP3 decoder. And find out which parts are the most time consuming. And then the reason why add two DCT and IDCT accelerators to LEON2 is presented. Chpater6 briefly describes two fast DCT algorithms and chapter7 discusses how to implement the two accelerators. In the last chapter, the effect of the accelerators is presented.
8.4 Future research

So far, all design steps are completed. In this design, I exploit the method of hardware and software codesign and implement a 36-point IDCT accelerator and a 32-point
69
DCT accelerator for MP3 decoder run on LEON2 processor. In the following sections, we will review the whole design process and present the scope for future research. 1. Complete a real time MP3 decoder on FPGA. In this project the accelerators can improve performance, but the decoder is not powerful enough to decoding MP3 real-time. Further effort can focus on following aspects: Decrease the area of embedded processor on FPGA. In this project, LEON2 consumes almost above half of all area of FPGA. If it consumes fewer system gates, more hardware can be implemented on chip. Design a MP3 decoder without any processor. That means all parts of the decoder are implemented in hardware. Maybe this is very difficult because in this project, only a DCT module has already consumed 30% system area. 2. Find a method that can converts C code into VHDL directly. In this project, scheduling tool can generate FSM automatically based on one type of temporary file. However, converting a big C program into the temporary file is very complicated. If those C code can be converted into VHDL code directly, the FSM design
70
71
Bibliography
[1] Design of an Audio Player as System-on-a-Chip, Luis Azuara Pattara Kiatisevi, Institute of Computer Science, University of Stuttgart [2] K Salomonsen, S Sgaard, E P Larsen, Design and Implementation of an MPEG/Audio Layer III Bitstream Processor 1997. [3] LEON2 Processor Users Manual, from www.gaisler.com [4] AMBA Specification, from www.gaisler.com [5] The LEON/ERC32 GNU Cross-Compiler System, from www.gaisler.com [6] MPEG-Layer3 Bitstream Syntax and Decoding, www.mp3-tech.org. [7] HSIEH S. HOU, A Fast Recursive Algorithm For Computing the Discrete Cosine Transform, IEEE Transaction On Acoustics, Speech, And Signal Processing, vol. ASSP-35, No. 10, pp.1455-1461, October 1987. [8] Szu-Wei Lee, Improved Algorithm for Efficient Computation of the Forward and Backward MDCT in MPEG Audio Coder, IEEE Transaction on Circuits and System-II: Analog and Digital Signal Processing, vol.48, No.10, pp.990-994, October 2001

Jianwei Thesis

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Jianwei Thesis

Hochgeladen von

Copyright:

Verfügbare Formate

Hardware/Software Codesign of MP3 Decoder with 36/32-point (I)DCT Accelerators

Jianwei Wang Advisor: Dr. Rene van Leuken

1.2 Definition of the work

1.3 The major challenges

1.4 Methodology of HW/SW codesign

Figure 1 the design flow of the general codesign approach

1.5 Organization of the thesis

Figure 2 the first level function block diagram of MP3 decoder

Figure 3 the second level function block diagram of MP3 decoder

18 freq.lines 18 freq.lines 18 freq.lines Granule 0 18 freq.lines

2.2 Bit stream decoding

Figure 5 MP3 frame format

Huffman code bits

Magnitude & sign

Huffman info decoding

Figure 6 Decoding of bitstream block diagram [2]

2.3 Invert Quantization

Figure 7 inverse quantization block diagram

2.4 Processing data

Figure 8 processing data block diagram

2.4.1 Stereo processing

2.4.2 Reordering & Alias Reduction

for i=0 to n-1

2.4.4 Polyphase Synthesis filter bank

Figure 9 Poly Phase Synthesis filter bank [2]

Figure 10 function block diagram of MP3 decoder

3.1 Background of LEON2

FPU CP Local ram

5-Stage Integer unit

Debug Support Unit

Debug Serial Link PCI Ethernet

Figure 11 function block diagram of LEON2 [3]

3.2.1 The Advanced High-performance Bus (AHB)

Each AHB slave is similarly connected through two records:

transfer done response type read data bus split completion

Figure 12 structure of AHB[4]

3.2.2 The Advanced Peripheral Bus (APB)

Figure 13 APB structure [4]

3.3 Floating-point unit and co-processor

3.4 Interrupt 3.4.1 Interrupt controller

Figure 14 interrupt controller of LEON2

3.4.2 Connect a C function to interrupt source

3.5 VHDL model architecture of LEON2

3.6 Cross compile

sparc-rtems-gcc g -mv8 -O3 rtems-hello.c -o rtems-hello.exe

sparc-rtems-objcopy -O srec hello.exe hello.srec

Figure 16 LEON2 with flashcard module and audio module

4.2 Method of adding slave modules to AHB

Figure 17 port definition of AHB arbiter

Figure 18 an example of slave module on AHB on position 4

Figure 19 AHB address decoding

constant ahbrange_config : ahbslv_addr_type := (0,0,0,0,0,0,0,0,1,2,7,4,7,7,6,5);

Table 1 map from address bus to position number

4.3 Method of adding slave modules to APB

Figure 22 APB address decoding

4.4 Access the new modules

4.5 Adding the flash card module to AHB

Figure 23 instantiate the flash card module in mcore.vhd

4.6 Driver module of the audio A/D converter

Figure 24 function block diagram of audio driver module

Table 2 the port address of audio module

Bit 1 irq pending / disable irq

Bit3 irq enable