Asip-10 1 1 28 9397

POWER REDUCTION FOR ASIPS: A CASE STUDY
Tilman Gl kler and Heinrich Meyr o Institute for Integrated Signal Processing Systems (ISS) Technical University of Aachen Aachen, Germany
Abstract - Application specic instruction set processors (ASIPs) are an excellent architecture for mixed control/data-ow oriented tasks with medium to low data rate and high complexity. The main advantage of ASIPs is the higher exibility due to programmability compared to dedicated hardware. A drawback of this design style is an increase in power consumption. The current case study focuses on an ASIP design methodology considering the classical parameters computational performance and area as well as energy consumption simultaneously. Several ASIP power optimization options have been applied and evaluated: clock-gating, logic netlist restructuring, ISA optimization, instruction memory power reduction, and use of a dedicated coprocessor. These optimizations are demonstrated with the ICORE (ISS-core) ASIP for DVB-T acquisition and tracking algorithms. The results reveal a potential of about one order of magnitude in energy savings for these optimizations.
INTRODUCTION
Application specic instruction set processor implementations in general tend to consume more power than dedicated hardware for the same computational task. This is due to the overhead in interconnection structure and to the control activity of the processor. On the other hand, processors are much more exible and can be used to implement any software-programmable task. Thus, there is a trade-off between exibility and low-power consumption [5]. Power consumption is increasingly important for todays devices. The focus of this paper is to evaluate practically relevant power optimizations using a real world design, namely the ICORE ASIP (ISS-core, [6]), as a case study for a quantitative approach to power-aware ASIP design. Instruction set oriented ASIPs signicantly extend the design space compared to xed commercial programmable signal processors, because the instruction set architecture and its implementation is entirely up to the designer. Power and performance optimization can be applied at the logic level, the register transfer (RT) level, in the ASIP software, and also at system level. There are various degrees of freedom for the designer: the RT level, for instance, can be subdivided into denition of the instruction set architecture, implementation of the data-path and of the control units. At system level, algorithms might be mapped to programmable architectures or to more dedicated hardware.
Our design, the low-power application specic instruction set processor ICORE, has been designed to perform the control and processing tasks for acquisition and tracking in a digital terrestrial television (DVB-T) receiver. The computational tasks for ICORE are acquisition of the FFT window position, sampling-clock synchronization for interpolation/decimation, and carrier frequency offset estimation. The simplied receiver structure is depicted in Figure 1. ICORE is integrated as one design module of this complete system-on-a-chip which has been designed together with Inneon Technologies AG/Munich. This single-chip DVB-T solution supports new features and enhanced algorithms and consumes less power compared to the the previous design [11].
TPS, Channel Estimation, Interference Detection Carrier/Sampling Frequency and Timing Synchronization
Figure 1: Digital Part of the DVB-T Receiver ICORE implements the lower data rate tasks in Figure 1, whereas the higher rate tasks are mapped to dedicated hardware. The current project uses HDL-based logic synthesis and semi-custom ASIC design to get the best time-to-market for this consumer application. This document is structured as follows: in the following section a short overview of previous approaches to (low-power) ASIP design is given. Afterwards, the ICORE design methodology together with a qualitative discussion of ASIP power optimization options is described. Later on, an overview of the processor architecture and the instruction set of ICORE is presented. Eventually, quantitative results of all the implemented power optimizations are given.
RELATED WORK
The concept of instruction set oriented ASIPs is well known in literature. In [1] a concise overview of ASIP design issues is given. The presented ASIP design ow is targeted at performance constraints and does not take into account the energy consumption of the implementation. There are various ASIP design tools for the (more or less) complete ASIP design ow from application to implementation in literature. In [2] the PEAS design environment is described which generates an instruction set simulation model and a synthesizable model from an architectural processor description. The MetaCore
from ADC
ICORE
FFT
I/Q conversion
Interpol./ Decimation
Equalizer, Softbitgen.
Symbol/Bit Deinterleaver FEC
transport stream
DSP development system [3] is an ASIP design tool which supports design space exploration and design generation. During design generation the development tools like C compiler, assembler, and ISA simulator as well as the processor in form of an HDL are generated. In [13] the ISDL machine description language is used to generate a bit-true instruction level simulator and a synthesizable Verilog processor description. There are also some design tools in literature focusing on a subset of the ASIP design ow. A framework for Compiler-ASIP co-design with feedback from an optimizing compiler to the ASIP design is described in [14]. In [15] the MSSQ compiler within the MIMOLA hardware design system is presented which is a retargetable high level language compiler for embedded systems. There are also some commercial approaches to ASIP design. For instance, Tensilica and ARC Cores both offer congurable processor cores together with a framework to generate the necessary development tools. For an overview refer to [4]. All the above mentioned tools primarily focus on computational performance issues and some also on silicon area. A design ow considering these classical VLSI design parameters together with the energy consumption is still missing. M CORE, a power consciously designed processor core, is described in [16]. M CORE is a dedicated processor because its functional units are implemented in full custom, thus, adaption to unexpected computational tasks is difcult and extremely time consuming. Its instruction set has been a-priori optimized for portable consumer applications using specically targeted benchmarks. Power optimization of M CORE has been applied at several levels of abstraction. M CORE uses lowpower sleep and doze modes. At the instruction level a high-instruction coding density together with a rich register set minimizes memory data transfers to perform a given function. Finally, the instruction pipeline contains logic to prevent unnecessary switching activity. However, there is no quantitative evaluation of these power savings available. The design of ICORE has been performed largely manually, without using HDL generation tools or high level language compilers. During the ICORE design the processor description language LISA [12] has been used to generate assembler and instruction set simulator. The focus of the ICORE design methodology was to apply a power conscious design methodology. Power saving techniques similar to the ones that have been used for M CORE but also typical ASIP-only power saving techniques have been applied. Furthermore, by obtaining quantitative results of these optimizations, the important parameters computational performance, area and energy consumption have been considered simultaneously. This is especially challenging due to the large design space which is offered by ASIPs compared to heterogeneous implementations using dedicated processors together with dedicated hardware.
ICORE DESIGN METHODOLOGY

The ICORE design comprises all assembler programs and the VHDL hardware of the synthesized core including the interfaces as a deliverable of this project.
ICORE is initially based on a mainly conventional DSP instruction set of a typical load/store Harvard-DSP architecture. This instruction set includes instructions for arithmetic and logical operations, data moves, and program ow control. It is a subset of any micro-controller or DSP instruction set without special instructions like e. g. rounding, division, normalization, bit test, and loop operations. With these basic instructions an initial assembler version of the algorithm for implementation is written. The implementation at this stage is typically not the optimum solution with respect to execution time and power consumption and may violate any given time constraints.
Performance Optimization
The next step in the ICORE design methodology is the proling of the initial program version. The fast instruction set simulation of this implementation is performed using the LISA processor description language [12] together with the LISA compiler tools to generate assembler, linker and the fast simulator. A typical design ow using LISA is depicted in Figure 2.
Assembler Linker Simulator
Figure 2: LISA ASIP Design Flow One result of this step is the cycle count for the implemented algorithm. Together with the system clock frequency this results in the execution time for the proled task. Logic synthesis of the corresponding HDL-description of the processor for the target technology has to check if this architecture is feasible for a given clock speed. If time constraints for a proled task are violated, architectural enhancements corresponding to instruction set modications have to be performed to accelerate
Verification: Golden Ref. vs. ISA simulation
Profiling Information, Cycle Count
HDL Verification: LISA/HDL cosimulation
Runtime, Critical Path, Area, Power

Golden Reference
Assembler Program
LISA language compiler

ASIP Architecture
LISA description
HDL description
Standard Cell Synthesis & Gate Level Simulation
this task. Here, a coprocessor for a runtime critical task can be added, if an ISA adaptation fails to satisfy the constraints. An example for enhancements of this kind is given later on. In the near future, the important and error prone control parts and pipeline structures of this HDL description will be generated automaticly from the LISA description. The designer will still be able to hand-optimize the performance critical datapath. The design process of the data-path will be speeded up by generating empty HDL entities which will be lled with structural or more behavioral HDL descriptions depending on the preferences of the HDL designer and the requirements of the logic synthesis tool.
Power Optimization
If all time constraints are fullled, additional optimizations can be introduced to increase power efciency. The energy that is required to execute a given algorithm in hardware can be distinguished into the following parts (a similar approach is given in [9]): intrinsic energy: refers to the minimum required processing energy needed to process a computational task like e. g. an arithmetic operation. The intrinsic energy only depends on the operations of an algorithm and the process technology.
It is often impossible to separate performance optimization from energy optimization, because ISA performance improvements typically also reduce the overhead energy (like in [6]). The effect of incremental ISA optimizations can be seen in a reduction of the overhead energy due to a decrease in processor runtime, because the overhead power remains nearly constant. As long as changes to the architecture have a limited impact on data routing, the routing energy does not change signicantly. The intrinsic energy is a constant, as long as the algorithm remains the same. In the following subsections important ASIP power optimization options are reviewed in a bottom-up fashion. The applicability of these optimization options for ICORE is discussed. Techniques like full-custom design of the data-path or downscaling of the supply voltage have not been considered, because they were not compatible to the constraints of the standard cell based design ow of our industry partner. Logic Level The application of clock gating is very common. For ICORE local clock gating supported by Synopsys DesignCompiler has been used.
routing energy: refers to the energy needed to move data spatially on the chip. overhead energy: refers to the remaining energy e. g. caused by wasteful logic activity, control activity, clock distribution etc.
At the logic level it is possible to restructure the netlist to reduce wasteful glitching activity e. g. due to unbalanced arrival times of signals. This option is provided by DesignPower which is part of DesignCompiler. It is most effective, if the toggle activity of a design which has been simulated with realistic stimuli is fed back to DesignCompiler. This optimization option has also been used for ICORE. Another possibility for optimization at the logic level are the so called concepts of precomputation [10] or guarded evaluation. These optimization alternatives have not been used for ICORE, because they require special properties of the arithmetic data-path to be effective. RT Level The following optimizations at the RT level have been made. ISA optimization by adapting the number and kind of processor resources and by adding specialized instructions
ISA optimization takes advantage of the fact that an ASIP is not a xed architecture and can be modied to match the properties of an application as dedicated hardware does. There are two major rules of thumb for ISA optimization. Firstly, it is obvious to optimize the frequently executed instructions or instruction sequences. Secondly, if there is a very simple hardware solution for a less-frequently occurring task, it can still be worth implementing it in dedicated hardware. The frequency of instruction sequences and operations has to be determined by carefully proling the assembler programs to be executed on the ASIP. Examples for ISA optimizations are described in [6]. Blocking gates or blocking registers are a good choice to suppress unnecessary propagation of values into functional units, which are connected to a common bus like the output of the general purpose register of ICORE. A drawback of this step is the increase in implementation area. In case of combinational blocking gates there is the danger of increasing the toggle activity of frequently used functional units due to additional transitions back to the blocked state. The power consumption of the instruction ROM of ICORE represents a significant part of the overall power. Thus, the word width of this ROM has been minimized during the ISA design, carefully trading-off performance issues vs. power. Furthermore, the ROM-internal switching power has been reduced by using an optimized instruction encoding. This encoding takes advantage of the fact that the ROM internally uses a precharge-phase charging all the internal bit lines to logic high. If a bit cell contains a logic 0 the connected bit line is discharged during the read-out phase; for a bit cell with logic 1 the bit line remains in the charged state. Thus, it is favorable to use logic 1s instead of logic zeros to reduce the internal toggling power of the bit lines and the sense ampliers. For details refer to [7].
blocking gates to prevent wasteful toggle activity in the data-path optimized instruction encoding reducing the power in the program memory sleep mode to reduce power during stand-by together with wake-up signals
The implementation of a dedicated coprocessor for a regular task with stringent timing constraints is an interesting option. The CORDIC algorithm described in [6] is a good candidate for this kind of optimization. A discussion of the effect of a CORDIC coprocessor on overall energy consumption is given later on. This optimization, however, is not used in the nal version of ICORE, because it violates the orthogonality of an instruction set oriented ASIP and reduces maintainability of the design. ASIP Software Level A very efcient means of power reduction is the intensive use of the processor sleep mode. In combination with clock gating a signicant reduction in stand-by power is achieved. Additional logic to control the processor wake-up within a clock cycle avoids excessive power due to polling external signals or events. This concept has been used for ICORE to synchronize the processing to data arrival events and to synchronize output signal changes to the COFDM symbol frame. Data memory accesses are minimized by hand-optimized assembler programs efciently using all the processor registers. Dedicated memory mapped I/O registers which are accessible by dedicated I/O instructions automatically provide handshake signals to the environment with just one write access. Overhead energy can be reduced by proling the implemented software for speed. Obviously, the software has to use the tailored instructions efciently. Furthermore, optimizations based on proling statistics are a good starting point to reorder assembler instructions representing conditional structures like if then, else if or case. These and additional techniques like common subtree removal or similar arithmetic transformations were applied during the assembler code design phase of ICORE. Architectural System Level On the system level, the partitioning into building blocks implemented as dedicated hardware, congurable HW blocks, or programmable devices has to take place. This partitioning must nd a feasible solution with respect to processing power and data rates. Excessive exibility has to be restricted to the required minimum in order to avoid an unnecessary increase in power consumption. Thus, it is important to identify whether the amount of required exibility of a building block can be satised with dedicated or congurable hardware. Otherwise, a programmable solution has to be used. In the case of DVB-T the system is partitioned into many high date rate blocks with limited exibility (like e. g. the FFT for the 8k/2k modes). On the other hand, there is the block implemented by ICORE, which has required as much exibility as possible. This exibility is needed in the case of ICORE to reduce the risk and the design effort in case of late design changes. If changes in the algorithms are required, a redesign of the ASIP program after the fabrication of initial chip samples is possible, using a ROM mask modication at reduced cost.
ICORE ARCHITECTURE
ICORE has a typical load/store Harvard architecture using two-operand instructions. Operations are pipelined with a single issue instruction fetch (IF) stage, an instruction decode (ID) and a combined operand read/execute/result write back (RD/EX/WB) pipeline stage (Fig. 3). The complete data-path has a width of 32 bits, except for the 16x16-bit multiplier. ICORE implements roughly 60 DSP instructions. These instructions are subdivided into 20 arithmetic instructions, 16 instructions for program ow control and instructions to support the sleep modes of the core. The remaining instructions are used for memory and I/O operations, logical operations etc. After ISA optimization, some more application specic instructions were added e. g. to support efcient bit manipulation (to set and extract bit elds in I2C registers) and to support a fast CORDIC angle calculation. Figure 3 depicts the ICORE architecture. ICORE uses 8 general purpose registers, 4 address registers, a status register, a program and a hardware loop counter. Currently, ICORE supports immediate addressing, register indirect addressing optionally with post-increment or post-decrement. Furthermore a test interface to access relevant internal states of the core during functional chip test was implemented. This interface also supports the chip scan-mode and controls the self-test of the core. 4-metal layer technology yield an implemenA logic synthesis run using a tation area of about 52k NAND2 equivalent gates and a critical path of about 9.5ns for worst case operating conditions.
RESULTS
Power optimization of ICORE has been achieved incrementally. This means that each optimization step can be quantitatively evaluated. The measurements have been performed using PowerCompiler of Synopsys with back-annotated toggle activity from gate level simulations. Results for ICORE This section summarizes the results of all important optimizations of ICORE, that have been used in the DVB-T project. The numbers in Figure 4 are related to the impact of incremental optimizations beginning with a completely unoptimized implementation. These optimizations are sorted starting with the least effective optimization (logic restructuring) and ending up with the most effective ones (clock gating and ISA optimization). It is important that clock gating has to be introduced before ISA optimization, otherwise, the benet of a longer processor sleep period due to faster processing will be signicantly reduced. Figure 4 depicts the overall energy of the ASIP during the tracking phase in the 2k-COFDM mode. Netlist restructuring is the least efcient optimization yielding about 10% in energy reduction, but without signicant increase in design time. Blocking gates reduce the energy by roughly another 10% while slightly increasing the area (by about 1%) and the design effort. The control power reduction, namely the reduction of the
% # ! &$"
IF
ID
RD/EX/WB
Shifter
General Purpose Registers
Multiplier
Minmax
ALU
Move Decoder ZOLP Ctrl. Branch Ctrl. Data Address Gen.
Bitmanip
Data Mem.
I2C
Flow Control Unit
Figure 3: ICORE architecture internal toggle activity of the instruction ROM using tool supported encoding, yields roughly 20% in energy reduction without affecting area or design time. This saving depends on the size of the used instruction memory. ISA optimization cuts energy consumption by another 0.5x, while increasing the design effort signicantly due to manual optimization. The applicability of this optimization strongly depends on the computational tasks of the application. Due to the fact that the processor has long idle intervals, the benet of clock gating in combination with the sleep mode of the core yields a factor of about 0.2x in energy reduction. This value strongly depends on the load of the ASIP. It might be argued, that the processing power of ICORE is signicantly over-dimensioned for the given application. This is not the case: the constraints of the system environment simply specify tight bounds on the runtime for the ICORE tasks, which have to be fullled by this implementation. The overall power reduction for ICORE with all optimizations is about 92%. It
Instruction ROM
Addsub
'
has to be pointed out that all the above mentioned optimizations do not impair the exibility and maintainability of this building block.
Figure 4: Incremental Power Optimization of ICORE
Further Optimization In this section another optimization is quantitatively evaluated, that has not been applied to the current DVB-T implementation for several reasons. It is well known, that a dedicated coprocessor can be used for the CORDIC task, using a structure as described in [8]. This coprocessor has been implemented and coupled to ICORE and, as expected, it signicantly reduces the overhead power for the CORDIC task. Table 1 depicts the area and power results for the different implementations (ASIP with/without coprocessor) for the CORDIC task and, additionally, for the tracking tasks altogether (which include several CORDIC evaluations). The overall savings for the complete tracking tasks are about 38%. A drawback of the optimization is the reduced exibility of this implementation. A slight modication of the CORDIC algorithm requires a redesign of the coprocessor in this case. Furthermore, this optimization at the structural level leaves the design paradigm of an instruction set oriented ASIP and introduces more heterogeneousness into the system. This makes maintainability and reusability of this building block more complicated and, consequently, has been avoided in the current project.
ASIP Area (ND2 equ.) Norm. Energy (only CORDIC) Norm. Energy (overall)
without . copro. 52k 100% 100%
with copro. 56k 7.8% 62%
Table 1: Results for ICORE with/without coprocessor
SUMMARY
ICORE, a power-consciously designed ASIP for DVB-T acquisition and tracking algorithms has been presented. The design methodology of ICORE used for optimization for computational performance and power has been described. Several practical ASIP power optimizations have been quantitatively evaluated. Optimizations have been applied at the logic level, the RT level and at the software level. Furthermore, a scenario with a dedicated coprocessor for the computation of the CORDIC task has been implemented. The results indicate a potential of about one order of magnitude in energy for optimization within the ISA oriented ASIP domain. By giving up this paradigm and shifting the implementation to a heterogeneous processor-coprocessor architecture additional energy savings of about 38% have been achieved.
References
[1] Manoj Kumar Jain, M. Balakrishnan and Anshul Kumar: ASIP Design Methodologies : Survey and Issues, Proc. of 14th CSI/IEEE Intl. Conf. on VLSI Design, Jan. 2001, Bangalore, India. [2] Makiko Itoh et al.: PEAS-III: An ASIP Design Environment, 2000 IEEE Int, Conf. on Computer Design: VLSI in Computers & Processors,pp.430-436, Sep., 2000. [3] Jin-Hyuk Yang et al.: MetaCore : An Application-Specic Programmable DSP Development System, IEEE Transactions on Very Large Scale Integration Systems, Apr. 2000, Vol. 8, No. 2, pp. 173-183. [4] http://www.eetimes.com/story/OEG20001120S0028, (last access: 07.08.2001) [5] Arthur Abnous and Jan Rabaey: Ultra-Low-Power Domain-Specic Multimedia Processors, Proc. IEEE VLSI Sig. Proc. Workshop, San Francisco, California, USA, October 1996. [6] Tilman Glokler, Stefan Bitterlich, and Heinrich Meyr: Increasing the Power Efciency of Application Specic Instruction Set Processors Using Datap-
ath Optimization 2000 IEEE Workshop on Signal Processing, Lafayette, Louisiana, USA, Oct. 2000 [7] Tilman Glokler and Stefan Bitterlich: Power Efcient Semi-Automatic Instruction Encoding for Application Specic Instruction Set Processors, Int. Conf. on Acoustics, Speech, and Signal Processing 2001, Salt Lake City, Utah, USA, May 2001 [8] Herbert Dawid, H. Meyr; K. Parhi, T. Nishitani (editors): Digital Signal Processing for Multimedia Systems, Marcel Dekker Inc., 1999, pp. 623-652 [9] T. Claasen: High Speed: Not the only way to exploit the intrinsic computational power of silicon, Int. Solid State Circ. Conf. 97, Digest of Tech. Papers, 1997 [10] G. K. Yeap: Practical Low Power Digital Design, Kluwer Academic Publishers, 1998 [11] Product Brief SQC 6100 - Terrestrial Receiver for DVB-T, INFINEON AG, Germany, www.inneon.com/products/ics/pdf/sqc 10b.pdf, 04.04.2001 [12] S. Pees, A. Hoffmann, V. Zivojnovic, and H. Meyr: LISA - Machine Description Language for Cycle-Accurate Models of Programmable DSP Architectures, 36th Design Automation Conference, New Orleans, June 1999. [13] George Hadjiyiannis, Pietro Russo, Srinivas Devadas: A Methodology for Accurate Performance Evaluation in Architecture Exploration. 36th Design Automation Conference, New Orleans, June 1999. [14] F. Onion, A. Nicolau, and N. Dutt: Incorporating Compiler Feedback Into the Design of ASIPs, Proc. of European Design and Test Conference, pages 508513, 1995. [15] R. Leupers, P. Marwedel: Retargetable Code Generation Based on Structural Processor Descriptions. Design Automation for Embedded Systems, 3(1), 1998. [16] Jeff Scott, Lea Hwang Lee, John Arends, and Bill Moyer: Designing the Low-Power M Core Architecture, M Core technology center; Motorola Inc., www.mot.com/SPS/MCORE/pdf container/lowpower.pdf, 04.04.2001

Asip-10 1 1 28 9397

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Asip-10 1 1 28 9397

Hochgeladen von

Copyright:

Verfügbare Formate

POWER REDUCTION FOR ASIPS: A CASE STUDY

Symbol/Bit Deinterleaver FEC

ICORE DESIGN METHODOLOGY

Assembler Linker Simulator

Verification: Golden Ref. vs. ISA simulation

Profiling Information, Cycle Count

HDL Verification: LISA/HDL cosimulation

Runtime, Critical Path, Area, Power

LISA language compiler

Standard Cell Synthesis & Gate Level Simulation

General Purpose Registers

Move Decoder ZOLP Ctrl. Branch Ctrl. Data Address Gen.

Flow Control Unit

Figure 4: Incremental Power Optimization of ICORE

without . copro. 52k 100% 100%

with copro. 56k 7.8% 62%

Table 1: Results for ICORE with/without coprocessor

Das könnte Ihnen auch gefallen