Sie sind auf Seite 1von 15

Low Power ASIC Implementation from a CAL Dataow Description

Hemanth Prabhu, Sherine Thomas, and Joachim Rodrigues


Department of Electrical and Information Technology, Lund University Box 118, SE-221 00 Lund, Sweden

Thomas Olsson and Anders Carlsson


Ericsson Research, Lund, Sweden

Abstract This paper presents a ow for low power hardware generation, based on a CAL actor language. CAL is a dataow language which provides a higher level of abstraction and generate both hardware and software description. A dataow language is appropriate for signal processing systems since algorithms are typically specied in dataow graphs, using the same method for specication and high level implementation oers rapid prototyping. Also the block partitioning capability of the CAL language makes it ideal for hardware-software co-design and programming recongurable processor arrays. The original CAL ow, is targeted for hardware-software co-design of complex systems on FPGA, this is modied to facilitate low power ASIC implementations. In case of ASIC the partitioning capability allows for implementing dierent clock domains, and by introducing a token based clock gating to each processing block further reduces power consumption. As a case study to evaluate the methodology and optimizations incorporated in the ow, an Orthogonal Frequency-Division Multiplexing (OFDM) multi-standard channel estimator is implemented. Hardware-Software co-design and Globally Asynchronous Locally Synchronous (GALS) design at a higher level of abstraction provides more freedom for design-space exploration and reduced design time. Keywords: CAL Dataow Language, High-Level Synthesis, Hw-Sw Co-Design, Design Partition, GALS, Token based Clock Gating, Low Power ASIC.

1. Introduction There is an increase in complexity of signal processing systems, driven by ever increasing demand for faster devices with more features. The implementation of complex designs require
6

Email addresses: Hemanth.Prabhu@eit.lth.se, Sherine.Thomas@eit.lth.se, Joachim.Rodrigues@eit.lth.se (Hemanth Prabhu, Sherine Thomas, and Joachim Rodrigues ), thomas.p.olsson@ericsson.com,anders.b.carlsson@ericsson.com (Thomas Olsson and Anders Carlsson ) Preprint submitted to Embedded Hardware Design (Microprocessors and Microsystems) March 26, 2012

hardware platforms with multiple processors, accelerators, peripherals and recongurable arrays. This kind of hardware platforms require very detailed cycle accurate description code. Typically used are Hardware Description Languages (HDL) like Verilog and VHDL. Register Transfer Level (RTL) implementation of complex algorithms and their reference design tend to be time consuming tasks. Implementation of hardware at a higher abstraction level requires new design ow and tools. A dataow language is appropriate for signal processing systems since algorithms are typically specied in dataow graphs. Using the same method for specication and implementation oers relative easiness due to rapid prototyping. CAL is a dataow oriented language that was specied and developed as part of the Ptolemy project at the University of California, Berkeley [1]. The CAL dataow language is extensively described in CAL language reference manual [2]. The CAL language gives a high level of abstraction and is able to generate a synthesizable hardware and software description. However, the current version of CAL to RTL generator (OpenDF) introduces redundant logic, which increases area and power cost. Therefore, in this study the RTL mapping eciency from a CAL dataow description was increased, and evaluated by a case study. The block partitioning capability of the CAL language may be used to eciently implement Globally Asynchronous Locally Synchronous (GALS) designs, which has the major advantage that a traditional synchronous design ow may be applied. Furthermore, power consumption of the design needs to be addressed to increase the battery life. As part of the study of hardware implementation in CAL, a clock gating scheme based on the activity of a network is implemented. Several modications on the CAL to RTL tool were performed to support these features. As a case study, an Orthogonal Frequency-Division Multiplexing (OFDM) multi-standard MMSE (minimum mean squared error) channel estimator is implemented in CAL to evaluate the methodology and optimizations incorporated in the ow. The channel estimator was synthesized in 65 nm CMOS technology. A GALS architecture was realized by dividing the design into dierent clock domains. A low power clock gating scheme was included in the implementation and an analysis on hardware parameters were performed. The remaining part of this paper is organized into the following sections. In Sec. 2 a brief introduction of the CAL dataow language is presented, and Sec. 3 addresses the optimizations on the CAL ow. Sec. 4 presents GALS design and clock gating technique incorporated into the tool. In Sec. 5 hardware implementation of the channel estimator in CAL is described, and the various results obtained are discussed in Sec. 5. Finally conclusions are drawn in Sec. 6. 2. Background of CAL Dataow Language A dataow model of an algorithm consists of nodes and communication arcs. The nodes represent combinational logic, and the communication arcs are used to transfer data tokens between nodes. A variety of such models exist, which have dierent trade-os between expressiveness and ability for analysis. Of particular interest are the synchronous dataow networks, which are applied in several academic modelling tools to represent streaming applications [3]. Synchronous dataow networks are constrained, which leads to an ecient synthesized code [4]. The advantage of a dataow model is that it is possible to have a one-to-one mapping between nodes or computational units in hardware. The nodes act asynchronous to each other, and communication arcs are used to transfer data tokens with insignicant control mechanism costs. In [5] it is shown that dataow models oer a representation that may eectively support the parallelization of the design for higher performance, which is required for a lot of applications in wireless systems. 2

N1

A N3 C

N2

B
Figure 1: Dataow Graph.

2.1. CAL Programming Model In Fig. 1, a dataow graph with three nodes (N1, N2, N3) and three communication arcs (A, B, C) is shown. In a CAL implementation the nodes are the actors which represents computational/logical tasks. Communication arcs are the buers or FIFOs through which data tokens are transferred between actors. The CAL actors are isolated computational units which consist of input/output (IO) ports, actions, state variables, and parameters. The state of an actor is not shareable with other actors, and interaction between actors is accomplished through IO ports based on data tokens. An action denes computational/logical operations performed on the data tokens based on the actor states. When an action is red, it may consume and/or produce data tokens. Afterwards, state of the actor is modied and an output data token is produced. 2.2. RTL Generation from CAL The CAL ow provides a high level of abstraction. The design cycle oers a wide range of design space exploration and optimization techniques. A CAL program may be compiled to both hardware and software. A software implementation is realized by translating CAL to C programming language, and hardware is realized by HDL. This ability to perform both hardware-software

B A
Example merger actor actor merge () in1,in2 ==> out A : action in1 : [a] ==> [a] B : action in2 : [a] ==> [a] selector (AB) * end end

Action
state

C
CAL Model OpenDF simulations

Software Generation
C Code Generated by CAL2C

Hardware Generation
HDL Code Generated by CAL2HDL

main() { int i = 0;

process() begin

Figure 2: CAL Frame Work.

In

A B

Out

Add

fb_out

FB

fb_in

Figure 3: Simple Feedback Network. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27

Actor d e s c r i p t i o n / / A c t o r f o r s i m p l e F eed b a ck ( d e l a y ) a c t o r FB ( i n i t ) f b i n ==> f b o u t a c t i o n [ f b i n ] ==> [ f b o u t ] end / / A c t o r f o r A d d i t i o n , t o k e n s a t p o r t A and B i s added . a c t o r add ( ) A, B ==> Sum a c t i o n [ a ] , [ b ] ==> [ a+b ] end end Network d e s c r i p t i o n / / Top L e v e l N etw o r k / / I n t e g r at i ng Actors n e t w o r k f b a d d ( ) I n ==> Out entities f b 0 = FB ( i n i t = [ 0 ] ) ; add = add ( ) ; structure I n > add . A; f b 0 . f b o u t > add . B ; add . sum > f b 0 . f b i n ; add . sum > Out ; end Listing 1: Example Feedback Network using CAL

synthesis enables the development of a unied platform for hardware-software co-design of complex systems, like embedded systems consisting of processors and hardware accelerators. A complete framework called Open Dataow, supports CAL network simulation and generation of hardware-software code, see Fig. 2. This capability of CAL to support hardware software co-design enables a common tool, architectural denition and specication for both platforms [6], simplies the design of complex systems. Details of the translation of CAL to HDL or C are described in [7]. These tools are open source and in this paper original version refers to the version available at [8]. 2.3. CAL Network Implementation The topology of actors connected to each other is referred as a network of actors, a simple network of actors implementation in CAL is shown in Fig. 3, and CAL description is shown in 4

Actors

Action

Action

States
FIFO

FIFO

FIFO

Scheduler

encapsulated states

FIFO

FIFO

Separate local scheduler for actions

Communication Arcs

Figure 4: Conceptional illustration of an Actor Network.

List. 1. In hardware, each actor is an independent entity and the communication between the actors is based on handshake protocols (4-phase bundled-data protocol). Each communication arcs in the CAL network is implemented as a FIFO with a handshake protocol wrapper, see Fig. 4. Furthermore, an actor also facilitates handshake protocol to consume and produce tokens. If two connected actors belong to dierent clock domains, an asynchronous FIFO implementation is selected. 2.4. Modications to existing Framework A brief description of CAL dataow language was presented in previous subsections. In Figure 5, the overall existing CAL framework/ tools is shown. The support for ASIC Implementation was added to this framework by modications/optimization of the tool, described in next section.

XLIM Backend Files ( input to other possible translator tools ) C Files *.c , *.h Software Implementation
CAL Description *.cal
CAL Front End Tools
opendf, orcc Open Source Tools

C++ Files *.cpp Java Files *.class, *.jar

Intermediate Representation

Top Level Files (network description *.nl)

FPGA Implementation using xilinx libraries *.vhd , *.v Hardware Implementation

ASIC Implementation *.vhd , *.v

Support For ASIC Implementation : * Remove all xilinx library dependecies * Infer Block memories (pick from library) * Optimizations to reduce Area Cost. * Partitioning into clock domains (GALS support) * Automatically Generate clock gating logic (Low Power )

Figure 5: Included ASIC features to existing CAL framework.

reset

clk

Reset Sync Logic

Actions
Mathematical & logical operations

Memory Unit

Block Memory Clocked Registers

Actions

Kicker Circuit

Actions

Scheduler

Handshake for input token

Handshake for output token

Figure 6: Generated Actor Implementation.

3. ASIC Implementation from CAL The modication performed on the tools to support ASIC implementation is divided into two parts, rst the reduction of the hardware area cost was taken into account. Various modications are performed on the tool/ow to reduce area of the hardware implementation from CAL. The second part involved incorporating existing low power methods (GALS, clock gating) into the ow to enable low power ASIC implementation from CAL. 3.1. CAL generated hardware - Area Optimizations A CAL actor consists of various units based on the computational and logical tasks. Fig. 6 shows a generalized actor implementation in hardware. The various sub-modules in the actor are explained as below. The Reset Synchronizer Logic is used to synchronize the reset with the actor clock. The Kicker Circuit generates a pulse based on the synchronized reset signal. The pulse generated is used by the scheduler logic to begin the protocol handshake mechanism and action scheduling. The Memory Unit contains state, global variables and constants. The global variables and constants are used by the action units for computational purpose. The state is used by the scheduler for ring sequences of actions in an actor. The Action Units are computational or logical units of an actor. The Scheduler Unit handles the token handshake protocol and ring of action units. 6

3.1.1. Removal of Redundant Logic The actor hardware generated by the original version of the tool assumes that each and every actor is a separate asynchronous block. Hence a synchronous reset logic and a corresponding kicker (pulse gen) circuit is implemented for every actor. This logic is redundant, since for a single clock domain the resets may be synchronized once, and routed to all the actors of that clock domain consequently. The CAL tool was optimized to generate only one reset logic and kicker circuit for a clock domain.

3.1.2. Infer ASIC memory The memory unit in the actor hardware implementation consists of registers which hold the state variables of the controller. If the actor contains an array of variables (list) of length greater than 128, a RAM behavioral model is inferred. Some actors may require a list of constants which are strictly read-only elements. CAL language has a provision to declare a list as read-only. However, in the original version of the tool, a ROM is implemented as a RAM with initialized constant values. In an FPGA, RAM and ROM are automatically inferred by the synthesis tool. However, in an ASIC ow, memories needed to be inferred by manual integration. Consequently, modications to generate appropriate behavioral models of memories based on the FPGA or ASIC ow were incorporated in the tool.

3.1.3. FIFO Optimization In a CAL network, actors communicate by transfer of data tokens, this communication is implemented using FIFO and handshake protocol resulting in a communication cost between actors. The depth of each FIFO is constrained in the CAL network description. Optimizations are applied to the FIFO implementation to minimize the communication cost. Based on the FIFO depth, the implementation is either a memory or register array. For a FIFO depth of 2 or less, the controller and handshake protocol are designed as glue logic. To further reduce the communication overhead modication are done in the ow to support merging of actors, this is performed by removing the FIFO (registers) between actors and glue logic handles the handshake protocol between actors. This is done by specifying fosize as null, as shown in Fig. 7 and the corresponding pseudo network le List. 2. An implementation by merging smaller actors makes a design more compact, however the merging of larger actors may increase the critical paths of the design.

A1
10

actors merged

B1
10

C1

HM

D1

A2

only handshake mechanism (no registers)

Figure 7: FIFO optimization by merging Actors.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

Network d e s c r i p t i o n / / Top L e v e l N etw o r k n e t w o r k t o p ( ) I n ==> Out entities / / actor declaration A1 = A ( ) ; A2 = A ( ) ; B1 = B ( ) C1 = C ( ) ; D1 = D ( ) structure / / a c t o r c o n n e c t i v i t y a l o n g w i t h FIFO s i z e s . .... A1 . o u t p u t > B1 . i n p u t 1 { f i f o s i z e = 10 } ; A2 . o u t p u t > B1 . i n p u t 2 { f i f o s i z e = 10 } ; B1 . s u m o u t > C1 . i n { f i f o s i z e = 5 } ; / / s p e c i f y i n g n u l l a c t s l i k e m er g in g C1 and D1 a c t o r s C1 . o u t > D1 . i n { f i f o s i z e = N u l l } ; .... end Listing 2: Example Feedback Network using CAL

3.2. Low Power ASIC Support GALS designs are very suitable for low power hardware implementations. Typically, in GALS based designs, a large system is divided into smaller synchronous blocks (or clock domains). The inherent independent nature of these smaller blocks oer the possibilities to implement various standard low power techniques like clock gating, power gating, dynamic voltage and frequency scaling [9]. 3.2.1. Clock Domain Partitioning The number of transistors that t on a single die increases and the feature size decreases with improvements in silicon fabrication technology. The clock generation and distribution becomes increasingly dicult with large designs. The clock load increases with higher level of integration and larger dies. This increase requires more clock buers and hence increases the clock distribution latency. This in turn makes it more dicult to design a global-clock network that may control all the blocks in the design. Furthermore, as the clock frequency increases, there is more cross coupling in long wires which increases the clock jitter. The clock network occupies signicant portion of the design area and the power consumption may lead to 35% of the total power consumption [10]. GALS design provides a promising solution which eliminates the need of synchronous low skew global clock network. The main advantage of GALS design is that the design may be divided into smaller clock domains and there may be arbitrary clock skew between clock domains. The clock domains are independent synchronous blocks and use synchronization circuits for inter domain communication. The signal processing systems have a dataow realization and may be easily mapped into such hardware structures. The capability of the CAL language to partition the design into smaller blocks may be used to eciently implement GALS design. There are various implementation of 8

GALS design [11], the CAL ow used implements GALS design using a FIFO based handshake mechanism. It interesting to note that the hardware implementation of handshake mechanism for data tokens and FIFOs are not part of the CAL language, since CAL only gives a high level abstraction of the dataow algorithm. Hence there is more exibility for the end user to tailor these mechanism based on application. A CAL network divided into clock domains is shown in Fig. 8 along with a pseudo code of the network description in List. 3. The hardware partitioning is done using the keyword clkdomain, similarly in case of software implementation the partition is done by specifying processorId along with the actor declaration. The partitioning into clock domains is straight forward (by using clkdomain keyword), in theory the maximum number of clock domains is equal to the number of actors in a network. This however would be an unrealistic implementation since asynchronous FIFOs used for communication between clock domains are expensive compared to the synchronous FIFOs. 3.2.2. Token-Based Clock Gating Power consumption is becoming an increasingly important metric in large hardware platforms. Clock gating is a well known method to decrease the dynamic power by reducing the number of transitions in registers. GALS divides the design into smaller blocks and clock gating schemes are applied to these blocks. The inherent advantages of clock gating in GALS design are discussed abundantly in literature [12]. For ASIC implementation using CAL language new keywords like powerdomain, clkgating are introduced for low power implementation.

Clk1

Global Reset

Clock Manager & Reset Logic


Clk1

Domain reset

Kicker Pulse

Sync FIFO

Async FIFO

Sync FIFO

Sync FIFO

A1

B1

A3

C4
Sync FIFO

Sync FIFO

Sync FIFO

Async FIFO

Sync FIFO

Sync FIFO

A2

B2

B3

C3

D3

Clk2 Clk2

Domain reset

Clk3 Kicker Pulse Clk3

Domain reset

Global Reset

Clock Manager & Reset Logic

Kicker Pulse

Figure 8: Clock Gating Scheme.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

P s eu d o Network D e s c r i p t i o n \ \ Top l e v e l d e s c r i p t i o n n e t w o r k t o p g a l s ( ) ==> entities \\ e n t i t y \\ s i m i l a r A1 = a c t o r B1 = a c t o r ....... ....... A2 = a c t o r ....... A3 = a c t o r D3 = a c t o r .......

d e c l a r a t i o n c l o c k domain c l k 1 to vhdl e n t i t y d e c l a r a t i o n s A ( ) ; { clkdomain = clk1 } ; B ( ) ; { clkdomain = clk1 } ;

A ( ) ; { clkdomain = clk2 } ; A ( ) ; { clkdomain = clk3 } ; D ( ) ; { clkdomain = clk3 } ;

structure \\ c o n n e c t i n g \\ s i m i l a r to A1 . o u t > B1 . B1 . o u t > A3 . ..... ...... B2 . o u t > B3 . B3 . o u t > C3 . end

different actors p o r t map c o n n e c t i o n i s v h d l in { f i f o s i z e = 4 } ; in { f i f o s i z e = 10} ;

in { f i f o s i z e = 5 } ; in { f i f o s i z e = 1 } ; Listing 3: Psuedo Code example for partitioning design

A clock gating scheme based on the data token activity is depicted in Fig. 8. The clocks to the actors are not required when no data token needs to be processed. The availability of data tokens is detected at the synchronous FIFO, and the arrival of a data token to the domain is detected at the asynchronous FIFO. Based on the availability/arrival of a data token, the clock signal to the domain is gated. If a data token exists or arrives in a domain, the clock domain is set in active state. When the domain is inactive the clock to the domain is gated. This is managed by the clock manager, see Fig. 8. The arrival/availability of data token in case of a FIFO based implementation is detected by the fo empty signal. It takes 3 clock cycles for a token to be consumed by the next actor from a FIFO, hence there is no latency between token detection in a FIFO and clock activation. Consequently, token based clock gating does not eect the behaviour (functionality) of the design. The token based clock gating scheme has been incorporated as part of the CAL hardware generation. There are features included to disable/enable clock gating to domains. Based on this a clock manager with appropriate state machines and clock gating logic is generated. This automization which divides the design into dierent clock domains with inbuilt clock gating feature, makes hardware implementation with CAL dataow language even more interesting. 10

done valid busy

clk rst start Pilot_loc(2:0) Op_mode(1:0)

Controller

Pilot ROM
Expected Pilots Extracted Pilots
LS Estimates

Start MMSE

Input data out (11:0)

Pilot Extraction

LS Module

LS RAM

MMSE module
LTE/DVB-H data out (11:0)

ROM

WiFi data out (11:0)

Figure 9: MMSE Channel Estimator.

4. Case Study: Channel Estimator An OFDM based multi-standard channel estimator is implemented as a case study for a low power ASIC hardware implementation with the CAL dataow language. The channel estimator was chosen as case study since the algorithm is of moderate complexity and requires signicant hardware. The implemented channel estimator is recongurable to concurrently support various standards like 3GPP LTE, IEEE 802.11n and DVB-H. A Robust MMSE algorithm is employed for the channel estimator. Details about the multi-standard environment for channel estimation is described extensively in [13]. The algorithm approximations and hardware mapping (data width, MMSE matrix coecients) chosen for CAL hardware implementation is same as the reference design [14]. The hardware architecture of the channel estimator, see Fig. 9, is divided into several blocks as described below. LS module - Least square estimation module consists of a complex multiplier. The inputs to the multiplier are pilot data and the inverse of the expected pilot values stored in pilot ROM. The output from the complex multiplier is stored in LS RAM for use with MMSE module. Controller module - This module is the main controller of channel estimator which takes care of pilot separation, Least Square Estimation, and the memory operations. MMSE module - This module consists of a matrix multiplier that is implemented with 12 Multiply-accumulate (MAC) units. The appropriate matrix inputs are sent serially from the LS RAM and MMSE ROM. Memories - The channel estimator consists of 3 memory units. Pilot ROM is implemented with 2 ROMs (1300x12). MMSE ROM stores the coecients for the MMSE algorithm which is implemented as 2 ROMs (200x120). LS RAM stores the output from LS module for further processing by MMSE module. It is implemented as 2 RAMs (334x12). 11

Pilot Extraction
Clock Domain (clk1)

Least Estimator

LS Memory Unit
Clock Domain (clk2)

Pilot ROM

PISO

MMSE Scheduler

12 MAC Units Clock Domain (clk3)

Figure 10: Dataow description of MMSE Estimator.

The inputs to the channel estimator module is of real and imaginary data of 12-bits, a 3-bit input which shows the location of the pilot, a 2-bit input which shows the type of data and a start signal. The outputs from the channel estimator are 12-bit real and imaginary data, a valid signal and a busy signal. The CAL dataow implementation is a straight forward mapping of the algorithm, see Fig. 10. The Pilot Extraction actor will process the OFDM symbols and send the pilot data tokens to the Least Estimator actor. Afterwards, the data tokens are multiplied by the expected inverse pilot values and stored in the Memory actor. After completing Least Estimation on all the pilots for a particular OFDM standard, the Memory actor sends data tokens to MMSE network. The MMSE network contains a matrix multiplier implemented with MAC actors. Inputs to these MAC actors are from the Memory actor and MMSE coecients from ROM. The MMSE controller actor handles MAC actors. A Parallel Input Serial Output (PISO) actor receives data from the MAC actor and sends the nal results serially. Moreover, in order to reduce power consumption the design is divided into three clock domains. This division was performed based on functionality. The algorithm implemented in hardware works in sequence and there is no need for all the domains to be active all the time. The three clock domains are clk1, clk2 and clk3. These domains may theoretically run at arbitrary frequencies. For area ecient implementation the depth of asynchronous FIFOs for communication between clock domains is kept minimal, hence only certain ratios of clock frequencies are supported in the implementation. 5. Results and Analysis of ASIC implementation from CAL The RTL description generated by CAL implementation was synthesized in 65 nm CMOS technology. Synthesis was performed on the original CAL ow and the optimized CAL ow for 12

Table 1: Area details at 250 MHz.

Actors CAL Original CAL Optimized 0.080 0.073

Memory 0.1 0.1

Communication (FIFO) 0.051 0.040

Total Area 0.238 0.22

Table 2: Hardware comparison.

Area [mm2 ] CAL max freq CAL at 250 MHz [14] at 250 MHz 0.25 0.22 0.19

Clock Domains 1 1 1

Max Freq [MHz] 414 250 250

Throughput [Samples/s] 169 M 102 M 78 M

comparison. The design is synthesized with the same clock constraints, which are bound by the reference design implementation. 5.1. Hardware Results Table 1 shows details of reduction in area due to dierent optimizations in the CAL to RTL tool. There is a 8% reduction in area compared to the original version of the CAL to RTL tool. The reduction is mainly from the removal of redundant logic in actors and FIFO optimizations. The design implemented by the optimized CAL ow is further compared with a reference RTL design, see Table 2. The critical path of the CAL based design is 2.4 ns. At maximum frequency of 414 MHz, the reported area is 0.25 mm2 . The channel estimator based on CAL is also synthesized with a clock constraint of 250 MHz and the total reported area is 0.22 mm2 . The area of the CAL implementation is 15% larger than reference RTL design, but still very encouraging considering the design time. The maximum clock for the CAL implementation is higher since the critical path is bounded in an actor, and actors in a CAL implementation are connected to each other by FIFOs (registers). CAL implementation throughput is higher than the reference design due to the inherent parallelism involved in the dataow language. At 250 MHz the throughput in CAL implementation is 102 M samples/sec and for the reference design implementation is 78 M samples/sec. It is possible to manually implement parallelism in RTL, but may need more time to analyse data dependencies and control mechanism. The bottleneck for the RTL implementation is in the matrix multiplication unit. The throughput would have been same as CAL, if a set of register are used to store the nal results of all the MAC units, and during the streaming out of the nal results the MAC units can continue to operate on next set of data. These results are in line with the research published on CAL hardware implementation on FPGA of the MPEG-4 standard [15], inner receiver [16]. In [15] the throughput of CAL implementation is much higher and now used in standardization of MPEG RVC. However in case of an FFT (designed by co-authors in [16]) the CAL implementation had a higher area cost for same throughput, since the RTL implementations of FFT are highly optimized. Hence it should be noted that the throughput gain will depend both on the complexity of design and reference implementation, but the main advantage is in the high level of abstraction, which leads to ease in design space exploration lower design time. 13

Table 3: Functionality based clock domain partition.

Clock Domain clk1 clk2 clk3

Area [mm2 ] 0.04 0.06 0.14

Active period (%) 100 45 40

Table 4: Power comparison at 250 MHz.

Area [mm2 ] CAL CAL Low Power 0.22 0.24

Num of clock domains 1 3

Power [mW] 18 10

Clock Freq [MHz] 250 250

Throughput [Samples/sec] 102 M 102 M

Normalized Power [mW/(MSamples/Sec)] 0.21 0.12

5.2. Low Power Implementation A low power implementation is realized by partitioning the channel estimator into three clock domains based on the functionality. The total reported area of the low power implementation is 0.24 mm2 . There is an increase of 10% in area due to the overhead in communication, mainly from the asynchronous queue. Table 3 shows area occupied by each clock domain and it can be seen that clk2, clk3 can be turned o for around 50% of the time. The power simulation were performed on the gate level netlist with back annotated timing and toggle information. The power consumption was estimated at a clock rate of 250 MHz to all clock domains, for comparison power is normalized to throughput as presented in Table 4. The normalized power consumption is reduced by 45% for the low power CAL implementation compared to the hardware generated by the original CAL tool. Further reduction in power consumption is possible by varying the clock rate (dynamic voltage frequency scaling techniques) for dierent clock domains. 6. Conclusions This paper presents a method to generate an ecient hardware design with CAL dataow language. Since the currently available CAL generation tool was designed for FPGA hardware implementation, there were changes performed to facilitate ASIC implementation. Further modications were done in the tool to optimize hardware generation. An OFDM channel estimator was implemented in 65 nm CMOS technology with the modied CAL generation tool. The hardware implemented by CAL has a higher throughput performance compared with the reference design. Due to the higher abstraction and handshake based interface of an actor, the design is not based on clock cycles like in RTL. Hence changes done to one or more actors does not aect the rest, which makes it more easy for design space exploration. A study on GALS design implementation with CAL was done by dividing the design into smaller clock domains. This division into clock domains is easily done in the CAL network. The tool generates the asynchronous handshakes between clock domains which makes the GALS implementation a very simple task. A clock gating scheme was integrated into the tool to support 14

low power ASIC implementation. The data token based clock gating gave remarkable reduction in dynamic power consumption. The reduced design time for comparable area and low power consumption in CAL based design is very encouraging, considering that the CAL implementation is at a higher level of abstraction. 7. Acknowledgments We thank Ericsson research and Lund University for providing the opportunity to work on this project. Also would like to thank MULTI-BASE and ACTORS project, both funded by 7th Framework Programme (FP7) of the European Commission and Swedish VINNOVA Industrial Excellence Center (SOS). References
[1] Ptolemy Project, UC Berkeley EECS Dept., http://ptolemy.eecs.berkeley.edu/ptolemyII/index.htm. [2] J. Eker, J. W. Janneck, CAL Language Report Specication of The CAL Actor Language, Tech. Rep. UCB/ERL M03/48, EECS Department, University of California, Berkeley (2003). [3] G. Kahn, The Semantics of a Simple Language For Parallel Programming, in: IFIP (Information processing) Congress, 1974, pp. 471475. [4] M. Chen, E. Lee, Design and implementation of a multidimensional synchronous dataow environment, in: Signals, Systems and Computers, 1994. 1994 Conference Record of the Twenty-Eighth Asilomar Conference on, Vol. 1, 1994, pp. 519 524 vol.1. doi:10.1109/ACSSC.1994.471507. [5] S. Ritz, M. Pankert, V. Zivojinovic, H. Meyr, Optimum vectorization of scalable synchronous dataow graphs, in: Application-Specic Array Processors, 1993. Proceedings., International Conference on, 1993, pp. 285 296. doi:10.1109/ASAP.1993.397152. [6] N. Siret, I. Sabry, J. Nezan, M. Raulet, A codesign synthesis from an mpeg-4 decoder dataow description, in: Circuits and Systems (ISCAS), Proceedings of 2010 IEEE International Symposium on, 2010, pp. 1995 1998. doi:10.1109/ISCAS.2010.5537107. [7] Open RVC-CAL Compiler, http://orcc.sourceforge.net/, http://opendf.sourceforge.net/ (Open Dataow Source Forge Project). [8] CAL Tool Version Used For This Project, Open Dataow Version : 1131, Open Forge Version : 16. [9] A. Chattopadhyay, Z. Zilic, Galds: a complete framework for designing multiclock asics and socs, Very Large Scale Integration (VLSI) Systems, IEEE Transactions on 13 (6) (2005) 641 654. doi:10.1109/TVLSI.2005.848825. [10] S. Butt, S. Schmermbeck, J. Rosenthal, A. Pratsch, E. Schmidt, System level clock tree synthesis for power optimization, in: Design, Automation Test in Europe Conference Exhibition, 2007. DATE 07, 2007, pp. 1 6. doi:10.1109/DATE.2007.364543. [11] P. Teehan, M. Greenstreet, G. Lemieux, A survey and taxonomy of gals design styles, Design Test of Computers, IEEE 24 (5) (2007) 418 428. doi:10.1109/MDT.2007.151. [12] E. Amini, M. Najibi, H. Pedram, Globally asynchronous locally synchronous wrapper circuit based on clock gating, in: Emerging VLSI Technologies and Architectures, 2006. IEEE Computer Society Annual Symposium on, Vol. 00, 2006, p. 6 pp. doi:10.1109/ISVLSI.2006.48. [13] F. Foroughi, J. Lofgren, O. Edfors, Channel estimation for a mobile terminal in a multi-standard environment (lte and dvb-h), in: Signal Processing and Communication Systems, 2009. ICSPCS 2009. 3rd International Conference on, 2009, pp. 1 9. doi:10.1109/ICSPCS.2009.5306380. [14] I. Diaz, B. Sathyanarayanan, A. Malek, F. Foroughi, J. Rodrigues, Highly scalable implementation of a robust mmse channel estimator for ofdm multi-standard environment, in: Signal Processing Systems (SiPS), 2011 IEEE Workshop on, 2011, pp. 311 315. doi:10.1109/SiPS.2011.6088995. [15] J. Janneck, I. Miller, D. Parlour, G. Roquier, M. Wipliez, M. Raulet, Synthesizing hardware from dataow programs: An mpeg-4 simple prole decoder case study, in: Signal Processing Systems, 2008. SiPS 2008. IEEE Workshop on, 2008, pp. 287 292. doi:10.1109/SIPS.2008.4671777. [16] T. Olsson, A. Carlsson, L. Wilhelmsson, J. Eker, C. von Platen, I. Diaz, A recongurable ofdm inner receiver implemented in the cal dataow language, in: Circuits and Systems (ISCAS), Proceedings of 2010 IEEE International Symposium on, 2010, pp. 2904 2907. doi:10.1109/ISCAS.2010.5538042.

15