Sie sind auf Seite 1von 4

Power Consumption Awareness in Cache

Memory Design with SystemC

Smail NIAR*, Samy MEFTALI, Jean-Luc DEKEYSER


INRIA-FUTURS, DART Project, University of Lille, France
[niar, meftali, dekeyser]@lifl.fr

Abstract* consumption estimations. With our design


This study presents the development of a cache methodology, SoC descriptions are synthetic,
memory module in a component library, designed modular and accurate. These three characteristics
for fast and synthetic embedded system simulation. are very important in that:
This paper demonstrates also the possibility of • Synthetic descriptions permit better functional
integrating an existing power consumption and structural understanding of the SoC. They
analytical model in a SystemC description at the make it possible to have several abstraction
cycle-accurate register-transfer level (RTL). levels in the same project, thus offering a
Keywords: SystemC, power consumption, compromise between precision and speed
processor, cache. during simulation.
• Modular descriptions make it possible to reuse
1. Introduction existing modules to design new SoC, with the
only cost being the separation of the SoC's
Energy consumption issues play an increasingly "implementation" and the "functional" aspects.
important role in the design of new electronic
• Accurate and detailed descriptions both
digital systems [1]. This change in designer attitude
guarantee that the performances measured by
is primarily motivated by a desire to increase
simulation are equivalent to those of the final
battery autonomy in embedded and mobile systems,
SoC hardware and prevent any ambiguity in
to take the thermal issues affecting the cooling,
the ultimate implementation phase.
packaging and reliability of embedded and high
This paper provides a detailed description of one
performance systems into account, and finally to
of the library modules, the cache memory module,
manage the environmental impact of mobile
at the cycle-accurate RTL. Briefly, this cache
computer systems.
module has the following features:
In addition, over the last few years, important
• Modular power consumption evaluation, based
progress has been made in the field of integrated
on an analytic model.
circuit technology. High-performance, low-cost
embedded systems have been designed using a • Modular SystemC-based specifications, for
system-on-chip (SoC) approach [2]. A side effect of module reuse.
this development is that SoC have become more • Cache memory configuration exploration, for
and more complex, requiring high-level tools (for determining the best cache configuration for
simulation, performance estimation and synthesis) each application for the SOC.
during the design phase. The SystemC language is a
C++ library. Its aim is to facilitate complex system 2. Function and importance of
design by supporting hardware and system-level cache memory for an
modelling [3]. However, although over the last few embedded system
years, many research projects have been devoted to
improving and facilitating simulation with SystemC, New applications such as multimedia, image
very little attention has been paid to the question of processing, telecommunication and network
power consumption evaluation in SoC design using applications are memory-centric and require
this language [4][5]. processing more and more data in less and less time.
To remedy this lack, we have designed a For this reason, growing numbers of embedded
SystemC module library that serves as a framework hardware platforms integrate ever-increasing cache
for a new design methodology dedicated to sizes. For instance, the new intel Xscale core
embedded systems. These modules allow easy embedded processor has two 32-way associative
performance (execution time) and energy 32KB data caches (for instructions and data). The
size of the instruction and data caches in the new
MIPS32 architecture can range from 256 bytes to
*
Also at the University of Valenciennes, France
4Mbytes. This tendency will most likely continue in processor as well as with the next memory level,
future embedded processors because of new the which may be either the second cache level or the
applications' needs in terms of memory bandwidth. main memory. The protocol used for implementing
One of the consequences of this trend is that the processor-cache and cache-nextLevelMemory is the
number of transistors and the space devoted to same. It is an asynchronous protocol and uses 3
caches have also increased. In some embedded control signals (request, write, and ack) and two
hardware platforms (such as the Intel ArmStrong), buses (address and data). Transfers are engaged on
the area taken by the cache memory can attain 50% behalf of either the processor (when executing a
of the total core, for a power consumption of up to memory instruction i.e. load or store) or the L1
50% of the total power consumption of the cache (when a cache miss occurs).
microprocessor system. In addition to their impact
on performance, most cache structures are adress adress
independent of the processor architecture and the $req Next
memReq
write Memory
instruction sets. Given this context, we chose to Processor
L1 memWrite
Level
present an example describing an on-chip cache. Data Cache Data
To evaluate the access time, and the power Ack Ack
consumption of the cache module in our library, we
used an existing analytical cache model, namely
Cacti. This model is an integrated access time, Figure 1. The cache module as a level 1 (L1) cache
power consumption and chip area model for on- and their transfer protocols.
chip cache memories and it supports multibanked
caches. In the Cacti model, each bank is composed When the processor decodes a memory instruction,
of several units: arrays for storing tags and data, tag the request signal is asserted. If the referenced
comparators and multiplexers for selecting a word block is present, then the Ack signal from the L1
(typically 8 bytes) out of a cache line consisting of cache to the processor is affirmed, and the operation,
B bytes. In this paper, only one bank is considered. either read (write=0) or write (write=1), is
In order to evaluate the access time, the per-access performed in the cache. Otherwise, the block is first
energy consumption and the chip space using Cacti, transferred from the next memory level, and only
the user must determine the following parameters: then is the Ack signal asserted. More details about
• S: total size in bytes the data transfer protocols are presented in figure 2
• B: block size in bytes (in the page), which illustrates 3 data transfers in a
• Assoc: associativity system. Due to space limitation in this paper, the
• T: technology size (0.1 µm by default) memory access latency is fixed to zero (Lat=0). The
• Pread: the number of input cache-to-memory bus width is twice as wide as its
• Pwrite: the number of output cache block size (Blocsize=2). In the first
transaction, there is a miss in address 0 (3 cycles).
• Pread_write: the number of input-output
In the second transaction, there is a hit (1 cycle).
ports.
The third transaction in figure 2 shows the
In this study, Pread=Pwrite=0 and Pread_write=1.
beginning of a cache miss at address “1000”, which
Using these parameters, Cacti determines the best
generates a conflict with block 0. This block must
layout (or configuration) that will optimize both the
then be saved (3 cycles) before loading the new
access time and the energy consumption. . More
block (3 cycles).
details about the power consumption model used by
Figure 3 depicts the internal structure of the cache.
Cacti are given in [6,7].
It consists of 4 unit types: the decoder, Assoc banks,
the replacement policy logic, and the cache
3. Cache memory with SystemC controller logic. One SystemC method (sc_method)
The SystemC library is object oriented and allows a is associated to each one of these units.
clear separation between structures and behaviours Connections between these modules are
of architectural components. It permits also implemented by signals (sc_signal) through ports.
hierarchical designs (hierarchical sc_module). The bank unit stores both tags and data and the
SystemC offers also several design possibilities at comparator logic is used to check the match
several abstraction levels. In fact, it contains both between the requested block and selected block in
high level data types and low levels ones. These the bank. After this comparison, a hit signal is sent
later allow bit-accurate, cycle-accurate to the cache controller logic, which sends the Ack
specifications which are able to give accurate signal to the CPU. The replacement policy logic
performance estimations. For all these reasons we holds block histories and, in the case of conflict,
decide to specify our library using SystemC. determines which block to eject from the cache.
Figure 1 shows the position of our module as a Several policies are available: FIFO, LRU, and
level 1 (L1) cache. The figure also shows the random.
cache's communication interfaces with the
.

Reading bloc 0 Hit Saving bloc 0


1+Lat+BlocSize cycles 1 cycle cycle 1+Lat+BlocSize cycles
Figure 2. Three data transfers in the cache

Tag+Data Tag+Data

index
Address Repla-
Deco- cement
der
CacheReq
tag Tag+Data Tag+Data
Policy

Comparator Comparator
Hit
0 Hit
index assoc-1
Ack
write Cache Bank2Rep
Controler init Power
Req Consum.
Consum.
cacti
To/from Main Memory
Figure 3. Internal structure of the cache memory

The cache uses the write-allocate policy to deal configuration file. The configuration file contains
with write misses. The power consumption the following parameters:
evaluation is performed by attaching the Cacti –nlines <num. of cache lines>
model to the cache controller. In fact, when the –bsize <block size in bytes>
cache is declared in the SystemC description, the –assoc <cache associativity>
cache configuration parameters are used to evaluate –readPorts <num. of cache read ports>
the access time and the energy consumption for –writePorts <num of cache write ports>
each access to the cache. –readWritePorts <num. of cache read_write ports
These two values are stored by the cache module. –techno <technology size in micron>
These values in conjunction with activity statistics –memLat <main memory access latency in cycles>
of the cache-module (number of accesses with hits, -cpu2cache <cpu to cache bus width in bytes>
misses, external bus access, etc.) are used to -cache2mem <cache to main memory bus width in
evaluate the total execution time in cycles, as well bytes>.
as the total energy consumed by the cache at the The second use of our cache module is as a
end of the simulation. SystemC module in a SoC description. In this case,
the cache declaration must simply be added to the
4. Using the cache module in a SoC description as follows:
Cache<cpu2cache, cache2mem> *dcache =
SystemC SoC description new Cache<cpu2cache, cache2mem> ("dcache",
Our cache modules can be used in two different nAssoc, nLines, bSize, techSize, rwPort, rPort, wPort,
ways. First, they can be used separately to analyze memLat);
the cache performance of a given application. In Figure 4, 5 and 6 depict the experimental results
this case, the cache is activated by the following for our cache where the cache performance is
command: analyzed separately (first method). Figure 5 shows
sc-cacheAnal –f <trace file> -config <config_file> the execution outputs of our SystemC cache
where sc-cacheAnal is the SystemC cache name, description for a trace file. In this example, the
and <trace file> represents the file containing the merge sort program was used on a vector of 20 000
list of memory access addresses generated by elements. In the simulation results shown here give
memory tracing during functional simulation. The statistics after the first 121118 memory references.
parameter <config_file> corresponds to the cache The outputs contain two sets of statistics.
start ..... ……etc….
Cacti Statistics: compare (ns): 0.557825
Main Memory configuration: latency = 2 (nJ): 0.0110586
Cache configuration: *******************
Size in bytes: 8192 SYSTEMC CACHE POWER AWARE SIMULATOR
Number of sets: 128 ****************
Associativity: 2 Cache Configuration :
Block Size (bytes): 32 LSU to Dcache Bus width in bytes : 4
Read/Write Ports: 1 Dcache to Mem bus width in bytes : 8
Read Ports: 0 Write Ports: 0 Statistics :
Technology Size: 0.35um Vdd: 2.6V Load / Store Instruction Nbr : 121118
Access Time (ns): 2.19856 SystemC: simulation stopped by user.
Power (nJ): 3.37432 simulation time : 5.53403 seconds
Best Ndwl (L1): 1 Best Ndbl (L1): 2 #cycles : 131330
Time Components: #Miss: 1733 #Hit : 119385
data side (with Output driver) (ns): 1.70219 #Cache Bloc Read : 121118
tag side (with Output driver) (ns): 2.19856 #Cache Bloc Write: 38107
decode_data (ns): 0.405051 Power per access : 3.37432e-09
(nJ): 0.075142 Total power in Cache (J) = 0.000537276
wordline and bitline data (ns): 0.601265
Figure 4. Statistics report for an application example

memory references. The optimal value for the cache


2.3
excution time in 10**6 cycles

associativity or the block size for a given


2.1
assoc=1 application (figures 5 and 6) will depend on the
1.9 assoc=2 relative weight of the execution time and the power
1.7 assoc=4 consumption. These experiments show that it is
1.5
assoc=8 possible to use our cache description in a design
space exploration to determine the best cache
1.3
configuration for a given (set of) application .
16 32 64 128 256
block size in bytes
5. Conclusion and perspectives
Figure 5. Execution time in million of cycles After presenting the original aspects of our
component library, we described the cache module
Total energy consumption

150 structure in detail. This SystemC description allows


assoc=1 accurate performance analysis as well as accurate
100 assoc=2
in mJ

evaluations of the cache's energy consumption. In


assoc=4 the near future, multi-banked caches will be
50
assoc=8 available in our library [8], and the library will be
0 enhanced by several other components (processors,
16 32 64 128 256 dram, Bus, etc.).
Block size in bytes

6. References:
Figure 6. Total energy consumption
[1] T. Mudge. “Power: A first class design
The first set corresponds to those given by Cacti constraint”, IEEE Computer,April 2001.
and are related only to the cache configuration and [2] G.Martin H.Chang, “Winning the SoC
not to the application. Cacti also reports the power Revolution”, Kluwer Academic Publi.
and access time contribution of each cache [3] www.systemc.org
component (decoder, wordline, bitline, etc). [4] www.microlib.org
The second set of statistics corresponds to [5] Orinoco, www.chipvision.com
application performance. It consists of the number [6] S. Wilton and N. Jouppi. An Enhanced
of memory references, the number of cycles needed Access and Cycle Time Model for On-
to execute these memory references, the number of Chip Caches. Research Report WRL 1994.
hits and misses in the cache, and the total energy [7] P. Shivakumar and N. P. Jouppi. CACTI
consumed by the cache. 3.0: An integrated cache timing, power,
Figures 5 and 6, respectively, present the total and area model, Research Report WRL01.
execution time (in millions of cycles) and the total [8] S. Niar, L.Eekhout, K.DeBosschere,
energy consumption in milliJoule Joule (mJ) for “Comparing multiported cache schemes”.
executing the merge sort program on an array of 20 Inter. Conf. on Parallel and Distributed
000 elements. This program generates 1 409 836 Processing Techniques and Appli., 2003.

Das könnte Ihnen auch gefallen