A Fast Low Power Embedded Cache Memory Design

A Fast Low Power Embedded Cache Memory Design
ZHAO Xue-mei, YE Yi-zheng, W Ming-yan, LI Xiao-ming

P.0.B.3 13, Microelectronics Center, Harbin Institute of Technology(HIT), Harbin 150001
e-mail:meizie@263.net
Abstract
A 64kb cache system designed for 32 bit RISC CPU is
realized. The circuits include two 4ns 32kb cache
memories, two 1.4ns 64-entry direct mapped translation
lookaside buffers (TLB), and two 2ns 64-lines tagRAM.
The high-speed decoder and amplifier are employed. The
TLB design contains a line encoder and valid bits with
flash clear. This cache memory reduces the power and
has faster access time by using the optimized decoder
and sense amplifier. SMARCH algorithm makes all the
cache system self-testability.
Cache Architecture
A cache is a high-speed memory store which contains
the instructions and data most likely to be needed by the
processor. Cache divided into instruction and data
section allows the processor simultaneous access to both
instructions and data, thereby doubling the effective
cache-memory bandwidth. If the processor issues a
reference to an item contained in the cache. Then a zero
wait-state access is made; If the reference is not
contained in the cache, then the longer latency associated
with the true processor memory is incurred. Caches rely
on the principles of locality of software. These principles
state that when a datdinstruction element is used by a
processor, it and its close neighbors are likely to be used
again soon.
bits<25:12> are translated from virtual to real by

accessing the TLB, while the low-order bits<ll:O> begin
the cache access. The TLB output is compared with the
tagRAM output to see if the required information is in
the cache, and the correct cache word is sent to the CPU.
The TLB unit consists of a 64-entry content addressable
memory (CAM), a 64-entry RAM driven directly by the
CAM, ROM cells for line address and hit functions, and
two valid bits in the CAM section with a single cycle
clear. The tagRAM and cache memory blocks are
organized into two compartments with dual port SRAM
cell. In the compare period, a match operation is initiated
by precharging MATCH line and the CAM bit lines high.
The application of the data to be matched to bitlines
causes MATCH to discharge if a mismatch exists. Only
when all of compare data are identical to the contents of
CAM, the MATCH will be high. MATCH will driver one
row SRAM cells to output data.
r
VAQS:Ib
TLB
___,
hit?
VAdl.O>
Since the cache is typically many orders of magnitude

smaller than main memory or virtual address space, not
all the data (or instruction) required by the processor
must be contained by cache. The information used to
determine whether a cache hit occurs should be
recorded. This information is called the cache TAG,
which is typically some of the address in main memory
of the data item contained in that cache elements. Thus,
when the processor issues an address for a reference, the
cache controller compares the TAG with the real address
to determine whether a hit occurs.
VA41:
Fig 1. Cache block diagram
Circuit Design
Cache memories in processor system need high-speed
low-power SRAM, because the performance of
processor system is primarily affected by the cache
memories. SRAM with higher speed are always in
demand [l]. The access time for SRAM mainly includes
the delay time of the decoder and the delay time of the
sense amplifier circuits. With the optimized trade-off
decoder and amplifier scheme cache will obtain not only
high speed but also low power. A large portion of cache
energy is dissipated in driving the bitlines, which are
heavily loaded with multiple storage cells. Hence some
energy reduction techniques such as bit equilibration and
Our cache is organized logically of separate instruction

section and data section. Each section contains a 32kb
cache. They individually have a 64-entry fully
associative translation lookaside buffer (TLB), and a 64entry tagRAM. Fig.1 shows the cache memorys
instruction section block diagram. The higher-order
0 ~ ~ 8 0 3 - 6 6 7 7 - 8 / 0 1 / $ 1 0 . 0 0 ~ 2 0LEEE.
01
566
Fast Low-power decoder
precharging have concentrated on reducing bitline

energy [ 2 ] .
In order to make it possible for the cache to offer highspeed low-power consumption and higher integration,
several circuit design technologies are introduced. TO
begin with a block scheme is described in the first
section, which effectively lower the power of cache
memories. Secondly, a decoder structure is introduced.
Thirdly, an optimized sense amplifier is discussed, and
built-in-self-test circuit is shown at last.
Divided Word Line Technology
During an access to some row of memory, the word line
activates all the cells in that row and the desired subword is accessed via the column multiplexers. This
arrangement has two drawbacks: the word line RC delay
grows as the square of the number of cells in the row
increases, and bitline power grows linearly with the
number of columns. Both these drawbacks can be
overcome by dividing the memory into smaller blocks of
cells using the Divided Word Line (DWL) technique,
which is first proposed by Yoshimoto, et.al. in [3]. In the
DWL technique the long word line of a conventional
array is broken up into k sections, with each section
activated independently, thus reducing the word line
length by k and reducing its RC delay by p. This
method has been widely used to realize high speed and
reduce power consumption.
We divide the decoder paths into three partitions, the

predecoder, the global word decoder or block decoder,
and the local decoder (figure 3). Both the block select
and the global word line drivers should have skewed
gates for maximum speed. The predecoders will have a
NOR style wide fanin input stage.
Our cache memories consist of Ikb memory block,

address decoder, sense amplifier. Each 8kb block
contains all the circuits needed for memory operations,
except for address decoders. The capacity and the
number of U 0 can be organized easily by changing the
array organization. Only one of the blocks will be active
according to the block address at a time. The floor plan
of instruction (or data) cache is shown in Figure 2. The
structure of tagRAh4 is similar to the cache, but its
capicity is smaller than cache.
Amplifier & U 0
The decoder encompasses the circuits from the address

input to the word line. The decoder delay consists of the
gate delays in the critical path and the interconnect delay
of the predecoder and word line wires. The wire RC
delay grows as the square of the wire length within the
decoder structure, especially of the word line. The RC
delay becomes significant in large SRAMs. Sizing of
gates in the decoder allows for trade offs in the delay and
power. The decoder delay can be greatly improved by
optimizing the circuit style used to construct the decoder
gates. The decode gate delay can be significantly
reduced by using pulsed circuit techniques, where the
word line is not a combinational signal but a pulse which
stays active for a certain minimum duration and then
shuts off. Thus, before any access, all the word lines are
off and the decoder needs to activate the word line for
the new row address. Since only one kind of transition
needs to propagate through the decoder logic chain, the
transistor sizes in the gates can be skewed to speed up
this transition and minimize the decode delay. This kind
of decoder using pulsed techniques in the decode path
reduces the delay significantly for only a modest
increase in power, comparing to the conventional static
technique.
PfedecQder
global word decoder

r----1
I
I
-.
I
L - - A
local word decoder

L - - - - i
block decoder
r----1
1----
L--.l
HI Amplifier & U 0 I
Fig 3. Three-stage decoder structure
Sense Amplifier
A sense amplifier is also an important part to obtain
faster access time. The circuit schematic of the sense
amplifier is shown in Fig 4. The sense amplifier contains
two stages. The first stage is a modified current-mirror
circuit to which two additional p-channel transistors
Fig.2 Floor plan of instruction (or data) cache
567
reconnected, while the second stage is a normal current

mirror. The input signals are normally precharged to Vu
V,. Accordingly a voltage difference of less than IOOmv
between YO lines is easily amplified. Although the
current-mirror amplifier has larger size than the latchtype amplifier, it can realize high speed with an
asynchronous operation. For this reason, our tag, cache,
TLB memory use the current-mirror amplifier with an
SRAM cell.
subsequent RAM SMARCH tests. This can verify the

CAM-RAM match line to word line interconnection. A
schematic of the cache implement in a product code is
shown in Figure5 Cache memory BIST circuit overrides
the standard IEEE 1149.1 Test Access Port (TAP) to gain
serial access to the compare data and output section of
the boundary scan registers. It then serially shifts
appropriate data into the compare data inputs and
samples the RAM read-only output signals for the CAM
and RAM test algorithm.
Chip Performance
,-
The instruction (or data) cache size is 0.6mmz and its

layout is shown in figure 6. The 6-Tr S U M single-port
cell area is 40.53'1m2(4.2X lli m2), and dual-port cell
area is 46.2'1 m'(4.2 X 9.65'1m').The simulated timing
results with cadence tools are shown in Figure 7. The
chip has been simulated and layouted using 0.351 m four
level metal technology and functionally verified. Each
port of cache's power is 0.42mWblHz. Cache memory
feature is shown in table 1 directly.
.I
".--.--I-+-
_..41__I-_
t s il
.-
Figure 4. Schematic of the sense amplifier
Design for Testability

With device gate counts and pin counts increasing and
the cost of external functional testing rising, increases in
hardware speed and functional complexity must be
complemented by robust and complete Self-test
capability [4].Cache BIST coverage is maximized by
executing a complete SMARCH on the CAM read/write
port while simultaneously running compare cycles on the
compared port. Priority encoding of match line outputs
ensures the correct hit/miss sequence. The CAM
SMARCH sequence ends with a preload of the CAM
with known states to address the RAM core in the
Figure 6. Layout of instruction cache
Lid
o,c3
---
--<---Li--i
5 fln
-L
l@n
15n
Figure 7.Results of Simulated waves
Figure 5 . Cache block with BIST circuit
568
Table 1 .Cache features
[2] K.Ghose and M.B.Kanlle, Reducing power in

superscalar processor caches using subbanking,
multiple line buffers and bit-line segmentation,
IEEE symposium on low power electronics, (1999),
p.70.
512 X 32bits
Technology
Memory cell
I 0.35i m CMOS
1 dual port 6Tr
[3] M. Yoshimoto, et. al., A 64kb CMOS RAM with

divided word line structure, 1983 IEEE
International Solid State Circuits Conference, Digest
of TechnicalPapers,(l983), p.58.
References
[I] Hiraki Nambu, Kazno Kanetani, A 1 . h access,
550MHz, 4.5Mb CMOS SRAM, IEEE journal of
solid-state circuits, 33, 11, (1998).
569
[4] Nadean-Destie B., silburt A., Agarwal V.K., serial

interfacing technique for embedded-memory testing,
IEEE Design and test of computers, 17,2,56(1990).

A Fast Low Power Embedded Cache Memory Design

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

A Fast Low Power Embedded Cache Memory Design

Hochgeladen von

Copyright:

Verfügbare Formate

A Fast Low Power Embedded Cache Memory Design

ZHAO Xue-mei, YE Yi-zheng, W Ming-yan, LI Xiao-ming

bits<25:12> are translated from virtual to real by

Since the cache is typically many orders of magnitude

Fig 1. Cache block diagram

Our cache is organized logically of separate instruction

Fast Low-power decoder

precharging have concentrated on reducing bitline

We divide the decoder paths into three partitions, the

Our cache memories consist of Ikb memory block,

The decoder encompasses the circuits from the address

global word decoder

local word decoder

Fig 3. Three-stage decoder structure

Fig.2 Floor plan of instruction (or data) cache

reconnected, while the second stage is a normal current

subsequent RAM SMARCH tests. This can verify the

The instruction (or data) cache size is 0.6mmz and its

Figure 4. Schematic of the sense amplifier

Design for Testability

Figure 6. Layout of instruction cache

Figure 5 . Cache block with BIST circuit

Table 1 .Cache features

[2] K.Ghose and M.B.Kanlle, Reducing power in

[3] M. Yoshimoto, et. al., A 64kb CMOS RAM with

[4] Nadean-Destie B., silburt A., Agarwal V.K., serial

Das könnte Ihnen auch gefallen