Sie sind auf Seite 1von 8

Microprocessors and Microsystems 28 (2004) 127134

www.elsevier.com/locate/micpro

Reconfigurable co-processor for Kanervas sparse distributed memory


Marcus Tadeu Pinheiro Silvaa, Antonio Padua Bragab,*, Wilian Soares Lacerdac
b

a
Federal Center of Technological Education, Belo Horizonte, MG, Brazil
Department of Electronics Engineering, Federal University of Minas Gerais, Caixa Postal 209, CEP 30.161-970 Belo Horizonte, MG, Brazil
c
Department of Computing, Federal University of Lavras, Lavras, MG, Brazil

Received 5 August 2003; revised 28 January 2004; accepted 29 January 2004

Abstract
The implementation on hardware of the first layer of Kanervas sparse distributed memory (SDM) is presented in this work. The hardware
consist on a co-processor board for connection on ISA standard bus of an IBM PCcompatible computer. The board, named reconfigurable
co-processor for SDM(RC-SDM), comprises on Xilinx FPGAs, local random access memory and bus interface circuits. Based on in-system
reconfiguration capacity of FPGAs, RC-SDM easily allows change of the characteristics of SDM topology implemented. First results show a
speed-up of four times of RC-SDM in relation to a software implementation of the algorithm.
q 2004 Elsevier B.V. All rights reserved.
Keywords: Associative memory; Neural networks; FPGAs

1. Introduction
Physical implementation of associative memories is in
the agenda for more than four decades and has been studied
from different perspectives. From the computer architecture
point of view, the term content addressable memories
(CAM) [1] has been coined to refer to those memory
implementations that can retrieve information with partial
clues of the content. The physical implementation of CAM
was usually accomplished with conventional random access
memories (RAM) and additional hardware. From the
connectionist point of view [2], associative memories
were implemented with neuron-like elements, or processing
elements(PE), connected by a network structure. Information was spread throughout the network and stored in the
connections between processors.
The sparse distributed memory (SDM) [3] is an
associative memory model that can be seen from both
perspectives. From the computer architecture point of view,
it can be seen as a generalization of a RAM and from the
connectionist perspective as an artificial neural network
(ANN). Associative memory properties are achieved by
* Corresponding author. Tel.: 55-31-3499-4869; fax: 55-31-34994850.
E-mail address: apbraga@cpdee.ufmg.br (A.P. Braga).
0141-9331/$ - see front matter q 2004 Elsevier B.V. All rights reserved.
doi:10.1016/j.micpro.2004.01.003

spreading sparse decoders into the high dimensional input


(address) space. The associative memory properties appear
as the dimension of the input space increases, which is in
fact a difficulty for its physical implementation. Computational intensive calculations of sparse address decoding is
one of the main bottlenecks of software and hardware
implementations of SDMs, since it requires Hamming
distance1 calculations between the high dimensional input
address and all sparse decoders.
This paper describes a hardware implementation of
SDMs that is aimed at implementing its most computational
intensive calculations in a co-processor board. The circuit,
implemented on an IBM PC platform, with reconfigurable
hardware, has 32 PEs that calculate the Hamming distances
between the input address and 32 sparse decoders in
parallel. The co-processor board, named here as recofigurable co-processor for SDMs, or simply RC-SDM, is in the
category of reconfigurable machines with extended instruction set [6], since a reconfigurable platform is established by
the connection of a FPGA-based co-processor to a host
machine. Adaptive computation is carried on by an
application program running in the host. This program
configures the FPGAs and then transfers data to be
1
The Hamming distance is defined as the number of bits in which two
binary words differ.

128

M.T.P. Silva et al. / Microprocessors and Microsystems 28 (2004) 127134

processed in the co-processor board. Data processing in the


co-processor board is accomplished in parallel with
program execution by the host processor.
The improved performance of the reconfigurable coprocessor system is significant when compared to a program
running in the host with optimized assembly code. The
flexibility of the board allows for adaptive reconfiguration
of current implementation and further improvements may
be achieved by using an extension of the proposed
architecture.

2. Description of SDM
The essence of SDMs is to use sparsely distributed
decoders in a high dimensional Boolean space so that any
sparse decoder, or hard storage location, is accessed from
anywhere in the space that is at a Hamming distance
smaller than r bits from its base address. Therefore, each
decoder responds to all the vectors inside a hyper-sphere,
or circle in SDMs terminology, with radius r and centre at
the locations base address. Depending on the selected
value for r, input vectors may access more than one storage
location at the same time, allowing data to be stored and
retrieved concurrently to and from several memory storage
locations. Using Kanervas analogy between a hypersphere in the n-dimensional Boolean space and a circle in
the plane [5], an example of the effect of the chosen radius
r in the concurrent access by an input vector j to two
sparse decoders zm and zn is shown in Fig. 1. The chosen
radius ra is large enough to allow access by j to both
decoders zm and zn : If ra were too small, j could have been
located outside the two hyper-spheres and, consequently,
could not have accessed any of them. Therefore, the
condition for j accessing both the arbitrary sparse decoders
zm and zn is that the distance from j to both of them is
less than the chosen radius r: In other words, j must be
inside the intersection of the hyper-spheres with centres in
zm and zn [5].
The size of the radius r in SDMs must be such that the
union of all the sets of elements inside the hyper-spheres
Si zi ; ri includes all the 2n elements of the space {0; 1}n ;
so that any input vector access at least one sparse decoder.

Fig. 1. j is inside the intersection of the two circles and, consequently,


accesses both decoders.

The distributed storage and retrieval of data in SDMs is


responsible for its associative memory properties. Therefore, the intersection between two hyper-spheres Si zi ; ri
and Sj zj ; rj separated by the Hamming distance h is a basic
issue in the analysis of SDMs, since it indicates the number
of vectors in the space that access simultaneously both
storage locations zi and zj : Of course, in the limit cases when
r 0 and r n; the number of patterns that access a given
hard location are, respectively, 1 and 2n : For r in the range
between 0 and n; it is possible to estimate the percentage of
the Boolean space that is in the intersection of two hyperspheres [7] and predict associative memory performance.
The terms hard storage location and sparse decoder used
in SDM terminology resemble those used to describe
conventional computers RAMs. In fact, SDMs can be seen
as a general case of a RAM, since, in the limit case when r is
set to zero and there are 2n sparse decoders, a SDM actually
becomes a conventional RAM. The main differences
between a RAM and a SDM are:
the address decoder of a RAM is selected only if its base
address is presented at the inputs r 0; whereas a
sparse decoder of a SDM is selected from anywhere
within r bits Hamming distance from the base address
r . 0;
a conventional RAM has 2n decoder outputs and storage
locations, where n is the size of the input address vector,
whereas a SDM has only a portion M of the maximum 2n
sparse decoders and hard storage locations;
a hard storage location of a SDM, instead of storing
only one data vector, like the storage locations of a
RAM, holds a statistical measure of all the data
vectors that accessed that location during the storage
process.
SDM is, therefore, an associative memory designed to
store information coded in long strings of bits, corresponding to binary vectors with hundreds or thousands of
coordinates. The basic principles behind Kanervas SDM
are based on the properties of high dimensional Boolean
space [3]. SDMs can be analyzed from two different
perspectives: as a generalization of a conventional computer
RAM with associative memory properties and also as a twolayers feed-forward neural network [4].
There is not a general expression for determining r as a
function of the Boolean space dimension n: As in
Kanervas original work [3], n should be large and r
should be approximately n=2: In most examples in his
original work n 1000 and r is a bit less than n=2: The
value used in most examples was r 451: The value of r
should not be larger than n=2; since it would include most
of the input space (more than 50%) and would cause too
much overlap among sparse decoders. If r is too small,
only a small percentage of the input space is covered by the
hyper-spheres. This is the reason why r is normally chosen
a bit smaller than n=2:

M.T.P. Silva et al. / Microprocessors and Microsystems 28 (2004) 127134

129

Fig. 2. Schematic view of the SDM.

Fig. 2 presents the schematic view of a SDM, from which


both neural network and RAM views of a SDM can be
observed. Anyone familiar with digital systems would
easily associate input vector I with a RAM input address,
and input data vector W with memory input data. The output
data, for reading information from memory, would be
associated with the output vector O. The address decoder
and memory array of a conventional RAM would be directly
associated with sparse decoders S and cells of memory H
(hard storage location). Similarly, the neural network view
would characterize the structure of Fig. 2 as a two layers
neural network, with information flow from input vector I to
output vector O, having the threshold elements as the
outputs of each layer.
2.1. Storage of information
The first array of a SDM, matrix S, corresponds to a
decoder-like structure. Each output of the decoder, after
threshold, is capable of activating only one location of
memory (the corresponding row of array H), like a
conventional computer RAM decoder. Nevertheless, an
input address may activate more than one decoder output
simultaneously, what causes data to be stored to or read
from several memory locations at the same time.
The dimension n of I must be high, so that the
properties of higher dimensional Boolean space become
valid. Working with very high dimensional spaces is one of

the fundamental principles behind SDMs, but represents


also a difficulty for its implementation. For n 1000; for
example, the number of possible input addresses is 21000,
what makes it unconceivable to think of a memory location
for each one of the 21000 input addresses. Instead of that, M
decoders and memory locations are used in SDMs, where
M is a very small fraction of 21000 (M ! 21000 ). The M
sparse decoders with radius r spread on the input space are
expected to include all the possible 21000 input addresses. A
value for r is selected, based on the properties of the higher
dimensional Boolean space, so that each one of the
possible input addresses access at least one memory
location.
Each row of matrix H corresponds to a word of the
associative memory, similarly to a conventional RAM, but,
instead of a register for each storage location, SDM has a
binary counter for each position. Each counter is incremented or decremented depending on whether a 1 or a 0 is
stored; the storage of 1 causes increment and of 0 causes
decrement. Input data vector W accesses all the hard
locations of matrix H in parallel, allowing data to be stored
in more than one location. Since several memory words can
be selected simultaneously by the same input address,
several counters on different words are activated by a single
input address. This process of storage causes the information to be distributed to several memory locations. A
single counter is normally activated several times when
many words are stored.

130

M.T.P. Silva et al. / Microprocessors and Microsystems 28 (2004) 127134

2.2. Retrieval of information


Similar to the storage process, information retrieval starts
by presenting an input vector I that is expected to select
more than one hard location of matrix H. An adder circuit
sums up all the outputs of activated counters accessed
during reading in order to generate the output vector O. The
result of the sum can be zero, positive or negative. A
negative value occurs when there were more 0s (counter
decremented) then 1s (counter incremented) written during
storage and a positive value occurs when there were more 1s
then 0s. A zero output indicates that either the number of 0s
and 1s stored were the same or that the selected locations
were not accessed before. A threshold function is applied to
the output of each adder, so that the output value is 1 if the
sum is positive, or zero if the sum is negative. Since storage
occurs in a distributed form, the read output value of the
SDM is a statistical measure of the stored data obtained by a
majority rule in order to effectively generate the output
vector. When information is written in SDMs, storage
occurs in distributed form in the j locations activated by the
corresponding address vector j: Since the information is
distributed, these j locations can also be accessed when
other input vectors are stored. When more and more words
are added to the memory, more and more coincidences
occur among stored data until the maximum storage
capacity of the SDM is reached, which is a fraction of M [3].
Vector d in Fig. 2 contains the results of the computed
distances between the input vector I and all the M sparse
decoders base addresses. Between vectors d and y,
threshold elements activate their outputs to 1 if the
corresponding distances in vector d are within the range 0
to r: Therefore, the active elements of vector y are only
those that correspond to the hard locations accessed by the
input vector I.

3. Algorithmic representation of SDMs processing


SDMs processing for storage and retrieval of information are presented in the next two sections in algorithmic
form. This helps selecting the approach for hardware
implementation of the co-processor board.
3.1. Algorithm for the first layer
Processing in the first layer is the same for both storage
and retrieval of information. The procedure presented in
Algorithm 1 involves first the calculation of the Hamming
distances h between input vector I and all the M sparse
decoders of matrix S. Bitwise exclusive-or operation
between input vector I and every row of matrix S is carried
on by the function ExclusiveOrI; Srow: The resulting
vector DifVector contains 1 at every position where the
two vectors I and Srow differ. Counting the number of
1s in DifVector results on the Hamming distance h:

This counting operation is carried on with function


CountOnesDifVector: For every distance calculated, for
each row, the corresponding output of vector y is set to 1 if
h # r where r is the radius of the sparse decoders. I is a 1 by
n vector, S is a M by n matrix and y is a M by 1 vector,
where n and M are user defined parameters.
Algorithm 1. Algorithm of the first layer
for row 0 to M 2 1 do
DifVector ExclusiveOrI; Srow;
h CountOnesDifVector;
if h # r then
yrow 1;
else
yrow 0
end if
end for

3.2. Algorithm for the second layer


SDMs processing is different for reading and writing in
the second layer, so two different procedures are presented
in Algorithms 2 and 3. The storage phase involves basically
the sum of every element of the input data vector W to every
counter of the selected rows, or hard locations, of matrix H.
In order to accomplish the counting operation correctly,
vector W must be represented with 1 and 2 1 elements.
The retrieval phase involves basically summing every
element of the selected rows of matrix H to generate vector
v, to which a threshold is applied to obtain the output vector
O. In Algorithms 2 and 3, H is a matrix with M rows and U
columns and W, v and O are vectors with 1 row and U
columns.
Algorithm 2. Algorithm for the storage phase of the second
layer
for row 0 to M 2 1 do
if yrow 1 then
for col 0 to U 2 1 do
Hrow; col Hrow; col Wcol
end for
end if
end for
Algorithm 3. Algorithm for the reading phase of the second
layer
v0
for row 1 to M 2 1 do
if yrow 1 then
for col 0 to U 2 1 do
vcol Hrow; col vcol
end for

M.T.P. Silva et al. / Microprocessors and Microsystems 28 (2004) 127134

end if
end for
for col 0 to U 2 1 do
if vcol # 0 then
ocol 1
else
ocol 0
end if
end for

3.3. Hardware implementation considerations


The calculation of Hamming distances in Algorithm 1
assumes that the two functions ExclusiveOrI; Srow and
CountOnesDifVector are available, but these functions are
not normally in the repertoire of instructions of commercial
microprocessors. Even for software implementation of the
two functions we would have to consider also that
microprocessors work with limited length operands, in the
range of 64 bits nowadays, while addresses in SDMs are
binary vectors with hundreds of bits. For addresses that are
beyond the operands length limit, the operation would have
to be executed in several steps. In addition to that, the
Hamming distance must be calculated between I and all the
M sparse decoders of S, what results on an algorithm
complexity of OnM: Since these operations are computationally intensive for large values of n and M; that is
normally the case for SDMs, they look attractive for parallel
hardware implementation. Bitwise hardware exclusive-or
operations could be implemented in parallel between I and
all the M sparse decoders. Counting the number of 1s in the
resulting exclusive-or operations would normally be a serial
operation.
There is also an inherent parallelism in the operations of
the second layer, since input data vector W is applied to all
the selected hard locations. Nevertheless, the number of
selected rows in matrix H would normally be a small
portion of the total number of M hard locations, typically a
number between 0.1 and 0.01 [4]. Parallel hardware would
have to be implemented for the whole matrix H, but only a
small part of it would be active at every memory access.
The gain in performance of a possible hardware implementation of the second layer would also depend on the spatial
position of the input vector I in relation to all sparse
decoders of matrix S. For an input vector that is outside the
intersection of two or more sparse decoders, the gain in
performance would be null. The performance would also
depend on the selected value of r; that is a user-defined
parameter. So, we would have a variable gain in
performance with a parallel hardware implementation of
the second layer. In addition to that, two different hardware
implementations would be needed, one for reading and the
other for writing.
It could also be observed that memory requirements
for matrix S are much lesser than those for matrix H.

131

For a SDM configuration with n 256; U 256; r 111


and M 16; 384; for example, the memory requirements
for matrix S is equal to 16; 3842561 bit, what is equivalent
to 4 Mbits. Considering now that each position of matrix H
stores an integer number, and considering also that 8 bits
number are sufficient, memory requirements for matrix H
for the same SDM above is 16; 3842568 bits, what
corresponds to 32 Mbits. In order to speed up local
processing, fast and, consequently, lower capacity static
memories chips are used, what makes it difficult to
implement large local memory banks, as would be needed
for the second layer. Therefore, the first layer algorithm was
selected for hardware implementation.
3.4. Hardware implementation of the first layer
Hamming distance calculation between input vector I
and all the sparse decoders of matrix S can be carried out
independently and in parallel. In our design, the elements of
I, S and of the output vector Y are stored in three separate
banks of static memory. The outputs of banks I and S are
connected to the inputs of the PEs in order to generate input
data for memory the bank Y (Fig. 3). The PEs are therefore
responsible for calculating the distance between I and all the
elements of S and then to generate vector Y that indicates
the hard locations selected from matrix H. PEs are
implemented within one Xilinx FPGA, what allows
flexibility in the modification of some user-defined
parameters: n (space dimension), r (radius) and M (number
of hard locations).
In the implementation of the PEs, a bit-serial strategy
was adopted, as will be described in the next section.
3.5. Bit-serial processing element
The serial PE (Fig. 4), instead of a parallel approach, was
chosen for the implementation of the algorithm. The first
reason for this choice was that the pin count for the serial
design is much smaller than for the parallel one. This is due
to the need for a ROM in the parallel approach that needs to
be implemented externally to the FPGA. In addition to that,
much more serial PEs the parallel PEs can be built into a
FPGA. However, the design concept of the bit-parallel PE
could be useful for a software implementation. Therefore,
the chosen approach requires more simplified circuits and a
more coherent design with current FPGAs architecture than
the parallel implementation.

Fig. 3. Static RAMs connections with the Processing Elements (PEs).

132

M.T.P. Silva et al. / Microprocessors and Microsystems 28 (2004) 127134

Fig. 4. PE bit-serial for the first layer of SDM.

The bit-serial PE is composed basically by a counter and


a simple combinational circuit. Before processing starts, the
counter (Difference CounterDC) is reset and, for every
new clock cycle, a bit i of the input vector I is compared
with bit i of the address decoder j of matrix S by an
exclusive-or gate. If the two bits Ii and Sij are different, DC
is enabled to count. The combinational circuit at the outputs
of DC has the function to detect whether Ii is within r bits
from Sij : The output y of PE goes to 0 when the counting
value of PE is larger than r and goes to 1 otherwise.
Therefore, in the beginning of processing, DC 0 and
y 1; when counting starts y may change to 0 if DC
exceeds the pre-defined value of r. In order to speed-up
processing, counting is interrupted when y changes to 1 by
the return path between y and input AND gate. Since in
SDMs the value of r is always less than n=2 the number of
bits required to implement DC is n=2:
Since the PEs process each bit of the input address in
parallel, distance calculation for the current 32 PEs is
interrupted when d . r for all of them. This allows the
distance to the next 32 PEs to be calculated. Performance

improvement is obtained since all 256 bits of each decoder


do not need to be scanned for every input address. Only
sparse decoders for which d # r have all the 256 bits
considered.

4. Architecture of the RC-SDM


The RC-SDM is composed basically by two FPGAs,
three banks of memory and the interface circuits with the
host processor, as shown in the diagram of Fig. 5.
In the start-up phase, after the board is energized, the two
FPGAs remain idle and their pins in high impedance. Next,
the configuration program running in the host processor
loads into them their bitstream files with their internal
configurations. The configuration program also loads into
the memories the data to be processed. After processing is
finished, the host reads back the results of the processed data.
Memory banks 1 and 2 are 32 bits wide, while bank 3 is
1 bit wide for the implementation of the PEs bit-serial
approach.

Fig. 5. PE bit-parallel for the first layer of SDM.

M.T.P. Silva et al. / Microprocessors and Microsystems 28 (2004) 127134

133

Fig. 6. PE bit-parallel for the first layer of SDM.

It can observed from Fig. 5 that the board structure is


reasonably flexible and open. Except for some limitations of
the available hardware, the user has reasonable flexibility
regarding the implementation of the algorithm. For
example, a bit-parallel approach could be used instead of
the current serial approach by simply changing the
configuration program. Although, part of the control circuits
for the communication between the board and the host are
implemented in external programmable circuits (GALs),
FPGA1 is also used for this purpose, what adds also
flexibility for changing the internal communication protocol. The board structure was designed in such a way that the
algorithm implementation is split into the two FPGAs.
FPGA1 includes part of the communication circuits for
exchanging information with the host, whereas FPGA2 is
dedicated for to the implementation of the PEs. This gives
also flexibility for the implementation of the PEs.
The chips used to implement FPGA1 and FPGA2 were
Xilinx XC4003E and XC4005E; with three and five
thousand gates, respectively. The percentage of chip
utilization of FPGA1 was 43% and of FPGA2 was 92%
and the clock frequency was 20 MHz.
A block diagram of the board after FPGAs configuration
is presented in Fig. 6. As can be observed, FPGA2 contains
32 PEs, what leads to the processing of 32 sparse decoders
in parallel. Control of memories carried on by FPGA1, that
also interprets the following commands from the host:
reading from and writing to the memories, initialize
processing, read processing status, start processing and
general initialization of the board.

5. Performance considerations
In order to evaluate the performance of RC-SDM, two
programs, one that uses the board and the other that does
not, were implemented. In both programs, the SDM
implemented was set with parameters n 256 and M
16; 384: The machine used to implement the programs and
host the board was a PC microcomputer with a Pentium
Celeron CPU running at 300 MHz. The programs were
written in ANSI C with all SDM functions implemented in
assembly. The main program written in C deals mainly
with accesses to disk and screen, whereas the assembly
routines treat SDM processing in both hardware and
software versions of the program in order to have an
unbiased software implementation. This assures that the
comparison of the hardware is made with a as fast as
possible software implementation. The second program
consists of a modification of the first one, where the
assembly routines were substituted by the processing in the
board. Processing time, that was measured with a logic
analyzer, of the assembly routine was 49.8 ms whereas the
hardware processing at RC-SDM was 13.06 ms, what
resulted in a speed-up of about four times in relation to the
software implementation.

6. Conclusions
More than a proof-of-concept, the implementation of
RC-SDM presented an efficient version of SDMs with

134

M.T.P. Silva et al. / Microprocessors and Microsystems 28 (2004) 127134

improved performance when compared to the equivalent


software implementation. The proposed architecture has
shown to be appropriate for the current application and
flexible enough for further improvements and extensions.
Since most of the hardware is implemented in the FPGAs,
reconfiguration and changes in the current implementation
can be easily accomplished in future developments. Current
state-of-the-art FPGAs would have allowed a larger density
circuit for higher dimensional SDMs, this work showed that,
for a small-scale problem, the architecture of RC-SDM is
suitable for a wide range of implementations.

Acknowledgements
The authors would like to thank the support from Xilinx
University Program, CNPq and FINEP.

References
[1] A. Krikelis, C.C. Weems, Associative processing and processors, IEEE
Computer (Guest Editorial) (November) (1994).
[2] C. Orovas, Cellular Associative Networks for Pattern Recognition, PhD
thesis, University of York, England, 1999.
[3] P. Kanerva, Sparse Distribuited Memory, Bradford/MIT Press, Cambridge, USA, 1988.
[4] P. Kanerva, Associative-memory models of the cerebellum, in: I.
Aleksander, J. Taylor (Eds.), Artificial Neural Networks, Elsevier
Science Publishers B. V, Amsterdam, 1992.
[5] P. Kanerva, in: M.H. Hassoun (Ed.), Sparse Distributed Memory and
Related Models, Associative Neural Memories: Theory and Implementation, Oxford University Press, New York, 1993, pp. 5076.
[6] R.W. Hartenstein et al., Custom Computing Machines vs. Hardware/
Software Co-Design: from a Globalized Point of View. In: 6th
International Workshop on Field Programmable Logic And Applications, Darmstadt, Germany, September 23-25, 1996.
[7] A.P. Braga, I. Aleksander, Geometrical treatment and statistical
modeling of the distribution of patterns in the n-dimensional boolean
space, Pattern Recognition Letters 16 (1994) 507 515.