0 Bewertungen0% fanden dieses Dokument nützlich (0 Abstimmungen)

5 Ansichten8 SeitenMachine

Sep 25, 2016

© © All Rights Reserved

PDF, TXT oder online auf Scribd lesen

Machine

© All Rights Reserved

Als PDF, TXT **herunterladen** oder online auf Scribd lesen

0 Bewertungen0% fanden dieses Dokument nützlich (0 Abstimmungen)

5 Ansichten8 SeitenMachine

© All Rights Reserved

Als PDF, TXT **herunterladen** oder online auf Scribd lesen

Sie sind auf Seite 1von 8

www.elsevier.com/locate/micpro

Marcus Tadeu Pinheiro Silvaa, Antonio Padua Bragab,*, Wilian Soares Lacerdac

b

a

Federal Center of Technological Education, Belo Horizonte, MG, Brazil

Department of Electronics Engineering, Federal University of Minas Gerais, Caixa Postal 209, CEP 30.161-970 Belo Horizonte, MG, Brazil

c

Department of Computing, Federal University of Lavras, Lavras, MG, Brazil

Abstract

The implementation on hardware of the first layer of Kanervas sparse distributed memory (SDM) is presented in this work. The hardware

consist on a co-processor board for connection on ISA standard bus of an IBM PCcompatible computer. The board, named reconfigurable

co-processor for SDM(RC-SDM), comprises on Xilinx FPGAs, local random access memory and bus interface circuits. Based on in-system

reconfiguration capacity of FPGAs, RC-SDM easily allows change of the characteristics of SDM topology implemented. First results show a

speed-up of four times of RC-SDM in relation to a software implementation of the algorithm.

q 2004 Elsevier B.V. All rights reserved.

Keywords: Associative memory; Neural networks; FPGAs

1. Introduction

Physical implementation of associative memories is in

the agenda for more than four decades and has been studied

from different perspectives. From the computer architecture

point of view, the term content addressable memories

(CAM) [1] has been coined to refer to those memory

implementations that can retrieve information with partial

clues of the content. The physical implementation of CAM

was usually accomplished with conventional random access

memories (RAM) and additional hardware. From the

connectionist point of view [2], associative memories

were implemented with neuron-like elements, or processing

elements(PE), connected by a network structure. Information was spread throughout the network and stored in the

connections between processors.

The sparse distributed memory (SDM) [3] is an

associative memory model that can be seen from both

perspectives. From the computer architecture point of view,

it can be seen as a generalization of a RAM and from the

connectionist perspective as an artificial neural network

(ANN). Associative memory properties are achieved by

* Corresponding author. Tel.: 55-31-3499-4869; fax: 55-31-34994850.

E-mail address: apbraga@cpdee.ufmg.br (A.P. Braga).

0141-9331/$ - see front matter q 2004 Elsevier B.V. All rights reserved.

doi:10.1016/j.micpro.2004.01.003

(address) space. The associative memory properties appear

as the dimension of the input space increases, which is in

fact a difficulty for its physical implementation. Computational intensive calculations of sparse address decoding is

one of the main bottlenecks of software and hardware

implementations of SDMs, since it requires Hamming

distance1 calculations between the high dimensional input

address and all sparse decoders.

This paper describes a hardware implementation of

SDMs that is aimed at implementing its most computational

intensive calculations in a co-processor board. The circuit,

implemented on an IBM PC platform, with reconfigurable

hardware, has 32 PEs that calculate the Hamming distances

between the input address and 32 sparse decoders in

parallel. The co-processor board, named here as recofigurable co-processor for SDMs, or simply RC-SDM, is in the

category of reconfigurable machines with extended instruction set [6], since a reconfigurable platform is established by

the connection of a FPGA-based co-processor to a host

machine. Adaptive computation is carried on by an

application program running in the host. This program

configures the FPGAs and then transfers data to be

1

The Hamming distance is defined as the number of bits in which two

binary words differ.

128

co-processor board is accomplished in parallel with

program execution by the host processor.

The improved performance of the reconfigurable coprocessor system is significant when compared to a program

running in the host with optimized assembly code. The

flexibility of the board allows for adaptive reconfiguration

of current implementation and further improvements may

be achieved by using an extension of the proposed

architecture.

2. Description of SDM

The essence of SDMs is to use sparsely distributed

decoders in a high dimensional Boolean space so that any

sparse decoder, or hard storage location, is accessed from

anywhere in the space that is at a Hamming distance

smaller than r bits from its base address. Therefore, each

decoder responds to all the vectors inside a hyper-sphere,

or circle in SDMs terminology, with radius r and centre at

the locations base address. Depending on the selected

value for r, input vectors may access more than one storage

location at the same time, allowing data to be stored and

retrieved concurrently to and from several memory storage

locations. Using Kanervas analogy between a hypersphere in the n-dimensional Boolean space and a circle in

the plane [5], an example of the effect of the chosen radius

r in the concurrent access by an input vector j to two

sparse decoders zm and zn is shown in Fig. 1. The chosen

radius ra is large enough to allow access by j to both

decoders zm and zn : If ra were too small, j could have been

located outside the two hyper-spheres and, consequently,

could not have accessed any of them. Therefore, the

condition for j accessing both the arbitrary sparse decoders

zm and zn is that the distance from j to both of them is

less than the chosen radius r: In other words, j must be

inside the intersection of the hyper-spheres with centres in

zm and zn [5].

The size of the radius r in SDMs must be such that the

union of all the sets of elements inside the hyper-spheres

Si zi ; ri includes all the 2n elements of the space {0; 1}n ;

so that any input vector access at least one sparse decoder.

accesses both decoders.

responsible for its associative memory properties. Therefore, the intersection between two hyper-spheres Si zi ; ri

and Sj zj ; rj separated by the Hamming distance h is a basic

issue in the analysis of SDMs, since it indicates the number

of vectors in the space that access simultaneously both

storage locations zi and zj : Of course, in the limit cases when

r 0 and r n; the number of patterns that access a given

hard location are, respectively, 1 and 2n : For r in the range

between 0 and n; it is possible to estimate the percentage of

the Boolean space that is in the intersection of two hyperspheres [7] and predict associative memory performance.

The terms hard storage location and sparse decoder used

in SDM terminology resemble those used to describe

conventional computers RAMs. In fact, SDMs can be seen

as a general case of a RAM, since, in the limit case when r is

set to zero and there are 2n sparse decoders, a SDM actually

becomes a conventional RAM. The main differences

between a RAM and a SDM are:

the address decoder of a RAM is selected only if its base

address is presented at the inputs r 0; whereas a

sparse decoder of a SDM is selected from anywhere

within r bits Hamming distance from the base address

r . 0;

a conventional RAM has 2n decoder outputs and storage

locations, where n is the size of the input address vector,

whereas a SDM has only a portion M of the maximum 2n

sparse decoders and hard storage locations;

a hard storage location of a SDM, instead of storing

only one data vector, like the storage locations of a

RAM, holds a statistical measure of all the data

vectors that accessed that location during the storage

process.

SDM is, therefore, an associative memory designed to

store information coded in long strings of bits, corresponding to binary vectors with hundreds or thousands of

coordinates. The basic principles behind Kanervas SDM

are based on the properties of high dimensional Boolean

space [3]. SDMs can be analyzed from two different

perspectives: as a generalization of a conventional computer

RAM with associative memory properties and also as a twolayers feed-forward neural network [4].

There is not a general expression for determining r as a

function of the Boolean space dimension n: As in

Kanervas original work [3], n should be large and r

should be approximately n=2: In most examples in his

original work n 1000 and r is a bit less than n=2: The

value used in most examples was r 451: The value of r

should not be larger than n=2; since it would include most

of the input space (more than 50%) and would cause too

much overlap among sparse decoders. If r is too small,

only a small percentage of the input space is covered by the

hyper-spheres. This is the reason why r is normally chosen

a bit smaller than n=2:

129

both neural network and RAM views of a SDM can be

observed. Anyone familiar with digital systems would

easily associate input vector I with a RAM input address,

and input data vector W with memory input data. The output

data, for reading information from memory, would be

associated with the output vector O. The address decoder

and memory array of a conventional RAM would be directly

associated with sparse decoders S and cells of memory H

(hard storage location). Similarly, the neural network view

would characterize the structure of Fig. 2 as a two layers

neural network, with information flow from input vector I to

output vector O, having the threshold elements as the

outputs of each layer.

2.1. Storage of information

The first array of a SDM, matrix S, corresponds to a

decoder-like structure. Each output of the decoder, after

threshold, is capable of activating only one location of

memory (the corresponding row of array H), like a

conventional computer RAM decoder. Nevertheless, an

input address may activate more than one decoder output

simultaneously, what causes data to be stored to or read

from several memory locations at the same time.

The dimension n of I must be high, so that the

properties of higher dimensional Boolean space become

valid. Working with very high dimensional spaces is one of

also a difficulty for its implementation. For n 1000; for

example, the number of possible input addresses is 21000,

what makes it unconceivable to think of a memory location

for each one of the 21000 input addresses. Instead of that, M

decoders and memory locations are used in SDMs, where

M is a very small fraction of 21000 (M ! 21000 ). The M

sparse decoders with radius r spread on the input space are

expected to include all the possible 21000 input addresses. A

value for r is selected, based on the properties of the higher

dimensional Boolean space, so that each one of the

possible input addresses access at least one memory

location.

Each row of matrix H corresponds to a word of the

associative memory, similarly to a conventional RAM, but,

instead of a register for each storage location, SDM has a

binary counter for each position. Each counter is incremented or decremented depending on whether a 1 or a 0 is

stored; the storage of 1 causes increment and of 0 causes

decrement. Input data vector W accesses all the hard

locations of matrix H in parallel, allowing data to be stored

in more than one location. Since several memory words can

be selected simultaneously by the same input address,

several counters on different words are activated by a single

input address. This process of storage causes the information to be distributed to several memory locations. A

single counter is normally activated several times when

many words are stored.

130

Similar to the storage process, information retrieval starts

by presenting an input vector I that is expected to select

more than one hard location of matrix H. An adder circuit

sums up all the outputs of activated counters accessed

during reading in order to generate the output vector O. The

result of the sum can be zero, positive or negative. A

negative value occurs when there were more 0s (counter

decremented) then 1s (counter incremented) written during

storage and a positive value occurs when there were more 1s

then 0s. A zero output indicates that either the number of 0s

and 1s stored were the same or that the selected locations

were not accessed before. A threshold function is applied to

the output of each adder, so that the output value is 1 if the

sum is positive, or zero if the sum is negative. Since storage

occurs in a distributed form, the read output value of the

SDM is a statistical measure of the stored data obtained by a

majority rule in order to effectively generate the output

vector. When information is written in SDMs, storage

occurs in distributed form in the j locations activated by the

corresponding address vector j: Since the information is

distributed, these j locations can also be accessed when

other input vectors are stored. When more and more words

are added to the memory, more and more coincidences

occur among stored data until the maximum storage

capacity of the SDM is reached, which is a fraction of M [3].

Vector d in Fig. 2 contains the results of the computed

distances between the input vector I and all the M sparse

decoders base addresses. Between vectors d and y,

threshold elements activate their outputs to 1 if the

corresponding distances in vector d are within the range 0

to r: Therefore, the active elements of vector y are only

those that correspond to the hard locations accessed by the

input vector I.

SDMs processing for storage and retrieval of information are presented in the next two sections in algorithmic

form. This helps selecting the approach for hardware

implementation of the co-processor board.

3.1. Algorithm for the first layer

Processing in the first layer is the same for both storage

and retrieval of information. The procedure presented in

Algorithm 1 involves first the calculation of the Hamming

distances h between input vector I and all the M sparse

decoders of matrix S. Bitwise exclusive-or operation

between input vector I and every row of matrix S is carried

on by the function ExclusiveOrI; Srow: The resulting

vector DifVector contains 1 at every position where the

two vectors I and Srow differ. Counting the number of

1s in DifVector results on the Hamming distance h:

CountOnesDifVector: For every distance calculated, for

each row, the corresponding output of vector y is set to 1 if

h # r where r is the radius of the sparse decoders. I is a 1 by

n vector, S is a M by n matrix and y is a M by 1 vector,

where n and M are user defined parameters.

Algorithm 1. Algorithm of the first layer

for row 0 to M 2 1 do

DifVector ExclusiveOrI; Srow;

h CountOnesDifVector;

if h # r then

yrow 1;

else

yrow 0

end if

end for

SDMs processing is different for reading and writing in

the second layer, so two different procedures are presented

in Algorithms 2 and 3. The storage phase involves basically

the sum of every element of the input data vector W to every

counter of the selected rows, or hard locations, of matrix H.

In order to accomplish the counting operation correctly,

vector W must be represented with 1 and 2 1 elements.

The retrieval phase involves basically summing every

element of the selected rows of matrix H to generate vector

v, to which a threshold is applied to obtain the output vector

O. In Algorithms 2 and 3, H is a matrix with M rows and U

columns and W, v and O are vectors with 1 row and U

columns.

Algorithm 2. Algorithm for the storage phase of the second

layer

for row 0 to M 2 1 do

if yrow 1 then

for col 0 to U 2 1 do

Hrow; col Hrow; col Wcol

end for

end if

end for

Algorithm 3. Algorithm for the reading phase of the second

layer

v0

for row 1 to M 2 1 do

if yrow 1 then

for col 0 to U 2 1 do

vcol Hrow; col vcol

end for

end if

end for

for col 0 to U 2 1 do

if vcol # 0 then

ocol 1

else

ocol 0

end if

end for

The calculation of Hamming distances in Algorithm 1

assumes that the two functions ExclusiveOrI; Srow and

CountOnesDifVector are available, but these functions are

not normally in the repertoire of instructions of commercial

microprocessors. Even for software implementation of the

two functions we would have to consider also that

microprocessors work with limited length operands, in the

range of 64 bits nowadays, while addresses in SDMs are

binary vectors with hundreds of bits. For addresses that are

beyond the operands length limit, the operation would have

to be executed in several steps. In addition to that, the

Hamming distance must be calculated between I and all the

M sparse decoders of S, what results on an algorithm

complexity of OnM: Since these operations are computationally intensive for large values of n and M; that is

normally the case for SDMs, they look attractive for parallel

hardware implementation. Bitwise hardware exclusive-or

operations could be implemented in parallel between I and

all the M sparse decoders. Counting the number of 1s in the

resulting exclusive-or operations would normally be a serial

operation.

There is also an inherent parallelism in the operations of

the second layer, since input data vector W is applied to all

the selected hard locations. Nevertheless, the number of

selected rows in matrix H would normally be a small

portion of the total number of M hard locations, typically a

number between 0.1 and 0.01 [4]. Parallel hardware would

have to be implemented for the whole matrix H, but only a

small part of it would be active at every memory access.

The gain in performance of a possible hardware implementation of the second layer would also depend on the spatial

position of the input vector I in relation to all sparse

decoders of matrix S. For an input vector that is outside the

intersection of two or more sparse decoders, the gain in

performance would be null. The performance would also

depend on the selected value of r; that is a user-defined

parameter. So, we would have a variable gain in

performance with a parallel hardware implementation of

the second layer. In addition to that, two different hardware

implementations would be needed, one for reading and the

other for writing.

It could also be observed that memory requirements

for matrix S are much lesser than those for matrix H.

131

and M 16; 384; for example, the memory requirements

for matrix S is equal to 16; 3842561 bit, what is equivalent

to 4 Mbits. Considering now that each position of matrix H

stores an integer number, and considering also that 8 bits

number are sufficient, memory requirements for matrix H

for the same SDM above is 16; 3842568 bits, what

corresponds to 32 Mbits. In order to speed up local

processing, fast and, consequently, lower capacity static

memories chips are used, what makes it difficult to

implement large local memory banks, as would be needed

for the second layer. Therefore, the first layer algorithm was

selected for hardware implementation.

3.4. Hardware implementation of the first layer

Hamming distance calculation between input vector I

and all the sparse decoders of matrix S can be carried out

independently and in parallel. In our design, the elements of

I, S and of the output vector Y are stored in three separate

banks of static memory. The outputs of banks I and S are

connected to the inputs of the PEs in order to generate input

data for memory the bank Y (Fig. 3). The PEs are therefore

responsible for calculating the distance between I and all the

elements of S and then to generate vector Y that indicates

the hard locations selected from matrix H. PEs are

implemented within one Xilinx FPGA, what allows

flexibility in the modification of some user-defined

parameters: n (space dimension), r (radius) and M (number

of hard locations).

In the implementation of the PEs, a bit-serial strategy

was adopted, as will be described in the next section.

3.5. Bit-serial processing element

The serial PE (Fig. 4), instead of a parallel approach, was

chosen for the implementation of the algorithm. The first

reason for this choice was that the pin count for the serial

design is much smaller than for the parallel one. This is due

to the need for a ROM in the parallel approach that needs to

be implemented externally to the FPGA. In addition to that,

much more serial PEs the parallel PEs can be built into a

FPGA. However, the design concept of the bit-parallel PE

could be useful for a software implementation. Therefore,

the chosen approach requires more simplified circuits and a

more coherent design with current FPGAs architecture than

the parallel implementation.

132

a simple combinational circuit. Before processing starts, the

counter (Difference CounterDC) is reset and, for every

new clock cycle, a bit i of the input vector I is compared

with bit i of the address decoder j of matrix S by an

exclusive-or gate. If the two bits Ii and Sij are different, DC

is enabled to count. The combinational circuit at the outputs

of DC has the function to detect whether Ii is within r bits

from Sij : The output y of PE goes to 0 when the counting

value of PE is larger than r and goes to 1 otherwise.

Therefore, in the beginning of processing, DC 0 and

y 1; when counting starts y may change to 0 if DC

exceeds the pre-defined value of r. In order to speed-up

processing, counting is interrupted when y changes to 1 by

the return path between y and input AND gate. Since in

SDMs the value of r is always less than n=2 the number of

bits required to implement DC is n=2:

Since the PEs process each bit of the input address in

parallel, distance calculation for the current 32 PEs is

interrupted when d . r for all of them. This allows the

distance to the next 32 PEs to be calculated. Performance

do not need to be scanned for every input address. Only

sparse decoders for which d # r have all the 256 bits

considered.

The RC-SDM is composed basically by two FPGAs,

three banks of memory and the interface circuits with the

host processor, as shown in the diagram of Fig. 5.

In the start-up phase, after the board is energized, the two

FPGAs remain idle and their pins in high impedance. Next,

the configuration program running in the host processor

loads into them their bitstream files with their internal

configurations. The configuration program also loads into

the memories the data to be processed. After processing is

finished, the host reads back the results of the processed data.

Memory banks 1 and 2 are 32 bits wide, while bank 3 is

1 bit wide for the implementation of the PEs bit-serial

approach.

133

reasonably flexible and open. Except for some limitations of

the available hardware, the user has reasonable flexibility

regarding the implementation of the algorithm. For

example, a bit-parallel approach could be used instead of

the current serial approach by simply changing the

configuration program. Although, part of the control circuits

for the communication between the board and the host are

implemented in external programmable circuits (GALs),

FPGA1 is also used for this purpose, what adds also

flexibility for changing the internal communication protocol. The board structure was designed in such a way that the

algorithm implementation is split into the two FPGAs.

FPGA1 includes part of the communication circuits for

exchanging information with the host, whereas FPGA2 is

dedicated for to the implementation of the PEs. This gives

also flexibility for the implementation of the PEs.

The chips used to implement FPGA1 and FPGA2 were

Xilinx XC4003E and XC4005E; with three and five

thousand gates, respectively. The percentage of chip

utilization of FPGA1 was 43% and of FPGA2 was 92%

and the clock frequency was 20 MHz.

A block diagram of the board after FPGAs configuration

is presented in Fig. 6. As can be observed, FPGA2 contains

32 PEs, what leads to the processing of 32 sparse decoders

in parallel. Control of memories carried on by FPGA1, that

also interprets the following commands from the host:

reading from and writing to the memories, initialize

processing, read processing status, start processing and

general initialization of the board.

5. Performance considerations

In order to evaluate the performance of RC-SDM, two

programs, one that uses the board and the other that does

not, were implemented. In both programs, the SDM

implemented was set with parameters n 256 and M

16; 384: The machine used to implement the programs and

host the board was a PC microcomputer with a Pentium

Celeron CPU running at 300 MHz. The programs were

written in ANSI C with all SDM functions implemented in

assembly. The main program written in C deals mainly

with accesses to disk and screen, whereas the assembly

routines treat SDM processing in both hardware and

software versions of the program in order to have an

unbiased software implementation. This assures that the

comparison of the hardware is made with a as fast as

possible software implementation. The second program

consists of a modification of the first one, where the

assembly routines were substituted by the processing in the

board. Processing time, that was measured with a logic

analyzer, of the assembly routine was 49.8 ms whereas the

hardware processing at RC-SDM was 13.06 ms, what

resulted in a speed-up of about four times in relation to the

software implementation.

6. Conclusions

More than a proof-of-concept, the implementation of

RC-SDM presented an efficient version of SDMs with

134

software implementation. The proposed architecture has

shown to be appropriate for the current application and

flexible enough for further improvements and extensions.

Since most of the hardware is implemented in the FPGAs,

reconfiguration and changes in the current implementation

can be easily accomplished in future developments. Current

state-of-the-art FPGAs would have allowed a larger density

circuit for higher dimensional SDMs, this work showed that,

for a small-scale problem, the architecture of RC-SDM is

suitable for a wide range of implementations.

Acknowledgements

The authors would like to thank the support from Xilinx

University Program, CNPq and FINEP.

References

[1] A. Krikelis, C.C. Weems, Associative processing and processors, IEEE

Computer (Guest Editorial) (November) (1994).

[2] C. Orovas, Cellular Associative Networks for Pattern Recognition, PhD

thesis, University of York, England, 1999.

[3] P. Kanerva, Sparse Distribuited Memory, Bradford/MIT Press, Cambridge, USA, 1988.

[4] P. Kanerva, Associative-memory models of the cerebellum, in: I.

Aleksander, J. Taylor (Eds.), Artificial Neural Networks, Elsevier

Science Publishers B. V, Amsterdam, 1992.

[5] P. Kanerva, in: M.H. Hassoun (Ed.), Sparse Distributed Memory and

Related Models, Associative Neural Memories: Theory and Implementation, Oxford University Press, New York, 1993, pp. 5076.

[6] R.W. Hartenstein et al., Custom Computing Machines vs. Hardware/

Software Co-Design: from a Globalized Point of View. In: 6th

International Workshop on Field Programmable Logic And Applications, Darmstadt, Germany, September 23-25, 1996.

[7] A.P. Braga, I. Aleksander, Geometrical treatment and statistical

modeling of the distribution of patterns in the n-dimensional boolean

space, Pattern Recognition Letters 16 (1994) 507 515.