Sie sind auf Seite 1von 7

High Speed Parallel Hardware Performance Issues for N e u r a l Network Applications

Robert W.Means HNC, Inc. 5930 Cornerstone C o u r t San Diego, CA 92121 Abstract
Neural network applications push the envelope of high speed computers in several areas. Pattem recognition applications often involve large amounts of training data and large neural networks. Significant preprocessing of the data is often necessary and real time operation is the ultimate goal. Each one of these steps stress different components of the computer system architecture. Since neural network architecture is inherently parallel, efficient mapping of the computations required for learning and classification should be very well suited for fast, high speed parallel computers. Two of the most used neural networks, Multilayer Backpropagation Networks and Competitive Learning (Kohonen Layer) Neural Networks, are examined and analyzed for parallel implementation. The common mathematical thread observed in the training and use of these networks is that the algorithms use standard linear algebra functions that operate on vectors and matrices. This common thread is observed also in Radial Basis Function Networks and Probabilistic Neural Networks. The implications of this observation for parallel computer architecture are presented.

1. Introduction Since the late OS, general purpose digital computers from PCs to Supercomputershave been used to implement and train many different neural networks. Special purpose chips, both analog and digital, have been built to implement neural networks, but they usually implement only one or two algorithms and tradeoffs have been made in the precision of the data representation and calculations. Additional tradeoffs have been made in the types and methods of learning and in the size and structure of the networks implemented by special purpose neural network chips. These chips need a high volume or high value application to compete with the general purpose programmable digital computer. For high speed neural network applications, a general purpose, programmable, parallel processor is often an excellent choice. In designing a parallel processor, the list of design issues is large. This paper addresses only the major issues as they affect neural network applications. Perhaps the biggest issue that affects the hardware design is the software model of the parallel hardware and the consequent ease of programming. The hardware issues - like the number of bits of precision in the data and the calculations, the IIO rate, the memory size and hierarchy, chip package, power dissipation, clock sped, etc. are all important, but are secondary to programmability and ease of use. The common mathematical thread observed in the training and use of networks in applications is that the algorithms use standard linear algebra functions that operate on vectors and matrices. This common thread is observed not only in the neural network portion of the application, but also in the preprocessing that makes up a considerable portion of the complete application. The implications of this observation for parallel computer architecture are presented below.

Efficient programming of a parallel processor is not an easy task. The taxonomy of parallel processing includes two major models, Single Instruction Multiple Data (SIMD)and Multiple Instruction Multiple Data (MIMD). There are other models and qualifiers like shared data and distributed data that add to the classes of parallel computers. Compilers of high level languages, such as C or Fortran, have difficulty in spreading the computational load over multiple processors and managing the data. It is usually left to the programmer, with the aid of architecture specific constructs, to find the parallelism, spread the data, and assign the tasks. However, in the case of neural network applications, parallel programming can be made a relatively easy task by using function libraries. Section 2 examines the neural network algorithms and addresses the programmability and ease-of-use issues. Section 3 covers the most important hardware issues, illustrating some of them by means of HNCs SJMD Numerical Array Processor (SNAP).

0-7803-1901-X/94 $4.0001994 IEEE

10

2. Parallel Implementation of Neural Networks Parallel processors are usually in the supercomputer class and, for the most part, have been notoriously difficult to program efficiently. Compilers that efficiently partition the data and the tasks are difficult to write. Special constructs and extensions have been developed for the programmer to explicitly distribute data and partition tasks. This places the burden on the programmer to understand the machine architecture, the compiler, and the hardware in order to balance interprocessor data communication time with execution time. To illustrate these difficulties, let us examine a portion of an algorithm that occurs frequently in neural networks. There are N neurons in an output layer, each having a set of M weights and each connected to the M neurons in the previous layer. Each output neuron computes the Euclidean distance between the input vector (of length M) and its local weight vector (also of length M). The equation for this process is given by:

where i is the i'th neuron in the output layer. In a high level programming language, this is implemented with two loops. In C,(foregoing pointers for clarity), equation (1) can be expressed as: for( i = 0; i e N; i++){ d[i] = 0.0; for( j = 0; j e M; j++)( t = w[i]u] - xu]; d[i] += t * t; d[i] = sqrt(d[i]);

1
For an efficient parallel program, the questions that must be asked and answered are: 0 How should the matrix of weights w[i]u] be distributed? 0 How should the vector xu] be distributed? 0 How should the vector d[i] be distributed? 0 How should the calculations be partitioned among K processors, particularly if M, N and K do not have common factors. These questions and programming problems disappear for the programmer if the programmer is able to treat the data in an object oriented fashion as matrices, vectors, and scalars and these objects are managed by a vendor specific memory management library. In that case, the programmer is only required to express equation (1) in a matrix and vector algebra context. Equation (1) is analyzed as: 0 The index, i, of w[i]u] represents the rows of a matrix, W. 0 The index, j, of w[i]u] represents the columns of a matrix, W. 0 The index, j, of xu] represents the elements of a vector, X. 0 The index, i, of d[i] represents the elements of a vector. D. The operations required are then expressed as 0 Subtract the vector, X,from every column of the matrix, W, and store the result in temporary matrix, T. 0 Square each element of the matrix, T, and store the result in T. 0 Sum the columns of the matrix, T, and store the result in the vector, D. 0 Take the square root of each element of the vector, D, and store the result in D. These basic linear algebra operations can be provided by the machine vendor in a set of optimized, efficient library functions that treat the vectors and matrices as objects. The memory manager distributes the data of the objects in a well defined fashion. For example, HNC's SNAP memory manager always distributes the elements of a column vector of length M among the K processors as one distributes cards in a poker game to K players. Element 0 of the element K goes to vector goes to processor 0, element 1 goes to processor 1, element K-1goes to processor K-1, processor 0, element K+l goes to processor 1 and so forth. The N rows of a matrix of size NxM are distributed similarly. That is, processor 0 gets all M elements of row 0, processor 1 gets all M elements of row 1, etc. With a

11

known data distribution from a well defined memory manager that deals with the data as objects, the library functions can be optimized by the vendor. Thus the end user's code and programs become very efficient. Note that the use o f C++or other object oriented language is not required. What is required is that a memory manager proved user callable functions such as Create-Scalar, Create-Vector(), and Create-Matrix() that assign an ID to these items. Subsequently, all references to the items by parallel library functions use the ID. A function that adds two items and stores the result in a third item will look like Add(X,Y,Z), where X, Y, and Z are the item IDS. The function, Add(), is written by the machine vendor and optimizes the Add() operation for all the different combinations of data type. F o r example, if X is a scalar, Y is an NxN matrix and Z is an NxN matrix, then the function adds the scalar's value to every element of the matrix, Y,and stores the result in the matrix, Z. The user does not necd to distribute the data or partition the tasks explicitly. The following sections examine and analyze two of the most used neural networks, Multilayer Backpropagation Networks and Competitive Learning Neural Networks. The common mathematical thread observed in the training and use of these networks is that the algorithms use standard linear algebra functions that operate on vectors and matrices. Thus we illustrate an effective way to program parallel machines for neural network applications by using an object oriented memory manager and an efficient function library . A user may have the question, "What if the function that I wish to use is not one of the functions in the vendor supplied library?" Section 2.3 addresses this question.

2.1 Multilayer Backpropagation Network


The multilayer backpropagation network (MBPN) is currently the most widely used type of neural network. These
networks implement a feedforward mapping that is determined by the networks weights. The term "backpropagation" actually describes the learning law, but is also used to describe the particular form of feedforward mapping. Backpropagation neural networks learn by comparing the actual outputs produced using its current weights with the desired outputs for the mapping it is supposed to implement. It uses the differences, or errors, to adjust its weights and reduce the average error. Note that this is a supervised learning procedure. The MBPN algorithm is well known. Each neuron in the hidden and output layers computes a weighted sum of its inputs with the equations:
M

where xj represents the elements of a vector, X,that are the outputs from the M neurons of the previous layer, zi represents the elements of a vector, Z, that are the outputs of the N neurons of this layer, Wib represents the elements of a bias vector, B, and represents the elements of an NxM weight matrix W.Each row of the weight matrix contains the M weights of #e i'th neuron in this layer. The function f() is the neuron's activation function, commonly a logistic sigmoid, that operates on each element of the vector T. These equations can be translated into their linear algebra function operations as: 0 Multiply the matrix, W, times the vector, X, store the result in a temporary vector, T. 0 Add the bias vector, B, to the vector, T,and store the result in the vector, T. 0 Perform the function, f(), on each element of the vector T and store the result in the vector, Z. This algorithm is commonly known as feedforward operation. If learning is desired, then the difference between the output vector produced by the feedforward operation and a target vector is used to adjust the weight matrices. The form of the learning rule depends on whether the neuron is a member of the output layer or one of the hidden layers. Also, there are a number of variations of the standard learning rules that we will not consider. The basic equations for the weights connecting the output layer to its previous layer are:
w a

= f'(1i) (ti

- Zi)
+ Awij

(4)

new old = w ij wij

12

where Awij are the changes in the weights, Z i are the outputs of the neurons in the last layer, O! is the learning rate for the layer, ti are the target values for the last layer, f(Ii) is the derivative of the neuron's activation function and Ii are the inputs to the activation function for the output layer's neurons. The library functions required for these equations are seen as: 0 Equation (4) requires the subtraction of two vectors. 0 Equation (5) requires the outer product of two vectors. 0 Equation (6) requires the subtraction of two matrices. The learning equations for the hidden layers are more complex than for the output layer. The error term 6i is a function of all the errors in the previous layer. It is given by:
M
= f ' (Ii)

6i

8k Wki k=l

(7)

where f '(Ii) is the derivative of the neuron's activation function, Ii are the inputs to the activation function for the hidden layer's neurons, are the weights in the subsequent layer and 6k are the errors in the subsequent layer. The equations for weight update, ( 5 ) and (6), remain the same. The sum in equation (6) can be expressed in linear algebraic terms as the product of a vector times a matrix. This is perhaps the most difficult function for the parallel processor system to implement efficiently in the MBPN algorithm. In the feedforward portion of the algorithm, a matrix-vector multiply must be implemented. Then, in the feedback portion of the algorithm, the vector-matrix multiply must be implemented. A matrix transpose or its equivalent must be done for both of these operations to be efficient. Data that are local for one case and are therefore easy for a processor to access will be non-local for the other case. Thus 0 Equation (7) requires a vector-matrix multiply function. Once the algorithm is put into terms of matrix and vector algebra, it is easily implemented by a programmer using high level function library.
2.2 Competitive Learning Neural Network The Competitive Learning Network (CLN) is typically trained to obtain a set of cluster centers, which are prototypes representing typical inputs. Note that this is an unsupervised learning procedure. CLN does not produce any output in the usual sense. If CLN is used after training, it is generally to identify the cluster to which the input belongs, i.e., the prototype most similar to the input. CLN has one main layer that holds the prototypes as the weights of the neurons.

The Competitive Learning Network has a layer of neurons that was described earlier by equation (1). This layer is commonly referred to as a Kohonen layer. It has N neurons, each having a set of M weights and each connected to the M neurons in the previous layer. Each neuron in the Kohonen layer computes the Euclidean distance between the input vector (of length M) and its local weight vector. The equation for this was given by equation (1) as:

This equation is easily translated into high level linear algebra function calls as was described earlier in section 2. The remaining portion of the learning algorithm is done by the steps: 0 Find the minimum element in the vector D and store its index in the scalar object, k. 0 Subtract row k of the matrix W from the vector X, and store the result in a temporary vector Y. 0 Multiply Y by a learning rate, alpha, and store the result in Y. Add Y to row k of the matrix W.

13

As with MBPN, once the algorithm is put into terms of matrix and vector algebra, it is easily implemented by a high level function library. This common thread is observed also in Radial Basis Function Networks and Probabilistic Neural Networks.
2.3 Functions not in the Vendor Supplied Library If the programmer desires to implement a function not in the vendor supplied library, then there are several options. One, an efficient high level language parallel compiler can be supplied. However, it will not be as efficient as is ultimately possible with a function hand coded by an expert. Two, if the parallel processors have a front end host, the function can be expressed in a high level language, compiled, and run on the single processor of the host. This requires the memory manager to know about the host's memory and to automatically transfer the data where it is needed. 'Ihree, supply sufficiently low level functions in the parallel processor function library that any arithmetic algorithm can be programmed on the parallel hardware. Four, supply the tools and training for a user to be able to program efficiently in the low level microcode. HNC provides opdons two, t h r e e ,and four for its SNAP.

3. Parallel Hardware Example


A flexible SIMD parallel architecture has been adopted by HNC to implement high speed neural network applications. Several features have been added to make HNCs SNAP more flexible than a standard SIMD machine. The SNAP,illustrated in Figure 1, is a one-dimensional ring of parallel floating point processors. All processors, Pi, are connected to triple-ported global memory via the global data bus. A systolic data bus between processors adds to efficient interprocessor communication. The triple-ported global memory is connected to both the VMEbus and VSBbus for communication to the host processor and/or peripherals. There are three memory pools that require management: the host's memory (not explicitly shown); the Global Memory, accessible by all of the parallel processors; and the Local Memories, accessible only by their local processor. The parallel processors are a custom HNC design with four processors in each VLSI chip. Each processor has a floating point multiplier, a floating point arithmetic logic unit and an integer arithmetic logic unit. The system runs at 20 MHZ, provides 2.56 GFLOPS (with the 64 processor version) and is supplied in a desktop chassis with between 16 and 64 processors ready to attach to a Sun SPARCstation host,

Figure 1 . SNAP-64 Architecture

14

3.1 Precision and Dynamic Range Issues The arithmetic precision required by users of neural network applications is dependent on the application and the algorithm. Image preprocessing forms a substantial part of many pattern recognition applications. Simply capturing and displaying visual gray scale images from most cameras requires only 8 bits per pixel. Few monitors can display more than the 256 shades of gray provided by 8 bits. However, infrared and x-ray sensors often have 12 or even 16 bits of dynamic range. Averaging these images to reduce noise or other arithmetic operations requires higher precision arithmetic processors. The consequences of finite precision arithmetic operations are different for each algorithm. For example, the two dimensional Fourier transform of an NxN image (FFTor DFT algorithm) expands the dynamic range required by a factor of N2.For a 1024x1024 image with an 8-bit gray scale input, the output of the Fourier transform has a 28-bit dynamic range. Clearly 8-bit or even 16-bit integer arithmetic is inadequate. Other algorithms, such as optical flow computation, also are done much better with floating point representation.

Floating point numbers were invented to resolve the issues of accuracy and dynamic range. In integer arithmetic, the accuracy of a data variable is one half of a least significant bit; the dynamic range of the variable is the number of bits in its memory storage. This leads to the sometimes anomalous situation for an integer arithmetic machine, where the number, 1, can be anywhere between 0.5 and 1.5. That is, its accuracy is 50%. On the other hand, the number, 250, can be anywhere between 249.5 and 250.5. Its accuracy is 0.2%. Thus larger numbers are more accurately represented in the computer than smaller numbers. This leads to many numerical instabilities and to the use of ad hoc scaling for each algorithm. IEEE 32-bit floating point numbers, on the other hand, have a uniform accuracy independent of the value of the number. This property not only gives accurate results over a very wide dynamic range, but it also eliminates the need for ad hoc scaling procedures that may or may not work and are not easily analyzable. Explicit scaling of the data by the programmer to preserve dynamic range and accuracy is often done in fixed point arithmetic to try to preserve accuracy and dynamic range. One area of image processing where integer arithmetic is adequate (besides simple display) is binary morphology. There, the pixels are reduced to a single bit and stay at that precision. However, gray scale morphology is a more powerful tool and boosts the precision required. Any transform image processing, whether it is a Fourier transform, a Wavelet transform, or a convolution will require enhanced dynamic range and accuracy over and above that given by the sensor. The only practical way to tell if an algorithm gets the right answer (if the "right" answer is known) with an integer arithmetic hardware system is to solve the problem on a general purpose floating point processor. Then one can claim for that set of data and for this set of scaling laws in this algorithm, the hardware gets a trustworthy answer. The limited precision answer is almost always not as good, but it may be acceptable for the particular image or application. However, success with one type of image or application doesn't mean success with the next type of image or application. Another reason to choose IEEE 32-bit data floating point representation is that it is a standard that the host computer and other computers can easily deal with. Non-standard floating point conventions will complicate the programmer's task.
3.2 Neural Network Size and Structure Issues Neural network chips that have restrictions on algorithms and network architecture will be acceptable only for certain classes of applications. For example, if a chip that implements the backpropagation algorithm has a fixed architecture with 8 inputs, 8 outputs, and 64 weights, and a user wants to run an experiment with a backpropagation network of size 100x60~60, and the board that the user built or bought has only four of these chips on it with 32 inputs, 32 hidden neurons and 32 output neurons, then the user has to build or buy another board. This is required just to experiment with different network structures. If the user wants to adjust the algorithm significantly or use another algorithm, he or she is out of luck. A general purpose parallel processor solves the structure problem by casting the neural network algorithm in terms of matrix and vector operations. The function library admits arbitrary sized items, subject only to not exceeding the total amount of memory available.
3.3 I/O Rate In some applications, performance is limited by I/O rate, not processing rate. If the algorithm performs many operations on each piece of data, then the YO rate is less important. An important consideration is that during I/O, all parallel processors can continue to work. HNC solved this problem by using the SNAP global memory as an I/O

15

buffer. It is triple ported memory such that the VMEbus, the VSBbus and the SIMD processors can each read or write to global memory during the same cycle. The data transfer rates are 20 MBytes/sec, 32 MBytes/sec, and 40 MBytes/scc respectively.
3.4 Preprocessing The efficiency of preprocessing operations depends on the extent and the efficiency of the library of functions provided, the ease of adding new and efficient functions, and the integration of the attached array processor into the hosts programming environment. Most of the preprocessing functions involved in neural network applications are a t r i x transpose. The systolic data bus hardware feature of the parallelizable, such as the Fourier transform and m SNAP provides very high bandwidth between the processors and makes the implementation of the matrix transpose efficient. The transpose is used in the two dimensional Fourier Transform of an image. Neural network applications require significant amounts of preprocessing. The actual neural network classification usually requires less than 20% of the total computational cycles.

3.5 Memory Size

An adequate local memory is essential in many applications. Each processor in Figure 1 has its own local memory extcrnal to the VLSI chip. Present SNAP boards have 512 KBytes per processor. HNC chose in its parallel processor architecture to put the local memory external to the VLSI processor chip and to make it all accessible in a single cycle. If, alternatively, local memory is on-chip, it is usually very small, perhaps only 4 KBytes of local memory per processor. In that case, there is often no way of adding external memory. Thus with on-chip memory and a system with 64 processors, the total local memory is 256 KBytes. A single 512x512 image takes up the whole memory. There is no way of adding two images and storing the result in a third image without elaborate and time consuming block oriented operations and data transfers. A larger number of processors can be used, but there is always a limit that is reached very quickly with on-chip RAM.Efficiency is also lost for small operations if large numbers of processors with on-chip memory are used as a work-around for not enough total memory. For example both a 128x64xlO and 128x512~10 MBPN network can be performed efficiently on a parallel processing system with 64 processors, but only the 128x512~10 network can be performed efficiently on a system with 512 processors.
4. Summary Neural network algorithms, plus the image processing, signal processing, and data processing that goes on both before and after the neural network part of a complete application, are all implemented efficiently in HNC's flexible SIMD architecture. The essential ingredients are an extensive math library, an object oriented data manager that is integrated into the host's high level language programming environment, and a powerful, flexible hardware architecture that has the hardware features for efficient implementation of a wide range of algorithms.

16

Das könnte Ihnen auch gefallen