Beruflich Dokumente
Kultur Dokumente
00
Printed in the USA. All rights reserved, Cepyright "c' 199(I Pergamon Press pie
ORIGINAL CONTRIBUTION
D O N A L D F. SPECHT
Lockheed Missiles & Space Company, Inc.
(Received 5 August 1988; revised and accepted 14 June 1989)
Abstract--By replacing the sigmoid activation function often used in neural networks with an exponential
function, a probabilistic neural network ( PNN) that can compute nonlinear decision boundaries which approach
the Bayes optimal is formed. Alternate activation functions having similar properties are also discussed. A four-
layer neural network of the type proposed can map any input pattern to any number of classifications. The
decision boundaries can be modified in real-time using new data as they become available, and can be implemented
using artificial hardware "'neurons" that operate entirely in parallel. Provision is also made for estimating the
probability and reliability of a classification as well as making the decision. The technique offers a tremendous
speed advantage for problems in which the incremental adaptation time of back propagation is a significant
fraction of the total computation time. For one application, the PNN paradigm was 200,000 times faster than
back-propagation.
Keywords--Neural network, Probability density function, Parallel processor, "Neuron", Pattern recognition,
Parzen window, Bayes strategy, Associative memory.
1962) and can be applied to problems containing any estimated. Parzen (1962) showed how one may con-
number of categories. struct a family of estimates of f(X),
Consider the two-category situation in which the
state of nature 0 is known to be either 0 A o r 0~. If .f°(x) n~,,,~°, , (4)
it is desired to decide whether 0 = 0A or 0 = 0t~
based on a set of measurements represented by the which is consistent at all points X at which the PDF
p-dimensional vector X' = [X~ . . . X j . . . Xp], the is continuous. Let XAt . . . . , XA . . . . . Xa,, be in-
Bayes decision rule becomes dependent random variables identically distributed
d(X) = 0z if hAlAfA(X ) > hslBfB(X) as a random variable X whose distribution function
F(X) = P[x <= X] is absolutely continuous, Parzen's
d(X) = 0B if hAIafA(X ) < hBlBfB(X) (1) conditions on the weighting function ~ ( y ) are
where fA(X) and fB(X) are the probability density sup ........ [O'~(y)i":: ~ (5)
functions for categories A and B, respectively; Ia
is the loss function associated with the decision where sup indicates the supremurn~
d(X) = 0n when 0 = OA; 1B is the loss associated
with the decision d(X) = Oa when 0 = 0B (the losses f +" [~o(y)[ dy <
~ (6)
associated with correct decisions are taken to be
equal to zero); hA is the a priori probability of oc- lim ]y~(y)[ = {L (7)
.v-~ z
01•
+ -.,. A,
- -.~. B
* • ~ X ~
+ -,,-A m
- -~B
OUTPUT
UNITS
where
C~ = hH,1Bk n~
h,~,l,~ n~
(13')
FIGURE 2. Orpnization for clmmlficntlon of ~ s into hA, = number of training patterns from category A,
~ . From " ~ ~ ~ ~ C1~- nBk = number Of tr~ning patte~s from category B~
1988i ~ i / E E E ~ d ~ on ~ 1 Note that Cg is the ratio of a priori probabilities,
Nelworks, 1, p. Sle. CopydgM l m by IEEE. ~ by
permission. divided by the ratio of samples and multiplied by the
ratio of losses. In any problem in which the numbers
of training samples from categories A and B are ob-
X~ X2 X3 X
f x) f (x)
% -
/ /
/
9(z,) =e×p[ (zi -11/ ~z 1 BINARY OUTPU-(
~ " by D.I
C ~
Probabilistic Neural Networks 113
tained in proportion to their a priori probabilities, should be compressed towards the Zi = 1 line as the
C~ = -IB~/IA~. This final ratio cannot be determined number of training patterns, n, is increased.
from the statistics of the training samples, but only
from the significance of the decision. If there is no
LIMITING C O N D I T I O N S AS ~---* 0 A N D
particular reason for biasing the decision, Ck may
AS a - - * ~
simplify to - 1 (an inverter).
The network is trained by setting the Wi weight It has been shown (Specht, 1967a) that the decision
vector in one of the pattern units equal to each of boundary defined by eqn (2) varies continuously
the X patterns in the training set and then connecting from a hyperplane when a --, ~ to a very nonlinear
the pattern unit's output to the appropriate sum- boundary representing the nearest neighbor classifier
mation unit. A separate neuron (pattern unit) is re- when a --* 0. The nearest neighbor decision rule has
quired for every training pattern. As indicated in been investigated in detail by Cover and Hart (1967).
Figure 2, the same pattern units can be grouped by In general, neither limiting case provides optimal
different summation units to provide additional pairs separation of the two distributions. A degree of av-
of categories and additional bits of information in eraging of nearest neighbors, dictated by the density
the output vector. of training samples, provides better generalization
than basing the decision on a single nearest neighbor.
The network proposed is similar in effect to the k-
A L T E R N A T E ACTIVATION FUNCTIONS nearest neighbor classifier.
Although eqn (12) has been used in all the experi- Specht (1966) contains an involved discussion of
mental work so far, it is not the only consistent es- how one should choose a value of the smoothing
timator that could be used. Alternate estimators parameter, a, as a function of the dimension of the
suggested by Cacoullos (1966) and Parzen (1962) are problem, p, and the number of training patterns, n.
given in Table 1, where However, it has been found that in practical prob-
lems it is not difficult to find a good value of a, and
that the misclassification rate does not change dra-
f.,(X) n,;? Kp ~ h'~(y) (14)
t=l matically with small changes in a.
Specht (1967b) describes an experiment in which
q
electrocardiograms were classified as normal or ab-
Y = (x, - xA,,)2 (15/ normal using the two-category classification of eqns
(1) and (12). In that case, 249 patterns were available
and K, is a constant such that for training and 63 independent cases were available
for testing. Each pattern was described by a 46-di-
f Kp~,~(y) dy = 1. (16) mensional pattern vector (but not normalized to unit
length). Figure 5 shows the percentage of testing
Zi = X • Wi as before. samples classified correctly versus the value of the
When X and W~ are both normalized to unit smoothing parameter, a. Several important conclu-
length, Z~ ranges from - 1 to + 1, and the activation sions are obvious. Peak diagnostic accuracy can be
function is of the form shown in Table 1. Note that obtained with any a between 4 and 6; the peak of
here all of the estimators can be expressed as a dot the curve is sufficiently broad that finding a good
product feeding into an activation function because value of a experimentally is not difficult. Further-
all involve y = 1/2 X/2 - 2X XAi. Non-dot product
•
more, any a in the range from 3 to 10 yields results
forms will be discussed later. only slightly poorer than those for the best value. It
All the Parzen windows shown in Table 1, in con- turned out that all values of a from 0 to ~ gave results
junction with the Bayes decision rule of eqn (1), that were significantly better than those of cardiol-
would result in decision surfaces that are asymptot- ogists on the same testing set.
ically Bayes optimal. The only difference in the cor- The only parameter to be tweaked in the proposed
responding neural networks would be the form of system is the smoothing parameter, a. Because it
the nonlinear activation function in the pattern unit. controls the scale factor of the exponential activation
This leads one to suspect that the exact form of the function, its value should be the same for every pat-
activation function is not critical to the usefulness of tern unit.
the network. The common elements in all the net-
works are that: the activation function takes its max-
AN ASSOCIATIVE MEMORY
imum value at Zs = 1 or maximum similarity between
the input pattern X and the pattern stored in the In the human thinking process, knowledge accu-
pattern unit; the activation function decreases as the mulated for one purpose is often used in different
pattern becomes less similar; and the entire curve ways for different purposes. Similarly, in this situa-
114 D. E Specht
TABLE 1
Preen ~ ~ am ~ ~
i ii
1, y _ < 1
O,y>l
I
-1 0 1
Z~
1--y, y__l
O, y>l I-
i I
-1 0 1
Zt
e-I12 y2
Z~
1i
e-lyl j
J
1
1 +y2
lyj Z,
i t
-1 0 1
Z~
'! I
sin(y/2) ~2
-1 0 1
Z~
I
tion, if the decision category were known, but not though all possible values of the parameter and
all the input variables, then the known input vari- choosing the one that maximiz~ the PDF. If several
ables could be impressed on the network for the parameters are unknown, this method may be im-
correct category and the unknown input variables practical. In this case, one might be satisfied with
could be varied to maximize the output of the net- finding the closest mode of the PDF. This goal could
work. These values represent those most likely to be be achieved using the method of steepest ascent.
associated with the known inputs. If only one pa- A more general approach to forming an associa-
rameter were unknown, then the most probable tive memory is to avoid distinguishingbetween inputs
value of that parameter could be found by ramping and outputs. By concatenating the X vector and the
Probabilistic Neural Networks 115
100 I I . . . . . . . . . . . . . . .
97%
90
81%
80
~ O R R E C T MATCHED-FILTE~
70 ~' ON ABNORMALS SOLUTION WITH
5 •%
COMPUTED
"t NEAREST-NEIGHBOR THRESHOLD
~ 6o
121 DECISION RULE
~ 50
40--
30--
20--
1( --
o I I I I I I I I r I , I , I I , I I ...... I.........
0 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 50 100
SMOOTHING PARAMETER or
FIGURE 5. Percentage of testing samples classified correctly versus smoothing parameter a. From "Vectorcardiographic
Diagnosis Using the Polynomial Discriminant Method of Pattern Recognition" by D. F. Specht, 1967, IEEE Transactions on
Bio-Medical Engineering, 14, 94. Copyright 1967 by IEEE. Reprinted by permission.
output vector into one longer measurement vector layer consisted of six binary outputs indicating six
X', a single probabilistic network can be used to find possible hull classifications. This data set was small,
the global PDF, f(X'). This PDF may have many but as in many practical problems, more data were
modes clustered at various locations on the hyper- either not available or expensive to obtain. To make
sphere. To use this network as an associative mem- the most use of the data available, both groups de-
ory, one impresses on the inputs of the network those cided to hold out one report, train a network on the
parameters that are known, and allows the other other 112 reports, and use the trained network to
parameters to relax to whatever combination maxi- classify the holdout pattern. This process was re-
mizes f(X'), which occurs at the nearest mode. peated with each of the 113 reports in turn. Mar-
chette and Priebe (1987) estimated that to perform
the experiment as planned would take in excess of 3
SPEED ADVANTAGE RELATIVE
weeks of continuous running time on a Digital Equip-
TO BACK PROPAGATION
ment Corp. VAX 8650. Because they didn't have that
One of the principle advantages of the PNN para- much VAX time available, they reduced the number
digm is that it is very much faster than the well-known of hidden units until the computation could be per-
back propagation paradigm (Rumelhart, 1986, chap. formed over the weekend. Maloney (1988), on the
8) for problems in which the incremental adaptation other hand, used a version of PNN on an IBM PC/
time of back propagation is a significant fraction of AT (8 MHz) and ran all 113 networks in 9 seconds
the total computation time. In a hull-to-emitter cor- (most of which was spent writing results on the
relation problem supplied by the Naval Ocean Sys- screen). Not taking into account the I/O overhead
tems Center (NOSC), the PNN accurately identified or the higher speed of the VAX, this amounts to a
hulls from difficult, nonlinear boundary, multire- speed improvement of 200,000 to 1!
gion, and overlapping emitter report parameter data Classification accuracy was roughly comparable.
sets. Back-propagation produced 82% accuracy whereas
Marchette and Priebe (1987) provide a description PNN produced 85% accuracy (the data distributions
of the problem and the results of classification using overlap such that 90% is the best accuracy that
back-propagation and conventional techniques. Ma- NOSC ever achieved using a carefully crafted special
loney (1988) describes the results of using PNN on purpose classifier). It is assumed that back propa-
the same data base. gation would have achieved about the same accuracy
The data set consisted of 113 emitter reports of as PNN if allowed to run 3 weeks. By breaking the
three continuous input parameters each. The output problem into subproblems classified by separate
116 D. F Specht
m number of training patterns XR~ ith training pattern vector from category R
RR number of training patterns from category n XR,j jth component of XR,
~Rk number of training patterns irom category to, (~(Y) weighting factor
PIRIX] probability of R given X Z dot product of X with weighl vector W
P dimension of the pattern vectors It state of nature
R category (A or B) Rth state of nature: the Rth category (for the two-cat-
W weight vector (p-dimensional) egory case. R = A or B)
X pattern vector (p-dimensional) a parameter of the weighting function ~(y)
X, jth component of X "smoothing parameter" of the probability density func-
X' transpose of X , X ' = [X~ . . . X t . . . X v ] tion estimator