Beruflich Dokumente
Kultur Dokumente
4, DECEMBER 1990
Abstract-This paper describes neural network architectures based 2 ) Separability: For any two points xI # x2 in D,there
on the Stone-Weierstrass theorem. Exponential functions, polyno-
mials, partial fractions, and Boolean functions are used to create net-
is a n f i n 5 such t h a t f ( x , ) f f ( x 2 ) .
works capable of approximating arbitrary bounded measurable func- 3) Algebraic Closure: Iffand g are any two functions
tions. A “modified logistic network,” satisfying the theorem, is +
in 5, then f g and uf bg are in 5 for any two real num-
proposed as an alternative to commonly used networks based on logis- bers a and b.
tic squashing functions. Then 5 is dense in C ( D ) , the set of continuous real-
valued functions on D . In other words, for any E > 0 and
any function g in C(D ) , there is a function f in 5 such
I. INTRODUCTION that 1 g ( x ) - f ( x ) 1 < E for all x E D.
In other words, the minimum total volume of open spheres cedure for constructing a network that computes sums,
required to cover the set wheref # g is less than 6. and the proof of Theorem 1 doubles as a procedure for
Results from real analysis allow us to strengthen this constructing a network that computes products. With these
result of convergence in measure to convergence almost procedures, we can explicitly determine synaptic weights
everywhere, and we conclude that continuous functions for a network approximating a polynomial.
can approximate bounded measurable functions. In comparison, networks employing sigmoidal squash-
An alternative formulation is to consider the space ing functions, including the popular logistic [7] and hy-
Lp [ D ] consisting of all real measurable Lebesque-inte- perbolic tangent functions [8], are unable to exactly com-
grable functions whose Lpn o m is finite on domain D: pute a product of sigmoids. Using logistic squashing
functions, for example, we can trivially calculate logistics
(3) along each of two axes
where the norm for 1 Ip < 00 is
Because of their practical meanings, it is convenient Nevertheless, no finite-sized “logistic network” can ex-
to use the Lp spaces: for p = 1, we obtain the set of finite actly compute f g . Nor does adding layers to the network
area functions on domain D ,and forp = 2, we obtain the resolve this problem. In this case, we find it difficult to
set of finite energy functions on domain D. B~~~~~~these exploit structural information about the function we wish
descriptions are so easily understood, we state our theo- to approximate.
rems in terms of Lp spaces.
111. APPLYING THE STONE-WEIERSTRASS THEOREM
When we say that a set 3 of functions is dense in
Lp[ D 1, we will mean that for any f i n Lp[ D 1, we can find A. Generic Network Architecture
a sequencef, of functions in 5 such that A tree structure, in which many neurons on one layer
feed a single neuron on the next layer, is a generic archi-
tecture for networks satisfying the Stone-Weierstrass
theorem. An example of a tree-structured network is
It is always possible to find a sequence of continuous shown in Fig. 1. The tree structure has one or more hid-
functions satisfying this condition: For such a sequence it den layers followed by a linear output neuron. We assume
is also possible to conclude that f n converges to f almost that an arbitrarily large number of neurons are present in
everywhere. each hidden layer.
The practical consequence of the preceding discussion In hidden layers, we will employ a variety of squashing
is that an infinitely large neural network can model any functions. Since several of the squashing functions are
Lp[D] function at all but isolated points. A finite net- unbounded, we use the term “squashing function”
work, however, might only accurately model such func- loosely.
tions over a subset of the domain. Before verifying that the generic architecture satisfies
This kind of problem also occurs with finite Fourier se- the Stone-Weierstrass theorem, we observe that the ap-
ries, manifesting itself as Gibb’s phenomenon. The 18% proximation capabilities of our networks are unaffected by
overshoot that occurs at step discontinuities decays away various modifications to the tree structure.
over some interval, causing significant error if too few 1) Because tree-structured networks can be embedded
terms are included in the Fourier series. In the limit of in totally connected networks, our theorems extend to to-
infinite terms, the error is confined to a set of measure tally connected networks.
zero and might be inconsequential. Users of finite neural 2) If the range of the function being approximated is
networks might observe analogous phenomena. They may appropriately bounded, we may include an invertible con-
be confident, however, that errors can be reduced if they tinuous squashing function in the output neuron. We ob-
use a larger network and possess a suitable learning al- tain this result by observing that, prior to squashing, we
gorithm. have a continuous function and a linear output neuron.
(For this argument to be valid, the range of the approxi-
D. Advantages of Satisfying the Stone- Weierstrass mated function must be properly contained in the range
Theorem of the squashing function.)
A network that satisfies the Stone-Weierstrass theorem 3) N copies of a network, placed side by side, can com-
can compute the weighted sum af + bg and the product pute continuous N-dimensional vector-valued functions.
f g of functions f and g computed by smaller networks. In other words, single-output networks generalize to mul-
Conversely, we can separate a polynomial expression into tiple outputs.
smaller terms that can be approximated by neural net- 4) Preprocessing of inputs via any continuous inverti-
works. These networks may be recombined to approxi- ble function does not affect the ability of an architecture
mate the original polynomial. We shall describe the pro- to approximate continuous functions.
292 IEEE TRANSACTIONS ON NEURAL NETWORKS. VOL. I, NO. 4. DECEMBER 1990
(-il
theorem. Separability is satisfied whenever a network can IJ
x 2 0.
= c wi IT x?:
i=l n=l w;, winE $3
i= I k= I
I N
= c wi
i=l n=l
g(x,)w"':g E C[O, 11 invertible,
Then 3 is dense in L p [ D ] ,for 1 Ip < 03.
w;, Win E 3
1 .
Then 3 is dense in L p [ D ] ,for 1 Ip < 00.
(17)
Corollary 5: In the modified logistic network, the syn-
aptic weights wiknin the first hidden layer may be re-
stricted to integer or nonnegative integer values. Also,
threshold weights may be added in either the output layer
Corollary 4: In the exponential-function network, the or first hidden layer, and nonnegative real or integer
synaptic weights winin the hidden layer may be restricted weights wikmay be included in the penultimate layer.
to integer or nonnegative integer values. The set of functions computed by a modified logistic
We note that iff is a logistic function or similar sig- network (with logistic squashing function appended at the
moid, thenf" is also a sigmoid. Thus, if we wish to com- output) is a superset of the set of functions computed by
pute a single-variable function, we may use a network of a logistic network. Furthermore, the backward error prop-
neurons with special sigmoidal squashing functions and agation weight update rules for the modified logistic net-
satisfy the Stone-Weierstrass theorem. work are identical to those of a standard logistic network.
Hence, the modified logistic network may be substituted
E. Partial Fraction Networks for a logistic network when one wishes to take advantage
Partial fractions are an example of nonexponential of multiplicative closure.
functions that translate multiplication into addition G. Step Functions and Perceptron Networks
1 1 In the 1940's, McCulloch and Pitts [lo] showed that a
( K G ) (Kx) network of neurons with stepped squashing functions,
later called perceptrons, could compute any Boolean logic
function. Although its input may be real valued, a net-
work of perceptrons computes only binary functions. We
may, therefore, use an AND gate to compute the product
1 of two such functions. Since a perceptron with properly
+ (A)
(-). (18) chosen threshold and synaptic weights is an AND gate,
we can construct a perceptron network that computes fg
from networks that compute f and g.
One might suppose that a similar identity holds when If we add a linear output neuron to a perceptron net-
wix is replaced by Er=, winxn.With this change, how- work, we have a network that computes step functions
ever, the product no longer translates into a finite sum. (meaning stair-step functions with multiple steps).
Attempts to adapt the partial fraction method to multiple Lippmann [ 111 offers a heuristic description, paraphrased
variables fail, and one may conclude that partial fractions here, of how such a network with two hidden layers can
afford a clever way of computing X " but not x " P . Never- approximate an arbitrary function f.
theless, we could use partial fractions in a network that First Hidden Layer: Computes hyperplanes, each of
computes functions of single variables which divides the input space into half-spaces assigned
M values of zero and one. The hyperplanes define bounda-
ries of convex regions in the input space. Neurons are
added to the first layer until the value off is nearly con-
stant over each convex region.
F. Modijied Logistic Network Second Hidden Layer: Computes AND functions that
Combining the partial fraction idea with exponentials determine which convex region an input point lies in.
produces a network that computes arbitrary functions of Output Layer: Multiplies each output from the second
more than one variable. Furthermore, the combination of hidden layer by the value off in the corresponding convex
partial fractions and exponentials is similar to a logistic region. In other words, the output of the second hidden
function. The resulting modified logistic network, shown layer identifies the convex region, and the synaptic weight
in Fig. 1, obeys the Stone-Weierstrass theorem: in the output layer is the value off.
Theorem 5: Let 3 be the set of all functions that can In this heuristic argument lies a proof that, if two net-
be computed by arbitrarily large modified logistic net- works can approximate f and g , we can construct a net-
works, on domain D = [0, 1lN: work that computes fg.
I
First Hidden Layer: Includes the hyperplanes for both These difficulties are partially resolved by requiring
f and g. networks to satisfy the hypotheses of the Stone-Weier-
Second Hidden Layer: Computes AND functions for strass theorem, as do the networks presented in this paper.
every intersection of convex regions from f and g. These networks are derived by transforming multiplica-
Output Layer: Multiplies each output from the second tion into addition. The transformation is accomplished via
hidden layer by the value offg in the corresponding con- three types of squashing functions: exponentials, partial
vex region. fractions, and step functions. Networks based on these
We cannot apply the Stone-Weierstrass theorem to the functions satisfy the algebraic closure hypotheses of the
aforementioned network because of a mathematical diffi- Stone-Weierstrass theorem, allowing us to explicitly con-
culty: the perceptron network computes step functions struct networks that compute polynomial expressions of
rather than continuous functions. Step functions, how- functions computed by smaller networks.
ever, are dense in the set of measurable functions by the A combination of partial fractions and exponentials
following lemma [6]. produces the modified logistic network. This network is
Lemma ZV-(Step Functions): Iff is an almost every- similar to a logistic network but satisfies the Stone-
where bounded and measurable function on a compact Weierstrass theorem directly. The modified logistic net-
space, then given E > 0, there is a step function h such work is suggested as an alternative to the logistic network
that in situations where one wishes to compute polynomial
If-hl < E (21) expressions.
except on a set of measure less than E .
Intuitively, the idea here is that any continuous function REFERENCES
can be accurately approximated by a step function if the 111 G. Cybenko, “Approximations by superpositions of a sigmoidal
steps are sufficiently small. This procedure introduces yet function,” Marh. Contr., Signals, Syst., vol. 2 , pp. 303-314, 1989.
another source of error for finite networks. Similar errors 121 K. Funahashi, “On the approximate realization o f continuous map-
pings by neural networks,” Neural Networks, vol. 2 , pp. 183-192,
occur in digital-to-analog converters that produce stair- 1989.
step approximations of bandlimited functions. In this case, [3] A . R. Gallant and H. White, “There exists a neural network that does
suitable low-pass filtering removes the error. The same not make avoidable mistakes,” in Proc. IEEE Inr. Conf. Neural Net-
works, San Diego, CA, July 24-27, 1988, vol. I , pp. 657-664.
procedure removes errors for a perceptron network ap- [4] K. Hornik, M. Stinchcombe, and H . White, “Multilayer feedfonvard
proximating a bandlimited function. networks are universal approximators,” Neural Networks, vol. 2, pp.
Although the Stone-Weierstrass theorem does not ap- 359-366, 1989.
[5] L. V . Kantorovich and G. P. Akilov, Functional Analysis, 2nd ed.
ply to our network, there is an advantage to having shown Oxford: Pergamon, 1982.
that we satisfy algebraic closure: we can explicitly con- [6] H. L. Royden, Real Analysis, 2nd ed. New York: Macmillan, 1968.
struct networks that compute certain polynomial expres- 171 D . E. Rumelhart, G. E. Hinton, and R. J . Williams, Parallel Dis-
fributed Processing: Explorations in the Microstructures of Cogni-
sions. This is partial compensation for the lack of a mul- rion. Cambridge, MA: M.I.T. Press, 1986.
tilayer perceptron learning algorithm [ 121. [8] J . J . Hopfield, “Neurons with graded response have collective com-
putational properties like those of two-state neurons,” in Proc. Nar.
V . CONCLUSION Acad. Sci. USA, vol. 81, pp. 3088-3092, 1984.
[9] A. Barron and R. Barron, “Statistical learning networks: A unifying
In the first part of this paper, we observed that errors view,” presented at the Symp. Interface: Statistics Comput. Sci.,
which approach zero in the limit of large networks may Reston, VA, 1988.
[IO] W. S . McCulloch and W. Pitts, “A logical calculus of the ideas im-
be significant in finite networks. Two possible types of manent in nervous activity,” Bull. Math. Biophys., vol. 5 , pp. 115-
errors are overshoot and slow convergence. A third type 133, 1943.
of error may arise in two-layer networks: the network may [111 R. P. Lippmann, “An introduction to computing with neural nets,”
IEEE ASSP Mag., pp. 4-22. Apr. 1987.
be unable to efficiently approximate the product of two [12] M. Minsky and S. Papert, Perceprrons. Cambridge, MA: M.I.T.
functions. Press, 1969.