Sie sind auf Seite 1von 6

290 IEEE TRANSACTIONS ON NEURAL NETWORKS. VOL. I . NO.

4, DECEMBER 1990

The Stone-Weierstrass Theorem and Its Application to Neural Networks


NEIL E. COTTER

Abstract-This paper describes neural network architectures based 2 ) Separability: For any two points xI # x2 in D,there
on the Stone-Weierstrass theorem. Exponential functions, polyno-
mials, partial fractions, and Boolean functions are used to create net-
is a n f i n 5 such t h a t f ( x , ) f f ( x 2 ) .
works capable of approximating arbitrary bounded measurable func- 3) Algebraic Closure: Iffand g are any two functions
tions. A “modified logistic network,” satisfying the theorem, is +
in 5, then f g and uf bg are in 5 for any two real num-
proposed as an alternative to commonly used networks based on logis- bers a and b.
tic squashing functions. Then 5 is dense in C ( D ) , the set of continuous real-
valued functions on D . In other words, for any E > 0 and
any function g in C(D ) , there is a function f in 5 such
I. INTRODUCTION that 1 g ( x ) - f ( x ) 1 < E for all x E D.

I N the past several years, researchers have shown that


infinitely large neural networks with a single hidden
layer are capable of approximating all but the most path-
A. Compact Spaces
In applications of neural networks, the domain on which
ological of functions [ 11-[4]. This powerful result assures we operate is almost always compact. It is a standard re-
us that neural networks have two desirable properties: 1) sult in real analysis that every closed and bounded set in
larger networks can produce less error than smaller net- !JIN is compact [6]. The unit hypercube, that we assume
works; and 2) there are no nemesis functions that cannot for the domain from now on, is the prototypical example
be modeled by neural networks. of such a compact space:
The Stone-Weierstrass theorem from classical real
analysis can be used to show that certain network archi- D = [0, 1IN. (1)
tectures possess the universal approximation capability. Any region obtained by smoothly distorting (but not tear-
Hornik et al. [4] introduced this theorem as the basis of ing) the unit hypercube is also a suitable compact space
their proofs and considered existing architectures that fail to which the Stone-Weierstrass theorem can be applied.
to satisfy the theorem’s hypotheses. In this paper, we fo-
cus instead on new architectures satisfying the theorem. B. Measurable Functions and Lp Spaces
This ensures that our networks have the two desirable The Stone-Weierstrass theorem states conditions that
properties listed above. guarantee that a network can approximate continuous
By employing the Stone-Weierstrass theorem in the de- functions. Nevertheless, many interesting functions, in-
sign of our networks, we also guarantee that the networks cluding step functions, are discontinuous. These func-
can compute certain polynomial expressions: if we are tions are members of the set of bounded measurable func-
given networks exactly computing two functions, fand g, tions. This set includes continuous functions, bounded
then a larger network can exactly compute a polynomial functions that have a countable number of discontinuities,
expression offand g . We will see how the larger network and all other bounded functions that are likely to be en-
can be explicitly determined from the smaller ones. Our countered in practice. Nonmeasurable functions, such as
discussion begins with the theorem and a pragmatic de- white Gaussian noise, are typified by having discontinu-
scription of some abstract terminology. ities at uncountably many points in their domain. Such
functions are abstractions, however, that are unrealizable
11. STONE-WEIERSTRASS THEOREM AND TERMINOLOGYin physical systems.
Theorem (Stone- Weierstrass) [5], [6]: Let domain D
C. Convergence Almost Everywhere
be a compact space of N dimensions, and let 5 be a set
of continuous real-valued functions on D, satisfying the The Stone-Weierstrass theorem can be extended to
following criteria: bounded measurable functions by applying a theorem of
1) Identity Function: The constant function f ( x ) = 1 Lusin [6].
is in 5. Theorem (Lusin): If g is a measurable real-valued func-
tion that is bounded almost everywhere on a compact do-
Manuscript received January 24, 1989; revised June 15, 1990. This pa- main D,then given 6 > 0 there is a continuous real-val-
per was supported by the National Science Foundation under Grant EET- ued functionfon D such that the measure of the set where
88 10478. f i s not equal to g is less than 6
The author is with the Department of Electrical Engineering, University
of Utah, Salt Lake City, UT 841 12.
IEEE Log Number 9038255.

1045-9227/90/1200-0290$01.OO 0 1990 IEEE


IEEE TRANSACTIONS ON NEURAL NETWORKS. VOL. I. NO. 4. DECEMBER 1990 29 1

In other words, the minimum total volume of open spheres cedure for constructing a network that computes sums,
required to cover the set wheref # g is less than 6. and the proof of Theorem 1 doubles as a procedure for
Results from real analysis allow us to strengthen this constructing a network that computes products. With these
result of convergence in measure to convergence almost procedures, we can explicitly determine synaptic weights
everywhere, and we conclude that continuous functions for a network approximating a polynomial.
can approximate bounded measurable functions. In comparison, networks employing sigmoidal squash-
An alternative formulation is to consider the space ing functions, including the popular logistic [7] and hy-
Lp [ D ] consisting of all real measurable Lebesque-inte- perbolic tangent functions [8], are unable to exactly com-
grable functions whose Lpn o m is finite on domain D: pute a product of sigmoids. Using logistic squashing
functions, for example, we can trivially calculate logistics
(3) along each of two axes
where the norm for 1 Ip < 00 is

Because of their practical meanings, it is convenient Nevertheless, no finite-sized “logistic network” can ex-
to use the Lp spaces: for p = 1, we obtain the set of finite actly compute f g . Nor does adding layers to the network
area functions on domain D ,and forp = 2, we obtain the resolve this problem. In this case, we find it difficult to
set of finite energy functions on domain D. B~~~~~~these exploit structural information about the function we wish
descriptions are so easily understood, we state our theo- to approximate.
rems in terms of Lp spaces.
111. APPLYING THE STONE-WEIERSTRASS THEOREM
When we say that a set 3 of functions is dense in
Lp[ D 1, we will mean that for any f i n Lp[ D 1, we can find A. Generic Network Architecture
a sequencef, of functions in 5 such that A tree structure, in which many neurons on one layer
feed a single neuron on the next layer, is a generic archi-
tecture for networks satisfying the Stone-Weierstrass
theorem. An example of a tree-structured network is
It is always possible to find a sequence of continuous shown in Fig. 1. The tree structure has one or more hid-
functions satisfying this condition: For such a sequence it den layers followed by a linear output neuron. We assume
is also possible to conclude that f n converges to f almost that an arbitrarily large number of neurons are present in
everywhere. each hidden layer.
The practical consequence of the preceding discussion In hidden layers, we will employ a variety of squashing
is that an infinitely large neural network can model any functions. Since several of the squashing functions are
Lp[D] function at all but isolated points. A finite net- unbounded, we use the term “squashing function”
work, however, might only accurately model such func- loosely.
tions over a subset of the domain. Before verifying that the generic architecture satisfies
This kind of problem also occurs with finite Fourier se- the Stone-Weierstrass theorem, we observe that the ap-
ries, manifesting itself as Gibb’s phenomenon. The 18% proximation capabilities of our networks are unaffected by
overshoot that occurs at step discontinuities decays away various modifications to the tree structure.
over some interval, causing significant error if too few 1) Because tree-structured networks can be embedded
terms are included in the Fourier series. In the limit of in totally connected networks, our theorems extend to to-
infinite terms, the error is confined to a set of measure tally connected networks.
zero and might be inconsequential. Users of finite neural 2) If the range of the function being approximated is
networks might observe analogous phenomena. They may appropriately bounded, we may include an invertible con-
be confident, however, that errors can be reduced if they tinuous squashing function in the output neuron. We ob-
use a larger network and possess a suitable learning al- tain this result by observing that, prior to squashing, we
gorithm. have a continuous function and a linear output neuron.
(For this argument to be valid, the range of the approxi-
D. Advantages of Satisfying the Stone- Weierstrass mated function must be properly contained in the range
Theorem of the squashing function.)
A network that satisfies the Stone-Weierstrass theorem 3) N copies of a network, placed side by side, can com-
can compute the weighted sum af + bg and the product pute continuous N-dimensional vector-valued functions.
f g of functions f and g computed by smaller networks. In other words, single-output networks generalize to mul-
Conversely, we can separate a polynomial expression into tiple outputs.
smaller terms that can be approximated by neural net- 4) Preprocessing of inputs via any continuous inverti-
works. These networks may be recombined to approxi- ble function does not affect the ability of an architecture
mate the original polynomial. We shall describe the pro- to approximate continuous functions.
292 IEEE TRANSACTIONS ON NEURAL NETWORKS. VOL. I, NO. 4. DECEMBER 1990

represent f g as a sum of functions. Thus, the key to sat-


isfying the Stone-Weierstrass theorem is to find functions
that transform multiplication into addition so that prod-
ucts can be written as summations. There are at least three
generic functions that accomplish this transformation: ex-
x1
ponential functions, partial fractions, and step functions.
We use these squashing functions in the next section to
XN construct networks.
Iv . NETWORKS
SATISFYING THE STONE-WEIERSTRASS
THEOREM
A. Decaying-Exponential Networks
Fig. 1 . Tree structured network for modified logistic network. Each box
Exponential functions are basic to the process of trans-
represents a neuron which computes a weighted sum of inputs and passes forming multiplication into addition in XWeral kinds of
the result through the function shown in the box. networks:
exp (XI) exp (4= exp (XI + 4 . (8)
B. Identity Function
The first hypothesis of the Stone-Weierstrass theorem Decaying-exponential networks use this identity explic-
requires that our neural network be able to compute the itly:
Theorem I : Let 3 be the set of all functions that can
identity function f ( x ) = 1. An obvious way to compute
be computed by arbitrarily large decaying-exponential
this function is set all synaptic weights in the network to
networks on domain D = [ 0 , 1 ] N :
zero except for a unitary threshold input to the last sum-
mation unit. A more seamless way to compute this func-
tion is to use a penultimate-layer squashing function
whose output value is one for an input value of zero. Then
I / N \ \
setting all synaptic weights to zero results in an output
value equal to one. This eliminates the need for an extra- = C
i= 1
w,exp (,- C
n=l winxn):wir winE %). (9)
neous threshold input. With the exception of the modified
logistic network, all the networks we discuss are of this Then 3 is dense in L p [ D ] ,for 1 Ip < 03.
tY Pe. Proof: Letfand g be two functions in 3 wherefis
as shown in (9) and g is as shown in (10)
C. Separability
J \
The second hypothesis of the Stone-Weierstrass theo-
rem requires that our neural network be able to compute
functions that have different values for different points.
Without this requirement, the trivial set of functions { f : Then the productfg may be written as a single summation
f ( x ) = n, n E % } would satisfy the Stone-Weierstrass as follows:

(-il
theorem. Separability is satisfied whenever a network can IJ

compute strictly monotonic functions of each input vari- fg = C


1=1 Wl exp W&) (11)
able. All of our networks have this capability.
where w, = wi wj and w,,,= win win. +
D. Algebraic Closure-Additive Hence, f g is in 3,and the network satisfies the Stone-
The third hypothesis of the Stone-Weierstrass theorem Weierstrass theorem. It follows that 3 is dense in Lp[ D ],
requires that our neural network be able to approximate f o r 1 s p < 03.
sums and products of functions. If the network can com- Corollary I : In the decaying-exponential network, hid-
pute either of two functions, f a n d g , we construct a net- den-layer weights winmay be restricted to integer or non-
work that computes af +
bg as follows: embed the net- negative integer values. (The latter case yields bounded
works forfand g in a larger network and, at the output outputs from the decaying-exponential neurons when in-
neuron, scale the synapses by a and b . Since the output puts are nonnegative. This is the motivation for using
unit is linear, we obtain af +
bg. The resulting network negative rather than positive exponentials.) Alternatively,
also has a tree structure as before. if a threshold input is included in the hidden layer, output
layer weights wi may be restricted to integer values. Re-
E. Algebraic Closure-Multiplicative stricting weights on both layers to integer values, how-
Modeling the product f g of two functions is the last ever, would violate the hypothesis that the network com-
capability we must demonstrate before we can conclude putes all functions af +
bg where a and b are real
that the Stone-Weierstrass theorem applies to a network. numbers. Note that adding a threshold input on the first
Because the output unit is assumed to be linear, we must layer is equivalent to multiplying wi by a real constant.
IEEE TRANSACTIONS ON NEURAL NETWORKS. VOL. I ; NO. 4. DECEMBER 1990 293

B. Fourier Networks preventing the learning of new knowledge from corrupt-


Fourier networks, introduced by Gallant and White 131, ing previously stored information.
implement Fourier series in network form. Fourier series We also observe that learning in a cosig network might
satisfy the Stone-Weierstrass theorem by the familiar tri- yield something unlike a Fourier series. The first layer
gonometric identity that transforms multiplication into might not compute sinusoidal functions. Another possi-
addition: bility is that the network might compute sinusoids but al-
ter their frequencies. This scenario is equivalent to find-
cos (A + B ) = cos A cos B - sin A sin B. (12) ing a representation for a function in terms of sinusoids
that are not harmonically related. These variations on
This identity is merely a disguised form of (8) obtained
Fourier series might be an efficient way to use computa-
from complex exponentials. Hence, Fourier networks are
tional hardware.
also based on exponentials.
A clever twist suggested by Gallant and White is to chop C. Mod$ed Sigma-Pi and Polynomial Networks
sinusoids into half-cycle sections and add flat tails. The
A third kind of network based on exponentials is the
result is a sigmoidal cosine squasher. A sinusoid is ob-
polynomial or modified sigma-pi network. We observe
tained by shifting and summing cosine squashers. We
that powers of x provide another example of (8)
shall refer to a slightly modified version of the cosine
squasher as a "cosig" function. xnxm= exp ( n In x ) exp ( m In x )
By showing that Fourier networks mimic Fourier se-
ries, Gallant and White proved an equivalent form of the = exp [ ( n + m ) In x ] =x"'~. (15)
following theorem and corollary. We note that Fourier se- Rumelhart et al. [ 7 ] have described a sigma-pi neuron
ries, and hence Fourier networks, satisfy the Stone- that computes sums of products of selected inputs. A sin-
Weierstrass theorem. The cosig network, described by gle layer of such neurons is incapable of approximating a
Theorem 2, computes a superset of functions computed
by Fourier networks.
function of the form f ( x l , - -
, x N ) = x,4, where q is
large, unless the network allows inputs to be multiplied
Theorem 2: Let 3 be the set of all functions that can by themselves. The modified sigma-pi or polynomial net-
be computed by arbitrarily large cosig networks, on do- work, described by the next theorem, explicitly provides
main D = [o, 1 l N :
the capability of approximating terms of the form xf' * * *
f- $ where qi is any real number. As has been noted by
Barron and Barron [9] and Hornik et al. [ 4 ] ,this kind of
network satisfies the Stone-Weierstrass theorem. In the

i= 1 3 form of network presented here, synaptic weights serve


as exponents for the inputs. We can achieve a similar re-
sult by placing a hidden layer of sigma (summation) neu-
rons before the pi (product) neuron.
Theorem 3: Let 3 be the set of all functions that can
where cosig( ) is a cosine squasher
be computed by arbitrarily large modified sigma-pi net-
1 works on domain D = [0, 1 l N :
(0 X I --
2

cosig (x) = -- <x <0 (14)


2 I N >

x 2 0.
= c wi IT x?:
i=l n=l w;, winE $3

Then 3 is dense in L p [ D ] ,for 1 Ip < 00.


Then 3 is dense in L p [ D ] ,for 1 5 p < 00. Corollary 3: In the modified sigma-pi or polynomial
Corollary 2: The Fourier network, which is a restricted network, the synaptic weights winin the hidden layer may
form of the cosig network, has the same approximation be restricted to integer or nonnegative integer values. A
capabilities as the cosig network. In the Fourier network, standard polynomial network is obtained in this way.
in order to yield sinusoids corresponding to a Fourier se-
ries, first layer weights winare restricted to fixed integer D. Exponentiated-Function Networks
values, and thresholds 8; are restricted to values of form A fourth kind of network based on exponentials is the
q / 4 where q is an integer. exponentiated-function network. This network is obtained
An interesting feature of a cosig function is that it has by preprocessing inputs to a modified sigma-pi network.
zero derivative outside the interval [ - $, 01. Conse- If g is the preprocessing function, the first layer of the
quently, a backpropagation learning algorithm would al- sigma-pi network computes polynomial functions of the
ter only a small subset of neurons for each training pattern form g ( X l ) " ' ' * ' g(xN)nN.
presented to the network. This might be advantageous in Theorem 4: Let 5 be the set of all functions that can
I I

294 IEEE TRANSACTIONS ON NEURAL NETWORKS. VOL. I . NO. 4. DECEMBER 1990

be computed by arbitrarily large exponentiated-function


networks, on domain D = [ 0 , 1 I N : 3 = if(.'. * ' ,X N )
K
f

i= I k= I

I N
= c wi
i=l n=l
g(x,)w"':g E C[O, 11 invertible,
Then 3 is dense in L p [ D ] ,for 1 Ip < 03.

w;, Win E 3
1 .
Then 3 is dense in L p [ D ] ,for 1 Ip < 00.
(17)
Corollary 5: In the modified logistic network, the syn-
aptic weights wiknin the first hidden layer may be re-
stricted to integer or nonnegative integer values. Also,
threshold weights may be added in either the output layer
Corollary 4: In the exponential-function network, the or first hidden layer, and nonnegative real or integer
synaptic weights winin the hidden layer may be restricted weights wikmay be included in the penultimate layer.
to integer or nonnegative integer values. The set of functions computed by a modified logistic
We note that iff is a logistic function or similar sig- network (with logistic squashing function appended at the
moid, thenf" is also a sigmoid. Thus, if we wish to com- output) is a superset of the set of functions computed by
pute a single-variable function, we may use a network of a logistic network. Furthermore, the backward error prop-
neurons with special sigmoidal squashing functions and agation weight update rules for the modified logistic net-
satisfy the Stone-Weierstrass theorem. work are identical to those of a standard logistic network.
Hence, the modified logistic network may be substituted
E. Partial Fraction Networks for a logistic network when one wishes to take advantage
Partial fractions are an example of nonexponential of multiplicative closure.
functions that translate multiplication into addition G. Step Functions and Perceptron Networks
1 1 In the 1940's, McCulloch and Pitts [lo] showed that a
( K G ) (Kx) network of neurons with stepped squashing functions,
later called perceptrons, could compute any Boolean logic
function. Although its input may be real valued, a net-
work of perceptrons computes only binary functions. We
may, therefore, use an AND gate to compute the product
1 of two such functions. Since a perceptron with properly
+ (A)
(-). (18) chosen threshold and synaptic weights is an AND gate,
we can construct a perceptron network that computes fg
from networks that compute f and g.
One might suppose that a similar identity holds when If we add a linear output neuron to a perceptron net-
wix is replaced by Er=, winxn.With this change, how- work, we have a network that computes step functions
ever, the product no longer translates into a finite sum. (meaning stair-step functions with multiple steps).
Attempts to adapt the partial fraction method to multiple Lippmann [ 111 offers a heuristic description, paraphrased
variables fail, and one may conclude that partial fractions here, of how such a network with two hidden layers can
afford a clever way of computing X " but not x " P . Never- approximate an arbitrary function f.
theless, we could use partial fractions in a network that First Hidden Layer: Computes hyperplanes, each of
computes functions of single variables which divides the input space into half-spaces assigned
M values of zero and one. The hyperplanes define bounda-
ries of convex regions in the input space. Neurons are
added to the first layer until the value off is nearly con-
stant over each convex region.
F. Modijied Logistic Network Second Hidden Layer: Computes AND functions that
Combining the partial fraction idea with exponentials determine which convex region an input point lies in.
produces a network that computes arbitrary functions of Output Layer: Multiplies each output from the second
more than one variable. Furthermore, the combination of hidden layer by the value off in the corresponding convex
partial fractions and exponentials is similar to a logistic region. In other words, the output of the second hidden
function. The resulting modified logistic network, shown layer identifies the convex region, and the synaptic weight
in Fig. 1, obeys the Stone-Weierstrass theorem: in the output layer is the value off.
Theorem 5: Let 3 be the set of all functions that can In this heuristic argument lies a proof that, if two net-
be computed by arbitrarily large modified logistic net- works can approximate f and g , we can construct a net-
works, on domain D = [0, 1lN: work that computes fg.
I

IEEE TRANSACTIONS ON NEURAL NETWORKS. VOL. I . NO. 4. DECEMBER 1990 295

First Hidden Layer: Includes the hyperplanes for both These difficulties are partially resolved by requiring
f and g. networks to satisfy the hypotheses of the Stone-Weier-
Second Hidden Layer: Computes AND functions for strass theorem, as do the networks presented in this paper.
every intersection of convex regions from f and g. These networks are derived by transforming multiplica-
Output Layer: Multiplies each output from the second tion into addition. The transformation is accomplished via
hidden layer by the value offg in the corresponding con- three types of squashing functions: exponentials, partial
vex region. fractions, and step functions. Networks based on these
We cannot apply the Stone-Weierstrass theorem to the functions satisfy the algebraic closure hypotheses of the
aforementioned network because of a mathematical diffi- Stone-Weierstrass theorem, allowing us to explicitly con-
culty: the perceptron network computes step functions struct networks that compute polynomial expressions of
rather than continuous functions. Step functions, how- functions computed by smaller networks.
ever, are dense in the set of measurable functions by the A combination of partial fractions and exponentials
following lemma [6]. produces the modified logistic network. This network is
Lemma ZV-(Step Functions): Iff is an almost every- similar to a logistic network but satisfies the Stone-
where bounded and measurable function on a compact Weierstrass theorem directly. The modified logistic net-
space, then given E > 0, there is a step function h such work is suggested as an alternative to the logistic network
that in situations where one wishes to compute polynomial
If-hl < E (21) expressions.
except on a set of measure less than E .
Intuitively, the idea here is that any continuous function REFERENCES
can be accurately approximated by a step function if the 111 G. Cybenko, “Approximations by superpositions of a sigmoidal
steps are sufficiently small. This procedure introduces yet function,” Marh. Contr., Signals, Syst., vol. 2 , pp. 303-314, 1989.
another source of error for finite networks. Similar errors 121 K. Funahashi, “On the approximate realization o f continuous map-
pings by neural networks,” Neural Networks, vol. 2 , pp. 183-192,
occur in digital-to-analog converters that produce stair- 1989.
step approximations of bandlimited functions. In this case, [3] A . R. Gallant and H. White, “There exists a neural network that does
suitable low-pass filtering removes the error. The same not make avoidable mistakes,” in Proc. IEEE Inr. Conf. Neural Net-
works, San Diego, CA, July 24-27, 1988, vol. I , pp. 657-664.
procedure removes errors for a perceptron network ap- [4] K. Hornik, M. Stinchcombe, and H . White, “Multilayer feedfonvard
proximating a bandlimited function. networks are universal approximators,” Neural Networks, vol. 2, pp.
Although the Stone-Weierstrass theorem does not ap- 359-366, 1989.
[5] L. V . Kantorovich and G. P. Akilov, Functional Analysis, 2nd ed.
ply to our network, there is an advantage to having shown Oxford: Pergamon, 1982.
that we satisfy algebraic closure: we can explicitly con- [6] H. L. Royden, Real Analysis, 2nd ed. New York: Macmillan, 1968.
struct networks that compute certain polynomial expres- 171 D . E. Rumelhart, G. E. Hinton, and R. J . Williams, Parallel Dis-
fributed Processing: Explorations in the Microstructures of Cogni-
sions. This is partial compensation for the lack of a mul- rion. Cambridge, MA: M.I.T. Press, 1986.
tilayer perceptron learning algorithm [ 121. [8] J . J . Hopfield, “Neurons with graded response have collective com-
putational properties like those of two-state neurons,” in Proc. Nar.
V . CONCLUSION Acad. Sci. USA, vol. 81, pp. 3088-3092, 1984.
[9] A. Barron and R. Barron, “Statistical learning networks: A unifying
In the first part of this paper, we observed that errors view,” presented at the Symp. Interface: Statistics Comput. Sci.,
which approach zero in the limit of large networks may Reston, VA, 1988.
[IO] W. S . McCulloch and W. Pitts, “A logical calculus of the ideas im-
be significant in finite networks. Two possible types of manent in nervous activity,” Bull. Math. Biophys., vol. 5 , pp. 115-
errors are overshoot and slow convergence. A third type 133, 1943.
of error may arise in two-layer networks: the network may [111 R. P. Lippmann, “An introduction to computing with neural nets,”
IEEE ASSP Mag., pp. 4-22. Apr. 1987.
be unable to efficiently approximate the product of two [12] M. Minsky and S. Papert, Perceprrons. Cambridge, MA: M.I.T.
functions. Press, 1969.

Das könnte Ihnen auch gefallen