Memory capacity of a perceptron

Attribution Non-Commercial (BY-NC)

Als PDF, TXT **herunterladen** oder online auf Scribd lesen

184 Aufrufe

Memory capacity of a perceptron

Attribution Non-Commercial (BY-NC)

Als PDF, TXT **herunterladen** oder online auf Scribd lesen

- Autonomous Land Vehicle in a Neural Network
- Smart Rice Cooker Report
- An Efficient Segmentation Technique for Devanagari Offline Handwritten Scripts Using the Feedforward Neural Network - Springer
- 700_713
- Tensor flow
- Artificial Intelligence Survey- Paper
- neural network sample paper
- A Study of Social Media Data and Data Mining Techniques
- Detecting the Presence of Hidden Information Using Back Propagation Neural Network Classifier
- Ai Techniques May 2017
- neural network
- Random Valued Impulse Noise Elimination using Neural Filter
- Geomorphology-Based Time-Lagged Recurrent Neural Networks for Runoff Forecasting
- 0143_Pitel
- Isolated Footing Mathcad Calculation
- 601 Analog Circuits SOLAR SSST2001
- Syllabus
- Cap4
- Quantification of Uncertainty in Mineral Prospectivity Prediction Using Neural Network Ensembles and Interval Neutrosophic Sets
- CIClassCh05

Sie sind auf Seite 1von 8

5.6 Stochastic Units

Another generalization is from our deterministic units to stochastic units Si gov-

erned by (2.48):

1

(5.54)

Prob(Sf = 1) = 1 + exp(=f2,8hf)

with

.hf = 'E Wiker

(5.55)

k

as before. This leads to

(Sf) = tanh (,8 'E Wiker)

k

(5.56)

. t . (242) In the context of a simulation we can use (5.56) to calculate (Sf),

JUS as m . . .' S h'l

whereas in a real stochastic network we would find it averagmg i or a Wi:,

updating randomly chosen units according to (5.54). Either way, we then use (Si )

as the basis of a weight change

(5.57)

where

(5.58)

This is just the average over outcomes of the changes we would have on

basis of individual outcomes using the ordinary delta rule will find it

articularly important when we discuss reinforcement learnmg m SectIOn 7.4 ..

p It is interesting to prove that this rule always decreases the average error given

by the usual quadratic measure -

E = ! 'E(f - Sf)2.

2 ip

(5.59)

Since we are assuming output units and patterns are 1, this is just twice

the total number of bits in error, and can also be wfltten

E= 'E(I-(fSf).

(5.60)

ip

Thus the average error in the stochastic netwo:r

kis

(E) = 'E(1 - (f (Sf))

ip

= 'E [1- (f tanh(,8'E Wiker)] .

ip k

(5.61)

5.7 Capacity of the Simple Perceptron 111

The change in (E) in one cycle of weight updatings is thus

8

= - 'E -8- tanh(,8hn

ipk Wik

- 'E 7][1 - (r tanh(,8hf)],8sech2(,8hf) (5.62)

ip"

using

6

dtanh(x)/dx = sech

2

x. The result (5.62) is clearly always negative (recall

tanh(x) < 1), so the procedure always improves the average

5.7 Capacity of the Simple Perceptron *

In the case of the associative network in Chapter 2 we were able to find the capacity

Pmax of a network of N units; for random patterns we found Pr'nax = 0.138N for large

N if we used the standard Hebb rule. If we tried to store P patterns withp > Pmax

the performance became terrible.

Similar questions can be asked for simpleperceptrons:

How many random input-output pairs can we expect to store reliably in a

network of given size?

How many of these can we expect to learn using a particular learning rule?

The answer to the second question may well be smaller than the first (e.g., for

nonlinear units), but is presently unknown in general. The first question, which

this section deals with, gives the maximum capacity that any learning algorithm

can hope to achieve.

For continuous-valued units (linear or nonlinear) we already know the answer,

because the condition is simply linear independence. If we choose P random pat-

terns, then they will be linearly independent if P :5 N (except for cases with very

small probability). So the capacity is Pmax = N.

'/ The case of threshold units depends on linear separability, which is harder to

deal with. The answer for random continuous-valued inputs was derived by Cover

11,965] (see also Mitchison and Durbin [1989]) and is remarkably simple:

Pmax = 2N. (5.63)

usual N is the number of input units, and is presumed large. The number of

' .. ' ut units must be small and fixed (independent of N). Equation (5.63) is strictly

, in the N -+ 00 limit. '

function sech

2

x = 1 - tanh

2

x is a bell-shaped curve with peak at x = o.

112 FIVE Simple Perceptrons

C(p,N)I2

P

0.5

. .

pJN

o

o 2 3 4

FIGURE 5.11 The function C(p, N)/2

P

given by (5.67) plotted versus p/Nfor N =

5, 20, and 100.

The rest of this section is concerned with proving (5.63), and may be omitted

on first reading. We follow the approach of Cover [1965]. A more general (but much

more difficult) method for answering this sort of question was given by Gardner

[1987] and is discussed in Chapter 10.

We consider a perceptron with N continuous-valued inputs and one 1 output

unit, using the deterministic threshold limit. The extension to several output units

is trivial since output units and their connections are independent-the result (5.63)

applies separately to each. For convenience we take the thresholds to be zero, but

they could be reinserted at the expense of one extra input unit, as in (5.2).

In (5.11) we showed that the perceptron divides the N-dimensional input space

into two regions separated by an (N - I)-dimensional hyperplane. For the case of

zero threshold this plane goes through the origin. All the points on one side give an

output of + 1 and all those on the other side give -1. Let us think of these as red

(+1) and black (-1) points respectively. Then the question we need to answer is:

how many points can we expect to put randomly in an N-dimensional space, some

red and some black, and then find a hyperplane through the origin that divides the

red points from the blaCk points?

Let us consider a slightly different question. For a given set of p randomly

placed points in an N-dimensional space, for how many out of the 2

P

possible red

and black colorings of the points can we find a hyperplane dividing red from black?

Call the answer C(p, N). For p small we expect C(p, N) = 2

P

, because we should

be able to find a suitable hyperplane for any possible coloring; consider N = p = 2

for example. For p large we expect C(p, N) to drop well below 2

P

, so an arbitrarily

chosen coloring will not possess a dividing hyperplane. The transition between these

regimes turns out to be sharp for large N, and gives us Pmax.

We will calculate C(p, N) shortly, but let us first examine the result. Figure 5.11

shows a graph of C(p, N)/2

P

against p/ N for N = 5, 20, and 100. Our expectations

for small and large p are fulfilled, and we see that the transition occurs quite rapidly

in the neighborhood of p = 2N, in agreement with (5.63). As Nis made larger and

5.7 Capacity of the Simple Perceptron

113

FIGURE 5.12 Finding sep-

arating hyperplanes con-

to go through

a pomt P as well as the

origin a is equivalent to

projecting onto one lower

dimension.

larger the transition becomes more and more shar Th ( ... .

can demonstrate that FIg 5 11 . . t p. us 5.63) IS JustIfied If we

. . IS correc .

. of points is not actually necessary.7 All that we need

IS a e pomts be In general position. As discussed on .

(for the no threshold case) that all subsets of N (ti ) .page 97, thIS .means

. ddt As or ewer pomts must be lmearly

m epen en . an example consider N = 2 a set ofp . t t d. .

plane is in g I t f . pom sma we- Imenslonal

. enera POSI IOn 1 no two lie on the same line through the .. A

of chosen from a continuous random distribution will obviousl orI?m. set

pOSItIOn except for coincidences that have zero probability. y be m general

We can now calculate C(p N) b . d t

d d d b h ' Y m uc Ion. Let us call a coloring that can be

;oin[ ;. and add a

b

For

those previous dichotomies where the dividing hyperplane could h

een drawn through point P th '11 b. ave

d .. ' ere e two new dIchotomies, one with P red

an one WIth It black. This is because when the points in general osition

;ny t?rough 1>. can be shifted infinitesimally "to go either sfde of

1 , WI OU C angmg the SIde of any of the other p points.

of the dichotomies only one color of point P will

, ere e one new dIchotomy for each old one.

Thus

C(p + 1, N) = C(p, N) + D (5.64)

where D is the number of the . C(p N) .

the dividing hyperplane dr dIchotomies that could have had

. . awn roug as well as the origin o. But this number

SImply ?(p, N - 1), because constraining the hyperplanes to go throu h a

tlcular pomt P makes the problem effectively (N _ 1) d . I .lgI par-

. yo 5 12 - ImenslOna; as 1 ustrated

mIg. . , we can proJect the whole problem onto an (N 1) d . I I

- - ImenSlOna pane

7N .

or IS It well defined unless a distribution function is specified.

FIVE Simple Perceptrons

perpendicular to OP, since any displacement of a point along the OP direction

cannot affect which side it is of any hyperplane containing OP.

We thereby obtain the recursion relation

C(p + 1, N) = C(p, N) +C(p, N - 1).

(5.65)

Iterating this equation for p, p - 1, p - 2, ... , 1 yields

C(p, N) = (P(jl)C(I, N) + N -1) + ... + N - p + 1). (5.66)

For p < N this is easy to handle, because C(I, N) = 2 for all N; one point can be

colored red or black. For p > N the second argument of C becomes 0 or negative in

some terms, but these terms can be eliminated by taking C(p, N) = 0 for N o. It

is easy to check that this choice is consistent with the recursion relation (5.65), and

with C(p, 1) = 2 (in one dimension the only "hyperplane" is a point at the origin,

allowing two dichotomies). Thus (5.66) makes sense for all values of p and Nand

can be written as

N-l ( 1)

C(p, N) = 2 P (5.67)

if we use the standard convention that (;:.) = 0 for m > n. Equation (5.67) was

used to plot Fig. 5.11, thus completing the demonstration.

It is actually easy to show from the symmetry = of binomial

coefficients that

(5.68)

so the curve goes through 1/2 at p = 2N. To show analytically that the transition

sharpens up for increasing N, one can appeal to the large N Gaussian limit of the

binomial coefficients, which leads to

(5.69)

for large N.

It is worth noting that C(p, N) = 2

P

if p N (this is shown on page 155). So

any coloring of up to N points is linearly separable, provided only that the points

are in general position. For N or fewer points general position is equivalent to linear

independence, so the sufficient conditions for asolution are exactly the same in

the threshold arid continuous-valued networks. But this is not true, of course, for

p>N.

SIX

Multi-Layer Networks

The limitations of a simple perceptron do not apply to feed-forward networks with

interr.nediate or "hidden" layers between the input and output layer. In fact, as

we WIll see later, a network with just one hidden layer can represent any Boolean

function (including for example XOR). Although the greater power of multi-layer

networks was realized long, ago, it was only recently shown how to make them learn

a particular function, using "back-propagation" or other methods. This absence of

a rule-together the demonstration by Minsky and Papert [1969] that

only lmearly separable functIOns could be represented by simple perceptrons-Ied

to a waning of interest in layered networks until recently.

Throughout this chapter, like the previous one, we consider only feed-forward

networks. More general networks are discussed in the next chapter.

6.1 Back-Propagation

The back-propagation algorithm is central to much current work on in

neural networks. It was invented independently several times, by Bryson and Ho

[1969], Werbos [1974], Parker [1985] and Rumelhart et al. [1986a, b]. A closely

related by Le Cun [1985]. The algorithm gives a prescription

for changmg the weIghts Wpq m any feed-forward network to learn a training set of

input-output pairs The basis is simply gradient descent as described in

Sections 5.4 (linear) and 5;5 (nonlinear) for a simple perceptron. '

consider a two-layer network such as that illustrated by Fig. 6.1. Our

n?tatIOnal. conventIOns shown in the figure; output units are denoted by OJ,

hIdden umtsby ltj, mput terminals by There are connections Wjk from the

115

264 TEN Formal Statistical Mechanics of Neural Networks

Only the second of these, which comes from of/or = 0, is a little tricky, needing

the identity

(10.75)

for any bounded function J(z).

Equation (10.72) is just like (10.22) for the a = 0 case, except for the addition

of the effective Gaussian random field term, which represents the crosstalk from the

uncondensed patterns. For a = 0 it reduces directly to (10.22). Equation (10.73)

is the obvious equation for the mean square magnetization. Equation (10.74) gives

the (nontrivial) relation between q and the mean square value of the random field,

and is identical to (2.67).

For memory states, i.e., m-vectors of the form (m, 0, 0, ... ), the saddle-point

equations (10.72) and (10.73) become simply

m

q

tanh,B(forz + m)z

tanh

2

,B(forz + m)z

(10.76)

(10.77)

where the averaging is solely over the Gaussian random field. These are are identical

to (2.65) and (2.68) that we found in the heuristic theory of Section 2.5. Their

solution, and the consequent phase diagram of the model in a - T space, can be

studied as we sketched there. Spurious states, such as the symmetric combinations

(10.26), can also be analyzed at finite a using the full equations (10.72)-(10.74).

There are several subtle points in this replica method calculation:

We started by calculating ((zn)) for integer n but eventually interpreted n

as a real number and took the n ...... 0 limit. This is not the only possible

continuation from the integers to the reals; we might for example have added

a function like sin 7rn/n.

We treated the order of limits and averages in a cavalier fashion, and in par-

ticular reversed the order of n ...... 0 and N ...... 00.

We made the replica symmetry approximation (10.60)-(10.62) which was re-

ally only based on intuition.

Experience has shown that the replica method usually does work, but there are few

rigorous mathematical results. It can be shown for the Sherrington-Kirkpatrick spin

glass model, and probably for this one too, that the reversal of limits is justified,

and that the replica symmetry assumption is correct for integer n [van Hemmen

and Palmer, 1979]. But for some problems, including the spin glass, the method

sometimes gives the wrong answer. This can be blamed on the integer-to-real con-

tinuation, and can be corrected by replica symmetry breaking, in which the

replica symmetry assumption is replaced by a more complicated assumption. Then

the natural continuation seems to give the right answer.

For the present problem Amit et al. showed that the replica symmetric approx-

imation is valid except at very low temperatures where there is replica symmetry

breaking. This seems to lead only to very small corrections in the results. However,

----------------......

10.2 Gardner Theory of the Connections

265

the predicted change in the capacity-a

c

becomes 0.144 instead of 0.138- b

d t t d

. . I . can e

e ec e III numenca simulations [Crisanti et aI., 1986].

10.2 Gardner Theory of the Connections

The second classic statistical mechanical tour de force in neural networks is th

computation by Gardner [1987, 1988] of the capacity of a simple perceptron. Th:

calculation applies in the same form to a Hopfield-like recurrent network for auto-

associative memory if the connections are allowed to be asymmetric.

. theory is very it is not specific to any particular algorithm for

the connectIOns. On the other hand, it does not provide us with a

specific set of connections even when it has told us that such a set exists. As

in Section 6.5, the basic idea is to consider the fraction of weight space that

implements a particular input-output function; recall that weight space is the space

of all possible connection weights w = {Wij}.

In Section 6.5 we used relatively simple methods to calculate weight space

volumes. The present approach is more complicated, though often more powerful.

We many. of the techniques introduced in the previous section, including replicas,

auxIliary varIables, and the saddle-point method.

We consider a simple perceptron with N binary inputs ej = 1 and M binary

threshold units that compute the outputs

Oi = sgn (N-

1

/

2

L Wijej ) .

j

(10.78)

The N-l/2 factor will be discussed shortly. Given a desired set of associations

ef ....... (f for Jl = 1, 2, ... , p, we want to know in what fraction of weight space the

equatIOns

(f = s

g

n(N-

1

/

2

L Wijey) (10.79)

j

satisfied (for all i and Jl). Or equivalently,JIl. what fraction of this space are the

__

j

(10.80)

true?

. It is also interesting ask the, cor:esponding question if the condition (10.80)

IS strengthened so there IS a margm sIze K; > 0 as in (5.20):

(f N-

1

/2 L Wijey > K;

j

(10.81)

A nonzero K; guarantees correction of small errors in the input pattern.

266 TEN Formal Statistical Mechanics of Neural Networks

Until (iO.81) the factor N-l/2 was irrelevant. We include it because it is conve-

nient to work with Wij'S of order unity, and a sum of N such terms of random sign

gives a result of order N

1

/

2

. Thus the explicit factor N-l/2 makes the left-hand

side of (10.81) of order unity, and it is appropriate to think about /\"s that are

independent of N. Of course this is only appropriate if the terms in the sum over j

are really of random sign, but that turns out to be the case of most interest here.

On the other hand, in Chapter 5 we were mainly dealing with a correlated sum,

and so used a factor N instead of Nl/2.

For a recurrent autoassociative network, the same equations with (j = (j give

the condition for the stability of the patterns, and a nonzero /\, ensures finite basins

of attraction.

The Capacity of a Simple Perceptron

The fundamental quantity that we want to calculate is the volume fraction of weight

space in which (10.81) is satisfied. Adding an additional constraint

(10.82)

for each unit i, so as to keep the weights within bounds, this fraction is

J dw (Ill' 0((j N-l/2 E

j

Wije; - /\,)) ITi t5(E

j

W[j - N)

V= ( 2) .

J dw ITi t5 E j Wij - N

(10.83)

Here we enforce the constraint (10.82) with the delta functions, and restrict the

numerator to regions satisfying (10.81) with the step functions 0(x).

The expression (10.83) is rather like a statistical-mechanical partition func-

tion (10.1), but the conventional exponential weight is replaced by an all-or-nothing

one given by the step functions. It is also important to recognize that here it is the

weights Wij that are the fundamental statistical-mechanical variables, not the acti-

vations of the units.

We observe immediately that (10.83) factors into a product of identical terms,

one for each i. Therefore we can drop the index i altogether, reducing without loss

of generality the calculation to the case of a single output unit. The corresponding

step also works for the recurrent network if Wij and Wji are independent, but the

calculation cannot be done this way if a symmetry constraint Wij = Wj; is imposed.

In the same way as for Z in the previous section, the statistically relevant

quantity is the average over the pattern distribution, not of V itself, but of its

logarithm. Therefore, we introduce replicas again and compute the average

10.2 Gardner Theory of the Connections

267

where the integrals are over all the wi's and the average (( ... )) is over the (1's and

the (I"s.

To proceed we use the same kinds of tricks as in the previous section. First we

work on the step functions, using the integral representation

0(z -/\,) = (Xld>. t5(>. _ z) = (OOd>'] dx e;x('x-z).

j" j" 271"

(10.85)

We have step functions for each a and IL, so at this point we need auxiliary variables

and Thus a particular step function becomes

where

0((1' N-

1

/

2

w,!(j - /\,) = 1

00

]

J

(10.86)

= (I' N-

1

/

2

L: W(j(f . (10.87)

j

It is now easy to average over the patterns, which occur only in the last factor

of (10.86). We consider the case of independent binary patterns, for which we have

((IT =

I'a

IT (( exp ( -i(l'(f

jl' a

eXP(Llogcos [N-

1

/

2

L

jl' a

=

(

__ 1 I' I' a 13)

exp 2N L.J xaxf3 L.J Wj Wj .

l'af3 j

(10.88)

The resulting E

j

wj wj term is not easy to deal with directly, so we replace it

by. a new variable q

a

f3 defined by

(10.89)

This gives qaa = 1 from (10.82), but we prefer to treat the a = /3 terms explicitly

and use qaf3 only for a =/; /3. Thus we rewrite (10.88) as

((IT = IT exp ( - t - L

I'a I' a a<f3

(10.90)

- using qaf3 = qf3a The qaf3's play the same role in this problem that qaf3 and raf3

did in the previous section.

When we insert (10.90) into (10.86) we see that we get an identical result for

each IL, so we can drop all the IL's and write

(10.91)

268 TEN Formal Statistical Mechanics of Neural Networks

where

Kp,x,q} = iLx"A" - - (10.92)

a

Now we turn to the delta functions. Using the basic integral representation

)

J

dr rz

6(z = -.e-

2n

(10.93)

we choose r = Ea/2 for each (l' to write the delta functions in (10.84) as

6 (2;:(wi? - N) = J (10.94)

3

In the same way we enforce the condition (10.89) for each pair (l'fJ (with (l' > fJ)

using r = N F"p :

(

N

-I " P) - NJ dF"p 2:j wjw'J . (10.95)

6 q"p - L..J Wj Wj - 27ri e

j

We also have to add an integral over each of the q"p's, so that the delta function

can pick out the desired value. ...

A factorization of the integrals over the w's is now possIble. Takmg everythmg

not involving wi outside, the numerator of (10.84) includes a factor

J (II dwi) e - /2+ (10.96)

"

for each j. These factors are all identical-wi is a dummy variable, and j no longer

appears elsewhere--so we can drop the j's and rewrite (10.96) as

[J (II dw,,)e - N (10.97)

"

The same transformation applies to the denominator of (10.84), except that there

are no F"p terms.

It is now time to collect together our factors from (10.92), (10.94), (10.95), and

(10.97). Writing Ak as exp(k log A), and omitting prefactors, (10.84) becomes

J(I1" dE,,) (11,,<p dq"pdF"p) eNG{q,F,E}

((Vn)) = J(I1" dE,,)eNH{E}

(10.98)

where

G{q, F, E}

log [1

00

(IJ J (IJ dx" ) e

K

p,x,q} ]

+ log [J (II dw,,) e - ]

a

- L F"pq"p + LEa

(10.99)

a<p a

10.2 Gardner Theory of the Connections

269

and

H{E} = 10g[J (II dw,,)e- + L E".

" "

(10.100)

Since the exponents inside the integrals in (10.98) are proportional to N, we will be

able to evaluate them exactly using the saddle-point method in the large-N limit.

As before, we make a replica-symmetric ansatz:

= F Ea=E (10.101)

(where the first two apply for (l' f= fJ only). This allows us to evaluate each term of

G.

For the first term we can rewrite K from (10.92) as

Kp,x,q} = iLx"A" - - i(LXaf

" " "

(10.102)

and linearize the last term with the usual Gaussian integral trick

(10.103)

derived from (10.5). Then the x" integrals can be done, leaving a product of identical

integrals over the A" 's. Upon replacing these by a single integral to the nth power

we obtain for the whole first line of (10.99):

(l'log{J_d_t e-

t

'/2 [1

00

dA ex

p

(- -,-::-(A-:-:-+t-,-"fo-;-)2)] n}

..ffi I< J27r(1- q) 2(1 - q)

dA ex

p

(- (A+t-.fo)2)] (10.104)

..ffi I< J27r(1 - q) 2(1- q)

where (l' == piN.

The second term in G can be evaluated in the same way, linearizing the

(2:" W,,)2 term with a Gaussian integral trick, then performing in turn the W"

integrals and the Gaussian integral. The final result in the small n limit is

-log(E + F) .

E+F

(10.105)

Finally, the third term of G gives simply (again for small n)

qF).

(10.106)

Now we are in a position to find the saddle point of G with respect to q, F, and

E. The most important order parameter is q. Its value at the saddle point is the

270 TEN Formal Statistical Mechanics of Neural Networks

most probable value of the overlap (10.89) between a pair of solutions. If, as at small

a, there is a large region ofw-space that solves (10.80), then different solutions can

be quite uncorrelated and q will be small. As we increase a, it becomes harder and

harder to find solutions, and the typical overlap between a pair of them increases.

Finally, when there is just a single solution, q becomes equal to 1. This point defines

the optimal perceptron: the one with the largest capacity for a given stability

parameter /\', or equivalently the one with highest stability for a given a. We focus

on this case henceforth, taking q -+ 1 shortly.

The saddle-point equations aGjaE = 0 and aGjaF = 0 can readily be solved

to express E and F in terms of q:

q

F

(1 _ q)2

E = 1- 2q

(1- q)2 .

(10.107)

Substituting these into the expression for G (and making a change of variable in

the d)" integral), we get

1 G( ) _ J dt -t

2

/21 [joo dz _Z2/2]

- q - a --e og --e

n V2i *V2i

+ + q) + 2(1 q) +

(10.108)

Setting aG j aq = 0 to find the saddle point gives

J

dt _t

2

/2 [l OOd _Z2/ 2] -1 _u

2

/2 t + /\,V'i

a --e ze e

V2i u 2V'i(1 - q)3/2

q

(10.109)

2(1 - q)2

where u = (/\, + tV'i)j...jf=q. Taking the limit q -+ 1 is a little tricky, but can be

done using L'Hospital's rule, yielding the final result

(10.110)

Equation (10.110) gives the capacity for fixed /\'. Alternatively we can use it to

find the appropriate /\, for the optimal perceptron to store Na patterns. In the limit

/\, = 0 it gives

(10.111)

in agreement with the result found geometrically by Cover that was outlined in

Chapter 5.

One can also perform the corresponding calculation for biased patterns with a

distribution

p(en = t(1 + m)6(er - 1) + t(1- m)6(er + 1)

(10.112)

10.2 Gardner Theory of the Connections

4 r-------------------,

3

2

o 2 3

271

FIGURE 10.1 Capacity a

c

as a function of /\, for three

values of m (from Gardner

[1988]).

so that ((en) = m. The calculation is just a little bit more complicated, with

an extra set of variables M(X = N-

1

/

2

w'J with respect to which G has to be

maximized. The results for the storage capacity as a function of m and /\, are shown

in Fig. 10.1.

An interesting limit is that of m -+ 1 (sparse patterns). Then the result for

/\, = 0 is

1

(10.113)

a

c

= (l-m)log(1!m)

which shows that one can store a great many sparse patterns. But there is nothing

very surprising about this, because very sparse patterns have very small informa-

tion content. Indeed, if we work out the total information capacity-the maximum

information we can store, in bits-given by

N

2

a

c

[1 (l-m) 1 (l+m)]

I=-log2 2(I-m)log -2- +2(1+m)log -2- ,

then we obtain

N

2

1=--

210g2

(10.114)

(10.115)

in the limit m -+ 1. This is less than the result for the unbiased case (m = 0,

a c = 2), which is 1= 2N

2

In fact the total information capacity is always of the

order N

2

, depending only slightly on m.

It is interesting to note that a capacity of the order of the optimal one (10.113)

is obtained for a Hopfield network from a simple Hebb-like rule [Wills haw et al.,

1969; Tsodyks and Feigel'man, 1988], as we mentioned in Chapter 2.

A number of extensions of this work have been made, notably to patterns with

a finite fraction of errors, binary weights, diluted connections, and (in the recurrent

network) connections with differing degrees of correlation between Wij and Wji

[Gardner and Derrida, 1988; Gardner et al., 1989].

272 TEN Formal Statistical Mechanics of Neural Networks

Generalization Ability

A particularly interesting application is to the calculation of the

ability of a simple perceptron. Recall from Section 6.5 that the generalizatIOn abIlity

of a network was defined as the probability of its giving the correct output for

the mapping it is trained to implement when tested on a random of the

mapping, not restricted to the training set. This calculated analytIcally by

Gardner's methods [Gyorgyi and Tishby, 1990; GyorgyI, 1990; Opper et al., 1990].

The basic idea first used by Gardner and Derrida [1989], is to perform a cal-

culation of the weight-space volume like the one just described, but, instead of

considering random input-target pairs (ef, (1'), using pairs which are examples .of

a particular function 1(1;.) = sgn(v .1;.) that the perceptron could learn. That IS,

we think of our perceptron as learning to imitate a teacher perceptron whose

weights are Vi. . .

Under learning, the pupil perceptron's weight vector W wIll come to lme up

with that of its teacher. Its generalization ability will depend on one parameter,

the dot product of the two vectors:

1

R= -wv.

N

(10.116)

Here both wand v are normalized as in (10.82). R is introduced into the calculation

in the same way that q was earlier, by inserting a delta function and integrating

over it. Ultimately one obtains saddle-point equations for both q R.

To find the generalization ability from R, consider the two varIables

x = N-

I

/

2

L Wjej

j

and y = N-

1

/

2

L Vjej

j

(10.117)

which are the net inputs to the pupil and the teacher respectively. For large .N, x

and yare Gaussian variables, each of zero mean and unit variance, with covarIance

(xy) = R. Thus their joint distribution is

(10.118)

Having averaged over all inputs, the generalization ability gU) no longer depends

on the specific mapping ofthe teacher (parametrized by v): but only on number

of examples. We therefore write it as g(a). Clearly g(a) IS the probabIlIty that x

and y have the same sign. Simple geometry then leads to

1 -1 R

g(a) = 1- -cos

'II"

where R is obtained from the saddle-point condition as described above.

(10.119)

I

I

I

I

f

10.2 Gardner Theory of the Connections

0.9

0.8

9

0.7

0.6

0.5

0 2 3

a

4 5 6

FIGURE 10.2 The gen-

eralization ability, g( a),

as a function of rela-

273

tive training-set size, a.

Adapted from Opper et al.

[1990].

Figure 10.2 shows the resulting g(a). The necessary number of examples for

good generalization is clearly of order N, in agreement with the estimate (6.81). In

the limit of many training examples perfect generalization is approached:

1

1- g(a) = -.

a

(10.120)

This form of approach means that the a priori generalization ability distribution

Po(g) discussed in Section 6.5 has no gap around g = l.

This example shows how one can actually do an explicit calculation (for the

simple percept ron) which fits into the theoretical framework for generalization in-

troduced in Section 6.5. We hope it will guide us in future calculations for less

trivial architectures.

All the preceding has been algorithm-independent-it is about the existence

of connection weights that implement the desired association, not about how they

are found. It is also possible to apply statistical mechanics methods to particular

algorithms [Kinzel and Opper, 1990; Hertz et al., 1989; Hertz, 1990], including

their dynamics, but these calculations lie outside the scope of the framework we

have presented here.

- Autonomous Land Vehicle in a Neural NetworkHochgeladen vonjosephkumar58
- Smart Rice Cooker ReportHochgeladen vonacfizrin00
- An Efficient Segmentation Technique for Devanagari Offline Handwritten Scripts Using the Feedforward Neural Network - SpringerHochgeladen vonMukta Rao
- 700_713Hochgeladen vonUbiquitous Computing and Communication Journal
- Tensor flowHochgeladen vonsmslca
- Artificial Intelligence Survey- PaperHochgeladen vonGirish Tulabandu
- neural network sample paperHochgeladen vonJustin Cook
- A Study of Social Media Data and Data Mining TechniquesHochgeladen vonAnonymous 7VPPkWS8O
- Detecting the Presence of Hidden Information Using Back Propagation Neural Network ClassifierHochgeladen vonijcsis
- Ai Techniques May 2017Hochgeladen vonSai Kiran
- neural networkHochgeladen vonsupersusuman
- Random Valued Impulse Noise Elimination using Neural FilterHochgeladen vonATS
- Geomorphology-Based Time-Lagged Recurrent Neural Networks for Runoff ForecastingHochgeladen vonManabendra Saharia
- 0143_PitelHochgeladen vonVidya Meenakshi
- Isolated Footing Mathcad CalculationHochgeladen vondumbadumba
- 601 Analog Circuits SOLAR SSST2001Hochgeladen vonpkrsuresh2013
- SyllabusHochgeladen vonKishore Iyer
- Cap4Hochgeladen vonRicardo Erazo
- Quantification of Uncertainty in Mineral Prospectivity Prediction Using Neural Network Ensembles and Interval Neutrosophic SetsHochgeladen vonDon Hass
- CIClassCh05Hochgeladen vonSenthil Kumar Krishnan
- Is Inspired From the Natural Neural Network of Human Nervous SystemHochgeladen vonsup
- 13b_neural_networks_1.pptxHochgeladen vonsarvesh_mishra
- Use of Artificial Neural Networks as a Predictive Method to Determine Moisture Resistance of Particle and Fiber Boards Under Cyclic Testing Conditions UNE-En 321Hochgeladen vonquirri1970
- A Comparative study on performance measures of conventional crop yield prediction models with ANNHochgeladen vonInternational Journal of Research in Computer Science and Electronics Technology
- dhiraj report1Hochgeladen vonnimbalkarDHIRAJ
- cHochgeladen vonRajesh Ramakrishnan
- Mathematics Mar 2009 EngHochgeladen vonPrasad C M
- Development of Prediction Model for Mechanical Properties of Batch Annealed Thin SteelHochgeladen vonvaalgatamilram
- finite differences modelHochgeladen vonMohammed El-khoudry
- Paper 1 Form 4 Mid Year (COMPLETE)Hochgeladen vonKDT8956

- ChangeHochgeladen vonabbztwitts
- Kernel for VBA Password RecoveryHochgeladen vonYo_elBrujo
- The Sun is 550 Billion Years OldHochgeladen vonDanBarZohar
- modified games small-sided gamesHochgeladen vonapi-250337075
- Radial Lip SealsHochgeladen vonjafar_ebadi
- ANSYS for Tablet Computer Design - Greg PitnerHochgeladen vonSunil Saini
- Exemplu breviar de calcul recipient sub presiune(1).pdf.pdfHochgeladen vonRemus Carobotiuc
- Homework 1 Answers2012SummerAHochgeladen vonnicadagle1996
- Unit of Work ScienceHochgeladen vons
- jorgenson english 113 general literature syllabus and scheduleHochgeladen vonapi-231902454
- Clustering in BioinformaticsHochgeladen vontisimpson
- ETP Presentation RubricsHochgeladen vonJenelle Catherina Magbutay
- Cec 208 TheoryHochgeladen vonSolomon Tamuno Iyango
- ASTM Filtration F838052013 Comments and Negatives 20150204 Rev 2 (1)Hochgeladen vonMorcos Loka
- Destroy the LiquidatorHochgeladen vonLuis Ernesto
- Economic Benefits of using StandardsHochgeladen vonWarricktheGrey
- 1971858_Islam, Women, And Politics the Demography of Arab Countries Author(s) Carla Makhlouf ObermeyerHochgeladen vonEDWARD BOT
- Upgrade_1206_1213Hochgeladen vonsurinder_singh_69
- PAMIES_A._2017_The_Concept_of_Cultureme.pdfHochgeladen vonMadalina Todinca
- Resume Ryan CardozaHochgeladen voncardozaryan
- LeadershipHochgeladen vonSohaibbinsatta
- Solomon L MS - M1 Edexcel.pdfHochgeladen vonDermot Chuck
- Feh353c AllHochgeladen vonkakan_s
- Jurnal MataHochgeladen vonJannatu R
- 013391545X_section2.1-2_2.4Hochgeladen vonAtef Naz
- lesson plan triangle inequality theorem albHochgeladen vonapi-255753704
- ChipsHochgeladen vonButt Head
- Christensen How Will You Measure Your LifeHochgeladen vonBogdan Comsa
- Long Term Effectiveness of Terbinafine vs ItraconazoleHochgeladen vonIndah Cynthia
- Production-Engineer-Generic-JD1.pdfHochgeladen vonsuranto anto