Sie sind auf Seite 1von 1

The Functions of Deep Learning

By Gilbert Strang Artificial intelligence languished for a


generation, waiting for new ideas. There is
no claim that the absolute best class of func-
(Avh [(Av + bhJ+
uppose we draw one of the digits
S 0,1, ... , 9. How does a human rec-
ognize which digit it is? That neuroscience
tions has now been found. That class needs
to allow a great many parameters (called pq + 2q = 20 weights
question is not answered here. How can a weights). And it must remain feasible to
computer recognize which digit it is? This compute all those weights (in a reasonable
is a machine learning question. Probably time) from knowledge of the training set. C[Av+bJ+ =W
both answers begin with the same idea: The choice that has succeeded beyond
leamfrom examples. expectation-and has transformed shallow
learning into deep learning-is continuous r{ 4,3) = 15 linear pieces
So we start with M different images (the
training set). An image is a set of p small piecewise linear (CPL) functions. Linear in w = F(v)
pixels - or a vector v = (Vi"'" V p )' The for simplicity, continuous to model an (Av)q
component Vi tells us the "grayscale" of the unknown but reasonable rule, and piecewise
ith pixel in the image: how dark or light it to achieve the nonlinearity that is an abso- Figure 1. Neural net construction of a piecewise linear function of the data vector v.
is. We now have M images, each with p lute requirement for real images and data.
sliced by q hyperplanes into r pieces. We some measure of F(v) - w( v) is solved by
features: M vectors v in p-dimensional This leaves the crucial question of com-
can count those pieces! This measures the following a gradient downhill. The gradient
space. For every v in that training set, we putability. What parameters will quickly
"expressivity" of the overall function F( v). of this complicated function is computed
know the digit it represents. describe a large family of CPL functions?
The formula from combinatorics is by backpropagation: the workhorse of deep
In a way, we know a function. We have Linear finite elements start with a triangu-
learning that executes the chain rule.
M inputs in RP, each with an output from lar mesh. But specifying many individual
° to 9. But we don't have a "rnle." We are
helpless with a new input. Machine learn-
nodes in RP is expensive. It will be better
if those nodes are the intersections of a
A historic competition in 2012 was to
identify the 1.2 million images collected in
ImageNet. The breakthrough neural network
ing proposes to create a rule that succeeds smaller number of lines (or hyperplanes).
in AlexNet had 60 miIlion weights in eight
on (most of) the training images. But "suc- Please note that a regular grid is too simple.
This number gives an impression of the layers. Its accuracy (after five days of sto-
ceed" means much more than that: the rule Figure I is a first construction of a
graph of F. But our function is not yet chastic gradient descent) cut in half the next
should give the correct digit for a much piecewise linear function of the data vec-
sufficiently expressive, and one more idea best error rate. Deep learning had arrived.
wider set of test images, taken from the tor v. Choose a matrix A and vector b.
is needed. Our goal here was to identify continu-
same population. This essential requirement Then set to 0 (this is the nonlinear step) Here is the indispensable ingredient in ous piecewise linear functions as power-
is called generalization. all negative components of A v + b. Then the learning function F. The best way ful approximators. That family is also
What form shall the rule take? Here multiply by a matrix C to produce the out-
to create complex functions from simple convenient-closed under addition and
we meet the fundamental question. Our put w=F(v)=C(Av+b)+. That vector functions is by composition. Each F; is maximization and composition. The magic
first answer might be: F(v) could be a (A v + b) + forms a "hidden layer" between linear (or affine) followed by the nonlinear is that the learning function F(Ap bi , v)
linear function from RP to RIO (a 10 by the input v and the output w.
ReLU : F;(v) = (AiV + bJ+. Theircom- gives accurate results on images v that F
p matrix). The 10 outputs would be prob- The nonlinear function called
position is F(v) = FL(F;,j .. r;(F;(v)))). has never seen.
abilities of the numbers 0 to 9. We would ReLU(x)=x+ = max (x, 0) was ongl- We now have L -1 hidden layers before
have lOp entries and M training samples nally smoothed into a logistic curve like
the final output layer. The network becomes This article is published with very light edits.
to get mostly right. 1/ (1 + e-X
).It was reasonable to think deeper as L increases. That depth can grow
The difficulty is that linearity is far too that continuous derivatives would help quickly for convolutional nets (with banded Gilbert Strang teaches linear algebra at
limited. Artistically, two Os could make in optimizing the weights A, b, C. That Toeplitz matrices A). the Massachusetts Institute of Technology.
an 8. I and 0 could combine into a hand- proved to be wrong. A description of the January 2019 textbook
The great optimization problem of deep
written 9 or possibly a 6. Images don't The graph of each component of Linear Algebra and Learning from Data is
learning is to compute weights Ai and bi
add. Recognizing faces instead of numbers (Av+b) has two half-planes (one is flat, that will make the outputs F(v) nearly cor- available at math. mit. edullearningfromdata.
requires a great many pixels - and the from th/ Os where A v+ b is negative).
rect - close to the digit w( v) that the image
input-output rule is nowhere near linear. If A is q by p, the input space RP is
v represents. This problem of minimizing

Das könnte Ihnen auch gefallen