Minsky y Papert

Perceptron: from Minsky & Papert (1969)!
Strictly bottom-up feature!

processing.!
Trainable evidence weigher!
Retina with! Non-adaptive local feature !

input pattern! detectors (preprocessing)!
`Supervised’ learning training paradigm!
Teacher:!
Correct output for trial 1 is 1.!
Correct output for trial 2 is 1.!
a! Correct output for trial 3 is 0.!
!
. . .!
Correct output for trial n is 0.!
Adjust weights to! Call the correct answers the!

Reduce error! desired outputs and represent!
them by the symbol “d”.!
Call the outputs from the network!

the actual outputs and represent!
them by the symbol “a”.!
From a neuron to a linear threshold unit.!
Trivial Data Set!
No generalization Not linearly separable.!

needed.!
Odd bit parity.!
Perceptron Learning Rule!
More Compact Form!
An Alternative Representation!
Learning the AND function!
Perceptron learning algorithm!

Linear Separability!
threshold!
inputs!
Perceptron with 2 inputs:!
Solve for y:!
Equation for straight line.! Adjustable wts control slope

and intercept.!
AND vs XOR!
Geometric illustration of linear separability or not.!
Try to draw a straight line through positive and negative instances.!

Algebraic Proof of Not Linear Separable for XOR!
Algebraic Proof of Not Linear Separable for XOR!
Greater than zero.!
Impossible, given above.!

Thresholds can be represented as weights!
Separating plane is also called decision boundary!
Linear discriminant function

Linear component
of LTU:! y(x) = wT x + w0
At decision boundary:! y(x) = 0

Properties of linear decision boundary!
1.  Decision boundary is orthogonal to weight vector.!
2.  Will give formula for distance of decision boundary to origin.!

Geometry of linear decision boundary!
Points in x space!
xA
Let ! and! xB
be two pts on hyperplane.!
A B
Then ! y(x ) = 0 = y(x )
Therefore:!
T B A
w (x x )=0
Bishop
x_A – x_B is parallel to decision boundary!
wT (xB xA ) = 0 Implies wt vector is perpendicular to

decision boundary!
Interpretation of
vector
subtraction!
Distance to hyperplane!
l = ||x|| cos ✓
Distance to hyperplane.!
wt x w0
l = ||x|| cos ✓ = ||x|| =
||x||||w|| ||w||
SVM Kernel!
Projecting input features to a higher-dimensional space.!
Russell and Norvig!

Perceptron: from Minsky & Papert (1969)!
Strictly bottom-up feature!

processing.!
Trainable evidence weigher!
Retina with! Non-adaptive local feature !

input pattern! detectors (preprocessing)!
Diameter-limited Perceptron!
Digits made of seven line segments!
Positive target is the exact digit pattern.!
Negative target is all other possible segment configurations.!
Winston
Digit Segment Perceptron!
Ten perceptrons are required.!
Winston!
Perceptron learning does not generalize to multilayer!
Perform a feedforward sweep to compute !

the output values below.!
Input is constrained so that!

exactly 2 of the 6 inputs are 1.!
Output units indicate whether!

the inputs are acquaintances!
or siblings.!
Winston!
An early model of an error signal: 1960’s!
Descriptive complexity is low!
Linear combiner element!

A very simple learning task
•  Consider a neural network with

two layers of neurons.
–  neurons in the top layer 0 1 2 3 4 5 6 7 8 9!
represent known shapes.
–  neurons in the bottom layer
represent pixel intensities.
•  A pixel gets to vote if it has ink
on it.
–  Each inked pixel can vote
for several different shapes.
•  The shape that gets the most
votes wins.
Hinton
Why the simple system does not work
•  A two layer network with a single winner in the top layer is equivalent to
having a rigid template for each shape.
–  The winner is the template that has the biggest overlap with the ink.
•  The ways in which shapes vary are much too complicated to be captured by
simple template matches of whole shapes.
–  To capture all the allowable variations of a shape we need to learn the
features that it is composed of.
Hinton
Examples of handwritten digits from a test set
General layered feedforward network!
There can be varying numbers of units in the different layers.!
A feedforward network does not have to be layered.!

!
Any connection topology that does not have recurrent connections!
is a feedforward network.!
Summary of Architecture for Backprop
Compare outputs with

Back-propagate correct answer to get
error signal to get error signal
derivatives for
learning outputs
hidden
layers
input vector
Hinton
Matrix representations of networks!
Need for supervised training!
We cannot define the hand printed character A. We can only !

present examples and hope that the network generalizes.!
The fact that this happens was a breakthrough in pattern recognition.!

Give the neuron a graded output value between 0 and 1!
Put sigmoid function here!
Logistic sigmoid equation!
Sigmoid plots!
sigmoid!
derivative!
Simple transformations on functions!
Odd and Even Functions!
Even: f(x) = f(-x)! Flip about y-axis!
Odd: f(x) = -f(-x)! Flip about y-axis and flip about x-axis.!
Even: cos, cosh!
Odd: sin, sinh!

sinh, cosh, and tanh!
Relation between logsig and tanh!
Exact formula for setting weights!
1.  Two classes!
2.  Gaussian distributed!
3.  Equal variances!
4.  Naïve Bayes!

Gaussian Distribution!
Fitting a bivariate Gaussian to 2D data
Data is presented as
a scatter plot.
From: Florian Hahne

Bivariate Gaussian!
Optimal decision boudary!

Exact formula for setting weights!
However, in most applications, we must

train the wts. Bishop, 1995
Derivatives of sigmoidal functions!
0
logsig(x) = logsig(x) · (1 logsig(x))
0 2
tanh(x) = 1 tanh (x)
If you know the value of a function at a particular point, you can quickly
compute the derivative of the function.!
Error Signal (Cost function)!
For regression.
Sum of squares
# of classes
S X
X c
For classification
E= dk (t) ln dk (t)
t=1 k=1
Cross entropy
Error Measure for Linear Unit!
This will be the cost function!

Instantaneous slope (i.e. derivative)!
Slope: how fast a line is increasing!

f(x + Δx) - f(x)!
f’(x) ~=!
Δx!
y - axis! f(x + Δx)!
f(x)!
Δx!
x - axis!
Δx
x! x + Δx!
Principle of gradient descent!
Goal is to find minimum of

cost function. !
!
Suppose estimate is x=1.!
Then derivative is > 0.!
Update estimate in the

negative direction.!
How the derivative varies with the error function!
Positive gradient.!
Minimum of cost function.!

Negative
gradient!
Deriving a learning rule: trains wts for single linear unit!
Derivative of cost function wrt weight w_1.!
t denotes test case

or trial.!
S denotes # of
cases in a batch.!
Weight update rule from gradient descent!

Simple example: learn AND with linear unit.!
# of cases in a batch = 4!
Since there are only two!

weights, it’s possible to!
visualize the error surface.!
o!
Composite error surface for linear unit learning AND!
Error surface is flipped upside down,

so that it is easier to see.!
In this case, quadratic.!
Optimal weight values are at top of surface (w1 = w2 = 1/3).!
Error surface for each of the 4 cases.!
Previous error surface

is obtained by adding
these together.!
Adding a Logistic Sigmoid Unit!
o!
g(x)
Formula for sigmoidal unit.!
Derivative of sigmoidal unit.!

Derivative of cost function for linear unit with sigmoid!
Chain rule!
Derivative of sigmoid.!
Error surface with a sigmoidal unit!
o!
plateau!
Backpropagation of errors!
Update rule for wt in output layer!
New form!
Function of h, which is defined on next slide.!

Backpropagation of errors!
New form!
wji = ⌘oi j (hj )
Arbitrary location in network.!
For output layer!
0
n = [dn on ] g (hn )
For hidden layers! X

0
j = g (hj ) wkj k
k
Adapted from Bishop!

Delta_n: output layer!
1.  If sigmoid, then o(1-o).!

2.  If tanh, then 1-o^2.!
3.  If linear, then 1.!
Initializing weights!
1.  Most methods set weights to randomly chosen small values.!
2.  Random values are chosen to avoid symmetries in the network.!
3.  Small values are chosen to avoid the saturated regions of the
sigmoid where the slope is near zero.!
Time complexity of backprop!
Feedforward sweep is O(w).!

!
Backprop sweep is O(w).!
Basic Autoencoder!
Ashish Gupta!
Sparse Autoencoder!
The number of hidden units is greater than the number of input units.!
The cost function is chosen to make most of the hidden units inactive.!
Cost function for sparse AE.!
Minimize wt
Minimize # of magnitude.!
SS error! active units!
Limited-memory BFGS!
Limitation of gradient descent: assumes direction of maximum change in

gradient is a good direction for update.!
!
A better choice maximizes ratio of gradient to curvature of error surface.!
!
Newton’s method takes curvature into account.!
!
Curvature is given by Hessian matrix. Require inverting H, which is
infeasible for large problems.!
!
BFGS approximates H inverse without actually inverting Hessian. Thus it is
a quasi-Newton method.!
!
Broyden, Fletcher, Goldfarb, Shanno!
Purpose of sparse autoencoder!
Discover an over complete basis set.!
Use basis set as a set of input features to a learning algorithm.!
Can think of this as a form of highly sophisticated preprocessing.!

Summary of Architecture for Backprop
Compare outputs with

Back-propagate correct answer to get
error signal to get error signal
derivatives for
learning outputs
hidden
layers
input vector
Hinton
Training Procedure: Make sure lab results apply to
application domain.!
If you have enough data:!

1.  Training set!
2.  Testing set!
3.  Validation set!
Otherwise:!
Ten-fold cross-validation. Allows use of all data for training.!

Curse of Dimensionality
Amount of data needed to

x1 sample a high-dimensional
space grows exponentially
D=1 w/ D.
x2
x2
x1
x1 x3
D=2 D=3
A feature preprocessing example!
Distinguishing mines from rocks! There exists a set of weights to solve!

in sonar returns.! the problem. Gradient descent searches!
for the desired weights.!
NETtalk – Sejnowski and Rosenberg (1987)!
26 outputs coding phonemes!
80 hidden units!
203 input binary vector, 29 bits!

for each of 7 chars including!
punctuation, 1-out-of 29 code!
Trained on 1024 words, capable!

Of intelligible speech after 10!
Training epochs, 95% training!
Accuracy after 50 epochs, and!
78 percent generalization accuracy.!
Limitations of back-propagation
•  It requires labeled training data.

–  Almost all data in an applied setting is unlabeled.
•  The learning time does not scale well

–  It is very slow in networks with multiple hidden layers.
•  It can get stuck in poor local optima.

–  These are often quite good, but for deep networks are far from
optimal.
Hinton
Synaptic noise sources:!
1. Probability of vesicular!
release.!
2. Magnitude of response!
in case of vesicular !
release.!
Possible biological source of a quasi-global !
Difference-of-reward signal!
Mesolimbic system or!

!
Mesocortical system!
Error broadcast to every !

weight in the network!
A spectrum of machine learning tasks
Typical Statistics------------Artificial Intelligence
•  Low-dimensional data (e.g. •  High-dimensional data (e.g.
less than 100 dimensions) more than 100 dimensions)
•  The noise is not sufficient to
•  Lots of noise in the data obscure the structure in the
data if we process it right.
•  There is a huge amount of
•  There is not much structure in
structure in the data, but the
the data, and what structure
structure is too complicated to
there is, can be represented by
be represented by a simple
a fairly simple model.
model.
•  The main problem is

•  The main problem is figuring
distinguishing true structure
out how to represent the
from noise. complicated structure in a way
that can be learned.
Hinton
Generative Model for Classification!
Definition:!
1.  Model class-conditional densities p(x | C_k).!
2.  Model class priors: p(C_k).!
3.  Compute posterior: p(C_k | x) using Bayes’ Thm.!
Motivation:!
1.  Links ML to probability theory, the universal language for

modeling uncertainty and evaluating evidence.!
2.  Links neural networks to ML.!

Knowledge structuring issues and information sources!
Aoccdrnig to a rscheearch at an
Elingshuinervtisy, it deosn't mttaer
inwaht oredr the ltteers in a
wrod are, the olnyiprmoetnt tihng is
tahtfristand lsat ltteer is at the rghit
pclae. The rsetcan be a toatl mses andyou
can sitll raed it wouthit a porbelm. Tihs
isbcuseae we do not raed erveylteter by its
slef, but the wrod as a wlohe
Top-down or Context Effects! Kaniza figures!
Semantics of a structured pattern!

Minsky y Papert

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Minsky y Papert

Hochgeladen von

Copyright:

Verfügbare Formate

Perceptron: from Minsky & Papert (1969)!

Strictly bottom-up feature!

Trainable evidence weigher!

Retina with! Non-adaptive local feature !

Adjust weights to! Call the correct answers the!

Call the outputs from the network!

No generalization Not linearly separable.!

Perceptron learning algorithm!

Solve for y:!

Equation for straight line.! Adjustable wts control slope

Geometric illustration of linear separability or not.!

Try to draw a straight line through positive and negative instances.!

Greater than zero.!

Impossible, given above.!

Linear discriminant function

At decision boundary:! y(x) = 0

1. Decision boundary is orthogonal to weight vector.!

2. Will give formula for distance of decision boundary to origin.!

wT (xB xA ) = 0 Implies wt vector is perpendicular to

Projecting input features to a higher-dimensional space.!

Russell and Norvig!

Strictly bottom-up feature!

Trainable evidence weigher!

Retina with! Non-adaptive local feature !

Positive target is the exact digit pattern.!

Negative target is all other possible segment configurations.!

Ten perceptrons are required.!

Perform a feedforward sweep to compute !

Input is constrained so that!

Output units indicate whether!

Linear combiner element!

• Consider a neural network with

There can be varying numbers of units in the different layers.!

A feedforward network does not have to be layered.!

Compare outputs with

We cannot define the hand printed character A. We can only !

The fact that this happens was a breakthrough in pattern recognition.!

Put sigmoid function here!

Logistic sigmoid equation!

Even: f(x) = f(-x)! Flip about y-axis!

Even: cos, cosh!

Odd: sin, sinh!

1. Two classes!

2. Gaussian distributed!

3. Equal variances!

4. Naïve Bayes!

From: Florian Hahne

Optimal decision boudary!

However, in most applications, we must

This will be the cost function!

Slope: how fast a line is increasing!

y - axis! f(x + Δx)!

Goal is to find minimum of

Then derivative is > 0.!

Update estimate in the

Minimum of cost function.!

Derivative of cost function wrt weight w_1.!

t denotes test case

Weight update rule from gradient descent!

Since there are only two!

Error surface is flipped upside down,

Previous error surface

Derivative of sigmoidal unit.!

Update rule for wt in output layer!

1.  Decision boundary is orthogonal to weight vector.!

2.  Will give formula for distance of decision boundary to origin.!

•  Consider a neural network with

1.  Two classes!

2.  Gaussian distributed!

3.  Equal variances!

4.  Naïve Bayes!

1.  If sigmoid, then o(1-o).!

1.  Most methods set weights to randomly chosen small values.!

2.  Random values are chosen to avoid symmetries in the network.!

•  It requires labeled training data.

•  The learning time does not scale well

•  It can get stuck in poor local optima.

•  The main problem is

2.  Model class priors: p(C_k).!

3.  Compute posterior: p(C_k | x) using Bayes’ Thm.!

1.  Links ML to probability theory, the universal language for

2.  Links neural networks to ML.!