Sie sind auf Seite 1von 77

Perceptron: from Minsky & Papert (1969)!

Strictly bottom-up feature!


processing.!

Trainable evidence weigher!

Retina with! Non-adaptive local feature !


input pattern! detectors (preprocessing)!
`Supervised’ learning training paradigm!

Teacher:!
Correct output for trial 1 is 1.!
Correct output for trial 2 is 1.!
a! Correct output for trial 3 is 0.!
!
. . .!
Correct output for trial n is 0.!

Adjust weights to! Call the correct answers the!


Reduce error! desired outputs and represent!
them by the symbol “d”.!

Call the outputs from the network!


the actual outputs and represent!
them by the symbol “a”.!
From a neuron to a linear threshold unit.!
Trivial Data Set!

No generalization Not linearly separable.!


needed.!
Odd bit parity.!
Perceptron Learning Rule!
More Compact Form!
An Alternative Representation!
Learning the AND function!

Perceptron learning algorithm!


Linear Separability!

threshold!
inputs!
Perceptron with 2 inputs:!

Solve for y:!

Equation for straight line.! Adjustable wts control slope


and intercept.!
AND vs XOR!

Geometric illustration of linear separability or not.!

Try to draw a straight line through positive and negative instances.!


Algebraic Proof of Not Linear Separable for XOR!
Algebraic Proof of Not Linear Separable for XOR!

Greater than zero.!

Impossible, given above.!


Thresholds can be represented as weights!
Separating plane is also called decision boundary!

Linear discriminant function


Linear component
of LTU:! y(x) = wT x + w0

At decision boundary:! y(x) = 0


Properties of linear decision boundary!

1.  Decision boundary is orthogonal to weight vector.!

2.  Will give formula for distance of decision boundary to origin.!


Geometry of linear decision boundary!

Points in x space!

xA
Let ! and! xB
be two pts on hyperplane.!

A B
Then ! y(x ) = 0 = y(x )
Therefore:!

T B A
w (x x )=0

Bishop
x_A – x_B is parallel to decision boundary!

wT (xB xA ) = 0 Implies wt vector is perpendicular to


decision boundary!

Interpretation of
vector
subtraction!
Distance to hyperplane!

l = ||x|| cos ✓

Distance to hyperplane.!

wt x w0
l = ||x|| cos ✓ = ||x|| =
||x||||w|| ||w||
SVM Kernel!

Projecting input features to a higher-dimensional space.!

Russell and Norvig!


Perceptron: from Minsky & Papert (1969)!

Strictly bottom-up feature!


processing.!

Trainable evidence weigher!

Retina with! Non-adaptive local feature !


input pattern! detectors (preprocessing)!
Diameter-limited Perceptron!
Digits made of seven line segments!

Positive target is the exact digit pattern.!

Negative target is all other possible segment configurations.!

Winston
Digit Segment Perceptron!

Ten perceptrons are required.!

Winston!
Perceptron learning does not generalize to multilayer!

Perform a feedforward sweep to compute !


the output values below.!

Input is constrained so that!


exactly 2 of the 6 inputs are 1.!

Output units indicate whether!


the inputs are acquaintances!
or siblings.!

Winston!
An early model of an error signal: 1960’s!
Descriptive complexity is low!

Linear combiner element!


A very simple learning task

•  Consider a neural network with


two layers of neurons.
–  neurons in the top layer 0 1 2 3 4 5 6 7 8 9!
represent known shapes.
–  neurons in the bottom layer
represent pixel intensities.
•  A pixel gets to vote if it has ink
on it.
–  Each inked pixel can vote
for several different shapes.
•  The shape that gets the most
votes wins.

Hinton
Why the simple system does not work

•  A two layer network with a single winner in the top layer is equivalent to
having a rigid template for each shape.
–  The winner is the template that has the biggest overlap with the ink.

•  The ways in which shapes vary are much too complicated to be captured by
simple template matches of whole shapes.
–  To capture all the allowable variations of a shape we need to learn the
features that it is composed of.

Hinton
Examples of handwritten digits from a test set
General layered feedforward network!

There can be varying numbers of units in the different layers.!

A feedforward network does not have to be layered.!


!
Any connection topology that does not have recurrent connections!
is a feedforward network.!
Summary of Architecture for Backprop

Compare outputs with


Back-propagate correct answer to get
error signal to get error signal
derivatives for
learning outputs

hidden
layers

input vector
Hinton
Matrix representations of networks!
Need for supervised training!

We cannot define the hand printed character A. We can only !


present examples and hope that the network generalizes.!

The fact that this happens was a breakthrough in pattern recognition.!


Give the neuron a graded output value between 0 and 1!

Put sigmoid function here!

Logistic sigmoid equation!

Sigmoid plots!

sigmoid!

derivative!
Simple transformations on functions!
Odd and Even Functions!

Even: f(x) = f(-x)! Flip about y-axis!

Odd: f(x) = -f(-x)! Flip about y-axis and flip about x-axis.!

Even: cos, cosh!

Odd: sin, sinh!


sinh, cosh, and tanh!
Relation between logsig and tanh!
Exact formula for setting weights!

1.  Two classes!

2.  Gaussian distributed!

3.  Equal variances!

4.  Naïve Bayes!


Gaussian Distribution!
Fitting a bivariate Gaussian to 2D data

Data is presented as
a scatter plot.

From: Florian Hahne


Bivariate Gaussian!

Optimal decision boudary!


Exact formula for setting weights!

However, in most applications, we must


train the wts. Bishop, 1995
Derivatives of sigmoidal functions!

0
logsig(x) = logsig(x) · (1 logsig(x))

0 2
tanh(x) = 1 tanh (x)

If you know the value of a function at a particular point, you can quickly
compute the derivative of the function.!
Error Signal (Cost function)!

For regression.

Sum of squares

# of classes
S X
X c
For classification
E= dk (t) ln dk (t)
t=1 k=1
Cross entropy
Error Measure for Linear Unit!

This will be the cost function!


Instantaneous slope (i.e. derivative)!

Slope: how fast a line is increasing!


f(x + Δx) - f(x)!
f’(x) ~=!
Δx!

y - axis! f(x + Δx)!

f(x)!

Δx!
x - axis!
Δx

x! x + Δx!
Principle of gradient descent!

Goal is to find minimum of


cost function. !
!
Suppose estimate is x=1.!

Then derivative is > 0.!

Update estimate in the


negative direction.!
How the derivative varies with the error function!

Positive gradient.!

Minimum of cost function.!


Negative
gradient!
Deriving a learning rule: trains wts for single linear unit!

Derivative of cost function wrt weight w_1.!

t denotes test case


or trial.!

S denotes # of
cases in a batch.!

Weight update rule from gradient descent!


Simple example: learn AND with linear unit.!

# of cases in a batch = 4!

Since there are only two!


weights, it’s possible to!
visualize the error surface.!

o!
Composite error surface for linear unit learning AND!

Error surface is flipped upside down,


so that it is easier to see.!
In this case, quadratic.!
Optimal weight values are at top of surface (w1 = w2 = 1/3).!
Error surface for each of the 4 cases.!

Previous error surface


is obtained by adding
these together.!
Adding a Logistic Sigmoid Unit!

o!

g(x)
Formula for sigmoidal unit.!

Derivative of sigmoidal unit.!


Derivative of cost function for linear unit with sigmoid!

Chain rule!

Derivative of sigmoid.!
Error surface with a sigmoidal unit!

o!

plateau!
Backpropagation of errors!

Update rule for wt in output layer!

New form!

Function of h, which is defined on next slide.!


Backpropagation of errors!

New form!
wji = ⌘oi j (hj )
Arbitrary location in network.!
For output layer!
0
n = [dn on ] g (hn )

For hidden layers! X


0
j = g (hj ) wkj k
k

Adapted from Bishop!


Delta_n: output layer!

1.  If sigmoid, then o(1-o).!


2.  If tanh, then 1-o^2.!
3.  If linear, then 1.!
Initializing weights!

1.  Most methods set weights to randomly chosen small values.!

2.  Random values are chosen to avoid symmetries in the network.!

3.  Small values are chosen to avoid the saturated regions of the
sigmoid where the slope is near zero.!
Time complexity of backprop!

Feedforward sweep is O(w).!


!
Backprop sweep is O(w).!
Basic Autoencoder!

Ashish Gupta!
Sparse Autoencoder!

The number of hidden units is greater than the number of input units.!

The cost function is chosen to make most of the hidden units inactive.!

Cost function for sparse AE.!

Minimize wt
Minimize # of magnitude.!
SS error! active units!
Limited-memory BFGS!

Limitation of gradient descent: assumes direction of maximum change in


gradient is a good direction for update.!
!
A better choice maximizes ratio of gradient to curvature of error surface.!
!
Newton’s method takes curvature into account.!
!
Curvature is given by Hessian matrix. Require inverting H, which is
infeasible for large problems.!
!
BFGS approximates H inverse without actually inverting Hessian. Thus it is
a quasi-Newton method.!
!
Broyden, Fletcher, Goldfarb, Shanno!
Purpose of sparse autoencoder!

Discover an over complete basis set.!

Use basis set as a set of input features to a learning algorithm.!

Can think of this as a form of highly sophisticated preprocessing.!


Summary of Architecture for Backprop

Compare outputs with


Back-propagate correct answer to get
error signal to get error signal
derivatives for
learning outputs

hidden
layers

input vector
Hinton
Training Procedure: Make sure lab results apply to
application domain.!

If you have enough data:!


1.  Training set!
2.  Testing set!
3.  Validation set!

Otherwise:!

Ten-fold cross-validation. Allows use of all data for training.!


Curse of Dimensionality

Amount of data needed to


x1 sample a high-dimensional
space grows exponentially
D=1 w/ D.

x2
x2

x1

x1 x3
D=2 D=3
A feature preprocessing example!

Distinguishing mines from rocks! There exists a set of weights to solve!


in sonar returns.! the problem. Gradient descent searches!
for the desired weights.!
NETtalk – Sejnowski and Rosenberg (1987)!

26 outputs coding phonemes!

80 hidden units!

203 input binary vector, 29 bits!


for each of 7 chars including!
punctuation, 1-out-of 29 code!

Trained on 1024 words, capable!


Of intelligible speech after 10!
Training epochs, 95% training!
Accuracy after 50 epochs, and!
78 percent generalization accuracy.!
Limitations of back-propagation

•  It requires labeled training data.


–  Almost all data in an applied setting is unlabeled.

•  The learning time does not scale well


–  It is very slow in networks with multiple hidden layers.

•  It can get stuck in poor local optima.


–  These are often quite good, but for deep networks are far from
optimal.

Hinton
Synaptic noise sources:!
1. Probability of vesicular!
release.!
2. Magnitude of response!
in case of vesicular !
release.!
Possible biological source of a quasi-global !
Difference-of-reward signal!

Mesolimbic system or!


!
Mesocortical system!

Error broadcast to every !


weight in the network!
A spectrum of machine learning tasks
Typical Statistics------------Artificial Intelligence
•  Low-dimensional data (e.g. •  High-dimensional data (e.g.
less than 100 dimensions) more than 100 dimensions)
•  The noise is not sufficient to
•  Lots of noise in the data obscure the structure in the
data if we process it right.
•  There is a huge amount of
•  There is not much structure in
structure in the data, but the
the data, and what structure
structure is too complicated to
there is, can be represented by
be represented by a simple
a fairly simple model.
model.

•  The main problem is


•  The main problem is figuring
distinguishing true structure
out how to represent the
from noise. complicated structure in a way
that can be learned.
Hinton
Generative Model for Classification!

Definition:!
1.  Model class-conditional densities p(x | C_k).!

2.  Model class priors: p(C_k).!

3.  Compute posterior: p(C_k | x) using Bayes’ Thm.!

Motivation:!

1.  Links ML to probability theory, the universal language for


modeling uncertainty and evaluating evidence.!

2.  Links neural networks to ML.!


Knowledge structuring issues and information sources!

Aoccdrnig to a rscheearch at an
Elingshuinervtisy, it deosn't mttaer
inwaht oredr the ltteers in a
wrod are, the olnyiprmoetnt tihng is
tahtfristand lsat ltteer is at the rghit
pclae. The rsetcan be a toatl mses andyou
can sitll raed it wouthit a porbelm. Tihs
isbcuseae we do not raed erveylteter by its
slef, but the wrod as a wlohe
Top-down or Context Effects! Kaniza figures!

Semantics of a structured pattern!

Das könnte Ihnen auch gefallen