KD III 1 LinearModels 1415

Knowledge Discovery WS 14/15
Linear Classifiers 3
Prof. Dr. Rudi Studer, Dr. Achim Rettinger*, Dipl.-Inform. Lei Zhang
{rudi.studer, achim.rettinger, l.zhang}@kit.edu
INSTITUT FÜR ANGEWANDTE INFORMATIK UND FORMALE BESCHREIBUNGSVERFAHREN (AIFB)
KIT – University of the State of Baden-Württemberg and

National Laboratory of the Helmholtz Association www.kit.edu
Knowledge Discovery Lecture WS14/15
22.10.2014 Einführung
Basics, Overview
29.10.2014 Design of KD-experiments
05.11.2014 Linear Classifiers
12.11.2014 Data Warehousing & OLAP
19.11.2014 Non-Linear Classifiers (ANNs) Supervised Techniques,
26.11.2014 Kernels, SVM Vector+Label Representation
03.12.2014 entfällt
10.12.2014 Decision Trees
17.12.2014 IBL & Clustering Unsupervised Techniques
07.01.2015 Relational Learning I
Semi-supervised Techniques,
14.01.2015 Relational Learning II
Relational Representation
21.01.2015 Relational Learning III
28.01.2015 Textmining
04.01.2015 Gastvortrag Meta-Topics
11.02.2015 Crisp, Visualisierung
3 Institut AIFB
Die Datenmatrix für Überwachtes Lernen
Xj j-te Eingangsvariable
X = (X0, . . . , XM 1)T
Vektor von Eingangsvariablen
M Anzahl der Eingangsvariablen
N Anzahl der Datenpunkte
Y Ausgangsvariable
xi = (xi,0, . . . , xi,M 1)T
i-ter Eingangsvektor
xi,j j-te Komponente von xi
yi i-te Zielgröße
di = (xi,0, . . . , xi,M 1, yi)T
i-tes Muster
D = {d1, . . . , dN }
(Trainings-) Datensatz
z Testeingangsvektor
t Unbekannte Testzielgröße zu z
X = (x1, . . . xN )T design matrix
1
4 Institut AIFB
Chapter III - 1
Linear Models
5 Institut AIFB
Today: Linear Models
„The Mother of ML algorithms“

Long history of linear models in statistics
Perceptron: Combination with biologically inspired ideas
Next times:
ANNs: Extension to non-linear problems
SVMs: well-founded generalization
6 Institut AIFB
Chaper 3.1.a
Linear Regression
7 Institut AIFB
nste-Quadrate Schätzer für lineare Regression
Concepts from Statistics in ML
(eindimensional)
  Empirical Risk Minimization
Modell:   Regularization
w ) = w0 + w1 x
= (w0, w1)T
dratischer Fehler:
N
X
(yi f (xi, w))2
i=1
8 Institut AIFB
= arg min JN (w) w0 = 1, w1 = 2, var(✏) = 1

• Überwachtes Lernen:Lernen:
• Überwachtes Die Zielgröße Yößesoll
Die Zielgr Y anhand von von
soll anhand Eingangsvariablen XX
Eingangsvariablen
gesagt werden
Empirical Risk Minimization (1/4)
gesagt werden
  •Supervised
Die einzige
• Die einzige wesentliche
Learning:
wesentliche Annahme
Annahme ist, dass
ist, dass (x,station
P (xP, y) y) station är (fest
är (fest undund unbe
unbeka
Vector of inputs: X = (X1 , X2 , ..., Xn )
• Man •definiert
Man definiert
  Singleeine
eine Klasse
Klasse
output:
von Lernmaschinen
Yvon Lernmaschinen (Funktionenklasse)
(Funktionenklasse)
– Beispiel: Funktionen f (x, w) mit Parametervektor w

– Beispiel: a function f (x, w) mit Parametervektor w
  DefineFunktionen
Mapping x to y according to parameter vector w
• Man definiert eine Verlustfunktion (Fehlerfunktion). Bei der Regression ist
• Man definiert eine Verlustfunktion (Fehlerfunktion). Bei der Regression ist d
dratische Fehler gebräuchlich
  Define
dratische a loss
Fehler gebrfunction
äuchlich
Typically quadratic loss loss(y, f (x, w)) = (y f (x, w))2
loss(y, f (x, w)) = (y f (x, w))2
9 Institut AIFB
Empirische Risiko Minimierung (2)
ist es, ausEmpirical

der MengeRisk Minimization
der ausgew (2/4)
ählten Funktionenklasse diejenige Funktion zu
ist es, aus der Menge der ausgewählten Funktionenklasse diejenige Funktion zu
en, die den   erwartete
Goal: Verlust
find minimiert,
parameters derminimize
that durch dasthe
Risikofunktional
loss.
n, die den erwartete Verlust Z minimiert, der durch das Risikofunktional
Z
R(w) = loss(y, f (x, w))P (x, y) dxdy
R(w) = loss(y, f (x, w))P (x, y) dxdy
niert ist
iert ist
Falle des quadratischen Fehlermaßes ergibt sich
alle des quadratischen
  Quadratic Fehlermaßes
loss:
Z ergibt sich
Z
R(w) = (y f (x, w))22P (x, y) dxdy
R(w) = (y f (x, w)) P (x, y) dxdy
6
10
6
Institut AIFB
che Risiko Minimierung (3)
Empirical Risk Minimization
r Wahrscheinlichleitslehre (3/4)nimmt man an, dass P (x, y) be
(Probability)
Minimierung (3)
ine
ehretypische Aufgabe
  Probability
(Probability) nimmtistman
es dann,
theory z.B.Pden
assumes
an, dass besten
(x, y) to belinearen
known Schätzer zu find
bekannt
ist es dann, z.B. den besten linearen Schätzer zu finden
er
y) Statistik istan,P (dass
x, y)
  In Statistics
nimmt man P (xunbekannt;
, y) is manand
unknown
bekannt kennt nurto be
needs einen Trainingsdate
denunbekannt;
y)
hprobe, sample)
besten man
estimated
linearen kennt
from
derSch
Grätzernur
ößedata einen
zu ,D;
N withTrainingsdatensatz
finden a sample of size N
Größe N ,
D = {( x , y )} N
man kennt nur einen Trainingsdatensatz i i i=1
D = {(xi, yi)}i=1 N
nehmenN an, dass die Daten i.i.d. (independent, identically distributed) sind
i, yi)}i.i.d.
Daten i=1  (independent, identically
Assumption: data is i.i.d.distributed)
(independentsindand identically
distributed)
pendent, identically distributed) sind
11 Institut AIFB
7
Folgt manEmpirical Risk

dem Prinzip Minimization
der empirischen (4/4)
Risiko Minimierung (empirical risk minimiza-
• Folgt man dem Prinzip der empirischen Risiko Minimierung (empirical risk minimiza-
tion), minimiert man im Training das empirische Risiko
  Empirical
tion), minimiert man imRisk Minimization
Training approximates
das empirische Risiko the real loss
N
1 X N
R(w) ⇡ Remp(w) = 1 Xloss(y, f (x, w))
R(w) ⇡ Remp(w) =N loss(y, f (x, w))
Ni=1
i=1
Definiert man als Verlustfunktion den quadratischen Fehler, ergibt sich als empirisches
• Definiert man als Verlustfunktion den quadratischen Fehler, ergibt sich als empirisches
Risiko der mittlere quadratische
  Quadratic loss Fehler der Trainingsdaten
Risiko der mittlere quadratische Fehler der Trainingsdaten
N
X
1 N
Remp(w) = 1 X(yi f (xi, w))22
Remp(w) =N i=1 (yi f (xi, w))
N
i=1
wobei wir auch gleich definieren
wobei wir auch gleich definieren
XN
12
J N (w ) = X N
(yi f (xi, w))22= N ⇥ Remp(w) Institut AIFB
JN (w) =i=1 (yi f (xi, w)) = N ⇥ Remp(w)
i=1
Linear Models
Vector of inputs: X = (x1 , ...xN )

  Output vector: Y
M
X
  Linear model: yˆi = xi,j wi,j + b
j=1
Estimate: w, b
13 Institut AIFB
Least Squares Estimate for Linear Regression
X N
  Least Squares:
RSS(w) = (yi xTi w)2
i=1
Vectorform: Ŷ = X w
T
RSS(w) = (Y Xw)T (Y Xw)

T
Calculate: w X (Y Xw) = 0
ŵ = (X T X) 1
XT Y
  (notation: RSS = R_emp in previous slides)
14 Institut AIFB
st,
inzige P (x, y) station
dasswesentliche är istist, dass P (x, y) stationär ist
Annahme
ein zu
teil: fürkomplexes
ein Modell
endliches ausgew
N wird
Regularization ein ählt (Überanpas-
zu komplexes Modell ausgewählt (Überanpas-
, overfitting)
Risk of overfitting when using Empirical Risk Minimization
 
zeigt
ŵLS sich Especially
sehrebenso sein not
daran,
  instabil robust
dass
kann ŵLSif sehr
(wenn M ⇡instabil sein kann (wenn M ⇡ N ), das
N ), das
, sehr empfindlich
Änderungen auf kleine
der Daten Änderungen der Daten reagiert
reagiert
Solution:
 
Lineare Regularization
Regression(Theory of ill-conditioned
und Regularisierung
em Problem
durch behilfteines
problems)
Einführung manStrafterms
sich durch Einführung eines Strafterms
w ) + complexity Remp(w) + complexity term

term
• Regularisierte Kostenfunktion (penalized least squares (PLS), Ridge Regression, Weight
Regularized
Decay):  der Einfluss einerLoss Function:
Eingangsgröße sollte klein sein
larisierungstheorie:
ll-conditioned theory of ill-conditioned problems
problems
XN MX1
P 2 term pen P 2 2+ 2
piel:
T complexity JN = (w)w=Tw = (yi fi (wx ,,w ))
mit 0 w
w= w
i i , mit 0 i
i i
i=1 i=0
17
17
15 ⇣ ⌘ 1 Institut AIFB
ŵP en = XT X + I XT y
Chapter 3.1.b
Linear Classifiers
16 Institut AIFB
2D Beispiel
17 Institut AIFB
Lineare Regression: X1
M
eon:
Regression: f (xi, w) = w0 + w
• Lineare Regression: M 1
Classification byMRegression
1 M
X 1 X j=1
X
f (xi, w) = w0 + wj xM 1
f (
f (xi, w) = w xi 0+
, w ) = w +
0wj xi,j w x
j i,j X
i,j
  Linear Regression f (xi , w) = j=1w0 + wj xi,j = xT w
j=1 j=1 i
j=1
T = xT w
= xT
i w =•xiWir i
w definieren alsT Zielgröße yi = 1 falls Muster xi z
= xi w
falls Muster xi zu Klasse 0 gehört
Wir definieren als Zielgröße yi = 1 falls Muster xi zu Klasse 1 gehört und yi = 0
finieren i =öße
yDefine
als  Zielgr
s Zielgröße yi =
1 falls 1 falls
if xiMuster
Muster is assigned
zu xi zu
Klasse 1 gehtoört
class
Klasse 1 geh
und1yand =und
iört yi = 0
0 otherwise
• Wir
falls Muster xidefinieren
zu Klasseals Zielgr
0 geh ört•ößeWir = 1 falls Muster
yi berechnen Gewichte xiwzuLSKlasse
= (X 1 Tgeh
X )ört1und
X Tyyi a
uster
zu Klasse 0 geh
xi zu ört 0 gehört
Klasse
falls Muster xi zu Klasse 0linearen gehört Regression
Wir berechnen   Calculate
Gewichte
T wLS1 =TT (X 1X)T 1XT y als
T asLS-L
before.
ösung, genau wie in der
Gewichte wLS = (wXLSX=
rechnen Gewichte ) (X XX y)als X LS-Lyösung, genau
alsT LS-L wie
ösung,
1 T in der wie in der
genau
Wir berechnen Gewichte• wFLS
linearen•Regression = (Xneues
ür einen X)Muster X yz als LS-Lösung,
berechnen wir genau
f (z) wie
= zin
nonRegression
linearen
  All newRegression
x are assigned to class
Klasse 1 if Tf (z) > 1/2; ansonsten ordnen wir
1 zu falls
Für einen neues Muster z berechnenTwir f (z)T = z wLS und ordnen das Muster
nen neuesz Muster
Muster berechnen wir f (z) wir
z berechnen = fz (zw)LS = und
z wordnen
LS unddasordnen
Muster
T das Muster
Klasse 1•zuFür einen
falls f (zneues Muster
) > 1/2; z berechnen
ansonsten ordnenwir (z) Muster
wirf das = z w LS und
Klasse 0 zuordnen das M
1f (zuz)falls
> 1/2;
f  (Called
z)ansonsten
> 1/2;
“Linear Discriminant
ordnen
ansonstenwirordnen Function”
das Muster
wir Klasse
das 0 zu
Muster Klasse 0 zu
Klasse 1 zu falls f (z) > 1/2; ansonsten ordnen wir das Muster Klasse 0 zu
18 Institut AIFB
10
10 10
1
ExampleKlassifikation durch Regression mit linearen Funktionen
19 Institut AIFB
12
Chapter 3.2.a
The Perceptron
(Single Layer ANN)
20 Institut AIFB
Introduction
  Artificial Neural Networks (ANNs) are learning systems

inspired by the structure of neural information processing.
  Key ideas:
  Simple, adaptive computational units (artificial neurons),
  Connections between neurons for information propagation.
  Key properties:
  ANNs are a powerful and flexible learning method,
  ANNs show a black-box behavior: the eventual model is not easy to
understand for humans and hard to analyze theoretically.
21 Institut AIFB
Biological Inspiration:
Neural Information Processing
22 Institut AIFB
Biological Inspiration:
Neural Information Processing
  Information processing between neurons works by means

of electrical excitations.
  Arrival of an action potential at the axon end triggers release of
neurotransmitters at synaptic gap.
  Neurotransmitters can stimulate or suppress the electrical
excitability at a target cell.
  New action potential triggered in the target cell if the incoming
excitations reach a certain threshold.
  Statistics:
  Human Brain: ~100 billion (1011) neurons and ~ 100 trillion (1014)
synapses.
  Nematode Worm (Caenorhabditis elegans): 302 neurons,
completely mapped.
23 Institut AIFB
Abstract Model of the Neuron
Cell body
Dendrites
Axon
Summation
Activation function
Variants:
The McCulloch-Pitts Model (1947) considers only binary inputs & weights
McCulloch, W., and W. Pitts (1947), “How We Know Universals: the Perception of Auditory and Visual Forms”, Bulletin of Mathematical
Biophysics, Vol. 9, pp. 127–147.
The Perceptron (Rosenblatt, 1958) considers any real-valued inputs & weights.
Rosenblatt, F. (1958), “The Perceptron: a Probabilistic Model for Information Storage and Organization in the Brain”, Psychological
Review, Vol. 65, pp. 386–408.
24 Institut AIFB
Formal Perceptron Model
25 Institut AIFB
Sign- (Threshold- / Step-) Function
  Convention: sign(0) = -1
  NB: alternative formulation for threshold b = - θ
corresponds to step function at θ .
26 Institut AIFB
Perceptron: Linear Threshold Functions
Perceptron encodes a simple linear discriminant function.

  Generic notation:
  NB: Many other learning algorithms (e.g. SVMs) use linear

discriminant functions.
  (notation n = M in previous slides)
27 Institut AIFB
Linear Classifiers: Geometric Interpretation
28 Institut AIFB
Linear Classifiers: Geometric Interpretation
  Input space is divided into two half-spaces.
  Decision boundary is a n-1 dimensional hyperplane

defined by:
  Interpretation of the parameters:

  Weight vector w is perpendicular to the hyperplane.
  Bias b is proportional to distance from origin, which
is |b| / ||w||.
29 Institut AIFB
Example: Classification of Iris Setosa
30 Institut AIFB
Linear Classifiers: Alternative Formulation
  NB: Often, bias term b is represented as component w0 of

weight vector w, input space needs to be extended by
component x0=1 for all inputs.
  Generic notation simplifies to:
33 Institut AIFB
Perceptron Training Task (1st Version)
"Given a set of training instances, find parameters w and b

that are capable of separating positive and negative input
patterns in the training set."
NB: we will see that this is not always possible.
34 Institut AIFB
“Perceptron Training” Algorithm
lerning rate determines intensity of

change (assump<on: α=1)‫ ‏‬
redirects weight vector towards/away

from misclassiﬁed training point
adjusts bias for increased

vector norm
Rosenblatt, F. (1958), “The Perceptron: a Probabilistic Model for

Information Storage and Organization in the Brain”, Psychological
Review, Vol. 65, pp. 386–408.
35 Institut AIFB
Convergence of “Perceptron Training”
Perceptron Training will always converge, given that the

training examples are linearly separable.
36 Institut AIFB
Convergence of „Perceptron Training“
Why should the update rule converge towards desired

solution ?
  Case 1: suppose a training example is classified correctly

with current parameters, then these won't be changed.
  Case 2a: example was classified -1 but real value was +1
  next time, we want a higher activation for this example
  so we need to change the weights
  increase the more the higher inputs for current example is
  Case 2b: example was classified +1 but real value was -1

  next time we want lower activation for this example
  see above but opposite direction for changes.
37 Institut AIFB
Example: Classification of Iris Setosa
100 iterations 300 iterations
722 iterations 600 iterations
38 Institut AIFB
Example: Classification of Iris Versicolor
??
39 Institut AIFB
Linear Separability
  Linear Separability – Classification problems for which a

model can be expressed in the form of linear threshold
functions are called linearly separable.
  Most real problems are, however, not linearly separable:

  Case I: linear pattern exists but is blurred (noise) – we might be
happy with a linear model but try to minimize the (unavoidable)
error.
  Case II: underlying pattern is not linear in principle – we will need a
different (non-linear !) model.
40 Institut AIFB
Perceptron Training Task (2nd Version)
"Given a set of training instances, find parameters w and b

that minimize the error on the training set."
41 Institut AIFB
Unthresholded Perceptron
  Consider for a while an unthresholded neuron

(linear output unit):
vs
  This corresponds to a linear regression model, i.e. a linear

model for a numeric response variable.
42 Institut AIFB
Squared Error
  Given numeric target values and numeric outputs, we can

define a loss function l, which measures how severe a
certain error is.
  Popular (standard) choice is the squared error loss:
  We can use the squared error for regression settings and

for classification settings (by using +1/-1 as target values).
43 Institut AIFB
Squared Error
Figure taken from Mitchell (1997)
44 Institut AIFB
Error Surface and Gradient Descent
w1
w2
Direction of steepest descent along the error surface Figure taken from Mitchell (1997)
45 Institut AIFB
Gradient Descent
  Gradient – direction in weight space that produces the

steepest increase in error (partial derivatives of error wrt to
the individual components of current weight vector):
  The update step then becomes (α = learning rate):
46 Institut AIFB
Derivation of Gradient for Squared Error
NB: Derivation is analogous for b = w0 (by assuming x0=1 for any input)
47 Institut AIFB
Gradient Descent Training for Perceptrons
(Batch Version)
imagine this to be a dummy

feature that equals 1 for all inputs.
w and b are updated aCer all

examples have been processed
see later
48 Institut AIFB
Gradient Descent: Batch version vs Delta Rule
  Batch version of Gradient Descent computes the actual

gradient after looking at all training examples. However,
this may be computationally expensive.
  The incremental version of Gradient Descent, also called

Delta-Rule (or Widrow Hoff Rule, LMS…), uses an
approximation to the Batch version: weights are updated
incrementally each time an individual example has been
processed.
49 Institut AIFB
Gradient Descent Training for Perceptrons
(Delta Rule Version)
updates are performed

incrementally after looking at
each training example
50 Institut AIFB
Choice of Learning Rate and Termination Criterion
  Learning rate α:
  moderates the width of update steps, actually hard to choose
  large steps may jump to far over a possible solution
  small steps may lead to too many iterations
  sometimes implemented as a function of the iterations which
becomes smaller over time.
  Termination criterion (selection):

  Error below certain threshold (This is a bad criterion! Why?)
  Change of error between iterations below certain threshold.
51 Institut AIFB
„Perceptron Training“ vs Gradient Descent
  „Perceptron Training“ is guaranteed to converge…

  to feasible parameters
  if training examples are linearly separable
  Gradient Descent / Delta Rule is guaranteed to converge

approximately…
  to parameters with minimum squared error E
  even if training examples are not linearly separable
  problem remains, however, if the training data is not separable in
principle ("minimum" error will be very high).
52 Institut AIFB
Review: Perceptron – Components
Model class
•  Perceptrons = linear discriminant functions =

separating hyperplanes
Learning algorithm
•  Gradient descent / Delta Rule
Optimization criterion
•  Minimize squared error at output layer
58 Institut AIFB
Knowledge Discovery Lecture WS14/15
22.10.2014 Einführung
Basics, Overview
29.10.2014 Design of KD-experiments
05.11.2014 Linear Classifiers
12.11.2014 Data Warehousing & OLAP
19.11.2014 Non-Linear Classifiers (ANNs) Supervised Techniques,
26.11.2014 Kernels, SVM Vector+Label Representation
03.12.2014 entfällt
10.12.2014 Decision Trees
17.12.2014 IBL & Clustering Unsupervised Techniques
07.01.2015 Relational Learning I
Semi-supervised Techniques,
14.01.2015 Relational Learning II
Relational Representation
21.01.2015 Relational Learning III
28.01.2015 Textmining
04.01.2015 Gastvortrag Meta-Topics
11.02.2015 Crisp, Visualisierung
59 Institut AIFB

KD III 1 LinearModels 1415

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

KD III 1 LinearModels 1415

Hochgeladen von

Copyright:

Verfügbare Formate

Knowledge Discovery WS 14/15

INSTITUT FÜR ANGEWANDTE INFORMATIK UND FORMALE BESCHREIBUNGSVERFAHREN (AIFB)

KIT – University of the State of Baden-Württemberg and

„The Mother of ML algorithms“

SVMs: well-founded generalization

= arg min JN (w) w0 = 1, w1 = 2, var(✏) = 1

– Beispiel: Funktionen f (x, w) mit Parametervektor w

ist es, ausEmpirical

Folgt manEmpirical Risk

Vector of inputs: X = (x1 , ...xN )

RSS(w) = (Y Xw)T (Y Xw)

w ) + complexity Remp(w) + complexity term

Artificial Neural Networks (ANNs) are learning systems

Information processing between neurons works by means

Perceptron encodes a simple linear discriminant function.

NB: Many other learning algorithms (e.g. SVMs) use linear

(notation n = M in previous slides)

Input space is divided into two half-spaces.

Decision boundary is a n-1 dimensional hyperplane

Interpretation of the parameters:

NB: Often, bias term b is represented as component w0 of

"Given a set of training instances, find parameters w and b

NB: we will see that this is not always possible.

lerning rate determines intensity of

redirects weight vector towards/away

adjusts bias for increased

Rosenblatt, F. (1958), “The Perceptron: a Probabilistic Model for

Perceptron Training will always converge, given that the

Why should the update rule converge towards desired

Case 1: suppose a training example is classified correctly

Case 2b: example was classified +1 but real value was -1

100 iterations 300 iterations

722 iterations 600 iterations

Linear Separability – Classification problems for which a

Most real problems are, however, not linearly separable:

"Given a set of training instances, find parameters w and b

Consider for a while an unthresholded neuron

This corresponds to a linear regression model, i.e. a linear

Given numeric target values and numeric outputs, we can

We can use the squared error for regression settings and

Figure taken from Mitchell (1997)

Gradient – direction in weight space that produces the

The update step then becomes (α = learning rate):

imagine this to be a dummy

w and b are updated aCer all

Batch version of Gradient Descent computes the actual

The incremental version of Gradient Descent, also called

updates are performed

Termination criterion (selection):

„Perceptron Training“ is guaranteed to converge…

Gradient Descent / Delta Rule is guaranteed to converge

• Perceptrons = linear discriminant functions =

• Gradient descent / Delta Rule

• Minimize squared error at output layer

Das könnte Ihnen auch gefallen

  Artificial Neural Networks (ANNs) are learning systems

  Information processing between neurons works by means

  NB: Many other learning algorithms (e.g. SVMs) use linear

  (notation n = M in previous slides)

  Input space is divided into two half-spaces.

  Decision boundary is a n-1 dimensional hyperplane

  Interpretation of the parameters:

  NB: Often, bias term b is represented as component w0 of

  Case 1: suppose a training example is classified correctly

  Case 2b: example was classified +1 but real value was -1

  Linear Separability – Classification problems for which a

  Most real problems are, however, not linearly separable:

  Consider for a while an unthresholded neuron

  This corresponds to a linear regression model, i.e. a linear

  Given numeric target values and numeric outputs, we can

  We can use the squared error for regression settings and

  Gradient – direction in weight space that produces the

  The update step then becomes (α = learning rate):

  Batch version of Gradient Descent computes the actual

  The incremental version of Gradient Descent, also called

  Termination criterion (selection):

  „Perceptron Training“ is guaranteed to converge…

  Gradient Descent / Delta Rule is guaranteed to converge

•  Perceptrons = linear discriminant functions =

•  Gradient descent / Delta Rule

•  Minimize squared error at output layer