Sie sind auf Seite 1von 27

Module 9

Feedforward Neural Networks


Estimating a Function - Neural Network Models

Applications

• Pattern recognition
• Function estimation
• Classification
• Nonlinear modeling Neural networks complement
other existing tools
• Prediction / Forecast
• Time series analysis
• Visualization
Artificial Neural Networks
Short history of neural network development

1943 “McCulloch/Pitts Cells”, first model of a neuron


1958 “Perceptron” (Rosenblatt)
1960 “Adaline” (Widrow & Hoff)
1969 Book “Perceptrons” (Minsky & Papert)
1970s Associative memory systems
1982 Hopfield networks (energy functions, spin-glass theory)
1982 Self-organizing networks (Kohonen)
1983 Stochastic networks (Hinton & Sejnowski)
1986 “Back-Propagation of Errors” (Rumelhart, McClelland et al.)
Estimating a Linear Function - The Perceptron

Idea: find a function f(x) = ax + b

INPUT x1 x2 x3

“weights” wi

“Neuron” with “bias” Θ

3
OUTPUT f ( x) = ∑ wi xi + θ
i =1
Perceptron Training (“Supervised Learning”)
Classification Task

1. Separate data into Training- and Test-Sets


• Typically 80+20 split, better: 50+50 split

2. Assign class labels (target values) to the data points


• Typically “1” for positive examples and
“0” or “-1” for negative examples

3. Determine the weights and bias values


• Optimization procedure, e.g. Gradient Descent

4. Test re-classification and classification accuracy


• Check separation of classes using training and test data
Gradient Descent Learning
The Perceptron Error Function

sum over Minimize E[w ] !


µ patterns

2
 µ  
E[w ] = ∑  y −  ∑ wi xi + θ  
µ

µ   i 

desired output actual output


(target value) (computed value)
Gradient Descent Learning
The Delta Rule (LMS rule, Widrow-Hoff rule)
new old
w =w + ∆w E

∂E
∆wi = −η
∂wi w
optimum
µ µ Eo = 0
∆wi = η δ xi
Note: Θ is treated like a weight

learning rate
µ µ µ
δ = ytarget − yactual
Gradient Descent Learning

Adaptive Learning Rate (Darken & Moody, 1991)

initial value
η0
ηt =
t
1+
r
degree of adaptation
(after r steps 50% reduction of η)
Gradient Descent Learning

Momentum Term (Rumelhart & McClelland, 1986)

∆wi (t ) = η δ xi + α ⋅ ∆wi (t − 1)

constant weight change


in previous step
Perceptron: Linear Classifier

optimum optimum “ill-posed“


Eo = 0 Eo > 0

• Many (infinite) solutions


• One optimal solution (E = min.)
Perceptron: Linear Function Fitting

x1 x2 40
35
Input 30
25
w1 w2 Weights f(x) 20
15
10
Neuron (linear) 5
0
8
6

X2
4 10
8
2 2 6
r
y = f( x ) = ∑ wi xi + θ
4
0 2
0
X1
i =1

• The neuron can have other “activation” (e.g., sigmoidal)


• Perceptron (one neuron) classification is always linear in X
Estimating a Non-Linear Classifier Function
The Multi-Layer Network

INPUT x1 x2 x3

w 1
act (ξ ) =
1 + e −ξ
Hidden Layer
“activation function”
v
OUTPUT

r  HID   IN   
y = f( x ) = act  ∑ vh  act  ∑ whi xhi + ϑh   + θ 
 h =1   i =1  
Universal function approximators!
hidden bias output bias
The Sigmoidal Activation Function

1.0

0.8

1 0.6
act ( x) =
1 + e−x 0.4

0.2

0.0
-6 -4 -2 0 2 4 6 x

• compresses the output to a 0/1 range


• its derivative is: act’(x) = act(1 – act)
Gradient Descent Learning (Back-Propagation of Errors)
Multi-Layer Network: Function Approximation

x1 x2

W2,1
9
W1,1 W2,2
W1,2
6
Hidden (sigmoidal)
f( x)
V1 V2 3

10
Output (linear) 0
8 6
8
6
4
4 X1
2 2
X2 0 0
r HID   IN 
y = f( x ) = ∑ vh  act  ∑ whi xhi + ϑh   + θ
h =1   i =1 
Gradient Descent Learning
The Multi-Layer Network Error Function

sum over Minimize E[w ] !


µ patterns

2
 µ    
( )
E[w ] = ∑  y − act  ∑ vh act  ∑ whi xi + ϑh  + θ  
µ

µ   h  i  
hidden neuron output

desired output actual output


(target value) (computed value)
Back-Propagation of Errors (“Backprop”) (1)

new old
v h =v h + ∆v
∂E
∆vh = −η Hidden-to-Output weights
∂vh
After presentation of a pattern µ
Hidden Output
∆vh = η δ H h

δ = ( ytarget − yactual )yactual (1 − yactual )


act’(x) = act(1 – act)
Back-Propagation of Errors (“Backprop”) (2)

new old
w hi =whi + ∆w
∂E
∆whi = −η Input-to-Hidden weights
∂whi
After presentation of a pattern µ
Input variables
∆whi = η δ h xhi
Back-Propagation of Errors (“Backprop”) (3)

∆wij = η ⋅ (δ v ) ⋅ sig (net ) ⋅ (1 − sig (net )) ⋅ inpi


k jk j j
144444424444443
δ
j

∆v jk = η ⋅ (t − o ) ⋅ sig (net ) ⋅ (1 − sig (net )) ⋅ o j


1k44k44442 k 444444 k3
δ
k
Neural Network Training

Validation data

Test data

Error
Training data

Training Time
“Forced Stop”
Mapping Chemical Space: “Drugs” and “Nondrugs”

120-dimensional data, Ghose & Crippen parameters


5’000 drugs, 5’000 nondrugs (Sadowski & Kubinyi, 1998)
Application of Neural Networks: “Drug-Likeness”
Drugs
25
120
20
Ghose & Crippen Σ = 24% Σ = 76%
% 15
Descriptors
10

0
0 1
Score

Nondrugs
14

12

10 Σ = 76% Σ = 24%
%
8
y = f(x) 6

Score 4

0
0 1
Score
“Drug-Likeness” ?
O

O HN N
NH2 N
H N
O H O O O
NH2 OH H H
N O
N N N N
S N O N O S O
O S H
N N O N
O
N O N
H S H
H H N
N
O

Rocephin™ (Ceftriaxone) Fortovase™ (Saquinavir) Viagra™ (Sildenafil)


Score = 0.98 Score = 0.96 Score = 0.94
O
O
HN O
O O

Xenical™ (Orlistat)
Score = 0.54
The Jury Decision Approach
Encoder Network

Training Mode

Pattern Pattern
vector vector

Mapping Mode

Pattern Factor 1
vector
Factor 2
Sequence Analysis by ANN
Residue encoding: “Unary” vectors

Score A 10000000000000000000
C 01000000000000000000
Output
D 00100000000000000000
E 00010000000000000000
F 00001000000000000000
Hidden G 00000100000000000000
H 00000010000000000000
I 00000001000000000000
Input
K 00000000100000000000
L 00000000010000000000
M R N L L V I …
M 00000000001000000000
N 00000000000100000000
b) “sliding window”
P 00000000000010000000
Q 00000000000001000000
R 00000000000000100000
S 00000000000000010000
T 00000000000000001000
V 00000000000000000100
W 00000000000000000010
Y 00000000000000000001
SignalP Output
Neural network tutorial

http://diwww.epfl.ch/mantra/tutorial/english/

Das könnte Ihnen auch gefallen