Beruflich Dokumente
Kultur Dokumente
Wang Ling
Outline
● Part I - Neural Networks are our friends
○ Numbers are our friends
○ Variables are our friends
○ Operators are our friends
○ Functions are our friends
○ Parameters are our friends
○ Cost Functions are our friends
○ Optimizers are our friends
○ Gradients are our friends
Outline
● Part I - Neural Networks are our friends
● Part 2 - Into Deep Learning
○ Nonlinear Neural Models
○ Multilayer Perceptrons
○ Using Discrete Variables
○ Example Applications
Numbers are our friends
Abby
4
Variables are our friends
Abby Bert
4 5
Variables are our friends
Abby Bert
4x 5y
Operators are our friends
Bert
1
4
4x - 1x = 3x
3 1
Functions are our friends
If you give me
1 apple I will
give you 3
bananas
1
4
?
5
Functions are our friends
● Input, x - Number of
Apples given by Abby
y = 3x
Functions are our friends
● Input, x - Number of
Apples given by Abby
y = 3x ● Output, y - Number of
Bananas received by Abby
Functions are our friends
1
4
?
5
y = 3x , x =1
Functions are our friends
1
4
3
5
y = 3x , x =1
y=3
Functions are our friends
y = 3x
Functions are our friends
y : Spanish Sentence
x : English Sentence
Functions are our friends
y : Move
x : Board
Functions are our friends
y : Category
x : Image
Functions are our friends
y : Move
x : Board
??????????????????????????
Functions are our friends
y = 3x
Cookie Monster
Functions are our friends
y = ?? y = 3x
Find it out for
yourself
Functions are our friends
y = ??
1
0
Functions are our friends
y = ??
1
0
5
16
Functions are our friends
y = ??
1
0
5
16
6
20
Functions are our friends
I want to know how many bananas I get,
but I ran out of apples....
y = ??
1
0
5
16
6
20
3
?
Parameters are our friends
● Input
● Output
y = 3x + 1
Parameters are our friends
● Input
Model ● Output
y = wx + b ● Parameters
5
16
6
20
3
?
Parameters are our friends
y = wx + b
Data
1
0
5
16
6
20
3
?
Parameters are our friends
y = wx + b
Data
x ŷ
1 0
5 16
6 20
3
?
Parameters are our friends
Data Model
x ŷ y = wx + b
1 0
5 16
6 20
Parameters are our friends
Data Model
x ŷ y = wx + b
1 0
5 16
6 20
Model
Candidate 2 x ŷ y
1 0 4
y = 2x + 2 5 16 12
6 20 14
Parameters are our friends
Data Model
Model x ŷ y
x y y = wx + b Candidate 1
1 0 1
1 0
5 16 5
5 16 y = 1x + 0 6 20 6
6 20
Model
Candidate 2 x ŷ y
1 0 4
y = 2x + 2 5 16 12
6 20 14
Which one is better ?
Parameters are our friends
Data Model
Model x ŷ y
x y y = wx + b Candidate 1
1 0 1
1 0
5 16 5
5 16 y = 1x + 0 6 20 6
6 20
Model
Candidate 2 x ŷ y
1 0 4
y = 2x + 2 5 16 12
6 20 14
Cost functions are our friends
Data Model
Model x ŷ y
n x y yn = wxn + b Candidate 1
1 0 1
0 1 0
5 16 5
1 5 16 y = 1x + 0 6 20 6
2 6 20
Model
Candidate 2 x ŷ y
1 0 4
y = 2x + 2 5 16 12
6 20 14
Cost functions are our friends
Data Model
Model x ŷ y
n x y yn = wxn + b Candidate 1
1 0 1
0 1 0
5 16 5
1 5 16 y = 1x + 0 6 20 6
2 6 20
Cost Model
Candidate 2 x ŷ y
C(w,b) 1 0 4
y = 2x + 2 5 16 12
6 20 14
Cost functions are our friends
Data Model
Model x ŷ y
n x y yn = wxn + b Candidate 1
1 0 1
0 1 0
5 16 5
1 5 16 y = 1x + 0 6 20 6
2 6 20 Square Loss
Cost Model
Candidate 2 x ŷ y
2
C(w,b) = ∑(yn-ŷn) 1 0 4
n∈{0,1,2} y = 2x + 2 5 16 12
6 20 14
Cost functions are our friends
Data Model n x ŷ y (y-ŷ)2
Model
n x y yn = wxn + b Candidate 1
0 1 0 1
0 1 0 1 5 16 5
1 5 16 y = 1x + 0 2 6 20 6
2 6 20
Cost Model
Candidate 2 x ŷ y
2
C(w,b) = ∑(yn-ŷn) 1 0 4
n∈{0,1,2} y = 2x + 2 5 16 12
6 20 14
Cost functions are our friends
Data Model n x ŷ y (y-ŷ)2
Model
n x y yn = wxn + b Candidate 1
0 1 0 1 1
0 1 0 1 5 16 5
1 5 16 y = 1x + 0 2 6 20 6
2 6 20
Cost Model
Candidate 2 x ŷ y
2
C(w,b) = ∑(yn-ŷn) 1 0 4
n∈{0,1,2} y = 2x + 2 5 16 12
6 20 14
Cost functions are our friends
Data Model n x ŷ y (y-ŷ)2
Model
n x y yn = wxn + b Candidate 1
0 1 0 1 1
0 1 0 1 5 16 5 121
1 5 16 y = 1x + 0 2 6 20 6
2 6 20
Cost Model
Candidate 2 x ŷ y
2
C(w,b) = ∑(yn-ŷn) 1 0 4
n∈{0,1,2} y = 2x + 2 5 16 12
6 20 14
Cost functions are our friends
Data Model n x ŷ y (y-ŷ)2
Model
n x y yn = wxn + b Candidate 1
0 1 0 1 1
0 1 0 1 5 16 5 121
1 5 16 y = 1x + 0 2 6 20 6 196
2 6 20
Cost Model
Candidate 2 x ŷ y
2
C(w,b) = ∑(yn-ŷn) 1 0 4
n∈{0,1,2} y = 2x + 2 5 16 12
6 20 14
Cost functions are our friends
Data Model n x ŷ y (y-ŷ)2
Model
n x y yn = wxn + b Candidate 1
0 1 0 1 1
0 1 0 1 5 16 5 121
1 5 16 y = 1x + 0 2 6 20 6 196
2 6 20
C(1,0) 318
Cost Model
Candidate 2 x ŷ y
2
C(w,b) = ∑(yn-ŷn) 1 0 4
n∈{0,1,2} y = 2x + 2 5 16 12
6 20 14
Cost functions are our friends
Data Model n x ŷ y (y-ŷ)2
Model
n x y yn = wxn + b Candidate 1
0 1 0 1 1
0 1 0 1 5 16 5 121
1 5 16 y = 1x + 0 2 6 20 6 196
2 6 20
C(1,0) 318
2 6 20 14 36
C(2,2) 68
Cost functions are our friends
Data Model
Model
n x y yn = wxn + b Candidate 1
0 1 0
1 5 16 y = 1x + 0
2 6 20
C(1,0) 318
Cost Model
Candidate 2
2
C(w,b) = ∑(yn-ŷn)
n∈{0,1,2} y = 2x + 2
C(2,2) 68
Cost functions are our friends
Data Model
n x y yn = wxn + b
0 1 0
1 5 16
2 6 20
How to find the parameters w and b?
Cost
2
C(w,b) = ∑(yn-ŷn)
n∈{0,1,2}
Optimizers are our friends
Data Model
n x y yn = wxn + b
0 1 0
1 5 16
2 6 20
Cost Optimizer
2
C(w,b) = ∑(yn-ŷn) arg min C(w,b)
n∈{0,1,2} w,b∈[-∞,∞]
Optimizers are our friends
Optimizer
w
arg min C(w,b)
w,b∈[-∞,∞]
b
Optimizers are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w0,b0 = 2,2 : C(w0,b0) = 68
b
Optimizers are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w0,b0 = 2,2 : C(w0,b0) = 68
2 68
2 b
Optimizers are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w0,b0 = 2,2 : C(w0,b0) = 68
w1,b1 = 3,2 : C(w1,b1) = ?
b
Optimizers are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w0,b0 = 2,2 : C(w0,b0) = 68
w1,b1 = 3,2 : C(w1,b1) = 26
2
n x ŷ y (y-ŷ)
0 1 0 5 25
1 5 16 17 1
2 6 20 20 0
b
C(3,2) 26
Optimizers are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w0,b0 = 2,2 : C(w0,b0) = 68
w1,b1 = 3,2 : C(w1,b1) = 26
2
n x ŷ y (y-ŷ)
0 1 0 5 25
1 5 16 17 1
2 6 20 20 0
b
C(3,2) 26
Optimizers are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w1,b1 = 3,2 : C(w1,b1) = 26
w2,b2 = 4,2 : C(w2,b2) = ??
b
Optimizers are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w1,b1 = 3,2 : C(w1,b1) = 26
w2,b2 = 4,2 : C(w2,b2) = 136
2
n x ŷ y (y-ŷ)
0 1 0 6 36
1 5 16 22 64
2 6 20 26 36
b
C(4,2) 136
Optimizers are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w1,b1 = 3,2 : C(w1,b1) = 26
b
Optimizers are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w1,b1 = 3,2 : C(w1,b1) = 26
w2,b2 = 3,3 : C(w2,b2) = 41
2
n x ŷ y (y-ŷ)
0 1 0 6 36
1 5 16 18 4
2 6 20 21 1
b
C(3,3) 41
Optimizers are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w1,b1 = 3,2 : C(w1,b1) = 26
b
Optimizers are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w1,b1 = 3,2 : C(w1,b1) = 26
w2,b2 = 3,1 : C(w2,b2) = 17
2
n x ŷ y (y-ŷ)
0 1 0 4 16
1 5 16 16 0
2 6 20 19 1
b
C(3,1) 17
Optimizers are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w2,b2 = 3,1 : C(w2,b2) = 17
b
Optimizers are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w2,b2 = 3,1 : C(w2,b2) = 17
w3,b3 = 3,0 : C(w3,b3) = 13
2
n x ŷ y (y-ŷ)
0 1 0 3 9
1 5 16 15 1
2 6 20 18 4
b
C(3,0) 13
Optimizers are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w3,b3 = 3,0 : C(w3,b3) = 13
b
Optimizers are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w3,b3 = 3,0 : C(w3,b3) = 13
w4,b4 = 3,-1 : C(w4,b4) = 17
2
n x ŷ y (y-ŷ)
0 1 0 2 4
1 5 16 14 4
2 6 20 17 9
b
C(3,-1) 17
Optimizers are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w3,b3 = 3,0 : C(w3,b3) = 13
w4,b4 = 2,0 : C(w4,b4) = 104
2
n x ŷ y (y-ŷ)
0 1 0 2 4
1 5 16 10 36
2 6 20 12 64
b
C(2,0) 104
Optimizers are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w3,b3 = 3,0 : C(w3,b3) = 13
w4,b4 = 4,0 : C(w4,b4) = 104
2
n x ŷ y (y-ŷ)
0 1 0 4 16
1 5 16 20 16
2 6 20 24 16
b
C(2,0) 54
Optimizers are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w3,b3 = 3,0 : C(w3,b3) = 13
The End?
b
Optimizers are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w?,b? = 4,-2 : C(w?,b?) = ??
b
Optimizers are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w?,b? = 4,-2 : C(w?,b?) = 12
2
n x ŷ y (y-ŷ)
0 1 0 2 4
1 5 16 18 4
2 6 20 22 4
b
C(4,-2) 12
Optimizers are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w3,b3 = 3,0 : C(w3,b3) = 13
b
Optimizers are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w3,b3 = 3,0 : C(w3,b3) = 13
Search
Problem
b
Optimizers are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w3,b3 = 3,0 : C(w3,b3) = 13
w4,b4 = 3.01,0 : C(w4,b4) = 12.82
2
n x ŷ y (y-ŷ)
0 1 0 3.01 9.06
1 5 16 15.01 0.98
2 6 20 18.01 3.96 b
C(3.01,0) 12.82
Optimizers are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w*,b* = 4,-2 : C(w*,b*) = 12
b
Optimizers are our friends
-Worse minimum
Large Step Size -But gets there faster
Vs
Step Size
Step Size
Step Size
Step Size
Step Size
Optimizers are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w*,b* = 4,-2 : C(w*,b*) = 12
b
Optimizers are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w*,b* = 4,-4 : C(w*,b*) = 0
b
Optimizers are our friends
y = wx + b
Data
x ŷ
1 0
5 16
6 20
3
?
Optimizers are our friends
y = 4x - 4
Data
x ŷ
1 0
5 16
6 20
3
?
Optimizers are our friends
y = 4x - 4
Data
x ŷ
1 0
5 16
6 20
3
8
Functions are our friends
y = wx + b
y : Is this a cat
x : Image
Functions are our friends
pixel (1,1)
pixel(1,3)
High
if cat
y = w1x + w2x + w3x + w4x +
1 2 3 4
b y : Is this a cat
x : Image
Functions are our friends
pixel (1,1)
pixel(1,3)
High
if cat
y = w1x + w2x + w3x + w4x +
1 2 3 4
b y : Is this a cat
x : Image
Very expensive
to compute
(hours or days)
b
Gradients are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
Should be used
sparingly
b
Gradients are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w0,b0 = 2,2 : C(w0,b0) = 68
2 68
2 b
Gradients are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w0,b0 = 2,2 : C(w0,b0) = 68
hw = 1 hw
2 68
2 b
Gradients are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w0,b0 = 2,2 : C(w0,b0) = 68
hw = 1 hw
C(w0+hw,b0) = C(3,2) = 26 2 68
2 b
Gradients are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w0,b0 = 2,2 : C(w0,b0) = 68
hw = 1 hw
C(w0+hw,b0) = C(3,2) = 26 2 68
r = (C(w0+1,b0)-C(w0,b0))
1
r = (C(3,2)-C(2,2))=-42 2 b
1
Gradients are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w0,b0 = 2,2 : C(w0,b0) = 68
hw = 1, r = -42 hw
hw = 0.1, r = -98 2 68
hw = 0.01, r = -104
hw = 0.001, r = -104
2 b
Gradients are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w0,b0 = 2,2 : C(w0,b0) = 68
hw = 1, r = -42 hw
hw = 0.1, r = -98 2 68
hw = 0.01, r = -104
hw = 0.001, r = -104
∂C (w0,b0)
hw → 0, r =
∂w
2 b
Gradients are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w0,b0 = 2,2 : C(w0,b0) = 68
2 hw
∂C ∂∑(yn-ŷn)
= n 2 68
∂w ∂w
2 b
Gradients are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w0,b0 = 2,2 : C(w0,b0) = 68
2 hw
∂C ∂∑(yn-ŷn)
= n = ∑2(y
n
n-ŷn)xn 2 68
∂w ∂w
2 b
Gradients are our friends
Optimizer
∂w ∂w 2 6 20 14 -6 -72
∂C (w0,b0)
hw → 0, r = = -104
∂w
Gradients are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w0,b0 = 2,2 : C(w0,b0) = 68
2 hw
∂C ∂∑(yn-ŷn)
= n = ∑2(y
n
n-ŷn)xn 2 68
∂w ∂w
2
∂C ∂∑(yn-ŷn)
= n = ∑2(yn-ŷn)
∂b ∂b
n 2 b
Gradients are our friends
Optimizer
∂C (w0,b0) 1 5 16 12 -4 -8
hw → 0, rw = = -104
∂w 2 6 20 14 -6 -12
∂C (w0,b0)
hb → 0, rb = = -12
∂w
Gradients are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w0,b0 = 2,2 : C(w0,b0) = 68
∂C (w0,b0) 2
hw → 0, rw = = -104
∂w
∂C (w0,b0)
hb → 0, rb = = -12
∂w
w1 = w0 - rw 2
→ Learning Rate/ Step size
b
b1 = b0 - rb
Summary
Data Model
n x ŷ yn = wxn + b
0 1 0
1 5 16
2 6 20
Cost Optimizer
2
C(w,b) = ∑(yn-ŷn) arg min C(w,b)
n∈{0,1,2} w,b∈[-∞,∞]
Summary
Data Model
n x ŷ yn = wxn + b System
0 1 0
1 5 16 y = 4x - 4
2 6 20
Cost Optimizer
2
C(w,b) = ∑(yn-ŷn) arg min C(w,b)
n∈{0,1,2} w,b∈[-∞,∞]
Into Deep Learning
Nonlinear Neural Models
y = 4x-4
Data
1
0
5
16
6
20
3
?
Nonlinear Neural Models
There is a limit
of bananas I
Data can give you
1
0
5
16
6
20
3
?
Nonlinear Neural Models
Data
y y = 4x-4
n x ŷ
0 1 0
1 5 16
2 6 20 x
Nonlinear Neural Models
Data
y y = 4x-4
n x ŷ
0 1 0
1 5 16
2 6 20 x
3 9 20
4 11 20
Nonlinear Neural Models
Data
y y = 2x+3
n x ŷ
0 1 0
1 5 16
2 6 20 x
3 9 20
Model
4 11 Problem
20
Nonlinear Neural Models
Data
y y = 2x+3
n x ŷ
0 1 0
1 5 16
2 6 20 Underfitting x
3 9 20
Model
4 11 Problem
20
Nonlinear Neural Models
Data
y y = ???
n x ŷ
0 1 0
1 5 16
2 6 20 x
3 9 20
4 11 20 Can we learn
arbitrary functions?
Nonlinear Neural Models
0 1 0
1 5 16
y = (4x - 4)s1 + (0x+20)s2
2 6 20
3 9 20
4 11 20
Nonlinear Neural Models
0 1 0
1 5 16
y = (4x - 4)s1 + (0x+20)s2
2 6 20
3 9 20
4 11 20 ?
?
Nonlinear Neural Models
s = (wx + b)
1
(t) =
1 + e-t
Nonlinear Neural Models
s = (1000x)
Nonlinear Neural Models
s = (1000x)
s = (1000x - 6000)
Data
n x ŷ
y = (4x - 4)s1 + (0x+20)s2
0 1 0
1 5 16
s1 = (-1000x + 6000)
2 6 20 s2 = (1000x - 6000)
3 9 20
4 11 20
Nonlinear Neural Models
Data
n x ŷ
y = (4x - 4)s1 + (0x+20)s2
0 1 0
1 5 16
s1 = (-1000x + 6000)
2 6 20 s2 = (1000x - 6000)
3 9 20
4 11 20
Nonlinear Neural Models
Data
n x ŷ
y = (16)s1 + (0x+20)s2
0 1 0
1 5 16
s1 = (-1000x + 6000)
2 6 20 s2 = (1000x - 6000)
3 9 20
4 11 20
Nonlinear Neural Models
Data
n x ŷ
y = (16)s1 + (20)s2
0 1 0
1 5 16
s1 = (-1000x + 6000)
2 6 20 s2 = (1000x - 6000)
3 9 20
4 11 20
Nonlinear Neural Models
Data
n x ŷ
y = (16)s1 + (20)s2
0 1 0
1 5 16
s1 = (1000)
2 6 20 s2 = (1000x - 6000)
3 9 20
4 11 20
Nonlinear Neural Models
Data
n x ŷ
y = (16)s1 + (20)s2
0 1 0
1 5 16
s1 = (1000)
2 6 20 s2 = (-1000)
3 9 20
4 11 20
Nonlinear Neural Models
Data
n x ŷ
y = (16)1 + (20)0
0 1 0
1 5 16
s1 = (1000)
2 6 20 s2 = (-1000)
3 9 20
4 11 20
Nonlinear Neural Models
Data
n x ŷ
y = 16
0 1 0
1 5 16
s1 = (1000)
2 6 20 s2 = (-1000)
3 9 20
4 11 20
Nonlinear Neural Models
Data
n x ŷ
y = (4x - 4)s1 + (0x+20)s2
0 1 0
1 5 16
s1 = (-1000x + 6000)
2 6 20 s2 = (1000x - 6000)
3 9 20
4 11 20
Nonlinear Neural Models
Data
n x ŷ
y = (32)s1 + (0x+20)s2
0 1 0
1 5 16
s1 = (-1000x + 6000)
2 6 20 s2 = (1000x - 6000)
3 9 20
4 11 20
Nonlinear Neural Models
Data
n x ŷ
y = (32)s1 + (20)s2
0 1 0
1 5 16
s1 = (-1000x + 6000)
2 6 20 s2 = (1000x - 6000)
3 9 20
4 11 20
Nonlinear Neural Models
Data
n x ŷ
y = (32)s1 + (20)s2
0 1 0
1 5 16
s1 = (-3000)
2 6 20 s2 = (1000x - 6000)
3 9 20
4 11 20
Nonlinear Neural Models
Data
n x ŷ
y = (32)s1 + (20)s2
0 1 0
1 5 16
s1 = (-3000)
2 6 20 s2 = (3000)
3 9 20
4 11 20
Nonlinear Neural Models
Data
n x ŷ
y = (32)0 + (20)1
0 1 0
1 5 16
s1 = (-3000)
2 6 20 s2 = (3000)
3 9 20
4 11 20
Nonlinear Neural Models
Data
n x ŷ
y = 20
0 1 0
1 5 16
s1 = (-3000)
2 6 20 s2 = (3000)
3 9 20
4 11 20
Nonlinear Neural Models
If you give me
too many
Data apples, I will
give you less
1
0
5
16
6
20
3
?
Multilayer Perceptrons
Data
n x ŷ
y y = (4x - 4)s1 + (0x+20)s2
0 1 0
1 5 16
2 6 20
x
3 9 20
4 11 20
Multilayer Perceptrons
Data
n x ŷ
y y = (4x - 4)s1 + (0x+20)s2
0 1 0
1 5 16
2 6 20
x
3 9 20
4 11 20
5 15 1
6 19 1
Multilayer Perceptrons
Data
n x ŷ
0 1 0
1 5 16
2 6 20
6 19 1 s2 = ????
s3 = (1000x - 15000)
Multilayer Perceptrons
Data
n x ŷ
y = (4x - 4)s1 + (0x+20)s2 + (0x+1)s3
0 1 0
1 5 16
s1 = (-1000x + 6000)
2 6 20 s2 = not s1 and not s3
3 9 20
s3 = (1000x - 15000)
4 11 20
5 15 1
6 19 1
Multilayer Perceptrons
n x ŷ
y = (4x - 4)s1 + (0x+20)s2 + (0x+1)s3
0 1 0
1 5 16
s1 = (-1000x + 6000)
2 6 20 s2 = not s1 and not s3
3 9 20
s3 = (1000x - 15000)
4 11 20
5 15 1
6 19 1
Multilayer Perceptrons
Data
n x ŷ
y = (4x - 4)s1 + (0x+20)s2 + (0x+1)s3
0 1 0
1 5 16
s1 = (-1000x + 6000)
2 6 20 s2 = (-1000s1 - 1000s3 + 500)
3 9 20
s3 = (1000x - 15000)
4 11 20
5 15 1
6 19 1
Multilayer Perceptrons
Data
n x ŷ
y = (4x - 4)s1 + (0x+20)s2 + (0x+1)s3
0 1 0
1 5 16
s1 = (-1000x + 6000)
2 6 20 s2 = (-1000s1 - 1000s3 + 500)
3 9 20
s3 = (1000x - 15000)
4 11 20
5 15 1
6 19 1
Multilayer Perceptrons
Data
n x ŷ
y = (40)s1 + (20)s2 + (1)s3
0 1 0
1 5 16
s1 = (-1000x + 6000)
2 6 20 s2 = (-1000s1 - 1000s3 + 500)
3 9 20
s3 = (1000x - 15000)
4 11 20
5 15 1
6 19 1
Multilayer Perceptrons
Data
n x ŷ
y = (40)s1 + (20)s2 + (1)s3
0 1 0
1 5 16
s1 = (-5000) = 0
2 6 20 s2 = (-1000s1 - 1000s3 + 500)
3 9 20
s3 = (-4000) = 0
4 11 20
5 15 1
6 19 1
Multilayer Perceptrons
Data
n x ŷ
y = (40)s1 + (20)s2 + (1)s3
0 1 0
1 5 16
s1 = (-5000) = 0
2 6 20 s2 = (- 0 - 0 + 500)
3 9 20
s3 = (-4000) = 0
4 11 20
5 15 1
6 19 1
Multilayer Perceptrons
Data
n x ŷ
y = (40)s1 + (20)s2 + (1)s3
0 1 0
1 5 16
s1 = (-5000) = 0
2 6 20 s2 = (500)
3 9 20
s3 = (-4000) = 0
4 11 20
5 15 1
6 19 1
Multilayer Perceptrons
Data
n x ŷ
y = (40)s1 + (20)s2 + (1)s3
0 1 0
1 5 16
s1 = (-5000) = 0
2 6 20 s2 = (500) = 1
3 9 20
s3 = (-4000) = 0
4 11 20
5 15 1
6 19 1
Multilayer Perceptrons
Data
n x ŷ
y = (40)0 + (20)1 + (1)0
0 1 0
1 5 16
s1 = (-5000) = 0
2 6 20 s2 = (500) = 1
3 9 20
s3 = (-4000) = 0
4 11 20
5 15 1
6 19 1
Multilayer Perceptrons
Data
n x ŷ
y = 20
0 1 0
1 5 16
s1 = (-5000) = 0
2 6 20 s2 = (500) = 1
3 9 20
s3 = (-4000) = 0
4 11 20
5 15 1
6 19 1
Multilayer Perceptrons
Data
n x ŷ
y = (4x - 4)s1 + (0x+20)s2 + (0x+1)s3
0 1 0
1 5 16
s1 = (-1000x + 6000)
2 6 20 s2 = (-1000s1 - 1000s3 + 500)
3 9 20
s3 = (1000x - 15000)
4 11 20
5 15 1
6 19 1
Multilayer Perceptrons
Data
n x ŷ
y = (772)s1 + (20)s2 + (1)s3
0 1 0
1 5 16
s1 = (-1000x + 6000)
2 6 20 s2 = (-1000s4 - 1000s5 + 500)
3 9 20
s3 = (1000x - 15000)
4 11 20
5 15 1
6 19 1
Multilayer Perceptrons
Data
n x ŷ
y = (772)s1 + (20)s2 + (1)s3
0 1 0
1 5 16
s1 = (-13000) = 0
2 6 20 s2 = (-1000s4 - 1000s5 + 500)
3 9 20
s3 = (4000) = 1
4 11 20
5 15 1
6 19 1
Multilayer Perceptrons
Data
n x ŷ
y = (772)s1 + (20)s2 + (1)s3
0 1 0
1 5 16
s1 = (-13000) = 0
2 6 20 s2 = (-1000 + 0 + 500)
3 9 20
s3 = (4000) = 1
4 11 20
5 15 1
6 19 1
Multilayer Perceptrons
Data
n x ŷ
y = (772)s1 + (20)s2 + (1)s3
0 1 0
1 5 16
s1 = (-13000) = 0
2 6 20 s2 = (-500) = 0
3 9 20
s3 = (4000) = 1
4 11 20
5 15 1
6 19 1
Multilayer Perceptrons
Data
n x ŷ
y = (772)0 + (20)0 + (1)1
0 1 0
1 5 16
s1 = (-13000) = 0
2 6 20 s2 = (-500) = 0
3 9 20
s3 = (4000) = 1
4 11 20
5 15 1
6 19 1
Multilayer Perceptrons
Data
n x ŷ
y=1
0 1 0
1 5 16
s1 = (-13000) = 0
2 6 20 s2 = (-500) = 0
3 9 20
s3 = (4000) = 1
4 11 20
5 15 1
6 19 1
Multilayer Perceptrons
Data
y
y = (4x - 4)s1 + (0x+20)s2 + (0x+1)s3
n x ŷ
0 1 0
1 5 16
2 6 20
x
3 9 20
4 11 20
5 15 1
6 19 1
Multilayer Perceptrons
s3 = (w7x + b6) b4
s
2
Multilayer Perceptrons
s3 = (w7x + b6) b4
s
b5
2
Multilayer Perceptrons
2
b5
Multilayer Perceptrons
y = (w1x + b1)s1 + (w2x+b2)s2 + (w3x+b3)s3
s s
x<6 1 3 x > 15
s
2 !(x > 15) & !(x < 6)
Multilayer Perceptrons
y = (w1x + b1)s1 + (w2x+b2)s2 + (w3x+b3)s3
s s
x<6 1 3 x > 15
s
2 x∈[6,15]
Multilayer Perceptrons
s s
x<6 1 3 x > 15
s s
2 4
s s s s
x<6 1 2 x > 15 3 x>2 4 x<3
s s s s
5 6 7 7
s s s s
x<6 1 2 x > 15 3 x>2 4 x<3 Layer 1 (Input Features)
s s s s
5 6 7 7
x∈[6,15] x∈]-∞,6] & ]15,∞] x∈[2,15] x∈[2,3] Layer 2 (And and Or Combinations)
Multilayer Perceptrons
x Input
s s s s
x<6 1 2 x > 15 3 x>2 4 x<3 Layer 1 (Input Features)
s s s s
5 6 7 7
x∈[6,15] x∈]-∞,6] & ]15,∞] x∈[2,15] x∈[2,3] Layer 2 (And and Or Combinations)
s s s s
1 2 3 4 Layer 1 (Input Features)
s s s s
5 6 7 7
s s s s
8 9 a b Layer 3 (Xor Combinations)
Multilayer Perceptrons
x Input
s s s s
1 2 3 4 Layer 1 (Input Features)
s s s s
5 6 7 7
s s s s
8 9 a b Layer 3 (Xor Combinations)
0 1 0
1 5 16
2 6 20
x
3 9 20
4 11 20 Universal
approximator
5 15 1
6 19 1
Multilayer Perceptrons
Data
y
n x ŷ
0 1 0
1 5 16
2 6 20
x
3 9 20
4 11 20
but...
5 15 1
6 19 1
Multilayer Perceptrons
Data
y
n x ŷ
0 1 0
1 5 16
2 6 20
x
3 9 20
4 11 20
No guarantee that
the best function will
5 15 1
be found
6 19 1
Multilayer Perceptrons
n x ŷ
x
0 1 0
1 5 16 s s s s
x>1 1 2 x<2 3 x<5 4 x<6
2 6 20
s s s
5 6 7
y
Multilayer Perceptrons
n x ŷ
x
0 1 0
1 5 16 s s s s
x>1 1 2 x<2 3 x<5 4 x<6
2 6 20
s s s
5 6 7
1 5 16 s s s s
x>1 1 2 x<2 3 x<5 4 x<6
2 6 20
s s s
5 6 7
1 5 16 s s s s
x>1 1 2 x<2 3 x<5 4 x<6
2 6 20
s s s
5 6 7
1 5 16 s s s s
x>1 1 2 x<2 3 x<5 4 x<6
2 6 20
s s s Overfitting
5 6 7
Model
Complexity
Multilayer Perceptrons
Task
Complexity Underfitting
Model
Complexity
Multilayer Perceptrons
Task
Complexity Underfitting
Overfitting
Model
Complexity
Multilayer Perceptrons
Task
Complexity Underfitting
e
y Zon
Happ
Overfitting
Model
Complexity
Multilayer Perceptrons
Task
Complexity Underfitting
e
y Zon
Happ
Overfitting
more features
Regression
Regression
Model
Linear
Complexity
Linear
Multilayer Perceptrons
Task
Complexity Underfitting
e
y Zon
Happ
Overfitting
Regression
MLP 1 Layer
MLP 2 Layer
MLP 3 Layer
Model
Linear
Complexity
Multilayer Perceptrons
Task
Complexity Underfitting
e
y Zon
Happ
Sentiment Overfitting
analysis
Regression
MLP 1 Layer
MLP 2 Layer
MLP 3 Layer
Model
Linear
Complexity
Multilayer Perceptrons
Task
Complexity Underfitting
Machine e
Translation y Zon
Happ
Sentiment Overfitting
analysis
Regression
MLP 1 Layer
MLP 2 Layer
MLP 3 Layer
Model
Linear
Complexity
Multilayer Perceptrons
Task
Complexity Underfitting
e
y Zon
Happ
Overfitting
Data
Model
Complexity
Multilayer Perceptrons
Task
Complexity Underfitting
y Zone
H a p p
Overfitting
Data
Model
Complexity
Multilayer Perceptrons
Task
Complexity Underfitting
e
Happy Zon
Overfitting
Data
Model
Complexity
Multilayer Perceptrons
n x ŷ
y y y
0 1 0
1 5 16
2 6 20
Multilayer Perceptrons
n x ŷ
y y y
0 1 0
1 5 16
2 6 20
3 2 4
Multilayer Perceptrons
n x ŷ
y y
0 1 0
1 5 16
2 6 20
3 2 4
Multilayer Perceptrons
Task
Complexity Underfitting
e
y Zon
Happ
Overfitting
Model Bias
Model
Complexity
Multilayer Perceptrons
Task
Complexity Underfitting
py Zone
Hap
Overfitting
Model Bias
L1 & L2 Regularization Model
Stochastic Dropout (Srivastava et al, 2014) Complexity
Model Structure (CNN, RNNs)
Multilayer Perceptrons
Regularization
2
C(w,b) = ∑(yn-ŷn) + (w+b)ß
n∈{0,1,2}
ß = Regularization constant
Multilayer Perceptrons
Regularization
x
s s s s
x>1 1 2 x<2 3 x<5 4 x<6
s s s
5 6 7
y
Multilayer Perceptrons
Regularization
x
s s s s
x>1 1 2 nothing 3 nothing 4 x<6
s s s
5 6 7
y
Multilayer Perceptrons
Regularization
x
s s s s
x>1 1 2 nothing 3 nothing 4 x<6
s s s
5 6 7
Data
1
0
5
16
6
20
3
?
Using Discrete Variables
Data
1
0
5
16
6
20
3
?
Using Discrete Variables
Data
1
0
5
16
6
20
3
? ?
Using Discrete Variables
Number of fruit to offer
x
s s s s
1 2 3 4
s s s
5 6 7
s1
s2
s1
s2
s2
e1 e2 e3 e4
Apple 0.1 -0.4 0.2 0.5
e1 e2 e3 e4
Apple 0.1 -0.4 0.2 0.5
Lookup
s2
V=3
logits
Apple Banana Coconut
Size = V
exp(di)
pi = w3 1.1 0.9 1.1
s2
Softmax
v∈{Apple, Banana, Coconut}
Type of fruit received v y Number of fruit received
Using Discrete Variables
Type of fruit to offer Number of fruit to offer
u x
Lookup
eu
s2
Softmax
v∈{Apple, Banana, Coconut}
Type of fruit received v y Number of fruit received
Summary
Continuous - linear
Continuous - values Sparse - softmax
Sparse - (embeddings) MLP
Example Applications
Embedding Pretraining (Collobert et al, 2011)
Predict
Context
Abby likes to eat apples and bananas
Example Applications
Embedding Pretraining (Collobert et al, 2011)
s1
Softmax
s2
Example Applications
Embedding Pretraining (Collobert et al, 2011)
edrink
eat
Cosine similairty
eeat
ebuild
Example Applications
Embedding Pretraining (Collobert et al, 2011)
eat
edrink
eeat
Cosine similairty
ebuild
Example Applications
Example Applications
Window-based Tagging (Collobert et al, 2011)
s1 Non-Linear Layer 1
s2 Non-Linear Layer 2
Example Applications
Window-based Tagging (Collobert et al, 2011)
s1 Non-Linear Layer 1
s2 Non-Linear Layer 2
VB Softmax
Example Applications
Window-based Tagging (Collobert et al, 2011)
s1 Non-Linear Layer 1
s2 Non-Linear Layer 2
VB Softmax
Example Applications
Window-based Tagging (Collobert et al, 2011)
Example Applications
Translation Rescoring (Devlin et al, 2014)
Predict
Context
Translation
Abby likes to eat apples and bananas
Source
Abby gosta de comer macas e bananas
Example Applications
Translation Rescoring (Devlin et al, 2014)
Predict
Context
Translation
Abby likes to eat apples and bananas
Source
Abby gosta de comer macas e bananas
Example Applications
Translation Rescoring (Devlin et al, 2014)
Translation
Abby likes to eat apples and bananas
s1
macas
s2
Example Applications
Translation Rescoring (Devlin et al, 2014)
∂C ∂∑(ŷn-yn)
= n = ∑-2(ŷ
n
n-yn)xn
∂w ∂w
Easy!
2
∂C ∂∑(ŷn-yn)
= n = ∑-2(ŷn-yn)
n
∂b ∂b
Computation Graphs are our friends
2
y = wx + b + tanh(yx + b)
Harder!
Computation Graphs are our friends
2
y = w x + b + tanh(w x + b )
1 1 2 2
Computation
Graphs can
compute
gradients for you!
Computation Graphs are our friends
2
C(w,b) = ∑(yn-ŷn) y = wx + b
n∈{0,1,2}
2
∂C ∂∑(ŷn-yn)
= n = ∑-2(ŷ
n
n-yn)xn
∂w ∂w
2
∂C ∂∑(ŷn-yn)
= n = ∑-2(ŷn-yn)
n
∂b ∂b
Computation Graphs are our friends
2
C(w,b) = ∑(yn-ŷn) y = wx + b
n∈{0,1,2}
2
∂C ∂(ŷn-yn) ∂yn
=∑ n
= ∑-2(ŷ
n
n-yn)xn
∂w ∂yn ∂w
2
∂C ∂(ŷn-yn) ∂yn
=∑ = ∑-2(ŷn-yn)
n n
∂b ∂yn ∂b
Computation Graphs are our friends
2
C(w,b) = ∑(yn-ŷn) y = wx + b
n∈{0,1,2}
2
∂C ∂(ŷn-yn) ∂yn
=∑ n
∂w ∂yn ∂w
2
∂C ∂(ŷn-yn) ∂yn
=∑
n
∂b ∂yn ∂b
Computation Graphs are our friends
2
C(w,b) = ∑(yn-ŷn) y=o+b
n∈{0,1,2}
o = wx
2
∂C ∂(ŷn-yn) ∂yn
=∑ n
∂w ∂yn ∂w
2
∂C ∂(ŷn-yn) ∂yn
=∑
n
∂b ∂yn ∂b
Computation Graphs are our friends
2
C(w,b) = ∑(dn) d=y-ŷ
n∈{0,1,2}
y=o+b
∂C ∂(ŷn-yn)
2
∂yn o = wx
=∑ n
∂w ∂yn ∂w
2
∂C ∂(ŷn-yn) ∂yn
=∑
n
∂b ∂yn ∂b
Computation Graphs are our friends
2
C(w,b) = ∑cn c=d
n∈{0,1,2}
d=y-ŷ
∂C ∂(ŷn-yn)
2
∂yn y=o+b
=∑
∂w
n
∂yn ∂w o = wx
2
∂C ∂(ŷn-yn) ∂yn
=∑
n
∂b ∂yn ∂b
Computation Graphs are our friends
2
C(w,b) = ∑cn c=d
n∈{0,1,2}
d=y-ŷ
∂C
∑ ∂cn ∂dn ∂yn ∂on y=o+b
=n
∂w ∂dn ∂yn ∂on ∂w o = wx
2
∂C ∂(ŷn-yn) ∂yn
=∑
n
∂b ∂yn ∂b
Computation Graphs are our friends
2
C(w,b) = ∑cn c=d
n∈{0,1,2}
d=y-ŷ
∂C
∑ ∂cn ∂dn ∂yn ∂on y=o+b
=n
∂w ∂dn ∂yn ∂on ∂w o = wx
∂C ∂cn ∂dn ∂yn
=∑
∂dn ∂yn ∂b
n
∂b
Computation Graphs are our friends
2
C(w,b) = ∑cn c=d Power 2
n∈{0,1,2}
d=y-ŷ Sub
∂C
∑ ∂cn ∂dn ∂yn ∂on y=o+b Add
=n
∂w ∂dn ∂yn ∂on ∂w o = wx Product
n∈{0,1,2}
d=y-ŷ Sub
∂C
∑ ∂cn ∂dn ∂yn ∂on y=o+b Add
=n
∂w ∂dn ∂yn ∂on ∂w o = wx Product
n∈{0,1,2}
d=y-ŷ Sub
∂C
∑ ∂cn ∂dn ∂yn ∂on y=o+b Add
=n
∂w ∂dn ∂yn ∂on ∂w o = wx Product
n∈{0,1,2}
d=y-ŷ Sub
∂C
∑ ∂cn ∂dn ∂yn ∂on y=o+b Add
=n
∂w ∂dn ∂yn ∂on ∂w o = wx Product
n∈{0,1,2}
d=y-ŷ Sub
∂C
∑ ∂cn ∂dn ∂yn ∂on y=o+b Add
=n
∂w ∂dn ∂yn ∂on ∂w o = wx Product
n∈{0,1,2}
d=y-ŷ Sub
∂C
∑ ∂cn ∂dn ∂yn ∂on y=o+b Add
=n
∂w ∂dn ∂yn ∂on ∂w
∂C o
∂cn ∂dn ∂yn
=∑
∂dn ∂yn ∂b
n
∂b Product
o = wx
w x
Computation Graphs are our friends
2
C(w,b) = ∑cn c=d Power 2
n∈{0,1,2}
d=y-ŷ Sub
y
∂C ∂cn ∂dn ∂yn ∂on
∑
=n
∂w ∂dn ∂yn ∂on ∂w Add y=o+b
∂C o b
∂cn ∂dn ∂yn
=∑
∂dn ∂yn ∂b
n
∂b Product
w x
Computation Graphs are our friends
n∈{0,1,2} Sub
∂C y ŷ
∑ ∂cn ∂dn ∂yn ∂on
=n
∂w ∂dn ∂yn ∂on ∂w Add
∂C o b
∂cn ∂dn ∂yn
=∑
∂dn ∂yn ∂b
n
∂b Product
w x
Computation Graphs are our friends
n∈{0} Sub
∂C y ŷ
∑ ∂cn ∂dn ∂yn ∂on
=n
∂w ∂dn ∂yn ∂on ∂w Add
∂C o b
∂cn ∂dn ∂yn
=∑
∂dn ∂yn ∂b
n
∂b Product
w x
Computation Graphs are our friends
n∈{0} Sub
∂C y ŷ
∑ ∂cn ∂dn ∂yn ∂on No Input Edges
=n External
∂w ∂dn ∂yn ∂on ∂w Add
Input
∂C o b
∂cn ∂dn ∂yn
=∑
∂dn ∂yn ∂b
n
∂b Product
w x
Computation Graphs are our friends
n∈{0} Sub
∂C y ŷ
∑ ∂cn ∂dn ∂yn ∂on
=n
∂w ∂dn ∂yn ∂on ∂w Add
Input
∂C o b
∂cn ∂dn ∂yn
=∑ Parameters
∂dn ∂yn ∂b
n
∂b Product
No Input Edges
Internal
w x
Computation Graphs are our friends
d Power 2 c Id C
Sub
y ŷ
16
Add
o b 2
Variables
Product
w x
2 5
Computation Graphs are our friends
d Power 2 c Id C
1-Initialize inputs
Sub
y ŷ
16
Add
o b 2
Product
w x
2 5
Computation Graphs are our friends
d Power 2 c Id C
1-Initialize inputs
2-Initialize variables
Sub
y ŷ
16
Add
o b 2
Variables
Product
w x
2 5
Computation Graphs are our friends
d Power 2 c Id C
1-Initialize inputs 0,0
2-Initialize variables 0,0 0,0
Sub
y ŷ
0,0
16
Add
o b 2
0,0
Variables
Product
2 values: x and dx
w x
2 5
Computation Graphs are our friends
d Power 2 c Id C
1-Initialize inputs 0,0
2-Initialize variables 0,0 0,0
3-Topological Sort variables Sub
y ŷ
0,0
16
Add
o b 2
0,0
Product
w x
2 5
Computation Graphs are our friends
d Power 2 c Id C
1-Initialize inputs 0,0
2-Initialize variables 3rd 0,0 0,0
3-Topological Sort variables Sub 4th 5th
y ŷ
0,0
16
2nd
Add
o b 2
1st 0,0
Product
w x
2 5
Computation Graphs are our friends
d Power 2 c Id C
1-Initialize inputs 0,0
2-Initialize variables 3rd 0,0 0,0
3-Topological Sort variables Sub 4th 5th
y ŷ
0,0
16
2nd
Add
o b 2
1st 10,0
Product
w x
2 5
Computation Graphs are our friends
d Power 2 c Id C
1-Initialize inputs 0,0
2-Initialize variables 3rd 0,0 0,0
3-Topological Sort variables Sub 4th 5th
y ŷ
12,0
16
2nd
Add
o b 2
1st 10,0
Product
w x
2 5
Computation Graphs are our friends
d Power 2 c Id C
1-Initialize inputs -4,0
2-Initialize variables 3rd 0,0 0,0
3-Topological Sort variables Sub 4th 5th
y ŷ
12,0
16
2nd
Add
o b 2
1st 10,0
Product
w x
2 5
Computation Graphs are our friends
d Power 2 c Id C
1-Initialize inputs 0,0
2-Initialize variables 3rd 0,0 0,0
3-Topological Sort variables Sub 4th 5th
y ŷ
0,0
1st 16
Add
o b 2
0,0
2nd
Product
w x
2 5
Computation Graphs are our friends
d Power 2 c Id C
1-Initialize inputs 0,0
2-Initialize variables 3rd 0,0 0,0
3-Topological Sort variables Sub 4th 5th
y ŷ
2,0
1st 16
Add
o b 2
0,0
2nd
Product
w x
2 5
Computation Graphs are our friends
d Power 2 c Id C
1-Initialize inputs 0,0
2-Initialize variables 3rd 0,0 0,0
3-Topological Sort variables Sub 4th 5th
y ŷ
2,0
1st 16
Add
o b 2
10,0
2nd
Product
w x
2 5
Computation Graphs are our friends
d Power 2 c Id C
1-Initialize inputs -14,0
2-Initialize variables 3rd 0,0 0,0
3-Topological Sort variables Sub 4th 5th
y ŷ
2,0
1st 16
Add
o b 2
10,0
2nd
Product
w x
2 5
Computation Graphs are our friends
d Power 2 c Id C
1-Initialize inputs 0,0
2-Initialize variables 0,0 0,0
3-Topological Sort variables Sub
y ŷ
0,0
16
Add
o b 2
0,0
Product
w x
2 5
Computation Graphs are our friends
d Power 2 c Id C
1-Initialize inputs 0,0
2-Initialize variables 0,0 0,0
3-Topological Sort variables Sub
y
0,0
Add
o
0,0
Computation Graphs are our friends
d c C
1-Initialize inputs 0,0
2-Initialize variables 0,0 0,0
3-Topological Sort variables
y
0,0
o
0,0
Computation Graphs are our friends
5th
4th
3rd
d c C
1-Initialize inputs 0,0
2-Initialize variables 0,0 0,0
3-Topological Sort variables
2nd
y
0,0
1st
o
0,0
Computation Graphs are our friends
d Power 2 c Add C
1-Initialize inputs 0,0
2-Initialize variables 0,0 0,0
3-Topological Sort variables Sub
y g
0,0 0,0
Add
Add
o 0,0
s 0,0
Computation Graphs are our friends
5th 6th 7th
d c C
1-Initialize inputs 0,0
2-Initialize variables 0,0 0,0
3-Topological Sort variables
4th
y g 3th
0,0 0,0
1st
o 0,0
2nd
s 0,0
Computation Graphs are our friends
3rd
d Power 2 c Id C
1-Initialize inputs 0,0
2-Initialize variables 0,0 0,0
3-Topological Sort variables Sub 4th 5th
4-For each variable in topological
order, run the forward method of all 2nd
y ŷ
operations that link to them 0,0
16
Add
1st o b 2
0,0
Product
w x
2 5
Computation Graphs are our friends
3rd
d Power 2 c Id C
1-Initialize inputs 0,0
2-Initialize variables 0,0 0,0
3-Topological Sort variables Sub 4th 5th
4-For each variable in topological
order, run the forward method of all 2nd
y ŷ
operations that link to them 0,0
16
Add
1st o b 2
10,0
Product
w x
2 5
Computation Graphs are our friends
3rd
d Power 2 c Id C
1-Initialize inputs 0,0
2-Initialize variables 0,0 0,0
3-Topological Sort variables Sub 4th 5th
4-For each variable in topological
order, run the forward method of all 2nd
y ŷ
operations that link to them 12,0
16
Add
1st o b 2
10,0
Product
w x
2 5
Computation Graphs are our friends
3rd
d Power 2 c Id C
1-Initialize inputs -4,0
2-Initialize variables 0,0 0,0
3-Topological Sort variables Sub 4th 5th
4-For each variable in topological
order, run the forward method of all 2nd
y ŷ
operations that link to them 12,0
16
Add
1st o b 2
10,0
Product
w x
2 5
Computation Graphs are our friends
3rd
d Power 2 c Id C
1-Initialize inputs -4,0
2-Initialize variables 16,0 0,0
3-Topological Sort variables Sub 4th 5th
4-For each variable in topological
order, run the forward method of all 2nd
y ŷ
operations that link to them 12,0
16
Add
1st o b 2
10,0
Product
w x
2 5
Computation Graphs are our friends
3rd
d Power 2 c Id C
1-Initialize inputs -4,0
2-Initialize variables 16,0 16,0
3-Topological Sort variables Sub 4th 5th
4-For each variable in topological
order, run the forward method of all 2nd
y ŷ
operations that link to them 12,0
16
Add
1st o b 2
10,0
Product
w x
2 5
Computation Graphs are our friends
3rd
d Power 2 c Id C
1-Initialize inputs -4,0
2-Initialize variables 16,0 16,1
3-Topological Sort variables Sub 4th 5th
4-For each variable in topological
order, run the forward method of all 2nd
y ŷ
operations that link to them 12,0
16
5-Set gradients to final variables
Add
1st o b 2
10,0
Product
w x
2 5
Computation Graphs are our friends
3rd
d Power 2 c Id C
1-Initialize inputs -4,0
2-Initialize variables 16,0 16,1
3-Topological Sort variables Sub 4th 5th
4-For each variable in topological
order, run the forward method of all 2nd
y ŷ
operations that link to them (Forward) 12,0
16
5-Set gradients to final variables
6-run the operations backward method Add
∂C
in reverse order (Backward) C=c =1
1st
∂c
o b 2
10,0
∂C
Product dc = dC
∂c
w x
2 5
Computation Graphs are our friends
3rd
d Power 2 c Id C
1-Initialize inputs -4,0
2-Initialize variables 16,1 16,1
3-Topological Sort variables Sub 4th 5th
4-For each variable in topological
order, run the forward method of all 2nd
y ŷ
operations that link to them (Forward) 12,0
16
5-Set gradients to final variables
6-run the operations backward method Add
∂C
in reverse order (Backward) C=c =1
1st
∂c
o b 2
10,0
∂C
Product dc = dC
∂c
w x
2 5
Computation Graphs are our friends
3rd
d Power 2 c Id C
1-Initialize inputs -4,0
2-Initialize variables 16,1 16,1
3-Topological Sort variables Sub 4th 5th
4-For each variable in topological
order, run the forward method of all 2nd
y ŷ
operations that link to them (Forward) 12,0
16
5-Set gradients to final variables
6-run the operations backward method Add 2 ∂c
in reverse order (Backward) c=d = 2d
1st o b ∂d
2
10,0
∂c
Product dd = dc
∂d
w x
2 5
Computation Graphs are our friends
3rd
d Power 2 c Id C
1-Initialize inputs -4,0
2-Initialize variables 16,1 16,1
3-Topological Sort variables Sub 4th 5th
4-For each variable in topological
order, run the forward method of all 2nd
y ŷ
operations that link to them (Forward) 12,0
16
5-Set gradients to final variables
6-run the operations backward method Add 2 ∂c
in reverse order (Backward) c=d = 2 x -4
1st o b ∂d
2
10,0
∂c
Product dd = dc
∂d
w x
2 5
Computation Graphs are our friends
3rd
d Power 2 c Id C
1-Initialize inputs -4,0
2-Initialize variables 16,1 16,1
3-Topological Sort variables Sub 4th 5th
4-For each variable in topological
order, run the forward method of all 2nd
y ŷ
operations that link to them (Forward) 12,0
16
5-Set gradients to final variables
6-run the operations backward method Add 2 ∂c
in reverse order (Backward) c=d = -8
1st o b ∂d
2
10,0
∂c
Product dd = dc
∂d
w x
2 5
Computation Graphs are our friends
3rd
d Power 2 c Id C
1-Initialize inputs -4,-8
2-Initialize variables 16,1 16,1
3-Topological Sort variables Sub 4th 5th
4-For each variable in topological
order, run the forward method of all 2nd
y ŷ
operations that link to them (Forward) 12,0
16
5-Set gradients to final variables
6-run the operations backward method Add 2 ∂c
in reverse order (Backward) c=d = -8
1st o b ∂d
2
10,0
∂c
Product dd = dc
∂d
w x
2 5
Computation Graphs are our friends
3rd
d Power 2 c Id C
1-Initialize inputs -4,-8
2-Initialize variables 16,1 16,1
3-Topological Sort variables Sub 4th 5th
4-For each variable in topological
order, run the forward method of all 2nd
y ŷ
operations that link to them (Forward) 12,0
16
5-Set gradients to final variables
6-run the operations backward method Add ∂d
in reverse order (Backward) d=y-ŷ =1
1st o b ∂y
2
10,0
Product
w x
2 5
Computation Graphs are our friends
3rd
d Power 2 c Id C
1-Initialize inputs -4,-8
2-Initialize variables 16,1 16,1
3-Topological Sort variables Sub 4th 5th
4-For each variable in topological
order, run the forward method of all 2nd
y ŷ
operations that link to them (Forward) 12,-8
16
5-Set gradients to final variables
6-run the operations backward method Add ∂d
in reverse order (Backward) d=y-ŷ =1
1st o b ∂y
2
10,0
∂d
Product dy = dd
∂y
w x
2 5
Computation Graphs are our friends
3rd
d Power 2 c Id C
1-Initialize inputs -4,-8
2-Initialize variables 16,1 16,1
3-Topological Sort variables Sub 4th 5th
4-For each variable in topological
order, run the forward method of all 2nd
y ŷ
operations that link to them (Forward) 12,-8
16 y=o+b
5-Set gradients to final variables
6-run the operations backward method Add ∂y
in reverse order (Backward) =1
1st o b ∂o
2
10,-8
∂y
Product
do = dy
∂o
w x
2 5
Computation Graphs are our friends
3rd
d Power 2 c Id C
1-Initialize inputs -4,-8
2-Initialize variables 16,1 16,1
3-Topological Sort variables Sub 4th 5th
4-For each variable in topological
order, run the forward method of all 2nd
y ŷ
operations that link to them (Forward) 12,-8
16 y=o+b
5-Set gradients to final variables
6-run the operations backward method Add ∂y ∂y
in reverse order (Backward) =1 =1
1st o b ∂o ∂b
2
10,-8
∂y
Product
bt+1 = b - dy
∂b
w x
2 5
Computation Graphs are our friends
3rd
d Power 2 c Id C
1-Initialize inputs -4,-8
2-Initialize variables 16,1 16,1
3-Topological Sort variables Sub 4th 5th
4-For each variable in topological
order, run the forward method of all 2nd
y ŷ
operations that link to them (Forward) 12,-8
16 y=o+b
5-Set gradients to final variables
6-run the operations backward method Add ∂y ∂y
in reverse order (Backward) =1 =1
1st o b ∂o ∂b
2
10,-8
∂y
Product
bt+1 = b - dy
∂b
w x
2 5
Computation Graphs are our friends
3rd
d Power 2 c Id C
1-Initialize inputs -4,-8
2-Initialize variables 16,1 16,1
3-Topological Sort variables Sub 4th 5th
4-For each variable in topological
order, run the forward method of all 2nd
y ŷ
operations that link to them (Forward) 12,-8
16 y=o+b
5-Set gradients to final variables
6-run the operations backward method Add ∂y ∂y
in reverse order (Backward) =1 =1
1st o b ∂o ∂b
2
10,-8
Product ∂C ∂c ∂d ∂y
bt+1 = b -
∂c ∂d ∂y ∂b
w x
2 5
Computation Graphs are our friends
3rd
d Power 2 c Id C
1-Initialize inputs -4,-8
2-Initialize variables 16,1 16,1
3-Topological Sort variables Sub 4th 5th
4-For each variable in topological
order, run the forward method of all 2nd
y ŷ
operations that link to them (Forward) 12,-8
16 y=o+b
5-Set gradients to final variables
6-run the operations backward method Add ∂y ∂y
in reverse order (Backward) =1 =1
1st o b ∂o ∂b
2
10,-8
Product ∂C
bt+1 = b -
∂b
w x
2 5
Computation Graphs are our friends
3rd
d Power 2 c Id C
1-Initialize inputs -4,-8
2-Initialize variables 16,1 16,1
3-Topological Sort variables Sub 4th 5th
4-For each variable in topological
order, run the forward method of all 2nd
y ŷ
operations that link to them (Forward) 12,-8
16 o = wx
5-Set gradients to final variables
6-run the operations backward method Add ∂o
in reverse order (Backward) =x
1st o b ∂w
2
10,-8
Product ∂o
wt+1 = w - do
∂w
w x
2 5
Computation Graphs are our friends
3rd
d Power 2 c Id C
1-Initialize inputs -4,-8
2-Initialize variables 16,1 16,1
3-Topological Sort variables Sub 4th 5th
4-For each variable in topological
order, run the forward method of all 2nd
y ŷ
operations that link to them (Forward) 12,-8
16 o = wx
5-Set gradients to final variables
6-run the operations backward method Add ∂o
in reverse order (Backward) =x
7-update parameters 1st ∂w
o b 2.2
10,-8
Product ∂o
wt+1 = w - do
∂w
w x
2.8 5
Computation Graphs are our friends
d Power 2 c Id C
Existing Tools: -4,-8
-Tensorflow ( https://www.tensorflow.org ) 16,1 16,1
-Torch ( https://github.com/torch/nn ) Sub
-CNN ( https://github.com/clab/cnn )
-JNN ( https://github.com/wlin12/JNN )
y ŷ
-Theano (http://deeplearning.net/software/theano/ ) 12,-8
16 o = wx
Add ∂o
=x
o b 2.2 ∂w
10,-8
Product ∂o
wt+1 = w - do
∂w
w x
2.8 5
Deep Neural Networks are our friends?
Convolutional Neural Network
Deep Neural Networks are our friends?
Convolutional Neural Network
x1 x2 x3 x4
x5 x6 x7 x8
4x4 image
Deep Neural Networks are our friends?
Convolutional Neural Network
x1 x2 x3 x4
x5 x6 x7 x8
4x4 image
Deep Neural Networks are our friends?
Convolutional Neural Network
x1 x2 x3 x4
x1
w1
x5 x6 x7 x8 z1
x2
x9 x10 x11 x12 z1
x1 x2 x3 x4
x2
w1
x5 x6 x7 x8 z1 z2
x3
x9 x10 x11 x12 z1
x1 x2 x3 x4
x5 x6 x7 x8 z1 z2
4x4 image
Deep Neural Networks are our friends?
Convolutional Neural Network
x1 x2 x3 x4
z1
x5 x6 x7 x8 z1 z2
4x4 image
z4