Sie sind auf Seite 1von 316

Deep Neural Networks

Are Our Friends

Wang Ling
Outline
● Part I - Neural Networks are our friends
○ Numbers are our friends
○ Variables are our friends
○ Operators are our friends
○ Functions are our friends
○ Parameters are our friends
○ Cost Functions are our friends
○ Optimizers are our friends
○ Gradients are our friends
Outline
● Part I - Neural Networks are our friends
● Part 2 - Into Deep Learning
○ Nonlinear Neural Models
○ Multilayer Perceptrons
○ Using Discrete Variables
○ Example Applications
Numbers are our friends
Abby

How many apples


does Abby have?
Numbers are our friends
Abby

4
Variables are our friends
Abby Bert

4 5
Variables are our friends
Abby Bert

4x 5y
Operators are our friends
Bert

1
4

If Abby has 4 apples,


and gives Bert 1 apple,
how many apples will
Abby have?
Operators are our friends
Bert

4x - 1x = 3x
3 1
Functions are our friends
If you give me
1 apple I will
give you 3
bananas

1
4
?
5
Functions are our friends
● Input, x - Number of
Apples given by Abby
y = 3x
Functions are our friends
● Input, x - Number of
Apples given by Abby
y = 3x ● Output, y - Number of
Bananas received by Abby
Functions are our friends

1
4
?
5

y = 3x , x =1
Functions are our friends

1
4
3
5

y = 3x , x =1
y=3
Functions are our friends
y = 3x
Functions are our friends

y : Spanish Sentence
x : English Sentence
Functions are our friends

y : Move
x : Board
Functions are our friends

y : Category
x : Image
Functions are our friends

y : Move
x : Board
??????????????????????????
Functions are our friends
y = 3x
Cookie Monster
Functions are our friends
y = ?? y = 3x
Find it out for
yourself
Functions are our friends
y = ??
1
0
Functions are our friends
y = ??
1
0

5
16
Functions are our friends
y = ??
1
0

5
16

6
20
Functions are our friends
I want to know how many bananas I get,
but I ran out of apples....
y = ??
1
0

5
16

6
20

3
?
Parameters are our friends
● Input
● Output
y = 3x + 1
Parameters are our friends
● Input
Model ● Output
y = wx + b ● Parameters

Input - Fixed, comes from data


Parameters - Need to be estimated
Parameters are our friends
y = wx + b
1
0

5
16

6
20

3
?
Parameters are our friends
y = wx + b
Data
1
0

5
16

6
20

3
?
Parameters are our friends
y = wx + b
Data

x ŷ

1 0

5 16

6 20

3
?
Parameters are our friends
Data Model

x ŷ y = wx + b
1 0

5 16

6 20
Parameters are our friends
Data Model

x ŷ y = wx + b
1 0

5 16

6 20

How to find the parameters w and b?


Parameters are our friends
Data Model
Model x y
x ŷ y = wx + b Candidate 1
1 0
1 0
5 16
5 16 y = 1x + 0 6 20
6 20
Parameters are our friends
Data Model
Model x ŷ y
x ŷ y = wx + b Candidate 1
1 0 1
1 0
5 16 5
5 16 y = 1x + 0 6 20 6
6 20
1 = 1*1 + 0
5 = 1*5 + 0
6 = 1*6 + 0
Parameters are our friends
Data Model
Model x ŷ y
x y y = wx + b Candidate 1
1 0 1
1 0
5 16 5
5 16 y = 1x + 0 6 20 6
6 20

Model
Candidate 2 x ŷ y

1 0 4
y = 2x + 2 5 16 12

6 20 14
Parameters are our friends
Data Model
Model x ŷ y
x y y = wx + b Candidate 1
1 0 1
1 0
5 16 5
5 16 y = 1x + 0 6 20 6
6 20

Model
Candidate 2 x ŷ y

1 0 4
y = 2x + 2 5 16 12

6 20 14
Which one is better ?
Parameters are our friends
Data Model
Model x ŷ y
x y y = wx + b Candidate 1
1 0 1
1 0
5 16 5
5 16 y = 1x + 0 6 20 6
6 20

Model
Candidate 2 x ŷ y

1 0 4
y = 2x + 2 5 16 12

6 20 14
Cost functions are our friends
Data Model
Model x ŷ y
n x y yn = wxn + b Candidate 1
1 0 1
0 1 0
5 16 5
1 5 16 y = 1x + 0 6 20 6
2 6 20

Model
Candidate 2 x ŷ y

1 0 4
y = 2x + 2 5 16 12

6 20 14
Cost functions are our friends
Data Model
Model x ŷ y
n x y yn = wxn + b Candidate 1
1 0 1
0 1 0
5 16 5
1 5 16 y = 1x + 0 6 20 6
2 6 20

Cost Model
Candidate 2 x ŷ y

C(w,b) 1 0 4
y = 2x + 2 5 16 12

6 20 14
Cost functions are our friends
Data Model
Model x ŷ y
n x y yn = wxn + b Candidate 1
1 0 1
0 1 0
5 16 5
1 5 16 y = 1x + 0 6 20 6
2 6 20 Square Loss

Cost Model
Candidate 2 x ŷ y
2
C(w,b) = ∑(yn-ŷn) 1 0 4

n∈{0,1,2} y = 2x + 2 5 16 12

6 20 14
Cost functions are our friends
Data Model n x ŷ y (y-ŷ)2
Model
n x y yn = wxn + b Candidate 1
0 1 0 1
0 1 0 1 5 16 5
1 5 16 y = 1x + 0 2 6 20 6
2 6 20

Cost Model
Candidate 2 x ŷ y
2
C(w,b) = ∑(yn-ŷn) 1 0 4

n∈{0,1,2} y = 2x + 2 5 16 12

6 20 14
Cost functions are our friends
Data Model n x ŷ y (y-ŷ)2
Model
n x y yn = wxn + b Candidate 1
0 1 0 1 1
0 1 0 1 5 16 5
1 5 16 y = 1x + 0 2 6 20 6
2 6 20

Cost Model
Candidate 2 x ŷ y
2
C(w,b) = ∑(yn-ŷn) 1 0 4

n∈{0,1,2} y = 2x + 2 5 16 12

6 20 14
Cost functions are our friends
Data Model n x ŷ y (y-ŷ)2
Model
n x y yn = wxn + b Candidate 1
0 1 0 1 1
0 1 0 1 5 16 5 121
1 5 16 y = 1x + 0 2 6 20 6
2 6 20

Cost Model
Candidate 2 x ŷ y
2
C(w,b) = ∑(yn-ŷn) 1 0 4

n∈{0,1,2} y = 2x + 2 5 16 12

6 20 14
Cost functions are our friends
Data Model n x ŷ y (y-ŷ)2
Model
n x y yn = wxn + b Candidate 1
0 1 0 1 1
0 1 0 1 5 16 5 121
1 5 16 y = 1x + 0 2 6 20 6 196
2 6 20

Cost Model
Candidate 2 x ŷ y
2
C(w,b) = ∑(yn-ŷn) 1 0 4

n∈{0,1,2} y = 2x + 2 5 16 12

6 20 14
Cost functions are our friends
Data Model n x ŷ y (y-ŷ)2
Model
n x y yn = wxn + b Candidate 1
0 1 0 1 1
0 1 0 1 5 16 5 121
1 5 16 y = 1x + 0 2 6 20 6 196
2 6 20
C(1,0) 318

Cost Model
Candidate 2 x ŷ y
2
C(w,b) = ∑(yn-ŷn) 1 0 4

n∈{0,1,2} y = 2x + 2 5 16 12

6 20 14
Cost functions are our friends
Data Model n x ŷ y (y-ŷ)2
Model
n x y yn = wxn + b Candidate 1
0 1 0 1 1
0 1 0 1 5 16 5 121
1 5 16 y = 1x + 0 2 6 20 6 196
2 6 20
C(1,0) 318

Cost Model n x ŷ y (y-ŷ)2


Candidate 2
2 0 1 0 4 16
C(w,b) = ∑(yn-ŷn)
n∈{0,1,2} y = 2x + 2 1 5 16 12 16

2 6 20 14 36

C(2,2) 68
Cost functions are our friends
Data Model
Model
n x y yn = wxn + b Candidate 1

0 1 0

1 5 16 y = 1x + 0
2 6 20
C(1,0) 318

Cost Model
Candidate 2
2
C(w,b) = ∑(yn-ŷn)
n∈{0,1,2} y = 2x + 2
C(2,2) 68
Cost functions are our friends
Data Model

n x y yn = wxn + b
0 1 0

1 5 16

2 6 20
How to find the parameters w and b?
Cost
2
C(w,b) = ∑(yn-ŷn)
n∈{0,1,2}
Optimizers are our friends
Data Model

n x y yn = wxn + b
0 1 0

1 5 16

2 6 20

Cost Optimizer
2
C(w,b) = ∑(yn-ŷn) arg min C(w,b)
n∈{0,1,2} w,b∈[-∞,∞]
Optimizers are our friends
Optimizer
w
arg min C(w,b)
w,b∈[-∞,∞]

b
Optimizers are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w0,b0 = 2,2 : C(w0,b0) = 68

b
Optimizers are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w0,b0 = 2,2 : C(w0,b0) = 68

2 68

2 b
Optimizers are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w0,b0 = 2,2 : C(w0,b0) = 68
w1,b1 = 3,2 : C(w1,b1) = ?

b
Optimizers are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w0,b0 = 2,2 : C(w0,b0) = 68
w1,b1 = 3,2 : C(w1,b1) = 26
2
n x ŷ y (y-ŷ)

0 1 0 5 25

1 5 16 17 1

2 6 20 20 0
b
C(3,2) 26
Optimizers are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w0,b0 = 2,2 : C(w0,b0) = 68
w1,b1 = 3,2 : C(w1,b1) = 26
2
n x ŷ y (y-ŷ)

0 1 0 5 25

1 5 16 17 1

2 6 20 20 0
b
C(3,2) 26
Optimizers are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w1,b1 = 3,2 : C(w1,b1) = 26
w2,b2 = 4,2 : C(w2,b2) = ??

b
Optimizers are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w1,b1 = 3,2 : C(w1,b1) = 26
w2,b2 = 4,2 : C(w2,b2) = 136
2
n x ŷ y (y-ŷ)

0 1 0 6 36

1 5 16 22 64

2 6 20 26 36
b
C(4,2) 136
Optimizers are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w1,b1 = 3,2 : C(w1,b1) = 26

b
Optimizers are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w1,b1 = 3,2 : C(w1,b1) = 26
w2,b2 = 3,3 : C(w2,b2) = 41
2
n x ŷ y (y-ŷ)

0 1 0 6 36

1 5 16 18 4

2 6 20 21 1
b
C(3,3) 41
Optimizers are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w1,b1 = 3,2 : C(w1,b1) = 26

b
Optimizers are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w1,b1 = 3,2 : C(w1,b1) = 26
w2,b2 = 3,1 : C(w2,b2) = 17
2
n x ŷ y (y-ŷ)

0 1 0 4 16

1 5 16 16 0

2 6 20 19 1
b
C(3,1) 17
Optimizers are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w2,b2 = 3,1 : C(w2,b2) = 17

b
Optimizers are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w2,b2 = 3,1 : C(w2,b2) = 17
w3,b3 = 3,0 : C(w3,b3) = 13
2
n x ŷ y (y-ŷ)

0 1 0 3 9

1 5 16 15 1

2 6 20 18 4
b
C(3,0) 13
Optimizers are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w3,b3 = 3,0 : C(w3,b3) = 13

b
Optimizers are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w3,b3 = 3,0 : C(w3,b3) = 13
w4,b4 = 3,-1 : C(w4,b4) = 17
2
n x ŷ y (y-ŷ)

0 1 0 2 4

1 5 16 14 4

2 6 20 17 9
b
C(3,-1) 17
Optimizers are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w3,b3 = 3,0 : C(w3,b3) = 13
w4,b4 = 2,0 : C(w4,b4) = 104
2
n x ŷ y (y-ŷ)

0 1 0 2 4

1 5 16 10 36

2 6 20 12 64
b
C(2,0) 104
Optimizers are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w3,b3 = 3,0 : C(w3,b3) = 13
w4,b4 = 4,0 : C(w4,b4) = 104
2
n x ŷ y (y-ŷ)

0 1 0 4 16

1 5 16 20 16

2 6 20 24 16
b
C(2,0) 54
Optimizers are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w3,b3 = 3,0 : C(w3,b3) = 13

The End?
b
Optimizers are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w?,b? = 4,-2 : C(w?,b?) = ??

b
Optimizers are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w?,b? = 4,-2 : C(w?,b?) = 12

2
n x ŷ y (y-ŷ)

0 1 0 2 4

1 5 16 18 4

2 6 20 22 4
b
C(4,-2) 12
Optimizers are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w3,b3 = 3,0 : C(w3,b3) = 13

b
Optimizers are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w3,b3 = 3,0 : C(w3,b3) = 13

Search
Problem

b
Optimizers are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w3,b3 = 3,0 : C(w3,b3) = 13
w4,b4 = 3.01,0 : C(w4,b4) = 12.82
2
n x ŷ y (y-ŷ)

0 1 0 3.01 9.06

1 5 16 15.01 0.98

2 6 20 18.01 3.96 b
C(3.01,0) 12.82
Optimizers are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w*,b* = 4,-2 : C(w*,b*) = 12

b
Optimizers are our friends

-Worse minimum
Large Step Size -But gets there faster

Vs

Small Step Size


-Better Minimum
-But gets there slowly
Optimizers are our friends

Step Size

Step Size

Step Size

Step Size

Step Size
Optimizers are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w*,b* = 4,-2 : C(w*,b*) = 12

b
Optimizers are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w*,b* = 4,-4 : C(w*,b*) = 0

b
Optimizers are our friends
y = wx + b
Data

x ŷ

1 0

5 16

6 20

3
?
Optimizers are our friends
y = 4x - 4
Data

x ŷ

1 0

5 16

6 20

3
?
Optimizers are our friends
y = 4x - 4
Data

x ŷ

1 0

5 16

6 20

3
8
Functions are our friends
y = wx + b

y : Is this a cat
x : Image
Functions are our friends
pixel (1,1)
pixel(1,3)
High
if cat
y = w1x + w2x + w3x + w4x +
1 2 3 4

b y : Is this a cat
x : Image
Functions are our friends
pixel (1,1)
pixel(1,3)
High
if cat
y = w1x + w2x + w3x + w4x +
1 2 3 4

b y : Is this a cat
x : Image

Millions of parameters Millions of samples


Gradients are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]

Very expensive
to compute
(hours or days)

b
Gradients are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]

Should be used
sparingly

b
Gradients are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w0,b0 = 2,2 : C(w0,b0) = 68

2 68

2 b
Gradients are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w0,b0 = 2,2 : C(w0,b0) = 68
hw = 1 hw
2 68

2 b
Gradients are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w0,b0 = 2,2 : C(w0,b0) = 68
hw = 1 hw
C(w0+hw,b0) = C(3,2) = 26 2 68

2 b
Gradients are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w0,b0 = 2,2 : C(w0,b0) = 68
hw = 1 hw
C(w0+hw,b0) = C(3,2) = 26 2 68
r = (C(w0+1,b0)-C(w0,b0))
1
r = (C(3,2)-C(2,2))=-42 2 b
1
Gradients are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w0,b0 = 2,2 : C(w0,b0) = 68
hw = 1, r = -42 hw
hw = 0.1, r = -98 2 68
hw = 0.01, r = -104
hw = 0.001, r = -104
2 b
Gradients are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w0,b0 = 2,2 : C(w0,b0) = 68
hw = 1, r = -42 hw
hw = 0.1, r = -98 2 68
hw = 0.01, r = -104
hw = 0.001, r = -104
∂C (w0,b0)
hw → 0, r =
∂w
2 b
Gradients are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w0,b0 = 2,2 : C(w0,b0) = 68
2 hw
∂C ∂∑(yn-ŷn)
= n 2 68
∂w ∂w

2 b
Gradients are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w0,b0 = 2,2 : C(w0,b0) = 68
2 hw
∂C ∂∑(yn-ŷn)
= n = ∑2(y
n
n-ŷn)xn 2 68
∂w ∂w

2 b
Gradients are our friends
Optimizer

arg min C(w,b)


w,b∈[-∞,∞] n x ŷ y (y-ŷ) 2(y-ŷ)x
w0,b0 = 2,2 : C(w0,b0) = 68
0 1 0 4 4 8
2
∂C ∂∑(yn-ŷn) 1 5 16 12 -4 -40
= n = ∑2(y
n
n-ŷn)xn

∂w ∂w 2 6 20 14 -6 -72

∂C (w0,b0)
hw → 0, r = = -104
∂w
Gradients are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w0,b0 = 2,2 : C(w0,b0) = 68
2 hw
∂C ∂∑(yn-ŷn)
= n = ∑2(y
n
n-ŷn)xn 2 68
∂w ∂w
2
∂C ∂∑(yn-ŷn)
= n = ∑2(yn-ŷn)
∂b ∂b
n 2 b
Gradients are our friends
Optimizer

arg min C(w,b)


w,b∈[-∞,∞] n x ŷ y (y-ŷ) 2(y-ŷ)
w0,b0 = 2,2 : C(w0,b0) = 68
0 1 0 4 4 8

∂C (w0,b0) 1 5 16 12 -4 -8
hw → 0, rw = = -104
∂w 2 6 20 14 -6 -12
∂C (w0,b0)
hb → 0, rb = = -12
∂w
Gradients are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w0,b0 = 2,2 : C(w0,b0) = 68

∂C (w0,b0) 2
hw → 0, rw = = -104
∂w
∂C (w0,b0)
hb → 0, rb = = -12
∂w
w1 = w0 - rw 2
→ Learning Rate/ Step size
b
b1 = b0 - rb
Summary
Data Model

n x ŷ yn = wxn + b
0 1 0

1 5 16

2 6 20

Cost Optimizer
2
C(w,b) = ∑(yn-ŷn) arg min C(w,b)
n∈{0,1,2} w,b∈[-∞,∞]
Summary
Data Model

n x ŷ yn = wxn + b System
0 1 0

1 5 16 y = 4x - 4
2 6 20

Cost Optimizer
2
C(w,b) = ∑(yn-ŷn) arg min C(w,b)
n∈{0,1,2} w,b∈[-∞,∞]
Into Deep Learning
Nonlinear Neural Models
y = 4x-4
Data
1
0

5
16

6
20

3
?
Nonlinear Neural Models
There is a limit
of bananas I
Data can give you
1
0

5
16

6
20

3
?
Nonlinear Neural Models
Data
y y = 4x-4
n x ŷ

0 1 0

1 5 16

2 6 20 x
Nonlinear Neural Models
Data
y y = 4x-4
n x ŷ

0 1 0

1 5 16

2 6 20 x
3 9 20

4 11 20
Nonlinear Neural Models
Data
y y = 2x+3
n x ŷ

0 1 0

1 5 16

2 6 20 x
3 9 20
Model
4 11 Problem
20
Nonlinear Neural Models
Data
y y = 2x+3
n x ŷ

0 1 0

1 5 16

2 6 20 Underfitting x
3 9 20
Model
4 11 Problem
20
Nonlinear Neural Models
Data
y y = ???
n x ŷ

0 1 0

1 5 16

2 6 20 x
3 9 20

4 11 20 Can we learn
arbitrary functions?
Nonlinear Neural Models

y = (w1x + b1)s1 + (w2x+b2)s2


Use different linear functions
depending on the value of x?
Nonlinear Neural Models

y = (w1x + b1)s1 + (w2x+b2)s2


s1 - 1 if x < 6 and 0 otherwise
s2 - 1 if x >= 6 and 0 otherwise
Nonlinear Neural Models

y = (w1x + b1)s1 + (w2x+b2)s2


s1 - 1 if x < 6 and 0 otherwise
Data
s2 - 1 if x >= 6 and 0 otherwise
n x ŷ

0 1 0

1 5 16
y = (4x - 4)s1 + (0x+20)s2
2 6 20

3 9 20

4 11 20
Nonlinear Neural Models

y = (w1x + b1)s1 + (w2x+b2)s2


s1 - 1 if x < 6 and 0 otherwise
Data
s2 - 1 if x >= 6 and 0 otherwise
n x ŷ

0 1 0

1 5 16
y = (4x - 4)s1 + (0x+20)s2
2 6 20

3 9 20

4 11 20 ?
?
Nonlinear Neural Models

s = (wx + b)

1
(t) =
1 + e-t
Nonlinear Neural Models

s = (1000x)
Nonlinear Neural Models

s = (1000x)

x = 0.1 then (1000x) = 1


x = -0.1 then (1000x) = 0
Nonlinear Neural Models

s = (1000x - 6000)

x = 6.1 then (1000x - 6000) = 1


x = 5.9 then (1000x - 6000) = 0
Nonlinear Neural Models

y = (w1x + b1)s1 + (w2x+b2)s2


s1 = (w3x + b3)
s2 = (w4x + b4)
Nonlinear Neural Models

Data

n x ŷ
y = (4x - 4)s1 + (0x+20)s2
0 1 0

1 5 16
s1 = (-1000x + 6000)
2 6 20 s2 = (1000x - 6000)
3 9 20

4 11 20
Nonlinear Neural Models

Data

n x ŷ
y = (4x - 4)s1 + (0x+20)s2
0 1 0

1 5 16
s1 = (-1000x + 6000)
2 6 20 s2 = (1000x - 6000)
3 9 20

4 11 20
Nonlinear Neural Models

Data

n x ŷ
y = (16)s1 + (0x+20)s2
0 1 0

1 5 16
s1 = (-1000x + 6000)
2 6 20 s2 = (1000x - 6000)
3 9 20

4 11 20
Nonlinear Neural Models

Data

n x ŷ
y = (16)s1 + (20)s2
0 1 0

1 5 16
s1 = (-1000x + 6000)
2 6 20 s2 = (1000x - 6000)
3 9 20

4 11 20
Nonlinear Neural Models

Data

n x ŷ
y = (16)s1 + (20)s2
0 1 0

1 5 16
s1 = (1000)
2 6 20 s2 = (1000x - 6000)
3 9 20

4 11 20
Nonlinear Neural Models

Data

n x ŷ
y = (16)s1 + (20)s2
0 1 0

1 5 16
s1 = (1000)
2 6 20 s2 = (-1000)
3 9 20

4 11 20
Nonlinear Neural Models

Data

n x ŷ
y = (16)1 + (20)0
0 1 0

1 5 16
s1 = (1000)
2 6 20 s2 = (-1000)
3 9 20

4 11 20
Nonlinear Neural Models

Data

n x ŷ
y = 16
0 1 0

1 5 16
s1 = (1000)
2 6 20 s2 = (-1000)
3 9 20

4 11 20
Nonlinear Neural Models

Data

n x ŷ
y = (4x - 4)s1 + (0x+20)s2
0 1 0

1 5 16
s1 = (-1000x + 6000)
2 6 20 s2 = (1000x - 6000)
3 9 20

4 11 20
Nonlinear Neural Models

Data

n x ŷ
y = (32)s1 + (0x+20)s2
0 1 0

1 5 16
s1 = (-1000x + 6000)
2 6 20 s2 = (1000x - 6000)
3 9 20

4 11 20
Nonlinear Neural Models

Data

n x ŷ
y = (32)s1 + (20)s2
0 1 0

1 5 16
s1 = (-1000x + 6000)
2 6 20 s2 = (1000x - 6000)
3 9 20

4 11 20
Nonlinear Neural Models

Data

n x ŷ
y = (32)s1 + (20)s2
0 1 0

1 5 16
s1 = (-3000)
2 6 20 s2 = (1000x - 6000)
3 9 20

4 11 20
Nonlinear Neural Models

Data

n x ŷ
y = (32)s1 + (20)s2
0 1 0

1 5 16
s1 = (-3000)
2 6 20 s2 = (3000)
3 9 20

4 11 20
Nonlinear Neural Models

Data

n x ŷ
y = (32)0 + (20)1
0 1 0

1 5 16
s1 = (-3000)
2 6 20 s2 = (3000)
3 9 20

4 11 20
Nonlinear Neural Models

Data

n x ŷ
y = 20
0 1 0

1 5 16
s1 = (-3000)
2 6 20 s2 = (3000)
3 9 20

4 11 20
Nonlinear Neural Models
If you give me
too many
Data apples, I will
give you less
1
0

5
16

6
20

3
?
Multilayer Perceptrons
Data

n x ŷ
y y = (4x - 4)s1 + (0x+20)s2
0 1 0

1 5 16

2 6 20
x
3 9 20

4 11 20
Multilayer Perceptrons
Data

n x ŷ
y y = (4x - 4)s1 + (0x+20)s2
0 1 0

1 5 16

2 6 20
x
3 9 20

4 11 20

5 15 1

6 19 1
Multilayer Perceptrons
Data

n x ŷ

0 1 0

1 5 16

2 6 20

3 9 20 y = (4x - 4)s1 + (0x+20)s2 + (0x+1)s3


4 11 20
s1 = (-1000x + 6000)
5 15 1

6 19 1 s2 = ????
s3 = (1000x - 15000)
Multilayer Perceptrons
Data

n x ŷ
y = (4x - 4)s1 + (0x+20)s2 + (0x+1)s3
0 1 0

1 5 16
s1 = (-1000x + 6000)
2 6 20 s2 = not s1 and not s3
3 9 20
s3 = (1000x - 15000)
4 11 20

5 15 1

6 19 1
Multilayer Perceptrons

y = (w1x + b1)s1 + (w2x+b2)s2 + (w3x+b3)s3


s1 = (w4x + b4)
s2 = (w5s1 + w6s3 + b5)
s3 = (w7x + b6)
Multilayer Perceptrons

y = (w1x + b1)s1 + (w2x+b2)s2 + (w3x+b3)s3


s1 = (w4x + b4) Layer 1 Perceptron

s2 = (w5s1 + w6s3 + b5)


s3 = (w7x + b6) Layer 1 Perceptron
Multilayer Perceptrons

y = (w1x + b1)s1 + (w2x+b2)s2 + (w3x+b3)s3


s1 = (w4x + b4) Layer 1 Perceptron

s2 = (w5s1 + w6s3 + b5) Layer 2 Perceptron

s3 = (w7x + b6) Layer 1 Perceptron


Multilayer Perceptrons
Data

n x ŷ
y = (4x - 4)s1 + (0x+20)s2 + (0x+1)s3
0 1 0

1 5 16
s1 = (-1000x + 6000)
2 6 20 s2 = not s1 and not s3
3 9 20
s3 = (1000x - 15000)
4 11 20

5 15 1

6 19 1
Multilayer Perceptrons
Data

n x ŷ
y = (4x - 4)s1 + (0x+20)s2 + (0x+1)s3
0 1 0

1 5 16
s1 = (-1000x + 6000)
2 6 20 s2 = (-1000s1 - 1000s3 + 500)
3 9 20
s3 = (1000x - 15000)
4 11 20

5 15 1

6 19 1
Multilayer Perceptrons
Data

n x ŷ
y = (4x - 4)s1 + (0x+20)s2 + (0x+1)s3
0 1 0

1 5 16
s1 = (-1000x + 6000)
2 6 20 s2 = (-1000s1 - 1000s3 + 500)
3 9 20
s3 = (1000x - 15000)
4 11 20

5 15 1

6 19 1
Multilayer Perceptrons
Data

n x ŷ
y = (40)s1 + (20)s2 + (1)s3
0 1 0

1 5 16
s1 = (-1000x + 6000)
2 6 20 s2 = (-1000s1 - 1000s3 + 500)
3 9 20
s3 = (1000x - 15000)
4 11 20

5 15 1

6 19 1
Multilayer Perceptrons
Data

n x ŷ
y = (40)s1 + (20)s2 + (1)s3
0 1 0

1 5 16
s1 = (-5000) = 0
2 6 20 s2 = (-1000s1 - 1000s3 + 500)
3 9 20
s3 = (-4000) = 0
4 11 20

5 15 1

6 19 1
Multilayer Perceptrons
Data

n x ŷ
y = (40)s1 + (20)s2 + (1)s3
0 1 0

1 5 16
s1 = (-5000) = 0
2 6 20 s2 = (- 0 - 0 + 500)
3 9 20
s3 = (-4000) = 0
4 11 20

5 15 1

6 19 1
Multilayer Perceptrons
Data

n x ŷ
y = (40)s1 + (20)s2 + (1)s3
0 1 0

1 5 16
s1 = (-5000) = 0
2 6 20 s2 = (500)
3 9 20
s3 = (-4000) = 0
4 11 20

5 15 1

6 19 1
Multilayer Perceptrons
Data

n x ŷ
y = (40)s1 + (20)s2 + (1)s3
0 1 0

1 5 16
s1 = (-5000) = 0
2 6 20 s2 = (500) = 1
3 9 20
s3 = (-4000) = 0
4 11 20

5 15 1

6 19 1
Multilayer Perceptrons
Data

n x ŷ
y = (40)0 + (20)1 + (1)0
0 1 0

1 5 16
s1 = (-5000) = 0
2 6 20 s2 = (500) = 1
3 9 20
s3 = (-4000) = 0
4 11 20

5 15 1

6 19 1
Multilayer Perceptrons
Data

n x ŷ
y = 20
0 1 0

1 5 16
s1 = (-5000) = 0
2 6 20 s2 = (500) = 1
3 9 20
s3 = (-4000) = 0
4 11 20

5 15 1

6 19 1
Multilayer Perceptrons
Data

n x ŷ
y = (4x - 4)s1 + (0x+20)s2 + (0x+1)s3
0 1 0

1 5 16
s1 = (-1000x + 6000)
2 6 20 s2 = (-1000s1 - 1000s3 + 500)
3 9 20
s3 = (1000x - 15000)
4 11 20

5 15 1

6 19 1
Multilayer Perceptrons
Data

n x ŷ
y = (772)s1 + (20)s2 + (1)s3
0 1 0

1 5 16
s1 = (-1000x + 6000)
2 6 20 s2 = (-1000s4 - 1000s5 + 500)
3 9 20
s3 = (1000x - 15000)
4 11 20

5 15 1

6 19 1
Multilayer Perceptrons
Data

n x ŷ
y = (772)s1 + (20)s2 + (1)s3
0 1 0

1 5 16
s1 = (-13000) = 0
2 6 20 s2 = (-1000s4 - 1000s5 + 500)
3 9 20
s3 = (4000) = 1
4 11 20

5 15 1

6 19 1
Multilayer Perceptrons
Data

n x ŷ
y = (772)s1 + (20)s2 + (1)s3
0 1 0

1 5 16
s1 = (-13000) = 0
2 6 20 s2 = (-1000 + 0 + 500)
3 9 20
s3 = (4000) = 1
4 11 20

5 15 1

6 19 1
Multilayer Perceptrons
Data

n x ŷ
y = (772)s1 + (20)s2 + (1)s3
0 1 0

1 5 16
s1 = (-13000) = 0
2 6 20 s2 = (-500) = 0
3 9 20
s3 = (4000) = 1
4 11 20

5 15 1

6 19 1
Multilayer Perceptrons
Data

n x ŷ
y = (772)0 + (20)0 + (1)1
0 1 0

1 5 16
s1 = (-13000) = 0
2 6 20 s2 = (-500) = 0
3 9 20
s3 = (4000) = 1
4 11 20

5 15 1

6 19 1
Multilayer Perceptrons
Data

n x ŷ
y=1
0 1 0

1 5 16
s1 = (-13000) = 0
2 6 20 s2 = (-500) = 0
3 9 20
s3 = (4000) = 1
4 11 20

5 15 1

6 19 1
Multilayer Perceptrons
Data
y
y = (4x - 4)s1 + (0x+20)s2 + (0x+1)s3
n x ŷ

0 1 0

1 5 16

2 6 20
x
3 9 20

4 11 20

5 15 1

6 19 1
Multilayer Perceptrons

y = (w1x + b1)s1 + (w2x+b2)s2 + (w3x+b3)s3


s1 = (w4x + b4) x
w4x
s2 = (w5s1 + w6s3 + b5) s
1
s
3

s3 = (w7x + b6) b4
s
2
Multilayer Perceptrons

y = (w1x + b1)s1 + (w2x+b2)s2 + (w3x+b3)s3


s1 = (w4x + b4) x
w4x w7x
s2 = (w5s1 + w6s3 + b5) s
1
s
3

s3 = (w7x + b6) b4
s
b5

2
Multilayer Perceptrons

y = (w1x + b1)s1 + (w2x+b2)s2 + (w3x+b3)s3


s1 = (w4x + b4) x

s2 = (w5s1 + w6s3 + b5) s


1
s
3

s3 = (w7x + b6) w5s1


s
w6s3

2
b5
Multilayer Perceptrons
y = (w1x + b1)s1 + (w2x+b2)s2 + (w3x+b3)s3

s s
x<6 1 3 x > 15

s
2 !(x > 15) & !(x < 6)
Multilayer Perceptrons
y = (w1x + b1)s1 + (w2x+b2)s2 + (w3x+b3)s3

s s
x<6 1 3 x > 15

s
2 x∈[6,15]
Multilayer Perceptrons

s s
x<6 1 3 x > 15

s s
2 4

x∈[6,15] x∈]-∞,6] & ]15,∞]


Multilayer Perceptrons
x

s s s s
x<6 1 2 x > 15 3 x>2 4 x<3

s s s s
5 6 7 7

x∈[6,15] x∈]-∞,6] & ]15,∞] x∈[2,15] x∈[2,3]


Multilayer Perceptrons
x Input

s s s s
x<6 1 2 x > 15 3 x>2 4 x<3 Layer 1 (Input Features)

s s s s
5 6 7 7

x∈[6,15] x∈]-∞,6] & ]15,∞] x∈[2,15] x∈[2,3] Layer 2 (And and Or Combinations)
Multilayer Perceptrons
x Input

s s s s
x<6 1 2 x > 15 3 x>2 4 x<3 Layer 1 (Input Features)

s s s s
5 6 7 7

x∈[6,15] x∈]-∞,6] & ]15,∞] x∈[2,15] x∈[2,3] Layer 2 (And and Or Combinations)

And(s1,s2) = (1000s1 + 1000s3 - 1500)


Or(s1,s2) = (1000s1 + 1000s3 - 500)
Multilayer Perceptrons
x Input

s s s s
1 2 3 4 Layer 1 (Input Features)

s s s s
5 6 7 7

Layer 2 (And and Or Combinations)

s s s s
8 9 a b Layer 3 (Xor Combinations)
Multilayer Perceptrons
x Input

s s s s
1 2 3 4 Layer 1 (Input Features)

s s s s
5 6 7 7

Layer 2 (And and Or Combinations)

s s s s
8 9 a b Layer 3 (Xor Combinations)

Xor(s1,s2) = Or(And(s1,!s2), And(!s1,s2))


Multilayer Perceptrons
Data
y
n x ŷ

0 1 0

1 5 16

2 6 20
x
3 9 20

4 11 20 Universal
approximator
5 15 1

6 19 1
Multilayer Perceptrons
Data
y
n x ŷ

0 1 0

1 5 16

2 6 20
x
3 9 20

4 11 20
but...
5 15 1

6 19 1
Multilayer Perceptrons
Data
y
n x ŷ

0 1 0

1 5 16

2 6 20
x
3 9 20

4 11 20
No guarantee that
the best function will
5 15 1
be found
6 19 1
Multilayer Perceptrons
n x ŷ
x
0 1 0

1 5 16 s s s s
x>1 1 2 x<2 3 x<5 4 x<6
2 6 20
s s s
5 6 7

x∈]-∞,1] x∈[5,6[ x∈[6,∞]

y
Multilayer Perceptrons
n x ŷ
x
0 1 0

1 5 16 s s s s
x>1 1 2 x<2 3 x<5 4 x<6
2 6 20
s s s
5 6 7

x∈]-∞,1] x∈[5,6[ x∈[6,∞]

y = 0s5 + 16s6 + 20s7


Multilayer Perceptrons
n x ŷ
x
0 1 0

1 5 16 s s s s
x>1 1 2 x<2 3 x<5 4 x<6
2 6 20
s s s
5 6 7

x∈]-∞,1] x∈[5,6[ x∈[6,∞]

y = 0s5 + 16s6 + 20s7


Multilayer Perceptrons
n x ŷ
x
0 1 0

1 5 16 s s s s
x>1 1 2 x<2 3 x<5 4 x<6
2 6 20
s s s
5 6 7

x∈]-∞,1] x∈[5,6[ x∈[6,∞]

y = 0s5 + 16s6 + 20s7


Multilayer Perceptrons
n x ŷ
x
0 1 0

1 5 16 s s s s
x>1 1 2 x<2 3 x<5 4 x<6
2 6 20
s s s Overfitting
5 6 7

x∈]-∞,1] x∈[5,6[ Model


x∈[6,∞]
Problem

y = 0s5 + 16s6 + 20s7


Multilayer Perceptrons
Task
Complexity

Model
Complexity
Multilayer Perceptrons
Task
Complexity Underfitting

Model
Complexity
Multilayer Perceptrons
Task
Complexity Underfitting

Overfitting

Model
Complexity
Multilayer Perceptrons
Task
Complexity Underfitting

e
y Zon
Happ

Overfitting

Model
Complexity
Multilayer Perceptrons
Task
Complexity Underfitting

e
y Zon
Happ

Overfitting

more features
Regression

Regression

Model
Linear

Complexity
Linear
Multilayer Perceptrons
Task
Complexity Underfitting

e
y Zon
Happ

Overfitting
Regression

MLP 1 Layer

MLP 2 Layer

MLP 3 Layer
Model
Linear

Complexity
Multilayer Perceptrons
Task
Complexity Underfitting

e
y Zon
Happ

Sentiment Overfitting
analysis
Regression

MLP 1 Layer

MLP 2 Layer

MLP 3 Layer
Model
Linear

Complexity
Multilayer Perceptrons
Task
Complexity Underfitting

Machine e
Translation y Zon
Happ

Sentiment Overfitting
analysis
Regression

MLP 1 Layer

MLP 2 Layer

MLP 3 Layer
Model
Linear

Complexity
Multilayer Perceptrons
Task
Complexity Underfitting

e
y Zon
Happ

Overfitting

Data
Model
Complexity
Multilayer Perceptrons
Task
Complexity Underfitting

y Zone
H a p p

Overfitting

Data
Model
Complexity
Multilayer Perceptrons
Task
Complexity Underfitting

e
Happy Zon

Overfitting

Data
Model
Complexity
Multilayer Perceptrons
n x ŷ
y y y
0 1 0

1 5 16

2 6 20
Multilayer Perceptrons
n x ŷ
y y y
0 1 0

1 5 16

2 6 20

3 2 4
Multilayer Perceptrons
n x ŷ
y y
0 1 0

1 5 16

2 6 20

3 2 4
Multilayer Perceptrons
Task
Complexity Underfitting

e
y Zon
Happ

Overfitting

Model Bias
Model
Complexity
Multilayer Perceptrons
Task
Complexity Underfitting

py Zone
Hap

Overfitting

Model Bias
L1 & L2 Regularization Model
Stochastic Dropout (Srivastava et al, 2014) Complexity
Model Structure (CNN, RNNs)
Multilayer Perceptrons
Regularization

2
C(w,b) = ∑(yn-ŷn) + (w+b)ß
n∈{0,1,2}

ß = Regularization constant
Multilayer Perceptrons
Regularization
x

s s s s
x>1 1 2 x<2 3 x<5 4 x<6

s s s
5 6 7

x∈]-∞,1] x∈[5,6[ x∈[6,∞]

y
Multilayer Perceptrons
Regularization
x

s s s s
x>1 1 2 nothing 3 nothing 4 x<6

s s s
5 6 7

x∈]-∞,1] nothing x∈[6,∞]

y
Multilayer Perceptrons
Regularization
x

s s s s
x>1 1 2 nothing 3 nothing 4 x<6

s s s
5 6 7

x∈]-∞,1] nothing x∈[6,∞]

Find solutions that


require less effort y
Using Discrete Variables

Data
1
0

5
16

6
20

3
?
Using Discrete Variables

Data
1
0

5
16

6
20

3
?
Using Discrete Variables

Data
1
0

5
16

6
20

3
? ?
Using Discrete Variables
Number of fruit to offer
x

s s s s
1 2 3 4

s s s
5 6 7

y Number of fruit received


Using Discrete Variables
Number of fruit to offer
x

s1

s2

y Number of fruit received


Using Discrete Variables
Type of fruit to offer Number of fruit to offer
u x

s1

s2

Type of fruit received v y Number of fruit received


Using Discrete Variables
Type of fruit to offer Number of fruit to offer
u x

u∈{Apple, Banana, Coconut}


s1

s2

v∈{Apple, Banana, Coconut}


Type of fruit received v y Number of fruit received
Using Discrete Variables
Lookup Tables
u

e1 e2 e3 e4
Apple 0.1 -0.4 0.2 0.5

Banana 0.4 1.4 -1.0 0.1

Coconut 1.1 0.9 1.1 0.5


V=3
Using Discrete Variables
Lookup Tables
u

e1 e2 e3 e4
Apple 0.1 -0.4 0.2 0.5

Banana 0.4 1.4 -1.0 0.1

Coconut 1.1 0.9 1.1 0.5


V=3
Using Discrete Variables
Lookup Tables
u

e1 e2 e3 e4 Embedding for u Size = 4

Apple 0.1 -0.4 0.2 0.5

Banana 0.4 1.4 -1.0 0.1

Coconut 1.1 0.9 1.1 0.5


V=3
Using Discrete Variables
Lookup Tables
u Banana

e1 e2 e3 e4 Embedding for u Size = 4

Apple 0.1 -0.4 0.2 0.5

Banana 0.4 1.4 -1.0 0.1

Coconut 1.1 0.9 1.1 0.5


V=3
Using Discrete Variables
Lookup Tables
u 1

e1 e2 e3 e4 Embedding for u Size = 4

0 0.1 -0.4 0.2 0.5

1 0.4 1.4 -1.0 0.1

2 1.1 0.9 1.1 0.5


V=3
Using Discrete Variables
Lookup Tables
u 1

Lookup

Embedding for u Size = 4


Using Discrete Variables
Type of fruit to offer Number of fruit to offer
u x
Lookup
eu

u∈{Apple, Banana, Coconut}


s1

s2

v∈{Apple, Banana, Coconut}


Type of fruit received v y Number of fruit received
Using Discrete Variables
Softmax

V=3

Apple Banana Coconut

w1 0.1 -0.4 0.2

w2 0.4 1.4 -1.0

w3 1.1 0.9 1.1

w4 1.3 0.1 0.4


Using Discrete Variables
Softmax
V=3
Input vector Size = 4

Apple Banana Coconut

w1 0.1 -0.4 0.2

w2 0.4 1.4 -1.0

w3 1.1 0.9 1.1

w4 1.3 0.1 0.4


Using Discrete Variables
Softmax
V=3
Input vector Size = 4

logits
Apple Banana Coconut
Size = V

w1 0.1 -0.4 0.2

w2 0.4 1.4 -1.0

w3 1.1 0.9 1.1

w4 1.3 0.1 0.4


Using Discrete Variables
Softmax
V=3
s s s s Input Vector
1 2 3 4

d d d Apple Banana Coconut


Logits
1 2 3
1 -1 -2

w1 0.1 -0.4 0.2

w2 0.4 1.4 -1.0

w3 1.1 0.9 1.1

w4 1.3 0.1 0.4


Using Discrete Variables
Softmax
V=3
s s s s Input Vector
1 2 3 4

d d d Apple Banana Coconut


Logits
1 2 3
1 -1 -2

w1 0.1 -0.4 0.2


p p p
1 2 2
0.84 0.11 0.05 w2 0.4 1.4 -1.0

w3 1.1 0.9 1.1

w4 1.3 0.1 0.4


Using Discrete Variables
Softmax
V=3
s s s s Input Vector
1 2 3 4

d d d Apple Banana Coconut


Logits
1 2 3
1 -1 -2

w1 0.1 -0.4 0.2


p p p
1 2 2
0.84 0.11 0.05 w2 0.4 1.4 -1.0

exp(di)
pi = w3 1.1 0.9 1.1

∑exp(di) w4 1.3 0.1 0.4


Using Discrete Variables
Softmax
V=3
s s s s Input Vector
1 2 3 4

d d d Apple Banana Coconut


Logits
1 2 3
1 -1 -2

w1 0.1 -0.4 0.2


p p p
1 2 2
0.84 0.11 0.05 w2 0.4 1.4 -1.0

w3 1.1 0.9 1.1


Apple
w4 1.3 0.1 0.4
Using Discrete Variables
Type of fruit to offer Number of fruit to offer
u x
Lookup
eu

u∈{Apple, Banana, Coconut}


s1

s2

Softmax
v∈{Apple, Banana, Coconut}
Type of fruit received v y Number of fruit received
Using Discrete Variables
Type of fruit to offer Number of fruit to offer
u x
Lookup
eu

u∈{Apple, Banana, Coconut}


s1

s2

Softmax
v∈{Apple, Banana, Coconut}
Type of fruit received v y Number of fruit received
Summary

Continuous - linear
Continuous - values Sparse - softmax
Sparse - (embeddings) MLP
Example Applications
Embedding Pretraining (Collobert et al, 2011)

Abby likes to eat apples and bananas


Example Applications
Embedding Pretraining (Collobert et al, 2011)

Predict
Context
Abby likes to eat apples and bananas
Example Applications
Embedding Pretraining (Collobert et al, 2011)

Abby likes to eat apples and bananas

e-4 e-3 e-2 e-1

s1
Softmax
s2
Example Applications
Embedding Pretraining (Collobert et al, 2011)
edrink
eat
Cosine similairty

eeat

ebuild
Example Applications
Embedding Pretraining (Collobert et al, 2011)
eat
edrink
eeat

Cosine similairty

ebuild
Example Applications
Example Applications
Window-based Tagging (Collobert et al, 2011)

Abby likes to eat apples and bananas

NNP VBZ TO VB NNS CC NNS


Example Applications
Window-based Tagging (Collobert et al, 2011)

Abby likes to eat apples and bananas

e-2 e-1 e-0 e1 e2


Example Applications
Window-based Tagging (Collobert et al, 2011)

Abby likes to eat apples and bananas

e-2 e-1 e-0 e1 e2 Word Embeddings

s1 Non-Linear Layer 1

s2 Non-Linear Layer 2
Example Applications
Window-based Tagging (Collobert et al, 2011)

Abby likes to eat apples and bananas

e-2 e-1 e-0 e1 e2 Word Embeddings

s1 Non-Linear Layer 1

s2 Non-Linear Layer 2

VB Softmax
Example Applications
Window-based Tagging (Collobert et al, 2011)

Abby likes to eat apples and bananas

e-2 e-1 e-0 e1 e2 Word Embeddings

s1 Non-Linear Layer 1

s2 Non-Linear Layer 2

VB Softmax
Example Applications
Window-based Tagging (Collobert et al, 2011)
Example Applications
Translation Rescoring (Devlin et al, 2014)

Translation 1 John does to eat coconuts and bananas

Translation 2 Abby likes to eat apples and bananas

Translation 3 Abby dislikes to drink apples and bananas

Source Abby gosta de comer macas e bananas


Example Applications
Translation Rescoring (Devlin et al, 2014)

Abby likes to eat apples and bananas


<s>
0.2
Example Applications
Translation Rescoring (Devlin et al, 2014)

Abby likes to eat apples and bananas


0.2 0.1
Example Applications
Translation Rescoring (Devlin et al, 2014)

Abby likes to eat apples and bananas


0.2 0.1 0.3
Example Applications
Translation Rescoring (Devlin et al, 2014)

Abby likes to eat apples and bananas 0.000378


0.2 0.1 0.3 0.5 0.7 0.4 0.2
Example Applications
Translation Rescoring (Devlin et al, 2014)

John does to eat coconuts and bananas 0.00003

Abby likes to eat apples and bananas 0.000378

Abby dislikes to drink apples and bananas 0.00012


Example Applications
Translation Rescoring (Devlin et al, 2014)

John does to eat coconuts and bananas 0.00003

Abby likes to eat apples and bananas 0.000378

Abby dislikes to drink apples and bananas 0.00012


Example Applications
Translation Rescoring (Devlin et al, 2014)

Predict
Context
Translation
Abby likes to eat apples and bananas

Source
Abby gosta de comer macas e bananas
Example Applications
Translation Rescoring (Devlin et al, 2014)

Predict
Context
Translation
Abby likes to eat apples and bananas

Source
Abby gosta de comer macas e bananas
Example Applications
Translation Rescoring (Devlin et al, 2014)

Translation
Abby likes to eat apples and bananas

e-4 e-3 e-2 e-1


f-1

s1
macas
s2
Example Applications
Translation Rescoring (Devlin et al, 2014)

Translation Score (BLEU) Arabic - English Chinese - English

Best Rescored System 52.8 34.7

1st OpenMT12 49.5 32.6

Hierarchical 43.4 30.1


Computation Graphs are our friends
2
C(w,b) = ∑(yn-ŷn) y = wx + b
n∈{0,1,2}

∂C ∂∑(ŷn-yn)
= n = ∑-2(ŷ
n
n-yn)xn

∂w ∂w
Easy!
2
∂C ∂∑(ŷn-yn)
= n = ∑-2(ŷn-yn)
n
∂b ∂b
Computation Graphs are our friends
2
y = wx + b + tanh(yx + b)

Harder!
Computation Graphs are our friends
2
y = w x + b + tanh(w x + b )
1 1 2 2

Computation
Graphs can
compute
gradients for you!
Computation Graphs are our friends
2
C(w,b) = ∑(yn-ŷn) y = wx + b
n∈{0,1,2}

2
∂C ∂∑(ŷn-yn)
= n = ∑-2(ŷ
n
n-yn)xn

∂w ∂w
2
∂C ∂∑(ŷn-yn)
= n = ∑-2(ŷn-yn)
n
∂b ∂b
Computation Graphs are our friends
2
C(w,b) = ∑(yn-ŷn) y = wx + b
n∈{0,1,2}

2
∂C ∂(ŷn-yn) ∂yn
=∑ n
= ∑-2(ŷ
n
n-yn)xn

∂w ∂yn ∂w
2

∂C ∂(ŷn-yn) ∂yn
=∑ = ∑-2(ŷn-yn)
n n
∂b ∂yn ∂b
Computation Graphs are our friends
2
C(w,b) = ∑(yn-ŷn) y = wx + b
n∈{0,1,2}

2
∂C ∂(ŷn-yn) ∂yn
=∑ n
∂w ∂yn ∂w
2

∂C ∂(ŷn-yn) ∂yn
=∑
n
∂b ∂yn ∂b
Computation Graphs are our friends
2
C(w,b) = ∑(yn-ŷn) y=o+b
n∈{0,1,2}
o = wx
2
∂C ∂(ŷn-yn) ∂yn
=∑ n
∂w ∂yn ∂w
2

∂C ∂(ŷn-yn) ∂yn
=∑
n
∂b ∂yn ∂b
Computation Graphs are our friends
2
C(w,b) = ∑(dn) d=y-ŷ
n∈{0,1,2}
y=o+b
∂C ∂(ŷn-yn)
2
∂yn o = wx
=∑ n
∂w ∂yn ∂w
2

∂C ∂(ŷn-yn) ∂yn
=∑
n
∂b ∂yn ∂b
Computation Graphs are our friends
2
C(w,b) = ∑cn c=d
n∈{0,1,2}
d=y-ŷ
∂C ∂(ŷn-yn)
2
∂yn y=o+b
=∑
∂w
n
∂yn ∂w o = wx
2

∂C ∂(ŷn-yn) ∂yn
=∑
n
∂b ∂yn ∂b
Computation Graphs are our friends
2
C(w,b) = ∑cn c=d
n∈{0,1,2}
d=y-ŷ
∂C
∑ ∂cn ∂dn ∂yn ∂on y=o+b
=n
∂w ∂dn ∂yn ∂on ∂w o = wx
2

∂C ∂(ŷn-yn) ∂yn
=∑
n
∂b ∂yn ∂b
Computation Graphs are our friends
2
C(w,b) = ∑cn c=d
n∈{0,1,2}
d=y-ŷ
∂C
∑ ∂cn ∂dn ∂yn ∂on y=o+b
=n
∂w ∂dn ∂yn ∂on ∂w o = wx
∂C ∂cn ∂dn ∂yn
=∑
∂dn ∂yn ∂b
n
∂b
Computation Graphs are our friends
2
C(w,b) = ∑cn c=d Power 2

n∈{0,1,2}
d=y-ŷ Sub

∂C
∑ ∂cn ∂dn ∂yn ∂on y=o+b Add
=n
∂w ∂dn ∂yn ∂on ∂w o = wx Product

∂C ∂cn ∂dn ∂yn Sub


=∑
∂dn ∂yn ∂b
n
∂b
Computation Graphs are our friends
2
C(w,b) = ∑cn c=d Power 2

n∈{0,1,2}
d=y-ŷ Sub

∂C
∑ ∂cn ∂dn ∂yn ∂on y=o+b Add
=n
∂w ∂dn ∂yn ∂on ∂w o = wx Product

∂C ∂cn ∂dn ∂yn


=∑ Sub
∂dn ∂yn ∂b
n
∂b forward(x,y) → z
backward(x,y,dz) → dx,dy
Computation Graphs are our friends
2
C(w,b) = ∑cn c=d Power 2

n∈{0,1,2}
d=y-ŷ Sub

∂C
∑ ∂cn ∂dn ∂yn ∂on y=o+b Add
=n
∂w ∂dn ∂yn ∂on ∂w o = wx Product

∂C ∂cn ∂dn ∂yn


=∑ Sub
∂dn ∂yn ∂b
n
∂b forward(x,y) : return x - y
backward(x,y,dz) : return 1, -1
Computation Graphs are our friends
2
C(w,b) = ∑cn c=d Power 2

n∈{0,1,2}
d=y-ŷ Sub

∂C
∑ ∂cn ∂dn ∂yn ∂on y=o+b Add
=n
∂w ∂dn ∂yn ∂on ∂w o = wx Product

∂C ∂cn ∂dn ∂yn


=∑ Sub
∂dn ∂yn ∂b
n
∂b forward(x,y) : return x - y
backward(x,y,dz) : return 1, -1
Computation Graphs are our friends
2
C(w,b) = ∑cn c=d Power 2

n∈{0,1,2}
d=y-ŷ Sub

∂C
∑ ∂cn ∂dn ∂yn ∂on y=o+b Add
=n
∂w ∂dn ∂yn ∂on ∂w o = wx Product

∂C ∂cn ∂dn ∂yn ∂dn


=∑ Sub
∂dn ∂yn ∂b
n
∂b forward(x,y) : return x - y ∂ŷn
backward(x,y,dz) : return 1, -1
Computation Graphs are our friends
2
C(w,b) = ∑cn c=d Power 2

n∈{0,1,2}
d=y-ŷ Sub

∂C
∑ ∂cn ∂dn ∂yn ∂on y=o+b Add
=n
∂w ∂dn ∂yn ∂on ∂w

∂C o
∂cn ∂dn ∂yn
=∑
∂dn ∂yn ∂b
n
∂b Product
o = wx
w x
Computation Graphs are our friends
2
C(w,b) = ∑cn c=d Power 2

n∈{0,1,2}
d=y-ŷ Sub

y
∂C ∂cn ∂dn ∂yn ∂on

=n
∂w ∂dn ∂yn ∂on ∂w Add y=o+b
∂C o b
∂cn ∂dn ∂yn
=∑
∂dn ∂yn ∂b
n
∂b Product

w x
Computation Graphs are our friends

C(w,b) = ∑cn d Power 2 c

n∈{0,1,2} Sub

∂C y ŷ
∑ ∂cn ∂dn ∂yn ∂on
=n
∂w ∂dn ∂yn ∂on ∂w Add

∂C o b
∂cn ∂dn ∂yn
=∑
∂dn ∂yn ∂b
n
∂b Product

w x
Computation Graphs are our friends

C(w,b) = ∑cn d Power 2 c Id C

n∈{0} Sub

∂C y ŷ
∑ ∂cn ∂dn ∂yn ∂on
=n
∂w ∂dn ∂yn ∂on ∂w Add

∂C o b
∂cn ∂dn ∂yn
=∑
∂dn ∂yn ∂b
n
∂b Product

w x
Computation Graphs are our friends

C(w,b) = ∑cn d Power 2 c Id C

n∈{0} Sub

∂C y ŷ
∑ ∂cn ∂dn ∂yn ∂on No Input Edges
=n External
∂w ∂dn ∂yn ∂on ∂w Add
Input
∂C o b
∂cn ∂dn ∂yn
=∑
∂dn ∂yn ∂b
n
∂b Product

w x
Computation Graphs are our friends

C(w,b) = ∑cn d Power 2 c Id C

n∈{0} Sub

∂C y ŷ
∑ ∂cn ∂dn ∂yn ∂on
=n
∂w ∂dn ∂yn ∂on ∂w Add
Input
∂C o b
∂cn ∂dn ∂yn
=∑ Parameters
∂dn ∂yn ∂b
n
∂b Product
No Input Edges
Internal
w x
Computation Graphs are our friends
d Power 2 c Id C

Sub

y ŷ
16
Add

o b 2
Variables
Product

w x
2 5
Computation Graphs are our friends
d Power 2 c Id C
1-Initialize inputs

Sub

y ŷ
16
Add

o b 2

Product

w x
2 5
Computation Graphs are our friends
d Power 2 c Id C
1-Initialize inputs
2-Initialize variables
Sub

y ŷ
16
Add

o b 2
Variables
Product

w x
2 5
Computation Graphs are our friends
d Power 2 c Id C
1-Initialize inputs 0,0
2-Initialize variables 0,0 0,0
Sub

y ŷ
0,0
16
Add

o b 2
0,0
Variables
Product
2 values: x and dx
w x
2 5
Computation Graphs are our friends
d Power 2 c Id C
1-Initialize inputs 0,0
2-Initialize variables 0,0 0,0
3-Topological Sort variables Sub

y ŷ
0,0
16
Add

o b 2
0,0

Product

w x
2 5
Computation Graphs are our friends
d Power 2 c Id C
1-Initialize inputs 0,0
2-Initialize variables 3rd 0,0 0,0
3-Topological Sort variables Sub 4th 5th

y ŷ
0,0
16
2nd
Add

o b 2
1st 0,0

Product

w x
2 5
Computation Graphs are our friends
d Power 2 c Id C
1-Initialize inputs 0,0
2-Initialize variables 3rd 0,0 0,0
3-Topological Sort variables Sub 4th 5th

y ŷ
0,0
16
2nd
Add

o b 2
1st 10,0

Product

w x
2 5
Computation Graphs are our friends
d Power 2 c Id C
1-Initialize inputs 0,0
2-Initialize variables 3rd 0,0 0,0
3-Topological Sort variables Sub 4th 5th

y ŷ
12,0
16
2nd
Add

o b 2
1st 10,0

Product

w x
2 5
Computation Graphs are our friends
d Power 2 c Id C
1-Initialize inputs -4,0
2-Initialize variables 3rd 0,0 0,0
3-Topological Sort variables Sub 4th 5th

y ŷ
12,0
16
2nd
Add

o b 2
1st 10,0

Product

w x
2 5
Computation Graphs are our friends
d Power 2 c Id C
1-Initialize inputs 0,0
2-Initialize variables 3rd 0,0 0,0
3-Topological Sort variables Sub 4th 5th

y ŷ
0,0
1st 16
Add

o b 2
0,0
2nd
Product

w x
2 5
Computation Graphs are our friends
d Power 2 c Id C
1-Initialize inputs 0,0
2-Initialize variables 3rd 0,0 0,0
3-Topological Sort variables Sub 4th 5th

y ŷ
2,0
1st 16
Add

o b 2
0,0
2nd
Product

w x
2 5
Computation Graphs are our friends
d Power 2 c Id C
1-Initialize inputs 0,0
2-Initialize variables 3rd 0,0 0,0
3-Topological Sort variables Sub 4th 5th

y ŷ
2,0
1st 16
Add

o b 2
10,0
2nd
Product

w x
2 5
Computation Graphs are our friends
d Power 2 c Id C
1-Initialize inputs -14,0
2-Initialize variables 3rd 0,0 0,0
3-Topological Sort variables Sub 4th 5th

y ŷ
2,0
1st 16
Add

o b 2
10,0
2nd
Product

w x
2 5
Computation Graphs are our friends
d Power 2 c Id C
1-Initialize inputs 0,0
2-Initialize variables 0,0 0,0
3-Topological Sort variables Sub

y ŷ
0,0
16
Add

o b 2
0,0

Product

w x
2 5
Computation Graphs are our friends
d Power 2 c Id C
1-Initialize inputs 0,0
2-Initialize variables 0,0 0,0
3-Topological Sort variables Sub

y
0,0

Add

o
0,0
Computation Graphs are our friends
d c C
1-Initialize inputs 0,0
2-Initialize variables 0,0 0,0
3-Topological Sort variables

y
0,0

o
0,0
Computation Graphs are our friends
5th
4th
3rd
d c C
1-Initialize inputs 0,0
2-Initialize variables 0,0 0,0
3-Topological Sort variables
2nd
y
0,0

1st
o
0,0
Computation Graphs are our friends
d Power 2 c Add C
1-Initialize inputs 0,0
2-Initialize variables 0,0 0,0
3-Topological Sort variables Sub

y g
0,0 0,0
Add

Add
o 0,0

s 0,0
Computation Graphs are our friends
5th 6th 7th
d c C
1-Initialize inputs 0,0
2-Initialize variables 0,0 0,0
3-Topological Sort variables
4th

y g 3th
0,0 0,0

1st

o 0,0

2nd
s 0,0
Computation Graphs are our friends
3rd
d Power 2 c Id C
1-Initialize inputs 0,0
2-Initialize variables 0,0 0,0
3-Topological Sort variables Sub 4th 5th
4-For each variable in topological
order, run the forward method of all 2nd
y ŷ
operations that link to them 0,0
16
Add

1st o b 2
0,0

Product

w x
2 5
Computation Graphs are our friends
3rd
d Power 2 c Id C
1-Initialize inputs 0,0
2-Initialize variables 0,0 0,0
3-Topological Sort variables Sub 4th 5th
4-For each variable in topological
order, run the forward method of all 2nd
y ŷ
operations that link to them 0,0
16
Add

1st o b 2
10,0

Product

w x
2 5
Computation Graphs are our friends
3rd
d Power 2 c Id C
1-Initialize inputs 0,0
2-Initialize variables 0,0 0,0
3-Topological Sort variables Sub 4th 5th
4-For each variable in topological
order, run the forward method of all 2nd
y ŷ
operations that link to them 12,0
16
Add

1st o b 2
10,0

Product

w x
2 5
Computation Graphs are our friends
3rd
d Power 2 c Id C
1-Initialize inputs -4,0
2-Initialize variables 0,0 0,0
3-Topological Sort variables Sub 4th 5th
4-For each variable in topological
order, run the forward method of all 2nd
y ŷ
operations that link to them 12,0
16
Add

1st o b 2
10,0

Product

w x
2 5
Computation Graphs are our friends
3rd
d Power 2 c Id C
1-Initialize inputs -4,0
2-Initialize variables 16,0 0,0
3-Topological Sort variables Sub 4th 5th
4-For each variable in topological
order, run the forward method of all 2nd
y ŷ
operations that link to them 12,0
16
Add

1st o b 2
10,0

Product

w x
2 5
Computation Graphs are our friends
3rd
d Power 2 c Id C
1-Initialize inputs -4,0
2-Initialize variables 16,0 16,0
3-Topological Sort variables Sub 4th 5th
4-For each variable in topological
order, run the forward method of all 2nd
y ŷ
operations that link to them 12,0
16
Add

1st o b 2
10,0

Product

w x
2 5
Computation Graphs are our friends
3rd
d Power 2 c Id C
1-Initialize inputs -4,0
2-Initialize variables 16,0 16,1
3-Topological Sort variables Sub 4th 5th
4-For each variable in topological
order, run the forward method of all 2nd
y ŷ
operations that link to them 12,0
16
5-Set gradients to final variables
Add

1st o b 2
10,0

Product

w x
2 5
Computation Graphs are our friends
3rd
d Power 2 c Id C
1-Initialize inputs -4,0
2-Initialize variables 16,0 16,1
3-Topological Sort variables Sub 4th 5th
4-For each variable in topological
order, run the forward method of all 2nd
y ŷ
operations that link to them (Forward) 12,0
16
5-Set gradients to final variables
6-run the operations backward method Add
∂C
in reverse order (Backward) C=c =1
1st
∂c
o b 2
10,0
∂C
Product dc = dC
∂c
w x
2 5
Computation Graphs are our friends
3rd
d Power 2 c Id C
1-Initialize inputs -4,0
2-Initialize variables 16,1 16,1
3-Topological Sort variables Sub 4th 5th
4-For each variable in topological
order, run the forward method of all 2nd
y ŷ
operations that link to them (Forward) 12,0
16
5-Set gradients to final variables
6-run the operations backward method Add
∂C
in reverse order (Backward) C=c =1
1st
∂c
o b 2
10,0
∂C
Product dc = dC
∂c
w x
2 5
Computation Graphs are our friends
3rd
d Power 2 c Id C
1-Initialize inputs -4,0
2-Initialize variables 16,1 16,1
3-Topological Sort variables Sub 4th 5th
4-For each variable in topological
order, run the forward method of all 2nd
y ŷ
operations that link to them (Forward) 12,0
16
5-Set gradients to final variables
6-run the operations backward method Add 2 ∂c
in reverse order (Backward) c=d = 2d
1st o b ∂d
2
10,0
∂c
Product dd = dc
∂d
w x
2 5
Computation Graphs are our friends
3rd
d Power 2 c Id C
1-Initialize inputs -4,0
2-Initialize variables 16,1 16,1
3-Topological Sort variables Sub 4th 5th
4-For each variable in topological
order, run the forward method of all 2nd
y ŷ
operations that link to them (Forward) 12,0
16
5-Set gradients to final variables
6-run the operations backward method Add 2 ∂c
in reverse order (Backward) c=d = 2 x -4
1st o b ∂d
2
10,0
∂c
Product dd = dc
∂d
w x
2 5
Computation Graphs are our friends
3rd
d Power 2 c Id C
1-Initialize inputs -4,0
2-Initialize variables 16,1 16,1
3-Topological Sort variables Sub 4th 5th
4-For each variable in topological
order, run the forward method of all 2nd
y ŷ
operations that link to them (Forward) 12,0
16
5-Set gradients to final variables
6-run the operations backward method Add 2 ∂c
in reverse order (Backward) c=d = -8
1st o b ∂d
2
10,0
∂c
Product dd = dc
∂d
w x
2 5
Computation Graphs are our friends
3rd
d Power 2 c Id C
1-Initialize inputs -4,-8
2-Initialize variables 16,1 16,1
3-Topological Sort variables Sub 4th 5th
4-For each variable in topological
order, run the forward method of all 2nd
y ŷ
operations that link to them (Forward) 12,0
16
5-Set gradients to final variables
6-run the operations backward method Add 2 ∂c
in reverse order (Backward) c=d = -8
1st o b ∂d
2
10,0
∂c
Product dd = dc
∂d
w x
2 5
Computation Graphs are our friends
3rd
d Power 2 c Id C
1-Initialize inputs -4,-8
2-Initialize variables 16,1 16,1
3-Topological Sort variables Sub 4th 5th
4-For each variable in topological
order, run the forward method of all 2nd
y ŷ
operations that link to them (Forward) 12,0
16
5-Set gradients to final variables
6-run the operations backward method Add ∂d
in reverse order (Backward) d=y-ŷ =1
1st o b ∂y
2
10,0

Product

w x
2 5
Computation Graphs are our friends
3rd
d Power 2 c Id C
1-Initialize inputs -4,-8
2-Initialize variables 16,1 16,1
3-Topological Sort variables Sub 4th 5th
4-For each variable in topological
order, run the forward method of all 2nd
y ŷ
operations that link to them (Forward) 12,-8
16
5-Set gradients to final variables
6-run the operations backward method Add ∂d
in reverse order (Backward) d=y-ŷ =1
1st o b ∂y
2
10,0
∂d
Product dy = dd
∂y
w x
2 5
Computation Graphs are our friends
3rd
d Power 2 c Id C
1-Initialize inputs -4,-8
2-Initialize variables 16,1 16,1
3-Topological Sort variables Sub 4th 5th
4-For each variable in topological
order, run the forward method of all 2nd
y ŷ
operations that link to them (Forward) 12,-8
16 y=o+b
5-Set gradients to final variables
6-run the operations backward method Add ∂y
in reverse order (Backward) =1
1st o b ∂o
2
10,-8
∂y
Product
do = dy
∂o
w x
2 5
Computation Graphs are our friends
3rd
d Power 2 c Id C
1-Initialize inputs -4,-8
2-Initialize variables 16,1 16,1
3-Topological Sort variables Sub 4th 5th
4-For each variable in topological
order, run the forward method of all 2nd
y ŷ
operations that link to them (Forward) 12,-8
16 y=o+b
5-Set gradients to final variables
6-run the operations backward method Add ∂y ∂y
in reverse order (Backward) =1 =1
1st o b ∂o ∂b
2
10,-8
∂y
Product
bt+1 = b - dy
∂b
w x
2 5
Computation Graphs are our friends
3rd
d Power 2 c Id C
1-Initialize inputs -4,-8
2-Initialize variables 16,1 16,1
3-Topological Sort variables Sub 4th 5th
4-For each variable in topological
order, run the forward method of all 2nd
y ŷ
operations that link to them (Forward) 12,-8
16 y=o+b
5-Set gradients to final variables
6-run the operations backward method Add ∂y ∂y
in reverse order (Backward) =1 =1
1st o b ∂o ∂b
2
10,-8
∂y
Product
bt+1 = b - dy
∂b
w x
2 5
Computation Graphs are our friends
3rd
d Power 2 c Id C
1-Initialize inputs -4,-8
2-Initialize variables 16,1 16,1
3-Topological Sort variables Sub 4th 5th
4-For each variable in topological
order, run the forward method of all 2nd
y ŷ
operations that link to them (Forward) 12,-8
16 y=o+b
5-Set gradients to final variables
6-run the operations backward method Add ∂y ∂y
in reverse order (Backward) =1 =1
1st o b ∂o ∂b
2
10,-8

Product ∂C ∂c ∂d ∂y
bt+1 = b -
∂c ∂d ∂y ∂b
w x
2 5
Computation Graphs are our friends
3rd
d Power 2 c Id C
1-Initialize inputs -4,-8
2-Initialize variables 16,1 16,1
3-Topological Sort variables Sub 4th 5th
4-For each variable in topological
order, run the forward method of all 2nd
y ŷ
operations that link to them (Forward) 12,-8
16 y=o+b
5-Set gradients to final variables
6-run the operations backward method Add ∂y ∂y
in reverse order (Backward) =1 =1
1st o b ∂o ∂b
2
10,-8

Product ∂C
bt+1 = b -
∂b
w x
2 5
Computation Graphs are our friends
3rd
d Power 2 c Id C
1-Initialize inputs -4,-8
2-Initialize variables 16,1 16,1
3-Topological Sort variables Sub 4th 5th
4-For each variable in topological
order, run the forward method of all 2nd
y ŷ
operations that link to them (Forward) 12,-8
16 o = wx
5-Set gradients to final variables
6-run the operations backward method Add ∂o
in reverse order (Backward) =x
1st o b ∂w
2
10,-8

Product ∂o
wt+1 = w - do
∂w
w x
2 5
Computation Graphs are our friends
3rd
d Power 2 c Id C
1-Initialize inputs -4,-8
2-Initialize variables 16,1 16,1
3-Topological Sort variables Sub 4th 5th
4-For each variable in topological
order, run the forward method of all 2nd
y ŷ
operations that link to them (Forward) 12,-8
16 o = wx
5-Set gradients to final variables
6-run the operations backward method Add ∂o
in reverse order (Backward) =x
7-update parameters 1st ∂w
o b 2.2
10,-8

Product ∂o
wt+1 = w - do
∂w
w x
2.8 5
Computation Graphs are our friends
d Power 2 c Id C
Existing Tools: -4,-8
-Tensorflow ( https://www.tensorflow.org ) 16,1 16,1
-Torch ( https://github.com/torch/nn ) Sub
-CNN ( https://github.com/clab/cnn )
-JNN ( https://github.com/wlin12/JNN )
y ŷ
-Theano (http://deeplearning.net/software/theano/ ) 12,-8
16 o = wx
Add ∂o
=x
o b 2.2 ∂w
10,-8

Product ∂o
wt+1 = w - do
∂w
w x
2.8 5
Deep Neural Networks are our friends?
Convolutional Neural Network
Deep Neural Networks are our friends?
Convolutional Neural Network

x1 x2 x3 x4

x5 x6 x7 x8

x9 x10 x11 x12

x13 x14 x15 x16

4x4 image
Deep Neural Networks are our friends?
Convolutional Neural Network

x1 x2 x3 x4

x5 x6 x7 x8

x9 x10 x11 x12

x13 x14 x15 x16

4x4 image
Deep Neural Networks are our friends?
Convolutional Neural Network

x1 x2 x3 x4
x1
w1
x5 x6 x7 x8 z1
x2
x9 x10 x11 x12 z1

x13 x14 x15 x16 ...


w9
4x4 image
x11
Deep Neural Networks are our friends?
Convolutional Neural Network

x1 x2 x3 x4
x2
w1
x5 x6 x7 x8 z1 z2
x3
x9 x10 x11 x12 z1

x13 x14 x15 x16 ...


w9
4x4 image
x12
Deep Neural Networks are our friends?
Convolutional Neural Network

x1 x2 x3 x4

x5 x6 x7 x8 z1 z2

x9 x10 x11 x12 z3 z4

x13 x14 x15 x16

4x4 image
Deep Neural Networks are our friends?
Convolutional Neural Network

x1 x2 x3 x4
z1
x5 x6 x7 x8 z1 z2

x9 x10 x11 x12 z3 z4


z2
Is this
y a cat?
x13 x14 x15 x16
z3

4x4 image
z4

Das könnte Ihnen auch gefallen