Lecture 1b - Deep Neural Networks Are Our Friends PDF

Deep Neural Networks
Are Our Friends
Wang Ling
Outline
● Part I - Neural Networks are our friends
○ Numbers are our friends
○ Variables are our friends
○ Operators are our friends
○ Functions are our friends
○ Parameters are our friends
○ Cost Functions are our friends
○ Optimizers are our friends
○ Gradients are our friends
Outline
● Part I - Neural Networks are our friends
● Part 2 - Into Deep Learning
○ Nonlinear Neural Models
○ Multilayer Perceptrons
○ Using Discrete Variables
○ Example Applications
Numbers are our friends
Abby
How many apples

does Abby have?
Numbers are our friends
Abby
4
Variables are our friends
Abby Bert
4 5
Variables are our friends
Abby Bert
4x 5y
Operators are our friends
Bert
1
4
If Abby has 4 apples,

and gives Bert 1 apple,
how many apples will
Abby have?
Operators are our friends
Bert
4x - 1x = 3x
3 1
Functions are our friends
If you give me
1 apple I will
give you 3
bananas
1
4
?
5
● Input, x - Number of
Apples given by Abby
y = 3x
● Input, x - Number of
Apples given by Abby
y = 3x ● Output, y - Number of
Bananas received by Abby
1
4
?
5
y = 3x , x =1
1
4
3
5
y = 3x , x =1
y=3
y = 3x
y : Spanish Sentence
x : English Sentence
y : Move
x : Board
y : Category
x : Image
y : Move
x : Board
??????????????????????????
y = 3x
Cookie Monster
y = ?? y = 3x
Find it out for
yourself
y = ??
1
0
y = ??
1
0
5
16
y = ??
1
0
5
16
6
20
I want to know how many bananas I get,
but I ran out of apples....
y = ??
1
0
5
16
6
20
3
?
Parameters are our friends
● Input
● Output
y = 3x + 1
● Input
Model ● Output
y = wx + b ● Parameters
Input - Fixed, comes from data

Parameters - Need to be estimated
y = wx + b
1
0
5
16
6
20
3
?
y = wx + b
Data
1
0
5
16
6
20
3
?
y = wx + b
Data
x ŷ
1 0
5 16
6 20
3
?
Data Model
x ŷ y = wx + b
1 0
5 16
6 20
Data Model
x ŷ y = wx + b
1 0
5 16
6 20
How to find the parameters w and b?

Data Model
Model x y
x ŷ y = wx + b Candidate 1
1 0
1 0
5 16
5 16 y = 1x + 0 6 20
6 20
Data Model
Model x ŷ y
x ŷ y = wx + b Candidate 1
1 0 1
1 0
5 16 5
5 16 y = 1x + 0 6 20 6
6 20
1 = 1*1 + 0
5 = 1*5 + 0
6 = 1*6 + 0
Data Model
Model x ŷ y
x y y = wx + b Candidate 1
1 0 1
1 0
5 16 5
5 16 y = 1x + 0 6 20 6
6 20
Model
Candidate 2 x ŷ y
1 0 4
y = 2x + 2 5 16 12
6 20 14
Data Model
Model x ŷ y
1 0 1
1 0
5 16 5
5 16 y = 1x + 0 6 20 6
6 20
Model
Candidate 2 x ŷ y
1 0 4
y = 2x + 2 5 16 12
6 20 14
Which one is better ?
Data Model
Model x ŷ y
1 0 1
1 0
5 16 5
5 16 y = 1x + 0 6 20 6
6 20
Model
Candidate 2 x ŷ y
1 0 4
y = 2x + 2 5 16 12
6 20 14
Cost functions are our friends
Data Model
Model x ŷ y
n x y yn = wxn + b Candidate 1
1 0 1
0 1 0
5 16 5
1 5 16 y = 1x + 0 6 20 6
2 6 20
Model
Candidate 2 x ŷ y
1 0 4
y = 2x + 2 5 16 12
6 20 14
Data Model
Model x ŷ y
1 0 1
0 1 0
5 16 5
1 5 16 y = 1x + 0 6 20 6
2 6 20
Cost Model
Candidate 2 x ŷ y
C(w,b) 1 0 4
y = 2x + 2 5 16 12
6 20 14
Data Model
Model x ŷ y
1 0 1
0 1 0
5 16 5
1 5 16 y = 1x + 0 6 20 6
2 6 20 Square Loss
Cost Model
Candidate 2 x ŷ y
2
C(w,b) = ∑(yn-ŷn) 1 0 4
n∈{0,1,2} y = 2x + 2 5 16 12
6 20 14
Data Model n x ŷ y (y-ŷ)2
Model
0 1 0 1
0 1 0 1 5 16 5
1 5 16 y = 1x + 0 2 6 20 6
2 6 20
Cost Model
Candidate 2 x ŷ y
2
C(w,b) = ∑(yn-ŷn) 1 0 4
n∈{0,1,2} y = 2x + 2 5 16 12
6 20 14
Model
0 1 0 1 1
0 1 0 1 5 16 5
1 5 16 y = 1x + 0 2 6 20 6
2 6 20
Cost Model
Candidate 2 x ŷ y
2
C(w,b) = ∑(yn-ŷn) 1 0 4
n∈{0,1,2} y = 2x + 2 5 16 12
6 20 14
Model
0 1 0 1 1
0 1 0 1 5 16 5 121
1 5 16 y = 1x + 0 2 6 20 6
2 6 20
Cost Model
Candidate 2 x ŷ y
2
C(w,b) = ∑(yn-ŷn) 1 0 4
n∈{0,1,2} y = 2x + 2 5 16 12
6 20 14
Model
0 1 0 1 1
0 1 0 1 5 16 5 121
1 5 16 y = 1x + 0 2 6 20 6 196
2 6 20
Cost Model
Candidate 2 x ŷ y
2
C(w,b) = ∑(yn-ŷn) 1 0 4
n∈{0,1,2} y = 2x + 2 5 16 12
6 20 14
Model
0 1 0 1 1
0 1 0 1 5 16 5 121
1 5 16 y = 1x + 0 2 6 20 6 196
2 6 20
C(1,0) 318
Cost Model
Candidate 2 x ŷ y
2
C(w,b) = ∑(yn-ŷn) 1 0 4
n∈{0,1,2} y = 2x + 2 5 16 12
6 20 14
Model
0 1 0 1 1
0 1 0 1 5 16 5 121
1 5 16 y = 1x + 0 2 6 20 6 196
2 6 20
C(1,0) 318
Cost Model n x ŷ y (y-ŷ)2

Candidate 2
2 0 1 0 4 16
C(w,b) = ∑(yn-ŷn)
n∈{0,1,2} y = 2x + 2 1 5 16 12 16
2 6 20 14 36
C(2,2) 68
Data Model
Model
0 1 0
1 5 16 y = 1x + 0
2 6 20
C(1,0) 318
Cost Model
Candidate 2
2
n∈{0,1,2} y = 2x + 2
C(2,2) 68
Data Model
n x y yn = wxn + b
0 1 0
1 5 16
2 6 20
How to find the parameters w and b?
Cost
2
n∈{0,1,2}
Optimizers are our friends
Data Model
n x y yn = wxn + b
0 1 0
1 5 16
2 6 20
Cost Optimizer
2
C(w,b) = ∑(yn-ŷn) arg min C(w,b)
n∈{0,1,2} w,b∈[-∞,∞]
Optimizer
w
arg min C(w,b)
w,b∈[-∞,∞]
b
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w0,b0 = 2,2 : C(w0,b0) = 68
b
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w0,b0 = 2,2 : C(w0,b0) = 68
2 68
2 b
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w0,b0 = 2,2 : C(w0,b0) = 68
w1,b1 = 3,2 : C(w1,b1) = ?
b
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w0,b0 = 2,2 : C(w0,b0) = 68
w1,b1 = 3,2 : C(w1,b1) = 26
2
n x ŷ y (y-ŷ)
0 1 0 5 25
1 5 16 17 1
2 6 20 20 0
b
C(3,2) 26
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w0,b0 = 2,2 : C(w0,b0) = 68
w1,b1 = 3,2 : C(w1,b1) = 26
2
n x ŷ y (y-ŷ)
0 1 0 5 25
1 5 16 17 1
2 6 20 20 0
b
C(3,2) 26
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w1,b1 = 3,2 : C(w1,b1) = 26
w2,b2 = 4,2 : C(w2,b2) = ??
b
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w1,b1 = 3,2 : C(w1,b1) = 26
w2,b2 = 4,2 : C(w2,b2) = 136
2
n x ŷ y (y-ŷ)
0 1 0 6 36
1 5 16 22 64
2 6 20 26 36
b
C(4,2) 136
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w1,b1 = 3,2 : C(w1,b1) = 26
b
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w1,b1 = 3,2 : C(w1,b1) = 26
w2,b2 = 3,3 : C(w2,b2) = 41
2
n x ŷ y (y-ŷ)
0 1 0 6 36
1 5 16 18 4
2 6 20 21 1
b
C(3,3) 41
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w1,b1 = 3,2 : C(w1,b1) = 26
b
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w1,b1 = 3,2 : C(w1,b1) = 26
w2,b2 = 3,1 : C(w2,b2) = 17
2
n x ŷ y (y-ŷ)
0 1 0 4 16
1 5 16 16 0
2 6 20 19 1
b
C(3,1) 17
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w2,b2 = 3,1 : C(w2,b2) = 17
b
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w2,b2 = 3,1 : C(w2,b2) = 17
w3,b3 = 3,0 : C(w3,b3) = 13
2
n x ŷ y (y-ŷ)
0 1 0 3 9
1 5 16 15 1
2 6 20 18 4
b
C(3,0) 13
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w3,b3 = 3,0 : C(w3,b3) = 13
b
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w3,b3 = 3,0 : C(w3,b3) = 13
w4,b4 = 3,-1 : C(w4,b4) = 17
2
n x ŷ y (y-ŷ)
0 1 0 2 4
1 5 16 14 4
2 6 20 17 9
b
C(3,-1) 17
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w3,b3 = 3,0 : C(w3,b3) = 13
w4,b4 = 2,0 : C(w4,b4) = 104
2
n x ŷ y (y-ŷ)
0 1 0 2 4
1 5 16 10 36
2 6 20 12 64
b
C(2,0) 104
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w3,b3 = 3,0 : C(w3,b3) = 13
w4,b4 = 4,0 : C(w4,b4) = 104
2
n x ŷ y (y-ŷ)
0 1 0 4 16
1 5 16 20 16
2 6 20 24 16
b
C(2,0) 54
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w3,b3 = 3,0 : C(w3,b3) = 13
The End?
b
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w?,b? = 4,-2 : C(w?,b?) = ??
b
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w?,b? = 4,-2 : C(w?,b?) = 12
2
n x ŷ y (y-ŷ)
0 1 0 2 4
1 5 16 18 4
2 6 20 22 4
b
C(4,-2) 12
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w3,b3 = 3,0 : C(w3,b3) = 13
b
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w3,b3 = 3,0 : C(w3,b3) = 13
Search
Problem
b
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w3,b3 = 3,0 : C(w3,b3) = 13
w4,b4 = 3.01,0 : C(w4,b4) = 12.82
2
n x ŷ y (y-ŷ)
0 1 0 3.01 9.06
1 5 16 15.01 0.98
2 6 20 18.01 3.96 b
C(3.01,0) 12.82
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w*,b* = 4,-2 : C(w*,b*) = 12
b
-Worse minimum
Large Step Size -But gets there faster
Vs
Small Step Size

-Better Minimum
-But gets there slowly
Step Size
Step Size
Step Size
Step Size
Step Size
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w*,b* = 4,-2 : C(w*,b*) = 12
b
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w*,b* = 4,-4 : C(w*,b*) = 0
b
y = wx + b
Data
x ŷ
1 0
5 16
6 20
3
?
y = 4x - 4
Data
x ŷ
1 0
5 16
6 20
3
?
y = 4x - 4
Data
x ŷ
1 0
5 16
6 20
3
8
y = wx + b
y : Is this a cat
x : Image
pixel (1,1)
pixel(1,3)
High
if cat
y = w1x + w2x + w3x + w4x +
1 2 3 4
b y : Is this a cat
x : Image
pixel (1,1)
pixel(1,3)
High
if cat
y = w1x + w2x + w3x + w4x +
1 2 3 4
b y : Is this a cat
x : Image
Millions of parameters Millions of samples

Gradients are our friends
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
Very expensive
to compute
(hours or days)
b
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
Should be used
sparingly
b
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w0,b0 = 2,2 : C(w0,b0) = 68
2 68
2 b
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w0,b0 = 2,2 : C(w0,b0) = 68
hw = 1 hw
2 68
2 b
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w0,b0 = 2,2 : C(w0,b0) = 68
hw = 1 hw
C(w0+hw,b0) = C(3,2) = 26 2 68
2 b
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w0,b0 = 2,2 : C(w0,b0) = 68
hw = 1 hw
C(w0+hw,b0) = C(3,2) = 26 2 68
r = (C(w0+1,b0)-C(w0,b0))
1
r = (C(3,2)-C(2,2))=-42 2 b
1
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w0,b0 = 2,2 : C(w0,b0) = 68
hw = 1, r = -42 hw
hw = 0.1, r = -98 2 68
hw = 0.01, r = -104
hw = 0.001, r = -104
2 b
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w0,b0 = 2,2 : C(w0,b0) = 68
hw = 1, r = -42 hw
hw = 0.1, r = -98 2 68
hw = 0.01, r = -104
hw = 0.001, r = -104
∂C (w0,b0)
hw → 0, r =
∂w
2 b
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w0,b0 = 2,2 : C(w0,b0) = 68
2 hw
∂C ∂∑(yn-ŷn)
= n 2 68
∂w ∂w
2 b
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w0,b0 = 2,2 : C(w0,b0) = 68
2 hw
∂C ∂∑(yn-ŷn)
= n = ∑2(y
n
n-ŷn)xn 2 68
∂w ∂w
2 b
Optimizer
arg min C(w,b)

w,b∈[-∞,∞] n x ŷ y (y-ŷ) 2(y-ŷ)x
w0,b0 = 2,2 : C(w0,b0) = 68
0 1 0 4 4 8
2
∂C ∂∑(yn-ŷn) 1 5 16 12 -4 -40
= n = ∑2(y
n
n-ŷn)xn
∂w ∂w 2 6 20 14 -6 -72
∂C (w0,b0)
hw → 0, r = = -104
∂w
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w0,b0 = 2,2 : C(w0,b0) = 68
2 hw
∂C ∂∑(yn-ŷn)
= n = ∑2(y
n
n-ŷn)xn 2 68
∂w ∂w
2
∂C ∂∑(yn-ŷn)
= n = ∑2(yn-ŷn)
∂b ∂b
n 2 b
Optimizer
arg min C(w,b)

w,b∈[-∞,∞] n x ŷ y (y-ŷ) 2(y-ŷ)
w0,b0 = 2,2 : C(w0,b0) = 68
0 1 0 4 4 8
∂C (w0,b0) 1 5 16 12 -4 -8
hw → 0, rw = = -104
∂w 2 6 20 14 -6 -12
∂C (w0,b0)
hb → 0, rb = = -12
∂w
Optimizer
w
y = wx + b
arg min C(w,b)
w,b∈[-∞,∞]
w0,b0 = 2,2 : C(w0,b0) = 68
∂C (w0,b0) 2
hw → 0, rw = = -104
∂w
∂C (w0,b0)
hb → 0, rb = = -12
∂w
w1 = w0 - rw 2
→ Learning Rate/ Step size
b
b1 = b0 - rb
Summary
Data Model
n x ŷ yn = wxn + b
0 1 0
1 5 16
2 6 20
Cost Optimizer
2
n∈{0,1,2} w,b∈[-∞,∞]
Summary
Data Model
n x ŷ yn = wxn + b System
0 1 0
1 5 16 y = 4x - 4
2 6 20
Cost Optimizer
2
n∈{0,1,2} w,b∈[-∞,∞]
Into Deep Learning
Nonlinear Neural Models
y = 4x-4
Data
1
0
5
16
6
20
3
?
There is a limit
of bananas I
Data can give you
1
0
5
16
6
20
3
?
Data
y y = 4x-4
n x ŷ
0 1 0
1 5 16
2 6 20 x
Data
y y = 4x-4
n x ŷ
0 1 0
1 5 16
2 6 20 x
3 9 20
4 11 20
Data
y y = 2x+3
n x ŷ
0 1 0
1 5 16
2 6 20 x
3 9 20
Model
4 11 Problem
20
Data
y y = 2x+3
n x ŷ
0 1 0
1 5 16
2 6 20 Underfitting x
3 9 20
Model
4 11 Problem
20
Data
y y = ???
n x ŷ
0 1 0
1 5 16
2 6 20 x
3 9 20
4 11 20 Can we learn
arbitrary functions?
y = (w1x + b1)s1 + (w2x+b2)s2

Use different linear functions
depending on the value of x?
y = (w1x + b1)s1 + (w2x+b2)s2

s1 - 1 if x < 6 and 0 otherwise
s2 - 1 if x >= 6 and 0 otherwise
y = (w1x + b1)s1 + (w2x+b2)s2

Data
n x ŷ
0 1 0
1 5 16
y = (4x - 4)s1 + (0x+20)s2
2 6 20
3 9 20
4 11 20
y = (w1x + b1)s1 + (w2x+b2)s2

Data
n x ŷ
0 1 0
1 5 16
y = (4x - 4)s1 + (0x+20)s2
2 6 20
3 9 20
4 11 20 ?
?
s = (wx + b)
1
(t) =
1 + e-t
s = (1000x)
s = (1000x)
x = 0.1 then (1000x) = 1

x = -0.1 then (1000x) = 0
s = (1000x - 6000)
x = 6.1 then (1000x - 6000) = 1

x = 5.9 then (1000x - 6000) = 0
y = (w1x + b1)s1 + (w2x+b2)s2

s1 = (w3x + b3)
s2 = (w4x + b4)
Data
n x ŷ
y = (4x - 4)s1 + (0x+20)s2
0 1 0
1 5 16
s1 = (-1000x + 6000)
2 6 20 s2 = (1000x - 6000)
3 9 20
4 11 20
Data
n x ŷ
y = (4x - 4)s1 + (0x+20)s2
0 1 0
1 5 16
s1 = (-1000x + 6000)
2 6 20 s2 = (1000x - 6000)
3 9 20
4 11 20
Data
n x ŷ
y = (16)s1 + (0x+20)s2
0 1 0
1 5 16
s1 = (-1000x + 6000)
2 6 20 s2 = (1000x - 6000)
3 9 20
4 11 20
Data
n x ŷ
y = (16)s1 + (20)s2
0 1 0
1 5 16
s1 = (-1000x + 6000)
2 6 20 s2 = (1000x - 6000)
3 9 20
4 11 20
Data
n x ŷ
y = (16)s1 + (20)s2
0 1 0
1 5 16
s1 = (1000)
2 6 20 s2 = (1000x - 6000)
3 9 20
4 11 20
Data
n x ŷ
y = (16)s1 + (20)s2
0 1 0
1 5 16
s1 = (1000)
2 6 20 s2 = (-1000)
3 9 20
4 11 20
Data
n x ŷ
y = (16)1 + (20)0
0 1 0
1 5 16
s1 = (1000)
2 6 20 s2 = (-1000)
3 9 20
4 11 20
Data
n x ŷ
y = 16
0 1 0
1 5 16
s1 = (1000)
2 6 20 s2 = (-1000)
3 9 20
4 11 20
Data
n x ŷ
y = (4x - 4)s1 + (0x+20)s2
0 1 0
1 5 16
s1 = (-1000x + 6000)
2 6 20 s2 = (1000x - 6000)
3 9 20
4 11 20
Data
n x ŷ
y = (32)s1 + (0x+20)s2
0 1 0
1 5 16
s1 = (-1000x + 6000)
2 6 20 s2 = (1000x - 6000)
3 9 20
4 11 20
Data
n x ŷ
y = (32)s1 + (20)s2
0 1 0
1 5 16
s1 = (-1000x + 6000)
2 6 20 s2 = (1000x - 6000)
3 9 20
4 11 20
Data
n x ŷ
y = (32)s1 + (20)s2
0 1 0
1 5 16
s1 = (-3000)
2 6 20 s2 = (1000x - 6000)
3 9 20
4 11 20
Data
n x ŷ
y = (32)s1 + (20)s2
0 1 0
1 5 16
s1 = (-3000)
2 6 20 s2 = (3000)
3 9 20
4 11 20
Data
n x ŷ
y = (32)0 + (20)1
0 1 0
1 5 16
s1 = (-3000)
2 6 20 s2 = (3000)
3 9 20
4 11 20
Data
n x ŷ
y = 20
0 1 0
1 5 16
s1 = (-3000)
2 6 20 s2 = (3000)
3 9 20
4 11 20
If you give me
too many
Data apples, I will
give you less
1
0
5
16
6
20
3
?
Multilayer Perceptrons
Data
n x ŷ
y y = (4x - 4)s1 + (0x+20)s2
0 1 0
1 5 16
2 6 20
x
3 9 20
4 11 20
Data
n x ŷ
y y = (4x - 4)s1 + (0x+20)s2
0 1 0
1 5 16
2 6 20
x
3 9 20
4 11 20
5 15 1
6 19 1
Data
n x ŷ
0 1 0
1 5 16
2 6 20
3 9 20 y = (4x - 4)s1 + (0x+20)s2 + (0x+1)s3

4 11 20
s1 = (-1000x + 6000)
5 15 1
6 19 1 s2 = ????
s3 = (1000x - 15000)
Data
n x ŷ
y = (4x - 4)s1 + (0x+20)s2 + (0x+1)s3
0 1 0
1 5 16
s1 = (-1000x + 6000)
2 6 20 s2 = not s1 and not s3
3 9 20
s3 = (1000x - 15000)
4 11 20
5 15 1
6 19 1
y = (w1x + b1)s1 + (w2x+b2)s2 + (w3x+b3)s3

s1 = (w4x + b4)
s2 = (w5s1 + w6s3 + b5)
s3 = (w7x + b6)
y = (w1x + b1)s1 + (w2x+b2)s2 + (w3x+b3)s3

s1 = (w4x + b4) Layer 1 Perceptron
s2 = (w5s1 + w6s3 + b5)

y = (w1x + b1)s1 + (w2x+b2)s2 + (w3x+b3)s3

s2 = (w5s1 + w6s3 + b5) Layer 2 Perceptron

Data
n x ŷ
y = (4x - 4)s1 + (0x+20)s2 + (0x+1)s3
0 1 0
1 5 16
s1 = (-1000x + 6000)
2 6 20 s2 = not s1 and not s3
3 9 20
s3 = (1000x - 15000)
4 11 20
5 15 1
6 19 1
Data
n x ŷ
y = (4x - 4)s1 + (0x+20)s2 + (0x+1)s3
0 1 0
1 5 16
s1 = (-1000x + 6000)
2 6 20 s2 = (-1000s1 - 1000s3 + 500)
3 9 20
s3 = (1000x - 15000)
4 11 20
5 15 1
6 19 1
Data
n x ŷ
y = (4x - 4)s1 + (0x+20)s2 + (0x+1)s3
0 1 0
1 5 16
s1 = (-1000x + 6000)
2 6 20 s2 = (-1000s1 - 1000s3 + 500)
3 9 20
s3 = (1000x - 15000)
4 11 20
5 15 1
6 19 1
Data
n x ŷ
y = (40)s1 + (20)s2 + (1)s3
0 1 0
1 5 16
s1 = (-1000x + 6000)
2 6 20 s2 = (-1000s1 - 1000s3 + 500)
3 9 20
s3 = (1000x - 15000)
4 11 20
5 15 1
6 19 1
Data
n x ŷ
y = (40)s1 + (20)s2 + (1)s3
0 1 0
1 5 16
s1 = (-5000) = 0
2 6 20 s2 = (-1000s1 - 1000s3 + 500)
3 9 20
s3 = (-4000) = 0
4 11 20
5 15 1
6 19 1
Data
n x ŷ
y = (40)s1 + (20)s2 + (1)s3
0 1 0
1 5 16
s1 = (-5000) = 0
2 6 20 s2 = (- 0 - 0 + 500)
3 9 20
s3 = (-4000) = 0
4 11 20
5 15 1
6 19 1
Data
n x ŷ
y = (40)s1 + (20)s2 + (1)s3
0 1 0
1 5 16
s1 = (-5000) = 0
2 6 20 s2 = (500)
3 9 20
s3 = (-4000) = 0
4 11 20
5 15 1
6 19 1
Data
n x ŷ
y = (40)s1 + (20)s2 + (1)s3
0 1 0
1 5 16
s1 = (-5000) = 0
2 6 20 s2 = (500) = 1
3 9 20
s3 = (-4000) = 0
4 11 20
5 15 1
6 19 1
Data
n x ŷ
y = (40)0 + (20)1 + (1)0
0 1 0
1 5 16
s1 = (-5000) = 0
2 6 20 s2 = (500) = 1
3 9 20
s3 = (-4000) = 0
4 11 20
5 15 1
6 19 1
Data
n x ŷ
y = 20
0 1 0
1 5 16
s1 = (-5000) = 0
2 6 20 s2 = (500) = 1
3 9 20
s3 = (-4000) = 0
4 11 20
5 15 1
6 19 1
Data
n x ŷ
y = (4x - 4)s1 + (0x+20)s2 + (0x+1)s3
0 1 0
1 5 16
s1 = (-1000x + 6000)
2 6 20 s2 = (-1000s1 - 1000s3 + 500)
3 9 20
s3 = (1000x - 15000)
4 11 20
5 15 1
6 19 1
Data
n x ŷ
y = (772)s1 + (20)s2 + (1)s3
0 1 0
1 5 16
s1 = (-1000x + 6000)
2 6 20 s2 = (-1000s4 - 1000s5 + 500)
3 9 20
s3 = (1000x - 15000)
4 11 20
5 15 1
6 19 1
Data
n x ŷ
y = (772)s1 + (20)s2 + (1)s3
0 1 0
1 5 16
s1 = (-13000) = 0
2 6 20 s2 = (-1000s4 - 1000s5 + 500)
3 9 20
s3 = (4000) = 1
4 11 20
5 15 1
6 19 1
Data
n x ŷ
y = (772)s1 + (20)s2 + (1)s3
0 1 0
1 5 16
s1 = (-13000) = 0
2 6 20 s2 = (-1000 + 0 + 500)
3 9 20
s3 = (4000) = 1
4 11 20
5 15 1
6 19 1
Data
n x ŷ
y = (772)s1 + (20)s2 + (1)s3
0 1 0
1 5 16
s1 = (-13000) = 0
2 6 20 s2 = (-500) = 0
3 9 20
s3 = (4000) = 1
4 11 20
5 15 1
6 19 1
Data
n x ŷ
y = (772)0 + (20)0 + (1)1
0 1 0
1 5 16
s1 = (-13000) = 0
2 6 20 s2 = (-500) = 0
3 9 20
s3 = (4000) = 1
4 11 20
5 15 1
6 19 1
Data
n x ŷ
y=1
0 1 0
1 5 16
s1 = (-13000) = 0
2 6 20 s2 = (-500) = 0
3 9 20
s3 = (4000) = 1
4 11 20
5 15 1
6 19 1
Data
y
y = (4x - 4)s1 + (0x+20)s2 + (0x+1)s3
n x ŷ
0 1 0
1 5 16
2 6 20
x
3 9 20
4 11 20
5 15 1
6 19 1
y = (w1x + b1)s1 + (w2x+b2)s2 + (w3x+b3)s3

s1 = (w4x + b4) x
w4x
s2 = (w5s1 + w6s3 + b5) s
1
s
3
s3 = (w7x + b6) b4
s
2
y = (w1x + b1)s1 + (w2x+b2)s2 + (w3x+b3)s3

s1 = (w4x + b4) x
w4x w7x
s2 = (w5s1 + w6s3 + b5) s
1
s
3
s3 = (w7x + b6) b4
s
b5
2
y = (w1x + b1)s1 + (w2x+b2)s2 + (w3x+b3)s3

s1 = (w4x + b4) x
s2 = (w5s1 + w6s3 + b5) s

1
s
3
s3 = (w7x + b6) w5s1

s
w6s3
2
b5
y = (w1x + b1)s1 + (w2x+b2)s2 + (w3x+b3)s3
s s
x<6 1 3 x > 15
s
2 !(x > 15) & !(x < 6)
y = (w1x + b1)s1 + (w2x+b2)s2 + (w3x+b3)s3
s s
x<6 1 3 x > 15
s
2 x∈[6,15]
s s
x<6 1 3 x > 15
s s
2 4
x∈[6,15] x∈]-∞,6] & ]15,∞]

x
s s s s
x<6 1 2 x > 15 3 x>2 4 x<3
s s s s
5 6 7 7
x∈[6,15] x∈]-∞,6] & ]15,∞] x∈[2,15] x∈[2,3]

x Input
s s s s
x<6 1 2 x > 15 3 x>2 4 x<3 Layer 1 (Input Features)
s s s s
5 6 7 7
x∈[6,15] x∈]-∞,6] & ]15,∞] x∈[2,15] x∈[2,3] Layer 2 (And and Or Combinations)
x Input
s s s s
x<6 1 2 x > 15 3 x>2 4 x<3 Layer 1 (Input Features)
s s s s
5 6 7 7
x∈[6,15] x∈]-∞,6] & ]15,∞] x∈[2,15] x∈[2,3] Layer 2 (And and Or Combinations)
And(s1,s2) = (1000s1 + 1000s3 - 1500)

Or(s1,s2) = (1000s1 + 1000s3 - 500)
x Input
s s s s
1 2 3 4 Layer 1 (Input Features)
s s s s
5 6 7 7
Layer 2 (And and Or Combinations)
s s s s
8 9 a b Layer 3 (Xor Combinations)
x Input
s s s s
1 2 3 4 Layer 1 (Input Features)
s s s s
5 6 7 7
Layer 2 (And and Or Combinations)
s s s s
8 9 a b Layer 3 (Xor Combinations)
Xor(s1,s2) = Or(And(s1,!s2), And(!s1,s2))

Data
y
n x ŷ
0 1 0
1 5 16
2 6 20
x
3 9 20
4 11 20 Universal
approximator
5 15 1
6 19 1
Data
y
n x ŷ
0 1 0
1 5 16
2 6 20
x
3 9 20
4 11 20
but...
5 15 1
6 19 1
Data
y
n x ŷ
0 1 0
1 5 16
2 6 20
x
3 9 20
4 11 20
No guarantee that
the best function will
5 15 1
be found
6 19 1
n x ŷ
x
0 1 0
1 5 16 s s s s
x>1 1 2 x<2 3 x<5 4 x<6
2 6 20
s s s
5 6 7
x∈]-∞,1] x∈[5,6[ x∈[6,∞]
y
n x ŷ
x
0 1 0
1 5 16 s s s s
x>1 1 2 x<2 3 x<5 4 x<6
2 6 20
s s s
5 6 7
x∈]-∞,1] x∈[5,6[ x∈[6,∞]
y = 0s5 + 16s6 + 20s7

n x ŷ
x
0 1 0
1 5 16 s s s s
x>1 1 2 x<2 3 x<5 4 x<6
2 6 20
s s s
5 6 7
x∈]-∞,1] x∈[5,6[ x∈[6,∞]
y = 0s5 + 16s6 + 20s7

n x ŷ
x
0 1 0
1 5 16 s s s s
x>1 1 2 x<2 3 x<5 4 x<6
2 6 20
s s s
5 6 7
x∈]-∞,1] x∈[5,6[ x∈[6,∞]
y = 0s5 + 16s6 + 20s7

n x ŷ
x
0 1 0
1 5 16 s s s s
x>1 1 2 x<2 3 x<5 4 x<6
2 6 20
s s s Overfitting
5 6 7
x∈]-∞,1] x∈[5,6[ Model

x∈[6,∞]
Problem
y = 0s5 + 16s6 + 20s7

Task
Complexity
Model
Complexity
Task
Complexity Underfitting
Model
Complexity
Task
Overfitting
Model
Complexity
Task
e
y Zon
Happ
Overfitting
Model
Complexity
Task
e
y Zon
Happ
Overfitting
more features
Regression
Regression
Model
Linear
Complexity
Linear
Task
e
y Zon
Happ
Overfitting
Regression
MLP 1 Layer
MLP 2 Layer
MLP 3 Layer
Model
Linear
Complexity
Task
e
y Zon
Happ
Sentiment Overfitting
analysis
Regression
MLP 1 Layer
MLP 2 Layer
MLP 3 Layer
Model
Linear
Complexity
Task
Machine e
Translation y Zon
Happ
Sentiment Overfitting
analysis
Regression
MLP 1 Layer
MLP 2 Layer
MLP 3 Layer
Model
Linear
Complexity
Task
e
y Zon
Happ
Overfitting
Data
Model
Complexity
Task
y Zone
H a p p
Overfitting
Data
Model
Complexity
Task
e
Happy Zon
Overfitting
Data
Model
Complexity
n x ŷ
y y y
0 1 0
1 5 16
2 6 20
n x ŷ
y y y
0 1 0
1 5 16
2 6 20
3 2 4
n x ŷ
y y
0 1 0
1 5 16
2 6 20
3 2 4
Task
e
y Zon
Happ
Overfitting
Model Bias
Model
Complexity
Task
py Zone
Hap
Overfitting
Model Bias
L1 & L2 Regularization Model
Stochastic Dropout (Srivastava et al, 2014) Complexity
Model Structure (CNN, RNNs)
Regularization
2
C(w,b) = ∑(yn-ŷn) + (w+b)ß
n∈{0,1,2}
ß = Regularization constant
Regularization
x
s s s s
x>1 1 2 x<2 3 x<5 4 x<6
s s s
5 6 7
x∈]-∞,1] x∈[5,6[ x∈[6,∞]
y
Regularization
x
s s s s
x>1 1 2 nothing 3 nothing 4 x<6
s s s
5 6 7
x∈]-∞,1] nothing x∈[6,∞]
y
Regularization
x
s s s s
x>1 1 2 nothing 3 nothing 4 x<6
s s s
5 6 7
x∈]-∞,1] nothing x∈[6,∞]
Find solutions that

require less effort y
Using Discrete Variables
Data
1
0
5
16
6
20
3
?
Data
1
0
5
16
6
20
3
?
Data
1
0
5
16
6
20
3
? ?
Number of fruit to offer
x
s s s s
1 2 3 4
s s s
5 6 7
y Number of fruit received

Number of fruit to offer
x
s1
s2
y Number of fruit received

Type of fruit to offer Number of fruit to offer
u x
s1
s2
Type of fruit received v y Number of fruit received

u x
u∈{Apple, Banana, Coconut}

s1
s2
v∈{Apple, Banana, Coconut}

Lookup Tables
u
e1 e2 e3 e4
Apple 0.1 -0.4 0.2 0.5
Banana 0.4 1.4 -1.0 0.1
Coconut 1.1 0.9 1.1 0.5

V=3
Lookup Tables
u
e1 e2 e3 e4
Apple 0.1 -0.4 0.2 0.5
Banana 0.4 1.4 -1.0 0.1
Coconut 1.1 0.9 1.1 0.5

V=3
Lookup Tables
u
e1 e2 e3 e4 Embedding for u Size = 4
Apple 0.1 -0.4 0.2 0.5
Banana 0.4 1.4 -1.0 0.1
Coconut 1.1 0.9 1.1 0.5

V=3
Lookup Tables
u Banana
Apple 0.1 -0.4 0.2 0.5
Banana 0.4 1.4 -1.0 0.1
Coconut 1.1 0.9 1.1 0.5

V=3
Lookup Tables
u 1
0 0.1 -0.4 0.2 0.5
1 0.4 1.4 -1.0 0.1
2 1.1 0.9 1.1 0.5

V=3
Lookup Tables
u 1
Lookup
Embedding for u Size = 4

u x
Lookup
eu

s1
s2

Softmax
V=3
Apple Banana Coconut
w1 0.1 -0.4 0.2
w2 0.4 1.4 -1.0
w3 1.1 0.9 1.1
w4 1.3 0.1 0.4

Softmax
V=3
Input vector Size = 4
w1 0.1 -0.4 0.2
w2 0.4 1.4 -1.0
w3 1.1 0.9 1.1
w4 1.3 0.1 0.4

Softmax
V=3
Input vector Size = 4
logits
Size = V
w1 0.1 -0.4 0.2
w2 0.4 1.4 -1.0
w3 1.1 0.9 1.1
w4 1.3 0.1 0.4

Softmax
V=3
s s s s Input Vector
1 2 3 4
d d d Apple Banana Coconut

Logits
1 2 3
1 -1 -2
w1 0.1 -0.4 0.2
w2 0.4 1.4 -1.0
w3 1.1 0.9 1.1
w4 1.3 0.1 0.4

Softmax
V=3
1 2 3 4

Logits
1 2 3
1 -1 -2
w1 0.1 -0.4 0.2

p p p
1 2 2
0.84 0.11 0.05 w2 0.4 1.4 -1.0
w3 1.1 0.9 1.1
w4 1.3 0.1 0.4

Softmax
V=3
1 2 3 4

Logits
1 2 3
1 -1 -2
w1 0.1 -0.4 0.2

p p p
1 2 2
0.84 0.11 0.05 w2 0.4 1.4 -1.0
exp(di)
pi = w3 1.1 0.9 1.1
∑exp(di) w4 1.3 0.1 0.4

Softmax
V=3
1 2 3 4

Logits
1 2 3
1 -1 -2
w1 0.1 -0.4 0.2

p p p
1 2 2
0.84 0.11 0.05 w2 0.4 1.4 -1.0
w3 1.1 0.9 1.1

Apple
w4 1.3 0.1 0.4
u x
Lookup
eu

s1
s2
Softmax
u x
Lookup
eu

s1
s2
Softmax
Summary
Continuous - linear
Continuous - values Sparse - softmax
Sparse - (embeddings) MLP
Example Applications
Embedding Pretraining (Collobert et al, 2011)
Abby likes to eat apples and bananas

Predict
Context
e-4 e-3 e-2 e-1
s1
Softmax
s2
edrink
eat
Cosine similairty
eeat
ebuild
eat
edrink
eeat
Cosine similairty
ebuild
Window-based Tagging (Collobert et al, 2011)
NNP VBZ TO VB NNS CC NNS

e-2 e-1 e-0 e1 e2

e-2 e-1 e-0 e1 e2 Word Embeddings
s1 Non-Linear Layer 1
VB Softmax
VB Softmax
Translation Rescoring (Devlin et al, 2014)
Translation 1 John does to eat coconuts and bananas
Translation 2 Abby likes to eat apples and bananas
Translation 3 Abby dislikes to drink apples and bananas
Source Abby gosta de comer macas e bananas


<s>
0.2

0.2 0.1

0.2 0.1 0.3
Abby likes to eat apples and bananas 0.000378

0.2 0.1 0.3 0.5 0.7 0.4 0.2
John does to eat coconuts and bananas 0.00003
Abby dislikes to drink apples and bananas 0.00012

John does to eat coconuts and bananas 0.00003
Abby dislikes to drink apples and bananas 0.00012

Predict
Context
Translation
Source
Abby gosta de comer macas e bananas
Predict
Context
Translation
Source
Abby gosta de comer macas e bananas
Translation
e-4 e-3 e-2 e-1

f-1
s1
macas
s2
Translation Score (BLEU) Arabic - English Chinese - English
Best Rescored System 52.8 34.7
1st OpenMT12 49.5 32.6
Hierarchical 43.4 30.1

Computation Graphs are our friends
2
C(w,b) = ∑(yn-ŷn) y = wx + b
n∈{0,1,2}
∂C ∂∑(ŷn-yn)
= n = ∑-2(ŷ
n
n-yn)xn
∂w ∂w
Easy!
2
∂C ∂∑(ŷn-yn)
= n = ∑-2(ŷn-yn)
n
∂b ∂b
2
y = wx + b + tanh(yx + b)
Harder!
2
y = w x + b + tanh(w x + b )
1 1 2 2
Computation
Graphs can
compute
gradients for you!
2
C(w,b) = ∑(yn-ŷn) y = wx + b
n∈{0,1,2}
2
∂C ∂∑(ŷn-yn)
= n = ∑-2(ŷ
n
n-yn)xn
∂w ∂w
2
∂C ∂∑(ŷn-yn)
= n = ∑-2(ŷn-yn)
n
∂b ∂b
2
C(w,b) = ∑(yn-ŷn) y = wx + b
n∈{0,1,2}
2
∂C ∂(ŷn-yn) ∂yn
=∑ n
= ∑-2(ŷ
n
n-yn)xn
∂w ∂yn ∂w
2
=∑ = ∑-2(ŷn-yn)
n n
∂b ∂yn ∂b
2
C(w,b) = ∑(yn-ŷn) y = wx + b
n∈{0,1,2}
2
=∑ n
∂w ∂yn ∂w
2
=∑
n
∂b ∂yn ∂b
2
C(w,b) = ∑(yn-ŷn) y=o+b
n∈{0,1,2}
o = wx
2
=∑ n
∂w ∂yn ∂w
2
=∑
n
∂b ∂yn ∂b
2
C(w,b) = ∑(dn) d=y-ŷ
n∈{0,1,2}
y=o+b
∂C ∂(ŷn-yn)
2
∂yn o = wx
=∑ n
∂w ∂yn ∂w
2
=∑
n
∂b ∂yn ∂b
2
C(w,b) = ∑cn c=d
n∈{0,1,2}
d=y-ŷ
∂C ∂(ŷn-yn)
2
∂yn y=o+b
=∑
∂w
n
∂yn ∂w o = wx
2
=∑
n
∂b ∂yn ∂b
2
C(w,b) = ∑cn c=d
n∈{0,1,2}
d=y-ŷ
∂C
∑ ∂cn ∂dn ∂yn ∂on y=o+b
=n
∂w ∂dn ∂yn ∂on ∂w o = wx
2
=∑
n
∂b ∂yn ∂b
2
C(w,b) = ∑cn c=d
n∈{0,1,2}
d=y-ŷ
∂C
∑ ∂cn ∂dn ∂yn ∂on y=o+b
=n
∂w ∂dn ∂yn ∂on ∂w o = wx
∂C ∂cn ∂dn ∂yn
=∑
∂dn ∂yn ∂b
n
∂b
2
C(w,b) = ∑cn c=d Power 2
n∈{0,1,2}
d=y-ŷ Sub
∂C
∑ ∂cn ∂dn ∂yn ∂on y=o+b Add
=n
∂w ∂dn ∂yn ∂on ∂w o = wx Product
∂C ∂cn ∂dn ∂yn Sub

=∑
∂dn ∂yn ∂b
n
∂b
2
n∈{0,1,2}
d=y-ŷ Sub
∂C
=n

=∑ Sub
∂dn ∂yn ∂b
n
∂b forward(x,y) → z
backward(x,y,dz) → dx,dy
2
n∈{0,1,2}
d=y-ŷ Sub
∂C
=n

=∑ Sub
∂dn ∂yn ∂b
n
∂b forward(x,y) : return x - y
backward(x,y,dz) : return 1, -1
2
n∈{0,1,2}
d=y-ŷ Sub
∂C
=n

=∑ Sub
∂dn ∂yn ∂b
n
∂b forward(x,y) : return x - y
2
n∈{0,1,2}
d=y-ŷ Sub
∂C
=n
∂C ∂cn ∂dn ∂yn ∂dn

=∑ Sub
∂dn ∂yn ∂b
n
∂b forward(x,y) : return x - y ∂ŷn
2
n∈{0,1,2}
d=y-ŷ Sub
∂C
=n
∂w ∂dn ∂yn ∂on ∂w
∂C o
∂cn ∂dn ∂yn
=∑
∂dn ∂yn ∂b
n
∂b Product
o = wx
w x
2
n∈{0,1,2}
d=y-ŷ Sub
y
∂C ∂cn ∂dn ∂yn ∂on
∑
=n
∂w ∂dn ∂yn ∂on ∂w Add y=o+b
∂C o b
∂cn ∂dn ∂yn
=∑
∂dn ∂yn ∂b
n
∂b Product
w x
C(w,b) = ∑cn d Power 2 c
n∈{0,1,2} Sub
∂C y ŷ
∑ ∂cn ∂dn ∂yn ∂on
=n
∂w ∂dn ∂yn ∂on ∂w Add
∂C o b
∂cn ∂dn ∂yn
=∑
∂dn ∂yn ∂b
n
∂b Product
w x
C(w,b) = ∑cn d Power 2 c Id C
n∈{0} Sub
∂C y ŷ
=n
∂C o b
∂cn ∂dn ∂yn
=∑
∂dn ∂yn ∂b
n
∂b Product
w x
n∈{0} Sub
∂C y ŷ
∑ ∂cn ∂dn ∂yn ∂on No Input Edges
=n External
Input
∂C o b
∂cn ∂dn ∂yn
=∑
∂dn ∂yn ∂b
n
∂b Product
w x
n∈{0} Sub
∂C y ŷ
=n
Input
∂C o b
∂cn ∂dn ∂yn
=∑ Parameters
∂dn ∂yn ∂b
n
∂b Product
No Input Edges
Internal
w x
d Power 2 c Id C
Sub
y ŷ
16
Add
o b 2
Variables
Product
w x
2 5
d Power 2 c Id C
1-Initialize inputs
Sub
y ŷ
16
Add
o b 2
Product
w x
2 5
d Power 2 c Id C
1-Initialize inputs
2-Initialize variables
Sub
y ŷ
16
Add
o b 2
Variables
Product
w x
2 5
d Power 2 c Id C
1-Initialize inputs 0,0
2-Initialize variables 0,0 0,0
Sub
y ŷ
0,0
16
Add
o b 2
0,0
Variables
Product
2 values: x and dx
w x
2 5
d Power 2 c Id C
3-Topological Sort variables Sub
y ŷ
0,0
16
Add
o b 2
0,0
Product
w x
2 5
d Power 2 c Id C
2-Initialize variables 3rd 0,0 0,0
3-Topological Sort variables Sub 4th 5th
y ŷ
0,0
16
2nd
Add
o b 2
1st 0,0
Product
w x
2 5
d Power 2 c Id C
y ŷ
0,0
16
2nd
Add
o b 2
1st 10,0
Product
w x
2 5
d Power 2 c Id C
y ŷ
12,0
16
2nd
Add
o b 2
1st 10,0
Product
w x
2 5
d Power 2 c Id C
1-Initialize inputs -4,0
y ŷ
12,0
16
2nd
Add
o b 2
1st 10,0
Product
w x
2 5
d Power 2 c Id C
y ŷ
0,0
1st 16
Add
o b 2
0,0
2nd
Product
w x
2 5
d Power 2 c Id C
y ŷ
2,0
1st 16
Add
o b 2
0,0
2nd
Product
w x
2 5
d Power 2 c Id C
y ŷ
2,0
1st 16
Add
o b 2
10,0
2nd
Product
w x
2 5
d Power 2 c Id C
y ŷ
2,0
1st 16
Add
o b 2
10,0
2nd
Product
w x
2 5
d Power 2 c Id C
y ŷ
0,0
16
Add
o b 2
0,0
Product
w x
2 5
d Power 2 c Id C
y
0,0
Add
o
0,0
d c C
3-Topological Sort variables
y
0,0
o
0,0
5th
4th
3rd
d c C
2nd
y
0,0
1st
o
0,0
d Power 2 c Add C
y g
0,0 0,0
Add
Add
o 0,0
s 0,0
5th 6th 7th
d c C
4th
y g 3th
0,0 0,0
1st
o 0,0
2nd
s 0,0
3rd
d Power 2 c Id C
4-For each variable in topological
order, run the forward method of all 2nd
y ŷ
operations that link to them 0,0
16
Add
1st o b 2
0,0
Product
w x
2 5
3rd
d Power 2 c Id C
y ŷ
16
Add
1st o b 2
10,0
Product
w x
2 5
3rd
d Power 2 c Id C
y ŷ
16
Add
1st o b 2
10,0
Product
w x
2 5
3rd
d Power 2 c Id C
y ŷ
16
Add
1st o b 2
10,0
Product
w x
2 5
3rd
d Power 2 c Id C
y ŷ
16
Add
1st o b 2
10,0
Product
w x
2 5
3rd
d Power 2 c Id C
y ŷ
16
Add
1st o b 2
10,0
Product
w x
2 5
3rd
d Power 2 c Id C
y ŷ
16
5-Set gradients to final variables
Add
1st o b 2
10,0
Product
w x
2 5
3rd
d Power 2 c Id C
y ŷ
operations that link to them (Forward) 12,0
16
6-run the operations backward method Add
∂C
in reverse order (Backward) C=c =1
1st
∂c
o b 2
10,0
∂C
Product dc = dC
∂c
w x
2 5
3rd
d Power 2 c Id C
y ŷ
16
6-run the operations backward method Add
∂C
in reverse order (Backward) C=c =1
1st
∂c
o b 2
10,0
∂C
Product dc = dC
∂c
w x
2 5
3rd
d Power 2 c Id C
y ŷ
16
6-run the operations backward method Add 2 ∂c
in reverse order (Backward) c=d = 2d
1st o b ∂d
2
10,0
∂c
Product dd = dc
∂d
w x
2 5
3rd
d Power 2 c Id C
y ŷ
16
in reverse order (Backward) c=d = 2 x -4
1st o b ∂d
2
10,0
∂c
Product dd = dc
∂d
w x
2 5
3rd
d Power 2 c Id C
y ŷ
16
in reverse order (Backward) c=d = -8
1st o b ∂d
2
10,0
∂c
Product dd = dc
∂d
w x
2 5
3rd
d Power 2 c Id C
1-Initialize inputs -4,-8
y ŷ
16
in reverse order (Backward) c=d = -8
1st o b ∂d
2
10,0
∂c
Product dd = dc
∂d
w x
2 5
3rd
d Power 2 c Id C
y ŷ
16
6-run the operations backward method Add ∂d
in reverse order (Backward) d=y-ŷ =1
1st o b ∂y
2
10,0
Product
w x
2 5
3rd
d Power 2 c Id C
y ŷ
operations that link to them (Forward) 12,-8
16
6-run the operations backward method Add ∂d
in reverse order (Backward) d=y-ŷ =1
1st o b ∂y
2
10,0
∂d
Product dy = dd
∂y
w x
2 5
3rd
d Power 2 c Id C
y ŷ
16 y=o+b
6-run the operations backward method Add ∂y
in reverse order (Backward) =1
1st o b ∂o
2
10,-8
∂y
Product
do = dy
∂o
w x
2 5
3rd
d Power 2 c Id C
y ŷ
16 y=o+b
6-run the operations backward method Add ∂y ∂y
in reverse order (Backward) =1 =1
1st o b ∂o ∂b
2
10,-8
∂y
Product
bt+1 = b - dy
∂b
w x
2 5
3rd
d Power 2 c Id C
y ŷ
16 y=o+b
1st o b ∂o ∂b
2
10,-8
∂y
Product
bt+1 = b - dy
∂b
w x
2 5
3rd
d Power 2 c Id C
y ŷ
16 y=o+b
1st o b ∂o ∂b
2
10,-8
Product ∂C ∂c ∂d ∂y
bt+1 = b -
∂c ∂d ∂y ∂b
w x
2 5
3rd
d Power 2 c Id C
y ŷ
16 y=o+b
1st o b ∂o ∂b
2
10,-8
Product ∂C
bt+1 = b -
∂b
w x
2 5
3rd
d Power 2 c Id C
y ŷ
16 o = wx
6-run the operations backward method Add ∂o
in reverse order (Backward) =x
1st o b ∂w
2
10,-8
Product ∂o
wt+1 = w - do
∂w
w x
2 5
3rd
d Power 2 c Id C
y ŷ
16 o = wx
6-run the operations backward method Add ∂o
in reverse order (Backward) =x
7-update parameters 1st ∂w
o b 2.2
10,-8
Product ∂o
wt+1 = w - do
∂w
w x
2.8 5
d Power 2 c Id C
Existing Tools: -4,-8
-Tensorflow ( https://www.tensorflow.org ) 16,1 16,1
-Torch ( https://github.com/torch/nn ) Sub
-CNN ( https://github.com/clab/cnn )
-JNN ( https://github.com/wlin12/JNN )
y ŷ
-Theano (http://deeplearning.net/software/theano/ ) 12,-8
16 o = wx
Add ∂o
=x
o b 2.2 ∂w
10,-8
Product ∂o
wt+1 = w - do
∂w
w x
2.8 5
Deep Neural Networks are our friends?
Convolutional Neural Network
x1 x2 x3 x4
x5 x6 x7 x8
x9 x10 x11 x12
x13 x14 x15 x16
4x4 image
x1 x2 x3 x4
x5 x6 x7 x8
x9 x10 x11 x12
x13 x14 x15 x16
4x4 image
x1 x2 x3 x4
x1
w1
x5 x6 x7 x8 z1
x2
x9 x10 x11 x12 z1
x13 x14 x15 x16 ...

w9
4x4 image
x11
x1 x2 x3 x4
x2
w1
x5 x6 x7 x8 z1 z2
x3
x9 x10 x11 x12 z1
x13 x14 x15 x16 ...

w9
4x4 image
x12
x1 x2 x3 x4
x5 x6 x7 x8 z1 z2
x9 x10 x11 x12 z3 z4
x13 x14 x15 x16
4x4 image
x1 x2 x3 x4
z1
x5 x6 x7 x8 z1 z2
x9 x10 x11 x12 z3 z4

z2
Is this
y a cat?
x13 x14 x15 x16
z3
4x4 image
z4

Lecture 1b - Deep Neural Networks Are Our Friends PDF

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Lecture 1b - Deep Neural Networks Are Our Friends PDF

Hochgeladen von

Copyright:

Verfügbare Formate

Deep Neural Networks

Are Our Friends

How many apples

If Abby has 4 apples,

Input - Fixed, comes from data

How to find the parameters w and b?

Cost Model n x ŷ y (y-ŷ)2

Small Step Size

Millions of parameters Millions of samples

arg min C(w,b)

arg min C(w,b)

y = (w1x + b1)s1 + (w2x+b2)s2

y = (w1x + b1)s1 + (w2x+b2)s2

y = (w1x + b1)s1 + (w2x+b2)s2

y = (w1x + b1)s1 + (w2x+b2)s2

x = 0.1 then (1000x) = 1

x = 6.1 then (1000x - 6000) = 1

y = (w1x + b1)s1 + (w2x+b2)s2

3 9 20 y = (4x - 4)s1 + (0x+20)s2 + (0x+1)s3

y = (w1x + b1)s1 + (w2x+b2)s2 + (w3x+b3)s3

y = (w1x + b1)s1 + (w2x+b2)s2 + (w3x+b3)s3

s2 = (w5s1 + w6s3 + b5)

y = (w1x + b1)s1 + (w2x+b2)s2 + (w3x+b3)s3

s2 = (w5s1 + w6s3 + b5) Layer 2 Perceptron

s3 = (w7x + b6) Layer 1 Perceptron

y = (w1x + b1)s1 + (w2x+b2)s2 + (w3x+b3)s3

y = (w1x + b1)s1 + (w2x+b2)s2 + (w3x+b3)s3

y = (w1x + b1)s1 + (w2x+b2)s2 + (w3x+b3)s3

s2 = (w5s1 + w6s3 + b5) s

s3 = (w7x + b6) w5s1

x∈[6,15] x∈]-∞,6] & ]15,∞]

x∈[6,15] x∈]-∞,6] & ]15,∞] x∈[2,15] x∈[2,3]

And(s1,s2) = (1000s1 + 1000s3 - 1500)

Layer 2 (And and Or Combinations)

Layer 2 (And and Or Combinations)

Xor(s1,s2) = Or(And(s1,!s2), And(!s1,s2))

x∈]-∞,1] x∈[5,6[ x∈[6,∞]

x∈]-∞,1] x∈[5,6[ x∈[6,∞]

y = 0s5 + 16s6 + 20s7

x∈]-∞,1] x∈[5,6[ x∈[6,∞]

y = 0s5 + 16s6 + 20s7

x∈]-∞,1] x∈[5,6[ x∈[6,∞]

y = 0s5 + 16s6 + 20s7

x∈]-∞,1] x∈[5,6[ Model

y = 0s5 + 16s6 + 20s7

x∈]-∞,1] x∈[5,6[ x∈[6,∞]

x∈]-∞,1] nothing x∈[6,∞]

x∈]-∞,1] nothing x∈[6,∞]

Find solutions that

y Number of fruit received

y Number of fruit received

Type of fruit received v y Number of fruit received

u∈{Apple, Banana, Coconut}

v∈{Apple, Banana, Coconut}

Banana 0.4 1.4 -1.0 0.1

Coconut 1.1 0.9 1.1 0.5

Banana 0.4 1.4 -1.0 0.1

Coconut 1.1 0.9 1.1 0.5

e1 e2 e3 e4 Embedding for u Size = 4

Apple 0.1 -0.4 0.2 0.5

Banana 0.4 1.4 -1.0 0.1

Coconut 1.1 0.9 1.1 0.5

e1 e2 e3 e4 Embedding for u Size = 4