Beruflich Dokumente
Kultur Dokumente
Linear programming
Method of Lagrange Multiplier
Linear objective function
Linear constraints
Min
subject to
- Define Lagrangian
At optimality:
(x, y) (0, 0)
92 + 8 - 1 = 0
= 1/9, = -1
= -1:
Substitute into (I):
Substitute into constraint: -5y2 = 225
No solution
= 1/9:
Substitute into (I): y = 2x
Substitute into constraint: x 2 = 5, y2 = 20
Min distance = x2 + y2 = 25
NOTE:
Optimal hyperplane
ij: Quadratic programming
iyi: Linear constraint
Can solved numerically by any general purpose optimization package
- Idea:
Achieve global optimality ()
Max margin
Let
- Find b:
Find and that belong to (+) & (-) class, respectively:
Margin = Magnitude of Projection of on
Support vector
- Objective:
It can be shown that:
- i > 0:
(Constrained optimization problem) is SUPPORT vector:
Lie on margins
Contribute to solution
- i = 0:
NOT contribute to solution
Perform testing removed/moved Solution NOT change
(Given new , check which class (+/-) belongs to)
Check
Or:
When training data not linearly separable
NOTE:
C constrains i Not let i increase to (may lead to no solution)
Still Quadratic Programming only 1 global min
- Idea:
Feature transformation Input data only appear (in training + testing) in form of dot products
Training:
Testing:
Example of kernels
Name Formula Params needed tuning
- Formulation: : Rm H Inhomogeneous d
polynomial
Gaussian
SVM input: (Radical basis (or )
function)
Data
Sigmoid Valid kernel only for
Decision
certain ,
boundary
Boundary 2-D curve 3-D plane
shape
Any algorithm depending only on dot products can use kernel trick
Gaussian Kernel
- Param tuning:
:
Influence on decision boundary of EVERY data point
"Smoothness" of decision boundary
: Linear boundary
Heuristics:
C :
Importance of goal of min error
Classification error, but also "smoothness" of decision boundary
- Process:
For each hyperparam combination:
f: Target function (unknown)
Divide m examples into k disjoint subsets
D: Targeting data distribution (unknown)
Each of size m/k
(Stratifying step) Prop of examples from each classes in subsets should
h: Hypothesis
be approx EQUAL
Specific SVM param set, neural network,
S: Training set of size n (draw from D)
Subset # Class 1 example # Class 2 example
A 5 2
B 5 2
C 5 2
- Training error:
D 5 3
- Def: Hypothesis h overfits training data if h' such that: 3 ABD C 40% 60%
On training set: errorS(h) < errorS(h') 4 ABC D 54% 46%
Over entire distribution: errorD(h) > errorD(h') Avg accuracy 52.75%
- Occam's Razor: Prefer SIMPLE hypothesis, because: Choose (C, ) with highest avg accuracy
Less simple hypotheses than complicated
Simple hypothesis fitting data unlikely to coincidence How many models trained in total: |Set C| |Set | k = 7 10 4 = 280
- Early stopping: Stop before reaching point where - Useful for SMALL data sets
training data perfectly classified
Real neuron - Use complex network of simple computing element Mimic brain's function
- Cell structure:
- Structure:
Unit (input, hidden, output)
Weighted link
- Signal transmission:
Impulses arrive simultaneously, added together - Learning = Updating weights
Transmitter released from synapse, enter dendrite
If sufficiently strong: Electrical pulse sent down axon
Reach synapse, release transmitter into other cells' bodies
- Properties:
Fault-tolerant: Cells die all time with no ill effect to brain's overall
functioning
- Feed-forward network
- Only 1 layer of adjustable weight
(with x0 1, w0 -)
- Algorithm:
Randomly initialize weights + Choose learning rate
Learning Capability
Update weight:
- Terminologies:
Training example :
Input:
Target output: td
Weight:
Predicted output:
- Mathematics: Gradient descent
- Objective:
Weight-Updating Methods
Illustrat
ion Gradient Descent: Common Issues
- Overshooting:
Problem: too large Overstep min
- Termination condition:
- Gradient calculation:
Reach preset # (*) iteration Problem: Sometimes, gradient can't derived easily &
precisely calculated (not case for Adaline)
Multi-layer perceptron
Set:
very small
Compute
Verify:
- m 2 classes:
Must nonlinear: Linear hidden unit can duplicated by single-layer network 1 output/class
Object Class i
- Non-differentiable
- Not suitable for gradient descent
Sigmoid
Unit
- (x) 0
- Nice derivative:
N0 output unit
Update weights:
Dynamically adapt
For each connection (i j):
Exploit error surface's high-order info
wji wji + jOuti
- Efficiency: O(W2)
W = Total number of weights
- Def:
Continuous:
Discrete:
- Feature Hierarchy: Each hidden units only connected to local subset of units in
previous layer
- Shared weights:
Due to same kernel used throughout image
Help detect features regardless of positions in images Robustness
# params to learn
- Zero-padding:
Convolution Representation shrink at each layer Limit # layers
Solution: Add zeros to lost positions
- Properties:
Commutative: f * g = g * f
Associative: f * (g * h) = (f * g) * h
- Advantages:
Effect of shifting, rotation Robustness
Computational burden on next layer
Basic concepts
Current reward & Next states depend ONLY on current state & action
Policy Evaluation
Actions: Forward/Reverse/None
- World/Environment:
Deterministic world Non-deterministic world
Actions have certain outcomes Active have uncertain outcomes
From state s, take action a From state s, take action a, reach
Definitely reach state s' - State s', with prob p'
- State s'', with prob p''
(s, a) P(s, s', a)
= New state reached from state = P(Transition s s' | Action a)
s through action a
R(s, s', a)
r(s, a) = E(Reward of Transition s s' | Action a)
= Reward got by taking action a
when at state s Policy (deterministic): State Good Action Produce
State Broken Action Repair
t: Time
Discount factor < 1 (Future reward not worth as much as current reward)
Basic Concepts
- Learning situation:
Situation Condition Learning method
Known Known (s, - Policy iteration
environment a), r(s, a) - Randomly initialize (s) s
- Value iteration
Unknown Unknown Q-learning
- Repeat:
environment (s, a), r(s, a)
Evaluate current policy
At optimal *:
Deterministic world:
Learning in Unknown Environment: Q-Learning
Q*(s, a) = r(s, a) + V*((s, a))
(Learn to approximate Q-function)
= r(s, a) + maxa'Q*((s, a), a')
Non-deterministic world:
- Initialize
- Repeat sufficiently:
Choose current state s (might/might not random)
s s'
- Reward function is non-deterministic
Each time (s, a) visited, get different r
Terminate condition: Reach allowed max # iteration
Solution: Update through avg
Reach goal state
Q(s, a) (1 - )Q(s, a) + [r + maxaQ(s', a')]
Q(s, a) Q(s, a) + [r + maxaQ(s', a') - Q(s, a)]
- Repeat sufficiently:
- Problem with Q-learning: When |S| too big: Choose current state s
Too many states to visit all
Too many entries in Q-table to hold in mem Repeat until Terminate:
Select & Execute action
Solution: Replace Q-table with function approximator Observe: Immediate reward r Training
New state s' episode
Update weight:
Difference = r + maxa'Q(s',a') - Q(s, a)
wi wi + [Difference]fi(s, a) i = 1..n
s s'
- Feature-based Representation:
Observation Current param determine next data Q used in both evaluating (s, a) & selecting
Q(s, a) = w1f1(s, a) + + wnfn(s, a)
sample getting trained action
fk(s, a): Feature function Q(s, a) Q(s, a) + [r + maxaQ(s', a') - Q(s, a)]
Return real number Q(s, a): Evaluation
Capture important properties of (s, a) maxaQ(s', a'): Action selection
Pacman: Distance to closest ghost,
What might - Strong correlation between samples Divergence/Oscilation
go wrong
Advantage: n < # States
- Learning directly from consecutive
samples is inefficient
Weight update:
Recall: Q(s, a) Q(s, a) + [Difference] Poor local minima/divergence
Difference = (r + maxaQ(s', a')) - Q(s, a)
Solution Experience relay Target network
Formulation:
- Pool agent's experience over many - Every C updates: Clone network Q ( Weight
t = r + maxa'Q(s', a') episode into D set) Target network
Experience e = (s, a, r, s')
- Next C updates: Use Generate Q-learning
- Draw sample randomly from D targets
Update weights
Why - Randomized sample Break - Delay between time Q updated & time targets
solution correlation affected by new update
works
- Each experience potentially used in - Divergence/Oscillation
many weights update Data
wi wi + [Difference]fi(s, a)
efficiency
Avoid oscillation/divergence