COMP 4211 - Machine Learning

Mathematical Background
Monday, February 20, 2017 10:01 AM
Linear programming
Method of Lagrange Multiplier
Linear objective function
Linear constraints
Min
subject to
- Define Lagrangian
vi: Lagrange multiplier (dual variables) Quadratic programming
- At optimality : Quadratic objective function

Linear constraints
Find shortest distance from origin to hyperbola x2 + 8xy + 7y2 = 225

G: (Positive semi-definite) matrix
Min x2 + y2
subject to x2 + 8xy + 7y2 - 225 = 0
L(x, y) = x2 + y2 - (x2 + 8xy + 7y2 - 225)
At optimality:
(x, y) (0, 0)
92 + 8 - 1 = 0
= 1/9, = -1
= -1:
Substitute into (I):
Substitute into constraint: -5y2 = 225
No solution
= 1/9:
Substitute into (I): y = 2x
Substitute into constraint: x 2 = 5, y2 = 20
Min distance = x2 + y2 = 25
COMP 4211 Page 1

Support Vector Machine (SVM)
Monday, February 20, 2017 9:37 AM
Finding the solution
Classification Problem - Primal (original problem):
- Given: Training set

(xi, yi): Training pattern - Apply method of Lagrangian:
: Input Associate 1 Lagrange multiplier i for each constraint
: Output (label)
- Assumption: Linearly separable

Exist linear surface to separate 2 classes
BSubstitute these back to primal to get Dual
(2-D: line, 3-D: plane, n-D: hyperplane) - Dual (derived problem):
- Find: that perfectly separate 2 classes
NOTE:
Optimal hyperplane
ij: Quadratic programming
iyi: Linear constraint
Can solved numerically by any general purpose optimization package
- Idea:
Achieve global optimality ()
Max margin
i: Determine HOW MUCH contribute to solution

(i Contribution )
Let
- Find b:
Find and that belong to (+) & (-) class, respectively:
Margin = Magnitude of Projection of on
Better method: Perform average over all Support Vector
Hyperplane should separate 2 classes
Support vector
- Objective:
It can be shown that:
- i > 0:
(Constrained optimization problem) is SUPPORT vector:
Lie on margins
Contribute to solution
- i = 0:
NOT contribute to solution
Perform testing removed/moved Solution NOT change
(Given new , check which class (+/-) belongs to)
Check
Or:
When training data not linearly separable
- New objective: Separate training with min # errors
COMP 4211 Page 2

Or:
When training data not linearly separable
- New objective: Separate training with min # errors

Introduce slack variable i (i 0)
NS: # support vector
- Primal: Penalize in objective function
C: Help decide whether MARGIN or ERROR is more important in

determining solution
- Dual: Soft margin hyperplane:
NOTE:
C constrains i Not let i increase to (may lead to no solution)
Still Quadratic Programming only 1 global min
COMP 4211 Page 3

Non-linear SVM (C-SVM) Kernel trick
Wednesday, March 1, 2017 11:51 AM
- Idea:
Feature transformation Input data only appear (in training + testing) in form of dot products
Training:
Testing:
Dot (inner) product can computed in Rm (without going to H):
- Idea: Change space ( coordinate sys) of data

Decision boundary: Arbitrary surface Hyperplane
- Kernel function k(x,y)

Purpose:
Return dot product between data points in some space
Replace dot products in SVM algorithm with kernel Obtain
efficient representation
Choose k Choose (feature map)
Example of kernels
Name Formula Params needed tuning
- Formulation: : Rm H Inhomogeneous d
polynomial
Gaussian
SVM input: (Radical basis (or )
function)
Original space Feature space Alternative form:
Data
Sigmoid Valid kernel only for
Decision
certain ,
boundary
Boundary 2-D curve 3-D plane
shape
- Problem with directly working on : High dimensionality

Inefficient computation
Input data: Gray-scale 1616 images

Use polynomial curve of degree 5th as decision boundary in
ORIGINAL space

Any algorithm depending only on dot products can use kernel trick
Gaussian Kernel
COMP 4211 Page 4

Gaussian Kernel
- Recall: Gaussian Bell-curve

Put bumps of various sizes on training set
- Param tuning:
:
Influence on decision boundary of EVERY data point
"Smoothness" of decision boundary
: Linear boundary
Heuristics:
C :
Importance of goal of min error
Classification error, but also "smoothness" of decision boundary
Useful SVM observation
- Support vector usually very few in number

Just need to store support vector when doing
prediction
- High-dimensional data MORE likely to be LINEAR

separable
That's why SVM quite commonly used in text,
image processing
COMP 4211 Page 5

Overfitting k-fold Cross (Stratified) Validation
Thursday, March 2, 2017 9:17 AM
- Purpose: Select "best" hypothesis from available data

Select model hyperparams resulting in smallest testing error (highest accuracy)
For Gaussian kernel C-SVM: (C, ) producing smallest testing error
Learning Error Measurement
- Process:
For each hyperparam combination:
f: Target function (unknown)
Divide m examples into k disjoint subsets
D: Targeting data distribution (unknown)
Each of size m/k
(Stratifying step) Prop of examples from each classes in subsets should
h: Hypothesis
be approx EQUAL
Specific SVM param set, neural network,
S: Training set of size n (draw from D)
Subset # Class 1 example # Class 2 example
A 5 2
B 5 2
C 5 2
- Training error:
D 5 3
Run learning process k times, each time:

Prop of examples in training set that h misclassifies Validation set = 1 subset
Can measure Training set = (Other (k-1) subsets)
- Testing error: Calculate avg accuracy

Prob that h misclassify data instance drawn randomly from D
Choose hyperparam combination with highest avg accuracy
Can't measure, but wish to know ( Now, all data can trained with these hyperparams)
Estimated through test set: - Why use k-fold:

High prop of data used for training
Also, all data used in computing error
1000 example: |Class 1| = 600, |Class 2| = 400

Run C-SVM, 4-fold stratified cross-validation
Gaussian kernel
C {2-4, 2-3, 2-2, 2-1, 20, 21, 22}
Overfitting
{2-6, 2-5, 2-4, 2-3, 22, 21, 22, 23, 24, 25}
Subset |Class 1| |Class 2|

A 150 100
B 150 100
C 150 100
D 150 100
For each (C, ):

Learning time Training set Validation set Validation error Accuracy
1 BCD A 50% 50%
2 ACD B 45% 55%
- Def: Hypothesis h overfits training data if h' such that: 3 ABD C 40% 60%
On training set: errorS(h) < errorS(h') 4 ABC D 54% 46%
Over entire distribution: errorD(h) > errorD(h') Avg accuracy 52.75%
- Occam's Razor: Prefer SIMPLE hypothesis, because: Choose (C, ) with highest avg accuracy
Less simple hypotheses than complicated
Simple hypothesis fitting data unlikely to coincidence How many models trained in total: |Set C| |Set | k = 7 10 4 = 280
Over-fitting avoidance principle Leave-one-out Cross-Validation
- Test set drawn independently from training set - Train on (m - 1) examples

- Not use test set for training - Validate on 1 example
- Early stopping: Stop before reaching point where - Useful for SMALL data sets
training data perfectly classified
COMP 4211 Page 6

Neural Networks
Friday, March 10, 2017 5:00 PM
Artificial Neural Network
Real neuron - Use complex network of simple computing element Mimic brain's function
- Cell structure:
- Structure:
Unit (input, hidden, output)
Weighted link
Dendrite: Receive info from others

Axon (1/cell): Transmit info from cell body
Synapse: Package of chem substances (transmitter), influence other
cells when released
- Signal transmission:
Impulses arrive simultaneously, added together - Learning = Updating weights
Transmitter released from synapse, enter dendrite
If sufficiently strong: Electrical pulse sent down axon
Reach synapse, release transmitter into other cells' bodies
- Properties:
Fault-tolerant: Cells die all time with no ill effect to brain's overall
functioning
Graceful degradation: As condition worsen, cell's performance

gradually (rather than sharply) drop
Learning capability: Network can modified Performance
Neural network: Massively parallel computation
COMP 4211 Page 7

Perceptron Model of Single Perceptron
Friday, March 10, 2017 5:00 PM
- Input: x1, x2, , xn

- Weight: w1, w2, , wn
- Activation function: Relate input & output
Perceptron
- Feed-forward network
- Only 1 layer of adjustable weight
(with x0 1, w0 -)
Learning Linearly Separable Function through Single Perceptron
- Principle of weight-updating rule:

Observed output (T) Predicted output (O)
Make small adjustment in weight Difference
- Algorithm:
Randomly initialize weights + Choose learning rate
Learning Capability
Repeat until all examples correctly predicted:

For each sample input x = (x1, x2, , xn), with corresponding output T
- Function can represented by single perceptron Linear separable
Calculate predicted output:
AND (I1 I2) OR (I1 I2)
Update weight:
Input (I1, I2) Output T

(5, 1) 0
(2, 1) 0
(1, 1) 1
(3, 3) 1
(4, 2) 0
(2, 3) 1
Decision boundary: Decision boundary:

I1 + I2 - 1.5 = 0 I1 + I2 - 0.5 = 0
Decision boundary: -3I1 + 4I2 + 1 = 0

- With more layers of sufficiently many perceptrons, any boolean function can
represented
Initialize: w = (0, 0, 0) , Learning rate = 1
(Why: Any boolean function can represented in Sum-of-Product or Product-of-Sum
form) Iteration wold = (-, w1, w2) I = (1, I1, I2) T O T-O wnew
1.1 (0, 0, 0) (1, 5, 1) 0 1 -1 (-1, 5, -1)
XOR: 1.2 (-1, -5, -1) (1, 2, 1) 0 0 0 (-1, 5, -1)
1.3 (-1, -5, -1) (1, 1, 1) 1 0 1 (0, -4, 0)
1.4 (0, -4, 0) (1, 3, 3) 1 0 1 (1, -1, 3)
1.5 (1, -1, 3) (1, 4, 2) 0 1 -1 (0, -5, 1)
1.6 (0, -5, 1) (1, 2, 3) 1 0 1 (1, -3, 4)
2.1 (1, -3, 4) (1, 5, 1) 0 0 0 (1, -3, 4)
2.2 (1, -3, 4) (1, 2, 1) 0 0 0 (1, -3, 4)
2.3 (1, -3, 4) (1, 1, 1) 1 1 0 (1, -3, 4)
2.4 (1, -3, 4) (1, 3, 3) 1 1 0 (1, -3, 4)
(a c) (b c): 2.5 (1, -3, 4) (1, 4, 2) 0 0 0 (1, -3, 4)

2.6 (1, -3, 4) (1, 2, 3) 1 1 0 (1, -3, 4)
COMP 4211 Page 8

2.4 (1, -3, 4) (1, 3, 3) 1 1 0 (1, -3, 4)
(a c) (b c): 2.5 (1, -3, 4) (1, 4, 2) 0 0 0 (1, -3, 4)

2.6 (1, -3, 4) (1, 2, 3) 1 1 0 (1, -3, 4)
- Perceptron convergence theorem:

If training example linearly separable: Running perceptron weight-updating
rule can:
Always converge to solution
In finite step for any initial weight choice
If example not linearly separable: Perceptron may fail to converge
COMP 4211 Page 9

Adaline
Saturday, March 25, 2017 9:27 AM
Fundamentals of Weight-Finding
Adaline (Adaptive Linear Element) - Principle: Gradient descent

Keep going "downhill" from any point in error surface
Can get stuck in local optimal solution
- Def:
Feedforward network
1 layer of adjustable weight
However, for 's adaline: 1 global min only
- Terminologies:
Training example :
Input:
Target output: td
Weight:
Predicted output:
- Mathematics: Gradient descent
- Objective:
* NOTE: Adaline is "linear regression" in statistics

Gradient at :
Move :
Direction: Opposite to
Magnitude: Small fraction of

Weight-Updating Methods
Batch Gradient Descent Stochastic Gradient Descent

Principl In each iteration: Use ALL examples to In each iteration: Use 1 example to update
e update weights weights
Algorit - Randomly initialize - Randomly initialize
hm
- (*) Repeat until termination condition met: - (*) Repeat until termination condition met:
Initialize For each :
For each : Calculate
Calculate

For each wi:

For each wi:
For each wi:
Illustrat
ion Gradient Descent: Common Issues
- Overshooting:
Problem: too large Overstep min
COMP 4211 Page 10

Problem: too large Overstep min
* NOTE: wi changed after each training

example IMMEDIATELY affect od
calculation of NEXT example
Pros & - Gradient change summed over whole data - For big data set: Possible to get good Solution:
Cons set each iteration result after 1 (*) pass (through all training Test on small subset of training set
Expensive computation for big data set: data) Well performance Use for whole set
Can't store whole set on mem Have
to read from disk Slow - generally move in right direction, but
# (*) iteration gradually
not always
- For big data set: Always need multiple (*) Have to use smallest step ()
passes (through all training data) to get
good result
- Mini-batch gradient descent:

In each iteration: Use b examples to update weights
Often faster than stochastic
- Termination condition:
- Gradient calculation:
Reach preset # (*) iteration Problem: Sometimes, gradient can't derived easily &
precisely calculated (not case for Adaline)
Multi-layer perceptron
Solution: Check gradient using finite difference

Repeat for different , , :
Compute
Compute gradient using formula previously derived
Set:
very small
Compute
Verify:
COMP 4211 Page 11

Multilayer Perceptron & Back-Propagation
Monday, May 22, 2017 9:24 PM
Network Construction for Multiclass Classification

Activation (Hidden Unit Transfer) Function
- m 2 classes:
Must nonlinear: Linear hidden unit can duplicated by single-layer network 1 output/class
Object Class i
- m = 2 classes: Special method:

1 output
y > 0 Object YES class
y 0 Object NO class
Hidden Illustration Characteristics

Unit Type
Step Unit
- Non-differentiable
- Not suitable for gradient descent
Sigmoid
Unit
- (x) 0
- Nice derivative:
- x very small: (x) Linear function

- x very big: (x) Step function Universal Approximation
Radical
Basis
Function - 1 hidden layer of sigmoid sufficient to approx any
(RBF) well-behave function to arbitrary precision (1)
Produce localized response to input:
Significant nonzero output only - Why need > 2 hidden layers: (2)
when input falls within small Less weights (params)
localized region
Same accuracy lvl
foutput(x) = g1(x) + g2(x) +

* NOTE: Approximate complicated function:
(1) Use lots of hidden units in 1 layer
w11(x) + w22(x) + w33(x) +
(2) Use many hidden layers

((()))
Rectified f(x) = max(0, x)

Linear
Unit - Efficient computation
(ReLU)
- Simple gradient:
- Most popular in deep networks
NOTE: Careful when initializing

weights Avoid f'(x) = 0
Leaky Efficient as ReLU
COMP 4211 Page 12

Leaky - Efficient as ReLU
ReLU - But: f'(x) 0
Gradient Computation: Single Sigmoid Unit Gradient Computation: Multiple Layers
N0 output unit
- Weights between Hidden & Output layer:
Backpropagation: Gradient Descent for Multilayer Network

- Weights between Input & Hidden layer:
- Algorithm (Stochastic version):

Randomly initialize wji
Repeat until convergence: For each

Propagate input forward:
Compute
Propagate error backward:

For each Output Unit k:
k = ok(1 - ok)(tk - ok) Gradient Descent: Practice issue
For each Hidden Unit j:
COMP 4211 Page 13

Propagate error backward:
For each Output Unit k:
k = ok(1 - ok)(tk - ok) Gradient Descent: Practice issue
For each Hidden Unit j:

- Speed up training:
Use momentum:
Update weights:
Dynamically adapt
For each connection (i j):
Exploit error surface's high-order info
wji wji + jOuti
- Escape poor local minima:

NOTE: Outi = Unit i's output, which is:
Train multiple networks, each initialized with different weights
ui: For hidden unit i
xi: For input unit i
- Efficiency: O(W2)
W = Total number of weights
COMP 4211 Page 14

Convolutional Neural Network
Monday, May 22, 2017 12:50 AM
Convolutional Network: Special Properties
Convolution Operator - Spare connectivity: Due to small convolution kernel
- Def:
Continuous:
Discrete:
- Feature Hierarchy: Each hidden units only connected to local subset of units in
previous layer
I: Image ; K: Mask (Kernel)
- Shared weights:
Due to same kernel used throughout image
Help detect features regardless of positions in images Robustness
# params to learn
- Zero-padding:
Convolution Representation shrink at each layer Limit # layers
Solution: Add zeros to lost positions
- Properties:
Commutative: f * g = g * f
Associative: f * (g * h) = (f * g) * h
Pooling Layer - Activation function:

Convolution is linear Need nonlinear activation function
Commonly used: ReLU: y = max(x, 0)
- Motivation:
Once feature detected, only its approx position relative to other
features relevant
Image of number 7:
Endpoint of roughly horizontal segment in upper left
Corner in upper right area
Endpoint of roughly vertical segment in lower portion
Different object instances:

Difference absolute positions of features
But their relative positions to each other same
- Technique: Max-pooling: For each sub-region, output max value
- Advantages:
Effect of shifting, rotation Robustness
Computational burden on next layer
COMP 4211 Page 15

Reinforcement Learning (RL)
Monday, May 22, 2017 10:59 PM
Markov Assumption
Basic concepts
st+1 = (st, at)

rt = r(st, at)
Current reward & Next states depend ONLY on current state & action
Policy Evaluation
- Given policy Compute State-value function V
- Reinforcement Learning: - Bellman equation:

Interact with env Get EVALUATIVE output/reward Deterministic world:
Learn mapping State Action V(s) = rt + rt+1 + 2rt+2 + = r(s, a) + ((s, a))
Goal: Max long-term reward
Objective: Min time to reach "goal"
States: Car's position & velocity
Actions: Forward/Reverse/None
Rewards: Non-deterministic world:

0, if goal reached
-1, otherwise
- World/Environment:
Deterministic world Non-deterministic world
Actions have certain outcomes Active have uncertain outcomes
From state s, take action a From state s, take action a, reach
Definitely reach state s' - State s', with prob p'
- State s'', with prob p''
(s, a) P(s, s', a)
= New state reached from state = P(Transition s s' | Action a)
s through action a
R(s, s', a)
r(s, a) = E(Reward of Transition s s' | Action a)
= Reward got by taking action a
when at state s Policy (deterministic): State Good Action Produce
State Broken Action Repair
- Policy: Map State Action:

V(Good) = 0.9(1 + V(Good)) + 0.1(0 + V(Bad))
Deterministic policy Non-deterministic policy V(Bad) = 0.7(0 + V(Good)) + 0.3(-10 + V(Bad))
At state s, ALWAYS take action a At state s, take:
- Action a', with prob p' - Computation method:
- Action a'', with prob p'' Solve linear system of V(s1), V(s2)
: S A (s, a) = P(Take action a | State s)
s (s) = a Iterative method:
Randomly initialize V0(s1), V0(s2),
Run until convergence:
- Discount factor :
Total Reward = Reward(t) + Reward(t + 1) * (Discount Factor)
+ Reward(t + 2) * (Discount Factor)2 +
t: Time
Discount factor < 1 (Future reward not worth as much as current reward)
COMP 4211 Page 16

Learning Optimal (Deterministic) Policy
Monday, May 22, 2017 11:30 PM Learning in Known Environment: Policy Iteration
Basic Concepts
- Learning situation:
Situation Condition Learning method
Known Known (s, - Policy iteration
environment a), r(s, a) - Randomly initialize (s) s
- Value iteration
Unknown Unknown Q-learning
- Repeat:
environment (s, a), r(s, a)
Evaluate current policy
- Optimal policy: * = argmaxV(s) s Improve policy

- Optimal state-value function: V*(s) = maxV(s) policyStable True
For each s S:
- Bellman Optimality Condition: newAction = argmax a[r(s, a) + V((s,a))]
Deterministic world: If newAction (s):
V*(s) = maxa[r(s, a) + V*((s,a))] (s) newAction
policyStable False
Non-deterministic world:
Stop if policyStable = True
Learning in Known Environment: Value Iteration

(Often converge faster than Policy Iteration)
- Randomly initialize V(s) s
- Repeat until < :

0
For each s S:
oldStateValue V(s)
V*(A) = max( 0.5(5 + V*(A)) + 0.5(5 + V*(B)), V(s) maxa[r(s, a) + V((s,a))]
10 + V*(B) ) max(, |oldStateValue - V(s)|)
V*(B) = 10 + V*(B)
- Output *:
- Q-function (Action-value function): *(s) argmaxa[r(s, a) + V((s,a))]
Q(s, a) r(s, a) + V((s, a)) Given , s, a
At optimal *:
Deterministic world:
Learning in Unknown Environment: Q-Learning
Q*(s, a) = r(s, a) + V*((s, a))
(Learn to approximate Q-function)
= r(s, a) + maxa'Q*((s, a), a')
Non-deterministic world:
- Initialize
- Repeat sufficiently:
Choose current state s (might/might not random)
Repeat until Terminate:

Select & Execute action a (based on some policy)
Observe: Immediate reward r Training
New state s' episode
Update:
Q-Learning: Remarks
s s'
- Reward function is non-deterministic
Each time (s, a) visited, get different r
Terminate condition: Reach allowed max # iteration
Solution: Update through avg
Reach goal state
Q(s, a) (1 - )Q(s, a) + [r + maxaQ(s', a')]
Q(s, a) Q(s, a) + [r + maxaQ(s', a') - Q(s, a)]
COMP 4211 Page 17

Training progress
visitn(s, a): # times (s, a) visited until nth iteration
- Action selection policy:

Greedy policy:
At s, select "best" a:
Problem:
Over-commit to action with high found early
Fail to explore potentially high-reward actions
-Greedy policy: State-of-the-art

Greedy most of time
Occasionally (with prob ), take random action
Exploit & Explore:

Selection action ai with prob:
Exploit high- action, but still have to chance to explore others
COMP 4211 Page 18

Generalization & Function Approximation
Tuesday, May 23, 2017 12:32 AM
Q-Learning with Linear Approximation
Q-Function Feature-Based Representation - Initialize wi
- Repeat sufficiently:
- Problem with Q-learning: When |S| too big: Choose current state s
Too many states to visit all
Too many entries in Q-table to hold in mem Repeat until Terminate:
Select & Execute action
Solution: Replace Q-table with function approximator Observe: Immediate reward r Training
New state s' episode
Update weight:
Difference = r + maxa'Q(s',a') - Q(s, a)
wi wi + [Difference]fi(s, a) i = 1..n
s s'
Q-Learning with Linear Approximation: Issues
- Feature-based Representation:
Observation Current param determine next data Q used in both evaluating (s, a) & selecting
Q(s, a) = w1f1(s, a) + + wnfn(s, a)
sample getting trained action
fk(s, a): Feature function Q(s, a) Q(s, a) + [r + maxaQ(s', a') - Q(s, a)]
Return real number Q(s, a): Evaluation
Capture important properties of (s, a) maxaQ(s', a'): Action selection
Pacman: Distance to closest ghost,
What might - Strong correlation between samples Divergence/Oscilation
go wrong
Advantage: n < # States
- Learning directly from consecutive
samples is inefficient
Weight update:
Recall: Q(s, a) Q(s, a) + [Difference] Poor local minima/divergence
Difference = (r + maxaQ(s', a')) - Q(s, a)
Solution Experience relay Target network
Formulation:
- Pool agent's experience over many - Every C updates: Clone network Q ( Weight
t = r + maxa'Q(s', a') episode into D set) Target network
Experience e = (s, a, r, s')
- Next C updates: Use Generate Q-learning
- Draw sample randomly from D targets
Update weights
Why - Randomized sample Break - Delay between time Q updated & time targets
solution correlation affected by new update
works
- Each experience potentially used in - Divergence/Oscillation
many weights update Data
wi wi + [Difference]fi(s, a)
efficiency
Avoid oscillation/divergence
COMP 4211 Page 19

COMP 4211 - Machine Learning

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

COMP 4211 - Machine Learning

Hochgeladen von

Copyright:

Verfügbare Formate

Mathematical Background

Monday, February 20, 2017 10:01 AM

vi: Lagrange multiplier (dual variables) Quadratic programming

- At optimality : Quadratic objective function

Find shortest distance from origin to hyperbola x2 + 8xy + 7y2 = 225

L(x, y) = x2 + y2 - (x2 + 8xy + 7y2 - 225)

COMP 4211 Page 1

Classification Problem - Primal (original problem):

- Given: Training set

- Assumption: Linearly separable

BSubstitute these back to primal to get Dual

(2-D: line, 3-D: plane, n-D: hyperplane) - Dual (derived problem):

- Find: that perfectly separate 2 classes

i: Determine HOW MUCH contribute to solution

Better method: Perform average over all Support Vector

Hyperplane should separate 2 classes

- New objective: Separate training with min # errors

COMP 4211 Page 2

- New objective: Separate training with min # errors

NS: # support vector

- Primal: Penalize in objective function

C: Help decide whether MARGIN or ERROR is more important in

- Dual: Soft margin hyperplane:

COMP 4211 Page 3

Dot (inner) product can computed in Rm (without going to H):

- Idea: Change space ( coordinate sys) of data

- Kernel function k(x,y)

Choose k Choose (feature map)

Original space Feature space Alternative form:

- Problem with directly working on : High dimensionality

COMP 4211 Page 4

- Recall: Gaussian Bell-curve

Useful SVM observation

- Support vector usually very few in number

- High-dimensional data MORE likely to be LINEAR

COMP 4211 Page 5

- Purpose: Select "best" hypothesis from available data

Run learning process k times, each time:

- Testing error: Calculate avg accuracy

Estimated through test set: - Why use k-fold:

1000 example: |Class 1| = 600, |Class 2| = 400

Subset |Class 1| |Class 2|

For each (C, ):

Over-fitting avoidance principle Leave-one-out Cross-Validation

- Test set drawn independently from training set - Train on (m - 1) examples

COMP 4211 Page 6

Artificial Neural Network

Dendrite: Receive info from others

Graceful degradation: As condition worsen, cell's performance

Learning capability: Network can modified Performance

Neural network: Massively parallel computation

COMP 4211 Page 7

- Input: x1, x2, , xn

Learning Linearly Separable Function through Single Perceptron

- Principle of weight-updating rule:

Repeat until all examples correctly predicted:

AND (I1 I2) OR (I1 I2)

Input (I1, I2) Output T

Decision boundary: Decision boundary:

Decision boundary: -3I1 + 4I2 + 1 = 0

(a c) (b c): 2.5 (1, -3, 4) (1, 4, 2) 0 0 0 (1, -3, 4)

COMP 4211 Page 8

(a c) (b c): 2.5 (1, -3, 4) (1, 4, 2) 0 0 0 (1, -3, 4)