Sie sind auf Seite 1von 22

Machine Learning

Lecture 4
Outline
‣ Understanding optimization view of learning
- large margin linear classification
- regularization, generalization
‣ Optimization algorithms
- preface: gradient descent optimization
- stochastic gradient descent
- quadratic program
Recall: learning as optimization
‣ Machine learning problems are often cast as optimization
problems

objective function = average loss + regularization

‣ Large margin linear classification as optimization


(Support Vector Machine)
Xn
1
J(✓, ✓0 ) = Lossh y (i) (✓ · x(i) + ✓0 ) + k✓k2
n i=1 2
Recall: large margin classifier
distance from the
1 decision boundary
k✓k to the margin
boundary

neg margin
pos margin boundary
boundary ✓ · x + ✓0 = 1
✓ · x + ✓0 = 1 decision
✓ · x + ✓0 = 0 boundary
n
X
1 (i) (i) 2
J(✓, ✓0 ) = Lossh y (✓ · x + ✓0 ) + k✓k
n 2
= 0.1 (C = 10)

✓ · x + ✓0 = 1 ✓ · x + ✓0 = 1

✓ · x + ✓0 = 0
= 1 (C = 1)

✓ · x + ✓0 = 1 ✓ · x + ✓0 = 1

✓ · x + ✓0 = 0
= 100 (C = 0.01)

✓ · x + ✓0 = 1 ✓ · x + ✓0 = 1

✓ · x + ✓0 = 0
= 1000 (C = 0.001)

✓ · x + ✓0 = 1 ✓ · x + ✓0 = 1

✓ · x + ✓0 = 0
= 0.01 (C = 100)

✓ · x + ✓0 = 1 ✓ · x + ✓0 = 1

✓ · x + ✓0 = 0
= 0.1 (C = 10)

✓ · x + ✓0 = 1 ✓ · x + ✓0 = 1

✓ · x + ✓0 = 0
= 1 (C = 1)

✓ · x + ✓0 = 1 ✓ · x + ✓0 = 1

✓ · x + ✓0 = 0
= 100 (C = 0.01)

✓ · x + ✓0 = 1 ✓ · x + ✓0 = 1

✓ · x + ✓0 = 0
Regularization, generalization

Xn
1
J(✓, ✓0 ) = Lossh y (i) (✓ · x(i) + ✓0 ) + k✓k2
n i=1 2
Outline
‣ Understanding optimization view of learning
- large margin linear classification
- regularization, generalization
‣ Optimization algorithms
- preface: gradient descent optimization
- stochastic gradient descent
- quadratic program
Preface: Gradient descent

J(✓)


Preface: Gradient descent

J(✓)


Stochastic gradient descent
Xn
1
J(✓, ✓0 ) = Lossh y (i) (✓ · x(i) + ✓0 ) + k✓k2
n i=1 2
Xn 
1
= Lossh y (i) (✓ · x(i) + ✓0 ) + k✓k2
n i=1 2
Stochastic gradient descent
n
X 
1
J(✓) = Lossh (y (i) ✓ · x(i) ) + k✓k2
n i=1
2
<latexit sha1_base64="YLu6FZGzynSY9qGb/PdX7SsQ7+M=">AAACn3icbVHfb9MwEHbCj40wWAuPvBgqpFZIVVJA8DKpEg8ghFCR6DZUp5HjOIm1xInsC6zy/IfyyH+C2+aBbZxk6dP33Z3vvkvbSmgIw9+ef+fuvfsHhw+Ch0ePHh8Phk9OddMpxpesqRp1nlLNKyH5EgRU/LxVnNZpxc/Siw9b/ewnV1o08jtsWh7XtJAiF4yCo5KB+TwmUHKgE3yCSa4oM5E10hLd1YkRJ5FdS0xSURSrgAC/BPOl0dom5XizNmMxsftqwrIG8OWemuBXfStSuUkyas3Mkqs+82o9C3YN42QwCqfhLvBtEPVghPpYJEPviGQN62ougVVU61UUthAbqkCwituAdJq3lF3Qgq8clLTmOjY7lyx+6ZgM541yTwLesf9WGFprvalTl1lTKPVNbUv+T1t1kL+PjZBtB1yy/Ud5V2Fo8NZynAnFGVQbByhTws2KWUmdP+AOExDN3dVkAaUhLVVCZm47a9xi9pq2Nf+XyNwM5vX0rZA2cP5FN926DU5n0yicRt/ejObz3slD9Ay9QGMUoXdojj6hBVoihv54B97AG/rP/Y/+V3+xT/W9vuYpuhb+j78aaM4O</latexit>
Stochastic gradient descent
n
X 
1
J(✓) = Lossh (y (i) ✓ · x(i) ) + k✓k2
n i=1
2
<latexit sha1_base64="YLu6FZGzynSY9qGb/PdX7SsQ7+M=">AAACn3icbVHfb9MwEHbCj40wWAuPvBgqpFZIVVJA8DKpEg8ghFCR6DZUp5HjOIm1xInsC6zy/IfyyH+C2+aBbZxk6dP33Z3vvkvbSmgIw9+ef+fuvfsHhw+Ch0ePHh8Phk9OddMpxpesqRp1nlLNKyH5EgRU/LxVnNZpxc/Siw9b/ewnV1o08jtsWh7XtJAiF4yCo5KB+TwmUHKgE3yCSa4oM5E10hLd1YkRJ5FdS0xSURSrgAC/BPOl0dom5XizNmMxsftqwrIG8OWemuBXfStSuUkyas3Mkqs+82o9C3YN42QwCqfhLvBtEPVghPpYJEPviGQN62ougVVU61UUthAbqkCwituAdJq3lF3Qgq8clLTmOjY7lyx+6ZgM541yTwLesf9WGFprvalTl1lTKPVNbUv+T1t1kL+PjZBtB1yy/Ud5V2Fo8NZynAnFGVQbByhTws2KWUmdP+AOExDN3dVkAaUhLVVCZm47a9xi9pq2Nf+XyNwM5vX0rZA2cP5FN926DU5n0yicRt/ejObz3slD9Ay9QGMUoXdojj6hBVoihv54B97AG/rP/Y/+V3+xT/W9vuYpuhb+j78aaM4O</latexit>

Select i 2 {1, . . . , n} at random



✓ ✓ ⌘t r✓ Lossh (y (i) ✓ · x(i) ) + k✓k2
<latexit sha1_base64="/iFqnTH+gU4FiOi9idvt9TF7CM0=">AAACq3icbVFNb9QwEHXCVwmFbukRkCxWSFuhrpKlFRxX4sKBQxFsW7HeRo7jJFYdO7IntCs3Z34jP4L/gHc3SLRlJMtPb97YM2+yRgoLcfwrCO/df/Dw0dbj6Mn202c7g93nJ1a3hvEZ01Kbs4xaLoXiMxAg+VljOK0zyU+zi4+r/OkPbqzQ6hssG76oaalEIRgFT6WDnwQqDhQTyQugxuhL3DMHmPgrBUwUzSRN/wozUZbziAC/AvdZW9ul1Wh57kZiv9tICMs14KsNtY/fYlIYyhyRvqucdm7SketeeX0+idYPLtLBMB7H68B3QdKDIerjON0NtkmuWVtzBUxSa+dJ3MDCUQOCSd5FpLW8oeyClnzuoaI1twu3dqzDbzyT40IbfxTgNftvhaO1tcs688qaQmVv51bk/3LzFooPCydU0wJXbPNR0UoMGq/sx7kwnIFcekCZEb5XzCrq/QG/pIhY7jeoSqgcaagRKvfTdc4P1t3Ircy/FLnvwb0bHwnVRd6/5LZbd8HJZJzE4+TL4XA67Z3cQi/QazRCCXqPpugTOkYzxNDvYC94GbwKD8Kv4feQbKRh0NfsoRsR8j+NedNB</latexit>
2
Support Vector Machine
‣ Support Vector Machine finds the maximum margin
linear separator by solving the quadratic program that
corresponds to J(✓, ✓0 )
<latexit sha1_base64="9ageMYaqdXL+I3/rPNlrTQi+Yx0=">AAACNnicbVDLSsNAFJ34rLVqrUs3wSJUkJL4QJcFN+Kqgn1AU8pkctsOTiZh5kYtIb/iVj/BX3HjTtz6CU4fC229MMzhnHu55x4/Flyj47xbS8srq2vruY38ZmFre6e4W2rqKFEMGiwSkWr7VIPgEhrIUUA7VkBDX0DLv78a660HUJpH8g5HMXRDOpC8zxlFQ/WKpZuKh0NAejz9es5Rr1h2qs6k7EXgzkCZzKre27UKXhCxJASJTFCtO64TYzelCjkTkOW9RENM2T0dQMdASUPQ3XRiPrMPDRPY/UiZJ9GesL8nUhpqPQp90xlSHOp5bUz+p3US7F92Uy7jBEGy6aJ+ImyM7HESdsAVMBQjAyhT3Hi12ZAqytDklfc0mDDlAIepF1PFZWCuy1JzWPZHQ3jCRx4YD+lp9ZzLLG/yc+fTWgTNk6rrVN3bs3KtNksyR/bJAakQl1yQGrkmddIgjDyRZ/JCXq0368P6tL6mrUvWbGaP/Cnr+weAiqsl</latexit>

‣ In the realizable case, if we disallow any margin


violations, the quadratic program we have to solve is

Find ✓, ✓0 that
1 2
minimize k✓k subject to
2
(i) (i)
y (✓ · x + ✓0 ) 1, i = 1, . . . , n
distance from the
1 decision boundary
k✓k to the margin
boundary

pos margin neg margin


boundary boundary
✓ · x + ✓0 = 1 ✓ · x + ✓0 = 1
✓ · x + ✓0 = 0
decision
boundary
Summary
‣ Learning problems can be formulated as optimization
problems of the form: loss + regularization
‣ Linear, large margin classification, along with many other
learning problems, can be solved with stochastic
gradient descent algorithms
‣ Large margin linear classifier can be also obtained via
solving a quadratic program (Support Vector Machine)

Das könnte Ihnen auch gefallen