Sie sind auf Seite 1von 65

Machine Learning

Support Vector Machines


Sjoerd Maessen

• AFOL
• Works at E-sites Breda
• Stock market enthusiast
Titanic
The change of survival
Challenge accepted
Could you take in account siblings?
Sure…
Oh and the number of parents, children aboard…
Of course!
Could you add a “simple if” for age as well?
Great! We are almost there!
But…
“Field of study that gives computers the
ability to learn without being explicitly
programmed”

Arthur Lee Samuel


https://personality-insights-livedemo.mybluemix.net/
Alright!
Let's become data scientists!
A comparison
Input program Input output

Traditional
Machine learning
programming

output new input program new output


Regression vs classification

Input Output Input Output

0.98 68 0 0.98 68 0.23

0.76 42 0 0.76 42 0.15

1.23 78 1 1.23 78 4.74

1.91 109 1 1.91 109 7.98


Support Vector Machine

• Automatically creates a “program”


• Model represents a space
• New input fits somewhere in this space
Support Vectors

• Optimal hyperplane
• Linear classifier
• Maximum margin
• Classification
Linearly separable dataset
Non-linear decision boundary
The kernel trick
A whole new dimension
The kernel trick
Choosing a kernel
• No kernel or linear kernel
• Gaussian kernel
• Polynomial kernel
• Signmoid kernel
• Radial basis function kernel
• …
Choosing a kernel
Rule of thumb
• N much bigger than M
=> linear kernel
• N small, M intermediate
=> gaussian kernel

• N = number of features
• M = number of training examples
Spam detection

• N = 10000 (bad words, # of urls,…)


• M = 250 (sample mails)

=> linear kernel


Validation of housing prices

• N = 1-1000 (# of rooms, m3, location,…)


• M = 100,000 (of transactions)

=> Gaussian kernel


Features
It’s all about preparation
Features

• Representation of raw data


• The hardest part

Pre- Feature Feature Feature


Raw data
processing extraction selection scaling
Association discovery
• Big data
• Determine relevance
OCR
Pre-processing
• De-skew
• Despeckle
• Convert to black & white
• Zoning
• Character segmentation
• Transformations…
OCR
Feature extraction
• Number of black pixels
• Filled x,y
• Histogram
• Character contour
• ...
OCR
Feature scaling
• 0, 1
• -1, +1

How to scale the number of black pixels?


• 22 / 56=> 0.39286
Real life
Training a model

Training file Label % Black Pixels X1 % X2 % X3 % …

• Labels 0 0.33 0.546 0.840

• Features 1 0.78 0.123 0.567 0.347

1 0.75 0.512 0.543

Feature selection
• Cross-validation
Alice in Wonderland
Down the rabbit hole
Features
• Word length
• Char frequency
Basic example
Result
• Thank you for contacting us. This is an automated response confirming the receipt of your

ticket. Our team will get back to you as soon as possible. When replying, please make sure

that the ticket ID is kept in the subject so that we can track your replies.

=> This is an English text

• Hierbij bevestigen wij de ontvangst en verwerking van uw e-mail met ticketnummer PCL-

98124-735. Uw vraag wordt opgepakt door één van onze engineers. Wij streven ernaar

spoedig een oplossing aan u terug te kunnen koppelen.

=> This is a Dutch text


Titanic
The change of survival
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked

1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.25 S

Cumings, Mrs. John Bradley (Florence Briggs


2 1 1 Thayer) female 38 1 0 PC 17599 712.833 C85 C

3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.925 S

4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1 C123 S

5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.05 S


6 0 3 Moran, Mr. James male 0 0 330877 84.583 Q

7 0 1 McCarthy, Mr. Timothy J male 54 0 0 17463 518.625 E46 S

8 0 3 Palsson, Master. Gosta Leonard male 2 3 1 349909 21.075 S

Johnson, Mrs. Oscar W (Elisabeth Vilhelmina


9 1 3 Berg) female 27 0 2 347742 111.333 S

10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14 1 0 237736 300.708 C

11 1 3 Sandstrom, Miss. Marguerite Rut female 4 1 1 PP 9549 16.7 G6 S

12 1 1 Bonnell, Miss. Elizabeth female 58 0 0 113783 26.55 C103 S

13 0 3 Saundercock, Mr. William Henry male 20 0 0 A/5. 2151 8.05 S

14 0 3 Andersson, Mr. Anders Johan male 39 1 5 347082 31.275 S


Feature extraction

• Title (Mrs, Miss, Mr, Jonkheer, Capt,..)


• Passengerclass
• Sex
• Age
• Siblings/spouses
• Parent/children
• Cabin
• Port of embarkation
Reading the trainingfile
Preprocessing and scaling
Preprocessing and scaling
Filling in blanks
Magic!

Result: 83,26% accuracy


Common issues

• Feature numbering
• Training data <> real world
• Overfitting
• Feature selection
• Multiclass classification
Next step?
Learn R, Python,…
Resources

• https://www.csie.ntu.edu.tw/~cjlin/libsvm/
• http://php.net/manual/en/book.svm.php
• https://www.kaggle.com/
• http://scikit-learn.org/stable/

• https://packagist.org/packages/sjoerdmaessen/machinelearning
@sjoerdmaessen
linkedin.com/in/sjoerdmaessen

Das könnte Ihnen auch gefallen