07 KNNBaysNeuralConceptSVMNov122018

PCA/SVD, KNN, Bayes
Theorem, Neural Net and

Concept of SVM
November 13, 2018
Agenda for Classes 2018

(Might be revised as we progress)
• Class 1 – (9/11) Overview of data mining
• Class 2 – (9/18) R (I), Wush Wu
• Class 3 – (9/25) Association, Apriori and its related issues
• Class 4 – (10/2) Data stream mining, FP Tree, Vertical
mining (announcing HW1)
• Class 5 – (10/9) Classification: decision tree, GPGPU
• Class 6 – (10/16) Description of Data, (announcing HW2)
• Class 7 – (10/23) R (II), Wush Wu (announcing HW3, in
R)
2
Tentative Class Agenda (cont’d)
• Class 8 – (10/30) Data exploration, more on decision trees,
rule-based classifiers, Project announcement
• Class 9 – (11/6) Scikit learn, LibSVM, Preparation for
HW4 and HW5
• Class 10 – (11/13) PCA/SVD, KNN, Bays, Neural network,
Concept of SVM
• Class 11 – (11/20) Go over abstracts (due 11/18), SVM,
Clustering, K-means, PAM
• Class 12 – (11/27) More on clustering; Sequential pattern
mining;
• Class 13 – (12/4) Web mining, PageRank, etc.
3
Tentative Class Agenda (cont’d)

• Class 14– (12/11) When Database and Data Mining
Meet, Prof. Mingling Lo
• Class 15 – (12/18) Project presentation I
• Class 16 – (12/25) Project presentation II
(Final Exam according to Univ. Schedule)
(Project due 1/24/2019)
• Happy New Year!
4
PCA and SVD
Principal Components Analysis (PCA)
•A goal of PCA is to find a transformation of the
data that satisfies the following properties:
• Each pair of new attributes has 0 covariance (for
distinct attributes).
• The attributes are ordered with respect to how
much of the variance of the data each attribute
captures.
• The first attribute captures as much of the variance
of the data as possible.
• Subject to the orthogonality requirement, each
successive attribute captures as much of the
remaining variance as possible.
Example of PCA
Dimension reduction:
reserve 1st pc or 1st+2nd pcs
1st principal 2nd principal

component component
Preprocessing:
mean of each attribute is 0
variance
Singular Value Decomposition (SVD)
(B.2)
PCA and SVD
PCA and SVD (cont’d)
PCA and SVD (cont’d) (D: n*p)

Example of SVD (n=3, the # of rows in D)
SVD
v1 v2 v3
(dimension reduction)
KNN
K Nearest Neighbors
Nearest Neighbor Classifiers
 Basic idea:
– If it walks like a duck, quacks like a duck, then
it’s probably a duck
Compute
Distance Test
Record
Training Choose k of the

Records “nearest” records
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Nearest-Neighbor Classifiers
Unknown record  Requires three things
– The set of stored records
– Distance Metric to compute
distance between records
– The value of k, the number of
nearest neighbors to retrieve
 To classify an unknown record:

– Compute distance to other
training records
– Identify k nearest neighbors
– Use class labels of nearest
neighbors to determine the
class label of unknown record
(e.g., by taking majority vote)

Definition of Nearest Neighbor
X X X
(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor
K-nearest neighbors of a record x are data points

that have the k smallest distance to x
Nearest Neighbor Classification
 Compute distance between two points:

– Euclidean distance
d ( p, q)   ( pi
i
q )
i
2
 Determine the class from nearest neighbor list

– take the majority vote of class labels among
the k-nearest neighbors
– Weigh the vote according to distance
 weight factor, w = 1/d2
Nearest Neighbor Classification…
 Choosing the value of k:

– If k is too small, sensitive to noise points
– If k is too large, neighborhood may include points from
other classes
Nearest Neighbor Classification…
 Scaling issues
– Attributes may have to be scaled to prevent
distance measures from being dominated by
one of the attributes
– Example:
 height of a person may vary from 1.5m to 1.8m
 weight of a person may vary from 90lb to 300lb
 income of a person may vary from $10K to $1M

Nearest neighbor Classification…
 k-NN classifiers are lazy learners

– It does not build models explicitly
– Unlike eager learners such as decision tree
induction and rule-based systems
– Classifying unknown records are relatively
expensive
Bays Classifiers
Bayes Classifier
 A probabilistic framework for solving classification

problems
 Conditional Probability: P ( A, C )
P(C | A) 
P ( A)
P ( A, C )
P( A | C ) 
P (C )
 Bayes theorem:
P( A | C ) P(C )
P(C | A) 
P( A)
Example of Bayes Theorem
 Given:
– A doctor knows that meningitis causes stiff neck 50% of the
time
– Prior probability of any patient having meningitis is 1/50,000
– Prior probability of any patient having stiff neck is 1/20
 If a patient has stiff neck, what’s the probability

he/she has meningitis?
P( S | M ) P( M ) 0.5 1 / 50000
P( M | S )    0.0002
P( S ) 1 / 20

Another Example
 Suppose we have 3 round apples (2 red, one

green), 2 bananas, 3 litchis (red and round), and
4 mangoes
 X: a red and round object
H: being an apple
 P(X)=5/12, P(H)=3/12, P(X|H)=2/3
 Then, P(H|X)=P(X|H)P(H)/P(X)=2/5
Bayesian Classifiers
 Consider each attribute and class label as random

variables
 Given a record with attributes (A1, A2,…,An)

– Goal is to predict class C
– Specifically, we want to find the value of C that
maximizes P(C| A1, A2,…,An )
 Can we estimate P(C| A1, A2,…,An ) directly from

data?

Bayesian Classifiers
 Approach:
– compute the posterior probability P(C | A1, A2, …, An) for
all values of C using the Bayes theorem
P ( A A  A | C ) P (C )
P (C | A A  A )  1 2 n
P( A A  A )
1 2 n
1 2 n
– Choose value of C that maximizes

P(C | A1, A2, …, An)
– Equivalent to choosing value of C that maximizes

P(A1, A2, …, An|C) P(C)
 How to estimate P(A1, A2, …, An | C )?

Naïve Bayes Classifier
 Assume independence among attributes Ai when class is

given:
– P(A1, A2, …, An | Cj) = P(A1| Cj) P(A2| Cj)… P(An| Cj)
– Can estimate P(Ai| Cj) for all Ai and Cj.
– New point is classified to Cj if P(Cj)  P(Ai| Cj) is

maximal.

How to Estimate Probabilities from Data?
c c c
 Class: P(C) = Nc/N
Tid Refund Marital Taxable – e.g., P(No) = 7/10,
Evade
Status Income P(Yes) = 3/10
1 Yes Single 125K No
2 No Married 100K No
 For discrete attributes:
3 No Single 70K No
4 Yes Married 120K No P(Ai | Ck) = |Aik|/ Nc k
5 No Divorced 95K Yes
– where |Aik| is number of
6 No Married 60K No
instances having attribute
7 Yes Divorced 220K No
Ai and belongs to class Ck
8 No Single 85K Yes
9 No Married 75K No
– Examples:
10 No Single 90K Yes P(Status=Married|No) = 4/7
P(Refund=Yes|Yes)=0
10
 For continuous attributes:

– Discretize the range into bins
 one ordinal attribute per bin
 violates independence assumption k
– Two-way split: (A < v) or (A > v)

 choose only one of the two splits as new attribute
– Probability density estimation:
 Assume attribute follows a normal distribution
 Use data to estimate parameters of distribution
(e.g., mean and standard deviation)
 Once probability distribution is known, can use it to
estimate the conditional probability P(Ai|c)
Tid Refund Marital
Status
Taxable
Income Evade
 Normal distribution:
( Ai   ij ) 2
1 Yes Single 125K No 1 
2 ij2
P( Ai | c j )  e
2 No Married 100K No 2 ij
3 No Single 70K No
4 Yes Married 120K No – One for each (Ai,ci) pair
5 No Divorced 95K Yes
6 No Married 60K No  For (Income, Class=No):
7 Yes Divorced 220K No
– If Class=No
8 No Single 85K Yes
9 No Married 75K No
 sample mean = 110
10 No Single 90K Yes  sample variance = 2975
10
1 ( 120110 ) 2
P( Income  120 | No)  e  0.0072


2 ( 2975 )
2 (54.54)
Example of Naïve Bayes Classifier

Given a Test Record:
X  (Refund  No, Married, Income  120K)
naive Bayes Classifier:
P(Refund=Yes|No) = 3/7  P(X|Class=No) = P(Refund=No|Class=No)

P(Refund=No|No) = 4/7  P(Married| Class=No)
P(Refund=Yes|Yes) = 0  P(Income=120K| Class=No)
P(Refund=No|Yes) = 1 = 4/7  4/7  0.0072 = 0.0024
P(Marital Status=Single|No) = 2/7
P(Marital Status=Divorced|No)=1/7
P(Marital Status=Married|No) = 4/7  P(X|Class=Yes) = P(Refund=No| Class=Yes)
P(Marital Status=Single|Yes) = 2/7  P(Married| Class=Yes)
P(Marital Status=Divorced|Yes)=1/7  P(Income=120K| Class=Yes)
P(Marital Status=Married|Yes) = 0 = 1  0  1.2  10-9 = 0
For taxable income:
If class=No: sample mean=110 Since P(X|No)P(No) > P(X|Yes)P(Yes)
sample variance=2975 Therefore P(No|X) > P(Yes|X)
If class=Yes: sample mean=90
sample variance=25 => Class = No

Naïve Bayes Classifier
 If one of the conditional probability is zero, then

the entire expression becomes zero
 Probability estimation:
N ic
Original : P( Ai | C ) 
Nc
c: number of classes
N 1
Laplace : P ( Ai | C )  ic p: prior probability
Nc  c
m: parameter
N  mp
m - estimate : P ( Ai | C )  ic
Nc  m
Example of Naïve Bayes Classifier
Name Give Birth Can Fly Live in Water Have Legs Class
human yes no no yes mammals
A: attributes
python no no no no non-mammals M: mammals
salmon no no yes no non-mammals
whale yes no yes no mammals N: non-mammals
frog no no sometimes yes non-mammals
komodo no no no yes non-mammals
6 6 2 2
bat yes yes no yes mammals P ( A | M )      0.06
pigeon
cat
no
yes
yes
no
no
no
yes
yes
non-mammals
mammals
7 7 7 7
leopard shark yes no yes no non-mammals 1 10 3 4
turtle no no sometimes yes non-mammals P ( A | N )      0.0042
penguin no no sometimes yes non-mammals 13 13 13 13
porcupine yes no no yes mammals
eel no no yes no non-mammals 7
salamander no no sometimes yes non-mammals P ( A | M ) P ( M )  0.06   0.021
gila monster no no no yes non-mammals 20
platypus no no no yes mammals
13
owl
dolphin
no
yes
yes
no
no
yes
yes
no
non-mammals
mammals
P ( A | N ) P ( N )  0.004   0.0027
eagle no yes no yes non-mammals 20
P(A|M)P(M) > P(A|N)P(N)

Give Birth Can Fly Live in Water Have Legs Class
yes no yes no ? => Mammals

Naïve Bayes (Summary)
 Robust to isolated noise points

 Handle missing values by ignoring the instance
during probability estimate calculations
 Robust to irrelevant attributes
 Independence assumption may not hold for some
attributes
 Will the testing outcome for an input tuple always
be the same as that from a decision tree built on
the same training set?
 Will an input tuple be classified based on the
majority of those tuples with same attribute
values?
Neural Networks
Artificial Neural Networks (ANN)
X1 X2 X3 Y Input Black box
1 0 0 0
1 0 1 1
X1
1 1 0 1 Output
1 1 1 1 X2 Y
0 0 1 0
0 1 0 0
0 1 1 1 X3
0 0 0 0
Output Y is 1 if at least two of the three inputs are equal to 1.
Input
nodes Black box
X1 X2 X3 Y
1 0 0 0 Output
1 0 1 1
X1 0.3 node
1 1 0 1
1 1 1 1 X2 0.3
0 0 1 0
 Y
0 1 0 0
0 1 1 1 X3 0.3 t=0.4
0 0 0 0
Y  I (0.3 X 1  0.3 X 2  0.3 X 3  0.4  0)

1 if z is true
where I ( z )  
0 otherwise
• Model is an assembly of Input
nodes Black box
interconnected nodes and Output
weighted links X1 w1 node
w2
X2  Y
• Output node sums up each w3
of its input value according X3 t
to the weights of its links
Perceptron Model
• Compare output node
against some threshold t Y  Sign ( wi X i  t ) or
i
I j   wij Oi   j 1
Oj  I j
i 1 e
Structure of NN
Input Neuron i Output

I1 wi1
wi2 Activation
I2
wi3
Si function Oi Oi
g(Si )
I3
threshold, t
Training NN means learning the

weights of the neurons
Deep Neural Network
M.-S. Chen 43
Algorithm for learning ANN
• Initialize the weights (w0, w1, …, wk)
• Adjust the weights in such a way that the output
of ANN is consistent with class labels of training
examples E  Y  f ( w , X )

2
i i i
– Objective function: i
– Find the weights wi’s that minimize the above
objective function
• e.g., backpropagation algorithm
Backpropagation
Ij=ΣwijOi+Θj;//compute the net input of unit j with respect to the
previous layer I
Oj= 1 ; //SIGMOID, compute the output of each unit j

1+e-Ij
For each unit j in the output layer

Err j =O j(1-O j)(T j -O j);//compute the error
For cach unit j in the hidden layers,

Err j =O j(1-O j) ΣErr kwjk;//compute the error with respect to
k
the next higher layer,k

M.-S. Chen NTU 45
Backpropagation (cont’d)
For each weight wij in network
Δwij=(l)ErrjOi;//weight increment
wij=wij+ Δwij;//weight update
for each bias Θj in network
Δ Θj=(l)Errj;//bias increment
Θj= Θj+ ΔΘj;//bias update
M.-S. Chen NTU 46

Learning Rate
• Learning rate: a constant typically having a
value between 0.0 and 1.0.
• Help avoid getting stuck at a local
minimum in the decision space
• A rule of thumb is to set the learning rate
to be 1/t where t is the number of
iterations gone through the training set so
far
M.-S. Chen NTU 47
MultiLayer Perceptron
Output vector
Errj  O j (1  O j )(T j  O j )
Output nodes
 j   j  (l) Err j
wij  wij  (l ) Err j Oi Errj  O j (1  O j ) Errk w jk
k
Hidden nodes
wij
1
Oj  I
Input nodes
1 e j
I j   wij Oi   j
i
Input vector: xi
M.-S. Chen NTU
x1 1 w14
w15 4 w46
w24
6
x2 2
w25 w56
w34 5
x3 3 w35
M.-S. Chen NTU 49
Table7.3 Initial input,weight, and bias values.
x1 x2 x3 w14 w15 w24 w25 w34 w35 w46 w56 Θ4 Θ5 Θ6

1 0 1 0.2 -0.3 0.4 0.1 -0.5 0.2 -0.3 -0.2 -0.4 0.2 0.1
Table7.4 The net input and output calculations.
Unit j Net input, Ij Output, Oj
4 0.2+0-0.5-0.4=-0.7 1/(1+e0.7)=0.332
5 -0.3+0+0.2+0.2=0.1 1/(1+e-0.1)=0.525
6 (-0.3)(0.332)-(0.2)(0.525)+0.1=-0.105 1/(1+e0.105)=0.474
M.-S. Chen NTU 50
Table7.5 Calculation of the error at each node.
Unit j Errj
6 (0.474)(1-0.474)(1-0.474)=0.1311
5 (0.525)(1-0.525)(0.1311)(-0.2)=-0.0065
4 (0.332)(1-0.332)(0.1311)(-0.3)=-0.0087
M.-S. Chen NTU 51
Table7.6 Calculations for weight and bias updating.

Weight or bias New value
w46 -0.3+(0.9)*(0.1311)*(0.332)=-0.261
w56 -0.2+(0.9)*(0.1311)*(0.525)=-0.138
w14 0.2+(0.9)*(-0.0087)*(1)=0.192
w15 -0.3+(0.9)*(-0.0065)*(1)=-0.306
w24 0.4+(0.9)*(-0.0087)*(0)=0.4
w25 0.1+(0.9)*(-0.0065)*(0)=0.1
w34 -0.5+(0.9)*(-0.0087)*(1)=-0.508
w35 0.2+(0.9)*(-0.0065)*(1)=0.194
Θ6 0.1+(0.9)*(0.1311)=0.218
Θ5 0.2+(0.9)*(-0.0065)=0.194
Θ4 -0.4+(0.9)*(-0.0087)=-0.408
M.-S. Chen NTU 52

Control Parameters for Execution
• Case updating: update the weights and biases
after the presentation of each tuple
• Epoch updating: the weight and bias increments
could be accumulated in variables, so that
weights and biases are updated after all tuples
in the training set are gone through
• Termination conditions
– Usually predetermined conditions when we feel no need to
proceed (e.g., little difference, accurate enough, time run out)
M.-S. Chen NTU 53
SVM
Support Vector Machine
Support Vector Machines
• Find a linear hyperplane (decision boundary) that will separate the data
B1
• One Possible Solution
B2
• Another possible solution
B2
• Other possible solutions
B1
B2
• Which one is better? B1 or B2?
• How do you define better?
B1
B2
b21
b22
margin
b11
b12
• Find hyperplane maximizes the margin => B1 is better than B2
B1
 
w x  b  0
 
 
w  x  b  1 w  x  b  1
b11
  b12
 1 if w  x  b  1 2
f ( x)     Margin   2
 1 if w  x  b  1 || w ||
2
• We want to maximize: Margin   2
|| w ||

|| w ||2
– Which is equivalent to minimizing: L( w) 
2
– But subject to the following constraints:
 
 1 if w  x i  b  1
f ( xi )    
 1 if w  x i  b  1
• This is a constrained optimization problem
– Numerical approaches to solve it (e.g., quadratic programming)
整體之 constraints 稱
KKT(Karush-Kuhn-Tucker)
conditions
參考 “Data Mining: Concepts and Techniques, 2e ” Eq.6.39 in p.341
“…, 3e” Eq.9.19 in p.412
Read Vipin’s book: Chapter 5.5

Try a tiny numerical example!
• What if the problem is not linearly separable?
• What if the problem is not linearly separable?
– Introduce slack variables  2
|| w ||  N k
• Need to minimize: L( w)   C  i 
2  i 1 
• Subject to:  
 1 if w  x i  b  1 - i
f ( xi )    
 1 if w  x i  b  1  i
Nonlinear SVM
• What if decision boundary is not linear?
Nonlinear SVM
• Transform data into higher dimensional space

07 KNNBaysNeuralConceptSVMNov122018

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

07 KNNBaysNeuralConceptSVMNov122018

Hochgeladen von

Copyright:

Verfügbare Formate

PCA/SVD, KNN, Bayes

Theorem, Neural Net and

Agenda for Classes 2018

Tentative Class Agenda (cont’d)

1st principal 2nd principal

PCA and SVD (cont’d) (D: n*p)

Training Choose k of the

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

 To classify an unknown record:

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor

K-nearest neighbors of a record x are data points

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Nearest Neighbor Classification

 Compute distance between two points:

 Determine the class from nearest neighbor list

 Choosing the value of k:

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Nearest Neighbor Classification…

 income of a person may vary from $10K to $1M

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

 k-NN classifiers are lazy learners

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

 A probabilistic framework for solving classification

Example of Bayes Theorem

 If a patient has stiff neck, what’s the probability

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

 Suppose we have 3 round apples (2 red, one

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

 Consider each attribute and class label as random

 Given a record with attributes (A1, A2,…,An)

 Can we estimate P(C| A1, A2,…,An ) directly from

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

– Choose value of C that maximizes

– Equivalent to choosing value of C that maximizes

 How to estimate P(A1, A2, …, An | C )?

Naïve Bayes Classifier

 Assume independence among attributes Ai when class is

– Can estimate P(Ai| Cj) for all Ai and Cj.

– New point is classified to Cj if P(Cj)  P(Ai| Cj) is

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

How to Estimate Probabilities from Data?

 For continuous attributes:

– Two-way split: (A < v) or (A > v)

P( Income  120 | No)  e  0.0072

Example of Naïve Bayes Classifier

P(Refund=Yes|No) = 3/7  P(X|Class=No) = P(Refund=No|Class=No)

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

 If one of the conditional probability is zero, then

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Example of Naïve Bayes Classifier

P(A|M)P(M) > P(A|N)P(N)

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

 Robust to isolated noise points

Output Y is 1 if at least two of the three inputs are equal to 1.

Y  I (0.3 X 1  0.3 X 2  0.3 X 3  0.4  0)

Input Neuron i Output

Training NN means learning the

Oj= 1 ; //SIGMOID, compute the output of each unit j

For each unit j in the output layer

For cach unit j in the hidden layers,

the next higher layer,k