Beruflich Dokumente
Kultur Dokumente
4
PCA and SVD
Principal Components Analysis (PCA)
•A goal of PCA is to find a transformation of the
data that satisfies the following properties:
• Each pair of new attributes has 0 covariance (for
distinct attributes).
• The attributes are ordered with respect to how
much of the variance of the data each attribute
captures.
• The first attribute captures as much of the variance
of the data as possible.
• Subject to the orthogonality requirement, each
successive attribute captures as much of the
remaining variance as possible.
Principal Components Analysis (PCA)
Principal Components Analysis (PCA)
Example of PCA
Dimension reduction:
reserve 1st pc or 1st+2nd pcs
variance
Principal Components Analysis (PCA)
Singular Value Decomposition (SVD)
(B.2)
PCA and SVD
PCA and SVD (cont’d)
SVD
v1 v2 v3
(dimension reduction)
KNN
K Nearest Neighbors
Nearest Neighbor Classifiers
Basic idea:
– If it walks like a duck, quacks like a duck, then
it’s probably a duck
Compute
Distance Test
Record
Nearest-Neighbor Classifiers
Unknown record Requires three things
– The set of stored records
– Distance Metric to compute
distance between records
– The value of k, the number of
nearest neighbors to retrieve
X X X
d ( p, q) ( pi
i
q )
i
2
Scaling issues
– Attributes may have to be scaled to prevent
distance measures from being dominated by
one of the attributes
– Example:
height of a person may vary from 1.5m to 1.8m
weight of a person may vary from 90lb to 300lb
Bays Classifiers
Bayes Classifier
P( A | C ) P(C )
P(C | A)
P( A)
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Given:
– A doctor knows that meningitis causes stiff neck 50% of the
time
– Prior probability of any patient having meningitis is 1/50,000
– Prior probability of any patient having stiff neck is 1/20
Bayesian Classifiers
Approach:
– compute the posterior probability P(C | A1, A2, …, An) for
all values of C using the Bayes theorem
P ( A A A | C ) P (C )
P (C | A A A ) 1 2 n
P( A A A )
1 2 n
1 2 n
c c c
Class: P(C) = Nc/N
Tid Refund Marital Taxable – e.g., P(No) = 7/10,
Evade
Status Income P(Yes) = 3/10
1 Yes Single 125K No
2 No Married 100K No
For discrete attributes:
3 No Single 70K No
4 Yes Married 120K No P(Ai | Ck) = |Aik|/ Nc k
5 No Divorced 95K Yes
– where |Aik| is number of
6 No Married 60K No
instances having attribute
7 Yes Divorced 220K No
Ai and belongs to class Ck
8 No Single 85K Yes
9 No Married 75K No
– Examples:
10 No Single 90K Yes P(Status=Married|No) = 4/7
P(Refund=Yes|Yes)=0
10
1 ( 120110 ) 2
2 (54.54)
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
N ic
Original : P( Ai | C )
Nc
c: number of classes
N 1
Laplace : P ( Ai | C ) ic p: prior probability
Nc c
m: parameter
N mp
m - estimate : P ( Ai | C ) ic
Nc m
Name Give Birth Can Fly Live in Water Have Legs Class
human yes no no yes mammals
A: attributes
python no no no no non-mammals M: mammals
salmon no no yes no non-mammals
whale yes no yes no mammals N: non-mammals
frog no no sometimes yes non-mammals
komodo no no no yes non-mammals
6 6 2 2
bat yes yes no yes mammals P ( A | M ) 0.06
pigeon
cat
no
yes
yes
no
no
no
yes
yes
non-mammals
mammals
7 7 7 7
leopard shark yes no yes no non-mammals 1 10 3 4
turtle no no sometimes yes non-mammals P ( A | N ) 0.0042
penguin no no sometimes yes non-mammals 13 13 13 13
porcupine yes no no yes mammals
eel no no yes no non-mammals 7
salamander no no sometimes yes non-mammals P ( A | M ) P ( M ) 0.06 0.021
gila monster no no no yes non-mammals 20
platypus no no no yes mammals
13
owl
dolphin
no
yes
yes
no
no
yes
yes
no
non-mammals
mammals
P ( A | N ) P ( N ) 0.004 0.0027
eagle no yes no yes non-mammals 20
Neural Networks
Artificial Neural Networks (ANN)
X1 X2 X3 Y Input Black box
1 0 0 0
1 0 1 1
X1
1 1 0 1 Output
1 1 1 1 X2 Y
0 0 1 0
0 1 0 0
0 1 1 1 X3
0 0 0 0
Artificial Neural Networks (ANN)
Input
nodes Black box
X1 X2 X3 Y
1 0 0 0 Output
1 0 1 1
X1 0.3 node
1 1 0 1
1 1 1 1 X2 0.3
0 0 1 0
Y
0 1 0 0
0 1 1 1 X3 0.3 t=0.4
0 0 0 0
of its input value according X3 t
to the weights of its links
Perceptron Model
• Compare output node
against some threshold t Y Sign ( wi X i t ) or
i
I j wij Oi j 1
Oj I j
i 1 e
Structure of NN
threshold, t
M.-S. Chen 43
Algorithm for learning ANN
• Initialize the weights (w0, w1, …, wk)
• Adjust the weights in such a way that the output
of ANN is consistent with class labels of training
examples E Y f ( w , X )
2
i i i
– Objective function: i
– Find the weights wi’s that minimize the above
objective function
• e.g., backpropagation algorithm
Backpropagation
Ij=ΣwijOi+Θj;//compute the net input of unit j with respect to the
previous layer I
Backpropagation (cont’d)
For each weight wij in network
Δwij=(l)ErrjOi;//weight increment
wij=wij+ Δwij;//weight update
for each bias Θj in network
Δ Θj=(l)Errj;//bias increment
Θj= Θj+ ΔΘj;//bias update
MultiLayer Perceptron
Output vector
Errj O j (1 O j )(T j O j )
Output nodes
j j (l) Err j
wij wij (l ) Err j Oi Errj O j (1 O j ) Errk w jk
k
Hidden nodes
wij
1
Oj I
Input nodes
1 e j
I j wij Oi j
i
Input vector: xi
M.-S. Chen NTU
x1 1 w14
w15 4 w46
w24
6
x2 2
w25 w56
w34 5
x3 3 w35
SVM
Support Vector Machine
Support Vector Machines
• Find a linear hyperplane (decision boundary) that will separate the data
Support Vector Machines
B1
• One Possible Solution
Support Vector Machines
B2
• Another possible solution
Support Vector Machines
B2
• Other possible solutions
Support Vector Machines
B1
B2
• Which one is better? B1 or B2?
• How do you define better?
Support Vector Machines
B1
B2
b21
b22
margin
b11
b12
• Find hyperplane maximizes the margin => B1 is better than B2
Support Vector Machines
B1
w x b 0
w x b 1 w x b 1
b11
b12
1 if w x b 1 2
f ( x) Margin 2
1 if w x b 1 || w ||
Support Vector Machines
2
• We want to maximize: Margin 2
|| w ||
|| w ||2
– Which is equivalent to minimizing: L( w)
2
– But subject to the following constraints:
1 if w x i b 1
f ( xi )
1 if w x i b 1
• This is a constrained optimization problem
– Numerical approaches to solve it (e.g., quadratic programming)
整體之 constraints 稱
KKT(Karush-Kuhn-Tucker)
conditions
參考 “Data Mining: Concepts and Techniques, 2e ” Eq.6.39 in p.341
“…, 3e” Eq.9.19 in p.412
Support Vector Machines
• What if the problem is not linearly separable?
Support Vector Machines
• What if the problem is not linearly separable?
– Introduce slack variables 2
|| w || N k
• Need to minimize: L( w) C i
2 i 1
• Subject to:
1 if w x i b 1 - i
f ( xi )
1 if w x i b 1 i
Nonlinear SVM
• What if decision boundary is not linear?
Nonlinear SVM
• Transform data into higher dimensional space