Sie sind auf Seite 1von 35

PCA/SVD, KNN, Bayes

Theorem, Neural Net and


Concept of SVM
November 13, 2018

Agenda for Classes 2018


(Might be revised as we progress)
• Class 1 – (9/11) Overview of data mining
• Class 2 – (9/18) R (I), Wush Wu
• Class 3 – (9/25) Association, Apriori and its related issues
• Class 4 – (10/2) Data stream mining, FP Tree, Vertical
mining (announcing HW1)
• Class 5 – (10/9) Classification: decision tree, GPGPU
• Class 6 – (10/16) Description of Data, (announcing HW2)
• Class 7 – (10/23) R (II), Wush Wu (announcing HW3, in
R)
2
Tentative Class Agenda (cont’d)
• Class 8 – (10/30) Data exploration, more on decision trees,
rule-based classifiers, Project announcement
• Class 9 – (11/6) Scikit learn, LibSVM, Preparation for
HW4 and HW5
• Class 10 – (11/13) PCA/SVD, KNN, Bays, Neural network,
Concept of SVM
• Class 11 – (11/20) Go over abstracts (due 11/18), SVM,
Clustering, K-means, PAM
• Class 12 – (11/27) More on clustering; Sequential pattern
mining;
• Class 13 – (12/4) Web mining, PageRank, etc.
3

Tentative Class Agenda (cont’d)


• Class 14– (12/11) When Database and Data Mining
Meet, Prof. Mingling Lo
• Class 15 – (12/18) Project presentation I
• Class 16 – (12/25) Project presentation II
(Final Exam according to Univ. Schedule)
(Project due 1/24/2019)
• Happy New Year!

4
PCA and SVD

Principal Components Analysis (PCA)
•A goal of PCA is to find a transformation of the 
data that satisfies the following properties:
• Each pair of new attributes has 0 covariance (for 
distinct attributes).
• The attributes are ordered with respect to how 
much of the variance of the data each attribute 
captures.
• The first attribute captures as much of the variance 
of the data as possible.
• Subject to the orthogonality requirement, each 
successive attribute captures as much of the 
remaining variance as possible.
Principal Components Analysis (PCA)

Principal Components Analysis (PCA)
Example of PCA

Dimension reduction: 
reserve 1st pc or 1st+2nd pcs

1st principal 2nd principal


component component
Preprocessing:
mean of each attribute is 0 

variance

Principal Components Analysis (PCA)
Singular Value Decomposition (SVD)
(B.2)

PCA and SVD
PCA and SVD (cont’d)

PCA and SVD (cont’d) (D: n*p)


Example of SVD (n=3, the # of rows in D)

SVD

v1 v2 v3

(dimension reduction)

KNN
K Nearest Neighbors
Nearest Neighbor Classifiers

 Basic idea:
– If it walks like a duck, quacks like a duck, then
it’s probably a duck

Compute
Distance Test
Record

Training Choose k of the


Records “nearest” records

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Nearest-Neighbor Classifiers
Unknown record  Requires three things
– The set of stored records
– Distance Metric to compute
distance between records
– The value of k, the number of
nearest neighbors to retrieve

 To classify an unknown record:


– Compute distance to other
training records
– Identify k nearest neighbors
– Use class labels of nearest
neighbors to determine the
class label of unknown record
(e.g., by taking majority vote)

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


Definition of Nearest Neighbor

X X X

(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor

K-nearest neighbors of a record x are data points


that have the k smallest distance to x

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Nearest Neighbor Classification

 Compute distance between two points:


– Euclidean distance

d ( p, q)   ( pi
i
q )
i
2

 Determine the class from nearest neighbor list


– take the majority vote of class labels among
the k-nearest neighbors
– Weigh the vote according to distance
 weight factor, w = 1/d2
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
Nearest Neighbor Classification…

 Choosing the value of k:


– If k is too small, sensitive to noise points
– If k is too large, neighborhood may include points from
other classes

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Nearest Neighbor Classification…

 Scaling issues
– Attributes may have to be scaled to prevent
distance measures from being dominated by
one of the attributes
– Example:
 height of a person may vary from 1.5m to 1.8m
 weight of a person may vary from 90lb to 300lb

 income of a person may vary from $10K to $1M

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


Nearest neighbor Classification…

 k-NN classifiers are lazy learners


– It does not build models explicitly
– Unlike eager learners such as decision tree
induction and rule-based systems
– Classifying unknown records are relatively
expensive

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Bays Classifiers
Bayes Classifier

 A probabilistic framework for solving classification


problems
 Conditional Probability: P ( A, C )
P(C | A) 
P ( A)
P ( A, C )
P( A | C ) 
P (C )
 Bayes theorem:

P( A | C ) P(C )
P(C | A) 
P( A)
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Example of Bayes Theorem

 Given:
– A doctor knows that meningitis causes stiff neck 50% of the
time
– Prior probability of any patient having meningitis is 1/50,000
– Prior probability of any patient having stiff neck is 1/20

 If a patient has stiff neck, what’s the probability


he/she has meningitis?
P( S | M ) P( M ) 0.5 1 / 50000
P( M | S )    0.0002
P( S ) 1 / 20

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


Another Example

 Suppose we have 3 round apples (2 red, one


green), 2 bananas, 3 litchis (red and round), and
4 mangoes
 X: a red and round object
H: being an apple
 P(X)=5/12, P(H)=3/12, P(X|H)=2/3
 Then, P(H|X)=P(X|H)P(H)/P(X)=2/5

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Bayesian Classifiers

 Consider each attribute and class label as random


variables

 Given a record with attributes (A1, A2,…,An)


– Goal is to predict class C
– Specifically, we want to find the value of C that
maximizes P(C| A1, A2,…,An )

 Can we estimate P(C| A1, A2,…,An ) directly from


data?

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


Bayesian Classifiers

 Approach:
– compute the posterior probability P(C | A1, A2, …, An) for
all values of C using the Bayes theorem
P ( A A  A | C ) P (C )
P (C | A A  A )  1 2 n

P( A A  A )
1 2 n

1 2 n

– Choose value of C that maximizes


P(C | A1, A2, …, An)

– Equivalent to choosing value of C that maximizes


P(A1, A2, …, An|C) P(C)

 How to estimate P(A1, A2, …, An | C )?


© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Naïve Bayes Classifier

 Assume independence among attributes Ai when class is


given:
– P(A1, A2, …, An | Cj) = P(A1| Cj) P(A2| Cj)… P(An| Cj)

– Can estimate P(Ai| Cj) for all Ai and Cj.

– New point is classified to Cj if P(Cj)  P(Ai| Cj) is


maximal.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


How to Estimate Probabilities from Data?

c c c
 Class: P(C) = Nc/N
Tid Refund Marital Taxable – e.g., P(No) = 7/10,
Evade
Status Income P(Yes) = 3/10
1 Yes Single 125K No
2 No Married 100K No
 For discrete attributes:
3 No Single 70K No
4 Yes Married 120K No P(Ai | Ck) = |Aik|/ Nc k
5 No Divorced 95K Yes
– where |Aik| is number of
6 No Married 60K No
instances having attribute
7 Yes Divorced 220K No
Ai and belongs to class Ck
8 No Single 85K Yes
9 No Married 75K No
– Examples:
10 No Single 90K Yes P(Status=Married|No) = 4/7
P(Refund=Yes|Yes)=0
10

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

How to Estimate Probabilities from Data?

 For continuous attributes:


– Discretize the range into bins
 one ordinal attribute per bin
 violates independence assumption k

– Two-way split: (A < v) or (A > v)


 choose only one of the two splits as new attribute
– Probability density estimation:
 Assume attribute follows a normal distribution
 Use data to estimate parameters of distribution
(e.g., mean and standard deviation)
 Once probability distribution is known, can use it to
estimate the conditional probability P(Ai|c)
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›
How to Estimate Probabilities from Data?
Tid Refund Marital
Status
Taxable
Income Evade
 Normal distribution:
( Ai   ij ) 2
1 Yes Single 125K No 1 
2 ij2
P( Ai | c j )  e
2 No Married 100K No 2 ij
3 No Single 70K No
4 Yes Married 120K No – One for each (Ai,ci) pair
5 No Divorced 95K Yes
6 No Married 60K No  For (Income, Class=No):
7 Yes Divorced 220K No
– If Class=No
8 No Single 85K Yes
9 No Married 75K No
 sample mean = 110
10 No Single 90K Yes  sample variance = 2975
10

1 ( 120110 ) 2

P( Income  120 | No)  e  0.0072



2 ( 2975 )

2 (54.54)
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Example of Naïve Bayes Classifier


Given a Test Record:
X  (Refund  No, Married, Income  120K)
naive Bayes Classifier:

P(Refund=Yes|No) = 3/7  P(X|Class=No) = P(Refund=No|Class=No)


P(Refund=No|No) = 4/7  P(Married| Class=No)
P(Refund=Yes|Yes) = 0  P(Income=120K| Class=No)
P(Refund=No|Yes) = 1 = 4/7  4/7  0.0072 = 0.0024
P(Marital Status=Single|No) = 2/7
P(Marital Status=Divorced|No)=1/7
P(Marital Status=Married|No) = 4/7  P(X|Class=Yes) = P(Refund=No| Class=Yes)
P(Marital Status=Single|Yes) = 2/7  P(Married| Class=Yes)
P(Marital Status=Divorced|Yes)=1/7  P(Income=120K| Class=Yes)
P(Marital Status=Married|Yes) = 0 = 1  0  1.2  10-9 = 0
For taxable income:
If class=No: sample mean=110 Since P(X|No)P(No) > P(X|Yes)P(Yes)
sample variance=2975 Therefore P(No|X) > P(Yes|X)
If class=Yes: sample mean=90
sample variance=25 => Class = No

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


Naïve Bayes Classifier

 If one of the conditional probability is zero, then


the entire expression becomes zero
 Probability estimation:

N ic
Original : P( Ai | C ) 
Nc
c: number of classes
N 1
Laplace : P ( Ai | C )  ic p: prior probability
Nc  c
m: parameter
N  mp
m - estimate : P ( Ai | C )  ic
Nc  m

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Example of Naïve Bayes Classifier

Name Give Birth Can Fly Live in Water Have Legs Class
human yes no no yes mammals
A: attributes
python no no no no non-mammals M: mammals
salmon no no yes no non-mammals
whale yes no yes no mammals N: non-mammals
frog no no sometimes yes non-mammals
komodo no no no yes non-mammals
6 6 2 2
bat yes yes no yes mammals P ( A | M )      0.06
pigeon
cat
no
yes
yes
no
no
no
yes
yes
non-mammals
mammals
7 7 7 7
leopard shark yes no yes no non-mammals 1 10 3 4
turtle no no sometimes yes non-mammals P ( A | N )      0.0042
penguin no no sometimes yes non-mammals 13 13 13 13
porcupine yes no no yes mammals
eel no no yes no non-mammals 7
salamander no no sometimes yes non-mammals P ( A | M ) P ( M )  0.06   0.021
gila monster no no no yes non-mammals 20
platypus no no no yes mammals
13
owl
dolphin
no
yes
yes
no
no
yes
yes
no
non-mammals
mammals
P ( A | N ) P ( N )  0.004   0.0027
eagle no yes no yes non-mammals 20

P(A|M)P(M) > P(A|N)P(N)


Give Birth Can Fly Live in Water Have Legs Class
yes no yes no ? => Mammals

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›


Naïve Bayes (Summary)

 Robust to isolated noise points


 Handle missing values by ignoring the instance
during probability estimate calculations
 Robust to irrelevant attributes
 Independence assumption may not hold for some
attributes
 Will the testing outcome for an input tuple always
be the same as that from a decision tree built on
the same training set?
 Will an input tuple be classified based on the
majority of those tuples with same attribute
values?
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#›

Neural Networks
Artificial Neural Networks (ANN)
X1 X2 X3 Y Input Black box
1 0 0 0
1 0 1 1
X1
1 1 0 1 Output
1 1 1 1 X2 Y
0 0 1 0
0 1 0 0
0 1 1 1 X3
0 0 0 0

Output Y is 1 if at least two of the three inputs are equal to 1.

Artificial Neural Networks (ANN)
Input
nodes Black box
X1 X2 X3 Y
1 0 0 0 Output
1 0 1 1
X1 0.3 node
1 1 0 1
1 1 1 1 X2 0.3
0 0 1 0
 Y
0 1 0 0
0 1 1 1 X3 0.3 t=0.4
0 0 0 0

Y  I (0.3 X 1  0.3 X 2  0.3 X 3  0.4  0)


1 if z is true
where I ( z )  
0 otherwise
Artificial Neural Networks (ANN)
• Model is an assembly of  Input
nodes Black box
inter­connected nodes and  Output
weighted links X1 w1 node
w2
X2  Y
• Output node sums up each  w3

of its input value according  X3 t
to the weights of its links
Perceptron Model
• Compare output node 
against some threshold t Y  Sign ( wi X i  t ) or
i

I j   wij Oi   j 1
Oj  I j
i 1 e

Structure of NN

Input Neuron i Output


I1 wi1
wi2 Activation
I2
wi3
Si function Oi Oi
g(Si )
I3

threshold, t

Training NN means learning the


weights of the neurons
Deep Neural Network

M.-S. Chen 43

Algorithm for learning ANN
• Initialize the weights (w0, w1, …, wk)

• Adjust the weights in such a way that the output 
of ANN is consistent with class labels of training 
examples E  Y  f ( w , X )

2
i i i
– Objective function: i

– Find the weights wi’s that minimize the above 
objective function
• e.g., backpropagation algorithm
Backpropagation
Ij=ΣwijOi+Θj;//compute the net input of unit j with respect to the
previous layer I

Oj= 1 ; //SIGMOID, compute the output of each unit j


1+e-Ij

For each unit j in the output layer


Err j =O j(1-O j)(T j -O j);//compute the error

For cach unit j in the hidden layers,


Err j =O j(1-O j) ΣErr kwjk;//compute the error with respect to
k

the next higher layer,k


M.-S. Chen NTU 45

Backpropagation (cont’d)
For each weight wij in network
Δwij=(l)ErrjOi;//weight increment
wij=wij+ Δwij;//weight update
for each bias Θj in network 
Δ Θj=(l)Errj;//bias increment
Θj= Θj+ ΔΘj;//bias update

M.-S. Chen NTU 46


Learning Rate
• Learning rate: a constant typically having a 
value between 0.0 and 1.0.
• Help avoid getting stuck at a local 
minimum in the decision space
• A rule of thumb is to set the learning rate 
to be 1/t where t is the number of 
iterations gone through the training set so 
far

M.-S. Chen NTU 47

Multi­Layer Perceptron

Output vector
Errj  O j (1  O j )(T j  O j )

Output nodes

 j   j  (l) Err j
wij  wij  (l ) Err j Oi Errj  O j (1  O j ) Errk w jk
k
Hidden nodes

wij
1
Oj  I
Input nodes
1 e j
I j   wij Oi   j
i
Input vector: xi
M.-S. Chen NTU
x1 1 w14

w15 4 w46
w24
6
x2 2
w25 w56
w34 5

x3 3 w35

M.-S. Chen NTU 49

Table7.3 Initial input,weight, and bias values.

x1 x2 x3 w14 w15 w24 w25 w34 w35 w46 w56 Θ4 Θ5 Θ6


1 0 1 0.2 -0.3 0.4 0.1 -0.5 0.2 -0.3 -0.2 -0.4 0.2 0.1
Table7.4 The net input and output calculations.
Unit j Net input, Ij Output, Oj
4 0.2+0-0.5-0.4=-0.7 1/(1+e0.7)=0.332
5 -0.3+0+0.2+0.2=0.1 1/(1+e-0.1)=0.525
6 (-0.3)(0.332)-(0.2)(0.525)+0.1=-0.105 1/(1+e0.105)=0.474
M.-S. Chen NTU 50
Table7.5 Calculation of the error at each node.
Unit j Errj
6 (0.474)(1-0.474)(1-0.474)=0.1311
5 (0.525)(1-0.525)(0.1311)(-0.2)=-0.0065
4 (0.332)(1-0.332)(0.1311)(-0.3)=-0.0087

M.-S. Chen NTU 51

Table7.6 Calculations for weight and bias updating.


Weight or bias New value
w46 -0.3+(0.9)*(0.1311)*(0.332)=-0.261
w56 -0.2+(0.9)*(0.1311)*(0.525)=-0.138
w14 0.2+(0.9)*(-0.0087)*(1)=0.192
w15 -0.3+(0.9)*(-0.0065)*(1)=-0.306
w24 0.4+(0.9)*(-0.0087)*(0)=0.4
w25 0.1+(0.9)*(-0.0065)*(0)=0.1
w34 -0.5+(0.9)*(-0.0087)*(1)=-0.508
w35 0.2+(0.9)*(-0.0065)*(1)=0.194
Θ6 0.1+(0.9)*(0.1311)=0.218
Θ5 0.2+(0.9)*(-0.0065)=0.194
Θ4 -0.4+(0.9)*(-0.0087)=-0.408

M.-S. Chen NTU 52


Control Parameters for Execution
• Case updating: update the weights and biases 
after the presentation of each tuple
• Epoch updating: the weight and bias increments 
could be accumulated in variables, so that 
weights and biases are updated after all tuples 
in the training set are gone through
• Termination conditions
– Usually pre­determined conditions when we feel no need to 
proceed (e.g., little difference, accurate enough, time run out)
M.-S. Chen NTU 53

SVM
Support Vector Machine
Support Vector Machines

• Find a linear hyperplane (decision boundary) that will separate the data

Support Vector Machines
B1

• One Possible Solution
Support Vector Machines

B2

• Another possible solution

Support Vector Machines

B2

• Other possible solutions
Support Vector Machines
B1

B2

• Which one is better? B1 or B2?
• How do you define better?

Support Vector Machines
B1

B2

b21
b22

margin
b11

b12

• Find hyperplane maximizes the margin => B1 is better than B2
Support Vector Machines
B1

 
w x  b  0
 
 
w  x  b  1 w  x  b  1

b11

  b12
 1 if w  x  b  1 2
f ( x)     Margin   2
 1 if w  x  b  1 || w ||

Support Vector Machines
2
• We want to maximize: Margin   2
|| w ||

|| w ||2
– Which is equivalent to minimizing: L( w) 
2
– But subject to the following constraints:
 
 1 if w  x i  b  1
f ( xi )    
 1 if w  x i  b  1
• This is a constrained optimization problem
– Numerical approaches to solve it (e.g., quadratic programming)
整體之 constraints 稱
KKT(Karush-Kuhn-Tucker)
conditions
參考 “Data Mining: Concepts and Techniques, 2e ” Eq.6.39 in p.341
“…, 3e” Eq.9.19 in p.412

Read Vipin’s book: Chapter 5.5


Try a tiny numerical example!

Support Vector Machines
• What if the problem is not linearly separable?
Support Vector Machines
• What if the problem is not linearly separable?
– Introduce slack variables  2
|| w ||  N k
• Need to minimize: L( w)   C  i 
2  i 1 
• Subject to:   
 1 if w  x i  b  1 - i
f ( xi )    
 1 if w  x i  b  1  i

Nonlinear SVM
• What if decision boundary is not linear?
Nonlinear SVM
• Transform data into higher dimensional space

Das könnte Ihnen auch gefallen