Sie sind auf Seite 1von 26

Lecture Slides for

INTRODUCTION
TO
MACHİNE
LEARNİNG
3RD EDİTİON
ETHEM ALPAYDIN
© The MIT Press, 2014

alpaydin@boun.edu.tr
http://www.cmpe.boun.edu.tr/~ethem/i2ml3e
CHAPTER 5:

MULTİVARİATE METHODS
Multivariate Data
3

 Multiple measurements (sensors)


 d inputs/features/attributes: d-variate
 N instances/observations/examples

 X11 X 12  X d1 
 2 2 2
 X1 X 2  Xd 
X
  
 N N
 X1 X 2  X d 
N
Multivariate Parameters
4

Mean : E x  μ  1 ,..., d T


Covariance :  ij  CovX i , X j 
 ij
Correlation : Corr X i , X j    ij 
 i j
 12  12   1d 
 
 21  2   2d 
 
2
  CovX   E X  μ X  μ T  
  
 2 
 d1  d 2   d 
Parameter Estimation

t 1 i
Nt
x
Sample mean m : mi  , i  1,..., d
N
 x N t
 mi x tj  m j 
 t 1 i
Covariance matrix S : sij
N
sij
Correlatio n matrix R : rij 
si s j

5
Estimation of Missing Values
6

 What to do if certain instances have missing


attributes?
 Ignore those instances: not a good idea if the
sample is small
 Use ‘missing’ as an attribute: may give information
 Imputation: Fill in the missing value
 Mean imputation: Use the most likely value (e.g., mean)
 Imputation by regression: Predict based on other
attributes
Multivariate Normal Distribution
7

x ~ N d μ, Σ 
1  1 
px   exp x  μ  Σ x  μ 
T 1

2  Σ
d /2 1/ 2
 2 
Multivariate Normal Distribution
8

 Mahalanobis distance: (x – μ)T ∑–1 (x – μ)


measures the distance from x to μ in terms of ∑ (normalizes
for difference in variances and correlations)
 Bivariate: d = 2
  12 1 2 
 2 
 1 2  2 
 2 
px1 , x 2  
1
exp
1
z1  2 z1z2  z2 
2

21 2 1   21   
2
2

zi  x i  i  /  i
Bivariate Normal
9
10
Independent Inputs: Naive Bayes
11

 If xi are independent, offdiagonals of ∑ are 0,


Mahalanobis distance reduces to weighted (by 1/σi)
Euclidean distance:
d
1  1 d x  
2

px    pi x i   d
exp     i
 
i
 
i 1
2   i 
d /2  2 i 1  i  
i 1

 If variances are also equal, reduces to Euclidean


distance
Parametric Classification

 If p (x | Ci ) ~ N ( μi , ∑i )
1  1 
px|C i   exp   x  μ T
Σ 1
x  μ 
i 
2  Σ i 1/ 2 i i
d/2
 2 

 Discriminant functions
gi x   log px|C i   log P C i 
d 1 1
  log 2  log Σ i  x  μ i T Σ i 1 x  μ i   log P C i 
2 2 2

12
Estimation of Parameters
13

PˆC i  
t i
r t

mi 
t i x
r t t

t i
r t

 r x  m i x  m i 
t t t T

Si  t i

t i
r t

1 1
gi x    log S i  x  m i T S i 1 x  m i   log PˆC i 
2 2
Different Si

 Quadratic discriminant
1
2
1 T 1

gi x    log S i  x S i x  2xT S i 1m i  m i T S i 1m i  log PˆC i 
2
 xT Wi x  w i T x  wi 0
where
1
Wi   S i 1
2
w i  S i 1m i
1 T 1 1
wi 0   m i S i m i  log S i  log PˆC i 
2 2
14
likelihoods
discriminant:
P (C1|x ) = 0.5

posterior for C1

15
Common Covariance Matrix S
16

 Shared common sample covariance S


S   P̂ Ci Si
 Discriminant reduces to i

1
gi x    x  mi T S1 x  mi   log P̂ Ci 
2
which is a linear discriminant
gi x   w i T x  wi 0
where
1 T 1
w i  S m i wi 0   m i S m i  log PˆC i 
1

2
Common Covariance Matrix S
17
Diagonal S
18

 When xj j = 1,..d, are independent, ∑ is diagonal


p (x|Ci) = ∏j p (xj |Ci)(Naive Bayes’ assumption)
2
1  x  mij 
d t

gi x     j
  log PˆC i 
2 j 1  s j 

Classify based on weighted Euclidean distance (in sj


units) to the nearest mean
Diagonal S
19

variances may be
different
Diagonal S, equal variances
20

 Nearest mean classifier: Classify based on Euclidean


distance to the nearest mean
x  mi
2

gi x     log PˆC i 
2s 2
2

  2  x tj  mij   log PˆC i 


d
1
2s j 1
 Each mean can be considered a prototype or template
and this is template matching
Diagonal S, equal variances
21

*?
Model Selection
22

Assumption Covariance matrix No of parameters


Shared, Hyperspheric Si=S=s2I 1
Shared, Axis-aligned Si=S, with sij=0 d
Shared, Hyperellipsoidal Si=S d(d+1)/2
Different, Hyperellipsoidal Si K d(d+1)/2

 As we increase complexity (less restricted S), bias


decreases and variance increases
 Assume simple models (allow some bias) to control
variance (regularization)
23
Discrete Features
24

 Binary features: pij  px j  1|C i 


if xj are independent (Naive Bayes’)

px |C i    p 1  pij 1 x j 


d
xj
ij
j 1

the discriminant is linear


gi x   log px |C i   log P C i 
  x j log pij  1  x j log 1  pij  log P C i 
j

Estimated parameters
p̂ij 
t j ri
x t t

t i
r t
Discrete Features
25

 Multinomial (1-of-nj) features: xj Î {v1, v2,..., vnj}


pijk  pz jk  1| C i   px j  vk | C i 
if xj are independent
d nj

px |C i    pijkjk
z

j 1 k 1

gi x    j k z jk log pijk  log P C i 

pˆijk 
t jk ri
z t t

t i
r t
Multivariate Regression
26

r t
 g x |w
t
0 , w1 ,...,wd   
Multivariate linear model
w0  w1x1t  w2 x 2t    wd xdt

E w0 ,w1 ,...,wd | X   t r  w0  w1x1    wd xd 


1 t t t 2

2
Multivariate polynomial model:
Define new higher-order variables
z1=x1, z2=x2, z3=x12, z4=x22, z5=x1x2
and use the linear model in this new z space
(basis functions, kernel trick: Chapter 13)

Das könnte Ihnen auch gefallen