I2ml3e Chap5

Lecture Slides for
INTRODUCTION
TO
MACHİNE
LEARNİNG
3RD EDİTİON
ETHEM ALPAYDIN
© The MIT Press, 2014
alpaydin@boun.edu.tr
http://www.cmpe.boun.edu.tr/~ethem/i2ml3e
CHAPTER 5:
MULTİVARİATE METHODS
Multivariate Data
3
 Multiple measurements (sensors)

 d inputs/features/attributes: d-variate
 N instances/observations/examples
 X11 X 12  X d1 
 2 2 2
 X1 X 2  Xd 
X
  
 N N
 X1 X 2  X d 
N
Multivariate Parameters
4
Mean : E x  μ  1 ,..., d T

Covariance :  ij  CovX i , X j 
 ij
Correlation : Corr X i , X j    ij 
 i j
 12  12   1d 
 
 21  2   2d 
 
2
  CovX   E X  μ X  μ T  
  
 2 
 d1  d 2   d 
Parameter Estimation
t 1 i
Nt
x
Sample mean m : mi  , i  1,..., d
N
 x N t
 mi x tj  m j 
 t 1 i
Covariance matrix S : sij
N
sij
Correlatio n matrix R : rij 
si s j
5
Estimation of Missing Values
6
 What to do if certain instances have missing

attributes?
 Ignore those instances: not a good idea if the
sample is small
 Use ‘missing’ as an attribute: may give information
 Imputation: Fill in the missing value
 Mean imputation: Use the most likely value (e.g., mean)
 Imputation by regression: Predict based on other
attributes
Multivariate Normal Distribution
7
x ~ N d μ, Σ 
1  1 
px   exp x  μ  Σ x  μ 
T 1
2  Σ
d /2 1/ 2
 2 
Multivariate Normal Distribution
8
 Mahalanobis distance: (x – μ)T ∑–1 (x – μ)

measures the distance from x to μ in terms of ∑ (normalizes
for difference in variances and correlations)
 Bivariate: d = 2
  12 1 2 
 2 
 1 2  2 
 2 
px1 , x 2  
1
exp
1
z1  2 z1z2  z2 
2
21 2 1   21   
2
2

zi  x i  i  /  i
Bivariate Normal
9
10
Independent Inputs: Naive Bayes
11
 If xi are independent, offdiagonals of ∑ are 0,

Mahalanobis distance reduces to weighted (by 1/σi)
Euclidean distance:
d
1  1 d x  
2

px    pi x i   d
exp     i
 
i
 
i 1
2   i 
d /2  2 i 1  i  
i 1
 If variances are also equal, reduces to Euclidean

distance
Parametric Classification
 If p (x | Ci ) ~ N ( μi , ∑i )
1  1 
px|C i   exp   x  μ T
Σ 1
x  μ 
i 
2  Σ i 1/ 2 i i
d/2
 2 
 Discriminant functions
gi x   log px|C i   log P C i 
d 1 1
  log 2  log Σ i  x  μ i T Σ i 1 x  μ i   log P C i 
2 2 2
12
Estimation of Parameters
13
PˆC i  
t i
r t
mi 
t i x
r t t
t i
r t
 r x  m i x  m i 
t t t T
Si  t i
t i
r t
1 1
gi x    log S i  x  m i T S i 1 x  m i   log PˆC i 
2 2
Different Si
 Quadratic discriminant
1
2
1 T 1

gi x    log S i  x S i x  2xT S i 1m i  m i T S i 1m i  log PˆC i 
2
 xT Wi x  w i T x  wi 0
where
1
Wi   S i 1
2
w i  S i 1m i
1 T 1 1
wi 0   m i S i m i  log S i  log PˆC i 
2 2
14
likelihoods
discriminant:
P (C1|x ) = 0.5
posterior for C1
15
Common Covariance Matrix S
16
 Shared common sample covariance S

S   P̂ Ci Si
 Discriminant reduces to i
1
gi x    x  mi T S1 x  mi   log P̂ Ci 
2
which is a linear discriminant
gi x   w i T x  wi 0
where
1 T 1
w i  S m i wi 0   m i S m i  log PˆC i 
1
2
Common Covariance Matrix S
17
Diagonal S
18
 When xj j = 1,..d, are independent, ∑ is diagonal

p (x|Ci) = ∏j p (xj |Ci)(Naive Bayes’ assumption)
2
1  x  mij 
d t
gi x     j
  log PˆC i 
2 j 1  s j 

Classify based on weighted Euclidean distance (in sj

units) to the nearest mean
Diagonal S
19
variances may be
different
Diagonal S, equal variances
20
 Nearest mean classifier: Classify based on Euclidean

distance to the nearest mean
x  mi
2
gi x     log PˆC i 
2s 2
2
  2  x tj  mij   log PˆC i 

d
1
2s j 1
 Each mean can be considered a prototype or template
and this is template matching
Diagonal S, equal variances
21
*?
Model Selection
22
Assumption Covariance matrix No of parameters

Shared, Hyperspheric Si=S=s2I 1
Shared, Axis-aligned Si=S, with sij=0 d
Shared, Hyperellipsoidal Si=S d(d+1)/2
Different, Hyperellipsoidal Si K d(d+1)/2
 As we increase complexity (less restricted S), bias

decreases and variance increases
 Assume simple models (allow some bias) to control
variance (regularization)
23
Discrete Features
24
 Binary features: pij  px j  1|C i 

if xj are independent (Naive Bayes’)
px |C i    p 1  pij 1 x j 

d
xj
ij
j 1
the discriminant is linear

gi x   log px |C i   log P C i 
  x j log pij  1  x j log 1  pij  log P C i 
j
Estimated parameters
p̂ij 
t j ri
x t t
t i
r t
Discrete Features
25
 Multinomial (1-of-nj) features: xj Î {v1, v2,..., vnj}

pijk  pz jk  1| C i   px j  vk | C i 
if xj are independent
d nj
px |C i    pijkjk
z
j 1 k 1
gi x    j k z jk log pijk  log P C i 
pˆijk 
t jk ri
z t t
t i
r t
Multivariate Regression
26
r t
 g x |w
t
0 , w1 ,...,wd   
Multivariate linear model
w0  w1x1t  w2 x 2t    wd xdt
E w0 ,w1 ,...,wd | X   t r  w0  w1x1    wd xd 

1 t t t 2
2
Multivariate polynomial model:
Define new higher-order variables
z1=x1, z2=x2, z3=x12, z4=x22, z5=x1x2
and use the linear model in this new z space
(basis functions, kernel trick: Chapter 13)

I2ml3e Chap5

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

I2ml3e Chap5

Hochgeladen von

Copyright:

Verfügbare Formate

Lecture Slides for

 Multiple measurements (sensors)

Mean : E x  μ  1 ,..., d T

 What to do if certain instances have missing

 Mahalanobis distance: (x – μ)T ∑–1 (x – μ)

 If xi are independent, offdiagonals of ∑ are 0,

 If variances are also equal, reduces to Euclidean

 Shared common sample covariance S

 When xj j = 1,..d, are independent, ∑ is diagonal

Classify based on weighted Euclidean distance (in sj

 Nearest mean classifier: Classify based on Euclidean

  2  x tj  mij   log PˆC i 

Assumption Covariance matrix No of parameters

 As we increase complexity (less restricted S), bias

 Binary features: pij  px j  1|C i 

px |C i    p 1  pij 1 x j 

the discriminant is linear

 Multinomial (1-of-nj) features: xj Î {v1, v2,..., vnj}

gi x    j k z jk log pijk  log P C i 

E w0 ,w1 ,...,wd | X   t r  w0  w1x1    wd xd 

Das könnte Ihnen auch gefallen