Sie sind auf Seite 1von 10

Exponential family models

I Simple, ’single-stage’ generative models.


Probabilistic & Unsupervised Learning I Easy, often closed-form expressions for learning and model comparison.
I . . . but limited in expressiveness.

Latent Variable Models What about distributions like these?

Maneesh Sahani
maneesh@gatsby.ucl.ac.uk

Gatsby Computational Neuroscience Unit, and


MSc ML/CSML, Dept Computer Science
University College London

Term 1, Autumn 2018


In each case, data may be generated by combining and transforming latent exponential
family variates.

Latent variable models Latent variable models

Explain correlations in x by assuming dependence on latent variables z

I Describe structured distributions.


(e.g. objects, illumination, pose)
I Correlations in high-dimensional x may be captured by fewer parameters.

I Capture an underlying generative process.


(e.g. object parts, surfaces)
z ∼ P[θz ]
I z may describe causes of x.
I help to separate signal from noise.
x | z ∼ P[θx ]

(e.g. edges) p(x, z; θx , θz ) = p(x | z; θx )p(z; θz ) I Combine exponential family distributions into richer, more flexible forms.
Z I P (z), P (x|z) and even P (x, z) may be in the exponential family
p(x; θx , θz ) = dz p(x | z; θx )p(z; θz ) I P (x) rarely is. (Exception: Linear Gaussian models).

(retinal image, i.e. pixels)


Latent variables and Gaussians Probabilistic Principal Components Analysis (PPCA)
Gaussian correlation can be composed from latent components and uncorrelated noise.
If the uncorrelated noise is assumed to be isotropic, this model is called PPCA.

Data: D = X = {x1 , x2 , . . . , xN }; xi ∈ RD
Latents: Z = {z1 , z2 , . . . , zN }; zi ∈ RK z1 z2 • • • zK
K
X
Linear generative model: xd = Λdk zk + d
k =1

I zk are independent N (0, 1) Gaussian factors


I d are independent N (0, ψ) Gaussian noise x1 x2 • • • xD
I K <D

Model for observations x is a correlated Gaussian:

p(z) = N (0, I )
 
Note: Ex [f (x )] = Ez Ex|z [f (x )]
Vx [x ] = Ez [V [x|z]] + Vz [E [x|z]]
p(x|z) = N (Λz, ψ I )
Z  h i   
T T
p(x) = p(z)p(x|z)dz = N Ez [Λz] , Ez Λzz Λ + ψ I = N 0, ΛΛT + ψ I

       
3 2 1 1 0
x ∼ N 0, ⇔ z ∼ N (0, 1) x∼N 2 z, where Λ is a D × K matrix.
2 3 1 0 1

Multivariate Gaussians and latent variables PPCA likelihood

The marginal distribution on x gives us the PPCA likelihood:


Two models:  
N 1
X T
T T −1
log p(X |Λ, ψ) = − log 2π(ΛΛ + ψ I ) − Tr (ΛΛ + ψ I ) xx

2 2
n
p(z) = N (0, I ) | {z }
NS
p(x) = N (0, Σ) p(x|z) = N (Λz, ψ I )
 
⇒ p(x) = N 0, ΛΛT + ψ I
To find the ML values of (Λ, ψ ) we could optimise numerically (gradient ascent / Newton’s
method), or we could use a different iterative algorithm called EM which we’ll introduce soon.
I Descriptive density model: correlations I Interpretable causal model: correlations
are captured by off-diagonal elements of captured by common influence of latent In fact, however, ML for PPCA is more straightforward in principle, as we will see by first
Σ. variable. considering the limit ψ → 0.
I Σ has D (D + 1 )
free parameters. I ΛΛT + ψ I has DK + 1 free parameters.
2
I Only constrained to be positive definite. I For K < D covariance structure is
constrained (“blurry pancake”)
I Simple ML estimate. [Note: We may also add a constant mean µ to the output, so as to model data that are not
I ML estimation is more complex.
b = N1 n xn and we can define
P
distributed around 0. In this case, the ML estimate µ
1 T
P
S= N n (x − µ)(x − µ) in the likelihood above.]
b b
The ψ → 0 limit Principal Components Analysis
As ψ → 0, the latent model can only capture K dimensions of variance.
This leads us to an (old) algorithm called Principal Components Analysis (PCA).

Assume data D = {xi } have zero mean (if not, subtract it).

I Find direction of greatest variance – λ(1) .


X T 2
λ(1) = argmax (xn v)
5
kvk=1 n
4

2
3
I Find direction orthogonal to λ(1) with greatest variance –
5
0
1
λ(2)

x3
0 −1

−2
..
−5 −3 .
−5
−4 x2

Find direction orthogonal to {λ(1) , λ(2) , . . . , λ(n−1) } with


0
5
−5 I
x1
greatest variance – λ(n) .
I Terminate when remaining variance drops below a
threshold.

In a Gaussian model, the ML parameters will find the K -dimensional space of most variance.

Eigendecomposition of a covariance matrix PCA and eigenvectors


The eigendecomposition of a covariance matrix makes finding the PCs easy.
Recall that u is an eigenvector, with scalar eigenvalue ω , of a matrix S if I The variance in direction u(i ) is
D E D E
Su = ω u (xT u(i ) )2 = u(i ) T xxT u(i ) = u(i ) T Su(i ) = u(i ) T ω(i ) u(i ) = ω(i )
u can have any norm, but we
will define it to be unity (i.e., uT u = 1). I The variance in an arbitrary direction v is
T

For a covariance matrix S = xx (which is D × D, symmetric, positive semi-definite):
D E D  X 2 E X
I In general there are D eigenvector-eigenvalue pairs (u(i ) , ω(i ) ), except if two or more (xT v)2 = x
T
v (i ) u (i ) = T
v(i ) u(i ) Su(j ) v(j )
eigenvectors share the same eigenvalue (in which case the eigenvectors are degenerate i ij
— any linear combination is also an eigenvector). X T
X 2
= v(i ) ω(j ) v(j ) u(i ) u(j ) = v(i ) ω(i )
I The D eigenvectors are orthogonal (or orthogonalisable, if ω(i ) = ω(j ) ). Thus, they form ij i
T
P
an orthonormal basis. i u(i ) u(i ) = I.
T 2
= 1 and so argmaxkvk=1 (xT v)2 = u(max)
P

I Any vector v can be written as I If v v = 1, then i v (i )
X  X X The direction of greatest variance is the eigenvector the largest eigenvalue.
T
v= u(i ) u(i ) v= (u(i ) T v)u(i ) = v(i ) u(i ) I In general, the PCs are exactly the eigenvectors of the empirical covariance matrix,
i i i ordered by decreasing eigenvalue.

fractional variance remaining


I The original matrix S can be written: I The eigenspectrum shows how the variance
100 1

eigenvalue (variance)
80 0.8
X is distributed across dimensions; can iden-
S= ω(i ) u(i ) u(i ) T = UWU T 60 0.6

i
tify transitions that might separate signal from 40
0.4

noise, or the number of PCs that capture a pre- 20


0.2

where U = [u(1) , u(2) , . . . , u(D) ] collects the eigenvectors and determined fraction of variance. 0
0 10 20 30
0
0 10 20 30
eigenvalue number

W = diag (ω(1) , ω(2) , . . . , ω(D) ) . eigenvalue number
PCA subspace Example of PCA: Eigenfaces
The K principle components define the K -dimensional subspace of greatest variance.

5 1

0
x3

0 −1

−2
−5 −3
−5
−4 x2
0
−5
5

x
1

I Each data point xn is associated with a projection x̂n into the principle subspace.
K
X
x̂n = xn (k ) λ(k )
k =1

I This can be used for lossy compression, denoising, recognition, . . . vismod.media.mit.edu/vismod/demos/facerec/basic.html

Example of PCA: Genetic variation within Europe Example of PCA: Genetic variation within Europe

Novembre et al. (2008) Nature 456:98-101

Novembre et al. (2008) Nature 456:98-101


Equivalent definitions of PCA Another view of PCA: Mutual information

Problem: Given x, find z = Ax with columns of A unit vectors, s.t. I (z; x) is maximised
(assuming that P (x) is Gaussian).

I Find K directions of greatest variance in data. I (z; x) = H (z) + H (x) − H (z, x) = H (z)

I Find K -dimensional orthogonal projection that preserves greatest So we want to maximise the entropy of z. What is the entropy of a Gaussian?
variance. Z
1 D
H (z) = − dz p(z) ln p(z) = ln |Σ| + (1 + ln 2π)
I Find K -dimensional vectors zi and matrix Λ so that x̂i = Λzi is as 2 2
close as possible (in squared distance) to xi . Therefore we want the distribution of z to have largest volume (i.e. det of covariance matrix).

I . . . (many others)
Σz = AΣx AT = AUWx U T AT

So, A should be aligned with the columns of U which are associated with the largest
eigenvalues (variances).
Projection to the principal component subspace preserves the most information about the
(Gaussian) data.

Linear autoencoders: From supervised learning to PCA ML learning for PPCA

output units • • • N N
x̂1 x̂2 x̂D Tr C −1 S where C = ΛΛT + ψ I
 
`=− log |2π C | −
 2 2
 
 ∂` N ∂ ∂  −1 
= N −C −1 Λ + C −1 SC −1 Λ
 
= − log |C | − Tr C S

Q decoder ∂Λ 2 ∂Λ ∂Λ
 “generation”

 So at the stationary points we have SC −1 Λ = Λ. This implies either:
hidden units z1 • • • zK  I Λ = 0, which turns out to be a minimum.

C = S ⇒ ΛΛT = S − ψ I. Now rank(ΛΛT ) ≤ K ⇒ rank(S − ψ I ) ≤ K

 I
encoder
P ⇒ S has D − K eigenvalues = ψ and Λ aligns with space of remaining eigenvectors.
 “recognition”

I or, taking the SVD: Λ = ULV T :

input units x1 x2 • • • xD S (ULV T VLU T + ψ I )−1 ULV T = ULV T ×VL


−1

2 T −1 2 2 T
⇒ S (UL U + ψ I ) U=U U (L + ψ I ) = (UL U + ψ I )U
2 T −1 2 −1
⇒ (UL U + ψ I ) U = U (L + ψ I )
⇒ SU (L2 + ψ I )−1 = U 2
×(L + ψ I )
2
Learning: argmin kx̂ − xk
2
x̂ = Qz z = Px ⇒ SU = U (L + ψ I )
| {z }
P ,Q diagonal

⇒ columns of U are eigenvectors of S with eigenvalues given by li2 + ψ .


At the optimum, P and Q perform the projection and reconstruction steps of PCA. (Baldi & Thus, Λ = ULV T spans a space defined by K eigenvectors of S; and the lengths of the column
vectors of L are given by the eigenvalues −ψ (V selects an arbitrary basis in the latent space).
Hornik 1989).
Remains to show (we won’t, but it’s intuitively reasonable) that the global ML solution is attained when Λ
aligns with the K leading eigenvectors.
PPCA latents PPCA latents
I In PCA the “noise” is orthogonal to the subspace, and we can project xn → x̂n trivially.
I In PPCA, the noise is more sensible (equal in all directions). But what is the projection?
Find the expected value zn = E [zn |xn ] and then take x̂n = Λzn .
I Tactic: write p(zn , xn |θ), consider xn to be fixed. What is this as a function of zn ?

p(zn , xn ) = p(zn )p(xn |zn )


PPCA posterior
K 1 1 1
= (2π)− 2 exp{− zTn zn } |2πΨ|− 2 exp{− (xn − Λzn )T Ψ−1 (xn − Λzn )} PPCA noise PCA projection
2 2
1
= c × exp{− [zTn zn + (xn − Λzn )T Ψ−1 (xn − Λzn )]} PPCA projection
2
1 PPCA latent prior
= c’ × exp{− [zTn (I + ΛT Ψ−1 Λ)zn − 2zTn ΛT Ψ−1 xn ]}
2
1
= c” × exp{− [zTn Σ−1 zn − 2zTn Σ−1 µ + µT Σ−1 µ]}
2

So Σ = (I + ΛT Ψ−1 Λ)−1 = I − βΛ and µ = ΣΛT Ψ−1 xn = β xn . Where β = ΣΛT Ψ−1 .


I Thus, x̂n = Λ(I + ΛT Ψ−1 Λ)−1 ΛT Ψ−1 xn = xn − Ψ(ΛΛT + Ψ)−1 xn
I This is not the same projection. PPCA takes into account noise in the principal
subspace. principal subspace
I As ψ → 0, the PPCA estimate → the PCA value.

Factor Analysis Factor Analysis (cont.)


If dimensions are not equivalent, equal variance assumption is inappropriate.
Data: D = X = {x1 , x2 , . . . , xN }; xi ∈ RD
Latents: Z = {z1 , z2 , . . . , zN }; zi ∈ RK z1 z2 • • • zK z1 z2 • • • zK
K
X
Linear generative model: xd = Λdk zk + d
k =1

I zk are independent N (0, 1) Gaussian factors


x1 x2 • • • xD x1 x2 • • • xD
I d are independent N (0, Ψdd ) Gaussian noise
I K <D
Model for observations x is still a correlated Gaussian:
I ML learning finds Λ (“common factors”) and Ψ (“unique factors” or “uniquenesses”)
p(z) = N (0, I )
given data
K (K −1)
p(x|z) = N (Λz, Ψ) I parameters (corrected for symmetries): DK + D − 2
D (D + 1 )
Z   I If number of parameters > 2
model is not identifiable (even after accounting for
T
p(x) = p(z)p(x|z)dz = N 0, ΛΛ + Ψ rotational degeneracy discussed later)
I no closed form solution for ML params: N (0, ΛΛT + Ψ)
where Λ is a D × K , and Ψ is D × D and diagonal.
Dimensionality Reduction: Finds a low-dimensional projection of high dimensional data that
captures the correlation structure of the data.
Factor Analysis projections Gradient methods for learning FA

Our analysis for PPCA still applies:


Optimise negative log-likelihood:
−1
T
x̂n = Λ(I + Λ Ψ Λ)−1 ΛT Ψ−1 xn = xn − Ψ(ΛΛT + Ψ)−1 xn
1 T 1 T T −1
−` = log |2π(ΛΛ + Ψ)| + x (ΛΛ + Ψ) x
but now Ψ is diagonal but not spherical. 2 2
w.r.t. Λ and Ψ (need matrix calculus) subject to constraints.
Note, though, that Λ is generally different from that found by PPCA.
And Λ is not unique: the latent space may be transformed by an arbitrary orthogonal
transform U (U T U = UU T = I) without changing the likelihood. I No spectral short-cut exists.
I Likelihood can have more than one (local) optimum, making it difficult to find the global
T
z̃ = Uz and Λ̃ = ΛU ⇒ Λ̃z̃ = ΛU T Uz = Λz value.
1

1
I For some data (“Heywood cases”) likelihood may grow unboundedly by taking one or
− ` = log 2π(ΛU T U ΛT + Ψ) + xT (ΛU T U ΛT + Ψ)−1 x

2 2 more Ψdd → 0. Can eliminate by assuming a prior on Ψ with zero density at Ψdd = 0,
1

1 T
but results sensitive to precise choice of prior.
= log 2π(Λ̃Λ̃ + Ψ) + x (Λ̃Λ̃T + Ψ)−1 x
T

2 2 Expectation maximisation (next lecture) provides an alternative approach to maximisation, but
doesn’t solve these issues.

FA vs PCA Canonical Correlations Analysis

Data vector pairs: D = {(u1 , v1 ), (u2 , v2 ) . . . } in spaces U and V .

I PCA and PPCA are rotationally invariant; FA is not


Classic CCA
PCA PCA
If x → Ux for unitary U , then λ(i ) → U λ(i ) I Find unit vectors υ 1 ∈ U , φ1 ∈ V such that the correlation of uTi υ 1 and vTi φ1 is
maximised.
I FA is measurement scale invariant; PCA and PPCA are not
I As with PCA, repeat in orthogonal subspaces.
FA FA
If x → Sx for diagonal S , then λ(i ) → S λ(i )

I FA and PPCA define a probabilistic model; PCA does not


Probabilistic CCA
I Generative model with latent zi ∈ RK :

z ∼ N (0, I )

[Note: it may be tempting to try to eliminate the scale-dependence of (P)PCA by u ∼ N (Υz, Ψu ) Ψu < 0
pre-processing data to equalise total variance on each axis. But P(PCA) assume equal noise v ∼ N (Φz, Ψv ) Ψv < 0
variance. Total variance has contributions from both ΛΛT and noise, so this approach does
not exactly solve the problem.]
I Block diagonal noise.
Limitations of Gaussian, FA and PCA models Mixture Distributions

I Gaussian, FA and PCA models are easy to understand and use in practice.
I They are a convenient way to find interesting directions in very high dimensional data
1
sets, eg as preprocessing
I However, they make strong assumptions about the distribution of the data: only the
mean and variance of the data are taken into account.
The class of densities which can be modelled is too restrictive.

xi2
0

−1

−1 0 1
x
xi2

0 i1

A mixture distribution has a single discrete latent variable:


iid
−1 si ∼ Discrete[π]
xi | si ∼ Psi [θsi ]
−1 0 1
x
i1
Mixtures arise naturally when observations from different sources have been collated.
By using mixtures of simple distributions, such as Gaussians, we can expand the class of They can also be used to approximate arbitrary distributions.
densities greatly.

The Mixture Likelihood Approximation with a Mixture of Gaussians (MoG)

The mixture model is


The component densities may be viewed as elements of a basis which can be combined to
iid approximate arbitrary distributions.
si ∼ Discrete[π]
xi | si ∼ Psi [θsi ] Here are examples where non-Gaussian densities are modelled (aproximated) as a mixture
of Gaussians. The red curves show the (weighted) Gaussians, and the blue curve the
Under the discrete distribution resulting density.
k
X
P (si = m) = πm ; πm ≥ 0, πm = 1 Uniform Triangle Heavy tails
m =1

Thus, the probability (density) at a single data point xi is


1 2 1
k
X
P (xi ) = P (xi | si = m)P (si = m)
0.5 1 0.5
m =1
k
X
= πm Pm (xi ; θm ) 0 0 0
m =1 −0.5 0 0.5 1 1.5 −0.5 0 0.5 1 1.5 −2 0 2
The mixture distribution (density) is a convex combination (or weighted average) of the
component distributions (densities). Given enough mixture components we can model (almost) any density (as accurately as
desired), but still only need to work with the well-known Gaussian form.
Clustering with a MoG Clustering with a MoG

In clustering applications, the latent variable si represents the (unknown) identity of the
cluster to which the ith observation belongs.

Thus, the latent distribution gives the prior probability of a data point coming from each
cluster.

P (si = m | π) = πm

Data from the mth cluster are distributed according to the mth component:

P (xi | si = m) = Pm (xi )

Once we observe a data point, the posterior probability distribution for the cluster it belongs to
is
Pm (xi )πm
P (si = m | xi ) = P
m Pm (xi )πm

This is often called the responsibility of the mth cluster for the ith data point.

The MoG likelihood Maximum Likelihood for a Mixture Model


The log likelihood is:
n
X k
X
Each component of a MoG is a Gaussian, with mean µm and covariance matrix Σm . Thus, L= log πm Pm (xi ; θm )
i =1 m =1
the probability density evaluated at a set of n iid observations, D = {x1 . . . xn } (i.e. the
likelihood) is Its partial derivative wrt θm is
n
n X
k
∂L X πm ∂ Pm (xi ; θm )
Y = Pk
p(D | {µm }, {Σm }, π) = πm N (xi | µm , Σm ) ∂θm i =1 m =1 πm Pm (x i ; θ m ) ∂θm
i =1 m =1
n X
k or, using ∂ P /∂θ = P × ∂ log P /∂θ ,
Y 1 1
− (x −µ ) T
Σ− 1
m (xi −µm )
= πm p e 2 i m n
i =1 m =1
| 2πΣ m | X πm Pm (xi ; θm ) ∂ log Pm (xi ; θm )
= Pk
πm P m (x i ; θ m ) ∂θm
The log of the likelihood is i =1
| m=1 {z }
n
n k X ∂ log Pm (xi ; θm )
X X 1 − 12 (xi −µm )T Σ− 1
m (xi −µm )
= rim
log p(D | {µm }, {Σm }, π) = log πm p e ∂θm
i =1 m =1
|2πΣm | i =1

And its partial derivative wrt πm is


Note that the logarithm fails to simplify the component density terms. A mixture distribution
does not lie in the exponential family. Direct optimisation is not easy. ∂L X n
Pm (xi ; θm ) X rim n

= Pk =
∂πm i =1 m=1 πm Pm (xi ; θm ) i =1
πm
MoG Derivatives The K-means Algorithm

The K-means algorithm is a limiting case of the mixture of Gaussians (c.f. PCA and Factor
Analysis).

Take πm = 1/k and Σm = σ 2 I, with σ 2 → 0. Then the responsibilities become binary


For a MoG, with θm = {µm , Σm } we get
2
n rim → δ(m, argmin kxi − µl k )
∂L X −1 l
= rim Σm (xi − µm )
∂µm i =1 with 1 for the component with the closest mean and 0 for all other components. We can then
n solve directly for the means by setting the gradient to 0.
∂L 1X

T

= rim Σm − (xi − µm )(xi − µm )
∂Σ−
m
1
2
i =1 The k-means algorithm iterates these two steps:
 
These equations can be used (along with Lagrangian derivatives wrt πm that enforce I assign each point to its closest mean
2
set rim = δ(m, argmin kxi − µl k )
normalisation) for gradient based learning; e.g., taking small steps in the direction of the l
 P 
gradient (or using conjugate gradients). rim xi
I update the means to the average of their assigned points set µm = Pi
i rim

This usually converges within a few iterations, although the fixed point depends on the initial
values chosen for µm . The algorithm has no learning rate, but the assumptions are quite
limiting.

A preview of the EM algorithm Issues

We wrote the k-means algorithm in terms of binary responsibilities. Suppose, instead, we


used the fractional responsibilities from the full (non-limiting) MoG, but still neglected the There are several problems with these algorithms:
dependence of the responsibilities on the parameters. We could then solve for both µm and
I slow convergence for the gradient based method
Σm .
I gradient based method may develop invalid covariance matrices
The EM algorithm for MoGs iterates these two steps: I local minima; the end configuration may depend on the starting state
I Evaluate the responsibilities for each point given the current parameters. I how do you adjust k? Using the likelihood alone is no good.
I Optimise the parameters assuming the responsibilities stay fixed: I singularities; components with a single data point will have their covariance going to
P P T
zero and the likelihood will tend to infinity.
rim xi i rim (xi − µm )(xi − µm )
µm = Pi and Σm = P
i rim i rim
We will attempt to address many of these as the course goes on.
Although this appears ad hoc, we will see (later) that it is a special case of a general
algorithm, and is actually guaranteed to increase the likelihood at each iteration.

Das könnte Ihnen auch gefallen