Beruflich Dokumente
Kultur Dokumente
Maneesh Sahani
maneesh@gatsby.ucl.ac.uk
(e.g. edges) p(x, z; θx , θz ) = p(x | z; θx )p(z; θz ) I Combine exponential family distributions into richer, more flexible forms.
Z I P (z), P (x|z) and even P (x, z) may be in the exponential family
p(x; θx , θz ) = dz p(x | z; θx )p(z; θz ) I P (x) rarely is. (Exception: Linear Gaussian models).
Data: D = X = {x1 , x2 , . . . , xN }; xi ∈ RD
Latents: Z = {z1 , z2 , . . . , zN }; zi ∈ RK z1 z2 • • • zK
K
X
Linear generative model: xd = Λdk zk + d
k =1
p(z) = N (0, I )
Note: Ex [f (x )] = Ez Ex|z [f (x )]
Vx [x ] = Ez [V [x|z]] + Vz [E [x|z]]
p(x|z) = N (Λz, ψ I )
Z h i
T T
p(x) = p(z)p(x|z)dz = N Ez [Λz] , Ez Λzz Λ + ψ I = N 0, ΛΛT + ψ I
√
3 2 1 1 0
x ∼ N 0, ⇔ z ∼ N (0, 1) x∼N 2 z, where Λ is a D × K matrix.
2 3 1 0 1
Assume data D = {xi } have zero mean (if not, subtract it).
2
3
I Find direction orthogonal to λ(1) with greatest variance –
5
0
1
λ(2)
x3
0 −1
−2
..
−5 −3 .
−5
−4 x2
In a Gaussian model, the ML parameters will find the K -dimensional space of most variance.
eigenvalue (variance)
80 0.8
X is distributed across dimensions; can iden-
S= ω(i ) u(i ) u(i ) T = UWU T 60 0.6
i
tify transitions that might separate signal from 40
0.4
where U = [u(1) , u(2) , . . . , u(D) ] collects the eigenvectors and determined fraction of variance. 0
0 10 20 30
0
0 10 20 30
eigenvalue number
W = diag (ω(1) , ω(2) , . . . , ω(D) ) . eigenvalue number
PCA subspace Example of PCA: Eigenfaces
The K principle components define the K -dimensional subspace of greatest variance.
5 1
0
x3
0 −1
−2
−5 −3
−5
−4 x2
0
−5
5
x
1
I Each data point xn is associated with a projection x̂n into the principle subspace.
K
X
x̂n = xn (k ) λ(k )
k =1
Example of PCA: Genetic variation within Europe Example of PCA: Genetic variation within Europe
Problem: Given x, find z = Ax with columns of A unit vectors, s.t. I (z; x) is maximised
(assuming that P (x) is Gaussian).
I Find K directions of greatest variance in data. I (z; x) = H (z) + H (x) − H (z, x) = H (z)
I Find K -dimensional orthogonal projection that preserves greatest So we want to maximise the entropy of z. What is the entropy of a Gaussian?
variance. Z
1 D
H (z) = − dz p(z) ln p(z) = ln |Σ| + (1 + ln 2π)
I Find K -dimensional vectors zi and matrix Λ so that x̂i = Λzi is as 2 2
close as possible (in squared distance) to xi . Therefore we want the distribution of z to have largest volume (i.e. det of covariance matrix).
I . . . (many others)
Σz = AΣx AT = AUWx U T AT
So, A should be aligned with the columns of U which are associated with the largest
eigenvalues (variances).
Projection to the principal component subspace preserves the most information about the
(Gaussian) data.
output units • • • N N
x̂1 x̂2 x̂D Tr C −1 S where C = ΛΛT + ψ I
`=− log |2π C | −
2 2
∂` N ∂ ∂ −1
= N −C −1 Λ + C −1 SC −1 Λ
= − log |C | − Tr C S
Q decoder ∂Λ 2 ∂Λ ∂Λ
“generation”
So at the stationary points we have SC −1 Λ = Λ. This implies either:
hidden units z1 • • • zK I Λ = 0, which turns out to be a minimum.
C = S ⇒ ΛΛT = S − ψ I. Now rank(ΛΛT ) ≤ K ⇒ rank(S − ψ I ) ≤ K
I
encoder
P ⇒ S has D − K eigenvalues = ψ and Λ aligns with space of remaining eigenvectors.
“recognition”
I or, taking the SVD: Λ = ULV T :
2 T −1 2 2 T
⇒ S (UL U + ψ I ) U=U U (L + ψ I ) = (UL U + ψ I )U
2 T −1 2 −1
⇒ (UL U + ψ I ) U = U (L + ψ I )
⇒ SU (L2 + ψ I )−1 = U 2
×(L + ψ I )
2
Learning: argmin kx̂ − xk
2
x̂ = Qz z = Px ⇒ SU = U (L + ψ I )
| {z }
P ,Q diagonal
z ∼ N (0, I )
[Note: it may be tempting to try to eliminate the scale-dependence of (P)PCA by u ∼ N (Υz, Ψu ) Ψu < 0
pre-processing data to equalise total variance on each axis. But P(PCA) assume equal noise v ∼ N (Φz, Ψv ) Ψv < 0
variance. Total variance has contributions from both ΛΛT and noise, so this approach does
not exactly solve the problem.]
I Block diagonal noise.
Limitations of Gaussian, FA and PCA models Mixture Distributions
I Gaussian, FA and PCA models are easy to understand and use in practice.
I They are a convenient way to find interesting directions in very high dimensional data
1
sets, eg as preprocessing
I However, they make strong assumptions about the distribution of the data: only the
mean and variance of the data are taken into account.
The class of densities which can be modelled is too restrictive.
xi2
0
−1
−1 0 1
x
xi2
0 i1
In clustering applications, the latent variable si represents the (unknown) identity of the
cluster to which the ith observation belongs.
Thus, the latent distribution gives the prior probability of a data point coming from each
cluster.
P (si = m | π) = πm
Data from the mth cluster are distributed according to the mth component:
P (xi | si = m) = Pm (xi )
Once we observe a data point, the posterior probability distribution for the cluster it belongs to
is
Pm (xi )πm
P (si = m | xi ) = P
m Pm (xi )πm
This is often called the responsibility of the mth cluster for the ith data point.
= Pk =
∂πm i =1 m=1 πm Pm (xi ; θm ) i =1
πm
MoG Derivatives The K-means Algorithm
The K-means algorithm is a limiting case of the mixture of Gaussians (c.f. PCA and Factor
Analysis).
This usually converges within a few iterations, although the fixed point depends on the initial
values chosen for µm . The algorithm has no learning rate, but the assumptions are quite
limiting.