Information Theory and Machine Learning

Information theory and
machine learning
I: Rate Distortion Theory, Deterministic
Annealing, and Soft K-means.
II: Information Bottleneck Method.
Lossy Compression
Summarize data by keeping only relevant

information and throwing away irrelevant
information.
Need a measure for information -> Shannon’s

mutual information.
Need a notion of relevance.

Relevance (1)
Shannon (1948): Define a function that
measures the distortion between the original
signal and its compressed representation.
Note: This is related to the similarity

measure in unsupervised learning/cluster
analysis.
Distortion function
Degree of freedom: which function to use is
up to the experimenter.
It is not always obvious what function

should be used, especially if the data do not
live in a metric space, and so there is no
“natural” measure.
Example: Speech.
Relevance (2)
Tishby, Pereira, Bialek (1999): Measure
relevance directly via Shannon’s mutual
information by defining:
Relevant information = the information

about a variable of interest.
Then, there is no need to define a distortion

function ad hoc, and the appropriate
similarity measure arises naturally.
Learning and lossy data
compression
When we build a model of a data set, we map
observations to a representation that summarizes the
data in an efficient way.
Example: K-means. Map N data points to K clusters,

with centroids c. If K << N, then we get a substantial
compression! log(K) << log(N).
Recall: Entropy H[X] ~ log(N).

How “efficient” is the
representation?
Entropy can be used to measure the
compactness of the model. Sometimes called
“statistical complexity” (J. P. Crutchfield.)
This is a good measure of complexity if we

are searching for a deterministic map (in
clustering called a “hard partition”)
In general, we may search over probabilistic

assignments (clustering: “soft partition”)
Trade-off between
compression and accuracy
Think about a continuous variable.
To describe it exactly, you need infinitely

many bits.
But finite bits to describe it up to some

accuracy.
Rate distortion theory
used for clustering
Find assignments p(c|x) from data x ∈ X to clusters
c = 1, ..., K, and find class-representatives xc
(“cluster centers”), such that the average distortion
D = !d(x, xc )" is small.
Compress the data by summarizing it efficiently in
terms of bit-cost: Minimize the coding rate, the
information which the clusters retain about the raw
data. p(x, c)
I=! "
p(x)p(c)
Constrained optimization
min [I(x, c) + β!d(x, xc )"]
p(c|x)
xc
p(c)
Solution: p(c|x) = exp [−βd(x, xc )]
Z(x, β)
d
xc from ! d(x, xc )"p(x|c) = 0
dxc
(centroid condition)
Rate distortion curve
Family of optimal solutions, one for each value
of the Lagrange multiplyer beta. This parameter
controls the trade-off between compression
and fidelity.
p(c)
p(c|x) = exp [−βd(x, xc )]
Z(x, β)
Evaluate objective function at the optimum for
each value of beta and plot I vs D.
=> Rate-distortion curve.
Remarks
for squared error distortion, d = (x − x )2
c /2 ,
the centroid condition reduces to
xc = !x"p(x|c)
“soft K-means” because assignments can be

probabilistic (fuzzy):
p(c)
p(c|x) = exp [−βd(x, xc )]
Z(x, β)
Remarks
in the zero temperature limit, β → ∞ , we
have deterministic (or “hard”) assignments
because
c∗ := arg min d(x, xc ); D(x, c) := d(x, xc ) − d(x, xc∗ ) > 0

c
exp(−βd(x, xc∗ )) " # $−1
p(c |x) = !
∗
= 1+ exp(−βD(x, c)) →1
c exp(−βd(x, xc )) ∗ c"=c
the analogy to thermodynamics inspired

Deterministic Annealing (Rose, 1990)
Soft K-means algorithm
Choose a distortion measure.
Fix the “temperature”, T, to a very large value

(corresponds to small = 1/T)
Solve iteratively, until convergence:

p(c)
Assignments: p(c|x) = exp [−βd(x, xc )]
Z(x, β)
d
Centroids: ! d(x, xc )"p(x|c) = 0
dxc
Lower temperature: T <- aT, and repeat.
a = “annealing rate”, a small, positive number.

Rate-distortion curve
feasible region
bit rate
K=2
K=3
infeasible region etc.
Distortion
How to choose the
distortion function?
Example: Speech
Cluster speech such that signals which encode

the same word will group together
May be extremely difficult to define a

distortion function which achieves this
Intelligibility criterion?
Should measure how well the meaning is

preserved.
Information Bottleneck Method
Tishby, Pereira, Bialek, 1999
Instead of guessing a distortion function, define

relevant information as information the data carries
about a quantity of interest (Example: phonemes or
words.)
Data is compressed such that relevant information is

kept maximally.
X y
min I(x,c) max I(c,y)
C
max [I(y, c) − λI(x, c)]
p(c|x)
Optimal assignment rule
?
max [I(y, c) − λI(x, c)]
p(c|x)
Optimal assignment rule

! "
p(c) 1
p(c|x) = exp − DKL [p(y|x)"p(y|c)]
Z(x, λ) λ
Kullback-Leibler divergence emerges as the

distortion function:
! " #
p(y|x)
DKL [p(y|x)!p(y|c)] = p(y|x) log2
y
p(y|c)
Information Plane
I(c,y)
infeasible
x K=4 etc.
x K=3
x K=2
I(c,x)
Homework
Implement Soft K-means algorithm.
Helpful reading (on course website): K. Rose,

"Deterministic Annealing for Clustering,
Compression, Classification, Regression, and
Related Optimization Problems," Proceedings of
the IEEE, vol. 80, pp. 2210-2239, November
1998.

Information Theory and Machine Learning

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Information Theory and Machine Learning

Hochgeladen von

Copyright:

Verfügbare Formate

Information theory and

Summarize data by keeping only relevant

Need a measure for information -> Shannon’s

Need a notion of relevance.

Note: This is related to the similarity

It is not always obvious what function

Relevant information = the information

Then, there is no need to define a distortion

Example: K-means. Map N data points to K clusters,

Recall: Entropy H[X] ~ log(N).

This is a good measure of complexity if we

In general, we may search over probabilistic

To describe it exactly, you need infinitely

But finite bits to describe it up to some

“soft K-means” because assignments can be

c∗ := arg min d(x, xc ); D(x, c) := d(x, xc ) − d(x, xc∗ ) > 0

the analogy to thermodynamics inspired

Fix the “temperature”, T, to a very large value

Solve iteratively, until convergence:

Lower temperature: T <- aT, and repeat.

a = “annealing rate”, a small, positive number.

Cluster speech such that signals which encode

May be extremely difficult to define a

Should measure how well the meaning is

Instead of guessing a distortion function, define

Data is compressed such that relevant information is

min I(x,c) max I(c,y)

Optimal assignment rule

Optimal assignment rule

Kullback-Leibler divergence emerges as the

Implement Soft K-means algorithm.

Helpful reading (on course website): K. Rose,

Das könnte Ihnen auch gefallen