Information Theory: Mark Van Rossum

Information Theory
Mark van Rossum
School of Informatics, University of Edinburgh
January 24, 2018
0
Version: January 24, 2018
1 / 35
Why information theory
Understanding the neural code.

Encoding and decoding. We imposed coding schemes, such as
2nd-order kernel, or NLP. We possibly lost information in doing so.
Instead, use information:
Don’t need to impose encoding or decoding scheme
(non-parametric).
In particular important for 1) spike timing codes, 2) higher areas.
Estimate how much information is coded in certain signal.
Caveats:
No easy decoding scheme for organism (upper bound only)
Requires more data and biases are tricky
2 / 35
Overview
Entropy, Mutual Information

Entropy Maximization for a Single Neuron
Maximizing Mutual Information
Estimating information
Reading: Dayan and Abbott ch 4, Rieke
3 / 35
Definition
The entropyPof a quantity is defined as

H(X ) = − x P(x) log2 P(x)
This is not ’derived’, nor fully unique, but it fulfills these criteria:
Continuous
If pi = n1 , it increases monotonically with n. H = log2 n.
Parallel independent channels add.
“Unit”: bits
Entropy can be thought of as physical entropy, “richness” of distribution
[Shannon and Weaver, 1949, Cover and Thomas, 1991,
Rieke et al., 1996]
4 / 35
Entropy
Discrete variable X
H(R) = − p(r ) log2 p(r )
r
Continuous variable at resolution ∆r

X X
H(R) = − p(r )∆r log2 (p(r )∆r ) = − p(r )∆r log2 p(r ) − log2 ∆r
r r
letting ∆r → 0 we have
Z
lim [H + log2 ∆r ] = − p(r ) log2 p(r )dr
∆r →0
(also called differential entropy)
5 / 35
Joint, Conditional entropy
Joint entropy:
X
H(S, R) = − P(S, R) log2 P(S, R)
r ,s
Conditional entropy:
X
H(S|R) = P(R = r )H(S|R = r )
r
X X
= − P(r ) P(s|r ) log2 P(s|r )
r s
= H(S, R) − H(R)
If S, R are independent
H(S, R) = H(S) + H(R)
6 / 35
Mutual information
Mutual information:
X p(r , s)
Im (R; S) = p(r , s) log2
r ,s
p(r )p(s)
= H(R) − H(R|S) = H(S) − H(S|R)
Measures reduction in uncertainty of R by knowing S (or vice

versa)
Im (R; S) ≥ 0
The continuous version is the difference of two entropies, the ∆r
divergence cancels
7 / 35
Mutual Information
The joint histogram determines mutual information.

Given P(r , s) ⇒ Im .
8 / 35
Mutual Information: Examples
Y1 Y1
non non
smoker smoker smoker smoker
lung lung
cancer 1/3 0 cancer 1/9 2/9
Y2 Y2
no lung no lung
0 2/3 2/9 4/9
cancer cancer
9 / 35
Mutual Information: Examples
Y1 Y1
non non
smoker smoker smoker smoker
lung lung
cancer 1/3 0 cancer 1/9 2/9
Y2 Y2
no lung no lung
0 2/3 2/9 4/9
cancer cancer
Only for the left joint probability Im > 0 (how much?). On the right,
knowledge about Y1 does not inform us about Y2 .
10 / 35
Kullback-Leibler divergence
KL-divergence measures distance between two probability

distributions
R P(x) P Pi
DKL (P||Q) = P(x) log2 Q(x) dx, or DKL (P||Q) ≡ i Pi log2 Qi
Not symmetric, but can be symmetrized

Im (R; S) = DKL (p(r , s)||p(r )p(s)).
Often used as probabilistic cost function: DKL (data||model).
Other probability distances exist (e.g. earth-movers distance)
11 / 35
Mutual info between jointly Gaussian variables
Z Z
P(y1 , y2 ) 1
I(Y1 ; Y2 ) = P(y1 , y2 ) log2 dy1 dy2 = − log2 (1 − ρ2 )
P(y1 )P(y2 ) 2
ρ is (Pearson-r) correlation coefficient.

12 / 35
Populations of Neurons
Given Z
H(R) = − p(r) log2 p(r)dr − N log2 ∆r
and Z
H(Ri ) = − p(ri ) log2 p(ri )dr − log2 ∆r
We have X
H(R) ≤ H(Ri )
i
(proof, consider KL divergence)
13 / 35
Mutual information in populations of Neurons
Reduncancy can be defined as (compare to above)

nr
X
R= I(ri ; s) − I(r; s).
i=1
Some codes have R > 0 (redundant code), others R < 0 (synergistic)

Example of synergistic code: P(r1 , r2 , s) with
P(0, 0, 1) = P(0, 1, 0) = P(1, 0, 0) = P(1, 1, 1) = 14
14 / 35
Entropy Maximization for a Single Neuron
Im (R; S) = H(R) − H(R|S)
If noise entropy H(R|S) is independent of the transformation

S → R, we can maximize mutual information by maximizing H(R)
under given constraints
Possible constraint: response r is 0 < r < rmax . Maximal H(R) if
⇒ p(r ) ∼ U(0, rmax ) (U is uniform dist)
If average firing rate is limited, and 0 < r < ∞ : exponential
distribution is optimal p(x) = 1/x̄exp(−x/x̄). H = log2 ex̄
If variance is fixed and −∞ < r < ∞: Gaussian distribution.
H = 12 log2 (2πeσ 2 ) (note funny units)
15 / 35
Let r = f (s) and s ∼ p(s). Which f (assumed monotonic)
maximizes H(R) using max firing rate constraint? Require:
1
P(r ) = rmax
dr 1 df
p(s) = p(r ) =
ds rmax ds
Thus df /ds = rmax p(s) and
Z s
f (s) = rmax p(s0 )ds0
smin
This strategy is known as histogram equalization in signal

processing
16 / 35
Fly retina
Evidence that the large monopolar cell in the fly visual system carries
out histogram equalization
Contrast response for fly large monopolar cell (points) matches

environment statistics (line) [Laughlin, 1981] (but changes in high
noise conditions)
17 / 35
V1 contrast responses
Similar in V1, but On and Off channels [Brady and Field, 2000]
18 / 35
Information of time varying signals
Single analog channel with Gaussian signal s and Gaussian noise η:

r =s+η
1 σ2 1
I = log2 (1 + s2 ) = log2 (1 + SNR)
2 ση 2
s(ω)
For time dependent signals I = 12 T dω
R
2π log2 (1 + n(ω) )
To maximize information, when variance of the signal is constrained,
use all frequency bands such that signal+noise = constant.
Whitening. Water filling analog:
19 / 35
Information of graded synapses
Light - (photon noise) - photoreceptor - (synaptic noise) - LMC

At low light levels photon noise dominates, synaptic noise is negligible.
Information rate: 1500 bits/s
[de Ruyter van Steveninck and Laughlin, 1996].
20 / 35
Spiking neurons: maximal information
Spike train with N = T /δt bins [Mackay and McCullogh, 1952] δt

“time-resolution”.
N!
pN = N1 events, #words = N1 !(N−N 1 )!
Maximal
P entropy if all words are equally likely.
H = pi log2 pi = log2 N! − log2 N1 ! − log2 (N − N1 )!
Use for large x that log x! ≈ x(log x − 1)
−T
H= [p log2 p + (1 − p) log2 (1 − p)]
δt
For low rates p 1, setting λ = (δt)p:
e
H = T λ log2 ( )
λδt
21 / 35
Spiking neurons
Calculation incorrect when multiple spikes per bin. Instead, for large
bins maximal information for exponential distribution:
1
P(n) = Z1 exp[−n log(1 + hni )]
1
H = log2 (1 + hni) + hni log2 (1 + hni ) ≈ log2 (1 + hni) + 1
22 / 35
Spiking neurons: rate code
[Stein, 1967]
Measure rate in window T , during which stimulus is constant.

Periodic neuron can maximally encode [1 + (fmax − fmin )T ] stimuli
H ≈ log2 [1 + (fmax − fmin )T ]. Note, only ∝ log(T )
23 / 35
[Stein, 1967]
Similar behaviour for Poisson : H ∝ log(T )
24 / 35
Spiking neurons: dynamic stimuli
[de Ruyter van Steveninck et al., 1997], but see

[Warzecha and Egelhaaf, 1999].
25 / 35
Maximizing Information Transmission: single output
Single linear neuron with post-synaptic noise
v =w·u+η
where η is an independent noise variable
Im (u; v ) = H(v ) − H(v |u)
Second term depends only on p(η)

To maximize Im need to maximize H(v ); sensible constraint is that
kwk2 = 1
If u ∼ N(0, Q) and η ∼ N(0, ση2 ) then v ∼ N(0, wT Qw + ση2 )
26 / 35
For a Gaussian RV with variance σ 2 we have H = 12 log 2πeσ 2 . To
maximize H(v ) we need to maximize wT Qw subject to the
constraint kwk2 = 1
Thus w ∝ e1 so we obtain PCA
If v is non-Gaussian then this calculation gives an upper bound on
H(v ) (as the Gaussian distribution is the maximum entropy
distribution for a given mean and covariance)
27 / 35
Infomax
Infomax: maximize information in multiple outputs wrt weights
[Linsker, 1988]
v = Wu + η
1
H(v ) = log det(hvvT i)
2
Example: 2 inputs and 2 outputs. Input is correlated. wk21 + wk22 = 1.
At low noise independent coding, at high noise joint coding.

28 / 35
Estimating information
Information estimation requires a lot of data.
Most statistical quantities are unbiased (mean, var,...).
But both entropy and noise entropy have bias.
[Panzeri et al., 2007]

29 / 35
Try to fit 1/N correction [Strong et al., 1998]
30 / 35
Common technique for Im : shuffle correction [Panzeri et al., 2007]
See also: [Paninski, 2003, Nemenman et al., 2002]
31 / 35
Summary
Information theory provides non parametric framework for coding

Optimal coding schemes depend strongly on noise assumptions
and optimization constraints
In data analysis biases can be substantial
32 / 35
References I
Brady, N. and Field, D. J. (2000).

Local contrast in natural images: normalisation and coding efficiency.
Perception, 29(9):1041–1055.
Cover, T. M. and Thomas, J. A. (1991).
Elements of information theory.
Wiley, New York.
de Ruyter van Steveninck, R. R. and Laughlin, S. B. (1996).
The rate of information transfer at graded-potential synapses.
Nature, 379:642–645.
de Ruyter van Steveninck, R. R., Lewen, G. D., Strong, S. P., Koberle, R., and Bialek, W.
(1997).
Reproducibility and variability in neural spike trains.
Science, 275:1805–1809.
Laughlin, S. B. (1981).
A simple coding procedure enhances a neuron’s information capacity.
Zeitschrift für Naturforschung, 36:910–912.
Linsker, R. (1988).
Self-organization in a perceptual network.
Computer, 21(3):105–117.
33 / 35
References II
Mackay, D. and McCullogh, W. S. (1952).

The limiting information capacity of neuronal link.
Bull Math Biophys, 14:127–135.
Nemenman, I., Shafee, F., and Bialek, W. (2002).
Entropy and Inference, Revisited.
nips, 14.
Paninski, L. (2003).
Estimation of Entropy and Mutual Information.
Neural Comp., 15:1191–1253.
Panzeri, S., Senatore, R., Montemurro, M. A., and Petersen, R. S. (2007).
Correcting for the sampling bias problem in spike train information measures.
J Neurophysiol, 98(3):1064–1072.
Rieke, F., Warland, D., de Ruyter van Steveninck, R., and Bialek, W. (1996).
Spikes: Exploring the neural code.
MIT Press, Cambridge.
Shannon, C. E. and Weaver, W. (1949).
The mathematical theory of communication.
Univeristy of Illinois Press, Illinois.
34 / 35
References III
Stein, R. B. (1967).
The information capacity of nerve cells using a frequency code.
Biophys J, 7:797–826.
Strong, S. P., Koberle, R., de Ruyter van Steveninck, R. R., and Bialek, W. (1998).
Entropy and Information in Neural Spike Trains.
Phys Rev Lett, 80:197–200.
Warzecha, A. K. and Egelhaaf, M. (1999).
Variability in spike trains during constant and dynamic stimulation.
Science, 283(5409):1927–1930.
35 / 35

Information Theory: Mark Van Rossum

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Information Theory: Mark Van Rossum

Hochgeladen von

Copyright:

Verfügbare Formate

Information Theory

Mark van Rossum

School of Informatics, University of Edinburgh

January 24, 2018

Understanding the neural code.

Entropy, Mutual Information

The entropyPof a quantity is defined as

Continuous variable at resolution ∆r

(also called differential entropy)

H(S, R) = H(S) + H(R)

Measures reduction in uncertainty of R by knowing S (or vice

The joint histogram determines mutual information.

KL-divergence measures distance between two probability

Not symmetric, but can be symmetrized

ρ is (Pearson-r) correlation coefficient.

(proof, consider KL divergence)

Reduncancy can be defined as (compare to above)

Some codes have R > 0 (redundant code), others R < 0 (synergistic)

Im (R; S) = H(R) − H(R|S)

If noise entropy H(R|S) is independent of the transformation

This strategy is known as histogram equalization in signal

Contrast response for fly large monopolar cell (points) matches

Single analog channel with Gaussian signal s and Gaussian noise η:

Light - (photon noise) - photoreceptor - (synaptic noise) - LMC

Spike train with N = T /δt bins [Mackay and McCullogh, 1952] δt

Measure rate in window T , during which stimulus is constant.

[de Ruyter van Steveninck et al., 1997], but see

Single linear neuron with post-synaptic noise

where η is an independent noise variable

Im (u; v ) = H(v ) − H(v |u)

Second term depends only on p(η)

At low noise independent coding, at high noise joint coding.

[Panzeri et al., 2007]

Information theory provides non parametric framework for coding

Brady, N. and Field, D. J. (2000).

Mackay, D. and McCullogh, W. S. (1952).

Das könnte Ihnen auch gefallen