Sie sind auf Seite 1von 35

Information Theory

Mark van Rossum

School of Informatics, University of Edinburgh

January 24, 2018

0
Version: January 24, 2018
1 / 35
Why information theory

Understanding the neural code.


Encoding and decoding. We imposed coding schemes, such as
2nd-order kernel, or NLP. We possibly lost information in doing so.
Instead, use information:
Don’t need to impose encoding or decoding scheme
(non-parametric).
In particular important for 1) spike timing codes, 2) higher areas.
Estimate how much information is coded in certain signal.
Caveats:
No easy decoding scheme for organism (upper bound only)
Requires more data and biases are tricky

2 / 35
Overview

Entropy, Mutual Information


Entropy Maximization for a Single Neuron
Maximizing Mutual Information
Estimating information
Reading: Dayan and Abbott ch 4, Rieke

3 / 35
Definition

The entropyPof a quantity is defined as


H(X ) = − x P(x) log2 P(x)
This is not ’derived’, nor fully unique, but it fulfills these criteria:
Continuous
If pi = n1 , it increases monotonically with n. H = log2 n.
Parallel independent channels add.
“Unit”: bits
Entropy can be thought of as physical entropy, “richness” of distribution
[Shannon and Weaver, 1949, Cover and Thomas, 1991,
Rieke et al., 1996]

4 / 35
Entropy

Discrete variable X
H(R) = − p(r ) log2 p(r )
r

Continuous variable at resolution ∆r


X X
H(R) = − p(r )∆r log2 (p(r )∆r ) = − p(r )∆r log2 p(r ) − log2 ∆r
r r

letting ∆r → 0 we have
Z
lim [H + log2 ∆r ] = − p(r ) log2 p(r )dr
∆r →0

(also called differential entropy)

5 / 35
Joint, Conditional entropy

Joint entropy:
X
H(S, R) = − P(S, R) log2 P(S, R)
r ,s

Conditional entropy:
X
H(S|R) = P(R = r )H(S|R = r )
r
X X
= − P(r ) P(s|r ) log2 P(s|r )
r s
= H(S, R) − H(R)

If S, R are independent

H(S, R) = H(S) + H(R)

6 / 35
Mutual information

Mutual information:
X p(r , s)
Im (R; S) = p(r , s) log2
r ,s
p(r )p(s)
= H(R) − H(R|S) = H(S) − H(S|R)

Measures reduction in uncertainty of R by knowing S (or vice


versa)
Im (R; S) ≥ 0
The continuous version is the difference of two entropies, the ∆r
divergence cancels

7 / 35
Mutual Information

The joint histogram determines mutual information.


Given P(r , s) ⇒ Im .

8 / 35
Mutual Information: Examples

Y1 Y1
non non
smoker smoker smoker smoker

lung lung
cancer 1/3 0 cancer 1/9 2/9
Y2 Y2
no lung no lung
0 2/3 2/9 4/9
cancer cancer

9 / 35
Mutual Information: Examples

Y1 Y1
non non
smoker smoker smoker smoker

lung lung
cancer 1/3 0 cancer 1/9 2/9
Y2 Y2
no lung no lung
0 2/3 2/9 4/9
cancer cancer

Only for the left joint probability Im > 0 (how much?). On the right,
knowledge about Y1 does not inform us about Y2 .

10 / 35
Kullback-Leibler divergence

KL-divergence measures distance between two probability


distributions
R P(x) P Pi
DKL (P||Q) = P(x) log2 Q(x) dx, or DKL (P||Q) ≡ i Pi log2 Qi

Not symmetric, but can be symmetrized


Im (R; S) = DKL (p(r , s)||p(r )p(s)).
Often used as probabilistic cost function: DKL (data||model).
Other probability distances exist (e.g. earth-movers distance)

11 / 35
Mutual info between jointly Gaussian variables

Z Z
P(y1 , y2 ) 1
I(Y1 ; Y2 ) = P(y1 , y2 ) log2 dy1 dy2 = − log2 (1 − ρ2 )
P(y1 )P(y2 ) 2

ρ is (Pearson-r) correlation coefficient.


12 / 35
Populations of Neurons

Given Z
H(R) = − p(r) log2 p(r)dr − N log2 ∆r

and Z
H(Ri ) = − p(ri ) log2 p(ri )dr − log2 ∆r

We have X
H(R) ≤ H(Ri )
i

(proof, consider KL divergence)

13 / 35
Mutual information in populations of Neurons

Reduncancy can be defined as (compare to above)


nr
X
R= I(ri ; s) − I(r; s).
i=1

Some codes have R > 0 (redundant code), others R < 0 (synergistic)


Example of synergistic code: P(r1 , r2 , s) with
P(0, 0, 1) = P(0, 1, 0) = P(1, 0, 0) = P(1, 1, 1) = 14

14 / 35
Entropy Maximization for a Single Neuron

Im (R; S) = H(R) − H(R|S)

If noise entropy H(R|S) is independent of the transformation


S → R, we can maximize mutual information by maximizing H(R)
under given constraints
Possible constraint: response r is 0 < r < rmax . Maximal H(R) if
⇒ p(r ) ∼ U(0, rmax ) (U is uniform dist)
If average firing rate is limited, and 0 < r < ∞ : exponential
distribution is optimal p(x) = 1/x̄exp(−x/x̄). H = log2 ex̄
If variance is fixed and −∞ < r < ∞: Gaussian distribution.
H = 12 log2 (2πeσ 2 ) (note funny units)

15 / 35
Let r = f (s) and s ∼ p(s). Which f (assumed monotonic)
maximizes H(R) using max firing rate constraint? Require:
1
P(r ) = rmax
dr 1 df
p(s) = p(r ) =
ds rmax ds
Thus df /ds = rmax p(s) and
Z s
f (s) = rmax p(s0 )ds0
smin

This strategy is known as histogram equalization in signal


processing

16 / 35
Fly retina
Evidence that the large monopolar cell in the fly visual system carries
out histogram equalization

Contrast response for fly large monopolar cell (points) matches


environment statistics (line) [Laughlin, 1981] (but changes in high
noise conditions)
17 / 35
V1 contrast responses

Similar in V1, but On and Off channels [Brady and Field, 2000]
18 / 35
Information of time varying signals

Single analog channel with Gaussian signal s and Gaussian noise η:


r =s+η
1 σ2 1
I = log2 (1 + s2 ) = log2 (1 + SNR)
2 ση 2
s(ω)
For time dependent signals I = 12 T dω
R
2π log2 (1 + n(ω) )
To maximize information, when variance of the signal is constrained,
use all frequency bands such that signal+noise = constant.
Whitening. Water filling analog:

19 / 35
Information of graded synapses

Light - (photon noise) - photoreceptor - (synaptic noise) - LMC


At low light levels photon noise dominates, synaptic noise is negligible.
Information rate: 1500 bits/s
[de Ruyter van Steveninck and Laughlin, 1996].

20 / 35
Spiking neurons: maximal information

Spike train with N = T /δt bins [Mackay and McCullogh, 1952] δt


“time-resolution”.
N!
pN = N1 events, #words = N1 !(N−N 1 )!
Maximal
P entropy if all words are equally likely.
H = pi log2 pi = log2 N! − log2 N1 ! − log2 (N − N1 )!
Use for large x that log x! ≈ x(log x − 1)

−T
H= [p log2 p + (1 − p) log2 (1 − p)]
δt
For low rates p  1, setting λ = (δt)p:
e
H = T λ log2 ( )
λδt

21 / 35
Spiking neurons

Calculation incorrect when multiple spikes per bin. Instead, for large
bins maximal information for exponential distribution:
1
P(n) = Z1 exp[−n log(1 + hni )]
1
H = log2 (1 + hni) + hni log2 (1 + hni ) ≈ log2 (1 + hni) + 1

22 / 35
Spiking neurons: rate code

[Stein, 1967]

Measure rate in window T , during which stimulus is constant.


Periodic neuron can maximally encode [1 + (fmax − fmin )T ] stimuli
H ≈ log2 [1 + (fmax − fmin )T ]. Note, only ∝ log(T )

23 / 35
[Stein, 1967]
Similar behaviour for Poisson : H ∝ log(T )

24 / 35
Spiking neurons: dynamic stimuli

[de Ruyter van Steveninck et al., 1997], but see


[Warzecha and Egelhaaf, 1999].
25 / 35
Maximizing Information Transmission: single output

Single linear neuron with post-synaptic noise

v =w·u+η

where η is an independent noise variable

Im (u; v ) = H(v ) − H(v |u)

Second term depends only on p(η)


To maximize Im need to maximize H(v ); sensible constraint is that
kwk2 = 1
If u ∼ N(0, Q) and η ∼ N(0, ση2 ) then v ∼ N(0, wT Qw + ση2 )

26 / 35
For a Gaussian RV with variance σ 2 we have H = 12 log 2πeσ 2 . To
maximize H(v ) we need to maximize wT Qw subject to the
constraint kwk2 = 1
Thus w ∝ e1 so we obtain PCA
If v is non-Gaussian then this calculation gives an upper bound on
H(v ) (as the Gaussian distribution is the maximum entropy
distribution for a given mean and covariance)

27 / 35
Infomax
Infomax: maximize information in multiple outputs wrt weights
[Linsker, 1988]
v = Wu + η
1
H(v ) = log det(hvvT i)
2
Example: 2 inputs and 2 outputs. Input is correlated. wk21 + wk22 = 1.

At low noise independent coding, at high noise joint coding.


28 / 35
Estimating information
Information estimation requires a lot of data.
Most statistical quantities are unbiased (mean, var,...).
But both entropy and noise entropy have bias.

[Panzeri et al., 2007]


29 / 35
Try to fit 1/N correction [Strong et al., 1998]

30 / 35
Common technique for Im : shuffle correction [Panzeri et al., 2007]
See also: [Paninski, 2003, Nemenman et al., 2002]

31 / 35
Summary

Information theory provides non parametric framework for coding


Optimal coding schemes depend strongly on noise assumptions
and optimization constraints
In data analysis biases can be substantial

32 / 35
References I

Brady, N. and Field, D. J. (2000).


Local contrast in natural images: normalisation and coding efficiency.
Perception, 29(9):1041–1055.
Cover, T. M. and Thomas, J. A. (1991).
Elements of information theory.
Wiley, New York.
de Ruyter van Steveninck, R. R. and Laughlin, S. B. (1996).
The rate of information transfer at graded-potential synapses.
Nature, 379:642–645.
de Ruyter van Steveninck, R. R., Lewen, G. D., Strong, S. P., Koberle, R., and Bialek, W.
(1997).
Reproducibility and variability in neural spike trains.
Science, 275:1805–1809.
Laughlin, S. B. (1981).
A simple coding procedure enhances a neuron’s information capacity.
Zeitschrift für Naturforschung, 36:910–912.
Linsker, R. (1988).
Self-organization in a perceptual network.
Computer, 21(3):105–117.

33 / 35
References II

Mackay, D. and McCullogh, W. S. (1952).


The limiting information capacity of neuronal link.
Bull Math Biophys, 14:127–135.
Nemenman, I., Shafee, F., and Bialek, W. (2002).
Entropy and Inference, Revisited.
nips, 14.
Paninski, L. (2003).
Estimation of Entropy and Mutual Information.
Neural Comp., 15:1191–1253.
Panzeri, S., Senatore, R., Montemurro, M. A., and Petersen, R. S. (2007).
Correcting for the sampling bias problem in spike train information measures.
J Neurophysiol, 98(3):1064–1072.
Rieke, F., Warland, D., de Ruyter van Steveninck, R., and Bialek, W. (1996).
Spikes: Exploring the neural code.
MIT Press, Cambridge.
Shannon, C. E. and Weaver, W. (1949).
The mathematical theory of communication.
Univeristy of Illinois Press, Illinois.

34 / 35
References III

Stein, R. B. (1967).
The information capacity of nerve cells using a frequency code.
Biophys J, 7:797–826.
Strong, S. P., Koberle, R., de Ruyter van Steveninck, R. R., and Bialek, W. (1998).
Entropy and Information in Neural Spike Trains.
Phys Rev Lett, 80:197–200.
Warzecha, A. K. and Egelhaaf, M. (1999).
Variability in spike trains during constant and dynamic stimulation.
Science, 283(5409):1927–1930.

35 / 35

Das könnte Ihnen auch gefallen