Sie sind auf Seite 1von 79

Introduction to Information Geometry

based on the book Methods of Information Geometry written by


Shun-Ichi Amari and Hiroshi Nagaoka
Yunshu Liu
2012-02-17
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Outline
1
Introduction to differential geometry
Manifold and Submanifold
Tangent vector, Tangent space and Vector eld
Riemannian metric and Afne connection
Flatness and autoparallel
2
Geometric structure of statistical models and statistical inference
The Fisher metric and -connection
Exponential family
Divergence and Geometric statistical inference
Yunshu Liu (ASPITRG) Introduction to Information Geometry 2 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Part I
Introduction to differential geometry
Yunshu Liu (ASPITRG) Introduction to Information Geometry 3 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Basic concepts in differential geometry
Basic concepts
Manifold and Submanifold
Tangent vector, Tangent space and Vector eld
Riemannian metric and Afne connection
Flatness and autoparallel
Yunshu Liu (ASPITRG) Introduction to Information Geometry 4 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Manifold
Manifold S
Manifold: a set with a coordinate system, a one-to-one mapping from S to R
n
,
supposed to be locally looks like an open subset of R
n

Elements of the set(points): points in R


n
, probability distribution, linear
system.
Figure : A coordinate system for a manifold S
Yunshu Liu (ASPITRG) Introduction to Information Geometry 5 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Manifold
Manifold S
Denition: Let S be a set, if there exists a set of coordinate systems / for S
which satises the condition (1) and (2) below, we call S an n-dimensional
C

differentiable manifold.
(1) Each element of / is a one-to-one mapping from S to some open
subset of R
n
.
(2) For all /, given any one-to-one mapping from S to R
n
, the
following hold:
/
1
is a C

diffeomorphism.
Here, by a C

diffeomorphism we mean that


1
and its inverse
1
are both C

(innitely many times differentiable).


Yunshu Liu (ASPITRG) Introduction to Information Geometry 6 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Examples of Manifold
Examples of one-dimensional manifold
A straight line: a manifold in R
1
, even if it is given in R
k
for k 2.
Any open subset of a straight line: one-dimensional manifold
A closed subset of a straight line: not a manifold
A circle: locally the circle looks like a line
Any open subset of a circle: one-dimensional manifold
A closed subset of a circle: not a manifold.
Yunshu Liu (ASPITRG) Introduction to Information Geometry 7 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Examples of Manifold: surface of a sphere
Surface of a sphere in R
3
, dened by S = (x, y, z) R
3
[x
2
+ y
2
+ z
2
= 1,
locally it can be parameterized by using two coordinates, for example, we can
use latitude and longitude as the coordinates.
nD sphere(n-1 sphere): S = (x
1
, x
2
, ..., x
n
) R
n
[x
2
1
+ x
2
2
+ ... + x
2
n
= 1.
Yunshu Liu (ASPITRG) Introduction to Information Geometry 8 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Examples of Manifold: surface of a torus
The torus in R
3
(surface of a doughnut):
x(u, v) = ((a + b cos u)cos v, (a + b cos u)sin v, b sin u), 0 u, v < 2.
where a is the distance from the center of the tube to the center of the torus,
and b is the radius of the tube. A torus is a closed surface dened as product
of two circles: T
2
= S
1
S
1
.
n-torus: T
n
is dened as a product of n circles: T
n
= S
1
S
1
S
1
.
Yunshu Liu (ASPITRG) Introduction to Information Geometry 9 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Coordinate systems for manifold
Parametrization of unit hemisphere
Parametrization: map of unit hemisphere into R
2
(1) by latitude and longigude;
x(, ) = (sin cos , sin sin , cos ), 0 < < /2, 0 < 2 (1)
Yunshu Liu (ASPITRG) Introduction to Information Geometry 10 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Coordinate systems for manifold
Parametrization of unit hemisphere
Parametrization: map of unit hemisphere into R
2
(2) by stereographic projections.
x(u, v) = (
2u
1 + u
2
+ v
2
,
2v
1 + u
2
+ v
2
,
1 u
2
v
2
1 + u
2
+ v
2
) where u
2
+ v
2
1 (2)
Yunshu Liu (ASPITRG) Introduction to Information Geometry 11 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Examples of Manifold: colors
Parametrization of color models
3 channel color models: RGB, CMYK, LAB, HSV and so on.
The RGB color model: an additive color model in which red, green, and
blue light are added together in various ways to reproduce a broad array
of colors.
The Lab color model: three coordinates of Lab represent the lightness of
the color(L), its position between red and green(a) and its position
between yellow and blue(b).
Yunshu Liu (ASPITRG) Introduction to Information Geometry 12 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Coordinate systems for manifold
Parametrization of color models
Parametrization: map of color into R
3
Examples:
Yunshu Liu (ASPITRG) Introduction to Information Geometry 13 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Submanifolds
Submanifolds
Denition: a submanifold M of a manifold S is a subset of S which itself has
the structure of a manifold
An open subset of n-dimensional manifold forms an n-dimensional
submanifold.
One way to construct m(<n)dimensional manifold: x n-m coordinates.
Examples:
Yunshu Liu (ASPITRG) Introduction to Information Geometry 14 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Submanifolds
Examples: color models
3-dimensional submanifold: any open subset;
2-dimensional submanifold: x one coordinate;
1-dimensional submanifold: x two coordinates.
Note: In Lab color model, we set a and b to 0 and change L from 0 to
100(from black to white), then we get a 1-dimensional submanifold.
Yunshu Liu (ASPITRG) Introduction to Information Geometry 15 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Basic concepts in differential geometry
Basic concepts
Manifold and Submanifold
Tangent vector, Tangent space and Vector eld
Riemannian metric and Afne connection
Flatness and autoparallel
Yunshu Liu (ASPITRG) Introduction to Information Geometry 16 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Curves and Tangent vector of Curves
Curves
Curve : I S from some interval I( R) to S.
Examples: curve on sphere, set of probability distribution, set of linear
systems.
Using coordinate system
i
to express the point (t) on the curve(where t
I):
i
(t) =
i
((t)), then we get (t) = [
1
(t), ,
n
(t)].
C

Curves
C

: innitely many times differentiable(sufciently smooth).


If (t) is C

for t I, we call a C

on manifold S.
Yunshu Liu (ASPITRG) Introduction to Information Geometry 17 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Tangent vector of Curves
A tangent vector is a vector that is tangent to a curve or surface at a given
point.
When S is an open subset of R
n
, the range of is contained within a single
linear space, hence we consider the standard derivative:
(a) = lim
h0
(a + h) (a)
h
(3)
In general, however, this is not true, ex: the range of in a color model
Thus we use a more general derivative instead:
(a) =
n

i=1

i
(a)(

i
)
p
(4)
where
i
(t) =
i
(t),
i
(a) =
d
dt

i
(t)[
t=a
and (

i
)
p
is an operator which
maps f (
f

i
)
p
for given function f : S R.
Yunshu Liu (ASPITRG) Introduction to Information Geometry 18 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Tangent space
Tangent space
Tangent space at p: a hyperplane T
p
containing all the tangents of curves
passing through the point p S. (dim T
p
(S) = dim S)
T
p
(S) =
n

i=1
c
i
(

i
)
p
[[c
1
, , c
n
] R
n

Examples: hemisphere and color


Figure : Tangent vector and tangent space in unit hemisphere
Yunshu Liu (ASPITRG) Introduction to Information Geometry 19 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Vector elds
Vector elds
Vector elds: a map from each point in a manifold S to a tangent vector.
Consider a coordinate system
i
for a n-dimensional manifold, clearly

i
=

i
are vector elds for i = 1, , n.
Yunshu Liu (ASPITRG) Introduction to Information Geometry 20 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Basic concepts in differential geometry
Basic concepts
Manifold and Submanifold
Tangent vector, Tangent space and Vector eld
Riemannian metric and Afne connection
Flatness and autoparallel
Yunshu Liu (ASPITRG) Introduction to Information Geometry 21 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Riemannian Metrics
Riemannian Metrics: an inner product of two tangent vectors(D and D

T
p
(S)) which satisfy D, D

p
R, and the following condition hold:
Linearity : aD + bD

, D

p
= aD, D

p
+ bD

, D

p
Symmetry : D, D

p
= D

, D
p
Positive deniteness : If D ,= 0 then D, D
p
> 0
The components g
ij
of a Riemannian metric g w.r.t. the coordinate system

i
are dened by g
ij
=
i
,
j
, where
i
=

i
.
Yunshu Liu (ASPITRG) Introduction to Information Geometry 22 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Riemannian Metrics
Examples of inner product:
For X = (x
1
, , x
n
) and Y = (y
1
, , y
n
), we can dene inner product
as X, Y
1
= X Y =

n
i=1
x
i
y
i
, or X, Y
2
= YMX, where M is any
symmetry positive-denite matrix.
For random variables X and Y, the expected value of their product:
X, Y = E(XY)
For square real matrix, A, B = tr(AB
T
)
Yunshu Liu (ASPITRG) Introduction to Information Geometry 23 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Riemannian Metrics
For unit sphere:
x(, ) = (sin cos , sin sin , cos ), 0 < < , 0 < 2 (5)
we have:

= (cos cos , cos sin , sin )

= (sin sin , sin cos , 0)


g
11
=

, g
22
=

g
12
= g
21
=

If we dene X, Y = YMX = 2

n
i=1
x
i
y
i
, where M =
_
2 0
0 2
_
, then
(g
i,j
) =
_
g
11
g
12
g
21
g
22
_
=
_
2 0
0 2sin
2

_
(6)
Yunshu Liu (ASPITRG) Introduction to Information Geometry 24 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Afne connection
Parallel translation along curves
Let : [a, b] S be a curve in S, X(t) be a vector eld mapping each point
(t) to a tangent vector, if for all t [a, b] and the corresponding innitesimal
dt, the corresponding tangent vectors are linearly related, that is to say there
exist a linear mapping
p,p
, such that X(t + dt) =
p,p
(X(t)) for t [a, b],
we say X is parallel along , and call

the parallel translation along .


Linear mapping: additivity and scalar multiplication.
Figure : Translation of a tangent vector along a curve
Yunshu Liu (ASPITRG) Introduction to Information Geometry 25 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Afne connection
Afne connection: relationships between tangent space at different points.
Recall:
Natural basis of the coordinate system [
i
]: (
i
)
p
= (

i
)
p
: an operator
which maps f (
f

i
)
p
for given function f : S R at p.
Tangent space:
T
p
(S) =
n

i=1
c
i
(

i
)
p
[[c
1
, , c
n
] R
n

Tangent vector(elements in Tangent space) can be represented as linear


combinations of
i
.
Tangent space T
p
Tangent vector X
p
Natural basis (
i
)
p
= (

i
)
p
Yunshu Liu (ASPITRG) Introduction to Information Geometry 26 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Afne connection
If the difference between the coordinates of p and p

are very small, that we


can ignore the second-order innitesimals (d
i
)(d
j
), where
d
i
=
i
(p

)
i
(p), then we can express difference between
p,p
((
j
)
p
) and
((
j
)
p
) as a linear combination of d
1
, , d
n
:

p,p
((
j
)
p
) = (
j
)
p

i,k
(d
i
(
k
ij
)
p
(
k
)
p
) (7)
where (
k
ij
)
p
; i, j, k = 1, , n are n
3
numbers which depend on the point p.
From X(t) =

n
i=1
X
i
(t)(
i
)
p
and X(t + dt) =

n
i=1
(X
i
(t + dt)(
i
)
p
), we
have

p,p
(X(t)) =

i,j,k
(X
k
(t) dt
i
(t)X
j
(t)(
k
ij
)
p
(
k
)
p
) (8)
Yunshu Liu (ASPITRG) Introduction to Information Geometry 27 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Afne connection

p,p
((
j
)
p
) = (
j
)
p

i,k
(d
i
(
k
ij
)
p
(
k
)
p
)
Yunshu Liu (ASPITRG) Introduction to Information Geometry 28 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Connection coefcients(Christoffels symbols): (
k
ij
)
p
Given a connection on the manifold S, the value of (
k
ij
)
p
are different for
different coordinate systems, it shows how tangent vectors changes on a
manifold, thus shows how basis vectors changes.
In

p,p
((
j
)
p
) = (
j
)
p

i,k
(d
i
(
k
ij
)
p
(
k
)
p
)
if we let
k
ij
= 0 for i, j, k = x, y, we will have

p,p
((
j
)
p
) = (
j
)
p

Yunshu Liu (ASPITRG) Introduction to Information Geometry 29 / 79


Introduction to differential geometry Geometric structure of statistical models and statistical inference
Connection coefcients(Christoffels symbols): (
k
ij
)
p
Given a connection on a manifold S,
k
i,j
depend on coordinate system. Dene
a connection which makes
k
i,j
to be zero in one coordinate system, we will
get non-zero connection coefcients in some other coordinate systems.
Example: If it is desired to let the connection coefcients for Cartesian
Coordinates of a 2D at plane to be zero,
k
ij
= 0 for i, j, k = x, y, we can
calculate the connection coefcients for Polar Coordinates:

r
=

r
=
1
r
,

= r, and
k
ij
= 0 for all others.
Yunshu Liu (ASPITRG) Introduction to Information Geometry 30 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Connection coefcients(Christoffels symbols): (
k
ij
)
p
Example(cont.):
Now if we want to let the connection coefcients for Polar Coordinates to be
zero,
k
ij
= 0 for i, j, k = r, , we can calculate the connection coefcients for
Polar Coordinates:
x
xx
=
sin
2
cos
r
,
y
xx
=
sin (1+cos
2
)
r
,

x
xy
=
x
yx
=
sin
3

r
,
y
xy
=
y
yx
=
cos
3

r
,
x
yy
=
cos (1+sin
2
)
r
, and

y
yy
=
sin cos
2

r
.
Yunshu Liu (ASPITRG) Introduction to Information Geometry 31 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Afne connection
Covariant derivative along curves
Derivative:
dX(t)
dt
= lim
dt0
X(t+dt)X(t)
dt
, what if X(t) and X(t + dt) lie in
different tangent spaces? X
t
(t + dt) =
(t+dt),(t)
(X(t + dt))
X(t) = X
t
(t + dt) X(t) =
(t+dt),(t)
(X(t + dt)) X(t)
Yunshu Liu (ASPITRG) Introduction to Information Geometry 32 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Afne connection
Covariant derivative along curves
We call
X(t)
dt
the covariant derivative of X(t):
X(t)
dt
= lim
dt0
X
t
(t + dt) X(t)
dt
=

(t+dt),(t)
(X(t + dt)) X(t)
dt
(9)

(t+dt)
(X(t + dt)) =

i,j,k
(X
k
(t + dt) + dt
i
(t)X
j
(t)(
k
ij
)
(t)
(
k
)
(t)
) (10)
X(t)
dt
=

i,j,k
(

X
k
(t) +
i
(t)X
j
(t)(
k
ij
)
(t)
(
k
)
(t)
) (11)
Yunshu Liu (ASPITRG) Introduction to Information Geometry 33 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Afne connection
Covariant derivative of any two tangent vector
Covariant derivative of Y w.r.t. X, where X =

n
i=1
(X
i

i
) and
Y =

n
i=1
(Y
i

i
):

X
Y =

i,j,k
(X
i

i
Y
k
+ Y
j

k
ij

k
) (12)

j
=
n

k=1

k
ij

k
(13)
Note: (
X
Y)
p
=
X
p
Y T
p
(S)
Yunshu Liu (ASPITRG) Introduction to Information Geometry 34 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Examples of Afne connection
metric connection
Denition: If for all vector elds X, Y, Z T (S),
ZX, Y =
Z
X, Y +X,
Z
Y.
where ZX, Y denotes the derivative of the function X, Y along this vector
eld Z, we say that is a metric connection w.r.t. g.
Equivalent condition: for all basis
i
,
j
,
k
T (S),

i
,
j
=

i
,
j
+
i
,

i
.
Property: parallel translation on a metric connection preserves inner products,
which means parallel transport is an isometry.

(D
1
),

(D
2
)
q
= D
1
, D
2

p
.
Yunshu Liu (ASPITRG) Introduction to Information Geometry 35 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Examples of Afne connection
Levi-Civita connection
For a given connection, when
k
ij
=
k
ji
hold for all i, j and k, we call it a
symmetric connection or torsion-free connection.
From

j
=

n
k=1

k
ij

k
, we know for a symmetric connection:

j
=

i
If a connection is both metric and symmetric, we call it the Riemannian
connection or the Levi-Civita connection w.r.t. g.
Yunshu Liu (ASPITRG) Introduction to Information Geometry 36 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Basic concepts in differential geometry
Basic concepts
Manifold and Submanifold
Tangent vector, Tangent space and Vector eld
Riemannian metric and Afne connection
Flatness and autoparallel
Yunshu Liu (ASPITRG) Introduction to Information Geometry 37 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Flatness
Afne coordinate system
Let
i
be a coordinate system for S, we call
i
an afne coordinate system
for the connection if the n basis vector elds
i
=

i
are all parallel on S.
Equivalent conditions for a coordinate system to be an afne coordinate
system:

j
=

n
k=1
(
k
ij

k
) = 0 for all i and j (14)

k
ij
= 0 for all i, j and k (15)
Flatness
S is at w.r.t the connection : an afne coordinate system exist for the
connection .
Yunshu Liu (ASPITRG) Introduction to Information Geometry 38 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Flatness
Examples:

j
=

n
k=1
(
k
ij

k
) = 0 for all i and j

k
ij
= 0 for all i, j and k
Yunshu Liu (ASPITRG) Introduction to Information Geometry 39 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Flatness
Curvature R and torsion T of a connection
R(
i
,
j
)
k
=

l
(R
l
ijk

l
) and T(
i
,
j
) =

k
(T
k
ij

k
) (16)
where R
l
ijk
and T
k
ij
can be computed in the following way:
R
l
ijk
=
i

l
jk

j

l
ik
+
l
ih

h
jk

l
jh

h
ik
(17)
T
k
ij
=
k
ij

k
ji
(18)
If a connection is at, then T=R=0;
If T= 0,
k
ij
= 0 for all i, j and k, we get the symmetry connnection.
Yunshu Liu (ASPITRG) Introduction to Information Geometry 40 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Flatness
Curvature
Curvature R = 0 iff parallel translation does not depend on curve choice.
Curvature is independent of coordinate system, under Riemannian
connection, we can calculate:
Curvature of 2 dimensional plane: R = 0;
Curvature of 3 dimensional sphere: R =
2
r
2
.
Yunshu Liu (ASPITRG) Introduction to Information Geometry 41 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Autoparallel submanifold
Equivalent condition for a submanifold M of S to be autoparallel

X
Y T (M) for X, Y T (M) (19)

b
T (M) for all a and b (20)

b
=

c
(
c
ab

c
) (21)
where
a
=

u
a
and
b
=

u
b
are the basis for submanifold M w.r.t.
coordinate system u
i
.
Examples of autoparallel submanifold:
Open subsets of manifold S are autoparallel;
A curve with the properity that all the tangent vector are parallel
Yunshu Liu (ASPITRG) Introduction to Information Geometry 42 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Autoparallel submanifold
Geodesics
Geodesics(autoparallel curves): A curve with tangent vector transported by
parallel translation.
Examples under Riemannian connection:
2 dimensional at plane: straight line
3 dimensional sphere: great circle
Yunshu Liu (ASPITRG) Introduction to Information Geometry 43 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Autoparallel submanifold
Geodesics
The geodesics with respect to the Riemannian connection are known to
coincide with the shortest curve joining two points.
Shortest curve: curve with the shortest length.
Length of a curve : [a, b] S:
|| =
_
b
a
|
d
dt
|dt =
_
b
a
_
g
ij

i

j
dt (22)
Yunshu Liu (ASPITRG) Introduction to Information Geometry 44 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Part II
Geometric structure of statistical models and statistical
inference
Yunshu Liu (ASPITRG) Introduction to Information Geometry 45 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Motivation
Motivation
Consider the set of probability distributions as a manifold.
Analysis the relationship between the geometric structure of the manifold and
statistical estimation.
Introduce concepts like metric, afne connection on statistical models and
studying quantities such as distance, the tangent space (which provides linear
approximations), geodesics and the curvature of a manifold.
Yunshu Liu (ASPITRG) Introduction to Information Geometry 46 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Statistical models
Statistical models
T(A) = p : A R [ p(x) > 0 (x A),
_
p(x)dx = 1 (23)
Example Normal Distribution:
A = R, n = 2, = [, ], = [, ][ < < , 0 < <
p(x, ) =
1

2
exp
(x )
2
2
2
Yunshu Liu (ASPITRG) Introduction to Information Geometry 47 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Geometric structure of statistical models and statistical inference
Basic concepts
The Fisher metric and -connection
Exponential family
Divergence and Geometric statistical inference
Yunshu Liu (ASPITRG) Introduction to Information Geometry 48 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
The Fisher information matrix
Fisher information matrix G() = [g
i,j
()], and
g
i,j
() = E

[
i

] =
_

i
(x; )
j
(x; )p(x; )dx
where

= (x; ) = log p(x; ) and E

denotes the expectation w.r.t. the


distribution p

.
Motivation:
Sufcient statistic and Cramer-Rao bound
Yunshu Liu (ASPITRG) Introduction to Information Geometry 49 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
The Fisher information matrix
Sufcient statistic
Sufcient statistic: for Y = F(X), given the distribution p(x; ) of X, we have
p(x; ) = q(F(x); )r(x; ), if r(x; ) does not depend on for all x, we say
that F is a sufcient statistic for the model S. Then we can write
p(x; ) = q(y; )r(x).
A sufcient statistic is a function whose value contains all the information
needed to compute any estimate of the parameter (e.g. a maximum likelihood
estimate).
Fisher information matrix and sufcient statistic
Let G() be the Fisher information matrix of S = p(x; ), and G
F
() be the
Fisher information matrix of the induced model S
F
= q(y; ), then we have
G
F
() G() in the sense that G() = G
F
() G() is positive
semidenite. G() = 0 iff. F is a sufcient statistic for S.
Yunshu Liu (ASPITRG) Introduction to Information Geometry 50 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Cramer-Rao inequality
Cramer-Rao inequality
The variance of any unbiased estimator is at least as high as the inverse of the
Fisher information.
Unbiased estimator

: E

(X)] =
The variance-covariance matrix V

] = [v
ij

] where
v
ij

= E

[(

i
(X)
i
)(

j
(X)
j
)]
Thus Cramer-Rao inequality state that V

] G()
1
, and an unbiased
estimator

satisfying V

] = G()
1
is called an efcient estimator.
Yunshu Liu (ASPITRG) Introduction to Information Geometry 51 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
-connection
-connection
Let S = p

be an n-dimensional model, and consider the function


()
ij,k
which maps each point to the following value:
(
()
ij,k
)

= E

[(
i

+
1
2

i

)(
k

)] (24)
where is an arbitrary real number. We dened an afne connection
()
which satisfy:
_

()

i

j
,
k
_
=
()
ij,k
(25)
where g = , is the Fisher metric. We call
()
the -connection
Yunshu Liu (ASPITRG) Introduction to Information Geometry 52 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
-connection
Properties of -connection
-connection is a symmetric connection
Relationship between -connection and -connection:

()
ij,k
=
()
ij,k
+

2
E[
i

]
The 0-connection is the Riemannian connection with respect to the
Fisher metric.

()
ij,k
=
(0)
ij,k
+

2
E[
i

()
=
1 +
2

(1)
+
1
2

(1)
Yunshu Liu (ASPITRG) Introduction to Information Geometry 53 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Geometric structure of statistical models and statistical inference
Basic concepts
The Fisher metric and -connection
Exponential family
Divergence and Geometric statistical inference
Yunshu Liu (ASPITRG) Introduction to Information Geometry 54 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Exponential family
Exponential family
p(x; ) = exp[C(x) +
n

i=1

i
F
i
(x) ()]
[
i
] are called the natural parameters(coordinates), and is the potential
function for [
i
], which can be calculated as
() = log
_
exp[C(x) +
n

i=1

i
F
i
(x)]dx
The exponential families include many of the most common distributions,
including the normal, exponential, gamma, beta, Dirichlet, Bernoulli,
binomial, multinomial, Poisson, and so on.
Yunshu Liu (ASPITRG) Introduction to Information Geometry 55 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Exponential family
Exponential family
Examples: Normal Distribution
p(x; , ) =
1

2
e

(x)
2
2
2
(26)
where C(x) = 0, F
1
(x) = x, F
2
(x) = x
2
, and
1
=

2
,
2
=
1
2
2
are the
natural parameters, the potential function is :
=
(
1
)
2
4
2
+
1
2
log(

2
) =

2
2
2
+ log(

2) (27)
Yunshu Liu (ASPITRG) Introduction to Information Geometry 56 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Mixture family
Mixture family
p(x; ) = C(x) +
n

i=1

i
F
i
(x)
In this case we say that S is a mixture family and [
i
] are called the mixture
parameters.
e-connection and m-connection
The natural parameters of exponential family form a 1-afne coordinate
system(
(1)
ij,k
= 0), which means the connection is 1-at, we call the
connection
(1)
the e-connection, and call exponential family e-at.
The mixture parameters of mixture family form a (-1)-afne coordinate
system(
(1)
ij,k
= 0), which means the connection is (-1)-at, and we call the
connection
(1)
the m-connection and call mixture family m-at.
Yunshu Liu (ASPITRG) Introduction to Information Geometry 57 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Dual connection
Dual connection
Denition: Let S be a manifold on which there is given a Riemannian metric
g and two afne connection and

. If for all vector elds X, Y, Z T (S),


Z < X, Y >=<
Z
X, Y > + < X,

Z
Y > (28)
hold, we say that and

are duals of each other w.r.t. g and call one the


dual connection of the other.
Additional, we call the triple (g, ,

) a dualistic structure on S.
Yunshu Liu (ASPITRG) Introduction to Information Geometry 58 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Dual connection
Properties
For any statistical model, the -connection and the ()-connection are
dual with respect to the Fisher metric.

(D
1
),

(D
2
)
q
= D
1
, D
2

p
.
where

and

are parallel translation along w.r.t. and

.
R = 0 R

= 0
where R and R

are the curvature tensors of and

.
Yunshu Liu (ASPITRG) Introduction to Information Geometry 59 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Dually at spaces and dual coordinate system
Dually at spaces
Let (g, ,

) be a dualistic structure on a manifold S, then we have


R = 0 R

= 0, and if the connnection and

are both
symmetric(T = T

= 0), then we see that -atness and

-atness are
equivalent.
We call (S, g, ,

) a dually at space if both duals and

are at.
Examples: Since -connections and -connections are dual w.r.t. Fisher
metric and -connections are symmetry, we have for any statistical model S
and for any real number
S is at S is () at (29)
Yunshu Liu (ASPITRG) Introduction to Information Geometry 60 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Dually at spaces and dual coordinate system
Dual coordinate system
For a particular -afne coordinate system [
i
], if we choose a corresponding

-afne coordinate system [


j
] such that
g =
i
,
j
=
j
i
where
i
=

i
and
j
=

j
.
Then we say the two coordinate systems mutually dual w.r.t. metric g, and
call one the dual coordinate system of the other.
Existence of dual coordinate system
A pair of dual coordinate system exist if and only if (S, g, ,

) is a dually
at space.
Yunshu Liu (ASPITRG) Introduction to Information Geometry 61 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Legendre transformations
Consider mutually dual coordinate system [
i
] and [
i
] with functions
: S R and : S R satisfy the following equations:

i
=
i

i
=
i
g
i,j
=
i

j
=
j

i
=
i

() = max

i
()
() = max

i
()
Yunshu Liu (ASPITRG) Introduction to Information Geometry 62 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Legendre transformations
Geometric interpretation for
f

(p) = max
x
(px f (x)):
A convex function f (x) is shown in red, and
the tangent line at point (x
0
, f (x
0
)) is shown
in blue. The tangent line intersects the
vertical axis at (0, f

) and f

is the value of
the Legendre transform f

(p
0
), where
p
0
=

f (x
0
). Note that for any other point on
the red curve, a line drawn through that point
with the same slope as the blue line will have
a y-intercept above the point (0, f

),
showing that is indeed a maximum.
Yunshu Liu (ASPITRG) Introduction to Information Geometry 63 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Examples of Legendre transformations
Examples:
The Legendre transform of f (x) =
1
p
[x[
p
(where 1 < p < ) is
f

(x

) =
1
q
[x

[
q
(where 1 < q < ),
The Legendre transform of f (x) = e
x
is f

(x

) = x

ln x

(where
x

> 0),
The Legendre transform of f (x) =
1
2
x
T
Ax is f

(x

) =
1
2
x
T
A
1
x

,
The Legendre transform of f (x) = [x[ is f

(x

) = 0 if x

1, and
f

(x

) = if x

> 1.
Yunshu Liu (ASPITRG) Introduction to Information Geometry 64 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
The natural parameter and dual parameter of Exponential family
For distribution p(x; ) = exp[C(x) +

n
i=1

i
F
i
(x) ()], [
i
] are
called the natural parameters
If we dene
i
= E

[F
i
] =
_
F
i
(x)p(x; )dx, we can verify [
i
] is a
(-1)-afne coordinate system dual to [
i
], we call this [
i
] the expectation
parameters or the dual parameters.
Yunshu Liu (ASPITRG) Introduction to Information Geometry 65 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
The natural parameter and dual parameter of Exponential family
Recall: Normal Distribution
p(x; , ) =
1

2
e

(x)
2
2
2
(30)
where C(x) = 0, F
1
(x) = x, F
2
(x) = x
2
, and
1
=

2
,
2
=
1
2
2
are the
natural parameters, the potential function is :
=
(
1
)
2
4
2
+
1
2
log(

2
) =

2
2
2
+ log(

2) (31)
The dual parameter are calculated as
1
=

1
= =

1
2
2
,

2
=

2
=
2
+
2
=
(
1
)
2
2
2
4(
2
)
2
, It has potential function:
=
1
2
(1 + log(

2
)) =
1
2
(1 + log(2)) + 2log) (32)
Yunshu Liu (ASPITRG) Introduction to Information Geometry 66 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Geometric structure of statistical models and statistical inference
Basic concepts
The Fisher metric and -connection
Exponential family
Divergence and Geometric statistical inference
Yunshu Liu (ASPITRG) Introduction to Information Geometry 67 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Divergences
Let S be a manifold and suppose that we are given a smooth function
D = D(|) : S S R satisfying for any p, q S:
D(p|q) 0 with equality iff . p = q) (33)
Then we introduce a distance-like measure of the separation between two
points.
Yunshu Liu (ASPITRG) Introduction to Information Geometry 68 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Divergence, semimetrics and metrics
A distance satisfying positive-deniteness, symmetry and triangle
inequality is called a metric;
A distance satisfying positive-deniteness and symmetry is called
semimetrics;
A distance satisfying only positive-deniteness is called a divergence.
Yunshu Liu (ASPITRG) Introduction to Information Geometry 69 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Kullback-Leibler divergence
Discrete random variables p and q:
D
KL
(p|q) =

i
p(x)log
p(x)
q(x)
(34)
Continuous random variables p and q:
D
KL
(p|q) =
_
p(x)log
p(x)
q(x)
dx (35)
Generally , we use Kullback-Leibler divergence to measurethe difference
between two probability distributions p and q. KL measures the expected
number of extra bits required to code samples from p when using a code
based on q, rather than using a code based on p. Typically p represents the
true distribution of data, observations, or a precisely calculated theoretical
distribution. The measure q typically represents a theory, model, description,
or approximation of p.
Yunshu Liu (ASPITRG) Introduction to Information Geometry 70 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Bregman divergence
Bregman divergence associated
with F for points p, q is :
B
F
(x|y) =
F(y) F(x) (y x), F(x),
where F(x) is a convex function
dened on a closed convex set
.
Examples:
F(x) = |x|
2
, then B
F
(x|y) = |x y|
2
.
More generally, if F(x) =
1
2
x
T
Ax, then B
F
(x|y) =
1
2
(x y)
T
A(x y).
KL divergence:if F =

i
x logx

x, we get KullbackLeibler divergence.


Yunshu Liu (ASPITRG) Introduction to Information Geometry 71 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Canonical divergence
Canonical divergence(a divergence for dually at space)
Let (S, g, ,

) be a dually at space, and [


i
], [
j
] be mutually dual afne
coordinate systems with potentials , , then the canonical
divergence((g, ) divergence) is dened as:
D(p|q) = (p) + (q)
i
(p)
j
(q) (36)
Yunshu Liu (ASPITRG) Introduction to Information Geometry 72 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Canonical divergence
Properties:
Relation between (g, ) divergence and (g,

) divergence:
D

(p|q) = D(q|p)
If M is a autoparallel submanifold w.r.t. either or

, then the
(g
M
,
M
)-divergence D
M
= D[
MM
is given by D
M
(p|q) = D(p|q)
If is a Riemannian connection( =

) which is at on S, there exist


a coordinate system which is self-dual(
i
=
i
), then
= =
1
2

i
(
i
)
2
, then the canonical divergence is
D(p|q) =
1
2
d(p, q)
2
where d(p, q) =
_

i
(p)
i
(q)
2
Yunshu Liu (ASPITRG) Introduction to Information Geometry 73 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Canonical divergence
Triangular relation
Let [
i
], [
i
] be mutually dual afne coordinate systems of a dually at
space (S, g, ,

), and let D be a divergence on S. Then a necessary and


sufcient condition for D to be the (g, )-divergence is that for all p, q, r S
the following triangular relation holds:
D(p|q) + D(q|r) D(p|r) =
i
(p)
i
(q)
i
(p)
i
(q) (37)
Yunshu Liu (ASPITRG) Introduction to Information Geometry 74 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Canonical divergence
Pythagorean relation
Let p, q, and r be three points in S. Let
1
be the -geodesic connecting p and
q, and let
2
be the

-geodesic connecting q and r. If at the intersection q the


curve
1
and
2
are orthogonal(with respect to the inner product g), then we
have the following Pythagorean relation.
D(p|r) = D(p|q) + D(q|r) (38)
Figure :
Yunshu Liu (ASPITRG) Introduction to Information Geometry 75 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Canonical divergence
Projection theorem
Let p be a point in S and let M be a submanifold of S which is

-autoparallel. Then a necessary and sufcient condition for a point q in M


to satisfy
D(p|q) = min
rM
D(p|r) (39)
is for the -geodesic connecting p and q to be orthogonal to M at q.
Figure :
Yunshu Liu (ASPITRG) Introduction to Information Geometry 76 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Canonical divergence
Examples
From the denition of exponential family and mixture family, the product of
exponential family are still exponential family, the sum of mixture family are
still mixture family.
e-at submanifold: set of all product distributions:
E
0
= p
X
[p
X
(x
1
, , x
N
) =

N
i=1
p
X
i
(x
i
)
m-at submanifold: set of joint distributions with given marginals:
M
0
= p
X
[

X\i
p
X
(x) = q
i
(x
i
) i 1, , N
Yunshu Liu (ASPITRG) Introduction to Information Geometry 77 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Canonical divergence
Examples
Yunshu Liu (ASPITRG) Introduction to Information Geometry 78 / 79
Introduction to differential geometry Geometric structure of statistical models and statistical inference
Thanks!
Thanks!
Question?
Yunshu Liu (ASPITRG) Introduction to Information Geometry 79 / 79

Das könnte Ihnen auch gefallen