Sie sind auf Seite 1von 39

Blind Source Separation by

Independent Components
Analysis
Professor Dr. Barrie W. Jervis
School of Engineering
Sheffield Hallam University
England
B.W.Jervis@shu.ac.uk

The Problem
Temporally independent unknown source
signals are linearly mixed in an unknown
system to produce a set of measured output
signals.
It is required to determine the source
signals.

Methods of solving this problem are known as


Blind Source Separation (BSS) techniques.
In this presentation the method of
Independent Components Analysis (ICA) will
be described.
The arrangement is illustrated in the next
slide.

Arrangement for BSS by ICA

s1
s2

x1
x2

u1
u2

sn

xn

un

g(.)

y1=g1(u1)
y2=g2(u2)
:
yn=gn(un)

Neural Network Interpretation

The si are the independent source signals,


A is the linear mixing matrix,
The xi are the measured signals,
W A-1 is the estimated unmixing matrix,
The ui are the estimated source signals or activations, i.e.
ui si,

The gi(ui) are monotonic nonlinear functions (sigmoids,


hyperbolic tangents),
The yi are the network outputs.

Principles of Neural Network


Approach
Use Information Theory to derive an algorithm
which minimises the mutual information
between the outputs y=g(u).
This minimises the mutual information between
the source signal estimates, u, since g(u)
introduces no dependencies.
The different u are then temporally independent
and are the estimated source signals.

Cautions I
The magnitudes and signs of the estimated source
signals are unreliable since
the magnitudes are not scaled
the signs are undefined
because magnitude and sign information is shared
between the source signal vector and the unmixing
matrix, W.
The order of the outputs is permutated compared
wiith the inputs

Cautions II
Similar overlapping source signals may not
be properly extracted.
If the number of output channels number
of source signals, those source signals of
lowest variance will not be extracted. This
is a problem when these signals are
important.

Information Theory I

If X is a vector of variables (messages) x i


which occur with probabilities P(x i), then the
average information content of a stream of N
messages is
N

H ( X ) P ( x ) log2 P ( x )
i 1

bits

and is known as the entropy of the random variable, X.

Information Theory II
Note that the entropy is expressible in terms
of probability.
Given the probability density distribution
(pdf) of X we can find the associated
entropy.
This link between entropy and pdf is of the
greatest importance in ICA theory.

Information Theory III


The joint entropy between two random
variables X and Y is given by
H ( X ,Y ) P ( x, y ) log2 P ( x, y )
x ,y

For independent variables


H ( X ,Y ) H ( X ) H (Y )

iff P ( x, y ) P ( x )P ( y )

Information Theory IV
The conditional entropy of Y given X
measures the average uncertainty remaining
about y when x is known, and is
H (Y | X ) P ( x, y ) log2P ( y | x )
x ,y

The mutual information between Y and X is


I (Y , X ) H (Y ) H ( X ) H (Y , X ) H (Y ) H (Y | X )

In ICA, X represents the measured signals,


which are applied to the nonlinear function g(u)
to obtain the outputs Y.

Bell and Sejnowskis ICA Theory (1995)


Aim to maximise the amount of mutual
information between the inputs X and the outputs
Y of the neural network.
I (Y , X ) H (Y ) H (Y | X )

(Uncertainty about Y when X is unknown)


Y is a function of W and g(u).
Here we seek to determine the W which
produces the ui si, assuming the correct g(u).

Differentiating:
I (Y , X )

H (Y )
H (Y | X )
H (Y )
w
w
w
w

(=0, since it did not come through W from X.)


So, maximising this mutual information is
equivalent to maximising the joint output entropy,
H ( y1,...y N ) H ( y1 ) .... H ( y N ) I ( y1,...y N )

which is seen to be equivalent to minimising the


mutual information between the outputs and hence
the ui, as desired.

The Functions g(u)


The outputs yi are amplitude bounded random variables,
and so the marginal entropies H(y i) are maximum when
the yi are uniformly distributed - a known statistical
result.
With the H(yi) maximised, I(Y,X) = 0, and the y i
uniformly distributed, the nonlinearity g i(ui) has the form
of the cumulative distribution function of the probability
density function of the si,
- a proven result.

Pause and review g(u) and W


W has to be chosen to maximise the joint
output entropy H(Y,X), which minimises the
mutual information between the estimated
source signals, ui.
The g(u) should be the cumulative
distribution functions of the source signals, si.
Determining the g(u) is a major problem.

One input and one output


For a monotonic nonlinear function, g(x),
fx ( x )
fy ( y )
y
x

Also H ( y ) E ln fy ( y ) fy ( y ) ln fy ( y )dy

Substituting:

y
H ( y ) E ln E ln fx ( x )
x

(we only need to maximise this term) (independent


of W)

A stochastic gradient ascent learning rule is


adopted to maximise H(y) by assuming
H
y y
w

ln

w w x x

w x

Further progress requires knowledge of g(u).


Assume for now, after Bell and Sejnowski, that
g(u) is sigmoidal, i.e. y 1
1 e u

Also assume u wx w 0

Learning Rule: 1 input, 1 output


Hence, we find:
y
wy 1 y
x
y

y 1 y 1 wx 1 2y
w x
from which :
1
w x 1 2y , and
w
w 0 1 2y

Learning Rule: N inputs, N outputs


Need

fx x
fy ( y )
,
J
where J is the Jacobian
y1
y1
x x
1
N

J det

y
y N
n

xN
x1

Assuming g(u) is sigmoidal again, we obtain:

W W 1 2y xT , and
w 0 1 2y,
T 1

where 1 is the unit vector

The network is trained until the changes in the


weights become acceptably small at each
iteration.
Thus the unmixing matrix W is found.

The Natural Gradient

The computation of the inverse matrix W


is time-consuming, and may be avoided by
rescaling the entropy gradient by
WW T (it
1by
)
multiplying

T 1

Thus,

for a sigmoidal g(u) we obtain

H ( y)
W
1 (1 2y)uT W
W

This is the natural gradient, introduced by


Amari (1998), and now widely adopted.

The nonlinearity, g(u)


We have already learnt that the g(u) should
be the cumulative probability densities of
the individual source distributions.
So far the g(u) have been assumed to be
sigmoidal, so what are the pdfs of the si?
The corresponding pdfs of the si are superGaussian.

Super- and sub-Gaussian pdfs


P si

Gaussian
Super-Gaussian
Sub-Gaussian
si

* Note: there are no mathematical definitions of


super- and sub-Gaussians

Super- and sub-Gaussians


Super-Gaussians: kurtosis (fourth order central
moment,
measures the flatness of the pdf) > 0.
infrequent signals of short duration, e.g.
evoked brain signals.
Sub-Gaussians
kurtosis < 0
signals mainly
on, e.g. 50/60 Hz electrical mains supply, but also
eye blinks.

Kurtosis

Kurtosis = 4th order central moment =

mx ( 4 ) E x mx
4

E ui4

E u
2
i

and is seen to be calculated from the current


estimates of the source signals.
To separate the independent sources, information
about their pdfs such as skewness (3rd. moment)
and flatness (kurtosis) is required.
First and 2nd. moments (mean and variance) are
insufficient.

A more generalised learning rule


Girolami (1997) showed that tanh(ui) and -tanh(ui)
could be used for super- and sub-Gaussians
respectively.
Cardoso and Laheld (1996) developed a stability
analysis to determine whether the source signals
were to be considered super- or sub-Gaussian.
Lee, Girolami, and Sejnowski (1998) applied these
findings to develop their extended infomax
algorithm for super- and sub-Gaussians using a
kurtosis-based switching rule.

Extended Infomax Learning Rule


With super-Gaussians modelled as
p(u ) N (0,1) sec h (u )
2

and sub-Gaussians as a Pearson mixture model


1
p(u ) N , 2 N , 2
2

the new extended learning rule is

W 1 K tanh u uT uuT W,
k i 1, super Gaussian,
k i 1, sub Gaussian.

Switching Decision
W 1 K tanh uu uu W,
T

k i 1, super Gaussian,
k i 1, sub Gaussian.

and the ki are the elements of the N-dimensional


diagonal matrix, K, and
k i sign E sec h 2 (ui )E ui2 E tanh(ui ) ui

Modifications of the formula for ki exist, but in


our experience the extended algorithm has been
unsatisfactory.

Reasons for unsatisfactory extended algorithm


1) Initial assumptions about super- and sub-Gaussian
distributions may be too inaccurate.
2) The switching criterion may be inadequate.

Alternatives
Postulate vague distributions for the source signals
which are then developed iteratively during training.
Use an alternative approach, e.g, statistically
based, JADE (Cardoso).

Summary so far
We have seen how W may be obtained by training
the network, and the extended algorithm for
switching between super- and sub-Gaussians has
been described.
Alternative approaches have been mentioned.
Next we consider how to obtain the source signals
knowing W and the measured signals, x.

Source signal determination


The system is:

si
unknown

Mixing
matrix A

xi

Unmixing
matrix W

measured

g(u)
uisi

yi

estimated

Hence U=W.x and x=A.S where AW-1, and US.


The rows of U are the estimated source signals,
known as activations (as functions of time).
The rows of x are the time-varying measured signals.

Source Signals
U

M xN M xM
Channel
number

x
M xN

u11 u12 u1N w11 w12 w1M x11 x12 x1N



u
w
x

x
MN N 1
MM M 1
MN
M1
Time, or sample number

Expressions for the Activations


u11 w11x11 w12 x21 w1M xM 1
u12 w11x12 w12 x22 w1M xM 2

We see that consecutive values of u are obtained by


filtering consecutive columns of x by the same row of
W.
u21 w 21x11 w 22 x21 w 23 x31 w 2N xN1

The ith row of u is the ith row of w by the columns


of x.

Procedure
Record N time points from each of M sensors, where N
5M.
Pre-process the data, e.g. filtering, trend removal.
Sphere the data using Principal Components Analysis
(PCA). This is not essential but speeds up the computation
by first removing first and second order moments.
Compute the ui si. Include desphering.
Analyse the results.

Optional Procedures I
The contribution of each activation at a
sensor may be found by back-projecting
it to the sensor.
x W 1s
1
1
w11
w12
w1M1
1

1
1
x w 21 w 22 w 2M

0
0

s21 s22
0
0

0 0

s23 s2N
0 0

1
1
1
x21 w 21
.0 w 22
.s21 w 2M1 .0 w 22
.s21
1
22

x22 w .s22 ,

1
22 2 N

x2M w s

Optional Procedures II
A measured signal which is contaminated by artefacts or
noise may be extracted by back-projecting all the signal
activations to the measurement electrode, setting other
activations to zero. (An artefact and noise removal method).
1
1
w11
w12
w1M1
1

1
1
x w 21 w 22 w 2M

s11 s12

s21 s22
0
0

s13 s1N

s23 s2N
0 0

1
1
1
1
x21 w 21
.s11 w 22
.s21 w 2M1 .0 w 21
.s11 w 22
.s21
1
21

1
22

x22 w .s12 w .s22 ,

1
21

1
22

x2N w .s1N w .s2N

Current Developments
Overcomplete representations - more signal
sources than sensors.
Nonlinear mixing.
Nonstationary sources.
General formulation of g(u).

Conclusions
It has been shown how to extract temporally
independent unknown source signals from their
linear mixtures at the outputs of an unknown
system using Independent Components
Analysis.
Some of the limitations of the method have been
mentioned.
Current developments have been highlighted.

Das könnte Ihnen auch gefallen