Blind Source Separation by Independent Components Analysis

Blind Source Separation by
Independent Components
Analysis
Professor Dr. Barrie W. Jervis
School of Engineering
Sheffield Hallam University
England
B.W.Jervis@shu.ac.uk
The Problem
Temporally independent unknown source
signals are linearly mixed in an unknown
system to produce a set of measured output
signals.
It is required to determine the source
signals.
Methods of solving this problem are known as

Blind Source Separation (BSS) techniques.
In this presentation the method of
Independent Components Analysis (ICA) will
be described.
The arrangement is illustrated in the next
slide.
Arrangement for BSS by ICA
s1
s2
x1
x2
u1
u2
sn
xn
un
g(.)
y1=g1(u1)
y2=g2(u2)
:
yn=gn(un)
Neural Network Interpretation
The si are the independent source signals,

A is the linear mixing matrix,
The xi are the measured signals,
W A-1 is the estimated unmixing matrix,
The ui are the estimated source signals or activations, i.e.
ui si,
The gi(ui) are monotonic nonlinear functions (sigmoids,

hyperbolic tangents),
The yi are the network outputs.
Principles of Neural Network

Approach
Use Information Theory to derive an algorithm
which minimises the mutual information
between the outputs y=g(u).
This minimises the mutual information between
the source signal estimates, u, since g(u)
introduces no dependencies.
The different u are then temporally independent
and are the estimated source signals.
Cautions I
The magnitudes and signs of the estimated source
signals are unreliable since
the magnitudes are not scaled
the signs are undefined
because magnitude and sign information is shared
between the source signal vector and the unmixing
matrix, W.
The order of the outputs is permutated compared
wiith the inputs
Cautions II
Similar overlapping source signals may not
be properly extracted.
If the number of output channels number
of source signals, those source signals of
lowest variance will not be extracted. This
is a problem when these signals are
important.
Information Theory I
If X is a vector of variables (messages) x i

which occur with probabilities P(x i), then the
average information content of a stream of N
messages is
N
H ( X ) P ( x ) log2 P ( x )
i 1
bits
and is known as the entropy of the random variable, X.
Information Theory II
Note that the entropy is expressible in terms
of probability.
Given the probability density distribution
(pdf) of X we can find the associated
entropy.
This link between entropy and pdf is of the
greatest importance in ICA theory.
Information Theory III

The joint entropy between two random
variables X and Y is given by
H ( X ,Y ) P ( x, y ) log2 P ( x, y )
x ,y
For independent variables

H ( X ,Y ) H ( X ) H (Y )
iff P ( x, y ) P ( x )P ( y )
Information Theory IV
The conditional entropy of Y given X
measures the average uncertainty remaining
about y when x is known, and is
H (Y | X ) P ( x, y ) log2P ( y | x )
x ,y
The mutual information between Y and X is

I (Y , X ) H (Y ) H ( X ) H (Y , X ) H (Y ) H (Y | X )
In ICA, X represents the measured signals,

which are applied to the nonlinear function g(u)
to obtain the outputs Y.
Bell and Sejnowskis ICA Theory (1995)

Aim to maximise the amount of mutual
information between the inputs X and the outputs
Y of the neural network.
I (Y , X ) H (Y ) H (Y | X )
(Uncertainty about Y when X is unknown)

Y is a function of W and g(u).
Here we seek to determine the W which
produces the ui si, assuming the correct g(u).
Differentiating:
I (Y , X )
H (Y )
H (Y | X )
H (Y )
w
w
w
w
(=0, since it did not come through W from X.)

So, maximising this mutual information is
equivalent to maximising the joint output entropy,
H ( y1,...y N ) H ( y1 ) .... H ( y N ) I ( y1,...y N )
which is seen to be equivalent to minimising the

mutual information between the outputs and hence
the ui, as desired.
The Functions g(u)

The outputs yi are amplitude bounded random variables,
and so the marginal entropies H(y i) are maximum when
the yi are uniformly distributed - a known statistical
result.
With the H(yi) maximised, I(Y,X) = 0, and the y i
uniformly distributed, the nonlinearity g i(ui) has the form
of the cumulative distribution function of the probability
density function of the si,
- a proven result.
Pause and review g(u) and W

W has to be chosen to maximise the joint
output entropy H(Y,X), which minimises the
mutual information between the estimated
source signals, ui.
The g(u) should be the cumulative
distribution functions of the source signals, si.
Determining the g(u) is a major problem.
One input and one output

For a monotonic nonlinear function, g(x),
fx ( x )
fy ( y )
y
x
Also H ( y ) E ln fy ( y ) fy ( y ) ln fy ( y )dy
Substituting:
y
H ( y ) E ln E ln fx ( x )
x
(we only need to maximise this term) (independent

of W)
A stochastic gradient ascent learning rule is

adopted to maximise H(y) by assuming
H
y y
w
ln
w w x x
w x
Further progress requires knowledge of g(u).

Assume for now, after Bell and Sejnowski, that
g(u) is sigmoidal, i.e. y 1
1 e u
Also assume u wx w 0
Learning Rule: 1 input, 1 output

Hence, we find:
y
wy 1 y
x
y
y 1 y 1 wx 1 2y
w x
from which :
1
w x 1 2y , and
w
w 0 1 2y
Learning Rule: N inputs, N outputs

Need
fx x
fy ( y )
,
J
where J is the Jacobian
y1
y1
x x
1
N
J det

y
y N
n
xN
x1
Assuming g(u) is sigmoidal again, we obtain:
W W 1 2y xT , and
w 0 1 2y,
T 1
where 1 is the unit vector
The network is trained until the changes in the

weights become acceptably small at each
iteration.
Thus the unmixing matrix W is found.
The Natural Gradient
The computation of the inverse matrix W

is time-consuming, and may be avoided by
rescaling the entropy gradient by
WW T (it
1by
)
multiplying
T 1
Thus,
for a sigmoidal g(u) we obtain
H ( y)
W
1 (1 2y)uT W
W
This is the natural gradient, introduced by

Amari (1998), and now widely adopted.
The nonlinearity, g(u)

We have already learnt that the g(u) should
be the cumulative probability densities of
the individual source distributions.
So far the g(u) have been assumed to be
sigmoidal, so what are the pdfs of the si?
The corresponding pdfs of the si are superGaussian.
Super- and sub-Gaussian pdfs

P si
Gaussian
Super-Gaussian
Sub-Gaussian
si
* Note: there are no mathematical definitions of

super- and sub-Gaussians
Super- and sub-Gaussians

Super-Gaussians: kurtosis (fourth order central
moment,
measures the flatness of the pdf) > 0.
infrequent signals of short duration, e.g.
evoked brain signals.
Sub-Gaussians
kurtosis < 0
signals mainly
on, e.g. 50/60 Hz electrical mains supply, but also
eye blinks.
Kurtosis
Kurtosis = 4th order central moment =
mx ( 4 ) E x mx
4
E ui4
E u
2
i
and is seen to be calculated from the current

estimates of the source signals.
To separate the independent sources, information
about their pdfs such as skewness (3rd. moment)
and flatness (kurtosis) is required.
First and 2nd. moments (mean and variance) are
insufficient.
A more generalised learning rule

Girolami (1997) showed that tanh(ui) and -tanh(ui)
could be used for super- and sub-Gaussians
respectively.
Cardoso and Laheld (1996) developed a stability
analysis to determine whether the source signals
were to be considered super- or sub-Gaussian.
Lee, Girolami, and Sejnowski (1998) applied these
findings to develop their extended infomax
algorithm for super- and sub-Gaussians using a
kurtosis-based switching rule.
Extended Infomax Learning Rule

With super-Gaussians modelled as
p(u ) N (0,1) sec h (u )
2
and sub-Gaussians as a Pearson mixture model

1
p(u ) N , 2 N , 2
2
the new extended learning rule is
W 1 K tanh u uT uuT W,
k i 1, super Gaussian,
k i 1, sub Gaussian.
Switching Decision
W 1 K tanh uu uu W,
T
k i 1, super Gaussian,
k i 1, sub Gaussian.
and the ki are the elements of the N-dimensional

diagonal matrix, K, and
k i sign E sec h 2 (ui )E ui2 E tanh(ui ) ui
Modifications of the formula for ki exist, but in

our experience the extended algorithm has been
unsatisfactory.
Reasons for unsatisfactory extended algorithm

1) Initial assumptions about super- and sub-Gaussian
distributions may be too inaccurate.
2) The switching criterion may be inadequate.
Alternatives
Postulate vague distributions for the source signals
which are then developed iteratively during training.
Use an alternative approach, e.g, statistically
based, JADE (Cardoso).
Summary so far
We have seen how W may be obtained by training
the network, and the extended algorithm for
switching between super- and sub-Gaussians has
been described.
Alternative approaches have been mentioned.
Next we consider how to obtain the source signals
knowing W and the measured signals, x.
Source signal determination

The system is:
si
unknown
Mixing
matrix A
xi
Unmixing
matrix W
measured
g(u)
uisi
yi
estimated
Hence U=W.x and x=A.S where AW-1, and US.

The rows of U are the estimated source signals,
known as activations (as functions of time).
The rows of x are the time-varying measured signals.
Source Signals
U
M xN M xM
Channel
number
x
M xN
u11 u12 u1N w11 w12 w1M x11 x12 x1N

u
w
x
x
MN N 1
MM M 1
MN
M1
Time, or sample number
Expressions for the Activations

u11 w11x11 w12 x21 w1M xM 1
u12 w11x12 w12 x22 w1M xM 2
We see that consecutive values of u are obtained by

filtering consecutive columns of x by the same row of
W.
u21 w 21x11 w 22 x21 w 23 x31 w 2N xN1
The ith row of u is the ith row of w by the columns

of x.
Procedure
Record N time points from each of M sensors, where N
5M.
Pre-process the data, e.g. filtering, trend removal.
Sphere the data using Principal Components Analysis
(PCA). This is not essential but speeds up the computation
by first removing first and second order moments.
Compute the ui si. Include desphering.
Analyse the results.
Optional Procedures I
The contribution of each activation at a
sensor may be found by back-projecting
it to the sensor.
x W 1s
1
1
w11
w12
w1M1
1
1
1
x w 21 w 22 w 2M

0
0
s21 s22
0
0
0 0
s23 s2N
0 0
1
1
1
x21 w 21
.0 w 22
.s21 w 2M1 .0 w 22
.s21
1
22
x22 w .s22 ,
1
22 2 N
x2M w s
Optional Procedures II
A measured signal which is contaminated by artefacts or
noise may be extracted by back-projecting all the signal
activations to the measurement electrode, setting other
activations to zero. (An artefact and noise removal method).
1
1
w11
w12
w1M1
1
1
1
x w 21 w 22 w 2M

s11 s12
s21 s22
0
0
s13 s1N
s23 s2N
0 0
1
1
1
1
x21 w 21
.s11 w 22
.s21 w 2M1 .0 w 21
.s11 w 22
.s21
1
21
1
22
x22 w .s12 w .s22 ,
1
21
1
22
x2N w .s1N w .s2N
Current Developments
Overcomplete representations - more signal
sources than sensors.
Nonlinear mixing.
Nonstationary sources.
General formulation of g(u).
Conclusions
It has been shown how to extract temporally
independent unknown source signals from their
linear mixtures at the outputs of an unknown
system using Independent Components
Analysis.
Some of the limitations of the method have been
mentioned.
Current developments have been highlighted.

Blind Source Separation by Independent Components Analysis

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Blind Source Separation by Independent Components Analysis

Hochgeladen von

Copyright:

Verfügbare Formate

Blind Source Separation by

Methods of solving this problem are known as

Arrangement for BSS by ICA

Neural Network Interpretation

The si are the independent source signals,

The gi(ui) are monotonic nonlinear functions (sigmoids,

Principles of Neural Network

If X is a vector of variables (messages) x i

and is known as the entropy of the random variable, X.

Information Theory III

For independent variables

The mutual information between Y and X is

In ICA, X represents the measured signals,

Bell and Sejnowskis ICA Theory (1995)

(Uncertainty about Y when X is unknown)

(=0, since it did not come through W from X.)

which is seen to be equivalent to minimising the

The Functions g(u)

Pause and review g(u) and W

One input and one output

(we only need to maximise this term) (independent

A stochastic gradient ascent learning rule is

Further progress requires knowledge of g(u).

Learning Rule: 1 input, 1 output

Learning Rule: N inputs, N outputs

Assuming g(u) is sigmoidal again, we obtain:

where 1 is the unit vector

The network is trained until the changes in the

The Natural Gradient

The computation of the inverse matrix W

for a sigmoidal g(u) we obtain

This is the natural gradient, introduced by

The nonlinearity, g(u)

Super- and sub-Gaussian pdfs

* Note: there are no mathematical definitions of

Super- and sub-Gaussians

Kurtosis = 4th order central moment =

and is seen to be calculated from the current

A more generalised learning rule

Extended Infomax Learning Rule

and sub-Gaussians as a Pearson mixture model

the new extended learning rule is

and the ki are the elements of the N-dimensional

Modifications of the formula for ki exist, but in

Reasons for unsatisfactory extended algorithm

Source signal determination

Hence U=W.x and x=A.S where AW-1, and US.

u11 u12 u1N w11 w12 w1M x11 x12 x1N

Expressions for the Activations

We see that consecutive values of u are obtained by

The ith row of u is the ith row of w by the columns

x22 w .s12 w .s22 ,

x2N w .s1N w .s2N

Das könnte Ihnen auch gefallen