i
G
i
(
i
)
_
,
where the function G
i
() is the cumulant generating
function, dened as
G
i
(
i
) = log
_
Xi
e
i
x
i
i
(dx
i
),
and where
i
() is a nite measure that generates the
exponential family and A
i
denes the space of the data
component x
i
. The gradient of the cumulant generating
function is denoted by g
i
() and is referred to as the link
function [26], [27], [28].
Because both GLMs and the Generalized Latent Vari
able (GLV) methodologies exploit the linear structure (2),
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 3
a
1
a
2
a
q

`
`
`
`
`
`
`












`
`
`
`
`
`
.
.
.
.
.
.
.
.
.
.
.,
.
>
>
>
>
>
>.
x
1
x
2
x
3
x
d
Fig. 1. Graphical model for the GLS approach.
they can be viewed as special cases of a GLS approach to
data analysis. In the GLMs theory, V and b are known
and a is deterministic and unknown [3], [4]. In both
the Generalized Latent Variable theory described in [1],
[24], [2] and the random and Mixedeffects Generalized
Linear Models (MGLMs) literature [29], [30], [23], [5], [4],
[31], [20], [8], V and b are deterministic while a (and
hence ) is treated as a random vector. The difference
between GLV and MGLMs is that in GLV, all of the
quantities V, b, and a are unknown, and hence need to
be identied, resulting in a socalled blind estimation
problem, whereas in MGLMs, V is a known matrix of
regressor variables and only the deterministic vector b
and the unknown realizations of the random effect vector
a (also known as latent variable) must be estimated. In
both GLV and MGLMs, it is assumed that the linear
relationship (2) holds in parameter space, and that the
tools of linear and statistical inverse theory are appli
cable or insightful, at least conceptually. The MGLMs
theory is a generalization of the classical theory of linear
regression, while the GLV theory is a generalization
of the classical theory of statistical factor analysis and
PCA. In both cases, the generalization is based on a
move from the data/description space containing the
measurement vector x to the parameter space containing
via a generally nonlinear transformation known as a
link function [3], [5], [4]. It is in the latter space that the
linear relationship (2) is assumed to hold.
Graphical models, also referred to as Bayesian Net
works when their graph is directed, are a powerful
tool to encode and exploit the underlying statistical
structure of complex data sets [32]. The GLS framework
represents a subclass of graphical model techniques
and its corresponding graphical model is presented in
Fig. 1. It is equivalent to a computationally tractable
mixed exponential families datatype hierarchical Bayes
graphical model with latent variables constrained to a
lowdimensional parameter subspace.
Since a (and hence ) is treated as a random vector
(Bayesian approach), the (nonconditional) probability
density function p(x) requires a generally intractable
integration over the parameters,
p(x) =
_
p(x[)()d =
_
d
i=1
p
i
(x
i
[
i
)()d, (4)
where () is the probability density function of the
parameter vector = aV + b. Given the observation
matrix denoted by
X =
_
_
_
_
_
x[1]
x[2]
.
.
.
x[n]
_
_
_
_
_
=
_
_
_
_
_
x
1
[1] . . . x
d
[1]
x
1
[2] . . . x
d
[2]
.
.
.
.
.
.
.
.
.
x
1
[n] . . . x
d
[n]
_
_
_
_
_
R
nd
(5)
composed of n independent and identically distributed
statistical data samples, each assumed to be stochasti
cally equivalent to the random row vector x, x[k] =
_
x
1
[k], . . . , x
d
[k]
k=1
p
_
x[k]
_
=
n
k=1
_
p
_
x[k][
_
()d (6)
using the latent variable assumption, with = aV+b. For
specied exponential family densities p
i
([), i = 1, . . . , d,
maximum likelihood identication of the model (4) cor
responds to identifying (), which, under the linear
condition = aV + b, corresponds to identifying the
matrix V, the vector b, and a density function, (a), on
the random effect a via a maximization of the likelihood
function p(X) with respect to V, b, and (a). This is
generally a quite difcult problem [5], [4], [20] and it
is usually attacked using approximation methods which
correspond to replacing the integrals in (4) and (6) by
sums [23]:
p(x) =
m
l=1
p
_
x[
[l]
_
l
=
m
l=1
d
i=1
p
i
_
x
i
[
i
[l]
_
l
, (7)
p(X) =
n
k=1
m
l=1
p
_
x[k][
[l]
_
l,k
(8)
over a nite number of discrete support points (atoms)
[l] (equivalently,
l
_
=
[l]
_
=
_
a =
a[l]
_
,
l,k
_
[k] =
[l]
_
=
_
a[k] =
a[l]
_
=
l
,
the last equality resulting from the independent and iden
tically distributed statistical samples assumption. Note that
and a are (discrete) random variables while
[l] and
[l]
_
=
_
a[l]
_
for
[l] =
a[l]V + b
with the matrix V having full rowrank results in the
assumption that the relationship between the discrete
values
[l] and
l=1
p
_
x[
[l]
_
l
=
m
l=1
p
_
x[
a[l]V+b
_
l
, (9)
and the data likelihood (8) is equal to
p(X) =
n
k=1
m
l=1
p
_
x[k][
[l]
_
l
=
n
k=1
m
l=1
p
_
x[k][
a[l]V+b
_
l
.
(10)
The data likelihood is thus (approximately) the likeli
hood of a nite mixture of exponential family densities
with unknown mixture proportions or pointmass prob
ability estimates
l
and unknown pointmass support
points
[l] =
a[l]V + b
in the lth mixture component [23]. In the mixture mod
els literature, the pointmass probabilities
l
are called
mixing proportions or weights, the densities p
_
x[k][
[l]
_
are called the component densities of the mixture and
equation (9) is referred to as the mcomponent nite
mixture density [24]. The combined problem of Maxi
mum Likelihood Estimation (MLE) of the parameters V,
b, the pointmass support points (atoms)
a[l],
l
for all l, and
V) is the subject matter of GLV analysis [1], [24], [45],
[2]. The commonly encountered alternative problem
of estimating b,
a[l] and
l
, l = 1, . . . , m, for known V,
where the elements of V are comprised of measured
regressor variables, is a generalization of classical linear
regression and is the subject matter of the theory of
MGLMs [3], [29], [23], [5], [4], [31], [20], [8].
However, a classical approach to the GLS estimation
problem can also be considered and the vector a (and
hence ) is treated as a deterministic vector. Then, to each
data point x[k], k = 1, . . . , n, corresponds a (generally
= aV+b
k=1
p
_
x[k][
[k]
_
=
n
k=1
p
_
x[k][
a[k]V+b
_
. (11)
Contrary to the Bayesian approach, no pointmass prob
abilities have to be estimated. For consistency of vocab
ulary throughout this paper, the points
a[k], k = 1, . . . , n,
in the lowdimensional parameter subspace are called la
tent variables for both Bayesian and classical approaches.
Similarly, the parameter points
[k], k = 1, . . . , n, are
called atoms in both approaches. The classical approach
can also be seen as an extreme case of the Bayesian
approach for which the probability density function ()
is a delta function (one per data point) and the total
number of atoms m equals the number of data points
n, i.e., m = n. Note that while the m < n parameter
points of the Bayesian approach are shared by all the
data points, the classical approach assigns one parameter
point to each data point (hence m = n). This extreme case
is the approach followed in Section 3. Section 4 presents
a general point of view and considers and compares both
approaches (m < n and m = n).
2.2 Approximately sufcient statistics
Interestingly, in this extreme case of delta pointmass
probabilities it can be shown that the pointmass support
points or latent variables
(
MAP
arg max
C
p(([x) = arg max
C
p((, x). (12)
Acknowledging that the GLS graphical model is similar
to the Markov chain presented in Figure 2, we have:
p(([x)p(x) = p((, x) =
_
p((, x, a)da
=
_
p(([x, a)p(x, a)da
and, because of the Markov chain structure,
_
p(([a)p(a[x)p(x)da
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 5
and, because each data point x exactly corresponds to
one support point
a =
a(x),
=
_
p(([a)(a
a)p(x)da
= p(([
a)p(x).
Hence, p(([
a) p(([x) and
a is an approximate suf
cient statistics.
The goal of the work proposed and analyzed here is
to t an adequately faithful, classconditional probability
model of the form (10) for a Bayesian approach or (11)
for a classical approach to labeled (when available) or
unlabeled data to develop algorithms for making deci
sions on new measurements or future data. The family of
models provided by (10) and (11), where the component
densities are exponential family densities is very exible
and can be used to model both labeled and unlabeled
data cases.
3 LEARNING AT ONE EXTREME OF GLS
This section focuses on the estimation procedure of
Generalized Linear Statistics framework components for
the extreme case where the number of parameter points
equals the number of data points m = n. This extreme
case is similar to the exponential family Principal Com
ponent Analysis (exponential PCA) technique proposed
in [10]. We interpret it as a form of PCA performed
in parameter space instead of data space as in classical
PCA.
3.1 Problem description
The special Generalized Linear Statistics framework case
presented in Section 2 where the number of parameter
points equals the number of data points, i.e., m = n, is
now solely considered. Hence, the pointmass probabili
ties do not need to be estimated and the EM algorithm is
unnecessary. To each vector x corresponds a single vector
. The
proposed algorithm aims to identify the set of parame
ters
_
[k]
_
n
k=1
, where each
q
j=1
where
v
j
= [v
j1
, v
j2
, . . . , v
jd
]. Hence, the matrix V dened by
V =
_
_
_
_
_
v
1
v
2
.
.
.
v
q
_
_
_
_
_
=
_
_
_
_
_
v
11
. . . v
1d
v
21
. . . v
2d
.
.
.
.
.
.
.
.
.
v
q1
. . . v
qd
_
_
_
_
_
is (q d) and identies the lowdimensional parameter
subspace. The latent variable matrix
A is (n q) and
represents the coordinates of each
[k], k = 1, . . . , n, in
this lower dimensional subspace:
A =
_
_
_
_
_
a[1]
a[2]
.
.
.
a[n]
_
_
_
_
_
=
_
_
_
_
_
a
1
[1] . . .
a
q
[1]
a
1
[2] . . .
a
q
[2]
.
.
.
.
.
.
.
.
.
a
1
[n] . . .
a
q
[n]
_
_
_
_
_
.
Therefore, each
[k] =
a[k]V+b =
q
j=1
a
j
[k]v
j
+b, (13)
with b = [b
1
, . . . , b
d
]. The matrix
AV + B is of
the same dimensions as the observation matrix, namely
(n d), where the offset matrix B is (n d) and simply
composed of n identical rows b.
Assuming a maximum likelihood estimation path as
traditionally used in the GLMs literature [3], nding the
A, V, b) = log p(X[
), (14)
subject to the constraint
AV+B.
In accordance with the GLS framework, the following
assumptions are made:
(i) the independent and identically distributed statistical
samples assumption: the samples x[k], k = 1, . . . , n,
are drawn independently and identically;
(ii) the latent variable assumption: the components x
i
[k],
i = 1, . . . , d, are independent when conditioned on
the random parameter vector
[k], i.e.,
p
_
x[k][
[k]
_
= p
1
_
x
1
[k][
1
[k]
_
. . . p
d
_
x
d
[k][
d
[k]
_
for
all k, k = 1, . . . , n;
(iii) the mixed or hybrid exponential family distributions
assumption: each density function p
i
_
x
i
[k][
i
[k]
_
is
any oneparameter exponential family distribution
with
i
[k] taken to be the natural parameter of the
exponential family density or some simple function
of it. This assumption allows the rich and powerful
theory of exponential family distributions to be
fruitfully utilized.
The marginal densities p
i
([) can all be different, al
lowing for the possibility of x[k] containing continuous
and discrete valued components. Consequently, the loss
function in equation (14) becomes:
L(
A, V, b) =
n
k=1
d
i=1
log p
i
_
x
i
[k][
i
[k]
_
. (15)
By exploiting the previously stated latent variable assump
tion, it can be shown that if p
i
([) is the 1dimensional
conditional distribution of the component x
i
[k], i =
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 6
1, . . . , d, of the data point x[k] given the parameter com
ponent
i
[k], then the vector x[k] follows a ddimensional
conditional exponential distribution p([) given the vec
tor parameter
[k].
Proof: Considering the most general case, each com
ponent x
i
for i = 1, . . . , d is assumed to be exponen
tially distributed according to the distribution p
i
with
parameter
i
, and the components are independent when
conditioned on theirs parameters. Following the deni
tion of standard exponential families, a nite measure
i
is assumed for each distribution, i = 1, . . . , d. Let
= (
1
, . . . ,
d
) with (dx) =
1
(dx
1
)
2
(dx
2
) . . .
d
(dx
d
) be the nite measure in the ddimensional data
space. The 1dimensional distribution can be written as
p
i
(x
i
[
i
) = exp
_
i
x
i
G
i
(
i
)
_
. Based on the latent variable
assumption, the distribution of the vector x can be written
as follows:
p(x[
) =
d
i=1
p
i
(x
i
[
i
) =
d
i=1
e
ixiGi(
i)
= e
x
d
i=1
Gi(
i)
,
where, by denition of an exponential family distribu
tion and using Fubinis theorem [46],
d
i=1
G
i
(
i
) =
d
i=1
log
_
X
i
e
i
x
i
i
(dx
i
) =log
d
i=1
_
X
i
e
i
x
i
i
(dx
i
)
= log
_
X
1
_
X
d
e
1x1+
2x2++
d
x
d
1
(dx
1
) . . .
d
(dx
d
)
= log
_
X
e
x
(dx) = G(
), (16)
where A = A
1
A
2
A
d
denes the (product)
space of the ddimensional vector x. As a result, G(
) =
d
i=1
G
i
(
i
) is the cumulant generating function of the
multivariate exponential family distribution p(x[
) =
d
i=1
p
i
(x
i
[
i
).
3.2 Estimation procedures for a single exponential
family
First, the case of a single common exponential family,
i.e., p
i
([) = p([) for all i = 1, . . . , d, is considered.
3.2.1 Loss function and convexity
The loss function is given by equation (15). Using the
denition of an exponential family distribution, the loss
function can be written as:
L(
A, V, b) =
n
k=1
d
i=1
_
G
_
i
[k]
_
x
i
[k]
i
[k]
_
=
n
k=1
_
G
_
a[k]V+b
_
a[k]V+b
_
x[k]
T
_
.
(17)
Alternatively, it can be shown that the negative log
likelihood of the density of an exponential family distri
bution p(x
i
[k][
i
[k]) can be expressed through a Bregman
distance B
F
():
log p(x
i
[k][
i
[k]) = B
F
_
x
i
[k]
_
_
g(
i
[k])
_
F
_
x
i
[k]
_
,
where F() is the Fenchel conjugate of the cumulant
generating function G() [28]. Then, the loss function in
(15) becomes:
L(
A, V, b) =
n
k=1
d
i=1
_
B
F
_
x
i
[k]
_
_
g(
i
[k])
_
F
_
x
i
[k]
_
_
=
n
k=1
d
i=1
_
B
F
_
x
i
[k]
_
_
g(
i
[k])
_
_
k=1
d
i=1
F
_
x
i
[k]
_
,
where
i
[k] =
q
j=1
a
j
[k]v
ji
+b
i
. The underlined term in
the above equation does not depend on either
A, V or
b, resulting in the following minimization problem:
arg min
A,V,b
L(
A, V, b) = arg min
A,V,b
n
k=1
d
i=1
B
F
_
x
i
[k]
_
_
g(
i
[k])
_
= arg min
A,V,b
n
k=1
d
i=1
B
F
_
x
i
[k]
_
_
g
_
q
j=1
a
j
[k]v
ji
+ b
i
_
_
. (18)
It can be shown that the loss function is convex in either
of its arguments with the others xed. Indeed, the dual
divergences property of the Bregman distance implies
that, if G
_
i
[k]
_
is strictly convex, then
B
F
_
x
i
[k]
_
_
g(
i
[k])
_
= B
G
_
f
_
g(
i
[k])
__
_
f
_
x
i
[k]
_
_
= B
G
_
i
[k]
_
_
f(x
i
[k])
_
, (19)
since the function f() is the inverse of the link function
g(). The fact that f() and g() are inverse functions
of each other is easily shown. Using equation (19) in
equation (18), the minimization problem becomes, if
G
_
i
[k]
_
is strictly convex:
arg min
A,V,b
L(
A, V, b) = arg min
A,V,b
n
k=1
d
i=1
B
G
_
i
[k]
_
_
f(x
i
[k])
_
= arg min
A,V,b
n
k=1
d
i=1
B
G
_
q
j=1
a
j
[k]v
ji
+ b
i
_
_
f(x
i
[k])
_
.
This is a critical step of GLS because the minimization
problem is moved from data space fully into parameter
space.
It is wellknown that G
_
i
[k]
_
is a convex function
on
i
[k] and strictly convex if the exponential family
is minimal [27]. The convexity property of Bregman
distances states that B
G
() is always convex in the rst
argument, resulting in the fact that B
G
_
i
[k]
_
_
f(x
i
[k])
_
is
convex in
i
[k] for all k = 1, . . . , n and i = 1, . . . , d. Then,
since
i
[k] =
q
j=1
a
j
[k]v
ji
+ b
i
is a convex relationship
in either a
j
[k], j = 1, . . . , q, v
ji
, j = 1, . . . , q, or b
i
with
the others xed, for all k = 1, . . . , n and i = 1, . . . , d,
the Bregman distance B
G
_
q
j=1
a
j
[k]v
ji
+b
i
_
_
f(x
i
[k])
_
is
convex in any of the three arguments when the other
two are xed. Therefore, as a sum of convex functions,
the loss function is convex in either of its arguments
with the others xed, i.e., the loss function is convex in
AV+B.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 7
Since the loss function is convex in either of its ar
guments with the others xed, its minimization can be
attacked by using an iterative approach. Then, the rst
step, given a xed matrix V and a xed vector b, is
to obtain the matrix
a[k] for
k = 1, . . . , n, that minimizes the loss function given by
equation (17). The second step, given a xed matrix
A
and a xed vector b, is to obtain the matrix V or the
set of vectors v
j
for j = 1, . . . , q, that minimizes the loss
function. The last step, given a xed matrix
A and a
xed matrix V, is to obtain the vector b that minimizes
the loss function.
3.2.2 Iterative minimization of the loss function
The classical NewtonRaphson method is used for the
iterative minimization of the loss function (17).
The rst step in the (t+1)
st
iteration consists of the up
date
A
(t+1)
= arg min
A
L(
A, V
(t)
, b
(t)
), with
A
(t)
, V
(t)
and b
(t)
being the updates obtained at the end of the t
th
iteration. It then requires the computation of the gradient
vector
a
l
_
a[k]
_
and the Hessian matrix
2
a
l
_
a[k]
_
of the
loss function l
_
a[k]
_
with respect to the vector
a[k], for
all k = 1, . . . , n, where l
_
a[k]
_
= l
_
a[k], V
(t)
, b
(t)
_
collects
the elements of the loss function L(
A, V
(t)
, b
(t)
) that
depend only on the vector
a[k]:
l
_
a[k]
_
=G
_
a[k]V
(t)
+b
(t)
_
a[k]V
(t)
+b
(t)
_
x[k]
T
(20)
=
d
i=1
_
_
_
G
_
_
q
j=1
a
j
[k]v
(t)
ji
+b
(t)
i
_
_
x
i
[k]
_
_
q
j=1
a
j
[k]v
(t)
ji
+b
(t)
i
_
_
_
_
_
.
The gradient vector of the loss function l
_
a[k]
_
with
respect to the vector
a
l
_
a[k]
_
=
l
_
a[k]
_
a[k]
=V
(t)
_
G
a[k]V
(t)
+b
(t)
_
x[k]
T
_
,
where, for
[k] =
a[k]V
(t)
+b
(t)
,
G
a[k]V
(t)
+b
(t)
_
=
G
_
[k]
_
[k]
and
[k]
a[k]
= V
(t)
.
Here, the following convention for derivatives
with respect to a row vector is used: for
a[k], a
(1 q) vector, and l(), a scalar function of
a[k],
l
_
a[k]
_
/
a[k] =
_
l
_
a[k]
_
/
a
1
[k], . . . , l
_
a[k]
_
/
a
q
[k]
T
is a (q 1) vector. Similarly, for
[k], a (1 d) vector,
and G(), a scalar function of
[k], G
_
[k]
_
/
[k]
is a (d 1) vector as follows: G
_
[k]
_
/
[k] =
_
G
_
[k]
_
/
1
[k], . . . , G
_
[k]
_
/
d
[k]
T
. Then,
G
a[k]V
(t)
+b
(t)
_
=
_
G
_
[k]
_
1
[k]
, . . . ,
G
_
[k]
_
d
[k]
_
=
_
1
[k]
d
i=1
G
_
i
[k]
_
, . . . ,
d
[k]
d
i=1
G
_
i
[k]
_
_
T
for
[k] =
a[k]V
(t)
+b
(t)
and using equation (16)
=
_
g
_
1
[k]
_
, . . . , g
_
d
[k]
_
,
where g
_
i
[k]
_
= G
_
i
[k]
_
/
i
[k]. The Hessian matrix of
the loss function with respect to the vector
a[k] is given
by
a
l
_
a[k]
_
=
2
l
_
a[k]
_
a[k]
2
= V
(t)
G
a[k]V
(t)
+b
(t)
_
V
(t),T
,
where
G
a[k]V
(t)
+b
(t)
_
=
_
2
G(
[k])
1
[k]
2
2
G(
[k])
d
[k]
1
[k]
.
.
.
.
.
.
.
.
.
2
G(
[k])
1[k]
d
[k]
2
G(
[k])
d
[k]
2
_
_
.
Furthermore,
2
G
_
[k]
_
r
[k]
s
[k]
=
2
r
[k]
s
[k]
d
i=1
G(
i
[k]) =
s
[k]
g(
r
[k]),
so that
2
G
_
[k]
_
r
[k]
s
[k]
=
_
0 if r ,= s,
g(
r
[k])/
r
[k] if r = s.
As a result,
G
a[k]V
(t)
+b
(t)
_
=
_
_
g(
1[k])
1
[k]
0
.
.
.
.
.
.
.
.
.
0
g(
d
[k])
d
[k]
_
_
,
i.e., G
a[k]V
(t)
+b
(t)
_
is a (d d) diagonal matrix.
The NewtonRaphson technique simply solves the
minimization problem arg min
a
l(
a[k], V
(t)
, b
(t)
) at iter
ation (t + 1) by using the update
a
(t+1)
[k] =
a
(t)
[k]
(t+1)
a
_
a
l
_
a
(t)
[k], V
(t)
, b
(t)
_
_
1
a
l
_
a
(t)
[k], V
(t)
, b
(t)
_
,
where l() is the gradient of the function l(),
2
l()
its Hessian matrix and
(t+1)
a
the socalled step size.
It yields the following update equation for the set of
vectors
a
(t+1)
[k] at iteration (t + 1) for k = 1, . . . , n:
a
(t+1)
[k]
T
=
a
(t)
[k]
T
(t+1)
a
_
V
(t)
G
a
(t)
[k]V
(t)
+b
(t)
_
V
(t),T
_
1
_
V
(t)
_
G
a
(t)
[k]V
(t)
+b
(t)
) x[k]
T
_
_
.
(21)
The second step in the iterative minimization
method consists of the update V
(t+1)
=
arg min
V
L(
A
(t+1)
, V, b
(t)
). It requires the computation
of the gradient vector
v
l
_
v
j
_
and the Hessian
matrix
2
v
l
_
v
j
_
of the loss function l
_
v
j
_
with
respect to the vector v
j
, for all j = 1, . . . , q, where
l
_
v
j
_
= l
_
A
(t+1)
, v
j
q
j=1
, b
(t)
_
collects the elements of
the loss function L(
A
(t+1)
, V, b
(t)
) that depend only on
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 8
the vector v
j
.
l(v
j
) =
n
k=1
_
G
_
q
r=1
a
(t+1)
r
[k]v
r
+b
(t)
_
_
q
r=1
a
(t+1)
r
[k]v
r
+b
(t)
_
x[k]
T
_
,
v
l(v
j
) =
l(v
j
)
v
j
=
n
k=1
a
(t+1)
j
[k]
_
G
a
(t+1)
[k]V+b
(t)
_
x[k]
T
_
,
2
v
l(v
j
) =
2
l(v
j
)
v
2
j
=
n
k=1
a
(t+1)
j
[k]
2
G
a
(t+1)
[k]V+b
(t)
_
.
Then, the update equation is given as follows for j =
1, . . . , q:
v
(t+1),T
j
= v
(t),T
j
(t+1)
v
_
n
k=1
a
(t+1)
j
[k]
2
G
a
(t+1)
[k]V
(t)
+b
(t)
_
_
1
_
n
k=1
a
(t+1)
j
[k]
_
G
a
(t+1)
[k]V
(t)
+b
(t)
_
x[k]
T
_
_
,
(22)
where
n
k=1
a
(t+1)
j
[k]
2
G
a
(t+1)
[k]V
(t)
+b
(t)
) =
_
n
k=1
a
(t+1)
j
[k]
2
g(
1
[k])
1
[k]
0
.
.
.
.
.
.
.
.
.
0
n
k=1
a
(m+1)
j
[k]
2
g(
d
[k])
d
[k]
_
_
for
[k] =
a
(t+1)
[k]V
(t)
+b
(t)
.
The last step in the iterative minimization method con
sists of the update b
(t+1)
= arg min
b
L(
A
(t+1)
, V
(t+1)
, b).
It requires the computation of the gradient vector
b
l
_
b
_
and the Hessian matrix
2
b
l
_
b
_
of the loss function
l
_
b
_
with respect to the offset vector b, where l
_
b
_
=
l
_
A
(t+1)
,V
(t+1)
, b
_
collects the elements of the loss func
tion L(
A
(t+1)
,V
(t+1)
, b) that depend only on the vector
b.
l(b) =
n
k=1
_
G
_
a
(t+1)
[k]V
(t+1)
+b
_
a
(t+1)
[k]V
(t+1)
+b
_
x[k]
T
_
,
b
l(b) =
l(b)
b
=
n
k=1
_
G
a
(t+1)
[k]V
(t+1)
+b
_
x[k]
T
_
,
2
b
l(b) =
2
l(b)
b
2
=
n
k=1
G
a
(t+1)
[k]V
(t+1)
+b
_
.
Then, the update equation is given as follows:
b
(t+1),T
=b
(t),T
(t+1)
b
_
n
k=1
G
a
(t+1)
[k]V
(t+1)
+b
(t)
_
_
1
_
n
k=1
_
G
a
(t+1)
[k]V
(t+1)
+b
(t)
_
x[k]
T
_
_
,
(23)
where
G
a
(t+1)
[k]V
(t+1)
+b
(t)
) =
_
_
g(
1[k])
1
[k]
0
.
.
.
.
.
.
.
.
.
0
g(
d
[k])
d
[k]
_
_
for
[k] =
a
(t+1)
[k]V
(t+1)
+b
(t)
.
Note that only the cumulant generating function G()
needs to be changed in order to get an algorithm for
a loss function involving a new exponential family.
This is pertinent since the cumulant generating function
uniquely denes the exponential family [27].
3.2.3 Penalty function approach
As noted in [10], it is possible for the atoms obtained
with the extreme GLS case corresponding to exponential
PCA to diverge since the optimum may be at innity. To
avoid such behavior, we introduce a penalty function
that denes and places a set of constraints into the loss
function via a penalty parameter in a way that penalizes
any divergence to innity.
The penalty function approach is used to convert
the nonlinear programming problem with equality and
inequality constraints into an unconstrained problem,
or into a problem with simple constraints [47], [48],
[49]. This transformation is accomplished by dening an
appropriate auxiliary function in terms of the problem
functions to dene a new objective or loss function.
The penalty function is dened as follows for
=
[
1
, . . . ,
d
]:
(
) =
d
i=1
_
exp
_
min
(
min
)
_
+ exp
_
max
(
max
)
_
_
,
(24)
and was designed so that (
i
max
, i = 1, . . . , d, and reaches innity otherwise.
Figure 3 shows possible shapes for the penalty function
depending on the parameters
min
and
max
values.
The loss function becomes:
L(
A, V, b) =
n
k=1
_
B
F
_
x[k]
_
_
g(
a[k]V+b)
_
+c
_
a[k]V+b
_
_
(25)
instead of
L(
A, V, b) =
n
k=1
B
F
_
x[k]
_
_
g(
a[k]V+b)
_
,
where the scalar c is called the penalty parameter.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 9
10 5 0 5 10
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
parameter
p
e
n
a
l
t
y
f
u
n
c
t
i
o
n
min
=
max
= 1
min
=
max
= 10
min
=
max
= 0.5
Fig. 3. Sketches for a possible penalty function (
min
=
5,
max
= 5): solid line for penalty function parameters
min
=
max
= 1, dashed line for parameters
min
=
max
= 10 and dashdot line for
min
=
max
= 0.5.
The previously developed iterative minimization al
gorithm is then used on
L(
a
(t+1)
[k]
T
=
a
(t)
[k]
T
(t+1)
a
_
V
(t)
_
G
a
(t)
[k]V
(t)
+b
(t)
_
+c
a
(t)
[k]V
(t)
+b
(t)
__
V
(t),T
_
1
V
(t)
_
G
a
(t)
[k]V
(t)
+b
(t)
_
x[k]
T
+c
a
(t)
[k]V
(t)
+b
(t)
_
_
.
Note that, for
[k] =
a[k]V
(t)
+ b
(t)
, the gradient of the
penalty function is given by:
a[k]V
(t)
+b
(t)
_
=
_
[k]
_
[k]
=
_
[k]
_
1
[k]
, . . . ,
[k]
_
d
[k]
_
where, for i = 1, . . . , d and k = 1, . . . , n,
[k]
_
i
[k]
=
i
[k]
d
r=1
exp
_
min
(
r
[k]
min
)
_
+
i
[k]
d
r=1
exp
_
max
(
r
[k]
max
)
_
=
min
exp
_
min
(
i
[k]
min
)
_
+
max
exp
_
max
(
i
[k]
max
)
_
.
The Hessian matrix is given by:
a[k]V
(t)
+b
(t)
_
=
_
2
(
[k])
1
[k]
2
2
(
[k])
d
[k]
1
[k]
.
.
.
.
.
.
.
.
.
2
(
[k])
1
[k]
d
[k]
2
(
[k])
d
[k]
2
_
_
,
with
2
(
[k])
r
[k]
s
[k]
=
2
min
exp
_
min
(
r
[k]
min
)
_
+
2
max
exp
_
max
(
r
[k]
max
)
_
for r = s, and
2
(
[k])
r
[k]
s
[k]
= 0 otherwise.
Similarly, the second step of the iterative minimization
yields the following update equation for j = 1, . . . , q:
v
(t+1),T
j
= v
(t),T
j
(t+1)
v
_
n
k=1
a
(t+1)
j
[k]
2
_
G
a
(t+1)
[k]V
(t)
+b
(t)
_
+ c
a
(t+1)
[k]V
(t)
+b
(t)
_
_
_
1
_
n
k=1
a
(t+1)
j
[k]
_
G
a
(t+1)
[k]V
(t)
+b
(t)
_
x[k]
T
+ c
a
(t+1)
[k]V
(t)
+b
(t)
__
_
.
Finally, the last step of the minimization problem
yields the following update equation:
b
(t+1),T
= b
(t),T
(t+1)
b
_
n
k=1
_
G
a
(t+1)
[k]V
(t+1)
+b
(t)
_
+ c
a
(t+1)
[k]V
(t+1)
+b
(t)
_
_
1
_
n
k=1
_
G
a
(t+1)
[k]V
(t+1)
+b
(t)
_
x[k]
T
+ c
a
(t+1)
[k]V
(t+1)
+b
(t)
__
_
.
3.3 Uniqueness and identiability
The matrix V R
qd
denes the lowdimensional
parameter subspace. It can be shown that the matrix V,
when q > 1, is not unique and that other equivalent
representations can be derived by orthogonal transfor
mations of it [22]. In cases where q > 1, there are an
innity of choices for V. The constraint
[k] =
a[k]V+b
for k = 1, . . . , n, is still satised if
a[k] is replaced by
a[k]M and V by M
T
V, where M is any orthogonal
matrix of dimension (q q).
In order to reduce the identiability problem of the
matrix
AV + B, an orthonormality constraint is
used, i.e., the condition
VV
T
= I
qq
(26)
is enforced. Consider the matrix space /= R
qd
, then
V /. As the iterative minimization process proposed
earlier goes on, the successive updates of the matrix
V evolve, giving rise to a curve V(t) in /, where t
describes time. The constraint VV
T
= I
qq
corresponds
to a hyperplane in / and an easy way to comply
with it would be to impose that the curve V(t) remains
tangential to the VV
T
= I
qq
hyperplane. In other
words, the progression along the curve V(t) should
remain on the tangent of the VV
T
= I
qq
hyperplane.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 10
Considering the notation
V = dV/dt, the tangent to the
VV
T
= I
qq
hyperplane can be dened by the equation
VV
T
+V
V
T
2
= 0, (27)
where the denominator is used for later convenience. Let
M =
V / and the following operator is dened:
/(M)
MV
T
+VM
T
2
. (28)
It is easily shown that the operator / : / o, where
o R
qq
, is linear. It is as well onto, i.e., (/) = o the
range of / or ^(/
)
^(/
)
= 0
^(/)
o /
/
/
\ \
\\
Fig. 4. The operator /, its range (/) and null space
^(/) in relation with its adjoint operator /
, its range
(/
) = 0, with / = ^(/)
(/
) and o = ^(/
) (/).
Proof: For any M / and any W o, the adjoint
operator /
is dened by
W, /(M)) = /
(W), M) , (29)
where, using the trace operator tr,
W, /(M)) = tr W
T
/(M)
=
1
2
tr
_
W
T
MV
T
+W
T
VM
T
_
=
1
2
tr
_
W
T
MV
T
+MV
T
W
_
=
1
2
tr
_
W
T
MV
T
+WMV
T
_
using properties of the trace operator,
= tr
_
W
T
+W
2
_
MV
T
= tr WMV
T
since W o is symmetric,
= tr V
T
WM = tr V
T
W
T
M
= tr(WV)
T
M = WV, M) . (30)
Now, combining equations (29) and (30) results in
/
(W), M) = WV, M) .
Consequently,
/
) = 0 and /
is onto.
Figure 4 shows the relationship between the range and
null space of /and the range and null space of its adjoint
operator /
.
Imposing that the curve V(t) remains tangential to
the VV
T
= I
qq
hyperplane is equivalent to imposing
/(
V) = 0, i.e.,
V(= dV/dt V) or the increment V
in the V update equation
_
V
(m+1)
= V
(m)
(m+1)
v
V,
cf. equation (22)
_
needs to be projected onto the null
space of /. The projection operator onto the null space
of / is dened as P
N(A)
= I P
R(A
)
= I /
+
/, where
the subscript
+
denotes a pseudoinverse. The goal now
is to compute /
+
. Since / is onto, (/) = o, for any
M / there exists a matrix W o such that
/(M) = W. (32)
Similarly, for any M / there exists a matrix o
such that
M = /
(). (33)
Then, combining equations (31) and (33) gives
M = /
() = V. (34)
Now, using both equations (32) and (34) gives
/(M) = /(V) = W
=
VV
T
+VV
T
T
2
=
+
T
2
= ,
since VV
T
= I
qq
and o is symmetric. As a result,
= W. Consequently, M = WV = /
+
(W). The
projection operator onto the range of /
is dened by
P
R(A
)
(M) = /
+
/(M) =
MV
T
+VM
T
2
V.
It can be shown that P
R(A
)
is correctly dened as a
projection operator since it is idempotent.
Proof:
P
2
R(A
)
(M) = P
R(A
)
_
MV
T
+VM
T
2
V
_
=
_
MV
T
+VM
T
2
V
_
V
T
+V
_
MV
T
+VM
T
2
V
_
T
2
V
=
1
2
__
MV
T
+VM
T
2
_
V+
_
MV
T
+VM
T
2
_
V
_
=
_
MV
T
+VM
T
2
_
V = P
R(A
)
(M),
since VV
T
= I
qq
and /(M) is symmetric.
Then, the projection operator onto the null space of /
is dened by
P
N(A)
(M) =
_
IP
R(A
)
_
(M) =M
_
MV
T
+VM
T
2
_
V.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 11
It is indeed a projector onto the null space of / since
/
_
P
N(A)
(M)
_
=
1
2
_
M
_
MV
T
+VM
T
2
_
V
_
V
T
+
1
2
V
_
M
_
MV
T
+VM
T
2
_
V
_
T
=
1
2
_
MV
T
VM
T
2
_
1
2
_
MV
T
VM
T
2
_
= 0.
Finally, the update equation for j = 1, . . . , q becomes:
v
(t+1),T
j
= v
(t),T
j
(t+1)
v
P
N(A)
_
n
k=1
a
(t+1)
j
[k]
2
G
a
(t+1)
[k]V
(t)
+b
(t)
_
_
1
_
n
k=1
a
(t+1)
j
[k]
_
G
a
(t+1)
[k]V
(t)
+b
(t)
_
x[k]
T
_
_
.
3.4 Mixed data types
The above approach was proposed assuming that the
data attributes have the same distribution. It can be
extended to the problem for which different types of
distributions can be used for different attributes. This
situation is referred to as the mixed datatype case
and is exposed below. Only two types of exponential
family distributions are considered here (for example,
the Bernoulli distribution and the Gaussian distribution).
Of course, this approach generalizes to any number of
exponential family distributions.
For simplicity of presentation, we consider that the
f rst attributes are distributed according to the ex
ponential family distribution p
(1)
and the (d f) last
attributes are distributed according to the exponential
family distribution p
(2)
. Following the previously stated
example, the bold superscript
(1)
would correspond
to Bernoulli distributed attributes and
(2)
to Gaussian
distributed attributes. Then,
X=
_
_
_
_
_
x
1
[1] . . . x
f
[1] x
f+1
[1] . . . x
d
[1]
x
1
[2] . . . x
f
[2] x
f+1
[2] . . . x
d
[2]
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
x
1
[n] . . . x
f
[n] x
f+1
[n] . . . x
d
[n]
_
_
_
_
_
=
_
X
(1)
X
(2)
_
.
The loss function is expressed as follows:
L(
A, V, b) = log p(X[
A, V, b) =
n
k=1
log p
_
x[k][
[k]
_
,
using the iid statistical samples assumption. Then, using
the latent variable assumption,
p
_
x[k][
[k]
_
= p
(1)
_
x
(1)
[k][
(1)
[k]
_
p
(2)
_
x
(2)
[k][
(2)
[k]
_
,
(35)
where
=
_
_
_
_
_
1
[1] . . .
f
[1]
f+1
[1] . . .
d
[1]
1
[2] . . .
f
[2]
f+1
[2] . . .
d
[2]
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
[n] . . .
f
[n]
f+1
[n] . . .
d
[n]
_
_
_
_
_
=
_
(1)
(2)
_
.
The matrix of parameters
AV + B results in the
following decompositions:
V=
_
_
_
_
_
v
11
. . . v
1f
v
1(f+1)
. . . v
1d
v
21
. . . v
2f
v
2(f+1)
. . . v
2d
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
v
q1
. . . v
qf
v
q(f+1)
. . . v
qd
_
_
_
_
_
=
_
V
(1)
V
(2)
_
,
and B =
_
B
(1)
B
(2)
_
, B
(1)
= [b
(1)
, . . . , b
(1)
]
T
with
b
(1)
= [b
1
, . . . , b
f
], and B
(2)
= [b
(2)
, . . . , b
(2)
]
T
with
b
(2)
= [b
f+1
, . . . , b
d
]. Then,
AV+B =
_
AV
(1)
+B
(1)
AV
(2)
+B
(2)
_
.
Now, notice that
AV
(1)
+B
(1)
=
_
_
_
_
_
a
1
[1] . . .
a
q
[1]
a
1
[2] . . .
a
q
[2]
.
.
.
.
.
.
.
.
.
a
1
[n] . . .
a
q
[n]
_
_
_
_
_
_
_
_
_
_
v
11
. . . v
1f
v
21
. . . v
2f
.
.
.
.
.
.
.
.
.
v
q1
. . . v
qf
_
_
_
_
_
.
The underlined term is a (nf) matrix whose elements
are
q
j=1
a
j
[k]v
ji
for all k = 1, . . . , n and i = 1, . . . , f.
Consequently, the elements of the (nf) matrix
AV
(1)
+
B
(1)
are of the form
q
j=1
a
j
[k]v
ji
+b
i
for all k = 1, . . . , n
and i = 1, . . . , f. The matrix
(1)
is also (n f) and its
elements take the following form:
i
[k] =
q
j=1
a
j
[k]v
ji
+
b
i
for all k = 1, . . . , n and i = 1, . . . , f. Besides, knowing
that
i
[k] =
q
j=1
a
j
[k]v
ji
+b
i
for all k = 1, . . . , n and i =
1, . . . , d, it becomes clear that
(1)
equals
AV
(1)
+B
(1)
,
and
(2)
is
AV
(2)
+B
(2)
. Note that, even though we are
able to separate the matrix
A is common to both
(1)
and
(2)
. Therefore, the loss
function takes the following form:
L(
A, V, b) =
n
k=1
log p
(1)
_
x
(1)
[k][
a[k], V
(1)
, b
(1)
_
k=1
log p
(2)
_
x
(2)
[k][
a[k], V
(2)
, b
(2)
_
.
Since the linear combination of convex functions with
nonnegative coefcients is always convex [47], the loss
function remains convex in either of its arguments with
the others xed. Therefore, the iterative minimization
technique proposed for the single exponential family can
be applied in the mixture of exponential families case.
The rst step in the NewtonRaphson minimization
technique, given a xed matrix V and xed vector b,
is to obtain the matrix
a[k]
for k = 1, . . . , n, that minimizes the loss function. The
second step, given a xed matrix
a[k]
_
= G
(1)
_
a[k]V
(1)(t)
+b
(1)(t)
_
a[k]V
(1)(t)
+b
(1)(t)
_
x
(1)
[k]
T
+ G
(2)
_
a[k]V
(2)(t)
+b
(2)(t)
_
a[k]V
(2)(t)
+b
(2)(t)
_
x
(2)
[k]
T
.
The update equation for the set of vectors
a[k] for k =
1, , n is:
a
(t+1)
[k]
T
=
a
(t)
[k]
T
(t+1)
a
_
V
(1)(t)
G
(1)
_
a
(t)
[k]V
(1)(t)
+b
(1)(t)
_
V
(1)(t),T
+V
(2)(t)
G
(2)
_
a
(t)
[k]V
(2)(t)
+b
(2)(t)
_
V
(2)(t),T
_
1
_
V
(1)(t)
_
G
(1)
_
a
(t)
[k]V
(1)(t)
+b
(1)(t)
_
x[k]
(1),T
_
+V
(2)(t)
_
G
(2)
_
a
(t)
[k]V
(2)(t)
+b
(2)(t)
_
x[k]
(2),T
__
.
(36)
For the second step, the two sets of row vectors
_
v
(1)
j
_
q
j=1
and
_
v
(2)
j
_
q
j=1
are updated separately. For the
sake of simplicity, the update equation is written for the
set
_
v
j
_
q
j=1
indistinct of the mixture superscript and is
given as follows for j = 1, . . . , q:
v
(t+1),T
j
= v
(t),T
j
(t+1)
v
_
n
k=1
a
(t+1)
j
[k]
2
G
a
(t+1)
[k]V
(t)
+b
(t)
_
_
1
_
n
k=1
a
(t+1)
j
[k]
_
G
a
(t+1)
[k]V
(t)
+b
(t)
_
x[k]
T
_
_
.
(37)
For the last step, as for
_
v
(1)
j
_
q
j=1
and
_
v
(2)
j
_
q
j=1
, the
derivations are made for the vector b indistinct of the
mixture superscript:
b
(t+1),T
=b
(t),T
(t+1)
b
_
n
k=1
G
a
(t+1)
[k]V
(t+1)
+b
_
_
1
_
n
k=1
_
G
a
(t+1)
[k]V
(t+1)
+b
_
x[k]
T
_
_
.
(38)
Equations (37) can be used for
_
v
(1)
j
_
q
j=1
and
_
v
(2)
j
_
q
j=1
,
and equation (38) can be used for b
(1)
and b
(2)
by
changing v
j
to v
(1)
j
, respectively to v
(2)
j
, V to V
(1)
,
respectively to V
(2)
, b to b
(1)
, respectively to b
(2)
,
G(), G
(), and G
() to G
(1)
(), G
(1)
(), and G
(1)
(),
respectively to G
(2)
(), G
(2)
(), and G
(2)
().
Table 1 summarizes exponential PCA algorithm.
4 TWO OTHER GLS SPECIAL CASES AND EX
TENSIONS TO MIXED DATATYPE CASES
In addition to the exponential family Principal Com
ponent Analysis (exponential PCA) technique, the two
Algorithm: Exponential PCA [10]
Input: a set of observations
x[k]
n
k=1
R
d
, two exponential
family distribution p
(1)
, p
(2)
dened by their cumulant gener
ating functions G
(1)
, G
(2)
, a number of atoms n, q d the
dimension of the latent variable lower dimensional subspace.
Output: the ML estimator
[k]}
n
k=1
that minimizes the loss
function L(
A, V, b) in (17):
[k] =
a[k]
V +
b for all k,
{
a[k]}
n
k=1
R
q
,
V R
qd
and
b R
d
.
Method:
Initialize V, b and
a[k]
n
k=1
;
[k] =
x[k]
[k]
[k] =
a[k]
V+
b
n
k=1
.
TABLE 1
Exponential PCA algorithm.
other GLS special cases considered here are the Semi
Parametric exponential family Principal Component
Analysis (SPPCA) and the Bregman soft clustering tech
niques. They all utilize Bregman distances and can all be
explained within a single hierarchical Bayes graphical
model framework shown in Figure 1. They are not
separate unrelated algorithms but different manifesta
tions of model assumptions and parameter choices taken
within a common framework. Because of this insight,
these algorithms are readily extended to deal with the
important mixed datatype case.
Figure 5 considers the number of atoms as a common
characteristic for comparison purposes. The exponential
PCA technique corresponds to a classical approach to
the GLS estimation problem. The classical approach can
be seen as an extreme case of the Bayesian approach for
which the probability density function () is a delta
function (one per data point) and the total number of
distinct natural parameter values m equals the number
of data points n, i.e., m = n. While the m < n parameters
of the Bayesian approach consistent with SPPCA and
the Bregman soft clustering techniques are shared by
all the data points, the classical approach assigns one
parameter point to each data point (hence m = n).
The Bregman soft clustering approach considers an even
smaller number of natural parameters or atoms than
SPPCA. Since its primary goal is clustering, the atoms
play the role of cluster centers in parameter space and
their total number is generally small. Furthermore, both
exponential PCA and SPPCA impose a lowdimensional
(unknown) latent variable subspace in their structure.
However, Bregman soft clustering does not impose this
lower dimensional constraint and hence can be seen as
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 13
a degenerate case.
It becomes clear while looking at Table 1, Table 2
and Table 3 that both SPPCA and Bregman soft clus
tering utilize the EM algorithm for estimation purposes
whereas exponential PCA does not. Indeed, because
exponential PCA assumes a classical approach, no point
mass probabilities need to be estimated.
number
of atoms m
n 0 1 2
exponential PCA
'
&
$
%
. . . . . .
Bregman soft
clustering
Fig. 5. General point of view on SPPCA, exponential
PCA and Bregman soft clustering based on the number
of NPML atoms.
4.1 Semiparametric exponential family PCA
The SemiParametric exponential family Principal Com
ponent Analysis (SPPCA) approach presented in [15] at
tacks the Semiparametric Maximum Likelihood mixture
density Estimation (SMLE) problem exposed in Section 2
by using the ExpectationMaximization (EM) algorithm
[41]. We directly present an SPPCA modied approach
for mixed data types and use the mixed datatype no
tations exposed in the previous section for exponential
PCA. Using equation (10), the loglikelihood function is
L(Q) = log
n
k=1
m
l=1
p
_
x[k][
a[l]V+b
_
l
, (39)
with Q =
_
[l] =
a[l]V + b,
l
_
m
l=1
the mixing distri
bution. The EM approach introduces a missing (unob
served) variable z
k
= [z
k1
, . . . , z
km
], for k = 1, . . . , n.
This variable is an mdimensional binary vector whose
lth component equals 1 if the observed variable x[k] was
drawn from the lth mixture component and 0 otherwise;
its value is estimated during the Estep. Using this
information, a complete loglikelihood function is dened
as follows:
L
(c)
_
Q, z
k
n
k=1
_
= log
n
k=1
m
l=1
p
_
x[k][
[l]
_
z
kl
z
kl
l
, (40)
with
[l] =
a[l]V+b. Because z
kl
equals 1 exactly for
one l if k is xed, reecting the assumption that each
x[k] is drawn from exactly one mixture component, the
inner sum in equation (39) has in fact for each k exactly
one nonzero term. In equation (40) it is exactly that non
zero term which is present in the product, all others have
an exponent of z
kl
= 0, and hence do not contribute
to the product. The maximization of the complete log
likelihood function (the Mstep) yields parameters
A, V
and b estimates. Then,
L
(c)
_
Q, z
k
n
k=1
_
=
n
k=1
m
l=1
z
kl
log
l
+
n
k=1
m
l=1
z
kl
log p
(1)
_
x
(1)
[k][
(1)
[l]
_
p
(2)
_
x
(2)
[k][
(2)
[l]
_
.
(41)
The Estep yields for k = 1, . . . , n and l = 1, . . . , m:
z
kl
= Ez
kl
[x[k],
1
, . . . ,
m
=
p
(1)
_
x
(1)
[k][
(1)
[l]
_
p
(2)
_
x
(2)
[k][
(2)
[l]
__
m
r=1
p
(1)
_
x
(1)
[k][
(1)
[r]
_
p
(2)
_
x
(2)
[k][
(2)
[r]
_
r
.
For all l and all k, each data point x[k] has an estimated
probability z
kl
of belonging to the lth mixture compo
nent.
The Mstep rst yields the estimates for the pointmass
probabilities:
(new)
l
=
n
k=1
z
kl
n
,
corresponding to the number of samples x[k] drawn
from the lth mixture, divided by the number of samples
overall. The second part of the Mstep, i.e., the estima
tion of the parameters V, b, and the latent variables
A =
_
a[1]
T
, . . . ,
a[m]
T
T
R
m,q
, is affected by the
mixed data type assumption. It consists of maximizing
the complete loglikelihood function (41) with respect to
these parameters:
arg max
A,V,b
L
(c)
__
[l],
(new)
l
_
m
l=1
, z
k
n
k=1
_
= arg max
A,V,b
n
k=1
m
l=1
z
kl
_
G
(1)
_
a[l]V
(1)
+b
(1)
_
a[l]V
(1)
+b
(1)
_
x[k]
T
_
+
n
k=1
m
l=1
z
kl
_
G
(2)
_
a[l]V
(2)
+b
(2)
_
a[l]V
(2)
+b
(2)
_
x[k]
T
_
.
We set, for l = 1, . . . , m, x[l] =
n
k=1
z
kl
x[k]/
n
k=1
z
kl
,
the lth mixture component center. It can be shown, using
exponential family properties, that the loss function is:
L(
A, V, b)
=
m
l=1
(new)
l
_
G
(1)
_
a[l]V
(1)
+b
(1)
_
a[l]V
(1)
+b
(1)
_
x[l]
T
_
+
m
l=1
(new)
l
_
G
(2)
_
a[l]V
(2)
+b
(2)
_
a[l]V
(2)
+b
(2)
_
x[l]
T
_
,
(42)
since (1/n)
n
k=1
z
kl
=
(new)
l
. Note that these coef
cients
(new)
l
, l = 1, . . . , m, are not present in the
algorithm proposed in [15].
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 14
The NewtonRaphson method is used for the iterative
minimization of the loss function (42) and the resulting
update equations are as follows. First at iteration (t +1),
for l = 1, . . . , m,
a
(t+1)
[l]
T
=
a
(t)
[l]
T
(t+1)
_
V
(1)(t)
G
(1)
_
a
(t)
[l]V
(1)(t)
+b
(1)(t)
_
V
(1)(t),T
+V
(2)(t)
G
(2)
_
a
(t)
[l]V
(2)(t)
+b
(2)(t)
_
V
(2)(t),T
_
1
_
V
(1)(t)
_
G
(1)
a
(t)
[l]V
(1)(t)
+b
(1)(t)
) x[l]
T
_
+V
(2)(t)
_
G
(2)
a
(t)
[l]V
(2)(t)
+b
(2)(t)
) x[l]
T
_
_
.
(43)
For the second step, the two sets of row vectors
_
v
(1)
j
_
q
j=1
and
_
v
(2)
j
_
q
j=1
are updated separately. For
j = 1, . . . , q:
v
(t+1),T
j
= v
(t),T
j
(t+1)
v
_
m
l=1
(new)
l
a
(t+1)
j
[l]
2
G
a
(t+1)
[l]V
(t)
+b
(t)
_
_
1
_
m
l=1
(new)
l
a
(t+1)
j
[l]
_
G
a
(t+1)
[l]V
(t)
+b
(t)
_
x[l]
T
_
_
.
(44)
And nally for the last step, the update equation is:
b
(t+1),T
= b
(t),T
= b
(t),T
(t+1)
b
_
m
l=1
(new)
l
G
a
(t+1)
[l]V
(t+1)
+b
_
_
1
_
m
l=1
(new)
l
_
G
a
(t+1)
[l]V
(t+1)
+b
_
x[l]
T
_
_
.
(45)
Equations (44) can be used for
_
v
(1)
j
_
q
j=1
and
_
v
(2)
j
_
q
j=1
,
and equation (45) can be used for b
(1)
and b
(2)
by
changing v
j
to v
(1)
j
, respectively to v
(2)
j
, V to V
(1)
,
respectively to V
(2)
, b to b
(1)
, respectively to b
(2)
,
G(), G
(), and G
() to G
(1)
(), G
(1)
(), and G
(1)
(),
respectively to G
(2)
(), G
(2)
(), and G
(2)
().
Table 2 summarizes the SPPCA algorithm.
4.2 Bregman soft clustering
The Bregman soft clustering approach presented in [16]
utilizes an alternative interpretation of the EM algorithm
for learning models involving mixtures of exponential
family distributions. It is a simple soft clustering algo
rithm for all Bregman divergences, i.e., for all exponen
tial family distributions. We choose here to present this
technique without referring to the Bregman divergence
as in [16] but by using its corresponding exponential
family probability distribution for the sake of compar
ison with SPPCA and exponential PCA.
Given a data set of observations
_
x[k]
_
n
k=1
, Bregman
soft clustering aims at modeling the statistical structure
Algorithm: SemiParametric exp. family PCA [15]
Input: a set of observations
x[k]
n
k=1
R
d
, two exponential
family distribution p
(1)
, p
(2)
dened by their cumulant gener
ating functions G
(1)
, G
(2)
, a number of atoms m, q d the
dimension of the latent variable lower dimensional subspace.
Output: the NPML estimator that maximizes the complete log
likelihood function L
(c)
Q, {z
k
}
n
k=1
:
Q =
[l],
l
m
l=1
with
[l] =
a[l]
V+
b for all l, {
a[l]}
m
l=1
R
q
,
V R
qd
and
b R
d
.
Method:
Initialize V, b and
a[l],
l
m
l=1
with
l
0 for all l and
m
l=1
l
= 1;
[l] =
x[k]
[l]
as
dened in (35) for all k and l;
repeat
{The Expectation Step}
for k = 1 to n do
for l = 1 to m do
z
kl
p
x[k]
[l]
l
/
m
r=1
p
x[k]
[r]
r
end for
end for
{The Maximization Step}
for l = 1 to m do
1/n
n
k=1
z
kl
end for
{The NewtonRaphson iterative algorithm}
for l = 1 to m do
[l] =
a[l]
V+
b,
l
m
l=1
.
TABLE 2
SemiParametric exponential family PCA algorithm.
Algorithm: Bregman Soft Clustering [16]
Input: a set of observations
x[k]
n
k=1
R
d
, two exponential
family distribution p
(1)
, p
(2)
dened by their cumulant gener
ating functions G
(1)
, G
(2)
, a number of atoms m.
Output: the NPML estimator that maximizes the complete log
likelihood function L
(c)
Q, {z
k
}
n
k=1
:
Q =
[l],
l
m
l=1
.
Method:
Initialize
[l],
l
m
l=1
with
l
0 for all l and
m
l=1
l
= 1;
p
x[k]
[l]
x[k]
[l]
l
/
m
r=1
p
x[k]
[r]
r
end for
end for
{The Maximization Step}
for l = 1 to m do
l
(1/n)
n
k=1
z
kl
[l]:
G
[l]) =
n
k=1
z
kl
x[k]/
n
k=1
z
kl
end for
until convergence;
return
Q =
[l],
l
m
l=1
.
TABLE 3
Bregman soft clustering algorithm.
of the data as a mixture of m densities of the same
exponential family. The clusters correspond to the com
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 15
ponents of the mixture model and the soft membership
of a data point in each cluster is proportional to the
probability of the data point being generated by the
corresponding density function. The Bregman soft clus
tering problem is based on a maximum likelihood esti
mation of the cluster parameters
_
[l],
l
_
m
l=1
satisfying
the following mixture structure:
p(x) =
m
l=1
p
_
x[
[l]
_
l
,
where p(x[) is an exponential family distribution. The
data likelihood function takes the following form:
p(X) =
n
k=1
m
l=1
p
_
x[k][
[l]
_
l
. (46)
The data likelihood function in (46) is similar to the data
likelihood function in (8) without the linear constraint
[l] =
[l], l = 1, . . . , m, are
estimated in the following way:
[l]
(new)
= arg max
[l]
n
k=1
m
r=1
z
kr
log p
_
x[k][
[r]
_
= arg max
[l]
_
n
k=1
m
r=1
z
kr
log p
(1)
_
x
(1)
[k][
(1)
[r]
_
+
n
k=1
m
r=1
z
kr
log p
(2)
_
x
(2)
[k][
(2)
[r]
_
_
,
with log p
_
x[k][
[r]
_
=
[r]x[k]
T
G
_
[r]
_
. Using the
convexity properties of G(), it is easily shown that:
G
[l]
(new)
) =
_
n
k=1
z
kl
x[k]
_
_
_
n
k=1
z
kl
_
can be solved for
[l]
(new),(1)
and
[l]
(new),(2)
by chang
ing x to x
(1)
, respectively to x
(2)
, G
() to G
(1)
(),
respectively to G
(2)
().
Table 3 summarizes the Bregman soft clustering algo
rithm.
5 CONCLUSION
This paper considers the problem of learning the un
derlying statistical structure of data of mixed types for
tting generative models. A unied generative model,
the Generalized Linear Statistics (GLS) model, was estab
lished using exponential family properties. Specically,
this work considered mixed datatype records which
have both continuous (e.g., Exponential and Gaussian)
and discrete (e.g., count and binary) components. The
GLS approach allows for the data components to have
different parametric forms by using the large range
of exponential family distributions. The specic GLS
framework developed here is equivalent to a computa
tionally tractable exponential families mixed datatype
hierarchical Bayes graphical model with latent variables
constrained to a lowdimensional parameter subspace.
The exponential family Principal Component Analysis
(exponential PCA) technique of [10], the SemiParametric
exponential family Principal Component Analysis (SP
PCA) technique of [15] and the Bregman soft clustering
method presented in [14] were demonstrated not to be
separate unrelated algorithms, but rather different man
ifestations of model assumptions and parameter choices
within the GLS framework. Because of this insight, the
three algorithms can be extended to readily derive novel
extensions that dealt with the important mixed datatype
case. As an example, the convex optimization problem
related to tting to a set of data the extreme GLS case
corresponding to exponential family Principal Compo
nent Analysis is described in detail.
Learning the GLS model provides a generative model
of the data, making it possible to both generate synthetic
data and perform effective detection or prediction on
data of mixed types in parameter space. This data
driven decision making aspect of GLS is presented in
a forthcoming Part II paper [50].
REFERENCES
[1] D. J. Bartholomew and M. Knott, Latent Variable Models and Factor
Analysis. Kendalls Library of Statistics, Oxford University Press,
New York, 2nd edition, 1999, vol. 7.
[2] A. Skrondal and S. RabeHesketh, Generalized Latent Variable
Modeling: Multilevel, Longitudinal, and Structural Equation Models.
Interdisciplinary Statistics, Chapman and Hall/CRC, 2004.
[3] P. McCullagh and J. A. Nelder, Generalized Linear Models. Mono
graphs on Statistics and Applied Probability 37, Chapman and
Hall/CRC, London, 2nd edition, 1989.
[4] C. E. McCulloch and S. R. Searle, Generalized, Linear and Mixed
Models. Wiley Series in Probability and Statistics, Wiley
Interscience, New York, 2001.
[5] L. Fahrmeir and G. Tutz, Multivariate Statistical Modelling Based on
Generalized Linear Models. Springer Series in Statistics, Springer
Verlag, New York, 2nd edition, 2001.
[6] J. Gill, Generalized Linear Models: A Unied Approach. Quantitative
Applications in the Social Sciences, Sage Publications, Thousand
Oaks, California, 2001.
[7] C. R. Rao and H. Toutenburg, Linear Models: Least Squares and
Alternatives. Springer Series in Statistics, SpringerVerlag, New
York, 2nd edition, 1999.
[8] E. W. Frees, Longitudinal and Panel Data: Analysis and Applications
in the Social Sciences. Cambridge University Press, New York,
2nd edition, 2004.
[9] H. S. Lynn and C. E. McCulloch, Using principal component
analysis and correspondence analysis for estimation in latent
variable models, Journal of the American Statistical Society, vol. 95,
no. 450, pp. 561572, 2000.
[10] M. Collins, S. Dasgupta, and R. E. Schapire, A generalization
of principal components analysis to the exponential family, in
Advances in Neural Information Processing Systems, 2001, vol. 14.
[11] D. B. Dunson and S. D. Perreault, Factor analytic models of clus
tered multivariate data with informative censoring, Biometrics,
vol. 57, pp. 302308, 2001.
[12] D. B. Dunson, Z. Chen, and J. Harry, A Bayesian approach for
joint modeling of cluster size and subunitspecic outcomes,
Biometrics, vol. 59, pp. 521530, 2003.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 16
[13] A. Skrondal and S. RabeHesketh, Some applications of gener
alized linear latent and mixed models in epidemiology, Norsk
Epidenmiologi, vol. 13, no. 2, pp. 265278, 2003.
[14] A. Banerjee, S. Merugu, I. Dhillon, and J. Ghosh, Clustering
with Bregman divergences, SIAM International Conference on Data
Mining (SDM), 2004.
[15] Sajama and A. Orlitsky, Semiparametric exponential family
PCA, Advances in Neural Information Processing Systems, vol. 17,
2004.
[16] A. Banerjee, S. Merugu, I. Dhillon, and J. Ghosh, Clustering with
Bregman divergences, Journal of Machine Learning Research, vol. 6,
pp. 17051749, 2005.
[17] C. Levasseur, K. KreutzDelgado, U. Mayer, and G. Gancarz,
Datapattern discovery methods for detection in nongaussian
highdimensional data sets, Conference Record of the 39th Asilomar
Conference on Signals, Systems and Computers, pp. 545549, 2005.
[18] C. Levasseur, U. F. Mayer, B. Burdge, and K. KreutzDelgado,
Generalized statistical methods for unsupervised minority class
detection in mixed data sets, Proceedings of the First IAPR Work
shop on Cognitive Information Processing, pp. 126131, 2008.
[19] C. Levasseur, B. Burdge, K. KreutzDelgado, and U. F. Mayer, A
unifying viewpoint of some clustering techniques using Bregman
divergences and extensions to mixed data sets, Proceedings of the
First IEEE Intl Workshop on Data Mining and Articial Intelligence
(DMAI), pp. 5663, 2008.
[20] C. E. McCulloch, Generalized Linear Mixed Models. Institute of
Mathematical Statistics, Hayward, California, 2003.
[21] N. Roy, G. Gordon, and S. Thrun, Finding approximate POMDP
solutions through belief compression, Journal of Articial Intelli
gence Research, vol. 23, pp. 140, 2005.
[22] D. N. Lawley and A. E. Maxwell, Factor Analysis as a Statistical
Method. New York, American Elsevier Pub. Co., 1971.
[23] M. Aitkin, A general maximum likelihood analysis of variance
components in generalized linear models, Biometrics, vol. 55, pp.
117128, 1999.
[24] G. McLachlan and D. Peel, Finite Mixture Models. Wiley Series
in Probability and Statistics, WileyInterscience, New York, 2000.
[25] I. T. Jolliffe, Principal Component Analysis. Springer Series in
Statistics, SpringerVerlag, New York, 2nd edition, 2002.
[26] O. E. BarndorffNielsen, Information and Exponential Families in
Statistical Theory. Wiley Series in Probability and Mathematical
Statistics, John Wiley and Sons, 1978.
[27] L. D. Brown, Fundamentals of Statistical Exponential Families: with
Applications in Statistical Decision Theory. Institute of Mathemat
ical Statistics, 1986.
[28] K. S. Azoury and M. K. Warmuth, Relative loss bounds for on
line density estimation with the exponential family of distribu
tions, Machine Learning, vol. 43, pp. 211246, 2001.
[29] Y. Lee and J. A. Nelder, Hierarchical generalized linear models,
Journal of the Royal Statistical Society, vol. 58, no. 4, pp. 619678,
1996.
[30] P. J. Green, Hierarchical generalized linear models, Journal of the
Royal Statistical Society, B, vol. 58, no. 4, pp. 619678, 1996.
[31] M. Aitkin and R. Rocci, A general maximum likelihood analysis
of measurement error in generalized linear models, Statistics and
Computing, vol. 12, pp. 163174, 2002.
[32] M. I. Jordan and T. J. Sejnowski, Graphical Models: Foundations
of Neural Computation. Computational Neuroscience, The MIT
Press, 1st edition, 2001.
[33] N. Laird, Nonparametric maximum likelihood estimation of a
mixing distribution, Journal of the American Statistical Society,
vol. 73, 1978.
[34] B. G. Lindsay, Mixture Models: Theory, Geometry, and Applications.
Institute of Mathematical Statistics, New York, 1995.
[35] J. Kiefer and J. Wolfowitz, Consistency of the maximum like
lihood estimator in the presence of innitely many incidental
parameters, The Annals of Mathematical Statistics, vol. 27, pp. 887
906, 1956.
[36] B. G. Lindsay, The geometry of mixture likelihoods: a general
theory, Annals of Statistics, vol. 11, no. 1, pp. 8694, 1983.
[37] , The geometry of mixture likelihoods, part ii: the exponen
tial family, Annals of Statistics, vol. 11, no. 3, pp. 783792, 1983.
[38] A. Mallet, A maximum likelihood estimation method for random
coefcient regression models, Biometrika, vol. 73, no. 3, pp. 645
656, 1986.
[39] A. Schumitzky, Nonparametric EM algorithms for estimating
prior distributions, Applied Mathematics and Computation, vol. 45,
no. 2, pp. 143157, 1991.
[40] B. G. Lindsay and M. L. Lesperance, A review of semiparamet
ric mixture models, Journal of Statistical Planning and Inference,
vol. 47, pp. 2999, 1995.
[41] A. P. Dempster, N. M. Laird, and D. B. Rubin, Maximum like
lihood estimation from incomplete data via the EM algorithm,
Journal of Royal Statistical Society, B, vol. 39, pp. 138, 1977.
[42] R. A. Redner and H. F. Walker, Mixture densities, maximum
likelihood and the EM algorithm, SIAM Review, vol. 26, no. 2,
pp. 195239, 1984.
[43] D. Boehning, ComputerAssited Analysis of Mixtures and Applica
tions: MetaAnalysis, Disease Mapping, and Others. Monographs on
Statistics and Applied Probability 81, Chapman and Hall/CRC,
New York, 2000.
[44] R. S. Pilla and B. Lindsay, Alternative EM methods for nonpara
metric nite mixture models, Biometrika, vol. 88, no. 2, pp. 535
550, 2001.
[45] S. RabeHesketh, A. Pickles, and C. Taylor, GLLAMM: A gen
eral class of multilevel models and a Stata program, Multilevel
Modelling Newsletter, vol. 13, pp. 1723, 2001.
[46] E. L. Lehmann and G. Castella, Theory of Point Estimation.
Springer Texts in Statistics, Springer, 2nd edition, 1998.
[47] D. P. Bertsekas, A. Nedi c, and A. E. Ozdaglar, Convex Analysis and
Optimization. Athena Scientic, Belmont, Massachusetts, 2003.
[48] S. Boyd and L. Vandenberghe, Convex Optimization. Cambrigde
University Press, 2004.
[49] H. Hindi, A tutorial on convex optimization II: duality and
interior point methods, Proceedings of the 2006 American Control
Conference, pp. 686696, 2006.
[50] C. Levasseur, U. F. Mayer, and K. KreutzDelgado, Generalized
statistical methods for mixed exponential families, part II: ap
plications, submitted to IEEE Transactions of Pattern Analysis and
Machine Intelligence (PAMI), available at http://dsp.ucsd.edu/cecile/,
2009.
Viel mehr als nur Dokumente.
Entdecken, was Scribd alles zu bieten hat, inklusive Bücher und Hörbücher von großen Verlagen.
Jederzeit kündbar.