Sie sind auf Seite 1von 22

Dimensionality Reduction: A Comparative Review

L.J.P. van der Maaten , E.O. Postma, H.J. van den Herik
MICC, Maastricht University, P.O. Box 616, 6200 MD Maastricht, The Netherlands.

Abstract

In recent years, a variety of nonlinear dimensionality reduction techniques have been proposed that aim to address the limitations
of traditional techniques such as PCA. The paper presents a review and systematic comparison of these techniques. The
performances of the nonlinear techniques are investigated on artificial and natural tasks. The results of the experiments reveal
that nonlinear techniques perform well on selected artificial tasks, but do not outperform the traditional PCA on real-world tasks.
The paper explains these results by identifying weaknesses of current nonlinear techniques, and suggests how the performance
of nonlinear dimensionality reduction techniques may be improved.

Key words: Dimensionality reduction, manifold learning, feature extraction.

1. Introduction ysis (see, e.g., [18]). However, these linear techniques


cannot adequately handle complex nonlinear data.
Real-world data, such as speech signals, digital pho- Therefore, in the last decade, a large number of nonlin-
tographs, or fMRI scans, usually has a high dimension- ear techniques for dimensionality reduction have been
ality. In order to handle real-world data adequately, its proposed (see for an overview, e.g., [23,95,113]). In
dimensionality needs to be reduced. Dimensionality re- contrast to the traditional linear techniques, the nonlin-
duction is the transformation of high-dimensional data ear techniques have the ability to deal with complex
into a meaningful representation of reduced dimension- nonlinear data. In particular for real-world data, the
ality. Ideally, the reduced representation should have a nonlinear dimensionality reduction techniques may of-
dimensionality that corresponds to the intrinsic dimen- fer an advantage, because real-world data is likely to be
sionality of the data. The intrinsic dimensionality of highly nonlinear. Previous studies have shown that non-
data is the minimum number of parameters needed to linear techniques outperform their linear counterparts
account for the observed properties of the data [43]. Di- on complex artificial tasks. For instance, the Swiss roll
mensionality reduction is important in many domains, dataset comprises a set of points that lie on a spiral-like
since it mitigates the curse of dimensionality and other two-dimensional manifold within a three-dimensional
undesired properties of high-dimensional spaces [60]. space. A vast number of nonlinear techniques are per-
As a result, dimensionality reduction facilitates, among fectly able to find this embedding, whereas linear tech-
others, classification, visualization, and compression of niques fail to do so. In contrast to these successes on
high-dimensional data. Traditionally, dimensionality re- artificial datasets, successful applications of nonlinear
duction was performed using linear techniques such as dimensionality reduction techniques on natural datasets
Principal Components Analysis (PCA) and factor anal- are scarce. Beyond this observation, it is not clear to
what extent the performances of the various dimen-
Corresponding author. sionality reduction techniques differ on artificial and
Email address: l.vandermaaten@micc.unimaas.nl natural tasks (a comparison is performed in [81], but
(L.J.P. van der Maaten). this comparison is very limited in scope with respect to

Preprint submitted to Elsevier 22 February 2008


the number of techniques and tasks that are addressed). 2. Dimensionality reduction
Motivated by the lack of a systematic comparison
of dimensionality reduction techniques, this paper The problem of (nonlinear) dimensionality reduction
presents a comparative study of the most important can be defined as follows. Assume we have a dataset
linear dimensionality reduction technique (PCA), and represented in a nD matrix X consisting of n datavec-
twelve frontranked nonlinear dimensionality reduction tors xi (i {1, 2, . . . , n}) with dimensionality D. As-
techniques. The aims of the paper are (1) to inves- sume further that this dataset has intrinsic dimensional-
tigate to what extent novel nonlinear dimensionality ity d (where d < D, and often d  D). Here, in math-
reduction techniques outperform the traditional PCA ematical terms, intrinsic dimensionality means that the
on real-world datasets and (2) to identify the inherent points in dataset X are lying on or near a manifold with
weaknesses of the twelve nonlinear dimenisonality re- dimensionality d that is embedded in the D-dimensional
duction techniques. The investigation is performed by space. Dimensionality reduction techniques transform
both a theoretical and an empirical evaluation of the dataset X with dimensionality D into a new dataset Y
dimensionality reduction techniques. The identification with dimensionality d, while retaining the geometry of
is performed by a careful analysis of the empirical re- the data as much as possible. In general, neither the
sults on specifically designed artificial datasets and on geometry of the data manifold, nor the intrinsic dimen-
the real-world datasets. sionality d of the dataset X are known. Therefore, di-
Next to PCA, the paper investigates the following mensionality reduction is an ill-posed problem that can
twelve nonlinear techniques: (1) multidimensional scal- only be solved by assuming certain properties of the
ing, (2) Isomap, (3) Maximum Variance Unfolding, data (such as its intrinsic dimensionality). Throughout
(4) Kernel PCA, (5) diffusion maps, (6) multilayer au- the paper, we denote a high-dimensional datapoint by
toencoders, (7) Locally Linear Embedding, (8) Lapla- xi , where xi is the ith row of the D-dimensional data
cian Eigenmaps, (9) Hessian LLE, (10) Local Tangent matrix X. The low-dimensional counterpart of xi is de-
Space Analysis, (11) Locally Linear Coordination, and noted by yi , where yi is the ith row of the d-dimensional
(12) manifold charting. Although our comparative re- data matrix Y . In the remainder of the paper, we adopt
view includes the most important nonlinear techniques the notation presented above.
for dimensionality reduction, it is not exhaustive. In the Figure 1 shows a taxonomy of techniques for dimen-
appendix, we list other important (nonlinear) dimen- sionality reduction. The main distinction between tech-
sionality reduction techniques that are not included in niques for dimensionality reduction is the distinction
our comparative review. There, we briefly explain why between linear and nonlinear techniques. Linear tech-
these techniques are not included. niques assume that the data lie on or near a linear sub-
The outline of the remainder of this paper is as follows. space of the high-dimensional space. Nonlinear tech-
In Section 2, we give a formal definition of dimension- niques for dimensionality reduction do not rely on the
ality reduction. Section 3 briefly discusses the most linearity assumption as a result of which more complex
important linear technique for dimensionality reduc- embeddings of the data in the high-dimensional space
tion (PCA). Subsequently, Section 4 describes and can be identified. The further subdivisions in the taxo-
discusses the selected twelve nonlinear techniques for nomy are discussed in the review in the following two
dimensionality reduction. Section 5 lists all techniques sections.
by theoretical characteristics. Then, in Section 6, we
present an empirical comparison of twelve techniques
for dimensionality reduction on five artificial datasets 3. Linear techniques for dimensionality reduction
and five natural datasets. Section 7 discusses the results
of the experiments; moreover, it identifies weaknesses Linear techniques perform dimensionality reduction
and points of improvement of the selected nonlinear by embedding the data into a linear subspace of lower
techniques. Section 8 provides our conclusions. Our dimensionality. Although there exist various techniques
main conclusion is that the focus of the research com- to do so, PCA is by far the most popular (unsupervised)
munity should shift towards nonlocal techniques for linear technique. Therefore, in our comparison, we only
dimensionality reduction with objective functions that include PCA as a benchmark. We briefly discuss PCA
can be optimized well in practice (such as PCA, Kernel below.
PCA, and autoencoders). Principal Components Analysis (PCA) [56] constructs
a low-dimensional representation of the data that de-
scribes as much of the variance in the data as possible.

2
Euclidean distance matrix (X X)(X X)T instead
of the eigenvectors of the covariance matrix 2 . Alterna-
tively, iterative techniques such as Simple PCA [83] or
probabilistic PCA [90] may be employed.

4. Nonlinear techniques for dimensionality


reduction

In Section 3, we discussed the main linear technique


for dimensionality reduction, which is established and
well studied. In contrast, most nonlinear techniques for
dimensionality reduction have been proposed more re-
cently and are therefore less well studied. In this sec-
tion, we discuss twelve nonlinear techniques for dimen-
sionality reduction, as well as their weaknesses and ap-
plications as reported in the literature. Nonlinear tech-
niques for dimensionality reduction can be subdivided
into three main types 3 : (1) techniques that attempt to
preserve global properties of the original data in the low-
dimensional representation, (2) techniques that attempt
Fig. 1. Taxonomy of dimensionality reduction techniques.
to preserve local properties of the original data in the
This is done by finding a linear basis of reduced dimen- low-dimensional representation, and (3) techniques that
sionality for the data, in which the amount of variance perform global alignment of a mixture of linear mod-
in the data is maximal. els. In subsection 4.1, we discuss six global nonlinear
In mathematical terms, PCA attempts to find a lin- techniques for dimensionality reduction. Subsection 4.2
ear mapping M that maximizes M T cov(X)M , where presents four local nonlinear techniques for dimension-
cov(X) is the covariance matrix of the data X. It can ality reduction. Subsection 4.3 presents two techniques
be shown that this linear mapping is formed by the d that perform a global alignment of linear models.
principal eigenvectors (i.e., principal components) of
the covariance matrix of the zero-mean data 1 . Hence, 4.1. Global techniques
PCA solves the eigenproblem
cov(X)M = M. (1) Global nonlinear techniques for dimensionality re-
duction are techniques that attempt to preserve global
The eigenproblem is solved for the d principal eigenval- properties of the data. The subsection presents six global
ues . The low-dimensional data representations yi of nonlinear techniques for dimensionality reduction: (1)
the datapoints xi are computed by mapping them onto MDS, (2) Isomap, (3) MVU, (4) diffusion maps, (5)
the linear basis M , i.e., Y = (X X)M . Kernel PCA, and (6) multilayer autoencoders. The tech-
PCA has been successfully applied in a large number niques are discussed in 4.1.1 to 4.1.6.
of domains such as face recognition [110], coin clas-
sification [57], and seismic series analysis [87]. The
main drawback of PCA is that the size of the covari- 4.1.1. MDS
ance matrix is proportional to the dimensionality of the Multidimensional scaling (MDS) [29,66] represents
datapoints. As a result, the computation of the eigen- a collection of nonlinear techniques that maps the high-
vectors might be infeasible for very high-dimensional dimensional data representation to a low-dimensional
data. In datasets in which n < D, this drawback may be representation while retaining the pairwise distances be-
overcome by computing the eigenvectors of the squared tween the datapoints as much as possible. The quality
of the mapping is expressed in the stress function, a
1 PCA maximizes M T cov(X)M with respect to M , under the
2 It can be shown that the eigenvectors ui and vi of the matrices
constraint that |M | = 1. This constraint can be enforced by intro-
ducing a Lagrange multiplier . Hence, an unconstrained maximiza- X T X and XX T are related through vi = 1 Xui , see, e.g., [26].
i
tion of M T cov(X)M + (1 M T M ) is performed. A stationary 3 Although diffusion maps and Kernel PCA are global methods,

point of this quantity is to be found when cov(X)M = M . they may behave as local methods depending on the kernel selection.

3
measure of the error between the pairwise distances in path between two points in the graph forms a good
the low-dimensional and high-dimensional representa- (over)estimate of the geodesic distance between these
tion of the data. Two important examples of stress func- two points, and can easily be computed using Dijk-
tions (for metric MDS) are the raw stress function and stras or Floyds shortest-path algorithm [35,41]. The
the Sammon cost function. The raw stress function is geodesic distances between all datapoints in X are
defined by computed, thereby forming a pairwise geodesic dis-
X tance matrix. The low-dimensional representations yi
(Y ) = (kxi xj k kyi yj k)2 , (2) of the datapoints xi in the low-dimensional space Y
ij
are computed by applying multidimensional scaling
in which kxi xj k is the Euclidean distance between the (see 4.1.1) on the resulting distance matrix.
high-dimensional datapoints xi and xj , and kyi yj k An important weakness of the Isomap algorithm is
is the Euclidean distance between the low-dimensional its topological instability [7]. Isomap may construct
datapoints yi and yj . The Sammon cost function is given erroneous connections in the neighborhood graph G.
by Such short-circuiting [71] can severely impair the per-
1 X (kxi xj k kyi yj k)2 formance of Isomap. Several approaches have been
(Y ) = P . proposed to overcome the problem of short-circuiting,
ij kxi xj ki6=j
kxi xj k
e.g., by removing datapoints with large total flows in
(3) the shortest-path algorithm [27] or by removing nearest
The Sammon cost function differs from the raw stress neighbors that violate local linearity of the neighbor-
function in that it puts more emphasis on retaining dis- hood graph [96]. A second weakness is that Isomap
tances that were originally small. The minimization of may suffer from holes in the manifold. This problem
the stress function can be performed using various meth- can be dealt with by tearing manifolds with holes [71].
ods, such as the eigendecomposition of a pairwise dis- A third weakness of Isomap is that it can fail if the
similarity matrix, the conjugate gradient method, or a manifold is nonconvex [106]. Despite these three weak-
pseudo-Newton method [29]. nesses, Isomap was successfully applied on tasks such
MDS is widely used for the visualization of data, e.g., in as wood inspection [81], visualization of biomedical
fMRI analysis [103] and in molecular modeling [112]. data [74], and head pose estimation [89].
The popularity of MDS has led to the proposal of vari-
ants such as SPE [3], CCA [32], SNE [53,79], and
FastMap [39]. In addition, there exist nonmetric vari- 4.1.3. MVU
ants of MDS, that aim to preserve ordinal relations in Because of its resemblance to Isomap, we discuss
data, instead of pairwise distances [29]. MVU before diffusion maps (see 4.1.4). Maximum
Variance Unfolding (MVU) is very similar to Isomap
4.1.2. Isomap in that it defines a neighborhood graph on the data and
Multidimensional scaling has proven to be successful retains pairwise distances in the resulting graph [117].
in many applications, but it suffers from the fact that it MVU is different from Isomap in that explicitly at-
is based on Euclidean distances, and does not take into tempts to unfold the data manifold. It does so by
account the distribution of the neighboring datapoints. maximizing the Euclidean distances between the da-
If the high-dimensional data lies on or near a curved tapoints, under the constraint that the distances in the
manifold, such as in the Swiss roll dataset [105], MDS neighborhood graph are left unchanged (i.e., under the
might consider two datapoints as near points, whereas constraint that the local geometry of the data manifold
their distance over the manifold is much larger than the is not distorted). The resulting optimization problem
typical interpoint distance. Isomap [105] is a technique can be solved efficiently using semidefinite program-
that resolves this problem by attempting to preserve ming.
pairwise geodesic (or curvilinear) distances between MVU starts with the construction of a neighborhood
datapoints. Geodesic distance is the distance between graph G, in which each datapoint xi is connected to
two points measured over the manifold. its k nearest neighbors xij (j = 1, 2, . . . , k). Subse-
In Isomap [105], the geodesic distances between the quently, MVU attempts to maximize the sum of the
datapoints xi (i = 1, 2, . . . , n) are computed by con- squared Euclidean distances between all datapoints,
structing a neighborhood graph G, in which every under the constraint that the distances inside the neigh-
datapoint xi is connected with its k nearest neighbors borhood graph G are preserved. In other words, MVU
xij (j = 1, 2, . . . , k) in the dataset X. The shortest performs the following optimization problem.

4
X
Maximize k yi yj k2 subject to: matrix of a dynamical process. Hence, the matrix P (1)
ij represents the probability of a transition from one data-
(1) k yi yj k2 =k xi xj k2 for (i, j) G point to another datapoint in a single timestep. The for-
ward probability matrix for t timesteps P (t) is given by
MVU reformulates the optimization problem as a (P (1) )t . Using the random walk forward probabilities
semidefinite programming problem (SDP) [111] by (t)
pij , the diffusion distance is defined by
defining a matrix K that is the inner product of the low- v
dimensional data representation Y . The optimization uX (p(t)
u (t) 2
problem is identical to the following SDP. (t) ik pjk )
D (xi , xj ) = t . (6)
k
(xk )(0)
Maximize trace(K) subject to (1), (2), and (3), with:
(1) kii 2kij + kjj =k xi xj k2 for (i, j) G In the equation, (xi )(0) is a term that attributes more
X weight to parts of the graph with high density. It is de-
(2) kij = 0
fined by (xi )(0) = Pmim , where mi is the degree of
ij j
j
P
(3) K 0 node xi defined by mi = j pij . From Equation 6, it
can be observed that pairs of datapoints with a high for-
From the solution K of the SDP, the low-dimensional ward transition probability have a small diffusion dis-
data representation Y can be obtained by performing a tance. The key idea behind the diffusion distance is that
singular value decomposition. it is based on many paths through the graph. This makes
MVU has a weakness similar to Isomap: short-circuiting the diffusion distance more robust to noise than, e.g.,
may impair the performance of MVU, because it adds the geodesic distance that is employed in Isomap. In the
constraints to the optimization problem that prevent suc- low-dimensional representation of the data Y , diffusion
cessful unfolding of the manifold. Despite this weak- maps attempt to retain the diffusion distances. Using
ness, MVU was successfully applied for, e.g., sensor lo- spectral theory on the random walk, it has been shown
calization [118] and DNA microarray data analysis [62]. (see, e.g., [68]) that the low-dimensional representation
Y that retains the diffusion distances is formed by the
4.1.4. Diffusion maps d nontrivial principal eigenvectors of the eigenproblem
The diffusion maps (DM) framework [68,78] orig-
P (t) v = v. (7)
inates from the field of dynamical systems. Diffusion
maps are based on defining a Markov random walk on Because the graph is fully connected, the largest eigen-
the graph of the data. By performing the random walk value is trivial (viz. 1 = 1), and its eigenvector v1 is
for a number of timesteps, a measure for the proxim- thus discarded. The low-dimensional representation Y
ity of the datapoints is obtained. Using this measure, is given by the next d principal eigenvectors. In the low-
the so-called diffusion distance is defined. In the low- dimensional representation, the eigenvectors are nor-
dimensional representation of the data, the pairwise dif- malized by their corresponding eigenvalues. Hence, the
fusion distances are retained as good as possible. low-dimensional data representation is given by
In the diffusion maps framework, a graph of the data is
Y = {2 v2 , 3 v3 , . . . , d+1 vd+1 }. (8)
constructed first. The weights of the edges in the graph
are computed using the Gaussian kernel function, lead- Diffusion maps have been successfully applied to, e.g.,
ing to a matrix W with entries shape matching [88] and gene expression analysis [121].
kxi xj k2
wij = e 2 2 , (4) 4.1.5. Kernel PCA
2 Kernel PCA (KPCA) is the reformulation of tradi-
where indicates the variance of the Gaussian. Sub-
tional linear PCA in a high-dimensional space that is
sequently, normalization of the matrix W is performed
constructed using a kernel function [97]. In recent years,
in such a way that its rows add up to 1. In this way, a
the reformulation of linear techniques using the kernel
matrix P (1) is formed with entries
trick has led to the proposal of successful techniques
(1) wij
pij = P . (5) such as kernel ridge regression and Support Vector Ma-
k wik chines [99]. Kernel PCA computes the principal eigen-
Since diffusion maps originate from dynamical systems vectors of the kernel matrix, rather than those of the
theory, the resulting matrix P (1) is considered a Markov covariance matrix. The reformulation of PCA in ker-
matrix that defines the forward transition probability nel space is straightforward, since a kernel matrix is

5
similar to the inproduct of the datapoints in the high- 4.1.6. Multilayer autoencoders
dimensional space that is constructed using the kernel Multilayer encoders are feed-forward neural net-
function. The application of PCA in the kernel space works with an odd number of hidden layers [33,54].
provides Kernel PCA the property of constructing non- The middle hidden layer has d nodes, and the input
linear mappings. and the output layer have D nodes. An example of an
Kernel PCA computes the kernel matrix K of the data- autoencoder is shown schematically in Figure 2. The
points xi . The entries in the kernel matrix are defined network is trained to minimize the mean squared error
by between the input and the output of the network (ide-
kij = (xi , xj ), (9) ally, the input and the output are equal). Training the
neural network on the datapoints xi leads to a network
where is a kernel function [99]. Subsequently, the ker-
in which the middle hidden layer gives a d-dimensional
nel matrix K is centered using the following modifica-
representation of the datapoints that preserves as much
tion of the entries
information in X as possible. The low-dimensional re-
1X 1X 1 X presentations yi can be obtained by extracting the node
kij = kij kil kjl + 2 klm . (10)
n n n values in the middle hidden layer, when datapoint xi is
l l lm
used as input. If linear activation functions are used in
The centering operation corresponds to subtracting the
the neural network, an autoencoder is very similar to
mean of the features in traditional PCA. It makes sure
PCA [67]. In order to allow the autoencoder to learn a
that the features in the high-dimensional space defined
nonlinear mapping between the high-dimensional and
by the kernel function are zero-mean. Subsequently, the
low-dimensional data representation, sigmoid activa-
principal d eigenvectors vi of the centered kernel ma-
tion functions are generally used.
trix are computed. The eigenvectors of the covariance
Multilayer autoencoders usually have a high number of
matrix i (in the high-dimensional space constructed
connections. Therefore, backpropagation approaches
by ) can now be computed, since they are related to
converge slowly and are likely to get stuck in local
the eigenvectors of the kernel matrix vi (see, e.g., [26])
minima. In [54], this drawback is overcome by a learn-
through
1 ing procedure that consists of three main stages.
i = Xvi . (11) First, the recognition layers of the network (i.e., the
i
layers from X to Y ) are trained one-by-one using Re-
In order to obtain the low-dimensional data representa- stricted Boltzmann Machines (RBMs). RBMs are two-
tion, the data is projected onto the eigenvectors of the layer neural networks with visual and hidden nodes that
covariance matrix i . The result of the projection (i.e., are binary and stochastic 4 . RBMs can be trained effi-
the low-dimensional data representation Y ) is given by ciently using an unsupervised learning procedure that
minimizes the so-called contrastive divergence [52].

X n n
Second, the reconstruction layers of the network (i.e.,
j
X j
yi = 1 (xj , xi ), . . . , d (xj , xi ) , (12)

j=1 j=1
the layers from Y to X 0 ) are formed by the inverse of
the trained recognition layers. In other words, the au-
where 1j indicates the jth value in the vector 1 and toencoder is unrolled. Third, the unrolled autoencoder
is the kernel function that was also used in the com- is finetuned in a supervised manner using backpropa-
putation of the kernel matrix. Since Kernel PCA is a gation.
kernel-based method, the mapping performed by Kernel Autoencoders have succesfully been applied to prob-
PCA relies on the choice of the kernel function . Pos- lems such as missing data imputation [1] and HIV
sible choices for the kernel function include the linear analysis [16].
kernel (making Kernel PCA equal to traditional PCA),
the polynomial kernel, and the Gaussian kernel that is
4.2. Local techniques
given in Equation 4 [99].
An important weakness of Kernel PCA is that the size
of the kernel matrix is proportional to the square of Subsection 4.1 presented six techniques for dimen-
the number of instances in the dataset. An approach sionality reduction that attempt to retain global proper-
to resolve this weakness is proposed in [108]. Kernel ties of the data. In contrast, local nonlinear techniques
PCA has been successfully applied to, e.g., face recog-
nition [63], speech recognition [75], and novelty detec- 4 For continuous data, the binary nodes may be replaced by mean-
tion [55]. field logistic or exponential family nodes.

6
xi are invariant to translation, rotation, and rescaling.
Because of the invariance to these transformations, any
linear mapping of the hyperplane to a space of lower
dimensionality preserves the reconstruction weights
in the space of lower dimensionality. In other words,
if the low-dimensional data representation preserves
the local geometry of the manifold, the reconstruc-
tion weights Wi that reconstruct datapoint xi from its
neighbors in the high-dimensional data representation
also reconstruct datapoint yi from its neighbors in
the low-dimensional data representation. As a conse-
quence, finding the d-dimensional data representation
Y amounts to minimizing the cost function
X k
X
(Y ) = (yi wij yij )2 . (13)
i j=1
Fig. 2. Schematic structure of an autoencoder.
Roweis and Saul [92] showed 5 that the coordinates of
the low-dimensional representations yi that minimize
for dimensionality reduction are based on solely pre-
this cost function are found by computing the eigenvec-
serving properties of small neighborhoods around the
tors corresponding to the smallest d nonzero eigenval-
datapoints. The central claim of these techniques is that
ues of the inproduct (I W )T (I W ). In this formula,
by preservation of local properties of the data, the global
I is the n n identity matrix.
layout of the data manifold is retained as well. This sub-
The popularity of LLE has led to the proposal of linear
section presents four local nonlinear techniques for di-
variants of the algorithm [49,65], and to successful ap-
mensionality reduction: (1) LLE, (2) Laplacian Eigen-
plications to, e.g., superresolution [24] and sound source
maps, (3) Hessian LLE, and (4) LTSA in 4.2.1 to 4.2.4.
localization [37]. However, there also exist experimen-
tal studies that report weak performance of LLE. In [74],
4.2.1. LLE LLE was reported to fail in the visualization of even sim-
Local Linear Embedding (LLE) [92] is a local tech- ple synthetic biomedical datasets. In [59], it is claimed
nique for dimensionality reduction that is similar to that LLE performs worse than Isomap in the derivation
Isomap in that it constructs a graph representation of of perceptual-motor actions. A possible explanation lies
the datapoints. In contrast to Isomap, it attempts to in the difficulties that LLE has when confronted with
preserve solely local properties of the data. As a result, manifolds that contains holes [92]. In addition, LLE
LLE is less sensitive to short-circuiting than Isomap, tends to collapse large portions of the data very close
because only a small number of properties are affected together in the low-dimensional space.
if short-circuiting occurs. Furthermore, the preservation
of local properties allows for successful embedding 4.2.2. Laplacian Eigenmaps
of nonconvex manifolds. In LLE, the local proper- Similar to LLE, Laplacian Eigenmaps find a low-
ties of the data manifold are constructed by writing dimensional data representation by preserving local
the datapoints as a linear combination of their nearest properties of the manifold [9]. In Laplacian Eigenmaps,
neighbors. In the low-dimensional representation of the the local properties are based on the pairwise distances
data, LLE attempts to retain the reconstruction weights between near neighbors. Laplacian Eigenmaps compute
in the linear combinations as good as possible. a low-dimensional representation of the data in which
LLE describes the local properties of the manifold the distances between a datapoint and its k nearest
around a datapoint xi by writing the datapoint as a neighbors are minimized. This is done in a weighted
linear combination Wi (the so-called reconstruction manner, i.e., the distance in the low-dimensional data
weights) of its k nearest neighbors xij . Hence, LLE
fits a hyperplane through the datapoint xi and its near- 5 (Y ) = (Y W Y )2 = Y T (I W )T (I W )Y is the function
est neighbors, thereby assuming that the manifold is that has to be minimized. Hence, the eigenvectors of (I W )T (I
locally linear. The local linearity assumption implies W ) corresponding to the smallest nonzero eigenvalues form the
that the reconstruction weights Wi of the datapoints solution that minimizes (Y ).

7
representation between a datapoint and its first nearest 4.2.3. Hessian LLE
neighbor contributes more to the cost function than the Hessian LLE (HLLE) [36] is a variant of LLE that
distance between the datapoint and its second nearest minimizes the curviness of the high-dimensional man-
neighbor. Using spectral graph theory, the minimiza- ifold when embedding it into a low-dimensional space,
tion of the cost function is defined as an eigenproblem. under the constraint that the low-dimensional data re-
The Laplacian Eigenmap algorithm first constructs a presentation is locally isometric. This is done by an
neighborhood graph G in which every datapoint xi is eigenanalysis of a matrix H that describes the curviness
connected to its k nearest neighbors. For all points xi of the manifold around the datapoints. The curviness of
and xj in graph G that are connected by an edge, the the manifold is measured by means of the local Hessian
weight of the edge is computed using the Gaussian at every datapoint. The local Hessian is represented in
kernel function (see Equation 4), leading to a sparse the local tangent space at the datapoint, in order to ob-
adjacency matrix W . In the computation of the low- tain a representation of the local Hessian that is invariant
dimensional representations yi , the cost function that to differences in the positions of the datapoints. It can
is minimized is given by be shown 7 that the coordinates of the low-dimensional
X representation can be found by performing an eigen-
(Y ) = (yi yj )2 wij . (14) analysis of H.
ij
Hessian LLE starts with identifying the k nearest neigh-
In the cost function, large weights wij correspond bors for each datapoint xi using Euclidean distance. In
to small distances between the datapoints xi and xj . the neighborhood, local linearity of the manifold is as-
Hence, the difference between their low-dimensional sumed. Hence, a basis for the local tangent space at
representations yi and yj highly contributes to the cost point xi can be found by applying PCA on its k nearest
function. As a consequence, nearby points in the high- neighbors xij . In other words, for every datapoint xi ,
dimensional space are brought closer together in the a basis for the local tangent space at point xi is deter-
low-dimensional representation. mined by computing the d principal eigenvectors M =
The computation of the degree matrix M and the graph {m1 , m2 , . . . , md } of the covariance matrix cov(xi ).
Laplacian L of the graph W allows for formulating the Note that the above requires that k d. Subsequently,
minimization problem as an eigenproblem [4]. The de- an estimator for the Hessian of the manifold at point xi
gree matrix M of W is a diagonal matrix, of which
P the in local tangent space coordinates is computed. In or-
entries are the row sums of W (i.e., mii = j wij ). der to do this, a matrix Zi is formed that contains (in
The graph Laplacian L is computed by L = M W . the columns) all cross products of M up to the dth or-
It can be shown that the following holds 6 der (including a column with ones). The matrix Zi is
X orthonormalized by applying Gram-Schmidt orthonor-
(Y ) = (yi yj )2 wij = 2Y T LY. (15)
malization [2]. The estimation of the tangent Hessian
ij
Hi is now given by the transpose of the last d(d+1) 2
Hence, minimizing (Y ) is proportional to minimizing columns of the matrix Zi . Using the Hessian estimators
Y T LY . The low-dimensional data representation Y can in local tangent coordinates, a matrix H is constructed
thus be found by solving the generalized eigenvalue with entries
problem XX
Lv = M v (16) Hlm = ((Hi )jl (Hi )jm ). (17)
i j
for the d smallest nonzero eigenvalues. The d eigenvec-
tors vi corresponding to the smallest nonzero eigenval- The matrix H represents information on the curviness of
ues form the low-dimensional data representation Y . the high-dimensional data manifold. An eigenanalysis
Laplacian Eigenmaps have been successfully applied of H is performed in order to find the low-dimensional
to, e.g., clustering [80,100,119], face recognition [51], data representation that minimizes the curviness of
and the analysis of fMRI data [22]. In addition, variants the manifold. The eigenvectors corresponding to the
of Laplacian Eigenmaps may be applied to supervised d smallest nonzero eigenvalues of H are selected and
or semi-supervised learning problems [28,10]. A linear form the matrix Y , which contains the low-dimensional
variant of Laplacian Eigenmaps is presented in [50]. representation of the data. A successful application of
6 Note that (Y ) =
P P 2
(y y )2 w = (y + yj2
P 2 ij iP j2 ij Pij i
2yi yj )wij = y mii +
i i j
yj mjj 2 ij
yi yj wij = 7 The derivation is too extensive for this paper; it can be found
2Y T MY 2Y T W Y = 2Y T LY in [36].

8
Hessian LLE to sensor localization has been presented 4.3. Global alignment of linear models
in [84].
In the previous subsections, we discussed techniques
4.2.4. LTSA that compute a low-dimensional data representation by
Similar to Hessian LLE, Local Tangent Space Anal- preserving global or local properties of the data. Tech-
ysis (LTSA) is a technique that describes local proper- niques that perform global alignment of linear mod-
ties of the high-dimensional data using the local tan- els combine these two types: they compute a number
gent space of each datapoint [124]. LTSA is based on of locally linear models and perform a global align-
the observation that, if local linearity of the manifold ment of these linear models. This subsection presents
is assumed, there exists a linear mapping from a high- two such techniques, viz., LLC and manifold chart-
dimensional datapoint to its local tangent space, and ing. The techniques are discussed separately in subsec-
that there exists a linear mapping from the correspond- tion 4.3.1 and 4.3.2.
ing low-dimensional datapoint to the same local tangent
space [124]. LTSA attempts to align these linear map- 4.3.1. LLC
pings in such a way, that they construct the local tan- Locally Linear Coordination (LLC) [104] computes
gent space of the manifold from the low-dimensional a number of locally linear models and subsequently
representation. In other words, LTSA simultaneously performs a global alignment of the linear models. This
searches for the coordinates of the low-dimensional data process consists of two steps: (1) computing a mix-
representations, and for the linear mappings of the low- ture of local linear models on the data by means of
dimensional datapoints to the local tangent space of the an Expectation Maximization (EM) algorithm and (2)
high-dimensional data. aligning the local linear models in order to obtain the
Similar to Hessian LLE, LTSA starts with computing low-dimensional data representation using a variant of
bases for the local tangent spaces at the datapoints xi . LLE.
This is done by applying PCA on the k datapoints xij LLC first constructs a mixture of m factor analyz-
that are neighbors of datapoint xi . This results in a map- ers (MoFA) using the EM algorithm [34,44,61]. Al-
ping Mi from the neighborhood of xi to the local tan- ternatively, a mixture of probabilistic PCA models
gent space i . A property of the local tangent space i (MoPPCA) model [109] could be employed. The lo-
is that there exists a linear mapping Li from the local cal linear models in the mixture are used to construct
tangent space coordinates ij to the low-dimensional m data representations zij and their corresponding
representations yij . Using this property of the local tan- responsibilities rij (where j {1, . . . , m}) for every
gent space, LTSA performs the following minimization datapoint xi . The responsibilities rij describe to what
X
min k Yi Jk Li i k2 , (18) extent datapoint
P xi corresponds to the model j; they
Yi ,Li
i satisfy j rij = 1. Using the local models and the
corresponding responsibilities, responsibility-weighted
where Jk is the centering matrix of size k [99]. Zhang
data representations uij = rij zij are computed. The
and Zha [124] have shown that the solution of the mini-
responsibility-weighted data representations uij are
mization is formed by the eigenvectors of an alignment
stored in a n mD block matrix U . The alignment
matrix B, that correspond to the d smallest nonzero
of the local models is performed based on U and on a
eigenvalues of B. The entries of the alignment matrix
matrix M that is given by M = (I W )T (I W ).
B are obtained by iterative summation (for all matrices
Herein, the matrix W contains the reconstruction
Vi and starting from bij = 0 for ij)
weights computed by LLE (see 4.2.1), and I denotes
BNi Ni = BNi Ni + Jk (I Vi ViT )Jk , (19) the n n identity matrix. LLC aligns the local models
where Ni is a selection matrix that contains the indices by solving the generalized eigenproblem
of the nearest neighbors of datapoint xi . Subsequently, Av = Bv, (20)
the low-dimensional representation Y is obtained by
computation of the eigenvectors corresponding to the d for the d smallest nonzero eigenvalues 8 . In the equa-
smallest nonzero eigenvectors of the symmetric matrix tion, A is the inproduct of M T U and B is the in-
1 T product of U . The d eigenvectors vi form a matrix L,
2 (B + B ).
In [107], a successful application of LTSA to microarray that can be shown to define a linear mapping from the
data is reported. A linear variant of LTSA is proposed
in [122]. 8 The derivation of this eigenproblem can be found in [104].

9
n
responsibility-weighted data representation U to the un- X
Dj = rij cov([Zj 1]). (24)
derlying low-dimensional data representation Y . The
i=1
low-dimensional data representation is thus obtained by
computing Y = U L. In Equation 24, the 1-column is added to the data re-
LLC has been successfully applied to face images of a presentation Zj in order to facilitate translations in the
single person with variable pose and expression, and to construction of yi from the data representations zij . Us-
handwritten digits [104]. ing the definition of the matrix D and the n mD
block-diagonal matrix U with entries uij = rij [zij 1],
the manifold charting cost function can be rewritten as
4.3.2. Manifold charting
Similar to LLC, manifold charting constructs a low- (Y ) = LT (D U T U )L, (25)
dimensional data representation by aligning a MoFA
where L represents the linear mapping on the matrix Z
or MoPPCA model [20]. In contrast to LLC, manifold
that can be used to compute the final low-dimensional
charting does not minimize a cost function that corre-
data representation Y . The linear mapping L can thus
sponds to another dimensionality reduction technique
be computed by solving the generalized eigenproblem
(such as the LLE cost function). Manifold charting min-
imizes a convex cost function that measures the amount (D U T U )v = U T U v, (26)
of disagreement between the linear models on the global
coordinates of the datapoints. The minimization of this for the d smallest nonzero eigenvalues. The d eigenvec-
cost function can be performed by solving a general- tors vi form the columns of the linear combination L
ized eigenproblem. from [U 1] to Y .
Manifold charting first performs the EM algorithm to
learn a mixture of factor analyzers, in order to obtain 5. Characterization of the techniques
m low-dimensional data representations zij and corre-
sponding responsibilities rij (where j {1, . . . , m}) In the Sections 3 and 4, we provided an overview
for all datapoints xi . Manifold charting finds a linear of techniques for dimensionality reduction. This sec-
mapping from the data representations zij to the global tion lists the techniques by three theoretical characteri-
coordinates yi that minimizes the cost function zations. First, relations between the dimensionality re-
n X
X m duction techniques are identified (subsection 5.1). Sec-
(Y ) = rij k yi yik k2 , (21) ond, we list and discuss a number of general proper-
i=1 j=1 ties of the techniques such as the nature of the objec-
Pm tive function that is optimized and the computational
where yi = k=1 rik yik . The intuition behind the cost
complexity of the technique (subsection 5.2). Third, the
function is that whenever there are two linear models
out-of-sample extension of the techniques is discussed
in which a datapoint has a high responsibility, these
in subsection 5.3.
linear models should agree on the final coordinate of
the datapoint. The cost function can be rewritten in the
form 5.1. Relations
Xn Xm X m
(Y ) = rij rik k yij yik k2 , (22) Many of the techniques discussed in Section 3 and 4
i=1 j=1 k=1 are highly interrelated, and in certain special cases even
which allows the cost function to be rewritten in the equivalent. Below, we discuss three types of relations
form of a Rayleigh quotient. The Rayleigh quotient can between the techniques.
be constructed by the definition of a block-diagonal ma- First, traditional PCA is identical to performing Ker-
trix D with m blocks by nel PCA with a linear kernel. Autoencoders in which
only linear activation functions are employed are very
D1 . . . 0 similar to PCA as well [67]. Performing (metric) multi-
.

.
dimensional scaling using the raw stress function with
D = .. . . . .. , (23) squared Euclidean distances is identical to performing

0 . . . Dm PCA, due to the relation between the eigenvectors of
the covariance matrix and the squared Euclidean dis-
where Dj is the sum of the weighted covariances of the tance matrix (see Section 3).
data representations zij . Hence, Dj is given by Second, performing MDS using geodesic distances is

10
identical to performing Isomap. Similarly, performing Technique Convex Parameters Computational Memory
Isomap with the number of nearest neighbors k set to PCA yes none O(D3 ) O(D2 )
n 1 is identical to performing traditional MDS (and
MDS yes none O(n3 ) O(n2 )
thus also to performing PCA). Diffusion maps are simi-
lar to MDS and Isomap, in that they attempt to preserve Isomap yes k O(n3 ) O(n2 )
pairwise distances (the so-called diffusion distances). MVU yes k O((nk)3 ) O((nk)3 )
The main discerning property of diffusion maps is that Diffusion maps yes , t O(n3 ) O(n2 )
it aims to retain a weighted sum of the distances of all
Kernel PCA yes (, ) O(n3 ) O(n2 )
paths through the graph defined on the data.
Third, the spectral techniques Kernel PCA, Isomap, Autoencoders no net size O(inw) O(w)
LLE, and Laplacian Eigenmaps can all be viewed upon LLE yes k O(pn2 ) O(pn2 )
as special cases of the more general problem of learning
Laplacian Eigenmaps yes k, O(pn2 ) O(pn2 )
eigenfunctions [12,48]. As a result, Isomap, LLE, and
Laplacian Eigenmaps can be considered as special cases Hessian LLE yes k O(pn2 ) O(pn2 )
of Kernel PCA (using a specific kernel function). For in- LTSA yes k O(pn2 ) O(pn2 )
stance, this relation is visible in the out-of-sample exten- LLC no m, k O(imd3 ) O(nmd)
sions of Isomap, LLE, and Laplacian Eigenmaps [15].
Manifold charting no m O(imd3 ) O(nmd)
The out-of-sample extension for these techniques is per- Table 1
formed by means of a so-called Nystrom approxima- Properties of techniques for dimensionality reduction.
tion [6,86], which is known to be equivalent to the Ker-
nel PCA projection [97]. Diffusion maps in which t = 1 tional complexity of the main computational part of the
are fairly similar to Kernel PCA with the Gaussian ker- technique, and (4) the memory complexity of the tech-
nel function. There are two main differences between nique. We discuss the four general properties below.
the two: (1) no centering of the Gram matrix is per- For property 1, Table 1 shows that most techniques for
formed in diffusion maps (although centering is gen- dimensionality reduction optimize a convex cost func-
erally not essential in Kernel PCA [99]) and (2) diffu- tion. This is advantageous, because it allows for find-
sion maps do not employ the principal eigenvector of ing the global optimum of the cost function. Because
the Gaussian kernel, whereas Kernel PCA does. MVU of their nonconvex cost functions, autoencoders, LLC,
can also be viewed upon as a special case of Kernel and manifold charting may suffer from getting stuck in
PCA, in which the SDP is the kernel function. In turn, local optima.
Isomap can be viewed upon as a technique that finds an For property 2, Table 1 shows that most nonlinear tech-
approximate solution to the MVU problem [120]. Eval- niques for dimensionality reduction all have free param-
uation of the dual MVU problem has also shown that eters that need to be optimized. By free parameters, we
LLE and Laplacian Eigenmaps show great resemblance mean parameters that directly influence the cost func-
to MVU [120]. tion that is optimized. The reader should note that iter-
As a consequence of these relations between the tech- ative techniques for dimensionality reduction have ad-
niques, our empirical comparative evaluation in Sec- ditional free parameters, such as the learning rate and
tion 6 does not include (1) Kernel PCA using a lin- the permitted maximum number of iterations. The pres-
ear kernel, (2) MDS, and (3) autoencoders with linear ence of free parameters has advantages as well as dis-
activation functions, because they are similar to PCA. advantages. The main advantage of the presence of free
Furthermore, we do not evaluate Kernel PCA using a parameters is that they provide more flexibility to the
Gaussian kernel in the experiments, because of its re- technique, whereas their main disadvantage is that they
semblance to diffusion maps; instead we use a polyno- need to be tuned to optimize the performance of the di-
mial kernel. mensionality reduction technique.
For properties 3 and 4, Table 1 provides insight into the
computational and memory complexities of the com-
5.2. General properties putationally most expensive algorithmic components of
the techniques. The computational complexity of a di-
In Table 1, the thirteen dimensionality reduction tech- mensionality reduction technique is of importance to its
niques are listed by four general properties: (1) the con- practical applicability. If the memory or computational
vexity of the optimization problem, (2) the main free resources needed are too large, application becomes in-
parameters that have to be optimized, (3) the computa- feasible. The computational complexity of a dimension-

11
ality reduction technique is determined by: (1) proper- memory complexity is O(pn2 ) as well.
ties of the dataset such as the number of datapoints n From the discussion of the four general properties of
and their dimensionality D, and (2) by parameters of the techniques for dimensionality reduction above, we
the techniques, such as the target dimensionality d, the make four observations: (1) some nonlinear techniques
number of nearest neighbors k (for techniques based for dimensionality reduction may suffer from getting
on neighborhood graphs) and the number of iterations stuck in local optima, (2) all nonlinear techniques re-
i (for iterative techniques). In Table 1, p denotes the quire the optimization of one or more free parameters,
ratio of nonzero elements in a sparse matrix to the to- (3) when D < n (which is true in most cases), nonlinear
tal number of elements, m indicates the number of lo- techniques have computational disadvantages compared
cal models in a mixture of factor analyzers, and w is to PCA, and (4) a number of nonlinear techniques suffer
the number of weights in a neural network. Below, we from a memory complexity that is square or cube with
discuss the computational complexity and the memory the number of datapoints n. From these observations, it
complexity of each of the entries in the table. is clear that nonlinear techniques impose considerable
The computationally most demanding part of PCA is demands on computational resources, as compared to
the eigenanalysis of the DD covariance matrix, which the linear technique. Attempts to reduce the computa-
is performed using a power method in O(D3 ). The cor- tional and/or memory complexities of nonlinear tech-
responding memory complexity of PCA is O(D2 ) 9 . niques have been proposed for, e.g., Isomap [31,70],
MDS, Isomap, diffusion maps, and Kernel PCA per- MVU [116,118], and Kernel PCA [108].
form an eigenanalysis of an n n matrix using a power
method in O(n3 ). Because Isomap, diffusion maps, and
Kernel PCA store a full n n kernel matrix, the mem- 5.3. Out-of-sample extension
ory complexity of these techniques is O(n2 ).
In contrast to the spectral techniques discussed above, An important requirement for dimensionality re-
MVU solves a semidefinite program (SDP) with nk duction techniques is the ability to embed new
constraints. Both the computational and the memory high-dimensional datapoints into an existing low-
complexity of solving an SDP are cube in the number dimensional data representation. So-called out-of-
of constraints [19]. Since there are nk constraints, the sample extensions have been developed for a number
computational and memory complexity of the main part of techniques to allow for the embedding of such new
of MVU is O((nk)3 ). Training an autoencoder using datapoints, and can be subdivided into parametric and
RBM training or backpropagation has a computational nonparametric out-of-sample extensions.
complexity of O(inw). The training of autoencoders In a parametric out-of-sample extension, the dimen-
may converge very slowly, especially in cases where sionality reduction technique provides all parameters
the input and target dimensionality are very high (since that are necessary in order to transform new data from
this yields a high number of weights in the network). the high-dimensional to the low-dimensional space. For
The memory complexity of autoencoders is O(w). instance, in linear techniques such as PCA, this trans-
The main computational part of LLC and manifold formation is defined by the linear mapping M that was
charting is the computation of the MoFA or MoPPCA applied to the original data. For Kernel PCA, a similar
model, which has computational complexity O(imd3 ). transformation is available, although this transforma-
The corresponding memory complexity is O(nmd). tion requires some additional kernel function com-
Similar to, e.g., Kernel PCA, local techniques perform putations [97]. For autoencoders, the trained network
an eigenanalysis of an n n matrix. However, for local defines the transformation from the high-dimensional
techniques the n n matrix is sparse. The sparsity of to the low-dimensional data representation.
the matrices is beneficial, because it lowers the compu- For the other nonlinear dimensionality reduction tech-
tational complexity of the eigenanalysis. Eigenanalysis niques a parametric out-of-sample extension is not
of a sparse matrix (using Arnoldi methods [5] or Jacobi- available, and therefore, a nonparametric out-of-sample
Davidson methods [42]) has computational complexity extension is required. Nonparametric out-of-sample
O(pn2 ), where p is the ratio of nonzero elements in extensions perform an estimation of the transforma-
the sparse matrix to the total number of elements. The tion from the high-dimensional to the low-dimensional
space. For instance, a nonparametric out-of-sample
9 In datasets in which n < D, the computational and memory com- extension for Isomap, LLE, and Laplacian Eigenmaps
plexity of PCA can be reduced to O(n3 ) and O(n2 ), respectively has been presented in [15], in which the techniques are
(see Section 3). redefined as kernel methods. Subsequently, the out-of-

12
sample extension is performed using the Nystrom ap-
proximation [86], which approximates the eigenvectors
of a large n n matrix based on the eigendecomposi-
tion of a smaller m m submatrix of the large matrix.
Similar nonparametric out-of-sample extensions for
Isomap are proposed in [27,31]. For MVU, an approxi-
mate out-of-sample extension has been proposed that is (a) True underlying manifold. (b) Reconstructed manifold up
to a nonlinear warping.
based on computing a linear transformation from a set
of landmark points to the complete dataset [116]. An Fig. 3. Two low-dimensional data representations.
alternative out-of-sample extension for MVU finds this
linear transformation by computing the eigenvectors
corresponding to the smallest eigenvalues of the graph mensionality reduction on the high-dimensional repre-
Laplacian (similar to Laplacian Eigenmaps) [118]. A sentation of the data. Subsequently, we assess the qual-
nonparametric out-of-sample extension that can be ity of the resulting low-dimensional data representation
applied to all nonlinear dimensionality reduction tech- by evaluating to what extent the local structure of the
niques is proposed in [73]. The technique finds the data is retained. The evaluation is performed by measur-
nearest neighbor of the new datapoint in the high- ing the generalization errors of k-nearest neighbor clas-
dimensional representation, and computes the linear sifiers that are trained on the low-dimensional data re-
mapping from the nearest neighbor to its corresponding presentation. A similar evaluation scheme is employed
low-dimensional representation. The low-dimensional in [94]. We motivate our experimental setup below.
representation of the new datapoint is found by apply- First, we opt for an evaluation of the local structure of
ing the same linear mapping to this datapoint. the data, because for successful visualization or classi-
From the description above, we may observe that linear fication of data only its local structure needs to be re-
and nonlinear techniques for dimensionality reduction tained. We evaluate how well the local structure of the
are quite similar in that they allow the embedding of data is retained by measuring the generalization error
new datapoints. However, for a number of nonlinear of k-nearest neighbor classifiers trained on the result-
techniques, only nonparametric out-of-sample exten- ing data representations, because the high variance of
sions are available, which leads to estimation errors in this classifier (for small values of k). The high variance
the embedding of new datapoints. of the k-nearest neighbor classifier makes it very well
suitable to assess the quality of the local structure of
6. Experiments the data.
Second, we opt for an evaluation of the quality based
on generalization errors instead of one based on recon-
In this section, a systematic empirical comparison of
struction errors for two main reasons. The first reason
the performance of the linear and nonlinear techniques
is that a high reconstruction error does not necessar-
for dimensionality reduction is performed. We perform
ily imply that the dimensionality reduction technique
the comparison by measuring generalization errors in
performed poorly. For instance, if a dimensionality re-
classification tasks on two types of datasets: (1) artifi-
duction technique recovers the true underlying mani-
cial datasets and (2) natural datasets.
fold in Figure 3(a) up to a nonlinear warping, such as
The setup of our experiments is described in subsec-
in Figure 3(b), this leads to a high reconstruction er-
tion 6.1. In subsection 6.2, the results of our experiments
ror, whereas the local structure of the two manifolds is
on five artificial datasets are presented. Subsection 6.3
nearly identical (as the circles indicate). In other words,
presents the results of the experiments on five natural
reconstruction errors measure the quality of the global
datasets.
structure of the low-dimensional data representation,
and not the quality of the local structure. The second
6.1. Experimental setup reason is that our main aim is to investigate the per-
formance of the techniques on real-world datasets, in
In our experiments on both the artificial and the nat- which the true underlying manifold of the data is usu-
ural datasets, we apply the twelve 10 techniques for di- ally unknown, and as a result, reconstruction errors can-
not be computed.
10 Pleasenote that we do not consider MDS in the experiments, For all dimensionality reduction techniques except for
because of its similarity to PCA (see 5.1). Isomap and MVU, we performed experiments without

13
out-of-sample extension, because our main interest is in 6.1.1. Five artificial datasets
the performance of the dimensionality reduction tech- We performed experiments on five artificial datasets.
niques, and not in the quality of the out-of-sample ex- The datasets were specifically selected to investigate
tension. In the experiments with Isomap and MVU, we how the dimensionality reduction techniques deal with:
employ the respective out-of-sample extensions of these (i) data that lies on a low-dimensional manifold that
techniques (see subsection 5.3) in order to embed da- is isometric to Euclidean space, (ii) data lying on a
tapoints that are not connected to the largest compo- low-dimensional manifold that is not isometric to Eu-
nent of the neighborhood graph which is constructed by clidean space, (iii) data that lies on or near a discontin-
these techniques. The use of the out-of-sample exten- uous manifold, and (iv) data forming a manifold with
sion of these techniques is necessary because the tradi- a high intrinsic dimensionality. The artificial datasets
tional implementations of Isomap and MVU can only on which we performed experiments are: the Swiss roll
embed the points that comprise the largest component dataset (addressing i), the helix dataset (ii), the twin
of the neighborhood graph. peaks dataset (ii), the broken Swiss roll dataset (iii),
The parameter settings employed in our experiments are and the high-dimensional (HD) dataset (iv). Figure 4
listed in Table 2. Most parameters were optimized us- shows plots of the first four artificial datasets. The HD
ing an exhaustive grid search within a reasonable range, dataset consists of points randomly sampled from a 5-
which is shown in Table 2. For two parameters ( in dif- dimensionial non-linear manifold embedded in a 10-
fusion maps and Laplacian Eigenmaps), we employed dimensional space. In order to ensure that the general-
fixed values in order to restrict the computational re- ization errors of the k-nearest neighbor classifiers re-
quirements of our experiments. The value of k in the k- flect the quality of the data representations produced by
nearest neighbor classifiers was set to 1. We determined the dimensionality reduction techniques, we assigned all
the target dimensionality in the experiments by means datapoints to one of two classes according to a checker-
of the maximum likelihood intrinsic dimensionality es- board pattern on the manifold. All artificial datasets con-
timator [72]. Note that for Hessian LLE and LTSA, the sist of 5,000 samples. We opted for a fixed number of
dimensionality of the actual low-dimensional data re- datapoints in each dataset, because in real-world appli-
presentation cannot be higher than the number of nearest cations, obtaining more training data is usually expen-
neighbors that was used to construct the neighborhood sive.
graph. The results of the experiments were obtained us-
ing leave-one-out validation.

Technique Parameter settings


PCA None
Isomap 5 k 15 (a) Swiss roll dataset. (b) Helix dataset.
MVU 5 k 15
Diffusion maps 10 t 100 = 1
Kernel PCA = (XX T + 1)5
Autoencoders Three hidden layers
LLE 5 k 15
Laplacian Eigenmaps 5 k 15 =1 (c) Twinpeaks dataset. (d) Broken Swiss roll dataset.

Hessian LLE 5 k 15
Fig. 4. Four of the artificial datasets.
LTSA 5 k 15
LLC 5 k 15 5 m 25
Manifold charting 5 m 25 6.1.2. Five natural datasets
Table 2 For our experiments on natural datasets, we selected
Parameter settings for the experiments. five datasets that represent tasks from a variety of do-
mains: (1) the MNIST dataset, (2) the COIL20 dataset,
(3) the NiSIS dataset, (4) the ORL dataset, and (5) the

14
HIVA dataset. The MNIST dataset is a dataset of 60,000 all neighborhood graph-based techniques outperform
handwritten digits. For computational reasons, we ran- techniques that do not employ neighborhood graphs
domly selected 10,000 digits for our experiments. The (including PCA).
images in the MNIST dataset have 28 28 pixels, and Third, the results on the broken Swiss roll dataset indi-
can thus be considered as points in a 784-dimensional cate that most nonlinear techniques for dimensionality
space. The COIL20 dataset contains images of 20 dif- reduction cannot deal with discontinuous (i.e., non-
ferent objects, depicted from 72 viewpoints, leading to smooth) manifolds. On the broken Swiss roll dataset,
a total of 1,440 images. The size of the images is 3232 Hessian LLE is the only technique that does not suffer
pixels, yielding a 1,024-dimensional space. The NiSIS severely from the presence of a discontinuity in the
dataset is a publicly available dataset for pedestrian manifold (when compared to the performance of the
detection, which consists of 3,675 grayscale images of techniques on the original Swiss roll dataset).
size 36 18 pixels (leading to a space of dimensional- Fourth, from the results on the HD dataset, we observe
ity 648). The ORL dataset is a face recognition dataset that most nonlinear techniques have major problems
that contains 400 grayscale images of 112 92 pixels when faced with a dataset with a high intrinsic dimen-
that depict 40 faces under various conditions (i.e., the sionality. In particular, local dimensionality reduction
dataset contains 10 images per face). The HIVA dataset techniques perform disappointing on a dataset with a
is a drug discovery dataset with two classes. It consists high intrinsic dimensionality. On the HD dataset, PCA
of 3,845 datapoints with dimensionality 1,617. is only outperformed by Isomap and an autoencoder,
which is the best performing technique on this dataset.
Fifth, we observe that Hessian LLE fails to find a so-
lution on the helix dataset. The failure is the result
6.2. Experiments on artificial datasets of the inability of the eigensolver to solve the eigen-
problem up to sufficient precision. Both Arnoldi [5]
In Table 3, we present the generalization errors and Jacobi-Davidson eigendecomposition methods [42]
of 1-nearest neighbor classifiers trained on the low- suffer from this problem, that is caused by the nature
dimensional data representations obtained from the of the eigenproblem that needs to be solved.
dimensionality reduction techniques. The table shows Taken together, the results show that although lo-
generalization errors measured using leave-one-out val- cal techniques for dimensionality reduction perform
idation on the five artificial datasets. In the table, the strongly on a simple dataset such as the Swiss roll
left column indicates the name of the dataset and the dataset, this strong performance does not generalize
target dimensionality to which we attempted to trans- very well to more complex datasets (e.g., datasets
form the high-dimensional data (see subsection 6.1). with non-smooth manifolds, manifolds that are non-
The best performing technique for a dataset is shown isometric to the Euclidean space, or manifolds with a
in boldface. From the results in Table 3, we make five high intrinsic dimensionality).
observations.
First, the results reveal that the nonlinear techniques
employing neighborhood graphs (viz. Isomap, MVU, 6.3. Experiments on natural datasets
LLE, Laplacian Eigenmaps, Hessian LLE, LTSA, and
LLC) outperform the other techniques on standard Table 4 presents the generalization errors of 1-
manifold learning problems, such as the Swiss roll nearest neighbor classifiers that were trained on the
dataset. Techniques that do not employ neighborhood low-dimensional data representations obtained from the
graphs (viz. PCA, diffusion maps, Kernel PCA, autoen- dimensionality reduction techniques. From the results
coders, and manifold charting) perform poorly on these in Table 4, we make three observations.
datasets. The performances of LLC and manifold chart- First, we observe that the performance of nonlinear
ing on the Swiss roll dataset are comparable to those techniques for dimensionality reduction on the natural
of techniques that do not employ neighborhood graphs. datasets is rather disappointing compared to the per-
Second, from the results with the helix and twin peaks formance of these techniques on, e.g., the Swiss roll
datasets, we observe that three local nonlinear dimen- dataset. In particular, PCA outperforms all nonlinear
sionality reduction techniques perform less well on techniques on three of the five datasets. Especially lo-
manifolds that are not isometric to Euclidean space. cal nonlinear techniques for dimensionality reduction
The performance of Isomap, MVU, and LTSA on these perform disappointingly. However, Kernel PCA and
datasets is still very strong. In addition, we observe that autoencoders perform strongly on almost all datasets.

15
Dataset (d) None PCA Isomap MVU KPCA DM Autoenc. LLE LEM HLLE LTSA LLC MC
Swiss roll (2D) 3.68% 30.56% 3.28% 5.12% 29.30% 28.06% 30.58% 7.44% 10.16% 3.10% 3.06% 27.74% 42.74%
Helix (1D) 1.24% 38.56% 1.22% 3.76% 44.54% 36.18% 32.50% 20.38% 10.34% failed 1.68% 26.68% 28.16%
Twinpeaks (2D) 0.40% 0.18% 0.30% 0.58% 0.08% 0.06% 0.12% 0.54% 0.52% 0.10% 0.36% 12.96% 0.06%
Broken Swiss (2D) 2.14% 27.62% 14.24% 36.28% 27.06% 23.92% 26.32% 37.06% 26.08% 4.78% 16.30% 26.96% 23.92%
HD (5D) 24.19% 22.14% 20.45% 23.62% 29.25% 34.75% 16.29% 35.81% 41.70% 47.97% 40.22% 38.69% 31.46%
Table 3
Generalization errors of 1-NN classifiers trained on artificial datasets.

Dataset (d) None PCA Isomap MVU KPCA DM Autoenc. LLE LEM HLLE LTSA LLC MC
MNIST (20D) 5.11% 5.06% 28.54% 18.35% 65.48% 59.79% 14.10% 19.21% 19.45% 89.55% 32.52% 36.29% 38.22%
COIL20 (5D) 0.14% 3.82% 14.86% 21.88% 7.78% 4.51% 1.39% 9.86% 14.79% 43.40% 12.36% 6.74% 18.61%
ORL (8D) 2.50% 4.75% 44.20% 39.50% 5.50% 49.00% 69.00% 9.00% 12.50% 56.00% 12.75% 50.00% 62.25%
NiSIS (15D) 8.24% 8.73% 20.57% 19.40% 11.70% 22.94% 9.82% 28.71% 43.08% 45.00% failed 26.86% 20.41%
HIVA (15D) 4.63% 5.05% 4.97% 4.89% 5.07% 3.51% 4.84% 5.23% 5.23% failed 6.09% 3.43% 5.20%
Table 4
Generalization errors of 1-NN classifiers trained on natural datasets.

Second, the results in Table 4 reveal that two tech- local techniques for dimensionality reduction. In sub-
niques (viz., Hessian LLE and LTSA) fail on one section 7.2, weaknesses of global techniques and tech-
dataset. The failures of Hessian LLE and LTSA are the niques that globally align a collection of linear mod-
result of the inability of the eigensolver to identify the els are discussed. Subsection 7.3 summarizes the main
smallest eigenvalues (due to numerical problems). Both weaknesses of current nonlinear techniques for dimen-
Arnoldi [5] and Jacobi-Davidson eigendecomposition sionality reduction and present recommendations for the
methods [42] suffer from this limitation. development of future dimensionality reduction tech-
Third, the results show that on some natural datasets, niques.
the classification performance of our classifiers was
not improved by performing dimensionality reduction. 7.1. Local techniques
Most likely, this is due to errors in the intrinsic dimen-
sionality estimator we employed. As a result, the target The results of our experiments show that the perfor-
dimensionalities may not be optimal (in the sense that mance of popular techniques based on neighborhood
they minimize the generalization error of the trained graphs is rather disappointing on many datasets. Most
classifier). However, since we aim to compare the per- likely, the poor performance of these techniques is due
formance of dimensionality reduction techniques, and to one or more of the following five weaknesses.
not to minimize generalization errors on classification First, local dimensionality reduction techniques suf-
problems, this observation is of no relevance. fer from the curse of dimensionality of the embed-
ded manifold (i.e., the intrinsic dimension of the
7. Discussion data) [13,14,116], because the number of datapoints that
is required to characterize a manifold properly grows
In the previous sections, we presented a compara- exponentially with the intrinsic dimensionality of the
tive study of techniques for dimensionality reduction. manifold. The susceptibility to the curse of dimension-
We observed that most nonlinear techniques do not out- ality is a fundamental weakness of all local learners,
perform PCA on natural datasets, despite their abil- and therefore, it also applies to learning techniques
ity to learn the structure of complex nonlinear mani- that employ Gaussian kernels (such as diffusion maps,
folds. This section discusses the main weaknesses of Support Vector Machines, and Gaussian processes).
current nonlinear techniques for dimensionality reduc- For artificial datasets with low intrinsic dimensionality
tion that explain the results of our experiments. In ad- such as the Swiss roll dataset, this weakness does not
dition, the section presents ideas on how to overcome apply. However, in most real-world tasks, the intrinsic
these weaknesses. The discussion is subdivided into dimensionality of the data is much higher. For instance,
three parts. Subsection 7.1 discusses five weaknesses of the face space is estimated to consist of at least 100

16
dimensions [77]. As a result, the performance of local datasets. It may be a problem, because for instance, a
techniques is poor on many real-world datasets, which dataset of objects depicted under various orientations
explains the results of our experiments with the HD gives rise to a manifold that is closed (similar to the
dataset and the natural datasets. helix dataset).
Second, the inferior performance of local nonlinear Fifth, local techniques for dimensionality reduction
techniques for dimensionality reduction arises from suffer from folding [21]. Folding is caused by a value
the eigenproblems that the techniques attempt to solve. of k that is too high with respect to the sampling den-
Typically, the smallest eigenvalues in these problems sity of (parts of) the manifold. Folding causes the local
are very small (around 107 or smaller), whereas the linearity assumption to be violated, leading to radial
largest eigenvalues are fairly big (around 102 or larger). or other distortions. In real-world datasets, folding is
Eigenproblems with these properties are extremely likely to occur because the data density is not necessar-
hard to solve, even for state-of-the-art eigensolvers. ily equal over the manifold (i.e., because the prior is not
The eigensolver may not be able to identify the small- uniform over the manifold). An approach that might
est eigenvalues of the eigenproblem, and as a result, overcome this weakness for datasets with small intrin-
the dimensionality reduction technique might produce sic dimensionality is adaptive neighborhood selection.
suboptimal solutions. The good results of Isomap (that Techniques for adaptive neighborhood selection are
searches for the largest eigenvalues) compared to local presented in, e.g., [76,93,115].
techniques (that search for the smallest eigenvalues) In addition to these five weaknesses, Hessian LLE and
may be explained by the difficulty of solving eigen- LTSA cannot transform data to a dimensionality higher
problems. than the number of nearest neighbors in the neigh-
Third, local properties of a manifold do not necessarily borhood graph, which might lead to difficulties with
follow the global structure of the manifold (as noted in, datasets with a high intrinsic dimensionality.
e.g., [20,91]) in the presence of noise around the mani-
fold. In other words, local methods suffer from overfit-
ting on the manifold. Furthermore, local techniques for
dimensionality reduction are sensitive to the presence 7.2. Global techniques
of outliers in the data [25]. In local techniques for di-
mensionality reduction, outliers are connected to their Our discussion on the results of global techniques
k nearest neighbors, even when they are very distant. for dimensionality reduction is subdivided into three
As a consequence, outliers degrade the performance of parts. First, we discuss the results of the neighbor-
local techniques for dimensionality reduction. A possi- hood graph-based techniques Isomap and MVU. Sec-
ble approach to resolve this problem is the usage of an ond, we discuss weaknesses explaining the results of
-neighborhood. In an -neighborhood, datapoints are the two kernel-based techniques, Kernel PCA and dif-
connected to all datapoints that lie within a sphere with fusion maps. Third, we discuss the results of the three
radius . A second approach to overcome the problem techniques that optimize nonconvex objective functions,
of outliers is preprocessing the data by removing out- viz., autoencoders, LLC, and manifold charting.
liers [82,123]. For the first part, we remark that global techniques
Fourth, the local linearity assumption of local tech- for dimensionality reduction that employ neighborhood
niques for dimensionality reduction implies that the graphs, such as Isomap and MVU, are subject to many
techniques assume that the manifold contains no dis- of the weaknesses that are mentioned in subsection 7.1,
continuities (i.e., the manifold is non-smooth). The re- because the construction of the neighborhood graph
sults of our experiments with the broken Swiss dataset is susceptible to (1) the curse of dimensionality, (2)
illustrate the incapability of current dimensionality re- overfitting, and the (3) presence of outliers. The re-
duction techniques to model non-smooth manifolds. sults of our experiments on natural datasets show that
In real-world datasets, the underlying manifold is not global techniques for dimensionality reduction based on
likely to be smooth. For instance, objects in real-world neighborhood graphs are often outperformed by PCA.
images undergo translations, which gives rise to dis- However, the results also reveal that Isomap and MVU
continuities in the underlying manifold representation. outperform local techniques such as LLE on natural
In addition, most local nonlinear techniques cannot datasets. The relatively good results of Isomap may
deal with manifolds that are not isometric to the Eu- be explained from the presence of a large number of
clidean space. This is most significantly revealed by the short-circuits in the neighborhood graph, as a result
results of our experiments with the helix and twinpeaks of which the neighborhood graph shows large resem-

17
blance to a random network. In such a network, the use of the MoTS model is that a t-distribution is less
distances through the graph are strongly correlated to sensitive to outliers than a Gaussian (which tends to
the (squared) Euclidean distances 11 , and Isomap thus overestimate variances). For autoencoders, the presence
reduces to MDS and PCA. The relatively good results of local optima in the objective function has largely
of MVU may be explained by the maximization of the been overcome by the pretraining of the network us-
variance in the low-dimensional data representation that ing RBMs. A limitation of autoencoders is that they are
MVU performs. This maximization causes originally only applicable on datasets of reasonable dimensional-
distant points to be far apart in the low-dimensional data ity. If the dimensionality of the dataset is very high, the
representation as well, even when the constraints on the number of weights in the network is too large to find
maximization do not follow the local structure of the an appropriate setting of the network. Hence, the infe-
manifold. rior performance of autoencoders on the ORL dataset
For the second part, we remark that kernel-based tech- is the result of the 10,304 dimensions in this dataset.
niques for dimensionality reduction (i.e., Kernel PCA This limitation of autoencoders may be addressed by
and diffusion maps) do not suffer from the weaknesses preprocessing the data using PCA.
of neighborhood graph-based techniques. However, the
performance of Kernel PCA and diffusion maps on the
Swiss roll dataset indicates that (similar to PCA) these 7.3. Main weaknesses
techniques are incapable of modeling certain complex
manifolds. The main reason for this incapability is that Taken together, the results of our experiments indi-
kernel-based methods require the selection of a proper cate that nonlinear techniques for dimensionality re-
kernel function. In general, model selection in kernel duction do not yet clearly outperform traditional PCA.
methods is performed using some form of hold-out test- This result agrees with the results of studies reported
ing [45], leading to high computational costs. Alterna- in the literature. On selected datasets, nonlinear tech-
tive approaches to model selection for kernel methods niques for dimensionality reduction outperform linear
are based on, e.g., maximizing the between-class mar- techniques [81,107], but nonlinear techniques perform
gins or the data variance using semidefinite program- poorly on various other natural datasets [47,58,59,74].
ming [46,69]. Despite these alternative approaches, the In particular, our results establish two main weaknesses
construction of a proper kernel remains an important of the popular local techniques for dimensionality re-
obstacle for the successful application of Kernel PCA. duction: (1) the susceptibility to the curse of dimen-
In addition, depending on the selection of the kernel, sionality and (2) the problems in finding the smallest
kernel-based techniques for dimensionality reduction eigenvalues in an eigenproblem.
may suffer from similar weaknesses as local techniques From the first weakness, we may infer that a require-
(e.g., when a Gaussian kernel with a small value of ment for future techniques for dimensionality reduction
is employed). The poor performance of diffusion maps is that they do not rely on local properties of the data.
in our experiments supports this claim. It has been suggested that the susceptibility to the curse
For the third part, we remark that techniques that op- of dimensionality may be addressed using techniques in
timize nonconvex objective functions, such as autoen- which the global structure of the data manifold is repre-
coders, LLC, and manifold charting, suffer from the sented in a number of linear models [14]. However, the
presence of local optima in the objective functions. For poor results of LLC and manifold charting in our exper-
instance, the EM algorithm that is employed in LLC and iments do not support this suggestion. The main added
manifold charting is likely to get stuck in a local max- value of local techniques for dimensionality reduction
imum of the log-likelihood function. In addition, these techniques is that they can be applied in domains where
techniques are hampered by the presence of outliers in information on the global structure of the data is not
the data. In techniques that perform global alignment available, such as in sensor localization problems [118].
of linear models (such as LLC), the sensitivity to the The second weakness leads to the suggestion that in
presence of outliers may be addressed by replacing the dimensionality reduction, it is not so important which
mixture of factor analyzers by a mixture of t-distributed property of the data is retained in the low-dimensional
subspaces (MoTS) model [30]. The intuition behind the data representation, but it is more important how well
it is retained. Therefore, we suggest that the focus of
11 Inthe Erdos-Renyi random network [38] on a Euclidean space, the the research community should shift towards the de-
shortest path between two points through the network is correlated velopment of techniques that have objective functions
with the Euclidean distance between the two points [85]. that can be optimized well in practice. The strong per-

18
formance of autoencoders indicates that these objective References
functions are not necessarily convex.
[1] M. Abdella and T. Marwala. The use of genetic algorithms
and neural networks to approximate missing data in database.
8. Conclusions In Proceedings of the IEEE International Conference on
Computational Cybernetics, pages 207212, 2005.
The paper presents a review and comparative study [2] G. Afken. Gram-Schmidt Orthogonalization. Academic Press,
Orlando, FL, USA, 1985.
of techniques for dimensionality reduction. From the [3] D.K. Agrafiotis. Stochastic proximity embedding. Journal of
results obtained, we may conclude that nonlinear tech- Computational Chemistry, 24(10):12151221, 2003.
niques for dimensionality reduction are, despite their [4] W.N. Anderson and T.D. Morley. Eigenvalues of the Laplacian
large variance, not yet capable of outperforming tradi- of a graph. Linear and Multilinear Algebra, 18:141145, 1985.
[5] W.E. Arnoldi. The principle of minimized iteration in the
tional PCA. In the future, we foresee the development
solution of the matrix eigenvalue problem. Quarterly of
of new nonlinear techniques for dimensionality reduc- Applied Mathematics, 9:1725, 1951.
tion that do not rely on local properties of the data man- [6] C. Baker. The numerical treatment of integral equations.
ifold. In addition, we foresee a shift in focus towards Clarendon Press, 1977.
the development of nonlocal techniques for dimension- [7] M. Balasubramanian and E.L. Schwartz. The Isomap
algorithm and topological stability. Science, 295(5552):7,
ality reduction with objective functions that can be op-
2002.
timized well, such as (Kernel) PCA and autoencoders. [8] G. Baudat and F. Anouar. Generalized discriminant analysis
using a kernel approach. Neural Computation, 12(10):2385
2404, 2000.
Acknowledgements [9] M. Belkin and P. Niyogi. Laplacian Eigenmaps and spectral
techniques for embedding and clustering. In Advances in
The work is supported by NWO/CATCH, project Neural Information Processing Systems, volume 14, pages
585591, Cambridge, MA, USA, 2002. The MIT Press.
RICH (grant 640.002.401). We thank the Dutch State
[10] M. Belkin and P. Niyogi. Semi-supervised learning on
Service for Archaeology (RACM) for their cooperation. Riemannian manifolds. Machine Learning, 56(13):209239,
2004.
[11] A.J. Bell and T.J. Sejnowski. An information maximization
Related techniques approach to blind separation and blind deconvolution. Neural
Computation, 7(6):11291159, 1995.
The comparative review presented in this paper ad- [12] Y. Bengio, O. Delalleau, N. Le Roux, J.-F. Paiement,
P. Vincent, and M. Ouimet. Learning eigenfunctions links
dresses all main techniques for (nonlinear) dimension- spectral embedding and Kernel PCA. Neural Computation,
ality reduction. However, it is not exhaustive. 16(10):21972219, 2004.
The comparative review does not include self- [13] Y. Bengio and Y. LeCun. Scaling learning algorithms towards
organizing maps [64] and their probabilistic extension AI. In L. Bottou, O. Chapelle, D. DeCoste, and J. Weston,
GTM [17], because we consider these techniques to editors, Large-Scale Kernel Machines, pages 321360. MIT
Press, 2007.
be clustering techniques. Techniques for Independent [14] Y. Bengio and M. Monperrus. Non-local manifold tangent
Component Analysis [11] are not included in our review, learning. In Advances in Neural Information Processing
because they were mainly designed for blind-source Systems, volume 17, pages 129136, Cambridge, MA, USA,
separation. Linear Discriminant Analysis [40] and Gen- 2004. The MIT Press.
eralized Discriminant Analysis [8] are not included in [15] Y. Bengio, J.-F. Paiement, P. Vincent, O. Delalleau, N. Le
Roux, and M. Ouimet. Out-of-sample extensions for LLE,
the review, because of their supervised nature. Further- Isomap, MDS, eigenmaps, and spectral clustering. In Advances
more, our comparative review does not cover a number in Neural Information Processing Systems, volume 16,
of techniques that are variants or extensions of the Cambridge, MA, USA, 2004. The MIT Press.
thirteen reviewed dimensionality reduction techniques. [16] B.L. Betechuoh, T. Marwala, and T. Tettey. Autoencoder
These variants include factor analysis [101], princi- networks for HIV classification. Current Science,
91(11):14671473, 2006.
pal curves [25], kernel maps [102], conformal eigen- [17] C. Bishop, M. Svensen, and C. Williams. GTM: The generative
maps [98], Geodesic Nullspace Analysis [21], variants topographic mapping. Neural Computation, 10(1):215234,
of multidimensional scaling [3,32,39,53,79], techniques 1998.
that (similarly to LLC and manifold charting) globally [18] C.M. Bishop. Pattern recognition and machine learning.
align a mixture of linear models [91,94,114], and linear Springer, 2006.
[19] B. Borchers and J.G. Young. Implementation of a primaldual
variants of LLE [49,65], Laplacian Eigenmaps [50], method for SDP on a shared memory parallel architecture.
and LTSA [122]. Computational Optimization and Applications, 37(3):355369,
2007.

19
[20] M. Brand. Charting a manifold. In Advances in Neural [39] C. Faloutsos and K.-I. Lin. FastMap: A fast algorithm
Information Processing Systems, volume 15, pages 985992, for indexing, data-mining and visualization of traditional
Cambridge, MA, USA, 2002. The MIT Press. and multimedia datasets. In Proceedings of the 1995 ACM
[21] M. Brand. From subspaces to submanifolds. In Proceedings SIGMOD International Conference on Management of Data,
of the 15th British Machine Vision Conference, London, UK, pages 163174, New York, NY, USA, 1995. ACM Press.
2004. British Machine Vision Association. [40] R.A. Fisher. The use of multiple measurements in taxonomic
[22] A. Brun, H.-J. Park, H. Knutsson, and C.-F. Westin. Coloring problems. Annals of Eugenics, 7:179188, 1936.
of DT-MRI fiber traces using Laplacian Eigenmaps. In [41] R.W. Floyd. Algorithm 97: Shortest path. Communications
Proceedings of the Eurocast 2003, Neuro Image Workshop, of the ACM, 5(6):345, 1962.
2003. [42] D.R. Fokkema, G.L.G. Sleijpen, and H.A. van der Vorst.
[23] C.J.C. Burges. Data Mining and Knowledge Discovery JacobiDavidson style QR and QZ algorithms for the reduction
Handbook: A Complete Guide for Practitioners and of matrix pencils. SIAM Journal on Scientific Computing,
Researchers, chapter Geometric Methods for Feature Selection 20(1):94125, 1999.
and Dimensional Reduction: A Guided Tour. Kluwer
[43] K. Fukunaga. Introduction to Statistical Pattern Recognition.
Academic Publishers, 2005.
Academic Press Professional, Inc., San Diego, CA, USA,
[24] H. Chang, D.-Y. Yeung, and Y. Xiong. Super-resolution
1990.
through neighbor embedding. In IEEE Computer Society
[44] Z. Ghahramani and G.E. Hinton. The EM algorithm for
Conference on Computer Vision and Pattern Recognition,
mixtures of factor analyzers. Technical Report CRG-TR-96-
volume 1, pages 275282, 2004.
1, Department of Computer Science, University of Toronto,
[25] K.-Y. Chang and J. Ghosh. Principal curves for nonlinear
1996.
feature extraction and classification. In Applications of
Artificial Neural Networks in Image Processing III, pages [45] G. Golub, M. Heath, and G. Wahba. Generalized cross-
120129, Bellingham, WA, USA, 1998. SPIE. validation as a method for choosing a good ridge parameter.
[26] C. Chatfield and A.J. Collins. Introduction to Multivariate Technometrics, 21:215224, 1979.
Analysis. Chapman and Hall, 1980. [46] T. Graepel. Kernel matrix completion by semidefinite
[27] H. Choi and S. Choi. Robust kernel Isomap. Pattern programming. In Proceedings of the International Conference
Recognition, 40(3):853862, 2007. on Artificial Neural Networks, pages 694699, Berlin,
[28] J.A. Costa and A.O. Hero. Classification constrained Germany, 2002. Springer-Verlag.
dimensionality reduction. In Proceedings of the IEEE [47] A.B.A. Graf and F.A. Wichmann. Gender classification of
International Conference on Acoustics, Speech, and Signal human faces. In Biologically Motivated Computer Vision 2002,
Processing, volume 5, pages 10771080, 2005. LNCS 2525, pages 491501, 2002.
[29] T. Cox and M. Cox. Multidimensional scaling. Chapman & [48] J. Ham, D. Lee, S. Mika, and B. Scholkopf. A kernel view of
Hall, London, UK, 1994. the dimensionality reduction of manifolds. Technical Report
[30] D. de Ridder and V. Franc. Robust subspace mixture models TR-110, Max Planck Institute for Biological Cybernetics,
using t-distributions. In Proceedings of the British Machine Germany, 2003.
Vision Conference 2003, pages 319328, 2003. [49] X. He, D. Cai, S. Yan, and H.-J. Zhang. Neighborhood
[31] V. de Silva and J.B. Tenenbaum. Global versus local methods preserving embedding. In Proceedings of the 10th IEEE
in nonlinear dimensionality reduction. In Advances in Neural International Conference on Computer Vision, pages 1208
Information Processing Systems, volume 15, pages 721728, 1213, 2005.
Cambridge, MA, USA, 2003. The MIT Press. [50] X. He and P. Niyogi. Locality preserving projections.
[32] P. Demartines and J. Herault. Curvilinear component analysis: In Advances in Neural Information Processing Systems,
A self-organizing neural network for nonlinear mapping of volume 16, page 37, Cambridge, MA, USA, 2004. The MIT
data sets. IEEE Transactions on Neural Networks, 8(1):148 Press.
154, 1997. [51] X. He, S. Yan, Y. Hu, P. Niyogi, and H.-J. Zhang. Face
[33] D. DeMers and G. Cottrell. Non-linear dimensionality
recognition using Laplacianfaces. IEEE Transactions on
reduction. In Advances in Neural Information Processing
Pattern Analysis and Machine Intelligence, 27(3):328340,
Systems, volume 5, pages 580587, San Mateo, CA, USA,
2005.
1993. Morgan Kaufmann.
[52] G.E. Hinton. Training products of experts by minimizing
[34] A. Dempster, N. Laird, and D. Rubin. Maximum likelihood
contrastive divergence. Neural Computation, 14(8):1771
from incomplete data via the EM algorithm. Journal of the
1800, 2002.
Royal Statistical Society, Series B, 39(1):138, 1977.
[35] E.W. Dijkstra. A note on two problems in connexion with [53] G.E. Hinton and S.T. Roweis. Stochastic Neighbor
graphs. Numerische Mathematik, 1:269271, 1959. Embedding. In Advances in Neural Information Processing
[36] D.L. Donoho and C. Grimes. Hessian eigenmaps: New Systems, volume 15, pages 833840, Cambridge, MA, USA,
locally linear embedding techniques for high-dimensional 2002. The MIT Press.
data. Proceedings of the National Academy of Sciences, [54] G.E. Hinton and R.R. Salakhutdinov. Reducing the
102(21):74267431, 2005. dimensionality of data with neural networks. Science,
[37] R. Duraiswami and V.C. Raykar. The manifolds of spatial 313(5786):504507, 2006.
hearing. In Proceedings of International Conference on [55] H. Hoffmann. Kernel PCA for novelty detection. Pattern
Acoustics, Speech and Signal Processing, volume 3, pages Recognition, 40(3):863874, 2007.
285288, 2005. [56] H. Hotelling. Analysis of a complex of statistical variables into
[38] P. Erdos and A. Renyi. On random graphs. Publicationes principal components. Journal of Educational Psychology,
Mathematicae, 6:290297, 1959. 24:417441, 1933.

20
[57] R. Huber, H. Ramoser, K. Mayer, H. Penz, and M. Rubik. [74] I.S. Lim, P.H. Ciechomski, S. Sarni, and D. Thalmann.
Classification of coins using an eigenspace approach. Pattern Planar arrangement of high-dimensional biomedical data sets
Recognition Letters, 26(1):6175, 2005. by Isomap coordinates. In Proceedings of the 16th IEEE
[58] N.P. Hughes and L. Tarassenko. Novel signal shape descriptors Symposium on Computer-Based Medical Systems, pages 50
through wavelet transforms and dimensionality reduction. In 55, 2003.
Wavelet Applications in Signal and Image Processing X, pages [75] A. Lima, H. Zen, Y. Nankaku, C. Miyajima, K. Tokuda, and
763773, 2003. T. Kitamura. On the use of Kernel PCA for feature extraction
[59] O.C. Jenkins and M.J. Mataric. Deriving action and in speech recognition. IEICE Transactions on Information
behavior primitives from human motion data. In International Systems, E87-D(12):28022811, 2004.
Conference on Intelligent Robots and Systems, volume 3, pages [76] N. Mekuz and J.K. Tsotsos. Parameterless Isomap with
25512556, 2002. adaptive neighborhood selection. In Proceedings of the 28th
[60] L.O. Jimenez and D.A. Landgrebe. Supervised classification DAGM Symposium, pages 364373, Berlin, Germany, 2006.
in high-dimensional space: geometrical, statistical, and Springer.
asymptotical properties of multivariate data. IEEE [77] M. Meytlis and L. Sirovich. On the dimensionality of face
Transactions on Systems, Man and Cybernetics, 28(1):3954, space. IEEE Transactions of Pattern Analysis and Machine
1997. Intelligence, 29(7):12621267, 2007.
[61] N. Kambhatla and T.K. Leen. Dimension reduction by [78] B. Nadler, S. Lafon, R.R. Coifman, and I.G. Kevrekidis.
local principal component analysis. Neural Computation, Diffusion maps, spectral clustering and the reaction
9(7):14931516, 1997. coordinates of dynamical systems. Applied and Computational
[62] R. Kharal. Semidefinite embedding for the dimensionality Harmonic Analysis: Special Issue on Diffusion Maps and
reduction of DNA microarray data. Masters thesis, University Wavelets, 21:113127, 2006.
of Waterloo, 2006. [79] K. Nam, H. Je, and S. Choi. Fast Stochastic Neighbor
[63] K.I. Kim, K. Jung, and H.J. Kim. Face recognition using Embedding: A trust-region algorithm. In Proceedings of the
kernel principal component analysis. IEEE Signal Processing IEEE International Joint Conference on Neural Networks
Letters, 9(2):4042, 2002. 2004, volume 1, pages 123128, Budapest, Hungary, 2004.
[64] T. Kohonen. Self-organization and associative memory. [80] A. Ng, M. Jordan, and Y. Weiss. On spectral clustering:
Springer-Verlag, Berlin, Germany, 1988. Analysis and an algorithm. In Advances in Neural Information
Processing Systems, volume 14, pages 849856, Cambridge,
[65] E. Kokiopoulou and Y. Saad. Orthogonal Neighborhood
Preserving Projections: A projection-based dimensionality MA, USA, 2001. The MIT Press.
[81] M. Niskanen and O. Silven. Comparison of dimensionality
reduction technique. IEEE Transactions on Pattern Analysis
reduction methods for wood surface inspection. In Proceedings
and Machine Intelligence, 29(12):21342156, 2007.
of the 6th International Conference on Quality Control by
[66] J.B. Kruskal. Multidimensional scaling by optimizing
Artificial Vision, pages 178188, Gatlinburg, TN, USA, 2003.
goodness of fit to a nonmetric hypothesis. Psychometrika,
International Society for Optical Engineering.
29:127, 1964.
[82] J.-H. Park, Z. Zhang, H. Zha, and R. Kasturi. Local smoothing
[67] S.Y. Kung, K.I. Diamantaras, and J.S. Taur. Adaptive
for manifold learning. In IEEE Computer Society Conference
Principal component EXtraction (APEX) and applications.
on Computer Vision and Pattern Recognition, volume 2, pages
IEEE Transactions on Signal Processing, 42(5):12021217,
452459, 2004.
1994.
[83] M. Partridge and R. Calvo. Fast dimensionality reduction and
[68] S. Lafon and A.B. Lee. Diffusion maps and coarse-graining:
Simple PCA. Intelligent Data Analysis, 2(3):292298, 1997.
A unified framework for dimensionality reduction, graph [84] N. Patwari and A.O. Hero. Manifold learning algorithms for
partitioning, and data set parameterization. IEEE Transactions localization in wireless sensor networks. In Proceedings of
on Pattern Analysis and Machine Intelligence, 28(9):1393 the IEEE International Conference on Acoustics, Speech, and
1403, 2006. Signal Processing, volume 3, pages 857860, 2004.
[69] G.R.G. Lanckriet, N. Cristianini, P. Bartlett, L.E. Ghaoui, and [85] M. Penrose. Random Geometric Graphs. Oxford University
M.I. Jordan. Learning the kernel matrix with semidefinite Press, 2003.
programming. Journal of Machine Learning Research, 5:27 [86] J.C. Platt. FastMap, MetricMap, and Landmark MDS are all
72, 2004. Nystrom algorithms. In Proceedings of the 10th International
[70] M.H. Law and A.K. Jain. Incremental nonlinear Workshop on Artificial Intelligence and Statistics, pages 261
dimensionality reduction by manifold learning. IEEE 268, 2005.
Transactions of Pattern Analysis and Machine Intelligence, [87] A.M. Posadas, F. Vidal, F. de Miguel, G. Alguacil, J. Pena,
28(3):377391, 2006. J.M. Ibanez, and J. Morales. Spatial-temporal analysis of a
[71] J.A. Lee and M. Verleysen. Nonlinear dimensionality reduction seismic series using the principal components method. Journal
of data manifolds with essential loops. Neurocomputing, of Geophysical Research, 98(B2):19231932, 1993.
67:2953, 2005. [88] N.M. Rajpoot, M. Arif, and A.H. Bhalerao. Unsupervised
[72] E. Levina and P.J. Bickel. Maximum likelihood estimation learning of shape manifolds. In Proceedings of the British
of intrinsic dimension. In Advances in Neural Information Machine Vision Conference, 2007.
Processing Systems, volume 17, Cambridge, MA, USA, 2004. [89] B. Raytchev, I. Yoda, and K. Sakaue. Head pose estimation
The MIT Press. by nonlinear manifold learning. In Proceedings of the 17th
[73] H. Li, L. Teng, W. Chen, and I.-F. Shen. Supervised learning ICPR, pages 462466, 2004.
on local tangent space. In Lecture Notes on Computer Science, [90] S.T. Roweis. EM algorithms for PCA and SPCA. In Advances
volume 3496, pages 546551, Berlin, Germany, 2005. Springer in Neural Information Processing Systems, volume 10, pages
Verlag. 626632, 1997.

21
[91] S.T. Roweis, L. Saul, and G. Hinton. Global coordination [109] M.E. Tipping and C.M. Bishop. Mixtures of probabilistic
of local linear models. In Advances in Neural Information principal component analysers. Neural Computation,
Processing Systems, volume 14, pages 889896, Cambridge, 11(2):443482, 1999.
MA, USA, 2001. The MIT Press. [110] M.A. Turk and A.P. Pentland. Face recognition using
[92] S.T. Roweis and L.K. Saul. Nonlinear dimensionality reduction eigenfaces. In Proceedings of the Computer Vision and Pattern
by Locally Linear Embedding. Science, 290(5500):23232326, Recognition 1991, pages 586591, 1991.
2000. [111] L. Vandenberghe and S. Boyd. Semidefinite programming.
[93] O. Samko, A.D. Marshall, and P.L. Rosin. Selection of the SIAM Review, 38(1):4995, 1996.
optimal parameter value for the Isomap algorithm. Pattern [112] M.S. Venkatarajan and W. Braun. New quantitative descriptors
Recognition Letters, 27(9):968979, 2006. of amino acids based on multidimensional scaling of a large
[94] G. Sanguinetti. Dimensionality reduction of clustered datasets. number of physicalchemical properties. Journal of Molecular
IEEE Transactions on Machine Intelligence and Pattern Modeling, 7(12):445453, 2004.
Analysis, 30(3):535540, 2008. [113] J. Venna. Dimensionality reduction for visual exploration
[95] L.K. Saul, K.Q. Weinberger, J.H. Ham, F. Sha, and D.D. of similarity structures. PhD thesis, Helsinki University of
Lee. Spectral methods for dimensionality reduction. In Technology, 2007.
Semisupervised Learning, Cambridge, MA, USA, 2006. The [114] J. Verbeek. Learning nonlinear image manifolds by global
MIT Press. alignment of local linear models. IEEE Transactions on
[96] A. Saxena, A. Gupta, and A. Mukerjee. Non-linear Pattern Analysis and Machine Intelligence, 28(8):12361250,
dimensionality reduction by locally linear isomaps. Lecture 2006.
Notes in Computer Science, 3316:10381043, 2004. [115] J. Wang, Z. Zhang, and H. Zha. Adaptive manifold learning.
[97] B. Scholkopf, A.J. Smola, and K.-R. Muller. Nonlinear In Advances in Neural Information Processing Systems,
component analysis as a kernel eigenvalue problem. Neural volume 17, pages 14731480, Cambridge, MA, USA, 2005.
Computation, 10(5):12991319, 1998. The MIT Press.
[98] F. Sha and L.K. Saul. Analysis and extension of [116] K.Q. Weinberger, B.D. Packer, and L.K. Saul. Nonlinear
spectral methods for nonlinear dimensionality reduction. In dimensionality reduction by semidefinite programming and
Proceedings of the 22nd International Conference on Machine kernel matrix factorization. In Proceedings of the 10th
Learning, pages 785792, 2005. International Workshop on AI and Statistics, Barbados, WI,
[99] J. Shawe-Taylor and N. Christianini. Kernel Methods for 2005. Society for Artificial Intelligence and Statistics.
Pattern Analysis. Cambridge University Press, Cambridge, [117] K.Q. Weinberger, F. Sha, and L.K. Saul. Learning a kernel
UK, 2004. matrix for nonlinear dimensionality reduction. In Proceedings
[100] J. Shi and J. Malik. Normalized cuts and image
of the 21st International Confernence on Machine Learning,
segmentation. IEEE Transactions on Pattern Analysis and
2004.
Machine Intelligence, 22(8):888905, 2000.
[118] K.Q. Weinberger, F. Sha, Q. Zhu, and L.K. Saul.
[101] C. Spearman. General intelligence objectively determined
Graph Laplacian regularization for large-scale semidefinite
and measured. American Journal of Psychology, 15:201293,
programming. In Advances in Neural Information Processing
1904.
Systems, volume 19 (to appear), 2007.
[102] J.A.K. Suykens. Data visualization and dimensionality
[119] Y. Weiss. Segmentation using eigenvectors: a unifying view.
reduction using kernel maps with a reference point. Technical
In Proceedings of the IEEE International Conference on
Report 07-22, ESAT-SISTA, K.U. Leuven, 2007.
Computer Vision, volume 2, pages 975982, Los Alamitos,
[103] G.A. Tagaris, W. Richter, S.G. Kim, G. Pellizzer, P. Andersen,
CA, USA, 1999. IEEE Computer Society Press.
K. Ugurbil, and A.P. Georgopoulos. Functional magnetic
[120] L. Xiao, J. Sun, and S. Boyd. A duality view of spectral
resonance imaging of mental rotation and memory scanning: a
methods for dimensionality reduction. In Proceedings of the
multidimensional scaling analysis of brain activation patterns.
23rd International Conference on Machine Learning, pages
Brain Res., 26(2-3):10612, 1998.
[104] Y.W. Teh and S.T. Roweis. Automatic alignment of hidden 10411048, 2006.
representations. In Advances in Neural Information Processing [121] R. Xu, S. Damelin, and D.C. Wunsch. Applications
Systems, volume 15, pages 841848, Cambridge, MA, USA, of diffusion maps in gene expression data-based cancer
2002. The MIT Press. diagnosis analysis. In Proceedings of the Annual International
[105] J.B. Tenenbaum. Mapping a manifold of perceptual Conference of the IEEE Engineering in Medicine and Biology
observations. In Advances in Neural Information Processing Society, pages 46134616, 2007.
Systems, volume 10, pages 682688, Cambridge, MA, USA, [122] T. Zhang, J. Yang, D. Zhao, and X. Ge. Linear local
1998. The MIT Press. tangent space alignment and application to face recognition.
[106] J.B. Tenenbaum, V. de Silva, and J.C. Langford. A global Neurocomputing, 70:15471533, 2007.
geometric framework for nonlinear dimensionality reduction. [123] Z. Zhang and H. Zha. Local linear smoothing for nonlinear
Science, 290(5500):23192323, 2000. manifold learning. Technical Report CSE-03-003, Department
[107] L. Teng, H. Li, X. Fu, W. Chen, and I.-F. Shen. Dimension of Computer Science and Engineering, Pennsylvania State
reduction of microarray data based on local tangent space University, University Park, PA, USA, 2003.
alignment. In Proceedings of the 4th IEEE International [124] Z. Zhang and H. Zha. Principal manifolds and nonlinear
Conference on Cognitive Informatics, pages 154159, 2005. dimensionality reduction via local tangent space alignment.
[108] M.E. Tipping. Sparse kernel principal component analysis. SIAM Journal of Scientific Computing, 26(1):313338, 2004.
In Advances in Neural Information Processing Systems,
volume 13, pages 633639, Cambridge, MA, USA, 2000. The
MIT Press.

22

Das könnte Ihnen auch gefallen