DR Survey

Asurveyofdimensionalityreductiontechniques
C.O.S.Sorzano1,,J.Vargas1,A.PascualMontano1
1
Natl.CentreforBiotechnology(CSIC)
C/Darwin,3.CampusUniv.Autnoma,28049Cantoblanco,Madrid,Spain
{coss,jvargas,pascual}@cnb.csic.es
Correspondingauthor
AbstractExperimentallifescienceslikebiologyorchemistryhaveseenintherecentdecadesanexplosionof
the data available from experiments. Laboratory instruments become more and more complex and report
hundreds or thousands measurements for a single experiment and therefore the statistical methods face
challengingtaskswhendealingwithsuchhighdimensionaldata.However,muchofthedataishighlyredundant
and can be efficiently brought down to a much smaller number of variables without a significant loss of
information. Themathematicalproceduresmaking possiblethis reduction arecalleddimensionality reduction
techniques;theyhavewidelybeendevelopedbyfieldslikeStatisticsorMachineLearning,andarecurrentlya
hotresearchtopic.Inthisreviewwecategorizetheplethoraofdimensionreductiontechniquesavailableand
givethemathematicalinsightbehindthem.
Keywords:DimensionalityReduction,DataMining,MachineLearning,Statistics
1. Introduction
Duringthelastdecadelifescienceshaveundergoneatremendousrevolutionwiththeaccelerateddevelopment
of high technologies and laboratory instrumentations. A good example is the biomedical domain that has
experienced a drastic advance since the advent of complete genome sequences. This postgenomics era has
leaded to the development of new highthroughput techniques that are generating enormous amounts of data,
which have implied the exponential growth of many biological databases. In many cases, these datasets have
much more variables than observations. For example, standard microarray datasets usually are composed by
thousands of variables (genes) in dozens of samples. This situation is not exclusive of biomedical research and
many other scientific fields have also seen an explosion of the number of variables measured for a single
experiment.Thisisthecaseofimageprocessing,massspectrometry,timeseriesanalysis,internetsearchengines,
andautomatictextanalysisamongothers.
Statisticalandmachinereasoningmethodsfaceaformidableproblemwhendealingwithsuchhighdimensional
data,andnormallythenumberofinputvariablesisreducedbeforeadataminingalgorithmcanbesuccessfully
applied. The dimensionality reduction can be made in two different ways: by only keeping the most relevant
variablesfromtheoriginaldataset(thistechniqueiscalledfeatureselection)orbyexploitingtheredundancyof
the input data and by finding a smaller set of new variables, each being a combination of the input variables,
containingbasicallythesameinformationastheinputvariables(thistechniqueiscalleddimensionalityreduction).
ThissituationisnotnewinStatistics.Infactoneofthemostwidelyuseddimensionalityreductiontechniques,
PrincipalComponentAnalysis(PCA),datesbacktoKarlPearsonin1901[Pearson1901].Thekeyideaistofinda
new coordinate system in which the input data can be expressed with many less variables without a significant
error.Thisnewbasiscanbeglobalorlocalandcanfulfillverydifferentproperties.Therecentexplosionofdata
available together with the evermore powerful computational resources have attracted the attention of many
researchers in Statistics, Computer Science and Applied Mathematics who have developed a wide range of
computational techniques dealing with the dimensionality reduction problem (for reviews see [Carreira1997,
Fodor2002,Mateen2009]).
In this review we provide an uptodate overview of the mathematical properties and foundations of the
different dimensionality reduction techniques. For feature selection, the reader is referred to the reviews of
[Dash1997],[Guyon2003]and[Saeys2007].
There are several dimensionality reduction techniques specifically designed for time series. These methods
specificallyexploitthefrequentialcontentofthesignalanditsusualsparsenessinthefrequencyspace.Themost
popular methods are those based on wavelets [Rioul1991, Graps1995], followed at a large distance by the
Empirical Mode Decomposition [Huang1998, Rilling2003] (the reader is referred to the references above for
furtherdetails).Wedonotcoverthesetechniquesheresincetheyarenotusuallyappliedforthegeneralpurpose
dimensionalityreductionofdata.Fromageneralpointofview,wemaysaythatwaveletsprojecttheinputtime
seriesontoafixeddictionary(seeSection3).Thisdictionaryhasthepropertyofmakingtheprojectionsparse(only
afewcoefficientsaresufficientlylarge),andthedimensionalityreductionisobtainedbysettingmostcoefficients
(thesmallones)tozero.Theempiricalmodedecomposition,instead,constructsadictionaryspeciallyadaptedto
eachinputsignal.
Tokeeptheconsistencyofthisreview,wedonotcoverneitherthosedimensionalityreductiontechniquesthat
takeintoaccounttheclassofobservations,i.e.,thereareobservationsfromaclassAofobjects,observationsfrom
a class B, and the dimensionality reduction technique should keep as well as possible the separability of the
original classes. Fishers Linear Discriminant Analysis (LDA) was one of the first techniques to address this issue
[Fisher1936,Rao1948].Manyotherworksfollowedsincethen,forthemostrecentworksandforabibliographical
reviewsee[Bian2011,Cai2011,Kim2011,Lin2011,Batmanghelich2012].
Inthefollowingwewillrefertotheobservationsasinputvectors x ,whosedimensionis M .Wewillassume
thatwehave N observationsandwewillrefertothe n thobservationas xn .Thewholedatasetofobservations
will be X , while X will be a M N matrix with all the observations as columns. Note that small, bold letters
represent vectors ( x ), while capital, nonbold letters ( X ) represent matrices. The goal of the dimensionality
reduction is to find another representation of a smaller dimension m such that as much information as
possibleisretainedfromtheoriginalsetofobservations x .Thisinvolvessometransformationoperatorfromthe
original vectors onto the new vectors, T (x) . These projected vectors are sometimes called feature vectors,
andtheprojectionof xn willbenotedas n .Theremightnotbeaninverseforthisprojection,buttheremustbe
awayofrecoveringanapproximatevaluetotheoriginalvector, x R( ) ,suchthat x x .
Aninterestingpropertyofanydimensionalityreductiontechniqueistoconsideritsstability.Inthiscontext,a
technique is said to be stable, if for any two input data points, x1 and x2 , the following inequation holds
[Baraniuk2010]: (1 ) x1 x 2
2
2
1 2
2
2
(1 ) x1 x 2 2 .Intuitively,thisequationimpliesthatEuclidean
2
distancesintheoriginalinputspacearerelativelyconservedintheoutputfeaturespace.
2. Methods based on Statistics and Information Theory

This family of methods reduces the input data according to some statistical or information theory criterion.
Somehow, the methods based on information theory can be seen as a generalization of the ones based on
statistics in the sense that they can capture nonlinear relationships between variables, can handle interval and
categoricalvariablesatthesametime,andmanyofthemareinvarianttomonotonictransformationsoftheinput
variables.
2.1 Vector Quantization and Mixture models

Probablythesimplestwayofreducingdimensionalityisbyassigningaclass(amongatotalof K classes)toeach
oneoftheobservations xn .Thiscanbeseenasanextremecaseofdimensionalityreductioninwhichwegofrom
M dimensionsto1(thediscreteclasslabel ).Eachclass, ,hasarepresentative x whichistheaverageofall
theobservationsassignedtothatclass.Ifavector xn hasbeenassignedtothe n thclass,thenitsapproximation

afterthedimensionalityreductionissimply x n x n (seeFig.1).
Figure1.Exampleoftheuseofavectorquantization.Blackcirclesrepresenttheinputdata, xn ;redsquares
representclassrepresentatives, x .
The goal is thus to find the representatives x and class assignments u ( x) ( u ( x) is equal to 1 if the
observation x is assigned to the th class, and is 0 otherwise) such that JVQ E
u ( x) x x
is
minimized.ThisproblemisknownasvectorquantizationorKmeans[Hartigan1979].Theoptimizationofthisgoal
functionisacombinatorialproblemalthoughthereareheuristicstocutdownitscost[Gray1984,Gersho1992].An
alternative formulation of the Kmeans objective function is JVQ X WU
2
F
t
subject to U U I and
uij 0,1 (i.e., that each input vector is assigned to one and only one class) [Batmanghelich2012]. In this
expression, W is a M m matrix with all representatives as column vectors, U is a m N matrix whose ijth
2
entryis1ifthejthinputvectorisassignedtotheithclass,and F denotestheFrobeniusnormofamatrix.
Thisintuitivegoalfunctioncanbeputinaprobabilisticframework.Letusassumewehaveagenerativemodelof
how the data is produced. Let us assume that the observed data are noisy versions of K vectors x which are
equallylikelyapriori.Letusassumethattheobservationnoiseisnormallydistributedwithasphericalcovariance
2I
.
The
likelihood
of
observing
xn
having
produced
is
matrix
x
l (xn | x , 2 )
M
2
exp 12
xn x
. With our previous definition of u ( x) we can express it as
u ( xn ) xn x 2
1
2
. The log likelihood of observing the whole dataset x
l (x n | x , )
exp 12 1 2
M
n
2
( n 1, 2,..., N ) after removing all constants is L( X | x )
u (x ) x
n 1
. We, thus, see that the goal
functionofvectorquantization JVQ producesthemaximumlikelihoodestimatesoftheunderlying x vectors.

Under this generative model, the probability density function of the observations is the convolution of a
Gaussianfunctionandasetofdeltafunctionslocatedatthe x vectors,i.e.,asetofGaussianslocatedatthe x
vectors.The vectorquantizationthenisanattempttofindthecentersoftheGaussiansformingtheprobability
densityfunctionoftheinputdata.ThisideahasbeenfurtherpursuedbyMixtureModels[Bailey1994]thatarea
generalizationofvectorquantizationinwhich,insteadoflookingonlyforthemeansoftheGaussiansassociatedto
each class, we also allow each class to have a different covariance matrix , and different a priori probability
. The algorithm looks for estimates of all these parameters by ExpectationMaximization, and at the end
producesforeachinputobservation xn ,thelabel oftheGaussianthathasthemaximumlikelihoodofhaving
generatedthatobservation.
We can extend this concept and, instead of making a hard class assignment, we can make a fuzzy class
assignment by allowing 0 u ( x) 1 and requiring
u (x) 1 for all x . This is another famous vector
quantizationalgorithmcalledfuzzyKmeans[Bezdek1981,Bezdek1984].
The Kmeans algorithm is based on a quadratic objective function, which is known to be strongly affected by
outliers. This drawback can be alleviated by taking the l1 norm of the approximation errors and modifying the
problem to J K medians X WU
2
1
t
subject to U U I and uij 0,1 [Arora1998, Batmanghelich2012].
[Iglesias2007]proposedadifferentapproachtofinddatarepresentativeslessaffectedbyoutlierswhichwemay
call robust Vector Quantization, J RVQ E
u ( x) x x
where x is a function less sensitive to
outliersthan x x ,forinstance[Iglesias2007]proposes x x with about0.5.
Someauthors[Girolami2002,Dhillon2004,Yu2012]haveproposedtouseanonlinearembeddingoftheinput
vectors x intoahigherdimensionalspace(dimensionalityexpansion,insteadofreduction),andthenperform
thevectorquantizationinthishigherdimensionalspace(thiskindofalgorithmsarecalledKernelalgorithmsand
are further explained below with Kernel PCA). The reason for performing this nonlinear mapping is that the
2
topologicalsphericalballsinducedbythedistance ( x) ( x ) inthehigherdimensionalspace,correspond
tononsphericalneighborhoodsintheoriginalinputspace,thusallowingforaricherfamilyofdistancefunctions.
Althoughvectorquantizationhasalltheingredientstobeconsideredadimensionalityreduction(mappingfrom
the high dimensional space to the low dimensional space by assigning a class label , and back to high
dimensional space through an approximation), this algorithm has a serious drawback. The problem is that the
distancesin thefeaturespace ( runsfrom1to K )donot correspond todistancesintheoriginalspace. For
example,if xn isassignedlabel0, x n 1 label1,and x n 2 label2,itdoesnotmeanthat xn iscloserto x n 1 thanto
x n 2 in the input M dimensional space. Labels are arbitrary and do not allow to conclude anything about the
relativeorganizationoftheinputdataotherthanknowingthatallvectorsassignedtothesamelabelarecloserto
therepresentativeofthatlabelthantotherepresentativeofanyotherlabel(thisfactcreatesaVoronoipartition
oftheinputspace)[Gray1984,Gersho1992].
Thealgorithmspresentedfromnowondonotsufferfromthisproblem.Moreover,theproblemcanbefurther
attenuatedbyimposinganeighborhoodstructureonthefeaturespace.ThisisdonebySelfOrganizingMapsand
GenerativeTopographicMappings,whicharepresentedbelow.
2.2 PCA
Principal Component Analysis (PCA) is by far one of the most popular algorithms for dimensionality reduction
[Pearson1901,Wold1987,Dunteman1989,Jollife2002].Givenasetofobservations x ,withdimension M (theylie
in M ),PCAisthestandardtechniqueforfindingthesinglebest(inthesenseofleastsquareerror)subspaceofa
givendimension, m .Withoutlossofgenerality,wemayassumethedataiszeromeanandthesubspacetofitisa
linearsubspace(passingthroughtheorigin).
This algorithm is based on the search of orthogonal directions explaining as much variance of the data as
possible.Intermsofdimensionalityreductionitcanbeformulated[Hyvarinen2001]astheproblemoffindingthe
m orthonormal directions w i minimizing the representation error J PCA

objective function, the reduced vectors are the projections
E x wi , x wi
i 1
w1 , x ,..., w m , x
. In this
. This can be much more

t
compactly written as W t x , where W is a M m matrix whose columns are the orthonormal directions w i
t
(orequivalently W W I ).Theapproximationtotheoriginalvectorsisgivenby x
w i , x w i ,orwhatisthe
i1
same, x W .InFigure2,weshowagraphicalrepresentationofaPCAtransformationinonlytwodimensions
( x 2 ).AscanbeseenfromFigure2,thevarianceofthedataintheoriginaldataspaceisbestcapturedinthe
rotatedspacegivenbyvectors W t x .
Figure2.GraphicalrepresentationofaPCAtransformationinonlytwodimensions.
1 is the first principal component and it goes in the direction of most variance, 2 is the second principal
component,itisorthogonaltothefirstanditgoesintheseconddirectionwithmostvariance(in 2 thereisnot
muchchoice,butinthegeneralcase, M ,thereis).Observethatwithoutlossofgeneralitythedataiscentred
abouttheoriginoftheoutputspace.
Wecanrewritetheobjectivefunctionas J PCA E x W
E x WW x X WW X
t
2
F
.Notethat
the class membership matrix ( U in vector quantization) has been substituted in this case by W X , which in
generalcantakeanypositiveornegativevalue.It,thus,haslostitsmembershipmeaningandsimplyconstitutes
theweightsofthelinearcombinationofthecolumnvectorsof W thatbetterapproximateeachinput x .Finally,
PCAobjectivefunctioncanalsobewrittenas J PCA Tr W t X W [He2011],where X
1
N
(x
x )(xi x )t
is the covariance matrix of the observed data. The PCA formulation has also been extended to complexvalued
inputvectors[Li2011],themethodiscallednoncircularPCA.
Thematrixprojectionoftheinputvectorsontoalowerdimensionalspace( W t x )isawidespreadtechnique
in dimensionality reduction as will be shown in this article. The elements involved in this projection have an
interesting interpretation as explained in the following example. Let us assume that we are analyzing scientific
articlesrelatedtoaspecificdomain.Eacharticlewillberepresentedbyavector x ofwordfrequencies,i.e.,we
choose a set of M words representative of our scientific area, and we annotate how many times each word
appears in each article. Each vector x is then orthogonally projected onto the new subspace defined by the
vectors w i .Eachvector w i hasdimension M anditcanbeunderstoodasatopic(i.e.,atopicischaracterized

bytherelativefrequenciesofthe M differentwords;twodifferenttopicswilldifferintherelativefrequenciesof
the M words). The projection of x onto each w i gives an idea of how important is topic w i to represent the
article x . Important topics have large projection values and, therefore, large values in the corresponding
componentof (seeFig.3).
Figure3.Projectionofthevector x ontothesubspacespannedbythevectors w i (rowsof W ).Thecomponents

t
of x representthefrequencyofeachwordinagivenscientificarticle.Thevectors w i representtheword
compositionofagiventopic.Eachcomponentoftheprojectedvector representshowimportantisthattopicfor
thearticlebeingtreated.
Itcanbeshown[Hyvarinen2001,Jenssen2010]thatwhentheinputvectors, x ,arezeromean(iftheyarenot,
we can transform the input data simply by subtracting the sample average vector), then the solution of the
minimization of J PCA is given by the m eigenvectors associated to the largest m eigenvalues of the covariance
matrixof x ( Cx
1
N
XX t ,notethatthecovariancematrixof x isa M M matrixwith M eigenvalues).Ifthe
eigenvaluedecompositionoftheinputcovariancematrixis Cx WM M WMt (since Cx isarealsymmetricmatrix),

1
then the feature vectors are constructed as m2Wm x , where m is a diagonal matrix with the m largest
t
eigenvaluesofthematrix M and Wm arethecorresponding m columnsfromtheeigenvectorsmatrix WM .We

could have constructed all the feature vectors at the same time by projecting the whole matrix X ,
1
U m2Wmt X . Note that the i th feature is the projection of the input vector x onto the i th eigenvector,
12
i i w ti x . The socomputed feature vectors have identity covariance matrix, C I , meaning that the
differentfeaturesaredecorrelated.
Univariatevarianceisasecondorderstatisticalmeasureofthedepartureoftheinputobservationswithrespect
tothesamplemean.Ageneralizationoftheunivariatevariancetomultivariatevariablesisthetraceoftheinput
covariancematrix.Bychoosingthe m largesteigenvaluesofthecovariancematrix Cx ,weguaranteethatweare
makingarepresentationinthefeaturespaceexplainingasmuchvarianceoftheinputspaceaspossiblewithonly
m variables. In fact, w1 is the direction in which the data has the largest variability, w 2 is the direction with
largestvariabilityoncethevariabilityalong w1 hasbeenremoved, w 3 isthedirectionwithlargestvariabilityonce
thevariabilityalong w1 and w 2 hasbeenremoved,etc.Thankstotheorthogonalityofthe w i vectors,andthe
subsequent decorrelation of the feature vectors, the total variance explained by PCA decomposition can be
2
convenientlymeasuredasthesumofthevariancesofeachfeature, PCA
Var .
i 1
i 1
2.2.1 Incremental, stream or online PCA

Let us assume that we have observed a number of input vectors x and we have already performed their
dimensionalityreductionwithPCA.Letusassumethatwearegivennewobservationsandwewanttorefineour
directions w i toaccommodatethenewvectors.InthestandardPCAapproachwewouldhavetoreestimatethe
covariancematrixof x (nowusingallthevectors,oldandnew),andtorecomputeitseigenvaluedecomposition.
Incrementalmethods(suchasIncrementalPCA[Artac2002])providecleverwaysofupdatingourestimatesofthe
bestdirections w i basedontheoldestimatesofthebestdirectionsandthenewdataavailable.Inthisway,we
can efficiently process new data as they arrive. For this reason, incremental methods are also known as stream
methodsoronlinemethods.Thisideacanbeappliedtomanyofthemethodsdiscussedinthisreviewandwillnot
befurthercommented.
2.2.2 Relationship of PCA and SVD
Another approach to the PCA problem, resulting in the same projection directions w i and feature vectors
usesSingularValueDecomposition(SVD,[Golub1970,Klema1980,Wall2003])forthecalculations.Letusconsider
thematrix X whosecolumnsarethedifferentobservationsofthevector x .SVDdecomposesthismatrixasthe
productofthreeothermatrices X WDU ( W isofsize M M , D isadiagonalmatrixofsize M N ,and U
isofsize N N )[Abdi2007].Thecolumnsof W aretheeigenvectorsofthecovariancematrixof x and Dii isthe
squarerootofitsassociatedeigenvalue.Thecolumnsof U arethefeaturevectors.Sofarwehavenotperformed
any dimensionality reduction yet. As in PCA the dimensionality reduction is achieved by discarding those
componentsassociatedtothelowesteigenvalues.Ifwekeepthe m directionswithlargestsingularvalues,weare
approximatingthedatamatrixby X Wm DmU m ( Wm isofsize M m , Dm isofsize m m ,and U m isofsize
m N ).Ithasbeenshown[Johnson1963]that X isthematrixbetterapproximating X intheFrobeniusnorm

sense(i.e.,exactlythesameasrequiredby J PCA ).
Anotherinterestingpropertyhighlightedbythisdecompositionisthattheeigenvectors w i (columnsofthe W
matrix) are an orthonormal basis of the subspace spanned by the observations x . That means that for each
elementinthisbasiswecanfindalinearcombinationofinputvectorssuchthat w i
n 1
in
x n .Thisfactwillbe
further exploited by Sparse PCA and Kernel PCA. This digression on the SVD approach to PCA helps us to
understandacommonsituationinsomeexperimentalsettings.Forinstance,inmicroarrayexperiments,wehave
about50samplesand1000variables.AsshownbytheSVDdecomposition,therankofthecovariancematrixisthe
minimumbetween M 1000 and N 50 ,therefore,wecannotcomputemorethan50principalcomponents.
AninterestingremarkontheSVDdecompositionisthatamongallpossiblematrixdecompositionsoftheform
X WDU , SVD is the only family of decompositions (SVD decomposition is not unique) yielding diagonal
matrices in D . In other words, the matrices W and U can differ significantly from the SVD decomposition as
long as D is not a diagonal matrix. This fact is further exploited by Sparse Tensor SVD (see dictionarybased
methodsbelow).
2.2.3 Nonlinear PCA
PCA can be extended to nonlinear projections very easily conceptually although its implementation is more
involved. Projections can be replaced by f (W t x) , being f ( ) : m m a nonlinear function chosen by the
user. The goal function is then J NLPCA E x Wf (W x)

t
[Girolami1997b].
which is minimized subject to W W I

t
2.2.4 PCA rotations and Sparse PCA

A drawback of PCA is that the eigenvectors ( w i ) have usually contributions from all input variables (all their
components are significantly different from zero). This makes their interpretation more difficult since a large
feature value cannot be attributed to a few (ideally a single) input values. Possible solutions are rotating the
eigenvectorsusingPCArotationmethods(Varimax,Quartimax,etc.)orforcingmanyofthecomponentsof w i to
bezero,i.e.,tohaveaSparsePCA.
Aneasywayofforcingtheinterpretabilityoftheprincipalcomponents w i isbyrotatingthemoncetheyhave
beencomputed.Thesubspacespannedbytherotatedvectorsisthesameastheonespannedbytheunrotated
vectors. Therefore, the approximation x is still the same, although the feature vector must be modified to
account for the rotation. This is the approach followed by Varimax [Kaiser958]. It looks for the rotation that
maximizesthevarianceofthefeaturevectorafterrotation(theideaisthatmaximizingthisvarianceimpliesthat
eachfeaturevectorusesonlyafeweigenvectors).Quartimaxlooksforarotationminimizingthenumberoffactors
different from zero in the feature vector. Equimax and Parsimax are compromises between Varimax and
Quartimax.AllthesecriteriaaregenerallyknownasOrthomaxandtheyhavebeenunifiedunderasinglerotation
criterion[Crawford1970].Amongthesecriteria,Varimaxisbyfarthemostpopular.Iftheorthogonalitycondition
of the vectors w i is removed, then the new principal components are free to move and the rotation matrix is
called an oblique rotation matrix. Promax [Abdi2003] is an example of such an oblique rotation technique.
However,theseobliquerotationshavebeensupersededbyGeneralizedPCA.
Recently,therehavebeensomepapersimposingthesparsenessofthefeaturevectorsbydirectlyregularizinga
functionalsolvingthePCAproblem(formulatedasaregressionproblem).Thiskindofmethodsiscommonlycalled
SparsePCA.OneofthemostpopularalgorithmsofthiskindistheoneofZou[Zou2006].Wehavealreadyseen
that the PCA problem can be seen as regression problem whose objective function is J PCA E x WW x
t
t
Normallytheoptimizationisperformedwiththeconstraint W W I (i.e.,thedirections w i haveunitmodule).
t
We can generalize this problem, and instead of using the same matrix to build the feature vectors ( W ) and
reconstruct the original samples ( W ), we can make them different. We can use ridge regression (a Tikhonov
regularization to avoid possible instabilities caused by the eventual illconditioning of the regression, the most
commononeissimplythe l2 norm),andaregularizationbasedonsomenormpromotingsparseness(likethe l1
norm): J SPCA E x WW t x
W
1
l1
2 W
2
l2
. In the previous formula the l p norms of the matrices are
p
computed as W l wij and the objective function is optimized with respect to W and W that are
p
i, j
M m matrices.Ithasbeenshown[Zou2006]thatpromotingthesparsenessof W promotesthesparsenessof
thefeaturevectorswhichis,afterall,thefinalgoalofthisalgorithm.
Theprevioussparseapproachestriedtofindsparseprojectiondirectionsbyzeroingsomeoftheelementsinthe
projection directions. However, we may prefer absolutely removing some of the input features (so that their
contributiontoallprojectiondirectionsiszero).[Ulfarsson2011]proposedthesparsevariablePCA(svPCA).svPCA
isbasedonnoisyPCA(nPCA),whichisaspecialcaseofFactorAnalysis(seebelow).Theobserveddataissupposed
2
tohavebeengeneratedas x W n .Assumingthatthecovarianceofthenoiseis I ,thegoalofnPCAisto
maximize the loglikelihood of observing the data matrix X under the nPCA model, that is
J nPCA 12 Tr X 1 12 log
where
WW t 2 I
J svPCA J nPCA 2N 2 w i
i 1
svPCA
objective
function
is
2.2.5 Localized PCA and Subspace segmentation

Aswehaveexplainedabove,givenasetofdatasamplesin M ,thestandardtechniqueforfindingthesingle
best(inthesenseofleastsquareerror)subspaceofagivendimensionisthePrincipalComponentAnalysis(PCA).
However,inpractice,asinglelinearmodelhasseverelimitationsformodellingrealdata.Becausethedatacanbe
heterogeneous and contain multiple subsets each of which is best fitted with a different linear model. Suppose
thatwehavea high dimensionaldataset, thatcan be conveniently decomposedintodifferentsmalldatasetsor
clusters,whichcanbeapproximatedwellbydifferentlinearsubspacesofsmalldimensionbymeansofaprincipal
componentanalysis.ThissituationispresentedinFig.4.NotefromFig.4thatthedatasetcomposedbythered
andbluepointscanbedecomposedintwosets,oneformedbytheredandtheotherbytheblueonesandeachof
thisgroupscaneffectivelybeapproximatedusingatwodimensionallinearsubspaces.
Figure4.Mixeddatasetcomposedbytheredandbluepointsthatcanbedecomposedintwosmallerdatasets,
whichcaneffectivelybedescribedusingtwodimensionallinearsubspaces.
ThetermLocalizedPCAhasbeenusedseveraltimesthroughliteraturetorefertodifferentalgorithms.Herewe
willrefertothemostsuccessfulones.[Fukunaga1971]proposedanextensionoftheKmeansalgorithmwhichwe
will refer to as ClusterPCA. In Kmeans, a cluster is represented by its centroid. In ClusterPCA, a cluster is
represented by a centroid plus an orthogonal basis defining a subspace that embeds locally the cluster. An
t
observation x isassignedtoaclusteriftheprojectionof x ontotheclustersubspace( x WW x )istheclosest
one(theselectionoftheclosestsubspacemustbedonewithcaresothatextrapolationoftheclusterisavoided).
Onceallobservationshavebeenassignedtotheircorrespondingclusters,theclustercentroidisupdatedasinK
meansandtheclustersubspaceisrecalculatedbyusingPCAontheobservationsbelongingtothecluster.Aswith
Kmeans,aseveredrawbackofthealgorithmisitsdependencewith theinitialization,andseveral hierarchically
divisivealgorithmshavebeenprovided(RecursiveLocalPCA)[Liu2003b].Forareviewonthiskindofalgorithms
see[Einbeck2008].
Subspace segmentation extends the idea of locally embedding the input points into linear subspaces. The
assumptionisthatthedatahasbeengeneratedusingseveralsubspacesthatmaynotbeorthogonal.Thegoalisto
identify all these subspaces. Generalized PCA [Vidal2005] is a representative of this family of algorithms.
Interestingly, the subspaces to be identified are represented as polynomials whose degree is the amount of
subspacestoidentify andwhosederivativesatadatapointgivenormalvectorstothesubspacepassingthrough
thepoint.
2.3 Principal curves, surfaces and manifolds

PCAistheperfect tooltoreduce datathatin theiroriginal M dimensionalspacelieinsomelinearmanifold.
However, there are situations at which the data follow some curved structure (e.g., a slightly bent line). In this
case,approximatingthecurvebyastraightlinewillnotperformagoodapproximationoftheoriginaldata.Wecan
observeasituationofthistypeinFig.5.
Figure5.Datasetthatlieinacurvedstructure(left)andtransformeddataset(right)
In Fig. 5 we show a dataset following a curved structured and therefore this dataset will not be conveniently
describedusingthePCAmethod.NotethatinthecaseofthedatashowninFig.5,itwillbeneededatleastthree
principalcomponentstodescribethedataprecisely.InFig.5weshowthesamedatasetaftertransformingit.Note
thatthisdatadoesnolongerfollowacurvedastructuredandinthiscase,itfollowsalinearone.Therefore,the
datashownontherightofFig.5canbeconvenientlydescribedusingPCAapproachandusingonlyoneprincipal
component.
Before introducing principal curves, surfaces and manifolds in depth, let us review PCA from a different
perspective.Givenasetofobservationsoftheinputvectors x withzeroaverage(iftheoriginaldataisnotzero
average,we cansimplysubtracttheaveragefrom alldatapoints),we canlookfor thelinepassing throughthe
origin and with direction w1 (whose equation is f ( ) w1 ) that better fits this dataset, i.e., that minimizes
J line E inf x f ( )
.Theinfimuminthepreviousobjectivefunctionimpliesthatforeachobservation x
n
wehavetolookforthepointintheline(definedbyitsparameter n )thatisclosesttoit.Thepoint f ( n ) isthe

orthogonalprojectionoftheobservationontotheline.Itcanbeprovedthatthesolutionofthisproblemisthe
directionwiththelargestdatavariance,thatis,thesamesolutionasinPCA.Oncewehavefoundthefirstprincipal
line, we can look for the second simply by constructing a new dataset in which we have subtracted the line
previously computed ( x 'n x n f ( n ) ). Then, we apply the same procedure m 1 times to detect the
subsequent most important principal lines. The dimensionality reduction is achieved simply by substituting the
observation xn bythecollectionofparameters n neededforitsprojectionontothedifferentprincipallines.
Theobjectivefunction J line ismerelyalinearregressionontheinputdata.Fordetectingcurvesinsteadoflines,
onepossibilitywouldbetofixthefamilyofcurvessought(parabolic,hyperbolic,exponential,polynomial)and
optimize its parameters (as was done with the line). This is exactly nonlinear regression and several methods
based on Neural Networks (as sophisticated nonlinear regressors) have been proposed (Nonlinear PCA,
[Kramer1991,Scholz2008],autoencoderneuralnetworks[Baldi1989,DeMers1993,Kambhatla1997]).
Alternatively, we can look for the best curve (not in a parametric family) [Hastie1989]. In Statistics, it is well
known that the best regression function is given by the conditional expectation f ( ) E x | f (x) , where
f (x) representsthecurveparameter neededtoproject x onto f .Inotherwords,thebestcurveistheone

thatassignsforeach theaverageofallobservedvaluesprojectedonto .Theparameterizationofthecurve f
must be such that it has unit speed (i.e.,
df
1 for all ), otherwise we could not uniquely determine this
d
function.Therearetwowarningsonthisapproach.Thefirstoneisthatitmightbelocallybiasedifthenoiseofthe
observationsislargerthanthelocalcurvatureofthefunction.Thesecondoneisthatifweonlyhaveafinitesetof
10
observations x , we will have to use some approximation of the expectation so that we make the curve
continuous.Thetwomostcommonchoicestomakethecurvecontinuousarekernelestimatesoftheexpectation
and the use of splines. In fact, the goal function of the classical smoothing spline is
df ( )
J spline E inf x f ( )
d ,whichregularizesthecurvefittingproblemwiththecurvatureof
d
thecurve.Anadvantageoftheuseofsplinesistheirefficiency(thealgorithmrunsas O( N ) ascomparedtothe
2
O( N 2 ) ofthekernelestimates).However,itisdifficulttochoosetheregularizationweight, .
Principal Curves can be combined with the idea of Localized PCA (constructing local approximations to data).
This has been done by several authors: Principal Curves of Oriented Points (PCOP) [Delicado2001], and Local
PrincipalCurves(LPC)[Einbeck2005].
ThePrincipalCurvesideacanbeextendedtomoredimensions(seeFig.6).Principalsurfacesarethefunctions
minimizing J surface E inf x f ( 1 , 2 )

1 , 2
. The solution is again, f ( 1 , 2 ) E x | f (x) , where f (x)
returnstheparametersofthesurfaceneededfortheprojectionof x .Intuitively,theprincipalsurfaceatthepoint
( 1 , 2 ) is the average of all observations whose orthogonal projection is at f ( 1 , 2 ) . The extension to
manifolds is straightforward [Smola1999], J manifold E inf x f ( )
Pf
that is a Tikhonov regularized
versionofthenonlinearregressionproblem. P isahomogenousinvariantscalaroperatorpenalizingunsmooth
functions. The fact that the regularization is homogeneous invariant implies that all surfaces which can be
transformedintoeachotherbyrotationsareequallypenalized.Afeasiblewayofdefiningthefunction f ( ) isby
choosinganumberoflocations i (normallydistributedonaregulargridalthoughthemethodisnotrestrictedto
this choice) and expanding this function as a weighted sum of a kernel function k ( i ) at those locations,
K
f ( ) i k ( i ) .Thenumberoflocations, K ,controlsthecomplexityofthemanifoldandthevectors i
i 1
(which are in the space of x and, thus, have dimension M ) control its shape. However, the dimensionality
reduction is still controlled by the dimension of the vector . Radial basis functions such as the Gaussian are
common kernels ( k ( , i ) k i
Pf
). With this expansion, the regularization term becomes [Smola1999]
i , j 1
i , j k ( i , j ) .Aninterestingfeatureofthisapproachisthatbyusingperiodicalkernels,onecan
learncircularmanifolds.
Figure6.Exampleofdatadistributedalongtwoprincipalsurfaces.
11
2.4 Generative Topographic Mapping

A different but related approach to reduce the dimensionality by learning a manifold is the Generative
TopographicMapping(GTM)[Bishop1998].Thismethodisgenerativebecauseitassumesthattheobservations
x have been generated by noisy observations of a mapping of the low dimensional vectors onto a higher
dimension, f ( ) (see Fig. 7, in fact the form of this nonlinear mapping is exactly the same as in the principal
manifolds of the previous section, f ( )
k ( ) ). In this method, it is presumed that the possible

i 1
vectorslieonadiscretegridwith K points,andthattheaprioriprobabilityofeachoneofthepointsofthegridis
thesame(uniformdistribution).IfthenoiseissupposedtobeGaussian(oranyothersphericaldistribution),the
maximum likelihood estimates of the vectors i boils down to the minimization of J GTM E inf x f ( )
K
This objective function can be regularized by
2
i
(instead of Pf
) which is the result of estimating the
i1
MaximumaPosterioriundertheassumptionthatthe i arenormallydistributedwith0mean.
Figure7.ExampleofGenerativeTopographicMapping.Theobserveddata(right)isassumedtobegeneratedby
mappingpointsinalowerdimensionalspace.
2.5 SelfOrganizing Maps

In fact, GTM has been proposed as a generalization of SelfOrganizing Maps, which in their turn are
generalizations of the vector quantization approaches presented at the beginning. SelfOrganizing Maps (SOM)
workasinVectorQuantizationbyassigningtoeachinputvectoralabel n correspondingtotheclassclosesttoits
representative vector. The reconstruction of xn is still x n x n , i.e., the class representative of class n .
However,classlabelsareforcedtolieinamanifoldatwhichatopologicalneighborhoodisdefined(seeFig.8).In
thisway,classesthatareclosetoeachotherinthefeaturespacearealsoclosetoeachotherintheinputspace.
12

Figure8.Originaldataliesinaring,vectorrepresentativescalculatedbySOMarerepresentedasredpoints.The
outputmaptopologyhasbeenrepresentedbylinkingeachrepresentativevectortoitsneighborswithablueedge.
Notethatbecauseofthetopologicalconstrainttheremightberepresentativevectorsthatarenotactually
representinganyinputpoint(e.g.,thevectorinthecenterofthering).
Kohonens SOMs [Kohonen1990, Kohonen1993, Kohonen2001] are the most famous SOMs. They work pretty
wellinmostcontexts,theyareverysimpletounderstandandimplement,buttheylackfromasolidmathematical
framework(theyarenotoptimizinganyfunctionalandtheycannotbeputinastatisticalframework).Theystartby
creating a set of labels on a given manifold (usually a plane). Labels are distributed in a regular grid and the
topologicalneighbourhoodisdefinedastheneighboursintheplaneofeachpointofthegrid(notethatthisidea
can be easily extended to higher dimensions). For initialization we assign to each label a class representative at
random.Eachinputobservation xn iscomparedtoallclassrepresentativesanditisassignedtotheclosestclass
whoselabel wewillrefertoas n .Inits batchversion,onceall theobservationshavebeenassigned, theclass
N
representativesarethenupdatedaccordingto x
k ( , n ) xn
n1
N
k ( , n )
.Thefunction k ( , n ) isakernelthatgivesmore
n 1
weight to pairs of classes that are closer in the manifold. The effect of this is that when an input vector xn is
assignedtoagivenclass,theclassessurroundingthisclasswillalsobeupdatedwiththatinputvector(although
with less weight than the winning class). Classes far from the winning class receive updates with a weight very
closeto0.Thisprocessisiterateduntilconvergence.
GTMgeneralizesKohonensSOMbecausetheclassrepresentatives x inSOMscanbeassimilatedtothe f ( i )
elementsofGTM,andthefunction f ( ) ofGTMcanbedirectlybecomputedfromthekernel k ( , n ) inSOM
[Bishop1998].However,GTMhastheadvantageoverSOMsthattheyareclearlydefinedinastatisticalframework
and the function f ( ) is maximizing the likelihood of observing the given data. There is another difference of
practicalconsequences,whileSOMmakesthedimensionalityreductionbyassigningoneofthepointsofthegrid
inthemanifold(i.e.,itproducesadiscretedimensionalityreduction),GTMiscapableofproducingacontinuous
dimensionalityreductionbychoosingtheparameters n suchthat x n f ( n ) isminimized.
Other generalizations of SOMs in the same direction are the Fuzzy SOM [PascualMarqui2001] and the Kernel
Density SOM (KenDerSOM) [PascualMontano2001]. These generalizations rely on the regularization of the
objective functions of Vector Quantization and the Mixture Models, respectively, by the term
K
, '1
x x ' k ( , ') ,thatis,classrepresentativescorrespondingtolabelsclosetoeachotherinthemanifold
13
should have smaller differences. For a review on SOM and its relationships to NonLinear PCA and Manifold
learningsee[Yin2008].
NeuralGasnetworks[Martinetz1991,Martinetz1993,Fritzke1995]isanapproachsimilartothestandardSOM,
onlythattheneighborhoodtopologyisautomaticallylearntfromthedata.Edgesappearanddisappearfollowing
an aging strategy. This automatic topology learning allows adapting to complex manifolds with locally different
intrinsicdimensionality[Pettis1979,Kegl2002,Costa2004](seeFig.9).
Figure9.ExampleofNeuralGasnetwork.Notethatthenetworkhasbeenabletolearnthe2Dtopologypresent
attheleftpoints,andthe1Dtopologyoftherightpoints.
2.6 Elastic maps, nets, principal graphs and principal trees

Elastic maps and nets [Gorban2004, Gorban2007] are halfway between SOMs and GTM. Like these two
methods,theelasticmapcanbethoughtofasanetofnodesinalowdimensionalspace.Foreachnode,thereisa
mappingbetweenthelowdimensionalnodeandthehighdimensionalspace(likeinSOM),i.e.,foreachnodein
thenet thereisacorrespondingvectorintheinputspace x .Ifaninputvector xn isassignedtoanode n ,
then its representative is x n x n (as in SOM). The goal function combines similarity to the input data with
regularity
N
J EN x n x n
n 1
and
2
, '1
similarity
within
K
g ( , ')x '
x x ' g ( , ') x
2
'1
K
g ( , ')
the
net:
. The first term accounts for the
'1
fidelityofthedatarepresentation.Inthesecondandthirdterms, g ( , ') defineaneighborhood(isequalto1if

thetwonodesareneighbors,andequalto0otherwise).Thesecondtermreinforcessmoothnessofthemanifold
by favoring similarity between neighbors; the third term imposes smoothness in a different way: a node
representativemustbeclosetotheaverageofitsneighbors.AsopposedtoSOM,elasticnetscandeleteoradd
nodesadaptively,creatingnetsthatarewelladaptedtothemanifoldtopology.Forthisreason,thismethodisalso
knownasPrincipalGraphs.Interestingly,therulestocreateanddeletenodes,canforcethegraphtobeatree.
2.7 Kernel PCA and Multidimensional Scaling

KernelPCA[Scholkopf1997,Scholkopf1999]isanotherapproachtryingtocapturenonlinearrelationships.Itcan
bewellunderstoodifMultidimensionalScaling(MDS),anotherlinearmethod,ispresentedfirst.Wehavealready
seenthatPCAcanbecomputedfromtheeigenvaluedecompositionoftheinputcovariancematrix, Cx N1 XX t .
However,wecouldhavebuilttheinnerproductmatrix(Grammatrix) Gx X t X ,i.e.,the i, j thcomponentof
14
this matrix has the inner product of xi with x j . The eigendecomposition of the Gram matrix yields
Gx WN NWNt (since Gx is a real, symmetric matrix). MDS is a classical statistical technique [Kruskal1964a,
Kruskal1964b, Schiffman1981, Kruskal1986, Cox2000, Borg2005] that builds with the m largest eigenvalues a
1
featurematrixgivenby U 2W t .Thisfeaturematrixistheonebestpreservingtheinnerproductsoftheinput
vectors,i.e.,itminimizestheFrobeniusnormofthedifference Gx G .Itcanbeproven[Jenssen2010]thatthe
eigenvaluesof Gx and Cx arethesame,that rank(Gx ) rank(Cx ) M ,andthatforany m ,thespace
spannedbyMDSandthespacespannedbyPCAareidentical,thatmeansthatonecouldfindarotationsuchthat
U MDS U PCA .Additionally,MDScanalsobecomputedevenifthedatamatrix X isunknown,allweneedisthe
Gram matrix, or alternatively a distance matrix. This is a situation rather common in some kind of data analysis
[Cox2000].
Kernel PCA [Scholkopf1997, Scholkopf1999] is another approach trying to capture nonlinear relationships. It
uses a function transforming the input vector x onto a new vector ( x) whose dimension is larger than
M (akindofdimensionalityexpansion).However,ifwechoose wellenough,thedatainthisnewspacemay
become more linear (e.g., following a straight line instead of a curve). In this new space we can perform the
standardPCAandobtainthefeaturevector .InFig.10,weshowanexampleoftheuseofthismultidimensional
reduction method. In Fig. 10(a) it is shown a dataset (red circles) following a curved structured that cannot be
described conveniently using linear PCA as the black dashed line does not describe conveniently the dataset.
Therefore, to correctly describing this dataset by the standard PCA method we will need at least two principal
components.InFig.10(b)weshowthedatasetaftertransformingitby .Ascanbeseeninthistransformedand
expandedspacewecandescribeaccuratelythedatasetusingonlyoneprincipalcomponentasthedatasetfollows
alinearrelationship.
Figure10.Inputdataset(redcircles)lyinginacurvedstructureanditscorrespondingfirstprincipalcomponent
usingstandardPCAmethod(blackdashedline)(a).Transformeddatasetbyfunction (bluepoints)andits
correspondingfirstprincipalcomponentoftheexpandeddataset(redline)(b)
Making use of the relationship between MDS and PCA, we do not need to compute the covariance of the
vectors,butwecancomputetheirinnerproductsinstead.WewilldosothroughaMercerkernelwhichdefinesa
validinnerproductinthespaceof makinguseoftheinputvectors, (x), (y ) k (x, y ) .Commonkernels
15
are k ( x, y ) x, y
, k (x, y ) exp 12
xy 2
, and k ( x, y ) tanh 1 x, y 2 (where the parameters
definethekernelshape).PCAvectors w i aretheeigenvectorsofthecovariancematrixofthetransformedvectors
(x) ; but these vectors are never explicitly built neither their covariance matrix. Instead, the orthogonal
directions w i arecomputedasalinearcombinationoftheobserveddata, w i
n 1
in
x n X i .The i vectors
are computed as the eigenvectors of a matrix G whose ij th entry is the inner product ( xi ), ( x j ) . The
featurevectorscanfinallybecomputedas i
n 1
in
(x n ), ( x) .Obtainingtheapproximationoftheoriginal
vector x is not as straightforward. In practice it is done by looking for a vector x that minimizes
(x) ( x ) .Theminimizationisperformednumericallystartingfromasolutionspecificallyderivedforeach
2
kernel.Againthankstothekernelmagic,onlythedotproductofthetransformedvectorsareneededduringthe
minimization.
All these techniques together with Locally Linear Embedding and ISOMAP (see below) are called spectral
dimensionality reduction techniques because they are based on the eigenvalue decomposition of some matrix.
[Bengio2006]providesanexcellentreviewofthem.
2.8 Kernel Entropy Component Analysis

WehavealreadyseenthatPCAlooksfordirectionsthatmaximizetheinputvarianceexplainedbythefeature
vectors. Variance is a statistical second order measurement reflecting the amount of information contained by
somevariables(ifthereisnovariabilityoftheinputvectors,thereisnoinformationinthem).However,varianceis
a limited measure of information. Renyis quadratic entropy is a more appropriate measure of the input
information. Renyis quadratic entropy is defined as H (x) log E p ( x) , where p (x) is the multivariate
probability density function of the input vectors x . Since the logarithm is a monotonic function, we may
concentrateonitsargument:theexpectationof p(x) .However,thetrueunderlyingprobabilitydensityfunction
isunknown.Wecanapproximateitthroughakernelestimator p ( x)
1
N
k ( x, x
n 1
) ,where k (x, y ) isaParzen
window (it is also required to be a Mercer kernel). If we now estimate Renyis quadratic entropy as the sample
1 N
1
averageoftheParzenestimator,weget E p(x) p (x n ) 2
N n 1
N
k (x
n1 1 n2 1
n1
, x n2 ) ,whichinthelightofour
previous discussion on Kernel PCA can be rewritten in terms of the Gram matrix of the vectors
1 t
1 G 1 (being 1 a vector of ones with dimension N ). The eigendecomposition of the Gram
N2
1
1 N
2
matrix yields E p ( x) 2 1t WN N WNt 1 2 n 1, w n , this means that for maximizing the information
N
N n 1
carried by the feature vectors is not enough choosing the eigenvectors with the m largest eigenvalues, but the
2
eigenvectors with the m largest contributions to the entropy, n 1, w n . This is the method called Kernel
E p(x)
EntropyComponentAnalysis[Jenssen2010]whichcanbeseentobeaninformationtheoreticgeneralizationofthe
PCA.
16
2.9 Robust PCA

There are many situations in which the input vectors x are contaminated by outliers. If this is the case, the
outliers may dominate the estimation of the mean and covariance of the input data resulting in very poor
performanceofthePCA(seeFig.11).ThereareapproachestohavearobustPCAwhichbasicallyamountstohave
robustestimatesofthemeanandcovariance.
Figure11.ComparisonbetweenRobustPCA(a)andStandardPCA(b).
An obvious modification to deal with univariate outliers is to change l2norm of the PCA objective function,
J PCA E x W
, by a l norm which is known to be more robust to outliers, J

1
RPCA
E x W
[Baccini1996, Ke2005]. However, these modifications are not invariant to rotations of the input features
[Ding2006]. [Ding2006] solved this problem by using the R1 norm of a matrix that is defined as
1
M
2
ein2 , and constructing the objective function J RPCA X WW t X
n 1 i 1
R1
R1
. Another possibility is to
substitutethenormbyakernelasisdoneinrobuststatistics.Thiscanbedoneatthelevelofindividualvariables
J RPCA E k xi W i (forinstance[He2011]usedakernelbasedonthecorrentropyfunction)oratthe
i 1
level of the multivariate distance J RPCA E k x W
([Iglesias2007] proposed to use the function
k ( x) x with about0.5,theyrefertothismethodas PCA;[Subbarao2006]proposedtouseMestimators,

theyrefertotheirmethodaspbM(projectionbasedMestimators)).
[Kwak2008] proposes a different approach, instead of minimizing the L1 norm of the reconstruction error as
before J RPCA E x W
,hemaximizestheL normoftheprojections J
1
RPCA
W t X subjectto W tW I
1
tryingtomaximizethedispersionofthenewfeatures.
The approach of De la Torre [Delatorre2003] can deal with univariate outliers. It modifies the PCA objective
functiontoexplicitlyaccountforindividualcomponentsoftheobservationsthatcanberegardedasoutliersandto
account for the possible differences in the variance of each variable. The Robust PCA goal function is then
N
J RPCA Oni i ( x ni ( i w i n )) P (Oni ) ,where Oni isavariablebetween0and1statingwhetherthe i

n 1 i 1
thcomponentofthe n thobservationisanoutlier(lowvalues)ornot(highvalues).Ifitisanoutlier,thenitserror
will not be counted, but the algorithm will be penalized by P (Oni )
17
Oni 1 to avoid the trivial solution of
considering all samples to be an outlier. This algorithm does not assume that the input data is zero valued and
estimates the mean, , from the nonoutlier samples and components. For each component, the function i
robustlymeasurestheerrorcommitted.Thisisdoneby i (e)
e2
,i.e.,thesquarederrorismodulatedby
e2 i2
thevarianceofthatvariable i2 .
TheapproachofHubert[Hubert2004]addressestheproblemofmultivariateoutliers.Itdistinguishesthecase
when we have more observations than variables ( N M ) and when we do not ( N M ). In the first case
( N M ), we have to specify the number h of outliers we want the algorithm to be resistant to (it has to be
h 12 ( N M 1) ). Then, we look for the subset of N h input vectors such that the determinant of its
covariancematrixisminimum(thisdeterminantisageneralizationofthevariancetomultivariatevariables:when
thedeterminantislargeitmeansthatthecorrespondingdatasethasalargevariance).Wecomputetheaverage
andcovarianceofthissubsetandmakesomeadjustmentstoaccountforthefinitesampleeffect.Theseestimates
are called the MCD (Minimum Covariance Determinant) estimates. Finally, PCA is performed as usual on this
covariance estimate. In the second case ( N M ), we cannot proceed as before since the determinant of the
covariancematrixofanysubsetwillbezero(remindourdiscussionwhentalkingabouttheSVDdecompositionof
thecovariancematrix).SowefirstperformadimensionalityreductionwithoutlossofinformationusingSVDand
keeping N 1 variables. Next, we identify outliers by choosing a large number of random directions, and
projecting the input data onto each direction. For each direction, we compute MCD estimates (robust to h
outliers)ofthemean( MCD ,w )andstandarddeviation( sMCD ,w )oftheprojections(notethattheseareunivariate
estimates).Theoutlyingnessofeachinputobservationiscomputedas outl (x n ) max
w
x n , w MCD ,w
sMCD ,w
.The h
pointswiththehighestoutlyingnessareremovedfromthedatasetand,finally,PCAisperformednormallyonthe
remainingpoints.
[Pinto2011] proposes a totally different approach. It is well known that rankstatistics is more robust to noise
and outliers than the standard statistical analysis. For this reason, they propose to substitute the original
observationsbytheirranks(theithcomponentofthenthindividual, x ni ,isrankedamongtheithcomponent
of all individuals, then the observation x ni is substituted by its rank that we will refer to as rni , and the
correspondingdatamatrixwillbereferredtoas R ).LookingatthePCAobjectivefunction, J PCA Tr W t X W ,

theyrecognizethatthecovariancematrixinthatequation, X ,isverymuchrelatedtothecorrelationcoefficient
among the different input features, the ijth entry of this matrix is the covariance between the ith and jth
variable.Therefore,theyproposetosubstitutethiscovariancematrixbyanewcorrelationmeasurebettersuited
to handle rank data. They refer to the approach as Weighted PCA. The objective function is finally
J PCA Tr W t RW subjectto W tW I .
2.10 Factor Analysis

Factor analysis (FA) [Spearman1904, Thurstone1947, Kaiser1960, Lawley1971, Mulaik1971, Harman1976] is
anotherstatisticaltechniqueintimatelyrelatedtoPCAanddimensionalityreduction.FAisagenerativemodelthat
assumes that the observed data has been produced from a set of latent, unobserved variables (called factors)
through the equation x W n (if there is no noise, this model is the same as in PCA, although PCA is not a
generative model in its conception). In this technique all the variances of the factors are absorbed into W such
thatthecovarianceof istheidentitymatrix.Factorsareassumedtofollowamultivariatenormaldistribution,
and to be uncorrelated to noise. Under these conditions, the covariance of the observed variables is
Cx WW t Cn where Cn isthecovariancematrixofthenoiseandhastobeestimatedfromthedata.Matrix W
18
is solved by factorization of the matrix WW t Cx Cn . This factorization is not unique since any orthogonal
rotation of W results in the same decomposition of Cx Cn [Kaiser1958]. This fact, rather than a drawback, is
exploitedtoproducesimplerfactorsinthesamewayasthePCAwasrotated(actuallytherotationmethodsforFA
arethesameastheonesforPCA).
2.11 Independent Component Analysis

Independent Component Analysis (ICA) [Common1994,Hyvarinen1999,Hyvarinen2000,Hyvarinen2001] is an
exampleofinformationtheorybasedalgorithm.Itisalsoagenerativemodelwithequation x W .WhilePCA
looksforuncorrelatedfactors(i.e.,aconstraintonthesecondorderstatistics),ICAlooksforindependentfactors
(i.e., a constraint on all their moments, not only secondorder). This is an advantage if the factors are truly
independent (for two Gaussian variables decorrelation implies independence but this is not generally true for
variableswithanyotherdistribution).InthelanguageofICA,the vectorsarecalledthesources,whilethe x are
called the observations. Matrix W is called the mixing matrix and the problem is formulated as one of source
separation, i.e., finding an unmixing matrix W such that the components of W t x are as independent as
possible(seeFig.12).Thesourcescanberecovereduptoapermutation(thesourcesarerecoveredindifferent
order) and a scale change (the sources are recovered with different scale). A limitation of the technique is that
usually sources are supposed to be nonGaussian since the linear combination of two Gaussian variables is also
Gaussianmakingtheseparationanillposedproblem.
Figure12.a)ExampleofPCAresultsforagiveninputdistribution.b)ICAresultsforthesamedistribution.
DifferentICAmethodsdifferinthewaytheymeasuretheindependenceoftheestimatesofthesourcevariables,
resultingindifferentestimatesofthemixingmatrix W andsourcevariables .Thefollowingaredifferentoptions
commonlyadopted:
NonGaussianity:Thecentrallimittheoremstatesthatthedistributionoftheweightedsumofthesources
tendstobenormallydistributedwhicheverthedistributionsoftheoriginalsources.Thus,apossiblewayto
achieve the separation of the sources is by looking for transformations W that maximize the kurtosis of
eachofthecomponentsofthevector [Hyvarinen2001](seeFig.13).Thekurtosisisrelatedtothethird
order moment of the distribution. The kurtosis of the Gaussian is zero and, thus, maximizing the kurtosis,
minimizes the Gaussianity of the output variables. In fact, maximizing the kurtosis can be seen as a Non
linear PCA problem (see above) with the nonlinear function for each feature vector
f i ( i ) i sgn( i ) i2 .FastICAisanalgorithmbasedonthisgoalfunction.Theproblemofkurtosisisthat
itcanbeverysensitivetooutliers.Alternatively,wecanmeasurethenonGaussianitybynegentropywhich
istheKullackLeiblerdivergencebetweenthemultivariatedistributionoftheestimatedsources ,andthe
distributionofamultivariatevariable G ofthesamemeanandcovariancematrixas .Projectionpursuit
19
[Friedman1974,Friedman1987]isanexploratorydataanalysisalgorithmlookingforprojectiondirectionsof
maximumkurtosis(asinourfirstICAalgorithm)whileExploratoryProjectionPursuit[Girolami1997]usesthe
maximumnegentropytolookfortheprojectiondirections.NonGaussianComponentAnalysis[Theis2011],
insteadoflookingforasingledirectionlikeinprojectionpursuit,looksforanentirelinearsubspacewhere
theprojecteddataisasnonGaussianaspossible.
Figure13.a)SampledistributioninwhichthefirstPCAcomponentmaximizestheexplainedvariancebut
projectionsontothisdirectionarenormallydistributed(b).ICAfirstcomponentisalsoshownonthesample
data,dataprojectionontothisdirectionclearlyshowsanonGaussiandistribution.
Maximumlikelihood(ML):Letusassumethatweknowtheaprioridistributionofeachofthecomponents
. Then, we could find the matrix W simply by maximizing the likelihood of all the observations
l x px (x) det W p ( ) det W
p ( ) . Taking logarithms and the expected value over all input

i 1
vectors we obtain L X log det W E pi ( i )) which is the objective function to maximize with
i 1
respectto W (theBellSejnowskiandthenaturalgradientalgorithms[Hyvarinen2001]aretypicalalgorithms
forperformingthisoptimization).Ifthedistributionofthefeaturesisnotknownapriori,itcanbeshown
[Hyvarinen2001] that reasonable errors in the estimation of the pi distributions result into locally
consistentMLestimatorsaslongasforall i E i gi ( i ) g i' ( i ) 0 ,where gi ( i )
p ' i ( i )
p i ( i )
and p i is
ourestimateofthetrulyunderlyingdistribution pi .Acommonapproachistochooseforeach i betweena

superGaussianandasubGaussiandistribution.Thischoiceisdonebycheckingwhichoneofthetwofulfills
E i gi ( i ) gi' ( i ) 0 . Common choices are gi ( i ) i tanh( i ) for the superGaussian and
gi ( i ) i tanh( i ) forthesubGaussian.Thiscriterionisstronglyrelatedtothe Infomax(Information

Maximization)inneuralnetworks.There,theproblemistolookforthenetworkweightssuchthatthejoint
entropy of the output variables, H ( 1 , 2 ,..., m ) , is maximized. It has been shown [Lee2000] that the
Infomax criterion is equivalent to the maximum likelihood one when gi ( i ) is the nonlinear function of
eachoutputneuronoftheneuralnetwork(usuallyasigmoid).Equivalently,insteadofmaximizingthejoint
entropyofthefeaturevariables,wecouldhaveminimizedtheirmutualinformation.Mutualinformationisa
measure of the dependency among a set of variables. So, it can be easily seen how all these criteria is
maximizingtheindependenceofthefeaturevectors.
Nonlinear decorrelation: two variables 1 and 2 are independent if for all continuous functions f1 and
f 2 with finite support we have E f1 ( 1 ) f 2 ( 2 ) E f1 ( 1 ) E f 2 ( 2 ) . The two variables are non
linearlydecorrelatedif E f1 ( 1 ) f 2 ( 2 ) 0 .Infact,PCAlooksforoutputvariablesthataresecondorder
decorrelated, E 1 2 0 , although they may not be independent because of their higherorder
20
moments.
Making
E f1 ( 1 ) f 2 ( 2 )
E
i
1
j
2
the
Taylor
expansion
of
f1 ( 1 ) f 2 ( 2 )
we
arrive
to
c E 0 . For which we need that all highorder correlations are zero,
i , j 0
ij
i
1
j
2
0 . HraultJutten and CichockiUnbehauen algorithms look for independent components by
makingspecificchoicesofthenonlinearfunctions f1 and f 2 [Hyvarinen2001].
3. Methods based on Dictionaries

Anotherfamilyofmethodsisbasedonthedecompositionofamatrixformedbyallinputdataascolumns, X .
Theinputdatamatrixusingtheinputvariablesistransformedintoanewdatamatrixusingthenewvariables.The
transformationisnothingbutalinearchangeofbasisbetweenthetwovariablesets.Inthefieldofdimensionality
reduction, the matrix expressing the change of basis is known as a dictionary (dictionary elements are called
atoms)andthereareseveralwaysofproducingsuchadictionary[Rubinstein2010].WehavealreadyseenSingular
Value Decomposition (SVD) and its strong relationship to PCA. Vector Quantization (Kmeans) can also be
consideredanextremecaseofdictionarybasedalgorithm(inputvectorsarerepresentedbyasingleatom,instead
ofasacombinationofseveralatoms).Underthissectionwewillexploreotherdictionarybasedalgorithms.
3.1 Nonnegative Matrix Factorization

A drawback of many dimensionality reduction techniques is that they produce feature vectors with negative
components.Insomeapplicationsliketextanalysis,itisnaturaltothinkoftheobservedvectorsastheweighted
sumofsomeunderlyingfactorswithnosubtractioninvolved(forinstance,itisnaturaltothinkthatifascientific
article is about two related topics, the word frequencies of the article will be a weighted sum (with positive
weights) of the word frequencies of the two topics). Let us consider the standard decomposition dictionary
X WU where W isthedictionary(ofsize M m ,itscolumnsarecalledatomsofthedictionary)and U (of
size m N )istheexpressionoftheobservationsinthesubspacespannedbythedictionaryatoms(seeFig.14).
Notconsideringsubtractionsimplyconstrainingthefeaturevectorstobe U 0 .
Figure14.Dictionarydecompositionofasetofdocuments(seeFig.3).Eachdocumentisdecomposedasthelinear
combinationgivenbytheweightsinUofthetopics(atoms)containedinW.
If X ismadeonlyofpositivevalues,itmightbeinterestingtoconstraintheatomstobepositiveaswell( W 0 ).
This is the problem solved by Nonnegative Matrix Factorization [Lee1999, Lee2001]. The goal is to minimize
21
X WU
2
F
or D( X WU ) (defined as D( A B)
ij
log Bijij Aij Bij ; this is the KullbackLeibler

A
i, j
divergence if A and B are normalized so that they can be regarded as probability distributions) subject to
W ,U 0 .Theadvantageofthisdecompositionisthat,iftheapplicationisnaturallydefinedwithpositivevalues,
thedictionaryatomsaremuchmoreunderstandableandrelatedtotheproblemthanthestandarddimensionality
reductionmethods.[Sandler2011]proposedtominimizetheEarthMoversDistancebetweenthematrices X and
WU withtheaimofmakingthemethodmorerobust,especiallytosmallsamples.TheEarthsMoverDistance,
alsocalledWassersteinmetric,isawayofmeasuringdistancesbetweentwoprobabilitydistributions(forareview
onhowtomeasuredistancesbetweenprobabilitydistributionssee[Rubner2000]).Itisdefinedastheminimum
costofturningoneprobabilitydistributionintotheotherand itiscomputedthroughatransportationproblem.
Thisdistancewasextendedby[Sandler2011]tomeasurethedistancebetweenmatricesbyapplyingthedistance
toeachcolumn(feature)ofthematricesandthensummingalldistances.
Intherecentyearsthereismuchinterestintheconstructionofsparsefeaturevectors,sparsedictionariesor
both. The underlying idea is to produce feature vectors with as many zeroes as possible, or what is the same,
approximatingtheobservationswithasfewdictionaryatomsaspossible.Thishasobviousadvantageswhentrying
toexplaintheatomiccompositionofagivenobservation.Inthefollowingparagraphswewillreviewsomeofthe
approachesalreadyproposedforNMFinthisdirection.
LocalNMF[Feng2002]enforcessparsitybyaddingtotheNMFgoalfunctiontheterm W tW (whichpromotes
1
theorthogonalityoftheatoms,i.e.,minimizestheoverlappingbetweenatoms)and U 2 (whichmaximizesthe
2
variance of the feature vectors, i.e., it favours the existence of large components). Nonnegative sparse coding
[Hoyer2002] and Sparse NMF [Liu2003] add the term U 1 in order to minimize the number of features with
2
significantvalues.Paucaetal.[Pauca2006]regularizebyadding U 2 and W
2
2
(thisisespeciallysuitedfornoisy
data).NMFwithSparsenessConstraints[Hoyer2004]performstheNMFwiththeconstraintthatthesparsenessof
each column of W is a given constant S w (i.e., it promotes the sparseness of the dictionary atoms) and the
sparsenessofeachrowof U isanotherconstant SU (i.e.,itpromotesthateachatomisusedinasfewfeature
vectors
as
possible).
Sparseness (x)
In
[Hoyer2004],
the
sparseness
of
vector
is
defined
as
x
1
n 1 that measures how much energy of the vector is packed in as few
x2
n 1
componentsaspossible.Thisfunctionevaluatesto1ifthereisasinglenonzerocomponent,andevaluatesto0if
all the elements are identical. Nonsmooth NMF [PascualMontano2006] modifies the NMF factorization to
X WSU . The matrix S controls the sparseness of the solution through the parameter . It is defined as
S (1 ) I m1 11t . For 0 it is the standard NMF. For 1 , we can think of the algorithm as using
effective feature vectors defined by U SU that substitute each feature vector by a vector of the same
dimensionality whose all components are equal to the mean of . This is just imposing nonsparseness on the
featurevectors,andthiswillpromotesparsenessonthedictionaryatoms.Ontheotherhand,wecouldhavealso
thought of the algorithm as using the effective atoms given by W WS that substitute each atom by the
average of all atoms. In this case, the nonsparseness of the dictionary atoms will promote sparseness of the
featurevectors.NonsmoothNMFisusedwithtypical valuesabout0.5.
Another flavor of NMF enforces learning the local manifold structure of the input data [Cai2011b], Graph
regularizedNMF(GNMF).Assumingthattheinputvectors xi and x j arecloseintheoriginalspace,onemightlike
that the reduced representations, i and j , are also close. For doing so, the algorithm constructs a graph G
encodingtheneighborsoftheinputobservations.Observationsarerepresentedbynodesinthegraph,andtwo
nodesareconnectedbyanedgeiftheirdistanceissmallerthanagiventhresholdandtheyareintheKneighbors
22
list of each other. The weight of each edge is 1, or if we prefer we can assign a different weight to each edge
2
1 x x
depending on the distance between the two points (for instance, e 2 i j ). We build the diagonal matrix D
whose elements are the row sums of G . The Laplacian of this graph is defined as L D G . The sum of the
Euclidean distances of the reduced representations corresponding to all neighbor pairs can be computed as
Tr ULU t [Cai2011b]. In this way, the GNMF objective function becomes X WU
2
F
Tr ULU t . This
algorithmhasthecorrespondingversionincasethattheKullbackLeiblerdivergenceispreferredoverEuclidean
distances[Cai2011b].
3.2 Principal Tensor Analysis and Nonnegative Tensor Factorization

In some situations the data is better represented by a multidimensional array rather than by a matrix. For
instance, we might have a table representing the gene expression level for a number of drugs. It is naturally
representedbya(drug,gene)matrixandallthepreviousmethodstofactorizematricesareapplicable.However,
wemighthavea(drug,gene,time) tablethatspecifiesthegeneexpression foreachcombinationofgene,drug
andtime.Thisthreewaytable(andingeneralmultiwaytables)isatensor(strictlyspeakingamultiwaytableof
dimension d isatensorifandonlyifitcanbedecomposedastheouterproductof d vectors;however,inthe
literaturemultiwaytablesareusuallyreferredtoastensorsandwewillalsoadhereheretothisloosedefinition).
Tensors can be flattened into matrices and then all the previous techniques would be available. However, the
localityimposedbysomevariables(liketimeorspatiallocation)wouldbelost.Nonnegativetensorfactorization
[Cichocki2009]isanextensionofNMFtomultiwaytables.Theobjectiveis,asusual,minimizingtherepresentation
error X
w
i 1
1
i
w ... w
2
i
d
i
X w
i 1 j 1
subject to w ij 0 for all i and j . In the previous
j
i
F
expression X is a tensor of dimension d (in our threeway table example, d 3 ), represents the outer
product, m isaparametercontrollingthedimensionalityreduction.Foreachdimension j (drug,geneortime,in
ourexample),therewillbe m associatedvectors w ij .Thelengthofeachvectordependsonthedimensionitis
associated to (see Fig. 15). If there are N j elements in the j th dimension ( N drugs , N genes and N time entries in
our example), the length of the vectors w ij is N j . The approximation after dimensionality reduction is
m
X w ij . For a threeway table, the pqr element of the tensor is given by X pqr w1ip w iq2 w 3ir , where
i 1 j 1
i 1
w representsthe p thcomponentofthevector w (analogouslywith w and w ).

1
ip
2
iq
1
i
3
ir
NTFisaparticularcaseofafamilyofalgorithmsdecomposingtensors.PARAFAC[Harshman1970,Bro1997]may
be one of the first algorithms doing so and can be regarded as a generalization of SVD to tensors. The SVD
approximation X Wm DmU m canberewrittenas X
theapproximationerror J PARAFAC X
outer products as in J Tuc ker X
m1
d w
i 1
m2
1
i
i 1
t
i
i 1
ui .PARAFACmodelisminimizing
w ... w
2
i
d
i
.Wecanenrichthemodeltoconsidermore
F
2
md
... d
i1 1 i2 1
d w u d w
id 1
i1i2 ...id
w w ... w
1
i1
2
i2
d
id
. This is called the Tucker model

F
[Tucker1966].
23

Figure15.Tensordecompositionasthesumofaseriesofoutervectorproducts.Thefigurecorrespondstothe
moregeneralTuckermodel,whichissimplifiedtoPARAFACorNTF.
3.3 Generalized SVD

AccordingtotheSVDdecomposition, X WDU ,thecolumnsof W aretheeigenvectorsofthematrix XX t ,
while the rows of U are the eigenvectors of the matrix X t X . Eigenvectors are orthogonal and therefore
W tW I and UU t I .Afterperformingthedimensionalityreduction,theapproximation X Wm DmU m isthe

matrixminimizingtherepresentationerror X X
trace ( X X )( X X )t .
Generalized SVD [Paige1981] performs a similar decomposition but relaxes, or modifies, the orthogonality
conditionsaccordingtotwoconstrainmatrices, CW and CU ,suchthat W t CW W I and UCU U t I [Abdi2007].
After
the
dimensionality
reduction,
the
approximation
matrix
is
the
one
minimizing
trace CW ( X X )CU ( X X )t .
GeneralizedSVDisaveryversatiletoolsinceundertheappropriatechoicesoftheconstrainmatricesitcanbe
particularized to correspondence analysis (a generalization of factor analysis for categorical variables, CW is the
relativefrequencyoftherowsofthedatamatrix, X ,and CU istherelativefrequenciesofitsrows),discriminant
analysis (a technique relating a set of continuous variables to a categorical variable), and canonical correlation
analysis (a technique analyzing two groups of continuous variables and performing simultaneously two
dimensionalityreductionssothatthetwonewsetsoffeatureshavemaximumcrosscorrelation)[Abdi2007].
3.4 Sparse representations and overcomplete dictionaries

Adifferentapproachtodimensionalityreductionisbycuttingtheinputsignalintosmallpieces,andperforming
a dimensionality reduction of them. For instance, we can divide an image into small 8x8 pieces (vectors of
dimension64).Thenwetrytoexpresseachpieceasalinearcombinationofafewatomsfromalargedictionary
(of size larger than 64, that is why it is called overcomplete). At the level of pieces, the dictionary acts as a
dimensionality expansion, although overall, there is a dimensionality reduction since each piece can be
representedwithjustafewatomsinsteadofthe64valuesneededoriginally.Thisapproachcanbeappliedtoany
domainwheretheoriginalvectors x canbedecomposedintopiecesofsimilarnature:imagesandtimeseriesare
goodexamples.Letus call x toeachoneofthese pieces.Theideais thatfor eachpiecewesolve the problem
min 0 subjectto x W (bysetting 0 werequireexactrepresentationoftheoriginalvector).The

2
columns of W are the atoms, and the feature vector define the specific linear combination of atoms used to
represent the corresponding piece. The sparseness of the feature vector is measured simply by counting the
number of elements different from zero (this is usually referred to as the l0 norm, although actually it is not a
24
normsinceitisnotpositivehomogeneous).Theproblemofthe l0 normisthatityieldsnonconvexoptimization
problemswhosesolutionisNPcomplete.Fortacklingthisproblemsomeauthorshavereplacedthe l0 bythe l p
norm (for 0 p 1 the problem is still nonconvex, although there are efficient algorithms; p 1 is a very
popularnormtopromotesparseness).Relatedproblemsare min x W
p (theLeastAbsoluteShrinkage
andSelectorOperator(LASSO)issuchaproblemwith p 1 ,andridgeregressionisalsothisproblemwith p 2 )
and min x W subjectto

2
t where t isauserdefinedvalue(for p 0 itrestrictsthefeaturevectorto
useat most t atoms). The previousproblemsare calledthesparsecodingstepand many algorithmshavebeen

devised for its solution. The most popular ones are basis pursuit [Chen1994, Chen2001], matching pursuit
[Mallat1993], orthogonal matching pursuit [Pati1993, Tropp2007], orthogonal least squares [Chen1991], focal
underdetermined system solver (FOCUSS) [Gorodnitsky1997], gradient pursuit [Blumensath2008] and conjugate
gradientpursuit[Blumensath2008].Forareviewonthesetechniques,please,see[Bruckstein2009].
Theotherproblemishowtolearntheovercompletedictionary W fromtheinputpieces x .Inaway,thiscan

beconsideredasanextensionofthevectorquantizationproblem.Invectorquantizationwelookforasetofclass
averagesminimizingtherepresentationerrorwhenallexceptoneofthefeaturevectorcomponentsarezero(the
value of the nonnull i component is 1). Now, we have relaxed this condition and we allow the feature
componentstoberealvalued(insteadof0or1)andwerepresentourdatabyaweightedsumofafewatoms.
Nearlyallmethodsiterativelyalternatebetweentheestimationofthefeaturevectorsandtheestimationofthe
dictionary, and they differ in the goal function being optimized, which ultimately result into different update
equationsforthedictionary.TheMethodofOptimalDirections(MOD)[Engan2000]isapossiblewayoflearning
* *
dictionaries.Thismethodoptimizes W ,U arg min X WU

W ,U
2
F
andithasproventobeveryefficient.Another
possibilitytolearnthedictionaryisbyMaximumLikelihood[Lewicki2000].Underagenerativemodelitisassumed

that the observations have been generated as noisy versions of a linear combination of atoms x W . The
observations are assumed to be produced independently, the noise to be white and Gaussian, and the a priori
distributionofthefeaturevectorstobeCauchyorLaplacian.Undertheseassumptionstheproblemofmaximizing
xn ,
given W
is
maximized
by
the
likelihood
of
observing
all
pieces,
W * arg min
n
min x n W n
W
. If a prior distribution of the dictionary is available we can use a
Bayesian approach (Maximum a posteriori) [Kreutz2003]. KSVD [Aharon2006] solves the problem

W * ,U * arg min
X WU
W ,U
2
F
subjectto n
t forallthepieces, n .ItisconceivedasgeneralizationoftheK
means algorithm, and in the update of the dictionary there is a Singular Value Decomposition (therefore, its
name).
KSVDcanbeintegratedintoalargerframeworkcapableofoptimallyreconstructingtheoriginalvectorfromits
pieces. The goal function is x x 2
2
n 0 xn W n
2
2
, where the variables are Lagrangian

n
multipliers.Onceallthepatcheshavebeenapproximatedbytheircorrespondingfeaturevectors,wecanrecover
1

t
t
theoriginalinputvectorby x I Pn Pn x Pn W n ,where Pn isanoperatorextractingthe n th
n
n
t
pieceasavector,and Pn istheoperatorputtingitbackinitsoriginalposition.
Tensor representations can be coupled to sparse representations as shown by [Gurumoorthy2010] which we

willrefertoasSparseTensorSVD.Incertainsituations,theinputdataisbetterrepresentedbyamatrixortensor
thanbyavector.Forinstance,imagepatchescanberepresentedasavectorbylexicographicallyorderingthepixel
values.However,thisrepresentationspoilsthespatialcorrelationofnearbypixels.Letusthenconsiderthatweno
25
longer have input vectors, xn , but input matrices (we will generalize later to tensors). This method learns a
dictionaryof K SVDlikebasis (Wi , U i ) .Eachinputmatrix, X n ,issparselyrepresentedinthe i thSVDlikebasis,
X n Wi DnU i .Sparsityis measuredthrough thenumberof nonzero componentsoftherepresentation Dn 0 .
Finally,amembershipmatrix, uin (0,1) ,specifiesbetween0(nomembership)and1(fullmembership)whether

the input matrix X n is represented with the i th basis or not. In this way, the goal of this algorithm is the
minimization of the representation error given by J SparseTensorSVD
u
n 1 i 1
in
X n Wi DnU i
. This functional is
minimized with respect to the SVDlike basis, the membership function and the sparse representations. The
optimization is constrained by the orthogonality of the basis ( Wi tWi I , U iU it I , for all i ), the sparsity
constraints ( Dn
t , t is a userdefined threshold), and that the columns of the membership matrix define a
K
probability distribution (
u
i 1
in
1 ). The generalization of this approach to tensors is straightforward simply
uin X n U n w ij
n 1 i 1
j 1
N
replacing the objective function by J SparseTensorSVD
subject to the same

F
orthogonality,sparsityandmembershipconstraints.
4. Methods based on projections
A different family of algorithms poses the dimensionality reduction problem as one of projecting the original
dataontoasubspacewithsomeinterestingproperties.
4.1 Projection onto interesting directions

Projection pursuit defines the output subspace by looking for interesting directions. What is interesting
depends on the specific problem but usually directions in which the projected values are nonGaussian are
consideredtobeinteresting.Projectionpursuitlooksfordirectionsmaximizingthekurtosisoftheprojectedvalues
asameasureofnonGaussianity.ThisalgorithmwasvisitedduringourreviewofIndependentComponentAnalysis
andpresentedasaspecialcaseofthatfamilyoftechniques.
Allthetechniquespresentedsofararerelativelycostlyincomputationalterms.Dependingontheapplicationit
mightbeenoughtoreducethedimensionalitywithoutoptimizinganygoalfunctionbutinaveryfastway.Most
techniques project the observations x onto the subspace spanned by a set of orthogonal vectors. However,
choosingthebest(insomesense)orthogonalvectorsiswhatiscomputationallycostlywhiletheprojectionitselfis
rather quick. In certain application domains some preconceived directions are known. This is the case of the
DiscreteCosineTransform(DCT)usedintheimagestandardJPEG[Watson1993].Thecosinevectorsusuallyyield
good reduction results with low representation error for signals and images. Many other transformbased
compression methods, like wavelets, also fall under this category. Alternatively, random mapping [Kaski1998,
Dasgupta2000, Bingham2001] solves this problem by choosing zeromean random vectors as the interesting
directionsontowhichprojecttheoriginalobservations(thisamountstosimplytakingarandommatrix W whose
columnsarenormalizedtohaveunitmodule).Randomvectorsareorthogonalintheory( E
w , w 0 ),and
i
nearlyorthogonalinpractice.Therefore,thedotproductbetweenanypairofobservationsisnearlyconservedin
the feature space. This is a rather interesting property since in many applications the similarity between two
observationsiscomputedthroughthedotproductofthecorrespondingvectors.Inthisway,thesesimilaritiesare
26
nearlyconservedinthefeaturespacebutatacomputationalcostthatisonlyasmallfractionofthecostofmost
dimensionalityreductiontechniques.
4.2 Projection onto manifolds

Insteadofprojectingtheinputdataontoanintelligentsetofdirections,wemightlookforamanifoldcloseto
the data, project onto that manifold and unfold the manifold for representation. This family of algorithms is
represented by a number of methods like Sammon projection, Multidimensional Scaling, Isomap, Laplacian
eigenmaps,andLocallinearembedding.AllthesealgorithmscanbeshowntobespecialcasesofKernelPCAunder
certain circumstances [Williams2002, Bengio2004]. The reader is also referred to [Cayton2005] for an excellent
reviewofmanifoldlearning.
WealreadysawMDSasatechniquepreservingtheinnerproductoftheinputvectors.Alternatively,itcouldbe
alsoseenasatechnique preservingdistancesintheoriginalspace.From thispointofview,Sammon projection
[Sammon1969] and MDS [Kruskal1986] look for a projected set of points, , such that distances in the output
subspace are as close as possible to the distances in the original space. Let Dn1n2 d (x n1 , x n2 ) be the distances
betweenanypairofobservationsintheoriginalspace(foranextensivereviewofdistancessee[Ramanan2011]).
d (x , x ) the distance
Let d n1n2 d ( n1 , n2 ) be the distance between their feature vectors, and D
n1n2
n1
n2
between their approximations after dimensionality reduction. Classical MDS look for the feature vectors
minimizing
n1 , n2
n1 , n2
D n1 ,n2
D
n1 , n2
, while Sammon projection minimizes
n1 , n2
n1 , n2
d n1 , n2
with
2
n1 , n2
n1 , n2
n ,n
1
1
Dn1 ,n2
(this goal function is called the stress function and it is defined between 0 and 1) (see Fig. 16).
ClassicalMDSsolvestheprobleminasinglestepinvolvingthecomputationoftheeigenvaluesofaGrammatrix
involving the distances in the original space. In fact, it can be proved that classical MDS is equivalent to PCA
[Williams2002]. Sammon projection uses a gradient descent algorithm and can be shown [Williams2002] to be
equivalenttoKernelPCA.MetricMDSmodifiesthe Dn1n2 distancesbyanincreasing,monotonicnonlinearfunction
f ( Dn1n2 ) .Thismodificationcanbepartofthealgorithmitself(someproposalsare[Bengio2004] f ( Dn1n2 ) Dn1n2

and f ( Dn1n2 )
1
2
n1n2
Dn1 Dn2 D where Dn1 and D n2 denote the average distance of observations n1
and n2 totherestofobservations,and D istheaveragedistancebetweenallobservations)oraspartofthedata

collectionprocess(e.g.distanceshavenotbeendirectlyobservedbutareaskedtoanumberofhumanobservers).
Ifthedistances Dn1n2 aremeasuredasgeodesicdistances(thegeodesicdistancebetweentwopointsinamanifold
is the one measured along the manifold itself; in practical terms it is computed as the shortest path in a
neighborhoodgraphconnectingeachobservationtoitsKnearestneighbors,seeFig.17),thentheMDSmethodis
calledIsomap[Tenenbaum2000].
27
Fig.16.Distancesintheoriginalspacearemappedontothelowestdimensionalspacetryingtofindprojection
pointsthatkeepthesetofdistancesasfaithfulaspossible.
Figure17.GeodesicversusEuclideandistance.Thegeodesicdistancebetweentwopointsisthelengthofthe
pathbelongingtoagivenmanifoldthatjoinsthetwopoints,whiletheEuclideandistanceisthelengthofthelinear
pathjoiningthetwopoints.
Laplacianeigenmaps[Belkin2001,Belkin2002]start fromanadjacencygraph similartothatofIsomapforthe
computationofthegeodesicdistances.Theneighborsimilaritygraph G iscalculatedaswasdoneinGNMF.Then,
the generalized eigenvalues and eigenvectors of the Laplacian of the graph G are computed, i.e., we solve the
problem ( D G )w Dw . Finally, we keep the eigenvectors of the m smallest eigenvalues discarding the
smallest one which is always 0. The dimensionality reduction is performed by n (w1n , w 2 n ,..., w mn ) , i.e., by
keepingthe n thcomponentofthe m eigenvectors.TheinterestingpropertyoftheLaplacianeigenmapisthat
thecostfunction,whichmeasuresthedistanceamongtheprojectedfeatures,canbeexpressedintermsofthe
Graph Laplacian: J LE
1
2
ij
i j
ti L j . So the goal is to minimize J LE subject to ti D i 1
i, j
[Zhang2009].Finally,itisworthmentioningthatLaplacianEigenmapsandPrincipalComponentAnalysis(andtheir
kernelversions)havebeenfoundtobeparticularcasesofamoregeneralproblemcalledLeastSquaresWeighted
KernelReducedRankedRegression(LSWKRRR)[Delatorre2012](infact,thisframeworkalsogeneralizesCanonical
CorrelationAnalysis,LinearDiscriminantAnalysisandSpectralClustering,techniquesthatareoutofthescopeof
this review). The objective function is to minimize J LS WKRRR W ( BAt x )Wx
2
F
subject to
rank( BAt ) m . W is a diagonal weight matrix for the feature points, Wx is a diagonal weight matrix for the
input data points, x is a matrix of the expanded dimensionality (kernel algorithms) of the input data. The
objective function is minimized with respect to the A and B matrices (they are considered to be regression
matrices and decoupling the transformation BAt in two matrices allows the generalization of techniques like
CanonicalCorrelationAnalysis).Therankconstraintissettopromoterobustnessofthesolutiontowardsarank
deficient x matrix.
Hessianeigenmaps[Donoho2003]workwiththeHessianofthegraphinsteadofitsLaplacian.Bydoingso,they
extendISOMAPandLaplacianeigenmaps,andtheyremovetheneedtomaptheinputdataontoaconvexsubset
of m .
Locallylinearembedding(LLE)[Roweis2000,Saul2003]isanothertechniqueusedtolearnmanifoldsclosetothe
data and project them onto it. For each observation xn we look for the Knearest neighbors (noted as x n ' ) and
28
produce a set of weights for its approximation minimizing x n
nn ' n '
(see Fig. 18). This optimization is
n'
performed simultaneously for all observations, i.e. J LLE
x
n
n'
nn ' n '
x n X n
, where n is a
weightvectortobedeterminedandwhosevalueisnonzeroonlyfortheneighborsof xn .Wecanwritethiseven
more compactly by stacking all weight vectors as columns of a matrix , then J LLE X X
2
F
. The
optimizationisconstrainedsothatthesumof n is1forall n .Oncetheweightshavebeendetermined,welook

forlowerdimensionpoints, n ,suchthat
nn ' n '
,thatisthenewpointshavetobereconstructed
n'
from its neighbors in the same way (with the same weights) as the observations they represent. This latest
problemissolvedbysolvinganeigenvalueproblemandalsokeepingthesmallesteigenvalues.SeeFig.19fora
comparisonoftheresultsofLLE,HessianLLEandISOMAPinaparticularcase.
Figure18.SchematicrepresentationofthetransformationsinvolvedinLTSA.
29

Figure19.ComparisonoftheresultsofLLE,HessianLLEandISOMAPfortheSwissrollinputdata.
Latent Tangent Space Alignment (LTSA) [Zhang2004, Zhang2012] is another technique locally learning the
manifold structure. As in LLE, we look for the local neighbors of a point. However, we now compute the local
coordinates, n , of all the input points in the PCA subspace associated to this neighborhood. Next, we need to
alignalllocalcoordinates.Foreachinputpointwecomputethereconstructionerrorfromitscoordinatesinthe
differentneighborhoodswhereitparticipated J LTSAn
(cn ' Ln ' n ) ( cn ' isavectorand Ln ' isamatrix,

2
n'
both to be determined for all neighborhoods; we may think of them as translation and shape parameters that
properlylocatethedifferentneighborhoodsinacommongeometricalframework).TheobjectivefunctionofLTSA
is J LTSA
J
n 1
LTSAn
thathastobeoptimizedwithrespectto n , cn ' and Ln ' (seeFig.19).
Oneoftheproblemsoftheprevioustechniques(ISOMAP,Laplacianeigenmaps,LocallyLinearEmbedding,and
Latent Tangent Space Alignment) is that they are only defined in a neighborhood of the training data, and they
normallyextrapolateverypoorly.Oneofthereasonsisbecausethemappingisnotexplicit,butimplicit.Locality
Preserving Projections (LPP) [He2004] tries to tackle this issue by constraining the projections to be a linear
projectionoftheinputvectors, n At x n .ThegoalfunctionisthesameasinLaplacianeigenmaps.The A matrix
is of size m M and it is the parameter with respect to which the J LE objective function is minimized. An
orthogonal version has been proposed [Kokiopoulo2007], with the constraint At A I . A Kernel version of LPP
alsoexists[He2004].Asinallkernelmethods,theideaistomaptheinputvectors xn ontoahigherdimensional
space with a nonlinear function, so that the linear constraint imposed by n At x n becomes a nonlinear
projection. A variation of this technique is called Neighborhood Preserving Embedding (NPE) [He2005] where
30
J NPE ti M j being M ( I G )t ( I G ) .ThismatrixapproximatesthesquaredLaplacianoftheweightgraph

G and it offers numerical advantages over the LPP algorithm. Orthogonal Neighborhood Preserving Projections
(ONPP)[Kokiopoulou2007]extendsthelinearprojectionidea, n At x n ,totheLLEalgorithm.
Theideaofconstructinglinearprojectionsforthedimensionalityreductioncanbeperformedlocally,insteadof
globally(asinLPP,OLPP,NPEandONPP).Thishasbeenproposedby[Wang2011].Themanifoldisdividedinareas
inwhichitcanbewellapproximatedbyalinearsubspace(amanifoldislocallysimilartoalinearsubspaceifthe
geodesicdistancebetweenpointsissimilartotheEuclideandistanceamongthosepoints).Thedivisionisadisjoint
partitionoftheinputdatapoints xn suchthatthenumberofpartsisminimizedandeachlocallinearsubspaceis
aslargeaspossible.WithineachpartitionalinearPCAmodelisadjusted.Finally,allmodelsarealignedfollowing
analignmentproceduresimilartothatofLTSA.
5. Trends and Conclusions

Wehaveanalyzedthenumberofcitationsthatthemostrelevantpapersineachsectionhavereceivedinthe
lastdecade(20032012).InTableIweshowthenumberofcitationssummarizedbylargeareasaswellastheir
share(%)forthedifferentyears.Atthesightofthistablewecandrawseveralconclusions:
Theinterestinthefieldhasgrownbyafactor3inthelastdecadeasshownbytheabsolutenumberof
citations.
Byfar,themostappliedtechniquesarethosebasedonthesearchofcomponentsinitsdifferentbrands
(ICA,PCA,FA,MDS,),althoughthetendencyinthelastdecadeistolooseimportanceinfavorofthose
techniquesusingprojections(especially,projectionsontomanifolds)ordictionaries.Thisisaresponseto
thenonlinearnatureofexperimentaldatainmostfields.
Dimensionalityreductiontechniquesbasedonprojectionsanddictionariesaregrowingveryfastinthelast
decade:both,inthenumberofnewmethodsandintheapplicationofthosemethodstorealproblems.
Interestingly,oldmethodsbasedonvectorquantizationkeepnearlyaconstantmarketsharemeaningthat
theyareverywellsuitedtoaspecifickindofproblems.However,thosemethodsthattriedtopreserve
the input data topology while doing the vector quantization have lost impact, mostly because of the
appearanceofnewmethodscapableofanalyzingmanifolds.
Wecanfurthersubdividetheselargeareasintosmallersubareas.TableIIshowsthesubareassortedbytotal
numberofcitations.Afteranalyzingthistablewedrawthefollowingconclusions:
The analysis on manifolds is the clear winner of the decade. The reason is its ability to analyze non
linearitiesanditscapabilityofadaptingtothelocalstructureofthedata.Amongthedifferenttechniques,
ISOMAP, Locally Linear Embedding, and Laplacian Eigenmaps are the most successful. This increase has
been at the cost of the nonlinear PCA versions (principal curves, principal surfaces and principal
manifolds)andtheSelfOrganizingMapssincethenewtechniquescanexplorenonlinearrelationshipsin
amuchricherway.
PCA in its different versions (standard PCA, robust PCA, sparse PCA, kernel PCA, ) is still one of the
preferredtechniquesduetoitssimplicityandintuitiveness.TheincreaseintheuseofPCAcontrastswith
thedecreaseintheuseofFactorAnalysis,whichismoreconstrainedinitsmodelingcapabilities.
IndependentComponentAnalysisreacheditsboominthemiddle2000s,butnowitisdeclining.Probably,
itwillremainatanicheofapplicationsrelatedtosignalprocessingforwhichitisparticularlywellsuited.
Butitmightnotstandasageneralpurposetechnique.Itispossiblethatthisdecreasealsorespondstoa
diversificationofthetechniquesfallingundertheumbrellaofICA.
Nonnegative Matrix Factorization has experienced an important raise, probably because of its ability of
producingmoreinterpretablebasesandbecausetheyarewellsuitedtomanysituationsinwhichthesum
ofpositivefactorsisthenaturalwayofmodelingtheproblem.
31
Therestofthetechniqueshavekepttheirmarketshare.Thisismostlikelyexplainedbythefactthatthey
havetheirownnicheofapplications,whichtheyareverywellsuitedto.
Overall,wecansaythatdimensionalityreductiontechniquesarebeingappliedinmanyscientificareasranging
frombiomedicalresearchtotextminingandcomputerscience.Inthisreviewwehavecovereddifferentfamilies
ofmethodologies;eachofthembasedondifferentcriteriabutallchasingthesamegoal:reducethecomplexityof
the data structure while at the same time delivering a more understandable representation of the same
information.Thefieldisstillveryactiveandevermorepowerfulmethodsarecontinuouslyappearingprovidingan
excellentapplicationtestbedforappliedmathematicians.
Acknowledgements
ThisworkwassupportedbytheSpanishMinisterofScienceandInnovation(BIO201017527)andMadridgovernmentgrant
(P2010/BMD2305).C.O.S.SorzanowasalsosupportedbytheRamnyCajalprogram.
Bibliography
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
[Abdi2003] Abdi, H. LewisBeck, M.; Bryman, A. & Futing, T. (ed.) Encyclopedia for research methods for the social sciences Factor rotations in
factoranalysesSage,2003,792795
[Abdi2007] Abdi, H. Salkind, N. (ed.) Encyclopedia of measurements and statistics Singular value decomposition (SVD) and Generalized Singular
ValueDecomposition(GSVD)SagePublications,2007,907912
[Aharon2006]Aharon,M.;Elad,M.&Bruckstein,A.KSVD:AnAlgorithmforDesigningOvercompleteDictionariesforSparseRepresentationIEEE
Trans.SignalProcessing,2006,54,43114322
[Arora1998] Arora, S.; Raghavan, P. & Rao, S. Approximation schemes for Euclidean kmedians and related problems Proc. 30th Annual ACM
symposiumonTheoryofcomputing(STOC),1998
[Artac2002]Artac,M.;Jogan,M.&Leonardis,A.IncrementalPCAorOnLineVisualLearningandRecognitionProc.16thInternationalConference
onPatternRecognition(ICPR),2002,3,30781
[Baccini1996] Baccini, A.; Besse, P. & Falguerolles, A L1norm PCA and a heuristic approach. A. Ordinal and Symbolic Data Analysis. Diday, E.;
Lechevalier,Y.&Opitz,O.(Eds.)Springer,1996,359368
[Bailey1994]Bailey,T.L.&Elkan,C.FittingamixturemodelbyexpectationmaximizationtodiscovermotifsinbiopolymersUniv.CaliforniaSan
Diego,1994.Tech.ReportCS94351.
[Baldi1989] Baldi, P. & Hornik, K. Neural networks and principal component analysis: Learning from examples without local minima Neural
Networks,1989,2,5358
[Baraniuk2010]Baraniuk,R.G.;Cevher,V.&Wakin,M.B.Lowdimensionalmodelsfordimensionalityreductionandsignalrecovery:ageometric
perspectiveProc.IEEE,2012,98,959971
[Batmanghelich2012]Batmanghelich,N.K.;Taskar, B.&Davatzikos,C. Generativediscriminativebasislearningformedicalimaging.IEEETrans.
MedicalImaging,2012,31,5169
[Belkin2001]Belkin,M.&Niyogi,P.LaplacianEigenmapsandSpectralTechniquesforEmbeddingandClusteringAdvancesinNeuralInformation
ProcessingSystems,2001,14,585591
[Belkin2002]Belkin,M.&Niyogi,P.LaplacianEigenmapsforDimensionalityReductionandDataRepresentationNeuralComputation,2002,15,
13731396
[Bengio2004]Bengio,Y.;Delalleau,O.;LeRoux,N.;Paiement,J.F.;Vincent,P.&Ouimet,M.Learningeigenfunctionslinksspectralembeddingand
KernelPCANeuralComputation,2004,16,21972219
[Bengio2006] Bengio, Y.; Delalleau, O.; Le Roux, N.; Paiement, J. F.; Vincent, P. & Ouimet, M. Spectral Dimensionality Reduction. Studies in
FuzzinessandSoftComputing:FeatureExtractionSpringer,2006,519550
[Bezdek1981]Bezdek,J.C.PatternRecognitionwithFuzzyObjectiveFunctionAlgorithmsPlenum,1981
[Bezdek1984]Bezdek,J.C.;Ehrlich,R.&Full,W.FCM:ThefuzzycmeansclusteringalgorithmComputers&Geosciences,1984,10,191203
[Bian2011]Bian,W.&Tao,D.MaxmindistanceanalysisbyusingsequentialSDPrelaxationfordimensionreduction.IEEETrans.PatternAnalysis&
MachineIntelligence,2011,33,10371050
[Bingham2001] Bingham, E. & Mannila, H. Random projection in dimensionality reduction: applications to image and text data Proc. ACM Intl.
Conf.Knowledgediscoveryanddatamining,2001
[Bishop1998]Bishop,C.M.;Svensn,M.&Williams,C.K.I.GTM:TheGenerativeTopographicMappingNeuralComputation,1998,10,215234
[Blumensath2008]Blumensath,T.&Davies,M.E.GradientpursuitsIEEETrans.SignalProcessing,2008,56,23702382
[Borg2005]Borg,I.&Groenen,P.F.ModernMultidimensionalScalingSpringer,2005
[Bro1997]Bro,R.PARAFAC.TutorialandapplicationsChemometricsandintelligentlaboratorysystems,1997,38,149171
[Bruckstein2009] Bruckstein, A.M.;Donoho,D.L.& Elad,M.FromSparse SolutionsofSystemsofEquationstoSparseModelingofSignalsand
ImagesSIAMReview,2009,51,3481
[Cai2011]Cai,H.;Mikolajczyk,K.&Matas,J.Learninglineardiscriminantprojectionsfordimensionalityreductionofimagedescriptors.IEEETrans.
PatternAnalysis&MachineIntelligence,2011,33,338352
[Cai2011b]Cai,D.;He,X.;Han,J.&Huang,T.S.GraphRegularizedNonNegativeMatrixFactorizationforDataRepresentation.IEEETrans.Pattern
Analysis&MachineIntelligence,2010,33,15481560
[Carreira1997]CarreiraPerpin,M.A.AreviewofdimensionreductiontechniquesDept.ComputerScience,Univ.Sheffield,1997
[Cayton2005]Cayton,L.AlgorithmsformanifoldlearningUniversityofCalifornia,SanDiego,Tech.Rep.CS20080923,2005
32
28. [Chen1991] Chen, S.; Cowan, C. F. N. & Grant, P. M. Orthogonal least squares learning algorithm for radial basis function networks IEEE Trans.
NeuralNetworks,1991,2,302309
29. [Chen1994]Chen,S.S.&Donoho,D.L.BasispursuitProc.IEEEConf.Signals,SystemsandComputers,1994
30. [Chen2001]Chen,S.S.;Donoho,D.L.&Saunders,M.A.AtomicdecompositionbybasispursuitSIAMReview,2001,43,129159
31. [Cichocki2009]Cichocki,A.;Zdunek,R.;Phan,A.H.&Amari,S.NonnegativematrixandtensorfactorizationsWiley,2009
32. [Common1994]Common,P.IndependentComponentAnalysis,anewconcept?SignalProcessing,36,287314(1994)
33. [Costa2004]Costa,J.A.&Hero,A.O.I.GeodesicEntropicGraphsforDimensionandEntropyEstimationinManifoldLearningIEEETrans.Signal
Processing,2004,52,22102221
34. [Cox2000]Cox,T.F.&Cox,M.A.A.MultidimensionalScalingChapman&Hall,2000
35. [Crawford1970]Crawford,C.B.&Ferguson,G.A.AgeneralrotationcriterionanditsuseinorthogonalrotationPsychometrika,1970,35,321332
36. [Dasgupta2000]Dasgupta,S.ExperimentswithrandomprojectionProc.Conf.Uncertaintyinartificialintelligence,2000
37. [Dash1997]Dash,M.&Liu,H.FeatureselectionforclassificationIntelligentDataAnalysis,1997,1,131156
38. [Delatorre2003]DelaTorre,F.&Black,M.J.AframeworkforrobustsubspacelearningIntl.J.ComputerVision,2003,54,117142
39. [Delatorre2012]DelaTorre,F.AleastsquaresframeworkforComponentAnalysis.IEEETrans.PatternAnalysis&MachineIntelligence,2012,34,
10411055
40. [Delicado2001]Delicado,P.AnotherlookatprincipalcurvesandsurfacesJ.MultivariateAnalysis,2001,77,84116
41. [DeMers1993]DeMers,D.&Cottrell,G.NonlineardimensionalityreductionAdvancesinNeuralInformationProcessingSystems,1993,5,580587
42. [Dhillon2004]Dhillon,I.S.;Guan,Y.&Kulis,B.Kernelkmeans:spectralclusteringandnormalizedcutsProc.ACMSIGKDDIntl.Conf.onKnowledge
discoveryanddatamining,2004,551554
43. [Ding2006] Ding, C.; Zhou, D.; He, X. & Zha, H. R1PCA: Rotational Invariant L1norm Principal Component Analysis for Robust Subspace
FactorizationProc.Intl.WorkshopMachineLearning,2006
44. [Donoho2003]Donoho,D.L.&Grimes,C.Hessianeigenmaps:locallylinearembeddingtechniquesforhighdimensionaldata.Proc.Natl.Acad.Sci.
USA,2003,100,55915596
45. [Dunteman1989]Dunteman,G.H.PrincipalComponentAnalysisSagePublications,1989
46. [Einbeck2005]Einbeck,J.;Tutz,G.&Evers,L.LocalprincipalcurvesStatisticsandComputing,2005,15,301313
47. [Einbeck2008] Einbeck, J.; Evers, L. & BailerJones, C. Representing Complex Data Using Localized Principal Components with Application to
AstronomicalDataLectureNotesinComputationalScienceandEngineering,2008,58,178201
48. [Engan2000]Engan,K.;Aase,S.O.&Husoy,J.H.Multiframecompression:theoryanddesignEURASIPSignalProcessing,2000,80,21212140
49. [Feng2002] Feng, T.; Li, S. Z.; Shum, H. Y. & Zhang, H. Local nonnegative matrix factorization as a visual representation Proc. 2nd Intl. Conf.
DevelopmentandLearning(ICDL),2002
50. [Fisher1936]Fisher,R.A.TheUseofMultipleMeasurementsinTaxonomicProblemsAnnalsofEugenics,1936,7,179188
51. [Fodor2002]Fodor,I.K.AsurveyofdimensionreductiontechniquesLawrenceLivermoreNatl.Laboratory,2002
52. [Friedman1974]Friedman,J.H.&Tukey,J.W.AProjectionPursuitAlgorithmforExploratoryDataAnalysisIEEETrans.Computers,1974,C23,881
890
53. [Friedman1987]Friedman,J.H.ExploratoryprojectionpursuitJ.AmericanStatisticalAssociation,1987,82,249266
54. [Fritzke1995]Fritzke,AgrowingneuralgasnetworklearnstopologiesAdvancesinNeuralInformationProcessing,B.Tesauro,G.;Touretzky,D.&
Lean,T.K.(Eds.)MITPress,1995,625632
55. [Fukunaga1971]Fukunaga,K.&Olsen,D.R.AnalgorithmforfindingintrinsicdimensionalityofdataIEEETrans.Computers,1971,C20,176183
56. [Gersho1992]Gersho,A.,Gray,R.M.Vectorquantizationandsignalcompression.KluwerAcademicPublishers,1992.
57. [Girolami1997]Girolami,M.&Fyfe,C.Extractionofindependentsignalsourcesusingadeflationaryexploratoryprojectionpursuitnetworkwith
lateralinhibitionIEEProc.Vision,ImageandSignalProcessingJournal,1997,14,299306
58. [Girolami1997b]Girolami,M.&Fyfe,C.ICAContrastMaximisationUsingOja'sNonlinearPCAAlgorithmIntl.J.NeuralSystems,1997,8,661678
59. [Girolami2002]Girolami,M.Mercerkernelbasedclusteringinfeaturespace.IEEETrans.NeuralNetworks,2002,13,780784
60. [Golub1970]Golub,G.H.&Reinsch,C.SingularvaluedecompositionandleastsquaressolutionsNumerischeMathematik,1970,14,403420
61. [Gorban2004]Gorban,A.N.;Karlin,I.V.&Zinovyev,A.Y.ConstructivemethodsofinvariantmanifoldsforkineticproblemsConstructivemethods
ofinvariantmanifoldsforkineticproblems,2004,396,197403
62. [Gorban2007]Gorban,A.N.;Kgl,B.;Wunsch,D.C.&Zinovyev,A.PrincipalManifoldsforDataVisualizationandDimensionReductionSpringer,
2007
63. [Gorodnitsky1997] Gorodnitsky, I. F. & Rao, B. D. Sparse signal reconstruction from limited data using FOCUSS: reweighted minimum norm
algorithmIEEETrans.SignalProcessing,1997,3,600616
64. [Graps1995]Graps,A.AnintroductiontowaveletsIEEEComputationalScience&Engineering,1995,2,5061
65. [Gray1984]Gray,R.M.VectorquantizationIEEEAcoustics,SpeechandSignalProcessingMagazine,1984,1,429
66. [Gurumoorthy2010] Gurumoorthy, K. S.; Rajwade, A.; Banerjee, A. & Rangarajan, A. A method for compact image representation using sparse
matrixandtensorprojectionsontoexemplarorthonormalbasesIEEETrans.ImageProcessing,2010,19,322334
67. [Guyon2003]Guyon,I.&Eliseeff,A.AnintroductiontovariableandfeatureselectionJ.MachineLearningResearch,2003,3,11571182
68. [Harman1976]Harman,H.H.ModernFactorAnalysisUniv.ChicagoPress,1976
69. [Harshman1970]Harshman,R.A.FoundationsofthePARAFACprocedure:Modelsandconditionsforan"explanatory"multimodalfactoranalysis
UCLAWorkingPapersinPhonetics,1970,16,184
70. [Hartigan1979]Hartigan,J.A.&Wong,M.A.AlgorithmAS136:AKMeansClusteringAlgorithmJ.RoyalStatisticalSoc.C,1979,28,100108
71. [Hastie1989]Hastie,T.&Stuetzle,W.PrincipalcurvesJ.AmericanStatisticalAssociation,1989,84,502516
72. [He2004]He,X.&Niyogi,LocalityPreservingProjections.AdvancesInNeuralInformationProcessingSystemsP.Thrun,S.;Saul,L.K.&Schlkopf,
B.(Eds.)MITPress,2004,153160
73. [He2005]He,X.;Cai,D.;Yan,S.&Zhang,H.J.NeighborhoodPreservingEmbeddingProc.IEEEIntl.Conf.ComputerVision,ICCV,2005.
74. [He2011]He,R.;Hu,B.G.;Zheng,W.S.&Kong,X.W.Robustprincipalcomponentanalysisbasedonmaximumcorrentropycriterion.IEEETrans.
ImageProcessing,2011,20,14851494
75. [Hoyer2002]Hoyer,P.O.NonnegativesparsecodingProc.IEEEWorkshopNeuralNetworksforSignalProcessing,2002
76. [Hoyer2004]Hoyer,P.O.NonnegativematrixfactorizationwithsparsenessconstraintsJ.MachineLearningResearch,2004,5,14571469
77. [Huang1998] Huang, H. E.; Shen, Z.; Long, S. R.; Wu, M. L.; Shih, H. H.; Zheng, Q.; Yen, N. C.; Tung, C. C. & Liu, H. H. The empirical mode
decompositionandtheHilbertspectrumfornonlinearandnonstationarytimeseriesanalysisProc.Roy.Soc.LondonA,1998,454,903995
33
78. [Hubert2004]Hubert,M.&Engelen,S.RobustPCAandclassificationinbiosciencesBioinformatics,2004,20,17281736
79. [Hyvarinen1999]Hyvrinen,A.Fastandrobustfixedpointalgorithmsforindependentcomponentanalysis.IEEETrans.NeuralNetworks,1999,10,
626634
80. [Hyvarinen2000]Hyvarinen,A.,Oja.E.IndependentComponentAnalysis:algorithmsandapplications.Neuralnetworks,2000,13,411430
81. [Hyvarinen2001]Hvarinen,A.,Karhunen,J.,Oja,E.IndependentComponentAnalysis.JohnWiley&Sons,Inc.2001
82. [Iglesias2007]Iglesias,J.E.;deBruijne,M.;Loog,M.;Lauze,F.&Nielsen,M.Afamilyofprincipalcomponentanalysesfordealingwithoutliers.
LectureNotesinComputerScience,2007,4792,178185
83. [Jenssen2010]Jenssen,R.KernelentropycomponentanalysisIEEETrans.PatternAnalysis&MachineIntelligence,2010,32,847860
84. [Johnson1963]Johnson,R.M.OnatheoremstatedbyEckartandYoungPsychometrika,1963,28,259263
85. [Jollife2002]Jollife,I.T.PrincipalComponentAnalysisWiley,2002
86. [Kaiser1958]Kaiser,H.F.Thevarimaxcriterionforanalyticrotationinfactoranalysis.Psychometrika,1958,23,187200
87. [Kaiser1960]Kaiser,H.F.Theapplicationofelectroniccomputerstofactoranalysis.Educationalandpsychologicalmeasurement,1960,20,141
151
88. [Kambhatla1997]Kambhatla,N.&Leen,T.K.DimensionreductionbylocalPrincipalComponentAnalysisNeuralComputation,1997,9,14931516
89. [Kaski1998] Kaski, S. Dimensionality reduction by random mapping: fast similarity computation for clustering Proc. Intl. Joint Conf. Neural
Networks(IJCNN),1998
90. [Ke2005]Ke,Q., Kanade,T.Robust L1normfactorizationinthepresenceofoutliersand missingdatabyalternativeconvexprogrammingProc.
Comput.Vis.PatternRecogn.Conf.,2005
91. [Kegl2002]Kegl,B.IntrinsicDimensionEstimationUsingPackingNumbers.AdvancesinNeuralInformationProcessingSystems2002
92. [Kim2011]Kim,M.&Pavlovic,V.Centralsubspacedimensionalityreductionusingcovarianceoperators.IEEETransPatternAnalMachIntell,2011,
33,657670
93. [Klema1980]Klema,V.&Laub,A.Thesingularvaluedecomposition:ItscomputationandsomeapplicationsIEEETrans.AutomaticControl,1980,
25,164176
94. [Kohonen1990]Kohonen,T.TheSelfOrganizingMapProc.IEEE,1990,78,14641480
95. [Kohonen1993]Kohonen,T.Thingsyouhaven'theardabouttheselforganizingmapProc.IEEEIntl.Conf.NeuralNetworks,1993
96. [Kohonen2001]Kohonen,T.SelfOrganizingMaps.Springer,2001.
97. [Kokiopoulou2007] Kokiopoulou, E. & Saad, Y. Orthogonal neighborhood preserving projections: a projectionbased dimensionality reduction
technique.IEEETrans.PatternAnalysis&MachineIntelligence,2007,29,21432156
98. [Kramer1991]Kramer,M.A.NonlinearPrincipalComponentAnalysisUsingAutoassociativeNeuralNetworksAIChEJournal,1991,37,233243
99. [Kreutz2003] KreutzDelgado, K.; Murray, J. F.; Rao, B. D.; Engan, K.; Lee, T. & Sejnowski, T. J. Dictionary learning algorithms for sparse
representationNeuralComputation,2003,15,349396
100. [Kruskal1964a]Kruskal,J.B.MultidimensionalscalingbyoptimizinggoodnessoffittoanonmetrichypothesisPsychometrika,1964,29,127
101. [Kruskal1964b]Kruskal,J.B.Nonmetricmultidimensionalscaling:anumericalmethod.Psychometrika,1964,29,115129
102. [Kruskal1986]Kruskal,J.B.&Wish,M.MultidimensionalscalingSage,1986
103. [Kwak2008]Kwak,N.Principalcomponentanalysisbasedonl1normmaximization.IEEETransPatternAnalMachIntell,2008,30,16721680
104. [Lawley1971]Lawley,D.N.&Maxwell,A.E.FactoranalysisasastatisticalmethodButterworths,1971
105. [Lee1999]Lee,D.D.&Seung,S.LearningthepartsofobjectsbynonnegativematrixfactorizationNature,1999,401,788791
106. [Lee2000]Lee,T.W.;Girolami,M.&Bell,A.J.Sejnowski,T.J.Aunifyinginformationtheoreticframeworkforindependentcomponentanalysis
Computers&MathematicswithApplications,2000,39,121
107. [Lee2001]Lee,D. D.&Seung,H.S. AlgorithmsfornonnegativematrixfactorizationAdvancesin NeuralInformationProcessingSystems, 2001,
556562
108. [Lewicki2000]Lewicki,M.S.&Sejnowski,T.J.LearningovercompleterepresentationsNeuralComputation,2000,12,337365
109. [Li2011] Li, X. L.; Adali, T. & Anderson, M. Noncircular Principal Component Analysis and Its Application to Model Selection IEEE Trans. Signal
Processing,2011,59,45164528
110. [Lin2011]Lin,Y.Y.;Liu,T.L.&Fuh,C.S.Multiplekernellearningfordimensionalityreduction.IEEETrans.PatternAnalysis&MachineIntelligence,
2011,33,11471160
111. [Liu2003] Liu, W.; Zheng, N. & Lu, X. Nonnegative matrix factorization for visual coding Proc. IEEE Intl. Conf. Acoustics, Speech and Signal
Processing(ICASSP),2003
112. [Liu2003b] Liu, Z. Y.; Chiu, K. C. & Xu, L. Improved system for object detection and star/galaxy classification via local subspace analysis Neural
Networks,2003,16,437451
113. [Mallat1993]Mallat,S.G.&Zhang,Z.Matchingpursuitswithtimefrequencydictionaries.IEEETrans.SignalProcessing,1993,41,33973415
114. [Martinetz1991] Martinetz, T. & Schulten. A ``neuralgas'' network learns topologies. Artificial neural networks, K. Kohonen, T.; Makisara, K.;
Simula,O.&Kangas,J.(Eds.)Elsevier,1991,397402
115. [Martinetz1993]Martinetz,T.M.;Berkovich,S.G.&Schulten,K.J.Neuralgasnetworkforvectorquantizationanditsapplicationtotimeseries
prediction.IEEETrans.NeuralNetworks,1993,4,558569
116. [Mateen2009] van der Mateen, L.; Postma, E. & van den Herik, J. Dimensionality Reduction: A Comparative Review Tilburg Centre for Creative
Computing,TilburgUniv.,2009
117. [Mulaik1971]Mulaik,S.A.ThefoundationsoffactoranalysisChapman&Hall,1971
118. [Paige1981]Paige,C.C.&Saunders,M.A.TowardsaGeneralizedSingularValueDecompositionSIAMJournalonNumericalAnalysis,1981,18,
398405
119. [PascualMarqui2001] PascualMarqui, R. D.; PascualMontano, A.; Kochi, K. & Carazo, J. M. Smoothly Distributed Fuzzy cMeans: A New Self
OrganizingMapPatternRecognition,2001,34,23952402
120. [PascualMontano2001]PascualMontano,A.;Donate,L.E.;Valle,M.;Brcena,M.;PascualMarqui,R.D.&Carazo,J.M.Anovelneuralnetwork
tecniqueforanalysisandclassificationofEMsingleparticleimagesJ.StructuralBiology,2001,133,233245
121. [PascualMontano2006]PascualMontano,A.;Carazo,J.;Kochi,K.;Lehmann,D.&PascualMarqui,R.Nonsmoothnonnegativematrixfactorization
(nsNMF)IEEETrans.PatternAnalysis&MachineIntelligence,2006,28,403415
122. [Pati1993]Pati,Y.;Rezaiifar,R.&Krishnaprasad,P.Orthogonalmatchingpursuit:recursivefunctionapproximationwithapplicationstowavelet
decompositionProc.Conf.RecordofTheTwentySeventhAsilomarConferenceonSignals,SystemsandComputers,1993,4044
34
123. [Pauca2006] Pauca, V. P.; Piper, J. & Plemmons, R. J. Nonnegative matrix factorization for spectral data analysis. Linear Algebra and its
Applications,2006,416,2947
124. [Pearson1901]Pearson,K.OnLinesandPlanesofClosestFittoSystemsofPointsinSpace.PhilosophicalMagazine,1901,2,559572
125. [Pettis1979]Pettis,K.W.;Bailey,T.A.;Jain,A.K.&Dubes,R.C.Anintrinsicdimensionalityestimatorfromnearneighborinformation.IEEETrans.
PatternAnalysis&MachineIntelligence,1979,1,2537
126. [Pinto2011] Pinto da Costa, J. F.; Alonso, H. & Roque, L. A weighted principal component analysis and its application to gene expression data.
IEEE/ACMTrans.ComputationalBiologyandBioinformatics,2011,8,246252
127. [Ramanan2011]Ramanan,D.&Baker,S.Localdistancefunctions:ataxonomy,newalgorithms,andanevaluation.IEEETrans.PatternAnalysis&
MachineIntelligence,USA.dramanan@ics.uci.edu,2011,33,794806
128. [Rilling2003] Rilling, G.; Flandrin, P. & Goncalves, P. On empirical mode decomposition and its algorithms Proc. IEEEEURASIP Workshop on
NonlinearSignalandImageProcessing,2003
129. [Rioul1991]Rioul,O.&Vetterli,M.Waveletsandsignalprocessing.IEEESignalProcessingMagazine,1991,8,1438
130. [Roweis2000]Roweis,S.T.&Saul,L.K.NonlinearDimensionalityReductionbyLocallyLinearEmbeddingScience,2000,290,23232326
131. [Rubinstein2010]Rubinstein,R.;Bruckstein,A.M.&Elad,M.DictionariesforsparserepresentationmodelingProc.IEEE,2010,98,10451057
132. [Rubner2000]Rubner,Y.;Tomasi,C.&Guibas,L.J.TheEarthMoversDistanceasaMetricforImageRetrievalIntl.J.ComputerVision,2000,40,
99121
133. [Saeys2007]Saeys,Y.;Inza,I.&Larraaga,P.AreviewoffeatureselectiontechniquesinbioinformaticsBioinformatics,2007,23,25072517
134. [Sammon1969]Sammon,J.W.Anonlinearmappingfordatastructureanalysis.IEEETrans.Computers,1969,18,401409
135. [Sandler2011]Sandler,R.&Lindenbaum,M.NonnegativeMatrixFactorizationwithEarthMover'sDistanceMetricforImageAnalysis.IEEETrans.
PatternAnalysis&MachineIntelligence,Yahoo!ResearchinHaifa.,2011,33,15901602
136. [Saul2003]Saul,L.K.&Roweis,S.ThinkGlobally,FitLocally:UnsupervisedLearningofLowDimensionalManifoldsDepartmentofComputer&
InformationScience,Univ.Pennsylvania,2003
137. [Schiffman1981] Schiffman, S. S.; Reynolds, M. L. & Young, F. W. Introduction to multidimensional scaling: Theory, methods, and applications
AcademicPress,1981
138. [Scholkopf1997]Schlkopf,B.;Smola,A.&Mller,K.R.KernelPrincipalComponentAnalysisProc.ofICANN,1997,58358
139. [Scholkopf1999]Schlkopf,B.;Smola,A.&Mller,K.R.KernelPrincipalComponentAnalysisAdvancesinkernelmethodssupportvectorlearning,
1999
140. [Scholz2008] Scholz, M.; Fraunholz, M. & Selbig, J. Nonlinear Principal Component Analysis: Neural Network Models and Applications Lecture
NotesinComputationalScienceandEngineering,2008,58,4467
141. [Smola1999]Smola,A.J.;Williamson,R.C.;Mika,S.&Schlkopf,B.RegularizedprincipalmanifoldsLectureNotesinArtificialIntelligence,1999,
1572,214229
142. [Spearman1904]Spearman,C.GeneralIntelligenceObjectivelyDeterminedandMeasuredAmericanJ.Psychology,1904,15,201292
143. [Subbarao2006]Subbarao,R.&Meer,P.SubspaceEstimationUsingProjectionBasedMEstimatorsoverGrassmannManifoldsLectureNotesin
ComputerScience,2006,3951,301312
144. [Tenenbaum2000]Tenenbaum,J.B.;deSilva,V.&Langford,J.C.AglobalgeometricframeworkfornonlineardimensionalityreductionScience,
2000,290,23192323
145. [Theis2011]Theis,F.J.;Kawanabe,M.&Muller,K.R.UniquenessofNonGaussianityBasedDimensionReductionIEEETrans.SignalProcessing,
2011,59,44784482
146. [Thurstone1947]Thurstone,L.MultipleFactorAnalysisUniv.ChicagoPress,1947
147. [Tropp2007]Tropp,J.A.&Gilbert,A.C.SignalRecoveryFromRandomMeasurementsViaOrthogonalMatchingPursuitIEEETrans.Information
Theory,2007,53,46554666
148. [Tucker1966]Tucker,L.R.SomemathematicalnotesonthreemodefactoranalysisPsychometrika,1966,31,279311
149. [Ulfarsson2011]Ulfarsson,M.O.&Solo,V.Vectorl0sparsevariablePCAIEEETrans.SignalProcessing,2011,59,19491958
150. [Vidal2005] Vidal, R.; Ma, Y. & Sastry, S. Generalized principal component analysis (GPCA). IEEE Trans. Pattern Analysis & Machine Intelligence,
2005,27,19451959
151. [Wall2003] Wall, M.; Rechtsteiner, A. & Rocha, L. A practical approach to microarray data analysis Singular Value Decomposition and Principal
ComponentAnalysisSpringer,2003,9110
152. [Wang2011]Wang,R.;Shan,S.;Chen,X.;Chen,J.&Gao,W.MaximalLinearEmbeddingforDimensionalityReduction.IEEETrans.PatternAnalysis
&MachineIntelligence,2011,33,17761792
153. [Watson1993]Watson,A.B.DCTquantizationmatricesvisuallyoptimizedforindividualimagesProc.SPIEWorkshoponAugmentedVisualDisplay
(AVID)Research,1993,19131914,363377
154. [Williams2002]Williams,C.K.I.OnaConnectionbetweenKernelPCAandMetricMultidimensionalScalingMachineLearning,2002,46,1119
155. [Wold1987]Wold,S.;Esbensen,K.&Geladi,P.PrincipalcomponentanalysisChemometricsandIntelligentLaboratorySystems,1987,2,375
156. [Yin2008]Yin,H.LearningNonlinearPrincipalManifoldsbySelfOrganisingMapsLectureNotesinComputationalScienceandEngineering,2008,
58,6895
157. [Yu2012] Yu, S.; Tranchevent, L.C.; Liu, X.; Glnzel, W.; Suykens, J. A. K.; Moor, B. D. & Moreau, Y. Optimized data fusion for kernel kmeans
clustering.IEEETrans.PatternAnalysis&MachineIntelligence,2012,34,10311039
158. [Zhang2004] Zhang, Z. & Zha, H. Principal Manifolds and Nonlinear Dimensionality Reduction via Tangent Space Alignment SIAM J. Scientific
Computing,2004,26,313338
159. [Zhang2009]Zhang,J.;Niyogi,P.&McPeek,M.S.Laplacianeigenfunctionslearnpopulationstructure.PLoSOne,2009,4,e7928
160. [Zhang2012]Zhang,Z.;Wang,J.&Zha,H.Adaptivemanifoldlearning.IEEETrans.PatternAnalysis&MachineIntelligence,2012,34,253265
161. [Zou2006]Zou,H.;Hastie,T.&Tibshirani,R.SparsePrincipalComponentAnalysisJ.ComputationalandGraphicalStatistics,2006,15,262286
35

DR Survey

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

DR Survey

Hochgeladen von

Copyright:

Verfügbare Formate

Asurveyofdimensionalityreductiontechniques

2. Methods based on Statistics and Information Theory

2.1 Vector Quantization and Mixture models

M dimensionsto1(thediscreteclasslabel ).Eachclass, ,hasarepresentative x whichistheaverageofall

theobservationsassignedtothatclass.Ifavector xn hasbeenassignedtothe n thclass,thenitsapproximation

observation x is assigned to the th class, and is 0 otherwise) such that JVQ E

. With our previous definition of u ( x) we can express it as

( n 1, 2,..., N ) after removing all constants is L( X | x )

. We, thus, see that the goal

functionofvectorquantization JVQ producesthemaximumlikelihoodestimatesoftheunderlying x vectors.

u (x) 1 for all x . This is another famous vector

call robust Vector Quantization, J RVQ E

where x is a function less sensitive to

outliersthan x x ,forinstance[Iglesias2007]proposes x x with about0.5.

m orthonormal directions w i minimizing the representation error J PCA

. This can be much more

PCAobjectivefunctioncanalsobewrittenas J PCA Tr W t X W [He2011],where X

vectors w i .Eachvector w i hasdimension M anditcanbeunderstoodasatopic(i.e.,atopicischaracterized

Figure3.Projectionofthevector x ontothesubspacespannedbythevectors w i (rowsof W ).Thecomponents

XX t ,notethatthecovariancematrixof x isa M M matrixwith M eigenvalues).Ifthe

eigenvaluedecompositionoftheinputcovariancematrixis Cx WM M WMt (since Cx isarealsymmetricmatrix),

eigenvaluesofthematrix M and Wm arethecorresponding m columnsfromtheeigenvectorsmatrix WM .We

2.2.1 Incremental, stream or online PCA

m N ).Ithasbeenshown[Johnson1963]that X isthematrixbetterapproximating X intheFrobeniusnorm

user. The goal function is then J NLPCA E x Wf (W x)

which is minimized subject to W W I

2.2.4 PCA rotations and Sparse PCA

. In the previous formula the l p norms of the matrices are

2.2.5 Localized PCA and Subspace segmentation

2.3 Principal curves, surfaces and manifolds

wehavetolookforthepointintheline(definedbyitsparameter n )thatisclosesttoit.Thepoint f ( n ) isthe

f (x) representsthecurveparameter neededtoproject x onto f .Inotherwords,thebestcurveistheone

minimizing J surface E inf x f ( 1 , 2 )

. The solution is again, f ( 1 , 2 ) E x | f (x) , where f (x)

manifolds is straightforward [Smola1999], J manifold E inf x f ( )

that is a Tikhonov regularized

). With this expansion, the regularization term becomes [Smola1999]

2.4 Generative Topographic Mapping

k ( ) ). In this method, it is presumed that the possible

This objective function can be regularized by

) which is the result of estimating the

2.5 SelfOrganizing Maps

x x ' k ( , ') ,thatis,classrepresentativescorrespondingtolabelsclosetoeachotherinthemanifold

2.6 Elastic maps, nets, principal graphs and principal trees

. The first term accounts for the

fidelityofthedatarepresentation.Inthesecondandthirdterms, g ( , ') defineaneighborhood(isequalto1if

2.7 Kernel PCA and Multidimensional Scaling

, and k ( x, y ) tanh 1 x, y 2 (where the parameters

2.8 Kernel Entropy Component Analysis

) ,where k (x, y ) isaParzen

2.9 Robust PCA

, by a l norm which is known to be more robust to outliers, J

level of the multivariate distance J RPCA E k x W

([Iglesias2007] proposed to use the function

k ( x) x with about0.5,theyrefertothismethodas PCA;[Subbarao2006]proposedtouseMestimators,

J RPCA Oni i ( x ni ( i w i n )) P (Oni ) ,where Oni isavariablebetween0and1statingwhetherthe i

Oni 1 to avoid the trivial solution of

correspondingdatamatrixwillbereferredtoas R ).LookingatthePCAobjectivefunction, J PCA Tr W t X W ,

2.10 Factor Analysis

2.11 Independent Component Analysis

l x px (x) det W p ( ) det W

p ( ) . Taking logarithms and the expected value over all input

consistentMLestimatorsaslongasforall i E i gi ( i ) g i' ( i ) 0 ,where gi ( i )