Beruflich Dokumente
Kultur Dokumente
C.O.S.Sorzano1,,J.Vargas1,A.PascualMontano1
1
Natl.CentreforBiotechnology(CSIC)
C/Darwin,3.CampusUniv.Autnoma,28049Cantoblanco,Madrid,Spain
{coss,jvargas,pascual}@cnb.csic.es
Correspondingauthor
AbstractExperimentallifescienceslikebiologyorchemistryhaveseenintherecentdecadesanexplosionof
the data available from experiments. Laboratory instruments become more and more complex and report
hundreds or thousands measurements for a single experiment and therefore the statistical methods face
challengingtaskswhendealingwithsuchhighdimensionaldata.However,muchofthedataishighlyredundant
and can be efficiently brought down to a much smaller number of variables without a significant loss of
information. Themathematicalproceduresmaking possiblethis reduction arecalleddimensionality reduction
techniques;theyhavewidelybeendevelopedbyfieldslikeStatisticsorMachineLearning,andarecurrentlya
hotresearchtopic.Inthisreviewwecategorizetheplethoraofdimensionreductiontechniquesavailableand
givethemathematicalinsightbehindthem.
Keywords:DimensionalityReduction,DataMining,MachineLearning,Statistics
1. Introduction
Duringthelastdecadelifescienceshaveundergoneatremendousrevolutionwiththeaccelerateddevelopment
of high technologies and laboratory instrumentations. A good example is the biomedical domain that has
experienced a drastic advance since the advent of complete genome sequences. This postgenomics era has
leaded to the development of new highthroughput techniques that are generating enormous amounts of data,
which have implied the exponential growth of many biological databases. In many cases, these datasets have
much more variables than observations. For example, standard microarray datasets usually are composed by
thousands of variables (genes) in dozens of samples. This situation is not exclusive of biomedical research and
many other scientific fields have also seen an explosion of the number of variables measured for a single
experiment.Thisisthecaseofimageprocessing,massspectrometry,timeseriesanalysis,internetsearchengines,
andautomatictextanalysisamongothers.
Statisticalandmachinereasoningmethodsfaceaformidableproblemwhendealingwithsuchhighdimensional
data,andnormallythenumberofinputvariablesisreducedbeforeadataminingalgorithmcanbesuccessfully
applied. The dimensionality reduction can be made in two different ways: by only keeping the most relevant
variablesfromtheoriginaldataset(thistechniqueiscalledfeatureselection)orbyexploitingtheredundancyof
the input data and by finding a smaller set of new variables, each being a combination of the input variables,
containingbasicallythesameinformationastheinputvariables(thistechniqueiscalleddimensionalityreduction).
ThissituationisnotnewinStatistics.Infactoneofthemostwidelyuseddimensionalityreductiontechniques,
PrincipalComponentAnalysis(PCA),datesbacktoKarlPearsonin1901[Pearson1901].Thekeyideaistofinda
new coordinate system in which the input data can be expressed with many less variables without a significant
error.Thisnewbasiscanbeglobalorlocalandcanfulfillverydifferentproperties.Therecentexplosionofdata
available together with the evermore powerful computational resources have attracted the attention of many
researchers in Statistics, Computer Science and Applied Mathematics who have developed a wide range of
computational techniques dealing with the dimensionality reduction problem (for reviews see [Carreira1997,
Fodor2002,Mateen2009]).
In this review we provide an uptodate overview of the mathematical properties and foundations of the
different dimensionality reduction techniques. For feature selection, the reader is referred to the reviews of
[Dash1997],[Guyon2003]and[Saeys2007].
There are several dimensionality reduction techniques specifically designed for time series. These methods
specificallyexploitthefrequentialcontentofthesignalanditsusualsparsenessinthefrequencyspace.Themost
popular methods are those based on wavelets [Rioul1991, Graps1995], followed at a large distance by the
Empirical Mode Decomposition [Huang1998, Rilling2003] (the reader is referred to the references above for
furtherdetails).Wedonotcoverthesetechniquesheresincetheyarenotusuallyappliedforthegeneralpurpose
dimensionalityreductionofdata.Fromageneralpointofview,wemaysaythatwaveletsprojecttheinputtime
seriesontoafixeddictionary(seeSection3).Thisdictionaryhasthepropertyofmakingtheprojectionsparse(only
afewcoefficientsaresufficientlylarge),andthedimensionalityreductionisobtainedbysettingmostcoefficients
(thesmallones)tozero.Theempiricalmodedecomposition,instead,constructsadictionaryspeciallyadaptedto
eachinputsignal.
Tokeeptheconsistencyofthisreview,wedonotcoverneitherthosedimensionalityreductiontechniquesthat
takeintoaccounttheclassofobservations,i.e.,thereareobservationsfromaclassAofobjects,observationsfrom
a class B, and the dimensionality reduction technique should keep as well as possible the separability of the
original classes. Fishers Linear Discriminant Analysis (LDA) was one of the first techniques to address this issue
[Fisher1936,Rao1948].Manyotherworksfollowedsincethen,forthemostrecentworksandforabibliographical
reviewsee[Bian2011,Cai2011,Kim2011,Lin2011,Batmanghelich2012].
Inthefollowingwewillrefertotheobservationsasinputvectors x ,whosedimensionis M .Wewillassume
thatwehave N observationsandwewillrefertothe n thobservationas xn .Thewholedatasetofobservations
will be X , while X will be a M N matrix with all the observations as columns. Note that small, bold letters
represent vectors ( x ), while capital, nonbold letters ( X ) represent matrices. The goal of the dimensionality
reduction is to find another representation of a smaller dimension m such that as much information as
possibleisretainedfromtheoriginalsetofobservations x .Thisinvolvessometransformationoperatorfromthe
original vectors onto the new vectors, T (x) . These projected vectors are sometimes called feature vectors,
andtheprojectionof xn willbenotedas n .Theremightnotbeaninverseforthisprojection,buttheremustbe
awayofrecoveringanapproximatevaluetotheoriginalvector, x R( ) ,suchthat x x .
Aninterestingpropertyofanydimensionalityreductiontechniqueistoconsideritsstability.Inthiscontext,a
technique is said to be stable, if for any two input data points, x1 and x2 , the following inequation holds
[Baraniuk2010]: (1 ) x1 x 2
2
2
1 2
2
2
(1 ) x1 x 2 2 .Intuitively,thisequationimpliesthatEuclidean
2
distancesintheoriginalinputspacearerelativelyconservedintheoutputfeaturespace.
Figure1.Exampleoftheuseofavectorquantization.Blackcirclesrepresenttheinputdata, xn ;redsquares
representclassrepresentatives, x .
The goal is thus to find the representatives x and class assignments u ( x) ( u ( x) is equal to 1 if the
u ( x) x x
is
minimized.ThisproblemisknownasvectorquantizationorKmeans[Hartigan1979].Theoptimizationofthisgoal
functionisacombinatorialproblemalthoughthereareheuristicstocutdownitscost[Gray1984,Gersho1992].An
alternative formulation of the Kmeans objective function is JVQ X WU
2
F
t
subject to U U I and
uij 0,1 (i.e., that each input vector is assigned to one and only one class) [Batmanghelich2012]. In this
expression, W is a M m matrix with all representatives as column vectors, U is a m N matrix whose ijth
2
entryis1ifthejthinputvectorisassignedtotheithclass,and F denotestheFrobeniusnormofamatrix.
Thisintuitivegoalfunctioncanbeputinaprobabilisticframework.Letusassumewehaveagenerativemodelof
how the data is produced. Let us assume that the observed data are noisy versions of K vectors x which are
equallylikelyapriori.Letusassumethattheobservationnoiseisnormallydistributedwithasphericalcovariance
2I
.
The
likelihood
of
observing
xn
having
produced
is
matrix
x
l (xn | x , 2 )
M
2
exp 12
xn x
u ( xn ) xn x 2
1
2
. The log likelihood of observing the whole dataset x
l (x n | x , )
exp 12 1 2
M
n
2
u (x ) x
n 1
densityfunctionoftheinputdata.ThisideahasbeenfurtherpursuedbyMixtureModels[Bailey1994]thatarea
generalizationofvectorquantizationinwhich,insteadoflookingonlyforthemeansoftheGaussiansassociatedto
each class, we also allow each class to have a different covariance matrix , and different a priori probability
. The algorithm looks for estimates of all these parameters by ExpectationMaximization, and at the end
producesforeachinputobservation xn ,thelabel oftheGaussianthathasthemaximumlikelihoodofhaving
generatedthatobservation.
We can extend this concept and, instead of making a hard class assignment, we can make a fuzzy class
assignment by allowing 0 u ( x) 1 and requiring
quantizationalgorithmcalledfuzzyKmeans[Bezdek1981,Bezdek1984].
The Kmeans algorithm is based on a quadratic objective function, which is known to be strongly affected by
outliers. This drawback can be alleviated by taking the l1 norm of the approximation errors and modifying the
problem to J K medians X WU
2
1
t
subject to U U I and uij 0,1 [Arora1998, Batmanghelich2012].
[Iglesias2007]proposedadifferentapproachtofinddatarepresentativeslessaffectedbyoutlierswhichwemay
u ( x) x x
Someauthors[Girolami2002,Dhillon2004,Yu2012]haveproposedtouseanonlinearembeddingoftheinput
vectors x intoahigherdimensionalspace(dimensionalityexpansion,insteadofreduction),andthenperform
thevectorquantizationinthishigherdimensionalspace(thiskindofalgorithmsarecalledKernelalgorithmsand
are further explained below with Kernel PCA). The reason for performing this nonlinear mapping is that the
2
topologicalsphericalballsinducedbythedistance ( x) ( x ) inthehigherdimensionalspace,correspond
tononsphericalneighborhoodsintheoriginalinputspace,thusallowingforaricherfamilyofdistancefunctions.
Althoughvectorquantizationhasalltheingredientstobeconsideredadimensionalityreduction(mappingfrom
the high dimensional space to the low dimensional space by assigning a class label , and back to high
dimensional space through an approximation), this algorithm has a serious drawback. The problem is that the
distancesin thefeaturespace ( runsfrom1to K )donot correspond todistancesintheoriginalspace. For
example,if xn isassignedlabel0, x n 1 label1,and x n 2 label2,itdoesnotmeanthat xn iscloserto x n 1 thanto
x n 2 in the input M dimensional space. Labels are arbitrary and do not allow to conclude anything about the
relativeorganizationoftheinputdataotherthanknowingthatallvectorsassignedtothesamelabelarecloserto
therepresentativeofthatlabelthantotherepresentativeofanyotherlabel(thisfactcreatesaVoronoipartition
oftheinputspace)[Gray1984,Gersho1992].
Thealgorithmspresentedfromnowondonotsufferfromthisproblem.Moreover,theproblemcanbefurther
attenuatedbyimposinganeighborhoodstructureonthefeaturespace.ThisisdonebySelfOrganizingMapsand
GenerativeTopographicMappings,whicharepresentedbelow.
2.2 PCA
Principal Component Analysis (PCA) is by far one of the most popular algorithms for dimensionality reduction
[Pearson1901,Wold1987,Dunteman1989,Jollife2002].Givenasetofobservations x ,withdimension M (theylie
in M ),PCAisthestandardtechniqueforfindingthesinglebest(inthesenseofleastsquareerror)subspaceofa
givendimension, m .Withoutlossofgenerality,wemayassumethedataiszeromeanandthesubspacetofitisa
linearsubspace(passingthroughtheorigin).
This algorithm is based on the search of orthogonal directions explaining as much variance of the data as
possible.Intermsofdimensionalityreductionitcanbeformulated[Hyvarinen2001]astheproblemoffindingthe
E x wi , x wi
i 1
w1 , x ,..., w m , x
. In this
compactly written as W t x , where W is a M m matrix whose columns are the orthonormal directions w i
t
(orequivalently W W I ).Theapproximationtotheoriginalvectorsisgivenby x
w i , x w i ,orwhatisthe
i1
same, x W .InFigure2,weshowagraphicalrepresentationofaPCAtransformationinonlytwodimensions
( x 2 ).AscanbeseenfromFigure2,thevarianceofthedataintheoriginaldataspaceisbestcapturedinthe
rotatedspacegivenbyvectors W t x .
Figure2.GraphicalrepresentationofaPCAtransformationinonlytwodimensions.
1 is the first principal component and it goes in the direction of most variance, 2 is the second principal
component,itisorthogonaltothefirstanditgoesintheseconddirectionwithmostvariance(in 2 thereisnot
muchchoice,butinthegeneralcase, M ,thereis).Observethatwithoutlossofgeneralitythedataiscentred
abouttheoriginoftheoutputspace.
Wecanrewritetheobjectivefunctionas J PCA E x W
E x WW x X WW X
t
2
F
.Notethat
the class membership matrix ( U in vector quantization) has been substituted in this case by W X , which in
generalcantakeanypositiveornegativevalue.It,thus,haslostitsmembershipmeaningandsimplyconstitutes
theweightsofthelinearcombinationofthecolumnvectorsof W thatbetterapproximateeachinput x .Finally,
1
N
(x
x )(xi x )t
is the covariance matrix of the observed data. The PCA formulation has also been extended to complexvalued
inputvectors[Li2011],themethodiscallednoncircularPCA.
Thematrixprojectionoftheinputvectorsontoalowerdimensionalspace( W t x )isawidespreadtechnique
in dimensionality reduction as will be shown in this article. The elements involved in this projection have an
interesting interpretation as explained in the following example. Let us assume that we are analyzing scientific
articlesrelatedtoaspecificdomain.Eacharticlewillberepresentedbyavector x ofwordfrequencies,i.e.,we
choose a set of M words representative of our scientific area, and we annotate how many times each word
appears in each article. Each vector x is then orthogonally projected onto the new subspace defined by the
of x representthefrequencyofeachwordinagivenscientificarticle.Thevectors w i representtheword
compositionofagiventopic.Eachcomponentoftheprojectedvector representshowimportantisthattopicfor
thearticlebeingtreated.
Itcanbeshown[Hyvarinen2001,Jenssen2010]thatwhentheinputvectors, x ,arezeromean(iftheyarenot,
we can transform the input data simply by subtracting the sample average vector), then the solution of the
minimization of J PCA is given by the m eigenvectors associated to the largest m eigenvalues of the covariance
matrixof x ( Cx
1
N
then the feature vectors are constructed as m2Wm x , where m is a diagonal matrix with the m largest
t
U m2Wmt X . Note that the i th feature is the projection of the input vector x onto the i th eigenvector,
12
i i w ti x . The socomputed feature vectors have identity covariance matrix, C I , meaning that the
differentfeaturesaredecorrelated.
Univariatevarianceisasecondorderstatisticalmeasureofthedepartureoftheinputobservationswithrespect
tothesamplemean.Ageneralizationoftheunivariatevariancetomultivariatevariablesisthetraceoftheinput
covariancematrix.Bychoosingthe m largesteigenvaluesofthecovariancematrix Cx ,weguaranteethatweare
makingarepresentationinthefeaturespaceexplainingasmuchvarianceoftheinputspaceaspossiblewithonly
m variables. In fact, w1 is the direction in which the data has the largest variability, w 2 is the direction with
largestvariabilityoncethevariabilityalong w1 hasbeenremoved, w 3 isthedirectionwithlargestvariabilityonce
thevariabilityalong w1 and w 2 hasbeenremoved,etc.Thankstotheorthogonalityofthe w i vectors,andthe
subsequent decorrelation of the feature vectors, the total variance explained by PCA decomposition can be
2
convenientlymeasuredasthesumofthevariancesofeachfeature, PCA
Var .
i 1
i 1
n 1
in
x n .Thisfactwillbe
further exploited by Sparse PCA and Kernel PCA. This digression on the SVD approach to PCA helps us to
understandacommonsituationinsomeexperimentalsettings.Forinstance,inmicroarrayexperiments,wehave
about50samplesand1000variables.AsshownbytheSVDdecomposition,therankofthecovariancematrixisthe
minimumbetween M 1000 and N 50 ,therefore,wecannotcomputemorethan50principalcomponents.
AninterestingremarkontheSVDdecompositionisthatamongallpossiblematrixdecompositionsoftheform
X WDU , SVD is the only family of decompositions (SVD decomposition is not unique) yielding diagonal
matrices in D . In other words, the matrices W and U can differ significantly from the SVD decomposition as
long as D is not a diagonal matrix. This fact is further exploited by Sparse Tensor SVD (see dictionarybased
methodsbelow).
2.2.3 Nonlinear PCA
PCA can be extended to nonlinear projections very easily conceptually although its implementation is more
involved. Projections can be replaced by f (W t x) , being f ( ) : m m a nonlinear function chosen by the
[Girolami1997b].
that the PCA problem can be seen as regression problem whose objective function is J PCA E x WW x
t
t
Normallytheoptimizationisperformedwiththeconstraint W W I (i.e.,thedirections w i haveunitmodule).
t
We can generalize this problem, and instead of using the same matrix to build the feature vectors ( W ) and
reconstruct the original samples ( W ), we can make them different. We can use ridge regression (a Tikhonov
regularization to avoid possible instabilities caused by the eventual illconditioning of the regression, the most
commononeissimplythe l2 norm),andaregularizationbasedonsomenormpromotingsparseness(likethe l1
norm): J SPCA E x WW t x
W
1
l1
2 W
2
l2
p
computed as W l wij and the objective function is optimized with respect to W and W that are
p
i, j
M m matrices.Ithasbeenshown[Zou2006]thatpromotingthesparsenessof W promotesthesparsenessof
thefeaturevectorswhichis,afterall,thefinalgoalofthisalgorithm.
Theprevioussparseapproachestriedtofindsparseprojectiondirectionsbyzeroingsomeoftheelementsinthe
projection directions. However, we may prefer absolutely removing some of the input features (so that their
contributiontoallprojectiondirectionsiszero).[Ulfarsson2011]proposedthesparsevariablePCA(svPCA).svPCA
isbasedonnoisyPCA(nPCA),whichisaspecialcaseofFactorAnalysis(seebelow).Theobserveddataissupposed
2
tohavebeengeneratedas x W n .Assumingthatthecovarianceofthenoiseis I ,thegoalofnPCAisto
maximize the loglikelihood of observing the data matrix X under the nPCA model, that is
J nPCA 12 Tr X 1 12 log
where
WW t 2 I
J svPCA J nPCA 2N 2 w i
i 1
svPCA
objective
function
is
Figure4.Mixeddatasetcomposedbytheredandbluepointsthatcanbedecomposedintwosmallerdatasets,
whichcaneffectivelybedescribedusingtwodimensionallinearsubspaces.
ThetermLocalizedPCAhasbeenusedseveraltimesthroughliteraturetorefertodifferentalgorithms.Herewe
willrefertothemostsuccessfulones.[Fukunaga1971]proposedanextensionoftheKmeansalgorithmwhichwe
will refer to as ClusterPCA. In Kmeans, a cluster is represented by its centroid. In ClusterPCA, a cluster is
represented by a centroid plus an orthogonal basis defining a subspace that embeds locally the cluster. An
t
observation x isassignedtoaclusteriftheprojectionof x ontotheclustersubspace( x WW x )istheclosest
one(theselectionoftheclosestsubspacemustbedonewithcaresothatextrapolationoftheclusterisavoided).
Onceallobservationshavebeenassignedtotheircorrespondingclusters,theclustercentroidisupdatedasinK
meansandtheclustersubspaceisrecalculatedbyusingPCAontheobservationsbelongingtothecluster.Aswith
Kmeans,aseveredrawbackofthealgorithmisitsdependencewith theinitialization,andseveral hierarchically
divisivealgorithmshavebeenprovided(RecursiveLocalPCA)[Liu2003b].Forareviewonthiskindofalgorithms
see[Einbeck2008].
Subspace segmentation extends the idea of locally embedding the input points into linear subspaces. The
assumptionisthatthedatahasbeengeneratedusingseveralsubspacesthatmaynotbeorthogonal.Thegoalisto
identify all these subspaces. Generalized PCA [Vidal2005] is a representative of this family of algorithms.
Interestingly, the subspaces to be identified are represented as polynomials whose degree is the amount of
subspacestoidentify andwhosederivativesatadatapointgivenormalvectorstothesubspacepassingthrough
thepoint.
Figure5.Datasetthatlieinacurvedstructure(left)andtransformeddataset(right)
In Fig. 5 we show a dataset following a curved structured and therefore this dataset will not be conveniently
describedusingthePCAmethod.NotethatinthecaseofthedatashowninFig.5,itwillbeneededatleastthree
principalcomponentstodescribethedataprecisely.InFig.5weshowthesamedatasetaftertransformingit.Note
thatthisdatadoesnolongerfollowacurvedastructuredandinthiscase,itfollowsalinearone.Therefore,the
datashownontherightofFig.5canbeconvenientlydescribedusingPCAapproachandusingonlyoneprincipal
component.
Before introducing principal curves, surfaces and manifolds in depth, let us review PCA from a different
perspective.Givenasetofobservationsoftheinputvectors x withzeroaverage(iftheoriginaldataisnotzero
average,we cansimplysubtracttheaveragefrom alldatapoints),we canlookfor thelinepassing throughthe
origin and with direction w1 (whose equation is f ( ) w1 ) that better fits this dataset, i.e., that minimizes
J line E inf x f ( )
.Theinfimuminthepreviousobjectivefunctionimpliesthatforeachobservation x
n
df
1 for all ), otherwise we could not uniquely determine this
d
function.Therearetwowarningsonthisapproach.Thefirstoneisthatitmightbelocallybiasedifthenoiseofthe
observationsislargerthanthelocalcurvatureofthefunction.Thesecondoneisthatifweonlyhaveafinitesetof
10
observations x , we will have to use some approximation of the expectation so that we make the curve
continuous.Thetwomostcommonchoicestomakethecurvecontinuousarekernelestimatesoftheexpectation
and the use of splines. In fact, the goal function of the classical smoothing spline is
df ( )
J spline E inf x f ( )
d ,whichregularizesthecurvefittingproblemwiththecurvatureof
d
thecurve.Anadvantageoftheuseofsplinesistheirefficiency(thealgorithmrunsas O( N ) ascomparedtothe
2
O( N 2 ) ofthekernelestimates).However,itisdifficulttochoosetheregularizationweight, .
Principal Curves can be combined with the idea of Localized PCA (constructing local approximations to data).
This has been done by several authors: Principal Curves of Oriented Points (PCOP) [Delicado2001], and Local
PrincipalCurves(LPC)[Einbeck2005].
ThePrincipalCurvesideacanbeextendedtomoredimensions(seeFig.6).Principalsurfacesarethefunctions
returnstheparametersofthesurfaceneededfortheprojectionof x .Intuitively,theprincipalsurfaceatthepoint
( 1 , 2 ) is the average of all observations whose orthogonal projection is at f ( 1 , 2 ) . The extension to
Pf
versionofthenonlinearregressionproblem. P isahomogenousinvariantscalaroperatorpenalizingunsmooth
functions. The fact that the regularization is homogeneous invariant implies that all surfaces which can be
transformedintoeachotherbyrotationsareequallypenalized.Afeasiblewayofdefiningthefunction f ( ) isby
choosinganumberoflocations i (normallydistributedonaregulargridalthoughthemethodisnotrestrictedto
this choice) and expanding this function as a weighted sum of a kernel function k ( i ) at those locations,
K
f ( ) i k ( i ) .Thenumberoflocations, K ,controlsthecomplexityofthemanifoldandthevectors i
i 1
(which are in the space of x and, thus, have dimension M ) control its shape. However, the dimensionality
reduction is still controlled by the dimension of the vector . Radial basis functions such as the Gaussian are
common kernels ( k ( , i ) k i
Pf
i , j 1
i , j k ( i , j ) .Aninterestingfeatureofthisapproachisthatbyusingperiodicalkernels,onecan
learncircularmanifolds.
Figure6.Exampleofdatadistributedalongtwoprincipalsurfaces.
11
vectorslieonadiscretegridwith K points,andthattheaprioriprobabilityofeachoneofthepointsofthegridis
thesame(uniformdistribution).IfthenoiseissupposedtobeGaussian(oranyothersphericaldistribution),the
maximum likelihood estimates of the vectors i boils down to the minimization of J GTM E inf x f ( )
K
2
i
(instead of Pf
i1
MaximumaPosterioriundertheassumptionthatthe i arenormallydistributedwith0mean.
Figure7.ExampleofGenerativeTopographicMapping.Theobserveddata(right)isassumedtobegeneratedby
mappingpointsinalowerdimensionalspace.
12
Figure8.Originaldataliesinaring,vectorrepresentativescalculatedbySOMarerepresentedasredpoints.The
outputmaptopologyhasbeenrepresentedbylinkingeachrepresentativevectortoitsneighborswithablueedge.
Notethatbecauseofthetopologicalconstrainttheremightberepresentativevectorsthatarenotactually
representinganyinputpoint(e.g.,thevectorinthecenterofthering).
Kohonens SOMs [Kohonen1990, Kohonen1993, Kohonen2001] are the most famous SOMs. They work pretty
wellinmostcontexts,theyareverysimpletounderstandandimplement,buttheylackfromasolidmathematical
framework(theyarenotoptimizinganyfunctionalandtheycannotbeputinastatisticalframework).Theystartby
creating a set of labels on a given manifold (usually a plane). Labels are distributed in a regular grid and the
topologicalneighbourhoodisdefinedastheneighboursintheplaneofeachpointofthegrid(notethatthisidea
can be easily extended to higher dimensions). For initialization we assign to each label a class representative at
random.Eachinputobservation xn iscomparedtoallclassrepresentativesanditisassignedtotheclosestclass
whoselabel wewillrefertoas n .Inits batchversion,onceall theobservationshavebeenassigned, theclass
N
representativesarethenupdatedaccordingto x
k ( , n ) xn
n1
N
k ( , n )
.Thefunction k ( , n ) isakernelthatgivesmore
n 1
weight to pairs of classes that are closer in the manifold. The effect of this is that when an input vector xn is
assignedtoagivenclass,theclassessurroundingthisclasswillalsobeupdatedwiththatinputvector(although
with less weight than the winning class). Classes far from the winning class receive updates with a weight very
closeto0.Thisprocessisiterateduntilconvergence.
GTMgeneralizesKohonensSOMbecausetheclassrepresentatives x inSOMscanbeassimilatedtothe f ( i )
elementsofGTM,andthefunction f ( ) ofGTMcanbedirectlybecomputedfromthekernel k ( , n ) inSOM
[Bishop1998].However,GTMhastheadvantageoverSOMsthattheyareclearlydefinedinastatisticalframework
and the function f ( ) is maximizing the likelihood of observing the given data. There is another difference of
practicalconsequences,whileSOMmakesthedimensionalityreductionbyassigningoneofthepointsofthegrid
inthemanifold(i.e.,itproducesadiscretedimensionalityreduction),GTMiscapableofproducingacontinuous
dimensionalityreductionbychoosingtheparameters n suchthat x n f ( n ) isminimized.
Other generalizations of SOMs in the same direction are the Fuzzy SOM [PascualMarqui2001] and the Kernel
Density SOM (KenDerSOM) [PascualMontano2001]. These generalizations rely on the regularization of the
objective functions of Vector Quantization and the Mixture Models, respectively, by the term
K
, '1
13
should have smaller differences. For a review on SOM and its relationships to NonLinear PCA and Manifold
learningsee[Yin2008].
NeuralGasnetworks[Martinetz1991,Martinetz1993,Fritzke1995]isanapproachsimilartothestandardSOM,
onlythattheneighborhoodtopologyisautomaticallylearntfromthedata.Edgesappearanddisappearfollowing
an aging strategy. This automatic topology learning allows adapting to complex manifolds with locally different
intrinsicdimensionality[Pettis1979,Kegl2002,Costa2004](seeFig.9).
Figure9.ExampleofNeuralGasnetwork.Notethatthenetworkhasbeenabletolearnthe2Dtopologypresent
attheleftpoints,andthe1Dtopologyoftherightpoints.
J EN x n x n
n 1
and
2
, '1
similarity
within
K
g ( , ')x '
x x ' g ( , ') x
2
'1
K
g ( , ')
the
net:
'1
14
this matrix has the inner product of xi with x j . The eigendecomposition of the Gram matrix yields
Gx WN NWNt (since Gx is a real, symmetric matrix). MDS is a classical statistical technique [Kruskal1964a,
Kruskal1964b, Schiffman1981, Kruskal1986, Cox2000, Borg2005] that builds with the m largest eigenvalues a
1
featurematrixgivenby U 2W t .Thisfeaturematrixistheonebestpreservingtheinnerproductsoftheinput
vectors,i.e.,itminimizestheFrobeniusnormofthedifference Gx G .Itcanbeproven[Jenssen2010]thatthe
eigenvaluesof Gx and Cx arethesame,that rank(Gx ) rank(Cx ) M ,andthatforany m ,thespace
spannedbyMDSandthespacespannedbyPCAareidentical,thatmeansthatonecouldfindarotationsuchthat
U MDS U PCA .Additionally,MDScanalsobecomputedevenifthedatamatrix X isunknown,allweneedisthe
Gram matrix, or alternatively a distance matrix. This is a situation rather common in some kind of data analysis
[Cox2000].
Kernel PCA [Scholkopf1997, Scholkopf1999] is another approach trying to capture nonlinear relationships. It
uses a function transforming the input vector x onto a new vector ( x) whose dimension is larger than
M (akindofdimensionalityexpansion).However,ifwechoose wellenough,thedatainthisnewspacemay
become more linear (e.g., following a straight line instead of a curve). In this new space we can perform the
standardPCAandobtainthefeaturevector .InFig.10,weshowanexampleoftheuseofthismultidimensional
reduction method. In Fig. 10(a) it is shown a dataset (red circles) following a curved structured that cannot be
described conveniently using linear PCA as the black dashed line does not describe conveniently the dataset.
Therefore, to correctly describing this dataset by the standard PCA method we will need at least two principal
components.InFig.10(b)weshowthedatasetaftertransformingitby .Ascanbeseeninthistransformedand
expandedspacewecandescribeaccuratelythedatasetusingonlyoneprincipalcomponentasthedatasetfollows
alinearrelationship.
Figure10.Inputdataset(redcircles)lyinginacurvedstructureanditscorrespondingfirstprincipalcomponent
usingstandardPCAmethod(blackdashedline)(a).Transformeddatasetbyfunction (bluepoints)andits
correspondingfirstprincipalcomponentoftheexpandeddataset(redline)(b)
Making use of the relationship between MDS and PCA, we do not need to compute the covariance of the
vectors,butwecancomputetheirinnerproductsinstead.WewilldosothroughaMercerkernelwhichdefinesa
validinnerproductinthespaceof makinguseoftheinputvectors, (x), (y ) k (x, y ) .Commonkernels
15
are k ( x, y ) x, y
, k (x, y ) exp 12
xy 2
definethekernelshape).PCAvectors w i aretheeigenvectorsofthecovariancematrixofthetransformedvectors
(x) ; but these vectors are never explicitly built neither their covariance matrix. Instead, the orthogonal
directions w i arecomputedasalinearcombinationoftheobserveddata, w i
n 1
in
x n X i .The i vectors
are computed as the eigenvectors of a matrix G whose ij th entry is the inner product ( xi ), ( x j ) . The
featurevectorscanfinallybecomputedas i
n 1
in
(x n ), ( x) .Obtainingtheapproximationoftheoriginal
vector x is not as straightforward. In practice it is done by looking for a vector x that minimizes
(x) ( x ) .Theminimizationisperformednumericallystartingfromasolutionspecificallyderivedforeach
2
kernel.Againthankstothekernelmagic,onlythedotproductofthetransformedvectorsareneededduringthe
minimization.
All these techniques together with Locally Linear Embedding and ISOMAP (see below) are called spectral
dimensionality reduction techniques because they are based on the eigenvalue decomposition of some matrix.
[Bengio2006]providesanexcellentreviewofthem.
1
N
k ( x, x
n 1
window (it is also required to be a Mercer kernel). If we now estimate Renyis quadratic entropy as the sample
1 N
1
averageoftheParzenestimator,weget E p(x) p (x n ) 2
N n 1
N
k (x
n1 1 n2 1
n1
, x n2 ) ,whichinthelightofour
previous discussion on Kernel PCA can be rewritten in terms of the Gram matrix of the vectors
1 t
1 G 1 (being 1 a vector of ones with dimension N ). The eigendecomposition of the Gram
N2
1
1 N
2
matrix yields E p ( x) 2 1t WN N WNt 1 2 n 1, w n , this means that for maximizing the information
N
N n 1
carried by the feature vectors is not enough choosing the eigenvectors with the m largest eigenvalues, but the
2
eigenvectors with the m largest contributions to the entropy, n 1, w n . This is the method called Kernel
E p(x)
EntropyComponentAnalysis[Jenssen2010]whichcanbeseentobeaninformationtheoreticgeneralizationofthe
PCA.
16
Figure11.ComparisonbetweenRobustPCA(a)andStandardPCA(b).
An obvious modification to deal with univariate outliers is to change l2norm of the PCA objective function,
J PCA E x W
RPCA
E x W
[Baccini1996, Ke2005]. However, these modifications are not invariant to rotations of the input features
[Ding2006]. [Ding2006] solved this problem by using the R1 norm of a matrix that is defined as
1
M
2
ein2 , and constructing the objective function J RPCA X WW t X
n 1 i 1
R1
R1
. Another possibility is to
substitutethenormbyakernelasisdoneinrobuststatistics.Thiscanbedoneatthelevelofindividualvariables
J RPCA E k xi W i (forinstance[He2011]usedakernelbasedonthecorrentropyfunction)oratthe
i 1
before J RPCA E x W
,hemaximizestheL normoftheprojections J
1
RPCA
W t X subjectto W tW I
1
tryingtomaximizethedispersionofthenewfeatures.
The approach of De la Torre [Delatorre2003] can deal with univariate outliers. It modifies the PCA objective
functiontoexplicitlyaccountforindividualcomponentsoftheobservationsthatcanberegardedasoutliersandto
account for the possible differences in the variance of each variable. The Robust PCA goal function is then
N
thcomponentofthe n thobservationisanoutlier(lowvalues)ornot(highvalues).Ifitisanoutlier,thenitserror
will not be counted, but the algorithm will be penalized by P (Oni )
17
considering all samples to be an outlier. This algorithm does not assume that the input data is zero valued and
estimates the mean, , from the nonoutlier samples and components. For each component, the function i
robustlymeasurestheerrorcommitted.Thisisdoneby i (e)
e2
,i.e.,thesquarederrorismodulatedby
e2 i2
thevarianceofthatvariable i2 .
TheapproachofHubert[Hubert2004]addressestheproblemofmultivariateoutliers.Itdistinguishesthecase
when we have more observations than variables ( N M ) and when we do not ( N M ). In the first case
( N M ), we have to specify the number h of outliers we want the algorithm to be resistant to (it has to be
h 12 ( N M 1) ). Then, we look for the subset of N h input vectors such that the determinant of its
covariancematrixisminimum(thisdeterminantisageneralizationofthevariancetomultivariatevariables:when
thedeterminantislargeitmeansthatthecorrespondingdatasethasalargevariance).Wecomputetheaverage
andcovarianceofthissubsetandmakesomeadjustmentstoaccountforthefinitesampleeffect.Theseestimates
are called the MCD (Minimum Covariance Determinant) estimates. Finally, PCA is performed as usual on this
covariance estimate. In the second case ( N M ), we cannot proceed as before since the determinant of the
covariancematrixofanysubsetwillbezero(remindourdiscussionwhentalkingabouttheSVDdecompositionof
thecovariancematrix).SowefirstperformadimensionalityreductionwithoutlossofinformationusingSVDand
keeping N 1 variables. Next, we identify outliers by choosing a large number of random directions, and
projecting the input data onto each direction. For each direction, we compute MCD estimates (robust to h
outliers)ofthemean( MCD ,w )andstandarddeviation( sMCD ,w )oftheprojections(notethattheseareunivariate
estimates).Theoutlyingnessofeachinputobservationiscomputedas outl (x n ) max
w
x n , w MCD ,w
sMCD ,w
.The h
pointswiththehighestoutlyingnessareremovedfromthedatasetand,finally,PCAisperformednormallyonthe
remainingpoints.
[Pinto2011] proposes a totally different approach. It is well known that rankstatistics is more robust to noise
and outliers than the standard statistical analysis. For this reason, they propose to substitute the original
observationsbytheirranks(theithcomponentofthenthindividual, x ni ,isrankedamongtheithcomponent
of all individuals, then the observation x ni is substituted by its rank that we will refer to as rni , and the
J PCA Tr W t RW subjectto W tW I .
18
is solved by factorization of the matrix WW t Cx Cn . This factorization is not unique since any orthogonal
rotation of W results in the same decomposition of Cx Cn [Kaiser1958]. This fact, rather than a drawback, is
exploitedtoproducesimplerfactorsinthesamewayasthePCAwasrotated(actuallytherotationmethodsforFA
arethesameastheonesforPCA).
Figure12.a)ExampleofPCAresultsforagiveninputdistribution.b)ICAresultsforthesamedistribution.
DifferentICAmethodsdifferinthewaytheymeasuretheindependenceoftheestimatesofthesourcevariables,
resultingindifferentestimatesofthemixingmatrix W andsourcevariables .Thefollowingaredifferentoptions
commonlyadopted:
NonGaussianity:Thecentrallimittheoremstatesthatthedistributionoftheweightedsumofthesources
tendstobenormallydistributedwhicheverthedistributionsoftheoriginalsources.Thus,apossiblewayto
achieve the separation of the sources is by looking for transformations W that maximize the kurtosis of
eachofthecomponentsofthevector [Hyvarinen2001](seeFig.13).Thekurtosisisrelatedtothethird
order moment of the distribution. The kurtosis of the Gaussian is zero and, thus, maximizing the kurtosis,
minimizes the Gaussianity of the output variables. In fact, maximizing the kurtosis can be seen as a Non
linear PCA problem (see above) with the nonlinear function for each feature vector
f i ( i ) i sgn( i ) i2 .FastICAisanalgorithmbasedonthisgoalfunction.Theproblemofkurtosisisthat
itcanbeverysensitivetooutliers.Alternatively,wecanmeasurethenonGaussianitybynegentropywhich
istheKullackLeiblerdivergencebetweenthemultivariatedistributionoftheestimatedsources ,andthe
distributionofamultivariatevariable G ofthesamemeanandcovariancematrixas .Projectionpursuit
19
[Friedman1974,Friedman1987]isanexploratorydataanalysisalgorithmlookingforprojectiondirectionsof
maximumkurtosis(asinourfirstICAalgorithm)whileExploratoryProjectionPursuit[Girolami1997]usesthe
maximumnegentropytolookfortheprojectiondirections.NonGaussianComponentAnalysis[Theis2011],
insteadoflookingforasingledirectionlikeinprojectionpursuit,looksforanentirelinearsubspacewhere
theprojecteddataisasnonGaussianaspossible.
Figure13.a)SampledistributioninwhichthefirstPCAcomponentmaximizestheexplainedvariancebut
projectionsontothisdirectionarenormallydistributed(b).ICAfirstcomponentisalsoshownonthesample
data,dataprojectionontothisdirectionclearlyshowsanonGaussiandistribution.
Maximumlikelihood(ML):Letusassumethatweknowtheaprioridistributionofeachofthecomponents
. Then, we could find the matrix W simply by maximizing the likelihood of all the observations
vectors we obtain L X log det W E pi ( i )) which is the objective function to maximize with
i 1
respectto W (theBellSejnowskiandthenaturalgradientalgorithms[Hyvarinen2001]aretypicalalgorithms
forperformingthisoptimization).Ifthedistributionofthefeaturesisnotknownapriori,itcanbeshown
[Hyvarinen2001] that reasonable errors in the estimation of the pi distributions result into locally
p ' i ( i )
p i ( i )
and p i is
Infomax criterion is equivalent to the maximum likelihood one when gi ( i ) is the nonlinear function of
eachoutputneuronoftheneuralnetwork(usuallyasigmoid).Equivalently,insteadofmaximizingthejoint
entropyofthefeaturevariables,wecouldhaveminimizedtheirmutualinformation.Mutualinformationisa
measure of the dependency among a set of variables. So, it can be easily seen how all these criteria is
maximizingtheindependenceofthefeaturevectors.
Nonlinear decorrelation: two variables 1 and 2 are independent if for all continuous functions f1 and
linearlydecorrelatedif E f1 ( 1 ) f 2 ( 2 ) 0 .Infact,PCAlooksforoutputvariablesthataresecondorder
decorrelated, E 1 2 0 , although they may not be independent because of their higherorder
20
moments.
Making
E f1 ( 1 ) f 2 ( 2 )
E
i
1
j
2
the
Taylor
expansion
of
f1 ( 1 ) f 2 ( 2 )
we
arrive
to
i , j 0
ij
i
1
j
2
Figure14.Dictionarydecompositionofasetofdocuments(seeFig.3).Eachdocumentisdecomposedasthelinear
combinationgivenbytheweightsinUofthetopics(atoms)containedinW.
If X ismadeonlyofpositivevalues,itmightbeinterestingtoconstraintheatomstobepositiveaswell( W 0 ).
This is the problem solved by Nonnegative Matrix Factorization [Lee1999, Lee2001]. The goal is to minimize
21
X WU
2
F
or D( X WU ) (defined as D( A B)
ij
i, j
divergence if A and B are normalized so that they can be regarded as probability distributions) subject to
W ,U 0 .Theadvantageofthisdecompositionisthat,iftheapplicationisnaturallydefinedwithpositivevalues,
thedictionaryatomsaremuchmoreunderstandableandrelatedtotheproblemthanthestandarddimensionality
reductionmethods.[Sandler2011]proposedtominimizetheEarthMoversDistancebetweenthematrices X and
WU withtheaimofmakingthemethodmorerobust,especiallytosmallsamples.TheEarthsMoverDistance,
alsocalledWassersteinmetric,isawayofmeasuringdistancesbetweentwoprobabilitydistributions(forareview
onhowtomeasuredistancesbetweenprobabilitydistributionssee[Rubner2000]).Itisdefinedastheminimum
costofturningoneprobabilitydistributionintotheotherand itiscomputedthroughatransportationproblem.
Thisdistancewasextendedby[Sandler2011]tomeasurethedistancebetweenmatricesbyapplyingthedistance
toeachcolumn(feature)ofthematricesandthensummingalldistances.
Intherecentyearsthereismuchinterestintheconstructionofsparsefeaturevectors,sparsedictionariesor
both. The underlying idea is to produce feature vectors with as many zeroes as possible, or what is the same,
approximatingtheobservationswithasfewdictionaryatomsaspossible.Thishasobviousadvantageswhentrying
toexplaintheatomiccompositionofagivenobservation.Inthefollowingparagraphswewillreviewsomeofthe
approachesalreadyproposedforNMFinthisdirection.
LocalNMF[Feng2002]enforcessparsitybyaddingtotheNMFgoalfunctiontheterm W tW (whichpromotes
1
theorthogonalityoftheatoms,i.e.,minimizestheoverlappingbetweenatoms)and U 2 (whichmaximizesthe
2
variance of the feature vectors, i.e., it favours the existence of large components). Nonnegative sparse coding
[Hoyer2002] and Sparse NMF [Liu2003] add the term U 1 in order to minimize the number of features with
2
significantvalues.Paucaetal.[Pauca2006]regularizebyadding U 2 and W
2
2
(thisisespeciallysuitedfornoisy
data).NMFwithSparsenessConstraints[Hoyer2004]performstheNMFwiththeconstraintthatthesparsenessof
each column of W is a given constant S w (i.e., it promotes the sparseness of the dictionary atoms) and the
sparsenessofeachrowof U isanotherconstant SU (i.e.,itpromotesthateachatomisusedinasfewfeature
vectors
as
possible).
Sparseness (x)
In
[Hoyer2004],
the
sparseness
of
vector
is
defined
as
x
1
n 1 that measures how much energy of the vector is packed in as few
x2
n 1
componentsaspossible.Thisfunctionevaluatesto1ifthereisasinglenonzerocomponent,andevaluatesto0if
all the elements are identical. Nonsmooth NMF [PascualMontano2006] modifies the NMF factorization to
X WSU . The matrix S controls the sparseness of the solution through the parameter . It is defined as
S (1 ) I m1 11t . For 0 it is the standard NMF. For 1 , we can think of the algorithm as using
effective feature vectors defined by U SU that substitute each feature vector by a vector of the same
dimensionality whose all components are equal to the mean of . This is just imposing nonsparseness on the
featurevectors,andthiswillpromotesparsenessonthedictionaryatoms.Ontheotherhand,wecouldhavealso
thought of the algorithm as using the effective atoms given by W WS that substitute each atom by the
average of all atoms. In this case, the nonsparseness of the dictionary atoms will promote sparseness of the
featurevectors.NonsmoothNMFisusedwithtypical valuesabout0.5.
Another flavor of NMF enforces learning the local manifold structure of the input data [Cai2011b], Graph
regularizedNMF(GNMF).Assumingthattheinputvectors xi and x j arecloseintheoriginalspace,onemightlike
that the reduced representations, i and j , are also close. For doing so, the algorithm constructs a graph G
encodingtheneighborsoftheinputobservations.Observationsarerepresentedbynodesinthegraph,andtwo
nodesareconnectedbyanedgeiftheirdistanceissmallerthanagiventhresholdandtheyareintheKneighbors
22
list of each other. The weight of each edge is 1, or if we prefer we can assign a different weight to each edge
2
1 x x
depending on the distance between the two points (for instance, e 2 i j ). We build the diagonal matrix D
whose elements are the row sums of G . The Laplacian of this graph is defined as L D G . The sum of the
Euclidean distances of the reduced representations corresponding to all neighbor pairs can be computed as
2
F
Tr ULU t . This
algorithmhasthecorrespondingversionincasethattheKullbackLeiblerdivergenceispreferredoverEuclidean
distances[Cai2011b].
w
i 1
1
i
w ... w
2
i
d
i
X w
i 1 j 1
j
i
F
expression X is a tensor of dimension d (in our threeway table example, d 3 ), represents the outer
product, m isaparametercontrollingthedimensionalityreduction.Foreachdimension j (drug,geneortime,in
ourexample),therewillbe m associatedvectors w ij .Thelengthofeachvectordependsonthedimensionitis
associated to (see Fig. 15). If there are N j elements in the j th dimension ( N drugs , N genes and N time entries in
our example), the length of the vectors w ij is N j . The approximation after dimensionality reduction is
m
X w ij . For a threeway table, the pqr element of the tensor is given by X pqr w1ip w iq2 w 3ir , where
i 1 j 1
i 1
2
iq
1
i
3
ir
NTFisaparticularcaseofafamilyofalgorithmsdecomposingtensors.PARAFAC[Harshman1970,Bro1997]may
be one of the first algorithms doing so and can be regarded as a generalization of SVD to tensors. The SVD
approximation X Wm DmU m canberewrittenas X
theapproximationerror J PARAFAC X
outer products as in J Tuc ker X
m1
d w
i 1
m2
1
i
i 1
t
i
i 1
ui .PARAFACmodelisminimizing
w ... w
2
i
d
i
.Wecanenrichthemodeltoconsidermore
F
2
md
... d
i1 1 i2 1
d w u d w
id 1
i1i2 ...id
w w ... w
1
i1
2
i2
d
id
[Tucker1966].
23
Figure15.Tensordecompositionasthesumofaseriesofoutervectorproducts.Thefigurecorrespondstothe
moregeneralTuckermodel,whichissimplifiedtoPARAFACorNTF.
trace ( X X )( X X )t .
Generalized SVD [Paige1981] performs a similar decomposition but relaxes, or modifies, the orthogonality
conditionsaccordingtotwoconstrainmatrices, CW and CU ,suchthat W t CW W I and UCU U t I [Abdi2007].
After
the
dimensionality
reduction,
the
approximation
matrix
is
the
one
minimizing
trace CW ( X X )CU ( X X )t .
GeneralizedSVDisaveryversatiletoolsinceundertheappropriatechoicesoftheconstrainmatricesitcanbe
particularized to correspondence analysis (a generalization of factor analysis for categorical variables, CW is the
relativefrequencyoftherowsofthedatamatrix, X ,and CU istherelativefrequenciesofitsrows),discriminant
analysis (a technique relating a set of continuous variables to a categorical variable), and canonical correlation
analysis (a technique analyzing two groups of continuous variables and performing simultaneously two
dimensionalityreductionssothatthetwonewsetsoffeatureshavemaximumcrosscorrelation)[Abdi2007].
columns of W are the atoms, and the feature vector define the specific linear combination of atoms used to
represent the corresponding piece. The sparseness of the feature vector is measured simply by counting the
number of elements different from zero (this is usually referred to as the l0 norm, although actually it is not a
24
normsinceitisnotpositivehomogeneous).Theproblemofthe l0 normisthatityieldsnonconvexoptimization
problemswhosesolutionisNPcomplete.Fortacklingthisproblemsomeauthorshavereplacedthe l0 bythe l p
norm (for 0 p 1 the problem is still nonconvex, although there are efficient algorithms; p 1 is a very
popularnormtopromotesparseness).Relatedproblemsare min x W
p (theLeastAbsoluteShrinkage
andSelectorOperator(LASSO)issuchaproblemwith p 1 ,andridgeregressionisalsothisproblemwith p 2 )
* *
2
F
andithasproventobeveryefficient.Another
possibilitytolearnthedictionaryisbyMaximumLikelihood[Lewicki2000].Underagenerativemodelitisassumed
that the observations have been generated as noisy versions of a linear combination of atoms x W . The
observations are assumed to be produced independently, the noise to be white and Gaussian, and the a priori
distributionofthefeaturevectorstobeCauchyorLaplacian.Undertheseassumptionstheproblemofmaximizing
xn ,
given W
is
maximized
by
the
likelihood
of
observing
all
pieces,
W * arg min
n
min x n W n
W
Bayesian approach (Maximum a posteriori) [Kreutz2003]. KSVD [Aharon2006] solves the problem
W * ,U * arg min
X WU
W ,U
2
F
subjectto n
t forallthepieces, n .ItisconceivedasgeneralizationoftheK
means algorithm, and in the update of the dictionary there is a Singular Value Decomposition (therefore, its
name).
KSVDcanbeintegratedintoalargerframeworkcapableofoptimallyreconstructingtheoriginalvectorfromits
pieces. The goal function is x x 2
2
n 0 xn W n
2
2
multipliers.Onceallthepatcheshavebeenapproximatedbytheircorrespondingfeaturevectors,wecanrecover
1
t
t
theoriginalinputvectorby x I Pn Pn x Pn W n ,where Pn isanoperatorextractingthe n th
n
n
t
pieceasavector,and Pn istheoperatorputtingitbackinitsoriginalposition.
25
longer have input vectors, xn , but input matrices (we will generalize later to tensors). This method learns a
dictionaryof K SVDlikebasis (Wi , U i ) .Eachinputmatrix, X n ,issparselyrepresentedinthe i thSVDlikebasis,
u
n 1 i 1
in
X n Wi DnU i
. This functional is
minimized with respect to the SVDlike basis, the membership function and the sparse representations. The
optimization is constrained by the orthogonality of the basis ( Wi tWi I , U iU it I , for all i ), the sparsity
constraints ( Dn
t , t is a userdefined threshold), and that the columns of the membership matrix define a
K
probability distribution (
u
i 1
in
uin X n U n w ij
n 1 i 1
j 1
N
orthogonality,sparsityandmembershipconstraints.
A different family of algorithms poses the dimensionality reduction problem as one of projecting the original
dataontoasubspacewithsomeinterestingproperties.
w , w 0 ),and
i
nearlyorthogonalinpractice.Therefore,thedotproductbetweenanypairofobservationsisnearlyconservedin
the feature space. This is a rather interesting property since in many applications the similarity between two
observationsiscomputedthroughthedotproductofthecorrespondingvectors.Inthisway,thesesimilaritiesare
26
nearlyconservedinthefeaturespacebutatacomputationalcostthatisonlyasmallfractionofthecostofmost
dimensionalityreductiontechniques.
d (x , x ) the distance
Let d n1n2 d ( n1 , n2 ) be the distance between their feature vectors, and D
n1n2
n1
n2
between their approximations after dimensionality reduction. Classical MDS look for the feature vectors
minimizing
n1 , n2
n1 , n2
D n1 ,n2
D
n1 , n2
n1 , n2
n1 , n2
d n1 , n2
with
2
n1 , n2
n1 , n2
n ,n
1
1
Dn1 ,n2
(this goal function is called the stress function and it is defined between 0 and 1) (see Fig. 16).
ClassicalMDSsolvestheprobleminasinglestepinvolvingthecomputationoftheeigenvaluesofaGrammatrix
involving the distances in the original space. In fact, it can be proved that classical MDS is equivalent to PCA
[Williams2002]. Sammon projection uses a gradient descent algorithm and can be shown [Williams2002] to be
equivalenttoKernelPCA.MetricMDSmodifiesthe Dn1n2 distancesbyanincreasing,monotonicnonlinearfunction
1
2
n1n2
Dn1 Dn2 D where Dn1 and D n2 denote the average distance of observations n1
27
Fig.16.Distancesintheoriginalspacearemappedontothelowestdimensionalspacetryingtofindprojection
pointsthatkeepthesetofdistancesasfaithfulaspossible.
Figure17.GeodesicversusEuclideandistance.Thegeodesicdistancebetweentwopointsisthelengthofthe
pathbelongingtoagivenmanifoldthatjoinsthetwopoints,whiletheEuclideandistanceisthelengthofthelinear
pathjoiningthetwopoints.
Laplacianeigenmaps[Belkin2001,Belkin2002]start fromanadjacencygraph similartothatofIsomapforthe
computationofthegeodesicdistances.Theneighborsimilaritygraph G iscalculatedaswasdoneinGNMF.Then,
the generalized eigenvalues and eigenvectors of the Laplacian of the graph G are computed, i.e., we solve the
problem ( D G )w Dw . Finally, we keep the eigenvectors of the m smallest eigenvalues discarding the
smallest one which is always 0. The dimensionality reduction is performed by n (w1n , w 2 n ,..., w mn ) , i.e., by
keepingthe n thcomponentofthe m eigenvectors.TheinterestingpropertyoftheLaplacianeigenmapisthat
thecostfunction,whichmeasuresthedistanceamongtheprojectedfeatures,canbeexpressedintermsofthe
Graph Laplacian: J LE
1
2
ij
i j
i, j
[Zhang2009].Finally,itisworthmentioningthatLaplacianEigenmapsandPrincipalComponentAnalysis(andtheir
kernelversions)havebeenfoundtobeparticularcasesofamoregeneralproblemcalledLeastSquaresWeighted
KernelReducedRankedRegression(LSWKRRR)[Delatorre2012](infact,thisframeworkalsogeneralizesCanonical
CorrelationAnalysis,LinearDiscriminantAnalysisandSpectralClustering,techniquesthatareoutofthescopeof
this review). The objective function is to minimize J LS WKRRR W ( BAt x )Wx
2
F
subject to
rank( BAt ) m . W is a diagonal weight matrix for the feature points, Wx is a diagonal weight matrix for the
input data points, x is a matrix of the expanded dimensionality (kernel algorithms) of the input data. The
objective function is minimized with respect to the A and B matrices (they are considered to be regression
matrices and decoupling the transformation BAt in two matrices allows the generalization of techniques like
CanonicalCorrelationAnalysis).Therankconstraintissettopromoterobustnessofthesolutiontowardsarank
deficient x matrix.
Hessianeigenmaps[Donoho2003]workwiththeHessianofthegraphinsteadofitsLaplacian.Bydoingso,they
extendISOMAPandLaplacianeigenmaps,andtheyremovetheneedtomaptheinputdataontoaconvexsubset
of m .
Locallylinearembedding(LLE)[Roweis2000,Saul2003]isanothertechniqueusedtolearnmanifoldsclosetothe
data and project them onto it. For each observation xn we look for the Knearest neighbors (noted as x n ' ) and
28
nn ' n '
n'
x
n
n'
nn ' n '
x n X n
, where n is a
weightvectortobedeterminedandwhosevalueisnonzeroonlyfortheneighborsof xn .Wecanwritethiseven
more compactly by stacking all weight vectors as columns of a matrix , then J LLE X X
2
F
. The
nn ' n '
,thatisthenewpointshavetobereconstructed
n'
from its neighbors in the same way (with the same weights) as the observations they represent. This latest
problemissolvedbysolvinganeigenvalueproblemandalsokeepingthesmallesteigenvalues.SeeFig.19fora
comparisonoftheresultsofLLE,HessianLLEandISOMAPinaparticularcase.
Figure18.SchematicrepresentationofthetransformationsinvolvedinLTSA.
29
Figure19.ComparisonoftheresultsofLLE,HessianLLEandISOMAPfortheSwissrollinputdata.
Latent Tangent Space Alignment (LTSA) [Zhang2004, Zhang2012] is another technique locally learning the
manifold structure. As in LLE, we look for the local neighbors of a point. However, we now compute the local
coordinates, n , of all the input points in the PCA subspace associated to this neighborhood. Next, we need to
alignalllocalcoordinates.Foreachinputpointwecomputethereconstructionerrorfromitscoordinatesinthe
differentneighborhoodswhereitparticipated J LTSAn
n'
both to be determined for all neighborhoods; we may think of them as translation and shape parameters that
properlylocatethedifferentneighborhoodsinacommongeometricalframework).TheobjectivefunctionofLTSA
is J LTSA
J
n 1
LTSAn
Oneoftheproblemsoftheprevioustechniques(ISOMAP,Laplacianeigenmaps,LocallyLinearEmbedding,and
Latent Tangent Space Alignment) is that they are only defined in a neighborhood of the training data, and they
normallyextrapolateverypoorly.Oneofthereasonsisbecausethemappingisnotexplicit,butimplicit.Locality
Preserving Projections (LPP) [He2004] tries to tackle this issue by constraining the projections to be a linear
projectionoftheinputvectors, n At x n .ThegoalfunctionisthesameasinLaplacianeigenmaps.The A matrix
is of size m M and it is the parameter with respect to which the J LE objective function is minimized. An
orthogonal version has been proposed [Kokiopoulo2007], with the constraint At A I . A Kernel version of LPP
alsoexists[He2004].Asinallkernelmethods,theideaistomaptheinputvectors xn ontoahigherdimensional
space with a nonlinear function, so that the linear constraint imposed by n At x n becomes a nonlinear
projection. A variation of this technique is called Neighborhood Preserving Embedding (NPE) [He2005] where
30
Wecanfurthersubdividetheselargeareasintosmallersubareas.TableIIshowsthesubareassortedbytotal
numberofcitations.Afteranalyzingthistablewedrawthefollowingconclusions:
The analysis on manifolds is the clear winner of the decade. The reason is its ability to analyze non
linearitiesanditscapabilityofadaptingtothelocalstructureofthedata.Amongthedifferenttechniques,
ISOMAP, Locally Linear Embedding, and Laplacian Eigenmaps are the most successful. This increase has
been at the cost of the nonlinear PCA versions (principal curves, principal surfaces and principal
manifolds)andtheSelfOrganizingMapssincethenewtechniquescanexplorenonlinearrelationshipsin
amuchricherway.
PCA in its different versions (standard PCA, robust PCA, sparse PCA, kernel PCA, ) is still one of the
preferredtechniquesduetoitssimplicityandintuitiveness.TheincreaseintheuseofPCAcontrastswith
thedecreaseintheuseofFactorAnalysis,whichismoreconstrainedinitsmodelingcapabilities.
IndependentComponentAnalysisreacheditsboominthemiddle2000s,butnowitisdeclining.Probably,
itwillremainatanicheofapplicationsrelatedtosignalprocessingforwhichitisparticularlywellsuited.
Butitmightnotstandasageneralpurposetechnique.Itispossiblethatthisdecreasealsorespondstoa
diversificationofthetechniquesfallingundertheumbrellaofICA.
Nonnegative Matrix Factorization has experienced an important raise, probably because of its ability of
producingmoreinterpretablebasesandbecausetheyarewellsuitedtomanysituationsinwhichthesum
ofpositivefactorsisthenaturalwayofmodelingtheproblem.
31
Therestofthetechniqueshavekepttheirmarketshare.Thisismostlikelyexplainedbythefactthatthey
havetheirownnicheofapplications,whichtheyareverywellsuitedto.
Overall,wecansaythatdimensionalityreductiontechniquesarebeingappliedinmanyscientificareasranging
frombiomedicalresearchtotextminingandcomputerscience.Inthisreviewwehavecovereddifferentfamilies
ofmethodologies;eachofthembasedondifferentcriteriabutallchasingthesamegoal:reducethecomplexityof
the data structure while at the same time delivering a more understandable representation of the same
information.Thefieldisstillveryactiveandevermorepowerfulmethodsarecontinuouslyappearingprovidingan
excellentapplicationtestbedforappliedmathematicians.
Acknowledgements
ThisworkwassupportedbytheSpanishMinisterofScienceandInnovation(BIO201017527)andMadridgovernmentgrant
(P2010/BMD2305).C.O.S.SorzanowasalsosupportedbytheRamnyCajalprogram.
Bibliography
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
[Abdi2003] Abdi, H. LewisBeck, M.; Bryman, A. & Futing, T. (ed.) Encyclopedia for research methods for the social sciences Factor rotations in
factoranalysesSage,2003,792795
[Abdi2007] Abdi, H. Salkind, N. (ed.) Encyclopedia of measurements and statistics Singular value decomposition (SVD) and Generalized Singular
ValueDecomposition(GSVD)SagePublications,2007,907912
[Aharon2006]Aharon,M.;Elad,M.&Bruckstein,A.KSVD:AnAlgorithmforDesigningOvercompleteDictionariesforSparseRepresentationIEEE
Trans.SignalProcessing,2006,54,43114322
[Arora1998] Arora, S.; Raghavan, P. & Rao, S. Approximation schemes for Euclidean kmedians and related problems Proc. 30th Annual ACM
symposiumonTheoryofcomputing(STOC),1998
[Artac2002]Artac,M.;Jogan,M.&Leonardis,A.IncrementalPCAorOnLineVisualLearningandRecognitionProc.16thInternationalConference
onPatternRecognition(ICPR),2002,3,30781
[Baccini1996] Baccini, A.; Besse, P. & Falguerolles, A L1norm PCA and a heuristic approach. A. Ordinal and Symbolic Data Analysis. Diday, E.;
Lechevalier,Y.&Opitz,O.(Eds.)Springer,1996,359368
[Bailey1994]Bailey,T.L.&Elkan,C.FittingamixturemodelbyexpectationmaximizationtodiscovermotifsinbiopolymersUniv.CaliforniaSan
Diego,1994.Tech.ReportCS94351.
[Baldi1989] Baldi, P. & Hornik, K. Neural networks and principal component analysis: Learning from examples without local minima Neural
Networks,1989,2,5358
[Baraniuk2010]Baraniuk,R.G.;Cevher,V.&Wakin,M.B.Lowdimensionalmodelsfordimensionalityreductionandsignalrecovery:ageometric
perspectiveProc.IEEE,2012,98,959971
[Batmanghelich2012]Batmanghelich,N.K.;Taskar, B.&Davatzikos,C. Generativediscriminativebasislearningformedicalimaging.IEEETrans.
MedicalImaging,2012,31,5169
[Belkin2001]Belkin,M.&Niyogi,P.LaplacianEigenmapsandSpectralTechniquesforEmbeddingandClusteringAdvancesinNeuralInformation
ProcessingSystems,2001,14,585591
[Belkin2002]Belkin,M.&Niyogi,P.LaplacianEigenmapsforDimensionalityReductionandDataRepresentationNeuralComputation,2002,15,
13731396
[Bengio2004]Bengio,Y.;Delalleau,O.;LeRoux,N.;Paiement,J.F.;Vincent,P.&Ouimet,M.Learningeigenfunctionslinksspectralembeddingand
KernelPCANeuralComputation,2004,16,21972219
[Bengio2006] Bengio, Y.; Delalleau, O.; Le Roux, N.; Paiement, J. F.; Vincent, P. & Ouimet, M. Spectral Dimensionality Reduction. Studies in
FuzzinessandSoftComputing:FeatureExtractionSpringer,2006,519550
[Bezdek1981]Bezdek,J.C.PatternRecognitionwithFuzzyObjectiveFunctionAlgorithmsPlenum,1981
[Bezdek1984]Bezdek,J.C.;Ehrlich,R.&Full,W.FCM:ThefuzzycmeansclusteringalgorithmComputers&Geosciences,1984,10,191203
[Bian2011]Bian,W.&Tao,D.MaxmindistanceanalysisbyusingsequentialSDPrelaxationfordimensionreduction.IEEETrans.PatternAnalysis&
MachineIntelligence,2011,33,10371050
[Bingham2001] Bingham, E. & Mannila, H. Random projection in dimensionality reduction: applications to image and text data Proc. ACM Intl.
Conf.Knowledgediscoveryanddatamining,2001
[Bishop1998]Bishop,C.M.;Svensn,M.&Williams,C.K.I.GTM:TheGenerativeTopographicMappingNeuralComputation,1998,10,215234
[Blumensath2008]Blumensath,T.&Davies,M.E.GradientpursuitsIEEETrans.SignalProcessing,2008,56,23702382
[Borg2005]Borg,I.&Groenen,P.F.ModernMultidimensionalScalingSpringer,2005
[Bro1997]Bro,R.PARAFAC.TutorialandapplicationsChemometricsandintelligentlaboratorysystems,1997,38,149171
[Bruckstein2009] Bruckstein, A.M.;Donoho,D.L.& Elad,M.FromSparse SolutionsofSystemsofEquationstoSparseModelingofSignalsand
ImagesSIAMReview,2009,51,3481
[Cai2011]Cai,H.;Mikolajczyk,K.&Matas,J.Learninglineardiscriminantprojectionsfordimensionalityreductionofimagedescriptors.IEEETrans.
PatternAnalysis&MachineIntelligence,2011,33,338352
[Cai2011b]Cai,D.;He,X.;Han,J.&Huang,T.S.GraphRegularizedNonNegativeMatrixFactorizationforDataRepresentation.IEEETrans.Pattern
Analysis&MachineIntelligence,2010,33,15481560
[Carreira1997]CarreiraPerpin,M.A.AreviewofdimensionreductiontechniquesDept.ComputerScience,Univ.Sheffield,1997
[Cayton2005]Cayton,L.AlgorithmsformanifoldlearningUniversityofCalifornia,SanDiego,Tech.Rep.CS20080923,2005
32
28. [Chen1991] Chen, S.; Cowan, C. F. N. & Grant, P. M. Orthogonal least squares learning algorithm for radial basis function networks IEEE Trans.
NeuralNetworks,1991,2,302309
29. [Chen1994]Chen,S.S.&Donoho,D.L.BasispursuitProc.IEEEConf.Signals,SystemsandComputers,1994
30. [Chen2001]Chen,S.S.;Donoho,D.L.&Saunders,M.A.AtomicdecompositionbybasispursuitSIAMReview,2001,43,129159
31. [Cichocki2009]Cichocki,A.;Zdunek,R.;Phan,A.H.&Amari,S.NonnegativematrixandtensorfactorizationsWiley,2009
32. [Common1994]Common,P.IndependentComponentAnalysis,anewconcept?SignalProcessing,36,287314(1994)
33. [Costa2004]Costa,J.A.&Hero,A.O.I.GeodesicEntropicGraphsforDimensionandEntropyEstimationinManifoldLearningIEEETrans.Signal
Processing,2004,52,22102221
34. [Cox2000]Cox,T.F.&Cox,M.A.A.MultidimensionalScalingChapman&Hall,2000
35. [Crawford1970]Crawford,C.B.&Ferguson,G.A.AgeneralrotationcriterionanditsuseinorthogonalrotationPsychometrika,1970,35,321332
36. [Dasgupta2000]Dasgupta,S.ExperimentswithrandomprojectionProc.Conf.Uncertaintyinartificialintelligence,2000
37. [Dash1997]Dash,M.&Liu,H.FeatureselectionforclassificationIntelligentDataAnalysis,1997,1,131156
38. [Delatorre2003]DelaTorre,F.&Black,M.J.AframeworkforrobustsubspacelearningIntl.J.ComputerVision,2003,54,117142
39. [Delatorre2012]DelaTorre,F.AleastsquaresframeworkforComponentAnalysis.IEEETrans.PatternAnalysis&MachineIntelligence,2012,34,
10411055
40. [Delicado2001]Delicado,P.AnotherlookatprincipalcurvesandsurfacesJ.MultivariateAnalysis,2001,77,84116
41. [DeMers1993]DeMers,D.&Cottrell,G.NonlineardimensionalityreductionAdvancesinNeuralInformationProcessingSystems,1993,5,580587
42. [Dhillon2004]Dhillon,I.S.;Guan,Y.&Kulis,B.Kernelkmeans:spectralclusteringandnormalizedcutsProc.ACMSIGKDDIntl.Conf.onKnowledge
discoveryanddatamining,2004,551554
43. [Ding2006] Ding, C.; Zhou, D.; He, X. & Zha, H. R1PCA: Rotational Invariant L1norm Principal Component Analysis for Robust Subspace
FactorizationProc.Intl.WorkshopMachineLearning,2006
44. [Donoho2003]Donoho,D.L.&Grimes,C.Hessianeigenmaps:locallylinearembeddingtechniquesforhighdimensionaldata.Proc.Natl.Acad.Sci.
USA,2003,100,55915596
45. [Dunteman1989]Dunteman,G.H.PrincipalComponentAnalysisSagePublications,1989
46. [Einbeck2005]Einbeck,J.;Tutz,G.&Evers,L.LocalprincipalcurvesStatisticsandComputing,2005,15,301313
47. [Einbeck2008] Einbeck, J.; Evers, L. & BailerJones, C. Representing Complex Data Using Localized Principal Components with Application to
AstronomicalDataLectureNotesinComputationalScienceandEngineering,2008,58,178201
48. [Engan2000]Engan,K.;Aase,S.O.&Husoy,J.H.Multiframecompression:theoryanddesignEURASIPSignalProcessing,2000,80,21212140
49. [Feng2002] Feng, T.; Li, S. Z.; Shum, H. Y. & Zhang, H. Local nonnegative matrix factorization as a visual representation Proc. 2nd Intl. Conf.
DevelopmentandLearning(ICDL),2002
50. [Fisher1936]Fisher,R.A.TheUseofMultipleMeasurementsinTaxonomicProblemsAnnalsofEugenics,1936,7,179188
51. [Fodor2002]Fodor,I.K.AsurveyofdimensionreductiontechniquesLawrenceLivermoreNatl.Laboratory,2002
52. [Friedman1974]Friedman,J.H.&Tukey,J.W.AProjectionPursuitAlgorithmforExploratoryDataAnalysisIEEETrans.Computers,1974,C23,881
890
53. [Friedman1987]Friedman,J.H.ExploratoryprojectionpursuitJ.AmericanStatisticalAssociation,1987,82,249266
54. [Fritzke1995]Fritzke,AgrowingneuralgasnetworklearnstopologiesAdvancesinNeuralInformationProcessing,B.Tesauro,G.;Touretzky,D.&
Lean,T.K.(Eds.)MITPress,1995,625632
55. [Fukunaga1971]Fukunaga,K.&Olsen,D.R.AnalgorithmforfindingintrinsicdimensionalityofdataIEEETrans.Computers,1971,C20,176183
56. [Gersho1992]Gersho,A.,Gray,R.M.Vectorquantizationandsignalcompression.KluwerAcademicPublishers,1992.
57. [Girolami1997]Girolami,M.&Fyfe,C.Extractionofindependentsignalsourcesusingadeflationaryexploratoryprojectionpursuitnetworkwith
lateralinhibitionIEEProc.Vision,ImageandSignalProcessingJournal,1997,14,299306
58. [Girolami1997b]Girolami,M.&Fyfe,C.ICAContrastMaximisationUsingOja'sNonlinearPCAAlgorithmIntl.J.NeuralSystems,1997,8,661678
59. [Girolami2002]Girolami,M.Mercerkernelbasedclusteringinfeaturespace.IEEETrans.NeuralNetworks,2002,13,780784
60. [Golub1970]Golub,G.H.&Reinsch,C.SingularvaluedecompositionandleastsquaressolutionsNumerischeMathematik,1970,14,403420
61. [Gorban2004]Gorban,A.N.;Karlin,I.V.&Zinovyev,A.Y.ConstructivemethodsofinvariantmanifoldsforkineticproblemsConstructivemethods
ofinvariantmanifoldsforkineticproblems,2004,396,197403
62. [Gorban2007]Gorban,A.N.;Kgl,B.;Wunsch,D.C.&Zinovyev,A.PrincipalManifoldsforDataVisualizationandDimensionReductionSpringer,
2007
63. [Gorodnitsky1997] Gorodnitsky, I. F. & Rao, B. D. Sparse signal reconstruction from limited data using FOCUSS: reweighted minimum norm
algorithmIEEETrans.SignalProcessing,1997,3,600616
64. [Graps1995]Graps,A.AnintroductiontowaveletsIEEEComputationalScience&Engineering,1995,2,5061
65. [Gray1984]Gray,R.M.VectorquantizationIEEEAcoustics,SpeechandSignalProcessingMagazine,1984,1,429
66. [Gurumoorthy2010] Gurumoorthy, K. S.; Rajwade, A.; Banerjee, A. & Rangarajan, A. A method for compact image representation using sparse
matrixandtensorprojectionsontoexemplarorthonormalbasesIEEETrans.ImageProcessing,2010,19,322334
67. [Guyon2003]Guyon,I.&Eliseeff,A.AnintroductiontovariableandfeatureselectionJ.MachineLearningResearch,2003,3,11571182
68. [Harman1976]Harman,H.H.ModernFactorAnalysisUniv.ChicagoPress,1976
69. [Harshman1970]Harshman,R.A.FoundationsofthePARAFACprocedure:Modelsandconditionsforan"explanatory"multimodalfactoranalysis
UCLAWorkingPapersinPhonetics,1970,16,184
70. [Hartigan1979]Hartigan,J.A.&Wong,M.A.AlgorithmAS136:AKMeansClusteringAlgorithmJ.RoyalStatisticalSoc.C,1979,28,100108
71. [Hastie1989]Hastie,T.&Stuetzle,W.PrincipalcurvesJ.AmericanStatisticalAssociation,1989,84,502516
72. [He2004]He,X.&Niyogi,LocalityPreservingProjections.AdvancesInNeuralInformationProcessingSystemsP.Thrun,S.;Saul,L.K.&Schlkopf,
B.(Eds.)MITPress,2004,153160
73. [He2005]He,X.;Cai,D.;Yan,S.&Zhang,H.J.NeighborhoodPreservingEmbeddingProc.IEEEIntl.Conf.ComputerVision,ICCV,2005.
74. [He2011]He,R.;Hu,B.G.;Zheng,W.S.&Kong,X.W.Robustprincipalcomponentanalysisbasedonmaximumcorrentropycriterion.IEEETrans.
ImageProcessing,2011,20,14851494
75. [Hoyer2002]Hoyer,P.O.NonnegativesparsecodingProc.IEEEWorkshopNeuralNetworksforSignalProcessing,2002
76. [Hoyer2004]Hoyer,P.O.NonnegativematrixfactorizationwithsparsenessconstraintsJ.MachineLearningResearch,2004,5,14571469
77. [Huang1998] Huang, H. E.; Shen, Z.; Long, S. R.; Wu, M. L.; Shih, H. H.; Zheng, Q.; Yen, N. C.; Tung, C. C. & Liu, H. H. The empirical mode
decompositionandtheHilbertspectrumfornonlinearandnonstationarytimeseriesanalysisProc.Roy.Soc.LondonA,1998,454,903995
33
78. [Hubert2004]Hubert,M.&Engelen,S.RobustPCAandclassificationinbiosciencesBioinformatics,2004,20,17281736
79. [Hyvarinen1999]Hyvrinen,A.Fastandrobustfixedpointalgorithmsforindependentcomponentanalysis.IEEETrans.NeuralNetworks,1999,10,
626634
80. [Hyvarinen2000]Hyvarinen,A.,Oja.E.IndependentComponentAnalysis:algorithmsandapplications.Neuralnetworks,2000,13,411430
81. [Hyvarinen2001]Hvarinen,A.,Karhunen,J.,Oja,E.IndependentComponentAnalysis.JohnWiley&Sons,Inc.2001
82. [Iglesias2007]Iglesias,J.E.;deBruijne,M.;Loog,M.;Lauze,F.&Nielsen,M.Afamilyofprincipalcomponentanalysesfordealingwithoutliers.
LectureNotesinComputerScience,2007,4792,178185
83. [Jenssen2010]Jenssen,R.KernelentropycomponentanalysisIEEETrans.PatternAnalysis&MachineIntelligence,2010,32,847860
84. [Johnson1963]Johnson,R.M.OnatheoremstatedbyEckartandYoungPsychometrika,1963,28,259263
85. [Jollife2002]Jollife,I.T.PrincipalComponentAnalysisWiley,2002
86. [Kaiser1958]Kaiser,H.F.Thevarimaxcriterionforanalyticrotationinfactoranalysis.Psychometrika,1958,23,187200
87. [Kaiser1960]Kaiser,H.F.Theapplicationofelectroniccomputerstofactoranalysis.Educationalandpsychologicalmeasurement,1960,20,141
151
88. [Kambhatla1997]Kambhatla,N.&Leen,T.K.DimensionreductionbylocalPrincipalComponentAnalysisNeuralComputation,1997,9,14931516
89. [Kaski1998] Kaski, S. Dimensionality reduction by random mapping: fast similarity computation for clustering Proc. Intl. Joint Conf. Neural
Networks(IJCNN),1998
90. [Ke2005]Ke,Q., Kanade,T.Robust L1normfactorizationinthepresenceofoutliersand missingdatabyalternativeconvexprogrammingProc.
Comput.Vis.PatternRecogn.Conf.,2005
91. [Kegl2002]Kegl,B.IntrinsicDimensionEstimationUsingPackingNumbers.AdvancesinNeuralInformationProcessingSystems2002
92. [Kim2011]Kim,M.&Pavlovic,V.Centralsubspacedimensionalityreductionusingcovarianceoperators.IEEETransPatternAnalMachIntell,2011,
33,657670
93. [Klema1980]Klema,V.&Laub,A.Thesingularvaluedecomposition:ItscomputationandsomeapplicationsIEEETrans.AutomaticControl,1980,
25,164176
94. [Kohonen1990]Kohonen,T.TheSelfOrganizingMapProc.IEEE,1990,78,14641480
95. [Kohonen1993]Kohonen,T.Thingsyouhaven'theardabouttheselforganizingmapProc.IEEEIntl.Conf.NeuralNetworks,1993
96. [Kohonen2001]Kohonen,T.SelfOrganizingMaps.Springer,2001.
97. [Kokiopoulou2007] Kokiopoulou, E. & Saad, Y. Orthogonal neighborhood preserving projections: a projectionbased dimensionality reduction
technique.IEEETrans.PatternAnalysis&MachineIntelligence,2007,29,21432156
98. [Kramer1991]Kramer,M.A.NonlinearPrincipalComponentAnalysisUsingAutoassociativeNeuralNetworksAIChEJournal,1991,37,233243
99. [Kreutz2003] KreutzDelgado, K.; Murray, J. F.; Rao, B. D.; Engan, K.; Lee, T. & Sejnowski, T. J. Dictionary learning algorithms for sparse
representationNeuralComputation,2003,15,349396
100. [Kruskal1964a]Kruskal,J.B.MultidimensionalscalingbyoptimizinggoodnessoffittoanonmetrichypothesisPsychometrika,1964,29,127
101. [Kruskal1964b]Kruskal,J.B.Nonmetricmultidimensionalscaling:anumericalmethod.Psychometrika,1964,29,115129
102. [Kruskal1986]Kruskal,J.B.&Wish,M.MultidimensionalscalingSage,1986
103. [Kwak2008]Kwak,N.Principalcomponentanalysisbasedonl1normmaximization.IEEETransPatternAnalMachIntell,2008,30,16721680
104. [Lawley1971]Lawley,D.N.&Maxwell,A.E.FactoranalysisasastatisticalmethodButterworths,1971
105. [Lee1999]Lee,D.D.&Seung,S.LearningthepartsofobjectsbynonnegativematrixfactorizationNature,1999,401,788791
106. [Lee2000]Lee,T.W.;Girolami,M.&Bell,A.J.Sejnowski,T.J.Aunifyinginformationtheoreticframeworkforindependentcomponentanalysis
Computers&MathematicswithApplications,2000,39,121
107. [Lee2001]Lee,D. D.&Seung,H.S. AlgorithmsfornonnegativematrixfactorizationAdvancesin NeuralInformationProcessingSystems, 2001,
556562
108. [Lewicki2000]Lewicki,M.S.&Sejnowski,T.J.LearningovercompleterepresentationsNeuralComputation,2000,12,337365
109. [Li2011] Li, X. L.; Adali, T. & Anderson, M. Noncircular Principal Component Analysis and Its Application to Model Selection IEEE Trans. Signal
Processing,2011,59,45164528
110. [Lin2011]Lin,Y.Y.;Liu,T.L.&Fuh,C.S.Multiplekernellearningfordimensionalityreduction.IEEETrans.PatternAnalysis&MachineIntelligence,
2011,33,11471160
111. [Liu2003] Liu, W.; Zheng, N. & Lu, X. Nonnegative matrix factorization for visual coding Proc. IEEE Intl. Conf. Acoustics, Speech and Signal
Processing(ICASSP),2003
112. [Liu2003b] Liu, Z. Y.; Chiu, K. C. & Xu, L. Improved system for object detection and star/galaxy classification via local subspace analysis Neural
Networks,2003,16,437451
113. [Mallat1993]Mallat,S.G.&Zhang,Z.Matchingpursuitswithtimefrequencydictionaries.IEEETrans.SignalProcessing,1993,41,33973415
114. [Martinetz1991] Martinetz, T. & Schulten. A ``neuralgas'' network learns topologies. Artificial neural networks, K. Kohonen, T.; Makisara, K.;
Simula,O.&Kangas,J.(Eds.)Elsevier,1991,397402
115. [Martinetz1993]Martinetz,T.M.;Berkovich,S.G.&Schulten,K.J.Neuralgasnetworkforvectorquantizationanditsapplicationtotimeseries
prediction.IEEETrans.NeuralNetworks,1993,4,558569
116. [Mateen2009] van der Mateen, L.; Postma, E. & van den Herik, J. Dimensionality Reduction: A Comparative Review Tilburg Centre for Creative
Computing,TilburgUniv.,2009
117. [Mulaik1971]Mulaik,S.A.ThefoundationsoffactoranalysisChapman&Hall,1971
118. [Paige1981]Paige,C.C.&Saunders,M.A.TowardsaGeneralizedSingularValueDecompositionSIAMJournalonNumericalAnalysis,1981,18,
398405
119. [PascualMarqui2001] PascualMarqui, R. D.; PascualMontano, A.; Kochi, K. & Carazo, J. M. Smoothly Distributed Fuzzy cMeans: A New Self
OrganizingMapPatternRecognition,2001,34,23952402
120. [PascualMontano2001]PascualMontano,A.;Donate,L.E.;Valle,M.;Brcena,M.;PascualMarqui,R.D.&Carazo,J.M.Anovelneuralnetwork
tecniqueforanalysisandclassificationofEMsingleparticleimagesJ.StructuralBiology,2001,133,233245
121. [PascualMontano2006]PascualMontano,A.;Carazo,J.;Kochi,K.;Lehmann,D.&PascualMarqui,R.Nonsmoothnonnegativematrixfactorization
(nsNMF)IEEETrans.PatternAnalysis&MachineIntelligence,2006,28,403415
122. [Pati1993]Pati,Y.;Rezaiifar,R.&Krishnaprasad,P.Orthogonalmatchingpursuit:recursivefunctionapproximationwithapplicationstowavelet
decompositionProc.Conf.RecordofTheTwentySeventhAsilomarConferenceonSignals,SystemsandComputers,1993,4044
34
123. [Pauca2006] Pauca, V. P.; Piper, J. & Plemmons, R. J. Nonnegative matrix factorization for spectral data analysis. Linear Algebra and its
Applications,2006,416,2947
124. [Pearson1901]Pearson,K.OnLinesandPlanesofClosestFittoSystemsofPointsinSpace.PhilosophicalMagazine,1901,2,559572
125. [Pettis1979]Pettis,K.W.;Bailey,T.A.;Jain,A.K.&Dubes,R.C.Anintrinsicdimensionalityestimatorfromnearneighborinformation.IEEETrans.
PatternAnalysis&MachineIntelligence,1979,1,2537
126. [Pinto2011] Pinto da Costa, J. F.; Alonso, H. & Roque, L. A weighted principal component analysis and its application to gene expression data.
IEEE/ACMTrans.ComputationalBiologyandBioinformatics,2011,8,246252
127. [Ramanan2011]Ramanan,D.&Baker,S.Localdistancefunctions:ataxonomy,newalgorithms,andanevaluation.IEEETrans.PatternAnalysis&
MachineIntelligence,USA.dramanan@ics.uci.edu,2011,33,794806
128. [Rilling2003] Rilling, G.; Flandrin, P. & Goncalves, P. On empirical mode decomposition and its algorithms Proc. IEEEEURASIP Workshop on
NonlinearSignalandImageProcessing,2003
129. [Rioul1991]Rioul,O.&Vetterli,M.Waveletsandsignalprocessing.IEEESignalProcessingMagazine,1991,8,1438
130. [Roweis2000]Roweis,S.T.&Saul,L.K.NonlinearDimensionalityReductionbyLocallyLinearEmbeddingScience,2000,290,23232326
131. [Rubinstein2010]Rubinstein,R.;Bruckstein,A.M.&Elad,M.DictionariesforsparserepresentationmodelingProc.IEEE,2010,98,10451057
132. [Rubner2000]Rubner,Y.;Tomasi,C.&Guibas,L.J.TheEarthMoversDistanceasaMetricforImageRetrievalIntl.J.ComputerVision,2000,40,
99121
133. [Saeys2007]Saeys,Y.;Inza,I.&Larraaga,P.AreviewoffeatureselectiontechniquesinbioinformaticsBioinformatics,2007,23,25072517
134. [Sammon1969]Sammon,J.W.Anonlinearmappingfordatastructureanalysis.IEEETrans.Computers,1969,18,401409
135. [Sandler2011]Sandler,R.&Lindenbaum,M.NonnegativeMatrixFactorizationwithEarthMover'sDistanceMetricforImageAnalysis.IEEETrans.
PatternAnalysis&MachineIntelligence,Yahoo!ResearchinHaifa.,2011,33,15901602
136. [Saul2003]Saul,L.K.&Roweis,S.ThinkGlobally,FitLocally:UnsupervisedLearningofLowDimensionalManifoldsDepartmentofComputer&
InformationScience,Univ.Pennsylvania,2003
137. [Schiffman1981] Schiffman, S. S.; Reynolds, M. L. & Young, F. W. Introduction to multidimensional scaling: Theory, methods, and applications
AcademicPress,1981
138. [Scholkopf1997]Schlkopf,B.;Smola,A.&Mller,K.R.KernelPrincipalComponentAnalysisProc.ofICANN,1997,58358
139. [Scholkopf1999]Schlkopf,B.;Smola,A.&Mller,K.R.KernelPrincipalComponentAnalysisAdvancesinkernelmethodssupportvectorlearning,
1999
140. [Scholz2008] Scholz, M.; Fraunholz, M. & Selbig, J. Nonlinear Principal Component Analysis: Neural Network Models and Applications Lecture
NotesinComputationalScienceandEngineering,2008,58,4467
141. [Smola1999]Smola,A.J.;Williamson,R.C.;Mika,S.&Schlkopf,B.RegularizedprincipalmanifoldsLectureNotesinArtificialIntelligence,1999,
1572,214229
142. [Spearman1904]Spearman,C.GeneralIntelligenceObjectivelyDeterminedandMeasuredAmericanJ.Psychology,1904,15,201292
143. [Subbarao2006]Subbarao,R.&Meer,P.SubspaceEstimationUsingProjectionBasedMEstimatorsoverGrassmannManifoldsLectureNotesin
ComputerScience,2006,3951,301312
144. [Tenenbaum2000]Tenenbaum,J.B.;deSilva,V.&Langford,J.C.AglobalgeometricframeworkfornonlineardimensionalityreductionScience,
2000,290,23192323
145. [Theis2011]Theis,F.J.;Kawanabe,M.&Muller,K.R.UniquenessofNonGaussianityBasedDimensionReductionIEEETrans.SignalProcessing,
2011,59,44784482
146. [Thurstone1947]Thurstone,L.MultipleFactorAnalysisUniv.ChicagoPress,1947
147. [Tropp2007]Tropp,J.A.&Gilbert,A.C.SignalRecoveryFromRandomMeasurementsViaOrthogonalMatchingPursuitIEEETrans.Information
Theory,2007,53,46554666
148. [Tucker1966]Tucker,L.R.SomemathematicalnotesonthreemodefactoranalysisPsychometrika,1966,31,279311
149. [Ulfarsson2011]Ulfarsson,M.O.&Solo,V.Vectorl0sparsevariablePCAIEEETrans.SignalProcessing,2011,59,19491958
150. [Vidal2005] Vidal, R.; Ma, Y. & Sastry, S. Generalized principal component analysis (GPCA). IEEE Trans. Pattern Analysis & Machine Intelligence,
2005,27,19451959
151. [Wall2003] Wall, M.; Rechtsteiner, A. & Rocha, L. A practical approach to microarray data analysis Singular Value Decomposition and Principal
ComponentAnalysisSpringer,2003,9110
152. [Wang2011]Wang,R.;Shan,S.;Chen,X.;Chen,J.&Gao,W.MaximalLinearEmbeddingforDimensionalityReduction.IEEETrans.PatternAnalysis
&MachineIntelligence,2011,33,17761792
153. [Watson1993]Watson,A.B.DCTquantizationmatricesvisuallyoptimizedforindividualimagesProc.SPIEWorkshoponAugmentedVisualDisplay
(AVID)Research,1993,19131914,363377
154. [Williams2002]Williams,C.K.I.OnaConnectionbetweenKernelPCAandMetricMultidimensionalScalingMachineLearning,2002,46,1119
155. [Wold1987]Wold,S.;Esbensen,K.&Geladi,P.PrincipalcomponentanalysisChemometricsandIntelligentLaboratorySystems,1987,2,375
156. [Yin2008]Yin,H.LearningNonlinearPrincipalManifoldsbySelfOrganisingMapsLectureNotesinComputationalScienceandEngineering,2008,
58,6895
157. [Yu2012] Yu, S.; Tranchevent, L.C.; Liu, X.; Glnzel, W.; Suykens, J. A. K.; Moor, B. D. & Moreau, Y. Optimized data fusion for kernel kmeans
clustering.IEEETrans.PatternAnalysis&MachineIntelligence,2012,34,10311039
158. [Zhang2004] Zhang, Z. & Zha, H. Principal Manifolds and Nonlinear Dimensionality Reduction via Tangent Space Alignment SIAM J. Scientific
Computing,2004,26,313338
159. [Zhang2009]Zhang,J.;Niyogi,P.&McPeek,M.S.Laplacianeigenfunctionslearnpopulationstructure.PLoSOne,2009,4,e7928
160. [Zhang2012]Zhang,Z.;Wang,J.&Zha,H.Adaptivemanifoldlearning.IEEETrans.PatternAnalysis&MachineIntelligence,2012,34,253265
161. [Zou2006]Zou,H.;Hastie,T.&Tibshirani,R.SparsePrincipalComponentAnalysisJ.ComputationalandGraphicalStatistics,2006,15,262286
35