Beruflich Dokumente
Kultur Dokumente
R.R.
1 Subject
ImplementingKMeansClusteringAlgorithmwithvariousDataMiningTools. Kmeansisaclustering(unsupervisedlearning)algorithm1.Theaimistocreatehomogeneoussubgroupsof examples. The individuals in the same subgroup are similar; the individuals in different subgroups are as differentaspossible. The KMeans approach is already described in several tutorials (http://datamining
tutorials.blogspot.com/search?q=kmeans).Thegoalhereistocompareitsimplementationwithvariousfree tools.Westudythefollowingtools:Tanagra1.4.28;R2.7.2withoutadditionalpackage;Knime1.3.5;Orange 1.0b2andRapidMinerCommunityEdition. Thestepsofthedataanalysisarethefollowing: Importingthedatafile; Computingsomedescriptivestatisticsindicators; Standardizingthevariables; Implementingthekmeansalgorithmonthestandardizedvariables; Visualizingtheclustermembershipofeachindividual; Interpretingtheclusterswithconditionaldescriptivestatisticsindicatorsorgraphicalrepresentations; Comparingtheclusterswithaprespecifiedgroupingdefinedbyanillustrativecategoricalvariable; Exportingthedatasetinafile,includingthenewclustermembershipcolumn.
These steps are usual in a clustering approach. The main interest of this tutorial is to show that we can implement these steps whatever the tools used. Of course, I cannot master the functionalities of all the tools.SometimesperhapsIdonotusethemostefficientprocedureinsomesituations.
2 Dataset
Weusethecars_dataset.txt2datafile.Itdescribesthecharacteristicsof392vehicles.Theactivevariables, which participate to the creation of the clusters, are the consumption (MPG), the DISPLACEMENT, the HORSEPOWER, the WEIGHT and the ACCELERATION. The illustrative variable, which is used only to strengthentheinterpretationoftheclusters,isORIGIN(Japan,EuropeandUSA).
3.1
Creatingadiagramandimportingthedataset
AfterwelaunchTanagra,weclickontheFILE/NEWmenuinordertocreateanewdiagram.Weselectthe CARS_DATASET.TXTdatafile.
1
http://faculty.chass.ncsu.edu/garson/PA765/cluster.htm http://eric.univlyon2.fr/~ricco/tanagra/fichiers/cars_dataset.zip;fromtheSTATLIBserver,http://lib.stat.cmu.edu/datasets/cars.desc
17juin2009
Page1sur39
Tanagra
R.R.
392observationsand6variablesareloaded.
3.2
Descriptivestatistics
We want to obtain an overview of the main characteristics of the dataset. We add a DEFINE STATUS componentintothediagram.WesetallthecontinuousvariablesasINPUT. Thesearetheactivevariablesoftheanalysisi.e.theyareusedduringtheclusteringprocess.
17juin2009
Page2sur39
Tanagra
R.R.
WeaddtheMOREUNIVARIATECONTSTATcomponent(STATISTICStab).WeclickonthecontextualVIEW menu.
Itseemsthattherearenotanomaliesorsomethingwhichrequiresaspecificpretreatmentinourdataset.
3.3
Standardizingtheactivevariables
Wewanttostandardizethevariablesbeforeperformingthekmeansapproach.Theaimistoeliminatethe discrepancy of scales between the variables3. We add the STANDARDIZE component (FEATURE CONSTRUCTIONtab)intothediagram.Then,weclickontheVIEWmenu.
3
Infact,thisoperationisnotnecessarywithTanagra.ItcanautomaticallystandardizethevariableswiththeKMeanscomponent.Weuse
explicitlythisstepforthecomparisonwiththeothertools.
17juin2009
Page3sur39
Tanagra
R.R.
5newvariablesarenowavailableforthefurtherprocessing.
3.4
KMeans
We want to use these transformed variables for the analysis. We insert a new DEFINE STATUS component intothediagram.WesetasINPUTthecomputedattributes(fromSTD_MPG_1toSTD_ACCELERATION_1).
WeinserttheKMEANScomponent(CLUSTERINGtab).WeclickonthePARAMETERScontextualmenu.We setthefollowingparameters.
17juin2009
Page4sur39
Tanagra
R.R.
Weaskapartitioningintotwogroups.Itisnotnecessarytonormalizethedistancebecauseweusealready standardizedvariables.WevalidateandweclickontheVIEWmenu.
TheTSS(Totalsumofsquares)is1954.9999;theWSS(Withinsumofsquares)is831.6058.TheBSS(Between sum of squares) explained by the partitioning is (1954.9999 831.6058) = 1123.3941. The resulting ratio is (1123.3941/1954.9999)=57.46%. Thereare100examplesinthefirstcluster;292examplesinthesecondone. 17juin2009 Page5sur39
Tanagra
R.R.
In the low part of the window, the CLUSTERS CENTROIDS section gives the average for each variable accordingtotheclusters.
3.5
Interpretationofgroups
We are now in the major step of the clustering process: we want to interpret the groups. What the characteristicsofeachcluster?Whatdifferentiateeachothers? 3.5.1 Groupmembershipofindividuals
3.5.2 Conditionaldescriptivestatistics
Tanagra
R.R.
We insert the DEFINE STATUS component into the diagram. We set as TARGET the computed column (CLUSTER_KMEANS_1),asINPUTtheotherattributes,includingtheillustrativevariable(ORIGIN). ThenweaddtheGROUPCHARACTERIZATIONcomponent(STATISTICStab).
Wenotethatthesecondcluster(C_K_MEANS_2)correspondsmainlytosmallcarswithlowconsumption(the meanofMPGis26.66intothegroupwhileitis23.49inthewholedataset),withasmallDISPLACEMENT,etc.
17juin2009
Page7sur39
Tanagra
R.R.
Inordertocharacterizethestrengthofthedifference,weusethe"testvalue"criterion(http://datamining tutorials.blogspot.com/2009/05/understandingtestvaluecriterion.html). Wecanuseeithertheactiveortheillustrativevariablesinordertocharacterizethegroups.Inourdataset,we use the ORIGIN variable for the group interpretation. We note for instance that the first cluster (C_K_MEANS_1)isonlyconstitutedofAmericancars.Theyhaveahighconsumption(MPGis14.75intothe group),etc. 3.5.3 Crosstabulationbetweenthegroupmembershipandanillustrativevariable
We can also highlight the association between the clusters membership and an illustrative variable using a crosstabulation.WeinsertaDEFINESTATUScomponent.WesetORIGINasTARGETandC_KMEANS_1as INPUT.
WeaddtheCONTINGENCYCHISQUAREcomponent(NONPARAMETRICSTATISTICStab)intothediagram. WeclickontheVIEWmenu. The results are of course consistent with those of the GROUP CHARACTERIZATION component. We have heremoreinformationaboutthestrengthoftheassociation.SomestatisticalindicatorssuchastheCramer's vandsoonareavailable.Wecancheckiftheassociationisstatisticallysignificant. Wecanalsodisplaytheresultsintheroworcolumnpercentage.
17juin2009
Page8sur39
Tanagra
R.R.
3.5.4 Scatterplot
http://en.wikipedia.org/wiki/Scatter_plot
17juin2009
Page9sur39
Tanagra
R.R.
We add the SCATTERPLOT component (DATA VISUALIZATION tab). We click on the VIEW menu. We set WEIGHTonthehorizontalaxis,HORSEPOWERontheverticalaxis.Weusetheclustermembershiptocolorize thepoints. 3.5.5 Graphicalrepresentationusingprincipalcomponentanalysis
In order to take in consideration the interactions between more than two variables, we can use a principal componentanalysis(PCA)andsetagraphicalrepresentationinthefirsttwofactors.Iftheseaxesarerelevant, therelativelocalizationofthegroupsinthisrepresentationspaceisquitefaithfuloftheirlocalizationinthe originalspace. We add the PRINCIPAL COMPONENT ANALYSIS component (FACTORIAL ANALYSIS tab) after the K MEANS1component.Thustheyusethesameactivevariables.WeclickontheVIEWmenu. Thefirsttwofactorsaccount92.8%ofthevariationintothedataset.Onthefirstfactor,wehaveanopposition betweenthecars(1)withlowconsumption(MPG),notveryfast(ACCELERATION),and(2)thosewhichare powerfulandheavy(HORSEPOWER,WEIGHT).
Whenwecreateascatterplotandsettothehorizontalaxisthefirstfactor(PCA_1_AXIS_1),totheverticalaxis thesecondfactor(PCA_1_AXIS_2),wenotethattheclustersarereallydistinct.
17juin2009
Page10sur39
Tanagra
R.R.
3.6
ExportingthedatasetincludingtheCLUSTERcolumn
TANAGRAcanexportalsointheXLS(EXCEL)andARFF(WEKA)fileformat.
17juin2009
Page11sur39
Tanagra
R.R.
We add the EXPORT DATASET component (DATA VISUALIZATION tab) into the diagram. In the settings dialogbox(PARAMETERSmenu),wespecifythatonlytheINPUTattributesmustbeexported.Wecanalso definethedirectoryandthefilename.ThenwevalidateandclickontheVIEWmenu.
Anewdatafile(OUTPUT.TXT)with392observationsand7variablesiscreated.
17juin2009
Page12sur39
Tanagra
R.R.
4 K-Means with R
Inthissection,weduplicatethestepsaboveusingtheRsoftware(http://www.rproject.org/)6.
4.1
Dataimportationanddescriptivestatistics
Wesetthefollowinginstructionsinordertoimportthedatasetandcomputethedescriptivestatistics.
Weobtain
4.2
Standardizingthevariables
Totransformthevariable,wecreatefirstacallbackfunctioncentrage_reduction(.)whichstandardizesone variable.Thenwecalltheapply(.)function.Thenewdataframeisvoitures.cr.
Themeanofthenewvariablesis0;theirvarianceis1.
4.3
KMeanswiththestandardizedvariables
WecannowlaunchtheKMeansalgorithmonthesenewvariables.Weaskapartitioningintotwogroups.We limitthenumberofiterationsto40.
6
Unfortunately,thecommentsintothesourcecodeareinFrench.Iapologize.Ihopetheinstructionsremainunderstandable.
17juin2009
Page13sur39
Tanagra
R.R.
R supplies among others: the number of examples in each cluster; the conditional mean according to the activevariables;theclustermembershipofeachinstance. Note:ItseemsthatwegetthesamegroupsthanTanagra.Weshouldcomparethe2partitionstobesure.In some situations, we obtain a different partition of one data mining tool to the other. Indeed, since the approachreliesonaheuristic,theinitializationofthealgorithmcaninfluencethefinalresult.
4.4
Interpretationofclusters
Fortheinterpretationofthegroups,wecomputetheconditionalmeanforthecontinuousvariables.
Weobtain
WecancomparetheseresultstothoseofTanagra:CLUS_1ofRisidenticaltoC_KMEANS_2ofTanagra.
17juin2009
Page14sur39
Tanagra
R.R.
InordertocreateacrosstabulationbetweentheclustersandtheORIGINcategoricalillustrativevariable:
Rsuppliesthefollowingtable.
Weusethefollowinginstructionsinordertocreatethescatterplotaccordingeachpairofvariables.
Wenotethatmostofthevariablesarehighlycorrelated.Thegroupsareclearlydiscernablewhateverthepairs ofvariablesused.
100
300
1500
3000
4500 30 40
mpg
300
displacement
100
horsepower
50 100 4500 1500 3000
weight
10
20
30
40
50 100
200
10
15
20
25
10
15
acceleration
20
25
200
10
20
17juin2009 Page15sur39
Tanagra
R.R.
Last, we implement a PCA (Principal Component Analysis) for a multivariate characterization. We use the princomp(.)procedure.
Withsomeadjustments,weobtainthesameresultsasTanagra.
Thenwecreatethescatterplotinthetwofirstfactorsrepresentationspace.
acp$scores[, 2]
-2
-1
-4
-2
0 acp$scores[, 1]
4.5
ExportingthedatasetincludingtheCLUSTERcolumn
Last,weexportboththeoriginaldatasetandtheKMeanscomputedcolumn.
17juin2009
Page16sur39
Tanagra
R.R.
5.1
Creatingaworkflowandimportingthedataset
WecreateanewworkflowbyclickingontheFILE/NEWmenu.WechoosetheNewKnimeProjectitem.
Then,weloadthedatasetusingtheFILEREADERcomponent.
17juin2009
Page17sur39
Tanagra
R.R.
5.2
Descriptivestatistics
We use the STATISTICS VIEW component for the computation of the descriptive statistics indicators. We connecttheFILEREADERcomponenttothislastone.ThenweclickontheEXECUTEANDOPENVIEWmenu. Theresultsaredisplayedinanewwindow.
5.3
Standardizingthevariables
The NORMALIZER component allows to standardize the variables. We can implement different kind of normalization.WeselecttheappropriatesettingsbyclickingontheCONFIGUREmenu.
17juin2009
Page18sur39
Tanagra
R.R.
We can visualize the dataset with the INTERACTIVE TABLE component. Only the continuous variables are transformedofcourse.
5.4
KMeans
WecanlaunchtheKMeansprocedure.WeaddtheKMeanscomponentintotheworkflow.Weclickonthe CONFIGUREmenuinordertosettheappropriateparameters.
17juin2009
Page19sur39
Tanagra
R.R.
By clicking on the EXECUTE AND OPEN VIEW menu, we obtain the results in a new window. There are 2 groupswithrespectively292and100instances.Theconditionalmeansarealsodisplayed,butcomputedon thestandardizedvariables.Thisisnotreallyusefulfortheinterpretation.
5.5
5.5.1
Interpretationofgroups
Groupmembership
TheINTERACTIVETABLEcomponentallowstovisualizetheclustermembershipofeachindividual.
17juin2009
Page20sur39
Tanagra 5.5.2
R.R.
Descriptivestatisticsandgraphicalrepresentation
Somepreliminarymanipulationsarenecessarybeforethecalculationsoftheconditionaldescriptivestatistics andthegraphicalrepresentation. ThePCAisnotavailableunderKnime.ButitcanperformaMultidimensionalScaling(MDS)7.Weobtainthe samefactorswhenwelaunchthismethodonasimilaritymatrix(distancematrix)computedusingaEuclidian distance8. We must thus compute this distance matrix using the PIVOT TABLE component. We set the appropriate parameters in order to compute the distance from the standardized variables. Only two latent variablesarecomputed.
http://en.wikipedia.org/wiki/Multidimensional_scaling http://www.mathpsyc.unibonn.de/doc/delbeke/delbeke.htm
17juin2009
Page21sur39
Tanagra
R.R.
WeinserttheINTERACTIVETABLEcomponentinordertocheckthemergingoperation.
WehavetheoriginalvariablesandthetwoadditionalcolumnssuppliedbytheMDScomponent. Now,wemustmergethisdatasettotheadditionalcolumn,theclustermembership,suppliedbytheKMeans component.Weperformtheoperationintotwosteps: (1) With the COLUMN FILTER component, we select the cluster membership column from the KMeans component.
17juin2009
Page22sur39
Tanagra
R.R.
(2)WiththeJOINERcomponent,wemergethiscolumntothedataset.
We add the INTERACTIVE TABLE component in order to visualize the resulting dataset. The first group of variablesissuppliedbythevariouscomponents;thesecondgroupcomesfromtheoriginaldatafile.
17juin2009
Page23sur39
Tanagra
R.R.
5.5.2.1 Descriptivestatistics
Knime offers a very interesting tool: the conditional boxplot. We can visualize more information about the characteristicsofthedistributions:centraltendencymeasures,theshapeofthedistribution,theoutliers,etc. Thedrawbackisthatwemustinsertonecomponentforeachvariable. We add the CONDITIONAL BOXPLOT component into the workflow. We set the appropriate settings (CONFIGUREmenu).ThenweclickontheEXECUTEANDOPENVIEWmenu.
17juin2009
Page24sur39
R.R.
Wewanttovisualizetheclustersintherepresentationspacedefinedbythepairofvariables.Wemustfirst specify the illustrative variable (CLUSTER) using the COLOR MANAGER component. We add after the SCATTERPLOTcomponentinordertocreatethegraphicalrepresentation.
17juin2009
Page25sur39
Tanagra
R.R.
5.5.2.4 CrosstabulationwiththeORIGINvariable
The PIVOTING component allows to create a cross tabulation between the CLUSTER and the ORIGIN columns.WeusetheINTERACTIVETABLEcomponentinordertovisualizethetable.
17juin2009
Page26sur39
Tanagra
R.R.
5.6
Exportationofthedataset
6.1
Creatingaschemaandimportationofthedataset
Tanagra
R.R.
6.2
Descriptivestatistics
Various descriptive statistics indicators are supplied by the ATTRIBUTES STATISTICS component. We can interactivelyselectthevariableintheleftpartofthevisualizationwindow.Forcategoricalvariable,weobtain thefrequencytable.
6.3
Standardizingthevariables
Tanagra
R.R.
The ATTRIBUTE STATISTICS allows to check the transformation. All the continuous variables have now a mean=0andastandarddeviation=1.
6.4
KMeans
The KMEANS CLUSTERING component is available into the ASSOCIATE tab. We connect CONTINUIZE to thislastone.ThenweclickontheOPENmenu:wesettheappropriateparametersandweclickontheAPPLY button. Orange indicates the number of instance into each group (250 and 142). It supplies also some indicatorsoffitnessforeachgroup(seethehelpfilefordetaileddescription).
17juin2009
Page29sur39
Tanagra
R.R.
6.5
Interpretationofthepartitioning
Cluster membership. Like the other tools, Orange creates a new column (CLUSTER) which describes the clustermembershipofeachindividual.
WecanvisualizethiscolumnwiththeDATATABLEcomponent. Descriptive statistics. The DISTRIBUTIONS component allows to compute the histogram of variables according to the values of a categorical variable, the cluster membership in our case. Below, we have the histogramofWEIGHTvariable.
Note:Thehistogramsarecomputedonthestandardizedvariableinthispart.AtoolsuchasJOINERofKNIME ismissinginordertorecoverthevariablesoftheoriginaldatafileinthesubsequentpartoftheschema.
17juin2009
Page30sur39
Tanagra
R.R.
Scatter plot. The SCATTERPLOT component allows to visualize the instances according to simultaneously twovariables.
ProjectionintherepresentationspaceofMDS.ThePCAisnotavailableintoOrange.LikeKnime,wemust compute first the distance matrix (the distance for each pair of instances). Then we perform a multidimensionalscalingonthismatrix.Wedefinethefollowingsequenceofcomponentsintheschema.We clickontheOPENmenuoftheMDScomponent.
17juin2009
Page31sur39
Tanagra
R.R.
6.5.1 Crosstabulation
WeclickontheOPENmenu.Theobtainedvisualizationwindowseemsmysterious.Butifweconsiderwith cautiontheresults,wenotethatwecanobservethedesiredinformation.
17juin2009
Page32sur39
Tanagra
R.R.
Intotheselectedfield,weobserve105instances.TheycorrespondtotheassociationbetweenCLUSTER=1 and ORIGIN = AMERICAN. Under the independence assumption, we should have 156.3. The CHISquare statisticofthetestforindependenceis124.884.
6.6
ExportingthedatasetincludingtheCLUSTERcolumn
Finally, we use the SAVE component in order to export the dataset. Orange exports the standardized variables.IdidnotknowhowtorecovertheoriginalvariableswiththeCLUSTERcolumn.
17juin2009 Page33sur39
Tanagra
R.R.
7.1
Specifyingthediagram
Hereisthewholediagram.
Weobservethefollowingtools. Accessing the data file. The CSVEXAMPLESOURCE component allows to access the dataset. The main parametersare:FILENAMEspecifiesthefilename;LABEL_NAMEreferstothelabelofeachinstance,weuse the ORIGIN variable in our tutorial, it is not really relevant but it allows to separate active and illustrative variables; COLUMN_SEPARATORS corresponds to the column separator, we set \t i.e. the tabulation character. Descriptivestatistics.DATASTATISTICSdescribesthedatasetthroughdescriptivestatisticsindicators.
17juin2009
Page34sur39
Tanagra
R.R.
Standardization of the variables. The NORMALIZATION component is used for the standardization of the variables.Variousformulasareavailable,weasktheZtransformation.
KMeans.KMEANScorrespondstotheKMeansalgorithm.Weask2groups(K);andwewanttoutilizethe CLUSTER column in the subsequent part of the diagram (ADD_CLUSTER_ATTRIBUTE). Furthermore, we want that the clusters are characterized with comparative descriptive statistics indicators (ADD_CHARACTERIZATION). The other settings are related to the computation (MAX_RUNSand MAX_OPTIMIZATION_STEPS).
17juin2009
Page35sur39
Tanagra
R.R.
Exporting the dataset including the cluster column. Finally, we export the dataset using the CSVEXAMPLESETWRITERcomponent.Wesetthedatafilenameandthecolumnseparatorcharacter.
7.2
Examiningtheresults
button.Awindowsummarizestheresults.Wecanselectthe
Afterwesavethediagram,weclickonthe
resultsassociatedtoeachcomponentbyclickingontheappropriatetab.
17juin2009 Page36sur39
Tanagra
R.R.
PLOTVIEWisagraphicaltool.WecancreateascatterplotwiththeSCATTERoption.
17juin2009 Page37sur39
Tanagra
R.R.
CLUSTERMODEL. This tab describes the results of the clustering process. TEXT VIEW option supplies the number of instances on each group (292 and 100). We obtain also the conditional mean according to the standardizedvariables.
TheFOLDERVIEWandGRAPHVIEWoptionsallowtovisualizetheclustermembershipofeachcase. CENTROIDPLOTVIEWisagraphicalrepresentationoftheconditionalmeanforeachvariable.
2othertabscompletetheresults:
17juin2009
Page38sur39
Tanagra
R.R.
ZTRANFORM.Itdescribestheparametersusedforthestandardizationofthevariablesi.e.themeanand thestandarddeviationofeachvariable.
DATASTATISTICS.Itcomputesthedescriptivestatisticsindicators.
8 Conclusion
Inthistutorial,weshowthatalmostthefreetoolscanperformaKmeansclusteringalgorithm.Evenifsome details are different, especially for the presentation of the results, we note that they supply comparable results.Itisratherencouragingfortheutilizationofthesetools.
17juin2009
Page39sur39