Sie sind auf Seite 1von 39

Tanagra

R.R.

1 Subject
ImplementingKMeansClusteringAlgorithmwithvariousDataMiningTools. Kmeansisaclustering(unsupervisedlearning)algorithm1.Theaimistocreatehomogeneoussubgroupsof examples. The individuals in the same subgroup are similar; the individuals in different subgroups are as differentaspossible. The KMeans approach is already described in several tutorials (http://datamining

tutorials.blogspot.com/search?q=kmeans).Thegoalhereistocompareitsimplementationwithvariousfree tools.Westudythefollowingtools:Tanagra1.4.28;R2.7.2withoutadditionalpackage;Knime1.3.5;Orange 1.0b2andRapidMinerCommunityEdition. Thestepsofthedataanalysisarethefollowing: Importingthedatafile; Computingsomedescriptivestatisticsindicators; Standardizingthevariables; Implementingthekmeansalgorithmonthestandardizedvariables; Visualizingtheclustermembershipofeachindividual; Interpretingtheclusterswithconditionaldescriptivestatisticsindicatorsorgraphicalrepresentations; Comparingtheclusterswithaprespecifiedgroupingdefinedbyanillustrativecategoricalvariable; Exportingthedatasetinafile,includingthenewclustermembershipcolumn.

These steps are usual in a clustering approach. The main interest of this tutorial is to show that we can implement these steps whatever the tools used. Of course, I cannot master the functionalities of all the tools.SometimesperhapsIdonotusethemostefficientprocedureinsomesituations.

2 Dataset
Weusethecars_dataset.txt2datafile.Itdescribesthecharacteristicsof392vehicles.Theactivevariables, which participate to the creation of the clusters, are the consumption (MPG), the DISPLACEMENT, the HORSEPOWER, the WEIGHT and the ACCELERATION. The illustrative variable, which is used only to strengthentheinterpretationoftheclusters,isORIGIN(Japan,EuropeandUSA).

3 K-Means with TANAGRA


Inthissection,wegivethedetailsofoperationswithTanagra.Wegiveonlytheinstructionandtheresulting outputfortheothertools.

3.1

Creatingadiagramandimportingthedataset

AfterwelaunchTanagra,weclickontheFILE/NEWmenuinordertocreateanewdiagram.Weselectthe CARS_DATASET.TXTdatafile.
1

http://faculty.chass.ncsu.edu/garson/PA765/cluster.htm http://eric.univlyon2.fr/~ricco/tanagra/fichiers/cars_dataset.zip;fromtheSTATLIBserver,http://lib.stat.cmu.edu/datasets/cars.desc

17juin2009

Page1sur39

Tanagra

R.R.

392observationsand6variablesareloaded.

3.2

Descriptivestatistics

We want to obtain an overview of the main characteristics of the dataset. We add a DEFINE STATUS componentintothediagram.WesetallthecontinuousvariablesasINPUT. Thesearetheactivevariablesoftheanalysisi.e.theyareusedduringtheclusteringprocess.

17juin2009

Page2sur39

Tanagra

R.R.

WeaddtheMOREUNIVARIATECONTSTATcomponent(STATISTICStab).WeclickonthecontextualVIEW menu.

Itseemsthattherearenotanomaliesorsomethingwhichrequiresaspecificpretreatmentinourdataset.

3.3

Standardizingtheactivevariables

Wewanttostandardizethevariablesbeforeperformingthekmeansapproach.Theaimistoeliminatethe discrepancy of scales between the variables3. We add the STANDARDIZE component (FEATURE CONSTRUCTIONtab)intothediagram.Then,weclickontheVIEWmenu.
3

Infact,thisoperationisnotnecessarywithTanagra.ItcanautomaticallystandardizethevariableswiththeKMeanscomponent.Weuse

explicitlythisstepforthecomparisonwiththeothertools.

17juin2009

Page3sur39

Tanagra

R.R.

5newvariablesarenowavailableforthefurtherprocessing.

3.4

KMeans

We want to use these transformed variables for the analysis. We insert a new DEFINE STATUS component intothediagram.WesetasINPUTthecomputedattributes(fromSTD_MPG_1toSTD_ACCELERATION_1).

WeinserttheKMEANScomponent(CLUSTERINGtab).WeclickonthePARAMETERScontextualmenu.We setthefollowingparameters.

17juin2009

Page4sur39

Tanagra

R.R.

Weaskapartitioningintotwogroups.Itisnotnecessarytonormalizethedistancebecauseweusealready standardizedvariables.WevalidateandweclickontheVIEWmenu.

TheTSS(Totalsumofsquares)is1954.9999;theWSS(Withinsumofsquares)is831.6058.TheBSS(Between sum of squares) explained by the partitioning is (1954.9999 831.6058) = 1123.3941. The resulting ratio is (1123.3941/1954.9999)=57.46%. Thereare100examplesinthefirstcluster;292examplesinthesecondone. 17juin2009 Page5sur39

Tanagra

R.R.

In the low part of the window, the CLUSTERS CENTROIDS section gives the average for each variable accordingtotheclusters.

3.5

Interpretationofgroups

We are now in the major step of the clustering process: we want to interpret the groups. What the characteristicsofeachcluster?Whatdifferentiateeachothers? 3.5.1 Groupmembershipofindividuals

Wecaninspectthegroupmembershipofeachindividual.Thisapproachisespeciallyusefulifwedealwitha smalldatasetandifwecanidentifyeachinstance(e.g.eachindividualislabeled). TANAGRAcomputesandaddsautomaticallyanewcolumntothecurrentdataset.Wecanvisualizeitwiththe VIEWDATASETcomponent(DATAVISUALIZATIONtab).

3.5.2 Conditionaldescriptivestatistics

Anotherapproach,moreuseful,istocomputethedescriptivestatisticsindicatorsaccordingtothecluster.By comparingthem,wecanunderstandthemaincharacteristicsofeachclusteri.e.whatarethevariableswhich allowtodifferentiatetheclusters. 17juin2009 Page6sur39

Tanagra

R.R.

We insert the DEFINE STATUS component into the diagram. We set as TARGET the computed column (CLUSTER_KMEANS_1),asINPUTtheotherattributes,includingtheillustrativevariable(ORIGIN). ThenweaddtheGROUPCHARACTERIZATIONcomponent(STATISTICStab).

Wenotethatthesecondcluster(C_K_MEANS_2)correspondsmainlytosmallcarswithlowconsumption(the meanofMPGis26.66intothegroupwhileitis23.49inthewholedataset),withasmallDISPLACEMENT,etc.

17juin2009

Page7sur39

Tanagra

R.R.

Inordertocharacterizethestrengthofthedifference,weusethe"testvalue"criterion(http://datamining tutorials.blogspot.com/2009/05/understandingtestvaluecriterion.html). Wecanuseeithertheactiveortheillustrativevariablesinordertocharacterizethegroups.Inourdataset,we use the ORIGIN variable for the group interpretation. We note for instance that the first cluster (C_K_MEANS_1)isonlyconstitutedofAmericancars.Theyhaveahighconsumption(MPGis14.75intothe group),etc. 3.5.3 Crosstabulationbetweenthegroupmembershipandanillustrativevariable

We can also highlight the association between the clusters membership and an illustrative variable using a crosstabulation.WeinsertaDEFINESTATUScomponent.WesetORIGINasTARGETandC_KMEANS_1as INPUT.

WeaddtheCONTINGENCYCHISQUAREcomponent(NONPARAMETRICSTATISTICStab)intothediagram. WeclickontheVIEWmenu. The results are of course consistent with those of the GROUP CHARACTERIZATION component. We have heremoreinformationaboutthestrengthoftheassociation.SomestatisticalindicatorssuchastheCramer's vandsoonareavailable.Wecancheckiftheassociationisstatisticallysignificant. Wecanalsodisplaytheresultsintheroworcolumnpercentage.

17juin2009

Page8sur39

Tanagra

R.R.

3.5.4 Scatterplot

Anotherwaytohighlighttheresultsisthegraphicalrepresentation.Thescatterplotisaveryusefultoolinthis context4.Wecanpositionthegroupsaccordingtwovariablessimultaneously.Thuswecancheckifthereare interactionsbetweenvariables.

http://en.wikipedia.org/wiki/Scatter_plot

17juin2009

Page9sur39

Tanagra

R.R.

We add the SCATTERPLOT component (DATA VISUALIZATION tab). We click on the VIEW menu. We set WEIGHTonthehorizontalaxis,HORSEPOWERontheverticalaxis.Weusetheclustermembershiptocolorize thepoints. 3.5.5 Graphicalrepresentationusingprincipalcomponentanalysis

In order to take in consideration the interactions between more than two variables, we can use a principal componentanalysis(PCA)andsetagraphicalrepresentationinthefirsttwofactors.Iftheseaxesarerelevant, therelativelocalizationofthegroupsinthisrepresentationspaceisquitefaithfuloftheirlocalizationinthe originalspace. We add the PRINCIPAL COMPONENT ANALYSIS component (FACTORIAL ANALYSIS tab) after the K MEANS1component.Thustheyusethesameactivevariables.WeclickontheVIEWmenu. Thefirsttwofactorsaccount92.8%ofthevariationintothedataset.Onthefirstfactor,wehaveanopposition betweenthecars(1)withlowconsumption(MPG),notveryfast(ACCELERATION),and(2)thosewhichare powerfulandheavy(HORSEPOWER,WEIGHT).

Whenwecreateascatterplotandsettothehorizontalaxisthefirstfactor(PCA_1_AXIS_1),totheverticalaxis thesecondfactor(PCA_1_AXIS_2),wenotethattheclustersarereallydistinct.

17juin2009

Page10sur39

Tanagra

R.R.

3.6

ExportingthedatasetincludingtheCLUSTERcolumn

Laststepofouranalysis,wewanttoexportthedatasetwiththeadditionalcolumnwhichindicatesthecluster membershipofeachindividual.TANAGRAcancreateadatafileinthetextfileformatwithtabseparator.We canhandleitwiththemajorityoftools(spreadsheet,dataminingtools,etc.)5. WemustbeforespecifythecolumnstoexportusingtheDEFINESTATUScomponent.WesetasINPUTthe originalvariables(MPGORIGIN)andthecomputedcolumn(CLUSTER_K_MEANS_1).

TANAGRAcanexportalsointheXLS(EXCEL)andARFF(WEKA)fileformat.

17juin2009

Page11sur39

Tanagra

R.R.

We add the EXPORT DATASET component (DATA VISUALIZATION tab) into the diagram. In the settings dialogbox(PARAMETERSmenu),wespecifythatonlytheINPUTattributesmustbeexported.Wecanalso definethedirectoryandthefilename.ThenwevalidateandclickontheVIEWmenu.

Anewdatafile(OUTPUT.TXT)with392observationsand7variablesiscreated.

17juin2009

Page12sur39

Tanagra

R.R.

4 K-Means with R
Inthissection,weduplicatethestepsaboveusingtheRsoftware(http://www.rproject.org/)6.

4.1

Dataimportationanddescriptivestatistics

Wesetthefollowinginstructionsinordertoimportthedatasetandcomputethedescriptivestatistics.

Weobtain

4.2

Standardizingthevariables

Totransformthevariable,wecreatefirstacallbackfunctioncentrage_reduction(.)whichstandardizesone variable.Thenwecalltheapply(.)function.Thenewdataframeisvoitures.cr.

Themeanofthenewvariablesis0;theirvarianceis1.

4.3

KMeanswiththestandardizedvariables

WecannowlaunchtheKMeansalgorithmonthesenewvariables.Weaskapartitioningintotwogroups.We limitthenumberofiterationsto40.
6

Unfortunately,thecommentsintothesourcecodeareinFrench.Iapologize.Ihopetheinstructionsremainunderstandable.

17juin2009

Page13sur39

Tanagra

R.R.

R supplies among others: the number of examples in each cluster; the conditional mean according to the activevariables;theclustermembershipofeachinstance. Note:ItseemsthatwegetthesamegroupsthanTanagra.Weshouldcomparethe2partitionstobesure.In some situations, we obtain a different partition of one data mining tool to the other. Indeed, since the approachreliesonaheuristic,theinitializationofthealgorithmcaninfluencethefinalresult.

4.4

Interpretationofclusters

Fortheinterpretationofthegroups,wecomputetheconditionalmeanforthecontinuousvariables.

Weobtain

WecancomparetheseresultstothoseofTanagra:CLUS_1ofRisidenticaltoC_KMEANS_2ofTanagra.

17juin2009

Page14sur39

Tanagra

R.R.

InordertocreateacrosstabulationbetweentheclustersandtheORIGINcategoricalillustrativevariable:

Rsuppliesthefollowingtable.

Weusethefollowinginstructionsinordertocreatethescatterplotaccordingeachpairofvariables.

Wenotethatmostofthevariablesarehighlycorrelated.Thegroupsareclearlydiscernablewhateverthepairs ofvariablesused.

100

300

1500

3000

4500 30 40

mpg

300

displacement

100

horsepower
50 100 4500 1500 3000

weight

10

20

30

40

50 100

200

10

15

20

25

10

15

acceleration

20

25

200

10

20

17juin2009 Page15sur39

Tanagra

R.R.

Last, we implement a PCA (Principal Component Analysis) for a multivariate characterization. We use the princomp(.)procedure.

Withsomeadjustments,weobtainthesameresultsasTanagra.

Thenwecreatethescatterplotinthetwofirstfactorsrepresentationspace.

acp$scores[, 2]

-2

-1

-4

-2

0 acp$scores[, 1]

4.5

ExportingthedatasetincludingtheCLUSTERcolumn

Last,weexportboththeoriginaldatasetandtheKMeanscomputedcolumn.

17juin2009

Page16sur39

Tanagra

R.R.

5 K-Means with KNIME


Inthissection,weduplicatethestepsaboveusingtheKnimesoftware(http://www.knime.org/).

5.1

Creatingaworkflowandimportingthedataset

WecreateanewworkflowbyclickingontheFILE/NEWmenu.WechoosetheNewKnimeProjectitem.

Then,weloadthedatasetusingtheFILEREADERcomponent.

17juin2009

Page17sur39

Tanagra

R.R.

5.2

Descriptivestatistics

We use the STATISTICS VIEW component for the computation of the descriptive statistics indicators. We connecttheFILEREADERcomponenttothislastone.ThenweclickontheEXECUTEANDOPENVIEWmenu. Theresultsaredisplayedinanewwindow.

5.3

Standardizingthevariables

The NORMALIZER component allows to standardize the variables. We can implement different kind of normalization.WeselecttheappropriatesettingsbyclickingontheCONFIGUREmenu.

17juin2009

Page18sur39

Tanagra

R.R.

We can visualize the dataset with the INTERACTIVE TABLE component. Only the continuous variables are transformedofcourse.

5.4

KMeans

WecanlaunchtheKMeansprocedure.WeaddtheKMeanscomponentintotheworkflow.Weclickonthe CONFIGUREmenuinordertosettheappropriateparameters.

17juin2009

Page19sur39

Tanagra

R.R.

By clicking on the EXECUTE AND OPEN VIEW menu, we obtain the results in a new window. There are 2 groupswithrespectively292and100instances.Theconditionalmeansarealsodisplayed,butcomputedon thestandardizedvariables.Thisisnotreallyusefulfortheinterpretation.

5.5
5.5.1

Interpretationofgroups
Groupmembership

TheINTERACTIVETABLEcomponentallowstovisualizetheclustermembershipofeachindividual.

17juin2009

Page20sur39

Tanagra 5.5.2

R.R.

Descriptivestatisticsandgraphicalrepresentation

Somepreliminarymanipulationsarenecessarybeforethecalculationsoftheconditionaldescriptivestatistics andthegraphicalrepresentation. ThePCAisnotavailableunderKnime.ButitcanperformaMultidimensionalScaling(MDS)7.Weobtainthe samefactorswhenwelaunchthismethodonasimilaritymatrix(distancematrix)computedusingaEuclidian distance8. We must thus compute this distance matrix using the PIVOT TABLE component. We set the appropriate parameters in order to compute the distance from the standardized variables. Only two latent variablesarecomputed.

Twonewcolumnsaregeneratedandavailableforthesubsequentprocedures.But,wemustjointhemtothe originaldatasetwiththeJOINERcomponent.Theconnectionsettingsareveryimportanthere.Wemustset themwithcaution.

http://en.wikipedia.org/wiki/Multidimensional_scaling http://www.mathpsyc.unibonn.de/doc/delbeke/delbeke.htm

17juin2009

Page21sur39

Tanagra

R.R.

WeinserttheINTERACTIVETABLEcomponentinordertocheckthemergingoperation.

WehavetheoriginalvariablesandthetwoadditionalcolumnssuppliedbytheMDScomponent. Now,wemustmergethisdatasettotheadditionalcolumn,theclustermembership,suppliedbytheKMeans component.Weperformtheoperationintotwosteps: (1) With the COLUMN FILTER component, we select the cluster membership column from the KMeans component.

17juin2009

Page22sur39

Tanagra

R.R.

(2)WiththeJOINERcomponent,wemergethiscolumntothedataset.

We add the INTERACTIVE TABLE component in order to visualize the resulting dataset. The first group of variablesissuppliedbythevariouscomponents;thesecondgroupcomesfromtheoriginaldatafile.

17juin2009

Page23sur39

Tanagra

R.R.

5.5.2.1 Descriptivestatistics

Knime offers a very interesting tool: the conditional boxplot. We can visualize more information about the characteristicsofthedistributions:centraltendencymeasures,theshapeofthedistribution,theoutliers,etc. Thedrawbackisthatwemustinsertonecomponentforeachvariable. We add the CONDITIONAL BOXPLOT component into the workflow. We set the appropriate settings (CONFIGUREmenu).ThenweclickontheEXECUTEANDOPENVIEWmenu.

17juin2009

Page24sur39

Tanagra 5.5.2.2 Scatterplot

R.R.

Wewanttovisualizetheclustersintherepresentationspacedefinedbythepairofvariables.Wemustfirst specify the illustrative variable (CLUSTER) using the COLOR MANAGER component. We add after the SCATTERPLOTcomponentinordertocreatethegraphicalrepresentation.

Weclearlydistinguishthetwogroups.Wenotethatitispossible,asinTanagra,tointeractivelymodifythe variablesonthehorizontalaxisandtheverticalaxis. 5.5.2.3 Scatterplotinthelatentvariablesrepresentationspace(MDS)

Withthetoolabove(SCATTERPLOT),wecanalsocreatethescatterplotintherepresentationspacedefined bytheMDScomponent.WesettheappropriatecolumnsintotheXandYaxes. TheresultisverysimilartothoseobtainedbythePCAcomponentunderRorTanagra.Thetwogroupsare clearlydiscernableaccordingthefirstfactor.

17juin2009

Page25sur39

Tanagra

R.R.

5.5.2.4 CrosstabulationwiththeORIGINvariable

The PIVOTING component allows to create a cross tabulation between the CLUSTER and the ORIGIN columns.WeusetheINTERACTIVETABLEcomponentinordertovisualizethetable.

17juin2009

Page26sur39

Tanagra

R.R.

5.6

Exportationofthedataset

Last,wewanttoexportthedatasetwiththeclustermembershipcolumn.Inthefirsttime,wemustfilterthe datasetinordertoselectthecolumnsthatwewanttoexport.WeusetheCOLUMNFILTERcomponent.Inthe secondtime,weusetheCSVWRITERcomponentinordertocreatethedatafile.Theresultingfileisinthe CSVformat.Weselect;asthecolumnseparator.

Thedatasetcanbeimportedeasilyintoaspreadsheetorotherdataminingtools. Knimeiswithoutanydoubtaveryperformingtool.Theanalysiscapabilitiesareverylarge.Butthedefinition oftheappropriatesuccessionofcomponentsissometimesdifficult.Weneedalittletrainingtogetthecorrect sequenceofoperations.

6 K-Means with ORANGE


ORANGE is a nice Data Mining tool. It is above all very easy to use (http://www.ailab.si/orange/). A comprehensive description is available for each component. It describes the goal and the settings of the approach;sometimesadetailedexampleissupplied.WemustthinktopresstheF1keywhenweneedhelp.

6.1

Creatingaschemaandimportationofthedataset

AnemptyschemaisavailablewhenwelaunchOrange.WeaddtheFILEcomponent(DATAtab).Wesetthe appropriatesettingsbyclickingontheOPENmenu.Weselectourdatafile(CARS_DATASET.TXT). 17juin2009 Page27sur39

Tanagra

R.R.

6.2

Descriptivestatistics

Various descriptive statistics indicators are supplied by the ATTRIBUTES STATISTICS component. We can interactivelyselectthevariableintheleftpartofthevisualizationwindow.Forcategoricalvariable,weobtain thefrequencytable.

6.3

Standardizingthevariables

TheCONTINUIZEcomponentallowstostandardizethevariables.Inthesettingsdialogbox,wecansetthe rightapproachaccordingtothetypeofthevariable. 17juin2009 Page28sur39

Tanagra

R.R.

The ATTRIBUTE STATISTICS allows to check the transformation. All the continuous variables have now a mean=0andastandarddeviation=1.

6.4

KMeans

The KMEANS CLUSTERING component is available into the ASSOCIATE tab. We connect CONTINUIZE to thislastone.ThenweclickontheOPENmenu:wesettheappropriateparametersandweclickontheAPPLY button. Orange indicates the number of instance into each group (250 and 142). It supplies also some indicatorsoffitnessforeachgroup(seethehelpfilefordetaileddescription).

17juin2009

Page29sur39

Tanagra

R.R.

6.5

Interpretationofthepartitioning

Cluster membership. Like the other tools, Orange creates a new column (CLUSTER) which describes the clustermembershipofeachindividual.

WecanvisualizethiscolumnwiththeDATATABLEcomponent. Descriptive statistics. The DISTRIBUTIONS component allows to compute the histogram of variables according to the values of a categorical variable, the cluster membership in our case. Below, we have the histogramofWEIGHTvariable.

Note:Thehistogramsarecomputedonthestandardizedvariableinthispart.AtoolsuchasJOINERofKNIME ismissinginordertorecoverthevariablesoftheoriginaldatafileinthesubsequentpartoftheschema.

17juin2009

Page30sur39

Tanagra

R.R.

Scatter plot. The SCATTERPLOT component allows to visualize the instances according to simultaneously twovariables.

ProjectionintherepresentationspaceofMDS.ThePCAisnotavailableintoOrange.LikeKnime,wemust compute first the distance matrix (the distance for each pair of instances). Then we perform a multidimensionalscalingonthismatrix.Wedefinethefollowingsequenceofcomponentsintheschema.We clickontheOPENmenuoftheMDScomponent.

Thetoolisalsointeractive.Wecandefineontheflytheillustrativevariablewhichcolorizesthepoints(GRAPH tab).WeselecttheCLUSTERcolumnhere,butwecanuseanycategoricalvariable.IntheMDStab,weselect theSTRESSfunctionandweclickontheOPTIMIZEbutton. Aswesayabove,theresultsareverysimilartothoseofPCA.Itisnotreallysurprising.

17juin2009

Page31sur39

Tanagra

R.R.

6.5.1 Crosstabulation

TheSIEVEDIAGRAMcomponentallowstocreateacrosstabulationbetweenCLUSTERandORIGIN.Wemust beforeusetheSELECTATTRIBUTEScomponentinordertospecifytheusedvariables.Wesetallthevariables asINPUT.

WeclickontheOPENmenu.Theobtainedvisualizationwindowseemsmysterious.Butifweconsiderwith cautiontheresults,wenotethatwecanobservethedesiredinformation.

17juin2009

Page32sur39

Tanagra

R.R.

Intotheselectedfield,weobserve105instances.TheycorrespondtotheassociationbetweenCLUSTER=1 and ORIGIN = AMERICAN. Under the independence assumption, we should have 156.3. The CHISquare statisticofthetestforindependenceis124.884.

6.6

ExportingthedatasetincludingtheCLUSTERcolumn

Finally, we use the SAVE component in order to export the dataset. Orange exports the standardized variables.IdidnotknowhowtorecovertheoriginalvariableswiththeCLUSTERcolumn.

17juin2009 Page33sur39

Tanagra

R.R.

7 K-Means with RAPIDMINER


RAPIDMINER (http://rapidi.com/content/blogcategory/38/69/) is the successor of YALE. Two versions are available,weusethefreeonei.e.theCommunityEditionversion. Itisnotpossibletolauncheachcomponentwhenitisinsertedintothediagram.Eachtimeyouactivatethe PLAYbuttonallthecomponentsofthediagramareexecuted.Fortunately,thecomputationisveryfast.For this reason, unlike other tools in this tutorial, we adopt a different approach: we first define the whole diagram,andthenwelaunchallthecomputations.

7.1

Specifyingthediagram

Hereisthewholediagram.

Weobservethefollowingtools. Accessing the data file. The CSVEXAMPLESOURCE component allows to access the dataset. The main parametersare:FILENAMEspecifiesthefilename;LABEL_NAMEreferstothelabelofeachinstance,weuse the ORIGIN variable in our tutorial, it is not really relevant but it allows to separate active and illustrative variables; COLUMN_SEPARATORS corresponds to the column separator, we set \t i.e. the tabulation character. Descriptivestatistics.DATASTATISTICSdescribesthedatasetthroughdescriptivestatisticsindicators.

17juin2009

Page34sur39

Tanagra

R.R.

Standardization of the variables. The NORMALIZATION component is used for the standardization of the variables.Variousformulasareavailable,weasktheZtransformation.

KMeans.KMEANScorrespondstotheKMeansalgorithm.Weask2groups(K);andwewanttoutilizethe CLUSTER column in the subsequent part of the diagram (ADD_CLUSTER_ATTRIBUTE). Furthermore, we want that the clusters are characterized with comparative descriptive statistics indicators (ADD_CHARACTERIZATION). The other settings are related to the computation (MAX_RUNSand MAX_OPTIMIZATION_STEPS).

17juin2009

Page35sur39

Tanagra

R.R.

Exporting the dataset including the cluster column. Finally, we export the dataset using the CSVEXAMPLESETWRITERcomponent.Wesetthedatafilenameandthecolumnseparatorcharacter.

7.2

Examiningtheresults
button.Awindowsummarizestheresults.Wecanselectthe

Afterwesavethediagram,weclickonthe

resultsassociatedtoeachcomponentbyclickingontheappropriatetab.

17juin2009 Page36sur39

Tanagra

R.R.

Descriptionofthedataset.TheDATATABLEtabdescribesthedataset:METADATAVIEWgivesthebasic characteristicsofthevariablesaccordingtheirtype;DATAVIEWdisplaysthevaluesofthevariables,including theCLUSTERcolumn.

PLOTVIEWisagraphicaltool.WecancreateascatterplotwiththeSCATTERoption.

17juin2009 Page37sur39

Tanagra

R.R.

CLUSTERMODEL. This tab describes the results of the clustering process. TEXT VIEW option supplies the number of instances on each group (292 and 100). We obtain also the conditional mean according to the standardizedvariables.

TheFOLDERVIEWandGRAPHVIEWoptionsallowtovisualizetheclustermembershipofeachcase. CENTROIDPLOTVIEWisagraphicalrepresentationoftheconditionalmeanforeachvariable.

2othertabscompletetheresults:

17juin2009

Page38sur39

Tanagra

R.R.

ZTRANFORM.Itdescribestheparametersusedforthestandardizationofthevariablesi.e.themeanand thestandarddeviationofeachvariable.

DATASTATISTICS.Itcomputesthedescriptivestatisticsindicators.

8 Conclusion
Inthistutorial,weshowthatalmostthefreetoolscanperformaKmeansclusteringalgorithm.Evenifsome details are different, especially for the presentation of the results, we note that they supply comparable results.Itisratherencouragingfortheutilizationofthesetools.

17juin2009

Page39sur39

Das könnte Ihnen auch gefallen