Sie sind auf Seite 1von 11

CIS764 - Step By Step Tutorial for Weka By Mandar Haridas Introduction to Weka: Weka is a tool comprising of numerous machine

learning algorithms that can be applied to Data Mining Problems. Weka was developed at the University of Waikato, New Zealand. It is an open source software issued under the GNU General Public License. Input in problem domain can be given directly to Weka or the Weka Algorithms can be called from Java code. The machine learning algorithms in Weka have themselves been developed in Java. What this tutorial covers In this tutorial we would illustrate the capabilities of Weka and show how the various algorithms are apply on an input data set. To start with, we would discuss the file formats that Weka accepts as input. We then discuss how to convert CSV (comma separated values) files into ARFF (attribute relation file format) files. Further, we input the ARFF file that we create to WEKA and show how WEKA applies its algorithms in correcting the input data.

Weka Input: As mentioned earlier, input to Weka is given as a data set. Weka permits the input data set to be in numerous file formats like CSV (comma separted values: *.csv), Binary Serialized Instances (*.bsi) etc. However, the most preferred and the most convenient input file format is the attribute relation file format (arff). So the first step in Weka always is taking an input file and making sure that it is in ARFF. Typically, here is how an ARFF data set looks like:
@relation studentdetails @attribute @attribute @attribute @attribute @attribute studentid real name string sex {MALE, FEMALE} points real grade {A,B}

@data 1,abc,MALE,95,A 2,def,FEMALE,90,A 3,pqr,FEMALE,85,B 4,xyz,MALE,92,A

The first line @relation . States the relation name. Next, all attributes the data for which is in the data set are specified. Attributes can be of any type like numeric, string, Boolean. Boolean attributes in ARFF are called Nominal attributes. Thus in the above file, the attributes sex and grade are nominal attributes. The last attribute is the default target attribute that is typically used for prediction. Finally, beyond the @data token, the

data set contains comma separated data for the above mentioned attributes. These data rows can be considered analogous to relational database table rows.

CSV to ARFF: Input files originally are not in ARFF. Typically, they are in CSV format like
studentid,name,sex,points,grade 1,abc,MALE,95,A 2,def,FEMALE,90,A 3,pqr,FEMALE,85,B 4,xyz,MALE,92,A

So our First Task is to convert a CSV file to an ARFF file. Following is a code fragment, given on sourceforge.net site that is typically used to convert an input file in CSV to ARFF.

Picture 1: Code to convert CSV file to ARFF, in Eclipse.

It should be remembered that for this, we need to use the converters CSVLoader for loading the CSV file into an Instances object and an ArffSaver to save the Instances as an ARFF file

Picture 2: Code to convert CSV file to ARFF, in Eclipse, highlighting import files.

Following is an example CSV file


outlook,temperature,humidity,windy,play sunny,85,85,FALSE,no sunny,80,90,TRUE,no overcast,83,86,FALSE,yes rainy,70,96,FALSE,yes rainy,68,80,FALSE,yes rainy,65,70,TRUE,no overcast,64,65,TRUE,yes sunny,72,95,FALSE,no sunny,69,70,FALSE,yes rainy,75,80,FALSE,yes sunny,75,70,TRUE,yes overcast,72,90,TRUE,yes overcast,81,75,FALSE,yes rainy,71,91,TRUE,no

When converted to ARFF, it looks as follows:


@relation weather @attribute outlook {sunny, overcast, rainy} @attribute temperature real @attribute humidity real @attribute windy {TRUE, FALSE} @attribute play {yes, no} @data

sunny,85,85,FALSE,no sunny,80,90,TRUE,no overcast,83,86,FALSE,yes rainy,70,96,FALSE,yes rainy,68,80,FALSE,yes rainy,65,70,TRUE,no overcast,64,65,TRUE,yes sunny,72,95,FALSE,no sunny,69,70,FALSE,yes rainy,75,80,FALSE,yes sunny,75,70,TRUE,yes overcast,72,90,TRUE,yes overcast,81,75,FALSE,yes rainy,71,91,TRUE,no

This file is provided as an example file when WEKA is installed. We now deliberately delete a few values of some attributes from the file and illustrate how on use of appropriate filter WEKA replaces these missing values and corrects the data. Here is how WEKA looks like:

Picture 3: Weka Interface. We now show how Weka Algorithms and filters are used to predict missing values in a given input ARFF file. The input file is fed into Weka as follows:

Picture 4: Input file being input into Weka. Once the input file is opened, certain preprocess analysis is done by Weka. All attributes in the input file are shown in the Attributes Window. Properties of the selected attribute like the attribute name, Attribute Type, number of missing values etc are displayed in the Selected attribute window. A histogram is shown in the right bottom of the screen which indicates the number of records for which the attribute play is yes and number of record for which play is no. The attribute play is selected by default because as mentioned earlier the last attribute is the default target attribute that is typically used for prediction.

Picture 5: Weka PreProcessing. The input ARFF file is now edited and we deliberately delete values of certain attributes:

Picture 6: Values deliberately deleted from the input ARFF file

Weka consists of a number of filters that are applied to input data depending on what facet of input data needs to be corrected. Thus, if certain values need to be removed from input the REMOVE filter is applied to input. If numeric data from values that are too small, too big or very close to a certain value need to be cleansed, the NumericCleaner filter is applied. Following are the screen prints of some filters available in WEKA.

Picture 7: Weka Filter InterquartileRange

Picture 8: Weka Filter PropositionalToMultiInstance

We use a filter called ReplaceMissingValues. This filter replaces all missing values for nominal and numeric attributes in a dataset with the modes and means from the training data. Select this filter and click Apply

Picture 9: ReplaceMissingValues Filter applied on Input Dataset The Filter predicts the missing values using its Agorithms.

Figure 10: Missing Values predicted by WEKA algorithms.

Next, we validate the algorithm that is being used to find the predicted value. This is done using the Classify option in Weka and by selecting an appropriate Classifier. We select a classifier J48.

Figure 11: Classifier J48 selected for input data set classification The classifier checks values of all attributes for input records and compares those values with the values predicted by the preprocess filter using its algorithm. After the classification is complete Weka outputs the statistics. For example: Percentage of correctly classified Instances, Incorrectly Classified Instances etc. The Classifier output can also be viewed in the form of a tree which may give a clearer idea of the classifier model and algorithm.

Figure 12-1: Classifier Output Statistics

Figure 12-2: Classifier Output Statistics

Figure 13: Classifier Tree Visualizer

Conclusion: Weka helps in realizing goal of Data Mining, in this case by predicting missing values and validating that the predicted values are indeed correct. We used Weka by providing it with an input file in ARFF format by first converting a file in CSV format into ARFF. It may be noted that it is also possible for Weka Algorithms to be called directly from Java Eclipse. However, this part was not looked into and tried as part of this tutorial. References:

http://www.cs.waikato.ac.nz/ml/weka/ http://weka.sourceforge.net/wiki/
Data Mining Book: Ian H. Witten and Eibe Frank (2005) "Data Mining: Practical machine learning tools and techniques", 2nd Edition, Morgan Kaufmann, San Francisco, 2005.

Das könnte Ihnen auch gefallen