Sie sind auf Seite 1von 8

IS ZC415 Data Mining

Lab 3: Classification

Objective: Classification using ID3 and C4.5 algorithms


Open the dataset lab3.arff provided to you. Try to note and observe the relevant characteristics of
the dataset, especially the following:
Number of instances
Number and type of attributes
Possible values for each attribute
Number of missing attribute values
Number of classes
Class distribution
View the dataset using the Edit button as shown below.

Click here
to view
We find that there are certain values in the dataset that are missing. Now we need to do certain
pre processing. First we fill in the missing values. This can be done by using the filter
ReplaceMissingValues as explained in the previous lab sheet and shown below.

After applying the filter, view the data by clicking Edit button as shown below.
Click here to apply filter

We will still find that there is one value missing in record number 23. Please replace that
manually, by selecting any value and click on ok.

This finishes the preprocessing. Now our data is ready for classification.

We need to apply ID3 and C4.5 algorithms for classification.

Now, click on the Classify tab. At the top is a Classifier box, that has a text field giving the
name of the currently selected classifier, and its options. Clicking on the text box brings up a
dialog box that you can use to configure the options of the current classifier. The Choose button
allows you to choose one of the classifiers available in Weka as shown below. Choose ID3
classifier first.

The result of applying the selected classifier will be tested according to the options that are set by
clicking in the Test options box. There are four test modes. Further testing options can be set by
clicking on the More options button.

The classifiers in Weka are designed to be trained to predict a single class attribute, which by
default is taken to be the last attribute. If you want to train a classifier to predict a different
attribute, click on the box below the Test options box to bring up a list of attributes to choose

We can choose Cross Validation or Percentage Split. First we choose Cross Validation with 10
folds as shown the figure below.
Choose ID3

Once the classifier, test options and class have all been set, the learning process is started by
clicking on the Start button as shown in the figure above.

When training is complete, the Classifier output area is filled with text describing the results of
training and testing. The output is split into several sections:
Run information
Classifier model on the full training set
Summary of results
Detailed accuracy by class
Confusion matrix

This looks as shown below in the snapshot.
Set the Test
Options here
Click here to change the class
Click here to

Also, a new entry appears in the Result list box. Right-clicking an entry invokes a menu
containing many items.

It looks something like shown in the figure below.

Right Click here to
Visualize curves, etc

Now in the Test Options, we can choose some other options like Percentage Split, etc. Results
obtained can be compared.


Apply the j48 (C4.5) algorithm on the prepared dataset and generate an unpruned tree.
Apply the j48 (C4.5) algorithm on the prepared dataset and generate a pruned tree.

To select unpruned or pruned tree, click on the text box beside choose button after selecting the
J48 classifier as shown in the figure below.

Then a window will popup as shown in the figure below.

Click here to choose options for creating
pruned or unpruned tree.

Set the value of unpruned to true or false depending on the requirement as shown in the figure

After setting these values, click on Start to generate the tree. Generate 2 trees one for pruned
and other for unpruned.

Right-click on the entries in the Result list box for the previous two cases to visualize the
graphical representation of the tree built, by clicking on the Visualize tree option.

Compare the classification accuracy for the results on this dataset in the above three cases.


Try out different testing options and compare the accuracy and other parameters generated each
time. Suggest which of the three algorithms is appropriate for the given dataset.


Divide the data into 2 parts training and test. First apply the classifier on training data set and
then on test data set for both the above algorithms and compare the results.

Set to True/False