Data Mining Report On A Range of Experiments

Lab 1 Report
Chinmay Dhawan
i6171015
Experiment 1:
KNN Classification Results:
AUC Values (+1, -1)

K=1 K=5 K = 11 K = 21
Data Set 1 .988, .988 .991, .991 .993, .993 .995, .995
Data Set 2 .963, .963 .993, .993 .991, .991 .984, .984
Data Set 3 .815, .815 .942, .942 .989, .989 .989, .989
• As K increases, the performance if measured in terms of ROC seems to be

increasing for Dataset 1; the change is small but increasing.
For Dataset 2, the AUC decreases when we switch from k = 5 to k = 11. For
Dataset 3, AUC is approximately same for k = 11 and k = 21.
For Dataset 3, the AUC increases from k = 1 to k = 5, but or k= 11 and k = 21, it has
approximately same values.
KNN is sensitive to the structure of the data, and hence, these different trends are
observed. For data set 1, number of positive and negative instances are equal
and much larger than the values of K used, so a change in its values doesn’t
have much impact on performance. For dataset 3, the number of positive
instances are just a fraction (10%) of the number of negative instances and for
this structure of data, a higher value of k like 21 will lead to a greater
misclassification of positive instances. But, the AUC value (dataset3, k = 21) is still
very good. That can be explained as the true negative rate is still high as the
dataset comprise of majority of negatively classified instances. For such
unbalanced datasets, true positive and false positive rates may result more clear
understanding.
• As we switch from data set 1 to dataset 2 and then to dataset 3, the

performance (in terms of AUC) is decreasing as the datasets are becoming more
unbalanced and chances of misclassification of the class having fewer instances
increases.
KNN Classification with weighed distances (1/distance):
AUC Values (+1, -1)

K=5 K = 11 K = 21
Data Set 1 .996, .996 .997, .997 .998, .998
Data Set 2 .995, .995 .996, .996 .994, .994
Data Set 3 .961, .961 .989, .989 .989, .989
• The overall performance has improved since weighed KNN classification is more
robust than KNN classification if higher than optimal value of K is used as more
weightage will be given to the closer instances.
Decision Trees AUC values for Experiment 1:
AUC Values (+1, -1)

Data Set 1 .993, .993
Data Set 2 .985, .985
Data Set 3 .898, .898
Bayes Classification AUC values for Experiment 1:
AUC Values (+1, -1)

Data Set 1 .991, .991
Data Set 2 .986, .986
Data Set 3 .987, .987
• Decision Trees also perform quite well as simple rules with threshold values of x
and y to classify the instances can be easily figured out from the dataset. Bayes
classification also result in quite comparable performance as same as KNN. For
dataset 3 Naive Bayes is acting like a majority classifier giving a true positive rate
of zero. This is not evident from the AUC value alone since the number of positive
instances are significantly less than their counterpart.
Experiment 2:
AUC Values
Decision Trees Naïve Bayes KNN | K = 1 KNN | K = 10
Dataset 1 1 .514 1 1
• For dataset 1, decision Trees and K-Nearest Neighbor Algorithms performs best
when compared with AUC metric. The reason is that classification of classes
solely depends on the values a0001 and a0003 variables. If both are same, then
the class is 2, if both are different then the class is 1. It is easier for decision trees to
find this rule and hence reach, 100% accuracy.
Also, the KNN classifiers perform quite good. The data set contains 1101 instances
with 4 nominal features which can only take two values. Thus, it comprises a set
of only 16 unique feature vectors. Thus, for any given unknown instance, it
becomes easier to classify, as many same instances already exist in data set and
effectively with ‘0’ distance from it. That’s why KNN with k = 1 and k = 10 also
results in perfect classification.
• Bayes Classification assumes that, given a class, the features are independent of
each other, but for the given dataset, given a class, the variables a0001 and
a0003 can be predicted from each other and hence have 100 percent
dependence. That’s why Bayes Classification results in an almost random
classification.
AUC Values
Decision Trees Naïve Bayes KNN | K = 1 KNN | K = 10
Dataset 2 1 .549 .974 1
• KNN (K = 1) shows a reduction in performance because a no. of redundant

attributes has been introduced in the dataset. The class can be determined by
using the two parameters only, but the redundant parameters affect the
distance between any two instances belonging to the same class which may
result in inaccurate classification and hence, drop in performance.
• Decision Trees shows the same performance as before, because the addition of
other parameters hasn’t affected the classification rule mentioned earlier and
thus the decision trees classification isn’t affected.
Experiment 3:
AUC Values
Naïve Bayes KNN | K = 1 KNN | K = 10 Decision trees
Dataset 1 .996 .988 .999 .973
Dataset 2 .995 .802 .948 .736
Dataset 3 .985 .731 .909 .673
Dataset 4 .963 .628 .823 .580
• For Dataset 1, all the classifiers perform quite good because of the simplicity of
the classification. The classification line for dataset 1 can be roughly described
with equation x1 + x2 = 1.
In general, Naïve Bayes performs the best while decision trees classification has
the worst performance. However, for dataset1 KNN with k = 1 outperforms Naïve
Bayes marginally.
• As datasets are varied the performance is declining for each of the classification
algorithm as the dataset is becoming more complex with addition of the
attributes in each dataset. However, the drop in performance of Naïve Bayes is
not very significant while, in case of decision trees there is an almost 40 percent
drop in AUC. Naïve Bayes adapts well to the large amount of data and thus
performs well. On the other hand, since each feature/attribute is contributing
towards the dataset, the rules are becoming more complicated for decision
trees as the number of attributes increase. Due to increased complexity, the
performance for KNN classifiers also drop, but KNN with k = 10 performs better
than KNN with k = 1, given the large number of instances.
Experiment 4:
AUC Values
Naïve Bayes KNN | K = 3 Decision Trees
Sonar Dataset .800, .800 0.933, 0.993 0.743, 0.743
• KNN with K = 3 performs the best giving a value of AUC equal to .933, while
decision tress performs poorly and Naïve Bayes classification results in average
performance.
• The frequency information contained in the attributes of instances is similar for
the classes mine and thus their Euclidian distance within instances of each of the
classes is less and thus KNN perform much better than other classifications. For
Decision tress, since there are a large number of attributes contributing towards
classification which results in complex tree structure and poor classification.

Data Mining Report On A Range of Experiments

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Data Mining Report On A Range of Experiments

Hochgeladen von

Copyright:

Verfügbare Formate

Lab 1 Report

KNN Classification Results:

AUC Values (+1, -1)

• As K increases, the performance if measured in terms of ROC seems to be

• As we switch from data set 1 to dataset 2 and then to dataset 3, the

KNN Classification with weighed distances (1/distance):

AUC Values (+1, -1)

Decision Trees AUC values for Experiment 1:

AUC Values (+1, -1)

Bayes Classification AUC values for Experiment 1:

AUC Values (+1, -1)

• KNN (K = 1) shows a reduction in performance because a no. of redundant

Das könnte Ihnen auch gefallen