Project Report

UNIVERSITY OF KARACHI
DEPARTMENT OF COMPUTER SCEINCE

(EVENINIG PROGRAMME)
BACHELORS OF SCIENCE IN COMPUTER SCIENCE
SECOND SEMESTER 2011
COURSE: BSCS -618- COMPUTATIONAL LINEAR ALGEBRA
PROJECT REPORT
ABSTRACT:
This report gives the detailed explanation of the Comparative Study of Data mining techniques
applied on Iris datasets, the data has been taken from UCI Machine Learning Repository. The
techniques have been implemented in MATLAB which is one of the valuable software for data
mining.
In this project I achieved a goal of comparative study of Plant Iris. The goal has been achieved by
using supervised learning approach for classifying the data into classes and unsupervised learning
approach for clustering the data into clusters. The supervised technique used in this study are Nave
Bayesian Classification, Decision Tree while the unsupervised techniques used here are singlelinkage Agglomera-tive clustering algorithm and K-Means clustering algorithm, Finally the
accuracy have been measured using testing data and results have been compared.
IRIS:
Iris is a genus of 260 -300 species of flowering plants with showy flowers. It takes its name from
the Greek word for a rainbow, referring to the wide variety of flower colours found among the
many species. As well as being the scientific name, iris is also very widely used as a common name
for all Iris species, though some plants called thus belong to other closely related genera. A
common name for some species is flags while the plants of subgenus scorpiris are widely known
as junos in particularly in horticulture. It is a popular garden flower.
The genera Belamcanda (blackberry-lily), Hemoda ctylus (snake's head iris), Neomarica (walking
iris) and Pardanthopsis are sometimes included in Iris.
DATA SET DESCRIPTION:

Number of Instances: 150 (50 in each of three classes)
Number of Attributes: 4 numeric, predictive attributes and the class
Attribute Information:
1.
2.
3.
4.
5.
sepal length in cm
sepal width in cm
petal length in cm
petal width in cm
class:
Iris Setosa
Iris Versicolour
Iris Virginica
Summary Statistics:
Min Max Mean SD Class Correlation
SUBMITTED TO: DR. TAHSEEN AHMED JILANI
QADIR
NAME: MUHAMMAD SHOAIB S/O ABDUL

SEAT #: EP086165
PROJECT REPORT
sepal length:
sepal width:
petal length:
petal width:
4.3
2.0
1.0
0.1
7.9
4.4
6.9
2.5
5.84
3.05
3.76
1.20
0.83
0.43
1.76
0.76
0.7826
-0.4194
0.9490 (high!)
0.9565 (high!)
DATAMINIG:
Data mining (the analysis step of the knowledge discovery in databases process, or KDD), a
relatively young and interdisciplinary field of computer science is the process of discovering new
patterns from large data sets involving methods at the intersection of artificial intelligence, machine
learning, statistics and database systems. The goal of data mining is to extract knowledge from a
data set in a human-understandable structure and involves database and data management, data preprocessing, model and inference considerations, interestingness metrics, complexity considerations,
post-processing of found structure, visualization and online updating.
The actual data-mining task is the automatic or semi-automatic analysis of large quantities of data
to extract previously unknown interesting patterns such as groups of data records (cluster analysis),
unusual records (anomaly detection) and dependencies (association rule mining). This usually
involves using database techniques such as spatial indexes. These patterns can then be seen as a
kind of summary of the input data, and used in further analysis or for example in machine
learning and predictive analytics. For example, the data mining step might identify multiple groups
in the data, which can then be used to obtain more accurate prediction results by a decision support
system. Neither the data collection, data preparation nor result interpretation and reporting are part
of the data mining step, but do belong to the overall KDD process as additional steps.
The related terms data dredging, data fishing and data snooping refer to the use of data mining
methods to sample parts of a larger population data set that are (or may be) too small for reliable
statistical inferences to be made about the validity of any patterns discovered. These methods can,
however, be used in creating new hypotheses to test against the larger data populations.
SUPERVISED AND UNSUPERVISED LEARNING:

Supervised Learning (often also called directed data mining) the variables under investigation can
be split into two groups: explanatory variables and one (or more) dependent variables. The target of
the analysis is to specify a relationship between the explanatory variables and the dependent
variable as it is done in regression analysis. To apply directed data mining techniques the values of
the dependent variable must be known for a sufficiently large part of the data set.
Unsupervised Learning is closer to the exploratory spirit of Data Mining as stressed in the
definitions given above. In unsupervised learning situations all variables are treated in the same
way, there is no distinction between explanatory and dependent variables. However, in contrast to
the name undirected data mining there is still some target to achieve. This target might be as general
as data reduction or more specific like clustering.
QADIR

SEAT #: EP086165
PROJECT REPORT
The dividing line between supervised learning and unsupervised learning is the same that
distinguishes discriminate analysis from cluster analysis. Supervised learning requires that the
target variable is well defined and that a sufficient number of its values are given. For unsupervised
learning typically either the target variable is unknown or has only been recorded for too small
a number of cases.
CLUSTERING:
Clustering is the task of assigning a set of objects into groups (called clusters) so that the objects in
the same cluster are more similar (in some sense or another) to each other than to those in other
clusters.
Clustering is a main task of explorative data mining, and a common technique for statistical data
analysis used in many fields, including machine, pattern recognition, image analysis, information
retrieval, and bioinformatics.
Cluster analysis itself is not one specific algorithm, but the general task to be solved. It can be
achieved by various algorithms that differ significantly in their notion of what constitutes a cluster
and how to efficiently find them. Popular notions of clusters include groups with low
distances among the cluster members, dense areas of the data space, intervals or particular statistical
distributions. The appropriate clustering algorithm and parameter settings (including values such as
the distance function to use, a density threshold or the number of expected clusters) depend on the
individual data set and intended use of the results. Cluster analysis as such is not an automatic task,
but an iterative process of knowledge discovery that involves trial and failure. It will often be
necessary to modify pre-processing and parameters until the result achieves the desired properties.
CLASSIFICATION:
Classification is a data mining (machine learning) technique used to predict group membership for
data instances. It is used for supervised machine learning; Classification is used for such datasets
which has already been classified and we have to make some rule or predict based on previous
values for the new values, its outcome is more accurate then the clustering because the Datasets in it
are already classified and we just make a rule for new values.
DESCRIPTION OF THE TECHNIQUES USED

K-MEANS CLUSTERING:
In statistics and data mining, k-means clustering is a method of cluster analysis which aims
to partition n observations into k clusters in which each observation belongs to the cluster with the
nearest mean
QADIR

SEAT #: EP086165
PROJECT REPORT
The problem is computationally difficult (NP-hard), however there are efficient heuristic
algorithms that are commonly employed that converge fast to a local optimum. These are usually
similar to the expectation-maximization algorithm for mixtures of Gaussian distributions via an
iterative refinement approach employed by both algorithms. Additionally, they both use cluster
centers to model the data, however k-means clustering tends to find clusters of comparable spatial
extent, while the expectation-maximization mechanism allows clusters to have different shapes.
Function in Matlab: Kmeans(data,3)

Where data is the variable in which the iris data has been populated and 3 is the number of clusters
to be created.
DECISION TREE:
A decision tree is a decision support tool that uses a tree-like graph or model of decisions and their
possible consequences, including chance event outcomes, resource costs, andutility. It is one way to
display an algorithm. Decision trees are commonly used in operations research, specifically
in decision analysis, to help identify a strategy most likely to reach agoal. Another use of decision
trees is as a descriptive means for calculating conditional probabilities.
Function in Matlab:
treefit(Data,Output_Col_of_Data)
AGGLOMERATIVE HIERARCHICAL CLUSTERING:

Agglomerative hierarchical clustering is a bottom-up clustering method where clusters have subclusters, which in turn have sub-clusters, etc. The classic example of this is species taxonomy. Gene
expression data might also exhibit this hierarchical quality (e.g. neurotransmitter gene families).
Agglomerative hierarchical clustering starts with every single object (gene or sample) in a single
cluster. Then, in each successive iteration, it agglomerates (merges) the closest pair of clusters by
satisfying some similarity criteria, until all of the data is in one cluster.
The hierarchy within the final cluster has the following properties:
Clusters generated in early stages are nested in those generated in later stages.
Clusters with different sizes in the tree can be valuable for discovery.
A Matrix Tree Plot visually demonstrates the hierarchy within the final cluster, where each
merger is represented by a binary tree.
Process
QADIR

SEAT #: EP086165
PROJECT REPORT
Assign each object to a separate cluster.

Evaluate all pair-wise distances between clusters (distance metrics are described
in Distance Metrics Overview).
Construct a distance matrix using the distance values.
Look for the pair of clusters with the shortest distance.
Remove the pair from the matrix and merge them.
Evaluate all distances from this new cluster to all other clusters, and update the
matrix.
Repeat until the distance matrix is reduced to a single element.
Function in Matlab:
Y=pdist(Data);
Z=linkage(Y,'single');
[H,T]=dendrogram(Z,'colorthreshold','default'); set(H,'linewidth',2);
NAVE BAYESIAN CLASSIFICATION:

The Bayesian method provides a principled way to incorporate this external information into the
data-analysis process. This process starts with an already given probability distribution for the
analyzed data set. As this distribution is given before any data is considered, it is called a prior
distribution. The new data set updates this prior distribution into a posterior distribution. The basic
tool for this updating is the Bayes Theorem.
The Bayes Theorem represents a theoretical background for a statistical approach to inductiveinferencing classification problems. We will first explain the basic concepts defined in the Bayes
Theorem and then use this theorem in the explanation of the Nave Bayesian Classification Process,
or the Simple Bayesian Classifier.
Results of Naive Bayesian:
Correctly Classified Instances
Incorrectly Classified Instances
QADIR
144
6
96 %
4 %
8

SEAT #: EP086165
PROJECT REPORT
=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure ROC Area Class
1
0
1
1
1
1
Iris-setosa
0.96 0.04
0.923 0.96 0.941 0.993 Iris-versicolor
0.92 0.02
0.958 0.92 0.939 0.993 Iris-virginica
Weighted Avg.
Technique
Accuracy
K-Means
Single-Linkage Agglomerative
82.08
80.10
Nave Bayesian
96%
0.96
Decision Tree
0.02
0.96
0.96
79.45
0.96
0.995
RESULT:
CONCLUSION:
QADIR

SEAT #: EP086165
PROJECT REPORT
After applying the technique we can observe that the accuracy for k-Means and Nave Bayesian are
the most similar and where there is no big difference in the accuracy of all the techniques used here.
Any of the above technique explained can give almost similar result.
FUTURE WORK:
All the Results can be improved if we apply some transformation techniques.
There can be other algorithm can be applied on this dataset to get the better result.
REFERENCES:
1. Data Mining: Concepts, Models, Methods, and Algorithms by Mehme kantardzic:John
Wiley and Sons 2003.
2. Data Mining Concepts and Technique by Jiawei Han and Michelen Kember 2 nd Edition
3. Data has been taken from uci machine learning repository the link of the dataset is
(http://archive.ics.uci.edu/ml/datasets/Iris)
4. Combining methods in supervised classification: a comparative study on discrete and
continuous problems
5. Boosting Principal Component Analysis by Genetic Algorithm
6. A Multiclass Classifier Based on Local Modeling and Information Theoretic Learning
7. Application of Clustering for Feature Selection Based on Rough Set Theory Approach
8. Wikipedia (http://en.wikipedia.org/wiki/Iris_(plant))

QADIR
10

SEAT #: EP086165

Project Report

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Project Report

Hochgeladen von

Copyright:

Verfügbare Formate

UNIVERSITY OF KARACHI

DEPARTMENT OF COMPUTER SCEINCE

DATA SET DESCRIPTION:

NAME: MUHAMMAD SHOAIB S/O ABDUL

SUPERVISED AND UNSUPERVISED LEARNING:

NAME: MUHAMMAD SHOAIB S/O ABDUL

DESCRIPTION OF THE TECHNIQUES USED

NAME: MUHAMMAD SHOAIB S/O ABDUL

Function in Matlab: Kmeans(data,3)

AGGLOMERATIVE HIERARCHICAL CLUSTERING:

NAME: MUHAMMAD SHOAIB S/O ABDUL

Assign each object to a separate cluster.

NAVE BAYESIAN CLASSIFICATION:

NAME: MUHAMMAD SHOAIB S/O ABDUL

=== Detailed Accuracy By Class ===

NAME: MUHAMMAD SHOAIB S/O ABDUL

SUBMITTED TO: DR. TAHSEEN AHMED JILANI

NAME: MUHAMMAD SHOAIB S/O ABDUL

Das könnte Ihnen auch gefallen