Sie sind auf Seite 1von 32

Data Mining

By
Kabith Sivaprasad (BE/1234/2009)
Rimjhim (BE/1134/2009)
Utkarsh Ahuja (BE/1226/2009)
Guided By: Professor. Ritesh Kumar J ha
Aim
To compare the efficiency of various algorithms in
classifying data.
Definition
It is the process that attempts to
discover patterns in large data sets
It is a field at the intersection of
computer science and statistics
It attempts to make sense of a large
data set and compiles it for further
use
It is the automatic analysis of large
quantities of data, i.e, data records,
unusual records, dependencies etc.
General Process Flow
Applications
Games
Business
Science And Engineering
Spatial Data Mining
Sensor Data Mining
Surveillance
Games
Oracles can predict moves in games such as chess,
checkers, etc. Such prediction techniques rely on data mining
by using a data set which contains previous game information

Business
Used in telemarketing, e- mailing potential consumers
about offers, recruitment of employees (scanning a database of
applicants), purchase patterns, etc.

Science And Engineering
Data mining has been used widely in the areas of
science and engineering, such as bioinformatics, genetics,
medicine, education and electrical power engineering



Spatial Data Mining
Spatial data mining is the application of data mining
methods to spatial data. The end objective of spatial data
mining is to find patterns in data with respect to geography.

Sensor data mining
Sensors can be placed at places to measure changes
over time, for instance, air pollution, migration patterns, etc.

Surveillance
Data mining has been used to stop terrorist
programs. In the context of combating terrorism, two
particularly plausible methods of data mining are "pattern
mining" and "subject-based data mining.


Association Mining

Definiton
Finding frequent patterns, associations,
correlations, or causal structures among sets
of items or objects in transaction databases,
relational databases, and other information
repositories

Association rule
Given a set of transactions find rules that
will predict the occurrence of an item based
on the occurrences of other items in the
transaction
Given a set of transactions T, the goal of
association rule mining is to find all rules
having
support minsup threshold
confidence minconf threshold


Example of Association
T
I
D
Items
1 Bread, Peanuts, Milk, Fruit, Jam
2 Bread, Jam, Soda, Chips, Milk,
Fruit
3 Steak, Jam, Soda, Chips, Bread
4 Jam, Soda, Peanuts, Milk, Fruit
5 Jam, Soda, Chips, Milk, Bread
6 Fruit, Soda, Chips, Milk
7 Fruit, Soda, Peanuts, Milk
8 Fruit, Peanuts, Cheese, Yogurt
Itemset
A collection of one or more items, e.g., {milk,
bread, jam}
Support count ()
Frequency of occurrence of an itemset
({Milk, Bread}) = 3 ,({Soda, Chips}) = 4

Support
Fraction of transactions that contain an
itemset
s({Milk, Bread}) = 3/8

Frequent Itemset
An itemset whose support is >= a minsup
threshold

Confidence
Defined as s(XuY)/s(X)

Classification
Classification:
predicts categorical class labels
classifies data (constructs a model) based on the training set
and the values (class labels) in a classifying attribute and
uses it in classifying new data

Process Flow
It is a two step process
1. Model construction: describing a set of predetermined classes
2. Model usage: for classifying future or unknown objects



Training
Data
Classification
Algorithms
IF rank =
professor
OR years > 6
THEN tenured =
yes
Classifier
Name Rank Years Tenured
Utkarsh Assistant Prof 7 Yes
Kabith Assistant Prof 5 No
Dheeraj Professor 2 Yes
Gautam Associate Prof 6 No
Triambak Professor 3 No
Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
The set of tuples used for model construction: training set
The model is represented as classification rules, decision trees, or
mathematical formulae

Name Rank Years Tenured
Ashish Assistant Prof 2 No
Mohan Associate Prof 7 No
Hulash Professor 5 Yes
Yadav Assistant Prof 7 Yes
Testing
Data
Classifier
Unseen
Data
Clint, Professor, 4
Tenured?
YES
Estimate accuracy of the model
The known label of test sample is compared with the classified result
from the model
Accuracy rate is the percentage of test set samples that are correctly
classified by the model
Test set is independent of training set, otherwise over-fitting will occur

Algorithms Used
KNN
k- nearest neighbour algorithm- is a method for classifying
objects based on closest training examples in the feature space. k-NN is
a type of instance-based learning, or lazy learning
SVM
Support vector machines- They are supervised learning
models with associated learning algorithms that analyze data and
recognize patterns, used for classification and regression analysis.
C 4.5
C4.5 is an algorithm used to generate a decision tree
developed by Ross Quinlan. The decision trees generated by C4.5 can
be used for classification, and for this reason, C4.5 is often referred to
as a statistical classifier.

C4.5
Given a set S of cases, C4.5 rst grows an initial tree using the divide-and-conquer
algorithm as follows:
If all the cases in S belong to the same class or S is small, the tree is a leaf labeled
with the most frequent class in S.
Otherwise, choose a test based on a single attribute with two or more outcomes. This
test is made the root of the tree with one branch for each outcome of the test, partition
S into corresponding subsets S1, S2,... according to the outcome for each case, and the
same procedure is recursively applied to each subset.
There are usually many tests that could be chosen in this last step. C4.5 uses two
heuristic criteria to rank possible tests
Information gain: minimizes the total entropy of the subsets {Si }
Gain Ratio: Information gain/Information provided by test outcomes.
Attributes can be either numeric or nominal and this determines the format of the test
outcomes.
Initial tree is then pruned to avoid overfitting. Pruning is carried from leaves to roots.

The principal disadvantage of C4.5s rulesets is the amount of CPU time and memory that
they require.
kNN
In this algorithm a group of k objects in the training set
that are closest to the test object are found, and the
assignment of a label is based on the predominance of a
particular class in this neighourhood.
There are three key elements of this algorithm:
A set of labeled objects
A metric to compute distance
Value of k,number of nearest neighbours.

Input: D,the set of k-training objects, and test object z=(x,y)
Process: d(x,x),the distance between z and every object (x,y) E D is computed.
Dz subset of D is selected, the set of k-closest training objects to z.
Output:y=argmax(summation of I(v=yi), where v is class label, yi is the class
label for the I th nearest neighour, and I(.) is an indicator function that returns the
value of 1 if its argument is true and 0 otherwise.

The choice of distance measure is another important consideration. Although various
measures can be used, the most desirable distannce measure is one for which a
smaller distance between two objects implies a greater likelihood of having same
class.
Eg. If kNN is being applied to classify documents, then it may be better to use the
cosine measure than euclidean distance.

Issues with kNN
One issue is the choice of k. If k is too small, then the
result can be sensitive to noise points. However if k is too
large, then the neighborhood may include too many poits
from other classes.
Second is the approach to combining class labels. The
simplest method is to take a majority vote, but this can be
a problem if the nearest neighbors vary widely in their
distances and the closest neighbors more reliably indicate
the class of the object.

Advatages
KNN classication is an easy to understand and easy to
implement classication technique.
It can perform well in many situations. The error of the
nearest neighbor rule is bounded above by twice the
Bayes error under certain reasonable assumptions. Also,
the error of the general kNN method asymptotically
approaches that of the Bayes error and can be used to
approximate it.
KNN is particularly well suited for multi-modal classes
as well as applications in which an object can have many
class labels. For example, for the assignment of functions
to genes based on expression proles, some researchers
found that kNN outperformed SVM, which is a much
more sophisticated classification scheme.
SVM
Support Vector Machine (SVM) offers one of the most robust and
accurate methods among all well-known algorithms. It has a sound
theoretical foundation, requires only a dozen examples for training, and
is insensitive to the number of dimensions.
Aim of SVM is to find the best classification function to distinguish
between members of the two classes in the training data. The metric for
the concept of the best classification function can be realize
geometrically.
For a linearly separable dataset, a linear classification function
corresponds to a separating hyperplane f(x) that passes through the
middle of the two classes.
Once this function is determined, new data instance Xn can be classified
by simply testing the sign of the function f(Xn);Xn belongs to the
positive class if f(Xn) > 0.
Since there are many such linear hyperplanes, SVM guarantee that the
best such function is found by maximizing the margin between the two
classes. Where margin is defined as amount of space, or separation
between the two classes as defined by the hyperplane.
The reason why SVM insists on nding the maximum margin
hyperplanes is that it offers the best generalization ability. It allows
not only the best classication performance (e.g., accuracy) on the
training data, but also leaves much room for the correct classication
of the of the future data.
SVM can be easily extended to perform numerical calculations. Here
we discuss two such extensions. The rst is to extend SVM to
perform regression analysis, where the goal is to produce a linear
function that can approximate that target function.
Another extension is to learn to rank elements rather than producing
a classication for individual elements. This method can be applied
to many areas where ranking is important, such as in document
ranking in information retrieval areas.
Weka
Weka is an acronym for Waikato
Environment for Knowledge
Analysis
Machine learning software written
in Java, developed at University of
Waikato, New Zealand
The software uses the .arff
(Attribute Relationship File
Format)
DATA SET




Data set considered:-
A raw data set which depicted the symptoms of
a heart attack and the magnitude of the measure
of the symptoms .







Attributes






Data set was divided into following 15 attributes:-
Age : age in years
-Class A-- 0-40 years
-Class B-- 40-60 years
-Class C 60 above

Sex : Sex (1=male 0= female)
Cp : Chest pain type
- Value 1: typical angina HA-chances high
- Value 2: atypical angina HA-chances less
- Value 3: non anginal pain HA-no chances
- Value 4: asymptomatic HA-chances high

Trestbps- Resting blood pressure ( > 120 HA- chances high)
Chol Serum cholesterol (>239 HA-chances high )
HA does not depend much on cholesterol level.












Attributes










Fbs : Fasting blood sugar (>120mg/dl)
- 1--True
- 0--False
Restecg : resting elecctrocardigraphic results
-- Value 0: normal
-- Value 1: Having ST-T wave abnormality
Thalach : Maximum heart rate achieved (normal thalach=208-0.7*age)
Exang : Exercise induced angina
- 1-- HA chances high
- 0-- HA chances low
Oldpeak : ST depression induced by exercise relative to rest
- 1-- HA chances high
- 0-- HA chances low










Attributes









Slope : the slope of the peak exercise ST segment
- 1 up sloping (HA chances high)
- 2 flat ( HA chances moderate)
- 3 down sloping (HA chances low)
Ca : number of major vessels (0-3) colored by fluoroscopy
Thal : - 3--Normal
-6--Fixed defect
-7-- Reversable defect

Num :Chances of HA:
-0--No HA
otherwise-- Possible HA








Entropy and Info gain
Choosing the root node:

To decide which is the root node ? We calculate the entropy for all the attributes and the attribute with the highest information gain
is selected as the root node.

Size : 9/15*log(9/15)+6/15*log(6/15) = 0.97


Attributes ,their entropy and information gain :-

Age :,
Entropy age = 5/15*(2/5 log2/5 +3/5 log3/5 ) +(1/5*log1) + 9/15 (7/9 log7/9 +2/9 log2/9)
= 0.697
Information gain := Size - Entropy(age)
= 0.97 - 0.697
= 0.27
Similarly after calculating for others we get :-
Sex = 0.03
Exang = 0.419
Thal= 0.069
..
..

Hence, max info. Gain = for attribute exang , and thus making it the root node.


Decision tree
1 0




M F 1 2 3 4

No HA HA possible No HA No HA {HA Possible}
Chest
pain
exang
Gender
Exang
gender cp
trestbps
fbs
thalach
gender cp
trestbps
cp
1
0
M F 1
2
3
4
H
N
0 1
H N
No HA HA
possible
No HA
Cholestrol
test
Cholestrol
test
HA
possible
No
HA
Cholesterol Test Decision tree:-



Low,1


H N




2 3 7 3

No HA HA possible No HA HA possible

Program for the above decision tree has been implemented

Trestbps,
fbs
Chol
thal
slope
The Final Decision Tree
The Form

Das könnte Ihnen auch gefallen