9 views

Uploaded by poonam

clustering in data mining

- Complexity Analysis of the Viking Labeled Release Experiments - Int'l Journal of Aeronautical & Space Science
- Texture and Clustering Final
- t Mva Users Guide
- Bio Medical Data Analysis Using Novel Clustering Techniques
- Implementation Of Web Document Clustering Methods For Forensic Analysis
- IJETTCS-2013-04-14-101
- Fuzzy Clustering Techniques
- KLANG VALLY RAINFALL FORECASTING MODEL USING TIME SERIES DATA MINING TECHNIQUE.
- d 0931621
- Comparison of Fuzzy - Neural Clustering Based Outlier Detection Techniques
- Hidden Sentiment Association in Chinese Web Opinion Mining
- A Survey on Machine Learning Classification Techniques
- A method to integrate tables of the World Wide Web
- Nguyen Armitage SurveysAndTutorials2008
- Paper - 3-D Kinematics and EMG
- C4.5byRossQuinlanforICDM06.pdf
- HW_07 copy
- 512874654 (1)
- IRJET-Intrusion detection system: classification, techniques and datasets to implement
- Study of Internet Traffic Classification Using Machine Learning Algorithms

You are on page 1of 49

CLUSTERING AND

CLASSIFICATION

Spring 2007, SJSU

Benjamin Lam

Overview

Definition of Clustering

Existing clustering methods

Clustering examples

Classification

Classification examples

Conclusion

Definition

Clustering can be considered the most important

unsupervised learning technique; so, as every

other problem of this kind, it deals with finding a

structure in a collection of unlabeled data.

into groups whose members are similar in some

way.

are similar between them and are dissimilar

to the objects belonging to other clusters.

Why clustering?

A few good reasons ...

Simplifications

Pattern detection

Useful in data concept construction

Unsupervised learning process

Where to use clustering?

Data mining

Information retrieval

text mining

Web analysis

marketing

medical diagnostic

Which method should I

use?

Type of attributes in data

Scalability to larger dataset

Ability to work with irregular data

Time cost

complexity

Data order dependency

Result presentation

Major Existing clustering

methods

Distance-based

Hierarchical

Partitioning

Probabilistic

Measuring Similarity

Dissimilarity/Similarity metric: Similarity is

expressed in terms of a distance function, which is

typically metric: d(i, j)

There is a separate quality function that

measures the goodness of a cluster.

The definitions of distance functions are usually

very different for interval-scaled, boolean,

categorical, ordinal and ratio variables.

Weights should be associated with different

variables based on applications and data

semantics.

It is hard to define similar enough or good

enough

Professor Lee, Sin-Min

the answer is typically highly subjective.

Distance based method

In this case we easily identify the 4 clusters into which the data

can be divided; the similarity criterion is distance: two or more

objects belong to the same cluster if they are close according to

a given distance. This is called distance-based clustering.

Hierarchical clustering

Agglomerative (bottom Divisive (top down)

up)

1. Start with a big cluster

1. start with 1 point 2. Recursively divide into

(singleton) smaller clusters

2. recursively add two or 3. Stop when k number of

more appropriate clusters is achieved.

clusters

3. Stop when k number

of clusters is

achieved.

general steps of hierarchical

clustering

Given a set of N items to be clustered, and an N*N distance

(or similarity) matrix, the basic process of hierarchical

clustering (defined by S.C. Johnson in 1967) is this:

Start by assigning each item to a cluster, so that if you

have N items, you now have N clusters, each containing

just one item. Let the distances (similarities) between

the clusters the same as the distances (similarities)

between the items they contain.

Find the closest (most similar) pair of clusters and merge

them into a single cluster, so that now you have one

cluster less.

Compute distances (similarities) between the new

cluster and each of the old clusters.

Repeat steps 2 and 3 until all items are clustered into K

number of clusters

Exclusive vs. non

exclusive clustering

exclusive way, so that if a certain

datum belongs to a definite cluster

then it could not be included in

another cluster. A simple example of

that is shown in the figure below,

where the separation of points is

achieved by a straight line on a bi-

dimensional plane.

On the contrary the second type, the

overlapping clustering, uses fuzzy sets

to cluster data, so that each point may

belong to two or more clusters with

different degrees of membership.

Partitioning clustering

1. Divide data into proper subset

2. recursively go through each subset

and relocate points between

clusters (opposite to visit-once

approach in Hierarchical approach)

Probabilistic clustering

1. Data are picked from mixture of

probability distribution.

2. Use the mean, variance of each

distribution as parameters for

cluster

3. Single cluster membership

Single-Linkage

Clustering(hierarchical)

The N*N proximity matrix is D =

[d(i,j)]

The clusterings are assigned

sequence numbers 0,1,......, (n-1)

L(k) is the level of the kth clustering

A cluster with sequence number m is

denoted (m)

The proximity between clusters (r)

and (s) is denoted d [(r),(s)] Mu-Yu Lu, SJSU

The algorithm is composed of

the following steps:

Begin with the disjoint clustering having

level L(0) = 0 and sequence number m =

0.

the current clustering, say pair (r), (s),

according to

d[(r),(s)] = min d[(i),(j)]

where the minimum is over all pairs of

clusters in the current clustering.

The algorithm is composed of the

following steps:(cont.)

Increment the sequence number : m = m +1.

Merge clusters (r) and (s) into a single cluster to

form the next clustering m. Set the level of this

clustering to

L(m) = d[(r),(s)]

rows and columns corresponding to clusters (r)

and (s) and adding a row and column

corresponding to the newly formed cluster. The

proximity between the new cluster, denoted (r,s)

and old cluster (k) is defined in this way:

d[(k), (r,s)] = min d[(k),(r)], d[(k),(s)]

Hierarchical clustering example

Lets now see a simple example: a hierarchical

clustering of distances in kilometers between

some Italian cities. The method used is single-

linkage.

Input distance matrix (L = 0 for all the

clusters):

The nearest pair of cities is MI and TO, at distance 138. These

are merged into a single cluster called "MI/TO". The level of

the new cluster is L(MI/TO) = 138 and the new sequence

number is m = 1.

Then we compute the distance from this new compound object

to all other objects. In single link clustering the rule is that

the distance from the compound object to another object is

equal to the shortest distance from any member of the

cluster to the outside object. So the distance from "MI/TO"

to RM is chosen to be 564, which is the distance from MI to

RM, and so on.

After merging MI with TO we obtain the

following matrix:

min d(i,j) = d(NA,RM) = 219 => merge NA and RM into a

new cluster called NA/RM

L(NA/RM) = 219

m=2

min d(i,j) = d(BA,NA/RM) = 255 => merge BA and NA/RM

into a new cluster called BA/NA/RM

L(BA/NA/RM) = 255

m=3

min d(i,j) = d(BA/NA/RM,FI) = 268 => merge BA/NA/RM

and FI into a new cluster called BA/FI/NA/RM

L(BA/FI/NA/RM) = 268

m=4

Finally, we merge the last two clusters at level 295.

The process is summarized by the following hierarchical tree:

K-mean algorithm

1. It acceptsthe number of clusters to group

data into, and the dataset to cluster as input

values.

2. It then creates the first K initial clusters (K=

number of clustersneeded)from the dataset by

choosing K rows of data randomly from the

dataset. For Example, if there are 10,000 rows

of data in the dataset and 3 clusters need to be

formed, then the first K=3 initial clusters will

be created by selecting 3 records randomly

from the dataset as the initial clusters. Each of

the 3 initial clusters formed will have just one

row of data.

3. The K-Means algorithmcalculates the Arithmetic

Mean of each cluster formed in the dataset. The

Arithmetic Mean of acluster is the mean of all the

individualrecords in the cluster. In each of the first K initial

clusters,their is onlyone record. The Arithmetic Mean ofa

cluster with one record is the set of values that make up

that record. For Example if the dataset we are discussing

is a set of Height, Weight and Age measurements for

students in a University, where arecord P in the dataset S

is represented by a Height, Weight and Age

measurement,then P = {Age, Height, Weight).Then

arecord containing themeasurementsof a student John,

would be represented as John = {20, 170, 80} where

John's Age = 20 years, Height = 1.70 metres andWeight =

80 Pounds. Since there is only one record in each

initialcluster then the Arithmetic Mean of a clusterwith

only the record for John as a member = {20, 170, 80}.

4. Next, K-Means assigns each recordin the dataset to only one of the

initial clusters.Each record is assigned to the nearest cluster (the

cluster which it is most similar to) using a measure of distance or

similaritylike the Euclidean Distance Measure or Manhattan/City-

Block Distance Measure.

5. K-Meansre-assigns each record in thedatasettothe most

similarcluster andre-calculates the arithmetic mean of all the clusters

in the dataset. The arithmetic mean of a cluster is the arithmetic mean

of all the records in that cluster. For Example, ifa cluster contains two

recordswhere the recordof the set of measurements forJohn = {20,

170, 80} and Henry = {30, 160, 120}, thenthe arithmetic mean Pmean

is represented as Pmean= {Agemean, Heightmean, Weightmean).

Agemean= (20 + 30)/2, Heightmean= (170 + 160)/2and Weightmean=

(80 + 120)/2. The arithmetic mean of this cluster = {25, 165,

100}. This new arithmetic mean becomes the center ofthis new

cluster. Following the same procedure, new cluster centers are

formed for all the existing clusters.

6. K-Means re-assigns each record in the dataset to only one of

the new clusters formed. A record or data point is assigned to

the nearest cluster (the cluster which it is most similar to)

using a measure of distance or similarity

7. The preceding steps are repeated until stable clusters are

formed and the K-Means clustering procedure is completed.

Stable clusters are formed when new iterations or repetitions

of the K-Means clustering algorithm does not create new

clusters as the cluster center or Arithmetic Mean of each

cluster formed is the same as the old cluster center. There

aredifferenttechniques fordetermining when a stable

cluster is formed or when the k-means clustering algorithm

procedure is completed.

Classification

Goal: Provide an overview of the

classification problem and introduce some of

the basic algorithms

Classification Problem Overview

Classification Techniques

Regression

Distance

Decision Trees

Rules

Neural Networks

Classification Examples

Teachers classify students grades

as A, B, C, D, or F.

Identify mushrooms as poisonous

or edible.

Predict when a river will flood.

Identify individuals with credit

risks.

Speech recognition

Pattern recognition

Classification Ex: Grading

x

If x >= 90 then

grade =A. <90 >=90

If 80<=x<90 then x A

grade =B.

<80 >=80

If 70<=x<80 then x B

grade =C.

If 60<=x<70 then <70 >=70

grade =D. x C

If x<50 then grade <50 >=60

=F. F D

Classification Techniques

Approach:

1. Create specific model by

evaluating training data (or

using domain experts

knowledge).

2. Apply model developed to new

data.

Classes must be predefined

Most common techniques use

DTs, NNs, or are based on

Defining Classes

Distance

Based

Partitioning

Based

Classification Using

Regression

Division: Use regression function

to divide area into regions.

Prediction: Use regression

function to predict a class

membership function. Input

includes desired class.

Height Example Data

Name Gender Height Output1 Output2

Kristina F 1.6m Short Medium

Jim M 2m Tall Medium

Maggie F 1.9m Medium Tall

Martha F 1.88m Medium Tall

Stephanie F 1.7m Short Medium

Bob M 1.85m Medium Medium

Kathy F 1.6m Short Medium

Dave M 1.7m Short Medium

Worth M 2.2m Tall Tall

Steven M 2.1m Tall Tall

Debbie F 1.8m Medium Medium

Todd M 1.95m Medium Medium

Kim F 1.9m Medium Tall

Amy F 1.8m Medium Medium

Wynette F 1.75m Medium Medium

Division

Prediction

Classification Using

Distance

Place items in class to which they

are closest.

Must determine distance between

an item and a class.

Classes represented by

Centroid: Central value.

Medoid: Representative point.

Individual points

Algorithm: KNN

K Nearest Neighbor

(KNN):

Training set includes classes.

Examine K items near item to be

classified.

New item placed in class with the

most number of close items.

O(q) for each tuple to be classified.

(Here q is the size of the training

set.)

KNN

KNN Algorithm

Classification Using

Decision Trees

Partitioning based: Divide

search space into rectangular

regions.

Tuple placed into class based on

the region within which it falls.

DT approaches differ in how the

tree is built: DT Induction

Internal nodes associated with

attribute and arcs with values for

that attribute.

Decision Tree

Given:

D = {t1, , tn} where ti=<ti1, , tih>

Database schema contains {A1, A2, ,

Ah}

Classes C={C1, ., Cm}

Decision or Classification Tree is a tree

associated with D such that

Each internal node is labeled with

attribute, Ai

Each arc is labeled with predicate which

can be applied to attribute at parent

Each leaf node is labeled with a class, C

DT Induction

Comparing DTs

Balanced

Deep

ID3

Creates tree using information theory concepts and

tries to reduce expected number of comparison..

ID3 chooses split attribute with the highest

information gain using entropy as base for

calculation.

Conclusion

very useful in data mining

applicable for both text and

graphical based data

Help simplify data complexity

classification

detect hidden pattern in data

Reference

Dr. M.H. Dunham -

http://engr.smu.edu/~mhd/dmbook/part2.ppt.

Dr. Lee, Sin-Min San Jose State University

Mu-Yu Lu, SJSU

Database System Concepts, Silberschatz, Korth,

Sudarshan

- Complexity Analysis of the Viking Labeled Release Experiments - Int'l Journal of Aeronautical & Space ScienceUploaded byaynoneemouse
- Texture and Clustering FinalUploaded byPushpa Kotipalli
- t Mva Users GuideUploaded bycaoson1202
- Bio Medical Data Analysis Using Novel Clustering TechniquesUploaded byHarikrishnan Shunmugam
- Implementation Of Web Document Clustering Methods For Forensic AnalysisUploaded byInternational Journal for Scientific Research and Development - IJSRD
- IJETTCS-2013-04-14-101Uploaded byAnonymous vQrJlEN
- Fuzzy Clustering TechniquesUploaded byŞerban Laurenţiu
- KLANG VALLY RAINFALL FORECASTING MODEL USING TIME SERIES DATA MINING TECHNIQUE.Uploaded byMahmoud Sammour
- d 0931621Uploaded byInternational Organization of Scientific Research (IOSR)
- Comparison of Fuzzy - Neural Clustering Based Outlier Detection TechniquesUploaded byIAEME Publication
- Hidden Sentiment Association in Chinese Web Opinion MiningUploaded bymachinelearner
- A Survey on Machine Learning Classification TechniquesUploaded byAnonymous vQrJlEN
- A method to integrate tables of the World Wide WebUploaded bymidi64
- Nguyen Armitage SurveysAndTutorials2008Uploaded byMahadev
- Paper - 3-D Kinematics and EMGUploaded bymalikowais
- C4.5byRossQuinlanforICDM06.pdfUploaded byUmi Dzihniyatii
- HW_07 copyUploaded byauctmetu
- 512874654 (1)Uploaded byTimothy Nankervis
- IRJET-Intrusion detection system: classification, techniques and datasets to implementUploaded byIRJET Journal
- Study of Internet Traffic Classification Using Machine Learning AlgorithmsUploaded byIRJET Journal
- sensors-14-18960Uploaded byOmar Andres Quilodran Rodriguez
- 7 ClassificationUploaded byVimal Raja
- Wacv98 YoavUploaded bykushalpatel10888
- Report_S167017Uploaded bydouglasgarciatorres
- Radiacion Al Inetrior de Los VehiculosUploaded byricadoelectron
- 4944a049Uploaded byJames Smith
- 10.1.1.61Uploaded bywf34
- A Survey of Machine Learning Methods Applied to Computer1422 (1)Uploaded byRaghavendra Sh
- 2. Research Phase.pdfUploaded byAnnisa Rakhmawati
- 3featUploaded byBudi Joyo

- Second Lecture Softquality SystemUploaded bypoonam
- sQMfinal_Aima_2_Uploaded bypoonam
- SadERUploaded bypoonam
- sad4Uploaded bypoonam
- SadUploaded byGunjan Arora
- Jar FileUploaded bypoonam
- Regression in Data MiningUploaded bypoonam
- Final HashingUploaded bypoonam
- treeiimalectureUploaded bypoonam
- Recursionexaplanation of Recursion Very Much ImportantUploaded bypoonam
- Software_Configuaration_Management_new(1).pptUploaded bydineshgomber
- Very Important Waht is Data Warehouse and Why RequiredUploaded bypoonam
- Very Improtant NormalisationUploaded bypoonam
- Very Very Important Data MiningbySangeetaUploaded bypoonam
- Web MiningAimaUploaded bypoonam
- What Are Differences Between Arrays and CollectionsUploaded bypoonam
- What is a Data Modelvery ImportantUploaded bypoonam
- What is a Package in JavaUploaded bypoonam
- iim4stacpostfixUploaded bypoonam
- Iway in EcommerceUploaded bypoonam
- LTangUploaded bypoonam
- Pitta Virechan ImportantUploaded bydevanshgomber
- Virechan Very ImportantUploaded bypoonam
- Virechana Induced Purgation Best Treatment for Pitta ProblemsUploaded bypoonam
- Remember Allways DineshUploaded bypoonam
- What is Big Data Question AnswerUploaded bypoonam
- Medicinal Use of Harsingar Fever Malria Sugar High Bp Intastine ProblemUploaded bypoonam
- Dhudhi Badi ImportantUploaded bypoonam

- Renesas C QuestionsUploaded bysharath_mb
- AccessLink User Manual 002372 v1 0Uploaded byseresinv
- Phx_bugreportUploaded byJordan Saylor
- M2AnalysisandPlanningCasesUploaded byEmmanuel Navarrete
- Creating a Realistic RockUploaded byYe Pai
- CUBE.docxUploaded byMickey Verma
- General Ledger Useful SQL Scripts.txtUploaded bySrihari Gulla
- Weibayes TestUploaded byAlan Flawson
- An 4327Uploaded byHorderlin Robles
- DAS6801+_OP+DAS_Ed_B_175-000519-00Uploaded byTechne Phobos
- PhantomUploaded byRh Vijay
- Wireshark ArticleUploaded byBharath Kumar
- Futures Magazine - The Art of Day-TradingUploaded byapi-3704744
- R150PC59Uploaded byMarcos Gomez
- CHRYSLER A604Uploaded byGeorgi Aleksiev
- FishingUploaded byRenz Carla Andaya
- Oakton 150 PH-CONUploaded byJadyr No Lo Pensaba Usar
- Intrusion Detection for Grid and cloud computing (1)Uploaded byVishal Singh
- IEC104 ReviewUploaded byAnonymous kxXXVtcFw
- Text or Talk_ is Technology MText or Talk_ Is Technology Making You Lonely_aking You Lonely_ - ForbesUploaded byS YY
- Game of War v. Empire Z - Machine Zone v. Ember Entertainment complaint.pdfUploaded byMark Jaffe
- 21 Companion Ticketing HostUploaded byjjprietoj
- Toshiba-Copier-3560-3570-Service-Handbook.pdfUploaded byAbdur Rahman Manik
- EN.Security Center Installation and Upgrade Guide 5.2 SR8.pdfUploaded bymansoorali_af
- User Manual Logo EnUploaded byLiviu Florea
- k m projectUploaded byPravin Morajkar
- regrunlog.txtUploaded byPaulo Simões
- Astersik Sip PrivacyUploaded byballu456789
- Quick Charge Device ListUploaded byGAndroid GAndroid
- r F-44 Manual ClearingUploaded bymadhub_17