Sie sind auf Seite 1von 113

Interval Merge by χ2 Analysis

 Merging-based (bottom-up) vs. splitting-based methods


 Merge: Find the best neighboring intervals and merge them to form
larger intervals recursively
 ChiMerge [Kerber AAAI 1992, See also Liu et al. DMKD 2002]
 Initially, each distinct value of a numerical attr. A is considered to be one
interval
 χ2 tests are performed for every pair of adjacent intervals
 Adjacent intervals with the least χ2 values are merged together, since low χ2
values for a pair indicate similar class distributions
 This merge process proceeds recursively until a predefined stopping
criterion is met (such as significance level, max-interval, max inconsistency,
etc.)

DWDM : 05BIF403: Prof. D.RAJESH 181


Segmentation by Natural
Partitioning

 A simply 3-4-5 rule can be used to segment numeric data


into relatively uniform, “natural” intervals.
 If an interval covers 3, 6, 7 or 9 distinct values at the most
significant digit, partition the range into 3 equi-width intervals
 If it covers 2, 4, or 8 distinct values at the most significant digit,
partition the range into 4 intervals
 If it covers 1, 5, or 10 distinct values at the most significant digit,
partition the range into 5 intervals

DWDM : 05BIF403: Prof. D.RAJESH 182


Example of 3-4-5 Rule
count

Step 1: -$351 -$159 profit $1,838 $4,700


Min Low (i.e, 5%-tile) High(i.e, 95%-0 tile) Max
Step 2: msd=1,000 Low=-$1,000 High=$2,000

(-$1,000 - $2,000)
Step 3:

(-$1,000 - 0) (0 -$ 1,000) ($1,000 - $2,000)

(-$400 -$5,000)
Step 4:

(-$400 - 0) ($2,000 - $5, 000)


(0 - $1,000) ($1,000 - $2, 000)
(0 -
(-$400 - ($1,000 -
$200)
$1,200) ($2,000 -
-$300)
($200 - $3,000)
($1,200 -
(-$300 - $400)
$1,400)
-$200) ($3,000 -
($400 - ($1,400 - $4,000)
(-$200 - $600) $1,600) ($4,000 -
-$100) $5,000)
($600 - ($1,600 -
$800) ($800 - ($1,800 -
$1,800)
(-$100 - $1,000) $2,000)
0)
DWDM : 05BIF403: Prof. D.RAJESH 183
Concept Hierarchy Generation for
Categorical Data

 Specification of a partial/total ordering of attributes


explicitly at the schema level by users or experts
 street < city < state < country
 Specification of a hierarchy for a set of values by explicit
data grouping
 {Urbana, Champaign, Chicago} < Illinois
 Specification of only a partial set of attributes
 E.g., only street < city, not others
 Automatic generation of hierarchies (or attribute levels) by
the analysis of the number of distinct values
 E.g., for a set of attributes: {street, city, state, country}

DWDM : 05BIF403: Prof. D.RAJESH 184


Automatic Concept Hierarchy
Generation
 Some hierarchies can be automatically generated based
on the analysis of the number of distinct values per
attribute in the data set
 The attribute with the most distinct values is placed at the
lowest level of the hierarchy
 Exceptions, e.g., weekday, month, quarter, year

country 15 distinct values

province_or_ state 365 distinct values

city 3567 distinct values

street 674,339 distinct values


DWDM : 05BIF403: Prof. D.RAJESH 185
Chapter 2: Data Preprocessing

 Why preprocess the data?


 Data cleaning
 Data integration and transformation
 Data reduction
 Discretization and concept hierarchy
generation
 Summary
DWDM : 05BIF403: Prof. D.RAJESH 186
Summary

 Data preparation or preprocessing is a big issue for both


data warehousing and data mining
 Discriptive data summarization is need for quality data
preprocessing
 Data preparation includes
 Data cleaning and data integration
 Data reduction and feature selection
 Discretization
 A lot a methods have been developed but data
preprocessing still an active area of research

DWDM : 05BIF403: Prof. D.RAJESH 187


References
 D. P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse environments. Communications
of ACM, 42:73-78, 1999
 T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003
 T. Dasu, T. Johnson, S. Muthukrishnan, V. Shkapenyuk. Mining Database Structure; Or, How to Build
a Data Quality Browser. SIGMOD’02.
 H.V. Jagadish et al., Special Issue on Data Reduction Techniques. Bulletin of the Technical
Committee on Data Engineering, 20(4), December 1997
 D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
 E. Rahm and H. H. Do. Data Cleaning: Problems and Current Approaches. IEEE Bulletin of the
Technical Committee on Data Engineering. Vol.23, No.4
 V. Raman and J. Hellerstein. Potters Wheel: An Interactive Framework for Data Cleaning and
Transformation, VLDB’2001
 T. Redman. Data Quality: Management and Technology. Bantam Books, 1992
 Y. Wand and R. Wang. Anchoring data quality dimensions ontological foundations. Communications of
ACM, 39:86-95, 1996
 R. Wang, V. Storey, and C. Firth. A framework for analysis of data quality research. IEEE Trans.
Knowledge and Data Engineering, 7:623-640, 1995

DWDM : 05BIF403: Prof. D.RAJESH 188


“Association Rules”

Market Baskets
Frequent Itemsets
A-priori Algorithm

DWDM : 05BIF403: Prof.


D.RAJESH 189
189
The Market-Basket Model

 A large set of items, e.g., things sold in a


supermarket.
 A large set of baskets, each of which is a small set
of the items, e.g., the things one customer buys
on one day.

DWDM : 05BIF403: Prof. D.RAJESH 190


Support

 Simplest question: find sets of items that appear


“frequently” in the baskets.
 Support for itemset I = the number of baskets
containing all items in I.
 Given a support threshold s, sets of items that
appear in > s baskets are called frequent
itemsets.

DWDM : 05BIF403: Prof. D.RAJESH 191


Example
 Items={milk, coke, pepsi, beer, juice}.
 Support = 3 baskets.
B1 = {m, c, b} B2 = {m, p, j}
B3 = {m, b} B4 = {c, j}
B5 = {m, p, b} B6 = {m, c, b, j}
B7 = {c, b, j} B8 = {b, c}
 Frequent itemsets: {m}, {c}, {b}, {j}, {m,
b}, {c, b}, {j, c}.

DWDM : 05BIF403: Prof. D.RAJESH 192


Applications --- (1)

 Real market baskets: chain stores keep terabytes


of information about what customers buy
together.
 Tells how typical customers navigate stores, lets them
position tempting items.
 Suggests tie-in “tricks,” e.g., run sale on diapers and
raise the price of beer.
 High support needed, or no $$’s .

DWDM : 05BIF403: Prof. D.RAJESH 193


Applications --- (2)

 “Baskets” = documents; “items” = words in those


documents.
 Lets us find words that appear together unusually
frequently, i.e., linked concepts.
 “Baskets” = sentences, “items” = documents
containing those sentences.
 Items that appear together too often could represent
plagiarism.

DWDM : 05BIF403: Prof. D.RAJESH 194


Applications --- (3)

 “Baskets” = Web pages; “items” = linked pages.


 Pairs of pages with many common references may be
about the same topic.
 “Baskets” = Web pages p ; “items” = pages that
link to p .
 Pages with many of the same links may be mirrors or
about the same topic.

DWDM : 05BIF403: Prof. D.RAJESH 195


Important Point

 “Market Baskets” is an abstraction that models


any many-many relationship between two
concepts: “items” and “baskets.”
 Items need not be “contained” in baskets.
 The only difference is that we count co-
occurrences of items related to a basket, not vice-
versa.

DWDM : 05BIF403: Prof. D.RAJESH 196


Scale of Problem

 WalMart sells 100,000 items and can store billions


of baskets.
 The Web has over 100,000,000 words and
billions of pages.

DWDM : 05BIF403: Prof. D.RAJESH 197


Association Rules

 If-then rules about the contents of baskets.


 {i1, i2,…,ik} → j means: “if a basket contains all of
i1,…,ik then it is likely to contain j.
 Confidence of this association rule is the
probability of j given i1,…,ik.

DWDM : 05BIF403: Prof. D.RAJESH 198


Example
+ B1 = {m, c, b} B2 = {m, p, j}
_ B3 = {m, b} B4 = {c, j}
_ B5 = {m, p, b} B6 = {m, c, b, j}
B7 = {c, b, j} B8 =+{b, c}
 An association rule: {m, b} → c.
 Confidence = 2/4 = 50%.

DWDM : 05BIF403: Prof. D.RAJESH 199


Interest

 The interest of an association rule is the absolute


value of the amount by which the confidence
differs from what you would expect, were items
selected independently of one another.

DWDM : 05BIF403: Prof. D.RAJESH 200


Example
B1 = {m, c, b} B2 = {m, p, j}
B3 = {m, b} B4 = {c, j}
B5 = {m, p, b} B6 = {m, c, b, j}
B7 = {c, b, j} B8 = {b, c}
 For association rule {m, b} → c, item c
appears in 5/8 of the baskets.
 Interest = | 2/4 - 5/8 | = 1/8 --- not very
interesting.

DWDM : 05BIF403: Prof. D.RAJESH 201



Relationships
Rules Among
with high support Measures
and confidence may be
useful even if they are not “interesting.”
 We don’t care if buying bread causes people to buy
milk, or whether simply a lot of people buy both bread
and milk.
 But high interest suggests a cause that might be
worth investigating.

DWDM : 05BIF403: Prof. D.RAJESH 202


Finding Association Rules

 A typical question: “find all association rules with


support ≥ s and confidence ≥ c.”
 Note: “support” of an association rule is the support of
the set of items it mentions.
 Hard part: finding the high-support (frequent )
itemsets.
 Checking the confidence of association rules involving
those sets is relatively easy.

DWDM : 05BIF403: Prof. D.RAJESH 203


Computation Model

 Typically, data is kept in a “flat file” rather than a


database system.
 Stored on disk.
 Stored basket-by-basket.
 Expand baskets into pairs, triples, etc. as you read
baskets.

DWDM : 05BIF403: Prof. D.RAJESH 204


Computation Model --- (2)

 The true cost of mining disk-resident data is


usually the number of disk I/O’s.
 In practice, association-rule algorithms read the
data in passes --- all baskets read in turn.
 Thus, we measure the cost by the number of
passes an algorithm takes.

DWDM : 05BIF403: Prof. D.RAJESH 205


Main-Memory Bottleneck

 In many algorithms to find frequent itemsets we


need to worry about how main memory is used.
 As we read baskets, we need to count something, e.g.,
occurrences of pairs.
 The number of different things we can count is limited
by main memory.
 Swapping counts in/out is a disaster.

DWDM : 05BIF403: Prof. D.RAJESH 206


Finding Frequent Pairs

 The hardest problem often turns out to be finding


the frequent pairs.
 We’ll concentrate on how to do that, then discuss
extensions to finding frequent triples, etc.

DWDM : 05BIF403: Prof. D.RAJESH 207


Naïve Algorithm

 A simple way to find frequent pairs is:


 Read file once, counting in main memory the
occurrences of each pair.
 Expand each basket of n items into its n (n -
1)/2 pairs.
 Fails if #items-squared exceeds main memory.

DWDM : 05BIF403: Prof. D.RAJESH 208



Details
There areof Main-Memory
two basic approaches: Counting
1. Count all item pairs, using a triangular matrix.
2. Keep a table of triples [i, j, c] = the count of the pair
of items {i,j } is c.
 (1) requires only (say) 4 bytes/pair; (2) requires
12 bytes, but only for those pairs with >0
counts.

DWDM : 05BIF403: Prof. D.RAJESH 209


12 per
4 per pair
occurring pair

Method (1) Method (2)

DWDM : 05BIF403: Prof. D.RAJESH 210


Details of Approach (1)

 Number items 1,2,…


 Keep pairs in the order {1,2}, {1,3},…, {1,n
}, {2,3}, {2,4},…,{2,n }, {3,4},…, {3,n },…{n
-1,n }.
 Find pair {i, j } at the position (i
–1)(n –i /2) + j – i.
 Total number of pairs n (n –1)/2; total bytes
about 2n 2.

DWDM : 05BIF403: Prof. D.RAJESH 211


Details of Approach (2)

 You need a hash table, with i and j as the key,


to locate (i, j, c) triples efficiently.
 Typically, the cost of the hash structure can be
neglected.
 Total bytes used is about 12p, where p is the
number of pairs that actually occur.
 Beats triangular matrix if at most 1/3 of possible pairs
actually occur.

DWDM : 05BIF403: Prof. D.RAJESH 212


A-Priori Algorithm --- (1)

 A two-pass approach called a-priori limits the


need for main memory.
 Key idea: monotonicity : if a set of items appears
at least s times, so does every subset.
 Contrapositive for pairs: if item i does not appear in s
baskets, then no pair including i can appear in s
baskets.

DWDM : 05BIF403: Prof. D.RAJESH 213


A-Priori Algorithm --- (2)

 Pass 1: Read baskets and count in main memory


the occurrences of each item.
 Requires only memory proportional to #items.
 Pass 2: Read baskets again and count in main
memory only those pairs both of which were
found in Pass 1 to be frequent.
 Requires memory proportional to square of frequent
items only.

DWDM : 05BIF403: Prof. D.RAJESH 214


Picture of A-Priori

Item counts Frequent items

Counts of
candidate
pairs

Pass 1 Pass 2

DWDM : 05BIF403: Prof. D.RAJESH 215


Detail for A-Priori

 You can use the triangular matrix method with n


= number of frequent items.
 Saves space compared with storing triples.
 Trick: number frequent items 1,2,… and keep a
table relating new numbers to original item
numbers.

DWDM : 05BIF403: Prof. D.RAJESH 216


Frequent Triples, Etc.

 For each k, we construct two sets of k –tuples:


 Ck = candidate k – tuples = those that might be
frequent sets (support > s ) based on information from
the pass for k –1.
 Lk = the set of truly frequent k –tuples.

DWDM : 05BIF403: Prof. D.RAJESH 217


C1 Filter L1 Construct C2 Filter L2 Construct C3

First Second
pass pass

DWDM : 05BIF403: Prof. D.RAJESH 218


A-Priori for All Frequent
Itemsets

 One pass for each k.


 Needs room in main memory to count each
candidate k –tuple.
 For typical market-basket data and reasonable
support (e.g., 1%), k = 2 requires the most
memory.

DWDM : 05BIF403: Prof. D.RAJESH 219


Frequent Itemsets --- (2)

 C1 = all items
 L1 = those counted on first pass to be
frequent.
 C2 = pairs, both chosen from L1.
 In general, Ck = k –tuples each k –1 of which
is in Lk-1.
 Lk = those candidates with support ≥ s.

DWDM : 05BIF403: Prof. D.RAJESH 220


C1

What is Cluster Analysis?


Cluster: a collection of data objects
Similar to one another within the same cluster
Dissimilar to the objects in other clusters
Cluster analysis
Grouping a set of data objects into clusters
Clustering is unsupervised classification: no predefined
classes
Typical applications
As a stand-alone tool to get insight into data distribution
As a preprocessing step for other algorithms

DWDM : 05BIF403: Prof.


D.RAJESH 221
221
Slide 221

C1 User supervision clustering =>


specify k clusters and initial center

Non-supervision clustering =>


let the computer allocate k clusters and initial centers
CityU, 3/11/2008
C2

Examples of Clustering Applications


 Marketing: Help marketers discover distinct groups in their
customer bases to develop targeted marketing programs
 Land use: Identification of areas of similar land use in an earth
observation database
 Insurance: Identifying groups of insurance policy holders with a
high average claim cost
 City-planning: Identifying groups of houses according to their
house type, value, and location
 Earth-quake studies: Observed earth quake epicenters
clustered along continent faults

DWDM : 05BIF403: Prof. D.RAJESH 222


Slide 222

C2 When the categories are unspecified, it is referred to as unsupervised learning.

When the categories are specified, it is referred to as supervised learning.

The selection of the initial points for the clustering are extremely important for the K-means algorithm.
CityU, 3/14/2008
C3

What Is Good Clustering?

 A good clustering method will produce high quality clusters with


 high intra-class similarity
 low inter-class similarity
 The quality of a clustering result depends on both the similarity
measure used by the method and its implementation.
 The quality of a clustering method is measured by its ability to
discover the hidden patterns.

DWDM : 05BIF403: Prof. D.RAJESH 223


Slide 223

C3 Objective of Clustering =>


Minimize the distance between objects to the cluster center

Clustering (K-nearest neighborhood) criteria =>


(1) Minimize distance between objects and center in a cluster.
(2) Set up number of k clusters.
CityU, 3/11/2008
C4

Requirements of Clustering in Data


Mining
 Scalability
 Deal with different types of attributes
 Discovery of clusters with arbitrary shape
 Requirements for domain knowledge to determine
input parameters
 Deal with noise and outliers
 Insensitive to order of input records
 High dimensionality
 Incorporation of user-specified constraints
 Interpretability and usability

DWDM : 05BIF403: Prof. D.RAJESH 224


Slide 224

C4 Outlines =>
Objects are far away from other objects in a cluster
CityU, 3/11/2008
C5
Data Structures
 x 11 ... x 1f ... x 1p 
 
 Data matrix  ... ... ... ... ... 
 (two modes) x ... x if ... x ip 
 i1 
 ... ... ... ... ... 
x ... x nf ... x np 
 n1 
 Dissimilarity matrix
 (one mode)  0 
 d(2,1) 0 
 
 d(3,1) d(3,2) 0 
 
 : : : 
d(n,1) d(n,2) ... ... 0
DWDM : 05BIF403: Prof. D.RAJESH 225
Slide 225

C5 The learning of the classifier(clustering) is "supervised" in that it is told to which class each training tuple belongs. It contrasts with
unsupervised learning (clustering), in which the class label of each training tuple is not known, and the number or set of classes (k clusters) to
be learned may not be known in advance
CityU, 3/17/2008
C6

Measure the Quality of Clustering

 Dissimilarity/Similarity metric: Similarity is expressed in terms of


a distance function, which is typically metric: d(i, j)
 “Quality” function that measures the “goodness” of a cluster.
 The definitions of distance functions are different for interval-
scaled, boolean, categorical, ordinal and ratio variables.
 Weights associated with different variables based on
applications and data semantics.
 It is hard to define “similar enough” or “good enough”

DWDM : 05BIF403: Prof. D.RAJESH 226


Slide 226

C6 Many clustering algorithms require users to input certain parameters in cluster analysis (such as the number of k desired clusters). The
clustering results can be quite sensitive to input parameters. Parameters, are often difficult to determine. But if the parameters of k clusters is
set up right, it helps to speed up the processing of clustering.
CityU, 3/17/2008
Interval-valued variables

 Standardize data
 Calculate the mean absolute deviation:

s f = 1n(| x1f −mf | +| x2 f −mf | +...+| xnf −mf |)


where
m f
= 1 (x + x + ... + x
n measurement
1f 2 f(z-score) nf
).
 Calculate the standardized

− m f
x

z if = deviation is
Using mean absolute s more
if
robust than using
f
standard deviation

DWDM : 05BIF403: Prof. D.RAJESH 227


Similarity and Dissimilarity Between Objects
 Distances are normally used to measure the similarity or dissimilarity
between two data objects
 Some popular ones include: Minkowski distance (general case)

d(i, j) = (| x − x | +| x − x | +...+| x − x | )
q
q q q
i1 j1 i2 ip
i2 j2 j1 j2
ip jp
where i = (x , x , …, x ) and j = (x , x , …, x ) are two p-dimensional data
i1 jp
objects, and q is a positive integer
 If q = 1, d is Manhattan distance

d (i, j) =| x − x | + | x − x | +...+ | x − x |
i1 j1 i2 j 2 ip jp
DWDM : 05BIF403: Prof. D.RAJESH 228
Similarity and Dissimilarity Between
Objects (Cont.)
 If q = 2, d is Euclidean distance (most popular):

d (i, j ) = (| x − x | 2 + | x − x | 2 +...+ | x − x | 2 )
i1 j1 i2 j2 ip jp
 Properties
 d(i,j) ≥ 0
 d(i,i) = 0
 d(i,j) = d(j,i)
 d(i,j) ≤ d(i,k) + d(k,j)
 one can use weighted distance, parametric Pearson product
moment correlation, or other disimilarity measures.

DWDM : 05BIF403: Prof. D.RAJESH 229


Binary Variables
• A contingency table for binary data
Object j
1 0 sum
1 a b a +b
Object i
0 c d c+ d
sum a + c b + d p

• Simple matching coefficient (invariant, if the binary variable is


symmetric):
d (i , j ) = b+c
a+b+c+d
• Jaccard coefficient (noninvariant if the binary variable is asymmetric):

d(i, j) = b+c
a+b+c
DWDM : 05BIF403: Prof. D.RAJESH 230
Dissimilarity between Binary Variables
 Example

Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4


Jack M Y N P N N N
Mary
 F is a symmetric
gender Y Nattribute
P N P N
Jim
 the remaining
M Yattributes
P are asymmetric
N N binaryN N
 let the values Y and P be set to 1, and the value N be set to 0

0 + 1
d ( jack , mary ) = = 0 . 33
2 + 0 + 1
1 + 1
d ( jack , jim ) = = 0 . 67
1 + 1 + 1
1 + 2
d ( jim , mary ) = = 0 . 75
+ D.RAJESH
1 Prof.
DWDM : 05BIF403: 1 + 2 231
Partitioning Method

A partitioning method constructs k clusters. It classifies the data


into k groups, which together satisfy the requirements of a
partition.Each group must contain at least one object.
Each object must belong to exactly one group.
k <= n where k is the number of clusters with n objects.

DWDM : 05BIF403: Prof. D.RAJESH 232


Partitioning Algorithms: Basic
Concept
 Partitioning method: Construct a partition of a database D
of n objects into a set of k clusters
 Given a k, find a partition of k clusters that optimizes the
chosen partitioning criterion
 Global optimal: exhaustively enumerate all partitions
 Heuristic methods: k-means algorithms
 k-means : Each cluster is represented by the center of the cluster

DWDM : 05BIF403: Prof. D.RAJESH 233


The K-Means Clustering Method

 Given k, the k-means algorithm is implemented in 4 steps:

1. Partition objects into k nonempty subsets


2. Compute seed points as the centroids of the clusters of the current
partition. The centroid is the center (mean point) of the cluster.
3. Assign each object to the cluster with the nearest seed point.
4. Go back to Step 2, stop when no more new assignment.

DWDM : 05BIF403: Prof. D.RAJESH 234


Clustering of a set of objects based on the k-means method

DWDM : 05BIF403: Prof. D.RAJESH 235


The K-Means Clustering Method
 Example
10 10

9 9

8 8

7 7

6 6

5 5

4 4

3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

10 10

9 9

8 8

7 7

6 6

5 5

4 4

3 3

2 2

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

DWDM : 05BIF403: Prof. D.RAJESH 236


Comments on the K-Means Method
 Strength
 Relatively efficient: O(tkn), where n is # objects, k is # clusters, and
t is # iterations. Normally, k, t << n.
 Often terminates at a local optimum. The global optimum may be
found using techniques such as: deterministic annealing and genetic
algorithms
 Weakness
 Applicable only when mean is defined, then what about categorical
data?
 Need to specify k, the number of clusters, in advance
 Unable to handle noisy data and outliers
 Not suitable to discover clusters with non-convex shapes

DWDM : 05BIF403: Prof. D.RAJESH 237


K-medoids algorithm
Arbitrarily choose k objects as the initial medoids;
Repeat
assign each remaining object to the cluster with the
nearest medoid Oj;
randomly select a nonmedoid object, Orandom;
compute the total cost S of swapping Oj with Orandom;
if S < 0 then swap Oj with Orandom to form the new set
of k medoids;
Until no change;

DWDM : 05BIF403: Prof. D.RAJESH 238


C7
Four cases of the cost function for k-medoids clustering

DWDM : 05BIF403: Prof. D.RAJESH 239


Slide 239

C7 The k-medoids method is more robust than k-means in the presence of noise and outliners, because a medoid is less influenced by outliers or
other extreme values than a mean. However, its processing is mroe costly than the k-means method.
CityU, 3/17/2008
 Case 1: p currently belongs to medoid Oj. If Oj is
replaced by Orandom as a medoid and p is closest to
one of Oj; i not= j, then p is reassigned to Oi.
 Case 2: p currently belongs to medoid Oj. If Oj is
replaced by Orandom, then p is reassigned to
Orandom.
 Case 3: p currently belongs to medoid Oi, i not= j. If
Oj is replaced by Orandom as a medoid and p is still
closest to Oi, then the assignment does not change.
 Case 4: p currently belongs to medoid Oi, i not= j. If
Oj is replaced by Orandom as a medoid and p is
closest to Orandom, then p is reassigned to Orandom.

DWDM : 05BIF403: Prof. D.RAJESH 240


Slide 240

C8 The k-medoids method is more robust than k-means in the presernce of noise and outliners, because a medoid is less influenced by outliners or
other extreme values than a mean. However, its processing is more costly than the k-means method.
CityU, 3/17/2008
Two dimensional example with 10 objects

DWDM : 05BIF403: Prof. D.RAJESH 241


Coordinates of the 10 objects

DWDM : 05BIF403: Prof. D.RAJESH 242


Assignment of objects to two representative
objects

DWDM : 05BIF403: Prof. D.RAJESH 243


Clustering corresponding to selections of
Number 1 and 5

DWDM : 05BIF403: Prof. D.RAJESH 244


Assignment of objects to two other
representative objects 4 and 8

DWDM : 05BIF403: Prof. D.RAJESH 245


Clustering corresponding to selections of
Number 4 and 8

DWDM : 05BIF403: Prof. D.RAJESH 246


An example of clustering five sample data items
graph with all distance

DWDM : 05BIF403: Prof. D.RAJESH 247


Sample table for the example

Item A B C D E

A 0 1 2 2 3

B 1 0 2 4 3

C 2 2 0 1 5

D 2 4 1 0 3

E 3 3 5 3 0

DWDM : 05BIF403: Prof. D.RAJESH 248


Example of K-medoids
Given the two medoids that are initially chosen are A and B.
Based on the following table and randomly placing items when
distances are identical to the two medoids, we obtain the
clusters {A, C, D} and {B, E}. The three non-medoids {C, D, E}
are examined to see which should be used to replace A or B.
We have six costs to determine: TCAC (the cost change by
replacing medoid A with medoid C), TCAD, TCAE, TCBC, TCBD and
TCBE.

TCAC=CAAC+CBAC+CCAC+CDAC+CEAC = 1 + 0 – 2 – 1 + 0 = -2
Where CAAC = the cost change of object A after replacing medoid
A with medoid C

DWDM : 05BIF403: Prof. D.RAJESH 249


DWDM : 05BIF403: Prof. D.RAJESH 250
Cost calculations for example

The diagram illustrates the calculation of these six


costs. We see that the minimum cost is 2 and that
there are several ways to reduce this cost.
Arbitrarily choosing the first swap, we get C and B
as the new medoids with the clusters being {C, D}
and {B, A, E}

DWDM : 05BIF403: Prof. D.RAJESH 251


An example
Initial five objects A, B, C, D, E, two clusters (A, C, D), (B, E), and centers {A, B}.
Evaluate swap enter A to center C.
Consider the new cost (new centers {B, C})

TCAC = CAAC + CBAC + CCAC + CDAC + CEAC


CAAC = CAB - CAA = 1 – 0 = 1
CBAC = CBB - CBB = 0 – 0 = 0
CCAC = CCC - CCA = 0 – 2 = -2
CDAC = CDC - CDA = 1 – 2 = -1
CEAC = CEB - CEB = 3 – 3 = 0

As a result, TCAC = 1 + 0 – 2 – 1 + 0 = – 2

The new center {B, C} is less costly. As a result, we should swan {A. B} to {B, C} by Medoid
method

DWDM : 05BIF403: Prof. D.RAJESH 252


Comparison between K-means
and K-medoids

The k-medoids method is more robust than k-means


in the presence of noise and outliers because a
medoid is less influenced by outliers or other
extreme values than a mean. However, its
processing is more costly than the k-means
method. Both methods require the user to specify
k, the number of clusters.

DWDM : 05BIF403: Prof. D.RAJESH 253


Case study of a behavioral
segmentation for a phone company
This system is characterized by using large number of
behavior related key drivers to cluster customer into
different homogeneous segments, which are similar
in term of profitability, call pattern or in other ways
that are meaningful for marketing planning purposes.
Our aim of the project is to develop a three
dimensional segmentation according to customer
revenue, call usage and call trend.

DWDM : 05BIF403: Prof. D.RAJESH 254


Sample report on clustering for a phone company to cluster customers based on the
phone usages and revenue to the company

Call Revenue

Call Usage
DWDM : 05BIF403: Prof. D.RAJESH 255
Derived business rules (observation)

23% high profitable groups of customers in cluster #1, #5 and


#7.

24% high usage caller group of customers in cluster #3, #8, #2


and #4.

Rule: High call usage implies higher call revenue,


but higher call revenue does not mean higher call usage.

DWDM : 05BIF403: Prof. D.RAJESH 256


Sample report on clustering for a phone company to cluster
customers based on the phone call duration and number of calls
Call Duration

DWDM : 05BIF403: Prof. D.RAJESH Number of Calls 257


Derived business rules (observation)

 High duration and High Calls in cluster #1, and #8.

 Low duration and Low Calls in cluster #3, #5, #9 and


#10.

 Rule: High Duration calls most likely implies Higher


Calls, while Low Duration calls most likely implies
Lower Call.

DWDM : 05BIF403: Prof. D.RAJESH 258


Reading assignment

“Data Mining: Concepts and Techniques” 2nd Edition


by Han and Kamber, Morgan Kaufmann publishers,
2007, chapter 7, pp. 383-407.

DWDM : 05BIF403: Prof. D.RAJESH 259


Lecture Review Question 9

What is supervised clustering and what is


unsupervised clustering? How do you compare
their difference with respect to performance?
Illustrate the strength and weakness of k-means
in comparison with the k-medoids algorithm.

DWDM : 05BIF403: Prof. D.RAJESH 260


Tutorial Question 9
The following table contains the attributes name, gender, trait-1, trait-2, trait-3 and trait-4,
where name is an object-id, gender is a symmetric attribute, and the remaining trait
attributes are asymmetric, describing personal traits of individuals who desire a penpal.
Suppose that a service exists that attempts to find pairs of compatible penpals. For
asymmetric attribute values, let the value P be set to 1 and the value N be set to 0.
Suppose that the distance between objects (potential penals) is computed based only on
the asymmetric variables.

(a) Compute the Jaccard coefficient for each pair.


(b) Who do you suggest would make the best pair of penpals? Which pair of individuals would
be the least compatible?

Name Gender Trait-1 Trait-2 Trait-3 Trait-4

Kevan M N P P N

Caroline F N P P P

Erik M P N N P

DWDM : 05BIF403: Prof. D.RAJESH 261


C9

Genetic Algorithm
Genetic Algorithms (GA) apply an evolutionary
approach to inductive learning. GA has been
successfully applied to problems that are difficult to
solve using conventional techniques such as
scheduling problems, traveling salesperson problem,
network routing problems and financial marketing.

DWDM : 05BIF403: Prof.


D.RAJESH 262
262
Slide 262

C9 Genetic algorithm is to locate representative sample data (solution) among training (test) data, which represents a rule.

The idea is to change the solution set until they all pass a fitness function.

The change can be a crossover or mutation to the solution data set.


CityU, 3/18/2008
C10

Supervised genetic learning

DWDM : 05BIF403: Prof. D.RAJESH 263


Slide 263

C10 Genetic Algorithm methodology:

Phase 1: Training data set is to derive a rule of representative data (solution) for population elements (input data to be mined).

Phase 2: Test data set is to test the result of the derived solution from phase 1. If the result passes the required successful rate, then the result
becomes a rule. Otherwise, repeat Phase 1.
CityU, 3/18/2008
c1

Genetic learning algorithm

 Step 1: Initialize a population P of n elements


as a potential solution.

 Step 2: Until a specified termination condition


is satisfied:
 2a: Use a fitness function to evaluate each
element of the current solution. If an element passes
the fitness criteria, it remains in P.
 2b: The population now contains m elements (m
<= n). Use genetic operators to create (n – m) new
elements. Add the new elements to the population.

DWDM : 05BIF403: Prof. D.RAJESH 264


Slide 264

c1 Criteria of algorithm termination are:

(1) Maximum M iteration specified by the user has been reached.

(2) Minimum error rate (fitness function score) specified by the user has been reached after matching all the training (test) data.
cstest, 3/31/2008
Digitalized Genetic knowledge
representation

 A common technique for representing genetic


knowledge is to transform elements into binary
strings.

 For example, we can represent income range as a


string of two bits for assigning “00” to 20-30k,
“01” to 30-40k, and “11” to 50-60k.

DWDM : 05BIF403: Prof. D.RAJESH 265


Genetic operator - Crossover

 The elements most often used for crossover are


those destined to be eliminated from the
population.

 Crossover forms new elements for the population


by combining parts of two elements currently in
the population.

DWDM : 05BIF403: Prof. D.RAJESH 266


Genetic operator - Mutation

 Mutation is sparingly applied to elements chosen


for elimination.

 Mutation can be applied by randomly flipping bits


(or attribute values) within a single element.

DWDM : 05BIF403: Prof. D.RAJESH 267


Genetic operator - Selection

 Selection is to replace to-be-deleted elements by


copies of elements that pass the fitness test with
high scores.

 With selection, the overall fitness of the


population is guaranteed to increase.

DWDM : 05BIF403: Prof. D.RAJESH 268


C11

Step 1 of Supervised genetic learning


This step initializes a population P of elements. The
P referred to population elements. The process
modifies the elements of the population until a
termination condition is satisfied, which might be
all elements of the population meet some
minimum criteria. An alternative is a fixed number
of iterations of the learning process.

DWDM : 05BIF403: Prof. D.RAJESH 269


Slide 269

C11 Own class means that the data in the training data set matches the target data (suggested solution).

Competing class means that the data in the training data set does not match the target data.
CityU, 3/26/2008
Step 2 of supervised genetic learning
Step 2a applies a fitness function to evaluate each
element currently in the population. With each
iteration, elements not satisfying the fitness
criteria are eliminated from the population. The
final result of a supervised genetic learning
session is a set of population elements that best
represents the training data.

DWDM : 05BIF403: Prof. D.RAJESH 270


Step 2 of supervised genetic learning
Step 2b adds new elements to the population to
replace any elements eliminated in step 2a. New
elements are formed from previously deleted
elements by applying crossover and mutation.

DWDM : 05BIF403: Prof. D.RAJESH 271


An initial population for supervised
genetic learning example

Population Income Life Credit Card Sex Age


element Range Insurance Insurance
Promotion

1 20-30k No Yes Male 30-39

2 30k-40k Yes No Female 50-59

3 ? No No Male 40-49

4 30k-40k Yes Yes Male 40-49

DWDM : 05BIF403: Prof. D.RAJESH 272


Question mark in population

A question mark in the population means that it is a


“don’t care” condition, which implied that the
attribute is not important to the learning process.

DWDM : 05BIF403: Prof. D.RAJESH 273


Training Data for Genetic Learning
Training Instance Income Range Life Insurance Credit Card Sex Age
Promotion Insurance

1 30-40k Yes Yes Male 30-39

2 30-40k Yes No Female 40-49

3 50-60k Yes No Female 30-39

4 20-30k No No Female 50-59

5 20-30k No No Male 20-29

6 30-40k No No Male 40-49

DWDM : 05BIF403: Prof. D.RAJESH 274


Goal and condition

 Our goal is to create a model able to differentiate


individuals who have accepted the life insurance
promotion from those who have not.
 We require that after each iteration of the
algorithm, exactly two elements from each class
(life insurance promotion=yes) & (life insurance
promotion=no) remain in the population.

DWDM : 05BIF403: Prof. D.RAJESH 275


Fitness Function
1. Let N be the number of matches of the input attribute
values of E with training instances from its own class.
2. Let M be the number of input attribute value matches to
all training instances from the competing classes.
3. Add 1 to M.
4. Divide N by M.

Note: the higher the fitness score, the smaller will be the
error rate for the solution.

DWDM : 05BIF403: Prof. D.RAJESH 276


Fitness function for element 1 own class of
life insurance promotion = no
1. Income Range = 20-30k matches with training
instances 4 and 5.
2. No matches for Credit Card Insurance=yes
3. Sex=Male matches with training instances 5 and
6.
4. No matches for Age=30-39.
5. ∴N=4

DWDM : 05BIF403: Prof. D.RAJESH 277


Fitness function for element 1 of competing class
of life insurance promotion = yes

1. No matches for Income Range=20-30k


2. Credit Card Insurance=yes matches with training
instance 1.
3. Sex=Male matches with training instance 1.
4. Age=30-39 matches with training instances 1 and 3.
5. ∴M = 4
6. ∴F(1) = 4 / 5 = 0.8
7. Similarly F(2)=0.86, F(3)=1.2, F(4)=1.0

DWDM : 05BIF403: Prof. D.RAJESH 278


Crossover operation for elements 1 & 2

DWDM : 05BIF403: Prof. D.RAJESH 279


C12

A Second-Generation Population

Population Income Life Credit Card Sex Age


element Range Insurance Insurance
Promotion

1 20-30k No No Female 50-59

2 30k-40k Yes Yes Male 30-39

3 ? No No Male 40-49

4 30k-40k Yes Yes Male 40-49

DWDM : 05BIF403: Prof. D.RAJESH 280


Slide 280

C12 After second generation,


the fitness function of element 1 is: 1.2
where N = 2 + 3 + 1 + 1 = 6
M=0+2+2+0=4

the fitness function of element 2 is 1.5


where N = 2 + 1 + 1 + 2 = 6
M=1+0+2+0=3

Both of them show improvements in sampling.


CityU, 4/1/2008

Das könnte Ihnen auch gefallen