Part 4

Interval Merge by χ2 Analysis
Merging-based (bottom-up) vs. splitting-based methods

Merge: Find the best neighboring intervals and merge them to form
larger intervals recursively
ChiMerge [Kerber AAAI 1992, See also Liu et al. DMKD 2002]
Initially, each distinct value of a numerical attr. A is considered to be one
interval
χ2 tests are performed for every pair of adjacent intervals
Adjacent intervals with the least χ2 values are merged together, since low χ2
values for a pair indicate similar class distributions
This merge process proceeds recursively until a predefined stopping
criterion is met (such as significance level, max-interval, max inconsistency,
etc.)
DWDM : 05BIF403: Prof. D.RAJESH 181

Segmentation by Natural
Partitioning
A simply 3-4-5 rule can be used to segment numeric data

into relatively uniform, “natural” intervals.
If an interval covers 3, 6, 7 or 9 distinct values at the most
significant digit, partition the range into 3 equi-width intervals
If it covers 2, 4, or 8 distinct values at the most significant digit,
partition the range into 4 intervals
If it covers 1, 5, or 10 distinct values at the most significant digit,
partition the range into 5 intervals

Example of 3-4-5 Rule
count
Step 1: -$351 -$159 profit $1,838 $4,700

Min Low (i.e, 5%-tile) High(i.e, 95%-0 tile) Max
Step 2: msd=1,000 Low=-$1,000 High=$2,000
(-$1,000 - $2,000)
Step 3:
(-$1,000 - 0) (0 -$ 1,000) ($1,000 - $2,000)
(-$400 -$5,000)
Step 4:
(-$400 - 0) ($2,000 - $5, 000)

(0 - $1,000) ($1,000 - $2, 000)
(0 -
(-$400 - ($1,000 -
$200)
$1,200) ($2,000 -
-$300)
($200 - $3,000)
($1,200 -
(-$300 - $400)
$1,400)
-$200) ($3,000 -
($400 - ($1,400 - $4,000)
(-$200 - $600) $1,600) ($4,000 -
-$100) $5,000)
($600 - ($1,600 -
$800) ($800 - ($1,800 -
$1,800)
(-$100 - $1,000) $2,000)
0)
Concept Hierarchy Generation for
Categorical Data
Specification of a partial/total ordering of attributes

explicitly at the schema level by users or experts
street < city < state < country
Specification of a hierarchy for a set of values by explicit
data grouping
{Urbana, Champaign, Chicago} < Illinois
Specification of only a partial set of attributes
E.g., only street < city, not others
Automatic generation of hierarchies (or attribute levels) by
the analysis of the number of distinct values
E.g., for a set of attributes: {street, city, state, country}

Automatic Concept Hierarchy
Generation
Some hierarchies can be automatically generated based
on the analysis of the number of distinct values per
attribute in the data set
The attribute with the most distinct values is placed at the
lowest level of the hierarchy
Exceptions, e.g., weekday, month, quarter, year
country 15 distinct values
province_or_ state 365 distinct values
city 3567 distinct values
street 674,339 distinct values

Chapter 2: Data Preprocessing
Why preprocess the data?

Data cleaning
Data integration and transformation
Data reduction
Discretization and concept hierarchy
generation
Summary
Summary
Data preparation or preprocessing is a big issue for both

data warehousing and data mining
Discriptive data summarization is need for quality data
preprocessing
Data preparation includes
Data cleaning and data integration
Data reduction and feature selection
Discretization
A lot a methods have been developed but data
preprocessing still an active area of research

References
D. P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse environments. Communications
of ACM, 42:73-78, 1999
T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003
T. Dasu, T. Johnson, S. Muthukrishnan, V. Shkapenyuk. Mining Database Structure; Or, How to Build
a Data Quality Browser. SIGMOD’02.
H.V. Jagadish et al., Special Issue on Data Reduction Techniques. Bulletin of the Technical
Committee on Data Engineering, 20(4), December 1997
D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999
E. Rahm and H. H. Do. Data Cleaning: Problems and Current Approaches. IEEE Bulletin of the
Technical Committee on Data Engineering. Vol.23, No.4
V. Raman and J. Hellerstein. Potters Wheel: An Interactive Framework for Data Cleaning and
Transformation, VLDB’2001
T. Redman. Data Quality: Management and Technology. Bantam Books, 1992
Y. Wand and R. Wang. Anchoring data quality dimensions ontological foundations. Communications of
ACM, 39:86-95, 1996
R. Wang, V. Storey, and C. Firth. A framework for analysis of data quality research. IEEE Trans.
Knowledge and Data Engineering, 7:623-640, 1995

“Association Rules”
Market Baskets
Frequent Itemsets
A-priori Algorithm
DWDM : 05BIF403: Prof.

D.RAJESH 189
189
The Market-Basket Model
A large set of items, e.g., things sold in a

supermarket.
A large set of baskets, each of which is a small set
of the items, e.g., the things one customer buys
on one day.

Support
Simplest question: find sets of items that appear

“frequently” in the baskets.
Support for itemset I = the number of baskets
containing all items in I.
Given a support threshold s, sets of items that
appear in > s baskets are called frequent
itemsets.

Example
Items={milk, coke, pepsi, beer, juice}.
Support = 3 baskets.
B1 = {m, c, b} B2 = {m, p, j}
B3 = {m, b} B4 = {c, j}
B5 = {m, p, b} B6 = {m, c, b, j}
B7 = {c, b, j} B8 = {b, c}
Frequent itemsets: {m}, {c}, {b}, {j}, {m,
b}, {c, b}, {j, c}.

Applications --- (1)
Real market baskets: chain stores keep terabytes

of information about what customers buy
together.
Tells how typical customers navigate stores, lets them
position tempting items.
Suggests tie-in “tricks,” e.g., run sale on diapers and
raise the price of beer.
High support needed, or no $$’s .

“Baskets” = documents; “items” = words in those

documents.
Lets us find words that appear together unusually
frequently, i.e., linked concepts.
“Baskets” = sentences, “items” = documents
containing those sentences.
Items that appear together too often could represent
plagiarism.

“Baskets” = Web pages; “items” = linked pages.

Pairs of pages with many common references may be
about the same topic.
“Baskets” = Web pages p ; “items” = pages that
link to p .
Pages with many of the same links may be mirrors or
about the same topic.

Important Point
“Market Baskets” is an abstraction that models

any many-many relationship between two
concepts: “items” and “baskets.”
Items need not be “contained” in baskets.
The only difference is that we count co-
occurrences of items related to a basket, not vice-
versa.

Scale of Problem
WalMart sells 100,000 items and can store billions

of baskets.
The Web has over 100,000,000 words and
billions of pages.

Association Rules
If-then rules about the contents of baskets.

{i1, i2,…,ik} → j means: “if a basket contains all of
i1,…,ik then it is likely to contain j.
Confidence of this association rule is the
probability of j given i1,…,ik.

Example
+ B1 = {m, c, b} B2 = {m, p, j}
_ B3 = {m, b} B4 = {c, j}
_ B5 = {m, p, b} B6 = {m, c, b, j}
B7 = {c, b, j} B8 =+{b, c}
An association rule: {m, b} → c.
Confidence = 2/4 = 50%.

Interest
The interest of an association rule is the absolute

value of the amount by which the confidence
differs from what you would expect, were items
selected independently of one another.

Example
B1 = {m, c, b} B2 = {m, p, j}
B3 = {m, b} B4 = {c, j}
B5 = {m, p, b} B6 = {m, c, b, j}
B7 = {c, b, j} B8 = {b, c}
For association rule {m, b} → c, item c
appears in 5/8 of the baskets.
Interest = | 2/4 - 5/8 | = 1/8 --- not very
interesting.

Relationships
Rules Among
with high support Measures
and confidence may be
useful even if they are not “interesting.”
We don’t care if buying bread causes people to buy
milk, or whether simply a lot of people buy both bread
and milk.
But high interest suggests a cause that might be
worth investigating.

Finding Association Rules
A typical question: “find all association rules with

support ≥ s and confidence ≥ c.”
Note: “support” of an association rule is the support of
the set of items it mentions.
Hard part: finding the high-support (frequent )
itemsets.
Checking the confidence of association rules involving
those sets is relatively easy.

Computation Model
Typically, data is kept in a “flat file” rather than a

database system.
Stored on disk.
Stored basket-by-basket.
Expand baskets into pairs, triples, etc. as you read
baskets.

Computation Model --- (2)
The true cost of mining disk-resident data is

usually the number of disk I/O’s.
In practice, association-rule algorithms read the
data in passes --- all baskets read in turn.
Thus, we measure the cost by the number of
passes an algorithm takes.

Main-Memory Bottleneck
In many algorithms to find frequent itemsets we

need to worry about how main memory is used.
As we read baskets, we need to count something, e.g.,
occurrences of pairs.
The number of different things we can count is limited
by main memory.
Swapping counts in/out is a disaster.

Finding Frequent Pairs
The hardest problem often turns out to be finding

the frequent pairs.
We’ll concentrate on how to do that, then discuss
extensions to finding frequent triples, etc.

Naïve Algorithm
A simple way to find frequent pairs is:

Read file once, counting in main memory the
occurrences of each pair.
Expand each basket of n items into its n (n -
1)/2 pairs.
Fails if #items-squared exceeds main memory.

Details
There areof Main-Memory
two basic approaches: Counting
1. Count all item pairs, using a triangular matrix.
2. Keep a table of triples [i, j, c] = the count of the pair
of items {i,j } is c.
(1) requires only (say) 4 bytes/pair; (2) requires
12 bytes, but only for those pairs with >0
counts.

12 per
4 per pair
occurring pair
Method (1) Method (2)

Details of Approach (1)
Number items 1,2,…

Keep pairs in the order {1,2}, {1,3},…, {1,n
}, {2,3}, {2,4},…,{2,n }, {3,4},…, {3,n },…{n
-1,n }.
Find pair {i, j } at the position (i
–1)(n –i /2) + j – i.
Total number of pairs n (n –1)/2; total bytes
about 2n 2.

Details of Approach (2)
You need a hash table, with i and j as the key,

to locate (i, j, c) triples efficiently.
Typically, the cost of the hash structure can be
neglected.
Total bytes used is about 12p, where p is the
number of pairs that actually occur.
Beats triangular matrix if at most 1/3 of possible pairs
actually occur.

A-Priori Algorithm --- (1)
A two-pass approach called a-priori limits the

need for main memory.
Key idea: monotonicity : if a set of items appears
at least s times, so does every subset.
Contrapositive for pairs: if item i does not appear in s
baskets, then no pair including i can appear in s
baskets.

A-Priori Algorithm --- (2)
Pass 1: Read baskets and count in main memory

the occurrences of each item.
Requires only memory proportional to #items.
Pass 2: Read baskets again and count in main
memory only those pairs both of which were
found in Pass 1 to be frequent.
Requires memory proportional to square of frequent
items only.

Picture of A-Priori
Item counts Frequent items
Counts of
candidate
pairs
Pass 1 Pass 2

Detail for A-Priori
You can use the triangular matrix method with n

= number of frequent items.
Saves space compared with storing triples.
Trick: number frequent items 1,2,… and keep a
table relating new numbers to original item
numbers.

Frequent Triples, Etc.
For each k, we construct two sets of k –tuples:

Ck = candidate k – tuples = those that might be
frequent sets (support > s ) based on information from
the pass for k –1.
Lk = the set of truly frequent k –tuples.

C1 Filter L1 Construct C2 Filter L2 Construct C3
First Second
pass pass

A-Priori for All Frequent
Itemsets
One pass for each k.

Needs room in main memory to count each
candidate k –tuple.
For typical market-basket data and reasonable
support (e.g., 1%), k = 2 requires the most
memory.

Frequent Itemsets --- (2)
C1 = all items
L1 = those counted on first pass to be
frequent.
C2 = pairs, both chosen from L1.
In general, Ck = k –tuples each k –1 of which
is in Lk-1.
Lk = those candidates with support ≥ s.

C1
What is Cluster Analysis?

Cluster: a collection of data objects
Similar to one another within the same cluster
Dissimilar to the objects in other clusters
Cluster analysis
Grouping a set of data objects into clusters
Clustering is unsupervised classification: no predefined
classes
Typical applications
As a stand-alone tool to get insight into data distribution
As a preprocessing step for other algorithms

D.RAJESH 221
221
Slide 221
C1 User supervision clustering =>

specify k clusters and initial center
Non-supervision clustering =>

let the computer allocate k clusters and initial centers
CityU, 3/11/2008
C2
Examples of Clustering Applications

Marketing: Help marketers discover distinct groups in their
customer bases to develop targeted marketing programs
Land use: Identification of areas of similar land use in an earth
observation database
Insurance: Identifying groups of insurance policy holders with a
high average claim cost
City-planning: Identifying groups of houses according to their
house type, value, and location
Earth-quake studies: Observed earth quake epicenters
clustered along continent faults

Slide 222
C2 When the categories are unspecified, it is referred to as unsupervised learning.
When the categories are specified, it is referred to as supervised learning.
The selection of the initial points for the clustering are extremely important for the K-means algorithm.
CityU, 3/14/2008
C3
What Is Good Clustering?
A good clustering method will produce high quality clusters with

high intra-class similarity
low inter-class similarity
The quality of a clustering result depends on both the similarity
measure used by the method and its implementation.
The quality of a clustering method is measured by its ability to
discover the hidden patterns.

Slide 223
C3 Objective of Clustering =>

Minimize the distance between objects to the cluster center
Clustering (K-nearest neighborhood) criteria =>

(1) Minimize distance between objects and center in a cluster.
(2) Set up number of k clusters.
CityU, 3/11/2008
C4
Requirements of Clustering in Data

Mining
Scalability
Deal with different types of attributes
Discovery of clusters with arbitrary shape
Requirements for domain knowledge to determine
input parameters
Deal with noise and outliers
Insensitive to order of input records
High dimensionality
Incorporation of user-specified constraints
Interpretability and usability

Slide 224
C4 Outlines =>
Objects are far away from other objects in a cluster
CityU, 3/11/2008
C5
Data Structures
 x 11 ... x 1f ... x 1p 
 
Data matrix  ... ... ... ... ... 
(two modes) x ... x if ... x ip 
 i1 
 ... ... ... ... ... 
x ... x nf ... x np 
 n1 
Dissimilarity matrix
(one mode)  0 
 d(2,1) 0 
 
 d(3,1) d(3,2) 0 
 
 : : : 
d(n,1) d(n,2) ... ... 0
Slide 225
C5 The learning of the classifier(clustering) is "supervised" in that it is told to which class each training tuple belongs. It contrasts with
unsupervised learning (clustering), in which the class label of each training tuple is not known, and the number or set of classes (k clusters) to
be learned may not be known in advance
CityU, 3/17/2008
C6
Measure the Quality of Clustering
Dissimilarity/Similarity metric: Similarity is expressed in terms of

a distance function, which is typically metric: d(i, j)
“Quality” function that measures the “goodness” of a cluster.
The definitions of distance functions are different for interval-
scaled, boolean, categorical, ordinal and ratio variables.
Weights associated with different variables based on
applications and data semantics.
It is hard to define “similar enough” or “good enough”

Slide 226
C6 Many clustering algorithms require users to input certain parameters in cluster analysis (such as the number of k desired clusters). The
clustering results can be quite sensitive to input parameters. Parameters, are often difficult to determine. But if the parameters of k clusters is
set up right, it helps to speed up the processing of clustering.
CityU, 3/17/2008
Interval-valued variables
Standardize data
Calculate the mean absolute deviation:
s f = 1n(| x1f −mf | +| x2 f −mf | +...+| xnf −mf |)

where
m f
= 1 (x + x + ... + x
n measurement
1f 2 f(z-score) nf
).
Calculate the standardized
− m f
x

z if = deviation is
Using mean absolute s more
if
robust than using
f
standard deviation

Similarity and Dissimilarity Between Objects
Distances are normally used to measure the similarity or dissimilarity
between two data objects
Some popular ones include: Minkowski distance (general case)
d(i, j) = (| x − x | +| x − x | +...+| x − x | )
q
q q q
i1 j1 i2 ip
i2 j2 j1 j2
ip jp
where i = (x , x , …, x ) and j = (x , x , …, x ) are two p-dimensional data
i1 jp
objects, and q is a positive integer
If q = 1, d is Manhattan distance
d (i, j) =| x − x | + | x − x | +...+ | x − x |
i1 j1 i2 j 2 ip jp
Similarity and Dissimilarity Between
Objects (Cont.)
If q = 2, d is Euclidean distance (most popular):
d (i, j ) = (| x − x | 2 + | x − x | 2 +...+ | x − x | 2 )
i1 j1 i2 j2 ip jp
Properties
d(i,j) ≥ 0
d(i,i) = 0
d(i,j) = d(j,i)
d(i,j) ≤ d(i,k) + d(k,j)
one can use weighted distance, parametric Pearson product
moment correlation, or other disimilarity measures.

Binary Variables
• A contingency table for binary data
Object j
1 0 sum
1 a b a +b
Object i
0 c d c+ d
sum a + c b + d p
• Simple matching coefficient (invariant, if the binary variable is

symmetric):
d (i , j ) = b+c
a+b+c+d
• Jaccard coefficient (noninvariant if the binary variable is asymmetric):
d(i, j) = b+c
a+b+c
Dissimilarity between Binary Variables
Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4

Jack M Y N P N N N
Mary
F is a symmetric
gender Y Nattribute
P N P N
Jim
the remaining
M Yattributes
P are asymmetric
N N binaryN N
let the values Y and P be set to 1, and the value N be set to 0
0 + 1
d ( jack , mary ) = = 0 . 33
2 + 0 + 1
1 + 1
d ( jack , jim ) = = 0 . 67
1 + 1 + 1
1 + 2
d ( jim , mary ) = = 0 . 75
+ D.RAJESH
1 Prof.
DWDM : 05BIF403: 1 + 2 231
Partitioning Method
A partitioning method constructs k clusters. It classifies the data

into k groups, which together satisfy the requirements of a
partition.Each group must contain at least one object.
Each object must belong to exactly one group.
k <= n where k is the number of clusters with n objects.

Partitioning Algorithms: Basic
Concept
Partitioning method: Construct a partition of a database D
of n objects into a set of k clusters
Given a k, find a partition of k clusters that optimizes the
chosen partitioning criterion
Global optimal: exhaustively enumerate all partitions
Heuristic methods: k-means algorithms
k-means : Each cluster is represented by the center of the cluster

The K-Means Clustering Method
Given k, the k-means algorithm is implemented in 4 steps:
1. Partition objects into k nonempty subsets

2. Compute seed points as the centroids of the clusters of the current
partition. The centroid is the center (mean point) of the cluster.
3. Assign each object to the cluster with the nearest seed point.
4. Go back to Step 2, stop when no more new assignment.

Clustering of a set of objects based on the k-means method

The K-Means Clustering Method
Example
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Comments on the K-Means Method
Strength
Relatively efficient: O(tkn), where n is # objects, k is # clusters, and
t is # iterations. Normally, k, t << n.
Often terminates at a local optimum. The global optimum may be
found using techniques such as: deterministic annealing and genetic
algorithms
Weakness
Applicable only when mean is defined, then what about categorical
data?
Need to specify k, the number of clusters, in advance
Unable to handle noisy data and outliers
Not suitable to discover clusters with non-convex shapes

K-medoids algorithm
Arbitrarily choose k objects as the initial medoids;
Repeat
assign each remaining object to the cluster with the
nearest medoid Oj;
randomly select a nonmedoid object, Orandom;
compute the total cost S of swapping Oj with Orandom;
if S < 0 then swap Oj with Orandom to form the new set
of k medoids;
Until no change;

C7
Four cases of the cost function for k-medoids clustering

Slide 239
C7 The k-medoids method is more robust than k-means in the presence of noise and outliners, because a medoid is less influenced by outliers or
other extreme values than a mean. However, its processing is mroe costly than the k-means method.
CityU, 3/17/2008
Case 1: p currently belongs to medoid Oj. If Oj is
replaced by Orandom as a medoid and p is closest to
one of Oj; i not= j, then p is reassigned to Oi.
Case 2: p currently belongs to medoid Oj. If Oj is
replaced by Orandom, then p is reassigned to
Orandom.
Case 3: p currently belongs to medoid Oi, i not= j. If
Oj is replaced by Orandom as a medoid and p is still
closest to Oi, then the assignment does not change.
Case 4: p currently belongs to medoid Oi, i not= j. If
Oj is replaced by Orandom as a medoid and p is
closest to Orandom, then p is reassigned to Orandom.

Slide 240
C8 The k-medoids method is more robust than k-means in the presernce of noise and outliners, because a medoid is less influenced by outliners or
other extreme values than a mean. However, its processing is more costly than the k-means method.
CityU, 3/17/2008
Two dimensional example with 10 objects

Coordinates of the 10 objects

Assignment of objects to two representative
objects

Clustering corresponding to selections of
Number 1 and 5

Assignment of objects to two other
representative objects 4 and 8

Clustering corresponding to selections of
Number 4 and 8

An example of clustering five sample data items
graph with all distance

Sample table for the example
Item A B C D E
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0

Example of K-medoids
Given the two medoids that are initially chosen are A and B.
Based on the following table and randomly placing items when
distances are identical to the two medoids, we obtain the
clusters {A, C, D} and {B, E}. The three non-medoids {C, D, E}
are examined to see which should be used to replace A or B.
We have six costs to determine: TCAC (the cost change by
replacing medoid A with medoid C), TCAD, TCAE, TCBC, TCBD and
TCBE.
TCAC=CAAC+CBAC+CCAC+CDAC+CEAC = 1 + 0 – 2 – 1 + 0 = -2
Where CAAC = the cost change of object A after replacing medoid
A with medoid C

Cost calculations for example
The diagram illustrates the calculation of these six

costs. We see that the minimum cost is 2 and that
there are several ways to reduce this cost.
Arbitrarily choosing the first swap, we get C and B
as the new medoids with the clusters being {C, D}
and {B, A, E}

An example
Initial five objects A, B, C, D, E, two clusters (A, C, D), (B, E), and centers {A, B}.
Evaluate swap enter A to center C.
Consider the new cost (new centers {B, C})
TCAC = CAAC + CBAC + CCAC + CDAC + CEAC

CAAC = CAB - CAA = 1 – 0 = 1
CBAC = CBB - CBB = 0 – 0 = 0
CCAC = CCC - CCA = 0 – 2 = -2
CDAC = CDC - CDA = 1 – 2 = -1
CEAC = CEB - CEB = 3 – 3 = 0
As a result, TCAC = 1 + 0 – 2 – 1 + 0 = – 2
The new center {B, C} is less costly. As a result, we should swan {A. B} to {B, C} by Medoid
method

Comparison between K-means
and K-medoids
The k-medoids method is more robust than k-means

in the presence of noise and outliers because a
medoid is less influenced by outliers or other
extreme values than a mean. However, its
processing is more costly than the k-means
method. Both methods require the user to specify
k, the number of clusters.

Case study of a behavioral
segmentation for a phone company
This system is characterized by using large number of
behavior related key drivers to cluster customer into
different homogeneous segments, which are similar
in term of profitability, call pattern or in other ways
that are meaningful for marketing planning purposes.
Our aim of the project is to develop a three
dimensional segmentation according to customer
revenue, call usage and call trend.

Sample report on clustering for a phone company to cluster customers based on the
phone usages and revenue to the company
Call Revenue
Call Usage
Derived business rules (observation)
23% high profitable groups of customers in cluster #1, #5 and

#7.
24% high usage caller group of customers in cluster #3, #8, #2

and #4.
Rule: High call usage implies higher call revenue,

but higher call revenue does not mean higher call usage.

Sample report on clustering for a phone company to cluster
customers based on the phone call duration and number of calls
Call Duration
DWDM : 05BIF403: Prof. D.RAJESH Number of Calls 257

Derived business rules (observation)
High duration and High Calls in cluster #1, and #8.
Low duration and Low Calls in cluster #3, #5, #9 and

#10.
Rule: High Duration calls most likely implies Higher

Calls, while Low Duration calls most likely implies
Lower Call.

Reading assignment
“Data Mining: Concepts and Techniques” 2nd Edition

by Han and Kamber, Morgan Kaufmann publishers,
2007, chapter 7, pp. 383-407.

Lecture Review Question 9
What is supervised clustering and what is

unsupervised clustering? How do you compare
their difference with respect to performance?
Illustrate the strength and weakness of k-means
in comparison with the k-medoids algorithm.

Tutorial Question 9
The following table contains the attributes name, gender, trait-1, trait-2, trait-3 and trait-4,
where name is an object-id, gender is a symmetric attribute, and the remaining trait
attributes are asymmetric, describing personal traits of individuals who desire a penpal.
Suppose that a service exists that attempts to find pairs of compatible penpals. For
asymmetric attribute values, let the value P be set to 1 and the value N be set to 0.
Suppose that the distance between objects (potential penals) is computed based only on
the asymmetric variables.
(a) Compute the Jaccard coefficient for each pair.

(b) Who do you suggest would make the best pair of penpals? Which pair of individuals would
be the least compatible?
Name Gender Trait-1 Trait-2 Trait-3 Trait-4
Kevan M N P P N
Caroline F N P P P
Erik M P N N P

C9
Genetic Algorithm
Genetic Algorithms (GA) apply an evolutionary
approach to inductive learning. GA has been
successfully applied to problems that are difficult to
solve using conventional techniques such as
scheduling problems, traveling salesperson problem,
network routing problems and financial marketing.

D.RAJESH 262
262
Slide 262
C9 Genetic algorithm is to locate representative sample data (solution) among training (test) data, which represents a rule.
The idea is to change the solution set until they all pass a fitness function.
The change can be a crossover or mutation to the solution data set.

CityU, 3/18/2008
C10
Supervised genetic learning

Slide 263
C10 Genetic Algorithm methodology:
Phase 1: Training data set is to derive a rule of representative data (solution) for population elements (input data to be mined).
Phase 2: Test data set is to test the result of the derived solution from phase 1. If the result passes the required successful rate, then the result
becomes a rule. Otherwise, repeat Phase 1.
CityU, 3/18/2008
c1
Genetic learning algorithm
Step 1: Initialize a population P of n elements

as a potential solution.
Step 2: Until a specified termination condition

is satisfied:
2a: Use a fitness function to evaluate each
element of the current solution. If an element passes
the fitness criteria, it remains in P.
2b: The population now contains m elements (m
<= n). Use genetic operators to create (n – m) new
elements. Add the new elements to the population.

Slide 264
c1 Criteria of algorithm termination are:
(1) Maximum M iteration specified by the user has been reached.
(2) Minimum error rate (fitness function score) specified by the user has been reached after matching all the training (test) data.
cstest, 3/31/2008
Digitalized Genetic knowledge
representation
A common technique for representing genetic

knowledge is to transform elements into binary
strings.
For example, we can represent income range as a

string of two bits for assigning “00” to 20-30k,
“01” to 30-40k, and “11” to 50-60k.

Genetic operator - Crossover
The elements most often used for crossover are

those destined to be eliminated from the
population.
Crossover forms new elements for the population

by combining parts of two elements currently in
the population.

Genetic operator - Mutation
Mutation is sparingly applied to elements chosen

for elimination.
Mutation can be applied by randomly flipping bits

(or attribute values) within a single element.

Genetic operator - Selection
Selection is to replace to-be-deleted elements by

copies of elements that pass the fitness test with
high scores.
With selection, the overall fitness of the

population is guaranteed to increase.

C11
Step 1 of Supervised genetic learning

This step initializes a population P of elements. The
P referred to population elements. The process
modifies the elements of the population until a
termination condition is satisfied, which might be
all elements of the population meet some
minimum criteria. An alternative is a fixed number
of iterations of the learning process.

Slide 269
C11 Own class means that the data in the training data set matches the target data (suggested solution).
Competing class means that the data in the training data set does not match the target data.
CityU, 3/26/2008
Step 2 of supervised genetic learning
Step 2a applies a fitness function to evaluate each
element currently in the population. With each
iteration, elements not satisfying the fitness
criteria are eliminated from the population. The
final result of a supervised genetic learning
session is a set of population elements that best
represents the training data.

Step 2 of supervised genetic learning
Step 2b adds new elements to the population to
replace any elements eliminated in step 2a. New
elements are formed from previously deleted
elements by applying crossover and mutation.

An initial population for supervised
genetic learning example
Population Income Life Credit Card Sex Age

element Range Insurance Insurance
Promotion
1 20-30k No Yes Male 30-39
2 30k-40k Yes No Female 50-59
3 ? No No Male 40-49
4 30k-40k Yes Yes Male 40-49

Question mark in population
A question mark in the population means that it is a

“don’t care” condition, which implied that the
attribute is not important to the learning process.

Training Data for Genetic Learning
Training Instance Income Range Life Insurance Credit Card Sex Age
Promotion Insurance
1 30-40k Yes Yes Male 30-39
2 30-40k Yes No Female 40-49
3 50-60k Yes No Female 30-39
4 20-30k No No Female 50-59
5 20-30k No No Male 20-29
6 30-40k No No Male 40-49

Goal and condition
Our goal is to create a model able to differentiate

individuals who have accepted the life insurance
promotion from those who have not.
We require that after each iteration of the
algorithm, exactly two elements from each class
(life insurance promotion=yes) & (life insurance
promotion=no) remain in the population.

Fitness Function
1. Let N be the number of matches of the input attribute
values of E with training instances from its own class.
2. Let M be the number of input attribute value matches to
all training instances from the competing classes.
3. Add 1 to M.
4. Divide N by M.
Note: the higher the fitness score, the smaller will be the
error rate for the solution.

Fitness function for element 1 own class of
life insurance promotion = no
1. Income Range = 20-30k matches with training
instances 4 and 5.
2. No matches for Credit Card Insurance=yes
3. Sex=Male matches with training instances 5 and
6.
4. No matches for Age=30-39.
5. ∴N=4

Fitness function for element 1 of competing class
of life insurance promotion = yes
1. No matches for Income Range=20-30k

2. Credit Card Insurance=yes matches with training
instance 1.
3. Sex=Male matches with training instance 1.
4. Age=30-39 matches with training instances 1 and 3.
5. ∴M = 4
6. ∴F(1) = 4 / 5 = 0.8
7. Similarly F(2)=0.86, F(3)=1.2, F(4)=1.0

Crossover operation for elements 1 & 2

C12
A Second-Generation Population
Population Income Life Credit Card Sex Age

element Range Insurance Insurance
Promotion
1 20-30k No No Female 50-59
3 ? No No Male 40-49

Slide 280
C12 After second generation,

the fitness function of element 1 is: 1.2
where N = 2 + 3 + 1 + 1 = 6
M=0+2+2+0=4
the fitness function of element 2 is 1.5

where N = 2 + 1 + 1 + 2 = 6
M=1+0+2+0=3
Both of them show improvements in sampling.

CityU, 4/1/2008

Part 4

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Part 4

Hochgeladen von

Copyright:

Verfügbare Formate

Interval Merge by χ2 Analysis

Merging-based (bottom-up) vs. splitting-based methods

DWDM : 05BIF403: Prof. D.RAJESH 181

A simply 3-4-5 rule can be used to segment numeric data

DWDM : 05BIF403: Prof. D.RAJESH 182

Step 1: -$351 -$159 profit $1,838 $4,700

(-$1,000 - 0) (0 -$ 1,000) ($1,000 - $2,000)

(-$400 - 0) ($2,000 - $5, 000)

Specification of a partial/total ordering of attributes

DWDM : 05BIF403: Prof. D.RAJESH 184

country 15 distinct values

province_or_ state 365 distinct values

city 3567 distinct values

street 674,339 distinct values

Why preprocess the data?

Data preparation or preprocessing is a big issue for both

DWDM : 05BIF403: Prof. D.RAJESH 187

DWDM : 05BIF403: Prof. D.RAJESH 188

DWDM : 05BIF403: Prof.

A large set of items, e.g., things sold in a

DWDM : 05BIF403: Prof. D.RAJESH 190

Simplest question: find sets of items that appear

DWDM : 05BIF403: Prof. D.RAJESH 191

DWDM : 05BIF403: Prof. D.RAJESH 192

Real market baskets: chain stores keep terabytes

DWDM : 05BIF403: Prof. D.RAJESH 193

“Baskets” = documents; “items” = words in those

DWDM : 05BIF403: Prof. D.RAJESH 194

“Baskets” = Web pages; “items” = linked pages.

DWDM : 05BIF403: Prof. D.RAJESH 195

“Market Baskets” is an abstraction that models

DWDM : 05BIF403: Prof. D.RAJESH 196

WalMart sells 100,000 items and can store billions

DWDM : 05BIF403: Prof. D.RAJESH 197

If-then rules about the contents of baskets.

DWDM : 05BIF403: Prof. D.RAJESH 198

DWDM : 05BIF403: Prof. D.RAJESH 199

The interest of an association rule is the absolute

DWDM : 05BIF403: Prof. D.RAJESH 200

DWDM : 05BIF403: Prof. D.RAJESH 201

DWDM : 05BIF403: Prof. D.RAJESH 202

A typical question: “find all association rules with

DWDM : 05BIF403: Prof. D.RAJESH 203

Typically, data is kept in a “flat file” rather than a

DWDM : 05BIF403: Prof. D.RAJESH 204

The true cost of mining disk-resident data is

DWDM : 05BIF403: Prof. D.RAJESH 205

In many algorithms to find frequent itemsets we

DWDM : 05BIF403: Prof. D.RAJESH 206

The hardest problem often turns out to be finding

DWDM : 05BIF403: Prof. D.RAJESH 207

A simple way to find frequent pairs is:

DWDM : 05BIF403: Prof. D.RAJESH 208

DWDM : 05BIF403: Prof. D.RAJESH 209

Method (1) Method (2)

DWDM : 05BIF403: Prof. D.RAJESH 210

Number items 1,2,…

DWDM : 05BIF403: Prof. D.RAJESH 211

You need a hash table, with i and j as the key,

DWDM : 05BIF403: Prof. D.RAJESH 212

A two-pass approach called a-priori limits the

DWDM : 05BIF403: Prof. D.RAJESH 213

Pass 1: Read baskets and count in main memory