Sie sind auf Seite 1von 145

ADVANCED ALGORITHMS IN

COMPUTATIONAL BIOLOGY (C3),


This is part of your
1
Class Info
Lecturer: Chi-Yao Tseng ()
cytseng@citi.sinica.edu.tw
Grading:
No assignments
Midterm:
2012/04/20
Im in charge of 17x2 points out of 120
No take-home questions
2
Outline
Introduction
From data warehousing to data mining
Mining Capabilities
Association rules
Classification
Clustering
More about Data Mining
3
Main Reference
Jiawei Han, Micheline Kamber, Data Mining:
Concepts and Techniques, 2
nd
Edition, Morgan
Kaufmann, 2006.
Official website:
http://www.cs.uiuc.edu/homes/hanj/bk2/
4
Why Data Mining?
The Explosive Growth of Data: from terabytes to petabytes (10
15
B= 1 million GB)
Data collection and data availability
Automated data collection tools, database systems, Web,
computerized society
Major sources of abundant data
Business: Web, e-commerce, transactions, stocks,
Science: Remote sensing, bioinformatics, scientific simulation,
Society and everyone: news, digital cameras, YouTube, Facebook
We are drowning in data, but starving for knowledge!
Necessity is the mother of inventionData mining
Automated analysis of massive data sets
5
Why Not Traditional Data Analysis?
Tremendous amount of data
Algorithms must be highly scalable to handle such as terabytes of data
High-dimensionality of data
Micro-array may have tens of thousands of dimensions
High complexity of data
New and sophisticated applications
6
Evolution of Database Technology
1960s:
Data collection, database creation, IMS and network DBMS
1970s:
Relational data model, relational DBMS implementation
1980s:
RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
Application-oriented DBMS (spatial, scientific, engineering, etc.)
1990s:
Data mining, data warehousing, multimedia databases, and Web databases
2000s
Stream data management and mining
Data mining and its applications
Web technology (XML, data integration) and global information systems

7
What is Data Mining?
Knowledge discovery in databases
Extraction of interesting (non-trivial, implicit,
previously unknown and potentially useful)
patterns or knowledge from huge amount of data.
Alternative names:
Knowledge discovery (mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information harvesting,
business intelligence, etc.

8
Data Mining: On What Kinds of Data?
Database-oriented data sets and applications
Relational database, data warehouse, transactional database
Advanced data sets and advanced applications
Data streams and sensor data
Time-series data, temporal data, sequence data (incl. bio-sequences)
Structure data, graphs, social networks and multi-linked data
Object-relational databases
Heterogeneous databases and legacy databases
Spatial data and spatiotemporal data
Multimedia database
Text databases
The World-Wide Web

9
Knowledge Discovery (KDD) Process
10
Databases
Knowledge!
Patterns
Interpretation /
Evaluation
Data Mining
Transformed
data
Selection &
Transformation
Data warehouse
Data Cleaning
& Integration
This is a view from typical database systems
and data warehousing communities.
Data mining plays an essential role in the
knowledge discovery process.
Data Mining and Business Intelligence
11
Increasing potential
to support
business decisions
End User
Business
Analyst
Data
Analyst
DBA
Decision
Making
Data Presentation
Visualization Techniques
Data Mining
Information Discovery
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
Data Mining: Confluence of Multiple Disciplines
12
Data
Mining
Database
technology
Statistics
Machine
learning
Pattern
recognition
Algorithms
High-
performance
computing
Data
visualization
Applications
Database
Data
warehouse
World-Wide
Web
Other info.
repositories
Typical Data Mining System
13
Database or Data
Warehouse Server
Data Mining Engine
Pattern Evaluation
Graphical User Interface
data cleaning, integration, and selection
Knowledge
Base
Data Warehousing
A data warehouse is a subject-oriented,
integrated, time-variant, and nonvolatile
collection of data in support of managements
decision making process. W. H. Inmon





14
Data Warehousing
Subject-oriented:
Provide a simple and concise view around particular subject issues by
excluding data that are not useful in the decision support process.
Integrated:
Constructed by integrating multiple, heterogeneous data sources.
Time-variant:
Provide information from a historical perspective (e.g., past 5-10 years.)
Nonvolatile:
Operational update of data does not occur in the data warehouse
environment
Usually requires only two operations: load data & access data.
15
Data Warehousing
The process of constructing and using data warehouses

A decision support database that is maintained
separately from the organizations operational
database

Support information processing by providing a solid
platform of consolidated, historical data for analysis

Set up stages for effective data mining
16
Illustration of Data Warehousing
17
Clean
Transform
Integrate
Load
Data source in Taipei
Data source in New York
Data source in London
.
.
.
Data
Warehouse
Query and
Analysis
Tools
client
client
OLTP vs. OLAP
Data Warehouse
OLAP(On-line Analytical Processing)
OLTP(On-line Transaction Processing)
Analytics
Data Mining
Decision Making
Short online transactions:
update, insert, delete
Online-Transaction
Processing
Tx.
database
current & detailed data,
Versatile
aggregated & historical data,
Static and Low volume
Complex Queries
18
Multi-Dimensional View of Data Mining
Data to be mined
Relational, data warehouse, transactional, stream, object-oriented/relational,
active, spatial, time-series, text, multi-media, heterogeneous, legacy, WWW
Knowledge to be mined
Characterization, discrimination, association, classification, clustering,
trend/deviation, outlier analysis, etc.
Multiple/integrated functions and mining at multiple levels
Techniques utilized
Database-oriented, data warehouse (OLAP), machine learning, statistics,
visualization, etc.
Applications adapted
Retail, telecommunication, banking, fraud analysis, bio-data mining, stock
market analysis, text mining, Web mining, etc.
19
Mining Capabilities (1/4)
Multi-dimensional concept description:
Characterization and discrimination
Generalize, summarize, and contrast data
characteristics, e.g., dry vs. wet regions
Frequent patterns (or frequent itemsets),
association
Diaper Beer [0.5%, 75%] (support, confidence)
20
Mining Capabilities (2/4)
Classification and prediction
Construct models (functions) that describe and distinguish classes or
concepts for future prediction
E.g., classify countries based on (climate), or classify cars based on
(gas mileage)
Predict some unknown or missing numerical values

21
Mining Capabilities (3/4)
Clustering
Class label is unknown: Group data to form new categories
(i.e., clusters), e.g., cluster houses to find distribution
patterns
Maximizing intra-class similarity & minimizing interclass
similarity
Outlier analysis
Outlier: Data object that does not comply with the general
behavior of the data
Noise or exception? Useful in fraud detection, rare events
analysis

22
Mining Capabilities (4/4)
Time and ordering, trend and evolution
analysis
Trend and deviation: e.g., regression analysis
Sequential pattern mining: e.g., digital camera
large SD memory
Periodicity analysis
Motifs and biological sequence analysis
Approximate and consecutive motifs
Similarity-based analysis

23
More Advanced Mining Techniques
Data stream mining
Mining data that is ordered, time-varying, potentially infinite.
Graph mining
Finding frequent subgraphs (e.g., chemical compounds), trees (XML),
substructures (web fragments)
Information network analysis
Social networks: actors (objects, nodes) and relationships (edges)
e.g., author networks in CS, terrorist networks
Multiple heterogeneous networks
A person could be multiple information networks: friends, family,
classmates,
Links carry a lot of semantic information: Link mining
Web mining
Web is a big information network: from PageRank to Google
Analysis of Web information networks
Web community discovery, opinion mining, usage mining,
24
Challenges for Data Mining
Handling of different types of data
Efficiency and scalability of mining algorithms
Usefulness and certainty of mining results
Expression of various kinds of mining results
Interactive mining at multiple abstraction levels
Mining information from different source of data
Protection of privacy and data security
25
Brief Summary
Data mining: Discovering interesting patterns and knowledge from
massive amount of data
A natural evolution of database technology, in great demand, with wide
applications
A KDD process includes data cleaning, data integration, data selection,
transformation, data mining, pattern evaluation, and knowledge
presentation
Mining can be performed in a variety of data
Data mining functionalities: characterization, discrimination, association,
classification, clustering, outlier and trend analysis, etc.
26
A Brief History of Data Mining Society
1989 IJCAI Workshop on Knowledge Discovery in Databases
Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, 1991)
1991-1994 Workshops on Knowledge Discovery in Databases
Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-
Shapiro, P. Smyth, and R. Uthurusamy, 1996)
1995-1998 International Conferences on Knowledge Discovery in Databases and
Data Mining (KDD95-98)
Journal of Data Mining and Knowledge Discovery (1997)
ACM SIGKDD conferences since 1998 and SIGKDD Explorations
More conferences on data mining
PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM (2001), etc.
ACM Transactions on KDD starting in 2007
27
More details here: http://www.kdnuggets.com/gpspubs/sigkdd-explorations-kdd-10-years.html
Conferences and Journals on Data Mining
KDD Conferences
ACM SIGKDD Int. Conf. on
Knowledge Discovery in
Databases and Data Mining (KDD)
SIAM Data Mining Conf. (SDM)
(IEEE) Int. Conf. on Data Mining
(ICDM)
European Conf. on Machine
Learning and Principles and
practices of Knowledge Discovery
and Data Mining (ECML-PKDD)
Pacific-Asia Conf. on Knowledge
Discovery and Data Mining
(PAKDD)
Int. Conf. on Web Search and
Data Mining (WSDM)
Other related conferences
DB: ACM SIGMOD, VLDB, ICDE,
EDBT, ICDT
WEB & IR: CIKM, WWW, SIGIR
ML & PR: ICML, CVPR, NIPS
Journals
Data Mining and Knowledge
Discovery (DAMI or DMKD)
IEEE Trans. On Knowledge and
Data Eng. (TKDE)
KDD Explorations
ACM Trans. on KDD

28
CAPABILITIES OF DATA MINING
29
FREQUENT PATTERNS &
ASSOCIATION RULES
30
Basic Concepts
Frequent pattern: a pattern (a set of items, subsequences,
substructures, etc.) that occurs frequently in a data set
First proposed by Agrawal, Imielinski, and Swami [AIS93] in the
context of frequent itemsets and association rule mining
Motivation: Finding inherent regularities in data
What products were often purchased together? Beer and diapers?!
What are the subsequent purchases after buying a PC?
What kinds of DNA are sensitive to this new drug?
Can we automatically classify web documents?
Applications
Basket data analysis, cross-marketing, catalog design, sale campaign
analysis, Web log (click stream) analysis, and DNA sequence analysis



31
Mining Association Rules
Transaction data analysis. Given:
A database of transactions (Each tx. has a list of
items purchased)
Minimum confidence and minimum support
Find all association rules: the presence of one
set of items implies the presence of another
set of items
32
Diaper Beer [0.5%, 75%]
(support, confidence)
Two Parameters
Confidence (how true)
The rule X&Y Z has 90% confidence:
means 90% of customers who bought X and Y also
bought Z.
Support (how useful is the rule)
Useful rules should have some minimum
transaction support.
33
Mining Strong Association Rules in
Transaction Databases (1/2)
Measurement of rule strength in a transaction
database.


34
AB [support, confidence]
tx of # total
in items all containing tx of #
) Pr( support
B A
B A

= =
A containing tx of #
both containing tx of #
) | Pr( confidence
B A
A B

= =
Mining Strong Association Rules in
Transaction Databases (2/2)
We are often interested in only strong
associations, i.e.,
support > min_sup
confidence > min_conf

Examples:
milk bread [5%, 60%]
tire and auto_accessories auto_services [2%,
80%].

35
Example of Association Rules
Transaction-id Items bought
1 A, B, D
2 A, C, D
3 A, D, E
4 B, E, F
5 B, C, D, E, F
36
Let min. support = 50%, min. confidence = 50%
Frequent patterns: {A:3, B:3, D:3, E:3, AD:3}
Association rules:
A D (s = 60%, c = 100%)
D A (s = 60%, c = 75%)

Two Steps for Mining Association Rules
Determining large (frequent) itemsets
The main factor for overall performance
The downward closure property of frequent
patterns
Any subset of a frequent itemset must be frequent
If {beer, diaper, nuts} is frequent, so is {beer, diaper}
i.e., every transaction having {beer, diaper, nuts} also
contains {beer, diaper}

Generating rules
37
The Apriori Algorithm
Apriori (R. Agrawal and R. Srikant. Fast algorithms
for mining association rules. VLDB'94.)
Derivation of large 1-itemsets L
1
: At the first
iteration, scan all the transactions and count the
number of occurrences for each item.
Level-wise derivation: At the k
th
iteration, the
candidate set C
k
are those whose every (k-1)-item
subset is in L
k-1
. Scan DB and count the # of
occurrences for each candidate itemset.
38
The Apriori AlgorithmAn Example
39
Database TDB
1
st
scan
C
1
L
1
L
2
C
2
C
2
2
nd
scan
C
3
L
3
3
rd
scan
Tid Items
100 A, C, D
200 B, C, E
300 A, B, C, E
400 B, E
Itemset sup
{A} 2
{B} 3
{C} 3
{D} 1
{E} 3
Itemset sup
{A} 2
{B} 3
{C} 3
{E} 3
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
Itemset sup
{A, B} 1
{A, C} 2
{A, E} 1
{B, C} 2
{B, E} 3
{C, E} 2
Itemset sup
{A, C} 2
{B, C} 2
{B, E} 3
{C, E} 2
Itemset
{B, C, E}
Itemset sup
{B, C, E} 2
min. support =2 txs (50%)
From Large Itemsets to Rules
For each large itemset m
For each subset p of m
if ( sup(m) / sup(m-p) > min_conf )
output the rule (m-p)p
conf. = sup(m)/sup(m-p)
support = sup(m)

m = {a,c,d,e,f,g} 2000 txs,
p = {c,e,f,g}
m-p = {a,d} 5000 txs
conf. = # {a,c,d,e,f,g} / # {a,d}
rule: {a,d} {c,e,f,g}
confidence: 40%, support: 2000 txs
40
Redundant Rules
For the same support and confidence, if we
have a rule {a,d} {c,e,f,g}, do we have
[agga98a]:
{a,d} {c,e,f} ?
{a} {c,e,f,g} ?
{a,d,c} {e,f,g} ?
{a} {c,d,e,f,g} ?
41
Yes!
Yes!
No!
No!
Practice
Suppose we additionally have
500 ACE
600 BCD
Support = 3 txs (50%), confidence = 66%
Repeat the large itemset generation
Identify all large itemsets
Derive up to 4 rules
Generate rules from the large itemsets with the
biggest number of elements (from big to small)
42
Discussion of The Apriori Algorithm
Apriori (R. Agrawal and R. Srikant. Fast algorithms for mining
association rules. VLDB'94.)
Derivation of large 1-itemsets L
1
: At the first iteration, scan
all the transactions and count the number of occurrences
for each item.
Level-wise derivation: At the k
th
iteration, the candidate
set C
k
are those whose every (k-1)-item subset is in L
k-1
.
Scan DB and count the # of occurrences for each candidate
itemset.
The cardinalitiy (number of elements) of C
2
is huge.
The execution time for the first 2 iterations is the
dominating factor to overall performance!
Database scan is expensive.

43
Improvement of the Apriori Algorithm
Reduce passes of transaction database scans

Shrink the number of candidates

Facilitate the support counting of candidates

44
Example Improvement 1- Partition: Scan
Database Only Twice
Any itemset that is potentially frequent in DB
must be frequent in at least one of the
partitions of DB
Scan 1: partition database and find local frequent
patterns
Scan 2: consolidate global frequent patterns

A. Savasere, E. Omiecinski, and S. Navathe. An efficient
algorithm for mining association in large databases. In
VLDB95




45
Example Improvement 2- DHP
DHP (direct hashing with pruning): Apriori + hashing
Use hash-based method to reduce the size of C
2
.
Allow effective reduction on tx database size (tx number and
each tx size.)









46
Tid Items
100 A, C, D
200 B, C, E
300 A, B, C, E
400 B, E
J. Park, M.-S. Chen, and P. Yu.
An effective hash-based algorithm for mining association rules. In SIGMOD95.
Mining Frequent Patterns w/o Candidate
Generation
A highly compact data structure: frequent
pattern tree.
An FP-tree-based pattern fragment growth
mining method.
Search technique in mining: partitioning-
based, divide-and-conquer method.

J. Han, J. Pei, Y. Yin, Mining Frequent Patterns without
Candidate Generation, in SIGMOD2000.
47
Frequent Patter Tree (FP-tree)
3 parts:
One root labeled as null
A set of item prefix subtrees
Frequent item header table

Each node in the prefix subtree consists of
Item name
Count
Node-link

Each entry in the frequent-item header table consists of
Item-name
Head of node-link
48
The FP-tree Structure
49
root
f:4
c:3
a:3
m:2
p:2
b:1
b:1
m:1
c:1
b:1
p:1
Item Head of node-links
f
c
a
b
m
p
frequent item header table
FP-tree Construction: Step1
Scan the transaction database DB once (the
first time), and derives a list of frequent items.

Sort frequent items in frequency descending
order.

This ordering is important since each path of a
tree will follow this order.
50
Example (min. support = 3)
Tx ID Items Bought (ordered) Frequent Items
100 f,a,c,d,g,i,m,p f,c,a,m,p
200 a,b,c,f,l,m,o f,c,a,b,m
300 b,f,h,j,o f,b
400 b,c,k,s,p c,b,p
500 a,f,c,e,l,p,m,n f,c,a,m,p
51
List of frequent items:
(f:4), (c:4), (a:3), (b:3), (m:3), (p:3)
Item Head of node-links
f
c
a
b
m
p
frequent item header table
FP-tree Construction: Step 2
Create a root of a tree, label with null
Scan the database the second time. The scan of the first tx
leads to the construction of the first branch of the tree.
52
Scan of 1
st
transaction: f,a,c,d,g,i,m,p
The 1
st
branch of the tree <(f:1),(c:1),(a:1),(m:1),(p:1)>
root
f:1
c:1
a:1
m:1
p:1
FP-tree Construction: Step 2 (contd)
Scan of 2
nd
transaction:
a,b,c,f,l,m,o f,c,a,b,m

two new nodes:
(b:1) (m:1)
53
root
f:2
c:2
a:2
m:1
p:1
b:1
m:1
The FP-tree
Tx ID Items Bought (ordered) Frequent Items
100 f,a,c,d,g,i,m,p f,c,a,m,p
200 a,b,c,f,l,m,o f,c,a,b,m
300 b,f,h,j,o f,b
400 b,c,k,s,p c,b,p
500 a,f,c,e,l,p,m,n f,c,a,m,p
54
root
f:4
c:3
a:3
m:2
p:2
b:1
b:1
m:1
c:1
b:1
p:1
Item Head of node-links
f
c
a
b
m
p
frequent item header table
Mining Process
Starts from the least frequent item p
Mining order: p -> m -> b -> a -> c -> f
55
Item Head of node-links
f
c
a
b
m
p
frequent item header table
Mining Process for item p
Starts from the least frequent item p
56
root
f:4
c:3
a:3
m:2
p:2
b:1
b:1
m:1
c:1
b:1
p:1
Two paths:
<f:4, c:3, a:3, m:2, p:2>
<c:1, b:1,p:1>
Conditional pattern based of p:
<f:2, c:2, a:2, m:2>
<c:1, b:1>

Conditional frequent pattern:
<c:3>

So we have two frequent patterns:
{p:3}, {cp:3}
min. support = 3
Mining Process for Item m
57
root
f:4
c:3
a:3
m:2
p:2
b:1
b:1
m:1
c:1
b:1
p:1
Two paths:
<f:4, c:3, a:3, m:2>
<f:4, c:3, a:3, b:1, m:1>
Conditional pattern based of m:
<f:2, c:2, a:2>
<f:1, c:1, a:1, b:1>

Conditional frequent pattern:
<f:3, c:3, a:3>
min. support = 3
Mining ms Conditional FP-tree
58
Mine (<f:3, c:3, a:3> | m)
(am:3)
Mine (<f:3, c:3> | am)
(cam:3)
Mine (<f:3> | cam)
(fcam:3)
(fam:3)
(cm:3)
Mine (<f:3> | cm)
(fcm:3)
(fm:3)
f
f
f
f
c
c
a
So we have frequent patterns:
{m:3}, {am:3}, {cm:3}, {fm:3}, {cam:3}, {fam:3}, {fcm:3}, {fcam:3}
Analysis of the FP-tree-based method
Find the complete set of frequent itemsets
Efficient because
Works on a reduced set of pattern bases
Performs mining operations less costly than
generation & test
Cons:
No advantages if the length of most txs are short
The size of FP-tree not always fit into main memory
59
Generalized Association Rules
Given the class hierarchy (taxonomy), one would
like to choose proper data granularities for
mining.
Different confidence/support may be considered.

R. Srikant and R. Agrawal, Mining generalized association
rules, VLDB95.
60
Clothes
Outerwear Shirts
Jackets Ski Pants
Footwear
Shoes Hiking Boots
Tx ID Items Bought
100 Shirt
200 Jacket, Hiking Boots
300 Ski Pants, Hiking Boots
400 Shoes
500 Shoes
600 Jacket
Concept Hierarchy
Freq. itemset Itemset
support
Jacket 2
Outerwear 3
Clothes 4
Shoes 2
Hiking Boots 2
Footwear 4
Outerwear, Hiking Boots 2
Clothes, Hiking Boots 2
Outerwear, Footwear 2
Clothes, Footwear 2
sup(30%) conf(60%)
Outerwear -> Hiking Boots 33% 66%
Outerwear -> Footwear 33% 66%
Hiking Boots -> Outerwear 33% 100%
Hiking Boots -> Clothes 33% 100%
Jacket -> Hiking Boots 16% 50%
Ski Pants -> Hiking Boots 16% 100%
61
Generalized Association Rules
62
uniform support
Milk
[support = 10%]
2% Milk
[support = 6%]
Skim Milk
[support = 4%]
Level 1
min_sup = 5%
Level 2
min_sup = 5%
Level 1
min_sup = 5%
Level 2
min_sup = 3%
reduced support
Milk
[support = 10%]
2% Milk
Not examined
Skim Milk
Not examined
Level 1
min_sup = 12%
Level 2
min_sup = 3%
level filtering
Other Relevant Topics
Max patterns
R. J. Bayardo. Efficiently mining long patterns from databases.
SIGMOD'98.
Closed patterns
N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent
closed itemsets for association rules. ICDT'99.
Sequential Patterns
What items one will purchase if he/she has bought some certain items.
R. Srikant and R. Agrawal, Mining sequential patterns, ICDE95
Traversal Patterns
Mining path traversal patterns in a web environment where
documents or objects are linked together to facilitate interactive
access.
M.-S. Chen, J. Park and P. Yu. Efficient Data Mining for Path Traversal
Patterns. TKDE98.

and more
63
CLASSIFICATION
64
Classification
Classifying tuples in a database.
Each tuple has some attributes with known
values.
In training set E
Each tuple consists of the same set of multiple
attributes as the tuples in the large database W.
Additionally, each tuple has a known class identity.
65
Classification (contd)
Derive the classification mechanism from the
training set E, and then use this mechanism to
classify general data (in testing set.)

A decision tree based approach has been
influential in machine learning studies.

66
Classification
Step 1: Model Construction
Train model from the existing data pool

67
Training
Data
name age income own cars?
Sandy <=30 low no
Bill <=30 low yes
Fox 3140 high yes
Susan >40 med no
Claire >40 med no
Andy 3140 high yes
Classification algorithm
Classification rules
Classification
Step 2: Model Usage
68
Testing
Data
name age income own cars?
John >40 hight ?
Sally <=30 low ?
Annie 3140 high ?
No
Classification rules
No
Yes
What is Prediction?
Prediction is similar to classification
First, construct model
Second, use model to predict future of unknown
objects
Prediction is different from classification
Classification refers to predict categorical class
label.
Prediction refers to predict continuous values.
Major method: regression
69
Supervised vs. Unsupervised
Learning
Supervised learning (e.g., classification)
Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations.

Unsupervised learning (e.g., clustering)
We are given a set of measurements, observations,
etc. with the aim of establishing the existence of
classes or clusters in the data.
No training data, or the training data are not
accompanied by class labels.
70
Evaluating Classification Methods
Predictive accuracy
Speed
Time to construct the model and time to use the model
Robustness
Handling noise and missing values
Scalability
Efficiency in large databases (not memory resident data)
Goodness of rules
Decision tree size
The compactness of classification rules
71
A Decision-Tree Based Classification
72
outlook
humidity windy
N
P
N P P
sunny
overcast
rainy
A decision tree of whether going to play tennis or not:
ID-3 and its extended version C4.5 (Quinlan93):
A top-down decision tree generation algorithm
high low No Yes
Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-
conquer manner.
Attributes are categorical.
(if an attribute is a continuous number, it needs to be discretized in
advance.) E.g.




At start, all the training examples are at the root.
Examples are partitioned recursively based on selected
attributes.
Algorithm for Decision Tree Induction
(1/2)
73
0 <= age <= 100
0 ~ 20
21 ~ 40
41 ~ 60
61 ~ 80
81 ~ 100
Algorithm for Decision Tree Induction
(2/2)
Basic algorithm (a greedy algorithm)

Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain): maximizing an
information gain measure, i.e., favoring the partitioning
which makes the majority of examples belong to a single
class.

Conditions for stopping partitioning:
All samples for a given node belong to the same class
There are no remaining attributes for further partitioning
majority voting is employed for classifying the leaf
There are no samples left

74
Decision Tree Induction: Training Dataset
75
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
3140 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
3140 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
3140 medium no excellent yes
3140 high yes fair yes
>40 medium no excellent no
76
Age?
<= 30
3140 > 40
Primary Issues in Tree Construction (1/2)
Split criterion: Goodness function
Used to select the attribute to be split at a tree
node during the tree generation phase
Different algorithms may use different goodness
functions:
Information gain (used in ID3/C4.5)
Gini index (used in CART)
77
Primary Issues in Tree Construction (2/2)
Branching scheme:
Determining the tree branch to which a sample
belongs
Binary vs. k-ary splitting

When to stop the further splitting of a node?
e.g. impurity measure
Labeling rule: a node is labeled as the class to
which most samples at the node belongs.
78
Income:
high
Income:
medium
Income:
low
How to Use a Tree?
Directly
Test the attribute value of unknown sample against the
tree.
A path is traced from root to a leaf which holds the label.

Indirectly
Decision tree is converted to classification rules.
One rule is created for each path from the root to a leaf.
IF-THEN is easier for humans to understand .
79
80
Attribute Selection Measure:
Information Gain (ID3/C4.5)
Select the attribute with the highest information gain
Let p
i
be the probability that an arbitrary tuple in D belongs to
class C
i
, estimated by |C
i, D
|/|D|
Expected information (entropy) needed to classify a tuple in D:


) ( log ) (
2
1
i
m
i
i
p p D Info

=
=
Expected information (entropy):
Entropy is a measure of how "mixed up" an attribute is.
It is sometimes equated to the purity or impurity of a variable.
High Entropy means that we are sampling from a uniform
(boring) distribution.
81
Expected information (entropy) needed to classify a tuple in D:


) ( log ) (
2
1
i
m
i
i
p p D Info

=
=
Expected Information (Entropy)
) 322 . 1 (
5
2
) 737 . 0 (
5
3
)
5
2
( log
5
2
)
5
3
( log
5
3
) 2 , 3 ( ) (
2 2
~
= = I D Inf o
0 0 0
)
5
0
( log
5
0
)
5
5
( log
5
5
) 0 , 5 ( ) (
2 2
= =
= = I D Inf o
(m: number of labels)
82
Attribute Selection Measure:
Information Gain (ID3/C4.5)
Select the attribute with the highest information gain
Let p
i
be the probability that an arbitrary tuple in D belongs to
class C
i
, estimated by |C
i, D
|/|D|
Expected information (entropy) needed to classify a tuple in D:

Information needed (after using A to split D into v partitions) to
classify D:

Information gained by branching on attribute A

) ( log ) ( ) (
2
1
i
m
i
i
p p D I D Info

=
= =
) (
| |
| |
) (
1
j
v
j
j
A
D I
D
D
D Info =

=
(D) Info Info(D) Gain(A)
A
=
83
Information needed (after using A to split D into v partitions)
to classify D:

Expected Information (Entropy)
) (
| |
| |
) (
1
j
v
j
j
A
D I
D
D
D Info =

=
) 1 , 2 (
5
3
) 1 , 1 (
5
2
) ( Inf o Inf o D Inf o + = ) 0 , 3 (
5
3
) 0 , 2 (
5
2
) ( Info Info D Info + =
84
Attribute Selection: Information Gain
Class P: buys_computer = yes
Class N: buys_computer = no
means age <=30 has 5 out of
14 samples, with 2 yeses and 3
nos. Hence


Similarly,
age p
i
n
i
I(p
i
, n
i
)
<=30 2 3 0.971
3140 4 0 0
>40 3 2 0.971
694 . 0 ) 2 , 3 (
14
5
) 0 , 4 (
14
4
) 3 , 2 (
14
5
) (
= +
+ =
I
I I D Inf o
age
048 . 0 ) _ (
151 . 0 ) (
029 . 0 ) (
=
=
=
rating credit Gain
student Gain
income Gain
246 . 0 ) ( ) ( ) ( = = D Info D Info age Gain
age
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
3140 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
3140 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
3140 medium no excellent yes
3140 high yes fair yes
>40 medium no excellent no
) 3 , 2 (
14
5
I
940 . 0 )
14
5
( log
14
5
)
14
9
( log
14
9
) 5 , 9 ( ) (
2 2
= = = I D Info
Gain Ratio for Attribute Selection (C4.5)
Information gain measure is biased towards attributes with a
large number of values.
C4.5 (a successor of ID3) uses gain ratio to overcome the
problem (normalization to information gain.)


GainRatio(A) = Gain(A)/SplitInfo(A)

GainRatio(income) = 0.029/0.926 = 0.031

The attribute with the maximum gain ratio is selected as the
splitting attribute.
)
| |
| |
( log
| |
| |
) (
2
1
D
D
D
D
D SplitInfo
j
v
j
j
A
=

=
926 . 0 )
14
4
( log
14
4
)
14
6
( log
14
6
)
14
4
( log
14
4
) (
2 2 2
= = D SplitInf o
A
85
Gini index (CART, IBM IntelligentMiner)
If a data set D contains examples from n classes, gini index, Gini(D) is
defined as

where p
j
is the relative frequency of class j in D

If a data set D is split on A into two subsets D
1
and D
2
, the gini index Gini(D)
is defined as:


Reduction in Impurity:


The attribute provides the smallest Gini
A
(D) (or the largest reduction in
impurity) is chosen to split the node (need to enumerate all the possible
splitting points for each attribute.)


86

=
=
n
j
p
j
D Gini
1
2
1 ) (
) (
| |
| |
) (
| |
| |
) (
2
2
1
1
D
Gini
D
D
D
Gini
D
D
D
Gini
A
+ =
) ( ) ( ) ( D Gini D Gini A Gini
A
= A
Gini index (CART, IBM IntelligentMiner)
Ex. D has 9 tuples in buys_computer = yes and 5 in no.


Suppose the attribute income partitions D into 10 in D
1
: {low, medium}
and 4 in D
2
: {high}.





But Gini
income{medium,high}
is 0.30 and thus the best since it is the lowest.
87
459 . 0
14
5
14
9
1 ) (
2 2
=
|
.
|

\
|

|
.
|

\
|
= D Gini
) (
45 . 0
] )
4
3
( )
4
1
( 1 [
14
4
] )
10
4
( )
10
6
( 1 [
14
10
) (
14
4
) (
14
10
) (
} {
2 2 2 2
1 1 } , {
D Gini
D Gini D Gini D Gini
high income
medium low income
e
e
=
=
+ =
|
.
|

\
|
+
|
.
|

\
|
=
Other Attribute Selection Measures
88
CHAID: a popular decision tree algorithm, measure based on
2
test for
independence
C-SEP: performs better than info. gain and gini index in certain cases
G-statistics: has a close approximation to
2
distribution
MDL (Minimal Description Length) principle (i.e., the simplest solution is preferred):
The best tree as the one that requires the fewest # of bits to both (1)
encode the tree, and (2) encode the exceptions to the tree
Multivariate splits (partition based on multiple variable combinations)
CART: finds multivariate splits based on a linear combination of attributes.
Which attribute selection measure is the best?

Most give good results, none is significantly superior than others
Other Types of Classification Methods
Bayes Classification Methods
Rule-Based Classification
Support Vector Machine (SVM)

Some of these methods will be taught in the
following lessons.
89
CLUSTERING
90
What is Cluster Analysis?
Cluster: a collection of data objects
Similar to one another within the same cluster
Dissimilar to the objects in other clusters
Cluster Analysis
Grouping a set of data objects into clusters

Typical applications:
As a stand-alone tool to get insight into data
distribution
As a preprocessing step for other algorithms
91
General Applications of Clustering
Spatial data analysis
Create thematic maps in GIS by clustering feature spaces.
Detect spatial clusters and explain them in spatial data mining.
Image Processing
Pattern recognition
Economic Science (especially market research)
WWW
Document classification
Cluster Web-log data to discover groups of similar access
patterns
92
Examples of Clustering Applications
Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop
targeted marketing programs.

Land use: Identification of areas of similar land use in an earth
observation database.

Insurance: Identifying groups of motor insurance policy
holders with a high average claim cost.

City-planning: Identifying groups of houses according to their
house type, value, and geographical location.
93
A good clustering method will produce high quality
clusters with
High intra-class similarity
Low inter-class similarity

The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation.
The quality of a clustering method is also measured by
its ability to discover hidden patterns.
What is Good Clustering?
94
Requirements of Clustering
in Data Mining (1/2)
Scalability
Ability to deal with different types of
attributes
Discovery of clusters with arbitrary shape
Minimal requirements of domain knowledge
for input
Able to deal with outliers
95
Requirements of Clustering
in Data Mining (2/2)
Insensitive to order of input records
High dimensionality
Curse of dimensionality
Incorporation of user-specified constraints
Interpretability and usability
96
Clustering Methods (I)
Partitioning Method
Construct various partitions and then evaluate them by some criterion,
e.g., minimizing the sum of square errors
K-means, k-medoids, CLARANS

Hierarchical Method
Create a hierarchical decomposition of the set of data (or objects)
using some criterion
Diana, Agnes, BIRCH, ROCK, CHAMELEON

Density-based Method
Based on connectivity and density functions
Typical methods: DBSACN, OPTICS, DenClue
97
Clustering Methods (II)
Grid-based approach
based on a multiple-level granularity structure
Typical methods: STING, WaveCluster, CLIQUE
Model-based approach
A model is hypothesized for each of the clusters and tries to find the best fit of
that model to each other
Typical methods: EM, SOM, COBWEB
Frequent pattern-based
Based on the analysis of frequent patterns
Typical methods: pCluster
User-guided or constraint-based
Clustering by considering user-specified or application-specific constraints
Typical methods: cluster-on-demand, constrained clustering
98
99
Typical Alternatives to
Calculate the Distance between Clusters
Single link: smallest distance between an element in one cluster and an
element in the other, i.e., dis(K
i
, K
j
) = min(t
ip
, t
jq
)
Complete link: largest distance between an element in one cluster and
an element in the other, i.e., dis(K
i
, K
j
) = max(t
ip
, t
jq
)
Average: average distance between an element in one cluster and an
element in the other, i.e., dis(K
i
, K
j
) = avg(t
ip
, t
jq
)
Centroid: distance between the centroids of two clusters,
i.e., dis(K
i
, K
j
) = dis(C
i
, C
j
)
Medoid: distance between the medoids of two clusters,
i.e., dis(K
i
, K
j
) = dis(M
i
, M
j
)
Medoid: one chosen, centrally located object in the cluster
Centroid, Radius and Diameter of a Cluster
(for numerical data sets)
Centroid: the middle of a cluster


Radius: square root of average mean squared distance from any point of
the cluster to its centroid


Diameter: square root of average mean squared distance between all
pairs of points in the cluster
N
t
N
i
ip
m
C
) (
1 =
E
=
N
m
c
ip
t
N
i
m
R
2
) (
1

=
E
=
) 1 (
2
) (
1 1

=
E
=
E
=
N N
jq
t
ip
t
N
j
N
i
m
D
100
diameter != 2 * radius
Partitioning Algorithms: Basic Concept
Partitioning method: construct a partition of a
database D of n objects into a set of k clusters.
Given a number k, find a partition of k clusters
that optimizes the chosen partitioning criterion.
Global optimal: exhaustively enumerate all partitions.
Heuristic methods: k-means, k-medoids
k-means (MacQueen67)
k-medoids or PAM, partion around medoids (Kaufman &
Rousseeuw87)
101
The K-Means Clustering Method
Given k, the k-means algorithm is implemented in
four steps:
1. Arbitrarily choose k points as initial cluster centroids.
2. Update Means (Centroids): Compute seed points as
the center of the clusters of the current partition.
(center: mean point of the cluster)
3. Re-assign Points: Assign each object to the cluster with
the nearest seed point.
4. Go back to Step 2, stop when no more new assignment.
102
loop
Example of the
K-Means Clustering Method
103
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Given k = 2:
Arbitrarily choose k
object as initial
cluster centroid
Assign
each
objects
to the
most
similar
centroid
Update
the
cluster
means
Update
the
cluster
means
Re-assign Re-assign
Comments on the K-Means Clustering
Time Complexity: O(tkn), where n is # of objects, k is
# of clusters, and t is # of iterations. Normally, k,t<<n.
Often terminates at a local optimum.
(The global optimum may be found using techniques such as:
deterministic annealing and genetic algorithms)
Weakness:
Applicable only when mean is defined, how about
categorical data?
Need to specify k, the number of clusters, in advance
Unable to handle noisy data and outliers
104
Why is K-Means Unable to
Handle Outliers?
The k-means algorithm is sensitive to outliers
Since an object with an extremely large value may
substantially distort the distribution of the data.





K-Medoids: Instead of taking the mean value of the
object in a cluster as a reference point, medoids can
be used, which is the most centrally located object in
a cluster.
105
X
PAM: The K-Medoids Method
PAM: Partition Around Medoids
Use real object to represent the cluster
1. Randomly select k representative objects as medoids.
2. Assign each data point to the closest medoid.
3. For each medoid m,
a. For each non-medoid data point o
b. Swap m and o, and compute the total cost of the configuration.
4. Select the configuration with the lowest cost.
5. Repeat steps 2 to 5 until there is no change in the medoid.
106
loop
A Typical K-Medoids Algorithm (PAM)

107
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
k = 2
Arbitrary
choose k
object as
initial
medoids
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Assign each
remaining object to
the nearest medoid
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Swap each medoid and each data point, and
compute the total cost of the configuration
m1
m2
108
PAM Clustering: Total swapping cost TC
ih
=
j
C
jih
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
j
i
h
t
C
jih
= 0
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
t
i h
j
C
jih
= d(j, h) - d(j, i)
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
h
i
t
j
C
jih
= d(j, t) - d(j, i)
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
t
i
h j
C
jih
= d(j, h) - d(j, t)
- Original
medoid: t, i
- h: swap with i
- j: any non-
selected object
d(j,h)>d(j,t)
d(j,h)<d(j,t)
i h
j j
t t
j j
i t
j j
t h
j j
What is the Problem with PAM?
PAM is more robust than k-means in the presence of
noise and outliers because a medoid is less influenced
by outliers or other extreme values than a mean.

PAM works efficiently for small data sets but does not
scale well for large data sets.
O( k(n-k)(n-k) ) for each iteration,
where n is # of data, k is # of clusters
Improvements: CLARA (uses a sampled set to determine
medoids), CLARANS
109
Hierarchical Clustering
Use distance matrix as clustering criteria.
This method does not require the number of clusters k as an
input, but needs a termination condition.
110
Step 0 Step 1 Step 2 Step 3 Step 4
b
d
c
e
a
a b
d e
c d e
a b c d e
Step 4 Step 3 Step 2 Step 1 Step 0
agglomerative
(AGNES)
divisive
(DIANA)
AGNES (Agglomerative Nesting)
Introduced in Kaufmann and Rousseeuw (1990)
Use the Single-Link method and the dissimilarity matrix.
Merge nodes that have the least dissimilarity
Go on in a non-descending fashion
Eventually all nodes belong to the same cluster





111
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Dendrogram:
Shows How the Clusters are Merged
Decompose data objects into a several levels of nested
partitioning (tree of clusters), called a dendrogram.
A clustering of the data objects is obtained by cutting the
dendrogram at the desired level, then each connected
component forms a cluster.

112
DIANA (Divisive Analysis)
Introduced in Kaufmann and Rousseeuw (1990)
Inverse order of AGNES
Eventually each node forms a cluster on its own.
113
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
More on Hierarchical Clustering
Major weakness:
Do not scale well: time complexity is at least O(n
2
), where n is
the number of total objects.
Can never undo what was done previously.

Integration of hierarchical with distance-based clustering
BIRCH(1996): uses CF-tree data structure and incrementally
adjusts the quality of sub-clusters.
CURE(1998): selects well-scattered points from the cluster and
then shrinks them towards the center of the cluster by a
specified fraction.
114
Density-Based Clustering Methods
Clustering based on density (local cluster criterion), such as
density-connected points
Major features:
Discover clusters of arbitrary shape
Handle noise
One scan
Need density parameters as termination condition
Several interesting studies:
DBSCAN: Ester, et al. (KDD96)
OPTICS: Ankerst, et al (SIGMOD99).
DENCLUE: Hinneburg & D. Keim (KDD98)
CLIQUE: Agrawal, et al. (SIGMOD98) (more grid-based)
115
Density-Based Clustering: Basic Concepts
Two parameters:
Eps: Maximum radius of the neighborhood
MinPts: Minimum number of points in an Eps-neighborhood
of that point
116
Eps
Density-Based Clustering: Basic Concepts
Two parameters:
Eps: Maximum radius of the neighborhood
MinPts: Minimum number of points in an Eps-neighborhood
of that point
N
Eps
(q): {p | dist(p,q) <= Eps} // p, q are two data points
Directly density-reachable: A point p is directly density-
reachable from a point q w.r.t. Eps, MinPts if
p belongs to N
Eps
(q)
core point condition:
|N
Eps
(q)| >= MinPts
117
p
q
MinPts = 5
Eps = 1 cm
Density-Reachable and Density-Connected
Density-reachable:
A point p is density-reachable from a
point q w.r.t. Eps, MinPts if there is a
chain of points p
1
, , p
n
,
p
1
= q, p
n
= p such that p
i+1
is directly
density-reachable from p
i
.
Density-connected:
A point p is density-connected to a
point q w.r.t. Eps, MinPts if there is a
point o such that both, p and q are
density-reachable from o w.r.t. Eps
and MinPts.
118
p
q
p
2
p q
o
Border
Core
DBSCAN: Density Based Spatial Clustering of
Applications with Noise
Relies on a density-based notion of cluster: A cluster is defined
as a maximal set of density-connected points.
Discovers clusters of arbitrary shape in spatial databases with
noise.
119
Border
Eps = 1cm
MinPts = 5
DBSCAN: The Algorithm
Arbitrary select an unvisited point p.
Retrieve all points density-reachable from p w.r.t. Eps and
MinPts.
If p is a core point, a cluster is formed. Mark all these points
as visited.
If p is a border point (no points are density-reachable from p),
mark p as visited and DBSCAN visits the next point of the
database.
Continue the process until all of the points have been visited.
120
References (1)
R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional
data for data mining applications. SIGMOD'98
M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973.
M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to identify the clustering
structure, SIGMOD99.
P. Arabie, L. J. Hubert, and G. De Soete. Clustering and Classification. World Scientific, 1996
Beil F., Ester M., Xu X.: "Frequent Term-Based Text Clustering", KDD'02
M. M. Breunig, H.-P. Kriegel, R. Ng, J. Sander. LOF: Identifying Density-Based Local Outliers. SIGMOD 2000.
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large
spatial databases. KDD'96.
M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial databases: Focusing techniques for
efficient class identification. SSD'95.
D. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2:139-172, 1987.
D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based on dynamic
systems. VLDB98.
121
References (2)
V. Ganti, J. Gehrke, R. Ramakrishan. CACTUS Clustering Categorical Data Using Summaries. KDD'99.
D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based on dynamic systems. In
Proc. VLDB98.
S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large databases. SIGMOD'98.
S. Guha, R. Rastogi, and K. Shim. ROCK: A robust clustering algorithm for categorical attributes. In ICDE'99, pp. 512-
521, Sydney, Australia, March 1999.
A. Hinneburg, D.l A. Keim: An Efficient Approach to Clustering in Large Multimedia Databases with Noise. KDD98.
A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall, 1988.
G. Karypis, E.-H. Han, and V. Kumar. CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling.
COMPUTER, 32(8): 68-75, 1999.
L. Kaufman and P. J. Rousseeuw, 1987. Clustering by Means of Medoids. In: Dodge, Y. (Ed.), Statistical Data Analysis
Based on the L1 Norm, North Holland, Amsterdam. pp. 405-416.
L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons,
1990.
E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets. VLDB98.
J. B. MacQueen (1967): "Some Methods for classification and Analysis of Multivariate Observations", Proceedings
of 5-th Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, University of California Press,
1:281-297
G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications to Clustering. John Wiley and Sons,
1988.
P. Michaud. Clustering techniques. Future Generation Computer systems, 13, 1997.
R. Ng and J. Han. Efficient and effective clustering method for spatial data mining. VLDB'94.
122
References (3)
L. Parsons, E. Haque and H. Liu, Subspace Clustering for High Dimensional Data: A Review , SIGKDD
Explorations, 6(1), June 2004
E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very large data sets. Proc. 1996
Int. Conf. on Pattern Recognition,.
G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-resolution clustering approach for
very large spatial databases. VLDB98.
A. K. H. Tung, J. Han, L. V. S. Lakshmanan, and R. T. Ng. Constraint-Based Clustering in Large Databases,
ICDT'01.
A. K. H. Tung, J. Hou, and J. Han. Spatial Clustering in the Presence of Obstacles , ICDE'01
H. Wang, W. Wang, J. Yang, and P.S. Yu. Clustering by pattern similarity in large data sets, SIGMOD 02.
W. Wang, Yang, R. Muntz, STING: A Statistical Information grid Approach to Spatial Data Mining, VLDB97.
T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : an efficient data clustering method for very large
databases. SIGMOD'96.
Wikipedia: DBSCAN. http://en.wikipedia.org/wiki/DBSCAN.
123
MORE ABOUT DATA MINING
124
ICDM 10 KEYNOTE SPEECH
10 YEARS OF DATA MINING RESEARCH:
RETROSPECT AND PROSPECT

http://www.cs.uvm.edu/~xwu/PPT/ICDM10-Sydney/ICDM10-Keynote.pdf
125
Xindong Wu, University of Vermont, USA
The Top 10 Algorithms
The 3-Step Identification Process
1. Nominations. ACM KDD Innovation Award and IEEE
ICDM Research Contributions Award winners were
invited in September 2006 to each nominate up to 10
best-known algorithms.

2. Verification. Each nomination was verified for its
citations on Google Scholar in late October 2006, and
those nominations that did not have at least 50
citations were removed. 18 nominations survived and
were then organized in 10 topics.

3. Voting by the wider community.



126
Top-10 Most Popular DM Algorithms:
18 Identified Candidates (I)
Classification
#1. C4.5: Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann.,
1993.
#2. CART: L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and
Regression Trees. Wadsworth, 1984.
#3. K Nearest Neighbors (kNN): Hastie, T. and Tibshirani, R. 1996. Discriminant
Adaptive Nearest Neighbor Classification. TPAMI. 18(6)
#4. Naive Bayes Hand, D.J., Yu, K., 2001. Idiot's Bayes: Not So Stupid After All?
Internat. Statist. Rev. 69, 385-398.
Statistical Learning
#5. SVM: Vapnik, V. N. 1995. The Nature of Statistical Learning Theory. Springer-
Verlag.
#6. EM: McLachlan, G. and Peel, D. (2000). Finite Mixture Models. J. Wiley, New
York. Association Analysis
Association Analysis
#7. Apriori: Rakesh Agrawal and Ramakrishnan Srikant. Fast Algorithms for Mining
Association Rules. In VLDB '94.
#8. FP-Tree: Han, J., Pei, J., and Yin, Y. 2000. Mining frequent patterns without
candidate generation. In SIGMOD '00.
127
The 18 Identified Candidates (II)
Link Mining
#9. PageRank: Brin, S. and Page, L. 1998. The anatomy of a large-scale hypertextual
Web search engine. In WWW-7, 1998.
#10. HITS: Kleinberg, J. M. 1998. Authoritative sources in a hyperlinked
environment. SODA, 1998.
Clustering
#11. K-Means: MacQueen, J. B., Some methods for classification and analysis of
multivariate observations, in Proc. 5th Berkeley Symp. Mathematical Statistics and
Probability, 1967.
#12. BIRCH: Zhang, T., Ramakrishnan, R., and Livny, M. 1996. BIRCH: an efficient
data clustering method for very large databases. In SIGMOD '96.
Bagging and Boosting
#13. AdaBoost: Freund, Y. and Schapire, R. E. 1997. A decision-theoretic
generalization of on-line learning and an application to boosting. J. Comput. Syst.
Sci. 55, 1 (Aug. 1997), 119-139.
128
Sequential Patterns
#14. GSP: Srikant, R. and Agrawal, R. 1996. Mining Sequential Patterns:
Generalizations and Performance Improvements. In Proceedings of the 5th
International Conference on Extending Database Technology, 1996.
#15. PrefixSpan: J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal and M-C.
Hsu. PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern
Growth. In ICDE '01.
Integrated Mining
#16. CBA: Liu, B., Hsu, W. and Ma, Y. M. Integrating classification and association
rule mining. KDD-98.
Rough Sets
#17. Finding reduct: Zdzislaw Pawlak, Rough Sets: Theoretical Aspects of Reasoning
about Data, Kluwer Academic Publishers, Norwell, MA, 1992
Graph Mining
#18. gSpan: Yan, X. and Han, J. 2002. gSpan: Graph-Based Substructure Pattern
Mining. In ICDM '02.
129
The 18 Identified Candidates (III)
Top-10 Algorithm Finally Selected at ICDM06
#1: C4.5 (61 votes)
#2: K-Means (60 votes)
#3: SVM (58 votes)
#4: Apriori (52 votes)
#5: EM (48 votes)
#6: PageRank (46 votes)
#7: AdaBoost (45 votes)
#7: kNN (45 votes)
#7: Naive Bayes (45 votes)
#10: CART (34 votes)
130
10 Challenging Problems in Data
Mining Research
Developing a Unifying Theory of Data Mining
Scaling Up for High Dimensional Data/High Speed Streams
Mining Sequence Data and Time Series Data
Mining Complex Knowledge from Complex Data
Data Mining in a Graph Structured Data
Distributed Data Mining and Mining Multi-agent Data
Data Mining for Biological and Environmental Problems
Data-Mining-Process Related Problems
Security, Privacy and Data Integrity
Dealing with Non-static, Unbalanced and Cost-sensitive Data
131
Advanced Topics in Data Mining
Web & Text Mining
Spatio-temporal Data Mining
Data Stream Mining
Uncertain Data Mining
Privacy Preserving in Data Mining
Graph Mining
Social Network Mining
Visualization of Data Mining

132
DATA STREAM MINING
133
Data Stream Management

Design synopsis structures for
streams
Design real-time and approximate
algorithms for stream mining and
query processing
Multiple Data Streams
Characteristics of Data Streams

Arrive in a high speed
Arrive continuously, and possibly
endlessly
Have a huge volume

Examples:
Sensor network data, network flow
data, stock market data, etc.
(Approximate)
Results
Online Stream
Summarization
Stream
Synopses
Stream Process System
Query
Processing
Stream
Mining
134
GRAPH MINING
135
Why Graph Mining?
Graphs are ubiquitous
Chemical compounds (Cheminformatics)
Protein structures, biological pathways/networks (Bioinformatics)
Program control flow, traffic flow, and workflow analysis
XML databases, Web, and social network analysis
Graph is a general model
Trees, lattices, sequences, and items are degenerated graphs
Diversity of graphs
Directed vs. undirected, labeled vs. unlabeled (edges & vertices), weighted,
with angles & geometry (topological vs. 2-D/3-D)
Complexity of algorithms: many problems are of high complexity
136
Graph Pattern Mining
Frequent sub-graphs
A (sub-)graph is frequent if its support (occurrence
frequency) in a given dataset is no less than a
minimum support threshold.

Applications of graph pattern mining
Mining biochemical structures
Program control flow analysis
Mining XML structures or Web communities
Building blocks for graph classification, clustering,
compression, comparison, and correlation analysis


137
Example: Frequent Subgraphs
Graph dataset

138
FREQUENT PATTERNS
(MIN SUPPORT IS 2)
(A) (B) (C)
(1) (2)
Graph Mining Algorithms
Incomplete beam search Greedy (Subdue: Holder et al. KDD94)
Inductive logic programming (WARMR: Dehaspe et al. KDD98)
Graph theory-based approaches
Apriori-based approach
AGM/AcGM: Inokuchi, et al. (PKDD00), FSG: Kuramochi and Karypis
(ICDM01), PATH#: Vanetik and Gudes (ICDM02, ICDM04), FFSM: Huan,
et al. (ICDM03)
Pattern-growth approach
MoFa: Borgelt and Berthold (ICDM02), gSpan: Yan and Han
(ICDM02), Gaston: Nijssen and Kok (KDD04)
139
SOCIAL NETWORK MINING
140
What is Social Network?
141
Nodes: individuals
Links: social relationship
(family/work/friendship/etc.)
Social Network:
Many individuals with diverse
social interactions between them.
Example of Social Networks
142
Friendship network
node: person
link: acquaintanceship
http://nexus.ludios.net/view/demo/
Example of Social Networks
143
Ke, Visvanath & Brner, 2004
Co-author network
node: author
link: write papers together
Mining on Social Networks
Social network analysis has a long history in social sciences.
A summary of the progress has been written. Linton Freeman, The
Development of Social Network Analysis. Vancouver: Empirical Pres, 2006.

Today: Convergence of social and technological networks, computing and
info. systems with intrinsic social structure. (By Jon Kleinberg, Cornell
University)

Relevant Topics:
How to build a suitable model for search and diffusion in social networks.
Link mining in a multi-relational, heterogeneous and semi-structured network.
Community formation, clustering of social network data.
Abstract or summarization of social networks.
Privacy preserving in social network data.

and many others
144
THANK YOU!
145

Das könnte Ihnen auch gefallen