Beruflich Dokumente
Kultur Dokumente
`
Mining Sequential Patterns
PrefixSpan: Mining Sequential Patterns
Problem statement
Efficiently by Prefix-Projected Pattern Definitions & examples
Growth Strategies
PrefixSpan algorithm
Authors:
Jian Pei, Jiawei Han, Behzad Mortazavi-Asi, Helen Pinto Qiming Chen, Motivation
Umeshwar Dayal, Mei-Chun Hsu
Definitions & examples
Algorithm
Example
Performance study
Presenter: Conclusions
Wojciech Stach
2
3=<ac> 6=<(abc)dcf>
5 6
Strategies Outline
` `
Apriori-property based Mining Sequential Patterns
AprioriSome (1995) Problem statement
AprioriAll (1995) Definitions & examples
DynamicSome (1995) Strategies
GSP (1996) PrefixSpan algorithm
Motivation
Regular expression constraints Definitions & examples
SPIRIT (1999) Algorithm
Example
Data projection based Performance study
FreeSpan (2000) Conclusions
7 8
Motivation and Background Prefix
` `
Shortcomings of Apriori-like approaches Given two sequences =<a1a2an> and
Potentially huge set of candidate sequences =<b1b2bm>, mn
Multiple scans of databases
Sequence is called a prefix of if and only if:
Difficulties at mining long sequential patterns
bi = ai for i m-1;
FreeSpan (Frequent pattern-projected Sequential pattern bm am;
mining) pattern growth method All the items in (am bm) are alphabetically after those in
General idea is to use frequent items to recursively project bm
sequence databases into a smaller projected databases and
grow subsequence fragments in each projected database
=<a(abc)(ac)d(cf)> =<a(abc)(ac)d(cf)>
PrefixSpan (Prefix-projected Sequential pattern mining)
Less projections and quickly shrinking sequences =<a(abc)a> =<a(abc)c>
9 10
Projection Postfix
` `
Given sequences and , such that is a Let =<a1a2an> be the projection of w.r.t.
subsequence of . prefix =<a1a2am-1am> (m n)
A subsequence of sequence is called a Sequence =<amam+1an> is called the postfix of
projection of w.r.t. prefix if and only if w.r.t. prefix , denoted as = / , where
has prefix ; am=(am-am)
There exist no proper super-sequence of such that We also denote =
is a subsequence of and also has prefix
=<a(abc)(ac)d(cf)> =<a(abc)(ac)d(cf)>
=<(bc)a>
=<a(abc)a>
=<(bc)(ac)d(cf)>
=<(_c)d(cf)>
11 12
PrefixSpan Algorithm PrefixSpan Algorithm (2)
` `
Input: A sequence database S, and the minimum support Method
threshold min_sup
1. Scan S| once, find the set of frequent items b
Output: The complete set of sequential patterns such that:
a) b can be assembled to the last element of to form a
Method: Call PrefixSpan(<>,0,S) sequential pattern; or
b) <b> can be appended to to form a sequential pattern.
Subroutine PrefixSpan(, l, S|) 2. For each frequent item b, append it to to form a
sequential pattern , and output ;
Parameters:
: sequential pattern, 3. For each , construct -projected database S|,
l: the length of ; and call PrefixSpan(, l+1, S|).
S|: the -projected database, if <>; otherwise; the
sequence database S.
13 14
id Sequence
10 <a(abc)(ac)d(cf)>
PrefixSpan - Example 20
30
<(ad)c(bc)(ae)>
<(ef)(ab)(df)cb>
PrefixSpan Example (2)
` 40 <eg(af)cbc> `
3. Find subsets of sequential patterns
1. Find length-1 sequential patterns min_support = 2
<a> <b> <c> <d> <e> <f> <g> <d> <a> <b> <c> <d> <e> <(_e)> <f> <(_f)>
4 4 4 3 3 3 1 <(cf)> 1 2 3 0 1 0 1 1
<c(bc)(ae)>
<(_f)cb>
2. Divide search space
Prefix
<db> <dc>
<a> <b> <c> <d> <e> <f> <db> <dc> <b> <c>
<(abc)(ac)d(cf)> <(_c)(ac)d(cf)> <(ac)d(cf)> <(cf)> <(_f)(ab)(df)cb> <(ab)(df)cb> <(_c)> <(bc)> 2 1
<(_d)c(bc)(ae)> <(_c)(ae)> <(bc)(ae)> <c(bc)(ae)> <(af)cbc> <cbc> <b>
<(_b)(df)cb> <(df)cb> <b> <(_f)cb>
<(_f)cbc> <c> <bc>
<dcb>
<dcb>
<>
15 16
id Sequence
10 <a(abc)(ac)d(cf)>
PrefixSpan min_support = 2
Scan to get 1-length sequences
Projected databases keep shrinking
Construct a triangular matrix instead of projected
The major cost of PrefixSpan is the construction of databases for each length-1 patterns
projected databases
a 2
How to reduce this cost? b (4,2,2) 1 ALL length-2 sequential
c (4,2,1) (3,3,2) 3 pattern
Different projection methods d (2,1,1) (2,2,0) (1,3,0) 0
e (1,2,1) (1,2,0) (1,2,0) (1,1,0) 0
Bi-level projection
f (2,1,1) (2,2,0) (1,2,1) (1,1,1) (2,0,1) 1
reduces the number and the size of projected databases a b c d e f
Pseudo-Projection
Support(<ac>) = 4
Support(<ca>) = 2 Support(<cc>) = 3
reduces the cost of projection when projected database can be
Support(<(ac)>) = 1
held in main memory
17 18
Runtime vs. support threshold I/O costs vs. threshold and scalability
` `
23 24
Outline Conclusions
` `
Mining Sequential Patterns
Problem statement PrefixSpan
Definitions & examples Efficient pattern growth method
Strategies Outperforms both GSP and FreeSpan
PrefixSpan algorithm Explores prefix-projection in sequential pattern mining
Motivation Mines the complete set of patterns but reduces the effort
Definitions & examples of candidate subsequence generation
Algorithm Prefix-projection reduces the size of projected database
Example and leads to efficient processing
Performance study Bi-level projection and pseudo-projection may improve
mining efficiency
Conclusions
25 26
References
` `
Pei J., Han J., Mortazavi-Asl J., Pinto H., Chen Q., Dayal U., Hsu M.,
PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected
Pattern Growth, 17th International Conference on Data Engineering
(ICDE), April 2001
Agrawal R., Srikant R., Mining sequential patterns, Proceedings 1995
Int. Conf. Very Large Data Bases (VLDB94), pp. 487-499, 1995 THANK YOU !!!
Han J., Dong G., Mortazavi-Asl B., Chen Q., Dayal U., Hsu M.-C.,
Freespan: Frequent pattern-projected sequential pattern mining,
Proceedings 2000 Int. Conf. Knowledge Discovery and Data Mining
(KDD00), pp. 355-359, 2000
Srikant R., Agrawal R., Mining sequential pattern: Generalizations
27 28