Sie sind auf Seite 1von 7

Outline

`
Mining Sequential Patterns
PrefixSpan: Mining Sequential Patterns

Problem statement
Efficiently by Prefix-Projected Pattern Definitions & examples
Growth Strategies
PrefixSpan algorithm
Authors:
Jian Pei, Jiawei Han, Behzad Mortazavi-Asi, Helen Pinto Qiming Chen, Motivation
Umeshwar Dayal, Mei-Chun Hsu
Definitions & examples
Algorithm
Example
Performance study
Presenter: Conclusions
Wojciech Stach
2

Sequential Pattern Mining Sequential Pattern Mining


` `
Given Find all the frequent subsequences, i.e. the
a set of sequences, where each sequence consists of a list subsequences whose occurrence frequency in the
of elements and each element consists of set of items set of sequences is no less than min_support
user-specified min_support threshold

Solution 53 frequent subsequences


<a><aa> <ab> <a(bc)> <a(bc)a> <aba> <abc>
<a(abc)(ac)d(cf)> - 5 elements, 9 items id Sequence
id Sequence <(ab)> <(ab)c> <(ab)d> <(ab)f> <(ab)dc> <ac>
10 <a(abc)(ac)d(cf)> 10 <a(abc)(ac)d(cf)> <aca> <acb> <acc> <ad> <adc> <af>
20 <(ad)c(bc)(ae)> <a(abc)(ac)d(cf)> - 9-sequence 20 <(ad)c(bc)(ae)> <b> <ba> <bc> <(bc)> <(bc)a> <bd> <bdc> <bf>
30 <(ef)(ab)(df)cb> 30 <(ef)(ab)(df)cb> <c> <ca> <cb> <cc>
40 <eg(af)cbc> <a(abc)(ac)d(cf)> = <a(cba)(ac)d(cf)> 40 <eg(af)cbc> <d> <db> <dc> <dcb>
<a(abc)(ac)d(cf)> <a(ac)(abc)d(cf)>
<e> <ea> <eab> <eac> <eacb> <eb> <ebc> <ec>
<ecb> <ef> <efb> <efc> <efcb>
min_support = 2 <f> <fb> <fbc> <fc> <fcb>
3 4
Subsequence vs. super sequence Sequence Support Count
` `
Given two sequences =<a1a2an> and A sequence database is a set of tuples <sid, s>
=<b1b2bm> A tuple <sid, s> is said to contain a sequence , if
is called a subsequence of , denoted as , is a subsequence of s, i.e., s
if there exist integers 1j1<j2<<jn m such that The support of a sequence is the number of
a1bj1, a2 bj2,, anbjn tuples containing
is a super sequence of

id Sequence 1=<a> support(1) = 4


10 <a(abc)(ac)d(cf)>
=<a(abc)(ac)d(cf)> =<a(abc)(ac)d(cf)> 2=<ac> support(2) = 4
20 <(ad)c(bc)(ae)>
1=<aa(ac)d(c)> 4=<df(cf)> 30 <(ef)(ab)(df)cb> 3=<(ab)c> support(3) = 2
40 <eg(af)cbc>
2=<(ac)(ac)d(cf)> 5=<(cf)d>

3=<ac> 6=<(abc)dcf>
5 6

Strategies Outline
` `
Apriori-property based Mining Sequential Patterns
AprioriSome (1995) Problem statement
AprioriAll (1995) Definitions & examples
DynamicSome (1995) Strategies
GSP (1996) PrefixSpan algorithm
Motivation
Regular expression constraints Definitions & examples
SPIRIT (1999) Algorithm
Example
Data projection based Performance study
FreeSpan (2000) Conclusions

7 8
Motivation and Background Prefix
` `
Shortcomings of Apriori-like approaches Given two sequences =<a1a2an> and
Potentially huge set of candidate sequences =<b1b2bm>, mn
Multiple scans of databases
Sequence is called a prefix of if and only if:


Difficulties at mining long sequential patterns
bi = ai for i m-1;
FreeSpan (Frequent pattern-projected Sequential pattern bm am;
mining) pattern growth method All the items in (am bm) are alphabetically after those in
General idea is to use frequent items to recursively project bm
sequence databases into a smaller projected databases and
grow subsequence fragments in each projected database

=<a(abc)(ac)d(cf)> =<a(abc)(ac)d(cf)>
PrefixSpan (Prefix-projected Sequential pattern mining)
Less projections and quickly shrinking sequences =<a(abc)a> =<a(abc)c>

9 10

Projection Postfix
` `
Given sequences and , such that is a Let =<a1a2an> be the projection of w.r.t.
subsequence of . prefix =<a1a2am-1am> (m n)
A subsequence of sequence is called a Sequence =<amam+1an> is called the postfix of
projection of w.r.t. prefix if and only if w.r.t. prefix , denoted as = / , where
has prefix ; am=(am-am)
There exist no proper super-sequence of such that We also denote =
is a subsequence of and also has prefix

=<a(abc)(ac)d(cf)> =<a(abc)(ac)d(cf)>
=<(bc)a>
=<a(abc)a>
=<(bc)(ac)d(cf)>
=<(_c)d(cf)>

11 12
PrefixSpan Algorithm PrefixSpan Algorithm (2)
` `
Input: A sequence database S, and the minimum support Method
threshold min_sup
1. Scan S| once, find the set of frequent items b
Output: The complete set of sequential patterns such that:
a) b can be assembled to the last element of to form a
Method: Call PrefixSpan(<>,0,S) sequential pattern; or
b) <b> can be appended to to form a sequential pattern.
Subroutine PrefixSpan(, l, S|) 2. For each frequent item b, append it to to form a
sequential pattern , and output ;
Parameters:
: sequential pattern, 3. For each , construct -projected database S|,
l: the length of ; and call PrefixSpan(, l+1, S|).
S|: the -projected database, if <>; otherwise; the
sequence database S.

13 14

id Sequence
10 <a(abc)(ac)d(cf)>

PrefixSpan - Example 20
30
<(ad)c(bc)(ae)>
<(ef)(ab)(df)cb>
PrefixSpan Example (2)
` 40 <eg(af)cbc> `
3. Find subsets of sequential patterns
1. Find length-1 sequential patterns min_support = 2
<a> <b> <c> <d> <e> <f> <g> <d> <a> <b> <c> <d> <e> <(_e)> <f> <(_f)>
4 4 4 3 3 3 1 <(cf)> 1 2 3 0 1 0 1 1
<c(bc)(ae)>
<(_f)cb>
2. Divide search space
Prefix
<db> <dc>

<a> <b> <c> <d> <e> <f> <db> <dc> <b> <c>
<(abc)(ac)d(cf)> <(_c)(ac)d(cf)> <(ac)d(cf)> <(cf)> <(_f)(ab)(df)cb> <(ab)(df)cb> <(_c)> <(bc)> 2 1
<(_d)c(bc)(ae)> <(_c)(ae)> <(bc)(ae)> <c(bc)(ae)> <(af)cbc> <cbc> <b>
<(_b)(df)cb> <(df)cb> <b> <(_f)cb>
<(_f)cbc> <c> <bc>
<dcb>
<dcb>
<>
15 16
id Sequence
10 <a(abc)(ac)d(cf)>

PrefixSpan - characteristics Bi-level Projection 20


30
<(ad)c(bc)(ae)>
<(ef)(ab)(df)cb>
40 <eg(af)cbc>
` No candidate sequence needs to be generated by `

PrefixSpan min_support = 2
Scan to get 1-length sequences
Projected databases keep shrinking
Construct a triangular matrix instead of projected
The major cost of PrefixSpan is the construction of databases for each length-1 patterns
projected databases
a 2
How to reduce this cost? b (4,2,2) 1 ALL length-2 sequential
c (4,2,1) (3,3,2) 3 pattern
Different projection methods d (2,1,1) (2,2,0) (1,3,0) 0
e (1,2,1) (1,2,0) (1,2,0) (1,1,0) 0
Bi-level projection
f (2,1,1) (2,2,0) (1,2,1) (1,1,1) (2,0,1) 1
reduces the number and the size of projected databases a b c d e f

Pseudo-Projection
Support(<ac>) = 4
Support(<ca>) = 2 Support(<cc>) = 3
reduces the cost of projection when projected database can be
Support(<(ac)>) = 1
held in main memory
17 18

Bi-level projection (2) Bi-level projection (3) - optimization


` `
For each length-2 sequential pattern , construct Do we need to include every item in a postfix in
the -projected database and find the frequent the projected databases?
items
NO! Item pruning in projected database by 3-way
Construct corresponding S-matrix Apriori checking
<ab> a b c (_c) d (_d) e (_e) f (_f)
<(_c)(ac)(cf)> 2 0 2 2 0 1 0 0 1 0 Any super-sequence of c can be excluded from construction of
<ac> is not frequent
<(_c)a> it can never be a sequential <ab> - projected database
<c> pattern
<aba> <abc> <a(bc)>

a 0 <a(bd)> is not frequent To construct <a(bc)>-projected database,


sequence <a(bcde)df> should be projected to <(_e)df>
c (1,0,1) 1 instead of <(_de)df>
(_c) (,2, ) (,1, )
a c (_c)
<a(bc)a>
19 20
Pseudo-Projection Experimental Results
` `
Observation: postfixes of a sequence often Environment: 233MHz Pentium PC, 128 MB RAM,
appear repeatedly in recursive projected databases Windows NT, Visual C++ 6.0
Method: instead of constructing physical Reported test on synthetic data set: C10T8S8I8:
projection by collecting all the postfixes, we can 1000 items
use pointers referring to the sequences in the 10000 sequences
database as a pseudo-projection Average number of items within elements: 8
Every projection consists of two pieces of Average number of elements in a sequence: 8
information: pointer to the sequence in database Competitors:
and offset to the postfix in the sequence GSP
FreeSpan
s1=<a(abc)(ac)d(cf)> Pointer Offset Postfix PrefixSpan-1 (level-by-level projection)
s1 2 <(abc)(ac)d(cf)> PrefixSpan-2 (bi-level projection)
s1 5 <(ac)d(cf)>
s1 6 <(_c)d(cf)>
21 22

Runtime vs. support threshold I/O costs vs. threshold and scalability
` `

23 24
Outline Conclusions
` `
Mining Sequential Patterns
Problem statement PrefixSpan
Definitions & examples Efficient pattern growth method
Strategies Outperforms both GSP and FreeSpan
PrefixSpan algorithm Explores prefix-projection in sequential pattern mining
Motivation Mines the complete set of patterns but reduces the effort
Definitions & examples of candidate subsequence generation
Algorithm Prefix-projection reduces the size of projected database
Example and leads to efficient processing
Performance study Bi-level projection and pseudo-projection may improve
mining efficiency
Conclusions

25 26

References
` `
Pei J., Han J., Mortazavi-Asl J., Pinto H., Chen Q., Dayal U., Hsu M.,
PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected
Pattern Growth, 17th International Conference on Data Engineering
(ICDE), April 2001
Agrawal R., Srikant R., Mining sequential patterns, Proceedings 1995
Int. Conf. Very Large Data Bases (VLDB94), pp. 487-499, 1995 THANK YOU !!!
Han J., Dong G., Mortazavi-Asl B., Chen Q., Dayal U., Hsu M.-C.,
Freespan: Frequent pattern-projected sequential pattern mining,
Proceedings 2000 Int. Conf. Knowledge Discovery and Data Mining
(KDD00), pp. 355-359, 2000
Srikant R., Agrawal R., Mining sequential pattern: Generalizations

and performance improvements, Proceedings 5th Int. /conf. Any Questions?


Extending Database Technology (EDBT96), pp. 3-17, 1996
Zhao Q., Bhowmick S. S., Sequential Pattern Mining: A Survey.
Technical Report Center for Advanced Information Systems, School
of Computer Engineering, Nanyang Technological University,
Singapore, 2003

27 28

Das könnte Ihnen auch gefallen