Sie sind auf Seite 1von 36

BIDE : Efficient Mining of Frequent Closed Sequences

Jianyong Wang and Jiawei Han Proc. 2004 Int. Conf. on Data Engineering (ICDE'04), Boston, MA, March 2004.

Opening
{,,} sup = 5 {,,,} sup = 5 sup < 5 sup > 5

{, ,} sup = 5 {,, ,} sup = 5

Review of Algorithms
Problem was first introduced by Agrawal and Srikant 1995. Mostly for mining frequent sequences (closed and non-closed). frequent sequences :
Apriori-based (Apriori, AprioriAll, GSP) Sequence-Enumeration Tree/Lattice-based (Max-Miner, Spam, Spade) Constraint-based (SPIRIT Family)

frequent closed sequences :


CloSpan : The first algorithm for mining frequent closed sequences (Based on Prefix-Span). BIDE

Apriori-based algorithms
Make multiple passes over the database.
First Pass:
Find the 1-frequent sequences Join them to build the candidate 2-sequences

For step k:
Join the set of frequent (k-1)-sequences Scan the SDB to check which of the k-sequences are actually frequent

Stop at step m:
No frequent m-sequence is found

Tree-based algorithms (1/2)


Use a sequence enumeration tree to generate all the candidate sequences. Traverse the tree (DFS). Use pruning techniques to avoid traversing sub-trees.
Apriori-principle: If a sequence is non-frequent, its supersequences can not be frequent.

If an extension event leads to a non-frequent sequence, this event is no longer used in other sequence extensions.

Tree-based algorithms (2/2)

I = {a, b} S-step:

add item to sequence


I-step: add item to event

Definition of Terms (1/2)


Items set: I = {i1, i2, ,in} Ordered Events set (Sequence) : S = {e1, e2, , em} Length : m-sequence Subsequence : Sa = {a1, a2, , an} Super-sequence : Sb = {b1, b2, , bm}
a1= bi1, a2 = bi2, , an = bin 1 i1 <i2<< in m

Definition of Terms (2/2)


SDB : sequence database tuple (sid, S) S : is a subsequence of S. supSDB(Sa) : Absolute support supSDB(Sa) / |SDB| : Relative support min_sup : supSDB(Sa) min_sup

S is closed: if there exists no other super-sequence with the same support.

Frequent Closed Sequences


Find the complete set of frequent closed sequences :
SDB
I = { A, B, C } min_sup = 50%

Previous Algorithms : 17 Frequent Sequences

BIDE (and CloSpan) :


6 Frequent Closed Sequences

CloSpan
Maintains the set of already mined frequent closed patterns in memory.
Sub-pattern checking: new pattern can be absorbed by an already mined frequent pattern. Super-pattern checking: new pattern can absorb some already mined frequent patterns.

To save space, CloSpan stores a superset of the frequent patterns in a hash-tree indexed structure and then prunes the tree to get the actual set of frequent closed patterns.

BIDE (1/2)
BIDE : An efficient algorithm for discovering the complete set

of frequent closed sequences.


BI-Directional Extension: new paradigm for mining frequent closed sequences without candidate maintenance.

Back-Scan pruning method.


Back-Skip optimization technique. Performance study: BIDE can be over an order of magnitude faster than the previous algorithms and consumes orders of magnitude less memory.

BIDE (2/2)
How to enumerate the complete set of frequent sequences?
Upon getting a frequent sequence, how to check if it is

closed?
How to design some search space pruning methods or other optimization techniques to accelerate the mining process?

First instance
S =CAABC Sp = A B First instance of a prefix sequence Sp = C A A B Projected Sequence of a prefix sequence Sp = C Projected Database of a prefix sequence Sp = Sp_SDB.

i-th last-in-first appearance


S =CAABC Sp = C A first instance of a prefix sequence Sp = C A

2nd last-in-first appearance = the last A of first instance

i-th semi-maximum period : SMP i( Sp )


S =ABCB Sp = A C the end of the first instance of prefix e1e2ei-1 in S = A between the 2nd last-in-first appearance = C SMP 2( A C) = B SMP 1( A C) =

Back-Scan search space pruning method


Theorem : Let the pre-fix sequence be an n-sequence, Sp = e1e2 . . . en. If i (1 i n), there exists an item e which appears in each of the i-th SMP of the prefix Sp in SDB, we can safely stop growing prefix Sp.

I = { A, B, C } min_sup = 50%

Sp = B : 4 SMP1 = {CAA, A, CA, A} Prune sub-tree under B

BI-Directional Extension closure checking


Sp = e1e2en Sp* = e1e2en e supSDB(Sp) = supSDB(Sp*) Sp is non-closed, item e is a forward extension event. Sp = e1e2en Sp2 = e1e2ei e ei+1en Sp = e1e2en Sp2 = e e1e2en supSDB(Sp) = supSDB(Sp*) Sp is non-closed, item e is a backward extension event.
or

Theorem : If there exists no forward extension event nor backward extension event w.r.t. Sp, Sp must be closed; otherwise it must be non-closed.

Last instance
S =CABDCABBA Sp = A B Last instance of a prefix sequence Sp = C A B D C A B B

i-th last-in-last appearance


S =CAABC Sp = A B Last instance of a prefix sequence Sp = C A A B

1st last-in-last appearance = the last A of last instance

i-th maximum period : MP i( Sp )


S =ABCB Sp = A B the end of the first instance of prefix e1e2ei-1 in S = A between the 2nd last-in-last appearance = B MP 2( A B) = B C MP 1( A B) =

Backward-Extension event checking


Lemma If there exists an item e which appears in each of the i-th maximum periods of Sp in SDB, then e is a backward extension event w.r.t. Sp. SDB I = { A, B, C } min_sup = 50%

Sp = A C : 4
MP2 = {AB, B, B, BB} Backward Extension Event for Sp(AC) is B. AC:4 is a frequent non-closed sequence. Sp = A B C : 4 No backward-extension item for Sp. No forward-extension item for Sp. ABC:4 is a frequent closed sequence.

Scan-Skip optimization
SDB I = { A, B, C } min_sup = 50% Sp = ABC with support 4.

MP1 = {CA, , C, }

skip last two

MP2 = {A, , , B}

skip last two

MP3 = {, , , B}

skip last three

Pruning and optimization

Projected Sequence & Projected Database


SDB

Sequence identifier
1 2 3 4
Sp_SDB = {C, CB, C, BCA}

Sequence
CAABC ABCB CABC ABBCA
[Sp = AB]

Sequence identifier 1 2 3 4

Sequence C CB C BCA

Frequent Sequence Enumeration (Prefix-Span)

Forward-extension event checking


Lemma For a pre-fix sequence Sp, its complete set of forward-extension events is equivalent to the set of its locally frequent items whose supports are equal to supSDB(Sp) .

SDB I = { A, B, C } min_sup = 50%

Sp1 = AB : 4 Sp1_SDB = {C, CB, C, BCA} Forward Extension Event for AB is C. Sp2 = CAB : 2 SP2_SDB = {C, , C, } Forward Extension Event for CAB is C.

Scheme of Bide Algorithm


SDB min_sup

Sp (Start from frequent 1-sequences)

Grow Sp with extension event.

BackScan pruning

Stop growing Sp.

BI-Directional Extension closure checking Backward-Extension event checking Find extension events. frequent non-closed sequences Forward-extension event checking No extension event. frequent closed sequences

The BIDE algorithm (1/2)


Scan the database once in order to find all the frequent 1sequences.
For each frequent 1-sequence build a pseudo-projected database and check if it can be pruned: Back-Scan. For every sequence that cannot be pruned compute the number of backward-extension items and then call subroutine bide.

The BIDE algorithm (2/2)

Subroutine bide:
For prefix Sp scan its projected DB. Compute the number of forward-extension items. If there is no forward-extension item nor backward-extension item, Sp is closed. Grow Sp with each locally frequent item to get a new prefix. Build the pseudo-projected DB for the new prefix. For each new prefix:
Check if it can be pruned If not, compute the number of backward-extension items and call itself

Performance Evaluation
BIDE can significantly outperform PrefixSpan and SPADE. BIDE consumes much less memory and can be faster than CloSpan. BackScan pruning method is effective in enhancing the performance. Experiments on three datasets.

SPADE vs. PrefixSpan vs. CloSpan vs. BIDE

CloSpan vs. BIDE(1/3)


Gazelle Dataset

CloSpan vs. BIDE(2/3)


Snake Dataset

CloSpan vs. BIDE(3/3)


Pi Dataset

Conclusions
Closed sequence mining: More compact result set Significantly better efficiency BIDE: Avoids the curse of candidate maintenance Prunes search space more deeply Consumes much less memory than CloSpan in closure checking Future Work: Push constraints into the mining process

Thank you !

Das könnte Ihnen auch gefallen