BIDE: Efficient Mining of Frequent Closed Sequences: Jianyong Wang and Jiawei Han

BIDE : Efficient Mining of Frequent Closed Sequences
Jianyong Wang and Jiawei Han Proc. 2004 Int. Conf. on Data Engineering (ICDE'04), Boston, MA, March 2004.
Opening
{,,} sup = 5 {,,,} sup = 5 sup < 5 sup > 5
{, ,} sup = 5 {,, ,} sup = 5
Review of Algorithms
Problem was first introduced by Agrawal and Srikant 1995. Mostly for mining frequent sequences (closed and non-closed). frequent sequences :
Apriori-based (Apriori, AprioriAll, GSP) Sequence-Enumeration Tree/Lattice-based (Max-Miner, Spam, Spade) Constraint-based (SPIRIT Family)
frequent closed sequences :

CloSpan : The first algorithm for mining frequent closed sequences (Based on Prefix-Span). BIDE
Apriori-based algorithms
Make multiple passes over the database.
First Pass:
Find the 1-frequent sequences Join them to build the candidate 2-sequences
For step k:
Join the set of frequent (k-1)-sequences Scan the SDB to check which of the k-sequences are actually frequent
Stop at step m:
No frequent m-sequence is found
Tree-based algorithms (1/2)

Use a sequence enumeration tree to generate all the candidate sequences. Traverse the tree (DFS). Use pruning techniques to avoid traversing sub-trees.
Apriori-principle: If a sequence is non-frequent, its supersequences can not be frequent.
If an extension event leads to a non-frequent sequence, this event is no longer used in other sequence extensions.
Tree-based algorithms (2/2)
I = {a, b} S-step:
add item to sequence

I-step: add item to event
Definition of Terms (1/2)

Items set: I = {i1, i2, ,in} Ordered Events set (Sequence) : S = {e1, e2, , em} Length : m-sequence Subsequence : Sa = {a1, a2, , an} Super-sequence : Sb = {b1, b2, , bm}
a1= bi1, a2 = bi2, , an = bin 1 i1 <i2<< in m
Definition of Terms (2/2)

SDB : sequence database tuple (sid, S) S : is a subsequence of S. supSDB(Sa) : Absolute support supSDB(Sa) / |SDB| : Relative support min_sup : supSDB(Sa) min_sup
S is closed: if there exists no other super-sequence with the same support.
Frequent Closed Sequences

Find the complete set of frequent closed sequences :
SDB
I = { A, B, C } min_sup = 50%
Previous Algorithms : 17 Frequent Sequences
BIDE (and CloSpan) :

6 Frequent Closed Sequences
CloSpan
Maintains the set of already mined frequent closed patterns in memory.
Sub-pattern checking: new pattern can be absorbed by an already mined frequent pattern. Super-pattern checking: new pattern can absorb some already mined frequent patterns.
To save space, CloSpan stores a superset of the frequent patterns in a hash-tree indexed structure and then prunes the tree to get the actual set of frequent closed patterns.
BIDE (1/2)
BIDE : An efficient algorithm for discovering the complete set
of frequent closed sequences.

BI-Directional Extension: new paradigm for mining frequent closed sequences without candidate maintenance.
Back-Scan pruning method.

Back-Skip optimization technique. Performance study: BIDE can be over an order of magnitude faster than the previous algorithms and consumes orders of magnitude less memory.
BIDE (2/2)
How to enumerate the complete set of frequent sequences?
Upon getting a frequent sequence, how to check if it is
closed?
How to design some search space pruning methods or other optimization techniques to accelerate the mining process?
First instance
S =CAABC Sp = A B First instance of a prefix sequence Sp = C A A B Projected Sequence of a prefix sequence Sp = C Projected Database of a prefix sequence Sp = Sp_SDB.
i-th last-in-first appearance

S =CAABC Sp = C A first instance of a prefix sequence Sp = C A
2nd last-in-first appearance = the last A of first instance
i-th semi-maximum period : SMP i( Sp )

S =ABCB Sp = A C the end of the first instance of prefix e1e2ei-1 in S = A between the 2nd last-in-first appearance = C SMP 2( A C) = B SMP 1( A C) =
Back-Scan search space pruning method

Theorem : Let the pre-fix sequence be an n-sequence, Sp = e1e2 . . . en. If i (1 i n), there exists an item e which appears in each of the i-th SMP of the prefix Sp in SDB, we can safely stop growing prefix Sp.
I = { A, B, C } min_sup = 50%
Sp = B : 4 SMP1 = {CAA, A, CA, A} Prune sub-tree under B
BI-Directional Extension closure checking

Sp = e1e2en Sp* = e1e2en e supSDB(Sp) = supSDB(Sp*) Sp is non-closed, item e is a forward extension event. Sp = e1e2en Sp2 = e1e2ei e ei+1en Sp = e1e2en Sp2 = e e1e2en supSDB(Sp) = supSDB(Sp*) Sp is non-closed, item e is a backward extension event.
or
Theorem : If there exists no forward extension event nor backward extension event w.r.t. Sp, Sp must be closed; otherwise it must be non-closed.
Last instance
S =CABDCABBA Sp = A B Last instance of a prefix sequence Sp = C A B D C A B B
i-th last-in-last appearance

S =CAABC Sp = A B Last instance of a prefix sequence Sp = C A A B
1st last-in-last appearance = the last A of last instance
i-th maximum period : MP i( Sp )

S =ABCB Sp = A B the end of the first instance of prefix e1e2ei-1 in S = A between the 2nd last-in-last appearance = B MP 2( A B) = B C MP 1( A B) =
Backward-Extension event checking

Lemma If there exists an item e which appears in each of the i-th maximum periods of Sp in SDB, then e is a backward extension event w.r.t. Sp. SDB I = { A, B, C } min_sup = 50%
Sp = A C : 4
MP2 = {AB, B, B, BB} Backward Extension Event for Sp(AC) is B. AC:4 is a frequent non-closed sequence. Sp = A B C : 4 No backward-extension item for Sp. No forward-extension item for Sp. ABC:4 is a frequent closed sequence.
Scan-Skip optimization
SDB I = { A, B, C } min_sup = 50% Sp = ABC with support 4.
MP1 = {CA, , C, }
skip last two
MP2 = {A, , , B}
skip last two
MP3 = {, , , B}
skip last three
Pruning and optimization
Projected Sequence & Projected Database

SDB
Sequence identifier
1 2 3 4
Sp_SDB = {C, CB, C, BCA}
Sequence
CAABC ABCB CABC ABBCA
[Sp = AB]
Sequence identifier 1 2 3 4
Sequence C CB C BCA
Frequent Sequence Enumeration (Prefix-Span)
Forward-extension event checking

Lemma For a pre-fix sequence Sp, its complete set of forward-extension events is equivalent to the set of its locally frequent items whose supports are equal to supSDB(Sp) .
SDB I = { A, B, C } min_sup = 50%
Sp1 = AB : 4 Sp1_SDB = {C, CB, C, BCA} Forward Extension Event for AB is C. Sp2 = CAB : 2 SP2_SDB = {C, , C, } Forward Extension Event for CAB is C.
Scheme of Bide Algorithm

SDB min_sup
Sp (Start from frequent 1-sequences)
Grow Sp with extension event.
BackScan pruning
Stop growing Sp.
BI-Directional Extension closure checking Backward-Extension event checking Find extension events. frequent non-closed sequences Forward-extension event checking No extension event. frequent closed sequences
The BIDE algorithm (1/2)

Scan the database once in order to find all the frequent 1sequences.
For each frequent 1-sequence build a pseudo-projected database and check if it can be pruned: Back-Scan. For every sequence that cannot be pruned compute the number of backward-extension items and then call subroutine bide.
The BIDE algorithm (2/2)
Subroutine bide:
For prefix Sp scan its projected DB. Compute the number of forward-extension items. If there is no forward-extension item nor backward-extension item, Sp is closed. Grow Sp with each locally frequent item to get a new prefix. Build the pseudo-projected DB for the new prefix. For each new prefix:
Check if it can be pruned If not, compute the number of backward-extension items and call itself
Performance Evaluation
BIDE can significantly outperform PrefixSpan and SPADE. BIDE consumes much less memory and can be faster than CloSpan. BackScan pruning method is effective in enhancing the performance. Experiments on three datasets.
SPADE vs. PrefixSpan vs. CloSpan vs. BIDE
CloSpan vs. BIDE(1/3)

Gazelle Dataset

Snake Dataset

Pi Dataset
Conclusions
Closed sequence mining: More compact result set Significantly better efficiency BIDE: Avoids the curse of candidate maintenance Prunes search space more deeply Consumes much less memory than CloSpan in closure checking Future Work: Push constraints into the mining process
Thank you !

BIDE: Efficient Mining of Frequent Closed Sequences: Jianyong Wang and Jiawei Han

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

BIDE: Efficient Mining of Frequent Closed Sequences: Jianyong Wang and Jiawei Han

Hochgeladen von

Copyright:

Verfügbare Formate

BIDE : Efficient Mining of Frequent Closed Sequences

{, ,} sup = 5 {,, ,} sup = 5

frequent closed sequences :

Tree-based algorithms (1/2)

Tree-based algorithms (2/2)

add item to sequence

Definition of Terms (1/2)

Definition of Terms (2/2)

S is closed: if there exists no other super-sequence with the same support.

Frequent Closed Sequences

Previous Algorithms : 17 Frequent Sequences

BIDE (and CloSpan) :

of frequent closed sequences.

Back-Scan pruning method.

i-th last-in-first appearance

2nd last-in-first appearance = the last A of first instance

i-th semi-maximum period : SMP i( Sp )

Back-Scan search space pruning method

Sp = B : 4 SMP1 = {CAA, A, CA, A} Prune sub-tree under B

BI-Directional Extension closure checking

i-th last-in-last appearance

1st last-in-last appearance = the last A of last instance

i-th maximum period : MP i( Sp )

Backward-Extension event checking

skip last two

skip last two

skip last three

Pruning and optimization

Projected Sequence & Projected Database

Frequent Sequence Enumeration (Prefix-Span)

Forward-extension event checking

SDB I = { A, B, C } min_sup = 50%

Scheme of Bide Algorithm

Sp (Start from frequent 1-sequences)

Grow Sp with extension event.

Stop growing Sp.

The BIDE algorithm (1/2)

The BIDE algorithm (2/2)

SPADE vs. PrefixSpan vs. CloSpan vs. BIDE

CloSpan vs. BIDE(1/3)

CloSpan vs. BIDE(2/3)

CloSpan vs. BIDE(3/3)

Das könnte Ihnen auch gefallen