Beruflich Dokumente
Kultur Dokumente
Jianyong Wang and Jiawei Han Proc. 2004 Int. Conf. on Data Engineering (ICDE'04), Boston, MA, March 2004.
Opening
{,,} sup = 5 {,,,} sup = 5 sup < 5 sup > 5
Review of Algorithms
Problem was first introduced by Agrawal and Srikant 1995. Mostly for mining frequent sequences (closed and non-closed). frequent sequences :
Apriori-based (Apriori, AprioriAll, GSP) Sequence-Enumeration Tree/Lattice-based (Max-Miner, Spam, Spade) Constraint-based (SPIRIT Family)
Apriori-based algorithms
Make multiple passes over the database.
First Pass:
Find the 1-frequent sequences Join them to build the candidate 2-sequences
For step k:
Join the set of frequent (k-1)-sequences Scan the SDB to check which of the k-sequences are actually frequent
Stop at step m:
No frequent m-sequence is found
If an extension event leads to a non-frequent sequence, this event is no longer used in other sequence extensions.
I = {a, b} S-step:
CloSpan
Maintains the set of already mined frequent closed patterns in memory.
Sub-pattern checking: new pattern can be absorbed by an already mined frequent pattern. Super-pattern checking: new pattern can absorb some already mined frequent patterns.
To save space, CloSpan stores a superset of the frequent patterns in a hash-tree indexed structure and then prunes the tree to get the actual set of frequent closed patterns.
BIDE (1/2)
BIDE : An efficient algorithm for discovering the complete set
BIDE (2/2)
How to enumerate the complete set of frequent sequences?
Upon getting a frequent sequence, how to check if it is
closed?
How to design some search space pruning methods or other optimization techniques to accelerate the mining process?
First instance
S =CAABC Sp = A B First instance of a prefix sequence Sp = C A A B Projected Sequence of a prefix sequence Sp = C Projected Database of a prefix sequence Sp = Sp_SDB.
I = { A, B, C } min_sup = 50%
Theorem : If there exists no forward extension event nor backward extension event w.r.t. Sp, Sp must be closed; otherwise it must be non-closed.
Last instance
S =CABDCABBA Sp = A B Last instance of a prefix sequence Sp = C A B D C A B B
Sp = A C : 4
MP2 = {AB, B, B, BB} Backward Extension Event for Sp(AC) is B. AC:4 is a frequent non-closed sequence. Sp = A B C : 4 No backward-extension item for Sp. No forward-extension item for Sp. ABC:4 is a frequent closed sequence.
Scan-Skip optimization
SDB I = { A, B, C } min_sup = 50% Sp = ABC with support 4.
MP1 = {CA, , C, }
MP2 = {A, , , B}
MP3 = {, , , B}
Sequence identifier
1 2 3 4
Sp_SDB = {C, CB, C, BCA}
Sequence
CAABC ABCB CABC ABBCA
[Sp = AB]
Sequence identifier 1 2 3 4
Sequence C CB C BCA
Sp1 = AB : 4 Sp1_SDB = {C, CB, C, BCA} Forward Extension Event for AB is C. Sp2 = CAB : 2 SP2_SDB = {C, , C, } Forward Extension Event for CAB is C.
BackScan pruning
BI-Directional Extension closure checking Backward-Extension event checking Find extension events. frequent non-closed sequences Forward-extension event checking No extension event. frequent closed sequences
Subroutine bide:
For prefix Sp scan its projected DB. Compute the number of forward-extension items. If there is no forward-extension item nor backward-extension item, Sp is closed. Grow Sp with each locally frequent item to get a new prefix. Build the pseudo-projected DB for the new prefix. For each new prefix:
Check if it can be pruned If not, compute the number of backward-extension items and call itself
Performance Evaluation
BIDE can significantly outperform PrefixSpan and SPADE. BIDE consumes much less memory and can be faster than CloSpan. BackScan pruning method is effective in enhancing the performance. Experiments on three datasets.
Conclusions
Closed sequence mining: More compact result set Significantly better efficiency BIDE: Avoids the curse of candidate maintenance Prunes search space more deeply Consumes much less memory than CloSpan in closure checking Future Work: Push constraints into the mining process
Thank you !