Beruflich Dokumente
Kultur Dokumente
n FP-growth Algorithm
A B C D E
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCDE
algorithm to improve
the naïve algorithm?
A B C D E
AB AC AD AE BC BD BE CD CE DE
Found to be
Infrequent
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
Pruned
ABCDE
supersets
L1 = {frequent items};
for (k = 1; Lk !=∅; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do begin
increment the count of all candidates in Ck+1
that are contained in t
end
Lk+1 = candidates in Ck+1 with min_support
end
return ∪kLk;
COMP9318: Data Warehousing and Data Mining 13
The Apriori Algorithm—An Example
minsup = 50%
Itemset sup
Itemset sup
Database TDB {A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup
{A, C} 2
2nd scan {A, B}
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2
{B, C} 2 {A, E}
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}
C3 Itemset
3rd scan L3 Itemset sup
{B, C, E}
{B, C, E} 2
COMP9318: Data Warehousing and Data Mining 14
Important Details of Apriori
1. How to generate candidates?
n Step 1: self-joining Lk (what’s the join condition? why?)
n Step 2: pruning
2. How to count supports of candidates?
Example of Candidate-generation
n L3={abc, abd, acd, ace, bcd}
n Self-joining: L3*L3
n abcd from abc and abd
n acde from acd and ace
n Pruning:
n acde is removed because ade is not in L3
n C4={abcd}
COMP9318: Data Warehousing and Data Mining 15
Generating Candidates in SQL
n A à B is an association rule if
n Confidence (A à B) ≥ min_conf,
where support (A à B) = support (AB), and
confidence (A à B) = support (AB) / support (A)
n 23 => 4, confidence=100%
n 24 => 3, confidence=100%
n 34 => 2, confidence=67% = (N* 50%)/(N*75%)
n 2 => 34, confidence=67%
n 3 => 24, confidence=67%
n 4 => 23, confidence=67%
n All rules have support = 50%
n # of scans: 100 ✓ ◆ ✓ ◆ ✓ ◆
100 100 100 100
n # of Candidates: 1 +
2
+ . . . +
100
= 2 1
n Bottleneck: candidate-generation-and-test
Can we avoid candidate generation altogether?
Alice X X
Bob X X
Charlie X X X
Dora X X
minsup = 1
n Apriori:
n L1 = {J, L, S, P, R}
Alice X X
Bob X X
Charlie X X X
Dora X X
minsup = 1
Ideas:
• Keep the support set for
each frequent itemset
• DFS
J è JL? J
{A, C}
J è ???
Only need to look at support
set for J ɸ
No Pain, No Gain
Java Lisp Scheme Python Ruby
Alice X X
Bob X X
Charlie X X X
Dora X X
minsup = 1
Ideas: {C}
JPR
• Keep the support set for
each frequent itemset {C} {A,C}
• DFS JP JR
J
…
{A, C}
ɸ
Notations and Invariants
n CondiditonalDB:
n DB|p = {t ∈ DB | t contains itemset p}
{x | x mod 3 = 0 ⋀ x ∈ even([100]) }
n A FP-tree is equivalent to a DB|p
n One can be converted to another
25
FP-tree Essential Idea /1
n Recursive algorithm again!
easy task, as
all frequent itemsets in
only items (not
n FreqItemsets(DB|p): itemsets) are
DB|p belong to one of
the following
needed
categories:
n X = FindLocallyFrequentItems(DB|p)
patterns ~ xip
patterns ~ ★px1
output { (x p) | x ∈ X }
patterns ~ ★px2
n Foreach x in X obtained
patterns ~ ★pxi
via
n DB*|px = GetConditionalDB+(DB*|p, x) recursion patterns ~ ★pxn
n
n FreqItemsets(DB*|px)
No Pain, No Gain
DB|J
Alice X X
Charlie X X X
minsup = 1
n FreqItemsets(DB|J):
n {P, R} ç FindLocallyFrequentItems(DB|J)
n Output {JP, JR}
n Get DB*|JP; FreqItemsets(DB*|JP)
n Get DB*|JR; FreqItemsets(DB*|JR)
n // Guaranteed no other frequent itemset in DB|J
FP-tree Essential Idea /2
n X = FindLocallyFrequentItems(DB|p)
n [optional] DB*|p = PruneDB(DB|p, X) Remove items not in X;
output { (x p) | x ∈ X } potentially reduce # of
transactions (∅ or dup).
n Foreach x in X Improves the efficiency.
n DB*|px = GetConditionalDB+(DB*|p, x)
n [optional] if DB*|px is degenerated, then powerset(DB*|px)
n FreqItemsets(DB*|px) Also gets rid of items
already processed
before x è avoid
duplicates
Grayed items are for illustration purpose only.
Lv 1 Recursion
FCAMP
CBP
n minsup = 3 FCAMP
DB*|P
DB*|M (sans P)
FCADGIMP FCAMP
DB*|B (sans MP)
ABCFLMO FCABM
BFHJOW FB DB*|A (sans BMP)
X = {F, C, A, B, M, P} FCA
Output: F, C, A, B, M, P FCA
FCA
Lv 2 Recursion on DB*|P
n minsup = 3
Which is actually FullDB*|CP
FCAMP C C
CBP C DB*|C C
FCAMP C C
DB DB* Context = Lv 3
recursion on DB*|CP:
X = {C} DB has only empty
sets or X = {} è
Output: CP immediately returns
Lv 2 Recursion on DB*|A (sans …)
Further
n minsup = 3 recursion
(output: FCA)
Which is actually FullDB*|CA
FC
FC DB*|C FC
FCA
FCA FC FC
FCA FC
F
DB*|F F
DB DB*
F
X = {F, C}
boundary
Output: FA, CA case
Different Example: Output: FAP
n minsup = 2
Which is actually FullDB*|AP
FC
DB*|A F
F
FCAMP FCA DB*|C F
FCBP FC
FAP FA
DB*|F
DB DB*
X = {F, C, A}
Output: FP, CP, AP
I will give you back the FP-tree
in DB
n Header table: (item, freq, ptr)
{} {} {}
f :2 f :4 c :1
f :1 Item freq head
f 4
c :1 c :2
… c
a
4
3
c :3 b :1 b :1
b 3
m 3
a :3 p :1
a :1 a :2 p 3
b :1 m :2 b :1 Output
m :1 m :1
f
c
p :1 m :1 p :2 m :1 a
p :1
b
m
Insert t2 Insert all ti p
Insert t1
TID frequent items
100 {f, c, a, m, p}
200 {f, c, a, b, m}
300 {f, b}
400 {c, b, p}
500 {f, c, a, m, p} p's conditional pattern base
f c a m : 2
c b : 1 Output
2 3 2 1 2 pc
{}
m :2 b :1 {}
Header
STOP
Table
p :2 m :1 c :3
TID frequent items
100 {f, c, a, m, p} m's conditional pattern base Output
200 {f, c, a, b, m} f c a : 2 mf
300 {f, b} f c a b : 1 mc
400 {c, b, p} ma
3 3 3 1
500 {f, c, a, m, p}
{}
{}
m :2 b :1 Header
gen_powerset Table
f :3
m :1 Output
mac
c :3
maf
mcf
macf
a :3
b's conditional pattern base
f c a : 1
f : 1
c : 1
2 2 1
{}
a :3
b :1
a's conditional pattern base
f c : 3 Output
af
3 3 ac
{}
a :3 {}
gen_powerset Header
Table
f :3
Output
acf
c :3
c's conditional pattern base
f : 3 Output
3
cf
{}
{}
STOP Header
Table
f :3
STOP
{}
70
Run time(sec.)
60
50
40
30
20
10
0
0 0.5 1 1.5 2 2.5 3
Support threshold(%)
n Divide-and-conquer:
n decompose both the mining task and DB according to
the frequent patterns obtained so far
n leads to focused search of smaller databases
n Other factors
n no candidate generation, no candidate test
n compressed database: FP-tree structure
n no repeated scan of entire database
n basic ops—counting local freq items and building sub
FP-tree, no pattern search and matching