6 Asso

COMP9318: Data Warehousing
and Data Mining

— L6: Association Rule Mining —
COMP9318: Data Warehousing and Data Mining 1

n Problem definition and preliminaries

What Is Association Mining?
n Association rule mining:
n Finding frequent patterns, associations, correlations, or
causal structures among sets of items or objects in
transaction databases, relational databases, and other
information repositories.
n Frequent pattern: pattern (set of items, sequence, etc.)
that occurs frequently in a database [AIS93]
n Motivation: finding regularities in data
n What products were often purchased together? — Beer
and diapers?!
n What are the subsequent purchases after buying a PC?
n What kinds of DNA are sensitive to this new drug?
n Can we automatically classify web documents?

Why Is Frequent Pattern or Assoiciation
Mining an Essential Task in Data Mining?
n Foundation for many essential data mining tasks
n Association, correlation, causality
n Sequential patterns, temporal or cyclic association,
partial periodicity, spatial and multimedia association
n Associative classification, cluster analysis, iceberg cube,
fascicles (semantic data compression)
n Broad applications
n Basket data analysis, cross-marketing, catalog design,
sale campaign analysis
n Web log (click stream) analysis, DNA sequence
analysis, etc. c.f., google’s spelling suggestion
COMP9318: Data Warehousing and Data Mining
Basic Concepts: Frequent Patterns and
Association Rules
n Itemset X={x1, …, xk}
n Shorthand: x1 x2 … xk
Transaction-id Items bought
n Find all the rules XàY with min
10 { A, B, C } confidence and support
20 { A, C }
n support, s, probability that a
30 { A, D }
transaction contains XÈY
40 { B, E, F }
n confidence, c, conditional
Customer
probability that a transaction
buys both
Customer
having X also contains Y.
buys diaper
Let min_support = 50%,

min_conf = 70%: frequent itemset
Customer
sup(AC) = 2 association rule
buys beer A è C (50%, 66.7%)
C è A (50%, 100%)
Mining Association Rules—an Example
Transaction-id Items bought

Min. support 50%
10 A, B, C Min. confidence 50%
20 A, C
Frequent pattern Support
30 A, D
{A} 75%
40 B, E, F
{B} 50%
{C} 50%
{A, C} 50%
For rule A è C:
support = support({A}∪{C}) = 50%
confidence = support({A}∪{C})/support({A}) = 66.6%
major computation challenge: calculate the support of itemsets
ç The frequent itemset mining problem
n Algorithms for scalable mining of (single-dimensional
Boolean) association rules in transactional databases

Association Rule Mining Algorithms
Candidate Generation
& Verification
n Naïve algorithm
n Enumerate all possible itemsets
and check their support against

min_sup
n Generate all association rules
and check their confidence

against min_conf
n The Apriori property
n Apriori Algorithm
n FP-growth Algorithm

All Candidate Itemsets for {A, B, C, D, E}
null
A B C D E
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
ABCDE

Apriori Property
n A frequent (used to be called large) itemset is an

itemset whose support is ≥ min_sup.
n Apriori property (downward closure): any subsets
of a frequent itemset are also frequent itemsets
n Aka the anti-monotone property of support
ABC ABD ACD BCD
“any supersets of
an infrequent
AB AC AD BC BD CD itemset are
also infrequent
A B C D itemsets”

Illustrating Apriori Principle
Q: How to design an null
algorithm to improve
the naïve algorithm?
A B C D E
AB AC AD AE BC BD BE CD CE DE
Found to be
Infrequent
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
Pruned
ABCDE
supersets

Apriori: A Candidate Generation-and-test Approach
n Apriori pruning principle: If there is any itemset

which is infrequent, its superset should not be
generated/tested!
n Algorithm [Agrawal & Srikant 1994]
1. Ck ç Perform level-wise candidate generation
(from singleton itemsets)
2. Lk ç Verify Ck against Lk
3. Ck+1 ç generated from Lk
4. Goto 2 if Ck+1 is not empty

The Apriori Algorithm
n Pseudo-code:
Ck: Candidate itemset of size k

Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=∅; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do begin
increment the count of all candidates in Ck+1
that are contained in t
end
Lk+1 = candidates in Ck+1 with min_support
end
return ∪kLk;
The Apriori Algorithm—An Example
minsup = 50%
Itemset sup
Itemset sup
Database TDB {A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup
{A, C} 2
2nd scan {A, B}
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2
{B, C} 2 {A, E}
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}
C3 Itemset
3rd scan L3 Itemset sup
{B, C, E}
{B, C, E} 2
Important Details of Apriori
1. How to generate candidates?
n Step 1: self-joining Lk (what’s the join condition? why?)
n Step 2: pruning
2. How to count supports of candidates?
Example of Candidate-generation
n L3={abc, abd, acd, ace, bcd}
n Self-joining: L3*L3
n abcd from abc and abd
n acde from acd and ace
n Pruning:
n acde is removed because ade is not in L3
n C4={abcd}
Generating Candidates in SQL
n Suppose the items in Lk-1 are listed in an order

n Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 <
q.itemk-1
n Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck

Derive rules from frequent itemsets
n Frequent itemsets != association rules

n One more step is required to find association
rules
n For each frequent itemset X,
For each proper nonempty subset A of X,
n Let B = X - A
n A à B is an association rule if
n Confidence (A à B) ≥ min_conf,
where support (A à B) = support (AB), and
confidence (A à B) = support (AB) / support (A)

Example – deriving rules from frequent
itemsets
n Suppose 234 is frequent, with supp=50%
n Proper nonempty subsets: 23, 24, 34, 2, 3, 4, with
supp=50%, 50%, 75%, 75%, 75%, 75% respectively

n These generate these association rules:
n 23 => 4, confidence=100%
n 34 => 2, confidence=67% = (N* 50%)/(N*75%)
n All rules have support = 50%
Q: is there any optimization (e.g., pruning) for this step?

Deriving rules
n To recap, in order to obtain A à B, we need
to have Support(AB) and Support(A)
n This step is not as time-consuming as
frequent itemsets generation
n Why?
n It’s also easy to speedup using techniques

such as parallel processing.
n How?
n Do we really need candidate generation for

deriving association rules?
n Frequent-Pattern Growth (FP-Tree)

Bottleneck of Frequent-pattern Mining
n Multiple database scans are costly

n Mining long patterns needs many passes of
scanning and generates lots of candidates
n To find frequent itemset i1i2…i100
n # of scans: 100 ✓ ◆ ✓ ◆ ✓ ◆
100 100 100 100
n # of Candidates: 1 +
2
+ . . . +
100
= 2 1
n Bottleneck: candidate-generation-and-test
Can we avoid candidate generation altogether?

n FP-growth

No Pain, No Gain
Java Lisp Scheme Python Ruby
Alice X X
Bob X X
Charlie X X X
Dora X X
minsup = 1
n Apriori:
n L1 = {J, L, S, P, R}
n C2 = all the ( 2) combinations

5
n Most of C2 do not contribute to the result

n There is no way to tell because
No Pain, No Gain
Alice X X
Bob X X
Charlie X X X
Dora X X
minsup = 1
Ideas:
• Keep the support set for
each frequent itemset
• DFS
J è JL? J
{A, C}
J è ???
Only need to look at support
set for J ɸ
No Pain, No Gain
Alice X X
Bob X X
Charlie X X X
Dora X X
minsup = 1
Ideas: {C}
JPR
• Keep the support set for
each frequent itemset {C} {A,C}
• DFS JP JR
J
…
{A, C}
ɸ
Notations and Invariants
n CondiditonalDB:
n DB|p = {t ∈ DB | t contains itemset p}
n DB = DB|∅ (i.e., conditioned on nothing)
n Shorthand: DB|px = DB|(p∪x)
n SupportSet(p∪x, DB) = SupportSet(x, DB|p)

n {x | x mod 6 = 0 ⋀ x ∈ [100] } =
{x | x mod 3 = 0 ⋀ x ∈ even([100]) }
n A FP-tree is equivalent to a DB|p
n One can be converted to another
n Next, we illustrate the alg using conditionalDB
25
FP-tree Essential Idea /1
n Recursive algorithm again!
easy task, as
all frequent itemsets in
only items (not
n FreqItemsets(DB|p): itemsets) are
DB|p belong to one of
the following
needed
categories:
n X = FindLocallyFrequentItems(DB|p)
patterns ~ xip
patterns ~ ★px1
output { (x p) | x ∈ X }
patterns ~ ★px2
n Foreach x in X obtained
patterns ~ ★pxi
via
n DB*|px = GetConditionalDB+(DB*|p, x) recursion patterns ~ ★pxn
n
n FreqItemsets(DB*|px)
No Pain, No Gain
DB|J
Alice X X
Charlie X X X
minsup = 1
n FreqItemsets(DB|J):
n {P, R} ç FindLocallyFrequentItems(DB|J)
n Output {JP, JR}
n Get DB*|JP; FreqItemsets(DB*|JP)
n Get DB*|JR; FreqItemsets(DB*|JR)
n // Guaranteed no other frequent itemset in DB|J
FP-tree Essential Idea /2
Also output each item in

n FreqItemsets(DB|p): X (appended with the
n If boundary condition, then … conditional pattern)
n X = FindLocallyFrequentItems(DB|p)
n [optional] DB*|p = PruneDB(DB|p, X) Remove items not in X;
output { (x p) | x ∈ X } potentially reduce # of
transactions (∅ or dup).
n Foreach x in X Improves the efficiency.
n DB*|px = GetConditionalDB+(DB*|p, x)
n [optional] if DB*|px is degenerated, then powerset(DB*|px)
n FreqItemsets(DB*|px) Also gets rid of items
already processed
before x è avoid
duplicates
Grayed items are for illustration purpose only.
Lv 1 Recursion
FCAMP
CBP
n minsup = 3 FCAMP
DB*|P
DB*|M (sans P)
FCADGIMP FCAMP
DB*|B (sans MP)
ABCFLMO FCABM
BFHJOW FB DB*|A (sans BMP)
BCKSP CBP DB*|C (sans ABMP)

AFCELPMN FCAMP
DB*|F (sans CABMP)
DB DB*
X = {F, C, A, B, M, P} FCA
Output: F, C, A, B, M, P FCA
FCA
Lv 2 Recursion on DB*|P
n minsup = 3
Which is actually FullDB*|CP
FCAMP C C
CBP C DB*|C C
FCAMP C C
DB DB* Context = Lv 3
recursion on DB*|CP:
X = {C} DB has only empty
sets or X = {} è
Output: CP immediately returns
Lv 2 Recursion on DB*|A (sans …)
Further
n minsup = 3 recursion
(output: FCA)
Which is actually FullDB*|CA
FC
FC DB*|C FC
FCA
FCA FC FC
FCA FC
F
DB*|F F
DB DB*
F
X = {F, C}
boundary
Output: FA, CA case
Different Example: Output: FAP
Lv 2 Recursion on DB*|P X = {F}

F
F
n minsup = 2
Which is actually FullDB*|AP
FC
DB*|A F
F
FCAMP FCA DB*|C F
FCBP FC
FAP FA
DB*|F
DB DB*
X = {F, C, A}
Output: FP, CP, AP
I will give you back the FP-tree
n An FP-tree tree of DB consists of:

n A fixed order among items in DB
n A prefix, threaded tree of sorted transactions
in DB
n Header table: (item, freq, ptr)
n When used in the algorithm, the input DB is

always pruned (c.f., PruneDB())
n Remove infequent items
n Remove infrequent items in every transaction

FP-tree Example
TID Items bought (ordered) frequent items
minsup = 3
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o, w} {f, b}
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
TID Items bought (ordered) frequent items
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o, w} {f, b}
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
{} {} {}
f :2 f :4 c :1
f :1 Item freq head
f 4
c :1 c :2
… c
a
4
3
c :3 b :1 b :1
b 3
m 3
a :3 p :1
a :1 a :2 p 3
b :1 m :2 b :1 Output
m :1 m :1
f
c
p :1 m :1 p :2 m :1 a
p :1
b
m
Insert t2 Insert all ti p
Insert t1
TID frequent items
100 {f, c, a, m, p}
200 {f, c, a, b, m}
300 {f, b}
400 {c, b, p}
500 {f, c, a, m, p} p's conditional pattern base
f c a m : 2
c b : 1 Output
2 3 2 1 2 pc
{}
Item freq head f :4 c :1

f 4
Cleaned p’s
conditional
c 4
pattern base
a 3 c :3 b :1 b :1
b 3
C :2
m 3 C :1
p 3 a :3 p :1
m :2 b :1 {}
Header
STOP
Table
p :2 m :1 c :3
TID frequent items
100 {f, c, a, m, p} m's conditional pattern base Output
200 {f, c, a, b, m} f c a : 2 mf
300 {f, b} f c a b : 1 mc
400 {c, b, p} ma
3 3 3 1
500 {f, c, a, m, p}
{}

f 4
c 4
a 3 c :3 b :1 b :1
b 3
m 3
a :3
{}
m :2 b :1 Header
gen_powerset Table
f :3
m :1 Output
mac
c :3
maf
mcf
macf
a :3
b's conditional pattern base
f c a : 1
f : 1
c : 1
2 2 1
{}

f 4
c 4
a 3 c :3 b :1 b :1 STOP
b 3
a :3
b :1
a's conditional pattern base
f c : 3 Output
af
3 3 ac
{}

f 4
c 4
a 3 c :3
a :3 {}
gen_powerset Header
Table
f :3
Output
acf
c :3
c's conditional pattern base
f : 3 Output
3
cf
{}

f 4
c 4
c :3
{}
STOP Header
Table
f :3
STOP
{}
Item freq head

f :4
f 4
FP-Growth vs. Apriori: Scalability With the Support
Threshold
100 Data set T25I20D10K

90 D1 FP-grow th runtime
D1 Apriori runtime
80
70
Run time(sec.)
60
50
40
30
20
10
0
0 0.5 1 1.5 2 2.5 3
Support threshold(%)

Why Is FP-Growth the Winner?
n Divide-and-conquer:
n decompose both the mining task and DB according to
the frequent patterns obtained so far
n leads to focused search of smaller databases
n Other factors
n no candidate generation, no candidate test
n compressed database: FP-tree structure
n no repeated scan of entire database
n basic ops—counting local freq items and building sub
FP-tree, no pattern search and matching

6 Asso

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

6 Asso

Hochgeladen von

Copyright:

Verfügbare Formate

COMP9318: Data Warehousing

and Data Mining

COMP9318: Data Warehousing and Data Mining 1

COMP9318: Data Warehousing and Data Mining 2

n What kinds of DNA are sensitive to this new drug?

n Can we automatically classify web documents?

COMP9318: Data Warehousing and Data Mining 3

Let min_support = 50%,

Transaction-id Items bought

COMP9318: Data Warehousing and Data Mining 7

and check their support against

and check their confidence

COMP9318: Data Warehousing and Data Mining 8

ABCD ABCE ABDE ACDE BCDE

COMP9318: Data Warehousing and Data Mining 9

n A frequent (used to be called large) itemset is an

COMP9318: Data Warehousing and Data Mining 10

ABCD ABCE ABDE ACDE BCDE

COMP9318: Data Warehousing and Data Mining 11

n Apriori pruning principle: If there is any itemset

COMP9318: Data Warehousing and Data Mining 12

Ck: Candidate itemset of size k

n Suppose the items in Lk-1 are listed in an order

COMP9318: Data Warehousing and Data Mining 16

n Frequent itemsets != association rules

COMP9318: Data Warehousing and Data Mining 17

supp=50%, 50%, 75%, 75%, 75%, 75% respectively

Q: is there any optimization (e.g., pruning) for this step?

n It’s also easy to speedup using techniques

n Do we really need candidate generation for

COMP9318: Data Warehousing and Data Mining 19

n Multiple database scans are costly

COMP9318: Data Warehousing and Data Mining 20

COMP9318: Data Warehousing and Data Mining 21

n C2 = all the ( 2) combinations

n Most of C2 do not contribute to the result

n DB = DB|∅ (i.e., conditioned on nothing)

n Shorthand: DB|px = DB|(p∪x)

n SupportSet(p∪x, DB) = SupportSet(x, DB|p)

n Next, we illustrate the alg using conditionalDB

Java Lisp Scheme Python Ruby

Also output each item in

BCKSP CBP DB*|C (sans ABMP)

Lv 2 Recursion on DB*|P X = {F}

n An FP-tree tree of DB consists of:

n A prefix, threaded tree of sorted transactions

n When used in the algorithm, the input DB is

n Remove infrequent items in every transaction

Item freq head f :4 c :1

Item freq head f :4 c :1

Item freq head f :4 c :1

Item freq head f :4 c :1

Item freq head f :4 c :1

Item freq head

100 Data set T25I20D10K

COMP9318: Data Warehousing and Data Mining 42

COMP9318: Data Warehousing and Data Mining 43

Das könnte Ihnen auch gefallen