Data Mining Apriori FP Growth Arafat

Mining Frequent Patterns,
Associations and Correlations
Md. Yasser Arafat

MS Student, Dept of CSE, DU
7/11/15
Topics Covered
Frequent Pattern
Association
Correlation
Support & Confidence
Closed Patterns and Max-Patterns
Apriori Algorithm
FP-Growth
Comparison between Apriori and FP-Growth
Correlation Analysis
7/11/15
Frequent Patterns
Frequent patterns are patterns such as itemset,
subsequence or substructure that appear in dataset
frequently.
Helps in data classification, clustering and mining
association, correlation and other interesting
relationships among data.
Has become an important data mining task and a
focused theme in data mining research.
7/11/15
Association
Association rules are
if/then statements
helps uncover relationships between seemingly
unrelated data in a relational database or other
information repository.
7/11/15
Correlation
A mutual relationship or connection between
two or more things.
Main goal is to find correlated interested
itemset.
7/11/15
Support & Confidence
Find all the rules A B with minimum support and

confidence
support, s, probability that a transaction contains A B
confidence, c, conditional probability that a transaction
having A also contains B
Support (A => B) = P(A B)
Confidence (A => B) = P(B|A)
= support (A B)/support(A)
7/11/15
Example
Transaction Items
ID
Min_sup = 50%
Frequent
itemset
Support
75%
50%
A, B, C
A, C
50%
A, D
50%
B, D
AC
50%
For rule A C:
support = support({A C}) = 50%
confidence = support({A C})/support({A}) = 66.6%
7/11/15
Closed Patterns and MaxPatterns

An itemset X is closed frequent itemset in data set D
if X is frequent and there exists no proper superitemset Y such that Y has the same support count as X
in D.
An itemset X is a maximal frequent itemset in data
set D if X is frequent and there exists no super
itemset of Y such that X C Y and Y is frequent in D.
7/11/15
Example
Exercise: Suppose there are only two transactions
<a1, , a100>, <a1, , a50>
Let min_sup = 1
What is the set of closed itemset?
{a1, , a100}: 1
{a1, , a50}: 2
What is the set of max-pattern?
{a1, , a100}: 1
7/11/15
Apriori Algorithm
Finding frequent itemsets by candidate
generation
Apriori property:
All nonempty subsets of a frequent itemset must
also be frequent.
7/11/15
10
Apriori Algorithm
Apriori pruning principle
If there is any pattern which is infrequent, its superset
should not be generated/tested.
Process
Scan Database once to get frequent 1-itemset
For each level k:
Generate length (k+1) candidates from length k frequent
patterns
Scan Database and remove the infrequent candidates
Terminate when no candidate set can be generated

7/11/15
11
Pseudo-code
1: Find all large 1-itemsets
2: For (k = 2 ; while Lk-1 is non-empty; k++)
3: {Ck = apriori-gen(Lk-1)
4:
For each c in Ck, initialise c.count to zero
5:
For all records r in the DB
6:
{Cr = subset(Ck, r); For each c in Cr , c.count+
+}
7:
Set Lk := all c in Ck whose count >= minsup
8:
} /* end -- return all of the L k sets.
7/11/15
12
Example
TID
List of
Items_IDs
Consider a database, D ,
consisting of 9 transactions.
T100
I1, I2, I5
T200
I2, I4
Suppose minimum support count

required is 2 (i.e. min_sup = 2/9 =
22 % )
T300
I2, I3
T400
I1, I2, I4
T500
I1, I3
Let minimum confidence required

is 70%.
T600
I2, I3
T700
I1, I3
T800
I1, I2 ,I3, I5
T900
I1, I2, I3
7/11/15
13
Example
Generating 1-itemset Frequent Pattern:
Scan D for
count of
each
candidate
Itemse
t
Sup.Count
{I1}
{I2}
{I3}
Itemse
t
Sup.Count
{I1}
{I2}
{I3}
{I4}
{I4}
{I5}
{I5}
C1
7/11/15
Compare candidate
support count with
minimum support
count
L1
14
Example
Itemset
Generate
C2
candidat
es from
L1
{I1, I2}
{I1, I3}
{I1, I4}
{I1, I5}
{I2, I3}
{I2, I4}
{I2, I5}
{I3, I4}
{I3, I5}
{I4, I5}
C2
7/11/15
Scan D
for count
of each
candidat
e
Itemse
t
Sup.
Count
Items
et
Sup
Count
{I1,
I2}
{I1,
I2}
{I1,
I3}
{I1,
I3}
{I1,
I4}
{I1,
I5}
{I1,
I5}
{I2,
I3}
{I2,
I3}
{I2,
I4}
{I2,
I4}
{I2,
I5}
{I2,
I5}
C2
Compare
candidate
support
count with
minimum
support
count
L2
2
2
15
Example
Generate
C3
candidat
es from
L2
Itemset
{I1, I2, I3}
{I1, I2, I5}
C3
Scan D
for count
of each
candidat
e
Itemset
Sup.
Count
{I1, I2,
I3}
{I1, I2,
I5} C3
Compare
candidate
support
count with
min
support
count
Itemset
Sup
Coun
t
{I1, I2,
I3}
L3
{I1, I2,
I5}
In order to find C3, we compute L2 Join L2.

C3 = L2 Join L2 = {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5},
{I2, I4, I5}}.
Now, Join step is complete and Prune step will be used to reduce the size of C 3.
Prune step helps to avoid heavy computation due to large Ck.
7/11/15
16
Association rule generation

Procedure:
For each frequent itemset l, generate all nonempty subsets of l.
For every nonempty subset s of l, output the rule s (l-s) if
support_count(l) / support_count(s) >= minimum confidence
threshold
7/11/15
17
Example
From previous example frequent itemset = {{I1}, {I2}, {I3}, {I4},
{I5}, {I1,I2}, {I1,I3}, {I1,I5}, {I2,I3}, {I2,I4}, {I2,I5}, {I1,I2,I3},
{I1,I2,I5}}.
Lets take l = {I1,I2,I5}.
Its all nonempty subsets are {I1,I2}, {I1,I5}, {I2,I5}, {I1}, {I2}, {I5}.
Now we can calculate the confidence of different association rules
{I1, I2} I5, Confidence = 2/4 = 50%

{I1, I5} I2, Confidence = 2/2 = 100%
{I2, I5} I1, Confidence = 2/2 = 100%
I1 {I2, I5}, Confidence = 2/6 = 33%
I2 {I1, I5}, Confidence = 2/7 = 29%
I5 {I1, I2}, Confidence = 2/2 = 100%
As minimum confidence threshold is 70%, then 2nd, 3rd and 6th

are the strong association rule.
7/11/15
18
Apriori Algorithm: Efficiency

Improvement
Many variations of the algorithm has been proposed
to improve the efficiency of the original algorithm.
Methods to improve the efficiency of the Apriori
algorithm:
7/11/15
Hash based itemset counting.

Transaction reduction.
Partitioning
Sampling
19
Bottlenecks of Apriori
Generate a huge number of candidate sets
Repeatedly scan the whole database
Check a large set of candidates by pattern
matching
7/11/15
20
FP-Growth
FP-growth or Frequent Pattern Growth adopts
a divide-and-conquer strategy
Compresses the database into FP-tree.
Divides the database into a set of conditional
databases which is associated with one pattern
fragment.
Associated data sets for each fragment is
examined.
7/11/15
21
Example
TID
Items
T10
0
I1, I2, I5
T20
0
Item
Id
Support
Count
I2, I4
I2
T30
0
I2, I3
I1
T40
0
I1, I2, I4
I3
T50
0
I1, I3
I4
T60
0
I2, I3
I5
T70
0
I1, I3
T80
0
I1, I2 ,I3,
I5
T90
0
I1, I2, I3
7/11/15
Node
link
null{
}
22
Example
TID
Items
T10
0
I1, I2, I5
T20
0
Item
Id
Support
Count
I2, I4
I2
T30
0
I2, I3
I1
T40
0
I1, I2, I4
I3
T50
0
I1, I3
I4
T60
0
I2, I3
I5
T70
0
I1, I3
T80
0
I1, I2 ,I3,
I5
T90
0
I1, I2, I3
7/11/15
Node
link
I2:
1
null{
}
I1:
1
I5:
1
23
Example
TID
Items
T10
0
I1, I2, I5
T20
0
Item
Id
Support
Count
I2, I4
I2
T30
0
I2, I3
I1
T40
0
I1, I2, I4
I3
T50
0
I1, I3
I4
T60
0
I2, I3
I5
T70
0
I1, I3
T80
0
I1, I2 ,I3,
I5
T90
0
I1, I2, I3
7/11/15
Node
link
I2:
2
I1:
1
null{
}
I4:
1
I5:
1
24
Example
TID
Items
T10
0
I1, I2, I5
T20
0
Item
Id
Support
Count
I2, I4
I2
T30
0
I2, I3
I1
T40
0
I1, I2, I4
I3
T50
0
I1, I3
I4
T60
0
I2, I3
I5
T70
0
I1, I3
T80
0
I1, I2 ,I3,
I5
T90
0
I1, I2, I3
7/11/15
Node
link
I2:
3
I1:
1
I3:
1
null{
}
I4:
1
I5:
1
25
Example
TID
Items
T10
0
I1, I2, I5
T20
0
Item
Id
Support
Count
I2, I4
I2
T30
0
I2, I3
I1
T40
0
I1, I2, I4
I3
T50
0
I1, I3
I4
T60
0
I2, I3
I5
T70
0
I1, I3
T80
0
I1, I2 ,I3,
I5
T90
0
I1, I2, I3
7/11/15
null{
}
Node
link
I2:
4
I1:
2
I5:
1
I3:
1
I4:
1
I4:
1
26
Example
TID
Items
T10
0
I1, I2, I5
T20
0
Item
Id
Support
Count
I2, I4
I2
T30
0
I2, I3
I1
T40
0
I1, I2, I4
I3
T50
0
I1, I3
I4
T60
0
I2, I3
I5
T70
0
I1, I3
T80
0
I1, I2 ,I3,
I5
T90
0
I1, I2, I3
7/11/15
null{
}
Node
link
I2:
4
I1:
2
I5:
1
I1:
1
I3:
1
I4:
1
I3:
1
I4:
1
27
Example
TID
Items
T10
0
I1, I2, I5
T20
0
Item
Id
Support
Count
I2, I4
I2
T30
0
I2, I3
I1
T40
0
I1, I2, I4
I3
T50
0
I1, I3
I4
T60
0
I2, I3
I5
T70
0
I1, I3
T80
0
I1, I2 ,I3,
I5
T90
0
I1, I2, I3
7/11/15
null{
}
Node
link
I2:
5
I1:
2
I5:
1
I1:
1
I3:
2
I4:
1
I3:
1
I4:
1
28
Example
TID
Items
T10
0
I1, I2, I5
T20
0
Item
Id
Support
Count
I2, I4
I2
T30
0
I2, I3
I1
T40
0
I1, I2, I4
I3
T50
0
I1, I3
I4
T60
0
I2, I3
I5
T70
0
I1, I3
T80
0
I1, I2 ,I3,
I5
T90
0
I1, I2, I3
7/11/15
null{
}
Node
link
I2:
5
I1:
2
I5:
1
I1:
2
I3:
2
I4:
1
I3:
2
I4:
1
29
Example
TID
Items
T10
0
I1, I2, I5
T20
0
Item
Id
Support
Count
I2, I4
I2
T30
0
I2, I3
I1
T40
0
I1, I2, I4
I3
T50
0
I1, I3
I4
T60
0
I2, I3
I5
T70
0
I1, I3
T80
0
I1, I2 ,I3,
I5
T90
0
I1, I2, I3
7/11/15
null{
}
Node
link
I2:
6
I1:
3
I5:
1
I1:
2
I3:
2
I3:
1
I4:
1
I3:
2
I4:
1
I5:
1
30
Example
TID
Items
T10
0
I1, I2, I5
T20
0
Item
Id
Support
Count
I2, I4
I2
T30
0
I2, I3
I1
T40
0
I1, I2, I4
I3
T50
0
I1, I3
I4
T60
0
I2, I3
I5
T70
0
I1, I3
T80
0
I1, I2 ,I3,
I5
T90
0
I1, I2, I3
7/11/15
null{
}
Node
link
I2:
7
I1:
4
I5:
1
I1:
2
I3:
2
I3:
2
I4:
1
I3:
2
I4:
1
I5:
1
31
Example
Branches of I5 :
I2, I1, I5: 1
I2, I1, I3, I5: 1
I2:
2
Conditional pattern base :

I2, I1: 1
I2, I1, I3: 1
null{
}
I1:
2
Item
Conditional pattern
base
Conditional
FP-Tree
Frequent pattern generated
I5
{(I2, I1: 1),(I2, I1, I3:

1)}
<I2:2 ,
I1:2>
{I2, I5:2}, {I1, I5:2}, {I2, I1,

I5: 2}
7/11/15
32
Example
Branches of I4 :
I2, I4: 1
I2, I1, I4: 1
I2:
2

I2: 1
I2, I1: 1
null{
}
Item
Conditional pattern
base
Conditional
FP-Tree
I4
{(I2 I1: 1),(I2: 1)}
<I2: 2>
{I2, I4: 2}
7/11/15
33
Example
Branches of I3 :
I2, I1, I3: 2
I2, I3: 2
I1, I3: 2
I2:
4

I2, I1: 2
I2: 2
I1: 2
null{
}
I1:
2
I1:
2
Ite
m
Conditional pattern
base
Conditional
FP-Tree
I3
{(I2, I1: 1),(I2: 2),

(I1: 2)}
<I2: 4, I1:
2>,<I1:2>
{I2, I3:4}, {I1, I3: 2} , {I2,

I1, I3: 2}
7/11/15
34
Example
Branches of I1 :
I2, I1: 4
I2:
4
7/11/15

I2: 4
null{
}
Item
Conditional
pattern base
Conditional
FP-Tree
Frequent
pattern
generated
I1
{(I2: 4)}
<I2: 4>
{I2, I1: 4}
35
Example
Item
Conditional pattern
base
Conditional
FP-Tree
Frequent pattern
generated
I5
{(I2, I1: 1),(I2, I1, I3:

1)}
<I2:2 , I1:2>
{I2, I5:2}, {I1, I5:2},

{I2, I1, I5: 2}
I4
{(I2, I1: 1),(I2: 1)}
<I2: 2>
{I2, I4: 2}
I3
{(I2, I1: 1),(I2: 2),

(I1: 2)}
<I2: 4, I1:
2>,<I1:2>
{I2, I3:4}, {I1, I3: 2},

{I2, I1, I3: 2}
I1
{(I2: 4)}
<I2: 4>
{I2, I1: 4}
7/11/15
36
Pros of FP-growth
No candidate generation, no candidate test
Use compact data structure
Eliminate repeated database scan
Basic operation is counting and FP-tree building
7/11/15
37
Comparison between Apriori and FPGrowth

Parameter
Apriori Algorithm
FP-growth Algorithmn
Technique
Use Apriori property and

join and prune property
It constructs conditional
pattern base and
condition FP tree from
database which satisfy
minimum support
Memory
utilization
Due to large number of

candidate generation,
require large memory
space
Due to compact structure

and no candidate
generation require less
memory
Number of
scans
Multiple scans for

Scan the database only
generation candidate sets twice
Execution
time
Execution time is more as

time is wasted for
candidate generation
every time
7/11/15
Execution time is smaller

than apriory algorithm
38
Correlation Analysis
Correlation Analysis provides an alternative
framework for finding interesting
relationships, or to improve understanding of
meaning of some association rules
Correlation measure
Lift
X2 measure
7/11/15
39
Correlation measure : Lift

Two item sets A and B are independent iff
P(A B) = P(A) P(B)

Otherwise A and B are dependent and
correlated
The measure of correlation, or correlation
between A and B is given by the formula:
lift(A,B)= P(A U B ) / P(A) . P(B)
7/11/15
40
Correlation measure : Lift

lift(A,B) >1 means that A and B are positively
correlated
lift(A,B) < 1 means that the occurrence of A is
negatively correlated with B.
lift(A,B) =1 means that A and B are independent
and there is no correlation between them.
7/11/15
41
Correlation measure : X2 measure

X2 measure
7/11/15
42
Reference
Chapter 6, Data Mining Concepts and
Techniques, Third Edition. By Jiawei Han,
Micheline Kamber and Jian Pei.
7/11/15
43
Thank You
7/11/15
44

Data Mining Apriori FP Growth Arafat

Hochgeladen von

Dokumentinformationen

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Data Mining Apriori FP Growth Arafat

Hochgeladen von

Copyright:

Verfügbare Formate

Mining Frequent Patterns,

Associations and Correlations

Md. Yasser Arafat

Support & Confidence

Find all the rules A B with minimum support and

Closed Patterns and MaxPatterns

Terminate when no candidate set can be generated

Suppose minimum support count

Let minimum confidence required

In order to find C3, we compute L2 Join L2.

Association rule generation

Now we can calculate the confidence of different association rules

{I1, I2} I5, Confidence = 2/4 = 50%

As minimum confidence threshold is 70%, then 2nd, 3rd and 6th

Apriori Algorithm: Efficiency

Hash based itemset counting.

Conditional pattern base :

Frequent pattern generated

{(I2, I1: 1),(I2, I1, I3:

{I2, I5:2}, {I1, I5:2}, {I2, I1,

Conditional pattern base :

Frequent pattern generated

{(I2 I1: 1),(I2: 1)}

Conditional pattern base :

Frequent pattern generated

{(I2, I1: 1),(I2: 2),

{I2, I3:4}, {I1, I3: 2} , {I2,

Conditional pattern base :

{(I2, I1: 1),(I2, I1, I3:

{I2, I5:2}, {I1, I5:2},

{(I2, I1: 1),(I2: 1)}

{(I2, I1: 1),(I2: 2),

{I2, I3:4}, {I1, I3: 2},

Comparison between Apriori and FPGrowth

Use Apriori property and

Due to large number of

Due to compact structure

Multiple scans for

Execution time is more as

Execution time is smaller

Correlation measure : Lift

P(A B) = P(A) P(B)

lift(A,B)= P(A U B ) / P(A) . P(B)

Correlation measure : Lift

Correlation measure : X2 measure

Das könnte Ihnen auch gefallen