Sie sind auf Seite 1von 44

Mining Frequent Patterns,

Associations and Correlations

Md. Yasser Arafat


MS Student, Dept of CSE, DU
7/11/15

Topics Covered

Frequent Pattern
Association
Correlation
Support & Confidence
Closed Patterns and Max-Patterns
Apriori Algorithm
FP-Growth
Comparison between Apriori and FP-Growth
Correlation Analysis

7/11/15

Frequent Patterns
Frequent patterns are patterns such as itemset,
subsequence or substructure that appear in dataset
frequently.
Helps in data classification, clustering and mining
association, correlation and other interesting
relationships among data.
Has become an important data mining task and a
focused theme in data mining research.

7/11/15

Association
Association rules are
if/then statements
helps uncover relationships between seemingly
unrelated data in a relational database or other
information repository.

7/11/15

Correlation
A mutual relationship or connection between
two or more things.
Main goal is to find correlated interested
itemset.

7/11/15

Support & Confidence

Find all the rules A B with minimum support and


confidence
support, s, probability that a transaction contains A B
confidence, c, conditional probability that a transaction
having A also contains B
Support (A => B) = P(A B)
Confidence (A => B) = P(B|A)
= support (A B)/support(A)

7/11/15

Example
Transaction Items
ID

Min_sup = 50%

Frequent
itemset

Support

75%

50%

A, B, C

A, C

50%

A, D

50%

B, D

AC

50%

For rule A C:
support = support({A C}) = 50%
confidence = support({A C})/support({A}) = 66.6%

7/11/15

Closed Patterns and MaxPatterns


An itemset X is closed frequent itemset in data set D
if X is frequent and there exists no proper superitemset Y such that Y has the same support count as X
in D.
An itemset X is a maximal frequent itemset in data
set D if X is frequent and there exists no super
itemset of Y such that X C Y and Y is frequent in D.

7/11/15

Example
Exercise: Suppose there are only two transactions
<a1, , a100>, <a1, , a50>
Let min_sup = 1
What is the set of closed itemset?
{a1, , a100}: 1
{a1, , a50}: 2
What is the set of max-pattern?
{a1, , a100}: 1

7/11/15

Apriori Algorithm
Finding frequent itemsets by candidate
generation
Apriori property:
All nonempty subsets of a frequent itemset must
also be frequent.

7/11/15

10

Apriori Algorithm
Apriori pruning principle
If there is any pattern which is infrequent, its superset
should not be generated/tested.

Process
Scan Database once to get frequent 1-itemset
For each level k:
Generate length (k+1) candidates from length k frequent
patterns
Scan Database and remove the infrequent candidates

Terminate when no candidate set can be generated


7/11/15

11

Pseudo-code
1: Find all large 1-itemsets
2: For (k = 2 ; while Lk-1 is non-empty; k++)
3: {Ck = apriori-gen(Lk-1)
4:
For each c in Ck, initialise c.count to zero
5:
For all records r in the DB
6:
{Cr = subset(Ck, r); For each c in Cr , c.count+
+}
7:
Set Lk := all c in Ck whose count >= minsup
8:
} /* end -- return all of the L k sets.
7/11/15

12

Example
TID

List of
Items_IDs

Consider a database, D ,
consisting of 9 transactions.

T100

I1, I2, I5

T200

I2, I4

Suppose minimum support count


required is 2 (i.e. min_sup = 2/9 =
22 % )

T300

I2, I3

T400

I1, I2, I4

T500

I1, I3

Let minimum confidence required


is 70%.

T600

I2, I3

T700

I1, I3

T800

I1, I2 ,I3, I5

T900

I1, I2, I3

7/11/15

13

Example
Generating 1-itemset Frequent Pattern:

Scan D for
count of
each
candidate

Itemse
t

Sup.Count

{I1}

{I2}

{I3}

Itemse
t

Sup.Count

{I1}

{I2}

{I3}

{I4}

{I4}

{I5}

{I5}

C1

7/11/15

Compare candidate
support count with
minimum support
count

L1

14

Example
Generating 2-itemset Frequent Pattern:
Itemset
Generate
C2
candidat
es from
L1

{I1, I2}
{I1, I3}
{I1, I4}
{I1, I5}
{I2, I3}
{I2, I4}
{I2, I5}
{I3, I4}
{I3, I5}
{I4, I5}

C2
7/11/15

Scan D
for count
of each
candidat
e

Itemse
t

Sup.
Count

Items
et

Sup
Count

{I1,
I2}

{I1,
I2}

{I1,
I3}

{I1,
I3}

{I1,
I4}

{I1,
I5}

{I1,
I5}

{I2,
I3}

{I2,
I3}

{I2,
I4}

{I2,
I4}

{I2,
I5}

{I2,
I5}

C2

Compare
candidate
support
count with
minimum
support
count

L2

2
2

15

Example
Generating 3-itemset Frequent Pattern:
Generate
C3
candidat
es from
L2

Itemset
{I1, I2, I3}
{I1, I2, I5}

C3

Scan D
for count
of each
candidat
e

Itemset

Sup.
Count

{I1, I2,
I3}

{I1, I2,
I5} C3

Compare
candidate
support
count with
min
support
count

Itemset

Sup
Coun
t

{I1, I2,
I3}

L3
{I1, I2,
I5}

In order to find C3, we compute L2 Join L2.


C3 = L2 Join L2 = {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5},
{I2, I4, I5}}.
Now, Join step is complete and Prune step will be used to reduce the size of C 3.
Prune step helps to avoid heavy computation due to large Ck.
7/11/15

16

Association rule generation


Procedure:
For each frequent itemset l, generate all nonempty subsets of l.
For every nonempty subset s of l, output the rule s (l-s) if
support_count(l) / support_count(s) >= minimum confidence
threshold

7/11/15

17

Example
From previous example frequent itemset = {{I1}, {I2}, {I3}, {I4},
{I5}, {I1,I2}, {I1,I3}, {I1,I5}, {I2,I3}, {I2,I4}, {I2,I5}, {I1,I2,I3},
{I1,I2,I5}}.
Lets take l = {I1,I2,I5}.
Its all nonempty subsets are {I1,I2}, {I1,I5}, {I2,I5}, {I1}, {I2}, {I5}.

Now we can calculate the confidence of different association rules

{I1, I2} I5, Confidence = 2/4 = 50%


{I1, I5} I2, Confidence = 2/2 = 100%
{I2, I5} I1, Confidence = 2/2 = 100%
I1 {I2, I5}, Confidence = 2/6 = 33%
I2 {I1, I5}, Confidence = 2/7 = 29%
I5 {I1, I2}, Confidence = 2/2 = 100%

As minimum confidence threshold is 70%, then 2nd, 3rd and 6th


are the strong association rule.
7/11/15

18

Apriori Algorithm: Efficiency


Improvement
Many variations of the algorithm has been proposed
to improve the efficiency of the original algorithm.
Methods to improve the efficiency of the Apriori
algorithm:

7/11/15

Hash based itemset counting.


Transaction reduction.
Partitioning
Sampling

19

Bottlenecks of Apriori
Generate a huge number of candidate sets
Repeatedly scan the whole database
Check a large set of candidates by pattern
matching

7/11/15

20

FP-Growth
FP-growth or Frequent Pattern Growth adopts
a divide-and-conquer strategy
Compresses the database into FP-tree.
Divides the database into a set of conditional
databases which is associated with one pattern
fragment.
Associated data sets for each fragment is
examined.

7/11/15

21

Example
TID

Items

T10
0

I1, I2, I5

T20
0

Item
Id

Support
Count

I2, I4

I2

T30
0

I2, I3

I1

T40
0

I1, I2, I4

I3

T50
0

I1, I3

I4

T60
0

I2, I3

I5

T70
0

I1, I3

T80
0

I1, I2 ,I3,
I5

T90
0

I1, I2, I3

7/11/15

Node
link

null{
}

22

Example
TID

Items

T10
0

I1, I2, I5

T20
0

Item
Id

Support
Count

I2, I4

I2

T30
0

I2, I3

I1

T40
0

I1, I2, I4

I3

T50
0

I1, I3

I4

T60
0

I2, I3

I5

T70
0

I1, I3

T80
0

I1, I2 ,I3,
I5

T90
0

I1, I2, I3

7/11/15

Node
link

I2:
1

null{
}

I1:
1
I5:
1

23

Example
TID

Items

T10
0

I1, I2, I5

T20
0

Item
Id

Support
Count

I2, I4

I2

T30
0

I2, I3

I1

T40
0

I1, I2, I4

I3

T50
0

I1, I3

I4

T60
0

I2, I3

I5

T70
0

I1, I3

T80
0

I1, I2 ,I3,
I5

T90
0

I1, I2, I3

7/11/15

Node
link

I2:
2
I1:
1

null{
}

I4:
1

I5:
1

24

Example
TID

Items

T10
0

I1, I2, I5

T20
0

Item
Id

Support
Count

I2, I4

I2

T30
0

I2, I3

I1

T40
0

I1, I2, I4

I3

T50
0

I1, I3

I4

T60
0

I2, I3

I5

T70
0

I1, I3

T80
0

I1, I2 ,I3,
I5

T90
0

I1, I2, I3

7/11/15

Node
link

I2:
3
I1:
1

I3:
1

null{
}

I4:
1

I5:
1

25

Example
TID

Items

T10
0

I1, I2, I5

T20
0

Item
Id

Support
Count

I2, I4

I2

T30
0

I2, I3

I1

T40
0

I1, I2, I4

I3

T50
0

I1, I3

I4

T60
0

I2, I3

I5

T70
0

I1, I3

T80
0

I1, I2 ,I3,
I5

T90
0

I1, I2, I3

7/11/15

null{
}

Node
link

I2:
4
I1:
2
I5:
1

I3:
1

I4:
1

I4:
1

26

Example
TID

Items

T10
0

I1, I2, I5

T20
0

Item
Id

Support
Count

I2, I4

I2

T30
0

I2, I3

I1

T40
0

I1, I2, I4

I3

T50
0

I1, I3

I4

T60
0

I2, I3

I5

T70
0

I1, I3

T80
0

I1, I2 ,I3,
I5

T90
0

I1, I2, I3

7/11/15

null{
}

Node
link

I2:
4
I1:
2
I5:
1

I1:
1

I3:
1

I4:
1
I3:
1

I4:
1

27

Example
TID

Items

T10
0

I1, I2, I5

T20
0

Item
Id

Support
Count

I2, I4

I2

T30
0

I2, I3

I1

T40
0

I1, I2, I4

I3

T50
0

I1, I3

I4

T60
0

I2, I3

I5

T70
0

I1, I3

T80
0

I1, I2 ,I3,
I5

T90
0

I1, I2, I3

7/11/15

null{
}

Node
link

I2:
5
I1:
2
I5:
1

I1:
1

I3:
2

I4:
1
I3:
1

I4:
1

28

Example
TID

Items

T10
0

I1, I2, I5

T20
0

Item
Id

Support
Count

I2, I4

I2

T30
0

I2, I3

I1

T40
0

I1, I2, I4

I3

T50
0

I1, I3

I4

T60
0

I2, I3

I5

T70
0

I1, I3

T80
0

I1, I2 ,I3,
I5

T90
0

I1, I2, I3

7/11/15

null{
}

Node
link

I2:
5
I1:
2
I5:
1

I1:
2

I3:
2

I4:
1
I3:
2

I4:
1

29

Example
TID

Items

T10
0

I1, I2, I5

T20
0

Item
Id

Support
Count

I2, I4

I2

T30
0

I2, I3

I1

T40
0

I1, I2, I4

I3

T50
0

I1, I3

I4

T60
0

I2, I3

I5

T70
0

I1, I3

T80
0

I1, I2 ,I3,
I5

T90
0

I1, I2, I3

7/11/15

null{
}

Node
link

I2:
6
I1:
3
I5:
1

I1:
2

I3:
2

I3:
1

I4:
1
I3:
2

I4:
1

I5:
1

30

Example
TID

Items

T10
0

I1, I2, I5

T20
0

Item
Id

Support
Count

I2, I4

I2

T30
0

I2, I3

I1

T40
0

I1, I2, I4

I3

T50
0

I1, I3

I4

T60
0

I2, I3

I5

T70
0

I1, I3

T80
0

I1, I2 ,I3,
I5

T90
0

I1, I2, I3

7/11/15

null{
}

Node
link

I2:
7
I1:
4
I5:
1

I1:
2

I3:
2

I3:
2

I4:
1
I3:
2

I4:
1

I5:
1

31

Example
Branches of I5 :
I2, I1, I5: 1
I2, I1, I3, I5: 1

I2:
2

Conditional pattern base :


I2, I1: 1
I2, I1, I3: 1
null{
}

I1:
2

Item

Conditional pattern
base

Conditional
FP-Tree

Frequent pattern generated

I5

{(I2, I1: 1),(I2, I1, I3:


1)}

<I2:2 ,
I1:2>

{I2, I5:2}, {I1, I5:2}, {I2, I1,


I5: 2}

7/11/15

32

Example
Branches of I4 :
I2, I4: 1
I2, I1, I4: 1

I2:
2

Conditional pattern base :


I2: 1
I2, I1: 1
null{
}

Item

Conditional pattern
base

Conditional
FP-Tree

Frequent pattern generated

I4

{(I2 I1: 1),(I2: 1)}

<I2: 2>

{I2, I4: 2}

7/11/15

33

Example
Branches of I3 :
I2, I1, I3: 2
I2, I3: 2
I1, I3: 2

I2:
4

Conditional pattern base :


I2, I1: 2
I2: 2
I1: 2
null{
}
I1:
2

I1:
2
Ite
m

Conditional pattern
base

Conditional
FP-Tree

Frequent pattern generated

I3

{(I2, I1: 1),(I2: 2),


(I1: 2)}

<I2: 4, I1:
2>,<I1:2>

{I2, I3:4}, {I1, I3: 2} , {I2,


I1, I3: 2}

7/11/15

34

Example
Branches of I1 :
I2, I1: 4

I2:
4

7/11/15

Conditional pattern base :


I2: 4

null{
}

Item

Conditional
pattern base

Conditional
FP-Tree

Frequent
pattern
generated

I1

{(I2: 4)}

<I2: 4>

{I2, I1: 4}
35

Example
Item

Conditional pattern
base

Conditional
FP-Tree

Frequent pattern
generated

I5

{(I2, I1: 1),(I2, I1, I3:


1)}

<I2:2 , I1:2>

{I2, I5:2}, {I1, I5:2},


{I2, I1, I5: 2}

I4

{(I2, I1: 1),(I2: 1)}

<I2: 2>

{I2, I4: 2}

I3

{(I2, I1: 1),(I2: 2),


(I1: 2)}

<I2: 4, I1:
2>,<I1:2>

{I2, I3:4}, {I1, I3: 2},


{I2, I1, I3: 2}

I1

{(I2: 4)}

<I2: 4>

{I2, I1: 4}

7/11/15

36

Pros of FP-growth
No candidate generation, no candidate test
Use compact data structure
Eliminate repeated database scan
Basic operation is counting and FP-tree building

7/11/15

37

Comparison between Apriori and FPGrowth


Parameter

Apriori Algorithm

FP-growth Algorithmn

Technique

Use Apriori property and


join and prune property

It constructs conditional
pattern base and
condition FP tree from
database which satisfy
minimum support

Memory
utilization

Due to large number of


candidate generation,
require large memory
space

Due to compact structure


and no candidate
generation require less
memory

Number of
scans

Multiple scans for


Scan the database only
generation candidate sets twice

Execution
time

Execution time is more as


time is wasted for
candidate generation
every time

7/11/15

Execution time is smaller


than apriory algorithm

38

Correlation Analysis
Correlation Analysis provides an alternative
framework for finding interesting
relationships, or to improve understanding of
meaning of some association rules
Correlation measure
Lift
X2 measure

7/11/15

39

Correlation measure : Lift


Two item sets A and B are independent iff

P(A B) = P(A) P(B)


Otherwise A and B are dependent and
correlated
The measure of correlation, or correlation
between A and B is given by the formula:

lift(A,B)= P(A U B ) / P(A) . P(B)

7/11/15

40

Correlation measure : Lift


lift(A,B) >1 means that A and B are positively
correlated
lift(A,B) < 1 means that the occurrence of A is
negatively correlated with B.
lift(A,B) =1 means that A and B are independent
and there is no correlation between them.
7/11/15

41

Correlation measure : X2 measure


X2 measure

7/11/15

42

Reference
Chapter 6, Data Mining Concepts and
Techniques, Third Edition. By Jiawei Han,
Micheline Kamber and Jian Pei.

7/11/15

43

Thank You

7/11/15

44

Das könnte Ihnen auch gefallen