Sie sind auf Seite 1von 33

) (

jalali@mshdiua.ac.ir Jalali.mshdiau.ac.ir

Data Mining

Association Rules

Data Mining

Mining Frequent Patterns, Association


Basic concepts Apriori FP-Growth Mining Multi-Dimensional Association

Exercises

What Is Frequent Pattern Analysis?


Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set

First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of frequent
itemsets and association rule mining Motivation: Finding inherent regularities in data
What products were often purchased together? cheese and chips?!

What are the subsequent purchases after buying a PC?


What kinds of DNA are sensitive to this new drug? Can we automatically classify web documents?

Applications

Basket data analysis, cross-marketing, catalog design, Web log (click stream) analysis, and DNA
sequence analysis.

Why Is Freq. Pattern Mining Important?


Forms the foundation for many essential data mining tasks
Association, correlation, and causality analysis
Sequential, structural (e.g., sub-graph) patterns Pattern analysis in spatiotemporal, multimedia, time-series, and stream data Classification: associative classification Cluster analysis: frequent pattern-based clustering Semantic data compression

Basic Concepts: Frequent Patterns and Association Rules

Transaction-id 10

Items bought A, B, D

Itemset X = {x1, , xk} Find all the rules X Y with minimum support and confidence support, s, probability that a transaction contains X Y confidence, c, conditional probability that a transaction having X also contains Y

20
30 40 50

A, C, D
A, D, E B, E, F B, C, D, E, F

Customer buys both

Customer buys chips

Let supmin = 50%, confmin = 50% Freq. Pat.: {A:3, B:3, D:4, E:3, AD:3}
Customer buys cheese

Association rules: A D (60%, 100%) D A (60%, 75%)


6

Closed Patterns and Max-Patterns


Number of sub-patterns, e.g., {a1, , a100} contains 2100 1 =

1.27*1030 sub-patterns! (Why?)


Solution: Mine closed patterns and max-patterns instead

An itemset X is closed if X is frequent and there exists no superpattern Y X, with the same support as X

An itemset X is a max-pattern if X is frequent and there exists


no frequent super-pattern Y X
7

Closed Patterns and Max-Patterns


Exercise. DB = {<a1, , a100>, < a1, , a50>}
Min_sup = 1.

What is the set of closed itemset?


<a1, , a100>: 1 < a1, , a50>: 2

What is the set of max-pattern?


<a1, , a100>: 1

Scalable Methods for Mining Frequent Patterns


The downward closure property of frequent patterns
Any subset of a frequent itemset must be frequent If {cheese, chips, nuts} is frequent, so is {cheese, chips} i.e., every transaction having {cheese, chips, nuts} also contains {cheese, chips}

Scalable mining methods:


Apriori (Agrawal & Srikant@VLDB94) Freq. pattern growth (FPgrowthHan, Pei & Yin @SIGMOD00)

Apriori: A Candidate Generation-and-Test Approach

Apriori is a classic algorithm for learning association rules


Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! Method:
Initially, scan DB once to get frequent 1-itemset Generate length (k+1) candidate itemsets from length k frequent itemsets

Test the candidates against DB


Terminate when no frequent or candidate set can be generated
10

The Apriori AlgorithmAn Example


Database TDB
Tid 10 20 30 40 Items A, C, D B, C, E A, B, C, E B, E

Supmin = 2 C1 1st scan

Itemset {A} {B} {C} {D} {E}

sup 2 3 3 1 3

Itemset

sup

L1

{A}
{B} {C} {E}

2
3 3 3

C2 L2
Itemset {A, C} {B, C} {B, E} {C, E} sup 2 2 3 2

Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E}

sup 1 2 1 2 3 2

C2 2nd scan

Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E}

C3

Itemset

{B, C, E}

3rd

scan

L3

Itemset {B, C, E}

sup 2

11

Apriori Algorithm - An Example


Assume minimum support = 2

12

Apriori Algorithm - An Example

The final frequent item sets are those remaining in L2 and L3. However, {2,3}, {2,5}, and {3,5} are all contained in the larger item set {2, 3, 5}. Thus, the final group of item sets reported by Apriori are {1,3} and {2,3,5}. These are the only item sets from which we will generate association rules.

13

Generating Association Rules from Frequent Itemsets


Only strong association rules are generated Frequent itemsets satisfy minimum support threshold Strong rules are those that satisfy minimum confidence threshold
support ( A B) confidence(A ==> B) = Pr(B | A) = support ( A)

For each frequent itemset, f, generate all non-empty subsets of f For every non-empty subset s of f do if support(f)/support(s) min_confidence then output rule s ==> (f-s) end
14

Generating Association Rules


(Example Continued)
Item sets: {1,3} and {2,3,5}

Candidate rules for {1,3}


Rule {1}{3} Conf.
2/2 = 1.0

Candidate rules for {2,3,5}


Rule {2,3}{5} Conf.
2/2 = 1.00

Rule {2}{5}

Conf.
3/3 = 1.00

{3}{1}

2/3 = 0.67

{2,5}{3}
{3,5}{2} {2}{3,5} {3}{2,5} {5}{2,3}

2/3 = 0.67
2/2 = 1.00 2/3 = 0.67 2/3 = 0.67 2/3 = 0.67

{2}{3}
{3}{2} {3}{5} {5}{2} {5}{3}

2/3 = 0.67
2/3 = 0.67 2/3 = 0.67 3/3 = 1.00 2/3 = 0.67

Assuming a min. confidence of 75%, the final set of rules reported by Apriori are: {1}{3}, {3,5}{2}, {5}{2} and {2}{5}
15

The Apriori Algorithm


Pseudo-code:

Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do
increment the count of all candidates in Ck+1 contained in t that are

Lk+1 = candidates in Ck+1 with min_support end return k Lk;


16

Important Details of Apriori


How to generate candidates?
Step 1: self-joining Lk
Step 2: pruning

How to count supports of candidates? Example of Candidate-generation


L3={abc, abd, acd, ace, bcd} Self-joining: L3*L3

abcd from abc and abd acde from acd and ace
Pruning:

acde is removed because ade is not in L3


C4={abcd}

17

How to Generate Candidates?


Suppose the items in Lk-1 are listed in an order Step 1: self-joining Lk-1
insert into Ck select p.item1, p.item2, , p.itemk-1, q.itemk-1 from Lk-1 p, Lk-1 q where p.item1=q.item1, , p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1

Step 2: pruning
forall itemsets c in Ck do

forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck

18

Challenges of Frequent Pattern Mining


Challenges
Multiple scans of transaction database Huge number of candidates Tedious workload of support counting for candidates

Improving Apriori: general ideas


Reduce passes of transaction database scans
Shrink number of candidates Facilitate support counting of candidates

19

Construct FP-tree from a Transaction Database


TID 100 200 300 400 500 Items bought (ordered) frequent items {f, a, c, d, g, i, m, p} {f, c, a, m, p} {a, b, c, f, l, m, o} {f, c, a, b, m} {b, f, h, j, o, w} {f, b} {b, c, k, s, p} {c, b, p} {a, f, c, e, l, p, m, n} {f, c, a, m, p} Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3

min_support = 3 {} f:4 c:3 a:3 m:2 p:2 b:1 m:1


20

1. Scan DB once, find frequent 1-itemset (single item pattern)

c:1 b:1 b:1 p:1

2. Sort frequent items in frequency descending order 3. Scan DB again, construct FP-tree

Construct FP-tree from a Transaction Database

L=[I2:7,I1:6,I3:6,I4:2,I5:2] if Min Sup Count =2

21

Benefits of the FP-tree Structure


Completeness
Preserve complete information for frequent pattern mining Never break a long pattern of any transaction

Compactness
Reduce irrelevant infoinfrequent items are gone Items in frequency descending order: the more frequently occurring, the more likely to be shared Never be larger than the original database (not count node-links and the count field)

22

Mining Frequent Patterns With FP-trees


Idea: Frequent pattern growth
Recursively grow frequent patterns by pattern and database partition

Method
For each frequent item, construct its conditional pattern-base, and then its conditional FP-tree Repeat the process on each newly created conditional FP-tree Until the resulting FP-tree is empty, or it contains only one pathsingle path will generate all the combinations of its sub-paths, each of which is a frequent pattern

23

Scaling FP-growth by DB Projection


FP-tree cannot fit in memory?DB projection First partition a database into a set of projected DBs Then construct and mine FP-tree for each projected DB

24

FP-Growth vs. Apriori: Scalability With the Support Threshold

100 90 80 70

Data set T25I20D10K


D1 FP-grow th runtime D1 Apriori runtime

Run time(sec.)

60 50 40 30 20 10 0 0 0.5 1 1.5 2 Support threshold(%) 2.5 3

25

Multiple-Level Association Rules


Items often form a hierarchy Items at the lower level are expected to have lower support Rules regarding itemsets at appropriate levels could be quite useful Transaction database can be encoded based on dimensions and levels Food
Milk Skim 2% Bread Wheat White
26

Mining Multi-Level Associations


A top_down, progressive deepening approach
First find high-level strong rules:

milk -> bread [20%, 60%]


Then find their lower-level weaker rules:

2% milk -> wheat bread [6%, 50%]


When one threshold set for all levels; if support too high then it is possible to miss meaningful associations at low level; if support too low then possible generation of uninteresting rules

different minimum support thresholds across multilevels lead to different algorithms (e.g., decrease min-support at lower levels)

27

Quantitative Association Rules

Handling quantitative rules may require mapping of the continuous variables into Boolean
28

Mapping Quantitative to Boolean


One possible solution is to map the problem to the Boolean association rules:
discretize a non-categorical attribute to intervals, e.g., Age [20,29], [30,39],... categorical attributes: each value becomes one item non-categorical attributes: each interval becomes one item

29

Example in Web Content Mining


Documents Associations
Find (content-based) associations among documents in a collection Documents correspond to items and words correspond to transactions Frequent itemsets are groups of docs in which many words occur in common

Term Associations
Find associations among words based on their occurrences in documents similar to above, but invert the table (terms as items, and docs as transactions)
business capital fund . . . invest Doc 1 5 2 0 . . . 6 Doc 2 5 4 0 . . . 0 Doc 3 2 3 0 . . . 0 . . . . . . . . .. .. .. .. .. .. .. .. Doc n 1 5 1 . . . 3
30

Example in Web Usage Mining


Association Rules in Web Transactions
discover affinities among sets of Web page references across user sessions

Examples
60% of clients who accessed /products/, also accessed /products/software/webminer.htm 30% of clients who accessed /special-offer.html, placed an online order in /products/software/ Actual Example from IBM official Olympics Site:

{Badminton, Diving} ==> {Table Tennis} [conf 69.7%, sup 0.35%]


Applications
Use rules to serve dynamic, customized contents to users prefetch files that are most likely to be accessed determine the best way to structure the Web site (site optimization) targeted electronic advertising and increasing cross sales
31

Example in Web Usage Mining


Association Rules From Cray Research Web Site
Conf supp
82.8

Association Rule

90

97.2

3.17 /PUBLIC/product-info/T3E ===> /PUBLIC/product-info/T3E/CRAY_T3E.html 0.14 /PUBLIC/product-info/J90/J90.html, /PUBLIC/product-info/T3E ===> /PUBLIC/product-info/T3E/CRAY_T3E.html 0.15 /PUBLIC/product-info/J90, /PUBLIC/product-info/T3E/CRAY_T3E.html, /PUBLIC/product-info/T90, ===> /PUBLIC/product-info/T3E, /PUBLIC/sc.html

Design suggestions

32

IBM Apriori FP-Growth . ( )


The number of transactions =100000, The average, transaction Length= 10, The 4 = average frequent pattern length

1.

. ( )

2.

Weka Apriori FP-Growth ( UCI )

3.

33

Das könnte Ihnen auch gefallen