Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining

) (
jalali@mshdiua.ac.ir Jalali.mshdiau.ac.ir
Data Mining
Association Rules
Data Mining
Mining Frequent Patterns, Association

Basic concepts Apriori FP-Growth Mining Multi-Dimensional Association
Exercises
What Is Frequent Pattern Analysis?

Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set
First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of frequent
itemsets and association rule mining Motivation: Finding inherent regularities in data
What products were often purchased together? cheese and chips?!
What are the subsequent purchases after buying a PC?

What kinds of DNA are sensitive to this new drug? Can we automatically classify web documents?
Applications
Basket data analysis, cross-marketing, catalog design, Web log (click stream) analysis, and DNA
sequence analysis.
Why Is Freq. Pattern Mining Important?

Forms the foundation for many essential data mining tasks
Association, correlation, and causality analysis
Sequential, structural (e.g., sub-graph) patterns Pattern analysis in spatiotemporal, multimedia, time-series, and stream data Classification: associative classification Cluster analysis: frequent pattern-based clustering Semantic data compression
Basic Concepts: Frequent Patterns and Association Rules
Transaction-id 10
Items bought A, B, D
Itemset X = {x1, , xk} Find all the rules X Y with minimum support and confidence support, s, probability that a transaction contains X Y confidence, c, conditional probability that a transaction having X also contains Y
20
30 40 50
A, C, D
A, D, E B, E, F B, C, D, E, F
Customer buys both
Customer buys chips
Let supmin = 50%, confmin = 50% Freq. Pat.: {A:3, B:3, D:4, E:3, AD:3}
Customer buys cheese
Association rules: A D (60%, 100%) D A (60%, 75%)

6
Closed Patterns and Max-Patterns

Number of sub-patterns, e.g., {a1, , a100} contains 2100 1 =
1.27*1030 sub-patterns! (Why?)

Solution: Mine closed patterns and max-patterns instead
An itemset X is closed if X is frequent and there exists no superpattern Y X, with the same support as X
An itemset X is a max-pattern if X is frequent and there exists

no frequent super-pattern Y X
7
Closed Patterns and Max-Patterns

Exercise. DB = {<a1, , a100>, < a1, , a50>}
Min_sup = 1.
What is the set of closed itemset?

<a1, , a100>: 1 < a1, , a50>: 2
What is the set of max-pattern?

<a1, , a100>: 1
Scalable Methods for Mining Frequent Patterns

The downward closure property of frequent patterns
Any subset of a frequent itemset must be frequent If {cheese, chips, nuts} is frequent, so is {cheese, chips} i.e., every transaction having {cheese, chips, nuts} also contains {cheese, chips}
Scalable mining methods:

Apriori (Agrawal & Srikant@VLDB94) Freq. pattern growth (FPgrowthHan, Pei & Yin @SIGMOD00)
Apriori: A Candidate Generation-and-Test Approach
Apriori is a classic algorithm for learning association rules

Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! Method:
Initially, scan DB once to get frequent 1-itemset Generate length (k+1) candidate itemsets from length k frequent itemsets
Test the candidates against DB

Terminate when no frequent or candidate set can be generated
10
The Apriori AlgorithmAn Example

Database TDB
Tid 10 20 30 40 Items A, C, D B, C, E A, B, C, E B, E
Supmin = 2 C1 1st scan
Itemset {A} {B} {C} {D} {E}
sup 2 3 3 1 3
Itemset
sup
L1
{A}
{B} {C} {E}
2
3 3 3
C2 L2
Itemset {A, C} {B, C} {B, E} {C, E} sup 2 2 3 2
Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E}
sup 1 2 1 2 3 2
C2 2nd scan
Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E}
C3
Itemset
{B, C, E}
3rd
scan
L3
Itemset {B, C, E}
sup 2
11
Apriori Algorithm - An Example

Assume minimum support = 2
12
Apriori Algorithm - An Example
The final frequent item sets are those remaining in L2 and L3. However, {2,3}, {2,5}, and {3,5} are all contained in the larger item set {2, 3, 5}. Thus, the final group of item sets reported by Apriori are {1,3} and {2,3,5}. These are the only item sets from which we will generate association rules.
13
Generating Association Rules from Frequent Itemsets

Only strong association rules are generated Frequent itemsets satisfy minimum support threshold Strong rules are those that satisfy minimum confidence threshold
support ( A B) confidence(A ==> B) = Pr(B | A) = support ( A)
For each frequent itemset, f, generate all non-empty subsets of f For every non-empty subset s of f do if support(f)/support(s) min_confidence then output rule s ==> (f-s) end
14
Generating Association Rules

(Example Continued)
Item sets: {1,3} and {2,3,5}
Candidate rules for {1,3}

Rule {1}{3} Conf.
2/2 = 1.0
Candidate rules for {2,3,5}

Rule {2,3}{5} Conf.
2/2 = 1.00
Rule {2}{5}
Conf.
3/3 = 1.00
{3}{1}
2/3 = 0.67
{2,5}{3}
{3,5}{2} {2}{3,5} {3}{2,5} {5}{2,3}
2/3 = 0.67
2/2 = 1.00 2/3 = 0.67 2/3 = 0.67 2/3 = 0.67
{2}{3}
{3}{2} {3}{5} {5}{2} {5}{3}
2/3 = 0.67
2/3 = 0.67 2/3 = 0.67 3/3 = 1.00 2/3 = 0.67
Assuming a min. confidence of 75%, the final set of rules reported by Apriori are: {1}{3}, {3,5}{2}, {5}{2} and {2}{5}
15
The Apriori Algorithm

Pseudo-code:
Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do
increment the count of all candidates in Ck+1 contained in t that are
Lk+1 = candidates in Ck+1 with min_support end return k Lk;

16
Important Details of Apriori

How to generate candidates?
Step 1: self-joining Lk
Step 2: pruning
How to count supports of candidates? Example of Candidate-generation

L3={abc, abd, acd, ace, bcd} Self-joining: L3*L3
abcd from abc and abd acde from acd and ace
Pruning:
acde is removed because ade is not in L3

C4={abcd}
17
How to Generate Candidates?

Suppose the items in Lk-1 are listed in an order Step 1: self-joining Lk-1
insert into Ck select p.item1, p.item2, , p.itemk-1, q.itemk-1 from Lk-1 p, Lk-1 q where p.item1=q.item1, , p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1
Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
18
Challenges of Frequent Pattern Mining

Challenges
Multiple scans of transaction database Huge number of candidates Tedious workload of support counting for candidates
Improving Apriori: general ideas

Reduce passes of transaction database scans
Shrink number of candidates Facilitate support counting of candidates
19
Construct FP-tree from a Transaction Database

TID 100 200 300 400 500 Items bought (ordered) frequent items {f, a, c, d, g, i, m, p} {f, c, a, m, p} {a, b, c, f, l, m, o} {f, c, a, b, m} {b, f, h, j, o, w} {f, b} {b, c, k, s, p} {c, b, p} {a, f, c, e, l, p, m, n} {f, c, a, m, p} Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3
min_support = 3 {} f:4 c:3 a:3 m:2 p:2 b:1 m:1

20
1. Scan DB once, find frequent 1-itemset (single item pattern)
c:1 b:1 b:1 p:1
2. Sort frequent items in frequency descending order 3. Scan DB again, construct FP-tree
Construct FP-tree from a Transaction Database
L=[I2:7,I1:6,I3:6,I4:2,I5:2] if Min Sup Count =2
21
Benefits of the FP-tree Structure

Completeness
Preserve complete information for frequent pattern mining Never break a long pattern of any transaction
Compactness
Reduce irrelevant infoinfrequent items are gone Items in frequency descending order: the more frequently occurring, the more likely to be shared Never be larger than the original database (not count node-links and the count field)
22
Mining Frequent Patterns With FP-trees

Idea: Frequent pattern growth
Recursively grow frequent patterns by pattern and database partition
Method
For each frequent item, construct its conditional pattern-base, and then its conditional FP-tree Repeat the process on each newly created conditional FP-tree Until the resulting FP-tree is empty, or it contains only one pathsingle path will generate all the combinations of its sub-paths, each of which is a frequent pattern
23
Scaling FP-growth by DB Projection

FP-tree cannot fit in memory?DB projection First partition a database into a set of projected DBs Then construct and mine FP-tree for each projected DB
24
FP-Growth vs. Apriori: Scalability With the Support Threshold
100 90 80 70
Data set T25I20D10K

D1 FP-grow th runtime D1 Apriori runtime
Run time(sec.)
60 50 40 30 20 10 0 0 0.5 1 1.5 2 Support threshold(%) 2.5 3
25
Multiple-Level Association Rules

Items often form a hierarchy Items at the lower level are expected to have lower support Rules regarding itemsets at appropriate levels could be quite useful Transaction database can be encoded based on dimensions and levels Food
Milk Skim 2% Bread Wheat White
26
Mining Multi-Level Associations

A top_down, progressive deepening approach
First find high-level strong rules:
milk -> bread [20%, 60%]

Then find their lower-level weaker rules:
2% milk -> wheat bread [6%, 50%]

When one threshold set for all levels; if support too high then it is possible to miss meaningful associations at low level; if support too low then possible generation of uninteresting rules
different minimum support thresholds across multilevels lead to different algorithms (e.g., decrease min-support at lower levels)
27
Quantitative Association Rules
Handling quantitative rules may require mapping of the continuous variables into Boolean
28
Mapping Quantitative to Boolean

One possible solution is to map the problem to the Boolean association rules:
discretize a non-categorical attribute to intervals, e.g., Age [20,29], [30,39],... categorical attributes: each value becomes one item non-categorical attributes: each interval becomes one item
29
Example in Web Content Mining

Documents Associations
Find (content-based) associations among documents in a collection Documents correspond to items and words correspond to transactions Frequent itemsets are groups of docs in which many words occur in common
Term Associations
Find associations among words based on their occurrences in documents similar to above, but invert the table (terms as items, and docs as transactions)
business capital fund . . . invest Doc 1 5 2 0 . . . 6 Doc 2 5 4 0 . . . 0 Doc 3 2 3 0 . . . 0 . . . . . . . . .. .. .. .. .. .. .. .. Doc n 1 5 1 . . . 3
30
Example in Web Usage Mining

Association Rules in Web Transactions
discover affinities among sets of Web page references across user sessions
Examples
60% of clients who accessed /products/, also accessed /products/software/webminer.htm 30% of clients who accessed /special-offer.html, placed an online order in /products/software/ Actual Example from IBM official Olympics Site:
{Badminton, Diving} ==> {Table Tennis} [conf 69.7%, sup 0.35%]

Applications
Use rules to serve dynamic, customized contents to users prefetch files that are most likely to be accessed determine the best way to structure the Web site (site optimization) targeted electronic advertising and increasing cross sales
31
Example in Web Usage Mining

Association Rules From Cray Research Web Site
Conf supp
82.8
Association Rule
90
97.2
3.17 /PUBLIC/product-info/T3E ===> /PUBLIC/product-info/T3E/CRAY_T3E.html 0.14 /PUBLIC/product-info/J90/J90.html, /PUBLIC/product-info/T3E ===> /PUBLIC/product-info/T3E/CRAY_T3E.html 0.15 /PUBLIC/product-info/J90, /PUBLIC/product-info/T3E/CRAY_T3E.html, /PUBLIC/product-info/T90, ===> /PUBLIC/product-info/T3E, /PUBLIC/sc.html
Design suggestions
32
IBM Apriori FP-Growth . ( )

The number of transactions =100000, The average, transaction Length= 10, The 4 = average frequent pattern length
1.
. ( )
2.
Weka Apriori FP-Growth ( UCI )
3.
33

Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining

Hochgeladen von

Copyright:

Verfügbare Formate

) (

Mining Frequent Patterns, Association

What Is Frequent Pattern Analysis?

What are the subsequent purchases after buying a PC?

Why Is Freq. Pattern Mining Important?

Basic Concepts: Frequent Patterns and Association Rules

Customer buys both

Customer buys chips

Association rules: A D (60%, 100%) D A (60%, 75%)

Closed Patterns and Max-Patterns

1.27*1030 sub-patterns! (Why?)

An itemset X is a max-pattern if X is frequent and there exists

Closed Patterns and Max-Patterns

What is the set of closed itemset?

What is the set of max-pattern?

Scalable Methods for Mining Frequent Patterns

Scalable mining methods:

Apriori: A Candidate Generation-and-Test Approach

Apriori is a classic algorithm for learning association rules

Test the candidates against DB

The Apriori AlgorithmAn Example

Supmin = 2 C1 1st scan

Itemset {A} {B} {C} {D} {E}

Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E}

Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E}

Apriori Algorithm - An Example

Apriori Algorithm - An Example

Generating Association Rules from Frequent Itemsets

Generating Association Rules

Candidate rules for {1,3}

Candidate rules for {2,3,5}

The Apriori Algorithm

Lk+1 = candidates in Ck+1 with min_support end return k Lk;

Important Details of Apriori

How to count supports of candidates? Example of Candidate-generation

acde is removed because ade is not in L3

How to Generate Candidates?

Challenges of Frequent Pattern Mining

Improving Apriori: general ideas

Construct FP-tree from a Transaction Database

min_support = 3 {} f:4 c:3 a:3 m:2 p:2 b:1 m:1

1. Scan DB once, find frequent 1-itemset (single item pattern)

c:1 b:1 b:1 p:1

Construct FP-tree from a Transaction Database

L=[I2:7,I1:6,I3:6,I4:2,I5:2] if Min Sup Count =2

Benefits of the FP-tree Structure

Mining Frequent Patterns With FP-trees

Scaling FP-growth by DB Projection

FP-Growth vs. Apriori: Scalability With the Support Threshold

Data set T25I20D10K

60 50 40 30 20 10 0 0 0.5 1 1.5 2 Support threshold(%) 2.5 3

Multiple-Level Association Rules

Mining Multi-Level Associations

milk -> bread [20%, 60%]

2% milk -> wheat bread [6%, 50%]

Quantitative Association Rules

Mapping Quantitative to Boolean

Example in Web Content Mining

Example in Web Usage Mining

{Badminton, Diving} ==> {Table Tennis} [conf 69.7%, sup 0.35%]

Example in Web Usage Mining

IBM Apriori FP-Growth . ( )