Frequent Pattern Mining Using FP Trees

SEMINAR BY NITHISH PAI B UNDER THE GUIDANCE OF MRS PUSHPALATA M P
OVERVIEW
Basic concepts and a roadmap Scalable frequent itemset mining methods APRIORI ALGORITHM: A Candidate generation test and approach PATTERN GROWTH APPROACH: Mining Frequent Patterns Without Candidate Generation CHARM / ECLAT: Mining by Exploring Vertical Data Format Summary and Conclusions
WHAT IS FREQUENT PATTERN ANALYSIS?

Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of frequent item sets and association rule mining Motivation: Finding inherent regularities in data.
What products were often purchased together? Beer and diapers?! What are the subsequent purchases after buying a PC? What kinds of DNA are sensitive to this new drug? Can we automatically classify web documents?
Applications
Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis.
WHY IS FREQUENT PATTERN MINING IMPORTANT?

Freq. pattern: An intrinsic and important property of datasets Foundation for many essential data mining tasks
Association, correlation, and causality analysis Sequential, structural (e.g., sub-graph) patterns Pattern analysis in spatiotemporal, multimedia, timeseries and stream data Classification: discriminative, frequent pattern analysis Cluster analysis: frequent pattern-based clustering Data warehousing: iceberg cube and cube-gradient Semantic data compression: fascicles Broad applications
BASIC CONCEPTS: FREQUENT PATTERNS

Tid 10 20 30 40 50 Items bought Tea, Nuts,Napkins Tea, Coffee, Napkins Tea, Napkins, Eggs Nuts, Eggs, Milk Nuts, Coffee, Diaper, Eggs, Milk
itemset: A set of one or more items k-itemset X = {x1, , xk} (absolute) support, or, support count of X: Frequency or occurrence of an itemset X (relative) support, s, is the fraction of transactions that contains X (i.e., the probability that a transaction contains X) An itemset X is frequent if Xs support is no less than a minsup threshold
BASIC CONCEPTS: ASSOCIATION RULES

Tid 10 20 30 40 50 Items bought Tea, Nuts,Napkins Tea, Coffee, Napkins Tea, Napkins, Eggs Nuts, Eggs, Milk Nuts, Coffee, Diaper, Eggs, Milk
Find all the rules X Y with minimum support and confidence

support, s, probability that a transaction contains X Y confidence, c, conditional probability that a transaction having X also contains Y Let minsup = 50%, minconf = 50%
Freq. Pat.: Tea:3, Nuts:3, Napkins:4,Eggs:3, {Tea, Napkins}:3 Association rules: (many more!)
Tea Napkins (60%, 100%) Napkins Tea (60%, 75%)
CLOSED PATTERNS AND MAXPATTERNS

A long pattern contains a combinatorial number of subpatterns, e.g., {a1, , a100} contains (1001) + (1002)+ +(100100) = 2100 1 = 1.27*1030 sub-patterns! Solution: Mine closed patterns and max-patterns instead An itemset X is closed if X is frequent and there exists no super-pattern Y X, with the same support as X (proposed by Pasquier, et al. @ ICDT99) An itemset X is a max-pattern if X is frequent and there exists no frequent super-pattern Y X (proposed by Bayardo @ SIGMOD98) Closed pattern is a lossless compression of freq. patterns Reducing the # of patterns and rules
COMPUTATIONAL COMPLEXITY OF FREQUENT ITEMSET MINING

How many itemsets are potentially to be generated in the worst case?
The number of frequent itemsets to be generated is senstive to the minsup threshold When minsup is low, there exist potentially an exponential number of frequent itemsets The worst case: MN where M: # distinct items, and N: max length of transactions
The worst case complexity vs. the expected probability

Ex. Suppose Walmart has 104 kinds of products
The chance to pick up one product 10-4 The chance to pick up a particular set of 10 products: ~10-40 What is the chance this particular set of 10 products to be frequent 10 times in 10 transactions?
THE DOWNWARD CLOSURE PROPERTY AND SCALABLE MINING METHODS

The downward closure property of frequent patterns states that:
Any subset of a frequent itemset must be frequent If {tea, napkins, nuts} is frequent, so is {tea, napkins} i.e., every transaction having {tea napkins nuts} also contains {tea napkins}
Scalable mining methods: Three major approaches

Apriori algorithm Freq. pattern growth Vertical data format approach
APRIORI ALGORITHM:: A CANDIDATE

GENERATION & TEST APPROACH
Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! It uses a breadth-first search strategy to count the support of itemsets and uses a candidate generation function which exploits the downward closure property of support. Method:
Initially, scan DB once to get frequent 1-itemset Generate length (k+1) candidate itemsets from length k frequent itemsets Test the candidates against DB Terminate when no frequent or candidate set can be generate
APRIORI ALGORITHM: EXAMPLE
IMPROVEMENT OF APRIORI METHOD

DISADVANTAGES:
Multiple scans of transaction database Huge number of candidates Tedious workload of support counting for candidates
Improving Apriori: general ideas

Reduce passes of transaction database scans Shrink number of candidates Facilitate support counting of candidates
Pattern-Growth Approach: Mining Frequent Patterns Without Candidate Generation

The FPGrowth Approach:
Depth-first search Avoid explicit candidate generation
Major philosophy: Grow long patterns from short ones using local frequent items only abc is a frequent pattern
Get all transactions having abc, i.e., project DB on abc: DB|abc d is a local frequent item in DB|abc abcd is a frequent pattern
CONSTRUCTING FP-TREE FROM A TRANSACTIONAL DATABASE

Consider the following set of transactions Let the minimum Support count be 3
TID 100 200 300 400 500
Items bought {f, a, c, d, g, i, m, p} {a, b, c, f, l, m, o} {b, f, h, j, o, w} {b, c, k, s, p} {a, f, c, e, l, p, m, n}
(ordered) frequent items {f, c, a, m, p} {f, c, a, b, m} {f, b} {c, b, p} {f, c, a, m, p}
CONSTRUCTING FP-TREE FROM A TRANSACTIONAL DATABASE

ALGORITHM FOR CONSTRUCTION OF A FP-TREE:
1. 2. 3.
Scan DB once, find frequent 1-itemset (single item pattern) Sort frequent items in frequency descending order, f-list Scan DB again, construct FP-tree
FP-Tree for the Example transaction: Header Table Item head f c a b m p frequency 4 4 3 3 3 3 f:4 c:3 a:3 m:2 p:2 b:1 m:1
F-list=f-c-a-b-m-p
{} c:1 b:1 b:1 p:1
PARTITION PATTERNS AND DATABASES

Frequent patterns can be partitioned into subsets according to f-list F-list = f-c-a-b-m-p Patterns containing p Patterns having m but no p Patterns having c but no a nor b, m, p Pattern f Completeness and non-redundancy
FINDIND PATTERNS HAVING P FROM P-CONDITIONAL DATABASE

Starting at the frequent item header table in the FPtree Traverse the FP-tree by following the link of each frequent item p Accumulate all of transformed prefix paths of item p to form ps conditional pattern base {} f:4 c:3 a:3 m:2 p:2 b:1 m:1 b:1 c:1 b:1 p:1 Conditional pattern bases item c a b m p cond. pattern base f:3 fc:3 fca:1, f:1, c:1 fca:2, fcab:1 fcam:2, cb:1
Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3
FROM CONDITIONAL PATTERN BASES TO CONDITIONAL FP-TREES

For each pattern-base
Accumulate the count for each item in the base Construct the FP-tree for the frequent items of the pattern base m-conditional pattern base: fca:2, fcab:1
Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3
{} f:4 c:3 a:3 m:2 p:2 b:1 m:1 b:1 c:1 b:1 p:1
{} f:3 c:3 a:3
All frequent patterns relate to m m, fm, cm, am, fcm, fam, cam, fcam -> associations
m-conditional FP-tree
RECURSION: MINING EACH CONDITIONAL FP-TREE
ALGORITHM FOR MINING FREQUENT ITEMSETS USING FP-TREE BY PATTERN FREQUENT GROWTH
Input: D, a transaction database; minsup, the minimum support count threshold. Output: The complete set of frequent patterns. Method: 1. The FP-tree is constructed in the following steps:
a)
Scan the transaction database D once. Collect F, the set of frequent items, and their support counts Sort F in support count descending order as L, the list of frequent items.
b)
Create the root of an FP-tree, and label it as null. For each transaction Trans in D do the following:
Select and sort the frequent items in Trans according to the order of L. Let the sorted frequent item list in Trans be[p|P], where p is the first element and P is the remaining list.
Call insert tree([p|P], T which is performed as follows: If T has a child N such that N.item-name=p.item-name, then increment Ns count by 1; else create a new node N, and let its count be 1, its parent link be linked to T, and node-link to the nodes with the same item-name via the node-link structure. If P is nonempty, call insert tree(P, N)recursively.
ALGORITHM FOR MINING FREQUENT ITEMSETS USING FP-TREE BY PATTERN FREQUENT GROWTH
2. The FP-tree is mined by calling FPgrowth(FPtree, null), which is implemented as follows. procedure FPgrowth(Tree , ) if Tree contains a single path P then { for each combination (denoted as ) of the nodes in the path P { generate pattern U with support count=minimum support_count of nodes in ; } }
else for each ai in the header of Tree { generate pattern = ai U with support_count = ai:support_count; Construct s conditional pattern base and then s conditional FP_tree Tree ; if Tree then call FPgrowth(Tree, ); }
BENEFITS OF FP-TREE STRUCTURE

Completeness Preserve complete information for frequent pattern mining Never break a long pattern of any transaction Compactness Reduce irrelevant infoinfrequent items are gone Items in frequency descending order: the more frequently occurring, the more likely to be shared Never be larger than the original database (not count node-links and the count field)
ADVANTAGES OF PATTERN GROWTH APPROACH

Divide-and-conquer:
Decompose both the mining task and DB according to the frequent patterns obtained so far Lead to focused search of smaller databases
Other factors
No candidate generation, no candidate test Compressed database: FP-tree structure No repeated scan of entire database Basic ops: counting local freq items and building sub FP-tree, no pattern search and matching
EXTENSION OF PATTERN GROWTH MINING METHODOLOGY

Mining closed frequent itemsets and max-patterns Mining sequential patterns Mining graph patterns Constraint-based mining of frequent patterns Computing iceberg data cubes with complex measures Pattern-growth-based Clustering Pattern-Growth-Based Classification
CHARM / ECLAT: Mining by Exploring Vertical Data Format

Vertical format: t(AB) = {T11, T25, } tid-list: list of trans.-ids containing an itemset Deriving closed patterns based on vertical intersections t(X) = t(Y): X and Y always happen together t(X) t(Y): transaction having X always has Y Using diffset to accelerate mining Only keep track of differences of tids t(X) = {T1, T2, T3}, t(XY) = {T1, T3} Diffset (XY, X) = {T2}

Frequent Pattern Mining Using FP Trees

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Frequent Pattern Mining Using FP Trees

Hochgeladen von

Copyright:

Verfügbare Formate

SEMINAR BY NITHISH PAI B UNDER THE GUIDANCE OF MRS PUSHPALATA M P

WHAT IS FREQUENT PATTERN ANALYSIS?

WHY IS FREQUENT PATTERN MINING IMPORTANT?

BASIC CONCEPTS: FREQUENT PATTERNS

BASIC CONCEPTS: ASSOCIATION RULES

Find all the rules X Y with minimum support and confidence

CLOSED PATTERNS AND MAXPATTERNS

COMPUTATIONAL COMPLEXITY OF FREQUENT ITEMSET MINING

The worst case complexity vs. the expected probability

THE DOWNWARD CLOSURE PROPERTY AND SCALABLE MINING METHODS

Scalable mining methods: Three major approaches

APRIORI ALGORITHM:: A CANDIDATE

APRIORI ALGORITHM: EXAMPLE

IMPROVEMENT OF APRIORI METHOD

Improving Apriori: general ideas

Pattern-Growth Approach: Mining Frequent Patterns Without Candidate Generation

CONSTRUCTING FP-TREE FROM A TRANSACTIONAL DATABASE

TID 100 200 300 400 500

Items bought {f, a, c, d, g, i, m, p} {a, b, c, f, l, m, o} {b, f, h, j, o, w} {b, c, k, s, p} {a, f, c, e, l, p, m, n}

(ordered) frequent items {f, c, a, m, p} {f, c, a, b, m} {f, b} {c, b, p} {f, c, a, m, p}

CONSTRUCTING FP-TREE FROM A TRANSACTIONAL DATABASE

{} c:1 b:1 b:1 p:1

PARTITION PATTERNS AND DATABASES

FINDIND PATTERNS HAVING P FROM P-CONDITIONAL DATABASE

Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3

FROM CONDITIONAL PATTERN BASES TO CONDITIONAL FP-TREES

Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3

{} f:3 c:3 a:3

RECURSION: MINING EACH CONDITIONAL FP-TREE

BENEFITS OF FP-TREE STRUCTURE

ADVANTAGES OF PATTERN GROWTH APPROACH

EXTENSION OF PATTERN GROWTH MINING METHODOLOGY

CHARM / ECLAT: Mining by Exploring Vertical Data Format

Das könnte Ihnen auch gefallen