Sie sind auf Seite 1von 25

SEMINAR BY NITHISH PAI B UNDER THE GUIDANCE OF MRS PUSHPALATA M P

OVERVIEW
Basic concepts and a roadmap Scalable frequent itemset mining methods APRIORI ALGORITHM: A Candidate generation test and approach PATTERN GROWTH APPROACH: Mining Frequent Patterns Without Candidate Generation CHARM / ECLAT: Mining by Exploring Vertical Data Format Summary and Conclusions

WHAT IS FREQUENT PATTERN ANALYSIS?


Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of frequent item sets and association rule mining Motivation: Finding inherent regularities in data.
What products were often purchased together? Beer and diapers?! What are the subsequent purchases after buying a PC? What kinds of DNA are sensitive to this new drug? Can we automatically classify web documents?

Applications
Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis.

WHY IS FREQUENT PATTERN MINING IMPORTANT?


Freq. pattern: An intrinsic and important property of datasets Foundation for many essential data mining tasks
Association, correlation, and causality analysis Sequential, structural (e.g., sub-graph) patterns Pattern analysis in spatiotemporal, multimedia, timeseries and stream data Classification: discriminative, frequent pattern analysis Cluster analysis: frequent pattern-based clustering Data warehousing: iceberg cube and cube-gradient Semantic data compression: fascicles Broad applications

BASIC CONCEPTS: FREQUENT PATTERNS


Tid 10 20 30 40 50 Items bought Tea, Nuts,Napkins Tea, Coffee, Napkins Tea, Napkins, Eggs Nuts, Eggs, Milk Nuts, Coffee, Diaper, Eggs, Milk

itemset: A set of one or more items k-itemset X = {x1, , xk} (absolute) support, or, support count of X: Frequency or occurrence of an itemset X (relative) support, s, is the fraction of transactions that contains X (i.e., the probability that a transaction contains X) An itemset X is frequent if Xs support is no less than a minsup threshold

BASIC CONCEPTS: ASSOCIATION RULES


Tid 10 20 30 40 50 Items bought Tea, Nuts,Napkins Tea, Coffee, Napkins Tea, Napkins, Eggs Nuts, Eggs, Milk Nuts, Coffee, Diaper, Eggs, Milk

Find all the rules X Y with minimum support and confidence


support, s, probability that a transaction contains X Y confidence, c, conditional probability that a transaction having X also contains Y Let minsup = 50%, minconf = 50%

Freq. Pat.: Tea:3, Nuts:3, Napkins:4,Eggs:3, {Tea, Napkins}:3 Association rules: (many more!)
Tea Napkins (60%, 100%) Napkins Tea (60%, 75%)

CLOSED PATTERNS AND MAXPATTERNS


A long pattern contains a combinatorial number of subpatterns, e.g., {a1, , a100} contains (1001) + (1002)+ +(100100) = 2100 1 = 1.27*1030 sub-patterns! Solution: Mine closed patterns and max-patterns instead An itemset X is closed if X is frequent and there exists no super-pattern Y X, with the same support as X (proposed by Pasquier, et al. @ ICDT99) An itemset X is a max-pattern if X is frequent and there exists no frequent super-pattern Y X (proposed by Bayardo @ SIGMOD98) Closed pattern is a lossless compression of freq. patterns Reducing the # of patterns and rules

COMPUTATIONAL COMPLEXITY OF FREQUENT ITEMSET MINING


How many itemsets are potentially to be generated in the worst case?
The number of frequent itemsets to be generated is senstive to the minsup threshold When minsup is low, there exist potentially an exponential number of frequent itemsets The worst case: MN where M: # distinct items, and N: max length of transactions

The worst case complexity vs. the expected probability


Ex. Suppose Walmart has 104 kinds of products
The chance to pick up one product 10-4 The chance to pick up a particular set of 10 products: ~10-40 What is the chance this particular set of 10 products to be frequent 10 times in 10 transactions?

THE DOWNWARD CLOSURE PROPERTY AND SCALABLE MINING METHODS


The downward closure property of frequent patterns states that:
Any subset of a frequent itemset must be frequent If {tea, napkins, nuts} is frequent, so is {tea, napkins} i.e., every transaction having {tea napkins nuts} also contains {tea napkins}

Scalable mining methods: Three major approaches


Apriori algorithm Freq. pattern growth Vertical data format approach

APRIORI ALGORITHM:: A CANDIDATE


GENERATION & TEST APPROACH
Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! It uses a breadth-first search strategy to count the support of itemsets and uses a candidate generation function which exploits the downward closure property of support. Method:
Initially, scan DB once to get frequent 1-itemset Generate length (k+1) candidate itemsets from length k frequent itemsets Test the candidates against DB Terminate when no frequent or candidate set can be generate

APRIORI ALGORITHM: EXAMPLE

IMPROVEMENT OF APRIORI METHOD


DISADVANTAGES:
Multiple scans of transaction database Huge number of candidates Tedious workload of support counting for candidates

Improving Apriori: general ideas


Reduce passes of transaction database scans Shrink number of candidates Facilitate support counting of candidates

Pattern-Growth Approach: Mining Frequent Patterns Without Candidate Generation


The FPGrowth Approach:
Depth-first search Avoid explicit candidate generation

Major philosophy: Grow long patterns from short ones using local frequent items only abc is a frequent pattern
Get all transactions having abc, i.e., project DB on abc: DB|abc d is a local frequent item in DB|abc abcd is a frequent pattern

CONSTRUCTING FP-TREE FROM A TRANSACTIONAL DATABASE


Consider the following set of transactions Let the minimum Support count be 3

TID 100 200 300 400 500

Items bought {f, a, c, d, g, i, m, p} {a, b, c, f, l, m, o} {b, f, h, j, o, w} {b, c, k, s, p} {a, f, c, e, l, p, m, n}

(ordered) frequent items {f, c, a, m, p} {f, c, a, b, m} {f, b} {c, b, p} {f, c, a, m, p}

CONSTRUCTING FP-TREE FROM A TRANSACTIONAL DATABASE


ALGORITHM FOR CONSTRUCTION OF A FP-TREE:
1. 2. 3.

Scan DB once, find frequent 1-itemset (single item pattern) Sort frequent items in frequency descending order, f-list Scan DB again, construct FP-tree

FP-Tree for the Example transaction: Header Table Item head f c a b m p frequency 4 4 3 3 3 3 f:4 c:3 a:3 m:2 p:2 b:1 m:1
F-list=f-c-a-b-m-p

{} c:1 b:1 b:1 p:1

PARTITION PATTERNS AND DATABASES


Frequent patterns can be partitioned into subsets according to f-list F-list = f-c-a-b-m-p Patterns containing p Patterns having m but no p Patterns having c but no a nor b, m, p Pattern f Completeness and non-redundancy

FINDIND PATTERNS HAVING P FROM P-CONDITIONAL DATABASE


Starting at the frequent item header table in the FPtree Traverse the FP-tree by following the link of each frequent item p Accumulate all of transformed prefix paths of item p to form ps conditional pattern base {} f:4 c:3 a:3 m:2 p:2 b:1 m:1 b:1 c:1 b:1 p:1 Conditional pattern bases item c a b m p cond. pattern base f:3 fc:3 fca:1, f:1, c:1 fca:2, fcab:1 fcam:2, cb:1

Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3

FROM CONDITIONAL PATTERN BASES TO CONDITIONAL FP-TREES


For each pattern-base
Accumulate the count for each item in the base Construct the FP-tree for the frequent items of the pattern base m-conditional pattern base: fca:2, fcab:1

Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3

{} f:4 c:3 a:3 m:2 p:2 b:1 m:1 b:1 c:1 b:1 p:1

{} f:3 c:3 a:3

All frequent patterns relate to m m, fm, cm, am, fcm, fam, cam, fcam -> associations

m-conditional FP-tree

RECURSION: MINING EACH CONDITIONAL FP-TREE

ALGORITHM FOR MINING FREQUENT ITEMSETS USING FP-TREE BY PATTERN FREQUENT GROWTH
Input: D, a transaction database; minsup, the minimum support count threshold. Output: The complete set of frequent patterns. Method: 1. The FP-tree is constructed in the following steps:
a)

Scan the transaction database D once. Collect F, the set of frequent items, and their support counts Sort F in support count descending order as L, the list of frequent items.

b)

Create the root of an FP-tree, and label it as null. For each transaction Trans in D do the following:
Select and sort the frequent items in Trans according to the order of L. Let the sorted frequent item list in Trans be[p|P], where p is the first element and P is the remaining list.

Call insert tree([p|P], T which is performed as follows: If T has a child N such that N.item-name=p.item-name, then increment Ns count by 1; else create a new node N, and let its count be 1, its parent link be linked to T, and node-link to the nodes with the same item-name via the node-link structure. If P is nonempty, call insert tree(P, N)recursively.

ALGORITHM FOR MINING FREQUENT ITEMSETS USING FP-TREE BY PATTERN FREQUENT GROWTH
2. The FP-tree is mined by calling FPgrowth(FPtree, null), which is implemented as follows. procedure FPgrowth(Tree , ) if Tree contains a single path P then { for each combination (denoted as ) of the nodes in the path P { generate pattern U with support count=minimum support_count of nodes in ; } }
else for each ai in the header of Tree { generate pattern = ai U with support_count = ai:support_count; Construct s conditional pattern base and then s conditional FP_tree Tree ; if Tree then call FPgrowth(Tree, ); }

BENEFITS OF FP-TREE STRUCTURE


Completeness Preserve complete information for frequent pattern mining Never break a long pattern of any transaction Compactness Reduce irrelevant infoinfrequent items are gone Items in frequency descending order: the more frequently occurring, the more likely to be shared Never be larger than the original database (not count node-links and the count field)

ADVANTAGES OF PATTERN GROWTH APPROACH


Divide-and-conquer:
Decompose both the mining task and DB according to the frequent patterns obtained so far Lead to focused search of smaller databases

Other factors
No candidate generation, no candidate test Compressed database: FP-tree structure No repeated scan of entire database Basic ops: counting local freq items and building sub FP-tree, no pattern search and matching

EXTENSION OF PATTERN GROWTH MINING METHODOLOGY


Mining closed frequent itemsets and max-patterns Mining sequential patterns Mining graph patterns Constraint-based mining of frequent patterns Computing iceberg data cubes with complex measures Pattern-growth-based Clustering Pattern-Growth-Based Classification

CHARM / ECLAT: Mining by Exploring Vertical Data Format


Vertical format: t(AB) = {T11, T25, } tid-list: list of trans.-ids containing an itemset Deriving closed patterns based on vertical intersections t(X) = t(Y): X and Y always happen together t(X) t(Y): transaction having X always has Y Using diffset to accelerate mining Only keep track of differences of tids t(X) = {T1, T2, T3}, t(XY) = {T1, T3} Diffset (XY, X) = {T2}

Das könnte Ihnen auch gefallen