Unit 2 Decision Tree

Frequent Itemset Mining
Methods
The Apriori algorithm
 Finding frequent itemsets using candidate generation
 Seminal algorithm proposed by R. Agrawal and R.
Srikant in 1994
 Uses an iterative approach known as a level-wise
search, where k-itemsets are used to explore (k+1)-
itemsets.
 Apriori property to reduce the search space: All
nonempty subsets of a frequent itemset must also be
frequent.
 P(I)<min_sup => I is not frequent
 P(I+A)<min_sup => I+A is not frequent either
 Antimonotone property – if a set cannot pass a test, all of
its supersets will fail the same test as well
 Using the apriori property in the algorithm:
 Let us look at how Lk-1 is used to find Lk, for k>=2
 Two steps:
 Join
 finding Lk, a set of candidate k-itemsets is generated by joining Lk-1 with
itself
 The items within a transaction or itemset are sorted in lexicographic order
 For the (k-1) itemset: li[1]<li[2]<…<li[k-1]
 The members of Lk-1 are joinable if their first(k-2) items are in common
 Members l1, l2 of Lk-1 are joined if (l1[1]=l2[1]) and (l1[2]=l2[2]) and … and
(l1[k-2]=l2[k-2]) and (l1[k-1]<l2[k-1]) – no duplicates
 The resulting itemset formed by joining l1 and l2 is l1[1], l1[2],…, l1[k-2], l1[k-
1], l2[k-1]
 Prune
 Ck is a superset of Lk, Lk contain those candidates from Ck, which are
frequent
 Scaning the database to determine the count of each candidate in Ck –
heavy computation
 To reduce the size of Ck the Apriori property is used: if any (k-1) subset of a
candidate k-itemset is not in Lk-1, then the candidate cannot be frequent
either,so it can be removed from Ck. – subset testing (hash tree)
Example:
TID List of item_IDs
T100 I1, I2, I5
T200 I2, I4
T300 I2, I3
T400 I1, I2, I4
T500 I1, I3
T600 I2, I3
T700 I1, I3
T800 I1, I2, I3, I5
T900 I1, I2, I3
 Scan D for count of each candidate
 C1: I1 – 6, I2 – 7, I3 -6, I4 – 2, I5 - 2
 Compare candidate support count with minimum support count (min_sup=2)
 L1: I1 – 6, I2 – 7, I3 -6, I4 – 2, I5 - 2
 Generate C2 candidates from L1 and scan D for count of each candidate
 C2: {I1,I2} – 4, {I1, I3} – 4, {I1, I4} – 1, …
 Compare candidate support count with minimum support count
 L2: {I1,I2} – 4, {I1, I3} – 4, {I1, I5} – 2, {I2, I3} – 4, {I2, I4} - 2, {I2, I5} – 2
 Generate C3 candidates from L2 using the join and prune steps:
 Join: C3=L2xL2={{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5}, {I2, I4,
I5}}
 Prune: C3: {I1, I2, I3}, {I1, I2, I5}
 Scan D for count of each candidate
 C3: {I1, I2, I3} - 2, {I1, I2, I5} – 2
 Compare candidate support count with minimum support count
 L3: {I1, I2, I3} – 2, {I1, I2, I5} – 2
 Generate C4 candidates from L3
 C4=L3xL3={I1, I2, I3, I5}
 This itemset is pruned, because its subset {{I2, I3, I5}} is not frequent => C4=null
Generating association rules from
frequent itemsets
Finding the frequent itemsets from
transactions in a database D
Generating strong association rules:
Confidence(A=>B)=P(B|A)=
support_count(AUB)/support_count(A)
support_count(AUB) – number of transactions
containing the itemsets AUB
support_count(A) - number of transactions containing
the itemsets A
 for every nonempty susbset s of l, output the rule s=>(l-s) if
support_count(l)/support_count(s)>=min_conf
 Example:
 lets have l={I1, I2, I5}
 The nonempty subsets are {I1, I2}, {I1, I5}, {I2, I5}, {I1}, {I2}, {I5}.
 Generating association rules:
I1 and I2=>I5 conf=2/4=50%

I1 and I5=>I2 conf=2/2=100%
I2 and I5=> I1 conf=2/2=100%
I1=>I2 and I5 conf=2/6=33%
I2=>I1 and I5 conf=2/7=29%
I5=>I1 and I2 conf=2/2=100%
If min_conf is 70%, then only the second, third and last rules above
are output.
Improving the efficiency of Apriori
 Hash-based technique – to reduce the size of the candidate k-itemsets, Ck, for k>1
 Generate all of the 2-itemsets for each transaction, hash them into a different buckets of a
hash table structure
 H(x,y)=((order of x)X10+(order of y)) mod 7
 Transaction reduction – a transaction that does not contain any frequent k-itemsets
cannot contain any frequent k+1 itemsets.
 Partitioning – partitioning the data to find candidate itemsets
 Sampling – mining on a subset of a given data

 searching for frequents itemsets in subset S, instead of D
 Lower support threshold
 Dynamic itemset counting – adding candidate itemsets at different points during a
scan
Mining Frequent Itemsets without
candidate generation
The candidate generate and test method
Reduces the size of candidates sets
Good performance
It may need to generate a huge number of
candidate sets
It may need to repeatedly scan the database
and check a large set of candidates by pattern
matching
Frequent-pattern growth method(FP-
growth) – frequent pattern tree(FP-tree)
Example:
 I5
 (I2, I1, I5:1)
 (I2, I1, I3, I5:1)
 I5 is a suffix, so the two prefixes are
 (I2, I1:1)
 (I2, I1, I3:1)
 FP tree: (I2:2, I1:2), I3 is removed because <2
 The combinations of frequent pattenrs:
 {I2,I5:2}
 {I1,I5:2}
 {I2, I1, I5:2}
 For I4 exist 2 prefixes:

 {{I2, I1:1},{I2:1}}
 Generation of the conditional FP-tree:
 (I2:2)
 The frequent pattern: {I2, I1:2}
Item Conditional Conditional Frequent Pattern
Pattern Base FP-tree Generated
I5 {{I2, I1:1}, {I2, (I2:2, I1:2) {I2, I5:2}, {I1,

I1, I3:1}} I5:2}, {I2, I1, I5:2}
I4 {{I2, I1:2}, (I2:2) {I2, I4:2}

{I2:1}}
I3 {{I2, I1:2}, (I2:4, I1:2), {I2, I3:4}, {I1,

{I2:2}, {I1:2}} (I1:2), (I2:4) I3:4}, {I2, I1,
I3:2}, {I2, I1:4}
I1 {{I2:4}} (I2:4) {I2, I1:4}
Mining frequent itemsets using vertical
data format
 Transforming the horizontal data format of the
transaction database D into a vertical data
format:
Itemset TID_set
I1 {T100, T400, T500, T700, T800, T900}
I2 {T100, T200, T300, T400, T600, T800, T900}
I3 {T300, T500, T600, T700, T800, T900}
I4 {T200, T400}
I5 {T100, T800}
Thank you

Unit 2 Decision Tree

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Unit 2 Decision Tree

Hochgeladen von

Copyright:

Verfügbare Formate

Frequent Itemset Mining

I1 and I2=>I5 conf=2/4=50%

 Sampling – mining on a subset of a given data

 For I4 exist 2 prefixes:

I5 {{I2, I1:1}, {I2, (I2:2, I1:2) {I2, I5:2}, {I1,

I4 {{I2, I1:2}, (I2:2) {I2, I4:2}

I3 {{I2, I1:2}, (I2:4, I1:2), {I2, I3:4}, {I1,

Das könnte Ihnen auch gefallen