Sie sind auf Seite 1von 11

PAPER ON: DATA MINING

K.SOWMYA A. SRI ASHA JYOTHI


CSE-II CSE-II
E-MAIL ID: E-MAIL ID:
sowmya_511@hotmail.com asha_544@yahoo.co.in
CLOSET: An Efficient Algorithm for Mining Frequent
Closed Itemsets
ABSTRACT
Association mining may often derive an undesirably large set of frequent itemsets and
association rules. Recent studies have proposed an interesting alternative: mining frequent closed
itemsets and their corresponding rules, which has the same power as association mining but
substantially reduces the number of rules to be presented.
An efficient algorithm CLOSET, for mining closed itemsets, with the development of
three techniques:
(1) Applying a compressed frequent pattern tree FP-tree structure for mining closed
itemsets without candidate generation,
(2) Developing a single prefix path compression technique to identify frequent closed
itemsets quickly, and
(3) Exploring a partition bases projection mechanism for scalable mining in large
databases. CLOSET is efficient and scalable over lager databases and is faster
than the previously proposed methods.

INTRODUCTION:

Overview of data mining:

Data mining (sometimes called data or knowledge discovery) is the process of analyzing data
from different perspectives and summarizing it into useful information - information that can be
used to increase revenue, cuts costs, or both. Data mining software is one of a number of
analytical tools for analyzing data. It allows users to analyze data from many different
dimensions or angles, categorize it, and summarize the relationships identified. Technically, data
mining is the process of finding correlations or patterns among dozens of fields in large relational
databases.

Data mining, the extraction of hidden predictive information from large databases, is a
powerful new technology with great potential to help companies focus on the most important
information in their data warehouses. Data mining tools predict future trends and behaviors,
allowing businesses to make proactive, knowledge-driven decisions. The automated, prospective
analyses offered by data mining move beyond the analyses of past events provided by
retrospective tools typical of decision support systems. Data mining tools can answer business
questions that traditionally were too time consuming to resolve. They scour databases for hidden
patterns, finding predictive information that experts may miss because it lies outside their
expectations.
Raw Information Data mining Hidden information
patterns

Figure 1: Diagram of data mining technique

Data mining is ready for application in the business community because it is supported by three
technologies that are now sufficiently mature:

• Massive data collection


• Powerful multiprocessor computers
• Data mining algorithms

The Data Mining process is not a simple function, as it often involves a variety of feedback loops
since while applying a particular technique, the user may determine that the selected data is of
poor quality or that the applied techniques did not produce the results of the expected quality. In
such cases, the user has to repeat and refine earlier steps, possibly even restarting the entire
process from the beginning. This is best illustrated in the following figure.

Figure2: The data mining process

One of the data mining algorithms is the closet algorithm.


Introduction to CLOSET algorithm:

It has been well recognized that frequent pattern mining plays an essential role in
many important data mining tasks, e.g. associations, sequential patterns , episodes, partial
periodicity, etc. However, it is also well known that frequent pattern mining often generates a
very large number of frequent itemsets and rules, which reduces not only efficiency but also
effectiveness of mining since users have to sift through a large number of mined rules to find
useful ones.
There is an interesting alternative: instead of mining the complete set of frequent itemsets
and their associations, association mining only their corresponding rules. An important
implication is that mining frequent closed itemset has the same power as mining the complete set
of frequent itemsets, but it will substantially reduce redundant rules to be generated and increase
both efficiency and effectiveness of mining.

Definition: Association rule mining searches for interesting relationships among items in a given
data set.

Interestingness Measures:

Certainty: Each discovered pattern should have a measure of certainty associated with it that
assesses the validity or “trustworthiness” of the pattern. A certainty measure for association rules
of the form “A⇒B”, where A and B are sets of items, is Confidence. Given a set task-relevant
data tuples, the confidence of “is defined as “A⇒B” is defined as

Confidence (A⇒B) = # tuples containing both A and B


# tuples containing A

Utility : The potential usefulness of a pattern is a factor defining its interestingness. It can be
estimated by a utility function, such as support. The support of an association pattern refers to the
percentage of task-relevant data tuples for which the pattern is true. For association rules of the
form “A⇒B” where A and B are sets of items, it is defined as

Support(A⇒B) = # tuples containing both A and B


total # of tuples

Association rules that satisfy both a user specified minimum confidence and user specified
minimum support threshold are referred to as Strong Association Rules.

Association Rule Mining is a two step process:


1. Find all frequent itemsets: each of these itemsets will occur at least as frequently as a pre-
determined minimum support count.
2. Generate strong association rules from the frequent itemsets: These rules must satisfy
minimum support and minimum confidence.
The Apriori Algorithm: Finding Frequent Itemsets Using Candidate Generation:

Apriori is an influential algorithm for mining frequent itemsets for Boolean association rules.
The names of the algorithm are based on the fact that the algorithm uses prior knowledge of
frequent itemset properties. Apriori employs an iterative approach known as a level-wise search
where k-itemsets are used to explore (k+1)-itemsets. The finding of each Lk requires one full scan
of the database.

Two step process of finding frequent items:

The join step: To find Lk, a set of candidate k-itemsets is generated by joining Lk-1 with itself.
This set of candidates is denoted Ck. Let l1 and l2 be itemsets in Lk-1. The notation li[j] refers to
the jth item in li. By convention, Apriori assumes that items within a transaction or itemset are
sorted in lexicographic order. The join, Lk-1  Lk-1, is performed; where members of Lk-1 are
joinable if there first (k-2) items are in common. That is, members l 1 and l2 are joined if
(l1[1]=l2[1])∧(l1[2]=l2[2])∧(l1[3]=l2[3])∧(l1[4]=l2[4]). The condition l1[k-1]<l2[k-1] simply ensures
that no duplicates are generated. The resulting itemset formed by joining l 1 and l2 is l1[1]l1[2]…
l1[k-1]l2[k-1]

The prune step: Ck is a superset of Lk, that is its members may or may not be frequent, but all of
the frequent, but all of the frequent k-itemsets are included in Ck. A scan of the database to
determine the count of each candidate in Ck would result in the determination of Lk. Ck, however
can be huge, and so this could involve heavy computation. To reduce the size of C k, the Apriori
property is used as follows. Any (k-1)-itemset that is not frequent cannot be a subset of a
frequent either and so can be removed from Ck. This subset testing can be done quickly by
maintaining a hash tree of all frequent itmesets.

Example:-
TID List Of Item_IDs
T100 I1,I2,I5
T200 I2,I4
T300 I2,I3
T400 I1,I2,I4
T500 I1,I3
T600 I2,I3
T700 I1,I3
T800 I1,I2,I3,I5
T900 I1,I2,I3

1) In the first iteration of the algorithm, each items is a member of the set of candidate 1-
itemsets, C1. The algorithm simply scans all of the transactions in order to count the
number of occurrence of each item.
2) Suppose that the minimum transactions support count required is 2. The set of frequent 1-
itemsets, L1, can then be determined. It consists of the candidate 1-itemsets satisfying
minimum support.
3) To discover the set of frequent 2-itemsets, L2 the algorithm uses L1 1X1 L1 to generate a
candidate set of 2-itemsets C2.
In this way we will find candidate sets until a candidate set is null.

Generating Association Rules from Frequent Itemsets


Once the frequent itemsets from transactions in a database D have been found, it is
straightforward to generate strong association rules from them. This can be done using the
following equation for confidence. Where the conditional probability is expressed in terms of
itemset support count.

Confidence (A⇒B) = support count (A U B)


Support count (A)

where support count (AUB) is the number of transactions containing the itemsets AUB, and
support count (A) is the number of transactions containing the itemset A. Based on this equation,
association rules can be generated as follows:

• for each frequent itemset l, generate all nonempty subsets of l.


• for every nonempty subset of l, output the rule “s⇒(l-s)” if support count(l) ≥ min_conf,
support count(s)
where min_conf is the minimum confidence threshold.

Since the rules are generated from frequent itemsets, each one automatically satisfies minimum
support. Frequent itemsets can be stored ahead of time in hash tables along with their counts so
that they can be accessed quickly.
Eg:- Suppose the data contain the frequent itemset l= {I1,I2,I5}. What are the association rules
that can be generated from l?
The nonempty subsets of l are {I1, I2}, {I1, I5}, {I2, I5}, {I1}, {I2} and {I5}. The resulting
association rules are as shown below, each listed with its confidence.
I1∧I2⇒I5, confidence = 2/4 = 50%
I1∧I5⇒I2, confidence = 2/2 = 100%
I2∧I5⇒I2, confidence = 2/2 = 100%
I1⇒I2∧I5, confidence = 2/6 = 33%
I2⇒I1∧I5, confidence = 2/7 = 29%
I5⇒I1∧I2, confidence = 2/2 = 100%
If the minimum confidence threshold is, say, 70% then only the second, third and last rules above
are output, since these are the only ones generated that are strong.

Mining Frequent Itemsets without Candidate Generation

The Apriori algorithm suffers from two non-trivial costs:


1) It may need to generate a huge number of candidate sets.
2) It may need to repeatedly scan the database and checks large set of candidates by pattern
matching
An interesting method that mines the complete set of frequent itemsets without candidate
generation is called frequent-pattern growth of simply FP-growth, which adopts a divide-and
–conquer strategy as follows: compress the database representing frequent items into a set of
conditional databases, each associated with one frequent item, and mine each such database
separately.

Reexamine the mining of transaction database, using the frequent-pattern growth approach.

The first scan of the database is the same as Apriori, which derives the set of frequent items (1-
items) and their support counts. Let the minimum count be 2. The set of frequent items is sorted
in the order of descending support count. This resulting set or list is denoted L. Thus, we have
L=[I2: 7,I1: 6,I3: 6,I4: 2,I5: 2].

An FP-tree is then constructed as follows. First, create the root of the tree, labeled with
“null”. Scan database D a second time. The items in each transaction are processed in L order
and a branch is created for each transaction. For example, the scan of the first transaction,
“T100:I1, I2, I5”, which contains three items (I2, I1, I5) in L order, leads to the construction of
the first branch of the tree with three nodes:<(I2:1, (I1:1), (I5:1)>, where I2 is linked as child of
the root, I1 is linked to I2 and I5 is linked to I2. The second transaction, T200, contains the items
I2 and I4 in L order, which would result in a branch where I2 is linked to the root and I4 is linked
to I2. However this branch would share a common prefix, (I2:2), with the existing path for T100.
Therefore, we instead increment the count of the I2 node when considering the branch to be
added for a transaction, the count of each node along a common prefix is incremented by 1, and
nodes for the item following the prefix are created and linked accordingly.
To facilitate tree traversal, an item header table is built so that each item points to its
occurrence in the tree via a chain of node-links. The tree obtained after scanning all of the
transactions is

Item Conditional pattern base Conditional FP-tree Frequent patterns generated


I5 {(I2 I1: 1) , (I2 I1 I3: 1)} <I2:2, I1: 2> I2 I5: 2, I1 I5: 2, I2 I1 I5: 2
I4 {(I2 I1: 1) , (I2: 1)} <I2: 2> I2 I4: 2

The mining of the FP-tree proceeds as follows. Start from each frequent length-1 pattern,
construct its conditional pattern base (a sub database which consists of the set of prefix paths in
the FP-tree co-occurring with the suffix pattern), then construct its (conditional) FP-tree and
perform mining recursively on such a tree. The pattern growth is achieved by the concatenation
tree.
Let’s first consider I5 which is the last item in L, rather than the first. The reasoning
behind this will become apparent as we explain the FP-tree mining process. I5 occurs in two
branches of the FP-tree. The paths formed by these branches are < (I2, I2, I5:1)> and < (I2, I1,
I3, I5:1)>. Therefore considering I5 as a suffix, its corresponding two prefix paths are < (I2I1:1)>
and < (I2, I1, I3:1) >, which form its conditional pattern base. Its conditional FP-tree contains
only a single path, (I2:2, I1:2); I3 is not included because its support count of 1 is less than the
minimum support count. The single path generates all the combinations of frequent patterns: I2
I5:2, I1 I5:2, I2 I1 I5:2.

In the same way find the frequent itemsets for all other Items. The FP-growth method
transforms the problem of finding long frequent patterns to looking for shorter ones recursively
and then concatenating the suffix. It uses the least frequent items as suffix, offering good
selectivity. The method substantially reduces the search costs.
PROBLEM DEFINITION:

An itemset X is contained in transaction <tid,Y> if X⊆ Y. Given a transaction database


TDB, the support of an itemset X, denoted as sup(X), is the number of transactions in TDB
which contain X. An association rule R: X⇒Y is an implication between two itemsets X and Y
where X, Y⊂I and X∩Y =∅. The support of the rule, denoted as sup(X⇒Y), is defined as sup
(XUY). The confidence of the rule, denoted as conf(X⇒Y), is defined as sup (XUY)/sup(X).

The requirement of mining the complete set of association rules leads to two problems:
1) There may exist a large number of frequent itemsets in a transaction database, especially
when the support threshold is low.
2) There may exist a huge number of association rules. It is hard for users to comprehend
and manipulate a huge number of rules.
An interesting alternative to this problem is the mining of frequent closed itemsets and their
corresponding association rules.

Frequent closed itemset: An itemset X is a closed itemset if there exist no itemset X’ such that
(1) X’ is a proper superset of X and (2) every transaction containing X also contains X’. A closed
itemset X is frequent if its support passes the given support threshold.

How to find the complete set of frequent closed itemsets efficiently from large database,
which is called the frequent closed itemset mining problem

For the transaction database in table1 with min_sup = 2, the divide and conquer method for
mining frequent closed itemset.

1) Find frequent items. Scan TDB to find the set of frequent items and derive a global
frequent item list, called f_list, and f_list = {c:4, e:4, f:4, a:3, d:2}, where the items are
sorted in support descending order any infrequent item, such as b are omitted..

2) Divide search space. All the frequent closed itemsets can be divided into 5 non-overlap
subsets based on the f_list: (1) the ones containing items d,(2) the ones containing item a
but no d, (3) the ones containing item f but no a not d, (4) the ones containing e but no f,
a nor d, and (5) the one containing only c. once all subsets are found, the complete set of
frequent closed itemsets is done.
3) Find subsets of frequent closed itemsets. The subsets of frequent closed itemsets can be
mined by constructing corresponding conditional database and mine each recursively.

Find frequent closed itemsets containing d. Only transaction containing d are needed.
The d-conditional database, denoted as TDB|d, contains all the transactions having d,
which is {cefa, cfa}. Notice that item d is omitted in each transaction since it appears in
every transaction in the d-conditional database.
The support of d is 2. Items c, f and a appear twice respectively in TDB| d. Therefore,
cfad: 2 is a frequent closed itemset. Since this itemset covers every frequent items in
TDB|d finishes.

In the same way find the frequent closed itemsets for a, f, e, and c.

4) The set of frequent closed itemsets fund is {acdf :2, a :3, ae :2, cf :4, cef :3, e :4}

Optimization 1: Compress transactional and conditional databases using FP-tree structures. FP-
tree compresses databases for frequent itemset mining. Conditional databases can be derived
from FP-tree efficiently.

Optimization 2: Extract items appearing in every transaction of conditional databases.

Optimization 3: Directly extract frequent closed itemsets from FP-tree.

Optimization 4: Prune search branches.


PERFORMANCE STUDY

Comparison of A-close, CHARM, and CLOSET, CLOSET out performs both CHARM
and A-close. CLOSET is efficient and scalable in mining frequent closed itemsets in large
databases. It is much faster than A-close, and also faster than CHARM.

CONCLUSION

CLOSET leads to less and more interesting association’s rules then the other previously
proposed methods.

Das könnte Ihnen auch gefallen