Beruflich Dokumente
Kultur Dokumente
Pattern Mining
Abstract. FP-growth algorithm using FP-tree has been widely studied for fre-
quent pattern mining because it can give a great performance improvement
compared to the candidate generation-and-test paradigm of Apriori. However, it
still requires two database scans which are not applicable to processing data
streams. In this paper, we present a novel tree structure, called CP-tree (Com-
pact Pattern tree), that captures database information with one scan (Insertion
phase) and provides the same mining performance as the FP-growth method
(Restructuring phase) by dynamic tree restructuring process. Moreover, CP-tree
can give full functionalities for interactive and incremental mining. Extensive
experimental results show that the CP-tree is efficient for frequent pattern min-
ing, interactive, and incremental mining with single database scan.
1 Introduction
Finding frequent patterns (or itemsets) plays an essential role in data mining and
knowledge discovery techniques, such as association rules, classification, clustering,
etc. A large number of research works [1], [7], [5], [3] have been published presenting
new algorithms or improvements on existing algorithms to solve the frequent pattern
mining problem more efficiently. FP-tree based FP-growth mining technique proposed
by Han et. al. [5] has been found one of the efficient algorithms using the prefix-tree
data structure. The performance gain achieved by FP-growth is predominantly based
on the highly compact nature of FP-tree, where it stores only the frequent items in a
frequency-descending order. During mining this item arrangement not only enables it
to avoid global infrequent node deletion process from each conditional tree but also
reduces the search space to find next frequent item in item list to one item. However,
construction of such FP-tree requires two database scans and prior knowledge about
support threshold, which are the key limitations of applying FP-tree in data stream
environment, incremental, and interactive mining.
The prefix-tree based approach may suffer from the limitation of memory size
when it tries to hold whole database information. However, as the currently available
memory size becomes more than GBytes, several prefix-tree data structures capturing
partial (with an error bound) [4] or whole [6], [3] database information have been
T. Washio et al. (Eds.): PAKDD 2008, LNAI 5012, pp. 1022–1027, 2008.
© Springer-Verlag Berlin Heidelberg 2008
CP-Tree: A Tree Structure for Single-Pass Frequent Pattern Mining 1023
proposed for mining frequent patterns. AFPIM [4] algorithm performs incremental
mining mainly by adjusting the FP-tree structure. Therefore, it requires two database
scans. CATS tree [6] is a single-pass solution but it still suffers from complex tree
construction process. The above two limitations are well-addressed in CanTree [3]
that captures the complete information in a canonical order of items from database
into a prefix-tree structure in order to facilitate it for incremental and interactive min-
ing using FP-growth mining technique. Although CanTree offers a simple single-pass
construction process, it usually yields poor compaction in tree size compared to FP-
tree. Therefore, it is storage and runtime inefficient causing higher mining time since
the items in the tree are not stored in frequency-descending order.
In this paper, we propose a novel tree structure, called CP-tree (Compact Pattern
tree), that constructs a compact prefix-tree structure with one database scan and pro-
vides the same mining performance as the FP-growth technique by efficient tree re-
structuring process. Our comprehensive experimental results on both real-life and
synthetic datasets show that frequent patterns mining, interactive and incremental
mining with our CP-tree outperforms the state-of-the-art algorithms in terms of both
execution time and memory requirements.
The rest of the paper is organized as follows. Section 2 describes the structure and
restructuring process of CP-tree. We report our experimental results in Section 3.
Finally, Section 4 concludes the paper.
may lead to poor performance. Therefore, it can be initiated (i) after each user-given
fixed sized slot, or (ii) when combined displacement of top-K items in I-list exceeds a
given threshold.
The other performance factor is tree restructuring mechanism. Existing Path adjusting
method (PAM), proposed in [4], sorts nodes of a prefix-tree by using bubble sort
technique. Any node may be split when it needs to be swapped with any child node
having count smaller than that node. Otherwise, simple exchange operation between
them is performed.
We propose a new tree restructuring technique called Branch sorting method
(BSM) that, unlike PAM, restructures by sorting unsorted paths in the tree one after
another and the I-list in frequency-descending order. We revisit the prefix-tree of
Fig. 1(b) constructed based on first three transactions of Fig. 1(a), where I-list order
{c:1, a:2, e:1, b:2, d:2, f:1} is not in frequency-descendent order. To restructure the
tree to such order, the I-list is sorted first to {a:2, b:2, d:2, c:1, e:1, f:1} order. Sec-
ondly, tree restructuring starts with the first path in the first branch say,
{c:1→a:1→e:1}. Since the path is not sorted according to new I-list order, it is
removed from the tree, sorted (using merge sort technique) into a temporary array
and then again inserted into tree in {a:1→c:1→e:1} order. All unsorted paths in
other remaining branches are processed using the same technique. If any path is
found sorted (e.g., the path of the last branch), it is not sorted, rather merged with
previously processed common sorted path (if any). Thus, with the processing of the
last path the restructuring of the tree is completed and we get the frequency-
descending tree of Fig. 1(c).
The performance of PAM largely depends on degree of displacement (DD)
among items between two I-lists, since swapping two nodes takes bubble sort cost
of O(n2), where n is the number of nodes between them. On the other hand, BSM
uses merge sort approach with a complexity of O(nlog2n) (n being the number of
items in path), therefore, the DD is immaterial on its performance. Hence, it is not
suitable to use PAM when the DD is reasonably high. However, BSM might be a
better candidate in such cases, since it performs almost evenly on variations of DD.
Moreover, its sorted path handling feature reduces not only the number of sorting
operations but also the size of data to be sorted. In summary, during tree restructur-
ing a somewhat dynamic manner can initiate the switching between two methods
based on the value of DD.
3 Experimental Results
We performed comprehensive experimental analysis on the performance of CP-tree
on several synthetic and real datasets. However, in the remaining part of this section,
due to the space constraint we only report the results on two real dense (chess and
mushroom) and one synthetic sparse (T10I4D100K) datasets. All programs are written
in Microsoft Visual C++ 6.0 and run on a time sharing environment with Windows
1026 S.K. Tanbeer et al.
Time (s)
Time(s)
0 0
0.055 0.05 0.045 0.04 0.035 0.35 0.3 0.25 0.2 0.15
min_sup (%) min_sup (%)
4 Conclusions
We have proposed CP-tree that dynamically achieves frequency-descending prefix-
tree structure with a single-pass by applying tree restructuring technique and consid-
erably reduces the mining time. We also proposed Branch sorting method, a new tree
restructuring technique, and presented guideline in choosing the values for tree re-
structuring parameters. We have shown that despite additional insignificant tree re-
structuring cost, CP-tree achieves a remarkable performance gain on overall runtime.
Moreover, the easy-to-maintain feature and property of constantly capturing full data-
base information in a highly compact fashion facilitate its efficient applicability in
interactive, incremental and stream data.
References
1. Agrawal, R., Imielinski, T., Swami, A.: Mining Association Rules Between Sets of Items in
Large Databases. In: SIGMOD, pp. 207–216 (1993)
2. Han, J., Cheng, H., Xin, D., Yan, X.: Frequent Pattern Mining: Current Status and Future
Directions. Data Min. Knowl. Disc. 10th Anniversary Issue (2007)
3. Leung, C.K., Khan, Q.I., Li, Z., Hoque, T.: CanTree: A Canonical-Order Tree for Incre-
mental Frequent-Pattern Mining. Knowledge and Information Systems 11(3), 287–311
(2007)
4. Koh, J.-L., Shieh, S.-F.: An Efficient Approach for Maintaining Association Rules Based on
Adjusting FP-tree Structures. In: Lee, Y., Li, J., Whang, K.-Y., Lee, D. (eds.) DASFAA
2004. LNCS, vol. 2973, pp. 417–424. Springer, Heidelberg (2004)
5. Han, J., Pei, J., Yin, Y.: Mining Frequent Patterns without Candidate Generation. In: Inter-
national Conference on Management of Data (2000)
6. Cheung, W., Zaïane, O.R.: Incremental Mining of Frequent Patterns without Candidate
Generation or Support Constraint. In: Seventh International Database Engineering and Ap-
plications Symposium (IDEAS) (2003)
7. Agrawal, R., Srikant, R.: Fast Algorithms for Mining Association Rules. In: International
Conference on Very Large Databases (1994)