Sie sind auf Seite 1von 6

CP-Tree: A Tree Structure for Single-Pass Frequent

Pattern Mining

Syed Khairuzzaman Tanbeer, Chowdhury Farhan Ahmed,


Byeong-Soo Jeong, and Young-Koo Lee

Department of Computer Engineering, Kyung Hee University


1 Seochun-dong, Kihung-gu, Youngin-si, Kyunggi-do, 446-701, Republic of Korea
{tanbeer, farhan, jeong, yklee}@khu.ac.kr

Abstract. FP-growth algorithm using FP-tree has been widely studied for fre-
quent pattern mining because it can give a great performance improvement
compared to the candidate generation-and-test paradigm of Apriori. However, it
still requires two database scans which are not applicable to processing data
streams. In this paper, we present a novel tree structure, called CP-tree (Com-
pact Pattern tree), that captures database information with one scan (Insertion
phase) and provides the same mining performance as the FP-growth method
(Restructuring phase) by dynamic tree restructuring process. Moreover, CP-tree
can give full functionalities for interactive and incremental mining. Extensive
experimental results show that the CP-tree is efficient for frequent pattern min-
ing, interactive, and incremental mining with single database scan.

Keywords: Data mining, data stream, frequent pattern, association rule.

1 Introduction
Finding frequent patterns (or itemsets) plays an essential role in data mining and
knowledge discovery techniques, such as association rules, classification, clustering,
etc. A large number of research works [1], [7], [5], [3] have been published presenting
new algorithms or improvements on existing algorithms to solve the frequent pattern
mining problem more efficiently. FP-tree based FP-growth mining technique proposed
by Han et. al. [5] has been found one of the efficient algorithms using the prefix-tree
data structure. The performance gain achieved by FP-growth is predominantly based
on the highly compact nature of FP-tree, where it stores only the frequent items in a
frequency-descending order. During mining this item arrangement not only enables it
to avoid global infrequent node deletion process from each conditional tree but also
reduces the search space to find next frequent item in item list to one item. However,
construction of such FP-tree requires two database scans and prior knowledge about
support threshold, which are the key limitations of applying FP-tree in data stream
environment, incremental, and interactive mining.
The prefix-tree based approach may suffer from the limitation of memory size
when it tries to hold whole database information. However, as the currently available
memory size becomes more than GBytes, several prefix-tree data structures capturing
partial (with an error bound) [4] or whole [6], [3] database information have been

T. Washio et al. (Eds.): PAKDD 2008, LNAI 5012, pp. 1022–1027, 2008.
© Springer-Verlag Berlin Heidelberg 2008
CP-Tree: A Tree Structure for Single-Pass Frequent Pattern Mining 1023

proposed for mining frequent patterns. AFPIM [4] algorithm performs incremental
mining mainly by adjusting the FP-tree structure. Therefore, it requires two database
scans. CATS tree [6] is a single-pass solution but it still suffers from complex tree
construction process. The above two limitations are well-addressed in CanTree [3]
that captures the complete information in a canonical order of items from database
into a prefix-tree structure in order to facilitate it for incremental and interactive min-
ing using FP-growth mining technique. Although CanTree offers a simple single-pass
construction process, it usually yields poor compaction in tree size compared to FP-
tree. Therefore, it is storage and runtime inefficient causing higher mining time since
the items in the tree are not stored in frequency-descending order.
In this paper, we propose a novel tree structure, called CP-tree (Compact Pattern
tree), that constructs a compact prefix-tree structure with one database scan and pro-
vides the same mining performance as the FP-growth technique by efficient tree re-
structuring process. Our comprehensive experimental results on both real-life and
synthetic datasets show that frequent patterns mining, interactive and incremental
mining with our CP-tree outperforms the state-of-the-art algorithms in terms of both
execution time and memory requirements.
The rest of the paper is organized as follows. Section 2 describes the structure and
restructuring process of CP-tree. We report our experimental results in Section 3.
Finally, Section 4 concludes the paper.

2 Overview of CP-Tree: Construction and Performance Issues


Let L = {i1, i2, … , in} be a set of literals, called items that have ever been used as a

unit information of an application domain. A set X = {ij, … , ik} L, (j ≤ k and 1 ≤ j, k
≤ n) is called a pattern. A transaction T = (tid, Y) is a couple where tid is a transac-

tion-id and Y is a pattern. If X Y, it is said that T contains X or X occurs in T. A
transactional database DB over L is a set of transactions and |DB| be the size of DB,
i.e. total number of transactions in DB. The support of a pattern X in DB is the number
of transactions in DB that contains X. A pattern is called frequent if its support is no
less than a user given support threshold min_sup, ∂, ∂
with 0≤ ≤|DB|. Given and a ∂
DB, discovering the complete set of frequent patterns in DB, say FDB is called the
frequent pattern mining problem.
We discuss the preliminaries and step-by-step construction mechanism of our CP-
tree here. In general, CP-tree achieves a frequency-descending structure by capturing
part-by-part data from the database and dynamically restructuring itself after each part
by using efficient tree restructuring mechanism. Like FP-tree, to facilitate the tree
traversal it maintains an item list, say, I-list. The construction operation mainly con-
sists of two phases: Insertion phase, that inserts (similar to FP-tree technique) trans-
action(s) into CP-tree according to current sort order of I-list and updates frequency
count of respective items in I-list; and Restructuring phase, that rearranges the I-list
according to frequency-descending order of items and restructures the tree nodes
according to new I-list. These two phases are executed alternatively; starting with
Insertion phase (with the first part of DB) and finishing with Restructuring phase
(after the last insertion) at the end of DB.
1024 S.K. Tanbeer et al.

Fig. 1 shows a transaction database and step-by-step construction procedure of CP-


tree. For the simplicity of description we assume that the Restructuring phase is exe-
cuted after inserting every three transactions and the first Insertion phase will follow
item-appearance order of items. For simplicity of figures we do not show the node
traverse pointers in tree, however, they are maintained in a fashion like FP-tree does.
Fig. 1(b) shows the exact structures of the tree and I-list after inserting transac-
tions 10, 20, and 30 in item-appearance order. Since the tree will be restructured
after every three transactions, the first Insertion phase ends here initiating the first
Restructuring phase. The Restructuring phase, at first, rearranges the items in the I-
list in frequency-descending order then, restructures the tree according to that order
as shown in Fig. 1(c). It can be noted that items having higher count value are ar-
ranged at the upper most portion of the tree; therefore, CP-tree at this stage is a fre-
quency-descending tree. The next Insertion phase (for transactions 40, 50, 60) will
follow the I-list order of {a, b, d, c, e, f} instead of previous order of {c, a, e, b, d, f}.
Fig. 1(d) and Fig. 1(e) respectively present the trees after second Insertion phase and
Restructuring phase. The final frequency-descending CP-tree we get by performing
the Insertion phase and Restructuring phase for last three transactions as shown in
Fig. 1(g).
Fig.1(h) shows a lexicographic CanTree containing more nodes with respect to CP-
tree for the same dataset. Usually databases share common prefix patterns among the
transactions; therefore, the size of CP-tree tree is usually much smaller than its DB
and bounded by the size of DB. Since CanTree does not guarantee of a frequency-
descending tree, generally the size of CP-tree will be smaller than that of CanTree.
Once CP-tree is constructed, using FP-growth mining technique FDB can be mined for
any value of support threshold ∂ by starting from the bottom most item in I-list having
count value ≥ ∂.
One of the two primary factors to affect the performance of CP-tree is effectively
switching to Restructuring phase. Too much or too few restructuring operations both

Fig. 1. Construction of CP-tree and CanTree


CP-Tree: A Tree Structure for Single-Pass Frequent Pattern Mining 1025

may lead to poor performance. Therefore, it can be initiated (i) after each user-given
fixed sized slot, or (ii) when combined displacement of top-K items in I-list exceeds a
given threshold.

2.1 Tree Restructuring

The other performance factor is tree restructuring mechanism. Existing Path adjusting
method (PAM), proposed in [4], sorts nodes of a prefix-tree by using bubble sort
technique. Any node may be split when it needs to be swapped with any child node
having count smaller than that node. Otherwise, simple exchange operation between
them is performed.
We propose a new tree restructuring technique called Branch sorting method
(BSM) that, unlike PAM, restructures by sorting unsorted paths in the tree one after
another and the I-list in frequency-descending order. We revisit the prefix-tree of
Fig. 1(b) constructed based on first three transactions of Fig. 1(a), where I-list order
{c:1, a:2, e:1, b:2, d:2, f:1} is not in frequency-descendent order. To restructure the
tree to such order, the I-list is sorted first to {a:2, b:2, d:2, c:1, e:1, f:1} order. Sec-
ondly, tree restructuring starts with the first path in the first branch say,
{c:1→a:1→e:1}. Since the path is not sorted according to new I-list order, it is
removed from the tree, sorted (using merge sort technique) into a temporary array
and then again inserted into tree in {a:1→c:1→e:1} order. All unsorted paths in
other remaining branches are processed using the same technique. If any path is
found sorted (e.g., the path of the last branch), it is not sorted, rather merged with
previously processed common sorted path (if any). Thus, with the processing of the
last path the restructuring of the tree is completed and we get the frequency-
descending tree of Fig. 1(c).
The performance of PAM largely depends on degree of displacement (DD)
among items between two I-lists, since swapping two nodes takes bubble sort cost
of O(n2), where n is the number of nodes between them. On the other hand, BSM
uses merge sort approach with a complexity of O(nlog2n) (n being the number of
items in path), therefore, the DD is immaterial on its performance. Hence, it is not
suitable to use PAM when the DD is reasonably high. However, BSM might be a
better candidate in such cases, since it performs almost evenly on variations of DD.
Moreover, its sorted path handling feature reduces not only the number of sorting
operations but also the size of data to be sorted. In summary, during tree restructur-
ing a somewhat dynamic manner can initiate the switching between two methods
based on the value of DD.

3 Experimental Results
We performed comprehensive experimental analysis on the performance of CP-tree
on several synthetic and real datasets. However, in the remaining part of this section,
due to the space constraint we only report the results on two real dense (chess and
mushroom) and one synthetic sparse (T10I4D100K) datasets. All programs are written
in Microsoft Visual C++ 6.0 and run on a time sharing environment with Windows
1026 S.K. Tanbeer et al.

XP operating system on a 2.66 GHz machine with 1 GB of main memory. Runtime


includes tree construction, tree restructure (for CP-tree only) and mining time.
Table 1 shows required time for both BSM and PAM on increase of DB size and
that of sorting frequency for T10I4D100K and chess datasets. Results indicate that the
overall restructuring efficiency notably increases on increase of DB size in BSM and
when applied phase-by-phase (i.e. slotted) on DB in PAM. However, the combined
approach outperforms each approach in phase-by-phase progress. Therefore, we adopt
the combined approach where switching depends on the value of DD.
Since it has been shown in [3] that CanTree outperforms other similar algo-
rithms say, AFPIM, CATS tree, we only state the performance comparison of CP-
tree with CanTree. To generalize the performance comparison we compare CP-tree
with three versions of CanTree; lexicographic order (CTl), reverse lexicographic
order (CTr), and appearance order (CTa). As shown in Table 2 for both datasets
T10I4D100K and mushroom, restructuring time for CP-tree appears to be an over-
head. However, in spite of this cost, CP-tree significantly outperforms all versions
of CanTree on overall runtime due to dramatic reduction in mining time. Fig. 2
reports that CP-tree significantly outperforms CanTree on overall runtime for vari-
ous min_sup values.
The last row of Table 2, that shows memory consumption of the algorithms, indi-
cates that size of CanTree varies on data distribution in transactions and order of
items in tree. However, size of CP-tree is independent on such parameters and it is
much smaller than all versions of CanTree designed in our experiments.

Table 1. Tree restructuring approach comparison (required time in second)

Restructure chess T10I4D100K


approaches No. of slots No. of slots
DB size (K) DB size (K)
(slot size = 1K) (slot size = 20K)
1 2 3 1 2 3 20 60 100 1 3 5
BSM 1.02 2.50 4.20 1.02 3.0 6.5 5.86 28.59 65.19 5.86 24.98 60.66
PAM 1.34 3.50 6.98 1.34 1.53 1.89 11.83 69.78 157.41 11.83 15.41 21.05
Combined -- -- -- 1.02 1.22 1.56 -- -- -- 5.86 9.47 15.14

Table 2. CP-tree Vs CanTree time and memory comparison

T10I4D100K (∂ = 0.04) mushroom (∂ = 0.15)


CTa CTl CTr CP-tree CTa CTl CTr CP-tree
Construction time (s) 58.88 61.67 57.09 61.86 5.66 4.83 4.58 5.72
Restructure time (s) -- -- -- 19.11 -- -- -- 1.89
Mining time (s) 218.25 679.56 824.22 0.44 40.53 62.77 53.19 20.67
Total time (s) 277.11 741.23 881.31 81.41 46.19 67.59 57.77 28.28
Memory (MB) 14.51 14.97 14.99 14.29 0.95 0.70 0.56 0.50
CP-Tree: A Tree Structure for Single-Pass Frequent Pattern Mining 1027

900 T10I4D100K 80 mushroom


CP-tree CP-tree
CTa 60 CTa

Time (s)
Time(s)

600 CTl CTl


CTr 40 CTr
300 20

0 0
0.055 0.05 0.045 0.04 0.035 0.35 0.3 0.25 0.2 0.15
min_sup (%) min_sup (%)

Fig. 2. Runtime comparison

4 Conclusions
We have proposed CP-tree that dynamically achieves frequency-descending prefix-
tree structure with a single-pass by applying tree restructuring technique and consid-
erably reduces the mining time. We also proposed Branch sorting method, a new tree
restructuring technique, and presented guideline in choosing the values for tree re-
structuring parameters. We have shown that despite additional insignificant tree re-
structuring cost, CP-tree achieves a remarkable performance gain on overall runtime.
Moreover, the easy-to-maintain feature and property of constantly capturing full data-
base information in a highly compact fashion facilitate its efficient applicability in
interactive, incremental and stream data.

References
1. Agrawal, R., Imielinski, T., Swami, A.: Mining Association Rules Between Sets of Items in
Large Databases. In: SIGMOD, pp. 207–216 (1993)
2. Han, J., Cheng, H., Xin, D., Yan, X.: Frequent Pattern Mining: Current Status and Future
Directions. Data Min. Knowl. Disc. 10th Anniversary Issue (2007)
3. Leung, C.K., Khan, Q.I., Li, Z., Hoque, T.: CanTree: A Canonical-Order Tree for Incre-
mental Frequent-Pattern Mining. Knowledge and Information Systems 11(3), 287–311
(2007)
4. Koh, J.-L., Shieh, S.-F.: An Efficient Approach for Maintaining Association Rules Based on
Adjusting FP-tree Structures. In: Lee, Y., Li, J., Whang, K.-Y., Lee, D. (eds.) DASFAA
2004. LNCS, vol. 2973, pp. 417–424. Springer, Heidelberg (2004)
5. Han, J., Pei, J., Yin, Y.: Mining Frequent Patterns without Candidate Generation. In: Inter-
national Conference on Management of Data (2000)
6. Cheung, W., Zaïane, O.R.: Incremental Mining of Frequent Patterns without Candidate
Generation or Support Constraint. In: Seventh International Database Engineering and Ap-
plications Symposium (IDEAS) (2003)
7. Agrawal, R., Srikant, R.: Fast Algorithms for Mining Association Rules. In: International
Conference on Very Large Databases (1994)

Das könnte Ihnen auch gefallen