Beruflich Dokumente
Kultur Dokumente
The following set of observations and theorems prove Proof. Suppose S is minimally infrequent in T . Therefore,
the correctness of the IFP min algorithm. They exhaus- it is itself infrequent, but all its subsets S 0 S are frequent.
tively define and verify the set of itemsets returned by the As S does not contain x, all occurrences of S or any of
algorithm. its subsets S 0 S must occur in the residual tree TRx only.
The first observation relates the (in)frequent itemsets of Hence, using Observation 1, in TRx also, S is infrequent
the database to those present in the residual database. but all S 0 S are frequent. Therefore, S is minimally
infrequent in TRx as well.
Observation 1. An itemset S (not containing the item x) is The converse is also true, i.e., if S is minimally infre-
frequent (infrequent) in the residual tree TRx if and only if quent in TRx , since all its occurrences are in TRx only, it is
the itemset S is frequent (infrequent) in T , i.e., globally minimally infrequent (i.e., in T ) as well.
S is frequent in T S is frequent in TRx (x
/ S)
S is infrequent in T S is infrequent in TRx (x
/ S)
Theorem 1 guides the algorithm in mining MIIs not con-
Proof. Since S does not contain x, it does not occur in the taining the least frequent item (lf-item) x. The algorithm
projected tree of x. All occurrences of S must, therefore, makes use of this theorem recursively, by reporting MIIs of
be only in the residual tree TRx , i.e., residual trees as MIIs of the original tree. For mining of
MIIs that contain x, the following observation and theorem
supp(S, T ) = supp(S, TRx ) [x
/ S] (1) are presented.
Hence, S is frequent (infrequent) in T if and only if S is The second observation relates the (in)frequent itemsets
frequent (infrequent) in TRx . of the database to those present in the projected database.
The following theorem shows that the MIIs that do not Observation 2. An itemset S is frequent (infrequent) in
contain the item x can be obtained directly as MIIs from the projected tree TPx if and only if the itemset obtained by
the residual tree of x. including x in S (i.e., x S), is frequent (infrequent) in T ,
i.e.,
Theorem 1. An itemset S (not containing the item x) is
minimally infrequent in T if and only if the itemset S is x S is frequent in T S is frequent in TPx
minimally infrequent in the residual tree TRx of x, i.e., x S is infrequent in T S is infrequent in TPx
S is minimally infrequent in T Proof. Consider the itemset xS. All occurrences of it are
S is minimally infrequent in TRx (x
/ S) only in the projected tree TPx . The projected tree, however,
does not list x, and therefore, we have, is infrequent, it must be a MII. However, from Theorem 1,
it becomes a MII in TRx which is a contradiction. There-
supp(x S, T ) = supp(x S, TPx ) = supp(S, TPx ) fore, S cannot be infrequent and is, therefore, frequent in
(2) T.
Since, A {x} S but x / A, therefore, A S.
Hence, x S is frequent (infrequent) in T if and only if S
We have already shown that A is infrequent in T . Using
is frequent (infrequent) in TPx .
the Apriori property, S cannot then be frequent. Hence,
The next theorem shows that the potential MIIs obtained it contradicts the original assumption that {x} S is not
from the projected tree of an item x (by appending x to it) is minimally infrequent in T .
a MII provided it is not a MII in the corresponding residual Together, we get that if S is minimally infrequent in TPx
tree of x. but not in TRx , then {x} S is minimally infrequent in T .
Theorem 2. An itemset {x} S is minimally infrequent
in T if and only if the itemset S is minimally infrequent in Theorem 2 guides the algorithm in mining MIIs contain-
the projected tree TPx but not minimally infrequent in the ing the least frequent item (lf-item) x. The algorithm first
residual tree TRx , i.e., obtains the MIIs from its projected tree and then removes
those that are also found in the residual tree. It thus shows
{x} S is minimally infrequent in T
the connection between the two parts of the database, pro-
S is minimally infrequent in TPx and
jected and residual.
S is not minimally infrequent in TRx
Proof. LHS to RHS: 5.3 Correctness
Suppose {x} S is minimally infrequent in T . There- We now formally establish the correctness of the algorithm
fore, it is itself infrequent, but all its subsets S 0 {x} S, IFP min by showing that the MIIs as enumerated in Group
including S, are frequent. 1 and Group 2 in Section 5.2 are generated by it.
Since S is frequent and it does not contain x, using Ob- In Step 1, the algorithm first finds the infrequent items
servation 1, S is frequent in TRx and is, therefore, not min- present in the tree. These 1-itemsets cover the Group 1(a).
imally infrequent in TRx . Consider the least frequent item x. In Step 11 and Step
From Observation 2, S is infrequent in TPx . Assume 12, the tree T is divided into smaller trees, the residual tree
that S is not minimally infrequent in TPx . Since, it is in- TRx and the projected tree TPx .
frequent itself, there must exist a subset S 0 S which In Step 13, Group 1(b) MIIs are obtained by the recur-
is infrequent in TPx as well. Now, consider the itemset sive application of IFP min on the residual tree TRx . The-
{x} S 0 . From Observation 2, it must be infrequent in T . orem 1 proves that these are MIIs in the original dataset.
However, since this is a subset of {x} S, this contradicts
In Step 14, potential MIIs are obtained by the recursive
the fact that {x} S is minimally infrequent. Therefore,
application of IFP min on the projected tree TPx . MIIs ob-
the assumption that S is not minimally infrequent in TPx is
tained from TRx are removed from this set. Combined with
false.
the item x, these form the MIIs enumerated as Group 2(a).
Together, it shows that if {x}S is minimally infrequent
Theorem 2 proves that these are indeed MIIs in the original
in T , S is minimally infrequent in TPx but not in TRx .
dataset.
RHS to LHS:
The projected database consist of all those transactions
Given that S is minimally infrequent in TPx but not in
in which x is present. The Group 2(b) MIIs are of length 2
TRx , assume that {x} S is not minimally infrequent in T .
(as shown earlier). Thus, single items that are frequent but
Since S is infrequent in TPx , using Observation 2, {x}
do not appear in the projected tree of x, when combined
S is also infrequent in T .
with x, constitute MIIs with support count of zero. These
Now, since we have assumed that {x} S is not mini-
items appear in TRx though as they are frequent. Hence,
mally infrequent in T , it must contain a subset A {x}S
they are obtained as single items that appear in TRx but not
which is infrequent in T as well.
in TPx as shown in Step 17 of the algorithm.
Suppose A contains x, i.e., x A. Consider the itemset
The algorithm is, hence, complete and it exhaustively
B such that A = {x} B. Note that since A {x} S,
generates all minimally infrequent itemsets.
B S. Since A = {x} B is infrequent in T , from
Observation 2, B is infrequent in TPx . However, since B
5.4 MIIs using Apriori
S, this contradicts the fact that S is minimally infrequent in
TPx . Hence, A cannot contain x. In this section, we show how the Apriori algorithm [1] can
Therefore, it must be the case that x / A. To show that be improved to mine the MIIs. Consider the iteration where
this leads to a contradiction as well, we first show that candidate itemsets of length l + 1 are generated from fre-
Now, if S is minimally infrequent in TPx but not in TRx quent itemsets of length l. From the generated candidate
and {x} S is not minimally infrequent in T , then every set, itemsets whose support satisfies the minimum support
subset S 0 S is frequent in TPx . Thus, S 0 S, S 0 is fre- threshold are reported as frequent and the rest are rejected.
quent in T as well (since TPx is only a part of T ). Then, if S This rejected set of itemsets constitute the MIIs of length
Algorithm 2 IFP MLMS Definition 2 (Prefix-set of a tree). The prefix-set of a tree
Input: IFP-tree T with T = p is the set of items that need to be included with the itemsets
Output: Frequent* itemsets of T in the tree, i.e., all the items on which the projections have
1: if T = then been done. For a tree T , it is denoted by T .
2: return Definition 3 (Prefix-length of a tree). The prefix-length of
3: end if a tree is the length of its prefix-set. For a tree T , it is de-
4: x ls-item in T noted by T .
5: if supp({x}, T ) < low then
6: SP For a tree T having T = S and T = p, the corre-
7: else sponding values for the residual tree are TRx = S and
8: TPx projected tree of x TRx = p while those for the projected tree are TPx =
9: TPx p + 1 {x} S and TPx = p + 1. For the original transaction
10: SP IFP MLMS (TPx ) database T D and its tree T , T = and T = 0.
11: end if For any k-itemset S in a tree T , the original itemset must
12: TRx residual tree of x include all the items in the prefix-set. Therefore, if T = p,
13: TRx p for S to be frequent, it must satisfy the support threshold
14: SR IFP MLMS (TRx ) for k + p-length itemsets, i.e., supp(X) k+p (hence-
15: if x is frequent* in T then forth, k,p is also used to denote k+p ). The definitions of
16: return ({x} SP ) SR {x} frequent and infrequent itemsets in a tree T with a prefix-
17: else length T = p are, thus, modified as follows.
18: return ({x} SP ) SR Definition 4 (Frequent* itemset). A k-itemset S is fre-
19: end if quent* in T having T = p if supp(S, T ) k,p .
l + 1. This is attributed to the fact that for such an itemset, Definition 5 (Infrequent* itemset). A k-itemset S is infre-
all the subsets are frequent (due to the candidate generation quent* in T having T = p if supp(S, T ) < k,p .
procedure) while the itemset itself is infrequent. For the Using these definitions, we explain Algorithm 2 along
experimentation purposes, we label this algorithm as the with an example shown in Figure 5 for the database in Ta-
Apriori min algorithm. ble 2. The support thresholds are 1 = 4, 2 = 4, 3 = 3,
4 = 2, and 5 = 1.
6 Frequent Itemsets in MLMS Model The item with least support for the initial tree T is B.
Since its support is above low (Step 5), the algorithm ex-
In this section, we present our proposed algorithm tracts the projected tree TPB and then recursively processes
IFP MLMS to mine frequent itemsets in the MLMS frame- it (Step 11 to Step 13).
work. Though most applications use the constraint 1 The algorithm then processes the residual tree TRB re-
2 n for the different support thresholds at dif- cursively (Step 12 to Step 14). For that, it breaks it into the
ferent lengths, our algorithm (shown in Algorithm 2) does projected and residual trees of the ls-item there, which is
not depend on it and works for any general 1 , . . . , n . A. Figure 5 shows all the frequent itemsets mined.
We use the lowest support low = mini i in the algo- Since B itself is not frequent in T (Step 15), it is not
rithm. The algorithm is based on the item with least sup- returned. The other itemsets are returned (Step 18).
port which we term as the ls-item.
IFP MLMS is again based on the concepts of residual 6.1 Correctness of IFP MLMS
and projected trees. It mines the frequent itemsets by first
dividing the database into projected and residual trees for Let x be the ls-item in tree T (having T = p) and S be a
the ls-item x, and then mining them recursively. We show k-itemset not containing x.
that the frequent itemsets obtained from the residual tree For computing the frequent* itemsets for a tree T , the
are frequent itemsets in the original database as well. IFP MLMS algorithm merges the frequent* itemsets, ob-
The itemsets obtained from the projected tree share the tained by the processing of the projected tree and residual
ls-item as a prefix. Hence, the thresholds cannot be applied tree of ls-item, using the following theorem
directly as the length of the itemset changes. The prefix ac- Theorem 3. An itemset {x} S is frequent* in T if and
cumulates as the algorithm goes deeper into recursion, and only if S is frequent* in the projected tree TPx of x, i.e.,
hence, a track of the prefix is maintained at each recursion
level. At any stage of the recursive processing, if supp(ls- {x} S is frequent* in T S is frequent* in TPx
item) < low , then this item cannot occur in a frequent Proof. Suppose S is a k-itemset.
itemset of any length (as any superset of it will not pass S is frequent* in TPx
the support threshold). Thus, its sub-tree is pruned, thereby supp(S, TPx ) k,p+1 (since TPx = p + 1)
reducing the search space considerably. supp({x} S, T ) k,p+1 (using Observation 2)
To analyze the prefix items corresponding to a tree, the supp({x} S, T ) k+1,p
following two definitions are required: {x} S is frequent* in T (since T = p).
Figure 5: Frequent itemsets generated from the database represented in Table 2 using the MLMS framework. The box
associated with each tree represents the frequent* itemsets in that tree.
there exist a large number of frequent itemsets, candidate when all the candidates are infrequent. As such, it avoids
generation-and-test methods may suffer from generating the complete traversal of the database for all possible
huge number of candidates and performing several scans of lengths. However, both IFP min and MINIT, being based
database for support-checking, thereby increasing the com- on the recursive elimination procedure, have to complete
putational time. The corresponding computational times their full run in order to report the MIIs. This results in
have not been shown in the interest of maintaining the scale higher computational times for these methods.
of the graph present for IFP min and MINIT. On analyzing and comparing the MIIs generated by
IFP min, Apriori min and MINIT algorithms, we found
As can be observed from the figures, dense and small
that the MIIs belonging to Group 2b (i.e., the itemsets hav-
datasets are characterized by a neutral support threshold
ing zero support threshold in the transaction database) are
below which the MINIT algorithm performs better than
not reported by the MINIT algorithm, thereby leading to
IFP min and above which IFP min performs better than
its incompleteness. Based on the experimental analysis, it
MINIT. The MINIT algorithm prunes an item based on
is observed that for large dense datasets, it is preferable to
the support threshold and length of the itemset in which
use IFP min algorithm. For small dense datasets, MINIT
the item is present (minimum support property [7]). As
should be used at low support thresholds and IFP min
the support thresholds are reduced, the pruning condition
should be used at larger thresholds. For sparse datasets,
becomes activated and leads to reduction in search space.
Apriori min should be used for reporting MIIs.
Above the neutral point, the pruning condition is not effec-
tive. In IFP min algorithm, any candidate MII itemset is
checked for set membership in a residual database whereas 7.2 IFP MLMS
in MINIT the candidates are validated by computing the In this section, we report the performance of IFP MLMS
support from the whole database. Due to reduced valida- algorithm in comparison with the Apriori MLMS [5] al-
tion space, IFP min outperforms MINIT. gorithm. Several real and synthetic datasets (obtained
The T10I4D100K (Figure 10) and T40I10D100K (Fig- from http://archive.ics.uci.edu/ml/datasets) have been used
ure 11) are sparse datasets. Since Apriori min is a for testing the performance of the algorithms. For dense
candidate-generation-and-test based algorithm, it halts datasets, due to the presence of large number of transac-
Anonymous Microsoft Web Dataset
35
ifp-mlms
30 apriori-mlms
25
15
10
0
0 10 20 30 40 50 60
Number of frequent itemsets
50
30
20
10
0
20 30 40 50 60 70 80
Number of transactions (x 1000)
References
[1] R. Agrawal and R. Srikant. Fast algorithms for min-
ing association rules in large databases. In VLDB,
pages 487499, 1994.
[2] K. Beyer and R. Ramakrishnan. Bottom-up compu-
tation of sparse and iceberg cube. SIGMOD Rec.,
28:359370, 1999.
Figure 14: The IFP MLMS and Apriori MLMS algorithms [3] X. Dong, Z. Niu, X. Shi, X. Zhang, and D. Zhu. Min-
on the datasets given in Table 4. ing both positive and negative association rules from
frequent and infrequent itemsets. In ADMA, pages
Number of Number of 122133, 2007.
ID Dataset
items transactions
[4] X. Dong, Z. Niu, D. Zhu, Z. Zheng, and Q. Jia. Min-
1 machine 467 209 ing interesting infrequent and frequent itemsets based
2 vowel context 4188 990 on mlms model. In ADMA, pages 444451, 2008.
3 abalone 3000 4177
4 audiology 311 202 [5] X. Dong, Z. Zheng, and Z. Niu. Mining infrequent
5 anneal 191 798 itemsets based on multiple level minimum supports.
6 cloud 2057 11539 In ICICIC, page 528, 2007.
7 housing 2922 506
8 forest fires 1039 518 [6] G. Grahne and J. Zhu. Fast algorithms for frequent
9 nursery 28 12961 itemset mining using fp-trees. Trans. Know. Data
10 yeast 1568 1484 Engg., 17(10):13471362, 2005.
[7] D. J. Haglin and A. M. Manning. On minimal in-
Table 4: Details of smaller datasets. frequent itemset mining. In DMIN, pages 141147,
2007.
performed in IFP MLMS, thus making IFP MLMS more
efficient. [8] J. Han, J. Pei, and Y. Yin. Mining frequent patterns
without candidate generation. In SIGMOD, pages 1
8 Conclusion 12, 2000.
In this paper, we have introduced a novel algorithm, [9] A. M. Manning and D. J. Haglin. A new algorithm
IFP min, for mining minimally infrequent itemsets (MIIs). for finding minimal sample uniques for use in statisti-
To the best of our knowledge, this is the first paper that cal disclosure assessment. In ICDM, pages 290297,
addresses this problem using the pattern-growth paradigm. 2005.
We have also proposed an improvement of the Apriori al-
gorithm to find the MIIs. The existing algorithms are eval- [10] A. M. Manning, D. J. Haglin, and J. A. Keane. A re-
uated on dense as well as sparse datasets. Experimen- cursive search algorithm for statistical disclosure as-
tal results show that: (i) for large dense datasets, it is sessment. Data Mining Know. Discov., 16:165196,
preferable to use IFP min algorithm, (ii) for small dense 2008.
datasets, MINIT should be used at low support thresholds
and IFP min should be used at larger thresholds and (iii) for [11] J. Srivastava and R. Cooley. Web usage mining: Dis-
sparse datasets, Apriori min should be used for reporting covery and applications of usage patterns from web
the MIIs. data. SIGKDD Explorations, 1:1223, 2000.
We have also designed an extension of the algorithm [12] X. Wu, C. Zhang, and S. Zhang. Efficient mining
for finding frequent itemsets in the multiple level mini- of both positive and negative association rules. ACM
mum support (MLMS) model. Experimental results show Trans. Inf. Syst., 22(3):381405, 2004.
that this algorithm, IFP MLMS, outperforms the existing
candidate-generation-and-test based Apriori MLMS algo-
rithm.
In future, we plan to utilize the scalable properties of our
algorithm to mine maximally frequent itemsets. It will be
also useful to do a performance analysis of IFP-tree in par-