Beruflich Dokumente
Kultur Dokumente
1. Introduction
804
support; second to produce association rule from the 3. Algorithms OF Frequent ItemSet Mining
frequent itemset that have confidence greater than the In order to improve the efficiency of creating frequent
user-specified minimum confidence. itemset in data cube and decrease the number of the
candidate item set that needs computing as far as
2.4.1 Generalized Frequent ItemSet. During the possible, then put forward the GenHibFreq Algorithms
process of generalized association rule mining, which was according with strategy(2).
multi-level frequent itemset mining is still a difficult 1) Basic idea of GenHibFreq Algorithms
point for researchers. In order to improve the efficiency According definition 6, while counting the support
of mining of frequent itemset, this paper concluded two of itemset by generalized association rule mining based
mining strategies of producing multi-level frequent on data cube, it only needs accessing the relevant cell,
itemset on the bases of summarizing the former not need to scan the whole data cube, it will decrease
algorithms to propose a new algorithm GenHibFreq the number of the candidate itemset that needs
which was suitable for mining multi-level frequent counting as far as possible
itemset based on data cube. 2) GenHibFreq Algorithms
1. Item depth and itemset depth Notation explanations are defined as follows:
Two concepts this paper needed were given as C k ,h : c Ck , h , depth( c) h
follows:
Definition 8: if item x I ,it’ s depth can be Lk ,h : l Lk , h , depth(l ) h
L : Frequent itemset
expressed as depth (x ) . Suppose x which is the
Description of GenHibFreq Algorithms:
parent item of x is not exist ,then depth (x ) =0,
Input : n-dimension R
D , M , Dstr
,user specified
or depth (x ) = depth( x) 1 .
minimum support value min_sup.
Definition 9 : if itemset X I , it’ s depth
Output: n-dimension multi-level frequent L
=
depth( X ) max({depth( x) x X }) .
① k 1; h 0; L ;
2. The Mining Strategies of Frequent Item Set
The mining strategies of multi -level frequent ②Creating frequent 1-itemset L1,0 of
itemset can be concluded as two following categories: depth 0 for every dimension;
( 1 )The k -itemset Lk 1 can be got through
③ h 1 ;
joining and pruning of candidate frequent While( L1, h 1 ) do {
(k-1)-itemset C k ,then traverse transactions database
For each 1-itemset l L1, h1 do
to count the support of all the candidate k-itemset
Ck ,delete those itemset that can not meet the For each 1-itemset e {i i is the filial
minimum support, get the frequent k-itemset Lk ; generation itemset of l } do {
(2)While creating frequent k-itemset L k ,firstly if ( Pr
e totalcount min_ sup ) add
traverse transactions database to count the support of 1-itemset e to L1,h ;
the candidate k -itemset Ck of the highest abstract }
level(i.e.,depth(Ck ) 0 )delete those itemsets that can h ;
not meet the minimum support value,then count the }
lower abstract level candidate k -itemset, one level L1 = h L1,h ;
deep into another ,in the process of it, we can decrease
the number of the candidate k -itemset that need count ④For ( k 2 ;( Lk 1 and k n ); k ) {
through deleting those parent itemset that can not meet h 0 ;
the minimum support value. repeat{
Comparing the above two strategies ,we can find
if ( h 0 ) {
easily that those two strategies all need joining and
pruning all the frequent (k-1)-itemset Lk 1 to get all the Ck ,0 gen _ candidate _ Apriori ( Lk 1,0 ) ;
candidate k -itemset ,The difference is that the L k ,0 =all candidates in C k,0 with minimum
computing ways are different that the traversing data
support;
base to compute the support of the candidate k -itemset
}
in C k .
else
805
L k ,h gen _ frequent _ Hib (k , h, L1,h , Lk ,h1 ) ; This algorithms made the number of the candidate
itemset that needs computing reach the least,thereby it
h ; can improve the efficiency of creating frequent itemset
} from the data cube effectively.
Until( Lk ,h1 );
2.4.2 Algorithms of Mining of Association Rule. In
L k = h Lk ,h ; order to decrease the creatintg of the abundant rules
} and fit GenHibFreq Algorithms of multi-level frequent
itemset mining based on data cube, we put forward
⑤Answer= k Lk ;
asscoation rule GenerateLHSs-Rule which was
Function gen _ frequent _ Hib( k , h, L1, h , Lk ,h 1) composed of two parts , one is BorderLHSs, the other
is GenerateRule . At first we use BorderLHSs
Input:L1 ,h depth h frequent 1-itemset,L k ,h 1
Algorithms through reverse searching means to find
depth h 1 frequent (k-1)-itemset the dividing line of LHSs , then we use Generate Rule
Output: L k ,h depth h frequent k-itemset Algorithms to create Association Rule,
GenerateLHSs-Rule can decrease the creating of the
①Queue FIFO ; Ck , h abcundant rules eddiciently. Descriptions of these two
Algorithms as follows.
②For each k-itemset l Lk ,h 1 do { 1. BorderLHSs
Enqueue k-itemset l to the end of FIFO ; We can use the downward closure property based on
} LHS of the association rule to find the dividing line of
LHSst through reverse searching means of
③ while( FIFO ) {
BorderLHSs(A) under the conditiong of the given
Dequeue k-itemset A {a1 , a2 , , ak } from minimum support value, Description of BorderLHSs
the head of FIFO ; Algorithms is follows:
[Input]: Frenquent Itemset A
If ( depth( A) h )
[Output]:Rule Condition(LHS)Dividing Lines
{ (LHSs)
If ( L k ,h consists of ① FIFO={A}; LHSs ;
k-itemset A {a1 , a2 , , ak } ) continue; ② while(FIFO ) do{
③ Dequeue B from the head of FIFO;
Else Add k-itemset A to Lk ,h ;
④ onBorder=TRUE;
}
For ⑤ For each ( B -1)-subset C of B do {
( j 1 ;( depth(a j ) h 1 and j k ); j ) { if( P (C ) P( A) min_ conf ) then {
onBorder=FALSE;
For each item e {i i is the filial generation
⑥ if (C is not in FIFO) then
item of a j , also{i} L1 ,h } do { Enqueue C to the end of FIFO;
}
Replace item a j in A with e ; }
if ⑦ if (onBorder==TRUE) then add B to
( LHSs;
Pr {a1 , a2 ,, a j 1 , e, a j1 ,, ak } totalcount min_ sup }
⑧ Answer= LHSs;
) BorderLHSs(A) will decrease the complexity
Enqueue k-itemset ecomously ,because once the item set of LHSs was
{ a1 , a2 , , a j 1 , e, a j 1 , , ak } to the found, the searching algorithms will stop searching
other subset. Even in the worst condition, the
end of FIFO ;
}
complexity of this Algorithms is O 2 A .
} 2. GenerateRule Algorithms
} GenerateRule Algorithms was gotten through
deleting one frequent itemset LHSs and making it not
④ Answer= L k, h ;
cross with any superset or subset.
806
If m frequent itemset A1 , A2 ,..., Am ,among them 3.2 Creating Association Rule
anyone itemset is the superset of ( A 1) layer of or
subset of A. if If the minimum confidence value min_conf is
B ( BorderLHSs(A) - i 1 BorderLHSs(Ai ))
m
,relative to 60%,according to GenerateLHSs-Rule Algorithms the
Association Rule was created, it was shown as Table2.
any other rules , B ( A B) is irredundant.
Table 1. sales business data base
Descriptiong of GenerateRule Algorithms is as
tid age income buys
follows:
100 25 45k { IBM Laptop, HP Color
[Input]:All Frequent Itemset L Printer }
[Output]:Irredudant Association Rule AR 200 28 40k { HP Desktop , Canon Color
① For each AL do { Printer }
② LHS(A) = BorderLHSs(A); 300 44 45k { IBM Desktop, HP
③ For each C L such that C is a Desktop }
400 21 20k { HP Desktop , Epson b/w
( A 1) -superset or a child itemset of A do { Printer }
LHS(A) = LHS(A) -BorderLHSs(C); 500 36 40k { IBM Laptop }
}
④ For each BLHS(A) do { 600 32 30k { HP Laptop, Epson b/w
Printer }
add rule “B ( A B) ”to AR ;
}
3.3 Outcome Analysis
}
⑤ Answer= AR ;
This Algorithms gets the most least and irredundant (1)31 association rules have been created using
association rule, the efficiency of the association rule general algorithms , and only create 16 association
numbet was improved greatly. Suposse the opetating rules when using the algorithms in this paper ,so we
time of every frequent itemset in set L is the same ,then can say our algorithms can decrease the reduntant rules
the computing complexity of this Algorithms and the efficiently.
value of set L are linearly dependence. ( 2 ) The Cumulate 、 Stratify and
ML_T2L1 algorithms need larger store space and
3.Example Verification and Analysis also have distinct limitation. However when counting
the support of itemset using the a lgorithms of this
paper, it only needs to access the relevant cell, not need
Sales transtional database as Table 1, sales has four to scan the whole data cube ,decrease the number of
attributes: transaction identifier tid, customer’s age, the candidate itemset that needs counting to improve
income and buys, age and income are all amount the efficiency of the creating of frequent item set.
attribute ,buys is category attribute. (3)BorderLHSs(A) algorithms can guarantee one
time visting for every subset of A, once the itemset of
3.1 Sales Database and Working Data Cube the condition border LHSs was found, the searching
algorithms weill stop searching all the subset to make
Designing working data cube relevant to tasks
according to data cube model, suppose this data
A
the complexity less than O 2 , so it can decrease
the complexity of this algorithms greatly.
contains 3 dimensionalities: age, income, and buys.
Suppose 3-itemset
4 Conclusion
X { (age,[20,29]) ,(income,[40k,49k]),(buys, Color
Printer)} , according to definition 4 , This paper described a multi -dimension data cube
PrX F ( age=[20,29],income=[40k,49k],buys=Color model; put forward the formalized definition of the
generalized association rule; and concluded two
Printer)=2 , that is there are 2 transactions in the
categories mining strategies of creating multi-level
original transactional database contains all the items in frequent itemset mining algorithms . GenHibFreqeh
itemset . which was suitable for data cube, this algorithms used
the abstract level among the itemsets adequately to
807
decrease the number of the candidate item set which Cumulate 、 Stratify and ML_T2L1 on algorithms
needs counting to improve the efficiency of this efficiency and creating of irredundant rules .At the
algorithms, we put forward the Algorithms of mining same time ,this algorithms has good performance in
of the Generalized Association Rule Based on Data flexibility, scalablicity and complexity .
Cube (GenerateLHSs-Rule) which can decrease the This paper was supported by Nature Science
creating of the redundant rules efficiently. Experiment Foundation of Jiang Su province(NO: BK2005021).
shows that the algorithms in this paper is superior to
Table 2. Association Rule created by GenerateRule(L) Algorithms
Multi-layer Association Rule Support Confidence
degree degree
(buys, Printer) (age, [20,29]) 3 75%
(income, [40 k,49 k]) (buys, Computer) 4 100%
(buys, Computer) (income, [40 k,49 k]) 4 66.7%
(buys, Desktop) (age, [20,29]) 2 66.7%
(age, [30,39]) (buys, Laptop) 2 100%
(buys, Laptop) (age, [30,39]) 2 66.7%
(buys, Desktop) (income, [40 k,49 k]) 2 66.7%
(buys, Laptop) (income, [40 k,49 k]) 2 66.7%
(age, [20,29]) (buys, HP Desktop) 2 66.7%
(buys, HP Desktop) (age, [20,29]) 2 66.7%
(buys, HP Desktop) (income, [40 k,49 k]) 2 66.7%
(buys, IBM Desktop) (income, [40 k,49 k]) 2 100%
(age, [20,29]) (income, [40 k,49 k]) (buys, Computer) 2 66.7%
(income, [40 k,49 k]) (buys, Printer) (age, [20,29]) 2 100%
(age, [20,29]) (income, [40 k,49 k]) (buys, Color Printer) 2 66.7%
(buys, Color Printer) (income, [40 k,49 k]) (age, [20,29]) 2 100%
808