Quant Miner

Quantitative Association Rule Mining Using a Hybrid PSO/ACO Algorithm (PSO/ACO-AR)
Osama M Badawy*, Abd-Elhay A Sallam**, Mohamed I Habib** *CCIT, Arab Academy for science and Technology obadawy@aast.edu **FOE, Suez Canal University Portsaid ab.sallam@ scuegypt.edu.eg , m.habib@scuegypt.edu.eg
Abstract
Over the years, data mining has attracted most of the attention from the research community. The researches attempt to develop faster, more scalable algorithms to navigate the ever increasing volumes of data in search of meaningful patterns. Association rules are a Data mining technique that tries to identify intrinsic patterns in large data sets. It has been widely used in different applications, a lot of algorithms introduced to discover these rules. However most of the algorithms used intend to discretize all numeric attributes, which leads to loss of knowledge. In this paper, an algorithm for mining quantitative association rules using swarm intelligence is introduced. The algorithm intends to discover optimized intervals in numeric attributes without the need for the discretization process. Although the algorithm does not need the minimum support and confidence, instead it looks for the best support that conform a frequent itemset. The algorithm is tested with both synthetic and real datasets, and the results show that the algorithm provides better accuracy when compared to other algorithms used for quantitative rules. Keywords: Swarm Intelligence, Data Mining, Association Rules.
1. Introduction
Due to modern information technology, which produces ever more powerful computers every year, it is possible today to collect, store, transfer, and combine huge amounts of data at very low costs. Thus an everincreasing number of companies and scientific institutions can afford to build up large archives of documents and other data like numbers, tables, images, and sounds. However, exploiting the information contained in these archives in an intelligent way turns out to be fairly difficult. Data mining is a research area that considers the analysis of large databases in order to identify valid, useful, meaningful, unknown, and unexpected relationships [8], [9]. Association rules are an important model in data mining, since the rules discovered tends to be very helpful and simple to read and evaluate. An association rule is defined as a relationship between attributes in the form C 1 C 2 , where C 1 and C 2 are attribute value in the form A = v if it is a discrete attribute or A [v 1 ,v 2 ] if the attribute is numeric. The interest of the rule is evaluated by means of statistical measures, as support and confidence [2], [3]: - The support of a rule is a measure that indicates the ratio of the transactions that satisfies both the antecedent and
the consequent of the rule. A rule C 1 C 2 has a support , if s% of the database records contains C 1 and C 2 . - The confidence of a rule indicates the frequency with which the consequent is fulfilled when it is also fulfilled the antecedent. A rule C 1 C 2 has a confidence c, if c% of the records of the database that contains C 1 also contains C 2 . Typically, association rules are considered interesting if they satisfy both a minimum support threshold and a minimum confidence threshold. Such thresholds can be set by users or domain experts. The common framework for mining association rules has two steps. First, find frequent itemsets which is a set of attributes belonging to the database that exceed the minimum support. Second, generate rules based on these frequent itemsets. Most of the algorithms used focus on the first phase since its time consuming, while the second phase is considered as direct process. Some of these algorithms can be seen in [2],[15],[17],[18]. Mining association rules on numerical attributes, is based on the idea of dividing the range of each numeric attribute into intervals as in [24] this is called disceretization. Different methods have been used for the disceretization process, such as fuzzy sets, these methods can be found in [5], [26]. This disceretization process, whatever the method that is used for partitioning
intervals, could lead to loss of information. For example in a medical database that contains two rules, a rule height [160,180] weight [60,80] with support 30%, and rule height [165,185] weight [60,80] with support 35%, so, if the discretization process finds the interval [160,180] for height attribute, the interval [165,185] cannot appear in any itemset. Thus the second rule can never appear. This fact generates a loss of information, so there is a need to obtain association rules on numeric attributes without disceretization. Thus this paper introduces an algorithm to discover the best intervals in each attributes that conforms to a frequent itemset.
2. Quantitative association rule mining

A lot of approaches are used for mining quantitative association rules. Such as the discretization approach, which is a classical method used to deal with numerical attributes based on a preprocessing step. In this step numeric attributes are discretized into intervals before the mining task [24],[26]. This can be achieved by either using user input, or automatic discretization methods, such as equi-depth binning. However, this work remains sensitive to outliers and does not reflect the distribution of data. Another approach has been introduce in [7], [13] based on using fuzzy sets in the discretization process. This decreases the loss of information, although there results remain sensitive to the fuzzy sets determined by this approach. A statistical approach is presented in [5], [27] assuming the right side of the rule to express the distribution of the values of numeric attributes such as the mean value. However the rules in this approach have a specific form that cannot handle numeric attributes in the left hand side of the rule. An optimization approach for numeric attributes is introduced in [10]. It proposes an optimization criterion, called the Gain as a trade-off between support and confidence and is defined by: Gain(A B) = Supp(AB) MinConf *Supp(A) However, the rules from this method can contain at most two numeric attributes. Another method based on support optimization has been proposed in [21]. This work has then been extended to a Gain-optimization approach in [6]. Although these works allow disjunctions between intervals, the forms of the rules remain restricted to one or two numeric attributes. An important algorithm named GAR introduced in [16]. It is based on the theory of evolutionary algorithms to find the most suitable amplitude of the intervals that conform a k-itemset, so that they have a high support value without being the intervals too wide. The fitness function depends on support of a k-itemset and three factors: two penalization factors are used to avoid getting the whole domains of the attributes and to avoid overlapping
between itemsets with respect to the support and one factor to favor specific itemsets with many attributes f(i) =covered-(marked* )-(amplitude* )+(nAtr* ) (1) This approach is limited to numeric attributes only. Another approach based on halfspaces has been suggested in [22]. In this work, the left-hand and the right-hand side of an association rule contain a linear equation on attributes, such as: x 1 > 20 0.5x 2 + 4x 4 55 Although it is quite original and useful in many applications, the use of such rules is restricted, and this approach cannot handle categorical attributes. A more recent approach introduced in [4], called QuantMiner is a genetic-based algorithm and works directly on a set of rule templates, which is defined by the user. On the other hand this method suffers from some drawbacks: it needs a lot of user work to prepare the set of template rules which the results are dependent on. So, if the user does not include all the possible templates, a loss of information may result in. Also using a large set of templates would take a lot of execution time.
3. Swarm Intelligence
Swarm intelligence is an innovative intelligent paradigm for solving optimization problems that originally took its inspiration from the biological examples by swarming, flocking and behavior of real ants. It can be mainly divided into two models, particle swarm optimization and ant colony optimization. An overview for these models can be seen in [1], [14]. Particle Swarm Optimization (PSO) incorporates swarming behaviors observed in flocks of birds, schools of fish, or swarms of bees, and even human social behavior, from which the idea is emerged [14], [20]. PSO is a population-based optimization tool, which could be implemented and applied easily to solve various function optimization problems. As an algorithm, the main strength of PSO is its fast convergence, which compares favorably with many global optimization algorithms like Genetic Algorithms. Ant Colony Optimization (ACO) deals with artificial systems that are inspired from the foraging behavior of real ants. Thus it is used to solve discrete optimization problems [14]. The main idea is the indirect communication between the ants by means of chemical pheromone trials. This pheromone enables them to find short paths between their nest and food. ACO has been shown to be a powerful paradigm when used for the discovery of classification rules involving nominal (categorical) attributes in [19]. PSO has been introduced for classification rules mining in [23]. However it cannot cope directly with nominal attributes, that nominal values are converted into
binary numbers in a preprocessing phase. Which is not a trivial process in case of a nominal attribute contains more than two values. Thus in [11], [12] a hybrid PSO/ACO algorithm is introduced. This algorithm deals directly with both continuous and nominal attribute values. The algorithm has two main steps; first it discovers classification rules containing nominal attributes only. Then, it adds attributes with continuous values to the discovered rules. The results shows that the algorithm is very competitive compared to other classification algorithms. In this paper an algorithm is introduced to discover association rules based on both standard PSO and the hybrid PSO/ACO algorithm in [12].
the on state is selected when the rule is decoded, then the corresponding attribute-value pair will be added to the decoded rule. Each pheromone matrix entry is a number representing a probability, which is associated with a minimum, positive, non-zero pheromone value.
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. Num_iteration = 0 Initialize a population (M) Particles Compute the fitness of each particle While (num_iter < Number of Iterations) do For each particle P For each dimension d in P Use roulette selection to choose whether the state should be set to off or on. End for each dimension Compute the fitness fit(P) of P B=P's past best state If fit(P) > fit(B) Then B=P End for each particle For each particle P Find best Neighbor Particle N according to fit(N) For each dimension d in P IF Pd = N d Then Pheromone corresponding to value of N d in P increased by Q p
4. Hybrid PSO/ACO algorithm

1. 2. 3. Nitemset = 0 Discovereditemset = While (Nitemset < Number of itemsets) BestNominalParticle = PSO/ACO-AR() BestParticle = PSOAR(BestNominalParticle) Discovered Itemset = DiscoveredItemset U BestParticle Penalize records covered by BestParticle Nitemset++ End while Figure 1 Sequential covering algorithm
4.
The algorithm uses a sequential covering approach [27] to discover one frequent itemset at a time. In the algorithm, an item is either represented as A = v , where A is a nominal attribute and is a value from its domain, or A [v 1 ,v 2 ] , where A is a quantitative attribute. Each particle represents an itemset. The algorithm starts by initializing the Discovereditemset to the empty set. And perform a loop until the desired number of frequent itemsets N is obtained. For each frequent itemset, the algorithm calls PSO/ACO-AR subroutine to discover frequent itemset with nominal attributes only, then a PSOAR algorithm is called in order to add continuous attributes to the frequent itemset. Then add it to the Discovereditemset. The final step is very important. It penalizes the records covered by the obtained itemset in the previous step. This penalization factor affects the fitness function, so the search space tends to not be repeated.
18. Else 19. Pheromone for the off state is decreased 20. End if 21. Normalize pheromone 22. End for each dimension 23. End for each particle Num_iter++ 24. 25. End while 26. Return the particle with the best fitness Figure 2 PSO/ACO-AR algorithm
4.1 PSO/ACO-AR algorithm

The pseudo-code for PSO/ACO-AR is show in figure 2. Each particle in the population is a collection of n pheromone matrices where n is the number of nominal attributes. Each of the n matrices has two entries, one entry represents an off state and one entry represents an on state. The off state is selected when the corresponding attribute-value pair is not included in the decoded rule. If
The algorithm starts by initializing the population, where each particle is seeded with terms from a randomly selected record from the dataset. The pheromone for the on state is set to 0.9 and the pheromone for the off state is set to 0.1. Then the algorithm iterates the population until reaches the Max. number of iterations. Then for each particle in the population, the algorithm loops in the dimensions of the particle. Where a roulette selection method is used to decide if the on or off state should be selected. After this, the fitness of the new particle is computed. If the new fitness is greater than the previous best fitness for the particle, then the new particle's state is saved. At step 13, the pheromone is updated for every particle. It is assumed that the particles are in Von-Neumann topology, thus it has a set of fixed neighbors. Each particle finds its best neighbor's best state N based on fitness fit(N). The pheromone updating procedure is influenced by two factors, the best state a particle has ever held, and the best state ever held by a neighbor particle. The on and
off states are examined in every dimension d and the following rules are applied. If Pd is the same as N d then an amount of pheromone equal to fit(P) is added to the pheromone entry in d corresponding to the value of Pd . The second pheromone updating rule is if Pd is not the same as N d then an amount of pheromone equal to fit(P) is removed from the pheromone entry in d corresponding to the value of Pd . All pheromone are then normalized. At the end, the algorithm returns with the best particle found and passes it to PSOAR algorithm. The algorithm uses the fitness function in equation (2), described in [16]. fit(P)=support(P)-(marked* ) -(amplitude* )+(nAtr* ) (2) Where support(P) is the number of records covered by particle P divided by total number of records in the dataset. The marked parameter indicates that a record has been covered previously by an itemset. This is used to discover different itemsets in later searches. To penalize the records, a value called penalization factor ( ) is used to give more or least weight to the marked records. This factor permits more or least overlapping between the itemsets found and it is defined by the user. The Amplitude parameter is very important in the fitness function. Its mission is to penalize the amplitude of the intervals that conform the itemset. In this way, if two individuals (itemsets) cover the same number of records and have the same number of attributes, then the best information is given by the one whose intervals are smaller. By means of the factor it is achieved that the algorithm be more or least permissive with regard to the growth of the intervals. The number of attributes (nAtr ) parameter rewards the frequent itemsets with a larger number of attributes. Its effect is increased or decreased by means of the factor . All the parameters of the fitness function are normalized into the unit interval. In this way all of them have the same weight when obtaining the fitness of each antibody.
two velocity vectors v Upd ,v Lpd for each d-dimension. A particle is evaluated by being added to the frequent itemset produced by the PSO/ACO-AR algorithm, and calculates its fitness using eq.2.
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. t=0 Initialize a population (M) Particles While (t < Number of Iterations) do Compute the fitness of each particle For each particle P Update past experience (P) Update global best particle Pg
For each dimension d in P Update the velocity of the d-dimension Move Pd according to equations 1, 2. End for each dimension End for each particle t=t+1 14. End while 15. Return the particle with the best fitness Figure 3 PSOAR algorithm
The algorithm starts with initializing the population, with an improved technique as introduced in [12]. Then it loops for a specified number of iteration, calculating the fitness of each particle. Then the algorithm updates the best position for the particle itself (x i ) and position of the best particle among the swarm so far (x g ) . Every particle then moves iteratively through its d-dimensions, updating velocity factors for upper and lower bound of the ddimension according to eq.3. The position of particle p in d dimension (x d ) is updated by eq.4. After a specified number of iterations, the algorithm returns with the best particle discovered containing the nominal part that was discovered from the PSO/ACO-AR algorithm.
v pd (t + 1) = wv pd (t ) + c1r1 (x ld (t ) x pd (t ))
+c 2 r2 (x gd (t ) x pd (t ))
(3) (4)
x pd (t + 1) = x pd (t ) + v pd (t + 1)
4.2 PSOAR algorithm

Figure 3, presents a PSO algorithm for Association rules (PSOAR). This algorithm uses a standard PSO algorithm as in [14]. Each particle in the population consists of two dimensions per attribute, one for the upper bound and one for the lower bound. Each particle P has
Equation 3, 4 are used to update the velocity (v pd ) and position (x pd ) of particle P in dimension d. Where x ld is the best position for the particle p in dimension d, x gd is the best position in d th dimension for the best particle in the entire population. The inertia factor is employed to control the impact of the previous history of velocities on the current one. Accordingly, the parameter w regulates the trade-off between the global (wideranging) and local (nearby) exploration abilities of the swarm. Two random numbers r1 and r2 are used to maintain the diversity of the population. They are uniformly distributed in the interval [0,1] for the d th dimension of the p particle. c1 is a positive
constant called the coefficient of the self-recognition component. c 2 is a positive constant called the coefficient of the social component. A particle decides where to move next, considering its own experience, which is the memory of its best past position, and the experience of its most successful particle in the swarm. The parameters c1 and c 2 are not critical for the convergence of PSO. However, proper fine-tuning may results in faster convergence and alleviation of local minima. As default values, usually, c1 = c 2 = 2 are used.
The parameters of the fitness function were: = 0.8, = 0.7, = 0.4 . In table 2 the frequent itemsets discovered by Hybrid PSO/ACO-AR are shown. Table 3 gives the itemsets found by the GAR algorithm. As seen the support of each of the found itemsets is quite close to the synthetically generated sets. The Hybrid PSO/ACOAR have been tested with the number of iterations 40 and the result were the same as the results in table 2 where the number of iterations is 60. This result shows that the Hybrid PSO/ACO-AR algorithm behaves in a correct way to find frequent itemsets, and the results are more close to optimal than GAR. This tends to that the swarm intelligence converges faster and more accurate to optimal solutions than genetic algorithm.
Table 3 Frequent itemsets discovered by GAR Sup(%) Frequent itemsets A1 [1,15], A 2 [6,35], A3 [61, 76], A 4 [1, 26] 15.8
A1 [4,30], A 2 [24, 40], A3 [10, 26], A 4 [26,51] A1 [44, 61], A 2 [55,83], A3 [20,34], A 4 [49, 72] A1 [74, 77], A 2 [9, 40], A3 [55, 60], A 4 [72,97]
A1 [4, 29], A 2 [3, 28], A3 [62, 71], A 4 [105,126]
records
5. Experimental results
The algorithm is developed in C# with dot net framework 2.0 [25]. It is executed on a PC with Core2Duo 2.6 GHz processor and 2 GB main memory, running Windows 2003 Server R2. To carry out the tests, the algorithm is executed with a population of 75 particles and 60 iterations. The parameters of the algorithm were chosen as follows, w=0.4, c1 = c 2 = 2.
237 203 229 182 184
13.5
15.3 12.1 12.2
5.1 Synthetic datasets

The GAR algorithm is implemented in C# in order to compare the results with Hybrid PSO/ACO-AR algorithm. A database is formed, in the same way described in [16]. It is composed of four numeric attributes and 1250 records. The values were distributed by means of a uniform distribution, into 5 sets formed by predetermined intervals. Besides, 250 new records were added to introduce noise in the data, distributing their values between the minimum and maximum values of the intervals. Table 1 gives the sets that synthetically created. The exact support of each set is 16.7%, where each set covers exactly 250 records.
Table 1 Sets synthetically created
Sets A1 [1,15], A 2 [7,35], A3 [60, 75], A 4 [0, 25]
A1 [5,30], A 2 [25, 40], A3 [10,30], A 4 [25,50]
Sup(%) records
The second experiment is carried out to test the scalability of the Hybrid PSO/ACO-AR. The same synthetically created sets in the first experiment are used but with a different number of records, scaled from 1000 records to 50000 records. Figure 3 shows the execution time in seconds required for discovering one frequent itemset for both Hybrid PSO/ACO-AR and GAR with a population size 100, and number of iterations 60.
16.67 16.67 16.67 16.67 16.67
250 250 250 250 250
A1 [45, 60], A 2 [55,85], A3 [20, 25], A 4 [50, 75] A1 [75, 77], A 2 [0, 40], A3 [58, 60], A 4 [75,100]
A1 [10,30], A 2 [0,30], A3 [65,70], A 4 [100,125]
Table 2 Frequent itemsets discovered by Hybrid PSO/ACO-AR Sup(%) records Frequent itemsets A1 [1,15], A 2 [7,35], A3 [60, 74], A 4 [1, 25] 238 15.9
A1 [7,31], A 2 [23, 40], A3 [11,32], A 4 [24,50] A1 [45, 60], A 2 [54,85], A3 [20, 25], A 4 [50, 75] A1 [75, 77], A 2 [0, 40], A3 [58, 60], A 4 [75,100]
Figure 4 Execution Time of Hybrid PSO/ACO-AR and GAR Table 4 Execution time of Hybrid PSO/ACO-AR and GAR for one itemset (in seconds) Execution Time (in Seconds) Hybrid PSO/ACO-AR GAR No. of Records 1250 10000 50000 5.1 13.0 49.8 13.2 21.1 118.5
16.53
16.67 16.67 15.0
248 250 250 225
A1 [10, 28], A 2 [1, 28], A3 [65, 70], A 4 [100,123]
As seen in figure 4 and table 4, the Hybrid PSO/ACOAR algorithm has a faster execution time compared to GAR. In figure 5 and table 5, the Hybrid PSO/ACO-AR algorithm is executed on the same dataset with different population size. It is seen that increasing the population size has a large effect on execution time especially with large datasets. That on a dataset contains 50000 records; the execution time with 100 antibodies is approximately doubled compared to execution time with 50 antibodies.
without the categorical attribute were discovered in table 2 with support 15.9%. Thus induce that the algorithm behaves better in case of both categorical and numerical attributes, and gets more accurate intervals for numerical attributes.
Table 7 Frequent itemsets discovered by Hybrid PSO/ACO-AR Sup(%) records Frequent itemsets
A1 [1,14], A 2 [7,35], A3 [60,75], A 4 [0,24], A5 = 0 A1 [5,30], A 2 [20,40], A3 [11,33], A 4 [26,50], A5 = 1 A1 [45,60], A 2 [55,84], A3 [20,24], A 4 [50,74], A5 = 2 A1 [75,77], A 2 [0,40], A3 [58,60], A 4 [75,100], A5 = 3 A1 [10,28], A 2 [0,27], A3 [65,70], A 4 [99,122], A5 = 4
16.67 16.20 16.67 16.67 15.67
250 243 250 250 235
5.2 Real Life datasets

In these experiments a simple direct algorithm, as discussed in [2] is used to generate rules from the discovered frequent itemsets. The first experiment is done on Iris dataset which is used in [4], thus the results from Hybrid PSO/ACO-AR are compared with the results obtained in [4]. The Iris dataset is composed of 150 samples of flowers from iris species setosa, versicolor, and virginica. For each species, there are 50 observations decribed by the attributes Speal Length (SL), Sepal Width (SW), Petal Length (PL), and Petal Width (PW) in mm. This attributes are shown in table 8.
Table 8 Statistical Information of the Iris dataset Species Attributes Setosa PW [1, 6], SW [23, 44]
Sup =33%
PL [10,19], SL [43,58]
Figure 5 Execution Time of Hybrid PSO/ACO-AR for different population size. Table 5 Execution time of Hybrid PSO/ACO-AR for one itemset (in seconds)
No. of Records Population Size
Execution Time (in Seconds) 1250 2.0 4.2 5.1 10000 5.5 10.0 13.0 50000 20.9 39.3 49.8
50 75 100
Another experiment is done to test the ability of Hybrid PSO/ACO-AR algorithm to work on both numerical and categorical attributes. Also the pervious dataset is used and a new column that contains categorical values is added, as seen in table 6. Thus each itemset has support 16.67% as in the first experiment.
Table 6 Sets synthetically created with categorical values Sup(%) records Sets
A1 [1,15], A 2 [7,35], A3 [60,75], A 4 [0,25], A5 = 0 A1 [5,30], A 2 [25,40], A3 [10,30], A 4 [25,50], A5 = 1 A1 [45,60], A 2 [55,85], A3 [20,25], A 4 [50,75], A5 = 2 A1 [75,77], A 2 [0,40], A3 [58,60], A 4 [75,100], A5 = 3 A1 [10,30], A 2 [0,30], A3 [65,70], A 4 [100,125], A5 = 4
Versicolor
Sup =33%
PW [10,18], SW [20,34] PL [30,51], SL [49, 70] PW [14, 25], SW [22,38] PL [45, 69], SL [49, 79]
Virginica
Sup =33%
The Hybrid PSO/ACO-AR has found the association rules shown in table 9 along with the rules obtained from the QUANTMINER in [4].
Table 9 Association rules from Hybrid PSO/ACO-AR and QUANTMINER
Hybrid PSO/ACO-AR S pecies : S etosa
PW [1, 6], S W [29, 42] PL [10,19], S L [43, 58]
Sup = 32.0%, Conf = 96%
16.67 16.67 16.67 16.67 16.67
250 250 250 250 250
QUANTMINER S pecies : S etosa

PW [1, 6], S W [31, 39] PL [10,19], S L [46, 54]
Sup = 23.3%, Conf = 70%
Running the Hybrid PSO/ACO-AR algorithm on this dataset, results in more accurate frequent itemsets compared to the ones synthetically created. The results are shown in table 7. As seen in table 7 the support of the first itemset for example is 16.67%, while the same itemset
S pecies :V ersicolor PW [10,18], S W [22, 34] PL [30, 51], S L [49, 70]

Sup = 33%, Conf = 100%
S pecies :V ersicolor PW [10,15], S W [22, 30] PL [35, 47], S L [55, 66]

Sup = 21.3%, Conf = 64%
S pecies :V irginica PW [13, 26], S W [23, 37] PL [48, 64], S L [56, 73]
Sup = 31.3%, Conf = 94%
S pecies :V irginica PW [18, 25], S W [27, 33] PL [48, 60], S L [58, 72]
Sup = 20.6%, Conf = 60%
The results from Hybrid PSO/ACO-AR show rules more close to the rules in Iris dataset than QUANTMINER. In the three rules discovered by Hybrid PSO/ACO-AR, it is noticed that the numeric intervals are more accurate compared to the intervals discovered from QUANTMINER. The support and the confidence of the rules from Hybrid PSO/ACO-AR are much better, even the second rule has support 33% and confidence 100% which is the exact support and confidence for this rule in IRIS dataset. The Hybrid PSO/ACO-AR algorithm is also experimented on a real life medical database donated by faculty of medicine, Suez Canal University. It is preferred to use live data than using a generated data or an old collected data set, because using a live data can provide more accurate results. The data set describes the profiles of patients of a Suez Canal University Hospital being treated for heart disease. The dataset consist of 16 attribute, 6 of these attributes are numeric, while the others are categorical attributes. It contains 200 observations for patients. Here are 3 examples of the discovered rules. Rule 1: support =12%, Confidence =100%
age [45, 60], gender = M ale , B psystolic [120,140] If dynspneaN Y HA grade = I , neck = Pulsating diagnosis _ pathological = IHD
In the third experiment, the Pima Inidans Diabetes Database from National Institute of Diabetes and Digestive and Kidney Diseases were used. This database contains 768 records, and 9 attributes. All of these attributes are numerical except the last one is categorical attribute which indicates whether the patient suffers from diabetes or not. The dataset contains two main categories, one indicates positive diabetes which is 268 patients out of the records and the other category for negative diabetes with 500 patients. The algorithm discover some rules, the four highest confidence rules are listed. Rule 1: Support = 33.6% Confidence = 96.3%
N um ber of tim es Pregnant [0,17], Plasm a Glucose [78,199] Diastolic Blood Pressure [0,114], Dibetes Pedigree Function If [0.08,1.4], Body M ass Index [22, 68], 2 hour serum Insulin [0, 600],T riceps S kin Fold T hickness [0, 56], A ge [21, 70] T ested _ Positive _ Diabetes
Rule 2: Support = 64.2% Confidence = 98.6%

N um ber of tim es Pregnant [0,13], Plasm a Glucose [44,194] Diastolic Blood Pressure [0,122], Dibetes Pedigree Function If [0.0,1.79], Body M ass Index [0, 48], 2 hour serum Insulin [0, 545],T riceps S kin Fold T h ickness [0, 60], A ge [21, 72] T ested _ N egative _ Diabetes
Rule 3: Support = 19.4% Confidence = 96.1%

N um ber of tim es Pregnant [0, 9], Plasm a Glucose [126,198] Diastolic Blood Pressure [52, 90], Dibetes Pedigree Function If [0.11, 0.9], 2 hour serum Insulin [0, 330], T riceps S kin Fold T hickness [0, 46], A ge [21, 51] Body M ass Index [21, 51]
Rule 4: Support = 27.2% Confidence = 100%

N um ber of tim es Pregnant [0, 4] Diastolic Blood Pressure [45,136], Dibetes Pedigree Function If [0,1.87], Body M ass Index [13, 80], 2 hour serum Insulin [0, 224],T riceps S kin Fold T hickness [10, 27] Plasm a Glucose [0, 214]
Rule 2: support =2%, Confidence =100%

age [42, 62], anginal chest _ pain = recurrent and severe If dynspneaN Y HA grade = I , gender = M ale diagnosis _ anatom ical = CA D
Rule 3: support =16%, Confidence =80%

aorticvalve _ systolic _ gradient [5,10], If gender = M ale , M arried , R ural diagnosis _ pathological = CHD , diagnosis functional _ class = I
These rules were considered as relevant and interesting when presented to medical experts in Suez Canal University. As shown in this rules, rule 1 provides the best interval for age as [45,60], while in rule 2 the best interval is [42,62]. In the previously techniques, the attributes are discretized before searching for the itemsets. So, if the discretization process makes the interval [45,60] for age attributes, then the interval [42,62] cannot appear in any itemset. This fact generates loss of information, that the rule cannot be discovered. This proofs the importance of using quantitative association rules.
These rules have been examined by medical experts, and considered as interesting and adequate information. The first rule covers the positive diabetes category with support 33.6% (258 records), which is very close to the number of patient in the database which is 268 patients. Rule 2 indicates the negative diabetes category with support 64.2% (493 records) from 500 patients in the dataset with negative diabetes results. As seen all the four rules have interlaced numeric intervals , Such as Age attribute in rule 2 have the interval [21,72] and in rule 3 the interval was [21,51]. Thus with the ordinary discretization process rule 3 cannot be discovered. This illustrates the significance of quantitative association rules, especially in the medical domain.
6. Conclusions
Quantitative association rules mining is very important in a lot of application that contains numerical attributes, such as medical, industrial, and financial fields. The
ordinary techniques for mining association rules require discretizing of numeric values, which would lead to loss of information. Instead in quantative rules the itemsets are optimized for the best interval in the attribute domain. A hybrid PSO/ACO-AR algorithm is presented in this paper, which is inspired from both particle swarm optimization and ant colony optimization. The algorithm can deal directly with the dataset without the need for any preprocessing for the data. It does not require a minimum support or confidence as required in the GAR [16]. Since it optimize the itemsets for the best support. The algorithm does not require specifying rule templates to work on as in QUANTMINER [4]. The Hybrid PSO/ACO-AR algorithm presented can optimize both numerical and categorical attributes, where most of the databases in real life contain both types of attributes unlike GAR algorithm. Several experiments have been carried out to check the behavior of the algorithm with different datasets. In synthetically created datasets, the algorithm proofs a good result compared to GAR algorithm. For the test on Iris dataset, the rules discovered were more accurate than the rules obtained in QUANTMINER. Also for the medical databases, the algorithm obtains satisfactory results. In future works, it is planed to work on new measures to be included in the fitness function in order to prohibit convergence in the rules, and implementing the algorithm using a different data structure in order to reduce the total execution time. Since, the algorithm has execution time longer than ordinary algorithms in association rules such as Apriori algorithm.
7. References
[1] Abraham, A., Guo, H., Liu, H.: Swarm Intelligence: Foundations, Perspectives and Applications, Studies in Computational Intelligence (SCI) 26, 325 (2006). [2] Agrawal R., and Srikant R. Fast Algorithms for Mining Association Rules in Large Databases. Proceedings of the 20th International Conference on Very Large Data Bases, pages 478-499, 1994. [3] Agrawal R., Imielinski T., and Swami A. "Mining association rules between sets of items in very large databases." Proceedings of the ACM SIGMOD Conference on Management of data, 207-216, 1993. [4] Ansaf, S. A., Christl V., Cyril, N., QuantMiner: A Genetic Algorithm for Mining Quantitative Association Rules, Proc. of the 20th International Conference on Artificial Intelligence IJCAI, 1035-1040, Hyberadad, India. 2007. [5] Aumann, Y., Lindell, Y.: A Statistical Theory for Quantitative Association Rules. Proceedings KDD99 261270, San Diego, CA. 1999. [6] Brin, S., Rastogi, R., and Shim, K. Mining optimized gain rules for numeric attributes. IEEE Trans. Knowl. Data Eng., 15(2):324338, 2003.
[7] Delgado, M. Marin, N. Sanchez, D. Vila, M.-A., Fuzzy association rules: general model and applications, in IEEE Transactions on Fuzzy Systems, 214- 225, Apr 2003. [8] Fayyad, U., Shapiro, G. P., and Smyth, P. and Uthurusamy, R., editors. Advances in Knowledge Discovery and Data Mining. MIT Press, Cambridge, MA, 1996. [9] Fayyad, U., Shapiro, G. P., and Smyth, P. From data mining to knowledge discovery: An overview. In Advances in knowledge Discovery and Data Mining, pages 1 36. MIT Press, Cambridge, MA, 1996. [10] Fukuda, T., Morimoto, Y., Morishita, S., and Tokuyama, T. Mining optimized association rules for numeric attributes. In Proc. of the fteenth ACM SIGACTSIGMOD -SIGART PODS96, pages 182191. ACM Press, 1996. [11] Holden, N., and Freitas, A.A. A hybrid PSO/ACO algorithm for classification. In Proc. Genetic and Evolutionary Computation Conference (GECCO-2007) Workshop on Particle Swarms: The Second Decade, pp 2745-2750. ACM, 2007. [12] Holden, N., Freitas, A. A Hybrid PSO/ACO Algorithm for Discovering Classification Rules in Data Mining, In Journal of Artificial Evolution and Applications (JAEA), 2008. [13] Keith C. C. Chan, Wai-Ho Au, Mining fuzzy association rules in a database containing relational and transactional data, Data mining and computational intelligence, 95 114, Physica-Verlag GmbH, March 2001. [14] Kennedy, J., and Eberhart, R. Swarm intelligence. Morgan Kaufmann Publishers,Inc., San Francisco, CA. 2001. [15] Lin, D-I., Kedem, Z.M.: Pincer Search: A New algorithm for Discovering the Maximum Frequent Set. In Proc. of the 6th Int'l Conference on Extending Database Technology (EDBT), 105-119 Valencia, 1998. [16] Mata, J., Alvarez, J. L., and Riquelme, J. C. An evolutionary algorithm to discover numeric association rules. In Proceedings of the ACM symposium on Applied computing SAC2002, pages 590594, 2002. [17] Orlando, S., Palmerini, P. and Perego, R., Enhancing the Apriori Algorithm for Frequent Set Counting. In Proc.of the 3rd Int. Conf. on Data Warehousing, pages 71-82, Munich,Germany,2001. [18] Park, J. S., Chen, M. S., Yu. P.S.: An Effective Hash Based Algorithm for Mining Association Rules. Proc. of the ACM SIGMOD Int'l Conf. on Management of Data San Jose, CA. 1995. [19] Parpinelli, R.S., Lopes, H.S., and Freitas, A.A. Data Mining with an Ant Colony Optimization Algorithm, IEEE Trans. on Evolutionary Computation, special issue on Ant Colony algorithms, 6(4), p. 321-332, 2002. [20] Parsopoulos, K. E., and Vrahatis, M. N. On the computation of all global minimizers through particle swarm optimization. IEEE Transactions on Evolutionary Computation, 8(3):211-224. 2004. [21] Rastogi, R. and Shim, K. Mining optimized support rules for numeric attributes. In Proc. Of the 15th ICDE, pages 206215, 1999. [22] Ruckert, U., Richter, L., and Kramer, S. Quantitative association rules based on half-spaces: An optimization approach. In Proceedings of the Fourth IEEE International Conference on Data Mining (ICDM04), pages 507510, 2004.
[23] Sousa, T., Silva, A., Neves, A. Particle Swarm based Data Mining Algorithms for classification tasks, Parallel Computing 30, 767783, 2004 [24] Srikant, R, Agrawal, R.: Mining Quantitative Association Rules in Large Relational Tables. Proc. of the ACM SIGMOD 1-12, 1996. [25] Troelsen, A., Pro C# 2005 and the .NET 2.0 Platform, Third Edition, 2005. [26] Wang, K., Tay. S.H., Liu, B.: Interestingness-Based Interval Merger for Numeric Association Rules. Proc. 4th Int. Conf. KDD, 121-128, 1998 [27] Webb, G. I. Discovering associations with numeric variables. In Proceedings of the seventh ACM SIGKDD, pages 383388. ACM Press, 2001.

Quant Miner

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Quant Miner

Hochgeladen von

Copyright:

Verfügbare Formate

Quantitative Association Rule Mining Using a Hybrid PSO/ACO Algorithm (PSO/ACO-AR)

2. Quantitative association rule mining

4. Hybrid PSO/ACO algorithm

4.1 PSO/ACO-AR algorithm

4.2 PSOAR algorithm

237 203 229 182 184

5.1 Synthetic datasets

16.67 16.67 16.67 16.67 16.67

250 250 250 250 250

248 250 250 225

A1 [10, 28], A 2 [1, 28], A3 [65, 70], A 4 [100,123]

16.67 16.20 16.67 16.67 15.67

250 243 250 250 235

5.2 Real Life datasets

16.67 16.67 16.67 16.67 16.67

250 250 250 250 250

QUANTMINER S pecies : S etosa

S pecies :V ersicolor PW [10,18], S W [22, 34] PL [30, 51], S L [49, 70]

S pecies :V ersicolor PW [10,15], S W [22, 30] PL [35, 47], S L [55, 66]

Rule 2: Support = 64.2% Confidence = 98.6%

Rule 3: Support = 19.4% Confidence = 96.1%

Rule 4: Support = 27.2% Confidence = 100%

Rule 2: support =2%, Confidence =100%

Rule 3: support =16%, Confidence =80%

Das könnte Ihnen auch gefallen