Sie sind auf Seite 1von 10

TEJASWI PINNIBOYINA* et al.

[IJESAT] INTERNATIONAL JOURNAL OF ENGINEERING SCIENCE & ADVANCED TECHNOLOGY

ISSN: 22503676
Volume - 2, Special Issue - 1, 61 70

MINING ITEMS FROM LARGE DATABASE USING COHERENT RULES


Tejaswi Pinniboyina1, Navya Dhulipala2, Radha Rani Deevi3, Sushma Nathani4
M.tech, C.S.E, KLUniversity,Vaddeswaram, A.P, India,tejaswifriends@gmail.com M.tech, C.S.E, KLUniversity, Vaddeswaram, A.P, India,navya.dhulipalla@gmail.com 3 Asst.Prof, C.S.E, KLUniversity,Vaddeswaram, A.P, India,deevi_radharani@gmail.com 4 M.tech, C.S.E, KLUniversity,Vaddeswaram, A.P, India,sushma.nathani@gmail.com
2 1

Abstract
Mining the frequent items from the large amount of databases by using association rules requires high domain knowledge in the specification of minimum support threshold which is risky, in order to overcome that a new algorithm based on coherent rules is proposed.To develop a new algorithm based on coherent rules so that users can mine the items without domain knowledge and it can mine the items efficiently when compared to association rules.

Index Terms: Coherent rules, Association rules, Minimum Threshold, Correlation. --------------------------------------------------------------------- *** -----------------------------------------------------------------------1. INTRODUCTION
The process of extracting information from large databases is termed as data mining. Different algorithms are used to extract the data, by using association rules frequent items are discovered from database by specifying minimum support threshold value. The items mined require domain knowledge [3] and statistical methods to specify that threshold value. If the threshold value is very high rare items may be missing, if it is low there may be inconsistency in the items retrieved. Some interesting rules are missing for future analysis which leads to errors in decision making, for mining rare items [12] they are grouped into arbitrary items becomes frequent items this is called rare items problem defined by Manila[13] according to Liu et.al. and another method is to split the data set into two or several blocks according to the frequency and mine each block using minimum support threshold although some association rules involving both frequent and rare items across different blocks is lost. Later frequent items are mined using multiple minimum threshold called minimum item support (MISs) by Liu et.al [12],later items are mined using minimum relative support and minimum confidence even though they are not properly correlated [2] then by using lift, leverage the minimum threshold the items are mined and they are not asymmetrical so to overcome this an approach of coherent rules using implications is used to mine the items through which the associations rules are discovered inherently, using some standard logic tables called implications. The frequent items are extracted which dont require specification of minimum support threshold and the items can be mined without domain knowledge of user. The association rules can be derived inherently by using these coherent rules.

2. PROBLEMS IN EXISTING ASSOCIATION RULE MINING


Existing association rule mining algorithms are based on minimum support threshold in order to generate rules which require domain knowledge; in this case interesting rules are missing for future analysis. Different minimum support threshold would result inconsistent mining results even when the mining process is performed on the same dataset. Association rule discovered using support and confidence may not be correlated in statistics. Some rules may not be interesting due to high marginal probabilities in their consequence item sets. Frequently co-occurred association rules even with high confidence values may not be truly related Association rule mining algorithms have high correlation error rate.

3. RELATED WORK
In data mining association rule learning is a popular and well researched method for discovering interesting relations between variables in large databases. Piatetsky-Shapiro [4] describes analyzing and presenting strong rules discovered in databases using different measures of interestingness. Based

IJESAT | Jan-Feb 2012


Available online @ http://www.ijesat.org 61

TEJASWI PINNIBOYINA* et al. [IJESAT] INTERNATIONAL JOURNAL OF ENGINEERING SCIENCE & ADVANCED TECHNOLOGY on the concept of strong rules, Agrawal et.al [2] introduced association rules for discovering regularities between products in large scale transaction data recorded by point-of scale systems in super markets. For example, the rule {onions, potatoes}=>{burger} found in the sales data of a super market would indicate that if a customer buys onions and potatoes together, he or she is likely to also buy hamburger meat. Such information can be used as the basis for decisions about marketing activities such as, e.g., promotional pricing or product placements. In addition to the above example from market basket analysis association rules are employed today in many application areas including Web usage mining, intrusion detection and bioinformatics. Following the original definition by Agrawal et al. [2] the problem of association rule mining is defined as: Let I= {i1, i2, i3.in} be a set of n binary attributes called items. Let D = {t1, t2, t3.tm} be a set of transactions called the database. Each transaction in D has a unique transaction ID and contains a subset of the items in I.

ISSN: 22503676
Volume - 2, Special Issue - 1, 61 70

The confidence of a rule is defined conf{X=>Y}=supp{X u Y}/supp{X}. For example, the rule{milk, bread} => {butter} has a confidence of 0.2 / 0.4 = 0.5 in the database, which means that for 50% of the transactions containing milk and bread the rule is correct. Confidence can be interpreted as an estimate of the probability (Y|X), the probability of finding the RHS of the rule in transactions under the condition that these transactions also contain the LHS [5]. The lift of a rule is defined by Lift (X=>Y) = (supp(X uY) / supp(Y) *supp(X)) or the ratio of the observed to support that expected X and Y are independent. The rule {milk, bread} => {butter} has a lift of ((0.2)/ (0.4)*(0.4)) =1.25.The conviction of a rule is defined as conv(X=>Y) =1-supp(Y)/1-conf(X=>Y). The rule {milk, bread} => {butter} has a conviction of (1-0.4)/(1-0.5)=1.2, and can be interpreted as the ratio of the expected frequency that X occurs without Y (that is to say, the frequency that the rule makes an incorrect prediction) if X and Y were independent divided by the observed frequency of incorrect predictions. In this example, the conviction value of 1.2 shows that the rule {milk, bread}=>{butter} would be incorrect 20% more often (1.2 times as often) if the association between X and Y was purely random chance. The property of succinctness (Characterized by clear, precise expression in few words) of a constraint. Usage Example: Instead of using Apriori algorithm to obtain the Frequent-Item-sets we can instead create all the Item-sets and run support counting only once. Association rules are usually required to satisfy a userspecified minimum support and a user-specified minimum confidence at the same time. Association rule generation is usually split up into two separate steps: 1. First, minimum support is applied to find all frequent item sets in a database. 2. Second, these frequent item sets and the minimum confidence constraint are used to form rules. While the second step is straight forward, the first step needs more attention. Finding all frequent item sets in a database is difficult since it involves searching all possible item sets (item combinations). The set of possible item sets is the power set over I and has size 2n 1 (excluding the empty set which is not a valid item set). Although the size of the power set grows exponentially in the number of items n in I, efficient search is possible using the downward-closure property of support [2] (also called anti-monotonicity[6]) which guarantees that for a frequent item set, all its subsets are also frequent and thus for an infrequent item set, all its supersets must also be infrequent. Therefore in order to mine the items from a database association rules are an important class of regularities that exist in databases. Since it was first introduced the problem of

Table 1
Database with Four Transactions and Five Items [2]. Beer 0 0 1 0 0

Transaction ID 1 2 3 4 5

Milk 1 0 0 1 0

Bread 1 0 0 1 1

Butter 0 1 0 1 0

A rule is defined as an implication of the form X => Y where X, Y subset I and X ^ Y=. To select interesting rules from the set of all possible rules, constraints on various measures of significance and interest can be used. The best-known constraints are minimum thresholds on support and confidence. The support supp(X) of an item set X is defined as the proportion of transactions in the data set which contain the item set. In the example database, the item set {milk, bread, butter} has a support of 1 / 5 = 0.2 since it occurs in 20% of all transactions (1 out of 5 transactions).

IJESAT | Jan-Feb 2012


Available online @ http://www.ijesat.org 62

TEJASWI PINNIBOYINA* et al. [IJESAT] INTERNATIONAL JOURNAL OF ENGINEERING SCIENCE & ADVANCED TECHNOLOGY mining associations has received a great deal of attention. The classic application is market basket analysis. It analyzes how the items purchased by customers are associated. An example 2 of an association rule is as follows, cheese => beer [sup = 10%, conf = 80%] This rule says that 10% of customers buy cheese and beer together, and those who buy cheese also buy beer 80% of the time. The basic model of association rules is as follows: The rule X transactions in T that support X also support Y. The rule has support s in T if s% of the transactions in T contains X Y. Given a set of transactions T (the database), the problem of mining association rules is to discover all association rules that have support and confidence greater than the user-specified minimum support (called minsup) and minimum confidence (called minconf) and they generate all large item sets that satisfy minsup, all association rules must satisfy minconf using the large item sets. An item set is simply a set of items. A large item set is an item set that has transaction support above minsup. Association rule mining has been studied extensively in the past [e.g., 2, 3, 5, 11, 4, 14, 10, 12, and 1]. The model used in all these studies, however, has always been the same, i.e., finding all rules that satisfy user-specified minimum support and minimum confidence constraints. The key element that makes association rule mining practical is the minsup. It is used to prune the search space and to limit the number of rules generated. However, using only a single minsup implicitly assumes that all items in the data are of the same nature (to be explained below) and have similar frequencies in the database. This is often not the case in real-life applications. In many applications, some items appear very frequently in the data, while others rarely appear. If the frequencies of items vary a great deal, we will encounter problems like if minsup is set too high, we will not find those rules that involve infrequent items or rare items in the data. In order to find rules that involve both frequent and rare items, we have to set minsup very low. However, this may cause combinatorial explosion, producing too many rules, because those frequent items will be associated with one another in all possible ways and many of them are meaningless. Example 3: In a supermarket transaction data, in order to find rules involving those infrequently purchased items such as food processor and cooking pan (they generate more profits per item), we need to set the minsup to very low (say, 0.5%). cooking Pan [sup = 0.5%, conf = 60%] However, this low minsup may also cause the following meaningless rule to be found: bread, cheese, milk, beer [sup = 0.5%, conf = 60%] knowing that 0.5% of the customers buy the 4 items together is useless because all these items are frequently purchased in a

ISSN: 22503676
Volume - 2, Special Issue - 1, 61 70

supermarket. For this rule to be useful, the support needs to be much higher. As mentioned, existing algorithms for mining association rules typically consists of two steps: (1) Finding all large item sets; and (2) Generating association rules using the large item sets. Almost all research in association rule mining algorithms focused on the first step since it is computationally more expensive. Also, the second step does not lend itself as well to smart algorithms as confidence does not possess closure property. Support, on the other hand, is downward closed. If a set of items satisfies the minsup, then all its subsets also satisfy the minsup. Downward closure property holds the key to pruning in all existing mining algorithms. Efficient algorithms for finding large item sets are based on level-wise search. Let k-item set denotes an item set with k items. At level 1, all large 1-itemsets are generated. At level 2, all large 2-itemsets are generated and so on. If an item set is not large at level k-1, it is discarded as any addition of items to the set cannot be large (downward closure property). All the potentially large item sets at level k are generated from large item sets at level k-1. However, in the proposed model, if we use an existing algorithm to find all large item sets, the downward closure property no longer holds. Example 4: Consider four items 1, 2, 3 and 4 in a database. Their minimum item supports are: MIS (1) = 10% MIS (2) = 20% MIS (3) = 5% MIS (4) = 6% If we find that item set {1, 2} has 9% of support at level 2, then it does not satisfy either MIS (1) or MIS (2). Using an existing algorithm, this item set is discarded since it is not large. Then, the potentially large item sets {1, 2, 3} and {1, 2, 4} will not be generated for level 3. Clearly, item sets {1, 2, 3} and {1, 2, 4} may be large because MIS (3) is only 5% and MIS (4) is 6%. It is thus wrong to discard {1, 2}. But if we do not discard {1, 2}, the downward closure property is lost in this the association rules are applied on the dataset.Exploiting this property, efficient algorithms (e.g., Apriori [7] and Eclat [8]) can find all frequent item sets. Apriori algorithm Apriori [7] is the best-known algorithm to mine association rules. It uses a breadth-first search strategy to counting the support of item sets and uses a candidate generation function which exploits the downward closure property of support. Eclat algorithm Eclat [8] is a depth-first search algorithm using set intersection.

IJESAT | Jan-Feb 2012


Available online @ http://www.ijesat.org 63

TEJASWI PINNIBOYINA* et al. [IJESAT] INTERNATIONAL JOURNAL OF ENGINEERING SCIENCE & ADVANCED TECHNOLOGY FP-growth algorithm FP-growth (frequent pattern growth) [9] uses an extended prefix-tree (FP-tree) structure to store the database in a compressed form. FP-growth adopts a divide-and-conquer approach to decompose both the mining tasks and the databases. It uses a pattern fragment growth method to avoid the costly process of candidate generation and testing used by Apriori and the different types of association rules are contrast set learning is a form of associative learning. Contrast set learners use rules that differ meaningfully in their distribution across subsets [10].Weighted class learning is another form of associative learning in which weight may be assigned to classes to give focus to a particular issue of concern for the consumer of the data mining results. We have different types of association rules K-optimal pattern discovery provides an alternative to the standard approach to association rule learning that requires that each pattern appear frequently in the data. Mining frequent sequences uses support to find sequences in temporal data [11]. Generalized Association Rules hierarchical taxonomy (concept hierarchy). Quantitative quantitative data. Association Rules categorical and

ISSN: 22503676
Volume - 2, Special Issue - 1, 61 70

of users to establish multiple short range friendships and few long range contacts. Based on the definitions, decentralized algorithms are developed to show that users can search for short paths to other users with high probability. The work presents mathematical models to further the decentralized search algorithm to enable searches even when users are unaware of their own and others positions in the network In a hierarchical network model was developed. Users rearranged at leaves of a hierarchical structure such that the least common ancestor of two nodes in the tree is the node at which they start differing in their attributes. Thus, the least common ancestor defines the similarity of two nodes or how likely they are to become friends. The closer the least common ancestor is to the two nodes, higher probability of the two nodes being friends. Based on this probability, the social network graph and the decentralized search algorithm are developed. Homophiles or the more commonly known phrase of birds of a feather flock together has constituted an important role in the study of social networks. Sociologists have tried to understand the phenomenon using multiple characteristics like gender, race, ethnicity, age, educational level, etc. Similarity between users due to their association to same communities has been studied in [18].Community associations and user keywords have been used to model user communication in social networks in [19]. Information exchange between users takes place only when they share a social path and common keywords and community memberships. Decentralized search algorithms using combinations of homophile and node degree parameters have been developed. Similarity between users as a function of their topological distance was studied. The work tried to find out the average fraction of similar users with a common characteristic like year in school, graduate status, etc. to track the number of similar users from a data set of Club Nexus. Their findings reported a gradual decay in similarity with increased topological distance in the social network. The work in [18] developed functions to analyze similarity between users as a function of the frequency of a shared item. Geographic ties between on-line social network users have been another property to understand homophile between users. Geographic location and friendship behaviors of bloggers were studied in [19]. These works have also studied the relationship between geographic location of users and the relationships among them. The study showed that one-third of friendships in a social network are independent of geography. This is an interesting conclusion and raises the question of why people at far off locations become friends and what characteristics bring them together? Will understanding the other key interests or activities of users in on-line social networks explain why people become friends? In this work we answer these

Interval Data Association Rules e.g. partition the age into 5-year-increment ranged Maximal Association Rules Sequential Association Rules temporal data e.g. first buy computer, then CD ROMs, then a webcam.

4. ANALYSIS
As the association rules are studied with examples now for thesis work then proceed with what kind of data must be used and the sample must be taken should be analyzed at fir These data is applied to association rules then with the same data samples by using some logical implications known coherent rules [1] data items were discovered and with what kind of data should be taken is discussed work related to the mathematics behind the small world problem and social networks in general. Next, we discuss works that address homophiles in social networks and user similarity based on user characteristics. Works in have developed mathematical models to show how users interact with one another and establish links to build a social network [15]. The Lattice Model is based on the geographical distance between users. The model defines a network model based on characteristics

IJESAT | Jan-Feb 2012


Available online @ http://www.ijesat.org 64

TEJASWI PINNIBOYINA* et al. [IJESAT] INTERNATIONAL JOURNAL OF ENGINEERING SCIENCE & ADVANCED TECHNOLOGY questions by understanding the interests pattern of users in Face book and how similarity between users interests influence friendship and the influence of user similarity in the network topology. The use of the term network and social network interchangeably in the work to mean the set of all users and the links between them that represent the friendship between them. It also explains the patterns of characteristics associated with a user, i.e. a users profile entries and classify the similarity between users through the semantic links between the keywords used by them. Methods like Latent Semantic Indexing have previously explored the semantics between digital data to explore the relationship between them. Analysis of user similarity through relations between their profile characteristics can also help in furthering the link prediction problem [20] in social networks to correctly identify pairs who are likely to forge friendship in future. It is Lacking in predicting of linking tags, Lack of using new predicting algorithms[16] Time consuming process Lack of handling missing values and can be overcome in our approach since it can predicts and suggest the item tags in the dataset efficiently. Executes faster than existing approaches, uses robust predicting algorithms for predicting tags, efficiently handles the missing values while processing. Thus in this way data analysis is carried. This is the way the data cleaning is carried on social network data to this data the coherent rules are applied and it is analyzed as the coherent rules are derived from logics and the implications are defined. In this implication if the items are present it is defined by logic 1 and if they are not present represented by 0.For derivation of the coherent we have some logic tables described as material implication, equivalence, mapping those association rules to equivalence, association rules and support is described.

ISSN: 22503676
Volume - 2, Special Issue - 1, 61 70

has been observed. From these propositions, we have four implications 1. p -> q, 2. p ->q, 3. p -> q, and 4. p ->q. Each is formed using standard symbols -> and The symbol -> implies that the relation is a mode of implication in logic, and denotes a false proposition. The example for implications 1 and 2 below: 1. If apples are observed in a customer market basket, then bread is observed in a customer market basket p -> q. 2. If apples are observed in a customer market basket, then bread is NOT observed in a customer market basket p -> q. The truth and falsity of any implication is judged by anding (^) the truth values held by propositions p and q. In a fruit retail business where no bread is sold, the implication that relates p and q will be false based on the operation between truth values; that is1 ^ 0 =0. The second implication based on the operation will be true because 1 ^ 1=1.Hence, we say that the latter implication p -> q is true, but the first implication p -> q is false. Each implication has its truth and falsity based on truth table values alone. There are a number of modes of implication. In the same way the tags that occur frequently with common link so that occurrence of one may lead to click on another tag or occurrence of one tag may prevent the other tag. We highlight two modes of implication and their truth table values in the next two sections.

5.1 Material Implication


A material implication () meets the logical principle of a contraposition. A contra positive (to a material implication) is written as q -> p. For example, suppose, if customers buy apples, that they then buy oranges is true as an implication. The contra positive is that if customers do not buy oranges, then they also do not buy apples. In case of tags the same tags with similar properties may occur together and if they dont occur and its implication has the truth values of its contrapositive, (p ^ q), it is a material implication. That is, p q if (p^ q).The truth table for a material Implication is shown in Table 2.

5. IMPLICATION
In an argument, the truth and falsity of an implication (also known as a compound proposition) (->) necessarily rely on logic. Each implication, having met specific logical principles, can be identified (for example, one may be a material implication, while the other may be an equivalence).Each has a set of different truth values. This will be explained later. We highlight here that an implication is formed using two propositions p and q. These propositions can be either true or false for the implications interpretation. For example, apples are observed in a customer market basket in the same way the in social network the behavior of friends with in the same geographical is same means it is a true interpretation if this

5.2 Equivalence
An equivalence (=) is another mode of implication. In particular, it is a special case of a material implication. For any implication to qualify as equivalence, the following condition must be met p =q iff (p xor q) where truth table values can be constructed in Table 3. Equivalence has an additional

IJESAT | Jan-Feb 2012


Available online @ http://www.ijesat.org 65

TEJASWI PINNIBOYINA* et al. [IJESAT] INTERNATIONAL JOURNAL OF ENGINEERING SCIENCE & ADVANCED TECHNOLOGY necessary condition. Due to this condition, propositions are now deemed both necessary and sufficient relates with, in short).

ISSN: 22503676
Volume - 2, Special Issue - 1, 61 70

Table 2
Truth Table for a material implication [1] q T F T F pq T F T T

the association rules deemed interesting. In addition, the process of finding such association rules will be independent of user knowledge because the truth and falsity of any implication is based purely on logical grounds.

5.3 Mapping Association Rules to Equivalences


The distinctions between an association rule and an implication is analyzed and highlighted the motivation to map an association rule to an implication. This section explains how to map an association rule to equivalence. A complete mapping between the two is realized in three progressive steps. Each step depends on the success of a previous step. In the first step, item sets are mapped to propositions in an implication. Item sets can be either observed or not observed in an association rule. Similarly, a proposition can either be true or false in an implication. Analogously, the presence of an item set can be mapped to a true proposition because this item set can be observed in transactional records. Having mapped the item sets, an association rule can now be mapped to an implication in a second step. An association rule has four different combinations of presence and absence of item sets. Similarly, there are four different implications depending on the truth value held by its propositions. Hence, an association rule can be mapped to an implication that has a truth value (either true or false). Finally, in a more specific implication, the association rule has a set of four truth table values. Having mapped item sets and association rules, now map association rules into specific modes of implication that have predefined truth table values and focus on equivalence. Based on a single transaction record in association rule mining, it shows the mapping from association rules to equivalences below.

p T T F F

Table 3
Truth Table for a material Equivalence [1] P T T F F q T F T F P=Q T F F T

One of many ways to prove equivalence is to show that the implications p -> q and p ->q hold true together. The latter is also named an inverse suppose, if customers buy apples, that they then buy oranges is a true implication. The inverse is that if customers do not buy apples, then they do not buy oranges. In this section that a typical statement of the format if ... then is a conditional or a rule. If this conditional also meets specific logical principles with a truth table, they are an implication. In the same way it is applied to tags which occurs frequently Among many modes of implications, a material implication relates propositions together. Equivalence is a special case of the former, where propositions are necessarily related together all the time and are independent of user knowledge. In other words, equivalence is necessarily true all the time and judged purely based on logic. In finding association rules that map to this equivalence. By mapping to this equivalence, we can expect to find association rules that are necessarily related with true implication consistently based on logic. These are

5.4 Mapping Using a Single Transaction Record


An item set has two states. In a single transaction record, an item can either be present or absent from the transaction record. It follows then that a proposition can either be true or false. If an item set is observed in a transaction record, it is analogous to having a true proposition. In the same way, item sets are mapped to propositions p and q as follows Item set X is mapped to p =T, if and only if X is observed. Absence of item set X, that is, X is mapped to p =F, if and only if X is not observed. Item set Y is mapped to q = T, if and only if Y not observed. Absence of item set Y, that is, Y is mapped to q= F, if and only if Y is not observed.

IJESAT | Jan-Feb 2012


Available online @ http://www.ijesat.org 66

TEJASWI PINNIBOYINA* et al. [IJESAT] INTERNATIONAL JOURNAL OF ENGINEERING SCIENCE & ADVANCED TECHNOLOGY Each component of an association rule is now mapped to propositions. Using the same mapping concept, an association rule can be mapped to a true or false implication. An association rule consists of two item sets X and Y. Following the mappings above: Item sets X and Y are mapped to p and q = T, if and only if X and Y are observed. X=>Y is true.

ISSN: 22503676
Volume - 2, Special Issue - 1, 61 70

Summarize all the mappings and the conditions required in Table 4.In each mapping from an association rule to equivalence, four conditions need to be checked. The four conditions to be passed on X =>Y and X =>Y are the same. This is highlighted in Table 4. (Note that the other four conditions on both X =>Y and X=> Y are also the same as shown.) Generally, each condition testing requires at least one transaction record to conclude as either true or false However, because there are four conditions, mapping from an association rule to equivalence cannot be carried out on a single transaction record. A mechanism is required to judge if the first association rule from each group can be mapped to a true equivalence having met all four conditions. This is because an association rule holds item sets that can be observed over a portion of transaction records. This leads us to perform mapping on multiple transaction records as described in the next section.

Table 4
Mapping of Association Rules to Equivalences [1] Equivalences Association Rules Truth or false on assoc Rules T F F T P=Q X=>Y Required condition(to map associations to equivalences) X=>Y X=>Y X=>Y X=>Y P= Q X=>Y

X=>Y X=>Y X=>Y X=>Y

5.5 Mapping Using Multiple Transaction Records


Previously, item sets have been mapped to propositions p and q if each item set is observed or not observed in a single transaction. In data containing multiple transaction records, an item set X is observed over a portion of transaction records. This total number of observations is given by the cardinality of the transactions in database D that contain X, known as support: Support, S(X) = |DX| A support S(X) denotes the numbers of times X which is observed in the entire data. Similarly, support S (X) denotes the number of times X which is not observed in the entire data. Based on this understanding, the interestingness of an item set is a relative comparison between the total number of observations of its presence and its absence: If S(X) has a greater value than S (X) then item set X is mostly observed in the entire data, and it is interesting. Conversely, item set X is mostly not observed in the entire data, and it is not interesting; the absence of item set X is interesting.

That is, an association rule X =>Y is mapped to an implication and is deemed interesting if and only if both item sets are observed from a single transaction record. Similarly, all four mappings from the association rule to its implications are given below X => Y is mapped to implication p -> q, if and only if both X and Y is observed. X =>Y is mapped to implication p->q, if and only if X is observed and Y is not observed. X =>Y is mapped to implication p-> q, if and only if X is not observed and Y is observed. X =>Y is mapped to implication p ->q, if and only if both X and Y is not observed.

Having mapped association rules to implications, the use of same mapping concept to map association rules to equivalences based on specific truth table values. An equivalence has truth table values (T, F, F, T) (see Table 3) for implications p ->q, p ->q, p->q, and p ->q respectively. An association rule is mapped to equivalence if each implication is either true or false. For example, Association rules X =>Y is mapped to p= q, if and only if X =>Y is true; X =>Y is false; X=> Y is false; and

Based on a relative comparison between the presence and absence of an item set, each item set can be mapped to propositions p and q within multiple transaction records: 1. if S(X)> S(X), then Item set X is mapped to p =T. Item set :X is mapped to p =F. if S(X) > S(X), then Item set X is mapped to p= F.

2.

IJESAT | Jan-Feb 2012


Available online @ http://www.ijesat.org 67

TEJASWI PINNIBOYINA* et al. [IJESAT] INTERNATIONAL JOURNAL OF ENGINEERING SCIENCE & ADVANCED TECHNOLOGY Item set X is mapped to p =T.

ISSN: 22503676
Volume - 2, Special Issue - 1, 61 70

X=>Y

S(X,Y)

An item set having mapped to a proposition is said to be interesting. The above mapping involves only a single item set. To judge if a union of two item sets, such as (X, Y) (i.e. X u Y), is comparatively interesting, a multiple comparisons over three other possible combinations are necessary. This is to ensure that none of these combinations is more observed than the initial combination. For example, to judge if the union of item set (X, Y ) is mostly observed in transactions, the number of transactions that contain (X,Y ), (X, Y ), or(X, Y ) must be lower than the portion of transactions that contain item set (X, Y ). Otherwise, the item set (X, Y) cannot be judged as interesting. Extending this understanding, it can be seen that only one item set is deemed interesting and the others deemed not interesting for each combination of the presence and absence of item sets contained within item sets(X, Y). The interestingness of item sets, we extend the concept to an association rule. In addition to the previous discussion, an association rule always involves two item sets. These are (X, Y), (X, Y), (X, Y), or (X, Y). An association rule X =>Y can be mapped to implication p -> q if and only if item set (X, Y) is interesting. Otherwise, if item set (X, Y) is not the most observed set, then the association rules X =>Y cannot be judged as interesting. For each combination of items contained within item sets (X, Y ),only one association rule is deemed interesting and the other association rules are deemed not interesting. The interesting one is the most observed association rule. The interestingness of the union of two item sets, an association rule is judged interesting by having the support value of two item sets. For example, S(X, Y) is higher than the support values that have one or none of the two item sets, such as S(X,Y), S(X,Y), and S(X,Y). Table 5 denotes the supports for each association rule.

We now can map association rules to implications as follows: X => Y is mapped to an implication p ->q, if and only if - S(X,Y)> S(X,Y) - S(X,Y)> S(X,Y)and - S(X,Y)> S(X,Y) X =>Y is mapped to an implication p->q, if and only if - S(X,Y)> S(X, Y) - S(X,Y)> S(X,Y) and - S (X, Y)> S(X,Y) X=>Y is mapped to an implication p -> q, if and only if - S (X, Y)> S(X, Y) - S (X, Y)> S (X, Y) and - S(X,Y >) S (X,Y) X =>Y is mapped to an implication p ->q, if and only if - S (X,Y)> S(X,Y) - S (X,Y)> S(X,Y) and - S (X,Y)> S (X, Y)

We give the name pseudo implication to association rules that are mapped to implications based on comparison between supports. By pseudo implication, we mean that the implication approximates a real implication (according to propositional logic). It is not a real implication because there are fundamental differences. Pseudo implication is judged true or false based on a comparison of supports, which has a range of integer values. In contrast, an implication is based on binary values. The former depends on the frequencies of co occurrences between item sets (supports) in a data set, whereas the latter does not and is based on truth values. Using the concept of pseudo implication, can further map association rules to specific modes of implications such as material implications and equivalences. Each follows the same truth values of the respective relations in logic. In a material implication relation, for example, association rules can be finally mapped to equivalences according to their truth table values. That is, the following conditions must be met: - S(X, Y)> S(X,Y) - S(X, Y)> S(X,Y) - S (X,Y)> S(X,Y)and - S (X,Y)> S(X,Y) and are given below:

Table 5
Mapping of Association Rules to Equivalences [1] Association X=>Y X=>Y X=>Y Support S(X,Y) S(X,Y) S(X,Y)

IJESAT | Jan-Feb 2012


Available online @ http://www.ijesat.org 68

TEJASWI PINNIBOYINA* et al. [IJESAT] INTERNATIONAL JOURNAL OF ENGINEERING SCIENCE & ADVANCED TECHNOLOGY

ISSN: 22503676
Volume - 2, Special Issue - 1, 61 70

X =>Y is mapped to p =q if and only if (2) is met. X =>Y is mapped to p =q if and only if (2) is met. X => Y is mapped to p =q if and only if (2) is met. X =>Y is mapped to p =q if and only if (2) is met.

Pattern Discovery, Ieee transactions on knowledge and data engineering, vol. 22,no.6,2010. [2] R. Agrawal, T. Imielinski, and A. Swami, Mining Association Rules between Sets of Items in Large Databases, SIGMOD Record, vol. 22, pp. 207-216, 1993. C. Longbing, Introduction to Domain Driven Data Mining, Data Mining for Business Applications, L.Cao,P.S. Yu, C. Zhang, and H. Zhang, eds., pp. 310,Springer,2008. Piatetsky-Shapiro, G. (1991), Discovery, analysis,and presentation of strong rules, in G.Piatetsky-Shapiro & W. J. Frawley, eds, Knowledge Discovery in Databases,AAAI/MIT Press, Cambridge, MA. Jochen Hipp, Ulrich Guntzer, and Gholamreza Nakhaeizadeh ,Algorithms for association rule mining - A general survey and comparison, SIGKDD Explorations, 2(2):1-58, 2000. Jian Pei, Jiawei Han, and Laks V.S. Lakshmanan, Mining frequent itemsets with convertible constraints, In Proceedings of the 17th International Conference on Data Engineering, April 26, 2001, Heidelberg, Germany, pages433-442, 2001. Rakesh Agrawal and Ramakrishnan Srikant, Fast algorithms for mining association rules in large databases,In Jorge B. Bocca, Matthias Jarke, and Carlo Zaniolo, editors, Proceedings of the 20th International Conference on Very Large Data Bases, VLDB, pages 487-499, Santiago, Chile, September 1994. Mohammed J. Zaki, Scalable algorithms for association mining, IEEE Transactions on Knowledge and Data Engineering, 12(3):372-390, May/June 2000. Jiawei Han, Jian Pei, Yiwen Yin, and Runying Mao,Mining frequent patterns without candidate generation, Data Mining and Knowledge Discovery 8:53-87, 2004. T. Menzies, Y. Hu, "Data Mining For Very Busy People," IEEE Computer, October 2003, pgs. 18-25. M. J. Zaki. (2001). SPADE: An Efficient Algorithm for Mining Frequent Sequences,Machine Learning Journal, 42, 3160. B. Liu, W. Hsu, and Y. Ma, Mining Association Rules with Multiple Minimum Supports, Proc. ACM SIGKDD, pp. 337-341,1999.

Each of the rules, having mapped to equivalence from propositional logic, is called a pseudo implication of equivalence. It shows how pseudo implications can be created by mapping association rules to two modes of implications: material implication and equivalence. It shall focus on pseudo implications of equivalences because equivalences are special cases in material implications according to propositional logic. Being a special case, a rules left-hand side and right-hand side are deemed more related than compared to material implication. In fact, both sides are so much related that they are equivalence. As a result, an equivalence relation is also bidirectional. That is, a rules left- and right-hand sides are interchangeable. Having exchanged both sides, this new equivalence follows also the truth table values of equivalence. This characteristic is not observed in material implications and remains as a more relaxed mode of implication. In mining rules from data sets and without requiring background knowledge of a domain, we need a strong reason to identify the existence of rules. Therefore, pseudo implications of equivalences are our primary focus in finding rules. After this coherent rules frame work is designed and by using all the implications the items is discovered.

[3]

[4]

[5]

[6]

[7]

6. COHERENT RULES ALGORITHM


1. Collect the data from database. 2.Preprocess the data by prediction techniques 3.Proposed association rule technique (by calculating support and confidence) 4.Then Association tag bundles rules and items are discovered then by using coherent rules based on logic the item sets are discovered and the modules are explained in design phase. [8]

[9]

7. CONCLUSION
The association rules are analyzed based on different threshold values and the data items are mined, based on logic the coherent rules are analyzed and the tag based items are discovered.

[10] [11]

REFERENCES
[1] Alex Tze Hiang Sim, Maria Indrawan, Samar Zutshi, Member,IEEE, and Bala Srinivasan,Logic-Based

[12]

IJESAT | Jan-Feb 2012


Available online @ http://www.ijesat.org 69

TEJASWI PINNIBOYINA* et al. [IJESAT] INTERNATIONAL JOURNAL OF ENGINEERING SCIENCE & ADVANCED TECHNOLOGY [13] H. Mannila, Database Methods for Data Mining, Proc. Fourth Intl Conf. Knowledge Discovery and Data Mining (Tutorial), 1998. W.-Y. Lin, M.-C. Tseng, and J.-H. Su, A ConfidenceLift Support Specification for Interesting Associations Mining, Proc. SixthPacific-Asia Conf. Advances in Knowledge Discovery and Data Mining (PAKDD), pp. 148-158, 2002. E. Chi and T. Mytkowicz. Understanding the Efficiency of Social Tagging Systems using Information Theory, HT08. E. Gabrilovich and S. Markovitch. Text Categorization with Many Redundant Features: Using Aggressive Feature Selection to Make SVMs Competitive with C4.5. ICML,04. S. Golder and B. A. Huberman. Usage Patterns of Collaborative Tagging Systems,Journal of Information Science, 32 (2):198208, April 2006. T. Haveliwala, A. Gionis, D. Klein, and P. Indyk. Evaluating Strategies for Similarity Search on the Web, WWW02. P. Heymann, G. Koutrika, and H. Garcia-Molina. Can Social Bookmarking Improve Web Search, WSDM08. T. Joachims. A Support Vector Method Multivariate Performance Measures,ICML05. for

ISSN: 22503676
Volume - 2, Special Issue - 1, 61 70

[14]

Deevi Radha Rani working as Assistant Professor in the Department of CSE, KL University.

[15]

[16]

Sushma Nathani is pursing M.tech from KLUniversity in the Department of Computer Science Engineering

[17]

[18]

[19]

[20]

BIOGRAPHIES
Tejaswi Pinniboyina is pursuing M.tech from KLUniversity in Computer Science Engineering

Navya Dhulipala is pursuing M.tech from KLUniversity in Computer Science Engineering

IJESAT | Jan-Feb 2012


Available online @ http://www.ijesat.org 70

Das könnte Ihnen auch gefallen