Sie sind auf Seite 1von 6

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 12, DECEMBER 2011, ISSN 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.

ORG

172

Correlation identification by Normalizing the Sequential Frequent Pattern set with the Specification of Constraints and Preferences
Janga Vijay Kumar, Vanam Sravan Kumar Dept.of.Computer Science and Engineering Jayamukhi Institute of Technological Sciences, Narsampet, Warangal (A.P)-India

ABSTRACT :Sequential Frequent pattern(Frequent patterns can be extended to the correlation between items) mining is an important data mining problem with broad applications which involves identifying sequentially frequently co-occurring set of items in a given transactional Or relational databases. Although there are many in-depth studies on efficient sequential frequent pattern mining algorithms still they leads to generate large set of sequential frequent patterns which seems to be vulnerable and inappropriate or irrelevant. Hence such set can be reducible by considering the preferences and constraints associated with each pattern to be mined, further it can be extended to mine the Correlations between the sets of Items. In this paper, we propose set of constraints and preferences needs to be specified with an Item set in addition to specifying Item set parameters like support and confidence. A user can simply specify a preference of set of items and constraints associated with each item identifies whether an Item can be included in the frequent set or NOT. We need to formulate the preferences for mining.
Keywords: Data mining, sequential Frequent-pattern mining, constraints, preferences, Normalization.

Section 1: Introduction
2. Sequential pattern mining: concepts
A framework to mine sequential frequent patterns with preferences and constraints can be pushed deep into the mining by properly employing the existing efficient frequent pattern mining techniques. The results indicate that normalizing the frequent pattern mining with the specification of preferences and constraints. Furthermore, we extend our discussion from. We identify the problem of Normalize the sequential pattern size by specifying constraints and by formulating the preferences for mining. Example: Consider a problem of identifying sequential pattern mining.

F.A. Author is with the jayamukhi institute of technological sciences, Warangal, India S.B. Author. is with the jayamukhi institute of technological sciences, Warangal, India

Let I = {x1, . . . , xn} be a set of items, each possibly being associated with a set of attributes, such as value, price, profit, calling distance, period, etc. The value on attribute A of item x is denoted by x.A. An Item set is a non-empty subset of items, and an item set with k items is called a k-item set. A sequence = _X1 Xl_ is an ordered list of item sets. An item set Xi (1 i l) in a sequence is called a transaction, a term originated from analyzing customers shopping sequences in a transaction database, such as in (Agrawal, Imielinski, & Swami, 1993; Agrawal & Srikant, 1994, 1995; Srikant & Agrawal, 1996). A transaction Xi may have a special attribute, times-tamp, denoted by Xi.time, which registers the time when the transaction was executed. For a sequence = _X1 Xl_, we assume Xi.time < Xj.time for 1 i < j l. The number of transactions in a sequence is called the length of the sequence. A sequence with length l is called an lsequence. For an l-sequence , we have len () = l. Furthermore, the i-th itemset is denoted by [i]. An item can occur at most once in an item set, but can occur multiple times in various item sets in a sequence. A sequence = _X1 . . . Xn_ is called a

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 12, DECEMBER 2011, ISSN 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

173

subsequence of another sequence = _Y1 . . .Ym_ (n m), and a super-sequence of , denoted by _ , if there exist integers 1 i1 < ... < in m such that X1 Yi1 , . . . , Xn Yin . A sequence database SDB is a set of 2-tuples (sid, ), where sid is a sequence-id and a sequence. A tuple (sid, ) in a sequence database SDB is said to contain a sequence if is a subsequence of . The number of tuples in a Given a positive integer min_sup as the support threshold, a sequence is a sequential pattern in sequence database SDB if sup( ) min_sup. The sequential pattern mining problem is to find the complete set of sequential patterns with respect to a given sequence database SDB and a support threshold min_sup. Example 1 (Sequential patterns) Table 1 shows a sequence database SDB with four sequences. The first sequence contains three transactions (itemsets): {a}, {b, c} and {e}. Its length is three. For the sake of brevity, the brackets are omitted if a transaction has only one item. As can be seen, an item can occur multiple times in various itemsets in a sequence. For example, item b appears twice in sequence 20. Moreover, a sequence can even contain identical transactions, such as transaction {d} in sequences 20, 30 and 40. However, there is not any general assumption about the repetition of items or transactions. Sequence_ID Sequence 10 (a(bc)c) 20 (c(ab)(bc)dd) 30 (c(acf)(abc)dd) 40 (addcb) Table 1: Sequence Database Sequence((ab)d) is a subsequence of both the second sequence,( e(ab((bc)dd), and the third one,(c(aef)(abc)dd). Thus, if the support threshold min_sup=2,((ab)d) is a sequential pattern. And the possible constraints that can be applied to mine sequential frequent itemset is Constraint 1 (Item constraint) An item constraint specifies subset of items that should or should not be present in the patterns. It is in the form of Citem() (i : 1 i len(), [i] V), Or Citem() (i : 1 i len(), [i] V ) ,where V is a subset of items, {, }and {{, ,, , , }. For the sake of brevity, we omit the strict operators (e.g., ,) in our discussion here. However, thesame principles can be applied to them. For example, when mining sequential patterns over a web log, a user may be interested in only patterns about visits to online bookstores. Let B be the set of online bookstores.

The corresponding item constraint is Cbookstore () (i : 1 i len(), [i] B). Constraint 2 (Length constraint) A length constraint specifies the requirement on the length of the patterns, where the length can be either the number of occurrences of items or the number of transactions. Length constraints can also be specified as the number of distinct items, or even the maximal number of items per transactions. For example, a user may want to find only long patterns (e.g., patterns consisting of at least 50 transactions) in bio-sequence analysis. Such a requirement can be expressed by a length constraint Clen() (len() 50). Constraint 3 (Regular expression constraint) A regular expression constraint CRE is a constraint specified as a regular expression over the set of items using the established set of regular expression operators, such as disjunction and Kleene closure. A sequential pattern satisfies CRE if and only if the pattern is accepted by its equivalent deterministic finite automata. For example, to find sequential patterns about a Web click stream starting from Yahoos home page and reaching hotels in New York city, one may use regular expression constraint Constraint 4(Duration constraint) A duration constraint is defined only in sequence databases where each transaction in every sequence has a time-stamp. It requires that the sequential patterns in the sequence database must have the property such that the time-stamp difference between the first and the last transactions in a sequential pattern must be longer or shorter than a given period. Formally, a duration constraint is in the form of Cdur Dur() _t, where {, } and_t is a given integer.Asequence satisfies the constraint if and only if |{ SDB|1 i1 < < ilen() len() s.t. [1] _ [i1], . . . , [len()] _ [ilen()], and ([ilen()].time [i1].time) _t}| min_sup. In some other applications, the gap between adjacent transactions in a pattern may be important. Constraint 5 (Gap constraint) A gap constraint set is defined only in sequence databases where each transaction in every sequence has a timestamp. It requires that the sequential patterns in the sequence database must have the property such that the timestamp difference between every two adjacent transactions must be longer or shorter than a given gap. Formally, a gap constraint is in the form of Cgap Gap() _t, where {, } and_t is a given integer.Asequence satisfies the constraint if and only if |{ SDB|1 i1 < < ilen()

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 12, DECEMBER 2011, ISSN 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

174

len() s.t. [1] _ [i1], . . . , [len()] _ [ilen()], and for all 1 < j len(), ([i j].time [i j1].time) _t}| min_sup. As shown in the above example, even with the progress on efficient sequential frequent pattern mining algorithms and constraint pushing techniques, the effectiveness of sequential frequent pattern mining still remains a serious concern. The major bottleneck is that a user has to specify the appropriate constraints (e.g. Item, Length, Gap, Duration etc.) which are often beyond the user's knowledge about the data. In this study, we propose Normalized sequential preference; constraint based frequent pattern mining, a novel theme of sequential frequent pattern mining. Instead of specifying solid constraints only, a user can simply specify preferences as well. First, we identify the problem of preference-based sequential frequent pattern mining, and formulate the preferences for mining. Second, we develop an efficient framework to mine frequent patterns with preferences. Interestingly, many preferences can be pushed deep into the mining by properly employing the existing efficient frequent pattern mining techniques. We conduct an extensive performance study to examine our method. The results indicate that preference-based sequential frequent pattern mining is effective and efficient. Last, we extend our discussion from preference-based sequential frequent pattern mining to preferencebased data mining in principle and draw a general framework. The problem of preference-based frequent pattern mining is described in Section 3. In Section 4, efficient mining algorithms are developed. An extensive performance study is reported in Section 5. We review related work and discuss a general framework of preference-based data mining Section 6 concludes the paper. 3. PREFERENCE-BASED FREQUENT PATTERN MINING Let I = {i1, ,in} be the set of items. An itemset (or pattern) is a subset of I. For the sake of brevity, we often write an itemset as a string of items and omit the parentheses. For example, itemset {a, b, c, d} is written as abcd. The number of items in an itemset is called its length. That is, len(X) = ||X||. A transaction T = (tid,X) is a tuple such that tid is a transaction identity and X is an itemset. A transaction database TDB is a set of transactions. A transaction T = (tid,Y) is said contain itemset X if X Y . Given a transaction database TDB, the support of itemset X, denoted as sup(X), is the number of

transactions in TDB containing X, i.e., sup(X) = ||{T = (tid,Y) |(T TDB) (X Y )}||. Given a transaction database TDB and a minimum support threshold min sup, an itemset X is said a frequent pattern if sup(X) min_sup. The problem of frequent pattern mining is to find the complete set of frequent patterns from the database. In general, we define preferences as follows. Definition 1 (Preference) A preference order <P is a partial order over 2I, the set of all possible itemsets. For itemsets X and Y, X is said (strictly) preferable to Y if X >P Y. Problem definition. An itemset X is called a preference pattern (with respect to preference P), if there exists no any other itemset Y such that Y >P X. Given a transaction database TDB, a preference P and a support threshold, the problem of preferencebased frequent pattern mining is to find the complete set of preference patterns with respect to P that are frequent. Now, let us consider how to write a preference. A simple preference such as a preference based on support or length of the itemsets can be written using an auxiliary function f : 2I R to define the preference order. For example, to prefer more frequent patterns to less frequent ones, we can have X = sup Y if sup(X) sup(Y). As another example, to prefer longer patterns to shorter ones, we can have X = length Y if ||X|| ||Y||. It becomes tricky when a user wants to integrate more than one preference. For example, when a user prefers either more frequent patterns or longer patterns, can we just simply write the preference as X Y if (sup(X) sup(Y)) (len(X) len(Y))"? Unfortunately, relation is not loyal to the user's real intension. In fact, defined so is even not a partial order. For example, suppose sup(abc) > sup(abcd). We have both abc abcd (due to the support inequality) and abcd abc (due to the length inequality). A closer look at the user's preference indicates that it should be read as follows. For any two patterns X and Y, if sup(X) sup(Y), ||X|| ||Y||, and at least one equality does not hold, then X is more preferable. Based on the above analysis, we define an integration operation among preference relations as follows. Definition 2 (Integration of preferences) Let =x1, . . . , =Xk be k preference orders. We define the integration of =x1, . . . , =Xk, denoted by = (=x1x2x3 . . . =Xk) as follows. For itemsets X and Y , X Y , if for any i (1 i k), X =<i Y. X Y if X Y and there exists at least one xi (1 i k) such that X <ioY.

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 12, DECEMBER 2011, ISSN 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

175

Theorem 1 (Integration of preferences) The integration of multiple preference relations is a strict partial order. Proof. We prove that, given partial orders =x1 and =x2, (=x1 =x2) is a partial order. The cases of more than two partial orders can be proved by a simple induction on the number of integration operators, which is finite. X is trivially irreflexive. Suppose X Y and Y X.

4. ALGORITHMS One important guideline in our design is to reuse the existing efficient mining techniques as much as possible. The breadth-first and depth-first search methods for frequent pattern mining have been studied extensively and substantially. Interesting, we show that many preferences can be pushed deep into the mining by properly employing the existing efficient frequent pattern mining techniques. Their effectiveness and efficiency have been justified and verified concretely in the previous work. 4.1 Preference Mining Algorithms In this section we present algorithms for mining the strict partial order preferences introduced in section 3.1. Our methods work on log relations as described in section 3.2 and use appropriate data mining and statistical methodologies in order to detect the right preference and correct additional information like OS-sets. To detect basic preferences, we use the frequencies of the different values in the log relation. Definition 3 (Frequency of a value) Let A be an attribute of a log relation R and xdom(A). The number of entries of x in R(A) is called frequency of x or freqA(x). If dom(A) is numerical, freqA([x1, x2]) denotes the number of entries of all values between x1 and x2 (x1 x2). We have introduced the concept of user-defined preferences P = (A, <P). The actual user preferences shall be predicted from the implicit preferences hidden in the user log data. To that purpose, we introduce the concept of data-driven preferences denoted by PD = (A, <PD). Definition 4 (Data-driven preference) For categorical domains dom(A) a data-driven preference PD =(A, <PD) is defined as; X<PD Y iff freqA(x) < freqA(y). For numerical domains dom(A) a data-driven preference PD =(A, <PD) id defined as; x<PD y iff > 0; freq A([x- ,x+ ])<freqA ([y- ,y+ ]). As it can easily be shown data-driven preferences define strict partial orders. Depending on the design of the log data, values can be products (e.g. search results) or just product properties like color or price. If the frequency of a value x is zero, a customer has

never selected the according value. If CWA holds, freqA(x) = 0 means that a customer doesnt like the property x because he never selected it although he knows it. Otherwise, if CWA doesnt hold, the customer may either not like the property x or may never have heard of it. The relation freqA(x) < freqA(y) shows that the correspondingcustomer has selected y more often than x. In this sense the relation x <PD y denotes a preference. Numeric domains need a slightly different approach to data-driven preferences. For instance, an attribute A may have the real numbers as domain (dom(A) = R) and we want to test, if a user has a data-driven LOWEST(A) preference, i.e. lower values are better and should occur with higher frequencies. Since consists of an infinity number of different values, the log relation only contains some of them and typically each value occurs only a few times in the log relation. Therefore, we use frequencies of intervals. E.g. for a data-driven LOWEST preference the relation freqA([x-, x+]) < freqA([y-, y+]) for y < x must hold for some . 4.2. Mining Categorical Preferences Based on PD = (A, <PD) we can define data-driven preferences for categorical domains. Definition 5 (Data-driven preferences for categorical data) Based on PD =(A,<PD )we can define data-driven preferences for categorical domains. Definition 5(Data-driven preferences for categorical data) Let A be a categorical attribute of a log relation R and POS-set,NEG-set,POS1-set,POS2-set,Edom(a) There is a data-driven Pos preference, iff x POS-set, y POS-set; y < PD x. There is a data-driven NEG preference, iff x NEG-set, y NEG -set; x < PD y. There is a data-driven POS/POS preference, iff x POS1-set, y POS2-set, z ( POS1-set POS2-set);y< PD x and z< PD y. There is a data-driven POS/NEG preference, iff x POS-set, y NEG-set, z ( POS-set NEG-set);z< PD x and y< PD z. Let <E be a strict partial order on E. A dta-driven EXPLICIT Preference holds, iff - x, yE with x<E y: x< PD y. - uE , vE: v<PD u For a data-driven POS preference the values in the POS-set must occur more often than the other values and in a data-driven NEG preference the other values must occur more often than the values in the NEGset. POS/POS and POS/NEG run analogously. A data-driven EXPLICIT preference with underlying Egraph exists, if a value y occurs more often than any

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 12, DECEMBER 2011, ISSN 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

176

successor x in E-graph. Values outside the E-graph occur with lowest frequencies. The main task for an algorithm for mining categorical preferences is the detection of proper POS-sets, NEG-sets, etc. Consider the following example of frequencies for an attribute author (CWA doesnt hold, the domain is static): Table 1.Example of frequencies for as attribute author. Douglas Adams 50 Edgar Wallace 49 Natalie Angier 2 Agartha Christie 3 John Grisham 2

4.3 Mining Numerical Preferences The distribution of numerical log data defines a statistical density function (X). Properties of this density function provide information about datadriven preferences. For instance,, if (X) has a unique maximum at Z and the gradient is positive for X<Z and negative fro X>Z, there is an AROUND preference with around-value z.This approach is consistent to the definition of numeric data-driven preferences because an increasing density guarantees freq A([X-,X+])< freq([Y-,Y+]) for X<Y and a decreasing density implicates freq A([X-,X+])> freq([Y-,Y+]). Thus, in the above examples values around Z are requested most frequently and the frequency decreases with increasing distance to z.Since the density is usually unknown, it has to be estimated using the underlying numerical log data. In our implementation we use histograms as an easy to use and efficient density estimation technique. 5. Experimental Evaluation In this section we present test results and performance measurements of an efficient databasedriven implementation of a Preference Miner prototype 5.1 Preference Mining Test Results: For our test environment we defined 35 preference profiles, where each profile contains between two and six preferences. In our simulation each user queries the product database between 25 to 50 times, whereby the exact number of requests is randomized. In each query a preference of the considered user is chosen and a product database is requested with it using Preference SQL [7]. The results are stored in a log database. Afterwards we use the Preference Mining algorithms to detect preferences within the log data. A comparison of the detected preference profiles with the predefined user preferences will show the effectiveness of the Preference Mining algorithms. To assess the quality of our results we define preference precision and preference recall. Definition 6(Precision and recall for preference(s)) Preference precision and preference recall are defined as: Precision = No. correctly deleted pref. Of user i No.of all deleted preferences for user i Recall = No.of correctly deleted pref. of user i No.of all preferences for user i

Douglas Adams Edgar Wallace Natalie Angier Agatha Christie John Grisham The set {Douglas Adams} is a correct POS-set for a data-driven POS preference. But intuitively, the set {Douglas Adams, Edgar Wallace} denotes are more reasonable POS-set since these two values occurred much more frequently than Natalie Angier, Agatha Christie and John Grisham. The following algorithm for mining categorical preferences uses cluster techniques in order to detect such proper sets.

Algorithm 1: Miner for categorical preferences in


static domains INPUT: log relation R, attribute A, dom(A) (1)Compute for each value xi the frequency in the log relation freq A (Xi). (2)Compute a clustering of the xi with freq A (Xi) 1 by using clustering technique. (3) Depending on the clustering results we have the following possibilities. (a) There is only one cluster C1 and CWA holds. Here we have a NEG (A, {x dom (A)| freq A (x) =0}) preference. (b) There are two classes C1 and C2 where c1C1, c2C2: freq A (c1) < freq A (c2). (b1) If -CWA we have a POS(A,C1) preference. (b2)If CWA there is a POS/NEG (A, C1 ;{ xdom (A)| freq A (x) =0}) preference. (c) There are three clusters C1, C2, C3 where c1C1, c2C2 c3C3: freq A (c3) < freq A (c2) freq A (c1) .Here we have a POS/POS (A, C1; C2) preference. (d) In all other situations there is no data driven preference. OUTPUT: The detected preference OR No preference found.

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 12, DECEMBER 2011, ISSN 2151-9617 https://sites.google.com/site/journalofcomputing WWW.JOURNALOFCOMPUTING.ORG

177

Black represents precision(%) Ashed color represents recall(%) The test results in above figure show average recall and precision over all test users. Mining categorical preferences leads to 60% Precision and 38% recall numerical preferences leads to 58% Precision and over 40% recall using histograms as density estimation, and combined preferences yields 55% precision and 15% recall. Typical problems are user preferences for values that dont occur in the product database and dependencies between preferences: if a base preference of a complex preference is not detected, it is very difficult to detect the complex preference itself. such problems also influence the preference recall. Note, that we filled the log relations with the search results, in real life applications even better preference mining results can be achieved, if the selected results OR query information can be used. 6. Summary and Outlook Another research task is the design of an appropriate storage structure for preferences Such a Preference Repository should not only be able to record preferences detected with the Preference Miner but also preferences defined with Preference SQL or Preference XPATH. The integration of situations should also be possible as well as user identifiers to assign users and user groups. Finally, the Preference Repository shall also include a set of appropriate access operations for inserting, deleting and updating preferences. It can also be used to find users with similar preferences and with it product recommendations based on preferences can be offered. Therefore the Preference Repository is also a major step towards advanced personalized applications.

References
1.Jian Pei Jiawei Han WeiWang(June 2005 ) Constraint-based sequential pattern mining: the pattern-growth methods, Springer Science + Business Media, LLC 2006 2.Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining association rules between sets of items in

large databases. In Proc. 1993 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD93) (pp. 207216). New York: ACM.Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules. In Proc. 1994 Int. Conf. Very Large Data Bases (VLDB94) (pp. 487499). California: Morgan Kaufmann. 3.Agrawal, R., & Srikant, R. (1995). Mining sequential patterns. In Proc. 1995 Int. Conf. Data Engineering (ICDE95) (pp. 314). Washington, District of Columbia: IEEE Computer Society. 4.Ayres, J., Flannick, J., Gehrke, J., & Yiu, T. (2002). Sequential pattern mining using a bitmap representation. In Proc. 2002 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD02) (pp. 429435). New York: ACM. 5.Bayardo, R.J., Agrawal, R., & Gunopulos, D. (1999). Constraint-based rule mining on large, dense data sets. In Proc. 1999 Int. Conf. Data Engineering (ICDE99) (pp. 188197). Washington, District of Columbia: IEEE Computer Society. 6.Yan, X., Han, J., & Afshar, R. (2003). CloSpan: Mining closed sequential patterns in large databases. In Proc. 2003 SIAM Int. Conf. Data Mining (pp. 406 417). New York: ACM. 7.Yang, J., Yu, P.S., Wang, W. & Han, J. (2002). Mining long sequential patterns in a noisy environment. In Proc. 2002 ACM-SIGMOD Int. Conf. on Management of Data (SIGMOD02) (pp. 6875). New York: ACM. 8.Zaki, M.J. (1998). Efficient enumeration of frequent sequences. In Proc. 7th Int. Conf. Information and Knowledge Management (CIKM98) (pp. 6875). Washington, District of Columbia. 9.D. Beeferman and A. Berger: Agglomerative Clustering of a Search Engine Query Log. In Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, p. 407-416, Boston, Massachusetts, USA, 2000. 10.J. Delgado and N. Ishii: Online Learning of User Preferences in Recommender Systems. In Proceedings of the IJCAI-99 Workshop on Machine Learning for Information Filtering, Stockholm, Schweden, 1999. 11. V. Estivill-Castro and M. E. Houle: Robust Distance-Based Clustering with Applications to Spatial Data Mining. In 3rd Pacific-Asia Conference on Knowledge Discovery and Data Mining, p. 327337, Beijing, China, 1999. 12. Stefan Holland1, Martin Ester2, Werner Kieling1Preference Mining: A Novel Approach on Mining User Preferences for Personalized Applications

Das könnte Ihnen auch gefallen