Sie sind auf Seite 1von 11

Calendric Association Rule Mining

from Time Series Database

Mudra C. Panchal and Ghanshyam I. Prajapati

Abstract In today’s world, data explosion is high. Due to increase in Internet


technologies and various applications, data are bursting. Out of numerous data, it is
cumbersome to find out the interesting data and even interestingness of data differs
from person to person, time to time, and task to task. Even the data keep on
changing with time. Thus, an attempt is made to mine the important information
from a large amount of time series data on a seasonal basis. An effort is carried out
to mine calendric association rules, i.e., it will mine frequent itemsets based on the
calendric pattern and generate association rules from it and the dataset considered is
of time series dataset for market basket analysis on seasonal basis. FP-Growth
algorithm is applied to carry out the task and the comparison is shown with respect
to Temporal-Apriori and it is shown that FP-Growth is time efficient than
Temporal-Apriori.

Keywords Association rules ⋅ Calendric association rules ⋅ Temporal apriori


FP-growth

1 Introduction

Data mining has been widely researched now, but still proper research is needed for
data that keep on changing with time, viz., temporal data. Many data mining task
like association rule mining, classification, clustering, outlier detection, etc., are of
utmost importance for many application. In this paper, focus is given on association
rule mining. Up till now, much research has been carried out for association rule
mining but less research for cyclic- or calendar-based association rule mining is
done. Frequent patterns need to be found out from transaction database. Transaction
database is temporal, viz., the time of a particular transaction carried out by cus-
tomer is registered [1]. Here, the focus is to mine calendar-based association rules,

M. C. Panchal (✉) ⋅ G. I. Prajapati


Springer-Verlag, Computer Science Editorial, Tiergartenstr. 17, 69121 Heidelberg, Germany
e-mail: mudracpanchal@gmail.com

© Springer Nature Singapore Pte Ltd. 2018 283


K. Saeed et al. (eds.), Progress in Advanced Computing and Intelligent Engineering,
Advances in Intelligent Systems and Computing 564,
https://doi.org/10.1007/978-981-10-6875-1_28
284 M. C. Panchal and G. I. Prajapati

viz., the rules that occur for a specific instance of time. Calendric association rules
are also called as seasonal rules as the frequent itemsets found do not occur
throughout the database but only during some period of time [2]. Whereas cyclic
association rules repeats itself at regular interval of time, viz., purchase of milk and
bread daily during the time 9:00 A:M to 10:00 A:M [3]. Various algorithms are
applied to mine such association rules which are described in detail in later sections
but out of which FP-Growth algorithm [4] is more efficient in terms of execution
time. Even an extension of Apriori, viz., Temporal Apriori is presented. It is
specifically used for time series dataset [5]. A comparison between both the
algorithms is also shown.
The paper is divided into sections. In Sect. 2, related work done by different
authors is explained. In Sect. 3, preliminary terminologies are described. Section 4
consists of the problem statement of the work presented. In Sect. 4, proposed system
is explained. Section 5 gives theoretical analysis for various algorithms. Section 6
consists of performance analysis and Sect. 7 shows conclusion.

2 Related Work

As per [6], author here explains two different algorithms, viz., sequential algorithm
which consists of cycle pruning, cycle skipping, and cycle elimination, and inter-
leaved algorithm. Results are shown by comparing the two algorithms based on
dependence on minimum support, varying noise levels, varying itemset size, and
data size scaleup. It minimizes the amount of wasted work done during the data
mining process. Interleaved algorithm performs better than sequential algorithm. It
does not provide updating of association rules. Numbers of frequent itemsets
generated are more.
As per [5], author here has extended the famous Apriori algorithm to find
temporal association rules based on calendar schema and calendar pattern. First,
they identify two classes of temporal association rules, viz., temporal association
rules w.r.t full match and temporal association rules w.r.t relaxed match. Then, they
apply extended version of apriori named as temporal apriori that works level wise to
develop two techniques that find association rules from both the classes of temporal
association rules. It requires less prior knowledge. It also discovers more unex-
pected rules. Calendar-based temporal association rule mining can be done for other
data mining tasks like clustering and classification. Time granules in a lower level
must be obtained by subdivision of the time granules in higher level.
As per [7], author has proposed an efficient method which works differently from
temporal apriori algorithm. This method scans the database at most twice. It works
in three phases. First, it discovers frequent 2-itemset along with their 1-star can-
didate calendar patterns. In second phase, it generates candidate itemset along with
their k-star candidate calendar patterns. In the last phase, it discovers frequent
itemset along with their frequent calendar patterns. It avoids multiple scans of
databases. It generates slightly more candidates than Temporal Apriori.
Calendric Association Rule Mining from Time Series Database 285

As per [8], author here has proposed an approach which consists of temporal
H-mine algorithm and temporal FP tree algorithm. It also considers two parameters
to mine the frequent itemset, viz., time and scalability. It is more efficient as it
decreases processing time for mining frequent itemset. The approach is complex. It
can be extended to design a good classifier.
As per [9], the author studied the problem of generating association rule that
displays regular cyclic variation over time. Apriori algorithm is not efficient for
such problem. Thus, the author has explained two new algorithms called sequential
algorithm and transition algorithm. A new technique is devised called cycle rule by
finding the relationship between association rules and time. The author has shown
the difference between sequential and interleaved algorithm. Interleaved algorithm
performs better than sequential algorithm. Implementing cycle rules reduces the
amount of time needed to find association rules. Transition algorithm scales
increasing data. It is not most efficient. Interleaved algorithm can be updated with
minor changes in order to find global cycles.
As per [10], real time database keeps on updating and thus it is required to
maintain and keep on updating the discovered temporal association rules. Imple-
menting mining algorithm every time is inefficient as it does not maintain previ-
ously discovered rules and rescan the whole database. Thus, the author has
proposed an incremental algorithm to maintain the discovered association rules.
Results are shown based on both synthetic and real database. The proposed algo-
rithm is ITARM which maintains temporal frequent itemsets after the temporal
transaction database has been updated. The basic concept behind it is of sliding
window filtering algorithm. It helps to reduce time to generate new candidates. It
reduces rescanning of databases. It is scalable. It works efficiently only for main-
taining the temporal association rules.

3 Preliminaries

In the previous section, we have seen types of work done by different authors and
the strengths and weaknesses of them. Now in this section, a discussion of theo-
retical background required to carry out the proposed approach is presented. The
different terminologies explained are support, confidence, dataset that is used, types
of association rules, algorithms implemented for temporal association rules, etc.

3.1 Support

It is defined as a number of times a particular item or itemsets exists in a particular


transaction database. It x and y are itemsets then, it is defined as the portion of the
database that consists of both X and Y itemsets together. If the total records in
286 M. C. Panchal and G. I. Prajapati

transaction database in 5 and only 1 records consist of both X and Y then the
support is 20%. As per [1], the equation to find support is as follows:

SupportðXYÞ = Support count of ðXYÞ ̸ Total number of transaction in D

3.2 Confidence

It is defined as the portion of records that contains both X and Y to the total records
that contain X. If the transaction table contains 10 records with X from total 20
records and confidence is 80% then 8 records out of 10 records that contains X also
contains Y. As per [1], the equation defined is as follows:

ConfidenceðXjYÞ = Support ðXYÞ ̸Support ðXÞ

3.3 Dataset

As the title says the dataset used for the implementation work is time series dataset,
i.e., the dataset with timestamp, viz., either date or time or both. So the dataset used
in the work is a food market dataset for carrying out market basket analysis. It is
easily downloadable from the site: http://recsyswiki.com/wiki/Grocery_shopping_
dataset [11]. It is a sample dataset from Microsoft and is available as mysql file.

3.4 Types of Association Rules

As per [12], various types of association rules are explained below.


Context-based Association Rules: Context-based association rules are classifies as
a type of association rules which concentrate more on unseen (hidden) variable.
These unseen variables are known as context variables which are responsible for
changing the final set of association rules.
Generalized Association Rules: It is based on the concept of generalization where
the hierarchy is climbed up. For example, city and state can be clubbed together to
represent a state.
Quantitative Association Rules: It consists of both categorical and quantitative
data [12]. For example, 70% of boys going to college will have a bike.
Interval Data Association Rules: Data are ordered and are separated by a specific
range. For example, partition the salary of employees by 10,000 Rs. slots.
Calendric Association Rule Mining from Time Series Database 287

Sequential Association Rules: Association rules are ordered in sequence. For


example, DNA sequence is important for gene classification.
Temporal Association Rules: Most of the real time data are temporal in nature.
viz., it varies with time. Thus, we need to keep on updating the database. For
example, purchase of AC by a customer in summer is seasonal; Purchase of milk in
morning everyday is cyclic association rules and many more examples.

3.5 Types of Temporal Association Rules

Interval Association Rules: Data are ordered and are separated by a specific range.
For example, partition the salary of employees by 10,000 Rs. slots. Each item is
assigned a time interval so association rules are discovered during that time interval.
Sequential Association Rules: Association rules are ordered in sequence. For
example, DNA sequence is important for gene classification.
Temporal Predicate Association Rules: A conjunction of binary temporal pred-
icates is added to the association rule to extend it which specifies the relationships
between the time stamps of transactions. It works for both point-based (Purchase of
item at a fix time like sharp 10 o’clock) and interval-based (Purchase of item
between a time interval like between 9:00 A:M and 10:00 A:M) mode.
Calendric Association Rules: It is based on calendar system. It is also called as
seasonal association rules. For example, more accidents during rainy season, more
purchase of refrigerator in summer.
Cyclic Association Rules: Ozden have introduced the concept of cyclic association
rules and have shown the relationship between association rules and time [6]. For
example, purchase of milk and butter daily during 9:00 A:M to 10:00 A:M.

3.6 Various Algorithms for Temporal Association Rules

Sequential Algorithm: It performs as per the fixed sequence from the starting to
ending. Its work is to find association rules that are cyclic in nature or which repeats
itself after certain period of time regularly, so it works in two steps. In the first step,
association rules are found at a particular instance of time and in the second step,
the cyclic patterns are detected. This method is implemented making use of Boolean
expression that the association rule found is represented by true or 1 and if asso-
ciation rule does not exist then it is represented by false or 0. It is in the form of
binary expression.
288 M. C. Panchal and G. I. Prajapati

In the first step, whole binary expression is scanned and the one found with 0 is
deleted and rest is saved to form association rules. This procedure continues till the
end of last bit of binary expression.
In the second step, only large cycle is detected from the association rules found
in first step.
Interleaved Algorithm: It is an extension of Apriori algorithm. It works in reverse
to sequential algorithm. Interleaved algorithm first determines cycle or pattern that
occurs regularly and then after, finds the association rules. It is more efficient than
sequential algorithm. As per [13], it works in two steps but it discovers three more
techniques.
Cycle omitting—It omits counting support for an itemset that does not belong to
that particular cycle.
Cycle deletion—If the frequency of an itemset is less at a particular time, then it
cannot have cycles. This allows omitting of cycles by cycle omitting step.
Cycle cutting—It simply prunes the cycles that are of no use.
As per [14], interleaved algorithm can be updated with minor changes in order to
find global cycles.
PCAR Algorithm: It is better than both sequential and interleaved algorithm. This
method works by performing the division between the original database into
number of partitions or segments as per user wish. If there are 10 transaction and
user wish to divide it into two segments, then first segment will consist of first five
transactions and second segment will consist of last five segments. The segment
will be scanned one after the other to generate the cyclic frequent itemset. After first
segment, whatever cyclic frequent itemset is generated will be used to carry out
scanning of next segment. Thus, it works in an incremental way. But the problem is
it generates many association rules that are of no use to user.
CBCAR Algorithm: It is named as constraint-based cyclic association rules. It is
an extension of PCAR. This algorithm eliminates the problem of PCAR of gen-
erating more number of association rules that are of no use to user. Here, user
defined constraints will be followed by the rules. Thus, it minimizes the set of
generated association rules.
IUPCAR Algorithm: It is incremental update PCAR. It is an extension of PCAR.
It is more efficient than PCAR. As real time data are temporal so it is difficult to
update them regularly as it requires rescanning of whole databases and hence
IUPCAR came into existence as it solves the problem of rescanning of database.
It works in three steps. Three classes are made, viz., frequent cyclic itemset,
frequent false-cyclic itemset, and rare cyclic itemset. Now in the first step, the
database is scanned and the items are placed in its related classes out of three. In the
second step, depending on the original class of the itemset and its support and new
class and its support, an affectation of new class is made according to the weighted
model [15]. In the final step, after the updating, final cyclic frequent itemset is
found out.
Calendric Association Rule Mining from Time Series Database 289

FP-Growth Algorithm: As per [4], it is the algorithm that is used to mine the
frequent itemsets from the dataset efficiently. It outperforms the working of Apriori
algorithm and many others. It reduces scanning of database as it completes the
whole procedure in just two scans of database. The main advantage of FP-Growth is
it reduces generation of candidate sets. First is arranges the transaction is ordered
sets and creates FP tree from the database and mines frequent itemsets directly from
FP tree.
The algorithms described here can similarly also be used for other temporal
association rules like temporal predicate association rules, etc. Along with these
algorithms, fuzzy-based techniques can also be used to found out temporal asso-
ciation rules accurately.

4 Proposed System Description

A time series dataset for groceries is considered for the work. The dataset is
available freely online with the attribute for time stamp as year, month, and
quarter. Four quarters are considered with quarter 1 consisting of months 1, 2, 3,
and quarter 2 for months 4, 5, 6 and quarter 3 for months 7, 8, 9 and quarter 4 for
10,11,12 months. Now as the work focuses on calendric association rules, a par-
ticular schema need to be considered. Thus here, the schema considered is (Year,
Month, quarter) where the values are (1997, (1, 2, 3,…12), (Q1, Q2, Q3, Q4)).
After defining a particular schema number of patterns possible is needed to be
fetched. As per the dataset total, 12 calendric patterns are possible for 12 months.
So the number of transactions for each pattern is generated. Now in order to find
frequent itemsets for each pattern, 12 different fp-growth need to be applied, hence
we have clubbed three patterns in one and applied fp-growth on it to find the
frequent patterns for that particular pattern. The selected patterns are (1997, 1, Q1),
(1997, 2, Q1) and (1997, 3, Q1) which is represented as frequent itemsets generated
for all months in Quarter 1 in the year 1997, i.e., (1997, *, Q1).
Hence, the aim is to apply FP-Growth on the transactions for the pattern (1997,
*, Q1) and generate all the frequent itemsets with its count and found calendric
association rules for the pattern (1997, *, Q1).

4.1 Proposed System

The diagrammatic flow shown above is the overall approach for finding calendric
association rules for time series dataset (Fig. 1). The proposed system makes use of
FP-Growth algorithm which gives better results than Temporal Apriori which has
already been used earlier. The execution time of Apriori Algorithm and its varia-
tions are more than FP-Growth algorithm. All the algorithms for mining association
rules have been explained in the previous section and out of which Temporal
290 M. C. Panchal and G. I. Prajapati

Fig. 1 Proposed system

Apriori works efficiently but the proposed system presented in this paper makes use
of FP-Growth which executes faster than Temporal Apriori as it generates less
candidates set and even scanning of dataset is only twice. The theoretical and
practical analysis is described in later sections.

5 Theoretical Analysis

Below is the table representing the theoretical analysis done from the literature
survey. Various algorithms have been studied and it has been analyzed that less
work has been done in calendric association rule mining and out of all the algo-
rithms FP-Growth is efficient to use (Table 1).
Calendric Association Rule Mining from Time Series Database 291

Table 1 Temporal association rule algorithms analysis


Algorithms Strength Weakness
Apriori Simple to implement. It is widely used Generate more candidate sets
Sequential Extended from Apriori Less efficient
Interleaved Minimizes the amount of wasted work. More Generates candidate sets
efficient than sequential
PCAR It outperforms sequential and interleaved It generates rules that are not
algorithms meeting expert’s expectation
IUPCAR Works fast for incremental update of cyclic Works on already generated
association rules cyclic association rules
T-Apriori Generates less candidate sets. Tedious work
FP-Growth Generates less number of Candidate sets. Requires more memory for
Minimizes the number of scan of database storage of tree

6 Experimental Results and Analysis

Given below are the experimental results followed by the analysis of the algorithms
presented here.
Experimental Results: The experiments are carried out to find the calendric
association rules using the dataset of groceries [11] which is suitable for time series
dataset. In Table 2, the experimental results for two different algorithms are shown,
viz., Temporal Apriori and FP-Growth. The results are based on the execution time
required for processing of dataset by both the algorithms. The execution time given
is in milliseconds. The experiments are executed 10 times on same data and on
same system. The experiments are carried out in java sdk 1.7 and Netbeans IDE 7.2
(Table 2).

Table 2 Temporal association rule algorithms analysis


Algorithms Execution time in milliseconds
Temporal apriori algorithm Fp-growth algorithm
Apriori Simple to implement. It is widely used Generate more candidate sets
Sequential Extended from Apriori Less efficient
Interleaved Minimizes the amount of wasted work. More Generates candidate sets
efficient than sequential
PCAR It outperforms sequential and interleaved It generates rules that are not
algorithms meeting expert’s expectation
IUPCAR Works fast for incremental update of cyclic Works on already generated
association rules cyclic association rules
Temporal Generates less candidates sets Tedious work
Apriori
FP-growth Generates less number of Candidate sets. Requires more memory for
Minimizes the number of scan of database storage of tree
292 M. C. Panchal and G. I. Prajapati

Fig. 2 Analysis chart

Experimental Analysis: From the Fig. 2, it has been observed that the execution
time of FP-Growth algorithm is almost half than the Temporal Apriori algorithm.
The experiments are executed 10 times and on an average, the execution time is
much less in case of FP-Growth than Temporal Apriori. But the algorithms work
differently for different datasets so the same algorithm may give different results on
other datasets but the time taken by FP-Growth for execution will be always less
than Temporal Apriori.

7 Conclusion

In this paper, a survey is presented on the various types of algorithms used for
mining association rules. More focus is given on temporal association rules, i.e.,
that keeps on changes with time. Theoretical analysis is given for temporal asso-
ciation rule mining algorithm. The proposed approach explained here is based on
FP-Growth algorithm which proves to work better than Temporal Apriori in terms
of execution time. Comparison of various algorithms is given for the same. Special
focus is given on time series dataset. Calendric association rules are mined from
time series dataset. Still the performance of different techniques differs from each
dataset as each dataset has their own characteristics.
Calendric Association Rule Mining from Time Series Database 293

References

1. Arora, J., Bhalla, N., Rao, S.: A review on association rule mining algorithms. Int. J. Innov.
Res. Comput. Commun. Eng. 1(5) (2013)
2. Shirsath, P.A., Verma, V.K.: A recent survey on incremental temporal association rule
mining. Int. J. Innov. Technol. Explor. Eng. 3(1) (2013)
3. Ale, J.M., Rossi, G.H.: An approach to discovering temporal association rules. ACM (2013)
4. Borgelt, C.: An implementation of fp-growth algorithm. IEEE (2011)
5. Li, Y., Ning, P., Wang, X.S., Jajodia, S.: Discovering calender-based temporal association
rules. Elsevier-Data Knowl. Eng. 44(2003), 193–218 (2003)
6. Ozden, B., Ramaswamy, S., Silberschatz, A.: Cyclic association rules. IEEE (1998)
7. Lee, W.-J., Jiang, J.-Y., Lee, S.-J.: An efficient algorithm to discover calendar-based temporal
association rules. IEEE (2004)
8. Verma, K., Vyas, O.P.: Efficient calendar based temporal association rule. SIGMOD Rec. 34
(3) (2005)
9. Srinivasan, V., Aruna, M.: Mining association rules to discover calendar based temporal
classification. IEEE (2008)
10. Gharib, T.F., Nassar, H., Taha, M., Abraham, A.: An efficient algorithm for incremental
mining of temporal association rules. Elsevier (2010)
11. http://recsyswiki.com/wiki/Grocery_shopping_datasets. Accessed Oct 2015
12. Patel, Kaushal K.: A survey of cyclic association rules. IJEDR 3(1), 453–458 (2015)
13. Shah, K., Panchal, M.: Evaluation on different approaches of cyclic association rules. IJRET 2
(2), 184–189 (2013)
14. Nanavati, N.R., Jinwala, D.C.: Privacy preservation approaches for global cycle detections for
cyclic association rules in distributed databases, pp. 368–371. ResearchGate, July 2012
15. Ahmed, E.B.: Incremental update of cyclic association rules. Springer, Berlin, Heidelberg
(2010)

Das könnte Ihnen auch gefallen