You are on page 1of 8

GRD Journals | Global Research and Development Journal for Engineering | International Conference on Innovations in Engineering and Technology

(ICIET) - 2016 | July 2016

e-ISSN: 2455-5703

Association Rule Mining in Big Data using


MapReduce Approach in Hadoop
1J.

Jenifer Nancy 2M. Jansi Rani 3Dr. D. Devaraj


1
P. G Scholar 2Assistant Professor 3Senior Professor and H.O.D
1,2
Department of Computer Science & Engineering 3Department of Electrical and Electronics Engineering
1,2,3
Kalasalingam University, Krishnankovil, India
Abstract
The concept of Association rule mining is an important task in data mining. In case of big data the large volume of data makes is
impossible to generate rules at a faster pace. By making use of parallel execution in Hadoop using the MapReduce framework,
the rules can be generated much faster and in an efficient way. The existing method transforms the input dataset into binomial
representation before processing them using MapReduce. But binomial conversion is not user-friendly since it is complex in case
of continuous values. In this paper, an improved and scalable algorithm is proposed for association rule mining that will convert
the input dataset into key-value pairs instead of binomial. All the stages of proposed association rule mining algorithm are
parallelized using MapReduce. The proposed algorithm works on high cardinality features and so no dimension detection is
needed.
Keyword- Hadoop; MapReduce; Association rule mining; Data mining; big data
__________________________________________________________________________________________________

I. INTRODUCTION
A. Big Data and Characteristics
The data is collected and stored in every minute, every hour and every day in an organization or institute and is available in large
quantity. But the amount of data is not of importance but what the organizations do with these data to identify information that
can be useful for them. This can be done by analyzing the data to identify insights or critical information that can help the
organization to make useful decisions for their growth. The term big data describes a large volume of data that is available in
both structured and in unstructured formats. Even though the concept of big data is a new term, the process of collecting the data,
storing them in large amounts and analyzing them to gather new information is something that has been done since long before
big data has been used. The characteristics of big data can be explained using 3 Vs such as (1) Volume, (2) Velocity and (3)
Variety.
The applications of big data include areas such as health care, telecom, finance, etc. In this paper the process of association
rule generation in big data is discussed and an association rule mining technique is proposed to generate the rules from the KDD
CUP 99 dataset.
B. Data Mining in Big Data
Big Data mining deals with a large amount of data that is stored in the data warehouses and databases. The concept of big data
mining can be used to extract or identify the interesting patterns and information from these large data. Many data mining
techniques are available that can be applied to the big data. They are classification, clustering, association rules, prediction,
estimation, documentation and description. The researches around these techniques have been large since long ago. Many
algorithms have been applied in each of the data mining techniques and this also applies to big data.
One such well known technique that is applied is the association rule mining in big data. This is a most efficient data
mining technique that is used to discover the various hidden patterns and information from large databases. Here the
relationships between the various attributes of the data are identified using the association rule mining algorithm. Some basic
types of association rule mining algorithms are the Apriori algorithm, Distributed algorithm and Parallel algorithm.
C. Association Rule Mining
The Association Rule Mining (ARM) [1] in data mining is a popular approach that is used to analyse the given dataset to
discover interesting patterns or relationships between the various items in the dataset. The concept of strong association rules
was first used by Agarwal et al. [2] to identify the various association rules between the items that are sold during a large scale
transaction database collected from a supermarket using a point system. The relationship between the items is identified based on
the purchase pattern. The ARM technique generates a set of association rules prevailing between the various items of the given
dataset based on the number of occurrences of these items combination in the dataset.

All rights reserved by www.grdjournals.com

179

Association Rule Mining in Big Data using MapReduce Approach in Hadoop


(GRDJE / CONFERENCE / ICIET - 2016 / 029)

An association rule is used to define the relationship between any two items in the given dataset. Consider three items
A, B and C. The relation {A, B} C say that if a person buys two items A and B together, then he/she will most likely buy the
item C also. That is, the relations between the items are generated by identifying the various patterns within the dataset. The
Association Rule Mining (ARM) technique [3] consists of two stages as follows:
1) Identify the itemset that occur frequently in the dataset The frequent itemset are those that have a support value (sup(item))
equal to or greater than the minimum support value (min_sup) that is pre-defined. The support value of itemset is calculated
as the number of transactions that contains that item. In the above example support of {A, B} is calculated as how many
transactions have both A and B.
2) Association rule generation using frequent itemset: In this stage the interesting rules are generated by calculating the
confidence factor (conf) for all the frequent itemset that are generated in previous stage. The confidence value for the above
example rule of {A, B} C will be sup({A, B})/sup(C).
D. MapReduce Approach for ARM
The association rules and the generation of rules are widely used and they face many issues and the major one is the availability
of large data and multidimensional datasets [4]. A single processor system and normal CPU speed and resources cannot handle
such large data and this makes the algorithm inefficient to use. In recent developments, the growth of network technology and
especially cloud platforms provided new ideas in terms of association rule generation by making use of parallel environment like
Hadoop [5]. MapReduce has been a popular and more used for computing large amounts of data ever since it was launched by
Google in its platform. The Google Distributed File System (GFS) and the Amazon Web Service (AWS) makes use of the
Hadoop platform and MapReduce to provide their services.
A MapReduce job usually splits the input data into various chunks and each of these are processed by the map tasks in
parallel manner. The Mapper maps the small tasks by making use of the key and value pair concept and the outputs are sorted.
Then the Reducer reduces the obtained outputs from the maps to obtain the final output. The MapReduce framework contains a
single Job Tracker as the master and a single Task Tracker as the slave for each cluster node. All input and output in MapReduce
are <key, value> pairs. The Hadoop is a Java based distributed programming environment sponsored by Apache that can be used
to process and handle large amounts of data. Hadoop has been created using the concept of MapReduce for large processing by
using a large number of nodes and clusters.
In case of Association Rule Mining in MapReduce, the Mapper maps the task of obtaining the various combinations of
items as the key and the value is used to keep track of the number of occurrences or the support count. Then finally the
Reducer task will reduce the obtained set of Mappers for each key value and calculates the final support and confidence for all
the candidate itemsets. This way the Association Rules can be generated with maximum support and confidence.
This remainder of this paper is organized as follows: Section 2 explains about the various association rule mining algorithms
using Hadoop and MapReduce; Section 3 describes the proposed method and its working; Section 4 shows the experimental
results of the proposed method; and finally Section 5 provides the overall conclusion of the paper.

II. LITERATURE SURVEY


The MapReduce can be used to design the existing sequential algorithms into parallel algorithms that can be used to handle large
amounts of data in a shorter time and so this is applied for association rule mining [6]. Some of the existing methods have been
discussed as given below.
A. State-of-art in Association Rule Mining
Yang et al. proposed a MapReduce based programming model for generation of association rules in Hadoop framework to
handle large volumes of data. The Apriori algorithm [7] is used as the underlying association rule generation technique. But the
standard Apriori algorithm is time consuming and it takes a really consumes more time especially when dealing with many
candidate sets. To overcome this issue, they implemented the improved Apriori algorithm that is parallelized using the Hadoop
framework to save time. The use of Hadoop for association rule generation provided new research focus in upcoming years. The
improved Apriori algorithm [8] is proposed by Yang et al. that mainly works using the MapReduce concept to handle large data
by making use of the various nodes in Hadoop platform.
Lin et al. [11] proposed a similar method for association rule generation by using the same Apriori approach for
frequent itemset generation in Hadoop platform using the MapReduce approach. The mining process is executed in a fast manner
by implementing the parallelized mining technique during frequent itemset generation. But parallelization cannot be handled
effectively. For this purposed the MapReduce is used. They proposed a parallelization algorithm in MapReduce that performs
better than the previously existing algorithms in terms of speed and efficiency in rule generation. That is, the comparison of
results obtained here shows better performance in terms of both speed and the rule generation accuracy [9] with existing
algorithms.
Riondata et al. proposed a randomized algorithm for association rules mining that is implemented using a parallel
approach [10] in MapReduce framework. The proposed approach generated the association rules appropriately based on the
dataset content. At first the proposed PARMA (Parallel Association Rule Mining Algorithm) approach randomized the

All rights reserved by www.grdjournals.com

180

Association Rule Mining in Big Data using MapReduce Approach in Hadoop


(GRDJE / CONFERENCE / ICIET - 2016 / 029)

MapReduce algorithm to identify the appropriate frequent itemsets and association rules by using a near-linear speed up process.
A large number of random samples are mined by using the original dataset.
Jongwook Woo et al. proposed a Market Based Analysis algorithm combined with MapReduce for association rule
generation. This is one of the most used algorithms for association rules [12]. At first the algorithm sorts the give dataset in
ascending order and then converts each instance of the dataset into a (key, value) pair and fit them into the MapReduce. Then the
execution is done on the Amazon EC2 MapReduce platform. The obtained experimental results shows that the performance is
increased by making use of the MapReduce parallel code but still there is a bottle neck at certain point when more nodes are
used.
B. Need for Proposed Method
The use of binomial algorithm is not suitable in many datasets and a novel method should be available that can be applied to any
format of datasets [13]. Also binomial transformation is complex and time consuming and is not necessary. It is difficult to
handle and process large volumes of data in a single server and so there is a need to use parallel environment.
In this paper an improved scalable and distributed key-value pair algorithm is proposed for the selection of frequent itemsets
from the dataset and for association rules generation. The proposed algorithm is a bottom up approach since at first the candidate
itemsets are generated and then the support values are calculated by getting the count from the dataset transactions. The
minimum support value is then provided to converts the candidate itemsets to frequent itemsets. A very large dataset is used here
and after selecting the frequent itemsets the association rules are generated. The implementation is done by making use of the
MapReduce platform and the complete process is parallelized.

III. PROPOSED METHOD


The paper proposes and implements the association rule mining using a very large dataset in the Hadoop platform using
MapReduce [14]. The proposed algorithm converts the input dataset into <key, value> pairs instead of binomial
representation. This way, one level of transformation can be reduced at the end for converting binomial features to data
features. The input dataset should be first preprocessed before going for the rule generation phase in MapReduce [15]. The
various phases of the proposed algorithm are discussed below.
1) Phase 1: Generate frequent 1-itemsets The input dataset is stored in the HDFS of the Hadoop environment at first to
make data access easy and fast for MapReduce operations [16]. The input data is then split into various chunks and
provided to the Mapper that maps the data to the output. The output from the mapper is represented as <key, value>
pair. The outputs obtained from all the maps are then combined together in the combiner and then sent to the reducer.
Here the support values are calculated by combining the values corresponding to each of the key values. Then the
support values are compared with the minimum support and the items that support these items are taken as the output
and thisis the frequent 1-itemsets.
2) Phase 2: Generate candidate 2-itemsets and n-itemsets Next the candidate 2-itemsets are generated by the mapper
using the frequent 1-itemsets. The count of each item in the candidate 2-itemsets is verified with the input data that is
provided to the mapper. They are then combined using the combiner to calculate the count values of the 2 -itemsets and
provided to the reducer. The reducer further reduces and counts the support values of 2-itemsets. This is repeated till all
the possible candidate n-itemsets are generated. The same process is repeated until no possible frequent itemset is
available in previous iteration.
3) Phase 3: Association rule generation Finally after generating all the frequent n-itemsets, the association rules are
generated based on confidence values. The confidence values are calculated by using the support values of the frequent
itemsets that form the rules. The output contains all the selected itemset value and its support count. The output is
written in an output file. These support values are then used for confidence calculation and the rules that contain 100%
confidence are generated as the output rules.
The overall association rule generation as discussed above is implemented in the Hadoop 0.20.0 framework by
creating a sing node Hadoop environment [17]. The time in Hadoop is synchronized with the system time and the time
values are calculated in milliseconds using the time function in Hadoop. The data flow for two iterations of MapReduce in
Hadoop is shown below in Fig. 1.

All rights reserved by www.grdjournals.com

181

Association Rule Mining in Big Data using MapReduce Approach in Hadoop


(GRDJE / CONFERENCE / ICIET - 2016 / 029)

Fig. 1: Data flow showing two iterations of proposed method

First the dataset is read as input by the MapReduce code from the HDFS storage and it processes each item as a separate
key to calculate the frequent 1-itemset as in Fig. 1. Then using pair of items from the 1-itemset the frequent 2-itemsets are
generated. This process is repeated till any number of iterations based on the number of itemsets needed. Fig. 1 shows till 3itemset calculation using MapReduce. The key used in the Mapper represents the n-itemsets where n is the number items used to
form the key. The MapReduce flow of the proposed MapReduce framework is shown below in Fig. 2.

Fig. 2: Proposed MapReduce framework

During the MapReduce operation the input dataset or file is split into many sections in the Mapper phase with each
Mapper having a unique key. In ARM the key represents the items available within the dataset and the value is the number of
occurrence of the item in the dataset. Initially the count is set to 1 in the Mapper and for each occurrence this count is increment.
Finally in the Reducer the total occurrence is found using merge and the support and confidence are calculated. The output file
consist of the list of rules generated based on the support and confidence.

IV. EXPERIMENTATION AND RESULTS


A. Dataset Description
The proposed approach for association rule mining is applied to KDD CUP 99 data and the simulation details are presented
here. The KDD CUP 99 input dataset consist of records from four categories of attacks such as Denial of Service, user-to-root,
probing attack and remote-to-local. The instances of the dataset consists of both labeled and unlabeled records in which each
labeled records consists of 41 attributes and one target attribute. The dataset consists of three groups of values such as basic,
content based and time based values. And not all the values are binary. The training set consists of almost 5 million instances of
input dataset. The description of test set and training set are given below:
Training Set Contains 494,021 connections or records with a total of 22 attack types.

All rights reserved by www.grdjournals.com

182

Association Rule Mining in Big Data using MapReduce Approach in Hadoop


(GRDJE / CONFERENCE / ICIET - 2016 / 029)

Test Set Contains 311,029 connections or records with 17 new attacks types not available in training data.
No.

Value

No.

Value

duration

22

is_guest_login

protocol_type

23

count

service

24

srv_count

flag

25

serror_rate

src_bytes

26

srv_serror_rate

dst_bytes

27

rerror_rate

land

28

srv_rerror_rate

wrong_fragment

29

same_srv_rate

urgent

30

diff_srv_rate

10

hot

31

srv_diff_host_rate

11

num_failed_logins

32

dst_host_count

12

logged_in

33

dst_host_srv_count

13

num_compromised

34

dst_host_same_srv_rate

14

root_shell

35

dst_host_diff_srv_rate

15

su_attempted

36

dst_host_same_src_port_rate

16

num_root

37

dst_host_srv_diff_host_rate

17

num_file_creation

38

dst_host_serror_rate

18

num_shells

39

dst_host_srv_serror_rate

19

num_access_files

40

dst_host_rerror_rate

20

num_outbound_cmds

41

dst_host_srv_rerror_rate

21

is_host_login
Table 1: Features of the input dataset

The 41 features of the KDD CUP 99 dataset is shown in Table 1 and Fig. 3 shows the sample values of the dataset.
The values from 1 to 41 are represented by separating them using , (comma) in the dataset given below in Fig. 3. That is, each
instance or row of the dataset consists of 42 attributes with 41 feature attributes and one class attribute all separated using a ,
(comma) as in the figure below. The row values are split to read each attributes separately.

Fig. 3: KDD CUP 99 dataset sample values

B. Results and Discussion


The input dataset is split into many tasks by using the Map and Reduce in the Hadoop environment during the execution. The
input data is sent to the mapper that will split the instances of the data into <key, value> pairs and then it is sent to the reducer.
The data is sorted and then shuffled before it is sent to the reducer. The final result is obtained by reducing the <key, value> pairs

All rights reserved by www.grdjournals.com

183

Association Rule Mining in Big Data using MapReduce Approach in Hadoop


(GRDJE / CONFERENCE / ICIET - 2016 / 029)

by calculating support and confidence and then selecting the rules based on that. Based on this it is possible to identify if the user
of a specific instance or attack is a guest login or host login. The obtained values of support and confidence during the 4 levels of
MapReduce operations are shown in Fig. 4.

Fig. 4: Support and Confidence

The execution of the MapReduce phase [18] in Hadoop and the obtained final results of the reducer phase are shown in
Fig. 5 and Fig. 6 respectively. Fig. 5 shows the execution of the Reducer phase and the output file is being generated. The final
statistics of the MapReduce job is shown in Fig. 5. The generated output file is shown in Fig. 6.

Fig. 5: Mapper and Reducer execution

Fig. 6: Final output

All rights reserved by www.grdjournals.com

184

Association Rule Mining in Big Data using MapReduce Approach in Hadoop


(GRDJE / CONFERENCE / ICIET - 2016 / 029)

The final output shown in Fig.6 shows the list of all frequent items sets that are generated along with the support and
confidence values near them. The format represented in the output is <itemset, support, confidence> and this is generated for all
possible combinations of itemsets for the given input attributes. In this case the 2-itemsets are generated.

V. CONCLUSION AND FUTURE WORK


The concept of association rule generation or mining can be done effectively in distributed systems that can use parallel
executions as in Hadoop environment. This is because it can be scaled up to large volumes of data with less execution time
and cost with good accuracy. The proposed algorithm in this paper also considers the type of input data and can be applied
to any data formats. By dividing the input data into many splits and processing them using many nodes, the execution is
made easy. The management issues such as data transfer between the nodes, storage of data, failure of any node and other
issues within the cluster are all handled by Hadoop automatically. Thus the proposed system is more efficient in terms of
scalability and robustness. The proposed association rule mining algorithm also has the same features and so it is efficient.
Also by making use of the key-value pair approach, the processing is made much easier compared to that of the existing
binomial approach. But still the proposed algorithm is not the best in performance when comes to really large datasets. So in
the future the Fuzzy based association rule mining can be done in Hadoop to handle data larger than the one in this paper.
Further the input data can be classified based on the calculated support and confidence values by using a suitable
classification algorithm. In future this work can be extended to implement feature selection first using information gain or
mutual information [19] before implanting ARM.

REFERENCES
[1] Ashrafi, M.Z.,Taniar,D., Smith,K., ODAM:An Optimized Distributed Association Rule Mining Algorithm, Distributed
Systems Online, IEEE, Volume 5, Issue 3, 2004.
[2] R.Agrawal, R.Srikant, Fast Algorithms for Mining Association Rules , In Proceedings of International Conference on
Very Large DataBases ,pp.487-499, Santiago,Chile,September1994.
[3] JongSooPark, Ming-SyanChen, PhilipS. Yu,An Effective Hash-based Algorithm for Mining Association Rules, In
Proceedings of the ACMSIGMOD International Conference on Management of Data, Michael Carey and Donovan
Schneider, ACM, 1995.
[4] Ozel,S.A., Guvenir,H.A., An Algorithm for Mining Association Rules using Perfect Hashing and Database Pruning,10th
Turkish Symposiumon Artificial Intelligence and Neural Networks , Gazimagusa, Springer, pp. 257-264, 2001.
[5] KaramGouda, Mohammed JaveedZaki, Efficiently Mining Maximal Frequent Itemsets, In Proceedings of the IEEE
International Conference on DataMining, pp.163-170, November29-December 02 , 2001.
[6] J.Han,J. Pei,Y. Yin, Mining Frequent Patterns without Candidate Generation, ACMSIGMOD International
Conference,Dallas,2000.
[7] D.W.Cheung, Jiawei Han, V.T. Ng, A.W. Fu, Yongjian Fu, "Afast Distributed Algorithm for Mining Association Rules, In
Proceedings of International Conference on Parallel and Distributed Information Systems, IEEE CS Press, 1996.
[8] AnsariE, DastghaibifardG, KeshtkaranM, KaabiH, Distributed Frequent Itemset Mining using Trie Data Structure
,International Journal of Computer Science, Volume 35, Issue 3, pp. 337-381, 2008.
[9] Park,J.S.,Chen,M. S., Yu,P. S., Efficient Paralle l Data Mining for Association Rules, In Proceedings of the Fourth
International Conference on Information and Knowledge Management,pp.31-33, 1995.
[10] Woo, J., Xu, Y, Market Basket Analysis Algorithm with Map/Reduce of Cloud Computing, In Proceedings of the
International Conference on Parallel and Distributed Processing Techniques and Applications, 2001.
[11] Lin, Ming-Yen, Pei-Yu Lee, Sue-Chen Hsueh, "Apriori-based Frequent Itemset Mining Algorithms on MapReduce", In
Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication, ACM, 2012.
[12] PeddiKishor, SammulalPorika, Literature Survey on Association Rule Discovery in Data Mining, International Journal of
Computer Science and Management Research, Volume 2, Issue 1, January 2013.
[13] Zhang C.S, Li Z.Y, Zheng D.S., An Improved Algorithm for Apriori, In Proceedings of the 1st International Workshop on
Education Technology and Computer Science, Volume 1, pp. 995-998, 2009.
[14] C.Jin, C.Vecchiola, R.Buyya, MRPGA: An Extension of MapReduce for Parallelizing Genetic Algorithms, Fourth IEEE
International Conference on eScience, pp. 214-221, 2008.
[15] T.Elsayed, J.Lin, Douglas W. Oard, Pairwise Document Similarity in Large Collections with MapReduce, In Proceedings
of 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2009.
[16] J.H.C. Yeung, C.C. Tsang, K.H. Tsoi, B.Kwan, C. Cheung, A.P.C. Chan P.H.W. Leong, Map-reduce as a Programming
Model for Custom Computing Machines, In Proceedings of the 16th IEEE Symposium on Field-Programmable Custom
Computing Machines, pp. 149-159, 2008.
[17] M.Zaharia, A.Konwinski, A. D. Joseph, R. Katz, I. Stoica, Improving MapReduce Performance in Heterogeneous
Environments, EECS Department University of California, Berkeley Technical Report Number UCB/EECS-2008-99
August 19, 2008.

All rights reserved by www.grdjournals.com

185

Association Rule Mining in Big Data using MapReduce Approach in Hadoop


(GRDJE / CONFERENCE / ICIET - 2016 / 029)

[18] MohammadhosseinBarkhordari, Mahdi Niamanesh, ScadiBino: An Effective MapReduce-based Association Rule Mining
Method, ACM 16th International Conference on Electronic Commerce, August 2014.
[19] P.Ganesh Kumar, D.Devaraj, Intrusion Detection using Artificial Neural Network with Reduced Input Features,
International Journal on Soft Computing, ICTACT, Issue 1, pp. 30-36, July 2010.

All rights reserved by www.grdjournals.com

186