Beruflich Dokumente
Kultur Dokumente
978-1-4799-3219-1/13 $31.00
978-0-7695-5134-0/13 $26.00 © 2013 IEEE 294
DOI 10.1109/WISA.2013.62
is required to start the Hadoop with start-all.sh command. C. Definition 3 Hbase Storage System
The application of Hadoop is widely welcome nowadays. Hbase is the database developed for Hadoop, which is a
For example, Baidu mines web data and analyses search logs open source database with column-oriented, distributed,
with it. Taobao uses it for storing and e-commerce sparse, sorting, multi-dimensional and primarily used to
transactions. Yahoo deals with more than 5PB webs by manage real-time read and write, random accessing to large
operating more than 10,000 Hadoop virtual machines in amount of data tables. Similar to starting the Hadoop, it starts
2000 nodes. an independent HBase instance (the HBase uses /tmp/hbase-
Client
File access request NameNode
Metadata
$USERID directory by default) by the start-hbase.sh
File storage location
command, and starts HBase housing to manage the HBase
Control commands
Control commands instance by HBase shell command. Hbase is dependent on
File data block
Status commands Status commands distributed data lock service ZooKeeper which is mainly
DataNode DataNode
used in starting position of storage HBase data. ZooKeeper
Copy Data block Data block Data block
finds the server of Hregion and terminated after the task is
Data block Data block
295
based on user-defined support. Conduct coded // Ik-1 means that the corresponding value (Ik-1=0, 1)
representation to the processed items according to the {
transaction record. Set the transaction set T = {t1 , t 2 , Ă Scan (Hb);
, t n } ˈ item set I = { i1 , i2 , Ă , im }, for any given for k each 1 in Max
transaction database D, so that f: D rij , f (D) rij . rij is Dividedinto_Every_Col(ChartˈIk-1);
defined as: }
1 I j Tk // In map function, scan the database first.
rij
// Then divide the line-items of the records from the first to
0 I j Tk the last
In which i = 1, 2... N; j = 1, 2... M.
Set a subset of features corresponding to the set as a 4) Reduce operations: Reduce obtains each column in
subset of a relational database which is named DB ' , DB ' is the item data of the Map, and then conduct "and" operation
composed with the tuple (TID itemset). A sample among columns of data subsets. By Reduce function count
corresponds to a record in the DB ' , the samples of each the number of '1's to determine whether it’s greater than or
component constituting the corresponding attribute in the equal to the minimum support degrees, until the pending
DB ' . There are m transaction records and n items. We can item set is empty.
obtain the following database by scanning once: Set the maximum of the column number as Max_col, the
column variables are controlled with p (1 ≤ p ≤ Max_col),
TID I 1 I2 In the line variables are controlled with q (1 ≤ the q ≤
T1 r11 r12 r1n MAX_ROW). Also, let N denote the number of nodes.
T2 r21 r2 n
Algorithm 3 Reduce operations
Tm rm1 rm 2 rmn Reduce(Sign, p)
rij equals to 1 or 0, which means that the i-th transaction // Sign to mark
// Ip-1 is a value corresponding to
contains or does not contain the j-th item, respectively.
for t each 1 in N
for p each 1 in Max_col
Algorithm 1 code the processed items for q each 1 in Max_row
Ip-1 = GET_map_Context(Signˈq);
TEMP = Count_User_Same_Data () // Get Map data
// First obtain the number of the same data
count = (I0 I1 … Ik-1) ;
if (TEMP < minsupport)
Delete_This_Columns(); numt = separate_1_num (count);
else // Count the number of "1" after "And" operation
Make_Code_Columns(); for t each 1 in N
// Judge the same number of data All_num+ = numt;
// If less than the minimum support directly delete this // Count the number of "1" in all nodes
column data, or encode the data columns if(All_num≥min_sup)
// Compare the number of "1" in node with the count which
given minimum support
This above method has good parallelism and scalability . return Lk-1;
It overcomes the shortcomings of the Apriori algorithm that else
needs to scan the database many times. Delete_this_Item ();
3) Map operations: Put the encoded database divided // Delete the items which does not meet the requirements
into M sections about the subset of data, the number of M Then L =L1
L2
L3
…
Lp-1;
depends on the number of nodes in the platform of Hadoop
data.
It needs that map scans each inputted purchase history, 5) Calculate the degree of confidence, and ultimately
and then cuts and divides by columns on each node. Set the obtain the association rules which meet the requirements.
largest column as Max, the range of the column is defined as
k (1 ≤ k ≤ Max). IV. DESIGN OF BOOK SALES SYSTEM BASED ON
CLOUD COMPUTING
Algorithm 2 Map operations The book sales system is running on the latest Google's
open source Hadoop platform. Fig. 4 shows its overall
Map (Chart, Ik-1) system architecture.
// Chart means transaction identifier Functions of each level are described below:
296
Book recommendation algorithm
Application based on CMR-Apriori
service
Task1 Task TaskQ
Task submission
Returned results
Data
Index: add, delete, query
Access
Compute node Compute node Compute node Q Figure 5. The project on book sales system.
A large number of data blocks As shown in Fig. 5, the system uses the Eclipse
Data storage
development environment, which nests Hadoop plug-ins, and
debug the HDFS file system through it. There are two project
Open source database of Hbase (ZooKeeper) files: "cloud" project, which includes the source of cloud
HDFS platform, and JRE System Library is the library of Java
Storage node Storage node Storage node Q running-time that is used for supporting Java Virtual
Machine. Then importing the hadoop-0.20.2-core.jar and
HBase-0.90.2.jar. “Engineering Liaoning Normal University
Figure 4. Architecture of the book sales system.
Online Book sales system" is the upper implementation of
the system of book sales, which mainly include the http
1) Application service layer: To retrieve and protocol set of web services, Java resource file use and etc.
recommend the content-based books by users’ purchase The application uses the interactive way of JSP, making
records. In the case that the support degrees have been are the algorithm code run in the server named da21, and
given by the known users, search and recommend the strong entering the system by accessing the named port of the da21
correlation books with using CMR-Apriori algorithm, and server's 8080. As shown in Fig. 6, the initialized data are
output the search results. displayed and by setting the degrees of support, finally we
2) Data Access Layer: To support retrieving get the results of frequent itemsets.
information of the upper books, including reading, storing,
additions, deletions and other related operations to the
characteristic data of book information.
3) Data caching layer: To pull in the information
characterized data in the cache book, reduce its load, and
enhance the performance of the data reading.
4) Parallel computing and scheduling layer: To do map
and reduce process to the large amount of data in a cluster.
5) Data storage layer: To fast read and store data
through combining with the advantage of HBase by column
stores, and then builing on HDFS characteristic data type of
book information.
V. IMPLEMENTATION AND ANALYSIS OF Figure 6. System functions.
CMR-APRIORI ALGORITHM
The system is running on the cluster of Hadoop-0.20.2 In order to expand the practical application of this system,
which consists of five machines: the NameNode (JobTrack) here the client of cloud book sales system based on official
machine da21 and the DataNode (TaskTracker) machines version of Android 2.3 is added, as shown in Fig. 7. Users
da1, da2, da22, da23. The operating system uses the open- can log in the cloud servers and purchase books whenever
source version of Ubuntu; the model of CPU is core dual- they want.
core processor. The memory capacity is 1GB, and the hard
drive capacity is 250GB. Also, the system data sets come
from analog information generated from Liaoning Normal
University Library (about 100,000 transaction records).
297
Fig. 8 presents the comparison of execution time of three
algorithms (CMR-Apriori algorithm, Apriori algorithm with
parallel processing, traditional Apriori algorithm) in several
transaction records. It is obvious that the original Apriori
algorithm performs worst and the one with parallel
processing has some slight improvement in efficiency.
However, the CMR-Apriori algorithm significantly
outperforms others with the same number of processed
transaction records.
Furthermore, we can observe in Fig. 9 that with the
increase of the number of nodes in the cluster, the efficiency
of parallel processing improves. In the meanwhile, the slope
Figure 7. The client of cloud book sales system. of the curve becomes smaller and smaller. Hence, with the
same number of transactions, when the number of nodes
In order to demonstrate the efficiency and accuracy of the increases to a certain value, we get a stable running time.
book sales system, two performance evaluations are
provided as follows:
Evaluation 1: Compare the execution time of the original VI. CONCLUSION
and the improved Apriori algorithm with the same number of Hadoop, one of the most popular cloud computing
transactions. platform recently, is considered as a hotspot in the IT field.
This paper introduces some background knowledge of cloud
Running CMR-Apriori algorithm
time the Apriori algorithm with parallel processing computing and Hadoop, and then provides analysis of the
˄s˅ the traditional Apriori algorithm traditional mining algorithm Apriori by utilizing the
30 Map/Reduce programming framework and the open source
distributed database Hbase. Further, we give details of the
25 proposed CMR-Apriori algorithm and apply it to book
20
recommendation service. Finally, we provide careful
performance evaluations to demonstrate that CMR-Apriori
15 significantly outperforms the traditional Apriori association
10
rule mining algorithm in the book recommendation service
model. It is able to provide customers with much more
5 convenient and efficient personalized service. Nevertheless,
some slight deficiencies still exist in our experiment, such as
0 1 2 3 4 5 6 7 8 9 10 11 12 the treatment of failure in NameNode end single point and
transaction number˄*1000˅ NameNode memory ceiling which will be the focus of our
future study. As we are fully confident with the promising
Figure 8. The comparison of three algorithms. prospect of Hadoop application, more efforts should be
directed toward extensively exploring existing resources to
Evaluation 2: Observe performance of the CMR-Apriori achieve its continuous improvement.
algorithm when the number of calculation nodes increases
(from 1 to 5). ACKNOWLEDGMENT
Running time˄s˅ The authors would like to thank Science and Technology
40
Plan Projects of Liaoning Province (Grant No. 2012232001),
Science and Technology Plan Projects of Dalian (Grant No.
35 2013A16GX116), Natural Science Foundation of Liaoning
30 Province (Grant No. 201202119).
25
20 REFERENCES
15 [1] M. Riondato, J. A. DeBrabant, R. Fonseca, and E. Upfal, “PARMA: a
parallel randomized algorithm for approximate association rules
10 mining in MapReduce,” In: Proceedings of the 21st ACM
International Conference on Information and Knowledge
5
Management (CIKM), 2012, pp. 85-94.
[2] R. Agrawal and J. C. Shafer, “Parallel mining of association rules,” In:
0 1 2 3 4 5
IEEE Transactions on Knowledge and Data Engineering, 1996, 8(6):
the number of nodes
962-969.
Figure 9. Comparison of running time with different number of nodes. [3] T. Shintani and M. Kitsuregawa, “Hash based parallel algorithms for
mining association rules,” In: Proceedings of the Fourth International
298
Conference on Parallel and Distributed Information Systems, 1996, [10] H. Yang, A. Dasdan, R. L. Hsiao, and D. S. Parker, “Map-Reduce-
pp. 19-31. Merge: Simplified relational data processing on large clusters,” In:
[4] K. W. Lin and D. J. Deng, “A novel parallel algorithm for frequent Proceedings of SIGMOD, 2007, pp. 1029-1040.
pattern mining with privacy preserved in cloud computing [11] J. Dean and S. Ghemawat, “Map/Reduce: simplified data processing
environments,” In: Int. J. Ad Hoc and Ubiquitous Computing, 2010, on large clusters,” In: Communications of the ACM, 2008, 51(1):
pp. 205-215. 107-113.
[5] L. Li and M. Zhang, “The strategy of mining association rule based [12] J. Dean and S. Ghemawat, “MapReduce: a flexible data processing
on cloud computing,” In: International Conference on Business tool,” In: Communications of the ACM, 2010, 53(1): 72-77.
Computing and Global Informatization (BCGIN), 2011, pp. 475-478. [13] D. Wegener, M. Mock, D. Adranale, and S. Wrobel, “Toolkit-based
[6] J. W. Huang, S. C. Lin, and M. S. Chen, “DPSP: distributed high-performance data mining of large data on MapReduce clusters,”
progressive sequential pattern mining on the cloud,” In: Proceedings In: IEEE International Conference on Data Mining Workshops
of the 14th Pacific-Asia Conference on Knowledge Discovery and (ICDMW), 2009, pp. 296-301.
Data Mining (PAKDD), 2010, pp. 27-34. [14] H. Li, Y. Wang, D. Zhang, M. Zhang, and E. Y. Chang, “PFP:
[7] Z. Wu, J. Cao, and C. Fang, “Data cloud for distributed data mining parallel FP-Growth for query recommendation,” In: Proceedings of
via pipelined MapReduce,” In: Proceedings of the 7th International the 2008 ACM Conference on Recommender Systems, 2008, pp. 107-
Workshop on Agents and Data Mining Interation (ADMI), 2011, pp. 114.
316-330. [15] Q. He, F. Zhuang, J. Li, and Z. Shi, “Parallel implementation of
[8] T. White. Hadoop: The Definitive Guide. O'Reilly Media, Inc, USA, classification algorithms based on MapReduce,” In: Proceedings of
Yahoo Press 2010. the 5th International Conference on Rough Set and Knowledge
[9] Y. Lai and S. ZhongZhi, “An efficient data mining framework on Technology (RSKT), 2010, pp.655-662.
Hadoop using java persistence API,” In: IEEE 10th International [16] X. Qin, H. Wang, X. Du, and S. Wang, “Big data analysis—
Conference on Computer and Information Technology (CIT), 2010, competition and symbiosis of RDBMS and MapReduce,” In: Journal
pp. 203-209. of Software, 2012, 23(1): 32-45.
299