Beruflich Dokumente
Kultur Dokumente
1 STUDENT NAME
Sonal
2 DISSERTATION TITLE
1. PARALLEL MINING MECHANISM OF FREQUENT ITEMSETS US-
ING MAPREDUCE TECHNIQUE.
3 DISSERTATION DOMAIN
BIG DATA
1. Frequent itemsets,
3. Hadoop cluster
4. Load balance
5. MapReduce.
6. Energy efficiency
SYNOPSIS
5 Problem Statement
Implement FiDoop on our in-house Hadoop cluster. We show that FiDoop on
the cluster is sensitive to data distribution and dimensions, because itemsets with
different lengths have different decomposition and construction costs. To improve
FiDoops performance, we develop a workload balance metric to measure load
balance across the clusters computing nodes.
6 Abstract
In Existing parallel mining algorithms for frequent itemsets multiple mechanism
are used (for eg. automatic parallelization, load balancing, data distribution, and
fault tolerance on large clusters). For solution to this problem, we design a new
parallel frequent itemsets mining algorithm called FiDoop using the MapReduce
programming model. FiDoop incorporates the frequent items ultrametric tree for
achieving reduces storage and avoids building conditional pattern bases, rather
than conventional FP trees. In FiDoop, we used three MapReduce Jobs are im-
plemented to complete the mining task. In third MapReduce job, mappers de-
compose itemsets independently and reducer constructing small ultrametric trees,
mining of these trees separately. In this paper, we implement FiDoop on our
in-house Hadoop cluster. We show that FiDoop on the cluster is sensitive to data
distribution and dimensions, because itemsets with different lengths have different
decomposition and construction costs. For improving FiDoops performance and
workload balance metric to measure load balance across the clusters computing
nodes, in this paper we develop FiDoop-HD. FiDoop-HD helps to speed up the
mining performance for high-dimensional data analysis. Extensive experiments
using real-world celestial spectral data demonstrate that our proposed solution is
efficient and scalable. In our proposed scheme, we will add various approaches to
improving energy efficiency of FiDoop running on Hadoop clusters.
The similar datas are clustered using clustering algorithm. The final data
set is stored in HDFS (Hadoop Distributed File System) using map reduce
technique. The performance after redundancy removal is evaluated.
8 INTRODUCTION
BACKGROUND:
An elementary necessity for mining for mining association rules is mining fre-
quent itemsets. Numerous algorithms exist for frequent itemset mining. Apriori
and FP-Growth are the traditional method. Apriori is an algorithm for frequent
item set mining and association rule learning over transactional databases. It pro-
ceeds by recognizing the frequent individual items in the database and widening
them to larger item sets providing those item sets appear adequately often in the
database. It works with Candidate Generation and Test Approach. Fp-Growth is
used to overcome the problem of candidate generatio. FP-growth is a program
to find frequent item sets with the FP-growth algorithm, which corresponds to the
transaction database as a prefix tree which is enhanced with links that organize the
nodes into lists referring to the same item. The search is carried out by prognostic
the prefix tree, working recursively on the result, and trimming the original tree.
The implementation also supports sifting for closed and maximal item sets with
conditional item set repositories, although the approach used in the program dif-
fers in as far as it used top-down prefix trees rather than FP-trees. FP-growth con-
dense a large database into a compact, Frequent-Pattern tree (FP-tree) structure
with highly reduced, but complete for frequent pattern mining and avoid costly
database scans. It develops an efficient, FP-tree-based frequent pattern mining
method with a divide-and-conquer methodology which decomposes mining tasks
into smaller ones and avoids candidate generation. The disadvantage of this algo-
rithm consists in the TID set being too long, taking considerable memory space as
well as computation time for intersecting the long sets. Incremental data mining
is not hold by this algorithm.
MOTIVATION:
FREQUENT itemsets mining (FIM) is a center issue in association rule mining
(ARM), grouping mining, and so forth. Accelerating the procedure of FIM is basic
and essential, on the grounds that FIM utilization represents a huge part of min-
ing time because of its high calculation and input/output (I/O) force. At the point
when datasets in present day information mining applications turn out to be too
much substantial, consecutive FIM calculations running on a solitary machine ex-
SYNOPSIS
perience the ill effects of execution decay. To address this issue, we research how
to perform FIM utilizing MapReducea generally embraced programming model
for preparing enormous datasets by misusing the parallelism among registering
hubs of a bunch. We demonstrate to appropriate an extensive dataset over the
group to adjust load over all bunch hubs, accordingly improving the execution of
parallel FIM. Big information for the most part incorporates information set with
sizes past the capacity of generally utilized programming devices to catch, oversee
and handle information inside a fair passed time. Its size is continually moving fo-
cus starting 2012 going from a couple of Dozen of terabyte to numerous petabytes
of information greatly parallel programming running on tens, hundreds, or even
a large number of servers.
Set Theory:
S={s, e, X, Y,}
W here
s = Starto f theprogram.
1.Loginwithwebpage.
2.LoadFiles/DataoncloudorHDFS.
e = Endo f theprogram.
Retrievethe f ile/Data f romcloudstoragesystemorHDFS.
X = Inputo f theprogram.
InputshouldbeFile/Data.
Y = Out puto f theprogram.
Finallywhenwerequest f or f iledownloadingweget f ileasout putgetlisto f f requentitemsetsandcount.
X,Y U
LetUbetheSeto f System.
U = {Client, S, M, H, R,C}
W hereClient, S, M, H, R,C, aretheelementso f theset.
Client = DataOwner,User
S = Splittingthe f ile.
M = Mappingthe f ilecontents.
H = Contentwiseshu f f lingthe f iledata.
R = Reducingthe f ilecontents.
SYNOPSIS
C = Countthe f requentitemset.
SPACECOMPLEXITY :
T hespacecomplexitydependsonPresentationandvisualizationo f
discovered patterns.Morethestorageo f datamoreisthespacecomplexity.
T IMECOMPLEXITY
CheckNo.o f patternsavailableinthedatasets = nI f (n > 1)thenretrievingo f
in f ormationcanbetimeconsuming.Sothetimecomplexityo f thisalgorithmisO(nn ).
= FailuresandSuccessconditions.
Failures :
Hugedatabasecanleadtomoretimeconsumptiontogetthein f ormation.
Hardware f ailure.
So f tware f ailure.
Success :
Searchtherequiredin f ormation f romavailableinDatasets.
Usergetsresultvery f astaccordingtotheirneeds.
10 LITERATURE SURVEY
SYNOPSIS
11 REFERENCES
[1] M. J. Zaki, Parallel and distributed association mining: A survey, IEEE Con-
currency, vol. 7, no. 4, pp. 1425, Oct./Dec. 1999.
[4] J. Han, J. Pei, Y. Yin, and R. Mao, Mining frequent patterns without candi-
date generation: A frequent-pattern tree approach, Data Min. Knowl. Disc., vol.
8, no. 1, pp. 5387, 2004.
[6] R. Agrawal and J. C. Shafer, Parallel mining of association rules, IEEE Trans.
Knowl. Data Eng., vol. 8, no. 6, pp. 962969, Dec. 1996.
[8] M.-Y. Lin, P.-Y.Lee, and S.-C. Hsueh, Apriori-based frequent itemset min-
ing algorithms on MapReduce, in Proc. 6th Int. Conf. Ubiquit. Inf.Manage.
Commun. (ICUIMC), Danang, Vietnam, 2012, pp. 76:176:8. [Online]. Avail-
able: http://doi.acm.org/10.1145/2184751.2184842
[9] D. Chen et al., Tree partition based parallel frequent pattern mining on shared
memory systems, in Proc. 20th IEEE Int. Parallel Distrib.Process.Symp. (IPDPS),
Rhodes Island, Greece, 2006, pp. 18.
[10] L. Liu, E. Li, Y. Zhang, and Z. Tang, Optimization of frequent itemset min-
ing on multiple-core processor, in Proc. 33rd Int. Conf. Very Large Data Bases,
Vienna, Austria, 2007, pp. 12751285.
[11] A. Javed and A. Khokhar, Frequent pattern mining on message passing Multi
processor systems, Distrib. Parallel Databases, vol. 16, no. 3, pp. 321334, 2004.
SYNOPSIS
[13] Y.-J. Tsay, T.-J. Hsu, and J.-R. Yu, FIUT: A new method for mining fre-
quent itemsets, Inf. Sci., vol. 179, no. 11, pp. 17241737, 2009.
[15] J. Dean and S. Ghemawat, MapReduce: A flexible data processing tool, Com-
mun. ACM, vol. 53, no. 1, pp. 7277, Jan. 2010.
[17] D. W. Cheung, S. D. Lee, and Y. Xiao, Effect of data skewness and workload
balance in parallel data mining, IEEE Trans. Knowl. Data Eng., vol. 14, no. 3,
pp. 498514, May/Jun. 2002.
[18] H. Li, Y. Wang, D. Zhang, M. Zhang, and E. Y. Chang, PFP: Parallel FP-
growth for query recommendation, in Proc. ACM Conf. Recommend. Syst.,
Lausanne, Switzerland, 2008, pp. 107114.
[20] J. S. Park, M.-S. Chen, and P. S. Yu, Using a hash-based method with trans-
action trimming for mining association rules, IEEE Trans. Knowl. Data Eng., vol.
9, no. 5, pp. 813825, Sep./Oct. 1997.
SYNOPSIS