Sie sind auf Seite 1von 14

SYNOPSIS

1 STUDENT NAME
Sonal

2 DISSERTATION TITLE
1. PARALLEL MINING MECHANISM OF FREQUENT ITEMSETS US-
ING MAPREDUCE TECHNIQUE.

2. PARALLEL MINING OF FREQUENT ITEMSETS USING MAPREDUCE


TECHNIQUE.

3. PARALLEL MINING OF FREQUENT ITEMSETS FOR BIG DATA.

3 DISSERTATION DOMAIN
BIG DATA

4 Technical Keywords (As per ACM Keywords)


Please note ACM Keywords can be found : http://www.acm.org/about/class/ccs98-
html
Example is given as

1. Frequent itemsets,

2. M2. Frequent Items Ultrametric Tree (FIU-tree)

3. Hadoop cluster

4. Load balance

5. MapReduce.

6. Energy efficiency
SYNOPSIS

5 Problem Statement
Implement FiDoop on our in-house Hadoop cluster. We show that FiDoop on
the cluster is sensitive to data distribution and dimensions, because itemsets with
different lengths have different decomposition and construction costs. To improve
FiDoops performance, we develop a workload balance metric to measure load
balance across the clusters computing nodes.

6 Abstract
In Existing parallel mining algorithms for frequent itemsets multiple mechanism
are used (for eg. automatic parallelization, load balancing, data distribution, and
fault tolerance on large clusters). For solution to this problem, we design a new
parallel frequent itemsets mining algorithm called FiDoop using the MapReduce
programming model. FiDoop incorporates the frequent items ultrametric tree for
achieving reduces storage and avoids building conditional pattern bases, rather
than conventional FP trees. In FiDoop, we used three MapReduce Jobs are im-
plemented to complete the mining task. In third MapReduce job, mappers de-
compose itemsets independently and reducer constructing small ultrametric trees,
mining of these trees separately. In this paper, we implement FiDoop on our
in-house Hadoop cluster. We show that FiDoop on the cluster is sensitive to data
distribution and dimensions, because itemsets with different lengths have different
decomposition and construction costs. For improving FiDoops performance and
workload balance metric to measure load balance across the clusters computing
nodes, in this paper we develop FiDoop-HD. FiDoop-HD helps to speed up the
mining performance for high-dimensional data analysis. Extensive experiments
using real-world celestial spectral data demonstrate that our proposed solution is
efficient and scalable. In our proposed scheme, we will add various approaches to
improving energy efficiency of FiDoop running on Hadoop clusters.

7 Goals and Objectives


Subset selection to achieve fast retrieval.

Compressed storage and load balancing is achieved by calculating T-relevance.

Compute F-correlation and construct a Minimum Spanning Tree to improve


speed and accuracy.

Efficiency and effectiveness of fast clustering algorithm are evaluated.


SYNOPSIS

The similar datas are clustered using clustering algorithm. The final data
set is stored in HDFS (Hadoop Distributed File System) using map reduce
technique. The performance after redundancy removal is evaluated.

Improving energy efficiency of Fidoop running on Hadoop clusters.

8 INTRODUCTION
BACKGROUND:
An elementary necessity for mining for mining association rules is mining fre-
quent itemsets. Numerous algorithms exist for frequent itemset mining. Apriori
and FP-Growth are the traditional method. Apriori is an algorithm for frequent
item set mining and association rule learning over transactional databases. It pro-
ceeds by recognizing the frequent individual items in the database and widening
them to larger item sets providing those item sets appear adequately often in the
database. It works with Candidate Generation and Test Approach. Fp-Growth is
used to overcome the problem of candidate generatio. FP-growth is a program
to find frequent item sets with the FP-growth algorithm, which corresponds to the
transaction database as a prefix tree which is enhanced with links that organize the
nodes into lists referring to the same item. The search is carried out by prognostic
the prefix tree, working recursively on the result, and trimming the original tree.
The implementation also supports sifting for closed and maximal item sets with
conditional item set repositories, although the approach used in the program dif-
fers in as far as it used top-down prefix trees rather than FP-trees. FP-growth con-
dense a large database into a compact, Frequent-Pattern tree (FP-tree) structure
with highly reduced, but complete for frequent pattern mining and avoid costly
database scans. It develops an efficient, FP-tree-based frequent pattern mining
method with a divide-and-conquer methodology which decomposes mining tasks
into smaller ones and avoids candidate generation. The disadvantage of this algo-
rithm consists in the TID set being too long, taking considerable memory space as
well as computation time for intersecting the long sets. Incremental data mining
is not hold by this algorithm.

MOTIVATION:
FREQUENT itemsets mining (FIM) is a center issue in association rule mining
(ARM), grouping mining, and so forth. Accelerating the procedure of FIM is basic
and essential, on the grounds that FIM utilization represents a huge part of min-
ing time because of its high calculation and input/output (I/O) force. At the point
when datasets in present day information mining applications turn out to be too
much substantial, consecutive FIM calculations running on a solitary machine ex-
SYNOPSIS

perience the ill effects of execution decay. To address this issue, we research how
to perform FIM utilizing MapReducea generally embraced programming model
for preparing enormous datasets by misusing the parallelism among registering
hubs of a bunch. We demonstrate to appropriate an extensive dataset over the
group to adjust load over all bunch hubs, accordingly improving the execution of
parallel FIM. Big information for the most part incorporates information set with
sizes past the capacity of generally utilized programming devices to catch, oversee
and handle information inside a fair passed time. Its size is continually moving fo-
cus starting 2012 going from a couple of Dozen of terabyte to numerous petabytes
of information greatly parallel programming running on tens, hundreds, or even
a large number of servers.

9 Relevant mathematics associated with the Project


Mapping
SYNOPSIS

Set Theory:
S={s, e, X, Y,}
W here
s = Starto f theprogram.
1.Loginwithwebpage.
2.LoadFiles/DataoncloudorHDFS.
e = Endo f theprogram.
Retrievethe f ile/Data f romcloudstoragesystemorHDFS.
X = Inputo f theprogram.
InputshouldbeFile/Data.
Y = Out puto f theprogram.
Finallywhenwerequest f or f iledownloadingweget f ileasout putgetlisto f f requentitemsetsandcount.

X,Y U
LetUbetheSeto f System.
U = {Client, S, M, H, R,C}
W hereClient, S, M, H, R,C, aretheelementso f theset.
Client = DataOwner,User
S = Splittingthe f ile.
M = Mappingthe f ilecontents.
H = Contentwiseshu f f lingthe f iledata.
R = Reducingthe f ilecontents.
SYNOPSIS

C = Countthe f requentitemset.

SPACECOMPLEXITY :
T hespacecomplexitydependsonPresentationandvisualizationo f
discovered patterns.Morethestorageo f datamoreisthespacecomplexity.

T IMECOMPLEXITY
CheckNo.o f patternsavailableinthedatasets = nI f (n > 1)thenretrievingo f
in f ormationcanbetimeconsuming.Sothetimecomplexityo f thisalgorithmisO(nn ).

= FailuresandSuccessconditions.
Failures :
Hugedatabasecanleadtomoretimeconsumptiontogetthein f ormation.
Hardware f ailure.
So f tware f ailure.

Success :
Searchtherequiredin f ormation f romavailableinDatasets.
Usergetsresultvery f astaccordingtotheirneeds.

10 LITERATURE SURVEY
SYNOPSIS

SR.No Author and Title Proposed Scheme We have re-


ferred
1 R. Agrawal, T. In this proposed system efficient Algorithm for
Imielinski, and generation for large itemsets by mining associa-
A. Swami, Min- hash method (2) effective reduction tion rules in large
ing association on itemsets scan required by the di- databases.
rules between vision approach and (3) the option
sets of items in of reducing the number of database
large databases, scans required Our proposed hash
ACM SIGMOD and division-based techniques.
Rec., vol. 22, no.
2, pp. 207216,
1993.
2 M.-Y. Lin, P.- In this proposed system evaluated Apriori Map-
Y.Lee, and S.-C. offline on relatively small data size. reduce algorithm
Hsueh, Apriori- When confronting with larger data execution.
based frequent size, which is inevitable for to-
itemset mining days organization, most if not all
algorithms on algorithms performed not as effi-
MapReduce, in cient as required to meet the real
Proc. 6th Int. time big data driven decision mak-
Conf. Ubiquit. ing needs. We therefore attempt
Inf.Manage. to solve these efficiency problems
Commun. by proposing a VAMR (Vertical-
(ICUIMC), Apriori Map-reduce) algorithm.
Danang, Viet-
nam, 2012
3 L. Liu, E. Li, Y. In this proposed system present ef- How to optimize
Zhang, and Z. ficient multi-core and GPU accel- frequent item-
Tang, Optimiza- erated parallelizations of the Tree set mining on
tion of frequent Projection, one of the most compet- multiple-core
itemset mining itive FIM algorithms. The exper- processor.
on multiple-core imental results show that our Tree
processor, in Projection implementation scales
Proc. 33rd Int. almost linearly in a CPU shared-
Conf. Very Large memory environment after careful
Data Bases, optimizations.
Vienna, Austria,
2007
SYNOPSIS

SR.No Author and Title Proposed Scheme We have re-


ferred
4 JIAWEI HAN, In this study, we propose a novel How to create
JIAN PEI, YI- frequent-pattern tree (FP-tree) frequent-pattern
WEN YIN, structure, which is an extended tree (FP-tree)
RUNYING MAO prefix-tree structure for storing structure.
Mining Frequent compressed, crucial information
Patterns with- about frequent patterns, and de-
out Candidate velop an efficient FP-tree based
Generation: A mining method, FP-growth, for
Frequent-Pattern mining the complete set of fre-
Tree Approach quent patterns by pattern fragment
growth.
5 Assaf Schus- In this paper, we present set of Multiple al-
ter, Ran Wolff, new algorithms that solved the Dis- gorithms of
Communication tributed Association Rule mining Association Rule
Efficient Dis- Problem using far less communica- mining.
tributed Mining tion.
of Association
Rules
6 Jong Soo Park; In this paper, we develop an al- Flow of Efficient
Ming-Syan Chen gorithm, called PDM, to conduct Parallel Data
and Philip S. Yu, parallel data mining for association Mining using
Efficient Parallel rules. Consider a transaction as Association
Data Mining a collection of items, and a large Rules.
for Association itemset is a set of items such that the
Rules number of transactions containing
it exceeds a pre-specilied threshold.
PDM is so designed that the global
set of large itemsets can be identi-
fied efficiently and the amount of
inter-node data exchange required
is minimized.
SYNOPSIS

SR.No Author and Title Proposed Scheme We have re-


ferred
7 Wei Lu, Yanyan In this paper, we investigate how How to perform
Shen, Su Chen, to perform kNN join using MapRe- kNN join using
Beng Chin Ooi, duce which is a well-accepted MapReduce.
Efficient Pro- framework for data-intensive appli-
cessing of k cations over clusters of computers.
Nearest Neigh- The mappers cluster objects into
bor Joins using groups; the reducers perform the
MapReduce kNN join on each group of objects
separately.

8 Sandy Moens, In this paper, we investigate the FIM techniques


Emin Akse- applicability of FIM techniques on on the MapRe-
hirli and Bart the MapReduce platform. We in- duce platform.
Goethals, Fre- troduce two new methods for min-
quent Itemset ing large datasets: Dist-Eclat fo-
Mining for Big cuses on speed while BigFIM is
Data optimized to run on really large
datasets. In our experiments we
show the scalability of our methods.
SYNOPSIS

SR.No Author and Title Proposed Scheme We have re-


ferred
9 Shekhar Gupta, We propose efficient use of Hadoop How to increase
Christian Fritz, on heterogeneous clusters as well as the efficiency of
Johan de Kleer, on virtual/cloud infrastructure, both Hadoop on het-
and Cees Wit- of which violate the peer-similarity erogeneous and
teveen, Diagnos- assumption. To this end, we have virtual clusters.
ing Heteroge- implemented and here present pre-
neous Hadoop liminary results of an approach for
Clusters automatically diagnosing the health
of nodes in the cluster, as well as the
resource requirements of incoming
MapReduce jobs. We show that
the approach can be used to iden-
tify abnormally performing cluster
nodes and to diagnose the kind of
fault occurring on the node in terms
of the system resource affected by
the fault (e.g., CPU contention, disk
I/O contention). We also describe
our future plans for using this ap-
proach to increase the efficiency of
Hadoop on heterogeneous and vir-
tual clusters, with or without faults.
10 Yi Yao, Jiayin We propose simple yet effective Idea about
Wang, Bo Sheng, schemes which use slot ratio be- what is Self-
Chiu C. Tan, tween map and reduce tasks as Adjusting Slot
Ningfang Mi, a tunable knob for reducing the Configurations
Self-Adjusting makespan of a given set. By and difference
Slot Config- leveraging the workload informa- between Ho-
urations for tion of recently completed jobs, our mogeneous and
Homogeneous schemes dynamically allocates re- Heterogeneous
and Heteroge- sources (or slots) to map and reduce Hadoop Clusters.
neous Hadoop tasks. We implemented the pre-
Clusters sented schemes in Hadoop V0.20.2
and evaluated them with represen-
tative MapReduce benchmarks at
Amazon EC2. The experimental
results demonstrate the effective-
ness and robustness of our schemes
under both simple workloads and
more complex mixed workloads.
SYNOPSIS

11 REFERENCES
[1] M. J. Zaki, Parallel and distributed association mining: A survey, IEEE Con-
currency, vol. 7, no. 4, pp. 1425, Oct./Dec. 1999.

[2] I. Pramudiono and M. Kitsuregawa, FP-tax: Tree structure based generalized


association rule mining, in Proc. 9th ACM SIGMOD Workshop Res. Issues Data
Min. Knowl. Disc., Paris, France, 2004, pp. 6063.

[3] R. Agrawal, T. Imielinski, and A. Swami, Mining association rules between


sets of items in large databases, ACM SIGMOD Rec., vol. 22, no. 2, pp. 207216,
1993.

[4] J. Han, J. Pei, Y. Yin, and R. Mao, Mining frequent patterns without candi-
date generation: A frequent-pattern tree approach, Data Min. Knowl. Disc., vol.
8, no. 1, pp. 5387, 2004.

[5] S. Kotsiantis and D. Kanellopoulos, Association rules mining: A recent overview,


GESTS Int. Trans. Comput. Sci. Eng., vol. 32, no. 1,pp. 7182, 2006.

[6] R. Agrawal and J. C. Shafer, Parallel mining of association rules, IEEE Trans.
Knowl. Data Eng., vol. 8, no. 6, pp. 962969, Dec. 1996.

[7] A. Schuster and R. Wolff, Communication-efficient distributed mining of as-


sociation rules, Data Min. Knowl. Disc., vol. 8, no. 2, pp. 171196, 2004.

[8] M.-Y. Lin, P.-Y.Lee, and S.-C. Hsueh, Apriori-based frequent itemset min-
ing algorithms on MapReduce, in Proc. 6th Int. Conf. Ubiquit. Inf.Manage.
Commun. (ICUIMC), Danang, Vietnam, 2012, pp. 76:176:8. [Online]. Avail-
able: http://doi.acm.org/10.1145/2184751.2184842

[9] D. Chen et al., Tree partition based parallel frequent pattern mining on shared
memory systems, in Proc. 20th IEEE Int. Parallel Distrib.Process.Symp. (IPDPS),
Rhodes Island, Greece, 2006, pp. 18.

[10] L. Liu, E. Li, Y. Zhang, and Z. Tang, Optimization of frequent itemset min-
ing on multiple-core processor, in Proc. 33rd Int. Conf. Very Large Data Bases,
Vienna, Austria, 2007, pp. 12751285.

[11] A. Javed and A. Khokhar, Frequent pattern mining on message passing Multi
processor systems, Distrib. Parallel Databases, vol. 16, no. 3, pp. 321334, 2004.
SYNOPSIS

[12] J. Neerbek, Message-driven FP-growth, in Proc. WICSA/ECSA Compan.


Vol., Helsinki, Finland, 2012, pp. 2936.

[13] Y.-J. Tsay, T.-J. Hsu, and J.-R. Yu, FIUT: A new method for mining fre-
quent itemsets, Inf. Sci., vol. 179, no. 11, pp. 17241737, 2009.

[14] J. Dean and S. Ghemawat, MapReduce: Simplified data processing on large


clusters, Commun. ACM, vol. 51, no. 1, pp. 107113, Jan. 2008.

[15] J. Dean and S. Ghemawat, MapReduce: A flexible data processing tool, Com-
mun. ACM, vol. 53, no. 1, pp. 7277, Jan. 2010.

[16] W. Lu, Y. Shen, S. Chen, and B. C. Ooi, Efficient processing of k near-


est neighbor joins using MapReduce, Proc. VLDB Endow., vol. 5, no. 10, pp.
10161027, 2012.

[17] D. W. Cheung, S. D. Lee, and Y. Xiao, Effect of data skewness and workload
balance in parallel data mining, IEEE Trans. Knowl. Data Eng., vol. 14, no. 3,
pp. 498514, May/Jun. 2002.

[18] H. Li, Y. Wang, D. Zhang, M. Zhang, and E. Y. Chang, PFP: Parallel FP-
growth for query recommendation, in Proc. ACM Conf. Recommend. Syst.,
Lausanne, Switzerland, 2008, pp. 107114.

[19] L. Cristofor. (2001). Artool Project[J]. [Online].

[20] J. S. Park, M.-S. Chen, and P. S. Yu, Using a hash-based method with trans-
action trimming for mining association rules, IEEE Trans. Knowl. Data Eng., vol.
9, no. 5, pp. 813825, Sep./Oct. 1997.
SYNOPSIS

12 Plan of Project Execution

Schedule. Date Project Activity


July 1st Week 01/07/2016 Formation Of Project Group
2nd Week 08/07/2016 Project Topic Selection
3rd Week 15/07/2016 Formation Of Project Group
August 1st Week 02/01/2016 Presentation On Project Ideas
2nd Week 09/01/2016 Submission Of Literature Survey
3rd Week 16/01/2016 Feasibility Assessment
September 1st Week 02/09/2016 Mid Sem Presentation Group
2nd Week 16/09/2016 Design Of Mathematical Model
3rd Week 23/09/2016 End Sem Presentation.
October 1st Week 07/10/2016 Report Preparation And Submission
January 1st Week 02/01/2017 Preparation for ANEC conference
2nd Week 09/01/2017 Discussion and implementation of 2nd
module
3rd Week 16/01/2017 Discussion about modification to Im-
proved K-means
4th Week 23/01/2017 1st and 2nd module presentation
5th Week 30/01/2017 Discussion on flow of project and design-
ing new module
February 1st Week 06/02/2017 Modification of modules.
2nd Week 13/02/2017 Designed test cases for our module.
3rd Week 20/02/2017 Worked on user interface.
March 1st Week 06/03/2017 Integration of all modules.
2nd Week 20/03/2017 Final Report and presentation.

Table 1: Project Plan

13 Graphical Time Chart


SYNOPSIS

Das könnte Ihnen auch gefallen