Research Paper

© All Rights Reserved

Als PDF, TXT **herunterladen** oder online auf Scribd lesen

42 Aufrufe

Research Paper

© All Rights Reserved

Als PDF, TXT **herunterladen** oder online auf Scribd lesen

- Apriori Principle example question and answer
- (DATA MINING) ppt
- Association Rule Mining
- Using Calculator and Expressions for Parameter Analysis
- Study on Positive and Negative Rule Based Mining Techniques for E-Commerce Applications
- Survey on Temporal Data Mining
- Problem 2.104
- On Some Linear Positive Operators: Statistical Approximation
- TE43 S & S model q p.pdf
- i Jst c 7251224793800
- FWE03sdm
- Pattern Mining Lec 2
- InTech-Platform for Intelligent Manufacturing Systems With Elements of Knowledge Discovery
- 483-2011-polynomials--special products--explanation practice
- Chaotic Differential Evolution Algorithm based PID Controller for Automatic Voltage Regulator system
- [IJCST-V4I2P44]:Dr. K.Kavitha
- sigmod98
- multirate-1202376179179262-2-1.pdf
- Merge Sort Question
- An Efficient Approach to Privacy Preserving Association Rule Mining

Sie sind auf Seite 1von 4

Yanxi Liu

School of ScienceChangchun University,Ch ina130022

liuy x.ccu@163.co m

Abstract- with the rapid de velopment of networks and

information technology, the endless information has paid

more and more attention by people. While in the pursuit of

information with high speed, the analysis and mining of the

information and rules hidden deep in the data are also paid

more emphasis. Data mining technology is to organize and

analyze the data, which can extract and discover knowledge

from the mass of data, so how to apply the data mining

technology into enterprise stock management is the focus of

this topic studied. In this paper, combining with data mining

theory, Apriori algorithm in association rule mining

algorithm is described in detail, the algorithm

implementation process is illustrated, and the improved

methods of the algorithm are discussed.

Keywords-data

algorithm

mining; information

association rules can be directly generated. The core

algorithm as follows:

Apriori algorith m called two sub-processes which are

Apriori-gen() and subset(), Apriori-gen() process produces

a candidate, then use the Apriori property (all non-empty

subsets of frequent item sets must also be frequent) to

delete those candidates of the non-frequent subsets. Once

generated all of the candidates, we will scan the database

D, and for each transaction, use the Subset () sub

procedure to identify all the candidate subsets, and make

cumulative count for each of these candidates. Finally, all

candidates met the minimu m support form frequent item

set L.

extraction; Apriori

Here by way of examp le to illustrate the Apriori

algorithm imp lementation process.

I

INT RODUCTION

Although modern computer technology and database

technology have been developed rapidly, could support

the store and quickly retrieve the grand scale databases or

data warehouses, but these techniques was only to gather

these "massive" data, and not to effectively organize and

use the knowledge hidden them, which eventually led to

todays phenomenon of "rich data, poor knowledge". The

emergence of data mining technology met people needs.

The technology involved in artificial intelligence, machine

learning, statistical analysis and other technologies, and it

makes decision analysis into a new stage. In this paper the

association rule min ing algorithm - Apriori algorith m

which is common ly used in data mining is mainly

discussed.

For examp le, a transaction database as shown in table

1, in wh ich there are nine affairs, that is, |D|=9. Apriori

assumes that the items of the affair are stored by the order

of

dictionary.

Minimu m

support

threshold

minsupport=2/9=22%, minimu m support count is 2.

TABLE1.

Tid

T1

T2

T3

T4

T5

T6

T7

T8

T9

Apriori algorith m is one of the most influential

algorithms to mine the frequent item sets of Boolean

association rules. This algorithm is an approach based on

two-stage frequency, the design of association rule mining

algorithm can be decomposed into two sub-issues:

(1) Find all the item sets with the support greater than

the min imu m support, which is called frequent item;

(2) Based on the above obtained frequent set, all the

association rules will be generated, and for each frequent

item set A, all the nonempty subset a of A will be found, if

the ratio of sup port (A)/ sup port(a) min confidence, to

generate the association rules A-a. That is, from the

frequent sets obtained in the first step, to exploit the rules

with confidence not less than minimu m confidence

minconfidence the user specified.

The realization process of the algorithm can be

described as follows: First of all, Apriori algorith m solve

the frequent sets L with items of 1, then from L to

generate the candidate sets Cz with items of 2, scan the

transaction database D, calculate the support to solve Lz,

and so on, resulting in CK, to scan D and derive LK. Once

978-0-7695-3941-6/10 $26.00 2010 IEEE

DOI 10.1109/ICCMS.2010.398

Item list

I1,I2,I5

I2,I4

I2,I3

I1,I2,I4

I1,I3

I2,I3

I1,I3

I1,I2,I3,I5

I1,I2,I3

process of frequent item sets.

C1

L1

Item set

6

{I1}

{I1} Scan

7

{I2} calculate {I2}

support

{I3}

6

{I3} count

{I4}

2

{I4}

{I5}

2

{I5}

107

111

C1

Compare

with

minimum

support

count

6

{I1}

{I2}

7

{I3}

6

{I4}

2

{I5}

2

C2

Item sets

{I1,I2}

{I1,I3}

{I1,I4}

{I1,I5}

{I2,I3}

{I2,I4}

{I2,I5}

{I3,I4}

{I3, I5}

{I4,I5}

C3

item sets

{I1,I2,I3}

{I1,I2,I5}

{I1,I3,I5}

{I2,I3,I4}

{I2,I3,I5}

{I2,I4,I5}

Scan

calcul

ate

suppo

rt

count

delete

item set

with

subset not

belong to

L2

Item sets

{I1,I2}

{I1,I3}

{I1,I4}

{I1,I5}

{I2,I3}

{I2,I4}

{I2,I5}

{I3,I4}

{I3, I5}

{I4,I5}

Support count

4

4

1

2

4

2

2

0

1

0

C3

item sets

{I1,I2,I3}

{I1,I2,I5}

Compare

L3

with

minimum item sets support count

2

support {I1,I2,I3}

2

count {I1,I2,I5}

Figure1.

C2

C3=L2L2={{I1,I2,I3},{I1,I2,I5},{I1,I3,I5},{I2,I3,I4

},{I2,I3,I5},{I2,I3,I5},{I2,I3,I5}

Item set {I1, I3, I5} has a 2-item subset {I3, I5} which

is not a L2 element, thus it is not frequent, from C3 to

remove {I1, I3, I5}, empathy delete{I2,I3,I4}, {I2,I3,I5},

{I2,I4,I5}. Thus C3= {{I1, I2, I3}, {I1, I2, I5}}. For each

item in the C3, scanning the transaction database and

calculate its support count. And then it will be co mpared

with the min imu m support count 2, to determine whether

the frequency, and to determine frequent 3-itemset L3.

5) To find frequent 4-itemsets L4.

Using L3 to generate aggregation C4 of candidate

4-itemsets and C4 = L3 L3= {{I1, I2, I3, I5}}. The

3-subset {I2, I3, I5} of item set {I1, I2, I3, I5} does not

belong to L3, thus {I1, I2, I3, I5} will be removed.

Therefore, C4=, the algorith m is ended and all the

frequent item sets are found.

L2

Compar

e with

minimu

m

support

count

scan D,

calculate

support

count

Item sets

{I1,I2}

{I1,I3}

{I1,I5}

{I2,I3}

{I2,I4}

{I2,I5}

item sets

{I1,I2,I3}

{I1,I2,I5}

C4

item sets

{I1,I2,I3,I5}

Support count

4

4

2

4

2

2

C3

support count

2

2

Once the frequent item sets found from the transaction

database, the next step is to generate association rules

fro m them. That is, to produce strong association rules to

meet the minimu m support and minimu m confidence, and

the confidence of the obtained association rules can be

calculated by using the following formu la. Here the

conditional probability is calculated by using the support

for item sets.

delete

item set C4

with

subset not item sets

belong to

L3

confidence( X Y ) G ( X Y ) / G x u 100%

sup port ( X Y ) / sup port ( X ) u 100%

contains item set XY; sup port(X) is the affair

number which contains item set X; based on the above

formula, the operation of forming the association rules

as follows:

1) Scan the database to initialize the source data, and

form candidate 1-itemsets with all the items of the

database for the aggregation.

2) To find frequent 1-itemsets L1. As the figure shows,

candidate 1-itemset C1 is an aggregation {{I1}, {I2},

{I3}, {I4}, {I5}} which is consisted of each item set. For

each item in the C1, scanning the transaction database

and calculate its support count. And then it will be

compared with the minimum support count 2, to

determine whether the frequency, the item greater than

the minimum support count will be added to the frequent

1-itemset to determine frequent 1-itemset L1.

3) To find frequent 2-itemsets L2. We should connect

frequent 1-itemsets L1 to generate aggregation C2 of

candidate 2-itemsets.

C2=L1L1={{I1,I2},{I1,I3},{I1,I4},{I1,I5},{I2,I4},{

I2,I5},{I3,I4},{I3,I5},{I4,I5}}

As all the 1-itemsets of the 2-itemsets in C2 are

included, so it is not need to delete. For each item in the

C2, scanning the transaction database and calculate its

support count. And then it will be compared with the

minimu m support count 2, to determine whether the

frequency, same as step 2, to determine frequent 2-itemset

L2.

4) To find frequent 3-itemsets L3.

Same as step 3, it is to use L2 to generate aggregation

C3 of candidate 3-itemsets.

non-empty subsets of it.

2) For

each

non-empty

subset

of

1,

if

sup port (l )

t min confidence,

sup port ( s)

then to generate

minimum confidence threshold.

As the rules are directly generated through frequent

item sets, so all the item sets the association rules involved

have to meet the minimu m support threshold.

With the above data as an examp le, the generation

process of association rules will be illustrated. The

frequent item set l= {I1, I2, I5}. The fo llowing will g ive

the association rules generated according to 1. The

non-empty subsets of I are {I1, I2}, {II, IS}, {I2, I5},

{I1}, {I2}, {I5}. The following are the association rules

and their confidence obtained according to this.

(1) Il I2I5

confidence=2/4=50%

(2) Il I5 12 confidence=2/2=100%

(3)12 I5 I1

confidence=2/2=100%

(4) I2 IlI5

confidence=2/6=33.3%

(5) Il 12I5

confidence=2/7=28.6%

(6)I5I1I2

confidence=2/2=100%

112

108

then there are only rule (2), (3) and (6), because their

confidence is greater than the minimu m confidence

threshold, so they are retained as the final output. That is,

output rules: Il I2I5, 12 I5 I1, I5I1I2.

idea is: first of all, the large-capacity databases will be

logically divided into several disjoint blocks which are

used to generate locally frequent item sets, and then these

local frequent item sets are regarded as candidate global

frequent item sets to get the final global frequent item sets

through testing their support.

2) Method based on Hash.

A hash-based technology can be used to compress the

candidate k-item set Ck (k>1). For example, when

scanning each transaction in database to hash them (that is,

map) to the different barrels of the different barrels and

increase the corresponding table count. In the hash table

the 2-itemset with corresponding hash bucket count less

than the value of support must not be a frequent 2-itemset,

should be removed from the candidate item sets. This

hash-based technology can greatly compressed the k-item

set to be examined.

3) Method based on transaction compression.

Reducing the affairs needed to be scanned in latter

circle, thus one affair does not contain any frequent k-item

set can not contain any frequent (k+l) - item set.

Therefore, when this record appears, you can add a mark

to it or remove from the transaction database. Therefore,

the database scans to generate j-item set (j>k) would

obviate the need to scan and analyze these records.

4) Method based on sampling.

This method is to select a random sample S of given

database D, and then search frequent item sets in the S

rather than D. This approach sacrifices some accuracy, but

in exchange for the effectiveness.

This method only needs to scan S affairs for one time.

As the search in S instead of D, it may lose some global

frequent item sets. To reduce this possibility, the support

less than the minimu m support can be used to find out the

frequent item sets (LS) locally in S, and then the rest of

the database will be used to calculate the actual support of

each item set in LS.

5) Dynamic item set count.

Dynamic item set count technology divides the

database into blocks which will mark the starting point.

Unlike Apriori only identify new candidate before each

time of complete scan, in this deformation, new candidate

item set can be added at any start point. If all subsets of a

key set have been identified as frequently, it will be added

as a new candidate. The result algorithm needs fewer

database scans than Apriori.

Apriori algorith m is to gradually complete the frequent

item set discovery through a growing number of the item

elements. Firstly, to generate frequent 1-itemset L1, and

then frequent 2-itemset L2, until no longer able to expand

the number of frequent item set elements, the algorithm is

ended. In the first k-cycle, the process firstly generates the

collection of Ck of the candidate k-item, and then

generates support through scanning the database and tests

to generate frequent k-item Lk. A lgorithm is simple and

clear, has no complicated derivation, and is complicated

derivation, but there are some shortcomings difficult to

overcome.

1) Repeatedly scanning the transaction database. For

each element of each circular candidate set C*, it must be

verified whether to join the frequent item set Lk through

scanning the database. If there is a large frequent item set

contained 10 items, then it is need to scan the transaction

database at least 10 times, which will bring a great I/O

load.

2) It may generate a large candidate set, from Lk-1 to

generate k- candidate item set, and Ck is the exponential

growth, for example, 104 frequent 1-itemsets may are

likely to generate candidate 2-itemset with close 107

elements. Such a large candidate set is a challenge to

time and main memory space.

3) Adopting only support. In real life, the occurrences

of some affairs are very frequent, but some are very rare,

so for our mining there has a problem: If the minimum

support threshold set too high, although the pace has

accelerated, but the data covered is less, meaningful rules

may not be discovered; If the minimum support threshold

set too low, then a large number of rules without

practical meaning will flood the entire mining process,

which will greatly reduce the mining efficiency and the

availability of the rule. All this will mislead decision

making.

4) The fitness landscape of the algorithm is narrow.

The algorithm only considers a single Boolean

association rule mining, but in practical applications,

there might be multi-dimensional, multi-volume, and

multi-layer association rules. At this time, the algorithm

is no longer applicable, needs to improve, or even needs

to be re-designed.

In order to improve the efficiency of Apriori

algorithm, there are a series of improved algorithms.

Although these algorithms follow the above theory, but

because of the introduction of related technologies (such

as data partition, sampling, etc.), the adaptability and

efficiency of the Apriori algorithm is improved to a certain

extent.

1) Method based on data partition.

The application of data partition technology in

association rule min ing can improve the flexibility of the

CONCLUSION

.Net framework provided are used to achieve the

development and application of the Apriori association

rule mining algorith m. The algorith m is theoretically

studied in-depth, the advantages and disadvantages of it

are also pointed out, and finally its improvement methods

are discussed.

REFERENCES

Efficient Algorithm for Mining Association Rules

in Large Databases, Proceedings of the 21st

International Conference on Very large Database,

113

109

1995:175-186

[2] Fayyad U M, Piatetsky-Shapiro G, Smyth p, The

KDD process for extracting useful knowledge

from volumes of data[J], Communication of the

ACM, 1996. 39 (11)

[3] D. W. Cheung, J. Han. V Ng, Maintenance of

discovered association rules in large databases: An

incremental updating technique, In proc.1996 Int.

Conf. Data Engineering

[4] E. H. Han, G Kary Pis, V Kumar, Scalable parallel

data mining for associate rules. In Proc.1997,

ACM-SIGMOD Int. Conf. Management of Data

[5] J. Han, Y Fu, Discovery of multiple-level

association rules from large databases. In

Proc.1995 Int. Conf. Very Large Data Bases

114

110

- Apriori Principle example question and answerHochgeladen vonUdara Seneviratne
- (DATA MINING) pptHochgeladen vonapi-3765947
- Association Rule MiningHochgeladen vonAllison Collier
- Using Calculator and Expressions for Parameter AnalysisHochgeladen vonmjm614
- Study on Positive and Negative Rule Based Mining Techniques for E-Commerce ApplicationsHochgeladen vonantonytechno
- Survey on Temporal Data MiningHochgeladen vonKienking Khansultan
- Problem 2.104Hochgeladen vonEric Castillo Martínez
- On Some Linear Positive Operators: Statistical ApproximationHochgeladen vonAnonymous 0U9j6BLllB
- TE43 S & S model q p.pdfHochgeladen vonSubramanyaAIyer
- i Jst c 7251224793800Hochgeladen vonAmir Saud
- FWE03sdmHochgeladen vonjyoti_mishra_9
- Pattern Mining Lec 2Hochgeladen vonAshish Sharma
- InTech-Platform for Intelligent Manufacturing Systems With Elements of Knowledge DiscoveryHochgeladen vonSalton Gerard
- 483-2011-polynomials--special products--explanation practiceHochgeladen vonapi-258903855
- Chaotic Differential Evolution Algorithm based PID Controller for Automatic Voltage Regulator systemHochgeladen vonIJSRP ORG
- [IJCST-V4I2P44]:Dr. K.KavithaHochgeladen vonEighthSenseGroup
- sigmod98Hochgeladen vonjilf
- multirate-1202376179179262-2-1.pdfHochgeladen vonsavisu
- Merge Sort QuestionHochgeladen vonMohammed Jeelan
- An Efficient Approach to Privacy Preserving Association Rule MiningHochgeladen vonEditor IJRITCC
- Transmission Line Fault Classification Using Discrete Wavelet TransformHochgeladen vonwvargas926
- Extracting Recurrent Patterns From Web Blog by Using Pattern Decomposition AlgorithmHochgeladen vonInternational Journal of Innovative Science and Research Technology
- An Efficient Approach for Mining Frequent Item sets with Transaction Deletion OperationHochgeladen vonponytail
- Image Mosaic Using FAST Corner Detection.pdfHochgeladen vonFathan Kenshin Himura
- An Efficient and Scalable way using UP-Growth and UP-Growth+ Algorithms for finding Mining High Utility Item sets from Transactional DatabasesHochgeladen vonNikhil Bhalerao
- T9_EVOW_Ch27.pdfHochgeladen vonBJ Diaz
- Arm_pptHochgeladen vonRupesh Jain
- ADBMS lab manualHochgeladen vonanand_gudnavar
- Preliminary BibliographyHochgeladen vonJb Santos
- Ex_4_4_FSC_part1Hochgeladen vonMeer

- DBMS fileHochgeladen vonJeremy Montgomery
- Phase1Hochgeladen vonJeremy Montgomery
- 08Hochgeladen vonSoo Lian Kei
- lab 2.rtfHochgeladen vonJeremy Montgomery
- M.C.a.(Sem - I) Discrete MathematicsHochgeladen vonSomya Sachdeva
- waseHochgeladen vonSandipan Saha
- Linux CommandsHochgeladen vonJeremy Montgomery
- dbms docHochgeladen vonJeremy Montgomery
- MedicalHochgeladen vonJeremy Montgomery
- 2 SynopsisHochgeladen vonJeremy Montgomery
- 6 chapter 3Hochgeladen vonJeremy Montgomery
- Bca - Exam Centre - December-january 2011-2012Hochgeladen vonJeremy Montgomery
- Bca - Exam Centre - May-june, 2012Hochgeladen vonJeremy Montgomery
- UNIT-IVHochgeladen vonJeremy Montgomery
- A Cad Cal 200814Hochgeladen vongauravarora93

- ContentServeHochgeladen vonFernanda Lemos
- Dataset Analyse CorrelationHochgeladen vonAbdelhay Laghbach
- Test9y10Hochgeladen vonSantiago De los Santos
- Implementing a PMSHochgeladen vonDewesh Shukla
- [Judgment] CCT 107-18 Public Protector v Reserve BankHochgeladen vonMashadi Kekana
- La Cuisinère Canadienne - The Cookbook as CommunicationHochgeladen vonpoupoudodo
- 7 the Man He Killed - CopyHochgeladen vonAbhishek M N
- Schiffman CB10 PPT 04 1Hochgeladen vonTabby Tan
- House PropertyHochgeladen vonAmrit Tejani
- Intro_to_Data Structure_Lec_1_Hochgeladen vonXafran Khan
- inventory mangementHochgeladen vonRazita Rathore
- Disrespect and abuse during childbirth in fourteen hospitals in nine cities of PeruHochgeladen vonAlvaro Taype-Rondan
- Human Agency and Divine Power Transforming Images and Recreating Gods Among TheNewarHochgeladen vonkrishnaamatya_47
- Chemistry Notes Unit 4 2011 end-year.pdfHochgeladen vonAnonymous na314kKjOA
- Confused KingdomHochgeladen vonMadeleine Zoe Buisseret
- Improved Computation for LM TrainingHochgeladen vonTeknik Mcu
- Disease IndicatorHochgeladen vonRamesh Menon
- D2b- 8 People v. Cadidia.docxHochgeladen vonAaron Ariston
- Are We All Equal Before God by Pastor BankieHochgeladen vonEmmanuel Owoicho StevJethro
- Equilibrium Studies on a Chi Phase-Strengthened Ferritic AlloyHochgeladen vonjournal
- Partially Linear ModelsHochgeladen vonsaipulam
- David DuBois Race and Gener Influcenes on Adjuntment in Early AdolescenseHochgeladen vonElizabeth Pérez Márquez
- CIHochgeladen vonSarvesh Singh
- Ayurveda History,Components,Treatment,Substances, Panchakarma and ResearchHochgeladen vonJatin
- Statement from Marion County Sheriff's OfficeHochgeladen vonStatesman Journal
- 8 - Reaction Kinetics Part 1 - Chemguide_notesHochgeladen vongeoboom12
- Learning TheoriesHochgeladen vonedu123EDU
- TEMHochgeladen vonLara Ferrer
- Chem SexHochgeladen vonAnelise Wesolowski Molina
- p3Hochgeladen vonAmsalu Setey