Sie sind auf Seite 1von 41

Chapter 10

ASSOCIATION RULE

By:

Aris D.(13406054) Ricky A.(13406058) Nadia FR. (13406069) Amirah K.(13406070) Paramita AW.(13406091) Bahana W.(13406102)

Introduction

Affinity Analysis

Study of attributes or characteristics that “go together”.

Market Based Analysis

The method, uncover rules for quantifying the relationship between two or more attributes.

“If antecedent, then consequent

Affinity Analysis & Market Basket Analysis

Example:

Supermarket may find that of the 1000 customers shopping on a Thursday night, 200 bought

diapers, and of the 200 who bought diapers, 50

bought beer.

The association rule:

If buy diapers, then buy beers”, with support of 50/1000 = 5%, and confidence of 50/200=25%

Affinity Analysis & Market Basket

Analysis (2)

Examples business & research:

Investigating the proportion of subscribers to your

company’s cell phone plan that respond positively to an

offer of a service upgrade

Examining the proportion of children whose parents read to them who are themselves good readers

Predicting degradation in telecommunications networks

Finding out which items in a supermarket are purchased

together & which are never purchased together

Determining the proportion of cases in which a new drug will exhibit dangerous side effects

Affinity Analysis & Market Basket

Analysis (3)

The number of possible association rules grows exponentially in the number of attributes.

If binary attributes (yes/no) then there are k.[2^(k-1)] possible association rule.

Example: a convinience store that sells 100 items. Possible association rules = 100.[2^99] ≈

6,4 x (10^31)

A priori algorithm (pendahuluan)reduce the search problem to a more manageable size

Notation for Data Representation in

Market Basket Analysis

Farmer sells I = {asparagus, beans, broccoli, corn, green peppers, squash, tomatoes}

A customer puts in a basket, Subset I = {broccoli, corn}

Subset doesn’t keep track of how much each item is purchased, just the name of item.

Transactional Data Format

Transactional Data Format

Tabular Data Format

Tabular Data Format

Support, Confidence, Frequent Itemsets, & the Apriori Property

Example:

D : set of transactions represented in Table 10.1

T : a transaction in D represents a set of items

I : set of items Set of items A : beans, squash Set of items B : asparagus

THEN … Association rule takes the form if A, then B (AB), A and B are PROPER subsets of I, and are mutually exclusive

Table of Transaction Made

Table of Transaction Made

Support and Confidence

Support, s, is the proportion of transactions in D

that contain both A and B. support = P(AB)

= number of transactions containing both A&B total number of transactions

Confidence, c, is a measure of the accuracy of the rule. confidence = P(B|A)= P(AB) P(A)

= number of transactions containing both A&B number of transactions containing A

Analysts prefer RULES:

High support AND High confidence

Frequent Itemset

Definition… An Itemset is a set of items contained in I, and a k- itemset containing k items.

e.g: {beans, squash} 2-itemset

The itemset frequency… the number of transactions that contain the particular itemset

A frequent itemset

itemset that occurs at least a certain minimum number of times, having itemset frequency Example:

Set that = 4, then itemsets that occur more than

FOUR times are said to be frequent.

The Apriori Property

Mining Association Rules

It is a two-steps process:

1. Find all frequent itemsets (all itemsets with frequency   )

2. From the frequent itemsets, generate

association rules satisfying the minimum

support and confidence conditions

The Apriori property states that if an itemset Z is

not frequent, then adding another item A to

the itemset Z will not make Z more frequent. This helpful property reduces significantly the search space for the a priori algorithm.

How does the Apriori Algorithm Work?

Part 1: Generating Frequent Itemsets

Part 2: Generating Association Rules

Generating Frequent Itemsets

Example:

let = 4, so that an itemset is frequent if it occurs

four or more times in D.

F = {asparagus, beans, broccoli, corn, green

of candidate k-itemsets

F

peppers, squash, tomatoes}

1

first, constructs a set C

for

k-1

k

2

by joining F

the a priori property.

C k

k=2,

vegetables in Table 10.4

with itself. Then it prunes Ck using

consists

of

all

the

combinations

of

F

use k number = 3

not much different than the steps for F 2 , but

3

Table 10.3 (pg.183)

Table 10.3 (pg.183)

Table 10.4 (pg. 185)

Table 10.4 (pg. 185)

However, consider s={beans, corn, squash}

the subset {corn, squash} has frequency 3 < 4 =

, so that {corn, squash} is not frequent.

By the priori property, therefore, {beans, corn,

squash} cannot be frequent, is therefore pruned,

and doesn’t appear in F3

So does the s= {beans, squash, tomatoes}, the

frequency of the subsets is < 4

Generating Association Rules

1. Generate all subsets of s.

2. Association Rule R : ss ⇒ ( s-ss) Generate R if fulfills the minimum confidence requirement. (s-ss) is set s without ss

Example two antecedent

Example two antecedent • All transaction = 14 • Transaction include asparagus and beans = 5

All transaction = 14

Transaction include asparagus and beans = 5

Transaction include asparagus and Squash = 5

Transaction include Beans and squash = 6

Ranked by support x Confidence

Ranked by support x Confidence • Minimum Confidence 80%

Minimum Confidence 80%

Clementine generating Association

Rules

Clementine generating Association Rules

Clementine generating Association

Rules (2)

Support means occurences of antecedent, different from what we defined before.

First columns indicates number of antecedent occurs.

To find actual “support” using clementine, multiply support and confidence.

Extension From Flag Data to General

Categorical Data

- Association rule not only for Flag (Boolean) data.

- A priori algorithm can be applied to categorical data.

Example using Clementine

Recall Normalized adult data set in chapter 6 and 7

Example using Clementine • Recall Normalized adult data set in chapter 6 and 7

Information-Theoretic Approach:

Generalized Rule Induction Method

Why GRI?

A priori algorithm is not well equipped to handle numerical attributes, need discretization

Discretization can lead to loss of information

GRI can handle both categorical or numerical variables as inputs, but still requires categorical variables as output

Generalized Rule Induction Method (2)

J-Measure

J

(

)

(

| ). ln

p x p y x

(

|

p y x

)

p ( y )

  p y x

[1

( | )]. ln

p y x

(

1

(

|

)

1

p y

)

p(x) probability of the value of x (antecedent)

p(y) probability of the value of y (consequent)

p(y|x) conditional probability of y given that x has occured

Generalized Rule Induction Method (3)

J-Measure shows “interestingness”

In GRI, user specifies how many association rules would be reported

If the “interestingness” of new rule > current minimum J in the rule table, new rule is inserted, rule with minimum J is eliminated

Application of GRI

p(x) : female, never married p(x) = 0.1463

Application of GRI p(x) : female, never married p(x) = 0.1463

Application of GRI (2)

p(y) : work class = private p(y) = 0.6958

Application of GRI (2) p(y) : work class = private p(y) = 0.6958

Application of GRI (3)

p(y|x) : work class = private;

given : female, never married

of GRI (3) p(y|x) : work class = private; given : female, never married p(y|x) =

p(y|x) = conditional probabilities = 0.763

Application of GRI

Calculation :

J

p

(

x

)

p

(

y

|

x

). ln

p

(

y

|

x

)

p

(

y

)

[1

p

(

y

0. 1463 0 .763 . ln

0.

763

0.

6958

( 0. 237 ). ln

|

x

)]. ln

0.237

0

.3042

)

1

p

(

|

x

y

1

p

(

y

)

0. 1463[ 0. 763. ln(1.0966 )

( 0. 237 ). ln( 0. 7791)]

0. 001637

When not to use Association Rules

Association Rules chosen a priori could be used based on:

Confidence

Confidence Difference

Confidence Ratio

Association Rules need to be applied with care

because the results are sometimes unreliable.

When not to use Association Rules (2)

Association Rules chosen a priori, based on confidence

Applying this association rule reduces the probability of randomly selecting desired data.

Eventhough the rule is useless, software still reported it probably because the default ranking mechanism for priori’s algorithm is confidence.

We should never simply believe the computer

output without making the effort to understand the models and mechanism underlying the result.

When not to use Association Rules (3)

Association Rules chosen a priori, based on confidence

When not to use Association Rules (3) Association Rules chosen a priori, based on confidence

When not to use Association Rules (4)

Association Rules chosen a priori, based on confidence difference

A random selection from the database would have provided more effective results (none

useless report)than applying the association

rule.

This rule provide the greatest increase in confidence from the prior to posterior.

Evaluation measures the absolute difference between the prior and posterior confidences.

When not to use Association Rules (5)

Association Rules chosen a priori, based on confidence difference

When not to use Association Rules (5) Association Rules chosen a priori, based on confidence difference

When not to use Association Rules (6)

Association Rules chosen a priori, based on confidence ratio

Analyst prefer to use the confidence ratio to evaluate potential rules.

to use the confidence ratio to evaluate potential rules. • Confidence difference criterion yielded the very

Confidence difference criterion yielded the very same rules as did the confidence ratio criterion.

When not to use Association Rules (7)

Association Rules chosen a priori, based on confidence ratio

Rules chosen a priori, based on confidence ratio • Example: If Marital_Satus = Divorced, then sex

Example:

If Marital_Satus = Divorced, then sex = Female. p(y)=0.3317 dan

p(y|x)=0.60

based on confidence ratio • Example: If Marital_Satus = Divorced, then sex = Female . p(y)=0.3317

Do Association Rules Represent

Supervised or Unsupervised Learning?

Supervised learning:

Variable is prespecified

Algorithm is provided with a rich collection of examples

where possible association between the target vaiable and

the predictor variables may be uncovered

Unsupervised learning:

No target variable is identified explicitly

Algorithm searches for patterns and structure among all the

variables

Association Rules generally used for unsupervised learning but can also be applied for supervised learning for classification task

Local Patterns Versus Global Models

Model: Global Description or Explanation of a data set.

Patterns: Essential local features of Data

Association rules are well suited to uncovering

local patterns in data

Applying “if “clause drills down deep into data set, uncovering a hidden local pattern that might be

relevant

Finding local patterns is one of the most important goals in data mining. It can lead to new profitable initiatives.