Print Linoff Gordon

Revisiting Association Rules
Gordon S. Linoff
Founder
Data Miners, Inc.
gordon@data-miners.com
Agenda
What association rules are

Data used for examples
Evaluating One Association Rule
– Support, confidence, lift
– Chi-Square
Generating and evaluating all association rules
Extending the ideas
©2007 Data Miners, Inc.

http://www.data-miners.com 2
What Is An Association Rule?
Associations rules tell us what products or events happen to occur together

– An example of undirected data mining
LHS Æ RHS
– “When the products on the left hand side are present in a transaction, then the
products on the right hand side are present”
– LHS = left hand side. It typically consists of zero or more products
– RHS = right hand side. It typically consists of one product
When products simply tend to occur together, we call that an item set
Used for a variety of purposes
– Embedded in automatic retailing systems to make recommendations on-line
– Recommending ring tones
– Cross-selling of financial products
– Data cleansing (exceptions to very common rules suggest data issues)

A Most Famous Example:

Beer and Diapers

The Real Story
So what are the facts? In 1992, Thomas Blischok, manager

of a retail consulting group at Teradata, and his staff
prepared an analysis of 1.2 million market baskets from
about 25 Osco Drug stores. Database queries were
developed to identify affinities. The analysis "did discover
that between 5:00 and 7:00 p.m. that consumers bought
beer and diapers". Osco managers did NOT exploit the
beer and diapers relationship by moving the products
closer together on the shelves. This decision support study
was conducted using query tools to find an association.
The true story is very bland compared to the legend.
Daniel Power, at http://dssresources.com/newsletters/66.php

Why is Revisiting Them Necessary?
Association rules often tell us what we should already know

– Customers who purchase maintenance agreements are very likely to
purchase large appliances.
– If a customer has three-way calling, then the customer has call-waiting.
– Customers purchase eggs the week before Easter.
Traditional approach does not produce very good rules

– They are biased toward rare products that happen to occur together.
Association rules are heavily biased toward large purchases,

with little or no information about small purchases
– One purchase with 100 products has 9,900 rules of the form A Æ B
– One purchase with 1 product has none.

Data Used for Examples
Data de-identified small retail sample
Summary information
– 4,040 products in 8 categories
– 189,559 orders for 156,258 households
– 1.48 order lines per order
– 1.80 order lines per household
Four important tables
– ORDERLINE
– ORDERS
– CUSTOMER
– PRODUCT
A relatively small amount of data is used for demonstration purposes.
It is available at www.data-miners.com.
Products, Categories, and One-Time

Purchasers
120,000 60%
Ratio of One-Time Purchasers
100,000 50%
Number of Households
80,000 40%
60,000 30%
40,000 20%
20,000 10%
0 0%
K
N
AR
ER
E
E
L
K
R
O
RE
AM
BI
O
ND
TH
O
SI
BO
EE
PA
TW
G
CA
O
LE
FR
AP
AR
C
CA
O
Which products are associated with one-time purchasers?

Co-occurrence of Product Groups
OTHER
OCCASION
GAME
FREEBIE
CALENDAR
BOOK
ARTWORK
APPAREL
L K AR N
RE OR ND BIE SIO R
PA TW OK LE EE ME CA HE
AP AR BO CA FR GA OC OT

Generating Association Rules

TRANSACTION DATA
HouseholdID OrderID ProductdID RULES
18111580 1164195 12820 10939 --> 10942
18111580 1164195 12826
11050 --> 11051
18111642 1151771 12510
10006 --> 10993
18111926 1056621 10026
11047 --> 11048
18112052 1186728 12175 11047 --> 11196
18112052 1186728 12820
11047 --> 11052
18112318 1024075 11179 11047 --> 11061
18112318 1024075 11168
11048 --> 11196
18112318 1024076 10834
10992 --> 11179
18112318 1024076 11176
11009 --> 11061
18112318 1024076 11163
11048 --> 11196
18112322 1219042 12479 11048 --> 11196
18112322 1219042 12820
11048 --> 11196
18112322 1219042 12479
10977 --> 10983
18112322 1219042 12820
10977 --> 10979
18112386 1014297 11053 11048 --> 11196
18112386 1014297 11088
...
18112386 1017171 11048
18112386 1017171 11196
...
How do we evaluate the rules?

Zero-Way Association Rules
<nothing> Æ RHS [right hand side]

Evaluation criteria: the proportion of customers who
have the RHS product(s) (also called the support)
Provides a performance baseline for other rules that
predict the RHS
SELECT rhs, COUNT(*)/(SELECT COUNT(*) FROM orders) as support
FROM (SELECT orderid, productid as rhs
FROM orderline
GROUP BY orderid, productid) rhs
GROUP BY rhs
ORDER BY 2 desc

Evaluating More Complex Rules
Consider the two most

common products in the data 189,559 orders
(12820 and 13190)
We have the following
18,441 LHS
information about their
individual and join 2,588 BOTH
frequencies 3,404 RHS
How do we use this to
measure the rule
12820 Æ 13190?

Support: How Often Is The Rule True?
Support is the proportion of

transactions that have all the 189,559 orders
products on the left hand side
and the right hand side.
18,441 LHS
– 2,588/189,559 = 1.4%
Support for a single rule is 2,588 BOTH

easy to calculate within a 3,404 RHS
single query.
Good rules have high support.

Confidence: How Often Is The Rule

Correct?
Confidence is the conditional

probability of the RHS given 189,559 orders
the LHS. It is expressed as the
ratio of LHS and RHS to LHS.
18,441 LHS
– 2,588 / 18,441 = 14.0%
Alternatively, it is the ratio of 2,588 BOTH

the support of the rule to the 3,404 RHS
support of LHS.
Good rules have high
confidence.

Lift: How Much Better Is The Rule Than Just
Guessing the RHS?
Lift is the ratio between

confidence and the support of 189,559 orders
the RHS:
– (2,588/18,441)/(3,404/189,559) =
7.8 18,441 LHS
Lift measures the ratio of the 2,588 BOTH

confidence and the confidence 3,404 RHS
of the zero-way rule.
Good rules have high lift.

Comparison of Four Rules

Rule: Inverse:
12820 Æ 13190 13190 Æ 12820
– Support 1.4% – Support 1.4%
– Confidence 14.0% – Confidence 76.0%
– Lift 7.8 – Lift 7.8
Negative: Negative Inverse:
12820 Æ NOT 13190 13190 Æ NOT 12820
– Support 8.4% – Support 0.4%
– Confidence 86.0% – Confidence 24.0%
– Lift 0.88 – Lift 0.27
Support is the same for a rule and its inverse

Lift is the same for a rule and its inverse
The product of the confidence of a rule and its inverse is Lift * Support
Support and lift of the negative rule depends on overall number of items.
And the best rules are . . . Lousy
COUNT
Rule Support Confidence Lift
LHS RHS LHSRHS
10874 --> 10879 1 1 1 0.0% 100.0% 192,983.0
12665 --> 10705 1 1 1 0.0% 100.0% 192,983.0
12935 --> 12190 1 1 1 0.0% 100.0% 192,983.0
13224 --> 13859 1 1 1 0.0% 100.0% 192,983.0
13779 --> 13232 1 1 1 0.0% 100.0% 192,983.0
10878 --> 10892 1 1 1 0.0% 100.0% 192,983.0
13495 --> 12353 1 1 1 0.0% 100.0% 192,983.0
12717 --> 11786 1 1 1 0.0% 100.0% 192,983.0
13238 --> 13752 1 1 1 0.0% 100.0% 192,983.0
11902 --> 11915 1 1 1 0.0% 100.0% 192,983.0
We want rules with lots of support and lots of confidence,

not rules where rare products happen to occur together.
The Tendency is to Require Minimum

Support and Lift, Such as Support of 0.1%
COUNT
LHS RHS LHSRHS
11051 --> 11050 369 294 220 0.11% 59.6% 391.4
11050 --> 11051 294 369 220 0.11% 74.8% 391.4
11064 --> 11067 480 347 275 0.14% 57.3% 318.6
11067 --> 11064 347 480 275 0.14% 79.3% 318.6
12823 --> 12951 747 201 201 0.10% 26.9% 258.3
12951 --> 12823 201 747 201 0.10% 100.0% 258.3
12506 --> 12830 615 1,021 530 0.27% 86.2% 162.9
12830 --> 12506 1,021 615 530 0.27% 51.9% 162.9
11097 --> 11095 1,090 548 229 0.12% 21.0% 74.0
11095 --> 11097 548 1,090 229 0.12% 41.8% 74.0
A bit better,
but is there a better way?
A Better Approach: Chi-Square
Chi-Square is a statistical test that measures

how unexpectedly data is partitioned across
multiple dimensions
ALL 189,559 RHS
LHS 18,441 YES NO
RHS 3,404 2,588 816
LHS
YES
BOTH 2,588 NO 15,853 170,302
Rules form a natural 2x2 contingency table

Calculating Chi-Square for the Rule

12820 Æ 13190
COUNTS EXPECTED VALUE
RHS RHS
YES NO YES NO
331.2 3,072.8
LHS
2,588 816 Row sum times column YES

LHS
YES
NO 15,853 170,302 sum divided by sum NO 18,109.8 168,045.2
DEVIATION Count – expected value

RHS
YES NO
2,256.8 -2,256.8
LHS
YES
NO -2,256.8 2,256.8
Deviation squared / expected value
CHI-SQUARE
RHS
YES NO
17,349.7
YES 15,380.6 1,657.5
LHS
NO 281.2 30.3 Sum of chi-square values
This is really all arithmetic.

Negative Rules Using Chi-Square
The following rules have the same chi-square

values: COUNTS
RHS
– Case 1: 12820 Æ 13190 YES
2,588
NO
816
LHS
YES
– Case 2: 12820 Æ NOT 13190 NO 15,853 170,302
How do we choose between the positive and

negative rule?
– Choose the one with the larger lift

And the best rules are . . . Still Lousy
COUNT CHI-
LHS RHS LHSRHS square
10878 --> 10899 1 1 1 192,983 0.00% 100.0% 192,983.0
13573 --> 11385 1 1 1 192,983 0.00% 100.0% 192,983.0
13842 --> 12885 1 1 1 192,983 0.00% 100.0% 192,983.0
10874 --> 10872 1 1 1 192,983 0.00% 100.0% 192,983.0
13859 --> 13305 1 1 1 192,983 0.00% 100.0% 192,983.0
10879 --> 10888 1 1 1 192,983 0.00% 100.0% 192,983.0
11000 --> 14030 1 1 1 192,983 0.00% 100.0% 192,983.0
14009 --> 14004 1 1 1 192,983 0.00% 100.0% 192,983.0
11228 --> 11223 1 1 1 192,983 0.00% 100.0% 192,983.0
10901 --> 10892 1 1 1 192,983 0.00% 100.0% 192,983.0
Hmmm, these aren’t any better.

So why do I like Chi-Square?
Let’s Look Where Support is at least 0.1%
COUNT CHI-
Top 10 Ordered by Rule
Support Confidence Lift
11064 --> 11067 480 347 275 87,447 0.14% 57.3% 318.6
Chi-Square: 11067 --> 11064 347 480 275 87,447 0.14% 79.3% 318.6
– Support, 0.32% 12830
12506
--> 12506
--> 12830
1,021
615
615
1,021
530
530
86,003
86,003
0.27%
0.27%
51.9%
86.2%
162.9
162.9
– Confidence, 63.2% 11051 --> 11050 369 294 220 85,953 0.11% 59.6% 391.4
– Lift, 230.9 11050
12823
--> 11051
--> 12951
294
747
369
201
220
201
85,953
51,780
0.11%
0.10%
74.8%
26.9%
391.4
258.3
12951 --> 12823 201 747 201 51,780 0.10% 100.0% 258.3
11196 --> 11048 4,729 3,166 1,824 40,973 0.95% 38.6% 23.5
11048 --> 11196 3,166 4,729 1,824 40,973 0.95% 57.6% 23.5
COUNT CHI-
11051 --> 11050 369 294 220 85,953 0.11% 59.6% 391.4
Lift: 11050 --> 11051 294 369 220 85,953 0.11% 74.8% 391.4
– Support, 0.15% 11064
11067
--> 11067
--> 11064
480
347
347
480
275
275
87,447
87,447
0.14%
0.14%
57.3%
79.3%
318.6
318.6
– Confidence, 59.9% 12823 --> 12951 747 201 201 51,780 0.10% 26.9% 258.3
– Lift, 241.0 12951
12506
--> 12823
--> 12830
201
615
747
1,021
201
530
51,780
86,003
0.10%
0.27%
100.0%
86.2%
258.3
162.9
12830 --> 12506 1,021 615 530 86,003 0.27% 51.9% 162.9
11097 --> 11095 1,090 548 229 16,629 0.12% 21.0% 74.0
11095 --> 11097 548 1,090 229 16,629 0.12% 41.8% 74.0
Chi-Square is better, for both support and confidence

One-Way Rules, Expected Values >= 5

COUNT CHI-
11048 --> 11196 3,166 4,729 1,824 40,973 0.95% 57.6% 23.5
Chi-Square: 11196 --> 11048 4,729 3,166 1,824 40,973 0.95% 38.6% 23.5
– Support, 0.71% 10940
10943
--> 10943
--> 10940
1,100
960
960
1,100
396
396
28,171
28,171
0.21%
0.21%
36.0%
41.3%
72.4
72.4
– Confidence, 40.5% 11052 --> 11197 1,410 1,900 542 20,440 0.28% 38.4% 39.0
– Lift, 31.5 11197
10956
--> 11052
--> 12139
1,900
2,773
1,410
7,063
542
1,483
20,440
19,805
0.28%
0.77%
28.5%
53.5%
39.0
14.6
12139 --> 10956 7,063 2,773 1,483 19,805 0.77% 21.0% 14.6
13190 --> 12820 3,404 18,441 2,588 17,716 1.34% 76.0% 8.0
12820 --> 13190 18,441 3,404 2,588 17,716 1.34% 14.0% 8.0
COUNT CHI-
Top 10 Ordered by 10940 --> 10943
LHS
1,100
RHS LHSRHS square
960 396 28,171 0.21% 36.0% 72.4
Lift: 10943
10943
--> 10940
--> 10939
960
960
1,100
1,306
396
330
28,171
16,300
0.21%
0.17%
41.3%
34.4%
72.4
50.8
– Support, 0.19% 10939 --> 10943 1,306 960 330 16,300 0.17% 25.3% 50.8
– Confidence, 29.2% 11052
11197
--> 11197
--> 11052
1,410
1,900
1,900
1,410
542
542
20,440
20,440
0.28%
0.28%
38.4%
28.5%
39.0
39.0
– Lift, 45.8 10939 --> 10942 1,306 1,354 320 10,691 0.17% 24.5% 34.9
10942 --> 10939 1,354 1,306 320 10,691 0.17% 23.6% 34.9
10939 --> 10940 1,306 1,100 237 7,168 0.12% 18.1% 31.8
10940 --> 10939 1,100 1,306 237 7,168 0.12% 21.5% 31.8
Chi-Square rules have more support

and larger confidence.
Two-Way Rules, Support > 0.1%
COUNT CHI-
12820, 12830 --> 12506 439 488 399 14,045 2.09% 90.9% 35.5
Chi-Square: 12506, 12820 --> 12830 469 498 399 12,843 2.09% 85.1% 32.5
– Support, 1.2% 12820, 12823
12510, 12820
--> 12951
--> 12507
310 196
144 94
194
90
11,725
11,362
1.02%
0.47%
62.6%
62.5%
60.8
126.7
– Confidence, 71.2% 11070, 11072 --> 11074 78 120 73 10,812 0.38% 93.6% 148.6
– Lift, 30.4 11052, 11196
11052, 11197
--> 11197
--> 11196
276 480
292 499
275
275
10,754
9,746
1.44%
1.44%
99.6%
94.2%
39.5
36.0
12820, 12951 --> 12823 194 382 194 9,578 1.02% 100.0% 49.9
11072, 11074 --> 11070 85 130 73 9,145 0.38% 85.9% 125.9
11070, 11074 --> 11072 80 139 73 9,088 0.38% 91.3% 125.1
COUNT CHI-
Top 10 Ordered by 11053, 12820 --> 11069
LHS RHS LHSRHS
35 88 35
square
7,556 0.18% 100.0% 216.5
Lift: 10939, 10941
11157, 11162
--> 10929
--> 11156
50 60
24 100
25
20
3,942
3,156
0.13%
0.10%
50.0%
83.3%
158.8
158.8
– Support, 0.18% 11158, 11162 --> 11156 24 100 20 3,156 0.10% 83.3% 158.8
– Confidence, 78.3% 11158, 11162
11070, 11072
--> 11157
--> 11074
24 109
78 120
21
73
3,192
10,812
0.11%
0.38%
87.5%
93.6%
152.9
148.6
– Lift, 158.4 11157, 11158 --> 11156 72 100 56 8,260 0.29% 77.8% 148.2
10929, 10939 --> 10944 31 104 25 3,669 0.13% 80.6% 147.7
10939, 10944 --> 10929 54 60 25 3,647 0.13% 46.3% 147.0
10939, 10941 --> 10944 50 104 40 5,829 0.21% 80.0% 146.5
Chi-Square rules have more support

and comparable confidence.
Two-Way Rules, Expected Values >= 5

COUNT CHI-
12820, 12830 --> 12506 439 488 399 14,045 2.09% 90.9% 35.5
Chi-Square: 12506, 12820 --> 12830 469 498 399 12,843 2.09% 85.1% 32.5
– Support, 1.3% 11052, 11196
11052, 11197
--> 11197
--> 11196
276 480
292 499
275
275
10,754
9,746
1.44%
1.44%
99.6%
94.2%
39.5
36.0
– Confidence, 68.5% 11196, 11197 --> 11052 365 436 275 8,881 1.44% 75.3% 32.9
– Lift, 29.3 10939, 10943
10940, 10943
--> 10940
--> 10939
219 372
255 504
170
170
6,626
4,113
0.89%
0.89%
77.6%
66.7%
39.8
25.2
12804, 12820 --> 11989 238 380 114 2,598 0.60% 47.9% 24.0
11989, 12820 --> 12804 326 329 114 2,160 0.60% 35.0% 20.2
12820, 13190 --> 13144 2,588 333 329 2,096 1.73% 12.7% 7.3
COUNT CHI-
Top 10 Ordered by 10939, 10943 --> 10940
LHS RHS LHSRHS
219 372 170
square
6,626 0.89% 77.6% 39.8
Lift: 11052, 11196
11052, 11197
--> 11197
--> 11196
276 480
292 499
275
275
10,754
9,746
1.44%
1.44%
99.6%
94.2%
39.5
36.0
– Support, 1.2% 12820, 12830 --> 12506 439 488 399 14,045 2.09% 90.9% 35.5
– Confidence, 71.2% 11196, 11197
12506, 12820
--> 11052
--> 12830
365 436
469 498
275
399
8,881
12,843
1.44%
2.09%
75.3%
85.1%
32.9
32.5
– Lift, 30.4 10940, 10943 --> 10939 255 504 170 4,113 0.89% 66.7% 25.2
12804, 12820 --> 11989 238 380 114 2,598 0.60% 47.9% 24.0
11989, 12820 --> 12804 326 329 114 2,160 0.60% 35.0% 20.2
10834, 11168 --> 10821 236 400 93 1,618 0.49% 39.4% 18.8
Comparable support and confidence –

expected value is a good filter
From One Rule to All of them:
Generating and Measuring All Rules
This is important because there are modifications to

the basic association rule algorithm that are very useful
Two Steps
– Generate all candidate rules
– Calculate measures (lift, support, confidence, chi-square)
and choose appropriate rules
Three methods of presentation
– Dataflow Diagrams
– SQL Queries
– SAS Code

One Way to Think of the Processing:

Dataflow for One-Way Candidate Rules
This dataflow generates all candidate one-way rules

(product A Æ product B)
And Expressed in SQL:
Query for All One-Way Candidate Rules
“orderid” because we are
looking for products in the
“JOIN” because all products
same order.
being considered must be in
the same order
CREATE TABLE candidates AS
SELECT lhs.orderid, lhs.lhs, rhs.rhs
FROM (SELECT DISTINCT orderid, productid as lhs
FROM orderline
) lhs JOIN
(SELECT DISTINCT orderid, productid as rhs
FROM orderline) rhs
ON lhs.orderid = rhs.orderid AND
lhs.lhs <> rhs.rhs
“<>“ because we want all

“DISTINCT” is because multiple candidate rules: “AÆB” is
order lines in the same order can different from “BÆA”.
have the same product.

Or, in SAS:
Generating Candidate One-Way Rules
/* STEP 1: Prepare the data as orders /* STEP 3: Create all product pairs
* with distinct products sorted * within an order */
* sorted on orderid */ data book.candidates
proc sql; (keep=orderid lhs rhs);
CREATE TABLE book.order_products as set book.order_products; by orderid;
SELECT orderid, productid, array prods[&MAXNPRODS] _TEMPORARY_;
COUNT(*) as num retain prods numprods;
FROM book.orderline if first.orderid then numprods = 0;
GROUP BY 1, 2 numprods+1;
; prods[numprods] = productid;
if last.orderid then do;
/* STEP 2: Calculate the max number do i = 1 to numprods;
* of products in any order */ lhs = prods[i];
proc sql; do j = 1 to i-1;
SELECT MAX(numprods) rhs = prods[j];
INTO :MAXNPRODS if j ~= i then output;
FROM (SELECT COUNT(DISTINCT productid end;
) as numprods end;
FROM book.orderline end;
GROUP BY orderid) a else delete;
; run;

Measuring the Goodness of Candidate Rules
using a Dataflow
Number of times each
rule appears

LHS appears

RHS appears

Calculating Measures in SQL:

Support, Confidence, Lift, and Chi-Square
SELECT (SQUARE(explhsrhs - numlhsrhs)/explhsrhs +
SQUARE(explhsnorhs - numlhsnorhs)/explhsnorhs +
SQUARE(expnolhsrhs - numnolhsrhs)/expnolhsrhs +
SQUARE(expnolhsnorhs - numnolhsnorhs)/expnolhsnorhs) as chisquare,
b.*
FROM (SELECT lhsrhs.*, numorders, numlhs, numrhs,
numlhs - numlhsrhs as numlhsnorhs,
numrhs - numlhsrhs as numnolhsrhs,
numlhs*numrhs/numorders as explhsrhs,
numlhs*(numorders-numrhs)/numorders as explhsnorhs,
(numorders-numlhs)*numrhs/numorders as expnolhsrhs,
(numorders-numlhs)*(numorders-numrhs)/numorders) as expnolhsnor,
(numorders - numlhs - numrhs + numlhsrhs) as numnolhsnorhs,
numlhsrhs/numorders as support,
numlhsrhs/numlhs as confidence,
numlhsrhs * numorders/(numlhs * numrhs) as lift
FROM (<LHSÆRHS subquery> ) sumlhsrhs JOIN
(<LHS subquery>) sumlhs
ON lhsrhs.lhs = sumlhs.lhs JOIN
(<RHS subquery>) sumrhs
ON lhsrhs.rhs = sumrhs.rhs CROSS JOIN
(<ALL subquery>) a
Summary Calculations Are In The
Subqueries
SELECT lhs, rhs, COUNT(*) as numlhsrhs
FROM candidates
GROUP BY lhs, rhs
SELECT lhs, COUNT(DISTINCT orderid) as numlhs

FROM candidates
GROUP BY lhs
SELECT rhs, COUNT(DISTINCT orderid) as numrhs

FROM candidates
GROUP BY rhs
SELECT COUNT(DISTINCT orderid) as numorders

FROM candidates

For SAS, We Start With The Same Four

Summaries
proc sql; proc sql;
CREATE TABLE book.sumlhsrhs as CREATE TABLE book.sumrhs as
SELECT lhs, rhs, SELECT rhs,
COUNT(DISTINCT orderid COUNT(DISTINCT orderid
) as numlhsrhs ) as numrhs
FROM book.candidates FROM book.candidates
GROUP BY lhs, prodrhs; GROUP BY rhs
ORDER BY 1;
proc sql;
CREATE TABLE book.sumlhs as CREATE INDEX rhs ON book.sumrhs;
SELECT lhs,
COUNT(DISTINCT orderid
) as numlhs proc sql;
FROM book.candidates SELECT COUNT(DISTINCT orderid
GROUP BY lhs ) as numorders
ORDER BY 1; INTO :NUMORDERS
FROM book.candidates;
CREATE INDEX lhs ON book.sumlhs;

Then the Measures Can Be Calculated in a
Data Step
data book.rules (keep=lhs rhs numorders numlhs numrhs numlhsrhs
support confidence lift chisquare);
set book.sumlhsrhs;
set book.sumlhs key=lhs;
set book.sumrhs key=rhs;
numorders = &NUMORDERS;
support = numlhsrhs / numorders; confidence = numlhsrhs / numlhs;
lift = (numlhsrhs * numorders) / (numlhs * numrhs);
numlhsnorhs = numlhs - numlhsrhs; numnolhsrhs = numrhs - numlhsrhs;
numnolhsnorhs = numorders - numlhs - numrhs + numlhsrhs;
explhsrhs = numlhs * numrhs / numorders;
explhsnorhs = numlhs * (numorders - numrhs) / numorders;
expnolhsrhs = (numorders - numlhs) * numrhs / numorders;
expnolhsnorhs = (numorders - numlhs) * (numorders - numrhs) / numorders;
chisquare = ((numlhsrhs - explhsrhs)**2/explhsrhs +
(numlhsnorhs - explhsnorhs)**2/explhsnorhs +
(numnolhsrhs - expnolhsrhs)**2/expnolhsrhs +
(numnolhsnorhs - expnolhsrhs)**2/expnolhsnorhs);
run;
proc sort data=book.rules; by descending chisquare;

run;
Extending the Ideas
More complicated rules

– With restrictions on orders
Sequential Association Rules

Association Rules within a household
Purchases within a household, but not at the same time
Associations on Attributes of Products
Heterogeneous rules
– Products on left hand side different from products on right hand side

More Complicated Rules
One-way rules are not be sufficient

More complicated rules involve more joins
To improve efficiency of calculation, we
introduce filters
– 2-way rules require at least 3 products in each order
– 3-way rules require at least 4 products
– And so on.

However, Filtering Out Orders with Too Few

Products Can Improve Performance
This finds orders that have
three or more products to
improve performance (we
might want to add “and
numprods < 20”)
A,B on the left hand side is

the same as B,A
Larger rules = More Joins

The SQL Version
CREATE TABLE candidates_pp2p as
SELECT lhs1, lhs2, lhs1||’,’||lhs2 as lhs, rhs
FROM (SELECT orderid FROM orderline GROUP BY orderid
HAVING COUNT(DISTINCT productid) > 2) o_filter JOIN
(SELECT orderid, productid as lhs1
FROM orderline
GROUP BY orderid, productid) lhs1
ON o_filter.orderid = lhs1.orderid JOIN
(SELECT orderid, productid as lhs2
FROM orderline
GROUP BY orderid, productid) lhs2
ON lhs1.orderid = lhs2.orderid AND
lhs1.lhs1 < lhs2.lhs2 JOIN
(SELECT orderid, productid as rhs
FROM orderline
GROUP BY orderid, productid) rhs
ON lhs1.orderid = lhs1.orderid AND
rhs.rhs NOT IN (lhs1, lhs2)

Finding Association Rules Within a

Household
First generate all candidate rules for all households

– Use filter to restrict to households with at least 2 products
– This is basically joining in additional tables to get the
householdid, then using householdid instead of orderid, and
following the same procedure as just shown
By naming things carefully, we can use the same code

(SQL or SAS) for calculating the measurements

Join In the Appropriate Tables to Get the
Household ID

This is a Minor Modification to the Data Step

for Generating Candidates
/* STEP 3: Create all product pairs within an order */
data book.candidates_p2p keep=hhid lhs rhs);
set book.order_products;
by orderid;
set book.orders key=orderid;
set book.customer key=customerid;
array prods[&MAXNPRODS] _TEMPORARY_;
retain prods numprods;
if first.orderid then numprods = 0;
numprods+1; Then calculating the
prods[numprods] = productid;
if last.orderid then do; measures (chi-square, lift,
do i = 1 to numprods; etc.) is the same as before.
lhs = prods[i];
do j = 1 to i-1;
rhs = prods[j];
if j ~= i then output;
end;
end;
end;
else delete;
run;
Sequential Association Rules
These take place at the household level, so this

is a modification on the previous approach
These require that the products on the left hand
side are purchased before those on the right
hand side
This is a simple modification to the candidates
query
Then the measurements methods are the same

Generating the Candidates Uses the Order

Date

Sequential Rule Candidates Use Order Date
Information
CREATE TABLE candidates_pthenp AS
SELECT lhs.householdid, lhs.lhs, rhs.rhs
FROM (SELECT householdid
FROM orderline ol JOIN orders o ON ol.orderid = o.orderid JOIN
customer c on c.customerid = o.customerid
GROUP BY householdid
HAVING COUNT(DISTINCT productid) > 1) filter JOIN
(SELECT householdid, orderdate as lhsdate, productid as lhs
GROUP BY householdid, orderdate, productid) lhs
ON filter.householdid = lhs.householdid JOIN
(SELECT householdid, orderdate as rhsdate, productid as rhs
GROUP BY householdid, orderdate, productid) rhs
ON lhs.householdid = rhs.householdid AND
lhs.lhsdate < rhs.rhsdate AND
rhs.rhs <> lhs.lhs Then calculating
the
measures is the same.

What Products Are Purchased by a

Household But Not in the Same Order?
Cross-selling opportunities: products purchased by a

household but not at the same time
– Ignores influences such as product bundles
The approach is similar to sequential association rules,

but ordering is not important
Once again, this changes the candidates, but not the
measure query
– The order on the right hand side is different from the orders
on the left hand side

Very Similar to the Sequential Rules

Associations on Product Attributes Instead of

Products
Products have descriptive information, such as:

– Brand
– Color
– Department
– Reduced price
How can we use this information in rules?

Naively including descriptors as items “works”
– But it usually produces rules that are trivial (“diet” + “soda” Æ “diet
Coke”)
Only consider attributes that appear in an order but not on the

same product
– This is a modification to the candidates query
Rules Contain Attributes from Different
Products

Heterogeneous Associations
These are rules that have different types of items on the left
hand side and right hand side
What web page visits lead to purchases?
– Web Pages Æ purchases
What direct marketing history leads to signing up?

– Marketing campaigns Æ Signing up
Which email campaigns lead to complaints?

– Email clicks Æ Complaints
These simply require setting up the candidate rules correctly

RHS Comes From a Different Source than
the LHS
SELECT item_lhs1.customerid, item_lhs1.lhs as lhs1, item_lhs2.lhs as lhs2,
(CAST(item_lhs1.lhs as VARCHAR)+', '+
CAST(item_lhs2.lhs as VARCHAR)) as lhs, item_rhs.rhs
FROM (SELECT cl.customerid
FROM clicks JOIN
GROUP BY cf.customerid
HAVING COUNT(DISTINCT cf.categoryid) > 1) filter JOIN
(SELECT cf.customerid, categoryid as lhs, welcometimeid as lhs_time
FROM clicks c
GROUP BY customerid, categoryid, timeid) item_lhs1
ON filter.customerid = item_lhs1.customerid JOIN
(SELECT customerid, categoryid as lhs, timeid as lhs_time
FROM clicks c
GROUP BY customerid, categoryid, timeid) item_lhs2
ON item_lhs1.customerid = item_lhs2.customerid AND
item_lhs1.lhs < item_lhs2.lhs JOIN
(SELECT customerid, categoryid as rhs, timeid as rhs_time
FROM complaints co with (nolock)
GROUP BY customerid, categoryidtimeid) item_rhs
ON item_lhs1.customerid = item_rhs.customerid AND
item_rhs.rhs NOT IN (item_lhs1.lhs, item_lhs2.lhs) AND
item_rhs.rhs_time > item_lhs1.lhs_time AND
item_rhs.rhs_time > item_lhs2.lhs_time
Evaluating the Rules Uses the Same

Measures Query
Clicks Imply Complaint Rules Chi-Square
Telecom + Travel ==> Loans 299.0
Telecom + Government Grants ==> Credit Report 299.0
Government Grants + Gifts ==> Credit Report 299.0
Education + College/Scholarship ==> [Uncategorized] 149.0
Debt + Telecom ==> Credit Report 149.0
Debt + Government Grants ==> Credit Report 149.0
Debt + Gifts ==> Credit Report 149.0
Credit Card + Travel ==> Loans 99.0
Credit Card + Government Grants ==> Credit Report 99.0
Entrepreneurial + Credit Report ==> Home Improvement 74.0
(1) Customers do not like receiving email offers

about Credit Reports.
(2) They also do not like offers radically different from
their past interests
Conclusion
Chi-Square Measure is a very natural way to measure

association rules
By separating the creation of candidate rules from measuring
them, it is possible to generate many types of rules
– Sequential rules
– Within a household, but not within an order
– Using attributes of products
– Heterogeneous rules
– Etc.
Beer and diapers is a legend. It shows the power of accessible

ideas and demonstrates the potential value lying in data using
association rules built on mundane “query tools”.

Print Linoff Gordon

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Print Linoff Gordon

Hochgeladen von

Copyright:

Verfügbare Formate

Revisiting Association Rules

 What association rules are

©2007 Data Miners, Inc.

 Associations rules tell us what products or events happen to occur together

©2007 Data Miners, Inc.

A Most Famous Example:

©2007 Data Miners, Inc.

So what are the facts? In 1992, Thomas Blischok, manager

Daniel Power, at http://dssresources.com/newsletters/66.php

©2007 Data Miners, Inc.

Why is Revisiting Them Necessary?

 Association rules often tell us what we should already know

 Traditional approach does not produce very good rules

 Association rules are heavily biased toward large purchases,

©2007 Data Miners, Inc.

Products, Categories, and One-Time

Which products are associated with one-time purchasers?

©2007 Data Miners, Inc.

©2007 Data Miners, Inc.

Generating Association Rules

How do we evaluate the rules?

©2007 Data Miners, Inc.

 <nothing> Æ RHS [right hand side]

©2007 Data Miners, Inc.

Evaluating More Complex Rules

 Consider the two most

©2007 Data Miners, Inc.

 Support is the proportion of

 Support for a single rule is 2,588 BOTH

©2007 Data Miners, Inc.

Confidence: How Often Is The Rule

 Confidence is the conditional

 Alternatively, it is the ratio of 2,588 BOTH

©2007 Data Miners, Inc.

 Lift is the ratio between

 Lift measures the ratio of the 2,588 BOTH

©2007 Data Miners, Inc.

Comparison of Four Rules

 Support is the same for a rule and its inverse

We want rules with lots of support and lots of confidence,

The Tendency is to Require Minimum

 Chi-Square is a statistical test that measures

 Rules form a natural 2x2 contingency table

©2007 Data Miners, Inc.

Calculating Chi-Square for the Rule

2,588 816 Row sum times column YES

DEVIATION Count – expected value

NO 281.2 30.3 Sum of chi-square values

This is really all arithmetic.

 The following rules have the same chi-square

 How do we choose between the positive and

©2007 Data Miners, Inc.

And the best rules are . . . Still Lousy

Hmmm, these aren’t any better.

Chi-Square is better, for both support and confidence

One-Way Rules, Expected Values >= 5

Chi-Square rules have more support

Chi-Square rules have more support

Two-Way Rules, Expected Values >= 5

Comparable support and confidence –

 This is important because there are modifications to

©2007 Data Miners, Inc.

One Way to Think of the Processing:

What association rules are

Associations rules tell us what products or events happen to occur together

Association rules often tell us what we should already know

Traditional approach does not produce very good rules

Association rules are heavily biased toward large purchases,

<nothing> Æ RHS [right hand side]

Consider the two most

Support is the proportion of

Support for a single rule is 2,588 BOTH

Confidence is the conditional

Alternatively, it is the ratio of 2,588 BOTH

Lift is the ratio between

Lift measures the ratio of the 2,588 BOTH

Support is the same for a rule and its inverse

Chi-Square is a statistical test that measures

Rules form a natural 2x2 contingency table

The following rules have the same chi-square

How do we choose between the positive and

This is important because there are modifications to

More complicated rules

Sequential Association Rules

One-way rules are not be sufficient

First generate all candidate rules for all households

By naming things carefully, we can use the same code

These take place at the household level, so this

Cross-selling opportunities: products purchased by a

The approach is similar to sequential association rules,

Products have descriptive information, such as:

How can we use this information in rules?

Only consider attributes that appear in an order but not on the

What direct marketing history leads to signing up?

Which email campaigns lead to complaints?

These simply require setting up the candidate rules correctly

Chi-Square Measure is a very natural way to measure

Beer and diapers is a legend. It shows the power of accessible