Sie sind auf Seite 1von 27

Revisiting Association Rules

Gordon S. Linoff

Founder
Data Miners, Inc.
gordon@data-miners.com

Agenda

‹ What association rules are


‹ Data used for examples
‹ Evaluating One Association Rule
– Support, confidence, lift
– Chi-Square
‹ Generating and evaluating all association rules
‹ Extending the ideas

©2007 Data Miners, Inc.


http://www.data-miners.com 2
What Is An Association Rule?

‹ Associations rules tell us what products or events happen to occur together


– An example of undirected data mining
‹ LHS Æ RHS
– “When the products on the left hand side are present in a transaction, then the
products on the right hand side are present”
– LHS = left hand side. It typically consists of zero or more products
– RHS = right hand side. It typically consists of one product
‹ When products simply tend to occur together, we call that an item set
‹ Used for a variety of purposes
– Embedded in automatic retailing systems to make recommendations on-line
– Recommending ring tones
– Cross-selling of financial products
– Data cleansing (exceptions to very common rules suggest data issues)

©2007 Data Miners, Inc.


http://www.data-miners.com 3

A Most Famous Example:


Beer and Diapers

©2007 Data Miners, Inc.


http://www.data-miners.com 4
The Real Story

So what are the facts? In 1992, Thomas Blischok, manager


of a retail consulting group at Teradata, and his staff
prepared an analysis of 1.2 million market baskets from
about 25 Osco Drug stores. Database queries were
developed to identify affinities. The analysis "did discover
that between 5:00 and 7:00 p.m. that consumers bought
beer and diapers". Osco managers did NOT exploit the
beer and diapers relationship by moving the products
closer together on the shelves. This decision support study
was conducted using query tools to find an association.
The true story is very bland compared to the legend.

Daniel Power, at http://dssresources.com/newsletters/66.php

©2007 Data Miners, Inc.


http://www.data-miners.com 5

Why is Revisiting Them Necessary?

‹ Association rules often tell us what we should already know


– Customers who purchase maintenance agreements are very likely to
purchase large appliances.
– If a customer has three-way calling, then the customer has call-waiting.
– Customers purchase eggs the week before Easter.

‹ Traditional approach does not produce very good rules


– They are biased toward rare products that happen to occur together.

‹ Association rules are heavily biased toward large purchases,


with little or no information about small purchases
– One purchase with 100 products has 9,900 rules of the form A Æ B
– One purchase with 1 product has none.

©2007 Data Miners, Inc.


http://www.data-miners.com 6
Data Used for Examples
‹ Data de-identified small retail sample
‹ Summary information
– 4,040 products in 8 categories
– 189,559 orders for 156,258 households
– 1.48 order lines per order
– 1.80 order lines per household
‹ Four important tables
– ORDERLINE
– ORDERS
– CUSTOMER
– PRODUCT
A relatively small amount of data is used for demonstration purposes.
It is available at www.data-miners.com.
©2007 Data Miners, Inc.
http://www.data-miners.com 7

Products, Categories, and One-Time


Purchasers
120,000 60%
Ratio of One-Time Purchasers

100,000 50%
Number of Households

80,000 40%

60,000 30%

40,000 20%

20,000 10%

0 0%
K
N

AR

ER

E
E
L

K
R
O
RE

AM

BI
O

ND

TH
O
SI

BO

EE
PA

TW

G
CA

O
LE

FR
AP

AR
C

CA
O

Which products are associated with one-time purchasers?

©2007 Data Miners, Inc.


http://www.data-miners.com 8
Co-occurrence of Product Groups
OTHER

OCCASION

GAME

FREEBIE

CALENDAR

BOOK

ARTWORK

APPAREL

L K AR N
RE OR ND BIE SIO R
PA TW OK LE EE ME CA HE
AP AR BO CA FR GA OC OT

©2007 Data Miners, Inc.


http://www.data-miners.com 9

Generating Association Rules


TRANSACTION DATA
HouseholdID OrderID ProductdID RULES
18111580 1164195 12820 10939 --> 10942
18111580 1164195 12826
11050 --> 11051
18111642 1151771 12510
10006 --> 10993
18111926 1056621 10026
11047 --> 11048
18112052 1186728 12175 11047 --> 11196
18112052 1186728 12820
11047 --> 11052
18112318 1024075 11179 11047 --> 11061
18112318 1024075 11168
11048 --> 11196
18112318 1024076 10834
10992 --> 11179
18112318 1024076 11176
11009 --> 11061
18112318 1024076 11163
11048 --> 11196
18112322 1219042 12479 11048 --> 11196
18112322 1219042 12820
11048 --> 11196
18112322 1219042 12479
10977 --> 10983
18112322 1219042 12820
10977 --> 10979
18112386 1014297 11053 11048 --> 11196
18112386 1014297 11088
...
18112386 1017171 11048
18112386 1017171 11196
...

How do we evaluate the rules?

©2007 Data Miners, Inc.


http://www.data-miners.com 10
Zero-Way Association Rules

‹ <nothing> Æ RHS [right hand side]


‹ Evaluation criteria: the proportion of customers who
have the RHS product(s) (also called the support)
‹ Provides a performance baseline for other rules that
predict the RHS
SELECT rhs, COUNT(*)/(SELECT COUNT(*) FROM orders) as support
FROM (SELECT orderid, productid as rhs
FROM orderline
GROUP BY orderid, productid) rhs
GROUP BY rhs
ORDER BY 2 desc

©2007 Data Miners, Inc.


http://www.data-miners.com 11

Evaluating More Complex Rules

‹ Consider the two most


common products in the data 189,559 orders
(12820 and 13190)
‹ We have the following
18,441 LHS
information about their
individual and join 2,588 BOTH
frequencies 3,404 RHS
‹ How do we use this to
measure the rule
12820 Æ 13190?

©2007 Data Miners, Inc.


http://www.data-miners.com 12
Support: How Often Is The Rule True?

‹ Support is the proportion of


transactions that have all the 189,559 orders
products on the left hand side
and the right hand side.
18,441 LHS
– 2,588/189,559 = 1.4%

‹ Support for a single rule is 2,588 BOTH


easy to calculate within a 3,404 RHS
single query.
‹ Good rules have high support.

©2007 Data Miners, Inc.


http://www.data-miners.com 13

Confidence: How Often Is The Rule


Correct?

‹ Confidence is the conditional


probability of the RHS given 189,559 orders
the LHS. It is expressed as the
ratio of LHS and RHS to LHS.
18,441 LHS
– 2,588 / 18,441 = 14.0%

‹ Alternatively, it is the ratio of 2,588 BOTH


the support of the rule to the 3,404 RHS
support of LHS.
‹ Good rules have high
confidence.

©2007 Data Miners, Inc.


http://www.data-miners.com 14
Lift: How Much Better Is The Rule Than Just
Guessing the RHS?

‹ Lift is the ratio between


confidence and the support of 189,559 orders
the RHS:
– (2,588/18,441)/(3,404/189,559) =
7.8 18,441 LHS

‹ Lift measures the ratio of the 2,588 BOTH


confidence and the confidence 3,404 RHS
of the zero-way rule.
‹ Good rules have high lift.

©2007 Data Miners, Inc.


http://www.data-miners.com 15

Comparison of Four Rules


‹ Rule: ‹ Inverse:
12820 Æ 13190 13190 Æ 12820
– Support 1.4% – Support 1.4%
– Confidence 14.0% – Confidence 76.0%
– Lift 7.8 – Lift 7.8
‹ Negative: ‹ Negative Inverse:
12820 Æ NOT 13190 13190 Æ NOT 12820
– Support 8.4% – Support 0.4%
– Confidence 86.0% – Confidence 24.0%
– Lift 0.88 – Lift 0.27

‹ Support is the same for a rule and its inverse


‹ Lift is the same for a rule and its inverse
‹ The product of the confidence of a rule and its inverse is Lift * Support
‹ Support and lift of the negative rule depends on overall number of items.
©2007 Data Miners, Inc.
http://www.data-miners.com 16
And the best rules are . . . Lousy

COUNT
Rule Support Confidence Lift
LHS RHS LHSRHS
10874 --> 10879 1 1 1 0.0% 100.0% 192,983.0
12665 --> 10705 1 1 1 0.0% 100.0% 192,983.0
12935 --> 12190 1 1 1 0.0% 100.0% 192,983.0
13224 --> 13859 1 1 1 0.0% 100.0% 192,983.0
13779 --> 13232 1 1 1 0.0% 100.0% 192,983.0
10878 --> 10892 1 1 1 0.0% 100.0% 192,983.0
13495 --> 12353 1 1 1 0.0% 100.0% 192,983.0
12717 --> 11786 1 1 1 0.0% 100.0% 192,983.0
13238 --> 13752 1 1 1 0.0% 100.0% 192,983.0
11902 --> 11915 1 1 1 0.0% 100.0% 192,983.0

We want rules with lots of support and lots of confidence,


not rules where rare products happen to occur together.
©2007 Data Miners, Inc.
http://www.data-miners.com 17

The Tendency is to Require Minimum


Support and Lift, Such as Support of 0.1%
COUNT
Rule Support Confidence Lift
LHS RHS LHSRHS
11051 --> 11050 369 294 220 0.11% 59.6% 391.4
11050 --> 11051 294 369 220 0.11% 74.8% 391.4
11064 --> 11067 480 347 275 0.14% 57.3% 318.6
11067 --> 11064 347 480 275 0.14% 79.3% 318.6
12823 --> 12951 747 201 201 0.10% 26.9% 258.3
12951 --> 12823 201 747 201 0.10% 100.0% 258.3
12506 --> 12830 615 1,021 530 0.27% 86.2% 162.9
12830 --> 12506 1,021 615 530 0.27% 51.9% 162.9
11097 --> 11095 1,090 548 229 0.12% 21.0% 74.0
11095 --> 11097 548 1,090 229 0.12% 41.8% 74.0

A bit better,
but is there a better way?
©2007 Data Miners, Inc.
http://www.data-miners.com 18
A Better Approach: Chi-Square

‹ Chi-Square is a statistical test that measures


how unexpectedly data is partitioned across
multiple dimensions
ALL 189,559 RHS
LHS 18,441 YES NO
RHS 3,404 2,588 816

LHS
YES
BOTH 2,588 NO 15,853 170,302

‹ Rules form a natural 2x2 contingency table

©2007 Data Miners, Inc.


http://www.data-miners.com 19

Calculating Chi-Square for the Rule


12820 Æ 13190
COUNTS EXPECTED VALUE
RHS RHS
YES NO YES NO
331.2 3,072.8
LHS

2,588 816 Row sum times column YES


LHS

YES
NO 15,853 170,302 sum divided by sum NO 18,109.8 168,045.2

DEVIATION Count – expected value


RHS
YES NO
2,256.8 -2,256.8
LHS

YES
NO -2,256.8 2,256.8
Deviation squared / expected value
CHI-SQUARE
RHS
YES NO
17,349.7
YES 15,380.6 1,657.5
LHS

NO 281.2 30.3 Sum of chi-square values

This is really all arithmetic.


©2007 Data Miners, Inc.
http://www.data-miners.com 20
Negative Rules Using Chi-Square

‹ The following rules have the same chi-square


values: COUNTS
RHS
– Case 1: 12820 Æ 13190 YES
2,588
NO
816

LHS
YES
– Case 2: 12820 Æ NOT 13190 NO 15,853 170,302

‹ How do we choose between the positive and


negative rule?
– Choose the one with the larger lift

©2007 Data Miners, Inc.


http://www.data-miners.com 21

And the best rules are . . . Still Lousy

COUNT CHI-
Rule Support Confidence Lift
LHS RHS LHSRHS square
10878 --> 10899 1 1 1 192,983 0.00% 100.0% 192,983.0
13573 --> 11385 1 1 1 192,983 0.00% 100.0% 192,983.0
13842 --> 12885 1 1 1 192,983 0.00% 100.0% 192,983.0
10874 --> 10872 1 1 1 192,983 0.00% 100.0% 192,983.0
13859 --> 13305 1 1 1 192,983 0.00% 100.0% 192,983.0
10879 --> 10888 1 1 1 192,983 0.00% 100.0% 192,983.0
11000 --> 14030 1 1 1 192,983 0.00% 100.0% 192,983.0
14009 --> 14004 1 1 1 192,983 0.00% 100.0% 192,983.0
11228 --> 11223 1 1 1 192,983 0.00% 100.0% 192,983.0
10901 --> 10892 1 1 1 192,983 0.00% 100.0% 192,983.0

Hmmm, these aren’t any better.


So why do I like Chi-Square?
©2007 Data Miners, Inc.
http://www.data-miners.com 22
Let’s Look Where Support is at least 0.1%
COUNT CHI-
‹ Top 10 Ordered by Rule
LHS RHS LHSRHS square
Support Confidence Lift
11064 --> 11067 480 347 275 87,447 0.14% 57.3% 318.6
Chi-Square: 11067 --> 11064 347 480 275 87,447 0.14% 79.3% 318.6
– Support, 0.32% 12830
12506
--> 12506
--> 12830
1,021
615
615
1,021
530
530
86,003
86,003
0.27%
0.27%
51.9%
86.2%
162.9
162.9
– Confidence, 63.2% 11051 --> 11050 369 294 220 85,953 0.11% 59.6% 391.4
– Lift, 230.9 11050
12823
--> 11051
--> 12951
294
747
369
201
220
201
85,953
51,780
0.11%
0.10%
74.8%
26.9%
391.4
258.3
12951 --> 12823 201 747 201 51,780 0.10% 100.0% 258.3
11196 --> 11048 4,729 3,166 1,824 40,973 0.95% 38.6% 23.5
11048 --> 11196 3,166 4,729 1,824 40,973 0.95% 57.6% 23.5

COUNT CHI-
‹ Top 10 Ordered by Rule
LHS RHS LHSRHS square
Support Confidence Lift
11051 --> 11050 369 294 220 85,953 0.11% 59.6% 391.4
Lift: 11050 --> 11051 294 369 220 85,953 0.11% 74.8% 391.4
– Support, 0.15% 11064
11067
--> 11067
--> 11064
480
347
347
480
275
275
87,447
87,447
0.14%
0.14%
57.3%
79.3%
318.6
318.6
– Confidence, 59.9% 12823 --> 12951 747 201 201 51,780 0.10% 26.9% 258.3
– Lift, 241.0 12951
12506
--> 12823
--> 12830
201
615
747
1,021
201
530
51,780
86,003
0.10%
0.27%
100.0%
86.2%
258.3
162.9
12830 --> 12506 1,021 615 530 86,003 0.27% 51.9% 162.9
11097 --> 11095 1,090 548 229 16,629 0.12% 21.0% 74.0
11095 --> 11097 548 1,090 229 16,629 0.12% 41.8% 74.0

Chi-Square is better, for both support and confidence


©2007 Data Miners, Inc.
http://www.data-miners.com 23

One-Way Rules, Expected Values >= 5


COUNT CHI-
‹ Top 10 Ordered by Rule
LHS RHS LHSRHS square
Support Confidence Lift
11048 --> 11196 3,166 4,729 1,824 40,973 0.95% 57.6% 23.5
Chi-Square: 11196 --> 11048 4,729 3,166 1,824 40,973 0.95% 38.6% 23.5
– Support, 0.71% 10940
10943
--> 10943
--> 10940
1,100
960
960
1,100
396
396
28,171
28,171
0.21%
0.21%
36.0%
41.3%
72.4
72.4
– Confidence, 40.5% 11052 --> 11197 1,410 1,900 542 20,440 0.28% 38.4% 39.0
– Lift, 31.5 11197
10956
--> 11052
--> 12139
1,900
2,773
1,410
7,063
542
1,483
20,440
19,805
0.28%
0.77%
28.5%
53.5%
39.0
14.6
12139 --> 10956 7,063 2,773 1,483 19,805 0.77% 21.0% 14.6
13190 --> 12820 3,404 18,441 2,588 17,716 1.34% 76.0% 8.0
12820 --> 13190 18,441 3,404 2,588 17,716 1.34% 14.0% 8.0

COUNT CHI-
Rule Support Confidence Lift
‹ Top 10 Ordered by 10940 --> 10943
LHS
1,100
RHS LHSRHS square
960 396 28,171 0.21% 36.0% 72.4
Lift: 10943
10943
--> 10940
--> 10939
960
960
1,100
1,306
396
330
28,171
16,300
0.21%
0.17%
41.3%
34.4%
72.4
50.8
– Support, 0.19% 10939 --> 10943 1,306 960 330 16,300 0.17% 25.3% 50.8
– Confidence, 29.2% 11052
11197
--> 11197
--> 11052
1,410
1,900
1,900
1,410
542
542
20,440
20,440
0.28%
0.28%
38.4%
28.5%
39.0
39.0
– Lift, 45.8 10939 --> 10942 1,306 1,354 320 10,691 0.17% 24.5% 34.9
10942 --> 10939 1,354 1,306 320 10,691 0.17% 23.6% 34.9
10939 --> 10940 1,306 1,100 237 7,168 0.12% 18.1% 31.8
10940 --> 10939 1,100 1,306 237 7,168 0.12% 21.5% 31.8

Chi-Square rules have more support


and larger confidence.
©2007 Data Miners, Inc.
http://www.data-miners.com 24
Two-Way Rules, Support > 0.1%
COUNT CHI-
‹ Top 10 Ordered by Rule
LHS RHS LHSRHS square
Support Confidence Lift
12820, 12830 --> 12506 439 488 399 14,045 2.09% 90.9% 35.5
Chi-Square: 12506, 12820 --> 12830 469 498 399 12,843 2.09% 85.1% 32.5
– Support, 1.2% 12820, 12823
12510, 12820
--> 12951
--> 12507
310 196
144 94
194
90
11,725
11,362
1.02%
0.47%
62.6%
62.5%
60.8
126.7
– Confidence, 71.2% 11070, 11072 --> 11074 78 120 73 10,812 0.38% 93.6% 148.6
– Lift, 30.4 11052, 11196
11052, 11197
--> 11197
--> 11196
276 480
292 499
275
275
10,754
9,746
1.44%
1.44%
99.6%
94.2%
39.5
36.0
12820, 12951 --> 12823 194 382 194 9,578 1.02% 100.0% 49.9
11072, 11074 --> 11070 85 130 73 9,145 0.38% 85.9% 125.9
11070, 11074 --> 11072 80 139 73 9,088 0.38% 91.3% 125.1

COUNT CHI-
Rule Support Confidence Lift
‹ Top 10 Ordered by 11053, 12820 --> 11069
LHS RHS LHSRHS
35 88 35
square
7,556 0.18% 100.0% 216.5
Lift: 10939, 10941
11157, 11162
--> 10929
--> 11156
50 60
24 100
25
20
3,942
3,156
0.13%
0.10%
50.0%
83.3%
158.8
158.8
– Support, 0.18% 11158, 11162 --> 11156 24 100 20 3,156 0.10% 83.3% 158.8
– Confidence, 78.3% 11158, 11162
11070, 11072
--> 11157
--> 11074
24 109
78 120
21
73
3,192
10,812
0.11%
0.38%
87.5%
93.6%
152.9
148.6
– Lift, 158.4 11157, 11158 --> 11156 72 100 56 8,260 0.29% 77.8% 148.2
10929, 10939 --> 10944 31 104 25 3,669 0.13% 80.6% 147.7
10939, 10944 --> 10929 54 60 25 3,647 0.13% 46.3% 147.0
10939, 10941 --> 10944 50 104 40 5,829 0.21% 80.0% 146.5

Chi-Square rules have more support


and comparable confidence.
©2007 Data Miners, Inc.
http://www.data-miners.com 25

Two-Way Rules, Expected Values >= 5


COUNT CHI-
‹ Top 10 Ordered by Rule
LHS RHS LHSRHS square
Support Confidence Lift
12820, 12830 --> 12506 439 488 399 14,045 2.09% 90.9% 35.5
Chi-Square: 12506, 12820 --> 12830 469 498 399 12,843 2.09% 85.1% 32.5
– Support, 1.3% 11052, 11196
11052, 11197
--> 11197
--> 11196
276 480
292 499
275
275
10,754
9,746
1.44%
1.44%
99.6%
94.2%
39.5
36.0
– Confidence, 68.5% 11196, 11197 --> 11052 365 436 275 8,881 1.44% 75.3% 32.9
– Lift, 29.3 10939, 10943
10940, 10943
--> 10940
--> 10939
219 372
255 504
170
170
6,626
4,113
0.89%
0.89%
77.6%
66.7%
39.8
25.2
12804, 12820 --> 11989 238 380 114 2,598 0.60% 47.9% 24.0
11989, 12820 --> 12804 326 329 114 2,160 0.60% 35.0% 20.2
12820, 13190 --> 13144 2,588 333 329 2,096 1.73% 12.7% 7.3

COUNT CHI-
Rule Support Confidence Lift
‹ Top 10 Ordered by 10939, 10943 --> 10940
LHS RHS LHSRHS
219 372 170
square
6,626 0.89% 77.6% 39.8
Lift: 11052, 11196
11052, 11197
--> 11197
--> 11196
276 480
292 499
275
275
10,754
9,746
1.44%
1.44%
99.6%
94.2%
39.5
36.0
– Support, 1.2% 12820, 12830 --> 12506 439 488 399 14,045 2.09% 90.9% 35.5
– Confidence, 71.2% 11196, 11197
12506, 12820
--> 11052
--> 12830
365 436
469 498
275
399
8,881
12,843
1.44%
2.09%
75.3%
85.1%
32.9
32.5
– Lift, 30.4 10940, 10943 --> 10939 255 504 170 4,113 0.89% 66.7% 25.2
12804, 12820 --> 11989 238 380 114 2,598 0.60% 47.9% 24.0
11989, 12820 --> 12804 326 329 114 2,160 0.60% 35.0% 20.2
10834, 11168 --> 10821 236 400 93 1,618 0.49% 39.4% 18.8

Comparable support and confidence –


expected value is a good filter
©2007 Data Miners, Inc.
http://www.data-miners.com 26
From One Rule to All of them:
Generating and Measuring All Rules

‹ This is important because there are modifications to


the basic association rule algorithm that are very useful
‹ Two Steps
– Generate all candidate rules
– Calculate measures (lift, support, confidence, chi-square)
and choose appropriate rules
‹ Three methods of presentation
– Dataflow Diagrams
– SQL Queries
– SAS Code

©2007 Data Miners, Inc.


http://www.data-miners.com 27

One Way to Think of the Processing:


Dataflow for One-Way Candidate Rules

This dataflow generates all candidate one-way rules


(product A Æ product B)
©2007 Data Miners, Inc.
http://www.data-miners.com 28
And Expressed in SQL:
Query for All One-Way Candidate Rules
“orderid” because we are
looking for products in the
“JOIN” because all products
same order.
being considered must be in
the same order
CREATE TABLE candidates AS
SELECT lhs.orderid, lhs.lhs, rhs.rhs
FROM (SELECT DISTINCT orderid, productid as lhs
FROM orderline
) lhs JOIN
(SELECT DISTINCT orderid, productid as rhs
FROM orderline) rhs
ON lhs.orderid = rhs.orderid AND
lhs.lhs <> rhs.rhs

“<>“ because we want all


“DISTINCT” is because multiple candidate rules: “AÆB” is
order lines in the same order can different from “BÆA”.
have the same product.

©2007 Data Miners, Inc.


http://www.data-miners.com 29

Or, in SAS:
Generating Candidate One-Way Rules
/* STEP 1: Prepare the data as orders /* STEP 3: Create all product pairs
* with distinct products sorted * within an order */
* sorted on orderid */ data book.candidates
proc sql; (keep=orderid lhs rhs);
CREATE TABLE book.order_products as set book.order_products; by orderid;
SELECT orderid, productid, array prods[&MAXNPRODS] _TEMPORARY_;
COUNT(*) as num retain prods numprods;
FROM book.orderline if first.orderid then numprods = 0;
GROUP BY 1, 2 numprods+1;
; prods[numprods] = productid;
if last.orderid then do;
/* STEP 2: Calculate the max number do i = 1 to numprods;
* of products in any order */ lhs = prods[i];
proc sql; do j = 1 to i-1;
SELECT MAX(numprods) rhs = prods[j];
INTO :MAXNPRODS if j ~= i then output;
FROM (SELECT COUNT(DISTINCT productid end;
) as numprods end;
FROM book.orderline end;
GROUP BY orderid) a else delete;
; run;

©2007 Data Miners, Inc.


http://www.data-miners.com 30
Measuring the Goodness of Candidate Rules
using a Dataflow
Number of times each
rule appears

Number of times each


LHS appears

Number of times each


RHS appears

©2007 Data Miners, Inc.


http://www.data-miners.com 31

Calculating Measures in SQL:


Support, Confidence, Lift, and Chi-Square
SELECT (SQUARE(explhsrhs - numlhsrhs)/explhsrhs +
SQUARE(explhsnorhs - numlhsnorhs)/explhsnorhs +
SQUARE(expnolhsrhs - numnolhsrhs)/expnolhsrhs +
SQUARE(expnolhsnorhs - numnolhsnorhs)/expnolhsnorhs) as chisquare,
b.*
FROM (SELECT lhsrhs.*, numorders, numlhs, numrhs,
numlhs - numlhsrhs as numlhsnorhs,
numrhs - numlhsrhs as numnolhsrhs,
numlhs*numrhs/numorders as explhsrhs,
numlhs*(numorders-numrhs)/numorders as explhsnorhs,
(numorders-numlhs)*numrhs/numorders as expnolhsrhs,
(numorders-numlhs)*(numorders-numrhs)/numorders) as expnolhsnor,
(numorders - numlhs - numrhs + numlhsrhs) as numnolhsnorhs,
numlhsrhs/numorders as support,
numlhsrhs/numlhs as confidence,
numlhsrhs * numorders/(numlhs * numrhs) as lift
FROM (<LHSÆRHS subquery> ) sumlhsrhs JOIN
(<LHS subquery>) sumlhs
ON lhsrhs.lhs = sumlhs.lhs JOIN
(<RHS subquery>) sumrhs
ON lhsrhs.rhs = sumrhs.rhs CROSS JOIN
(<ALL subquery>) a
©2007 Data Miners, Inc.
http://www.data-miners.com 32
Summary Calculations Are In The
Subqueries
SELECT lhs, rhs, COUNT(*) as numlhsrhs
FROM candidates
GROUP BY lhs, rhs

SELECT lhs, COUNT(DISTINCT orderid) as numlhs


FROM candidates
GROUP BY lhs

SELECT rhs, COUNT(DISTINCT orderid) as numrhs


FROM candidates
GROUP BY rhs

SELECT COUNT(DISTINCT orderid) as numorders


FROM candidates

©2007 Data Miners, Inc.


http://www.data-miners.com 33

For SAS, We Start With The Same Four


Summaries
proc sql; proc sql;
CREATE TABLE book.sumlhsrhs as CREATE TABLE book.sumrhs as
SELECT lhs, rhs, SELECT rhs,
COUNT(DISTINCT orderid COUNT(DISTINCT orderid
) as numlhsrhs ) as numrhs
FROM book.candidates FROM book.candidates
GROUP BY lhs, prodrhs; GROUP BY rhs
ORDER BY 1;
proc sql;
CREATE TABLE book.sumlhs as CREATE INDEX rhs ON book.sumrhs;
SELECT lhs,
COUNT(DISTINCT orderid
) as numlhs proc sql;
FROM book.candidates SELECT COUNT(DISTINCT orderid
GROUP BY lhs ) as numorders
ORDER BY 1; INTO :NUMORDERS
FROM book.candidates;
CREATE INDEX lhs ON book.sumlhs;

©2007 Data Miners, Inc.


http://www.data-miners.com 34
Then the Measures Can Be Calculated in a
Data Step
data book.rules (keep=lhs rhs numorders numlhs numrhs numlhsrhs
support confidence lift chisquare);
set book.sumlhsrhs;
set book.sumlhs key=lhs;
set book.sumrhs key=rhs;
numorders = &NUMORDERS;
support = numlhsrhs / numorders; confidence = numlhsrhs / numlhs;
lift = (numlhsrhs * numorders) / (numlhs * numrhs);
numlhsnorhs = numlhs - numlhsrhs; numnolhsrhs = numrhs - numlhsrhs;
numnolhsnorhs = numorders - numlhs - numrhs + numlhsrhs;
explhsrhs = numlhs * numrhs / numorders;
explhsnorhs = numlhs * (numorders - numrhs) / numorders;
expnolhsrhs = (numorders - numlhs) * numrhs / numorders;
expnolhsnorhs = (numorders - numlhs) * (numorders - numrhs) / numorders;
chisquare = ((numlhsrhs - explhsrhs)**2/explhsrhs +
(numlhsnorhs - explhsnorhs)**2/explhsnorhs +
(numnolhsrhs - expnolhsrhs)**2/expnolhsrhs +
(numnolhsnorhs - expnolhsrhs)**2/expnolhsnorhs);
run;

proc sort data=book.rules; by descending chisquare;


run;
©2007 Data Miners, Inc.
http://www.data-miners.com 35

Extending the Ideas

‹ More complicated rules


– With restrictions on orders

‹ Sequential Association Rules


‹ Association Rules within a household
‹ Purchases within a household, but not at the same time
‹ Associations on Attributes of Products
‹ Heterogeneous rules
– Products on left hand side different from products on right hand side

©2007 Data Miners, Inc.


http://www.data-miners.com 36
More Complicated Rules

‹ One-way rules are not be sufficient


‹ More complicated rules involve more joins
‹ To improve efficiency of calculation, we
introduce filters
– 2-way rules require at least 3 products in each order
– 3-way rules require at least 4 products
– And so on.

©2007 Data Miners, Inc.


http://www.data-miners.com 37

However, Filtering Out Orders with Too Few


Products Can Improve Performance
This finds orders that have
three or more products to
improve performance (we
might want to add “and
numprods < 20”)

A,B on the left hand side is


the same as B,A

Larger rules = More Joins


©2007 Data Miners, Inc.
http://www.data-miners.com 38
The SQL Version
CREATE TABLE candidates_pp2p as
SELECT lhs1, lhs2, lhs1||’,’||lhs2 as lhs, rhs
FROM (SELECT orderid FROM orderline GROUP BY orderid
HAVING COUNT(DISTINCT productid) > 2) o_filter JOIN
(SELECT orderid, productid as lhs1
FROM orderline
GROUP BY orderid, productid) lhs1
ON o_filter.orderid = lhs1.orderid JOIN
(SELECT orderid, productid as lhs2
FROM orderline
GROUP BY orderid, productid) lhs2
ON lhs1.orderid = lhs2.orderid AND
lhs1.lhs1 < lhs2.lhs2 JOIN
(SELECT orderid, productid as rhs
FROM orderline
GROUP BY orderid, productid) rhs
ON lhs1.orderid = lhs1.orderid AND
rhs.rhs NOT IN (lhs1, lhs2)

©2007 Data Miners, Inc.


http://www.data-miners.com 39

Finding Association Rules Within a


Household

‹ First generate all candidate rules for all households


– Use filter to restrict to households with at least 2 products
– This is basically joining in additional tables to get the
householdid, then using householdid instead of orderid, and
following the same procedure as just shown

‹ By naming things carefully, we can use the same code


(SQL or SAS) for calculating the measurements

©2007 Data Miners, Inc.


http://www.data-miners.com 40
Join In the Appropriate Tables to Get the
Household ID

©2007 Data Miners, Inc.


http://www.data-miners.com 41

This is a Minor Modification to the Data Step


for Generating Candidates
/* STEP 3: Create all product pairs within an order */
data book.candidates_p2p keep=hhid lhs rhs);
set book.order_products;
by orderid;
set book.orders key=orderid;
set book.customer key=customerid;
array prods[&MAXNPRODS] _TEMPORARY_;
retain prods numprods;
if first.orderid then numprods = 0;
numprods+1; Then calculating the
prods[numprods] = productid;
if last.orderid then do; measures (chi-square, lift,
do i = 1 to numprods; etc.) is the same as before.
lhs = prods[i];
do j = 1 to i-1;
rhs = prods[j];
if j ~= i then output;
end;
end;
end;
else delete;
run;
©2007 Data Miners, Inc.
http://www.data-miners.com 42
Sequential Association Rules

‹ These take place at the household level, so this


is a modification on the previous approach
‹ These require that the products on the left hand
side are purchased before those on the right
hand side
‹ This is a simple modification to the candidates
query
‹ Then the measurements methods are the same

©2007 Data Miners, Inc.


http://www.data-miners.com 43

Generating the Candidates Uses the Order


Date

©2007 Data Miners, Inc.


http://www.data-miners.com 44
Sequential Rule Candidates Use Order Date
Information
CREATE TABLE candidates_pthenp AS
SELECT lhs.householdid, lhs.lhs, rhs.rhs
FROM (SELECT householdid
FROM orderline ol JOIN orders o ON ol.orderid = o.orderid JOIN
customer c on c.customerid = o.customerid
GROUP BY householdid
HAVING COUNT(DISTINCT productid) > 1) filter JOIN
(SELECT householdid, orderdate as lhsdate, productid as lhs
FROM orderline ol JOIN orders o ON ol.orderid = o.orderid JOIN
customer c on c.customerid = o.customerid
GROUP BY householdid, orderdate, productid) lhs
ON filter.householdid = lhs.householdid JOIN
(SELECT householdid, orderdate as rhsdate, productid as rhs
FROM orderline ol JOIN orders o ON ol.orderid = o.orderid JOIN
customer c on c.customerid = o.customerid
GROUP BY householdid, orderdate, productid) rhs
ON lhs.householdid = rhs.householdid AND
lhs.lhsdate < rhs.rhsdate AND
rhs.rhs <> lhs.lhs Then calculating
the
measures is the same.

©2007 Data Miners, Inc.


http://www.data-miners.com 45

What Products Are Purchased by a


Household But Not in the Same Order?

‹ Cross-selling opportunities: products purchased by a


household but not at the same time
– Ignores influences such as product bundles

‹ The approach is similar to sequential association rules,


but ordering is not important
‹ Once again, this changes the candidates, but not the
measure query
– The order on the right hand side is different from the orders
on the left hand side

©2007 Data Miners, Inc.


http://www.data-miners.com 46
Very Similar to the Sequential Rules

©2007 Data Miners, Inc.


http://www.data-miners.com 47

Associations on Product Attributes Instead of


Products

‹ Products have descriptive information, such as:


– Brand
– Color
– Department
– Reduced price

‹ How can we use this information in rules?


‹ Naively including descriptors as items “works”
– But it usually produces rules that are trivial (“diet” + “soda” Æ “diet
Coke”)

‹ Only consider attributes that appear in an order but not on the


same product
– This is a modification to the candidates query
©2007 Data Miners, Inc.
http://www.data-miners.com 48
Rules Contain Attributes from Different
Products

©2007 Data Miners, Inc.


http://www.data-miners.com 49

Heterogeneous Associations

‹ These are rules that have different types of items on the left
hand side and right hand side
‹ What web page visits lead to purchases?
– Web Pages Æ purchases

‹ What direct marketing history leads to signing up?


– Marketing campaigns Æ Signing up

‹ Which email campaigns lead to complaints?


– Email clicks Æ Complaints

‹ These simply require setting up the candidate rules correctly

©2007 Data Miners, Inc.


http://www.data-miners.com 50
RHS Comes From a Different Source than
the LHS
SELECT item_lhs1.customerid, item_lhs1.lhs as lhs1, item_lhs2.lhs as lhs2,
(CAST(item_lhs1.lhs as VARCHAR)+', '+
CAST(item_lhs2.lhs as VARCHAR)) as lhs, item_rhs.rhs
FROM (SELECT cl.customerid
FROM clicks JOIN
GROUP BY cf.customerid
HAVING COUNT(DISTINCT cf.categoryid) > 1) filter JOIN
(SELECT cf.customerid, categoryid as lhs, welcometimeid as lhs_time
FROM clicks c
GROUP BY customerid, categoryid, timeid) item_lhs1
ON filter.customerid = item_lhs1.customerid JOIN
(SELECT customerid, categoryid as lhs, timeid as lhs_time
FROM clicks c
GROUP BY customerid, categoryid, timeid) item_lhs2
ON item_lhs1.customerid = item_lhs2.customerid AND
item_lhs1.lhs < item_lhs2.lhs JOIN
(SELECT customerid, categoryid as rhs, timeid as rhs_time
FROM complaints co with (nolock)
GROUP BY customerid, categoryidtimeid) item_rhs
ON item_lhs1.customerid = item_rhs.customerid AND
item_rhs.rhs NOT IN (item_lhs1.lhs, item_lhs2.lhs) AND
item_rhs.rhs_time > item_lhs1.lhs_time AND
item_rhs.rhs_time > item_lhs2.lhs_time
©2007 Data Miners, Inc.
http://www.data-miners.com 51

Evaluating the Rules Uses the Same


Measures Query
Clicks Imply Complaint Rules Chi-Square
Telecom + Travel ==> Loans 299.0
Telecom + Government Grants ==> Credit Report 299.0
Government Grants + Gifts ==> Credit Report 299.0
Education + College/Scholarship ==> [Uncategorized] 149.0
Debt + Telecom ==> Credit Report 149.0
Debt + Government Grants ==> Credit Report 149.0
Debt + Gifts ==> Credit Report 149.0
Credit Card + Travel ==> Loans 99.0
Credit Card + Government Grants ==> Credit Report 99.0
Entrepreneurial + Credit Report ==> Home Improvement 74.0

(1) Customers do not like receiving email offers


about Credit Reports.
(2) They also do not like offers radically different from
their past interests
©2007 Data Miners, Inc.
http://www.data-miners.com 52
Conclusion

‹ Chi-Square Measure is a very natural way to measure


association rules
‹ By separating the creation of candidate rules from measuring
them, it is possible to generate many types of rules
– Sequential rules
– Within a household, but not within an order
– Using attributes of products
– Heterogeneous rules
– Etc.

‹ Beer and diapers is a legend. It shows the power of accessible


ideas and demonstrates the potential value lying in data using
association rules built on mundane “query tools”.
©2007 Data Miners, Inc.
http://www.data-miners.com 53

Das könnte Ihnen auch gefallen