Beruflich Dokumente
Kultur Dokumente
Gordon S. Linoff
Founder
Data Miners, Inc.
gordon@data-miners.com
Agenda
100,000 50%
Number of Households
80,000 40%
60,000 30%
40,000 20%
20,000 10%
0 0%
K
N
AR
ER
E
E
L
K
R
O
RE
AM
BI
O
ND
TH
O
SI
BO
EE
PA
TW
G
CA
O
LE
FR
AP
AR
C
CA
O
OCCASION
GAME
FREEBIE
CALENDAR
BOOK
ARTWORK
APPAREL
L K AR N
RE OR ND BIE SIO R
PA TW OK LE EE ME CA HE
AP AR BO CA FR GA OC OT
COUNT
Rule Support Confidence Lift
LHS RHS LHSRHS
10874 --> 10879 1 1 1 0.0% 100.0% 192,983.0
12665 --> 10705 1 1 1 0.0% 100.0% 192,983.0
12935 --> 12190 1 1 1 0.0% 100.0% 192,983.0
13224 --> 13859 1 1 1 0.0% 100.0% 192,983.0
13779 --> 13232 1 1 1 0.0% 100.0% 192,983.0
10878 --> 10892 1 1 1 0.0% 100.0% 192,983.0
13495 --> 12353 1 1 1 0.0% 100.0% 192,983.0
12717 --> 11786 1 1 1 0.0% 100.0% 192,983.0
13238 --> 13752 1 1 1 0.0% 100.0% 192,983.0
11902 --> 11915 1 1 1 0.0% 100.0% 192,983.0
A bit better,
but is there a better way?
©2007 Data Miners, Inc.
http://www.data-miners.com 18
A Better Approach: Chi-Square
LHS
YES
BOTH 2,588 NO 15,853 170,302
YES
NO 15,853 170,302 sum divided by sum NO 18,109.8 168,045.2
YES
NO -2,256.8 2,256.8
Deviation squared / expected value
CHI-SQUARE
RHS
YES NO
17,349.7
YES 15,380.6 1,657.5
LHS
LHS
YES
– Case 2: 12820 Æ NOT 13190 NO 15,853 170,302
COUNT CHI-
Rule Support Confidence Lift
LHS RHS LHSRHS square
10878 --> 10899 1 1 1 192,983 0.00% 100.0% 192,983.0
13573 --> 11385 1 1 1 192,983 0.00% 100.0% 192,983.0
13842 --> 12885 1 1 1 192,983 0.00% 100.0% 192,983.0
10874 --> 10872 1 1 1 192,983 0.00% 100.0% 192,983.0
13859 --> 13305 1 1 1 192,983 0.00% 100.0% 192,983.0
10879 --> 10888 1 1 1 192,983 0.00% 100.0% 192,983.0
11000 --> 14030 1 1 1 192,983 0.00% 100.0% 192,983.0
14009 --> 14004 1 1 1 192,983 0.00% 100.0% 192,983.0
11228 --> 11223 1 1 1 192,983 0.00% 100.0% 192,983.0
10901 --> 10892 1 1 1 192,983 0.00% 100.0% 192,983.0
COUNT CHI-
Top 10 Ordered by Rule
LHS RHS LHSRHS square
Support Confidence Lift
11051 --> 11050 369 294 220 85,953 0.11% 59.6% 391.4
Lift: 11050 --> 11051 294 369 220 85,953 0.11% 74.8% 391.4
– Support, 0.15% 11064
11067
--> 11067
--> 11064
480
347
347
480
275
275
87,447
87,447
0.14%
0.14%
57.3%
79.3%
318.6
318.6
– Confidence, 59.9% 12823 --> 12951 747 201 201 51,780 0.10% 26.9% 258.3
– Lift, 241.0 12951
12506
--> 12823
--> 12830
201
615
747
1,021
201
530
51,780
86,003
0.10%
0.27%
100.0%
86.2%
258.3
162.9
12830 --> 12506 1,021 615 530 86,003 0.27% 51.9% 162.9
11097 --> 11095 1,090 548 229 16,629 0.12% 21.0% 74.0
11095 --> 11097 548 1,090 229 16,629 0.12% 41.8% 74.0
COUNT CHI-
Rule Support Confidence Lift
Top 10 Ordered by 10940 --> 10943
LHS
1,100
RHS LHSRHS square
960 396 28,171 0.21% 36.0% 72.4
Lift: 10943
10943
--> 10940
--> 10939
960
960
1,100
1,306
396
330
28,171
16,300
0.21%
0.17%
41.3%
34.4%
72.4
50.8
– Support, 0.19% 10939 --> 10943 1,306 960 330 16,300 0.17% 25.3% 50.8
– Confidence, 29.2% 11052
11197
--> 11197
--> 11052
1,410
1,900
1,900
1,410
542
542
20,440
20,440
0.28%
0.28%
38.4%
28.5%
39.0
39.0
– Lift, 45.8 10939 --> 10942 1,306 1,354 320 10,691 0.17% 24.5% 34.9
10942 --> 10939 1,354 1,306 320 10,691 0.17% 23.6% 34.9
10939 --> 10940 1,306 1,100 237 7,168 0.12% 18.1% 31.8
10940 --> 10939 1,100 1,306 237 7,168 0.12% 21.5% 31.8
COUNT CHI-
Rule Support Confidence Lift
Top 10 Ordered by 11053, 12820 --> 11069
LHS RHS LHSRHS
35 88 35
square
7,556 0.18% 100.0% 216.5
Lift: 10939, 10941
11157, 11162
--> 10929
--> 11156
50 60
24 100
25
20
3,942
3,156
0.13%
0.10%
50.0%
83.3%
158.8
158.8
– Support, 0.18% 11158, 11162 --> 11156 24 100 20 3,156 0.10% 83.3% 158.8
– Confidence, 78.3% 11158, 11162
11070, 11072
--> 11157
--> 11074
24 109
78 120
21
73
3,192
10,812
0.11%
0.38%
87.5%
93.6%
152.9
148.6
– Lift, 158.4 11157, 11158 --> 11156 72 100 56 8,260 0.29% 77.8% 148.2
10929, 10939 --> 10944 31 104 25 3,669 0.13% 80.6% 147.7
10939, 10944 --> 10929 54 60 25 3,647 0.13% 46.3% 147.0
10939, 10941 --> 10944 50 104 40 5,829 0.21% 80.0% 146.5
COUNT CHI-
Rule Support Confidence Lift
Top 10 Ordered by 10939, 10943 --> 10940
LHS RHS LHSRHS
219 372 170
square
6,626 0.89% 77.6% 39.8
Lift: 11052, 11196
11052, 11197
--> 11197
--> 11196
276 480
292 499
275
275
10,754
9,746
1.44%
1.44%
99.6%
94.2%
39.5
36.0
– Support, 1.2% 12820, 12830 --> 12506 439 488 399 14,045 2.09% 90.9% 35.5
– Confidence, 71.2% 11196, 11197
12506, 12820
--> 11052
--> 12830
365 436
469 498
275
399
8,881
12,843
1.44%
2.09%
75.3%
85.1%
32.9
32.5
– Lift, 30.4 10940, 10943 --> 10939 255 504 170 4,113 0.89% 66.7% 25.2
12804, 12820 --> 11989 238 380 114 2,598 0.60% 47.9% 24.0
11989, 12820 --> 12804 326 329 114 2,160 0.60% 35.0% 20.2
10834, 11168 --> 10821 236 400 93 1,618 0.49% 39.4% 18.8
Or, in SAS:
Generating Candidate One-Way Rules
/* STEP 1: Prepare the data as orders /* STEP 3: Create all product pairs
* with distinct products sorted * within an order */
* sorted on orderid */ data book.candidates
proc sql; (keep=orderid lhs rhs);
CREATE TABLE book.order_products as set book.order_products; by orderid;
SELECT orderid, productid, array prods[&MAXNPRODS] _TEMPORARY_;
COUNT(*) as num retain prods numprods;
FROM book.orderline if first.orderid then numprods = 0;
GROUP BY 1, 2 numprods+1;
; prods[numprods] = productid;
if last.orderid then do;
/* STEP 2: Calculate the max number do i = 1 to numprods;
* of products in any order */ lhs = prods[i];
proc sql; do j = 1 to i-1;
SELECT MAX(numprods) rhs = prods[j];
INTO :MAXNPRODS if j ~= i then output;
FROM (SELECT COUNT(DISTINCT productid end;
) as numprods end;
FROM book.orderline end;
GROUP BY orderid) a else delete;
; run;
Heterogeneous Associations
These are rules that have different types of items on the left
hand side and right hand side
What web page visits lead to purchases?
– Web Pages Æ purchases