Homework #5: Decision Support Systems 2012/2013 Meic - Taguspark

Decision Support Systems 2012/2013
MEIC - TagusPark
Homework #5
Due: 15.Apr.2013
1 Frequent Pattern Mining

1. Consider the database D depicted in Table 1, containing five transactions, each containing several items.
Consider minsup = 60% and minconf = 80%.
Table 1: Database D of transactions to be analyzed.

TID Items
T100 {B, O, N, E, C, O}
T200 {B, O, N, E, C, A}
T300 {C, A, N, E, C, A}
T400 {F, A, N, E, C, A}
T500 {F, A, C, A}
(a) (1 val.) Using FP-growth algorithm, find all frequent 4- and 3-itemsets in the database D.
Solution:
The FP-Growth algorithm starts by building the set C1 of frequent 1-itemsets, from which the
FP-tree is then computed. From the provided data, we get
Item Count
B 2
O 2
N 4
C1 =
E 4
C 5
A 4
F 2
where the itemsets marked in bold are those above minsup.1 Sorting the frequent 1-itemsets in
Homework 5 Decision Support Systems Page 2 of 9
decreasing order of support, we get C → N → E → A and use this order to build the following
FP-tree:
Root
Item
C:5
C
N
N :4 A:1
E
A
E:4
A:3
To determine the frequent 4- and 3-itemsets, we build our conditional pattern base, including
only those itemsetd with 3 and 4 items. This leads to:
Item Cond. Pattern Base Cond. Tree Frequent Pattern
A {{CN E} : 3} hC : 3, N : 3, E : 3i ∅
E {{CN } : 4} hC : 4, N : 4i {CN E} : 4
We can then conclude that the only frequent 3-itemset is {CN E} and there are no frequent 4-itemsets.
(b) (1 val.) Consider the frequent itemsets computed in (a). Without computing the corresponding
support, show that any subitemset of such frequent itemsets must also be frequent. Use this fact to
compute frequent 2- and 1-itemsets.
Solution:
If minsup denotes the minimum (relative) support, an itemset S is a frequent itemset if
sup% (S, D) ≥ minsup or, equivalently, if supc (S, D) ≥ minsup × |D|, where |D| is the number
of transactions in D.
Let S0 be any nonempty subset of S. Since S0 appears in all transactions where S appears,
supc (S0 , D) ≥ supc (S, D) ≥ minsup × |D|. Thus, S0 is also a frequent itemset.
In our case, we have the frequent 3-itemset {CN E}, from where we can derive the frequent
2-itemsets {CN }, {CE} and {N E}. Similarly, we can compute the frequent 1-itemset {C},
{N } and {E}.
(c) (1 val.) From the frequent itemsets you discovered, list all of the strong association rules matching
the following metarule, where X is a variable representing customers, and Itemi denotes variables
representing items (e.g., “A”, “C”, etc.)
∀t ∈ D, buys(X, item1 ) ∧ buys(X, item2 ) ⇒ buys(X, item3 ) [S, C].
Do not forget to include the values for the support S and confidence C for any rules you may
discover.
Solution:
1 In these solutions, we considered a strict minimum support, i.e., we considered as frequent only those items I such that
supp(I) > minsup. However, for grading purposes, we admitted equally solutions that considered as frequent those itemsets I
such that supp(I) ≥ minsup.
In our case, since the provided metarule involves 3 items, we need only to consider the asso-
ciation rules derived from the frequent itemset {CN E}. In particular, we get three possible
association rules verifying the provided metarule:
{CN } ⇒ {E} [0.8, 1]

{CE} ⇒ {N } [0.8, 1]
{EN } ⇒ {C} [0.8, 1].
Since all rules are above the minconf threshold, all three are strong rules.
(d) (1 val.) Design an example to illustrate that, in general, computing 2- and 1-frequent itemsets
from discovered 3-frequent itemsets is not sufficient to guarantee that all frequent itemsets have
been discovered. Is this the case of database D?
Solution:
As an example, we consider the dataset provided. As can easily seen in Question a, the itemset
{A} is a frequent 1-itemset that, however, is not a subset of the only frequent 3-itemset {CN E}
determined in Question a.
Similarly, by running FP-tree completely, we can conclude that the 2-itemset {CA} is frequent
but, again, is not a subset of the frequent itemset {CN E}. This shows that computing 2- and
1-frequent itemsets from discovered 3-frequent itemsets is not sufficient to guarantee that all
frequent itemsets are discovered.
2. (1 val.) Discuss advantages and disadvantages of FP-growth versus Apriori.
Solution:
Apriori has to do multiple scans of the database while FP-growth builds the FP-Tree with a single
scan and requires no additional scans of the database. Moreover, Apriori requires that candidate
itemsets are generated, an operation that is computationally expensive (owing to the self-join in-
volved), while FP-growth does not generate any candidates.
On the other hand, FP-growth implies handling an FP-tree, a more complex data-structure than
those involved in Apriori. In scenarios involving itemsets with a large number of possible items
and large cardinality may lead to complex FP-trees, the storage and handling of which becomes
computationally expensive.
Though debate exists, it is not established that either method is computationally more efficient.
1.1 Practical Questions (Using SQL Server 2012)
3. Using SQL Server Management Studio connect to the database AdventureWorksDW2012.

(a) (1 val.) Write an SQL query to determine the number of transactions in the view vAssocSeqOrders.
In your answer document, include both the SQL query and the obtained value.
Solution:
One possible query would be:
select
COUNT(*)
from
dbo.vAssocSeqOrders
leading to the value 21, 255.
(b) (1 val.) Write an SQL query to identify, in the view vAssocSeqLineItems, which models appear
in more than 1, 500 orders. In your answer document, include both the SQL query and the obtained
result.
Solution:
SELECT
I.Model, COUNT(*) AS ’Total’
FROM
dbo.vAssocSeqLineItems I
GROUP BY
I.Model
HAVING
COUNT(*) > 1500
ORDER BY
COUNT(*)
DESC
resulting in the following table:
Model Total
Sport-100 6,171
Water Bottle 4,076
Patch kit 3,010
Mountain Tire Tube 2,908
Mountain-200 2,477
Road Tire Tube 2,216
Cycling Cap 2,095
Fender Set - Mountain 2,014
Mountain Bottle Cage 1,941
Road Bottle Cage 1,702
Long-Sleeve Logo Jersey 1,642
Short-Sleeve Classic Jersey 1,537
(c) (1 val.) Write an SQL query to identify, in the view vAssocSeqLineItems, which pairs of models
appear in more than 1, 500 orders (do not include pairs in which both elements are the same). In
your answer document, include both the SQL query and the obtained result.
Model Model Total
Solution:
SELECT
I.Model, J.Model, COUNT(*) AS ’Total’
FROM
dbo.vAssocSeqLineItems I
INNER JOIN
dbo.vAssocSeqLineItems J
ON
I.OrderNumber = J.OrderNumber AND I.Model < J.Model
GROUP BY
I.Model, J.Model
HAVING
COUNT(*) > 1500
ORDER BY
COUNT(*)
DESC
resulting in the following table:
Model Model Total
Mountain Bottle Cage Water Bottle 1,623

Road Bottle Cage Water Bottle 1,513
(d) (1 val.) Write an SQL query to identify, in the view vAssocSeqLineItems, which triplets of models
appear in more than 1, 500 orders (do not include triplets with repeated elements). In your answer
document, include both the SQL query and the obtained result.
Solution:
SELECT
I.Model, J.Model, K.Model, COUNT(*) AS ’Total’
FROM
dbo.vAssocSeqLineItems I,
dbo.vAssocSeqLineItems J,
dbo.vAssocSeqLineItems K
where
I.OrderNumber = J.OrderNumber AND
J.OrderNumber = K.OrderNumber AND
I.Model < J.Model and
J.Model < K.Model
GROUP BY
I.Model, J.Model, K.Model
HAVING
COUNT(*) > 1500
ORDER BY
COUNT(*)
DESC
resulting in an empty table.
4. The different queries in Question 3 roughly correspond to the main steps of the Apriori algorithm.
(a) (1 val.) From the results in Question 3, determine the minimum (relative) support implicitly used
in the aforementioned SQL queries.
Solution:
Since we selected only itemsets appearing more than 1, 500, we have a minimum relative support
of
1, 500
minsup = = 7.05%.
21, 255
(b) (2 val.) Determine all possible associations obtained from the frequent itemsets identified in Ques-
tion 3. Indicate the confidence associated with each such association rule and all relevant calcula-
tions. Which of the calculated association rules correspond to strong rules for a minimum confidence
of 60%?
Solution:
Possible associations arise from frequent k-itemsets, with k > 1. In our case, we have, as
possible associations,
Water Bottle ⇒ Mountain Bottle Cage

Mountain Bottle Cage ⇒ Water Bottle
Water Bottle ⇒ Road Bottle Cage
Road Bottle Cage ⇒ Water Bottle
In order to determine which of the associations above are strong associations, the corresponding
confidence is:
1, 623
Water Bottle ⇒ Mountain Bottle Cage conf = = 39.8%
4, 076
1, 623
Mountain Bottle Cage ⇒ Water Bottle conf = = 83.6%
1, 941
1, 513
Water Bottle ⇒ Road Bottle Cage conf = = 37.1%
4, 076
1, 513
Road Bottle Cage ⇒ Water Bottle conf = = 88.9%
1, 702
and we can conclude that, for minconf = 60%, only Mountain Bottle Cage ⇒ Water Bottle and
Road Bottle Cage ⇒ Water Bottle are strong association rules.
5. In SQL Server Data Tools, run the Microsoft Association algorithm you experimented in the lab, but
setting the minimum support to the value computed in Question 4 and the minimum confidence to 60%.
(a) (2 val.) Provide a screenshot of the Itemset pane containing the frequent itemsets discovered by
the algorithm. Compare these with your results from Question 4.
Solution:
As seen in Question 3, the frequent itemsets are:
Model Total
Sport-100 6,171
Water Bottle 4,076
Patch kit 3,010
Mountain Tire Tube 2,908
Mountain-200 2,477
Road Tire Tube 2,216
Cycling Cap 2,095
Fender Set - Mountain 2,014
Mountain Bottle Cage 1,941
Road Bottle Cage 1,702
Long-Sleeve Logo Jersey 1,642
Short-Sleeve Classic Jersey 1,537
Mountain Bottle Cage, Water Bottle 1,623

Road Bottle Cage, Water Bottle 1,513
This corresponds to the result obtained by Microsoft Association algorithm:
The only two 2-itemsets observed are precisely those appearing in the associations determined
in Question 4, as expected.
(b) (2 val.) Provide a screenshot of the Rules pane containing the strong association rules discovered
by the algorithm. Compare these with your results from Question 4.
Solution:
As seen in Question 4, the only strong associations are:
Mountain bottle cage ⇒ Water bottle [sup = 32.4%, C = 83.6%]

Road bottle cage ⇒ Water bottle [sup = 30.2%, C = 88.9%]
This corresponds to the result obtained by Microsoft Association algorithm:
(c) (2 val.) Indicate the dependence network computed by the algorithm and explain its meaning.
Solution:
The dependence network portrayed by the Microsoft Association algorithm is:
and indicates that the existence of either items Road bottle cage or Mountain bottle cage is a
strong indicator of the presence of item Water bottle.
6. (2 val.) Note that, besides the confidence associated with each association rule, MS SQL Server also
indicates the importance of the rule. Importance determines how “useful” a given rule is, and is computed
as
sup(X ∪ Y ) sup(¬X)
importance(X ⇒ Y ) = log · ,
sup(X) sup(¬X ∪ Y )
where sup(¬A) corresponds to the number of itemsets that do not include item A. In the data-mining
literature, a quantity providing similar information is the lift and is computed as
sup% (X ∪ Y )
lift(X ⇒ Y ) = .
sup% (X) × sup% (Y )
Compute the importance and lift for the association rules mined. For this purpose, take into consideration
the total number of transactions you computed in Question 4. Confirm the value of importance provided
by Microsoft Association. Indicate your calculations, and verify that the rules with larger lift are also
ranked by Microsoft Association algorithm as more important.
Solution:
Computing the importance for the mined rules, we get:
1, 513 × 19, 553
importance(RBC ⇒ WB) = log = 0.831
1, 702 × 2, 563
1, 623 × 19, 314
importance(MBC ⇒ WB) = log = 0.818.
1, 941 × 2, 453
Computing now the lift, we get:
1, 513 × 21, 255
lift(RBC ⇒ WB) = = 4.64
1, 702 × 4, 076
1, 623 × 21, 255
lift(MBC ⇒ WB) = = 4.36,
1, 941 × 4, 076
which agrees with the importance results from Microsoft Association algorithm.

Homework #5: Decision Support Systems 2012/2013 Meic - Taguspark

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Homework #5: Decision Support Systems 2012/2013 Meic - Taguspark

Hochgeladen von

Copyright:

Verfügbare Formate

Decision Support Systems 2012/2013

1 Frequent Pattern Mining

Table 1: Database D of transactions to be analyzed.

∀t ∈ D, buys(X, item1 ) ∧ buys(X, item2 ) ⇒ buys(X, item3 ) [S, C].

{CN } ⇒ {E} [0.8, 1]

2. (1 val.) Discuss advantages and disadvantages of FP-growth versus Apriori.

1.1 Practical Questions (Using SQL Server 2012)

3. Using SQL Server Management Studio connect to the database AdventureWorksDW2012.

Model Model Total

Mountain Bottle Cage Water Bottle 1,623

Water Bottle ⇒ Mountain Bottle Cage

Mountain Bottle Cage, Water Bottle 1,623

As seen in Question 4, the only strong associations are:

Mountain bottle cage ⇒ Water bottle [sup = 32.4%, C = 83.6%]

This corresponds to the result obtained by Microsoft Association algorithm:

Das könnte Ihnen auch gefallen