Beruflich Dokumente
Kultur Dokumente
Topics
Data mining basics and the KDD process
Why data mining? Different views of data mining. Related terms and
disciplines. Definitions.
Readings: slides and Ch. 1.1 and Ch. 1.2 in T2
Clustering
Objectives
Fundamentals: Cluster, distance measure and calculations, and comparative
criteria
Bottom-up hierarchical algorithm
Applications and comparative criteria
a) Calculate the support, confidence and lift of the following association rule. Indicate if
the items in the association rule are independent of each other or have negative or
positive impacts on each other. (8 points)
{10} -> {50,70}
b) The following is the list of large two item sets. Show the steps to apply the Apriori
property to generate and prune the candidates for large three itemsets. Describe how
the Apriori property is used is in the steps. Give the final list of candidate large three
item sets. (10 points)
{10,20} {10,30} {20,30} {20,40}
c) Does customer 1 support the sequence <{20} {50,70} {10}>? Justify your answer. (5
points)
e) Based on the types of association rules discussed in class, identify which type(s) of
rules {10}-> {50,70} is? (3 points)
Name Name
Date of Birth Type
Annual Income Introduction Date
City
State
SALESFACT
TransactionID
Quantity
Amount
DATE
Day of Year
Year
Note:
1. TransactionID is used as the primary key in the fact table because there might be more
than one transaction for each customer and product in a given day.
2. The Introduction Date for a product is the date when it is first introduced into the
market.
a) The clustering task was selected to identify customer segmentation. Suggest the
attributes including derived attributes to be used in the clustering task and justify
your answer. (10 points)
b) Recommend a standardization or normalization method for the attributes in a
distance function. (10 points)
i. Specify the input and class label attributes you choose for this
classification/prediction task. Give an example of business decision(s)
that can benefit from the classification/prediction results using the
input and class label attributes of your choice. (10 points)
ii. Define and give an example of noise using the data set above. (5 points)
iii. Assume that you will use a decision tree classifier. Specify and compare
the different tree pruning approaches. (10 points)
iv. Suppose you are using a neural network instead of a decision tree. List at
least three possible parameters you want to tune to improve its
performance during the training period. (5 points)
The task attributes of the four data mining tasks discussed in class are briefly described
below:
Association rule and sequential pattern mining - Customer ID, Transaction ID and Item.
Classification/prediction - input and the class label attributes
Clustering mining - input attributes
The following are the data fields in the data mining server log:
User ID, Session ID, Dataset ID, MiningTask ID, Parameter Value, Accuracy
a) Which task will you perform to identify the data mining tasks that tend to be
performed in the same session? Describe the attributes you choose and how they are
mapped to the data mining task attributes listed above. (6 points)
b) Which task will you perform to identify the sequence of data mining tasks that users
tend to perform on the same data set over time? Describe the attributes you choose
and how they are mapped to the data mining task attributes listed above. (6 points)
c) Which task will you perform to determine if the Parameter Value level (low, medium
or high) and the level of Parameter Value adjustment (small, moderate or large) tend
to have a positive or negative impact on Accuracy. Describe the attributes you choose
and how they are mapped to the data mining task attributes listed above. (8 points)
Answers to Sample Exam Questions
Question 1:
a)
Support = Support ({10,50,70}) = 2/10 = 20%
Confidence = Support ({10,50,70})/ Support({10}) = 0.2/0.7 = 2/7 = 29%
Lift = Confidence/Support({50,70}) = 2/7/0.2 = 10/7 = 1.43 > 1
b)
{10,20} {10,30} {20,30} {20,40}
***O: describe how the apriori property is used to decide which 2 large item sets are
joined together and to determine which 3 item set should be pruned.
d)
The sequence of customer 1 is:
<{10,20} {10,30,50,70} {10,20,30,40}>
e)
Only customer 1 supports the sequence <{10} {30}> and there are 5 customers,
therefore,
f)
The association rule {10} -> {50,70} is a single-level, single-dimensional and Boolean
association rule.
Question 2.
iv) Hidden layer node number, learning rate, epochs, momentum, accuracy
threshold, hidden layer number,
Question 3:
a)
I will suggest using association rule mining. In this data mining task, Session ID can be
mapped to Transaction ID and MiningTask ID can be mapped to Item.
b)
I will suggest using sequential pattern mining. In this data mining task, Dataset ID can be
mapped to Customer ID, Session ID can be mapped to Transaction ID and MiningTask
ID can be mapped to Item.
c)
I will suggest using classification. In this data mining task, input attributes include
parameter value level and level of parameter value adjustment, and the class label
attribute is the impact on accuracy (i.e., positive, negative, or no impact).