Data Mining Question Set

DATA MINING FOR
BUSINESS INTELLIGENCE
PROF. MAYTAL SAAR-TSECHANSKY
Assignment 1
I. Data Mining Concepts
Answer the questions below. Please be concise and phrase your answers carefully.
1. For each of the data mining tasks (i.e., classification, regression, clustering, link analysis and
sequence analysis) provide an example of a business problem that can be supported by these
methods. For each business problem (e.g., customer attrition) formulate clearly the business
goal (e.g., prevent attrition of any customers who is likely to switch to a competitor), and
explain how the data mining method can be used to obtain it (e.g., build a classification
model to predict whether a customer will switch). Please do not use the examples discussed
in class. (10 points)
2. Because descriptive data mining is not used to predict values of interest, it is beneficial to
gain insights on past events, but is not useful to support future decisions. Do you agree with
this statement? Explain your answer. (15 points)
3. An analyst in a telecommunication company analyzed a subset of the firms data and found
that 20% of customers who made at least two calls to the companys customer service center
within 2 months have switch to a competing provider. Later that day, the analyst repeated
the analysis, using another subset from the same database. However, this time the analysis
suggested that only 10% of customers switched once they made at least two calls to
customer service within a period of two months.
Suggest possible reason(s) for the
discrepancy between the patterns the analyst found in each case. (15 points)
DATA MINING FOR

II. Classification Trees
Note: This part of the assignment can be prepared electronically. However,

if you prefer to use paper and pencil, you may submit a hard copy instead.
Classification trees are one of the most widely used data mining algorithms; they
are simple yet effective. To get started, consider the problem of predicting
whether or not the new president goes jogging on a particular day. You have
observed the presidents decisions in the past and constructed a training set of
historical examples including the weather in a given day, whether the president
jogged the previous day, and then the presidents jogging decision on that day.
You now want to generate a predictive model to predict the presidents decisions
in the future.
The values that the different attributes can take are provided below:
Attribute
WEATHER
JOGGED_YESTERDAY
Jog today (target variable)
Possible Values
Warm, Cold, Raining
Yes, No
Yes, No
Because each attribute's value starts with a different letter, for shorthand we'll
just use that initial letter, e.g., 'W' for Warm.. Our target/class variable (the
variable value we want to predict) is whether or not the president will jog today.
Here is our TRAINING data set, which we will use to build a predictive model
of the presidents decisions:
DATA MINING FOR

WEATHER JOGGED_YESTERDAY Target (Jog Today)

C
W
R
C
R
W
C
W
C
W
W
C
R
W
N
Y
Y
Y
N
Y
N
N
Y
Y
N
N
Y
Y
Yes
No
No
No
No
No
No
Yes
No
Yes
Yes
Yes
No
No
(a) Constructing the Initial Decision Tree (25 points)

Apply the classification tree building steps described in class (and in Chapter 6 of the
text) to the TRAINING set, using information gain as the criteria for selecting splits to
include in the tree. Show all your work, including what splits were considered at each
step, all entropy calculations used to decide among alternative splits, and the final
decision tree model you constructed.
As a reminder of the class discussion, recall that the process for selecting the best split at
each node, can be simplified into one simple rule: For each node in the tree, split the
examples based on the attribute that produces the largest information gain (formula
provided in class notes).
Your initial tree is just a single leaf node, containing all the examples in the training data.
For, each leaf node, consider all available attributes as candidates for splits, and calculate
the information gain obtained in each case. Then specify which of the possible
splits/attributes provides the largest information gain and incorporate this split in your
tree model. If multiple attributes tie for the best one, choose the one whose name appears
earliest in alphabetical order (e.g., JOGGED_YESTERDAY before WEATHER). Recall
that once a split is decided and sub groups (descended nodes in the tree) are created in the
tree model, these subgroups themselves also can be split, unless one of the stopping rules
applies. See the class notes for examples on how this is done.
DATA MINING FOR

This process continues for each new leaf node until either of two conditions is
met:
1. All available attributes have already been included along the path through
the tree, or
2. the training examples associated with this leaf node all have the same class
value ((i.e., if the leaf is already pure).
You may use Excel to calculate the information gain from partitioning the training
example on a given attribute. Use the function =log(number, base), where base is 2 for
entropy calculation.
*If you prepare this question (a) using a paper and pencil, please make sure your
answers can be read by the TA (in other words, type your answers or print neatly).
Please dont spend too much time on any given part. If you get stuck and have
questions, please request to meet us (TA and instructor) and we will help you.
(b) Using the tree model for prediction, and estimating models predictive accuracy
(15 points)
Here is a Test Set of examples for which you would like to generate predictions:
W
Y
?
R
N
?
C
N
?
C
Y
?
W
N
?
R
Y
?
Use the decision tree produced in part (a) to predict the class (classify) each example in
the TEST set.
The table below contains the same examples as in the Test Set, but also includes the
correct classification of each example. What proportions of the cases in the Test Set were
predicted accurately by your model?
DATA MINING FOR


W
Y
No
R
N
Yes
C
N
Yes
C
Y
No
W
N
Yes
R
Y
Yes
(d) Now assume that you did not record the weather in each day or whether the president
jogged the previous day, and thus cannot induce a classification tree model. All you
recorded was the last column in your training data whether the president jogged that
day.
* If you were to use only this information, what would be your best prediction for each of
the cases in the Test Set? Explain clearly how did you arrive at this prediction.
* Given these predictions, in what proportion of the cases in the Test Set your predictions
are correct?
*Compare this result with the prediction accuracy of the classification tree model
calculated in (b). Is the classification model better at predicting the presidents decisions?
(15 points)
(e) Generating Rules (5 points)
Extract two rules from the decision tree generated in step (a).

Data Mining Question Set

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Data Mining Question Set

Hochgeladen von

Copyright:

Verfügbare Formate

DATA MINING FOR

Suggest possible reason(s) for the

DATA MINING FOR

Note: This part of the assignment can be prepared electronically. However,

DATA MINING FOR

WEATHER JOGGED_YESTERDAY Target (Jog Today)

(a) Constructing the Initial Decision Tree (25 points)

DATA MINING FOR

DATA MINING FOR

WEATHER JOGGED_YESTERDAY Target (Jog Today)

Das könnte Ihnen auch gefallen