Sie sind auf Seite 1von 4

CS5228: Knowledge Discovery and Data Mining

(2013-14, Semester I)
Assignment 1 (100 points)
1 Notes and Requirements
This assignment contributes 15% to the nal course grade.
Submission options: 1) to hand in hard copies, or 2) to submit soft copies via
IVLE (submission folder: Student Submission/Assignment 1).
Due time and date: Oct. 4, 2013. If you prefer to hand in your assignments,
please hand in them before or right after the class on Oct. 4, 2013. If you prefer
to submit your assignments via ILVE, please upload them to the folder Student
Submission/Assignment 1 before 11:59pm on Oct. 4, 2013.
Note: Late submission of an assignment would result in a reduced grade for
the assignment, unless an extension has been granted by the instructor. A late
submission receives an additional 20% penalty for every 24 hours delay.
2 Question Sets
Question 1: (15 points) Consider the data set shown in Table 1 for a binary classi-
cation problem.
1. Calculate the information gain when splitting on Aand B. Which attribute would
the decision tree induction algorithm choose? (5 points)
2. Calculate the gain in the Gini index when splitting on A and B. Which attribute
would the decision tree induction algorithm choose? (5 points)
3. Figure 1 (on the 48
th
page of the lecture notes L3:Classication I) shows
that entropy and the Gini index are both monotonically increasing on the range
[0, 0.5] and they are both monotonically decreasing on the range [0.5, 1]. Is it
possible that information gain and the gain in the Gin index favor different at-
tributes? Explain. (5 points)
1
Table 1: Data set for Question 1.
A B Class Label
T F +
T T +
T T +
T F -
T T +
F F -
F F -
F F -
T T -
T F -
Figure 1: For a binary class problem.
Question 2: (20 points) Consider a binary classication problem with the following
set of attributes and attribute values:
Air Conditioner = {Working, Broken}
Engine = {Good, Bad}
Mileage = {High, Medium, Low}
Rust = {Yes, No}
Suppose a rule-based classier produces the following rule set:
1. Are the rules mutually exclusive? (5 points)
2
Mileage = High Value = Low
Mileage = Low Value = High
Air Conditioner = Working Engine = Good Value = High
Air Conditioner = Working Engine = Bad Value = Low
Air Conditioner = Broken Value = Low
2. Is the rule set exhaustive? (5 points)
3. Is ordering needed for this set of rules? (5 points)
4. Do you need a default class for the rule set? (5 points)
Question 3: (35 points) Consider the data set shown in Table 2
Table 2: Data set for Question 3.
Record A B C Class
1 0 0 0 +
2 0 0 1 -
3 0 1 1 -
4 0 1 1 -
5 0 0 1 +
6 1 0 1 +
7 1 0 1 -
8 1 0 1 -
9 1 1 1 +
10 1 0 1 +
1. Estimate the conditional probabilities for P(A|+), P(B|+), P(C|+), P(A|),
P(B|), and P(C|)? (5 points)
2. Use the estimate of conditional probabilities given in the previous question to
predict the class label for a test example (A = 0, B = 1, C = 0) using the nave
Bayes approach. (10 points)
3. Estimate the conditional probabilities using the m-estimate approach, with p =
1/2, and m = 4. (5 points)
4. Repeat part (2) using the conditional probabilities given in part (3). (10 points)
5. Compare the two methods for estimating probabilities. Which method is better
and why? (5 points)
3
Figure 2: Bayesian belief network for Question 4.
Question 4: (30 points) Given the Bayesian network shown in Figure 2, compute the
following probabilities:
1. P(B = good, F = empty, G = empty, S = yes). (10 points)
2. P(B = bad, F = empty, G = not empty, S = no). (10 points)
3. Given that the battery is bad, compute the probability that the car will start. (10
points)
4

Das könnte Ihnen auch gefallen