You are on page 1of 13

CS 4701: Data Science

Assignment 2
Date Assigned: May 24th, 2018 Date Due: May 27th, 2018

(Total Score: 66 points)

(Questions: 1)

Instructions:

Provide all numerical results with two digits of precision only. Labels of all figures and tables are
mentioned below the figures and tables respectively.

Q 1. Consider the datasets provided in Tables 1 and 2, in order to answer the following questions.
Assume , and ̂ ,̂ ,̂ ,̂ ,̂ ,
̂ , as obtained using Least Squares approach.

Multiple Linear Regression:

a) Calculate F-statistic using Training Data. What can be inferred from the determined value?
(6 points)
b) Implement Forward Selection technique with a Stopping Rule: “Maximum three Predictors”.
(15 points)
c) Suppose an Interaction Effect exists between “Student” and “Annual Income”.
i. Extend your originally developed Multiple Linear Regression model by including the
Interaction Term. (3 points)
ii. Determine the relationship between the Interaction Term and “Credit Limit” in terms of
magnitude and direction. (2 points)
iii. Calculate Standard Error (SE) of the Interaction Term Coefficient Estimate (ITCE)
determined in part (i). Is the determined coefficient a good estimate? (5 points)
iv. Calculate t-statistic for the ITCE determined in part (i). What can be inferred from the
determined value? Will you revert to your originally developed Multiple Linear
Regression model, or will you keep this new model? (6 points)
v. Calculate statistic for the extended model over Training Data. What can be inferred
from the determined value? (13 points)
vi. Calculate Test MSE using the extended model. Is the model performing well on Test
Data? [Hint: Test MSE on the originally developed Multiple Linear Regression model =
73,281,963.26] (12 points)

Non-Linear Regression:
d) Observe the graph of “Annual Income” vs. “Credit Limit” in Figure 1. Transform your originally
developed Multiple Linear Regression model by including a cubic term for “Annual Income”.
(2 points)
e) What is your opinion on the transformation carried out in part (d) for: (2 points)
i. Quadratic Regression Model

Page 1 of 13
CS 4701: Data Science Assignment 2 Answer Key

ii. Cubic Regression Model

Gender Student Married


Age Annual Income Credit Limit
S#

1 1 34 0 1 $14,891 $3,606
2 0 82 1 1 $106,025 $6,645
3 1 71 0 0 $104,593 $7,075
4 0 36 0 0 $148,924 $9,504
5 1 68 0 1 $55,882 $4,897
6 1 77 0 0 $80,180 $8,047
7 0 41 1 1 $71,061 $6,819
Table 1. Credit Card Customers - Training Data

8 0 37 0 0 $20,996 $3,388
9 1 87 0 0 $71,408 $7,114
10 0 66 0 0 $15,125 $3,300
Table 2. Credit Card Customers - Test Data

10,000
9,000
8,000
7,000
Credit Limit

6,000
5,000
4,000
3,000
2,000
1,000
0
0 20,000 40,000 60,000 80,000 100,000 120,000 140,000 160,000
Annual Income

Figure 1. Annual Income vs. Credit Limit

Page 2 of 13
CS 4701: Data Science Assignment 2 Answer Key

A 1.

Multiple Linear Regression

a)

Here, ∑ ̅ ∑ ̅

̅ ̅ ̅ ̅ ̅ ̅ ̅

where, ̅ ∑ ∑

Hence,

Here, ∑ ̂

∑ ̂

̂
̂ ̂ ̂ ̂ ̂ ̂

where, ̂

Hence,

Hence,

Since value of is found to be very low, hence none of the predictors has a strong relationship
with the response variable.

b) We begin with the Null Model: ̂ ̂

We then fit Simple Linear Regressions and calculate their respective RSS.

For :

̂ ̂

Page 3 of 13
CS 4701: Data Science Assignment 2 Answer Key

∑ ̂

∑ ̂

̂ ̂ ̂ ̂ ̂ ̂ ̂

For :

̂ ̂

For :

̂ ̂

Page 4 of 13
CS 4701: Data Science Assignment 2 Answer Key

For :

̂ ̂

For :

̂ ̂

Page 5 of 13
CS 4701: Data Science Assignment 2 Answer Key

We select to be added to our model, since it results in the lowest RSS among all predictors.

̂ ̂ ̂

We then fit Simple Linear Regressions and calculate their respective RSS.

For :

̂ ̂

For :

̂ ̂

Page 6 of 13
CS 4701: Data Science Assignment 2 Answer Key

For :

̂ ̂

For :

̂ ̂

Page 7 of 13
CS 4701: Data Science Assignment 2 Answer Key

We select to be added to our model, since it results in the lowest RSS among all predictors.

̂ ̂ ̂ ̂

We then fit Simple Linear Regressions and calculate their respective RSS.

For :

̂ ̂

For :

̂ ̂

Page 8 of 13
CS 4701: Data Science Assignment 2 Answer Key

For :

̂ ̂

We select to be added to our model, since it results in the lowest RSS among all predictors.

̂ ̂ ̂ ̂ ̂

c)
i. ̂ ̂ ̂ ̂ ̂ ̂ ̂ ̂

∑ ̅ ̅
where, ̂ ̂
∑ ̅
[Don’t calculate ̂ using ̂ ̂ ̂]

∑ ̅ ̅
∑ ̅
̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅
̅ ̅ ̅ ̅ ̅ ̅ ̅

Here,

Page 9 of 13
CS 4701: Data Science Assignment 2 Answer Key

and, ̅ ∑ ∑

Hence, ̂

and, ̂

ii. Interaction Term ( ) has no relationship with “Credit Limit”, since its coefficient
(̂) .
iii. { ̂} { ̂}
∑ ̅

̅ ̅ ̅ ̅ ̅ ̅ ̅

Hence,
{ ̂}

Since { ̂ } is zero, ̂ is exactly same as . Hence, ̂ is the best estimate of .

iv. For :

̂
(̂)

There is some relationship between “Interaction Term” and “Credit Limit”. Hence, I will keep
this new model.

v.

where,

and, ∑ ̂

Page 10 of 13
CS 4701: Data Science Assignment 2 Answer Key

∑ ̂

̂
̂ ̂ ̂ ̂ ̂ ̂

Hence,

We can infer that 82% Variance is explained in “Credit Limit” by regressing onto five
different predictors. Hence, the relationship between “Credit Limit” and the five different
predictors is quite strong.

vi. using extended model would remain same as that of the original model, since the
extended model contains only a new Interaction Term, whose coefficient is zero.

Since value of is found to be very high, hence the extended model is not
performing well on Test Data.

Non-Linear Regression:

d) ̂ ̂
̂

∑ ̅ ̅
where, ̂ ̂ [Don’t calculate ̂ using ̂ ̂ ]
∑ ̅

∑ ̅ ̅
∑ ̅
̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅
̅ ̅ ̅ ̅ ̅ ̅ ̅

Here,

Page 11 of 13
CS 4701: Data Science Assignment 2 Answer Key

and,
̅ ∑ ∑

Hence,
̂

∑ ̅ ̅
where, ̂ ̂ [Don’t calculate ̂ using ̂ ̂ ]
∑ ̅

∑ ̅ ̅
∑ ̅
̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅
̅ ̅ ̅ ̅ ̅ ̅ ̅

Here,

and,
̅ ∑ ∑

Hence, ̂

Page 12 of 13
CS 4701: Data Science Assignment 2 Answer Key

Hence, ̂

e) Since both the coefficients ( ̂ ̂ are equal to zero, hence both the Quadratic and Cubic
terms have no relationships with “Credit Limit”. Thus Quadratic Regression Model and Cubic
Regression Model are the same as the original model.

Page 13 of 13