Assignment 2

Date Assigned: May 24th, 2018 Date Due: May 27th, 2018

(Questions: 1)

Instructions:

Provide all numerical results with two digits of precision only. Labels of all figures and tables are

mentioned below the figures and tables respectively.

Q 1. Consider the datasets provided in Tables 1 and 2, in order to answer the following questions.

Assume , and ̂ ,̂ ,̂ ,̂ ,̂ ,

̂ , as obtained using Least Squares approach.

a) Calculate F-statistic using Training Data. What can be inferred from the determined value?

(6 points)

b) Implement Forward Selection technique with a Stopping Rule: “Maximum three Predictors”.

(15 points)

c) Suppose an Interaction Effect exists between “Student” and “Annual Income”.

i. Extend your originally developed Multiple Linear Regression model by including the

Interaction Term. (3 points)

ii. Determine the relationship between the Interaction Term and “Credit Limit” in terms of

magnitude and direction. (2 points)

iii. Calculate Standard Error (SE) of the Interaction Term Coefficient Estimate (ITCE)

determined in part (i). Is the determined coefficient a good estimate? (5 points)

iv. Calculate t-statistic for the ITCE determined in part (i). What can be inferred from the

determined value? Will you revert to your originally developed Multiple Linear

Regression model, or will you keep this new model? (6 points)

v. Calculate statistic for the extended model over Training Data. What can be inferred

from the determined value? (13 points)

vi. Calculate Test MSE using the extended model. Is the model performing well on Test

Data? [Hint: Test MSE on the originally developed Multiple Linear Regression model =

73,281,963.26] (12 points)

Non-Linear Regression:

d) Observe the graph of “Annual Income” vs. “Credit Limit” in Figure 1. Transform your originally

developed Multiple Linear Regression model by including a cubic term for “Annual Income”.

(2 points)

e) What is your opinion on the transformation carried out in part (d) for: (2 points)

i. Quadratic Regression Model

Page 1 of 13

CS 4701: Data Science Assignment 2 Answer Key

Age Annual Income Credit Limit

S#

1 1 34 0 1 $14,891 $3,606

2 0 82 1 1 $106,025 $6,645

3 1 71 0 0 $104,593 $7,075

4 0 36 0 0 $148,924 $9,504

5 1 68 0 1 $55,882 $4,897

6 1 77 0 0 $80,180 $8,047

7 0 41 1 1 $71,061 $6,819

Table 1. Credit Card Customers - Training Data

8 0 37 0 0 $20,996 $3,388

9 1 87 0 0 $71,408 $7,114

10 0 66 0 0 $15,125 $3,300

Table 2. Credit Card Customers - Test Data

10,000

9,000

8,000

7,000

Credit Limit

6,000

5,000

4,000

3,000

2,000

1,000

0

0 20,000 40,000 60,000 80,000 100,000 120,000 140,000 160,000

Annual Income

Page 2 of 13

CS 4701: Data Science Assignment 2 Answer Key

A 1.

a)

Here, ∑ ̅ ∑ ̅

̅ ̅ ̅ ̅ ̅ ̅ ̅

where, ̅ ∑ ∑

Hence,

Here, ∑ ̂

∑ ̂

̂

̂ ̂ ̂ ̂ ̂ ̂

where, ̂

Hence,

Hence,

Since value of is found to be very low, hence none of the predictors has a strong relationship

with the response variable.

We then fit Simple Linear Regressions and calculate their respective RSS.

For :

̂ ̂

Page 3 of 13

CS 4701: Data Science Assignment 2 Answer Key

∑ ̂

∑ ̂

̂ ̂ ̂ ̂ ̂ ̂ ̂

For :

̂ ̂

For :

̂ ̂

Page 4 of 13

CS 4701: Data Science Assignment 2 Answer Key

For :

̂ ̂

For :

̂ ̂

Page 5 of 13

CS 4701: Data Science Assignment 2 Answer Key

We select to be added to our model, since it results in the lowest RSS among all predictors.

̂ ̂ ̂

We then fit Simple Linear Regressions and calculate their respective RSS.

For :

̂ ̂

For :

̂ ̂

Page 6 of 13

CS 4701: Data Science Assignment 2 Answer Key

For :

̂ ̂

For :

̂ ̂

Page 7 of 13

CS 4701: Data Science Assignment 2 Answer Key

We select to be added to our model, since it results in the lowest RSS among all predictors.

̂ ̂ ̂ ̂

We then fit Simple Linear Regressions and calculate their respective RSS.

For :

̂ ̂

For :

̂ ̂

Page 8 of 13

CS 4701: Data Science Assignment 2 Answer Key

For :

̂ ̂

We select to be added to our model, since it results in the lowest RSS among all predictors.

̂ ̂ ̂ ̂ ̂

c)

i. ̂ ̂ ̂ ̂ ̂ ̂ ̂ ̂

∑ ̅ ̅

where, ̂ ̂

∑ ̅

[Don’t calculate ̂ using ̂ ̂ ̂]

∑ ̅ ̅

∑ ̅

̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅

̅ ̅ ̅ ̅ ̅ ̅ ̅

Here,

Page 9 of 13

CS 4701: Data Science Assignment 2 Answer Key

and, ̅ ∑ ∑

Hence, ̂

and, ̂

ii. Interaction Term ( ) has no relationship with “Credit Limit”, since its coefficient

(̂) .

iii. { ̂} { ̂}

∑ ̅

̅ ̅ ̅ ̅ ̅ ̅ ̅

Hence,

{ ̂}

iv. For :

̂

(̂)

There is some relationship between “Interaction Term” and “Credit Limit”. Hence, I will keep

this new model.

v.

where,

and, ∑ ̂

Page 10 of 13

CS 4701: Data Science Assignment 2 Answer Key

∑ ̂

̂

̂ ̂ ̂ ̂ ̂ ̂

Hence,

We can infer that 82% Variance is explained in “Credit Limit” by regressing onto five

different predictors. Hence, the relationship between “Credit Limit” and the five different

predictors is quite strong.

vi. using extended model would remain same as that of the original model, since the

extended model contains only a new Interaction Term, whose coefficient is zero.

Since value of is found to be very high, hence the extended model is not

performing well on Test Data.

Non-Linear Regression:

d) ̂ ̂

̂

∑ ̅ ̅

where, ̂ ̂ [Don’t calculate ̂ using ̂ ̂ ]

∑ ̅

∑ ̅ ̅

∑ ̅

̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅

̅ ̅ ̅ ̅ ̅ ̅ ̅

Here,

Page 11 of 13

CS 4701: Data Science Assignment 2 Answer Key

and,

̅ ∑ ∑

Hence,

̂

∑ ̅ ̅

where, ̂ ̂ [Don’t calculate ̂ using ̂ ̂ ]

∑ ̅

∑ ̅ ̅

∑ ̅

̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅ ̅

̅ ̅ ̅ ̅ ̅ ̅ ̅

Here,

and,

̅ ∑ ∑

Hence, ̂

Page 12 of 13

CS 4701: Data Science Assignment 2 Answer Key

Hence, ̂

e) Since both the coefficients ( ̂ ̂ are equal to zero, hence both the Quadratic and Cubic

terms have no relationships with “Credit Limit”. Thus Quadratic Regression Model and Cubic

Regression Model are the same as the original model.

Page 13 of 13

