Beruflich Dokumente
Kultur Dokumente
Business Analytics
Week 6
Keongtae Kim
Assistant Professor
Office: CYT 950
Email: keongkim@cuhk.edu.hk
1
Housekeeping Issues
One page proposal for group project
Due this Friday (22nd) at 11:59pm
Clearly indicate your data source and business analytics
objective
Submit via both Blackboard and VeriGuide
One submission per group
Please name your group unless you already did
2
Housekeeping Issues
Midterm exam
18th Mar during class (70 mins from 2:30 to 3:40)
Covers concepts and interpretation of results
No need to use laptop and no need to memorize Python
codes
Accounts for 20% of the total grade in this course
3
Regression
4
Linear Regression
5
Let’s start from a simple linear model –
one predictor
Marketing Problem: I know current egg
prices on Monday in California. Can I
predict egg sales for the rest of the week
in California?
I have: Two-year historical database
Target variable: weekly sales cases of eggs
Predictor variable: egg prices
Step one: look at the raw data…
6
Cases Sold / Week Egg.Prices
96343 90.42
96345 89.33
96928 89.89
93519 90.71
99032 85.99
Some of the historical 91539 91.83
data… 89969 87.29
90859 96.36
99697 99.71
88350 99.38
100383 97.53
94415 100.77
91813 98
Let’s use graphical 100466 99.89
Prediction: At
$.80, how many
cases would you
predict be sold?
At $1.05?
Cases
What cautions
would you include
with your
predictions?
We want:
A formula that we can plug the prices into and it will
give us the predicted case sales
And it will give us the error range in the predicted
case sales
First try: Use the simplest formula around: a linear
relation: Cases = a + b x Prices
use the data to calibrate (estimate a and b) this linear
model
9
100000 120000 140000 160000 180000
Cases
Egg.Pr
Observed Value
of y for xi
ei Slope = w1
Predicted Value
of y for xi Error for this x
value
Intercept = w0
xi x 11
Linear Regression
SSE ei i i
2
( y ˆ
y ) 2
i i
i 0 1i
( y ( w w x )) 2
w1 can be obtained by
setting the partial ( x x )( y y )
w
i i
derivative of the SSE to i
w yw x
1 0 1
0 and solving for w1, ( x x ) i
2
i
ultimately resulting in: 12
Goodness of Fit - RMSE
RMSE: Root Mean Squared Error
Square root of error variance (average squared
difference between the true value and the
estimated value)
True Value Estimated Value Error
127 132 −5
78 76 2
120 122 −2
130 129 1
95 91 4
13
RMSE:
Goodness of Fit - R2
R2 : (SST-SSE)/SST
SST: Sum of Squares Total
SSE: Sum of Squares Error
SSE SST
R2=(4.907-0.86)/4.907=0.84
14
Goodness of Fit – R2
15
Interpreting the output
e.g. One predictor, Egg.Pr, of Case sales
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 153,414.5 15,992.9 9.593 5.96e-16 ***
Egg.Pr -553.9 168.2 -3.293 0.00136 **
We could
Drop them and re-calibrate
Explore to see if there is anything else that predicts
them
Go back…
18
Back to original Prices—Cases Scatterplot
Weeks 40 and 91 are the Easter weeks
40
91
160000
Cases
120000
80000
Egg.Pr
Predict from the Graphical Model: It is next year, Egg prices are 1.10, and it
19
is Easter Week. What do you expect Sales to be?
Predicting, knowing price and Easter
season
20
Some values of Cases, Egg.Pr, and the
nominal Easter variable…
Easter takes the values: Non Easter, Pre Easter, Easter, and Post Easter
That’s a problem 21
Create new Indicator variables:
New Model:
Sales = a + b x Price + c x IndEaster
+ d x IndPreEaster + e x IndPostEaster
22
Points to note with categorical variables
and predictive models
R CALIBRATION OUTPUT:
Coefficients:
Estimate Pr(>|t|)
• (Intercept) 115387.19 < 2e-16 ***
• Egg.Pr -170.15 0.0813
• Easter[T.Pre Easter] 32728.55 1.94e-08 ***
• Easter[T.Easter] 76946.67 < 2e-16 ***
• Easter[T.Post Easter] -22096.43 8.25e-05 ***
25
Classification
The response (dependent) variable is binary
“Hit” or “Flop” (e.g. a film or a song)
“Positive” or “Negative” (e.g. medical test results)
“Legitimate” or “Fraudulent” (e.g. credit card transaction)
“Payment” or “No Payment” (e.g., newspaper subscription)
26
Logistic Regression or Logit
model
PHENOMENA we want to capture:
1. While the target data is binary, we will model
the probability of choice
2. Probability is continuous variable bounded
by 0 and 1
3. The probability of choosing an option is
related to the utility the individual would
derive from the choice.
4. Utilities are a linear function of customer
characteristics (and possibly the attributes of
the choice)
27
Logistic Regression
Linear regression for the binary outcome
What are limitations to using a linear regression
28
Logistic Regression
Regression model for modeling binary outcomes
Logit function
Instead of using Y (or probability p) as the dependent
variable, we use a function of it, which is called the logit
Logit maps any value of the dependent variables into a
probability [0, 1]
29
Logistic Regression
Logistic regression
30
The Logistic Regression Model
A nonlinear regression model
logit = b0 b1 Gender b2 Married b3 Income b4 Age e
Inserting logit = log(Odds)
odds =p/(1-p)= exp{b0 b1 Gender b2 Married b3 Income b4 Age e}
Solving for p 1
p
1 e ( b 0 b1GENDER b 2 MARRIED b 3INCOME b 4 AGEe )
Bottom
line: Logistic regression is a nonlinear function that
maps any values of the input variables into a probability
31
The Coefficients of the Logistic Model
Odds:
w(x1, x2 ,…, xk) = exp(a b1 x1 b2 x2 ... bk xk)
34
Interpreting Coefficients of Continuous
Predictors: Beer Preference Example
(Optional)
Input variables Coefficient Std. Error p-value Odds
Constant term -0.68189073 1.93081641 0.72396708 *
Gender -0.77788508 0.71664554 0.27772108 0.45937654
Married 0.16966102 0.79447782 0.83089775 1.18490314
Income 0.00027846 0.00006335 0.00001103 1.00027847
Age -0.22822094 0.05238947 0.00001323 0.79594839
35