Business Analytics: Advance: Logistic Regression

Business Analytics:
Advance
LOGISTIC REGRESSION
Recap : Linear Regression
What is regression? A regression assesses whether predictor variables account for
variability in a dependent variable
Types of regression
analysis Linear, Multiple, Logistic, Multinomial, etc
Purpose of regression Typically, a regression analysis is used for (1) modelling the relationship
analysis between x and y. (2)prediction of the target variable (forecasting).
Understanding Simple
Linear Regression With Use case
Understanding Multiple
Linear Regression With Use case
Outline of the Supervised Learning Program
Day Topics (Professionals) Plan
1 Introduction to Analytics and its Applications Program Duration: 3 months
2 Basics of Data/Statistics/R (Analytical tool) - I Start Date: 7th May’17
3 Basics of Data/Statistics/R/Alteryx Demo(Analytical tool) - II
On Every Saturday & Sunday from
4 Linear Regression 2 pm to 5 pm IST
5 Logistic Regression 8 Weeks of support after the
completion of the program (12 hrs,
6 Clustering based on pre-booked appointment)
7 Decision Tree
Change in dates will be notified in
8 Time series Modelling advance as needed
9 Practical Session on Use cases
10 Market Basket Analysis
11 Text Mining
12 Data Visualization
COPYRIGHT 2017@ DEFOUR ANALYTICS PVT LTD 3

Agenda: Logistic Regression
Generalized Linear Model (glm)
What is logistic regression?
Types of logistic regression analysis
Applications of logistic regression analysis
Prerequisite / when & why binary logistic regression
Case Study with R
Copyright © 2016 Defour Analytics Pvt. Ltd

Linear Regression Challenge
Consider Simple Linear regression,
We have fit a best fit line on data of sales and adv. cost, that linear equation helps us to forecast future sales
values depending on the advertising cost.
Now lets consider,
We want to predict the probability if sales will be high or low depending on the advertising cost. Here, we are
interested in modelling for the binary outcome, high sales or low sales. 1.2
1
This illustrates one of the problems with using a linear
0.8
model for a dichotomous outcome:
Sales
the outcome variable Y is clearly not normally distributed. 0.6
It can only take one of two values: 0 or 1. Hence, to model 0.4
this data, we will need to use a different error distribution. 0.2
0
0 10 20 30 40 50 60 70
Advertising cost
Generalized Linear Models (GLMs)
 The generalized linear model (GLM) is a flexible generalization of ordinary linear
regression that allows for response variables that have other than a normal distribution.
The GLM generalizes linear regression by allowing the linear model to be related to the
response variable via a link function and by allowing the magnitude of the variance of
each measurement to be a function of its predicted value.
 It uses an iteratively reweighted least squares method for maximum likelihood
estimation of the model parameters.
 The GLM consists of three elements:
1. A probability distribution from the exponential family.
2. A linear predictor η = Xβ .
3. A link function g such that E(Y) = μ = g-1(η).
6
Logistic Regression
Form of regression that allows the prediction of discrete variables by a
mix of continuous and discrete predictors.
Addresses the same questions that discriminant function analysis and

multiple regression do but with no distributional assumptions on the
predictors (the predictors do not have to be normally distributed, linearly
related or have equal variance in each group)
Logistic Regression
In logistic regression the dependent variable is binary, and the purpose of the analysis is to
assess the effects of multiple independent variables, which can be numeric and/or
categorical, on the dependent variable.
 If an independent variable is nominal level and not categorical, we need to dummy code
the variable.
In the binary logistic regression, the outcome can have only two possible types of values
(e.g. “Yes” or “No, “Success” or “Failure”)
Outcome is coded as “0” and “1” in binary logistic regression.

Logistic Regression

The regression equation is :
log(/(1- )) = + +….. +
which can be also written as,
=
Linear regression and Logistic regression
 Logistic Regression is used when response variable is categorical in nature. For
instance, Yes/No, True/False, Red/Green/Blue, 1st/2nd/3rd/4th etc.
 Linear Regression is used when your response variable is continuous. For instance
Weight, Height, Number of hours etc.
 Linear Regression gives an equation which is of the form Y = mX + C, means equation
with degree 1.
 However, Logistic Regression gives an equation which is of the form
Y = e^X/1 + e^-X
 Linear Regression uses Ordinary Least Squares method to minimise the errors and
arrive at a best possible fit while Logistic regression uses maximum likelihood method to
arrive at the solution.
Types of logistic regression
 BINARY LOGISTIC REGRESSION
 It is used when the dependent variable is dichotomous.
 MULTINOMIAL LOGISTIC REGRESSION

 It is used when the dependent or outcomes variable has more than two
categories.
Typical Applications of Logistic Regression
Building of models to ascertain the pattern/behaviour of certain performance measures
 Asset performance
 Identification of bad actors
 Supply chain performance
 Vendor Reliability based on parts supplied
 Contractor performance
 Customer attrition
 Employee attrition rate
 Human reliability
Who uses it in Plain words.
Binary Logistic Regression can be used in the following situations.
 A catalog company wants to increase the proportion of mailings that result in sales.
 A doctor wants to accurately diagnose a possibly cancerous tumor.
 A loan officer wants to know whether the next customer is likely to default.
 Using the Binary Logistic Regression procedure, the catalog company can send mailings to the
people who are most likely to respond, the doctor can determine whether the tumor is more likely
to be benign or malignant, and the loan officer can assess the risk of extending credit to a particular
customer.
15
When and Why Binary Logistic Regression?
 When the dependent variable is non parametric and we don't
have homoscedasticity. (variance of DV and IV not equal)
 Used when the dependent variable has only two levels. (Yes/no,
male/female, taken/not taken)
If we don’t have linearity.
16
Assumptions
 No assumptions about the distributions of the predictor variables.
 Predictors do not have to be normally distributed
 Does not have to be linearly related.
 Does not have to have equal variance within each group.
17
Performance of the model

Unlike and adjusted in Linear Regression we have null deviance and AIC in Logistic
Regression.
 Null Deviance : - It is a difference between actual and predicted value when only
intercept is used for predicting the value.
 Residual Deviance : - When the entire equation is used to predict the value.
 AIC (Akaike Information criteria)
AIC = Residual deviance + 2 * no. of variables
 It is a measure which will help to decide which model to choose.
 So lower the AIC better is the model (provided it passes all t test and chi square test)
AIC
 Akaike Information Criterion is a measure of the relative quality of a model that
accounts for fit and the number of terms in the model.
 The statistic has no interpretation without a comparison value.
 Interpretation
 Use AIC to compare different models. The smaller the AIC, the better the model fits
the data.
 However, the model with the smallest AIC for a set of predictors does not necessarily
fit the data well.
 Also use goodness-of-fit tests and residual plots to assess how well a model fits the
data.
Use Case – German Bank
 This German data set contains entries for 1000 customers in which Bank have some information of their
customers and the credit ratings of those customers.
The response variable in this data set has two outcomes – 1 and 0. 1 indicates credit rating is good and 0
indicates credit rating is not good.
 We need to build a model based on those parameters to predict which customers will have high credit rating
and which will have low credit rating.
 If this prediction can be made accurately, Banks would be able to easily segment the customers whom they
should provide credits ( Loans and credit cards).
R code cont.
library(MASS)
# Setting working directory
setwd("E:\\Data\\Course Modules\\Basic Foundation\\Day 5\\Usecase")
# This data set we are using here is a Bank data which contains variables like age, job, salary etc.
# The details of the data set is provided in PDF in the same folder.
# With this data our aim is to predict wheather a person will default the payment or not which is our "Response" variable
# Lets import the dataset and View its Varibles.
bank.data <- read.csv(file = "bank.csv",header = T, sep = ",")
View(bank.data)
str(bank.data) # structure is checked if character variables are categorical or not
# We can see all the variables are integers, so we have to convert them to categorical which are actually categorical
for (i in c(1,3:9,11:21,23:31)) {
bank.data[,i] <- factor(x=bank.data[,i],levels = sort(unique(bank.data[,i])))
}
R code cont.
str(bank.data)
# We will do random data partition for training and testing sets
set.seed(24)
temp <- sample(x=c("train","test"),size = nrow(bank.data),replace= T, prob=c(0.7,0.3))
# Creating training and testing datasets
training.bank <- bank.data[temp=="train",]
testing.bank <- bank.data[temp=="test",]
# creating logistics training model
train.model <- glm("RESPONSE ~ CHK_ACCT + DURATION + HISTORY + NEW_CAR + USED_CAR +
FURNITURE + RADIO_TV + EDUCATION + RETRAINING + AMOUNT +
SAV_ACCT + EMPLOYMENT + INSTALL_RATE + MALE_DIV + MALE_SINGLE +
MALE_MAR_or_WID + CO_APPLICANT + GUARANTOR + PRESENT_RESIDENT +
REAL_ESTATE + PROP_UNKN_NONE + AGE + OTHER_INSTALL + RENT +
OWN_RES + NUM_CREDITS + JOB + NUM_DEPENDENTS + TELEPHONE +
FOREIGN",data = training.bank,family = "binomial")
R code cont.
summary(train.model)
# Lets do step wise regression to find out significant variables
step.train.model <- stepAIC(train.model,direction = "both")
# WIth above step we got the followig model
# RESPONSE ~ CHK_ACCT + DURATION + HISTORY + NEW_CAR + USED_CAR + FURNITURE + EDUCATION + AMOUNT + SAV_ACCT + EMPLOYMENT + INSTALL_RATE +
MALE_SINGLE + GUARANTOR + PRESENT_RESIDENT + OTHER_INSTALL + NUM_CREDITS + FOREIGN
# Now Lets check how good our model is
library(MKmisc) # To check goodness of fit for logistic regression model
library(pROC) # TO draw ROC curve and accuracy
# Hosmer - Lemeshow goodness of fit test
HLgof.test(fit = step.train.model$fitted.values,obs = step.train.model$y, verbose = T)
# Verbose logical, to print all (intermediate results also)
# p-value = 0.6367 (we see in c-statistics), Do not reject Null hypothesis
# This means model is a good fit. We will check with other major also
# Plotting the ROC curve
proc <- roc(response=step.train.model$y,predictor = step.train.model$fitted.values,plot = T)
proc$auc
R code
# Area under the curve = 0.83, i.e. model is fitting 83% values
#### PREDICTION with testing data
predict.test <- predict.glm(object = step.train.model,newdata = testing.bank,type = "response")
summary(predict.test)
test.roc <- roc(response=testing.bank$RESPONSE,predictor = predict.test,plot=T)
test.roc$auc #Testing with ROC curve
result <- as.data.frame(predict.test) # Making data frame of the testing prediction
# Converting the probabilities into categorical 1 and 0 ( Rule is 0-0.5 = 0 and >0.5 =1)
result <- ifelse(test = result$predict.test>0.5,yes = 1,no = 0)
result
original <- testing.bank$RESPONSE
original <- as.data.frame(original) # Extracting the original response & converting to data frame
comp <- cbind(original,result) # Combining the original vs result to compare
View(comp)
Accomplishments today!
What is logistic regression?
Types of logistic regression analysis
Applications of logistic regression analysis
Prerequisite / when & why binary logistic regression
Generalized Linear Model (glm)
Case Study with R

Quiz
1. Which of the following is a problem of ordinary regression with a binary dependent variable?
a. Model predictions are not bounded between the two binary values
b. The distribution of residuals is not normal
c. Both of the above
2. In logistic regression the logit is . . . : (one correct choice)

a. the natural logarithm of the odds .
b. an instruction to record the data.
c. a logarithm of a digit.
d. the cube root of the sample size.
Quiz
3. In binomial logistic regression the dependent (or criterion) variable: (one correct choice)
a. is a random variable
b. is like the median and is split the data into two equal halves.
c. consists of two categories.
d. is expressed in bits.
4. In simple logistic regression the traditional goodness of fit measure, -2(log likelihood of
current model – log likelihood of previous model) is : (one correct choice)
a. a statistic that does not follow a Chi square PDF.
b. indicates the spread of answers to a question.
c. an index of how closely the analysis reaches statistical significance.
d. how close the predicted findings are to actual findings.
THANK YOU

Business Analytics: Advance: Logistic Regression

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Business Analytics: Advance: Logistic Regression

Hochgeladen von

Copyright:

Verfügbare Formate

Business Analytics:

COPYRIGHT 2017@ DEFOUR ANALYTICS PVT LTD 3

What is logistic regression?

Types of logistic regression analysis

Applications of logistic regression analysis

Prerequisite / when & why binary logistic regression

Case Study with R

Copyright © 2016 Defour Analytics Pvt. Ltd

It can only take one of two values: 0 or 1. Hence, to model 0.4

this data, we will need to use a different error distribution. 0.2

Addresses the same questions that discriminant function analysis and

Outcome is coded as “0” and “1” in binary logistic regression.

which can be also written as,

 MULTINOMIAL LOGISTIC REGRESSION

 A doctor wants to accurately diagnose a possibly cancerous tumor.

Types of logistic regression analysis

Applications of logistic regression analysis

Prerequisite / when & why binary logistic regression

Generalized Linear Model (glm)

Case Study with R

2. In logistic regression the logit is . . . : (one correct choice)

Copyright © 2016 Defour Analytics Pvt. Ltd

Das könnte Ihnen auch gefallen