Beruflich Dokumente
Kultur Dokumente
Advance
LOGISTIC REGRESSION
Recap : Linear Regression
What is regression? A regression assesses whether predictor variables account for
variability in a dependent variable
Types of regression
analysis Linear, Multiple, Logistic, Multinomial, etc
Purpose of regression Typically, a regression analysis is used for (1) modelling the relationship
analysis between x and y. (2)prediction of the target variable (forecasting).
Understanding Simple
Linear Regression With Use case
Understanding Multiple
Linear Regression With Use case
Outline of the Supervised Learning Program
Day Topics (Professionals) Plan
1 Introduction to Analytics and its Applications Program Duration: 3 months
2 Basics of Data/Statistics/R (Analytical tool) - I Start Date: 7th May’17
3 Basics of Data/Statistics/R/Alteryx Demo(Analytical tool) - II
On Every Saturday & Sunday from
4 Linear Regression 2 pm to 5 pm IST
5 Logistic Regression 8 Weeks of support after the
completion of the program (12 hrs,
6 Clustering based on pre-booked appointment)
7 Decision Tree
Change in dates will be notified in
8 Time series Modelling advance as needed
9 Practical Session on Use cases
10 Market Basket Analysis
11 Text Mining
12 Data Visualization
Sales
the outcome variable Y is clearly not normally distributed. 0.6
0
0 10 20 30 40 50 60 70
Advertising cost
Generalized Linear Models (GLMs)
The generalized linear model (GLM) is a flexible generalization of ordinary linear
regression that allows for response variables that have other than a normal distribution.
The GLM generalizes linear regression by allowing the linear model to be related to the
response variable via a link function and by allowing the magnitude of the variance of
each measurement to be a function of its predicted value.
It uses an iteratively reweighted least squares method for maximum likelihood
estimation of the model parameters.
The GLM consists of three elements:
1. A probability distribution from the exponential family.
2. A linear predictor η = Xβ .
3. A link function g such that E(Y) = μ = g-1(η).
6
Logistic Regression
Form of regression that allows the prediction of discrete variables by a
mix of continuous and discrete predictors.
If an independent variable is nominal level and not categorical, we need to dummy code
the variable.
In the binary logistic regression, the outcome can have only two possible types of values
(e.g. “Yes” or “No, “Success” or “Failure”)
log(/(1- )) = + +….. +
=
Linear regression and Logistic regression
Logistic Regression is used when response variable is categorical in nature. For
instance, Yes/No, True/False, Red/Green/Blue, 1st/2nd/3rd/4th etc.
Linear Regression is used when your response variable is continuous. For instance
Weight, Height, Number of hours etc.
Linear Regression gives an equation which is of the form Y = mX + C, means equation
with degree 1.
However, Logistic Regression gives an equation which is of the form
Y = e^X/1 + e^-X
Linear Regression uses Ordinary Least Squares method to minimise the errors and
arrive at a best possible fit while Logistic regression uses maximum likelihood method to
arrive at the solution.
Types of logistic regression
BINARY LOGISTIC REGRESSION
It is used when the dependent variable is dichotomous.
A catalog company wants to increase the proportion of mailings that result in sales.
A loan officer wants to know whether the next customer is likely to default.
Using the Binary Logistic Regression procedure, the catalog company can send mailings to the
people who are most likely to respond, the doctor can determine whether the tumor is more likely
to be benign or malignant, and the loan officer can assess the risk of extending credit to a particular
customer.
15
When and Why Binary Logistic Regression?
When the dependent variable is non parametric and we don't
have homoscedasticity. (variance of DV and IV not equal)
Used when the dependent variable has only two levels. (Yes/no,
male/female, taken/not taken)
If we don’t have linearity.
16
Assumptions
No assumptions about the distributions of the predictor variables.
Predictors do not have to be normally distributed
Does not have to be linearly related.
Does not have to have equal variance within each group.
17
Performance of the model
Unlike and adjusted in Linear Regression we have null deviance and AIC in Logistic
Regression.
Null Deviance : - It is a difference between actual and predicted value when only
intercept is used for predicting the value.
Residual Deviance : - When the entire equation is used to predict the value.
AIC (Akaike Information criteria)
AIC = Residual deviance + 2 * no. of variables
It is a measure which will help to decide which model to choose.
So lower the AIC better is the model (provided it passes all t test and chi square test)
AIC
Akaike Information Criterion is a measure of the relative quality of a model that
accounts for fit and the number of terms in the model.
The statistic has no interpretation without a comparison value.
Interpretation
Use AIC to compare different models. The smaller the AIC, the better the model fits
the data.
However, the model with the smallest AIC for a set of predictors does not necessarily
fit the data well.
Also use goodness-of-fit tests and residual plots to assess how well a model fits the
data.
Use Case – German Bank
This German data set contains entries for 1000 customers in which Bank have some information of their
customers and the credit ratings of those customers.
The response variable in this data set has two outcomes – 1 and 0. 1 indicates credit rating is good and 0
indicates credit rating is not good.
We need to build a model based on those parameters to predict which customers will have high credit rating
and which will have low credit rating.
If this prediction can be made accurately, Banks would be able to easily segment the customers whom they
should provide credits ( Loans and credit cards).
R code cont.
library(MASS)
# Setting working directory
setwd("E:\\Data\\Course Modules\\Basic Foundation\\Day 5\\Usecase")
# This data set we are using here is a Bank data which contains variables like age, job, salary etc.
# The details of the data set is provided in PDF in the same folder.
# With this data our aim is to predict wheather a person will default the payment or not which is our "Response" variable
# Lets import the dataset and View its Varibles.
bank.data <- read.csv(file = "bank.csv",header = T, sep = ",")
View(bank.data)
str(bank.data) # structure is checked if character variables are categorical or not
# We can see all the variables are integers, so we have to convert them to categorical which are actually categorical
for (i in c(1,3:9,11:21,23:31)) {
bank.data[,i] <- factor(x=bank.data[,i],levels = sort(unique(bank.data[,i])))
}
R code cont.
str(bank.data)
# We will do random data partition for training and testing sets
set.seed(24)
temp <- sample(x=c("train","test"),size = nrow(bank.data),replace= T, prob=c(0.7,0.3))
# Creating training and testing datasets
training.bank <- bank.data[temp=="train",]
testing.bank <- bank.data[temp=="test",]
# creating logistics training model
train.model <- glm("RESPONSE ~ CHK_ACCT + DURATION + HISTORY + NEW_CAR + USED_CAR +
FURNITURE + RADIO_TV + EDUCATION + RETRAINING + AMOUNT +
SAV_ACCT + EMPLOYMENT + INSTALL_RATE + MALE_DIV + MALE_SINGLE +
MALE_MAR_or_WID + CO_APPLICANT + GUARANTOR + PRESENT_RESIDENT +
REAL_ESTATE + PROP_UNKN_NONE + AGE + OTHER_INSTALL + RENT +
OWN_RES + NUM_CREDITS + JOB + NUM_DEPENDENTS + TELEPHONE +
FOREIGN",data = training.bank,family = "binomial")
R code cont.
summary(train.model)
# Lets do step wise regression to find out significant variables
step.train.model <- stepAIC(train.model,direction = "both")
# WIth above step we got the followig model
# RESPONSE ~ CHK_ACCT + DURATION + HISTORY + NEW_CAR + USED_CAR + FURNITURE + EDUCATION + AMOUNT + SAV_ACCT + EMPLOYMENT + INSTALL_RATE +
MALE_SINGLE + GUARANTOR + PRESENT_RESIDENT + OTHER_INSTALL + NUM_CREDITS + FOREIGN
# Now Lets check how good our model is
library(MKmisc) # To check goodness of fit for logistic regression model
library(pROC) # TO draw ROC curve and accuracy
# Hosmer - Lemeshow goodness of fit test
HLgof.test(fit = step.train.model$fitted.values,obs = step.train.model$y, verbose = T)
# Verbose logical, to print all (intermediate results also)
# p-value = 0.6367 (we see in c-statistics), Do not reject Null hypothesis
# This means model is a good fit. We will check with other major also
# Plotting the ROC curve
proc <- roc(response=step.train.model$y,predictor = step.train.model$fitted.values,plot = T)
proc$auc
R code
# Area under the curve = 0.83, i.e. model is fitting 83% values
#### PREDICTION with testing data
predict.test <- predict.glm(object = step.train.model,newdata = testing.bank,type = "response")
summary(predict.test)
test.roc <- roc(response=testing.bank$RESPONSE,predictor = predict.test,plot=T)
test.roc$auc #Testing with ROC curve
result <- as.data.frame(predict.test) # Making data frame of the testing prediction
# Converting the probabilities into categorical 1 and 0 ( Rule is 0-0.5 = 0 and >0.5 =1)
result <- ifelse(test = result$predict.test>0.5,yes = 1,no = 0)
result
original <- testing.bank$RESPONSE
original <- as.data.frame(original) # Extracting the original response & converting to data frame
comp <- cbind(original,result) # Combining the original vs result to compare
View(comp)
Accomplishments today!
What is logistic regression?
4. In simple logistic regression the traditional goodness of fit measure, -2(log likelihood of
current model – log likelihood of previous model) is : (one correct choice)
a. a statistic that does not follow a Chi square PDF.
b. indicates the spread of answers to a question.
c. an index of how closely the analysis reaches statistical significance.
d. how close the predicted findings are to actual findings.
THANK YOU