Regression

Introduction
to Regression
In this session
• Introduction to Regression.
– What is Regression?
– Why we need Regression?
– Different types of Regression Models
– How to create a Regression model?
• Method of Least Squares for simple linear regression.

What is Regression?
• Regression is a tool for finding existence of an association
relationship between a dependent variable (Y) and one or more
independent variables (X1, X2, …, Xn) in a study.
• The relationship can be linear or non-linear.

Regression
An important tool in Predictive Analytics
Regression is a supervised learning algorithm under

Machine Learning terminology
Regression -‐ Definition
A statistical technique that attempts to determine

the existence of a possible relationship between one
dependent variable (usually denoted by Y) and a
collection of Independent variables.
Regression is used for generating new hypothesis

and for validating a hypothesis
Regression Vs Correlation
• Regression is the study of, “existence of a relationship”,
between two variable. The main objective is to estimate the
change in mean value of independent variable.
• Correlation is the study of, “strength of relationship”, between

two variables.
Importance of Regression
• In 1980, Supreme court of USA recognized regression

as a valid method of identifying discrimination.
• American Food and Drug Administration (FDA) uses

regression as an approved tool for validating food
and drug products.
Why we need Regression?
• Companies would like to know about factors that has
significant impact on their Key Performance Indicators (KPI).
• Regression helps to create new hypothesis that may assist the

companies to improve their performance.
Types of Regression
Types of Regression
Regression
Models
One More than one
independent independent variable
variable
Simple Multiple
Regression Regression
Linear Non-‐linear Linear Non-‐linear

Types of Regression
• Simple linear regression – refers to a regression model
between two variables.
Y = β 0 + β1 X 1 + ε
• Multiple linear regression – refers to a regression model on

more than one independent variables.
Y = β 0 + β1 X 1 + β 2 X 2 + ... + β k X k + ε
• Nonlinear regression.
1
Y = β0 + + X 2β 3 + ε
β1 + β 2 X 1
Linear Regression
• Linear regression stands for a function that is

linear in regression coefficients.
• The following equation will be treated as

linear as far as regression is concerned.
Y = β1 + β1X1 + β2 X1X 2 + β3 X 22
Multiple Linear Regression
• Multiple linear regression means linear in regression
parameters (beta values). The following are examples of
multiple linear regression:
Y = β0 + β1x1 + β 2 x2 + ... + β k xk + ε
2
Y = β0 + β1x1 + β 2 x2 + β3 x1x2 + β 4 x2 ... + β k xk + ε
An important task in multiple regression is to estimate

the beta values (β1, β2, β3 etc…)
Regression Model Development
Pre-process the data

Derive and Analyze and divide the data
Explore the data into training and
Descriptive Statistics
validation data
Perform Estimate regression Define functional form

Diagnostic Tests parameters of the relationship
NO
Model satisfies
diagnostic test
YES STOP
Regression Functional Form
1. Hypothesize Deterministic Component
2. Estimate Unknown Model Parameters
3. Specify Probability Distribution of Random Error Term

– Estimate Standard Deviation of Error
4. Validate Model for its fitness.
5. Use Model for Prediction & Estimation

Functional Form
Functional Form
• Specify the explanatory variables.
• Specify the nature of relationship between dependent variable

and explanatory variables.
Linear Regression Model
Relationship between variables is a linear function
Population Population Random

Y-Intercept Slope Error
Yi = β 0 + β1X i + ε i
Dependent Independent
(Response) (Explanatory)
Variable Variable
(e.g., income) (e.g., education)
Deterministic component in
Regression
General form of Regression Models
Y = Deterministic Component + Random Error
where E(Y) = Deterministic Component
The deterministic component is a mathematical

combination of the independent variables.
Scatter Plot
What is scatter plot?
• Scatter plot (also called scatter diagram) is a graph
used to display and compare two or more variables.
• Scatter plot doesn’t require the user to specify

dependent and independent variable.
Explore the data –

create training and
Derive and Analyze
Pre-process the data
validation dataset Descriptive Statistics
Define functional form

Perform Estimate regression
of the relationship
Diagnostic Tests parameters using training data
NO
Model satisfies
diagnostic test
YES STOP
Interpreting a Scatter plot
• In any graph of data, look for

– The Overall pattern and
– Striking deviations from that pattern
• You can describe the overall pattern of a scatterplot by the
– Form (Linear pattern?)
– Direction (Positive, Negative or Flat)
– Strength of the relationship (Correlation)
• Watch for outliers
– An outlier is an individual value(s) that falls outside the
overall pattern of the relationship
No Relationship Quadratic
Strong Positive Exponential
Strong negative Outlier

Explore the data and Derive and Analyze

create training and Pre-process the data
validation data set
Descriptive Statistics
Perform Estimate regression Define functional form

Diagnostic Tests parameters of the relationship
NO
Model satisfies
diagnostic test
YES STOP
Model Assumptions
Linear Regression Model Assumptions
• The regression model is linear in parameters.
• The explanatory variable X is assumed to be non-stochastic.
• Given the value of X (say Xi), the mean of the random error term
εi is zero.
• The error term, εi, follows a normal distribution.
• Given the value of X, the variance of εi is constant
(Homoscedasticity).
• There is no autocorrelation between two εi values.
Assumptions Continued…
• Low correlation between Xi and εi.

• The number of observations n must be greater than the
number of parameters to be estimated.
• The X values in a given sample must not all be the same.
Technically, Var(X) must be a finite positive number.
• The regression model is correctly specified.
• There is no perfect multi-collinearity (no perfect linear
relationship) among explanatory variables.
Estimation of Parameters
Estimation of Parameters
Population Random Sample
Unknown
Relationship J $
Yi = β 0 + β1X i + ε i J $
J $
J $ J $
J $
J $
Population Linear Regression Model
Y
Yi = β 0 + β1X i + ε i
εi = Random error
X
∧ ∧
Observed value E (Y X ) = β 0 + β1 X i
What is the best fit?
How would you draw a line through the points?
How do you determine which line ‘fits best’?
Y
60
40
20
0 X
0 20 40 60
Method of Ordinary Least Squares (OLS)
Least Squares Graphically
n
LS minimizes ∑ ε̂ i2 = ε̂12 + ε̂ 22 + ε̂ 32 + ε̂ 24
i =1
Y Y2 = β! 0 + β! 1X 2 + ε! 2
ε^4
ε^2
ε^1 ε^3
! ! !
Yi = β 0 + β 1X i
X
Estimation of Parameters in Regression
The least squares function is given by
2
n n
⎛ k ⎞
2
SSE = ∑ ε = ∑ ⎜⎜ yi − β 0 − ∑ β j xij ⎟⎟
i
i =1 i =1 ⎝ j =1 ⎠
Regression Coefficient (β1) in SLR
ˆ ∑ (xi − x )( yi − y ) Cov( X , Y )
β1 = 2
=
∑ (xi − x ) Var ( X )
ˆ SY
β1 = r ×
SX
where r is the correlation coefficient between X and Y
SY is the standard deviation of Y
SX is the standard deviation of X
Why Least Squares Estimate
• OLS beta estimates are, “Best Linear Unbiased Estimates
(BLUE)”, provided the error terms are uncorrelated (no auto
regression) and have equal variance (homoscedasticity). That
is,
⎡ ∧ ⎤
E ⎢β − β ⎥ = 0
⎣ ⎦
Advantages of OLS Estimates
• They are unbiased estimates.
• They (estimates) have minimum variance.
• They have consistency, as the sample size increases,

the estimate, β i, converges to the true population
∧
parameter value, βi.

Regression

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Regression

Hochgeladen von

Copyright:

Verfügbare Formate

Introduction

• Method of Least Squares for simple linear regression.

• The relationship can be linear or non-linear.

Regression is a supervised learning algorithm under

A statistical technique that attempts to determine

Regression is used for generating new hypothesis

• Correlation is the study of, “strength of relationship”, between

• In 1980, Supreme court of USA recognized regression

• American Food and Drug Administration (FDA) uses

• Regression helps to create new hypothesis that may assist the

Linear Non-­‐linear Linear Non-­‐linear

• Multiple linear regression – refers to a regression model on

• Linear regression stands for a function that is

• The following equation will be treated as

An important task in multiple regression is to estimate

Pre-process the data

Perform Estimate regression Define functional form

2. Estimate Unknown Model Parameters

3. Specify Probability Distribution of Random Error Term

4. Validate Model for its fitness.

5. Use Model for Prediction & Estimation

• Specify the explanatory variables.

• Specify the nature of relationship between dependent variable

Relationship between variables is a linear function

Population Population Random

Y = Deterministic Component + Random Error

where E(Y) = Deterministic Component

The deterministic component is a mathematical

• Scatter plot doesn’t require the user to specify

Explore the data –

Define functional form

• In any graph of data, look for

Strong Positive Exponential

Strong negative Outlier

Explore the data and Derive and Analyze

Perform Estimate regression Define functional form

• Low correlation between Xi and εi.

Population Random Sample

The least squares function is given by

• They (estimates) have minimum variance.

• They have consistency, as the sample size increases,

parameter value, βi.

Das könnte Ihnen auch gefallen

Linear Non-‐linear Linear Non-‐linear