Sie sind auf Seite 1von 39

Introduction

 to  Regression
In this session
• Introduction to Regression.
– What is Regression?
– Why we need Regression?
– Different types of Regression Models
– How to create a Regression model?

• Method of Least Squares for simple linear regression.


What is Regression?
• Regression is a tool for finding existence of an association
relationship between a dependent variable (Y) and one or more
independent variables (X1, X2, …, Xn) in a study.

• The relationship can be linear or non-linear.


Regression  
An  important  tool  in  Predictive  Analytics

Regression  is  a  supervised  learning  algorithm under  


Machine  Learning  terminology
Regression  -­‐ Definition

A statistical technique that attempts to determine


the existence of a possible relationship between one
dependent variable (usually denoted by Y) and a
collection of Independent variables.

Regression is used for generating new hypothesis


and for validating a hypothesis
Regression  Vs  Correlation
• Regression is the study of, “existence of a relationship”,
between two variable. The main objective is to estimate the
change in mean value of independent variable.

• Correlation is the study of, “strength of relationship”, between


two variables.
Importance  of  Regression

• In 1980, Supreme court of USA recognized regression


as a valid method of identifying discrimination.

• American Food and Drug Administration (FDA) uses


regression as an approved tool for validating food
and drug products.
Why we need Regression?
• Companies would like to know about factors that has
significant impact on their Key Performance Indicators (KPI).

• Regression helps to create new hypothesis that may assist the


companies to improve their performance.
Types of Regression
Types of Regression
Regression  
Models
One More than one
independent independent variable
variable

Simple   Multiple  
Regression Regression

Linear Non-­‐linear Linear Non-­‐linear


Types of Regression
• Simple linear regression – refers to a regression model
between two variables.
Y = β 0 + β1 X 1 + ε

• Multiple linear regression – refers to a regression model on


more than one independent variables.
Y = β 0 + β1 X 1 + β 2 X 2 + ... + β k X k + ε

• Nonlinear regression.
1
Y = β0 + + X 2β 3 + ε
β1 + β 2 X 1
Linear Regression

• Linear regression stands for a function that is


linear in regression coefficients.

• The following equation will be treated as


linear as far as regression is concerned.
Y = β1 + β1X1 + β2 X1X 2 + β3 X 22
Multiple  Linear  Regression
• Multiple linear regression means linear in regression
parameters (beta values). The following are examples of
multiple linear regression:

Y = β0 + β1x1 + β 2 x2 + ... + β k xk + ε
2
Y = β0 + β1x1 + β 2 x2 + β3 x1x2 + β 4 x2 ... + β k xk + ε

An  important  task  in  multiple  regression  is  to  estimate  


the  beta  values  (β1,  β2,  β3 etc…)
Regression  Model  Development
Regression  Model  Development

Pre-process the data


Derive and Analyze and divide the data
Explore the data into training and
Descriptive Statistics
validation data

Perform Estimate regression Define functional form


Diagnostic Tests parameters of the relationship

NO

Model satisfies
diagnostic test
YES STOP
Regression Functional Form
1. Hypothesize Deterministic Component

2. Estimate Unknown Model Parameters

3. Specify Probability Distribution of Random Error Term


– Estimate Standard Deviation of Error

4. Validate Model for its fitness.

5. Use Model for Prediction & Estimation


Functional Form
Functional  Form

• Specify the explanatory variables.

• Specify the nature of relationship between dependent variable


and explanatory variables.
Linear Regression Model

Relationship  between  variables  is  a  linear  function

Population Population Random


Y-Intercept Slope Error

Yi = β 0 + β1X i + ε i
Dependent Independent
(Response) (Explanatory)
Variable Variable
(e.g., income) (e.g., education)
Deterministic component in
Regression
General form of Regression Models

Y = Deterministic Component + Random Error

where E(Y) = Deterministic Component

The deterministic component is a mathematical


combination of the independent variables.
Scatter Plot
What  is  scatter  plot?
• Scatter plot (also called scatter diagram) is a graph
used to display and compare two or more variables.

• Scatter plot doesn’t require the user to specify


dependent and independent variable.
Regression Model Development

Explore the data –


create training and
Derive and Analyze
Pre-process the data
validation dataset Descriptive Statistics

Define functional form


Perform Estimate regression
of the relationship
Diagnostic Tests parameters using training data

NO

Model satisfies
diagnostic test
YES STOP
Interpreting a Scatter plot

• In any graph of data, look for


– The Overall pattern and
– Striking deviations from that pattern
• You can describe the overall pattern of a scatterplot by the
– Form (Linear pattern?)
– Direction (Positive, Negative or Flat)
– Strength of the relationship (Correlation)
• Watch for outliers
– An outlier is an individual value(s) that falls outside the
overall pattern of the relationship
No  Relationship Quadratic

Strong  Positive Exponential

Strong  negative Outlier


Regression Model Development

Explore the data and Derive and Analyze


create training and Pre-process the data
validation data set
Descriptive Statistics

Perform Estimate regression Define functional form


Diagnostic Tests parameters of the relationship

NO

Model satisfies
diagnostic test
YES STOP
Model Assumptions
Linear Regression Model Assumptions
• The regression model is linear in parameters.
• The explanatory variable X is assumed to be non-stochastic.
• Given the value of X (say Xi), the mean of the random error term
εi is zero.
• The error term, εi, follows a normal distribution.
• Given the value of X, the variance of εi is constant
(Homoscedasticity).
• There is no autocorrelation between two εi values.
Assumptions Continued…

• Low correlation between Xi and εi.


• The number of observations n must be greater than the
number of parameters to be estimated.
• The X values in a given sample must not all be the same.
Technically, Var(X) must be a finite positive number.
• The regression model is correctly specified.
• There is no perfect multi-collinearity (no perfect linear
relationship) among explanatory variables.
Estimation of Parameters
Estimation of Parameters

Population Random Sample

Unknown
Relationship J $
Yi = β 0 + β1X i + ε i J $
J $
J $ J $
J $
J $
Population Linear Regression Model

Y
Yi = β 0 + β1X i + ε i
εi = Random error

X
∧ ∧
Observed value E (Y X ) = β 0 + β1 X i
What is the best fit?
How would you draw a line through the points?
How do you determine which line ‘fits best’?

Y
60
40
20
0 X
0 20 40 60
Method of Ordinary Least Squares (OLS)
Least Squares Graphically
n
LS minimizes ∑ ε̂ i2 = ε̂12 + ε̂ 22 + ε̂ 32 + ε̂ 24
i =1

Y Y2 = β! 0 + β! 1X 2 + ε! 2
ε^4
ε^2
ε^1 ε^3
! ! !
Yi = β 0 + β 1X i
X
Estimation of Parameters in Regression

The least squares function is given by

2
n n
⎛ k ⎞
2
SSE = ∑ ε = ∑ ⎜⎜ yi − β 0 − ∑ β j xij ⎟⎟
i
i =1 i =1 ⎝ j =1 ⎠
Regression  Coefficient  (β1)  in  SLR

ˆ ∑ (xi − x )( yi − y ) Cov( X , Y )
β1 = 2
=
∑ (xi − x ) Var ( X )

ˆ SY
β1 = r ×
SX
where r is the correlation coefficient between X and Y
SY is the standard deviation of Y
SX is the standard deviation of X
Why  Least  Squares  Estimate
• OLS beta estimates are, “Best Linear Unbiased Estimates
(BLUE)”, provided the error terms are uncorrelated (no auto
regression) and have equal variance (homoscedasticity). That
is,

⎡ ∧ ⎤
E ⎢β − β ⎥ = 0
⎣ ⎦
Advantages  of  OLS  Estimates
• They are unbiased estimates.

• They (estimates) have minimum variance.

• They have consistency, as the sample size increases,


the estimate, β i, converges to the true population

parameter value, βi.

Das könnte Ihnen auch gefallen