Sie sind auf Seite 1von 37

Father of Regression Analysis

Carl Friedrich Gauss


(1777-1855)
The first person to explain the
common phenomenon of
regression
Sir Francis Galton
(1822-1911)
Regression analysis is widely used for prediction
and forecasting, where its use has substantial
overlap with the field of machine learning.
In general, it is used to predict the value of one
variable (dependent variable) on the basis of
other variables (independent variables).
Direct Relationship
0
10
20
30
40
50
60
70
15 25 35 45 55
Quantity(x)
Inverse Relationship
P
r
i
c
e
(
y
)

0
10
20
30
40
50
60
70
55 45 35 25 15
Quantity(x)
P
r
i
c
e
(
y
)

REGRESSION
LINEAR REGRESSION
STEPS
Analyzing the Scatter
Plot
Perform Regression
Analysis
Interpret
A scatter diagram is used to show the relationship between
two variables.


Where
Y: Dependent variable
a: Y- intercept
b: slope of the line
X: Independent variable


bX a Y + =









Graph showing the effect of precipitation on
the yield of cucumber in Uttar Pradesh in 2010.
We need to identify how we can fit a line
between the points which are scattered.

The line will have a GOOD FIT if it
minimizes the error between the estimated
points on the line and actual observed points.
bX a Y + =
^
_
2 2
_ _
X n X
Y X n XY
b


=
_ _
X b Y a =
FORMULAE:
The Estimating Line
slope
Y-intercept
Standard Error of Estimate
The standard error of the estimate is a
measure of the accuracy of predictions.
The standard error of the estimate is a
measure of the accuracy of the predictions.


2
2
^

|
.
|

\
|

=

n
Y Y
s
e
2
2


=

n
XY b Y a Y
s
e
where,
Y: value of dependent variable
Y: estimated values from the estimating equation
n: number of data points
^
Y
X
Y = a+bX+1s
e
Y = a+bX+2s
e
Y = a+bX+3s
e
Regression
Line:
Y= a+bX
68%=1s
e
95.5%=2s
e
99.7%=3s
e
^
Y
X
Background:
An analyst wants to
determine the relationship
between years of
experience possessed by
workers and the hourly
wage rate.

Experience
(yrs)
Hourly
Earnings
($)
1 12.98
2 13.94
3 17.07
4 18.27
5 21.63
6 22.84
7 24.04
8 25.84
9 26.08
10 28.61
Case Study 1-a
The standard error of the estimate is a
measure of the accuracy of predictions.
0
5
10
15
20
25
30
35
0 5 10 15
H
o
u
r
l
y

R
a
t
e
(
$
)

Experience(yrs)
Y-Values
Y-Values
Scatter Diagram
Formula for Estimating Line:


after calculating the values of b and a, we get
b= 1.746 and a= 11.529
Substituting for Y when X=11, we get
Y= 30.735


bX a Y + =
^
Case Study 1-a
The standard error of the estimate is a
measure of the accuracy of predictions.

Y = 1.7456X + 11.529
0
5
10
15
20
25
30
35
0 5 10 15
H
o
u
r
l
y

R
a
t
e
(
$
)

Experience(yrs)
Y-Values
Y-Values
Linear (Y-Values)
^
Hence, we can conclude that
the hourly rate of a labor with eleven
years of experience would be
$30.74.
Scatter Diagram with Regression Line
Calculating the Standard Error of Estimate
using,


We get, s
e
= 0.842


2
2
^

|
.
|

\
|

=
n
Y Y
s
e
Therefore, we can conclude
that the hourly rate for the labor will
differ by approximately $0.84 from
the actual rate.
It provides a measure of how well future
outcomes are likely to be predicted by the
applied method.
2
^

|
.
|

\
|
= Y Y
VARIATION OF THE Y VALUES AROUND
THE REGRESSION LINE
2
_

|
.
|

\
|
= Y Y

|
.
|

\
|

|
.
|

\
|

=
2
_
2
^
2
1
Y Y
Y Y
r
SAMPLE COEFFICIENT OF
DETERMINATION
VARIATION OF THE Y VALUES AROUND
THEIR OWN MEAN
Based on the previous data, we can calculate
the Coefficient of Determination using the
formula,


We get, r
2
= 0.978

|
.
|

\
|

|
.
|

\
|

=
2
_
2
^
2
1
Y Y
Y Y
r
Hence, we can conclude that
there is a strong correlation between
the experience of the labor and the
respective hourly rate.
Y = 1.7456X + 11.529
r = 0.9779
0
5
10
15
20
25
30
35
0 5 10 15
H
o
u
r
l
y

R
a
t
e
(
$
)

Experience(yrs)
Y-Values
Y-Values
Linear (Y-Values)
Scatter Diagram
^
BX A Y + =
Population Regression line
e BX A Y + + =
Population Regression line
with random disturbance
Where:
e is random disturbance from population
regression line
b
H
s
B b
t
0

=
Standardized value of
b
Slope of population regression line

Standard error of b
regression coefficient


=
2
_
2
X n X
s
s
e
b
Estimating equation is valid over small range
as the one from which initial sample was taken.
Misinterpreting r and r
2
.

Regression and correlation analyses can in no
way to determine cause and effect.
multiple REGRESSION
The principles of Linear Regression can be
extended to two or more explanatory variables.




As an example, if a hypotensive agent is
administered prior to surgery, recovery time for blood
pressure to normal value will depend on the dose of
the hypotensive and the blood pressure during surgery.
...
3 3 2 2 1 1
^
X b X b X b a Y + + + =
Minitab is a statistics package developed in
Pennsylvania State University in 1972.

It computes the regression coefficients and
several statistics associated with the regression
equation.


The Regression Equation
3 3 2 2 1 1

X b X b X b a Y + + + =
1
2
)

=
k n
Y Y
e s
Standard Error Of Estimate
( ) e S t Y

Approximate Confidence
Interval
e X B X B X B A Y
k k
+ + + + + = ...
2 2 1 1
Population
Y-intercept

Population slopes

Random error

Dependent (response) variable

Independent (explanatory) variables

Test Statistic:
Standardized Regression Coefficient:

slope of fitted regression
actual slope for the population
standard error of the regression
i
b
i i
s
B b
t
0

=
i
b
0
i
B
i
b
s
2
_

|
.
|

\
|
= Y Y
2
_
^

|
.
|

\
|
= Y Y
2
^

|
.
|

\
|
= Y Y
SST
SSR
SSE
Decomposing the
total variation in Y

( ) 1
=
k n
SSE
k
SSR
F
F ratio
Aparna Chatterjee is a busy executive in TradeDhaba.com.
She is late for a meeting because she is unable to locate the
multiple regressions that her secretary has worked out for her. If
the total regression is significant at 0.05 level of significance, then
she wanted to use the computer output as evidence to support
some of her ideas at the meeting. The secretary is however is sick
today and Aparna has been unable to locate his work. As a matter
of fact, the only information she possesses concerning the
multiple regressions are on a piece of scrap paper
Regression for Aparna
Chatterjee

SSR 872.4 with ***df
SSE *** with 17 df
SST1023.6 with 24 df

As the scrap paper does not even have a complete set of
numbers on it, she has concluded that it must be useless.
Now we have to find out whether that scrap paper is
useful or not.

The formula to be used is:
therefore, SSE = SST-SSR
=151.2
Again df SST=df SSR + df SSE
Df SSR=df SST-df SSE
=24-17 = 7

We know,


calculating the value of F we get,
F = 14.01
Tabulated F = 2.61
( ) 1
=
k n
SSE
k
SSR
F
Hence, it can be concluded that
the overall regression is significant
and she can use it in the meeting.
Modeling techniques are the way in which we
can include the explanatory variables and
check the of appropriateness of the
regression model.

Modeling
Techniques
Dummy variables
Qualitative data
Trend Line Analysis
Risk Analysis for Investments
Sales or Market Forecasts
Linear Regression in Human Resources

Das könnte Ihnen auch gefallen