Sie sind auf Seite 1von 14

Regression and Correlation

In a simple case we have only one independent variable, x, and one dependent
variable, y. Regression analysis assumes that there is no error in the independent
variable, but there is random error in the dependent variable. Thus, all the errors
due to measurement and to approximations in the modeling equations appear in the
dependent variable, y. In any example of regression, the expectation or expected
value of y varies as a function of x, and errors cause measured values of y to
deviate from the expected value of y at any particular value of x. If there are
several measured values of y at one value of x, the mean of the measured values of
y will give an approximation of the expected value of y at that value of x.

Engineers often encounter situations where an independent variable affects the


value of a dependent variable, and errors of measurement produce random
fluctuations about the expected values. Thus, change in stress produces change in
strain plus variation in measured strain due to error. The power produced by an
electric motor changes with variation of the input voltage, and measurements of
output include measuring errors. The methods of regression are used to summarize
sets of data in a useful form. The values of x and y and any other quantities are
already known from measurements and are therefore fixed, so it is not quite right
to speak of them in this development as variables. The true variables will be the
coefficients that are adjusted to give the best fit.

Simple Linear Regression

The simplest situation is a linear or straight-line relation between a single input and
the response. Say the input and response are x and y, respectively. For this simple
situation the mean of the probability distribution is

1
where α and β are constant parameters that we want to estimate. They are often
called regression coefficients. From a sample consisting of n pairs of data (xi,yi),
we calculate estimates, a for α and b for β. If at x = xi, yˆ i is the estimated value of
E(Y), we have the fitted regression line

Where the “hat” on yˆ indicates that this is an estimated value.

Method of Least Squares

The problem now is to determine a and b to give the best fit with the sample data.
If the points given by (xi,yi). for the present analysis we need a systematic recipe or
algorithm. The sum of squares of deviations from the mean of a sample is less than
the sum of squares of deviations from any other constant value. We can adapt that
requirement to the present case as follows. Let ei = yi − yˆi be the deviation in the
y-direction of any data point from the fitted regression line. Then the estimates
a and b are chosen so that the sum of the squares of deviations of all the

points,

is smaller than for any other choice of a and b. Thus, a and b are
chosen so that has a minimum value. This is called the method of least squares and
the resulting equation is called the regression line of y on x, where y is the
response and x is the input. Say the points are as shown in Fig. below This is
called a scatter plot for the data. We can see that the points seem to roughly follow
a straight line, but there are appreciable deviations from any straight line that might
be drawn through the points

2
Now let us consider the method of least
squares in more detail. If the points or pairs of
values are (xi,yi) and the estimated equation of
the line is taken to be yˆ = a + bx, then the
errors or deviations from the line in the y-
direction are ei = [yi – (a + bxi)]. These
deviations are often called residuals, the
variations in y that are not explained by
regression. The squares of the deviations are e2i = [yi – (a + bxi)]2, and the sum of
the squares of the deviations for all n points is

This sum of the squares of the deviations or errors or residuals for all n points is
abbreviated as SSE The quantity we want to minimize in this case is

SSE =

Remember that the n values of x and the n values of y come from observations and
so are now all fixed and not subject to variation. We will minimize SSE by varying
a and b, so a and b become the independent variables at this point in the analysis.
You should remember from calculus that to minimize a quantity we take the
derivative with respect to the independent variable and set it equal to zero. In this
case there are two independent variables, a and b, so we take partial derivatives
with respect to each of them and set the derivatives equal to zero. Omitting some
of the algebra we have

3
These are called the least squares equations (or normal equations) for estimating
the coefficients, a and b. The right-hand equalities of these two equations give
equations that are linear in the coefficients a and b, so they can be solved
simultaneously. The results are

The two forms of equation for b are equivalent, as can be shown easily. The first
form is usually used for calculations. The second form, equation, is preferred when
rounding errors in calculations may become appreciable.

4
This indicates that the best-fit line passes through the point (x, y ) , which is called
the centroid point and is the center of mass of the data points

5
Example 1

6
Variance of Experimental Points Around the Line

Now we need to estimate the variance of points from the least-squares regression
line for y on x. This must be found from the residuals, deviations of points from the
least squares line in the y-direction.

This quantity is a measure of the scatter of experimental points around the line.
The square root of this quantity is, of course, the estimated standard deviation or
standard error of the points from the line. The subscript, y|x, is meant to emphasize
that the estimated variance around the line is found from deviations in the y-
direction and at fixed values of x.

7
Other Forms Linear in the Coefficients

Equations of the form, log y = a + b x, where a and b are coefficients to be


determined by least squares, can be handled easily. Remember that x and y are
known quantities, numbers. Then we can calculate without difficulty the value of
log y for each data point. Then log y can be used in place of y, and so the
regression coefficients can be calculated as before

Example 2

8
(1) The exponential function, y = a bx, can be
modified suitably by taking logarithms of both sides.
This gives log y = log a + x log b. Notice that this is
the form that gives straight lines on semi-log graph
paper.

(2) The power function, y = a xb, can also be treated


by taking logarithms of both sides. The result is log y
= log a + b log x. Notice that this form would give
straight lines on log-log graph paper

9
Example 2: The shear resistance of soil, y kN m–2, is determined by measurements
as a function of the normal stress, x kN m–2. The data are as shown below:

Solution :

10
Example 14.3

For the data of Example2 calculate the standard deviation of points about the
regression line, and then plot residuals against the input, and comment on the
results.

11
Correlation

Correlation is a measure of the association between two random variables, say X


and Y. We do not have for this calculation the assumption that one of these
variables is known without error: both variables are assumed to be varying
randomly. We do assume for this analysis that X and Y are related linearly, so the
usual correlation coefficient gives a measure of the linear association between X
and Y. Although the underlying correlation is defined in terms of variances and
covariance, in practice we work with the sample correlation coefficient. This is
calculated as

where Sxx, Syy, and Sxy are defined in precedent equations . This correlation
coefficient is often denoted simply by r. If the points (xi, yi) are in a perfect straight
line and the slope of that line is positive, rxy = 1. If the points are in a perfect
straight line and the slope is negative, rxy = –1. If there is no systematic relation
between X and Y at all, r ≈ 0, and rxy differs from zero only because of random
variation in the sample points. If X and Y follow a linear relation affected by
random errors, rxy will be close to +1 or –1. These cases are illustrated in Fig. In all
cases, because of the definitions –1 ≤ rxy ≤ + 1.

12
13
Example

a) Calculate a correlation coefficient for the data of Example 2

14

Das könnte Ihnen auch gefallen