Beruflich Dokumente
Kultur Dokumente
In a simple case we have only one independent variable, x, and one dependent
variable, y. Regression analysis assumes that there is no error in the independent
variable, but there is random error in the dependent variable. Thus, all the errors
due to measurement and to approximations in the modeling equations appear in the
dependent variable, y. In any example of regression, the expectation or expected
value of y varies as a function of x, and errors cause measured values of y to
deviate from the expected value of y at any particular value of x. If there are
several measured values of y at one value of x, the mean of the measured values of
y will give an approximation of the expected value of y at that value of x.
The simplest situation is a linear or straight-line relation between a single input and
the response. Say the input and response are x and y, respectively. For this simple
situation the mean of the probability distribution is
1
where α and β are constant parameters that we want to estimate. They are often
called regression coefficients. From a sample consisting of n pairs of data (xi,yi),
we calculate estimates, a for α and b for β. If at x = xi, yˆ i is the estimated value of
E(Y), we have the fitted regression line
The problem now is to determine a and b to give the best fit with the sample data.
If the points given by (xi,yi). for the present analysis we need a systematic recipe or
algorithm. The sum of squares of deviations from the mean of a sample is less than
the sum of squares of deviations from any other constant value. We can adapt that
requirement to the present case as follows. Let ei = yi − yˆi be the deviation in the
y-direction of any data point from the fitted regression line. Then the estimates
a and b are chosen so that the sum of the squares of deviations of all the
points,
is smaller than for any other choice of a and b. Thus, a and b are
chosen so that has a minimum value. This is called the method of least squares and
the resulting equation is called the regression line of y on x, where y is the
response and x is the input. Say the points are as shown in Fig. below This is
called a scatter plot for the data. We can see that the points seem to roughly follow
a straight line, but there are appreciable deviations from any straight line that might
be drawn through the points
2
Now let us consider the method of least
squares in more detail. If the points or pairs of
values are (xi,yi) and the estimated equation of
the line is taken to be yˆ = a + bx, then the
errors or deviations from the line in the y-
direction are ei = [yi – (a + bxi)]. These
deviations are often called residuals, the
variations in y that are not explained by
regression. The squares of the deviations are e2i = [yi – (a + bxi)]2, and the sum of
the squares of the deviations for all n points is
This sum of the squares of the deviations or errors or residuals for all n points is
abbreviated as SSE The quantity we want to minimize in this case is
SSE =
Remember that the n values of x and the n values of y come from observations and
so are now all fixed and not subject to variation. We will minimize SSE by varying
a and b, so a and b become the independent variables at this point in the analysis.
You should remember from calculus that to minimize a quantity we take the
derivative with respect to the independent variable and set it equal to zero. In this
case there are two independent variables, a and b, so we take partial derivatives
with respect to each of them and set the derivatives equal to zero. Omitting some
of the algebra we have
3
These are called the least squares equations (or normal equations) for estimating
the coefficients, a and b. The right-hand equalities of these two equations give
equations that are linear in the coefficients a and b, so they can be solved
simultaneously. The results are
The two forms of equation for b are equivalent, as can be shown easily. The first
form is usually used for calculations. The second form, equation, is preferred when
rounding errors in calculations may become appreciable.
4
This indicates that the best-fit line passes through the point (x, y ) , which is called
the centroid point and is the center of mass of the data points
5
Example 1
6
Variance of Experimental Points Around the Line
Now we need to estimate the variance of points from the least-squares regression
line for y on x. This must be found from the residuals, deviations of points from the
least squares line in the y-direction.
This quantity is a measure of the scatter of experimental points around the line.
The square root of this quantity is, of course, the estimated standard deviation or
standard error of the points from the line. The subscript, y|x, is meant to emphasize
that the estimated variance around the line is found from deviations in the y-
direction and at fixed values of x.
7
Other Forms Linear in the Coefficients
Example 2
8
(1) The exponential function, y = a bx, can be
modified suitably by taking logarithms of both sides.
This gives log y = log a + x log b. Notice that this is
the form that gives straight lines on semi-log graph
paper.
9
Example 2: The shear resistance of soil, y kN m–2, is determined by measurements
as a function of the normal stress, x kN m–2. The data are as shown below:
Solution :
10
Example 14.3
For the data of Example2 calculate the standard deviation of points about the
regression line, and then plot residuals against the input, and comment on the
results.
11
Correlation
where Sxx, Syy, and Sxy are defined in precedent equations . This correlation
coefficient is often denoted simply by r. If the points (xi, yi) are in a perfect straight
line and the slope of that line is positive, rxy = 1. If the points are in a perfect
straight line and the slope is negative, rxy = –1. If there is no systematic relation
between X and Y at all, r ≈ 0, and rxy differs from zero only because of random
variation in the sample points. If X and Y follow a linear relation affected by
random errors, rxy will be close to +1 or –1. These cases are illustrated in Fig. In all
cases, because of the definitions –1 ≤ rxy ≤ + 1.
12
13
Example
14