Sie sind auf Seite 1von 25

Curve Fitting, Regression

Field data is often accompanied by noise. Even though all control parameters (independent
variables) remain constant, the resultant outcomes (dependent variables) vary. A process of
quantitatively estimating the trend of the outcomes, also known as regression or curve fitting,
therefore becomes necessary.

The curve fitting process fits equations of approximating curves to the raw field data.
Nevertheless, for a given set of data, the fitting curves of a given type are generally NOT unique.
Thus, a curve with a minimal deviation from all data points is desired. This best-fitting curve can
be obtained by the method of least squares.

The Method of Least Squares

The method of least squares assumes that the best-fit curve of a given type is the curve that has
the minimal sum of the deviations squared (least square error) from a given set of data.

Suppose that the data points are , , ..., where is the independent
variable and is the dependent variable. The fitting curve has the deviation (error) from
each data point, i.e., , , ..., . According to the
method of least squares, the best fitting curve has the property that:

The Least-Squares Line

The least-squares line uses a straight line

to approximate the given set of data, , , ..., , where . The best


fitting curve has the least square error, i.e.,

Please note that and are unknown coefficients while all and are given. To obtain the
least square error, the unknown coefficients and must yield zero first derivatives.
Expanding the above equations, we have:

The unknown coefficients and can therefore be obtained:

where stands for .

The Least-Squares Parabola


The least-squares parabola uses a second degree curve

to approximate the given set of data, , , ..., , where . The best


fitting curve has the least square error, i.e.,

Please note that , , and are unknown coefficients while all and are given. To obtain
the least square error, the unknown coefficients , , and must yield zero first derivatives.
Expanding the above equations, we have

The unknown coefficients , , and can hence be obtained by solving the above linear
equations.

The Least-Squares mth Degree Polynomials

When using an mth degree polynomial

to approximate the given set of data, , , ..., , where , the best


fitting curve has the least square error, i.e.,

Please note that , , , ..., and are unknown coefficients while all and are given.
To obtain the least square error, the unknown coefficients , , , ..., and must yield
zero first derivatives.
Expanding the above equations, we have

The unknown coefficients , , , ..., and can hence be obtained by solving the above
linear equations.

Multiple Regression

Multiple regression estimates the outcomes (dependent variables) which may be affected by
more than one control parameter (independent variables) or there may be more than one control
parameter being changed at the same time.

An example is the two independent variables and and one dependent variable in the linear
relationship case:
For a given data set , , ..., , where , the best fitting curve
has the least square error, i.e.,

Please note that , , and are unknown coefficients while all , , and are given. To
obtain the least square error, the unknown coefficients , , and must yield zero first
derivatives.

Expanding the above equations, we have

The unknown coefficients , , and can hence be obtained by solving the above linear
equations.
Least Squares Fitting

A mathematical procedure for finding the best-fitting curve to a given set of points by
minimizing the sum of the squares of the offsets ("the residuals") of the points from the curve.
The sum of the squares of the offsets is used instead of the offset absolute values because this
allows the residuals to be treated as a continuous differentiable quantity. However, because
squares of the offsets are used, outlying points can have a disproportionate effect on the fit, a
property which may or may not be desirable depending on the problem at hand.

In practice, the vertical offsets from a line (polynomial, surface, hyperplane, etc.) are almost
always minimized instead of the perpendicular offsets. This provides a fitting function for the
independent variable that estimates for a given (most often what an experimenter wants),
allows uncertainties of the data points along the - and -axes to be incorporated simply, and
also provides a much simpler analytic form for the fitting parameters than would be obtained
using a fit based on perpendicular offsets. In addition, the fitting technique can be easily
generalized from a best-fit line to a best-fit polynomial when sums of vertical distances are used.
In any case, for a reasonable number of noisy data points, the difference between vertical and
perpendicular fits is quite small.

The linear least squares fitting technique is the simplest and most commonly applied form of
linear regression and provides a solution to the problem of finding the best fitting straight line
through a set of points. In fact, if the functional relationship between the two quantities being
graphed is known to within additive or multiplicative constants, it is common practice to
transform the data in such a way that the resulting line is a straight line, say by plotting vs.
instead of vs. in the case of analyzing the period of a pendulum as a function of its length .
For this reason, standard forms for exponential, logarithmic, and power laws are often explicitly
computed. The formulas for linear least squares fitting were independently derived by Gauss and
Legendre.

For nonlinear least squares fitting to a number of unknown parameters, linear least squares fitting
may be applied iteratively to a linearized form of the function until convergence is achieved.
However, it is often also possible to linearize a nonlinear function at the outset and still use linear
methods for determining fit parameters without resorting to iterative procedures. This approach
does commonly violate the implicit assumption that the distribution of errors is normal, but often
still gives acceptable results using normal equations, a pseudoinverse, etc. Depending on the type
of fit and initial parameters chosen, the nonlinear fit may have good or poor convergence
properties. If uncertainties (in the most general case, error ellipses) are given for the points,
points can be weighted differently in order to give the high-quality points more weight.

Vertical least squares fitting proceeds by finding the sum of the squares of the vertical deviations
of a set of data points

(1)

from a function . Note that this procedure does not minimize the actual deviations from the line
(which would be measured perpendicular to the given function). In addition, although the
unsquared sum of distances might seem a more appropriate quantity to minimize, use of the
absolute value results in discontinuous derivatives which cannot be treated analytically. The
square deviations from each point are therefore summed, and the resulting residual is then
minimized to find the best fit line. This procedure results in outlying points being given
disproportionately large weighting.

The condition for to be a minimum is that

(2)

for , ..., . For a linear fit,

(3)

so

(4)

(5)
(6)

These lead to the equations

(7)

(8)

In matrix form,

(9)

so

(10)

The matrix inverse is

(11)

so

(12)

(13)

(14)

(15)

(Kenney and Keeping 1962). These can be rewritten in a simpler form by defining the sums of
squares

(16)
(17)

(18)

(19)

(20)

(21)

which are also written as

(22)

(23)

(24)

Here, is the covariance and and are variances. Note that the quantities and
can also be interpreted as the dot products

(25)

(26)

In terms of the sums of squares, the regression coefficient is given by

(27)

and is given in terms of using () as

(28)

The overall quality of the fit is then parameterized in terms of a quantity known as the
correlation coefficient, defined by

(29)
which gives the proportion of which is accounted for by the regression.

Let be the vertical coordinate of the best-fit line with -coordinate , so

(30)

then the error between the actual vertical point and the fitted point is given by

(31)

Now define as an estimator for the variance in ,

(32)

Then can be given by

(33)

(Acton 1966, pp. 32-35; Gonick and Smith 1993, pp. 202-204).

The standard errors for and are

(34)

(35)
ANOVA
"Analysis of Variance." A statistical test for heterogeneity of means by analysis of group
variances. ANOVA is implemented as ANOVA[data] in the Mathematica package ANOVA` .

To apply the test, assume random sampling of a variate with equal variances, independent
errors, and a normal distribution. Let be the number of replicates (sets of identical
observations) within each of factor levels (treatment groups), and be the th observation
within factor level . Also assume that the ANOVA is "balanced" by restricting to be the same
for each factor level.

Now define the sum of square terms

(1)

(2)

(3)

(4)

(5)

which are the total, treatment, and error sums of squares. Here, is the mean of observations
within factor level , and is the "group" mean (i.e., mean of means). Compute the entries in the
following table, obtaining the P-value corresponding to the calculated F-ratio of the mean
squared values

(6)
F-
category freedom SS mean squared ratio
SS
model A
SS
error
E
SS
total
T

If the P-value is small, reject the null hypothesis that all means are the same for the different
groups.
Correlation Coefficient
The correlation coefficient, sometimes also called the cross-correlation coefficient, is a quantity
that gives the quality of a least squares fitting to the original data. To define the correlation
coefficient, first consider the sum of squared values , , and of a set of data points
about their respective means,
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
These quantities are simply unnormalized forms of the variances and covariance of and given
by
(13)
(14)
(15)
For linear least squares fitting, the coefficient in
(16)
is given by
(17)

(18)
and the coefficient in
(19)
is given by
(20)
The correlation coefficient (sometimes also denoted ) is then defined by
(21)
(22)
The correlation coefficient is also known as the product-moment coefficient of correlation or
Pearson's correlation. The correlation coefficients for linear fits to increasingly noisy data are
shown above.
The correlation coefficient has an important physical interpretation. To see this, define
(23)
and denote the "expected" value for as . Sums of are then
(24)
(25)
(26)
(27)
(28)
(29)

(30)

(31)
(32)
(33)
The sum of squared errors is then
(34)
(35)
(36)

(37)
(38)
(39)
(40)
(41)
and the sum of squared residuals is
(42)
(43)
(44)
(45)
(46)
But
(47)

(48)
so

(49)

(50)

(51)

(52)
and
(53)
The square of the correlation coefficient is therefore given by
(54)

(55)

(56)

In other words, is the proportion of which is accounted for by the regression.


If there is complete correlation, then the lines obtained by solving for best-fit and
coincide (since all data points lie on them), so solving () for and equating to () gives
(57)
Therefore, and , giving
(58)
The correlation coefficient is independent of both origin and scale, so
(59)
where
(60)
Least Squares Fitting--Exponential

To fit a functional form

(1)

take the logarithm of both sides

(2)

The best-fit values are then

(3)

(4)

where and .

This fit gives greater weights to small values so, in order to weight the points equally, it is often
better to minimize the function

(5)

Applying least squares fitting gives


(6)

(7)

(8)

Solving for and ,

(9)

(10)

In the plot above, the short-dashed curve is the fit computed from () and () and the long-
dashed curve is the fit computed from (9) and (10).

Least Squares Fitting--Logarithmic

Given a function of the form

(1)

the coefficients can be found from least squares fitting as

(2)
Least Squares Fitting--Perpendicular Offsets

In practice, the vertical offsets from a line (polynomial, surface, hyperplane, etc.) are almost
always minimized instead of the perpendicular offsets. This provides a fitting function for the
independent variable that estimates for a given (most often what an experimenter wants),
allows uncertainties of the data points along the - and -axes to be incorporated simply, and
also provides a much simpler analytic form for the fitting parameters than would be obtained
using a fit based on perpendicular offsets.

The residuals of the best-fit line for a set of points using unsquared perpendicular distances
of points are given by

(1)

Since the perpendicular distance from a line to point is given by

(2)

the function to be minimized is

(3)

Unfortunately, because the absolute value function does not have continuous derivatives,
minimizing is not amenable to analytic solution. However, if the square of the perpendicular
distances

(4)
is minimized instead, the problem can be solved in closed form. is a minimum when

(5)

and

(6)

The former gives

(7)
(8)

and the latter

(9)

But

(10)
(11)

so (10) becomes

(12)

(13)

(14)

Plugging () into (14) then gives


(15)

After a fair bit of algebra, the result is

(16)

So define

(17)

(18)

and the quadratic formula gives

(19)

with found using (). Note the rather unwieldy form of the best-fit parameters in the
formulation. In addition, minimizing for a second- or higher-order polynomial leads to
polynomial equations having higher order, so this formulation cannot be extended.

Least Squares Fitting--Polynomial


Generalizing from a straight line (i.e., first degree polynomial) to a th degree polynomial

(1)

the residual is given by

(2)

The partial derivatives (again dropping superscripts) are


(3)

(4)

(5)

These lead to the equations

(6)

(7)

(8)

or, in matrix form

(9)

This is a Vandermonde matrix. We can also obtain the matrix for a least squares fit by writing

(10)

Premultiplying both sides by the transpose of the first matrix then gives

(11)
so

(12)

As before, given points and fitting with polynomial coefficients , ..., gives

(13)

In matrix notation, the equation for a polynomial fit is given by

(14)

This can be solved by premultiplying by the transpose ,

(15)

This matrix equation can be solved numerically, or can be inverted directly if it is well formed,
to yield the solution vector

(16)

Setting in the above equations reproduces the linear solution.

Least Squares Fitting--Power Law


Given a function of the form

(1)

least squares fitting gives the coefficients as

(2)

(3)

where and .

Nonlinear Least Squares Fitting


Given a function of a variable tabulated at values , ..., , assume the
function is of known analytic form depending on parameters , and consider the
overdetermined set of equations

(1)
(2)

We desire to solve these equations to obtain the values , ..., which best satisfy this system of
equations. Pick an initial guess for the and then define

(3)

Now obtain a linearized estimate for the changes needed to reduce to 0,

(4)

for , ..., , where . This can be written in component form as

(5)

where is the matrix

(6)
In more concise matrix form,

(7)

where is an -vector and is an -vector.

Applying the transpose of to both sides gives

(8)

Defining

(9)
(10)

in terms of the known quantities and then gives the matrix equation

(11)

which can be solved for using standard matrix techniques such as Gaussian elimination. This
offset is then applied to and a new is calculated. By iteratively applying this procedure until
the elements of become smaller than some prescribed limit, a solution is obtained. Note that
the procedure may not converge very well for some functions and also that convergence is often
greatly improved by picking initial values close to the best-fit value. The sum of square residuals
is given by after the final iteration.

An example of a nonlinear least squares fit to a noisy Gaussian function

(12)
is shown above, where the thin solid curve is the initial guess, the dotted curves are intermediate
iterations, and the heavy solid curve is the fit to which the solution converges. The actual
parameters are , the initial guess was (0.8, 15, 4), and the converged values
are (1.03105, 20.1369, 4.86022), with . The partial derivatives used to construct the
matrix are

(13)

(14)

(15)

The technique could obviously be generalized to multiple Gaussians, to include slopes, etc.,
although the convergence properties generally worsen as the number of free parameters is
increased.

An analogous technique can be used to solve an overdetermined set of equations. This problem
might, for example, arise when solving for the best-fit Euler angles corresponding to a noisy
rotation matrix, in which case there are three unknown angles, but nine correlated matrix
elements. In such a case, write the different functions as for , ..., , call their
actual values , and define

(16)

and

(17)

where are the numerical values obtained after the th iteration. Again, set up the equations as

(18)

and proceed exactly as before


In the application to analytical calibration, the concentration of the sample Cx is given by Cx =
(Sx - intercept)/slope, where Sx is the signal given by the sample solution. The uncertainty of all
three terms contribute to the uncertainty of Cx. The standard deviation of Cx can be estimated
from the standard deviations of slope, intercept, and Sx using the rules for mathematical error
propagation. But the problem is that, in analytical chemistry, the labor and cost of preparing and
running large numbers of standards solution often limits the number of standards to a rather small set,
by statistical standards, so these estimates of standard deviation are often fairly rough. A spreadsheet
that performs these error-propagation calculations for your own first-order (linear) analytical calibration
data can be downloaded from http://terpconnect.umd.edu/~toh/models/CalibrationLinear.xls). For
example, the linear calibration example just given in the previous section, where the "true" value of the
slope was 10 and the intercept was zero, this spreadsheet (whose screen shot shown on the right)
predicts that the slope is 9.8 with a standard deviation 0.407 (4.2%) and that the intercept is 0.197 with
a standard deviation 0.25 (128%), both well within one standard deviation of the true values. This
spreadsheet also performs the propagation of error calculations for the calculated concentrations of
each unknown in the last two columns on the right. In the example in this figure, the instrument
readings of the standards are taken as the unknowns, showing that the predicted percent concentration
errors range from about 5% to 19% of the true values of those standards. (Note that the standard
deviation of the concentration is greater at high concentrations than the standard deviation of the
slope, and considerably greater at low concentrations because of the greater influence of the
uncertainly in the intercept). For a further discussion and some examples,
see http://terpconnect.umd.edu/~toh/models/Bracket.html#Cal_curve_linear.

Das könnte Ihnen auch gefallen