Beruflich Dokumente
Kultur Dokumente
Curve fitting is the process of constructing a curve, or mathematical function, that has
the best fit to a series of data points, possibly subject to constraints. Curve fitting can involve
either interpolation, where an exact fit to the data is required, or smoothing, in which a "smooth"
function is constructed that approximately fits the data. A related topic is regression analysis,
which focuses more on questions of statistical inference such as how much uncertainty is present
in a curve that is fit to data observed with random errors. Fitted curves can be used as an aid for
data visualization, to infer values of a function where no data are available, and to summarize the
relationships among two or more variables. Extrapolation refers to the use of a fitted curve
beyond the range of the observed data, and is subject to a degree of uncertainty since it may
reflect the method used to construct the curve as much as it reflects the observed data.
This graph shows a series of points (generated by a Sin function) approximated by polinomial
curves (red curve is linear, green is quadratic, orange is cubic and blue is 4th degree).
Starting with a first degree polynomial equation:
This is a line with slope a. A line will connect any two points, so a first degree polynomial
equation is an exact fit through any two points with distinct x coordinates.
If the order of the equation is increased to a second degree polynomial, the following results:
Even if an exact match exists, it does not necessarily follow that it can be readily
discovered. Depending on the algorithm used there may be a divergent case,
where the exact fit cannot be calculated, or it might take too much computer time
to find the solution. This situation might require an approximate solution.
The effect of averaging out questionable data points in a sample, rather than
distorting the curve to fit them exactly, may be desirable.
Low-order polynomials tend to be smooth and high order polynomial curves tend
to be "lumpy". To define this more precisely, the maximum number of inflection
points possible in a polynomial curve is n-2, where n is the order of the
polynomial equation. An inflection point is a location on the curve where it
switches from a positive radius to negative. We can also say this is where it
transitions from "holding water" to "shedding water". Note that it is only
"possible" that high order polynomials will be lumpy; they could also be smooth,
but there is no guarantee of this, unlike with low order polynomial curves. A
fifteenth degree polynomial could have, at most, thirteen inflection points, but
could also have twelve, eleven, or any number down to zero.
The degree of the polynomial curve being higher than needed for an exact fit is undesirable for
all the reasons listed previously for high order polynomials, but also leads to a case where there
are an infinite number of solutions. For example, a first degree polynomial (a line) constrained
by only a single point, instead of the usual two, would give an infinite number of solutions. This
brings up the problem of how to compare and choose just one solution, which can be a problem
for software and for humans, as well. For this reason, it is usually best to choose as low a degree
as possible for an exact match on all constraints, and perhaps an even lower degree, if an
approximate fit is acceptable.
Fitting other curves to data points:
Other types of curves, such as conic sections (circular, elliptical, parabolic, and
hyperbolic arcs) or trigonometric functions (such as sine and cosine), may also be used, in
certain cases. For example, trajectories of objects under the influence of gravity follow a
parabolic path, when air resistance is ignored. Hence, matching trajectory data points to a
parabolic curve would make sense. Tides follow sinusoidal patterns, hence tidal data points
should be matched to a sine wave, or the sum of two sine waves of different periods, if the
effects of the Moon and Sun are both considered.
In spectroscopy, curves may be fitted with Gaussian, Lorentzian, Voigt and related functions.
Algebraic fit versus geometric fit for curves:
For algebraic analysis of data, "fitting" usually means trying to find the curve that
minimizes the vertical (y-axis) displacement of a point from the curve (e.g., ordinary least
squares). However for graphical and image applications geometric fitting seeks to provide the
best visual fit; which usually means trying to minimize the orthogonal distance to the curve
(e.g., total least squares), or to otherwise include both axes of displacement of a point from the
curve. Geometric fits are not popular because they usually require non-linear and/or iterative
calculations, although they have the advantage of a more aesthetic and geometrically accurate
result.
Fitting a circle by geometric fit:
Coope approaches the problem
of trying to find the best visual fit of
circle to a set of 2D data points.
C(1 ; 1);
r = 4.
The method elegantly transforms the ordinarily non-linear problem into a linear problem that can
be solved without using iterative numerical methods, and is hence an order of magnitude faster
than previous techniques.
Application to surfaces:
Note that while this discussion was in terms of 2D curves, much of this logic also extends
to 3D surfaces, each patch of which is defined by a net of curves in two parametric directions,
typically called u and v. A surface may be composed of one or more surface patches in each
direction.
REGRESSION ANALYSIS
In statistics, regression analysis is a statistical process for estimating the relationships among
variables. It includes many techniques for modeling and analyzing several variables, when the
focus is on the relationship between a dependent variable and one or more independent variables.
More specifically, regression analysis helps one understand how the typical value of the
dependent variable changes when any one of the independent variables is varied, while the other
independent variables are held fixed. Most commonly, regression analysis estimates the
conditional expectation of the dependent variable given the independent variables – that is,
the average value of the dependent variable when the independent variables are fixed. Less
commonly, the focus is on a quantile, or other location parameter of the conditional distribution
of the dependent variable given the independent variables. In all cases, the estimation target is
a function of the independent variables called the regression function. In regression analysis, it
is also of interest to characterize the variation of the dependent variable around the regression
function, which can be described by a probability distribution.
Regression analysis is widely used for prediction and forecasting, where its use has substantial
overlap with the field of machine learning. Regression analysis is also used to understand which
among the independent variables are related to the dependent variable, and to explore the forms
of these relationships. In restricted circumstances, regression analysis can be used to infer causal
relationships between the independent and dependent variables. However this can lead to
illusions or false relationships, so caution is advisable; for example, correlation does not imply
causation.
Many techniques for carrying out regression analysis have been developed. Familiar methods
such as linear regression and ordinary least squares regression are parametric, in that the
regression function is defined in terms of a finite number of unknown parameters that are
estimated from the data. Nonparametric regression refers to techniques that allow the regression
function to lie in a specified set of functions, which may be infinite-dimensional.
The performance of regression analysis methods in practice depends on the form of the data
generating process, and how it relates to the regression approach being used. Since the true form
of the data-generating process is generally not known, regression analysis often depends to some
extent on making assumptions about this process. These assumptions are sometimes testable if a
lot of data are available. Regression models for prediction are often useful even when the
assumptions are moderately violated, although they may not perform optimally. However, in
many applications, especially with small effects or questions of causality based on observational
data, regression methods can give misleading results.[2][3]
REGRESSION MODELS:
Regression models involve the following variables:
If N data points of the form (Y, X) are observed, where N < k, most classical approaches
to regression analysis cannot be performed: since the system of equations defining the
regression model is underdetermined, there are not enough data to recover β.
If exactly N = k data points are observed, and the function f is linear, the
equations Y = f(X, β) can be solved exactly rather than approximately. This reduces to
solving a set of N equations with N unknowns (the elements of β), which has a unique
solution as long as the X are linearly independent. If f is nonlinear, a solution may not
exist, or many solutions may exist.
The most common situation is where N > k data points are observed. In this case, there is
enough information in the data to estimate a unique value for β that best fits the data in
some sense, and the regression model when applied to the data can be viewed as an over
determined system in β.
In the last case, the regression analysis provides the tools for:
1. Finding a solution for unknown parameters β that will, for example, minimize the
distance between the measured and predicted values of the dependent
variable Y (also known as method ofleast squares).
2. Under certain statistical assumptions, the regression analysis uses the surplus of
information to provide statistical information about the unknown parameters β and
predicted values of the dependent variable Y.
Underlying assumptions:
parabola:
This is still linear regression; although the expression on the right hand side is quadratic in the
independent variable , it is linear in the parameters , and
In both cases, is an error term and the subscript indexes a particular observation.
Given a random sample from the population, we estimate the population parameters and obtain
the sample linear regression model:
The residual, , is the difference between the value of the dependent variable
predicted by the model, , and the true value of the dependent variable, . One method of
estimation is ordinary least squares. This method obtains parameter estimates that minimize the
sum of squared residuals, SSE, also sometimes denoted RSS:
Minimization of this function results in a set of normal equations, a set of simultaneous linear
equations in the parameters, which are solved to yield the parameter estimators, .
In the case of simple regression, the formulas for the least squares estimates are
where is the mean (average) of the values and is the mean of the values.
Under the assumption that the population error term has a constant variance, the estimate of that
variance is given by:
This is called the mean square error (MSE) of the regression. The denominator is the sample size
reduced by the number of model parameters estimated from the same data, (n-p)
for p regressors or (n-p-1) if an intercept is used.[22] In this case, p=1 so the denominator is n-2.
The standard errors of the parameter estimates are given by
Under the further assumption that the population error term is normally distributed, the
researcher can use these estimated standard errors to create confidence intervals and
conduct hypothesis tests about the population parameters.
Where; xij is the ith observation on the jth independent variable, and where the first
independent variable takes the value 1 for all i (so is the regression intercept).
The least squares parameter estimates are obtained from p normal equations. The residual
can be written as
INTERPOLATION
In the mathematical field of numerical analysis, interpolation is a method of
constructing new data points within the range of a discrete set of known data points.
In engineering and science, one often has a number of data points, obtained
by sampling or experimentation, which represent the values of a function for a limited number of
values of the independent variable. It is often required to interpolate (i.e. estimate) the value of
that function for an intermediate value of the independent variable. This may be achieved
by curve fitting or regression analysis.
A different problem which is closely related to interpolation is the approximation of a
complicated function by a simple function. Suppose the formula for some given function is
known, but too complex to evaluate efficiently. A few known data points from the original
function can be used to create an interpolation based on a simpler function. Of course, when a
simple function is used to estimate data points from the original, interpolation errors are usually
present; however, depending on the problem domain and the interpolation method used, the gain
in simplicity may be of greater value than the resultant loss in accuracy.
Example
For example, suppose we have a table like this, which gives some values of an unknown
function f.
xf(x)
00
10 .8415
20 .9093
30 .1411
4−0.7568
5−0.9589
6−0.2794
Interpolation provides a means of estimating the function at intermediate points, such as x = 2.5
Linear interpolation:
One of the simplest methods is linear interpolation (sometimes known as lerp). Consider
the above example of estimating f(2.5). Since 2.5 is midway between 2 and 3, it is reasonable to
take f(2.5) midway between f(2) = 0.9093 and f(3) = 0.1411, which yields 0.5252.
Generally, linear interpolation takes two data points, say (xa,ya) and (xb,yb), and the interpolant is
given by:
Linear interpolation is quick and easy, but it is not very precise. Another disadvantage is that the
interpolant is not differentiable at the point xk.
The following error estimate shows that linear interpolation is not very precise. Denote the
function which we want to interpolate by g, and suppose that x lies between xa and xb and
that g is twice continuously differentiable. Then the linear interpolation error is
In words, the error is proportional to the square of the distance between the data points. The error
in some other methods, including polynomial interpolation and spline interpolation (described
below), is proportional to higher powers of the distance between the data points. These methods
also produce smoother interpolants.
Polynomial interpolation:
Polynomial interpolation is a generalization of linear interpolation. Note that the linear
interpolant is a linear function. We now replace this interpolant with a polynomial of
higher degree.
Consider again the problem given above. The following sixth degree polynomial goes through all
the seven points: